Professional Documents
Culture Documents
Tseng (2010) MathProg
Tseng (2010) MathProg
B (2010) 125:263–295
DOI 10.1007/s10107-010-0394-2
Paul Tseng
Received: 14 May 2009 / Accepted: 22 January 2010 / Published online: 17 August 2010
© Springer and Mathematical Optimization Society 2010
1 Introduction
P. Tseng was invited to speak on this paper at ISMP 2009. Tragically, Paul went missing on August 13,
2009, while kayaking in China. The guest editors thank Michael Friedlander (mpf@cs.ubc.ca) for
carrying out minor revisions to the submitted manuscript as suggested by the referees.
P. Tseng (B)
Department of Mathematics, University of Washington, Seattle, WA 98195, USA
e-mail: mpf@cs.ubc.ca
123
264 P. Tseng
localization, are often large scale, possibly nonconvex or NP-hard, and the data may
be noisy. Such problems may be approximated by convex relaxations that are highly
structured.
Question 1: How accurate are approximations obtained via the convex relaxations?
Specifically, can we bound the error between a solution of the convex relaxation and an
(unknown) solution of the original noiseless problem in terms of knowable quantities
such as the noise level? This question is meaningful when a solution of the original
problem changes (semi)continuously with small perturbations in the data, so the prob-
lem has both discrete and continuous nature. Examples include compressed sensing
and sensor network localization; see Sect. 2. A certain noise-aware property of the
convex relaxation appears key.
Due to the large problem size, first-order gradient methods seem better suited to exploit
structures such as sparsity, (partial) separability, and simple nonsmoothness in the con-
vex relaxations; see Sect. 3. The asymptotic convergence rate of these methods depend
on an error bound on the distance to the solution set of the convex relaxation in terms
of a certain residual function. We prove such an error bound for a class of 2-norm-
regularized problems that includes the group lasso for linear and logistic regression;
see Sect. 4. Thus our aims are threefold: exposit on existing results, present a new
result, and suggest future research.
We begin with a problem that has received much attention recently: compressed
sensing. In the basic version of this problem, we wish to find a sparse representation
of a given noiseless signal b0 ∈ m from a dictionary of n elementary signals. This
may be formulated as
min (x), (1)
x|Ax=b0
where A ∈ m×n comprises the elementary signals for its columns and (x) counts
the number of nonzero components in x ∈ n . In typical applications, m and n are
large (m, n ≥ 2000). This problem is known to be difficult (NP-hard) and a popular
solution approach is to approximate it by a convex relaxation, with (·) replaced by
the 1-norm · 1 .1 This results in the linear program
min x1 , (2)
x|Ax=b0
1 We are using the term “relaxation” loosely since (·) majorizes · only on the unit ∞-norm ball.
1
123
Structured convex optimization 265
(τ > 0). Partial results on the accuracy of these relaxations are known [43,135,
144]. Various methods have been proposed to solve (3) and/or (4) for fixed ρ and τ ,
including interior point [34], coordinate minimization [123,124], proximal gradient
(with or without smoothing) [14,15,36], and proximal minimization [156]. Homotopy
approaches have been proposed to follow the solution as τ varies between 0 and ∞
[55,66,111,108]. The efficiency of these methods depend on the structure of A. The
polyhedral and separable structure of · 1 is key.
A problem related to (4) is image denoising using total variation (TV) regulariza-
tion:
where rank(x) is the rank of a matrix x ∈ p×n , and A is a linear mapping from p×n
to m , and b0 ∈ m . In the case of matrix completion, we have Ax = (xi j )(i, j)∈A ,
where A indexes the known entries of x. This problem is also NP-hard, and a con-
vex relaxation has been proposed whereby rank(x) is replaced by the nuclear/trace
norm xnuc (the 1-norm of the singular values of x) [25,27,52,76,83,117,157]. The
resulting problem
has a more complex structure than (2), and only recently have solution methods, includ-
ing interior-point method and dual gradient method, been developed [25,76,83] and
exactness results been obtained [27,117]. For noisy b, we may consider, analogous to
123
266 P. Tseng
where τ > 0 and · F denotes the Frobenious-norm. This problem has applications
to dimension reduction in multivariate linear regression [78] and multi-task learn-
ing [1,3,106]. Recently, proximal/gradient methods have been applied to its solution
[78,83,133]. How well does (8) approximate (6)?
A third problem related to (1) and of recent interest is that of sparse inverse covari-
ance estimation, where we seek a positive definite x ∈ S n that is sparse and whose
inverse approximates a given sample covariance matrix s ∈ S n . This may be formu-
lated as the nonconvex problem
Interior-point method appears unsuited for solving (10) owing to the large size of the
Newton equation to be solved. Block-coordinate minimization methods (with each
block corresponding to a row/column of x) and Nesterov’s accelerated gradient meth-
ods have been applied to its solution [5,9,56,77]. How well does (10) approximate
(9)?
A fourth problem of much recent interest is that of ad hoc wireless sensor net-
work localization [4,18,19,21,22,31,38,39,49,73,75,128]. In the basic version of this
problem, we have n points z 10 , . . . , z n0 in d (d ≥ 1). We know the last n − m points
(“anchors”) and an estimate di j ≥ 0 of the Euclidean distance di0j = z i0 −z 0j 2 between
“neighboring” points i and j for all (i, j) ∈ A, where A ⊆ ({1, . . . , m}×{1, . . . , n})∪
({1, . . . , n}×{1, . . . , m}). We wish to estimate the first m points (“sensors”). This
problem may be formulated as
min |z i − z j 22 − di2j | + |z i − z 0j 22 − di2j |, (11)
z 1 ,...,z m
(i, j)∈As (i, j)∈Aa
123
Structured convex optimization 267
yii − 2yi j + y j j if (i, j) ∈ As ,
i j (x) :=
yii − 2z iT z 0j + z 0j 2 if (i, j) ∈ Aa .
The relaxation (12) is a semidefinite program (SDP) and is exact if it has a solu-
tion of rank d. On the other hand, (12) is still difficult to solve by existing methods
for SDP when m > 500, and decomposition methods have been proposed [21,31].
Recently, Wang, Zheng, Ye, and Boyd [149] proposed a further relaxation of (12),
called edge-based SDP (ESDP) relaxation, which is solved much faster by an interior-
point method than (12), and yields solution comparable in approximation accuracy
as (12). The ESDP relaxation is obtained by relaxing the constraint x
0 in (12)
to require only those principal submatrices of x associated with A to be positive
semidefinite. Specifically, the ESDP relaxation is
min i j (x) − di2j
x
(i, j)∈A
y zT
subject to x = ,
z I
⎛ ⎞
yii yi j z iT
⎜ ⎟
⎝ yi j y j j z Tj ⎠
0 ∀(i, j) ∈ A ,
s
zi zj I
yii z iT
0 ∀i ≤ m. (13)
z i Id
Notice that the objective function and the positive semidefinite constraints in (13) do
not depend on yi j , (i, j) ∈ A. How well does (12) approximate (11)? Only partial
results are known in the noiseless case, i.e., (11) has zero optimal value [128,142].
How well does (13) approximate (11)?
The above convex problems share a common structure, namely, they entail mini-
mizing the sum of a smooth (i.e., continuously differentiable) convex function and a
123
268 P. Tseng
“simple” nonsmooth convex function. More specifically, they have the form
The class of problems (14) was studied in [6,61,100] and by others; see [147] and
references therein. For example, (4) corresponds to
(5) corresponds to
n
0 if x ∈ B22 ,
E = , · = · 2 ,
n
f (x) = Ax − b22 , P(x) = (17)
∞ else,
(8) corresponds to
In the case of 0 < λ < λ̄ < ∞, Lu [77] proposed a reformulation of (10) via Fenchel
duality [118,131], corresponding to
0 if x − s∞ ≤ τ,
f (x) = sup x, y
− log det(y), P(x) = (20)
λI yλ̄I
∞ else,
where x∞ = maxi, j |xi j |. (Note that P in (20) depends on τ .) Importantly, for (16),
(17), (19), (20), P is block-separable, i.e.,
P(x) = PJ (x J ), (21)
J ∈J
123
Structured convex optimization 269
where J is some partition of the coordinate indices of x. Then (14) is amenable to solu-
tion by (block) coordinatewise methods. Another special case of interest is variable
selection in logistic regression [89,127,158,134], corresponding to
m
with Ai the ith row of A ∈ m×n and bi ∈ {0, 1}; also see [57,122,125,126] for
variants. A closely related problem is group variable selection (“group lasso”), which
uses instead
P(x) = ω J x J 2 , (23)
J ∈J
with ω J ≥ 0, in (16) and (22) [89,158]. The TV image reconstruction model in [148,
Eq. (1.3)] and the TV-L1 image deblurring model in [153, Eq. (1.5)] are special cases
of (14) with f quadratic and P of the form (23).
How accurate approximations are the convex relaxations (7), (8), (10), (12), (13),
and other related convex relaxations such as that for max- min dispersion, allowing
for noisy data? What are the iteration complexities and asymptotic convergence rates
of first-order gradient methods for solving these and related problems? First-order
methods are attractive since they can exploit sparse or partial separable structure of f
and block-separable structure of P.
Throughout, n denotes the space of n-dimensional real column vectors, S n =
{x ∈ n×n | x = x T }, and T denotes transpose. For any x ∈ n and nonempty
J ⊆ {1, . . . , n}, x j denotes jth coordinate of x, x J denotes subvector of x comprising
1/ρ
x j , j ∈ J , and xρ = n
|x | ρ for 0 < ρ < ∞, and x∞ = max j |x j |.
j=1 j
For any x ∈ p×n , xi j denotes the (i, j)th entry of x.
√
There exist overcomplete sets with n ≈ m 2 and μ ≈ 1/ m; see [130, pages 265–266]
and references therein. The quantity μ is central to the analysis in [42–44,58,59,62,
1
65,84,135,136,144]. In particular, (2) is exact whenever N 0 < 21 (μ−1 +1) = O(n 4 ),
123
270 P. Tseng
where N 0 denotes the optimal value of (1) [42, Theorem 7], [65, Theorem 1], [58],
[135, Theorems A and B]. When b is noisy, it can be shown that the solution x ρ of
the noise-aware model (3) is close to the solution x 0 of the original noiseless problem
(1), i.e.,
(δ + ρ)2
x ρ − x 0 22 ≤ (25)
1 − μ(4N 0 − 1)
123
Structured convex optimization 271
ρ
Sresdp := x | x is feasible for (13) and |i j (x) − di2j | ≤ ρi j ∀(i, j) ∈ A . (27)
ρ
Then Sresdp contains the noiseless ESDP solutions whenever ρ ≥ δ := (|di2j −
(di0j )2 |)(i, j)∈A . Moreover, defining its “analytic center” as
⎛ ⎞
yii yi j z iT
ρ ⎜ ⎟
m
x := arg min − log det ⎝ yi j y j j z Tj ⎠ − log tri (x), (28)
ρ
x∈S (i, j)∈As i=1
resdp zi zj I
ρ
it can be shown that, for ρ > δ sufficiently small, tri (x ) = 0 is a reliable indicator of
sensor i position accuracy; see [115, Theorem 4]. Moreover, for any ρ > δ, we have
the following computable bound on the individual sensor position accuracy:
ρ ρ 1
z i − z i0 ≤ 2|As | + m · tri (x ) 2 ∀i.
Can these results be extended to the SDP relaxation (12) or SOS relaxations [69,104]
or to handle additional constraints such as bounds on distances and angles-of-arrival?
Closely related to sensor network localization is the continuous max- min disper-
sion problem, whereby, given existing points z m+1 0 , . . . , z n0 in d (d ≥ 1), we wish
to locate new points z 1 , . . . , z m inside, say, a box [0, 1]d that are furthest from each
other and existing points [35,150]:
max min min ωi j z i − z j 2 , min ωi j z i − z 0j 2 , (29)
z 1 ,...,z m ∈[0,1]d i< j≤m i≤m< j
123
272 P. Tseng
with ωi j > 0. Replacing “max” by “sum” yields the max-avg dispersion problem
[116, Section 4]. It can be shown (by reduction from 0/1-integer program feasibil-
ity) that (29) is NP-hard if d is a part of the input, even when m = 1. How accurate
approximations are their convex (e.g., SDP, SOCP) relaxations? Another related prob-
lem arises in protein structure prediction, whereby the distances between neighboring
atoms and their bond angles are known, and we wish to find positions of the atoms
that minimize a certain energy function [120]. Although the energy function is com-
plicated and highly nonlinear, one can focus on the most nonlinear terms, such as the
Lennard-Jones interactions, in seeking approximate solutions.
How to solve (14)? We will assume that ∇ f is Lipschitz continuous on a closed convex
set X ⊇ dom P, i.e.,
for some L > 0, where E ∗ is the vector space of continuous linear functionals on
E, endowed with the dual norm x ∗ ∗ = supx≤1 x ∗ , x
and x ∗ , x
is the value of
x ∗ ∈ E ∗ at x ∈ E. This assumption, which is satisfied by (16), (17), (18), (20), (22),
can be relaxed to hold for (19) as well. Owing to its size and structure, (14) is suited for
solution by first-order gradient methods, whereby at each iteration f is approximated
by a linear function plus a “simple” proximal term. We describe such methods below.
To simplify notation, we denote the linearization of f in F at y ∈ X by
1
D(x, y) ≥ x − y2 ∀ x, y ∈ X . (32)
2
The classical gradient projection method of Goldstein and Levitin, Polyak (see
[16,114]) naturally generalizes to solve (14) using the Bregman function D, with
123
Structured convex optimization 273
where x 0 ∈ dom P. This method, which we call the proximal gradient (PG) method,
is closely related to the mirror descent method of Nemirovski and Yudin [93], as is
discussed in [8,12]; also see [61] for the case of η(·) = 21 · 22 . When P is the 1-norm
or has the block-separable form (21), the new point x k+1 can be found in closed form,
which is a key advantage of this method for large-scale optimization; see [147,152]
and references therein. When P is given by (15) and X is the unit simplex, x k+1 can be
found in closed form in O(n) floating point operations (flops) by taking η(x) to be the
x log x-entropy function [12, Section 5], [99, Lemma 4]. Moreover, the corresponding
D satisfies (32) with · being the 1-norm [12, Proposition 5.1], [99, Lemma 3]. If
η(·) = 21 · 22 is used instead, then x k+1 can still be found in O(n) flops, but this
requires using a more complicated algorithm; see [72] and references therein. It can
be shown that
L
F(x k ) − inf F ≤ O ∀k,
k
and hence O( L
) iterations suffice to come within
> 0 of inf F; see, e.g., [14,
Theorem 3.1], [96, Theorem 2.1.14], [114, page 166], [145, Theorem 5.1].
In a series of work [94,95,99] surveyed in [96] (also see [114, page 171]), Nesterov
proposed three methods for solving the smooth constrained case (15) that, at each iter-
ation, use either one or two projection steps together with extrapolation to accelerate
convergence. These accelerated gradient projection methods generate points {x k } that
achieve
L
F(x ) − inf F ≤ O
k
∀k,
k2
so that O( L
) iterations suffice to come within
> 0 of inf F. In [99], it is shown
that various large convex-concave optimization problems can be efficiently solved
by applying these methods to a smooth approximation with Lipschitz constant L =
O(1/
). These methods have inspired various extensions and variants [7, Section 5],
[14,64,74], [96, Section 2.2], [99,101,143], as well as applications to compressed
sensing, sparse covariance selection, matrix completion, etc. [15,5,67,77,78,83,97,
133]. In particular,
all three methods can be extended to solve (14) in a unified way and
)
L
achieve O( iteration complexity; see [143] and discussion below. The work per
iteration is between O(n) and O(n 3 ) flops for the applications of√Sect. 1. In contrast,
the number of iterations for interior-point methods is at best O( n log 1
) and the
work per iteration is typically between O(n 3 ) and O(n 4 ) ops. Thus, for moderate
(say,
= .001), moderate L (which may depend on n), and large n (say, n ≥ 10000),
a proximal gradient method can outperform interior-point methods.
123
274 P. Tseng
−1
y k = x k + θk (θk−1 − 1)(x k − x k−1 ), (34)
L
x k+1
= arg min F (x; y ) + x − y ,
k k 2
(35)
x 2
θk + 4θk − θk
4 2 2
θk+1 = . (36)
2
θk
An inductive argument shows that θk ≤ k+2 2
for all k. As k → ∞, we have θk−1 =
√
1 − θk → 1, so that, by (34), y is asymptotically an isometric extrapolation from
k
y k = (1 − θk )x k + θk z k , (37)
z k+1 = arg min F (x; y k ) + θk L D(x, z k ) , (38)
x
x k+1 = (1 − θk )x k + θk z k+1 , (39)
with θk+1 given by (36). Since 0 < θk ≤ 1, we have from x k , z k ∈ dom P that
y k ∈ dom P. In the smooth constrained case (15), this method corresponds to Auslen-
der and Teboulle’s extension [8, Section 5] of Nesterov’s second method [95]; also see
[96, page 90]. A variant proposed by Lan, Lu, and Monteiro [74, Section 3] replaces
(39) by a PG step from y k .
The third APG method differs from the second method mainly in the computa-
tion of z k+1 . For any x 0 ∈ dom P and z 0 = arg min η(x), θ0 = 1, it generates (for
x∈dom P
k = 0, 1, . . .) y k by (37),
k
F (x; y i )
z k+1
= arg min + Lη(x) , (40)
x θi
i=0
and x k+1 , θk+1 by (39), (36). Thus, (40) replaces F (x; y k )/θk in (38) by its cumu-
lative sum and replaces D(·, z k ) by η(·). In the case of (15), this method is similar to
Nesterov’s third method [99] but with only one projection instead of two. In the case
123
Structured convex optimization 275
of E being a Hilbert space and η(·) = 21 · 2 , this method bears some resemblance to
an accelerated dual gradient method in [101, Section 4].
The preceding accelerated methods may look unintuitive, but they arise “naturally”
from refining the analysis of the PG method, as is discussed in Appendix A. Moreover,
these methods are equivalent (i.e., generate the same sequences) when P ≡ 0, E is a
Hilbert space, and the Bregman function D is quadratic (i.e., η(·) = 21 · 2 ). Some
extensions of these methods, including cutting planes, estimating L, are discussed in
[143]. In particular, L can be estimated using backtracking: increase L and repeat the
iteration whenever a suitable sufficient descent condition (e.g., 52 or 56) is violated;
see [14,94,143]. Below we summarize the iteration complexity of the PG and APG
methods.
Theorem 1 (a) Let {x k } be generated by the PG method (33). For any x ∈ dom P,
we have
1
F(x k ) ≤ F(x) + L D(x, x 0 ) ∀k ≥ 1.
k
(c) Let {x k } be generated by the second APG method (36)–(39). For any x ∈ dom P,
we have
(d) Let {x k } be generated by the third APG method (36), (37), (39), (40). For any
x ∈ dom P, we have
123
276 P. Tseng
for some saddle function φ and convex set V in a suitable space, duality gap can be
used to terminate the methods [92,99,98]. The dual problem is maxv Q(v), with dual
function
where x k+1 , y k , θk are generated by the third APG method, and we let
with v̄ −1 = 0 [143, Corollary 3(c)]; also see [77, Theorem 2.2] and [98] for related
results in the constrained case (15). Analogous bounds hold for the first two APG
methods; see [143, Corollaries 1(b) and 2].
When applied to (4), the first APG method yields an accelerated version of the
iterative thresholding method of Daubechie et al. [14,36]. What about the other two
methods? How efficiently can these methods be applied to solve (5), (7), (8), (10),
(22), and related problems such as those in [50,122,125,126,154]? When applied to
(7) and (8), singular value decomposition is needed at each iteration, and efficiency
depends on the cost for this decomposition. However, only the largest singular val-
ues and their associated singular vectors are needed [25]. Can these be efficiently
computed or updated? Some progress on this have recently been made [83,133].
Can the iteration complexity be further improved? The proofs suggest that the con-
vergence rate can be improved to O( k1p ) ( p > 2) if we can replace · 2 in the
proofs by · p ; see Appendix A. However, this may require using a higher-order
approximation of f in F (·; y), so the improvement would not come “for free”.
When E is a Hilbert space and P is block-separable (21), we can apply (33) block-
coordinatewise, possibly with L and D dynamically adjusted, resulting in a block-
coordinate gradient (BCG) method. More precisely, given x k ∈ dom P, we choose Jk
as the union of some subcollection of indices in J and choose a self-adjoint positive
123
Structured convex optimization 277
and update
x k+1 = x k + αk d k , (42)
where 0 < β, σ < 1 and k is the difference between the optimal value of (41)
and F(x k ). This rule requires only function evaluations, and k predicts the descent
from x k along d k . For global convergence, the index subset Jk is chosen either in a
Gauss-Seidel manner, i.e., Jk ∪ · · · ∪ Jk+K covers all subsets of J for some constant
K ≥ 0 [32,89,107,141,147] or Jk is chosen in a Gauss-Southwell manner to satisfy
all
k ≤ ωk ,
all
where 0 < ω < 1 and k denotes the analog of k when Jk is replaced by the entire
coordinate index set [147,145,145]. Moreover, assuming (30) with X = dom P, the
BCG method using the Gauss-Southwell choice of Jk finds an
-minimizer of F in
O( L
) iterations [145, Theorem 5.1].
The above BCG method, which has been successful for compressed sensing and
variable selection in regression [89,127,134] and can be extended to handle linear con-
straints as in support vector machine (SVM) training [145,113], may also be suited
for solving (5), (10), (13), (20), and related problems. When applied to (10) with Jk
indexing a row/column of x, d k and αk are computable in O(n 2 ) flops using Schur
complement and properties of determinant. This method may be similarly applied
to Lu’s reformulation (20). This contrasts with the block-coordinate minimization
method in [9] which uses O(n 3 ) flops per iteration. This method can also be applied
to solve (28) by using a smooth convex penalty 2θ 1
max{0, | · | − ρi j }2 (θ > 0) for each
constraint in (27) of the form | · | ≤ ρi j . By choosing each coordinate block to com-
prise z i , yii , and (yi j ) j|(i, j)∈A , the computation distributes, with sensor i needing to
123
278 P. Tseng
communicate only with its neighbors when updating its position—an important con-
sideration for practical implementation. The resulting method can accurately position
up to 9000 sensors in little over a minute; see [115, Section 7.3]. Can Nesterov’s
extrapolation techniques of Sect. 3.1 be adapted to the BCG method?
A problem that frequently arises in machine learning and neural network training has
the form (14) with
m
f (x) = f i (x), (44)
i=1
x k+1 = x k − αk ∇ f ik (x k ), (45)
with i k chosen cyclically from {1, . . . , m} and stepsize αk > 0 either constant or
adjusted dynamically; see [16,71,79,86,91,129,139] and references therein. When
some the ∇ f i ’s are “similar”, this incremental method is more efficient than (pure)
gradient method since it does not wait for all component gradients ∇ f 1 , ..., ∇ f m to be
evaluated before updating x. However, global convergence (i.e., ∇ f (x k ) → 0) gener-
ally requires αk → 0, which slows convergence. Recently, Blatt, Hero and Gauchman
[23] proposed an aggregate version of (45) that approximates ∇ f (x k ) by
k
g =
k
∇ f i (x ).
=k−m+1
This method achieves global convergence for any sufficiently small constant stepsize
αk , but requires O(mn) storage. We can reduce the storage to O(n) by updating a
cumulative average of the component gradients
m k
k
gsum = gsum
k−1
+ ∇ f ik (x k ), gk = g ,
k sum
−1 = 0. We then use g k in the PG, APG, or BCG method. The resulting incre-
with gsum
mental methods share features with recently proposed averaging gradient methods
[68,102]. What are their convergence properties, iteration complexities, and practical
performances?
123
Structured convex optimization 279
Under the EB condition, asymptotic linear and even superlinear convergence can be
established for various methods, including interior point, gradient projection, proximal
minimization, coordinate minimization, and coordinate gradient methods—even if X̄
is unbounded; see [13,51,81,82,140,147,145,146] and references therein. Moreover,
the EB condition holds under any of the following conditions; see [147, Theorem 4]
as well as [110, Theorem 3.1], [82, Theorem 2.1] for the constrained case (15).
C1. E = n , f (x) = h(Ax) + c, x
for all x ∈ n , where A ∈ m×n , c ∈ n ,
and h is a strongly convex differentiable function on m with ∇h Lipschitz
continuous on m . P is polyhedral.
C2. E = n , f (x) = max y∈Y { y, Ax
− h(y)} + c, x
for all x ∈ n , where
Y is a polyhedral set in m , A ∈ m×n , c ∈ n , and h is a strongly con-
vex differentiable function on m with ∇h Lipschitz continuous on m . P is
polyhedral.
C3. f is strongly convex and satisfies (30) for some L > 0.
What if f is not strongly convex and P is non-polyhedral? In particular, we are
interested in the group lasso for linear and logistic regression (see (16), (22), (23)), for
which f is not strongly convex (unless A has full column rank) and P is non-polyhedral
(unless J is a singleton for all J ∈ J ). The following new result shows that the error
bound (47) holds for the group lasso. The proof, given in Appendix B, exploits the
structure of the weighted sum of 2-norms (23). To our knowledge, this is the first
Lipschitzian error bound result for f not strongly convex and P non-polyhedral.
Theorem 2 Suppose that E = n , P has the form (23) with ω J > 0 for all J ∈ J ,
and f has the form
123
280 P. Tseng
is Lipschitz continuous on any compact convex subset of domh, and (b) h(y) → ∞
whenever y approaches a boundary point of domh. If X̄ = ∅, then
1
h(y) = y − b2 ,
2
corresponding to linear regression, and
m
h(y) = log(1 + e yi ) − b, y
with b ∈ {0, 1}m ,
i=1
m
h(y) = − log(yi ) + b, y
with b ≥ 0,
i=1
which arises in likelihood estimation under Poisson noise [122]. In the first two exam-
ples, h is bounded from below by zero. In the third example, h is unbounded from
below but tends to −∞ sublinearly. Since P given by (23) is homogeneous of degree
1, it is readily seen that X̄ = ∅ for all three examples. (An example of X̄ = ∅ is
min x e x + 2x + |x|.)
However, X̄ need not be a singleton. An example is
1
min |x1 + x2 − 2|2 + |x1 | + |x2 |,
x 2
Proof Theorem 2 shows that [147, Assumption 2(a)] is satisfied. By [147, Theorem
1(a)], {F(x k )} ↓, so, by (49), {x k } is bounded. Since F is convex, [147, Assumption
2(b)] is automatically satisfied. Also, {x k } is lies in a compact convex subset of dom f ,
123
Structured convex optimization 281
over which ∇ f is Lipschitz continuous (since ∇h is Lipschitz continuous over the the
image of this subset under A). Conditions (i)–(iii) imply that the remaining assump-
tions in [147, Theorem 2(b)] are satisfied. In particular, the restricted Gauss-Seidel
rule in [147] holds with the given T . Since {x k } is bounded, [147, Theorem 2(b)]
implies that {F(x k )}k∈T converges at least Q-linearly and {x k }k∈T converges at least
R-linearly [107].
By Corollary 1, the method in [158], [89, Section 2.2.1] for linear group lasso (corre-
sponding to H k = A T A) and the method in [89, Section 2.2.2] for logistic group lasso
(corresponding to H k = νk I with νk > 0) attain linear convergence. Linear conver-
gence of the block-coordinate minimization methods used in [148] for TV-regularized
image
reconstruction, with f (w, u) = 21 w − Bu22 + γ2 Au − b22 , P(w, u) =
J w J 2 , and in [153] for TV-regularized image
deblurring, with f (w, u, z) =
γ
2 w − Bu2 + 2 Au − b − z2 , P(w, u, z) = J ω J w J 2 + z1 , can be ana-
1 2 2
Since f is convex and (30) holds with X ⊇ dom P, we have from (31) that
L
F(x) ≥ F (x; y) ≥ F(x) − x − y2 ∀x ∈ dom P, y ∈ X . (50)
2
The following “3-point” property is also key; see [33, Lemma 3.2], [74, Lemma 6],
[143, Section 2].
3-Point Property: For any proper lsc convex function ψ : E → (−∞, ∞] and any
z ∈ X , if η is differentiable at z + = arg min x {ψ(x) + D(x, z)}, then
The APG methods can be motivated by analyzing the PG method (33). Let {x k } be
generated by the PG method. For any x ∈ dom P and any k ∈ {0, 1, . . . }, we have
L k+1
F(x k+1 ) ≤ F (x k+1 ; x k ) + x − x k 2
2
≤ F (x k+1 ; x k ) + L D(x k+1 , x k )
≤ F (x; x k ) + L D(x, x k ) − L D(x, x k+1 )
≤ F(x) + L D(x, x k ) − L D(x, x k+1 ) ∀x ∈ dom P, (51)
where the first and fourth inequalities use (50), the second inequality uses (32), and
the third inequality uses (33) and the 3-Point Property with ψ(x) = F (x; x k )/L.
Letting ek = F(x k ) − F(x) and k = L D(x, x k ), this simplifies to the recursion
ek+1 ≤ k − k+1 ∀k ≥ 0.
123
282 P. Tseng
L k+1
x
F(x k+1 ) ≤ F (x k+1 ; y k ) + − y k 2 (52)
2
L L
≤ F (y; x k ) + y − y k 2 − y − x k+1 2
2 2
L L
≤ F(y) + y − y − y − x k+1 2
k 2
∀y ∈ dom P, (53)
2 2
where the first and third inequalities use (50), and the second inequality uses (35)
and the 3-Point Property. To get L to be scaled by something tending to zero, set
y = (1−θk )x k +θk x in the above inequality, with x ∈ dom P arbitrary and 0 < θk ≤ 1
to be determined. We can then factor θk out of x to scale L, yielding
L
F(x k+1 ) ≤ F((1 − θk )x k + θk x) + (1 − θk )x k + θk x − y k 2
2
L
− (1 − θk )x k + θk x − x k+1 2
2
L
= F((1 − θk )x k + θk x) + θk2 x + (θk−1 − 1)x k
2
L 2
−θk y − θk x + (θk−1 − 1)x k − θk−1 x k+1 2 ,
−1 k 2
2
where we have rearranged the terms to look like the recursion (51). We want the two
terms inside 2 to have the form “x − z k ” and “x − z k+1 ”, which we get by setting
L L
F(x k+1 ) ≤ (1−θk )F(x k )+θk F(x)+θk2 x −z k 2 −θk2 x −z k+1 2 ∀k. (54)
2 2
123
Structured convex optimization 283
1−θk+1
Upon dividing both sides by θk2 , we see that, by choosing θk+1 so that 1
θk2
= θk+1
2
(which, upon solving for θk+1 , yields (36)), this rewrites as the recursion
1 − θk+1 1 − θk
ek+1 + k+1 ≤ ek + k ,
θk+1
2 θk2
1 − θk+1 1 − θ0
ek+1 + k+1 ≤ e0 + 0 ∀k.
θk+1
2 θ02
Since 1−θ
θ2
k+1
= 1
θk2
, setting θ0 = 1 simplifies this to θ12 ek+1 ≤ 0 −k+1 ≤ 0 . Also,
k+1 k
we have from (34) and taking θ−1 = 1 that z 0 = y 0 = x 0 . This proves Theorem 1(b).
The preceding proof depends crucially on rearranging terms inside 2 , so it cannot
be directly extended to other Bregman functions D. However, (54) suggests we seek
a recursion of the form
which, as our derivation of (51) and (54) suggests, can be achieved by setting z k+1 by
(38) and using the 3-Point Property, setting x k+1 by (39) (analogous to our setting of
y in (53)), and then choosing y k so that x k+1 − y k = θk (z k+1 − z k ) (which works out
to be (37)). Specifically, for any k ∈ {0, 1, . . . }, we have
L k+1
F(x k+1 ) ≤ F (x k+1 ; y k ) + x − y k 2
2
Lθk2 k+1
= F ((1 − θk )x k + θk z k+1 ; y k ) + z − z k 2
2
≤ (1 − θk ) F (x k ; y k ) + θk F (z k+1 ; y k ) + θk2 L D(z k+1 , z k )
where the first inequality uses (50), the second inequality uses the convexity of
F (·; y k ) and (32), the last inequality uses the 3-Point Property with ψ(x) =
F (x; y k )/(θk L). Using (50) yields (55), from which Theorem 1(c) follows.
The third APG method (see 40) can be derived and analyzed similarly using an
analog of the 3-Point Property for η; see [143, Property 2]. For brevity we omit the
details.
123
284 P. Tseng
123
Structured convex optimization 285
so that X̄ is bounded (since ω J > 0 for all J ∈ J ), as well as being closed convex.
Since F is convex and X̄ is bounded, it follows from [118, Theorem 8.7] that (49)
holds. By using these observations and Lemma 1, we prove below the EB condition
(47).
We argue by contradiction. Suppose there exists a ζ ≥ min F such that (47) fails
to hold for all κ > 0 and
> 0. Then there exists a sequence x 1 , x 2 , . . . in n \X̄
satisfying
rk
F(x ) ≤ ζ ∀k,
k
{r } → 0,
k
→ 0, (60)
δk
x k − x̄ k
u k := ∀k. (64)
δk
123
286 P. Tseng
Ax k = A(x̄ k + δk u k ) = ȳ + δk Au k = ȳ + o(δk ).
Since Ax k and ȳ are in Y , the Lipschitz continuity of ∇h on Y (see (62)) and (61)
yield
g k = ḡ + o(δk ). (65)
Case (a). In this case, Lemma 1 implies that r Jk = −x kJ for all k. Since {r k } → 0
and {x k } → x̄, this implies x̄ J = 0. Also, by (64) and (60),
−r Jk − x̄ kJ o(δk ) − x̄ kJ
u kJ = = .
δk δk
x̄ kJ
ḡ J + ω J = 0, (66)
x̄ kJ 2
so ū J is a positive multiple of ḡ J .
Case (b). Since x̄ k ∈ X̄ and x̄ kJ = 0, the optimality condition for (14) with τ = 1,
(23), and ∇ f (x̄ k ) = ḡ imply (66) holds for all k. Then Lemma 1 implies
ωJ ω J x kJ
−r Jk = 1− g kJ +
g kJ − x kJ 2 g kJ − x kJ 2
ωJ ω J (x̄ kJ + δk u kJ )
= 1− (ḡ J + o(δk )) +
g kJ − x kJ 2 g kJ − x kJ 2
ω J (x̄ kJ − ḡ J ) ω J δk u kJ
= ḡ J + + o(δk ) +
g kJ − x kJ 2 g kJ − x kJ 2
ωJ ωJ ω J δk u kJ
= − (ḡ J − x̄ kJ ) + o(δk ) +
ḡ J − x̄ kJ 2 g kJ − x kJ 2 g kJ − x kJ 2
ωJ ωJ
= − (ḡ J − x̄ kJ )
ḡ J − x̄ kJ 2 ḡ J − x̄ kJ − δk u kJ + o(δk )2
123
Structured convex optimization 287
ω J δk u kJ
+o(δk ) +
ḡ J − x̄ kJ − δk u kJ + o(δk )2
ω J ḡ J − x̄ kJ , −δk u kJ
ω J δk u kJ
= (ḡ J − x̄ kJ ) + o(δk ) +
ḡ J − x̄ kJ 32 ḡ J − x̄ kJ 2
ω J δk ḡ J , −u kJ
= ḡ J + u kJ + o(δk ),
ḡ J − x̄ kJ 2 ω2J
where the second and fifth equalities use (64) and (65); the fourth and
ḡ J −x̄ kJ
last equalities use (66) so that ḡ J = ω J ; the sixth equality uses
ḡ J −x̄ kJ 2
∇x x−1
2 = − x
x
3. Multiplying both sides by ḡ J − x̄ kJ 2 /(ω J δk ) and
2
using (60) and ḡ J 2 = ω J (by (66)) yields in the limit
ḡ J , ū J
0=− ḡ J + ū J . (67)
ḡ J 22
ωJ ωJ ω J x kJ
−r Jk = − k g kJ +
ḡ J 2 g J − x kJ 2 g kJ − x kJ 2
ω J ḡ J , g kJ − x kJ − ḡ J
ω J x kJ
= g kJ + o(g kJ − x kJ − ḡ J 2 ) +
ḡ J 32 g kJ − x kJ 2
ω J ḡ J , x kJ
k ω J x kJ
= − gJ + o(δk ) + ,
ḡ J 32 g kJ − x kJ 2
x kJ + r Jk
g kJ + r Jk + ω J = 0 ∀k.
x kJ + r Jk 2
123
288 P. Tseng
g kJ , x kJ
r k , x k
x k 2 + r Jk , x kJ
g kJ , u kJ
= = − J J − ωJ J 2 k
δk δk δk x J + r Jk 2
rk xk
r Jk , x kJ
u kJ 2 + δJk , kJ
x J 2
=− − ωJ ! k ! .
δk ! x k !
! kJ + kJ ! r
! x J 2 !
x J 2
2
Then (60) and {x kJ 2 /δk } → ū J > 0 yield in the limit that ḡ J , ū J
=
−ω J ū J 2 < 0. This together with (67) implies ū J is a negative multiple
of ḡ J .
x̂ := x̄ k +
ū
with
> 0. Since Aū = 0, we have ∇ f (x̂) = ∇ f (x̄ k ) = ḡ. We show below that, for
0 ∈ ḡ J + ω J ∂x̂ J 2 (68)
x k − x̂22 = x k − x̄ k −
ū22 = x k − x̄ k 22 − 2
x k − x̄ k , ū
+
2 < x k − x̄ k 22
for all
> 0 sufficiently small, which contradicts x̄ k being the point in X̄ nearest
to x k and thus proves (63). For each J ∈ J , if ū J = 0, then x̂ J = x̄ kJ and (68)
holds automatically (since x̄ k ∈ X̄ ). Suppose that ū J = 0. We prove (68) below by
considering the three aforementioned cases (a), (b), and (c).
x̂ J
ḡ J + ω J = 0. (69)
x̂ J 2
123
Structured convex optimization 289
The remaining argument is similar to the proof of [147, Theorem 4], but using (63)
and the strong convexity of h in place of the strong convexity of f . For each k, since
r k is a solution of the subproblem (46), by Fermat’s rule [119, Theorem 10.1],
r k ∈ arg min g k + r k , d
+ P(x k + d).
d
Hence
g k + r k , r k
+ P(x k + r k ) ≤ g k + r k , x̄ k − x k
+ P(x̄ k ).
P(x̄ k ) ≤ ḡ, x k + r k − x̄ k
+ P(x k + r k ).
g k − ḡ, x k − x̄ k
+ r k 22 ≤ ḡ − g k , r k
+ r k , x̄ k − x k
.
Since Ax k and A x̄ k = ȳ are in Y , we also have from (61), (62), and (63) that
σ k
g k − ḡ, x k − x̄ k
= ∇h(Ax k )−∇h( ȳ), Ax k − ȳ
≥ σ Ax k − ȳ22 ≥ x − x̄ k 22 .
κ2
Combining these two inequalities and using (62) yield
σ k
x − x̄ k 22 + r k 22 ≤ ∇h( ȳ) − ∇h(Ax k ), Ar k
+ r k , x̄ k − x k
κ2
≤ LA22 x k − x̄ k 2 r k 2 + x k − x̄ k 2 r k 2 ,
σ k
x − x̄ k 22 ≤ (LA22 + 1)x k − x̄ k 2 r k 2 ∀k.
κ2
References
1. Abernethy, J., Bach, F., Evgeniou, T., Vert, J.-P.: A new approach to collaborative filtering: operator
estimation with spectral regularization. J. Mach. Learn. Res. 10, 803–826 (2009)
2. Alfakih, A.Y.: Graph rigidity via Euclidean distance matrices. Linear Algebra Appl. 310, 149–165
(2000)
3. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73, 243–272
(2008)
123
290 P. Tseng
4. Aspnes, J., Goldenberg, D., Yang, Y.R.: On the computational complexity of sensor network
localization. In: Algorithmic Aspects of Wireless Sensor Networks: First International Workshop,
ALGOSENSORS 2004 (Turku, Finland, July 2004), vol. 3121 of Lecture Notes in Computer Sci-
ence, Springer, pp. 32–44
5. D’Aspremont, A., Banerjee, O., Ghaoui, L.E.: First-order methods for sparse covariance selection.
SIAM J. Matrix Anal. Appl. 30(1), 56–66 (2008)
6. Auslender, A.: Minimisation sans contraintes de fonctions localement lipschitziennes: Applications
à la programmation mi-convexe, mi-différentiable. Nonlinear Program. 3, 429–460 (1978)
7. Auslender, A., Teboulle, M.: Interior projection-like methods for monotone vari- ational inequalities.
Math. Program. 104 (2005)
8. Auslender, A., Teboulle, M.: Interior gradient and proximal methods for convex and conic optimiza-
tion. SIAM J. Optim. 16, 697–725 (2006)
9. Banerjee, O., Ghaoui, L.E., D’Aspremont, A.: Model selection through sparse maximum likelihood
estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–516 (2008)
10. Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry
property for random matrices. Constr. Approx. 28(3), 253–263 (2008)
11. Bauschke, H.H., Borwein, J.M., Combettes, P.L.: Bregman monotone optimization algorithms. SIAM
J. Control Optim. 42(2), 596–636 (2003)
12. Beck, A., Teboulle, M.: Mirror descent and nonlinear projected subgradient methods for convex
optimization. Oper. Res. Lett. 31, 167–175 (2003)
13. Beck, A., Teboulle, M.: A linearly convergent dual-based gradient projection algorithm for quadrati-
cally constrained convex minimization. Math. Oper. Res. 31(2), 398–417 (2006). doi:10.1287/moor.
1060.0193
14. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
Technical report, Department of Industrial Engineering and Management, Technion, Haifa (2008)
15. Becker, S., Bobin, J., Candés, E.: Nesta: a fast and accurate first-order method for sparse recovery.
Technical report, California Institute of Technology, April (2009)
16. Bertsekas, D.P.: Nonlinear Programming. 2nd edn. Athena Scientific, Belmont, MA (1999)
17. Biswas, P., Aghajan, H., Ye, Y.: Semidefinite programming algorithms for sensor network localization
using angle of arrival information. In: Proceedings of 39th Annual Asiloinar Conference on Signals,
Systems, and Computers (Pacific Grove, CA, 2005)
18. Biswas, P., Liang, T.-C., Toh, K.-C., Wang, T.-C., Ye, Y.: Semidefinite programming approaches
for sensor network localization with noisy distance measurements. Autom. Sci. Eng. IEEE Trans. 3
(2006)
19. Biswas, P., Liang, T.-C., Wang, T.-C., Ye, Y.: Semidefinite programming based algorithms for sensor
network localization. ACM Trans. Sens. Netw. 2, 188–220 (2006)
20. Biswas, P., Toh, K.-C., Ye, Y.: A distributed SDP approach for large-scale noisy anchor-free graph
realization with applications to molecular conformation. SIAM J. Sci. Comput. 30(3), 1251–1277
(2008)
21. Biswas, P., Ye, Y.: Multiscale optimization methods and applications, vol. 82 of Nonconvex opti-
mization and its applications. Springer, 2003, ch. A Distributed Method for Solving Semidefinite
Programs Arising from Ad Hoc Wireless Sensor Network Localization.
22. Biswas, P., Ye, Y.: Semidefinite programming for ad hoc wireless sensor network localization. In:
Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks
(2004)
23. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step
size. SIAM J. Optim. 18 (2007)
24. Bregman, L.: The relaxation method of finding the common point of convex sets and its application
to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–
217 (1967)
25. Cai, J.-F., Candés, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion.
Report, California Institute of Technology, Pasadena, September (2008)
26. Candès, E.: The restricted isometry property and its implications for compressed sensing. Comptes
Rendus-Mathématique 346(9–10), 589–592 (2008)
27. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. To appear in Found. of
Comput. Math. (2008)
123
Structured convex optimization 291
28. Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inform. Theory 52(2), 489–509 (2006)
29. Candès, E.J., Romberg, J., Tao, T.: Stable signal recovery from incomplete and inaccurate measure-
ments. Commun. Pure Appl. Math. 59(8), 1207–1223 (2006)
30. Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inform. Theory 51(12), 4203–
4215 (2005)
31. Carter, M.W., Jin, H.H., Saunders, M.A., Ye, Y.: Spaseloc: an adaptive subproblem algorithm for
scalable wireless sensor network localization. SIAM J. Optim. 17(4), 1102–1128 (2006)
32. Censor, Y., Zenios, S.: Parallel Optimization: Theory, Algoritluns, and Applications. Oxford
University Press, New York (1997)
33. Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using Breg-
man functions. SIAM J. Optim. 3, 538–543 (1993)
34. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM J. Sci.
Comput. 20(1), 33–61 (1998)
35. Dasarathy, B., White, L.J.: A maxmin location problem. Oper. Res. 28(6), 1385–1401 (1980)
36. Daubechies, I., Defrise, M., Mol, C.D.: An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Commun. Pure Appl. Math. 57, 1413–1457 (2004)
37. Davies, M.E., Gribonval, R.: Restricted isometry constants where p sparse recovery can fail for
0 < p ≤ 1. Publication interne 1899, Institut de Recherche en Informatique et Systémes Aléatoires,
July (2008)
38. Ding, Y., Krislock, N., Qian, J., Wolkowicz, H.: Sensor network localization, euclidean distance
matrix completions, and graph realization. Technical report, Department of Combinatorics and Opti-
mization, University of Waterloo, Waterloo, February (2008)
39. Doherty, L., Pister, K.S.J., El Ghaoui, L.: Convex position estimation in wireless sensor networks.
In: Proceedings of 20th INFOCOM, vol. 3 (2001)
40. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal l1-norm near-
solution approximates the sparsest near-solution. Technical report, Department of Statistics, Stanford
University, Stanford (2006)
41. Donoho, D.L.: For most large underdetermined systems of linear equations, the minimal l1-norm
solution is also the sparsest solution. Commun. Pure Appl. Math. 59, 797–829 (2006)
42. Donoho, D.L., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via
1 minimization. Proc. Natl. Acad. Sci. USA 100(5), 2197–2202 (2003)
43. Donoho, D.L., Elad, M., Temlyakov, V.: Stable recovery of sparse overcomplete representations in
the presence of noise. IEEE Trans. Inform. Theory 52(1), 6–18 (2006)
44. Donoho, D.L., Huo, X.: Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform.
Theory 47(7), 2845–2862 (2001)
45. Donoho, D.L., Johnstone, I.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–
455 (1994)
46. Donoho, D.L., Johnstone, I.M.: Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat.
Assoc. 90(432), 1200–1224 (1995)
47. Donoho, D.L., Tsaig, Y., Drori, I., Starck, J.-L.: Sparse Solution of Underdetermined Linear Equa-
tions by Stagewise Orthogonal Matching Pursuit. Tech. Rep. 2006–2, Dept. of Statistics, Stanford
University, April (2006)
48. Eckstein, J.: Nonlinear proximal point algorithms using bregman functions, with applications to
convex programming. Math. Oper. Res. 18(1), 202–226 (1993)
49. Eren, T., Goldenberg, D.K., Whiteley, W., Yang, Y.R., Morse, A.S., Anderson, B.D.O., Belhumeur,
P.: Rigidity, computation, and randomization in network localization. In: 23rd INFOCOM, vol. 4.
(2004)
50. Evgeniou, T., Pontil, M.: Regularized multi-task learning. In: Proceedings of 10th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (2004)
51. Facchinei, F., Pang, J.-S.: Finite-Dimensional Variational Inequalities and Complementarity Prob-
lems, vol. I and II. Springer, New York (2003)
52. Fazel, M., Hindi, H., Boyd, S.P.: A rank minimization heuristic with application to minimum order
system approximation. In: Proceedings of American Control Conference (Arlington, 2001)
53. Ferris, M.C., Mangasarian, O.L.: Parallel variable distribution. SIAM J. Optim. 4 (1994)
54. Fletcher, R.: An overview of unconstrained optimization. In: Spedicato, E. (ed.) Algorithms for
Continuous Optimization, Kluwer Academic, Dordrecht (1994)
123
292 P. Tseng
55. Friedman, J., Hastie, T., Höfling, H., Tihshirani, R.: Pathwise coordinate optimization. Ann. Appl.
Stat. 1, (2007)
56. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso.
Biostatistics (2007)
57. Friedman, J., Hastie, T., Tihshirani, R.: Regularization paths for generalized linear models via coor-
dinate descent. Report, Department of Statistics, Stanford University, Stanford, July (2008)
58. Fuchs, J.-J.: On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theory
50(6), 1341–1344 (2004)
59. Fuchs, J.-J.: Recovery of exact sparse representations in the presence of bounded noise. IEEE Trans.
Inform. Theory 51 (2005)
60. Fukushima, M.: Parallel variable transformation in unconstrained optimization. SIAM J. Optim. 8
(1998)
61. Fukushima, M., Mine, H.: A generalized proximal point algorithm for certain non-convex minimiza-
tion problems. Int. J. Syst. Sci. 12, 989–1000 (1981)
62. Gilbert, A.C., Muthukrishnan, S., Strauss, M.J.: Approximation of functions over redundant dictio-
naries using coherence. In: 14th Annual ACM-SIAM Symposium on Discrete Algorithms (New York,
2003), ACM
63. Goldfarb, D., Yin, W.: Second-order cone programming methods for total variation-based image
restoration. SIAM J. Sci. Comput. 27 (2005)
64. Gonzaga, C.C., Karas, E.W.: Optimal steepest descent algorithms for unconstrained convex problems:
fine timing nesterov’s method. Technical report, Departnient of Mathematics, Federal University of
Santa Catarina, Florianópolis, August (2008)
65. Gribonval, R., Nielsen, M.: Sparse representations in unions of bases. IEEE Trans. Inform. Theory
49(12), 3320–3325 (2003)
66. Hale, E., Yin, W., Zhang, Y.: A fixed-point continuation method for l1-regularized minimization with
applications to compressed sensing. Tech. Rep. TR07-07, Department of Computational and Applied
Mathematics, Rice University, Houston, TX (2007)
67. Hoda, S., Gilpin, A., Pefla, J.: Smoothing techniques for computing nash equilibria of sequential
games. Technical report, Carnegie Mellon University, Pittsburg, March (2008)
68. Juditsky, A., Lan, G., Nemirovski, A., Shapiro, A.: Stochastic approximation approach to stochastic
programming. Technical report, to appear in SIAM J. Optim. (2007)
69. Kim, S., Kojima, M., Waki, H.: Exploiting sparsity in sdp relaxation for sensor network localization,
report b-447. Technical report, Tokyo Institute of Technology, Tokyo, October (2008)
70. Kiwiel, K.C.: Proximal minimization methods with generalized bregman functions. SIAM J. Control
Optim. 35 (1997)
71. Kiwiel, K.C.: Convergence of approximate and incremental subgradient methods for convex optimi-
zation. SIAM J. Optim. 14, 807–840 (2003)
72. Kiwiel, K.C.: On linear-time algorithms for the continuous quadratic knapsack problem. J. Optim.
Theory Appl. 134 (2007)
73. Krislock, N., Piccialli, V., Wolkowicz, H.: Robust semidefinite programming approaches for sensor
network localization with anchors. Technical report, Department of Combinatorics and Optimization,
University of Waterloo, Waterloo, May (2006)
74. Lan, G., Lu, Z., Monteiro, R.D.C.: Primal-dual first-order methods with o(1/e)o(1/
) iteration-
complexity for cone programming. Technical report, School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, December (2006)
75. Liang, T.-C., Wang, T.-C., Ye, Y.: A gradient search method to round the semidefinite program-
ming relaxation solution for ad hoc wireless sensor network localization. Technical report, Electrical
Engineering, Stanford University, October (2004)
76. Liu, Z., Vandenberghe, L.: Interior-point method for nuclear norm approximation with application
to system identification. Technical report, Electrical Engineering Department, UCLA, Los Angeles
(2008)
77. Lu, Z.: Smooth optimization approach for sparse covariance selection. Technical report, Department
of Mathematics, Simon Fraser University, Burnaby, January 2008. submitted to SIAM J. Optim
78. Lu, Z., Monteiro, R.D.C., Yuan, M.: Convex optimization methods for dimension reduction and coef-
ficient estimation in multivariate linear regression. Technical report, School of Industrial and Systems
Engineering, Georgia Institute of Technology, Atlanta, January 2008. revised March (2009)
123
Structured convex optimization 293
79. Luo, Z.Q.: On the convergence of the lms algorithm with adaptive learning rate for linear feedforward
networks. Neural Comput. 3 (1991)
80. Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth
minimization. SIAM J. Control Optim. 30 (1992)
81. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general
approach. Ann. Oper. Res. 46 (1993)
82. Luo, Z.Q., Tseng, P.: On the convergence rate of dual ascent methods for linearly constrained convex
minimization. Math. Oper. Res. 18 (1993)
83. Ma, S., Goldfarb, D., Chen, L.: Fixed point and bregman iterative methods for matrix rank minimi-
zation. Report 08–78, UCLA Computational and Applied Mathematics (2008)
84. Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. Signal Process. IEEE
Trans. 41 (1993)
85. Mangasarian, O.L.: Sparsity-preserving sor algorithms for separable quadratic and linear program-
ming. Comnput. Oper. Res. 11, 105–112 (1984)
86. Mangasarian, O.L.: Mathematical programming in neural networks. ORSA J. Comput. 5 (1993)
87. Mangasarian, O.L.: Parallel gradient distribution in unconstrained optimization. SIAM J. Control
Optim. 33, 1916–1925 (1995)
88. Mangasarian, O.L., Musicant, D.R.: Successive over relaxation for support vector machines. IEEE
Trans. Neural Netw. 10 (1999)
89. Meier, L., van de Geer, S., Bühlmann, P.: The group Lasso for logistic regression. J. Royal Statist. Soc.
B 70, 53–71 (2008)
90. More, J.J., Wu, Z.: Global continuation for distance geometry problems. SIAM J. Optim. 7 (1997)
91. Nedić, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization.
SIAM J. Optim. 12 (2001)
92. Nemirovski, A.: Prox-method with rate of convergence o(1/t) for variational inequalities with lips-
chitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J.
Optim. 15 (2005)
93. Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley,
New York (1983)
94. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence
o(1/k 2 ). Doklady AN SSSR 269 (1983). translated as Soviet Math. Dokl
95. Nesterov, Y.: On an approach to the construction of optimal methods of minimization of smooth
convex functions. Ekonom. i. Mat. Metody 24 (1988)
96. Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer Academic Publisher, Dordrecht,
The Netherlands (2004)
97. Nesterov, Y.: Smoothing technique and its applications in semidefinite optimization. Technical report,
CORE, Catholic University of Louvain, Louvain-la-Neuve, Belgium, October (2004)
98. Nesterov, Y. Excessive gap technique in nonsmooth convex minimization. SIAM J. Optim. 16 (2005)
99. Nesterov, Y.: Smooth minimization of nonsmooth functions. Math. Program. 103, 127–152 (2005)
100. Nesterov, Y.: Gradient methods for minimizing composite objective function. Technical report, Center
for Operations Research and Econometrics (CORE), Catholic University of Louvain (2007)
101. Nesterov, Y.: Gradient methods for minimizing composite objective function. Technical report, CORE,
Catholic University of Louvain, Louvain-la-Neuve, Belgium, September (2007)
102. Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120, 221–
259 (2007). doi:10.1007/s10107-007-0149-x
103. Nesterov, Y.E., Nemirovski, A.: Interior-Point Polynomial Algorithms in Convex Programming, vol.
13 of Stud. Appl. Math. Society of Industrial and Applied Mathematics, Philadelphia (1994)
104. Nie, J.: Sum of squares method for sensor network localization. Technical report, Department of
Mathematics, University of California, Berkeley, June 2006. to appear in Comput. Optim. Appl
105. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, New York (1999)
106. Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for
multiple classification problems. Stat. Comput. 20, 231–252 (2009)
107. Ortega, J.M., Rheinboldt, W.C.: Iterative Solutions of Nonlinear Equations in Several Variables. Aca-
demic Press, London (1970)
108. Osborne, M.R., Presnell, B., Turlach, B.A.: On the LASSO and its dual. J. Comput. Graph. Stat. 9, 319–
337 (2000)
123
294 P. Tseng
109. Osher, S., Burger, M., Goldfarb, D., Xu, J.: An iterative regularization method for total variation-based
image restoration. SIAM J. Multiscale Model. Simul. 4, 460–489 (2005)
110. Pang, J.-S.: A posteriori error bounds for the linearly constrained variational inequality problem.
Math. Oper. Res. 12 (1987)
111. Park, M.-Y., Hastie, T.: An L1 regularization-path algorithm for generalized linear models. J. Roy.
Soc. Stat. B 69, 659–677 (2007)
112. Pfander, G.B., Rauhut, H., Tanner, J. Identification of matrices having a sparse representation. IEEE
Trans. Signal Proc. 56 (2008)
113. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In: Advances
in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, USA (1998)
114. Polyak, B.T.: Introduction to optimization. Optimization Software, New York (1987)
115. Pong, T.K., Tseng, P.: (robust) edge-based semidefinite programming relaxation of sensor network
localization. Technical report, Department of Mathematics, University of Washington, Seattle, Janu-
ary (2009)
116. Ravi, S.S., Rosenkrantz, D.J., Tayi, G.K.: Heuristic and special case algorithms for dispersion prob-
lems. Oper. Res. 42 (1994)
117. Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations
via nuclear norm minimization. arXiv 0706.4138, June (2007)
118. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
119. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis. Springer, New York (1998)
120. Rohl, C.A., Strauss, C.E.M., Misura, K., Baker, D.: Protein structure prediction using rosetta. Methods
Enzym. 383, 66–93 (2004)
121. Sagastizabal, C.A., Solodov, M.V.: Parallel variable distribution for constrained optimization. Com-
put. Optim. Appl. 22, 111–131 (2002)
122. Sardy, S., Antoniadis, A., Tseng, P.: Automatic smoothing with wavelets for a wide class of distribu-
tions. J. Comput. Graph. Stat. 13, 399–421 (2004)
123. Sardy, S., Bruce, A., Tseng, P.: Block coordinate relaxation methods for nonparametric wavelet
denoising. J. Comput. Graph. Stat. 9, 361–379 (2000)
124. Sardy, S., Bruce, A., Tseng, P.: Robust wavelet denoising. IEEE Trans. Signal Proc. 49 (2001)
125. Sardy, S., Tseng, P.: AMlet, RAMlet, and GAMlet: automatic nonlinear fitting of additive models,
robust and generalized, with wavelets. J. Comput. Graph. Stat. 13, 283–309 (2004)
126. Sardy, S., Tseng, P.: On the statistical analysis of smoothing by maximizing dirty markov random
field posterior distributions. J. Amer. Statist. Assoc. 99, 191–204 (2004)
127. Shi, J., Yin, W., Osher, S., Sajda, P.: A fast algorithm for large scale L1-regularized logistic regression.
Technical report, Department of Computational and Applied Mathematics, Rice University, Houston
(2008)
128. So, A.M.-C., Ye, Y.: Theory of semidefinite programming for sensor network localization. Math.
Program. 109, 367–384 (2007)
129. Solodov, M.V.: Incremental gradient algorithms with step sizes bounded away from zero. Comput.
Optim. Appl. 11, 23–25 (1998)
130. Strohmer, T., Heath, R.J.: Grassmannian frames with applications to coding and communica-
tions. Appl. Comp. Harm. Anal. 14, 257–275 (2003)
131. Rockafellar, R.T.: Conjugate Duality and Optimization. Society for Industrial and Applied Mathe-
matics, Philadelphia (1974)
132. Teboulle, M.: Convergence of proximal-like algorithms. SIAM J. Optim. 7 (1997)
133. Toh K.-C., Yun, S.: An accelerated proximal gradient algorithm for nuclear regularized least squares
problems. Technical report, Department of Mathematics, National University of Singapore, Singapore
(2009)
134. Toh, S. Y. K.-C.: A coordinate gradient descent method for l1-regularized convex minimization.
Technical report, Department of Mathematics, National University of Singapore, Singapore, 2008.
submitted to Comput. Optim. Appl
135. Tropp, J.A.: Greed is good: algorithmic results for sparse approximation. IEEE Trans. Inform. The-
ory 50(10), 2231–2242 (2004)
136. Tropp, J.A.: Just relax: convex programming methods for identifying sparse signals in noise. IEEE
Trans. Inform. Theory 52(3), 1030–1051 (2006)
137. Tropp, J.A., Gilbert, A.C.: Signal recovery from random measurements via orthogonal matching
pursuit. IEEE Trans. Inform. Theory 53(12), 4655–4666 (2007)
123
Structured convex optimization 295
138. Tseng, P.: Dual coordinate descent methods for non-strictly convex minimization. Math. Prog.
59, 231–247 (1993)
139. Tseng, P.: An incremental gradient(-projection) method with momentum term and adap- tive stepsize
rule. SIAM J. Optim. 8 (1998)
140. Tseng, P.: Nonlinear Optimization and Related Topics. Kiuwer, Dordrecht, 2000, ch. Error bounds
and superlinear convergence analysis of sonic Newton-type methods in optimization
141. Tseng, P.: Convergence of block coordinate descent method for nondifferentiable minimization.
J. Optim. Theory Appl. 109, 475–494 (2001)
142. Tseng, P.: Second-order cone programming relaxation of sensor network localization. SIAM J. Optim.
18 (2007)
143. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Technical
report, Department of Mathematics, University of Washington, Seattle, May 2008. submitted to
SIAM J. Optim
144. Tseng, P.: Further results on a stable recovery of sparse overcomplete representations in the presence
of noise. IEEE Trans. Inf. Theory. 55 (2009)
145. Tseng, P., Yun, S.: A coordinate gradient descent method for linearly constrained smooth optimization
and support vector machines training. Comput. Optim. Appl. (2008)
146. Tseng, P., Yun, S.: A block-coordinate gradient descent method for linearly constrained nonsmooth
separable optimization. J. Optim. Theory Appl. 140, 513–535 (2009)
147. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimiza-
tion. Math. Program. 117, 387–423 (2009)
148. Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation
image reconstruction. Technical report, Department of Computational and Applied Mathematics,
Rice University, Houston, 2007. to appear in SIAM Imaging Sci
149. Wang, Z., Zheng, S., Ye, Y., Boyd, S.: Further relaxations of the semidefinite programming approach
to sensor network localization. SIAM J. Optim. 19 (2008)
150. White, D.J.: A heuristic approach to a weighted maxmin dispersion problem. IMA J. Manag. Math.
9 (1996)
151. Wright, S.J.: Primal-Dual Interior-Point Methods. SIAM, Philadelphia (1997)
152. Wright, S.J., Nowak, R.D., Figueiredo, M.A.T.: Sparse reconstruction by separable approximation.
Technical report, Computer Sciences Department, University of Wisconsin, Madison, October (2007)
153. Yang, J., Zhang, Y., Yin, W.: An efficient TVL1 algorithm for deblurring multichannel images cor-
rupted by impulsive noise. Technical report, Department of Computational and Applied Mathematics,
Rice University, Houston (2008)
154. Ye, J., Ji, S., Chen, J.: Multi-class discriminant kernel learning via convex programming. J. Mach.
Learn. Res. 9, 719–758 (2008)
155. Ye, Y.: Interior Point Algorithms: Theory and Analysis. John Wiley & Sons, New York (1997)
156. Yin, W., Osher, S., Goldfarb, D., Darbon, J.: Bregman iterative algorithms for l1 minimization with
applications to compressed sensing. SIAM J. Imaging Sci. 1 (2008)
157. Yuan, M., Ekici, A., Lu, Z., Monteiro, R.: Dimension reduction and coefficient estimation in multi-
variate linear regression. J. Royal Stat. Soc. B. 69, 329–346 (2007)
158. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Royal Stat.
Soc. B. 68, 49–67 (2006)
159. Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94
(2007)
160. Zhu, M., Chan, T.F.: An efficient primal-dual hybrid gradient algorithm for total variation image
restoration. Technical report, Department of Mathematics, UCLA, Los Angeles (2008)
161. Zhu, M., Wright, S.J., Chan, T.F.: Duality-based algorithms for total-variation regularized image
restoration. Technical report, Department of Mathematics, UCLA, Los Angeles (2008)
123