Professional Documents
Culture Documents
Algorithms
R. Chartrand
Descartes Labs, Los Alamos, NM 87544, USA. e-mail: rick@descarteslabs.com
W. Yin
Department of Mathematics, University of California - Los Angeles, CA, 90095, USA. e-mail:
wotaoyin@math.ucla.edu
1
2 Rick Chartrand and Wotao Yin
where λ > 0. The optimization problem in (1) is clearly separable. For an input
component xi , since kwk0 only depends on whether wi is nonzero, the minimizing
wi is clearly either 0 or xi . If wi = 0, the ith term of the objective function is xi2 /(2λ ),
while if wi = xi 6= 0, the value is 1. We thus obtain
√
0 if |xi | < 2λ ;
√
proxλ k·k0 (x) i = {0, xi } if |xi | = 2λ ; (2)
√
xi if |xi | > 2λ .
1 We use “norm” loosely, to refer to such things as the ` p quasinorm, or the `0 penalty function
(which has no correct normlike name).
Nonconvex Sparse Regularization and Splitting Algorithms 3
This motivates the use of the mapping known as hard thresholding, defined compo-
nentwise as follows: (
0 if |xi | ≤ t;
Ht (x)i = (3)
xi if |xi | > t.
(Our choice of value at the discontinuity is arbitrary.) Thus H√2λ (x) gives a (global)
minimizer of the problem in (1).
Now consider the following optimization problem:
1
min kxk0 + 2λ kAx − bk22 . (4)
x
The one difference between FBS and IHT is that in IHT, the threshold value t is
adaptive, chosen so that the outcome of the thresholding has at most K nonzero
components, for some positive integer K.
The discontinuities of the `0 norm and hard thresholding are obstacles to good al-
gorithmic performance. For example, substituting hard thresholding for soft thresh-
olding in ADMM gives an algorithm that oscillates fiercely, though Dong and Zhang
[22] managed to tame the beast by using “double augmented Lagrangian” [48] and
using the mean of the iterates as the solution.
Using the ` p norm, with p ∈ (0, 1), provides a better-behaved alternative:
min kxk pp + 2λ
1
kAx − bk22 . (6)
x
Note that we use the pth power of the ` p norm, because it is easier to compute with
and satisfies the triangle inequality.
We can try to compute the proximal mapping:
As before, this is a separable problem; for simplicity, assume w and x are scalars, and
without loss of generality, assume x > 0. If we seek a nonzero minimizer, we need
to solve pw p−1 + (w − x)/2 = 0. Eliminating the negative exponent gives w2−p −
xw1−p + 2p = 0. This is not analytically solvable for general p. The
√ cases where it
can be solved are p = 1/2, where we have a cubic equation in w, and p = 2/3,
4 Rick Chartrand and Wotao Yin
√
where we have a quartic equation in 3 w. While this may seem restrictive, there
is anecdotal, empirical evidence suggesting p = 1/2 is a good choice in at least a
broad range of circumstances [18, 49, 63].
The first paper to make use of these special cases is by Krishnan and Fergus
[29]. They consider a nonconvex generalization of TV-regularized deblurring, with
(anisotropic) TV replaced by the ` p norm of the gradient, for p ∈ {1/2, 2/3}. Their
approach generalizes the splitting approach of FTVd [54], with soft thresholding
replaced by the appropriate proximal mapping. They consider solving the cubic or
quartic equation by radicals, and by means of a lookup table, and report the latter to
be faster while giving the same quality.
Xu et al. [62] conduct further analysis of the `1/2 proximal mapping, considering
the solution of the cubic equation in terms of the cosine and arccosine functions.
Similarly, Cao et al. [14] develop a closed-form formula for the `2/3 proximal map-
ping. They show that the proximal mapping can be considered as a thresholding
mapping, in the sense of each component of the output being the same scalar func-
tion of the corresponding component of the input, and that the scalar function maps
all inputs of magnitude below some thresholding to zero. They use the thresholding
mapping within an application of forward-backward splitting to the `1/2 generaliza-
tion of BPDN, as well as part of alternative approach to the algorithm of Krishnan
and Fergus.
Given the difficulty of computing proximal mappings for a wide range of noncon-
vex penalty functions, it is reasonable to consider reversing the process: specifying
a thresholding mapping, and then determining whether it is a proximal mapping of
a penalty function. Antoniadis considers a general class of thresholding functions,
and then showing that these are proximal mappings of penalty functions [1]:
Theorem 1 (Antoniadis). Let δλ : R → R be a thresholding function that is increas-
ing and antisymmetric such that 0 ≤ δλ (x) ≤ x for x ≥ 0 and δλ (x) → ∞ as x → ∞.
Then there exists a continuous positive penalty function pλ , with pλ (x) ≤ pλ (y)
whenever |x| ≤ |y|, such that δλ (z) is the unique solution of the minimization prob-
lem minθ (zθ )2 + 2pλ (|θ |) for every z at which δλ is continuous.
Such thresholding functions are considered in the context of thresholding of wavelet
coefficients, such as for denoising, and not for use with splitting algorithms.
Chartrand [16, 17] proves a similar theorem, but by making use of tools from
convex analysis in the proof, is able to add first-order information about the penalty
function to the conclusions:
Theorem 2 (Chartrand). Suppose s = sλ : R+ → R+ is continuous, satisfies x ≤
λ ⇒ s(x) = 0 for some λ > 0, is strictly increasing on [λ , ∞), and s(x) ≤ x. De-
fine S = Sλ on Rn by S(x)i = s(|xi |) sign(xi ) for each i. Then S is the proximal
mapping of a penalty function G(x) = ∑i g(xi ) where g is even, strictly increasing
and continuous on [0, ∞), differentiable on (0, ∞), and nondifferentiable at 0 with
∂ g(0) = [−1, 1]. If also x − s(x) is nonincreasing on [λ , ∞), then g is concave on
[0, ∞) and G satisfies the triangle inequality.
The proximal mappings are considered for use in generalizations of ADMM. A new
thresholding function, a C∞ approximation of hard thresholding, is used to give a
Nonconvex Sparse Regularization and Splitting Algorithms 5
state of the art image reconstruction result. Subsequently, however, the older and
simpler firm thresholding function of Gao and Bruce [24] was found to perform at
least as well.
It can be shown (cf. [58]) that when all of the hypotheses of Thm. 2 are satisfied,
the resulting mapping G is weakly convex (also known as semiconvex. This is the
property that G(x) + ρkxk22 is convex for sufficiently large ρ. Bayram has shown [4]
that the generalization of FBS for weakly convex penalty functions converges to a
global minimizer of the analog of (6):
1
min G(x) + 2λ kAx − bk22 . (8)
x
Theorem 3 (Bayram). Let A in (8) have smallest and largest singular values σm
and σM , respectively. Suppose G is proper, lower semi-continuous, and such that
G(x) + ρkxk22 is convex for ρ ≥ 2σm2 . Suppose also that the set of minimizers of
2 , the sequence generated by FBS converges to a
(8) is nonempty. Then if λ < 1/σM
global minimizer of (8) and monotonically decreases the cost.
Coordinate descent has been applied for problems with nonconvex regularization
functions. This section reviews the coordinate methods, as well as the work that
adapts them to nonconvex problems.
Coordinate descent is a class of method that solves a large problem by updating
one, or a block of, coordinates at each step. The method works with smooth and
nonsmooth-separable functions, as well as convex and nonconvex functions. We
call a function separable if it has the form f (x) = f1 (x1 ) + · · · + fn (xn ). (Coordinate
descent methods generally cannot handle functions that are simultaneously nons-
mooth and nonseparable because they can cause convergence to a non-critical point
[57].) There has been very active research on coordinate descent methods because
their steps have an small memory footprint, making them suitable for solving large-
sized problems. Certain coordinate descent methods can be parallelized. In addition,
even on small- and mid-sized problems, coordinate descent methods can run much
faster than the traditional first-order methods because each coordinate descent step
can use a much larger step size and the selection of updating coordinates can be
chosen in favor of the problem structure and data.
Let us consider the problem of
minimize f (x1 , . . . , xn ).
x
At each step of coordinate descent, all but some ith coordinate of x are fixed. Let
collect all components of x but xi . The general framework of coordinate descent is:
For k = 0, 1, . . . , perform:
1. Select a coordinate ik
2. Do (
update xik+1
k
,
xk+1
j = x k,
j for all j 6= ik ,
3. Check stopping criteria.
In the algorithm, each update to xik+1 can take one of the following forms:
which are called direct update (first appeared in [57]), proximal update (first ap-
peared in [2]), gradient update, and prox-gradient update, respectively. The gradient
update (9c) is equivalent to the xi -directional gradient descent with step size ηk :
xik+1 ← xk − ηk ∇ f (xk ) i .
The prox-gradient update (9d) is also known as the prox-linear update and forward-
backward update. It is applied to the objective function of the following form
n
prox
f (x1 , . . . , xn ) = f diff (x1 , . . . , xn ) + ∑ fi (xi ),
i=1
prox
where f diff is a differentiable function and each fi is a proximal function. The
problem (6) is an example of this form. In addition to the four forms of updates in
(9), there are recent other forms of coordinate descent: a gradient ∇i f (xk ) can be
replaced by its stochastic approximation and/or be evaluated at a point x̂k extrapo-
lating xk and xk−1 .
Different forms of coordinate updates can be mixed up in a coordinate descent
algorithm, but each coordinate xi is usually updated by one chosen form through-
out the iterations. The best choice is based on the problem structure and required
convergence properties.
As a very nice feature, coordinate descent can select updating coordinates ik at
different orders. Such flexibilities lead to numerical advantages. The most common
choice is the cyclic order. Recently, random and shuffled cyclic orders are found
Nonconvex Sparse Regularization and Splitting Algorithms 7
The minimizers are (x1 , x2 , x3 ) = ±(1, 1, 1). However, the subproblem (9a) generates
a sequence of points that approximately cycle near the other six points in the set
{(x1 , x2 , x3 ) : xi = ±1, i = 1, 2, 3}. Nonetheless, each of other forms of updates in (9)
has been shown to converge (under various conditions) for differentiable nonconvex
functions, which bring us to the next issue of nonsmooth functions. Warga’s example
[57],
8 Rick Chartrand and Wotao Yin
(
x1 − 2x2 , 0 ≤ x2 ≤ x1 ≤ 1,
f (x1 , x2 ) =
x2 − 2x1 , 0 ≤ x1 ≤ x2 ≤ 1,
is a convex, nonsmooth function, which has its minimizer at (x1 , x2 ) = (1, 1). Start-
ing from an arbitrary point (x1 , x2 ) where x1 6= 1 and x2 6= 1, all the existing coordi-
nate descent algorithms will fail at a non-optimal point x1 = x2 . Although this exam-
ple is simple, such a phenomenon generally occurs with many nonsmooth functions
that couple multiple variables. To deal with this problem, we need to apply objective
smoothing, variable splitting, or primal-dual splitting, which we do not review here.
Regularity assumptions on the nonconvex objective function are required to show
convergence of coordinate descent algorithms. Grippo and Sciandrone [27] require
that f is component-wise strictly quasiconvex with respect to all but 2 blocks, or
that f is pseudoconvex and has bounded level sets. Tseng [52] requires that either
f is pseudoconvex in every coordinate-pair of among n − 1 coordinates or f has at
most one minimizer for each of (n − 2) coordinates, along with other assumptions.
Tseng and Yun [53] studied coordinate descent based on (9d) under an assumed
local error bound that holds if f diff is (nonconvex quadratic) or equals g(Ax) for a
strongly convex function g and matrix A. However, their local error bound requires
the regularization function to be convex and polyhedral, too strong for ` p -norm and
other nonconvex regularization functions to hold.
The paper [60] considers multiconvex2 smooth functions plus separable proximal
functions in the objective. It analyzes the mixed use of different update forms, ex-
trapolation, as well as the functions satisfying the Kurdyka-Łojasiewicz (KL) prop-
erty, which improves subsequential convergence to global convergence. The BSUM
framework [46] aims at analyzing different forms of coordinate descent in a unified
approach. Their analysis has allowed the differentiable part of the objective to be
nonconvex, along with other conditions. Checking their conditions, piecewise lin-
ear regularization functions are permitted. However, the assumptions in the above
papers exclude other major nonconvex regularization functions.
The papers [38, 11] are among the first to apply coordinate descent to regularized
statistical regression problems with nonconvex regularization functions including
the minimax concave penalty (MCP) and the smoothly clipped absolute deviation
(SCAD). They require properties on the fitting terms such that the sum of fitting and
regularization terms is convex and the results from [53] can apply.
Coordinate descent for nonconvex ` p -norm minimization recently appeared in
[37], where the algorithm uses the update form (9d) and the cyclic order. Although
the proximal map of ` p is set-valued in general, a single-valued selection is used.
The work establishes subsequential convergence and, under a further assumption of
scalable restricted isometry property, full sequential convergence to a local mini-
mizer. The algorithm in [35] analyzes a random prox-gradient coordinate descent
algorithm. It accepts an objective of the smooth+proximal form where both func-
tions can be nonconvex. More recently, the work [61] extends the analysis of the
previous work [60] by including nonconvex regularization functions such as the
2 The objective is convex in each of coordinates while the other coordinates are fixed.
Nonconvex Sparse Regularization and Splitting Algorithms 9
minimax concave penalty (MCP), the smoothly clipped absolute deviation (SCAD),
and others. The work also analyzes both deterministic and randomly shuffled or-
ders and reports better performance using the latter order. Another recent work [66]
establishes that the support set converges in finitely many iterations and provides
new conditions to ensure convergence to a local minimizer. The papers [61, 66] also
apply the KL property for full sequential convergence.
Last but not least, to have strong performance, coordinate descent algorithms
rely on solving smaller, simpler subproblems that have low complexities and low
memory requirements. Not all problem structures are amenable for coordinate de-
scent. Hence, identifying coordinate friendly structures in a problem is crucial to
implementing coordinate descent algorithms effectively. See the recent work [41].
4 Other methods
This section briefly mentions some other algorithms for problems with nonconvex
regularization functions.
The work [56] analyzes the convergence of the alternating direction method of
multipliers (ADMM) for minimizing a nonconvex and possibly nonsmooth objec-
tive function, f (x1 , . . . , x p , y), subject to linear constraints. Unlike the majority of
coordinate descent algorithms, ADMM handles constraints by involving dual vari-
ables. The proposed ADMM in [56] sequentially updates the primal variables in
the order x1 , . . . , x p , y, followed by updating the dual variable. For convergence to
a critical point, the objective function on the y-block must be smooth (possibly
nonconvex) and the objective function of each xi -block can be smooth plus non-
smooth. A variety of nonconvex functions such as piecewise linear functions, ` p
norm, Schatten-p norm (0 < p < 1), SCAD, as well as the indicator functions of
compact smooth manifolds (e.g., spherical, Stiefel, and Grassman manifolds), can
be applied to the xi variables.
The sorted `1 function [28] is a nonconvex regularization function. It is a
weighted `1 function where the weights have fixed values but are dynamically as-
signed to the components under the principle that components with larger magni-
tudes get smaller weights. This is the same principle in reweighted `2 and `1 al-
gorithms, where the assignments are fixed and the weights are dynamic. Sorted `1
regularization are related to iterative support detection [55] and iterative hard thresh-
olding [8].
The difference of `1 and `2 , kxk1 − kxk2 , favors sparse vectors [64], so it can also
serve as a nonconvex regularization function for generating sparse solutions.
10 Rick Chartrand and Wotao Yin
The picture is courtesy of [64]. The authors analyzed the global solution and pro-
posed an iterative algorithm based on the difference of convex functions algorithm.
Their simulation suggests stronger performance when the sampling matrix in com-
pressed sensing is ill-conditioned (the restricted isometry property is not satisfied)
than other compared algorithms.
References
1. Antoniadis, A.: Wavelet methods in statistics: Some recent developments and their applica-
tions. Statistics Surveys 1, 16–55 (2007)
2. Auslender, A.: Asymptotic properties of the Fenchel dual functional and applications to de-
composition problems. Journal of Optimization Theory and Applications 73(3), 427–449
(1992)
3. Barrodale, I., Roberts, F.D.K.: Applications of mathematical programming to ` p approxima-
tion. In: J.B. Rosen, O.L. Mangasarian, K. Ritter (eds.) Nonlinear Programming, Madison,
Wisconsin, May 4–6, 1970, pp. 447–464. Academic Press, New York (1970)
4. Bayram, I.: On the convergence of the iterative shrinkage/thresholding algorithm with a
weakly convex penalty. IEEE Transactions on Signal Processing 64(6), 1597–1608 (2016)
5. Bertsekas, D.P.: Nonlinear Programming, 2nd edition edn. Athena Scientific (1999)
6. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and distributed computation: numerical methods.
Prentice Hall, Englewood Cliffs, N.J (1989)
7. Blake, A., Zisserman, A.: Visual Reconstruction. MIT Press, Cambridge, MA (1987)
8. Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Applied
and Computational Harmonic Analysis 27(3), 265–274 (2009)
9. Blumensath, T., Davies, M.E.: Normalized iterative hard thresholding: Guaranteed stability
and performance. IEEE Journal of Selected Topics in Signal Processing 4(2), 298–309 (2010)
10. Bradley, J.K., Kyrola, A., Bickson, D., Guestrin, C.: Parallel coordinate descent for l1-
regularized loss minimization. arXiv preprint arXiv:1105.5379 (2011)
11. Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression,
with applications to biological feature selection. The annals of applied statistics 5(1), 232–
253 (2011)
12. Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction
from highly incomplete frequency information. IEEE Transactions on Information Theory
52(2), 489–509 (2006)
13. Candès, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted `1 minimization.
Journal of Fourier Analysis and Applications 14(5-6), 877–905 (2008)
14. Cao, W., Sun, J., Xu, Z.: Fast image deconvolution using closed-form thresholding formulas
of lq (q = 12 , 23 ) regularization. Journal of Visual Communication and Image Representation
24(1), 31–41 (2013)
15. Chartrand, R.: Exact reconstructions of sparse signals via nonconvex minimization. IEEE
Signal Processing Letters 14(10), 707–710 (2007)
Nonconvex Sparse Regularization and Splitting Algorithms 11
16. Chartrand, R.: Generalized shrinkage and penalty functions. In: IEEE Global Conference on
Signal and Information Processing, p. 616. Austin, TX (2013)
17. Chartrand, R.: Shrinkage mappings and their induced penalty functions. In: IEEE International
Conference on Acoustics, Speech, and Signal Processing. Florence, Italy (2014)
18. Chartrand, R., Yin, W.: Iteratively reweighted algorithms for compressive sensing. In: IEEE
International Conference on Acoustics, Speech, and Signal Processing, pp. 3869–3872. Las
Vegas, NV (2008)
19. Chazan, D., Miranker, W.: Chaotic relaxation. Linear Algebra and its Applications 2(2), 199–
222 (1969)
20. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM
Journal on Scientific Computing 20, 33–61 (1998)
21. Dhillon, I.S., Ravikumar, P.K., Tewari, A.: Nearest Neighbor based Greedy Coordinate De-
scent. In: J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, K.Q. Weinberger (eds.) Ad-
vances in Neural Information Processing Systems 24, pp. 2160–2168. Curran Associates, Inc.
(2011)
22. Dong, B., Zhang, Y.: An efficient algorithm for `0 minimization in wavelet frame based image
restoration. Journal of Scientific Computing 54(2-3), 350–368 (2013)
23. Donoho, D.L.: Compressed sensing. IEEE Transactions on Information Theory 52(4), 1289–
1306 (2006)
24. Gao, H.Y., Bruce, A.G.: Waveshrink with firm shrinkage. Statistica Sinica 7(4), 855–874
(1997)
25. Gorodnitsky, I.F., Rao, B.D.: A new iterative weighted norm minimization algorithm and its
applications. In: IEEE Sixth SP Workshop on Statistical Signal and Array Processing, pp.
412–415 (1992)
26. Gorodnitsky, I.F., Rao, B.D.: Convergence analysis of a class of adaptive weighted norm ex-
trapolation algorithms. In: Asilomar Conference on Signals, Systems, and Computers, vol. 1,
pp. 339–343. Pacific Grove, CA (1993)
27. Grippo, L., Sciandrone, M.: On the convergence of the block nonlinear Gauss–Seidel method
under convex constraints. Operations Research Letters 26(3), 127–136 (2000)
28. Huang, X.L., Shi, L., Yan, M.: Nonconvex sorted `1 minimization for sparse approximation.
Journal of the Operations Research Society of China 3(2), 207–229 (2015)
29. Krisnan, D., Fergus, R.: Fast image deconvolution using hyper-Laplacian priors. In: Y. Bengio,
D. Schuurmans, J. Lafferty, C. Williams, A. Culotta (eds.) Advances in Neural Information
Processing Systems, vol. 22, pp. 1033–1041. Vancouver, BC (2009)
30. Lawson, C.L.: Contributions to the theory of linear least maximum approximation. Ph.D.
thesis, University of California, Los Angeles (1961)
31. Leahy, R.M., Jeffs, B.D.: On the design of maximally sparse beamforming arrays. IEEE
Transactions on Antennas and Propagation 39(8), 1178–1187 (1991)
32. Li, Y., Osher, S.: Coordinate descent optimization for `1 minimization with application to
compressed sensing; a greedy algorithm. Inverse Problems and Imaging 3(3), 487503 (2009)
33. Liu, J., Wright, S.J.: Asynchronous Stochastic Coordinate Descent: Parallelism and Conver-
gence Properties. SIAM Journal on Optimization 25, 351–376 (2015)
34. Liu, J., Wright, S.J., Ré, C., Bittorf, V., Sridhar, S.: An Asynchronous Parallel Stochastic
Coordinate Descent Algorithm. J. Mach. Learn. Res. 16(1), 285–322 (2015)
35. Lu, Z., Xiao, L.: Randomized block coordinate non-monotone gradient method for a class of
nonlinear programming. arXiv preprint arXiv:1306.5918 (2013)
36. Mareček, J., Richtárik, P., Takáč, M.: Distributed Block Coordinate Descent for Minimizing
Partially Separable Functions. In: M. Al-Baali, L. Grandinetti, A. Purnama (eds.) Numerical
Analysis and Optimization, no. 134 in Springer Proceedings in Mathematics and Statistics,
pp. 261–288. Springer International Publishing (2015)
37. Marjanovic, G., Solo, V.: Sparsity Penalized Linear Regression With Cyclic Descent. IEEE
Transactions on Signal Processing 62(6), 1464–1475 (2014)
38. Mazumder, R., Friedman, J.H., Hastie, T.: SparseNet: Coordinate Descent With Nonconvex
Penalties. Journal of the American Statistical Association 106(495), 1125–1138 (2011)
12 Rick Chartrand and Wotao Yin
39. Mohimani, H., Babaie-Zadeh, M., Jutten, C.: A fast approach for overcomplete sparse decom-
position based on smoothed L0 norm. IEEE Trans. Signal Process. 57(1), 289–301 (2009)
40. Nutini, J., Schmidt, M., Laradji, I.H., Friedlander, M., Koepke, H.: Coordinate descent
converges faster with the {Gauss-Southwell} rule than random selection. arXiv preprint
arXiv:1506.00552 (2015)
41. Peng, Z., Wu, T., Xu, Y., Yan, M., Yin, W.: Coordinate friendly structures, algorithms and
applications. Annals of Mathematical Sciences and Applications 1 (2016)
42. Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an Algorithmic Framework for Asynchronous
Parallel Coordinate Updates. arXiv:1506.02396 [cs, math, stat] (2015)
43. Peng, Z., Yan, M., Yin, W.: Parallel and distributed sparse optimization. In: Signals, Systems
and Computers, 2013 Asilomar Conference on, pp. 659–646. IEEE (2013)
44. Powell, M.J.D.: On search directions for minimization algorithms. Mathematical Program-
ming 4(1), 193–201 (1973)
45. Rao, B.D., Kreutz-Delgado, K.: An affine scaling methodology for best basis selection. IEEE
Transactions on Signal Processing 47(1), 187–200 (1999)
46. Razaviyayn, M., Hong, M., Luo, Z.: A Unified Convergence Analysis of Block Successive
Minimization Methods for Nonsmooth Optimization. SIAM Journal on Optimization 23(2),
1126–1153 (2013)
47. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math-
ematical Programming 156(1-2), 433–484 (2015)
48. Rockafellar, R.T.: Augmented lagrangians and applications of the proximal point algorithm in
convex programming. Mathematics of Operations Research 1(2), 97–116 (1976)
49. Saab, R., Chartrand, R., Özgür Yilmaz: Stable sparse approximations via nonconvex optimiza-
tion. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2008)
50. Scherrer, C., Halappanavar, M., Tewari, A., Haglin, D.: Scaling up coordinate descent algo-
rithms for large `1 regularization problems. arXiv preprint arXiv:1206.6409 (2012)
51. Tseng, P.: On the rate of convergence of a partially asynchronous gradient projection algo-
rithm. SIAM Journal on Optimization 1(4), 603–619 (1991)
52. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimiza-
tion. Journal of optimization theory and applications 109(3), 475–494 (2001)
53. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimiza-
tion. Mathematical Programming 117(1-2), 387–423 (2009)
54. Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total
variation image reconstruction. SIAM Journal on Imaging Science 1(3), 248–272 (2008)
55. Wang, Y., Yin, W.: Sparse Signal Reconstruction via Iterative Support Detection. SIAM Jour-
nal on Imaging Sciences 3(3), 462–491 (2010)
56. Wang, Y., Yin, W., Zeng, J.: Global Convergence of ADMM in Nonconvex Nonsmooth Opti-
mization. arXiv:1511.06324 [cs, math] (2015)
57. Warga, J.: Minimizing certain convex functions. Journal of the Society for Industrial & Ap-
plied Mathematics 11(3), 588–593 (1963)
58. Woodworth, J., Chartrand, R.: Compressed sensing recovery via nonconvex shrinkage penal-
ties. Http://arxiv.org/abs/1504.02923
59. Wu, T.T., Lange, K.: Coordinate descent algorithms for lasso penalized regression. The Annals
of Applied Statistics 2(1), 224–244 (2008)
60. Xu, Y., Yin, W.: A Block Coordinate Descent Method for Regularized Multiconvex Optimiza-
tion with Applications to Nonnegative Tensor Factorization and Completion. SIAM Journal
on Imaging Sciences 6(3), 1758–1789 (2013)
61. Xu, Y., Yin, W.: A globally convergent algorithm for nonconvex optimization based on block
coordinate update. arXiv:1410.1386 [math] (2014)
62. Xu, Z., Chang, X., Xu, F., Zhang, H.: L1/2 regularization: A thresholding representation theory
and a fast solver. IEEE Transactions on Neural Networks and Learning Systems 23(7), 1013–
1027 (2012)
63. Xu, Z.B., Guo, H.L., Wang, Y., Zhang, H.: Representative of L1/2 regularization among Lq
(0 < q ≤ 1) regularizations: an experimental study based on phase diagram. Acta Automatica
Sinica 38(7), 1225–1228 (2012)
Nonconvex Sparse Regularization and Splitting Algorithms 13
64. Yin, P., Lou, Y., He, Q., Xin, J.: Minimization of `1−2 for compressed sensing. SIAM Journal
on Scientific Computing 37(1), A536–A563 (2015)
65. Zangwill, W.I.: Nonlinear programming: a unified approach. Prentice-Hall, Englewood Cliffs,
NJ (1969)
66. Zeng, J., Peng, Z., Lin, S.: A Gauss-Seidel Iterative Thresholding Algorithm for lq Regularized
Least Squares Regression. arXiv:1507.03173 [cs] (2015)