Professional Documents
Culture Documents
Derivative-Free Optimization Via Proximal Point Methods - Hare, Lucet (2014)
Derivative-Free Optimization Via Proximal Point Methods - Hare, Lucet (2014)
net/publication/236990443
CITATIONS READS
17 276
2 authors, including:
Warren L Hare
University of British Columbia - Okanagan
110 PUBLICATIONS 1,612 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Warren L Hare on 27 May 2014.
W. L. Hare · Y. Lucet
W. Hare
corresponding author
Mathematics, University of British Columbia, Okanagan Campus (UBCO). 3333 University
Way, Kelowna, BC, V1V 1V7, Canada.
Tel.:+01-250-807-9378
Fax: +01-250-807-8001
E-mail: warren.hare@ubc.ca
Y. Lucet
Computer Science, UBCO.
Tel.:+01-250-807-9505
Fax: +01-250-807-8001
E-mail: yves.lucet@ubc.ca
2 W. L. Hare, Y. Lucet
1 Introduction
The goal of this paper is to provide a first look at how the proximal point
method could be adapted to a derivative-free setting. To this end, we create a
generalized Derivative-Free Proximal Point (DFPP) method and outline the
theory of its convergence.
The remainder of this paper is organized as follows. In Section 2, we define
the proximal point mapping, proximal point methods, and discuss some of
the known properties of these notions. In Section 3, we discuss the notions
of interpolation, and provide an algorithmic framework for a derivative-free
proximal point method. In Section 4, we prove the algorithmic framework
converges in the sense that the limit of the objective function’s gradients is 0.
We conclude and give directions for future study in Section 5.
Remark 2.1 The above proof does not ensure that xk converges, only that
limk→∞ ∇f (xk ) = 0. For example, if the proximal point algorithm is applied
to f (x) := ex , one finds that xk → −∞.
The basic proximal point method has led to a large number of successful
proximal-style algorithms. Most common are the proximal-bundle methods,
where the objective function f is replaced by a sequence of piecewise lin-
ear model functions f k , see [28, Chpt XIV] and [29]. Such methods essentially
replace the minimization of f with a sequence of quadratic programming prob-
lems1 . The success of these methods in practice is well-documented [29] (and
references therein).
One interesting result regarding the proximal point method lies in the in-
herent stability of proximal points. Under some basic conditions (all of which
are satisfied by convexity), linear errors in the objective function values per-
petuate into linear errors in the proximal point mapping [22, Thm 4.6]. Similar
results hold for linear errors in the gradient values, prox-parameter and prox-
center [22, Thm 4.6].
This inherent stability makes proximal point methods an interesting con-
sideration for DFO, where we naturally expect errors in gradient values (and
possibly function values). In this paper, we demonstrate that we can replace
the objective function f with a reasonable approximation thereof, determine
the proximal points of the approximation functions, and then rely on the sta-
bility of the proximal point mapping to ensure convergence. To this end, we
next present an algorithmic framework for a derivative free proximal point
method.
1 Note that, although the model function is linear, the computation of the proximal point
still contains the quadratic penalty term. If the piecewise linear model function is given by
f k (x) = max {hai , xi + bi }, then the resulting subproblem takes the form
i=1,2,...N
1
argminv,y {v + r kx − yk2 : v ≥ hai , xi + bi for i = 1, 2, ...N }.
2
Remark 3.2 Assumption 1 is slightly stronger than assuming the model func-
tions form a fully linear class of models [1, Def 6.1], but slightly weaker than
assuming they form a fully quadratic class of models [1, Def 6.2] and that the
lower-level sets are compact (see Lemma 3.1). In particular, a fully linear model
only implies the first two error bounds of Assumption 1: (kf (y)−q(y)k ≤ C∆2
and k∇f (y) − ∇q(y)k ≤ C∆); while a fully quadratic model implies error
bounds of the form kf (y) − q(y)k ≤ C∆3 , k∇f (y) − ∇q(y)k ≤ C∆2 , and
k∇2 f (y) − ∇2 q(y)k ≤ C∆.
Methods for generating sampling sets and models that maintain Assump-
tion 1 are well understood [30, 31, 1] (among others). For example, if f ∈ C 1
and ∇f is Lipschitz, Assumption 1 can be achieved through linear interpola-
tion over appropriately poised sets (noting that ∇2 q = H = 0 in this case) [30,
Thm 2], where appropriately poised sets can be generated through a variety
of algorithms, see for example [1, Alg 6.2 & 6.3].
Another approach to achieving Assumption 1 is understood through the
following lemma.
Proof Suppose error bounds of the form (4) hold. Since f ∈ C 2 and the lower-
level set is compact, there exists a constant C2 such that k∇2 f (y)k ≤ C2 for
all y ∈ {y : f (y) ≤ f (x0 )}. As ∆ < 1, we immediately have that
Remark 3.3 In Lemma 3.1, the assumption of the lower-level set being com-
pact could be replaced by an assumption that ∇2 f is bounded. We present in
this form, as the lower-level set assumption is used in our convergence analysis.
δ k = q k (xk ) − q k (x̃k ).
Step 5. Loop:
Increment k → k + 1 and return to Step 1.
t
u
Before proceeding, we provide a few remarks on the DFPP algorithm.
First, note that although we assume y 0 = xk , this is only to simplify
discussion. Our proofs only require xk ∈ Y .
Second, note that, in Step 3 the equality in {x̃k } = Prk q k (xk ) is valid as
the function q k + rk 12 k · k2 is strictly convex (by Step 2). This also ensures that
the matrix inversion, (2H k + rk Id)−1 , is well-defined. Using Assumption 1, we
note that q k + rI is strictly convex for any r > M . As such, Step 2 will never
increase rk beyond M + 1.
Third, note that, as q k is a quadratic interpolation of f over Y k and
x ∈ Y k , the predicted decrease δ k = q k (xk ) − q k (x̃k ) could equivalently
k
Fifth, note that the line search in Step 4 is optional, convergence analysis
without the line search is done by setting Λ = 1. We introduce the line search
as a potential option to improve efficiency. Future research will explore which
(if any) line search method is most effective for this algorithm.
Finally, note that the DFPP algorithm employs several parameters in its
implementation (see Step 0). In this paper, we keep these parameters general,
and only provide the bounds for these parameters required for convergence.
Future research will explore methods to select (and adaptively adjust) these
parameters. We do note that the parameter rmin must be strictly greater than
0 (see the proof of Lemma 4.2). This, however, is not a concern, as any im-
plementation would enforce a similar tolerance to avoid floating point errors.
4 Convergence Analysis
We now turn our focus to the theoretical convergence analysis of the algo-
rithm. We begin (Subsection 4.1) by examining the stopping condition of the
algorithm. In Subsection 4.2, we examine convergence under an infinite se-
quence of serious steps, while in Subsection 4.3 we examine convergence under
a finite sequence of serious steps. Under infinite serious steps we show that, if
the prox-parameter is bounded and the search radius converges to zero, then
the limit of the iteration’s gradients is zero. Under finite serious steps we show
that the final serious step results in a point with a gradient of zero.
4.1 Stopping
Our first theorem examines the ‘success’ stopping criterion occurring in Step 1.
this implies δ ki converges to 0. Note that q ki (x̃ki )+rki 21 kx̃ki −xki k2 ≤ q ki (xki )
implies that
δ ki = q ki (xki ) − q ki (x̃ki )
≥ q ki (xki ) − (q ki (xki ) − rki 12 kx̃ki − xki k2 )
≥ rki 12 kx̃ki − xki k2 .
Since rki ≥ rmin we have that
kx̃ki − xki k → 0. (5)
k+1 k
Examining kx − x k after a serious step, we note that
kxk+1 − xk k = kxk + α(x̃k − xk ) − xk k
= αkx̃k − xk k
≤ Λkx̃k − xk k → 0.
Thus kxk+1 − xk k → 0.
Finally, for eventual contradiction, suppose that ∆k does not converge to
0; i.e., there exists ∆¯ > 0 such that ∆k > ∆¯ for all k (note that ∆k is a non-
increasing sequence). Applying kxk+1 − xk k → 0, under an infinite number of
serious steps, eventually we have
kxk+1 − xk k kxk+1 − xk k
< < tol .
∆ k ∆¯
Thus the rule would be invoked an infinite number of times, so ∆k+1 ≤ ΓS ∆k
an infinite number of times, implying ∆k → 0. t
u
Theorem 4.2 (Infinite Serious Steps) Assume Assumption 1 holds. Sup-
pose the lower-level set {y : f (y) ≤ f (x0 )} is compact. Suppose DFPP is
applied to create iterates xk and that an infinite number of serious steps oc-
curs. If rk is bounded above and ∆k → 0, then
lim ∇f (xk ) = 0.
k→∞
Remark 4.4 In Theorem 4.2, proof of convergence requires that the sequence
rk is bounded above. By Assumption 1 we know that Step 2 will never increase
rk beyond M + 1. However, rk can also be increased during Step 4. Thus an
infinite sequence of serious steps intermixed with an infinite sequence of null
steps could result in rk unbounded. Clearly, the assumption that rk is bounded
could be weakened to allow for rk → ∞ as long as rki (x̃ki −xki ) → 0. However,
it is an open question to examine under what conditions this assumption can
be removed completely.
Clearly Theorem 4.2 implies that the limit of any convergent subsequence
with an infinite number of serious steps will be a stationary point. Under the
additional assumption that the objective function is strictly convex, then an
infinite sequence of serious steps will converge to the unique minimizer.
Corollary 4.1 (Infinite Serious Steps and Strict Convexity) Assume
Assumption 1 holds. Suppose that f is strictly convex and the lower-level set
{y : f (y) ≤ f (x0 )} is compact. Suppose DFPP is applied to create iterates
xk and that an infinite number of serious steps occurs. If rk is bounded above
and ∆k → 0, then xk will converge to the unique minimizer
lim xk = x̄ = arg min{f (x)}.
k→∞ x
Next note that the convexity of f +η 12 k·k2 implies that f +η 12 k·−x̄k2 must be
convex. Applying ∇(f + η 21 k · −x̄k2 )(x̄) = ∇f (x̄), and the fact f + η 12 k · −x̄k2
is convex, we see that
1 2
Pr f (x̄) ⊆ y : f (x̄) + h∇f (x̄), y − x̄i + (r − η) ky − x̄k ≤ f (x̄) .
2
Rearrangement and the Cauchy-Schwarz Inequality complete the proof. t
u
Corollary 4.2 (Limits of Interpolation Set Radii) Given an infinite se-
quence of null steps, one must have ∆k → 0.
Proof Suppose, for eventual contradiction, that an infinite series of null step
occurs, and ∆k does not converge to 0. As either, ∆k+1 = ∆k or ∆k+1 ≤ Γ ∆k ,
this would imply that ∆k+1 = ∆k , xk+1 = xk , rk+1 = 2rk , and q k+1 = q k
for all k sufficiently large. Since q k is a quadratic function, there exists η > 0
such that q k + η 21 k · k2 is convex for all k sufficiently large. (For k sufficiently
large, the function q k does not depend on k so η is also independent of k.)
Lemma 4.3 therefore states that
2k∇q k (xk )k
k k k k
x̃ ∈ Prk q (x ) ⊆ y : ky − x k ≤ .
rk − η
As rk → ∞, eventually we must have x̃k ∈ B∆k (xk ). At this point, ∆k de-
creases, and we encounter a contradiction. t
u
The next convergence theorem makes no additional assumptions on the
function. In particular, the assumptions that f has compact lower-level sets,
that ∆k → 0, and that the sequence rk is bounded above are no longer needed.
The removal of the assumption regarding compact lower-level sets is fairly
obvious, as the iteration points xk are fixed after a finite number of iterations.
The removal of the assumption that ∆k → 0 follows from Corollary 4.2. The
removal of the assumption that the sequence rk is bounded above requires use
of the Mean Value Theorem and is only possible because the iteration points
xk are fixed after a finite number of iterations.
We now prove convergence under a finite number of serious steps and an
infinite sequence of null steps.
Theorem 4.3 (Finite Serious and Infinite Null Steps) Assume Assump-
tion 1 holds. Suppose DFPP is applied to create iterates xk and that a finite
number of serious steps and an infinite number of null steps occur. Let x̄ be
the prox-center resulting from the last serious step. Then
∇f (x̄) = 0.
Proof Since there is an infinite number of null steps, by Corollary 4.2, we have
that ∆k → 0. We now have two cases, first rk is bounded above, second rk is
unbounded.
Case I: Suppose rk is bounded above. This implies that x̃k ∈ B∆k (x̄) for all k
sufficiently large. As ∆k → 0, we conclude x̃k → x̄.
DFO via Prox. Point Methods 13
As in the proof of Theorem 4.2, −∇q k (x̃k ) = rk (x̃k −x̄). Since rk is bounded
above, x̃k → x̄ implies that k∇f (x̄) − ∇f (x̃k )k → 0 and k∇q k (x̃k )k → 0. For
any x̃k ∈ B∆k (x̄) we may apply Assumption 1 to see
k∇f (x̄)k ≤ k∇f (x̄) − ∇f (x̃k )k + k∇f (x̃k ) − ∇q k (x̃k )k + k∇q k (x̃k )k
≤ k∇f (x̄) − ∇f (x̃k )k + C∆k + k∇q k (x̃k )k.
for some z ki = β x̃ki + (1 − β)x̄, β ∈]0, 1[. Noting that −∇q k (x̃k ) = rk (x̃k − x̄),
and using the definition of δ k , we rewrite
1
f (x̃ki ) − f (x̄) = h∇f (y ki ), − ∇q ki (x̃ki )i
rki
for some y ki = αx̃ki + (1 − α)x̄, α ∈]0, 1[, and
1
−δ ki = h∇q ki (z ki ), − ∇q ki (x̃ki )i
rki
for some z ki = β x̃ki + (1 − β)x̄, β ∈]0, 1[. Since each step is a null step, we
know that for each ki
∇f (y ki ) → ∇f (x̄),
∇q ki (x̃ki ) → ∇f (x̄),
Finally note
∇q ki (z ki ) → ∇f (x̄)
by the same logic.
We may now pass to the limit in (7) to find
5 Conclusion
Acknowledgements
Work in this paper was supported by NSERC Discovery grants (Hare and
Lucet), a UBC IRF grant (Hare), and a Canadian Foundation for Innovation
(CFI) Leaders Opportunity Fund (Lucet) programs.
References
6. Hare, W.L.: Using derivative free optimization for constrained parameter selection
in a home and community care forecasting model. In: International Perspectives on
Operations Research and Health Care, Proceedings of the 34th Meeting of the EURO
Working Group on Operational Research Applied to Health Sciences, pp. 61–73. (2010)
7. Audet, C., Dennis, J.E.Jr., Le Digabel, S.: Globalization strategies for mesh adaptive
direct search. Comput. Optim. Appl. 46(2), 193–215 (2010)
8. Nocedal J., Wright, S.J.: Numerical optimization. Springer Series in Operations Re-
search and Financial Engineering. Springer, New York (2006)
9. Coope, I.D., Price, C.J.: A direct search frame-based conjugate gradients method. J.
Comput. Math. 22(4), 489–500 (2004)
10. Cheng, W., Xiao, Y., Hu, Q.J.: A family of derivative-free conjugate gradient methods
for large-scale nonlinear systems of equations. J. Comput. Appl. Math. 224(1), 11–19
(2009)
11. Martinet, B.: Détermination approchée d’un point fixe d’une application pseudo-
contractante. Cas de l’application prox. C. R. Acad. Sci. Paris Sér. A-B, 274, A163–A165
(1972)
12. Lemaréchal, C., Strodiot, J.J., Bihain, A.: On a bundle algorithm for nonsmooth opti-
mization. In: Nonlinear programming, vol. 4, pp. 245–282. Academic Press, New York
(1981)
13. Mifflin, R.: A modification and extension of Lemarechal’s algorithm for nonsmooth
minimization. Math. Programming Stud., 17, 77–90 (1982)
14. Lukšan, L. and Vlček, J.: A bundle-Newton method for nonsmooth unconstrained
minimization. Math. Program. 83(3, Ser. A), 373–391 (1998)
15. Hare, W., Sagastizábal, C.: Computing proximal points of nonconvex functions. Math.
Program. 116(1-2, Ser. B), 221–258 (2009)
16. Hare, W., Sagastizábal, C.: Redistributed proximal bundle method for nonconvex opti-
mization. SIAM J. Opt. 20(5), 2442–2473 (2010)
17. Mifflin, R., Sagastizábal, C.: Proximal points are on the fast track. J. Convex Anal.
9(2), 563–579 (2002)
18. Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-
regularity. J. Convex Anal. 11(2), 251–266 (2004)
19. Hare, W.L.: A proximal method for identifying active manifolds. Comput. Optim. Appl.
43(2), 295–306 (2009)
20. Mifflin, R., Sagastizábal, C.: VU -smoothness and proximal point results for some non-
convex functions. Optim. Methods Softw. 19(5), 463–478, (2004)
21. Mifflin, R., Sagastizábal, C.: A V U -algorithm for convex minimization. Math. Program.
104(2-3, Ser. B), 583–608 (2005)
22. Hare, W.L., Poliquin, R.A.: Prox-regularity and stability of the proximal mapping. J.
Convex Anal. 14(3), 589–606 (2007)
23. Kiwiel, K.C.: A proximal bundle method with approximate subgradient linearizations.
SIAM J. Optim. 16(4), 1007–1023 (2006)
24. Shen, J., Xia, Z.Q., Pang, L.P.: A proximal bundle method with inexact data for convex
nondifferentiable minimization. Nonlinear Anal. 66(9), 2016–2027 (2007)
25. Kiwiel, K.C., Lemaréchal, C.: An inexact bundle variant suited to column generation.
Math. Program. 118(1, Ser. A), 177–206 (2009)
26. Kiwiel, K.C.: An inexact bundle approach to cutting-stock problems. INFORMS J.
Comput. 22(1), 131–143 (2010)
27. Rockafellar, R.T., Wets, R. J.-B.: Variational Analysis. Vol. 317 of Grundlehren der
Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences].
Springer-Verlag, Berlin (1998)
28. Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms.
Vol. 305–306 in Grundlehren der Mathematischen Wissenschaften [Fundamental Prin-
ciples of Mathematical Sciences]. Springer-Verlag, Berlin (1993)
29. Mäkelä, M.M.: Survey of bundle methods for nonsmooth optimization. Optim. Methods
Softw. 17(1), 1–29 (2002)
30. Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of interpolation sets in derivative
free optimization. Math. Program. 111(1-2, Ser. B), 141–172 (2008)
DFO via Prox. Point Methods 17
31. Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of sample sets in derivative-free
optimization: polynomial regression and underdetermined interpolation. IMA J. Numer.
Anal. 28(4), 721–748 (2008)
32. Apkarian, P. Noll, D., Prot, O.: A proximity control algorithm to minimize nonsmooth
and nonconvex semi-infinite maximum eigenvalue functions. J. Convex Anal. 16(3-4),
641–666 (2009)