You are on page 1of 18

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/236990443

Derivative-Free Optimization Via Proximal Point Methods

Article  in  Journal of Optimization Theory and Applications · June 2014


DOI: 10.1007/s10957-013-0354-0

CITATIONS READS

17 276

2 authors, including:

Warren L Hare
University of British Columbia - Okanagan
110 PUBLICATIONS   1,612 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Book: Derivative-Free and Blackbox Optimization View project

Derivative-free Methods View project

All content following this page was uploaded by Warren L Hare on 27 May 2014.

The user has requested enhancement of the downloaded file.


Noname manuscript No.
(will be inserted by the editor)

Derivative-free Optimization Via Proximal Point


Methods

W. L. Hare · Y. Lucet

Received: date / Accepted: date

Abstract Derivative-Free Optimization (DFO) examines the challenge of min-


imizing (or maximizing) a function without explicit use of derivative informa-
tion. Many standard techniques in DFO are based on using model functions
to approximate the objective function, and then applying classic optimiza-
tion methods on the model function. For example, the details behind adapting
steepest descent, conjugate gradient, and quasi-Newton methods to DFO have
been studied in this manner. In this paper we demonstrate that the proxi-
mal point method can also be adapted to DFO. To that end, we provide a
derivative-free proximal point (DFPP) method and prove convergence of the
method in a general sense. In particular, we give conditions under which the
gradient values of the iterates converge to 0, and conditions under which an
iterate corresponds to a stationary point of the objective function.

Keywords Derivative-free optimization · Proximal point method · Uncon-


strained minimization

Mathematics Subject Classification (2000) Primary 90C56 · 90C55;


Secondary 65K05 · 49M25

W. Hare
corresponding author
Mathematics, University of British Columbia, Okanagan Campus (UBCO). 3333 University
Way, Kelowna, BC, V1V 1V7, Canada.
Tel.:+01-250-807-9378
Fax: +01-250-807-8001
E-mail: warren.hare@ubc.ca
Y. Lucet
Computer Science, UBCO.
Tel.:+01-250-807-9505
Fax: +01-250-807-8001
E-mail: yves.lucet@ubc.ca
2 W. L. Hare, Y. Lucet

1 Introduction

In this paper, we are concerned with (unconstrained) Derivative-free Optimiza-


tion (DFO). That is, an optimization problem where the objective function is
not analytically available. We assume that through some sort of black-box we
are able to evaluate the objective function at an arbitrary point, but that we
are unable to determine any higher order information (e.g., gradient vectors)
directly.
Uses and example applications of DFO can be found in the recent book
[1, Chpt 1]. One increasingly common use of DFO is in black-box optimization
problems; that is, problems where the objective function is generated via a
computer simulation (see [2–6] among many other examples). For such prob-
lems, the objective function may in fact be well behaved, even smooth, but
higher order information is unavailable to the user. It is likely that, as modern
computing science develops more user-friendly simulation software, the value
of DFO in solving such real-world problems will continue to increase. (It should
be noted that various other numerical reasons, such as function instability, can
make derivatives unusable for optimization; see [1, Chpt 1] or [7] for further
examples.)
If the objective function f is differentiable and the derivatives are easily
evaluated or approximated, then a number of classical methods have become
well established in academia. For example, steepest descent [8, Chpt 3], conju-
gate gradient [8, Chpt 5], and quasi-Newton methods (such as BFGS) [8, Chpt
6], are all considered classical methods, and often taught at an undergraduate
level (although perhaps in a simplified manner).
Research into DFO methods has often begun by considering these classical
derivative based methods, and examining how to adapt them to a derivative-
free setting. Each of the above methods has a natural counter-part in the world
of DFO. Steepest descent is easily converted to DFO by use of the Simplex
Gradient [1, Chpt 9]. The coordinate search method [1, Chpt 7] could be
thought of as a simplified DFO version of the conjugate gradient approach,
but more advanced DFO methods based on conjugate gradients have also been
developed [9, 10]. Quasi-Newton methods are encapsulated in the trust-region
frameworks of [1, Chpt 10].
Interestingly, research into DFO methods has not (to the best of the au-
thors’ knowledge) considered how the classical proximal point method ([11])
might be adapted to a derivative-free setting. This is surprising, as the prox-
imal point method provides a flexible robust framework, both in theory and
practice. The simplicity of the basic proximal point method can be seen in the
original one line description of the basic algorithm as given by Martinet [11].
The flexibility of the proximal point method can be found in applications to
non-smooth and non-convex optimization [12–16], and in its ability to iden-
tify active manifolds [17–19]. The ability to identify active manifolds has been
critical in the development of the VU theory for convex minimization [20, 21].
The robustness of proximal point methods has been noted in stability theory
such as [22] and inexact proximal-bundle methods such as [23–26].
DFO via Prox. Point Methods 3

The goal of this paper is to provide a first look at how the proximal point
method could be adapted to a derivative-free setting. To this end, we create a
generalized Derivative-Free Proximal Point (DFPP) method and outline the
theory of its convergence.
The remainder of this paper is organized as follows. In Section 2, we define
the proximal point mapping, proximal point methods, and discuss some of
the known properties of these notions. In Section 3, we discuss the notions
of interpolation, and provide an algorithmic framework for a derivative-free
proximal point method. In Section 4, we prove the algorithmic framework
converges in the sense that the limit of the objective function’s gradients is 0.
We conclude and give directions for future study in Section 5.

2 Proximal Point Methods

Given a function f and a parameter r > 0, we define the proximal envelope


er f and its related proximal point mapping Pr f by the following:
1
er f (x) := inf {f (y) + r ky − xk2 } and (1)
y 2
1
Pr f (x) := arg min{f (y) + r ky − xk2 }. (2)
y 2
The parameter r is referred to as the prox-parameter, and the point x is re-
ferred to as the prox-center. If the function f is non-convex, then the proximal
envelope is potentially equal to −∞ and the proximal point mapping is poten-
tially empty. We call a function prox-bounded if and only if there exists r̄ ≥ 0
such that er f (x̄) > −∞ for some x̄ ∈ IRn and all r > r̄. The threshold of prox-
boundedness is the greatest lower bound on all such r̄. If f is prox-bounded
with threshold r̄, then er f (x) > −∞ for all x ∈ IRn and any r > r̄ [27, Ex
1.24]. It is immediately clear that all convex functions are prox-bounded (with
threshold r̄ = 0) and that all quadratic functions q are prox-bounded (with
threshold equal to smallest r that makes q + r 21 k · k2 convex).
Given a prox-bounded function f and a sequence of prox-parameters rk
greater than the threshold of prox-boundedness, the proximal point method is
summarized in the inclusion
xk+1 ∈ Prk f (xk ). (3)
If rk is bounded above and f ∈ C 1 is bounded below, then it is straightforward
to prove that limk→∞ ∇f (xk ) = 0:
Since f (xk+1 ) ≤ f (xk ) − rk 12 kxk+1 − xk k2 and f is bounded below, we
have that f (xk ) converges, which implies rk 21 kxk+1 − xk k2 converges to
0. The first order optimality conditions for er f imply that
∇f (xk+1 ) = −rk (xk+1 − xk ),
so rk bounded above implies that limk→∞ ∇f (xk ) = 0.
4 W. L. Hare, Y. Lucet

(It is equally straightforward to prove convergence of the proximal point


method in the more general non-differentiable case [11], but it requires tools
of non-smooth analysis such as sub-differentials.)

Remark 2.1 The above proof does not ensure that xk converges, only that
limk→∞ ∇f (xk ) = 0. For example, if the proximal point algorithm is applied
to f (x) := ex , one finds that xk → −∞.

The basic proximal point method has led to a large number of successful
proximal-style algorithms. Most common are the proximal-bundle methods,
where the objective function f is replaced by a sequence of piecewise lin-
ear model functions f k , see [28, Chpt XIV] and [29]. Such methods essentially
replace the minimization of f with a sequence of quadratic programming prob-
lems1 . The success of these methods in practice is well-documented [29] (and
references therein).
One interesting result regarding the proximal point method lies in the in-
herent stability of proximal points. Under some basic conditions (all of which
are satisfied by convexity), linear errors in the objective function values per-
petuate into linear errors in the proximal point mapping [22, Thm 4.6]. Similar
results hold for linear errors in the gradient values, prox-parameter and prox-
center [22, Thm 4.6].
This inherent stability makes proximal point methods an interesting con-
sideration for DFO, where we naturally expect errors in gradient values (and
possibly function values). In this paper, we demonstrate that we can replace
the objective function f with a reasonable approximation thereof, determine
the proximal points of the approximation functions, and then rely on the sta-
bility of the proximal point mapping to ensure convergence. To this end, we
next present an algorithmic framework for a derivative free proximal point
method.

3 Derivative-free Proximal Point Method

We shall use xk to denote the prox-centre of the algorithm during iteration k.


At each iteration, the algorithm shall make use of a set of interpolation points
Y = {y 0 , y 1 , . . . y p } ⊆ IRn with y 0 = xk . We define the sampling radius of Y
via
∆(Y ) := max
i
|y i − y 0 |.
y ∈Y

1 Note that, although the model function is linear, the computation of the proximal point

still contains the quadratic penalty term. If the piecewise linear model function is given by
f k (x) = max {hai , xi + bi }, then the resulting subproblem takes the form
i=1,2,...N

1
argminv,y {v + r kx − yk2 : v ≥ hai , xi + bi for i = 1, 2, ...N }.
2

(The proximal point of f k is given by y.)


DFO via Prox. Point Methods 5

We say that q is a quadratic model function of f corresponding to Y if and


only if
q(y i ) = f (y i ) for each y i ∈ Y

and q is a quadratic function, i.e.,

q(y) := α + hg, yi + hy, Hyi,

where α ∈ IR, g ∈ IRn , and H ∈ IRn×n is symmetric. (Note that we do not


assume H is positive semi-definite.)
In order to avoid lengthy discussions on interpolation and poisedness of
sets [1, Chpt 2–6], we introduce the following assumption that will ensure
convergence. We denote the closed ball of radius ∆ centred at y 0 by B∆ (y 0 );
B∆ (y 0 ) := {x : kx − y 0 k ≤ ∆}.

Assumption 1 Assume f ∈ C 1 . Furthermore, assume that there exists con-


stants C and M such that, for any point y 0 and any sampling radius ∆ > 0,
we are able to generate a sampling set Y = {y 0 , y 1 , . . . y p } ⊆ IRn and a corre-
sponding quadratic model function q such that ∆(Y ) = ∆ and

kf (y) − q(y)k ≤ C∆2 for all y ∈ B∆ (y 0 ),


k∇f (y) − ∇q(y)k ≤ C∆ for all y ∈ B∆ (y 0 ), and
k∇2 q(y)k ≤ M.

Remark 3.2 Assumption 1 is slightly stronger than assuming the model func-
tions form a fully linear class of models [1, Def 6.1], but slightly weaker than
assuming they form a fully quadratic class of models [1, Def 6.2] and that the
lower-level sets are compact (see Lemma 3.1). In particular, a fully linear model
only implies the first two error bounds of Assumption 1: (kf (y)−q(y)k ≤ C∆2
and k∇f (y) − ∇q(y)k ≤ C∆); while a fully quadratic model implies error
bounds of the form kf (y) − q(y)k ≤ C∆3 , k∇f (y) − ∇q(y)k ≤ C∆2 , and
k∇2 f (y) − ∇2 q(y)k ≤ C∆.

Methods for generating sampling sets and models that maintain Assump-
tion 1 are well understood [30, 31, 1] (among others). For example, if f ∈ C 1
and ∇f is Lipschitz, Assumption 1 can be achieved through linear interpola-
tion over appropriately poised sets (noting that ∇2 q = H = 0 in this case) [30,
Thm 2], where appropriately poised sets can be generated through a variety
of algorithms, see for example [1, Alg 6.2 & 6.3].
Another approach to achieving Assumption 1 is understood through the
following lemma.

Lemma 3.1 (Assumption via Quadratic Interpolation) Assume that


f ∈ C 2 , x0 ∈ IRn , and suppose that the lower-level set {y : f (y) ≤ f (x0 )}
is compact. Furthermore, suppose there exists C such that, for any point y 0
and any sampling radius 0 < ∆ < 1, we are able to generate a sampling set
6 W. L. Hare, Y. Lucet

Y = {y 0 , y 1 , . . . y p } ⊆ IRn and a corresponding quadratic model function q


such that ∆(Y ) = ∆ and

kf (y) − q(y)k ≤ C∆3 for all y ∈ B∆ (y 0 ),


k∇f (y) − ∇q(y)k ≤ C∆2 for all y ∈ B∆ (y 0 ), and (4)
k∇2 f (y) − ∇2 q(y)k ≤ C∆ for all y ∈ B∆ (y 0 ).

Then, Assumption 1 holds at all points y ∈ {y : f (y) ≤ f (x0 )}.

Proof Suppose error bounds of the form (4) hold. Since f ∈ C 2 and the lower-
level set is compact, there exists a constant C2 such that k∇2 f (y)k ≤ C2 for
all y ∈ {y : f (y) ≤ f (x0 )}. As ∆ < 1, we immediately have that

kf (y) − q(y)k ≤ C∆2 for all y ∈ B∆ (y 0 ) and


k∇f (y) − ∇q(y)k ≤ C∆ for all y ∈ B∆ (y 0 ).

The triangle inequality yields the final line in Assumption 1,

k∇2 q(y)k ≤ k∇2 f (y) − ∇2 q(y)k + k∇2 f (y)k ≤ C∆ + C2 ≤ C + C2 := M,

which finishes the proof. t


u

Using the above lemma, we see that, if f ∈ C 2 and ∇2 f is Lipschitz,


Assumption 1 can be achieved through quadratic interpolation [30, Thm 3]
or quadratic regression [31, Thm 3.2] over appropriately poised sets. (Both
approaches generate error bounds of the form (4).) Alternatively, if f ∈ C 2
and ∇2 f is Lipschitz and bounded, then Assumption 1 can be achieved us-
ing minimum Frobenius norm Lagrange polynomials [31, Thm 5.12] (that
k∇2 q(y)k ≤ M can be easily shown from k∇2 f k bounded).

Remark 3.3 In Lemma 3.1, the assumption of the lower-level set being com-
pact could be replaced by an assumption that ∇2 f is bounded. We present in
this form, as the lower-level set assumption is used in our convergence analysis.

We now state our derivative-free proximal point method framework. In the


algorithm, we use λn (H) to denote the minimum eigenvalue of H.
Algorithm: [Derivative-Free Proximal Point Method (DFPP)]
Step 0. Initialize: Set k = 0 and input
x0 - an initial prox-center,
∆0 - an initial sampling radius, ∆0 > 0,
rmin - lower bound for the prox-parameter, rmin > 0
r0 - an initial prox-parameter, r0 ≥ rmin
m - an Armijo-like parameter, 0 < m < 1,
Λ - a maximal step length parameter, Λ ≥ 1,
Γ - a minimal radius decrease parameter, 0 < Γ < 1,
ε∇ ∆ ∇ ∆
tol , εtol - stopping tolerances, εtol , εtol > 0.
DFO via Prox. Point Methods 7

Step 1. Model and Stopping Conditions:


Use y 0 = xk and ∆k to create a sampling set Y k and a quadratic model
q k in accordance to Assumption 1:
1
q k (x) := ak + hg k , xi + hx, H k xi.
2
If k∇q k (xk )k < ε∇ k ∆
tol and ∆ < εtol , then STOP (‘success’).
Step 2. Prox-feasiblity Check:
If rk ≤ −λn (H k ), then (q k + rk 21 k · k2 is not strictly convex):
– reset rk = −λn (H k ) + 1.
Step 3. Prox Trial Point:
Compute the trial point

{x̃k } = Prk q k (xk ) = {(2H k + rk Id)−1 (rk xk − g k )}.

Compute the predicted decrease

δ k = q k (xk ) − q k (x̃k ).

Step 4. Serious/Null Check:


If f (x̃k ) ≤ f (xk ) − mδ k , then declare a serious step:
– (line search, optional) set xk+1 = xk + αk (x̃k − xk ) such that

f (xk+1 ) ≤ f (xk ) − mαk δ k and αk ∈ [1, Λ],

– select rk+1 ∈ [rmin , rk ]


– select ∆k+1 ∈]0, ∆k ].
Else (if f (x̃k ) > f (xk ) − mδ k ), then declare a null step:
– if x̃k ∈ / B∆k (xk ), then set rk+1 → 2rk , xk+1 = xk and ∆k+1 = ∆k ,
– if x̃ ∈ B∆k (xk ), then set xk+1 = xk and select ∆k+1 ∈]0, Γ ∆k ].
k

Step 5. Loop:
Increment k → k + 1 and return to Step 1.
t
u
Before proceeding, we provide a few remarks on the DFPP algorithm.
First, note that although we assume y 0 = xk , this is only to simplify
discussion. Our proofs only require xk ∈ Y .
Second, note that, in Step 3 the equality in {x̃k } = Prk q k (xk ) is valid as
the function q k + rk 12 k · k2 is strictly convex (by Step 2). This also ensures that
the matrix inversion, (2H k + rk Id)−1 , is well-defined. Using Assumption 1, we
note that q k + rI is strictly convex for any r > M . As such, Step 2 will never
increase rk beyond M + 1.
Third, note that, as q k is a quadratic interpolation of f over Y k and
x ∈ Y k , the predicted decrease δ k = q k (xk ) − q k (x̃k ) could equivalently
k

be computed via δ k = f k (xk ) − q k (x̃k ).


Fourth, we remark that the “Serious Step” and “Null Step” terminology is
taken from historical proximal point algorithm design. A more modern termi-
nology might be “Successful Iteration” and “Null Iteration”.
8 W. L. Hare, Y. Lucet

Fifth, note that the line search in Step 4 is optional, convergence analysis
without the line search is done by setting Λ = 1. We introduce the line search
as a potential option to improve efficiency. Future research will explore which
(if any) line search method is most effective for this algorithm.
Finally, note that the DFPP algorithm employs several parameters in its
implementation (see Step 0). In this paper, we keep these parameters general,
and only provide the bounds for these parameters required for convergence.
Future research will explore methods to select (and adaptively adjust) these
parameters. We do note that the parameter rmin must be strictly greater than
0 (see the proof of Lemma 4.2). This, however, is not a concern, as any im-
plementation would enforce a similar tolerance to avoid floating point errors.

The derivative-free proximal point method shows notable similarity to DFO


trust region methods [1, Chpt 10 & 11]. Both methods approximate the ob-
jective function with a quadratic and then use minimization of the quadratic
approximation to determine the next iterate. The difference largely lies in how
the method maintains the next iterate close enough to the interpolation set
to ensure high accuracy of the quadratic approximation function. DFO trust
region methods apply a trust region, thereby minimizing over a closed convex
set. The DFPP method applies a quadratic penalty that stops iterates from
moving too far from the current prox-center.
There are two advantages to the quadratic penalty approach. First, the
quadratic penalty convexifies the approximating quadratic. This ensures that
the sub-problems are all easily solvable convex quadratic programs. Second, as
can be seen in the proof of Theorem 4.2, the quadratic penalty automatically
enforces an Armijo-like descent.

4 Convergence Analysis

We now turn our focus to the theoretical convergence analysis of the algo-
rithm. We begin (Subsection 4.1) by examining the stopping condition of the
algorithm. In Subsection 4.2, we examine convergence under an infinite se-
quence of serious steps, while in Subsection 4.3 we examine convergence under
a finite sequence of serious steps. Under infinite serious steps we show that, if
the prox-parameter is bounded and the search radius converges to zero, then
the limit of the iteration’s gradients is zero. Under finite serious steps we show
that the final serious step results in a point with a gradient of zero.

4.1 Stopping

Our first theorem examines the ‘success’ stopping criterion occurring in Step 1.

Theorem 4.1 (Stopping Test) Assume Assumption 1 holds. Suppose that


DFPP is applied to f and creates iterates xk . If the algorithm terminates in
DFO via Prox. Point Methods 9

Step 1 at iteration k̄, then



k∇f (xk̄ )k ≤ Cε∆
tol + εtol .

Proof Stopping in Step 1 implies that k∇q k̄ (xk̄ )k < ε∇ k̄ ∆


tol and ∆ < εtol . Since
k̄ k̄ 0
x ∈ Y ⊆ B∆k̄ (y ), Assumption 1 implies

k∇f (xk̄ )k ≤ k∇f (xk̄ ) − ∇q k̄ (xk̄ )k + k∇q k̄ (xk̄ )k ≤ C∆k̄ + ε∇ ∆ ∇


tol ≤ Cεtol + εtol ,

which finishes the proof. t


u
There are now two outcomes left to consider. First that the algorithm
performs an infinite number of serious steps and second that the algorithm
performs a finite number of serious steps and an infinite number of null steps.

We consider each of these scenarios, using the framework of ε∆tol = εtol = 0.

4.2 Infinite Serious Steps

This subsection examines convergence under the situation of an infinite num-


ber of serious steps. To prove convergence (in Theorem 4.2) we will assume
that the function f has compact lower-level sets. That is,
{y : f (y) ≤ f (x0 )} is compact.
Furthermore, convergence requires that the sequence rk is bounded above.
It is an open question to examine under what conditions this assumption can
be removed.
Finally, in Theorem 4.2 we also assume that ∆k → 0. Note that at each
serious step the user has the option of decreasing ∆k if desired. In Lemma 4.2
we provide a rule that ensures ∆k → 0 if an infinite number of serious steps
occurs. Later, in Corollary 4.2, we find that given an infinite number of null
steps, one must have ∆k → 0.
Lemma 4.2 (Enforcing Interpolation Radii Decrease) Suppose the lower-
level set {y : f (y) ≤ f (x0 )} is compact. Suppose DFPP is applied to create
iterates xk and that an infinite number of serious steps occurs. Fix tol > 0
and 0 < ΓS < 1. Suppose the following rule is invoked:
kxk+1 − xk k
if < tol , then enforce ∆k+1 ≤ ΓS ∆k . (∆rule )
∆k
Then ∆k → 0.
Proof Let ki be the infinite subsequence of serious steps. At each serious step
we have
f (xki +1 ) ≤ f (xki ) − mδ ki .
Therefore f (xki ) is non-increasing. Since f is bounded below (as f ∈ C 1 and
{y : f (y) ≤ f (x0 )} is compact), f (xki ) must converge. Since
0 ≤ mδ ki ≤ f (xki ) − f (xki +1 ),
10 W. L. Hare, Y. Lucet

this implies δ ki converges to 0. Note that q ki (x̃ki )+rki 21 kx̃ki −xki k2 ≤ q ki (xki )
implies that
δ ki = q ki (xki ) − q ki (x̃ki )
≥ q ki (xki ) − (q ki (xki ) − rki 12 kx̃ki − xki k2 )
≥ rki 12 kx̃ki − xki k2 .
Since rki ≥ rmin we have that
kx̃ki − xki k → 0. (5)
k+1 k
Examining kx − x k after a serious step, we note that
kxk+1 − xk k = kxk + α(x̃k − xk ) − xk k
= αkx̃k − xk k
≤ Λkx̃k − xk k → 0.
Thus kxk+1 − xk k → 0.
Finally, for eventual contradiction, suppose that ∆k does not converge to
0; i.e., there exists ∆¯ > 0 such that ∆k > ∆¯ for all k (note that ∆k is a non-
increasing sequence). Applying kxk+1 − xk k → 0, under an infinite number of
serious steps, eventually we have
kxk+1 − xk k kxk+1 − xk k
< < tol .
∆ k ∆¯
Thus the rule would be invoked an infinite number of times, so ∆k+1 ≤ ΓS ∆k
an infinite number of times, implying ∆k → 0. t
u
Theorem 4.2 (Infinite Serious Steps) Assume Assumption 1 holds. Sup-
pose the lower-level set {y : f (y) ≤ f (x0 )} is compact. Suppose DFPP is
applied to create iterates xk and that an infinite number of serious steps oc-
curs. If rk is bounded above and ∆k → 0, then
lim ∇f (xk ) = 0.
k→∞

Proof By the same arguments as in Lemma 4.2, we have kx̃ki − xki k → 0.


Using x̃ki ∈ Prki f (xki ) gives
−∇q ki (x̃ki ) = rki (x̃ki − xki ),
and since rki is bounded above, we deduce that k∇q ki (x̃ki )k converges to 0.
Examining ∇q ki (xki ), we note that
k∇q ki (xki ) − ∇q ki (x̃ki )k = kH ki (xki − x̃ki )k
≤ kH ki kkxki − x̃ki k
≤ M kxki − x̃ki k.
This implies that ∇q ki (xki ) converges to 0.
Finally, as xki ∈ Y ki , we have
k∇f (xki ) − ∇q ki (xki )k ≤ C∆ki ,
which yields that lim ∇f (xki ) = 0. t
u
ki →∞
DFO via Prox. Point Methods 11

Remark 4.4 In Theorem 4.2, proof of convergence requires that the sequence
rk is bounded above. By Assumption 1 we know that Step 2 will never increase
rk beyond M + 1. However, rk can also be increased during Step 4. Thus an
infinite sequence of serious steps intermixed with an infinite sequence of null
steps could result in rk unbounded. Clearly, the assumption that rk is bounded
could be weakened to allow for rk → ∞ as long as rki (x̃ki −xki ) → 0. However,
it is an open question to examine under what conditions this assumption can
be removed completely.
Clearly Theorem 4.2 implies that the limit of any convergent subsequence
with an infinite number of serious steps will be a stationary point. Under the
additional assumption that the objective function is strictly convex, then an
infinite sequence of serious steps will converge to the unique minimizer.
Corollary 4.1 (Infinite Serious Steps and Strict Convexity) Assume
Assumption 1 holds. Suppose that f is strictly convex and the lower-level set
{y : f (y) ≤ f (x0 )} is compact. Suppose DFPP is applied to create iterates
xk and that an infinite number of serious steps occurs. If rk is bounded above
and ∆k → 0, then xk will converge to the unique minimizer
lim xk = x̄ = arg min{f (x)}.
k→∞ x

Proof The requirements of Theorem 4.2 are fulfilled. Thus


lim ∇f (xk ) = 0.
k→∞

Since the lower-level set is compact, there exists a convergent subsequence.


Let xki → x̂ be any convergent subsequence. Since f ∈ C 1 , ∇f (x̂) = 0 and
therefore x̂ is the unique minimizer x̄. t
u

4.3 Finite Serious Steps

In order to prove convergence under a finite number of serious steps and an


infinite number of null steps, we require some basic results on the proximal
point mapping.
Lemma 4.3 (Proximal Point Bounds) Let f ∈ C 1 and η > 0 be such that
f + η 21 k · k2 is convex. Then, for any r > η, the proximal point mapping Pr f
is well-defined, single-valued, Lipschitz continuous, and satisfies
 
2k∇f (x̄)k
Pr f (x̄) ⊆ y : ky − x̄k ≤ . (6)
r−η
Proof That the proximal point mapping Pr f is well-defined, single-valued, Lip-
schitz continuous can be found in [27, Prop. 13.37] or [22, Thm 2.4]. Equation
(6) follows from a classic argument. First note that
Pr f (x̄) ⊆ y : f (y) + r 12 ky − x̄k2 ≤ f (x̄)


= y : f (y) + η 21 ky − x̄k2 + (r − η) 12 ky − x̄k2 ≤ f (x̄) .



12 W. L. Hare, Y. Lucet

Next note that the convexity of f +η 12 k·k2 implies that f +η 12 k·−x̄k2 must be
convex. Applying ∇(f + η 21 k · −x̄k2 )(x̄) = ∇f (x̄), and the fact f + η 12 k · −x̄k2
is convex, we see that
 
1 2
Pr f (x̄) ⊆ y : f (x̄) + h∇f (x̄), y − x̄i + (r − η) ky − x̄k ≤ f (x̄) .
2
Rearrangement and the Cauchy-Schwarz Inequality complete the proof. t
u
Corollary 4.2 (Limits of Interpolation Set Radii) Given an infinite se-
quence of null steps, one must have ∆k → 0.
Proof Suppose, for eventual contradiction, that an infinite series of null step
occurs, and ∆k does not converge to 0. As either, ∆k+1 = ∆k or ∆k+1 ≤ Γ ∆k ,
this would imply that ∆k+1 = ∆k , xk+1 = xk , rk+1 = 2rk , and q k+1 = q k
for all k sufficiently large. Since q k is a quadratic function, there exists η > 0
such that q k + η 21 k · k2 is convex for all k sufficiently large. (For k sufficiently
large, the function q k does not depend on k so η is also independent of k.)
Lemma 4.3 therefore states that
2k∇q k (xk )k
 
k k k k
x̃ ∈ Prk q (x ) ⊆ y : ky − x k ≤ .
rk − η
As rk → ∞, eventually we must have x̃k ∈ B∆k (xk ). At this point, ∆k de-
creases, and we encounter a contradiction. t
u
The next convergence theorem makes no additional assumptions on the
function. In particular, the assumptions that f has compact lower-level sets,
that ∆k → 0, and that the sequence rk is bounded above are no longer needed.
The removal of the assumption regarding compact lower-level sets is fairly
obvious, as the iteration points xk are fixed after a finite number of iterations.
The removal of the assumption that ∆k → 0 follows from Corollary 4.2. The
removal of the assumption that the sequence rk is bounded above requires use
of the Mean Value Theorem and is only possible because the iteration points
xk are fixed after a finite number of iterations.
We now prove convergence under a finite number of serious steps and an
infinite sequence of null steps.
Theorem 4.3 (Finite Serious and Infinite Null Steps) Assume Assump-
tion 1 holds. Suppose DFPP is applied to create iterates xk and that a finite
number of serious steps and an infinite number of null steps occur. Let x̄ be
the prox-center resulting from the last serious step. Then
∇f (x̄) = 0.
Proof Since there is an infinite number of null steps, by Corollary 4.2, we have
that ∆k → 0. We now have two cases, first rk is bounded above, second rk is
unbounded.
Case I: Suppose rk is bounded above. This implies that x̃k ∈ B∆k (x̄) for all k
sufficiently large. As ∆k → 0, we conclude x̃k → x̄.
DFO via Prox. Point Methods 13

As in the proof of Theorem 4.2, −∇q k (x̃k ) = rk (x̃k −x̄). Since rk is bounded
above, x̃k → x̄ implies that k∇f (x̄) − ∇f (x̃k )k → 0 and k∇q k (x̃k )k → 0. For
any x̃k ∈ B∆k (x̄) we may apply Assumption 1 to see

k∇f (x̄)k ≤ k∇f (x̄) − ∇f (x̃k )k + k∇f (x̃k ) − ∇q k (x̃k )k + k∇q k (x̃k )k
≤ k∇f (x̄) − ∇f (x̃k )k + C∆k + k∇q k (x̃k )k.

As the right-hand side converges to 0, we must have k∇f (x̄)k = 0.


Case II: [rk unbounded] Suppose rk is unbounded. Applying the arguments
in Corollary 4.2 we know that there exists a subsequence ki such that x̃ki ∈
B∆ki (x̄) for each ki . For this subsequence we have x̃ki → x̄ (as ∆ki → 0).
Applying the Mean Value Theorem to elements from this subsequence we
have that
f (x̃ki ) − f (x̄) = h∇f (y ki ), x̃ki − x̄i
for some y ki = αx̃ki + (1 − α)x̄, α ∈]0, 1[, and

q ki (x̃ki ) − q ki (x̄) = h∇q ki (z ki ), x̃ki − x̄i

for some z ki = β x̃ki + (1 − β)x̄, β ∈]0, 1[. Noting that −∇q k (x̃k ) = rk (x̃k − x̄),
and using the definition of δ k , we rewrite
1
f (x̃ki ) − f (x̄) = h∇f (y ki ), − ∇q ki (x̃ki )i
rki
for some y ki = αx̃ki + (1 − α)x̄, α ∈]0, 1[, and
1
−δ ki = h∇q ki (z ki ), − ∇q ki (x̃ki )i
rki
for some z ki = β x̃ki + (1 − β)x̄, β ∈]0, 1[. Since each step is a null step, we
know that for each ki

f (x̃ki ) − f (x̄) = f (x̃ki ) − f (xki ) > −mδ ki .

Using the y ki and z ki above, this implies that for each ki

h∇f (y ki ), ∇q ki (x̃ki )i < mh∇q ki (z ki ), ∇q ki (x̃ki )i. (7)

We seek to apply a limit to (7).


To apply a limit to (7), we first note

∇f (y ki ) → ∇f (x̄),

as f ∈ C 1 , and y ki → x̄. Next note

∇q ki (x̃ki ) → ∇f (x̄),

as kH ki k is bounded above and

k∇q ki (x̃ki ) − ∇f (x̄)k ≤ k∇q ki (x̃ki ) − ∇q ki (x̄)k + k∇q ki (x̄) − ∇f (x̄)k


≤ kH ki kkx̃ki − x̄k + C∆ki → 0.
14 W. L. Hare, Y. Lucet

Finally note
∇q ki (z ki ) → ∇f (x̄)
by the same logic.
We may now pass to the limit in (7) to find

k∇f (x̄)k2 ≤ mk∇f (x̄)k2 .

Since 0 < m < 1, we must have that k∇f (x̄)k = 0. t


u
We now summarize our convergence results.

Corollary 4.3 (Convergence of the DFPP algorithm) Assume Assump-


tion 1 holds and the lower-level set {y : f (y) ≤ f (x0 )} is compact. Suppose
DFPPwith ∆rule is applied to create iterates xk and that rk is bounded above.
Then either the algorithm terminates successfully with

k∇f (xk̄ )k ≤ Cε∆
tol + εtol ,

or the sequence ∇f (xk ) converges to 0.


If in addition f is strictly convex then xk converges to the unique mini-
mizer.

5 Conclusion

In this paper, we have developed the theoretical framework required to gener-


ate a derivative-free proximal point method. We have shown that the method
converges in several manners. First, if the algorithm terminates due to a ‘suc-
cess’, then an optimality certificate is generated. Second, if an infinite number
of serious steps occurs, then the limit of the gradients converges to zero. This
implies that any convergent subsequence will be a stationary point and, under
strict convexity, the algorithm will converge to the global minimizer. Third,
we have shown that, if a finite number of serious steps occurs, then the final
serious step generates a stationary point.
There are several clear directions of study still required in this research.
Naturally, an implementation of the algorithm is desired. Such an implemen-
tation needs to incorporate several features to perform well numerically. For
example, model construction (including when and how to reuse previous sam-
ple points in model generation) must be considered, and methods to deal with
badly scaled problems should be explored. Proper management of the prox-
parameter also warrants further investigation.
In model construction, one promising direction is the application of mini-
mum Frobenius norm schemes [31]. These schemes allow for the generation of
quadratic models with reasonable computation time. Recent work has also con-
sidered over-interpolated models using regression techniques [1]. These ideas
may allow for reusing sample points during long sequences of null steps.
Some promising ideas to handle badly scaled functions can be adapted from
trust region methods [8, Section 4.5]. In particular, when a quadratic model
DFO via Prox. Point Methods 15

is used for interpolation or approximation, we could use the eigenvalues of the


current quadratic model to do a diagonal rescaling. Doing more linear algebra
operations make sense since we assume in our setting that function evaluations
dominate the computation time.
A preliminary discussion on how to manage the prox-parameter is con-
tained in [32]. It should be noted that, in derivative based bundle methods,
the prox-parameter is typically only increased due to a failure in the prox-
feasibility test (step 2 in DFPP). In this case, for some classes of functions,
it is possible to guarantee a theoretical bound on the prox-parameter, those
showing that rk will never be unbounded [15, Thm 3]. However, in the case
of the DFPP algorithm, the prox-parameter is also used to ensure accuracy
of the approximation function at trial points. This can lead to increasing the
prox-parameter even though the prox-feasibility test is passed. More work is
needed to determine whether the algorithm can be modified to avoid having
to assume the prox-parameter is bounded.
Another promising direction of research is found in the proximal-bundle
methodology. In proximal-bundle methods, the objective function is replaced
by a piecewise-linear model function based on cutting planes of the objective
function. In the context of derivative-free optimization, it may be possible
to use a collection of simplex gradients to generate a similar cutting planes
model. If successful, this would maintain a low number of function calls per
iteration (n + 1), but still provide an expectation of rapid convergence. Con-
vergence analysis for such an approach might be possible through adapting
this manuscript to incorporate bundle techniques.

Acknowledgements

Work in this paper was supported by NSERC Discovery grants (Hare and
Lucet), a UBC IRF grant (Hare), and a Canadian Foundation for Innovation
(CFI) Leaders Opportunity Fund (Lucet) programs.

References

1. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to derivative-free optimiza-


tion. Vol. 8 of MPS/SIAM Series on Optimization. Society for Industrial and Applied
Mathematics (SIAM), Philadelphia (2009)
2. Booker, A.J., Dennis, J.E.Jr., Frank, P.D., Serafini, D.B., Torczon, V.: Optimization
using surrogate objectives on a helicopter test example. In: Computational methods for
optimal design and control (Arlington, VA, 1997) of Progr. Systems Control Theory,
vol. 24, pp. 49–58. Birkhäuser Boston, Boston (1998)
3. Duvigneau, R., Visonneau, M.: Hydrodynamic design using a derivative-free method.
Struct. Multidiscip. Optim. 28, 195–205 (2004)
4. Marsden, A.L., Wang, M., Dennis, J.E.Jr., Moin, P.: Trailing-edge noise reduction using
derivative-free optimization and large-eddy simulation. J. Fluid Mech. 572, 13–36 (2007)
5. Marsden, A.L., Feinstein, J.A., Taylor, C.A.: A computational framework for derivative-
free optimization of cardiovascular geometries. Comput. Methods Appl. Mech. Engr.
197(21-24), 1890–1905 (2008)
16 W. L. Hare, Y. Lucet

6. Hare, W.L.: Using derivative free optimization for constrained parameter selection
in a home and community care forecasting model. In: International Perspectives on
Operations Research and Health Care, Proceedings of the 34th Meeting of the EURO
Working Group on Operational Research Applied to Health Sciences, pp. 61–73. (2010)
7. Audet, C., Dennis, J.E.Jr., Le Digabel, S.: Globalization strategies for mesh adaptive
direct search. Comput. Optim. Appl. 46(2), 193–215 (2010)
8. Nocedal J., Wright, S.J.: Numerical optimization. Springer Series in Operations Re-
search and Financial Engineering. Springer, New York (2006)
9. Coope, I.D., Price, C.J.: A direct search frame-based conjugate gradients method. J.
Comput. Math. 22(4), 489–500 (2004)
10. Cheng, W., Xiao, Y., Hu, Q.J.: A family of derivative-free conjugate gradient methods
for large-scale nonlinear systems of equations. J. Comput. Appl. Math. 224(1), 11–19
(2009)
11. Martinet, B.: Détermination approchée d’un point fixe d’une application pseudo-
contractante. Cas de l’application prox. C. R. Acad. Sci. Paris Sér. A-B, 274, A163–A165
(1972)
12. Lemaréchal, C., Strodiot, J.J., Bihain, A.: On a bundle algorithm for nonsmooth opti-
mization. In: Nonlinear programming, vol. 4, pp. 245–282. Academic Press, New York
(1981)
13. Mifflin, R.: A modification and extension of Lemarechal’s algorithm for nonsmooth
minimization. Math. Programming Stud., 17, 77–90 (1982)
14. Lukšan, L. and Vlček, J.: A bundle-Newton method for nonsmooth unconstrained
minimization. Math. Program. 83(3, Ser. A), 373–391 (1998)
15. Hare, W., Sagastizábal, C.: Computing proximal points of nonconvex functions. Math.
Program. 116(1-2, Ser. B), 221–258 (2009)
16. Hare, W., Sagastizábal, C.: Redistributed proximal bundle method for nonconvex opti-
mization. SIAM J. Opt. 20(5), 2442–2473 (2010)
17. Mifflin, R., Sagastizábal, C.: Proximal points are on the fast track. J. Convex Anal.
9(2), 563–579 (2002)
18. Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-
regularity. J. Convex Anal. 11(2), 251–266 (2004)
19. Hare, W.L.: A proximal method for identifying active manifolds. Comput. Optim. Appl.
43(2), 295–306 (2009)
20. Mifflin, R., Sagastizábal, C.: VU -smoothness and proximal point results for some non-
convex functions. Optim. Methods Softw. 19(5), 463–478, (2004)
21. Mifflin, R., Sagastizábal, C.: A V U -algorithm for convex minimization. Math. Program.
104(2-3, Ser. B), 583–608 (2005)
22. Hare, W.L., Poliquin, R.A.: Prox-regularity and stability of the proximal mapping. J.
Convex Anal. 14(3), 589–606 (2007)
23. Kiwiel, K.C.: A proximal bundle method with approximate subgradient linearizations.
SIAM J. Optim. 16(4), 1007–1023 (2006)
24. Shen, J., Xia, Z.Q., Pang, L.P.: A proximal bundle method with inexact data for convex
nondifferentiable minimization. Nonlinear Anal. 66(9), 2016–2027 (2007)
25. Kiwiel, K.C., Lemaréchal, C.: An inexact bundle variant suited to column generation.
Math. Program. 118(1, Ser. A), 177–206 (2009)
26. Kiwiel, K.C.: An inexact bundle approach to cutting-stock problems. INFORMS J.
Comput. 22(1), 131–143 (2010)
27. Rockafellar, R.T., Wets, R. J.-B.: Variational Analysis. Vol. 317 of Grundlehren der
Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences].
Springer-Verlag, Berlin (1998)
28. Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms.
Vol. 305–306 in Grundlehren der Mathematischen Wissenschaften [Fundamental Prin-
ciples of Mathematical Sciences]. Springer-Verlag, Berlin (1993)
29. Mäkelä, M.M.: Survey of bundle methods for nonsmooth optimization. Optim. Methods
Softw. 17(1), 1–29 (2002)
30. Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of interpolation sets in derivative
free optimization. Math. Program. 111(1-2, Ser. B), 141–172 (2008)
DFO via Prox. Point Methods 17

31. Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of sample sets in derivative-free
optimization: polynomial regression and underdetermined interpolation. IMA J. Numer.
Anal. 28(4), 721–748 (2008)
32. Apkarian, P. Noll, D., Prot, O.: A proximity control algorithm to minimize nonsmooth
and nonconvex semi-infinite maximum eigenvalue functions. J. Convex Anal. 16(3-4),
641–666 (2009)

View publication stats

You might also like