A Discussion On Variational Analysis in Derivative-Free Optimization - Hare (2020)

Set-Valued and Variational Analysis (2020) 28:643–659
https://doi.org/10.1007/s11228-020-00556-y
A Discussion on Variational Analysis in Derivative-Free

Optimization
Warren Hare1
Received: 23 March 2020 / Accepted: 13 September 2020 / Published online: 25 September 2020
© Springer Nature B.V. 2020
Abstract
Variational Analysis studies mathematical objects under small variations. With regards to
optimization, these objects are typified by representations of first-order or second-order
information (gradients, subgradients, Hessians, etc). On the other hand, Derivative-Free
Optimization studies algorithms for continuous optimization that do not use first-order
information. As such, researchers might conclude that Variational Analysis plays a limited
role in Derivative-Free Optimization research. In this paper we argue the contrary by show-
ing that many successful DFO algorithms rely heavily on tools and results from Variational
Analysis.
Keywords Derivative-Free Optimization · Variational Analysis · Direct-search method ·

model-based methods · order-N accuracy
Mathematics Subject Classiﬁcation (2010) Primary 49-01 · 90C52; Secondary 49J52 ·

65K05
1 Introduction
Let us begin by defining Variational Analysis as the study of how mathematical objects
(functions, sets, functionals, etc) behave under small variations. Variational Analysis is the
foundation of single-variable Calculus, where small variations are used to define derivatives
and integrals. Modern research in Variational Analysis has developed concepts of directional
derivatives, subdifferentials, normal and tangent cones, co-derivatives, etc. All of these have
numerous real-world applications, while simultaneously being fundamentally interesting in
a purely mathematical sense.
One major use of Variational Analysis is the field of Mathematical Optimization. At
the most basic level, Fermat’s theorem links the derivative (gradient) of a function to its
Hare’s research is partially supported by Natural Sciences and Engineering Research Council of
Canada (NSERC) Discovery Grant #2018-03865.
Warren Hare
warren.hare@ubc.ca
1 Department of Mathematics, University of British Columbia, Kelowna, British Columbia, Canada

644 W. Hare
minimizers. More advances concept like directional derivatives and subdifferentials allow
the idea of critical points to be extended to nonsmooth functions.
Derivative-Free Optimization (DFO) is the mathematical study of algorithms for continu-
ous optimization that do not use first-order information [8]. While DFO arguably dates back
to the 1960’s (see, for example, [42, 51, 70]), it was not until 1998 that the term “Derivative-
Free Optimization” was first formally used [26]. Since then, DFO has grown enormously.
Indeed, since any optimization problem where the objective function is computed via simu-
lation software can be approached by DFO methods, as simulation software is becoming a
more common, the value and interest DFO is similarly increasing.
By definition, DFO studies of algorithms that do not use derivatives, gradients, direc-
tional derivatives, subgradients, normal cones, tangent cones, etc. As such, it might seem
that Variational Analysis would have limited value in DFO research. However, a study of
DFO shows that this is a false conclusion. In fact, the many of the most successful DFO
algorithms rely heavily on tools and results from Variational Analysis. In this paper, we will
highlight some of this research and argue that Variational Analysis is a critical component
to studying DFO.
DFO algorithms can be split into two broad categories: direct-search methods and model-
based methods [8]. In Section 2, we discuss direct-search methods in DFO, while in
Sections 3 and 4, we discuss model-based methods in DFO (first for smooth problems, then
for nonsmooth problems). We provide conclusions and thoughts on productive directions in
Section 5.
In the remainder of the paper, we shall assume that we are attempting to solve
min{f (x) : x ∈ Rn },
where f : Rn → R ∪ {∞} is proper and continuous on its domain. (Note, in DFO, f

taking a value of ∞ can also be viewed as any situation where the blackbox fails.) Other
assumptions will be stated as required.
2 DFO Direct-Search Methods
2.1 A Simple Direct-Search Methods
Direct-search methods start from an incumbent solution x k and then consider nearby points
by evaluating f in directions d ki with step-size δ: f (x k + δd ki ). If improvement is found,
then the incumbent solution is updated. If no improvement is found, then the search size
parameter δ is updated and the next iteration will consider points closer to the incumbent
solution. We formalize a simple direct-search method in Algorithm 1 below. However, first
we must define positive spanning set.
Definition 2.1 (Positive spanning set) Let D ⊆ Rn with |D| = m. The positive span of D,
denoted pspan(D), is the set of all nonnegative linear combinations of vectors in D:
m

pspan(D) = λi d : λi ≥ 0, d ∈ D
i i
⊆ Rn .
i=1
The set D is a positive spanning set for Rn if pspan(D) = Rn .

A Discussion on Variational Analysis in Derivative-Free Optimization 645
We now present a simplified version of the generalized pattern search method from [66]
(see also [8, Ch 7]). A standard generalized pattern search method includes a search step.
We discuss this in Section 2.2. Note, the algorithm defines local mesh optimizer in Step 1.
Algorithm 1: Simple generalized pattern search (SGPS).

Given f : Rn → R and starting point x 0 ∈ Rn
0. Initialization
δ 0 ∈ (0, ∞) initial step size parameter
D ⊆ Qn a finite positive spanning set
stop ∈ [0, ∞) stopping tolerance
k←0 iteration counter
1. Poll
select a positive spanning set D k ⊆ D
if f (t) < f (x k ) for some t ∈ P k = {x k + δ k d : d ∈ D k }
set x k+1 ← t and δ k+1 ← 2δ k
otherwise x k is a local mesh optimizer
set x k+1 ← x k and δ k+1 ← 12 δ k
2. Termination
if δ k+1 ≥ stop
increment k ← k + 1 and go to 1
otherwise stop
In reviewing the SGPS algorithm, it is clear that no variational information (e.g., gra-
dients, Hessians, etc) are used in the algorithm. Nonetheless, we shall see that Variational
Analysis can be used to analyze the algorithm.
Indeed, the first step to prove convergence of the SGPS algorithm is to argue that
lim infk→∞ δ k = 0. To accomplish this, we first apply a bounded level set assumption (i.e.,
assume that the level sets of f (x) are bounded). Using this, we note that if δ k is bounded
below, then the structure of D means that there are only a finite number of points the SGPS
algorithm can examine. As this creates a finite set of possible function values, at least one
of points these must be a local mesh optimizer. Once the incumbent solution is a local mesh
optimizer, δ k must decrease below the bound. Details can be found in [5] or [8, Lem 7.3 &
Thm 7.4].
The next step in proving convergence of the SGPS algorithm is to argue the existence
of refined points. The definition of refined points uses the terminology of local mesh
optimizers from Step 1 of SGPS.
Definition 2.2 (Refining subsequences and refined points) Let {x k }k∈K be a convergent
subsequence of local mesh optimizers from the SGPS algorithm. We say that the subse-
quence is a refining subsequence if limk∈K δ k = 0. The limit of a refining subsequence is
called its corresponding refined point.
Adding a continuity assumption (f ∈ C 0 ) to the bounded level set assumption, com-

bined with aforementioned lim infk→∞ δ k = 0, easily shows that refined points must exist
and that every accumulation point of the algorithm has the same function value [8, Prop
646 W. Hare
7.5]. Therefore, we may focus our convergence analysis on refined points. The resulting
convergence theorem is given in terms of the generalized directional derivative.
Definition 2.3 (Generalized directional derivative) Let f : Rn → R be locally Lipschitz

near x ∈ Rn . The generalized directional derivative of f at x in the direction d ∈ Rn is
f (y + td) − f (y)
f ◦ (x; d) = lim sup .
y→x, t0 t
Recall, the generalized directional derivative provides the following first order optimality
of conditions.
Theorem 2.4 (First order optimality via f ◦ ) If f is locally Lipschitz continuous near x ∗
and x ∗ is a local minimizer of f , then f ◦ (x ∗ ; d) ≥ 0 for every d ∈ Rn .
Returning to the SGPS algorithm, we see that if x̂ is a refined point for the SGPS
algorithm, then x̂ satisfies a version of the first order optimality of condition above.
Theorem 2.5 [Convergence of SGPS] Let {x k } be the sequence of iterates produced by

applying the SGPS algorithm to a function f : Rn → R with bounded level sets. Let {x k }k∈K
be a refining sequence and x̂ the corresponding refined point.
i. If f is locally Lipschitz, then for any direction d ∈ D such that f was evaluated
infinitely many times in the subsequence {x k }k∈K we have f ◦ (x̂; d) ≥ 0.
ii. If f ∈ C 1 , then
∇f (x̂) = 0.
Proof Apply [8, Thm 7.7] to this version of SGPS. Alternately, adapt the proofs from [5]
and [66].
What is remarkable in the above theorem is that, despite the SGPS algorithm being a
DFO method, we still apply Variational Analysis arguments and techniques to generate a
proof of convergence to a first-order critical point.
2.2 Advances on Direct Search Methods
The SGPS algorithm presented above is a simplified version of Torczon’s pattern search
algorithm [66]. Even in this simplified form, the algorithm captures much of the flexibility
of the method. The finite positive spanning set D can be arbitrarily large, thereby allowing
for a great deal of flexibility in the search directions.
One obvious location where SGPS can be generalized is the step size increase multiplier
rule (e.g., δ k+1 ← 2δ k ). While there is very little special about the update size of 2, it is
surprising to note that it cannot be replaced by a completely arbitrary constant or sequence of
constants. Indeed, it has been shown that in order to replace the step size increase multiplier
by a constant τ , then τ must be rational [2]. However, in that case, the step size updates
can be replaced by a sequence of bounded positive powers of τ [5]. (Similar statements
hold for the step size decrease multiplier of 12 .) Another approach comes from using a fixed
sequence of rational step sizes, e.g., 1, 0.5, 0.2, 1(10−1 ), 0.5(10−1 ), 0.2(10−1 ), .... [13].
Larger advances in SGPS are created by the introduction of a mesh and a search step.
Definition 2.6 (Mesh) Let D ⊆ Qn be a positive spanning set for Rn with p elements. Let
D ∈ Rn×p be the matrix whose columns are the elements of D. The mesh generated by D
centered at the incumbent solution x k ∈ Rn of coarseness δ k > 0 is defined by
M k = {x k + δ k D y : y ∈ Np } ⊂ Rn .
The mesh is essentially the set of all points that SGPS could examine if the step size
never decreased beyond its current level. Once the mesh is established, it is possible to add
a search step into the SGPS algorithm:
1.5 Search
if f (t) < f (x k ) for some t in a finite subset S k of the mesh M k
set x k+1 ← t and δ k+1 ← 2δ k and go to 2. Termination
otherwise go to 1. Poll
Note, we have labelled the search as step 1.5, as the search step goes between the poll
step and the termination step.
The search step is a major advance to the SGPS algorithm, as it allows for a wide variety
of heuristics to be embedding into the algorithm [3, 7, 11, 15, 33] (among many others).
Convergence of the method is maintained under the same conditions as Theorem 2.5 ([8,
Thm 7.7] still applies).
Advances beyond the mesh come form the introduction of frames. Frames are more
complicated that meshes, but allow for the development of direct search methods that have
an asymptotically dense set of polling directions. This allows theoretical results showing
that (under appropriate conditions) refined points x̂ have f ◦ (x̂; d) ≥ 0 for all d ∈ Rn
(as apposed to Theorem 2.5, where f ◦ (x̂; d) ≥ 0 for a finite list of directions). Frames
also allow for the treatment of inequality constraints and provide theoretical proofs of
convergence to critical points based on hypertangent cones. See [6] or [8, Ch 8] for details.
3 DFO Model-Based Methods for Smooth Functions
3.1 Basic Model-Based Methods
In Section 2 we saw that tools and techniques of Variational Analysis could be used to
analyze direct-search methods in DFO. In the next two sections, we will see similar results
for model-based methods in DFO. This section focuses on model-based methods for smooth
functions.
In DFO, model-based methods begin by using function values to build a model of the
true objective function. Assuming the model is ‘well-built’, a classical 1st or 2nd order
optimization method can be applied to the model. As the algorithm proceeds, the model is
updated and (hopefully) convergence is achieved.
Before proceeding, we define terminology that allows us to mathematically capture when
a model is ‘well-built’. We begin with function accuracy.
Definition 3.1 (Order-N function accuracy) Given f ∈ C 0 , x ∈ dom(f ), and ¯ > 0 we

˜
say that {f }∈(0,]
¯ is a class of models of f at x parameterized by that provides order-
¯
N function accuracy at x if there exists a scalar κ(x) > 0 such that, given any ∈ (0, ]
the model f˜ satisfies
|f (x) − f˜ (x)| ≤ κ(x)N .

648 W. Hare
We say that {f˜ }∈(0,]

¯ is a class of models of f at x parameterized by that provides
order-N function accuracy near x if there exists a scalar κ(x) > 0 such that, given any
¯ the model f˜ satisfies
∈ (0, ]
|f (y) − f˜ (y)| ≤ κ(x)N for all y ∈ B (x).
Definition 3.1 captures the idea that the function values produced by the model are at
least as accurate as an N th order Taylor expansion. However, function accuracy is seldom
sufficient to prove convergence.
Therefore, in a similar manner, we can define gradient accuracy and Hessian accuracy.
Definition 3.2 (Order-N gradient accuracy) Given f ∈ C 1 , x ∈ dom(f ), and ¯ > 0 we

say that {f˜ }∈(0,]
N gradient accuracy at x if there exists a scalar κ(x) > 0 such that, given any ∈ (0, ]¯
∇f (x) − ∇ f˜ (x) ≤ κ(x)N .
order-N gradient accuracy near x if there exists a scalar κ(x) > 0 such that, given any
∈ (0, ]
∇f (y) − ∇ f˜ (y) ≤ κ(x)N for all y ∈ B (x).
Definition 3.3 (Order-N Hessian accuracy) Given f ∈ C 2 , x ∈ dom(f ), and ¯ > 0 we

say that {f˜ }∈(0,]
N Hessian accuracy at x if there exists a scalar κ(x) > 0 such that, given any ∈ (0, ] ¯
∇ 2 f (x) − ∇ 2 f˜ (x) ≤ κ(x)N .
order-N Hessian accuracy near x if there exists a scalar κ(x) > 0 such that, given any
∈ (0, ]
∇ 2 f (y) − ∇ 2 f˜ (y) ≤ κ(x)N for all y ∈ B (x).
Examining Definitions 3.1, 3.2, and 3.3, it is clear that the idea of order-N accuracy
could easily be applied to any Variational Analysis object that consists of a single vector
or matrix. Later we will see how order-N accuracy can be applied to set-based Variational
Analysis objects.
Definitions 3.1, 3.2, and 3.3, allow us to describe the quality of a model in a manner that
is extremely useful for proving convergence. For example, if a class of models has order-3
function accuracy, order-2 gradient accuracy, and order-1 Hessian accuracy, then the class of
models behaves similar to a 2nd-order Taylor expansion. Indeed, in this case, the behavior is
so similar that we can prove such a class of models would be sufficient to ensure quadratic
convergence of Newton’s method (provided other appropriate conditions are meet).
Theorem 3.4 Let f ∈ C 3 and ¯ > 0. For any given x k ∈ Rn , suppose that {f˜ }∈(0,]¯
k
is a class of models of f at x parameterized by that provides order-2 gradient accuracy
and order-1 Hessian accuracy, with parameters κg (x k ) and κH (x k ) respectively. Select a
sequence {k } with k ≥ 0 and suppose ∇ 2 f˜k (x k ) is nonsingular for all k. Consider the
model-Newton update
−1
x k+1 = x k − ∇ 2 f˜k (x k ) ∇ f˜k (x k ).
Suppose
i. x k is sufficiently close to a local or global minimizer x ∗ ,
ii. κg (x k ) and κH (x k ) are bounded
above for all k sufficiently large,
2 −1
iii. there exists M > 0 such that ˜
∇ fk (x )
k ≤ M for all k sufficiently large,

iv. there exists μ > 0 such that k < μx k − x ∗ for all k sufficiently large.
Then x k converges quadratically to x ∗ .
Proof By Taylor’s theorem, applying f ∈ C 3 and x k is sufficiently close to x ∗ , we may

select κT > 0 such that
∇f (x k ) + ∇ 2 f (x k )(x ∗ − x k ) ≤ κT x ∗ − x k 2 .
By order-2 gradient accuracy, order-1 Hessian accuracy, and assumption (ii), we have
κg > 0 and κH > 0 such that
∇f (x k ) − ∇ f˜k (x k ) ≤ κg (k )2 , and
∇ 2 f (x k ) − ∇ 2 f˜k (x k ) ≤ κH k
for all k sufficiently large.
Now, examine x ∗ − x k+1 ,
−1
∗
x ∗ − x k+1 = x − x k + ∇ 2 f˜ (x k )
k ∇ f˜ k (x k )

−1
2
= ∇ f˜k (x k ) ∇ 2 f˜ (x k ) (x ∗ − x k ) + ∇ f˜ (x k )
k k

−1
2
˜
≤ ∇ fk (x ) ˜ ∗ ˜
∇ fk (x ) (x − x ) + ∇ fk (x )
k 2 k k k

2
≤ M ˜
∇ fk (x ) − ∇ f (x ) (x − x )+
k 2 k ∗ k

∇ f˜ k (x k ) − ∇f (x k ) + ∇f (x k ) + ∇ 2 f (x k )(x ∗ − x k )

2
∗ ∗
≤ M κH k k
x − x + κg ( ) + κT x − x
k 2 k
2
∗
≤ M(κh μ + κg μ + κT )x − x
2 k
.
Theorem 3.4 makes a number of assumptions on the presented model-Newton update. In

particular, we note that there is no reason to expect that, without careful algorithm design,
assumption (iv) should hold. However, our goal in presenting Theorem 3.4 is not to propose
a new algorithm, but instead to demonstrate how the idea of order-N accuracy can work
with Taylor’s theorem to recreate classical results from Variational Analysis for smooth
optimization.
650 W. Hare
Remark 1 In 2009 [29] and 2011 [69], researchers defined and developed the classes of fully
linear and fully quadratic models. These definitions were similar to order-N accuracy, in
that they tried to capture the idea of a model mimicking the behaviour of a 1st-order Taylor
expansion (fully linear) or a 2n-order Taylor expansion (fully quadratic). In the language of
order-N accuracy,
• for f ∈ C 1 , a class of models is fully linear if and only if it provides order-2 function
accuracy and order-1 gradient accuracy;
• for f ∈ C 2 , a class of models is fully quadratic if and only if it provides order-3 function
accuracy, order-2 gradient accuracy, and order-1 Hessian accuracy.
Thus, Theorem 3.4 could have been proven using a class of models that is fully quadratic,
instead of the order-N assumptions given. However, this would have included the unused
assumption that the models provided order-3 function accuracy.
As an example of a complete model-based method, we present a model-based steepest

descent algorithm.
Like SGPS, it is clear that MBSD uses no variational information in its application. Also
like SGPS, we shall see that Variational Analysis provides the correct tools to analyze the
algorithm.
Before discussing convergence of the MBSD algorithm, let us note that if k = 0, then
the algorithm reduces to steepest descent. Indeed, if k = 0, then order-1 gradient accuracy
implies that ∇ f˜k (x k ) = ∇f (x k ). As 0 = k < stop , the algorithm will terminate
if ∇ f˜k (x k ) < stop . If the algorithm does not terminate, then the accuracy check is
automatically passed, and a line search is preformed in the direction of steepest descent
using a standard Armijo condition.
Suppose that k > 0 and the stopping condition is triggered. In this case,
∇f (x k ) ≤ ∇f (x k ) − g̃ k + g̃ k < κg k + stop < (1 + κg )stop ,
where κg is the constant for order-1 gradient accuracy. Thus, an approximate critical point
is found up to (1 + κg )stop precision.
To prove convergence, the first step is to note that if ∇f (x k ) = 0, then k → 0 would
imply g̃ k → ∇f (x k ), which further implies that eventually the model accuracy check must
pass (or the stopping condition will be triggered) [8, Prop 10.3]. Similar, albeit more tech-
nical, arguments can show that if ∇f (x k ) = 0, then once the model is sufficiently accurate
the line search must succeed [8, Thm 10.5]. This leads to a simple proof of convergence.
Theorem 3.5 (Convergence of the MBSD algorithm) Suppose f ∈ C 1 is bounded below

and f˜ is a class of models with order-1 gradient accuracy. Suppose that there exists t¯ > 0
such that t k ≥ t¯ for all k. If MBSD is run with stop = 0, then
lim ∇f (x k ) = 0.
k→∞
Proof Without loss of generality, we drop to a subsequence such that the model accuracy
checks pass and the Armijo line search,
f (x k+1 ) ≤ f (x k − t k g̃ k ) < f (x k ) − ηt k g̃ k 2 ,

Algorithm 2: Model-based steepest descent (MBSD).

Given f ∈ C 1 , starting point x 0 ∈ Rn and a class of models with order-1 gradient
accuracy
0. Initialize:
0 ∈ (0, ∞) initial model accuracy parameter
μ0 ∈ (0, ∞) initial target accuracy parameter
η ∈ (0, 1) an Armijo parameter
stop ∈ [0, ∞) stopping tolerance
k←0 iteration counter
1. Model:
use a finite number of points to create f˜k with order-1 gradient accuracy
set g̃ k = ∇ f˜k (x k )
2. Model accuracy checks:
a) if k < stop and g̃ k < stop
declare algorithm success and stop
b) if k > μk g̃ k
declare the model inaccurate
decrease k+1 ≤ 12 k , set μk+1 = μk , x k+1 = x k , k ← k + 1, go to 1
c) if k ≤ μk g̃ k
declare the model accurate and proceed to 3
3. Line search:
perform a line-search in the direction −g̃ k to seek t k with
f (x k − t k g̃ k ) < f (x k ) − ηt k g̃ k 2
k
if t is found declare line-search success
otherwise declare line-search failure
4. Update:
if line-search success
let x k+1 be any point such that f (x k+1 ) ≤ f (x k − t k g̃ k )
set k+1 = k , μk+1 = μk
otherwise (line-search failure)
set x k+1 = x k , k+1 = k , and decrease μk+1 ≤ 12 μk
increment k ← k + 1 and go to 1
is successful. Noting that f is bounded below, we must have f (x k+1 ) converge. As f (x k )

will converge to the same limit, we have
lim f (x k+1 ) ≤ lim f (x k ) − ηt k g̃ k 2

k→∞ k→∞
lim ηt k g̃ k 2 ≤ 0.
k→∞
Since t k ≥ t¯, we have lim g̃ k 2 = 0. This implies

k→∞
lim ∇f (x k ) ≤ lim ∇f (x k ) − g̃ k + g̃ k

k→∞ k→∞
≤ lim κg k + g̃ k
k→∞
≤ lim κg μk g̃ k + g̃ k = 0.
k→∞
652 W. Hare
where the final line comes from the model accuracy check in step 2 (note μk never
increases).
Complete details can be found in [8, Thm 10.6].
3.2 Advances on Model-Based Methods
The MBSD algorithm above is a simplified version of model-based descent presented in [8,
Ch 10], which is a slightly generalized version of Powell’s COBYLA method (Constrained
Optimization BY Linear Approximation) [55].
DFO research quickly advanced past model-based descent methods towards trust-
region methods [1, 20, 21, 27, 30, 56–58, 60, 65, 67] (and references therein). A few
approaches have branched away from trust-region methods and begun to explore other
algorithms [19, 41]. In both cases, instead of linear interpolation (which creates models
with order-2 function accuracy and order-1 gradient accuracy), researchers started focusing
on quadratic interpolation (which creates models with order-3 function accuracy, order-2
gradient accuracy, and order-1 Hessian accuracy).
When quadratic models are used, then curvature can be used and the rate of conver-
gence tends to increase (e.g., see Theorem 3.4 above). However, quadratic interpolation
requires (n + 1)(n + 2)/2 (well-poised) interpolation points [8, Def 9.8]. Conversely, lin-
ear interpolation only requires n + 1 interpolation points [8, Def 9.4]. Thus, as dimension
increases, we see quadratic growth in the number of function values required to construct
quadratic interpolation models, but only linear growth in the number of function values
required to construct linear interpolation models. This creates an intriguing question of bal-
ance between the improvements gained and the cost of gaining those improvements when
comparing quadratic and linear interpolation [35].
This question has generated many proposed methods that aim to achieve a good balance
between work in model construction and rate of convergence with respect to iterations. One
of the most intuitive approaches is to begin with a linear model, and then increase model
quality by using more interpolation points as necessary. To do this, it is necessary to by able
to build over-determined linear interpolation models and/or under-determined quadratic
interpolation models. Information on such constructions can be found in [28, 59, 61]. Other
approaches to reduce the number of interpolation points required in complex models of
smooth objective functions have taken the approach of abandoning the polynomial structure
of the model [46, 52, 63, 68].
4 DFO Model-Based Methods for Nonsmooth Functions
4.1 Basic Model-Based Methods for Nonsmooth Functions
While the methods in Section 3 do not use gradients or Hessians in the algorithms, they
nonetheless work on the assumption that the underlying true objective function is smooth.
This is necessary if concepts like order-N gradient accuracy are used, as the true objective
needs gradients to approximate. To consider a model-based DFO method for a nonsmooth
objective function, we must introduce more advanced concepts from Varational Analysis.
For the sake of discussion, we limit our discussion to locally Lipschitz functions. Recall, if
a function is locally Lipschitz, then it is differentiable almost everywhere, which allows the
following definition of subdifferential and subgradients.
Definition 4.1 (Subdifferential and subgradient) Let f : Rn → R be a locally Lipschitz

function. Let D(f ) be the set of points where f is differentiable. The subdifferential of f
at the point x ∈ Rn is defined
∂f (x) = conv{v ∈ Rn : there exists y i ∈ D(f ), such that y i → x, ∇f (y i ) → v}, (1)
where conv denotes the convex hull.
The elements of the subdifferential are called subgradients.
Similar to order-N gradient accuracy, we can define order-N subgradient accuracy of a

Lipschitz function.
Definition 4.2 (Order-N subgradient accuracy) Given locally Lipschitz function f , x ∈

Rn , and ¯ > 0 we say that {f˜ }∈(0,]
¯ is a class of models of f at x parameterized by
that provides order-N subgradient accuracy at x if there exists a scalar κg (x) > 0 such
that, given any ∈ (0, ]
i) given any v ∈ ∂f (x) there exists ṽ ∈ ∂ f˜ (x) with v − ṽ ≤ κg (x)N , and
ii) given any ṽ ∈ ∂ f˜ (x) there exists v ∈ ∂f (x) with v − ṽ ≤ κg (x)N .
order-N subgradient accuracy near x if there exists a scalar κg (x) > 0 such that, given any
∈ (0, ]
i) for all y ∈ B (x) given any v ∈ ∂f (y) there exists ṽ ∈ ∂ f˜ (y) with v − ṽ ≤
κg (x)N , and
ii) for all y ∈ B (x) given any ṽ ∈ ∂ f˜ (y) there exists v ∈ ∂f (y) with v − ṽ ≤
κg (x)N .
As the subdifferential is a set, instead of a singleton, Definition 4.2 requires two accuracy
checks to be satisfied. The approximation set must include an approximation of everything
in the true set, and everything in the approximation set must approximate something from
the true set. The approximation set must include an approximation of everything in the true
set implies that if the true set suggests a descent direction, then the approximation set will
suggest a similar descent direction. Conversely, everything in the approximation set must
approximate something from the true set implies that if the approximation set suggest you
are near a local minimizer, then the true set will suggest the same result. To explore this fur-
ther, let us adapt Algorithm 3 to nonsmooth functions. Algorithm 4 is virtually unchanged
Algorithm 3: Model-based steepest descent for locally Lipschitz f , (MBSD0 ).

Given locally Lipschitz function f , starting point x 0 ∈ Rn and a class of models with
order-1 subgradient accuracy
0. Initialize:
same as MBSD
1. Model:
use a finite number of points to create f˜k with order-1 subgradient accuracy
set g̃ k = proj(0, ∂ f˜k (x k ))
2.–4. Model accuracy checks, Line search, and Update:
same as MBSD
654 W. Hare
from Algorithm 3, except that the word gradient is changed to subgradient. Analysis of
stopping conditions and convergence are also extremely similar to Algorithm 3.
Suppose the stopping condition is triggered at iteration k. Let g̃ k = proj(0, ∂ f˜k (x k ))
and let g k be the element in ∂f (x k ) given by item (ii) of order-1 subgradient accuracy. Then
g k ≤ g k − g̃ k + g̃ k < κg k + stop < (1 + κg )stop ,
where κg is the constant for order-1 subgradient accuracy. Thus, dist(0, ∂f (x k )) ≤ g k <
(1 + κg )stop , so an approximate critical point is found up to (1 + κg )stop precision. (See
[10, Lem 4.5] for more detail.)
Convergence analysis begins by noting that if proj(0, ∂ f˜k (x k )) = 0, then k → 0
would imply that eventually the model accuracy check must pass or the stopping condition
will be triggered. Similarly the proof of [8, Thm 10.5] can be trivially adapted to show that
if proj(0, ∂ f˜k (x k )) = 0, then once the model is sufficiently accurate the line search must
succeed. Replacing ∇f (x k ) with g k = proj(0, ∂ f˜k (x k )) in the proof of Theorem 3.6, then
produces a convergence proof for Algorithm 4 (see [10, Thm 4.6] for more detail).
Theorem 4.3 (Convergence of the MBSD algorithm) Suppose f is locally Lipschitz and
bounded below and f˜ is a class of models with order-1 subgradient accuracy. Suppose
that there exists t¯ > 0 such that t k ≥ t¯ for all k. If MBSD0 is run with stop = 0, then
lim g k = 0,
k→∞
where g k = proj(0, ∂ f˜k (x k )).
4.2 Advances on Model-Based Methods for Nonsmooth Functions
Algorithm 4 includes the assumption that it has access to a class of models with order-1
subgradient accuracy. Of course, this opens the question of whether such models exist.
First approaches to approximating the subdifferential of a nonsmooth function date back
to 2008, where Bagirov, Karasözen, and Sezer presented a DFO methods for unconstrained
nonsmooth optimization problems [17]. Their approach takes a convex hull of a large num-
ber of approximate gradients near a point of nondifferentiablity and applies equation (1)
to approximate the subdifferential. The approximate subdifferential was then used in a
conjugate subgradient method.
While provably convergent, Bagirov, Karasözen, and Sezer’s method suffers from the
curse of dimensionality. The amount of function values required to create each approximate
gradient is linked to the dimension [17] (see also Section 3 above). Moreover, the amount of
approximate gradients required is linked to the dimension [23] and to the complexity of the
subdifferential [9]. Furthermore, the complexity of the subdifferential typically increases
with dimension. As a result, the amount of function values required per iteration increases
dramatically with dimension.
Fortunately, in 2013 it was recognized that if the function had some underlying structure,
then the work required to approximate the subdifferential of a nonsmooth function could be
greatly decreased [37]. For example, in [37] the authors assume that the objective function
f (x) takes the form
f (x) = max{fi (x)},
i∈I
where each fi ∈ C1was provided by a blackbox and all blackboxes are evaluated in parallel.
(An application that show the validity of such assumptions can be found in [38].) Applying
the formula
∂f (x) = conv{∇fi (x) : fi (x) = f (x)}
it is possible to create models that satisfy order-1 subgradient accuracy using just n + 1
function evaluations.
Similar techniques can be applied if the objective function takes the form f (x) =
i |fi (x)| [44] or, more generally, if the objective function takes the form f (x) = g(F (x))
where g is known convex and F is a blackbox function [34].
In working with nonsmooth functions, some researchers have chosen to continue explor-
ing trust-region approaches [47]. However, most researchers have branched away from such
methods and instead looked at algorithms for nonsmooth optimization. For example, the
methods mentioned above ([37] and [44]) both approach the problem through active man-
ifold methods. These ideas are further developed in [43, 48]. Nonconvex algorithms also
open the possibility for (proximal) bundle methods [40] or VU methods [39].
5 Conclusions and Future Challenges for VA in DFO
Over the last two decades, Derivative-Free Optimization has established itself as a valuable
field of Optimization research. Two books [8, 29] and several surveys [10, 45, 64] have
recently been published on the topic. All of these reinforce the elegant mathematics, chal-
lenging computer science, and numerous applications that are linked to DFO research. In
this paper, we have argued that despite DFO algorithms not using 1st-order information, the
links to Variational Analysis are extremely strong.
In direct-search methods, we saw how tools from Variational Analysis could be used to
prove convergence to 1st-order critical points, even though the algorithm accesses no 1st-
order information. In model-based methods, we introduced the idea of order-N accuracy
and demonstrated how it can be linked with Taylor series in order to adapt classical con-
vergence results to model-based DFO algorithms. We also saw that model-based methods
are not limited to smooth functions. In model-based methods for nonsmooth functions, we
adapted order-N accuracy to work for the set-valued object known as the subdifferential
and demonstrated that a model-based DFO algorithm for nonsmooth optimization will con-
verge to a 1st-order critical point. In all of these examples, we see the tools of Variational
Analysis at work.
Before concluding, we add a few remarks about the future of DFO and its connection to
Variational Analysis.
First, we remark that in Section 2 we define positive spanning sets, but did not discuss
how to construct them. At the basic level this is easy (e.g., the 2n coordinate directions form
a positive spanning set). However, direct-search methods can be significantly impacted by
the choice of positive spanning set, so a better understanding of their construction and quali-
ties would be valuable. We also note that where we presented order-N accuracy specifically
for the subdifferential, similar definitions can easily be constructed for other set-valued
objects, such as normal cones. The question of how to efficiently and accurately construct
approximate subdifferentials, normal cones, or other Variational Analysis objects is far from
answered.
These open questions lead to the conclusion that DFO algorithms, in general, would
benefit from greater knowledge in what might be called “Numerical Variational Analysis
for Optimization”. Some current results in this area include: methods to construct well-
balanced positive-bases [36, 62], novel calculus rules for approximate gradients [31, 36,
656 W. Hare
61], and methods to approximate subdifferentials for various functional forms [43, 48]. But,
this leaves many questions unanswered.
With regards to algorithms, the three presented in this paper are obviously very basic.
They were selected for the purposes of demonstration, not for any level of competitive
performance. Algorithms with provable convergence results date back to the earliest days
of DFO research [49]. More recent research has focused on the development of algorithms
for specific styles of problems. For example,
• problems where the blackbox provides noisy function values [12, 19, 24, 53, 65];
• problems with very expensive blackboxes [22], possibly with cheaper low-accuracy
surrogate blackboxes [16, 50, 54];
• problems in multi-objective optimization [14, 25, 32]; and
• problems with structured (or grey-box) functions [4, 18, 48].
Like the numerical approximate questions above, the list of open questions in algorithm
design is virtually endless.
Much of the new research in algorithm design is working with model-based methods.
In this author’s opinion, one important aspect in researching model-based DFO algorithm
design is to try to separate the model construction and the algorithm design whenever
possible. The language of order-N accuracy can help in that regards.
An astute reader of this paper will note that we do not discuss other popular 0th-order
optimization methods, such as genetic algorithms, particle swarm optimization, or other
heuristic methods. This was intentional. While these algorithms have been found effective
in a number of applications, for the most part, the convergence analysis of such methods do
not apply the rigourous tools of Variational Analysis. Whether this is a concern, a feature,
or an opportunity for future research, we leave for time to decide.
References
1. Amaioua, N., Audet, C., Conn, A.R., Le Digabel, S.: Efficient solution of quadratically constrained
quadratic subproblems within the mesh adaptive direct search algorithm. Eur. J. Oper. Res. 268(1), 13–24
(2018)
2. Audet, C.: Convergence results for generalized pattern search algorithms are tight. Optim. Eng. 5(2),
101–122 (2004)
3. Audet, C., Béchard, V., Le Digabel, S.: Nonsmooth optimization through mesh adaptive direct search
and variable neighborhood search. J Glob Optim 41, 299–318 (2008)
4. Audet, C., Côté, P., Poissant, C., Tribes, C.: Monotonic grey box direct search optimization. Optim. Lett.
14, 3–18 (2020)
5. Audet, C., Dennis, J.E. Jr..: Analysis of generalized pattern searches. SIAM J. Optim. 13(3), 889–903
(2003)
6. Audet, C., Dennis, J.E.: Jr. Mesh adaptive direct search algorithms for constrained optimization. SIAM
J. Optim. 17(1), 188–217 (2006)
7. Audet, C., Dennis, J.E. Jr.., Le Digabel, S.: Globalization strategies for mesh adaptive direct search.
Comput. Optim. Appl. 46(2), 193–215 (2010)
8. Audet, C., Hare, W.: Derivative-Free and Blackbox Optimization. Springer International Publishing AG,
Switzerland (2017)
9. Audet, C., Hare, W.: Algorithmic construction of the subdifferential from directional derivatives. Set-
Valued Var. Anal. 26(3), 431–447 (2018)
10. Audet, C., Hare, W.: Model-based methods in derivative-free nonsmooth optimization, chapter 18.
In: Bagirov, A., Gaudioso, M., Karmitsa, N., Mäkelä, M. (eds.) Numerical nonsmooth optimization.
Springer (2020)
11. Audet, C., Ianni, A., Le Digabel, S., Tribes, C.: Reducing the number of function evaluations in mesh
adaptive direct search algorithms. SIAM J. Optim. 24(2), 621–642 (2014)
12. Audet, C., Ihaddadene, A., Le Digabel, S., Tribes, C.: Robust optimization of noisy blackbox problems
using the mesh adaptive direct search algorithm. Optim. Lett. 12(4), 675–689 (2018)
13. Audet, C., Le Digabel, S., Tribes, C.: The mesh adaptive direct search algorithm for granular and discrete
variables. SIAM J. Optim. 29(2), 1164–1189 (2019)
14. Audet, C., Savard, G., Zghal, W.: A mesh adaptive direct search algorithm for multiobjective optimiza-
tion. Eur. J. Oper. Res. 204(3), 545–556 (2010)
15. Audet, C., Tribes, C.: Mesh-based nelder?mead algorithm for inequality constrained optimization.
Comput. Optim. Appl. 71(2), 331–352 (2018)
16. Aziz, M., Hare, W., Jaberipour, M., Lucet, Y.: Multi-fidelity algorithms for the horizontal alignment
problem in road design. Eng. Optim. 0(0), 1–20 (2019)
17. Bagirov, A.M., Karasözen, B., Sezer, M.: Discrete gradient method: derivative-free method for nons-
mooth optimization. J. Optim. Theory Appl. 137(2), 317–334 (2008)
18. Bajaj, I., Iyer, S.S., Hasan, M.M.F.: A trust region-based two phase algorithm for constrained black-box
and grey-box optimization with infeasible initial point. Comput. Chem. Eng. 116, 306–321 (2018)
19. Berahas, A.S., Byrd, R.H., Nocedal, J.: Derivative-free optimization of noisy functions via quasi-
N,ewton methods. SIAM J. Optim. 29(2), 965–993 (2019)
20. Berghen, F.V.: CONDOR: A Constrained, Non-Linear, Derivative-Free Parallel Optimizer for Contin-
uous, High Computing Load, Noisy Objective Functions. PhD Thesis, Université Libre de Bruxelles,
Belgium (2004)
21. Berghen, F.V., Bersini, H.: CONDOR, a new parallel, constrained extension of Powell’s UOBYQA
algorithm: experimental results and comparison with the DFO algorithm. jcomam 181, 157–175 (2005)
22. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev.
60(2), 223–311 (2018)
23. Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth,
nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2005)
24. Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and
random models. Math. Program. 169, 447–487 (2018)
25. Cocchi, G., Liuzzi, G., Papini, A., Sciandrone, M.: An implicit filtering algorithm for derivative-free
multiobjective optimization with box constraints. Comput. Optim. Appl. 69(2), 267–296 (2018)
26. Conn, A.R., Scheinberg, K., Toint, Ph.L.: A derivative free optimization algorithm in practice. In:
Proceedings of 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Opti-
mization (1998). http://perso.fundp.ac.be/phtoint/pubs/TR98-11.ps
27. Conn, A.R., Scheinberg, K., Toint, P.h.L.: DFO (Derivative Free Optimization) https://projects.coin-or.
org/Dfo (2001)
28. Conn, A.R., Scheinberg, K.: L.n. Vicente. Geometry of sample sets in derivative free optimization
Polynomial regression and underdetermined interpolation. IMA J. Numer. Anal. 28(4), 721–749 (2008)
29. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to Derivative-Free MOS-SIAM Optimization.
Series on Optimization. SIAM, Philadelphia (2009)
30. Conn, A.R., Toint, P.h.L.: An Algorithm using Quadratic Interpolation for Unconstrained Derivative Free
Optimization, pp. 27–47. Springer, Berlin (1996). chapter Nonlinear Optimization and Applications
31. Coope, I.D., Tappenden, R.: Efficient calculation of regular simplex gradients. Comput. Optim. Appl.
To appear (2019)
32. Custódio, A.L., Madeira, J.F.A., Vaz, A.I.F., Vicente, L.N.: Direct multisearch for multiobjective
optimization. SIAM J. Optim. 21(3), 1109–1140 (2011)
33. Gramacy, R.B., Le Digabel, S.: The mesh adaptive direct search algorithm with treed Gaussian process
surrogates. Pacific J. Optim. 11(3), 419–447 (2015)
34. Hare, W.: Compositions of convex functions and fully linear models. Optim. Lett. 11(7), 1217–1227
(2017)
35. Hare, W., Jaberipour, M.: Adaptive interpolation strategies in derivative-free optimization: a case study.
Pac. J. Optim. 14(2), 327–347 (2018)
36. Hare, W., Jarry-Bolduc, G.: Calculus identities for generalized simplex gradients Rules and applications.
SIAM J. Optim. 30(1), 853–884 (2020)
37. Hare, W., Nutini, J.: A derivative-free approximate gradient sampling algorithm for finite minimax
problems. Comput. Optim. Appl. 56(1), 1–38 (2013)
38. Hare, W., Nutini, J., Tesfamariam, S.: A survey of non-gradient optimization methods in structural
engineering. Adv. Eng. Softw. 59, 19–28 (2013)
658 W. Hare
39. Hare, W., Planiden, C., Sagastizábal, C.: A derivative-free VU-algorithm for convex finite-max prob-
lems. Optim. Methods Softw., (to appear). https://www.tandfonline.com/doi/full/10.1080/10556788.
2019.1668944
40. Hare, W., Sagastizábal, C., Solodov, M.: A proximal bundle method for nonsmooth nonconvex functions
with inexact information. Comput. Optim Appl. 63(1), 1–28 (2016)
41. Hare, W.L., Lucet, Y.: Derivative-free optimization via proximal point methods. J. Optim. Theory Appl.
160(1), 204–220 (2014)
42. Hooke, R., Jeeves, T.A.: “Direct Search” solution of numerical and statistical problems. J. Assoc.
Comput. Mach. 8(2), 212–229 (1961)
43. Khan, K.A., Larson, J., Wild, S.M.: Manifold sampling for optimization of nonconvex functions that are
piecewise linear compositions of smooth components. SIAM J. Optim. 28(4), 3001–3024 (2018)
44. Larson, J., Menickelly, M., Wild, S.M.: Manifold sampling for
1 nonconvex optimization. SIAM J.
Optim. 26(4), 2540–2563 (2016)
45. Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numerica 28, 287–
404 (2019)
46. Lera, D., Sergeyev, Y.D.: GOSH: derivative-free global optimization using multi-dimensional space-
filling curves. J. Global Optim. 71(1), 193–211 (2018)
47. Liuzzi, G., Lucidi, S., Rinaldi, F., Vicente, L.N.: Trust-region methods for the derivative-free optimiza-
tion of nonsmooth black-box functions. SIAM J. Optim. 29(4), 3012–3035 (2019)
48. Menickelly, M., Wild, S.M.: Derivative-free robust optimization by outer approximations. Math.
Program. 179(1-2, Ser. A), 157–193 (2020)
49. Mifflin, R.: A superlinearly convergent algorithm for minimization without evaluating derivatives. Math.
Program. 9(1), 100–117 (1975)
50. Müller, J., Day, M.: Surrogate optimization of computationally expensive black-box problems with
hidden constraints. INFORMS J. Comput. To appear (2019)
51. Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)
52. Oeuvray, R., Bierlaire, M.: Boosters: a derivative-free algorithm based on radial basis functions. Int. J.
Model. Simul. 29(1), 26–36 (2009)
53. Paquette, C., Scheinberg, K.: A stochastic line search method with convergence rate analysis (2018)
54. Polak, E., Wetter, M.: Precision control for generalized pattern search algorithms with adaptive precision
function evaluations. SIAM J. Optim. 16(3), 650–669 (2006)
55. Powell, M.J.D.: A direct search optimization method that models the objective and constraint functions
by linear interpolation. In: Gomez, S., Hennart, J.-P. (eds.) Advances in Optimization and Numerical
Analysis, Proceedings of the 6th Workshop on Optimization and Numerical Analysis, Oaxaca, Mexico,
vol. 275, pp. 51–67, Kluwer Academic Publishers, Dordrecht (1994)
56. Powell, M.J.D.: UOBYQA: Unconstrained optimization by quadratic approximation. Technical Report
DAMTP 2000/NA14, Department of Applied Mathematics and Theoretical Physics, University of
Cambridge, Silver Street, Cambridge CB3 9EW, England (2000)
57. Powell, M.J.D.: UOBYQA: Unconstrained Optimization by quadratic approximation. Math. Program.
92(3), 555–582 (2002)
58. Powell, M.J.D.: On trust region methods for unconstrained minimization without derivatives. Math.
Program. 97(3), 605–623 (2003)
59. Powell, M.J.D.: Least Frobenius norm updating of quadratic models that satisfy interpolation conditions.
Math. Program. 100(1), 183–215 (2004)
60. Powell, M.J.D.: The BOBYQA algorithm for bound constrained optimization without derivatives. Tech-
nical report, Department of Applied Mathematics and Theoretical Physics, Cambridge University, UK
(2009)
61. Regis, R.G.: The calculus of simplex gradients. Optim. Lett. 9(5), 845–865 (2015)
62. Regis, R.G.: On the properties of positive spanning sets and positive bases. Optim. Eng. 17(1), 229–262
(2016)
63. Regis, R.G., Shoemaker, C.A.: Parallel radial basis function methods for the global optimization of
expensive functions. Eur. J. Oper. Res. 182(2), 514–535 (2007)
64. Rios, L.M., Sahinidis, N.V.: Derivative-free optimization: a review of algorithms and comparison of
software implementations. J. Glob. Optim. 56(3), 1247–1293 (2013)
65. Shashaani, S., Hashemi, F.S., Pasupathy, R.: ASTRO-DF: a class of adaptive sampling trust-region
algorithms for derivative-free stochastic optimization. SIAM J. Optim. 28(4), 3145–3176 (2018)
66. Torczon, V.: On the convergence of pattern search algorithms. SIAM J. Optim. 7(1), 1–25 (1997)
67. Verdério, A., Karas, E.W., Pedroso, L.G., Scheinberg, K.: On the construction of quadratic models for
derivative-free trust-region algorithms. EURO J. Comput Optim 5, 501–527 (2017)
68. Wild, S.M., Regis, R.G., Shoemaker, C.A.: ORBIT: optimization by radial basis function interpolation
in trust-regions. SIAM J. Sci. Comput. 30(6), 3197–3219 (2008)
69. Wild, S.M., Shoemaker, C.A.: Global convergence of radial basis function trust region derivative-free
algorithms. SIAM J. Optim. 21(3), 761–781 (2011)
70. Winfield, D.: Function and Functional Optimization by Interpolation in Data Tables. PhD thesis Harvard
University USA (1969)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

A Discussion On Variational Analysis in Derivative-Free Optimization - Hare (2020)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Discussion On Variational Analysis in Derivative-Free Optimization - Hare (2020)

Uploaded by

Copyright:

Available Formats

Set-Valued and Variational Analysis (2020) 28:643–659

A Discussion on Variational Analysis in Derivative-Free

Keywords Derivative-Free Optimization · Variational Analysis · Direct-search method ·

Mathematics Subject Classiﬁcation (2010) Primary 49-01 · 90C52; Secondary 49J52 ·

1 Department of Mathematics, University of British Columbia, Kelowna, British Columbia, Canada

where f : Rn → R ∪ {∞} is proper and continuous on its domain. (Note, in DFO, f

2 DFO Direct-Search Methods

2.1 A Simple Direct-Search Methods

The set D is a positive spanning set for Rn if pspan(D) = Rn .

Algorithm 1: Simple generalized pattern search (SGPS).

Adding a continuity assumption (f ∈ C 0 ) to the bounded level set assumption, com-

Definition 2.3 (Generalized directional derivative) Let f : Rn → R be locally Lipschitz

Theorem 2.5 [Convergence of SGPS] Let {x k } be the sequence of iterates produced by

2.2 Advances on Direct Search Methods

3 DFO Model-Based Methods for Smooth Functions

3.1 Basic Model-Based Methods

Definition 3.1 (Order-N function accuracy) Given f ∈ C 0 , x ∈ dom(f ), and  ¯ > 0 we

|f (x) − f˜ (x)| ≤ κ(x)N .

We say that {f˜ }∈(0,]

Definition 3.2 (Order-N gradient accuracy) Given f ∈ C 1 , x ∈ dom(f ), and  ¯ > 0 we

Definition 3.3 (Order-N Hessian accuracy) Given f ∈ C 2 , x ∈ dom(f ), and  ¯ > 0 we

Proof By Taylor’s theorem, applying f ∈ C 3 and x k is sufficiently close to x ∗ , we may

Theorem 3.4 makes a number of assumptions on the presented model-Newton update. In

As an example of a complete model-based method, we present a model-based steepest

∇f (x k ) ≤ ∇f (x k ) − g̃ k  + g̃ k  < κg k + stop < (1 + κg )stop ,

Theorem 3.5 (Convergence of the MBSD algorithm) Suppose f ∈ C 1 is bounded below

f (x k+1 ) ≤ f (x k − t k g̃ k ) < f (x k ) − ηt k g̃ k 2 ,

Algorithm 2: Model-based steepest descent (MBSD).

is successful. Noting that f is bounded below, we must have f (x k+1 ) converge. As f (x k )

lim f (x k+1 ) ≤ lim f (x k ) − ηt k g̃ k 2

Since t k ≥ t¯, we have lim g̃ k 2 = 0. This implies

lim ∇f (x k ) ≤ lim ∇f (x k ) − g̃ k  + g̃ k 

3.2 Advances on Model-Based Methods

4 DFO Model-Based Methods for Nonsmooth Functions

4.1 Basic Model-Based Methods for Nonsmooth Functions

Definition 4.1 (Subdifferential and subgradient) Let f : Rn → R be a locally Lipschitz

Similar to order-N gradient accuracy, we can define order-N subgradient accuracy of a

Definition 4.2 (Order-N subgradient accuracy) Given locally Lipschitz function f , x ∈

Algorithm 3: Model-based steepest descent for locally Lipschitz f , (MBSD0 ).

where g k = proj(0, ∂ f˜k (x k )).

4.2 Advances on Model-Based Methods for Nonsmooth Functions

5 Conclusions and Future Challenges for VA in DFO

You might also like

Definition 3.1 (Order-N function accuracy) Given f ∈ C 0 , x ∈ dom(f ), and ¯ > 0 we

|f (x) − f˜ (x)| ≤ κ(x)N .

We say that {f˜ }∈(0,]

Definition 3.2 (Order-N gradient accuracy) Given f ∈ C 1 , x ∈ dom(f ), and ¯ > 0 we

Definition 3.3 (Order-N Hessian accuracy) Given f ∈ C 2 , x ∈ dom(f ), and ¯ > 0 we

∇f (x k ) ≤ ∇f (x k ) − g̃ k + g̃ k < κg k + stop < (1 + κg )stop ,

f (x k+1 ) ≤ f (x k − t k g̃ k ) < f (x k ) − ηt k g̃ k 2 ,

lim f (x k+1 ) ≤ lim f (x k ) − ηt k g̃ k 2

Since t k ≥ t¯, we have lim g̃ k 2 = 0. This implies

lim ∇f (x k ) ≤ lim ∇f (x k ) − g̃ k + g̃ k

where g k = proj(0, ∂ f˜k (x k )).