Professional Documents
Culture Documents
Series Editors
Panos M. Pardalos
János D. Pintèr
Stephen M. Robinson
Tamás Terlaky
My T. Thai
Convex Optimization in
Normed Spaces
Theory, Methods and Examples
Juan Peypouquet
Universidad Técnica Federico
Valparaı́so
Chile
v
Foreword
This book offers self-contained access and a modern vision of convex optimization
in Banach spaces, with various examples and numerical methods. The challenge of
providing a clear and complete presentation in only 120 pages is met successfully.
This reflects a deep understanding by the author of this rich subject. Even when
presenting classical results, he offers new insights and a fresh perspective.
Convex analysis lies in the core of optimization, since convexity is often present
or hidden, even in apparently non-convex problems. It may be partially present in
the basic blocks of structured problems, or introduced intentionally (as in relaxation)
as a solution technique.
The examples considered in the book are carefully chosen. They illustrate a
wide range of applications in mathematical programming, elliptic partial differential
equations, optimal control, signal/image processing, among others. The framework
of reflexive Banach spaces, which is analyzed in depth in the book, provides a gen-
eral setting covering many of these situations.
The chapter on iterative methods is particularly accomplished. It gives a unified
view on first-order algorithms, which are considered as time discrete versions of the
(generalized) continuous steepest descent dynamic. Proximal algorithms are deeply
studied. They play a fundamental role in splitting methods, which have been proven
to be successful at reducing the dimension for large scale problems, like sparse
optimization in signal/image processing, or domain decomposition for PDE’s.
The text is accompanied by images that provide a geometric intuition of the ana-
lytical concepts. The writing is clear and concise, and the presentation is fairly lin-
ear. Overall, the book provides an enjoyable reading.
This book is accessible to a broad audience. It is excellent both for those who
want to become familiar with the subject and learn the basic concepts, or as a text-
book for a graduate course. The extensive bibliography provides the reader with
more information, as well as recent trends.
vii
Preface
Convex analysis comprises a number of powerful, elegant ideas, and techniques that
are essential for understanding and solving problems in many areas of applied math-
ematics. In this introductory textbook, we have tried to present the main concepts
and results in an easy-to-follow and mostly self-contained manner. It is intended to
serve as a guide for students and more advanced independent learners who wish to
get acquainted with the basic theoretical and practical tools for the minimization of
convex functions on normed spaces. In particular, it may be useful for researchers
working on related fields, who wish to apply convex-analytic techniques in their
own work. It can also be used for teaching a graduate-level course on the subject, in
view of the way the contents are organized. We should point out that we have not
pretended to include every relevant result in the matter, but to present the ideas as
transparently as possible. We believe that this work can be helpful to gain insight
into some connections between optimization, control, calculus of variations, partial
differential equations, and functional analysis that sometimes go unnoticed.
ix
x
Juan Peypouquet
Santiago & Valparaı́so, Chile
October, 2014
Acknowledgments
This book contains my vision and experience on the theory and methods of con-
vex optimization, which have been modeled by my research and teaching activity
during the past few years, beginning with my PhD thesis at Universidad de Chile
and Université Pierre et Marie Curie - Paris VI. I am thankful, in the first place,
to my advisors, Felipe Álvarez and Sylvain Sorin, who introduced me to optimiza-
tion and its connection to dynamical systems, and transmitted me their curiosity and
passion for convex analysis. I am also grateful to my research collaborators, espe-
cially Hedy Attouch, who is always willing to share his deep and vast knowledge
with kindness and generosity. He also introduced me to researchers and students at
Université Montpellier II, with whom I have had a fruitful collaboration. Of great
importance were the many stimulating discussions with colleagues at Universidad
Técnica Federico Santa Marı́a, in particular, Pedro Gajardo and Luis Briceño, as
well as my PhD students, Guillaume Garrigos and Cesare Molinari.
I also thank Associate Editor, Donna Chernyk, at Springer Science and Business
Media, who invited me to engage in the more ambitious project of writing a book to
be edited and distributed by Springer. I am truly grateful for this opportunity. A pre-
cursor [88] had been prepared earlier for a 20-h minicourse on convex optimization
at the XXVI EVM & EMALCA, held in Mérida, in 2013. I thank my friend and for-
mer teacher at Universidad Simón Bolı́var, Stefania Marcantognini, for proposing
me as a lecturer at that School. Without their impulse, I would not have considered
writing this book at this point in my career.
Finally, my research over the past few years was made possible thanks to Coni-
cyt Anillo Project ACT-1106, ECOS-Conicyt Project C13E03, Millenium Nucleus
ICM/FIC P10-024F, Fondecyt Grants 11090023 and 1140829, Basal Project CMM
Universidad de Chile, STIC-AmSud Ovimine, and Universidad Técnica Federico
Santa Marı́a, through their DGIP Grants 121123, 120916, 120821 and 121310.
Juan Peypouquet
Santiago & Valparaı́so, Chile
October, 2014
xi
Contents
2 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Extended Real-Valued Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Lower-Semicontinuity and Minimization . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Minimizers of Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
xiii
xiv Contents
4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 Norm of a Bounded Linear Functional . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Optimal Control and Calculus of Variations . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Controlled Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Existence of an Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.3 The Linear-Quadratic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Some Elliptic Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 The Theorems of Stampacchia and Lax-Milgram . . . . . . . . . . 74
4.3.2 Sobolev Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 Poisson-Type Equations in H 1 and W 1,p . . . . . . . . . . . . . . . . . 76
4.4 Sparse Solutions for Underdetermined Systems of Equations . . . . . . 79
5 Problem-Solving Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Combining Optimization and Discretization . . . . . . . . . . . . . . . . . . . . 81
5.1.1 Recovering Solutions for the Original Problem: Ritz’s
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.2 Building the Finite-Dimensional Approximations . . . . . . . . . 83
5.2 Iterative Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Problem Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Elimination of Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.2 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 1
Basic Functional Analysis
Abstract Functional and convex analysis are closely intertwined. In this chapter
we recall the basic concepts and results from functional analysis and calculus that
will be needed throughout this book. A first section is devoted to general normed
spaces. We begin by establishing some of their main properties, with an emphasis
on the linear functions between spaces. This leads us to bounded linear functionals
and the topological dual. Second, we review the Hahn–Banach Separation Theo-
rem, a very powerful tool with important consequences. It also illustrates the fact
that the boundaries between functional and convex analysis can be rather fuzzy at
times. Next, we discuss some relevant results concerning the weak topology, espe-
cially in terms of closedness and compactness. Finally, we include a subsection on
differential calculus, which also provides an introduction to standard smooth opti-
mization techniques. The second section deals with Hilbert spaces, and their very
rich geometric structure, including the ideas of projection and orthogonality. We
also revisit some of the general concepts from the first section (duality, reflexivity,
weak convergence) in the light of this geometry.
For a comprehensive presentation, the reader is referred to [30] and [94].
Example 1.2. The space C ([a, b]; R) of continuous real-valued functions on the
interval [a, b], with the norm · ∞ defined by f ∞ = max | f (t)|.
t∈[a,b]
Proof. A set C ⊂ X has empty interior if, and only if, every open ball intersects
Cc . Let B be an open ball. Take another open ball B whose closure is contained
in B. Since C0c has empty interior, B ∩ C0c = 0./ Moreover, since both B and C0c are
open, there exist x1 ∈ X and r1 ∈ (0, 2 ) such that B(x1 , r1 ) ⊂ B ∩C0c . Analogously,
1
there exist x2 ∈ X and r ∈ (0, 14 ) such that B(x2 , r2 ) ⊂ B(x1 , r1 ) ∩C1c ⊂ B ∩C0c ∩C1c .
Inductively, one defines (xm , rm ) ∈ X × R such that xm ∈ B(xn , rn ) ∩ nk=0 Ckc and
rm ∈ (0, 2−m ) for each m > n ≥ 1. In particular, xm − xn < 2−n whenever m >
n ≥ 1. It follows that (xn ) is a Cauchy sequence and so, it must converge to some x̄,
∞
which must belong to B ∩ ∞ c c
k=0 Ck ⊂ B ∩ ( k=0 Ck ) , by construction.
We shall find several important consequences of this result, especially the Banach–
Steinhaus uniform boundedness principle (Theorem 1.9) and the continuity of con-
vex functions in the interior of their domain (Proposition 3.6).
Basic functional analysis 3
The function · L (X;Y ) is a norm on the space L (X;Y ) of bounded linear oper-
ators from (X, · X ) to (Y, · Y ). For linear operators, boundedness and (uniform)
continuity are closely related. This is shown in the following result:
Proposition 1.6. Let (X, · X ) and (Y, · Y ) be normed spaces and let L : X → Y
be a linear operator. The following are equivalent:
i) L is continuous in 0;
ii) L is bounded; and
iii)L is uniformly Lipschitz-continuous in X.
Proof. Let i) hold. For each ε > 0 there is δ > 0 such that L(h)Y ≤ ε whenever
hX ≤ δ . If xX = 1, then L(x)Y = δ −1 L(δ x)Y ≤ δ −1 ε and so, sup L(x)Y
x=1
< ∞. Next, if ii) holds, then
x−z
L(x) − L(z)Y = x − zX L ≤ LL (X;Y ) x − zX
x − zX
It ensues that L ∈ Inv(X;Y ) and L−1 = M̄ ◦ L0−1 . For the continuity, since
and
Mn+1 − IX = M ◦ Mn ≤ M · (1 − M)−1 ,
we deduce that
L0−1 2
L−1 − L0−1 ≤ L − L0 .
1 − M
Observe that Φ is actually Lipschitz-continuous in every closed ball B̄(L0 , R) with
R < L0−1 −1 .
This fact will be used in Sect. 6.3.2. This and other useful calculus tools for
normed spaces can be found in [33].
for each r < r̂ and λ ∈ Λ . It follows that sup Lλ L (X;Y ) < ∞.
λ ∈Λ
is the duality product between X and X ∗ . If the space can be easily guessed from the
context, we shall write L, x instead of L, xX ∗ ,X to simplify the notation.
for each L ∈ X ∗ . Since L, x ≤ L∗ x for each x ∈ X and L ∈ X ∗ , we have
μx ∗∗ ≤ x, so actually μx ∈ X ∗∗ . The function J : X → X ∗∗ , defined by
J (x) = μx , is the canonical embedding of X into X ∗∗ . Clearly, the function J
is linear and continuous. We shall see later (Proposition 1.17) that J is an isome-
try. This fact will imply, in particular, that the canonical embedding J is injective.
The space (X, · ) is reflexive if J is also surjective. In other words, if every ele-
ment μ of X ∗∗ is of the form μ = μx for some x ∈ X. Necessarily, (X, · ) must be
a Banach space since it is homeomorphic to (X ∗∗ , · ∗∗ ).
6 Juan Peypouquet U
Let (X, · X ) and (Y, · Y ) be normed spaces and let L ∈ L (X;Y ). Given y∗ ∈ Y ∗ ,
the function ζy∗ : X → R defined by
is linear and continuous. In other words, ζy∗ ∈ X ∗ . The adjoint of L is the operator
L∗ : Y ∗ → X ∗ defined by
L∗ (y∗ ) = ζy∗ .
In other words, L and L∗ are linked by the identity
We shall verify later (Corollary 1.18) that the two norms actually coincide.
Let X be a real vector space. A set C ⊆ X is convex if for each pair of points of C,
the segment joining them also belongs to C. In other words, if the point λ x+(1− λ )y
belongs to C whenever x, y ∈ C and λ ∈ (0, 1).
Remarks
Before proceeding with the proof of the Hahn–Banach Sepatarion Theorem 1.10,
some remarks are in order:
Theorem 1.11. Let C be a nonempty, convex subset of a normed space (X, · ) not
containing the origin.
i) If C is open, there exists L ∈ X ∗ \ {0} such that L, x < 0 for each x ∈ C.
ii) If C is closed, there exist L ∈ X ∗ \ {0} and ε > 0 such that L, x + ε ≤ 0 for each
x ∈ C.
Clearly, Theorem 1.11 is a special case of Theorem 1.10. To verify that they are
actually equivalent, simply write C = A − B and observe that C is open if A is, while
C is closed if A is compact and B is closed.
Second, part ii) of Theorem 1.11 can be easily deduced from part i) of The-
orem 1.10 by considering a sufficiently small open ball A around the origin (not
intersecting C), and writing B = C.
V = { x ∈ RN : v · x = 0}
Therefore,
0 ≤ 2pn 2 ≤ 2 pn · x + tx − pn 2 .
Letting t → 0, we deduce that pn · x ≥ 0 for all x ∈ Cn . Now write vn = −pn /pn .
The sequence (vn ) lies in the unit sphere, which is compact. We may extract a sub-
sequence that converges to some v ∈ RN with v = 1 (thus v = 0) and v · x ≤ 0 for
all x ∈ C.
Step 3: Conclusion.
Take any (necessarily continuous) isomorphism φ : X/M → R and set L = φ ◦ π .
Then, either L, x < 0 for all x ∈ C, or −L, x < 0 for all x ∈ C.
Proof. Define
L, x − μ ≤ L − , y
Basic functional analysis 9
L, x ≤ L − , y + α x
Proof. Set A = B(0, x) and B = {x}. By Theorem 1.10, there exists Lx ∈ X \
{0} such that Lx , y ≤ Lx , x for all y ∈ A. This implies Lx , x = Lx ∗ x. The
functional x = Lx /Lx ∗ has the desired properties.
The set F (x) is always convex and it need not be a singleton, as shown in the
following example:
Example 1.15. Consider X = R2 with (x1 , x2 ) = |x1 | + |x2 | for (x1 , x2 ) ∈ X. Then,
X ∗ = R2 with (x1∗ , x2∗ ), (x1 , x2 ) = x1∗ x1 + x2∗ x2 and (x1∗ , x2∗ )∗ = max{|x1∗ |, |x2∗ |} for
(x1∗ , x2∗ ) ∈ X ∗ . Moreover, F (1, 0) = { (1, b) ∈ R2 : b ∈ [−1, 1]}.
From Corollary 1.14 we deduce the following:
Corollary 1.16. For every x ∈ X, x = max L, x.
L∗ =1
Recall from Sect. 1.1.1 that the canonical embedding J of X into X ∗∗ is defined
by J (x) = μx , where μx satisfies
Recall also that J is linear, and J (x)∗∗ ≤ x for all x ∈ X. We have the fol-
lowing:
Proposition 1.17. The canonical embedding J : X → X ∗∗ is a linear isometry.
Proof. It remains to prove that x ≤ μx ∗∗ . To this end, simply notice that
Corollary 1.18. Let (X, ·X ) and (Y, ·Y ) be normed spaces. Given L ∈ L (X;Y ),
let L∗ ∈ L (Y ∗ ; X ∗ ) be its adjoint. Then L∗ L (Y ∗ ;X ∗ ) = LL (X;Y ) .
Proof. We already proved that L∗ L (Y ∗ ;X ∗ ) ≤ LL (X;Y ) . For the reverse inequal-
ity, use Corollary 1.16 to deduce that
LL (X;Y ) = sup max L (y ), xX ∗ ,X ≤ L∗ L (Y ∗ ;X ∗ ) ,
∗ ∗
∗
xX =1 y Y ∗ =1
is open for the weak topology and contains x0 . Moreover, the collection of all such
sets generates a base of neighborhoods of x0 for the weak topology in the sense that
if V0 is a neighborhood of x0 , then there exist L1 , . . . , LN ∈ X ∗ and ε > 0 such that
N
x0 ∈ VLεk (x0 ) ⊂ V0 .
k=1
Recall that a Hausdorff space is a topological space in which every two distinct
points admit disjoint neighborhoods.
Proposition 1.19. (X, σ ) is a Hausdorff space.
Proof. Let x1 = x2 . Part ii) of the Hahn–Banach Separation Theorem 1.10 shows
ε ε
the existence of L ∈ X ∗ such that L, x1 + ε ≤ L, x2 . Then VL2 (x1 ) and V−L
2
(x2 )
are disjoint weakly open sets containing x1 and x2 , respectively.
It follows from the definition that every weakly open set is (strongly) open. If X is
finite-dimensional, the weak topology coincides with the strong topology. Roughly
1 A trivial example is the discrete topology (for which every set is open), but it is not very useful
for our purposes.
Basic functional analysis 11
speaking, the main idea is that, inside any open ball, one can find a weak neighbor-
hood of its center that can be expressed as a finite intersection of open half-spaces.
L2 , x − x0 = ε
L1 , x − x0 = ε
L3 , x − x0 = ε
x0
Proof. By definition, every weakly closed set must be strongly closed. Conversely,
let C ⊂ X be convex and strongly closed. Given x0 ∈ / C, we may apply part ii) of
the Hahn–Banach Separation Theorem 1.10 with A = {x0 } and B = C to deduce the
existence of L ∈ X ∗ \ {0} and ε > 0 such that L, x0 + ε ≤ L, y for each y ∈ C. The
set VLε (x0 ) is a weak neighborhood of x0 that does not intersect C. It follows that Cc
is weakly open.
We say that a sequence (xn ) in X converges weakly to x̄, and write xn x̄, as n → ∞
if lim L, xn − x̄ = 0 for each L ∈ X ∗ . This is equivalent to saying that for each
n→∞
weakly open neighborhood V of x̄ there is N ∈ N such that xn ∈ V for all n ≥ N.
The point x̄ is the weak limit of the sequence.
12 Juan Peypouquet U
Since |L, xn − x̄| ≤ L∗ xn − x̄, strongly convergent sequences are weakly
convergent and the limits coincide.
Proposition 1.22. Let (xn ) converge weakly to x̄ as n → ∞. Then:
i) (xn ) is bounded.
ii) x̄ ≤ lim inf xn .
n→∞
iii)If (Ln ) is a sequence in X ∗ that converges strongly to L̄, then lim Ln , xn = L̄, x̄.
n→∞
Proof. For i), write μn = J (xn ), where J is the canonical injection of X into
X ∗∗ , which is a linear isometry, by Proposition 1.17. Since lim μn (L) = L, x̄ for
n→∞
all L ∈ X ∗ , we have sup μn (L) < +∞. The Banach–Steinhaus uniform boundedness
n∈N
principle (Theorem 1.9) implies sup xn = sup μn ∗∗ ≤ C for some C > 0. For ii),
n∈N n∈N
use Corollary 1.14 to deduce that
As n → ∞, we obtain iii).
Since we have defined two topologies on X, there exist (strongly) closed sets and
weakly closed sets. It is possible and very useful to define some sequential notions
as well. A set C ⊂ X is sequentially closed if every convergent sequence of points
in C has its limit in C. Analogously, C is weakly sequentially closed if every weakly
convergent sequence in C has its weak limit in C. The relationship between the
various notions of closedness is summarized in the following result:
Proposition 1.23. Consider the following statements concerning a nonempty set
C ⊂ X:
i) C is weakly closed.
ii) C is weakly sequentially closed.
iii)C is sequentially closed.
iv)C is closed.
Then i) ⇒ ii) ⇒ iii) ⇔ iv) ⇐ i). If C is convex, the four statements are equivalent.
Proof. It is easy to show that i) ⇒ ii) and iii) ⇔ iv). We also know that i) ⇒ iv) and
ii) ⇒ iii) because the weak topology is coarser than the strong topology. Finally, if C
is convex, Proposition 1.21 states precisely that i) ⇔ iv), which closes the loop.
Basic functional analysis 13
The topological dual X ∗ of a normed space (X, · ) is a Banach space with the
norm · ∗ . As in Sect. 1.1.3, we can define the weak topology σ (X ∗ ) in X ∗ . Recall
that a base of neighborhoods for some point L ∈ X ∗ is generated by the sets of the
form
{ ∈ X ∗ : μ , − LX ∗∗ ,X ∗ < ε }, (1.1)
with μ ∈ X ∗∗ and ε > 0.
with x ∈ X and ε > 0. Notice the similarity and difference with (1.1). Now, since
every x ∈ X determines an element μx ∈ X ∗∗ by the relation
it is clear that every set that is open for the weak∗ topology must be open for the
weak topology as well. In other words, σ ∗ ⊂ σ .
In infinite-dimensional normed spaces, compact sets are rather scarce. For instance,
in such spaces the closed balls are not compact (see [30, Theorem 6.5]). One of the
most important properties of the weak∗ topology is that, according to the Banach–
Alaoglu Theorem (see, for instance, [30, Theorem 3.16]), the closed unit ball in X ∗
is compact for the weak∗ topology. Recall, from Sect. 1.1.1, that X is reflexive if
the canonical embedding J of X into X ∗∗ is surjective. This implies that the spaces
(X, σ ) and (X ∗∗ , σ ∗ ) are homeomorphic, and so, the closed unit ball of X is compact
for the weak topology. Further, we have the following:
Theorem 1.24. Let (X, · ) be a Banach space. The following are equivalent:
i) X is reflexive.
ii) The closed unit ball B̄(0, 1) is compact for the weak topology.
iii)Every bounded sequence in X has a weakly convergent subsequence.
We shall not include the proof here for the sake of brevity. The interested reader
may consult [50, Chaps. II and V] for full detail, or [30, Chap. 3] for abridged
commentaries.
14 Juan Peypouquet U
Proof. Suppose (yn ) does not converge weakly to ŷ. Then, there exist a weakly
open neighborhood V of ŷ, and a subsequence (ykn ) of (yn ) such that ykn ∈/ V for all
n ∈ N. Since (ykn ) is bounded, it has a subsequence (y jkn ) that converges weakly as
n → ∞ to some y̌ which cannot be in V and so y̌ = ŷ. This contradicts the uniqueness
of ŷ.
f (x + td) − f (x)
f (x; d) = lim ,
t→0+ t
whenever this limit exists. The function f is differentiable (or simply Gâteux-
differentiable) at x if f (x; d) exists for all d ∈ X and the function d → f (x; d) is
linear and continuous. In this situation, the Gâteaux derivative (or gradient) of f at
x is ∇ f (x) = f (x; ·), which is an element of X ∗ . On the other hand, f is differen-
tiable in the sense of Fréchet (or Fréchet-differentiable) at x if there exists L ∈ X ∗
such that
| f (x + h) − f (x) − L, h|
lim = 0.
h→0 h
If it is the case, the Fréchet derivative of f at x is D f (x) = L. An immediate conse-
quence of these definitions is
Proposition 1.26. If f is Fréchet-differentiable at x, then it is continuous and
Gâteaux-differentiable there, with ∇ f (x) = D f (x).
As usual, f is differentiable (in the sense of Gâteaux or Fréchet) on A if it is so
at every point of A.
for all x, y, z ∈ X and α ∈ R. Suppose also that B is continuous: |B(x, y)| ≤ β x y
for some β ≥ 0 and all x, y ∈ X. The function f : X → R, defined by f (x) = B(x, x), is
Fréchet-differentiable and D f (x)h = B(x, h) + B(h, x). Of course, if B is symmetric:
B(x, y) = B(y, x) for all x, y ∈ X, then D f (x)h = 2B(x, h).
Basic functional analysis 15
where we use ∇i to denote the gradient with respect to the i-th set of variables.
It is to note that the Gâteaux-differentiability does not imply continuity. In par-
ticular, it is weaker than Fréchet-differentiability.
A simple computation shows that ∇ f (0, 0) = (0, 0). However, lim f (z, z2 ) = 1 =
z→0
f (0, 0).
If the gradient of f is Lipschitz-continuous, we can obtain a more precise first-
order estimation for the values of the function:
Lemma 1.30 (Descent Lemma). If f : X → R is Gâteaux-differentiable and ∇ f is
Lipschitz-continuous with constant L, then
L
f (y) ≤ f (x) + ∇ f (x), y − x + y − x2
2
for each x, y ∈ X. In particular, f is continuous.
Therefore,
1 1
f (y) − f (x) = ∇ f (x), h dt + ∇ f (x + th) − ∇ f (x), h dt
0 0
1
≤ ∇ f (x), h + ∇ f (x + th) − ∇ f (x) h dt
0
1
≤ ∇ f (x), h + Lh2 t dt
0
L
= ∇ f (x), y − x + y − x2 ,
2
as claimed.
Second Derivatives
∇ f (x + td) − ∇ f (x)
(∇ f ) (x; d) = lim ,
t→0+ t
whenever this limit exists (with respect to the strong topology of X ∗ ). The func-
tion f is twice differentiable in the sense of Gâteaux (or simply twice Gâteaux-
differentiable) in x if f is Gâteaux-differentiable in a neighborhood of x, (∇ f ) (x; d)
exists for all d ∈ X, and the function d → (∇ f ) (x; d) is linear and continu-
ous. In this situation, the second Gâteaux derivative (or Hessian) of f at x is
∇2 f (x) = (∇ f ) (x; ·), which is an element of L (X; X ∗ ). Similarly, f is twice differ-
entiable in the sense of Fréchet (or twice Fréchet-differentiable) at x if there exists
M ∈ L (X; X ∗ ) such that
D f (x + h) − D f (x) − M(h)∗
lim = 0.
h→0 h
∇ f (x + td), d and φ (0) = ∇2 f (x)d, d. The second-order Taylor expansion for
φ in R yields
1 t2
lim 2 φ (t) − φ (0) − t φ (0) − φ (0) = 0,
t→0 t 2
which gives the result.
Of course, it is possible to define derivatives of higher order, and obtain the cor-
responding Taylor approximations.
Proof. Take y ∈ C. Since C is convex, for each λ ∈ (0, 1), the point yλ = λ y + (1 −
λ )x̂ belongs to C. The inequality f (x̂) ≤ f (yλ ) is equivalent to f (x̂ + λ (y − x̂)) −
f (x̂) ≥ 0. It suffices to divide by λ and let λ → 0 to deduce that f (x̂; y − x̂) ≥ 0 for
all y ∈ C.
C
x̂
As we shall see in the next chapter, the condition given by Fermat’s Rule (The-
orem 1.32) is not only necessary, but also sufficient, for convex functions. In the
general case, one can provide second-order necessary and sufficient conditions for
optimality. We state this result in the unconstrained case (C = X) for simplicity.
18 Juan Peypouquet U
t2 2
∇ f (x̂)d, d > f (x̂ + td) − f (x̂) − ε t 2 ≥ −ε t 2
2
for all t ∈ [0,t0 ]. It follows that ∇2 f (x̂)d, d ≥ 0.
For ii), assume ∇2 f (x̂) is uniformly elliptic with constant α > 0 and take d ∈ X.
Set ε = α4 d2 . By Proposition 1.31, there is t1 > 0 such that
t2 2
f (x̂ + td) > f (x̂) + ∇ f (x̂)d, d − ε t 2 ≥ f (x̂)
2
for all t ∈ [0,t1 ].
Hilbert spaces are an important class of Banach spaces with rich geometric
properties.
Therefore,
1 α
x2 + y2
|x, y| ≤
2α 2
for each α > 0. In particular, taking α = x/y, we deduce i). Next, we use i) to
deduce that
x + y2 = x2 + 2x, y + y2 ≤ x2 + 2x y + y2 = (x + y)2 ,
x, y
cos(γ ) = , γ ∈ [0, π ].
x y
We shall say that x and y are orthogonal, and write x ⊥ y, if cos(γ ) = 0. In a sim-
ilar fashion, we say x and y are parallel, and write x y, if | cos(γ )| = 1. With this
notation, we have
20 Juan Peypouquet U
for each x, y ∈ H. It shows the relationship between the length of the sides and
the lengths of the diagonals in a parallelogram, and is easily proved by adding the
following identities
x + y2 = x2 + y2 + 2x, y and x − y2 = x2 + y2 − 2x, y.
x + y
y
x − y
x
Example 1.36. The space X = C ([0, 1]; R) with the norm x = max |x(t)| is not
t∈[0,1]
a Hilbert space. Consider the functions x, y ∈ X, defined by x(t) = 1 and y(t) = t for
t ∈ [0, 1]. We have x = 1, y = 1, x + y = 2 and x − y = 1. The parallelogram
identity (1.3) does not hold.
An important property of Hilbert spaces is that given a nonempty, closed, and con-
vex subset K of H and a point x ∈
/ K, there is a unique point in K which is the closest
to x. More precisely, we have the following:
Proposition 1.37. Let K be a nonempty, closed, and convex subset of H and let
x ∈ H. Then, there exists a unique point y∗ ∈ K such that
Proof. We shall prove Proposition 1.37 in three steps: first, we verify that (1.4) has
a solution; next, we establish the equivalence between (1.4) and (1.5); and finally,
we check that (1.5) cannot have more than one solution.
First, set d = inf x − y and consider a sequence (yn ) in K such that lim yn −
y∈K n→∞
x = d. We have
Whence, (yn ) is a Cauchy sequence, and must converge to some y∗ , which must lie
in K by closedness. The continuity of the norm implies d = lim yn − x = y∗ − x.
n→∞
Next, assume (1.4) holds and let y ∈ K. Since K is convex, for each λ ∈ (0, 1) the
point λ y + (1 − λ )y∗ also belongs to K. Therefore,
x − y∗ 2 ≤ x − λ y − (1 − λ )y∗ 2
= x − y∗ 2 + 2λ (1 − λ )x − y∗ , y∗ − y + λ 2 y∗ − y2 .
This implies
λ
x − y∗ , y − y∗ ≤ y∗ − y2 .
2(1 − λ )
Letting λ → 0 we obtain (1.5). Conversely, if (1.5) holds, then
The point y∗ given by Proposition 1.37 is the projection of x onto K and will be
denoted by PK (x). The characterization of PK (x) given by (1.5) says that for each
x∈/ K, the set K lies in the closed half-space
S = { y ∈ H : x − PK (x), y − PK (x) ≤ 0 }.
x − PM (x), u = 0
±x − PM (x), u ≤ 0.
The topological dual of a real Hilbert space can be easily characterized. Given y ∈ H,
the function Ly : H → R, defined by Ly (h) = y, h, is linear and continuous by
the Cauchy–Schwarz inequality. Moreover, Ly ∗ = y. Conversely, we have the
following:
Theorem 1.41 (Riesz Representation Theorem). Let L : H → R be a continuous
linear function on H. Then, there exists a unique yL ∈ H such that
L(h) = yL , h
Basic functional analysis 23
x̂ = x − PM (x).
The vector
L(x̂)
yL = x̂
x̂2
has the desired property and it is straightforward to verify that the function L → yL
is a linear isometry.
As a consequence, the inner product · , ·∗ : H ∗ × H ∗ defined by
Proof. The only if part is immediate. For the if part, notice that 0 ≤ lim sup xn −
n→∞
x̄2 = lim sup xn 2 + x̄2 − 2 xn , x̄ ≤ 0.
n→∞
Remark 1.45. Another consequence of Theorem 1.41 is that we may interpret the
gradient of a Gâteaux-differentiable function f as an element of H instead of H ∗
(see Sect. 1.1.4). This will be useful in the design of optimization algorithms (see
Chap. 6).
Chapter 2
Existence of Minimizers
are equivalent. The advantage of the second formulation is that linear, geometric, or
topological properties of the underlying space X may be exploited.
Γγ ( f ) = {x ∈ X : f (x) ≤ γ }.
Γγ ( f )
If x ∈ dom( f ), then x ∈ Γf (x) ( f ). Therefore, dom( f ) = γ ∈R Γγ ( f ). On the other
hand, recall that
Observe that
argmin( f ) = Γγ ( f ). (2.1)
γ >inf( f )
This set includes the graph of f and all the points above it.
Proof. Let (γn ) be a nonincreasing sequence such that lim γn = inf( f ) (possibly
n→∞
−∞). For each n ∈ N consider the compact set Kn = Γγn ( f ). Then Kn+1 ⊂ Kn for
each n ∈ N and so
argmin( f ) = Kn
n∈N
Cn = { x ∈ X : f (x) + λ x − xn ≤ f (xn )}
and
vn = inf{ f (x) : x ∈ Cn },
and take xn+1 ∈ Cn such that
1
f (xn+1 ) ≤ vn + .
n+1
Observe that (Cn ) is a nested (decreasing) sequence of closed sets, and f (xn ) is
nonincreasing and bounded from below by f (x0 ) − ε . Clearly, for each h ≥ 0 we
have
λ xn+h − xn ≤ f (xn ) − f (xn+h ),
and (xn ) is a Cauchy sequence. Its limit, which we denote x̄, belongs to Cn . In
n≥0
particular,
f (x̄) + λ x̄ − x0 ≤ f (x0 )
and
λ x̄ − x0 ≤ f (x0 ) − f (x̄) ≤ ε .
Existence of Minimizers 29
1
f (x̃) + λ x̃ − xn ≤ f (xn ) ≤ f (x̃) +
n
for all n, and we deduce that x̃ = x̄.
x0
x̄
Minimizing Sequences
lim inf f (u jn (t)) ≥ f (ū(t)) for almost every t. By Fatou’s Lemma (see, for instance,
n→∞
[30, Lemma 4.1]), we obtain
Remark 2.10. It is easy to prove that f is convex if, and only if, epi( f ) is a convex
subset of X × R. Moreover, if f is convex, then each sublevel set Γγ ( f ) is con-
vex. Obviously the converse is not true in general: Simply consider any nonconvex
monotone function on X = R. Functions whose level sets are convex are called
quasi-convex.
We shall provide some practical characterizations of convexity for differentiable
functions in Sect. 3.2.
Example 2.11. Let B : X × X → R be a bilinear function and define f (x) = B(x, x),
as we did in Example 1.27. For each x, y ∈ X and λ ∈ (0, 1), we have
We mention some operations that allow us to construct convex functions from oth-
ers. Another very important example will be studied in detail in Sect. 3.6
Example 2.13. Suppose A : X → Y is affine, f : Y → R ∪ {+∞} is convex and
θ : R → R ∪ {+∞} convex and nondecreasing. Then, the function g = θ ◦ f ◦ A :
X → R ∪ {+∞} is convex.
Example 2.14. If f and g are convex and if α ≥ 0, then f + α g is convex. It follows
that the set of convex functions is a convex cone.
Example 2.15. If ( fi )i∈I is a family of convex functions, then sup( fi ) is convex,
since epi(sup( fi )) = ∩ epi( fi ) and the intersection of convex sets is convex.
Example 2.16. In general, the infimum of convex functions need not be convex.
However, we have the following: Let V be a vector space and let g : X × V → R ∪
{+∞} be convex. Then, the function f : X → R∪{+∞} defined by f (x) = inf g(x, v)
v∈V
32 Juan Peypouquet U
is convex. Here, the facts that g is convex in the product space and the infimum is
taken over a whole vector space are crucial.
lim f (x) = ∞,
x→∞
Proof. The function f fulfills the hypotheses of Theorem 2.5 for the weak topology.
Clearly, a strictly convex function cannot have more than one minimizer.
Abstract This chapter deals with several properties of convex functions, especially
in connection with their regularity, on the one hand, and the characterization of
their minimizers, on the other. We shall explore sufficient conditions for a convex
function to be continuous, as well as several connections between convexity and
differentiability. Next, we present the notion of subgradient, a generalization of the
concept of derivative for nondifferentiable convex functions that will allow us to
characterize their minimizers. After discussing conditions that guarantee their exis-
tence, we present the basic (yet subtle) calculus rules, along with their remarkable
consequences. Other important theoretical and practical tools, such as the Fenchel
conjugate and the Lagrange multipliers, will also be studied. These are particularly
useful for solving constrained problems.
Pretty much as closed convex sets are intersections of closed half-spaces, any lower-
semicontinuous convex function can be represented as a supremum of continuous
affine functions:
Proposition 3.1. Let f : X → R ∪ {+∞} be proper. Then, f is convex and lower-
semicontinuous if, and only if, there exists a family ( fi )i∈I of continuous affine func-
tions on X such that f = sup( fi ).
for all (x, λ ) ∈ epi( f ). Clearly, s ≥ 0. Otherwise, we may take x ∈ dom( f ) and λ
sufficiently large to contradict (3.1). We distinguish two cases:
s > 0: We may assume, without loss of generality, that s = 1 (or divide by s and
rename L and ε ). We set
and take λ = f (x) to deduce that f (x) ≥ α (x) for all x ∈ dom( f ) and α (x0 ) > λ0 .
Observe that this is valid for each x0 ∈ dom( f ).
s = 0: As said above, necessarily x0 ∈ / dom( f ), and so, f (x0 ) = +∞. Set
and observe that α0 (x) ≤ 0 for all x ∈ dom( f ) and α0 (x0 ) = ε > 0. Now take x̂ ∈
dom( f ) and use the argument of the case s > 0 to obtain a continuous affine function
α̂ such that f (x) ≥ α̂ (x) for all x ∈ dom( f ). Given n ∈ N set αn = α̂ + nα0 . We
conclude that f (x) ≥ αn (x) for all x ∈ dom( f ) and lim αn (x0 ) = lim (α̂ (x0 )+nε ) =
n→∞ n→∞
+∞ = f (x0 ).
The converse is straightforward, since epi(sup( fi )) = ∩ epi( fi ).
Characterization of Continuity
In this subsection, we present some continuity results that reveal how remarkably
regular convex functions are.
Proposition 3.2. Let f : X → R ∪ {+∞} be convex and fix x0 ∈ X. The following
are equivalent:
i) f is bounded from above in a neighborhood of x0 ;
ii) f is Lipschitz-continuous in a neighborhood of x0 ;
iii) f is continuous in x0 ; and
iv) (x0 , λ ) ∈ int(epi( f )) for each λ > f (x0 ).
x
y
ỹ
x0
2x0 − x
2(K − f (x0 ))
f (y) − f (x) ≤ 2λ [K − f (x0 )] ≤ x − y.
r
Interchanging the roles of x and y we conclude that | f (x) − f (y)| ≤ Mx − y with
M = 2(K−rf (x0 )) > 0.
ii) ⇒ iii): This is straightforward.
iii) ⇒ iv): If f is continuous in x0 and λ > f (x0 ), then for each η ∈ ( f (x0 ), λ )
there exists δ > 0 such that f (z) < η for all z ∈ B(x0 , δ ). Hence, (x0 , λ ) ∈ B(x0 , δ ) ×
(η , +∞) ⊂ epi( f ), and (x0 , λ ) ∈ int(epi( f )).
iv) ⇒ i): If (x0 , λ ) ∈ int(epi( f )), then there exist r > 0 and K > f (x0 ), such that
B(x0 , r) × (K, +∞) ⊂ epi( f ). It follows that f (z) ≤ K for every z ∈ B(x0 , r).
Proposition 3.3. Let (X, · ) be a normed space and let f : X → R ∪ {+∞} be
convex. If f is continuous at some x0 ∈ dom( f ), then x0 ∈ int(dom( f )) and f is
continuous on int(dom( f )).
Proof. If f is continuous at x0 , Proposition 3.2 gives x0 ∈ int(dom( f )) and there
exist r > 0 y K > f (x0 ) such that f (x) ≤ K for every x ∈ B(x0 , r). Let y0 ∈
36 Juan Peypouquet U
int(dom( f )) and pick ρ > 0 such that the point z0 = y0 + ρ (y0 − x0 ) belongs to
ρr 1+ρ
dom( f ). Take y ∈ B(y0 , 1+ ρ ) and set w = x0 + ( ρ )(y − y0 ). Solving for y, we see
that
ρ 1
y= w+ z0 .
1+ρ 1+ρ
On the other hand, w − x0 = 1+ρ ρ y − y0 < r, and so w ∈ B(x0 , r) y f (w) ≤ K.
w
y
x0 y0 z0
dom( f )
Therefore,
ρ
f (y) ≤ 1+ρ f (w) + 1
1+ρ f (z0 ) ≤ max{K, f (z0 )}.
ρr
Since this is true for each y ∈ B(y0 , 1+ ρ ), we conclude that f is bounded from above
in a neighborhood of y0 . Proposition 3.2 implies f is continuous at y0 .
Proof. Let {e1 , . . . , eN } generate X. Let x0 ∈ int(dom( f )) and take ρ > 0 small
enough so that x0 ± ρ ei ∈ dom( f ) for all i = 1, . . . , N. The convex hull C of these
points is a neighborhood of x0 and f is bounded by max { f (x0 ± ρ ei )} on C. The
i=1,...,N
result follows from Proposition 3.2.
Proof. Fix x0 ∈ int(dom( f )). Without any loss of generality we may assume that
x0 = 0. Take λ > f (0). Given x ∈ X, define gx : R → R ∪ {+∞} by gx (t) = f (tx).
Since 0 ∈ int(dom(gx )), we deduce that gx is continuous at 0 by Proposition 3.5.
Therefore, there is tx > 0 such that tx x ∈ Γλ ( f ). Repeating this argument for each
x ∈ X, we see that n≥1 nΓλ ( f ) = X. Baire’s Category Theorem shows that Γλ ( f )
has nonempty interior and we conclude by Corollary 3.4.
Remark 3.7. To summarize, let (X, · ) be a normed space and let f : X → R ∪
{+∞} be convex. Then f is continuous on int(dom( f )) if either (i) f is continuous
at some point, (ii) X is finite dimensional, or (iii) X is a Banach space and f is
lower-semicontinuous.
f (x + th) − f (x)
f (x; h) = inf ,
t>0 t
which exists in R ∪ {−∞, +∞}.
We have the following:
Proposition 3.9. Let f : X → R ∪ {+∞} be proper and convex, and let x ∈ dom( f ).
Then, the function φx : X → [−∞, +∞], defined by φx (h) = f (x; h), is convex. If,
moreover, f is continuous in x, then φx is finite and continuous in X.
Proof. Take y, z ∈ X, λ ∈ (0, 1) and t > 0. Write h = λ y + (1 − λ )z. By the convexity
of f , we have
f (x + th) − f (x)
−Lh ≤ ≤ Lh.
t
It follows that dom(φx ) = X and that φx is bounded from above in a neighborhood
of 0. Using Proposition 3.2 again, we deduce that φx is continuous in 0, and, by
Proposition 3.3, φx is continuous in int(dom(φx )) = X.
Proof. By convexity,
for all y ∈ X and all λ ∈ (0, 1). Rearranging the terms we get
f (x + λ (y − x)) − f (x)
≤ f (y) − f (x).
λ
As λ → 0 we obtain ii). From ii), we immediately deduce iii).
To prove that iii) ⇒ i), define φ : [0, 1] → R by
φ (λ ) = f λ x + (1 − λ )y − λ f (x) − (1 − λ ) f (y).
Convex Analysis and Subdifferential Calculus 39
for λ ∈ (0, 1). Take 0 < λ1 < λ2 < 1 and write xi = λi x + (1 − λi )y for i = 1, 2. A
simple computation shows that
1
φ ( λ1 ) − φ ( λ2 ) = ∇ f (x1 ) − ∇ f (x2 ), x1 − x2 ≤ 0.
λ1 − λ2
for all x, y ∈ X.
Theorem 3.13 (Baillon–Haddad Theorem). Let f : X → R be convex and differen-
tiable. The gradient of f is Lipschitz-continuous with constant L if, and only if, it is
cocoercive with constant 1/L.
Proof. Let x, y, z ∈ X. Suppose ∇ f is Lipschitz-continuous with constant L. By the
Descent Lemma 1.30, we have
L
f (z) ≤ f (x) + ∇ f (x), z − x + z − x2 .
2
Subtract ∇ f (y), z to both sides, write hy (z) = f (z) − ∇ f (y), z and rearrange the
terms to obtain
L
hy (z) ≤ hx (x) + ∇ f (x) − ∇ f (y), z + z − x2 .
2
The function hy is convex, differentiable and ∇hy (y) = 0. We deduce that hy (y) ≤
hy (z) for all z ∈ X, and so
L
hy (y) ≤ hx (x) + ∇ f (x) − ∇ f (y), z + z − x2 . (3.5)
2
Fix any ε > 0 and pick μ ∈ X such that μ ≤ 1 and
∇ f (x) − ∇ f (y)2
hy (y) ≤ hx (x) + ∇ f (x) − ∇ f (y), x − + R ε.
2L
Interchanging the roles of x and y and adding the resulting inequality, we obtain
1
∇ f (x) − ∇ f (y)2 ≤ ∇ f (x) − ∇ f (y), x − y + 2R ε .
L
Since this holds for each ε > 0, we deduce that ∇ f is cocoercive with constant 1/L.
The converse is straightforward.
Remark 3.14. From the proof of the Baillon–Haddad Theorem 3.13, we see that the
L-Lipschitz continuity and the L1 cocoercivity of ∇ f are actually equivalent to the
validity of the inequality given by the Descent Lemma 1.30, whenever f is convex.
Convex Analysis and Subdifferential Calculus 41
lies below the set epi( f ) and touches it at the point (x, f (x)).
for all y ∈ X, or, equivalently, if (3.7) holds for all y in a neighborhood of x. The set
of all subgradients of f at x is the subdifferential of f at x and is denoted by ∂ f (x).
If ∂ f (x) = 0,
/ we say f is subdifferentiable at x. The domain of the subdifferential is
Graph of f Graph of ∂ f
if x ∈ C, and NC (x) = 0/ if x ∈
/ C. Intuitively, the normal cone contains the directions
that point outwards with respect to C. If C is a closed affine subspace, that is C =
{x0 } +V , where x0 ∈ X and V is a closed subspace of X. In this case, NC (x) = V ⊥
for all x ∈ C (here V ⊥ is the orthogonal space of V , see Sect. 1.1.1).
The subdifferential really is an extension of the notion of derivative:
Proposition 3.20. Let f : X → R ∪ {+∞} be convex. If f is Gâteaux-differentiable
at x, then x ∈ dom(∂ f ) and ∂ f (x) = {∇ f (x)}.
Proof. First, the gradient inequality (3.6) and the definition of the subdifferential
together imply ∇ f (x) ∈ ∂ f (x). Now take any x∗ ∈ ∂ f (x). By definition,
for all y ∈ X. Take any h ∈ X and t > 0, and write y = x + th to deduce that
f (x + th) − f (x)
≥ x∗ , h.
t
Passing to the limit as t → 0, we deduce that ∇ f (x) − x∗ , h ≥ 0. Since this must be
true for each h ∈ X, necessarily x∗ = ∇ f (x).
A convex function may have more than one subgradient. However, we have:
Convex Analysis and Subdifferential Calculus 43
Proposition 3.21. For each x ∈ X, the set ∂ f (x) is closed and convex.
Proof. Let x1∗ , x2∗ ∈ ∂ f (x) and λ ∈ (0, 1). For each y ∈ X we have
Add λ times the first inequality and 1 − λ times the second one to obtain
Since this holds for each y ∈ X, we have λ x1∗ + (1 − λ )x2∗ ∈ ∂ f (x). To see that ∂ f (x)
is (sequentially) closed, take a sequence (xn∗ ) in ∂ f (x), converging to some x∗ . We
have
f (y) ≥ f (x) + xn∗ , y − x
for each y ∈ X and n ∈ N. As n → ∞ we obtain
Proof. Note that f (y) ≥ f (x) + x∗ , y − x and f (x) ≥ f (y) + y∗ , x − y. It suffices
to add these inequalities and rearrange the terms.
Proposition 3.20, if f is convex and differentiable, then 0 = ∇ f (x̂) if, and only if, x̂
is a global minimizer of f . In other words, the only critical points convex functions
may have are their global minimizers.
3.4 Subdifferentiablility
As pointed out in Example 3.18, subdifferentiability does not imply continuity. Sur-
prisingly enough, the converse is true.
Proposition 3.25. Let f : X → R ∪ {+∞} be convex. If f is continuous at x, then
∂ f (x) is nonempty and bounded.
for every (y, λ ) ∈ int(epi( f )). Taking y = x, we deduce that s ≤ 0. If s = 0 then L, y−
x ≤ 0 for every y in a neighborhood of x. Hence L = 0, which is a contradiction.
We conclude that s < 0. Therefore,
with z∗ = −L/s, for every λ > f (y). Letting λ tend to f (y), we see that
Approximate Subdifferentiability
As seen in Example 3.17, it may happen that dom(∂ f ) dom( f ). This can be an
inconvenient when trying to implement optimization algorithms. Given ε > 0, a
point x∗ ∈ X ∗ is an ε -approximate subgradient of f at x ∈ dom( f ) if
for all y ∈ X (compare with the subdifferential inequality (3.7)). The set ∂ε f (x) of
such vectors is the ε -approximate subdifferential of f at x. Clearly, ∂ f (x) ⊂ ∂ε f (x)
for each x, and so
dom(∂ f ) ⊂ dom(∂ε f ) ⊂ dom( f )
for each ε > 0. It turns out that the last inclusion is, in fact, an equality.
Proposition 3.26. Let f : X → R∪{+∞} be proper, convex, and lower-semicontinuous.
Then dom( f ) = dom(∂ε f ) for all ε > 0.
Proof. Let ε > 0 and let x ∈ dom( f ). By Proposition 3.1, f can be represented as the
pointwise supremum of continuous affine functions. Therefore, there exist x∗ ∈ X ∗
and μ ∈ R, such that
f (y) ≥ x∗ , y + μ
for all y ∈ X and
x∗ , x + μ ≥ f (x) − ε .
Adding these two inequalities, we obtain precisely (3.8).
−ε
y = − 41ε x − ε
In this section, we discuss some calculus rules for the subdifferentials of convex
functions. We begin by providing a Chain Rule for the composition with a bounded
linear operator. Next, we present the Moreau–Rockafellar Theorem, regarding the
subdifferential of a sum. This result has several important consequences, both for
theoretical and practical purposes.
46 Juan Peypouquet U
A∗ ∂ f (Ax) ⊂ ∂ ( f ◦ A)(x).
In particular,
f (Az) ≥ f (Ax) + x∗ , A(z − x)
for every z ∈ X. We conclude that
does not intersect int (epi( f )), which is nonempty by continuity. By part i) of the
Hahn–Banach Separation Theorem, there is (L, s) ∈ Y ∗ × R \ {(0, 0)} such that
for all z ∈ X and all (y, λ ) ∈ int (epi( f )). Pretty much like in the proof of the Moreau–
Rockafellar Theorem 3.30, we prove that s > 0. Then, write = −L/s and let λ →
f (y) to obtain
f (y) ≥ f (Ax) + x∗ , z − x + , y − Az. (3.9)
Take z = x to deduce that
0 ≥ x∗ − A∗ , z − x
A natural question is whether the subdifferential of the sum of two functions, is the
sum of their subdifferentials. We begin by showing that this is not always the case.
Example 3.29. Let f , g : R → R ∪ {+∞} be given by
0 if x ≤ 0 +∞
√ if x < 0
f (x) = and g(x) =
+∞ if x > 0 − x if x ≥ 0.
We have
⎧
⎨ {0} if x < 0
0/ if x ≤ 0
∂ f (x) = [0, +∞) if x = 0 and ∂ g(x) =
⎩ − 2√
1
x
if x > 0.
0/ if x > 0
L, y + sλ ≤ L, z + sμ
for each (y, λ ) ∈ A and each (z, μ ) ∈ B. In particular, taking (y, λ ) = (x, 1) ∈ A and
(z, μ ) = (x, 0) ∈ B, we deduce that s ≤ 0. On the other hand, if s = 0, taking z = x0
we see that L, x0 − y ≥ 0 for every y in a neighborhood of x0 . This implies L = 0
and contradicts the fact that (L, s) = (0, 0). Therefore, s < 0, and we may write
for every y ∈ dom(g), and we conclude that z∗ ∈ ∂ g(x). In a similar fashion, use
(3.12) with y = x, any λ > 0 and μ = g(x) − g(z), along with (3.11), to deduce that
Observe that, if ∂ ( f +g)(x) = ∂ f (x)+ ∂ g(x) for all x ∈ X, then dom(∂ ( f +g)) =
dom(∂ f ) ∩ dom(∂ g).
If X is a Banach space, another condition ensuring the equality in (3.10) is that
t dom( f ) − dom(g) be a closed subset of X. This is known as the Attouch–
t≥0
Brézis condition [9].
Combining the Chain Rule (Proposition 3.28) and the Moreau–Rockafellar Theorem
3.30, we obtain the following:
Corollary 3.31. Let A ∈ L (X;Y ), and let f : Y → R ∪ {+∞} and f , g : X → R ∪
{+∞} be proper, convex, and lower-semicontinuous. For each x ∈ X, we have
where we have written g(y) = f (y) − x0∗ , y. By Ekeland’s Variational Principle
(Theorem 2.6), there exists x̄ ∈ B(x0 , ε /λ ) such that
for all y = x̄. The latter implies that x̄ is the unique minimizer of the function h :
X → R ∪ {+∞} defined by h(y) = g(y) + λ x̄ − y. From the Moreau–Rockafellar
Theorem 3.30, it follows that x̄ ∈ dom(∂ h) = dom(∂ f ) and
where B = {x∗ ∈ X ∗ : x∗ ∗ ≤ 1} (see also Example 3.16). In other words, there
exists x̄∗ ∈ ∂ f (x̄) such that x̄∗ − x0∗ ∗ ≤ λ .
Corollary 3.34. Let X be a Banach space and let f : X → R ∪ {+∞} be convex and
lower-semicontinuous. Then dom(∂ f ) = dom( f ).
Let H be a Hilbert space and let f : H → R ∪ {+∞} be proper, convex and lower-
semicontinuous. Given λ > 0 and x ∈ H, the Moreau–Yosida Regularization of f
with parameter (λ , x) is the function f(λ ,x) : H → R ∪ {+∞}, defined by
1
f(λ ,x) (z) = f (z) + z − x2 .
2λ
The following property will be useful later on, especially in Sect. 6.2, when we
study the proximal point algorithm (actually, it is the core of that method):
Proposition 3.35. For each λ > 0 and x ∈ H, the function f(λ ,x) has a unique min-
imizer x̄. Moreover, x̄ is characterized by the inclusion
x̄ − x
− ∈ ∂ f (x̄). (3.13)
λ
Proof. Since f is proper, convex, and lower-semicontinuous, f(λ ,x) is proper, strictly
convex, lower-semicontinuous, and coercive. Existence and uniqueness of a mini-
mizer x̄ is given by Theorem 2.19. Finally, Fermat’s Rule (Theorem 3.24) and the
Moreau–Rockafellar Theorem 3.30 imply that x̄ must satisfy
x̄ − x
0 ∈ ∂ f(λ ,x) (x̄) = ∂ f (x̄) + ,
λ
which gives the result.
Convex Analysis and Subdifferential Calculus 51
x̄ = (I + λ ∂ f )−1 x, (3.14)
Therefore,
0 ≤ x̄ − ȳ2 ≤ x − y, x̄ − ȳ ≤ x − y x̄ − ȳ, (3.15)
and we conclude that x̄ − ȳ ≤ x − y.
Example 3.37. The indicator function δC of a nonempty, closed, and convex subset
of H is proper, convex, and lower-semicontinuous. Given λ > 0 and x ∈ H, Jλ (x) is
the unique solution of
1
min δC (z) + z − x : z ∈ H = min{z − x : z ∈ C}.
2
2λ
Recall from Proposition 3.35 that for each λ > 0 and x ∈ H, the Moreau–Yosida
Regularization f(λ ,x) of f has a unique minimizer x̄. The Moreau envelope of f with
parameter λ > 0 is the function fλ : H → R defined by
1 1
fλ (x) = min{ f(λ ,x) (z)} = min f (z) + z − x 2
= f (x̄) + x̄ − x2 .
z∈H z∈H 2λ 2λ
52 Juan Peypouquet U
Moreover, x̂ minimizes f on H if, and only if, it minimizes fλ on H for all λ > 0.
Example 3.38. Let C be a nonempty, closed, and convex subset of H, and let f = δC .
Then, for each λ > 0 and x ∈ H, we have
1 1
fλ (x) = min z − x2 = dist(x,C)2 ,
z∈C 2λ 2λ
1
fλ (y) − fλ (x) = f (ȳ) − f (x̄) + ȳ − y2 − x̄ − x2
2λ
1 1
≥ − x̄ − x, ȳ − x̄ + ȳ − y2 − x̄ − x2 ,
λ 2λ
by the subdifferential inequality and the inclusion (3.13). Straightforward algebraic
manipulations yield
1 1
fλ (y) − fλ (x) − x − x̄, y − x ≥ (ȳ − x̄) − (y − x)2 ≥ 0. (3.16)
λ 2λ
Interchanging the roles of x and y, we obtain
1
fλ (x) − fλ (y) − y − ȳ, x − y ≥ 0.
λ
We deduce that
1 1 1
0 ≤ fλ (y) − fλ (x) − x − x̄, y − x ≤ y − x2 − ȳ − x̄, y − x ≤ y − x2 ,
λ λ λ
Convex Analysis and Subdifferential Calculus 53
D fλ (x) − D fλ (y), x − y ≥ 0,
and the characterization given in Proposition 3.10. Finally, from (3.15), we deduce
that
Therefore,
1
D fλ (x) − D fλ (y) ≤
x − y,
λ
and so, D fλ is Lipschitz-continuous with constant 1/λ .
Proof. Since
1
f (Jλ (x)) +J (x) − x2 ≤ f (x)
2λ λ
and f is bounded from below by some continuous affine function (Proposition 3.1),
Jλ (x) remains bounded as λ → 0. As a consequence, f (Jλ (x)) is bounded from
below, and so, lim Jλ (x) − x = 0. Finally,
λ →0
f (x) ≤ lim inf f (Jλ (x)) ≤ lim sup f (Jλ (x)) ≤ f (x),
λ →0 λ →0
by lower-semicontinuity.
Let X and Y be normed spaces and let A ∈ L (X;Y ). Consider two proper, lower-
semicontinuous, and convex functions f : X → R ∪ {+∞} and g : Y → R ∪ {+∞}.
The primal problem in Fenchel–Rockafellar duality is given by:
Its optimal value is denoted by α and the set of primal solutions is S. Also Consider
the dual problem:
Proof. Statements i) and ii) are equivalent by the Fenchel–Young Inequality (Propo-
sition 3.42) and, clearly, they imply iii). But iii) implies ii) since
f (x) + f ∗ (−A∗ y∗ ) − −A∗ y∗ , x + g(Ax) + g∗ (y∗ ) − y∗ , Ax = 0
and each term in brackets is nonnegative. Next, iii) and iv) are equivalent in view of
(3.19). Finally, if x̂ ∈ S and there is x ∈ dom( f ) such that g is continuous in Ax, then
by the Chain Rule (Proposition 3.28) and the Moreau–Rockafellar Theorem 3.30.
Hence, there is ŷ∗ ∈ ∂ g(Ax̂) such that −A∗ ŷ∗ ∈ ∂ f (x̂), which is i).
For the linear programming problem, the preceding argument gives:
Example 3.52. Consider the linear programming problem:
(LP) min { c · x : Ax ≤ b },
x∈RN
If (LP) has a solution and there is x ∈ RN such that Ax < b, then the dual problem
has a solution, there is no duality gap and the primal-dual solutions (x̂, ŷ∗ ) are char-
acterized by Theorem 3.51. Observe that the dimension of the space of variables,
namely N, is often much larger than the number of constraints, which is M. In those
cases, the dual problem (DLP) is much smaller than the primal one.
Therefore, f ∗∗ = f .
58 Juan Peypouquet U
The situation discussed in Example 3.55 for continuous affine functions is not
exceptional. On the contrary, we have the following:
Proposition 3.56. Let f : X → R ∪ {+∞} be proper. Then, f is convex and lower-
semicontinuous if, and only if, f ∗∗ = f .
because fi∗∗ = fi for continuous affine functions (see Example 3.55). The converse
is straightforward, since f = f ∗∗ is a supremum of continuous affine functions.
Proof. As we saw in Propostion 3.9, the function φx0 : X → R defined by φx0 (h) =
f (x0 ; h) is convex and continuous in X. We shall see that actually φx0 (h) = x0∗ , h
for all h ∈ X. First observe that
Let A ∈ L (X;Y ) and let b ∈ Y . We shall derive optimality conditions for the problem
of minimizing a function f over a set C of the form:
C = { x ∈ X : Ax = b }, (3.20)
60 Juan Peypouquet U
which we assume to be nonempty. Before doing so, let us recall that, given A ∈
L (X;Y ), the adjoint of A is the operator A∗ : Y ∗ → X ∗ defined by the identity
for x ∈ X and y∗ ∈ Y ∗ . It is possible to prove (see, for instance, [30, Theorem 2.19])
that, if A has closed range, then ker(A)⊥ = R(A∗ ).
assuming this set is nonempty. Since each gi is convex and each h j is affine, the set
C is convex. To simplify the notation, write
for all x ∈ X.
Proof. If α = −∞, the result is trivial. Otherwise, set
!
m p
A= ( f (x) − α , +∞) × ∏(gi (x), +∞) × ∏ {h j (x)} .
x∈X i=1 j=1
Convex Analysis and Subdifferential Calculus 61
λ0 u0 + · · · + λm um + μ1 v1 + · · · + μ p v p ≥ 0 (3.23)
for all (u0 , . . . , um , v1 , . . . , v p ) ∈ A. If some λi < 0, the left-hand side of (3.23) can
be made negative by taking ui large enough. Therefore, λi ≥ 0 for each i. Passing
to the limit in (3.23), we obtain (3.22) for each x ∈ dom( f ) ∩ ( m i=1 dom(gi )). The
inequality holds trivially in all other points.
A more precise and useful result can be obtained under a qualification condition:
Slater’s condition: There exists x0 ∈ dom( f ) such that gi (x0 ) < 0 for i = 1, . . . m,
and h j (x0 ) = 0 for j = 1, . . . p.
Roughly speaking, this means that the constraint given by the system of inequal-
ities is thick in the subspace determined by the affine equality constraints.
Corollary 3.65. Assume Slater’s condition holds. Then, there exist λ̂1 , . . . , λ̂m ≥ 0,
and μ̂1 , . . . , μ̂ p ∈ R, such that
m p
α ≤ f (x) + ∑ λ̂i gi (x) + ∑ μ̂ j h j (x)
i=1 j=1
for all x ∈ X.
Conversely, if x̂ ∈ C and there exist λ̂1 , . . . , λ̂m ≥ 0, and μ̂1 , . . . , μ̂ p ∈ R, such that
λ̂i gi (x̂) = 0 for all i = 1, . . . , m and (3.24) holds, then x̂ ∈ S.
62 Juan Peypouquet U
Lagrangian Duality
For (x, λ , μ ) ∈ X × Rm
+ × R , the Lagrangian for the constrained optimization prob-
p
lem is defined as
m p
L(x, λ , μ ) = f (x) + ∑ λi gi (x) + ∑ μ j h j (x).
i=1 j=1
Observe that
sup L(x, λ , μ ) = f (x) + δC (x).
(λ ,μ )∈Rm
+ ×R
p
We shall refer to the problem of minimizing f over C as the primal problem. Its
value is given by
α = inf sup L(x, λ , μ ) ,
x∈X (λ ,μ )∈Rm
+ ×R
p
and S is called the set of primal solutions. By inverting the order in which the supre-
mum and infimum are taken, we obtain the dual problem, whose value is
∗
α = sup inf L(x, λ , μ ) .
(λ ,μ )∈Rm p x∈X
+ ×R
Convex Analysis and Subdifferential Calculus 63
The set S∗ of points at which the supremum is attained is the set of dual solutions or
Lagrange multipliers.
We say (x̂, λ̂ , μ̂ ) ∈ X × Rm
+ × R is a saddle point of L if
p
i) x̂ ∈ S, (λ̂ , μ̂ ) ∈ S∗ and α = α ∗ ;
ii) (x̂, λ̂ , μ̂ ) is a saddle point of L; and
iii) x̂ ∈ C, λ̂i gi (x̂) = 0 for all i = 1, . . . , m and (3.24) holds.
Moreover, if S = 0/ and Slater’s condition holds, then S∗ = 0/ and α = α ∗ .
Proof. Let i) hold. Since x̂ ∈ S, we have
for all (λ , μ ) ∈ Rm
+ × R . Next,
p
m
It follows that ∑ λ̂i gi (x̂) = 0 and we conclude that each term is zero, since they all
i=1
have the same sign. The second inequality in (3.25) gives (3.24).
64 Juan Peypouquet U
Remark 3.69. As we pointed out in Example 3.52 for the linear programming prob-
lem, the dimension of the space in which the dual problem is stated is the number of
constraints. Therefore, the dual problem is typically easier to solve than the primal
one. Moreover, according to Theorem 3.68, once we have found a Lagrange multi-
plier, we can recover the solution x̂ for the original (primal) constrained problem as
a solution of
min L(x, λ̂ , μ̂ ),
x∈X
Abstract The tools presented in the previous chapters are useful, on the one hand, to
prove that a wide variety of optimization problems have solutions; and, on the other,
to provide useful characterizations allowing to determine them. In this chapter, we
present a short selection of problems to illustrate some of those tools. We begin
by revisiting some results from functional analysis concerning the maximization
of bounded linear functionals and the realization of the dual norm. Next, we dis-
cuss some problems in optimal control and calculus of variations. Another standard
application of these convex analysis techniques lies in the field of elliptic partial dif-
ferential equations. We shall review the theorems of Stampacchia and Lax-Milgram,
along with some variations of Poisson’s equation, including the obstacle problem
and the p-Laplacian. We finish by commenting a problem of data compression and
restoration.
Let X be a normed space and take L ∈ X ∗ . Recall that L∗ = sup |L(x)|. By lin-
x=1
earity, it is easy to prove that
We shall focus in this section on the third form, since it can be seen as a convex
minimization problem. Indeed, the value of
inf L(x) + δB̄(0,1) (x)
x∈X
is precisely −L∗ .
Let B be any nonempty, closed, convex, and bounded subset of X. The function
f = L + δB is proper, lower-semicontinuous, convex, and coercive. If X is reflexive,
the infimum is attained, by Theorem 2.19. We obtain the following:
Corollary 4.1. Let X be reflexive and let L ∈ X ∗ . Then, L attains its maximum and its
minimum on each nonempty, closed, convex, and bounded subset of X. In particular,
there exists x0 ∈ B̄(0, 1) such that L(x0 ) = L∗ .
The last part can also be seen as a consequence of Corollary 1.16. Actually, the
converse is true as well: if every L ∈ X ∗ attains its maximum in B̄(0, 1), then X
is reflexive (see [68]). Therefore, in nonreflexive spaces there are bounded linear
functionals that do not realize their supremum on the ball. Here is an example:
Example 4.2. Consider (as in Example 1.2) the space X = C ([−1, 1]; R) of con-
tinuous real-valued functions on [−1, 1], with the norm · ∞ defined by x∞ =
maxt∈[−1,1] |x(t)|. Define L : X → R by
0 1
L(x) = x(t) dt − x(t) dt.
−1 0
We shall see that L∗ = 2. Indeed, consider the sequence (xn )n≥2 in X defined by
⎧
⎨ 1 if −1 ≤ t ≤ − 1n
xn (t) = −nt if − 1n < t < 1n
⎩
−1 if 1n ≤ t ≤ 1.
However, if x∞ ≤ 1 and L(x) = 2, then, necessarily, x(t) = 1 for all t ∈ [−1, 0] and
x(t) = −1 for all t ∈ [0, 1]. Therefore, x cannot belong to X.
As a consequence of Corollary 4.1 and Example 4.2, we obtain:
Corollary 4.3. The space C ([−1, 1]; R), · ∞ is not reflexive.
As we have pointed out, if X is not reflexive, then not all the elements of X ∗
attain their supremum on the unit ball. However, many of them do. More precisely,
we have the following remarkable result of Functional Analysis:
Corollary 4.4 (Bishop-Phelps Theorem). Let X be a Banach space and let B ⊂
X be nonempty, closed, convex, and bounded. Then, the set of all bounded linear
functionals that attain their maximum in B is dense in X ∗ .
Examples 67
for each L ∈ X ∗ (see also Example 3.47). Since B is bounded, dom(δB∗ ) = X ∗ . On the
other hand, L attains its maximum in B if, and only if, L ∈ R(∂ δB ) by Fermat’s Rule
(Theorem 3.24) and the Moreau–Rockafellar Theorem 3.30. But, by Proposition
3.43, R(∂ δB ) = dom(δB∗ ) = X ∗ .
Using convex analysis and differential calculus tools, it is possible to prove the exis-
tence and provide characterizations for the solutions of certain problems in optimal
control and calculus of variations.
Given u ∈ L p (0, T ; RM ) with p ∈ [1, ∞], one can find yu ∈ C (0, T ; RN ) such that
the pair (u, yu ) satisfies (CS). Indeed, let R : [0, T ] → Rn×n be the resolvent of the
matrix equation Ẋ = AX with initial condition X(0) = I. Then yu can be computed
by using the Variation of Parameters Formula:
t
yu (t) = R(t)y0 + R(t) R(s)−1 [B(s)u(s) + c(s)] ds. (4.1)
0
Consider the functional J defined on some function space X (to be specified later)
by
T
J[u] = (t, u(t), yu (t)) dt + h(yu (T ))
0
It is feasible if J is proper. In this section we focus our attention on the case where
the function has the form
for all v ∈ RM and some p ∈ (1, ∞). This p will determine the function space where
we shall search for solutions. Let g : RN → R be continuous and bounded from
below by αg , and let h : RN → R ∪ {+∞} be lower-semicontinuous (for the strong
topology) and bounded from below by αh . We shall prove that J has a minimizer.
Theorem 4.5. If (OC) is feasible, it has a solution in L p (0, T ; RM ).
f (u(t)) ≥ α f + β |u(t)| p
which is the part of yu that does not depend on u. The j-th component of yun satisfies
T
yuj n (t) = z0j (t) + 1[0,t] (s)μ j (t, s)un (s) dt,
0
where μ j (t, s) is the j-th row of the matrix R(t)R(s)−1 B(s). The weak convergence
of (un ) to ū implies
t
lim yun (t) = z0 (t) + R(t) R(s)−1 B(s)ū(s) dt = yū (t)
n→∞ 0
for each t ∈ [0, T ]. Moreover, this convergence is uniform because the sequence
(yun ) is equicontinuous. Indeed, if 0 ≤ τ ≤ t ≤ T , we have
T
yuj n (t) − yuj n (τ ) = 1[τ ,t] (s)μ j (t, s)un (s) dt.
0
But μ j (t, s) is bounded by continuity, while the sequence (un ) is bounded because it
is weakly convergent. Therefore,
and
lim inf h(yun (T )) ≥ h(yū (T )).
n→∞
where
• The vector functions u and yu are related by (CS) but, for simplicity, we assume
c ≡ 0.
• All vectors are represented as columns, and the star ∗ denotes the transposition
of vectors and matrices.
• For each t, U(t) is a uniformly elliptic symmetric matrix of size M × M, and the
function t → U(t) is continuous. With this, there is a constant α > 0 such that
u∗U(t)u ≥ α u2 for all u ∈ RM and t ∈ [0, T ].
• For each t, W (t) is a positive semidefinite symmetric matrix of size N × N, and
the function t → W (t) is continuous.
• The function h is convex and differentiable. Its gradient ∇h is represented as a
column vector.
By applying Theorem 4.5 in L2 (0, T ; RM ), the optimal control problem has a
solution. Moreover, it is easy to see that J is strictly convex, and so, there is a unique
optimal pair (ū, ȳu ). We have the following:
Theorem 4.7. The pair (ū, ȳu ) is optimal if, and only if,
for almost every t ∈ [0, T ], where the adjoint state p is the unique solution of the
adjoint equation
ṗ(t) = −A(t)∗ p(t) +W (t)ȳu (t)
with terminal condition p(T ) = −∇h(ȳu (T )).
0 = ∇J(ū), v
T T
= ū(t)∗U(t)v(t) dt + ȳu (t)∗W (t)yv (t) dt + ∇h(ȳu (T ))∗ yv (T )
0 0
for all v ∈ L2 (0, T ; RM ). Let us focus on the middle term first. By using the adjoint
equation, we see that
Therefore,
T T T
ȳu (t)∗W (t)yv (t) dt = ṗ(t)∗ yv (t) dt + p(t)∗ A(t)yv (t) dt
0 0 0
T
= p(T )∗ yv (T ) − p(t)∗ ẏv (t) − A(t)yv (t) dt
0
T
∗
= −∇h(ȳu (T )) yv (T ) − p(t)∗ B(t)v(t) dt.
0
We conclude that
T
0 = ∇J(ū), v = ū(t)∗U(t) − p(t)∗ B(t) v(t) dt
0
which we can interpret as moving the car far away to the left, without spending too
much energy. The adjoint equation is
ṗ1 (t) 00 p1 (t) p1 (T ) −1
=− , = ,
ṗ2 (t) 10 p2 (t) p2 (T ) 0
from which we deduce that p1 (t) ≡ −1 and p2 (t) = t − T . Finally, from (4.4), we
deduce that ū(t) = p2 (t) = t − T . With this information and the initial condition, we
can compute x̄(T ) and the value J[ū].
Introducing Controllability
Assume now that h = δT , which means that we require that the terminal state ȳu (T )
belong to a convex target set T (see Remark 4.6). Although h is no longer differen-
tiable, one can follow pretty much the same argument as in the proof of Theorem 4.7
(but using the nonsmooth version of Fermat’s Rule, namely 3.24) to deduce that the
adjoint state p must satisfy the terminal condition −p(T ) ∈ NT (ȳu (T )).
72 Juan Peypouquet U
Example 4.9. If the distance between the terminal state and some reference point yT
must be less than or equal to ρ > 0, then T = B̄(yT , ρ ). There are two possibilities
for the terminal states:
1. either ȳu (T ) − yT < ρ and p(T ) = 0;
2. or ȳu (T ) − yT = ρ and p(T ) = κ (yT − ȳu (T )) for some κ ≥ 0.
for some function : R×RN ×RN → R. If is of the form described in (4.2), namely
where ∇i denotes the gradient with respect to the i-th set of variables.
ferentiable and β̇ = α .
Examples 73
Proof. Let us first analyze the case α ≡ 0. We must prove that β is constant. To this
end, define the function H : [0, T ] → R as
t T
1
H(t) = [β (s) − B] ds, where B= β (t) dt.
0 T 0
We are now in position to present the necessary optimality condition for (CV):
Theorem 4.11 (Euler–Lagrange Equation). Let x∗ be a smooth solution of (CV).
Then, the function t → ∇3 L( t, x∗ (t), ẋ∗ (t)) is continuously differentiable and
d
∇3 L( t, x∗ (t), ẋ∗ (t)) = ∇2 L(t, x∗ (t), ẋ∗ (t))
dt
for every t ∈ (0, T ).
Proof. Set
C0 = x ∈ C 1 [0, T ]; RN : x(0) = x(T ) = 0 ,
and define g : C0 → R as
g(h) = f (x∗ + h).
Clearly, 0 minimizes g on C0 because x∗ is a smooth solution of (CV). In other words,
0 ∈ argminC0 (g). Moreover, Dg(0) = D f (x∗ ) and so, Fermat’s Rule (Theorem 1.32)
74 Juan Peypouquet U
The theory of partial differential equations makes systematic use of several tools
that lie in the interface between functional and convex analysis. Here, we comment
a few examples.
In other words, −∇ f (x̄) ∈ NC (x̄), which is equivalent to B(x̄, c − x̄) ≥ x∗ , c − x̄ for
all c ∈ C (see Example 1.27).
We shall provide a brief summary of some Sobolev spaces, which are particu-
larly useful for solving second-order elliptic partial differential equations. For more
details, the reader may consult [1, 30] or [54].
W 1,p = {u ∈ L p : ∇u ∈ (L p )N }
The gradient is taken in the sense of distributions. We omit the reference to the
domain Ω to simplify the notation, since no confusion should arise.
The space W01,p is the closure in W 1,p of the subspace of infinitely differentiable
functions in Ω whose support is a compact subset of Ω . Intuitively, it consists of
the functions that vanish on the boundary ∂ Ω of Ω . The following property will be
useful later:
Proposition 4.14 (Poincaré’s Inequalities). Let Ω be bounded.
i) There exists C > 0 such that
uL p ≤ C∇u(L p )N
u − ũL p ≤ C∇u(L p )N
1
for all u ∈ W 1,p . Here, we have written ũ = |Ω | Ω u(ξ ) d ξ , where |Ω | is the
measure of Ω .
76 Juan Peypouquet U
If p > 1, then W 1,p and W01,p are reflexive. We shall restrict ourselves to this case.
We have
W01,p ⊂ W 1,p ⊂ L p
with continuous embeddings. The topological duals satisfy
Lq = (L p )∗ ⊂ (W 1,p )∗ ⊂ (W01,p )∗ ,
We also write H01 := W01,2 . It is sometimes convenient not to identify H 1 and H01
with their duals. Instead, one can see them as subspaces of L2 , identify the latter
with itself and write
for all v ∈ H01 . In other words, there is a unique weak solution ū for the homogeneous
Dirichlet boundary condition problem
−Δ u + μ u = v∗ in Ω
u = 0 in ∂ Ω .
Let Ω be connected and let ∂ Ω be sufficiently smooth. Consider the problem with
homogeneous Neumann boundary condition:
−Δ u + μ u = v∗ in Ω
∂u
∂ ν = 0 in ∂ Ω ,
is uniformly elliptic, and the arguments above allow us to prove that the problem
has a unique weak solution, which is a ū ∈ H 1 such that
for all v ∈ H 1 .
78 Juan Peypouquet U
If μ = 0, then B is not uniformly elliptic (it is not even positive definite, since
B(u, u) = 0 when u is constant). However, by Poincaré’s Inequality (part ii) of Propo-
sition 4.14), it is possible to prove that B is uniformly elliptic when restricted to
V = { u ∈ H 1 : u, 1L2 = 0 },
is a nonempty, closed and convex subset of H01 . With the notation introduced for
the Dirichlet problem, and using Stampacchia’s Theorem 4.12, we deduce that there
exists a unique ū ∈ C such that
for all c ∈ C. In other words, ū is a weak solution for the Dirichlet boundary value
problem with an obstacle:
⎧
⎨ −Δ u + μ u ≥ v∗ in Ω
u ≥ φ in Ω
⎩
u = 0 in ∂ Ω .
To fix ideas, take μ = 0 and v∗ = 0, and assume u and φ are of class C 2 . For each
nonnegative function p ∈ H01 of class C 2 , the function c = ū + p belongs to C, and
so ∇ū, ∇p(L2 )N ≥ 0. Therefore, Δ ū ≤ 0 on Ω , and ū is superharmonic. Moreover,
if ū > φ on some open subset of Ω , then Δ ū ≥ 0 there. It follows that Δ ū = 0, and
ū is harmonic whenever ū > φ . Summarizing, u is harmonic in the free set (where
ū > φ ), and both ū and φ are superharmonic in the contact set (where ū = φ ). Under
these conditions, it follows that ū is the least of all superharmonic functions that lie
above φ . For further discussion, see [69, Chap. 6].
Examples 79
∇ f (ū) = 0.
0 = μ ū p−2 ū, vLq ,L p + ∇ū p−2 ∇ū, ∇v(Lq )N ,(L p )N − v∗ , v(W 1,p )∗ ,W 1,p
0 0
for all v ∈ W01,p . If μ = 0, we recover a classical existence and uniqueness result for
the p-Laplace Equation (see, for instance, [74, Example 2.3.1]): The function ū is
the unique weak solution for the problem
−Δ p u = v∗ in Ω
u = 0 in ∂ Ω ,
where · 0 denotes the so-called counting norm (number of nonzero entries; not
a norm). This comes from the necessity of representing rather complicated func-
tions (such as images or other signals) in a simple and concise manner. The convex
relaxation of this nonconvex problem is
Under some conditions on the matrix A (see [48]) solutions of (P0 ) can be found
by solving (P1 ). This can be done, for instance, by means of the iterative shrink-
age/thresholding algorithm (ISTA), described in Example 6.26.
Related 1 -regularization approaches can be found in [34, 39, 58]. The least abso-
lute shrinkage and selection operator (LASSO) method [99] is closely related but
slightly different: it considers a constraint of the form x1 ≤ T and minimizes
Ax − b2 . Also, it is possible to consider the problem with inequality constraints
min{ x1 : Ax ≤ b },
as well as the stable signal recovery [35], where the constraint Ax = b is replaced
by Ax − b ≤ ε .
Chapter 5
Problem-Solving Strategies
Abstract Only in rare cases, can problems in function spaces be solved analytically
and exactly. In most occasions, it is necessary to apply computational methods to
approximate the solutions. In this chapter, we discuss some of the basic general
strategies that can be applied. First, we present several connections between opti-
mization and discretization, along with their role in the problem-solving process.
Next, we introduce the idea of iterative procedures, and discuss some abstract tools
for proving their convergence. Finally, we comment some ideas that are useful to
simplify or reduce the problems, in order to make them tractable or more efficiently
solved.
The strategy here consists in replacing the original problem in the infinite-
dimensional space X by a problem in a finite-dimensional subspace Xn of X 1 . To
this end, the first step is to modify, reinterpret or approximate the objective function
and the constraints by a finite-dimensional model (we shall come back to this point
later). Next, we use either the optimality conditions, as those discussed in Chap. 3,
or an iterative method, like the ones we will present in Chap. 6, for the latter. If
the model is sufficiently accurate, we can expect the approximate solution xn of the
problem in Xn to be close to a real solution for the original problem in X, or, at
least, that the value of the objective function in xn be close to its minimum in X. The
theoretical justification for this procedure will be discussed below.
In this section, we present a classical and very useful result that relates the infinite-
dimensional problem to its finite-dimensional approximations. It is an important
tool in general numerical analysis and provides a theoretical justification for the dis-
cretize, then optimize strategy.
This is always possible since J is bounded from below and Xn−1 ⊂ Xn . Clearly, if
εn = 0, the condition J(xn ) ≤ J(xn−1 ) holds automatically.
whenever y − y0 < ρ . Using the density hypothesis and the fact that lim εn =
n→∞
0, we can find N ∈ N and zN ∈ XN such that zN − y0 < ρ , and εN ≤ δ /3. We
deduce that
for every n ≥ N. Since this holds for each x ∈ X, we obtain the minimizing property.
The last part follows from Proposition 2.8.
ui, j+1
u(x, y)
ui, j ui+1, j
Represent the functions on each block by their values on a prescribed node, pro-
ducing piecewise constant approximations. Integrals are naturally approximated by
a Riemann sum. Partial derivatives are replaced by the corresponding difference
−u
quotients. These can be, for instance, one-sided ∂∂ ux (x, y) ∼ i+1, hj i, j ; or centered
u
∂u ui+1, j −ui−1, j
∂ x (x, y) ∼ 2h . The dimension of the approximating space is the number of
nodes and the new variables are vectors whose components are the values of the
original variable (a function) on the nodes.
Finite Elements: As before, the domain is partitioned into small polyhedral units,
whose geometry may depend on the dimension and the type of approximation. On
each unit, the function is most often represented by a polynomial. Integrals and
derivatives are computed accordingly. The dimension of the approximating space
depends on the number of subdomains in the partition and the type of approximation
(for instance, the degree of the polynomials). For example:
1. If N = 1, the Trapezoidal Rule for computing integrals is a approximation by
polynomials of order 1, while Simpson’s Rule uses polynomials of order 2.
2. When N = 2, it is common to partition the domain into triangles. If function val-
ues are assigned to the vertices, it is possible to construct an affine interpolation.
This generates the graph of a continuous function whose derivatives, which are
constant on each triangle, and integral are easily computed.
The method of finite elements has several advantages: First, reasonably smooth
open domains can be well approximated by triangulation. Second, partitions can
be very easily refined, and one can choose not to do this refinement uniformly
throughout the domain, but only in specific areas where more accuracy is required.
Third, since the elements have very small support, the method is able to capture
local behavior in a very effective manner. In particular, if the support of the solution
Problem-Solving Strategies 85
is localized, the finite element scheme will tend to identify this fact very quickly.
Overall, this method offers a good balance between simplicity, versatility, and pre-
cision. One possible drawback is that it usually requires very fine partitions in order
to represent smooth functions. The interested reader is referred to [28, 41], or [90].
1 1
1
0 2 1
zn+1 = Fn (zn )
for each n ≥ 0.
Proof. Since (zn ) is bounded, it suffices to prove that it has at most one weak limit
point, by Corollary 1.25. Let zkn ẑ and zmn ž. Since ẑ, ž ∈ S, the limits lim zn −
n→∞
ẑ and lim zn − ž exist. Since
n→∞
The formulation and proof presented above for Lemma 5.2 first appeared in [19].
Part of the ideas are present in [84], which is usually given as the standard reference.
$ zn−1 ) + εn
zn = Fn ($ (5.1)
provided ∑∞
n=1 εn < +∞.
Proof. Let τ denote either the strong or the weak topology. Given n > k ≥ 0, define
Πkn = Fn ◦ · · · ◦ Fk+1 .
Since
Πk+h
n
zk+h ) − Πkn ($
($ zk ) = Πk+h
n
zk+h ) − Πk+h
($ n
◦ Πkk+h ($
zk )
zk+h − Πkk+h ($
≤ $ zk ),
we have
But, by definition,
∞
zk+h − Πkk+h ($
$ zk ) ≤ ∑ ε j ,
j=k
zk+h − ζ = [$
$ zk+h − Πkk+h ($
zk )] + [Πkk+h ($
zk ) − ζk ] + [ζk − ζ ]. (5.2)
Given ε > 0 we can take k large enough so that the first and third terms on the right-
hand side of (5.2) are less than ε in norm, uniformly in h. Next, for such k, we let
h → ∞ so that the second term converges to zero for the topology τ .
Penalization
Suppose, for simplicity, that f : X → R is convex and continuous, and consider the
constrained problem
+∞ +∞ +∞ +∞
Φ Ψ
C C C
A (not necessarily positive) function looking like Φ in the figure above is usually
referred to as an interior penalization function for C. It satisfies int(C) ⊂ dom(Φ ),
dom(Φ ) = C, and lim ∇Φ (y) = +∞ for all x on the boundary of C, denoted by
y→x
∂ C. A Legendre function is an interior penalization function that is differentiable
and strictly convex in int(C) (see [91, Chap. 26]). A barrier is a Legendre function
such that lim Φ (y) = +∞ for all x ∈ ∂ C. In turn, a convex, nonnegative, everywhere
y→x
defined function Ψ such that Ψ −1 (0) = C is an exterior penalization function, since
it act only outside of C.
Under suitable conditions, solutions for the (rather smoothly constrained) prob-
lem
for ε > 0, exist and lie in int(C). One would reasonably expect them to approximate
solutions of (P), as ε → 0. Indeed, by weak lower-semicontinuity, it is easy to see
that, if xε minimizes f + εΦ on X, then every weak limit point of (xε ), as ε → 0,
minimizes f on C. Similarly, if xβ is a solution for the unconstrained problem
with β > 0, then every weak limit point of (xβ ), as β → +∞, belongs to C and
minimizes f on C.
Remark 5.4. Loosely speaking, we replace the function f + δC by a family ( fν ) of
functions in such a way that fν is reasonably similar to f + δC if the parameter ν is
close to some value ν0 .
Multiplier Methods
C = {x ∈ X : gi (x) ≤ 0 for i = 1, . . . , m }.
90 Juan Peypouquet U
Suppose we are able to solve the dual problem and find a Lagrange multiplier λ̂ ∈
Rm
+ . Then, the constrained problem
min{ f (x) : x ∈ C }
Similar approaches can be used for the affinely constrained problem where
C = {x ∈ X : Ax = b }
(see Theorem 3.63). For instance, the constrained problem is equivalent to an uncon-
strained problem in view of Remark 3.70, and the optimality condition is a system
of inclusions: Ax̂ = b, and −A∗ y∗ ∈ ∂ f (x̂) for (x̂, y∗ ) ∈ X ×Y ∗ .
5.3.2 Splitting
In many practical cases, one can identify components either in the objective func-
tion, the variable space, or the constraint set. In such situations, it may be useful to
focus on the different components independently, and then use the partial informa-
tion to find a point which is closer to being a solution for the original problem.
done either simultaneously or successively), and then using some averaging or pro-
jection procedure to consolidate the information. Let us see some model situations,
some of which we will revisit in Chap. 6:
Separate variables in a product space: The space X is the product of two spaces
X1 and X2 , where fi : Xi → R ∪ {+∞} for i = 1, 2, and the corresponding variables
x1 and x2 are linked by a simple relation, such as an affine equation.
Example 5.5. For the linear-quadratic optimal control problem described in
Sect. 4.2.3, we can consider the variables u ∈ L p (0, T ; RM ) and y ∈ C ([0, T ]; RN ),
related by the Variation of Parameters Formula (4.1).
When independent minimization is simple: The functions f1 and f2 depend on the
same variable but minimizing the sum is difficult, while minimizing each function
separately is relatively simple.
Example 5.6. Suppose one wishes to find a point in the intersection of two closed
convex subsets, C1 and C2 , of a Hilbert space. This is equivalent to minimizing
f = δC1 + δC2 . We can minimize δCi by projecting onto Ci . It may be much simpler
to project onto C1 and C2 independently, than to project onto C1 ∩C2 . This method
of alternating projections has been studied in [40, 61, 102] among others.
Different methods for different regularity: For instance, suppose that f1 : X → R
is smooth, while f2 : X → R ∪ {+∞} is proper, convex and lower-semicontinuous.
We shall see, in Chap. 6, that some methods are well adapted to nonsmooth func-
tions, whereas other methods require certain levels of smoothness to be effective.
Example 5.7. Suppose we wish to minimize an everywhere defined, smooth func-
tion f over a closed, convex set C. In Hilbert spaces, the projected gradient algorithm
(introduced by [59] and [73]) consists in performing an iteration of the gradient
method (see Sect. 6.3) with respect to f , followed by a projection onto C.
Composite structure: If A : X → Y is affine and f : Y → R ∪ {+∞} is convex, then
f ◦ A : X → R ∪ {+∞} is convex (see Example 2.13). In some cases, and depending
on f and A, it may be useful to consider a constrained problem in the product space.
More precisely, the problems
are equivalent. The latter has an additive structure and the techniques presented
above can be used, possibly in combination with iterative methods. This is particu-
larly useful when f has a simple structure, but f ◦ A is complicated.
Example 5.8. As seen in Sect. 4.3.3, the optimality condition for minimizing the
square of the L2 norm of the gradient is a second-order partial differential equation.
We shall come back to these techniques in Chap. 6.
Chapter 6
Keynote Iterative Methods
∇ f (x0 )
x0
x(t)
S
At each instant t, the velocity ẋ(t) points towards the interior of the sublevel set
Γf (x(t)) ( f ). Observe also that the stationary points of (ODE) are exactly the criti-
cal points of f (the zeroes of ∇ f ). Moreover, the function f decreases along the
solutions, and does so strictly unless it bumps into a critical point, since
d
f (x(t)) = ∇ f (x(t)), ẋ(t) = −ẋ(t)2 = −∇ f (x(t))2 .
dt
If f is convex, one would reasonably expect the trajectories to minimize f as
t → ∞. Indeed, by convexity, we have (see Proposition 3.10)
d
f (z) − f (x(t)) ≥ ∇ f (x(t)), z − x(t) = ẋ(t), x(t) − z = hz (t),
dt
where hz (t) = 12 x(t) − z2 . Integrating on [0, T ] and recalling that t → f (x(t)) is
nonincreasing, we deduce that
x0 − z2
f (x(T )) ≤ f (z) + ,
2T
and we conclude that
lim f (x(t)) = inf( f ).
t→∞
d
hz̄ (t) = ẋ(t), x(t) − z̄ = −∇ f (x(t)) − ∇ f (z̄), x(t) − z̄ ≤ 0
dt
and lim x(t) − z̄ exists. Since f is continuous and convex, it is weakly lower-
t→∞
semicontinuous, by Proposition 2.17. Therefore, every weak limit point of x(t) as
t → ∞ must minimize f , by Proposition 2.8. We conclude, by Opial’s Lemma 5.2,
that x(t) converges weakly to some z̄ ∈ S as t → ∞.
Keynote Iterative Methods 95
has similar properties: for each x0 ∈ dom( f ), (DI) has a unique absolutely continu-
ous solution x : [0, +∞) → H, that it satisfies lim f (x(t)) = inf( f ). Further, if S = 0,
/
t→∞
then x(t) converges weakly to a point in S. The proof (especially the existence) is
more technical and will be omitted. The interested reader may consult [87] for fur-
ther discussion, or also [29], which is the original source.
λ1 λ2 λ3 λn
0 σ1 σ2 σ3 ··· σn−1 σn
S = argmin( f ),
which, of course, may be empty. To define the proximal point algorithm, we con-
sider a sequence (λn ) of positive numbers called the step sizes.
Proximal Point Algorithm: Begin with x0 ∈ H. For each n ≥ 0, given a step size λn
and the state xn , define the state xn+1 as the unique minimizer of the Moreau–Yosida
Regularization f(λn ,xn ) of f , which is the proper, lower-semicontinuous and strongly
convex function defined as
1
f(λn ,xn ) (z) = f (z) + z − xn 2
2λn
(see Sect 3.5.4). In other words,
1
{xn+1 } = argmin f (z) + z − xn : z ∈ H .
2
(6.1)
2λn
We now study the basic properties of proximal sequences, which hold for any
(proper, lower-semicontinuous and convex) f . As we shall see, proximal sequences
minimize f . Moreover, they converge weakly to a point in the solution set S, pro-
vided it is nonempty.
In the first place, the very definition of the algorithm (see (6.1)) implies
1
f (xn+1 ) + xn+1 − xn 2 ≤ f (xn )
2λn
In particular,
d(x0 , S)2
f (xn+1 ) − α ≤ .
2σn
iii)If (λn ) ∈
/ 1 , then lim f (xn ) = α .
n→∞
As mentioned above, we shall see (Theorem 6.3) that proximal sequences always
converge weakly. However, if a proximal sequence (xn ) happens to converge strongly
(see Sect. 6.2.2), it is possible to obtain a convergence rate of lim σn ( f (xn+1 )− α ) =
n→∞
0, which is faster than the one predicted in part ii) of the preceding proposition. This
was proved in [60] (see also [85, Sect. 2] for a simplified proof).
As seen above, proximal sequences converge weakly under very general assump-
tions. The fact that the computation of the point xk+1 involves information on the
point xn+1 itself, actually implies that, in some sense, the future history of the
sequence will be taken into account as well.
For this reason, the evolution of proximal sequences turns out to be very sta-
ble. More precisely, the displacement xn+1 − xn always forms an acute angle both
with the previous displacement xn − xn−1 and with any vector pointing towards the
minimizing set S.
Proposition 6.4. Let (xn ) be a proximal sequence. If xn+1 = xn , then
xn+1 − xn , x̂ − xn > 0.
In other words,
λn−1
xn+1 − xn , xn − xn−1 ≥ xn+1 − xn 2 .
λn
In particular, xn+1 − xn , xn − xn−1 > 0 whenever xn+1 = xn . For the second inequal-
ity, if x̂ ∈ S, then 0 ∈ ∂ f (x̂). Therefore
In conclusion, proximal sequences cannot go “back and forth;” they always move
in the direction of S.
100 Juan Peypouquet U
x0
x1
whenever x∗ ∈ ∂ f (x) and y∗ ∈ ∂ f (y). Observe also that the latter implies that ∂ f is
r-expansive, in the sense that
x∗ − y∗ ≥ rx − y
When applied to strongly convex functions, the proximal point algorithm gener-
ates strongly convergent sequences.
Proposition 6.5. Let (xn ) be a proximal sequence with (λn ) ∈
/ 1 . If f is strongly
convex with parameter r, then S is a singleton {x̂} and
n
xn+1 − x̂ ≤ x0 − x̂ ∏ (1 + rλk )−1 . (6.4)
k=0
Keynote Iterative Methods 101
and so
x̂ − xn+1 ≤ (1 + rλn )−1 x̂ − xn .
Proceeding inductively we obtain (6.4). To conclude, we use the fact that ∏∞
k=0 (1 +
rλk ) tends to ∞ if, and only if, (λn ) ∈
/ 1 .
It is possible to weaken the strong convexity assumption provided the step sizes
are sufficiently large. More precisely, we have the following:
Proposition 6.6. Let (xn ) be a proximal sequence with (λn ) ∈
/ 2 . If ∂ f is r-
expansive, then S is a singleton {x̂} and
n
xn+1 − x̂ ≤ x0 − x̂ ∏ (1 + r2 λk2 )− 2 .
1
k=0
Even Functions
xn+1 + xm 2 ≤ xn + xm 2 .
We deduce that
xn − xm 2 ≤ 2xn 2 − 2xm 2 .
Since lim xn − 0 exists, (xn ) must be a Cauchy sequence.
n→∞
If int(S) = 0,
/ there exist ū ∈ S and r > 0 such that for every h ∈ B(0, 1), we have
ū + rh ∈ S. Now, given x ∈ dom(∂ f ) and x∗ ∈ ∂ f (x), the subdifferential inequality
gives
f (ū + rh) ≥ f (x) + x∗ , ū + rh − x.
Since ū + rh ∈ S, we deduce that
and therefore,
rx∗ = sup x∗ , h ≤ x∗ , x − ū. (6.5)
B(0,1)
Proposition 6.8. Let (xn ) be a proximal sequence with (λn ) ∈ / 1 . If S has nonempty
interior, then (xn ) converges strongly, as n → ∞, to some x̂ ∈ S.
xn −xn+1
Proof. If we apply Inequality (6.5) with x = xn+1 and x∗ = λn we deduce that
Whence, if m > n,
m−1
2rxn − xm ≤ 2r ∑ x j − x j+1
j=n
d(xn+1 , S) ≥ Ce−σn L .
Proof. Recall that Px denotes the projection of x onto S. Let η > 0 be such that
whenever d(u, S) < η . As said before, the minimizing set cannot be attained in
a finite number of steps. If d(xN , S) < η for some N, then d(xn+1 , S) < η for all
n ≥ N, and so
We conclude that
n
d(xn+1 , S) ≥ ∏ (1 + λk L) −1
d(xN , S) ≥ e−σn L d(xN , S),
k=N
Proof. If xk ∈
/ S for k = 0, . . . , n + 1, then
% &
xk+1 − xk xk − xk+1 2
f (xk ) − f (xk+1 ) ≥ − , xk − xk+1
≥ λk ≥ r 2 λk .
λk λ
k
104 Juan Peypouquet U
by the subdifferential inequality. Hence rx − Px ≤ v, x − Px and so v ≥ r > 0
for all v ∈ ∂ f (x) such that x ∈
/ S. Inequality (6.6) is known as a linear error bound.
Roughly speaking, it means that f is steep on the boundary of S.
θ tan(θ ) = r
6.2.3 Examples
In some cases, it is possible to obtain an explicit formula for the proximal iterations.
In other words, for x ∈ H and λ > 0, one can find y ∈ H such that
1
y ∈ argmin{ f (ξ ) + ξ − x2 : ξ ∈ H},
2λ
or, equivalently,
x − y ∈ λ ∂ f (y).
Indicator Function
Let C be a nonempty, closed and convex subset of H. Recall that the indicator func-
tion δC : H → R ∪ {+∞} of C, given by δC (ξ ) = 0 if ξ ∈ C and +∞ otherwise, is
Keynote Iterative Methods 105
Observe that, in this case, the proximal iterations converge to the projection of x0
onto S = C in one iteration. The term proximal was introduced by J.-J. Moreau. In
some sense, it expresses the fact that the proximal iteration generalizes the concept
of projection when the indicator function of a set is replaced by an arbitrary proper,
lower-semicontinuous and convex function (see [80], where the author presents a
survey of his previous research [77, 78, 79] in the subject).
Quadratic Function
The quadratic case was studied in [71] and [72]. Consider a bounded, symmetric
and positive semidefinite linear operator A : H → H, and a point b ∈ H. The function
f : H → R defined by
1
f (x) = x, Ax − b, x
2
is convex and differentiable, with gradient ∇ f (x) = Ax − b. Clearly, x̂ minimizes f
if, and only if, Ax = b. In particular, S = 0/ if, and only if, b belongs to the range of
A. Given x ∈ H and λ > 0, we have
x − y ∈ λ ∂ f (y)
By Theorem 6.3, if the equation Ax = b has solutions, then each proximal sequence
must converge weakly to one of them. Actually, the convergence is strong. To see
this, take any solution x̄ use the change of variables un = xn − x̄. A simple computa-
tion shows that
where g is the even function given by g(x) = 12 x, Ax. It follows that (un ) converges
strongly to some ū and we conclude that (xn ) converges strongly to ū + x̄.
106 Juan Peypouquet U
1 and L1 Norms
is characterized by ⎧
⎨ x − λ if x > λ
y= 0 if −λ ≤ x ≤ λ
⎩
x + λ if x < −λ .
The preceding argument can be easily extended to the 1 norm in RN , namely
fN : RN → R given by fN (ξ ) = ξ 1 = |ξ1 | + · · · + |ξN |.
Since x ∈ H there is I ∈ N such that |xi | ≤ λ for all i ≥ I. Hence yi = 0 for all i ≥ I
and the sequence (yi ) belongs to 1 (N; R). Finally,
1 1
|yi | + |yi − xi |2 ≤ |z| + |z − xi |2
2λ 2λ
for all z ∈ R and so
1
y ∈ argmin u1 (N;R) + u − x22 (N;R) : u ∈ H .
2λ
∇ f (z)
z
d
S
where the sequence (dn ) satisfies some nondegeneracy condition with respect to the
sequence of steepest descent directions, namely Hypothesis 6.12 below.
The idea (which dates back to [36]) of selecting exactly the steepest descent
direction d = −∇ f (z) is reasonable, but it may not be the best option. On the other
hand, the choice of the step size λn may be crucial in the performance of such an
algorithm, as shown in the following example:
Example 6.11. Let f : R → R be defined by f (z) = z2 . Take λn ≡ λ and set z0 = 1. A
straightforward computation yields zn = (1 − 2λ )n for each n. Therefore, if λ = 1,
the sequence (zn ) remains bounded but does not converge. Next, if λ > 1, then
lim zn = +∞. Finally, if λ < 1, then lim zn = 0, the unique minimizer of f .
n→∞ n→∞
In this section, we will explore different possibilities for λn and dn , and provide
qualitative and quantitative convergence results.
Hypothesis 6.12. The function f is bounded from below, its gradient ∇ f is Lipschitz-
continuous with constant L, there exist positive numbers α and β such that
for all n ∈ N, and the sequence (λn ) of step sizes satisfies (λn ) ∈
/ 1 and
2α
λ := sup λn < . (6.8)
n∈N βL
Lemma 6.13. Assume Hypothesis 6.12 holds. There is δ > 0 such that
It suffices to define
2α − β λ L
δ= ,
2αβ
which is positive by Hypothesis 6.12.
Keynote Iterative Methods 109
Proposition 6.14. Assume Hypothesis 6.12 holds. Then lim ∇ f (zn ) = 0 and every
n→∞
strong limit point of the sequence (zn ) must be critical.
Proof. By Lemma 6.13, we have
Since (λn ) ∈
/ 1 , we also have lim infn→∞ ∇ f (zn ) = 0. Let ε > 0 and take N ∈ N
large enough so that √
ε2 α
∑ n λ ∇ f (z n ) 2
<
4L
. (6.9)
n≥N
We shall prove that ∇ f (zm ) < ε for all m ≥ N. Indeed, if m ≥ N and ∇ f (zm ) ≥
ε , define ε
νm = min n > m : ∇ f (zn ) < ,
2
which is finite because lim infn→∞ ∇ f (zn ) = 0. Then
νm −1 νm −1
∇ f (zm ) − ∇ f (zνm ) ≤ ∑ ∇ f (zk+1 ) − ∇ f (zk ) ≤ L ∑ zk+1 − zk .
k=m k=m
νm −1
2L ε ε
∇ f (zm ) ≤ ∇ f (zνm ) + √
ε α ∑ λk ∇ f (zk )2 < + =ε
2 2
k=m
for each n ∈ N. But (zkn ) is bounded, lim ∇ f (zkn ) = 0 and lim f (zkn ) ≥ f (z) by
n→∞ n→∞
weak lower-semicontinuity. We conclude that f (z) ≤ f (u).
110 Juan Peypouquet U
/ ( λn ) ∈
Proposition 6.19. Let (zn ) satisfy (G), where f is convex, S = 0, / 1 and
supn∈N λn < L . Then (zn ) converges weakly as n → ∞ to a point in S.
2
Since ∑n∈N λn ∇ f (zn )2 < ∞, also ∑n∈N λn2 ∇ f (zn )2 < ∞. We deduce the exis-
tence of lim zn − u by Lemma 6.18. We conclude using Proposition 6.15 and
n→∞
Opial’s Lemma 5.2.
i) For each ε ∈ (0, 1), there is R > 0 such that, if zN − z∗ ≤ R for some N ∈ N,
then
zN+m − z∗ ≤ R ε m
for all m ∈ N.
ii) Assume ∇2 f is Lipschitz-continuous on U with constant M and let ε ∈ (0, 1). If
2ε
zN − z∗ ≤ MC for some N ∈ N, then
2 2m
zN+m − z∗ ≤ ε
MC
for all m ∈ N.
Proof. Fix n ∈ N with zn ∈ U, and define g : [0, 1] → H by g(t) = ∇ f (z∗ +t(zn −z∗ )).
We have
1 1
∇ f (zn ) = g(1) − g(0) = ġ(t) dt = ∇2 f (z∗ + t(zn − z∗ ))(zn − z∗ ) dt.
0 0
It follows that
zn+1 − z∗ = zn − z∗ − [∇2 f (zn )]−1 ∇ f (zn )
= [∇2 f (zn )]−1 ∇2 f (zn )(zn − z∗ ) − ∇ f (zn )
1
2
≤ C ∇ f (zn ) − ∇ f (z + t(zn − z )) (zn − z ) dt
2 ∗ ∗ ∗
0
1
≤ Czn − z∗ ∇2 f (zn ) − ∇2 f (z∗ + t(zn − z∗ )) dt.
(6.10)
L (H;H)
0
Take ε ∈ (0, 1) and pick R > 0 such that ∇2 f (z) − ∇2 f (z∗ ) ≤ ε /2C whenever
z − z∗ ≤ R. We deduce that if zN − z∗ < R, then zn+1 − z∗ ≤ ε zn − z∗ for all
n ≥ N, and this implies i).
For part ii), using (6.10) we obtain zn+1 − z∗ ≤ MC ∗ 2
2 zn − z . Given ε ∈ (0, 1),
∗ 2ε ∗ m
if zN − z ≤ MC , then zN+m − z ≤ MC ε for all m ∈ N.
2 2
Despite its remarkable local speed of convergence, Newton’s method has some
noticeable drawbacks:
First, the fast convergence can only be granted once the iterates are sufficiently
close to a solution. One possible way to overcome this problem is to perform a few
iterations of another method in order to provide a warm starting point.
Next, at points where the Hessian is close to degeneracy, the method may have a
very erratic behavior.
√
Example 6.24. Consider the function f : R → R be defined by f (z) = z2 + 1. A
simple computation shows that Newton’s method gives zn+1 = −z3n for each n. The
Keynote Iterative Methods 113
problem here is that, since f is almost flat for large values of zn , the method predicts
that the minimizers should be very far.
Some alternatives are the following:
• Adding a small constant λn (a step size), and computing
can help to avoid very large displacements. The constant λn has to be carefully
selected: If it is too large, it will not help; while, if it is too small, convergence
may be very slow.
• Another option is to force the Hessian away from degeneracy by adding a uni-
formly elliptic term, and compute
Finally, each iteration has a high computational complexity due to the inversion
of the Hessian operator. A popular solution for this is to update the Hessian every
once in a while. More precisely, compute [∇2 f (zN )]−1 and compute
for the next pN iterations. After that, one computes [∇2 f (zN+pN )]−1 and continues
iterating in a similar fashion.
The proximal point algorithm is suitable for solving minimization problems that
involve nonsmooth objective functions, but can be hard to implement. If the objec-
tive function is sufficiently regular, it may be more effective and efficient to use
gradient-consistent methods, which are easier to implement; or Newton’s method,
which converges remarkably fast. Let f = f1 + f2 . If problem involves smooth and
nonsmooth features (if f1 is smooth but f2 is not), it is possible to combine these
approaches and construct better adapted methods. If neither f1 nor f2 is smooth, it
114 Juan Peypouquet U
may still be useful to treat them separately since it may be much harder to imple-
ment the proximal iterations for the sum f1 + f2 , than for f1 and f2 independently.
A thorough review of these methods can be found in [21].
where V = { (x, y) : Ax + By = c }.
Example 6.26. For the problem
where
(I + λ γ · 1 )−1 (v) i = (|vi | − λ γ )+ sgn(vi ) , i = 1, . . . , n
(see [22, 23, 42]). See also [104]
A variant of the forward–backward algorithm was proposed by Tseng [101] for
min{ f1 (x) + f2 (x) : x ∈ C}. It includes a second gradient subiteration and closes
Example 6.27. For example, in order to find the intersection of two closed convex
sets, C1 and C2 , this gives (with γ = 1) the method of alternating projections:
studied in [40, 61, 102], among others. As one should expect, sequences produced
by this method converge weakly to points in C1 ∩ C2 . Convergence is strong if C1
and C2 are closed, affine subspaces.
A modification of the double-backward method, including an over-relaxation
between the backward subiterations, gives the Douglas-Rachford [49] algorithm:
⎧
⎨ yn = (I + λn ∂ f1 )−1 (xn )
zn = (I + λn ∂ f2 )−1 (2yn − xn )
⎩
xn+1 = xn + γ (zn − yn ), γ ∈ (0, 2).
Multiplier Methods
where r > 0. This idea was introduced in [63] and [89], and combined with proximal
iterations in [92]. Similar approaches can be used for the affinely constrained prob-
116 Juan Peypouquet U
lem min{ f (x) : Ax = b}, whose Lagrangian is L(x, y∗ ) = f (x) + y∗ , Ax − b. More-
over, for a structured problem min{ f1 (x) + f2 (y) : Ax + By = c}, one can use alter-
nating methods of multipliers. This approach has been followed by [15, 103] and
[38] (see also [52, 62] and [37]). In the latter, a multiplier prediction step is added
and implementability is enhanced. See also [27, 97], and the references therein, for
further commentaries on augmented Lagrangian methods.
Penalization
Recall, from Remark 5.4, that the penalization technique is based on the idea of
progressively replacing f + δC by a family ( fν ), in such a way that the minimizers
of fν approximate the minimizers of f + δC , as ν → ν0 . Iterative procedures can be
combined with penalization schemes in a diagonal fashion. Starting from a point
x0 ∈ H, build a sequence (xn ) as follows: At the n-th iteration, you have a point
xn . Choose a value νn for the penalization parameter, and apply one iteration of the
algorithm of your choice (for instance, the proximal point algorithm or the gradient
method), to obtain xn+1 . Update the value of your parameter to νn+1 , and iterate.
17. Auslender A, Teboulle M. Interior gradient and proximal methods for convex and conic opti-
mization. SIAM J Optim. 2006;16:697–725.
18. Bachmann G, Narici L, Beckenstein E. Fourier and Wavelet Analysis. New York: Springer;
2002.
19. Baillon JB. Comportement asymptotique des contractions et semi-groupes de contraction.
Thèse. Université Paris 6; 1978.
20. Baillon JB. Un exemple concernant le comportement asymptotique de la solution du
problème du/dt + ∂ ϕ (u) 0. J Funct Anal. 1978;28:369–76.
21. Bauschke H, Combettes PL. Convex analysis and monotone operator theory in Hilbert
spaces. New York: Springer; 2011.
22. Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J Imaging Sci. 2009;2:183–202.
23. Beck A, Teboulle M. Gradient-based algorithms with applications in signal recovery prob-
lems. Convex optimization in signal processing and communications. Cambridge: Cam-
bridge University Press; 2010. pp. 33–88.
24. Bertsekas D. Nonlinear programming. Belmont: Athena Scientific; 1999.
25. Borwein J. A note on ε -subgradients and maximal monotonicity. Pac J Math. 1982;103:307–
14.
26. Borwein J, Lewis A. Convex analysis and nonlinear optimization. New York: Springer; 2010.
27. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learn-
ing via the alternating direction method of multipliers. Found Trends Mach Learn. 2011;3:1–
122.
28. Brenner SC, Scott LR. The mathematical theory of finite element methods. 3rd ed. New
York: Springer; 2008.
29. Brézis H. Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces
de Hilbert. Amsterdam: North Holland Publishing Co; 1973.
30. Brézis H. Functional analysis, Sobolev spaces and partial differential equations. New York:
Springer; 2011.
31. Brézis H, Lions PL. Produits infinis de résolvantes. Israel J Math. 1978;29:329–45.
32. Burachik R, Grana Drummond LM, Iseum AN, Svaiter BF. Full convergence of the steepest
descent method with inexact line searches. Optimization. 1995;32:137–46.
33. Cartan H. Differential calculus. London: Kershaw Publishing Co; 1971.
34. Candès EJ, Romberg JK, Tao T. Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Trans Inform Theory. 2006;52:489–
509.
35. Candès EJ, Romberg JK, Tao T. Stable signal recovery from incomplete and inaccurate mea-
surements. Comm Pure Appl Math. 2006;59:1207–23.
36. Cauchy AL. Méthode générale pour la résolution des systèmes d’équations simultanées. C R
Acad Sci Paris. 1847;25:536–8.
37. Chambolle A, Pock T. A first-order primal-dual algorithm for convex problems with appli-
cations to imaging. J Math Imaging Vis. 2010;40:1–26.
38. Chen G, Teboulle M. A proximal-based decomposition method for convex minimization
problems. Math Program. 1994;64:81–101.
39. Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci
Comput. 1998;20:33–61.
40. Cheney W, Goldstein AA. Proximity maps for convex sets. Proc Amer Math Soc.
1959;10:448–50.
41. Ciarlet PG. The finite element method for elliptic problems. Amsterdam: North Holland
Publishing Co; 1978.
42. Combettes PL, Wajs VR. Signal recovery by proximal forward-backward splitting. Multi-
scale Model Simul. 2005;4:1168–200.
43. Cominetti R, Courdurier M. Coupling general penalty schemes for convex programming with
the steepest descent method and the proximal point algorithm. SIAM J Opt. 2002;13:745–65.
44. Cominetti R, San Martı́n J. Asymptotic analysis of the exponential penalty trajectory in linear
programming. Math Program. 1994;67:169–87.
References 121
45. Crandall MG, Liggett TM. Generation of semigroups of nonlinear transformations on general
Banach spaces. Am J Math. 1971;93:265–98.
46. Daubechies I. Ten lectures on wavelets. Philadelphia: SIAM; 1992.
47. Donoho DL. Compressed sensing. IEEE Trans Inform Theory. 2006;52:1289–306.
48. Donoho DL, Tanner J. Sparse nonnegative solutions of underdetermined linear equations by
linear programming. Proc Natl Acad Sci U S A. 2005;102:9446–51.
49. Douglas J Jr, Rachford HH Jr. On the numerical solution of heat conduction problems in two
and three space variables. Trans Am Math Soc. 1956;82:421–39.
50. Dunford N, Schwartz J. Linear operators. Part I. General theory. New York: Wiley; 1988.
51. Eckstein J. Nonlinear proximal point algorithms using Bregman functions, with applications
to convex programming. Math Oper Res. 1993;18:202–26.
52. Eckstein J. Some saddle-function splitting methods for convex programming. Opt Methods
Softw. 1994;4:75–83.
53. Ekeland I, Temam R. Convex analysis and variational problems. Philadelphia: SIAM; 1999.
54. Evans L. Partial differential equations. 2nd edn. Providence: Graduate Studies in Mathemat-
ics, AMS; 2010.
55. Ferris M. Finite termination of the proximal point algorithm. Math Program. 1991;50:359–
66.
56. Folland G. Fourier analysis and its applications. Pacific Grove: Wadsworth & Brooks/Cole;
1992.
57. Frankel P, Peypouquet J. Lagrangian-penalization algorithm for constrained optimization
and variational inequalities. Set-Valued Var Anal. 2012;20:169–85.
58. Friedlander MP, Tseng P. Exact regularization of convex programs. SIAM J Optim.
2007;18:1326–50.
59. Goldstein AA. Convex programming in Hilbert space. Bull Am Math Soc. 1964;70:709–10.
60. Güler O. On the convergence of the proximal point algorithm for convex minimization. SIAM
J Control Optim. 1991;29:403–19.
61. Halperin I. The product of projection operators. Acta Sci Math (Szeged). 1962;23:96–9.
62. He B, Liao L, Han D, Yang H. A new inexact alternating directions method for monotone
variational inequalities. Math Program. 2002;92:103–18.
63. Hestenes MR. Multiplier and gradient methods. J Opt Theory Appl. 1969;4:303–20.
64. Hiriart-Urruty JB, Lemaréchal C. Fundamentals of convex analysis. New York: Springer;
2001.
65. Iusem A, Svaiter BF, Teboulle M. Entropy-like proximal methods in convex programming.
Math Oper Res. 1994;19:790–814.
66. Izmailov A, Solodov M. Optimization, Volume 1: optimality conditions, elements of convex
analysis and duality. Rio de Janeiro: IMPA; 2005.
67. Izmailov A, Solodov M. Optimization, Volume 2: computational methods. Rio de Janeiro:
IMPA; 2007.
68. James R. Reflexivity and the supremum of linear functionals. Israel J Math. 1972;13:289–
300.
69. Kinderlehrer D, Stampacchia G. An introduction to variational inequalities and their appli-
cations. New York: Academic; 1980.
70. Kobayashi Y. Difference approximation of Cauchy problems for quasi-dissipative operators
and generation of nonlinear semigroups. J Math Soc Japan. 1975;27:640–65.
71. Krasnosel’skii MA. Solution of equations involving adjoint operators by successive approx-
imations. Uspekhi Mat Nauk. 1960;15:161–5.
72. Kryanev AV. The solution of incorrectly posed problems by methods of successive approxi-
mations. Dokl Akad Nauk SSSR. 1973;210:20–2. Soviet Math Dokl. 1973;14:673–6.
73. Levitin ES, Polyak BT. Constrained minimization methods. USSR Comp Math Math Phys.
1966;6:1–50.
74. Lions JL. Quelques méthodes de résolution des problèmes aux limites non linéaires. Paris:
Dunod; 2002.
75. Mangasarian OL, Wild EW. Exactness conditions for a convex differentiable exterior penalty
for linear programming. Optimization. 2011;60:3–14.
122 References
kernel, 3 range, 3
resolvent, 51, 96
Lagrange multipliers, 63 Riesz Representation Theorem, 22
Lagrangian, 62, 64 Ritz’s Theorem, 82
LASSO method, 80 rocket-car problem, 71
Legendre-Fenchel Reciprocity Formula, 58
linear control system, 67 saddle point, 63
linear error bound, 103 sequentially closed set, 12
linear programming problem, 57 space
linear-quadratic control problem, 69 Banach, 2
logarithmic barrier, 116 Hausdorff, 10, 26
Hilbert, 19
minimizing sequence, 30 orthogonal, 5
monotone (see monotonicity), 43 reflexive, 5
monotonicity, 43 topological bidual, 5
Moreau envelope, 51 topological dual, 5
Moreau-Rockafellar Theorem, 47 sparse solutions, 79
Moreau-Yosida Regularization, 50 stable signal recovery, 80
Stampacchia’s Theorem, 74
Newton’s Method, 111 Steepest Descent, 94
norm, 1, 19 step sizes, 94
normal cone, 42 subdifferential, 41
normed space, 1 approximate, 45
subgradient, 41
operator, 2 approximate, 44
p-Laplacian, 79 sublevel set, 25
adjoint, 6
inverse, 3 target set, 67
invertible, 3 Taylor approximation, 16
positive definite, 18, 39 topology
positive semidefinite, 18, 38 discrete, 10
uniformly elliptic, 18, 39 of pointwise convergence, 13
Opial’s Lemma, 86 strong, 2
Optimal Control, 67 weak, 10
optimal control problem, 67 weak∗ , 13