Convex Optimization in Normed Spaces - Theory Methods and Examples

SpringerBriefs in Optimization
Series Editors
Panos M. Pardalos
János D. Pintèr
Stephen M. Robinson
Tamás Terlaky
My T. Thai
SpringerBriefs in Optimization showcases algorithmic and theoretical techniques,

case studies, and applications within the broad-based field of optimization.
Manuscripts related to the ever-growing applications of optimization in applied
mathematics, engineering, medicine, economics, and other applied sciences are
encouraged.
For further volumes:

http://www.springer.com/series/8918
Juan Peypouquet
Convex Optimization in
Normed Spaces
Theory, Methods and Examples
Juan Peypouquet
Universidad Técnica Federico
Valparaı́so
Chile
ISSN 2190-8354 ISSN 2191-575X (electronic)

SpringerBriefs in Optimization
ISBN 978-3-319-13709-4 ISBN 978-3-319-13710-0 (eBook)
DOI 10.1007/978-3-319-13710-0
Library of Congress Control Number: 2015935427
Springer Cham Heidelberg New York Dordrecht London

c The Author(s) 2015
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

To my family
v
Foreword
This book offers self-contained access and a modern vision of convex optimization
in Banach spaces, with various examples and numerical methods. The challenge of
providing a clear and complete presentation in only 120 pages is met successfully.
This reflects a deep understanding by the author of this rich subject. Even when
presenting classical results, he offers new insights and a fresh perspective.
Convex analysis lies in the core of optimization, since convexity is often present
or hidden, even in apparently non-convex problems. It may be partially present in
the basic blocks of structured problems, or introduced intentionally (as in relaxation)
as a solution technique.
The examples considered in the book are carefully chosen. They illustrate a
wide range of applications in mathematical programming, elliptic partial differential
equations, optimal control, signal/image processing, among others. The framework
of reflexive Banach spaces, which is analyzed in depth in the book, provides a gen-
eral setting covering many of these situations.
The chapter on iterative methods is particularly accomplished. It gives a unified
view on first-order algorithms, which are considered as time discrete versions of the
(generalized) continuous steepest descent dynamic. Proximal algorithms are deeply
studied. They play a fundamental role in splitting methods, which have been proven
to be successful at reducing the dimension for large scale problems, like sparse
optimization in signal/image processing, or domain decomposition for PDE’s.
The text is accompanied by images that provide a geometric intuition of the ana-
lytical concepts. The writing is clear and concise, and the presentation is fairly lin-
ear. Overall, the book provides an enjoyable reading.
This book is accessible to a broad audience. It is excellent both for those who
want to become familiar with the subject and learn the basic concepts, or as a text-
book for a graduate course. The extensive bibliography provides the reader with
more information, as well as recent trends.
Montpellier (France), October 2014 Hedy Attouch
vii
Preface
Convex analysis comprises a number of powerful, elegant ideas, and techniques that
are essential for understanding and solving problems in many areas of applied math-
ematics. In this introductory textbook, we have tried to present the main concepts
and results in an easy-to-follow and mostly self-contained manner. It is intended to
serve as a guide for students and more advanced independent learners who wish to
get acquainted with the basic theoretical and practical tools for the minimization of
convex functions on normed spaces. In particular, it may be useful for researchers
working on related fields, who wish to apply convex-analytic techniques in their
own work. It can also be used for teaching a graduate-level course on the subject, in
view of the way the contents are organized. We should point out that we have not
pretended to include every relevant result in the matter, but to present the ideas as
transparently as possible. We believe that this work can be helpful to gain insight
into some connections between optimization, control, calculus of variations, partial
differential equations, and functional analysis that sometimes go unnoticed.
The book is organized as follows:

Keeping in mind users with different backgrounds, we begin by reviewing in
Chap. 1 the minimal functional-analytic concepts and results convex analysis users
should be familiar with. The chapter is divided into two sections: one devoted to
normed spaces, including the basic aspects of differential calculus; and the other,
containing the main aspects of Hilbert spaces.
Chapter 2 deals with existence of minimizers. We begin by providing a general
result in a Hausdorff space setting, and then, we consider the case of convex func-
tions defined on reflexive spaces. It is possible to go straight to the convex case but
by working out the abstract, setting first one can deal with problems that are not
convex but where convexity plays an important role.
Analysis and calculus are the core of Chap. 3. First, we discuss the connection
between convexity, continuity, and differentiability. Next, we present the notion of
subgradient, which extends the idea of derivative, along with its main properties and
calculus rules. We use these concepts to derive necessary and sufficient optimality
conditions for nonsmooth constrained optimization problems.
ix
x
Chapter 4 contains selected applications: some functional analysis results are

revisited under a convex-analytic perspective, existence of solutions as well as opti-
mality conditions are established for optimal control and calculus of variations prob-
lems, and for some elliptic partial differential equations, including the obstacle prob-
lem and the p-Laplacian. We also mention the compressed sensing problem.
The main problem-solving strategies are described in Chap. 5. On the one
hand, we provide a quick overview of the most popular discretization methods.
On the other, we mention some ideas that help tackle the problems more easily.
For instance, splitting is useful to reduce problem size or simplify optimality condi-
tions, while Lagrangian duality and penalization allow us to pass from a constrained
to an unconstrained setting.
In Chap. 6, standard abstract iterative methods are presented from a dynamical
systems perspective and their main properties are discussed in detail. We finish by
commenting how these methods can be applied in more realistically structured prob-
lems, by combining them with splitting, duality, and penalization techniques. We
restrict ourselves to a Hilbert-space context in this chapter for two reasons: First,
this setting is sufficiently rich and covers a broad range of situations. In particu-
lar, discretized problems, which are posed in finite-dimensional (Euclidean) spaces.
Second, algorithms in Hilbert spaces enjoy good convergence properties that, in
general, cannot be extended to arbitrary Banach spaces.
We intended to make this introductory textbook as complete as possible, but some

relevant aspects of the theory and methods of convex optimization have not been
covered. The interested reader is invited to consult related works. For a more gen-
eral functional setting, see [10, 53, 105]. In [21], the authors did a remarkable job
putting together essentially all there is to know about fixed point theory, convex
analysis and monotone operators in Hilbert spaces, and elegantly pointing out the
relationships. The exposition in [64] is deep, well-written, and rich in geometric
insight (see also [26]). The timeless classic is, of course, [91]. For (convex and non-
convex) numerical optimization methods, see [24, 105]. Theoretical and practical
aspects are clearly explained also in the two-volume textbook [66, 67]. More bibli-
ographical commentaries are included throughout the text.
Finally, we invite the reader to check http://jpeypou.mat.utfsm.cl/

for additional material including examples, exercises, and complements. Sugges-
tions and comments are welcome at juan.peypouquet@usm.cl.
Juan Peypouquet
Santiago & Valparaı́so, Chile
October, 2014
Acknowledgments
This book contains my vision and experience on the theory and methods of con-
vex optimization, which have been modeled by my research and teaching activity
during the past few years, beginning with my PhD thesis at Universidad de Chile
and Université Pierre et Marie Curie - Paris VI. I am thankful, in the first place,
to my advisors, Felipe Álvarez and Sylvain Sorin, who introduced me to optimiza-
tion and its connection to dynamical systems, and transmitted me their curiosity and
passion for convex analysis. I am also grateful to my research collaborators, espe-
cially Hedy Attouch, who is always willing to share his deep and vast knowledge
with kindness and generosity. He also introduced me to researchers and students at
Université Montpellier II, with whom I have had a fruitful collaboration. Of great
importance were the many stimulating discussions with colleagues at Universidad
Técnica Federico Santa Marı́a, in particular, Pedro Gajardo and Luis Briceño, as
well as my PhD students, Guillaume Garrigos and Cesare Molinari.
I also thank Associate Editor, Donna Chernyk, at Springer Science and Business
Media, who invited me to engage in the more ambitious project of writing a book to
be edited and distributed by Springer. I am truly grateful for this opportunity. A pre-
cursor [88] had been prepared earlier for a 20-h minicourse on convex optimization
at the XXVI EVM & EMALCA, held in Mérida, in 2013. I thank my friend and for-
mer teacher at Universidad Simón Bolı́var, Stefania Marcantognini, for proposing
me as a lecturer at that School. Without their impulse, I would not have considered
writing this book at this point in my career.
Finally, my research over the past few years was made possible thanks to Coni-
cyt Anillo Project ACT-1106, ECOS-Conicyt Project C13E03, Millenium Nucleus
ICM/FIC P10-024F, Fondecyt Grants 11090023 and 1140829, Basal Project CMM
Universidad de Chile, STIC-AmSud Ovimine, and Universidad Técnica Federico
Santa Marı́a, through their DGIP Grants 121123, 120916, 120821 and 121310.
Juan Peypouquet
Santiago & Valparaı́so, Chile
October, 2014
xi
Contents
1 Basic Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Normed Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Bounded Linear Operators and Functionals, Topological
Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 The Hahn–Banach Separation Theorem . . . . . . . . . . . . . . . . . 6
1.1.3 The Weak Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.4 Differential Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.1 Basic Concepts, Properties and Examples . . . . . . . . . . . . . . . . 18
1.2.2 Projection and Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.3 Duality, Reflexivity and Weak Convergence . . . . . . . . . . . . . . 22
2 Existence of Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Extended Real-Valued Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Lower-Semicontinuity and Minimization . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Minimizers of Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Convex Analysis and Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . . 33

3.1 Convexity and Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Convexity and Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Directional Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Characterizations of Convexity for Differentiable Functions 38
3.2.3 Lipschitz-Continuity of the Gradient . . . . . . . . . . . . . . . . . . . . 40
3.3 Subgradients, Subdifferential and Fermat’s Rule . . . . . . . . . . . . . . . . . 41
3.4 Subdifferentiablility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Basic Subdifferential Calculus Rules and Applications . . . . . . . . . . . 45
3.5.1 Composition with a Linear Function: A Chain Rule . . . . . . . 46
3.5.2 Sum of Convex Functions and the Moreau–Rockafellar
Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.3 Some Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.4 Moreau–Yosida Regularization and Smoothing . . . . . . . . . . . 50
xiii
xiv Contents
3.6 The Fenchel Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.6.1 Main Properties and Examples . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.2 Fenchel–Rockafellar Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.3 The Biconjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Optimality Conditions for Constrained Problems . . . . . . . . . . . . . . . . 59
3.7.1 Affine Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.2 Nonlinear Constraints and Lagrange Multipliers . . . . . . . . . . 60
4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1 Norm of a Bounded Linear Functional . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Optimal Control and Calculus of Variations . . . . . . . . . . . . . . . . . . . . . 67
4.2.1 Controlled Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.2 Existence of an Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.3 The Linear-Quadratic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Some Elliptic Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 The Theorems of Stampacchia and Lax-Milgram . . . . . . . . . . 74
4.3.2 Sobolev Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.3 Poisson-Type Equations in H 1 and W 1,p . . . . . . . . . . . . . . . . . 76
4.4 Sparse Solutions for Underdetermined Systems of Equations . . . . . . 79
5 Problem-Solving Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Combining Optimization and Discretization . . . . . . . . . . . . . . . . . . . . 81
5.1.1 Recovering Solutions for the Original Problem: Ritz’s
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1.2 Building the Finite-Dimensional Approximations . . . . . . . . . 83
5.2 Iterative Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3 Problem Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.1 Elimination of Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.2 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Keynote Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.1 Steepest Descent Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 The Proximal Point Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.1 Basic Properties of Proximal Sequences . . . . . . . . . . . . . . . . . 97
6.2.2 Strong Convergence and Finite-Time Termination . . . . . . . . . 100
6.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Gradient-Consistent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 The Gradient Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3.2 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4 Some Comments on Extensions and Variants . . . . . . . . . . . . . . . . . . . 113
6.4.1 Additive Splitting: Proximal and Gradient Methods . . . . . . . . 113
6.4.2 Duality and Penalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 1
Basic Functional Analysis
Abstract Functional and convex analysis are closely intertwined. In this chapter
we recall the basic concepts and results from functional analysis and calculus that
will be needed throughout this book. A first section is devoted to general normed
spaces. We begin by establishing some of their main properties, with an emphasis
on the linear functions between spaces. This leads us to bounded linear functionals
and the topological dual. Second, we review the Hahn–Banach Separation Theo-
rem, a very powerful tool with important consequences. It also illustrates the fact
that the boundaries between functional and convex analysis can be rather fuzzy at
times. Next, we discuss some relevant results concerning the weak topology, espe-
cially in terms of closedness and compactness. Finally, we include a subsection on
differential calculus, which also provides an introduction to standard smooth opti-
mization techniques. The second section deals with Hilbert spaces, and their very
rich geometric structure, including the ideas of projection and orthogonality. We
also revisit some of the general concepts from the first section (duality, reflexivity,
weak convergence) in the light of this geometry.
For a comprehensive presentation, the reader is referred to [30] and [94].
1.1 Normed Spaces
A norm on a real vector space X is a function · : X → R such that

i) x > 0 for all x = 0;
ii) α x = |α |x for all x ∈ X and α ∈ R;
iii)The triangle inequality x + y ≤ x + y holds for all x, y ∈ X.
A normed space is a vector space where a norm has been specified.
Example 1.1. The space RN with the norms: x∞ = max |xi |, or x p
i=1,...,N
1/p
N
= ∑ |xi | p , for p ≥ 1.

i=1
c The Author(s) 2015 1

J. Peypouquet, Convex Optimization in Normed Spaces,
SpringerBriefs in Optimization, DOI 10.1007/978-3-319-13710-0 1
2 Juan Peypouquet U
Example 1.2. The space C ([a, b]; R) of continuous real-valued functions on the
interval [a, b], with the norm · ∞ defined by f ∞ = max | f (t)|.

t∈[a,b]
Example 1.3. The space L1 (a, b ; R)

of Lebesgue-integrable real-valued functions

on the interval [a, b], with the norm · 1 defined by f 1 = ab | f (t)| dt.

Given r > 0 and x ∈ X, the open ball of radius r centered at x is the set
BX (x, r) = {y ∈ X : y − x < r}.
The closed ball is

B̄X (x, r) = {y ∈ X : y − x ≤ r}.
We shall omit the reference to the space X whenever it is clear from the context. A
set is bounded if it is contained in some ball.
In a normed space one can define a canonical topology as follows: a set V is a

neighborhood of a point x if there is r > 0 such that B(x, r) ⊂ V . We call it the strong
topology, in contrast with the weak topology to be defined later on.
We say that a sequence (xn ) in X converges (strongly) to x̄ ∈ X, and write xn → x̄,

as n → ∞ if lim xn − x̄ = 0. The point x̄ is the limit of the sequence. On the other
n→∞
hand, (xn ) has the Cauchy property, or it is a Cauchy sequence, if lim xn − xm =
n,m→∞
0. Every convergent sequence has the Cauchy property, and every Cauchy sequence
is bounded. A Banach space is a normed space in which every Cauchy sequence is
convergent, a property called completeness.
Example 1.4. The spaces in Examples 1.1, 1.2 and 1.3 are Banach spaces.

We have the following result:
Theorem 1.5 (Baire’s Category Theorem). Let X be a Banach space and let (Cn )
be a sequence of closed subsets of X. If each Cn has empty interior, then so does

n∈N Cn .
Proof. A set C ⊂ X has empty interior if, and only if, every open ball intersects
Cc . Let B be an open ball. Take another open ball B whose closure is contained
in B. Since C0c has empty interior, B ∩ C0c = 0./ Moreover, since both B and C0c are
open, there exist x1 ∈ X and r1 ∈ (0, 2 ) such that B(x1 , r1 ) ⊂ B ∩C0c . Analogously,
1
there exist x2 ∈ X and r ∈ (0, 14 ) such that B(x2 , r2 ) ⊂ B(x1 , r1 ) ∩C1c ⊂ B ∩C0c ∩C1c .

Inductively, one defines (xm , rm ) ∈ X × R such that xm ∈ B(xn , rn ) ∩ nk=0 Ckc and
rm ∈ (0, 2−m ) for each m > n ≥ 1. In particular, xm − xn < 2−n whenever m >
n ≥ 1. It follows that (xn ) is a Cauchy sequence and so, it must converge to some x̄,
∞
which must belong to B ∩ ∞ c c
k=0 Ck ⊂ B ∩ ( k=0 Ck ) , by construction.

We shall find several important consequences of this result, especially the Banach–
Steinhaus uniform boundedness principle (Theorem 1.9) and the continuity of con-
vex functions in the interior of their domain (Proposition 3.6).
Basic functional analysis 3
1.1.1 Bounded Linear Operators and Functionals, Topological

Dual
Bounded Linear Operators
Let (X, · X ) and (Y, · Y ) be normed spaces. A linear operator L : X → Y is

bounded if
LL (X;Y ) := sup L(x)Y < ∞.
xX =1
The function · L (X;Y ) is a norm on the space L (X;Y ) of bounded linear oper-
ators from (X, · X ) to (Y, · Y ). For linear operators, boundedness and (uniform)
continuity are closely related. This is shown in the following result:
Proposition 1.6. Let (X, · X ) and (Y, · Y ) be normed spaces and let L : X → Y
be a linear operator. The following are equivalent:
i) L is continuous in 0;
ii) L is bounded; and
iii)L is uniformly Lipschitz-continuous in X.
Proof. Let i) hold. For each ε > 0 there is δ > 0 such that L(h)Y ≤ ε whenever
hX ≤ δ . If xX = 1, then L(x)Y = δ −1 L(δ x)Y ≤ δ −1 ε and so, sup L(x)Y
x=1
< ∞. Next, if ii) holds, then

x−z

L(x) − L(z)Y = x − zX L ≤ LL (X;Y ) x − zX
x − zX
and L is uniformly Lipschitz-continuous. Clearly, iii) implies i).

We have the following:
Proposition 1.7. If (X, · X ) is a normed space and (Y, · Y ) is a Banach space,
then (L (X;Y ), · L (X;Y ) ) is a Banach space.
Proof. Let (Ln ) be a Cauchy sequence in L (X;Y ). Then, for each x ∈ X, the
sequence (Ln (x)) has the Cauchy property as well. Since Y is complete, there
exists L(x) = lim Ln (x). Clearly, the function L : X → Y is linear. Moreover, since
n→∞
(Ln ) is a Cauchy sequence, it is bounded. Therefore, there exists C > 0 such that
Ln (x)Y ≤ Ln L (X;Y ) xX ≤ CxX for all x ∈ X. Passing to the limit, we deduce
that L ∈ L (X;Y ) and LL (X;Y ) ≤ C.

The kernel of L ∈ L (X;Y ) is the set
ker(L) = { x ∈ X : L(x) = 0 } = L−1 (0),
which is a closed subspace of X. The range of L is
R(L) = L(X) = { L(x) : x ∈ X }.

4 Juan Peypouquet U
It is a subspace of Y , but it is not necessarily closed.
An operator L ∈ L (X;Y ) is invertible if there exists an operator in L (Y ; X),

called the inverse of L, and denoted by L−1 , such that L−1 ◦ L(x) = x for all x ∈ X
and L◦L−1 (y) = y for all y ∈ Y . The set of invertible operators in L (X;Y ) is denoted
by Inv(X;Y ). We have the following:
Proposition 1.8. Let (X, · X ) and (Y, · Y ) be Banach spaces. The set Inv(X;Y )
is open in L (X;Y ) and the function Φ : Inv(X;Y ) → Inv(Y ; X), defined by Φ (L) =
L−1 , is continuous.
Proof. Let L0 ∈ Inv(X;Y ) and let L ∈ B(L0 , L0−1 −1 ). Let IX be the identity oper-
ator in X and write M = IX − L0−1 ◦ L = L0−1 ◦ (L0 − L). Denote by M k the com-
n
position of M with itself k times and define Mn = ∑ M k . Since M < 1, (Mn )
k=0
is a Cauchy sequence in L (X; X) and must converge to some M̄. But M ◦ Mn =
Mn ◦ M = Mn+1 − IX implies M̄ ◦ (IX − M) = (IX − M) ◦ M̄ = IX , which in turn gives
(M̄ ◦ L0−1 ) ◦ L = L ◦ (M̄ ◦ L0−1 ) = IX .
It ensues that L ∈ Inv(X;Y ) and L−1 = M̄ ◦ L0−1 . For the continuity, since
L−1 − L0−1 ≤ L−1 ◦ L0 − IX · L0−1 = L0−1 · M̄ −1 − IX
and
Mn+1 − IX = M ◦ Mn ≤ M · (1 − M)−1 ,
we deduce that
L0−1 2
L−1 − L0−1 ≤ L − L0 .
1 − M
Observe that Φ is actually Lipschitz-continuous in every closed ball B̄(L0 , R) with
R < L0−1 −1 .

This fact will be used in Sect. 6.3.2. This and other useful calculus tools for
normed spaces can be found in [33].
A remarkable consequence of linearity and completeness is that pointwise bound-

edness implies boundedness in the operator norm · L (X;Y ) . More precisely, we
have the following consequence of Baire’s Category Theorem 1.5:
Theorem 1.9 (Banach–Steinhaus uniform boundedness principle). Let (Lλ )λ ∈Λ be
a family of bounded linear operators from a Banach space (X, · X ) to a normed
space (Y, · Y ). If
sup Lλ (x)Y < ∞
λ ∈Λ
for each x ∈ X, then

sup Lλ L (X;Y ) < ∞.
λ ∈Λ
Proof. For each n ∈ N, the set
Cn := {x ∈ X : sup Lλ (x)Y ≤ n}

λ ∈Λ
is closed, as intersection of closed sets. Since ∪n∈NCn = X has nonempty interior,

Baire’s Category Theorem 1.5 shows the existence of N ∈ N, x̂ ∈ X and r̂ > 0 such
that B(x̂, r̂) ⊂ CN . This implies
rLλ (h) ≤ Lλ (x̂ + rh)Y + Lλ (x̂)Y ≤ 2N
for each r < r̂ and λ ∈ Λ . It follows that sup Lλ L (X;Y ) < ∞.

λ ∈Λ
The Topological Dual and the Bidual
The topological dual of a normed space (X, · ) is the normed space (X ∗ , · ∗ ),

where X ∗ = L (X; R) and · ∗ = · L (X;R) . It is actually a Banach space, by
Proposition 1.7. Elements of X∗ are bounded linear functionals. The bilinear func-
tion ·, ·X ∗ ,X : X ∗ × X → R, defined by
L, xX ∗ ,X = L(x),
is the duality product between X and X ∗ . If the space can be easily guessed from the
context, we shall write L, x instead of L, xX ∗ ,X to simplify the notation.
The orthogonal space or annihilator of a subspace V of X is
V ⊥ = { L ∈ X ∗ : L, x = 0 for all x ∈ V },
which is a closed subspace of X ∗ , even if V is not closed.
The topological bidual of (X, · ) is the topological dual of (X ∗ , · ∗ ), which

we denote by (X ∗∗ , · ∗∗ ). Each x ∈ X defines a linear function μ : X → R by
μx (L) = L, xX∗,X
for each L ∈ X ∗ . Since L, x ≤ L∗ x for each x ∈ X and L ∈ X ∗ , we have
μx ∗∗ ≤ x, so actually μx ∈ X ∗∗ . The function J : X → X ∗∗ , defined by
J (x) = μx , is the canonical embedding of X into X ∗∗ . Clearly, the function J
is linear and continuous. We shall see later (Proposition 1.17) that J is an isome-
try. This fact will imply, in particular, that the canonical embedding J is injective.
The space (X, · ) is reflexive if J is also surjective. In other words, if every ele-
ment μ of X ∗∗ is of the form μ = μx for some x ∈ X. Necessarily, (X, · ) must be
a Banach space since it is homeomorphic to (X ∗∗ , · ∗∗ ).
6 Juan Peypouquet U
The Adjoint Operator
Let (X, · X ) and (Y, · Y ) be normed spaces and let L ∈ L (X;Y ). Given y∗ ∈ Y ∗ ,
the function ζy∗ : X → R defined by
ζy∗ (x) = y∗ , LxY ∗ ,Y
is linear and continuous. In other words, ζy∗ ∈ X ∗ . The adjoint of L is the operator
L∗ : Y ∗ → X ∗ defined by
L∗ (y∗ ) = ζy∗ .
In other words, L and L∗ are linked by the identity
L∗ y∗ , xX ∗ ,X = y∗ , LxY ∗ ,Y
We immediately see that L∗ ∈ L (Y ∗ ; X ∗ ) and

∗ ∗
L L (Y ∗ ;X ∗ ) = sup sup y , LxY ∗ ,Y ≤ LL (X;Y ) .
y∗ Y ∗ =1 xX =1
We shall verify later (Corollary 1.18) that the two norms actually coincide.
1.1.2 The Hahn–Banach Separation Theorem
The Hahn–Banach Separation Theorem is a cornerstone in functional and convex

analysis. As we shall see in next chapter, it has several important consequences.
Let X be a real vector space. A set C ⊆ X is convex if for each pair of points of C,
the segment joining them also belongs to C. In other words, if the point λ x+(1− λ )y
belongs to C whenever x, y ∈ C and λ ∈ (0, 1).
Theorem 1.10 (Hahn–Banach Separation Theorem). Let A and B be nonempty,

disjoint convex subsets of a normed space (X, · ).
i) If A is open, there exists L ∈ X ∗ \ {0} such that L, x < L, y for each x ∈ A and
y ∈ B.
ii) If A is compact and B is closed, there exist L ∈ X ∗ \ {0} and ε > 0 such that
L, x + ε ≤ L, y for each x ∈ A and y ∈ B.
Remarks
Before proceeding with the proof of the Hahn–Banach Sepatarion Theorem 1.10,
some remarks are in order:
First, Theorem 1.10 is equivalent to

Theorem 1.11. Let C be a nonempty, convex subset of a normed space (X, · ) not
containing the origin.
i) If C is open, there exists L ∈ X ∗ \ {0} such that L, x < 0 for each x ∈ C.
ii) If C is closed, there exist L ∈ X ∗ \ {0} and ε > 0 such that L, x + ε ≤ 0 for each
x ∈ C.
Clearly, Theorem 1.11 is a special case of Theorem 1.10. To verify that they are
actually equivalent, simply write C = A − B and observe that C is open if A is, while
C is closed if A is compact and B is closed.
Second, part ii) of Theorem 1.11 can be easily deduced from part i) of The-
orem 1.10 by considering a sufficiently small open ball A around the origin (not
intersecting C), and writing B = C.
Finally, in finite-dimensional spaces, part i) of Theorem 1.11 can be obtained

without any topological assumptions on the sets involved. More precisely, we have
the following:
Proposition 1.12. Given N ≥ 1, let C be a nonempty and convex subset of RN not
containing the origin. Then, there exists v ∈ RN \ {0} such that v · x ≤ 0 for each
x ∈ C. In particular, if N ≥ 2 and C is open, then
V = { x ∈ RN : v · x = 0}
is a nontrivial subspace of RN that does not intersect C.

Proof. Let (xn ) ∈ C such that the set {xn : n ≥ 1} is dense in C. Let Cn be the convex
hull of the set {xk : k = 1, . . . , n} and let pn be the least-norm element of Cn . By
convexity, for each x ∈ Cn and t ∈ (0, 1), we have
pn 2 ≤ pn + t(x − pn )2 = pn 2 + 2t pn · (x − pn ) + t 2 x − pn 2 .
Therefore,
0 ≤ 2pn 2 ≤ 2 pn · x + tx − pn 2 .
Letting t → 0, we deduce that pn · x ≥ 0 for all x ∈ Cn . Now write vn = −pn /pn .
The sequence (vn ) lies in the unit sphere, which is compact. We may extract a sub-
sequence that converges to some v ∈ RN with v = 1 (thus v = 0) and v · x ≤ 0 for
all x ∈ C.

Proof of Theorem 1.11
Many standard functional analysis textbooks begin by presenting a general form

of the Hahn–Banach Extension Theorem (see Theorem 1.13 below) and used to
prove Theorem 1.11. We preferred to take the opposite path here, which has a more
convex-analytic flavor. The same approach can be found, for instance, in [96].
8 Juan Peypouquet U
Step 1: If the dimension of X is at least 2, there is a nontrivial subspace of X not

intersecting C.
Take any two-dimensional subspace Y of X. If Y ∩C = 0/ there is nothing to prove.
Otherwise, identify Y with R2 and use Proposition 1.12 to obtain a subspace of Y
disjoint from Y ∩C, which clearly gives a subspace of X not intersecting C.
Step 2: There is a closed subspace M of X such that M ∩ C = 0/ and the quotient

space X/M has dimension 1.
Let M be the collection of all subspaces of X not intersecting C, ordered by inclu-
sion. Step 1 shows that M = 0. / According to Zorn’s Lemma (see, for instance,
[30, Lemma 1.1]), M has a maximal element M, which must be a closed subspace
of X not intersecting C. The dimension of the quotient space X/M is at least 1
because M = X. The canonical homomorphism π : X → X/M is continuous and
open. If the dimension of X/M is greater than 1, we use Step 1 again with X̃ = X/M
and C̃ = π (C) to find a nontrivial subspace M̃ of X̃ that does not intersect C̃. Then
π −1 (M̃) is a subspace of X that does not intersect C and is strictly greater than M,
contradicting the maximality of the latter.
Step 3: Conclusion.
Take any (necessarily continuous) isomorphism φ : X/M → R and set L = φ ◦ π .
Then, either L, x < 0 for all x ∈ C, or −L, x < 0 for all x ∈ C.
A Few Direct but Important Consequences
The following is known as the Hahn–Banach Extension Theorem:

Theorem 1.13. Let M be a subspace of X and let : M → R be a linear function
such that , x ≤ α x for some α > 0 and all x ∈ M. Then, there exists L ∈ X ∗
such that L, x = , x for all x ∈ M and L∗ ≤ α .
Proof. Define
A = { (x, μ ) ∈ X × R : μ > α x, x ∈ X}

B = { (y, ν ) ∈ X × R : ν = , y, y ∈ M}.
By the Hahn–Banach Separation Theorem 1.10, there is (L̃, s) ∈ X ∗ × R \ {(0, 0)}

such that
L̃, x + sμ ≤ L̃, y + sν
for all (x, μ ) ∈ A and (y, ν ) ∈ B. Taking x = y = 0, μ = 1 and ν = 0, we deduce that
s ≤ 0. If s = 0, then L̃, x − y ≤ 0 for all x ∈ X and so L̃ = 0, which contradicts the
fact that (L̃, s) = (0, 0). We conclude that s > 0. Writing L = −L̃/s, we obtain
L, x − μ ≤ L − , y
for all (x, μ ) ∈ A and y ∈ M. Passing to the limit as μ → α x, we get
L, x ≤ L − , y + α x
for all x ∈ X and all y ∈ M. It follows that L = on M and L∗ ≤ α .

Another consequence of the Hahn–Banach Separation Theorem 1.10 is the fol-

lowing:
Corollary 1.14. For each x ∈ X, there exists x ∈ X ∗ such that x ∗ = 1 and
x , x = x.
Proof. Set A = B(0, x) and B = {x}. By Theorem 1.10, there exists Lx ∈ X \
{0} such that Lx , y ≤ Lx , x for all y ∈ A. This implies Lx , x = Lx ∗ x. The
functional x = Lx /Lx ∗ has the desired properties.

The functional x , given by Corollary 1.14, is a support functional for x. The

normalized duality mapping is the set-valued function F : X → P(X ∗ ) given by
F (x) = { x∗ ∈ X ∗ : x∗ ∗ = 1 and x∗ , x = x }.
The set F (x) is always convex and it need not be a singleton, as shown in the
following example:
Example 1.15. Consider X = R2 with (x1 , x2 ) = |x1 | + |x2 | for (x1 , x2 ) ∈ X. Then,
X ∗ = R2 with (x1∗ , x2∗ ), (x1 , x2 ) = x1∗ x1 + x2∗ x2 and (x1∗ , x2∗ )∗ = max{|x1∗ |, |x2∗ |} for
(x1∗ , x2∗ ) ∈ X ∗ . Moreover, F (1, 0) = { (1, b) ∈ R2 : b ∈ [−1, 1]}.

From Corollary 1.14 we deduce the following:
Corollary 1.16. For every x ∈ X, x = max L, x.
L∗ =1
Recall from Sect. 1.1.1 that the canonical embedding J of X into X ∗∗ is defined
by J (x) = μx , where μx satisfies
μx , LX ∗∗ ,X ∗ = L, xX ∗ ,X .
Recall also that J is linear, and J (x)∗∗ ≤ x for all x ∈ X. We have the fol-
lowing:
Proposition 1.17. The canonical embedding J : X → X ∗∗ is a linear isometry.
Proof. It remains to prove that x ≤ μx ∗∗ . To this end, simply notice that
μx ∗∗ = sup μx (L) ≥ μx (x ) = x , xX ∗ ,X = x,

L∗ =1
where x is the functional given by Corollary 1.14.

Another consequence of Corollary 1.16 concerns the adjoint of a bounded linear

operator, defined in Sect. 1.1.1:
10 Juan Peypouquet U
Corollary 1.18. Let (X, ·X ) and (Y, ·Y ) be normed spaces. Given L ∈ L (X;Y ),
let L∗ ∈ L (Y ∗ ; X ∗ ) be its adjoint. Then L∗ L (Y ∗ ;X ∗ ) = LL (X;Y ) .
Proof. We already proved that L∗ L (Y ∗ ;X ∗ ) ≤ LL (X;Y ) . For the reverse inequal-
ity, use Corollary 1.16 to deduce that

LL (X;Y ) = sup max L (y ), xX ∗ ,X ≤ L∗ L (Y ∗ ;X ∗ ) ,
∗ ∗
∗
xX =1 y Y ∗ =1
which gives the result.

1.1.3 The Weak Topology
By definition, each element of X ∗ is continuous as a function from (X, · ) to

(R, | · |). However, there are other topologies on X for which every element of X ∗ is
continuous.1 The coarsest of such topologies (the one with the fewest open sets) is
called the weak topology and will be denoted by σ (X), or simply σ , if there is no
possible confusion.
Given x0 ∈ X, L ∈ X ∗ and ε > 0, every set of the form

VLε (x0 ) = {x ∈ X : L, x − x0 < ε } = L−1 (−∞, L(x0 ) + ε )
is open for the weak topology and contains x0 . Moreover, the collection of all such
sets generates a base of neighborhoods of x0 for the weak topology in the sense that
if V0 is a neighborhood of x0 , then there exist L1 , . . . , LN ∈ X ∗ and ε > 0 such that

N
x0 ∈ VLεk (x0 ) ⊂ V0 .
k=1
Recall that a Hausdorff space is a topological space in which every two distinct
points admit disjoint neighborhoods.
Proposition 1.19. (X, σ ) is a Hausdorff space.
Proof. Let x1 = x2 . Part ii) of the Hahn–Banach Separation Theorem 1.10 shows
ε ε
the existence of L ∈ X ∗ such that L, x1 + ε ≤ L, x2 . Then VL2 (x1 ) and V−L
2
(x2 )
are disjoint weakly open sets containing x1 and x2 , respectively.

It follows from the definition that every weakly open set is (strongly) open. If X is
finite-dimensional, the weak topology coincides with the strong topology. Roughly
1 A trivial example is the discrete topology (for which every set is open), but it is not very useful
for our purposes.
speaking, the main idea is that, inside any open ball, one can find a weak neighbor-
hood of its center that can be expressed as a finite intersection of open half-spaces.
L2 , x − x0 = ε
L1 , x − x0 = ε
L3 , x − x0 = ε
x0
On the other hand, if X is infinite-dimensional, the weak topology is strictly

coarser than the strong topology, as shown in the following example:
Example 1.20. If X is infinite-dimensional, the ball B(0, 1) is open for the strong
topology but not for the weak topology. Roughly speaking, the reason is that no
finite intersection of half-spaces can be bounded in all directions.

In other words, there are open sets which are not weakly open. Of course, the
same is true for closed sets. However, closed convex sets are weakly closed.
Proposition 1.21. A convex subset of a normed space is closed for the strong topol-
ogy if, and only if, it is closed for the weak topology.
Proof. By definition, every weakly closed set must be strongly closed. Conversely,
let C ⊂ X be convex and strongly closed. Given x0 ∈ / C, we may apply part ii) of
the Hahn–Banach Separation Theorem 1.10 with A = {x0 } and B = C to deduce the
existence of L ∈ X ∗ \ {0} and ε > 0 such that L, x0 + ε ≤ L, y for each y ∈ C. The
set VLε (x0 ) is a weak neighborhood of x0 that does not intersect C. It follows that Cc
is weakly open.

Weakly Convergent Sequences
We say that a sequence (xn ) in X converges weakly to x̄, and write xn x̄, as n → ∞
if lim L, xn − x̄ = 0 for each L ∈ X ∗ . This is equivalent to saying that for each
n→∞
weakly open neighborhood V of x̄ there is N ∈ N such that xn ∈ V for all n ≥ N.
The point x̄ is the weak limit of the sequence.
Since |L, xn − x̄| ≤ L∗ xn − x̄, strongly convergent sequences are weakly
convergent and the limits coincide.
Proposition 1.22. Let (xn ) converge weakly to x̄ as n → ∞. Then:
i) (xn ) is bounded.
ii) x̄ ≤ lim inf xn .
n→∞
iii)If (Ln ) is a sequence in X ∗ that converges strongly to L̄, then lim Ln , xn = L̄, x̄.
n→∞
Proof. For i), write μn = J (xn ), where J is the canonical injection of X into
X ∗∗ , which is a linear isometry, by Proposition 1.17. Since lim μn (L) = L, x̄ for
n→∞
all L ∈ X ∗ , we have sup μn (L) < +∞. The Banach–Steinhaus uniform boundedness
n∈N
principle (Theorem 1.9) implies sup xn = sup μn ∗∗ ≤ C for some C > 0. For ii),
n∈N n∈N
use Corollary 1.14 to deduce that
x = x̄ , x − xn + x̄ , xn ≤ x̄ , x − xn + xn ,
and let n → ∞. Finally, by part i), we have
|Ln , xn − L̄, x̄| ≤ |Ln − L̄, xn | + |L̄, xn − x̄|

≤ C Ln − L̄∗ + |L̄, xn − x̄|.
As n → ∞, we obtain iii).

More on Closed Sets
Since we have defined two topologies on X, there exist (strongly) closed sets and
weakly closed sets. It is possible and very useful to define some sequential notions
as well. A set C ⊂ X is sequentially closed if every convergent sequence of points
in C has its limit in C. Analogously, C is weakly sequentially closed if every weakly
convergent sequence in C has its weak limit in C. The relationship between the
various notions of closedness is summarized in the following result:
Proposition 1.23. Consider the following statements concerning a nonempty set
C ⊂ X:
i) C is weakly closed.
ii) C is weakly sequentially closed.
iii)C is sequentially closed.
iv)C is closed.
Then i) ⇒ ii) ⇒ iii) ⇔ iv) ⇐ i). If C is convex, the four statements are equivalent.
Proof. It is easy to show that i) ⇒ ii) and iii) ⇔ iv). We also know that i) ⇒ iv) and
ii) ⇒ iii) because the weak topology is coarser than the strong topology. Finally, if C
is convex, Proposition 1.21 states precisely that i) ⇔ iv), which closes the loop.
The Topological Dual Revisited: The Weak∗ Topology
The topological dual X ∗ of a normed space (X, · ) is a Banach space with the
norm · ∗ . As in Sect. 1.1.3, we can define the weak topology σ (X ∗ ) in X ∗ . Recall
that a base of neighborhoods for some point L ∈ X ∗ is generated by the sets of the
form
{ ∈ X ∗ : μ , − LX ∗∗ ,X ∗ < ε }, (1.1)
with μ ∈ X ∗∗ and ε > 0.
Nevertheless, since X ∗ is, by definition, a space of functions, a third topology can

be defined on X ∗ in a very natural way. It is the topology of pointwise convergence,
which is usually referred to as the weak∗ topology in this context. We shall denote it
by σ ∗ (X ∗ ), or simply σ ∗ if the space is clear from the context. For this topology, a
base of neighborhoods for a point L ∈ X ∗ is generated by the sets of the form
{ ∈ X ∗ : − L, xX ∗ ,X < ε }, (1.2)
with x ∈ X and ε > 0. Notice the similarity and difference with (1.1). Now, since
every x ∈ X determines an element μx ∈ X ∗∗ by the relation
μx , X ∗∗ ,X ∗ = , xX ∗ ,X ,
it is clear that every set that is open for the weak∗ topology must be open for the
weak topology as well. In other words, σ ∗ ⊂ σ .
Reflexivity and Weak Compactness
In infinite-dimensional normed spaces, compact sets are rather scarce. For instance,
in such spaces the closed balls are not compact (see [30, Theorem 6.5]). One of the
most important properties of the weak∗ topology is that, according to the Banach–
Alaoglu Theorem (see, for instance, [30, Theorem 3.16]), the closed unit ball in X ∗
is compact for the weak∗ topology. Recall, from Sect. 1.1.1, that X is reflexive if
the canonical embedding J of X into X ∗∗ is surjective. This implies that the spaces
(X, σ ) and (X ∗∗ , σ ∗ ) are homeomorphic, and so, the closed unit ball of X is compact
for the weak topology. Further, we have the following:
Theorem 1.24. Let (X, · ) be a Banach space. The following are equivalent:
i) X is reflexive.
ii) The closed unit ball B̄(0, 1) is compact for the weak topology.
iii)Every bounded sequence in X has a weakly convergent subsequence.
We shall not include the proof here for the sake of brevity. The interested reader
may consult [50, Chaps. II and V] for full detail, or [30, Chap. 3] for abridged
commentaries.
An important consequence is the following:

Corollary 1.25. Let (yn ) be a bounded sequence in a reflexive space. If every weakly
convergent subsequence has the same weak limit ŷ, then (yn ) must converge weakly
to ŷ as n → ∞.
Proof. Suppose (yn ) does not converge weakly to ŷ. Then, there exist a weakly
open neighborhood V of ŷ, and a subsequence (ykn ) of (yn ) such that ykn ∈/ V for all
n ∈ N. Since (ykn ) is bounded, it has a subsequence (y jkn ) that converges weakly as
n → ∞ to some y̌ which cannot be in V and so y̌ = ŷ. This contradicts the uniqueness
of ŷ.

1.1.4 Differential Calculus
Consider a nonempty open set A ⊂ X and function f : A → R. The directional

derivative of f at x ∈ A in the direction d ∈ X is
f (x + td) − f (x)
f (x; d) = lim ,
t→0+ t
whenever this limit exists. The function f is differentiable (or simply Gâteux-
differentiable) at x if f (x; d) exists for all d ∈ X and the function d → f (x; d) is
linear and continuous. In this situation, the Gâteaux derivative (or gradient) of f at
x is ∇ f (x) = f (x; ·), which is an element of X ∗ . On the other hand, f is differen-
tiable in the sense of Fréchet (or Fréchet-differentiable) at x if there exists L ∈ X ∗
such that
| f (x + h) − f (x) − L, h|
lim = 0.
h→0 h
If it is the case, the Fréchet derivative of f at x is D f (x) = L. An immediate conse-
quence of these definitions is
Proposition 1.26. If f is Fréchet-differentiable at x, then it is continuous and
Gâteaux-differentiable there, with ∇ f (x) = D f (x).
As usual, f is differentiable (in the sense of Gâteaux or Fréchet) on A if it is so
at every point of A.
Example 1.27. Let B : X × X → R be a bilinear function:
B(x + α y, z) = B(x, z) + α B(y, z) and B(x, y + α z) = B(x, y) + α B(x, z)
for all x, y, z ∈ X and α ∈ R. Suppose also that B is continuous: |B(x, y)| ≤ β x y
for some β ≥ 0 and all x, y ∈ X. The function f : X → R, defined by f (x) = B(x, x), is
Fréchet-differentiable and D f (x)h = B(x, h) + B(h, x). Of course, if B is symmetric:
B(x, y) = B(y, x) for all x, y ∈ X, then D f (x)h = 2B(x, h).

Example 1.28. Let X be the space of continuously differentiable functions defined

on [0, T ] with values in RN , equipped with the norm
xX = max x(t) + max ẋ(t).

t∈[0,T ] t∈[0,T ]
Given a continuously differentiable function : R × RN × RN → R, define J : X → R

by
T
J[u] = (t, x(t), ẋ(t)) dt.
0
Then J is Fréchet-differentiable and
T
DJ(x)h = ∇2 (t, x(t), ẋ(t)) · h(t) + ∇3 (t, x(t), ẋ(t)) · ḣ(t) dt,
0
where we use ∇i to denote the gradient with respect to the i-th set of variables.

It is to note that the Gâteaux-differentiability does not imply continuity. In par-
ticular, it is weaker than Fréchet-differentiability.
Example 1.29. Define f : R2 → R by

⎧ 4
⎨ 2x y if (x, y) = (0, 0)
f (x, y) = x + y3
6
⎩
0 if (x, y) = (0, 0).
A simple computation shows that ∇ f (0, 0) = (0, 0). However, lim f (z, z2 ) = 1 =
z→0
f (0, 0).

If the gradient of f is Lipschitz-continuous, we can obtain a more precise first-
order estimation for the values of the function:
Lemma 1.30 (Descent Lemma). If f : X → R is Gâteaux-differentiable and ∇ f is
Lipschitz-continuous with constant L, then
L
f (y) ≤ f (x) + ∇ f (x), y − x + y − x2
2
for each x, y ∈ X. In particular, f is continuous.
Proof. Write h = y − x and define g : [0, 1] → R by g(t) = f (x + th). Then ġ(t) =

∇ f (x + th), hfor each t ∈ (0, 1), and so
1 1
∇ f (x + th), h dt = ġ(t) dt = g(1) − g(0) = f (y) − f (x).
0 0
Therefore,
1 1
f (y) − f (x) = ∇ f (x), h dt + ∇ f (x + th) − ∇ f (x), h dt
0 0
1
≤ ∇ f (x), h + ∇ f (x + th) − ∇ f (x) h dt
0
1
≤ ∇ f (x), h + Lh2 t dt
0
L
= ∇ f (x), y − x + y − x2 ,
2
as claimed.

Second Derivatives
If f : A → R is Gâteaux-differentiable in A, a valid question is whether the function

∇ f : A → X ∗ is differentiable. As before, one can define a directional derivative
∇ f (x + td) − ∇ f (x)
(∇ f ) (x; d) = lim ,
t→0+ t
whenever this limit exists (with respect to the strong topology of X ∗ ). The func-
tion f is twice differentiable in the sense of Gâteaux (or simply twice Gâteaux-
differentiable) in x if f is Gâteaux-differentiable in a neighborhood of x, (∇ f ) (x; d)
exists for all d ∈ X, and the function d → (∇ f ) (x; d) is linear and continu-
ous. In this situation, the second Gâteaux derivative (or Hessian) of f at x is
∇2 f (x) = (∇ f ) (x; ·), which is an element of L (X; X ∗ ). Similarly, f is twice differ-
entiable in the sense of Fréchet (or twice Fréchet-differentiable) at x if there exists
M ∈ L (X; X ∗ ) such that
D f (x + h) − D f (x) − M(h)∗
lim = 0.
h→0 h
The second Fréchet derivative of f at x is D2 f (x) = M.

Proposition 1.31 (Second-order Taylor Approximation). Let A be an open subset
of X and let x ∈ A. Assume f : A → R is twice Gâteaux-differentiable in x. Then, for
each d ∈ X, we have

1 t2 2
lim 2 f (x + td) − f (x) − t∇ f (x), d − ∇ f (x)d, d = 0.
t→0 t 2
Proof. Define φ : I ⊂ R → R by φ (t) = f (x + td), where I is a sufficiently small

open interval around 0 such that φ (t) exists for all t ∈ I. It is easy to see that φ (t) =
∇ f (x + td), d and φ (0) = ∇2 f (x)d, d. The second-order Taylor expansion for
φ in R yields
1 t2
lim 2 φ (t) − φ (0) − t φ (0) − φ (0) = 0,
t→0 t 2

Of course, it is possible to define derivatives of higher order, and obtain the cor-
responding Taylor approximations.
Optimality Conditions for Differentiable Optimization Problems
The following is the keynote necessary condition for a point x̂ to minimize a

Gâteaux-differentiable function f over a convex set C.
Theorem 1.32 (Fermat’s Rule). Let C be a convex subset of a normed space (X, ·
) and let f : X → R ∪ {+∞}. If f (x̂) ≤ f (y) for all y ∈ C and if f is Gâteaux-
differentiable at x̂, then
∇ f (x̂), y − x̂ ≥ 0
for all y ∈ C. If moreover x̂ ∈ int(C), then ∇ f (x̂) = 0.
Proof. Take y ∈ C. Since C is convex, for each λ ∈ (0, 1), the point yλ = λ y + (1 −
λ )x̂ belongs to C. The inequality f (x̂) ≤ f (yλ ) is equivalent to f (x̂ + λ (y − x̂)) −
f (x̂) ≥ 0. It suffices to divide by λ and let λ → 0 to deduce that f (x̂; y − x̂) ≥ 0 for
all y ∈ C.

To fix the ideas, consider a differentiable function on X = R2 . Theorem 1.32

states that the gradient of f at x̂ must point inwards, with respect to C. In other
words, f can only decrease by leaving the set C. This situation is depicted below:
∇ f (x̂) Level curves of f
C
x̂
As we shall see in the next chapter, the condition given by Fermat’s Rule (The-
orem 1.32) is not only necessary, but also sufficient, for convex functions. In the
general case, one can provide second-order necessary and sufficient conditions for
optimality. We state this result in the unconstrained case (C = X) for simplicity.
An operator M ∈ L (X; X ∗ ) is positive semidefinite if Md, d ≥ 0 for all d ∈ X;

positive definite if Md, d > 0 for all d = 0; and uniformly elliptic with constant
α > 0 if Md, d ≥ α2 d2 for all d ∈ X.
Theorem 1.33. Let A be an open subset of a normed space (X, · ), let x̂ ∈ A, and
let f : A → R.
i) If f (x̂) ≤ f (y) for all y in a neighborhood of x̂, and f is twice Gâteaux-
differentiable at x̂, then ∇ f (x̂) = 0 and ∇2 f (x̂) is positive semidefinite.
ii) If ∇ f (x̂) = 0 and ∇2 f (x̂) is uniformly elliptic, then f (x̂) < f (y) for all y in a
neighborhood of x̂.
Proof. For i), we already know by Theorem 1.32 that ∇ f (x̂) = 0. Now, if d ∈ X and
ε > 0, by Proposition 1.31, there is t0 > 0 such that
t2 2
∇ f (x̂)d, d > f (x̂ + td) − f (x̂) − ε t 2 ≥ −ε t 2
2
for all t ∈ [0,t0 ]. It follows that ∇2 f (x̂)d, d ≥ 0.
For ii), assume ∇2 f (x̂) is uniformly elliptic with constant α > 0 and take d ∈ X.
Set ε = α4 d2 . By Proposition 1.31, there is t1 > 0 such that
t2 2
f (x̂ + td) > f (x̂) + ∇ f (x̂)d, d − ε t 2 ≥ f (x̂)
2
for all t ∈ [0,t1 ].

1.2 Hilbert Spaces
Hilbert spaces are an important class of Banach spaces with rich geometric
properties.
1.2.1 Basic Concepts, Properties and Examples
An inner product in a real vector space H is a function · , · : H × H → R such that:

i) x, x > 0 for every x = 0;
ii) x, y = y, x for each x, y ∈ H;
iii)α x + y, z = α x, z + y, z for each α ∈ R and x, y, z ∈ H.

The function · : H → R, defined by x = x, x, is a norm on H. Indeed,
it is clear that x > 0 for every x = 0. Moreover, for each α ∈ R and x ∈ H, we
have α x = |α |x. It only remains to verify the triangle inequality. We have the
following:
Proposition 1.34. For each x, y ∈ H we have

i) The Cauchy–Schwarz inequality: |x, y| ≤ x y.
ii) The triangle inequality: x + y ≤ x + y.
Proof. The Cauchy–Schwarz inequality is trivially satisfied if y = 0. If y = 0 and
α > 0, then
0 ≤ x ± α y2 = x ± α y, x ± α y = x2 ± 2α x, y + α 2 y2 .
Therefore,
1 α
x2 + y2
|x, y| ≤
2α 2
for each α > 0. In particular, taking α = x/y, we deduce i). Next, we use i) to
deduce that
x + y2 = x2 + 2x, y + y2 ≤ x2 + 2x y + y2 = (x + y)2 ,
whence ii) holds.

If x = x, x for all x ∈ X, we say that the norm · is associated to the inner
product · , ·. A Hilbert space is a Banach space, whose norm is associated to an
inner product.
Example 1.35. The following are Hilbert spaces:
i) The Euclidean space RN is a Hilbert space with the inner product given by the
dot product: x, y = x · y.
ii) The space 2 (N; R) of real sequences x = (xn ) such that
∑ xn2 < +∞,

n∈N
equipped with the inner product defined by x, y = ∑n∈N xn yn .

iii)Let Ω be a bounded open subset of RN . The space L2 (Ω ; RM ) of (classes of)
measurable vector fields φ : Ω → RM such that

φm (ζ )2 d ζ < +∞, for m = 1, 2, . . . , M,
Ω

with the inner product φ , ψ = ∑M
m=1 Ω φm (ζ ) ψm (ζ ) d ζ .

By analogy with it seems reasonable to define the angle γ between two
RN ,
nonzero vectors x and y by the relation
x, y
cos(γ ) = , γ ∈ [0, π ].
x y
We shall say that x and y are orthogonal, and write x ⊥ y, if cos(γ ) = 0. In a sim-
ilar fashion, we say x and y are parallel, and write x y, if | cos(γ )| = 1. With this
notation, we have
i) Pythagoras Theorem: x ⊥ y if, and only if, x + y2 = x2 + y2 ;

ii) The colinearity condition: x y if, and only if, x = λ y with λ ∈ R.
Another important geometric property of the norm in a Hilbert space is the Par-
allelogram Identity, which states that

x + y2 + x − y2 = 2 x2 + y2 (1.3)
for each x, y ∈ H. It shows the relationship between the length of the sides and
the lengths of the diagonals in a parallelogram, and is easily proved by adding the
following identities
x + y2 = x2 + y2 + 2x, y and x − y2 = x2 + y2 − 2x, y.
x + y
y
x − y
x
Example 1.36. The space X = C ([0, 1]; R) with the norm x = max |x(t)| is not
t∈[0,1]
a Hilbert space. Consider the functions x, y ∈ X, defined by x(t) = 1 and y(t) = t for
t ∈ [0, 1]. We have x = 1, y = 1, x + y = 2 and x − y = 1. The parallelogram
identity (1.3) does not hold.

1.2.2 Projection and Orthogonality
An important property of Hilbert spaces is that given a nonempty, closed, and con-
vex subset K of H and a point x ∈
/ K, there is a unique point in K which is the closest
to x. More precisely, we have the following:
Proposition 1.37. Let K be a nonempty, closed, and convex subset of H and let
x ∈ H. Then, there exists a unique point y∗ ∈ K such that
x − y∗ = min x − y. (1.4)

y∈K
Moreover, it is the only element of K such that
x − y∗ , y − y∗ ≤ 0 for all y ∈ K. (1.5)

Proof. We shall prove Proposition 1.37 in three steps: first, we verify that (1.4) has
a solution; next, we establish the equivalence between (1.4) and (1.5); and finally,
we check that (1.5) cannot have more than one solution.
First, set d = inf x − y and consider a sequence (yn ) in K such that lim yn −
y∈K n→∞
x = d. We have
yn − ym 2 = (yn − x) + (x − ym )2

= 2 yn − x2 + ym − x2 − (yn + ym ) − 2x2 ,
by virtue of the parallelogram identity (1.3). Since K is convex, the midpoint

between yn and ym belongs to K. Therefore,
2
yn + ym

(yn + ym ) − 2x = 4 − x
≥ 4d ,
2 2
2
according to the definition of d. We deduce that

yn − ym 2 ≤ 2 yn − x2 + ym − x2 − 2d 2 .
Whence, (yn ) is a Cauchy sequence, and must converge to some y∗ , which must lie
in K by closedness. The continuity of the norm implies d = lim yn − x = y∗ − x.
n→∞
Next, assume (1.4) holds and let y ∈ K. Since K is convex, for each λ ∈ (0, 1) the
point λ y + (1 − λ )y∗ also belongs to K. Therefore,
x − y∗ 2 ≤ x − λ y − (1 − λ )y∗ 2
= x − y∗ 2 + 2λ (1 − λ )x − y∗ , y∗ − y + λ 2 y∗ − y2 .
This implies
λ
x − y∗ , y − y∗ ≤ y∗ − y2 .
2(1 − λ )
Letting λ → 0 we obtain (1.5). Conversely, if (1.5) holds, then
x − y2 = x − y∗ 2 + 2x − y∗ , y∗ − y + y∗ − y2 ≥ x − y∗ 2
for each y ∈ K and (1.4) holds.
Finally, if y∗1 , y∗2 ∈ K satisfy (1.5), then
x − y∗1 , y∗2 − y∗1 ≤ 0 and x − y∗2 , y∗1 − y∗2 ≤ 0.
Adding the two inequalities we deduce that y∗1 = y∗2 .

The point y∗ given by Proposition 1.37 is the projection of x onto K and will be
denoted by PK (x). The characterization of PK (x) given by (1.5) says that for each
x∈/ K, the set K lies in the closed half-space
S = { y ∈ H : x − PK (x), y − PK (x) ≤ 0 }.
Corollary 1.38. Let K be a nonempty, closed, and convex subset of H. Then K =

/ {y ∈ H : x − PK (x), y − PK (x)}.
∩x∈K
Conversely, the intersection of closed convex half-spaces is closed and convex.
For subspaces we recover the idea of orthogonal projection:

Proposition 1.39. Let M be a closed subspace of H. Then,
x − PM (x), u = 0
for each x ∈ H and u ∈ M. In other words, x − PM (x) ⊥ M.

Proof. Let u ∈ M and write v± = PM (x) ± u. Then v± ∈ M and so
±x − PM (x), u ≤ 0.
It follows that x − PM (x), u = 0.

Another property of the projection, bearing important topological consequences,
is the following:
Proposition 1.40. Let K be a nonempty, closed, and convex subset of H. The func-
tion x → PK (x) is nonexpansive.
Proof. Let x1 , x2 ∈ H. Then x1 − PK (x1 ), PK (x2 ) − PK (x1 ) ≤ 0 and x2 − PK (x2 ),
PK (x1 ) − PK (x2 ) ≤ 0. Summing these two inequalities we obtain
PK (x1 ) − PK (x2 )2 ≤ x1 − x2 , PK (x1 ) − PK (x2 ).
We conclude using the Cauchy–Schwarz inequality.

1.2.3 Duality, Reflexivity and Weak Convergence
The topological dual of a real Hilbert space can be easily characterized. Given y ∈ H,
the function Ly : H → R, defined by Ly (h) = y, h, is linear and continuous by
the Cauchy–Schwarz inequality. Moreover, Ly ∗ = y. Conversely, we have the
following:
Theorem 1.41 (Riesz Representation Theorem). Let L : H → R be a continuous
linear function on H. Then, there exists a unique yL ∈ H such that
L(h) = yL , h
for each h ∈ H. Moreover, the function L → yL is a linear isometry.

Proof. Let M = ker(L), which is a closed subspace of H because L is linear and
continuous. If M = H, then L(h) = 0 for all h ∈ H and it suffices to take yL = 0. If
M = H, let us pick any x ∈
/ M and define
x̂ = x − PM (x).
/ M. Given any h ∈ H, set uh = L(x̂)h − L(h)x̂, so that

Notice that x̂ = 0 and x̂ ∈
uh ∈ M. By Proposition 1.39, we have x̂, uh = 0. In other words,
0 = x̂, uh = x̂, L(x̂)h − L(h)x̂ = L(x̂)x̂, h − L(h)x̂2 .
The vector
L(x̂)
yL = x̂
x̂2
has the desired property and it is straightforward to verify that the function L → yL
is a linear isometry.

As a consequence, the inner product · , ·∗ : H ∗ × H ∗ defined by
L1 , L2 ∗ = L1 (yL2 ) = yL1 , yL2
turns H ∗ into a Hilbert space, which is isometrically isomorphic to H. The norm

associated with · , ·∗ is precisely · ∗ .
Corollary 1.42. Hilbert spaces are reflexive.
Proof. Given μ ∈ H ∗∗ , use the Riesz Representation Theorem 1.41 twice to obtain
Lμ ∈ H ∗ such that μ () = Lμ , ∗ for each ∈ H ∗ , and then yLμ ∈ H such that
Lμ (x) = yLμ , x for all x ∈ H. It follows that μ () = Lμ , ∗ = (yLμ ) for each
∈ H ∗ by the definition of · , ·∗ .

Remark 1.43. Theorem 1.41 also implies that a sequence (xn ) on a Hilbert space H
converges weakly to some x ∈ H if, and only if, lim xn − x, y = 0 for all y ∈ H.

n→∞
Strong and weak convergence can be related as follows:
Proposition 1.44. A sequence in (xn ) converges strongly to x̄ if, and only if, it con-
verges weakly to x̄ and lim sup xn ≤ x̄.
n→∞
Proof. The only if part is immediate. For the if part, notice that 0 ≤ lim sup xn −
n→∞
x̄2 = lim sup xn 2 + x̄2 − 2 xn , x̄ ≤ 0.

n→∞
Remark 1.45. Another consequence of Theorem 1.41 is that we may interpret the
gradient of a Gâteaux-differentiable function f as an element of H instead of H ∗
(see Sect. 1.1.4). This will be useful in the design of optimization algorithms (see
Chap. 6).

Chapter 2
Existence of Minimizers
Abstract In this chapter, we present sufficient conditions for an extended real-

valued function to have minimizers. After discussing the main concepts, we begin
by addressing the existing issue in abstract Hausdorff spaces, under certain (one-
sided) continuity and compactness hypotheses. We also present Ekeland’s Varia-
tional Principle, providing the existence of approximate minimizers that are strict in
some sense. Afterward, we study the minimization of convex functions in reflexive
spaces, where the verification of the hypothesis is more practical. Although it is pos-
sible to focus directly on this setting, we preferred to take the long path. Actually,
the techniques used for the abstract framework are useful for problems that do not
fit in the convex reflexive setting, but where convexity and reflexivity still play an
important role.
2.1 Extended Real-Valued Functions
In order to deal with unconstrained and constrained optimization problems in a uni-

fied setting, we introduce the extended real numbers by adding the symbol +∞. The
convention γ < +∞ for all γ ∈ R allows us to extend the total order of R to a total
order of R ∪ {+∞}. We can define functions on a set X with values in R ∪ {+∞}.
The simplest example is the indicator function of a set C ⊂ X, defined as

0 if x ∈ C
δC (x) =
+∞ otherwise.
The main interest of introducing these kinds of functions is that, clearly, if f : X →

R, the optimization problems
min{ f (x) : x ∈ C} and min{ f (x) + δC (x) : x ∈ X}

are equivalent. The advantage of the second formulation is that linear, geometric, or
topological properties of the underlying space X may be exploited.
Let f : X → R ∪ {+∞} be an extended real-valued function. The effective domain

(or, simply, the domain) of f is the set of points where f is finite. In other words,
dom( f ) = {x ∈ X : f (x) < +∞}.
A function f is proper if dom( f ) = 0.

/ The alternative, namely f ≡ +∞, is not very
interesting for optimization purposes. Given γ ∈ R, the γ -sublevel set of f is
Γγ ( f ) = {x ∈ X : f (x) ≤ γ }.
Γγ ( f )

If x ∈ dom( f ), then x ∈ Γf (x) ( f ). Therefore, dom( f ) = γ ∈R Γγ ( f ). On the other
hand, recall that
argmin( f ) = {x∗ ∈ X : f (x∗ ) ≤ f (x) for all x ∈ X}.
Observe that
argmin( f ) = Γγ ( f ). (2.1)
γ >inf( f )
Finally, the epigraph of f is the subset of the product space X × R defined as
epi( f ) = {(x, α ) ∈ X × R : f (x) ≤ α }.
This set includes the graph of f and all the points above it.
2.2 Lower-Semicontinuity and Minimization
In this section, we discuss the existence of minimizers for lower-semicontinuous

functions in a fairly abstract setting.
Existence of Minimizers 27
Minimization in Hausdorff Spaces
In what follows, (X, τ ) is a Hausdorff space (a topological space in which every

two distinct points admit disjoint neighborhoods). A function f : X → R ∪ {+∞} is
lower-semicontinuous at a point x0 ∈ X if for each α < f (x0 ) there is a neighborhood
V of x0 such that f (y) > α for all y ∈ V . If f is lower-semicontinuous at every point
of X, we say f is lower-semicontinuous in X.
Example 2.1. The indicator function δC of a closed set C is lower-semicontinuous.

Example 2.2. If f and g are lower-semicontinuous and if α ≥ 0, then f + α g is
lower-semicontinuous. In other words, the set of lower-semicontinuous functions is
a convex cone.

Lower-semicontinuity can be characterized in terms of the epigraph and the level
sets of the function:
Proposition 2.3. Let f : X → R ∪ {+∞}. The following are equivalent:
i) The function f is lower-semicontinuous;
ii) The set epi( f ) is closed in X × R; and
iii)For each γ ∈ R, the sublevel set Γγ ( f ) is closed.
Proof. We shall prove that i) ⇒ ii) ⇒ iii) ⇒ i).

Let f be lower-semicontinuous and take (x0 , α ) ∈ / epi( f ). We have α < f (x0 ).
Now take any β ∈ (α , f (x0 )). There is a neighborhood V of x0 such that f (y) >
β for all y ∈ V . The set V × (−∞, β ) is a neighborhood of (x0 , α ) that does not
intersect epi( f ), which gives ii).
Suppose now that epi( f ) is closed. For each γ ∈ R, the set Γγ ( f ) is homeomorphic
to epi( f ) ∩ [X × {γ }] and so it is closed.
Finally, assume Γγ ( f ) is closed. In order to prove that f is lower-semicontinuous,
pick x0 ∈ X and α ∈ R such that α < f (x0 ). Then x0 ∈ / Γα ( f ). Since this set is
closed, there is a neighborhood V of x0 that does not intersect Γα ( f ). In other words,
f (y) > α for all y ∈ V .

Example 2.4. If ( fi )i∈I is a family of lower-semicontinuous functions, then sup( fi )

is lower-semicontinuous, since epi(sup( fi )) = ∩ epi( fi ) and the intersection of
closed sets is closed.

A function f : X → R ∪ {+∞} is inf-compact if, for each γ ∈ R, the sublevel set
Γγ ( f ) is relatively compact. Clearly, a function f : X → R ∪ {+∞} is proper, lower-
semicontinuous, and inf-compact if, and only if, Γγ ( f ) is nonempty and compact for
each γ > inf( f ).
Theorem 2.5. Let (X, τ ) be a Hausdorff space and let f : X → R ∪ {+∞} be proper,
lower-semicontinuous, and inf-compact. Then argmin( f ) is nonempty and compact.
Moreover, inf( f ) > −∞.
Proof. Let (γn ) be a nonincreasing sequence such that lim γn = inf( f ) (possibly
n→∞
−∞). For each n ∈ N consider the compact set Kn = Γγn ( f ). Then Kn+1 ⊂ Kn for
each n ∈ N and so
argmin( f ) = Kn
n∈N
is nonempty and compact. Clearly, inf( f ) > −∞.

The usefulness of Theorem 2.5 relies strongly on the appropriate choice of
topology for X. If τ has too many open sets, it will be very likely for a function
f : X → R ∪ {+∞} to be lower-semicontinuous but one would hardly expect it to
be inf-compact. We shall see in, Sect. 2.3, that for convex functions defined on a
reflexive space, the weak topology is a suitable choice.
Improving Approximate Minimizers
The following result provides an important geometric property satisfied by approx-

imate minimizers of lower-semicontinuous functions:
Theorem 2.6 (Ekeland’s Variational Principle). Consider a lower-semicontinuous
function f : X → R ∪ {+∞} defined on a Banach space X. Let ε > 0 and suppose
x0 ∈ dom( f ) is such that f (x0 ) ≤ inf f (x) + ε . Then, for each λ > 0, there exists
x∈X
x̄ ∈ B̄(x0 , ε /λ ) such that f (x̄) + λ x̄ − x0 ≤ f (x0 ) and f (x) + λ x − x̄ > f (x̄) for
all x = x̄.
Proof. For each n ≥ 0, given xn ∈ X, define
Cn = { x ∈ X : f (x) + λ x − xn ≤ f (xn )}
and
vn = inf{ f (x) : x ∈ Cn },
and take xn+1 ∈ Cn such that
1
f (xn+1 ) ≤ vn + .
n+1
Observe that (Cn ) is a nested (decreasing) sequence of closed sets, and f (xn ) is
nonincreasing and bounded from below by f (x0 ) − ε . Clearly, for each h ≥ 0 we
have
λ xn+h − xn ≤ f (xn ) − f (xn+h ),

and (xn ) is a Cauchy sequence. Its limit, which we denote x̄, belongs to Cn . In
n≥0
particular,
f (x̄) + λ x̄ − x0 ≤ f (x0 )
and
λ x̄ − x0 ≤ f (x0 ) − f (x̄) ≤ ε .
Finally, if there is x̃ ∈ X such that
f (x̃) + λ x̃ − x̄ ≤ f (x̄),

then, also x̃ ∈ Cn and so
n≥0
1
f (x̃) + λ x̃ − xn ≤ f (xn ) ≤ f (x̃) +
n
for all n, and we deduce that x̃ = x̄.

x0
x̄
Minimizing Sequences
It is often useful to consider a sequential notion of lower-semicontinuity. A func-

tion f : X → R ∪ {+∞} is sequentially lower-semicontinuous at x ∈ dom( f ) for the
topology τ if
f (x) ≤ lim inf f (xn )
n→∞
for every sequence (xn ) converging to x for the topology τ .
Example 2.7. Let f : RM → R ∪ {+∞} be proper, lower-semicontinuous, and

bounded from below. Define F : L p (0, T ; RM ) → R ∪ {+∞} by
T
F(u) = f (u(t)) dt.
0
We shall see that F is sequentially lower-semicontinuous. Let (un ) be a sequence

in the domain of F converging strongly to some ū ∈ L p (0, T ; RM ). Extract a subse-
quence (ukn ) such that
lim F(ukn ) = lim inf F(un ).
n→∞ n→∞
Next, since (ukn ) converges in L p (0, T ; RM ),

we may extract yet another subse-
quence (u jn ) such that (u jn (t)) converges to ū(t) for almost every t ∈ [0, T ] (see,
for instance, [30, Theorem 4.9]). Since f is lower-semicontinuous, we must have
lim inf f (u jn (t)) ≥ f (ū(t)) for almost every t. By Fatou’s Lemma (see, for instance,
n→∞
[30, Lemma 4.1]), we obtain
F(ū) ≤ lim inf F(u jn ) = lim F(ukn ) = lim inf F(un ),

n→∞ n→∞ n→∞
and so F is sequentially lower-semicontinuous.

It is possible to establish a sequential analogue of Proposition 2.3, along with
relationships between lower-semicontinuity and sequential lower-semicontinuity for
the strong and the weak topologies, in the spirit of Proposition 1.23. We shall come
back to this point in Proposition 2.17.
We shall say (xn ) is a minimizing sequence for f : X → R ∪ {+∞} if lim f (xn ) =

n→∞
inf( f ). An important property of sequentially lower-semicontinuous functions is
that the limits of convergent minimizing sequences are minimizers.
Proposition 2.8. Let (xn ) be a minimizing sequence for a proper and sequentially
lower-semicontinuous function f : X → R ∪ {+∞}. If (xn ) converges to x̄, then x̄ ∈
argmin( f ).
One can also prove a sequential version of Theorem 2.5. A function f : X →
R ∪ {+∞} is sequentially inf-compact if for each γ > inf( f ), every sequence in
Γγ ( f ) has a convergent subsequence. We have the following:
Theorem 2.9. If f : X → R ∪ {+∞} is proper, sequentially lower-semicontinuous,
and sequentially inf-compact, then there exists a convergent minimizing sequence
for f . In particular, argmin( f ) is nonempty.
The proofs of Proposition 2.8 and Theorem 2.9 are straightforward, and left to
the reader.
2.3 Minimizers of Convex Functions
An extended real valued function f : X → R ∪ {+∞} defined on a vector space X is

convex if
f (λ x + (1 − λ )y) ≤ λ f (x) + (1 − λ ) f (y) (2.2)
for each x, y ∈ dom( f ) and λ ∈ (0, 1). Notice that inequality (2.2) holds trivially if
λ ∈ {0, 1} or if either x or y are not in dom( f ). If the inequality in (2.2) is strict
whenever x = y and λ ∈ (0, 1), we say f is strictly convex. Moreover, f is strongly
convex with parameter α > 0 if
α
f (λ x + (1 − λ )y) ≤ λ f (x) + (1 − λ ) f (y) − λ (1 − λ )x − y2
2
for each x, y ∈ dom( f ) and λ ∈ (0, 1).
Remark 2.10. It is easy to prove that f is convex if, and only if, epi( f ) is a convex
subset of X × R. Moreover, if f is convex, then each sublevel set Γγ ( f ) is con-
vex. Obviously the converse is not true in general: Simply consider any nonconvex
monotone function on X = R. Functions whose level sets are convex are called
quasi-convex.

We shall provide some practical characterizations of convexity for differentiable
functions in Sect. 3.2.
Example 2.11. Let B : X × X → R be a bilinear function and define f (x) = B(x, x),
as we did in Example 1.27. For each x, y ∈ X and λ ∈ (0, 1), we have
f (λ x + (1 − λ )y) = λ f (x) + (1 − λ ) f (y) − λ (1 − λ )B(x − y, x − y).
Therefore, we have the following:

i) f is convex if, and only if, B is positive semidefinite (B(z, z) ≥ 0 for all z ∈ H);
ii) f is strictly convex if, and only if, B is positive definite (B(z, z) > 0 for all z = 0);
and
iii) f is strongly convex with parameter α if, and only if, B is uniformly elliptic with
parameter α (B(z, z) ≥ α2 z2 ).
In particular, if A : H → H is a linear operator on a Hilbert space H and we set
B(x, x) = Ax, x, then B is bilinear. Therefore, the function f : H → R, defined by
f (x) = Ax, x, is convex if, and only if, A is positive semidefinite (Az, z ≥ 0 for
all z ∈ H); strictly convex if, and only if, A is positive definite (Az, z > 0 for all
z = 0); and strongly convex with parameter α if, and only if, A is uniformly elliptic
with parameter α (Az, z ≥ α2 z2 ).

Example 2.12. The indicator function δC of a convex set C is a convex function.

Some Convexity-Preserving Operations
We mention some operations that allow us to construct convex functions from oth-
ers. Another very important example will be studied in detail in Sect. 3.6
Example 2.13. Suppose A : X → Y is affine, f : Y → R ∪ {+∞} is convex and
θ : R → R ∪ {+∞} convex and nondecreasing. Then, the function g = θ ◦ f ◦ A :
X → R ∪ {+∞} is convex.

Example 2.14. If f and g are convex and if α ≥ 0, then f + α g is convex. It follows
that the set of convex functions is a convex cone.

Example 2.15. If ( fi )i∈I is a family of convex functions, then sup( fi ) is convex,
since epi(sup( fi )) = ∩ epi( fi ) and the intersection of convex sets is convex.

Example 2.16. In general, the infimum of convex functions need not be convex.
However, we have the following: Let V be a vector space and let g : X × V → R ∪
{+∞} be convex. Then, the function f : X → R∪{+∞} defined by f (x) = inf g(x, v)
v∈V
is convex. Here, the facts that g is convex in the product space and the infimum is
taken over a whole vector space are crucial.

Convexity, Lower-Semicontinuity, and Existence of Minimizers
As a consequence of Propositions 1.21, 1.23 and 2.3, we obtain

Proposition 2.17. Let (X, · ) be a normed space and let f : X → R ∪ {+∞}.
Consider the following statements:
i) f is weakly lower-semicontinuous.
ii) f is weakly sequentially lower-semicontinuous.
iii) f is sequentially lower-semicontinuous.
iv) f is lower-semicontinuous.
Then i) ⇒ ii) ⇒ iii) ⇔ iv) ⇐ i). If f is convex, the four statements are equivalent.
Example 2.18. Let f : RM → R ∪ {+∞} be proper, convex, lower-semicontinuous,
and bounded from below. As in Example 2.7, define F : L p (0, T ; RM ) → R ∪ {+∞}
by
T
F(u) = f (u(t)) dt.
0
Clearly, F is proper and convex. Moreover, we already proved that F is sequentially
lower-semicontinuous. Proposition 2.17 shows that F is lower-semicontinuous and
sequentially lower-semicontinuous both for the strong and the weak topologies.
A function f : X → R ∪ {+∞} is coercive if
lim f (x) = ∞,
x→∞
or, equivalently, if Γγ ( f ) is bounded for each γ ∈ R. By Theorem 1.24, coercive

functions on reflexive spaces are weakly inf-compact. We have the following:
Theorem 2.19. Let X be reflexive. If f : X → R ∪ {+∞} is proper, convex, coercive,
and lower-semicontinuous, then argmin( f ) is nonempty and weakly compact. If,
moreover, f is strictly convex, then argmin( f ) is a singleton.
Proof. The function f fulfills the hypotheses of Theorem 2.5 for the weak topology.
Clearly, a strictly convex function cannot have more than one minimizer.

If f is strongly convex, it is strictly convex and coercive. Therefore, we deduce:

Corollary 2.20. Let X be reflexive. If f : X → R ∪ {+∞} is proper, strongly convex,
and lower-semicontinuous, then argmin( f ) is a singleton.
Chapter 3
Convex Analysis and Subdifferential Calculus
Abstract This chapter deals with several properties of convex functions, especially
in connection with their regularity, on the one hand, and the characterization of
their minimizers, on the other. We shall explore sufficient conditions for a convex
function to be continuous, as well as several connections between convexity and
differentiability. Next, we present the notion of subgradient, a generalization of the
concept of derivative for nondifferentiable convex functions that will allow us to
characterize their minimizers. After discussing conditions that guarantee their exis-
tence, we present the basic (yet subtle) calculus rules, along with their remarkable
consequences. Other important theoretical and practical tools, such as the Fenchel
conjugate and the Lagrange multipliers, will also be studied. These are particularly
useful for solving constrained problems.
3.1 Convexity and Continuity
In this section, we discuss some characterizations of continuity and lower-semi-

continuity for convex functions.
Lower-Semicontinuous Convex Functions
Pretty much as closed convex sets are intersections of closed half-spaces, any lower-
semicontinuous convex function can be represented as a supremum of continuous
affine functions:
Proposition 3.1. Let f : X → R ∪ {+∞} be proper. Then, f is convex and lower-
semicontinuous if, and only if, there exists a family ( fi )i∈I of continuous affine func-
tions on X such that f = sup( fi ).

Proof. Suppose f is convex and lower-semicontinuous and let x0 ∈ X. We shall

prove that, for every λ0 < f (x0 ), there exists a continuous affine function α such that
α (x) ≤ f (x) for all x ∈ dom( f ) and λ0 < α (x0 ) < f (x0 ). Since epi( f ) is nonempty,
closed, and convex, and (x0 , λ0 ) ∈ / epi( f ), by the Hahn–Banach Separation Theo-
rem 1.10, there exist (L, s) ∈ X ∗ × R \ {(0, 0)} and ε > 0 such that
L, x0 + sλ0 + ε ≤ L, x + sλ (3.1)
for all (x, λ ) ∈ epi( f ). Clearly, s ≥ 0. Otherwise, we may take x ∈ dom( f ) and λ
sufficiently large to contradict (3.1). We distinguish two cases:
s > 0: We may assume, without loss of generality, that s = 1 (or divide by s and
rename L and ε ). We set
α (x) = −L, x + [L, x0 + λ0 + ε ] ,
and take λ = f (x) to deduce that f (x) ≥ α (x) for all x ∈ dom( f ) and α (x0 ) > λ0 .
Observe that this is valid for each x0 ∈ dom( f ).
s = 0: As said above, necessarily x0 ∈ / dom( f ), and so, f (x0 ) = +∞. Set
α0 (x) = −L, x + [L, x0 + ε ] ,
and observe that α0 (x) ≤ 0 for all x ∈ dom( f ) and α0 (x0 ) = ε > 0. Now take x̂ ∈
dom( f ) and use the argument of the case s > 0 to obtain a continuous affine function
α̂ such that f (x) ≥ α̂ (x) for all x ∈ dom( f ). Given n ∈ N set αn = α̂ + nα0 . We
conclude that f (x) ≥ αn (x) for all x ∈ dom( f ) and lim αn (x0 ) = lim (α̂ (x0 )+nε ) =
n→∞ n→∞
+∞ = f (x0 ).
The converse is straightforward, since epi(sup( fi )) = ∩ epi( fi ).

Characterization of Continuity
In this subsection, we present some continuity results that reveal how remarkably
regular convex functions are.
Proposition 3.2. Let f : X → R ∪ {+∞} be convex and fix x0 ∈ X. The following
are equivalent:
i) f is bounded from above in a neighborhood of x0 ;
ii) f is Lipschitz-continuous in a neighborhood of x0 ;
iii) f is continuous in x0 ; and
iv) (x0 , λ ) ∈ int(epi( f )) for each λ > f (x0 ).
Proof. We shall prove that i) ⇒ ii) ⇒ iii) ⇒ iv) ⇒ i).

i) ⇒ ii): There exist r > 0 and K > f (x0 ) such that f (z) ≤ K for every z ∈
B(x0 , 2r). We shall find M > 0 such that | f (x) − f (y)| ≤ Mx − y for all x, y ∈
Convex Analysis and Subdifferential Calculus 35
B(x0 , r). First, consider the point

y−x
ỹ = y + r . (3.2)
y − x
Since ỹ − x0 ≤ y − x0 + r < 2r, we have f (ỹ) ≤ K. Solving for y in (3.2), we

see that
y − x y − x
y = λ ỹ + (1 − λ )x, with λ = ≤ .
y − x + r r
The convexity of f implies
f (y) − f (x) ≤ λ [ f (ỹ) − f (x)] ≤ λ [K − f (x)]. (3.3)
Write x0 = 12 x + 12 (2x0 − x). We have f (x0 ) ≤ 12 f (x) + 12 f (2x0 − x) and
− f (x) ≤ K − 2 f (x0 ). (3.4)
x
y
ỹ
x0
2x0 − x
Combining (3.3) and (3.4), we deduce that
2(K − f (x0 ))
f (y) − f (x) ≤ 2λ [K − f (x0 )] ≤ x − y.
r
Interchanging the roles of x and y we conclude that | f (x) − f (y)| ≤ Mx − y with
M = 2(K−rf (x0 )) > 0.
ii) ⇒ iii): This is straightforward.
iii) ⇒ iv): If f is continuous in x0 and λ > f (x0 ), then for each η ∈ ( f (x0 ), λ )
there exists δ > 0 such that f (z) < η for all z ∈ B(x0 , δ ). Hence, (x0 , λ ) ∈ B(x0 , δ ) ×
(η , +∞) ⊂ epi( f ), and (x0 , λ ) ∈ int(epi( f )).
iv) ⇒ i): If (x0 , λ ) ∈ int(epi( f )), then there exist r > 0 and K > f (x0 ), such that
B(x0 , r) × (K, +∞) ⊂ epi( f ). It follows that f (z) ≤ K for every z ∈ B(x0 , r).

Proposition 3.3. Let (X, · ) be a normed space and let f : X → R ∪ {+∞} be
convex. If f is continuous at some x0 ∈ dom( f ), then x0 ∈ int(dom( f )) and f is
continuous on int(dom( f )).
Proof. If f is continuous at x0 , Proposition 3.2 gives x0 ∈ int(dom( f )) and there
exist r > 0 y K > f (x0 ) such that f (x) ≤ K for every x ∈ B(x0 , r). Let y0 ∈
int(dom( f )) and pick ρ > 0 such that the point z0 = y0 + ρ (y0 − x0 ) belongs to
ρr 1+ρ
dom( f ). Take y ∈ B(y0 , 1+ ρ ) and set w = x0 + ( ρ )(y − y0 ). Solving for y, we see
that
ρ 1
y= w+ z0 .
1+ρ 1+ρ

On the other hand, w − x0 = 1+ρ ρ y − y0 < r, and so w ∈ B(x0 , r) y f (w) ≤ K.
w
y
x0 y0 z0
dom( f )
Therefore,

ρ
f (y) ≤ 1+ρ f (w) + 1
1+ρ f (z0 ) ≤ max{K, f (z0 )}.
ρr
Since this is true for each y ∈ B(y0 , 1+ ρ ), we conclude that f is bounded from above
in a neighborhood of y0 . Proposition 3.2 implies f is continuous at y0 .

An immediate consequence is:

Corollary 3.4. Let f : X → R ∪ {+∞} be convex. If int(Γλ ( f )) = 0/ for some λ ∈ R,
then f is continuous on int(dom( f )).
Proposition 3.3 requires that f be continuous at some point. This hypothesis can
be waived if the space is complete. We begin by establishing the result in finite-
dimensional spaces:
Proposition 3.5. Let X be finite dimensional and let f : X → R ∪ {+∞} be convex.
Then f is continuous on int(dom( f )).
Proof. Let {e1 , . . . , eN } generate X. Let x0 ∈ int(dom( f )) and take ρ > 0 small
enough so that x0 ± ρ ei ∈ dom( f ) for all i = 1, . . . , N. The convex hull C of these
points is a neighborhood of x0 and f is bounded by max { f (x0 ± ρ ei )} on C. The
i=1,...,N
result follows from Proposition 3.2.

For general Banach spaces, we have the following:

Proposition 3.6. Let (X, · ) be a Banach space and let f : X → R ∪ {+∞} be
lower-semicontinuous and convex. Then f is continuous on int(dom( f )).
Proof. Fix x0 ∈ int(dom( f )). Without any loss of generality we may assume that
x0 = 0. Take λ > f (0). Given x ∈ X, define gx : R → R ∪ {+∞} by gx (t) = f (tx).
Since 0 ∈ int(dom(gx )), we deduce that gx is continuous at 0 by Proposition 3.5.
Therefore, there is tx > 0 such that tx x ∈ Γλ ( f ). Repeating this argument for each

x ∈ X, we see that n≥1 nΓλ ( f ) = X. Baire’s Category Theorem shows that Γλ ( f )
has nonempty interior and we conclude by Corollary 3.4.

Remark 3.7. To summarize, let (X, · ) be a normed space and let f : X → R ∪
{+∞} be convex. Then f is continuous on int(dom( f )) if either (i) f is continuous
at some point, (ii) X is finite dimensional, or (iii) X is a Banach space and f is
lower-semicontinuous.

3.2 Convexity and Differentiability
In this subsection, we study some connections between convexity and differentia-

bility. First, we analyze the existence and properties of directional derivatives. Next,
we provide characterizations for the convexity of differentiable functions. Finally,
we provide equivalent conditions for the gradient of a differentiable convex function
to be Lipschitz continuous.
3.2.1 Directional Derivatives
Recall from Sect. 1.1.4 that the directional derivative of a function f : X → R ∪

{+∞} at a point x ∈ dom( f ) in the direction h ∈ X is given by f (x; h) =
lim f (x+th)−
t
f (x)
.
t→0
f (x+th)− f (x)
Remark 3.8. If f is convex, a simple computation shows that the quotient t
is nondecreasing as a function of t. We deduce that
f (x + th) − f (x)
f (x; h) = inf ,
t>0 t
which exists in R ∪ {−∞, +∞}.

Proposition 3.9. Let f : X → R ∪ {+∞} be proper and convex, and let x ∈ dom( f ).
Then, the function φx : X → [−∞, +∞], defined by φx (h) = f (x; h), is convex. If,
moreover, f is continuous in x, then φx is finite and continuous in X.
Proof. Take y, z ∈ X, λ ∈ (0, 1) and t > 0. Write h = λ y + (1 − λ )z. By the convexity
of f , we have
f (x + th) − f (x) f (x + ty) − f (x) f (x + tz) − f (x)

≤λ + (1 − λ ) .
t t t
Passing to the limit, φx (λ y + (1 − λ )z) ≤ λ φx (y) + (1 − λ )φx (z). Now, if f is con-

tinuous in x, it is Lipschitz-continuous in a neighborhood of x, by Proposition 3.2.
Then, for all h ∈ X, all sufficiently small t > 0 and some L > 0, we have
f (x + th) − f (x)
−Lh ≤ ≤ Lh.
t
It follows that dom(φx ) = X and that φx is bounded from above in a neighborhood
of 0. Using Proposition 3.2 again, we deduce that φx is continuous in 0, and, by
Proposition 3.3, φx is continuous in int(dom(φx )) = X.

A function f is Gâteaux-differentiable at x if the function φx defined above is

linear and continuous in X. Proposition 3.9 provides the continuity part, but it is
clear that a continuous convex function need not be Gâteaux-differentiable (take,
for instance, the absolute value in R). A sufficient condition for a convex function
to be Gâteaux-differentiable will be given in Proposition 3.58.
3.2.2 Characterizations of Convexity for Differentiable Functions
We begin by providing the following characterization for the convexity of Gâteaux-

differentiable functions:
Proposition 3.10 (Characterization of convexity). Let A ⊂ X be open and convex,
and let f : A → R be Gâteaux-differentiable. The following are equivalent:
i) f is convex.
ii) f (y) ≥ f (x) + ∇ f (x), y − x for every x, y ∈ A.
iii)∇ f (x) − ∇ f (y), x − y ≥ 0 for every x, y ∈ A.
If, moreover, f is twice Gâteaux-differentiable on A, then the preceding conditions
are equivalent to
iv) ∇2 f (x)d, d ≥ 0 for every x ∈ A and d ∈ X (positive semidefinite).
Proof. By convexity,
f (λ y + (1 − λ )x) ≤ λ f (y) + (1 − λ ) f (x)
for all y ∈ X and all λ ∈ (0, 1). Rearranging the terms we get
f (x + λ (y − x)) − f (x)
≤ f (y) − f (x).
λ
As λ → 0 we obtain ii). From ii), we immediately deduce iii).
To prove that iii) ⇒ i), define φ : [0, 1] → R by

φ (λ ) = f λ x + (1 − λ )y − λ f (x) − (1 − λ ) f (y).
Then φ (0) = φ (1) = 0 and

φ (λ ) = ∇ f λ x + (1 − λ )y , x − y − f (x) + f (y)
for λ ∈ (0, 1). Take 0 < λ1 < λ2 < 1 and write xi = λi x + (1 − λi )y for i = 1, 2. A
simple computation shows that
1
φ ( λ1 ) − φ ( λ2 ) = ∇ f (x1 ) − ∇ f (x2 ), x1 − x2 ≤ 0.
λ1 − λ2
In other words, φ is nondecreasing. Since φ (0) = φ (1) = 0, there is λ̄ ∈ (0, 1) such

that φ (λ̄ ) = 0. Since φ is nondecreasing, φ ≤ 0 (and φ is nonincreasing) on [0, λ̄ ]
and next φ ≥ 0 (whence φ is nondecreasing) on [λ̄ , 1]. It follows that φ (λ ) ≤ 0 on
[0, 1], and so, f is convex.
Assume now that f is twice Gâteaux-differentiable and let us prove that iii) ⇒
iv) ⇒ i).
For t > 0 and h ∈ X, we have ∇ f (x + th) − ∇ f (x),th ≥ 0. Dividing by t 2 and
passing to the limit as t → 0, we obtain ∇2 f (x)h, h ≥ 0.
Finally, defining φ as above, we see that

φ (λ ) = ∇2 f λ x + (1 − λ )y (x − y), x − y ≥ 0.
It follows that φ is nondecreasing and we conclude as before.

It is possible to obtain characterizations for strict and strong convexity as well.
The details are left as an exercise.
Proposition 3.11 (Characterization of strict convexity). Let A ⊂ X be open and
convex, and let f : A → R be Gâteaux-differentiable. The following are equivalent:
i) f is strictly convex.
ii) f (y) > f (x) + ∇ f (x), y − x for any distinct x, y ∈ A.
iii)∇ f (x) − ∇ f (y), x − y > 0 for any distinct x, y ∈ A.
If, moreover, f is twice Gâteaux-differentiable on A, then the following condition is
sufficient for the previous three:
iv) ∇2 f (x)d, d > 0 for every x ∈ A and d ∈ X (positive definite).
Proposition 3.12 (Characterization of strong convexity). Let A ⊂ X be open and
convex, and let f : A → R be Gâteaux-differentiable. The following are equivalent:
i) f is strongly convex with constant α > 0.
ii) f (y) ≥ f (x) + ∇ f (x), y − x + α2 x − y2 for every x, y ∈ A.
iii)∇ f (x) − ∇ f (y), x − y ≥ α x − y2 for every x, y ∈ A.
If, moreover, f is twice Gâteaux-differentiable on A, then the preceding conditions
are equivalent to
iv) ∇2 f (x)d, d ≥ α2 d2 for every x ∈ A and d ∈ X (uniformly elliptic with constant
α ).
3.2.3 Lipschitz-Continuity of the Gradient
A function F : X → X ∗ is cocoercive with constant β if
F(x) − F(y), x − y ≥ β F(x) − F(y)2∗
for all x, y ∈ X.
Theorem 3.13 (Baillon–Haddad Theorem). Let f : X → R be convex and differen-
tiable. The gradient of f is Lipschitz-continuous with constant L if, and only if, it is
cocoercive with constant 1/L.
Proof. Let x, y, z ∈ X. Suppose ∇ f is Lipschitz-continuous with constant L. By the
Descent Lemma 1.30, we have
L
f (z) ≤ f (x) + ∇ f (x), z − x + z − x2 .
2
Subtract ∇ f (y), z to both sides, write hy (z) = f (z) − ∇ f (y), z and rearrange the
terms to obtain
L
hy (z) ≤ hx (x) + ∇ f (x) − ∇ f (y), z + z − x2 .
2
The function hy is convex, differentiable and ∇hy (y) = 0. We deduce that hy (y) ≤
hy (z) for all z ∈ X, and so
L
hy (y) ≤ hx (x) + ∇ f (x) − ∇ f (y), z + z − x2 . (3.5)
2
Fix any ε > 0 and pick μ ∈ X such that μ ≤ 1 and
∇ f (x) − ∇ f (y), μ ≥ ∇ f (x) − ∇ f (y)∗ − ε .

∇ f (x)−∇ f (y)
Set R = L and replace z = x − R μ in the right-hand side of (3.5) to obtain
∇ f (x) − ∇ f (y)2
hy (y) ≤ hx (x) + ∇ f (x) − ∇ f (y), x − + R ε.
2L
Interchanging the roles of x and y and adding the resulting inequality, we obtain
1
∇ f (x) − ∇ f (y)2 ≤ ∇ f (x) − ∇ f (y), x − y + 2R ε .
L
Since this holds for each ε > 0, we deduce that ∇ f is cocoercive with constant 1/L.
The converse is straightforward.

Remark 3.14. From the proof of the Baillon–Haddad Theorem 3.13, we see that the
L-Lipschitz continuity and the L1 cocoercivity of ∇ f are actually equivalent to the
validity of the inequality given by the Descent Lemma 1.30, whenever f is convex.
3.3 Subgradients, Subdifferential and Fermat’s Rule
Let f : X → R be convex and assume it is Gâteaux-differentiable at a point x ∈ X.

By convexity, Proposition 3.10 gives
f (y) ≥ f (x) + ∇ f (x), y − x (3.6)
for each y ∈ X. This shows that the hyperplane
V = {(y, z) ∈ X × R : f (x) + ∇ f (x), y − x = z}
lies below the set epi( f ) and touches it at the point (x, f (x)).
This idea can be generalized to nondifferentiable functions. Let f : X → R ∪

{+∞} be proper and convex. A point x∗ ∈ X ∗ is a subgradient of f at x if
f (y) ≥ f (x) + x∗ , y − x (3.7)
for all y ∈ X, or, equivalently, if (3.7) holds for all y in a neighborhood of x. The set
of all subgradients of f at x is the subdifferential of f at x and is denoted by ∂ f (x).
If ∂ f (x) = 0,
/ we say f is subdifferentiable at x. The domain of the subdifferential is
dom(∂ f ) = {x ∈ X : ∂ f (x) = 0}.

/
By definition, dom(∂ f ) ⊂ dom( f ). The inclusion may be strict though, as in Exam-

ple 3.17 below. However, we shall see (Corollary 3.34) dom(∂ f ) = dom( f ).
Let us see some examples:

Example 3.15. For f : R → R, given by f (x) = |x|, we have ∂ f (x) = {−1} if x < 0,
∂ f (0) = [−1, 1], and ∂ f (x) = {1} for x > 0.

Graph of f Graph of ∂ f
Example 3.16. More generally, if f : X → R is given by f (x) = x − x0 , with

x0 ∈ X, then
BX ∗ (0, 1) if x = x0
∂ f (x) =
F (x − x0 ) if x = x0 ,
where F is the normalized duality mapping defined in Sect. 1.1.2.

√
Example 3.17. Define g : R → R ∪ {+∞} by g(x) = +∞ if x < 0, and g(x) = − x
if x ≥ 0. Then
0/ if x ≤ 0
∂ g(x) =
− 2√
1
x
if x > 0.
Notice that 0 ∈ dom(g) but ∂ g(0) = 0,

/ thus dom(∂ g) dom(g).

Example 3.18. Let h : R → R∪{+∞} be given by h(x) = +∞ if x = 0, and h(0) = 0.
Then
0/ if x = 0
∂ h(x) =
R if x = 0.
Observe that h is subdifferentiable but not continuous at 0.

Example 3.19. More generally, let C be a nonempty, closed, and convex subset of
X and let δC : X → R ∪ {+∞} be the indicator function of C:

0 if x ∈ C
δC (x) =
+∞ otherwise.
Given x ∈ X, the set NC (x) = ∂ δC (x) is the normal cone to C at x. It is given by
NC (x) = {x∗ ∈ X ∗ : x∗ , y − x ≤ 0 for all y ∈ C}
if x ∈ C, and NC (x) = 0/ if x ∈
/ C. Intuitively, the normal cone contains the directions
that point outwards with respect to C. If C is a closed affine subspace, that is C =
{x0 } +V , where x0 ∈ X and V is a closed subspace of X. In this case, NC (x) = V ⊥
for all x ∈ C (here V ⊥ is the orthogonal space of V , see Sect. 1.1.1).

The subdifferential really is an extension of the notion of derivative:
Proposition 3.20. Let f : X → R ∪ {+∞} be convex. If f is Gâteaux-differentiable
at x, then x ∈ dom(∂ f ) and ∂ f (x) = {∇ f (x)}.
Proof. First, the gradient inequality (3.6) and the definition of the subdifferential
together imply ∇ f (x) ∈ ∂ f (x). Now take any x∗ ∈ ∂ f (x). By definition,
f (y) ≥ f (x) + x∗ , y − x
for all y ∈ X. Take any h ∈ X and t > 0, and write y = x + th to deduce that
f (x + th) − f (x)
≥ x∗ , h.
t
Passing to the limit as t → 0, we deduce that ∇ f (x) − x∗ , h ≥ 0. Since this must be
true for each h ∈ X, necessarily x∗ = ∇ f (x).

Proposition 3.58 provides a converse result.
A convex function may have more than one subgradient. However, we have:
Proposition 3.21. For each x ∈ X, the set ∂ f (x) is closed and convex.
Proof. Let x1∗ , x2∗ ∈ ∂ f (x) and λ ∈ (0, 1). For each y ∈ X we have
f (y) ≥ f (x) + x1∗ , y − x

f (y) ≥ f (x) + x2∗ , y − x
Add λ times the first inequality and 1 − λ times the second one to obtain
f (y) ≥ f (x) + λ x1∗ + (1 − λ )x2∗ , y − x.
Since this holds for each y ∈ X, we have λ x1∗ + (1 − λ )x2∗ ∈ ∂ f (x). To see that ∂ f (x)
is (sequentially) closed, take a sequence (xn∗ ) in ∂ f (x), converging to some x∗ . We
have
f (y) ≥ f (x) + xn∗ , y − x
for each y ∈ X and n ∈ N. As n → ∞ we obtain
f (y) ≥ f (x) + x∗ , y − x.
It follows that x∗ ∈ ∂ f (x).

Recall that if f : R → R is convex and differentiable, then its derivative is non-

decreasing. The following result generalizes this fact:
Proposition 3.22. Let f : X → R ∪ {+∞} be convex. If x∗ ∈ ∂ f (x) and y∗ ∈ ∂ f (y),
then x∗ − y∗ , x − y ≥ 0.
Proof. Note that f (y) ≥ f (x) + x∗ , y − x and f (x) ≥ f (y) + y∗ , x − y. It suffices
to add these inequalities and rearrange the terms.

This property is known as monotonicity.
For strongly convex functions, we have the following result:

Proposition 3.23. Let f : X → R ∪ {+∞} be strongly convex with parameter α .
Then, for each x∗ ∈ ∂ f (x) and y ∈ X, we have
α
f (y) ≥ f (x) + x∗ , y − x + x − y2 .
2
Moreover, for each y∗ ∈ ∂ f (y), we have x∗ − y∗ , x − y ≥ α x − y2 .
The definition of the subdifferential has a straightforward yet remarkable conse-
quence, namely:
Theorem 3.24 (Fermat’s Rule). Let f : X → R ∪ {+∞} be proper and convex. Then
x̂ is a global minimizer of f if, and only if, 0 ∈ ∂ f (x̂).
For convex functions, the condition 0 ∈ ∂ f (x̂) given by Fermat’s Rule is not only
necessary but also sufficient for x̂ to be a global minimizer. In particular, in view of
Proposition 3.20, if f is convex and differentiable, then 0 = ∇ f (x̂) if, and only if, x̂
is a global minimizer of f . In other words, the only critical points convex functions
may have are their global minimizers.
3.4 Subdifferentiablility
As pointed out in Example 3.18, subdifferentiability does not imply continuity. Sur-
prisingly enough, the converse is true.
Proposition 3.25. Let f : X → R ∪ {+∞} be convex. If f is continuous at x, then
∂ f (x) is nonempty and bounded.
Proof. According to Proposition 3.2, int(epi( f )) = 0.

/ Part i) of the Hahn–Banach
Separation Theorem 1.10, with A = int(epi( f )) and B = {(x, f (x))}, gives (L, s) ∈
X ∗ × R \ {(0, 0)} such that
L, y + sλ ≤ L, x + s f (x)
for every (y, λ ) ∈ int(epi( f )). Taking y = x, we deduce that s ≤ 0. If s = 0 then L, y−
x ≤ 0 for every y in a neighborhood of x. Hence L = 0, which is a contradiction.
We conclude that s < 0. Therefore,
λ ≥ f (x) + z∗ , y − x,
with z∗ = −L/s, for every λ > f (y). Letting λ tend to f (y), we see that
f (y) ≥ f (x) + z∗ , y − x.
This implies z∗ ∈ ∂ f (x) = 0.

/ On the other hand, since f is continuous at x, it is
Lipschitz-continuous on a neighborhood of x, by Proposition 3.2. If x∗ ∈ ∂ f (x),
then
f (x) + x∗ , y − x ≤ f (y) ≤ f (x) + My − x,
and so, x∗ , y − x ≤ My − x for every y in a neighborhood of x. We conclude that
x∗ ≤ M. In other words, ∂ f (x) ⊂ B̄(0, M).

Approximate Subdifferentiability
As seen in Example 3.17, it may happen that dom(∂ f ) dom( f ). This can be an
inconvenient when trying to implement optimization algorithms. Given ε > 0, a
point x∗ ∈ X ∗ is an ε -approximate subgradient of f at x ∈ dom( f ) if
f (y) ≥ f (x) + x∗ , y − x − ε (3.8)

for all y ∈ X (compare with the subdifferential inequality (3.7)). The set ∂ε f (x) of
such vectors is the ε -approximate subdifferential of f at x. Clearly, ∂ f (x) ⊂ ∂ε f (x)
for each x, and so
dom(∂ f ) ⊂ dom(∂ε f ) ⊂ dom( f )
for each ε > 0. It turns out that the last inclusion is, in fact, an equality.
Proposition 3.26. Let f : X → R∪{+∞} be proper, convex, and lower-semicontinuous.
Then dom( f ) = dom(∂ε f ) for all ε > 0.
Proof. Let ε > 0 and let x ∈ dom( f ). By Proposition 3.1, f can be represented as the
pointwise supremum of continuous affine functions. Therefore, there exist x∗ ∈ X ∗
and μ ∈ R, such that
f (y) ≥ x∗ , y + μ
for all y ∈ X and
x∗ , x + μ ≥ f (x) − ε .
Adding these two inequalities, we obtain precisely (3.8).

√ that the function g : R → R∪{+∞}

Example 3.27. Let us recall from Example 3.17
given by g(x) = +∞ if x < 0, and g(x) = − x if x ≥0; satisfies
0 ∈ dom(g) but
∂ g(0) = 0.
/ A simple computation shows that ∂ε g(0) = −∞, 41ε for ε > 0. Observe
that dom(∂ g) dom(∂ε g) = dom(g).

−ε
y = − 41ε x − ε
3.5 Basic Subdifferential Calculus Rules and Applications
In this section, we discuss some calculus rules for the subdifferentials of convex
functions. We begin by providing a Chain Rule for the composition with a bounded
linear operator. Next, we present the Moreau–Rockafellar Theorem, regarding the
subdifferential of a sum. This result has several important consequences, both for
theoretical and practical purposes.
3.5.1 Composition with a Linear Function: A Chain Rule
The following is a consequence of the Hahn–Banach Separation Theorem 1.10:

Proposition 3.28 (Chain Rule). Let A ∈ L (X;Y ) and let f : Y → R ∪ {+∞} be
proper, convex, and lower-semicontinuous. For each x ∈ X, we have
A∗ ∂ f (Ax) ⊂ ∂ ( f ◦ A)(x).
Equality holds for every x ∈ X if f is continuous at some y0 ∈ A(X).

Proof. Let x ∈ X and x∗ ∈ ∂ f (Ax). For every y ∈ Y , we have
f (y) ≥ f (Ax) + x∗ , y − Ax.
In particular,
f (Az) ≥ f (Ax) + x∗ , A(z − x)
for every z ∈ X. We conclude that
( f ◦ A)(z) ≥ ( f ◦ A)(x) + A∗ x∗ , z − x
for each z ∈ X, and so A∗ ∂ f (Ax) ⊆ ∂ ( f ◦ A)(x).

Conversely, take x∗ ∈ ∂ ( f ◦ A)(x), which means that
f (Az) ≥ f (Ax) + x∗ , z − x
for all z ∈ X. This inequality implies that the affine subspace
V = {(Az, f (Ax) + x∗ , z − x) : z ∈ X} ⊂ Y × R
does not intersect int (epi( f )), which is nonempty by continuity. By part i) of the
Hahn–Banach Separation Theorem, there is (L, s) ∈ Y ∗ × R \ {(0, 0)} such that
L, Az + s( f (Ax) + x∗ , z − x) ≤ L, y + sλ
for all z ∈ X and all (y, λ ) ∈ int (epi( f )). Pretty much like in the proof of the Moreau–
Rockafellar Theorem 3.30, we prove that s > 0. Then, write = −L/s and let λ →
f (y) to obtain
f (y) ≥ f (Ax) + x∗ , z − x + , y − Az. (3.9)
Take z = x to deduce that
f (y) ≥ f (Ax) + , y − Ax.
for all y ∈ Y . On the other hand, if we take y = Ax in (3.9), we obtain
0 ≥ x∗ − A∗ , z − x
for al z ∈ X. We conclude that ∈ ∂ f (Ax) and x∗ = A∗ , whence x∗ ∈ A∗ ∂ f (Ax).

3.5.2 Sum of Convex Functions and the Moreau–Rockafellar

Theorem
A natural question is whether the subdifferential of the sum of two functions, is the
sum of their subdifferentials. We begin by showing that this is not always the case.
Example 3.29. Let f , g : R → R ∪ {+∞} be given by

0 if x ≤ 0 +∞
√ if x < 0
f (x) = and g(x) =
+∞ if x > 0 − x if x ≥ 0.
We have
⎧
⎨ {0} if x < 0
0/ if x ≤ 0
∂ f (x) = [0, +∞) if x = 0 and ∂ g(x) =
⎩ − 2√
1
x
if x > 0.
0/ if x > 0
Therefore, ∂ f (x) + ∂ g(x) = 0/ for every x ∈ R. On the other hand, f + g = δ{0} ,

which implies ∂ ( f +g)(x) = 0/ if x = 0, but ∂ ( f +g)(0) = R. We see that ∂ ( f +g)(x)
may differ from ∂ f (x) + ∂ g(x).

Roughly speaking, the problem with the function in the example above is that the
intersection of their domains is too small. We have the following:
Theorem 3.30 (Moreau–Rockafellar Theorem). Let f , g : X → R∪{+∞} be proper,
convex, and lower-semicontinuous. For each x ∈ X, we have
∂ f (x) + ∂ g(x) ⊂ ∂ ( f + g)(x). (3.10)
Equality holds for every x ∈ X if f is continuous at some x0 ∈ dom(g).
Proof. If x∗ ∈ ∂ f (x) and z∗ ∈ ∂ g(x), then
f (y) ≥ f (x) + x∗ , y − x and g(y) ≥ g(x) + z∗ , y − x
for each y ∈ X. Adding both inequalities, we conclude that
f (y) + g(y) ≥ f (x) + g(x) + x∗ + z∗ , y − x
for each y ∈ X and so, x∗ + z∗ ∈ ∂ ( f + g)(x).

Suppose now that u∗ ∈ ∂ ( f + g)(x). We have
f (y) + g(y) ≥ f (x) + g(x) + u∗ , y − x (3.11)
for every y ∈ X. We shall find x∗ ∈ ∂ f (x) and z∗ ∈ ∂ g(x) such that x∗ + z∗ = u∗ . To

this end, consider the following nonempty convex sets:
B = {(y, λ ) ∈ X × R : g(y) − g(x) ≤ −λ }

C = {(y, λ ) ∈ X × R : f (y) − f (x) − u∗ , y − x ≤ λ }.
Define h : X → R ∪ {+∞} as h(y) = f (y) − f (x) − u∗ , y − x. Since h is continuous

in x0 and C = epi(h), the open convex set A = int(C) is nonempty by Proposition 3.2.
Moreover, A ∩ B = 0/ by inequality (3.11). Using Hahn–Banach Separation Theorem
1.10, we obtain (L, s) ∈ X ∗ × R \ {(0, 0)} such that
L, y + sλ ≤ L, z + sμ
for each (y, λ ) ∈ A and each (z, μ ) ∈ B. In particular, taking (y, λ ) = (x, 1) ∈ A and
(z, μ ) = (x, 0) ∈ B, we deduce that s ≤ 0. On the other hand, if s = 0, taking z = x0
we see that L, x0 − y ≥ 0 for every y in a neighborhood of x0 . This implies L = 0
and contradicts the fact that (L, s) = (0, 0). Therefore, s < 0, and we may write
z∗ , y + λ ≤ z∗ , z + μ (3.12)
with z∗ = −L/s. By the definition of C, taking again (z, μ ) = (x, 0) ∈ B, we obtain
z∗ , y − x + f (y) − f (x) − u∗ , y − x ≤ 0.
Inequality (3.11) then gives
g(y) ≥ g(x) + z∗ , y − x
for every y ∈ dom(g), and we conclude that z∗ ∈ ∂ g(x). In a similar fashion, use
(3.12) with y = x, any λ > 0 and μ = g(x) − g(z), along with (3.11), to deduce that
f (z) ≥ f (x) + u∗ − z∗ , z − x
for all z ∈ X, which implies x∗ = u∗ − z∗ ∈ ∂ f (x) and completes the proof.

Observe that, if ∂ ( f +g)(x) = ∂ f (x)+ ∂ g(x) for all x ∈ X, then dom(∂ ( f +g)) =
dom(∂ f ) ∩ dom(∂ g).
If X is a Banach space, another condition ensuring the equality in (3.10) is that

t dom( f ) − dom(g) be a closed subset of X. This is known as the Attouch–
t≥0
Brézis condition [9].
3.5.3 Some Consequences
Combining the Chain Rule (Proposition 3.28) and the Moreau–Rockafellar Theorem
3.30, we obtain the following:
Corollary 3.31. Let A ∈ L (X;Y ), and let f : Y → R ∪ {+∞} and f , g : X → R ∪
{+∞} be proper, convex, and lower-semicontinuous. For each x ∈ X, we have
A∗ ∂ f (Ax) + ∂ g(x) ⊂ ∂ ( f ◦ A + g)(x).

Equality holds for every x ∈ X if there is x0 ∈ dom(g) such that f is continuous at

Ax0 .
This result will be useful in the context of Fenchel–Rockafellar duality (see The-
orem 3.51).
Another interesting consequence of the Moreau–Rockafellar Theorem 3.30 is:

Corollary 3.32. Let X be reflexive and let f : X → R ∪ {+∞} be proper, strongly
convex, and lower-semicontinuous. Then ∂ f (X) = X ∗ . If, moreover, f is Gâteaux-
differentiable, then ∇ f : X → X ∗ is a bijection.
Proof. Take x∗ ∈ X ∗ . The function g : X → R ∪ {+∞}, defined by g(x) = f (x) −

x∗ , x, is proper, strongly convex, and lower-semicontinuous. By Theorem 2.19,
there is a unique x̄ ∈ argmin(g) and, by Fermat’s Rule (Theorem 3.24), 0 ∈ ∂ g(x̄).
This means that x∗ ∈ ∂ f (x̄). If f is Gâteaux-differentiable, the equation ∇ f (x) = x∗
has exactly one solution for each x∗ ∈ X ∗ .

Points in the graph of the approximate subdifferential can be approximated by

points that are actually in the graph of the subdifferential.
Theorem 3.33 (Brønsted–Rockafellar Theorem). Let X be a Banach space and
let f : X → R ∪ {+∞} be lower-semicontinuous and convex. Take ε > 0 and x0 ∈
dom( f ), and assume x0∗ ∈ ∂ε f (x0 ). Then, for each λ > 0, there exist x̄ ∈ dom(∂ f )
and x̄∗ ∈ ∂ f (x̄) such that x̄ − x0 ≤ ε /λ and x̄∗ − x0∗ ∗ ≤ λ .

Proof. Since x0∗ ∈ ∂ε f (x0 ), we have
f (y) ≥ f (x0 ) + x∗ , y − x0 − ε
for all y ∈ X. In other words, x0 satisfies
g(x0 ) ≤ inf g(y) + ε ,

y∈X
where we have written g(y) = f (y) − x0∗ , y. By Ekeland’s Variational Principle
(Theorem 2.6), there exists x̄ ∈ B(x0 , ε /λ ) such that
g(x̄) + λ x̄ − x0 ≤ g(x0 ) and g(x̄) < g(y) + λ x̄ − y
for all y = x̄. The latter implies that x̄ is the unique minimizer of the function h :
X → R ∪ {+∞} defined by h(y) = g(y) + λ x̄ − y. From the Moreau–Rockafellar
Theorem 3.30, it follows that x̄ ∈ dom(∂ h) = dom(∂ f ) and
0 ∈ ∂ h(x̄) = ∂ g(x̄) + λ B = ∂ f (x̄) − x0∗ + λ B,
where B = {x∗ ∈ X ∗ : x∗ ∗ ≤ 1} (see also Example 3.16). In other words, there
exists x̄∗ ∈ ∂ f (x̄) such that x̄∗ − x0∗ ∗ ≤ λ .
An immediate consequence is the following:

Corollary 3.34. Let X be a Banach space and let f : X → R ∪ {+∞} be convex and
lower-semicontinuous. Then dom(∂ f ) = dom( f ).
Proof. Since dom(∂ f ) ⊂ dom( f ), it suffices to prove that dom( f ) ⊂ dom(∂ f ).

Indeed, take x0 ∈ dom( f ) and let ε > 0. By the Brønsted–Rockafellar Theorem
3.33 (take λ = 1), there is x̄ ∈ dom(∂ f ) such that x̄ − x0 ≤ ε .

In other words, the points of subdifferentiability are dense in the domain.
3.5.4 Moreau–Yosida Regularization and Smoothing
In this subsection, we present a regularizing and smoothing technique for convex

functions defined on a Hilbert space. By adding a quadratic term, one is able to
force the existence and uniqueness of a minimizer. This fact, in turn, has two major
consequences: first, it is the core of an important minimization method, known as
the proximal point algorithm, which we shall study in Chap. 6; and second, it allows
us to construct a smooth version of the function having the same minimizers.
The Moreau–Yosida Regularization
Let H be a Hilbert space and let f : H → R ∪ {+∞} be proper, convex and lower-
semicontinuous. Given λ > 0 and x ∈ H, the Moreau–Yosida Regularization of f
with parameter (λ , x) is the function f(λ ,x) : H → R ∪ {+∞}, defined by
1
f(λ ,x) (z) = f (z) + z − x2 .
2λ
The following property will be useful later on, especially in Sect. 6.2, when we
study the proximal point algorithm (actually, it is the core of that method):
Proposition 3.35. For each λ > 0 and x ∈ H, the function f(λ ,x) has a unique min-
imizer x̄. Moreover, x̄ is characterized by the inclusion
x̄ − x
− ∈ ∂ f (x̄). (3.13)
λ
Proof. Since f is proper, convex, and lower-semicontinuous, f(λ ,x) is proper, strictly
convex, lower-semicontinuous, and coercive. Existence and uniqueness of a mini-
mizer x̄ is given by Theorem 2.19. Finally, Fermat’s Rule (Theorem 3.24) and the
Moreau–Rockafellar Theorem 3.30 imply that x̄ must satisfy
x̄ − x
0 ∈ ∂ f(λ ,x) (x̄) = ∂ f (x̄) + ,
λ

Solving for x̄ in (3.13), we can write
x̄ = (I + λ ∂ f )−1 x, (3.14)
where I : H → H denotes the identity function. In view of Proposition 3.35, the

expression Jλ = (I + λ ∂ f )−1 defines a function Jλ : H → H, called the proximity
operator of f , with parameter λ . It is also known as the resolvent of the operator
∂ f with parameter λ , due to its analogy with the resolvent of a linear operator. We
have the following:
Proposition 3.36. If f : H → R ∪ {+∞} is proper, lower-semicontinuous, and con-
vex, then Jλ : H → H is (everywhere defined and) nonexpansive.
Proof. Let x̄ = Jλ (x) and ȳ = Jλ (y), so that

x̄ − x ȳ − y
− ∈ ∂ f (x̄) y − ∈ ∂ f (ȳ).
λ λ
In view of the monotonicity of ∂ f (Proposition 3.22), we have
(x̄ − x) − (ȳ − y), x̄ − ȳ ≤ 0.
Therefore,
0 ≤ x̄ − ȳ2 ≤ x − y, x̄ − ȳ ≤ x − y x̄ − ȳ, (3.15)
and we conclude that x̄ − ȳ ≤ x − y.

Example 3.37. The indicator function δC of a nonempty, closed, and convex subset
of H is proper, convex, and lower-semicontinuous. Given λ > 0 and x ∈ H, Jλ (x) is
the unique solution of

1
min δC (z) + z − x : z ∈ H = min{z − x : z ∈ C}.
2
2λ
Independently of λ , Jλ (x) is the point in C which is closest to x. In other words, it is

the projection of x onto C, which we have denoted by PC (x). From Proposition 3.36,
we recover the known fact that the function PC : H → H, defined by PC (x) = Jλ (x),
is nonexpansive (see Proposition 1.40).

The Moreau Envelope and Its Smoothing Effect
Recall from Proposition 3.35 that for each λ > 0 and x ∈ H, the Moreau–Yosida
Regularization f(λ ,x) of f has a unique minimizer x̄. The Moreau envelope of f with
parameter λ > 0 is the function fλ : H → R defined by

1 1
fλ (x) = min{ f(λ ,x) (z)} = min f (z) + z − x 2
= f (x̄) + x̄ − x2 .
z∈H z∈H 2λ 2λ
An immediate consequence of the definition is that
inf f (z) ≤ fλ (x) ≤ f (x)

z∈H
for all λ > 0 and x ∈ H. Therefore,
inf f (z) = inf fλ (z).

z∈H z∈H
Moreover, x̂ minimizes f on H if, and only if, it minimizes fλ on H for all λ > 0.
Example 3.38. Let C be a nonempty, closed, and convex subset of H, and let f = δC .
Then, for each λ > 0 and x ∈ H, we have

1 1
fλ (x) = min z − x2 = dist(x,C)2 ,
z∈C 2λ 2λ
where dist(x,C) denotes the distance from x to C.

A remarkable property of the Moreau envelope is given by the following:
Proposition 3.39. For each λ > 0, the function fλ is Fréchet-differentiable and
1
D fλ (x) = (x − x̄)
λ
for all x ∈ H. Moreover, fλ is convex and D fλ is Lipschitz-continuous with constant
1/λ .
Proof. As before, write x̄ = Jλ (x), ȳ = Jλ (y), and observe that
1
fλ (y) − fλ (x) = f (ȳ) − f (x̄) + ȳ − y2 − x̄ − x2
2λ
1 1
≥ − x̄ − x, ȳ − x̄ + ȳ − y2 − x̄ − x2 ,
λ 2λ
by the subdifferential inequality and the inclusion (3.13). Straightforward algebraic
manipulations yield
1 1
fλ (y) − fλ (x) − x − x̄, y − x ≥ (ȳ − x̄) − (y − x)2 ≥ 0. (3.16)
λ 2λ
Interchanging the roles of x and y, we obtain
1
fλ (x) − fλ (y) − y − ȳ, x − y ≥ 0.
λ
We deduce that
1 1 1
0 ≤ fλ (y) − fλ (x) − x − x̄, y − x ≤ y − x2 − ȳ − x̄, y − x ≤ y − x2 ,
λ λ λ
in view of (3.15). Writing y = x + h, we conclude that

1 1
lim fλ (x + h) − fλ (x) − x − x̄, h = 0,
h→0 h λ
and fλ is Fréchet-differentiable. The convexity follows from the fact that
D fλ (x) − D fλ (y), x − y ≥ 0,
and the characterization given in Proposition 3.10. Finally, from (3.15), we deduce
that
(x − x̄) − (y − ȳ)2 = x − y2 + x̄ − ȳ2 − 2x − y, x̄ − ȳ ≤ x − y2 .
Therefore,
1
D fλ (x) − D fλ (y) ≤
x − y,
λ
and so, D fλ is Lipschitz-continuous with constant 1/λ .

Remark 3.40. Observe that D fλ (x) ∈ ∂ f (Jλ (x)).

The approximation property of the Moreau envelope is given by the following:
Proposition 3.41. As λ → 0, fλ converges pointwise to f .
Proof. Since
1
f (Jλ (x)) +J (x) − x2 ≤ f (x)
2λ λ
and f is bounded from below by some continuous affine function (Proposition 3.1),
Jλ (x) remains bounded as λ → 0. As a consequence, f (Jλ (x)) is bounded from
below, and so, lim Jλ (x) − x = 0. Finally,
λ →0
f (x) ≤ lim inf f (Jλ (x)) ≤ lim sup f (Jλ (x)) ≤ f (x),
λ →0 λ →0
by lower-semicontinuity.

3.6 The Fenchel Conjugate
The Fenchel conjugate of a proper function f : X → R ∪ {+∞} is the function f ∗ :

X ∗ → R ∪ {+∞} defined by
f ∗ (x∗ ) = sup{x∗ , x − f (x)}. (3.17)

x∈X
Since f ∗ is a supremum of continuous affine functions, it is convex and lower-

semicontinuous (see Proposition 3.1). Moreover, if f is bounded from below by
a continuous affine function (for instance if f is proper and lower-semicontinuous),

then f ∗ is proper.
3.6.1 Main Properties and Examples
As a consequence of the definition, we deduce the following:

Proposition 3.42 (Fenchel–Young Inequality). Let f : X → R∪{+∞}. For all x ∈ X
and x∗ ∈ X ∗ , we have
f (x) + f ∗ (x∗ ) ≥ x∗ , x. (3.18)
If f is convex, then equality in (3.18) holds if, and only if, x∗ ∈ ∂ f (x).
In particular, if f is convex, then R(∂ f ) ⊂ dom( f ∗ ). But the relationship between
these two sets goes much further:
Proposition 3.43. Let X be a Banach space and let f : X → R ∪ {+∞} be lower-
semicontinuous and convex, then R(∂ f ) = dom( f ∗ ).
Proof. Clearly, R(∂ f ) ⊂ dom( f ∗ ), so it suffices to prove that dom( f ∗ ) ⊂ R(∂ f ).

Let x0∗ ∈ dom( f ∗ ) and let ε > 0. Take x0 ∈ dom( f ) such that
x0∗ , x0 − f (x0 ) + ε ≥ f ∗ (x0∗ ) = sup{ x0∗ , y − f (y) }.

y∈X
We deduce that x0∗ ∈ ∂ε f (x0 ). By the Brønsted–Rockafellar Theorem 3.33 (take λ =

ε ), there exist x̄ ∈ dom(∂ f ) and x̄∗ ∈ ∂ f (x̄) such that x̄ − x0 ≤ 1 and x̄∗ − x0∗ ∗ ≤
ε . We conclude that x0∗ ∈ R(∂ f ).

Another direct consequence of the Fenchel–Young Inequality (3.42) is:

Corollary 3.44. If f : X → R is convex and differentiable, then ∇ f (X) ⊂ dom( f ∗ )
and f ∗ (∇ f (x)) = ∇ f (x), x − f (x) for all x ∈ X.
Let us see some examples:
Example 3.45. Let f : R → R be defined by f (x) = ex . Then
⎧ ∗
⎨ x ln(x∗ ) − x∗ if x∗ > 0
∗ ∗
f (x ) = 0 if x∗ = 0
⎩
+∞ if x∗ < 0.
This function is called the Bolzmann–Shannon Entropy.

Example 3.46. Let f : X → R be defined by f (x) = x0∗ , x + α for some x0∗ ∈ X ∗
and α ∈ R. Then,

∗ ∗ ∗ ∗ +∞ if x∗ = x0∗
f (x ) = sup{x − x0 , x − α } =
x∈X −α if x∗ = x0∗ .
In other words, f ∗ = δ{x0∗ } − α .

Example 3.47. Let δC be the indicator function of a nonempty, closed convex subset
C of X. Then,
δC∗ (x∗ ) = sup{ x∗ , x }.
x∈C
This is the support function of the set C, and is denoted by σC (x∗ ).

Example 3.48. The Fenchel conjugate of a radial function is radial: Let φ : [0, ∞) →
R ∪ {+∞} and define f : X → R ∪ {+∞} by f (x) = φ (x). Extend φ to all of R by
φ (t) = +∞ for t < 0. Then
f ∗ (x∗ ) = sup{ x∗ , x − φ (x) }

x∈X
= sup sup {tx∗ , h − φ (t) }
t≥0 h=1
∗
= sup{tx ∗ − φ (t) }
t∈R
= φ ∗ (x∗ ∗ ).
For instance, for φ (t) = 12 t 2 , we obtain f ∗ (x∗ ) = 12 x∗ 2∗ .

Example 3.49. Let f : X → R ∪ {+∞} be proper and convex, and let x ∈ dom( f ).
For h ∈ X, set
f (x + th) − f (x) f (x + th) − f (x)
φx (h) = f (x; h) = lim = inf
t→0 t t>0 t
(see Remark 3.8). Let us compute φx∗ :
φx∗ (x∗ ) = sup{ x∗ , h − φx (h) }

h∈X

∗ f (x + th) − f (x)
= sup sup x , h −
h∈X t>0 t

f (x) + x∗ , z − f (z) − x∗ , x
= sup sup
t>0 z∈X t

f (x) + f (x ) − x∗ , x
∗ ∗
= sup .
t>0 t
By the Fenchel–Young Inequality (Proposition 3.42), we conclude that φx∗ (x∗ ) = 0

if x∗ ∈ ∂ f (x), and φx∗ (x∗ ) = +∞ otherwise. In other words, φx∗ (x∗ ) = δ∂ f (x) (x∗ ).

Another easy consequence of the definition is:
Proposition 3.50. If f ≤ g, then f ∗ ≥ g∗ . In particular,
∗ ∗
sup( fi ) ≤ inf( fi∗ ) and inf( fi ) = sup( fi∗ )
i∈I i∈I i∈I i∈I
for any family ( fi )i∈I of functions on X with values in R ∪ {+∞}.
3.6.2 Fenchel–Rockafellar Duality
Let X and Y be normed spaces and let A ∈ L (X;Y ). Consider two proper, lower-
semicontinuous, and convex functions f : X → R ∪ {+∞} and g : Y → R ∪ {+∞}.
The primal problem in Fenchel–Rockafellar duality is given by:
(P) inf f (x) + g(Ax).

x∈X
Its optimal value is denoted by α and the set of primal solutions is S. Also Consider
the dual problem:
(D) inf f ∗ (−A∗ y∗ ) + g∗ (y∗ ),

y∗ ∈Y ∗
with optimal value α ∗ . The set of dual solutions is denoted by S∗ .
By the Fenchel–Young Inequality (Proposition 3.42), for each x ∈ X and y∗ ∈ Y ∗ ,

we have f (x) + f ∗ (−A∗ y∗ ) ≥ −A∗ y∗ , x and g(Ax) + g∗ (y∗ ) ≥ y∗ , Ax. Thus,
f (x) + g(Ax) + f ∗ (−A∗ y∗ ) + g∗ (y∗ ) ≥ −A∗ y∗ , x + y∗ , Ax = 0 (3.19)
for all x ∈ X and y∗ ∈ Y ∗ , and so α + α ∗ ≥ 0. The duality gap is α + α ∗ .
Let us characterize the primal-dual solutions:

Theorem 3.51. The following statements concerning points x̂ ∈ X and ŷ∗ ∈ Y ∗ are
equivalent:
i) −A∗ ŷ∗ ∈ ∂ f (x̂) and ŷ∗ ∈ ∂ g(Ax̂);
ii) f (x̂) + f ∗ (−A∗ ŷ∗ ) = −A∗ ŷ∗ , x̂ and g(Ax̂) + g∗ (ŷ∗ ) = ŷ∗ , Ax̂;
iii) f (x̂) + g(Ax̂) + f ∗ (−A∗ ŷ∗ ) + g∗ (ŷ∗ ) = 0; and
iv) x̂ ∈ S and ŷ∗ ∈ S∗ and α + α ∗ = 0.
Moreover, if x̂ ∈ S and there is x ∈ dom( f ) such that g is continuous in Ax, then
there exists ŷ∗ ∈ Y ∗ such that all four statements hold.
Proof. Statements i) and ii) are equivalent by the Fenchel–Young Inequality (Propo-
sition 3.42) and, clearly, they imply iii). But iii) implies ii) since

f (x) + f ∗ (−A∗ y∗ ) − −A∗ y∗ , x + g(Ax) + g∗ (y∗ ) − y∗ , Ax = 0
and each term in brackets is nonnegative. Next, iii) and iv) are equivalent in view of
(3.19). Finally, if x̂ ∈ S and there is x ∈ dom( f ) such that g is continuous in Ax, then
0 ∈ ∂ ( f + g ◦ A)(x̂) = ∂ f (x̂) + A∗ ∂ g(Ax̂)

by the Chain Rule (Proposition 3.28) and the Moreau–Rockafellar Theorem 3.30.
Hence, there is ŷ∗ ∈ ∂ g(Ax̂) such that −A∗ ŷ∗ ∈ ∂ f (x̂), which is i).

For the linear programming problem, the preceding argument gives:
Example 3.52. Consider the linear programming problem:
(LP) min { c · x : Ax ≤ b },
x∈RN
where c ∈ RN , A is a matrix of size M × N, and b ∈ RM . This problem can be

recast in the form of (P) by setting f (x) = c · x and g(y) = δRM (b − y). The Fenchel
+
conjugates are easily computed (Examples 3.46 and 3.47) and the dual problem is
(DLP) min { b · y∗ : A∗ y∗ + c = 0, and y∗ ≥ 0 }.

y∗ ∈RM
If (LP) has a solution and there is x ∈ RN such that Ax < b, then the dual problem
has a solution, there is no duality gap and the primal-dual solutions (x̂, ŷ∗ ) are char-
acterized by Theorem 3.51. Observe that the dimension of the space of variables,
namely N, is often much larger than the number of constraints, which is M. In those
cases, the dual problem (DLP) is much smaller than the primal one.

3.6.3 The Biconjugate
If f ∗ is proper, the biconjugate of f is the function f ∗∗ : X → R ∪ {+∞} defined by
f ∗∗ (x) = sup {x∗ , x − f ∗ (x∗ )}.

x∗ ∈X ∗
Being a supremum of continuous affine functions, the biconjugate f ∗∗ is convex and

lower-semicontinuous.
Remark 3.53. From the Fenchel–Young Inequality (Proposition 3.42), we deduce
that f ∗∗ (x) ≤ f (x) for all x ∈ X. In particular, f ∗∗ is proper. A stronger conclusion
for lower-semicontinuous convex functions is given in Proposition 3.56.

Remark 3.54. We have defined the biconjugate in the original space X and not
in the bidual X ∗∗ . Therefore, f ∗∗ = ( f ∗ )∗ (although f ∗∗ (x) = ( f ∗ )∗ (J (x)) for all
x ∈ X, where J is the canonical embedding of X into X ∗∗ ), but this difference
disappears in reflexive spaces.

Example 3.55. Let f : X → R be defined by f (x) = x0∗ , x + α for some x0∗ ∈ X ∗
and α ∈ R. We already computed f ∗ = δ{x0∗ } − α in Example 3.46. We deduce that
f ∗∗ (x) = sup {x∗ , x − f ∗ (x∗ )} = x0∗ , x + α .

x∗ ∈X ∗
Therefore, f ∗∗ = f .

The situation discussed in Example 3.55 for continuous affine functions is not
exceptional. On the contrary, we have the following:
Proposition 3.56. Let f : X → R ∪ {+∞} be proper. Then, f is convex and lower-
semicontinuous if, and only if, f ∗∗ = f .
Proof. We already mentioned in Remark 3.53 that always f ∗∗ ≤ f . On the other

hand, since f is convex and lower-semicontinuous, there exists a family ( fi )i∈I of
continuous affine functions on X such that f = sup( fi ), by Proposition 3.1. As in
i∈I
Proposition 3.50, we see that f ≤ g implies f ∗∗ ≤ g∗∗ . Therefore,
f ∗∗ ≥ sup( fi∗∗ ) = sup( fi ) = f ,

i∈I i∈I
because fi∗∗ = fi for continuous affine functions (see Example 3.55). The converse
is straightforward, since f = f ∗∗ is a supremum of continuous affine functions.

Corollary 3.57. Let f : X → R ∪ {+∞} be proper. Then, f ∗∗ is the greatest lower-

semicontinuous and convex function below f .
Proof. If g : X → R ∪ {+∞} is a lower-semicontinuous and convex function such

that g ≤ f , then g = g∗∗ ≤ f ∗∗ .

Proposition 3.56 also gives the following converse to Proposition 3.20:

Proposition 3.58. Let f : X → R ∪ {+∞} be convex. Suppose f is continuous in x0
and ∂ f (x0 ) = {x0∗ }. Then f is Gâteaux-differentiable in x0 and ∇ f (x0 ) = x0∗ .
Proof. As we saw in Propostion 3.9, the function φx0 : X → R defined by φx0 (h) =
f (x0 ; h) is convex and continuous in X. We shall see that actually φx0 (h) = x0∗ , h
for all h ∈ X. First observe that
φx0 (h) = φx∗∗

0
(h) = sup {x∗ , h − φx∗0 (x∗ )}.
x∗ ∈X ∗
Next, as we computed in Example 3.49, φx∗0 (x∗ ) = δ∂ f (x0 ) (x∗ ). Hence,
φx0 (h) = sup {x∗ , h − δ{x0∗ } (x∗ )} = x0∗ , h.

x∗ ∈X ∗
We conclude that f is Gâteaux-differentiable in x0 and ∇ f (x0 ) = x0∗ .

Combining the Fenchel–Young Inequality (Proposition 3.42) and Proposition

3.56, we obtain the following:
Proposition 3.59 (Legendre–Fenchel Reciprocity Formula). Let f : X → R ∪ {+∞}
be proper, lower-semicontinuous, and convex. Then
x∗ ∈ ∂ f (x) if, and only if, J (x) ∈ ∂ f ∗ (x∗ ),
where J is the canonical embedding of X into X ∗∗ .

Remark 3.60. The dissymmetry introduced by defining ∂ f ∗ as a subset of X ∗∗ and

the biconjugate as a function of X may seem unpleasant. Actually, it does produce
some differences, not only in Proposition 3.59, but, for instance, in the roles of pri-
mal and dual problems (see Sects. 3.6.2 and 3.7.2). Of course, all the dissymmetry
disappears in reflexive spaces. In general Banach spaces, this difficulty may be cir-
cumvented by endowing X ∗ with the weak∗ topology σ ∗ . The advantage of doing
so, is that the topological dual of X ∗ is X (see [94, Theorem 3.10]). In this setting,
X ∗ will no longer be a normed space, but the core aspects of this chapter can be
recast in the context of locally convex topological vector spaces in duality.

3.7 Optimality Conditions for Constrained Problems
Let C ⊂ X be closed and convex, and let f : X → R ∪ {+∞} be proper, lower-

semicontinuous, and convex. The following result characterizes the solutions of the
optimization problem
min{ f (x) : x ∈ C }.
Proposition 3.61. Let f : X → R ∪ {+∞} be proper, lower-semicontinuous, and
convex and let C ⊂ X be closed and convex. Assume either that f is continuous at
some point of C, or that there is an interior point of C where f is finite. Then x̂
minimizes f on C if, and only if, there is p ∈ ∂ f (x̂) such that −p ∈ NC (x̂).
Proof. By Fermat’s Rule (Theorem 3.24) and the Moreau–Rockafellar Theorem

3.30, x̂ minimizes f on C if, and only if,
0 ∈ ∂ ( f + δC )(x̂) = ∂ f (x̂) + NC (x̂).
This is equivalent to the existence of p ∈ ∂ f (x̂) such that −p ∈ NC (x̂).

Let C be a closed affine subspace of X, namely, C = {x0 } +V , where x0 ∈ X and

V is a closed subspace of X. Then NC (x̂) = V ⊥ . We obtain:
Corollary 3.62. Let C = {x0 } +V , where x0 ∈ X and V is a closed subspace of X,
and let f : X → R ∪ {+∞} be a proper, lower-semicontinuous, and convex function
and assume that f is continuous at some point of C. Then x̂ minimizes f on C if, and
only if, ∂ f (x̂) ∩V ⊥ = 0.
/
3.7.1 Affine Constraints
Let A ∈ L (X;Y ) and let b ∈ Y . We shall derive optimality conditions for the problem
of minimizing a function f over a set C of the form:
C = { x ∈ X : Ax = b }, (3.20)
which we assume to be nonempty. Before doing so, let us recall that, given A ∈
L (X;Y ), the adjoint of A is the operator A∗ : Y ∗ → X ∗ defined by the identity
A∗ y∗ , xX ∗ ,X = y∗ , AxY ∗ ,Y ,
for x ∈ X and y∗ ∈ Y ∗ . It is possible to prove (see, for instance, [30, Theorem 2.19])
that, if A has closed range, then ker(A)⊥ = R(A∗ ).

Theorem 3.63. Let C be defined by (3.20), where A has closed range. Let f : X →
R ∪ {+∞} be a proper, lower-semicontinuous, and convex function and assume that
f is continuous at some point of C. Then x̂ minimizes f on C if, and only if, Ax̂ = b
and there is ŷ∗ ∈ Y ∗ such that −A∗ ŷ∗ ∈ ∂ f (x̂).
Proof. First observe that NC (x̂) = ker(A)⊥ . Since A has closed range, ker(A)⊥ =
R(A∗ ). We deduce that − p̂ ∈ NC (x̂) = R(A∗ ) if, and only if, p̂ = A∗ ŷ∗ for some
ŷ∗ ∈ Y ∗ , and conclude using Proposition 3.61.

3.7.2 Nonlinear Constraints and Lagrange Multipliers
Throughout this section, we consider proper, lower-semicontinuous, and convex

functions f , g1 , . . . , gm : X → R ∪ {+∞}, along with continuous affine functions
h1 , . . . , h p : X → R, which we assume to be linearly independent. We shall derive
optimality conditions for the problem of minimizing f over the set C defined by
C = {x ∈ X : gi (x) ≤ 0 for all i, and h j (x) = 0 for all j }, (3.21)
assuming this set is nonempty. Since each gi is convex and each h j is affine, the set
C is convex. To simplify the notation, write
S = argmin{ f (x) : x ∈ C } and α = inf{ f (x) : x ∈ C }.
We begin by showing the following intermediate result, which is interesting in

its own right:
Proposition 3.64. There exist λ0 , . . . , λm ≥ 0, and μ1 , . . . μ p ∈ R (not all zero) such
that
m p
λ0 α ≤ λ0 f (x) + ∑ λi gi (x) + ∑ μ j h j (x) (3.22)
i=1 j=1
for all x ∈ X.
Proof. If α = −∞, the result is trivial. Otherwise, set

!
m p
A= ( f (x) − α , +∞) × ∏(gi (x), +∞) × ∏ {h j (x)} .
x∈X i=1 j=1
The set A ⊂ R × Rm × R p is nonempty and convex. Moreover, by the definition of α ,

0∈/ A. Accodring to the finite-dimensional version of the Hahn-Banach Separation
Theorem, namely Proposition 1.12, there exists (λ0 , . . . , λm , μ1 , . . . , μ p ) ∈ R × Rm ×
R p \ {0} such that
λ0 u0 + · · · + λm um + μ1 v1 + · · · + μ p v p ≥ 0 (3.23)
for all (u0 , . . . , um , v1 , . . . , v p ) ∈ A. If some λi < 0, the left-hand side of (3.23) can
be made negative by taking ui large enough. Therefore, λi ≥ 0 for each i. Passing

to the limit in (3.23), we obtain (3.22) for each x ∈ dom( f ) ∩ ( m i=1 dom(gi )). The
inequality holds trivially in all other points.

A more precise and useful result can be obtained under a qualification condition:
Slater’s condition: There exists x0 ∈ dom( f ) such that gi (x0 ) < 0 for i = 1, . . . m,
and h j (x0 ) = 0 for j = 1, . . . p.
Roughly speaking, this means that the constraint given by the system of inequal-
ities is thick in the subspace determined by the affine equality constraints.
Corollary 3.65. Assume Slater’s condition holds. Then, there exist λ̂1 , . . . , λ̂m ≥ 0,
and μ̂1 , . . . , μ̂ p ∈ R, such that
m p
α ≤ f (x) + ∑ λ̂i gi (x) + ∑ μ̂ j h j (x)
i=1 j=1
for all x ∈ X.
Proof. If λ0 = 0 in Proposition 3.64, and Slater’s condition holds, then λ1 = · · · =

λm = 0. It follows that a nontrivial linear combination of the affine functions is
nonnegative on X. By linearity, this combination must be identically 0, which con-
tradicts the linear independence. It suffices to divide the whole expression by λ0 > 0
and rename the other variables.

As a consequence of the preceding discussion we obtain the first-order optimality

condition for the constrained problem, namely:
Theorem 3.66. If x̂ ∈ S and Slater’s condition holds, then there exist λ̂1 , . . . , λ̂m ≥ 0,
and μ̂1 , . . . , μ̂ p ∈ R, such that λ̂i gi (x̂) = 0 for all i = 1, . . . , m and
" #
m p
0∈∂ f + ∑ λ̂i gi (x̂) + ∑ μ̂ j ∇h j (x̂). (3.24)
i=1 j=1
Conversely, if x̂ ∈ C and there exist λ̂1 , . . . , λ̂m ≥ 0, and μ̂1 , . . . , μ̂ p ∈ R, such that
λ̂i gi (x̂) = 0 for all i = 1, . . . , m and (3.24) holds, then x̂ ∈ S.
Proof. If x̂ ∈ S, the inequality in Corollary 3.65 is in fact an equality and we easily

see that λ̂i gi (x̂) = 0 for all i = 1, . . . , m. Fermat’s Rule (Theorem 3.24) gives (3.24).
Conversely, we have
m p
f (x̂) ≤ f (x) + ∑ λ̂i gi (x) + ∑ μ̂ j h j (x)
i=1 j=1

m p
≤ f (x) + sup sup ∑ λi gi (x) + ∑ μ j h j (x)
λi ≥0 μ j ∈R i=1 j=1
= f (x) + δC (x)
for all x ∈ X. It follows that x̂ minimizes f on C.

Remark 3.67. By the Moreau–Rockafellar Theorem 3.30, the inclusion

m p
0 ∈ ∂ f (x̂) + ∑ λ̂i ∂ gi (x̂) + ∑ μ̂ j ∇h j (x̂)
i=1 j=1
implies (3.24). Moreover, they are equivalent if f , g1 , . . . , gm have a common point

of continuity.

Lagrangian Duality
For (x, λ , μ ) ∈ X × Rm
+ × R , the Lagrangian for the constrained optimization prob-
p
lem is defined as
m p
L(x, λ , μ ) = f (x) + ∑ λi gi (x) + ∑ μ j h j (x).
i=1 j=1
Observe that
sup L(x, λ , μ ) = f (x) + δC (x).
(λ ,μ )∈Rm
+ ×R
p
We shall refer to the problem of minimizing f over C as the primal problem. Its
value is given by

α = inf sup L(x, λ , μ ) ,
x∈X (λ ,μ )∈Rm
+ ×R
p
and S is called the set of primal solutions. By inverting the order in which the supre-
mum and infimum are taken, we obtain the dual problem, whose value is

∗
α = sup inf L(x, λ , μ ) .
(λ ,μ )∈Rm p x∈X
+ ×R
The set S∗ of points at which the supremum is attained is the set of dual solutions or
Lagrange multipliers.
Clearly, α ∗ ≤ α . The difference α − α ∗ is the duality gap between the primal

and the dual problems.
We say (x̂, λ̂ , μ̂ ) ∈ X × Rm
+ × R is a saddle point of L if
p
L(x̂, λ , μ ) ≤ L(x̂, λ̂ , μ̂ ) ≤ L(x, λ̂ , μ̂ ) (3.25)
for all (x, λ , μ ) ∈ X × Rm

+ ×R .
p
Primal and dual solutions are further characterized by:

Theorem 3.68. Let (x̂, λ̂ , μ̂ ) ∈ X × Rm
+ × R . The following are equivalent:
p
i) x̂ ∈ S, (λ̂ , μ̂ ) ∈ S∗ and α = α ∗ ;
ii) (x̂, λ̂ , μ̂ ) is a saddle point of L; and
iii) x̂ ∈ C, λ̂i gi (x̂) = 0 for all i = 1, . . . , m and (3.24) holds.
Moreover, if S = 0/ and Slater’s condition holds, then S∗ = 0/ and α = α ∗ .
Proof. Let i) hold. Since x̂ ∈ S, we have
L(x̂, λ , μ ) ≤ sup L(x̂, λ , μ ) = α

λ ,μ
for all (λ , μ ) ∈ Rm
+ × R . Next,
p
α ∗ = inf L(x, λ̂ , μ̂ ) ≤ L(x, λ̂ , μ̂ ).

x
for all x ∈ X. Finally,
α ∗ = inf L(x, λ̂ , μ̂ ) ≤ L(x̂, λ̂ , μ̂ ) ≤ sup L(x̂, λ , μ ) = α .

x λ ,μ
Since α ∗ = α , we obtain (3.25), which gives ii).

Now, suppose ii) holds. By (3.25), we have
f (x̂) + δC (x̂) = sup L(x̂, λ , μ ) ≤ inf L(x, λ̂ , μ̂ ) = α ∗ < +∞,

λ ,μ x
and so, x̂ ∈ C. Moreover,

m
f (x̂) = f (x̂) + δC (x̂) ≤ L(x̂, λ̂ , μ̂ ) = f (x̂) + ∑ λ̂i gi (x̂) ≤ f (x̂).
i=1
m
It follows that ∑ λ̂i gi (x̂) = 0 and we conclude that each term is zero, since they all
i=1
have the same sign. The second inequality in (3.25) gives (3.24).
Next, assume iii) holds. By Theorem 3.66, x̂ ∈ S. On the other hand,

L(x̂, λ̂ , μ̂ ) = f (x̂) = α ≥ α ∗ = sup inf L(x, λ , μ ) ,
λ ,μ x
and (λ̂ , μ̂ ) ∈ S∗ . We conclude that α ∗ = inf L(x, λ̂ , μ̂ ) = L(x̂, λ̂ , μ̂ ) = α .

x
Finally, if S = 0/ and Slater’s condition holds, by Theorem 3.66, there is (λ̂ , μ̂ )
such that condition iii) in Theorem 3.68 holds. Then, (λ̂ , μ̂ ) ∈ S∗ and α = α ∗ .
Remark 3.69. As we pointed out in Example 3.52 for the linear programming prob-
lem, the dimension of the space in which the dual problem is stated is the number of
constraints. Therefore, the dual problem is typically easier to solve than the primal
one. Moreover, according to Theorem 3.68, once we have found a Lagrange multi-
plier, we can recover the solution x̂ for the original (primal) constrained problem as
a solution of
min L(x, λ̂ , μ̂ ),
x∈X
which is an unconstrained problem.

Remark 3.70. In the setting of Theorem 3.63, we know that x̂ minimizes f on C =
{x ∈ X : Ax = b} if, and only if, Ax̂ = b and there is ŷ∗ ∈ Y ∗ such that −A∗ ŷ∗ ∈ ∂ f (x̂).
If we define the Lagrangian L : X ×Y ∗ → R ∪ {+∞} by
L(x, y∗ ) = f (x) + y∗ , Ax − b,
then, the following are equivalent:

i) Ax̂ = b and −A∗ ŷ∗ ∈ ∂ f (x̂);
ii) (x̂, ŷ∗ ) is a saddle point of L; and
iii) x̂ ∈ C, and minimizes x → L(x, ŷ∗ ) over X.
This allows us to characterize x̂ as a solution for an unconstrained problem. In this
context, ŷ∗ is a (possibly infinite-dimensional) Lagrange multiplier.

Chapter 4
Examples
Abstract The tools presented in the previous chapters are useful, on the one hand, to
prove that a wide variety of optimization problems have solutions; and, on the other,
to provide useful characterizations allowing to determine them. In this chapter, we
present a short selection of problems to illustrate some of those tools. We begin
by revisiting some results from functional analysis concerning the maximization
of bounded linear functionals and the realization of the dual norm. Next, we dis-
cuss some problems in optimal control and calculus of variations. Another standard
application of these convex analysis techniques lies in the field of elliptic partial dif-
ferential equations. We shall review the theorems of Stampacchia and Lax-Milgram,
along with some variations of Poisson’s equation, including the obstacle problem
and the p-Laplacian. We finish by commenting a problem of data compression and
restoration.
4.1 Norm of a Bounded Linear Functional
Let X be a normed space and take L ∈ X ∗ . Recall that L∗ = sup |L(x)|. By lin-
x=1
earity, it is easy to prove that
sup |L(x)| = sup |L(x)| = sup |L(x)|.

x=1 x<1 x≤1
We shall focus in this section on the third form, since it can be seen as a convex
minimization problem. Indeed, the value of

inf L(x) + δB̄(0,1) (x)
x∈X
is precisely −L∗ .

Let B be any nonempty, closed, convex, and bounded subset of X. The function
f = L + δB is proper, lower-semicontinuous, convex, and coercive. If X is reflexive,
the infimum is attained, by Theorem 2.19. We obtain the following:
Corollary 4.1. Let X be reflexive and let L ∈ X ∗ . Then, L attains its maximum and its
minimum on each nonempty, closed, convex, and bounded subset of X. In particular,
there exists x0 ∈ B̄(0, 1) such that L(x0 ) = L∗ .
The last part can also be seen as a consequence of Corollary 1.16. Actually, the
converse is true as well: if every L ∈ X ∗ attains its maximum in B̄(0, 1), then X
is reflexive (see [68]). Therefore, in nonreflexive spaces there are bounded linear
functionals that do not realize their supremum on the ball. Here is an example:
Example 4.2. Consider (as in Example 1.2) the space X = C ([−1, 1]; R) of con-
tinuous real-valued functions on [−1, 1], with the norm · ∞ defined by x∞ =
maxt∈[−1,1] |x(t)|. Define L : X → R by
0 1
L(x) = x(t) dt − x(t) dt.
−1 0
Clearly, L ∈ X ∗ with L∗ ≤ 2, since

0 1
|L(x)| ≤ x∞ dt + x∞ dt = 2x∞ .
−1 0
We shall see that L∗ = 2. Indeed, consider the sequence (xn )n≥2 in X defined by
⎧
⎨ 1 if −1 ≤ t ≤ − 1n
xn (t) = −nt if − 1n < t < 1n
⎩
−1 if 1n ≤ t ≤ 1.
We easily see that xn ∈ B̄(0, 1) for all n ≥ 2 and

1
lim L(xn ) = lim 2 − = 2.
n→∞ n→∞ n
However, if x∞ ≤ 1 and L(x) = 2, then, necessarily, x(t) = 1 for all t ∈ [−1, 0] and
x(t) = −1 for all t ∈ [0, 1]. Therefore, x cannot belong to X.

As a consequence of Corollary 4.1 and Example 4.2, we obtain:

Corollary 4.3. The space C ([−1, 1]; R), · ∞ is not reflexive.
As we have pointed out, if X is not reflexive, then not all the elements of X ∗
attain their supremum on the unit ball. However, many of them do. More precisely,
we have the following remarkable result of Functional Analysis:
Corollary 4.4 (Bishop-Phelps Theorem). Let X be a Banach space and let B ⊂
X be nonempty, closed, convex, and bounded. Then, the set of all bounded linear
functionals that attain their maximum in B is dense in X ∗ .
Examples 67
Proof. As usual, let δB be the indicator function of the set B. Then
δB∗ (L) = supL, x

x∈B
for each L ∈ X ∗ (see also Example 3.47). Since B is bounded, dom(δB∗ ) = X ∗ . On the
other hand, L attains its maximum in B if, and only if, L ∈ R(∂ δB ) by Fermat’s Rule
(Theorem 3.24) and the Moreau–Rockafellar Theorem 3.30. But, by Proposition
3.43, R(∂ δB ) = dom(δB∗ ) = X ∗ .

4.2 Optimal Control and Calculus of Variations
Using convex analysis and differential calculus tools, it is possible to prove the exis-
tence and provide characterizations for the solutions of certain problems in optimal
control and calculus of variations.
4.2.1 Controlled Systems
Let y0 ∈ Rn , A : [0, T ] → Rn×n , B : [0, T ] → Rn×m , and c : [0, T ] → Rn . Consider the

linear control system

ẏ(t) = A(t)y(t) + B(t)u(t) + c(t), t ∈ (0, T )
(CS)
y(0) = y0 .
We assume, for simplicity, that the functions A, B, and c are continuous.
Given u ∈ L p (0, T ; RM ) with p ∈ [1, ∞], one can find yu ∈ C (0, T ; RN ) such that
the pair (u, yu ) satisfies (CS). Indeed, let R : [0, T ] → Rn×n be the resolvent of the
matrix equation Ẋ = AX with initial condition X(0) = I. Then yu can be computed
by using the Variation of Parameters Formula:
t
yu (t) = R(t)y0 + R(t) R(s)−1 [B(s)u(s) + c(s)] ds. (4.1)
0
It is easy to see that the function u → yu is affine and continuous.
The system (CS) is controllable from y0 to the target set T ⊂ RN in time T if

there exists u ∈ L p (0, T ; RM ) such that yu (T ) ∈ T . In other words, if the function
Φ : L p (0, T ; RM ) → R ∪ {+∞}, defined by Φ (u) = δT (yu (T )), is proper.
4.2.2 Existence of an Optimal Control
Consider the functional J defined on some function space X (to be specified later)
by
T
J[u] = (t, u(t), yu (t)) dt + h(yu (T ))
0
with : R × RM × RN → R ∪ {+∞}. The optimal control problem is
(OC) min{ J[u] : u ∈ X }.
It is feasible if J is proper. In this section we focus our attention on the case where
the function has the form
(t, v, x) = f (v) + g(x). (4.2)
We assume f : RM → R ∪ {+∞} is proper, convex, and lower-semicontinuous,

and that there exist α f ∈ R and β > 0 such that
(G) f (v) ≥ α f + β |v| p
for all v ∈ RM and some p ∈ (1, ∞). This p will determine the function space where
we shall search for solutions. Let g : RN → R be continuous and bounded from
below by αg , and let h : RN → R ∪ {+∞} be lower-semicontinuous (for the strong
topology) and bounded from below by αh . We shall prove that J has a minimizer.
Theorem 4.5. If (OC) is feasible, it has a solution in L p (0, T ; RM ).
Proof. By Theorem 2.9, it suffices to verify that J is sequentially inf-compact and

sequentially lower-semicontinuous for the weak topology.
For inf-compactness, since L p (0, T ; RM ) is reflexive, it suffices to show that the
sublevel sets are bounded, by Theorem 1.24. Fix γ ∈ R and take u ∈ Γγ (J). By the
growth condition (G), we have
f (u(t)) ≥ α f + β |u(t)| p
for almost every t ∈ [0, T ]. Integrating from 0 to T we obtain
γ ≥ J[u] ≥ (α f + αg + αh )T + β uLp p (0,T ;RM ) .

√
Setting R = β −1 [γ − (α f + αg + αh )T ], we have Γγ (J) ⊂ B(0, R).
For the lower-semicontinuity, let (un ) be a sequence in H such that un ū as
n → ∞. We shall prove that
J[ū] ≤ lim inf J[un ]. (4.3)

n→∞
Examples 69
To simplify the notation, for t ∈ [0, T ], set

t
z0 (t) = R(t)y0 + R(t) R(s)−1 c(s) dt,
0
which is the part of yu that does not depend on u. The j-th component of yun satisfies
T
yuj n (t) = z0j (t) + 1[0,t] (s)μ j (t, s)un (s) dt,
0
where μ j (t, s) is the j-th row of the matrix R(t)R(s)−1 B(s). The weak convergence
of (un ) to ū implies
t
lim yun (t) = z0 (t) + R(t) R(s)−1 B(s)ū(s) dt = yū (t)
n→∞ 0
for each t ∈ [0, T ]. Moreover, this convergence is uniform because the sequence
(yun ) is equicontinuous. Indeed, if 0 ≤ τ ≤ t ≤ T , we have
T
yuj n (t) − yuj n (τ ) = 1[τ ,t] (s)μ j (t, s)un (s) dt.
0
But μ j (t, s) is bounded by continuity, while the sequence (un ) is bounded because it
is weakly convergent. Therefore,
|yuj n (t) − yuj n (τ )| ≤ C|t − τ |
for some constant C, independent of n. It immediately follows that

T T
lim g(yun (t)) dt = g(yū (t)) dt
n→∞ 0 0
and
lim inf h(yun (T )) ≥ h(yū (T )).
n→∞
Therefore, in order to establish (4.3), it suffices to prove that

T T
lim inf f (un (t)) dt ≥ f (ū(t)) dt,
n→∞ 0 0
which was already done in Example 2.18.

Remark 4.6. Requiring a controllability condition yu (T ) ∈ T is equivalent to

adding a term δT (yu (T )) in the definition of h(yu (T )).

4.2.3 The Linear-Quadratic Problem
We shall derive a characterization for the solutions of the linear-quadratic control

problem, where the objective function is given by
T T
1 ∗ 1
J[u] = u(t) U(t)u(t) dt + yu (t)∗W (t)yu (t) dt + h(yu (T )),
2 0 2 0
where
• The vector functions u and yu are related by (CS) but, for simplicity, we assume
c ≡ 0.
• All vectors are represented as columns, and the star ∗ denotes the transposition
of vectors and matrices.
• For each t, U(t) is a uniformly elliptic symmetric matrix of size M × M, and the
function t → U(t) is continuous. With this, there is a constant α > 0 such that
u∗U(t)u ≥ α u2 for all u ∈ RM and t ∈ [0, T ].
• For each t, W (t) is a positive semidefinite symmetric matrix of size N × N, and
the function t → W (t) is continuous.
• The function h is convex and differentiable. Its gradient ∇h is represented as a
column vector.
By applying Theorem 4.5 in L2 (0, T ; RM ), the optimal control problem has a
solution. Moreover, it is easy to see that J is strictly convex, and so, there is a unique
optimal pair (ū, ȳu ). We have the following:
Theorem 4.7. The pair (ū, ȳu ) is optimal if, and only if,
ū(t) = U(t)−1 B(t)∗ p(t) (4.4)
for almost every t ∈ [0, T ], where the adjoint state p is the unique solution of the
adjoint equation
ṗ(t) = −A(t)∗ p(t) +W (t)ȳu (t)
with terminal condition p(T ) = −∇h(ȳu (T )).
Proof. Since J is convex and differentiable, by Fermat’s Rule (Theorem 1.32), we

know that (ū, ȳu ) is optimal if, and only if,
0 = ∇J(ū), v
T T
= ū(t)∗U(t)v(t) dt + ȳu (t)∗W (t)yv (t) dt + ∇h(ȳu (T ))∗ yv (T )
0 0
for all v ∈ L2 (0, T ; RM ). Let us focus on the middle term first. By using the adjoint
equation, we see that
ȳu (t)∗W (t) = ṗ(t)∗ + p(t)∗ A(t).

Examples 71
Therefore,
T T T
ȳu (t)∗W (t)yv (t) dt = ṗ(t)∗ yv (t) dt + p(t)∗ A(t)yv (t) dt
0 0 0
T
= p(T )∗ yv (T ) − p(t)∗ ẏv (t) − A(t)yv (t) dt
0
T
∗
= −∇h(ȳu (T )) yv (T ) − p(t)∗ B(t)v(t) dt.
0
We conclude that
T
0 = ∇J(ū), v = ū(t)∗U(t) − p(t)∗ B(t) v(t) dt
0
for all v ∈ L2 (0, T ; RM ), and the result follows.

Let us analyze the rocket-car problem:

Example 4.8. A car of unit mass is equipped with propulsors on each side, to move
it along a straight line. It starts at the point x0 with initial velocity v0 . The control u
is the force exerted by the propulsors, so that the position x and velocity v satisfy:

ẋ(t) 01 x(t) 0 x(0) x0
= + u(t), = .
v̇(t) 00 v(t) 1 v(0) v0
Consider the optimal control problem of minimizing the functional

T
1
J[u] = xu (T ) + u(t)2 dt,
2 0
which we can interpret as moving the car far away to the left, without spending too
much energy. The adjoint equation is

ṗ1 (t) 00 p1 (t) p1 (T ) −1
=− , = ,
ṗ2 (t) 10 p2 (t) p2 (T ) 0
from which we deduce that p1 (t) ≡ −1 and p2 (t) = t − T . Finally, from (4.4), we
deduce that ū(t) = p2 (t) = t − T . With this information and the initial condition, we
can compute x̄(T ) and the value J[ū].

Introducing Controllability
Assume now that h = δT , which means that we require that the terminal state ȳu (T )
belong to a convex target set T (see Remark 4.6). Although h is no longer differen-
tiable, one can follow pretty much the same argument as in the proof of Theorem 4.7
(but using the nonsmooth version of Fermat’s Rule, namely 3.24) to deduce that the
adjoint state p must satisfy the terminal condition −p(T ) ∈ NT (ȳu (T )).
Example 4.9. If the distance between the terminal state and some reference point yT
must be less than or equal to ρ > 0, then T = B̄(yT , ρ ). There are two possibilities
for the terminal states:
1. either ȳu (T ) − yT < ρ and p(T ) = 0;
2. or ȳu (T ) − yT = ρ and p(T ) = κ (yT − ȳu (T )) for some κ ≥ 0.

4.2.4 Calculus of Variations
The classical problem of Calculus of Variations is
(CV ) min{ J[x] : x ∈ AC(0, T ; RN ), x(0) = x0 , x(T ) = xT },
where the function J is of the form

T
J[x] = (t, ẋ(t), x(t)) dt
0
for some function : R×RN ×RN → R. If is of the form described in (4.2), namely
(t, v, x) = f (v) + g(x),
this problem fits in the framework of (OC) by setting M = N, A ≡ 0, B = I, c ≡ 0

and h = δ{xT } . From Theorem 4.5, it ensues that (CV ) has a solution whenever it is
feasible.
Optimality Condition: Euler–Lagrange Equation
If is of class C 1 , the function J is differentiable (see Example 1.28) and

T
DJ(x)h = ∇2 (t, x(t), ẋ(t)) · h(t) + ∇3 (t, x(t), ẋ(t)) · ḣ(t) dt,
0
where ∇i denotes the gradient with respect to the i-th set of variables.
In order to obtain necessary optimality conditions, we shall use the following

auxiliary result:

Lemma 4.10. Let α , β ∈ C 0 [0, T ]; R be such that
T
α (t)h(t)dt + β (t)ḣ(t) dt = 0
0

for each h ∈ C [0, T ]; R satisfying h(0) = h(T ) = 0. Then β is continuously dif-
1
ferentiable and β̇ = α .
Examples 73
Proof. Let us first analyze the case α ≡ 0. We must prove that β is constant. To this
end, define the function H : [0, T ] → R as
t T
1
H(t) = [β (s) − B] ds, where B= β (t) dt.
0 T 0
Since H is continuously differentiable and H(0) = H(T ) = 0,

T
0= β (t)Ḣ(t) dt
0
T
= β (t)[β (t) − B] dt
0
T T
= [β (t) − B]2 dt + B [β (t) − B] dt
0 0
T
= [β (t) − B]2 dt.
0
This implies β ≡ B because β is continuous. For the general case, define

t
A(t) = α (s) ds.
0
Using integration by parts, we see that

T T
0= α (t)h(t)dt + β (t)ḣ(t) dt = β (t) − A(t) ḣ(t) dt.
0 0

for each h ∈ C [0, T ]; R such that h(0) = h(T ) = 0. As we have seen, this implies
1
β − A must be constant. In other words, β is a primitive of α . It follows that β is

continuously differentiable and β̇ = α .

We are now in position to present the necessary optimality condition for (CV):
Theorem 4.11 (Euler–Lagrange Equation). Let x∗ be a smooth solution of (CV).
Then, the function t → ∇3 L( t, x∗ (t), ẋ∗ (t)) is continuously differentiable and
d
∇3 L( t, x∗ (t), ẋ∗ (t)) = ∇2 L(t, x∗ (t), ẋ∗ (t))
dt
for every t ∈ (0, T ).
Proof. Set
C0 = x ∈ C 1 [0, T ]; RN : x(0) = x(T ) = 0 ,
and define g : C0 → R as
g(h) = f (x∗ + h).
Clearly, 0 minimizes g on C0 because x∗ is a smooth solution of (CV). In other words,
0 ∈ argminC0 (g). Moreover, Dg(0) = D f (x∗ ) and so, Fermat’s Rule (Theorem 1.32)
gives D f (x∗ )h = Dg(0)h = 0 for every h ∈ C0 . This is precisely

T
∇2 L(t, x(t), ẋ(t)) · h(t) + ∇3 L(t, x(t), ẋ(t)) · ḣ(t) dt = 0
0
each h∗ ∈ C0∗. To conclude, for k = 1, . . . , N set h j ≡ 0 for j = k, write αk (t) =

for
∇2 L(t, x (t), ẋ (t)) k and use Lemma 4.10 componentwise.

4.3 Some Elliptic Partial Differential Equations
The theory of partial differential equations makes systematic use of several tools
that lie in the interface between functional and convex analysis. Here, we comment
a few examples.
4.3.1 The Theorems of Stampacchia and Lax-Milgram
Theorem 4.12 (Stampacchia’s Theorem). Let X be reflexive and let C ⊂ X be

nonempty, closed, and convex. Given a continuous and symmetric bilinear form
B : X × X → R and some x∗ ∈ X ∗ , define f : X → R by
1
f (x) = B(x, x) − x∗ , x,
2
and set S = argminC ( f ). We have the following:
i) If B is positive semidefinite and C is bounded, then S = 0;/
ii) If either B is positive definite and C is bounded, or B is uniformly elliptic, then S
is a singleton S = {x̄}.
In any case, x̄ ∈ S if, and only if, x̄ ∈ C and B(x̄, c − x̄) ≥ x∗ , c − x̄ for all c ∈ C.
Proof. The function g = f + δC is proper, lower-semicontinuous, convex, and coer-

cive in both cases, and strictly convex in case ii), as shown in Example 2.11. The
existence and uniqueness of minimizers follow from Theorem 2.19. On the other
hand, by Fermat’s Rule (Theorem 3.24) and the Moreau–Rockafellar Theorem 3.30
(see also Proposition 3.61 for a shortcut), x̄ ∈ S if, and only if,
0 ∈ ∂ g(x̄) = ∂ ( f + δC )(x̄) = ∂ f (x̄) + ∂ δC (x̄).
In other words, −∇ f (x̄) ∈ NC (x̄), which is equivalent to B(x̄, c − x̄) ≥ x∗ , c − x̄ for
all c ∈ C (see Example 1.27).

If C = X we obtain the following:

Examples 75
Corollary 4.13 (Lax-Milgram Theorem). Let X be reflexive, let B : X × X → R be

a uniformly elliptic, continuous and symmetric bilinear function, and let x∗ ∈ X ∗ .
Then, the function f : X → R, defined by
1
f (x) = B(x, x) − x∗ , x,
2
has a unique minimizer x̄. Moreover, x̄ is the unique element of X that satisfies
B( · , h) = x∗ , h for all h ∈ X.
4.3.2 Sobolev Spaces
We shall provide a brief summary of some Sobolev spaces, which are particu-
larly useful for solving second-order elliptic partial differential equations. For more
details, the reader may consult [1, 30] or [54].
Let Ω be a nonempty open subset of RN and let p ≥ 1. Consider the space
W 1,p = {u ∈ L p : ∇u ∈ (L p )N }
with the norm

1/p
uW 1,p = |∇u(ξ )| p + |u(ξ )| p d ξ .
Ω
The gradient is taken in the sense of distributions. We omit the reference to the
domain Ω to simplify the notation, since no confusion should arise.
The space W01,p is the closure in W 1,p of the subspace of infinitely differentiable
functions in Ω whose support is a compact subset of Ω . Intuitively, it consists of
the functions that vanish on the boundary ∂ Ω of Ω . The following property will be
useful later:
Proposition 4.14 (Poincaré’s Inequalities). Let Ω be bounded.
i) There exists C > 0 such that
uL p ≤ C∇u(L p )N
for all u ∈ W01,p . In particular, the function u → ∇u(L p )N is a norm in W01,p ,

which is equivalent to the original norm.
ii) If Ω is bounded, connected and ∂ Ω is smooth, there exists C > 0 such that
u − ũL p ≤ C∇u(L p )N
1
for all u ∈ W 1,p . Here, we have written ũ = |Ω | Ω u(ξ ) d ξ , where |Ω | is the
measure of Ω .
If p > 1, then W 1,p and W01,p are reflexive. We shall restrict ourselves to this case.
We have
W01,p ⊂ W 1,p ⊂ L p
with continuous embeddings. The topological duals satisfy
Lq = (L p )∗ ⊂ (W 1,p )∗ ⊂ (W01,p )∗ ,
where q > 1 is defined by 1

p + 1
q = 1. Now, for each v∗ ∈ (W01,p )∗ , there exist
v0 , v1 , . . . , vN ∈ Lq such that

N
∂u
v∗ , u(W 1,p )∗ ,W 1,p = v0 (ξ )u(ξ ) d ξ + ∑ vi (ξ ) (ξ ) d ξ
0 0 Ω i=1 Ω ∂ xi
for all u ∈ W01,p . In turn, given v0 , v1 , . . . , vN ∈ Lq , we can define an element v∗ of

(W 1,p )∗ by

N
∂u
v∗ , u(W 1,p )∗ ,W 1,p = v0 (ξ )u(ξ ) d ξ + ∑ vi (ξ ) (ξ ) d ξ
Ω i=1 Ω ∂ xi
for all u ∈ W 1,p .
The case p = 2, namely H 1 := W 1,2 , is of particular importance. It is a Hilbert

space, whose norm is associated to the inner product
u, vH 1 = u, vL2 + ∇u, ∇v(L2 )N .
We also write H01 := W01,2 . It is sometimes convenient not to identify H 1 and H01
with their duals. Instead, one can see them as subspaces of L2 , identify the latter
with itself and write
H01 ⊂ H 1 ⊂ L2 = (L2 )∗ ⊂ (H 1 )∗ ⊂ (H01 )∗ .
4.3.3 Poisson-Type Equations in H 1 and W 1,p
Dirichlet Boundary Conditions
Let Ω ⊂ RN be bounded. Given μ ∈ R, consider the bilinear function B : H01 ×H01 →

R defined by
B(u, v) = μ u, vL2 + ∇u, ∇v(L2 )N .
Clearly, B is continuous and symmetric. It is not difficult to see that it is uniformly
elliptic if μ > −1/C2 , where C is given by Poincaré’s Inequality (part i) of Proposi-
tion 4.14).
Examples 77
Now take v∗ ∈ (H01 )∗ . By the Lax-Milgram Theorem (Corollary 4.13), there is a

unique ū ∈ H01 such that
μ ū, vL2 + ∇ū, ∇v(L2 )N = v∗ , v(H 1 )∗ ,H 1

0 0
for all v ∈ H01 . In other words, there is a unique weak solution ū for the homogeneous
Dirichlet boundary condition problem

−Δ u + μ u = v∗ in Ω
u = 0 in ∂ Ω .
Moreover, the function Φ : H01 → (H01 )∗ defined by
Φ (u), v(H 1 )∗ ,H 1 = B(u, v)

0 0
is a continuous isomorphism with continuous inverse.
Suppose now that g ∈ H 1 is uniformly continuous on Ω (and extend it continu-

ously to Ω ). If we replace v∗ by v∗ − Φ (g) and apply the argument above, we obtain
the unique weak solution ū ∈ H 1 for the nonhomogeneous Dirichlet boundary con-
dition problem:
u = g in ∂ Ω .
Neumann Boundary Condition
Let Ω be connected and let ∂ Ω be sufficiently smooth. Consider the problem with
homogeneous Neumann boundary condition:

∂u
∂ ν = 0 in ∂ Ω ,
where ν denotes the outward normal vector to ∂ Ω .

If μ > 0, the bilinear function B : H 1 × H 1 → R, defined by
B(u, v) = μ u, vL2 + ∇u, ∇v(L2 )N ,
is uniformly elliptic, and the arguments above allow us to prove that the problem
has a unique weak solution, which is a ū ∈ H 1 such that
μ ū, vL2 + ∇ū, ∇v(L2 )N = v∗ , v(H 1 )∗ ,H 1
for all v ∈ H 1 .
If μ = 0, then B is not uniformly elliptic (it is not even positive definite, since
B(u, u) = 0 when u is constant). However, by Poincaré’s Inequality (part ii) of Propo-
sition 4.14), it is possible to prove that B is uniformly elliptic when restricted to
V = { u ∈ H 1 : u, 1L2 = 0 },
which is a closed subspace of H 1 . Using this fact, the problem

−Δ u = v∗ in Ω
∂u
∂ ν = 0 in ∂ Ω
has a solution (exactly one in V ) whenever v∗ ∈ (H 1 )∗ and v∗ , 1(H 1 )∗ ,H 1 = 0.
The Obstacle Problem
Let φ ∈ H01 . The set
C = { u ∈ H01 : u(ξ ) ≥ φ (ξ ) for almost every ξ ∈ Ω }
is a nonempty, closed and convex subset of H01 . With the notation introduced for
the Dirichlet problem, and using Stampacchia’s Theorem 4.12, we deduce that there
exists a unique ū ∈ C such that
μ ū, c − ūL2 + ∇ū, ∇(c − ū)(L2 )N ≥ v∗ , c − ū(H 1 )∗ ,H 1

0 0
for all c ∈ C. In other words, ū is a weak solution for the Dirichlet boundary value
problem with an obstacle:
⎧
⎨ −Δ u + μ u ≥ v∗ in Ω
u ≥ φ in Ω
⎩
u = 0 in ∂ Ω .
To fix ideas, take μ = 0 and v∗ = 0, and assume u and φ are of class C 2 . For each
nonnegative function p ∈ H01 of class C 2 , the function c = ū + p belongs to C, and
so ∇ū, ∇p(L2 )N ≥ 0. Therefore, Δ ū ≤ 0 on Ω , and ū is superharmonic. Moreover,
if ū > φ on some open subset of Ω , then Δ ū ≥ 0 there. It follows that Δ ū = 0, and
ū is harmonic whenever ū > φ . Summarizing, u is harmonic in the free set (where
ū > φ ), and both ū and φ are superharmonic in the contact set (where ū = φ ). Under
these conditions, it follows that ū is the least of all superharmonic functions that lie
above φ . For further discussion, see [69, Chap. 6].
Examples 79
Poisson’s Equation for the p-Laplacian
Let Ω ⊂ RN be bounded and take p > 1, μ ∈ R and v∗ ∈ (W01,p )∗ . Define f : W01,p →

R by
μ 1 ∗
f (u) = uLp p + ∇u(L p
p )N − v , u(W 1,p )∗ ,W 1,p .
p p 0 0
Clearly, f is proper and semicontinuous. Moreover, it is strongly convex if μ >

−1/C2 , where C is given by Poincaré’s Inequality (part i) of Proposition 4.14). By
Theorem 2.19 (or, if you prefer, by Corollary 2.20), f has a unique minimizer ū in
W01,p . Since f is differentiable, Fermat’s Rule (Theorem 3.24) implies
∇ f (ū) = 0.
With the convention that 0 p−2 · 0 = 0 for p > 1, this is
0 = μ ū p−2 ū, vLq ,L p + ∇ū p−2 ∇ū, ∇v(Lq )N ,(L p )N − v∗ , v(W 1,p )∗ ,W 1,p
0 0
for all v ∈ W01,p . If μ = 0, we recover a classical existence and uniqueness result for
the p-Laplace Equation (see, for instance, [74, Example 2.3.1]): The function ū is
the unique weak solution for the problem

−Δ p u = v∗ in Ω
u = 0 in ∂ Ω ,
where the p-Laplacian operator Δ p : W01,p → (W01,p )∗ is given by
Δ p (u) = div(∇u p−2 ∇u).
4.4 Sparse Solutions for Underdetermined Systems of Equations
Let A be a matrix of size J × N and let b ∈ RJ . When the system Ax = b is underde-

termined, an important problem in signal compression and statistics (see [47, 99])
is to find its sparsest solutions. That is,
(P0 ) min{x0 : Ax = b},
where · 0 denotes the so-called counting norm (number of nonzero entries; not
a norm). This comes from the necessity of representing rather complicated func-
tions (such as images or other signals) in a simple and concise manner. The convex
relaxation of this nonconvex problem is
(P1 ) min{x1 : Ax = b}.

Under some conditions on the matrix A (see [48]) solutions of (P0 ) can be found
by solving (P1 ). This can be done, for instance, by means of the iterative shrink-
age/thresholding algorithm (ISTA), described in Example 6.26.
Related 1 -regularization approaches can be found in [34, 39, 58]. The least abso-
lute shrinkage and selection operator (LASSO) method [99] is closely related but
slightly different: it considers a constraint of the form x1 ≤ T and minimizes
Ax − b2 . Also, it is possible to consider the problem with inequality constraints
min{ x1 : Ax ≤ b },
as well as the stable signal recovery [35], where the constraint Ax = b is replaced
by Ax − b ≤ ε .
Chapter 5
Problem-Solving Strategies
Abstract Only in rare cases, can problems in function spaces be solved analytically
and exactly. In most occasions, it is necessary to apply computational methods to
approximate the solutions. In this chapter, we discuss some of the basic general
strategies that can be applied. First, we present several connections between opti-
mization and discretization, along with their role in the problem-solving process.
Next, we introduce the idea of iterative procedures, and discuss some abstract tools
for proving their convergence. Finally, we comment some ideas that are useful to
simplify or reduce the problems, in order to make them tractable or more efficiently
solved.
5.1 Combining Optimization and Discretization
The main idea is to combine optimization techniques (based either on optimality

conditions or iterative procedures) and numerical analysis tools, which will allow us
to pass from the functional setting to a computationally tractable model and back. In
rough terms, we can identify two types of strategy, depending on the order in which
these components intervene, namely:
Discretize, Then Optimize
The strategy here consists in replacing the original problem in the infinite-
dimensional space X by a problem in a finite-dimensional subspace Xn of X 1 . To
this end, the first step is to modify, reinterpret or approximate the objective function
and the constraints by a finite-dimensional model (we shall come back to this point
later). Next, we use either the optimality conditions, as those discussed in Chap. 3,
1 Or even a large-dimensional X by a subspace Xn of smaller dimension.

or an iterative method, like the ones we will present in Chap. 6, for the latter. If
the model is sufficiently accurate, we can expect the approximate solution xn of the
problem in Xn to be close to a real solution for the original problem in X, or, at
least, that the value of the objective function in xn be close to its minimum in X. The
theoretical justification for this procedure will be discussed below.
Optimize, Then Discretize
An alternative is to use optimization techniques in order to design a method for

solving the original problem directly, and then use numerical analysis techniques to
implement it. Here, again, there are at least two options:
1. One option is to determine optimality conditions (see Chap. 3). These are usually
translated into a functional equation (an ordinary or partial differential equation,
for instance) or inequality, which is then solved numerically.
2. Another possibility is to devise an iterative procedure (this will be discussed in
Chap. 6) and implement it computationally. To fix ideas, in the unconstrained
case, at step n, starting from a point xn , we identify a direction in which the
objective function decreases, and advance in that direction to find the next iterate
xn+1 such that the value of the objective function at xn+1 is less than the value at
the previous point xn .
5.1.1 Recovering Solutions for the Original Problem: Ritz’s

Method
In this section, we present a classical and very useful result that relates the infinite-
dimensional problem to its finite-dimensional approximations. It is an important
tool in general numerical analysis and provides a theoretical justification for the dis-
cretize, then optimize strategy.
Let J : X → R be bounded from below. Consider a sequence (Xn ) of subspaces

of X such that Xn−1 ⊂ Xn for each n ≥ 1, and whose union is dense in X. At the n-th
iteration, given a point xn−1 ∈ Xn−1 and a tolerance εn ≥ 0, find xn such that
xn ∈ εn − argmin{ J(x) : x ∈ Xn } and J(xn ) ≤ J(xn−1 ).
This is always possible since J is bounded from below and Xn−1 ⊂ Xn . Clearly, if
εn = 0, the condition J(xn ) ≤ J(xn−1 ) holds automatically.
This procedure allows us to construct a minimizing sequence for J, along with

minimizers for J. More precisely, we have the following:
Problem-Solving Strategies 83
Theorem 5.1 (Ritz’s Theorem). If J is upper-semicontinuous and lim εn = 0, then

n→∞
lim J(xn ) = inf J(x). If, moreover, J is continuous, then every limit point of (xn )
n→∞ x∈X
minimizes J.
Proof. Fix x ∈ X and δ > 0. There exists y0 ∈ X such that
J(y0 ) ≤ J(x) + δ /3.
Since J is upper-semicontinuous at y0 , there is ρ > 0 such that
J(y) < J(y0 ) + δ /3
whenever y − y0 < ρ . Using the density hypothesis and the fact that lim εn =
n→∞
0, we can find N ∈ N and zN ∈ XN such that zN − y0 < ρ , and εN ≤ δ /3. We
deduce that
J(xn ) ≤ J(xN ) ≤ J(zN ) + εN < J(y0 ) + 2δ /3 ≤ J(x) + δ
for every n ≥ N. Since this holds for each x ∈ X, we obtain the minimizing property.
The last part follows from Proposition 2.8.

Typically, the finite-dimensional space Xn is spanned by a finite (not necessarily

n) number of elements of X, that are chosen considering the nature of the problem
and its solutions.
5.1.2 Building the Finite-Dimensional Approximations
Most optimization problems in infinite-dimensional spaces (see Chap. 4 for some

examples) share the following properties:
1. The variables are functions whose domain is some subset of RN .
2. The objective function involves an integral functional depending on the variables
and possibly their derivatives.
3. The constraints are given by functional equations or inequalities.
The most commonly used discretization schemes for these kinds of problems,
use a finite number of elements selected from a Schauder basis for the space and
consider approximations in the subset spanned by these elements. Here are a few
examples:
Finite Differences: Draw a grid in the domain, partitioning it into small N-

dimensional blocks.
ui, j+1
u(x, y)
ui, j ui+1, j
Represent the functions on each block by their values on a prescribed node, pro-
ducing piecewise constant approximations. Integrals are naturally approximated by
a Riemann sum. Partial derivatives are replaced by the corresponding difference
−u
quotients. These can be, for instance, one-sided ∂∂ ux (x, y) ∼ i+1, hj i, j ; or centered
u
∂u ui+1, j −ui−1, j
∂ x (x, y) ∼ 2h . The dimension of the approximating space is the number of
nodes and the new variables are vectors whose components are the values of the
original variable (a function) on the nodes.
Finite Elements: As before, the domain is partitioned into small polyhedral units,
whose geometry may depend on the dimension and the type of approximation. On
each unit, the function is most often represented by a polynomial. Integrals and
derivatives are computed accordingly. The dimension of the approximating space
depends on the number of subdomains in the partition and the type of approximation
(for instance, the degree of the polynomials). For example:
1. If N = 1, the Trapezoidal Rule for computing integrals is a approximation by
polynomials of order 1, while Simpson’s Rule uses polynomials of order 2.
2. When N = 2, it is common to partition the domain into triangles. If function val-
ues are assigned to the vertices, it is possible to construct an affine interpolation.
This generates the graph of a continuous function whose derivatives, which are
constant on each triangle, and integral are easily computed.
The method of finite elements has several advantages: First, reasonably smooth
open domains can be well approximated by triangulation. Second, partitions can
be very easily refined, and one can choose not to do this refinement uniformly
throughout the domain, but only in specific areas where more accuracy is required.
Third, since the elements have very small support, the method is able to capture
local behavior in a very effective manner. In particular, if the support of the solution
is localized, the finite element scheme will tend to identify this fact very quickly.
Overall, this method offers a good balance between simplicity, versatility, and pre-
cision. One possible drawback is that it usually requires very fine partitions in order
to represent smooth functions. The interested reader is referred to [28, 41], or [90].
Fourier Series: Trigonometric polynomials provide infinitely differentiable approx-

imations to piecewise smooth functions. One remarkable property is that their inte-
grals and derivatives can be explicitly computed and admit very accurate approxi-
mations. Another advantage is that functions in the basis can all be generated from
any one of them, simply by a combination of translations and contractions, which
is a very positive feature when it comes to implementation. To fix the ideas and see
the difference with the previous methods, suppose the domain is a bounded interval,
interpreted as a time horizon. The Fourier coefficients are computed by means of
projections onto sinusoidal functions with different frequencies. The dimension of
the approximating space is the number of Fourier coefficients computed. In some
sense, the original function is replaced by a representation that takes into account its
main oscillation frequencies. There are two main drawbacks in taking the standard
Fourier basis. First comes its lack of localization. A very sparse function (a func-
tion with a very small support) may have a large number of nonnegligible Fourier
coefficients. The second weak point is that the smooth character of the trigonomet-
ric functions offers a poor approximation around discontinuities and produces some
local instabilities such as the Gibbs Phenomenon. More details on Fourier analysis
and series can be found in [56, 100], or still [18].
Wavelets: This term gathers a broad family of methods involving representation of

functions with respect to certain bases. The common and distinctive features are the
following:
• Localization: Each element in the basis may be chosen either with arbitrarily
small support, or having very small values outside an arbitrarily small domain.
This makes them ideal for identifying peaks and other types of local behavior.
Just like the finite elements, they reveal sparsity very quickly.
• Simple representation: All the functions in the basis are constructed from one
particular element, called the mother wavelet, by translations and contractions.
• Customized smoothness: One can choose pretty much any degree of smoothness
ranging from piecewise constant (such as the Haar basis) to infinitely differen-
tiable (like the so-called Mexican hat) functions, depending on the qualitative
nature of the functions to approximate.
• Capture oscillating behavior: The shape of the elements of the basis makes it
possible to identify the main oscillating frequencies of the functions.
Wavelet bases share several good properties of finite elements and Fourier series.
For further reference, the reader may consult [46] or [18].
1 1
1
0 2 1
The Haar mother wavelet The Mexican hat mother wavelet
5.2 Iterative Procedures
An iterative algorithm on X is a procedure by which, starting from a point z0 ∈ X,

and using a family (Fn ) of functions from X to X, we produce a sequence (zn ) of
points in X by applying the rule
zn+1 = Fn (zn )
for each n ≥ 0.
In the search for minimizers of a function f , a useful idea is to follow a certain

direction in order to find a point where f has a smaller value. In the process, we
construct a sequence that is meant to minimize f . We shall come back to this point
in Chap. 6.
A Useful Criterion for Weak Convergence in Hilbert Space
It is possible to establish the strong or weak convergence of a sequence by using the

definition, provided one knows the limit beforehand. Otherwise, one must rely on
different tools. For example, the Cauchy property provides a convergence criterion
in Banach spaces. The following result is useful to prove the weak convergence of
a sequence in a Hilbert space without a priori knowledge of the limit.
Lemma 5.2 (Opial’s Lemma). Let S be a nonempty subset of a Hilbert space H and
let (zn ) be a sequence in H. Assume
i) For each u ∈ S there exists lim zn − u; and
n→∞
ii) Every weak limit point of (zn ) belongs to S.
Then (zn ) converges weakly as n → ∞ to some ū ∈ S.
Proof. Since (zn ) is bounded, it suffices to prove that it has at most one weak limit
point, by Corollary 1.25. Let zkn ẑ and zmn ž. Since ẑ, ž ∈ S, the limits lim zn −
n→∞
ẑ and lim zn − ž exist. Since
n→∞
zn − ẑ2 = zn − ž2 + ž − ẑ2 + 2zn − ž, ž − ẑ,
by passing to appropriate subsequences we deduce that ẑ = ž.

The formulation and proof presented above for Lemma 5.2 first appeared in [19].
Part of the ideas are present in [84], which is usually given as the standard reference.
Qualitative Effect of Small Computational Errors
The minimizing property of sequences of approximate solutions is given by Ritz’s

Theorem 5.1. However, when applying an iterative algorithm, numerical errors typ-
ically appear in the computation of the iterates. It turns out that small computational
errors do not alter the qualitative behavior of sequences generated by nonexpansive
algorithms. Most convex optimization algorithms are nonexpansive. We have the
following:
Lemma 5.3. Let (Fn ) be a family of nonexpansive functions on a Banach space
X and assume that every sequence (zn ) satisfying zn = Fn (zn−1 ) converges weakly
zn ) satisfying
(resp. strongly). Then so does every sequence ($
$ zn−1 ) + εn
zn = Fn ($ (5.1)
provided ∑∞
n=1 εn < +∞.
Proof. Let τ denote either the strong or the weak topology. Given n > k ≥ 0, define
Πkn = Fn ◦ · · · ◦ Fk+1 .
In particular, τ − lim Πkn (z) exists for each k ∈ N and z ∈ X. Let ($

zn ) satisfy (5.1)
n→∞
and set
ζk = τ − lim Πkn ($
zk ).
n→∞
Since
Πk+h
n
zk+h ) − Πkn ($
($ zk ) = Πk+h
n
zk+h ) − Πk+h
($ n
◦ Πkk+h ($
zk )
zk+h − Πkk+h ($
≤ $ zk ),
we have
ζk+h − ζk ≤ lim Πk+h

n
zk+h ) − Πkn ($
($ zk+h − Πkk+h ($
zk ) ≤ $ zk ).
n→∞
But, by definition,
∞
zk+h − Πkk+h ($
$ zk ) ≤ ∑ ε j ,
j=k
and so, it tends to zero uniformly in h as k → ∞. Therefore (ζk ) is a Cauchy sequence

that converges strongly to a limit ζ . Write
zk+h − ζ = [$
$ zk+h − Πkk+h ($
zk )] + [Πkk+h ($
zk ) − ζk ] + [ζk − ζ ]. (5.2)
Given ε > 0 we can take k large enough so that the first and third terms on the right-
hand side of (5.2) are less than ε in norm, uniformly in h. Next, for such k, we let
h → ∞ so that the second term converges to zero for the topology τ .

Further robustness properties for general iterative algorithms can be found in

[5, 6] and [7].
5.3 Problem Simplification

There are several ways to simplify optimization problems. Here, we mention two
important techniques: one consists in passing from the constrained setting to an
unconstrained one; the other allows us to deal with different sources of difficulty
separately.
5.3.1 Elimination of Constraints
Constrained optimization problems are, in general, hard to solve. In some occa-

sions, it is possible to reformulate the problem in an equivalent (or nearly equiv-
alent), unconstrained version. We shall present two different strategies, which are
commonly used. Of course, this list is not exhaustive (see, for instance, [95]).
Penalization
Suppose, for simplicity, that f : X → R is convex and continuous, and consider the
constrained problem
(P) min{ f (x) : x ∈ C } = min{ f (x) + δC (x) : x ∈ X },
where C ⊂ X is nonempty, closed and convex. If one replaces δC by a function that

is similar in some sense, it is natural to expect the resulting problem to have similar
solutions as well.
+∞ +∞ +∞ +∞
Φ Ψ
C C C
A (not necessarily positive) function looking like Φ in the figure above is usually
referred to as an interior penalization function for C. It satisfies int(C) ⊂ dom(Φ ),
dom(Φ ) = C, and lim ∇Φ (y) = +∞ for all x on the boundary of C, denoted by
y→x
∂ C. A Legendre function is an interior penalization function that is differentiable
and strictly convex in int(C) (see [91, Chap. 26]). A barrier is a Legendre function
such that lim Φ (y) = +∞ for all x ∈ ∂ C. In turn, a convex, nonnegative, everywhere
y→x
defined function Ψ such that Ψ −1 (0) = C is an exterior penalization function, since
it act only outside of C.
Under suitable conditions, solutions for the (rather smoothly constrained) prob-
lem
(Pε ) min{ f (x) + εΦ (x) : x ∈ X },
for ε > 0, exist and lie in int(C). One would reasonably expect them to approximate
solutions of (P), as ε → 0. Indeed, by weak lower-semicontinuity, it is easy to see
that, if xε minimizes f + εΦ on X, then every weak limit point of (xε ), as ε → 0,
minimizes f on C. Similarly, if xβ is a solution for the unconstrained problem
(Pβ ) min{ f (x) + βΨ (x) : x ∈ X },
with β > 0, then every weak limit point of (xβ ), as β → +∞, belongs to C and
minimizes f on C.
Remark 5.4. Loosely speaking, we replace the function f + δC by a family ( fν ) of
functions in such a way that fν is reasonably similar to f + δC if the parameter ν is
close to some value ν0 .
Multiplier Methods
As in Sect. 3.7.2, consider proper, lower-semicontinuous and convex functions

f , g1 , . . . , gm : X → R ∪ {+∞}, and focus on the problem of minimizing a convex
function f : X → R ∪ {+∞} over the set C defined by
C = {x ∈ X : gi (x) ≤ 0 for i = 1, . . . , m }.
Suppose we are able to solve the dual problem and find a Lagrange multiplier λ̂ ∈
Rm
+ . Then, the constrained problem
min{ f (x) : x ∈ C }
is equivalent (see Theorems 3.66 and 3.68) to the unconstrained problem

!
m
min f (x) + ∑ λ̂i gi (x) : x ∈ X .
i=1
One can proceed either:

• Subsequently: solving the (finite-dimensional) dual problem first, and then using
the Lagrange multiplier to state and solve an unconstrained primal problem; or
• Simultaneously: approach the primal-dual solution in the product space. This, in
turn, can be performed in two ways:
– Use optimality conditions for the saddle-point problem. The primal-dual solu-
tion must be a critical point of the Lagrangian, if it is differentiable;
– Devise an algorithmic procedure based on iterative minimization with respect
to the primal variable x and maximization on the dual variable λ . We shall
comment this method in Chap. 6.
Similar approaches can be used for the affinely constrained problem where
C = {x ∈ X : Ax = b }
(see Theorem 3.63). For instance, the constrained problem is equivalent to an uncon-
strained problem in view of Remark 3.70, and the optimality condition is a system
of inclusions: Ax̂ = b, and −A∗ y∗ ∈ ∂ f (x̂) for (x̂, y∗ ) ∈ X ×Y ∗ .
5.3.2 Splitting
In many practical cases, one can identify components either in the objective func-
tion, the variable space, or the constraint set. In such situations, it may be useful to
focus on the different components independently, and then use the partial informa-
tion to find a point which is closer to being a solution for the original problem.
Additive and Composite Structure
In some occasions, the objective function f is structured as a sum of functions

f1 , . . . , fK , each of which has special properties. One common strategy consists in
performing minimization steps on each fk (k = 1, . . . , K) independently (this can be
done either simultaneously or successively), and then using some averaging or pro-
jection procedure to consolidate the information. Let us see some model situations,
some of which we will revisit in Chap. 6:
Separate variables in a product space: The space X is the product of two spaces
X1 and X2 , where fi : Xi → R ∪ {+∞} for i = 1, 2, and the corresponding variables
x1 and x2 are linked by a simple relation, such as an affine equation.
Example 5.5. For the linear-quadratic optimal control problem described in
Sect. 4.2.3, we can consider the variables u ∈ L p (0, T ; RM ) and y ∈ C ([0, T ]; RN ),
related by the Variation of Parameters Formula (4.1).

When independent minimization is simple: The functions f1 and f2 depend on the
same variable but minimizing the sum is difficult, while minimizing each function
separately is relatively simple.
Example 5.6. Suppose one wishes to find a point in the intersection of two closed
convex subsets, C1 and C2 , of a Hilbert space. This is equivalent to minimizing
f = δC1 + δC2 . We can minimize δCi by projecting onto Ci . It may be much simpler
to project onto C1 and C2 independently, than to project onto C1 ∩C2 . This method
of alternating projections has been studied in [40, 61, 102] among others.

Different methods for different regularity: For instance, suppose that f1 : X → R
is smooth, while f2 : X → R ∪ {+∞} is proper, convex and lower-semicontinuous.
We shall see, in Chap. 6, that some methods are well adapted to nonsmooth func-
tions, whereas other methods require certain levels of smoothness to be effective.
Example 5.7. Suppose we wish to minimize an everywhere defined, smooth func-
tion f over a closed, convex set C. In Hilbert spaces, the projected gradient algorithm
(introduced by [59] and [73]) consists in performing an iteration of the gradient
method (see Sect. 6.3) with respect to f , followed by a projection onto C.

Composite structure: If A : X → Y is affine and f : Y → R ∪ {+∞} is convex, then
f ◦ A : X → R ∪ {+∞} is convex (see Example 2.13). In some cases, and depending
on f and A, it may be useful to consider a constrained problem in the product space.
More precisely, the problems
min{ f (Ax) } and min { f (y) : Ax = y }

x∈X (x,y)∈X×Y
are equivalent. The latter has an additive structure and the techniques presented
above can be used, possibly in combination with iterative methods. This is particu-
larly useful when f has a simple structure, but f ◦ A is complicated.
Example 5.8. As seen in Sect. 4.3.3, the optimality condition for minimizing the
square of the L2 norm of the gradient is a second-order partial differential equation.

We shall come back to these techniques in Chap. 6.
Chapter 6
Keynote Iterative Methods
Abstract In this chapter we give an introduction to the basic (sub)gradient-based

methods for minimizing a convex function on a Hilbert space. We pay special atten-
tion to the proximal point algorithm and the gradient method, which are interpreted
as time discretizations of the steepest descent differential inclusion. Moreover, these
methods, along with some extensions and variants, are the building blocks for other
− more sophisticated − methods that exploit particular features of the problems,
such as the structure of the feasible (or constraint) set. The choice of proximal- or
gradient-type schemes depends strongly on the regularity of the objective function.
Throughout this chapter, H is a real Hilbert space.
6.1 Steepest Descent Trajectories
Let f : H → R be a continuously differentiable function with Lipschitz-continuous

gradient. By the Cauchy-Lipschitz-Picard Theorem, for each x0 ∈ H, the ordinary
differential equation

x(0) = x0
(ODE)
−ẋ(t) = ∇ f (x(t)), t > 0,
has a unique solution. More precisely, there is a unique continuously differentiable

function x : [0, +∞) → H such that x(0) = x0 and −ẋ(t) = ∇ f (x(t)) for all t > 0. To
fix the ideas, let us depict this situation in R2 :

∇ f (x0 )
x0
x(t)
S
At each instant t, the velocity ẋ(t) points towards the interior of the sublevel set
Γf (x(t)) ( f ). Observe also that the stationary points of (ODE) are exactly the criti-
cal points of f (the zeroes of ∇ f ). Moreover, the function f decreases along the
solutions, and does so strictly unless it bumps into a critical point, since
d
f (x(t)) = ∇ f (x(t)), ẋ(t) = −ẋ(t)2 = −∇ f (x(t))2 .
dt
If f is convex, one would reasonably expect the trajectories to minimize f as
t → ∞. Indeed, by convexity, we have (see Proposition 3.10)
d
f (z) − f (x(t)) ≥ ∇ f (x(t)), z − x(t) = ẋ(t), x(t) − z = hz (t),
dt
where hz (t) = 12 x(t) − z2 . Integrating on [0, T ] and recalling that t → f (x(t)) is
nonincreasing, we deduce that
x0 − z2
f (x(T )) ≤ f (z) + ,
2T
and we conclude that
lim f (x(t)) = inf( f ).
t→∞
If, moreover, S = argmin( f ) = 0/ and we take z̄ ∈ S, then
d
hz̄ (t) = ẋ(t), x(t) − z̄ = −∇ f (x(t)) − ∇ f (z̄), x(t) − z̄ ≤ 0
dt
and lim x(t) − z̄ exists. Since f is continuous and convex, it is weakly lower-
t→∞
semicontinuous, by Proposition 2.17. Therefore, every weak limit point of x(t) as
t → ∞ must minimize f , by Proposition 2.8. We conclude, by Opial’s Lemma 5.2,
that x(t) converges weakly to some z̄ ∈ S as t → ∞.
Keynote Iterative Methods 95
It is possible to prove that if f : H → R ∪ {+∞} is proper, lower-semicontinuous

and convex, then the Steepest Descent Differential Inclusion

x(0) = x0
(DI)
−ẋ(t) ∈ ∂ f (x(t)), t > 0,
has similar properties: for each x0 ∈ dom( f ), (DI) has a unique absolutely continu-
ous solution x : [0, +∞) → H, that it satisfies lim f (x(t)) = inf( f ). Further, if S = 0,
/
t→∞
then x(t) converges weakly to a point in S. The proof (especially the existence) is
more technical and will be omitted. The interested reader may consult [87] for fur-
ther discussion, or also [29], which is the original source.
If we discretize (ODE) by the method of finite differences, we first take a

sequence (λn ) of positive parameters called the step sizes, set σn = ∑nk=1 λk , and
partition the interval [0, +∞) as:
λ1 λ2 λ3 λn
0 σ1 σ2 σ3 ··· σn−1 σn
In order to recover the asymptotic properties of (ODE) as t → ∞, we suppose

that σn → ∞ as n → ∞, which is equivalent to (λn ) ∈ 1 .
Then ẋ(t) is approximated by

xn − xn−1
,
λn
where the sequence (xn ) must be determined. On the other hand, it is natural to
approximate the term ∇ f (x(t)) either by ∇ f (xn−1 ), or by ∇ f (xn ), which correspond
to evaluating the gradient at the time corresponding to the left, or the right, end of the
interval [σn−1 , σn ], respectively. The first option produces an explicit update rule:
xn − xn−1
− = ∇ f (xn−1 ) ⇐⇒ xn = xn−1 − λn ∇ f (xn−1 ),
λn
known as the gradient method, originally devised by Cauchy [36]. This method finds
points in S if f is sufficiently regular and the step sizes are properly selected. The
second option, namely ∇ f (x(t)) ∼ ∇ f (xn ), gives an implicit update rule:
xn − xn−1
− = ∇ f (xn ) ⇐⇒ xn + λn ∇ f (xn ) = xn−1 ,
λn
that is known as the proximal point algorithm, and was introduced by Martinet [76]
much more recently. Despite the difficulty inherent to the implementation of the
implicit rule, this method has remarkable stability properties, and can be applied
to non-differentiable (even discontinuous) functions successfully. These two algo-

rithms are the building blocks for a wide spectrum of iterative methods in optimiza-
tion. The remainder of this chapter is devoted to their study.
6.2 The Proximal Point Algorithm
In what follows, f : H → R ∪ {+∞} is a proper, lower-semicontinuous convex func-

tion. We denote the optimal value
α = inf{ f (u) : u ∈ H},
which, in principle, may be −∞; and the solution set
S = argmin( f ),
which, of course, may be empty. To define the proximal point algorithm, we con-
sider a sequence (λn ) of positive numbers called the step sizes.
Proximal Point Algorithm: Begin with x0 ∈ H. For each n ≥ 0, given a step size λn
and the state xn , define the state xn+1 as the unique minimizer of the Moreau–Yosida
Regularization f(λn ,xn ) of f , which is the proper, lower-semicontinuous and strongly
convex function defined as
1
f(λn ,xn ) (z) = f (z) + z − xn 2
2λn
(see Sect 3.5.4). In other words,

1
{xn+1 } = argmin f (z) + z − xn : z ∈ H .
2
(6.1)
2λn
Moreover, according to the Moreau–Rockafellar Theorem,

xn+1 − xn
0 ∈ ∂ f(λn ,xn ) (xn+1 ) = ∂ f (xn+1 ) + ,
λn
or, equivalently,
xn+1 − xn
− ∈ ∂ f (xn+1 ). (6.2)
λn
Using this property, one can interpret the proximal iteration xn → xn+1 as an implicit
discretization of the differential inclusion
−ẋ(t) ∈ ∂ f (x(t)) t > 0.

Inclusion (6.2) can be written in resolvent notation (see (3.14)) as:
xn+1 = (I + λn ∂ f )−1 (xn ).
A sequence (xn ) generated following the proximal point algorithm is a proximal

sequence. The stationary points of a proximal sequence are the minimizers of the
objective function f , since, clearly,
xn+1 = xn if, and only if, 0 ∈ ∂ f (xn+1 ).
6.2.1 Basic Properties of Proximal Sequences
We now study the basic properties of proximal sequences, which hold for any
(proper, lower-semicontinuous and convex) f . As we shall see, proximal sequences
minimize f . Moreover, they converge weakly to a point in the solution set S, pro-
vided it is nonempty.
In the first place, the very definition of the algorithm (see (6.1)) implies
1
f (xn+1 ) + xn+1 − xn 2 ≤ f (xn )
2λn
for each n ≥ 0. Therefore, the sequence ( f (xn )) is nonincreasing. In fact, it decreases

strictly as long as xn+1 = xn . In other words, at each iteration, the point xn+1 will be
different from xn only if a sufficient reduction in the value of the objective function
can be attained.
Inclusion (6.2) and the definition of the subdifferential together give
% &
xn+1 − xn
f (u) ≥ f (xn+1 ) + − , u − xn+1
λn
for each u ∈ H. This implies
2λn [ f (xn+1 ) − f (u)] ≤ 2xn+1 − xn , u − xn+1

= xn − u2 − xn+1 − u2 − xn+1 − xn 2
≤ xn − u2 − xn+1 − u2 . (6.3)
Summing up for n = 0, . . . , N we deduce that

N
2 ∑ λn [ f (xn+1 ) − f (u)] ≤ x0 − u2 − xN+1 − u2 ≤ x0 − u2
n=0
for each N ∈ N. Since the sequence ( f (xn )) is nonincreasing, we deduce that
2σN ( f (xN+1 ) − f (u)) ≤ x0 − u2 ,

where we have written

n
σn = ∑ λk .
k=0
Gathering all this information we obtain the following:

Proposition 6.1. Let (xn ) be a proximal sequence.
i) The sequence ( f (xn )) is nonincreasing. Moreover, it decreases strictly as long as
xn+1 = xn .
ii) For each u ∈ H and n ∈ N, we have
2σn ( f (xn+1 ) − f (u)) ≤ x0 − u2 .
In particular,
d(x0 , S)2
f (xn+1 ) − α ≤ .
2σn
iii)If (λn ) ∈
/ 1 , then lim f (xn ) = α .
n→∞
As mentioned above, we shall see (Theorem 6.3) that proximal sequences always
converge weakly. However, if a proximal sequence (xn ) happens to converge strongly
(see Sect. 6.2.2), it is possible to obtain a convergence rate of lim σn ( f (xn+1 )− α ) =
n→∞
0, which is faster than the one predicted in part ii) of the preceding proposition. This
was proved in [60] (see also [85, Sect. 2] for a simplified proof).
Concerning the sequence (xn ) itself, we have the following:

Proposition 6.2. Let (xn ) be a proximal sequence with (λn ) ∈
/ 1 . Then:
i) Every weak limit point of the sequence (xn ) must lie in S.
ii) If x∗ ∈ S, then the sequence (xn − x∗ ) is nonincreasing. As a consequence,
lim xn − x∗ exists.
n→∞
In particular, the sequence (xn ) is bounded if, and only if, S = 0.

/
Proof. For part i), if (xkn ) converges weakly to x̄, then
f (x̄) ≤ lim inf f (xkn ) = lim f (xn ) ≤ inf{ f (u) : u ∈ H},

n→∞ n→∞
by the weak lower-semicontinuity of f . This implies x̄ ∈ S. Part ii) is obtained by

replacing u = x∗ in (6.3) since f (xn ) ≥ f (x∗ ) for all n ∈ N.

Proposition 6.2, along with Opial’s Lemma 5.2, allow us to establish the weak
convergence of proximal sequences:
Theorem 6.3. Let (xn ) be a proximal sequence generated using a sequence of step
sizes (λn ) ∈
/ 1 . If S = 0,
/ then (xn ) converges weakly, as n → ∞, to a point in S.
Otherwise, lim xn = ∞.
n→∞
Consistency of the Directions
As seen above, proximal sequences converge weakly under very general assump-
tions. The fact that the computation of the point xk+1 involves information on the
point xn+1 itself, actually implies that, in some sense, the future history of the
sequence will be taken into account as well.
For this reason, the evolution of proximal sequences turns out to be very sta-
ble. More precisely, the displacement xn+1 − xn always forms an acute angle both
with the previous displacement xn − xn−1 and with any vector pointing towards the
minimizing set S.
Proposition 6.4. Let (xn ) be a proximal sequence. If xn+1 = xn , then
xn+1 − xn , xn − xn−1 > 0.
Moreover, if xn+1 = xn and x̂ ∈ S then
xn+1 − xn , x̂ − xn > 0.
Proof. According to (6.2), we have

xn+1 − xn xn − xn−1
− ∈ ∂ f (xn+1 ) and − ∈ ∂ f (xn ).
λn λn−1
By monotonicity,
% &
xn+1 − xn xn − xn−1
− , xn+1 − xn ≤ 0.
λn λn−1
In other words,
λn−1
xn+1 − xn , xn − xn−1 ≥ xn+1 − xn 2 .
λn
In particular, xn+1 − xn , xn − xn−1 > 0 whenever xn+1 = xn . For the second inequal-
ity, if x̂ ∈ S, then 0 ∈ ∂ f (x̂). Therefore
xn+1 − xn , x̂ − xn = xn+1 − xn , x̂ − xn+1 + xn+1 − xn 2 > 0,
as long as xn+1 = xn , by monotonicity.

In conclusion, proximal sequences cannot go “back and forth;” they always move
in the direction of S.
x0
x1
6.2.2 Strong Convergence and Finite-Time Termination
Theorem 6.3 guarantees the weak convergence of proximal sequences. A natu-

ral question—already posed by R.-T. Rockafellar in [93]—is whether or not the
convergence is always strong. A negative answer was provided by O. Güler in
[60] based on an example given by J.-B. Baillon in [20] for the differential inclu-
sion −ẋ(t) ∈ ∂ f (x(t)), t > 0. However, under additional assumptions, the proximal
sequences can be proved to converge strongly. We shall present some examples here.
Some variations of the proximal algorithm also converge strongly (see, for instance,
[98]).
Strong Convexity and Expansive Subdifferential
Recall from Proposition 3.23 that if f : H → R ∪ {+∞} is strongly convex with

parameter r > 0, then ∂ f is strongly monotone with parameter r, which means that
x∗ − y∗ , x − y ≥ rx − y2
whenever x∗ ∈ ∂ f (x) and y∗ ∈ ∂ f (y). Observe also that the latter implies that ∂ f is
r-expansive, in the sense that
x∗ − y∗ ≥ rx − y
whenever x∗ ∈ ∂ f (x) and y∗ ∈ ∂ f (y).
When applied to strongly convex functions, the proximal point algorithm gener-
ates strongly convergent sequences.
/ 1 . If f is strongly
convex with parameter r, then S is a singleton {x̂} and
n
xn+1 − x̂ ≤ x0 − x̂ ∏ (1 + rλk )−1 . (6.4)
k=0
In particular, (xn ) converges strongly, as n → ∞, to the unique x̂ ∈ S.

xn+1 −xn
Proof. Since 0 ∈ ∂ f (x̂) and − λn ∈ ∂ f (xn+1 ), the strong convexity implies
xn+1 − xn , x̂ − xn+1 ≥ rλn x̂ − xn+1 2
for all n. The equality 2ζ , ξ = ζ + ξ 2 − ζ 2 − ξ 2 gives
x̂ − xn 2 − xn+1 − xn 2 ≥ (1 + 2rλn )x̂ − xn+1 2 .
But xn+1 − xn ≥ rx̂ − xn+1 by the expansive property of the subdifferential of a

strongly convex function. We deduce that
x̂ − xn 2 ≥ (1 + 2rλn + r2 λn2 )x̂ − xn+1 2
and so
x̂ − xn+1 ≤ (1 + rλn )−1 x̂ − xn .
Proceeding inductively we obtain (6.4). To conclude, we use the fact that ∏∞
k=0 (1 +
rλk ) tends to ∞ if, and only if, (λn ) ∈
/ 1 .

It is possible to weaken the strong convexity assumption provided the step sizes
are sufficiently large. More precisely, we have the following:
/ 2 . If ∂ f is r-
expansive, then S is a singleton {x̂} and
n
xn+1 − x̂ ≤ x0 − x̂ ∏ (1 + r2 λk2 )− 2 .
1
k=0
In particular, (xn ) converges strongly, as n → ∞, to the unique x̂ ∈ S.
Even Functions
Recall that a function f : H → R ∪ {+∞} is even if f (−x) = f (x) for all x ∈ H.

We shall see that, when applied to an even function, the proximal point algorithm
generates only strongly convergent sequences. Notice first that if f is even, then
0 ∈ S, and so S = 0.
/
/ 1 . If f is even, then
(xn ) converges strongly, as n → ∞, to some x̂ ∈ S.
Proof. Take m > n and replace u = −xm in inequality (6.3) to obtain
xn+1 + xm 2 ≤ xn + xm 2 .
For each fixed m, the function n → xn + xm 2 is nonincreasing. In particular,
4xm 2 = xm + xm 2 ≤ xn + xm 2 .

On the other hand, the Parallelogram Identity gives
xn + xm 2 + xn − xm 2 = 2xm 2 + 2xn 2 .
We deduce that
xn − xm 2 ≤ 2xn 2 − 2xm 2 .
Since lim xn − 0 exists, (xn ) must be a Cauchy sequence.

n→∞
Solution Set with Nonempty Interior
If int(S) = 0,
/ there exist ū ∈ S and r > 0 such that for every h ∈ B(0, 1), we have
ū + rh ∈ S. Now, given x ∈ dom(∂ f ) and x∗ ∈ ∂ f (x), the subdifferential inequality
gives
f (ū + rh) ≥ f (x) + x∗ , ū + rh − x.
Since ū + rh ∈ S, we deduce that
rx∗ , h ≤ x∗ , x − ū,
and therefore,
rx∗ = sup x∗ , h ≤ x∗ , x − ū. (6.5)
B(0,1)
Proposition 6.8. Let (xn ) be a proximal sequence with (λn ) ∈ / 1 . If S has nonempty
interior, then (xn ) converges strongly, as n → ∞, to some x̂ ∈ S.
xn −xn+1
Proof. If we apply Inequality (6.5) with x = xn+1 and x∗ = λn we deduce that
2rxn − xn+1 ≤ 2xn − xn+1 , xn+1 − ū

= xn − ū2 − xn+1 − ū2 − xn+1 − xn 2
≤ xn − ū2 − xn+1 − ū2 .
Whence, if m > n,
m−1
2rxn − xm ≤ 2r ∑ x j − x j+1
j=n
≤ xn − ū2 − xm − ū2 .
Part ii) of Proposition 6.2 implies (xn ) is a Cauchy sequence.

Linear Error Bound and Finite-Time Termination
If f is differentiable on S, the proximal point algorithm cannot terminate in a

finite number of iterations unless the initial point is already in S (and the proximal
sequence is stationary). Indeed, if xn+1 ∈ S and f is differentiable at xn+1 then

xn+1 − xn
− ∈ ∂ f (xn+1 ) = {0},
λn
and so xn+1 = xn . Proceeding inductively, one proves that xn = x0 ∈ S for all n.
Moreover, if the function is “too smooth” around S, one can even obtain an upper
bound for the speed of convergence. To see this, let d(u, S) denote the distance from
u to S, and, for each n ∈ N, write σn = ∑nk=0 λk . We have the following:
Proposition 6.9. Assume f is a continuously differentiable function such that ∇ f
is Lipschitz-continuous with constant L in a neighborhood of S and let (xn ) be a
proximal sequence with x0 ∈/ S. Then, there exists C > 0 such that
d(xn+1 , S) ≥ Ce−σn L .
Proof. Recall that Px denotes the projection of x onto S. Let η > 0 be such that
∇ f (x) = ∇ f (x) − ∇ f (Px) ≤ Lx − Px = Ld(x, S),
whenever d(u, S) < η . As said before, the minimizing set cannot be attained in
a finite number of steps. If d(xN , S) < η for some N, then d(xn+1 , S) < η for all
n ≥ N, and so
xn+1 − xn = λn ∇ f (xn+1 ) ≤ λn Ld(xn+1 , S)
for all n ≥ N. But d(xn , S) − d(xn+1 , S) ≤ xn+1 − xn , which implies
(1 + λn L)d(xn+1 , S) ≥ d(xn , S).
We conclude that

n
d(xn+1 , S) ≥ ∏ (1 + λk L) −1
d(xN , S) ≥ e−σn L d(xN , S),
k=N
and this completes the proof.

On the other hand, if the function f is somehow steep on the boundary of S,

the sequence {xn } will reach a point in S in a finite number of iterations, which is
possible to estimate in terms of the initial gap f (x0 ) − α , where α = inf( f ).
Proposition 6.10. Assume α > −∞ and v ≥ r > 0 for all v ∈ ∂ f (x) such that
/ S. Set N = min{n ∈ N : r2 σn ≥ f (x0 ) − α }. Then xN ∈ S.
x∈
Proof. If xk ∈
/ S for k = 0, . . . , n + 1, then
% &
xk+1 − xk xk − xk+1 2
f (xk ) − f (xk+1 ) ≥ − , xk − xk+1
≥ λk ≥ r 2 λk .
λk λ
k
by the subdifferential inequality. Summing up one gets
r2 σn ≤ f (x0 ) − f (xn+1 ) ≤ f (x0 ) − α .
This implies xN ∈ S, as claimed.

If
f (x) ≥ α + rx − Px (6.6)
for all x ∈ H, then
f (x) ≥ α + rx − Px ≥ f (x) + v, Px − x + rx − Px
by the subdifferential inequality. Hence rx − Px ≤ v, x − Px and so v ≥ r > 0
for all v ∈ ∂ f (x) such that x ∈
/ S. Inequality (6.6) is known as a linear error bound.
Roughly speaking, it means that f is steep on the boundary of S.
θ tan(θ ) = r
Using an equivalent form of Inequality (6.6), convergence in a finite number of

iterations was proved in [93] and [55].
One remarkable property of the proximal point algorithm is that it can be applied
to nonsmooth functions. Moreover, Proposition 6.10 implies that, in a sense, nons-
moothness has a positive impact in the speed of convergence.
6.2.3 Examples
In some cases, it is possible to obtain an explicit formula for the proximal iterations.
In other words, for x ∈ H and λ > 0, one can find y ∈ H such that
1
y ∈ argmin{ f (ξ ) + ξ − x2 : ξ ∈ H},
2λ
or, equivalently,
x − y ∈ λ ∂ f (y).
Indicator Function
Let C be a nonempty, closed and convex subset of H. Recall that the indicator func-
tion δC : H → R ∪ {+∞} of C, given by δC (ξ ) = 0 if ξ ∈ C and +∞ otherwise, is
proper, lower-semicontinuous and convex. Clearly, the set of minimizers of δC is

precisely the set C. Take x ∈ H and λ > 0. Then,

1
argmin δC (ξ ) + ξ − x : ξ ∈ H = argmin{ξ − x : ξ ∈ C} = {PC (x)}.
2
2λ
Observe that, in this case, the proximal iterations converge to the projection of x0
onto S = C in one iteration. The term proximal was introduced by J.-J. Moreau. In
some sense, it expresses the fact that the proximal iteration generalizes the concept
of projection when the indicator function of a set is replaced by an arbitrary proper,
lower-semicontinuous and convex function (see [80], where the author presents a
survey of his previous research [77, 78, 79] in the subject).
Quadratic Function
The quadratic case was studied in [71] and [72]. Consider a bounded, symmetric
and positive semidefinite linear operator A : H → H, and a point b ∈ H. The function
f : H → R defined by
1
f (x) = x, Ax − b, x
2
is convex and differentiable, with gradient ∇ f (x) = Ax − b. Clearly, x̂ minimizes f
if, and only if, Ax = b. In particular, S = 0/ if, and only if, b belongs to the range of
A. Given x ∈ H and λ > 0, we have
x − y ∈ λ ∂ f (y)
if, and only if

y = (I + λ A)−1 (x + λ b).
Since A is positive semidefinite, I + λ A is indeed invertible. The iterations of the
proximal point algorithm are given by
xn+1 = (I + λn A)−1 (xn + λn b), n ≥ 0.
By Theorem 6.3, if the equation Ax = b has solutions, then each proximal sequence
must converge weakly to one of them. Actually, the convergence is strong. To see
this, take any solution x̄ use the change of variables un = xn − x̄. A simple computa-
tion shows that
un+1 = (I + λn A)−1 (un ) = (I + λn ∇g)−1 (un ), n ≥ 0,
where g is the even function given by g(x) = 12 x, Ax. It follows that (un ) converges
strongly to some ū and we conclude that (xn ) converges strongly to ū + x̄.
1 and L1 Norms
Minimization of 1 and L1 norms have a great number of applications in image pro-

cessing and optimal control, since they are known to induce sparsity of the solutions.
We shall comment these kinds of applications in the next chapter. For the moment,
let us present how the proximal iterations can be computed for these functions.
In finite dimension: Let us start by considering f1 : R → R defined by f1 (ξ ) =

|ξ |. Take x ∈ R and λ > 0. The unique point

1
y ∈ argmin |ξ | + |ξ − x| : ξ ∈ R
2
2λ
is characterized by ⎧
⎨ x − λ if x > λ
y= 0 if −λ ≤ x ≤ λ
⎩
x + λ if x < −λ .
The preceding argument can be easily extended to the 1 norm in RN , namely
fN : RN → R given by fN (ξ ) = ξ 1 = |ξ1 | + · · · + |ξN |.
In 2 (N; R): Now let H = 2 (N; R) and define f∞ : H → R ∪ {+∞} by

∑i∈N |ξi | if ξ ∈ 1 (N; R)
f∞ (ξ ) = ξ 1 (N;R) =
+∞ otherwise.
Take x ∈ H and λ > 0. For each i ∈ N define

⎧
⎨ xi − λ if xi > λ
yi = 0 if −λ ≤ xi ≤ λ
⎩
xi + λ if xi < −λ ,
Since x ∈ H there is I ∈ N such that |xi | ≤ λ for all i ≥ I. Hence yi = 0 for all i ≥ I
and the sequence (yi ) belongs to 1 (N; R). Finally,
1 1
|yi | + |yi − xi |2 ≤ |z| + |z − xi |2
2λ 2λ
for all z ∈ R and so

1
y ∈ argmin u1 (N;R) + u − x22 (N;R) : u ∈ H .
2λ
In L2 (Ω ; R): Let Ω be a bounded open subset of RN and consider the Hilbert

space L2 (Ω ; R), which is included in L1 (Ω ; R). Define fΩ : H → R by

fΩ (u) = uL1 (Ω ;R) = |u(ζ )| d ζ .
Ω
Take x ∈ H (any representing function) and λ > 0. Define y : Ω → R by

⎧
⎨ x(ζ ) − λ if x(ζ ) > λ
y(ζ ) = 0 if −λ ≤ x(ζ ) ≤ λ
⎩
x(ζ ) + λ if x(ζ ) < −λ
for each ζ ∈ Ω . Observe that if one chooses a different representing function x ,

then the corresponding y will match y almost everywhere. Then y ∈ H and
1 1
|y(ζ )| + |y(ζ ) − x(ζ )|2 ≤ |z| + |z − x(ζ )|2
2λ 2λ
for all z ∈ R. Therefore

1
y ∈ argmin uL1 (Ω ;R) + u − x2L2 (Ω ;R) : u ∈ H .
2λ
6.3 Gradient-Consistent Algorithms
Gradient-consistent methods are based on the fact that if f : H → R is differentiable

at a point z, then ∇ f (z) belongs to the normal cone of the sublevel set Γf (z) ( f ).
Indeed, if y ∈ Γf (z) ( f ), we have
∇ f (z), y − z ≤ f (y) − f (z) ≤ 0
by the subdifferential inequality. Intuitively, a direction d forming a sufficiently

small angle with −∇ f (z) must point inwards, with respect to Γf (z) ( f ), and we should
have f (z + λ d) < f (z) if the step size λ > 0 is conveniently chosen.
∇ f (z)
z
d
S
Gradient-consistent algorithms take the form

z0 ∈ H
(GC)
zn+1 = zn + λn dn ,
where the sequence (dn ) satisfies some nondegeneracy condition with respect to the
sequence of steepest descent directions, namely Hypothesis 6.12 below.
The idea (which dates back to [36]) of selecting exactly the steepest descent
direction d = −∇ f (z) is reasonable, but it may not be the best option. On the other
hand, the choice of the step size λn may be crucial in the performance of such an
algorithm, as shown in the following example:
Example 6.11. Let f : R → R be defined by f (z) = z2 . Take λn ≡ λ and set z0 = 1. A
straightforward computation yields zn = (1 − 2λ )n for each n. Therefore, if λ = 1,
the sequence (zn ) remains bounded but does not converge. Next, if λ > 1, then
lim zn = +∞. Finally, if λ < 1, then lim zn = 0, the unique minimizer of f .

n→∞ n→∞
In this section, we will explore different possibilities for λn and dn , and provide
qualitative and quantitative convergence results.
Hypothesis 6.12. The function f is bounded from below, its gradient ∇ f is Lipschitz-
continuous with constant L, there exist positive numbers α and β such that
α dn 2 ≤ ∇ f (zn )2 ≤ −β ∇ f (zn ), dn (6.7)
for all n ∈ N, and the sequence (λn ) of step sizes satisfies (λn ) ∈
/ 1 and
2α
λ := sup λn < . (6.8)
n∈N βL
Lemma 6.13. Assume Hypothesis 6.12 holds. There is δ > 0 such that
f (zn+1 ) − f (zn ) ≤ −δ λn ∇ f (zn )2
for all n ∈ N. Thus, the sequence ( f (zn )) decreases to some v∗ ∈ R.
Proof. Hypothesis 6.12 and Lemma 1.30 together give

L
f (zn+1 ) − f (zn ) ≤ ∇ f (zn ), zn+1 − zn + zn+1 − zn 2
2
Lλn2
= λn ∇ f (zn ), dn + dn 2
2
L λn 1
≤ λn − ∇ f (zn )2
2α β

β λ L − 2α
≤ λn ∇ f (zn )2 .
2αβ
It suffices to define
2α − β λ L
δ= ,
2αβ
which is positive by Hypothesis 6.12.

Proposition 6.14. Assume Hypothesis 6.12 holds. Then lim ∇ f (zn ) = 0 and every
n→∞
strong limit point of the sequence (zn ) must be critical.
Proof. By Lemma 6.13, we have
∑ λn ∇ f (zn )2 ≤ f (z0 ) − inf( f ) < +∞.

n∈N
Since (λn ) ∈
/ 1 , we also have lim infn→∞ ∇ f (zn ) = 0. Let ε > 0 and take N ∈ N
large enough so that √
ε2 α
∑ n λ ∇ f (z n ) 2
<
4L
. (6.9)
n≥N
We shall prove that ∇ f (zm ) < ε for all m ≥ N. Indeed, if m ≥ N and ∇ f (zm ) ≥
ε , define ε
νm = min n > m : ∇ f (zn ) < ,
2
which is finite because lim infn→∞ ∇ f (zn ) = 0. Then
νm −1 νm −1
∇ f (zm ) − ∇ f (zνm ) ≤ ∑ ∇ f (zk+1 ) − ∇ f (zk ) ≤ L ∑ zk+1 − zk .
k=m k=m
But for k = m, . . . , νm − 1 we have

1 2
zk+1 − zk = λk dk ≤ √ λk ∇ f (zk ) ≤ √ λk ∇ f (zk )2 ,
α ε α
ε
because ∇ f (zk ) ≥ 2 for all such k. We deduce that
νm −1
2L ε ε
∇ f (zm ) ≤ ∇ f (zνm ) + √
ε α ∑ λk ∇ f (zk )2 < + =ε
2 2
k=m
by (6.9). This contradicts the assumption that ∇ f (zm ) ≥ ε .

If we suppose in Proposition 6.14 that the sequence (λn ) is bounded from below
by a positive constant, the proof is trivial, in view of Lemma 6.13.
So far, we have not required the objective function f to be convex.

Proposition 6.15. If Hypothesis 6.12 holds and f is convex, then every weak limit
point of the sequence (zn ) minimizes f .
Proof. Take any u ∈ H and let zkn z as n → ∞. By convexity,
f (u) ≥ f (zkn ) + ∇ f (zkn ), u − zkn
for each n ∈ N. But (zkn ) is bounded, lim ∇ f (zkn ) = 0 and lim f (zkn ) ≥ f (z) by
n→∞ n→∞
weak lower-semicontinuity. We conclude that f (z) ≤ f (u).

In the strongly convex case, we immediately obtain strong convergence:

Proposition 6.16. If Hypothesis 6.12 holds and f is strongly convex with parameter
μ , then the sequence (zn ) converges strongly to the unique minimizer of f .
Proof. Let z∗ be the unique minimizer of f . The strong convexity implies
μ zn+1 − z∗ 2 ≤ ∇ f (zn+1 ) − ∇ f (z∗ ), zn+1 − z∗

= ∇ f (zn+1 ), zn+1 − z∗
≤ ∇ f (zn+1 ) · zn+1 − z∗ ,
by Proposition 3.12. It follows that μ zn+1 − z∗ ≤ ∇ f (zn+1 ). By Proposition

6.14, the right-hand side tends to 0 and so (zn ) must converge to z∗ .

Remark 6.17. Step size selection may have a great impact on the speed of con-
vergence. Several rules have been devised and are commonly used: the exact and
limited minimization rules, as well as the rules of Armijo, Goldstein and Wolfe (see
[24]). A more sophisticated approach was developed in [81]. See also [32].

Let us analyze the convergence of (zn ) for specific gradient-consistent methods.
6.3.1 The Gradient Method
Let us begin by the pure gradient method, given by

z0 ∈ H
(G)
zn+1 = zn − λn ∇ f (zn ) for n ∈ N.
This algorithm is gradient-consistent (take α = β = 1) and the condition on the step

sizes is simply
2
sup λn < .
n∈N L
Let us recall the following basic result on real sequences:
Lemma 6.18. Let (an ) and (εn ) be nonnegative sequences such that (εn ) ∈ 1 and
an+1 − an ≤ εn for each n ∈ N. Then lim an exists.
n→∞
/ ( λn ) ∈
Proposition 6.19. Let (zn ) satisfy (G), where f is convex, S = 0, / 1 and
supn∈N λn < L . Then (zn ) converges weakly as n → ∞ to a point in S.
2
Proof. Take u ∈ S. For each n ∈ N we have 0 ≥ f (u) − f (zn ) ≥ ∇ f (zn ), u − zn .

Therefore,
zn+1 − u2 = zn+1 − zn 2 + zn − u2 + 2zn+1 − zn , zn − u

= λn2 ∇ f (zn )2 + zn − u2 + 2λn ∇ f (zn ), u − zn
≤ λn2 ∇ f (zn )2 + zn − u2 .
Since ∑n∈N λn ∇ f (zn )2 < ∞, also ∑n∈N λn2 ∇ f (zn )2 < ∞. We deduce the exis-
tence of lim zn − u by Lemma 6.18. We conclude using Proposition 6.15 and
n→∞
Opial’s Lemma 5.2.

Using Lemma 5.3 we immediately obtain

Corollary 6.20. Let (zn ) satisfy (GC) and assume that ∑ λn dn + ∇ f (zn ) < ∞. If
n∈N
(λn ) ∈
/ 1 and sup λn < L2 , then (zn ) converges weakly as n → ∞ to a point in S.
n∈N
For strong convergence we have the following:

/ ( λn ) ∈
Proposition 6.21. Let (zn ) satisfy (G), where f is convex, S = 0, / 1 and
sup λn < L2 . Then (zn ) converges strongly as n → ∞ to a point in S if either f is
n∈N
strongly convex; f is even; or int(argmin( f )) = 0.
/
6.3.2 Newton’s Method
Let f : H → R ∪ {+∞} and let z∗ ∈ S. Assume that f is twice Gâteaux-differentiable

in a neighborhood of z∗ , that its Hessian ∇2 f is continuous in z∗ , and that the oper-
ator ∇2 f (z∗ ) is invertible. By Proposition 1.8, there exist a convex neighborhood U
of z∗ and a constant C > 0 such that f is strictly convex on U, ∇2 f (z) is invertible
for all z ∈ U, and
sup [∇2 f (z)]−1 L (H;H) ≤ C.
z∈U
Newton’s Method is defined by

z0 ∈ U
(NM)
zn+1 = zn − [∇2 f (zn )]−1 ∇ f (zn ) for n ∈ N.
Example 6.22. If f is quadratic, namely
f (x) = Ax, x + b, x + c,
then ∇2 f (x) = A + A∗ for all x ∈ U = H. If the operator A + A∗ is invertible, then

Newton’s Method can be applied and converges in one iteration. Indeed, a straight-
forward computation shows that ∇ f (z1 ) = 0. We should point out that, in this set-
ting, computing this one iteration is equivalent to solving the optimality condition
∇ f (x) = 0.

In general, if the initial point z0 is sufficiently close to z∗ , Newton’s method
converges very fast. More precisely, we have the following:
Proposition 6.23. Let (zn ) be defined by Newton’s Method (NM) with z0 ∈ U. We
have the following:
i) For each ε ∈ (0, 1), there is R > 0 such that, if zN − z∗ ≤ R for some N ∈ N,
then
zN+m − z∗ ≤ R ε m
for all m ∈ N.
ii) Assume ∇2 f is Lipschitz-continuous on U with constant M and let ε ∈ (0, 1). If
2ε
zN − z∗ ≤ MC for some N ∈ N, then
2 2m
zN+m − z∗ ≤ ε
MC
for all m ∈ N.
Proof. Fix n ∈ N with zn ∈ U, and define g : [0, 1] → H by g(t) = ∇ f (z∗ +t(zn −z∗ )).
We have
1 1
∇ f (zn ) = g(1) − g(0) = ġ(t) dt = ∇2 f (z∗ + t(zn − z∗ ))(zn − z∗ ) dt.
0 0
It follows that

zn+1 − z∗ = zn − z∗ − [∇2 f (zn )]−1 ∇ f (zn )

= [∇2 f (zn )]−1 ∇2 f (zn )(zn − z∗ ) − ∇ f (zn )
1
2
≤ C ∇ f (zn ) − ∇ f (z + t(zn − z )) (zn − z ) dt
2 ∗ ∗ ∗
0

1
≤ Czn − z∗ ∇2 f (zn ) − ∇2 f (z∗ + t(zn − z∗ )) dt.
(6.10)
L (H;H)
0
Take ε ∈ (0, 1) and pick R > 0 such that ∇2 f (z) − ∇2 f (z∗ ) ≤ ε /2C whenever
z − z∗ ≤ R. We deduce that if zN − z∗ < R, then zn+1 − z∗ ≤ ε zn − z∗ for all
n ≥ N, and this implies i).
For part ii), using (6.10) we obtain zn+1 − z∗ ≤ MC ∗ 2
2 zn − z . Given ε ∈ (0, 1),
∗ 2ε ∗ m
if zN − z ≤ MC , then zN+m − z ≤ MC ε for all m ∈ N.
2 2

Despite its remarkable local speed of convergence, Newton’s method has some
noticeable drawbacks:
First, the fast convergence can only be granted once the iterates are sufficiently
close to a solution. One possible way to overcome this problem is to perform a few
iterations of another method in order to provide a warm starting point.
Next, at points where the Hessian is close to degeneracy, the method may have a
very erratic behavior.
√
Example 6.24. Consider the function f : R → R be defined by f (z) = z2 + 1. A
simple computation shows that Newton’s method gives zn+1 = −z3n for each n. The
problem here is that, since f is almost flat for large values of zn , the method predicts
that the minimizers should be very far.

Some alternatives are the following:
• Adding a small constant λn (a step size), and computing
zn+1 = zn − λn [∇2 f (zn )]−1 ∇ f (zn )
can help to avoid very large displacements. The constant λn has to be carefully
selected: If it is too large, it will not help; while, if it is too small, convergence
may be very slow.
• Another option is to force the Hessian away from degeneracy by adding a uni-
formly elliptic term, and compute
zn+1 = zn − [∇2 f (zn ) + ε I]−1 ∇ f (zn ),
where ε > 0 and I denotes the identity operator.
Finally, each iteration has a high computational complexity due to the inversion
of the Hessian operator. A popular solution for this is to update the Hessian every
once in a while. More precisely, compute [∇2 f (zN )]−1 and compute
zn+1 = zn − λn [∇2 f (zN )]−1 ∇ f (zn )
for the next pN iterations. After that, one computes [∇2 f (zN+pN )]−1 and continues
iterating in a similar fashion.
6.4 Some Comments on Extensions and Variants
Many constrained problems can be reinterpreted, reduced or simplified, so they can

be more efficiently solved. Moreover, in some cases, this preprocessing is even nec-
essary, in order to make problems computationally tractable.
6.4.1 Additive Splitting: Proximal and Gradient Methods
The proximal point algorithm is suitable for solving minimization problems that
involve nonsmooth objective functions, but can be hard to implement. If the objec-
tive function is sufficiently regular, it may be more effective and efficient to use
gradient-consistent methods, which are easier to implement; or Newton’s method,
which converges remarkably fast. Let f = f1 + f2 . If problem involves smooth and
nonsmooth features (if f1 is smooth but f2 is not), it is possible to combine these
approaches and construct better adapted methods. If neither f1 nor f2 is smooth, it
may still be useful to treat them separately since it may be much harder to imple-
ment the proximal iterations for the sum f1 + f2 , than for f1 and f2 independently.
A thorough review of these methods can be found in [21].
Forward–Backward, Projected Gradient and Tseng’s Algorithm
For instance, suppose that f1 : X → R is smooth, while f2 : X → R ∪ {+∞} is only

proper, convex and lower-semicontinuous. We have seen in previous sections that
the proximal point algorithm is suitable for the minimization of f2 , but harder to
implement; while the gradient method is simpler and effective for the minimization
of f1 . One can combine the two methods as:
xn+1 = γ (I + λn ∂ f2 )−1 (xn − λn ∇ f1 (xn )) + (1 − γ )xn , γ ∈ (0, 1]1
This scheme is known as the forward–backward algorithm. It is a generalization of

the projected gradient method introduced by [59] and [73], which corresponds to
the case f2 = δC , where C is a nonempty, closed and convex subset of H, and reads
xn+1 = PC (xn − λn ∇ f1 (xn )).
Example 6.25. For the structured problem with affine constraints
min{ φ (x) + ψ (y) : Ax + By = c },
the projected gradient method reads

(xn+1 , yn+1 ) = PV xn − λn ∇φ (xn ), yn − λn ∇ψ (yn ) ,
where V = { (x, y) : Ax + By = c }.

Example 6.26. For the problem
min{ γ v1 + Av − b2 : v ∈ Rn },
the forward–backward algorithm gives the iterative shrinkage/thresholding algo-

rithm (ISTA):
xn+1 = (I + λ γ · 1 )−1 (xn − 2λ A∗ (Axn − b)) ,
where
(I + λ γ · 1 )−1 (v) i = (|vi | − λ γ )+ sgn(vi ) , i = 1, . . . , n
(see [22, 23, 42]). See also [104]

A variant of the forward–backward algorithm was proposed by Tseng [101] for
min{ f1 (x) + f2 (x) : x ∈ C}. It includes a second gradient subiteration and closes
1 Using γ ∈ (0, 1) is sometimes useful for stabilization purposes.

with a projection onto C:

⎧
⎪
⎪ yn = xn − λn ∇ f1 (xn )
⎨
zn = (I + λn ∂ f2 )−1 (yn )
⎪
⎪ wn = zn − λn ∇ f1 (zn )
⎩
xn+1 = PC (xn + (zn − yn )).
Double-Backward, Alternating Projections and Douglas-Rachford Algorithm
If neither f1 nor f2 is smooth, an alternative is to use a double-backward approach:
xn+1 = γ (I + λn ∂ f2 )−1 (I + λn ∂ f1 )−1 (xn ) + (1 − γ )xn , γ ∈ (0, 1].
Example 6.27. For example, in order to find the intersection of two closed convex
sets, C1 and C2 , this gives (with γ = 1) the method of alternating projections:
xn+1 = PC2 PC1 (xn ),
studied in [40, 61, 102], among others. As one should expect, sequences produced
by this method converge weakly to points in C1 ∩ C2 . Convergence is strong if C1
and C2 are closed, affine subspaces.

A modification of the double-backward method, including an over-relaxation
between the backward subiterations, gives the Douglas-Rachford [49] algorithm:
⎧
⎨ yn = (I + λn ∂ f1 )−1 (xn )
zn = (I + λn ∂ f2 )−1 (2yn − xn )
⎩
xn+1 = xn + γ (zn − yn ), γ ∈ (0, 2).
6.4.2 Duality and Penalization
Multiplier Methods
Let f , g : H → R ∪ {+∞} be proper, convex and lower-semicontinuous, and recall

from Sect. 3.7 that solutions for min{ f (x) : g(x) ≤ 0} are saddle points of the
Lagrangian L(x, μ ) = f (x) + μ g(x). This suggests that an iterative procedure based
on minimization with respect to the primal variable x, and maximization with respect
to the dual variable μ , may be an effective strategy for solving this problem. Several
methods are based on the augmented Lagrangian
Lr (x, μ ) = f (x) + μ g(x) + rg(x)2 ,
where r > 0. This idea was introduced in [63] and [89], and combined with proximal
iterations in [92]. Similar approaches can be used for the affinely constrained prob-
lem min{ f (x) : Ax = b}, whose Lagrangian is L(x, y∗ ) = f (x) + y∗ , Ax − b. More-
over, for a structured problem min{ f1 (x) + f2 (y) : Ax + By = c}, one can use alter-
nating methods of multipliers. This approach has been followed by [15, 103] and
[38] (see also [52, 62] and [37]). In the latter, a multiplier prediction step is added
and implementability is enhanced. See also [27, 97], and the references therein, for
further commentaries on augmented Lagrangian methods.
Penalization
Recall, from Remark 5.4, that the penalization technique is based on the idea of
progressively replacing f + δC by a family ( fν ), in such a way that the minimizers
of fν approximate the minimizers of f + δC , as ν → ν0 . Iterative procedures can be
combined with penalization schemes in a diagonal fashion. Starting from a point
x0 ∈ H, build a sequence (xn ) as follows: At the n-th iteration, you have a point
xn . Choose a value νn for the penalization parameter, and apply one iteration of the
algorithm of your choice (for instance, the proximal point algorithm or the gradient
method), to obtain xn+1 . Update the value of your parameter to νn+1 , and iterate.
Some interior penalties: Let f , g : H → R be convex (thus continuous), and set C =

{x ∈ H : g(x) ≤ 0}. Consider the problem min{ f (x) : x ∈ C}. Let θ : R → R ∪ {+∞}
be a convex, increasing function such that lim θ (u) = +∞ and lim θ (u) u = 0. For
u→0 u→−∞
ν > 0, define fν : H → R ∪ {+∞} by fν (x) = f (x) + νθ (g(x)/ν ). Then dom( fν ) =
{x ∈ H : g(x) < 0} = int(C) for all ν > 0, and lim fν (x) = f (x) for all x ∈ int(C).
ν →0
Example 6.28. The following functions satisfy the hypotheses described above:
1. Logarithmic barrier: θ (u) = − ln(−u) for u < 0 and θ (u) = +∞ if u ≥ 0.
2. Inverse barrier: θ (u) = −1/u for u < 0 and θ (u) = +∞ if u ≥ 0.

In [43], the authors consider a number of penalization schemes coupled with the
proximal point algorithm (see also [2, 16] and the references in [43]). Other related
approaches include [17, 65] and [51].
Exterior penalties: Let f : H → R be convex and let Ψ be an exterior penalization

function for C. Depending on the regularity of f and Ψ , possibilities for solving
min{ f (x) : x ∈ C} include proximal and double backward methods ([12]):
xn+1 = (I + λn ∂ f + λn βn ∂Ψ )−1 (xn ), or xn+1 = (I + λn ∂ f )−1 (I + λn βn ∂Ψ )−1 (xn ),
a forward–backward algorithm ([13] and [83]):

xn+1 = (I + λn ∂ f )−1 xn − λn βn ∇Ψ (xn ) ,
or a gradient method ([86]):
xn+1 = xn − λn ∇ f (xn ) − λn βn ∇Ψ (xn ).

For instance, if C = {x ∈ H : Ax = b}, we can write Ψ (x) = 12 Ax − b2K , so that

∇Ψ (x) = A∗ (Axn − b) and lim βΨ (x) = δC (x).
β →+∞
Alternating variants have been studied in [8, 11, 14], among others. See also [57],
where the penalization parameter is interpreted as a Lagrange multiplier. Related
methods using an exponential penalization function have been studied, for instance,
in [3, 4, 44] and [75].
References
1. Adams R. Sobolev spaces. Academic; 1975.

2. Alart P, Lemaire B. Penalization in nonclassical convex programming via variational con-
vergence. Math Program. 1991;51:307–31.
3. Álvarez F. Absolute minimizer in convex programming by exponential penalty. J Convex
Anal. 2000;7:197–202.
4. Alvarez F, Cominetti R. Primal and dual convergence of a proximal point exponential penalty
method for linear programming. Math Program. 2002;93:87–96.
5. Álvarez F, Peypouquet J. Asymptotic equivalence and Kobayashi-type estimates for nonau-
tonomous monotone operators in Banach spaces. Discret Contin Dynam Syst. 2009;25:1109–
28.
6. Álvarez F, Peypouquet J. Asymptotic almost-equivalence of Lipschitz evolution systems in
Banach spaces. Nonlinear Anal. 2010;73:3018–33.
7. Álvarez F, Peypouquet J. A unified approach to the asymptotic almost-equivalence of evolu-
tion systems without Lipschitz conditions. Nonlinear Anal. 2011;74:3440–4.
8. Attouch H, Bolte J, Redont P, Soubeyran A. Alternating proximal algorithms for weakly cou-
pled convex minimization problems, Applications to dynamical games and PDE’s. J Convex
Anal. 2008;15:485–506.
9. Attouch H, Brézis H. Duality for the sum of convex functions in general Banach spaces. In:
Barroso J, editor. Aspects of mathematics and its applications. Amsterdam: North-Holland
Publishing Co; 1986. pp. 125–33.
10. Attouch H, Butazzo G, Michaille G. Variational analysis in Sobolev and BV spaces: applica-
tions to PDEs and optimization. Philadelphia: SIAM; 2005.
11. Attouch H, Cabot A, Frankel P, Peypouquet J. Alternating proximal algorithms for con-
strained variational inequalities. Application to domain decomposition for PDE’s. Nonlinear
Anal. 2011;74:7455–73.
12. Attouch H, Czarnecki MO, Peypouquet J. Prox-penalization and splitting methods for con-
strained variational problems. SIAM J Optim. 2011;21:149–73.
13. Attouch H, Czarnecki MO, Peypouquet J. Coupling forward-backward with penalty schemes
and parallel splitting for constrained variational inequalities. SIAM J Optim. 2011;21:1251–
74.
14. Attouch H, Redont P, Soubeyran A. A new class of alternating proximal minimization algo-
rithms with costs-to-move. SIAM J Optim. 2007;18:1061–81.
15. Attouch H, Soueycatt M. Augmented Lagrangian and proximal alternating direction methods
of multipliers in Hilbert spaces. Applications to games, PDE’s and control. Pacific J Opt.
2009;5:17–37.
16. Auslender A, Crouzeix JP, Fedit P. Penalty-proximal methods in convex programming. J
Optim Theory Appl. 1987;55:1–21.

SpringerBriefs in Optimization, DOI 10.1007/978-3-319-13710-0
120 References
17. Auslender A, Teboulle M. Interior gradient and proximal methods for convex and conic opti-
mization. SIAM J Optim. 2006;16:697–725.
18. Bachmann G, Narici L, Beckenstein E. Fourier and Wavelet Analysis. New York: Springer;
2002.
19. Baillon JB. Comportement asymptotique des contractions et semi-groupes de contraction.
Thèse. Université Paris 6; 1978.
20. Baillon JB. Un exemple concernant le comportement asymptotique de la solution du
problème du/dt + ∂ ϕ (u) 0. J Funct Anal. 1978;28:369–76.
21. Bauschke H, Combettes PL. Convex analysis and monotone operator theory in Hilbert
spaces. New York: Springer; 2011.
22. Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse
problems. SIAM J Imaging Sci. 2009;2:183–202.
23. Beck A, Teboulle M. Gradient-based algorithms with applications in signal recovery prob-
lems. Convex optimization in signal processing and communications. Cambridge: Cam-
bridge University Press; 2010. pp. 33–88.
24. Bertsekas D. Nonlinear programming. Belmont: Athena Scientific; 1999.
25. Borwein J. A note on ε -subgradients and maximal monotonicity. Pac J Math. 1982;103:307–
14.
26. Borwein J, Lewis A. Convex analysis and nonlinear optimization. New York: Springer; 2010.
27. Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learn-
ing via the alternating direction method of multipliers. Found Trends Mach Learn. 2011;3:1–
122.
28. Brenner SC, Scott LR. The mathematical theory of finite element methods. 3rd ed. New
York: Springer; 2008.
29. Brézis H. Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces
de Hilbert. Amsterdam: North Holland Publishing Co; 1973.
30. Brézis H. Functional analysis, Sobolev spaces and partial differential equations. New York:
Springer; 2011.
31. Brézis H, Lions PL. Produits infinis de résolvantes. Israel J Math. 1978;29:329–45.
32. Burachik R, Grana Drummond LM, Iseum AN, Svaiter BF. Full convergence of the steepest
descent method with inexact line searches. Optimization. 1995;32:137–46.
33. Cartan H. Differential calculus. London: Kershaw Publishing Co; 1971.
34. Candès EJ, Romberg JK, Tao T. Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Trans Inform Theory. 2006;52:489–
509.
35. Candès EJ, Romberg JK, Tao T. Stable signal recovery from incomplete and inaccurate mea-
surements. Comm Pure Appl Math. 2006;59:1207–23.
36. Cauchy AL. Méthode générale pour la résolution des systèmes d’équations simultanées. C R
Acad Sci Paris. 1847;25:536–8.
37. Chambolle A, Pock T. A first-order primal-dual algorithm for convex problems with appli-
cations to imaging. J Math Imaging Vis. 2010;40:1–26.
38. Chen G, Teboulle M. A proximal-based decomposition method for convex minimization
problems. Math Program. 1994;64:81–101.
39. Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM J Sci
Comput. 1998;20:33–61.
40. Cheney W, Goldstein AA. Proximity maps for convex sets. Proc Amer Math Soc.
1959;10:448–50.
41. Ciarlet PG. The finite element method for elliptic problems. Amsterdam: North Holland
Publishing Co; 1978.
42. Combettes PL, Wajs VR. Signal recovery by proximal forward-backward splitting. Multi-
scale Model Simul. 2005;4:1168–200.
43. Cominetti R, Courdurier M. Coupling general penalty schemes for convex programming with
the steepest descent method and the proximal point algorithm. SIAM J Opt. 2002;13:745–65.
44. Cominetti R, San Martı́n J. Asymptotic analysis of the exponential penalty trajectory in linear
programming. Math Program. 1994;67:169–87.
References 121
45. Crandall MG, Liggett TM. Generation of semigroups of nonlinear transformations on general
Banach spaces. Am J Math. 1971;93:265–98.
46. Daubechies I. Ten lectures on wavelets. Philadelphia: SIAM; 1992.
47. Donoho DL. Compressed sensing. IEEE Trans Inform Theory. 2006;52:1289–306.
48. Donoho DL, Tanner J. Sparse nonnegative solutions of underdetermined linear equations by
linear programming. Proc Natl Acad Sci U S A. 2005;102:9446–51.
49. Douglas J Jr, Rachford HH Jr. On the numerical solution of heat conduction problems in two
and three space variables. Trans Am Math Soc. 1956;82:421–39.
50. Dunford N, Schwartz J. Linear operators. Part I. General theory. New York: Wiley; 1988.
51. Eckstein J. Nonlinear proximal point algorithms using Bregman functions, with applications
to convex programming. Math Oper Res. 1993;18:202–26.
52. Eckstein J. Some saddle-function splitting methods for convex programming. Opt Methods
Softw. 1994;4:75–83.
53. Ekeland I, Temam R. Convex analysis and variational problems. Philadelphia: SIAM; 1999.
54. Evans L. Partial differential equations. 2nd edn. Providence: Graduate Studies in Mathemat-
ics, AMS; 2010.
55. Ferris M. Finite termination of the proximal point algorithm. Math Program. 1991;50:359–
66.
56. Folland G. Fourier analysis and its applications. Pacific Grove: Wadsworth & Brooks/Cole;
1992.
57. Frankel P, Peypouquet J. Lagrangian-penalization algorithm for constrained optimization
and variational inequalities. Set-Valued Var Anal. 2012;20:169–85.
58. Friedlander MP, Tseng P. Exact regularization of convex programs. SIAM J Optim.
2007;18:1326–50.
59. Goldstein AA. Convex programming in Hilbert space. Bull Am Math Soc. 1964;70:709–10.
60. Güler O. On the convergence of the proximal point algorithm for convex minimization. SIAM
J Control Optim. 1991;29:403–19.
61. Halperin I. The product of projection operators. Acta Sci Math (Szeged). 1962;23:96–9.
62. He B, Liao L, Han D, Yang H. A new inexact alternating directions method for monotone
variational inequalities. Math Program. 2002;92:103–18.
63. Hestenes MR. Multiplier and gradient methods. J Opt Theory Appl. 1969;4:303–20.
64. Hiriart-Urruty JB, Lemaréchal C. Fundamentals of convex analysis. New York: Springer;
2001.
65. Iusem A, Svaiter BF, Teboulle M. Entropy-like proximal methods in convex programming.
Math Oper Res. 1994;19:790–814.
66. Izmailov A, Solodov M. Optimization, Volume 1: optimality conditions, elements of convex
analysis and duality. Rio de Janeiro: IMPA; 2005.
67. Izmailov A, Solodov M. Optimization, Volume 2: computational methods. Rio de Janeiro:
IMPA; 2007.
68. James R. Reflexivity and the supremum of linear functionals. Israel J Math. 1972;13:289–
300.
69. Kinderlehrer D, Stampacchia G. An introduction to variational inequalities and their appli-
cations. New York: Academic; 1980.
70. Kobayashi Y. Difference approximation of Cauchy problems for quasi-dissipative operators
and generation of nonlinear semigroups. J Math Soc Japan. 1975;27:640–65.
71. Krasnosel’skii MA. Solution of equations involving adjoint operators by successive approx-
imations. Uspekhi Mat Nauk. 1960;15:161–5.
72. Kryanev AV. The solution of incorrectly posed problems by methods of successive approxi-
mations. Dokl Akad Nauk SSSR. 1973;210:20–2. Soviet Math Dokl. 1973;14:673–6.
73. Levitin ES, Polyak BT. Constrained minimization methods. USSR Comp Math Math Phys.
1966;6:1–50.
74. Lions JL. Quelques méthodes de résolution des problèmes aux limites non linéaires. Paris:
Dunod; 2002.
75. Mangasarian OL, Wild EW. Exactness conditions for a convex differentiable exterior penalty
for linear programming. Optimization. 2011;60:3–14.
122 References
76. Martinet R. Régularisation d’inéquations variationnelles par approximations successives.

Rev Française Informat Recherche Opérationnelle. 1970;4:154–8.
77. Moreau JJ. Décomposition orthogonale dun espace hilbertien selon deux cônes mutuellement
polaires. C R Acad Sci Paris. 1962;255:238–40.
78. Moreau JJ. Fonctions convexes duales et points proximaux dans un espace hilbertien. C R
Acad Sci Paris. 1962;255:2897–9.
79. Moreau JJ. Proprietés des applications “prox”. C R Acad Sci Paris. 1963;256:1069–71.
80. Moreau JJ. Proximité et dualité dans un espace hilbertien. Bull Soc Math France.
1965;93:273–99.
81. Nesterov YE. A method for solving the convex programming problem with convergence rate
O(1/k2 ). Dokl Akad Nauk SSSR. 1983;269:543–7.
82. Nocedal J, Wright S. Numerical optimization. 2nd ed. New York: Springer; 2006.
83. Noun N, Peypouquet J. Forward-backward-penalty scheme for constrained convex minimiza-
tion without inf-compactness. J Optim Theory Appl. 2013;158:787–95.
84. Opial Z. Weak Convergence of the sequence of successive approximations for nonexpansive
mappings. Bull Amer Math Soc. 1967;73:591-7.
85. Peypouquet J. Asymptotic convergence to the optimal value of diagonal proximal iterations
in convex minimization. J Convex Anal. 2009;16:277–86.
86. Peypouquet J. Coupling the gradient method with a general exterior penalization scheme for
convex minimization. J Optim Theory Appl. 2012;153:123–38.
87. Peypouquet J, Sorin S. Evolution equations for maximal monotone operators: asymptotic
analysis in continuous and discrete time. J Convex Anal. 2010;17:1113–63.
88. Peypouquet J. Optimización y sistemas dinámicos. Caracas: Ediciones IVIC; 2013.
89. Powell MJD. A method for nonlinear constraints in minimization problems. In: Fletcher R,
editor. Optimization. New York: Academic; 1969. pp. 283–98.
90. Raviart PA, Thomas JM. Introduction à l’analyse numérique des équations aux dérivées par-
tielles. Paris: Masson; 1983.
91. Rockafellar RT. Convex analysis. Princeton: Princeton University Press; 1970.
92. Rockafellar RT. Augmented lagrangians and applications of the proximal point algorithm in
convex programming. Math Oper Res. 1976;1:97–116.
93. Rockafellar RT. Monotone operators and the proximal point algorithm. SIAM J Control
Optim. 1976;14:877–98.
94. Rudin W. Functional analysis. Singapore: McGraw-Hill Inc; 1991.
95. Sagastizábal C, Solodov M. An infeasible bundle method for nonsmooth convex constrained
optimization without a penalty function or a filter. SIAM J Optim. 2005;16:146–69.
96. Schaefer HH, Wolff MP. Topological vector spaces. New York: Springer; 1999.
97. Shefi R, Teboulle M. Rate of convergence analysis of decomposition methods based on the
proximal method of multipliers for convex minimization. SIAM J Optim. 2014;24:269–97.
98. Solodov MV, Svaiter BF. Forcing strong convergence of proximal point iterations in a Hilbert
space. Math Program. 2000;87:189–202.
99. Tibshirani R. Regression shrinkage and selection via the Lasso. J Royal Stat Soc Ser B.
1996;58:267–88.
100. Tolstov GP. Fourier series. Mineola: Dover; 1976.
101. Tseng P. A modified forward-backward splitting method for maximal monotone mappings.
SIAM J Control Optim. 2000;38:431–46.
102. von Neumann J. On rings of operators. Reduction theory. Ann Math. 1949;50:401–85.
103. Xu MH. Proximal alternating directions method for structured variational inequalities. J
Optim Theory Appl. 2007;134:107–17.
104. Yang J, Zhang Y. Alternating direction algorithms for 1 -problems in compressive sensing.
SIAM J Sci Comp. 2011;33:250–78.
105. Zalinescu C. Convex analysis in general vector spaces. Singapore: World Scientific; 2002.
Index
alternating projection method, 115 domain, 25

alternating projections, 90 of the subdifferential, 41
annihilator, 5 double-backward algorithm, 115
Attouch-Brézis condition, 48 Douglas-Rachford algorithm, 115
augmented Lagrangian, 115 dual problem, 56, 62
dual solution, 56, 63
Baillon-Haddad Theorem, 40 duality gap, 56, 63
Banach-Steinhaus Uniform Boundedness duality mapping, 9
Principle, 4
barrier, 88 epigraph, 26
Bishop-Phelps Theorem, 66 Euler-Lagrange Equation, 72, 73
Bolzmann-Sannon Entropy, 54
extended real numbers, 25
bounded operator, 2
bounded set, 1
Brønsted-Rockafellar Theorem, 49 feasible problem, 68
Fenchel conjugate, 53
Calculus of Variations, 72 Fenchel-Young Inequality, 54
canonical embedding, 5 Fermat’s Rule, 17, 43
Cauchy sequence, 2 forward-backward algorithm, 114
Chain Rule, 46 function
coercive function, 32 biconjugate, 57
completeness, 2 cocoercive, 40
controllable system, 67 harmonic, 78
convergence indicator, 25
strong, 2 inf-compact, 27
weak, 11 Legendre, 88
convex function, 30 lower-semicontinuous, 26
strictly, 30 proper, 25
strongly, 30 quasi-convex, 30
convex set, 6 sequentially inf-compact, 30
counting norm, 79 sequentially lower-semicontinuous, 29
subdifferentiable, 41
Descent Lemma, 15 superharmonic, 78
differentiable support, 55
in the sense of Fréchet, 14 functional
in the sense of Gâteaux, 14 bounded linear, 5
directional derivative, 14 support, 9

SpringerBriefs in Optimization, DOI 10.1007/978-3-319-13710-0
124 Index
gradient method, 95 orthogonal projection, 22

orthogonal vectors, 19
Hahn-Banach Extension Theorem, 8
Hahn-Banach Separation Theorem, 6 parallel vectors, 19
Hessian, 16 Parallelogram Identity, 20
penalization, 88
inequality Poincaré’s Inequalities, 75
Cauchy-Schwarz, 19 primal problem, 56, 62
triangle, 19 primal solution, 56, 62
inner product, 18 projected gradient algorithm, 114
inverse barrier, 116 projection, 22
ISTA, 80, 114 proximal point algorithm, 95
iterative algorithm, 85 proximity operator, 51
kernel, 3 range, 3
resolvent, 51, 96
Lagrange multipliers, 63 Riesz Representation Theorem, 22
Lagrangian, 62, 64 Ritz’s Theorem, 82
LASSO method, 80 rocket-car problem, 71
Legendre-Fenchel Reciprocity Formula, 58
linear control system, 67 saddle point, 63
linear error bound, 103 sequentially closed set, 12
linear programming problem, 57 space
linear-quadratic control problem, 69 Banach, 2
logarithmic barrier, 116 Hausdorff, 10, 26
Hilbert, 19
minimizing sequence, 30 orthogonal, 5
monotone (see monotonicity), 43 reflexive, 5
monotonicity, 43 topological bidual, 5
Moreau envelope, 51 topological dual, 5
Moreau-Rockafellar Theorem, 47 sparse solutions, 79
Moreau-Yosida Regularization, 50 stable signal recovery, 80
Stampacchia’s Theorem, 74
Newton’s Method, 111 Steepest Descent, 94
norm, 1, 19 step sizes, 94
normal cone, 42 subdifferential, 41
normed space, 1 approximate, 45
subgradient, 41
operator, 2 approximate, 44
p-Laplacian, 79 sublevel set, 25
adjoint, 6
inverse, 3 target set, 67
invertible, 3 Taylor approximation, 16
positive definite, 18, 39 topology
positive semidefinite, 18, 38 discrete, 10
uniformly elliptic, 18, 39 of pointwise convergence, 13
Opial’s Lemma, 86 strong, 2
Optimal Control, 67 weak, 10
optimal control problem, 67 weak∗ , 13

Convex Optimization in Normed Spaces - Theory Methods and Examples

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Convex Optimization in Normed Spaces - Theory Methods and Examples

Uploaded by

Copyright:

Available Formats

SpringerBriefs in Optimization

SpringerBriefs in Optimization showcases algorithmic and theoretical techniques,

For further volumes:

ISSN 2190-8354 ISSN 2191-575X (electronic)

Library of Congress Control Number: 2015935427

Springer Cham Heidelberg New York Dordrecht London

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Montpellier (France), October 2014 Hedy Attouch

The book is organized as follows:

Chapter 4 contains selected applications: some functional analysis results are

We intended to make this introductory textbook as complete as possible, but some

Finally, we invite the reader to check http://jpeypou.mat.utfsm.cl/

1 Basic Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

3 Convex Analysis and Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . . 33

3.6 The Fenchel Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Keynote Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

1.1 Normed Spaces

A norm on a real vector space X is a function · : X → R such that

c The Author(s) 2015 1

Example 1.3. The space L1 (a, b ; R)

BX (x, r) = {y ∈ X : y − x < r}.

The closed ball is

In a normed space one can define a canonical topology as follows: a set V is a

We say that a sequence (xn ) in X converges (strongly) to x̄ ∈ X, and write xn → x̄,

1.1.1 Bounded Linear Operators and Functionals, Topological

Bounded Linear Operators

Let (X, · X ) and (Y, · Y ) be normed spaces. A linear operator L : X → Y is

and L is uniformly Lipschitz-continuous. Clearly, iii) implies i).

ker(L) = { x ∈ X : L(x) = 0 } = L−1 (0),

which is a closed subspace of X. The range of L is

R(L) = L(X) = { L(x) : x ∈ X }.

It is a subspace of Y , but it is not necessarily closed.

An operator L ∈ L (X;Y ) is invertible if there exists an operator in L (Y ; X),

(M̄ ◦ L0−1 ) ◦ L = L ◦ (M̄ ◦ L0−1 ) = IX .

L−1 − L0−1 ≤ L−1 ◦ L0 − IX · L0−1 = L0−1 · M̄ −1 − IX

A remarkable consequence of linearity and completeness is that pointwise bound-

for each x ∈ X, then

Proof. For each n ∈ N, the set

Cn := {x ∈ X : sup Lλ (x)Y ≤ n}

is closed, as intersection of closed sets. Since ∪n∈NCn = X has nonempty interior,

rLλ (h) ≤ Lλ (x̂ + rh)Y + Lλ (x̂)Y ≤ 2N

The Topological Dual and the Bidual

The topological dual of a normed space (X, · ) is the normed space (X ∗ , · ∗ ),

L, xX ∗ ,X = L(x),

The orthogonal space or annihilator of a subspace V of X is

V ⊥ = { L ∈ X ∗ : L, x = 0 for all x ∈ V },

which is a closed subspace of X ∗ , even if V is not closed.

The topological bidual of (X, · ) is the topological dual of (X ∗ , · ∗ ), which

μx (L) = L, xX∗,X

The Adjoint Operator

ζy∗ (x) = y∗ , LxY ∗ ,Y

L∗ y∗ , xX ∗ ,X = y∗ , LxY ∗ ,Y

We immediately see that L∗ ∈ L (Y ∗ ; X ∗ ) and

1.1.2 The Hahn–Banach Separation Theorem

The Hahn–Banach Separation Theorem is a cornerstone in functional and convex

Theorem 1.10 (Hahn–Banach Separation Theorem). Let A and B be nonempty,

First, Theorem 1.10 is equivalent to

Finally, in finite-dimensional spaces, part i) of Theorem 1.11 can be obtained

is a nontrivial subspace of RN that does not intersect C.

pn 2 ≤ pn + t(x − pn )2 = pn 2 + 2t pn · (x − pn ) + t 2 x − pn 2 .

Proof of Theorem 1.11

Many standard functional analysis textbooks begin by presenting a general form

Step 1: If the dimension of X is at least 2, there is a nontrivial subspace of X not

L, xX ∗ ,X = L(x),

V ⊥ = { L ∈ X ∗ : L, x = 0 for all x ∈ V },

μx (L) = L, xX∗,X

ζy∗ (x) = y∗ , LxY ∗ ,Y

L∗ y∗ , xX ∗ ,X = y∗ , LxY ∗ ,Y

F (x) = { x∗ ∈ X ∗ : x∗ ∗ = 1 and x∗ , x = x }.

μx , LX ∗∗ ,X ∗ = L, xX ∗ ,X .

μx ∗∗ = sup μx (L) ≥ μx (x ) = x , xX ∗ ,X = x,

x = x̄ , x − xn + x̄ , xn ≤ x̄ , x − xn + xn ,

|Ln , xn − L̄, x̄| ≤ |Ln − L̄, xn | + |L̄, xn − x̄|

{ ∈ X ∗ : − L, xX ∗ ,X < ε }, (1.2)

μx , X ∗∗ ,X ∗ = , xX ∗ ,X ,

An operator M ∈ L (X; X ∗ ) is positive semidefinite if Md, d ≥ 0 for all d ∈ X;

An inner product in a real vector space H is a function · , · : H × H → R such that:

0 ≤ x ± α y2 = x ± α y, x ± α y = x2 ± 2α x, y + α 2 y2 .

equipped with the inner product defined by x, y = ∑n∈N xn yn .

x − y∗ , y − y∗ ≤ 0 for all y ∈ K. (1.5)

x − y2 = x − y∗ 2 + 2x − y∗ , y∗ − y + y∗ − y2 ≥ x − y∗ 2

x − y∗1 , y∗2 − y∗1 ≤ 0 and x − y∗2 , y∗1 − y∗2 ≤ 0.