(Applied Optimization 87) Yurii Nesterov (Auth.) - Introductory Lectures On Convex Optimization - A Basic Course-Springer US (2004)

INTRODUCTORY LECTURES ON
CONVEX OPTIMIZATION
A Basic Course
Applied Optimization
Volume 87
Series Editors:
Panos M. Pardalos
University ofFlorida, US.A.
Donald W. Heam
University ofFlorida, US.A.
INTRODUCTORY LECTURES ON
CONVEX OPTIMIZATION
A Basic Course
By
Yurii Nesterov
Center of Operations Research and Econometrics, (CORE)
Universite Catholique de Louvain (UCL)
Louvain-la-Neuve, Belgium
''
~·
Springer Science+Business Media, LLC
Library of Congress Cataloging-in-PubHcation
Nesterov, Yurri
Introductory Lectures on Convex Optimization: A Basic Course
ISBN 978-1-4613-4691-3 ISBN 978-1-4419-8853-9 (eBook)
DOI 10.1007/978-1-4419-8853-9
Copyright Cl 2004 by Springer Science+Business Media New York

Originally published by Kluwer Academic Publishers in 2004
Softcover reprint of the hardcover 1st edition 2004
All rights reserved. No part ofthis publication may be reproduced, stored in a retrieval system
or transmitted in any form or by any means, electronic, mechanical, photo-copying,
microfilming, recording, or otherwise, without the prior written perrnission ofthe publisher, with
the exception ofany material supplied specifrcally for the purpose ofbeing entered and executed
on a computer system, for exclusive use by the purchaser ofthe work.
Permissions for books published in the USA: permj ssi ons®wkap com
Permissions for books published in Europe: permissions@wkap.nl
Printedon acid-free paper.
Contents
Preface lX
Acknowledgments xiii
Introduction XV
1. NONLINEAR OPTIMIZATION 1
1.1 World of nonlinear optimization 1
1.1.1 General formulation of the problern 1
1.1.2 Performance of numerical methods 4
1.1.3 Complexity bounds for global optimization 7
1.1.4 Identity cards of the fields 13
1.2 Local methods in unconstrained minimization 15
1.2.1 Relaxation and approximation 15
1.2.2 Classes of differentiable functions 20
1.2.3 Gradient method 25
1.2.4 Newton method 32
1.3 First-order methods in nonlinear optimization 37
1.3.1 Gradient method and Newton method:
What is different? 37
1.3.2 Conjugate gradients 42
1.3.3 Constrained minimization 46
2. SMOOTH CONVEX OPTIMIZATION 51
2.1 Minimization of smooth functions 51
2.1.1 Smooth convex functions 51
2.1.2 Lower complexity bounds for :FJ:• 1 (Rn) 58
2.1.3 Strongly convex functions 63
2.1.4 Lower complexity bounds for s:,J}(Rn) 66
vi INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
2.1.5 Gradient method 68

2.2 Optimal Methods 71
2.2.1 Optimal methods 71
2.2.2 Convex sets 81
2.2.3 Gradient mapping 86
2.2.4 Minimization methods for simple sets 87
2.3 Minimization problern with smooth components 90
2.3.1 Minimax problern 90
2.3.2 Gradient mapping 93
2.3.3 Minimization methods for minimaxproblern 96
2.3.4 Optimization with functional constraints 100
2.3.5 Method for constrained minimization 105
3. NONSMOOTH CONVEX OPTIMIZATION 111
3.1 General convex functions 111
3.1.1 Motivation and definitions 111
3.1.2 Operations with convex functions 117
3.1.3 Continuity and differentiability 121
3.1.4 Separation theorems 124
3.1.5 Subgradients 126
3.1.6 Computing subgradients 130
3.2 Nonsmooth minimization methods 135
3.2.1 Generallower complexity bounds 135
3.2.2 Main lemma 138
3.2.3 Subgradient method 141
3.2.4 Minimization with functional constraints 144
3.2.5 Complexity bounds in finite dimension 146
3.2.6 Cutting plane schemes 149
3.3 Methods with complete data 156
3.3.1 Model of nonsmooth function 157
3.3.2 Kelley method 158
3.3.3 Level method 160
3.3.4 Constrained minimization 164
4. STRUCTURAL OPTIMIZATION 171
4.1 Self-concordant functions 171
4.1.1 Black box concept in convex optimization 171
4.1.2 What the Newton method actually does? 173
4.1.3 Definition of self-concordant function 175
Contents vii
4.1.4 Main inequalities 181

4.1.5 Minimizing the self-concordant function 187
4.2 Self-concordant barriers 192
4.2.1 Motivation 192
4.2.2 Definition of self-concordant barriers 193
4.2.3 Main inequalities 196
4.2.4 Path-following scheme 199
4.2.5 Finding the analytic center 203
4.2.6 Problems with functional constraints 206
4.3 Applications of structural optimization 210
4.3.1 Bounds on parameters of self-concordant barriers 210
4.3.2 Linear and quadratic optimization 213
4.3.3 Semidefinite optimization 216
4.3.4 Extremal ellipsoids 220
4.3.5 Separable optimization 224
4.3.6 Choice of minimization scheme 227
Bibliography 231
References 233
Index
235
Preface
It was in the middle of the 1980s, when the seminal paper by Kar-
markar opened a new epoch in nonlinear optimization. The importance
of this paper, containing a new polynomial-time algorithm for linear op-
timization problems, was not only in its complexity bound. At that time,
the most surprising feature of this algorithm was that the theoretical pre-
diction of its high efficiency was supported by excellent computational
results. This unusual fact dramatically changed the style and direc-
tions of the research in nonlinear optimization. Thereafter it became
more and more common that the new methods were provided with a
complexity analysis, which was considered a better justification of their
efficiency than computational experiments. In a new rapidly develop-
ing field, which got the name "polynomial-time interior-point methods",
such a justification was obligatory.
Afteralmost fifteen years of intensive research, the main results of this
development started to appear in monographs [12, 14, 16, 17, 18, 19].
Approximately at that time the author was asked to prepare a new course
on nonlinear optimization for graduate students. The idea was to create
a course which would reflect the new developments in the field. Actually,
this was a major challenge. At the time only the theory of interior-point
methods for linear optimization was polished enough to be explained to
students. The general theory of self-concordant functions had appeared
in print only once in the form of research monograph [12]. Moreover,
it was clear that the new theory of interior-point methods represented
only a part of a general theory of convex optimization, a rather involved
field with the complexity bounds, optimal methods, etc. The majority
of the latter results were published in different journals in Russian.
The book you see now is a result of an attempt to present serious
thingsinan elementary form. As is always the case with a one-semester
course, the most difficult problern is the selection of the material. For
X INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
us the target notions were the complexity of the optimization problems

and a provable effi.ciency of numerical schemes supported by complex-
ity bounds. In view of a severe volume Iimitation, we had to be very
pragmatic. Any concept or fact included in the book is absolutely nec-
essary for the analysis of at least one optimization scheme. Surprisingly
enough, none of the material presented requires any facts from duality
theory. Thus, this topic is completely omitted. This does not mean, of
course, that the author neglects this fundamental concept. However, we
hope that for the first treatment of the subject such a compromise is
acceptable.
The main goal of this course is the development of a correct under-
standing of the complexity of different optimization problems. This goal
was not chosen by chance. Every year I meet Ph.D. students of different
specializations who ask me for advice on reasonable numerical schemes
for their optimization models. And very often they seem to have come
too late. In my experience, if an optimization model is created without
taking into account the abilities of numerical schemes, the chances that
it will be possible to find an acceptable numerical solution are close to
zero. In any field of human activity, if we create something, we know
in advance why we are doing so and what we are going to do with the
result. And only in numerical modeHing is the situationstill different.
This coursewas given during several years at Universite Catholique de
Louvain (Louvain-la-Neuve, Belgium). The course is self-contained. It
consists of four chapters (Nonlinear optimization, Smooth convex opti-
mization, Nonsmooth convex optimization and Structural optimization
(Interior-point methods)). The chapters are essentially independent and
can be used as parts of more general courses on convex analysis or op-
timization. In our experience each chapter can be covered in three two-
hour lectures. We assume a reader to have a standard undergraduate
background in analysis and linear algebra. We provide the reader with
short bibliographical notes which should help in a closer examination of
the subject.
YURII NESTEROV
Louvain-la-Neuve, Belgium
May, 2003.
To my wife Svetlana
Acknowledgments
This book is a refiection of the main achievements in convex optimiza-

tion, the field in which the author has worked for more than twenty five
years. During all these years the author has had the exceptional oppor-
tunity to communicate and collaborate with the top-level scientists in
the field. I am greatly indebted to many of them.
I was very lucky to start my scientific career in Moscow at the time of
decline of the Soviet Union, which managed to gather in a single city the
best brains of a 300-million population. The contacts with A. Antipin,
Yu. Evtushenko, E. Golshtein, A. Ioffe, V. Karmanov, L. Khachian,
R. Polyak, V. Pschenichnyj, N. Shor, N. Tretiakov, F. Vasil'ev, D. Yudin,
and, of course, with A. Nemirovsky and B. Polyak, were invaluable in
forming the directions and priorities of my research.
I was very lucky to move to the West at a very important moment
in time. For nonlinear optimization that was the era of interior-point
methods. That was the time, when a new paper was announced almost
every day, and a time of open contacts and interesting conferences. I am
very thankful to my colleges Kurt Anstreicher, Freddy Auslender, Rony
Ben-Tal, Rob Freund, Jean-Louis Goffin, Don Goldfarb, Osman Guller,
Florian Jarre, Ken Kortanek, Claude Lemarechal, Olvi Mangasarian,
Florian Potra, Jim Renegar, Kees Roos, Tamas Terlaky, Mike Todd,
Levent Tuncel and Yinyu Ye for interesting discussions and cooperation.
Special thanks to Jean-Philippe Vial, the author of the idea of writing
this book.
Finally, I was very lucky to find myself at the Center of Operations
Research and Econometrics (CORE) in Louvain-la-Neuve, Belgium. The
excellent working conditions of this research center and the exceptional
environment were very helpful during all these years. It is impossible to
overestimate the importance of the spirit of research, which is created
and maintained here by my colleagues Vincent Blondel, Yves Genin,
xiv INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Michel Gevers, Etienne Laute, Yves Poches, Yves Smeers, Paul Van
Dooren and Laurence Wolsey, both coming from CORE and CESAME,
a research center of the Engineering department ofUniversite Catholique
de Louvain (UCL). The research activity of the author during many
years was supported by the Belgian Program on Interuniversity Poles of
Attraction initiated by the Belgian State, Prime Minister's Office and
Science Policy Programming.
Introduction
Optimization problems arise naturally in different fields of applica-

tions. In many situations, at some point we get a craving to arrange
things in a best possible way. This intention, converted into a mathe-
matical form, becomes an optimization problern of a certain type. De-
pending on the field of interest, it could be an optimal design problem, an
optimal control problem, an optimal location problem, an optimal diet
problem, etc. However, the next step, finding a solution to the mathe-
matical model, is far from trivial. At first glance, everything Iooks very
simple: many commercial optimization packages are easily available and
any user can get a "solution" to the model just by clicking on an icon
on the screen of his/her personal computer. The question is, what do
we actually get? How much can we trust the answer?
One of the goals of this course is to show that, despite their attraction,
the proposed "solutions" of general optimization problems very often
cannot satisfy the expectations of a naive user. In our opinion, the main
fact, which should be known to any person dealing with optimization
models, is that in general optimization problems are unsolvable. This
Statement, which is usually missing in standard optimization courses,
is very important for an understanding of optimization theory and its
development in the past and in the future.
In many practical applications the process of creating a model can take
a Iot of time and effort. Therefore, the researchers should have a clear
understanding of the properties of the model they are constructing. At
the stage of modelling, many different tools can be used to approximate
the real situation. And it is absolutely necessary to understand the
computational consequences of each decision. Very often we have to
xv1 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
choose between a "good" model, which we cannot solve, 1 and a "bad"

model, which can be solved for sure. What is better?
In fact, the computational practice provides us with a hint of an an-
swer to the above question. Actually, the most widespread optimization
models now are still linear optimization models. It is very unlikely that
such models can describe our nonlinear world very well. Thus, the main
reason for their popularity is that practitioners prefer to deal with solv-
able models. Of course, very often the linear approximation is poor. But
usually it is possible to predict the consequences of such a choice and
make a correction in interpretation of the obtained solution. It seems
that for them this is better than trying to solve a model without any
guarantee for success.
Another goal of this course consists in discussing numerical meth-
ods for solvable nonlinear models, namely convex optimization problems.
The development of convex optimization theory in the last years has
been very rapid and very exciting. Now it consists of several competing
branches, each of which has some strong and some weak points. We will
discuss in detail their features, taking into account the historical aspect.
More precisely, we will try to understand the internal logic of the de-
velopment of each branch of the field. Up to now, the main results of
the development can be found only in special journals and monographs.
However, in our opinion, this theory is ripe for explanation to the final
users, industrial engineers, economists and students of different special-
izations. We hope that this book will be interesting even for the experts
in optimization theory since it contains many results, which have never
been published in English.
In this book we will try to convince the reader, that in order to apply
the optimization formulations successfully, it is necessary to be aware
of some theory, which explains what we can and what we cannot do
with optimization problems. The elements of this simple theory can be
found in each lecture of the course. We will try to show that convex
optimization is an excellent example of a complete application theory,
which is simple, easy to learn and which can be very useful in practical
applications.
In this course we discuss the most efficient modern optimization sche-
mes and establish for them efficiency bounds. This course is self-contai-
ned; we prove all necessary results. Nevertheless, the proofs and the
reasoning should not be a problern even for graduate students.
1 More precisely, which we can try to solve

INTRODUCTION xvii
The structure of the book is as follows. It consists of four relatively in-

dependent chapters. Each chapter includes three sections, each of which
corresponds approximately to a two-hour lecture. Thus, the contents of
the book can be directly used for a standard one-semester course.
Chapter 1 is devoted to generat optimization problems. In Sec-
tion 1.1 we introduce the terminology, the notions of oracle, black box,
functional model of an optimization problern and the complexity of gen-
eral iterative schemes. We prove that global optimization problems are
"unsolvable" and discuss the main features of different fields of opti-
mization theory. In Section 1.2 we discuss two main local unconstrained
minimization schemes: the gradient method and the Newton method.
We establish their local rates of convergence and discuss the possible dif-
ficulties {divergence, convergence to a saddle point). In Section 1.3 we
compare the formal structures of the gradient and the Newton method.
This analysis leads to the idea of a variable metric. We describe quasi-
Newton methods and conjugate gradients schemes. We conclude this sec-
tion with an analysis of sequential unconstrained minimization schemes.
In Chapter 2 we consider smooth convex optimization methods. In
Section 2.1 we analyze the main reason for the difficulties encountered
in the previous chapter and from this analysis derive two good func-
tional classes, the class of smooth convex functions and that of smooth
strongly convex functions. For corresponding unconstrained minimiza-
tion problems we establish the lower complexity bounds. We conclude
this section with an analysis of a gradient scheme, which demonstrates
that this method is not optimal. The optimal schemes for smooth con-
vex minimization problems are discussed in Section 2.2. We start from
the unconstrained minimization problem. After that we introduce con-
vex sets and define a notion of gradient mapping for a minimization
problern with simple constraints. We show that the gradient mapping
can formally replace a gradient step in the optimization schemes. In
Section 2.3 we discuss more complicated problems, which involve sev-
eral smooth convex functions, namely, the minimax problern and the
constrained minimization problem. For both problems we introduce the
notion of gradient mapping and present the optimal schemes.
Chapter 3 is devoted to the theory of nonsmooth convex optimiza-
tion. Since we do not assume that the reader has a background in convex
analysis, the chapter is started by Section 3.1, which contains a compact
presentation of all necessary facts. The final goal of this section is to
justify the rules for computing the subgradients of a convex function.
The next Section 3.2 starts from the lower complexity bounds for non-
smooth optimization problems. After that we present a general scheme
for the complexity analysis of the corresponding methods. We use this
xviii INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
scheme to establish the convergence rate of the subgradient method, the

center-of-gravity method and the ellipsoid method. We also discuss some
other cutting plane schemes. Section 3.3 is devoted to the minimization
schemes, which employ a piece-wise linear model of a convex function.
We describe Kelley's method and show that it can be extremely slow. Af-
ter that we introduce the so-called Ievel method. We justify its efficiency
estimates for unconstrained and constrained minimization problems.
Chapter 4 is devoted to convex minimization problems with explicit
structure. In Section 4.1 we discuss a certain contradiction in the black
box concept as applied to a convex optimization model. We introduce a
barrier model of an optimization problem, which is based on the notion
of self-concordant function. For such functions the second-order oracle
is not local and they can be easily minimized by the Newton method.
We study the properties of these functions and establish the rate of
convergence of the Newton method. In Section 4.2 we introduce self-
concordant barriers, the subdass of self-concordant functions, which is
suitable for sequential unconstrained minimization schemes. We study
the properties of such barriers and prove the efficiency estimate of the
path-following scheme. In Section 4.3 we consider several examples of
optimization problems, for which we can construct a self-concordant bar-
rier, and, consequently, these problems can be solved by a path-following
scheme. We consider linear and quadratic optimization problems, prob-
lems of semidefinite optimization, separable optimization and geomet-
rical optimization, problems with extremal ellipsoids, and problems of
approximation in lp-norms. We conclude this chapter and the whole
course by a comparison of an interior-point scheme with a nonsmooth
optimization method as applied to a particular problern instance.
Chapter 1
NONLINEAR OPTIMIZATION
1.1 World of nonlinear optimization

(General formulation of the problem; Important examples; Blackbox and iter·
ative methods; Analytical and arithmetical complexity; Uniform grid method;
Lower complexity bounds; Lower bounds for global optimization; Rules of the
game.)
1.1.1 General formulation of the problern

Let us start by fixing the mathematical form of our main problern and
the standard terminology. Let x be an n-dimensional real vector:
X= (x(l), ... ,x(n))T ERn,
S be a subset of Rn, and fo(x), ... , fm(x) be some real-valued functions
of x. In the entire book we deal with different variants of the following
general minimization problem:
min fo(x),
s.t. fj(x) & 0, j = 1 ... m, (1.1.1)
XE s,
where the sign & could be ::;, ~ or =.
We call fo (x) the objective function of our prob lern, the vector function
f(x) = (JI(x), ... , fm(x)f
is called the vector of functional constraints, the set S is called the basic
feasible set, and the set
Q = {x ES I /j(x)::; 0, j = 1 ... m}
2 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
is called the feasible set of problern (1.1.1). That is just a convention

to consider a rninirnization problern. Instead, we could consider a max-
imization problern with the objective function - fo(x).
There is a natural classification of the types of rninimization problerns:
• Constrained problems: Q c Rn.
• Unconstrained problems: Q =Rn.
• Smooth problems: allfi(x) are differentiable.
• Nonsmooth problems: there is a nondifferentiable component fk(x).
• Linearly constrained problems: all functional constraints are linear:
n
fi(x) = L:a}i)x(i) +bj = (aj,x) +bj, j = 1 ... m,
i=l
(here (·, ·) stands for the inner product in Rn), and S is a polyhedron.
If fo(x) is also linear, then (1.1.1) isalinear optimization problem. If
fo(x) is quadratic, then (1.1.1) is a quadratic optimization problem. If
allfi are quadratic, then this is a quadratically constrained quadratic
problern.
There is also a classification based on the properties of feasible set.
• Problem (1.1.1) is called feasible if Q-=/::. 0.
• Problem (1.1.1) is called strictly feasible if there exists x E int Q such
that fi(x) < 0 (or > 0) for all inequality constraints and fi(x) = 0
for all equality constraints. (Slater condition.)
Finally, we distinguish different types of solutions to (1.1.1):
• x* is called the optimal global solution to {1.1.1) if fo(x*) ~ fo(x) for
all x E Q (global minimum). In this case fo(x*) is called the (global)
optimal value of the problern.
• x* is called a local solution to (1.1.1) if fo(x*) :::; fo(x) for all x E
int Q c Q (local minimum).
Let us consider now several examples demonstrating the origin of the
optirnization problerns.
EXAMPLE 1.1.1 Let x{l), ... , x(n) be our design variables. Then we can
fix sorne functional characteristics of our decision: fo(x), ... , fm(x). For
exarnple, we can consider a price of the project, amount of required
Nonlinear Optimization 3
resources, reliability of the systern, etc. We fix the rnost irnportant

characteristics, f 0 (x), as our objective. For all others we irnpose sorne
bounds: aj :5 /j(x) :5 bj. Thus, we corne up with the problern:
rnin fo(x),
s.t.: aj :::; fi(x) :::; bj, j = 1. .. m,
XE S,
where S stands for the structural constraints, like nonnegativity or boun-

dedness of sorne variables. 0
EXAMPLE 1.1.2 Let our initialproblern be as follows:

Find x ERn suchthat fi(x) = aj, j = 1 ... m. (1.1.2)
Then we can consider the problem:
m
mjn ~)/j(x)- aj) 2 ,
j=l
perhaps even with some additional constraints on x. 1f the optimal value

of the latter problern is zero, we conclude that our initialproblern (1.1.2)
has a solution.
Note that the problern (1.1.2) is almost universal. lt covers ordinary
differential equations, partial differential equations, problems arising in
Game Theory, and rnany others. 0
EXAMPLE 1.1.3 Sornetirnes our decision variables x(l), ... , x(n) must be
integer. This can be described by the following constraint:
sin(7rx(i)) = 0, i = 1 ... n.
Thus, we can also treat integer optimization problems:

min fo(x),
s.t.: aj :5 /j(x) :5 bj, j = 1 ... m,
XE S,
. ( 7l'X (i))
Sill = 0, . = 1 ... n.
~
0
Looking at these examples, a reader can understand the optimism

of the pioneers of nonlinear optimization, which can be easily seen in
the papers of the 1950's and 1960's. Our first impression should be, of
course, as follows:
Nonlinear optimization is a very important and promising

application theory. It covers almost all needs of Opera-
tions research and numerical analysis.
However, just by looking at the same examples, especially at Examples

1.1.2 and 1.1.3, a more suspicious (or more experienced) reader could
come to the following conjecture:
In general, optimization problems are unsolvable
lndeed, the real life is too complicated to believe in a universal tool,

which can solve all problems at once.
However, such suspicions are not so important in science; that is a
question of personal taste how much we trust them. Therefore it was
definitely one of the most important events in optimization, when in the
middle of the 1970s this conjecture was proved in some strict mathemat-
ical sense. The proof is so simple and remarkable, that we cannot avoid
it in our course. But first of all, we should introduce a speciallanguage,
which is necessary to speak about such things.
1.1.2 Performance of numerical methods

Let us imagine the following situation: Wehave a problern 'P, which
we are going to solve. We know that there are different numerical meth-
ods for doing so, and of course, we want to find a scheme that is the
best for our 'P. However, it turnsout that we are looking for something
that does not exist. In fact, maybe it does, but it is definitely not rec-
ommended to ask the winner for help. Indeed, consider a method for
solving problern (1.1.1), which is doing nothing except reporting that
x* = 0. Of course, this method does not work for all problems except
those for which the solution is indeed the origin. And for the latter
problems the "performance" of such a scheme is unbeatable.
Thus, we cannot speak about the best method for a particular problern
'P, but we can do so for a class of problems :F :J 'P. Indeed, usually the
numerical methods are developed for solving many different problems
with similar characteristics. Thus, a performance of method M on the
whole dass :F is a natural characteristic of its efficiency.
Since we are going to speak about the performance of M on a class

F, we should assume that M does not have complete information about
a particular problern P.
A known (for numerical scheme) "part" of problern P is

called the model of the problem.
We denote the model by L:. Usually the model consists of problern

formulation, description of classes of functional components, etc.
In order to recognize the problern P (and solve it), the method should
be able to collect specific information about P. It is convenient to de-
scribe the process of collecting the data by the notion of an oracle. An
oracle 0 is just a unit, which answers the successive questions of the
method. The method M is trying to solve the problern P by collecting
and handling the answers.
In general, each problern can be described by different models. More-
over, for each problern we can develop different types of oracles. But let
us fix r; and 0. In this case, it is natural to define the performance of
M on (L:, 0) as its performance on the worst Pw from (r;, 0). Note that
this Pw can be bad only forM.
Further, what is the performance of M on P? Let us start from the
intuitive definition:
Performance of M on P is the total amount of computa-

tional efforts that is required by method M to solve the
problern P.
In this definition there are two additional notions to be specified. First of

all, what does it mean: to solve the problem? In some situations it could
mean finding an exact solution. However, in many areas of numerical
analysis that is impossible (and optimization is definitely such a case).
Therefore,
To solve the problern means to find an approximate solu-

tion to M with an accuracy E > 0.
The meaning of the words with an accuracy E > 0 is very important for
our definitions. However, it is too early to speak about that now. We
just introduce notation ~ for a stopping criterion; its meaning will be
always precise for particular problern classes. Now we can give a formal
definition of the problern dass:
:F =(2:, 0, 1;).
In order to solve a problern P E :F we can apply an iterative process,
which naturally describes any method M working with the oracle.
General Iterative Scheme.
Input: A starting point xo and an accuracy E > 0.

Initialization. Set k = 0, L 1 = 0. Here k is iteration
counter and h is accumulated information set.
(1.1.3)
Main loop:
1. Call oracle 0 at Xk·
2. Updatetheinformation set: h = h-1 U(xk, O(xk)).
3. Apply rules of method M to h and form point Xk+l·
4. Check criterion 7;. If yes then form an output x.
Otherwise set k := k + 1 and go to Step 1.
Now we can specify the term computational efforts in our definition of

performance. In the scheme (1.1.3) we can easily find two mostexpensive
steps. The first one is Step 1, where we call the oracle, and the second
one is Step 3, where we form the next test point. Thus, we can introduce
two measures of complexity of problern P for method M:
Analytical complexity: The number of calls of oracle,

which is required to solve problern P up to accuracy t:.
Arithmetical complexity: The total number of arithmetic
operations (including the work of oracle and work of
method), which is required to solve problern P up to ac-
curacy t.
Comparing the notions of analytical and arithmetical complexity, we

can see that the second one is more realistic. However, for a particu-
lar method M as applied to a problern P, the arithmetical complexity
usually can be easily obtained frorn the analytical cornplexity and the
cornplexity of the oracle. Therefore, in this course we will speak rnainly
about bounds on the analytical cornplexity for some problern classes.
There is one standard assumption on the oracle, which allows us to
obtain the majority of the results on the analytical cornplexity of opti-
mization problems. This assurnption is called the local black box concept
and it Iooks as follows:
Local black box
1. The only information available for the numerical

scheme is the answer of the oracle.
2. The oracle is local: A small variation of the problern
far enough frorn the test point x does not change the
answer at x.
This concept is very useful in numerical analysis. Of course, its first

part Iooks like an artificial wall between the method and the oracle. It
seems natural to give the method an access to internal structure of the
problem. However, we will see that for problems with rather complicated
structure this access is almost useless. For more simple problems it could
help. We will see that in the last chapter of the book.
To conclude the section, Iet us mention that the standard formulation
(1.1.1) is called a functional model of optimization problems. Usually,
for such models the standard assumptions are related to the smoothness
of functional components. In accordance to degree of srnoothness we can
apply different types of oracle:
• Zero-order oracle: returns the value f(x).
• First-order oracle: returns f(x) and the gradient J'(x).
• Second-order oracle: returns f(x), f'(x) and the Hessian f"(x).
1.1.3 Complexity bounds for global optimization

Let us try to apply the formal language, introduced in the previous
section, to a particular problern class. Consider, for example, the fol-
lowing problem:
min j(x). (1.1.4)
xEBn
In our terrninology, this is a constrained rninirnization problern without
functional constraints. The basic feasible set of this problern is Bn, an
n-dimensional box in Rn:
Bn={xERnl O~x(i)~l, i=l. .. n}.
Let us measure distances in Rn using l00 -norm:
Assurne that, with respect to this norm,
the objective function f(x) is Lipschitz continuous on Bn:
I f(x)- f(y) I~ L II X- Y lloo Vx, Y E Bn, (1.1.5)
with some constant L (Lipschitz constant).
Consider a very simple method for solving (1.1.4), which is called the
uniform grid method. This method g(p) has one integer input parameter
p. Its scheme is as follows.
Method Q(p)
1. Form (p + 1) n points
X· . -
(tl, ... ,tn)-
c . .)T
:!J.!2. !n.
p'p'•••'p '
where (ib···,in) E {O, ... ,p}n. (1.1.6)
2. Among all points X(i 1,... ,in) find a point x, which has
the minimal value of objective function.
3. Return the pair (x, f(x)) as a result.
Thus, this method forms a uniform grid of the test points inside the
box Bn, computes the minimum value of the objective over this grid and
returns this value as an approximate solution to problern (1.1.4). In our
terminology, this is a zero-order iterative method without any inßuence
of the accumulated information on the sequence of test points. Let us

find its efficiency estimate.
THEOREM 1.1.1 Let f* be the global optimal value of problern {1.1.4).
Then
f(x)- !* ::; ~·
Proof: Let x* be a global minimum of our problem. Then there exist
coordinates (i1, i2, ... , in) suchthat
= X (tJ· ,t2,
X - · -< x* -< X (tJ
· ... ,tn) ·
· +l,t2+1, ·
... ,tn+l) =
- y
(here and in the sequel we write x ::; y for x, y E Rn if and only if

for all i = 1 ... n). Note that y(i) - x<i) = ~ for i = 1 ... n,
x<i) ::; y(i)
and
x~i) E [x(i), y(i)], i = 1 ... n.
Denote x = (x + y)/2. Let us form a point x as follows:
y(i)' 'f (i) >
1 X*
'(i)
_X ,
x(i) = {
x<i) otherwise.
'
lt is clear that I x(i)- x~i) 1::; 2~, i = 1 ... n. Therefore
I x- x* lloo= l::;t::;n
ID!ix lx(i) - x~i) I ::; 2~.
Since x belongs to our grid, we conclude that
Let us finish the definition of our problern dass. Define our goal as
follows:
Find x E Bn : j(x)- j*::; €. (1.1. 7)
Then we immediately get the following result.
COROLLARY 1.1.1 Analytical complexity of the problern class {1.1.4),
{1.1.5), {1.1. 7) for method g is at most
A(Q) = ( l~J + 2r.

(here l aJ is an integer part of a).
Proof: Take p = l ~ J + 1. Then p 2: ~, and, in view of Theorem 1.1.1,
we have f(x) - f* ::; ~ ::; €. Note that we constrict (p + 1)n points. 0
Thus, A(Q) justifies an upper complexity bound for our problern class.
This result is quite informative, but we still have some questions.
Firstly, it may happen that our proof is too rough and the real perfor-
mance of Q(p) is much better. Secondly, we still cannot be sure that
Q(p) is a reasonable method for solving (1.1.4). There may exist other
schemes with much higher performance.
In order to answer these questions, we need to derive lower complexity
bounds for the problern dass (1.1.4), (1.1.5), (1.1.7). The rnain features
of such bounds are as follows.
• They are based on the black box concept.
• These bounds are valid for all reasonable iterative schernes. Thus,
they provide us with a lower estirnate for analytical complexity on
the problern dass.
• Very often such bounds ernploy the idea of the resisting oracle.
For us only the notion of the resisting orade is new. Therefore, Iet us
discuss it in rnore detail.
A resisting oracle tries to create a worst problern for each particular
rnethod. It starts frorn an "ernpty" function and it tries to answer each
call of the rnethod in the warst possible way. However, the answers rnust
be compatible with the previous answers and with the description of the
problern dass. Then, after terrnination of the method it is possible to
reconstruct a problern, which fits cornpletely the final inforrnation set
accurnulated by the algorithrn. Moreover, if we launch this rnethod on
this problern, it will reproduce the sarne sequence of the test points since
it will have the sarne sequence of answers frorn the orade.
Let us show how that works for problern (1.1.4). Consider the class
of problems C defined as follows:
Model: rnin f(x),

xEBn
f (x) is 100 - Lipschitz continuous on Bn.
Oracle: Zero-order local black box.
Approximate solution: Find x E Bn: f(x)- f* ~ E.

r.
THEOREM 1.1. 2 For < ~ L the analytical complexity of C for zero-
E
order methods is at least ( l ii J
Proof: Derrote p = liiJ

(~ 1). Assurne that there exists a method,
which needs N < pn calls of oracle to solve any problern from C. Let us
apply this method to the following resisting strategy:
Oracle returns f(x) = 0 at any test point x.
Therefore this method can find only x E Bn with f(x) = 0. However,

note that there exists x E Bn such that
x, + Pe
1 B
E n, e = (1, ... , 1) T E Rn,
and there were no test points inside the box B = {x I x ~ x ~ x + ~e}.

Derrote x* = x + 2~e and consider the function
](x) = min{O, L I x- x* lloo -E},

Clearly, this function is l 00 -Lipschitz continuous with the constant L
and its global optimal value is -E. Moreover, f{x) differs from zero only
inside the box B' = {x 111 x- x* lloo~ f}. Since 2p ~ LjE, we conclude
that
B' ~ B := {x 111 x- x Jloo~ 2~}.
Thus, f(x) is equal to zero at all test points of our method. Since
the accuracy of the result of our method is E, we come to the following
conclusion: If the number of calls of the oracle is less than pn, then the
accuracy of the result cannot be better than E. 0
Now we can say much more about the performance of the uniform grid
method. Let us compare its efficiency estimate with the lower bound:
Lower bound: ( liiJ) n.

Thus, if E = 0(*), the lower and upper bounds coincide up to a constant
multiplicative factor. This implies that Q(p) is an optimal method for C.
At the same time, Theorem 1.1.2 supports our initial claim that the
general optimization problems are unsolvable. Let us Iook at the follow-
ing example.
EXAMPLE 1.1.4 Consider the problern dass :F defined by the following
parameters:
L = 2, n = 10, E = 0.01.
Note that the size of the problern is very small and we ask only for 1%
accuracy.
The lower complexity bound for this dass is ( fe)
n. Let us compute
it for our example.
Lower bound: 1020 calls of orade

Complexity of oracle: at least n arithmetic Operations (a.o.)
Total complexity: 1021 a.o.
Work station: 106 a.o. per second
Total time: 1015 seconds
One year: less than 3.2 · 107 sec.
I We need• 131 250 000 ymrs.
This estimate is so disappointing that we cannot keep any hope that

such problems may become solvable in a future. Let us just play with
the parameters of the problern dass.
• Ifwe change n to n+1, then the estimate is multiplied by one hundred.

Thus, for n = 11 our lower bound is valid for a much more powerful
computer.
• On the contrary, if we multiply e by two, we reduce the complexity

by a factor of a thousand. For example, if e = 8%, then we need only
two weeks. 0
We should note, that the lower complexity bounds for problems with
smooth functions, or for high-arder methods are not much better than
those of Theorem 1.1.2. This can be proved using the same arguments
and we leave the proof as an exercise for the reader. Camparisan of
the above results with the upper bounds for NP-hard problems, which
are considered as a dassical example of very difficult problems in com-
binatorial optimization, is also quite disappointing. Hard combinatorial
problems need 2n a.o. only!
To conclude this section, let us compare our situation with one in some
other fields of numerical analysis. lt is well known, that the uniform grid
approach is a standard tool in many domains. For example, if we need
to compute numerically the value of the integral of a univariate function
J
1
I= f(x)dx,
0
the standard way to proceed is to form a discrete sum

1 n .
Sn= N "L f(xi), Xi = ;.,, i = 1 ... N.
i=l
lf f(x) is Lipschitz continuous, then this value can be used as an ap-

proximation to I:
Note that in our terminology this is exactly the uniform grid approach.
Moreover, that is a standard way for approximating the integrals. The
reason why it works here lies in the dimension of problems. For inte-
gration the standard dimensions are very small (up to three), and in
optimization sometimes we need to solve problems with several millions
of variables.
1.1.4 Identity cards of the fields

After the pessimistic results of the previous section, first of all we
should understand what could be our goal in theoretical analysis of op-
timization problems. It seems, everything is clear for general global op-
timization. But maybe the goals of this field are too ambitious? Maybe
in some practical problems we would be satisfied by much less "optimal"
solution? Or, maybe there are some interesting problern classes, which
are not so dangerous as the dass of general continuous functions?
In fact, each of these questions can be answered in a different way.
And this way defines the style of research (or rules of the game) in
the different fields of nonlinear optimization. lf we try to classify these
fields, we can easily see that they differ one from another in the following
aspects:
• Goals of the methods.

• Classes of functional components.
• Description of the oracle.
These aspects define in a natural way the list of desired properties of the
optimization methods. Let us present the "identity cards" of the fields,
which we are going to consider in the book.
Name: General global optimization. (Section 1.1)

G~als: Find a global minimum.
Functional dass: Continuous functions.
Orade: 0 - 1 - 2 order black box.
Desired properties: Convergence to a global minimum.
Features: From theoretical point of view, this game is too
short. We always lose it.
Problem sizes: There are examples of solving problems
with thousands of variables. However, no guarantee for suc-
cess even for very small problems.
History: Starts from 1955. Several local peaks of interest
related to new heuristic ideas (simulated annealing, neural
networks, genetic algorithms).
Name: Nonlinear optimization. (Sections 1.2, 1.3)

Goals: Find a local minimum.
Functional dass: Differentiable functions.
Oracle: 1 - 2 order black box.
Desired properties: Convergence to a local minimum.
Fast convergence.
Features: Variability of approaches. Most widespread soft-
ware. The goal is not always acceptable and reachable.
Problem sizes: up to 1000 variables.
History: Starts from 1955. Peak period: 1965 - 1985.
Theoretical activity now is rather low.
Name: Convex optimization. (Chapters 2, 3)

Goals: Find a global minimum.
Functional dass: Convex sets and functions.
Orade: lst-order black box.
Desired properties: Convergence to a global minimum.
Rate of convergence depends on the dimension.
Features: Very rich and interesting theory. Comprehen-
sive complexity theory. Efficient practical methods. The
problern dass is sometimes restrictive.
Problem sizes: up to 1000 variables.
History: Starts from 1970. Peak period: 1975- 1985 (ter-
minated by explosion of interior-point ideas). Theoretical
activity now is growing up.
Name: Interior-point polynomial-time methods. {Chap-

ter 4)
Goals: Find a global minimum.
Functional class: Convex sets and functions with explicit
structure.
Oracle: 2nd-order black box oracle, which is not local.
Desired properties: Fast convergence to a global mini-
mum. Rate of convergence depends on the structure of the
problem.
Features: Very new and perspective theory. Avoid the
black box concept. The problern dass is practically the same
as in convex optimization.
Problem sizes: Sometimes up to 10 000 000 variables.
History: Starts from 1984. Peak period: 1990 - .... Very
high theoretical activity just now.
1.2 Local methods in unconstrained minimization

(Relaxation and approximation; Necessary optimality conditions; Sufficient
optimality conditions; Class of differentiahte functions; Class of twice differ-
entiable functions; Gradient method; Rate of convergence; Newton method.}
1.2.1 Relaxation and approximation

The simplest goal of generat nonlinear optimization is to find a local
minimum of a differentiable function. In general, the global structure of
such a function is not simpler than that one of a Lipschitz continuous
function. Therefore, even for reaching such a restricted goal, it is neces-
sary to follow some special principles, which guarantee the convergence
of a minimization process.
The majority of generat nonlinear optimization methods are based on
the idea of relaxation:
We call the sequence {ak}~ 0 a relaxation sequence if
ak+l :=:; ak Vk ~ 0.
In this section we consider several methods for solving the following

unconstrained minimization problern
min f(x), (1.2.1)

xERn
where f(x) is a smooth function. In order to do so, we generate a

relaxation sequence {!(xk)}r: 0:
This strategy has the following important advantages:

1. If f(x) is bounded below on Rn, then the sequence {f(xk)}~ 0 con-
verges.
2. In any case we improve the initial value of the objective function.
However, it would be impossible to implement the idea of relaxation
without employing another fundamental principle of numerical analysis,
the approximation. In general,
To approximate means to replace an initial complex ob-

ject by a simplified one, which is close by its properties
to the original.
In nonlinear optimization we usually apply local approximations based

on derivatives of nonlinear functions. These are the first- and the second-
order approximations (or, the linear and quadratic approximations).
Let f(x) be differentiable at x. Then for y ERn we have
f(y) = f(x) + (!'(x), y- x) + o(ll y- x Ii),

where o( r) is some function of r ~ 0 such that
lim ~o(r)
r.l-0
= 0, o(O) = 0.
In the sequel we fix the notation II · II for the standard Euclidean norm
in Rn:
The linear function f(x) + (f'(x), y- x) is called the linear approxi-

mation of f at x. Recall that the vector f'(x) is called the gradient of
function f at x. Considering the points Yi = x + Eei, where ei is the ith
coordinate vector in Rn, and taking the limit in € -+ 0, we obtain the
following coordinate representation of the gradient:
Let us mention two important properties of the gradient. Derrote by

.CJ(a) the level set of f(x):
.Cf (0:) = {X E Rn I J (X) S 0:} ·
Consider the set of directions that are tangent to .Ct(f(x)) at x:
LEMMA 1.2.1 lf s E S 1(x), then (f'(x), s) = 0.

Proof: Since f(Yk) = f(x), we have
J(yk) = f(x) + (J'(x), Yk- x) + o(ll Yk- x II) = J(x).

Therefore (J'(x), Yk- x) + o(ll Yk- x II) = 0. Dividing this equation by
II Yk - x II and taking the limit in Yk -+ x, we obtain the result. 0
Let s be a direction in Rn, II s II = 1. Consider the local decrease of

f(x) along s:
~(s) = lim l[j(x + as)- f(x)].
a.j_O a
Note that f(x + o:s)- f(x) = a(f'(x), s) + o(o:). Therefore

~(s) = (J'(x), s).
Using the Cauchy-Schwartz inequality:

- II XII . II y lls (x, y) sll X II . II y II,
we obtain ~(s) = (f'(x), s) 2: - II f'(x) 11. Let us take
s = -J'(x)/ II !'(x II.

Then
~(s) = -(f'(x),J'(x))/ II J'(x) II=- II J'(x) II.
Thus, the direction - J' (x) (the antigradient) is the direction of the
fastest local decrease of J(x) at point x.
The next statement is probably the most fundamental fact in opti-
mization.
THEOREM 1. 2.1 (First-order optimality condition.)
Let x* be a local minimum of differentiable function f(x). Then
f'(x*) = 0.
Proof: Since x* is a local minimum of f(x), then there exists r > 0

such that for all y, IIY - x*ll ~ r, we have f(y) 2 f(x*). Since f is
differentiable, this implies that
f(y) = f(x*) + (f'(x*), y- x*) + o(ll Y- x* II) 2 f(x*).

Thus, for all s, II s II= 1, we have (f'(x*), s) 2 0. Consider the directions
s and -s; we get
(f'(x*), s} = 0, Vs, II s II= 1.

Finally, choosing s = ei, i = 1 ... n, where ei is the ith coordinate vector
in Rn, we obtain f'(x*) = 0. 0
CoROLLARY 1. 2.1 Let x* be a local minimum of a dijjerentiable func-

tion f(x) subject to linear equality constraints
x E C, := {x ERnlAx = b} =J 0,
where A is an m x n-matrix and b E Rm, m < n. Then there exists a
vector of multipliers ). * such that
{1.2.2)
Proof: Consider some vectors Ui, i = 1 ... k, that form a basis of the
null space of matrix A. Then any x E C, can be represented as follows:
k
x = x(y) = x* + LY(i)ui, y E Rk.
i=l
Moreover, the point y = 0 is a local minimum of the function cp(y) =

f(x(y)). In view of Theorem 1.2.1, cp'(O) = 0. This means that
~!~?l = (f'(x*), ui) = 0, i = 1 ... k,

and (1.2.2) follows. 0
Note that we have proved only a necessary condition of a local min-

imum. The points satisfying this condition are called the stationary
points of function f. In order to see that such points are not always the
local minima, it is enough to look at function f(x) = x 3 , x E Rl, at
x=O.
Let us introduce now the second-order approximation. Let function

x. Then
f(x) be twice differentiahte at
J(y) = f(x) + (f'(x), y- x) + !U"(x)(y- x), y- x) + o(ll y- x 11 2 ).
The quadratic function
f(x) + (f'(x), y- x) + !U"(x)(y- x), y- x)

is called the quadratic (or second-order) approximation of function f at
x. Recall that the (n x n)-matrix f"(x) has the following entries:
(f"(x))(i,j) = 8:c:{J~L>.
It is called the Hessian of function f at x. Note that the Hessian is a
symmetric matrix:
f"(x) = [f"(x)f.
The Hessian can be seen as a derivative of the vector function f'(x):
f'(y) = f'(x) + f"(x)(y- x) + o(ll y- x II),

where o(r) is a vector function suchthat lim ~
r,l-0
II o(r) II= 0 and o(O) = 0.
Using the second-order approximation, we can write down the second-
order optimality conditions. In what follows notation A t 0, used for a
symmetric matrix A, means that A is positive semidefinite:
(Ax, x) 2: 0 Vx ERn.
Notation A >- 0 means that Ais positive definite (above inequality must
be strict for x =/= 0).
THEOREM 1.2.2 (Second-order optimality condition.)

Let x* be a local minimum of twice differentiahte function f(x). Then
f'(x*) = 0, f"(x*) ?:: 0.
Proof: Since x* is a local minimum of function f(x), there exists r >0
suchthat for all y, IIY- x*ll ~ r, we have
f(y) ~ f(x*).
In view of Theorem 1.2.1, f'(x*) = 0. Therefore, for any such y,
f(y) = f(x*) + (f"(x*)(y- x*), y- x*) + o(ll y- x* 11 2 ) ~ f(x*).
Thus, (f"(x*)s, s) ~ 0, for all s, II s II= 1. 0

Again, the above theorem is a necessary (second-order) characteristic

of a local minimum. Let us prove a sufficient condition.
THEOREM 1.2.3 Let function f(x) be twice differentiable on Rn and let
x* satisfy the following conditions:
f'(x*) = 0, J"(x*) >- 0.

Then x* is a strict local minimum of f(x).
(Sometimes, instead of strict we say isolated.)
Proof: Note that in a small neighborhood ofpoint x* function f(x) can
be represented as
f(y) = f(x*) + !U"(x*)(y- x*), Y- x*) + o(ll Y- x* 11 2).

Since ~ --+ 0, there exists a value f such that for all r E [0, f] we have
I o(r) 1:::; V•df"(x*)),

where >.1(f"(x*)) is the smallest eigenvalue of matrix f"(x*). Recall,
that in view of our assumption, this eigenvalue is positive. Therefore,
for any y, II y- x*ll :::; f we have
f(y) ;::=: f(x*) + ~>.I(f"(x*)) II Y- x* 11 2 +o(ll Y- x* 11 2 )
> f(x*) + ~>.1(f"(x*)) II Y- x* 11 2> f(x*).

0
1.2.2 Classes of differentiable functions

It is well knowll that ally colltilluous fullctioll call be approximated by
a smooth fullctioll with arbitrarily small accuracy. Therefore, assumillg
only differelltiability of the objective function we cannot get any rea-
sonable properties of minimizatioll processes. Hence, we have to impose
some additional assumptiolls on the magllitude of the derivatives. Tra-
ditionally, in optimizatioll such assumptions are presented in the form
of a Lipschitz condition for a derivative of certaill order.
Let Q be a subset of Rn. Wedellote by C1•P(Q) the class offunctions
with the following properties:
• any jE C1•P(Q) is k times COlltilluously differelltiable Oll Q.

• Its pth derivative is Lipschitz colltinuous Oll Q with the constant L:
for all x, y E Q.
Clearly, we always have p ~ k. If q 2: k, then Cf·P(Q) ~ c1·P(Q). For
example, Cz•
1 (Q) ~ Cl'
1 (Q). Note also that these classes possess the
following property: if ft E C1~(Q), /2 E c1:(Q) and a, ß E R 1 , then

for
L3 =I a I L1 + I ß I L2
we have aft + ß/2 E c1:(Q).
We use notation f E Ck(Q) for a function f which is k times contin-
uously differentiable on Q.
For us the most important dass of functions of the above type will be
cl•1 (Rn), the class of functions with Lipschitz continuous gradient. By
definition, the inclusion f E Cl•

1 (Rn) implies that
II f'(x)- f'(y) II~ L II x- Y II (1.2.3)
for all x, y E Rn. Let us give a sufficient condition for that inclusion.
LEMMA 1.2.2 Function f(x) belongs to Cz• 1 (Rn) C Cl' 1 (Rn) if and only
if
II f"(x) II~ L, Vx ERn. {1.2.4)
Proof. Indeed, for any x, y E Rn we have

1
f'(y) = f'(x) +f J"(x + r(y- x))(y- x)dr
(l
0
= J'(x) + f"(x + T(y- x))dT) · (y- x).
Therefore, if condition (1.2.4) is satisfied then
II f'(y)- f'(x) II = (l f"(x + T(y- x))dT) · (y- x)
1
< Jf"(x+r(y-x))dr ·lly-xll
0
1
~I II f"(x+r(y-x)) II dr·ll y-x II
0
~L II y- XII-
On the other hand, if f E Cz• (Rn), then for any s ERn and a > 0, we
1
have
(! J"(x + Ts)dT) · s =II f'(x + as)- /'(x) II ::; aL II s II .
Dividing this inequality by a and tending a ,t 0, we obtain (1.2.4). 0
This simple result provides us with many examples of functions with

Lipschitz continuous gradient.
EXAMPLE 1.2.1 1. Linear function f(x) = a + (a,x) E C~' 1 (Rn) since
f'(x) = a, f"(x) = 0.
2. For the quadratic function f(x) = a+(a, x)+!(Ax, x) with A =AT
we have
f'(x) = a + Ax, J"(x) = A.
Therefore f(x) E Cl' 1 (Rn) with L =II A II·
3. Consider the function of one variable f(x) = J1 + x2 , x E R 1 . We
have
f'(x) = n· J"(x) = (1 + ~2)3/2 ~ 1.
Therefore f(x) E Ci' (R). 1 0
The next statement is important for the geometric interpretation of

functions from ci·
1 (Rn)
LEMMA 1.2.3 Let f E Ci' 1 (Rn). Then for any x, y from Rn we have
I f(y)- f(x)- (f'(x),y- x) I~ f II Y- x 11 2 · {1.2.5}
Proof: For all x, y E Rn we have

1
f(y) = f(x) + f(f'(x + T(y- x}),y- x)dT
0
1
= f(x) + (f'(x),y- x) + f(f'(x + T(y- x))- f'(x),y- x)dT.
0
Therefore
I J(y)- J(x)- (J'(x), y- x) I
1
=I f(f'(x + r(y- x))- J'(x),y- x)dr I
0
1
::; J I (J'(x + r(y- x))- J'(x), y- x) I dr
0
1
::; f II f'(x+r(y-x)) -f'(x) 11·11 y-x II dr
0
Geometrically, we can draw the following picture. Consider a function

f from Ci' 1 (Rn). Let us fix some Xo E Rn and define two quadratic
functions
<h(x) = f(xo) + (f'(xo),x- xo) + ~ II x- xo 11 2,
</J2(x) = f(xo) + (J'(xo), x- xo)- ~ II x- xo 11 2 .
Then graph of function f is located between the graphs of </; 1 and </J2:
Let us prove a similar result for the dass of twice differentiable func-
tions. Our main dass of functions ofthat type will be dj:/(Rn), the
dass of twice differentiable functions with Lipschitz continuous Hessian.
Recall that for f E C'j;/(Rn) we have
II f"(x)- f"(y) II:S M II x- Y II (1.2.6)
for all x, y ERn.
LEMMA 1.2.4 Let f E Cz' (Rn).

2 Then for any x, y from Rn we have
II f'(y)- f'(x)- f"(x)(y- x) II:S "{ II y- x 11 2 , (1.2. 7}
lf(y)- f(x)- (f'(x),y- x)- !U"(x)(y- x),y- x)l

(1.2.8}
::; 't II y - X 11 3 .
Proof: Let us fix some x, y E Rn. Then

1
f'(y) = f'(x) + J f"(x + r(y- x))(y- x)dr
0
1
= f'(x) + f"(x)(y- x) + f(f"(x + r(y- x))- f"(x))(y- x)dr.
0
Therefore
II f'(y)- f'(x)- f"(x)(y- x) II
1
= II f(f"(x + r(y- x))- f"(x))(y- x)dr II
0
1
< J II (f"(x + r(y- x))- J"(x))(y- x) II dr
0
1
< f I f"(x + r(y- x)) - f"(x) II · II y- x II dr
0
1
< f rM II y- x 11 2 dr = !vf II y- x 11 2 •
0
Inequality (1.2.8) can be proved in a similar way. D
CoROLLARY 1.2.2 Let f E c'f.;/(Rn) and II y- x II= r. Then
f"(x)- Mrln ::5 f"(y) ::5 J"(x) + Mrln,

where In is the unit matrix in Rn.
(Recall that for matrices A and B we write A t B if A - B t 0.)

Proof: Denote G = f"(y)- f"(x). Since f E cJl(Rn), we have II G II;:;
Mr. This means that eigenvalues of the symmetric matrix G, >.i(G),
satisfy the following inequality:
I >.i(G) 1;:; Mr, i = 1 ... n.
Hence, -Mrln ::5 G:::: J"(y)- f"(x) ::5 Mrln. D

1.2.3 Gradient method

Now we are completely ready for studying the convergence rate of
unconstrained minimization methods. Let us start from the simplest
scheme. We already know that antigradient is a direction of locally
steepest descent of differentiable function. Since we are going to find its
local minimum, the following scheme is the first to be tried:
Gradient method
(1.2.9)
Choose xo E Rn.
Iterate Xk+I = Xk- hkf'(xk), k = 0, 1, ....
We will refer to this scheme as a gradient method. The scalar factor

of the gradient, hk, is called the step size. Of course, it must be positive.
There are many variants of this method, which differ one from another
by the step-size strategy. Let us consider the most important examples.
1. The sequence {hk} ~ 0 is chosen in advance. For example,
hk = h > 0, (constant step)
hk = h
v'k+T"
2. Full relaxation:
3. Goldstein-Armijo rule: Find Xk+I = Xk- hf'(xk) suchthat
(1.2.10)
ß(f'(xk), Xk- Xk+I) 2:: f(xk)- f(xk+I), (1.2.11)

where 0 < a < ß < 1 are some fixed parameters.
Comparing these strategies, we see that the first strategy is the sirn-
plest one. Indeed, it is often used, but mainly in the context of convex
optimization. In that framework the behavior of functions is much more
predictable than in the general nonlinear case.
The second strategy is completely theoretical. It is never used in

practice since even in one-dimensional cases we cannot find an exact
minimum in finite time.
The third strategy is used in the majority of the practical algorithms.
It has the following geometric interpretation. Let us fix x E Rn. Con-
sider the function of one variable
cp(h) = f(x- hf'(x)), h 2:: 0.

Then the step-size values acceptable for this strategy belang to the part
of the graph of cp that is located between two linear functions:
c/J1(h) = f(x)- ah II f'(x) 11 2, c/J2(h) = f(x)- ßh II f'(x) 11 2 .
Note that cp(O) = cfJ1(0) = cp2 (0) and cp'(O) < cp~(O) < cp~(O) < 0. There-
fore, the acceptable values exist unless cp(h) is not bounded below. There
are several very fast aue-dimensional procedures for finding a point sat-
isfying the conditions of this strategy, but their description is not so
important for us now.
Let us estimate the performance of the gradient method. Consider
the problern
min f(x),
xERn
with f E Ci' 1 (Rn). And assume that f(x) is bounded below on Rn.
Let us evaluate a result of one gradient step. Consider y = x- hf'(x).
Then, in view of (1.2.5), we have
f(y) < f(x) + (J'(x),y- x) + ~ II y- x 11 2
f(x)- h II f'(x) 11 2 +h22 L II f'(x) 11 2 (1.2.12)
f(x)- h(l- ~L) II f'(x) 11 2 .
Thus, in order to get the best estimate for possible decrease of the objec-
tive function, we have to solve the following one-dimensional problem:
D. (h) = - h (1 - ~ L) --t m~n .

Computing the derivative of this function, we conclude that the optimal
step size must satisfy the equation Ll'(h) = hL- 1 = 0. Thus, that is
1:,
h* = which is a minimum of Ll(h) since Ll"(h) = L > 0.
Thus, our considerations prove that one step of the gradient method
decreases the value of objective function at least as follows:
f(y) ::; f(x)- A II f'(x) 11 2 .
Let us check what is going on with the above step-size strategies.

Let iJ:k+l = Xk -hkf'(xk). Then for the constant step strategy, hk = h,
we have
f(xk)- f(xk+l) ~ h(1- ~Lh) II J'(xk) 11 2 .
Therefore, if we choose hk = 2_Z with a E (0, 1), then
Of course, the optimal choice is hk = i.

For the full relaxation strategy we have
since the maximal decrease is not worse than that for hk = t.

Finally, for the Goldstein-Armijo rule in view of (1.2.11) we have
From (1.2.12) we obtain
f(xk) - f(xk+l) ~ hk ( 1 - 4; L) II f'(xk) 11 2 •
Therefore hk ~ j;(l- ß). Further, using (1.2.10) we have
f(xk)- J(xk+I) ~ a(J'(xk), Xk- Xk+I) = ahk II f'(xk) 11 2 .
Combining this inequality with the previous one, we conclude that
Thus, we have proved that in all cases we have
(1.2.13)
where w is some positive constant.

Now we areready to estimate the performance of the gradient scheme.
Let us sum up the inequalities {1.2.13) for k = 0 ... N. We obtain
(1.2.14)
where f* is the optimal value of the problern (1.2.1). As a simple con-

sequence of (1.2.14) we have
II f'(xk) II-+ 0 as k-+ oo.

However, we can also say something about the convergence rate. lndeed,
denote
g* - min g
N- o-:::k-:::N k,
where 9k =II f'(xk) II· Then, in view of (1.2.14), we come to the following
inequality:
1 [1
Y'N :::; JN+l z::;L(J(xo) - !*) ] 1/2 . (1.2.15)
The right-hand side of this inequality describes the rate of convergence

of the sequence {g'N} to zero. Note that we cannot say anything about
the rate of convergence of sequences {f(xk)} and {xk}·
Recall, that in general nonlinear optimization our goal is quite moder-
ate: We want to find only a local minimum of our problem. Nevertheless,
even this goal is unreachable for a gradient method. Let us consider the
following example.
EXAMPLE 1.2.2 Let us look at the following function of two variables:
j(x) =j(x(1),x(2)) = !(x(1))2 + !(x(2))4 _ !(x(2))2.
The gradient of this function is f'(x) = (x(l), (x< 2)) 3 - x< 2))T. Therefore
there are only three points which can pretend tobe a local minimum of
this function:
x! = (0, 0), x; = (0, -1), xj = (0, 1).

Computing the Hessian of this function,
!"(x) ~ ( ~ 3(x(2~2 _ 1 ) ,
we conclude that x2 and x3 are the isolated local minima 1 , but xi is only
a stationary point of our function. Indeed, f(xi) = 0 and f(xi + €e 2 ) =
44 - ~2 < 0 for € small enough.
Now, let us consider the trajectory of the gradient method, which
starts from xo = (1, 0). Note that the second coordinate of this point
is zero. Therefore, the second coordinate of f'(xo) is also zero. Con-
sequently, the second coordinate of x 1 is zero, etc. Thus, the entire
sequence of points, generated by the gradient method will have the sec-
ond coordinate equal to zero. This means that this sequence converges
to xi.
1 In fact, in our example they are the global solutions.

To condude our example, note that this situation is typical for all
first-order unconstrained minimization methods. Without additional
rather strict assumptions it is impossible to guarantee their global con-
vergence to a local minimum, only to a stationary point. 0
Note that inequality (1.2.15) provides us with an example of a new

notion, that is the rate of convergence of minimization process. How
can we use this notion in the complexity analysis? Rate of convergence
delivers the upper complexity bounds for a problern dass. These bounds
are always justified by some numerical methods. lf there exists a method,
for which its upper complexity bounds are proportional to the lower
complexity bounds of the problern dass, we call this method optimal.
Recall that in Section 1 we have already seen an example of optimal
method.
Let us look at an example of upper complexity bounds.
EXAMPLE 1.2.3 Consider the following problern dass:
Model: 1. Unconstrained minimization.

2. f E Cl'
1 (Rn).
3. f(x) is bounded below.
(1.2.16)
Oracle: First order black box.
€ - solution: f(x) ~ f(xo), II f'(x) II~ €.
Note, that inequality (1.2.15) can be used in order to obtain an upper

bound for the number of steps (= calls of the oracle), which is necessary
to find a point with a small norm of the gradient. For that, Iet us write
down the following inequality:
1 [1
g'N ~ vfN+l -wL(f(xo)- f*) ] 1/2 ~ €.
Therefore, if N + 1 2: ~(f(xo)- J*), we necessarily have g'N ~ €.

Thus, we can use the value ~(f(x 0 ) - f*) as an upper complexity
bound for our problern dass. Comparing this estimate with the result
of Theorem 1.1.2, we can see that it is much better; at least it does not
depend on n. The lower complexity bound for the dass (1.2.16) is not
known. 0
Let us check, what can be said about the local convergence of the
gradient method. Consider the unconstrained minimization problern
min f(x)
xERn
under the following assumptions:

1. f E c;;/(Rn).
2. There exists a local minimum of function f at which the Hessian is

positive definite.
3. We know some bounds 0 < l:::; L < oo for the Hessian at x*:
lln ~ f"(x*) ~ Lln. (1.2.17)
4. Our starting point x 0 is close enough to x*.

Consider the process: Xk+1 = Xk - hkf'(xk)· Note that f'(x*) = 0.
Hence,
1
f'(xk) = f'(xk) - f'(x*) =f f"(x* + T(Xk - x*))(xk - x*)dT
0
1
where Gk = f J"(x* + T(xk- x*))dT. Therefore
0
There is a standard technique for analyzing processes of this type,

which is based on contracting mappings. Let sequence {ak} be defined
as follows:
ao ERn, ak+l = Akak,
where Ak are (n x n) matrices suchthat II Ak 11:::; 1- q with q E (0, 1).
Then we can estimate the rate of convergence of sequence {ak} to zero:
II ak+l II~ (1- q) II ak II~ (1- q)k+l II ao ll-7 o.

In our case we need to estimate II In -hkGk II· Denote rk =II xk -x* 11.
In view of Corollary 1.2.2, we have
f"(x*)- TMrkln j f"(x* + T(Xk- x*)) j f"(x*) + TMrkln.
Therefore, using assumption (1.2.17), we obtain
(l- '!fM)In j Gk j (L + '!fM)In·

Hence, (1- hk(L + ItM))In ::S In- hkGk ::S (1- hk(l- ItM))In and we
conclude that
(1.2.18)
where ak(h) = 1- h(l- ItM) and bk(h) = h(L + ItM)- 1.

Note that ak(O) = 1 and bk(O) = -1. Therefore, if rk < f
ak(h) is a strictly decreasing function of h and we can ensure
~' then =
for small enough hk. In this case we will have rk+l < rk.
As usual, many step-size strategies are available. For example, we
can choose hk = t.
Let us consider the "optimal" strategy consisting in
minimizing the right-hand side of (1.2.18}:
max{ ak(h}, bk(h)} -+ min.

h
Assurne that ro < f. Then, if we form the sequence {xk} using the
optimal strategy, we can be sure that rk+l < rk < f. Further, the
optimal step size h'k can be found from the equation:
ak(h) = bk(h) <=? 1- h(l- ~M) = h(L + ~M)- 1.

Hence
h*- 2
k- L+l' (1.2.19)
(Surprisingly enough, the optimal step does not depend on M.) Under
this choice we obtain
Tk+l -
< (L-th
L+l +
Mrz
L+l'
Let us estimate the rate of convergence of the process. Denote q = L2~ 1

and ak = ~zrk ( < q). Then
ak+l ::S (1- q)ak + a2k = ak(1 + (ak- q)) = ak(l-(ak-~) 2 )

1-(ak-q
<
-
ak
l+q-ak
•
1- > lli - 1, or
Therefore -ak+l - ak
Hence,
Thus,
a
k -
qr~ (r-ro) -< ..!l!!L
< ro+(l+q) f-ro l+q
(-1-)k ·
This proves the following theorem.
THEOREM 1.2.4 Let function f(x) satisfy our assumptions and let the
starting point xo be close enough to a local minimum:
ro =II xo - x* II < f = i} .
Then the gradient method with step size (1.2.19} converges as follows:
II Xk- x* lls-; r~~o (1- L~3l)k ·

This rate of convergence is called linear.
1.2.4 Newton method

The Newton method is widely known as a technique for finding a root
of a function of one variable. Let cjJ(t) : R -7 R. Consider the equation
cjJ(t*) = 0.
The Newton method is basedonlinear approximation. Assurne that we

get some t close enough to t*. Note that
cp(t + t:.t) = cp(t) + cfJ'(t)t:.t + o(i t:.t j).

Therefore the equation cjJ(t + t:.t) = 0 can be approximated by the fol-
lowing linear equation:
cjJ(t) + c/J'{t)b.t = 0.
We can expect that the solution of this equation, the displacement t:.t,
is a good approximation to the optimal displacement b.t* = t* - t.
Converting this idea in an algorithmic form, we get the process
t k+l = t k -
!Pl!:JJ..
tP'(tk).
This scherne can be naturally extended onto the problern of finding

solution to a systern of nonlinear equations,
F(x) = 0,
where x ERn and F(x) :Rn~ Rn. In this case we have to define the
displacernent tl.x as a solution to the following systern of linear equations:
F(x) + F'(x)tl.x = 0
(it is called the Newton system). Ifthe Jacobian F'(x) is nondegenerate,

we can compute displacernent tl.x = -[F'(x)t 1F(x). The correspond-
ing iterative scherne Iooks as follows:
xk+l = Xk- [F'(xk)t 1F(xk)·

Finally, in view of Theorem 1.2.1, we can replace the unconstrained
minirnization problern by a problern of finding roots of the nonlinear
systern
f'(x) = 0. (1.2.20)
(This replacement is not completely equivalent, but it works in nonde-
generate situations.) Further, for solving {1.2.20) we can apply a stan-
dard Newton rnethod for systems of nonlinear equations. In this case,
the Newton system Iooks as follows:
f'(x) + f"(x)ilx = 0.
Hence, the Newton method for optimization problems appears tobe in
the form
(1.2.21)
Note that we can obtain the process {1.2.21), using the idea of quadra-
tic approximation. Consider this approximation, computed with respect
to the point Xk:
</>(x) = f(xk) + (f'(xk), X- Xk) + !(f"(xk)(X- Xk), X- Xk)·

Assurne that f"(xk) >- 0. Then we can choose xk+ 1 as a point of mini-
mum of the quadratic function <f>(x). This means that
and we come again to the Newton process (1.2.21).

We will see that the convergence of the Newton method in a neigh-
borhood of a strict local minimum is very fast. However, this method
has two serious drawbacks. Firstly, it can break down if f"(xk) is de-
generate. Secondly, the Newton process can diverge. Let us look at the
following example.
EXAMPLE 1.2.4 Let us apply the Newton method for finding a root of
the following function of one variable:
qy(t) = yl~t2.
Clearly, t* = 0. Note that
qy'(t) = [l+t;j3/2.
Therefore the Newton process looks as follows:
t k+l -- t k - !fi!:JJ_
"''(t ) -- tk - __lk.__
f177I . [1 + t2]3/2
k -- - t3k•
'I' k yl+tk
Thus, if I t 0 I< 1, then this method converges and the convergence is

extremely fast. The points ±1 are the oscillation points of this method.
If I to I> 1, then the method diverges. D
In order to avoid a possible divergence, in practice we can apply a

damped Newton method:
where hk > 0 is a step-size parameter. At the initial stage of the method

we can use the same step size strategies as for the gradient scheme. At
the final stage it is reasonable to choose hk = 1.
Let us study the local convergence of the Newton method. Consider
the problern
mm f(x)
xERn
under the following assumptions:
1. f E c'~;/(Rn).
2. There exists a local minimum of function f with positive definite

Hessian:
f"(x*) t lln, l > 0. (1.2.22)
3. Our starting point xo is close enough to x*.

Consider the process: Xk+1 = xk - [f"(xk)t 1f'(xk)· Then, using

the same reasoning as for the gradient method, we obtain the following
representation:
Xk+1 - x* = Xk- x*- [J"(xk)]- 1f'(xk)
1
= Xk- x*- [f"(xk)]- 1 f f"(x* + T(Xk- x*))(xk- x*)dT
0
1
where Gk = f[f"(xk)- f"(x* + T(Xk- x*))]dT.
0
Derrote rk =II xk- x* II· Then
1
II Gk I = I J[f"(xk)- f"(x* + T(xk- x*))]dT II
0
1
< f II f"(xk)- f"(x* + T(Xk- x*)) II dT
0
1
< f M(l- T)rkdT = !fM.
0
In view of Corollary 1.2.2, and (1.2.22), we have

f"(xk) 2 f"(x*)- Mrkln 2 (l- Mrk)In·
Therefore, if rk < ~, then f" (x k) is positive definite and
II [J"(xk)t 1 II~ (l- Mrk)- 1 •

Hence, for rk small enough (rk < 3~), we have
The rate of convergence of this type is called quadratic.

Thus, we have proved the following theorem.
THEOREM 1.2.5 Let function f(x) satisfy our assumptions. Suppose
that the initial starting point xo is close enough to x*:
II xo- x* II< r = 3~.
Then II Xk- x* II < f for alt k and the Newton method converges quadrat-
ically:
Comparing this result with the rate of convergence of the gradient

method, we see that the Newton method is much faster. Surprisingly
enough, the region of quadratic convergence of the Newton method is
almost the same as the region of the linear convergence of the gradient
method. This justifies a standard recommendation to use the gradient
method only at the initial stage of the minimization process in order to
get close to a local minimum. The final job should be performed by the
Newton method.
In this section we have seen several examples of the convergence rate.
Let us make a correspondence between these rates and the complexity
bounds. As we have seen in Example 1.2.3, the upper bound for the
analytical complexity of a problern dass is an inverse function of the
rate of convergence.
1. Sublinear rate. This rate is described in terms of a power function of
the iteration counter. For example, we can have rk ~ -Jk·
In this case
the upper complexity bound of corresponding problern dass justified
by this scheme is ( ~) 2 .
Sublinear rate is rather slow. In terms of complexity, each new right
digit of the answer takes the amount of computations comparable with
the total amount of the previous work. Note also, that the constant
c plays a significant role in the corresponding complexity estimate.
2. Linear rate. This rate is given in terms of an exponential function of

the iteration counter. For example,
rk ~ c(l- q)k.
Note that the corresponding complexity bound is *{lnc +In:).

This rate is fast: Each new right digit of the answer takes a constant
amount of computations. Moreover, the dependence of the complex-
ity estimate in constant c is very weak.
3. Quadratic rate. This rate has a form of a double exponential function
of the iteration counter. For example,
The corresponding complexity estimate depends on a double loga-

rithm of the desired accuracy: In ln ~.
This rate is extremely fast: Each iteration doubles the number of
right digits in the answer. The constant c is important only for the
starting moment of the quadratic convergence (crk < 1).
1.3 First-order methods in nonlinear optimization

(Gradient method and Newton method: What is different? Idea of variable
metric; Variable metric methods; Conjugate gradient methods; Constrained
minimization: Penalty functions and penalty function methods; Barrier func-
tions and barrier function methods.)
1.3.1 Gradient method and Newton method:

What is different?
In the previous section we have considered two local methods for find-
ing a local minimum in the simplest minimization problern
min f(x),
xERn
with f E Ci'2 (Rn). Those are the gradient method
Xk+l = Xk- hkf'(xk), hk > 0.

and the Newton Method:
Xk+l = Xk- [!"(xk)t 1f'(xk)·

Recall that the local rate of convergence of these methods is different.
Wehave seen, that the gradient method has a linear rate and the Newton
method converges quadratically. What is the reason for this difference?
If we look at the analytic form of these methods, we can see at least
the following formal difference: In the gradient method the search di-
rection is the antigradient, while in the Newton method we multiply the
antigradient by some matrix, that is the inverse Hessian. Let us try to
derive these directions using some "universal" reasoning.
Let us fix some x E Rn. Consider the following approximation of the
function f (x):
</JI(x) = f(x) + (J'(x),x- x) + 21 II x- x 11 2 ,
where the parameter h is positive. The first-order optimality condition

provides us with the following equation for xi, the unconstrained mini-
mum of the function </JI(x):
<P~ (xi) = f'(x) + *(xi- x) = 0.

Thus, xi = x - hf'(x). That is exactly the iterate of the gradient
method. Note, that if h E (0, ±J, then the function </J1(x) is a global
upper approximation of j(x):
(see Lemma 1.2.3). This fact is responsible for global convergence of the
gradient method.
Further, consider a quadratic approximation of function f(x):
<P2(x) = f(x) + (f'(x), x- x) + !U"(x)(x- x), x- x).
We have already seen that the minimum of this function is
x2 = x- [f"(x)t 1 j'(x),
and that is exactly the iterate of the Newton method.
Thus, we can try to use some approximations of function f(x), which
are better than </> 1 (x) and which are less expensive than </>2(x).
Let G be a positive definiten x n-matrix. Denote
</>c(x) = J(x) + (f'(x), x- x) + !(G(x- x), x- x).

Computing its minimum from the equation
<Pa(xa) = J'(x) + G(xa- x) = o,

we obtain
x(; = x- a- f 1 1(x). (1.3.1)
The first-order methods, which form a sequence of matrices
{Gk} : Gk --7 J"(x*)
(or {Hk} : Hk := GJ; 1 --7 [f"(x*)t 1), are called the variable metric
methods. (Sometimes the name quasi-Newton methods is used.) In these
methods only the gradients are involved in the process of generating the
sequences {Gk} or {Hk}·
The updating rule (1.3.1) is very common in optimization. Let us
provide it with one more interpretation.
Note that the gradient and the Hessian of a nonlinear function f(x)
are defined with respect to a standard Euclidean inner product on Rn:
n
(x,y) = L:x(i)y(i), x,y ERn, II x II= (x,x) 112 .
i=l
Indeed, the definition of the gradient is
f(x + h) = f(x} + (f'(x), h) + o(ll h II),

and from this equation we derive its coordinate representation:
Let us introduce now a new inner product. Consider a symmetric posi-

tive definite n x n-matrix A. For x, y E Rn derrote
(x,y)A = (Ax,y), II x IIA= (Ax,x) 112 .

The function II x IIA is a new norm on Rn. Note that topologically this
new metric is equivalent to the old one:
where An (A) and A1 ( A) are the smallest and the largest eigenvalues of
the matrix A. However, the gradient and the Hessian, computed with
respect to the new inner product are changing:
f(x + h) = f(x) + (J'(x), h) + ~(J"(x)h, h) + o(ll h II)
= f(x) + (A- 1f'(x), h)A + ~(A- 1 J"(x)h, h)A + a(ll h IIA)·

Hence, J~(x) = A- 1f'(x) is the new gradient and n(x) = A- 1f"(x) is
the new Hessian.
Thus, the direction used in the Newton method can be seen as a
gradient computed with respect to the metric defined by A = J"(x).
Note that the Hessian of J(x) at x computed with respect to A = J"(x)
is In.
EXAMPLE 1.3.1 Consider quadratic function
f(x) = a + (a, x) + !(Ax, x),

where A =AT>- 0. Note that J'(x) = Ax + a, J"(x) = A and
f' (x*) = Ax* +a = 0
for x* = -A- 1 a. Let us compute the Newton direction at some x ERn:
Therefore for any x ERnwehave x- dN(x) = -A- 1 a = x*. Thus, for

a quadratic function the Newton method converges in one step. Note
also that
f(x) = a + (A- 1 a, x)A + ~ II x II~,
f~(x) = A- 1f'(x) = dN(x),
D
Let us write down a general scheme of the variable metric methods.
Variable metric metbad
o. Choose Xo E nn. Set Ho= In.

Compute f(xo) and f'(xo).
1. kth iteration (k ~ 0).
a). Set Pk = Hkf'(xk).
b). Find Xk+I = Xk - hkPk

(see Section 1.2.3 for step-size rules).
c). Compute f(xk+I) and f'(xk+I)·
d). Update the matrix Hk: Hk ~ Hk+l·
The variable metric schemes differ one from another only in implemen-
tation of Step ld), which updates matrix Hk· For that, they use new
information, accumulated at Step lc), namely the gradient f'(xk+I)·
The idea is justified by the following property of a quadratic function.
Let
f(x) = a + (a, x) + !(Ax, x), J'(x) = Ax + a.
Then, for any x, y ERnwehave f'(x)- f'(y) = A(x- y). This identity
explains the origin of the so-called quasi-Newton rule.
Quasi-Newton rule
Choose Hk+l suchthat
Actually, there are many ways to satisfy this relation. Below we present
several examples of the schemes that usually are recommended as the
most efficient ones.
EXAMPLE 1.3.2 Denote
Then the quasi-Newton relation is satisfied by the following rules.

1. Rank-one correction scheme.
1:1Hk = (ok- Hk'Yk)(ok- Hk'Yk)T

(ok - Hk"fk, 'Yk)
2. Davidon-Fletcher-Powell scheme (DFP}.
Hk'Yk'YfHk
(Hk"fk, 'Yk) .
3. Broyden-Fletcher-Goldfarb-Shanno scheme (BFGS).
ßHk = Hk'Ykof + Ok"ff Hk _ ßk Hk'Yk'Yf Hk,

(Hk"fk, 'Yk) (Hk"fk, 'Yk)
where ßk = 1 + bk,ok)/(Hk'Yk,'Yk)·
Clearly, there are many other possibilities. From the computational
point of view, BFGS is considered as the most stable scheme. D
Note that for quadratic functions the variable metric methods usually
terminate in n iterations. In a neighborhood of strict minimum they
have a superlinear rate of convergence: for any xo E Rn there exists a
number N such that for all k ~ N we have
II Xk+l - x* 11:5 const· II Xk- x* II · II Xk-n- x* II

(the proofs are very long and technical). As far as global convergence is
concerned, these methods are not better than the gradient method (at
least, from the theoretical point of view).
Note that in the variable metric schemes it is necessary to store and
update a symmetric n x n-matrix. Thus, each iteration needs O(n2 )
auxiliary arithmetic operations. During many years this feature was
considered as one of the main drawbacks of the variable metric methods.
That stimulated the interest in so-called conjugate gradients schemes,
which have much lower complexity of each iteration (see Section 1.3.2).
However, in view of an amazing growth of computer power in the last
decades, these objections are not so important anymore.
1.3.2 Conjugate gradients

The conjugate gradients methods were initially proposed for minimiz-
ing a quadratic function. Consider the problern
min f(x), (1.3.2)
xERn
with f(x) = o: + (a, x) + !(Ax, x) and A = AT >- 0. We have already

seen that the solution of this problern is x* = -A- 1a. Therefore, our
objective function can be written in the following form:
f(x) = o: + (a,x) + ~(Ax,x) = o:- (Ax*,x) + ~(Ax,x)
= o:- ~(Ax*,x*) + ~(A(x- x*),x- x*).
Thus, f* = o:- ~(Ax*, x*) and f'(x) = A(x- x*).
Suppose we are given by a starting point xo. Consider the linear
K rylov subspaces
Lk = Lin { A(xo - x*), ... , Ak(xo - x*)}, k;:::: 1,
where Ak is the kth power of matrix A. The sequence of points {xk}
generated by a conjugate gradients method is defined as follows:
I Xk =arg min{f(x) I x E xo + .Ck}, k ~ 1.1 (1.3.3)

This definition Iooks quite artificial. However, later we will see that
this method can be written in a pure "algorithmic" form. We need
representation (1.3.3) only for theoretical analysis.
LEMMA 1.3.1 For any k;:::: 1 we have Lk = Lin{f'(xo), ... , f'(xk_l)}.
Proof: For k = 1 the statement is true since f'(x 0 ) = A(xo - x*).
Suppose that it is true for some k ~ 1. Then
k
Xk = xo +L A,(i) Ai(xo- x*)
i=l
with some). E Rk. Therefore

k
f'(xk) = A(xo- x*) + L ).(i) Ai+l(xo- x*) = y + ).(k) Ak+ 1 (xo- x*),
i=l
for certain y from .Ck. Thus,
.Ck+l Lin {C.k, Ak+ 1 (xo- x*)} = Lin {L:k, J'(xk)}
Lin {f'(xo), ... , f'(xk)}.

0
The next result helps to understand the behavior ofthe sequence {xk}.
LEMMA 1.3.2 For any k, i ~ 0, k i= i we have (f'(xk), f'(xi)) = 0.
Proof: Let k > i. Consider the function
In view of Lemma 1.3.1, for some .X* we have Xk = xo + E_x~) J'(xj_t).

j=l
However, by definition, x k is the point of minimum of f (x) on .Ck. There-
fore 4>'(-X*) = 0. It remains to compute the components of the gradient:
COROLLARY 1.3.1 The sequence generated by the conjugate gradients

method for (1.3.2) is finite.
Proof: The number of orthogonal directions in Rn cannot exceed n. 0
COROLLARY 1.3.2 For any p E .Ck we have (J'(xk),p) = 0. 0
The last auxiliary result explains the name of the method. Denote
6i = Xi+l- Xi. It is clear that .Ck = Lin {6o, ... , 6k-l }.
LEMMA 1.3.3 For any k i= i we have (A6k, 6i) = 0.
(Such directions are called conjugate with respect to A.)
Proof: Without loss of generality we can assume that k > i. Then
Let us show how we can write down the conjugate gradients method in
a more algorithmic form. Since .Ck = Lin {60 , ... , 6k-l }, we can represent
Xk+I as follows:
k-l
Xk+l = Xk- hkf1(Xk) +L _x(j)dj.
j=O
In our notation that is
(1.3.4)
Let us compute the coeflicients ofthe representation. Multiplying (1.3.4)

by A and di, 0 :::; i :::; k- 1, and using Lemma 1.3.3 we obtain
Hence, in view of Lemma 1.3.2, Ai= 0, i < k -1. For i = k -1 we have
since dk-1 = -hk-1Pk-1 by the definition of {Pk}·

Note that we managed to write down a conjugate gradients scheme in
terms of the gradients of the objective function f(x). This provides us
with a possibility to apply formally this scheme for minimizing a general
nonlinear function. Of course, such extension destroys all properties of
the process, which are specific for the quadratic functions. However,
in the neighborhood of a strict local minimum the objective function is
close to quadratic. Therefore asymptotically this method can be fast.
Let us present a general scheme of the conjugate gradients method
for minimizing a nonlinear function.
Conjugate gradient method
0. Let xo ERn. Compute f(xo),J'(xo). Set Po= f'(xo).
a). Find Xktl = xk + hkPk (by "exact" line search).
b). Compute f(xk+I) and f'(xk+d·
c). Compute the coefficient ßk·
d). Set Pk+l = f'(xk+I)- ßkPk·
In that scheme we did not specify yet the coefficient ßk· In fact, there
are many different formulas for this coefficient. All of them give the
same result on quadratic functions, but in a general nonlinear case they
generate different sequences. Let us present three of the most popular
1.
2.
3 Polak-Ribbiere· ß - - (J'(xk+t),J'(xk±d-f'(xk))
· · k - II!' (xk)ll 2 •
Recall that in the quadratic case the conjugate gradients method

terminates in n iterations (or less). Algorithmically, this means that
Pn+l = 0. In a nonlinear case that is not true. However, after n iter-
ation this direction loses any interpretation. Therefore, in all practical
schemes there exists a restarting strategy, which at some moment sets
ßk = 0 ( usually after every n iterations). This ensures a global con-
vergence of the scheme (since we have a usual gradient step just after
the restart and all other iterations decrease the function value). In a
neighborhood of a strict minimum the conjugate gradients schemes have
a local n-step quadratic convergence:
II Xn+I - x* II::; const· II xo - x* 11 2 .
Note, that this local convergence is slower than that of the variable
metric methods. However, the conjugate gradients schemes have an
advantage of a very cheap iteration. As far as the global convergence is
concerned, the conjugate gradients, in general, are not better than the
gradient method.
1.3.3 Constrained minimization

Let us discuss briefly the main ideas underlying the methods of general
constrained minimization. The problern we deal with is as follows:
fo(x) -+ min,
(1.3.5)
fi(x) ~ 0, i = 1 ... m.
where fi(x) are smooth functions. For example, we can consider fi(x)
frorn Cl'1 (Rn).
Since the cornponents of the problern (1.3.5) are general nonlinear
functions, we cannot expect that this problern is easier than an uncon-
strained minimization problem. Indeed, even the standard difficulties
with stationary points, which we have in unconstrained minimization,
appear in (1.3.5) in a much stronger form. Note that a stationary point
of this problern (whatever it is) can be infeasible for the systern of func-
tional constraints. Hence, any minimization scheme attracted by such a
point should accept that it fails even to find a feasible solution to (1.3.5).
Therefore, the following reasoning Iooks quite convincing.
1. We have efficient methods for unconstrained minimization. (?) 2
2. Unconstrained minimization is simpler than the constrained one. (?) 3
3. Therefore, Iet us try to approxirnate a solution to the problern (1.3.5)
by a sequence of solutions to some auxiliary unconstrained minimiza-
tion problems.
This philosophy is implemented by the schemes of Sequential Uncon-
strained Minimization. There are two rnain groups of such rnethods: the
penalty function methods and the barrier methods. Let us describe the
basic ideas of these approaches.
2 In fact, that is not absolutely true. We will see, that in order to apply an unconstrained
minimization method for solving constrained problems, we need to be able to find at least
a strict local minimum. And we have already seen (Example 1.2.2}, that this could pose a
problem.
3 We are not going to discuss the correctness of this statement for general nonlinear problems.
We just prevent the reader from extending it onto another problern classes. In the next
chapters we will have a possibility to see that this statement is not always true.
We start from penalty function methods.
DEFINITION 1. 3.1 A continuous function ci> (x) is called a penalty func-
tion for a closed set Q if
• ci>(x) = 0 for any x E Q,
• <P(x) > 0 for any x ~ Q.
Sometimes a penalty function is called just penalty. The main property
of the penalty functions is as follows.
If ci> 1 (x) is a penalty for Q 1 and ci> 2(x) is a penalty for Q2,
then ci>l(x) + ci>2(x) is a penalty for intersection Ql nQ2.
Let us give several examples of such functions.
EXAMPLE 1.3.3 Denote (a)+ = max{ a, 0}. Let
Q = {x ERn lfi(x) ~ 0, i = 1. .. m}.
Then the following functions are penalties for Q:
m
1. Quadratic penalty: ci>(x) = L: (!i(x))~.
i=l
m
2. Nonsmooth penalty: ci>(x) = L.: (fi(x))+·
i=l
The reader can easily continue the Iist. 0
The general scheme of a penalty function method is as follows.
Penalty function method
0. Choose Xo E nn. Choose a sequence of penalty coeffi-

cients: 0 < tk < tk+I and tk -+ oo.
1. kth iteration (k;::: 0).
Find a point Xk+I =arg min{/o(x)
xERn
+ tkci>(x)}
using Xk as a starting point.
It is easy to prove the convergence of this scheme assuming that Xk+I

isaglobal minimum of the auxiliary function. 4 Denote
(lllk is the global optimal value of lllk(x)). Denote by x* the global

solution to (1.3.5).
THEOREM 1.3.1 Let there exist a value l > 0 such that the set
S = {x ERn I fo(x) + t<P(x) :::; fo(x*)}
is bounded. Then
lim f(xk) = fo(x*), lim <P(xk) = 0.

k---'too k---'too
Proof: Note that lllk :::; IJ!k(x*) = fo(x*). At the same time, for any
x ERnwehave 'l!k+l(x) ~ 'l!k(x). Therefore 'l!k+l ~ 'l!k. Thus, there
=
exists a Iimit lim 'l!k lll* :::; f*. If tk > l then
k---'too
Therefore, the sequence {xk} has Iimit points. Since lim tk = +oo, for
k---'too
any such point x. we have <P{x.) = 0 and fo(x.) :::; fo(x*). Thus x. E Q
and
w* = fo(x.) + <P(x.) = fo(x.) ~ fo(x*).
0
Note that this result is very general, but not too informative. There
are still many questions, which should be answered. For example, we
do not know what kind of penalty function we should use. What should
be the rules for choosing the penalty coefficients? What should be the
accuracy for solving the auxiliary problems? The main feature of these
questions is that they can be hardly addressed in the framework of gen-
eral nonlinear optimization theory. Traditionally, they are considered as
questions to be answered by computational practice.
Let us Iook at the barrier methods.
DEFINITION 1.3.2 Let Q be a closed set with nonempty interior. A
continuous function F(x) is called a barrier function for Q if F(x) ~ oo
when x approaches the boundary of Q.
4 If we assume that it is a strict local minimum, then the result is much weaker.
Sometimes a barrier function is called barrier for short. Similarly to the

penalty functions, the barriers possess the following property:
lf F 1 ( x) is a barrier for Q1 and F2 (x) is a barrier for Q2,

then Fl (X) + F2 (X) is a barrier for intersection Ql nQ2.
In order to apply the barrier approach, the problern (1.3.5) must sat-
isfy the Slater condition:
3x : fi(x) < o, i = 1. .. m.
Let us look at some examples of barrier functions.
EXAMPLE 1.3.4 Let Q = {x E Rn ifi(x) ~ 0, i = 1 ... m}. Then all
functions below are barriers for Q:
m
1. Power-function barrier: F(x) = i~l (-/;~x))P' P :::=: 1.
m
2. Logarithmic barrier: F(x) = - E In(- fi(x)).
i=l
3. Exponential barrier: F(x) = i~ exp (-i(x)).

The reader can easily extend this list. 0
The scheme of a barrier method is as follows.
Barrier function method
0. Choose xo E int Q. Choose a sequence of penalty

coefficients: 0 < tk < tk+l and tk
-t oo.
1. kth iteration (k :::=: 0).

Find a point Xk+l =arg min{fo(x)
xEQ
+ f-F(x)}
k
using Xk as a starting point.
Let us prove the convergence of this method assuming that Xk+I is a

global minimum of the auxiliary function. Denote
wk(x) = fo(x) t:
+ F(x),
('llk is the global optimal value of 'llk(x)). And let f* be the optimal
value of the problern (1.3.5).
THEOREM 1.3.2 Let barrier F(x) be bounded below on Q. Then
lim
k-too
w;. = f*.
Proof: Let F(x) ~ F* for all x E Q. For arbitrary x Eint Q we have
sup lim 'llj. ~ lim [Jo(x)

k-too k-too
+ /k F(x)] = fo(x).
Therefore sup lim 'll'k

k-too
~ J*. On the other hand,
'llj. = min
xeQ
{!o(x) + f-F(x)}
k
~ min
xEQ
{!o(x) + lk F*} = !* + lk F*.
Thus, lim 'llj. = J*. D

k-too
The same as with the penalty functions method, there are many ques-
tions to be answered. We do not know how to find the starting point xo
and how to choose the best barrier function. We do not know the rules
for updating the penalty coefficients and the acceptable accuracy of the
solutions to the auxiliary problems. Finally, we have no idea about the
efficiency estimates of this process. And the reason is not in the lack
of the theory. Our problern (1.3.5) is just too complicated. We will see
that all of the above questions get precise answers in the framework of
convex optimization.
We have finished our brief presentation of general nonlinear optimiza-
tion. It was really very short and there are many interesting theoreti-
cal topics that we did not mention. That is because the main goal of
this book is to describe the areas of optimization in which we can ob-
tain some clear and complete results on the performance of numerical
methods. Unfortunately, the general nonlinear optimization is just too
complicated to fit the goal. However, it is impossible to skip this field
since a lot of basic ideas, underlying the convex optimization methods,
have their origin in general nonlinear optimization theory. The gradient
method and the Newton method, sequential unconstrained minimization
and barrier functions were originally developed and used for general op-
timization problems. But only the framework of convex optimization
allows these ideas to get their real power. In the next chapters of this
book we will see many examples of the second birth of these old ideas.
Chapter 2
SMOOTH CONVEX OPTIMIZATION
2.1 Minimization of smooth functions

(Smooth convex functions; Lower complexity bounds for :F't'' 1 (Rn);
Strongly
convex functions. Lower complexity bounds s:l(Rn); Gradient method.)
2.1.1 Smooth convex functions

In this section we deal with unconstrained minimization problern
min f(x), {2.1.1)

xERn
where the function j(x) is smooth enough. Recall that in the previous
chapter we were trying to solve this problern under very weak assump-
tions on function f. And we have seen that in this general situation we
cannot do too much: It is impossible to guarantee convergence even to a
local minimum, impossible to get acceptable bounds on the global per-
formance of minimization schemes, etc. Let us try to introduce some rea-
sonable assumptions on function f to make our problern more tractable.
For that, Iet us try to determine the desired properties of a class of
differentiable functions F we want to work with.
From the results of the previous chapter we can get an impression
that the main reasons of our troubles is the weakness of the first-order
optimality condition (Theorem 1.2.1). Indeed, we have seen that, in
general, the gradient method converges only to a stationary point of
function f (see inequality (1.2.15) and Example 1.2.2). Therefore the
first additional property we definitely need is as follows.
AssuMPTION 2.1.1 For any f E F the first-order optimality condition

is sufficient for a point to be a global solution to (2.1.1}.
Further, the main feature of any tractable functional dass :F is the

possibility to verify indusion f E :F in a simple way. Usually that
is ensured by a set of basic elements of the dass and by the list of
possible operations with elements of :F, which keep the result in the
dass (such Operationsare called invariant). An excellent example is the
dass of differentiahte functions: In order to check either a function is
differentiahte or not, we need just to look at its analytical expression.
We do not want to restriet our dass too much. Therefore, let us
introduce only one invariant operation for the hypothetical class :F.
AssUMPTION 2.1.2 lf JI, h E :Fand a,ß ~ 0, then afi + ßh E :F.
The reason for the restriction on the sign of coefficients in this assump-
tion is evident: We would like to see x 2 in our dass, but function -x 2
is not suitable for our goals.
Finally, let us add in :F some basic elements.
ASSUMPTION 2.1.3 Any linear function f(x) = a+(a,x} belongs to :F. 1
Note that the linear function f(x) perfectly fits Assumption 2.1.1. In-
deed, f'(x) = 0 implies that this function is constant and any point in
Rn is its global minimum.
It turns out that we have assumed enough to specify our functional
class. Consider f E :F. Let us fix some x 0 E Rn and consider the
function
cf>(y) = f(y)- (f'(xo),y}.
Then cf> E :F in view of Assumptions 2.1.2 and 2.1.3. Note that
cf>'(y) ly=xo= f'(xo)- f'(xo) = 0.

Therefore, in view of Assumption 2.1.1, x 0 is the global minimum of
function cf> and for any y E Rn we have
cf>(y) ~ cf>(xo) = f(xo)- (J'(xo),xo}.

Hence, f(y) ~ f(xo) + (f'(xo), x- xo}.
This inequality is very well known in optimization. It defines the class
of differentiable convex functions.
DEFINITION 2.1.1 A continuously differentiable function f(x) is called
convex on Rn (notation f E :F1 (Rn)) if for any x,y ERnwehave
f(y) ~ f(x) + (f'(x), y- x}. (2.1.2}
1 This is not a description of the whole set of basic elements. We just say that we want to
have linear functions in our class.
Smooth convex optimization 53
lf- f(x) is convex, we call f(x) concave.

In what follows we consider also the classes of convex functions :F~· 1 ( Q)
with the same meaning of the indices as for the classes C~· 1 (Q).
Let us check our assumptions, which become now the properties of
the functional class.
THEOREM 2.1.1 lf f E :F1 (Rn) and f'(x*) = 0 then x* is the global

minimum of J(x) on Rn.
Proof: In view of inequality (2.1.2), for any x ERn we have
f(x) 2:: j(x*) + (J'(x*), x- x*) = j(x*).

0
Thus, we get what we want in Assumption 2.1.1. Let us check As-

sumption 2.1.2.
LEMMA 2.1.1 If fi and h belong to :F1 (Rn) and a,ß 2::0 thenfunction
f = afl + ßh also belongs to :F1 (Rn).
Proof: For any x, y E Rn we have
ft(y) > fr(x) + (J{(x),y- x},

h(y) > h(x) + (J~(x),y- x}.
lt remains to multiply the first equation by a, the second one by ß and
add the results. 0
Thus, for differentiable functions our hypothetical class coincides with

the class of convex functions. Let us present their main properties.
The next statement significantly increases our possibilities in con-
structing the convex functions.
Proof: lndeed, let x, y E Rn. Denote x = Ax + b, fj = Ay + b. Since

<P'(x) =AT f'(Ax + b), we have
</J(y) = f(y) > J(x) + (J'(x), y- x)
= <P(x) + (J'(x), A(y- x))
= ifJ(x) +(AT f'(x), y- x)
= <P(x) + (<P'(x), y- x).

0
In order to simplify the verification of inclusion f E :F1 (Rn), we pro-

vide this class with several equivalent definitions.
THEOREM 2.1.2 Continuously differentiable function f belongs to the
class :F1 (Rn) if and only if for any x, y E Rn and a E [0, 1] we have2
f(ax + (1- a)y) ~ af(x) + (1- a)f(y). (2.1.3)
Proof: Denote x 0 = ax + {1- a)y. Let f E :F 1 (Rn). Then

f(xa) :::; f(y) + (f'(xa), Y- Xa) = f(y) + a(f'(xa), Y- x),
f(xa) ~ f(x) + (f'(xa), X- X 0 ) = f(x)- (1- a)(f'(xa), Y- x).
Multiplying first inequality by (1- a}, the second one by a and adding
the results, we get {2.1.3).
Let (2.1.3} be true for all x, y ERn and a E [0, 1]. Let us choose some
a E [0, 1). Then
f(y) > 1 ~ 0 [/(xa)- af(x)] = f(x) + 1 ~ 0 [/(xa)- f(x)]
= f(x) + 1 ~ 0 [f(x + (1- a)(y- x))- f(x)].

Tending a to 1, we get (2.1.2}. 0
THEOREM 2.1.3 Continuously differentiable function f belongs to the

class :F1 (Rn) if and only if for any x, y E Rn we have
(f'(x)- J'(y),x- y) ~ 0. {2.1.4)
2 Note that inequality (2.1.3) without assumption on differentiability of /, serves as a defini-

tion of geneml convex functions. We will study these functions in detail in the next chapter.
Proof: Let f be a convex continuously differentiable function. Then

f(x) 2 f(y) + (f'(y), x- y), f(y) 2 f(x) + (f'(x), Y- x).
Adding these inequalities, we get (2.1.4).
Let (2.1.4) holdforallx,yERn. Denotex 7 =x+r(y-x). Then
1
f(y) = f(x) + f(f'(x + r(y- x)), y- x)dr
0
1
= f(x) + (f'(x),y- x) + f(f'(x 7 )- f'(x),y- x)dr
0
1
f(x) + (f'(x), y- x) + f ~(f'(xT)- f'(x), xT- x)dr
0
> f(x) + (f'(x), y- x).

0
Sometimes it is more convenient to work with functions from the dass

:F2(Rn) C :Fl(Rn).
THEOREM 2 .1. 4 Two tim es continuously differentiable function f be-
longs to :F2(Rn) if and only for any x ERn we have
f"(x} C::: 0. (2.1.5)
Proof: Let f E C 2 (Rn) be convex. Derrote x 7 = x + rs, r > 0. Then,

in view of (2.1.4}, we have
0 < ~(f'(xT)- f'(x),xT- x) = ~(f'(xT)- f'(x),s)
~ f(f"(x + >.s)s, s)d>.,

0
and we get (2.1.5) by tending r to zero.
Let (2.1.5) hold for all x ERn. Then
f(y) = f(x) + (f'(x), y- x)
1 T
+ J f(J"(x + >.(y- x))(y- x},y- x)d>.dr
0 0
> f(x) + (J'(x), y- x).

0
Let us look at some examples of differentiable convex functions.

EXAMPLE 2.1.1 1. Linear function f(x) = a + (a,x) is convex.
2. Let a matrix A be symmetric and positive semidefinite. Then the
quadratic function
f(x) = a + (a, x) + ~(Ax, x)
is convex (since f"(x) =At 0}.
3. The following functions of one variable belang to F 1 (R):
f(x} = ex,
f(x) = I X IP, p > 1,

x2
f(x) = 1-lxl'
f(x) = I x I -ln(1+ I x 1).
We can checkthat using Theorem 2.1.4.
Therefore, the function arising in geometric optimization,
L ea:;+(a;,x)'
m
f(x) =
i=l
is convex (see Lemma 2.1.2). Similarly, the function arising in lp-norm

approximation problem,
m
f(x) =LI (ai,x)- bi IP,
i=l
is convex too. D
As with general nonlinear functions, the differentiability itself cannot

ensure any special topological properties of convex functions. There-
fore we need to consider the problern classes with Lipschitz continuous
derivatives of a certain order. The most important dassofthat type is
.r'l•
1 (Rn), the dass of convex functions with Lipschitz continuous gradi-
ent. Let us provide it with several necessary and sufficient conditions.

THEOREM 2.1.5 All conditions below, holding for all x, y E Rn and a
from [0, 1], are equivalent to inclusion f E :Fz'
1 (Rn):
0 ~ f(y)- f(x)- (J'(x), y- x) ~ t II x- y 11 2, {2.1.6}

f(x) + (f'(x), Y- x) + A II f'(x)- f'(y) 11 2 ~ f(y), {2.1. 7}
t II f'(x) - f'(y) 11 2 ~ (f'(x) - f'(y), x- y), {2.1.8}
(J'(x)- f'(y), x- y) ~ L II x- y 11 2 , {2.1.9}
af(x) + (1 - a)f(y) 2:: f(ax + (1 - a)y}

(2.1.10}
+ o(~Lo) II f'(x)- f'(y) 11 2,
af(x) + (1 - a)f(y) < f(ax + (1- a)y)

{2.1.11}
+a(1-a)~ II x-y 11 2 •
Proof: Indeed, (2.1.6) follows from the definition of convex functions

and Lemma 1.2.3. Further, let us fix x 0 ERn. Consider the function
ifl(y) = f(y}- (J'(xo},y}.
Note that ifJ E .1'1' 1 (Rn) and its optimal point is y* = x 0 • Therefore, in
view of (2.1.6), we have
ifl(y*) ~ ifl(y- tifl'(y)) ~ ifl(y) - A II ifl'(y) 11 2 •
And we get {2.1.7} since ifl'(y) = f'(y)- f'(xo}.

We obtain {2.1.8) from inequality (2.1.7) by adding two copies of it
with x and y interchanged. Applying now to (2.1.8) Cauchy-Schwartz
inequality we get II f'(x) - f'(y) II~ L II x- Y II·
In the same way we can obtain {2.1.9) from (2.1.6). In order to get
(2.1.6) from (2.1.9) we apply integration:
1
f(y)- f(x)- (J'(x), y- x) = f(J'(x + r(y- x))- f'(x), y- x)dr
0
~ !Lily- xll 2 •
Let us prove the last two inequalities. Derrote x 0 = ax + (1 - a)y.
Then, using (2.1.7) we get
f(x) ~ f(xo:) + (J'(xo:), (1- a)(x- y)) + A II J'(x)- f'(xo:} 11 2 ,
f(y) > f(xo:) + (J'(xo:), a(y- x)) + A II f'(y)- f'(xo:) 11 2 •

Adding these inequalities multiplied by a and (1- a) respectively, and

using inequality
a II 91 - u 11 2 +(1 - a) II 92 - u 11 2 ~ a(1 - a) II 91 - 92 11 2 ,
we get (2.1.10). It is easy to checkthat we get (2.1.7) from (2.1.10) by

tending a ---+ 1.
Similarly, from (2.1.6) we get
f(x) < f(xa) + (f'(xa), (1- a)(x- y)) + ~ II (1- a)(x- y) 11 2 ,
f(y) < f(xa) + (f'(xa), a(y- x)) + ~ II a(y- x) 11 2 .
Adding these inequalities multiplied by a and (1 - a) respectively, we

obtain (2.1.11). And we get back to (2.1.6) as a---+ 1. D
Finally, let us give a characterization of the dass :Fi• 1 (Rn).

THEOREM 2.1. 6 Two tim es continuously differentiable function f be-
lon9s to :Fi' 1 (Rn) if and only for any x ERn we have
0 j J"(x) j Lln. (2.1.12}
Proof: The statement follows from Theorem 2.1.4 and (2.1.9). D
2.1.2 Lower complexity bounds for :F~· 1 (Rn)

Before we go forward with optimization methods, let us check our
possibilities in minimizing smooth convex functions. In this section we
obtain the lower complexitr bounds for optimization problems with ob-
jective functions from :F'f' (Rn) (and, consequently, :Fi• 1 (Rn)).
Recall that our problern class is as follows.
Model: min /(x),

xERn
f E :Fi'l(Rn).
Oracle: First-order local black box.
Approximate solution: X E Rn, / (X) - /* ~ t.

In order to make our considerations simpler, let us introduce the follow-

ing assumption on iterative processes.
AssUMPTION 2.1.4 An iterative method M generates a sequence of test
points {Xk} such that
Xk E xo + Lin {f'(xo), ... , J'(xk_I)}, k ~ 1.
This assumption is not absolutely necessary and it can be avoided by
a more sophisticated reasoning. However, it holds for the majority of
practical methods.
We can prove the lower complexity bounds for our problern dass with-
out developing a resisting oracle. Instead, we just point out the "worst
function in the world" (that means, in Ff' 1 (Rn)). This function appears
to be difficult for alliterative schemes satisfying Assumption 2.1.4.
Let us fix some constant L > 0. Consider the following family of
quadratic functions
fk(x) = ~ { ~((x(ll) 2 + ~~>x(i) - x(i+Il) 2 + (x(k)) 2 ] - x(l)}

for k = 1 ... n. Note that for all s E Rn we have
ur(x)s, s) ~ ~ [(s(ll)' + :~>(i) - s(i+ll) 2 + (s('i)'] ~ 0,
and
n
< LL:(s(i)) 2 .
i=l
Thus, 0 ~ ff:(x) ~ Lln. Therefore fk(x) E :Ff' 1(Rn), 1::; k::; n.

Let us compute the minimum of function fk· It is easy to see that
f/:(x) = fAk with
where Ok,p is a (k x p) zero matrix. Therefore the equation
fk(x) = Akx- e1 = 0
has the following unique solution:
1--i i = 1. .. k,
-(i) - k+l'
xk - {
0, k+1~ i ~ n.
Hence, the optimal value of function fk is
fi = ~ [!(Akxbxk)- (el,xk)] = -t(el!xk)

(2.1.13)
= t(-1+k~l).
Note also that
~ ·2 - k(k+1){2k+l}
L..~- 6
<
-
{k+1)3
3 • (2.1.14)
i=l
Therefore
(2.1.15)
< k- _2_. k{k+l}

k+l 2
+ {k+1)2
1 . {k+l)l - l(k
3 - 3
+ 1) .
Denote Rk,n = {x ERn I x(i) = 0, k+ 1 ~ i ~ n}; that is a subspace

of Rn, in which only the first k components of the point can differ from
zero. From the analytical form of the functions {!k} it is easy to see
that for all x E Rk,n we have
/p(x) = fk(x), p = k ... n.
Let us fix some p, 1 ~ p ~ n.

LEMMA 2.1.3 Let xo = 0. Then for any sequence {xk}t=o satisfying
the condition
Xk E Lk = Lin {f;(xo), ... , J;(xk_I) },
we have Ck ~ Rk,n.
Proof: Since xo = 0, we have J;(xo) =-tel E R 1•n. Thus .C1 R 1•n.=

Let .Ck ~ Rk,n for some k < p. Since Ap is three-diagonal, for any
x E Rk,n we have J;(x) E Rk+l,n. Therefore .Ck+l ~ Rk+l,n, and we can
complete the proof by induction. 0
COROLLARY 2.1.1 For any sequence {xk}1=o suchthat xo = 0 and Xk E

.Ck we have
Now we areready to prove the main result of this section.
THEOREM 2.1.7 For any k, 1 ~ k ~ ~(n- 1), and any xo ERn there
exists a junction f E :FJ:• 1 (Rn) suchthat for any first-order method M
satisjying Assumption 2.1.4 we have
where x* is the minimum of j(x) and J* = j(x*).

Proof: It is clear that the methods of this type are invariant with
respect to a simultaneaus shift of all objects in the space of variables.
Thus, the sequence of iterates, which is generated by such a method for
function f(x) starting from x 0 , is just a shift of the sequence generated
for j (x) = f (x + xo) starting from the origin. Therefore, we can assume
that xo = 0.
Let us prove the first inequality. For that, let us fix k and apply M to
minimizing f(x) = hk+I(x). Then x* = X2k+I and f* = f2k+I· Using
Corollary 2.1.1, we conclude that
Hence, since xo = 0, in view of (2.1.13) and (2.1.15) we get the following

estimate:
Let us prove the second inequality. Since Xk E Rk,n and xo = 0, we

have
2k+l (-(i) ) 2 - 2k+l ( i )2
II Xk - x* 11 2 > L:: x2k+1 - L::
1- 2k+2
i==k+ 1 i==k+ 1
1 2k+1 . 1 2k+1 ·2
= k +1- k+l L:: 't + 4(k+I)2 L:: Z
i=k+l i==k+l
In view of (2.1.14), we have

2k+1
L:: i2 ! [(2k + 1)(2k + 2)(4k + 3)- k(k + 1)(2k + 1)]
i==k+l
= t(k + 1)(2k + 1)(7k + 6).

Therefore, using (2.1.15) we finally obtain
II Xk _ x* ll2 > k + 1 __1_ . (3k+2)(k+l) + (2k+1)(7k+6)

k+l 2 24(k+l)
k - 2k 2+7k+6
= - 2- 24(k+l)
> 2k2+7k+6
lß(k+l)2 II -
Xo- X2k+1 112>
_ S1 II XQ- X * 112 .
0
The above theorem is valid only under assumption that the number
of steps of the iterative scheme is not too large as compared with the
dimension of the space (k :::; ~(n- 1)). The complexity bounds ofthat
type are called uniform in the dimension of variables. Clearly, they
are valid for very large problems, in which we cannot wait even for n
iterates of the method. However, even for problems with a moderate
dimension, these bounds also provide us with some information. Firstly,
they describe the potential performance of numerical methods on the
initial stage of the minimization process. And secondly, they warn us
that without a direct use of finite-dimensional arguments we cannot get
better complexity for any numerical scheme.
To conclude this section, Iet us note that the obtained lower bound for
the value of the objective function is rather optimistic. Indeed, after one
hundred iterations we could decrease the initial residual in 104 times.
However, the result on the behavior of the minimizing sequence is quite
disappointing: The convergence to the optimal point can be arbitrarily
slow. Since that is a lower bound, this conclusion is inevitable for our
problern dass. The only thing we can do is to try to find problern
classes in which the situation could be better. That is the goal of the
next section.
2.1.3 Strongly convex functions

Thus, we are looking for a restriction of the functional class 1 (Rn), .rl•
for which we can guarantee a reasonable rate of convergence to a unique
solution of the minimization problern
Recall, that in Section 1.2.3 we have proved that in a small neighbor-

hood of a nondegenerate local minimum the gradient method converges
linearly. Let us try to make this non-degeneracy assumption global.
Namely, Iet us assume that there exists some constant 1-t > 0 such that
for any x with f'(x) = 0 and any x ERnwehave
f(x) ;:::: f(x) + ~1-t II x- x 11 2 •
Using the same reasoning as in Section 2.1.1, we obtain the dass of

strongly convex functions.
DEFINITION 2.1.2 A continuously differentiable function f(x) is called
strongly convex on Rn (notation f E S~(Rn)) if there exists a constant
1J > 0 such that for any x, y E Rn we have
f(y);:::: f(x) + (f'(x),y- x} + ~1-t II y- x 11 2 . {2.1.16}

Constant IJ is called the convexity parameter of function f.
We will also consider the classes s~t(Q) with the same meaning of
the indices k, l and L as for the class C~· 1 (Q).
Let us fix some properties of strongly convex functions.
THEOREM 2.1.8 lf f E Sb(Rn) and f'(x*) = 0, then
f(x) ;:::: f(x*) + !~-t II x- x* 11 2
for all x ERn.

Proof: Since f'(x*) = 0, in view of inequality (2.1.16), for any x E Rn
we have
f(x) > f(x*) + (f'(x*), x- x*) + !~-t II x- x* 11 2
- f(x*) + !~-t II x- x* 11 2 •
0
The following result justifies the addition of strongly convex functions.

LEMMA 2.1.4 lf fi E Sh 1 (Rn), f2 E Sh 2 (Rn) and a, ß?: 0, then
f = afi + ßh E S~ll- 1 +ß11- 2 (Rn).

Proof: For any x, y E Rn we have
JI(y) > !I(x) + (fi(x),y- x) + !J.Ll II y- x 11 2,
h(y) > h(x) + UHx), Y- x) + !J.L2 II Y- x 11 2 .
It remains to add these equations multiplied respectively by a and ß. D
Note that the class SJ(Rn) coincides with F 1 (Rn). Therefore addition
of a convex function to a strongly convex function gives a strongly convex
function with the same convexity parameter.
Let us give several equivalent definitions of strongly convex functions.
THEOREM 2.1.9 Let f be continuously differentiable. Both conditions
below, holding for all x, y ERn and a E (0, 1], are equivalent to inclusion
f E S~(Rn):
(f'(x) - f'(y), x- y) ?: J.L II x- y 11 2 , (2.1.17)
af(x) + (1- a)f(y) ?: f(ax + (1- a)y)

(2.1.18}
+a(1- a)~ II x- y 11 2 .
The proof of this theorem is very similar to the proof of Theorem 2.1.5
and we leave it as an exercise for the reader.
The next statement sometimes is useful.
THEOREM 2 .1.1 0 If f E Sh (Rn), then for any x and y from Rn we have
f(y) S f(x) + (f'(x), Y- x) + 2~ II f'(x) - f'(y) 11 2 , (2.1.19)
(f'(x) - f'(y), x- y) S ~ II f'(x) - f'(y) 11 2 • (2.1.20)
Proof: Let us fix some x E Rn. Consider the function

cp(y) = f(y)- (f'(x), y) E S~(Rn).
Since rp'(x) = 0, in view of (2.1.16) for any y ERn we have that
rp(x) = minrp(v)
V
;::: min[<P(y)
V
+ (<P'(y), v- y) + !ltllv- Yll 2]
= <P(y)- 2~11<P'(y)li 2 ,
and that is exactly (2.1.19). Adding two copies of (2.1.19) with x and y
interchanged we get (2.1.20). D
Finally, the second-order characterization of the class Sh(Rn) is as

follows.
THEOREM 2 .1.11 Two times continuously differentiable function f be-
longs to the class s~ (Rn) if and only if X E Rn
f"(x) !: flln. {2.1.21)
Proof: Apply (2.1.17). D
Note we can Iook at examples of strongly convex functions.

EXAMPLE 2.1.2 1. f(x) = !II x 11 2 belongs to Sf(Rn) since f"(x) =In.
2. Let symmetric matrix A satisfy the condition: ttln :S A :S Lln. Then
j(x) = a+ (a,x) + !(Ax,x) E s::l(Rn) C S!;l(Rn)
since f"(x) = A.
Other examples can be obtained as a sum of convex and strongly
convex functions. D
For us the most interesting functional class is s!:l(Rn). This class is

described by the following inequalities:
(f'(x)- J'(y),x- y};::: flll x- Y 11 2 , (2.1.22)
II !'(x)- !'(y) II:S L IIx- Y II · (2.1.23)

The value Qf = L / fl ;::: 1 is called the condition number of function f.
It is important that the inequality (2.1.22) can be strengthened using
the additional information (2.1.23).
THEOREM 2.1.12 lf f E S~:l(Rn), then for any x, y ERn we have
(f'(x)- f'(y), x- Y) 2: Jl-~L II x- Y 11 2 + Jl-~L II f'(x)- f'(y) 11 2 ·

{2.1.24}
Proof: Derrote cp(x) = f(x)- ~1111xll 2 . Then cp'(x) = f'(x)- 11x; hence,
by (2.1.22) and (2.1.9) cjJ E :Fl~M(Rn). If 11 = L, then (2.1.24) is proved.
If 11 < L, then by (2.1.8) we have
(cp'(x)- <P'(y),y- x) ~ L~J.LII<P'(x)- </>'(y)ll 2 ,
and that is exactly (2.1.24). D
2.1.4 Lower complexity bounds for sc;,l(Rn)

Let us get the lower complexity bounds for unconstrained minimiza-
tion of functions from the dass sr;:J}(Rn) c s~:l(Rn). Consider the
following problern dass.
Model: min J(x),

xERn
f E s;:l(Rn), 1-l > 0.
Oracle: First-order local black box.
Approximate solution: x: f(x) - !* ~ E, II x- x* 11 2 ~ E.
As in the previous section, we consider the methods satisfying Assump-

tion 2.1.4. We are going to find the lower complexity bounds for our
problern in terms of condition number Qf = ~·
Note that in the description of our problern dass we do not say any-
thing about the dimension of the space of variables. Therefore formally,
this dass includes also the infinite-dimensional problems.
We are going to give an example of some bad function defined in the
infinite-dimensional space. We could do that also in a finite dimension,
=
but the corresponding reasoning is more complicated.
Consider gX) l2 , the space of all sequences x = {x(i)}~ 1 with finite
norm
Let us choose some parameters f-l > 0 and Qf > 1, which define the
following function
Denote
A= ( -i -~ -~ ~ )
0 -1 2 .
0 0
Then f"(x) = tt(Qf 1) A + f-ll, where I is the unit operator in R 00 • In
the previous section we have already seen that 0 :::5 A :::5 4/. Therefore
This means that J11 ,Q 1 E s::,~~ 1 (R 00 ). Note that the condition number
of function f 11 ,Q1 is
Q!Jl.,Qf -_ ttQr _ Q
11 - !·
Let us find the minimum of function f 11 ,11 Q1 • The first-order optimality
condition
, (x)
! J-L,IlQf = (tt(Qr 1) A
- 4
+ u1) X-
f""'
J-L(Qr 1) e = 0
4 1
can be written as
(A + Q/- 1 ) x = e1.
The coordinate form of this equation is as follows:
2 Q r+l x(l) - x( 2) = 1,
Qr1
(2.1.25)
x(k+l)- 2Qr+ 1 x(k)
Qrl
+ x(k-l) = 0, k = 2, ....
Let q be the smallest root of the equation
q2 - 2 Q I+ 1 q + 1 = 0
Qrl '
that is q = 1-1 •
~QQt+1 Then the sequence (x*)(k) = qk, k = 1, 2, ... , satisfies
the system (2.1.25). Thus, we come to the following result.
THEOREM 2.1.13 For any xo E R 00 and any constants 1-" > 0, Qf > 1
there exists a function f E s:,~~l (R00 ) such that for any first-order
method M satisfying Assumption 2.1.4, we have
( ~~~)
2
II Xk- x* 11 2 2 k II xo- x* 11 2 ,
f(xk)- !* ~ ~ ( ~~~) 2k II xo- x* 11 2 ,

where x* is the minimum of function f and f* = f(x*).
Proof: Indeed, we can assume that xo = 0. Let us choose f(x) =
JIJ.,!J.Qf(x). Then
II xo- x* 11 2 =f ((x*)(i)]2 = fq 2i = ,S.

1
i=l i=l q
Since J:.IJ.QJ(x) is a three-diagonal operator and f~,IJ.QJ(O) = e 1 , we

conclude that Xk E Rk,oo. Therefore
= T-:r = q2k II xo- x*

00 ( ") 00 . 2(k+l)
II Xk- x* 11 2 ~ L: [(x*) ~ J2 = L: q2~ 11 2 •
i=k+ 1 i=k+ 1 q
The second bound of the theorem follows from the first one and The-
orem 2.1.8. 0
2.1.5 Gradient method

Let us check how the gradient method works on the problern
min f(x)
xERn
with f E :t'l' (Rn).

1 Recall that the scheme of the gradient method is as
follows.
Gradient method
0. Choose x 0 ERn.

a). Compute f(xk) and f'(xk).
b). Find Xk+l = Xk- hkf'(xk) (see Section 2 for
step-size rules).
In this section we analyze the simplest variant of the gradient scheme

with hk = h > 0. It is possible to show that for all other reasonable
step-size rules the rate of convergence of this method is similar. Denote
by x* the optimal point of our problern and f* = f(x*).
THEOREM 2.1.14 Let jE Fi,• 1 (Rn) and 0 < h < Then the gradient f·
method generates a sequence {xk}, which converges as follows:
f(xk)- f* ::; 2llxo-!~11~x~t~;~r~~u·c~:)-/*) ·

Proof: Denote rk =II Xk- x* II· Then
r~+l = II Xk - x* - hf'(xk) 11 2
= r~- 2h(f'(xk), Xk - x*) + h2 II f'(xk) 11 2
< r~- h(f- h) II f'(xk) 11 2
(we use (2.1.8) and f'(x*) = 0). Therefore rk ::; r 0 . In view of (2.1.6)
we have
where w = h(l- th). Denote Ö.k = f(xk)- f*. Then
ö.k::; (f'(xk), Xk- x*) ::; ro II f'(xk) II .

Therefore ö.k+l ::; Ö.k - ;%-ll.~. Thus,
0
Summing up these inequalities, we get
~
'-"k+l
>
-
} + r~(k
'-"O 0
+ 1).
0
In order to choose the optimal step size, we need to maximize the

function 4J(h} = h(2- Lh) with respect to h. The first-order optimality
condition 4J'(h} = 2 - 2Lh = 0 provides us with the value h* = In t·
this case we get the following efficiency estimate of the gradient method:
f( ) f* < 2L(f(xo)-f*)llxo-x*l! 2 (2.1.26)

Xk - - 2LJixo-x•IJ2+k·(/(xo)- /*) ·
Further, in view of (2.1.6) we have
f(xo) < f* + (f'(x*), xo- x*} + t II xo- x* 11 2
= f* + t II xo - x* 11 2 .
Since the right-hand side of inequality (2.1.26) is increasing in f(xo)- f*,

we obtain the following result.
COROLLARY 2.1.2 lf h = t and jE .rt' (Rn), then
1
f( Xk ) _ f* <
-
2LIIxo-x•ll 2
k+4 . {2.1.27}
Let us estimate the performance of the gradient method on the dass

of strongly convex functions.
THEOREM 2.1.15 lf jE S~:l(Rn) and 0 < h ~ 1-L!L' then the gradient
method generates a sequence {xk} such that
II Xk - x* 11 2 ~ ( 1 - ~_fJ;) k II xo- x* 11 2 .
lf h = 1-L!L' then
llxk-x*ll < (~;~i)kllxo-x*ll,
f(xk)- f* < ~ (~;~i) 2 k II xo- x* 11 2 ,
where Qf = LfJ-L.
Proof: Denote rk =II Xk- x* 11. Then
rf+l = II Xk- x* - hf'(xk) 11 2
= r~ - 2h(f'(xk), Xk - x*} + h2 II f'(xk) 11 2
< (1- ~-ff) r~ + h(h- 1-L!L) II f'(xk) 11 2
(we use (2.1.24) and f'(x*) = 0). The last inequality in the theorem
follows from the previous one and (2.1.6). 0
Recall that we have seen already the step-size rule h = 1-L!L and
the linear rate of convergence of the gradient method in Section 1.2.3,
Theorem 1.2.4. But that was only a local result.
Comparing the rate of convergence of the gradient method with the

lower complexity bounds (Theorems 2.1.7 and 2.1.13), we can see that
the gradient method is far from being optimal for classes Fz• 1 (Rn) and
s~:i(Rn). We should also note that on theseproblern classes the Stan-
dard unconstrained minimization methods (conjugate gradients, variable
metric) have a similar global efficiency bound. The optimal methods for
minimizing smooth convex and strongly convex functions will be consid-
ered in the next section.
2.2 Optimal Methods

(Optimal methods; Convex sets; Constrained minimization problem; Gradient
mapping; Minimization methods over a simple set.)
2.2.1 Optimal methods

In this section we consider an unconstrained minimization problern
min f(x),
xeRn
with f being strongly convex: f E s~;i_(Rn), t.t 2: 0. Formally, this

family of classes contains also the dass of convex functions with Lipschitz
gradient (S~·z(Rn) = Fz• 1 (Rn)).
In the pr~vious section we proved the following efficiency estimates
for the gradient method:
!( Xk ) _ J* <
-
2LIIxo-x*ll 2
k+4 '
These estimates differ from our lower complexity bounds (Theorem 2.1.7
and Theorem 2.1.13) by an order of magnitude. Of course, in general
this does not mean that the gradient method is not optimal since the
lower bounds might be too optimistic. However, we will see that in our
case the lower bounds are exact up to a constant factor. We prove that
by constructing a method that has corresponding efficiency bounds.
Recall that the gradient method forms a relaxation sequence:
This fact is crucial for justification of its convergence rate (Theorem

2.1.14). However, in convex optimization the optimal methods never
rely on relaxation. Firstly, for some problern classes this property is too
expensive. Secondly, the schemes and efficiency estimates of optimal
methods are derived from some global topological properties of convex

functions. From this point of view, relaxation is a too "microscopic"
property to be useful.
The schemes and efficiency bounds of optimal methods are based on
the notion of estimate sequence.
DEFINITION 2.2.1 A pair of sequences {1>k(x)}~ 0 and {Ak}~ 0 , >.k 2: 0

is called an estimate sequence of function f(x) if
and for any x E Rn and all k 2: 0 we have
{2.2.1)
The next statement explains why these objects could be useful.
LEMMA 2.2.1 lf for some sequence {xk} we have
{2.2.2}
then f(xk)- f* ~ >.k[<!>o(x*)- !*] --+ 0.
Proof: Indeed,
Thus, for any sequence {xk}, satisfying (2.2.2) we can derive its rate
of convergence directly from the rate of convergence of sequence { Ak}.
However, at this moment we have two serious questions. Firstly, we do
not know how to form an estimate sequence. And secondly, we do not
know how we can ensure (2.2.2). The first question is simpler, so let us
answer it.
LEMMA 2.2.2 Assurne that:

1 f E S~:l(Rn),
2 <Po (x) is an arbitmry function an Rn,
3 {yk}~ 0 is an arbitrary sequence in Rn,
4 {ak}r=o: ak E (O, 1),

5 Ao = 1.
Then the pair of sequences {4>k(x)}r=o• {Ak}k:o recursively defined by t:
Ak+l = (1 - ak)Ak,
4>k+I(x) = (1- ak)4>k(x) {2.2.3}
+ak[f(Yk) + (J'(yk), x- Yk) + ~ II x- Yk 11 2],
is an estimate sequence.
Proof: Indeed, 4>o(x) ~ (1 - Ao)f(x) + Ao4>o(x) = 4>o(x). Further, let
(2.2.1) hold for some k ~ 0. Then
4>k+I(x) ~ (1- ak)4>k(x) + akf(x)
= (1- (1- ak)Ak)f(x) + (1- ak)(cPk(x)- (1- Ak)f(x))
< (1- {1- ak)Ak)f(x) + (1- ak)AkcPo(x)
= (1- Ak+l)f(x) + Ak+I4>o(x).
It remains to note that condition 4) ensures Ak -+ 0. D
Thus, the above statement provides us with some rules for updating
the estimate sequence. Now we have two control sequences, which can
help to ensure inequality (2.2.2). Note that we arealso free in the choice
of initial function c/Jo(x). Let us choose it as a simple quadratic function.
Then we can obtain the exact description of the way c/J'k varies.
LEMMA 2.2.3 Let 4>0 (x) = 4>0+ ~ II x- vo 11 2 . Then the process {2.2.3}
preserves the canonical form of functions {4>k(x)}:
cPk(x) = c/J'k + 1f II x- Vk 11 2 , {2.2.4}
where the sequences {'yk}, {Vk} and {c/Jk} are defined as follows:
'Yk+l = (1 - ak)'Yk + akp.,
cPk+I = (1- ak)4>k + akf(Yk)- 21:~ 1 II f'(yk) 11 2
+ak(~~:;bk (~ II Yk- Vk 11 2 +(f'(yk),vk- Yk)) ·

Proof: Note that cp~(x) = 'Yoln. Let us prove that cf;%(x) = 'Ykln for all
k ~ 0. Indeed, if that is true for some k, then
cf;%+1 (x) = (1- ak)cf;%(x) + akJ.Lln = ((1- akhk + akJ.L)ln ='Yk+1In.

This justifies the canonical form (2.2.4) of functions cpk(x).
Further,
if;k+l(x) = (1- ak) (if;ic + 1t II x- Vk 11 2 )
+ ak[f(yk) + (f'(yk), x- Yk) + ~ II x- Yk 11 2].
Therefore the equation cpk+l (x) = 0, which is the fi.rst-order optimality

condition for function cpk+l(x), Iooks as follows:
From that we get the equation for the point vk+ 1 , which is the minimum
of the function cpk+l(x).
Finally, let us compute if;'ic+l· In view of the recursion rule for the
sequence {cpk(x)}, we have
cpk+1 + 'Ykt II Yk - Vk+1 11 2 = cpk+l (Yk)

(2.2.5)
Note that in view of the relation for vk+ 1,
Therefore
'Ykr II Vk+l - Yk 11 2 = 2"f!+l [(1 - ak) 2 'Y~ II Vk - Yk 11 2
It remains to substitute this relation into (2.2.5) noting that the factor
for the term II Yk- Vk 11 2 in this expression is as follows:
(1- ak)'lk.-
2
_1_{1- ak)2'Y2
2'Yk+l k
= {1- ak)'lk. (1- (1-akhk)
2 'Yk+l
0
Now the situation is more clear and we are close to getting an algo-
rithmic scheme. Indeed, assume that we already have Xk:
Then, in view of the previous Iemma,
Since f(xk) 2: f(Yk) + (f'(yk), Xk - Yk), we get the following estimate:

2
cf>*k+l 2: f(Yk)- 21~~1 II f'(Yk) 11 2
+(1- ak)(f'(yk ), ~(vk- Yk) + Xk- Yk)·

lk+l
Let us Iook at this inequality. We want to have c/>k+l 2: f(xk+l)· Recall,

that we can ensure the inequality
in many different ways. The simplest one is just to take the gradient
step
Xk+l = Yk- hkf'(xk)
with hk = ±(see (2.1.6)). Let us define ak as follows:
2
Then 2""lk+l
0
k 2~ and we can replace the previous inequality by the
following:
Now we can use our freedom in the choice of Yk· Let us find it from the
equation:
~krk ( Vk - Yk)
lk+l
+ Xk - Yk = 0.
That is
Thus, we come to the following method.
General scheme of optimal method
0. Choose xo ERn and 10 > 0. Set vo = xo.

1. kth iteration (k 2': 0).
a). Compute ak E {0, 1) from equation
La~ = {1 - akhk + akfJ·

Set lk+l = {1- akhk + akfJ· (2.2.6)
b). Choose
1 _ Qk/'kVk+l'kt!Xk
Yk- l'k+akJL
and compute f(Yk) and f'(Yk)·

c). Find Xk+l suchthat
(see Section 1.2.3 for the step-size rules}.

d). Set Vk = (l-akhkvk+akllYk-akf'(Yk).
+l 'Yktl
Note that in Step 1c} ofthis scheme we can choose any Xk+l satisfying
the inequality
with some w > 0. Then the constant b replaces L in the equation of

Step 1a).
THEOREM 2.2.1 The scheme (2.2.6} generates a sequence {xk}~ 0 such

that
f(xk) - f* :S Ak [f(xo) - f* +f II xo - x* 11 2 ] ,
where Ao =1 and )..k = ll~01 {1- ai).
Proof: Indeed, let us choose </Jo(x) = f(xo) + ~ II x- vo 11 2 . Then

f(xo) = <Po and we get f(xk) ::; <P'k by construction of the scheme. It
remains to use Lemma 2.2.1. D
Thus, in order to estimate the rate of convergence of (2.2.6), we need

to understand how fast >.k goes to zero.
LEMMA 2.2.4 If in the scheme (2.2.6} /O;:::: J-t, then
{2.2. 7)
Proof: Indeed, if /k ;:::: J-t, then /k+l = La~ = (1 - akhk + akJ-t ;:::: 1-t·
Since /O ;:::: J-t, we conclude that this inequality is valid for all /k· Hence,
jii
ak ;:::: and we have proved the first inequality in (2.2.7).
Further, let us prove that /k ;:::: 1o>.k. Indeed, since /o = /o>.o, we can
use induction:
/k+l ;:::: (1- akhk? (1- ak)/o>.k = /O)..k+l·

Therefore La~ = /k+l ;:::: /o>.k+l·
Denote ak = k.Since {>.k} is a decreasing sequence, we have
,;>:;- .;>:;:±; Ak- Akt 1

ak+l- ak = vf..\k..\k+l = vf..\k..\ktt(V'Xk+vf.Xk±d
> ..\k-..\ktt = Ak-(1-ak)..\k = ct& >1 /JQ,
-2..\k~ 2.xk~ 2~-2VT
Thus, ak;:::: 1 + ~fl and the Iemma is proved. D
Let us present an exact statement on optimality of (2.2.6).

THEOREM 2.2.2 Let us take in (2.2.6} /o = L. Then this scheme gen-
erates a sequence {xk}r:o such that
f(xk)- !* ~ Lmin { (1-/f/, (k;2)2} II xo- x* 11 2 .
This means that {2.2. 6) is optimal for unconstrained minimization of

the functions from s~;i(Rn), 1-t ;:::: 0.
Proof: We get the above inequality using f(xo)- f* ~ ~ II xo- x* 11 2

and Theorem 2.2.1 with Lemma 2.2.4.
Let J-t > 0. From the lower complexity bounds for the dass (see
Theorem 2.1.13) we have
f( xk)- f* > 1:!:. (&-1)2k R2 >- l!exp

- 2 ..jQ/+1 2
(- 4k ) R2
..jQ/-1 '
78 INTRODUCTORY LEGTURES ON GONVEX OPTIMIZATION
where Qf = L/~-t and R =II xo- x* II· Therefore, the worst case bound
for finding Xk satisfying f(xk) - f* ~ E cannot be better than
k>- .,fQi4 -l [In 1€ + In !!2 + 2In R] .

For our scheme we have
f(xk)- f* ~ LR2 (1-lf)k ~ LR 2exp ( -)ij).

Therefore we guarantee that k ~ J([j [ln -: + ln L + 2ln R] . Thus, the
main term in this estimate, JQi In ~, is proportional to the lower bound.
The same reasoning can be used for the dass (Rn). sJ:l D
Let us analyze a variant of the scheme {2.2.6), which uses the gradient
step for finding the point Xk+I·
Constant Step Scheme, I
0. Choose xo ERn and 'Yo > 0. Set vo = xo.

a). Compute ak E {0, 1) from the equation
La~ = (1 - ak)rk + CtkJ-t· (2.2.8)
Set rk+l = (1- akhk + CtkJ-t·

b). Choose y = ak"J'kvk+"l'k+lxk.
k "l'k+akJl
Compute f(Yk) and f'(yk)·
c). Set Xk+I = Yk --j;J'(yk) and
Vk+I = ;;l--[{1-
lk±l
ak)rkvk + CtkJ-tYk- akf'(Yk)].
Let us demonstrate that this scheme can be rewritten in a simpler

form. Note that
Yk = 'Yk+1akJl (akrkVk + 'Yk+lxk),
Yk- tf'(Yk),
= ~[(1-
lk±l
ak)rkvk + CikJ.tYk- akf'(Yk)].
Therefore
_ _1_ { (1-akhk } _ 1-a~sx _ -2LJ'( )

- 'Yk+t ak Yk + J-LYk ak k 'Yk+l Yk
Hence,
= Xk
+1
+ Clktl'Yktl(Vk±l-Xktl)
'Yk+l +ak+tJJ
= Xk + 1 + fJ,f.lk(Xk + 1 _ Xk)
'
where
Thus, we managed to get rid of {vk}. Let us do the same with 'Yk· We
have
Therefore
_ 'Yk 1 1-ak) _ ak(l-ak)

- ak 'Yk+l +ak+l L) - af +ak±l ·
Note also that a~+l = (1- ak+I)a~ + qak+l with q = J-L/ L, and
a5L = (1- aoho + J-Lao.

The latter relation means that 'Yo can be seen as a function of ao. Thus,
we can completely eliminate the sequence {'Yk}. Let us write down the
corresponding scheme.
Constant Step Scheme, li
0. Ghoose xo ERn and ao E {0, 1).

Set Yo = xo and q = 1;.
1. kth iteration (k ;::: 0).
a). Garnpute f(Yk) and f'(Yk)· Set
{2.2.9)
Xk+l = Yk - tf'(Yk)·
b). Garnpute ak+l E {0, 1) from equation
a~+l = {1- ak+l)a~ + qak+l,

and set ß = a1(l-ak)
k ak+ak+l'
The rate of convergence of the above scheme can be derived from

Theorem 2.2.1 and Lemma 2.2.4. Let us write down the corresponding
statement in terms of a 0 .
THEOREM 2.2.3 lf in scheme (2.2.9}
{2.2.10}
then
f (xk) - f* S min { (1 - .ft) k , ( Jt1Zy'1o')2}

2
x [f(xo) - !* + lll xo - x* 11 2 ] ,
where "'O = no(noL-tL).
1 1-ao
We do not need to prove this theorem since the initial scheme is not
changed. We change only notation. In Theorem 2.2.3 condition {2.2.10)
is equivalent to /o ~ Jl-·
Scheme {2.2.9) becomes very simple if we choose ao = {this cor- .ft
responds to /o = 11-). Then
for all k ;:::: 0. Thus, we corne to the following process.
Constant step scheme, 111
0. Choose Yo = xo E Rn.
1. kth iteration (k;:::: 0). (2.2.11)
Xk+l = Yk - tf'(Yk),
However, note that this process does not work for p. = 0. The choice
'Yo = L (which changes corresponding value of ao) is safer.
2.2.2 Convex sets

Let us try to understand which constrained rninirnization problern we
can solve. Let us start frorn the sirnplest problern of this type, the
problern without functional constraints:
rnin f(x),
xEQ
where Q is sorne subset of Rn. In order to rnake our problern tractable,

we should irnpose sorne assurnptions on the set Q. And first of all, let
us answer the following question: Which sets fit naturally the class of
convex functions? Frorn definition of convex function,
f(ax + (1- a)y) :::; af(x) + (1- a)f(y), Vx, y ERn, a E [0, 1],
we see that it is irnplicitly assurned that it is possible to check this

inequality at any point of the segment [x, y]:
[x,y] = {z = ax + (1- a)y, a E [0, 1]}.

Thus, it would be natural to consider a set that contains the whole
segrnent [x, y] provided that the end points x and y belang to the set.
Such sets are called convex.
DEFINITION 2.2.2 Set Q is called convex if for any x, y E Q and a
from [0, 1] we have
ax + (1- a)y E Q.
The point ax + {1- a)y with a E [0, 1] is called a convex combination

of these two points.
In fact, we have already met some convex sets.
LEMMA 2.2.5 If f(x) is a convex function, then for any ß E R 1 its level
set
is either convex or empty.

Proof: Indeed, let X and y belong to c,(ß). Then J(x) ~ ß and
J(y) ~ ß. Therefore
f(ax + (1- a)y) :$ af(x) + {1 - a)f(y) :$ ß.
0
LEMMA 2.2.6 Let J(x) be a convex function. Then its epigraph
Cf = {(X, T) E Rn+ 1 I J(X) :$ T}
is a convex set.
Proof: lndeed, let z1 = (x1,rt) E Cf and z2 = (x2,r2) E Cf· Then for

any a E [0, 1] we have
Za =az1 + {1- a)z2 = (ax1 + (1- a)x2, ar1 + (1- a)r2),

Thus, z 0 E Cf· 0
Let us look at some properties of convex sets.

THEOREM 2.2.4 Let Q1 ~ Rn and Q2 ~ Rm be convex sets and A(x)
be a linear operator:
Then all sets below are convex:

1. Intersection (m = n}: Ql nQ2 = {x ERn I XE Ql, XE Q2}·
2. Sum (m = n): Q1 + Q2 = {z = x + y I x E Qt, y E Q2}.
3. Direct sum: Q1 x Q2 = {(x, y) E Rn+m I x E Q1, y E Q2}·
4· Conic hull: K:(Ql) = {z ERn I z = ßx, x E Qbß ~ 0}.
5. Convex hull
Conv (Q I, Q2) = { z E Rn I z = ax + (1 - a),
y,x E QI, y E Q2, a E [0, 1]}.
6. Affine image: A(QI) = {y E Rm I y = A(x), x E QI}·

7. Inverse affine image: A-I(Q2) = {x ERn I A(x) E Q2}.
Proof: 1. If XI E QinQ2, XI E QlnQ2, then [xl,X2] c QI and
[xi, x2] c Q2. Therefore [xl, X2] c Ql n Q2.
2. If ZI = Xl + X2, XI E Ql, X2 E Q2 and Z2 = YI + Y2, YI E Q1,
Y2 E Q2, then
azi + (1- a)z2 = (ax1 + (1- a)yi)l + (ax2 + (1- a)y2)2,
where (·h E Q1 and (·)2 E Q2.
3. If ZI = (xi, X2), XI E QI, X2 E Q2 and Z2 = (yi, Y2), YI E QI,
Y2 E Q2, then
azi + (1- a)z2 = ((axi + (1- a)yi)1, (ax2 + (1- a)y2)2),
where (·h E Q1 and (·)2 E Q2.
4. If ZI = ßiXI, XI E QI, ßi ;::: 0, and Z2 = ß2x2, X2 E QI, ß2 ;::: 0,
then for any a E [0, 1J we have
azi + (1- a)z2 = aß1x1 + (1- a)ß2x2 = 1(ax1 + (1- a)x2),

where 1 = aßl + (1- a)ß2, and a = aßd'Y E [0, 1].
5. If z1 = ßixi + (1 - ßdx2, x1 E Q1, x2 E Q2, ß1 E [0, 1], and
Z2 = ß2Y1 + (1 - ß2)y2, Yl E Ql, Y2 E Q2, ß2 E [0, 1], then for any
a E [0, 1] we have
az1 + (1 - a)z2 = a(ß1x1 + (1 - ßl)x2)
+(1 - a)(ß2YI + (1 - ß2)y2)
= a(ß1x1 + (1- ßi)Yd

+(1 - a)(/32x2 + (1 - ß2)Y2),
where a = aß1 + (1- a)ß2 and ß1 = aßi/a, ß2 = a(1- ßi)/(1- a).
6. If YI, Y2 E A(QI) then YI = Ax1 +band Y2 = Ax2 + b for some x1,
x2 E Q1. Therefore, for y(a) = ay1 + (1- a}y2, 0 ~ a ~ 1, we have
y(a) = a(Ax1 + b) + (1- a)(Ax2 + b) = A(ax1 + (1- a)x2) + b.
Thus, y(a) E A(Ql).

7. If x1, x2 E A- 1(Q2) then Ax1 + b = Y1 and Ax2 + b = Y2 for some
YI, Y2 E Q2. Therefore, for x(a) = ax1 + (1- a)x2, 0 ~ a ~ 1, we have
A(x(a)) = A(ax1 + (1- a)x2) + b

= a(Ax1 + b) + (1- a)(Ax2 + b) = ay1 + (1- a)y2 E Q2.
0
Let us give several examples of convex sets.

EXAMPLE 2.2.1 1. Half-space {x E Rn I (a,x} ~ ß} is convex since
linear function is convex.
2. Polytope {x E Rn I (ai, x) ~ bi, i = 1 ... m} is convex as an
intersection of convex sets.
3. Ellipsoid. Let A =AT!:: 0. Then the set {x ERn I (Ax, x) ~ r 2} is
convex since function (Ax, x} is convex. 0
Let us write down the optimality conditions for the problern
(2.2.12)
where Q is a closed convex set. It is clear that the old condition
f'(x) = 0
does not work here.
EXAMPLE 2.2.2 Consider the one-dimensional problem:
minx.
x~O
Here x E Rl, Q = {x : x ~ 0} and f(x) = x. Note that x* = 0 but

f'(x*) = 1 > 0. 0
THEOREM 2.2.5 Let f E .1'1 (Rn) and Q be a closed convex set. The
point x* is a solution of (2.2.12} if and only if
(J'(x*),x- x*} ~ 0 (2.2.13}

for all x E Q.
Proof: Indeed, if (2.2.13) is true, then
f(x) 2 f(x*) + (J'(x*), x- x*) 2 f(x*)

for all x E Q.
Let x* be a solution to (2.2.12). Assurne that there exists some x E Q
suchthat
(J'(x*), x- x*) < 0.
Consider the function cf;(a) = f(x* + a(x- x*)), a E [0, 1]. Note that
cf;(O) = f(x*), cj;'{O) = (J'(x*), x- x*} < 0.
Therefore, for a small enough we have
f(x* + a(x- x*)) = cf;(a) < cf;(O) = f(x*).
That is a contradiction. 0
THEOREM 2.2.6 Let f E S~(Rn) and Q be a closed convex set. Then

there exists a unique solution x* of problern {2.2.12}.
Proof: Let xo E Q. Consider the set Q = {x E Q I f(x) ~ f(xo)}.

Note that problern (2.2.12) is equivalent to
min{f(x) I x E Q}. {2.2.14)
However, Q is bounded: for all x E Q we have

f(xo) 2 f(x) 2 f(xo) + (f'(xo), x- xo) + ~ II x- xo 11 2 •
Hence, II x- xo II~ ~ II f'(xo) II·

Thus, the solution x* of (2.2.14) (::: (2.2.12)) exists. Let us prove that
it is unique. Indeed, if xi is also a solution to (2.2.12), then
f* = f(xi) 2 f(x*) + (J'(x*), xi - x*) + ~ II xi - x* 11 2
2 !* + ~ II xi - x* 11 2
(we have used Theorem 2.2.5). Therefore xi = x*. 0

2.2.3 Gradient mapping

In the constrained minimization problern the gradient of the objective
function should be treated differently as compared to the unconstrained
situation. In the previous section we have already seen that its role in
optimality conditions is changing. Moreover, we cannot use it anymore in
a gradient step since the result could be infeasible, etc. If we Iook at the
main properties of the gradient, which we have used for f E 1 (Rn), .r'l'
we can see that two of them are of the most importance. The first
one is that the gradient step decreases the function value by an amount
comparable with the squared norm of the gradient:
f(x- tf'(x)) ~ f(x)- lr, II f'(x) 11 2 •
And the second one is the inequality

{f'(x), x- x*) ;::: ± f'(x)
II 11 2 •
It turns out that for constrained minimization problems we can in-

troduce an object that inherits the most important properties of the
gradient.
DEFINITION 2.2.3 Let us fix some r > 0. Denote
XQ(x; 'Y) =arg min [f(x)
xEQ
+ (f'(x), x- x) + ~ II x- x 11 2] ,
9Q(x; 'Y) = r(x- xQ(x; 'Y)).
We call gQ(r, x) the gradient mapping of f on Q.

For Q =Rn we have
xQ(x; r) = x- ~f'(x), 9Q(x; r) = J'(x).
Thus, the value ~ can be seen as a step size for the "gradient" step
x--+ XQ(X;[).
Note that the gradient mapping is weil defined in view of Theorem
2.2.6. Moreover, it is defined for all x ERn, not necessarily from Q.
Let us write down the main property of gradient mapping.
THEOREM 2.2. 7 Let f E S~:l(Rn), r ;::: L and xE Rn. Then for any
x E Q we have
f(x) ;::: f(xQ(x; 'Y)) + (gQ(x; 'Y), x- x)
(2.2.15)
+2~ II 9Q(x;r) 11 2 +~ II x-x 11 2 .
Proof: Derrote XQ = xQ(/,x), 9Q = 9Q(/,x) and let
cp(x) = J(x) + (f'(x), x- x) + ~ II x- x 11 2 .
Then cp'(x) = f'(x) + r(x- x), and for any x E Q we have
(f'(x)- 9Q,x- XQ) = (cp'(xQ),x- XQ) ~ 0.
Hence,
f(x)- ~ II x- x 11 2 > f(x) + (f'(x),x- x)

= f(x) + (f'(x), xQ- x) + (f'(x), x- xQ)
> J(x) + (f'(x), xQ- x) + (gQ, x- xQ)
= cfJ(xQ)- ~ II XQ- x 11 2 +(gQ, x- XQ)
= cfJ(xQ)- 2~ II 9Q 11 2 +(gQ, x- XQ)
= cfJ(xQ) + 2~ II 9Q 11 2 +(gQ, x- x),
and cp(xQ) ~ f(xQ) since 1 ~ L. 0
COROLLARY 2.2.1 Let jE S~;i(Rn), r~L and x ERn. Then
(2.2.16)
(2.2.17)
Proof: Indeed, using (2.2.15) with x = x, we get (2.2.16). Using (2.2.15)

with x = x*, we get (2.2.17) since f(xQ(x; r)) ~ f(x*). 0
2.2.4 Minimization methods for simple sets

Let us show how we can use the gradient mapping for solving the
following problem:
min J(x),
xEQ
where f E s!:i(Rn) and Q is a closed convex set. We assume that

the set Q is simple enough, so the gradient mapping can be computed
explicitly. This assumption is valid for positive orthant, n dimensional
box, simplex, Euclidean ball and some other sets.
Let us start from the gradient method:
Gradient method for simple sets
0. Choose xo E Q. (2.2.18)
Xk+l = Xk- hgQ(Xk;L).
The efficiency analysis of this scheme is very similar to that of the

unconstrained version. Let us give an example of such a reasoning.
THEOREM 2.2.8 Let f E S!;l(Rn). lf in scheme {2.2.18) h = ·!, then
II Xk - x* 11 2 ~ (1- z)k II Xo- x* 11 2 .
Proof: Denote rk =II xk- x* II, 9Q = 9Q(Xki L). Then, using inequality
(2.2.17), we obtain
Tf+l = II Xk - x* - hgQ 11 2 = Tf - 2h(gQ, Xk - x*) + h2 II 9Q 11 2
< (1-ht-t)rf+h(h-t) IIga 11 2 = (1-r)rf.

D
Note that for the step size h = t we have

Xk+l = Xk- t9Q(Xkj L) = XQ(Xki L).
Consider now the optimal schemes. We give only a sketch of justifi-
cation since it is very similar to that of Section 2.2.1.
First of all, we define the estimate sequence. Assurne that x 0 E Q.
Define
4>o(x) = f(xo) + lll x- xo 11 2 ,
cPk+I(x) = {1- ak)cPk(x) + ak[f(xQ(Yki L)) + A II 9Q(Yki L) 11 2
+(gQ(Yk;L),x- Yk) +~ II x- Yk 11 2 ).
Note that the form ofthe recursive rule for cfJk(x) is changed. The reason
is that now we use inequality (2.2.15} instead of (2.1.16). However,
this modification does not change the analytical form of recursion and
therefore it is possible to keep all convergence results of Section 2.2.1.
Similarly, it is easy to see that the estimate sequence {cfJk(x)} can be
written as
with the following recursive rules for 'Yk, vk and c/J'k:
'Yk+I
+ ( E- 2-y:~l) II gQ(Yki L) 11 2
+ak(~~:thk (~ II Yk -Vk 11 2 +(gQ(Yk;L),vk -yk)) ·
Further, assuming that c/J'k ~ f(xk) and using the inequality
f(xk) ~ f(xQ(Yki L}) + (gQ(Yki L), Xk- Yk)

+A II gQ(Yki L) 11 2 +~ II Xk- Yk 11 2 ],
we come to the following lower bound:
c/J'k+l ~ (1- ak)f(xk) + akf(xQ(Yki L))
+( n- 2-y:~~) 11 YQ(Yki L) 11 2
> f(xQ(Yki L)) + ( A- 27:~ 1 ) II gQ(Yki L) 11 2
+(1- ak}(gQ(Yki L), ~=+~ (vk- Yk) + Xk- Yk)·

Thus, again we can choose

Xk+I = XQ(Yki L),
Yk = 'Yk+letk/L (O!k"fkVk + 'Yk+IXk)·

Let us write down the corresponding variant of scheme (2.2.9).
Constant Step Scheme, II. Simple sets.
0. Choose xo ERn and ao E (0, 1).

Set Yo = xo and q = z.
1. kth iteration (k 2 0).
a). Compute f(Yk) and f'(Yk)· Set
(2.2.19)
b). Compute ak+l E (0, 1) from equation
a~+l = {1- ak+I)a~ + qak+ll
Clearly, the rate of convergence of this method is given by Theo-

rem 2.2.3. In this scheme only points {xk} are feasible for Q. The
sequence {yk} is used for computing the gradient mapping and may be
infeasible.
2.3 Minimization problern with smooth

components
(Minimax problem: grodient mapping, gradient method, optimal methods; Pro-
blem with functional constraints; Methods for constrained minimization.)
2.3.1 Minimax problern

Very often the objective function of an optimization problern is com-
posed by several components. For example, the reliability of a complex
system usually is defined as a minimal reliability of its parts. A con-

strained minimization problern with functional constraints provides us
with an example of interaction of several nonlinear functions, etc.
The siruplest problern of that type is called the minimax problem. In
this section we deal with the smooth minimax problem:
min
xEQ
[!(x) = m.ax fi(x)]
l:O:::l:O:::m
, {2.3.1)
where Ii E s~;}.(Rn), i = 1 ... m and Q is a closed convex set. We call

the function f(x) max-type function composed by the components fi(x).
We write f E S~;i_(Rn) if all components of function f belangtothat
class.
Note that, in general, f(x) is not differentiable. However, provided
that all fi are differentiable functions, we can introduce an object, which
behaves exactly as a linear approximation of a smooth function.
DEFINITION 2.3.1 Let f be a max-type function:
f(x) = m.ax fi(x).

1:5z:Sm
Function
f(x; x) = m.ax [fi(x) + (Jf{x), x- x)],
1:5z:5m
is called the linearization of f(x) at x.
Campare the following result with inequalities (2.1.16) and (2.1.6).
LEMMA 2.3.1 For any x ERn we have
f(x) ;:::: f(x; x) +~ II x- x 11 2 , {2.3.2}
f(x) ~ f(x; x) +t II x- x 11 2 • {2.3.3}
Proof: Indeed,
fi(x) ;:::: fi(x) + (Ji(x), x- x) + ~ II x- x 11 2
(see (2.1.16)). Taking the maximum ofthis inequality in i, we get (2.3.2).

For {2.3.3) we use inequality
fi(x) ~ fi(x) + (Jf(x), x- x) + t II x- x 11 2
(see (2.1.6)). 0
Let us write down the optimality conditions for problern {2.3.1) (com-
pare with Theorem 2.2.5).
THEOREM 2.3.1 A point x* E Q is a solution to {2.3.1} if and only if
for any x E Q we have
f(x*; x) ~ f(x*; x*) = f(x*). {2.3.4)
Proof: Indeed, if (2.3.4) is true, then

f(x) ~ f(x*; x) ~ f(x*; x*) = f(x*)
for all x E Q.
Let x* be a solution to (2.3.1). Assurne that there exists x E Q such
that f(x*; x) < f(x*). Consider the functions
</>i(a) = fi(x* + a(x- x*)), i = 1 ... m.
Note that for all i, 1 ~ i ~ m, we have
fi(x*) + (!}(x*), x- x*) < f(x*) = m_ax fi(x*).
l~t~m
Therefore either </>i(O) =fi(x*) < f(x*), or

</>i(O) = f(x*), 4>~(0) = (ff(x*), x- x*) < 0.
Therefore, for a small enough we have
fi(x* + a(x- x*)) = </>i(a) < f(x*)
for all i, 1 ~ i ~ m. That is a contradiction. 0
COROLLARY 2.3.1 Let x* be a minimum of a max-type function f(x)

on the set Q. If f belongs to S~(Rn), then
f(x) ~ f(x*) +~ II x- x* 11 2
for all x E Q.
Proof: Indeed, in view of (2.3.2) and Theorem 2.3.1, for any x E Q we
have
f(x) > f(x*;x) + ~ II x- x* 11 2
> f(x*;x*) +~ II x- x* 11 2 = f(x*) +~ II x- x* 11 2 ·

0
Finally, Iet us prove the existence theorem.

THEOREM 2.3.2 Let max-type function f(x) belong to S~(Rn) with J-t >
0, and Q be a closed convex set. Then there exists a unique optimal
solution x* to the problern (2.3.1}.
Proof: Let x E Q. Consider the set Q = {x E Q I f(x) ~ f(x)}. Note
that the problern (2.3.1) is equivalent to
min{f(x) I x E Q}. (2.3.5)
But Q is bounded: for any x E Q we have
J(x) 2:: /i(x) 2:: /i(x) + (ff(x), x- x) + ~ II x- x 11 2 ,
consequently,
~ II x- x 11 2 ~11 J'(x) II · II x- x II +J(x) -fi(x).

Thus, the solution x* of (2.3.5) (and of (2.3.1)) exists.
If xi is another solution to (2.3.1), then
f(x*) = f(xi) ~ f(x*; xi) + ~ II xi - x* 11 2 ~ f(x*) + ~ II xi - x* 11 2
(by (2.3.2) ). Therefore xi = x*. D
2.3.2 Gradient mapping

In Section 2.2.3 we have introduced the gradient mapping, which re-
places the gradient for a constrained minimization problern over a sim-
ple set. Since linearization of a max-type function behaves similarly to
linearization of a smooth function, we can try to adapt the notion of
gradient mapping to our particular situation.
Let us fix some 1 > 0 and x E Rn. Consider a max-type function
f(x). Denote
J'Y(x; x) = f(x; x) +! II x- x 11 2 •
The following definition is an extension of Definition 2.2.3.
DEFINITION 2.3.2 Define
f*(x;'Y) = min/'Y(x; x),

xEQ
XJ(Xi/) = arg min/'Y(x; x),

xeQ
YJ(x; 1) = /(X- Xf(Xi/)).

We call 9J(x; ')') gradient mapping of max-type function f on Q.

For m = 1 this definition is equivalent to Definition 2.2.3. Similarly,
the point of linearization x does not necessarily belong to Q.
It is clear that f 1 (x; x) is a max-type function composed by the com-
ponents
fi(x) + (Jf(x), x- x) + ~ II x- x II 2 E S~;~(Rn), i = 0 ... m.
Therefore the gradient mapping is well defined (Theorem 2.3.2).

Let us prove the main result of this section, which highlights the simi-
larity between the properties of the gradient mapping and the properties
of the gradient (compare with Theorem 2.2.7).
THEOREM 2.3.3 Let f E S~:i(Rn). Then for all XE Q we have
f(x; x) ;:::: f*(x; 1') + (gJ(x; 1'), x- x) + 2~ II 9J(x; 1') 11 2 • (2.3.6}
Proof: Denote Xf = Xf(X;')'), 9! = 9J(x;')'). It is clear that f 1 (x;x) E

S~;~(Rn) and it is a max-type function. Therefore all results of the
previous section can be applied also to f,y.
Since Xf = argmin j 1 (x;x), in view of Corollary 2.3.1 and Theorem
xEQ
2.3.1 we have
f(x;x) = ! 1 (x;x)- ~ II x- x 11 2
> J,(x;xJ) + 1-(11 x- Xf 11 2 - II x- x 11 2 )
f*(x;')') + 1-(x- x 1,2(x- x) + x- x 1)

f*(x; 1') + (91, x- x) + 2~ II 9! 11 2 .
0
In what follows we often refer to the following corollary to Theorem

2.3.3.
COROLLARY 2.3.2 Let f E S~;i(Rn) and ')';:::: L. Then:
1. For any x E Q and x E Rn we have
f (X) ;:::: f (X f (X; ')')) + (9 f (X; ')') , X - X)
(2.3. 7)
+2\ II 9J(X;')') 11 2 +~ II X- X 11 2 .
2. lf x E Q, then
J(x,(x;'Y)) ~ J(x)- 2~ II 9J(x;'Y) 11 2 , {2.3.8}
3. For any x E Rn we have
{2.3.9}
Proof: Assumption 'Y 2: L implies that f*(x; 1) 2: f(xJ(x; 'Y)). There-

fore (2.3.7) follows from (2.3.6) since
f(x) 2: f(x;x} + ~ II x- x 11 2
for all x ERn (see Lemma 2.3.1).

Using (2.3.7) with x = x, we get (2.3.8), and using (2.3.7) with x = x*,
we get (2.3.9) since f(x,(x;"(})- f(x*) 2: 0. 0
Finally, Iet us estimate the variation of f* (x; 1) as a function of 1.

LEMMA 2.3.2 For any /1, /2 > 0 and x ERn we have
Proof: Denote Xi = x J(x; /i), 9i = 9J(x; 1i), i = 1, 2. In view of (2.3.6),

we have
J(x;x} + ~ II x- x 11 2 2: f*(x;'Yd + (91,x- x)
(2.3.10}
+ 2~1 II 91 11 2 +~ II X - X 11 2
for all x E Q. In particular, for x = x2 we obtain
f*(x; 12) = f(x; x2) +~ II x2- x 11 2
> J*(x;,l) + (91,x2- x) + 2~ 1 II 91 11 2 +~ II x2- x 11 2
= f*(x;11) + 2~ 1 II 91 11 2 -_;2 (91,92) + 2;2 II 9211 2
> J*(x;11) + 2~ 1 II 91 11 2 - 2~2 II 91 11 2 .

0
2.3.3 Minimization methods for minimax problern

As usual, we start a presentation of numerical methods for problern
(2.3.1) from a "gradient" method with constant step:
Gradient method for minimaxproblern
0. Choose x 0 E Q and h > 0. (2.3.11)

Xk+l = Xk- hgt(Xki L).
THEOREM 2.3.4 Let j E S~:l(Rn). lf in {2.3.11} we choose h $ f,

then
II Xk- x* 11 2 $ (1 - p,h)k II xo- x* 11 2 •
Proof: Denote rk =II Xk- x* II, g = gf(Xki L). Then, in view of (2.3.9)
we have
r~+l = II Xk- x*- hgQ 11 2 = r~- 2h(g,xk- x*) + h 2 II g 11 2
< (1- hJ-L)r~ + h (h- t) II g 11 2 $ (1 -p,h)r~.

0
Note that with h = t we have
Xk+l = Xk- tgf(Xki L) = Xf(Xki L).

Forthis step size, the rate of convergence of scheme (2.3.11) is as follows:
Goroparing this result with Theorem 2.2.8, we see that for minimax
problern the gradient method has the same rate of convergence, as it has
in the smooth case.
Let us check, what the situation is with the optimal methods. Recall,
that in order to develop an optimal method, we need to introduce an
estimate sequence with some recursive updating rules. Formally, the
minimax problern differs from the unconstrained minimization problern
only by the form of lower approximation of the objective function. In
the case of unconstrained minimization, inequality (2.1.16) was used for
updating the estimate sequence. Now it must be replaced by inequality

(2.3.7).
Let us introduce an estimate sequence for problern (2.3.1). Let us
fix some xo E Q and 'Yo > 0. Consider the sequences {Yk} C Rn and
{ak} C (0, 1). Define
c/Jo(x) = f(xo) + ~ II x- xo 11 2 ,
cPk+I(x) = (1- ak)<f>k(x)
+ak[' f(xJ(YkiL)) + fr II 9J(YkiL) 11 2 1
+(gj(Yki L), X- Yk} +~ II X- Yk 11 2 ].
Comparing these relations with (2.2.3), we can find the difference only
in the constant term (it is in the frame). In (2.2.3) this place was taken
by f(Yk)· This difference Ieads to a trivial modification in the results
of Lemma 2.2.3: All inclusions of f(Yk) must be formally replaced by
the expression in the frame, and f'(Yk) must be replaced by 9J(Yki L).
Thus, we come to the following lemma.
LEMMA 2.3.3 For all k 2:: 0 we have
<l>k(x) =<PZ + 1f II x- Vk 11 2 ,
where the sequences bk}, { vk} and {<Pk} are defined as follows: vo = xo,
<Po = f(xo) and
'Yk+l = (1- O:'khk + O:'kiJ.,
o2 I (
+~I 9! YkiL) II 2
+ ok(~~:~hk {~ II Yk- Vk 11 2 +(gj(Yki L), Vk- Yk}) ·
D
Now we can proceed exactly as in Section 2.2. Assurne that <Pie 2::
f(xk). Inequality (2.3.7) with x = Xk and x = Yk becomes
f(xk) 2:: f(xJ(YkiL)) + (gJ(YkiL),xk- Yk)
+A II 9J(Yki L) 1 2 +~ II Xk- Yk 11 2 •
Hence,
~ f(xJ(Yki L)) + ( A - 2:~ 1 ) II 9J(Yki L) 11 2
+(1- ak)(gj(Yki L), ~(vk-

lk+l
Yk) + Xk- Yk).
Thus, again we can choose
Xk+l Xj(Yki L),
La~ = (1- ak)rk + akt-t = rk+l•

Yk 1
1 d akf.k (akrkVk + rk+lxk)·
Let us write down the resulting scheme in the form of (2.2.!)), with
eliminated sequences {vk} and {rk}.
Constant Step Scheme, II. Minimax.
0. Choose xo ERn and ao E (0, 1).

Set Yo = xo and q = t.
a). Compute {fi(yk)} and {fi(yk)}. Set
(2.3.12)
b). Compute ak+l E (0, 1) from equation
a~+l = (1- ak+l)a~ + qak+l•

and set ßk = ak(l-ak)
a%+ak+l'
The convergence analysis of this scheme is completely identical to that

of scheme (2.2.9}. Let us just give the result.
THEOREM 2.3.5 Let the max-type function f belong to S~;l(Rn). lf in

{2.3.12} we take ao 2:: Jii, then
f(xk)- f* S min{ (1- [i)k, ( 2 VI~Zy"YY) 2 }
x [f(xo)- !* + ~ II xo- x* 11 2],
where "' = ao(aoL-JL).

tO 1-ao
Note that the scheme (2.3.12) works for all11- ~ 0. Let us write down
the method for solving (2.3.1) with strictly convex components.
VI-{Ji
0. ehoose xo E Q. Set Yo = xo, ß = VL+,fii' (2.3.13}
1. kth iteration (k 2:: 0).
Compute {fi(Yk)} and {![(yk)}. Set
THEOREM 2.3.6 For scheme (2.3.13} we have
f(xk) - f* :5 2 ( 1 -{f) k (f(x 0) - f*). (2.3.14}
Proof: Scheme (2.3.13) is a variant of {2.3.12) with a 0 = Jli.

Under
this choice, ro = 1-" and we get (2.3.14) from Theorem 2.3.5 since, in view
of Corollary 2.3.1, ~ II xo- x* 11 2 :5 f(xo)- f*. 0
To conclude this section, let us look at an auxiliary problern, which we

need to solve in order to cornpute the gradient rnapping of the rninirnax
problem. Recall, that this problern is as follows:
rnin { rn.ax [fi(xo) + (fi(xo), x- xo)] +

xEQ l~t~m
~ II x- xo 11 2 }.
Introducing the additional variables t E Rm, we can rewrite this problern

in the following way:
min { E
i=l
t(i) + ~ II x- xo 11 2 }
(2.3.15)
s. t. fi(xo) + (ff(xo), x- xo) ~ t(i), i = 1 ... m,
xEQ, tERm,
Note that if Q is a polytope, then the problern (2.3.15) is a quadratic

optirnization problem. This problern can be solved by sorne special finite
methods (simplex-type algorithms). It can be also solved by interior
point rnethods. In the latter case, we can treat rnuch rnore cornplicated
nonlinear structure of the set Q.
2.3.4 Optimization with functional constraints

Let us show that rnethods described in the previous section can be
used for solving a constrained minimization problern with smooth func-
tional constraints. Recall, that the analytical form of such a problern is
as follows:
min fo(x),
s.t. fi(x) ~ 0, i = 1 ... m, (2.3.16)
XE Q,
where the functions fi are convex and srnooth and Q is a closed convex
set. In this section we assume that fi E s!:l
(Rn), i = 0 ... m, with sorne
J1. > 0.
The relation between the problern (2.3.16) and rninirnax problerns
is established by sorne special function of one variable. Consider the
parametric rnax-type function
f(t; x) = max{fo(x)- t; fi(x), i = 1 ... m}, t ER\ x E Q.

Let us introduce the function
f*(t) = rnin f(t; x). (2.3.17)
xEQ
Note that the components of max-type function f(t; ·) are strongly con-
vex in x. Therefore, for any t E R 1 the solution of problern (2.3.17),
x*(t), exists and is unique in view of Theorem 2.3.2.
We will try to get close to the solution of (2.3.16) using a process
based on approximate values of function f* (t). This approach can be
seen as a variant of sequential quadratic optimization. It can be applied
also to nonconvex problems.
Let us establish some properties of function f* (t).
LEMMA 2.3.4 Let t* be an optimal value of problern (2.3.16}. Then
f* (t) < 0 for alt t ?:. t*,
f*(t) > 0 for all t < t*.
Proof: Let x* be a solution to (2.3.16). 1ft?:. t*, then
f*(t) < f(t; x*) = max{fo(x*) - t; ]i(x*)}
< max{ t* - t; fi(x*)} ~ 0.
Suppose that t < t* and f*(t) ~ 0. Then there exists y E Q suchthat
fo (y) ~ t < t*, Ii (y) ~ 0, i = 1 ... m.
Thus, t* cannot be an optimal value of (2.3.16). 0
Thus, the smallest root of function f* (t) corresponds to the optimal

value of the problern (2.3.16). Note also, that using the methods of
the previous section, we can compute an approximate value of function
f*(t). Hence, our goal is to form a process of finding the root, based on
that information. However, forthat we need to know more properties of
function f*(t).
LEMMA 2.3.5 For any!:::..?:. 0 we have
j*(t)-!:::.. ~ j*(t + !:::..) ~ f*(t).

Proof: Indeed,
f*(t + ß) = min
xEQ
~ax
l~z~m
{fo(x)- t- ß; fi(x)}
< min ~ax {fo(x)- t; fi(x)} = f*(t),

xeQ l~z~m
f*(t + ß) = min m.ax {fo(x)- t; fi(x) + ß}- ß

xEQ l~z~m
> min ~ax {fo(x)- t; fi(x)}- ß = f*(t)- ß.

xeQ l~z~m
D
In other words, function f*(t) decreases in t and it is Lipschitz con-

tinuous with constant equal to 1.
LEMMA 2.3.6 For any t1 < t2 and b.. ~ 0 we have
f*(tl - ß) ~ f*(ti) + ß f*(t~l=f:(t2). (2.3.18}
Proof: Denote to = t1 - b.., a = t 2 ~to - t 2 -~+ß E [0, 1]. Then

t1 = (1 - a)to + at2 and (2.3.18) can be written as
(2.3.19)
Let x 0 = (1- a)x*(to) + ax*(t2). Wehave
f*(ti) <
< ~~~~m
m.ax {(1- a)(fo(x*(to))- to) + a(fo(x*(t2))- t2);
(1- a)fi(x*(to)) + afi(x*(t2))}

< (1- a) m.ax {fo(x*(to))- to; fi(x*(to))}
1:5~~m
= (1- a)f*(to) + aj*(t2),
and we get (2.3.18). D

Note that Lemmas 2.3.5 and 2.3.6 are valid for any parametric max-
type functions, not necessarily formed by functional components of prob-
lern (2.3.16).
Let us study now the properties of gradient mapping for a parametric
max-type function. To do that, Iet us introduce first a linearization of a
parametric max-type function f(t; x):
f(t; i; x) = m_ax {fo(i)
l~z~m
+ (J~(x), x- x)- t; fi(x) + (Jf(x), x- i) }.
Now we can introduce a gradient mapping in a standard way. Let us fix
some 'Y > 0. Denote
f'Y(t; i; x) f(t;i;x) +! II x-i 11 2 ,
f*(t; i; 'Y) = min f,y(t; i; x)
xEQ
Xf(t; i; "f) = argmin J'Y(t;i;x)

xEQ
g 1(t;x;,) = 'Y(i- x 1(t;x;,)).
We call 9t(t; i; "() the constrained gradient mapping of problern (2.3.16).

As usual, the point of linearization x is not necessarily feasible for Q.
Note that function J'Y(t; i; x) itself is a max-type function composed
by the components
fo(x) + (JfJ(x),x- x)- t + ~ II x- x 11 2 ,
fi(x) + (JI(x), x- x) - t +! II x- x 11 2 , i = 1 ... m.
Moreover, f'Y(t; x; x) E S~;~(Rn). Therefore, for any t E R 1 the con-

strained gradient mapping is weil defined in view of Theorem 2.3.2.
Since f(t; x) E S~;L(Rn), we have
!JL(t; x; x) :::; f(t; x) :::; fL(t; x; x)

for all x E Rn. Therefore f*(t; i; ~-t) :::; f*(t) :::; f*(t; i; L). Moreover,
using Lemma 2.3.6, we obtain the following result:
For any i E Rn, "( > 0, !:l. ;:::: 0 and t1 < t2 and we have
f*(tt- !:l.; x; 'Y) ;:::: J*(t1; x; 'Y) + t 2 ~t 1 (J*(t1; x; 'Y)- j*(t2; x; 'Y)).
(2.3.20)
There are two values of "(, which are important for us. Theseare 'Y = L
and "( = 1-t· Applying Lemma 2.3.2 to max-type function J'Y(t;i;x) with
/l = L and 1 2 = J.L, we obtain the following inequality:
f*(t; x; J.L) 2': f*(t; x; L)- ~;f il.9t(t; x; L) 11 2 . (2.3.21)
Since we are interested in finding a root of the function f*(t), let us

describe the behavior of the roots of function j*(t;x;1), which can be
seen as an approximation of f*(t).
Denote
t*(x, t) = root tU*(t; x; J.L))
(notation root t (-) means the root in t of function (·)).
LEMMA 2.3.7 Let x ERn and f < t* besuchthat
J*(f;x;J.L) 2: (1- "")f*(l;x;L)
for some ""E {0, 1). Then l < t*(x, f) S t*. Moreover, for any t < t and
x ERnwehave
f*(t; x; L) 2: 2(1- "")f*(t; x; L) t-t

t• (x,t)-t ·
Proof: Since t < t*, we have
o < f*(l) s f*(t; x; L) s 1 ~,J*(f; x; J.L).
Thus, f*(t; x; J.L) > 0 and, since f*(t; x; J.L) decreases in t, we get
t*(x, l) > l.
Denote ~ = f- t. Then, in view of (2.3.20), we have
f*(t; x; L) > f*(t) 2: f*(l; x; J.L) 2: f*(l; x; J.L) + t•(x~)-tf*(t; x; J.L)
> (1- "") ( 1 + t•(x~)-t) f*(t; x; L)
0
2.3.5 Method for constrained minimization

Now we areready to analyze the following process.
Constrained minimization scheme
0. Choose xo E Q, K E (0, !), to < t* and accuracy E > 0.

a). Generate sequence {xk,j} by method (2.3.13) as
applied to f(tk; x) with starting point Xk,o = Xk· lf
f*(tk; Xk,ji Jl.) ~ (1- K)j*(tk; Xk,ji L)

(2.3.22)
then stop the internal process and set j(k) = j,
j*(k) =arg min j*(tkiXk 3·;L),

O< '< '(k)
_)_) '
Xk+! = XJ(tkiXk,j•(k)iL).
Global stop, if at some iteration of the internal

scheme we have j*(tkiXk,j;L) $ E.
b). Set tk+I = t*(xk,j(k)• tk)·
This is the first time in our course we meet a two-level process. Clearly,
its analysis is rather complicated. Firstly, we need to estimate the rate
of convergence of the upper-level process in (2.3.22) (it is called a mas-
ter process). Secondly, we need to estimate the total complexity of the
internal processes in Step 1a). Since we are interested in the analyti-
cal complexity of this method, the arithmetical cost of computation of
t*(x, t) and j*(t; x, 'Y) is not important for us now.
Let us describe convergence of the master process.
LEMMA 2.3.8
f *(t·x ·L)<to-t•[
k, k+1,
1
- 1-tt 2(1-tt)
]k .
Proof: Denote ß = 2 ( 1 ~~~:) ( < 1) and

Since tk+l = t*(xk,j(k)• tk), in view of Lemma 2.3.7 for k ~ 1 we have

2(1- K)f*(tkiXk,j(k)iL) < f*(tk-liXk-l,j(k-l)iL)) •
.jtk+l-tk - .jtk-tk-1
Thus, 6k $ ßbk-1 and we obtain
f*(tk; xk,j(k)i L) = 6k..Jtk+1- tk $ ßk6o..Jtk+l- tk
= ßkJ* (t o; Xo,j(O); L) t,.+t-tk

tt -to ·
Further, in view of Lemma 2.3.5, we have t1 - to ~ f*(to; xo,j(o)i J.t).

Hence,
f*(tk; xk,j(k)i L) < ßk f*(to; xo,j(O)i L) /*(to;xo,j(O)il')
< ~V j*(to)(to- t*).

It remains to note that f*(to) $ to - t* (Lemma 2.3.5) and
j*(tk; xk+li L) = f*(tk; xk,j*(k)i L) $ f*(tk; xk,j(k)i L).
0
The above result provides us with an estimate for the nurober of

upper-level iterations, which are necessary to find an €-Solution of prob-
lern (2.3.16). Indeed, let f*(tk; xk,ji L) $ €. Then for x. = Xf(tk; xk,ji L)
we have
f(tk;x.) = 1 ~~{fo(x.) -tk;fi(x*)} $ f*(tkiXk,jiL) $ €.
Since tk s t*, we conclude that

fo(x.) S t* + E,
(2.3.23)
fi(x*) $ €, i = 1 .. . m.
In view of Lemma 2.3.8, we can get (2.3.23) at most in

1 t -t•
N(€) = ln[2(1-~~:)] In (t~~:)e (2.3.24)
full iterations of the master process (the last iteration of the process, in
general, is not full since it is terminated by the Global stop rule). Note
that in this estimate K is an absolute constant {for example, K = :\).
Let us analyze the complexity of the internal process. Let the sequence
{Xk,j} be generated by {2.3.13) with the starting point Xk,o = Xk· In view
of Theorem 2.3.6, we have
j(tk; Xk,j)- j*(tk) ~ 2(1- [f)j (J(tkj Xk)- f*(tk))
~ 2e-u-i(J(tkiXk)- f*(tk)) :5 2e-u·j j(tk;xk),
where a = jli.
Denote by N the number of full iterations of the process (2.3.22)
(N ~ N(E)). Thus, j(k) is defined for all k, 0 ~ k ~ N. Note that
tk = t*(xk-1,j(k-1)• tk-1) > tk-1· Therefore
f(tk; Xk) ~ f(tk-1i Xk) ~ j*(tk-1i Xk-I,j•(k-1)• L).

Denote
ßk = f*(tk-Ii xk-l,j•(k-1)• L), k 2: 1, ßo = f(to; xo).
Then, for all k 2: 0 we have
LEMMA 2.3.9 For all k, 0 ~ k ~ N, the intemal process works no

longer as the following condition is satisfied:
(2.3.25)
Proof: Assurne that (2.3.25) is satisfied. Then, in view of (2.3.8), we

have
A II 9j(tk; Xk,ji L 11 2 < f(tk; Xk,j)- f(tk; Xj(tk; Xk,ji L))
< f(tkj Xk,j) - f*(tk)·

Therefore, using (2.3.21), we obtain
f*(tkiXk,jiJ.t) > f*(tkjXk,jiL)- ~;r "g,(tkiXk,jiL 11 2
> f*(tkiXk,jiL)- L?·(J(tkiXk,j)- f*(tk))

> (1- K)j*(tk; Xk,ji L).
And that is the termination criterion of the internal process in Step la)
in (2.3.22). 0
The above result, combined with the estimate of the rate of con-
vergence for the internal process, provide us with the total complexity
estimate of the constrained minimization scheme.
LEMMA 2.3.10 For alt k, 0 ~ k ~ N, we have
J'(k) < 1 + !I. ln 2(L-fL)Llk.

- VIi KJLLlk+l
Proof: Assurne that
where a = fi{.
VL
min f*(tk; Xk Ji L). Note that
Recall that ßk+l =
O"Sj"Sj(k) '
the stopping criterion ofthe internal process did not work for j = j(k)-
1. Therefore, in view of Lemma 2.3.9, we have
f*(tk; Xk,ji L) ~ LJL-;_H(f(tk; Xk,j)- f*(tk)) ~ 2LJL-;_He-u·j ßk < ßk+l·
That is a contradiction with the definition of ßk+ 1• 0
COROLLARY 2.3.3
N
L: j(k) < (N + 1) [1 + fi ·In~] + fi ·ln _Au_,
k=O - V Ii KJL V Ii LlN+l
0
It remains to estimate the number of internal iterations in the last

step of the master process. Denote this number by j*.
LEMMA 2.3.11
'* < 1 + fi. }n 2(L-JL)LlNtl.
J - VIi KJL€
Proof: The proof is very similar tothat of Lemma 2.3.10. Suppose that
j* _ l > fi. ·ln 2(L-JL)LlNtl.
V Ii KJLf
Note that for j = j*- 1 we have
E < J*(tN+liXN+l,jiL) ~ LJL7(f(tN+liXN+!,j)- f*(tN+I))

That is a contradiction. 0
COROLLARY 2.3.4
j* + Ej(k)::;
k==O
(N + 2) [1 + lf ·In 2(~/>] + lf ·In~.
Let us put all things together. Substituting estimate (2.3.24) for the
number of full iterations N into the estimate of Corollary 2.3.4, we come
to the following bound for the total nurnber of internal iterations in the
process (2.3.22):
1
[ ln[2{1-!~:))ln t -t•
(1o-,.)e + 2] . [1 + V!IJj ·ln ~]
"11-
(2.3.26)
+V'fiIi · ln (t · rn.ax {fo(xo) - to; fi(xo) }) .
I::;)$m
Note that method (2.3.13), which implernents the internal process, calls
the oracle of problern (2.3.16) at each iteration only once. Therefore,
we conclude that estirnate (2.3.26) is an upper cornplexity bound for the
problern {2.3.16) with E-solution defined by {2.3.23). Let us check, how
far this estirnate is frorn the lower bounds.
The principal terrn in estirnate {2.3.26) is of the order
ln to-t• .
e
/I. In L.p,'
VJi
This value differs frorn the lower bound for an unconstrained minimiza-
tion problern by a factor of In t·
This rneans, that scherne (2.3.22) is at
least suboptimal for constrained optimization problems. We cannot say
rnore since a specific lower complexity bound for constrained minimiza-
tion is not known.
To conclude this section, Iet us answer two technical questions. Firstly,
in scheme (2.3.22) we assume that we know sorne estimate t 0 < t*. This
assumption is not binding since we can choose to equal to the optimal
value of the minimization problern
rnin [f(xo)
xeQ
+ (f'(xo), x- xo) + ~ II x- xo 11 2 ].
Clearly, this value is less than or equal to t*.

Secondly, we assume that we are able to compute t*(x, t). Recall that
t* ( x, t) is the root of function
j*(t;x;J.L) = rnin !11-(t;x;x),

xEQ
where j,_,.(t;x;x) is a max-type function composed by the components
fo(x) + (!Ö(x),x- x} + ~ II x- x 11 2 -t,
fi(x) + (fi(x),x- x) + ~ II x- x 11 2 , i = 1. .. m.
In view of Lemma 2.3.4, it is the optimal value of the following mini-

mization problem:
min [fo(x) + Uö(x), x- x) + ~ II x- x 11 2 ],
s.t. fi(x) + (ff(x), x- x) + ~ II x- x 11 2 ~ 0, i = 1 ... m,
XE Q.
This problern is not a quadratic optimization problem, since the con-
straints are not linear. However, it can be solved in finite time by a
simplex-type process, since the objective function and the constraints
have the same Hessian. This problern can be also solved by interior-
point methods.
Chapter 3
NONSMOOTH CONVEX OPTIMIZATION
3.1 General convex functions

{Equivalent definitions; Closed functions; Continuity of convex functions; Sep-
aration theorems; Subgradients; Computation rules; Optimality conditions.)
3.1.1 Motivation and definitions

In this chapter we consider rnethods for solving general convex rnini-
rnization problern
min fo(x),
s.t. fi(x) :::; 0, i = 1. .. m, (3.1.1)
where Q is a closed convex set and fi(x), i = 0 ... m, are general convex
functions. The terrn general means that these functions can be non-
differentiable. Clearly, such a problern is rnore difficult than a srnooth
one.
Note that nonsrnooth rninirnization problerns arise frequently in differ-
ent applications. Quite often sorne components of a rnodel are composed
by rnax-type functions:
f(x) = max r/Jj(x),
l:5J:5P
where r/Jj(x) are convex and differentiable. In the previous section we

have seen that such a function can be treated by a gradient rnapping.
However, if the number of srnooth cornponents p in this function is very
big, the cornputation of the gradient rnapping becornes too expensive.
Then, it is reasonable to treat this max-type function as a general convex

function. Another source of nondifferentiable functions is the situation
when some components of the problern (3.1.1) are given implicitly, as
a solution of an auxiliary problem. Such functions are called the func-
tions with implicit structure. Very often these functions appear to be
nonsmooth.
Let us start our considerations with definition of general convex func-
tion. In the sequel the term "general" is often omitted.
Denote by
dom f = { x E Rn : I f (X) I< oo}
the domain of function f. We always assume that dom f i= 0.
DEFINITION 3.1.1 Function J(x) is called convex ijits domain is convex
and for all x, y E dom f and a E [0, 1] the following inequality holds:
f(ax + (1- a)y) :::; af(x) + (1- a)J(y).

We call f concave if- f is convex.
At this point, we are not ready to speak about any method for solving
(3.1.1). In the previous chapter, our optimization methods were based
on the gradients of smooth functions. For nonsmooth functions such
objects do not exist and we have to find something to replace them.
However, in order to do that, we should study first the properties of
general convex functions and justify a possibility to define a generalized
gradient. That is a long way, but we have to pass through it.
A Straightforward consequence of Definition 3.1.1 is as follows.
LEMMA 3.1.1 (Jensen inequality) For any xl, ... ,xm E domfand co-
efficients 0:1, ... , O:m such that
m
L O:i = 1, O:i;::: 0, i = 1 ... m, (3.1.2)
i=l
we have
Proof: Let us prove this statement by induction in m. Definition 3.1.1

justifies the inequality for m = 2. Assurne it is true for some m ;::: 2. For
the set of m + 1 points we have
m+l m
L O:iXi = O:!Xl + (1- o:I) LßiXi,
i=l i=l
Nonsmooth convex optimization 113
where ßi = ~
l~a 1 • clearly,
m
Lßi = 1, ßi 2 0, i = 1. .. m.
i=l
Therefore, using Definition 3.1.1 and our inductive assumption, we have
m
The point x = 2:: aiXi with coefficients ai satisfying (3.1.2) is called
i=l
a convex combination of points Xi.
Let us pointout two important consequences of Jensen inequality.
COROLLARY 3.1.1 Let x be a convex combination of points x1, •.. ,xm.
Then
m
Proof: Indeed, in view of Jensen inequality and since ai 2 0, E ai = 1,
i=l
we have
CoROLLARY 3.1.2 Let
= Conv{xl,··· ,xm} ={x = LaiXi I ai 2 0,

m m
~ Lai= 1}.
i=l i=l
Let us give two equivalent definitions of convex functions.

THEOREM 3.1.1 Function f is convex if and only if for all x, y E domf
and ß 2 0 suchthat y + ß(y- x) E dom J, we have
f(y + ß(y- x)) ? f(y) + ß(f(y)- f(x)). {3.1.3)

Proof: Let f be convex. Derrote a = -dRI and u = y + ß(y- x). Then
y= l~ß(u+ßx)=(1-a)u+ax.
Therefore
f(y) :::; (1- a)f(u) + af(x) = !hf(u) + !RIJ(x).

Let (3.1.3) hold. Let us fix x, y E domf and a E (0, 1]. Derrote
ß = 1 ~ 0 and u = ax+ (1- a)y. Then
x = ~(u- (1- a)y) = u + ß(u- y).
Therefore
J(x) 2:: J(u) + ß(f(u)- J(y)) = ~f(u)- l~a J(y).
D
THEOREM 3 .1. 2 Function f is convex if and only if its epigraph
epi (!) = {(x, t) E domf x R I t 2:: f(x)}
is a convex set.
Proof: Indeed, if (x1, tr) E epi (!) and (x2, t2) E epi (!), then for any
a E [0, 1] we have
at1 + (1- a)t2 2:: af(xl) + {1- a)j(x2) 2:: j(ax1 + (1- a)x2).
Thus, (ax1 + (1- a)x2, at1 + (1- a)t2) E epi (!).
Let epi {!) be convex. Note that for x 1 , x 2 E dom f
(x1,J(xl)) E epi{f), (x1,/(x2) E epi{f).
Therefore (ax1 + (1- a)x2, af(xl) + (1- a)j(x2)) E epi {!). That is
j(ax1 + (1- a)x2) :::; af(xl) + (1- a)j(x2).

D
We need also the following property of Ievel sets of convex functions.

THEOREM 3 .1. 3 If function f is convex, then all its Level sets
Cr(ß) = {x E domf I f(x):::; ß}

are either convex or empty.
Proof: Indeed, if XI E .c,(ß) and x2 E .CJ(ß), then for any a E [0, 1]

we have
j(ax1 + (1- a)x2) ~ af(xi) + (1- a)j(x2) ~aß+ (1- a)ß = ß.
We will see that the behavior of a general convex function on the

boundary of its domain sometimes is out of any control. Therefore, let
us introduce one convenient notion, which will be very useful in our
analysis.
DEFINITION 3 .1. 2 A convex function f is called closed if its epigraph is

a closed set.
As an immediate consequence of the definition we have the following

result.
THEOREM 3.1.4 lf convex function f is closed, then alt its level sets are
either empty or closed.
Proof: By its definition, (.CJ(ß),ß) = epi (f) n{(x, t) I t = ß}. There-

fore, the epigraph .Cf (ß) is closed and convex as an intersection of two
closed convex sets. 0
Note that, if f is convex and continuous and its domain dom f is

closed, then f is a closed function. However, in general, a closed convex
function is not necessarily continuous.
Let us look at some examples of convex functions.
EXAMPLE 3 .1.1 1. Linear function is closed and convex.
2. f(x) =I x I, x E R 1, is closed and convex since its epigraph is

{(x,t) I t?. x, t?. -x},
the intersection of two closed convex sets (see Theorem 3.1.2).
3. All dijjerentiable and convex on Rn functions belong to the class of

general closed convex functions.
4. Function f(x) = ~' x > 0, is convex and closed. However, its domain
dom f = int R~ is open.
5. Function f(x) =II x II, where II · II is any norm, is closed and convex:
f(a.xl + (1- a.)x2) = II a.x1 + (1- a.)x2 II
< II a.x1 II + II (1 - a.)x2 II

= a. II Xl II +(1 - a.) II X2 II
for any Xt,X2 E Rn and a. E (0, 1]. The most important norms in
numerical analysis are so-called lp-norms:
p~l.
Among them there are three norms, which are commonly used:
- The Euclidean norm: II x II= [f; (xCi)) 2 ]112 , p = 2.

i=l
- The h-norm: II x llt= f: I xCi) I, p = 1.

i=l
- The l00 -norm ( Chebyshev norm, uniform norm, infinity norm):
II X lloo= ffi!iX I X(i) I .

l::S~::Sn
Any norm defines a system of balls,
Bll·ll(xo,r) = {x ERn 111 x- xo II$ r}, r ~ 0,

where r is the radius of the ball and xo E Rn is its center. We call
the ball BII·II(O, 1) the unit ball of the norm II · 11. Clearly, these balls
are convex sets (see Theorem 3.1.3). For lp-balls of the radius r we
use the notation
Bp(xo,r) = {x ERn 111 x- xo llp$ r}.

Note the following relation between Euclidean and l 1-balls:
B1(xo,r) C B2(xo,r) C Bl(xo,r.fii).

That is true because of the standard inequalities:
6. Up to now, all our examples did not show up any pathological behav-
ior. However, let us Iook at the following function of two variables:
0,
f(x, y) = {
cp(x, y),
where cp(x, y) is an arbitrary nonnegative function defined on a unit
circle. The domain ofthis function is the unit Euclidean disk, which is
closed and convex. Moreover, it is easy to see that f is convex. How-
ever, it has no reasonable properties on the boundary of its domain.
Definitely, we want to exclude such functions from our considerations.
That was the reason for introducing the notion of closed function. It
is clear that f (x, y) is not closed unless cjJ( x, y) = 0. D
3.1.2 Operations with convex functions

In the previous section we have seen several examples of convex func-
tions. Let us describe a set of invariant operations, which allow us to
create more complicated objects.
THEOREM 3 .1. 5 Let functions /I and h be closed and convex and let
ß 2:: 0. Then alt functions below are closed and convex:
1. f(x) = ßfi(x) , domf = domfi.
2. f(x) = h(x) + h(x), domf = (domfi) n(dom/2).
3. f(x) = max{fi(x),h(x)}, domf = (domfi)n(domf2).
Proof:
1. The first item is evident:
j(ax1 + (1- a)x2) :S ß(afi (xl) + (1- a)fi (x2)).
2. For all Xl,X2 E (domfl) n(domh) and Ci E [0, 1] we have
h (ax1 + (1- a)x2) + h(axl + (1- a)x2)
:S afi (xl) + (1 - a)fi (x2) + af2(xl) + (1 - a)f2(x2)
= a(fi(xi) + h(xl)) + (1- a)(fi(x2) + h(x2)).

Thus, f(x) is convex. Let us prove that it is closed. Consider a sequence
{ (xk, tk)} C epi (/):
lim Xk = x E domj, lim tk = l.
k-too k-too
Since h and f2 are closed, we have

inf lim fi(xk) ~ fi(x), inf lim h(xk) ~ f2(x).
k-HXJ k--too
Therefore
l = lim tk ~ inf lim

k--too k--too
fi (xk) + inf k--too
lim f2(xk) ~ f(x).
Thus, (x, l) E epi/.1

3. The epigraph of function f(x) is as follows:
epif = {(x,t) It~ fi(x) t ~ f2(x) x E (domfi)n(domf2)}
- epifi nepif2.
Thus, epi f is closed and convex as an intersection of two closed convex
sets. It remains to use Theorem 3.1.2. 0
The following theorem demonstrates that convexity is an affine-inva-

riant property.
THEOREM 3 .1. 6 Let function ifJ(y), y E Rm, be convex and closed. Con-
sider a linear operator
A(x) = Ax + b: Rn -t Rm.
Then f(x) = ifJ(A(x)) is a closed convex function with domain
dom f = {x ERn I A(x) E domifJ}.
Proof: For x1 and x2 from domf denote Y1 = A(xl), Y2 = A(Y2)· Then
for a E [0, 1] we have
= ifJ(ayl + (1 - a)y2)
< aifJ(yi) + (1 - a)ifJ(Y2)
= af(xl) + (1 - a)j(x2).
1 It is important to understand, that a similar property for convex sets is not valid. Consider
the following two-dimensional example: Q1 = {(x,y): y 2:: ~. x > 0}, Q2 = {(x,y): y =
0, x ~ 0}. Bothofthese sets are convex and closed. However, their sum Q1 + Q2 {(x, y) : =
y > 0} is convex and open.
Thus, j(x) is convex. The closedness of its epigraph follows from conti-
nuity of the linear operator A(x). D
The next theorem is one of the main suppliers of convex functions

with implicit structure.
THEOREM 3.1. 7 Let ~ be some set and
f(x) = sup{cp(y,x) I yE ~}.
y
Suppose that for any fixed y E ~ the function cp(y, x) is closed and convex
in x. Then f(x) is a closed and convex function with domain
domf={xE n domcp(y,·)l
yEt:.
:3/:c/J(y,x)S/VYE~}. (3.1.4)
Proof: Indeed, if x belongs to the right-hand side of equation (3.1.4),

then f(x) < oo and we conclude that x E dom f. If x does not belang
to this set, then there exists a sequence {yk} such that cp(yk, x) ---+ oo.
Therefore x does not belong to dom f.
Finally, it is clear tlmt (x, t) E epi f if and only if for all y E ~ we
have
x E domcp(y, ·), t;::: cp(y, x).
This means that
epif = n
epicp(y, ·).
yEt:.
Therefore f is convex and closed since each epi cp(y, ·) is convex and
closed. 0
Now we are ready to Iook at more sophisticated examples of convex

functions.
EXAMPLE 3.1.2 1. Function J(x) = ID!lX {x(il} is closed and convex.
l:St:Sn
2. Let.>.= (.>.( 1 ), ... , .>,(m)) and ~ be a set in R+. Consider the function
m
f(x) = sup L .>.(i) fi(x),
>.Et:. i=l
where fi are closed and convex. In view of Theorem 3.1.5, the

epigraphs of functions
m
c/J>.(x) = L >,(i) fi(x)
i=l
are convex and closed. Thus, j(x) is closed and convex in view of
Theorem 3.1.7. Note that we did not assume anything about the
structure of the set b..
3. Let Q be a convex set. Consider the function
'1/JQ(x) = sup{ (g, x) I g E Q}.

Function '1/JQ(x) is called support function of the set Q. Note that
'1/JQ(x) is closed and convex in view of Theorem 3.1.7. This function
is homogeneaus of degree one:
'1/JQ(tx) = t'I/JQ(x), x E domQ, t ~ 0.
lf the set Q is bounded then dom '1/JQ = Rn.
4. Let Q be a set in Rn. Consider the function '1/J (g, 'Y) = sup cp(y, g, 'Y),
yEQ
where
c/J(y, g, 1) = (g, y) - ~ II Y 11 2 •
The function '1/J(g, 'Y) is closed and convex in (g, 'Y) in view ofTheorem
3.1.7. Let us look at its properties.
lf Q is bounded, then dom '1/J = Rn+l. Consider the case Q = Rn.
Let us describe the domain of '1/J. If 'Y < 0, then for any g =/: 0 we
can take Ya = ag. Clearly, along this sequence cp(ya, g, 'Y) -+ oo as
a -+ oo. Thus, dom '1/J contains only points with 'Y ~ 0.
lf 'Y = 0, the only possible value for g is zero since otherwise the
function 4J(y, g, 0} is unbounded.
Finally, if 'Y > 0 then the point maximizing 4J(y, g, 'Y) with respect
to y is y* (g, 'Y) = ~ g and we get the following expression for '1/J:
'1/J(g, 'Y) = 1Wf

2')' •
Thus,
0, if g = 0, 'Y = 0,
'1/J(g, 'Y) = { 1Wf
2')' ' if 'Y > 0,
with the domain dom'!f; = (Rn x {'Y > 0}) U{O,O). Note that this
is a convex set, which is neither closed nor open. Nevertheless, '1/J
is a closed convex function. At the same time, this function is not
continuous at the origin:
0
3.1.3 Continuity and differentiability

In the previous sections we have seen that the behavior of convex func-
tions at the boundary points of its domain can be rather disappointing
(see Examples 3.1.1(6), 3.1.2(4)). Fortunately, this is the only bad news
about convex functions. In this section we will see that the structure of
convex functions in the interior of its domain is very simple.
LEMMA 3.1.2 Let function f be convex and xo E int (domf). Then f
is locally upper bounded at xo.
Proof: Let us choose some E > 0 such that xo ± Eei E int (dom f),
i = 1 ... n, where ei are the coordinate vectors of Rn. Denote
!:::.=Conv{xo±Eei, i= 1 ... n}.
Let us show that !:::. =:> B2(xo, €) with € = Jn· Indeed, consider
n n
x = x0 + Lhiei, L(hi)2:::; E.
i=l i=l
We can assume that hi 2:: 0 (otherwise, in the above representation we

can choose -ei instead of ei). Then
Therefore for hi = -/Jhi we have
x = xo + ß "f:
i=l
hiei = xo +~ "f: hiEei
i=l
Thus, using Corollary 3.1.2, we obtain
M = xEB2(xo,l)
max f(x) :::; max f(x) :::; max f(xo ±
xE6 l:Sz:Sn
Eei).
Remarkably enough, the above result implies continuity of a convex

function at any interior point of its domain.
THEOREM 3.1.8 Let f be convex and xo Eint (domf). Then f is locally

Lipschitz continuous at xo.
Proof: Let B2(xo, ~:) ~ domfand sup{f(x) I x E B2(xo, ~:)} ~ M (M is

finite in view of Lemma 3.1.2). Consider y E B2(xo, ~:), y =I xo. Denote
a = ~ II y- xo II, z = xo + ~(y- xo).

It is clear that II z - xo II= ~ II y - xo II= E. Therefore a :::; 1 and
y = az + (1- a)x 0 • Hence,
f(y) < af(z) + (1- a)f(xo) :::; f(xo) + a(M- f(xo))
= f(xo) + M-{(xo) II Y- xo II ·
Further, denote u = xo + ~(xo- y). Then II u- xo II= E and y =
xo + a(x0 - u). Therefore, in view of Theorem 3.1.1 we have
f(y) > f(xo) + a(f(xo)- l(u)) ;:::: f(xo)- a(M- l(xo))
= f(xo)- M-{(xo} II Y- Xo II ·
Thus, I f(y)- f(xo) I~ M-!(xo) II Y- xo II· 0
Let us show that the convex functions possess a property, which is

very close to differentiability.
DEFINITION 3.1.3 Let x E dom I. We call f differentiable in a direction
p at point x if the following limit exists:
f'(x;p) = lim i[f(x + ap)- l(x)]. (3.1.5)
a.(.O
The value f' (x; p) is called the directional derivative of I at x.

THEOREM 3.1.9 Convex function f is differentiable in any direction at
any interior point of its domain.
Proof: Let x E int (dom !) . Consider the function
<jJ(a) = ~[f(x + ap) - l(x)], a > 0.

Let 'Y E (0, 1] and a E (0, ~:) be small enough to have x + Ep E domf.
Then
f(x + aßp) = !((1- ß)x + ß(x + ap)) :::; {1- ß)f(x) + ßf(x + ap).
Therefore
<!J(aß) = ~[f(x + aßp)- l(xo)] :::; ~[f(x + ap)- f(x)] = <!J(a).
Thus, <j>(a) decreases as a .j,. 0. Let us choose 1 > 0 small enough to

have x- IP E domf. Then, in view of (3.1.3) we have
</>( a) 2: * [f(x) - f(x - /P )].
Hence, the Iimit in (3.1.5) exists. D
Let us prove that the directional derivative provides us with a global

lower support of the convex function.
LEMMA 3.1.3 Let f be a convex function and x E int (domj). Then

f'(x;p) is a convexfunction ofp, which is homogeneaus ofdegree 1. For
any y E dom f we have
f(y) 2: f(x) + f'(x;y- x). (3.1.6)
Proof: Let us prove that the directional derivative is homogeneous.

Indeed, for p E Rn and T > 0 we have
f'(x; Tp) = lim i-[f(x

a.j.O
+ Tap)- f(x)]
T lim -ß1 [f(x

ß.j.O
+ ßp)- f(x)] = T f'(xo;p).
Further, for any Pl, P2 ERn and ß E [0, 1] we obtain
f'(x; ßP1 + {1 - ß)P2)
< lim i-{ß[f(x

a.j.O
+ api)- f(x)]
+{1 - ß)[f(x + ap2) - f(x )]}
= ßf'(x;pl) + {1- ß)f'(x;p2)·

Thus, f'(x;p) is convex in p. Finally, Iet a E (0, 1], y E domf and
Ya = x + a(y- x). Then in view of Theorem 3.1.1, we have
f(y) = f(ya + i-{1- a)(ya- x)) 2: f(ya) + i-(1- a)[f(ya)- f(x)],

and we get {3.1.6) taking the Iimit in a .j,. 0. D
3.1.4 Separation theorems

Up to now we were describing properties of convex functions in terms
of function values. We did not introduce any directions which could
be useful for constructing the minimization schemes. In convex analy-
sis such directions are defined by separation theorems, which form the
subject of this section.
DEFINITION 3.1.4 Let Q be a convex set. We say that hyperplane
tl(g,'"'f) = {x ERn I (g,x} = '"'f}, g =f 0,

is supporting to Q if any x E Q satisfies inequality (g,x} ~ 'Y·
We say that the hyperplane 1l(g, 'Y) separates a point xo from Q if
(g, x) ~ 'Y ~ (g, xo} (3.1. 7)
for all x E Q. lf the second inequality in (3.1. 7) is strict, we call the
separation strict.
The separation theorems can be derived from the properties of pro-
jection.
DEFINITION 3.1.5 Let Q be a closed set and xo ERn. Denote
?rq(xo) = argmin{ll x- xo II: x E Q}.
We call ?rq(xo) projection of point xo onto the set Q.
THEOREM 3 .1.1 0 If Q is a convex set, then there exists a unique pro-
jection ?rq(xo).
Proof: Indeed, 1rq(x 0 ) = arg min{ <P(x) I x E Q}, where the function
!
<P(x) = II X- Xo 11 2 belongs to st:i(Rn). Therefore 1l'q(xo) is unique
and well defined in view of Theorem 2.2.6. D
It is clear that 11'Q(xo) = xo if and only if xo E Q.

LEMMA 3.1.4 Let Q be a closed convex set and x 0 ~ Q. Then for any
x E Q we have
(7rQ(xo) - xo, x- 11'Q(xo)} ~ 0. (3.1.8}
Proof: Note that 7rq(x0 ) is a solution to the minimization problern

min <P(x) with <P(x) = ~ II x- xo 11 2 . Therefore, in view of Theorem
xEQ
2.2.5 we have
(<P1(11'q(xo)),x- ?rq(xo)) ~ 0
for all x E Q. It remains to note that c/J'(x) = x- xo. 0
Finally, we need a kind of triangle inequality for projection.

LEMMA 3.1.5 For any x E Q we have
II x -1l"Q(xo) 11 2 + II 11"Q(xo)- xo 11 2 ~11 x- xo 11 2 ·
Proof: Indeed, in view of (3.1.8), we have

II x- 11"Q(xo) 11 2 - II x- xo 11 2 (xo- 11"Q(xo), 2x- 11"Q(xo) - xo}
< - II xo - 11"Q(xo) 11 2 •
0
Now we can prove the separation theorems. We will need two of them.
The first one describes our possibilities in strict Separation.
THEOREM 3.1.11 Let Q be a closed convex set and xo rj. Q. Then there
exists a hyperplane 1l(g,,), which strictly separates xo from Q. Namely,
we can take
g = xo- 11"Q(xo)-:/= 0, 'Y = {xo- 11"Q(xo), 11"Q(xo)}.
Proof: Indeed, in view of (3.1.8}, for any x E Q we have
{xo- 11"Q(xo}, x} < {xo -1l"Q(xo), 11"Q(xo)}
= (xo- 11"Q(xo), xo}- II xo- 11"Q(xo) 11 2 •

0
Let us give an example of an application of the above theorem.

COROLLARY 3.1.3 Let Qt and Q2 be two closed convex sets.
1. If for any g E dom'I/JQ 2 we have 'I/JQ 1 (g) ~ 'I/JQ 2 (g), then Q1 ~ Q2.
2. Let dom 1/JQ 1 = dom 1/JQ 2 and for any g E dom 1/JQ 1 we have
1/JQ 1 (g) = 1/JQ 2 (p). Then Q1 Q2. =
Proof: 1. Assurne that there exists x 0 E Q1 , which does not belong
to Q2. Then, in view of Theorem 3.1.11, there exists a direction g such
that
(g, xo} > 'Y (g, x}
~
for all x E Q2. Hence, g E dom'I/JQ 2 and 'I/JQ 1 (g) > 'I/JQ 2 (g). That is a
contradiction.
2. In view of the first statement, Q1 ~ Q2 and Q2 ~ Q1 . Therefore,
Ql::: Q2. D
The next separation theorem deals with boundary points of convex

sets.
THEOREM 3 .1.12 Let Q be a closed convex set and xo belang to the
boundary of set Q. Then there exists a hyperplane 1l (g, 'Y), supporting
to Q and passing through xo.
(Such a vector g is called supporting to Q at xo.)
Proof: Consider a sequence {yk} suchthat Yk ~ Q and Yk --+ xo. Denote
In view of Theorem 3.1.11, for all x E Q we have
(gk! x) :S 'Yk :S (gk, Yk)· (3.1.9)
However, I 9k II= 1 and the sequence {rk} is bounded:

I rk I I (gk, 1fQ(Yk)- xo) + (gk, xo) I
(Lemma 3.1.5) < I 1fQ(Yk)- xo I + II xo II:SII Yk- xo II + II xo II ·
Therefore, without loss of generality we can assume that there exist
g* = lim 9k and 1* = lim rk· It remains to take the limit in (3.1.9). D
k-too k-too
3.1.5 Subgradients
Now we are completely ready to introduce an extension of the notion
of gradient.
DEFINITION 3.1.6 Let f be a convex function. A vector g is called a

subgradient of function f at point xo E dom f if for any x E dom f we
have
f(x) ~ f(xo) + (g, x- xo). (3.1.10}
The set of alt subgradientsoff at xo, äj(xo), is called the subdifferential
of function f at point xo.
The necessity of the notion of subdifferential is clear from the following
example.
EXAMPLE 3 .1. 3 Consider function f (x) =I x I, x E R 1 • For all y E R 1
and g E [-1, 1] we have
f(y) =I Y I~ g · y = J(o) + g · (y- o).

Therefore, the subgradient of f at x = 0 is not unique. In our example

it is the whole segment [-1, 1]. 0
The whole set of inequalities (3.1.10), x E dom j, can be seen as a set

of linear constraints, defining the set ß f (xo). Therefore, by definition,
the subdifferential is a closed convex set.
Note that the subdifferentiability of a function implies convexity.
LEMMA 3.1.6 Letforanyx E domf subdifferentialßj(x) be nonempty.
Then f is a convex function.
Proof: Indeed, let x, y E domj, a E [0, 1]. Consider Yo: = x+a(y -x).
Let g E ßj(y0 ). Then
f(y) > f(yo:) + (g, y- Yo:) = f(yo:) + (1- a)(g, y- x),
J(x) > f(yo:) + (g, x- Yo:) = f(Yo:)- a(g, y- x).
Adding these inequalities multiplied by a and (1 - a) respectively, we
get
af(y) + (1- a)f(x) 2: f(Yo:)·
D
On the other hand, we can prove a converse statement.

THEOREM 3 .1.13 Let f be closed and convex and xo E int (dom J).
Then ßj(x 0 ) is a nonempty bounded set.
Proof: Note that the point (f(xo), x 0 ) belongs to the boundary of
epi (f). Hence, in view of Theorem 3.1.12, there exists a hyperplane
supporting to epi (J) at (J(xo), xo):
-aT + (d, x) ::::; -aj(xo) + (d, xo) (3.1.11)
for all (T,x) E epi (J). Note that we can take
II d 112 +a2 = 1. (3.1.12)
Since for all T 2: f(xo) the point (T, x 0 ) belongs to epi (f), we conclude
that a 2: 0.
Recall, that a convex function is locally upper bounded in the interior
of its domain (Lemma 3.1.2). This means that there exist some E > 0
and M > 0 suchthat B2(xo, E) ~ domfand
f(x)- f(xo) ::::; M II x- xo I
for all x E B 2 (x 0 , €). Therefore, in view of (3.1.11), for any x from this
ball we have
(d,x- xo) $ a(f(x)- f(xo)) $ aM II x- xo II·

Choosing x = xo + Ed we get II d II 2 :S Ma II d II· Thus, in view of the
normalizing condition (3.1.12) we obtain
Hence, choosing g = dja we get
f(x) 2 f(xo) + (g, x- xo)

for all x E domf.
Finally, if g E ßj(xo), g =/:. 0, then choosing x = xo + €9/ II g II we
obtain
E II g II= (g, x- xo) :S f(x) - f(xo) :S M II x- xo II= M €.

Thus, 8f(x 0 ) is bounded. 0
Let us show that the conditions of the above theorem cannot be re-
laxed.
EXAMPLE 3.1.4 Consider the function f(x) = -../X with the domain
{x E R 1 I x 2 0}. This function is convex and closed, but the subdiffer-
ential does not exist at x = 0. D
Let us deterine an important relation between the subdifferential and

the directional derivative of convex function.
THEOREM 3.1.14 Let f be a closed convex function. For any xo E

int (dom f) and p E Rn we have
f'(xo;p) = max{(g,p) I g E ßj(xo)}.

Proof: Note that
f'(xo;p) = ~N ~[f(xo + ap)- f(xo)] 2 (g,p), (3.1.13)
where g is an arbitrary vector from af (xo). Therefore, the subdifferential

of function f'(xo;p) at p = 0 is not empty and 8f(xo) ~ ßpj'(xo; 0). On
the other hand, since f'(x 0 ;p) is convex in p, in view of Lemma 3.1.3,
for any y E dom f we have
f(y) 2: f(xo) + !'(xo; y- xo) 2: f(xo) + (g, y- xo),

where g E 8pf'(x 0 ; 0). Thus, ßp!'(xo; 0) ~ 8f(xo) and we conclude that
=
ßj(xo) ßpj'(xo; 0).
Consider 9p E ßpj'(xo;p). Then, in view of inequality (3.1.6), for all
v E Rn and T > 0 we have
Tf'(xo; v) = !'(xo; Tv) 2: J'(xo;p) + (gp, TV- p).

Considering T --+ oo we conclude that
f'(xo; v) 2: (gp, v), (3.1.14)
and, considering T --+ 0, we obtain
J'(xo;p)- (gp,p) ~ 0. (3.1.15)
However, inequality (3.1.14) implies that gp E Bpf'(xo; 0). Therefore,

comparing (3.1.13) and (3.1.15) we conclude that (gp,p) = f'(xo;p). 0
To conclude this section, let us point out several properties of sub-

gradients, which are of main importance for optimization. Let us start
from the optimality condition.
THEOREM 3.1.15 Wehave f(x*) = min f(x) if and only if
xEdomf
0 E 8f(x*).
Proof: Indeed, if 0 E 8f(x*), then f(x) 2: f(x*) + (0, x- x*) = f(x*)

for all x E domf. On the other hand, if f(x) 2: f(x*) for all x E domf,
then 0 E 8f(x*) in view of Definition 3.1.6. 0
The next result forms a basis for cutting plane optimization schemes.
THEOREM 3.1.16 For any xo E domf all vectors g E 8f(xo) are sup-
porting to the level set .Cf (f (xo)):
(g,xo- x 2: 0 Vx E LJ(f(xo)) ={x E domf: f(x) ~ f(xo)}.
Proof: Indeed, if f(x) ~ f(xo) and g E 8f(xo), then
f(xo) + (g, x- xo) ~ f(x) ~ f(xo).
0
COROLLARY 3.1.4 Let Q ~ domf be a closed convex set, xo E Q and
x* = argmin{f(x) I x E Q}.
Then for any g E öj(xo) we have (g, xo- x*} 2: 0. 0
3.1.6 Computing subgradients

In the previous section we introduced the subgradients, objects which
we are going to use in minimization schemes. However, in order to
apply such schemes in practice, we need to be sure that these objects
are computable. In this section we present some rules for computing the
things.
LEMMA 3.1. 7 Let f be closed and convex. Assurne that it is differen-
tiable on its domain. Then Öf (x) = {!' (x)} for any x E int {dom f).
Proof: Let us fix some x E int (domf). Then, in view of Theorem
3.1.14, for any direction p ERn and any g E öj(x) we have
(f'(x),p} = f'(x;p) 2: (g,p).
Changing the sign of p, we conclude that (f'(x),p) = (g, p) for all g from
öf(x). Finally, considering p = ek, k = 1 ... n, we get g = f'(x). 0
Let us provide all operations with convex functions, described in Sec-

tion 3.1.2, with corresponding rules for updating subgradients.
LEMMA 3 .1. 8 Let function f (y) be closed and convex with dom f ~ Rm.
Consider a linear operator
A(x) = Ax + b: Rn---+ Rm.

Then cfi(x) = j(A(x)) is a closed convex function with domain domcjJ =
{x I A(x) E domf}. For any x Eint (domcjJ) we have
öcjJ(x) =AT öj(A(x)).
Proof: We have already proved the first part of this lemma in Theo-
rem 3.1.6. Let us prove the relation for the subdifferentiaL
Indeed, let Yo = A(xo). Then for all p ERn we have
cfi'(xo,p) = f'(yo;Ap) = max{(g,Ap) I g E öj(yo)}
= max{ (g,p) I g E AT öj(yo)}.

Using Theorem 3.1.14 and Corollary 3.1.3, we get
LEMMA 3.1.9 Let ft(x) and h(x) be closed convex functions and a1,
a2 ~ 0. Then function f(x) = ad1(x) + a2h(x) is closed and convex
and
8f(x) = a18!I(x) + a28h(x) (3.1.16}
jor any X jrom int (dom/) = int (domfl) n int (domf2).
Proof: In view of Theorem 3.1.5, we need to prove only the relation for
the subdifferentials. Consider Xo E int (dom /I) n
int (dom h). Then,
for any p E Rn we have
= max{(91,a1p) I 91 E 8/I(xo)}
+ max{ (92, a2p) I 92 E 8f2(xo)}
= max{ (a191 + a292,p) I 91 E 8/I (xo), 92 E 8f2(xo)}
= max{(9,p) I 9 E at8ft(xo) + a2äh(xo)}.

Note that both äfi(xo) and äfi(xo) are bounded. Hence, using Theo-
rem 3.1.14 and Corollary 3.1.3, we get (3.1.16). o
LEMMA 3.1.10 Let functions fi(x), i = 1 ... m, be closed and convex.

Then function f(x) = m_ax fi(x) is also closed and convex. For any
1$t$m
XE int (dom/) = n int (domfi) we have
m
i=l
8f(x) = Conv {8/i(x) I i E J(x)}, (3.1.17}

where I(x) = {i: fi(x) = f(x)}.
Proof: Again, in view of Theorem 3.1.5, we need to justify only the
n
m
rules for subdifferentials. Consider X E int (domfi). Assurne that
i=l
I(x) = {1, ... , k }. Then for any p ERn we have
f'(x;p) = 1$i$k
max ff(x;p) = 1$i$k
max max{(gi,p) I 9i E äfi(x)}.
Note that for any set of values a 1 , ... , ak we have
k
where D.k = Pi ~ 0, L: Ai = 1}, the k-dimensional standard simplex.
i==l
Therefore,
f'(x;p) =
k
max{ (g,p) I g = I: Ai9i, 9i E öfi(x), {.\i} E D.k}
i=l
max{(g,p) I g E Conv{8fi(x),i E I(x)} }.

0
The last rule can be useful for computing some elements from the
subdifferentiaL
LEMMA 3.1.11 Let ß be a set and f(x) = sup{4>(y,x) I y E ß}.
Suppose that for any fixed y E D. the function 4>(y, x) is closed and
convex in x. Then f(x) is closed convex.
Moreover, for any x from
domf = {x ERn I 3')': 4>(y,x) ~ ')'Vy E ß}
we have
8f(x) :2 Conv {84>x(y, x) I y E J(x)},
where I(x) = {y I 4>(y,x) = f(x)}.
Proof: In view of Theorem 3.1.7, we have to prove only the inclusion.
Indeed, for any x E domf, y E I(x) and g E 84>x(y,x) we have
f(x) ~ 4>(y,x) ~ 4>(y,xo) + (g,x- xo) = f(xo) + (g,x- xo).
0
Now we can look at some examples of subdifferentials.

EXAMPLE 3.1.5 1. Let f(x) =I x I, x E R 1 . Then 8f(O) = [-1, 1] since
f(x) = max g · x.
-1:599
m
2. Consider function f (x) = I: I (ai, x) - bi I· Derrote
i=l
L(x) {i: (ai, x)- bi < 0},

h (X) {i : (ai, X) - bi > 0},
I o(x) {i : (ai, x) - bi = 0}.
Then äj(x) = I: ai- I: ai + I: [-ai,ai].

iEI+(x) iEL(x) iElo(x)
3. Consider function j(x) = m;:t:x x(i). Derrote I(x) = {i : x(i) =

l<~<n
f(x)}. Then äj(x) = Conv {ei-l i E I(x)}. For x = 0 we have
8j(O) = Conv {ei I 1::; i::; n} =D.n.

4. For Euclidean norm f(x) =II x II we have
8](0) = B2(0, 1) = {x ERn 111 X II:S 1},

8j(x) = {x/ II x 1!}, x =/= 0.
n
5. For l1-norm f(x) =II x lh= I: I x(i) I we have
i=l
8](0) = B 00 (0 1) = {x ERn I max

l:Si:Sn
I x(i) I< 1}
' - '
äj(x) I: ei- I: ei + I: [-ei, ei], x =/= 0,

iEl+(x) iEL(x) iElo(x)
where I+(x) = {i I x(i) > 0}, L(x) = {i I x(i) < 0} and Io(x) =
{i I x(i) = 0}.
We leave justification of these examples as an exercise for the reader. 0
We conclude this section with an example of application of the above

technique for deriving an optimality condition for a smooth minimization
problern with functional constraints.
THEOREM 3.1.17 (Kuhn-Tucker). Let fi be differentiable convex func-

tions, i = 0 ... m. Suppose that there exists a point x such that fi(x) < 0
for all i = 1 ... m. (Slater condition.)
A point x* is a solution to the problern

min{fo(x) I fi(x) ~ O,i = 1. .. m} (3.1.18)
if and only if it is feasible and there exist nonnegative numbers Ai, i =
1 ... m, such that
f~(x*) + L Adf(x*) = 0,
iEl*
where I* = {i E [1, m) : fi(x*) = 0}.
Proof: In view of Lemma 2.3.4, x* is a solution to (3.1.18) if and only
if it is a global minimizer of the function
r/>(x) = max{fo(x)- J*;fi(x),i = 1. .. m}.
In view of Theorem 3.1.15, this is the case if and only if 0 E ßrJ>(x*).
Further, in view of Lemma 3.1.10, this is true if and only if there exist
nonnegative ~i, such that
5.of~(x*) +L ~df(x*) = 0, 5.o +L ~i = 1.

iEl" iEl*
Thus, we need to prove only that 5.o > 0. Indeed, if 5.o = 0, then
E ~ifi(x) ~ E ~i[fi(x*) + (fi(x*), x- x*)] = 0.
iEl* iEl*
This contradicts the Slater condition. Therefore 5. 0 > 0 and we can take
Ai = J..if 5.o, i E J*. 0
Theorem 3.1.17 is very useful for solving simple optimization prob-

lems.
LEMMA 3.1.12 Let A >- 0. Then
max{(c,x}: (Ax,x} ~I}= (A- 1 c,c) 112 .
X
Proof: Note that all conditions of Theorem 3.1.17 are satisfied and
the solution x* of the above problern is attained at the boundary of the
feasible set. Therefore, in accordance with Theorem 3.1.17 we have to
solve the following equations:
c = AAx*, (Ax*, x*} = 1.
0
3.2 Nonsmooth minimization methods

(Generallower complexity bounds; Main lemma; Localization sets; Subgradient
method; Constrained minimization scheme; Optimization in finite dimension
and lower complexity bounds; Cutting plane scheme; Center of gravity method;
Ellipsoid method; Other methods.)
3.2.1 General lower complexity bounds

In the previous section we have introduced a dass of general convex
functions. These functions can be nonsrnooth and therefore the corre-
sponding rninirnization problern can be quite difficult. As for srnooth
problerns, Iet us try to derive a lower cornplexity bounds, which will
help us to evaluate the perforrnance of nurnerical rnethods.
In this section we derive such bounds for the following unconstrained
rninirnization problern
rnin J(x), (3.2.1)
xERn
where f is a convex function. Thus, our problern dass is as follows:
Model: 1. Unconstrained rninirnization.

2. f is convex on Rn and Lipschitz
continuous on a bounded set.
Oracle: First-order black box:

at each point x we can cornpute
J(x), g(x) E öf(x),
g(x) is an arbitrary subgradient. (3.2.2)
Approximate Find x E Rn : f(x) - j* ~ €.

solution:
Methods: Generate a sequence {xk} :

xk E xo + Lin {g(xo), ... , g(xk-d }.
As in Section 2.1.2, for deriving a lower cornplexity bound for our

problern dass, we will study the behavior of nurnerical rnethods on sorne
function, which appears tobe very difficult for all of thern.
Let us fix some constants p > 0 and 1 > 0. Consider the family of
functions
fk(x) = 1 max x(il + ~ I x 1 2, k = 1. . . n.

l~i:::;k
Using the rules of subdifferential calculus, described in Section 3.1.6, we

can write down an expression for the subdifferential of fk at x. That is
Bfk(x) px + 1Conv {ei I i E I (x)},
I(x) {J. I 1 < J. <k xUl = max x(il}.

- - ' l~i~k
Therefore for any x, y E B2(0, p), p > 0, and 9k(Y) E ßfk(Y) we have
fk(Y)- fk(x) < (gk(y), y- x}
< I 9k(Y) 11·11 y-x II~ (pp+!) II y-x II·
Thus, fk is Lipschitz continuous on B 2 (0, p) with Lipschitz constant

M =PP+!·
Further, consider the point xk with the coordinates
It is easy to check that 0 E 8fk(xk) and therefore x"k is the minimum of

function fk(x) (see Theorem 3.1.15). Note that
Rk =II x"k II= ~'
Let us describe now a resisting oracle for function fk(x). Since the
analytical form of this function is fixed, the resistance of this oracle
consists in providing us with the worst possible subgradient at each test
point. The algorithmic scheme of this oracle is as follows.
Input: XE Rn.
Main Loop: f := -oo; i* := 0;
for j := 1 to m do
if xU) > f then { f := xU); i* := j };
f := 'Yf +} II X 11 2 ; g := ei• + fLX;
Output: fk(x) := J, 9k(x) :=gERn.
At the first glance, there is nothing special in this scheme. Its main loop
is just a standard process for finding a maximal coordinate of a vector
from Rn. However, the main feature of this loop is that we always
form the subgradient as a coordinate vector. Moreover, this coordinate
corresponds to i*, which is the firstmaximal component of vector x. Let
us check what happens with a minimizing sequence, which uses such an
oracle.
Let us choose starting point x 0 = 0. Derrote
Rp,n = {x ERn I x(i) = 0, p +1 ~ i ~ n}.
Since xo = 0, the answer of the oracle is fk(xo) = 0 and 9k(xo) = e1.

Therefore the next point ofthe sequence, x 1 , necessarily belongs to R 1•n.
Assurne now that the current test point of the sequence, Xi, belongs to
RP,n, 1 ~ p ~ k. Then the oracle will return a subgradient
g = fLXi + Iei•,
where i* ~ p + 1. Therefore, the next test point Xi+l belongs to RP+l,n.
This simple reasoning proves that for all i, 1 ~ i ~ k, we have Xi E
Ri,n. Consequently, for i: 1 ~ i ~ k - 1, we cannot improve the starting
value of the objective function:
!k(xi) ~ 1 max x~j) = 0.

l=:;j=:;k
Let us convert this observation in a lower complexity bound. Let us

fix some parameters of our problern dass P(xo, R, M), that is R > 0 and
M > 0. In addition to (3.2.2) we assume that
• the solution of problern (3.2.1), x*, exists and x* E B2(xo, R).

• f is Lipschitz continuous on B2(x 0 , R) with constant M > 0.
THEOREM 3.2.1 For any class 'P(x 0 , R, M) and any k, 0 ::; k ::; n- 1,
there exists a function f E P(xo, R, M) such that
f(xk)- f* ~ 2(l:Jkr)
for any optimization scheme, which generates a sequence {xk} satisfying
the condition
Xk E xo + Lin {g(xo), ... , g(xk-d }.
Proof: Without lass of generality we can assume that xo = 0. Let us
choose f(x) = fk+l(x) with
'Y = ~' J1. = (l+v'~+l)R"

Then
II xo - x* II= Rk+l = JLJ:IT = R,

and f(x) is Lipschitz continuous on B 2(x 0 , R) with constant ~-tR+"Y = M.
Note that Xk E Rk,n. Hence, f(xk)- f* ~ -f*. D
The lower complexity bound presented in Theorem 3.2.1 is uniform

in the dimension of the space of variables. As for the lower bound of
Theorem 2.1.7, it can be applied to problems with very large dimension,
or to efficiency analysis of starting iterations of a minimization scheme
(k::; n- 1).
We will see that our lower estimate is exact: There exist minimiza-
tion methods, which have the rate of convergence proportional to this
lower bound. Comparing this bound with the lower bound for smooth
minimization problems, we can see that now the possible convergence
rate is much slower. However, we should remernher that we are working
now with the most general dass of convex problems.
3.2.2 Main lemma

At this moment we are interested in the following problem:
min{f(x) I x E Q}, (3.2.3)
where Q is a closed convex set, and f is a function, which is convex
on Rn. We are going to study some methods for solving (3.2.3), which
employ subgradients g(x) of the objective function. As compared with
=
the smooth problem, our goal now is much more complicated. Indeed,
even in the simplest situation, when Q Rn, the subgradient seems to
be a poor replacement for the gradient of smooth function. For example,
we cannot be sure that the value of the objective function is decreasing
in the direction -g(x). We cannot expect that g(x) -+ 0 as x approaches
a solution of our problem, etc.
Fortunately, there is one property of subgradients that makes our
goals reachable. We have proved this property in Corollary 3.1.4:
At any x E Q the following inequality holds:
(g(x),x- x*) :2:0. (3.2.4)

This simple inequality leads to two consequences, which form a basis for
any nonsmooth minimization method. Namely:
• The distance between x and x* is decreasing in the direction -g(x).
• Inequality (3.2.4) cuts Rn on two half-spaces. Only one of them
contains x*.
Nonsmooth minimization methods cannot employ the idea of relax-
ation or approximation. There is another concept, underlying all these
schemes. That is the concept of localization. However, to go forward
with this concept, we have to develop some special technique, which al-
lows us to estimate the quality of an approximate solution to problern
(3.2.3). That is the main goal of this section.
Let us fix some x ERn. For x ERn with g(x) =J 0 define
v1 (x, x) = ~(g(x), x- x).
If g(x) = 0, then define Vj(x; x) = 0. Clearly, Vj(X, x) sll X - X II·

The values Vj(x, x) have a natural geometric interpretation. Consider
a point x such that g(x) =P 0 and (g(x), x - x) :2: 0. Let us Iook at the
point y = x + VJ(x)g(x)/ I g(x) II· Then
(g(x),x- y) = (g(x),x- x)- Vj(x,x) I g(x) II= 0

and II y- x II= VJ(x, x). Thus, Vj(X, x) is a distance from the point x to
hyperplane {y: (g(x), x- y) = 0}.
Let us introduce a function that measures the variation of function f
with respect to the point x. For t :2: 0 define
wf(x; t) = max{f(x)- J(x) 111 x- x lls t}.

If t < 0, we set w1 (x; t) = 0.
Clearly, the function Wf possesses the following properties:

• wf(x;O) = 0 for all t ~ 0.
• w1(x; t) is a nondecreasing function oft, t E R 1 .
• f(x) - f(x) ~ wf(x; II x- x* II).
It is important that in the convex situation the last inequality can be
strengthened.
LEMMA 3.2.1 For any x ERn we have
f(x)- f(x) ~ w1(x; v 1(x; x)). {3.2.5)

If f(x) is Lipschitz continuous on B2(x, R) with some constant M, then
f(x)- f(x) ~ M(vt(x; x))+ {3.2.6)

for all x ERn with VJ(x;x) ~ R.
Proof: If (g(x), x - x) ~ 0, then f(x) ~ f(x) + (g(x), x- x) ~ f(x).
This implies that v1(x; x) ~ 0. Hence, wf(x; VJ(x; x)) = 0 and (3.2.5)
holds.
Let (g(x), X - x) > 0. For
y = iJY&m(x + v1(x; x)g(x))
we have (g(x),y- x) = 0 and II y- x II= vf(x;x). Therefore
f(y) ~ f(x) + (g(x), y- x) = f(x),
and
f(x)- f(x) ~ f(y)- f(x) ~ wf(x; II y- x II) = wf(x; vf(x; x)).
If f
is Lipschitz continuous on B2 (x, R) and 0 ~ v1(x; x) ~ R, then
y E B2(x, R). Hence,
f(x)- f(x) ~ f(y)- f(x) ~ M II y- x II= Mvt(x; x).
0
Let us fix some x*, a solution to problern (3.2.3). The values VJ(x*;x)
allow us to estimate the quality of localization sets.
DEFINITION 3.2.1 Let {xi}~ 0 be a sequence in Q. Define
We call this set the localization set of problern {3. 2. 3) generated by se-
quence {xi}~o·
Note that in view of inequality (3.2.4), for all k ;:::: 0 we have x* E Sk.
Denote
v·- z (>
z- VJ(x*·' x·) - 0) ' vk* = O<i<k
mm Vi·
Thus,
vZ = max{r I (g(xi), Xi- x) ;:::: 0, i = 0 ... k, Vx E B2(x*, r)}.
LEMMA 3.2.2 Let fi. = min f(xi). Then fi.- f* ~ WJ(x*; vk).
O$i::;k
Proof: Using Lemma 3.2.1, we have
WJ(x*;vZ) = O$i$k
min WJ(x*;vi) > min [f(xi)- J*] = fj.- f*.
- O~i~k
0
3.2.3 Subgradie nt method

Now we are ready to analyze the behavior of some minimization
schemes. Consider the problern
min{f(x) I x E Q}, (3.2.7)
where f is a convex on Rn function and Q is a simple closed convex set.

The term "simple" means that we can solve explicitly some simple mini-
mization problems over Q. In accordance to the goals of this section, we
have to be able to find in a reasonably cheap way a Euclidean projection
of any point onto Q.
We assume that problern (3.2. 7) is equipped with a first-order oracle,
which at any test point x provides us with the value of objective function
f(x) and with one of its subgradients g(x).
As usual, we try first a version of a gradient method. Note that for
nonsmooth problems the norm of the subgradient, II g(x) II, is not very
informative. Therefore in the subgradient scheme we use a normalized
direction g(x)/ II g(x) II·
Subgradient method. Unconstrained minimization
0. Choose xo E Q and a sequence {hk}~ 0 :

00
hk > o, hk ~ o, E hk = oo.
(3.2.8)
k=O

Compute f(xk), g(xk) and set
Let us estimate the rate of convergence of this scheme.

THEOREM 3.2.2 Let f be Lipschitz continuous on B2(x*, R) with con-
stant M and xo E B(x*, R). Then
k
R2 + L h~
f k*- f* -< M i=O
k
(3.2.9}
2 L: h;
i=O
Proof: Denote ri =II Xi- x* II· Then, in view of Lemma 3.1.5, we have
< llxi- hi ~~~~::~11 - x*ll 2

= rf - 2hivi + hf.
Summing up these inequalities for i = 0 ... k we get
Thus,
k
R2+Eh~
v*
k-
< i=O
k
2L: h;
i=O
It remains to use Lemma 3.2.2. 0
Thus, Theorem 3.2.2 demonstrates that the rate of convergence of

subgradient method (3.2.8) depends on the values
i=D
00
We can easily see that Ö.k --+ 0 if the series 2:: hi diverges. However, Iet
i=D
us try to choose hk in an optimal way.
Let us assume that we have to perform a fixed number of steps of
the subgradient method, say, N. Then, minimizing D..k as a function of
{ hk }k"=o, we find that the optimal strategy is as follows: 2
hi = ,;/}+1, i = 0 ... N. (3.2.10)

In this case D..N = ,;/}+ 1 and we obtain the following rate of convergence:
! k* - f* < ../N+l.
-
MR
Comparing this result with the Iower bound of Theorem 3.2.1, we con-
clude:
The subgradient method (3.2.8), (3.2.10) is optimal for

problern (3.2. 7) uniformly in the dimension n.
If we do not want to fix the number of iterations apriori, we can choose

hi = vf+P i = 0, ....
Then it is easy to see that D..k is proportional to
and we can classify the rate of convergence of this scheme as sub-optimal.

Thus, the simplest method for solving the problern (3.2.3) appears
to be optimal. This indicates that the problems from our dass are too
complicated tobe solved efficiently. However, we should remember, that
our conclusion is valid uniformly in the dimension of the problem. We
will see that a moderate dimension of the problem, taken into account
in a proper way, helps to develop much more efficient schemes.
2 From Example 3.1.2{3) we can see that t:;.k is a convex function of {hi} .
3.2.4 Minimization with functional constraints

Let us apply a subgradient method to a constrained minimization
problern with functional constraints. Consider the problern
min{f(x) I x E Q, fi(x) ~ O,i = 1 ... m}, (3.2.11)
with convex f and fi, and a simple bounded closed convex set Q:
II x- y 11:::; R, Vx,y E Q.
Let us form an aggregate constraint ](x) = ( m.ax fi(x)) . Then
l$J$m +
our problern can be written as follows:
min{f(x) I x E Q, J(x) :::; 0}. (3.2.12)
Note that we can easily compute a subgradient g(x) of function j, pro-
vided that we can do so for functions fi (see Lemma 3.1.10).
Let us fix some x*, a solution to (3.2.11). Note that J(x*) = 0 and
Vj(x*;x) ~ 0 for all x ERn. Therefore, in view ofLemma 3.2.1 we have
](x) :::; Wj(x*; vf(x*; x)}.
If fi are Lipschitz continuous on Q with constant M, then for any x
from Rn we have the estimate
J(x):::; M · Vj(x*;x).
Let us write down a subgradient minimization scheme for constrained
minimization problern (3.2.12). We assume that R is known.
Subgradient method. Functional constraints
0. Choose xo E Q and sequence {hk}f=o=
hk = v'k~0.5.
1. kth iteration (k ~ 0). (3.2.13)
a). Compute f(xk), g(xk), ](xk), g(xk) and set
g(xk), if j(xk) <II g(xk) II hk, (A),

Pk = {
g(xk), if f(xk) ~II g(xk) II hk, (B).
THEOREM 3.2.3 Let f be Lipschitz continuous on B2(x*, R) with con-

stant M1 and
M2 = m~
1::;3::;m
{II g II: g E 8fi(x), x E B2(x*, R)}.
Then for any k 2: 3 there exists a number i', 0 :5 i' :5 k, such that
f(xi') - f* :5 - :5 1M
1M1R f(xi')
k-1.5'
2R
k-1.5 ·
Proof: Note that for direction Pk, chosen in accordance to rule (B), we
have
II g(xk) II hk :5 ](xk) :5 (g(xk), Xk - x*).
Hence, in this case vf( x*; x k) 2: hk.
Let k' =] t( and Ik ={i E (k', ... , k]: Pi= g(xi)}· Derrote
v·-
t-
VJ(x*·' x·)
t '
Then for all i, k' :5 i :5 k, we have

if i E lk, then r[+1 $ r[- 2hivi + hr,
if i f/. Ik, then r[+l :5 r[- 2hiih + hr.
Summing up these inequalities for i E [k', ... , k], we get:
k
r~, + L hr ~ r~+l +2 L hiVi +2 L hiVi.
i=k' iElk iri.Jk
Recall that for i f/. Ik we have Vi 2: hi (Oase (B)).

Assurne that Vi 2: hi for all i E Ik. Then
1 k 2 _ k 1 k+l dT _ 2k 3
1 2: Ji.'I .L hi - .L i+0.5 2: f r+0.5 - ln 2kq 1 2: ln 3.
t=k' t=k' k'
That is a contradiction. Thus, h f= 0 and there exists some i' E h

such that Vi' < hi'. Clearly, for this number we have Vi' :5 hk', and,
consequently, (Vi') + :5 hk' .
Thus, we conclude that f(xi')- f* :5 M 1 hk' (see Lemma 3.2.1) and,
since i' E h we have also the estimate
It remains to note that k' 2: i - 1 and therefore hk' :5 Jf2-~. 5 • 0

Comparing the result of Theorem 3.2.3 with the lower complexity

bound of Theorem 3.2.1, we see that scheme (3.2.13) has an optimal rate
of convergence. Recall, that this lower complexity bound was obtained
for an unconstrained minimization problem. Thus, our result proves
that from the viewpoint of analytical complexity the general convex
unconstrained minimization problems are not easier than the constrained
ones.
3.2.5 Complexity bounds in finite dimension

Let us look at the unconstrained minimization problern again, assum-
ing that its dimension is relatively small. This means that our com-
putational resources allow us to perform the number of iterations of
a minimization method, proportional to the dimension of the space of
variables. What will be the lower complexity bounds in this case?
In this section we obtain a finite-dimensionallower complexity bound
for a problem, which is closely related to minimization problem. This is
the feasibility problem:
Find x* E Q, (3.2.14)
where Q is a convex set. We assume that this problern is endowed with
an oracle, which answers our request at point x E Rn in the following
way:
• Either it reports that x E Q.
• Or, it returns a vector g, separating x from Q:
(g, x - x) ~ 0 \lx E Q.
To estimate the complexity of this problem, we introduce the following

assumption.
AssUMPTION 3.2.1 There exists a point x* E Q such that for some
€ > 0 the ball B2(x*, €) belongs to Q .
For example, if we know an optimal value f* for problern (3.2.3), we
can treat this problern as a feasibility problern with
Q = {(t, x) E Rn+l I t ~ f(x), t ~ J* + l, x E Q}.

The relation between the accuracy parameters land € in (3.2.1) can be
easily obtained, assuming that the function f is Lipschitz continuous.
We leave this reasoning as an exercise for the reader.
Let us describe now a resisting oracle for problern (3.2.14). lt forms

a sequence of boxes {Bk}f:, 0 , Bk+l C Bk, defined by their lower and
upper bounds.
Bk= {x ERn I ak :S x :S bk}·
Foreach box Bk, k 2: 0, denote by Ck = ~(ak + bk) its center. For boxes
Bk, k 2: 1, the oracle creates an individual sepa.rating vector 9k· Up to
a sign change, this is always a coordinate vector.
In the scheme below we use two dynamic counters:
• m is the number of generated boxes.
• i is the active coordinate.
Denote by e E Rn a vector of all ls. The oracle sta.rts from the following
settings:
ao :=-Re, bo :=Re, m := 0, i := 1.
Its input is an arbitrary x E Rn.
Resisting oracle. Feasibility problern
If x ~ Bo then return a separator of x from Bo eise
1. Find the maximal k E [0, ... , m] : x E Bk.

2. lf k < m then return 9k eise {Create a new box}:
If
m := m + 1; i := i + 1; If i > n then i := 1.
Return 9m·
This oracle implements a very simple strategy. Note, that the next
box Bm+l is always a half of the last box Bm. The box Bm is divided into
148 INTRODUCTO RY LEGTURES ON CONVEX OPTIMIZATIO N
two parts by a hyperplane, which passes through its center and which
corresponds to the active coordinate i. Depending in which part of the
box Bm we get the test point x, we choose the sign of the separation
vector 9m+l = ±ei. After creating a new box Bm+l the index i is
increased by 1. If this value exceeds n, we return again to i = 1. Thus,
the sequence of boxes {Bk} possesses two important properties:
• voln Bk+l = !voln Bk.
• For any k 2 0 we have bk+n- ak+n = !(bk- ak)·

Note also that the number of generated boxes does not exceed the num-
ber of calls of the oracle.
LEMMA 3.2.3 For all k;:::: 0 we have the inclusion
k
B2(ck, rk) C Bk, with rk = ~ (~) -;;-. {3.2.15}
Proof: Indeed, for all k E [0, ... , n - 1] we have
Bk :J Bn = {x I Cn- ~Re :S X :S Cn +~Re} :J B2(cn, ~R).
Therefore, for such k we have Bk :J B2(cb ~R) and (3.2.15) holds. Fur-
ther, let k = nl + p with some p E [0, ... , n- 1]. Since
bk- ak = (~) -! (bp- ap),

we conclude that
Bk :J B2 (ck, ~R (~) -l).
It remains to note that rk :::; ~ R ( ~) -!. 0
Lemma 3.2.3 immediately leads to the following complexity result.

THEOREM 3.2.4 Consider a class of feasibility problems {3.2.14}, which
satisfy Assumption 3.2.1, and for which the feasible sets Q belang to
B 00 ( 0, R). The lower analytical complexity bound for this class is n ln ~
calls of the oracle.
Proof: Indeed, we have seen that the number of generated boxes does
not exceed the number of calls of the oracle. Moreover, in view of Lemma
3.2.3, after k iterations the last box contains the ball B2(Cmk,rk)· 0
The lower complexity bound for minimization problern (3.2.3) can be

obtained in a similar way. However, the corresponding reasoning is more
complicated. Therefore we present here only a conclusion.
THEOREM 3.2.5 The lower bound for analytical complexity of problern
class formed by minimization problems {3.2.3} with Q ~ B 00 (0, R) and
f E .rt 0 (B00 (0, R)), is nln "':~ calls of the oracle. D
3.2.6 Cutting plane schemes

Let us look now at the following constrained minimization problem:
min{f(x) I x E Q}, (3.2.16)
where f is a function convex on Rn, and Q is a bounded closed convex
set such that
int Q '=I 0, diam Q = D < oo.
We assume that Q is not simple and that our problern is equipped with a
separating oracle. At any test point x E Rn this oracle returns a vector
gwhich is:
• a subgradient of f at x, if x E Q,
• a separator of x from Q, if x fl. Q.
An important example of such a problern is a constrained minimization
problern with functional constraints (3.2.11}. We have seen that this
problern can be rewritten as a problern with a single functional constraint
(see (3.2.12)), which defines a feasible set
Q = {x ERn I ](x) S 0}.
In this case, for x f/. Q the oracle has to provide us with any subgradient
g E a](x). Clearly, g separates X from Q (see Theorem 3.1.16).
Let us present the main property of finite-dimensional localization
sets.
Consider a sequence X= {xi}~ 0 belanging to the set Q. Recall, that
the localization sets, generated by this sequence, are defined as follows:
So(X) Q,
sk+l (X) - {X E Sk(X) I (g(xk), Xk - x) ~ 0}.

Clearly, for any k ~ 0 we have x* E Sk. Denote
Denote by voln S an n-dimensional volume of set S C Rn.

THEOREM 3.2.6 For any k ~ 0 we have
1
*<D
vk-
[voln Sk(X)]
voln Q
n
.
Proof: Denote a = vkf D (:::; 1). Since Q ~ B2 (x*, D) we have the

following inclusion:
(1- a)x* + aQ ~ (1- a)x* + aB2(x*, D) = B2(x*, vZ).
Since Q is convex, we conclude that
(1 - a)x* + aQ =[(1 - a)x* + aQ] nQ ~ B2(x*, vk) nQ ~ Sk(X).
Therefore voln Sk(X) ~ voln [(1 - a)x* + aQ] = anvoln Q. 0
Quite often the set Q is rather complicated and it is difficult to work

directly with sets Sk(X). Instead, we can update some simple upper
approximations of these sets. The process of generating such approxi-
mations is described by the following cutting plane scheme.
General cutting plane scheme
0. Choose a bounded set Eo 2 Q.

a) Choose Yk E Ek
b) If Yk E Q then compute f(Yk), g(yk)· If Yk f/; Q, (3.2.17)
then compute g(yk), which separates Yk from Q.
c) Set
g(yk), if Yk E Q,
9k = {
g(yk), if Yk f/; Q.
d) Choose Ek+l 2 {x E Ek I (gk,Yk- x} ~ 0}.
Let us estimate the performance of the above process. Consider the

sequence Y = {Yk}~ 0 , involved in this scheme. Denote by X a sub-
sequence of feasible points in the sequence Y: X = Y Q. Let us n
introduce the counter
i(k) = number of points Yi, 0:::; j < k, suchthat Yi E Q.
Thus, if i(k)> 0, then X f 0.

LEMMA 3.2.4 For any k;::: 0, we have Si(k) ~ Ek·
Proof: Indeed, if i(O) = 0, then So = Q ~ Eo. Let us assume that
Si(k) ~ Ek for some k 2: 0. Then, at the next iteration there are two
possibilities:
a) i(k + 1) = i(k). This happens if and only if Yk rt Q. Then
Ek+l ::) {x E Ek I (g(yk),Yk- x) 2: 0}
::) {X E si(k+l) I (g(yk), Yk - x) ;::: 0} = si(k+l)

since Si(k+l) ~ Q and g(yk) separates Yk from Q.
b) i(k + 1) = i(k) + 1. In this case Yk E Q. Then
Ek+l ::) {x E Ek I (g(yk), Yk - x) 2: 0}
::) {X E si(k) I (g(yk), Yk - x) 2: 0} = si(k)+l

since Yk = xi(k)· 0
The above results immediately Iead to the following important con-

clusion.
COROLLARY 3.2.1 1. For any k such that i(k) > 0, we have
I I
~
V~(k)
(X)< D
-
[volnSi(k)(X)];- <
voln Q -
D [volnEk];-
voln Q ·
2. lf voln Ek < voln Q, then i(k) > 0.

Proof: We have already proved the first statement. The second one
follows from the inclusion Q = So = Si(k) ~ Ek, which is valid for all k
such that i(k) = 0. 0
Thus, if we manage to ensure voln Ek -t 0, then we obtain a conver-

gent scheme. Moreover, the rate of decrease of the volume automatically
defines the rate of convergence of the method. Clearly, we should try to
decrease voln Ek as fast as possible.
Historically, the first nonsmooth minimization method, implementing
the idea of cutting planes, was the center of gravity method. It is based
on the following geometrical fact.
Consider a bounded convex set S C Rn, int S f 0. Define the center
of gravity of this set as
cg(S) = voll S
n
J xdx.
S
The following result demonstrates that any cut passing through the cen-
ter of gravity divides the set on two proportional pieces.
LEMMA 3.2.5 Let g be a direction in Rn. Define
s+ = {x Es I (g, cg(S)- x) ~ 0}.

Then
(We accept this result without proof.) 0
This observation naturally Ieads to the following minimization scheme.
Method of centers of gravity
0. Set So= Q.
a) Choose Xk = cg(Sk) and compute f(xk), g(xk)·
b) Set sk+l = {x E sk I (g(xk), Xk- x) ~ 0}.
Let us estimate the rate of convergence of this method. Denote

Jt. = min f(xj)·
0$j$k
THEOREM 3.2.7 lf f is Lipschitz continuous on B2(x*,D) with a con-
stant M, then for any k ~ 0 we have
k
J:.-f*~MD(l-~)--n.
Proof: The statement follows from Lemma 3.2.2, Theorem 3.2.6 and
Lemma 3.2.5. 0
Comparing this result with the lower complexity bound of Theo-

rem 3.2.5, we see that the center-of-gravity method is optimal in finite
dimension. lts rate of convergence does not depend on any individual
characteristics of our problern like condition number, etc. However, we
should accept that this metbad is absolutely impractical, since the com-
putation of the center of gravity in multi-dimensional space is a more
difficult problern than our initial one.
Let us look at another method, which uses a possibility of approxi-

mation of the localization sets. This method is based on the following
geometrical observation.
Let H be a positive definite symmetric n x n matrix. Consider the
ellipsoid
E(H,x) = {x ERn I (H- 1 (x- x),x- x)::::; 1}.
Let us choose a direction g E Rn and consider a half of the above ellip-
soid, defined by corresponding hyperplane:
E+ = {x E E(H,x) 1 (g,x- x) 2: 0}.
lt turns out that this set belongs to another ellipsoid, which volume is
strictly smaller than the volume of E(H, x}.
LEMMA 3.2.6 Denote
x+ = x- 1 Hg
n+l · (Hg,g)l/2'
voln E(H+, x+) ::::; ( 1- (n~ 1 F) 2 voln E(H, x).

Proof: Denote G = H- 1 and G+ = H+1 . It is clear that
G+ = n:-;1 (a + n:_1. dt~~g)).

Without loss of generality we can assume that x = 0 and (Hg, g) = 1.
Suppose x E E+· Note that x+ =- n~ 1 Hg. Therefore
II X- X+ llb+ n~-;1 (11 X- X+ llb +n:1 (g,x- x+)2)'
II X -X+ II~ II X II~ + n!l (g, x} + (n~lF'

(g, x - x+) 2 = (g, x} 2 + n! 1(g, x} + (n~l)2 ·
Putting all terms together, we obtain
II X - X+ II~+ = n:2 1 (II XII~+ n:_1 (g, X) 2 + n:_1 (g, x} + n2:_1) ·

Note that (g, x) ::::; 0 and II x lla:s; 1. Therefore
(g, x} 2 + (g, x) = (g, x)(1 + (g, x)) ::::; 0.
Hence,
I X- X+ II~+~ n:2 1 (11 XII~ +nLl) ~ 1.
Thus, we have proved that E+ c E(H+, x+)·
Let us estimate the volume of E(H+, x+)·
[ detHt]1/2 = [( !12 )n n- 1]1/2

detH n2-l n+1
n2 ( 1
[ nL1 -
2 )
n+1
~] ~ -< [n2-1
n2 ( 1 - n(n+l)
2 )] ~
_ [ n 2 (n 2 +n-2) ] ~ _ [ 1 ] ~
- n(n-l)(n+l)2 - 1 - (n+l)2 ·
0
It turnsout that the ellipsoid E(H+, x+) is the ellipsoid ofthe minimal
volume containing the half of the initialellipsoid E+.
Our observations can be implemented in the following algorithmic
scheme of the ellipsoid method.
Ellipsoid method
0. Choose Yo ERn and R > 0 suchthat B2(Yo, R) 2 Q.

Set Ho = R 2 · In.
g(yk), if Yk E Q, (3.2.18)
gk {
g(yk), if Yk ~ Q,
Yk+1
This method can be seen as a particular implementation of general

scheme (3.2.17) by choosing
Ek = {x ERn I (Hj; 1 (x- Yk),x- Yk) ~ 1}

and Yk being the center of this ellipsoid.

Let us present an efficiency estimate for the ellipsoid method. Denote
Y = {Yk}~ 0 and Iet X be a feasible subsequence of the sequence Y:
Denote J; = min f(xj)·

O~j~k
THEOREM 3.2.8 Let f be Lipschitz continuous on B2(x*, R) with some

constant M. Then for i(k) > 0, we have
k 1
~ _ f* < MR ( 1 -
f~(k) 1 ) 2 . [voln Bo(xo,R)];;.
- (n+1)2 voln Q
Proof: The proof follows from Lemma 3.2.2, Corollary 3.2.1 and Lem-
ma 3.2.6. D
We need additional assumptions to guarantee X =/= 0. Assurne that

there exists some p > 0 and x E Q such that
(3.2.19)
Then
< (1 _ (n+1)2
.! 1 .! k
[ voln E&] n
voln Q -
1 ) 2 [voln B2 xo,R ] n
voln
< le- 2(n+1)2 R.
- p
In view of Corollary 3.2.1, this implies that i(k) > 0 for all
k > 2(n + 1) 2 In~.

lf i(k) > 0, then
k
fz(k)
~ _ f* < lMR2. e-2<n+t>2
- p •
In order to ensure that(3.2.19) holds for a constrained minimization

problern with functional constraints, it is enough to assume that all
constraints are Lipschitz continuous and there is a feasible point, at
which all functional constraints are strictly negative (Slater condition).
We leave the details of the proof as an exercise for the reader.
Let us discuss now the complexity of ellipsoid method (3.2.18). Each
iteration of this method is rather cheap; it takes only O(n 2 ) arithmetic
operations. On the other hand, in order to generate an e-solution of
problern (3.2.16), satisfying assumption (3.2.19), this method needs
calls of the oracle. This efficiency estimate is not optimal (see Theorem
3.2.5), but it has a polynomial dependence on In~ and a polynomial de-
pendence on logarithms ofthe dass parameters M, Rand p. For problern
classes, whose oracle has a polynomial complexity, such algorithms are
called (weakly) polynomial.
To conclude this section, let us mention that there are several methods
that work with localization sets in the form of a polytope: ·
Let us list the most important methods of this type:
• Inscribed Ellipsoid Method. The point Yk in this scheme is chosen as

follows:
Yk =Center of the maximal ellipsoid Wk: Wk c Ek.
• Analytic Center Method. In this method the point Yk is chosen as

the minimum of the analytic barrier
mk
Fk(x) =- I:In(bj- (aj,x)).
j=l
• Volumetrie Center Method. This is also a barrier-type scheme. The

point Yk is chosen as the minimum of the volumetric barrier
where Fk(x) is the analytic barrier for the set Ek·

All these methods are polynomial with complexity bound
where p is either 1 or 2. However, the complexity of each iteration in

these methods is much !arger (n3 + n 4 arithmetic operations). In the
next chapter we will see that the test point Yk for these schemes can be
computed by interior-point methods.
3.3 Methods with complete data

(Model of nonsmooth function; Kelley method; ldea of level method; Uncon-
strained minimization; Efficiency estimates; Problems with functional con-
straints.)
3.3.1 Model of nonsrnooth function

In the previous section we studied several methods for solving the
following problem:
min f(x), (3.3.1)
xEQ
where f is a Lipschitz continuous convex function and Q is a closed con-

vex set. We have seen that the optimal method for problern (3.3.1) is the
subgradient method (3.2.8), (3.2.10). Note, that this conclusion is valid
for the whole class of Lipschitz continuous functions. However, when
we are going to minimize a particular function from that dass, we can
expect that it is not too bad. We can hope that the real performance of
the minimization method will be much better than a theoretical bound
derived from a worst-case analysis. Unfortunately, as far as the subgra-
dient method is concerned, these expectations are too optimistic. The
scheme of the subgradient method is very strict and in general it cannot
converge faster than in theory. It can be also shown that the ellipsoid
method (3.2.18), inherits this drawback of subgradient scheme. In prac-
tice it works more or less in accordance to its theoretical bound even
when it is applied to a very simple function like II x 11 2 .
In this section we will discuss the algorithmic schemes, which are more
flexible than the subgradient and the ellipsoid methods. These schemes
are based on the notion of a model of nonsmooth function.
DEFINITION 3.3.1 Let X= {xk}~ 0 be a sequence in Q. Denote
ik(X; x) = max [f(xi) + (g(xi), x- xi)],

O<i<k
where g(xi) are some subgradients of f at Xi·

The function jk (X; x) is called the model of convex function f.
Note that f k (X; x) is a piece-wise linear function of x. In view of
inequality (3.1.10) we always have
for all x E Rn. However, at all test points Xi, 0 :S i :S k, we have
The next model is always better than the previous one:
for all x ERn.

3.3.2 Kelley method

Model A(X;x) represents the complete information on function f
accumulated after k calls of the oracle. Therefore it seems natural to
develop a minimization scheme, based on this object. Perhaps the most
natural method of this type is as follows:
Kelley method
0. Choose xo E Q. (3.3.2)
Find Xk+I E Argmin A(X; x).
xEQ
Intuitively, this scheme looks very attractive. Even the presence of

a complicated auxiliary problern is not too disturbing, since it can be
solved by linear optimization methods in finite time. However, it turns
out that this method cannot be recommended for practical applications.
And the main reason for that is its instability. Note that the solution
of auxiliary problern in the method (3.3.2) may be not unique. More-
over, the whole set Arg min jk (X; x) can be unstable with respect to an
xeQ
arbitrary small variation of data {f(xi), g(xi)}. This feature results in
unstable practical behavior of the scheme. Moreover, this feature can
be used for constructing an example of a problem, in which the Kelley
method has a very bad lower complexity bound.
EXAMPLE 3.3.1 Consider the problern (3.3.1) with
f(y,x) = max{j Y j, II x 11 2 }, y E R 1 , x ERn,
Q = {z=(y,x): y2 +11xll 2::;1}.

Thus, the solution of this problern is z* = (y*, x*) = (0, 0), and the
optimal value f* = 0. Denote by Zk = Arg min A(
Z; z), the optimal set
zeQ
of model A(Z; z), and by jz = jk(Z"k) the optimal value of the model.
Let us choose zo = (1, 0). Then the initial model of function f is
fo(Z; z) = y. Therefore, the first point, generated by the Kelley method
is z1 = (-1, 0). Hence, the next model of the function f is as follows:
A(Z;z) = max{y, -y} =I Y I·

Clearly, /i = 0. Note that /k,+ 1 2: fk.. On the other hand,
fk. S f(z*) = 0.
Thus, for all consequent models with k 2: 1 we will have /; = 0 and
Zk. = (0, Xk), where
xz = {x E B2(0, 1): I Xi 1 2 +(2xi,X -xi) s 0, i = o... k}.

Let us estimate efficiency of the cuts for the set Xk.. Since Xk+l can
be an arbitrary point from Xk., at the first stage of the method we can
choose Xi with the unit norms: II Xi II= 1. Then the set Xk. is defined as
follows:
Xk. = {x E B2(0, 1) I (xi,x) S ~,i = 0 ... k}.
We can do that if
As far as this is possible, we can have
f(zi) =f(O, xi) = 1.
Let us estimate a possible length of this stage using the following fact.
Let d be a direction in nn, II d II= 1. Consider a surface

S(o:) = {x ERn 111 x II= 1, (d,x) 2: o:}, o: E [~, 1].
= voln-1 (S(o:)) s v(O) [1- o: 2 ] -

n-1
Then v(o:) 2 •
At the first stage, each step cuts from the sphere S2 (0, 1) at most the
segment S( ~) . Therefore, we can continue the process if k [ ] s Ja
n-1
.
During all these iterations we still have f(zi) = 1.
Since at the first stage of the process the cuts are (xi, x) s ~' for all
k, 0 S k S N = ./J ,
[ ]
n-1
we have
This means that after N iterations we can repeat our process with the
ball B2(0, ~), etc. Note that f(O, x) = ~ for all x from B 2(0, ~ ).
Thus, we prove the following lower bound for the Kelley method
(3.3.2):
k [.ii] n-1
f(xk)- f* ~ {!) 2
This means that we cannot get an ~:-solution of our problern less than in
~ [ 2 ]n-1
2ln2 '7a
calls of the oracle. It remains to compare this lower bound with the
upper complexity bounds of other methods:
Ellipsoid method: 0 (n2 ln~)
Optimal methods: 0 (nln~)
Gradient method: 0 (e\-)

0
3.3.3 Level method

Let us show that it is possible to work with models in a stable way.
Denote
J; = minA(X;x), fZ = min f(xi)·
xEQ 09~k
The first value is called the minimal value of the model, and the second
one the record value of the model. Clearly }'; ~ f* ~ fZ.
Let us choose some a E (0, 1). Denote
lk(a) = (1- a)jk + afk·

Consider the Ievel set
Clearly, Ck(a) is a closed convex set.

Note that the set Ck(a) is of a certain interest for an optimization
scheme. Firstly, inside this set clearly there are no test points of the
current model. Secondly, this set is stable with respect to a small vari-
ation of the data. Let us present a minimization scheme, which deals
directly with this level set.
Level method
0. Choose point xo E Q, accuracy E > 0 and Ievel coeffi-

cient a E (0, 1).
(3.3.3)
a). Compute J; and J;.
b). If !Z- jz ~ e, then STOP.
c). Set Xk+l = 7r.ck(a)(Xk)·
In this scheme there are two quite expensive Operations. We need to

compute an optimal value jz
of the current model. lf Q is a polytope,
then this value can be obtained from the following linear programming
problem:
min t,
s.t. f(xi) + (g(xi), x- Xi) ~ t, i = 0 ... k,
XE Q.
We also need to compute projection 7r.ck(a)(Xk)· If Q is a polytope, then
this is a quadratic programming problem:
min II x- Xk 11 2 ,
xEQ.
Both these problems are solvable either by a standard simplex-type
method, or by interior-point schemes.
Let us Iook at some properties of the Ievel method. Recall, that the
optimal values of the model decrease, and the records values increase:
Jz ~ Jz+l ~ r ~ JZ+l ~ JZ.

Denote D..k = [JZ, !Z] and 6k = JZ - jz. We call 6k the gap of the model
]k(X;x). Then
The next result is crucial for the analysis of the Ievel method.
LEMMA 3.3.1 Assurne that for some p ~ k we have Op ~ (1 - a)ok.
Then for all i, k ~ i ~ p,
Proof: Note that for such i we have c5p ~ (1-a)c5k ~ (1-a)<>i· Therefore
li(a) = ft- (1- a)di ~ 1;- (1- a)di = J; + dp- (1- a)di ~ J;.
0
Let us show that the steps of the Ievel method are Iarge enough.
Denote
MJ = max{ll g 111 g E oj(x), x E Q}.
LEMMA 3.3.2 For the sequence {xk} generated by the Ievel method we
have
II Xk+l - Xk II>
(1-a)6k
- MI .
Proof: Indeed,
> jk(Xk+I) ~ f(xk) + (g(xk), Xk+l - Xk)

> f(xk)- MJ II Xk+l- Xk II ·
0
Finally, we need to show that the gap in the model is decreasing.

LEMMA 3.3.3 Let Q in the problern (3.3.1} be bounded: diamQ ~ D.
If for some p ~ k we have Op ~ (1- a)ok, then
M2D2
p +1- k :s; (1-~)26r
Proof: Denote xic E Argmin jk(X;x). In view ofLemma 3.3.1 we have
xEQ
Ji(X; x;) S Jp(X; x;) = J; S li(a)

for all i, k ~ i ~ p. Therefore, in view of Lemma 3.1.5 and Lemma 3.3.2
we obtain the following:
II Xi+l - x; 11 2 < II Xi - x; 11 2 - II Xi+l - Xi 11 2
< II X, 0 - * 112 -
Xp
(1-a)26l
M2 _
<II Xz 0 - * 112 - (1-a)26~
Xp Mz •
I I
Summing up these inequalities in i = k ... p we get
(p + 1- k) ( 1 -;~25~ ~II
I
Xk- x; 11 2 ~ D 2•
Note that the value p + 1 - k is equal to the number of indices in

the segment [k,p]. Now we can prove the efficiency estimate of the Ievel
method.
THEOREM 3.3.1 Let diamQ = D. Then the scheme of the Level method
terminates no sooner than after
N = lf2a(1~I~~2-a) J+ 1
iterations. Termination criterion of the method guarantees J;- f* ~ E.
Proof: Assurne that ök ~ E, 0 ~ k ~ N. Let us divide the indices on

the groups in the decreasing order
{N, ... , 0} = I(O) U 1(2) U ... U I(m),
suchthat
I(j) = [p(j), k(j)], p(j) ~ k(j), j = 0 ... m,
p(O) = N, p(j + 1) = k(j) + 1, k(m) = 0,
Ök(j) ~ 1~a0p(j) < t5k(j)+1 = t5p(j+l)·
Clearly, for j ~ 0 we have
t5 .
P(J+l) -
>~ >
1-a -
Öp(O)
(1-a)J+i -
> f
(1-a)J+i ·
In view of Lemma 3.3.2, n(j) = p(j) +1- k(j) is bounded:

") MJD 2 MJD2 ( 1 ) 2j
n (J ~ (1-o)2ö2 . ~ f2(1-a)2
p(J)
- a .
Therefore
D
Let us discuss the above efficiency estimate. Note that we can obtain
the optimal value of the Ievel parameter a from the following maximiza-
tion problem:
--+ max.
oE[O,l]
Its solution is o:* = 2+J2.

1 Under this choice we have the following ef-
ficiency estimate of the Ievel method: N ~ ~ M} D 2 • Comparing this
result with Theorem 3.2.1 we see that the Ievel rnethod is optimal uni-
formly in the dimension of the space of variables. Note that the analyt-
ical cornplexity bound of this rnethod in finite dirnension is not known.
One of the advantages of this rnethod isthat the gap 8k = JZ- J;
provides us with an exact estirnate of current accuracy. Usually, this gap
converges to zero rnuch faster than in the worst case situation. For the
rnajority of reallife optirnization problems the accuracy € = w- 4 - w- 5
is obtained by the rnethod after 3n - 4n iterations.
3.3.4 Constrained minimization

Let us dernonstrate how we can use the rnodels for solving constrained
rninirnization problems. Consider the problern
min f(x),
s.t. /j(x) s 0, j = 1 ... m, (3.3.4)
xEQ,
where Q is a bounded closed convex set, and functions f(x), fi(x) are
Lipschitz continuous on Q.
Let us rewrite this problern as a problern with a single functional
constraint. Denote f(x) = rn~x /j(x). Then we obtain the equivalent
1$J$m
problern
rnin f(x),
s.t. j(x) ~ 0, (3.3.5)
XE Q.
Note that f(x) and /(x) are convex and Lipschitz continuous. In this
section we will try to solve (3.3.5) using the rnodels for both of thern.
Let us define the corresponding rnodels. Consider a sequence X =
{xk}k::, 0 . Denote
Jk(X; x) = 0~~k [f(xj) + (g(xj), x- Xj}] ~ f(x),

_)_
As in Section 2.3.4, our scherne is based on the parametric function
j(t; x) rnax{f(x)- t, ](x)},
f*(t) = rninf(t;x).
xEQ
Recall that f*(t) is nonincreasing in t. Let x* be a solution to (3.3.5).

Denote t* = J(x*). Then t* is the srnallest root of function f*(t).
Using the rnodels for the objective function and the constraint, we
can introduce a rnodel for the pararnetric function. Denote
fk(X;t,x) = rnax{jk(X;x)- t,A(X;x)} ~ f(t;x),
J;(X;t) = rninfk(X;t,x) ~ j*(t).

xEQ
Again, J;(x; t) is nonincreasing in t. It is clear that its srnallest root

tA;(X} does not exceed t*.
We will need the following characterization of the root tA;(X).
LEMMA 3.3.4
tk(X) = rnin{}k(X; x) I /k(X; x) ~ 0, x E Q}.
Proof: Denote by xt:

the solution of the rninirnization problern in the
right-hand side of the above equation. And let iz
= j~(X; xt:). Then
Thus, we always have ik ;::: tk(X).

Assurne that i'k > t'k(X). Then there exists a point y such that
A(X; y)- t'k(X) ~ 0, A(X; y) ~ 0.
However, in this case i'k = jk(X; x'J.:) :::; /k(X; y) :::; tk(X) < i'k. That is
a contradiction. D
In our analysis we will need also the function
fk(X; t) = rnin fk(X; t, Xj),

OS,jS,k
the record value of our pararnetric rnodel.

LEMMA 3.3.5 Let t 0 < t 1 :::; t*. Assurne that J;(X; tt) > 0. Then
tk(X) > t1 and
{3.3.6}
Proof. Denote xk(t) E Arg minfk(X; t, x), t2 = tk(X), a = :~:::::~ E

[0, 1]. Then
t1 = (1 - a)to + at2
and inequality (3.3.6) is equivalent to the following:
f;(x; h) ::; (1- a)fZ(X; to) + afZ(X; t2) (3.3.7)
(note that J;(X; t2) = 0). Let Xa = (1- a)xk(to) + axk(t2). Then we
have
Jk,(X; tt) ::=; max{!k(X; Xa)- t1; Jk(X; Xa)}
:::; max{(1- a)(!k(X; xk(to))- to) + a(!k(X; xk(t2))- t2);

(1- a)!k(X; xA;(to)) + ajk(X; xA;(t2))}
:::; (1 - a) max{!k(X; xk(to)) - to; fk(X; xk(to))}
+amax{jk(X; xk(t2))- t2; fk(X; xk(t2))}
= (1- a)/k,(X; to) + af;(x; t2),

and we get (3.3.7). 0
We also need the following statement (compare with Lemma 2.3.5).

LEMMA 3.3.6 For any A ~ 0 we have
f*(t) - A :::; f*(t + A),
/;(X; t)- A :::; jz(X; t + A)

Proof. Indeed, for f*(t) we have
f*(t + A) = min [max{f(x)- t; f(x) + A}- A]
xeQ
2:: min [max{f(x)- t; f(x)}- A] = J*(t)- A.

xEQ
The proof of the second inequality is similar. 0
Now we areready to present a constrained minimization scheme (com-

pare with constrained minimization scheme of Section 2.3.5).
Constrained Ievel method
0. Choose x 0 E Q, t 0 < t*, /'i, E (0, ! ) and accuracy € > 0.

a). Keep generating sequence X = {xj}~ 0 by the
Ievel method as applied to function f(tk; x). If the (3.3.8)
internal termination criterion
holds, then stop the internal process and set j(k) = j.

Global stop: fj(X; tk) :SE.
b). Set tk+1 = tj(k)(X).
We are interested in an analytical complexity bound for this method.

Therefore the complexity of computation of the root tj(X) and of the
value ]J(X; t) is not important for us now. We need to estimate the rate
of convergence of the master process and the complexity of Step 1a).
Let us start from the master process.
LEMMA 3.3. 7 For all k ~ 0, we have
!J(k)(X;tk) :S t~=~· [2(1~~~:)r.

Proof: Denote
ß = 2(1~~~:) (< 1).
Since tk+l = tj(k)(X) andin view of Lemma 3.3.5, for all k :2: 1, we have
O"k-1 = y'tk~tk_Jj(k-1)(X; tk-d ~ y'tk~tk-l Jj(k)(X; tk-d
> 2 JA* (X t ) > 2(1-~~:) f* (X t ) 5!..Jr..
- y'tk+t-tk j(k) j k - y'tk+t -tk j(k) j k = ß .
Thus, ak :::; ßak-l and we obtain
= ßkj*j(O) (X ; t 0 ) tk±l-tk
lt-to ·
Further, in view of Lemma 3.3.6, t1- to 2: Jj(o)(X; to). Therefore
:S /:.k~~:Ji;(o)(X;to)(tk+l-tk) :S /:t~~:Jf*(to)(to -t*).

It remains to note that f*(to) ~to-t* (see Lemma 3.3.6). 0
Let Global stop condition in (3.3.8) be satisfied: fj (X; tk) ~ €. Then

there exist j* such that
Therefore we have
Since tk ~ t*, we conclude that

j(Xj•) < t*+~:,
(3.3.9)
/(Xj•) < E.
In view of Lemma 3.3.7, we can get (3.3.9) at most in
1 t -t•
N(~:) = ln[2(1-~~:)]ln (1°-~~:)~
full iterations of the master process. (The last iteration of the process is
terminated by the Global stop rule). Note that in the above expression
r;, is an absolute constant (for example, we can take r;, = t)·
Let us estimate the complexity of the internal process. Denote
Mt= max{ll g 111 g E 8f(x) U8j(x), x E Q}.

We need to analyze two cases.
1. Full step. At this step the internal process is terminated by the
rule
The corresponding inequality for the gap is as follows:
fj(k)(X; tk)- ]J(k)(X; tk) ~ ~JJ(k)(X; tk).
In view of Theorem 3.3.1, this happens at most after
M2D2
K- 2 (/j(k) (X ;tk )f2a(1-a)2 (2-a)
iterations of the internal process. Since at the full step !J(k)(X; tk)) ~ e,
we conclude that
M2D2
j(k)- j(k- 1) ~ K.2~:2a(1~a)2(2-a)
for any full iteration of the master process.
2. Last step. The internal process of this step was terminated by
Global stop rule:
fj(X; tk) ~ e.
Since the normal stopping criterion did not work, we conclude that
fJ-1 (X; tk)- fJ-1 (X; tk) ~ ~fJ-1 (X; tk) ~ ~e.
Therefore, in view of Theorem 3.3.1, the number of iterations at the last
step does not exceed
K.2€2a(l-a)2(2-a) ·
Thus, we come to the following estimate of total complexity of the
constrained level method:
-
-
2
MrD
2
,.2e2a(l-a)2(2-a}
[1 + In(2(1-K-))
1 1 J.o..=.L...]
n (1-1•;}€
M2D2Jn 2(to-t•)
- I •
- ~:2a(1-a)2(2-a)K.2Jn[2(1-K.))'
It can be shown that the reasonable choice for the parameters of this
scheme is a = ~ = 2+\!2'.
The principal term in the above complexity estimate is on the order
of ~ ln 2(to;t•). Thus, the constrained level method is suboptimal (see
Theorem 3.2.1).
In this method, at each iteration of the master process we need to
find the root tj(k)(X). In view of Lemma 3.3.4, that is equivalent to the
following problem:
min{A(X;x) I /k(X;x) ~ 0, x E Q}.
In other words, we need to solve the problern
rnm t,
s.t. f(xj) + (g(xj}, x- Xj) ~ t, j = 0 ... k,
/(xj) + (g(xj}, x- Xj) ~ 0, j = 0 ... k,
XE Q.
If Q is a polytope, this problern can be solved by finite linear prograrn-
rning rnethods (sirnplex rnethod}. lf Q is rnore cornplicated, we need to
use interior-point schernes.
To conclude this section, let us note that we can use a better rnodel
for the functional constraints. Since
/(x) = rn.ax fi(x),

1$t$m
it is possible to work with
where 9i(Xj) E öfi(xj)· In practice, this complete rnodel significantly ac-

celerates the convergence of the process. However, clearly each iteration
becomes more expensive.
As far as practical behavior of this scherne is concerned, we note that
usually the process is very fast. There are some technical problems,
related to accurnulation of too many linear pieces in the model. However,
in all practical schemes there exists sorne strategy for dropping the old
elernents of the rnodel.
Chapter 4
STRUCTURAL OPTIMIZATION
4.1 Self-concordant functions

(Do we really have a black box? What the Newton method actually does?
Definition of self-concordant functions; Main properties; Minimizing the self-
concordant function.)
4.1.1 Black box concept in convex optimization

In this chapter we are going to present the main ideas underlying the
modern polynomial-time interior-point methods in nonlinear optimiza-
tion. In order to start, let us look first at the traditional formulation of
a minimization problem.
Suppose we want to solve a minimization problern in the following
form:
min{fo(x) I /j(x) ~ 0, j = l ... m}.
xeRn
We assume that the functional cornponents of this problern are con-
vex. Note that all standard convex optimization schemes for solving
this problern are based on the black-box concept. This means that we
assume our problern to be equipped with an oracle, which provides us
with some inforrnation on the functional components of the problern at
some test point x. This oracle is local: lf we change the shape of a
component far enough from the test point, the answer of the oracle does
not change. These answers comprise the only inforrnation available for
numerical methods. 1
However, if we look carefully at the above situation, we can see a
certain contradiction. Indeed, in order to apply the convex optirnization
1 We have discussed this concept and the corresponding methods in the previous chapters.
methods, we need to be sure that our functional components are convex.

However, we can check convexity only by analyzing the structure of these
functions 2 : If our function is obtained from the basic convex functions
by convex operations (summation, maximum, etc.), we conclude that it
is convex.
Thus, the functional components of the problern are not in a black
box at the moment we check their convexity and choose a minimization
scheme. But we put them in a black box for numerical methods. That is
the main conceptual contradiction of the standard convex optimization
theory. 3
The above Observation gives us hope that the structure of the problern
can be used to improve the performance of convex minimization schemes.
Unfortunately, structure is a very fuzzy notion, which is quite difficult
to formalize. One possible way to describe the structure is to fix the
analytical type of functional components. For example, we can consider
the problems with linear functions /j(x) only. This works, but note that
this approach is very fragile: Ifwe addjust a single functional component
of different type, we get another problern dass and all theory must be
done from scratch.
Alternatively, it is clear that having the structure at hand we can
play a lot with the analytical form of the problem. We can rewrite the
problern in many equivalent forms using nontrivial transformation of
variables or constraints, introducing additional variables, etc. However,
this would serve no purpose until the moment we realize the final goal
of such transformations. So, let us try to find the goal.
At this moment, it is better to Iook at classical examples. In many
situations the sequential reformulations of the initial problern can be
seen as a part of the numerical scheme. We start from a complicated
problern 'P and, step by step, we simplify its structure up to the moment
wegetatrivial problern (or, a problern which we know how to solve):
'P ----t ... ----t (!*' x*).
Let us Iook at the standard approach for solving a system of linear
equations, namely,
Ax=b.
We can proceed as follows:
1. Check that A is symmetric and positive definite. Sometimes this is
clear from the origin of matrix A.
2A numerical verification of convexity is a hopeless problem.

3 However, the conclusions of the theory concerning the oracle-based minimization schemes
remain valid.
Structural optimization 173
2. Compute the Cholesky factorization of the matrix:

A=LLT,
where L is a lower-triangular matrix. Form an auxiliary system
Ly = b, LT X = y.
3. Salve the auxiliary system.
This process can be seen as a sequence of equivalent transformations of
the initial problern
Imagine for a moment that we do not know how to solve systems
of linear equations. In order to discover the above scheme we should
perform the following steps:
1. Find a dass of problems which can be solved very efficiently (linear
systems with triangular matrices in our example).
2. Describe the transformation rules for converting our initial problern
into the desired form.
3. Describe the dass of problems for which these transformation rules
are applicable.
We are ready to explain the way it works in optimization. First of
all, we need to find a basic numerical scheme and a problern formulation
at which this scheme is very efficient. We will see that for our goals the
most appropriate candidate is the Newton method (see Section 1.2.4) as
applied in the framework of Sequential Unconstrained Minimization (see
Section 1.3.3).
In the succeeding section we will highlight some drawbacks of the
standard analysis of Newton method. From this analysis we derive a
family of very special convex functions, the self-concordant functions
and self-concordant barriers, which can be efficiently minimized by the
Newton method. We use these objects in a description of a transformed
version of our initial problem. In the sequel we refer to this description as
to a barrier model of our problem. This model will replace the standard
functional model of optimization problern used in the previous chapters.
4.1.2 What the Newton method actually does?

Let us Iook at the standard result on local convergence of the Newton
method (we have proved it as Theorem 1.2.5). We are trying to find an
unconstrained local minimum x* of twice differentiable function f(x).
Assurne that:
• f"(x*) t lln with some constant l > 0,

• II f"(x)- f"(y) II~ M II x- y II for all x and y ERn.
We assume also that the starting point of the Newton process xo is close
enough to x*:
II xo- x* II< r = 32f.t· (4.1.1)
Then we can prove that the sequence
xk+l = xk- [f"(xkt 1 J'(xk), k ~ o, (4.1.2)
is well defined. Moreover, II Xk- x* II< f for all k ~ 0 and the Newton
method (4.1.2) converges quadratically:
II Xk+l -X * II<- M!lxk-x*ll 2
2(l-Mjjxk-x•li)'
What is wrong with this result? Note that the description of the
region of quadratic convergence (4.1.1) for this method is given in terms
of the standard inner product
n
(x, y) = L x(i}y(i).
i=l
If we choose a new basis in Rn, then all objects in our description change:
the metric, the Hessians, the bounds l and M. But Iet us look what
happens with the Newton process. Namely, let A be a nondegenerate
(n x n)-matrix. Consider the function
</>(y) = f(Ay).
The following result is very important for understanding the nature of
Newton method.
LEMMA 4.1.1 Let {xk} be a sequence, generated by the Newton method
for function f:
Xk+l = Xk- [f"(xk)t 1J'(xk), k ~ 0.
Consider a sequence {Yk}, generated by the Newton method for function
</>:
Yk+l = Yk- [<f>"(Yk)t 1</>'(yk), k ~ 0,
with Yo = A- 1xo. Then Yk = A- 1xk for all k ~ 0.
Proof: Let Yk = A- 1xk for some k ~ 0. Then
Yk+l = Yk- [4>"(yk)t 1 4>'(Yk) = Yk- [AT f"(Ayk)At 1AT f'(Ayk)
0
Thus, the Newton method is affine invariant with respect to affine

transformation of variables. Therefore its real region of quadratic con-
vergence does not depend on a particular inner product. It depends only
on the local topological structure of function f (x).
Let us try to understand what was bad in our assumptions. The main
assumption we used is the Lipschitz continuity of Hessians:
II f"(x)- f"(y) II:S: M II x- Y II, Vx,y ERn.

Let us assume that f E 0 3 (Rn). Denote
f 111 (x)[u] = lim l[J"(x + au)- J"(x)].

a~oa
Note that the object in the right-hand side is an (n x n)-matrix. Then

our assumption is equivalent to
II f"'(x)[u]II:S: M II u II .
This means that at any point x E Rn we have
(i 111 (x)[u]v,v} :S: M II u II· II v 11 2 Vu,v ERn.

Note that the value in the left-hand side of this inequality is invariant
with respect to affine transformation of variables. However, the right-
hand side does not possess this property. Therefore the most natural
way to improve the situation is to find an affine-invariant replacement
for the standard norm II · II· The main candidate for such a replacement
is rather evident: That is the norm defined by the Hessian f" (x) itself,
namely,
II u IIJ''(x)= (f"(x)u,u) 112 •
This choice gives us the dass of self-concordant functions.
4.1.3 Definition of self-concordant function

Let us consider a closed convex function f (x) E 0 3 ( dom f) with open
domain. Let us fix a point x E dom f and a direction u E Rn. Consider
the function
1J(x; t) = f(x +tu),
as a function of variable t E dom 1J(x; ·) ~ R 1 . Denote
D f(x)[u] = 4J'(x; t) = (f'(x), u),
D 2 J(x)[u, u] = 4J"(x; t) = (f"(x)u, u) =II u ll}"(x)'

D 3 f(x)[u,u,u] = 1J111 (x;t) = (D 3f(x)[u]u,u).
DEFINITION 4.1.1 We call lunction I self-concordant il there exists a

constant MJ 2:: 0 such that the inequality
holds lor any x E dom I and u E Rn.

Note that we cannot expect these functions to be very widespread.
But we need them only to construct a barrier model of our problem. We
will see very soon that such functions are easy to be minimized by the
Newton method.
Let us point out an equivalent definition of self-concordant functions.
LEMMA 4 .1. 2 A lunction f is sell-concordant il and only il lor any
x E doml and any u1, u2, U3 ERn we have
3
I D 3 !(x)[u1, u2, u3] 1::; MJ IT II Ui llrcx) · (4.1.3)
i=l
We accept this statement without proof since it needs some special facts
from the theory of three-linear symmetric forms.
In what follows, we very often use Definition 4.1.1 in order to prove
that some I is self-concordant. On the contrary, Lemma 4.1.2 is useful
for establishing the properties of self-concordant functions.
Let us consider several examples.
EXAMPLE 4.1.1 1. Linear function. Consider the function
l(x) = a+ (a,x), domf =Rn.
Then
f'(x) = a, J"(x) = 0, j 111 (x) = 0,
and we conclude that M1 = 0.
2. Convex quadratic function. Consider the function
f(x) = a + (a, x} + ~(Ax, x}, domf =Rn,
where A =AT t 0. Then
f'(x) = a + Ax, f"(x) = A, / 111 (x) = 0,

and we conclude that M1 = 0.
3. Logarithmic barrier for a ray. Consider a function of one variable
f(x) = -lnx, domf = {x E R 1 I x > 0}.

Then
f'(x) = -~, f"(x) = ~' f"'(x) = -~.
Therefore f(x) is self-concordant with MJ = 2.
4. Logarithmic barrier for a second-order region. Let A = AT t 0.

Consider the concave function
if>(x) = a + {a, x} - !(Ax, x).
Define f(x) = -lnif>(x), with domf = {x ERn I if>(x) > 0}. Then
Dj(x)[u] = - t/>(~)[(a,u}- (Ax,u}],
D 2f(x)[u, u] = t/>2{x) [(a, u)- (Ax, u)J2 + tf>(~) (Au, u),
D 3 j(x)[u,u,u] = - t/> 3Cx)[(a,u)- {Ax,u)] 3
- tjl2(x)[{a,u)- (Ax,u}](Au,u}.
Derrote w1 = Df(x)[u] and w2 = t/>(~)(Au,u}. Then
D 2 f(x)[u, u] = w~ + w2 2 0,
I D 3 f(x)[u, u, u] I = l2w~ + 3wlw21 .
The only nontrivial case is w1 =f 0. Derrote a = w2jw~. Then
/D 3f(x)(u,u,u)l < 2/w1r+3/w1!w2 _ 2(1+ta) < 2

(D2J(x)[u,u])3 2 - (w 1 +w 2)3f2 - (l+a)3f2 - ·
Thus, this function is self-concordant and M 1 = 2.
5. lt is easy to verify that none of the following functions of one variable

is self-concordant:
f(x) = ex, f(x) = x1v, x > 0, p > 0, f(x) =I x IP, p > 2.

0
Let us Iook now at the main properties of self-concordant functions.

THEOREM 4.1.1 Let functions Ii be self-concordant with constants Mi,

i = 1, 2, and let a, ß > 0. Then the function f(x) = aft (x) + ßh(x) is
self-concordant with constant
MJ = max { .}aMt, .ßM2}

and domf = domftndom/2.
Proof: In view of Theorem 3.1.5, f is a closed convex function. Let us
fix some x E dom f and u E Rn. Then
I D 3 fi(x)[u, u, u]l~ Mi [ D 2 fi(x)[u, u] ] 3/2 , i = 1, 2.
Denote Wi = D 2 fi(x)[u, u] ;:=: 0. Then

ID 3 f(x)[u,u,uji < aiD 3 b(x)[u,u,u)i+ßiD3 b(x)[u,u,u]l
[D2 f(x)[u,u]]3 2 [aDl b (x)[u,u)+ßD2 f2(x)[u,u]]3/2
< ctM1w~/ 2 +ßM2w~/ 2

[aw1 +ßw2]3f 2
The right-hand side of this inequality does not change when we replace
(w1,w2) by (tw 1, tw2) with t > 0. Therefore we can assume that
aw1 +ßw2 = 1.
Denote e = O:Wt. Then the right-hand side of the above inequality
becomes equal to
*'e1 2 + ~(1- e) 312 , eE [o, 11.

This function is convex in e.
Therefore it attains its maximum at the
end points of the interval (see Corollary 3.1.1). D
CoROLLARY 4.1.1 Let function f be self-concordant with some constant

MJ. If A =AT~ 0, then the function
cf>(x) = a + (a, x) + ~(Ax, x) + j(x)
is also self-concordant with constant MI/I= MJ.
Proof: We have seen that any convex quadratic function is self-concor-
dant with the constant equal to zero. D
CoROLLARY 4.1.2 Let function f be self-concordant with some constant

M1 and a > 0. Then the function <P(x) = af(x) is also self-concordant
with the constant MI/J = )oMJ. 0
Let us prove now that self-concordance is an affine-invariant property.

THEOREM 4.1.2 Let A(x) = Ax + b: Rn -t Rm, be a linear operator.
Assurne that function J(y) is self-concordant with constant M1. Then
the function <P(x) = J(A(x)) is also self-concordant and MI/J = MJ·
Proof: The function <P(x) is closed and convex in view ofTheorem 3.1.6.
Let us fix some X E dom <P = {X : A(x) E dom!} and u E nn. Denote
y = A(x), v =Au. Then
D<P(x)[u] = (J'(A(x)), Au) = (J'(y), v),
D 2<P(x)[u, u] - (J"(A(x))Au, Au) = (J"(y)v, v),
D 3 <P(x)[u, u, u] = D 3 J(A(x))[Au, Au, Au] = D 3 f(y)[v, v, v].

Therefore
I D 3 <P(x)[u,u,u] I = I D 3 f(y)[v,v,v] I~ MJ(J"(y)v,v) 312
= MJ(D 2<P(x)[u, u]) 312 •

0
The next statement demonstrates that some local properties of a self-

concordant function reflect somehow the global properties of its domain.
THEOREM 4.1.3 Let function f be self-concordant. If dom f contains
no straight line, then the Hessian J"(x) is nondegenerate at any x from
domf.
Proof: Assurne that (J"(x)u,u) = 0 for some x E domfand u ERn,
u =/:- 0. Consider the points Ya = x + o:u E dom f and the function
'1/J(a) = (J"(y 0 )u, u).
Note that
Since '1/;(a) ~ 0, we conclude that '1/J'(O) = 0. Therefore this function is

a part of the solution of the following system of differential equations:
t/J'(a) = 2'1/;(a) 312 - e(a),
t/J(O) = e<o) = o, {
e'(a) = 0.
However, this system has a unique trivial solution. Therefore 'lj;(a) = 0

for all feasible a.
Thus, we have shown that the function </J(a) = f(Ya) is linear:
a>.
</J(a) = f(x) + (J'(x), Ya- x) + f f(J"(Yr )u, u)drd)..
0 0
= J(x) + a(J'(x), u).

Assurne that there exists ä such that Ya E fJ( dom f). Consider a se-
quence {ak} such that ak t ä. Then
Note that Zk E epi f, but z rt. epi f since Ya rt. dom f. That is a contradie-
tioll since function f is closed. Considering direction -u, and assuming
that this ray intersects the boundary, we come to a contradiction again.
Therefore we conclude that Ya E dom f for all a. However, that is a
contradiction with the assumptions of the theorem. D
Finally, Iet us describe the behavior of self-concordant function near

the boundary of its domain.
THEOREM 4.1.4 Let f be a self-concordant function. Then for any point
x E 8( dom f) and any sequence
{xk} C domf: Xk-+ x

we have f(xk) -+ +oo.
Proof: Note that the sequence {f(xk)} is bounded below:
f(xk) ~ f(xo) + (J'(xo), Xk- xo).

Assurne that it is bounded from above. Then it has a Iimit point f. Of
course, we can think that this is a unique Iimit point of the sequence.
Therefore
Zk = (xk, f(xk)) -+ z = (x, f).
Note that Zk E epi f, but z ~ epi f since x ~ dom f. That is a contra-
diction since function f is closed. D
Thus, we have proved that f (x) is a barrier function for cl (dom f)

(see Section 1.3.3).
4.1.4 Main inequalities

Let us fix some self-concordant function f(x). We assume that its
constant M 1 = 2 (otherwise we can scale it, see Corollary 4.1.2). We call
such functions the standard self-concordant. We assume also that dom f
contains no straight line (this implies that all f" (x) are nondegenerate,
see Theorem 4.1.3).
Denote:
1/ u llx = (J"(x)u,u) 112 ,
II v 11; = ([f"(x)tlv, v)l/2'
AJ(x) = ([f"(x)t 1f'(x), f'(x)) 112.

Clearly, I (v, u) 1::;11 v 11; · II u llx· We call II u llx the local norm of
direction u with respect to x, and AJ(x) =II f'(x) 11; is called the local
norm of the gradient J'(x). 4
Let us fix x E dom f and u E Rn, u =/= 0. Consider the function of one
variable
<fJ(t) = (f"(x+t~)u,u)l/2
with the domain dom</J = {t E R 1 : x +tuE domf}.
LEMMA 4 .1. 3 For alt feasible t we have I </J' (t) I::; 1.
Proof: Indeed,
"'-'( ) _ _ f'"(x+tu)[u,u,ul
'+' t - 2(! 11 (x+tu)u,u)3)2 ·
Therefore I <P'(t) 1::; 1 in view of Definition 4.1.1. 0
COROLLARY 4.1.3 Domain of function <jJ(t) contains the interval
(-cp(O), cp(O)).
Proof: Since f(x+tu) --+ oo as x+tu approaches the boundary of dom f

(see Theorem 4.1.4), the function (f"(x + tu)u, u} cannot be bounded.
Therefore dom cp ={
t I c/J( t) > 0}. It remains to note that
<P(t) ~ </J(ü)- I t I
in view of Lemma 4.1.3. 0
4 Sometimes AJ(X) is called the Newton decrement of function f at x.

Let us consider the following ellipsoid:

W 0 (x;r) = {y ERn I II y- X llx< r},
W(x; r) = cl (W 0 (x; r)) ={y ERn I II y- x llxS r}.
This ellipsoid is called the Dikin ellipsoid of function f at x.
THEOREM 4.1.5 1. For any XE domf we have W 0 (x; 1) ~ domf.
2. For all x, y E dom f the following inequality holds:
I y- X II Y-> l+liy-xll.,.
lly-xll.,
3. If II y- x llx< 1, then
II y- X
I Y-< 1-ily-xl!.,'
lly-xll., (4.1.5)
Proof: 1. In view of Corollary 4.1.3, dom f contains the set

{y = X + tu I t 2 I u II; < 1}
(since </>(0) = 1/ II u llx)· That is exactly W 0 (x; 1).
2. Let us choose u = y- x. Then
4>( 1) = IIY!xlly' 4>(0) = iiy!xu.,'

and </>(1) :$ 4>(0) + 1 in view of Lemma 4.1.3. That is (4.1.4}.
3. lf II y- x llx< 1, then 4>(0) > 1, and in view of Lemma 4.1.3
4>(1} 2:: 4>(0}- 1. That is (4.1.5}. 0
THEOREM 4.1.6 Let XE domf. Then for any y E W 0 (x; 1} we have
(1- II Y- x llx) 2 f"(x) ~ f"(y) ~ (t-IIY_:xll.,)2 f"(x). (4.1.6)
Proof: Let us fix some u E Rn, u ::j:. 0. Consider the function

1/;(t) = (f"(x + t(y- x))u,u), t E [0, 1].
Denote Yt = x + t(y- x). Then, in view of Lemma 4.1.2 and (4.1.5}, we
have
11/J'(t) I = I D 3 f(YtHY- x,u,u] IS 211 y- X IIYtll u ll;t
Therefore
2{ln(1- t II y- x llx))'::; (ln1f.l(t))'::; -2(ln(1- t II y- X llx))'.

Let us integrate this inequality in t E [0, 1]. We get:
(1- II Y- X llx) 2 ~ ~f~~ ~ (1-lly:xiJx)2 ·
That is exactly (4.1.6). 0
ÜOROLLARY 4.1.4 Let x E domf and r =II y- x llx< 1. Then we can

estimate the matrix
J
1
G= f"(x + T(y- x))dT
0
as follows:
(1- r + r32 )f"(x) ~ G ~ l~rf"(x).
Proof: Indeed, in view of Theorem 4.1.6 we have
1 1
G = I f"(x + T(y- x))dT?: f"(x) · I(1- Tr) 2dT
0 0
= (1 - r + ~r 2 )f"(x),
1
G -< f"(x) ·I (l!;r)2 = l~rf"(x).
0
0
Let us look again at the most important facts we have proved.

• At any point x E dom f we can point out an ellipsoid
W 0 (x; 1) = {x E Rn I (f"(x)(y- x), y- x)) < 1},

b~longing to dom f.
• Inside the ellipsoid W(x; r) with r E [0, 1) function f is almost
quadratic:
(1- r) 2 f"(x) ~ f"(y) ~ (l!r)2f"(x)
for all y E W (x; r). Choosing r small enough, we can make the
quality of the quadratic approximation acceptable for our goals.
These two facts form the basis for almost all consequent results.
We conclude this section with the results describing the variation of
a self-concordant function with respect to a linear approximation.
THEOREM 4.1. 7 For any x, y E dom f we have
(! '( Y) - J'( X )'Y - X

) -> I+JJy-xJJx'
JJy-xl!~ (4.1. 7}
+ (f'(x), Y- x) + w(ll Y- x llx},

f(y) ~ f(x) (4.1.8}
where w(t) = t -ln(l + t).
Proof: Denote Yr = x + T(y- x), T E [0, 1], and r =II y- x llx· Then,
in view of {4.1.4) we have
1
(f'(y)- f'(x), y- x) f(f"(yr)(y- x), y- x)dT
0
1 2 r 2
2: J (l_;rr)2dT = r 0J (l~t)2dt =
0
1~r·
Further, using (4.1.7), we obtain
1
f(y)- f(x)- (f'(x),y- x) = f(f'(yr)- f'(x),y- x)dT
0
1
= f ~(f'(Yr)- f'(x),yr- x)dT
0
> Jl IIYr-xlli d _ J1 rr 2 d
0 r(l+JJyr-xJJx) T - 0 l+rr T
r
= J f~t = w(r).
0
0
THEOREM 4.1.8 Let x E dom f and II y- x llx< 1. Then
(! '( Y) - J'( X )'Y - X ) -< 1-JJy-xJix'

JJy-xl!~ (4.1.9}
f(y) S:: f(x) + (f'(x), Y- x) + w*(ll y- x llx), (4.1.10}

where w.(t) = -t -ln(l- t).
Proof: Derrote Yr = x + T(y- x), T E [0, 1], and r =II y- X llx· Since
II Yr- x II< 1, in view of (4.1.5) we have
1
(f'(y)-J'(x),y-x) = I(f"(yr)(y-x),y-x)dT
0
r
= r I0
1
$ I r2 d
(1-rrF 7
1 d
(1-t)2 t = 1-r
r2
·
0
Further, using (4.1.9), we obtain

1
J(y)- f(x)- (f'(x), y- x) = I(f'(Yr)- f'(x), y- x)dT
0
1
I ~(J'(Yr)- f'(x),Yr- x)dT
0
r
= I f~tt = w.(r).
0
D
THEOREM 4.1.9 /nequalities (4.1.4}, (4.1.5), (4.1. 7}, (4.1.8}, (4.1.9}

and (4.1.1 0) are necessary and sufficient characteristics of standard self-
concordant functions.
Proof: We have justified two sequences of implications:
Definition 4.1.1 => (4.1.4) => (4.1.7) => (4.1.8),
Definition 4.1.1 => (4.1.5) => {4.1.9) => {4.1.10).

Let us prove the implication (4.1.8) => Definition 4.1.1. Let x E dom f
and x - au E dom f for a E [0, E). Consider the function
1/J(a) = f(x- au), a E [0, ~:).

=
Denote r = llullx [cp"(o)Jl/ 2 • Assuming that {4.1.8} holds for all x and
y from dom f, we have
'1/J(a)- '1/J(O)- '1/J'(O)a- !'l/J"(O)a2 ~ w(ar)- !a2 r 2 •

Therefore
~'1/J/11 (0) = lim ['1/J(a)- '1/J(O)- '1/J'(O)a- !'l/J"(O)a2 ]
a,l.O
> lim ~ [w(ar)- !a2 r 2]

a,l.O a:
= lim
a,I.O
p0
[w'(ar)- ar]
Thus, D3 f(x)[u,u,u] = -'1/J"(O) ~ 'l/J111 (0) ~ 2['1/J"(O)j31 2 and that is

Definition 4.1.1 with MJ = 2. Implication (4.1.10) => Definition 4.1.1
can be proved by a similar reasoning. D
The above theorems are written in terms of two auxiliary functions

w(t) = t -ln(1 + t) and w*(r) = -T -ln(1- r). Note that
t .
w 1(t) = 1+t 2 0, W "(t) = (1+t)2
1 > 0'
w~(r) = 1 ~r 2 0, w~(r) = c1 !r)2 > 0.

Therefore, w(t) and w*(r) are convex functions. In what follows we often
use different relations between these functions. Let us fix this notation
for future references.
LEMMA 4.1.4 For any t ~ 0 and TE [0, 1) we have
w'(w~(r)) = r, w~(w'(t)) = t,
w(t) = max [€t- w*(€)], w*(r) = max[€r- w(€)],

0~{<1 {~0
w(t) + w.(r) 2 rt,

w*(r) = rw~(r)- w(w~(r)), w(t) = tw'(t)- w*(w'(t)).
We leave the proof of this Iemma as an exercise for the reader. For
an advanced reader we should note that the only reason for the above
relationsisthat functions w(t) and w*(t) are conjugate.
Let us prove two more inequalities.
THEOREM 4.1.10 For any x and y from Q we have

f(y) ~ f(x) + (f'(x), Y- x) + w(llf'(y)- f'(x)ll;). (4.1.11)
If in addition llf'(y)- f'(x)ll; < 1, then
f(y) ::; f(x) + (f'(x), Y- x) + w*(llf'(y)- f'(x)ll;). (4.1.12)
Proof: Let us fix an arbitrary x and y from Q. Consider the function

<P(z) = f(z)- (f'(x), z), z E Q.
Note that this function is self-concordant and <P'(x) = 0. Therefore,
using inequality (4.1.10) we get
f(x)- (f'(x),x) = </J(x) = min</J(z)
zEQ
< min[<!J(y)
zEQ
+ (<!J'(y), z- y) + w*(llz- yjjy)]
= </J(y)- w(II<P'(y)ll;)
= f(y) - (f'(x), y) - w(llf'(y) - f'(x) II;),

and that is (4.1.11). In order to prove inequality (4.1.12) we use a similar
reasoning with (4.1.8). D
4.1.5 Minimizing the self-concordant function

Let us consider the following minimization problem:
min{f(x) Ix E domf}. (4.1.13)
The next theorem provides us with a sufficient condition for existence of
its solution. Recall that we assume that f is a standard self-concordant
function and dom f contains no straight line.
THEOREM 4 .1.11 Let >.1 (x) < 1 for some x E dom f. Then the solution
of problern (4.1.13), xj, exists and is unique.
Proof: Indeed, in view of (4.1.8), for any y E domf we have
f(y) > f(x) + (f'(x), Y- x) + w(ll Y- x llx)
> f(x)- II f'(x) u;. 'II y- X llx +w(ll y- X llx)
= f(x)- AJ(x)· II Y- X llx +w(ll Y- X llx)·
Therefore for any y E .Cj(J(x)) = {y ERn I f(y) ~ f(x)} we have
lly!xllx w(ii Y- X iix) ~ Aj(X) < 1.

Note that the function tw(t) = 1- t ln(1 + t) is strictly increasing in t.
Hence, II y- x llx~ f, where f is a unique positive root of the equation
(1- AJ(x))t = ln(1 + t).
Thus, CJ(f(x)) is bounded and therefore xj exists. It is unique since in

view of {4.1.8) for all y E domf we have
f(y) ~ f(xj) + w(ll Y- xj llxj)·

0
Thus, we have proved that a local condition AJ(x) < 1 provides us

with some global information on function f, that is the existence of
the minimum xj. Note that the result of Theorem 4.1.11 cannot be
strengthened.
EXAMPLE 4.1.2 Let us fix some E > 0. Consider a function of one
variable
ff(x} =EX -lnx, x > 0.
This function is self-concordant in view of Example 4.1.1 and Corol-
lary 4.1.1. Note that
f: (X) = E - ~, f:' = ~.
Therefore AJ,(x) =11- EX I· Thus, for E = 0 we have AJ0 (x) = 1 for any
x > 0. Note that the function fo is not bounded below.
If E > 0, then xj. = ~· Note that we can recognize the existence of
the minimizer at point x = 1 even if E is arbitrary small. D
Let us consider now a scheme of the damped Newton method:
Damped Newton method
(4.1.14)
0. Choose xo E dorn f.
1. Iterate Xk+l = Xk- l+A~(xk) [f"(xk)]- 1f'(xk), k ~ 0.
THEOREM 4.1.12 For any k 2 0 we have

(4.1.15}
Proof: Denote >. = AJ(Xk)· Then II Xk+I - Xk llx= 1 ~..\ = w'(>.).

Therefore, in view of (4.1.10) and Lemma 4.1.4, we have
.>.2
= f(xk)- 1+.>. + w*(w'(>.))
= f(xk)- >.w'(>.) + w*(w'(>.)) = f(xk)- w(>.).
D
Thus, for all x E domf with AJ(x) 2 ß > 0 one step of the damped
Newton method decreases the value of f(x) at least by a constant w(ß) >
0. Note that the result of Theorem 4.1.12 is global. It can be used to
obtain a global efficiency estimate of the process.
Let us describe now the local convergence of the standard Newton
method:
Standard Newton method
(4.1.16)
0. Choose xo E dom f.
1. Iterate Xk+l = Xk- [f"(xk)]- 1 /'(xk), k 2 0.
Note that we can measure the convergence of this process in different

ways. We can estimate the rate of convergence for the functional gap
f(xk)- f(xj), or for the local norm ofthe gradient AJ(xk) =II f'(xk) ll;k,
or for the local distance to the minimum II Xk - xj llxk. Finally, we can
look at the distance to the minimum in a fixed metrics
defined by the minimum itself. Let us prove that locally all these mea-
sures are equivalent.
THEOREM 4.1.13 Let AJ(x) < 1. Then

w(A.J(x)}::; f(x}- f(xj)::; w*(A.J(x}}, {4.1.17}
(4.1.18}
w(r*(x)) ::; f(x)- f(xj)::; w*(r*(x)}, (4.1.19}

where the last inequality ·is valid for r * (x) < 1.
Proof: Denote r =II x- xj llx and ).. = AJ(x}. Inequalities (4.1.17}
follow from Theorem 4.1.10. Further, in view of (4.1.7} we have
1~r::; (J'(x),x- xj)::; .Ar.

That is the right-hand side of inequality (4.1.18). If r ~ 1 then the
left-hand side of this inequality is trivial. Suppose that r < 1. Then
f'(x) = G(x- xj) with
J
1
G= J"(xj + T(x- xj)}dT,

0
and
A.j(x) = ([J"(x)t 1 G(x- xj}, G(x- xj)) ::;11 H 11 2 r 2 ,
where H = [f"(x)J- 112 G[f"(x)J- 112 • In view of Corollary 4.1.4, we have
- -
G -< 1-f"(x).
1-r
Therefore II H 11::; 1 ~r and we conclude that
A.}(x) ::; ( 1 ~~)2 = (w~(r)) 2 .
Thus, AJ(x) ::; w~(r}. Applying w'(·) to both sides, we get the remairring
part of (4.1.18).
Finally, inequalities (4.1.19) follow from (4.1.8) and (4.1.10). 0
Let us estimate the local rate convergence of the standard Newton

method (4.1.16). It is convenient to do that in terms of >..1(x), the local
norm of the gradient.
THEOREM 4.1.14 Let x E domf and AJ(x) < 1. Then the point
x+ = x- [f"(x)t 1 f'(x)
belongs to dom f and we have

, (
Af
)
X+ ::=;
( ,x 1 (x)
1->.,(x)
)2 '
Proof: Denote p =X+- x, >. = >.t(x). Then II p llx= >. < 1. Therefore
x+ E domf (see Theorem 4.1.5). Note that in view of Theorem 4.1.6,
>-t(x+) = ([/"(x+)t 1 f'(x+), /'(x+)) 112
~ 1-liPII:r II f'(x+) llx= 1~-X II f'(x+) llx ·

Further,
f'(x+) = f'(x+)- f'(x)- f"(x)(x+- x) = Gp,
1
where G = J[f"(x + rp)- f"(x)]dr. Therefore
0
II !'(x+) II;= ([J"(x)t 1 Gp, Gp) ::;II H 11 2 . II p II;,

where H = [f"(x)t 112 G[f"(x)t 112 • In view of Corollary 4.1.4,
(->. + k>. 2 )f"(x) j G j 1 ~_xf"(x).
Therefore II H 11:: ; max:{t~_x,>.- k>- 2 } = 1 ~_x, and we conclude that
>.](x+) :::; (l!.x)2 II f'(x+) lli:::; c 1 ~~)4 •

0
Theorem 4.1.14 provides us with the following description of the region

of quadratic convergence of scheme (4.1.16):
>.t(x) < >. = 3_ yg = 0.3819 ... ,

2
where ); is the root of the equation (1 _\)2

= 1. In this case we can
guarantee that >.J(x+) < >.J(x).
Thus, our results Iead to the following strategy for solving the initial
problern (4.1.13).
• First stage: >.t(xk) ~ ß, where ß E (0, 3;). At this stage we apply the
damped Newton method. At each iteration of this method we have
Thus, the number of steps of this stage is bounded:

N ~ wfßy[f(xo)- f(xi)].
• Second stage: AJ(Xk) ::; ß. At this stage we apply the standard

Newton method. This process converges quadratically:
\ ( ) ( AJ(Xk) ) 2 ß>.J(Xk) \ ( )
Af Xk+l ::; 1->.f(Xk) ::; (l-ß)2 < Af Xk •
It can be shown that the local convergence of the damped Newton

method (4.1.14) is also quadratic:
(4.1.20)
However, we prefer to use the above switching strategy since it gives

better complexity bounds. Relation (4.1.20) can be justified in the same
way as it was clone in Theorem 4.1.14. We leave the reasoning as an
exercise for the reader.
4.2 Self-concordant barriers

(Motivation; Definition of self-concordant barriers; Main properties; Standard
minimization problem; Central path; Path-following method; How to initialize
the process? Problems with functional constraints.)
4.2.1 Motivation
In the previous section we have seen that the Newton method is
very efficient in minimizing a standard self-concordant function. Such
a function is always a barrier for its domain. Let us check what can
be proved about the sequential unconstrained minimization approach
(Section 1.3.3), which uses such barriers.
In what follows we deal with constrained minimization problems of
special type. Denote Dom f = cl (dom f).
DEFINITION 4.2.1 We call a constrained minimization problern stan-
dard if it has the form
min{ (c, x) I x E Q}, (4.2.1}

where Q is a closed convex set. We assume also that we know a self-
concordant function f such that Dom f = Q.
Let us introduce a parametric penalty function
f(t; x) = t(c, x) + f(x)
with t 2:: 0. Note that f(t; x) is self-concordant in x (see Corollary 4.1.1).
Denote
x*(t) =arg min f(t; x).
xEdom/
This trajectory is called the centrat path of the problern (4.2.1). Note
that we can expect x*(t) ---+ x* as t ---+ oo (see Section 1.3.3}. Therefore
we are going to follow this trajectory.
Recall that the standard Newton method, as applied to minimization
of function f(t; x), has a local quadratic convergence (Theorem 4.1.14}.
Moreover, we have an explicit description of the region of quadratic
convergence:
- ~
)..f(t;·)(x)~ß<>.= -2 ·
Let us study our possibilities assuming that we know exactly x = x*(t}
for some t > 0.
Thus, we are going to increase t:
t+ = t + ~' ~ > 0.
However, we need to keep x in the region of quadratic convergence of
the Newton method for function f(t + ~; ·):
)..f(t+.6;·)(x) ~ ß <.X.
Note that the update t---+ t+ does not change the Hessian of the barrier
function:
f"(t + ~;x} = f"(t;x).
Therefore it is easy to estimate how can be big the step ~. lndeed, the
first order optimality condition provides us with the following centrat
path equation:
tc + f'(x*(t)) = 0. (4.2.2)
Since tc + J'(x) = 0, we obtain
)..f(t+Ä;·)(x) =II t+c + f'(x) llx= ~ II c llx= ~ II f'(x) llx~ ß.

Hence, if we want to increase t at a linear rate, we need to assume that
the value
>.}(x) =II f'(x) II~= ([/"(x)t 1 f'(x), f'(x))
is uniformly bounded on dom f.
Thus, we come to a definition of self-concordant barrier.
4.2.2 Definition of self-concordant barriers

DEFINITION 4.2.2 Let F(x) be a standard self-concordant function. We
call it a v-self-concordant barrier for set Dom F, if
sup [2(F'(x), u)- (F"(x)u, u)] ~ v (1,.2.3}

uER"
for all x E dom F. The value v is called the parameter of the barrier.
Note that we do not assume F"(x) tobe nondegenerate. However, if

this is the case, then the inequality (4.2.3) is equivalent to
([F"{x)t 1F'(x), F'(x)} ~ v. (4.2.4)

We will use also another equivalent form of inequality (4.2.3):
(F'(x), u} 2 ~ v(F"(x)u, u} 't:/u ERn. {4.2.5)
(To see that for u with (F"(x)u, u} > 0, replace u in (4.2.3) by .Xu and
find the maximum of the left-hand side in .X.) Note that the condition
(4.2.5) can be written in a matrix notation:
F"(x) t tF'(x)F'(xf. {4.2.6)
Let us check now which self-concordant functions given by Exam-

ple 4.1.1 arealso self-concordant barriers.
EXAMPLE 4.2.1 1. Linear function: f(x) = a + (a,x}, domf = Rn.
Clearly, for a =/::. 0 this function is not a self-concordant barrier since
f"(x) = 0.
2. Convex quadratic function. Let A = AT >- 0. Consider the function
f(x) = a + (a,x) + ~(Ax,x), domf =Rn.
Then f'(x) = a + Ax and f"(x) = A. Therefore
([f(x)J- 1 f'(x), f'(x)) = (A- 1 (Ax- a), Ax- a)
= (Ax,x)- 2(a,x) + (A- 1 a,a).

Clearly, this value is unbounded from above on Rn. Thus, a quadratic
function is not a self-concordant barrier.
3. Logarithmic barrier for a ray. Consider the following function of one
variable:
F(x) = -lnx, domF = {x E R 1 I x > 0}.

Then F'(x) = -~ and F"(x} = ~ > 0. Therefore
~ 1 2
F"fX} = X2" • X = 1.
Thus, F(x) is a v-self-concordant barrier for {x > 0} with v = 1.
4. Logarithmic barrier for a second-order region. Let A = AT >- 0.

Consider the concave quadratic function
cp(x) = a + (a, x) - ~(Ax, x).
Define F(x) = -lncp(x), domf = {x ERn I cjJ(x) > 0}. Then
(F'(x),u} = -rt>lx)[(a,u}- (Ax,u)],
(F"(x}u,u} = r/> 2tx}[(a,u}- (Ax,u}]2 + rt>(~)(Au,u).
Denote WI = (F'(x),u) and w2 = rt>lx)(Au,u). Then
(F"(x)u, u) = w~ + w2 ~ wi.
Therefore 2(F'(x), u}- (F"(x}u, u} =::; 2w1- w~ =::; 1. Thus, F(x) is a
v-self-concordant barrier with v = 1. 0
Let us present some simple properties of self-concordant barriers.
THEOREM 4.2.1 Let F(x) be a self-concordant barrier. Then the func-

tion (c, x} + F(x)is a self-concordant function on domF.
Proof: Since F(x) is a self-concordant function, we just apply Corol-

lary 4.1.1. D
Note that this property is important for path-following schemes.
THEOREM 4.2.2 Let Fi be Vi-self-concordant barriers, i = 1, 2. Then

the function
is a self-concordant barrier for convex set Dom F = Dom F1 nDom F2

with the parameter v = v1 + v2.
Proof: In view of Theorem 4.1.1, Fis a standard self-concordant func-

tion. Let us fix x E dom F. Then
max [2(F'(x)u, u}- (F"(x)u, u)]
ueRn
- max [2(F{(x)u, u}- (F{'(x)u, u) + 2(F{(x)u, u}- (F{'(x)u, u)]

uERn
< max [2(F{(x)u,u)- (F{'(x)u,u)]

uERn
+ uERn
max [2(F2(x)u, u)- (F2'(x)u, u)] ~ v1 + v2.
0
Finally, let us show that the value of a parameter of a self-concordant

barrier is invariant with respect to affine transformation of variables.
THEOREM 4.2.3 Let A(x) = Ax + b be a linear operator, A(x) :Rn --t
Rm. Assurne that function F(y) is a v-self-concordant barrier. Then the
function q>(x) = F(A(x)) is a v-self-concordant barrier for the set
Domq_) = {x ERn I A(x) E DomF}.
Proof: Function g_}(x) is a standard self-concordant function in view of
Theorem 4.1.2. Let us fix x E domg_}, Then y = A(x) E domF. Note
that for any u E Rn we have
(q>'(x), u) = (F'(y), Au), (q>"(x)u, u) = (F"(y)Au, Au).
Therefore
max [2(<I>'(x), u)- (<P"(x)u, u)]
uERn
= max [2(F'(y), Au)- (F"(y)Au, Au)]

uERn
~ max [2(F'(y), v)- (F"(y)v, v)] ~ v.

veRm
0
4.2.3 Main inequalities

Let us show that the local characteristics of a self-concordant barrier
(the gradient and the Hessian) provide us with global information about
the structure of the domain.
THEOREM 4.2.4 1. Let F(x) be a v-self-concordant barrier. For any x
and y from dom F, we have
(F'(x),y- x} < v. (4.2. 7)
Moreover, if (F'(x), y- x) 2': 0, then
(F'( Y) - F'( X )'Y - X

)> (F'Cxt·rx>2
- v-(F' x ,y-x) · (4.2.8)
2. A standard self-concordant function F(x) is a v-self-concordant bar-

rier if and only if
F(y) 2:: F(x)- v In ( 1- t(F'(x), y- x)) Vx, y E domF. (4.2.9)
Proof: 1. Let x,y E domF. Consider the function
<jJ(t) = (F'(x+t(y-x)),y-x), t E [0,1].
If </J(O) :::; 0, then (4.2. 7) is trivial. If </J(O) = 0, then (4.2.8) is trivial.

Suppose that </J(O) > 0. Note that in view of (4.2.5) we have
<P'(t) = (F"(x + t(y- x))(y- x), y- x)
Therefore <jJ(t) increases and it is positive for t E [0, 1]. Moreover, for
any t E [0, 1] we have
-itJ + q,?o) 2': tt.

This implies that (F'(x),y- x) = <P(O) < y for all t E [0,1). Thus,
(4.2.7) is proved. Moreover,
</J(t) - </J(O) 2': v~~~(b) - </J(O) = vt~~~t~), t E [0, 1).

Taking t = 1, we get (4.2.8).
2. Denote 'lj;(x) = e-~F(x). Then
'1/l'(x) = -te-~F(x) · F'(x),
'1/J"(x) -te-~F(x) [F"(x)- tF'(x)F'(xf] .
Thus, in view of Theorem 2.1.4 and definition (4.2.6), function 'ljJ(x) is

concave if and only if the function F(x) is a v-self-concordant barrier.
It remains to note that (4.2.9) is the same as
'1/J(y):::; 'lj;(x) + ('1/J'(x),y -x)

up to a logarithmic transformation of both sides of the inequality. 0
THEOREM 4.2.5 Let F(x) be a v-self-concordant barrier. Then for any

x E dom F and y E Dom F such that
(F'(x),y- x} ~ 0, (4.2.10}
we have
II Y- X llx~ V+ 2y'i/. (4.2.11}
Proof: Denote r =II y - x llx· Let r > .,fii. Consider the point
Ya = x + a(y - x) with a = 4
< 1. In view of our assumption
(4.2.10) and inequality (4.1.7) we have
w =(F'(Ya),y- x} > (F'(Ya)- F'(x),y- x}
= ~(F'(Ya)- F'(x), Ya- x}
> l . IIYo -xll~

a l+IIYo-xll~
= l+a
al!rxll = ~
2
ly-x]x I+Tv'
On the other hand, in view of (4.2.7), we obtain
(1- a)w = (F'(y 0 ),y- Ya} ~ v.
Thus,
( 1-~)~
r l+vfv <v
- '
and that is exactly (4.2.11). D
We conclude this section by studying the properties of one special

point of a convex set.
DEFINITION 4.2.3 Let F(x) be a v-self-concordant barrier for the set
Dom F. The point
xP, =arg min F(x),
xEdomF
is called the analytic center of convex set Dom F, generated by the barrier
F(x).
THEOREM 4.2.6 Assurne that the analytic center of a v-self-concordant
barrier F(x) exists. Then for any x E Dom F we have
II X - Xp llxj,. ~ V + 2..jV.
On the other hand, for any x E Rn such that II x - Xp llx;, ~ 1 we have

x E DornF.
Proof: The first staternent follows frorn Theorem 4.2.5 since F'(x}) =
0. The second staternent follows frorn Theorem 4.1.5. D
Thus, the asphericity of the set Dorn F with respect to x}, cornputed
in the metric II · llxj.., does not exceed v + 2yfi/. It is well known that for
any convex set in Rn there exists a metric in which the asphericity of this
set is less than or equal to n (John Theorem). However, we rnanaged to
estirnate the asphericity in terrns of the parameter of the barrier. This
value does not depend directly on the dirnension of the space.
Note also, that if Dorn F contains no straight line, the existence of x}
irnplies the boundedness of DomF. (Since then F"(x}) is nondegener-
ate, see Theorem 4.1.3).
COROLLARY 4.2.1 Let DomF be bounded. Then for any x E dornF
and v E Rn we have
II v 11;::; (v + 2y'v) II v 11;;, .

Proof: By Lemma 3.1.12 we get the following representation:
II v 11;= ([F"(x)r 1v,v) 112 = rnax{(v,u) I (F"(x)u,u) ~ 1}.
On the other hand, in view of Theorem 4.1.5 and Theorem 4.2.6, we

have
B =
{y ERn I II y- X llx::; 1} ~Dorn F
~ {y E Rn I II Y - Xp llx::; ll + 2y0} := B* ·
Therefore, using again Theorem 4.2.6, we get the following:
II v 11; = rnax{(v,y- x) I y E B} ~ rnax{(v,y- x) I y E B*}
= (v,x}- x) + (v + 2yfi/) II v 11;•.

F
Note that II v II;= !I -v 11;. Therefore we can assurne (v, x}- x) ~ 0. D
4.2.4 Path-following scheme

Now we are ready to describe a barrier model of the rninirnization
problern. This is the standard rninirnization problern
rnin{ (c, x) I x E Q} (4.2.12)
with bounded closed convex set Q = DomF, which has nonempty inte-
rior, and which is endowed with a v-self-concordant barrier F(x).
Recall that we are going to solve (4.2.12) by tracing the central path:
x*(t) =arg min f(t; x), (4.2.13)

xEdomF
where f(t; x) = t(c, x) + F(x) and t ;:::: 0. In view of the first-order

optimality condition, any point of the central path satisfies equation
tc+ F'(x*(t)) = 0. (4.2.14)
Since the set Q is bounded, the analytic center of this set, x'F, exists and
x*(O) = x}. (4.2.15)
In order to follow the central path, we are going to update the points,
satisfying an approximate centering condition:
AJ(t;·)(x) =II f'(t; x) II;= I tc + F'(x) 11;~ ß, (4.2.16)
where the centering parameter ß is small enough.

Let us show that this is a reasonable goal.
THEOREM 4.2. 7 For any t > 0 we have
(c, x* (t)) - c* ~ !f, (4.2.17}
where c* is the optimal value of (4. 2.12). If a point x satisfies the cen-
tering condition (4.2.16), then
(c ' x)- c* < l

- t
(v + (ß+vvlß).
1-ß
(4.2.18)
Proof: Let x* be a solution to (4.2.12). In view of (4.2.14) and (4.2.7)

we have
(c, x* (t) - x*) = t (F' (x* (t) ), x* - x* (t)) ~ ![.

Further, let x satisfy (4.2.16). Denote ,\ = AJ(t;·)(x). Then
t(c,x-x*(t)) (J'(t;x)- F'(x),x- x*(t))
< (.\ + JV) II X- x*(t) llx

< (.\ + V'v)-L <
1-A -
V
(ß+Vvlß
1-ß
in view of (4.2.4), Theorem 4.1.13 and (4.2.16). 0
Let us analyze now one step of a path-following scheme. Namely,

assume that x E dom F. Consider the following iterate:
(4.2.19)
THEOREM 4.2.8 Let X satisfy (4.2.16):
II tc + F'(x) 11;~ ß
with ß < 5. = 3-l'g. Then for 1, such that
I I I~ 111- ß, (4.2.20}
we have again I t+c + F'(x+) 11;~ ß.

Proof: Denote .Ao = II tc + F'(x) 11;~ ß, AI = II t+c + F'(x) 11; and
A+ = II t+c + F'(x+) 11;+. Then
and in view of Theorem 4.1.14 we have
A+ ~ (l~~J2 =[w~(.AI)j2.
It remains to note that inequality (4.2.20) is equivalent to
(recall that w'(w~(r)) = r, see Lemma 4.1.4). 0
Let us prove now that the increase of t in the scheme (4.2.19) 1s

suffi.ciently !arge.
LEMMA 4.2.1 Let x satisfy (4.2.16). Then
II c 11;~ f(ß + JV). (4.2.21)

Proof: Indeed, in view of {4.2.16) and (4.2.4), we have

t II c 11;=11 f'(t; x)- F'(x) 11;~11 f'(t; x) 11; + II F'(x) 11;~ ß + JV.
0
Let us fix now some reasonable values of parameters in the scheme

(4.2.19). In the rest of this chapter we always assume that
ß -- gl
1 'V -
I - l+...[ß- ß --
_:fl_ 5
35• (4 •2•22)
We have proved that it is possible to follow the central path, using the
rule (4.2.19). Note that we can either increase or decrease the current
value of t. The lower estimate for the rate of increasing t is
t+ ~ (1 + 4+3~\fV). t,
and the upper estimate for the rate of decreasing t is
t+ ~ ( 1 - 4+3~\fV) . t.
Thus, the general scheme for solving the problern (4.2.12) is as follows.
Main path-following scheme
0. Set t 0 = 0. Choose an accuracy e > 0 and x 0 E dom F

suchthat
II F'(xo) n;o~ ß.
1. kth iteration (k ~ 0). Set (4.2.23)
2. Stop the process if e tk ~ v + (ßi!glß.
Let us give a complexity bound for the above scheme.

THEOREM 4.2.9 The scheme (4.2.23} terminates no more than ajter N
steps, where
Moreover, at the moment of termination we have (c,xN}- c* S €.
Proof: Note that ro =

II xo - x} llxoS ~ {see Theorem 4.1.13}.
Therefore, in view of Theorem 4.1.6 we have
~
h -II
- c II*xo-
< -1-ro
1- II c II*xj;.-< 1..=1L
1=211 II c II*xj;. ·
-y{l 2ß) ( )k-1

Thus, tk 2:: ( 1 -ß~lcll;. 1 + ß1JV for all k 2:: 1. 0
F
Let us discuss now the above complexity estimate. The main term in
the complexity is
vl/cl/*.
7. 2.,fo In ---fL.
Note that the value V II c II;. estimates the variation of the linear func-
F
tion (c, x} over the set Dom F (see Theorem 4.2.6). Thus, the ratio
can be seen as a relative accuracy of the solution.

The process (4.2.23) has one serious drawback. Sometimes it is diffi-
cult to satisfy its starting condition
In such cases we need an additional process for finding an appropriate

starting point. We analyze the corresponding strategies in the next
section.
4.2.5 Finding the analytic center

Thus, our goal now is to find an approximation to the analytic center
of the set Dom F. Let us look at the following minimization problem:
min{F{x} Ix E domF}, {4.2.24)
where F is a v-self-concordant barrier. In view of the needs of the

previous section, we have to find an approximate solution x E dom F of
this problem, which satisfies inequality
II F'(x) llis ß,
for certain ß E {0, 1}.
In order to reach our goal, we can apply two different mmimiza-
tion schemes. The first one is a Straightforward implementation of the
damped Newton method. And the second one is based on path-following

approach.
Consider the first scheme.
Damped Newton method for analytic centers
0. Choose yo E domF.
1. kth iteration (k 2:: 0). Set (4.2.25)

_ _ [F"(Yk)]- 1 F'(UkJ
Yk+l - Yk l+IIF'(yk)IIYk •
2. Stop the process if I F' (Yk) IIZ,. ~ ß.
THEOREM 4.2.10 The process (4.2.25) terminates no later than after

w(lß) (F(yo)- F(xF)) iterations.
Proof: Indeed, in view of Theorem 4.1.12, we have

F(Yk+l) :S F(yk)- w(>.F(Yk)) :S F(yk)- w(ß).
Therefore F(yo)- k w(ß) 2:: F(yk) 2:: F(xP,). 0
The implementation of the path-following approach is a little bit more

complicated. Let us choose some Yo E dom F. Define the auxiliary
centrat path as follows:
y*(t) =arg min

yEdomF
[-t(F'(yo), y) + F(y)],
where t 2:: 0. Note that this trajectory satisfies the equation
F'(y*(t)) = tF'(yo). (4.2.26)

Therefore it connects two points, the starting point y0 and the analytic
center xF:
y*(1) = Yo, y*(O) = xF.
We can follow this trajectory by the process (4.2.19) with decreasing t.
Let us estimate the rate of convergence of the auxiliary central path
y* (t) to the analytic center.
LEMMA 4.2.2 For any t ~ 0 we have
II F'(y*(t)) ll;•(t)::; (v + 2Ji/) II F'(xo) 11;f. ·t.

Proof: This estimate follows from (4.2.26) and Corollary 4.2.1. 0
Let us look now at the corresponding algorithmic scheme.
Auxiliary path-following scheme
0. Choose Yo E Dom F. Set to = 1.

1. kth iteration (k ~ 0). Set
t k+l -- t k -
1
IIF'(vo)ll;k ' (4.2.27)
2. Stop the process if II F'(Yk) llvk::; 1 ~.

Set x = Yk- [F"(Yk)t 1 F'(Yk)·
Note that the above scheme follows the auxiliary central path y*(t) as
tk --+ 0. It updates the points {Yk} satisfying the approximate centering
condition
The termination criterion of this process,
>.k =II F'(Yk) llvk::; 1 ~,
guarantees that II F'(x) llx::; ( 1 ~1J 2 : : ; ß (see Theorem 4.1.14).

Let us derive a complexity estimate for this process.
THEOREM 4.2.11 The process (4.2.27} terminates no later than ajter
~(ß + y'v) ln [~(v + 2y'v) II F'(xo) II;;,]

iterations.
Proof: Recall that we have fixed the parameters:
ß -_ lgl "'-&
I-
1+ ß
- ß-5
- 36'
Note that to = 1. Therefore, in view of Theorem 4.2.8 and Lemma 4.2.1,

we have
tk+l ~ ( 1 - ßlv'ii)
tk ~ exp ( -1~~) .
Further, in view of Lemma 4.2.2, we obtain
II F'(yk) ll;k = I (tkF'(xo) + F'(yk))- tkF'(xo) ll;k
< ß+tk II F'(xo) ll;k~ ß+tk(v+2y'i)) II F'(xo) II;F.
Thus, the process is terminated at most when the following inequality
holds:
tk(ll + 2y'V) II F'(xo) < ~- ß = 'Y·

II;.F-l+Jß
D
Now we can discuss the complexity of both schemes. The principal

term in the complexity of the auxiliary path-following scheme is
7.2y'v[lnv +In II F'(xo) II;.]
F
and for the auxiliary damped Newton method it is O(F(y0 )-F(xj;.)). We

cannot compare these estimates directly. However, a more sophisticated
analysis demonstrates the advantages of the path-following approach.
Note also that its complexity estimate naturally fits the complexity of the
main path-following process. Indeed, if we apply (4.2.23} with (4.2.27},
we get the following complexity bound for the whole process:
7.2JV [2Inv+ In I F'(xo) ll;f. +In II c 11;f. +ln~].

To conclude this section, note that for some problems it is difficult
even to point out a starting point y0 E dom F. In such cases we should
apply one more auxiliary minimization process, which is similar to the
process (4.2.27). We discuss this situation in the next section.
4.2.6 Problems with functional constraints

Let us consider the following minimization problem:
min fo(x),
s.t fj(x) ~ 0, j = l ... m, (4.2.28}
xEQ,
where Q is a simple bounded closed convex set with nonempty inte-

rior and all functions fi, j = 0 ... m, are convex. We assume that the
problern satisfies the Slater condition: There exists x E int Q such that
/j(x) < 0 for all j = 1 ... m.
Let us assume that we know an upper bound f suchthat fo(x) < f
for all x E Q. Then, introducing two additional variables T and "'' we
can rewrite this problern in the standard form:
minT,
s.t fo(x) ~ T,
(4.2.29)
/j(x) ~ "'' j = 1 ... m,
XE Q, T ~ f, K, ~ 0.
Note that we can apply the interior-point methods to a problern only if

we are able to construct the self-concordant barrier for the feasible set.
In the current situation this means that we should be able to construct
the following barriers:
• A self-concordant barrier FQ(x) for the set Q.
• A self-concordant barrier Fo(x, T) for the epigraph of the objective
function fo(x).
• Self-concordant barriers Fj(x, "') for the epigraphs of the functional
constraints /j(x).
Let us assume that we can do that. Then the resulting self-concordant
barrier for the feasible set of the problern (4.2.29) is as follows:
m
F(x, T, "') = FQ(x) + Fo(x, T) + L Fj(x, "') -ln(f- T) -In( -K,).
j=l
The parameter of this barrier is

m
f) = VQ + v0 + L vi + 2, (4.2.30)
j=l
where vo are the parameters of the corresponding barriers.

Note that it could be still difficult to find a starting point from dom.F.
This domain is an intersection of the set Q with the epigraphs of the ob-
jective function and the constraints and with two additional constraints
r ::; f and K. ::; 0. If we have a point x 0 E int Q, then we can choose

large enough r 0 and K.o to guarantee
fo(xo) < ro < f, fj(xo) < K.o, j = 1 ... m,
but then the constraint K. ::; 0 could be violated.
In order to sirnplify our analysis, let us change notation. Frorn now
on we consider the problern
rnin (c, z},
s.t. z ES, (4.2.31)
(d, z} ::; 0,
where z = (x, r, ~), (c, z} = r, (d, z} = ~ and S is the feasible set of the
problern (4.2.29) without the constraint K. ::; 0. Note that we know a
self-concordant barrier F(z) for the set Sand we can easily find a point
zo Eint S. Moreover, in view of our assurnptions, the set
S(a) = {z ES I (d,z}::; a}
is bounded and it has nonernpty interior for a large enough.
The process of solving the problern (4.2.31) consists of three stages.
1. Choose a starting point zo E int S and an initial gap ~ > 0.
Set a = (d, zo} + ~. If a ::; 0, then we can use the two-stage process
described in Section 4.2.5. Otherwise we do the following. First, we find
an approxirnate analytic center ofthe set S(a), generated by the barrier
F(z) = F(z) -ln(a- (d,z}).
Narnely, we find a point z satisfying the condition
.Ap.(z) =(F"(z)-1 (F'(z) + a-(d,z)) ,F'(z) + a-fd,z))l/2::; ß.
In order to generate such a point, we can use the auxiliary schernes
discussed in Section 4.2.5.
2. The next stage consists in following the centrat path z(t) defined
by the equation
td + F'(z(t)) = 0, t 2:: 0.
Note that the previous stage provides us with a reasonable approxirna-
tion to the analytic center z(O). Therefore we can follow this path,
using the process (4.2.19). This trajectory leads us to the solution of
the rninirnization problern
rnin{ (d, z} I z E S(a)}.
Structv.ral optimization 209
In view of the Slater condition for problern (4.2.31), the optimal value
of this problern is strictly negative.
The goal of this stage consists in finding an approximation to the
analytic center of the set
S = {z E S(a) I (d,z) ~ 0},
generated by the barrier
F(z) = F(z) -ln( -(d, z) ).

This point, z*, satisfies the equation
F'(z*)- (d,~.) = 0.
Therefore z* is a point of the central path z(t). The corresponding value
of the penalty parameter t* is
t* = - (d,~.) > 0.
This stage ends up with a point z, satisfying the condition
Ap(z) = (F"(z)- 1 (P'(z)- (d~z)) ,F'(z)- (d~z)) 1 1 2 ~ ß.
3. Note that F"(z) > F"(z). Therefore, the point z, computed at the
previous stage satisfies inequality
Ap(z) =(F"(z)- 1 (F'(z)- (d~z)) ,F'(z)- (d~z)) 1 1 2 ~ ß.

This means that we have a good approximation of the analytic center of
the set S and we can apply the main path-following scheme (4.2.23) to
solve the problern
min{ (c, z) I z E S}.
Clearly, this problern is equivalent to (4.2.31).
We omit the detailed complexity analysis of the above three-stage
scheme. It could be done similarly to the analysis of Section 4.2.5. The
main term in the complexity of this scheme is proportional to the product
of .fD (see (4.2.30)) and the sum of the logarithm of desired accuracy
e and the logarithms of some structural characteristics of the problern
(size of the region, deepness of Slater condition, etc.).
Thus, we have shown that we can apply efficient interior point meth-
ods to all problems, for which we can point out some self-concordant
barriers for the basic feasible set Q and for the epigraphs of functional
constraints. Our main goal now is to describe the classes of convex
problems, for which such barriers can be constructed in a computable

form. Note that we have an exact characteristic of the quality of the
self-concordant barrier. That is the value of its parameter: The smaller
it is, the more efficient will be the corresponding path-following scheme.
In the next section we discuss our possibilities in applying the developed
theory to particular convex problems.
4.3 Applications of structural optimization

(Bounds on parameters of self-concordant barriers; Linear and quadratic opti-
mization; Semidefinite optimization; Extremal ellipsoids; Separable problems;
Geometrie optimization; Approximation in lp norms; Choice of optimization
scheme.)
4.3.1 Bounds on parameters of self-concordant

barriers
In the previous section we have discussed a path-following scheme for
solving the following problem:
min (c,x), (4.3.1)

xEQ
where Q is a closed convex set with nonempty interior, for which we

know a v-self-concordant barrier F(x). Using such a barrier, we can
solve (4.3.1) in 0 (y'V · In~) iterations of a path-following scheme. Recall
that the most difficult part of each iteration is the solution of a system
of linear equations.
In this section we study the Iimits of applicability of this approach. We
discuss the lower and upper bounds for the parameters of self-concordant
barriers; we also discuss some classes of convex problems, for which the
model (4.3.1) can be created in a computable form.
Let us start from lower bounds on barrier parameters.
LEMMA 4.3.1 Let f(t) be a v-self-concordant barrier for the interval

(a,ß) C R 1 , a < ß < oo. Then
v 2::: /'b = sup

tE{a,ß)
iB.ill:
f"(t) 2::: 1.
Proof: Note that v 2::: /'b by definition. Let us assume that /'b < 1. Since
f(t) is a barrier for (a., ß), there exists a value ä E (a., ß) such that
f'(t) > 0 for all t E [ö, ß).
Consider the function cjJ(t) = <~Jar, t E [a,ß). Then, since f'(t) > 0,
f(t) is self-concordant and c/J{t) ::; "' < 1, we have
c/J 1 (t) = 2f 1(t)- (F?~)) 2 f" (t)1
= f I (t) (
2- __j:J!:L f'" (t) ) ~ 2(1-
y'f'(t) . [f"(t)J312 fo)f I (t).
Hence, for all t E [a, ß) we obtain cjJ(t) ~ c/J(a) + 2(1- fo)(f(t)- f(ä)).
This is a contradiction since f (t) is a barrier and c/J( t) is bounded from
above. 0
COROLLARY 4.3.1 Let F(x) be a v-self-concordant barrier for Q C Rn.

Then v ~ 1.
Proof: Indeed, let x E int Q. Since Q C Rn, there exists a nonzero
direction u E Rn such that the line {y = x + tu, t E R 1 } intersects
the boundary of the set Q. Therefore, considering the function f(t) =
F(x +tu), and using Lemma 4.3.1, we get the result. 0
Let us prove a simple lower bound for parameters of self-concordant

barriers for unbounded sets.
Let Q be a closed convex set with nonempty interior. Consider x E
int Q. Assurne that there exists a nontrivial set of recession directions
{PI, ... , pk} of the set Q:
x + api E Q Va ~ 0.
THEOREM 4.3.1 Let positive coefficients {ßi}f= 1 satisfy condition
x- ßi Pi ~ int Q, i = 1 ... k.
k
lf for some positive a1, ... , ak we have '{) = x - E aiPi E Q, then the
i=l
parameter v of any self-concordant barrier for Q satisfies inequality:
k
v>
-
E
i=l
ili.
{3;
Proof: Let F(x) be a v-self-concordant barrier for the set Q. Since Pi

is a recession direction, we have
(since otherwise the function f(t) = F(x + tp) attains its minimum; see
Theorem 4.1.11).
Note that x- ßi Pi f!. Q. Therefore, in view of Theorem 4.1.5, the
norm of the direction Pi is large enough: ßi II Pi llx~ 1. Hence, in view
of Theorem 4.2.4, we obtain
k k k
v ~ (F'(x), y- x) = (F'(x),- E aiPi) ~ E ai II Pi llx~ E ~-
i=l i=l i=l •
0
Let us present now an existence theorem for self-concordant barriers.

Consider a closed convex set Q, int Q '!- 0, and assume that Q contains
no straight line. Define a polar set of Q with respect to some point
x E intQ:
P(x) = {s ERn I (s,x- x) 51, Vx E Q}.
It can be proved that for any x E int Q the set P(x) is a bounded closed
convex set with nonempty interior. Denote V(x) = voln P(x).
THEOREM 4.3.2 There exist absolute constants c1 and c2, such that the
function
U(x) = c1 ·In V(x}
is a (c2 · n)-self-concordant barrier for Q. 0
Function U(x) is called the universal barrier for the set Q. Note that
the analytical complexity of problern (4.3.1}, equipped with a universal
barrier, is 0 ( yn · ln '7) . Recall that such efficiency estimate is impossi-
ble, if we use a local black-box oracle (see Theorem 3.2.5).
The above result has mainly a theoretical interest. In general, the uni-
versal barrier U(x) cannot be easily computed. However, Theorem 4.3.2
demonstrates that such barriers, in principle, can be found for any con-
vex set. Thus, the applicability of our approach is restricted only by
abilities of constructing a computable self-concordant barrier, hopefully
with a small value of the parameter. The process of creating the barrier
model of the initial problem, can be hardly described in a formal way.
For each particular problern there could be many different barrier mod-
els, and we should choose the best one, taking into account the value of
the parameter of the self-concordant barrier, the complexity of its gradi-
ent and Hessian, and the complexity of solution of the Newton system.
In the rest of this section we will see how that can be done for some
standard problern classes of convex optimization.
4.3.2 Linear and quadratic optimization

Let us start from linear optimization problem:
min (c, x),

xERn
s.t Ax = b, (4.3.2)
x (i) > . - 1 .. . n,
_ 0 , ~-
where A is an (m x n)-matrix, m < n. The inequalities in this problern

define the positive orthant in Rn. This set can be equipped with the
following self-concordant barrier:
n
F(x) =- l:Inx(i), v = n,
i=l
(see Example 4.2.1 and Theorem 4.2.2). This barrier is called the stan-
dard logarithmic barrier for R+..
In order to solve the problern (4.3.2), we have to use a restriction
of the barrier F(x) onto affine subspace {x : Ax = b}. Since this
restriction is an n-self-concordant barrier (see Theorem 4.2.3), the com-
plexity estimate for the problern (4.3.2) is 0 (.,fii · ln 7) iterations of a
path-following scheme.
Let us prove that the standard logarithmic barrier is optimal for R+..
LEMMA 4.3.2 Parameter v of any self-concordant barrier for R+. satis-

fies the inequality v ~ n.
Proof: Let us choose
x = e := ( 1, ... , 1) T E int R+.,
Pi = ei, i = 1 ... n,
where ei is the ith coordinate vector of Rn. Clearly, the conditions of

Theorem 4.3.1 are satisfied with ai = ßi = 1, i = 1 ... n. Therefore
n
v>
-
2:: lli
i=l ß;
= n.
0
Note that the above lower bound is valid only for the entire set R~.
The lower bound for intersection {x E R~ I Ax = b} can be smaller.
Let us Iook now at a quadratically constrained quadratic optimization

problern:
min qo(x) = o:o
xeRn
+ (ao, x} + ~(Aox, x},
(4.3.3)
s.t Qi(x) = O:i + (ai,x} + !(Aix,x} :S: ßi, i = 1 ... m,
where Ai are some positive semidefinite (n x n)-matrices. Let us rewrite
this problern in a standard form:
minT,
X,T
s. t qo(x) :S: T,
(4.3.4}
XE nn, TE R1.
The feasible set of this problern can be equipped with the following self-
concordant barrier:
m
F(x, T) = -ln(T- qo(x))- L ln(ßi- Qi(x)), v= m + 1,
i=l
(see Example 4.2.1, and Theorem 4.2.2). Thus, the complexity bound
for problern (4.3.3) is 0 ( ym + 1 ·ln ':) iterations of a path-following
scheme. Note this estimate does not depend on n.
In many applications the functional components of the problern in-
clude a nonsmooth quadratic term of the form II Ax - b II· Let us show
that we can treat such terms using interior-point technique.
LEMMA 4.3.3 The function
F(x, t) = -ln(t2- II x 11 2 )
is a 2-self-concordant barrier for the convex set 5
K2 = {(x, t) ERn+ I I t 2::11 X II}.
Proof: Let us fix a point z = (x, t) E int K 2 and a nonzero direction
u = (h,T} E Rn+l. Denote ~(o:) = (t + o:T) 2 -II x + o:h 11 2 • We need to
compare the derivatives of function
c/J(o:) = F(z + o:u) = -Ine(o:)
5 Depending on the field, this set has different names: Lorentz cone, ice-cream cone, second-
order cone.
at a = 0. Denote cpO = cpO(o), eO = e(·l(o). Then
( = 2(t7- (x,h)), ~ 11 = 2(r 2 - II h 11 2 ),

II-
</;-
(.t)2-
~
t:_
~' cjJ111 E'E" _
-37
_ (.t) 3.
2 ~
Note the inequality 2cjJ" ~ (cp') 2 is equivalent to (e') 2 ~ 2~~"· Thus, we

need to prove that for any (h, r) we have
(4.3.5)
Clearly, we can restriet ourselves by I T 1>11 h II (otherwise the right-

hand side of the above inequality is nonpositive). Moreover, in order to
minimize the left-hand side, we should chose sign T = sign (x, h) (thus,
let T > 0), and (x, h) =II x II · Jl h Jl. Substituting these values in (4.3.5),
we get a valid inequality.
Finally, since 0 < (~};'2 ::::; ~ and [1 - eJ 312 ~ 1 - ~~, we get the
following:
Let us prove that the barrier described in the above statement is

optimal for the second-order cone.
LEMMA 4.3.4 Parameter v of any self-concordant barrier for the set K 2
satisjies inequality v ~ 2.
Proof: Let us choose z= (0, 1) E intK2 and some h ERn, II h II= 1.

Define
P1 = (h, 1), P2 = ( -h, 1), " '_"'_1

<-<1 - <-<2- 2•
Note that for all 1 2: 0 we have z +/Pi= (±1h, 1 + 1) E Kz and

z- ßiPi = (±~h,!) rt int K2,
z- a1P1- a2P2 = ( -~h + ~h, 1- ~- ~) = 0 E K2.
Therefore, the conditions of Theorem 4.3.1 are satisfied and
V >
-
Q.J..
ßt
+ !:!2.
ß2
= 2.
0
4.3.3 Semidefinite optimization

In semidefinite optimization the decision variables are some matrices.
Let X = {X(i,j)}f.j=l be a symmetric n x n-matrix (notation: X E
snxn). The linear space snxn can be provided with the following inner
product: for any X, Y E snxn define
n n
(X, Y)F = LL x(i,i)y(i,j), II X IIF= (X,X)F1/2 .
i=l j=l
Sometimes the value II X IIF is called the Frobenius norm of matrix X.
For symmetric matrices X and Y we have the following identity:
(X,Y · Y}F
= f: f: x(i,j) f: y(i,k)y(i,k) = f: f: f: x(i,i)y(i,k)y(j,k)

i=lj=l k=l i=lj=lk=l
. (4.3.6)
= Ln nL y(k,J)
.n
L ...
X(J,l)y(l,k) = Ln nL .
y(k,J)(XY)(J,k)
k=lj=l i=l k=lj=l
n
= }: (Y XY)(k,k) = Trace {Y XY) = (Y XY, In) F·
k=l
In semidefinite optimization problems a nontrivial part of constraints
is formed by the cone of positive semidefinite n x n-matrices Pn C snxn.
Recall that X E Pn if and only if (Xu, u} ~ 0 for any u E Rn. If
(X u, u} > 0 for all nonzero u, we call X positive definite. Such matrices
form an interior of cone Pn· Note that Pn is a closed convex set.
The general formulation of the semidefinite optimization problern is
as follows:
min (C,X)F,
(4.3.7)
X EPn,
where C and Ai belong to snxn. In order to apply a path-following
scheme to this problem, we need a self-concordant barrier for Pn.
Letmatrix X belang to intPn. Denote F(X) = -lndetX. Clearly
n
F(X) = -ln I1 ~i(X),
i=l
where {Ai(X)}f=I is the set of eigenvalues of matrix X.
LEMMA 4.3.5 Function F(X) is convex and F'(X) = -X- 1 . For any
direction ~ E snxn we have
Trace (rx-l/2ö.x-I/2f)'
-2(In, (X-1/2Llx-1/2j3)F
Proof: Let us fix some Ll E snxn and X Eint Pn suchthat X +b. E Pn·
Then
F(X + b.) - F(X) -In det(X + Ll) -In det X
Thus, -x- 1 E 8F(X). Therefore F is convex (Lemma 3.1.6) and

F'(x) = -x- 1 (Lemma 3.1.7).
Further, consider function cp(a) = (F'(X +aLl), Ll)F, a E [0, 1]. Then
cp(a)- cp(O) (x- 1 -(X+ ail)-I, Ll) F
((X+ ail)- 1((X + aLl)- XJX- 1 , Ll)F
a((X + ail)- 1b.x-I, b.) F·
Thus, cp'{O) = (F"(X)b., Ll)F = (X- 1 b.X-1, ~)F·

The last expression can be proved in a similar way by differentiating
the function '1/J(a) =((X+ ab.)- 1b.(X + a~)- 1 , b.)F· D
THEOREM 4.3.3 Function F(X) is an n-self-concordant barrier for Pn.

Proof: Let us fix XE int Pn and ß E snxn. Denote Q = x- 112 ßX- 112
and Ai= Ai(Q), i = 1 ... n. Then, in view of Lemma 4.3.5 we have
n
(F'(X), ß)F = LAi,
i=l
n 2
(F"(X)ß, ß)F = E\,
i=l
n 3
D 3 F(X)[ß, ß, ß] = -2 E Ai.
i=l
Using two standard inequalities
I.En Ar I ::; (.En Af )3/2 ,

~=1 ~=1
we obtain
(F'{X), ß)} < n(F"(X}ß, ß)F,
I D 3 F(X)[ß, ß, ß] I < 2(F"(X)ß, ß)~ 2 •

0
Let us prove that F(X) = -In det X is the optimal barrier for Pn.
LEMMA 4.3.6 Parameter v of any self-concordant barrier for cone Pn
satisfies inequality v ;::: n.
Proof: Let us choose X = In E int Pn and the directions Pi = eie[,
i = 1 ... n, where ei is the ith coordinate vector of Rn. Note that for
any 'Y;::: 0 we have In+ 'Ypi Eint Pn. Moreover,
n
In - eief f/. int Pn, In - E eie[ = 0 E Pn·
i=l
Therefore, the conditions of Theorem 4.3.1 are satisfied with O:i = ßi = 1,

i = 1 ... n, and we obtain v ;::: E
n
i=l
r. = n.
'
0
As in the linear optimization problern (4.3.2}, in problern (4.3.7) we

need to use restriction of F(X) onto the set
.C ={X: (Ai,X)F = bi, i = 1 ... m}.
This restriction is an n-self-concordant barrier in view of Theorem 4.2.3.
Thus, the complexity bound of the problern (4.3. 7) is 0 ( y'n · In~) it-
erations of a path-following scheme. Note that this estimate is very
encouraging since the dimension of the problern (4.3.7) is !n(n + 1).
Let us estimate the arithmetical cost of each iteration of a path-

following scheme (4.2.23) as applied to the problern (4.3.7). Note that
we work with a restriction of the barrier F(X) onto the set C. In view of
Lemma 4.3.5, each Newton step consists in solving the following prob-
lern:
min{ (U, ~)F
.6.
+ ~(x- 1 ~x- 1 , ~)F: (Ai, ~)F = 0, i = 1 ... m},
where X >-- 0 belongs to C and U is a combination of the cost matrix C
and the gradient F'(X). In accordance to Corollary 1.2.1, the solution of
this problern can be found from the following system of linear equations:
(4.3.8)
(Ai, ~)F = 0, i = 1 ... m.
From the first equation in (4.3.8) we get
~=X [-u + f:
J=l
>.(i) Ajl X. (4.3.9)
Substituting this expression in the second equation in (4.3.8), we get the

linear system
m
L >.(j)(Ai,XAjX)F = (Ai,XUX)p, i = 1 ... m, (4.3.10)
j=l
which can be written in a matrixform as S>. = d with

S(i,j) = (A,XAjX)p, d(i) = (U,XAjX)p, i,j = 1 ... n.
Thus, a Straightforward strategy of solving the system (4.3.8) consists

in the following steps.
• Compute matrices XAjX, j = 1 ... m. Cost: O(mn 3 ) Operations.
• Compute the elements of Sand d. Cost: O(m 2 n 2 ) operations .
• Compute). = s- 1d. Cost: O(m 3 ) Operations.
• Compute ~ by (4.3.9). Cost: O(mn 2 ) operations.
Taking into account that m ~ n(n2+l) we conclude that the complexity
of one Newton step does not exceed
O(n 2 (m + n)m) arithmetic operations.j (4.3.11)

However, if the matrices Aj possess a certain structure, then this esti-

mate can be significantly improved. For example, if all Aj are of rank 1:
Aj = ajaJ, aj E Rn, j = 1 ... m,
then the computation of the Newton step can be clone in
O((m + n) 3 ) arithmetic Operations. (4.3.12)
We leave the justification of this claim as an exercise for the reader.

To conclude this section, note that in many important applications we
can use the barrier - ln det (·) for treating some functions of eigenvalues.
Consider, for example, a matrix A(x) E snxn, which depends linearly
on x. Then the convex region
{(x, t) I 1~ftn >.i(A(x)) ~ t},

can be described by a self-concordant barrier
F(x, t) = -lndet(tln- A(x)).
The value of the parameter of this barrier is equal to n.
4.3.4 Extremal ellipsoids

In some applications we are interested in approximating polytopes by
ellipsoids. Let us consider the most important examp1es.
4.3.4.1 Circumscribed ellipsoid
Given by a set of points a1, ••• , am E Rn, find an ellipsoid

W, which contains all points {ai} and which volume is as
small as possible.
Let us pose this problern in a formal way. First of all note, that any
bounded ellipsoid W C Rn can be represented as
w = {x ERn I X= n- 1(v + u), II u II~ 1},
where H E int Pn and v E Rn. Then inclusion a E W is equivalent to

inequality II Ha - v II ~ 1. Note also that
1 W =
VO n
1 B 2 (0 ' 1) • d et H -1 =
VO n
voln B2(0,1)
det H ·
Thus, our problern is as follows:

minT,
H,v,T
s.t. -lndetH~T,
(4.3.13)
II Hai -v II~ 1, i = 1 ... m,
In order to solve this problern by an interior-point scherne we need to find

a self-concordant barrier for a feasible set. At this mornent we know such
barriers for all components of this problern except the first inequality.
LEMMA 4.3. 7 Function
-lndet H -ln(T + lndet H)
is an (n + 1)-self-concordant barrier for the set
{(H,T) E snxn X R 1 I T ~ -lndetH, HE 'Pn}·

0
Thus, we can use the following barrier:

m
F(H, v, T) = -lndet H -ln(T + lndetH)- E ln{1- II Hai- v 11 2 ),
i=l
v=m+n+l.
The corresponding cornplexity bound is 0 ( Jm + n + 1 ·In m;n) itera-

tions of a path-following scheme.
4.3.4.2 Inscribed ellipsoid with fixed center
Let Q be a convex polytope defined by a set of linear

inequalities:
and Iet v E int Q. Find an ellipsoid W, centered at v,

such that W C Q and which volurne is as big as possible.
Let us fix some HE int 'Pn· We can represent the ellipsoidWas

W = {x ERn I (H- 1 (x- v),x- v} ~ 1}.
We need the following simple result.
LEMMA 4.3.8 Let (a, v} < b. Inequality (a, x} < b is valid for any
x E W if and only if
(Ha, a} ~ (b - (a, v} )2 •
Proof: In view of Lemma 3.1.12, we have
max{(a,u}
u
I (H- 1 u,u} ~ 1} = (Ha,a) 112 •
Therefore we need to ensure

max(a,x}
xEW
= max((a,x- v)
xEW
+ (a,v)]
= (a, v) + max{ (a, u} I (H- 1u, u) ~ 1}

X
- + (Ha,a) 112
(a,v) ~ b.
This proves our statement since (a, v} < b. D
Note that voln W = voln B2(0, 1)[det Hjll 2. Hence, our problern is as
follows:
minT,
H,T
s.t. -lndet H ~ T,
{4.3.14)
H E 'Pn, T E R1 .
In view of Lemma 4.3. 7, we can use the following self-concordant barrier:
F(H, T) = -lndet H -ln(T + lndet H)
v = m+n+l.
The efficiency estimate of the corresponding path-following scheme is
0 ( v'm + n + 1 · In m;n)
iterations.
4.3.4.3 Inscribed ellipsoid with free center
Let Q be a convex polytope defined by a set of linear

inequalities:
Q = {x ERn I (ai,x) :$ bi, i = l. .. m},
and Iet int Q =/= 0. Find an ellipsoid W C Q, which has

the maximal volume.
Let GE int Pn, v Eint Q. We can represent Was follows:

w - {x ERn I II G- 1 (x- v) II:$ 1}
- {x ERn I (G- 2 (x- v),x- v) :$ 1}.
In view of Lemma 4.3.8, the inequality (a, x) :$bis valid for any x E W
if and only if
II Ga 11 2 =
(G2a, a) :$ (b- (a, v)} 2 •
That gives a convex region for (G, v):
II Ga II:$ b- (a,v).
Note that voln W = voln B2(0, 1) det G. Therefore our problern can be
written as follows:
min r,
G,v,r
s.t. -lndet G :$ r,
(4.3.15)
In view of Lemmas 4.3. 7 and 4.3.3, we can use the following self-
concordant barrier:
F(G, v, r) = -lndet G -ln(r + lndet G)
v = 2m+ n + 1.
The corresponding efficiency estimate is 0 ( v'2m + n + 1 · ln m;n) iter-

ations of a path-following scheme.
4.3.5 Separable optimization

In problems of separable optimization all nonlinear terms are pre-
sented by univariate functions. A general formulation of such a problern
looks as follows:
mo
mRi~
XE
qo(x) = L: ao,jfo,j((ao,j,x) + bo,j),
j==l
(4.3.16)
m;
s.t Qi(x) =E ai,jfi,j((ai,j,x) +bi,j)::::; ßi, i = l. .. m,
j==l
where ai,j are some positive coefficients, ai,j E Rn and fi,j(t) are convex
functions of one variable. Let us rewrite this problern in a standard
form:
min ro,
x,t,r
m;
L: ai,jti,j ::::; Ti, i = 0 ... m, (4.3.17)
j=l
m
where M = E mi. Thus, in order to construct a self-concordant barrier
i=O
for the feasible set of the problem, we need barriers for epigraphs of
univariate convex functions Ai. Let us point out such barriers for several
important functions.
4.3.5.1 Logarithm and exponent.

Function FI(x, t) = -lnx -ln(lnx + t) is a 2-self-concordant barrier
for the set
Ql = {(x,t) E R2 1 x > 0, t ~ -lnx},
and function F2(x, t) = -ln t -ln(ln t- x) is a 2-self-concordant barrier

for the set
4.3.5.2 Entropy function.

Function Fg(x, t) = -lnx -ln(t- x lnx) is a 2-self-concordant barrier
for the set
Qg = {(x,t) E R 2 1 x 2 0, t 2 xlnx}.
4.3.5.3 Increasing power functions.

Function F4 (x, t) = -2ln t-1n(t 21P -x 2 ) is a 4-self-concordant barrier
for the set
Q4 = {(x, t) E R 2 I t 21 x IP}, p 2 1,
and function F5 (x, t) = -lnx -ln(tP- x) is a 2-self-concordant barrier
for the set
Qs = { (x, t) E R 2 I x 2 0, tP 2 x }, 0 < p s 1.
4.3.5.4 Decreasing power functions.
Function F6 (X' t) = - ln t -ln( X - r l/p) is a 2-self-concordant barrier
for the set
Q6 = { (x, t) E R 2 I x > 0, t 2 x~} , p 2 1,
and function F7(x, t) = -lnx -ln(t- x-P) is a 2-self-concordant barrier

for the set
Q7 = { (x, t) E R 2 I x > 0, t ;::: :lv} , 0 < p < 1.

We omit the proofs of the above statements since they are rather
technical. It can be also shown that the barriers for all of these sets,
(except maybe Q4 ), are optimal. Let us prove this statement for the sets
Q6 and Q7.
LEMMA 4.3.9 Parameter v of any self-concordant barrier for the set
Q = { (x(l) ' x( 2)) E R 2 I x(l) > 0' x( 2) > ~l }

- (x1l1)P '
with p > 0, satisfies inequality v 2 2.

Proof: Let us fix some 'Y > 1 and choose x= ( 'Y, 'Y) E int Q. Denote
Pt= e1, P2 = e2, ß1 = ß2 = {, a1 = a2 = a ='Y- 1.
Then x + eei E Q for any e;: : 0 and

x- ßet = (O,r) fj. Q, x- ße2 = {r,O) fj. Q,
x-a(el +e2) = (r-a,{-a) = (1,1) E Q.

Therefore, the conditions of Theorem 4.3.1 are satisfied and
V >
-
Ql.
ßt
+ Q.2.
ß2 = 21.::.!.
'Y
This proves the statement since 1 can be chosen arbitrarily big. 0
Let us conclude our discussion by two examples.
4.3.5.5 Geometrie optimization.

The initial formulation of such problems is as follows:
mo n . (i)
min qo(x) = L ao,j TI (x(J)Y'"0 ·i,
xERn j=l j=l
m; n . (i)
s.t qi(x) = L ai,j TI (x(3 >yri.j ~ 1, i = 1 ... m, (4.3.18)
j=l j=l
x(j) > 0, j = 1 ... n,
where ai,j are some positive coefficients. Note that the problern (4.3.18)
is not convex.
Let us introduce the vectors ai,j = (o{~, ... , u~~)) E Rn, and change
the variables: x(i) = eY(i). Then (4.3.18) is transformed into a convex
separable problem.
mo
min L ao 3·exp{(ao 3·,y)),
yERn j=l ' '
(4.3.19)
m;
s.t. L D!i,j exp( (ai,j, y)) ~ 1, i = 1 ... m.
j=l
m
Denote M = E mi. The complexity of solving (4.3.19) by a path-fol-
i=O
lowing scheme is
o(M 1 1 2 ·ln~).
4.3.5.6 Approximation in lp norms.
The simplest problern of that type is as follows:
(4.3.20)
s.t. a :::; x :::; ß,
where p ? 1. Clearly, we can rewrite this problern in an equivalent

standard form:
min T(o)
x,r '
s.t I (ai, x} - b(i) jP:::; T(i), i = 1 ... m,
E T(i) :::; T(O)' (4.3.21)

i=l
a::=;x ::=;ß,
The complexity bound of this problern is 0 (Jm + n · In m;n) iterations

of a path-following scheme.
Wehave discussed the performance of interior-point methods on sev-
eral pure problern formulations. However, it is important that we can ap-
ply these methods to mixed problems. For example, in problems (4.3. 7)
or (4.3.20) we can treat also the quadratic constraints. To do that, we
need to construct a corresponding self-concordant barrier. Such barriers
are known for all important examples we meet in practical applications.
4.3.6 Choice of minimization scheme

We have seen that many convex optimization problems can be solved
by interior-point methods. However, we know that the same problems
can be solved by another general technique, the nonsmooth optimization
methods. In general, we cannot say which approach is better, since the
answer depends on individual structure of a particular problem. How-
ever, the complexity estimates for optimization schemes often help to
make a reasonable choice. Let us consider a simple example.
Assurne we are going to solve a problern of finding the best approxi-
mation in lp-norms:
(4.3.22)
s.t a:::; x:::; ß,
where p ? 1. And Iet us have two numerical methods available:
• The ellipsoid method (Section 3.2.6).
• The interior-point path-following scheme.
What scheme should we use? We can derive the answer from the com-
plexity estimates of corresponding methods.
Let us estimate first the performance of the ellipsoid method as ap-
plied to problern {4.3.22).
Complexity of ellipsoid method
Number of iterations: 0 ( n 2 In *) ,
Complexity of the oracle: O(mn) operations,
Complexity of the iteration: O(n 2 ) Operations.
Total complexity: 0 ( n 3 (m + n) In~) operations.
The analysis of a path-following scheme is more involved. First of all,

we should form a barrier model of the problem:
min ~'
x,r,{
s.t. I (ai, x) - b(i) IP~ T(i), i = 1 ... m,
f
i=l
T(i) ~ ~' a ~X~ ß,
(4.3.23)
F(x,T,~)) = Ej(T(i),(ai,x) -b(i)) -In(~- i=l

i=l
ET(i))
where f(y, t) = -2ln t -ln(t21P- y 2 ).

Wehaveseen that the parameter of barrier F(x, T, e) is V = 4m+n+ 1.
Therefore, the number of iterations of a path-following scheme can be
estimated as 0 ( v'4m + n + 1 ·In mpn ).
At each iteration of the path-following scheme we need to compute
the gradient and the Hessian of barrier F(x, T, e). Denote
9I(y, t) = f~(y, t), 92(Y, t) = J:(y, t).
Then
E9I(T(il,(ai,x)-b(il)ai- f: [ (il~
F~(x,r,~)= i=l (il- ßiil~ (il]ei,
i=l X Q X
Further, denoting
we obtain
F;(i)x(x,T,~) = h12(T(i), (ai,x)- b(i))ai,
F~'(i>,r<il(x,r,O = h22(r(i), (ai,x)- b(i)) + (~- i~l T(i)) - 2 ,
F"(il
T 1T
Ul (x, T, 0 = ( e- .L:m r(i) )
!=l
-2
, i =1= j,
F"(i)
T '~
c(x, T, 0 = - ( e- L: T(i) )-2 '
m
i=l
FJ:e(x, T, 0 = ( ~- i~l T(i)) -2

Thus, the complexity of the second-order oracle in the path-following
scheme is O(mn 2 ) arithmetic operations.
Let us estimate now the complexity of each iteration. The main source
of computations at each iteration is the solution of a Newton system.
Denote
/'i, = (e - .I:
z=l
T(i)) -
2
, Si = (ai, x) - b(i), i = 1 ... n,
and
A2 = diag(hi2(T(i),si))~ 1 , D = diag(h22(T(i),si))~ 1 .
Then, using the notation A = (a 1 , ... , am), e = (1, ... , 1) E Rm, the
Newton system can be written in the following form:
(4.3.24)
where t is the penalty parameter. From the second equation in (4.3.24)

we obtain
Substituting !:l.T in the first equation in (4.3.24), we can express
!:l.x = [A(Ao + A1- A~[D + tblmt 1 )AT]- 1 {F~(x,T,e)

-AA2[D + tblm]- 1(F:(x,T,e)- l'be!:l.e)}.
Using these relations we can find !:l.e from the last equation in (4.3.24).
Thus, the Newtonsystem (4.3.24) can be solved in O(n 3 + mn 2 ) Op-
erations. This implies that the total complexity of the path-following
scheme can be estimated as
0 ( n 2 (m + n) 312 ·ln m;n)

arithmetic operations. Comparing this estimate with that ofthe ellipsoid
method, we conclude that the interior-point methods are more efficient
if m is not too large, namely, if m ~ O(n2 ).
Bibliography
Chapter 1. Nonlinear optimization

Beetion 1.1. Complexity theory for black-box optimization schemes was
developed in [8]. In this monograph the reader can find different exam-
ples of resisting oracles and lower complexity bounds similar to that of
Theorem 1.1. 2.
Beetions 1.2 and 1.3. There exist several dassie monographs [2, 3, 7],
which treat different aspects of nonlinear optimization and numerical
schemes. For sequential unconstrained minimization the best source is
still [4].
Chapter 2. Smooth convex optimization

Seetion 2.1. The lower complexity bounds for smooth convex and strong-
ly convex functions can be found in [8]. However, the proof used in this
section seems to be new.
Beetion 2.2. Gradient mapping was introduced in [8]. The optimal

method for smooth and strongly smooth convex functions was proposed
in [10]. A constrained variant of this scheme is taken from [11].
Section 2.3. Optimal methods for minimax problems were developed in
[11]. The approach of Section 2.3.5 seems tobe new.
Chapter 3. Nonsmooth convex optimization

Section 3.1. A comprehensive treatment of different topics of convex
analysis can be found in [5]. However, the dassie [15] is still very useful.
Section 3.2. Lower complexity bounds for nonsmooth minimization
problems can be found in [8]. The scheme of the proof of the con-
vergence rate was suggested in [9]. See [13] for detailed bibliographical
comments on the history of nonsmooth minimization schemes.
Section 3.3. The example for the Kelley method is taken from [8]. The
presentation of the Ievel method is close to [6].
Chapter 4. Structural optimization

This chapter contains an adaptation of the main concepts from [12].
We added several useful inequalities and slightly simplified the path-
following scheme. We refer the reader to [1] for numerous applications
of interior-point methods, and to [14], [16), [18] and [19] for detailed
treatment of different theoretical aspects.
References
[1] A. Ben-Tal and A. Nemirovskii. Lectures on Moden Convex Optimizatin Anal-

ysis, Alogorithms, and Engineering Applications, SIAM, Philadelphia, 2001.
[2] A.B. Conn, N.I.M. Gould and Ph.L. Toint. Trust Region Methods, SIAM,
Philadelphia, 2000.
[3] J.E. Dennis and R.B. Schnabel. Numerical Methods for Unconstrained Opti-
mization and Nonlinear Equations, SIAM, Philadelphia, 1996.
[4] A.V. Fiacco and G.P. McCormick. Nonlinear Programming: Sequential Uncon-
strained Minimization Techniques, John Wiley and Sons, New York, 1968.
[5] J.-B. Hiriart-Urruty and C. Lemankhal. Convex Analysis and Minimization

Algorithms, vols. I and II. Springer-Verlag, 1993.
[6] C. Lemarechal, A. Nemirovskii and Yu. Nesterov. New variants of bundle meth-
ods. Mathematical Programmming, 69(1995), 111-148.
[7] D.G. Luenberger. Linear and Nonlinear Programming, Second Edition, Addi-
son Wesley. 1984.
[8) A.Nemirovsky and D.Yudin. Informational complexity and efficient methods

for solution of convex extremal problems, Wiley, New York, 1983.
[9) Yu.Nesterov. Minimization methods for nonsmooth convex and quaskonvex

functions. Ekonomika i Mat. Metody, v.ll, No.3, 519-531, 1984. (In Russian;
translated as MatEcon.)
[10] Yu.Nesterov. A method for solving a convex programming problern with rate
of convergence 0( ~ ). Soviet Math. Doklady, 1983, v.269, No.3, 543-547. (In
Russian.)
[11] Yu.Nesterov. Efficient methods in nonlinear programming. Radio i Sviaz,

Moscow, 1989. (In Russian.)
[12] Yu. Nesterov and A.Nemirovskii. Interior-Point Polynomial Algorithms in

Convex Programming, SIAM, Philadelphia, 1994.
[13] B.T. Polyak. Introduction to Optimization. Optimization Software, New York,

NY, 1987.
[14] J.Renegar. A Mathematical View of Interior-Point Methods in Convex Opti-

mization, MPS-SIAM Series on Optimization, SIAM 2001.
[15] R.T. Rockafellar Convex Analysis, Princeton Univ. Press, Princeton, NJ, 1970.
[16] C. Roos, T. Terlaky and J.-Ph. Vial. Theory and Algorithms for Linear Opti-
mization: An Interior Point Approach. John Wiley, Chichester, 1997.
[17] R.J. Vanderbei. Linear Programming: Foundations and Extensions. Kluwer

Academic Publishers, Boston, 1996.
[18] S. Wright. Primal-dual interior point methods. SIAM, Philadelphia, 1996.
[19] Y. Ye. Interior Point Algorithms: Theory and Analysis, John Wiley and Sons,
lnc., 1997.
Index
Analytic center, 198 function, 112

Antigradient, 17 set, 81
Approximate centering condition, 200 Cutting plane scheme, 150
Approximation, 16
first order, 16 Damped Newton method, 34
global upper, 37 Dikin ellipsoid, 182
in !p-norms, 226 Directional derivative, 122
linear, 16 Domain of function, 112
quadratic, 19
second order, 19 Epigraph, 82
Estimate sequence, 72
Barrier
analytic, 156 Feasibility problem, 146
self-concordant, 193 Function
universal, 212 barrier, 48, 180
volumetric, 156 convex, 112
Black box concept, 7 objective, 1
self-concordant, 176
Center strongly convex, 63
analytic, 198 Functional constraints, 1
of gravity, 151
Central path, 193
General iterative scheme, 6
auxiliary, 204
Gradient, 16
equation, 193
mapping, 86
Cla.ss of problems, 6
Complexity
analytical, 6 Hessian, 19
arithmetical, 6 Hyperplane
lower bounds, 10 separating, 124
upper bounds, 10 supporting, 124
Computational efforts, 6
Condition number, 65 Inequality
Cone of Cauchy-Schwartz, 17
positive semidefinite matrices, 216 Jensen, 112
second order, 214 Infinity norm, 116
Conjugate directions, 43 Information set, 6
Contracting mapping, 30 Inner product, 2
Convex
combination, 82, 113 Kelley method, 158
differentiable function, 52 Krylov subspace, 42
Level set, 17, 82 on a problem, 5

Localization set, 141 on a problern dass, 5
Polar set, 212
Matrix Polynomial methods, 156
positive definite, 19 Positive orthant, 213
positive semidefinite, 19 Problem
Max-type function, 91 constrained, 2
Method of feasible, 2
analytic centers, 156 general, 1
barrier functions, 49 linearly constrained, 2
centers of gravity, 151 nonsmooth, 2
conjugate gradients, 42 of approximation in lv-norms, 226-227
ellipsoids, 154 of geometric optimization, 226
gradient, 25 of integer optimization, 3
inscribed ellipsoids, 156 of linear optimization, 2, 213
penalty functions, 47 of quadratic optimization, 2
uniform grid, 8 of semidefinite optimization, 216
variable metric, 38, 40 of separable optimization, 224
volumetric centers, 156 quadratically constrained quadratic, 2,
Method 214
optimal, 29 smooth, 2
path-following, 202 strictly feasible, 2
quasi-Newton, 38 unconstrained, 2
Minimax problem, 91 Projection, 124
Minimum
global, 2 Quasi-Newton rule, 40
local, 2
Model of Recession direction, 211
convex function, 157 Relaxation, 15
minimization problem, 5 Restarting strategy, 45
barrier, 173, 199
functional, 7 Self-concordant
barrier, 193
Newton method function, 176
damped, 34, 188 Sequential unconstrained minimization, 46
standard, 33, 189 Set
Newton system, 33 convex, 81
Norm feasible, 2
! 00 , 8, 116 basic, 1
h, 116 Slater condition, 2, 49
Euclidean, 16 Solution
Frobenius, 216 global, 2
local, 181 local, 2
Standard
Optimality condition logarithmic barrier, 213
constrained problem, 84 minimization problem, 192
first order, 17 simplex, 132
minimax problem, 92 Stationary point, 18
second order, 19 Step-size strategies, 25
Oracle Strict Separation, 124
local black box, 7 Structural constraints, 3
resisting, 10-11 Subdifferential, 126
Subgradient, 126
Parameter Support function, 120
barrier, 194 Supporting vector, 126
centering, 200
Performance Unit ball, 116

(Applied Optimization 87) Yurii Nesterov (Auth.) - Introductory Lectures On Convex Optimization - A Basic Course-Springer US (2004)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(Applied Optimization 87) Yurii Nesterov (Auth.) - Introductory Lectures On Convex Optimization - A Basic Course-Springer US (2004)

Uploaded by

Copyright:

Available Formats

INTRODUCTORY LECTURES ON

Copyright Cl 2004 by Springer Science+Business Media New York

2.1.5 Gradient method 68

4.1.4 Main inequalities 181

us the target notions were the complexity of the optimization problems

This book is a refiection of the main achievements in convex optimiza-

Optimization problems arise naturally in different fields of applica-

choose between a "good" model, which we cannot solve, 1 and a "bad"

1 More precisely, which we can try to solve

The structure of the book is as follows. It consists of four relatively in-

scheme to establish the convergence rate of the subgradient method, the

1.1 World of nonlinear optimization

1.1.1 General formulation of the problern

s.t. fj(x) & 0, j = 1 ... m, (1.1.1)

is called the feasible set of problern (1.1.1). That is just a convention

resources, reliability of the systern, etc. We fix the rnost irnportant

s.t.: aj :::; fi(x) :::; bj, j = 1. .. m,

where S stands for the structural constraints, like nonnegativity or boun-

EXAMPLE 1.1.2 Let our initialproblern be as follows:

perhaps even with some additional constraints on x. 1f the optimal value

Thus, we can also treat integer optimization problems:

s.t.: aj :5 /j(x) :5 bj, j = 1 ... m,

Looking at these examples, a reader can understand the optimism

Nonlinear optimization is a very important and promising

However, just by looking at the same examples, especially at Examples

In general, optimization problems are unsolvable

lndeed, the real life is too complicated to believe in a universal tool,

1.1.2 Performance of numerical methods

Since we are going to speak about the performance of M on a class

A known (for numerical scheme) "part" of problern P is

We denote the model by L:. Usually the model consists of problern

Performance of M on P is the total amount of computa-

In this definition there are two additional notions to be specified. First of

To solve the problern means to find an approximate solu-

General Iterative Scheme.

Input: A starting point xo and an accuracy E > 0.

Now we can specify the term computational efforts in our definition of

Analytical complexity: The number of calls of oracle,

Comparing the notions of analytical and arithmetical complexity, we

Local black box

1. The only information available for the numerical

This concept is very useful in numerical analysis. Of course, its first

1.1.3 Complexity bounds for global optimization

n-dimensional box in Rn:

Bn={xERnl O~x(i)~l, i=l. .. n}.

Let us measure distances in Rn using l00 -norm:

Assurne that, with respect to this norm,

the objective function f(x) is Lipschitz continuous on Bn:

I f(x)- f(y) I~ L II X- Y lloo Vx, Y E Bn, (1.1.5)

with some constant L (Lipschitz constant).

where (ib···,in) E {O, ... ,p}n. (1.1.6)

3. Return the pair (x, f(x)) as a result.

of the accumulated information on the sequence of test points. Let us

(here and in the sequel we write x ::; y for x, y E Rn if and only if

Since x belongs to our grid, we conclude that

A(Q) = ( l~J + 2r.

Model: rnin f(x),

f (x) is 100 - Lipschitz continuous on Bn.

Oracle: Zero-order local black box.

Approximate solution: Find x E Bn: f(x)- f* ~ E.

order methods is at least ( l ii J

Proof: Derrote p = liiJ

Oracle returns f(x) = 0 at any test point x.

f(y) = f(x) + (f'(x), y- x) + o(ll Y- x II) 2 f(x*).

f'(x) = 0, J"(x) >- 0.

f(y) = f(x) + !U"(x)(y- x), Y- x) + o(ll Y- x* 11 2).

> f(x) + ~>.1(f"(x)) II Y- x* 11 2> f(x*).