Professional Documents
Culture Documents
(Applied Optimization 87) Yurii Nesterov (Auth.) - Introductory Lectures On Convex Optimization - A Basic Course-Springer US (2004)
(Applied Optimization 87) Yurii Nesterov (Auth.) - Introductory Lectures On Convex Optimization - A Basic Course-Springer US (2004)
CONVEX OPTIMIZATION
A Basic Course
Applied Optimization
Volume 87
Series Editors:
Panos M. Pardalos
University ofFlorida, US.A.
Donald W. Heam
University ofFlorida, US.A.
INTRODUCTORY LECTURES ON
CONVEX OPTIMIZATION
A Basic Course
By
Yurii Nesterov
Center of Operations Research and Econometrics, (CORE)
Universite Catholique de Louvain (UCL)
Louvain-la-Neuve, Belgium
''
~·
Springer Science+Business Media, LLC
Library of Congress Cataloging-in-PubHcation
Nesterov, Yurri
Introductory Lectures on Convex Optimization: A Basic Course
ISBN 978-1-4613-4691-3 ISBN 978-1-4419-8853-9 (eBook)
DOI 10.1007/978-1-4419-8853-9
Preface lX
Acknowledgments xiii
Introduction XV
1. NONLINEAR OPTIMIZATION 1
1.1 World of nonlinear optimization 1
1.1.1 General formulation of the problern 1
1.1.2 Performance of numerical methods 4
1.1.3 Complexity bounds for global optimization 7
1.1.4 Identity cards of the fields 13
1.2 Local methods in unconstrained minimization 15
1.2.1 Relaxation and approximation 15
1.2.2 Classes of differentiable functions 20
1.2.3 Gradient method 25
1.2.4 Newton method 32
1.3 First-order methods in nonlinear optimization 37
1.3.1 Gradient method and Newton method:
What is different? 37
1.3.2 Conjugate gradients 42
1.3.3 Constrained minimization 46
2. SMOOTH CONVEX OPTIMIZATION 51
2.1 Minimization of smooth functions 51
2.1.1 Smooth convex functions 51
2.1.2 Lower complexity bounds for :FJ:• 1 (Rn) 58
2.1.3 Strongly convex functions 63
2.1.4 Lower complexity bounds for s:,J}(Rn) 66
vi INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Index
235
Preface
It was in the middle of the 1980s, when the seminal paper by Kar-
markar opened a new epoch in nonlinear optimization. The importance
of this paper, containing a new polynomial-time algorithm for linear op-
timization problems, was not only in its complexity bound. At that time,
the most surprising feature of this algorithm was that the theoretical pre-
diction of its high efficiency was supported by excellent computational
results. This unusual fact dramatically changed the style and direc-
tions of the research in nonlinear optimization. Thereafter it became
more and more common that the new methods were provided with a
complexity analysis, which was considered a better justification of their
efficiency than computational experiments. In a new rapidly develop-
ing field, which got the name "polynomial-time interior-point methods",
such a justification was obligatory.
Afteralmost fifteen years of intensive research, the main results of this
development started to appear in monographs [12, 14, 16, 17, 18, 19].
Approximately at that time the author was asked to prepare a new course
on nonlinear optimization for graduate students. The idea was to create
a course which would reflect the new developments in the field. Actually,
this was a major challenge. At the time only the theory of interior-point
methods for linear optimization was polished enough to be explained to
students. The general theory of self-concordant functions had appeared
in print only once in the form of research monograph [12]. Moreover,
it was clear that the new theory of interior-point methods represented
only a part of a general theory of convex optimization, a rather involved
field with the complexity bounds, optimal methods, etc. The majority
of the latter results were published in different journals in Russian.
The book you see now is a result of an attempt to present serious
thingsinan elementary form. As is always the case with a one-semester
course, the most difficult problern is the selection of the material. For
X INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
YURII NESTEROV
Louvain-la-Neuve, Belgium
May, 2003.
To my wife Svetlana
Acknowledgments
Michel Gevers, Etienne Laute, Yves Poches, Yves Smeers, Paul Van
Dooren and Laurence Wolsey, both coming from CORE and CESAME,
a research center of the Engineering department ofUniversite Catholique
de Louvain (UCL). The research activity of the author during many
years was supported by the Belgian Program on Interuniversity Poles of
Attraction initiated by the Belgian State, Prime Minister's Office and
Science Policy Programming.
Introduction
NONLINEAR OPTIMIZATION
XE s,
where the sign & could be ::;, ~ or =.
We call fo (x) the objective function of our prob lern, the vector function
f(x) = (JI(x), ... , fm(x)f
is called the vector of functional constraints, the set S is called the basic
feasible set, and the set
Q = {x ES I /j(x)::; 0, j = 1 ... m}
2 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
(here (·, ·) stands for the inner product in Rn), and S is a polyhedron.
If fo(x) is also linear, then (1.1.1) isalinear optimization problem. If
fo(x) is quadratic, then (1.1.1) is a quadratic optimization problem. If
allfi are quadratic, then this is a quadratically constrained quadratic
problern.
There is also a classification based on the properties of feasible set.
• Problem (1.1.1) is called feasible if Q-=/::. 0.
• Problem (1.1.1) is called strictly feasible if there exists x E int Q such
that fi(x) < 0 (or > 0) for all inequality constraints and fi(x) = 0
for all equality constraints. (Slater condition.)
Finally, we distinguish different types of solutions to (1.1.1):
• x* is called the optimal global solution to {1.1.1) if fo(x*) ~ fo(x) for
all x E Q (global minimum). In this case fo(x*) is called the (global)
optimal value of the problern.
• x* is called a local solution to (1.1.1) if fo(x*) :::; fo(x) for all x E
int Q c Q (local minimum).
Let us consider now several examples demonstrating the origin of the
optirnization problerns.
EXAMPLE 1.1.1 Let x{l), ... , x(n) be our design variables. Then we can
fix sorne functional characteristics of our decision: fo(x), ... , fm(x). For
exarnple, we can consider a price of the project, amount of required
Nonlinear Optimization 3
XE S,
EXAMPLE 1.1.3 Sornetirnes our decision variables x(l), ... , x(n) must be
integer. This can be described by the following constraint:
sin(7rx(i)) = 0, i = 1 ... n.
XE S,
. ( 7l'X (i))
Sill = 0, . = 1 ... n.
~
0
4 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
The meaning of the words with an accuracy E > 0 is very important for
our definitions. However, it is too early to speak about that now. We
just introduce notation ~ for a stopping criterion; its meaning will be
6 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
always precise for particular problern classes. Now we can give a formal
definition of the problern dass:
:F =(2:, 0, 1;).
In order to solve a problern P E :F we can apply an iterative process,
which naturally describes any method M working with the oracle.
usually can be easily obtained frorn the analytical cornplexity and the
cornplexity of the oracle. Therefore, in this course we will speak rnainly
about bounds on the analytical cornplexity for some problern classes.
There is one standard assumption on the oracle, which allows us to
obtain the majority of the results on the analytical cornplexity of opti-
mization problems. This assurnption is called the local black box concept
and it Iooks as follows:
Consider a very simple method for solving (1.1.4), which is called the
uniform grid method. This method g(p) has one integer input parameter
p. Its scheme is as follows.
Method Q(p)
1. Form (p + 1) n points
X· . -
(tl, ... ,tn)-
c . .)T
:!J.!2. !n.
p'p'•••'p '
2. Among all points X(i 1,... ,in) find a point x, which has
the minimal value of objective function.
Thus, this method forms a uniform grid of the test points inside the
box Bn, computes the minimum value of the objective over this grid and
returns this value as an approximate solution to problern (1.1.4). In our
terminology, this is a zero-order iterative method without any inßuence
Nonlinear Optimization 9
I x- x* lloo= l::;t::;n
ID!ix lx(i) - x~i) I ::; 2~.
Let us finish the definition of our problern dass. Define our goal as
follows:
Find x E Bn : j(x)- j*::; €. (1.1. 7)
Then we immediately get the following result.
COROLLARY 1.1.1 Analytical complexity of the problern class {1.1.4),
{1.1.5), {1.1. 7) for method g is at most
Thus, A(Q) justifies an upper complexity bound for our problern class.
This result is quite informative, but we still have some questions.
Firstly, it may happen that our proof is too rough and the real perfor-
mance of Q(p) is much better. Secondly, we still cannot be sure that
Q(p) is a reasonable method for solving (1.1.4). There may exist other
schemes with much higher performance.
In order to answer these questions, we need to derive lower complexity
bounds for the problern dass (1.1.4), (1.1.5), (1.1.7). The rnain features
of such bounds are as follows.
• They are based on the black box concept.
• These bounds are valid for all reasonable iterative schernes. Thus,
they provide us with a lower estirnate for analytical complexity on
the problern dass.
• Very often such bounds ernploy the idea of the resisting oracle.
For us only the notion of the resisting orade is new. Therefore, Iet us
discuss it in rnore detail.
A resisting oracle tries to create a worst problern for each particular
rnethod. It starts frorn an "ernpty" function and it tries to answer each
call of the rnethod in the warst possible way. However, the answers rnust
be compatible with the previous answers and with the description of the
problern dass. Then, after terrnination of the method it is possible to
reconstruct a problern, which fits cornpletely the final inforrnation set
accurnulated by the algorithrn. Moreover, if we launch this rnethod on
this problern, it will reproduce the sarne sequence of the test points since
it will have the sarne sequence of answers frorn the orade.
Let us show how that works for problern (1.1.4). Consider the class
of problems C defined as follows:
r.
THEOREM 1.1. 2 For < ~ L the analytical complexity of C for zero-
E
Now we can say much more about the performance of the uniform grid
method. Let us compare its efficiency estimate with the lower bound:
Note that the size of the problern is very small and we ask only for 1%
accuracy.
The lower complexity bound for this dass is ( fe)
n. Let us compute
it for our example.
We should note, that the lower complexity bounds for problems with
smooth functions, or for high-arder methods are not much better than
those of Theorem 1.1.2. This can be proved using the same arguments
and we leave the proof as an exercise for the reader. Camparisan of
the above results with the upper bounds for NP-hard problems, which
are considered as a dassical example of very difficult problems in com-
binatorial optimization, is also quite disappointing. Hard combinatorial
problems need 2n a.o. only!
To conclude this section, let us compare our situation with one in some
other fields of numerical analysis. lt is well known, that the uniform grid
approach is a standard tool in many domains. For example, if we need
Nonlinear Optimization 13
J
1
I= f(x)dx,
0
Note that in our terminology this is exactly the uniform grid approach.
Moreover, that is a standard way for approximating the integrals. The
reason why it works here lies in the dimension of problems. For inte-
gration the standard dimensions are very small (up to three), and in
optimization sometimes we need to solve problems with several millions
of variables.
ak+l :=:; ak Vk ~ 0.
lim ~o(r)
r.l-0
= 0, o(O) = 0.
In the sequel we fix the notation II · II for the standard Euclidean norm
in Rn:
x E C, := {x ERnlAx = b} =J 0,
where A is an m x n-matrix and b E Rm, m < n. Then there exists a
vector of multipliers ). * such that
{1.2.2)
Proof: Consider some vectors Ui, i = 1 ... k, that form a basis of the
null space of matrix A. Then any x E C, can be represented as follows:
k
x = x(y) = x* + LY(i)ui, y E Rk.
i=l
(f"(x))(i,j) = 8:c:{J~L>.
It is called the Hessian of function f at x. Note that the Hessian is a
symmetric matrix:
f"(x) = [f"(x)f.
The Hessian can be seen as a derivative of the vector function f'(x):
(Ax, x) 2: 0 Vx ERn.
Notation A >- 0 means that Ais positive definite (above inequality must
be strict for x =/= 0).
f(y) ~ f(x*).
In view of Theorem 1.2.1, f'(x*) = 0. Therefore, for any such y,
f(y) = f(x*) + (f"(x*)(y- x*), y- x*) + o(ll y- x* 11 2 ) ~ f(x*).
for all x, y E Q.
Clearly, we always have p ~ k. If q 2: k, then Cf·P(Q) ~ c1·P(Q). For
example, Cz•
1 (Q) ~ Cl'
1 (Q). Note also that these classes possess the
for all x, y E Rn. Let us give a sufficient condition for that inclusion.
LEMMA 1.2.2 Function f(x) belongs to Cz• 1 (Rn) C Cl' 1 (Rn) if and only
if
II f"(x) II~ L, Vx ERn. {1.2.4)
1
< Jf"(x+r(y-x))dr ·lly-xll
0
1
~I II f"(x+r(y-x)) II dr·ll y-x II
0
~L II y- XII-
22 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
On the other hand, if f E Cz• (Rn), then for any s ERn and a > 0, we
1
have
f'(x) = a, f"(x) = 0.
2. For the quadratic function f(x) = a+(a, x)+!(Ax, x) with A =AT
we have
f'(x) = a + Ax, J"(x) = A.
Therefore f(x) E Cl' 1 (Rn) with L =II A II·
3. Consider the function of one variable f(x) = J1 + x2 , x E R 1 . We
have
f'(x) = n· J"(x) = (1 + ~2)3/2 ~ 1.
Therefore f(x) E Ci' (R). 1 0
LEMMA 1.2.3 Let f E Ci' 1 (Rn). Then for any x, y from Rn we have
1
= f(x) + (f'(x),y- x) + f(f'(x + T(y- x))- f'(x),y- x)dT.
0
Nonlinear Optimization 23
Therefore
I J(y)- J(x)- (J'(x), y- x) I
1
=I f(f'(x + r(y- x))- J'(x),y- x)dr I
0
1
::; J I (J'(x + r(y- x))- J'(x), y- x) I dr
0
1
::; f II f'(x+r(y-x)) -f'(x) 11·11 y-x II dr
0
Then graph of function f is located between the graphs of </; 1 and </J2:
Let us prove a similar result for the dass of twice differentiable func-
tions. Our main dass of functions ofthat type will be dj:/(Rn), the
dass of twice differentiable functions with Lipschitz continuous Hessian.
Recall that for f E C'j;/(Rn) we have
1
= f'(x) + f"(x)(y- x) + f(f"(x + r(y- x))- f"(x))(y- x)dr.
0
Therefore
II f'(y)- f'(x)- f"(x)(y- x) II
1
= II f(f"(x + r(y- x))- f"(x))(y- x)dr II
0
1
< J II (f"(x + r(y- x))- J"(x))(y- x) II dr
0
1
< f I f"(x + r(y- x)) - f"(x) II · II y- x II dr
0
1
< f rM II y- x 11 2 dr = !vf II y- x 11 2 •
0
Gradient method
(1.2.9)
Choose xo E Rn.
Iterate Xk+I = Xk- hkf'(xk), k = 0, 1, ....
hk = h
v'k+T"
2. Full relaxation:
(1.2.10)
with f E Ci' 1 (Rn). And assume that f(x) is bounded below on Rn.
Let us evaluate a result of one gradient step. Consider y = x- hf'(x).
Then, in view of (1.2.5), we have
Thus, in order to get the best estimate for possible decrease of the objec-
tive function, we have to solve the following one-dimensional problem:
(1.2.13)
(1.2.14)
However, we can also say something about the convergence rate. lndeed,
denote
g* - min g
N- o-:::k-:::N k,
where 9k =II f'(xk) II· Then, in view of (1.2.14), we come to the following
inequality:
1 [1
Y'N :::; JN+l z::;L(J(xo) - !*) ] 1/2 . (1.2.15)
!"(x) ~ ( ~ 3(x(2~2 _ 1 ) ,
we conclude that x2 and x3 are the isolated local minima 1 , but xi is only
a stationary point of our function. Indeed, f(xi) = 0 and f(xi + €e 2 ) =
44 - ~2 < 0 for € small enough.
Now, let us consider the trajectory of the gradient method, which
starts from xo = (1, 0). Note that the second coordinate of this point
is zero. Therefore, the second coordinate of f'(xo) is also zero. Con-
sequently, the second coordinate of x 1 is zero, etc. Thus, the entire
sequence of points, generated by the gradient method will have the sec-
ond coordinate equal to zero. This means that this sequence converges
to xi.
To condude our example, note that this situation is typical for all
first-order unconstrained minimization methods. Without additional
rather strict assumptions it is impossible to guarantee their global con-
vergence to a local minimum, only to a stationary point. 0
(1.2.16)
Oracle: First order black box.
Let us check, what can be said about the local convergence of the
gradient method. Consider the unconstrained minimization problern
min f(x)
xERn
1
where Gk = f J"(x* + T(xk- x*))dT. Therefore
0
(1.2.18)
for small enough hk. In this case we will have rk+l < rk.
As usual, many step-size strategies are available. For example, we
can choose hk = t.
Let us consider the "optimal" strategy consisting in
minimizing the right-hand side of (1.2.18}:
Assurne that ro < f. Then, if we form the sequence {xk} using the
optimal strategy, we can be sure that rk+l < rk < f. Further, the
optimal step size h'k can be found from the equation:
Tk+l -
< (L-th
L+l +
Mrz
L+l'
1- > lli - 1, or
Therefore -ak+l - ak
32 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Hence,
Thus,
a
k -
qr~ (r-ro) -< ..!l!!L
< ro+(l+q) f-ro l+q
(-1-)k ·
This proves the following theorem.
THEOREM 1.2.4 Let function f(x) satisfy our assumptions and let the
starting point xo be close enough to a local minimum:
ro =II xo - x* II < f = i} .
Then the gradient method with step size (1.2.19} converges as follows:
cjJ(t*) = 0.
cjJ(t) + c/J'{t)b.t = 0.
We can expect that the solution of this equation, the displacement t:.t,
is a good approximation to the optimal displacement b.t* = t* - t.
Converting this idea in an algorithmic form, we get the process
t k+l = t k -
!Pl!:JJ..
tP'(tk).
Nonlinear Optimization 33
Note that we can obtain the process {1.2.21), using the idea of quadra-
tic approximation. Consider this approximation, computed with respect
to the point Xk:
has two serious drawbacks. Firstly, it can break down if f"(xk) is de-
generate. Secondly, the Newton process can diverge. Let us look at the
following example.
EXAMPLE 1.2.4 Let us apply the Newton method for finding a root of
the following function of one variable:
qy(t) = yl~t2.
Clearly, t* = 0. Note that
qy'(t) = [l+t;j3/2.
t k+l -- t k - !fi!:JJ_
"''(t ) -- tk - __lk.__
f177I . [1 + t2]3/2
k -- - t3k•
'I' k yl+tk
1. f E c'~;/(Rn).
1
= Xk- x*- [f"(xk)]- 1 f f"(x* + T(Xk- x*))(xk- x*)dT
0
1
where Gk = f[f"(xk)- f"(x* + T(Xk- x*))]dT.
0
Derrote rk =II xk- x* II· Then
1
II Gk I = I J[f"(xk)- f"(x* + T(xk- x*))]dT II
0
1
< f II f"(xk)- f"(x* + T(Xk- x*)) II dT
0
1
< f M(l- T)rkdT = !fM.
0
rk ~ c(l- q)k.
min f(x),
xERn
(see Lemma 1.2.3). This fact is responsible for global convergence of the
gradient method.
Further, consider a quadratic approximation of function f(x):
<P2(x) = f(x) + (f'(x), x- x) + !U"(x)(x- x), x- x).
We have already seen that the minimum of this function is
x2 = x- [f"(x)t 1 j'(x),
and that is exactly the iterate of the Newton method.
Thus, we can try to use some approximations of function f(x), which
are better than </> 1 (x) and which are less expensive than </>2(x).
Let G be a positive definiten x n-matrix. Denote
where An (A) and A1 ( A) are the smallest and the largest eigenvalues of
the matrix A. However, the gradient and the Hessian, computed with
respect to the new inner product are changing:
D
40 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
The variable metric schemes differ one from another only in implemen-
tation of Step ld), which updates matrix Hk· For that, they use new
information, accumulated at Step lc), namely the gradient f'(xk+I)·
The idea is justified by the following property of a quadratic function.
Let
f(x) = a + (a, x) + !(Ax, x), J'(x) = Ax + a.
Then, for any x, y ERnwehave f'(x)- f'(y) = A(x- y). This identity
explains the origin of the so-called quasi-Newton rule.
Quasi-Newton rule
Actually, there are many ways to satisfy this relation. Below we present
several examples of the schemes that usually are recommended as the
most efficient ones.
Nonlinear Optimization 41
Hk'Yk'YfHk
(Hk"fk, 'Yk) .
where ßk = 1 + bk,ok)/(Hk'Yk,'Yk)·
Clearly, there are many other possibilities. From the computational
point of view, BFGS is considered as the most stable scheme. D
Note that for quadratic functions the variable metric methods usually
terminate in n iterations. In a neighborhood of strict minimum they
have a superlinear rate of convergence: for any xo E Rn there exists a
number N such that for all k ~ N we have
The next result helps to understand the behavior ofthe sequence {xk}.
LEMMA 1.3.2 For any k, i ~ 0, k i= i we have (f'(xk), f'(xi)) = 0.
Proof: Let k > i. Consider the function
The last auxiliary result explains the name of the method. Denote
6i = Xi+l- Xi. It is clear that .Ck = Lin {6o, ... , 6k-l }.
LEMMA 1.3.3 For any k i= i we have (A6k, 6i) = 0.
(Such directions are called conjugate with respect to A.)
Proof: Without loss of generality we can assume that k > i. Then
Let us show how we can write down the conjugate gradients method in
a more algorithmic form. Since .Ck = Lin {60 , ... , 6k-l }, we can represent
Xk+I as follows:
k-l
Xk+l = Xk- hkf1(Xk) +L _x(j)dj.
j=O
44 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
(1.3.4)
In that scheme we did not specify yet the coefficient ßk· In fact, there
are many different formulas for this coefficient. All of them give the
same result on quadratic functions, but in a general nonlinear case they
generate different sequences. Let us present three of the most popular
1.
2.
3 Polak-Ribbiere· ß - - (J'(xk+t),J'(xk±d-f'(xk))
· · k - II!' (xk)ll 2 •
Note, that this local convergence is slower than that of the variable
metric methods. However, the conjugate gradients schemes have an
advantage of a very cheap iteration. As far as the global convergence is
concerned, the conjugate gradients, in general, are not better than the
gradient method.
2 In fact, that is not absolutely true. We will see, that in order to apply an unconstrained
minimization method for solving constrained problems, we need to be able to find at least
a strict local minimum. And we have already seen (Example 1.2.2}, that this could pose a
problem.
3 We are not going to discuss the correctness of this statement for general nonlinear problems.
We just prevent the reader from extending it onto another problern classes. In the next
chapters we will have a possibility to see that this statement is not always true.
Nonlinear Optimization 47
We start from penalty function methods.
DEFINITION 1. 3.1 A continuous function ci> (x) is called a penalty func-
tion for a closed set Q if
• ci>(x) = 0 for any x E Q,
• <P(x) > 0 for any x ~ Q.
Sometimes a penalty function is called just penalty. The main property
of the penalty functions is as follows.
If ci> 1 (x) is a penalty for Q 1 and ci> 2(x) is a penalty for Q2,
then ci>l(x) + ci>2(x) is a penalty for intersection Ql nQ2.
Let us give several examples of such functions.
EXAMPLE 1.3.3 Denote (a)+ = max{ a, 0}. Let
Q = {x ERn lfi(x) ~ 0, i = 1. .. m}.
Then the following functions are penalties for Q:
m
1. Quadratic penalty: ci>(x) = L: (!i(x))~.
i=l
m
2. Nonsmooth penalty: ci>(x) = L.: (fi(x))+·
i=l
Proof: Note that lllk :::; IJ!k(x*) = fo(x*). At the same time, for any
x ERnwehave 'l!k+l(x) ~ 'l!k(x). Therefore 'l!k+l ~ 'l!k. Thus, there
=
exists a Iimit lim 'l!k lll* :::; f*. If tk > l then
k---'too
Therefore, the sequence {xk} has Iimit points. Since lim tk = +oo, for
k---'too
any such point x. we have <P{x.) = 0 and fo(x.) :::; fo(x*). Thus x. E Q
and
w* = fo(x.) + <P(x.) = fo(x.) ~ fo(x*).
0
Note that this result is very general, but not too informative. There
are still many questions, which should be answered. For example, we
do not know what kind of penalty function we should use. What should
be the rules for choosing the penalty coefficients? What should be the
accuracy for solving the auxiliary problems? The main feature of these
questions is that they can be hardly addressed in the framework of gen-
eral nonlinear optimization theory. Traditionally, they are considered as
questions to be answered by computational practice.
Let us Iook at the barrier methods.
DEFINITION 1.3.2 Let Q be a closed set with nonempty interior. A
continuous function F(x) is called a barrier function for Q if F(x) ~ oo
when x approaches the boundary of Q.
4 If we assume that it is a strict local minimum, then the result is much weaker.
Nonlinear Optimization 49
('llk is the global optimal value of 'llk(x)). And let f* be the optimal
value of the problern (1.3.5).
THEOREM 1.3.2 Let barrier F(x) be bounded below on Q. Then
lim
k-too
w;. = f*.
Proof: Let F(x) ~ F* for all x E Q. For arbitrary x Eint Q we have
'llj. = min
xeQ
{!o(x) + f-F(x)}
k
~ min
xEQ
{!o(x) + lk F*} = !* + lk F*.
The same as with the penalty functions method, there are many ques-
tions to be answered. We do not know how to find the starting point xo
and how to choose the best barrier function. We do not know the rules
for updating the penalty coefficients and the acceptable accuracy of the
solutions to the auxiliary problems. Finally, we have no idea about the
efficiency estimates of this process. And the reason is not in the lack
of the theory. Our problern (1.3.5) is just too complicated. We will see
that all of the above questions get precise answers in the framework of
convex optimization.
We have finished our brief presentation of general nonlinear optimiza-
tion. It was really very short and there are many interesting theoreti-
cal topics that we did not mention. That is because the main goal of
this book is to describe the areas of optimization in which we can ob-
tain some clear and complete results on the performance of numerical
methods. Unfortunately, the general nonlinear optimization is just too
complicated to fit the goal. However, it is impossible to skip this field
since a lot of basic ideas, underlying the convex optimization methods,
have their origin in general nonlinear optimization theory. The gradient
method and the Newton method, sequential unconstrained minimization
and barrier functions were originally developed and used for general op-
timization problems. But only the framework of convex optimization
allows these ideas to get their real power. In the next chapters of this
book we will see many examples of the second birth of these old ideas.
Chapter 2
where the function j(x) is smooth enough. Recall that in the previous
chapter we were trying to solve this problern under very weak assump-
tions on function f. And we have seen that in this general situation we
cannot do too much: It is impossible to guarantee convergence even to a
local minimum, impossible to get acceptable bounds on the global per-
formance of minimization schemes, etc. Let us try to introduce some rea-
sonable assumptions on function f to make our problern more tractable.
For that, Iet us try to determine the desired properties of a class of
differentiable functions F we want to work with.
From the results of the previous chapter we can get an impression
that the main reasons of our troubles is the weakness of the first-order
optimality condition (Theorem 1.2.1). Indeed, we have seen that, in
general, the gradient method converges only to a stationary point of
function f (see inequality (1.2.15) and Example 1.2.2). Therefore the
first additional property we definitely need is as follows.
1 This is not a description of the whole set of basic elements. We just say that we want to
have linear functions in our class.
Smooth convex optimization 53
LEMMA 2.1.1 If fi and h belong to :F1 (Rn) and a,ß 2::0 thenfunction
f = afl + ßh also belongs to :F1 (Rn).
Multiplying first inequality by (1- a}, the second one by a and adding
the results, we get {2.1.3).
Let (2.1.3} be true for all x, y ERn and a E [0, 1]. Let us choose some
a E [0, 1). Then
f(y) > 1 ~ 0 [/(xa)- af(x)] = f(x) + 1 ~ 0 [/(xa)- f(x)]
1
= f(x) + (f'(x),y- x) + f(f'(x 7 )- f'(x),y- x)dr
0
1
f(x) + (f'(x), y- x) + f ~(f'(xT)- f'(x), xT- x)dr
0
L ea:;+(a;,x)'
m
f(x) =
i=l
is convex too. D
~ !Lily- xll 2 •
Let us prove the last two inequalities. Derrote x 0 = ax + (1 - a)y.
Then, using (2.1.7) we get
f(x) ~ f(xo:) + (J'(xo:), (1- a)(x- y)) + A II J'(x)- f'(xo:} 11 2 ,
a II 91 - u 11 2 +(1 - a) II 92 - u 11 2 ~ a(1 - a) II 91 - 92 11 2 ,
and
n
< LL:(s(i)) 2 .
i=l
fk(x) = Akx- e1 = 0
has the following unique solution:
1--i i = 1. .. k,
-(i) - k+l'
xk - {
0, k+1~ i ~ n.
Hence, the optimal value of function fk is
(2.1.15)
we have Ck ~ Rk,n.
Smooth convex optimization 61
THEOREM 2.1.7 For any k, 1 ~ k ~ ~(n- 1), and any xo ERn there
exists a junction f E :FJ:• 1 (Rn) suchthat for any first-order method M
satisjying Assumption 2.1.4 we have
1 2k+1 . 1 2k+1 ·2
= k +1- k+l L:: 't + 4(k+I)2 L:: Z
i=k+l i==k+l
k - 2k 2+7k+6
= - 2- 24(k+l)
> 2k2+7k+6
lß(k+l)2 II -
Xo- X2k+1 112>
_ S1 II XQ- X * 112 .
0
The above theorem is valid only under assumption that the number
of steps of the iterative scheme is not too large as compared with the
dimension of the space (k :::; ~(n- 1)). The complexity bounds ofthat
type are called uniform in the dimension of variables. Clearly, they
are valid for very large problems, in which we cannot wait even for n
iterates of the method. However, even for problems with a moderate
dimension, these bounds also provide us with some information. Firstly,
they describe the potential performance of numerical methods on the
initial stage of the minimization process. And secondly, they warn us
that without a direct use of finite-dimensional arguments we cannot get
better complexity for any numerical scheme.
To conclude this section, Iet us note that the obtained lower bound for
the value of the objective function is rather optimistic. Indeed, after one
hundred iterations we could decrease the initial residual in 104 times.
However, the result on the behavior of the minimizing sequence is quite
disappointing: The convergence to the optimal point can be arbitrarily
slow. Since that is a lower bound, this conclusion is inevitable for our
problern dass. The only thing we can do is to try to find problern
Smooth convex optimization 63
classes in which the situation could be better. That is the goal of the
next section.
- f(x*) + !~-t II x- x* 11 2 •
0
64 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Note that the class SJ(Rn) coincides with F 1 (Rn). Therefore addition
of a convex function to a strongly convex function gives a strongly convex
function with the same convexity parameter.
Let us give several equivalent definitions of strongly convex functions.
THEOREM 2.1.9 Let f be continuously differentiable. Both conditions
below, holding for all x, y ERn and a E (0, 1], are equivalent to inclusion
f E S~(Rn):
The proof of this theorem is very similar to the proof of Theorem 2.1.5
and we leave it as an exercise for the reader.
The next statement sometimes is useful.
THEOREM 2 .1.1 0 If f E Sh (Rn), then for any x and y from Rn we have
rp(x) = minrp(v)
V
;::: min[<P(y)
V
+ (<P'(y), v- y) + !ltllv- Yll 2]
= <P(y)- 2~11<P'(y)li 2 ,
and that is exactly (2.1.19). Adding two copies of (2.1.19) with x and y
interchanged we get (2.1.20). D
since f"(x) = A.
Other examples can be obtained as a sum of convex and strongly
convex functions. D
Proof: Derrote cp(x) = f(x)- ~1111xll 2 . Then cp'(x) = f'(x)- 11x; hence,
by (2.1.22) and (2.1.9) cjJ E :Fl~M(Rn). If 11 = L, then (2.1.24) is proved.
If 11 < L, then by (2.1.8) we have
(cp'(x)- <P'(y),y- x) ~ L~J.LII<P'(x)- </>'(y)ll 2 ,
and that is exactly (2.1.24). D
=
but the corresponding reasoning is more complicated.
Consider gX) l2 , the space of all sequences x = {x(i)}~ 1 with finite
norm
Smooth convex optimization 67
Let us choose some parameters f-l > 0 and Qf > 1, which define the
following function
Denote
A= ( -i -~ -~ ~ )
0 -1 2 .
0 0
Then f"(x) = tt(Qf 1) A + f-ll, where I is the unit operator in R 00 • In
the previous section we have already seen that 0 :::5 A :::5 4/. Therefore
This means that J11 ,Q 1 E s::,~~ 1 (R 00 ). Note that the condition number
of function f 11 ,Q1 is
Q!Jl.,Qf -_ ttQr _ Q
11 - !·
Let us find the minimum of function f 11 ,11 Q1 • The first-order optimality
condition
, (x)
! J-L,IlQf = (tt(Qr 1) A
- 4
+ u1) X-
f""'
J-L(Qr 1) e = 0
4 1
can be written as
(A + Q/- 1 ) x = e1.
The coordinate form of this equation is as follows:
2 Q r+l x(l) - x( 2) = 1,
Qr1
(2.1.25)
x(k+l)- 2Qr+ 1 x(k)
Qrl
+ x(k-l) = 0, k = 2, ....
q2 - 2 Q I+ 1 q + 1 = 0
Qrl '
that is q = 1-1 •
~QQt+1 Then the sequence (x*)(k) = qk, k = 1, 2, ... , satisfies
the system (2.1.25). Thus, we come to the following result.
68 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
THEOREM 2.1.13 For any xo E R 00 and any constants 1-" > 0, Qf > 1
there exists a function f E s:,~~l (R00 ) such that for any first-order
method M satisfying Assumption 2.1.4, we have
( ~~~)
2
II Xk- x* 11 2 2 k II xo- x* 11 2 ,
Gradient method
0. Choose x 0 ERn.
THEOREM 2.1.14 Let jE Fi,• 1 (Rn) and 0 < h < Then the gradient f·
method generates a sequence {xk}, which converges as follows:
r~+l = II Xk - x* - hf'(xk) 11 2
(we use (2.1.8) and f'(x*) = 0). Therefore rk ::; r 0 . In view of (2.1.6)
we have
~
'-"k+l
>
-
} + r~(k
'-"O 0
+ 1).
0
= f* + t II xo - x* 11 2 .
f( Xk ) _ f* <
-
2LIIxo-x•ll 2
k+4 . {2.1.27}
II Xk - x* 11 2 ~ ( 1 - ~_fJ;) k II xo- x* 11 2 .
lf h = 1-L!L' then
where Qf = LfJ-L.
Proof: Denote rk =II Xk- x* 11. Then
rf+l = II Xk- x* - hf'(xk) 11 2
(we use (2.1.24) and f'(x*) = 0). The last inequality in the theorem
follows from the previous one and (2.1.6). 0
Recall that we have seen already the step-size rule h = 1-L!L and
the linear rate of convergence of the gradient method in Section 1.2.3,
Theorem 1.2.4. But that was only a local result.
Smooth convex optimization 71
min f(x),
xeRn
!( Xk ) _ J* <
-
2LIIxo-x*ll 2
k+4 '
These estimates differ from our lower complexity bounds (Theorem 2.1.7
and Theorem 2.1.13) by an order of magnitude. Of course, in general
this does not mean that the gradient method is not optimal since the
lower bounds might be too optimistic. However, we will see that in our
case the lower bounds are exact up to a constant factor. We prove that
by constructing a method that has corresponding efficiency bounds.
Recall that the gradient method forms a relaxation sequence:
{2.2.1)
{2.2.2}
Proof: Indeed,
Thus, for any sequence {xk}, satisfying (2.2.2) we can derive its rate
of convergence directly from the rate of convergence of sequence { Ak}.
However, at this moment we have two serious questions. Firstly, we do
not know how to form an estimate sequence. And secondly, we do not
know how we can ensure (2.2.2). The first question is simpler, so let us
answer it.
is an estimate sequence.
Proof: Indeed, 4>o(x) ~ (1 - Ao)f(x) + Ao4>o(x) = 4>o(x). Further, let
(2.2.1) hold for some k ~ 0. Then
4>k+I(x) ~ (1- ak)4>k(x) + akf(x)
= (1- (1- ak)Ak)f(x) + (1- ak)(cPk(x)- (1- Ak)f(x))
< (1- {1- ak)Ak)f(x) + (1- ak)AkcPo(x)
= (1- Ak+l)f(x) + Ak+I4>o(x).
It remains to note that condition 4) ensures Ak -+ 0. D
Thus, the above statement provides us with some rules for updating
the estimate sequence. Now we have two control sequences, which can
help to ensure inequality (2.2.2). Note that we arealso free in the choice
of initial function c/Jo(x). Let us choose it as a simple quadratic function.
Then we can obtain the exact description of the way c/J'k varies.
LEMMA 2.2.3 Let 4>0 (x) = 4>0+ ~ II x- vo 11 2 . Then the process {2.2.3}
preserves the canonical form of functions {4>k(x)}:
cPk(x) = c/J'k + 1f II x- Vk 11 2 , {2.2.4}
where the sequences {'yk}, {Vk} and {c/Jk} are defined as follows:
'Yk+l = (1 - ak)'Yk + akp.,
Proof: Note that cp~(x) = 'Yoln. Let us prove that cf;%(x) = 'Ykln for all
k ~ 0. Indeed, if that is true for some k, then
From that we get the equation for the point vk+ 1 , which is the minimum
of the function cpk+l(x).
Finally, let us compute if;'ic+l· In view of the recursion rule for the
sequence {cpk(x)}, we have
Therefore
'Ykr II Vk+l - Yk 11 2 = 2"f!+l [(1 - ak) 2 'Y~ II Vk - Yk 11 2
It remains to substitute this relation into (2.2.5) noting that the factor
for the term II Yk- Vk 11 2 in this expression is as follows:
(1- ak)'lk.-
2
_1_{1- ak)2'Y2
2'Yk+l k
= {1- ak)'lk. (1- (1-akhk)
2 'Yk+l
0
Smooth convex optimization 75
Now the situation is more clear and we are close to getting an algo-
rithmic scheme. Indeed, assume that we already have Xk:
in many different ways. The simplest one is just to take the gradient
step
Xk+l = Yk- hkf'(xk)
with hk = ±(see (2.1.6)). Let us define ak as follows:
2
Then 2""lk+l
0
k 2~ and we can replace the previous inequality by the
following:
Now we can use our freedom in the choice of Yk· Let us find it from the
equation:
~krk ( Vk - Yk)
lk+l
+ Xk - Yk = 0.
That is
76 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Note that in Step 1c} ofthis scheme we can choose any Xk+l satisfying
the inequality
{2.2. 7)
Proof: Indeed, if /k ;:::: J-t, then /k+l = La~ = (1 - akhk + akJ-t ;:::: 1-t·
Since /O ;:::: J-t, we conclude that this inequality is valid for all /k· Hence,
jii
ak ;:::: and we have proved the first inequality in (2.2.7).
Further, let us prove that /k ;:::: 1o>.k. Indeed, since /o = /o>.o, we can
use induction:
where Qf = L/~-t and R =II xo- x* II· Therefore, the worst case bound
for finding Xk satisfying f(xk) - f* ~ E cannot be better than
Let us analyze a variant of the scheme {2.2.6), which uses the gradient
step for finding the point Xk+I·
Vk+I = ;;l--[{1-
lk±l
ak)rkvk + CtkJ-tYk- akf'(Yk)].
Yk- tf'(Yk),
= ~[(1-
lk±l
ak)rkvk + CikJ.tYk- akf'(Yk)].
Smooth convex optimization 79
Therefore
Hence,
= Xk
+1
+ Clktl'Yktl(Vk±l-Xktl)
'Yk+l +ak+tJJ
= Xk + 1 + fJ,f.lk(Xk + 1 _ Xk)
'
where
Thus, we managed to get rid of {vk}. Let us do the same with 'Yk· We
have
Therefore
Note also that a~+l = (1- ak+I)a~ + qak+l with q = J-L/ L, and
x [f(xo) - !* + lll xo - x* 11 2 ] ,
where "'O = no(noL-tL).
1 1-ao
We do not need to prove this theorem since the initial scheme is not
changed. We change only notation. In Theorem 2.2.3 condition {2.2.10)
is equivalent to /o ~ Jl-·
Scheme {2.2.9) becomes very simple if we choose ao = {this cor- .ft
responds to /o = 11-). Then
Smooth convex optimization 81
0. Choose Yo = xo E Rn.
1. kth iteration (k;:::: 0). (2.2.11)
Xk+l = Yk - tf'(Yk),
However, note that this process does not work for p. = 0. The choice
'Yo = L (which changes corresponding value of ao) is safer.
rnin f(x),
xEQ
is a convex set.
(2.2.12)
f'(x) = 0
does not work here.
EXAMPLE 2.2.2 Consider the one-dimensional problem:
minx.
x~O
THEOREM 2.2.5 Let f E .1'1 (Rn) and Q be a closed convex set. The
point x* is a solution of (2.2.12} if and only if
for all x E Q.
That is a contradiction. 0
2 !* + ~ II xi - x* 11 2
Thus, the value ~ can be seen as a step size for the "gradient" step
x--+ XQ(X;[).
Note that the gradient mapping is weil defined in view of Theorem
2.2.6. Moreover, it is defined for all x ERn, not necessarily from Q.
Let us write down the main property of gradient mapping.
THEOREM 2.2. 7 Let f E S~:l(Rn), r ;::: L and xE Rn. Then for any
x E Q we have
f(x) ;::: f(xQ(x; 'Y)) + (gQ(x; 'Y), x- x)
(2.2.15)
+2~ II 9Q(x;r) 11 2 +~ II x-x 11 2 .
Smooth convex optimization 87
Hence,
(2.2.16)
(2.2.17)
0. Choose xo E Q. (2.2.18)
1. kth iteration (k ~ 0).
Proof: Denote rk =II xk- x* II, 9Q = 9Q(Xki L). Then, using inequality
(2.2.17), we obtain
Tf+l = II Xk - x* - hgQ 11 2 = Tf - 2h(gQ, Xk - x*) + h2 II 9Q 11 2
+(gQ(Yk;L),x- Yk) +~ II x- Yk 11 2 ).
Smooth convex optimization 89
Note that the form ofthe recursive rule for cfJk(x) is changed. The reason
is that now we use inequality (2.2.15} instead of (2.1.16). However,
this modification does not change the analytical form of recursion and
therefore it is possible to keep all convergence results of Section 2.2.1.
Similarly, it is easy to see that the estimate sequence {cfJk(x)} can be
written as
'Yk+I
+ ( E- 2-y:~l) II gQ(Yki L) 11 2
+ak(~~:thk (~ II Yk -Vk 11 2 +(gQ(Yk;L),vk -yk)) ·
+( n- 2-y:~~) 11 YQ(Yki L) 11 2
min
xEQ
[!(x) = m.ax fi(x)]
l:O:::l:O:::m
, {2.3.1)
Proof: Indeed,
(see (2.1.6)). 0
92 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Let us write down the optimality conditions for problern {2.3.1) (com-
pare with Theorem 2.2.5).
THEOREM 2.3.1 A point x* E Q is a solution to {2.3.1} if and only if
for any x E Q we have
f(x*; x) ~ f(x*; x*) = f(x*). {2.3.4)
for all x E Q.
Proof: Indeed, in view of (2.3.2) and Theorem 2.3.1, for any x E Q we
have
f(x) > f(x*;x) + ~ II x- x* 11 2
consequently,
2. lf x E Q, then
{2.3.9}
f(x) 2: f(x;x} + ~ II x- x 11 2
Proof: Denote rk =II Xk- x* II, g = gf(Xki L). Then, in view of (2.3.9)
we have
r~+l = II Xk- x*- hgQ 11 2 = r~- 2h(g,xk- x*) + h 2 II g 11 2
Goroparing this result with Theorem 2.2.8, we see that for minimax
problern the gradient method has the same rate of convergence, as it has
in the smooth case.
Let us check, what the situation is with the optimal methods. Recall,
that in order to develop an optimal method, we need to introduce an
estimate sequence with some recursive updating rules. Formally, the
minimax problern differs from the unconstrained minimization problern
only by the form of lower approximation of the objective function. In
the case of unconstrained minimization, inequality (2.1.16) was used for
Smooth convex optimization 97
Comparing these relations with (2.2.3), we can find the difference only
in the constant term (it is in the frame). In (2.2.3) this place was taken
by f(Yk)· This difference Ieads to a trivial modification in the results
of Lemma 2.2.3: All inclusions of f(Yk) must be formally replaced by
the expression in the frame, and f'(Yk) must be replaced by 9J(Yki L).
Thus, we come to the following lemma.
LEMMA 2.3.3 For all k 2:: 0 we have
<l>k(x) =<PZ + 1f II x- Vk 11 2 ,
where the sequences bk}, { vk} and {<Pk} are defined as follows: vo = xo,
<Po = f(xo) and
'Yk+l = (1- O:'khk + O:'kiJ.,
o2 I (
+~I 9! YkiL) II 2
+ ok(~~:~hk {~ II Yk- Vk 11 2 +(gj(Yki L), Vk- Yk}) ·
D
Now we can proceed exactly as in Section 2.2. Assurne that <Pie 2::
f(xk). Inequality (2.3.7) with x = Xk and x = Yk becomes
f(xk) 2:: f(xJ(YkiL)) + (gJ(YkiL),xk- Yk)
+A II 9J(Yki L) 1 2 +~ II Xk- Yk 11 2 •
98 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Hence,
Note that the scheme (2.3.12) works for all11- ~ 0. Let us write down
the method for solving (2.3.1) with strictly convex components.
VI-{Ji
0. ehoose xo E Q. Set Yo = xo, ß = VL+,fii' (2.3.13}
1. kth iteration (k 2:: 0).
Compute {fi(Yk)} and {![(yk)}. Set
min { E
i=l
t(i) + ~ II x- xo 11 2 }
(2.3.15)
s. t. fi(xo) + (ff(xo), x- xo) ~ t(i), i = 1 ... m,
xEQ, tERm,
XE Q,
where the functions fi are convex and srnooth and Q is a closed convex
set. In this section we assume that fi E s!:l
(Rn), i = 0 ... m, with sorne
J1. > 0.
The relation between the problern (2.3.16) and rninirnax problerns
is established by sorne special function of one variable. Consider the
parametric rnax-type function
Note that the components of max-type function f(t; ·) are strongly con-
vex in x. Therefore, for any t E R 1 the solution of problern (2.3.17),
x*(t), exists and is unique in view of Theorem 2.3.2.
We will try to get close to the solution of (2.3.16) using a process
based on approximate values of function f* (t). This approach can be
seen as a variant of sequential quadratic optimization. It can be applied
also to nonconvex problems.
Let us establish some properties of function f* (t).
Proof: Indeed,
f*(t + ß) = min
xEQ
~ax
l~z~m
{fo(x)- t- ß; fi(x)}
f*(ti) <
< ~~~~m
m.ax {(1- a)(fo(x*(to))- to) + a(fo(x*(t2))- t2);
Note that Lemmas 2.3.5 and 2.3.6 are valid for any parametric max-
type functions, not necessarily formed by functional components of prob-
lern (2.3.16).
Let us study now the properties of gradient mapping for a parametric
max-type function. To do that, Iet us introduce first a linearization of a
parametric max-type function f(t; x):
f(t; i; x) = m_ax {fo(i)
l~z~m
+ (J~(x), x- x)- t; fi(x) + (Jf(x), x- i) }.
Now we can introduce a gradient mapping in a standard way. Let us fix
some 'Y > 0. Denote
f'Y(t; i; x) f(t;i;x) +! II x-i 11 2 ,
f*(t; i; 'Y) = min f,y(t; i; x)
xEQ
There are two values of "(, which are important for us. Theseare 'Y = L
and "( = 1-t· Applying Lemma 2.3.2 to max-type function J'Y(t;i;x) with
104 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
for some ""E {0, 1). Then l < t*(x, f) S t*. Moreover, for any t < t and
x ERnwehave
Thus, f*(t; x; J.L) > 0 and, since f*(t; x; J.L) decreases in t, we get
t*(x, l) > l.
0
Smooth convex optimization 105
Xk+! = XJ(tkiXk,j•(k)iL).
This is the first time in our course we meet a two-level process. Clearly,
its analysis is rather complicated. Firstly, we need to estimate the rate
of convergence of the upper-level process in (2.3.22) (it is called a mas-
ter process). Secondly, we need to estimate the total complexity of the
internal processes in Step 1a). Since we are interested in the analyti-
cal complexity of this method, the arithmetical cost of computation of
t*(x, t) and j*(t; x, 'Y) is not important for us now.
Let us describe convergence of the master process.
LEMMA 2.3.8
f *(t·x ·L)<to-t•[
k, k+1,
1
- 1-tt 2(1-tt)
]k .
full iterations of the master process (the last iteration of the process, in
general, is not full since it is terminated by the Global stop rule). Note
that in this estimate K is an absolute constant {for example, K = :\).
Smooth convex optimization 107
Let us analyze the complexity of the internal process. Let the sequence
{Xk,j} be generated by {2.3.13) with the starting point Xk,o = Xk· In view
of Theorem 2.3.6, we have
where a = jli.
Denote by N the number of full iterations of the process (2.3.22)
(N ~ N(E)). Thus, j(k) is defined for all k, 0 ~ k ~ N. Note that
tk = t*(xk-1,j(k-1)• tk-1) > tk-1· Therefore
(2.3.25)
And that is the termination criterion of the internal process in Step la)
in (2.3.22). 0
The above result, combined with the estimate of the rate of con-
vergence for the internal process, provide us with the total complexity
estimate of the constrained minimization scheme.
LEMMA 2.3.10 For alt k, 0 ~ k ~ N, we have
where a = fi{.
VL
min f*(tk; Xk Ji L). Note that
Recall that ßk+l =
O"Sj"Sj(k) '
the stopping criterion ofthe internal process did not work for j = j(k)-
1. Therefore, in view of Lemma 2.3.9, we have
f*(tk; Xk,ji L) ~ LJL-;_H(f(tk; Xk,j)- f*(tk)) ~ 2LJL-;_He-u·j ßk < ßk+l·
COROLLARY 2.3.3
N
L: j(k) < (N + 1) [1 + fi ·In~] + fi ·ln _Au_,
k=O - V Ii KJL V Ii LlN+l
0
That is a contradiction. 0
COROLLARY 2.3.4
j* + Ej(k)::;
k==O
(N + 2) [1 + lf ·In 2(~/>] + lf ·In~.
Let us put all things together. Substituting estimate (2.3.24) for the
number of full iterations N into the estimate of Corollary 2.3.4, we come
to the following bound for the total nurnber of internal iterations in the
process (2.3.22):
1
[ ln[2{1-!~:))ln t -t•
(1o-,.)e + 2] . [1 + V!IJj ·ln ~]
"11-
(2.3.26)
+V'fiIi · ln (t · rn.ax {fo(xo) - to; fi(xo) }) .
I::;)$m
Note that method (2.3.13), which implernents the internal process, calls
the oracle of problern (2.3.16) at each iteration only once. Therefore,
we conclude that estirnate (2.3.26) is an upper cornplexity bound for the
problern {2.3.16) with E-solution defined by {2.3.23). Let us check, how
far this estirnate is frorn the lower bounds.
The principal terrn in estirnate {2.3.26) is of the order
ln to-t• .
e
/I. In L.p,'
VJi
This value differs frorn the lower bound for an unconstrained minimiza-
tion problern by a factor of In t·
This rneans, that scherne (2.3.22) is at
least suboptimal for constrained optimization problems. We cannot say
rnore since a specific lower complexity bound for constrained minimiza-
tion is not known.
To conclude this section, Iet us answer two technical questions. Firstly,
in scheme (2.3.22) we assume that we know sorne estimate t 0 < t*. This
assumption is not binding since we can choose to equal to the optimal
value of the minimization problern
rnin [f(xo)
xeQ
+ (f'(xo), x- xo) + ~ II x- xo 11 2 ].
fi(x) + (fi(x),x- x) + ~ II x- x 11 2 , i = 1. .. m.
XE Q.
This problern is not a quadratic optimization problem, since the con-
straints are not linear. However, it can be solved in finite time by a
simplex-type process, since the objective function and the constraints
have the same Hessian. This problern can be also solved by interior-
point methods.
Chapter 3
where Q is a closed convex set and fi(x), i = 0 ... m, are general convex
functions. The terrn general means that these functions can be non-
differentiable. Clearly, such a problern is rnore difficult than a srnooth
one.
Note that nonsrnooth rninirnization problerns arise frequently in differ-
ent applications. Quite often sorne components of a rnodel are composed
by rnax-type functions:
f(x) = max r/Jj(x),
l:5J:5P
At this point, we are not ready to speak about any method for solving
(3.1.1). In the previous chapter, our optimization methods were based
on the gradients of smooth functions. For nonsmooth functions such
objects do not exist and we have to find something to replace them.
However, in order to do that, we should study first the properties of
general convex functions and justify a possibility to define a generalized
gradient. That is a long way, but we have to pass through it.
A Straightforward consequence of Definition 3.1.1 is as follows.
LEMMA 3.1.1 (Jensen inequality) For any xl, ... ,xm E domfand co-
efficients 0:1, ... , O:m such that
m
L O:i = 1, O:i;::: 0, i = 1 ... m, (3.1.2)
i=l
we have
where ßi = ~
l~a 1 • clearly,
m
Lßi = 1, ßi 2 0, i = 1. .. m.
i=l
m
The point x = 2:: aiXi with coefficients ai satisfying (3.1.2) is called
i=l
a convex combination of points Xi.
Let us pointout two important consequences of Jensen inequality.
COROLLARY 3.1.1 Let x be a convex combination of points x1, •.. ,xm.
Then
m
Proof: Indeed, in view of Jensen inequality and since ai 2 0, E ai = 1,
i=l
we have
y= l~ß(u+ßx)=(1-a)u+ax.
Therefore
Therefore
J(x) 2:: J(u) + ß(f(u)- J(y)) = ~f(u)- l~a J(y).
D
is a convex set.
Proof: Indeed, if (x1, tr) E epi (!) and (x2, t2) E epi (!), then for any
a E [0, 1] we have
at1 + (1- a)t2 2:: af(xl) + {1- a)j(x2) 2:: j(ax1 + (1- a)x2).
Thus, (ax1 + (1- a)x2, at1 + (1- a)t2) E epi (!).
Let epi {!) be convex. Note that for x 1 , x 2 E dom f
Therefore (ax1 + (1- a)x2, af(xl) + (1- a)j(x2)) E epi {!). That is
THEOREM 3.1.4 lf convex function f is closed, then alt its level sets are
either empty or closed.
4. Function f(x) = ~' x > 0, is convex and closed. However, its domain
dom f = int R~ is open.
116 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
5. Function f(x) =II x II, where II · II is any norm, is closed and convex:
p~l.
Among them there are three norms, which are commonly used:
6. Up to now, all our examples did not show up any pathological behav-
ior. However, let us Iook at the following function of two variables:
0,
f(x, y) = {
cp(x, y),
where cp(x, y) is an arbitrary nonnegative function defined on a unit
circle. The domain ofthis function is the unit Euclidean disk, which is
closed and convex. Moreover, it is easy to see that f is convex. How-
ever, it has no reasonable properties on the boundary of its domain.
Definitely, we want to exclude such functions from our considerations.
That was the reason for introducing the notion of closed function. It
is clear that f (x, y) is not closed unless cjJ( x, y) = 0. D
Therefore
- epifi nepif2.
Thus, epi f is closed and convex as an intersection of two closed convex
sets. It remains to use Theorem 3.1.2. 0
= ifJ(ayl + (1 - a)y2)
< aifJ(yi) + (1 - a)ifJ(Y2)
= af(xl) + (1 - a)j(x2).
1 It is important to understand, that a similar property for convex sets is not valid. Consider
the following two-dimensional example: Q1 = {(x,y): y 2:: ~. x > 0}, Q2 = {(x,y): y =
0, x ~ 0}. Bothofthese sets are convex and closed. However, their sum Q1 + Q2 {(x, y) : =
y > 0} is convex and open.
Nonsmooth convex optimization 119
Thus, j(x) is convex. The closedness of its epigraph follows from conti-
nuity of the linear operator A(x). D
Suppose that for any fixed y E ~ the function cp(y, x) is closed and convex
in x. Then f(x) is a closed and convex function with domain
domf={xE n domcp(y,·)l
yEt:.
:3/:c/J(y,x)S/VYE~}. (3.1.4)
are convex and closed. Thus, j(x) is closed and convex in view of
Theorem 3.1.7. Note that we did not assume anything about the
structure of the set b..
3. Let Q be a convex set. Consider the function
4. Let Q be a set in Rn. Consider the function '1/J (g, 'Y) = sup cp(y, g, 'Y),
yEQ
where
c/J(y, g, 1) = (g, y) - ~ II Y 11 2 •
The function '1/J(g, 'Y) is closed and convex in (g, 'Y) in view ofTheorem
3.1.7. Let us look at its properties.
lf Q is bounded, then dom '1/J = Rn+l. Consider the case Q = Rn.
Let us describe the domain of '1/J. If 'Y < 0, then for any g =/: 0 we
can take Ya = ag. Clearly, along this sequence cp(ya, g, 'Y) -+ oo as
a -+ oo. Thus, dom '1/J contains only points with 'Y ~ 0.
lf 'Y = 0, the only possible value for g is zero since otherwise the
function 4J(y, g, 0} is unbounded.
Finally, if 'Y > 0 then the point maximizing 4J(y, g, 'Y) with respect
to y is y* (g, 'Y) = ~ g and we get the following expression for '1/J:
Thus,
0, if g = 0, 'Y = 0,
'1/J(g, 'Y) = { 1Wf
2')' ' if 'Y > 0,
with the domain dom'!f; = (Rn x {'Y > 0}) U{O,O). Note that this
is a convex set, which is neither closed nor open. Nevertheless, '1/J
is a closed convex function. At the same time, this function is not
continuous at the origin:
0
Nonsmooth convex optimization 121
Proof: Let us choose some E > 0 such that xo ± Eei E int (dom f),
i = 1 ... n, where ei are the coordinate vectors of Rn. Denote
Let us show that !:::. =:> B2(xo, €) with € = Jn· Indeed, consider
n n
x = x0 + Lhiei, L(hi)2:::; E.
i=l i=l
x = xo + ß "f:
i=l
hiei = xo +~ "f: hiEei
i=l
M = xEB2(xo,l)
max f(x) :::; max f(x) :::; max f(xo ±
xE6 l:Sz:Sn
Eei).
= f(xo) + M-{(xo) II Y- xo II ·
Further, denote u = xo + ~(xo- y). Then II u- xo II= E and y =
xo + a(x0 - u). Therefore, in view of Theorem 3.1.1 we have
f(y) > f(xo) + a(f(xo)- l(u)) ;:::: f(xo)- a(M- l(xo))
= f(xo)- M-{(xo} II Y- Xo II ·
Thus, I f(y)- f(xo) I~ M-!(xo) II Y- xo II· 0
f(x + aßp) = !((1- ß)x + ß(x + ap)) :::; {1- ß)f(x) + ßf(x + ap).
Therefore
<!J(aß) = ~[f(x + aßp)- l(xo)] :::; ~[f(x + ap)- f(x)] = <!J(a).
Nonsmooth convex optimization 123
< - II xo - 11"Q(xo) 11 2 •
0
Now we can prove the separation theorems. We will need two of them.
The first one describes our possibilities in strict Separation.
THEOREM 3.1.11 Let Q be a closed convex set and xo rj. Q. Then there
exists a hyperplane 1l(g,,), which strictly separates xo from Q. Namely,
we can take
g = xo- 11"Q(xo)-:/= 0, 'Y = {xo- 11"Q(xo), 11"Q(xo)}.
Proof: Indeed, in view of (3.1.8}, for any x E Q we have
{xo- 11"Q(xo}, x} < {xo -1l"Q(xo), 11"Q(xo)}
3.1.5 Subgradients
Now we are completely ready to introduce an extension of the notion
of gradient.
Since for all T 2: f(xo) the point (T, x 0 ) belongs to epi (f), we conclude
that a 2: 0.
Recall, that a convex function is locally upper bounded in the interior
of its domain (Lemma 3.1.2). This means that there exist some E > 0
and M > 0 suchthat B2(xo, E) ~ domfand
f(x)- f(xo) ::::; M II x- xo I
128 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
for all x E B 2 (x 0 , €). Therefore, in view of (3.1.11), for any x from this
ball we have
Let us show that the conditions of the above theorem cannot be re-
laxed.
EXAMPLE 3.1.4 Consider the function f(x) = -../X with the domain
{x E R 1 I x 2 0}. This function is convex and closed, but the subdiffer-
ential does not exist at x = 0. D
the other hand, since f'(x 0 ;p) is convex in p, in view of Lemma 3.1.3,
for any y E dom f we have
0 E 8f(x*).
The next result forms a basis for cutting plane optimization schemes.
THEOREM 3.1.16 For any xo E domf all vectors g E 8f(xo) are sup-
porting to the level set .Cf (f (xo)):
(g,xo- x 2: 0 Vx E LJ(f(xo)) ={x E domf: f(x) ~ f(xo)}.
Proof: Indeed, if f(x) ~ f(xo) and g E 8f(xo), then
f(xo) + (g, x- xo) ~ f(x) ~ f(xo).
0
130 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
x* = argmin{f(x) I x E Q}.
Changing the sign of p, we conclude that (f'(x),p) = (g, p) for all g from
öf(x). Finally, considering p = ek, k = 1 ... n, we get g = f'(x). 0
LEMMA 3.1.9 Let ft(x) and h(x) be closed convex functions and a1,
a2 ~ 0. Then function f(x) = ad1(x) + a2h(x) is closed and convex
and
8f(x) = a18!I(x) + a28h(x) (3.1.16}
jor any X jrom int (dom/) = int (domfl) n int (domf2).
Proof: In view of Theorem 3.1.5, we need to prove only the relation for
the subdifferentials. Consider Xo E int (dom /I) n
int (dom h). Then,
for any p E Rn we have
= max{(91,a1p) I 91 E 8/I(xo)}
i=l
k
where D.k = Pi ~ 0, L: Ai = 1}, the k-dimensional standard simplex.
i==l
Therefore,
f'(x;p) =
k
max{ (g,p) I g = I: Ai9i, 9i E öfi(x), {.\i} E D.k}
i=l
The last rule can be useful for computing some elements from the
subdifferentiaL
LEMMA 3.1.11 Let ß be a set and f(x) = sup{4>(y,x) I y E ß}.
Suppose that for any fixed y E D. the function 4>(y, x) is closed and
convex in x. Then f(x) is closed convex.
Moreover, for any x from
domf = {x ERn I 3')': 4>(y,x) ~ ')'Vy E ß}
we have
8f(x) :2 Conv {84>x(y, x) I y E J(x)},
where I(x) = {y I 4>(y,x) = f(x)}.
Proof: In view of Theorem 3.1.7, we have to prove only the inclusion.
Indeed, for any x E domf, y E I(x) and g E 84>x(y,x) we have
f(x) ~ 4>(y,x) ~ 4>(y,xo) + (g,x- xo) = f(xo) + (g,x- xo).
0
n
5. For l1-norm f(x) =II x lh= I: I x(i) I we have
i=l
where I+(x) = {i I x(i) > 0}, L(x) = {i I x(i) < 0} and Io(x) =
{i I x(i) = 0}.
We leave justification of these examples as an exercise for the reader. 0
Thus, we need to prove only that 5.o > 0. Indeed, if 5.o = 0, then
E ~ifi(x) ~ E ~i[fi(x*) + (fi(x*), x- x*)] = 0.
iEl* iEl*
This contradicts the Slater condition. Therefore 5. 0 > 0 and we can take
Ai = J..if 5.o, i E J*. 0
Proof: Note that all conditions of Theorem 3.1.17 are satisfied and
the solution x* of the above problern is attained at the boundary of the
feasible set. Therefore, in accordance with Theorem 3.1.17 we have to
solve the following equations:
0
Nonsmooth convex optimization 135
Let us fix some constants p > 0 and 1 > 0. Consider the family of
functions
Therefore for any x, y E B2(0, p), p > 0, and 9k(Y) E ßfk(Y) we have
Let us describe now a resisting oracle for function fk(x). Since the
analytical form of this function is fixed, the resistance of this oracle
consists in providing us with the worst possible subgradient at each test
Nonsmooth convex optimization 137
Input: XE Rn.
for j := 1 to m do
if xU) > f then { f := xU); i* := j };
At the first glance, there is nothing special in this scheme. Its main loop
is just a standard process for finding a maximal coordinate of a vector
from Rn. However, the main feature of this loop is that we always
form the subgradient as a coordinate vector. Moreover, this coordinate
corresponds to i*, which is the firstmaximal component of vector x. Let
us check what happens with a minimizing sequence, which uses such an
oracle.
Let us choose starting point x 0 = 0. Derrote
Rp,n = {x ERn I x(i) = 0, p +1 ~ i ~ n}.
g = fLXi + Iei•,
where i* ~ p + 1. Therefore, the next test point Xi+l belongs to RP+l,n.
This simple reasoning proves that for all i, 1 ~ i ~ k, we have Xi E
Ri,n. Consequently, for i: 1 ~ i ~ k - 1, we cannot improve the starting
value of the objective function:
=
the smooth problem, our goal now is much more complicated. Indeed,
even in the simplest situation, when Q Rn, the subgradient seems to
be a poor replacement for the gradient of smooth function. For example,
we cannot be sure that the value of the objective function is decreasing
in the direction -g(x). We cannot expect that g(x) -+ 0 as x approaches
a solution of our problem, etc.
Fortunately, there is one property of subgradients that makes our
goals reachable. We have proved this property in Corollary 3.1.4:
At any x E Q the following inequality holds:
and
If f
is Lipschitz continuous on B2 (x, R) and 0 ~ v1(x; x) ~ R, then
y E B2(x, R). Hence,
f(x)- f(x) ~ f(y)- f(x) ~ M II y- x II= Mvt(x; x).
0
Let us fix some x*, a solution to problern (3.2.3). The values VJ(x*;x)
allow us to estimate the quality of localization sets.
DEFINITION 3.2.1 Let {xi}~ 0 be a sequence in Q. Define
Nonsmooth convex optimization 141
We call this set the localization set of problern {3. 2. 3) generated by se-
quence {xi}~o·
Note that in view of inequality (3.2.4), for all k ;:::: 0 we have x* E Sk.
Denote
v·- z (>
z- VJ(x*·' x·) - 0) ' vk* = O<i<k
mm Vi·
Thus,
LEMMA 3.2.2 Let fi. = min f(xi). Then fi.- f* ~ WJ(x*; vk).
O$i::;k
WJ(x*;vZ) = O$i$k
min WJ(x*;vi) > min [f(xi)- J*] = fj.- f*.
- O~i~k
0
Proof: Denote ri =II Xi- x* II· Then, in view of Lemma 3.1.5, we have
Thus,
k
R2+Eh~
v*
k-
< i=O
k
2L: h;
i=O
Nonsmooth convex optimization 143
i=D
00
We can easily see that Ö.k --+ 0 if the series 2:: hi diverges. However, Iet
i=D
us try to choose hk in an optimal way.
Let us assume that we have to perform a fixed number of steps of
the subgradient method, say, N. Then, minimizing D..k as a function of
{ hk }k"=o, we find that the optimal strategy is as follows: 2
Comparing this result with the Iower bound of Theorem 3.2.1, we con-
clude:
2 From Example 3.1.2{3) we can see that t:;.k is a convex function of {hi} .
144 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
hk = v'k~0.5.
1. kth iteration (k ~ 0). (3.2.13)
a). Compute f(xk), g(xk), ](xk), g(xk) and set
Then for any k 2: 3 there exists a number i', 0 :5 i' :5 k, such that
f(xi') - f* :5 - :5 1M
1M1R f(xi')
k-1.5'
2R
k-1.5 ·
Proof: Note that for direction Pk, chosen in accordance to rule (B), we
have
II g(xk) II hk :5 ](xk) :5 (g(xk), Xk - x*).
Hence, in this case vf( x*; x k) 2: hk.
Let k' =] t( and Ik ={i E (k', ... , k]: Pi= g(xi)}· Derrote
v·-
t-
VJ(x*·' x·)
t '
(g, x - x) ~ 0 \lx E Q.
Denote by e E Rn a vector of all ls. The oracle sta.rts from the following
settings:
ao :=-Re, bo :=Re, m := 0, i := 1.
Its input is an arbitrary x E Rn.
m := m + 1; i := i + 1; If i > n then i := 1.
Return 9m·
This oracle implements a very simple strategy. Note, that the next
box Bm+l is always a half of the last box Bm. The box Bm is divided into
148 INTRODUCTO RY LEGTURES ON CONVEX OPTIMIZATIO N
two parts by a hyperplane, which passes through its center and which
corresponds to the active coordinate i. Depending in which part of the
box Bm we get the test point x, we choose the sign of the separation
vector 9m+l = ±ei. After creating a new box Bm+l the index i is
increased by 1. If this value exceeds n, we return again to i = 1. Thus,
the sequence of boxes {Bk} possesses two important properties:
Therefore, for such k we have Bk :J B2(cb ~R) and (3.2.15) holds. Fur-
ther, let k = nl + p with some p E [0, ... , n- 1]. Since
The following result demonstrates that any cut passing through the cen-
ter of gravity divides the set on two proportional pieces.
LEMMA 3.2.5 Let g be a direction in Rn. Define
0. Set So= Q.
1. kth iteration (k ~ 0).
a) Choose Xk = cg(Sk) and compute f(xk), g(xk)·
b) Set sk+l = {x E sk I (g(xk), Xk- x) ~ 0}.
Hence,
I X- X+ II~+~ n:2 1 (11 XII~ +nLl) ~ 1.
Thus, we have proved that E+ c E(H+, x+)·
Let us estimate the volume of E(H+, x+)·
n2 ( 1
[ nL1 -
2 )
n+1
~] ~ -< [n2-1
n2 ( 1 - n(n+l)
2 )] ~
_ [ n 2 (n 2 +n-2) ] ~ _ [ 1 ] ~
- n(n-l)(n+l)2 - 1 - (n+l)2 ·
0
It turnsout that the ellipsoid E(H+, x+) is the ellipsoid ofthe minimal
volume containing the half of the initialellipsoid E+.
Our observations can be implemented in the following algorithmic
scheme of the ellipsoid method.
Ellipsoid method
g(yk), if Yk E Q, (3.2.18)
gk {
g(yk), if Yk ~ Q,
Yk+1
Proof: The proof follows from Lemma 3.2.2, Corollary 3.2.1 and Lem-
ma 3.2.6. D
Then
< (1 _ (n+1)2
.! 1 .! k
[ voln E&] n
voln Q -
1 ) 2 [voln B2 xo,R ] n
voln
< le- 2(n+1)2 R.
- p
In view of Corollary 3.2.1, this implies that i(k) > 0 for all
calls of the oracle. This efficiency estimate is not optimal (see Theorem
3.2.5), but it has a polynomial dependence on In~ and a polynomial de-
pendence on logarithms ofthe dass parameters M, Rand p. For problern
classes, whose oracle has a polynomial complexity, such algorithms are
called (weakly) polynomial.
To conclude this section, let us mention that there are several methods
that work with localization sets in the form of a polytope: ·
Kelley method
0. Choose xo E Q. (3.3.2)
1. kth iteration (k 2 0).
Find Xk+I E Argmin A(X; x).
xEQ
fk. S f(z*) = 0.
Thus, for all consequent models with k 2: 1 we will have /; = 0 and
Zk. = (0, Xk), where
Let us estimate a possible length of this stage using the following fact.
At the first stage, each step cuts from the sphere S2 (0, 1) at most the
segment S( ~) . Therefore, we can continue the process if k [ ] s Ja
n-1
.
During all these iterations we still have f(zi) = 1.
Since at the first stage of the process the cuts are (xi, x) s ~' for all
k, 0 S k S N = ./J ,
[ ]
n-1
we have
This means that after N iterations we can repeat our process with the
ball B2(0, ~), etc. Note that f(O, x) = ~ for all x from B 2(0, ~ ).
160 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Thus, we prove the following lower bound for the Kelley method
(3.3.2):
k [.ii] n-1
f(xk)- f* ~ {!) 2
This means that we cannot get an ~:-solution of our problern less than in
~ [ 2 ]n-1
2ln2 '7a
calls of the oracle. It remains to compare this lower bound with the
upper complexity bounds of other methods:
The first value is called the minimal value of the model, and the second
one the record value of the model. Clearly }'; ~ f* ~ fZ.
Let us choose some a E (0, 1). Denote
Level method
XE Q.
We also need to compute projection 7r.ck(a)(Xk)· If Q is a polytope, then
this is a quadratic programming problem:
min II x- Xk 11 2 ,
xEQ.
Both these problems are solvable either by a standard simplex-type
method, or by interior-point schemes.
Let us Iook at some properties of the Ievel method. Recall, that the
optimal values of the model decrease, and the records values increase:
The next result is crucial for the analysis of the Ievel method.
LEMMA 3.3.1 Assurne that for some p ~ k we have Op ~ (1 - a)ok.
Then for all i, k ~ i ~ p,
Proof: Note that for such i we have c5p ~ (1-a)c5k ~ (1-a)<>i· Therefore
li(a) = ft- (1- a)di ~ 1;- (1- a)di = J; + dp- (1- a)di ~ J;.
0
Let us show that the steps of the Ievel method are Iarge enough.
Denote
MJ = max{ll g 111 g E oj(x), x E Q}.
LEMMA 3.3.2 For the sequence {xk} generated by the Ievel method we
have
II Xk+l - Xk II>
(1-a)6k
- MI .
Proof: Indeed,
< II X, 0 - * 112 -
Xp
(1-a)26l
M2 _
<II Xz 0 - * 112 - (1-a)26~
Xp Mz •
I I
Nonsmooth convex optimization 163
(p + 1- k) ( 1 -;~25~ ~II
I
Xk- x; 11 2 ~ D 2•
N = lf2a(1~I~~2-a) J+ 1
D
Let us discuss the above efficiency estimate. Note that we can obtain
the optimal value of the Ievel parameter a from the following maximiza-
tion problem:
--+ max.
oE[O,l]
164 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
xEQ,
where Q is a bounded closed convex set, and functions f(x), fi(x) are
Lipschitz continuous on Q.
Let us rewrite this problern as a problern with a single functional
constraint. Denote f(x) = rn~x /j(x). Then we obtain the equivalent
1$J$m
problern
rnin f(x),
XE Q.
Note that f(x) and /(x) are convex and Lipschitz continuous. In this
section we will try to solve (3.3.5) using the rnodels for both of thern.
Let us define the corresponding rnodels. Consider a sequence X =
{xk}k::, 0 . Denote
f*(t) = rninf(t;x).
xEQ
LEMMA 3.3.4
However, in this case i'k = jk(X; x'J.:) :::; /k(X; y) :::; tk(X) < i'k. That is
a contradiction. D
LEMMA 3.3.5 Let t 0 < t 1 :::; t*. Assurne that J;(X; tt) > 0. Then
tk(X) > t1 and
{3.3.6}
(note that J;(X; t2) = 0). Let Xa = (1- a)xk(to) + axk(t2). Then we
have
Jk,(X; tt) ::=; max{!k(X; Xa)- t1; Jk(X; Xa)}
Since tk+l = tj(k)(X) andin view of Lemma 3.3.5, for all k :2: 1, we have
O"k-1 = y'tk~tk_Jj(k-1)(X; tk-d ~ y'tk~tk-l Jj(k)(X; tk-d
> 2 JA* (X t ) > 2(1-~~:) f* (X t ) 5!..Jr..
- y'tk+t-tk j(k) j k - y'tk+t -tk j(k) j k = ß .
168 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
= ßkj*j(O) (X ; t 0 ) tk±l-tk
lt-to ·
Therefore we have
-
-
2
MrD
2
,.2e2a(l-a)2(2-a}
[1 + In(2(1-K-))
1 1 J.o..=.L...]
n (1-1•;}€
M2D2Jn 2(to-t•)
- I •
- ~:2a(1-a)2(2-a)K.2Jn[2(1-K.))'
It can be shown that the reasonable choice for the parameters of this
scheme is a = ~ = 2+\!2'.
The principal term in the above complexity estimate is on the order
of ~ ln 2(to;t•). Thus, the constrained level method is suboptimal (see
Theorem 3.2.1).
In this method, at each iteration of the master process we need to
find the root tj(k)(X). In view of Lemma 3.3.4, that is equivalent to the
following problem:
min{A(X;x) I /k(X;x) ~ 0, x E Q}.
170 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
rnm t,
XE Q.
If Q is a polytope, this problern can be solved by finite linear prograrn-
rning rnethods (sirnplex rnethod}. lf Q is rnore cornplicated, we need to
use interior-point schernes.
To conclude this section, let us note that we can use a better rnodel
for the functional constraints. Since
STRUCTURAL OPTIMIZATION
1 We have discussed this concept and the corresponding methods in the previous chapters.
172 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
Ly = b, LT X = y.
3. Salve the auxiliary system.
This process can be seen as a sequence of equivalent transformations of
the initial problern
Imagine for a moment that we do not know how to solve systems
of linear equations. In order to discover the above scheme we should
perform the following steps:
1. Find a dass of problems which can be solved very efficiently (linear
systems with triangular matrices in our example).
2. Describe the transformation rules for converting our initial problern
into the desired form.
3. Describe the dass of problems for which these transformation rules
are applicable.
We are ready to explain the way it works in optimization. First of
all, we need to find a basic numerical scheme and a problern formulation
at which this scheme is very efficient. We will see that for our goals the
most appropriate candidate is the Newton method (see Section 1.2.4) as
applied in the framework of Sequential Unconstrained Minimization (see
Section 1.3.3).
In the succeeding section we will highlight some drawbacks of the
standard analysis of Newton method. From this analysis we derive a
family of very special convex functions, the self-concordant functions
and self-concordant barriers, which can be efficiently minimized by the
Newton method. We use these objects in a description of a transformed
version of our initial problem. In the sequel we refer to this description as
to a barrier model of our problem. This model will replace the standard
functional model of optimization problern used in the previous chapters.
0
Structural optimization 175
II f"'(x)[u]II:S: M II u II .
This means that at any point x E Rn we have
We accept this statement without proof since it needs some special facts
from the theory of three-linear symmetric forms.
In what follows, we very often use Definition 4.1.1 in order to prove
that some I is self-concordant. On the contrary, Lemma 4.1.2 is useful
for establishing the properties of self-concordant functions.
Let us consider several examples.
EXAMPLE 4.1.1 1. Linear function. Consider the function
Then
f'(x) = a, J"(x) = 0, j 111 (x) = 0,
and we conclude that M1 = 0.
2. Convex quadratic function. Consider the function
Define f(x) = -lnif>(x), with domf = {x ERn I if>(x) > 0}. Then
- tjl2(x)[{a,u)- (Ax,u}](Au,u}.
D 2 f(x)[u, u] = w~ + w2 2 0,
I D 3 f(x)[u, u, u] I = l2w~ + 3wlw21 .
The only nontrivial case is w1 =f 0. Derrote a = w2jw~. Then
The right-hand side of this inequality does not change when we replace
(w1,w2) by (tw 1, tw2) with t > 0. Therefore we can assume that
aw1 +ßw2 = 1.
Denote e = O:Wt. Then the right-hand side of the above inequality
becomes equal to
Note that Zk E epi f, but z rt. epi f since Ya rt. dom f. That is a contradie-
tioll since function f is closed. Considering direction -u, and assuming
that this ray intersects the boundary, we come to a contradiction again.
Therefore we conclude that Ya E dom f for all a. However, that is a
contradiction with the assumptions of the theorem. D
(-cp(O), cp(O)).
<P(t) ~ </J(ü)- I t I
in view of Lemma 4.1.3. 0
3. If II y- x llx< 1, then
II y- X
I Y-< 1-ily-xl!.,'
lly-xll., (4.1.5)
Therefore
J
1
G= f"(x + T(y- x))dT
0
as follows:
(1- r + r32 )f"(x) ~ G ~ l~rf"(x).
Proof: Indeed, in view of Theorem 4.1.6 we have
1 1
G = I f"(x + T(y- x))dT?: f"(x) · I(1- Tr) 2dT
0 0
= (1 - r + ~r 2 )f"(x),
1
G -< f"(x) ·I (l!;r)2 = l~rf"(x).
0
0
These two facts form the basis for almost all consequent results.
We conclude this section with the results describing the variation of
a self-concordant function with respect to a linear approximation.
THEOREM 4.1. 7 For any x, y E dom f we have
1 2 r 2
2: J (l_;rr)2dT = r 0J (l~t)2dt =
0
1~r·
Further, using (4.1.7), we obtain
1
f(y)- f(x)- (f'(x),y- x) = f(f'(yr)- f'(x),y- x)dT
0
1
= f ~(f'(Yr)- f'(x),yr- x)dT
0
> Jl IIYr-xlli d _ J1 rr 2 d
0 r(l+JJyr-xJJx) T - 0 l+rr T
r
= J f~t = w(r).
0
0
Proof: Derrote Yr = x + T(y- x), T E [0, 1], and r =II y- X llx· Since
II Yr- x II< 1, in view of (4.1.5) we have
1
(f'(y)-J'(x),y-x) = I(f"(yr)(y-x),y-x)dT
0
r
= r I0
1
$ I r2 d
(1-rrF 7
1 d
(1-t)2 t = 1-r
r2
·
0
1
I ~(J'(Yr)- f'(x),Yr- x)dT
0
r
= I f~tt = w.(r).
0
D
=
Denote r = llullx [cp"(o)Jl/ 2 • Assuming that {4.1.8} holds for all x and
y from dom f, we have
w'(w~(r)) = r, w~(w'(t)) = t,
< min[<!J(y)
zEQ
+ (<!J'(y), z- y) + w*(llz- yjjy)]
= </J(y)- w(II<P'(y)ll;)
f: (X) = E - ~, f:' = ~.
Therefore AJ,(x) =11- EX I· Thus, for E = 0 we have AJ0 (x) = 1 for any
x > 0. Note that the function fo is not bounded below.
If E > 0, then xj. = ~· Note that we can recognize the existence of
the minimizer at point x = 1 even if E is arbitrary small. D
(4.1.14)
0. Choose xo E dorn f.
1. Iterate Xk+l = Xk- l+A~(xk) [f"(xk)]- 1f'(xk), k ~ 0.
Structural optimization 189
.>.2
= f(xk)- 1+.>. + w*(w'(>.))
= f(xk)- >.w'(>.) + w*(w'(>.)) = f(xk)- w(>.).
D
Thus, for all x E domf with AJ(x) 2 ß > 0 one step of the damped
Newton method decreases the value of f(x) at least by a constant w(ß) >
0. Note that the result of Theorem 4.1.12 is global. It can be used to
obtain a global efficiency estimate of the process.
Let us describe now the local convergence of the standard Newton
method:
(4.1.16)
0. Choose xo E dom f.
1. Iterate Xk+l = Xk- [f"(xk)]- 1 /'(xk), k 2 0.
defined by the minimum itself. Let us prove that locally all these mea-
sures are equivalent.
190 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
(4.1.18}
J
1
and
A.j(x) = ([J"(x)t 1 G(x- xj}, G(x- xj)) ::;11 H 11 2 r 2 ,
where H = [f"(x)J- 112 G[f"(x)J- 112 • In view of Corollary 4.1.4, we have
- -
G -< 1-f"(x).
1-r
Thus, AJ(x) ::; w~(r}. Applying w'(·) to both sides, we get the remairring
part of (4.1.18).
Finally, inequalities (4.1.19) follow from (4.1.8) and (4.1.10). 0
\ ( ) ( AJ(Xk) ) 2 ß>.J(Xk) \ ( )
Af Xk+l ::; 1->.f(Xk) ::; (l-ß)2 < Af Xk •
(4.1.20)
4.2.1 Motivation
In the previous section we have seen that the Newton method is
very efficient in minimizing a standard self-concordant function. Such
a function is always a barrier for its domain. Let us check what can
be proved about the sequential unconstrained minimization approach
(Section 1.3.3), which uses such barriers.
In what follows we deal with constrained minimization problems of
special type. Denote Dom f = cl (dom f).
DEFINITION 4.2.1 We call a constrained minimization problern stan-
dard if it has the form
This trajectory is called the centrat path of the problern (4.2.1). Note
that we can expect x*(t) ---+ x* as t ---+ oo (see Section 1.3.3}. Therefore
we are going to follow this trajectory.
Recall that the standard Newton method, as applied to minimization
of function f(t; x), has a local quadratic convergence (Theorem 4.1.14}.
Moreover, we have an explicit description of the region of quadratic
convergence:
- ~
)..f(t;·)(x)~ß<>.= -2 ·
Let us study our possibilities assuming that we know exactly x = x*(t}
for some t > 0.
Thus, we are going to increase t:
t+ = t + ~' ~ > 0.
However, we need to keep x in the region of quadratic convergence of
the Newton method for function f(t + ~; ·):
)..f(t+.6;·)(x) ~ ß <.X.
Note that the update t---+ t+ does not change the Hessian of the barrier
function:
f"(t + ~;x} = f"(t;x).
Therefore it is easy to estimate how can be big the step ~. lndeed, the
first order optimality condition provides us with the following centrat
path equation:
tc + f'(x*(t)) = 0. (4.2.2)
Since tc + J'(x) = 0, we obtain
for all x E dom F. The value v is called the parameter of the barrier.
(To see that for u with (F"(x)u, u} > 0, replace u in (4.2.3) by .Xu and
find the maximum of the left-hand side in .X.) Note that the condition
(4.2.5) can be written in a matrix notation:
~ 1 2
F"fX} = X2" • X = 1.
Structural optimization 195
(F"(x)u, u) = w~ + w2 ~ wi.
Therefore 2(F'(x), u}- (F"(x}u, u} =::; 2w1- w~ =::; 1. Thus, F(x) is a
v-self-concordant barrier with v = 1. 0
+ uERn
max [2(F2(x)u, u)- (F2'(x)u, u)] ~ v1 + v2.
0
Therefore <jJ(t) increases and it is positive for t E [0, 1]. Moreover, for
any t E [0, 1] we have
Proof: Denote r =II y - x llx· Let r > .,fii. Consider the point
Ya = x + a(y - x) with a = 4
< 1. In view of our assumption
(4.2.10) and inequality (4.1.7) we have
w =(F'(Ya),y- x} > (F'(Ya)- F'(x),y- x}
= ~(F'(Ya)- F'(x), Ya- x}
Proof: The first staternent follows frorn Theorem 4.2.5 since F'(x}) =
0. The second staternent follows frorn Theorem 4.1.5. D
Thus, the asphericity of the set Dorn F with respect to x}, cornputed
in the metric II · llxj.., does not exceed v + 2yfi/. It is well known that for
any convex set in Rn there exists a metric in which the asphericity of this
set is less than or equal to n (John Theorem). However, we rnanaged to
estirnate the asphericity in terrns of the parameter of the barrier. This
value does not depend directly on the dirnension of the space.
Note also, that if Dorn F contains no straight line, the existence of x}
irnplies the boundedness of DomF. (Since then F"(x}) is nondegener-
ate, see Theorem 4.1.3).
COROLLARY 4.2.1 Let DomF be bounded. Then for any x E dornF
and v E Rn we have
~ {y E Rn I II Y - Xp llx::; ll + 2y0} := B* ·
Therefore, using again Theorem 4.2.6, we get the following:
II v 11; = rnax{(v,y- x) I y E B} ~ rnax{(v,y- x) I y E B*}
with bounded closed convex set Q = DomF, which has nonempty inte-
rior, and which is endowed with a v-self-concordant barrier F(x).
Recall that we are going to solve (4.2.12) by tracing the central path:
Since the set Q is bounded, the analytic center of this set, x'F, exists and
In order to follow the central path, we are going to update the points,
satisfying an approximate centering condition:
where c* is the optimal value of (4. 2.12). If a point x satisfies the cen-
tering condition (4.2.16), then
(4.2.19)
II tc + F'(x) 11;~ ß
with ß < 5. = 3-l'g. Then for 1, such that
I I I~ 111- ß, (4.2.20}
A+ ~ (l~~J2 =[w~(.AI)j2.
It remains to note that inequality (4.2.20) is equivalent to
ß -- gl
1 'V -
I - l+...[ß- ß --
_:fl_ 5
35• (4 •2•22)
We have proved that it is possible to follow the central path, using the
rule (4.2.19). Note that we can either increase or decrease the current
value of t. The lower estimate for the rate of increasing t is
t+ ~ (1 + 4+3~\fV). t,
and the upper estimate for the rate of decreasing t is
t+ ~ ( 1 - 4+3~\fV) . t.
Thus, the general scheme for solving the problern (4.2.12) is as follows.
~
h -II
- c II*xo-
< -1-ro
1- II c II*xj;.-< 1..=1L
1=211 II c II*xj;. ·
Let us discuss now the above complexity estimate. The main term in
the complexity is
vl/cl/*.
7. 2.,fo In ---fL.
Note that the value V II c II;. estimates the variation of the linear func-
F
tion (c, x} over the set Dom F (see Theorem 4.2.6). Thus, the ratio
II F'(x) llis ß,
for certain ß E {0, 1}.
In order to reach our goal, we can apply two different mmimiza-
tion schemes. The first one is a Straightforward implementation of the
204 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
0. Choose yo E domF.
Note that the above scheme follows the auxiliary central path y*(t) as
tk --+ 0. It updates the points {Yk} satisfying the approximate centering
condition
ß -_ lgl "'-&
I-
1+ ß
- ß-5
- 36'
xEQ,
Structural optimization 207
minT,
s.t fo(x) ~ T,
(4.2.29)
/j(x) ~ "'' j = 1 ... m,
XE Q, T ~ f, K, ~ 0.
(d, z} ::; 0,
where z = (x, r, ~), (c, z} = r, (d, z} = ~ and S is the feasible set of the
problern (4.2.29) without the constraint K. ::; 0. Note that we know a
self-concordant barrier F(z) for the set Sand we can easily find a point
zo Eint S. Moreover, in view of our assurnptions, the set
S(a) = {z ES I (d,z}::; a}
is bounded and it has nonernpty interior for a large enough.
The process of solving the problern (4.2.31) consists of three stages.
1. Choose a starting point zo E int S and an initial gap ~ > 0.
Set a = (d, zo} + ~. If a ::; 0, then we can use the two-stage process
described in Section 4.2.5. Otherwise we do the following. First, we find
an approxirnate analytic center ofthe set S(a), generated by the barrier
F(z) = F(z) -ln(a- (d,z}).
Narnely, we find a point z satisfying the condition
.Ap.(z) =(F"(z)-1 (F'(z) + a-(d,z)) ,F'(z) + a-fd,z))l/2::; ß.
In order to generate such a point, we can use the auxiliary schernes
discussed in Section 4.2.5.
2. The next stage consists in following the centrat path z(t) defined
by the equation
td + F'(z(t)) = 0, t 2:: 0.
Note that the previous stage provides us with a reasonable approxirna-
tion to the analytic center z(O). Therefore we can follow this path,
using the process (4.2.19). This trajectory leads us to the solution of
the rninirnization problern
rnin{ (d, z} I z E S(a)}.
Structv.ral optimization 209
In view of the Slater condition for problern (4.2.31), the optimal value
of this problern is strictly negative.
The goal of this stage consists in finding an approximation to the
analytic center of the set
F'(z*)- (d,~.) = 0.
Therefore z* is a point of the central path z(t). The corresponding value
of the penalty parameter t* is
t* = - (d,~.) > 0.
This stage ends up with a point z, satisfying the condition
Ap(z) = (F"(z)- 1 (P'(z)- (d~z)) ,F'(z)- (d~z)) 1 1 2 ~ ß.
3. Note that F"(z) > F"(z). Therefore, the point z, computed at the
previous stage satisfies inequality
Proof: Note that v 2::: /'b by definition. Let us assume that /'b < 1. Since
f(t) is a barrier for (a., ß), there exists a value ä E (a., ß) such that
f'(t) > 0 for all t E [ö, ß).
Structural optimization 211
Consider the function cjJ(t) = <~Jar, t E [a,ß). Then, since f'(t) > 0,
f(t) is self-concordant and c/J{t) ::; "' < 1, we have
= f I (t) (
2- __j:J!:L f'" (t) ) ~ 2(1-
y'f'(t) . [f"(t)J312 fo)f I (t).
Hence, for all t E [a, ß) we obtain cjJ(t) ~ c/J(a) + 2(1- fo)(f(t)- f(ä)).
This is a contradiction since f (t) is a barrier and c/J( t) is bounded from
above. 0
x + api E Q Va ~ 0.
x- ßi Pi ~ int Q, i = 1 ... k.
k
lf for some positive a1, ... , ak we have '{) = x - E aiPi E Q, then the
i=l
parameter v of any self-concordant barrier for Q satisfies inequality:
k
v>
-
E
i=l
ili.
{3;
(since otherwise the function f(t) = F(x + tp) attains its minimum; see
Theorem 4.1.11).
Note that x- ßi Pi f!. Q. Therefore, in view of Theorem 4.1.5, the
norm of the direction Pi is large enough: ßi II Pi llx~ 1. Hence, in view
of Theorem 4.2.4, we obtain
k k k
v ~ (F'(x), y- x) = (F'(x),- E aiPi) ~ E ai II Pi llx~ E ~-
i=l i=l i=l •
0
It can be proved that for any x E int Q the set P(x) is a bounded closed
convex set with nonempty interior. Denote V(x) = voln P(x).
THEOREM 4.3.2 There exist absolute constants c1 and c2, such that the
function
U(x) = c1 ·In V(x}
is a (c2 · n)-self-concordant barrier for Q. 0
Function U(x) is called the universal barrier for the set Q. Note that
the analytical complexity of problern (4.3.1}, equipped with a universal
barrier, is 0 ( yn · ln '7) . Recall that such efficiency estimate is impossi-
ble, if we use a local black-box oracle (see Theorem 3.2.5).
The above result has mainly a theoretical interest. In general, the uni-
versal barrier U(x) cannot be easily computed. However, Theorem 4.3.2
demonstrates that such barriers, in principle, can be found for any con-
vex set. Thus, the applicability of our approach is restricted only by
abilities of constructing a computable self-concordant barrier, hopefully
with a small value of the parameter. The process of creating the barrier
model of the initial problem, can be hardly described in a formal way.
For each particular problern there could be many different barrier mod-
els, and we should choose the best one, taking into account the value of
the parameter of the self-concordant barrier, the complexity of its gradi-
ent and Hessian, and the complexity of solution of the Newton system.
In the rest of this section we will see how that can be done for some
standard problern classes of convex optimization.
Structural optimization 213
s.t Ax = b, (4.3.2)
x (i) > . - 1 .. . n,
_ 0 , ~-
(see Example 4.2.1 and Theorem 4.2.2). This barrier is called the stan-
dard logarithmic barrier for R+..
In order to solve the problern (4.3.2), we have to use a restriction
of the barrier F(x) onto affine subspace {x : Ax = b}. Since this
restriction is an n-self-concordant barrier (see Theorem 4.2.3), the com-
plexity estimate for the problern (4.3.2) is 0 (.,fii · ln 7) iterations of a
path-following scheme.
Let us prove that the standard logarithmic barrier is optimal for R+..
Pi = ei, i = 1 ... n,
Note that the above lower bound is valid only for the entire set R~.
The lower bound for intersection {x E R~ I Ax = b} can be smaller.
214 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
s. t qo(x) :S: T,
(4.3.4}
XE nn, TE R1.
The feasible set of this problern can be equipped with the following self-
concordant barrier:
m
F(x, T) = -ln(T- qo(x))- L ln(ßi- Qi(x)), v= m + 1,
i=l
(see Example 4.2.1, and Theorem 4.2.2). Thus, the complexity bound
for problern (4.3.3) is 0 ( ym + 1 ·ln ':) iterations of a path-following
scheme. Note this estimate does not depend on n.
In many applications the functional components of the problern in-
clude a nonsmooth quadratic term of the form II Ax - b II· Let us show
that we can treat such terms using interior-point technique.
LEMMA 4.3.3 The function
F(x, t) = -ln(t2- II x 11 2 )
is a 2-self-concordant barrier for the convex set 5
K2 = {(x, t) ERn+ I I t 2::11 X II}.
Proof: Let us fix a point z = (x, t) E int K 2 and a nonzero direction
u = (h,T} E Rn+l. Denote ~(o:) = (t + o:T) 2 -II x + o:h 11 2 • We need to
compare the derivatives of function
c/J(o:) = F(z + o:u) = -Ine(o:)
5 Depending on the field, this set has different names: Lorentz cone, ice-cream cone, second-
order cone.
Structural optimization 215
(4.3.5)
V >
-
Q.J..
ßt
+ !:!2.
ß2
= 2.
0
216 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
(4.3.7)
X EPn,
where C and Ai belong to snxn. In order to apply a path-following
scheme to this problem, we need a self-concordant barrier for Pn.
Letmatrix X belang to intPn. Denote F(X) = -lndetX. Clearly
n
F(X) = -ln I1 ~i(X),
i=l
where {Ai(X)}f=I is the set of eigenvalues of matrix X.
Structural optimization 217
LEMMA 4.3.5 Function F(X) is convex and F'(X) = -X- 1 . For any
direction ~ E snxn we have
Trace (rx-l/2ö.x-I/2f)'
-2(In, (X-1/2Llx-1/2j3)F
Proof: Let us fix some Ll E snxn and X Eint Pn suchthat X +b. E Pn·
Then
Proof: Let us fix XE int Pn and ß E snxn. Denote Q = x- 112 ßX- 112
and Ai= Ai(Q), i = 1 ... n. Then, in view of Lemma 4.3.5 we have
n
(F'(X), ß)F = LAi,
i=l
n 2
(F"(X)ß, ß)F = E\,
i=l
n 3
D 3 F(X)[ß, ß, ß] = -2 E Ai.
i=l
Using two standard inequalities
we obtain
(F'{X), ß)} < n(F"(X}ß, ß)F,
Let us prove that F(X) = -In det X is the optimal barrier for Pn.
LEMMA 4.3.6 Parameter v of any self-concordant barrier for cone Pn
satisfies inequality v ;::: n.
Proof: Let us choose X = In E int Pn and the directions Pi = eie[,
i = 1 ... n, where ei is the ith coordinate vector of Rn. Note that for
any 'Y;::: 0 we have In+ 'Ypi Eint Pn. Moreover,
n
In - eief f/. int Pn, In - E eie[ = 0 E Pn·
i=l
i=l
r. = n.
'
0
(4.3.8)
(Ai, ~)F = 0, i = 1 ... m.
~=X [-u + f:
J=l
>.(i) Ajl X. (4.3.9)
Let us pose this problern in a formal way. First of all note, that any
bounded ellipsoid W C Rn can be represented as
s.t. -lndetH~T,
(4.3.13)
II Hai -v II~ 1, i = 1 ... m,
v=m+n+l.
- + (Ha,a) 112
(a,v) ~ b.
This proves our statement since (a, v} < b. D
Note that voln W = voln B2(0, 1)[det Hjll 2. Hence, our problern is as
follows:
minT,
H,T
s.t. -lndet H ~ T,
{4.3.14)
H E 'Pn, T E R1 .
In view of Lemma 4.3. 7, we can use the following self-concordant barrier:
F(H, T) = -lndet H -ln(T + lndet H)
v = m+n+l.
The efficiency estimate of the corresponding path-following scheme is
0 ( v'm + n + 1 · In m;n)
iterations.
Structural optimization 223
s.t. -lndet G :$ r,
(4.3.15)
In view of Lemmas 4.3. 7 and 4.3.3, we can use the following self-
concordant barrier:
F(G, v, r) = -lndet G -ln(r + lndet G)
v = 2m+ n + 1.
224 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
where ai,j are some positive coefficients, ai,j E Rn and fi,j(t) are convex
functions of one variable. Let us rewrite this problern in a standard
form:
min ro,
x,t,r
m;
L: ai,jti,j ::::; Ti, i = 0 ... m, (4.3.17)
j=l
m
where M = E mi. Thus, in order to construct a self-concordant barrier
i=O
for the feasible set of the problem, we need barriers for epigraphs of
univariate convex functions Ai. Let us point out such barriers for several
important functions.
Qs = { (x, t) E R 2 I x 2 0, tP 2 x }, 0 < p s 1.
4.3.5.4 Decreasing power functions.
Function F6 (X' t) = - ln t -ln( X - r l/p) is a 2-self-concordant barrier
for the set
Q6 = { (x, t) E R 2 I x > 0, t 2 x~} , p 2 1,
V >
-
Ql.
ßt
+ Q.2.
ß2 = 21.::.!.
'Y
m; n . (i)
s.t qi(x) = L ai,j TI (x(3 >yri.j ~ 1, i = 1 ... m, (4.3.18)
j=l j=l
where ai,j are some positive coefficients. Note that the problern (4.3.18)
is not convex.
Let us introduce the vectors ai,j = (o{~, ... , u~~)) E Rn, and change
the variables: x(i) = eY(i). Then (4.3.18) is transformed into a convex
separable problem.
mo
min L ao 3·exp{(ao 3·,y)),
yERn j=l ' '
(4.3.19)
m;
s.t. L D!i,j exp( (ai,j, y)) ~ 1, i = 1 ... m.
j=l
m
Denote M = E mi. The complexity of solving (4.3.19) by a path-fol-
i=O
lowing scheme is
o(M 1 1 2 ·ln~).
4.3.5.6 Approximation in lp norms.
The simplest problern of that type is as follows:
(4.3.20)
s.t. a :::; x :::; ß,
Structural optimization 227
a::=;x ::=;ß,
(4.3.22)
s.t a:::; x:::; ß,
where p ? 1. And Iet us have two numerical methods available:
• The ellipsoid method (Section 3.2.6).
• The interior-point path-following scheme.
228 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
What scheme should we use? We can derive the answer from the com-
plexity estimates of corresponding methods.
Let us estimate first the performance of the ellipsoid method as ap-
plied to problern {4.3.22).
Number of iterations: 0 ( n 2 In *) ,
Complexity of the oracle: O(mn) operations,
f
i=l
T(i) ~ ~' a ~X~ ß,
(4.3.23)
Then
E9I(T(il,(ai,x)-b(il)ai- f: [ (il~
F~(x,r,~)= i=l (il- ßiil~ (il]ei,
i=l X Q X
Further, denoting
we obtain
F"(il
T 1T
Ul (x, T, 0 = ( e- .L:m r(i) )
!=l
-2
, i =1= j,
F"(i)
T '~
c(x, T, 0 = - ( e- L: T(i) )-2 '
m
i=l
/'i, = (e - .I:
z=l
T(i)) -
2
, Si = (ai, x) - b(i), i = 1 ... n,
230 INTRODUCTORY LEGTURES ON CONVEX OPTIMIZATION
and
A2 = diag(hi2(T(i),si))~ 1 , D = diag(h22(T(i),si))~ 1 .
Then, using the notation A = (a 1 , ... , am), e = (1, ... , 1) E Rm, the
Newton system can be written in the following form:
(4.3.24)
[2] A.B. Conn, N.I.M. Gould and Ph.L. Toint. Trust Region Methods, SIAM,
Philadelphia, 2000.
[3] J.E. Dennis and R.B. Schnabel. Numerical Methods for Unconstrained Opti-
mization and Nonlinear Equations, SIAM, Philadelphia, 1996.
[4] A.V. Fiacco and G.P. McCormick. Nonlinear Programming: Sequential Uncon-
strained Minimization Techniques, John Wiley and Sons, New York, 1968.
[6] C. Lemarechal, A. Nemirovskii and Yu. Nesterov. New variants of bundle meth-
ods. Mathematical Programmming, 69(1995), 111-148.
[7] D.G. Luenberger. Linear and Nonlinear Programming, Second Edition, Addi-
son Wesley. 1984.
[10] Yu.Nesterov. A method for solving a convex programming problern with rate
of convergence 0( ~ ). Soviet Math. Doklady, 1983, v.269, No.3, 543-547. (In
Russian.)
[15] R.T. Rockafellar Convex Analysis, Princeton Univ. Press, Princeton, NJ, 1970.
[16] C. Roos, T. Terlaky and J.-Ph. Vial. Theory and Algorithms for Linear Opti-
mization: An Interior Point Approach. John Wiley, Chichester, 1997.
[19] Y. Ye. Interior Point Algorithms: Theory and Analysis, John Wiley and Sons,
lnc., 1997.
Index