You are on page 1of 12

Cybernetics and Systems Analysis, Vol. 39, No.

4, 2003

ALGORITHMS OF NONDIFFERENTIABLE
OPTIMIZATION: DEVELOPMENT AND
APPLICATION

N. Z. Shor, N. G. Zhurbenko, A. P. Likhovid, and P. I. Stetsyuk UDC 519.8

A brief survey of nondifferentiable optimization methods developed at the Institute of Cybernetics is


presented: the subgradient method, the subgradient method with space dilatation in the subgradient
direction, and the r-algorithm. Applications of nondifferentiable optimization methods are considered.

Keywords: nondifferentiable optimization, method of subgradient type, operator of space dilatation,


method of ellipsoids, r-algorithm.

INTRODUCTION

A great many problems arising in solving complex mathematical programming problems can be reduced to
minimization of convex functions with a discontinuous gradient. Methods of nondifferentiable optimization make it possible
to apply flexibly various decomposition schemes (in terms of variables, constraints, resources, etc.), taking into account a
specific character of high-dimension problems, and allows us to derive efficiently dual estimates in discrete and
continuous-discrete programming problems and for some classes of multiextremal problems. It becomes possible to use
nonsmooth penalty functions that allow us, for finite values of penalty parameters, to obtain the problem of unconstrained
minimization, completely equivalent to the initial problem of convex programming. Performance characteristics of the plants
being optimized can be well approximated frequently by piecewise smooth functions of unknown parameters that generates
optimization problems with nonsmooth functions.
The lack of efficient methods of nonsmooth optimization hampered solution of the above classes of problems and
made us either to change the problem formulation at the expense of adequacy of reality model, or to use various methods of
smoothing. The latter not always leads to success, since application of smoothing functions results in bad condition of the
minimized function that worsens computing stability of such efficient methods of smooth minimization as the
quasi-Newtonian and the method of conjugate gradients.
Thus, the domain of application of nonsmooth optimization methods is rather wide, therefore, much attention is given
to development of computing methods of nonsmooth optimization. These methods are quite competitive both in reliability
and in computation time and accuracy of the results with the most efficient methods of solution of smooth ill-conditioned
problems.
The purpose of the present review is to describe briefly sets of algorithms of nondifferentiable optimization developed
at the Institute of Cybernetics and to show their numerous applications. The review includes the following subgradient-type
methods: the generalized gradient descent; subgradient-type methods with space dilatation in the subgradient direction;
r-algorithms, being a powerful practical tools in solving nondifferentiable optimization problems.
V. M. Glushkov placed strong emphasis on nonsmooth optimization methods. At a highest governmental level, he
promoted their application to various branches of the national economy. In particular, he paid much attention to creation of
the system of automated production scheduling and order distribution in ferrous metallurgy.

V. M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kiev, nzshor@d120.icyb.kiev.ua;


zhurb@d120.icyb.kiev.ua; stetsyuk@d120.icyb.kiev.ua. Translated from Kibernetika i Sistemnyi Analiz, No. 4, pp. 80-94,
July-August 2003. Original article submitted June 10, 2003.

©
1060-0396/03/3904-0537$25.00 2003 Plenum Publishing Corporation 537
THE GENERALIZED GRADIENT DESCENT

Let f ( x ) be a convex function defined on the Euclidean space E n , X * be a set of minima (it can also be empty),
x * ∈ X * be the minimum point; inf f ( x ) = f * , and g f ( x ) be the subgradient (arbitrary) of the function f ( x ) at the point x.
The subgradient g f ( x ) of the function f at the point x is the vector g f ( x ) such that

f ( x ) − f ( x ) ≥ ( g f ( x ), x − x ) ∀ x∈E n .

It follows from definition of a subgradient that if f ( x ) < f ( x ), then

( − g f ( x ), x − x ) > 0. (1)

Geometrically formula (1) means that antisubgradient at the point x forms an acute angle with an arbitrary direction
from x toward a point x with a smaller value of f ( x ). Herefrom, if X * is not empty, x ∉ X * , then when moving from
x in the direction − g f ( x ) with a rather small step, the distance to X * decreases. This simple fact forms the basis for
the subgradient method or the method of the generalized gradient descent (GGD) first proposed in [1] in connection
with solution of the network transport problem.
By the method of the generalized gradient descent we will mean the procedure of constructing a minimizing sequence
{x k }∞
k=0 , where x 0 is the initial approximation and x k are constructed from the following recurrence formula:

g f (x k ) (2)
x k +1 = x k − h k +1 , k = 0 , 1, 2 , K ;
|| g f ( x k )||

here g f ( x k ) is an arbitrary subgradient of the function f ( x ) at the point x k and h k+1 is a step factor. If || g f ( x k )|| = 0,
then x k is the minimum point of the function f ( x ), and the process terminates.
Beginning in 1962, several versions of the GGD using relation (2) were developed at the Institute of Cybernetics.
These results were obtained at a period from 1962 till 1968 and are reflected in the monograph [2]. The following theorem
(see, for example, [2]) contains the most general result on convergence of the GGD.
THEOREM 1. Let f ( x ) be a convex function defined on E n , with a bounded domain of minima X * , and {h k },
k =1, 2 , K , be a sequence of numbers with the properties

h k > 0; lim h k = 0;
k →∞
∑ h k = + ∞.
k =1
Then the sequence {x k }, k = 1, 2 , K, derived by formula (2), possesses one of the following properties for an arbitrary
x 0 ∈ E n : either there exists k = k such that x k ∈ X * or lim min || x k − y|| = 0 , lim f ( x k ) = min f ( x ) = f *.
k → ∞ y ∈X* k →∞ y ∈ En
With definite additional assumptions, it became possible to obtain GGD versions that converge at the rate of
geometrical progression.
THEOREM 2. Let f ( x ) be a convex function defined on E n , and for all x ∈ E n for some ϕ (0 ≤ ϕ < π / 2) the
inequality holds
( g f ( x ), x − x * ( x )) ≥ cos ϕ|| g f ( x )|| ⋅ || x − x * ( x )||, (3)

where x * ( x ) is a point belonging to the set of minima of the function f ( x ) and lying on the shortest distance from x.
Then if for a given x 0 we select a quantity h1 satisfying the inequality

|| x * ( x 0 ) − x 0 ||cos ϕ , π / 4 ≤ ϕ < π / 2,


h1 ≥ 
|| x ( x 0 ) − x 0 || / (2 cos ϕ), 0 ≤ ϕ < π / 4 ,
*

determine {h k }∞
k=1 according to the recurrence formula

h k +1 = h k r ( ϕ), k = 1, 2 , K ,

538
where
sin ϕ , π / 4 ≤ ϕ < π / 2,
r( ϕ) = 
1 / (2 cos ϕ), 0 ≤ ϕ < π / 4,
and calculate {x k }∞
k=1 by formula (2), then for some k either g f ( x k* ) =0 and x k* belongs to the domain of minima or
*

for all k = 1, 2 , K the inequality is fulfilled

h / cos ϕ , π / 4 ≤ ϕ < π / 2,
|| x k − x * ( x k )|| ≤  k +1
2 cos ϕ ⋅ h k +1 , 0 ≤ ϕ < π / 4.

Thus, if the angle ϕ is known beforehand, then, adjusting the step by the formulas from Theorem 2, one can obtain
convergence to the minimum with the rate of geometrical progression with the denominator q = r ( ϕ).
In formula (3), cos ϕ characterizes the degree of elongation of level surfaces of the function f ( x ). If in some
neighborhood of the minimum of the function f ( x ) there is no ϕ < π / 2 such that (3) is fulfilled for any x from this
neighborhood, then we will call such function essentially ravine. The way of adjustment of step factors, presented in
Theorem 2, is inapplicable for minimization of essentially ravine functions. In this case, one should use the universal method
of choice of step factors presented in Theorem 1.
Let us formulate a theorem similar to Theorem 2, immediately in terms that characterize the degree of “elongation” of
level surfaces.
THEOREM 3. Let the convex function f ( x ) be defined on E n , x * be a unique minimum point of f ( x ), and the initial
approximation x 0 and the numbers σ and h1 be given, with σ ≥ 2, h1 ≥ || x 0 − x * || / σ. Let us consider a set
Y = {y : || y − x * || ≤ σh1}. If for any pair of points x, z ∈ Y such that f ( x ) = f ( z) ≠ f ( x * ), the condition

|| x − x * || ≤ σ || z − x * ||,

is satisfied, then the sequence {x k }∞


k=0 derived using recurrence formulas (2), where h k +1 = h k σ − 1 / σ, converges to
2

x * at the rate of geometrical progression


|| x k − x * || ≤ h k +1 σ,

except for the case, where for some k = k g f ( x k ) = 0, i.e., x k = x * .


Let us consider one more version of the GGD method, where step factor remains constant during a definite number of
steps and then is halved [2].
THEOREM 4. Let conditions of Theorem 1, σ ≥ 2, be satisfied for the convex function f ( x ). Let us consider for a
− ( k +1)/ N 
given x an iterative process (2), where h k +1 = h 0 ⋅ 2  . Here a  is the integer part of the number a. For a
sufficiently large h 0 and N ≥ 3σ 2 + 1, the inequality is fulfilled

|| x k − x * || ≤ 2 σh k +1 , k = 0 , 1, 2 , K .

Step adjustment, according to Theorem 4, was used in the GGD method, which was proposed in 1962 for solving the
transport problem in a network form [1]. Actually, the paper [1] was the first example of application of the subgradient
process to minimization of convex nondifferentiable functions.
The GGD methods made it possible to solve many problems of production-transport scheduling with the use of
decomposition schemes (with respect to variables and constraints) for large-dimension problems. A detailed information on
these problems is available in [3]. The GGD method was also a basis for creation of a stochastic analog of the generalized
gradient descent [4], which has a wide practical application, in particular, in solving multistage stochastic programming
problems. Application of the method of generalized stochastic gradient to solution of the two-stage stochastic transport problem
connected with determination of volumes of warehouses of similar products in the case of a random demand is described in [3].
The results on the GGD methods, obtained at the Institute of Cybernetics were developed in [5, 6] for solving convex
programming problem with constraints. Not much was known about the works on the GGD abroad till 1974 since they were
published in Russian and in not easily accessible literature. They became popular after publication in [7] of a detailed review
of the results and the bibliography on nondifferentiable optimization obtained in the USSR.

539
SUBGRADIENT METHODS WITH SPACE
DILATATION IN THE SUBGRADIENT DIRECTION

In the analysis of the GGD algorithms converging at the rate of a geometrical progression, the upper bounds of the
sines of the angles between the antigradient at a given point and the direction from this point to the minimum point played a
significant role. Slow convergence of GGD is insurmountable within the framework of this method in ravine problems, when
the upper bound of the above-mentioned angles is equal to π / 2.
One can change the situation by using linear nonorthogonal transformations of the space of arguments for improving
conditionality of the problem. When antigradients form an angle close to π / 2, with the direction to the minimum point, it is
reasonable to apply the operation of space dilatation in the gradient direction in order to decrease its “transversal” component.
These heuristic considerations have served as a basis for creation of a set of subgradient-type methods with space dilatation.
N. Z. Shor [8–10] pioneered in introducing the operation of space dilatation in the gradient direction as a heuristic
procedure for improving properties of conditionality of the problem. The procedure is realized through the operator of space
dilatation, which has the following vector form:

R α ( ξ) = I n + ( α − 1) ξξ T , ξ ∈ E n , || ξ|| = 1, α > 1,

where ( ⋅ ) T is transposition, || ⋅ || is Euclidean norm, I n is a unit matrix of the order n, α is the coefficient of space
dilatation, and ξ is the dilatation direction. Properties of the operator R α ( ξ) are presented at length in [2].
In describing the algorithms, the operator R β ( ξ) is used that is inverse to the operator of space dilatation R α ( ξ). It has
the following form (see [2]):
1
R β ( ξ) = R α−1 ( ξ) = I n + ( β − 1) ξξ T , β = < 1.
α
Let us describe a schematic diagram of the subgradient-type algorithms with space dilatation in the subgradient
direction for minimization of the function f ( x ).
−1 −1
Given are x 0 ∈ E n and B 0 = A0 = I n (a unit n × n matrix). After k steps we have x k ∈ E n , B k = Ak , Ak is n × n
matrix of space transformation after k steps.
1. Calculate g f ( x k ) (if g f ( x k ) = 0, then the process terminates).
2. Determine g~k = g ϕk ( y k ) = B k* g f ( x k ), where ϕ k ( y) = f (B k y); y k = Ak x k ; g~k is the generalized gradient for the
function ϕ k ( y) defined in the “dilatated” space.
3. Find
ξ k = g~k / || g~k ||; x k +1 = x k − h k +1B k ξ k. (4)

Moving along the antisubgradient in the “dilatated” space corresponfs to formula (4): Ak x k +1 = y k − h k +1 ξ k .
4. Calculate
−1
B k +1 = Ak +1 = B k R βk +1 ( ξ k ), β k +1 = 1 / α k +1 . (5)

Dilatation of the transformed space in the direction ξ k corresponds to formula (5): Ak +1 = R αk +1 ( ξ k ) Ak , α k+1 > 1.
5. Go to the next step: k + 1 → k + 2.
The main complexity in designing an efficient algorithm is selection of coefficients of space dilatation α k and
strategy of modification of the step factors h k . The first experiments have shown that selecting α = 2 and h k = const we can
obtain good results for many examples of convex ravine functions [8]. Unfortunately, such a simple way not always
furnishes the desired result. In constructing other theoretically justified versions of the algorithms, the step factor and
coefficients of space dilatation were selected so that the sequence of distances to a minimum point in the corresponding
transformed spaces did not increase. This principle guarantees convergence at the rate of a geometrical progression in the
value of the function. To realize the above-mentioned principle, we need some additional information about the function
f ( x ), namely, its value at the point of minimum f * and the so-called constants of growth M and N .
THEOREM 5. Let f ( x ) be a convex function defined on E n , and in some spherical neighborhood S d of the
minimum point x * : S d = {x:|| x − x * || ≤ d} and let the subgradient satisfy the two-sided inequality

N ( f ( x ) − f ( x * )) ≤ ( g f ( x ), x − x * ) ≤ M ( f ( x ) − f ( x * )), (6)

540
where M > N are positive constants. Then if we put in the algorithm

2 MN f ( x k ) − f ( x * )
x 0 ∈ S d , h k +1 = ,
M +N || g~k ||

(M + N )
1 < α k +1 ≤ , k = 0 ,1,2 , K ,
(M − N )
then for all k = 0 , 1, K
|| Ak ( x k − x * )|| ≤ d (7)

whence localization of x * in the ellipsoid Φ k with the center at the point x k follows. The ratio of volumes of the
ellipsoids Φ k+1 and Φ k is specified by the equality
vol (Φ k +1 ) M −N
= βk = .
vol (Φ k ) M +N

For the quadratic positive definite function we may select M = N = 2 in inequality (6). For a piecewise linear function
whose overgraph is a cone with a vertex at the point ( x * , f * ), we may select M = N = 1. For these cases, β k+1 = β = 0 and
the algorithm converges in no more than n steps.
Solution of a nondegenerate system of n linear equations with n unknowns (ai , x ) + bi = 0 , i = 1, K n , can be replaced
by determination of the minimum of f ( x ) = max |(ai , x ) + bi |. Putting f * = 0, β k = 0 and applying the method (4), (5), we
1 ≤i ≤ n
obtain an algorithm corresponding to the well-known finite procedure of solving linear algebraic systems, namely, to the
method of orthogonalization of gradients.
Generalizations of Theorem 5 are obtained also for some classes of nonconvex functions appearing in solving systems of
nonlinear equations fi ( x ) = 0 , i = 1, K , n. We can show for f ( x ) = max| fi ( x )| that if x * (system solution) is a regular point
(i.e., the functions fi ( x ) are continuously differentiable at this point and the Jacobian of the system I ( x * ) is nonzero), then for
any δ > 0 there exists a small neighborhood S d ( x * ) such that the constants M and N in (6) can be selected, respectively, as
M −N
M = 1 + δ; N = 1 − δ; β = = δ. As shown in [2], if we apply the limiting version of the algorithm with β = 0 and
M +N
restoration after every n iterations (the large cycle), then under ordinary assumptions concerning smoothness and regularity for
solution of systems of nonlinear equations, a quadratic convergence rate can be obtained (with respect to large cycles).
The set of algorithms with space dilatation in the subgradient direction contains, as a special case, the so-called
method of ellipsoids. The method of ellipsoids based on the sequential truncation methods was proposed by D. B. Judin and
A. S. Nemirovskii [11], and, independently, by N. Z. Shor [12], as a special case of the algorithm with space dilatation in the
subgradient direction. Such algorithm employs the following parameters: the coefficient of space dilatation is selected
constant and equal to
n +1
α k +1 = α = ,
n −1
and step is adjusted according to the rule
r n
h1 = ; h k +1 = h k ; k = 1, 2 , K ,
n +1 n2 −1

where n is the space dimension, r is the radius of a sphere with the center at the point x 0 , containing the point x * .
The method of ellipsoids converges at the rate of a geometrical progression in deviation of the best value of f ( x )
achieved at the given step from the optimal one. In this case, the denominator of the geometrical progression depends
asymptotically only on the space dimension n
1
qn ≈1 − .
2n 2
L. G. Khachiyan [13] used the method of ellipsoids to construct and justify the first polynomial algorithm for solution
of the linear programming problem with rational coefficients. Moreover, this method has found an important application in

541
the theory of complexity of discrete optimization algorithms [14]. Much attention was given on the Congress on
Mathematical Programming (Bonn, 1982) to the method of ellipsoids and its applications. In particular, N. Z. Shor’s survey
report on nonsmooth optimization methods developed at the Institute of Cybernetics [16] was published in [15].
An experience of application of algorithms involving space dilatation in the gradient direction has shown that
subgradient processes can be significantly accelerated when using operators changing the space metric. At the same time,
difficulties of selection of step factors stimulated a search for new methods of nonsmooth optimization with a variable
metric, where choice of the step factor is connected with the search for the directional minimum. This class of algorithms is
described in the next Section.

SUBGRADIENT METHODS WITH SPACE DILATATION IN DIRECTION


OF DIFFERENCE OF TWO SEQUENTIAL SUBGRADIENTS

Subgradient-type algorithms with space dilatation in direction of the difference of two sequential subgradients
(r-algorithms) proposed in 1971 in [17] proved to be especially efficient in solution of complex problems of
nondifferentiable optimization of an average dimension (up to several hundreds of variables).
In their structure and complexity of iteration, r algorithms are close to the methods with space dilatation in the
subgradient direction. But there is an important difference between them: the GGD methods with space dilatation in the
subgradient direction cannot be monotone while r-algorithms may become monotone for a definite adjustment of step factors
and coefficients of space dilatation. This is due to the simple geometrical fact: if a point x k lies at the boundary of two
“pieces” of a piecewise smooth level surfaces and gradients to these smooth “pieces” form an obtuse angle, then any space
dilatation in direction of one of the gradients (or alternately in direction of two indicated gradients) cannot turn this angle into
acute one, it only may approach π / 2 remaining at the same time obtuse. Applying the method with space dilatation in the
subgradient direction, one cannot obtain the direction of decrease of the function as an antigradient to one of the pieces in the
dilatated space. At the same time, space dilatation in the direction of the difference of two indicated gradients with a
sufficient coefficient of dilatation transforms the obtuse angle between the gradients into the acute one, i.e., the respective
images of these antigradients in the dilatated space become the directions of decrease of the function.
Let us present a general scheme of the r-algorithms for minimization of a convex function f ( x ) defined on E n . We
believe that f ( x ) has a bounded domain of minima X * , so lim f ( x ) = + ∞.
|| x|| →∞
We select an initial approximation x 0 ∈ E n and a nonsingular matrix B 0 (more often B 0 coincides with the unit
matrix I n or with the diagonal matrix Dn with positive diagonal elements, which is used for scaling variables).
The first step of the algorithm is made by the formula x1 = x 0 − h 0 η 0 , where η 0 = B 0 B 0T g f ( x 0 ), h 0 is a step factor
selected from the condition of existence at the point x1 of a subgradient g f ( x1 ) such that ( g f ( x1 ), η 0 ) ≤ 0. For B 0 = I n , we
have η 0 = g f ( x 0 ), and the first step coincides with an iteration of the subgradient process.
Let definite values of x k ∈ E n and n × n matrix B k be obtained as a result of calculations after k (k = 1, 2 , K) steps.
Let us describe the ( k +1)th step.
1. Calculate the following quantities: g f ( x k ), which is a subgradient of the function f ( x ) at the point x k and
rk = B k ( g f ( x k ) − g f ( x k −1 )), which is the vector of difference of two sequential subgradients in the transformed space.
T

−1
The formula y = Ak x , where Ak = B k , specifies the passage from the initial space to the transformed one. Define the
function ϕ k ( y) = f (B k y), then g ϕk ( y) = B kT g f ( x ). Thus, rk is the difference of two subgradients of the function ϕ k ( y)
calculated at the points y k = Ak x k and ~y k = Ak x k −1 .
2. Derive ξ k = rk / || rk || .
3. Specify the quantity β k inverse to the coefficient of space dilatation α k before the ( k +1)th step.
4. Calculate B k +1 = B k R βk ( ξ k ), where R βk ( ξ k ) is the operator of space dilatation at the ( k +1)th step. Note that
−1
B k +1 = Ak +1 .
5. Derive g~k = B kT+1 g f ( x k ), which is the subgradient of the function ϕ k +1 = f (B k +1 y) at the point y k +1 = Ak +1 x k .
6. Determine
x k +1 = x k − h k B k +1 g~k / || g~k ||. (8)

542
A step of algorithm (8) corresponds to a step of the generalized gradient descent in the space transformed under the
action of the operator Ak+1 . Indeed, being applied to both sides of (8), the operator Ak+1 yields

y k +1 = Ak +1 x k +1 = y k − h k g~k / || g~k ||, (9)

where y k = Ak +1 x k .
7. Go to the next step or terminate the algorithm if some break conditions hold.
Practical efficiency of the algorithm depends in many respects on the choice of the step factor h k . In the r-algorithm,
h k is selected from the condition of approximate search for the minimum of f ( x ) in a direction. In this case, the condition
h k ≥ h k* should be observed in minimization of convex functions (h k* is the value of the step factor corresponding to the
directional minimum). In the general case, the subgradient direction at the point x k+1 should form a non-obtuse angle with
the direction of descent from the point x k .
In minimization of nonsmooth convex functions defined on E n , the following versions of the algorithm turned out to
be most successful for experimental and practical calculations. Coefficients of space dilatation α k are selected within the
limits 2 to 3, and adaptive adjustment is applied to the step factor h k . A natural number m and constants q > 1 and t k0 > 0 are
specified. After k steps, we obtain a constant t k0 . We move from the point x k in the direction of descent with the step t k0 until
the condition of completion of descent in the direction holds or the number of steps become equal to m. The condition of
descent completion may require that the value of the function at the recurrent point is less than that at the previous point;
another variant of such condition is that derivative in the direction of descent at a given point is nonnegative. If the condition of
descent completion is not carried out after the next m steps, then instead of t k0 we store t 1k = qt k0 , where q > 1, and continue the
descent in the same direction with a larger step. If the descent completion condition is not fulfilled after the next m steps, then
we take t k2 = qt 1k instead of t 1k , etc. Since we assume that lim f ( x ) = + ∞, the descent completion condition will necessarily
|| x|| → ∞
p
be satisfied in a finite number of steps in a defined direction. The step constant t k k = q pk t k0 ( p ∈ {0 , 1, 2 , K }), which was used
p
at the last step, is taken as the initial one for the descent in a new direction from the point x k+1 , i.e., t k0+1 = t k k .
The above adjustment of the step factor is based on the following considerations. Let the r-algorithm with the constant
coefficient of space dilatation α be applied. Then an arbitrary direction will be stretched on “the average” α times in n
iterations, i.e., when moving in the dilatated space with a constant step, the step will decrease approximately α times in the
initial space in n iterations. If the distance to the point of minimum decreases in this case at the same average velocity, i.e.,
approximately α times in n iterations, then the number of steps in the direction will be limited. For a slower convergence, the
step in the initial space will decrease “on the average” faster than the distance to the point of minimum, and the number of
steps in a direction may increase without limit. To prevent this, an adaptive procedure of increasing the step factor h is
introduced that stabilizes the number of steps in the direction from one to several units. This procedure allows us to reveal
faster the directions in which the function decreases without limit (such cases appear often in actual practice, when a problem
of linear programming is solved in a space of dual variables using the scheme of decomposition in constraints, and the
problem turns out to be inconsistent). In this case, the step factor increases sharply that is a signal for terminating the descent.
As numerous computing experiments and practical calculations have shown, in most cases for α ∈[2 , 3] and m = 3 and
the above h adjustment technique, the number of steps in a direction on the average rarely exceeds two. In this case, accuracy
on the functional, as a rule, is improved 3 to 5 times in n steps of the r-algorithm,.
In case of minimization of a smooth function, more fine techniques of search for the directional minimum can be
applied for acceleration of convergence, for example, a quadratic approximation on three points, the process of “golden
section,” etc. In a smooth case, an adaptive technique of step adjustment in a direction, similar to that mentioned above with
a little amendment has shown itself to advantage: if at the given iteration the function has taken a greater value already after
the first step, then the step factor shall be multiplied by a given number smaller than unity (about 0.8–0.95). This is because
in a smooth case, the rate of convergence may turn out to be faster for a more accurate determination of the directional
minimum, and additional refinement of a step increases the accuracy of search for the directional minimum.
As applied to problems of smooth optimization, the r-algorithm is close in formal structure to the
quasi-Newtonian-type algorithms with a variable metric. For example, the limiting version of the r-algorithm with an infinite
coefficient of dilatation (i.e., β = 0) is a projective version of the method of conjugate gradients. It is shown in [17] that the
limiting version of the r-algorithm has a quadratic rate of convergence under ordinary conditions of smoothness and

543
regularity. Note that using operators of space dilatation, one can construct sets of quasi-Newtonian methods also for finite
values of the coefficients of dilatation [18, 19]. These algorithms are characterized by numerical stability to the accuracy of
search for the directional minimum.
Let us consider further a monotone modification of the r-algorithm [20]. It differs from other versions of the
r-algorithms both in the choice of the direction of motion and adjustment of the step factor value. In the general case, the
monotone modification of the r-algorithm consists in a sequence of iterations on each of which the following operations take
place (below x 0 , x1 , K , x k , K is the minimizing sequence of points of the space E n on which the function f ( x ) is defined).
Let a starting point x 0 ∈ E n and a nonsingular n × n matrix B 0 be given. The initial iteration corresponds to the first
step of the above scheme of the r-algorithm.
Let k iterations be performed ( k > 1) and as a result the point x k ∈ E n , the inverse matrix B k of space transformation,
and the subgradient g( x k ) be obtained, so that the direction of motion from the point x k is determined by the formula

x k +1 ( h k ) = x k − h k B k B T
gk ,
k

where h k is the unknown step factor, which is determined from the condition of directional minimum
h min = arg min f ( x k ( h k )).
k

In particular, h min can be 0, then x k +1 = x k .


k
When moving in the main space, the step factor is selected from the condition of approximate minimum in the direction
of movement but, as distinct from the well-known modifications, the step factor may turn out to be equal to 0 in definite
circumstances. In this case, we remain at the current point and only the inverse matrix varies at the respective iteration.
If h min > 0, then
k
x k +1 = x k − h min B k B T g k .
k k

In both cases, we find among subgradients of the function f at the point x k +1 a subgradient, which forms an angle
π
greater or equal to with the moving direction. If the required subgradient can be selected ambiguously, then that should be
2
selected whose projection onto the direction of descent is minimum.
The sequence f ( x 0 ), f ( x1 ), K , f ( x k ) is nonincreasing since a next point is obtained from the previous one as a result
of finding an approximate minimum in some direction, which provides fulfillment of the inequality f ( x k ) ≥ f ( x k +1 ). It is
clear herefrom that the sequence of values of the function { f ( x k )}∞ k =0 has the limit f ≥ f ( x ).
* *

A new possibility of selection of direction of movement is introduced in the algorithm, which involves solution of
auxiliary problems of convex quadratic programming. Let us consider the case of minimization of the function of maximum
of smooth convex functions. Note that a smooth convex function can be approximated from below at any point x by a linear
function of the form ( g f ( x ), x − x ).
Let ϕ( x ) = max fi ( x ), where I is a finite set of indices, fi ( x ) are continuously differentiable convex functions defined
i ∈I
n
on the whole E . To simplify the calculations, we will assume that ϕ( x ) has a unique minimum that is the desired point
x* ∈ E n .
Let x be an arbitrary point from E n . Denote I 0 ( x ) = {i ∈ I | ϕ( x ) = fi ( x )}, I ε ( x ) = {i ∈ I | ϕ( x ) ≥ fi ( x ) ≥ ϕ( x ) − ε}.
We will call I ε ( x ) the ε-active set of indices for the point x.
As compared with the ordinary r-algorithm, the following special quadratic problem is used for selection of the
descent direction: find
ρ* = min ∑ λ p λ l ( g f p ( x ), g fl ( x ))
{λ: {λi }i ∈ I ε ( x )} p,l
(10)

under the constraints


λ i ≥ 0 for all i ∈ I ε ( x ), (11)

∑ λ i = 1. (12)
i ∈ Iε (x)

544
If ρ * =0, then ϕ( x * ) ≥ ϕ( x ) − ε, i.e., the value of ϕ( x ) − ε is the lower estimate. Having the values of a record and the
lower estimate of the minimized function at the point x, we can estimate the accuracy of determining the minimum at a
current point. If ρ * > 0, then the vector opposite to the vector ∑ λ*i g fi ( x ) gives the direction of movement from the point
i ∈ Iε (x)
x that decreases the value of ϕ( x ) no less than by ε.
Note that the auxiliary problem (10)–(12) with respect to λ is the problem of quadratic convex programming with
linear constraints, and it can be reduced easily with the use of nonsmooth penalty functions to the problem of unconstrained
minimization of a nonsmooth piecewise quadratic function. To solve this problem, one may apply the r-algorithm in
combination with the method of nonsmooth penalty functions. Computing experiments have shown for large examples a
decrease of time of deriving the solution for the auxiliary problem as compared with the well-known LOQO program, which
realizes the approach based on the method of interior points.
Thus, implementation of the monotone modification of the r-algorithm consists in the general case of a finite number
of stages, each including several units:
• calculation of sequential values of x k , x k +1 , K with nonincreasing values of the minimized function for a constant
α > 1 and respective matrices B k , B k +1 ;
• control over degeneracy of the matrices B k and their “restoration” if necessary;
• solution of special auxiliary quadratic problems for deriving lower estimates of the function being minimized.
The monotone modification of the r-algorithm was applied to solution of a wide class of optimization problems,
including minimax problems, special quadratic problems, problems of a maximum section of a graph, the problem of optimal
control with discrete time, and polynomial extremum problems [20].

APPLICATIONS OF ALGORITHMS OF NONDIFFERENTIABLE OPTIMIZATION

Let us present the main sources generating nonsmooth optimization problems.


First, these are mathematical programming problems of large dimensions with a block structure and a comparatively
small number of interblock communications. The use of decomposition schemes for solution of such problems leads to
problems of minimization (maximization) of, as a rule, nonsmooth functions of coupling variables or of Lagrangian
multipliers (dual estimates), corresponding to the coupling constraints.
Second, these are minimization problems for the function of maxima. Let parametric set of convex functions defined
n
on E , { f α ( x )}α ∈ A be given. The main source of nonsmooth functions in convex programming is the operation of taking a
pointwise maximum with respect to the parameter α, i.e., the construction of the function of maximum

F ( x ) = sup f α ( x ).
α∈A
The domain of definition of the function F ( x ) (domF ) coincides with the values x ∈ E n for which { f α ( x )} is bounded
from above in α. For each x ∈ domF , let us define a subset of indices

I ( x ) = {α ∈ A : f α ( x ) = F ( x )}.

The subgradient set G F ( x ) of the function F at the point x is determined by the formula

G F ( x ) = conv { U α ∈ I ( x ) G fα ( x )}, (13)

where conv{M} stands for the operation of finding a minimum convex closed set containing M, and G fα ( x ) are
subgradient subsets of functions f α at the point x, α ∈ I ( x ). If all the functions f α ( α ∈ I ( x )) are differentiable at the
point x, then G fα ( x ) consists of a unique point coinciding with the gradient g fα ( x ), and formula (13) takes the
following form:
G F ( x ) = conv { U α ∈ I ( x ) g fα ( x )}.

In the case where I ( x ) is a finite set, all extreme points of the set G F ( x ) are gradients of some functions f α , α ∈ I ( x ),
at the point x and G f ( x ) is a convex polyhedron of a respective dimension.

545
The third source of nonsmooth problems are Lagrangian estimates in mathematical programming problems. Let us
consider a general problem of mathematical programming, whose constraints are divided into two parts, one of which having
the form of the condition of membership x ∈ X ⊆ E n , and another being determined by the system of equalities: find

f 0* = inf f 0 ( x ), X ⊆ E n , (14)
x ∈X

under the constraints


fi ( x ) = 0 , i = 1, K , m. (15)

Assume that X is a closed subset of an n-dimensional Euclidean space and fi ( x ) are continuous functions defined on X.
To estimate f 0* , let us introduce the Lagrangian function

m
L ( x , u) = f 0 ( x ) + ∑ ui fi ( x ),
i =1

where u = {u1 , K , um } is the vector of Lagrangian multipliers. Let us consider the estimate

ψ(u) = inf L ( x , u).


x
For any admissible x and arbitrary u, we have ψ(u) ≤ f 0 ( x ), whence ψ(u) ≤ f 0* .
The problem of deriving the best estimate for the best value of the problem (14), (15) in the given class of Lagrangian
estimations is reduced to solution of a master problem: find
ψ * = sup ψ(u).
u
The function ψ(u) is concave as result of the operation of minimization with respect to x ∈ X of a parametric (u is the
parameter) set of linear in u functions. Assume that ψ(u) is a concave eigenfunction with a nonempty domain dom ψ, having
interior points. Let x be an interior point of dom ψ, i.e., x ∈ int dom ψ.
Then by the rules of calculation of a subgradient of the function of maximum ϕ(u) = − ψ(u), the subgradient set G ϕ ( x )
can be determined as follows:  
G ϕ (u ) = conv  U F ( x (u)) ,
x ∈ X (u ) 

where X (u) is the set of all possible solutions of the local problem inf L ( x , u ); F ( x (u )) is the residual vector
x ∈X
corresponding to the solution x (u ), F ( x (u )) = { f1 ( x (u )), K , fm ( x (u ))}. Thus, if the local problem has a unique solution at
the point x, then ϕ is differentiable at the respective point and its gradient coincides with the residual vector
{− fi ( x (u ))}im=1. Otherwise, as a rule, a discontinuity of the gradient of the function ϕ(u) at the respective point takes place.
The fourth source is minimization problems for the function of maximum that are characteristic of game models,
“multicriterion” models of optimal scheduling and operations research. Problems of solving sets of equations and
inequalities, deriving coefficients of nonlinear regression, when the Chebyshev criterion of minimization of the maximum of
the discrepancy (the discrepancy module) is used as a criterion, can be reduced to such problems.
The fifth source consists of nonlinear programming problems, whose solution involves the method of nonsmooth
penalty functions. Nonsmooth penalty functions of a definite form have a doubtless advantage as compared with the
commonly used smooth penalty functions: when using nonsmooth penalty functions, as a rule, there is no need to direct
penalty coefficients to +∞.
The sixth source is optimal control problems with continuous and discrete time. The use of the maximum principle or
the discrete maximum principle in many cases leads to minimization problems for functions with a discontinuous gradient.
These problems can be considered as special problems of nonlinear programming, and for their solution schemes of
decomposition or the method of nonsmooth penalty functions can be applied.
The seventh source consists of problems of discrete programming or problems of the mixed discrete-continuous type.
Many such problems can be solved rather successfully with the use of the branch and bound algorithm with deriving
estimates through solution of the dual problem. The dual problem usually turns out to be the problem of minimization of a

546
convex piecewise linear function with a huge number of “pieces” under simple constraints, i.e., the problem of nonsmooth
optimization.
And at last, functions with a discontinuous gradient can immediately appear in the model of the problem of optimal
scheduling, projection or operations research as a result of piecewise smooth approximation of technical and economic
characteristics of actual plants.
It should be mentioned also that, from the application point of view, there is no clear boundary between nonsmooth
and smooth functions. In the context of applied mathematics and computing practice, a function with a rapidly varying
gradient is close in its properties to a nonsmooth function. Therefore, the computation methods developed for solving
nonsmooth optimization problems turn out to be efficient also for optimization of “poor” smooth functions (for example,
ravine-type functions).
Numerous applications of algorithms of nonsmooth optimization to the above-mentioned classes of problems can be
found in monographs of N. Z. Shor [2, 3, 21, 22].

CONCLUSIONS

The paper contains a brief description of the methods of nondifferentiable optimization developed at the Institute of
Cybernetics.
1. The methods of generalized gradient descent, which has initiated a new direction of mathematical programming,
namely, the numerical methods of nonsmooth optimization, to which numerous scientific articles and monographs are
devoted now.
2. The subgradient methods with space dilatation in the direction of the subgradient, having fast convergence as
compared with the methods of the generalized gradient descent. These methods gave the theory of optimization a unique
algorithm, namely, the method of ellipsoids whose convergence rate depends only on dimension of space. The use of the
method of ellipsoids has allowed us to solve a number of important problems in the theory of complexity of mathematical
programming problems.
3. The subgradient methods with space dilatation in the direction of difference of two sequential subgradients (the
r-algorithms). Within the framework of this set of methods, rather efficient implementations of the r-algorithms are obtained.
The number of iterations for deriving the best value f * with the ε-accuracy for functions of n variables is estimated
1
empirically as N = O ( n log ) . The developed modifications of the r-algorithm are efficient tools of minimization of
ε
nonsmooth convex functions. In minimization of smooth functions, they proved to be competitive with the most successful
implementations of the conjugate gradient methods and the methods of the quasi-Newtonian type.
The r-algorithm was used in high-dimensional optimization problems and in quasi-block problems with various
schemes of decomposition, for calculation of dual Lagrangian estimates in multiextremal and combinatorial optimization
problems. In actual practice, it was used to solve problems of optimal scheduling, optimal design, synthesis of networks,
restoration of images, ellipsoidal approximation and localization, etc.
V.M.Glushkov, who created and directed a school of nonsmooth optimization at the Institute of Cybernetics, placed
strong emphasis on development of numerical optimization methods and importance of their practical applications. Now
development of the methods of nondifferentiable optimization is being continued actively.

REFERENCES

1. N. Z. Shor, “Application of the gradient-descent method to solution of the network transport problem,” in: Proc. Sci.
Seminar on Theor. Appl. Probl. Cybern. Oper. Res., Kiev, Nauch. Sovet po Kibern. AN USSR [in Russian], Issue 1
(1962), pp. 9–17.
2. N. Z. Shor, Methods of Minimization of Nondifferentiable Functions and their Applications [in Russian], Naukova
Dumka, Kiev (1979).
3. V. S. Mikhalevich, V. A. Trubin, and N. Z. Shor, Optimization Problems of Production and Transport Scheduling:
Models, Methods, and Algorithms [in Russian], Nauka, Moscow (1986).

547
4. Yu. M. Ermol’ev and N. Z.Shor, “The method of random search for problems of two-stage stochastic programming
and its generalization,” Kibernetika, No. 1, 90–92 (1968).
5. I. I. Eremin, “The method of “penalties” in convex programming,” Kibernetika, No. 4, 63–67 (1966).
6. B. T. Polyak, “A general method of solving extremum problems,” Dokl. AN SSSR, 174, No. 1, 33–36 (1967).
7. M. L. Balinski and P. Wolfe (eds.), “Nondifferentiable optimization,” in: Math. Programming Study, 3,
North-Holland, Amsterdam (1975).
8. N. Z. Shor and V. I. Biletskii, “The method of space dilatation for acceleration of convergence in ravine-type
problems,” in: Tr. Semin. Sci. Counc. AN USSR in Cybern. Theory of Optimal Decisions, No. 2 [in Russian] Kiev
(1969), pp. 3–18.
9. N. Z. Shor, “The use of operations of space dilatation in problems of minimization of convex functions,” Kibernetika,
No. 1, 6–12 (1970).
10. N. Z. Shor, “On convergence rate of the generalized gradient descent with space dilatation,” Kibernetika, No. 2, 80–85
(1970).
11. D. B. Yudin and A. S. Nemirovskii, “Information complexity and efficient methods of solution of convex extremum
problems,” in: Economika i Mat. Metody, Issue 2 [in Russian], (1976), pp. 357–369.
12. N. Z. Shor, “The truncation method with space dilatation for solving convex programming problems,” Kibernetika,
No. 1, 94–95 (1977).
13. L. G. Khachiyan, “Polynomial algorithms in linear programming,” DAN SSSR, 244, No. 5, 1093-1096 (1979).
14. M. Grotschel, L. Lovasz, and A. Schrijver, Geometric Algorithms and Combinatorial Optimization, Springer-Verlag,
Berlin (1988).
15. A. Bachem, M. Grotschel, and B. Korte (eds.), Mathematical Programming: the State of Art, Springer-Verlag, Berlin
(1983).
16. N. Z. Shor, “Generalized gradient methods of nondifferentiable optimization employing space dilatation operations,”
in: Mathematical Programming: the State of Art, Springer-Verlag, Berlin (1983), pp. 501–529.
17. N. Z. Shor and N. G. Zhurbenko, “The minimization method using space dilatation in direction of difference of two
sequential gradients,” Kibernetika, No. 3, 51–59 (1971).
18. N. G. Zhurbenko, “Construction of a set of conjugate gradient methods based on the use of the space dilatation
operator,” in: The Theory and Applications of Optimization Methods [in Russian], Inst. Kibern. im V.M. Glushkova
NAN Ukrainy, Kiev (1998), pp. 12–18.
19. N. G. Zhurbenko, “Quasi-Newtonian algorithms of minimization based on the space dilatation operator,” in: Theory
of Optimal Solutions [in Russian], Kiev: Inst. Kibern. im V.M. Glushkova NAN Ukrainy, Kiev (1999), pp. 45–50.
20. N. Z. Shor, “Monotone modifications of r-algorithms and their application,” Kibern. Sist. Analiz, No. 6, 74–95
(2002).
21. N. Z. Shor and S. I. Stetsenko, Quadratic Extremum Problems and Nondifferentiable Optimization [in Russian],
Naukova Dumka, Kiev (1989).
22. N. Z. Shor, Nondifferentiable Optimization and Polynomial Problems, Kluwer Acad. Publ., Boston–
Dordrecht–London (1998).

548

You might also like