You are on page 1of 317

Stochastic Optimization Methods

Second edition
Kurt Marti

Stochastic Optimization
Methods
Second edition

123
Univ. Prof. Dr. Kurt Marti
Federal Armed Forces
University Munich
Department of Aerospace Engineering and Technology
85577 Neubiberg/München
Germany
kurt.marti@unibw-muenchen.de

ISBN: 978-3-540-79457-8 e-ISBN: 978-3-540-79458-5

Library of Congress Control Number: 2008925523

c 2008 Springer-Verlag Berlin Heidelberg


°

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from Springer. Violations are
liable to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.

Cover design: WMXDesign GmbH, Heidelberg

Printed on acid-free paper

9 8 7 6 5 4 3 2
springer.com
Preface

Optimization problems in practice depend mostly on several model parame-


ters, noise factors, uncontrollable parameters, etc., which are not given fixed
quantities at the planning stage. Typical examples from engineering and eco-
nomics/operations research are: Material parameters (e.g. elasticity moduli,
yield stresses, allowable stresses, moment capacities, and specific gravity), ex-
ternal loadings, friction coefficients, moments of inertia, length of links, mass
of links, location of center of gravity of links, manufacturing errors, tolerances,
noise terms, demand parameters, technological coefficients in input-output
functions, cost factors, etc.. Due to several types of stochastic uncertainties
(physical uncertainty, economic uncertainty, statistical uncertainty, and model
uncertainty) these parameters must be modelled by random variables having
a certain probability distribution. In most cases at least certain moments of
this distribution are known.
To cope with these uncertainties, a basic procedure in engineer-
ing/economic practice is to replace first the unknown parameters by some
chosen nominal values, e.g. estimates or guesses, of the parameters. Then,
the resulting and mostly increasing deviation of the performance (output,
behavior) of the structure/system from the prescribed performance (out-
put, behavior), i.e., the “tracking error”, is compensated by (online) input
corrections. However, the online correction of a system/structure is often
time-consuming and causes mostly increasing expenses (correction or re-
course costs). Very large recourse costs may arise in case of damages or
failures of the plant. This can be avoided to a large extent by taking into
account at the planning stage the possible consequences of the tracking errors
and the known prior and sample information about the random data of the
problem. Hence, instead of relying on ordinary deterministic parameter opti-
mization methods – based on some nominal parameter values – and applying
some correction actions, stochastic optimization methods should be applied:
By incorporating the stochastic parameter variations into the optimization
process, expensive and increasing online correction expenses can be avoided
or at least reduced to a large extent.
VI Preface

Consequently, for the computation of robust optimal decisions/designs,


i.e., optimal decisions which are insensitive with respect to random parameter
variations, appropriate deterministic substitute problems must first be for-
mulated. Based on theoretical principles, these substitute problems depend
on probabilities of failure/success and/or on more general expected cost/loss
terms. Since probabilities and expectations are defined by multiple integrals in
general, the resulting often nonlinear and also nonconvex deterministic sub-
stitute problems can be solved only by approximative methods. Hence, the
analytical properties of the substitute problems are examined, and appropri-
ate deterministic and stochastic solution procedures are presented. The aim
of this book is therefore to provide analytical and numerical tools – including
their mathematical foundations – for the approximate computation of robust
optimal decisions/designs as needed in concrete engineering/economic appli-
cations.
Two basic types of deterministic substitute problems occur mostly in
practice:
* Reliability-Based Optimization Problems:
Minimization of the expected primary costs subject to expected recourse
cost constraints (reliability constraints) and remaining deterministic con-
straints, e.g. box constraints. In case of piecewise constant cost functions,
probabilistic objective functions and/or probabilistic constraints occur;
* Expected Total Cost Minimization Problems:
Minimization of the expected total costs (costs of construction, design,
recourse costs, etc.) subject to the remaining deterministic constraints.
Basic methods and tools for the construction of appropriate deterministic sub-
stitute problems of different complexity and for various practical applications
are presented in Chap. 1 and part of Chap. 2 together with fundamental ana-
lytical properties of reliability-based optimization/expected cost minimization
problems. Here, the main problem is the analytical treatment and numerical
computation of the occurring expectations and probabilities depending on
certain input variables.
For this purpose deterministic and stochastic approximation methods are
provided, which are based on Taylor expansion methods, regression and re-
sponse surface methods (RSM), probability inequalities, and differentiation
formulas for probability and expectation value functions. Moreover, first order
reliability methods (FORM), Laplace expansions of integrals, convex approx-
imation/deterministic descent directions/efficient points, (hybrid) stochastic
gradient methods, stochastic approximation procedures are considered. The
mathematical properties of the approximative problems and iterative solution
procedures are described.
After the presentation of the basic approximation techniques in Chap. 2,
in Chap. 3 derivatives of probability and mean value functions are obtained
by the following methods: Transformation method, stochastic completion and
transformation, and orthogonal function series expansion. Chapter 4 shows
Preface VII

how nonconvex deterministic substitute problems can be approximated by


convex mean value minimization problems. Depending on the type of loss
functions and the parameter distribution, feasible descent directions can be
constructed at non-efficient points. Here, efficient points, determined often by
linear/quadratic conditions involving first and second order moments of the
parameters, are candidates for optimal solutions to be obtained with much
larger effort. The numerical/iterative solution techniques presented in Chap. 5
are based on hybrid stochastic approximation, methods using not only simple
stochastic (sub) gradients, but also deterministic descent directions and/or
more exact gradient estimators. More exact gradient estimators based on Re-
sponse Surface Methods (RSM) are described in the first part of Chap. 5.
Convergence in the mean square sense is considered, and results on the speed
up of the convergence by using improved step directions at certain iteration
points are given. Various extensions of hybrid stochastic approximation meth-
ods are treated in Chap. 6 on stochastic approximation methods with changing
variance of the estimation error. Here, second order search directions, based
e.g. on the approximation of the Hessian, and asymptotic optimal scalar and
matrix valued step sizes are constructed. Moreover, using different types of
probabilistic convergence concepts, related convergence properties and con-
vergence rates are given. Mathematical definitions and theorems needed here
are given in the Appendix. Applications to the approximate computation of
survival/failure probabilities for technical and economical systems/structures
are given in Chap. 7. The reader of this book needs some basic knowledge in
linear algebra, multivariate analysis and stochastics.
To complete this monograph, the author was supported by several collab-
orators: First of all I thank Ms Elisabeth Lößl from the Institute for Math-
ematics and Computer Sciences for the excellent LATEX typesetting of the
whole manuscript. Without her very precise and careful work, this project
could not have been completed. I also owe thanks to my former collabora-
tors, Dr. Andreas Aurnhammer, for further LATEX support and for providing
English translation of some German parts of the original manuscript, and to
Thomas Platzer for proof reading of some parts of the book. Last but not
least I am indebted to Springer-Verlag for including the book in the Springer-
program. I thank, especially, the Publishing Director Economics and Manage-
ment Science of Springer-Verlag, Dr. Werner A. Müller, for his great patience
until the completion of this book.

Munich Kurt Marti


August 2004
VIII Preface

Preface to the Second Edition

The major change in the second edition of this book is the inclusion of a new,
extended version of Chap. 7 on the approximate computation of probabilities
of survival and failure of technical, economic structures and systems. Based on
a representation of the state function of the structure/system by the minimum
value of a certain optimization problem, approximations of the probability of
survival/failure are obtained by means of polyhedral approximation of the so-
called survival domain, certain probability inequalities and disretizations of
the underlying probability distribution of the random model parameters. All
the other chapters have been updated by including some more recent material
and by correcting some typing errors.
The author again thanks Ms. Elisabeth Lößl for the excellent LaTeX-
typesetting of all revisions and completions of the text. Moreover, I thank Dr.
Müller, Vice President Business/Economics and Statistics, Springer-Verlag,
for making it possibile to publish a second edition of “Stochastic Optimization
Methods”.

Munich Kurt Marti


March 2008
Contents

Part I Basic Stochastic Optimization Methods

1 Decision/Control Under Stochastic Uncertainty . . . . . . . . . . . 3


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Deterministic Substitute Problems: Basic Formulation . . . . . . . . 5
1.2.1 Minimum or Bounded Expected Costs . . . . . . . . . . . . . . . 6
1.2.2 Minimum or Bounded Maximum Costs
(Worst Case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Deterministic Substitute Problems in Optimal Decision


Under Stochastic Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Optimum Design Problems with Random Parameters . . . . . . . . 9
2.1.1 Deterministic Substitute Problems in Optimal
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 Deterministic Substitute Problems in Quality
Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Basic Properties of Substitute Problems . . . . . . . . . . . . . . . . . . . . 18
2.3 Approximations of Deterministic Substitute Problems
in Optimal Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.1 Approximation of the Loss Function . . . . . . . . . . . . . . . . 20
2.3.2 Regression Techniques, Model Fitting, RSM . . . . . . . . . . 22
2.3.3 Taylor Expansion Methods . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Applications to Problems in Quality Engineering . . . . . . . . . . . . 29
2.5 Approximation of Probabilities: Probability Inequalities . . . . . . 30
2.5.1 Bonferroni-Type Inequalities . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.2 Tschebyscheff-Type Inequalities . . . . . . . . . . . . . . . . . . . . . 32
2.5.3 First Order Reliability Methods (FORM) . . . . . . . . . . . . . 37
X Contents

Part II Differentiation Methods

3 Differentiation Methods for Probability and Risk Functions 43


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Transformation Method: Differentiation by Using an Integral
Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.1 Representation of the Derivatives by Surface Integrals . . 51
3.3 The Differentiation of Structural Reliabilities . . . . . . . . . . . . . . . 54
3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 More General Response (State) Functions . . . . . . . . . . . . 57
3.5 Computation of Probabilities and its Derivatives
by Asymptotic Expansions of Integral of Laplace Type . . . . . . . 62
3.5.1 Computation of Probabilities of Structural Failure
and Their Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.5.2 Numerical Computation of Derivatives
of the Probability Functions Arising in Chance
Constrained Programming . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Integral Representations of the Probability Function
P (x) and its Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.7 Orthogonal Function Series Expansions I: Expansions
in Hermite Functions, Case m = 1 . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7.1 Integrals over the Basis Functions and the Coefficients
of the Orthogonal Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.7.2 Estimation/Approximation of P (x) and its Derivatives . 82
3.7.3 The Integrated Square Error (ISE) of Deterministic
Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.8 Orthogonal Function Series Expansions II: Expansions
in Hermite Functions, Case m > 1 . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.9 Orthogonal Function Series Expansions III: Expansions
in Trigonometric, Legendre and Laguerre Series . . . . . . . . . . . . . 91
3.9.1 Expansions in Trigonometric and Legendre Series . . . . . . 92
3.9.2 Expansions in Laguerre Series . . . . . . . . . . . . . . . . . . . . . . . 92

Part III Deterministic Descent Directions

4 Deterministic Descent Directions and Efficient Points . . . . . . 95


4.1 Convex Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.1.1 Approximative Convex Optimization Problem . . . . . . . . . 99
4.2 Computation of Descent Directions in Case of Normal
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2.1 Descent Directions of Convex Programs . . . . . . . . . . . . . . 105
4.2.2 Solution of the Auxiliary Programs . . . . . . . . . . . . . . . . . . 108
4.3 Efficient Solutions (Points) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.3.1 Necessary Optimality Conditions Without Gradients . . . 116
Contents XI

4.3.2 Existence of Feasible Descent Directions


in Non-Efficient Solutions of (4.9a), (4.9b) . . . . . . . . . . . . 117
4.4 Descent Directions in Case of Elliptically Contoured
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.5 Construction of Descent Directions by Using Quadratic
Approximations of the Loss Function . . . . . . . . . . . . . . . . . . . . . . 121

Part IV Semi-Stochastic Approximation Methods

5 RSM-Based Stochastic Gradient Procedures . . . . . . . . . . . . . . . 129


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2 Gradient Estimation Using the Response Surface
Methodology (RSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2.1 The Two Phases of RSM . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2.2 The Mean Square Error of the Gradient Estimator . . . . 138
5.3 Estimation of the Mean Square (Mean Functional) Error . . . . . 142
5.3.1 The Argument Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.3.2 The Criterial Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4 Convergence Behavior of Hybrid Stochastic Approximation
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.4.1 Asymptotically Correct Response Surface Model . . . . . . 148
5.4.2 Biased Response Surface Model . . . . . . . . . . . . . . . . . . . . . 150
5.5 Convergence Rates of Hybrid Stochastic Approximation
Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.1 Fixed Rate of Stochastic and Deterministic Steps . . . . . . 158
5.5.2 Lower Bounds for the Mean Square Error . . . . . . . . . . . . 169
5.5.3 Decreasing Rate of Stochastic Steps . . . . . . . . . . . . . . . . . 173

6 Stochastic Approximation Methods with Changing


Error Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.2 Solution of Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.3 General Assumptions and Notations . . . . . . . . . . . . . . . . . . . . . . . 179
6.3.1 Interpretation of the Assumptions . . . . . . . . . . . . . . . . . . . 181
6.3.2 Notations and Abbreviations in this Chapter . . . . . . . . . . 182
6.4 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.4.1 Estimation of the Quadratic Error . . . . . . . . . . . . . . . . . . . 183
6.4.2 Consideration of the Weighted Error Sequence . . . . . . . . 185
6.4.3 Further Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . 188
6.5 General Convergence Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
6.5.1 Convergence with Probability One . . . . . . . . . . . . . . . . . . . 190
6.5.2 Convergence in the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
6.5.3 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . . . . 195
6.6 Realization of Search Directions Yn . . . . . . . . . . . . . . . . . . . . . . . . 204
6.6.1 Estimation of G∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
XII Contents

6.6.2 Update of the Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210


6.6.3 Estimation of Error Variances . . . . . . . . . . . . . . . . . . . . . . . 216
6.7 Realization of Adaptive Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . 220
6.7.1 Optimal Matrix Step Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 221
6.7.2 Adaptive Scalar Step Size . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.8 A Special Class of Adaptive Scalar Step Sizes . . . . . . . . . . . . . . . 236
6.8.1 Convergence Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.8.2 Examples for the Function Qn (r) . . . . . . . . . . . . . . . . . . . . 241
6.8.3 Optimal Sequence (wn ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.8.4 Sequence (Kn ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Part V Reliability Analysis of Structures/Systems

7 Computation of Probabilities of Survival/Failure


by Means of Piecewise Linearization of the State
Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
7.2 The State Function s∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
7.2.1 Characterization of Safe States . . . . . . . . . . . . . . . . . . . . . . 258
7.3 Probability of Safety/Survival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
7.4 Approximation I of ps , pf : FORM . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.4.1 The Origin of IRν Lies in the Transformed
Safe Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.4.2 The Origin of IRν Lies in the Transformed Failure
Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
7.4.3 The Origin of IRν Lies on the Limit State Surface . . . . . . 268
7.4.4 Approximation of Reliability Constraints . . . . . . . . . . . . . 269
7.5 Approximation II of ps , pf : Polyhedral Approximation
of the Safe/Unsafe Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
7.5.1 Polyhedral Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.6 Computation of the Boundary Points . . . . . . . . . . . . . . . . . . . . . . 279
7.6.1 State Function s∗ Represented by Problem A . . . . . . . . . 280
7.6.2 State Function s∗ Represented by Problem B . . . . . . . . . 280
7.7 Computation of the Approximate Probability Functions . . . . . . 282
7.7.1 Probability Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
7.7.2 Discretization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
7.7.3 Convergent Sequences of Discrete Distributions . . . . . . . 293

Part VI Appendix

A Sequences, Series and Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301


A.1 Mean Value Theorems for Deterministic Sequences . . . . . . . . . . 301
A.2 Iterative Solution of a Lyapunov Matrix Equation . . . . . . . . . . . 309
Contents XIII

B Convergence Theorems for Stochastic Sequences . . . . . . . . . . . 313


B.1 A Convergence Result of Robbins–Siegmund . . . . . . . . . . . . . . . . 313
B.1.1 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
B.2 Convergence in the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
B.3 The Strong Law of Large Numbers for Dependent Matrix
Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
B.4 A Central Limit Theorem for Dependent Vector Sequences . . . . 319

C Tools from Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321


C.1 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
C.2 The v. Mises-Procedure in Case of Errors . . . . . . . . . . . . . . . . . . . 322

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
1
Decision/Control Under Stochastic
Uncertainty

1.1 Introduction

Many concrete problems from engineering, economics, operations research,


etc., can be formulated by an optimization problem of the type

min f0 (a, x) (1.1a)


s.t.
fi (a, x) ≤ 0, i = 1, . . . , mf (1.1b)
gi (a, x) = 0, i = 1, . . . , mg (1.1c)
x ∈ D0 . (1.1d)

Here, the objective (goal) function f0 = f0 (a, x) and the constraint func-
tionsfi = fi (a, x), i = 1, . . . , mf , gi = gi (a, x), i = 1, . . . , mg , defined on
a joint subset of IRν × IRr , depend on a decision, control or input vector
x = (x1 , x2 , . . . , xr )T and a vector a = (a1 , a2 , . . . , aν )T of model parame-
ters. Typical model parameters in technical applications, operations research
and economics are material parameters, external load parameters, cost factors,
technological parameters in input–output operators, demand factors. Further-
more, manufacturing and modeling errors, disturbances or noise factors, etc.,
may occur. Frequent decision, control or input variables are material, topolog-
ical, geometrical and cross-sectional design variables in structural optimiza-
tion [62], forces and moments in optimal control of dynamic systems and
factors of production in operations research and economic design.
The objective function (1.1a) to be optimized describes the aim, the goal
of the modelled optimal decision/design problem or the performance of a
technical, economic system or process to be controlled optimally. Further-
more, the constraints (1.1b)–(1.1d) represent the operating conditions guar-
anteeing a safe structure, a correct functioning of the underlying system,
process, etc. Note that the constraint (1.1d) with a given, fixed convex subset
4 1 Decision/Control Under Stochastic Uncertainty

D0 ⊂ IRr summarizes all (deterministic) constraints being independent of


unknown model parameters a, as, e.g., box constraints:

xL ≤ x ≤ xU (1.1d )

with given bounds xL , xU .


Important concrete optimization problems, which may be formulated, at
least approximatively, this way are problems from optimal design of mechan-
ical structures and structural systems [2, 62, 130, 138, 146, 148], adaptive
trajectory planning for robots [3, 6, 27, 97, 111, 143], adaptive control of dy-
namic system [144,147], optimal design of economic systems [123], production
planning, manufacturing [73, 118] and sequential decision processes [101], etc.
A basic problem in practice is that the vector of model parameters
a = (a1 , . . . , aν )T is not a given, fixed quantity. Model parameters are of-
ten unknown, only partly known and/or may vary randomly to some extent.
Several techniques have been developed in the recent years in order to
cope with uncertainty with respect to model parameters a. A well known
basic method, often used in engineering practice, is the following two-step
procedure [6, 27, 111, 143, 144]:
(1) Approximation: Replace first the ν-vector a of the unknown or stochastic
varying model parameters a1 , . . . , aν by some fixed vector a0 of so-called
nominal values a0l , l = 1, . . . , ν. Apply then an optimal decision (con-
trol) x∗ = x∗ (a0 ) with respect to the resulting approximate optimization
problem

min f0 (a0 , x) (1.2a)


s.t.
fi (a0 , x) ≤ 0, i = 1, . . . , mf (1.2b)
gi (a0 , x) = 0, i = 1, . . . , mg (1.2c)
x ∈ D0 . (1.2d)

Due to the deviation of the actual parameter vector a from the nominal
vector a0 of model parameters, deviations of the actual state, trajectory
or performance of the system from the prescribed state, trajectory, goal
values occur.
(2) Compensation: The deviation of the actual state, trajectory or perfor-
mance of the system from the prescribed values/functions is compensated
then by online measurement and correction actions (decisions or controls).
Consequently, in general, increasing measurement and correction expenses
result in course of time.
Considerable improvements of this standard procedure can be obtained
by taking into account already at the planning stage, i.e., offline, the mostly
available a priori (e.g., the type of random variability) and sample information
1.2 Deterministic Substitute Problems: Basic Formulation 5

about the parameter vector a. Indeed, based, e.g., on some structural in-
sight, or by parameter identification methods, regression techniques, cal-
ibration methods, etc., in most cases information about the vector a of
model/structural parameters can be extracted. Repeating this information
gathering procedure at some later time points tj > t0 (= initial time point),
j = 1, 2, . . . , adaptive decision/control procedures occur [101].
Based on the inherent random nature of the parameter vector a, the ob-
servation or measurement mechanism, resp., or adopting a Bayesian approach
concerning unknown parameter values [10], here we make the following basic
assumption:
Stochastic Uncertainty: The unknown parameter vector a is a realization

a = a(ω), ω ∈ Ω, (1.3)

of a random ν-vector a(ω) on a certain probability space (Ω, A0 , P), where


the probability distribution Pa(·) of a(ω) is known, or it is known that Pa(·)
lies within a given range W of probability measures on IRν . Using a Bayesian
approach, the probability distribution Pa(·) of a(ω) may also describe the
subjective or personal probability of the decision maker, the designer.
Hence, in order to take into account the stochastic variations of the para-
meter vector a, to incorporate the a priori and/or sample information about
the unknown vector a, resp., the standard approach “insert a certain nominal
parameter vector a0 , and correct then the resulting error” must be replaced
by a more appropriate deterministic substitute problem for the basic opti-
mization problem (1.1a)–(1.1d) under stochastic uncertainty.

1.2 Deterministic Substitute Problems: Basic


Formulation
The proper selection of a deterministic substitute problem is a decision the-
oretical task, see [78]. Hence, for (1.1a)–(1.1d) we have first to consider the
outcome map
 T
e = e(a, x) := f0 (a, x), f1 (a, x), . . . , fmf (a, x), g1 (a, x), . . . , gmg (a, x) , (1.4a)
a ∈ IRν , x ∈ IRr (x ∈ D0 ),

and to evaluate then the outcomes e ∈ E ⊂ IR1+mf +mg by means of certain


loss or cost functions
γi : E → IR, i = 0, 1, . . . , m, (1.4b)

 an integer m ≥ 0. For the processing of the numerical outcomes


with
γi e(a, x) , i = 0, 1, . . . , m, there are two basic concepts:
6 1 Decision/Control Under Stochastic Uncertainty

1.2.1 Minimum or Bounded Expected Costs

Consider the vector of (conditional) expected losses or costs


⎛ ⎞ ⎛ ⎞
F0 (x) Eγ0 (e(a(ω), x))
⎜ F1 (x) ⎟ ⎜ Eγ1 (e(a(ω), x)) ⎟
⎜ ⎟ ⎜ ⎟
F(x) = ⎜ . ⎟ := ⎜ .. ⎟ , x ∈ IRr , (1.5)
⎝ .. ⎠ ⎝ . ⎠
Fm (x) Eγm (e(a(ω), x))

where the (conditional) expectation “E” is taken with respect to the time
history A = At , (Aj ) ⊂ A0 up to a certain time point t or stage j. A
short definition of expectations is given in Sect. 2.1, for more details, see, e.g.,
[7, 43, 124].
Having different expected cost or performance functions F0 , F1 , . . . , Fm
to be minimized or bounded, as a basic deterministic substitute problem for
(1.1a)–(1.1d) with a random parameter vector a = a(ω) we may consider the
multi-objective expected cost minimization problem

“min”F(x) (1.6a)

s.t.

x ∈ D0 . (1.6b)

Obviously, a good compromise solution x∗ of this vector optimization prob-


lem should have at least one of the following properties [25, 134]:
Definition 1.1 (a) A vector x0 ∈ D0 is called a functional–efficient or
Pareto optimal solution of the vector optimization problem (1.6a), (1.6b)
if there is no x ∈ D0 such that

Fi (x) ≤ Fi (x0 ), i = 0, 1, . . . , m (1.7a)

and
Fi0 (x) < Fi0 (x0 ) for at least one i0 , 0 ≤ i0 ≤ m. (1.7b)

(b) A vector x0 ∈ D0 is called a weak functional–efficient or weak Pareto


optimal solution of (1.6a), (1.6b) if there is no x ∈ D0 such that

Fi (x) < Fi (x0 ), i = 0, 1, . . . , m (1.8)

(Weak) Pareto optimal solutions of (1.6a), (1.6b) may be obtained now by


means of scalarizations of the vector optimization problem (1.6a), (1.6b).
Three main versions are stated in the following:
(1) Minimization of primary expected cost/loss under expected cost con-
straints
min F0 (x) (1.9a)
1.2 Deterministic Substitute Problems: Basic Formulation 7

s.t.

Fi (x) ≤ Fimax , i = 1, . . . , m (1.9b)


x ∈ D0 . (1.9c)

Here, F0 = F0 (x) is assumed to describe the primary goal of the de-


sign/decision making problem, while Fi = Fi (x), i = 1, . . . , m, de-
scribe secondary goals. Moreover, Fimax , i = 1, . . . , m, denote given upper
cost/loss bounds.
Remark 1.1 An optimal solution x∗ of (1.9a)–(1.9c) is a weak Pareto
optimal solution of (1.6a), (1.6b).
(2) Minimization of the total weighted expected costs
Selecting certain positive weight factors c0 , c1 , . . . , cm , the expected
weighted total costs are defined by
m  
F̃ (x) := ci Fi (x) = Ef a(ω), x , (1.10a)
i=0

where
m  
f (a, x) := ci γi e(a, x) . (1.10b)
i=0

Consequently, minimizing the expected weighted total costs F̃ = F̃ (x)


subject to the remaining deterministic constraint (1.1d), the following
deterministic substitute problem for (1.1a)–(1.1d) occurs:
m
min ci Fi (x) (1.11a)
i=0

s.t.

x ∈ D0 . (1.11b)

Remark 1.2 Let ci > 0, i = 1, 1, . . . , m, be any positive weight factors.


Then an optimal solution x∗ of (1.11a), (1.11b) is a Pareto optimal solu-
tion of (1.6a), (1.6b).
(3) Minimization of the maximum weighted expected costs
Instead of adding weighted expected costs, we may consider the maximum
of the weighted expected costs:
  
F̃ (x) := max ci Fi (x) = max ci Eγi e a(ω), x . (1.12)
0≤i≤m 0≤i≤m

Here c0 , c1 , . . . , m, are again positive weight factors. Thus, minimizing


F̃ = F̃ (x) we have the deterministic substitute problem
8 1 Decision/Control Under Stochastic Uncertainty

min max ci Fi (x) (1.13a)


0≤i≤m

s.t.
x ∈ D0 . (1.13b)
Remark 1.3 Let ci , i = 0, 1, . . . , m, be any positive weight factors. An
optimal solution of x∗ of (1.13a), (1.13b) is a weak Pareto optimal solution
of (1.6a), (1.6b).

1.2.2 Minimum or Bounded Maximum Costs (Worst Case)

Instead of taking expectations, we may consider the worst case with respect to
the cost variations caused by the random parameter vector a = a(ω). Hence,
the random cost function
  
ω → γi e a(ω), x (1.14a)

is evaluated by means of
  
Fisup (x) := ess sup γi e a(ω), x , i = 0, 1, . . . , m. (1.14b)

Here, ess sup (. . .) denotes the (conditional) essential supremum with respect
to the random vector a = a(ω), given information A, i.e., the infimum of the
supremum of (1.14a) on sets A ∈ A0 of (conditional) probability one, see,
e.g., [124].
Consequently, the vector function F = Fsup (x) is then defined by
⎛    ⎞
⎛ ⎞ ess sup γ0 e a(ω), x
F0 (x) ⎜    ⎟
⎜ F1 (x) ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ess sup γ e a(ω), x ⎟
Fsup (x) = ⎜ . ⎟ := ⎜ ⎟.
1
⎜ ⎟ (1.15)
⎝ .. ⎠ ⎜ .. ⎟
⎝ .   ⎠
Fm (x)
ess sup γm e a(ω), x

Working with the vector function F = Fsup (x), we have then the vector
minimization problem
“min”Fsup (x) (1.16a)
s.t.
x ∈ D0 . (1.16b)
By scalarization of (1.16a), (1.16b) we obtain then again deterministic
substitute problems for (1.1a)–(1.1d) related to the substitute problem (1.6a),
(1.6b) introduced in Sect. 1.2.1.
More details for the selection and solution of appropriate deterministic
substitute problems for (1.1a)–(1.1d) are given in the next sections.
2
Deterministic Substitute Problems in Optimal
Decision Under Stochastic Uncertainty

2.1 Optimum Design Problems with Random


Parameters
In the optimum design of technical or economic structures/systems two basic
classes of criterions appear:
First there is a primary cost function

G0 = G0 (a, x). (2.1a)

Important examples are the total weight or volume of a mechanical structure,


the costs of construction, design of a certain technical or economic structure/
system, or the negative utility or reward in a general decision situation.
For the representation of the structural/system safety or failure, for the
representation of the admissibility of the state, or for the formulation of the
basic operating conditions of the underlying structure/system certain state
functions
yi = yi (a, x), i = 1, . . . , my , (2.1b)
are chosen. In structural design these functions are also called “limit
state functions” or “safety margins.” Frequent examples are some displace-
ment, stress, load (force and moment) components in structural design, or
production and cost functions in production planning problems, optimal mix
problems, transportation problems, allocation problems and other problems
of economic design.
In (2.1a), (2.1b), the design or input vector x denotes the r-vector of de-
sign or input variables, x1 , x2 , . . . , xr , as, e.g., structural dimensions, sizing
variables, such as cross-sectional areas, thickness in structural design, or fac-
tors of production, actions in economic design. For the design or input vector
x one has mostly some basic deterministic constraints, e.g., nonnegativity
constraints, box constraints, represented by

x ∈ D, (2.1c)
10 2 Deterministic Substitute Problems in Optimal Decision

where D is a given convex subset of IRr . Moreover, a is the ν-vector of model


parameters. In optimal structural design,

p
a= (2.1d)
P

is composed of the following two subvectors: P is the m-vector of the acting


external loads, e.g., wave, wind loads, payload, etc. Moreover, p denotes the
(ν − m)-vector of the further model parameters, as, e.g., material parameters
(like strength parameters, yield/allowable stresses, elastic moduli, plastic ca-
pacities, etc.) of each member of the structure, the manufacturing tolerances
and the weight or more general cost coefficients.
In production planning problems,

a = (A, b, c) (2.1e)

is composed of the m × r matrix A of technological coefficients, the demand


m-vector b and the r-vector c of unit costs.
Based on the my -vector of state functions
 T
y(a, x) := y1 (a, x), y2 (a, x), . . . , ymy (a, x) , (2.1f)

the admissible or safe states of the structure/system can be characterized by


the condition
y(a, x) ∈ B, (2.1g)
where B is a certain subset of IRmy ; B = B(a) may depend also on some model
parameters.
In production planning problems, typical operating conditions are given,
cf. (2.1e), by

y(a, x) := Ax − b ≥ 0 or y(a, x) = 0, x ≥ 0. (2.2a)

In mechanical structures/structural systems, the safety (survival) of the


structure/system is described by the operating conditions

yi (a, x) > 0 for all i = 1, . . . , my (2.2b)

with state functions yi = yi (a, x), i = 1, . . . , my , depending on certain


response components of the structure/system, such as displacement, stress,
force, moment components. Hence, a failure occurs if and only of the
structure/system is in the ith failure mode (failure domain)

yi (a, x) ≤ 0 (2.2c)

for at least one index i, 1 ≤ i ≤ my .


Note. The number my of safety margins or limit state functions
yi = yi (a, x), i = 1, . . . , my , may be very large. E.g., in optimal plastic
2.1 Optimum Design Problems with Random Parameters 11

design the limit state functions are determined by the extreme points of the
admissible domain of the dual pair of static/kinematic LPs related to the
equilibrium and linearized convex yield condition, see Sect. 7.

Basic problems in optimal decision/design are:


(1) Primary (construction, planning, investment, etc.) cost minimization un-
der operating or safety conditions
min G0 (a, x) (2.3a)
s.t.
y(a, x) ∈ B (2.3b)
x ∈ D. (2.3c)
Obviously we have B = (0, +∞)my in (2.2b) and B = [0, +∞)my or B = {0}
in (2.2a).
(2) Failure or recourse cost minimization under primary cost constraints
 
“min”γ y(a, x) (2.4a)

s.t.
G0 (a, x) ≤ Gmax (2.4b)
x ∈ D. (2.4c)
In (2.4a) γ = γ(y) is a scalar or vector valued cost/loss function evaluating
violations of the operating conditions (2.3b). Depending on the application,
these costs are called “failure” or “recourse” costs [58, 59, 98, 120, 137, 138].
As already discussed in Sect. 1, solving problems of the above type, a basic
difficulty is the uncertainty about the true value of the vector a of model
parameters or the (random) variability of a:
In practice, due to several types of uncertainties such as, see [154]:
– Physical uncertainty (variability of physical quantities, like material, loads,
dimensions, etc.)
– Economic uncertainty (trade, demand, costs, etc.)
– Statistical uncertainty (e.g., estimation errors of parameters due to limited
sample data)
– Model uncertainty (model errors)
the ν-vector a of model parameters must be modeled by a random vector
a = a(ω), ω ∈ Ω, (2.5a)
on a certain probability space (Ω, A0 , P) with sample space Ω having elements
ω, see (1.3). For the mathematical representation of the corresponding (con-
A
ditional) probability distribution Pa(·) = Pa(·) of the random vector a = a(ω)
12 2 Deterministic Substitute Problems in Optimal Decision

(given the time history or information A ⊂ A0 ), two main distribution models


are taken into account in practice:
(1) Discrete probability distributions
(2) Continuous probability distributions
In the first case there is a finite or countably infinite number l0 ∈ IN ∪ {∞}
of realizations or scenarios al ∈ IRν , l = 1, . . . , l0 ,
 
P a(ω) = al = αl , l = 1, . . . , l0 , (2.5b)

taken with probabilities αl , l = 1, . . . , l0 . In the second case, the probability


that the realization a(ω) = a lies in a certain (measurable) subset B ⊂ IRν is
described by the multiple integral
  
P a(ω) ∈ B = ϕ(a) da (2.5c)
B

with a certain probability density function ϕ = ϕ(a) ≥ 0, a ∈ IR , ν
ϕ(a)da = 1.
The properties of the probability distribution Pa(·) may be described –
fully or in part – by certain numerical characteristics, called parameters of
Pa(·) . These distribution parameters θ = θh are obtained by considering
expectations  
θh := Eh a(ω) (2.6a)

of (measurable) functions
 
(h ◦ a)(ω) := h a(ω) (2.6b)

composed of the random vector a = a(ω) with certain (measurable) mappings

h : IRν −→ IRsh , sh ≥ 1. (2.6c)

 type of the probability distribution Pa(·) of a = a(ω),


According to the
the expectation Eh a(ω) is defined by

⎪ l0  


  ⎪
⎨ h al αl , in the discrete case (2.5b)
Eh a(ω) = l=1 (2.6d)



⎪ h(a)ϕ(a) da, in the continuous case (2.5c).

IRν

Further distribution parameters θ are functions

θ = Ψ (θh1 , . . . , θhs ) (2.7)


2.1 Optimum Design Problems with Random Parameters 13

of certain “h-moments” θh1 , . . . , θhs of the type (2.6a). Important examples


of the type (2.6a), (2.7), resp., are the expectation

a = Ea(ω) (for h(a) := a, a ∈ IRν ) (2.8a)

and the covariance matrix


  T
Q := E a(ω) − a a(ω) − a = Ea(ω)a(ω)T − a aT (2.8b)

of the random vector a = a(ω).

Due to the stochastic variability of the random vector a(·) of model pa-
rameters, and since the realization a(ω) = a is not available at the decision
making stage, the optimal design problem (2.3a)–(2.3c) or (2.4a)–(2.4c) under
stochastic uncertainty cannot be solved directly.
Hence, appropriate deterministic substitute problems must be chosen tak-
ing into account the randomness of a = a(ω), cf. Sect. 1.2.

2.1.1 Deterministic Substitute Problems in Optimal Design

According to Sect. 1.2, a basic deterministic substitute problem in optimal


design under stochastic uncertainty is the minimization of the total expected
costs including the expected costs of failure
 
min cG EG0 a(ω), x + cf pf (x) (2.9a)

s.t.
x ∈ D. (2.9b)
Here,    
pf = pf (x) := P y a(ω), x ∈B
| (2.9c)
is the probability of failure or the probability that a safe function of the
structure, the system is not guaranteed. Furthermore, cG is a certain weight
factor, and cf > 0 describes the failure or recourse costs. In the present
definition of expected failure costs, constant costs for each realization a = a(ω)
of a(·) are assumed. Obviously, it is

pf (x) = 1 − ps (x) (2.9d)

with the probability of safety or survival


   
ps (x) := P y a(ω), x ∈ B . (2.9e)

In case (2.2b) we have


   
pf (x) = P yi a(ω), x ≤ 0 for at least one index i, 1 ≤ i ≤ my . (2.9f)
14 2 Deterministic Substitute Problems in Optimal Decision

The objective function (2.9a) may be interpreted as the Lagrangian (with


given cost multiplier cf ) of the following reliability based optimization (RBO)
problem, cf. [2, 28, 88, 91, 99, 120, 138, 154, 162]:
 
min EG0 a(ω), x (2.10a)

s.t.

pf (x) ≤ αmax (2.10b)


x ∈ D, (2.10c)

where αmax > 0 is a prescribed maximum failure probability, e.g., αmax =


0.001, cf. (2.3a)–(2.3c).
The “dual” version of (2.10a)–(2.10c) reads

min pf (x) (2.11a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.11b)
x∈D (2.11c)

with a maximum (upper) cost bound Gmax , see (2.4a)–(2.4c).


Further substitute problems are obtained by considering more general ex-
pected failure or recourse cost functions
  
Γ (x) = Eγ y a(ω), x (2.12a)

arising from structural systems weakness or failure, or because of false oper-


ation. Here,
      T
y a(ω), x := y1 a(ω), x , . . . , ymy a(ω), x (2.12b)

is again the random vector of state functions, and

γ : IRmy → IRmγ (2.12c)

is a scalar or vector valued cost or loss function. In case B = (0, +∞)my or


B = [0, +∞)my it is often assumed that γ = γ(y) is a nonincreasing function,
hence,
γ(y) ≥ γ(z), if y ≤ z, (2.12d)
where inequalities between vectors are defined componentwise.

Example 2.1. If γ(y) = 1 for y ∈ B c (complement of B) and γ(y) = 0 for


y ∈ B, then Γ (x) = pf (x).
2.1 Optimum Design Problems with Random Parameters 15

Example 2.2. Suppose that γ = γ(y) is a nonnegative scalar function on IRmy


such that
γ(y) ≥ γ0 > 0 for all y ∈B
| (2.13a)
with a constant γ0 > 0. Then for the probability of failure we find the following
upper bound
    1   
pf (x) = P y a(ω), x ∈B
| ≤ Eγ y a(ω), x , (2.13b)
γ0

where the right hand side of (2.13b) is obviously an expected cost function of
type (2.12a)–(2.12c). Hence, the condition (2.10b) can be guaranteed by the
cost constraint   
Eγ y a(ω), x ≤ γ0 αmax . (2.13c)

Example 2.3. If the loss function γ(y) is defined by a vector of individual loss
functions γi for each state function yi = yi (a, x), i = 1, . . . , my , hence,
 T
γ(y) = γ1 (y1 ), . . . , γmy (ymy ) , (2.14a)

then
 T   
Γ (x) = Γ1 (x), . . . , Γmy (x) , Γi (x) := Eγi yi a(ω), x , 1 ≤ i ≤ my ,
(2.14b)

i.e., the my state functions yi , i = 1, . . . , my , will be treated separately.

Working with the more general expected failure or recourse cost functions
Γ = Γ (x), instead of (2.9a)–(2.9c), (2.10a)–(2.10c) and (2.11a)–(2.11c) we
have the related substitute problems:
(1) Expected total cost minimization
 
min cG EG0 a(ω), x + cTf Γ (x), (2.15a)

s.t.
x∈D (2.15b)
(2) Expected primary cost minimization under expected failure or recourse cost
constraints  
min EG0 a(ω), x (2.16a)
s.t.

Γ (x) ≤ Γ max (2.16b)

x ∈ D, (2.16c)
16 2 Deterministic Substitute Problems in Optimal Decision

(3) Expected failure or recourse cost minimization under expected primary cost
constraints
“min”Γ (x) (2.17a)
s.t.
 
EG0 a(ω), x ≤ Gmax (2.17b)
x ∈ D. (2.17c)

Here, cG , cf are (vectorial) weight coefficients, Γ max is the vector of upper loss
bounds, and “min” indicates again that Γ (x) may be a vector valued function.

2.1.2 Deterministic Substitute Problems in Quality Engineering

In quality engineering [109,113,114] the aim is to optimize the performance of


a certain product or a certain process, and to make the performance insensitive
(robust) to any noise to which the product, the process may be subjected.
Moreover, the production costs (development, manufacturing, etc.) should be
bounded or minimized.
Hence, the control (input, design) factors x are then to set as to provide
– An optimal mean performance (response)
– A high degree of immunity (robustness) of the performance (response) to
noise factors
– Low or minimum production costs
Of course, a certain trade-off between these criterions may be taken into
account.
For this aim, the quality characteristics or performance (e.g., dimensions,
weight, contamination, structural strength, yield, etc.) of the product/process
is described [109] first by a function
 
y = y a(ω), x (2.18)

depending on a ν-vector a = a(ω) of random varying parameters (noise fac-


tors) aj = aj (ω), j = 1, . . . , ν, causing quality variations, and the r-vector x
of control/input factors xk , k = 1, . . . , r.
The quality loss of the product/process under consideration is measured
then by a certain loss function γ = γ(y). Simple loss functions γ are often
chosen [113, 114] in the following way:
(1) Nominal-is-best
Having a given, finite target or nominal value y ∗ , e.g., a nominal dimension
of a product, the quality loss γ(y) reads

γ(y) := (y − y ∗ )2 . (2.19a)
2.1 Optimum Design Problems with Random Parameters 17

(2) Smallest-is-best
If the target or nominal value y ∗ is zero, i.e., if the absolute value |y| of
the product quality function y = y(a, x) should be as small as possible,
e.g., a certain product contamination, then the corresponding quality loss
is defined by
γ(y) := y 2 . (2.19b)
(3) Biggest-is-best
If the absolute value |y| of the product quality function y = y(a, x) should
be as large as possible, e.g., the strength of a bar or the yield of a process,
then a possible quality loss is given by
2
1
γ(y) := . (2.19c)
y

Obviously, (2.19a) and (2.19b) are convex loss functions. Moreover, since
the quality or response function y = y(a, x) takes only positive or only nega-
tive values y in many practical applications, also (2.19c) yields a convex loss
function in many cases.
Further decision situations may be modeled by choosing more general con-
vex loss functions γ = γ(y).
A high mean product/process level or performance at minimum random
quality variations and low or bounded production/manufacturing costs is
achieved then [109, 113, 114] by minimization of the expected quality loss
function   
Γ (x) := Eγ y a(ω), x , (2.19d)

subject to certain constraints for the input x, as, e.g., constraints for the
production/manufacturing costs. Obviously, Γ (x) can also be minimized by
maximizing the negative logarithmic mean loss (sometimes called “signal-to-
noise-ratio” [109]):
SN := − log Γ (x). (2.19e)

Since Γ = Γ (x) is a mean value function as described in the previous


sections, for the computation of a robust optimal input vector x∗ one has the
following stochastic optimization problem

min Γ (x) (2.20a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.20b)

x ∈ D. (2.20c)
18 2 Deterministic Substitute Problems in Optimal Decision

2.2 Basic Properties of Substitute Problems


As can be seen from the conversion of an optimization problem with random
parameters into a deterministic substitute problem, cf. Sects. 2.1.1 and 2.1.2,
a central role is played by expectation or mean value functions of the type
  
Γ (x) = Eγ y a(ω), x , x ∈ D0 , (2.21a)

or more general  
Γ (x) = Eg a(ω), x , x ∈ D0 . (2.21b)

Here, a = a(ω) is a random ν-vector, y = y(a, x) is an my -vector valued


function on a certain subset of IRν × IRr , and γ = γ(z) is a real valued function
on a certain subset of IRmy .
Furthermore, g = g(a, x) denotes a real valued function on a certain subset
of IRν ×IRr . In the following we suppose that the expectation in (2.21a), (2.21b)
exists and is finite for all input vectors x lying in an appropriate set D0 ⊂ IRr ,
cf. [11].
The following basic properties of the mean value functions Γ are needed
in the following again and again.
 
Lemma 2.1. Convexity. Suppose that x → g a(ω), x is convex a.s. (almost
 
sure) on a fixed convex domain D0 ⊂ IRr . If Eg a(ω), x exists and is finite
for each x ∈ D0 , then Γ = Γ (x) is convex on D0 .

Proof. This property follows [58, 59, 78] directly from the linearity of the ex-
pectation operator.  
If g = g(a, x) is defined by g(a, x) := γ y(a, x) , see (2.21a), then the
above theorem yields the following result:
  
Corollary 2.1. Suppose that γ is convex and Eγ y a(ω), x exists and is
 
finite for each x ∈ D0 . (a) If x → y a(ω), x is linear a.s., then Γ = Γ (x) is
 
convex. (b) If x → y a(ω), x is convex a.s., and γ is a convex, monotonous
nondecreasing function, then Γ = Γ (x) is convex.

It is well known [72] that a convex function is continuous on each open


subset of its domain. A general sufficient condition for the continuity of Γ is
given next.
 
Lemma 2.2. Continuity. Suppose that Eg a(ω), x exists and is finite for
 
each x ∈ D0 , and assume that x → g a(ω), x is continuous at x0 ∈ D0 a.s.
 
If there is a function ψ = ψ a(ω) having finite expectation such that
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 19
    
 
g a(ω), x  ≤ ψ a(ω) a.s. for all x ∈ U (x0 ) ∩ D0 , (2.22)

where U (x0 ) is a neighborhood of x0 , then Γ = Γ (x) is continuous at x0 .

Proof. The assertion can be shown by using Lebesgue’s dominated conver-


gence theorem, see, e.g., [78].
For the consideration of the differentiability of Γ = Γ (x), let D denote an
open subset of the domain D0 of Γ .

Lemma 2.3. Differentiability. Suppose that


 
(1) Eg a(ω), x exists and is finite for each x ∈ D0 ,
 
(2) x → g a(ω), x is differentiable on the open subset D of D0 a.s. and
(3)     
 
∇x g a(ω), x  ≤ ψ a(ω) , x ∈ D, a.s., (2.23a)
 
where ψ = ψ a(ω) is a function having finite expectation. Then the expec-
 
tation of ∇x g a(ω), x exists and is finite, Γ = Γ (x) is differentiable on D
and    
∇Γ (x) = ∇x Eg a(ω), x = E∇x g a(ω), x , x ∈ D. (2.23b)

∆Γ
Proof. Considering the difference quotients , k = 1, . . . , r, of Γ at a
∆xk
fixed point x0 ∈ D, the assertion follows by means of the mean value the-
orem, inequality (2.23a) and Lebesgue’s dominated convergence theorem, cf.
[58, 59, 78].

Example 2.4. In case (2.21a), under obvious differentiability


 assumptions con-
cerning γ and y we have ∇x g(a, x) = ∇x y(a, x) ∇γ y(a, x) , where ∇x y(a, x)
T

denotes the Jacobian of y = y(a, x) with respect to a. Hence, if (2.23b) holds,


then  T   
∇Γ (x) = E∇x y a(ω), x ∇γ y a(ω), x . (2.23c)

2.3 Approximations of Deterministic Substitute


Problems in Optimal Design
The main problem in solving the deterministic substitute problems defined
above is that the arising probability and expected cost functions pf =
pf (x), Γ = Γ (x), x ∈ IRr , are defined by means of multiple integrals over
a ν-dimensional space.
Thus, the substitute problems may be solved, in practice, only by some
approximative analytical and numerical methods [37, 165]. In the following
20 2 Deterministic Substitute Problems in Optimal Decision

we consider possible approximations for substitute problems based on general


expected recourse cost functions Γ = Γ (x) according to (2.21a) having a
real valued convex loss function γ(z). Note that the probability of failure
function pf = pf (x) may be approximated from above, see (2.13a), (2.13b),
by expected cost functions Γ = Γ (x) having a nonnegative function γ = γ(z)
being bounded from below on the failure domain B c . In the following several
basic approximation methods are presented.

2.3.1 Approximation of the Loss Function

Suppose here that γ = γ(y) is a continuously differentiable, convex loss func-


tion on IRmy . Let then denote
      T
y(x) := Ey a(ω), x = Ey1 a(ω), x , . . . , Eymy a(ω), x (2.24)
 
the expectation of the vector y = y a(ω), x of state functions yi =
 
yi a(ω), x , i = 1, . . . , my .
For an arbitrary continuously differentiable, convex loss function γ we
have
      T    
γ y a(ω), x ≥ γ y(x) + ∇γ y(x) y a(ω), x − y(x) . (2.25a)

Thus, taking expectations in (2.25a), we find Jensen’s inequality


    
Γ (x) = Eγ y a(ω), x ≥ γ y(x) (2.25b)

which holds for any convex function γ. Using the mean value theorem, we
have
γ(y) = γ(y) + ∇γ(ŷ)T (y − y), (2.25c)
where ŷ is a point on the line segment yy between y and y. By means of
(2.25b), (2.25c) we get
         
   
0 ≤ Γ (x) − γ y(x) ≤ E ∇γ ŷ a(ω), x  · y a(ω), x − y(x) . (2.25d)

(a) Bounded gradient  


If the gradient ∇γ is bounded on the range of y = y a(ω), x , x ∈ D, i.e.,
if    
 
∇γ y a(ω), x  ≤ ϑmax a.s. for each x ∈ D, (2.26a)

with a constant ϑmax > 0, then


     
 
0 ≤ Γ (x) − γ y(x) ≤ ϑmax E y a(ω), x − y(x) , x ∈ D. (2.26b)
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 21

Since t → t, t ≥ 0, is a concave function, we get
  
0 ≤ Γ (x) − γ y(x) ≤ ϑmax q(x), (2.26c)

where    2
 
q(x) := E y a(ω), x − y(x) = trQ(x) (2.26d)
is the generalized variance, and
  
Q(x) := cov y a(·), x (2.26e)
 
denotes the covariance matrix of the random vector y = y a(ω), x . Conse-
quently, the expected loss function Γ (x) can be approximated from above by
  
Γ (x) ≤ γ y(x) + ϑmax q(x) for x ∈ D. (2.26f)

(b) Bounded eigenvalues of the Hessian


Considering second order expansions of γ, with a vector ỹ ∈ ȳy we find
1
γ(y) − γ(y) = ∇γ(y)T (y − y) + (y − y)T ∇2 γ(ỹ)(y − y). (2.27a)
2
Suppose that the eigenvalues
 λ of ∇2 γ(y) are bounded from below and
above on the range of y = y a(ω), x for each x ∈ D, i.e.,
   
0 < λmin ≤ λ ∇2 γ y a(ω), x ≤ λmax < +∞, a.s., x ∈ D (2.27b)

with constants 0 < λmin ≤ λmax . Taking expectations in (2.27a), we get


  λmin   λmax
γ y(x) + q(x) ≤ Γ (x) ≤ γ y(x) + q(x), x ∈ D. (2.27c)
2 2
Consequently, using (2.26f) or (2.27c), various approximations for the
deterministic substitute problems (2.15a), (2.15b), (2.16a)–(2.16c), (2.17a)–
(2.17c) may be obtained.
Based on the above approximations of expected cost functions, we state the
following two approximates to (2.16a)–(2.16c), (2.17a)–(2.17c), resp., which
are well known in robust optimal design:
(1) Expected primary cost minimization under approximate expected failure or
recourse cost constraints
 
min EG0 a(ω), x (2.28a)

s.t.
 
γ y(x) + c0 q(x) ≤ Γ max (2.28b)
x ∈ D, (2.28c)
where c0 is a scale factor, cf. (2.26f) and (2.27c);
22 2 Deterministic Substitute Problems in Optimal Decision

(2) Approximate expected failure or recourse cost minimization under expected


primary cost constraints
 
min γ y(x) + c0 q(x) (2.29a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.29b)
x ∈ D. (2.29c)

Obviously, by means of (2.28a)–(2.28c) or (2.29a)–(2.29c) optimal designs


x∗ are achieved which
– Yield a high mean performance of the structure/structural system
– Are minimally sensitive or have a limited sensitivity with respect to ran-
dom parameter variations (material, load, manufacturing, process, etc.)
– Cause only limited costs for design, construction, maintenance, etc.

2.3.2 Regression Techniques, Model Fitting, RSM

In structures/systems with variable quantities, very often one or several


functional relationships between the measurable variables can be guessed ac-
cording to the available data resulting from observations, simulation experi-
ments, etc. Due to limited data, the true functional relationships between the
quantities involved are approximated in practice – on a certain limited range
– usually by simple functions, as, e.g., multivariate polynomials of a certain
(low) degree. The resulting regression metamodel [64] is obtained then by
certain fitting techniques, such as weighted least squares or Tschebyscheff
minmax procedures for estimating the unknown parameters of the regression
model.
An important advantage of these models is then the possibility to predict
the values of some variables from knowledge of other variables.
In case of more than one independent variable (predictor or regressor) one
has a “multiple regression” task. Problems with more than one dependent or
response variable are called “multivariate regression” problems, see [33].
(a) Approximation of state functions. The numerical solution is simplified
considerably if one can work with one single state function y = y(a, x).
Formally, this is possible by defining the function

y min (a, x) := min yi (a, x). (2.30a)


1≤i≤my

Indeed, according to (2.2b), (2.2c) the failure of the structure, the system
can be represented by the condition

y min (a, x) ≤ 0. (2.30b)


2.3 Approximations of Deterministic Substitute Problems in Optimal Design 23

Fig. 2.1. Loss function γ

Thus, the weakness or failure of the technical or economic device can be


evaluated numerically by the function
  
Γ (x) := Eγ y min a(ω), x (2.30c)

with a nonincreasing loss function γ : IR → IR+ , see Fig. 2.1.


However, the “min”-operator in (2.30a) yields a nonsmooth function
y min = y min (a, x) in general, and the computation of the mean and vari-
ance function
 
y min (x) := Ey min a(ω), x (2.30d)
  
σy2min (x) := V y min a(·), x (2.30e)

by means of Taylor expansion with respect to the model parameter vector


a at a = Ea(ω) is not possible directly, cf. Sect. 2.3.3.

According to the definition (2.30a), an upper bound for y min (x) is given
by  
y min (x) ≤ min y i (x) = min Eyi a(ω), x .
1≤i≤my 1≤i≤my
min
Further approximations of y (a, x) and its moments can be found by
using the representation
1 
a + b − |a − b|
min(a, b) =
2
of the minimum of two numbers a, b ∈ IR. E.g., for an even index my we
have
 
y min (a, x) = min min yi (a, x), yi+1 (a, x)
i=1,3,...,my −1
1  
= min yi (a, x) + yi+1 (a, x)− yi (a, x) − yi+1 (a, x) .
i=1,3,...,my −1 2
24 2 Deterministic Substitute Problems in Optimal Decision

In many cases we may suppose that the state functions yi = yi (a, x),
i = 1, . . . , my , are bounded from below, hence,

yi (a, x) > −A, i = 1, . . . , my ,

for all (a, x) under consideration with a positive constant A > 0. Thus,
defining
ỹi (a, x) := yi (a, x) + A, i = 1, . . . , my ,
and therefore

ỹ min (a, x) := min ỹi (a, x) = y min (a, x) + A,


1≤i≤my

we have
y min (a, x) ≤ 0 if and only if ỹ min (a, x) ≤ A.
Hence, the survival/failure of the system or structure can be studied also
by means of the positive function ỹ min = ỹ min (a, x). Using now the the-
ory of power or Hölder means [24], the minimum ỹ min (a, x) of positive
functions can be represented also by the limit
 my
1/λ
min 1 λ
ỹ (a, x) = lim ỹi (a, x)
λ→−∞ my i=1

 my
1/λ
1
of the decreasing family of power means M [λ]
(ỹ) := ỹiλ ,
my i=1
λ < 0. Consequently, for each fixed p > 0 we also have
 my
p/λ
min p 1 λ
ỹ (a, x) = lim ỹi (a, x) .
λ→−∞ my i=1
   p
Assuming that the expectation EM [λ] ỹ a(ω) , x exists for an expo-
nent λ = λ0 < 0, by means of Lebesgue’s bounded convergence theorem
we get the moment representation
 p/λ
 p 1
my  λ
min
E ỹ a(ω), x = lim E ỹi a(ω), x .
λ→−∞ my i=1

Since t → tp/λ , t > 0, is convex for each fixed p > 0 and λ < 0, by Jensen’s
inequality we have the lower moment bound
 p/λ
 p 1
my  λ
E ỹ min a(ω), x ≥ lim E ỹi a(ω), x .
λ→−∞ my i=1
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 25
 
Hence, for the pth order moment of ỹ min a(·), x we get the approxima-
tions
   p/λ
1
my   p/λ 1
my  λ
E ỹi a(ω), x ≥ E ỹi a(ω), x
my i=1 my i=1

for some λ < 0.

Using regression techniques, Response Surface Methods (RSM), etc., for


given vector x, the function a → y min (a, x) can be approximated [14, 23,
55,63,136] by functions y = y(a, x) being sufficiently smooth with respect
to the parameter vector a (Figs. 2.2 and 2.3).
In many important cases, for each i = 1, . . . , my , the state functions

(a, x) −→ yi (a, x)

are bilinear functions. Thus, in this case y min = y min (a, x) is a piecewise
linear function with respect to a. Fitting a linear or quadratic Response
Surface Model [17, 18, 105, 106]

y(a, x) := c(x) + q(x)T (a − a) + (a − a)T Q(x)(a − a) (2.30f)

to a → y min (a, x), after the selection of appropriate reference points

a(j) := a + da(j) , j = 1, . . . , p, (2.30g)

Fig. 2.2. Approximate smooth failure domain [y(a, x) ≤ 0] for given x


26 2 Deterministic Substitute Problems in Optimal Decision

Fig. 2.3. Approximation ỹ(a, x) of y min (a, x) for given x

(j)
with “design” points da ∈ IRν , j = 1, . . . , p, the unknown coefficients
c = c(x), q = q(x) and Q = Q(x) are obtained by minimizing the mean
square error
p  2
ρ(c, q, Q) := y(a(j) , x) − y min (a(j) , x) (2.30h)
j=1

with respect to (c, q, Q). Since the model (2.30f) depends linearly on the
function parameters (c, q, Q), explicit formulas for the optimal coefficients
c∗ = c∗ (x), q ∗ = q ∗ (x), Q∗ = Q∗ (x), (2.30i)
are obtained from this least squares estimation method, see Chap. 5.

(b) Approximation of expected loss functions. Corresponding to the approxi-


mation (2.30f) of y min = y min (a, x),using
 again
least squares techniques,
a mean value function Γ (x) = Eγ y a(ω), x , cf. (2.12a), can be ap-
proximated at a given point x0 ∈ IRν by a linear or quadratic Response
Surface Function
Γ(x) := β0 + βIT (x − x0 ) + (x − x0 )T B(x − x0 ), (2.30j)
with scalar, vector and matrix parameters β0 , βI , B. In this case estimates
y (i) = Γ̂ (i) of Γ (x) are needed at some reference points x(i) = x0 + d(i) ,
i = 1, . . . , p. Details are given in Chap. 5.

2.3.3 Taylor Expansion Methods

As can be seen above, cf. (2.21a), (2.21b), in the objective and/or in the
constraints of substitute problems for optimization problems with random
data mean value functions of the type
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 27
 
Γ (x) := Eg a(ω), x

occur. Here, g = g(a, x) is a real valued function on a subset of IRν × IRr , and
a = a(ω) is a ν-random vector.
(a) Expansion with respect to a: Suppose that on its domain the function
g = g(a, x) has partial derivatives ∇la g(a, x), l = 0, 1, . . . , lg +1, up to order
lg + 1. Note that the gradient ∇a g(a, x) contains the so-called sensitivities
∂g
(a, x), j = 1, . . . , ν, of g with respect to the parameter vector a at
∂aj
(a, x). In the same way, the higher order partial derivatives ∇la g(a, x), l >
1, represent the higher order sensitivities of g with respect to a at (a, x).
Taylor expansion of g = g(a, x) with respect to a at a := Ea(ω) yields
lg
1 l 1
g(a, x) = ∇a g(ā, x) · (a − ā)l + ∇lg+1 g(â, x) · (a − ā)lg +1
l! (lg + 1)! a
l=0
(2.31a)

where â := ā + ϑ(a − ā), 0 < ϑ < 1, and (a − ā)l denotes the system of lth
order products

(aj − āj )lj
j=1

with lj ∈ IN ∪ {0}, j = 1, . . . , ν, l1 + l2 + . . . + lν = l. If g = g(a, x) is defined


by  
g(a, x) := γ y(a, x) ,

see (2.21a), then the partial derivatives ∇la g of g up to the second order
read:
 T  
∇a g(a, x) = ∇a y(a, x) ∇γ y(a, x) (2.31b)
 T  
∇2a g(a, x) = ∇a y(a, x) ∇2 γ y(a, x) ∇a y(a, x) (2.31c)
 
+∇γ y(a, x) · ∇2a y(a, x),

where
∂2y
(∇γ) · ∇2a y := (∇γ)T · (2.31d)
∂ak ∂al k,l=1,...,ν

Taking expectations in (2.31a), Γ (x) can be approximated, cf. Sect. 2.3.1,


by
lg  l

Γ (x) := g(ā, x) + ∇la g(ā, x) · E a(ω) − ā , (2.32a)
l=2
28 2 Deterministic Substitute Problems in Optimal Decision
 l
where E a(ω) − ā denotes the system of mixed lth central moments
 T
of the random vector a(ω) = a1 (ω), . . . , aν (ω) . Assuming that the
domain of g = g(a, x) is convex with respect to a, we get the error estimate
      
  1  
Γ (x) − Γ(x) ≤ E sup ∇alg +1 g ā + ϑ a(ω) − ā , x 
(lg + 1)! 0≤ϑ≤1
 lg +1
 
×a(ω) − ā . (2.32b)

In many practical cases the random parameter ν-vector a = a(ω) has a


l +1
convex, bounded support, and ∇ag g is continuous. Then the L∞ -norm
  
1  
r(x) := ess sup ∇alg +1 g a(ω), x  (2.32c)
(lg + 1)!

is finite for all x under consideration, and (2.32b), (2.32c) yield the error
bound    lg +1
   
Γ (x) − Γ(x) ≤ r(x)E a(ω) − ā . (2.32d)

Remark 2.1. The above described method


 can be extended T to the case of
vector valued loss functions γ(z) = γ1 (z), . . . , γmγ (z) .

(b) Inner expansions: Suppose that Γ (x) is defined by (2.21a), hence,


  
Γ (x) = Eγ y a(ω), x .

Linearizing the vector function y = y(a, x) with respect to a at ā, thus,

y(a, x) ≈ y(1) (a, x) := y(ā, x) + ∇a y(ā, x)(a − ā), (2.33a)

the mean value function Γ (x) is approximated by


  
Γ(x) := Eγ y(ā, x) + ∇a y(ā, x) a(ω) − ā . (2.33b)

Corresponding to (2.24), (2.26e), define


 
y (1) (x) := Ey(1) a(ω), x = y(a, x) (2.33c)
    
Q(x) := cov y(1) a(·), x = ∇a y(a, x) cov a(·) ∇a y(a, x)T . (2.33d)

In case of convex loss functions γ, approximates of Γ and the correspond-


ing substitute problems based on Γ may be obtained now by applying
the methods described in Sect. 2.3.1. Explicit representations for Γ are
obtained in case of quadratic loss functions γ.
2.4 Applications to Problems in Quality Engineering 29

Error estimates can be derived easily for Lipschitz-continuous or convex


loss function γ. In case of a Lipschitz-continuous loss function γ with
Lipschitz constant L > 0, e.g., for sublinear [78, 81] loss functions, using
(2.33c) we have
      
   
Γ (x) − Γ(x) ≤ L · E y a(ω), x − y(1) a(ω), x  . (2.33e)

Applying the mean value theorem [29], under appropriate second order
differentiability assumptions, for the right hand side of (2.33e) we find the
following stochastic version of the mean value theorem
    
 
E y a(ω), x − y(1) a(ω), x 
 2     
   
≤ E a(ω) − a sup ∇2a y a + ϑ a(ω) − a , x  . (2.33f)
0≤ϑ≤1

2.4 Applications to Problems in Quality Engineering


Since the deterministic substitute problems arising in quality engineering are
of the same type as in optimal design, the approximation techniques described
in Sect. 2.3 can be applied directly to the robust design problem (2.20a)–
(2.20c). This is shown in the following for the approximation method described
in Sect. 2.3.1.
Hence, the approximation techniques developed in Sect. 2.3.1 are applied
now to the objective function
  
Γ (x) = Eγ y a(ω), x

of the basic robust design problem (2.20a)–(2.20c). If γ = γ(y) in an arbi-


trary continuously differentiable convex loss function on IR1 , then we find the
following approximations:
(a) Bounded first order derivative
In case that γ has a bounded derivative on the convex support of the
quality function, i.e., if
   
  
γ y a(ω), x  ≤ ϑmax < +∞ a.s. , x ∈ D, (2.34a)

then   
Γ (x) ≤ γ y(x) + ϑmax q(x), x ∈ D, (2.34b)
where
 
y(x) := Ey a(ω), x , (2.34c)
      2
q(x) := V y a(·), x = E y a(ω), x − y(x) (2.34d)
 
denote the mean and variance of the quality function y = y a(ω), x .
30 2 Deterministic Substitute Problems in Optimal Decision

(b) Bounded second order derivative


If the second order derivative γ  of γ is bounded from above and below,
i.e., if
  
0 < λmin ≤ γ  y a(ω), x ≤ λmax < +∞ a.s., x ∈ D, (2.35a)

then
  λmin   λmax
γ y(x) + q(x) ≤ Γ (x) ≤ γ y(x) + q(x), x ∈ D. (2.35b)
2 2
Corresponding to (2.28a)–(2.28c), (2.29a)–(2.29c), we then have the fol-
lowing “dual” robust optimal design problems:
(1) Expected primary cost minimizationunder approximate expected failure
or recourse cost constraints
 
min EG0 a(ω), x (2.36a)

s.t.
 
γ y(x) + c0 q(x) ≤ Γ max (2.36b)
x ∈ D, (2.36c)

where Γ max denotes an upper loss bound.


(2) Approximate expected failure cost minimization under expected pri-
mary cost constraints
 
min γ y(x) + c0 q(x) (2.37a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.37b)
x ∈ D, (2.37c)

where c0 is again a certain


 scale
 factor, see (2.34a), (2.34b), (2.35a),
(2.35b), and G0 = G0 a(ω), x represents the production costs having
an upper bound Gmax .

2.5 Approximation of Probabilities: Probability


Inequalities

In reliability analysis of engineering/economic structures or systems, a main


problem is the computation of probabilities
2.5 Approximation of Probabilities: Probability Inequalities 31
N   
 
N
P Vi := P a(ω) ∈ Vi (2.38a)
i=1 i=1

or ⎛ ⎞ ⎛ ⎞

N 
N
P⎝ Sj ⎠ := P ⎝a(ω) ∈ Sj ⎠ (2.38b)
j=1 j=1

of unions and intersections of certain failure/survival domains (events) Vj , Sj ,


j = 1, . . . , N . These domains (events) arise from the representation of the
structure or system by a combination of certain series and/or parallel sub-
structures/systems. Due to the high complexity of the basic physical relations,
several approximation techniques are needed for the evaluation of (2.38a),
(2.38b).

2.5.1 Bonferroni-Type Inequalities

In the following V1 , V2 , . . . , VN denote arbitrary (Borel-)measurable subsets of


the parameter space IRν , and the abbreviation
 
P(V ) := P a(ω) ∈ V (2.38c)

is used for any measurable subset V of IRν .


Starting from the representation of the probability of a union of N events,
⎛ ⎞
N N
P⎝ Vj ⎠ = (−1)k−1 sk,N , (2.39a)
j=1 k=1

where  

k
sk,N := P Vil , (2.39b)
1≤i1 <i2 <...<ik ≤N l=1

we obtain [45] the well known basic Bonferroni bounds


⎛ ⎞

N ρ
P⎝ Vj ⎠ ≤ (−1)k−1 sk,N for ρ ≥ 1, ρ odd (2.39c)
j=1 k=1
⎛ ⎞

N ρ
P⎝ Vj ⎠ ≥ (−1)k−1 sk,N for ρ ≥ 1, ρ even. (2.39d)
j=1 k=1

Besides (2.39c), (2.39d), a large amount of related bounds of different com-


plexity are available, cf. [45, 153, 155]. Important bounds of first and second
degree are given below:
32 2 Deterministic Substitute Problems in Optimal Decision
⎛ ⎞

N
max qj ≤ P ⎝ Vj ⎠ ≤ Q1 (2.40a)
1≤j≤N
j=1
⎛ ⎞

N
Q1 − Q2 ≤ P ⎝ Vj ⎠ ≤ Q1 − max qil (2.40b)
1≤l≤N
j=1 i=l
⎛ ⎞
Q21 N
≤P⎝ Vj ⎠ ≤ Q1 . (2.40c)
Q1 + 2Q2 j=1

The above quantities qj , qij , Q1 , Q2 are defined as follows:


N
Q1 := qj with qj := P(Vj ) (2.40d)
j=1
N j−1
Q2 := qij with qij := P(Vi ∩ Vj ). (2.40e)
j=2 i=1

Moreover, defining
q := (q1 , . . . , qN ), Q := (qij )1≤i,j≤N , (2.40f)
we have ⎛ ⎞

N
P⎝ Vj ⎠ ≥ q T Q− q, (2.40g)
j=1

where Q− denotes the generalized inverse of Q, cf. [155].

2.5.2 Tschebyscheff-Type Inequalities



m
In many cases the survival or feasible domain (event) S = Si , is represented
i=1
by a certain number m of inequality constraints of the type
yli < (≤)yi (a, x) < (≤)yui , i = 1, . . . , m, (2.41a)
as, e.g., operating conditions, behavioral constraints. Hence, for a fixed input,
design or control vector x, the event S = S(x) is given by
S := {a ∈ IRν : yli < (≤)yi (a, x) < (≤)yui , i = 1, . . . , m} . (2.41b)
Here,
yi = yi (a, x), i = 1, . . . , m, (2.41c)
are certain functions, e.g., response, output or performance
functions of the structure, system, defined on (a subset of) IRν × IRr .
Moreover, yli < yui , i = 1, . . . , m, are lower and upper bounds for the
variables yi , i = 1, . . . , m. In case of one-sided constraints some bounds yli , yui
are infinite.
2.5 Approximation of Probabilities: Probability Inequalities 33

Two-Sided Constraints

If yli < yui , i = 1, . . . , m, are finite bounds, (2.41a) can be represented by

|yi (a, x) − yic | < (≤)ρi , i = 1, . . . , m, (2.41d)

where the quantities yic , ρi , i = 1, . . . , m, are defined by


yli + yui yui − yli
yic := , ρi := . (2.41e)
2 2
Consequently, for the probability P(S) of the event S, defined by (2.41b), we
have
    
 
P(S) = P yi a(ω), x − yic  < (≤)ρi , i = 1, . . . , m . (2.41f)

Introducing the random variables


 
  yi a(ω), x − yic
ỹi a(ω), x := , i = 1, . . . , m, (2.42a)
ρi
and the set
B := {y ∈ IRm : |yi | < (≤)1, i = 1, . . . , m} , (2.42b)
with ỹ = (ỹi )1≤i≤m , we get
   
P(S) = P ỹ a(ω), x ∈ B . (2.42c)

Considering any (measurable) function ϕ : IRm → IR such that

(i) ϕ(y) ≥ 0, y ∈ IRm (2.42d)


(ii) ϕ(y) ≥ ϕ0 > 0, if y ∈B,
| (2.42e)

with a positive constant ϕ0 , we find the following result:

Theorem 2.1. For any (measurable) function ϕ : IRm → IR fulfilling condi-


tions (2.42d), (2.42e), the following Tschebyscheff-type inequality holds:
   
P yli < (≤)yi a(ω), x < (≤)yui , i = 1, . . . , m
1   
≥1− Eϕ ỹ a(ω), x , (2.43)
ϕ0

provided that the expectation in (2.43) exists and is finite.

Proof. If P   denotes the probability distribution of the random m-


ỹ a(·),x
 
vector ỹ = ỹ a(ω), x , then
34 2 Deterministic Substitute Problems in Optimal Decision
    
Eϕ ỹ a(ω), x = ϕ(y)Pỹ(a(·),x) (dy) + ϕ(y)Pỹ(a(·),x) (dy)
y∈B y∈B c
 
≥ ϕ(y)Pỹ(a(·),x) (dy) ≥ ϕ0 Pỹ(a(·),x) (dy)
y∈B c y∈B c
        
= ϕ0 P ỹ a(ω), x ∈B
| = ϕ0 1 − P ỹ a(ω), x ∈ B ,

which yields the assertion, cf. (2.42c).

Remark 2.2. Note that P(S) ≥ αs with a given minimum reliability αs ∈ (0, 1]
can be guaranteed by the expected cost constraint
  
Eϕ ỹ a(ω), x ≤ (1 − αs )ϕ0 .

Example 2.5. If ϕ = 1B c is the indicator function of the complement B c of B,


then ϕ0 = 1 and (2.43) holds with the equality sign.

Example 2.6. For a given positive definite m × m matrix C, define ϕ(y) :=


y T Cy, y ∈ IRm . Then, cf. (2.42b), (2.42d), (2.42e),
 
min ϕ(y) = min min y T Cy, min y T Cy . (2.44a)
|B
y∈ 1≤i≤m yi ≥1 yi ≤−1

Thus, the lower bound ϕ0 follows by considering the convex optimization


problems arising in the right hand side of (2.44a). Moreover, the expectation
Eϕ(ỹ) needed in (2.43) is given, see (2.42a), by

Eϕ(ỹ) = E ỹ T C ỹ = EtrC ỹ ỹ T ,
    T
= trC(diag ρ)−1 cov y a(·), x + y(x) − yc y(x) − yc (diag ρ)−1 ,

(2.44b)

where “tr” denotes the trace of a matrix, diag ρ is the diagonal


 matrix
 diag
ρ := (ρi δij ), yc := (yic ), see (2.41e), and y = y(x) := Eyi a(ω), x . Since
Q ≤ trQ ≤ mQ for any positive definite m × m matrix Q, an upper
bound of Eϕ(ỹ) reads (Fig. 2.4)
    T
Eϕ(ỹ) ≤ mC (diag ρ)−1 2 cov y a(·), x + y(x) − yc y(x) − yc .
(2.44c)

Example 2.7. Assuming in the above Example 2.6 that C = diag (cii ) is a
diagonal matrix with positive elements cii > 0, i = 1, . . . , m, then

min ϕ(y) = min cii > 0, (2.44d)


|B
y∈ 1≤i≤m
2.5 Approximation of Probabilities: Probability Inequalities 35

Fig. 2.4. Function ϕ = ϕ(y)

and Eϕ(ỹ) is given by


   2
m E yi a(ω), x − yic
Eϕ(ỹ) = cii
i=1
ρ2i
   2
m σy2i a(·), x + y i (x) − yic
= cii . (2.44e)
i=1
ρ2i

One-Sided Inequalities

Suppose that exactly one of the two bounds yli < yui is infinite for each
i = 1, . . . , m. Multiplying the corresponding constraints in (2.41a) by −1, the
admissible domain S = S(x), cf. (2.41b), can be represented always by

S(x) = {a ∈ IRν : ỹi (a, x) < (≤) 0, i = 1, . . . , m} , (2.45a)

 yi − yui
where ỹi :=  , if yli = −∞, and ỹi := yli − yi , if yui = +∞. If we set
ỹ(a, x) := ỹi (a, x) and
 
B̃ := y ∈ IRm : yi < (≤) 0, i = 1, . . . , m , (2.45b)

then      
P S(x) = P ỹ a(ω), x ∈ B̃ . (2.45c)

Consider also in this case, cf. (2.42d), (2.42e), a function ϕ : IRm → IR such
that
36 2 Deterministic Substitute Problems in Optimal Decision

(i) ϕ(y) ≥ 0, y ∈ IRm (2.46a)


(ii)ϕ(y) ≥ ϕ0 > 0, if y ∈| B̃. (2.46b)

Then, corresponding to Theorem 2.1, we have this result:


Theorem 2.2 (Markov-type inequality). If ϕ : IRm → IR is any (mea-
surable) function fulfilling conditions (2.46a), (2.46b), then
    1   
P ỹ a(ω), x < (≤) 0 ≥ 1 − Eϕ ỹ a(ω), x , (2.47)
ϕ0
provided that the expectation in (2.47) exists and is finite.
Remark 2.3. Note that a related inequality was used already in Example 2.2.
m
Example 2.8. If ϕ(y) := wi eαi yi with positive constants wi , αi , i =
i=1
1, . . . , m, then
inf ϕ(y) = min wi > 0 (2.48a)
| B̃
y∈ 1≤i≤m

and  
   m
Eϕ ỹ a(ω), x = wi Eeαi ỹi a(ω),x
, (2.48b)
i=1

where the expectation in (2.48b) can be computed approximatively by Taylor


expansion:
 
Eeαi ỹi = eαi ỹi (x) Eeαi ỹi −ỹi (x)
α2    2
≈ eαi ỹi (x) 1 + i E yi a(ω), x − y i (x)
2
α2
= eαi ỹi (x) 1 + i σy2i (a(·),x) . (2.48c)
2
 
Supposing that yi = yi a(ω), x is a normal distributed random variable,
then 1 2 2
¯ α σ
Eeαi ỹi = eαi ỹi (x) e 2 i yi (a(·),x) . (2.48d)
Example 2.9. Consider ϕ(y) := (y − b)T C(y − b), where, cf. Example 2.6, C is
a positive definite m × m matrix and b < 0 a fixed m-vector. In this case we
again have
min ϕ(y) = min min ϕ(y)
| B̃
y∈ 1≤i≤m yi ≥0

and
   T    
Eϕ(ỹ) = E ỹ a(ω), x − b C ỹ a(ω), x − b
      T
= trCE ỹ a(ω), x − b ỹ a(ω), x − b .
2.5 Approximation of Probabilities: Probability Inequalities 37

Note that

yi (a, x) − (yui + bi ), if yli = −∞
ỹi (a, x) − bi =
yli − bi − yi (a, x), if yui = +∞,

where yui + bi < yui and yli < yli − bi .

Remark 2.4. The one-sided case can also be reduced approximatively to the
two-sided case by selecting a sufficiently large, but finite upper bound ỹui ∈
IR, lower bound ỹli ∈ IR, resp., if yui = +∞, yli = −∞. For multivariate
Tschebyscheff inequalities see also [75].

2.5.3 First Order Reliability Methods (FORM)

Special approximation methods for probabilities were developed in structural


reliability analysis [19, 21, 54, 104] for the probability of failure pf = pf (x)
given by (2.9c) with B := (0, +∞)my . Hence, see (2.2b),
   
pf = 1 − P y a(ω), x > 0 .

Applying certain probability inequalities, see Sects. 2.5.1 and 2.5.2, the case
with m(>1) limit state functions is reduced first to the case with one limit
state function only. Approximations of pf are obtained then by linearization
or by quadratic approximation of a transformed state function ỹi = ỹi (ã, x)
at a certain “design point” ã∗ = ã∗i (x) for each i = 1, . . . , m.
Based on the above procedure, according to (2.9f) we have
   
pf (x) = P yi a(ω), x ≤ 0 for at least one index i, 1 ≤ i ≤ my
my     my
≤ P yi a(ω), x ≤ 0 = pf i (x), (2.49a)
i=1 i=1

where    
pf i (x) := P yi a(ω), x ≤ 0 . (2.49b)

Assume now that the random ν-vector a = a(ω) can be represented [104] by
 
a(ω) = U  a(ω) , (2.50a)

where U : IRν → IRν is a certain 1–1-transformation in IRν , and 


a= a(ω) is
an N (0, I)-normal distributed random ν-vector, i.e., with mean vector zero
and covariance matrix equal to the identity matrix I. Assuming that the ν-
vector a = a(ω) has a continuous probability distribution and considering the
conditional distribution functions

Hi = Hi (ai |a1 , . . . , ai−1 )


38 2 Deterministic Substitute Problems in Optimal Decision

of ai = ai(ω), given
 aj (ω) = aj , j = 1, . . . , i − 1, then the random ν-vector
z(ω) := T a(ω) defined by the transformation

z1 := H1 (a1 )
z2 := H2 (a2 |a1 )
..
.
zν := Hν (aν |a1 , . . . , aν−1 ),
cf. [128], has stochastically independent components z1 = z1 (ω), . . . , zν =
zν (ω), and z = z(ω) is uniformly distributed on the ν-dimensional hypercube
[0, 1]ν . Thus, if Φ = Φ(t) denotes the distribution function of the univariate
N (0, 1)-normal distribution, then the inverse ã = U −1 (a) of the transforma-
tion U can be represented by
⎛ −1 ⎞
Φ (z1 )
⎜ .. ⎟
ã := ⎝ . ⎠ , z = T (a).
Φ−1 (zν )
In the following, let x be a given, fixed r-vector. Defining then
 
yi (
a, x) := yi U (
a), x , (2.50b)
we get    
pf i (x) = P yi 
a(ω), x ≤ 0 , i = 1, . . . , my . (2.50c)
In the First Order Reliability Method (FORM), cf. [54], the state function
yi = yi (
a, x) is linearized at a so-called design point  a∗ = a∗(i) . A design point
a∗ = 
 a∗(i) is obtained by projection of the origin  a = 0 onto the limit state
surface yi ( a∗ , x) = 0, we get
a, x) = 0, see Fig. 2.5. Since yi (

Fig. 2.5. FORM


2.5 Approximation of Probabilities: Probability Inequalities 39
 T
yi ( a∗ , x) + ∇a yi (
a, x) ≈ yi ( a∗ , x) ( a∗ )
a−
 T
a∗ , x) (
= ∇a yi ( a∗ )
a− (2.50d)

and therefore
 T  
pf i (x) ≈ P a∗ , x) (
∇a yi ( a∗ ) ≤ 0
a(ω) −  = Φ −βi (x) . (2.50e)

Here, the reliability index βi = βi (x) is given by


 T
a∗ , x) 
∇a yi ( a∗
βi (x) := − . (2.50f)
a∗ , x)
∇a yi (

Note that βi (x) is the minimum distance from  a = 0 to the limit state surface
a, x) = 0 in the 
yi ( a-domain.
For more details and Second Order Reliability Methods (SORM) based on
second order Taylor expansions, see Chap. 7 and [54, 104].
2
Deterministic Substitute Problems in Optimal
Decision Under Stochastic Uncertainty

2.1 Optimum Design Problems with Random


Parameters
In the optimum design of technical or economic structures/systems two basic
classes of criterions appear:
First there is a primary cost function

G0 = G0 (a, x). (2.1a)

Important examples are the total weight or volume of a mechanical structure,


the costs of construction, design of a certain technical or economic structure/
system, or the negative utility or reward in a general decision situation.
For the representation of the structural/system safety or failure, for the
representation of the admissibility of the state, or for the formulation of the
basic operating conditions of the underlying structure/system certain state
functions
yi = yi (a, x), i = 1, . . . , my , (2.1b)
are chosen. In structural design these functions are also called “limit
state functions” or “safety margins.” Frequent examples are some displace-
ment, stress, load (force and moment) components in structural design, or
production and cost functions in production planning problems, optimal mix
problems, transportation problems, allocation problems and other problems
of economic design.
In (2.1a), (2.1b), the design or input vector x denotes the r-vector of de-
sign or input variables, x1 , x2 , . . . , xr , as, e.g., structural dimensions, sizing
variables, such as cross-sectional areas, thickness in structural design, or fac-
tors of production, actions in economic design. For the design or input vector
x one has mostly some basic deterministic constraints, e.g., nonnegativity
constraints, box constraints, represented by

x ∈ D, (2.1c)
10 2 Deterministic Substitute Problems in Optimal Decision

where D is a given convex subset of IRr . Moreover, a is the ν-vector of model


parameters. In optimal structural design,

p
a= (2.1d)
P

is composed of the following two subvectors: P is the m-vector of the acting


external loads, e.g., wave, wind loads, payload, etc. Moreover, p denotes the
(ν − m)-vector of the further model parameters, as, e.g., material parameters
(like strength parameters, yield/allowable stresses, elastic moduli, plastic ca-
pacities, etc.) of each member of the structure, the manufacturing tolerances
and the weight or more general cost coefficients.
In production planning problems,

a = (A, b, c) (2.1e)

is composed of the m × r matrix A of technological coefficients, the demand


m-vector b and the r-vector c of unit costs.
Based on the my -vector of state functions
 T
y(a, x) := y1 (a, x), y2 (a, x), . . . , ymy (a, x) , (2.1f)

the admissible or safe states of the structure/system can be characterized by


the condition
y(a, x) ∈ B, (2.1g)
where B is a certain subset of IRmy ; B = B(a) may depend also on some model
parameters.
In production planning problems, typical operating conditions are given,
cf. (2.1e), by

y(a, x) := Ax − b ≥ 0 or y(a, x) = 0, x ≥ 0. (2.2a)

In mechanical structures/structural systems, the safety (survival) of the


structure/system is described by the operating conditions

yi (a, x) > 0 for all i = 1, . . . , my (2.2b)

with state functions yi = yi (a, x), i = 1, . . . , my , depending on certain


response components of the structure/system, such as displacement, stress,
force, moment components. Hence, a failure occurs if and only of the
structure/system is in the ith failure mode (failure domain)

yi (a, x) ≤ 0 (2.2c)

for at least one index i, 1 ≤ i ≤ my .


Note. The number my of safety margins or limit state functions
yi = yi (a, x), i = 1, . . . , my , may be very large. E.g., in optimal plastic
2.1 Optimum Design Problems with Random Parameters 11

design the limit state functions are determined by the extreme points of the
admissible domain of the dual pair of static/kinematic LPs related to the
equilibrium and linearized convex yield condition, see Sect. 7.

Basic problems in optimal decision/design are:


(1) Primary (construction, planning, investment, etc.) cost minimization un-
der operating or safety conditions
min G0 (a, x) (2.3a)
s.t.
y(a, x) ∈ B (2.3b)
x ∈ D. (2.3c)
Obviously we have B = (0, +∞)my in (2.2b) and B = [0, +∞)my or B = {0}
in (2.2a).
(2) Failure or recourse cost minimization under primary cost constraints
 
“min”γ y(a, x) (2.4a)

s.t.
G0 (a, x) ≤ Gmax (2.4b)
x ∈ D. (2.4c)
In (2.4a) γ = γ(y) is a scalar or vector valued cost/loss function evaluating
violations of the operating conditions (2.3b). Depending on the application,
these costs are called “failure” or “recourse” costs [58, 59, 98, 120, 137, 138].
As already discussed in Sect. 1, solving problems of the above type, a basic
difficulty is the uncertainty about the true value of the vector a of model
parameters or the (random) variability of a:
In practice, due to several types of uncertainties such as, see [154]:
– Physical uncertainty (variability of physical quantities, like material, loads,
dimensions, etc.)
– Economic uncertainty (trade, demand, costs, etc.)
– Statistical uncertainty (e.g., estimation errors of parameters due to limited
sample data)
– Model uncertainty (model errors)
the ν-vector a of model parameters must be modeled by a random vector
a = a(ω), ω ∈ Ω, (2.5a)
on a certain probability space (Ω, A0 , P) with sample space Ω having elements
ω, see (1.3). For the mathematical representation of the corresponding (con-
A
ditional) probability distribution Pa(·) = Pa(·) of the random vector a = a(ω)
12 2 Deterministic Substitute Problems in Optimal Decision

(given the time history or information A ⊂ A0 ), two main distribution models


are taken into account in practice:
(1) Discrete probability distributions
(2) Continuous probability distributions
In the first case there is a finite or countably infinite number l0 ∈ IN ∪ {∞}
of realizations or scenarios al ∈ IRν , l = 1, . . . , l0 ,
 
P a(ω) = al = αl , l = 1, . . . , l0 , (2.5b)

taken with probabilities αl , l = 1, . . . , l0 . In the second case, the probability


that the realization a(ω) = a lies in a certain (measurable) subset B ⊂ IRν is
described by the multiple integral
  
P a(ω) ∈ B = ϕ(a) da (2.5c)
B

with a certain probability density function ϕ = ϕ(a) ≥ 0, a ∈ IR , ν
ϕ(a)da = 1.
The properties of the probability distribution Pa(·) may be described –
fully or in part – by certain numerical characteristics, called parameters of
Pa(·) . These distribution parameters θ = θh are obtained by considering
expectations  
θh := Eh a(ω) (2.6a)

of (measurable) functions
 
(h ◦ a)(ω) := h a(ω) (2.6b)

composed of the random vector a = a(ω) with certain (measurable) mappings

h : IRν −→ IRsh , sh ≥ 1. (2.6c)

 type of the probability distribution Pa(·) of a = a(ω),


According to the
the expectation Eh a(ω) is defined by

⎪ l0  


  ⎪
⎨ h al αl , in the discrete case (2.5b)
Eh a(ω) = l=1 (2.6d)



⎪ h(a)ϕ(a) da, in the continuous case (2.5c).

IRν

Further distribution parameters θ are functions

θ = Ψ (θh1 , . . . , θhs ) (2.7)


2.1 Optimum Design Problems with Random Parameters 13

of certain “h-moments” θh1 , . . . , θhs of the type (2.6a). Important examples


of the type (2.6a), (2.7), resp., are the expectation

a = Ea(ω) (for h(a) := a, a ∈ IRν ) (2.8a)

and the covariance matrix


  T
Q := E a(ω) − a a(ω) − a = Ea(ω)a(ω)T − a aT (2.8b)

of the random vector a = a(ω).

Due to the stochastic variability of the random vector a(·) of model pa-
rameters, and since the realization a(ω) = a is not available at the decision
making stage, the optimal design problem (2.3a)–(2.3c) or (2.4a)–(2.4c) under
stochastic uncertainty cannot be solved directly.
Hence, appropriate deterministic substitute problems must be chosen tak-
ing into account the randomness of a = a(ω), cf. Sect. 1.2.

2.1.1 Deterministic Substitute Problems in Optimal Design

According to Sect. 1.2, a basic deterministic substitute problem in optimal


design under stochastic uncertainty is the minimization of the total expected
costs including the expected costs of failure
 
min cG EG0 a(ω), x + cf pf (x) (2.9a)

s.t.
x ∈ D. (2.9b)
Here,    
pf = pf (x) := P y a(ω), x ∈B
| (2.9c)
is the probability of failure or the probability that a safe function of the
structure, the system is not guaranteed. Furthermore, cG is a certain weight
factor, and cf > 0 describes the failure or recourse costs. In the present
definition of expected failure costs, constant costs for each realization a = a(ω)
of a(·) are assumed. Obviously, it is

pf (x) = 1 − ps (x) (2.9d)

with the probability of safety or survival


   
ps (x) := P y a(ω), x ∈ B . (2.9e)

In case (2.2b) we have


   
pf (x) = P yi a(ω), x ≤ 0 for at least one index i, 1 ≤ i ≤ my . (2.9f)
14 2 Deterministic Substitute Problems in Optimal Decision

The objective function (2.9a) may be interpreted as the Lagrangian (with


given cost multiplier cf ) of the following reliability based optimization (RBO)
problem, cf. [2, 28, 88, 91, 99, 120, 138, 154, 162]:
 
min EG0 a(ω), x (2.10a)

s.t.

pf (x) ≤ αmax (2.10b)


x ∈ D, (2.10c)

where αmax > 0 is a prescribed maximum failure probability, e.g., αmax =


0.001, cf. (2.3a)–(2.3c).
The “dual” version of (2.10a)–(2.10c) reads

min pf (x) (2.11a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.11b)
x∈D (2.11c)

with a maximum (upper) cost bound Gmax , see (2.4a)–(2.4c).


Further substitute problems are obtained by considering more general ex-
pected failure or recourse cost functions
  
Γ (x) = Eγ y a(ω), x (2.12a)

arising from structural systems weakness or failure, or because of false oper-


ation. Here,
      T
y a(ω), x := y1 a(ω), x , . . . , ymy a(ω), x (2.12b)

is again the random vector of state functions, and

γ : IRmy → IRmγ (2.12c)

is a scalar or vector valued cost or loss function. In case B = (0, +∞)my or


B = [0, +∞)my it is often assumed that γ = γ(y) is a nonincreasing function,
hence,
γ(y) ≥ γ(z), if y ≤ z, (2.12d)
where inequalities between vectors are defined componentwise.

Example 2.1. If γ(y) = 1 for y ∈ B c (complement of B) and γ(y) = 0 for


y ∈ B, then Γ (x) = pf (x).
2.1 Optimum Design Problems with Random Parameters 15

Example 2.2. Suppose that γ = γ(y) is a nonnegative scalar function on IRmy


such that
γ(y) ≥ γ0 > 0 for all y ∈B
| (2.13a)
with a constant γ0 > 0. Then for the probability of failure we find the following
upper bound
    1   
pf (x) = P y a(ω), x ∈B
| ≤ Eγ y a(ω), x , (2.13b)
γ0

where the right hand side of (2.13b) is obviously an expected cost function of
type (2.12a)–(2.12c). Hence, the condition (2.10b) can be guaranteed by the
cost constraint   
Eγ y a(ω), x ≤ γ0 αmax . (2.13c)

Example 2.3. If the loss function γ(y) is defined by a vector of individual loss
functions γi for each state function yi = yi (a, x), i = 1, . . . , my , hence,
 T
γ(y) = γ1 (y1 ), . . . , γmy (ymy ) , (2.14a)

then
 T   
Γ (x) = Γ1 (x), . . . , Γmy (x) , Γi (x) := Eγi yi a(ω), x , 1 ≤ i ≤ my ,
(2.14b)

i.e., the my state functions yi , i = 1, . . . , my , will be treated separately.

Working with the more general expected failure or recourse cost functions
Γ = Γ (x), instead of (2.9a)–(2.9c), (2.10a)–(2.10c) and (2.11a)–(2.11c) we
have the related substitute problems:
(1) Expected total cost minimization
 
min cG EG0 a(ω), x + cTf Γ (x), (2.15a)

s.t.
x∈D (2.15b)
(2) Expected primary cost minimization under expected failure or recourse cost
constraints  
min EG0 a(ω), x (2.16a)
s.t.

Γ (x) ≤ Γ max (2.16b)

x ∈ D, (2.16c)
16 2 Deterministic Substitute Problems in Optimal Decision

(3) Expected failure or recourse cost minimization under expected primary cost
constraints
“min”Γ (x) (2.17a)
s.t.
 
EG0 a(ω), x ≤ Gmax (2.17b)
x ∈ D. (2.17c)

Here, cG , cf are (vectorial) weight coefficients, Γ max is the vector of upper loss
bounds, and “min” indicates again that Γ (x) may be a vector valued function.

2.1.2 Deterministic Substitute Problems in Quality Engineering

In quality engineering [109,113,114] the aim is to optimize the performance of


a certain product or a certain process, and to make the performance insensitive
(robust) to any noise to which the product, the process may be subjected.
Moreover, the production costs (development, manufacturing, etc.) should be
bounded or minimized.
Hence, the control (input, design) factors x are then to set as to provide
– An optimal mean performance (response)
– A high degree of immunity (robustness) of the performance (response) to
noise factors
– Low or minimum production costs
Of course, a certain trade-off between these criterions may be taken into
account.
For this aim, the quality characteristics or performance (e.g., dimensions,
weight, contamination, structural strength, yield, etc.) of the product/process
is described [109] first by a function
 
y = y a(ω), x (2.18)

depending on a ν-vector a = a(ω) of random varying parameters (noise fac-


tors) aj = aj (ω), j = 1, . . . , ν, causing quality variations, and the r-vector x
of control/input factors xk , k = 1, . . . , r.
The quality loss of the product/process under consideration is measured
then by a certain loss function γ = γ(y). Simple loss functions γ are often
chosen [113, 114] in the following way:
(1) Nominal-is-best
Having a given, finite target or nominal value y ∗ , e.g., a nominal dimension
of a product, the quality loss γ(y) reads

γ(y) := (y − y ∗ )2 . (2.19a)
2.1 Optimum Design Problems with Random Parameters 17

(2) Smallest-is-best
If the target or nominal value y ∗ is zero, i.e., if the absolute value |y| of
the product quality function y = y(a, x) should be as small as possible,
e.g., a certain product contamination, then the corresponding quality loss
is defined by
γ(y) := y 2 . (2.19b)
(3) Biggest-is-best
If the absolute value |y| of the product quality function y = y(a, x) should
be as large as possible, e.g., the strength of a bar or the yield of a process,
then a possible quality loss is given by
2
1
γ(y) := . (2.19c)
y

Obviously, (2.19a) and (2.19b) are convex loss functions. Moreover, since
the quality or response function y = y(a, x) takes only positive or only nega-
tive values y in many practical applications, also (2.19c) yields a convex loss
function in many cases.
Further decision situations may be modeled by choosing more general con-
vex loss functions γ = γ(y).
A high mean product/process level or performance at minimum random
quality variations and low or bounded production/manufacturing costs is
achieved then [109, 113, 114] by minimization of the expected quality loss
function   
Γ (x) := Eγ y a(ω), x , (2.19d)

subject to certain constraints for the input x, as, e.g., constraints for the
production/manufacturing costs. Obviously, Γ (x) can also be minimized by
maximizing the negative logarithmic mean loss (sometimes called “signal-to-
noise-ratio” [109]):
SN := − log Γ (x). (2.19e)

Since Γ = Γ (x) is a mean value function as described in the previous


sections, for the computation of a robust optimal input vector x∗ one has the
following stochastic optimization problem

min Γ (x) (2.20a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.20b)

x ∈ D. (2.20c)
18 2 Deterministic Substitute Problems in Optimal Decision

2.2 Basic Properties of Substitute Problems


As can be seen from the conversion of an optimization problem with random
parameters into a deterministic substitute problem, cf. Sects. 2.1.1 and 2.1.2,
a central role is played by expectation or mean value functions of the type
  
Γ (x) = Eγ y a(ω), x , x ∈ D0 , (2.21a)

or more general  
Γ (x) = Eg a(ω), x , x ∈ D0 . (2.21b)

Here, a = a(ω) is a random ν-vector, y = y(a, x) is an my -vector valued


function on a certain subset of IRν × IRr , and γ = γ(z) is a real valued function
on a certain subset of IRmy .
Furthermore, g = g(a, x) denotes a real valued function on a certain subset
of IRν ×IRr . In the following we suppose that the expectation in (2.21a), (2.21b)
exists and is finite for all input vectors x lying in an appropriate set D0 ⊂ IRr ,
cf. [11].
The following basic properties of the mean value functions Γ are needed
in the following again and again.
 
Lemma 2.1. Convexity. Suppose that x → g a(ω), x is convex a.s. (almost
 
sure) on a fixed convex domain D0 ⊂ IRr . If Eg a(ω), x exists and is finite
for each x ∈ D0 , then Γ = Γ (x) is convex on D0 .

Proof. This property follows [58, 59, 78] directly from the linearity of the ex-
pectation operator.  
If g = g(a, x) is defined by g(a, x) := γ y(a, x) , see (2.21a), then the
above theorem yields the following result:
  
Corollary 2.1. Suppose that γ is convex and Eγ y a(ω), x exists and is
 
finite for each x ∈ D0 . (a) If x → y a(ω), x is linear a.s., then Γ = Γ (x) is
 
convex. (b) If x → y a(ω), x is convex a.s., and γ is a convex, monotonous
nondecreasing function, then Γ = Γ (x) is convex.

It is well known [72] that a convex function is continuous on each open


subset of its domain. A general sufficient condition for the continuity of Γ is
given next.
 
Lemma 2.2. Continuity. Suppose that Eg a(ω), x exists and is finite for
 
each x ∈ D0 , and assume that x → g a(ω), x is continuous at x0 ∈ D0 a.s.
 
If there is a function ψ = ψ a(ω) having finite expectation such that
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 19
    
 
g a(ω), x  ≤ ψ a(ω) a.s. for all x ∈ U (x0 ) ∩ D0 , (2.22)

where U (x0 ) is a neighborhood of x0 , then Γ = Γ (x) is continuous at x0 .

Proof. The assertion can be shown by using Lebesgue’s dominated conver-


gence theorem, see, e.g., [78].
For the consideration of the differentiability of Γ = Γ (x), let D denote an
open subset of the domain D0 of Γ .

Lemma 2.3. Differentiability. Suppose that


 
(1) Eg a(ω), x exists and is finite for each x ∈ D0 ,
 
(2) x → g a(ω), x is differentiable on the open subset D of D0 a.s. and
(3)     
 
∇x g a(ω), x  ≤ ψ a(ω) , x ∈ D, a.s., (2.23a)
 
where ψ = ψ a(ω) is a function having finite expectation. Then the expec-
 
tation of ∇x g a(ω), x exists and is finite, Γ = Γ (x) is differentiable on D
and    
∇Γ (x) = ∇x Eg a(ω), x = E∇x g a(ω), x , x ∈ D. (2.23b)

∆Γ
Proof. Considering the difference quotients , k = 1, . . . , r, of Γ at a
∆xk
fixed point x0 ∈ D, the assertion follows by means of the mean value the-
orem, inequality (2.23a) and Lebesgue’s dominated convergence theorem, cf.
[58, 59, 78].

Example 2.4. In case (2.21a), under obvious differentiability


 assumptions con-
cerning γ and y we have ∇x g(a, x) = ∇x y(a, x) ∇γ y(a, x) , where ∇x y(a, x)
T

denotes the Jacobian of y = y(a, x) with respect to a. Hence, if (2.23b) holds,


then  T   
∇Γ (x) = E∇x y a(ω), x ∇γ y a(ω), x . (2.23c)

2.3 Approximations of Deterministic Substitute


Problems in Optimal Design
The main problem in solving the deterministic substitute problems defined
above is that the arising probability and expected cost functions pf =
pf (x), Γ = Γ (x), x ∈ IRr , are defined by means of multiple integrals over
a ν-dimensional space.
Thus, the substitute problems may be solved, in practice, only by some
approximative analytical and numerical methods [37, 165]. In the following
20 2 Deterministic Substitute Problems in Optimal Decision

we consider possible approximations for substitute problems based on general


expected recourse cost functions Γ = Γ (x) according to (2.21a) having a
real valued convex loss function γ(z). Note that the probability of failure
function pf = pf (x) may be approximated from above, see (2.13a), (2.13b),
by expected cost functions Γ = Γ (x) having a nonnegative function γ = γ(z)
being bounded from below on the failure domain B c . In the following several
basic approximation methods are presented.

2.3.1 Approximation of the Loss Function

Suppose here that γ = γ(y) is a continuously differentiable, convex loss func-


tion on IRmy . Let then denote
      T
y(x) := Ey a(ω), x = Ey1 a(ω), x , . . . , Eymy a(ω), x (2.24)
 
the expectation of the vector y = y a(ω), x of state functions yi =
 
yi a(ω), x , i = 1, . . . , my .
For an arbitrary continuously differentiable, convex loss function γ we
have
      T    
γ y a(ω), x ≥ γ y(x) + ∇γ y(x) y a(ω), x − y(x) . (2.25a)

Thus, taking expectations in (2.25a), we find Jensen’s inequality


    
Γ (x) = Eγ y a(ω), x ≥ γ y(x) (2.25b)

which holds for any convex function γ. Using the mean value theorem, we
have
γ(y) = γ(y) + ∇γ(ŷ)T (y − y), (2.25c)
where ŷ is a point on the line segment yy between y and y. By means of
(2.25b), (2.25c) we get
         
   
0 ≤ Γ (x) − γ y(x) ≤ E ∇γ ŷ a(ω), x  · y a(ω), x − y(x) . (2.25d)

(a) Bounded gradient  


If the gradient ∇γ is bounded on the range of y = y a(ω), x , x ∈ D, i.e.,
if    
 
∇γ y a(ω), x  ≤ ϑmax a.s. for each x ∈ D, (2.26a)

with a constant ϑmax > 0, then


     
 
0 ≤ Γ (x) − γ y(x) ≤ ϑmax E y a(ω), x − y(x) , x ∈ D. (2.26b)
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 21

Since t → t, t ≥ 0, is a concave function, we get
  
0 ≤ Γ (x) − γ y(x) ≤ ϑmax q(x), (2.26c)

where    2
 
q(x) := E y a(ω), x − y(x) = trQ(x) (2.26d)
is the generalized variance, and
  
Q(x) := cov y a(·), x (2.26e)
 
denotes the covariance matrix of the random vector y = y a(ω), x . Conse-
quently, the expected loss function Γ (x) can be approximated from above by
  
Γ (x) ≤ γ y(x) + ϑmax q(x) for x ∈ D. (2.26f)

(b) Bounded eigenvalues of the Hessian


Considering second order expansions of γ, with a vector ỹ ∈ ȳy we find
1
γ(y) − γ(y) = ∇γ(y)T (y − y) + (y − y)T ∇2 γ(ỹ)(y − y). (2.27a)
2
Suppose that the eigenvalues
 λ of ∇2 γ(y) are bounded from below and
above on the range of y = y a(ω), x for each x ∈ D, i.e.,
   
0 < λmin ≤ λ ∇2 γ y a(ω), x ≤ λmax < +∞, a.s., x ∈ D (2.27b)

with constants 0 < λmin ≤ λmax . Taking expectations in (2.27a), we get


  λmin   λmax
γ y(x) + q(x) ≤ Γ (x) ≤ γ y(x) + q(x), x ∈ D. (2.27c)
2 2
Consequently, using (2.26f) or (2.27c), various approximations for the
deterministic substitute problems (2.15a), (2.15b), (2.16a)–(2.16c), (2.17a)–
(2.17c) may be obtained.
Based on the above approximations of expected cost functions, we state the
following two approximates to (2.16a)–(2.16c), (2.17a)–(2.17c), resp., which
are well known in robust optimal design:
(1) Expected primary cost minimization under approximate expected failure or
recourse cost constraints
 
min EG0 a(ω), x (2.28a)

s.t.
 
γ y(x) + c0 q(x) ≤ Γ max (2.28b)
x ∈ D, (2.28c)
where c0 is a scale factor, cf. (2.26f) and (2.27c);
22 2 Deterministic Substitute Problems in Optimal Decision

(2) Approximate expected failure or recourse cost minimization under expected


primary cost constraints
 
min γ y(x) + c0 q(x) (2.29a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.29b)
x ∈ D. (2.29c)

Obviously, by means of (2.28a)–(2.28c) or (2.29a)–(2.29c) optimal designs


x∗ are achieved which
– Yield a high mean performance of the structure/structural system
– Are minimally sensitive or have a limited sensitivity with respect to ran-
dom parameter variations (material, load, manufacturing, process, etc.)
– Cause only limited costs for design, construction, maintenance, etc.

2.3.2 Regression Techniques, Model Fitting, RSM

In structures/systems with variable quantities, very often one or several


functional relationships between the measurable variables can be guessed ac-
cording to the available data resulting from observations, simulation experi-
ments, etc. Due to limited data, the true functional relationships between the
quantities involved are approximated in practice – on a certain limited range
– usually by simple functions, as, e.g., multivariate polynomials of a certain
(low) degree. The resulting regression metamodel [64] is obtained then by
certain fitting techniques, such as weighted least squares or Tschebyscheff
minmax procedures for estimating the unknown parameters of the regression
model.
An important advantage of these models is then the possibility to predict
the values of some variables from knowledge of other variables.
In case of more than one independent variable (predictor or regressor) one
has a “multiple regression” task. Problems with more than one dependent or
response variable are called “multivariate regression” problems, see [33].
(a) Approximation of state functions. The numerical solution is simplified
considerably if one can work with one single state function y = y(a, x).
Formally, this is possible by defining the function

y min (a, x) := min yi (a, x). (2.30a)


1≤i≤my

Indeed, according to (2.2b), (2.2c) the failure of the structure, the system
can be represented by the condition

y min (a, x) ≤ 0. (2.30b)


2.3 Approximations of Deterministic Substitute Problems in Optimal Design 23

Fig. 2.1. Loss function γ

Thus, the weakness or failure of the technical or economic device can be


evaluated numerically by the function
  
Γ (x) := Eγ y min a(ω), x (2.30c)

with a nonincreasing loss function γ : IR → IR+ , see Fig. 2.1.


However, the “min”-operator in (2.30a) yields a nonsmooth function
y min = y min (a, x) in general, and the computation of the mean and vari-
ance function
 
y min (x) := Ey min a(ω), x (2.30d)
  
σy2min (x) := V y min a(·), x (2.30e)

by means of Taylor expansion with respect to the model parameter vector


a at a = Ea(ω) is not possible directly, cf. Sect. 2.3.3.

According to the definition (2.30a), an upper bound for y min (x) is given
by  
y min (x) ≤ min y i (x) = min Eyi a(ω), x .
1≤i≤my 1≤i≤my
min
Further approximations of y (a, x) and its moments can be found by
using the representation
1 
a + b − |a − b|
min(a, b) =
2
of the minimum of two numbers a, b ∈ IR. E.g., for an even index my we
have
 
y min (a, x) = min min yi (a, x), yi+1 (a, x)
i=1,3,...,my −1
1  
= min yi (a, x) + yi+1 (a, x)− yi (a, x) − yi+1 (a, x) .
i=1,3,...,my −1 2
24 2 Deterministic Substitute Problems in Optimal Decision

In many cases we may suppose that the state functions yi = yi (a, x),
i = 1, . . . , my , are bounded from below, hence,

yi (a, x) > −A, i = 1, . . . , my ,

for all (a, x) under consideration with a positive constant A > 0. Thus,
defining
ỹi (a, x) := yi (a, x) + A, i = 1, . . . , my ,
and therefore

ỹ min (a, x) := min ỹi (a, x) = y min (a, x) + A,


1≤i≤my

we have
y min (a, x) ≤ 0 if and only if ỹ min (a, x) ≤ A.
Hence, the survival/failure of the system or structure can be studied also
by means of the positive function ỹ min = ỹ min (a, x). Using now the the-
ory of power or Hölder means [24], the minimum ỹ min (a, x) of positive
functions can be represented also by the limit
 my
1/λ
min 1 λ
ỹ (a, x) = lim ỹi (a, x)
λ→−∞ my i=1

 my
1/λ
1
of the decreasing family of power means M [λ]
(ỹ) := ỹiλ ,
my i=1
λ < 0. Consequently, for each fixed p > 0 we also have
 my
p/λ
min p 1 λ
ỹ (a, x) = lim ỹi (a, x) .
λ→−∞ my i=1
   p
Assuming that the expectation EM [λ] ỹ a(ω) , x exists for an expo-
nent λ = λ0 < 0, by means of Lebesgue’s bounded convergence theorem
we get the moment representation
 p/λ
 p 1
my  λ
min
E ỹ a(ω), x = lim E ỹi a(ω), x .
λ→−∞ my i=1

Since t → tp/λ , t > 0, is convex for each fixed p > 0 and λ < 0, by Jensen’s
inequality we have the lower moment bound
 p/λ
 p 1
my  λ
E ỹ min a(ω), x ≥ lim E ỹi a(ω), x .
λ→−∞ my i=1
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 25
 
Hence, for the pth order moment of ỹ min a(·), x we get the approxima-
tions
   p/λ
1
my   p/λ 1
my  λ
E ỹi a(ω), x ≥ E ỹi a(ω), x
my i=1 my i=1

for some λ < 0.

Using regression techniques, Response Surface Methods (RSM), etc., for


given vector x, the function a → y min (a, x) can be approximated [14, 23,
55,63,136] by functions y = y(a, x) being sufficiently smooth with respect
to the parameter vector a (Figs. 2.2 and 2.3).
In many important cases, for each i = 1, . . . , my , the state functions

(a, x) −→ yi (a, x)

are bilinear functions. Thus, in this case y min = y min (a, x) is a piecewise
linear function with respect to a. Fitting a linear or quadratic Response
Surface Model [17, 18, 105, 106]

y(a, x) := c(x) + q(x)T (a − a) + (a − a)T Q(x)(a − a) (2.30f)

to a → y min (a, x), after the selection of appropriate reference points

a(j) := a + da(j) , j = 1, . . . , p, (2.30g)

Fig. 2.2. Approximate smooth failure domain [y(a, x) ≤ 0] for given x


26 2 Deterministic Substitute Problems in Optimal Decision

Fig. 2.3. Approximation ỹ(a, x) of y min (a, x) for given x

(j)
with “design” points da ∈ IRν , j = 1, . . . , p, the unknown coefficients
c = c(x), q = q(x) and Q = Q(x) are obtained by minimizing the mean
square error
p  2
ρ(c, q, Q) := y(a(j) , x) − y min (a(j) , x) (2.30h)
j=1

with respect to (c, q, Q). Since the model (2.30f) depends linearly on the
function parameters (c, q, Q), explicit formulas for the optimal coefficients
c∗ = c∗ (x), q ∗ = q ∗ (x), Q∗ = Q∗ (x), (2.30i)
are obtained from this least squares estimation method, see Chap. 5.

(b) Approximation of expected loss functions. Corresponding to the approxi-


mation (2.30f) of y min = y min (a, x),using
 again
least squares techniques,
a mean value function Γ (x) = Eγ y a(ω), x , cf. (2.12a), can be ap-
proximated at a given point x0 ∈ IRν by a linear or quadratic Response
Surface Function
Γ(x) := β0 + βIT (x − x0 ) + (x − x0 )T B(x − x0 ), (2.30j)
with scalar, vector and matrix parameters β0 , βI , B. In this case estimates
y (i) = Γ̂ (i) of Γ (x) are needed at some reference points x(i) = x0 + d(i) ,
i = 1, . . . , p. Details are given in Chap. 5.

2.3.3 Taylor Expansion Methods

As can be seen above, cf. (2.21a), (2.21b), in the objective and/or in the
constraints of substitute problems for optimization problems with random
data mean value functions of the type
2.3 Approximations of Deterministic Substitute Problems in Optimal Design 27
 
Γ (x) := Eg a(ω), x

occur. Here, g = g(a, x) is a real valued function on a subset of IRν × IRr , and
a = a(ω) is a ν-random vector.
(a) Expansion with respect to a: Suppose that on its domain the function
g = g(a, x) has partial derivatives ∇la g(a, x), l = 0, 1, . . . , lg +1, up to order
lg + 1. Note that the gradient ∇a g(a, x) contains the so-called sensitivities
∂g
(a, x), j = 1, . . . , ν, of g with respect to the parameter vector a at
∂aj
(a, x). In the same way, the higher order partial derivatives ∇la g(a, x), l >
1, represent the higher order sensitivities of g with respect to a at (a, x).
Taylor expansion of g = g(a, x) with respect to a at a := Ea(ω) yields
lg
1 l 1
g(a, x) = ∇a g(ā, x) · (a − ā)l + ∇lg+1 g(â, x) · (a − ā)lg +1
l! (lg + 1)! a
l=0
(2.31a)

where â := ā + ϑ(a − ā), 0 < ϑ < 1, and (a − ā)l denotes the system of lth
order products

(aj − āj )lj
j=1

with lj ∈ IN ∪ {0}, j = 1, . . . , ν, l1 + l2 + . . . + lν = l. If g = g(a, x) is defined


by  
g(a, x) := γ y(a, x) ,

see (2.21a), then the partial derivatives ∇la g of g up to the second order
read:
 T  
∇a g(a, x) = ∇a y(a, x) ∇γ y(a, x) (2.31b)
 T  
∇2a g(a, x) = ∇a y(a, x) ∇2 γ y(a, x) ∇a y(a, x) (2.31c)
 
+∇γ y(a, x) · ∇2a y(a, x),

where
∂2y
(∇γ) · ∇2a y := (∇γ)T · (2.31d)
∂ak ∂al k,l=1,...,ν

Taking expectations in (2.31a), Γ (x) can be approximated, cf. Sect. 2.3.1,


by
lg  l

Γ (x) := g(ā, x) + ∇la g(ā, x) · E a(ω) − ā , (2.32a)
l=2
28 2 Deterministic Substitute Problems in Optimal Decision
 l
where E a(ω) − ā denotes the system of mixed lth central moments
 T
of the random vector a(ω) = a1 (ω), . . . , aν (ω) . Assuming that the
domain of g = g(a, x) is convex with respect to a, we get the error estimate
      
  1  
Γ (x) − Γ(x) ≤ E sup ∇alg +1 g ā + ϑ a(ω) − ā , x 
(lg + 1)! 0≤ϑ≤1
 lg +1
 
×a(ω) − ā . (2.32b)

In many practical cases the random parameter ν-vector a = a(ω) has a


l +1
convex, bounded support, and ∇ag g is continuous. Then the L∞ -norm
  
1  
r(x) := ess sup ∇alg +1 g a(ω), x  (2.32c)
(lg + 1)!

is finite for all x under consideration, and (2.32b), (2.32c) yield the error
bound    lg +1
   
Γ (x) − Γ(x) ≤ r(x)E a(ω) − ā . (2.32d)

Remark 2.1. The above described method


 can be extended T to the case of
vector valued loss functions γ(z) = γ1 (z), . . . , γmγ (z) .

(b) Inner expansions: Suppose that Γ (x) is defined by (2.21a), hence,


  
Γ (x) = Eγ y a(ω), x .

Linearizing the vector function y = y(a, x) with respect to a at ā, thus,

y(a, x) ≈ y(1) (a, x) := y(ā, x) + ∇a y(ā, x)(a − ā), (2.33a)

the mean value function Γ (x) is approximated by


  
Γ(x) := Eγ y(ā, x) + ∇a y(ā, x) a(ω) − ā . (2.33b)

Corresponding to (2.24), (2.26e), define


 
y (1) (x) := Ey(1) a(ω), x = y(a, x) (2.33c)
    
Q(x) := cov y(1) a(·), x = ∇a y(a, x) cov a(·) ∇a y(a, x)T . (2.33d)

In case of convex loss functions γ, approximates of Γ and the correspond-


ing substitute problems based on Γ may be obtained now by applying
the methods described in Sect. 2.3.1. Explicit representations for Γ are
obtained in case of quadratic loss functions γ.
2.4 Applications to Problems in Quality Engineering 29

Error estimates can be derived easily for Lipschitz-continuous or convex


loss function γ. In case of a Lipschitz-continuous loss function γ with
Lipschitz constant L > 0, e.g., for sublinear [78, 81] loss functions, using
(2.33c) we have
      
   
Γ (x) − Γ(x) ≤ L · E y a(ω), x − y(1) a(ω), x  . (2.33e)

Applying the mean value theorem [29], under appropriate second order
differentiability assumptions, for the right hand side of (2.33e) we find the
following stochastic version of the mean value theorem
    
 
E y a(ω), x − y(1) a(ω), x 
 2     
   
≤ E a(ω) − a sup ∇2a y a + ϑ a(ω) − a , x  . (2.33f)
0≤ϑ≤1

2.4 Applications to Problems in Quality Engineering


Since the deterministic substitute problems arising in quality engineering are
of the same type as in optimal design, the approximation techniques described
in Sect. 2.3 can be applied directly to the robust design problem (2.20a)–
(2.20c). This is shown in the following for the approximation method described
in Sect. 2.3.1.
Hence, the approximation techniques developed in Sect. 2.3.1 are applied
now to the objective function
  
Γ (x) = Eγ y a(ω), x

of the basic robust design problem (2.20a)–(2.20c). If γ = γ(y) in an arbi-


trary continuously differentiable convex loss function on IR1 , then we find the
following approximations:
(a) Bounded first order derivative
In case that γ has a bounded derivative on the convex support of the
quality function, i.e., if
   
  
γ y a(ω), x  ≤ ϑmax < +∞ a.s. , x ∈ D, (2.34a)

then   
Γ (x) ≤ γ y(x) + ϑmax q(x), x ∈ D, (2.34b)
where
 
y(x) := Ey a(ω), x , (2.34c)
      2
q(x) := V y a(·), x = E y a(ω), x − y(x) (2.34d)
 
denote the mean and variance of the quality function y = y a(ω), x .
30 2 Deterministic Substitute Problems in Optimal Decision

(b) Bounded second order derivative


If the second order derivative γ  of γ is bounded from above and below,
i.e., if
  
0 < λmin ≤ γ  y a(ω), x ≤ λmax < +∞ a.s., x ∈ D, (2.35a)

then
  λmin   λmax
γ y(x) + q(x) ≤ Γ (x) ≤ γ y(x) + q(x), x ∈ D. (2.35b)
2 2
Corresponding to (2.28a)–(2.28c), (2.29a)–(2.29c), we then have the fol-
lowing “dual” robust optimal design problems:
(1) Expected primary cost minimizationunder approximate expected failure
or recourse cost constraints
 
min EG0 a(ω), x (2.36a)

s.t.
 
γ y(x) + c0 q(x) ≤ Γ max (2.36b)
x ∈ D, (2.36c)

where Γ max denotes an upper loss bound.


(2) Approximate expected failure cost minimization under expected pri-
mary cost constraints
 
min γ y(x) + c0 q(x) (2.37a)

s.t.
 
EG0 a(ω), x ≤ Gmax (2.37b)
x ∈ D, (2.37c)

where c0 is again a certain


 scale
 factor, see (2.34a), (2.34b), (2.35a),
(2.35b), and G0 = G0 a(ω), x represents the production costs having
an upper bound Gmax .

2.5 Approximation of Probabilities: Probability


Inequalities

In reliability analysis of engineering/economic structures or systems, a main


problem is the computation of probabilities
2.5 Approximation of Probabilities: Probability Inequalities 31
N   
 
N
P Vi := P a(ω) ∈ Vi (2.38a)
i=1 i=1

or ⎛ ⎞ ⎛ ⎞

N 
N
P⎝ Sj ⎠ := P ⎝a(ω) ∈ Sj ⎠ (2.38b)
j=1 j=1

of unions and intersections of certain failure/survival domains (events) Vj , Sj ,


j = 1, . . . , N . These domains (events) arise from the representation of the
structure or system by a combination of certain series and/or parallel sub-
structures/systems. Due to the high complexity of the basic physical relations,
several approximation techniques are needed for the evaluation of (2.38a),
(2.38b).

2.5.1 Bonferroni-Type Inequalities

In the following V1 , V2 , . . . , VN denote arbitrary (Borel-)measurable subsets of


the parameter space IRν , and the abbreviation
 
P(V ) := P a(ω) ∈ V (2.38c)

is used for any measurable subset V of IRν .


Starting from the representation of the probability of a union of N events,
⎛ ⎞
N N
P⎝ Vj ⎠ = (−1)k−1 sk,N , (2.39a)
j=1 k=1

where  

k
sk,N := P Vil , (2.39b)
1≤i1 <i2 <...<ik ≤N l=1

we obtain [45] the well known basic Bonferroni bounds


⎛ ⎞

N ρ
P⎝ Vj ⎠ ≤ (−1)k−1 sk,N for ρ ≥ 1, ρ odd (2.39c)
j=1 k=1
⎛ ⎞

N ρ
P⎝ Vj ⎠ ≥ (−1)k−1 sk,N for ρ ≥ 1, ρ even. (2.39d)
j=1 k=1

Besides (2.39c), (2.39d), a large amount of related bounds of different com-


plexity are available, cf. [45, 153, 155]. Important bounds of first and second
degree are given below:
32 2 Deterministic Substitute Problems in Optimal Decision
⎛ ⎞

N
max qj ≤ P ⎝ Vj ⎠ ≤ Q1 (2.40a)
1≤j≤N
j=1
⎛ ⎞

N
Q1 − Q2 ≤ P ⎝ Vj ⎠ ≤ Q1 − max qil (2.40b)
1≤l≤N
j=1 i=l
⎛ ⎞
Q21 N
≤P⎝ Vj ⎠ ≤ Q1 . (2.40c)
Q1 + 2Q2 j=1

The above quantities qj , qij , Q1 , Q2 are defined as follows:


N
Q1 := qj with qj := P(Vj ) (2.40d)
j=1
N j−1
Q2 := qij with qij := P(Vi ∩ Vj ). (2.40e)
j=2 i=1

Moreover, defining
q := (q1 , . . . , qN ), Q := (qij )1≤i,j≤N , (2.40f)
we have ⎛ ⎞

N
P⎝ Vj ⎠ ≥ q T Q− q, (2.40g)
j=1

where Q− denotes the generalized inverse of Q, cf. [155].

2.5.2 Tschebyscheff-Type Inequalities



m
In many cases the survival or feasible domain (event) S = Si , is represented
i=1
by a certain number m of inequality constraints of the type
yli < (≤)yi (a, x) < (≤)yui , i = 1, . . . , m, (2.41a)
as, e.g., operating conditions, behavioral constraints. Hence, for a fixed input,
design or control vector x, the event S = S(x) is given by
S := {a ∈ IRν : yli < (≤)yi (a, x) < (≤)yui , i = 1, . . . , m} . (2.41b)
Here,
yi = yi (a, x), i = 1, . . . , m, (2.41c)
are certain functions, e.g., response, output or performance
functions of the structure, system, defined on (a subset of) IRν × IRr .
Moreover, yli < yui , i = 1, . . . , m, are lower and upper bounds for the
variables yi , i = 1, . . . , m. In case of one-sided constraints some bounds yli , yui
are infinite.
2.5 Approximation of Probabilities: Probability Inequalities 33

Two-Sided Constraints

If yli < yui , i = 1, . . . , m, are finite bounds, (2.41a) can be represented by

|yi (a, x) − yic | < (≤)ρi , i = 1, . . . , m, (2.41d)

where the quantities yic , ρi , i = 1, . . . , m, are defined by


yli + yui yui − yli
yic := , ρi := . (2.41e)
2 2
Consequently, for the probability P(S) of the event S, defined by (2.41b), we
have
    
 
P(S) = P yi a(ω), x − yic  < (≤)ρi , i = 1, . . . , m . (2.41f)

Introducing the random variables


 
  yi a(ω), x − yic
ỹi a(ω), x := , i = 1, . . . , m, (2.42a)
ρi
and the set
B := {y ∈ IRm : |yi | < (≤)1, i = 1, . . . , m} , (2.42b)
with ỹ = (ỹi )1≤i≤m , we get
   
P(S) = P ỹ a(ω), x ∈ B . (2.42c)

Considering any (measurable) function ϕ : IRm → IR such that

(i) ϕ(y) ≥ 0, y ∈ IRm (2.42d)


(ii) ϕ(y) ≥ ϕ0 > 0, if y ∈B,
| (2.42e)

with a positive constant ϕ0 , we find the following result:

Theorem 2.1. For any (measurable) function ϕ : IRm → IR fulfilling condi-


tions (2.42d), (2.42e), the following Tschebyscheff-type inequality holds:
   
P yli < (≤)yi a(ω), x < (≤)yui , i = 1, . . . , m
1   
≥1− Eϕ ỹ a(ω), x , (2.43)
ϕ0

provided that the expectation in (2.43) exists and is finite.

Proof. If P   denotes the probability distribution of the random m-


ỹ a(·),x
 
vector ỹ = ỹ a(ω), x , then
34 2 Deterministic Substitute Problems in Optimal Decision
    
Eϕ ỹ a(ω), x = ϕ(y)Pỹ(a(·),x) (dy) + ϕ(y)Pỹ(a(·),x) (dy)
y∈B y∈B c
 
≥ ϕ(y)Pỹ(a(·),x) (dy) ≥ ϕ0 Pỹ(a(·),x) (dy)
y∈B c y∈B c
        
= ϕ0 P ỹ a(ω), x ∈B
| = ϕ0 1 − P ỹ a(ω), x ∈ B ,

which yields the assertion, cf. (2.42c).

Remark 2.2. Note that P(S) ≥ αs with a given minimum reliability αs ∈ (0, 1]
can be guaranteed by the expected cost constraint
  
Eϕ ỹ a(ω), x ≤ (1 − αs )ϕ0 .

Example 2.5. If ϕ = 1B c is the indicator function of the complement B c of B,


then ϕ0 = 1 and (2.43) holds with the equality sign.

Example 2.6. For a given positive definite m × m matrix C, define ϕ(y) :=


y T Cy, y ∈ IRm . Then, cf. (2.42b), (2.42d), (2.42e),
 
min ϕ(y) = min min y T Cy, min y T Cy . (2.44a)
|B
y∈ 1≤i≤m yi ≥1 yi ≤−1

Thus, the lower bound ϕ0 follows by considering the convex optimization


problems arising in the right hand side of (2.44a). Moreover, the expectation
Eϕ(ỹ) needed in (2.43) is given, see (2.42a), by

Eϕ(ỹ) = E ỹ T C ỹ = EtrC ỹ ỹ T ,
    T
= trC(diag ρ)−1 cov y a(·), x + y(x) − yc y(x) − yc (diag ρ)−1 ,

(2.44b)

where “tr” denotes the trace of a matrix, diag ρ is the diagonal


 matrix
 diag
ρ := (ρi δij ), yc := (yic ), see (2.41e), and y = y(x) := Eyi a(ω), x . Since
Q ≤ trQ ≤ mQ for any positive definite m × m matrix Q, an upper
bound of Eϕ(ỹ) reads (Fig. 2.4)
    T
Eϕ(ỹ) ≤ mC (diag ρ)−1 2 cov y a(·), x + y(x) − yc y(x) − yc .
(2.44c)

Example 2.7. Assuming in the above Example 2.6 that C = diag (cii ) is a
diagonal matrix with positive elements cii > 0, i = 1, . . . , m, then

min ϕ(y) = min cii > 0, (2.44d)


|B
y∈ 1≤i≤m
2.5 Approximation of Probabilities: Probability Inequalities 35

Fig. 2.4. Function ϕ = ϕ(y)

and Eϕ(ỹ) is given by


   2
m E yi a(ω), x − yic
Eϕ(ỹ) = cii
i=1
ρ2i
   2
m σy2i a(·), x + y i (x) − yic
= cii . (2.44e)
i=1
ρ2i

One-Sided Inequalities

Suppose that exactly one of the two bounds yli < yui is infinite for each
i = 1, . . . , m. Multiplying the corresponding constraints in (2.41a) by −1, the
admissible domain S = S(x), cf. (2.41b), can be represented always by

S(x) = {a ∈ IRν : ỹi (a, x) < (≤) 0, i = 1, . . . , m} , (2.45a)

 yi − yui
where ỹi :=  , if yli = −∞, and ỹi := yli − yi , if yui = +∞. If we set
ỹ(a, x) := ỹi (a, x) and
 
B̃ := y ∈ IRm : yi < (≤) 0, i = 1, . . . , m , (2.45b)

then      
P S(x) = P ỹ a(ω), x ∈ B̃ . (2.45c)

Consider also in this case, cf. (2.42d), (2.42e), a function ϕ : IRm → IR such
that
36 2 Deterministic Substitute Problems in Optimal Decision

(i) ϕ(y) ≥ 0, y ∈ IRm (2.46a)


(ii)ϕ(y) ≥ ϕ0 > 0, if y ∈| B̃. (2.46b)

Then, corresponding to Theorem 2.1, we have this result:


Theorem 2.2 (Markov-type inequality). If ϕ : IRm → IR is any (mea-
surable) function fulfilling conditions (2.46a), (2.46b), then
    1   
P ỹ a(ω), x < (≤) 0 ≥ 1 − Eϕ ỹ a(ω), x , (2.47)
ϕ0
provided that the expectation in (2.47) exists and is finite.
Remark 2.3. Note that a related inequality was used already in Example 2.2.
m
Example 2.8. If ϕ(y) := wi eαi yi with positive constants wi , αi , i =
i=1
1, . . . , m, then
inf ϕ(y) = min wi > 0 (2.48a)
| B̃
y∈ 1≤i≤m

and  
   m
Eϕ ỹ a(ω), x = wi Eeαi ỹi a(ω),x
, (2.48b)
i=1

where the expectation in (2.48b) can be computed approximatively by Taylor


expansion:
 
Eeαi ỹi = eαi ỹi (x) Eeαi ỹi −ỹi (x)
α2    2
≈ eαi ỹi (x) 1 + i E yi a(ω), x − y i (x)
2
α2
= eαi ỹi (x) 1 + i σy2i (a(·),x) . (2.48c)
2
 
Supposing that yi = yi a(ω), x is a normal distributed random variable,
then 1 2 2
¯ α σ
Eeαi ỹi = eαi ỹi (x) e 2 i yi (a(·),x) . (2.48d)
Example 2.9. Consider ϕ(y) := (y − b)T C(y − b), where, cf. Example 2.6, C is
a positive definite m × m matrix and b < 0 a fixed m-vector. In this case we
again have
min ϕ(y) = min min ϕ(y)
| B̃
y∈ 1≤i≤m yi ≥0

and
   T    
Eϕ(ỹ) = E ỹ a(ω), x − b C ỹ a(ω), x − b
      T
= trCE ỹ a(ω), x − b ỹ a(ω), x − b .
2.5 Approximation of Probabilities: Probability Inequalities 37

Note that

yi (a, x) − (yui + bi ), if yli = −∞
ỹi (a, x) − bi =
yli − bi − yi (a, x), if yui = +∞,

where yui + bi < yui and yli < yli − bi .

Remark 2.4. The one-sided case can also be reduced approximatively to the
two-sided case by selecting a sufficiently large, but finite upper bound ỹui ∈
IR, lower bound ỹli ∈ IR, resp., if yui = +∞, yli = −∞. For multivariate
Tschebyscheff inequalities see also [75].

2.5.3 First Order Reliability Methods (FORM)

Special approximation methods for probabilities were developed in structural


reliability analysis [19, 21, 54, 104] for the probability of failure pf = pf (x)
given by (2.9c) with B := (0, +∞)my . Hence, see (2.2b),
   
pf = 1 − P y a(ω), x > 0 .

Applying certain probability inequalities, see Sects. 2.5.1 and 2.5.2, the case
with m(>1) limit state functions is reduced first to the case with one limit
state function only. Approximations of pf are obtained then by linearization
or by quadratic approximation of a transformed state function ỹi = ỹi (ã, x)
at a certain “design point” ã∗ = ã∗i (x) for each i = 1, . . . , m.
Based on the above procedure, according to (2.9f) we have
   
pf (x) = P yi a(ω), x ≤ 0 for at least one index i, 1 ≤ i ≤ my
my     my
≤ P yi a(ω), x ≤ 0 = pf i (x), (2.49a)
i=1 i=1

where    
pf i (x) := P yi a(ω), x ≤ 0 . (2.49b)

Assume now that the random ν-vector a = a(ω) can be represented [104] by
 
a(ω) = U  a(ω) , (2.50a)

where U : IRν → IRν is a certain 1–1-transformation in IRν , and 


a= a(ω) is
an N (0, I)-normal distributed random ν-vector, i.e., with mean vector zero
and covariance matrix equal to the identity matrix I. Assuming that the ν-
vector a = a(ω) has a continuous probability distribution and considering the
conditional distribution functions

Hi = Hi (ai |a1 , . . . , ai−1 )


38 2 Deterministic Substitute Problems in Optimal Decision

of ai = ai(ω), given
 aj (ω) = aj , j = 1, . . . , i − 1, then the random ν-vector
z(ω) := T a(ω) defined by the transformation

z1 := H1 (a1 )
z2 := H2 (a2 |a1 )
..
.
zν := Hν (aν |a1 , . . . , aν−1 ),
cf. [128], has stochastically independent components z1 = z1 (ω), . . . , zν =
zν (ω), and z = z(ω) is uniformly distributed on the ν-dimensional hypercube
[0, 1]ν . Thus, if Φ = Φ(t) denotes the distribution function of the univariate
N (0, 1)-normal distribution, then the inverse ã = U −1 (a) of the transforma-
tion U can be represented by
⎛ −1 ⎞
Φ (z1 )
⎜ .. ⎟
ã := ⎝ . ⎠ , z = T (a).
Φ−1 (zν )
In the following, let x be a given, fixed r-vector. Defining then
 
yi (
a, x) := yi U (
a), x , (2.50b)
we get    
pf i (x) = P yi 
a(ω), x ≤ 0 , i = 1, . . . , my . (2.50c)
In the First Order Reliability Method (FORM), cf. [54], the state function
yi = yi (
a, x) is linearized at a so-called design point  a∗ = a∗(i) . A design point
a∗ = 
 a∗(i) is obtained by projection of the origin  a = 0 onto the limit state
surface yi ( a∗ , x) = 0, we get
a, x) = 0, see Fig. 2.5. Since yi (

Fig. 2.5. FORM


2.5 Approximation of Probabilities: Probability Inequalities 39
 T
yi ( a∗ , x) + ∇a yi (
a, x) ≈ yi ( a∗ , x) ( a∗ )
a−
 T
a∗ , x) (
= ∇a yi ( a∗ )
a− (2.50d)

and therefore
 T  
pf i (x) ≈ P a∗ , x) (
∇a yi ( a∗ ) ≤ 0
a(ω) −  = Φ −βi (x) . (2.50e)

Here, the reliability index βi = βi (x) is given by


 T
a∗ , x) 
∇a yi ( a∗
βi (x) := − . (2.50f)
a∗ , x)
∇a yi (

Note that βi (x) is the minimum distance from  a = 0 to the limit state surface
a, x) = 0 in the 
yi ( a-domain.
For more details and Second Order Reliability Methods (SORM) based on
second order Taylor expansions, see Chap. 7 and [54, 104].
4
Deterministic Descent Directions and Efficient
Points

4.1 Convex Approximation


According to Sects. 1.2 and 2.1 we consider here deterministic substitute prob-
lems of the type
min F (x) s.t. x ∈ D (4.1a)
with the convex feasible domain D and the expected total cost function
 
F (x) := EG0 a(ω), x + Γ (x), (4.1b)

where   
Γ (x) := Eγ y a(ω), x . (4.1c)

Here, G0 = G0 (a, x) is the primary cost function (e.g., setup, construction


costs), and y = y(a, x) is the my -vector of state functions of the underlying
structure/system/process. The function G0 = G0 (a, x) and the vector func-
tion y = y(a, x) depend on the ν-vector a of random model parameters and
the r-vector x of decision/design variables. Moreover, γ = γ(y) is a loss func-
tion evaluating violations of the operating conditions y(a, x) ∈ B, see (2.1g).
We assume in this chapter that the expectations in (4.1b), (4.1c) exist and
are finite. Moreover, we suppose that the gradient of F (x), Γ (x) exists, and
can be obtained, see formulas (2.23b), (2.23c) by interchanging differentiation
and expectation:
 
∇F (x) = E∇x G0 a(ω), x + ∇Γ (x) (4.2a)

with  T   
∇Γ (x) = E∇x y a(ω), x ∇γ y a(ω), x . (4.2b)

The main goal of this chapter is to present several methods for the con-
struction of (1) deterministic descent directions h = h(x) for F (x), Γ (x) at
96 4 Deterministic Descent Directions and Efficient Points

certain points x ∈ IRr , and (2) so-called efficient points x0 of the stochas-
tic optimization problem (4.1a)–(4.1c). Note that a descent direction for the
function F at a point x is a vector h = h(x) such that

F (x + th) < F (x) for 0 < t ≤ t0

with a constant t0 > 0. The descent directions h = h(x) to be constructed


here depend not explicitly on the gradient ∇F, ∇Γ of F, Γ , but on some
structural properties of F, Γ . Moreover, efficient points x0 ∈ D are feasible
points of (4.1a)–(4.1c) not admitting any feasible descent directions h(x0 )
at x0 . Hence, efficient points x0 are candidates for optimal solutions x∗ of
(4.1a)–(4.1c). For the construction of a descent direction h = h(x) at a point
x = x0 the objective function F = F (x) is replaced first [78] by the mean
value function
   T
Fx0 (x) := E G0 a(ω), x0 + ∇x G0 a(ω), x0 (x − x0 ) + Γx0 (x), (4.3a)

with
     
Γx0 (x) := Eγ y a(ω), x0 + ∇x y a(ω), x0 (x − x0 ) . (4.3b)

Obviously,
 Fx
0
follows from F by linearization of the primary cost function
G0 = G0 a(ω), x with respect to x at x0 , and by “inner linearization” of the
expected recourse cost function Γ = Γ (x) at x0 (Fig. 4.1).
Corresponding to (4.2b), the gradient ∇Fx0 (x) is given by

Fig. 4.1. Convex approximation Γx0 of Γ at x0


4.1 Convex Approximation 97
 
∇Fx0 (x) = ∇x G0 a(ω, x0 + ∇Γx0 (x) (4.3c)

with
 T      
∇Γx0 (x) = E∇x y a(ω), x0 ∇γ y a(ω), x0 + ∇x y a(ω), x0 (x − x0 ) .
(4.3d)

Under the general assumptions concerning F and Γ , the following basic prop-
erties of Fx0 hold:
Theorem 4.1. (1) If γ is a convex cost function, then Γx0 and Fx0 are convex
functions for each x0 .
(2) At point x0 the following equations hold:

Γx0 (x0 ) = Γ (x0 ), Fx0 (x0 ) = F (x0 ) (4.4a)


∇Γx0 (x0 ) = ∇Γ (x0 ), ∇Fx0 (x0 ) = ∇F (x0 ). (4.4b)

(3) Suppose that γ fulfills the Lipschitz condition


 
 
γ(y) − γ(z) ≤ Ly − z, y, z ∈ B (4.4c)

with a constant L > 0 and a convex set B ⊂ IRmy . If y = y(a, x) and its
linearization at x0 fulfill
     
y a(ω), x ∈ B, y a(ω), x0 + ∇x y a(ω), x0 (x − x0 ) ∈ B a.s. (4.4d)

for a point x, then


  

|Γ (x) − Γx0 | ≤ Lx − x0  · E sup ∇x y a(ω), λx + (1 − λ)x0
0≤λ≤1
 

−∇x y a(ω), x0  . (4.4e)

Proof. (a) The first term of Fx0 is linear in x, and also the argument of γ in
Γx0 is linear in x. Hence, Γx0 , Fx0 are convex for each convex loss function γ
such that the expectations exist and are finite. (b) The equations in (4.4a),
(4.4b) follow from (4.1b), (4.1c), (4.3a), (4.3b) and from (4.2a), (4.2b), (4.3c),
(4.3d) by putting x = x0 . The last part follows by applying the mean value
theorem for vector valued functions, cf. [29].

Remark 4.1. (a) If γ = γ(y) is a convex function on IRmy , then (4.4c) holds
on each bounded closed subset B ⊂ IRmy with a local Lipschitz constant
L = L(B). (b) An important class of convex loss functions γ having a global
Lipschitz constant L = Lγ are the sublinear functions on IRmy . A function γ
is called sublinear if it is positive homogeneous and subadditive, hence,
98 4 Deterministic Descent Directions and Efficient Points

γ(λy) = λγ(y), λ ≥ 0, y ∈ IRmy (4.5a)


γ(y + z) ≤ γ(x) + γ(z), y, z ∈ IR my
. (4.5b)

Obviously, a sublinear function is convex. Furthermore, defining the norm γ


of a sublinear function γ on IRmy by
  
γ := sup γ(y) : y ≤ 1 , (4.5c)

we have
 
 
γ(y) ≤ γ · y, y ∈ IRmy (4.5d)
 
 
γ(y) − γ(z) ≤ γ · y − z, y, z ∈ IRmy . (4.5e)

Further properties of sublinear functions may be found in [78].


In the following we suppose now that in (4.1b), (4.1c) differentiation and
expectation can be interchanged. Using Theorem 4.1, for the construction of
descent directions of F at a point x0 we may proceed as follows:
Theorem 4.2. (a) Let x0 ∈ IRr be a given point. If h = h(x0 ) is a descent
direction for Fx0 at x0 , then h is also a descent direction for F at x0 . (b)
Suppose that z is a vector such that Fx0 (z) < Fx0 (x0 ). Then h = z − x0 is a
 = Γx0 (x0 ) for z = x0 , andΓx0 is
descent direction of Fx0 at x0 . (c) If Γx0 (z)
not constant on the line segment x0 z := λz + (1 − λ)x0 : 0 ≤ λ ≤ 1 , then
h = z − x0 is a descent direction of Γx0 at x0 .

Proof. (a) Using (4.4b), we have ∇F (x0 )T h = ∇Fx0 (x0 )T h < 0. Hence, h is
also a descent direction of F at x0 . (b) Since Fx0 is a differentiable, convex
function on IRmy , we obtain

∇Fx0 (x0 )T h = ∇Fx0 (x0 )T (z − x0 ) ≤ Fx0 (z) − Fx0 (x0 ) < 0.

Thus, h is a descent direction of Fx0 at x0 . (c) In this case there is a vector


w = λz + (1 − λ)x0 = x0 + λ(z − x0 ), 0 < λ < 1, such that Γx0 (w) < Γx0 (x0 ).
Thus,

∇Γx0 (x0 )T λh = ∇Γx0 (x0 )T (w − x0 ) ≤ Γx0 (w) − Γx0 (x0 ) < 0

and therefore ∇Γx0 (x0 )T h < 0.

Deterministic descent directions h of F at x0 can be obtained then in two


ways:
(1) Separate linear inequality:
Corollary 4.1. For a given point x0 ∈ IRr , suppose that there is a vector
z ∈ IRr such that
4.1 Convex Approximation 99
 T
E∇x G0 a(ω), x0 (z − x0 ) < (≤)0 (4.6a)
Γx0 (z) < (=)Γx0 (x0 ), (4.6b)

where in case “=” the function Γx0 is not constant on x0 z, or (4.6a) holds
with “<.” Then h = z − x0 is a descent direction for F at x0 .

Proof. The assertion follows from


 T
Fx0 (z) − Fx0 (x0 ) = E∇x G0 a(ω), x0 (z − x0 ) + Γx0 (z) − Γx0 (x0 ).

(2) Enlarged loss function:


Defining the enlarged state function ỹ = ỹ(a, x) and the enlarged loss function
γ̃ = ỹ(γ̃) by

G0 (a, x) t
ỹ(a, x) := , γ̃(ỹ) = γ̃ := t + γ(y), (4.7a)
y(a, x) y

we have   
F (x) = Eγ̃ ỹ a(ω), x . (4.7b)

Deterministic descent directions may be found then by applying Theorems 4.1


and 4.2 directly to the representation (4.7a), (4.7b) of F .
Obviously, the convex approximation Fx0 of F at x0 can be represented
by  
Fx0 (x) = c0 + c̄T x + Eγ A(ω)x − b(ω) , (4.8a)

where
   T
c0 := EG0 a(ω), x0 − E∇x G0 a(ω), x0 x0 (4.8b)
 
c̄ := E∇x G0 a(ω), x0 . (4.8c)
 
Furthermore, A(ω), b(ω) is the random my (r + 1) matrix given by
 
A(ω) := ∇x y a(ω), x0 (4.8d)
     
b(ω) := − y a(ω), x0 − ∇x y a(ω), x0 x0 . (4.8e)

4.1.1 Approximative Convex Optimization Problem

According to the above


 results,
 the construction
  of deterministic descent direc-

tions of F (x) = EG0 a(ω), x + Eγ y a(ω), x at a point x0 , as well as the
100 4 Deterministic Descent Directions and Efficient Points

approximate solution of the (nonconvex) optimization problem (4.1a)–(4.1c)


can be reduced to a convex optimization problem

min F (x) s.t. x ∈ D (4.9a)

with the convex objective


 
F (x) := c̄T x + Eu A(ω)x − b(ω) . (4.9b)
 
Here, c̄ is a given, fixed r-vector of mean cost coefficients, and A(ω), b(ω)
is a random m × (r + 1) matrix. Furthermore, u = u(y) is a convex loss
function such that the expectation in (4.9b) exists and is finite for all x under
consideration. We still suppose that u has the following partial monotonicity
property:

yI ≤ zI , yII = zII ⇒ u(y) ≤ u(z) (4.10a)


yI ≤ zI , yI = zI , yII = zII ⇒ u(y) < u(z), (4.10b)

where y, z ∈ IRm , yI := (yi )i∈J , yII := (yi )i ∈J


| with a certain index set
J ⊂ {1, . . . , m}. Note that inequalities a ≤ b, a < b for vectors a, b ∈ IRm
are defined componentwise.
The construction of descent directions for  F at a  point x0 depends on
the type of the probability distribution of A(ω), b(ω) and the functional
properties of u. Thus, we need the following definitions:
Definition 4.1. The loss function u is called (partly) monotonous increasing
or strictly (partly) monotonous increasing if u fulfills (4.10a), (4.10b), respec-
tively. Let C J , C JJ denote the set of all convex functions u on IRm fulfilling
(4.10a), (4.10b), respectively. Moreover, let CˆJ , CˆJJ be the set of all strictly
convex functions fulfilling (4.10a), (4.10b) respectively.
In order to guarantee that the expectations under consideration exist and are
finite, we take loss functions u from the following subclasses:
Definition 4.2. Let C J (P ), C JJ (P ), CˆJ (P ), CˆJJ (P ) designate the set of all
u ∈ C J , C JJ , CˆJ , CˆJJ resp., such that the expectation, the integral with respect
to the probability measure P ,
    
F0 (x) := Eu A(ω)x − b(ω) = u A(ω)x − b(ω) P (dω) (4.11)

JJ ˆJ
exists and is finite for all x ∈ IRr . Finally, let Csep J
, Csep , Csep , Cˆsep
JJ
and
J JJ ˆJ ˆ
Csep (P ), Csep (P ), Csep (P ), Csep (P ) denote the subset of all separable elements
JJ

'm
(i.e., u(z) := ui (zi )) of C J , . . . , CˆJJ and C J (P ), . . . , CˆJJ (P ), respectively.
i=1
4.2 Computation of Descent Directions in Case of Normal Distributions 101

Note that C J ⊃ C JJ ⊃ CˆJJ and C J ⊃ CˆJ ⊃ CˆJJ , where corresponding inclu-


sions hold also for the subsets of separable or integrable u.
Solving now problem (4.9a), (4.9b), besides the well known difficulties in
minimizing mean value functions, the loss function u should be exactly known.
However, in practice there is always some uncertainty about the appropriate
selection of u, e.g., due to difficulties in assigning appropriate penalty costs
u(z) to the deviation z = A(ω)x − b(ω) between the output A(ω)x of the
system x → A(ω)x and the target or upper bound b(ω). Consequently, in
the following we want to modify [76, 77, 81, 83, 84, 90, 93] the approximate
problem (4.9a), (4.9b) in such a way that we need only some obvious functional
properties (e.g., monotonicity, convexity) of u. Using certain moments of the
c(ω) x
random vector Ã(ω)x − b̃(ω) := , we find new substituting
A(ω)x − b(ω)
problems for (4.9a), (4.9b) and then also for(4.1a)–(4.1c)
 having the following
properties: (1) Only certain moments of Ã(ω), b̃(ω) are needed and (2)
compromise solutions, called efficient solutions of (4.1a)–(4.1c), are obtained
which are valid for a large class of loss functions u.

4.2 Computation of Descent Directions in Case


of Normal Distributions
 
In the following we suppose that the random m × (n + 1) matrix A(ω), b(ω)
has a normal distribution with mean
 
(Ā, b̄) := E A(ω), b(ω) (4.12a)

and covariance matrix


 
Q = cov A(·), b(·) = (Qij )i,j=1,...,m . (4.12b)

The (n + 1) × (n + 1) matrix
   
Qij := cov Ai (·), bi (·) , Aj (·), bj (·) = Qji , (4.12c)
 
denotes the covariance matrix of the ith and the jth row Ai (ω), bi (ω) ,
   
Aj (ω), bj (ω) of A(ω), b(ω) . Consequently, δ(ω, x) := A(ω)x − b(ω) is
normally distributed with mean Āx − b̄ and covariance matrix
   
Qx := cov A(·)x − b(·) = gij (x) . (4.13a)
i,j=1,...,m
102 4 Deterministic Descent Directions and Efficient Points

The functions gij (x) are given by

Qij + Qji x
gij (x) = x̂T Qij x̂ = x̂T Qji x̂ = x̂T x̂ with x̂ := ; (4.13b)
2 −1

especially, the variance of δi (ω, x) = Ai (ω)x − bi (ω) is given by


 
gii (x) = var A(·)x − bi (·) = x̂T Qii x̂. (4.13c)

Derivative-free descent directions of the objective function F can be ob-


tained, as developed in [83, 84, 90], by using the relations given below. For
sake of simplicity, we assume first that u ∈ Csep J
(P ). In this case, with
ĀI := (Āi )i∈J , ĀII := (Ai )i∈J , the relations read:

c̄T x ≥ c̄T y (4.14a)


ĀI x ≥ ĀI y, ĀII x = ĀII y (4.14b), (4.14c)
gii (x) ≥ gii (y), i = 1, . . . , m. (4.14d)

Theorem 4.3. (a) If any vectors x, y ∈ IRr fulfill relations (4.14a)–(4.14d),


then F (x) ≥ F (y) for each u ∈ CsepJ
(P ), (b) Let x, y ∈ IRr be related according
to (4.14a)–(4.14d), then F (x) > F (y) for each u ∈ Csep J
(P ), Csep
JJ
(P ), Cˆsep
J
(P ),
resp., if still the following additional condition holds:
c̄T x > c̄T y (4.14a )
Āi x > Āi y for some i ∈ J (4.14b )
gii (x) > gii (y) for some i, 1 ≤ i ≤ m. (4.14d )
Proof. According to the assumptions, ui is a (strictly) monotonous nonde-
creasing, (strictly) convex function for each i ∈ J. Furthermore, the objective
F reads:
m  
F (x) = cT x + Eui Āi x − b̄i + gii (x)ξi (ω) ,
i=1

where ξi = ξi (ω) is a standard normal distributed random variable. This


representation of F yields then the assertion F (x) ≥ (>)F (y) in case of the
relations (4.14a)–(4.14d), with “=” in (4.14d), and (4.14a ), (4.14b ). The
proof for the general case (4.14d) and case (4.14d ) is contained in the proof
of the next theorem.
Remark 4.2. (a) Obviously, relations (4.14a)–(4.14d), (4.14a ), (4.14b),
(4.14d ), involving only first and second order moments of A(ω), b(ω), c(ω) ,
are linear or quadratic in y, (b) The constraint “x ∈ D” in (4.1a) is taken
into account by adding to (4.14a)–(4.14d) the condition

y ∈ D, (4.14e)

where in the following D may be any convex subset of IRr .


4.2 Computation of Descent Directions in Case of Normal Distributions 103

For more general loss functions u ∈ C J (P ), C JJ (P ), CˆJ (P ), CˆJJ (P ),


resp., relations (4.14a)–(4.14e) must be replaced by the following stronger
conditions:

c̄T x ≥ c̄T y (4.15a)


ĀI x ≥ ĀI y, ĀII x = ĀII y (4.15b), (4.15c)
Qx  Qy (i.e., Qx − Qy is positive semidefinite) (4.15d)
y ∈ D. (4.15e)

Corresponding to Theorem 4.3, here we know the following result [83, 84]:
Theorem 4.4. (a) If r-vectors x, y fulfill relations (4.15a)–(4.15d), then
F (x) ≥ F (y) for each u ∈ C J (P ). (b) Let x, y ∈ IRr be related according
to (4.15a)–(4.15d). Then F (x) > F (y) for each u ∈ C J (P ), C JJ (P ), CˆJ (P ),
resp., if still the following additional condition holds:
c̄T x > c̄T y (4.15a )
Āi x > Āi y for some i ∈ J (4.15b )
Qx  Qy , Qx = Qy . (4.15d )
T
Proof.  (4.9b), (4.11) we have F (x) = c̄ x + F0 (x) with F0 (x) =
 According to
Eu A(ω)x − b(ω) , where u ∈ C J (P ), C JJ (P ), CˆJ (P ), respectively. Thus, if
(4.15a), (4.15a ), resp., holds, then F (x) ≥ (>)c̄T y + F0 (x), and in the rest of
the proof we have to consider F0 only. Defining
   
Yx = Yx (ω) := A(ω)x − b(ω) − (Āx − b̄) = A(ω) − Ā x − b(ω) − b̄ ,

we find that
   
F0 (x) = Eu Āx − b̄ + Yx (ω) ≥ (>)Eu Āy − b̄ + Yx (ω) =: F̃0 ,

provided that (4.15b), (4.15b ) holds and u ∈ C J (P ), u ∈ C JJ (P ), respectively.


Since Yx (ω) is a N (0, Qx )-normal distributed random m-vector, the charac-
(0) (0) (0) (0)
teristic functions P̂x , P̂y of the distributions Px , Py of Yx , Yy , resp., are
given by
   
(0)
P̂x (z) := E exp iYx (ω)T z = exp − 12 z T Qx z ,
   
(0)
P̂y (z) := E exp iYy (ω)T z = exp − 12 z T Qy z , z ∈ IRm .

Supposing now that (4.15d), (4.15d ), resp., holds, we consider the character-
istic function
1
K̂(z) := exp − z T (Qx − Qy )z , z ∈ IRm ,
2
104 4 Deterministic Descent Directions and Efficient Points

of the N (0, Qx − Qy )-normal distribution K on IRm . We find

P̂y(0) (z)K̂(z) = P̂x(0) (z), z ∈ IRm ,


(0) (0) (0) (0)
hence, Px is the convolution Px = K ∗Py of K and Py . Defining ũ(z) :=
u(Āy − b̄ + z), z ∈ IRm , we obtain
   
F̃0 = E ũ Yx (ω) = ũ(z)Px(0) (dz) = ũ(z)(K ∗ Py(0) )(dz)
 
= u(v + w)K(dw)Py (dv) = J(v)Py(0) (dv),
(0)

)
where J(v) := ũ(v + w)K(dw), v ∈ IRm . For any u ∈ C J (P ) we have that

ũ(v + w) − ũ(v) ≥ g T (v + w − v) = g T w, w ∈ IRm ,

for all v ∈ IRm and a subgradient [78] g ∈ ∂ ũ(v). Consequently,


  
J(v) ≥ ũ(v) + g T w K(dw) = ũ(v) for all v ∈ IRm ,

and therefore
 
F̃0 = J(v)Py(0) (dv) ≥ ũ(v)Py(0) (dv) = F0 (y).

If u ∈ CˆJ (P ), then

ũ(v + w) − ũ(v) > g T w, w = 0,

for all v ∈ IRm and a subgradient g ∈ ∂ ũ(v). In case of (4.15d ) we get


   
J(v) = ũ(v + w)K(dw) > ũ(v) + g T w K(dw) = ũ(v) for all v ∈ IRm ,

and therefore
 
F̃0 = J(v)Py(0) (dv) > ũ(v)Py(0) (dv) = F0 (y),

which concludes the proof now.


There are several sufficient conditions for (4.15d) implying (4.14d). Using,
e.g., Gersgorin’s circle theorem, we find that (4.15d), (4.15d ), is implied by
m  
 
gii (x) − gii (y) ≥ gij (x) − gij (y), i = 1, 2, . . . , m, (4.16a)
j=1
j=i

(4.16a) holds with gii (x) > gii (y) at least once, (4.16a )

respectively. Clearly, (4.16a ) implies (4.14d ).


4.2 Computation of Descent Directions in Case of Normal Distributions 105

Though, for practical purposes, (4.16a), (4.16a ), with given x, can be


solved for y by certain search techniques, for analytical considerations ' we
need the following representation of (4.16a): For each i = 1, . . . , m, let i
designate the set of all (m − 1)-vectors σi = (σi1 , . . . , σii−1 , σii+1 , . . . , σim )T
such that σij ∈ {−1, +1} for all j = i. Obviously, (4.16a), (4.16a ) is satisfied
if and only if

giσi (x) ≥ giσi (y) for all σi ∈ , i = 1, . . . , m, (4.16b)


i
(4.16b) holds with gii (x) > gii (y) for some 1 ≤ i ≤ m, (4.16b )

resp., where, cf. (4.13b), (4.13c),


 
giσi (x) := gii (x) − σij gij (x) = x̂T Qii − σji Qij x̂
j=i j=i
⎛ ⎞
Qij + Qji ⎠
= x̂T ⎝Qii − σij x̂. (4.16c)
2
j=i

With a given vector x, condition (4.16b), (4.16b ), resp., may contain at


most m2m−1 inequalities for y. However, in practice (4.16b), (4.16b ) contain
much fewer constraints for y, since mostly some of the covariance matrices
Qij (= QTji ), j = i, are zero.
A very interesting situation occurs certainly if for a given i, 1 ≤ i ≤ m,
Qij + Qji
Qii (σi ) := Qii − σij is positive (semi) definite for all σi ∈ .
2 i
j=i
(4.16d)
If (4.16d) holds, then cf. (4.16c),

giσi (y) = ŷ T Qii (σi )ŷ


'
is (strictly) convex for each σi ∈ i , see (4.16b).

4.2.1 Descent Directions of Convex Programs

If n-vectors x, y are related such that F (x) ≥ F (y), y = x, then h = y − x is a


descent direction for F at x, provided that F (x) = F (x) or F is not constant
on the line segment xy = {λx + (1 − λ)y : 0 ≤ λ ≤ 1} joining x and y. This
suggests the following definition:
 
Definition 4.3. C J (P, D) := {u ∈ C J (P ) : F (x) = c̄T x+Eu A(ω)x − b(ω) is
not constant on arbitrary line segments xy of D}. Csep
J
(P, D) := Csep
J
(P ) ∩
C (P, D).
J
106 4 Deterministic Descent Directions and Efficient Points

Obviously, we have

c(ω)T x t
F (x) = Fu (x) := Ev with v := t + u(z). (4.17a)
A(ω)x − b(ω) z

By Lemma 2.2 in [84] we know then that F is constant on a line segment


xy, y = x, if and only if F (x) = F (y) and (“a.s.” means with probability one)
    
u λ A(ω)x − b(ω) + (1 − λ) A(ω)y − b(ω)
   
= λu A(ω)x − b(ω) + (1 − λ)u A(w)y − b(w) a.s., 0 ≤ λ ≤ 1. (4.17b)

For convex functions u being strictly convex with respect to a certain


variable zι , 1 ≤ ι ≤ m, (4.17b) has the following consequences:
Lemma 4.1. Let u ∈ C J (P ) be a loss function such that for some index 1 ≤
ι≤m
 
u λz + (1 − λ)w < λu(z) + (1 − λ)u(w) if zι = wι and 0 < λ < 1. (4.17c)

If (4.17b) holds, then Aι (ω)(y − x) = 0 a.s. and Rιι (y − x) = 0, where Rιι


denotes the covariance matrix of Aι (ω).

Corollary 4.2. Let u ∈ C J (P ) be a loss function fulfilling (4.17c). If Rιι is


regular for some 1 ≤ ι ≤ m, then F = Fu is strictly convex on IRr .

Thus, we have the following inclusion:


Corollary 4.3. C J (P, D) ⊃ {u ∈ C J (P ): There exists at least one index ι, 1 ≤
ι ≤ m, such that Rιι is regular and (4.17c) holds}.
Of course, a related inclusion holds also for Csep
J
(P, D).
Now we have the following consequences from Theorems 4.3, 4.4:
Corollary 4.4 (Construction of descent directions of F at x without
using derivatives of F ).
(a) Let u ∈ Csep J
(P ), u ∈ C J (P ), respectively. If x, y = x, are related according
to (4.14a)–(4.14d), (4.15a)–(4.15d), resp., then h := y − x is a descent
direction for F at x, provided that F is not constant on xy (e.g., c̄T x >
c̄T y).
(b) If x, y = x, fulfill relations (4.14a)–(4.14d), (4.15a)–(4.15d) and u ∈
Csep
JJ
, C JJ , resp., then h = y − x is a descent direction for F at x, pro-
vided that Āi x > Āi , y for some i ∈ J.
(c) Let x, y = x, fulfill (4.14a)–(4.14d), (4.15a)–(4.15d), and consider u ∈
Cˆsep
J
, CˆJ , respectively. Then h = y − x is a descent direction for F at x,
provided that gii (x) > gii (y) for some i, 1 ≤ i ≤ m, Qx = Qy , respectively.
4.2 Computation of Descent Directions in Case of Normal Distributions 107

Remark 4.3. (a) Feasible descent directions h = y − x for F at x ∈ D are ob-


tained if besides (4.14a)–(4.14d), (4.15a)–(4.15d) also (4.14e), (4.15e), resp., is
taken into consideration. (b) Besides descent directions for F , in the following
we are also looking for necessary optimality conditions for (4.9a), (4.9b) with-
out using derivatives of F . Obviously, if x∗ is an optimal solution of (4.9a),
(4.9b), then F admits no feasible descent directions at x∗ .

Because of Theorems 4.3, 4.4, Corollary 4.4 and the above remark, we are
looking now for solutions of (4.14a)–(4.14d) and (4.15a)–(4.15d).
1
Representing first the symmetric (r + 1) × (r + 1) matrix (Qij + Qji ),
2
by
1 Rij dij
(Qij + Qji ) = , (4.18a)
2 d ij T qij

where Rij is a symmetric r × r matrix, dij ∈ IRr and qij ∈ IR, we get

gij (x) = xT Rij x − 2xT dij + qij , (4.18b)

see (4.13b). Since gii is convex for each i = 1, . . . , m, we find

gii (y) ≥ gii (x) + ∇gii (x)T (y − x), (4.18c)

where in (4.18c) the strict inequality “>” holds for y = x, if Rii is posi-
tive definite. Thus, since (4.15a)–(4.15d) implies (4.14a)–(4.14d), we have the
following lemma:
Lemma 4.2 (Necessary conditions for (4.14a)–(4.14d), (4.15a)–
(4.15d)). If y = x fulfills (4.14a)–(4.14d) or (4.15a)–(4.15d) with given x,
then h = y − x satisfies the linear constraints

c̄T h ≤ 0 (4.19a)
ĀI h ≤ 0, ĀII h = 0 (4.19b), (4.19c)
(Rii x − dii )T h ≤ 0(< 0, if Rii is positive definite), i = 1, . . . , m. (4.19d)

Remark 4.4. If the cone Cx of feasible directions for D at x ∈ D is represented


(1) (2) (1) (2)
by Cx = {h ∈ IRr \{0} : Ux h ≤ 0, Ux h < 0}, where Ux , Ux , resp., are
(1) (2) (1) (2)
νx × r, νx × r matrices with νx , νx ≥ 0, then the remaining relations
(4.14e), (4.15e) can be described by

Ux(1) h ≤ 0, Ux(2) h < 0. (4.19e), (4.19f)


(1) (2)
Note that Ux = 0 and νx = 0 if x ∈ int D (:= interior of D).
Using now theorems of the alternatives [74], for the existence of solutions
h = 0 of (4.19a)–(4.19f), we obtain this characterization:
108 4 Deterministic Descent Directions and Efficient Points

Lemma 4.3. (a) Suppose that Rii is positive definite for i ∈ K, where ∅ =
(2)
K ⊂ {1, . . . , m} or νx > 0. Then, either (4.19a)–(4.19f) has a solution
h(= 0), or there exist multipliers

|J| ν (1) ν (2)


γi ≥ 0, 0 ≤ i ≤ m, λI ∈ IR+ , λII ∈ IRm−|J| , π (1) ∈ IR+x , π (2) ∈ IR+x
(4.20a)
with (γi )i∈K = 0 or π (2) = 0 such that
m
T T
γ0 c̄ + γi (Rii x − dii ) + ĀTI λI + ĀTII λII + Ux(1) π (1) + Ux(2) π (2) = 0.
i=1
(4.20b)
(2)
(b) Suppose that Rii is singular for each i = 1, 2, . . . , m and = 0 (i.e., νx
(2)
Ux is canceled). Then either (4.19a)–(4.19f) has a solution h(= 0) such
that in (4.19a)–(4.19f) the strict inequality “<” holds at least once in the
(1)
1 + |J| + m + νx inequalities of (4.19a)–(4.19f), or there exist multipliers

γi > 0, i = 0, 1, . . . , m, λI > 0, λII ∈ IRM −|J| , π (1) > 0 (4.20a )

such that (4.20b) holds.


(2)
(c) If νx = 0, then (4.19a)–(4.19f) has a solution h = 0*such that “=”
holds everywhere in (4.19a)–(4.19f) if and only if rank x < r for the
(1) *
(1 + 2m + νx ) × r matrix x having the rows cT0 , Ai , i = 1, . . . , m, (Rii x −
(1) (1)
dii )T , i = 1, . . . , m, Uxj , j = 1, . . . , νx .

4.2.2 Solution of the Auxiliary Programs

For any given x ∈ D, solutions y = x of (4.14a)–(4.14e), (4.15a)–(4.15e)


can be obtained by random search methods, modified Newton methods, and
by solving certain optimization problems related to (4.14a)–(4.14e), (4.15a)–
(4.15e).

Solution of (4.14a)–(4.14e). Here we have the program

min f (y) s.t. (4.14a)–(4.14c), (4.14e), (4.21a)

where  
f (y) := max gii (y) − gii (x) : 1 ≤ i ≤ m . (4.21b)

Note. Clearly, also any other selection of functions from {cT y − cT x, Āi y −
Āi x, i ∈ J, gii (y) − gii (x), i = 1, . . . , m} can be used to define the objective
function f in (4.21a).
Obviously, (4.21a), (4.21b) is equivalent to

min t (4.21a )
4.2 Computation of Descent Directions in Case of Normal Distributions 109

s.t.
c̄T y ≤ c̄T x (4.21b )
ĀI y ≤ ĀI x, ĀII y = ĀII x (4.21c ), (4.21d )
gii (y) − gii (x) ≤ t, i = 1, . . . , m (4.21e )
y ∈ D. (4.21f )

It is easy to see that (4.21a), (4.21b), (4.21a )–(4.21f ) are convex programs
having for x ∈ D always the feasible solution y = x, (y, t) = (x, 0). Optimal
solutions y ∗ , f (y ∗ ) ≤ 0, (y ∗ , t∗ ), t∗ ≤ 0, resp., exist under weak assumptions.
Further basic properties of the above program are shown next:
Lemma 4.4. Let x ∈ D be any given feasible point.
(a) If y ∗ , (y ∗ , t∗ ) is optimal in (4.21a), (4.21b), (4.21a )–(4.21f  ), resp., then
y ∗ solves (4.14a)–(4.14d).
(b) If y, (y, t) is feasible in (4.21a), (4.21b), (4.21a )–(4.21f  ), then y solves
(4.14a)–(4.14d), provided that f (y) ≤ 0, t ≤ 0, respectively. If y solves
(4.14a)–(4.14d), then y, (y, t), t0 ≤ t ≤ 0, is feasible in (4.21a), (4.21b),
(4.21a )–(4.21f  ), resp., where t0 := f (y) ≤ 0.
(c) (4.14a)–(4.14d) has the unique solution y = x if and only if (4.21a )–
(4.21f  ) has the unique optimal solution (y ∗ , t∗ ) = (x, 0).

Proof. Assertions (a) and (b) are clear. Suppose now that (4.14a)–(4.14d)
yields y = x. Obviously, (y ∗ , t∗ ) = (x, 0) is then optimal in (4.21a )–(4.21f ).
Assuming that there is another optimal solution (y ∗∗ , t∗∗ ) = (x, 0), t∗ ≤ 0, of
(4.21a )–(4.21f ), we know that y ∗∗ solves (4.14a)–(4.14d). Hence, we have
y ∗∗ = x and therefore t∗∗ < 0 which is a contradiction to y ∗∗ = x.
Conversely, assume that (4.21a )–(4.21f ) has the unique optimal solution
(y ∗ , t∗ ) = (x, 0), and suppose that (4.14a)–(4.14d) has a solution y = x.
Since (y, t), f (y) ≤ t ≤ 0, is feasible in (4.21a )–(4.21f ), we find t∗ ≤ t ≤ 0 and
therefore f (y) = 0. Thus, (y, 0) = (y ∗ , t∗ ) is also optimal in (4.21a )–(4.21f ),
in contradiction to the uniqueness of (y ∗ , t∗ ) = (x, 0).

According to Lemma 4.4, solutions y of system (4.14a)–(4.14d) can be


determined by solving (4.21a )–(4.21f ). In the following we assume that
 
D = x ∈ IRr : g(x) ≤ g0 , (4.22)

 T
where g(x) = gi (x), . . . , gν (x) is a ν-vector of convex functions on IRr , g0 ∈
Rν . Furthermore, suppose that (4.21a )–(4.21f ) fulfills the Slater condition for
each x ∈ D. This regularity condition holds, e.g., (consider (y, t) := (x, 1)) if

g(y) := Gy with a ν × r matrix G. (4.23)


110 4 Deterministic Descent Directions and Efficient Points

The necessary and sufficient optimality conditions for (4.21a )–(4.21f ) read:
'
m
γi = 1 (4.24a)
i=1
'
m  
γi (Rii y − dii ) = − 12 γ0 c + ĀTI λI + ĀTII λII + ∂g T
∂x (y) µ (4.24b)
i=1

gii (y) − gii (x) − t ≤ 0, i = 1, . . . , m (4.24c)


 
γi gii (y) − gii (x) − t = 0, γi ≥ 0, i = 1, . . . , m (4.24d)

c̄T y − c̄T x ≤ 0, γ0 (c̄T y − c̄T x) = 0, γ0 ≥ 0 (4.24e)


ĀI y − ĀI x ≤ 0, λTI (ĀI y − ĀI x) = 0, λI ≥ 0 (λI ∈ IR|J| ) (4.24f)
ĀII y − ĀII x = 0, λII ∈ IR m−|J|
(4.24g)
 
g(y) − g0 ≤ 0, µT g(y) − g0 = 0, µ ≥ 0 (µ ∈ IRν ). (4.24h)

Fundamental properties of optimal solutions of (4.21a )–(4.21f  ): For given


x ∈ D, let (y ∗ , t∗ ) be optimal in (4.21a )–(4.21f ). Hence, t∗ ≤ 0, (y ∗ , t∗ )
is characterized by (4.24a)–(4.24h), and y ∗ solves (4.14a)–(4.14d), see
Lemma 4.4a. Moreover, one of the following cases will occur, see Corollary 4.4:
Case i: t∗ < 0.
Here we have gii (y ∗ ) < gii (x), 1 ≤ i ≤ m. Thus, y ∗ = x, and h = y ∗ −x is a
feasible descent direction for F at x, provided that u ∈ Cˆsep J
(P ) or u ∈ Csep
J
(P )

and F = Fu is not constant on xy .
Case ii.1: t∗ = 0 and gii (y ∗ ) < gii (x) for some 1 ≤ i ≤ m or c̄T y ∗ < c̄T x
or Āi y ∗ < Āi x for some i ∈ J.
Here, y ∗ = x, and h = y ∗ − x is a feasible descent direction for F at x if
u ∈ C˜sep
J
(P ), u ∈ Csep
J
(P ), u ∈ Csep
JJ
(p), respectively.
Case ii.2: t = 0 and gii (y ∗ ) = gii (x), 1 ≤ i ≤ m, c̄T y ∗ = c̄T x, Āi y ∗ =

Āi x, i ∈ J, with y ∗ = x.
Then h = y ∗ −x is a feasible descent direction for F at x for all u ∈ CsepJ
(P )

such that F = Fu is not constant on xy .
Case ii.3: t∗ = 0 and (y ∗ , t∗ ) = (x, 0) is the unique optimal solution of
(4.21a )–(4.21f ).
According to Lemma 4.4c, in Case ii.3 we know here that (4.14a)–(4.14d)
has only the trivial solution y = x. Hence, the construction of a descent
direction h = y − x by means of (4.14a)–(4.14d) and Corollary 4.4 fails at
x. Note that points x ∈ D having this property are candidates for optimal
solutions of (4.9a), (4.9b).
Solution of (4.15a)–(4.15e). Let x ∈ D be a given feasible point. Replacing
for simplicity (4.15d) by (4.16a), for handling system (4.15a)–(4.15e), we get
the following optimization problem

min fˆ(y) s.t. (4.15a)–(4.15c), (4.15e), (4.25a)


4.2 Computation of Descent Directions in Case of Normal Distributions 111

where
⎧ ⎫
⎨   ⎬
 
fˆ(y) := max gii (y) − gii (x) + gij (x) − gij (y) : 1 ≤ i ≤ m (4.25b)
⎩ ⎭
j=i
'
= max {giσi (y) − giσi (x) : σi ∈ i , 1 ≤ i ≤ m} ,

see the note to program (4.21a), (4.21b). In the above, giσi is defined by
(4.16c). An equivalent version of (4.25a), (4.25b) reads

min t (4.25a )

s.t.
c̄T y ≤ c̄T x (4.25b )
ĀI y ≤ ĀI x, ĀII y = Ā
 II x (4.25c ), (4.25d )

' 
gii (y) − gii (x) + gij (x) − gij (y) ≤ t, i = 1, . . . , m (4.25e )
j=i
y ∈ D, (4.25f )

where condition (4.25e ) has also the equivalent form


'
giσi (y) − giσi (x) ≤ t for all σi ∈ i , i = 1, . . . , m. (4.25e )

It is easy to see that, for given x ∈ D, program (4.25a), (4.25b), (4.25a )–


(4.25f  ), resp., has always the trivial feasible solution y = x, (y, t) = (x, 0).
Moreover, optimal solutions y ∗ , (y ∗ , t∗ ) of (4.25a), (4.25b), (4.25a )–(4.25f ),
resp., exist under weak assumptions, where t∗ = fˆ(y ∗ ) ≤ 0. If gii (y ∗ ) = gii (x)
for an index i, then gij (y ∗ ) = gij (x) for all j = 1, . . . , m and f (y ∗ ) = t∗ = 0. If
condition (4.16d) holds for every i = 1, . . . , m, then (4.25a), (4.25b), (4.25a )–
(4.25f ) are convex. Further properties of (4.25a), (4.25b), (4.25a )–(4.25f ) are
given in the following:
Lemma 4.5. Let x ∈ D be a given point and consider system (4.15a)–(4.15c),
(4.15e), (4.16a).
(a) If y ∗ , (y ∗ , t∗ ) is optimal in (4.25a), (4.25b), (4.25a )–(4.25f  ), resp., then
y ∗ solves (4.15a)–(4.15c), (4.15e), (4.16a).
(b) If y, (y, t) is feasible in (4.25a), (4.25b), (4.25a )–(4.25f  ), then y solves
(4.15a)–(4.15c), (4.15e), (4.16a) provided that fˆ(y) ≤ 0, t ≤ 0, respec-
tively. If y solves (4.15a)–(4.15c), (4.15e), (4.16a), then y, (y, t), t0 ≤
t ≤ 0, is feasible in (4.25a), (4.25b), (4.25a )–(4.25f  ), resp., where
t0 := fˆ(y) ≤ 0.
(c) (4.15a)–(4.15c), (4.15e), (4.16a) has the unique solution y = x if and
only if program (4.25a )–(4.25f  ) has the unique optimal solution (y ∗ , t∗ ) =
(x, 0).
(d) If (y, t) is feasible in (4.25a )–(4.25f  ), then (y, t) is also feasible in
(4.21a )–(4.21f  ).
112 4 Deterministic Descent Directions and Efficient Points

Proof. Assertions (a), (b), (d) are clear, and (c) can be shown as the corre-
sponding assertion (c) in Lemma 4.4.

As was shown above, (4.15a)–(4.15c), (4.15e), (4.16a) can be solved by


means of (4.25a )–(4.25f ). If (4.25a )–(4.25f ) is convex, see (4.16d), then we
may proceed as in (4.21a )–(4.21f ). Indeed, if D is given by (4.22), and we
assume that the Slater condition holds, then the necessary and sufficient con-
ditions for an optimal solution (y ∗ , t∗ ) of (4.25a )–(4.25f ) read:
'
m '
i=1 σi ∈
' γiσi = 1 (4.26a)

" #
i

'
m '  '   ' 
' γiσi Rii − σij Rij y − dii − σij dij
i=1 σi ∈ j=i j=i
 
i

= − 12 γ0 c + ĀTI λI + ĀTII λII + ∂x


∂g
(y)T µ (4.26b)
'
giσi (y) − giσi (x) − t ≤ 0, σi ∈ i , i = 1, . . . , m (4.26c)
  '
γiσi giσi (y) − giσi (x) − t = 0, γiσi ≥ 0, σi ∈ i , i = 1, . . . , m (4.26d)
c̄T y ≤ c̄T x, γ0 (c̄T y − c̄T x) = 0, γ0 ≥ 0 (4.26e)
ĀI y − ĀI x ≤ 0, λTI (ĀI y − ĀI x) = 0, λI ≥ 0 (4.26f)
ĀII y = ĀII x, λII ∈ IRm−|J| (4.26g)
 
g(y) − g0 ≤ 0, µT g(y) − g0 = 0, µ ≥ 0. (4.26h)

In the non-convex case the conditions (4.26a)–(4.26h) are still necessary


for an optimal solution (y ∗ , t∗ ) of (4.25a )–(4.25f ), provided that (y ∗ , t∗ )
fulfills a certain regularity condition, see, e.g., [47].

Fundamental properties of optimal solutions of (4.25a )–(4.25f  ): For given


x ∈ D, let (y ∗ , t∗ ) be an optimal solution of (4.25a )–(4.25f ). According to
Lemma 4.5(a) we know that y ∗ solves then (4.15a)–(4.15c), (4.15e), (4.16a)
and therefore also (4.15a)–(4.15e). Moreover, one of the following different
cases will occur, cf. Corollary 4.4:
Case i: t∗ < 0.
This yields gii (y ∗ ) < gii (x), i = 1, . . . , m, and Qx  Qy∗ . Thus, y ∗ = x,
and h = y ∗ − x is a feasible descent direction for F at x, provided that
u ∈ CˆJ (P ) or u ∈ C J (P ) and F = Fu is not constant on xy ∗ .
Case ii.1: t∗ = 0 and gii (y ∗ ) < gii (x) for an index i, i = 1, . . . , m, or
c̄ y < c̄T x or Āi y ∗ < Āi x for some i ∈ J.
T ∗

Then y ∗ = x, and h = y ∗ − x is a feasible descent direction for F at x,


provided that u ∈ CˆJ (P ), u ∈ C J (P ), u ∈ C JJ (P ), respectively.
Case ii.2: t∗ = 0 and Qy∗ = Qx , c̄T y ∗ = c̄T x, Āi y ∗ = Āi x, i ∈ J, with

y = x.
4.3 Efficient Solutions (Points) 113

Here, h = y ∗ − x is a feasible descent direction for all u ∈ C J (P ) such that


F = Fu is not constant on xy ∗ .
Case ii.3: t∗ = 0 and (y ∗ , t∗ ) = (x, 0) is the unique optimal solution of
(4.25a )–(4.25f ).
According to Lemma 4.5(c) we know that (4.15a)–(4.15c), (4.15e), (4.16a)
has then only the trivial solution y = x. In this case the construction of a
descent direction h = y − x for F by means of (4.15a)–(4.15c), (4.15e), (4.16a)
and Corollary 4.4 fails at x. Points of this type are candidates for optimal
solutions of (4.9a), (4.9b), see Sect. 4.3.

Lemmas 4.5(d) and (4.18c) yield still the auxiliary convex program

min t (4.27a)

s.t.

c̄T y ≤ c̄T x (4.27b)


ĀI y ≤ ĀI x, AII y = ĀII x (4.27c), (4.27d)
2(Rii x − dii )T (y − x) − t ≤ 0, i = 1, . . . , m (4.27e)
y ∈ D. (4.27f)

Obviously, for given x ∈ D, each feasible solution (y, t) of (4.21a )–(4.21f ) or


(4.25a )–(4.25f ) is also feasible in (4.27a)–(4.27f). If (4.25a )–(4.25f ) is con-
vex, then (4.27e) can be sharpened, see (4.16d), (4.18a), (4.25e ), as follows:
+ ,
 '   '  T
2 Rii − σij Rij x − dii − σij dij (y − x) − t ≤ 0,
j=i j=i
'
σi ∈ i, 1 ≤ i ≤ m. (4.27e )

4.3 Efficient Solutions (Points)

Instead of solving approximatively the complicated stochastic optimization


problem (4.9a), (4.9b), where one has not always a unique loss function u, we
may look for an alternative solution concept: Obvious alternatives for optimal
solution of (4.9a), (4.9b) with a loss function u ∈ Csep
J
(Cˆsep
J JJ ˆJJ
, Csep , Csep ), u ∈
J ˆJ JJ ˆJJ
C (C , C , C ), resp., are just the points x ∈ D such that the above proce-
0

dure for finding a feasible descent direction for F = Fu fails at x0 . Based on


(4.14a)–(4.14e), (4.15a)–(4.15e), resp., and Corollary 4.4, efficient solutions
are defined as follows:
Definition 4.4. A vector x0 ∈ D is called a Csep
J
-efficient solution of (4.9a),
(4.9b) if there is no y ∈ D such that
114 4 Deterministic Descent Directions and Efficient Points

c̄T x0 ≥ c̄T y (4.28a)


ĀI x ≥ ĀI y, ĀII x0 = ĀII y
0
(4.28b), (4.28c)
gii (x0 ) ≥ gii (y), i = 1, . . . , m (4.28d)

and

(c̄T x0 , ĀI x0 , gii (x0 ), 1 ≤ i ≤ m) = (c̄T y, ĀI y, gii (y), 1 ≤ i ≤ m) (4.28e)

x0 ∈ D is called a strongly (or uniquely) Csep J


-efficient solution of (4.9a),
(4.9b) if there is no vector y ∈ D, y = x0 , such that (4.28a)–(4.28d) holds.
Let Esep , Esep
st
, resp., denote the set of all Csep
J
-strongly Csep
J
-efficient solutions
of (4.9a), (4.9b).
We observe that the elements of Esep , Esep st
are the Pareto-optimal solutions
[25, 134] of the vector optimization problem
 
“min” c̄T y, ĀI y− b̄I , ĀII y− b̄II , b̄II − ĀII y, gii (y), 1 ≤ i ≤ m . (4.28 )
y∈D

Definition 4.5. A vector x0 ∈ D is called a C J -efficient solution of (4.9a),


(4.9b) if there is no y ∈ D such that

c̄T x0 ≥ c̄T y (4.29a)


ĀI x ≥ ĀI y, ĀII x0 = ĀII y
0
(4.29b), (4.29c)
'
giσi (x ) ≥ giσi (y), σi ∈ i , i = 1, . . . , m
0
(4.29d)

and

(c̄T x0 , ĀI x0 , gii (x0 ), 1 ≤ i ≤ m) = (c̄T y, ĀI y, gii (y), 1 ≤ i ≤ m). (4.29e)

x0 ∈ D is called a strongly (or uniquely) C J -efficient solution of (4.9a), (4.9b)


if there is no y ∈ D, y = x0 , such that (4.29a)–(4.29d) holds. Let E, E st , resp.,
denote the set of all C J -, strongly C J -efficient solutions of (4.9a), (4.9b).
Note that the elements of E, E st are the Pareto-optimal solutions of the
multi-objective program
 ' 
“min” c̄T y, ĀI y − b̄I , ĀII y − b̄II , b̄II − ĀII y, giσi (y), σi ∈ i , 1 ≤ i ≤ m .
y∈D
(4.29 )

Discussion of Definitions 4.4, 4.5:


Relations (4.28a)–(4.28d) and (4.14a)–(4.14d) agree. Furthermore, (4.29a)–
(4.49d) and (4.15a)–(4.15c), (4.16b) coincide, where (4.16b) is equivalent to
(4.16a). We have the inclusions
st ⊂ E , E st ⊂ E,
Esep (4.30a)
sep
where equality in (4.30a) holds under the following conditions:
4.3 Efficient Solutions (Points) 115

Lemma 4.6. (a) If gii is strictly convex for at least one index i, 1 ≤ i ≤ m,
st = E .
then Esep sep '
(b) If giσi is convex for all σi ∈ i , i = 1, 2, . . . , m, and gii is strictly convex
for at least one index i, 1 ≤ i ≤ m, then E st = E.
(c) If for arbitrary x, y ∈ D

x = y ⇔ c̄T x = c̄T y, Āx = Āy, gii (x) = gii (y), i = 1, . . . , m, (4.30b)


st = E
then Esep st
sep and E = E.

Proof. (a) We may suppose, e.g., that g11 is strictly convex. Let then x0 ∈ Esep
and assume that x0 ∈E st . Hence, there exists y ∈ D, y = x0 , such that
| sep
(4.28a)–(4.28d) holds. Since all gii are convex and g11 is strictly convex, it is
easy to see that η := λy 0 + (1 − λ)y fulfills (4.28a)–(4.28e) for each 0 < λ < 1.
This contradiction yields x0 ∈ Esep st and therefore E st = E , cf. (4.30a). As-
sep sep
sertion (b) follows in the same way. In order to show (c), let x0 ∈ Esep , x0 ∈ E,
resp., and assume x0 ∈E st , x0 ∈E
| sep | st . Consider y ∈ D, y = x0 , such that
(4.28a)–(4.28d), (4.29a)–(4.29d), resp., holds. Because of x0 ∈ Esep , x0 ∈ E,
resp., we have
   
c̄T x0 , ĀI x0 , gii (x0 ), 1 ≤ i ≤ m = c̄T y, ĀI y, gii (y), 1 ≤ i ≤ m .

This equation, (4.28c), (4.29c), resp., and (4.30b) yield the contradiction y =
st , x0 ∈ E st , resp., and therefore E st and E st = E.
x0 . Thus, x0 ∈ Esep sep

According to Lemmas 4.4(c), 4.5(c), a point x0 ∈ D is strongly CsepJ


−, strongly
   
C -efficient if and only if (4.21a )–(4.21f ), (4.25a )–(4.25f ), resp., with x :=
J

x0 , has the unique optimal solution (y ∗ , t∗ ) = (x0 , 0). Since (4.29d), being
equivalent to (4.16a), implies (4.28d), we find
st ⊂ E st .
Esep ⊂ E, Esep (4.30c)

A special subset of efficient solutions of (4.9a), (4.9b) is given by


 
E0 := x0 ∈ D : (4.19a)–(4.49f) has no solution h = 0 (4.31)

Indeed, Lemma 4.2 and Remark 4.4 yield the following inclusion:
st .
Corollary 4.5. E0 ⊂ Esep
By Lemma 4.3 we have this parametric representation for E0 :
Corollary 4.6. (a) If Rii is positive definite for at least one index 1 ≤ i ≤
m, then x ∈ E0 if and only if x ∈ D and there are parameters γi , i =
0, 1, . . . , m, λI , λII and π (1) , π (2) such that (4.20a), (4.20b) holds.
116 4 Deterministic Descent Directions and Efficient Points
(2)
(b) Let Rii be singular for each i = 1, . . . , m. If νx > 0 for each x ∈ D, then
x ∈ E0 if and only if x ∈ D and there exist γi , 0 ≤ i ≤ m, λI , λII , π (1) , π (2)
(2)
such that (4.20a), (4.20b)
* holds. If νx = 0, *x ∈ D, then x ∈ E0 if and
only if x ∈ D, rank x = n (with the matrix x defined in Lemma 4.3(c))
or there are parameters γi , 0 ≤ i ≤ m, λI , λII and π (1) such that (4.20a ),
(4.20b) holds.

4.3.1 Necessary Optimality Conditions Without Gradients

As was already discussed in Sect. 4.2.2, Cases ii.3, efficient solutions are candi-
dates for optimal solutions x∗ of (4.9a), (4.9b). This will be made more precise
in the following:
Theorem 4.5. If x∗ ∈ argminx∈D F (x) with any u ∈ Cˆsep
JJ
(P ), u ∈ CˆJJ (P ).
∗ ∗
then x ∈ Esep , x ∈ E, respectively.

Proof. Supposing that x∗ ∈E | sep , x∗ ∈E,


| resp., we find y ∈ D such that (4.28a)–
(4.28e), (4.29a)–(4.29e) holds, where x0 := x∗ . Thus, (4.14a)–(4.14e), (4.15a)–
(4.15c), (4.15e), (4.16a), resp., hold. Moreover at least one of the relations
(4.14a ), (4.14b ), (4.14d ), (4.15a ), (4.15b ) and (4.16b ), resp., hold with
x := x∗ . Consequently, according to Corollary 4.4 and Remark 4.3a we have
a feasible descent direction for F at x∗ . This contradiction to the optimality
of x∗ yields the assertion.

Approximating the elements of Csep J


(P ), C J (P ) by certain elements of
ˆJJ
Csep (P ), C (P ), resp., we find the next condition:
JJ

Theorem 4.6. If D is compact or D is closed and F (x) → +∞ as x →


+∞, then for each u ∈ Csep J
(P ), u ∈ C J (P ) there exists x∗ ∈ argmin F (x)
contained in the closed hull Ēsep , Ē of Esep , E, respectively.

Proof. Consider any u ∈ Csep


J
(P ), u ∈ C J (P ), and define uρ ∈ Cˆsep
JJ
(P ), uρ ∈
ˆ
C (P ), resp., by
JJ

m
uρ (z) := u(z) + ρ vi (zi ), z ∈ IRm ,
i=1

where ρ > 0 and vi = vi (zi ) is defined by

vi (zi ) = exp(zi ), i ∈ J, and vi (zi ) = zi2 for i ∈J.


|

If D is compact, then for each ρ > 0 we find x∗ρ ∈ argminx∈D Fρ (x), where
 
Fρ (x) := c̄T x + Euρ A(ω)x − b(ω) , see (4.9b). Furthermore, since D is
4.3 Efficient Solutions (Points) 117

'
m  
compact and x → E vi A(ω)x − bi (ω) is convex and therefore continuous
i=1  
 
on D, there is a constant C > 0 such that F (x) − Fρ (x) ≤ ρC for all x ∈ D.
   
   
This yields minx∈D F (x) − minx∈D Fρ (x) = minx∈D F (x) − Fρ (x∗ρ ) ≤ ρC.


Hence, Fρ (x∗ρ ) → minx∈D F (x) as ρ ↓ 0. Furthermore, we have F (x∗ρ ) −


Fρ (x∗ρ ) ≤ ρC, ρ > 0, and for any accumulation point x∗ of (x∗ρ ) we get
      
      
F (x∗ ) − min F (x) ≤ F (x∗ ) − F (x∗ρ ) + F (x∗ρ ) − Fρ (x∗ρ ) + Fρ (x∗ρ )
x∈D
  
  
− min F (x) ≤ F (x∗ ) − F (x∗ρ ) + 2ρC, ρ > 0.
x∈D

Consequently, since x∗ρ ∈ Esep , x∗ρ ∈ E, resp., for all ρ > 0, cf. Theorem 4.5,
for every accumulation point x∗ of (x∗ρ ) we find that x∗ ∈ E sep , x∗ ∈ Ē,
resp., and F (x∗ ) = minx∈D F (x). Suppose now that D is closed and F (x) →
+∞ as x → +∞. Since Fρ (x) ≥ F (x), x ∈ IRn , ρ > 0, we also have that
Fρ (x) → +∞ for each fixed ρ > 0. Hence, there exist optimal solutions x∗0 ∈
argminx∈D F (x) and x∗ρ ∈ argminx∈D Fρ (x) for every ρ > 0. For all 0 < ρ ≤ 1
and with an arbitrary, but fixed x̃ ∈ D we find then x∗ρ ∈ Esep , x∗ρ ∈ E, resp.,
and
F (x∗ρ ) ≤ Fρ (x∗ρ ) ≤ Fρ (x̃) ≤ F1 (x̃), 0 < ρ ≤ 1.
 
Thus, (x∗ρ ) must lie in a bounded set. Let R := max x∗0 , R0 , where
 
R0 denotes a norm bound of (x∗ρ ). If D̃ := x ∈ D : x ≤ R , then
minx∈D F (x) = minx∈D̃ F (x) and minx∈D Fρ (x) = minx∈D̃ Fρ (x). Since D̃
is compact, the rest of the proof follows now as in the first part.
For elements of Csep
J
(P, D), C J (P, D), see Definition 4.3, we have the fol-
lowing condition:
Theorem 4.7. If x∗ ∈ argminx∈D F (x) with any u ∈ Csep
J
(P, D), u ∈
∗ st ∗ st
C (P, D), then x ∈ Esep , x ∈ E , respectively.
J

Proof. Let u ∈ Csep J


(P, D) and assume that x∗ ∈E st . According to
| sep

Definition 4.4 we find then y ∈ D, y = x , such that (4.28a)–(4.28d) hold
with x0 := x∗ . Corollary 4.4a yields that h = y − x∗ is a feasible descent
direction for F at x∗ which is a contradiction to the optimality of x∗ . Hence,
st . The rest follows in the same way.
x∗ ∈ Esep

4.3.2 Existence of Feasible Descent Directions in Non-Efficient


Solutions of (4.9a), (4.9b)

The preceding Theorems 4.3, 4.4 and Corollary 4.4 yield this final result:
118 4 Deterministic Descent Directions and Efficient Points

Corollary 4.7. (a) Let x0 ∈E, | x0 ∈E


| sep , respectively. Then there exists y ∈ D,
y = x0 , such that (4.29a)–(4.29e), (4.28a)–(4.28e), resp., holds. Conse-
quently, F (y) ≤ F (x0 ) for all u ∈ C J (P ), u ∈ Csep J
(P ), resp., and F (y) <
F (x ) for all u ∈ Cˆ (P ), u ∈ Cˆsep (P ), respectively. Hence, h = y − x0 is a
0 JJ JJ

feasible descent direction for F at x0 for all u ∈ CˆJJ (P ), u ∈ Cˆsep


J
(P ), and for
all u ∈ C (P ), u ∈ Csep (P ), resp., such that F is not constant on x0 y. (b) Let
J J

| st , x0 ∈E
x0 ∈E st , respectively. Then there is y ∈ D, y = x0 , such that (4.29a)–
| sep
(4.29d), (4.28a)–(4.28d), resp., holds. Hence, F (y) ≤ F (x0 ), and h = y − x0
is a feasible descent direction for F at x0 for all u ∈ C J (P ), u ∈ Csep
J
(P ), such
0
that F is not constant on x y.

4.4 Descent Directions in Case of Elliptically Contoured


Distributions
In extension of the results
 in Sect.
 4.2, here we suppose that the random
m × (r + 1) matrix A(ω), b(ω) has an elliptically contoured probability
distribution, cf. [56,119,155]. Members of this family are, e.g., the multivariate
normal, Cauchy, stable, Student t-, inverted Student t-distribution as also
their truncations to elliptically contoured sets in IRm(r+1) and the uniform
distribution on elliptically contoured sets in IRm(r+1) . The parameters of this
class can be described by an integer order parameter s ≥ 1 and an m(r + 1)
location matrix
Ξ = (Θ, θ) (4.32a)
Moreover, there are positive semidefinite m(r + 1) × m(r + 1) scale matrices
 
Q(σ) = Qij (σ) , σ = 1, . . . , s, (4.32b)
i,j=1,...,m

having (r + 1) × (r + 1) submatrices Qij such that

Qji = QTij , i, j = 1, . . . , m. (4.32c)

In the following M ∈ IRm(r+1) is always represented by an m × (r + 1) matrix.


-
The characteristic function P  =P -  (M ) of this class of
A(·),b(·) A(·),b(·)

distributions is then given by


 T
-
P  (M ) := E exp itrM A(ω), b(ω) (4.33a)
A(·),b(·)

'
s  
= exp(itrM Ξ T ) exp − g vecM Q(σ)(vecM )T ,
σ=1
4.4 Descent Directions in Case of Elliptically Contoured Distributions 119

where
vecM := (M1 , M2 , . . . , Mm ) (4.33b)
denotes the m(r + 1)-vector of the rows M1 , . . . , Mm of M . Furthermore, g =
g(t) is a certain nonnegative 1–1-function on IR+ . E.g., in the case of multivari-
ate symmetric stable distributions of order s and characteristic exponent α,
1
0 < α ≤ 2, we have g(t) := tα/2 , t ≥ 0, see [119].
2
Consequently, the probability distribution PA(·)x−b(·) of the random m-
vector   x
A(ω)x − b(ω) = A(ω), b(ω) x̂ with x̂ := (4.34a)
−1
may be represented by the characteristic function
  
P-A(·)x−b(·) (z) = E exp iz T A(ω)x − b(ω)
 T
= E exp itr(z x̂T ) A(ω), b(ω) -
=P  (z x̂T )
A(·),b(·)
  
s  
= exp iz (Θx − θ) exp −
T T
g z Qx (σ)z . (4.34b)
σ=1

In (4.34b) the positive semidefinite m × m matrices Qx (σ) are defined by


 
Qx (σ) := x̂T Qij (σ)x̂ , σ = 1, . . . , s, (4.34c)
i,j=1,...,m

cf. (4.32b), (4.32c) and (4.13a)–(4.13c).


Corresponding to Sect. 4.2, in the following we consider again the
derivative-free construction of (feasible) descent directions h = y − x for
the objective function F of the convex optimization problem (4.9a), (4.9b),
hence,
F (x) = c̄T x + F0 (x)
where  
F0 (x) = Eu A(ω)x − b(ω) with u ∈ C J (P ).

Following the proof of Theorem 4.4, for a given vector x and a vector y = x
to be determined, we consider the quotient of the characteristic functions of
A(ω)x − b(ω) and A(ω)y − bω), see (4.34a), (4.34b):

-A(·)x−b(·) (z)
P  
= exp iz T (Θx − Θy) ϕ(0)
x,y (z), (4.35a)
-A(·)y−b(·) (z)
P

where
 
s     
x,y (z) := exp −
ϕ(0) g z T Qx (σ)z − g z T Qy (σ)z . (4.35b)
σ=1
120 4 Deterministic Descent Directions and Efficient Points
(0)
Thus, a necessary condition such that ϕx,y is the characteristic function
- (0)
x,y (z) = Kx,y (z), z ∈ IR ,
m
ϕ(0) (4.35c)
(0)
of a symmetric probability distribution Kx,y reads
Qx (σ)  Qy (σ), i.e., Qx (σ) − Qy (σ) is positive semidefinite, σ = 1, . . . , s
(4.36)
for arbitrary functions g = g(t) having the above mentioned properties. Note
that (4.35b), (4.35c) means that
PA0 (·)x−b0 (·) = Kx,y
(0)
∗ PA0 (·)y−b0 (·) , (4.37)
   
where A0 (ω), b0 (ω) := A(ω), b(ω) − Ξ.
Based on (4.35a)–(4.35c) and (4.36), corresponding to Theorem 4.4 for
normal distributions P  , here we have the following results:
A(·),b(·)

Theorem 4.8. Distribution invariance. For given x ∈ IRr , let y = x fulfill the
relations
c̄T x ≥ c̄T y (4.38a)
Θx = Θy (4.38b)
Qx (σ) = Qy (σ), σ = 1, . . . , s. (4.38c)
Then PA(·)x−b(·) = PA(·)y−b(·) , F0 (x) = F0 (y) for arbitrary u ∈ C(P ) := C J (P )
with J = ∅, and F (x) ≥ F (y).
Proof. According to (4.34a), (4.34b), conditions (4.38b), (4.38c) imply
P -A(·)y−b(·) , hence, PA(·)x−b(·) = PA(·)y−b(·) , and the rest follows
-A(·)x−b(·) = P
immediately.
Corollary 4.8. Under the above assumptions h = y − x is a descent direction
for F at x, provided that F is not constant on xy.
Theorem 4.9. Suppose that (4.36) is also sufficient for (4.35c). Assume that
(0)
the distribution Kx,y has zero mean for all x, y under consideration. For given
x ∈ IR , let y = x be a vector satisfying the relations
r

c̄T x ≥ c̄T y (c̄T x > c̄T y) (4.39a)


ΘI x ≥ ΘI y (Θi x > Θi y for at least one i ∈ J) (4.39b)
ΘII x = ΘII y (4.39c)

Qx (σ)  Qy (σ), σ = 1, . . . , s Qx (σ) = Qy (σ) for at least one

1≤σ≤s , (4.39d)

where Θi := (Θi )i∈J , ΘII := (Θi )i∈|J . Then F (x) ≥ (>)F (y) for all u ∈
C J (P ), C JJ (P ), CˆJ (P ), respectively.
4.5 Construction of Descent Directions 121

Proof. Under the above assumptions (4.37) holds and


   
F0 (x) = Eu A(ω)x − b(ω) = Eu Θx − θ + Yx (ω) ,

Yx (ω) := A(ω)x − b(ω) − (Θx − θ). Now the proof of Theorem 4.4 can be
transferred to the present case.

4.5 Construction of Descent Directions by Using


Quadratic Approximations of the Loss Function

Suppose that there exists a closed convex subset Z0 ⊂ IRm such that the loss
function u fulfills the relations

dT V D ≤ dT ∇2 u(z)d ≤ dT W d for all z ∈ Z0 and d ∈ IRm , (4.40a)

where V, W are given fixed symmetric m × m matrices. Moreover, suppose

A(ω)x − b(ω) ∈ Z0 for all x ∈ D and ω ∈ Ω. (4.40b)


   
Let (Ā, b̄) = E A(ω), b(ω) be the mean of A(ω), b(ω) , and let
   
Qij = cov Ai (·), bi (·) , Aj (·), bj (·) (4.41)
   T   
= E Ai (ω), bi (ω) − Āi , b̄i Aj (ω), bj (ω) − (Āj , b̄j )
   
denote the covariance matrix of the ith and jth row Ai (·), bi (·) , Aj (·), bj (·)
 
of A(ω), b(ω) . For any symmetric m × m matrix U = (uij ) we define then
the symmetric (r + 1) × (r + 1) matrix Û by
 T  
Û = E A(ω) − Ā, b(ω) − b̄ U A(ω) − Ā, b(ω) − b̄ (4.42)
m
= uij Qij .
i,j=1

For a given r-vector x ∈ D we consider now the Taylor expansion of u at


 
z̄x = Āx − b̄ = E A(ω)x − b(ω) , (4.43)

hence,
1
u(z) = u(z̄x ) + ∇u(z̄)T (z − z̄x ) + (z − z̄)T ∇2 u(w)(z − z̄x ),
2
122 4 Deterministic Descent Directions and Efficient Points

where w = z̄x + ϑ(z − z̄x ) for some 0 < ϑ < 1. Because of (4.40b) and the
convexity of Z0 it is z̄x ∈ Z0 for all x ∈ D and therefore w ∈ Z0 for all z ∈ Z0 .
Hence, (4.40a) yields
u(z̄x ) + ∇u(z̄x )T (z − z̄x ) + 12 (z − z̄x )T V (z − z̄x ) ≤ u(z)
≤ u(z̄x ) + ∇u(z̄x )T (z − z̄x ) + 12 (z − z̄x )T W (z − z̄x ) (4.44)
for all x ∈ D and z ∈ Z0 .
Putting z := A(ω)x − b(ω) into (4.44), because of (4.40b) we find
 
u(z̄x ) + ∇u(z̄x )T A(ω)x − b(ω) − z̄x
 T    
+ 12 A(ω)x − b(ω) − z̄x V A(ω)x − b(ω) − z̄x ≤ u A(ω)x − b(ω) ≤
 
≤ u(z̄x ) + ∇u(z̄x )T A(ω)x − b(ω) − z̄x (4.45)
 T  
+ 12 A(ω)x − b(ω) − z̄x W A(ω)x − b(ω) − z̄x
for all x ∈ D and all ω ∈ Ω.
x
Taking now expectations on both sides of (4.45), using (4.42) and x̂ = −1
we get
 
u(z̄x ) + 12 x̂T V̂ x̂ ≤ Eu A(ω)x − b(ω) ≤ u(z̄x ) + 12 x̂T Ŵ x̂ (4.46a)
for all x ∈ D.
Furthermore, putting z := Āy − b̄, where y ∈ D, into the second inequality of
(4.44), we find
u(Āy − b̄) ≤ u(z̄x ) + ∇u(z̄x )T Ā(y − x) + 12 (y − x)T ĀT W Ā(y − x) (4.46b)
 T
= u(z̄x ) + ĀT ∇u(z̄x ) (y − x) + 12 (y − x)T (ĀT W Ā)(y − x)
for all x ∈ D and y ∈ D.
Considering now F (y) − F (x) for vectors x, y ∈ D, from (4.46a) we get
     
F (y) − F (x) = c̄T y + Eu A(ω)y − b(ω) − c̄T x + Eu A(ω)x − b(ω)

1 1 T
≤ c̄T (y − x) + ŷ T Ŵ ŷ + u(Āy − b̄) − x̂ V̂ x̂ + u(Āx − b̄) .
2 2
Hence, we have the following result:
 
Theorem 4.10. If D, A(ω), b(ω) , Z0 and V, W are selected such that
(4.40a), (4.40b) hold, then for all x, y ∈ D it is

1
F (y)−F (x) ≤ c̄T (y −x)+ (ŷ T Ŵ ŷ − x̂T V̂ x̂)+u(Āy − b̄)−u(Āx− b̄). (4.47a)
2
4.5 Construction of Descent Directions 123

Using also inequality (4.46b), we get this corollary:

Corollary 4.9. Under the assumption of Theorem 4.10 it holds


1
F (y) − F (x) ≤ c̄T (y − x) + (ŷ T Ŵ ŷ − x̂T V̂ x̂) +
2
 T 1
+ ĀT ∇u(Āx − b̄) (y − x) + (y − x)T ĀT W Ā(y − x). (4.47b)
2

Represent the (r + 1) × (r + 1) matrix Ŵ by


⎛ ⎞
.
W. .. w
⎜ ⎟
Ŵ = ⎜ ⎟
⎝ ... ... ...⎠, (4.47c)
.
wT .. γ

. is a symmetric r × r matrix, w is an r-vector and γ is a real number.


where W
Then
. (y − x) + 2(W
ŷ T Ŵ ŷ = x̂T Ŵ x̂ + (y − x)T W . x − w)T (y − x),

and therefore
1 1 . + ĀT W Ā)(y − x)
F (y) − F (x) ≤ c̄T (y − x) + x̂T (Ŵ − V̂ )x̂ + (y − x)T (W
2 2
 T
+ ĀT ∇u(Āx − b̄) + W. x − w (y − x)

1  T
= x̂T (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W. x − w (y − x) (4.47d)
2
1 . + ĀT W Ā)(y − x).
+ (y − x)T (W
2
 
Thus, F x + (y − x) − F (x) is estimated from above by a quadratic form in
h = y − x.
Concerning the construction of descent directions of F at x we have this
first immediate consequence from Theorem 4.10:
Theorem 4.11. Suppose that the assumptions of Theorem 4.10 hold. If vec-
tors x, y ∈ D, y = x, are related such that

c̄T x ≥ (>)c̄T y (4.48a)


Āx = Āy (4.48b)
x̂ V̂ x̂ ≥ (>)ŷ T Ŵ ŷ,
T
(4.48c)

then F (x) ≥ (>)F (y). Moreover, if F (x) > F (y) or F (x) = F (y) and F is
not constant on xy, then h = y − x is a feasible descent direction of F at x.
124 4 Deterministic Descent Directions and Efficient Points

Note. If the loss function u is monotone with respect to the partial order
on IRm induced by IRm + , then (4.48b) can be replaced by one of the weaker
relations

Āx ≥ Āy or Āx ≤ Āy. (4.48b )

If u has the partial monotonicity property (4.10a), (4.10b), then (4.48b) can
be replaced by (4.14b), (4.14b ), (4.14c).
The next result follows from (4.47d).
Theorem 4.12. Suppose that the assumptions of Theorem 4.10 hold. If vec-
tors x, y ∈ D, y = x, are related such that
 
(W .x − w
. + ĀT W Ā)(y − x) = − c̄ + ĀT ∇u(Āx − b̄) + W (4.49a)
 T
. x − w (y − x) ≤ (<) − x̂T (Ŵ − V̂ )x̂, (4.49b)
c̄ + ĀT ∇u(Āx − b̄) + W

then F (y) ≤ (<)F (x). Furthermore, if F (y) < F (x) or F (y) = F (x) and F is
not constant on xy, then h = y − x is a feasible descent direction for F at x.

Proof. If (4.49a), (4.49b) holds, then (4.47d) yields

1 T  T
F (y) − F (x) ≤ x̂ (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W . x − w (y − x)
2
1 . + ĀT W Ā)(y − x)
+ (y − x)T (W
2
1  T
≤ x̂T (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W . x − x (y − x)
2
1  
+(−1) (y − x)T c̄ + ĀT ∇u(T̄ x − b̄) + W .x − w
2
1  T
= . x − w (y − x)
x̂T (Ŵ − V̂ )x̂ + c̄ + ĀT ∇u(Āx − b̄) + W
2
≤ (<)0.

Discussion of (4.49a), (4.49b): Suppose that

C = W̃ + ĀT W Ā is positive definite, (4.50a)

hence, there are positive numbers α, β such that

αx2 ≤ xT Cx ≤ βx2 . (4.50b)


4.5 Construction of Descent Directions 125

According to (4.49a) it is
 
.x − w ,
y − x = −C −1 c̄ + ĀT ∇u(Āx − b̄) + W

and (4.49b) reads


 T  
.x−w
c̄+ ĀT ∇u(Āx− b̄)+ W .x−w ≥ (>)x̂T (Ŵ − V̂ )x̂
C −1 c̄+ ĀT ∇u(Āx− b̄)+ W

or
aT C −1 a ≥ x̂T (Ŵ − V̂ )x̂,
where
. x − w.
a = c̄ + ĀT ∇u(Āx − b̄) + W
Because of (4.50b) it is
1 1
a2 ≤ aT C −1 a ≤ a2 .
β α

Thus, we find the following criterion for (4.49b):


1 . x − w
2
(1) If c̄ + ĀT ∇u(Āx + b̄) + W  ≥ (>)x̂T (Ŵ − V̂ )x̂, then (4.49b)
β
holds.
1
2
. x − w
(2) If x̂T (Ŵ − V̂ )x̂ > c̄ + ĀT ∇u(Āx − b̄) + W  , then (4.49b) is not
α
valid.
5
RSM-Based Stochastic Gradient Procedures

5.1 Introduction
According to the methods for converting an optimization problem with a
random parameter vector a = a(ω) into a deterministic substitute problem,
see Chap. 1 and Sects. 2.1.1, 2.1.2, 2.2, a basic problem is the minimization of
the total expected cost function F = F (x) on a closed, convex feasible domain
D for the input or design vector x.
 For simplification
 of notation, here the random total cost function f =
f a(ω), x is denoted also by f = f (ω, x), hence,
 
f (ω, x) := f a(ω), x .

Thus, the expected total cost minimization problem

minimize Ef (ω, x) s.t. x ∈ D (5.1)

must be solved. In this chapter we consider therefore the approximate so-


lution of the mean value minimization problem (5.1) by a semi-stochastic
approximation method based on the response surface methodology (RSM),
see, e.g., [17, 55, 106]. Hence, f = f (ω, x) denotes a function on IRr depending
also on a random element ω, and “E” is the expectation operator with respect
to the underlying probability space (Ω, A, P ) containing ω. Moreover, D des-
ignates a closed convex subset of IRr . We suppose that the objective function
of (5.1)
F (x) = Ef (ω, x), x ∈ IRr , (5.2)
is defined and finite for every x ∈ IRr . Furthermore, we assume that the
derivatives ∇F (x), ∇2 F (x), . . . exist up to a certain order κ. In some cases, cf.
[11] and Sect. 2.2, we may simply interchange differentiation and integration,
hence,
∇F (x) = E∇x f (ω, x) or ∇F (x) = E∂x f (ω, x), (5.3a)
130 5 RSM-Based Stochastic Gradient Procedures

where ∇x f, ∂x f denotes the gradient, subgradient, resp., of f with respect


to x. Corresponding formulas may hold also for higher derivatives of F . How-
ever, in some important practical situations, especially for probability-valued
functions F (x), (5.3a) does not hold. According to Chap. 3 we know that
gradient representation in integral form

∇F (x) = G(w, x)µ(dw) (5.3b)

with a certain vector-valued function G = G(w, x), may be derived then, cf.
also [149, 159].
One of the main method for solving problem (5.1) is the stochastic ap-
proximation procedure [36, 44, 46, 52, 69, 71, 112, 116]

Xn+1 = pD (Xn − ρn Yns ), n = 1, 2, . . . , (5.4a)

where Yns denotes an estimator of the gradient ∇F (x) at x = Xn . Impor-


tant examples are the simple stochastic (sub)gradient or the finite difference
gradient estimator
 
Yns = ∇x f (ωn , Xn ) Yns ∈ ∂x f (ωn , Xn ) , or Yns = G(wn , Xn ) (5.4b)

1  
r
(1) (2)
Yns = f (ωn,j , Xn + cn ej ) − f (ωn,j , Xn − cn ej ) ej (5.4c)
j=1
2cn

(k)
with the unit vectors e1 , . . . , er or IRr . Here, ωn (wn ), n = 1, 2, . . . , ωn,j , k =
1, 2, j = 1, 2, . . . , r, n = 1, 2, . . . , resp., are sequences of independent re-
alizations of the random element ω(w). Using formula (5.4b), (5.4c), resp.,
algorithm (5.4a) represents a Robbins–Monro (RM)-, a Kiefer–Wolfowitz
(KW)-type stochastic approximation procedure. Moreover, pD designates the
projection of IRr onto D, and ρn > 0, n = 1, 2, . . . , denotes the sequence of
step sizes. A standard condition for the selection of (ρn ) is given [34–36] by

∞ ∞
ρn = +∞, ρ2n < +∞, (5.4d)
n=1 n=1

and corresponding conditions for (cn ) can be found in [164]. Due to its sto-
chastic nature, algorithms of the type (5.4a) have only a very small asymptotic
convergence rate in general cf. [164].
Considerable improvements of (5.4a) can be obtained [87] by
(1) Replacing −Yns at certain iteration points Xn , n ∈ IN1 , by an improved
step direction hn of F at Xn , see [79, 80, 84, 86, 90]
(2) Step size control, cf. [85, 89, 117]
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 131

Using (I), we get the following hybrid stochastic approximation procedure



pD (Xn + ρn hn ) for n ∈ IN1
Xn+1 = (5.5)
pD (Xn − ρn Yns ) for n ∈ IN2
where IN1 , IN2 is a certain partition of the set IN of integers.
Important types of improved step directions hn are:
– (feasible) Deterministic descent directions hn existing at nonstationary/
nonefficient points Xn in problems (5.1) having certain functional proper-
ties, see Sect. 4 and [81–84, 93].
– hn = −Yne , where Yne is a more exact estimator of ∇F (Xn ), as, e.g.,
the arithmetic mean of a certain number M = Mn of simple stochastic
s
gradients Yn,i = ∇x f (ωn,i , Xn ), Yn,i
s
∈ ∂x f (ωn,i , Xn ), resp., i = 1, . . . , M ,
of the type (5.4b), cf. [86], or the estimator Yne defined by the extended
central difference schemes given by Fabian [39, 41].
By an appropriate selection of the partition IN1 , IN2 we may guarantee
that the stochastic approximation method is accelerated sufficiently, without
using the improved and therefore more expensive step direction hn too often.
Several numerical experiments show that improved step directions hn are most
effective during an initial phase of the algorithm.
Having several generators of deterministic descent directions hn in spe-
cial situations, as, e.g., in stochastic linear programs with appropriate pa-
rameter distributions, see Sect. 4, in the following a general method for the
computation of more exact gradient estimators Yne at an arbitrary iteration
point Xn = x0 is described. This method, which includes as a special case
the well known finite difference schemes, used, e.g., in the KW -type proce-
dures [36,41,164], is based on the following fundamental principle of numerical
differentiation, see, e.g., [22, 48, 51]:
Starting from certain supporting points x(i) , i = 1, . . . , p, lying 
in a neigh-

borhood S of x0 and the corresponding function values y (i) = F x(i) , i =
1, . . . p, by interpolation, first an interpolation function (e.g., a polynomial of
degree p−1) F̂ = F̂ (x) is computed. Then, the kth order numerical derivatives
ˆ k F of F at x0 are defined by the analytical derivatives

ˆ k F (x0 ) := ∇k F̂ (x0 ),
∇ k = 0, 1, . . . , κ
of the interpolation function F̂ at x0 . Clearly, in order to guarantee a certain
accuracy of the higher numerical derivatives ∇ ˆ k F (x0 ), a sufficiently large
(i)
number p of interpolation points x , i = 1, . . . , p, is needed.

5.2 Gradient Estimation Using the Response Surface


Methodology (RSM)
We consider a subdomain S of the feasible domain D such that
x0 ∈ S ⊂ D, (5.6a)
132 5 RSM-Based Stochastic Gradient Procedures

where x0 is the point, e.g., x0 := Xn , at which an estimate ∇F ˆ (x0 ) of ∇F


must be computed. Corresponding to the principle of numerical differentiation
described above, in order to estimate first the objective function
 F (x) on S,
(i) (i)
estimates y , i = 1, . . . , p, of the function values F x , i = 1, . . . , p, are
determined at so-called design points

x(i) = x(i) (x0 ) ∈ S, i = 1, 2, . . . , p, (5.6b)

chosen by the decision maker [17, 63, 106]. Simple estimators are
 
y (i) := f ω (i) , x(i) i = 1, 2, . . . , p, (5.6c)

where ω (1) , ω (2) , . . . , ω (p) are independent realizations of the random element ω.
The objective function F is then approximated – on S – mostly by a
polynomial response surface model F̂ (x) = F̂ (x|βj , j ∈ J). In practice, usually
first and/or second order models are used. The coefficients βj , j ∈ J, are
determined by means of regression analysis (mainly least squares estimation),
see [23, 55, 63]. Having F̂ (x), the RSM-gradient estimator ∇F ˆ (x0 ) at x0 is
defined by the gradient (with respect to x)
ˆ (x0 ) := ∇F̂ (x0 )
∇F (5.6d)

of F̂ at x0 . Using higher order polynomial response surface models F̂ , the


estimates ∇ˆ k F (x0 ) of higher order derivatives of F at x0 can be defined in
the same way:
ˆ k F (x0 ) := ∇k F̂ (x0 ),
∇ k = 1, . . . , κ. (5.6e)

Remark 5.1. The above gradient estimation procedure (5.6a)–(5.6e) is espe-


cially useful if the gradient representation (5.3a) or (5.3b) does not hold (or
is not known explicitly) for every x ∈ IRr , or the numerical evaluation of
∇x f (ω, x), ∂x f (ω, x), G = G(w, x), resp., is too complicated.
The following very important example is taken from structural reliability
[28, 38, 156]: A decisive role is played, see Sects. 2.1
 and 3.1, by the failure
probability F (x) = 1 − P yl ≤ y(a(ω), x) ≤ yu that, for a given design
vector x (e.g., structural dimensions),
 the
 structure does not fulfill the basic
behavioral constraints yl ≤ y a(ω), x ≤ yu . Here, yl , yu are given lower
 
and upper bounds, y = y a(ω), x is an m-vector of displacement and/or
stress components of the structure, and a = a(ω) denotes the random vector
of structural parameters, e.g., load parameters, yield stresses, elastic moduli.
Obviously,
  
F (x) = Ef (ω, x) with f (ω, x) := 1 − 1B y a(ω), x ,
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 133

where 1B designates the characteristic function of B = {y ∈ IRm : yl ≤ y ≤


yu }. Here, (5.3a) does not hold, and the numerical evaluation of differentiation
formulas of the type (5.3b), see Chap. 3, may be very complicated – if they
are available at all.
Remark 5.2. In addition to Remark 5.1, the RSM-estimator has still the fol-
lowing advantages:
(a) A priori knowledge on F can be incorporated easily in the selection of an
appropriate response surface model. As an important example we mention
here the use of so-called reciprocal variables 1/xk or more general interme-
diate variables tk = tk (xk ) besides the decision variables xk , k = 1, . . . , r,
in structural optimization, see, e.g., [5, 42].
(b) The present method does not only yield estimates of the gradient (and
some higher derivatives) of F at x0 , but also estimates F̂ (x0 ) of the func-
tion value F (x0 ). However, besides gradient estimations, estimates of func-
tion values are the main indicators used to evaluate the performance of an
algorithm, see [44]. Indeed, having these indicators, we may check whether
the process runs into the “right direction,” i.e., decreases (increases, resp.)
the mean performance criterion F (x), see [105]. Hence, appropriate stop-
ping criterions can be formulated.
Remark 5.3. In many practical situations in a certain neighborhood of an
optimal solution x∗ of (5.1), the objective function F is nearly quadratic.
Hence, F may be described then by means of second order polynomial models
F̂ with high accuracy. Based on second order models F̂ of F , estimates x̂∗ of
an optimal solution x∗ of (5.1) may be obtained by solving analytically the
approximate program
min F̂ (x) s.t. x ∈ D.
Furthermore, Newton-type step directions hn := −∇2 F̂ (Xn )−1 ∇F̂ (Xn ), to
be used in procedure (5.5), can be calculated easily.
Remark 5.4. The above remarks show that RSM has many very interesting
features and advantages. But we have also to mention that for problems (5.1)
with large dimensions r, the computational expenses to find a full second order
model F̂ or more accurate gradient estimates may become very large. However,
an improved search direction hn is required in the semi-stochastic approxima-
tion procedure only at iteration points Xn with n ∈ IN1 , cf. Sect. 5.1. Fur-
thermore, within RSM there are several suggestions to reduce the estimation
expenses. E.g., by the group screening few variables xkl , l = 1, . . . , r1 (< r),
can be identified [102]. On the other hand, by decomposition – e.g., in struc-
tural optimization – a large optimization problem that might be intractable
because of the large number of unknowns and constraints combined with a
computationally expensive evaluation of the objective function F can be con-
verted into a set of much smaller and separate, but coordinated subproblems,
see, e.g., [145]. Hence, in many cases the optimization problem (5.1) can be
reduced again to a moderate size.
134 5 RSM-Based Stochastic Gradient Procedures

5.2.1 The Two Phases of RSM

In Phase 1 of RSM, i.e., if the process (Xn ) is still far away from an optimal
point x∗ of (5.1), then F is estimated on S by the linear empirical model

F̂ (x) = β0 + βIT (x − x0 ), (5.7a)

where
β T = (β0 , βIT ) = (β0 , β1 , . . . , βr )T (5.7b)
is the (r + 1)-vector of unknown coefficients
 of the linear model (5.7a). Having
estimates y of the function values F x(i) at the so-called design points
(i)

x(i) , i = 1, . . . , p, in S, by least squares estimation (LSQ) the following esti-


mate β̂ of β is obtained:

β̂ = (W T W )−1 W T y. (5.8a)

Here, the p × (r + 1) matrix W and the p-vector y are defined by


⎛ T
⎞ ⎛ (1) ⎞
1 d(1) y
⎜ (2)T ⎟
⎜1 d ⎟ ⎜ y (2) ⎟

W =⎜ ⎟, y = ⎜
⎜ .. ⎟ (5.8b)
⎜. . ⎟
. . ⎝ . ⎠
⎝. . ⎠
1d (p)T y (p)

with d := x − x0 and therefore

d(i) = x(i) − x0 , i = 1, 2, . . . , p. (5.8c)

In order to guarantee the regularity of W T W , we require that

d(2) − d(1) , . . . , d(p) − d(1) , p ≥ r + 1, span the IRr . (5.8d)

Having β̂ by (5.8a), the gradient ∇F of F can be estimated on S by the


gradient of (5.7a), hence,
ˆ (x) = β̂I for all x ∈ S.
∇F (5.9)

The accuracy of the estimator (5.9) on S, cf. (5.6a), is studied in the


following way: Denote by
 
ε(i) = ε(i) ω, x(i) , i = 1, 2, . . . , p, (5.10a)
 
the stochastic error of the estimate y (i) of F x(i) , see, e.g., (5.6c). By means
of first order Taylor expansion of F at x0 we obtain, cf. (5.8c),
 
(i)
y (i) = F x(i) + ε(i) = F (x0 ) + ∇F (x0 )T d(i) + R1 + ε(i) , (5.10b)
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 135

where
(i) 1 (i)T 2  
R1 = d ∇ F x0 + ϑi d(i) d(i) , 0 < ϑi < 1, (5.10c)
2
denotes the remainder of the first order Taylor expansion. Obviously, if S is a
convex subset of the feasible domain D of (5.1), and the Hessian ∇2 F (x) of
the objective function F of (5.1) is bounded on S, then
      
 (i)  1  (i)   
R1  ≤ d  sup ∇2 F x0 + ϑd(i) 
2 0≤ϑ≤1

1 
 (i) 2




≤ d  sup∇2 F (x). (5.10d)
2 x∈S

Thus, representation (5.10b) yields the data model

y = W β + R1 + ε, (5.11a)

where W, y are given by (5.8b), and the vectors β, R1 and ε are defined by
⎛ (1) ⎞ ⎛ (1) ⎞
R1 ε
⎜ (2) ⎟ ⎜ ε(2) ⎟
F (x0 ) ⎜ R1 ⎟ ⎜ ⎟
β= , R1 = ⎜ ⎟
⎜ .. ⎟ , ε = ⎜ . ⎟. (5.11b)
∇F (x0 ) ⎝ . ⎠ ⎝ .. ⎠
R1
(p) ε(p)

Concerning the p-vector ε of random errors ε(i) , i = 1, . . . , p, we make the


following (standard) assumptions

E(ε|x0 ) = 0, cov(ε|x0 ) = E(εεT |x0 ) = Q := (σi2 δij ), (5.11c), (5.11d)

where δij , i, j = 1, . . . , p, denotes the Kronecker symbol, and

σi2 = σi2 (x0 , d(1) , d(2) , . . . , d(p) ), i = 1, 2, . . . , p. (5.11e)

Putting now the data model (5.11a) into (5.8a), we get

F (x0 )
β̂ = + (W T W )−1 W T (R1 + ε). (5.12a)
∇F (x0 )

Hence, for the estimator β̂I of ∇F (x0 ), cf. (5.9), we find


 −1
1
p    1  (1)
β̂I = ∇F (x0 ) + d(i) − d d(i) − d d − d, . . . ,
p i=1
p

d(p) − d (R1 + ε), (5.12b)
136 5 RSM-Based Stochastic Gradient Procedures

where d is defined by
p
1
d= d(i) , d(i) = x(i) − x0 . (5.12c)
p i=1

Clearly, (5.12b), (5.12c) represents the accuracy of the estimator (5.9)


on S.
Note. According to (5.12a), the accuracy of β̂ depends on the deterministic
model error R1 , the stochastic observation error ε and also on the choice of
the design points x(1) , . . . , x(p) , see, e.g., [17, 136].

In Phase 2 of RSM, i.e., if the process (Xn ) is approaching a certain


neighborhood of a minimal point x∗ of (5.1), see Remark 5.3, then the linear
model (5.7a) of F is no more adequate, in general. It is replaced therefore by
the more general quadratic empirical model

F̂ (x) = β0 + βIT (x − x0 ) + (x − x0 )T B(x − x0 ), x ∈ S. (5.13a)

The unknown parameters β0 , βi , i = 1, . . . , r, in the r-vector βI , and βij , i, j =


1, . . . , r, in the symmetric r×r matrix B, are determined again by least squares
estimation. The LSQ-estimate

β̂ = (β̂0 , β̂I , B̂)


1 2
of the (r + 3r + 2)-vector
2
β T = (β0 , β1 , β2 , . . . , βr , β11 , β12 , . . . , β1r , β22 , β23 , . . . ,
β2r , . . . , βr−1r−1 , βr−1r , βrr ), (5.13b)

of unknown parameters in model (5.13a), denoted for simplicity also by

β = (β0 , β1 , B), (5.13b )

is given – in correspondence with formula (5.8a) – again by

β̂ = (W T W )−1 W T y. (5.14a)
1
Here the p × (r2 + 3r + 2) matrix W and the p-vector y are defined by
2
⎛ T T
⎞ ⎛ ⎞
1 d(1) z (1) y (1)
⎜ (2)T (2)T ⎟ ⎜ y (2) ⎟
⎜1 d z ⎟ ⎜ ⎟
⎜ ⎟ ⎟
W =⎜. . .. ⎟, y=⎜
⎜ .. ⎟ , (5.14b)
⎜ .. .. ⎟ ⎝ . ⎠
⎝ . ⎠
T
1 d(p) z (p)
T
y (p)
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 137

where d(i) = x(i) − x0 see (5.8c), and


 T
(i)2 (i) (i) (i) (i)2 (i) (i) (i) (i)
z (i) = d1 , 2d1 d2 , . . . , 2d1 d(i)
r , d 2 , 2d 2 d 3 , . . . , 2d 2 d r , . . . , d(i)2
r .
(5.14c)

Corresponding to the first phase, the existence of (W  W )−1 is guaranteed –


for the full quadratic model – by the assumption
d(2) − d(1) d(p) − d(1) 2
, . . . , (p) span the IR(r +3r+2)/2
, (5.14d)
z −z
(2) (1) z − z (1)
see [17, 106], hence,
(r + 1)(r + 2)
p≥ + 1.
2
Having β̂ = (β̂0 , β̂I , B̂), the gradient of F is then estimated on S by the
gradient of (5.13a), thus
ˆ (x) = β̂I + 2B̂(x − x0 )
∇F for x∈S (5.15a)

and therefore
ˆ (x0 ) = β̂I .
∇F (5.15b)
Note. In practice, in the beginning of Phase 2 only the pure quadratic terms
βii (xi − x0i )2 , i = 1, . . . , r, may be added (successively) to the linear model
(5.7a).
For the consideration of the accuracy of the estimator (5.15b) we use now
second order Taylor expansion. Corresponding to (5.10b), we obtain
1 
y (i) = F (x(i) ) + ε(i) = F (x0 ) + ∇F (x0 ) d(i) + d(i) ∇2 F (x0 )d(i) + R2 + ε(i) ,
(i)
2
(5.16a)
where  
(i) 1
R2 = ∇3 F x0 + ϑi d(i) · d(i)3 , 0 < ϑi < 1, (5.16b)
3!
denotes the remainder of the second order Taylor expansion. Clearly, if S
is a convex subset of D and the multilinear form ∇3 F (x) of all third order
derivatives of F is bounded on S, then
(i) 1 (i) 3
|R2 | ≤ d  sup ∇3 F (x0 + ϑd(i) )
6 0≤ϑ≤1
1 (i) 3
≤ d  sup ∇3 F (x). (5.16c)
6 x∈S

Representation (5.16a) yields now the data model

y = W β + R2 + ε. (5.17a)

Here, W, y are given by (5.14b), and the vectors β, R2 , ε are defined, cf. (5.13b),
(5.13b ) for the notation, by
138 5 RSM-Based Stochastic Gradient Procedures

(1)
⎞ ⎛ ⎞
R2 ε(1)
  ⎜ (2) ⎟ ⎜ ε(2) ⎟
1 ⎜ R2 ⎟ ⎜ ⎟
β = F (x0 ), ∇F (x0 ), ∇2 F (x0 ) , R2 = ⎜ ⎟
⎜ .. ⎟ , ε = ⎜ . ⎟.
2 ⎝ . ⎠ ⎝ .. ⎠
(p)
R2 ε(p)
(5.17b)
Corresponding to (5.11c)–(5.11e) we also suppose in Phase 2 that

E(ε|x0 ) = 0, cov(ε|x0 ) = E(εεT |x0 ) = Q := (σi2 δij ) (5.17c), (5.17d)


 
σi2 = σi2 x0 , d(1) , d(2) , . . . , d(p) , i = 1, . . . , p. (5.17e)

Putting the data model (5.17a) into (5.14a), we get – see (5.13b), (5.13b )
concerning the notation –
1
β̂ = F (x0 ), ∇F (x0 ), ∇2 F (x0 ) + (W T W )−1 W T R2 + (W T W )−1 W T ε.
2
(5.18a)

Especially, for the estimator β̂I of ∇F (x0 ), see (5.15a), (5.15b), we find
1 
β̂I = ∇F (x0 ) + U −1 (d(1) − d, . . . , d(p) − d) (5.18b)
p
 p 
1 
− (d − d)(z − z)
(i) (i) T
Z −1 (z (1) − z, . . . , z (p) − z) (R2 + ε).
p i=1

The vectors d(i) , d, z (i) are given by (5.8c), (5.12c), (5.14c), resp., and z, Z, U
are defined by
p p
1 1
z= z (i) , Z= (z (i) − z)(z (i) − z)T , (5.18c), (5.18d)
p i=1
p i=1
p
 p

1 1
U = (d (i)
− d)(d (i)
− d) −
T (i)
(d − d)(z (i)
− z)
T
Z −1 (5.18e)
p i=1
p i=1
 p

1
× (z (i) − z)(d(i) − d)T .
p i=1

5.2.2 The Mean Square Error of the Gradient Estimator

According to (5.9) and (5.12b), (5.15b) and (5.18b), resp., in both phases for
ˆ (x0 ) we find
the gradient estimator ∇F
ˆ (x0 ) = ∇F (x0 ) + H(R + ε)
∇F (5.19)

with R = R1 , R = R2 . Furthermore, H designates the r × p matrix arising in


ˆ (x0 ) = β̂1 .
the representation (5.12b), (5.18b), resp., of ∇F
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 139

ˆ (x0 )
The mean square error V = V (x0 ) of the estimator ∇F
  
V := E ∇F ˆ (x0 ) − ∇F (x0 )2 |x0 (5.20a)

is given, cf. see (5.11c), (5.17c), resp., by the sum


 
V = edet 2 + E estoch 2 |x0 (5.20b)

ˆ (x0 )
of the squared bias edet 2 := HR2 and the variance of ∇F
   
E estoch 2 |x0 := E Hε2 |x0 .

Since w2 = tr(wwT ) = wwT  for a vector w, where tr(M ), M , resp.,


denotes the trace, the operator norm of a square matrix M , we find

edet 2 = HRRT H T  = sup wT (HRRT H T )w ≤ R2 HH T . (5.20c)


w =1

Moreover, using that tr(M ) is equal to the sum of the eigenvalues of M , we


get
 
E(estoch 2 |x0 ) = E tr(HεεT H T )|x0

= tr(HQH T ) ≤ r sup (H T w)T (σi2 δij )(H T w)


w =1

≤r max σi2 sup H T w2 = r max σi2 HH T ,


1≤i≤p w =1 1≤i≤p

(5.20d)

cf. (5.11d), (5.17d), respectively. Consequently, inequalities (5.20c), (5.20d)


yield in the first and in the second phase

V ≤ R2 + r max σi2 HH T . (5.20e)


1≤i≤p

Here, the matrix product HH T can be computed explicitly: In Phase 1, due


to (5.12b) we have
1
H = H0−1 (d(1) − d, d(2) − d, . . . , d(p) − d), (5.21a)
p

where d(i) , d are given by (5.8c), (5.12c), resp., and


p
1
H0 = (d(i) − d)(d(i) − d)T . (5.21b)
p i=1
140 5 RSM-Based Stochastic Gradient Procedures

By an easy calculation, from (5.21a), (5.21b) we get


1 −1 1
H0 H0 H0−1 = H0−1 .
HH T = (5.21c)
p p
 
Assume now that the function value F x(i) is estimated ν times at

x(i) = x0 + d(i) for each i = 1, . . . , p0 . (5.22a)

Hence, p = νp0 and

d(kp0 +i) = d(i) , x(kp0 +i) = x(i) , i = 1, 2, . . . , p0 , k = 1, 2, . . . , ν − 1. (5.22b)

Putting

1
p0  (0)
 
(0) T (0) 1
p0
(0)
H0 := d(i) − d d(i) − d , d := d(i) , (5.23a)
p0 i=1
p0 i=1

in case (5.22a), (5.22b) we easily find

1 1 (0)−1
HH T = H . (5.23b)
ν p0 0

In Phase 2 matrix H reads, see (5.18b),


"
1  
H = H0−1 d(1) − d, . . . , d(p) − d (5.24a)
p
 p  #
1   T  
−1
− d −d z −z
(i) (i)
Z z − z, . . . , z − z ,
(1) (p)
p i=1

where d(i) , d, z (i) , z, Z are given by (5.8c), (5.12c), (5.14c), (5.18c), (5.18d),
resp., and H0 is defined, cf. (5.18e), here by

H0 := U. (5.24b)

Corresponding to (5.21c), from (5.24a), (5.24b), also in Phase 2 we get


1 −1
HH T = H . (5.24c)
p 0
In case (5.22a), (5.22b) of estimating F ν times at p0 different points, one can
easily see that an equation corresponding to (5.23b) holds also in Phase 2.
Consequently, from (5.20e), and (5.21c), (5.24c), resp., for the mean square
estimation error V we obtain in both phases
1
V ≤ R2 + r max σi2 H −1 , (5.25a)
1≤i≤p p 0
5.2 Gradient Estimation Using the Response Surface Methodology (RSM) 141

where H0 is given by (5.21b), (5.24b), respectively. In case (5.22a), (5.22b) we


get
1 (0)−1
V ≤ R2 + r max σi2 H0 , (5.25b)
1≤i≤p νp0
provided that
2
σ(kp0 +i)
= σi2 , i = 1, 2, . . . , p0 , k = 1, . . . , ν − 1. (5.26)

Finally, we have to consider R2  : First, in Phase 1 according to (5.10d)


and (5.11b) we find with certain numbers ϑi , 0 < ϑi < 1, i = 1, . . . , p,
p   2
1  
R1 2 ≤ d(i) 4 ∇2 F x0 + ϑi d(i) 
4 i=1
p   2
1  
≤ d(i) 4 sup ∇2 F x0 + ϑd(i) 
4 i=1 0≤ϑ≤1
 
p  2
1  
≤ d 
(i) 4
sup∇2 F (x) , (5.27a)
4 i=1 x∈S

 
 
where the last inequality holds if S, cf. (5.6a), (5.6b) is convex, and ∇2 F (x)
is bounded on S. In Phase 2, with (5.16b), (5.17b), we find
p   2
1  
R2 2 ≤ d(i) 6 ∇3 F x0 + ϑi d(i) 
36 i=1
p   2
1  
≤ d(i) 6 sup ∇3 F x0 + ϑd(i) 
36 i=1 0≤ϑ≤1
 p

1
≤ d 
(i) 6
sup ∇3 F (x)2 , (5.27b)
36 i=1 x∈S

where the last inequality holds under corresponding assumptions as used


above. Summarizing (5.27a), (5.27b), we have
p  2
1  
Rl 2 ≤ d(i) 2(l+1) sup ∇l+1 F (x0 + ϑd(i) ) , l = 1, 2.
((l + 1)!)2 i=1 0≤ϑ≤1
(5.27c)
Putting (5.27c) into (5.25a), for l = 1, 2 we find
 p  2
1 1  l+1 (i) 
V ≤ d 
(i) 2(l+1)
sup ∇ F (x0 + ϑd )
((l + 1)!)2 p i=1 0≤ϑ≤1

r
+ max σi2 H0−1 . (5.28)
p 1≤i≤p
142 5 RSM-Based Stochastic Gradient Procedures

We consider now the case

d(i) = µd˜(i) , i = 1, 2, . . . , p, 0 < µ ≤ 1, (5.29a)

where d˜(i) , i = 1, . . . , p, are given r-vectors. Obviously

H0 = µ2 H̃0 , (5.29b)

where the r × r matrix H̃0 follows from (5.21b), (5.24b), resp., by replacing
d(i) by d˜(i) , i = 1, . . . , p. Using (5.29a), from (5.28) we get for l = 1, 2

µ2l
p  2
⎜ 1 ˜  l+1 ˜ (i) 
V ≤ ⎝ 2  d 
(i) 2(l+1)
sup  ∇ F (x0 + ϑ d )
p i=1 0≤ϑ≤1
(l + 1)!

rµ−2 ⎟
+ max σi2 ⎠ H̃0−1 . (5.30)
p l≤i≤p

5.3 Estimation of the Mean Square (Mean Functional)


Error
Considering now the convergence behavior of the hybrid stochastic approxi-
mation procedure (5.5) working with the RSM-gradient estimator Yne , we need
the following notations.
Let IN1 , IN2 be a given partition of IN, and suppose that Xn+1 = pD (Xn −
ρn Yns ) with a simple gradient estimator Yns , see (5.4b), at each iteration point
Xn with n ∈ IN2 . Furthermore, define for each Xn with n ∈ IN1 the improved
step direction hn = −Yne by the following scheme:
Following Sect. 5.2, at x0 = Xn , n ∈ IN1 , we select r-vectors d(i) , i =
1, . . . , p. In the most general case it is

d(i) = d(i,n) = d(i,n) (Xn ), i = 1, 2, . . . , p = p(n), (5.31a)

i.e., the increments d(1) , . . . , d(p) and p may depend on the stage index n and
the iteration point Xn . In the most simple case we have

d(i,n) = d(i,0) , i = 1, 2, . . . , p = p̄ for all n = 1, 2, . . . , (5.31b)

where d(1,0) , . . . , d(p,0) are p given, fixed r-vectors.


The unknown objective function F is then estimated at

x(i) = x(i,n) = Xn + d(i,n) , i = 1, 2, . . . , p(n), (5.31c)

i.e., we have the estimates

y (i) = y (i,n) = F (x(i,n) ) + ε(i,n) , i = 1, 2, . . . , p(n). (5.31d)


5.3 Estimation of the Mean Square (Mean Functional) Error 143

According to (5.8b), (5.14b), resp., we define


⎛ (1,n) ⎞
y
⎜ y (2,n) ⎟
⎜ ⎟
yn = ⎜ . ⎟ with p = p(n) (5.31e)
⎝ .. ⎠
y (p,n)
⎛ T
⎞ ⎛ T T

1 d(1,n) 1 d(1,n) z (1,n)
⎜ (2,n)T ⎟ ⎜ (2,n)T (2,n)T ⎟
⎜1 d ⎟ ⎜1 d z ⎟
Wn = ⎜
⎜ .. .. ⎟ , Wn = ⎜ .
⎟ ⎜. .. .. ⎟
⎟ (5.31f)
⎝. . ⎠ ⎝. . . ⎠
(p,n)T (p,n)T (p,n)T
1d 1d z

for Phase 1, Phase 2, resp., where



(i,n)2 (i,n) (i,n) (i,n) (i,n)2
z (i,n) = d1 , 2d1 d2 , . . . , 2d1 d(i,n)r , d2 ,
2
 T
(i,n) (i,n)
2d2 d3 , . . . , d(i,n)
r , i = 1, 2, . . . , p, (5.31g)

cf. (5.14c). Note that in case (5.31b) the matrices Wn are fixed.
Finally, the RSM-step directions hn are defined by
ˆ (Xn ) := −β̂(n)I with β̂(n) = (W T Wn )−1 W T yn , (5.32)
hn = −Yne = −∇F n n

see (5.8a), (5.9) for Phase 1 and (5.14a), (5.15b) for Phase 2. Of course, we
suppose that all matrices WnT Wn , n = 1, 2, . . . , are regular, see (5.8d), (5.14d).

5.3.1 The Argument Case

Here we consider the behavior of the hybrid stochastic approximation method


(5.5) by means of the mean square errors

bn = EXn − x∗ 2 , n ≥ 1, with an optimal solution x∗ of (5.1). (5.33)

Following [89], in the present case we suppose


 
k0 x − x∗ 2 ≤ (x − x∗ )T ∇F (x) − ∇F (x∗ ) ≤ k1 x − x∗ 2 (5.34a)
k02 x − x∗ 2 ≤ ∇F (x) − ∇F (x∗ )2 ≤ k12 x − x∗ 2 (5.34b)

for all x ∈ D where k0 , k1 are some constants such that

k1 ≥ k0 > 0. (5.34c)

Moreover, for the mean square estimation error


 2
 
Vns (Xn ) = E Yns − ∇F (Xn ) |Xn (5.35a)
144 5 RSM-Based Stochastic Gradient Procedures

of the simple stochastic gradient Yns according to (5.4b), if (5.3a) holds true,
we suppose that
Vns (Xn ) ≤ σ1s2 + C1s Xn − x∗ 2 with constants σ12 , C1s ≥ 0. (5.35b)
For n ∈ IN2 with Yns given by (5.4b) we obtain [89]
 
bn+1 ≤ 1 − 2k0 ρn + (k12 + C1s )ρ2n bn + σ1s2 ρ2n . (5.36)

For n ∈ IN1 we have, see (5.32), Xn+1 = pD (Xn − ρn Yne ), and therefore [89]
 
E Xn+1 − x∗ 2 |Xn ≤ Xn − x∗ 2
   
− 2ρn E (Xn − x∗ )T Yne − ∇F (x∗ ) |Xn
  
+ ρ2n E Yne − ∇F (Xn )2 |Xn
 T   
+ 2E Yne − ∇F (Xn ) ∇F (Xn ) − ∇F (x∗ ) |Xn

+ ∇F (Xn ) − ∇F (x∗ )2 . (5.37)

According to (5.19) in Phase 1 and Phase 2 we get


 
E(Yne |Xn ) = ∇F (Xn ) + E Hn (Rn + εn )|Xn . (5.38a)

In this equation, cf. (5.10c) and (5.11b), (5.16b) and (5.17b), resp., (5.31a)–
(5.31g),
⎛ 1 ⎞ ⎛ ⎞
Rl (Xn , d(1,n) ) ε(1,n)
⎜ R2 (Xn , d(2,n) ) ⎟ ⎜ ε(2,n) ⎟
⎜ l ⎟ ⎜ ⎟
Rn = ⎜ .. ⎟ , l = 1, 2, εn = ⎜ .. ⎟ (5.38b)
⎝ . ⎠ ⎝ . ⎠
Rlp (Xn , d(p(n),n) ) ε(p(n),n)
is the vector of remainders resulting from using first, second order empirical
models for F , the vector of stochastic errors in estimating F (x) at Xn +
d(i,n) , i = 1, 2, . . . , p(n), respectively. Furthermore, Hn is the r × p matrix
given by (5.21a), (5.24a), resp., if there we put d(i) := d(i,n) , z (i) := z (i,n) , i =
1, 2, . . . , p = p(n). Hence,
E(Yne |Xn ) = ∇F (Xn ) + edet
n ,

where, cf. (5.20a)–(5.20c), edet


n = Hn Rn is the bias resulting from using a first
or second order empirical model for F . Consequently
   
E (Xn − x∗ )T Yne − ∇F (x∗ ) |Xn
 
= (Xn − x∗ )T ∇F (Xn ) − ∇F (x∗ ) + edet
n

≥ k0 Xn − x∗ 2 + (Xn − x∗ )T edet


n (5.39a)
5.3 Estimation of the Mean Square (Mean Functional) Error 145

see (5.34a). Moreover,


 T     
∇F (Xn )−∇F (x∗ ) |Xn = edet ∇F (Xn )−∇F (x∗ ) ,
T
E Yne −∇F (Xn ) n
(5.39b)
and (5.34b) yields
 2
 
∇F (Xn ) − ∇F (x∗ ) ≤ k12 Xn − x∗ 2 . (5.39c)

Finally, (5.25a) implies that


 2
 
Vne (Xn ) := E Yne − ∇F (Xn ) |Xn

1
≤ Rn 2 + r max σi (Xn ) H −1 , (5.39d)
1≤i≤p(n) p(n) n0
where Hn0 is given by (5.21b), (5.24b), respectively. From (5.37) and (5.39a)–
(5.39d) we now have
 
E Xn − x∗ 2 |Xn ≤ (1 − 2k0 ρn + k12 ρ2n )Xn − x∗ 2
T
 
∗ 2 det T ∗
−2ρn edet
n (X n − x ) + 2ρ e
n n ∇F (X n ) − ∇F (x )
  1
+ρ2n Rn 2 + r max σi (Xn )2 H −1 . (5.40)
1≤i≤p(n) p(n) n0
 
T T
Estimating next edet
n (Xn − x∗ ) and edet
n ∇F (Xn ) − ∇F (x∗ ) , we get,
see (5.20c), (5.21c), (5.24c), (5.34b)
  /
 det T ∗  −1 1
en (Xn − x ) ≤ Hn0  Rn  · Xn − x∗ , (5.41a)
p(n)
   /
 det T  −1 1
en ∇F (Xn ) − ∇F (x∗ )  ≤ k1 Hn0  Rn  · Xn − x∗ . (5.41b)
p(n)
According to (5.27c) we find

p(n)
1 1 ⎝ 1
 Rn  ≤ d(i,n) 2(ln +1)
p(n) (ln + 1)! p(n) i=1
⎞1/2
 2
 
× sup ∇ln +1 F (Xn + ϑd(i,n) ) ⎠ . (5.41c)
0≤ϑ≤1

Here, (ln )n∈IN1 is a sequence with ln ∈ {1, 2} for all n ∈ IN1 such that

{n ∈ IN1 : algorithm (5.5) is in Phase λ} = {n ∈ IN1 : ln = λ}, λ = 1, 2.


(5.41d)
146 5 RSM-Based Stochastic Gradient Procedures

Note. For the transition from Phase 1 to Phase 2 statistical F tests may be
applied. However, since a sequence of tests must be applied, the significance
level is not easy to calculate.
Furthermore, suppose that representation (5.29a) holds, i.e.,

d(i,n) = µn d˜(i,n) , 0 < µn ≤ 1, i = 1, 2, . . . , p(n),

where d˜(i,n) , i = 1, . . . , p(n), are given r-vectors. Then Hn0 = µ2n H  n0 , cf.
(5.29b), and (5.41c) yields

/ / ln p(n)
−1 1  −1  µ ⎝ 1
Hn0  Rn  ≤ H n0
n
d˜(i,n) 2(ln +1)
p(n) (ln + 1)! p(n) i=1
⎞1/2
 2
 ln +1 
× sup ∇ F (Xn + ϑd˜(i,n) ) ⎠ . (5.41e)
0≤ϑ≤1

Now we assume that


 2
 
sup ∇ln +1 F (Xn + ϑd˜(i,n) ) ≤ γn2 + Γn2 Xn − x∗ 2 , 1 ≤ i ≤ p(n), n ∈ IN1 ,
0≤ϑ≤1
(5.42)
where γn ≥ 0, Γn ≥ 0, n ∈ IN1 , are certain coefficients. Then (5.41e) yields
/
−1 1
Hn0  Rn  (5.43)
p(n)
⎛ ⎞1/2
/ ln p(n)  
≤ H  −1  µn ⎝ 1 d˜(i,n) 2(ln +1) ⎠ γn + Γn Xn − x∗  .
n0
(ln + 1)! p(n) i=1

Putting (5.41a), (5.41b), (5.41e) and (5.43) into (5.40), with (5.42) we
get
  
E Xn+1 − x∗ 2 |Xn ≤ 1 − 2[k0 − Γn µlnn Tn ]ρn

+ (k1 + Γn µlnn Tn )2 ρ2n Xn − x∗ 2
µ−2  −1  max σi (Xn )2 + γn2 (µlnn Tn )2
+ ρ2n r n
H
p(n) n0 1≤i≤p(n)
+ 2γn (ρn + k1 ρ2n )µlnn Tn Xn − x∗ , n ∈ IN1 . (5.44a)

Here, Tn is given by
⎛ ⎞1/2
/ p(n)
1  −1  ⎝ 1
Tn = H n0 d˜(i,n) 2(ln +1) ⎠ . (5.44b)
(ln + 1)! p(n) i=1
5.4 Convergence Behavior of Hybrid Stochastic Approximation Methods 147

In the following we suppose, cf. (5.31a)–(5.31d), that


 2
E y (i,n) − F (x(i,n) ) |Xn =: σi (Xn )2 ≤ σ1F 2 + C1F Xn − x∗ 2 ,

1 ≤ i ≤ p(n), (5.45)
with constants σ1F , C1F ≥ 0. Taking expectations on both sides of (5.44a), we
find the following result:
Theorem 5.1. Assume that (5.34a)–(5.34c), (5.42), (5.45) hold, and use rep-
resentation (5.29a). Then
 +
bn+1 ≤ 1 − 2[k0 − Γn µlnn Tn ]ρn + (k1 + Γn µlnn Tn )2
, 
−2
µ  −1  ρ2 bn
+rC1F n H n
p(n) n0

n Tn EXn − x 
+2γn (ρn + k1 ρ2n )µln
µ−2  −1
+ rσ1F 2 n H  + γn2 (µlnn Tn )2 ρ2n for all n ∈ IN1 , (5.46)
p(n) n0
where Tn is given by (5.44b).
Remark 5.5. Since the KW-gradient estimator (5.4c) can be represented as a
special (Phase 1-)RSM-gradient estimator, for the estimator Yns according to
(5.4c) we obtain again an error recursion of type (5.46). Indeed, we have only
to set ln = 1 and to use appropriate (KW-)increments d(i,n) := d(i,0) , i =
1, . . . , p, n ≥ 1.

5.3.2 The Criterial Case

Here the performance of (5.5) is evaluated by means of the mean functional


errors
 
bn = E F (Xn ) − F ∗ , n ≥ 1, with the minimum value F ∗ of (5.1).

Under corresponding assumptions, see [89], also in the criterial case an


inequality of the type (5.46) is obtained.

5.4 Convergence Behavior of Hybrid Stochastic


Approximation Methods
In the following we compare the hybrid stochastic approximation method
s
(5.5) with the standard stochastic gradient procedure (5.4a), i.e., Xn+1 :=
pD (Xn − ρn Yns ), n = 1, 2, . . . . For this purpose we suppose that (5.3a) holds
and Yns is given by (5.4b).
148 5 RSM-Based Stochastic Gradient Procedures

5.4.1 Asymptotically Correct Response Surface Model

Suppose that γn = 0 for all n ∈ IN1 . Hence, according to (5.42), the sequence of
response surface models F̂n is asymptotically correct. In this case inequalities
(5.36) and (5.46) yield
"
(1 − 2K0,n
e e2 2
ρn + K1,n e2 2
ρn )bn+ σ1,n ρn for n ∈ IN1
bn+1 ≤  (5.47a)
1 − 2k0 ρn + (k1 + C1 )ρn bn + σ1s2 ρ2n for n ∈ IN2 ,
2 s 2

where k0 , k12 + C1s , σ1s2 are the same constants as in (5.36) and
e
K0,n = k0 − Γn µlnn Tn (5.47b)
rC1F  −1 
e2
K1,n = (k1 + Γn µlnn Tn )2 + H (5.47c)
µ2n p(n) n0
rσ F 2  −1
e2
σ1,n = 2 1 H . (5.47d)
µn p(n) n0
A weak assumption is that
e
K0,n ≥ K0e > 0 for all n ∈ IN1 (5.48a)
e2
K1,n ≤ K1e2 < +∞ for all n ∈ IN1 (5.48b)
e2
σ1,n ≤ σ1e2 < +∞ for all n ∈ IN1 (5.48c)

with constants K0e > 0, K1e2 < +∞ and σ1e2 < +∞; note that k0 ≥ K0e , see
(5.47b).
Remark 5.6. Suppose that (Γn )n∈IN1 , cf. (5.42), is bounded, i.e., 0 ≤ Γn ≤
Γ0 < +∞ for all n ∈ IN1 with a constant Γ0 . The assumptions (5.48a)–(5.48c)
can be guaranteed, e.g., as follows:
(a) If, for n ∈ IN1 , F (x) is estimated ν(n) times at each one of p0 different
 n0 = H
points Xn + d(i) , i = 1, 2, . . . , p0 , see (5.22a), (5.22b), then H  (0)
0
with a fixed matrix H  (0)
for all n ∈ IN1 . Since ln ∈ {1, 2} for all n ∈ IN1 ,
0
we find, cf. (5.44b), Tn ≤ T0 for all n ∈ IN1 with a constant T0 < +∞.
Because of 0 < µn ≤ 1, p(n) = ν(n)p0 , (5.47b)–(5.47d) yield
e
K0,n ≥ k0 − Γ0 T0 µn
rC1F  (0)−1 rσ1F 2  (0)−1
e2
K1,n ≤ (k1 + Γ0 T0 )2 + H0 , e2
σ1,n ≤ H0 .
µ2n µ2n
Thus, conditions (5.48a)–(5.48c) hold for arbitrary ν(n), n ∈ IN1 , if
k0 − K0e
≥ µn ≥ µ0 > 0 for all n ∈ IN1 with
Γ0 T 0
k0 − K0e
0 < K0e < k0 , 0 < µ0 < . (5.48d)
Γ0 T 0
5.4 Convergence Behavior of Hybrid Stochastic Approximation Methods 149

(b) In case (5.31b), i.e., if d˜(i,n) = d˜(i,0) , i = 1, 2, . . . , p, n ≥ 1, then H̃n0 = H̃0


for all n ∈ IN1 with a fixed matrix H̃0 and therefore Tn ≤ T0 < +∞ for all
n ∈ IN1 , with fixed T0 , since ln ∈ {1, 2}, n ∈ IN1 . Thus, for K0,n e e2
, K1,n e2
, σ1,n ,
(0)
resp., we obtain the same estimates as above if H̃0 is replaced by H̃0 .
Hence, also in this case, conditions (5.48a)–(5.48c) hold for an arbitrary
p if (µn )n∈IN1 is selected according to (5.48d).
According to (5.36), for the mean square errors of (5.4a), (5.4b)

bsn := EXns − x∗ 2 , n = 1, 2, . . . ,

we have the “upper” recurrence relation


 
bsn+1 ≤ 1 − 2k0 ρn + (k12 + C1s )ρ2n bsn + σ1s2 ρ2n , n = 1, 2, . . . . (5.49)

In addition to (5.35b) assume


 
Vns (Xn ) = E Yns − ∇F (Xn )2 |Xn ≥ σ0s2 + C0s Xn − x∗ 2 (5.50)

with coefficients 0 < σ0s ≤ σ1s , 0 ≤ C0s ≤ C1s . Using (5.34a), (5.34b), we find,
see [85, 89], for (bsn ) also the “lower” recursion
 
bsn+1 ≥ 1 − 2k1 ρn + (k02 + C0s )ρ2n bsn + σ0s2 ρ2n for n ≥ n0 , (5.51a)

provided that, e.g.,

Xns − ρn Yns ∈ D a.s. for all n ≥ n0 . (5.51b)

Having the lower recursion (5.51a) for (bsn ), we can compare algorithms
(5.4a), (5.4b) and (5.5) by means of the quotients
bn bn
s
≤ s for n ≥ n0 , (5.52a)
bn Bn
where the sequence (B sn ) of lower bounds B sn ≤ bsn , n ≥ n0 , is defined by
 
B sn+1 = 1 − 2k1 ρn + (k02 + C0s )ρ2n B sn + σ0s2 ρ2n , n ≥ n0 , with B n0 ≤ bsn0 .
(5.52b)
In all other cases we consider
bn
s , n = 1, 2, . . . , (5.52c)
Bn
s
where the sequence of upper bounds B n ≥ bsn , n = 1, 2, . . . , defined by
s
  s s
B n+1 = 1 − 2k0 ρn + (k12 + C12 )ρ2n B n + σ1s2 ρ2n , n = 1, 2, . . . , B 1 ≥ bs1 ,
(5.52d)
150 5 RSM-Based Stochastic Gradient Procedures

represents the “worst case” of (5.4a), (5.4b). For the standard step sizes
c
ρn = with constants c > 0, q ∈ IN ∪ {0} (5.53)
n+q
and with k1 ≥ k0 > 0 we have, see [164],

c2 1
lim n · B sn = σ0s2 for all c > , (5.54a)
n→∞ 2k1 c − 1 2k1
s c2 c2 1
lim n · B n = σ1s2 ≥ σ0s2 for all c > . (5.54b)
n→∞ 2k0 c − 1 2k1 c − 1 2k0
For the standard step sizes (5.53) the asymptotic behavior of (bn /bsn ),
s
(bn /B n ), resp., can be determined by means of the methods developed in the
following Sect. 5.5:

Theorem 5.2. Suppose that for the RSM-based stochastic approximation


procedure (5.5) the upper error recursion (5.47a) holds, where (5.48a)–(5.48c)
is fulfilled, and (ρn ) is defined by (5.53). Moreover, suppose that groups of
M iterations with the RSM-gradient estimator Yne and N iterations with the
simple gradient estimator Yns are taken by turns.
(1) If the lower estimates (5.51a) of (bsn ) hold, then

bn σ1e2 M + σ1s2 N 2k1 c − 1 1


lim sup ≤ · for all c > (5.55a)
n→∞ bsn σ0s2 (M + N ) 2K0e c − 1 2K0e
s
(2) If (bn ) is compared with (B n ), then

bn σ1e2 M + σ1s2 N 2k0 c − 1 1


lim sup s ≤ · for all c > (5.55b)
n→∞ Bn σ1s2 (M + N ) 2K0e c − 1 2K0e

Note. It is easy to see that the right hand side of (5.55a), (5.55b), resp., takes
any value below 1, provided that σ1e2 and the rate N/(N + M ) of steps with
a “simple” gradient estimator Yns are sufficiently small.
Instead of applying the standard step size rule (5.53) and having then
Theorem 5.2, we may use the optimal step sizes developed in [89]. Here esti-
mates of the coefficients appearing in the optimal step size rules may be
obtained from second order response surface models, see also the next Chap. 6.

5.4.2 Biased Response Surface Model

Assume now that γn > 0 for all n ∈ IN1 . For n ∈ IN1 , according to Theorem 5.1
we have first to estimate Exn − x∗ : For any given sequence (δn )n∈IN1 of
positive numbers, we know [164] that
1
EXn − x∗  ≤ δn + bn for all n ∈ IN1 . (5.56)
δn
5.4 Convergence Behavior of Hybrid Stochastic Approximation Methods 151

Taking, see (5.47b), (5.48a),

4γn (1 + k1 ρn )Tn ln
δn := · µn , n ∈ IN1 , (5.57)
k0 − Γn µlnn Tn

according to (5.46), (5.56) and (5.57) we obtain, see (5.47b)–(5.47d),


 2
3 e 8 γn (1 + k1 ρn )Tn (µlnn )2
bn+1 ≤ 1 − K0,n ρn + K1,n e2 2
ρn bn + e ρn
2 K0,n
 
e2
+ σ1,n + γn2 (µlnn Tn )2 ρ2n for all n ∈ IN1 . (5.58)

Inequalities (5.36) and (5.58) may be represented jointly by

bn+1 ≤ un bn + vn , n = 1, 2, . . . (5.59)

where the coefficients un , vn can be taken easily


 from
 (5.36) and (5.58).
We want now to select (ρn ), (µn )n∈IN1 and p(n) such that
n∈IN1

un ≤ 1 − Aρn for n ≥ n0 with n0 ∈ IN and fixed A > 0. (5.60)

According to (5.36), (5.58) we have to select A > 0 such that

−2k0 + (k12 + C1s )ρn ≤ −A for n ∈ IN2 , n ≥ n0 (5.61a)


3 −1 ρn
− (k0 − Γn µlnn Tn ) + (k1 + Γn µlnn Tn )2 ρn + rC1F H̃n0  2 ≤ −A
2 µn p(n)
for n ∈ IN1 , n ≥ n0 . (5.61b)

Assuming that with fixed numbers K0e > 0, K11


2
≥ 0 and K̃12
2
it is
e
K0,n = k0 − Γn µlnn Tn ≥ K0e for all n ∈ IN1 (5.62a)

(k1 + Γn µlnn Tn )2 ≤ K11


2
< +∞ for all n ∈ IN1 (5.62b)
−1
rC1F H̃n0  ≤ K̃12
2
< +∞ for all n ∈ IN1 , (5.62c)

see (5.47b)–(5.47d), (5.48a)–(5.48c) and Remark 5.6, we find that (5.61b) is


implied by
ρn 3
2
K11 2
ρn + K̃12 ≤ K0e − A for n ∈ IN1 , n ≥ n0 . (5.61b )
µ2n p(n) 2

Thus, selecting A > 0 such that

3 e 3
0<A< K ≤ k0 < 2k0 , (5.63a)
2 0 2
152 5 RSM-Based Stochastic Gradient Procedures

we find that (5.61a), (5.61b) and therefore also (5.60) hold, provided that

ρn → 0, n → ∞ (5.63b)
ρn
→ 0, n → ∞ with n ∈ IN1 . (5.63c)
µ2n p(n)
 
Note that (5.63c) holds for any sequence p(n) if
n∈IN1

ρn
→ 0, n → ∞ with n ∈ IN1 . (5.63c )
µ2n
Having (5.62a)–(5.62c) and (5.63a)–(5.63c), inequalities (5.59) and (5.60) yield

bn+1 ≤ (1 − Aρn )bn + vn for all n ≥ n0 .

Finally, supposing still that


∞ ∞
ρn = +∞ and vn < +∞, (5.63d), (5.63e)
n=1 n=1

the above recursion implies bn → 0 as n → ∞. Condition (5.63e) demands,


with respect to IN2 , that
ρ2n < +∞. (5.63f)
n∈IN2

Concerning IN1 we assume that


−1
γn ≤ γ0 , Tn ≤ T0 , H̃n0  ≤ h0 for all n ∈ IN1 (5.64)

with certain constants c0 > 0, T0 > 0, h0 > 0. Under these assumptions con-
dition (5.63e) holds if – besides (5.63f) – we still demand that

ρ2n
(µlnn )2 ρn < +∞, < +∞ (5.65a), (5.65b)
µ2n p(n)
n∈IN1 n∈IN1

(µlnn )2 ρ2n < +∞, (5.65c)


n∈IN1

'

cf. (5.63b). Since 0 < µn ≤ 1, n ∈ IN1 , (5.63f) and (5.65c) hold if ρ2n < +∞.
n=1
In summary, in the present case we have the following result:
Theorem 5.3. Suppose that, besides the general assumptions guaranteeing
  in-
−1
equalities (5.36) and (5.46), (γn )n∈IN1 , (Γn )n∈IN1 , (Tn )n∈IN1 and H̃n0 
  n∈IN1
are bounded. If (ρn ), (µn )n∈IN1 , p(n) are selected such that
n∈IN1
k0 −K0e
(1) 0 < µn < for all n ∈ IN1 , where 0 < K0e < k0 and Γ0 > 0, T0 > 0
Γ0 T0
are upper bounds of (Γn )n∈IN1 , (Tn )n∈IN1
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 153

(2) Conditions (5.63b)–(5.63d), (5.63f ) and (5.65a)–(5.65c) hold, then


EXn − x∗ 2 → 0 as n → ∞

In the literature on stochastic approximation [164] many results are avail-


able on the convergence rate of numerical sequences (βn ) defined by recursions
of the type
 cn  dn
βn+1 ≤ 1 − βn + π+1 , n = 1, 2, . . . ,
n n
with lim inf cn = c > π and lim sup dn = d > 0. Hence, the above recursions
n→∞ n→∞
on (bn ) yield still the following result, cf. (5.54a), (5.54b):
Theorem 5.4. Let the sequence of stepsizes (n ) be given by (5.53). Assume
that the general assumptions of Theorem 5.3 are fulfilled.
(1) If p(n) = 0(n) and µn = 0(n− 2 ), then bn = 0(n−1 ), provided that c >
1

3
.
2K0e 
is bounded and µn = 0(n− 4 ), then bn = 0(n− 2 ) for all
1 1
(2) If p(n)
n∈IN1
3
c> .
2K0e

5.5 Convergence Rates of Hybrid Stochastic


Approximation Procedures
In order to compare the speed of convergence of the pure stochastic algorithm
(5.4a), (5.4b) and the hybrid stochastic approximation algorithm (5.5), we
consider the mean square error

bn = EXn − x∗ 2 ,

where x∗ is an optimal solution of (5.1).


Some preliminaries are needed for the derivation of a recursion for the
error sequence (bn ). First, we note that the projection operator pD from IRr
on D is defined by
 
 
x − pD (x) = inf x − y for all x ∈ IRr .
y∈D

It is known [16] that


 
 
pD (x) − pD (y) ≤ x − y for all x, y ∈ IRr . (5.66)

As mentioned in Sect. 5.1, see also Sect. 4, many stochastic optimization


problems (5.1) have the following property:
154 5 RSM-Based Stochastic Gradient Procedures

Property (DD). At certain “nonstationary” or “nonefficient” points x∈D


there exits a feasible descent direction (DD) h = h(x) of F which can be
obtained with much less computational expenses than the gradient ∇F (x).
Moreover, h = h(x) is stable with respect to changes of the loss function
contained in f = f (ω, x), see [77, 82, 83].
Consider then the mapping A : IRr → IRr defined by

−h(x), if (DD) holds at x
A(x) = (5.67)
∇F (x), else.

In many cases [77, 82, 83] h(x) can be represented by h(x) = y − x with a
certain vector y ∈ D. Hence, in this case we get

pD (Xn + n hn ) = Xn + n hn , if 0 < n ≤ 1,

and the projection onto D can be omitted then.


We suppose now
 
pD x∗ − A(x∗ ) = x∗ for all  > 0. (5.68)

for every optimal solution x∗ of (5.1). This assumption is justified, since for a
convex, differentiable objective function F , the optimal solutions x∗ of (5.1)
can be characterized [16] by pD x∗ − ∇F (x∗ ) = x∗ for all  > 0, and in x∗
there is no feasible descent direction h(x∗ ) on the other hand.
The hybrid algorithm (5.5) can be represented then by
  
Xn+1 = pD Xn − n A(Xn ) + ξn , n = 1, 2, . . . , (5.69a)

where the noise r-vector ξn is defined as follows:


(1) Hybrid methods with deterministic descent directions h(Xn ) and standard
gradient estimators Yns

0, if n ∈ IN1
ξn := (5.69b)
Yns − ∇F (Xn ), if n ∈ IN2 ,

where A(x) is defined by (5.67).


(2) Hybrid methods with RSM-gradient estimators Yne and simple gradient
estimators Yns  e
Yn − ∇F (Xn ), if n ∈ IN1
ξn := (5.69c)
Yns − ∇F (Xn ), if n ∈ IN2 ,
where A(x) := ∇F (x).
While for hybrid methods with mixed RSM-/standard gradient estima-
tors Yne , Yns a recurrence relation for the error sequence (bn ) was given in
Theorem 5.1 and in Sect. 5.4.1, for case (i) we have this result:
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 155

Lemma 5.1 (Mixed feasible descent directions A(Xn ) and standard


gradient estimators Yns ). Let x∗ be an optimal solution. Assume that the
optimality condition (5.68) holds and the following conditions are fulfilled
 2
 
A(x) − A(x∗ ) ≤ γ1 + γ2 x − x∗ 2 , (5.70)
 T
A(x) − A(x∗ ) (x − x∗ ) ≥ αx − x∗ 2 (5.71)

for every x ∈ D, where γ1 ≥ 0, γ2 > 0 and α > 0 are some constants.


Moreover, for n ∈ IN2 assume that

E(Yns |Xn ) = ∇F (Xn )a.s. (almost sure) (5.72)


EYns 2 < +∞, (5.73)

Then bn = EXn − x∗ 2 satisfies the recursion

bn+1 ≤ (1 − 2αn + γ2 2n )bn + βn 2n , n = 1, 2, . . . , (5.74a)

where βn is defined by
βn = γ1 + Eξn 2 . (5.74b)

Proof. Using (5.66) and (5.68)–(5.73), we find


     2
 
Xn+1 − x∗ 2 = pD Xn − n A(Xn ) + ξn − pD x∗ − n A(x∗ ) 
   2
 
≤ (Xn − x∗ ) − n A(Xn ) − A(x∗ ) − n ξn 
 T
≤ Xn − x∗ 2 − 2n A(Xn ) − A(x∗ ) (Xn − x∗ )


+2n A(Xn ) − A(x∗ )2
  
−2n ξnT Xn − x∗ − n A(Xn ) − A(x∗ ) + 2n ξn 2
 
≤ Xn − x∗ 2 − 2αn Xn − x∗ 2 + 2n γ1 + γ2 Xn − x∗ 2
  
−2ξnT Xn − x∗ − n A(Xn ) − A(x∗ ) + 2n ξn 2 .

From (5.69b) and (5.72) we obtain E(ξn |Xn ) = 0 a.s. for every n. Hence,
from the above inequality we a.s. have
 
E Xn+1 − x∗ 2 |Xn ≤ (1 − 2αn + γ2 2n )Xn − x∗ 2
 
+γ1 2n + 2n E ξn 2 |Xn .

Taking expectations on both sides, we get the asserted recursion (5.74a),


(5.74b).
156 5 RSM-Based Stochastic Gradient Procedures

We want to discuss in short the assumptions in the above lemma. First we


remark that (5.70) holds if
 

A(x) − A(x∗ ) ≤ k1 + k2 x − x∗  (5.75)

for some constants k1 ≥ 0, k2 > 0. Indeed, (5.75) implies (5.70) with γ1 =


k12 + k1 k2 , γ2 = k1 k2 + k22 .
If inequality (5.71) is satisfied for A(x) = ∇F (x), then F is called strongly
convex at x∗ . For A(x) = −h(x) being a feasible descent direction, condition
(5.71) reads
h(x) (x∗ − x) ≥ αx − x∗ 2 for x ∈ D.
Since hopt (x) = x∗ − x is the best descent direction in X, the above inequality
may be regarded as a condition on the quality of h(x) as a descent direction.
The remaining conditions (5.72), (5.73) state that the error ξn in a stochastic
step has mean zero and finite second order moments.
Considering now the convergence behavior of (5.74a), (5.74b), in the lit-
erature, see, e.g., [132, 135, 160, 164], the following type of results is available.
Lemma 5.2. Let (bn ) be a sequence of nonnegative numbers such that
 µn  τ
bn+1 ≤ 1 − bn + π+1 , n ≥ n0 (5.76)
n n
for some positive integer n0 . If lim µn = µ and τ > 0, then
n→∞

bn = 0(n−π ), if µ > π > 0, (5.77a)


 
bn = 0 log

n
, if µ = π > 0, (5.77b)
−µ
bn = 0(n ), if π > µ > 0. (5.77c)
If (bn ) satisfies the recursion
 µn  τn
bn+1 = 1 − bn + π+1 , n > n0 , (5.78)
n n
where lim µn = µ > π > 0 and lim τn = τ ≥ 0, then
n→∞ n→∞
τ
lim nπ bn = . (5.79)
n→∞ µ−π
Applying Lemmas 5.2–(5.74a), (5.74b), this result is obtained:
Corollary 5.1. Suppose that the assumptions in Lemma 5.1 are fulfilled and,
in addition, let
Eξn 2 ≤ σ 2 < +∞ for n ∈ IN2 . (5.80)
c
If the step size n is chosen according to n = , where c > 0 and
 n+q
 2
q ∈ IN ∪ {0}, then lim E Xn (ω) − x∗  = 0, where the asymptotic rate of
n→∞
convergence is given by (5.77a)–(5.77c) with π = 1 and µ = 2αc.
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 157
c
Proof. Having (5.80) and n = , it is easy to see that (5.74a), (5.74b)
n+q
n n
implies (5.76) for µn = 2αc − γ2 c2 , µ = 2αc, τ = c2 (γ1 + σ 2 )
n+q (n + q)2
and π = 1. The corollary follows now from Lemma 5.2.
Assuming now that
lim n = 0,
n→∞
there is an integer n0 > 0 such that

0 < 1 − 2αn + γ2 2n < 1 − an , n ≥ n0 , (5.81)

where a is a fixed positive number such that 0 < a < 2α. Furthermore, assume
that (5.80) holds.
Then (5.74a), (5.74b) and (5.81) yield the recursion

bn+1 ≤ (1 − an )bn + β̄n 2n , n ≥ n0 , (5.82)

where β̄n is given by



γ1 , if n ∈ IN1
β̄n = (5.83)
γ1 + σ 2 , if n ∈ IN2 .

Next we want to consider the “worst case” in (5.82), i.e., the recursion

Bn+1 = (1 − an )Bn + β̄n 2n , n ≥ n0 . (5.84a)

If bn and Bn satisfy the recursions (5.82) and (5.84a), respectively, and if


Bn0 ≥ bn0 , then by a simple induction argument we find that
 2
 
E Xn (ω) − x∗  = bn ≤ Bn for all n ≥ n0 . (5.84b)

Corresponding to Corollary 5.1, for the upper bound Bn of the mean square
error bn we have this
Corollary 5.2. Suppose that all assumptions in Lemma 5.1 are fulfilled and
c
(5.80) holds. If n = with c > 0 and q ∈ IN ∪ {0}, then for (Bn ) we
n+q
have the asymptotic formulas

Bn = 0(n−1 ), if ac > 1,
log n
Bn = 0 , if ac = 1,
n
Bn = 0(n−ac ), if ac < 1

as n → ∞. Furthermore, if ac > 1 and IN1 = ∅, i.e., in the pure stochastic


case, we have
c2
lim nBn = (γ1 + σ 2 ) .
n→∞ ac − 1
158 5 RSM-Based Stochastic Gradient Procedures

Proof. The assertion follows by applying Lemmas 5.2–(5.84a), (5.84b).

Since by means of Corollaries 5.1 and 5.2 we can not distinguish between
the convergence behavior of the pure stochastic and the hybrid stochastic
approximation algorithm, we have to consider recursion (5.84a), (5.84b) in
more detail.
It turns out that the improvement of the speed of convergence resulting
from the use of both stochastic and deterministic directions depends on the
rate of stochastic and deterministic steps taken in (5.69a)–(5.69c).
Note. For simplicity of notation, an iteration step Xn → Xn+1 using an im-
proved step direction, a standard stochastic gradient is called a “deterministic
step,” a “stochastic step,” respectively.

5.5.1 Fixed Rate of Stochastic and Deterministic Steps

Let us assume that the rate of stochastic and deterministic steps taken in
M
(5.69a)–(5.69c) is fixed and given by , i.e., M deterministic and N stochas-
N
tic steps are taken by turns beginning with M deterministic steps.
Then Bn satisfies recursion (5.84a), (5.84b) with β̄n given by (5.83), i.e.,

γ1 for n ∈ IN1
β̄n =
γ1 + σ 2 for n ∈ IN2 ,

where IN1 , IN2 are defined by


 
IN1= n ∈ IN : (m−1)(M + N )+1 ≤ n ≤ mM + (m − 1)N for some m ∈ IN ,

IN2=IN\IN1 = n ∈ IN : m̃M + (m̃ − 1)N + 1 ≤ n ≤ m̃(M + N ) for some

m̃ ∈ IN .

We want to compare Bn with the upper bounds Bns corresponding to the


purely stochastic algorithm (5.4a), (5.4b), i.e., Bns shall satisfy (5.84a), (5.84b)
with
β̄n = γ1 + σ 2 for all n ≥ n0 .
We assume that the step size 0 is chosen according to
c
n =
n+q
with a fixed q ∈ IN ∪ {0} and c > 0 such that ac ≥ 1.
Then the recursions to be considered are
 
Bn+1 = 1 − n+qϑ β̃n
Bn + (n+q) 2 , n ≥ n0 , (5.85)
 
s
Bn+1 = 1 − n+q
ϑ B
Bns + (n+q) 2 , n ≥ n0 , (5.86)
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 159

where

A for n ∈ IN1
β̃n =
B for n ∈ IN2 ,
A = c2 γ1 , B = c2 (γ1 + σ 2 ) and ϑ = ac.

We need two lemmata (see [164] for the first and [127] for the second).
Lemma 5.3. Let ϑ ≥ 1 be a real number and

n
ϑ
Ck,n = 1− for k, n ∈ IN, k < n.
m
m=k+1

Then for every ε > 0 there is a kε ∈ IN such that


ϑ ϑ
k k
(1 − ε) ≤ Ck,n ≤ (1 + ε)
n n
for all n > k ≥ kε .
Lemma 5.4. Let the numbers yn satisfy the recursion

yn+1 = Un yn + Vn for n ≥ ν,

where (Un ), (Vn ) are given sequences of real numbers such that Un = 0 for
n ≥ ν. Then
⎛ ⎞
n
V n
yn+1 = ⎝yν + ⎠
j
'j Um for n ≥ ν.
j=ν m=ν Um m=ν

We are going to prove the following result:


Theorem 5.5. Let the sequence (Bn ) and (Bns ) be defined by (5.85) and
(5.86), respectively. If ϑ ≥ 1, then
Bn AM + BN γ1 M + (γ1 + σ 2 )N
lim s
= Q := = . (5.87)
n→∞ Bn B(M + N ) (γ1 + σ 2 )(M + N )
Proof. Using the notation of Lemma 5.3, we have


j
ϑ
1− = Cν+q−1,j+q for j ≥ ν ≥ 1,
m=ν
m+q

and as a simple consequence of (5.85), (5.86) and Lemma 5.4 we note


'
n
β̃j
Bν + (j+q)2 Cν+q−1,j+q
Bn+1 j=ν
s = '
n , n≥ν
Bn+1 B
Bνs + (j+q)2 C ν+q−1,j+q
j=ν
160 5 RSM-Based Stochastic Gradient Procedures

for every fixed integer ν > n0 . By Lemma 5.3 for every 0 < ε < 1 there is an
integer p = ν − 1, ν > n0 , such that
ϑ ϑ
p+q p+q
(1 − ε) ≤ Cp+q,j+q ≤ (1 + ε) for j > p.
j+q j+q

Consequently, for n ≥ ν we find the inequality

1
'
n
Bν + (1+ε)(p+q)ϑ
β̃j (j + q)ϑ−2
j=ν Bn+1
'
n ≤ s (5.88)
B Bn+1
Bνs + (1−ε)(p+q)ϑ
(j + q)ϑ−2
j=ν

1
'n
Bν + (1−ε)(p+q)ϑ
β̃j (j + q)ϑ−2
j=ν
≤ '
n .
B
Bνs + (1+ε)(p+q)ϑ
(j + q)ϑ−2
j=ν

'
Since ϑ ≥ 1, the series (j + q)ϑ−2 diverges as n → ∞. So the numbers
j=ν
Bν and Bνs have no influence on the limit of both sides of the above inequalities
(5.88) as n → ∞. Suppose now that we are able to show the equation

S(ν, n) AM + BN
lim = = L, (5.89)
n→∞ T (ν, n) M +N

where
n
S(ν, n) = β̃j (j + q)ϑ−2 ,
j=ν
n
T (ν, n) = (j + q)ϑ−2 .
j=ν

Taking the limit in (5.88) for n → ∞, for every ε > 0 we obtain

1−ε L Bn+1 Bn+1 1+ε L


· ≤ lim inf s ≤ lim sup s ≤ · .
1+ε B j→∞ Bn+1 n→∞ B n+1 1−ε B

Bn+1
But this can be true only if lim s exists and
n→∞ Bn+1

Bn+1 L AM + BN
lim s = =
n→∞ B B B(M + N )
n+1

which proves (5.87). Hence, it remains to show that the limit (5.89) exists and
is equal to L. For this purpose we may assume without loss of generality that
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 161

ν = m0 (M + N ) + 1 for some integer m0 . Furthermore, we assert that from


the existence of  
S ν, m(M + N )
lim   (5.90)
m→∞
T ν, m(M + N )

the existence of the limit (5.89) follows. To see this, let n ≥ ν be given and
let m = mn be the uniquely determined integer such that

m(M + N ) + 1 ≤ n < (m + 1) (M + N ).

Since
   
S ν, (m + 1)(M + N ) = S(ν, n) + S n + 1, (m + 1)(M + N ) ,

for m(M + N ) + 1 ≤ n < (m + 1)(M + N ) a simple computation yields


 
S ν, (m + 1)(M + N ) S(ν, n)
 − (5.91)
T ν, (m + 1)(M + N ) T (ν, n)
   
S n + 1, (m + 1)(M + N ) T n + 1, (m + 1)(M + N ) · S(ν, n)
=   −   .
T ν, (m + 1)(M + N ) T ν, (m + 1)(M + N ) · T (ν, n)

Clearly, from the definition of S(ν, n), T (ν, n) and β̃j ≤ B for all j ∈ IN follows
that
   
S n + 1, (m + 1)(M + N ) T m(M + N ) + 1, (m + 1)(M + N )
  ≤B  
T ν, (m + 1)(M + N ) T ν, (m + 1)(M + N )
(5.92a)
and
   
T n + 1, (m + 1)(M + N ) ·S(ν, n) T m(M + N )+1, (m + 1)(M + N )
  ≤B   .
T ν, (m + 1)(M + N ) · T (ν, n) T ν, (m + 1)(M + N )
(5.92b)
Assume now that
 
T m(M + N ) + 1, (m + 1)(M + N )
lim   = 0. (5.93)
n→∞
T ν, (m + 1)(M + N )

Since n → ∞ implies that m = mn → ∞, from (5.91), the inequalities (5.92a),


(5.92b) and assumption (5.93) we obtain
162 5 RSM-Based Stochastic Gradient Procedures
 
S ν, (mn + 1)(M + N ) S(ν, n)
lim  − = 0.
n→∞
T ν, (mn + 1)(M + N ) T (ν, n)

Therefore, if the limit (5.90) exists, then also the limit (5.89) exists and
both limits are equal.
For 1 ≤ ϑ ≤ 2 the limit (5.93) follows from
 
T m(M + N ) + 1, (m + 1)(M + N ) ≤ M + N
 
and the divergence of T ν, (m + 1)(M + N ) as m → ∞.
For ϑ > 2 we have
   ϑ−2
T m(M + N ) + 1, (m + 1)(M + N ) ≤ (M + N ) (m + 1)(M + N ) + q
 
and with ν = m0 (M + N ) + 1

  m (k+1)(M +N )
T ν, (m + 1)(M + N ) = (j + q)ϑ−2
k=m0 j=k(M +N )+1

m  ϑ−2
≥ (M + N ) k(M + N ) + q .
k=m0

Consider the monotonously increasing function


 ϑ−2
u(x) = x(M + N ) + q , x ≥ 0.

Because of

k+1 
k+1

u(x − 1) dx ≤ u(k + 1 − 1)dx = u(k),


k k

we have that
  m
T ν, (m + 1)(M + N ) ≥ (M + N ) u(k)
k=m0

m 
k+1 
m+1

≥ (M + N ) u(x − 1)dx = (M + N ) u(x − 1)dx


k=m0 k m0

1  ϑ−1  ϑ−1
= m(M + N ) + q − (m0 − 1)(M + N ) + q .
ϑ−1
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 163

Therefore, for ϑ > 2 we find


 
T m(M + N ) + 1, (m + 1)(M + N )
 
T ν, (m + 1)(M + N )


⎨  ϑ−2
m(M + N ) + q
≤ (ϑ − 1)(M + N ) m(M + N ) + q

⎩ (m + 1)(M + N ) + q

 ϑ−1 ⎫−1
(m0 − 1)(M + N ) + q ⎪

− ϑ−2 ⎪ .
(m + 1)(M + N ) + q ⎭

This yields then the limit (5.93).


It remains to show that (5.90) and consequently also (5.89) holds. Again
with ν = m0 (M + N ) + 1 we have

  m−1 (k+1)(M +N )
S ν, m(M + N ) = β̃j (j + q)ϑ−2 (5.94)
k=m0 j=k(M +N )+1
⎛ ⎞
m−1 (k+1)M +kN (k+1)(M +N )
= ⎝A (j + q)ϑ−2 + B (j + q)ϑ−2 ⎠
k=m0 j=k(M +N )+1 j=(k+1)M +kN +1
m−1
= σk ,
k=m0

where σk is defined by
M  ϑ−2 M +N  ϑ−2
σk = A k(M + N ) + i + q +B k(M + N ) + i + q .
i=1 i=M +1

In the same way we find


  m−1
T ν, m(M + N ) = τk , (5.95)
k=m0

where τk is given by
M +N  ϑ−2
τk = k(M + N ) + i + q .
i=1
Define now the functions
M  ϑ−2 M +N  ϑ−2
f (x) = A x(M + N ) + i + q +B x(M + N ) + i + q
i=1 i=M +1
164 5 RSM-Based Stochastic Gradient Procedures

and
M +N  ϑ−2
g(x) = x(M + N ) + i + q .
i=1

Then σk = f (k) and τk = g(k).


For 1 ≤ ϑ ≤ 2 the functions f and g are monotonously nonincreasing. Hence,


k+1 
k+1

f (x)dx ≤ f (k) = σk ≤ f (x − 1)dx


k k

and therefore
m m−1 m
f (x)dx ≤ σk ≤ f (x − 1)dx.
m0 k=m0 m0

Analogously we find
m m−1 m
g(x)dx ≤ τk ≤ g(x − 1)dx.
m0 k=m0 m0

By (5.94) and (5.95), in the case 1 ≤ ϑ ≤ 2 we have now


m m
f (x)dx   f (x − 1)dx
m0
S ν, m(M + N ) m
m ≤  ≤ 0 m
 . (5.96a)
T ν, m(M + N )
g(x − 1)dx g(x)dx
m0 m0

If ϑ > 2, then f and g are monotonously increasing and therefore


k+1 
k+1

f (x)dx ≥ f (k) = σk ≥ f (x − 1)dx,


k k

hence,
m m−1 m
f (x)dx ≥ σk ≥ f (x − 1)dx,
m0 k=m0 m0

and in the same way we get


m m−1 m
g(x)dx ≥ τk ≥ g(x − 1)dx.
m0 k=m0 m0
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 165

Consequently, if ϑ > 2, then by (5.94) and ( 5.95) we have


)m   )m
f (x − 1)dx S ν, m(M + N ) f (x)dx
m0 m0
≤   ≤ (5.96b)
)m )m
g(x)dx T ν, m(M + N ) g(x − 1)dx
m0 m0
Hence,
 in both cases
 we have the same type of upper and lower bounds for
S ν, m(M + N )
 .
T ν, m(M + N )
We have to discuss two cases:
(i) Let ϑ = 1. In this case we find
m M   
A
f (x − 1)dx = log (m − 1)(M + N ) + i + q
M +N i=1
m0
 
− log (m0 − 1)(M + N ) + i + q

B
M +N   
+ log (m − 1)(M + N ) + i + q
M +N
i=M +1
  A·M
− log (m0 − 1)(M + N ) + i + q = log(m − 1)
M +N
A
M
i+q  
+ log M +N + − log (m0 − 1)(M +N ) + i + q
M +N i=1
m−1
M +N

B·N B i+q
+ log(m − 1) + log M + N +
M +N M +N m−1
i=M +1

 
− log (m0 − 1)(M + N ) + i + q

and in the same way


m M +N

1 i+q
g(x)dx = log m + log M + N +
M +N i=1
m
m0

 
− log m0 (M + N ) + i + q .

AM
Since in the above integral representations all terms besides
M +N
BN
log(m − 1), log(m − 1) and log m are bounded as m → ∞, we
M +N
have now
166 5 RSM-Based Stochastic Gradient Procedures

m
f (x − 1)dx
m0 AM log(m − 1) BN
lim m = lim · +
m→∞ m→∞ M +N log m M +N
g(x)dx
m0
log(m − 1) AM + BN
× = .
log m M +N

Analogously
m
f (x)dx
m0 AM + BN
lim = ,
m→∞ m M +N
g(x − 1)dx
m0

and the theorem is proved for ϑ = 1.


(ii) Let now ϑ > 1. Here we find
m
f (x − 1)dx =
m0

A
M  ϑ−1
= (m − 1)(M + N ) + i + q
(M + N )(ϑ − 1) i=1
 ϑ−1
− (m0 − 1)(M + N ) + i + q

B
M +N  ϑ−1
+ (m − 1)(M + N ) + i + q
(M + N )(ϑ − 1)
i=M +1
 ϑ−1
− (m0 − 1)(M + N ) + i + q
M

ϑ−1
Amϑ−1 1 i+q
= 1− (M + N ) +
(M + N )(ϑ − 1) i=1 m m

ϑ−1
m0 − 1 i+q
− (M + N ) +
m m
M +N

ϑ−1
Bmϑ−1 1 i+q
+ 1− (M + N ) +
(M + N )(ϑ − 1) m m
i=M +1

ϑ−1
m0 − 1 i+q
− (M + N ) +
m m
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 167

as well as

)m mϑ−1
M +N
i+q
ϑ−1
g(x)dx = M +N +
(M + N )(ϑ − 1) m
m0 i=1 
ϑ−1
m0 i+q
− (M + N ) +
m m

and therefore
m
f (x − 1)dx
m0 AM (M + N )ϑ−1 + BN (M + N )ϑ−1 AM + BN
lim m = ϑ−1
= .
m→∞ (M + N )(M + N ) M +N
g(x)dx
m0

Analogously
m
f (x)dx
m0 AM + BN
lim = ,
m→∞ m M +N
g(x − 1)dx
m0

which completes now the proof of our theorem.

Discussion of Theorem 5.5


If M deterministic an N stochastic steps are taken by turns and the step
c 1
size n is chosen such that n = with q ∈ IN and c ≥ , where 0<a<2α,
n+q a
then by Theorem 5.5 we have the asymptotic representation

σ2
Bn ≈ 1− w Bns as n → ∞,
γ1 + σ 2
M
where w = is the percentage of deterministic steps in one complete
M +N
turn of M + N deterministic and stochastic steps. Hence, here we find that
 2
−1
asymptotically the semi-stochastic procedure is 1 − γ1σ+σ2 w times faster
than the pure stochastic approximation procedure. This conclusion is studied
in more detail in Sect. 5.5.2.

Comparison of Hybrid Stochastic Approximation Procedures

By the above results we are also able to compare two hybrid stochastic ap-
proximation procedures characterized by the parameters A, B, M, N, ϑ and
168 5 RSM-Based Stochastic Gradient Procedures

Ã, B̃, M̃ , Ñ , ϑ̃, respectively. Corresponding to these two tuples of parameters


A, B, M, N, ϑ and Ã, B̃, M̃ , Ñ , ϑ̃, respectively, let Bn and B̃n , respectively, be
the upper bounds of the mean square errors of the hybrid procedures de-
fined by the recursion (5.85). Analogously, let Bns and B̃ns , respectively, be the
solution of recursion (5.86) corresponding to the above tuples of parameters.
If ϑ ≥ 1 and ϑ̃ ≥ 1, then by Theorem 5.5 we know that
Bn AM + BN
s
→ as n → ∞
Bn B(M + N )

and
B̃n ÃM̃ + B̃ Ñ
→ as n → ∞.
s
B̃n B̃(M̃ + Ñ )
Moreover, by Corollary 5.2 we know that for ϑ > 1, ϑ̃ > 1
B
Bns n · Bns ϑ−1 B ϑ̃ − 1
= → = · as n → ∞.
s
B̃n n · B̃ns B̃ B̃ ϑ − 1
ϑ̃−1

Consequently, we find this


Corollary 5.3. For any two hybrid stochastic approximation procedures char-
acterized by the parameters A, B, M, N, ϑ and Ã, B̃, M̃ , Ñ , ϑ̃, resp., where
ϑ > 1 and ϑ̃ > 1, it holds
AM +BN
Bn M +N ϑ̃ − 1
lim = · .
n→∞ B̃
n
ÃM̃ +B̃ Ñ ϑ−1
M̃ +Ñ

Bn
Proof. For we may write
B̃n
Bn Bn B s 1
lim = s · n· .
n→∞ B̃
n Bn B̃ns B̃n
s B̃n

Hence, by the above considerations we get

Bn Bn Bs 1 AM + BN B ϑ̃ − 1 B̃(M̃ + Ñ )
lim = lim s · lim n · = · · ·
n→∞ B̃
n
n→∞ Bn n→∞ B̃ s
n lim
B̃n B(M + N ) B̃ ϑ − 1 ÃM̃ + B̃ Ñ
s
n→∞ B̃n

which proves our corollary.

Note. If M̃ = 0, B̃ = B and ϑ̃ = ϑ, i.e., if B̃n = Bns , then from Corollary 5.2


we gain back Theorem 5.5.
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 169

5.5.2 Lower Bounds for the Mean Square Error

The results obtained above may be used now to find also lower bounds for
the mean square error of our hybrid procedure, provided that some further
hypotheses are fulfilled.
To this end assume that for an optimal solution x∗ of (5.1)
     
 
pD Xn − n A(Xn ) + ξn − pD x∗ − n A(x∗ )  (5.97)
    
 
= Xn − n A(Xn ) + ξn − x∗ − n A(x∗ )  for all n ∈ IN.

This holds, e.g., if ∇F (x∗ ) = 0 and


 
Xn − n A(Xn ) + ξn ∈ D for all n ∈ IN,

see (5.67) and (5.69a).


Now, if (5.97) holds, then we find
   2
 
Xn+1 − x∗ 2 = (Xn − x∗ ) − n A(Xn ) − A(x∗ ) − n ξn  (5.98)
 T
= Xn − x∗ 2 − 2n A(Xn ) − A(x∗ ) (Xn − x∗ )
 2 
 
+2n A(Xn ) − A(x∗ ) − 2n ξnT Xn − x∗
 
−n A(Xn ) − A(x∗ ) + 2n ξn2 2 .

Assume now that for all x ∈ D the inequalities


 2
 
A(x) − A(x∗ ) ≥ δ1 + δ2 x − x∗ 2 , x = x∗ , (5.99)
 T
A(x) − A(x∗ ) (x − x∗ ) ≤ βx − x∗ 2 (5.100)

hold, and for all n ∈ IN the inequality

Eξn 2 ≥ η 2 > 0 (5.101)

is fulfilled, where δ1 ≥ 0, δ2 > 0, β > 0 and η > 0 are given constants.


Note that these inequalities are opposite to (5.70), (5.71) and (5.80),
respectively. Furthermore, (5.99), (5.100) and (5.101) together with (5.70),
(5.71) and (5.80) imply that

δ1 ≤ γ1 , δ2 ≤ γ2 , α ≤ β and η 2 ≤ σ 2 .

If again (5.72) is valid, i.e., if E(ξn |Xn ) = 0 a.s. for every n ∈ IN2 , then,
using equality (5.98), for bn = EXn − x∗ 2 we find in the same way as in
170 5 RSM-Based Stochastic Gradient Procedures

the proof of Lemma 5.1 the following recursion being opposite to (5.74a),
(5.74b) and (5.82)
 
bn+1 ≥ (1 − 2βn + δ2 2n )bn + 2n δ1 + Eξn 2 (5.102a)

≥ (1 − 2βn )bn + β n 2n , n = 1, 2, . . . ,

where 
δ1 , if n ∈ IN1
βn = (5.102b)
δ1 + η 2 , if n ∈ IN
In contrast to the “worst case” in (5.82), which is represented by the recursion
(5.84a), (5.84b), here we consider the “best case” in (5.102a), i.e., the recursion

B n+1 = (1 − 2βn )B n 2n + β n 2n , n = 1, 2, . . . . (5.103)

Assuming again that lim n = 0, there is an integer n0 such that (5.81)


n→∞
holds, i.e.,
0 < 1 − 2αn + γ2 2n < 1 − an for all n ≥ n0 ,
where 0 < a < 2α, as well as

1 − 2βn > 0 for all n ≥ n0 .

Let Bn0 = bn0 and B n0 = bn0 .


Then, according to Sect. 5.5, inequality (5.84a), we know that bn ≤ Bn for
all n ≥ n0 , and by a similar simple induction argument we find

B n ≤ bn ≤ Bn for all n ≥ n0 . (5.104a)

Since the pure stochastic procedure is characterized simply by IN1 = ∅, we


also have
B sn ≤ bsn ≤ Bns , for all n ≥ n0 , (5.104b)
where bsn is the mean square error of the pure stochastic algorithm. Fur-
thermore, Bns and B sn are the solutions of (5.84a) and (5.103), resp., with
β̄n = γ1 + σ 2 and β n = δ1 + η 2 for all n ∈ IN, where, of course, Bns 0 = bsn0 and
B sn0 = bsn0 .
Thus, the inequalities (5.104a), (5.104b) yield the estimates

Bn bn Bn
s
≤ s ≤ s for all n ≥ n0 . (5.105)
Bn bn Bn

Let now the step size n be defined by


c
n = , n = 1, 2, . . .
n+q

where c > 0 and q ∈ IN ∪ {0} are fixed numbers.


5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 171

The (hybrid) stochastic algorithms having the lower and upper mean
square error estimates (B n ) and (Bns ) are characterized – see Sect. 5.5.1 –
by the parameters

A = c2 δ1 , B = c2 (δ1 + η 2 ), M, N, ϑ = 2βc,

and
B̃ = c2 (γ1 + σ 2 ), M̃ = 0, ϑ̃ = ac, respectively.
Hence, if c > 0 is selected such that ac > 1 and 2βc > 1, then Corollary 5.3
yields
c2 δ1 M +c2 (δ1 +η 2 )N
B M +N ac − 1
lim ns = · . (5.106a)
n→∞ Bn c2 (γ1 + σ 2 ) 2βc − 1
Furthermore, the sequences (Bn ), (Bns ), resp., are estimates of the mean square
error of the (hybrid) stochastic algorithms represented by the parameters

A = c2 γ1 , B = c2 (γ1 + σ 2 ), M, N, ϑ = ac

and
B̃ = c2 (δ1 + η 2 ), M̃ = 0, ϑ̃ = 2βc, respectively.
Thus, if ac > 1 and 2βc > 1, then Corollary 5.3 yields
c2 γ1 M +c2 (γ1 +σ 2 )N
Bn M +N 2βc − 1
lim s = · . (5.106b)
n→∞ B n c2 (δ1 + η 2 ) ac − 1

Consequently, we have now this next result:


Theorem 5.6. Suppose that the minimization problem (5.1) fulfills the in-
equalities (5.70), (5.71) and (5.99), (5.100). Furthermore, assume that the
c
noise conditions (5.72), (5.80) and (5.101) hold. If n = with q ∈
n+q
IN ∪ {0} and c > 0 such that ac > 1, where 0 < a < 2α, and if (5.97) is
fulfilled, then
bn bn
Q1 ≤ lim inf s ≤ lim sup s ≤ Q2 , (5.107)
n→∞ bn n→∞ bn

where
δ1 M + (δ1 + η 2 )N ac − 1
Q1 = · , (5.108a)
(M + N )(γ1 + σ 2 ) 2βc − 1

γ1 M + (γ1 + σ 2 )N 2βc − 1
Q2 = · . (5.108b)
(M + N )(δ1 + η 2 ) ac − 1

Proof. Since 0 < a < 2α and α ≤ β, we have ac < 2βc for every c > 0 and
therefore 2βc − 1 > ac − 1 > 0. The assertion follows now from (5.106a) and
(5.106b).
172 5 RSM-Based Stochastic Gradient Procedures

Discussion of Theorem 5.6


(a) Obviously, (5.107) yields Q1 ≤ Q2 , and, using formulas (5.108a), (5.108b),
we find Q1 < Q2 .
Moreover, by means of Theorem 5.5 and because of δ1 ≤γ1 , α≤β, η 2 ≤σ 2
as also 1 < ac < 2αc ≤ 2βc, we obtain

γ1 M + (γ1 + σ 2 )N ac − 1 ac − 1 Bn Bn
Q1 ≤ · = · lim s < lim s
(M + N )(γ1 + σ ) 2βc − 1
2 2βc − 1 n→∞ Bn n→∞ Bn
(5.109)

as also
δ1 M + (δ1 + η 2 )N 2βc − 1 2βc − 1 B B
Q2 ≥ · = · lim ns > lim ns .
(M + N )(δ1 + η ) ac − 1
2 ac − 1 n→∞ Bn n→∞ Bn
(5.110)

(b) According to (5.107) we have

Q1 bsn ≤ bn ≤ Q2 bsn as n → ∞.

This means that asymptotically the hybrid stochastic approximation pro-


cedure is
1 1
at most times faster and at least times faster
Q1 Q2
than the pure stochastic procedure.
1
According to (5.109) we have > 1 for the best case. Moreover, Q2 < 1
Q1
holds if and only if

γ1 M + (γ1 + σ 2 )N ac − 1
< f := (< 1). (5.111)
(M + N )(δ1 + η 2 ) 2βc − 1

Since always

γ1 + σ 2 − f (δ1 + η 2 ) > γ1 + σ 2 − δ1 − η 2 ≥ 0,

condition (5.111) is equivalent to

N f (δ1 + η 2 ) − γ1
< (5.112a)
M γ1 + σ 2 − f (δ1 + η 2 )

which can be fulfilled if and only if


γ1 2βc − 1
δ1 + η 2 > = γ1 (5.112b)
f ac − 1
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 173

1
(c) If c is selected according to c > , where a0 is a fixed number such that
a0
0 < a0 < 2α, then ac ≥ a0 c > 1 for all a with a0 ≤ a < 2α. Hence, if
1
c > , then Theorem 5.6 holds for all numbers a with a0 ≤ a < 2α.
a0
1
Consequently, if c > , where 0 < a0 < 2α, then the relations (5.107)
a0
with (5.108a) and (5.108b), (5.109) and (5.110) as also (5.111), (5.112a)
and (5.112b) hold also if a is replaced by 2α.

5.5.3 Decreasing Rate of Stochastic Steps

A further improvement of the speed of convergence is obtained if we have a


decreasing rate of stochastic steps taken in (5.69a)–(5.69c). In order to handle
this case, we consider the sequence

cn = nλ EXn − x∗ 2 , n = 1, 2, . . .

for some λ with 0 ≤ λ < λ̄, where λ̄ is a given fixed positive number. Then,
under the same conditions and using the same methods as in Lemma 5.1,
corresponding to (5.74a), (5.74b) we find
λ
1
cn+1 ≤ 1+ (1 − 2αn + γ2 2n )cn + (n + 1)λ βn 2n , n = 1, 2, . . . ,
n
(5.113a)
where again
βn = γ1 + Eξn 2 (5.113b)
and ξn is defined by (5.69b).
We claim that for n ≥ n1 , n1 ∈ IN, sufficiently large,
λ
1
1+ (1 − 2αn + γ2 2n ) ≤ 1, (5.114)
n
provided that the step size n is suitably chosen.
Indeed, if n → 0 as n → ∞, then there is an integer n0 such that (see
(5.81))
0 < 1 − 2αn + γ2 2n < 1 − an , n ≥ n0
for any number a such that 0 < a < 2α.
According to (5.114) and (5.81) we have to choose now n such that
λ
1
1+ (1 − an ) ≤ 1, n ≥ n1 (5.115)
n
with an integer n1 ≥ n0 . Write
λ
1 pn (λ)
1+ =1+
n n
174 5 RSM-Based Stochastic Gradient Procedures
 
λ
1
with pn (λ) = n 1+ − 1 . Hence, condition (5.115) has the form
n

n
1 − an ≤ , n ≥ n1 .
n + pn (λ)

Since 0 ≤ λ < λ̄ and therefore pn (λ) ≤ pn (λ̄) for all n ∈ IN, the latter is true if
n
1 − an ≤ , n ≥ n1 .
n + pn (λ̄)

Hence, for the step size n we find the condition

1 pn (λ̄) 1 1 pn (λ̄)
n ≥ = (5.116)
a n + pn (λ̄) n a 1 + pn (λ̄)
n

for all n sufficiently large. Writing


 λ̄
1 + n1 − 1
pn (λ̄) = 1 ,
n

we see that
d
lim pn (λ̄) = (1 + x)λ̄ |x=0 = λ̄.
n→∞ dx
Therefore
pn (λ̄)
lim pn (λ̄)
= λ̄,
n→∞
1+ n
which implies that

pn (λ̄)
ϑ1 ≤ pn (λ̄)
≤ ϑ2 , n = 1, 2, . . . , (5.117)
1+ n

with some constants ϑ1 , ϑ2 such that 0 < ϑ1 < λ̄ ≤ ϑ2 . If λ̄ ∈ IN, then an easy
consideration shows that
pn (λ̄)
pn (λ̄)
, n = 1, 2, . . . ,
1+ n

in a monotonously increasing sequence. Hence, for every λ̄ ∈ IN the bounds


2λ̄ − 1
ϑ1 , ϑ2 can be selected according to ϑ1 = and ϑ2 = λ̄. Consequently, if
2λ̄
n is defined by
1 1 pn (λ̄)
n = (5.118a)
n a 1 + pn (λ̄)
n
or n is given by
5.5 Convergence Rates of Hybrid Stochastic Approximation Procedures 175

c 1 c ϑ2
n = = q with q ∈ IN and c > , (5.118b)
n+q n1+ n a

then there is an integer n1 ≥ n0 such that (5.116) holds for n ≥ n1 .


ϑ 1
Let ϑ = ϑ2 , ϑ = ac, respectively. Then in both cases we have n ≤ · ,
a n
and from (5.113a), (5.113b)–(5.115) follows
2
ϑ 1
cn+1 ≤ cn + (n + 1)λ βn 2n ≤ cn + (n + 1)λ βn
a n2
λ̄ 2 2
n+1 ϑ ϑ
≤ cn + βn nλ−2 ≤ cn + 2λ̄ βn nλ−2
n a a
for n ≥ n1 and therefore
N 2 N
ϑ
cN +1 − cn1 = (cn+1 − cn ) ≤ 2λ̄ βn nλ−2 , N ≥ n1 . (5.119)
n=n1
a n=n1

Suppose now that also (5.80) holds, hence Eξn 2 ≤ σ 2 for all n ∈ IN2 .
Therefore βn ≤ β̄n , where (see (5.83)) β̄n = γ1 for n ∈ IN1 and β̄n = γ1 + σ 2
for n ∈ IN2 . Inequality (5.119) yields now
⎛ ⎞
2 N N
ϑ ⎜ γ1 γ1 + σ 2 ⎟
cN +1 − cn1 ≤ 2λ̄ ⎝ + ⎠.
a n=n1 n2−λ n=n1 n2−λ
n∈IN1 n∈IN2

Choosing λ̄ ≥ 2, we find this result:


Theorem 5.7. Let the assumptions of Lemma 5.1 be fulfilled, and suppose
that (5.80) holds. Moreover, let the step size n be defined by (5.118a) or
(5.118b). If γ1 = 0 and IN1 = IN\IN2 with

IN2 = {nk = k p : k = 1, 2, . . .} (5.120)

for some fixed integer p = 1, 2, . . ., then

EXn − x∗ 2 = 0(n−λ ) as n → ∞
1
for all λ such 0 ≤ λ < 2 − .
p
Proof. Under the given hypotheses we have
2
ϑ 1
cN +1 − cn1 ≤ 2λ̄ σ2
a (k p )2−λ
n1 ≤kp ≤N
2 ∞
ϑ 1
≤ 2λ̄ σ2 p(2−λ)
, N ≥ n1 ,
a k
k=1
176 5 RSM-Based Stochastic Gradient Procedures

1
where cn = nλ EXn −x∗ 2 with 0 ≤ λ < λ̄ and a fixed λ̄ ≥ 2. Since 2− < λ̄,
p
1
and 0 ≤ λ < 2 − is equivalent to p(2 − λ) > 1, the above series converges.
p
Hence, the sequence (cn ) is bounded which yields the assertion of our theorem.

Discussion of Theorem 5.7


The condition γ1 = 0 means that the operator A(x) defined by (5.67) must
satisfy (see (5.70)) the Lipschitz condition
  √
 
A(x) − A(x∗ ) ≤ γ2 x − x∗ .

Definition (5.120) of IN2 obviously means that we have a decreasing rate n p −1


1

of stochastic steps taken in (5.69a), (5.69b).


According to Lemma 5.2, for a pure stochastic approximation procedure
we know that EXn − x∗ 2 = 0(n−κ ) with 0 < κ ≤ 1. Hence, if stochastic
and deterministic steps are taken in (5.69a), (5.69b) as described by (5.120),
then the speed of convergence may be increased from O(n−κ ) with 0 < κ ≤ 1
1
in the pure stochastic case to the order of O(n−λ ) with 1 < λ < 2 − in the
p
hybrid stochastic case. If p = 1, p → ∞, respectively, then we obtain the pure
stochastic, pure deterministic case, respectively.
Denoting by sn the number of stochastic steps taken in (5.69a), (5.69b)
up to the nth stage, in the case (5.120) we have

sn ≈ n1/p
sn
and therefore 2 ≈ n1/p−2 . Hence, the estimate of the convergence rate given
n
by Theorem 5.7 can also be represented in the form
s  r 
EXn − x∗ 2 ≈ O
n n
= O
n2 n
if rn = n1/p−1 is the rate of stochastic steps taken in (5.69a), (5.69b), cf. the
example in [80].
6
Stochastic Approximation Methods
with Changing Error Variances

6.1 Introduction
As already discussed in the previous Chap. 5, hybrid stochastic gradient pro-
cedures are very important tools for the iterative solution of stochastic opti-
mization problems as discussed in Chaps. 1–4:

min F (x) s.t. x ∈ D


 
with the expectation value function F (x) = Ef a(ω), x having gradient
G(x) = ∇F (x). Moreover, quite similar stochastic approximation procedures
may be used for the solution of systems of nonlinear equations

G(x) = 0

involving a vectorial so-called “regression function” G = G(x) on IRν . These


algorithms generating a sequence (Xn ) of estimators Xn of a solution x∗ of
one of the above mentioned related problems are based on estimators

Yn = G(Xn ) + Zn , n = 0, 1, 2, . . .

of the gradient G of the objective function F or the regression function G at


an iteration point Xn . The resulting iterative procedures of the type

X(n+1) := pD (Xn − Rn Yn ), n = 0, 1, 2, . . . ,

cf. (5.5), are well studied in the literature. Here, pD denotes the projection of
IRν onto the feasible domain D known to contain an (optimal) solution, and
Rn , n = 0, 1, 2, . . . is a sequence of scalar or matrix valued step sizes.
In the extensive literature on stochastic approximation procedures, cf.
[39–41, 60, 61, 125, 164], various sufficient conditions can be found for the
underlying optimization problem, for the regression function, resp., the se-
quence of estimation errors (Zn ) and the sequence of step sizes Rn , n =
0, 1, 2, . . ., such that
178 6 Stochastic Approximation Methods with Changing Error Variances

Xn → x∗ , n → ∞,
in some probabilistic sense (in the (quadratic) mean, almost sure (a.s.), in
distribution). Furthermore, numerous results concerning the asymptotic dis-
tribution of the random algorithm (Xn ) and the adaptive selection of the step
sizes for increasing the convergence behavior of (Xn ) are available.
As shown in Chap. 5, the convergence behavior of stochastic gradient pro-
cedures may be improved considerably by using
– More exact gradient estimators Yn at certain iteration stages n ∈ N1
and/or
– Deterministic (feasible) descent directions hn = h(Xn ) available at certain
iteration points Xn
While the convergence of hybrid stochastic gradient procedures was examined
in Chap. 5 in terms of the sequence of mean square errors
bn = EXn − x∗ 2 , n = 0, 1, 2, . . . ,
in the present Chap. 6, for stochastic approximation algorithms with chang-
ing error variances further probabilistic convergence properties are studied in
detail: Convergence almost sure (a.s.), in the mean, in distribution. Moreover,
the asymptotic distribution is determined, and the adaptive selection of scalar
and matrix valued step sizes is considered for the case of changing error vari-
ances. The results presented in this chapter are taken from the doctoral thesis
of E. Plöchinger [115].
Remark 6.1. Based on the RSM-gradient estimators considered in Chap. 5,
implementations of the present stochastic approximation procedure, espe-
cially the Hessian approximation given in Sect. 6.6, are developed in the the-
sis of J. Reinhart [121], see also [122], for stochastic structural optimization
problems.

6.2 Solution of Optimality Conditions


For a given set ∅ = D ⊂ IRν and a measurable function G : D −→ IRν , consider
the following problem
P: Find a vector x∗ ∈ D such that
 
x∗ = pD x∗ − r · G(x∗ ) for all r > 0 . (6.1)
Here, D is a convex and compact set, and for a vector y ∈ IRν , let pD (y)
denote the projection of y onto D, determined uniquely by the relation
 
pD (y) ∈ D and y − pD (y) = min y − x.
x∈D

Problems of this type have to be solved very often. E.g., in the special case
that G = ∇F , where F : D −→ IR is a convex and continuously differentiable
function, according to [16] condition (6.1) is equivalent to
6.3 General Assumptions and Notations 179

F (x∗ ) = min F (x).


x∈D

Furthermore, for a vector x∗ ∈ D̊ (:= interior of D), condition (6.1) is equiv-


alent with the system of (nonlinear) equations

G(x∗ ) = 0.

In many practical situations, the function G(x) can not be determined directly,
but we may assume that for every x ∈ D a stochastic approximate (estimate)
Y of G(x) is available. Hence, for the numerical solution of problem (P ) we
consider here the following stochastic approximation method:
Select a vector X1 ∈ D, and for n = 1, 2, . . . determine recursively

Xn+1 := pD (Xn − Rn · Yn ) (6.2a)


with Rn := ρn · Mn and Yn := G(Xn ) + Zn . (6.2b)

The matrix step size Rn = ρn Mn , where ρn is a positive number and Mn is


a real ν ×ν matrix, can be chosen by the user in order to control the algorithm
(6.2a), (6.2b). Yn denotes the search direction, and Zn is the estimation error
occurring at the computation of G(Xn ) in step n. Furthermore, ρn , Mn and Zn
are random variables defined on an appropriate probability space (Ω, A, P).

6.3 General Assumptions and Notations


In order to derive different convergence properties of the sequence (Xn ) de-
fined by (6.2a), (6.2b), in this part we suppose that the following assumptions,
denoted by (U), are fulfilled. Note that these conditions are the standard as-
sumptions which can be found in similar way in all relevant work on stochastic
approximation:

(a) D ⊂ IRν is a nonempty, convex and compact stet with diameter δ0 .


(b) Let G : D0 −→ IRν denote a vector function, where D0 ⊃ {x + e : x ∈
D and e < ε0 } with a positive number ε0 .
(c) There exist positive numbers L0 and µ0 , as well as functions H : D −→
IRν × IRν and δ : D × IRν −→ IRν such that

G(x + e) = G(x) + H(x) · e + δ(x; e), (U1a)


 
δ(x; e) ≤ L0 · e1+µ0 , (U1b)
H(x) is continuous in x (U1c)

for all x ∈ D and e < ε0 .


(d) There is a unique x∗ ∈ D such that
 
x∗ = pD x∗ − r · G(x∗ ) for all r > 0. (U2)
180 6 Stochastic Approximation Methods with Changing Error Variances

(e) For x∗ ∈ ∂D (:= boundary of D) a scalar step size is selected, i.e.,

Rn = ρn · Mn = rn · I with rn ∈ IR+ . (U3)

(f) Concerning the selection of the matrices Mn in the step sizes Rn = ρn ·Mn ,
suppose that
sup Mn  < ∞ a.s. (almost sure), (U4a)
n

and there is a partition


0n + ∆Mn ,
Mn = M (U4b)

where

lim ∆Mn  = 0 a.s., (U4c)


n→∞
0n H(x)u ≥ au2
uT M (U4d)

for each n ∈ IN, x ∈ D and u ∈ IRν . Here, a is a positive constant.


(g) There exists a probability space (Ω, A, P) and σ-algebras A1 ⊂ . . . ⊂
0n and Zn−1 for n = 1, 2, . . .
An ⊂ A such that the variables Xn , ρn , Mn , M
are An -measurable.
(h) For the estimation error Zn in (6.2a), (6.2b) we suppose, cf. (5.35a),
(5.35b), that

nEn Zn  −→ 0 a.s., (U5a)
∗ 2
En Zn − En Zn  ≤ σ0,n + σ1 Xn − x 
2 2 2
(U5b)

for all n = 1, 2, . . . . Here En (·) := E(· | An ) denotes the conditional


expectation with respect to An , σ1 is a constant and σ0 ,n denotes a An -
measurable random variable with
2
sup E σ0,n ≤ σ02 < ∞ (U5c)
n

and a further constant σ0 .


(i) There exists a disjoint partition IN = IN(1) ∪˙ . . . ∪IN
˙ (κ) such that for k =
1, . . . , κ the limits

| {1, . . . , n} ∩ IN(k) |
qk := lim >0 (U6)
n→∞ n
exist and are positive, see Sect. 5.5.1.

In the later Sects. 6.5–6.8 additional assumptions are needed, generating


further convergence properties or implying some of the above assumptions.
Assumptions of that type are marked by one of the letters V,W,X and Y, and
hold always for the current section.
6.3 General Assumptions and Notations 181

6.3.1 Interpretation of the Assumptions

According to condition (c), the mapping G : D −→ IRν is continuously differ-


entiable and has the Jacobian H(x). Because of (a) and (U1a)–(U1c) there is
a number β < ∞ with
 
 
H(x) ≤ β for all x ∈ D. (6.3)

Using the mean value theorem, we have then


 
 
G(x) − G(y) ≤ β · x − y (6.4)

for all x, y ∈ D. Hence, G is Lipschitz continuous with the Lipschitz con-


stant β.
Since the projection operator pD (y) and the map G(x) are continuous,
because of a) there is a vector x∗ ∈ D fulfilling (U2).
Conditions (U2) and (U3) guarantee, that for each n we have
 
x∗ = pD x∗ − Rn · G(x∗ ) . (6.5)

Consequently, in the deterministic case, i.e., if Zn ≡ 0, then algorithm (6.2a),


(6.2b) stays at x∗ provided that Xn = x∗ for a certain index n.
According to condition (f), the matrices Mn and M 0n are bounded, and
(U4d) and the mean value theorem imply that
1  2
0n G(x) − G(x∗ ) ≥ ax − x∗ 2
x − x∗ , M (6.6)

for all n and x ∈ D, where < ·, · > denotes the scalar product in IRν . Especially,
for each n ∈ IN we have
1  2
0n G(Xn ) − G(x∗ ) ≥ -
Xn − x∗ , M an Xn − x∗ 2 a.s. (6.7)

with certain An -measurable random variables - an ≥ a. Furthermore, (U4d)


yields:
0n , H(x) are regular for each n ∈ IN and x ∈ D.
M (6.8)
E.g., condition (U4d) holds in case of scalar step sizes, provided that

0n = m
M - n · I with m
- n ≥ d,
H(x) + H(x) ≥ 2αI
T

for all n ∈ IN and x ∈ D, where d and α are positive numbers. These conditions
hold in Sects. 6.7.2 and 6.8. The second of the above conditions implies also
(U2), and means in the gradient case G(x) = ∇F (x) that the function F (x)
is strong convex on D.
182 6 Stochastic Approximation Methods with Changing Error Variances

According to (g), the random variables ρn , Mn and Zn may depend on


the history X1 , . . . , Xn , Z1 , . . . , Zn−1 , ρ1 , . . . , ρn , M1 , . . . , Mn of the process
(6.2a), (6.2b).
Condition (U5a) concerning the estimation error Zn holds, e.g., if the
search direction Yn is an unbiased estimation of G(Xn ), hence, if En Yn =
G(Xn ). From (U5a)–(U5c) and (a) we obtain that the estimation error has a
bounded variance.
Since for different indices n the estimation error Zn may vary considerably,
according to (i) the set of integers IN will be partitioned such that the variances
of Zn are equal approximatively for large indices n lying in the same part
IN(k) . Moreover, the numbers q (1) , . . . , q (κ) represent the portion of the sets
IN(1) , . . . , IN(κ) with respect to IN.

6.3.2 Notations and Abbreviations in this Chapter

The following notations and abbreviations are used in the next sections:

a(k) := lim inf -


- an (6.9)
n∈IN(k)

a := q1 -
- a(1) + . . . + qκ -
a(κ) ≥ a
∆n := Xn − x∗ = actual argument error
∆Zn := Zn − En Zn
σk2 := lim sup Eσ0,n
2
n∈IN(k)

τn+1 := τn (1 + γn ) weight factor, where


τ1 := 1 and 0 ≤ γn is a nonnegative number
-
∆n := τn (Xn − x∗ ) weighted argument error
G∗ := G(x∗ ), H ∗ := H(x∗ )
Gn := G(Xn ), Hn := H(Xn )
λmin (A) := minimum eigenvalue of a symmetric matrix A
λmax (A) := maximum eigenvalue of a symmetric matrix A
AT := transpose of A
/
A := λmax (AT A) = norm of the matrix A
A ≤ B : ⇐⇒ λmin (B − A) ≥ 0
1
α∗ := λmin (H ∗ + H ∗T )
2
IMm,n := {m + 1, . . . , n} ∩ IM = (m, n)-section of a set IM ⊂ IN
a+ := max (a, 0), a− := a+ − a
6.4 Preliminary Results 183

Ω̄0 := {ω ∈ Ω : ω ∈
/ Ω0 } = complement of the set Ω0 ⊂ Ω
IΩ0 := characteristic function of Ω0 ⊂ Ω
EΩ0 T := E(T IΩ0 ) = expectation of the random variable T
restricted to the set Ω0
limextr := either lim sup or lim inf

6.4 Preliminary Results


In order to shorten the proofs of the results given in the subsequent sections,
here some preliminary results are given.

6.4.1 Estimation of the Quadratic Error

For the proof of the convergence of the quadratic argument error Xn − x∗ 2
the following auxiliary statements are needed.
Lemma 6.1. For all n ∈ IN we have
 2
En ∆n+1 2 ≤ ∆n − Rn (Gn − G∗ )
 
+ Rn 2 En Zn 2 + En Zn − En Zn 2
 
+ 2Rn  En Zn ∆n − Rn (Gn − G∗ ). (6.10)

Proof. Because of (6.5), for arbitrary n we get


 2
∆n+1 2 = pD (Xn − Rn Yn ) − pD (x∗ − Rn G∗ )
 2
≤ Xn − Rn (Gn + Zn ) − (x∗ − Rn G∗ )
3 4 2
 
=  ∆n − Rn (Gn − G∗ ) − Rn En Zn − Rn (Zn − En Zn )
   2  2
≤ ∆n − Rn (Gn − G∗ ) + Rn En Zn  + Rn (Zn − En Zn )

+2∆n − Rn (Gn − G∗ ) − Rn En Zn , Rn (Zn − En Zn ).

Inequality (6.10) is obtained then by taking expectations.


The first term on the right hand side of inequality (6.10) can be estimated
as follows:
Lemma 6.2. For each n we find
     
∆n − Rn (Gn − G∗ )2 ≤ 1 − 2ρn -an − β∆Mn  + ρ2n Mn 2 β 2 ∆n 2 .
(6.11)
184 6 Stochastic Approximation Methods with Changing Error Variances

Proof. According to (U4b) we have


 
∆n − ρn Mn (Gn − G∗ )2 ≤ ∆n 2 + ρ2n Mn 2 Gn − G∗ 2
5 6
0n (Gn − G∗ ) + 2ρn ∆n ∆Mn Gn − G∗ .
−2ρn ∆n , M
Inequality (6.11) follows now by applying (6.4) and (6.7).
-n = τn (Xn − x∗ )) we have now an
For the sequence of weighted errors (∆
upper bound which is used several times in the following:
Lemma 6.3. For arbitrary ε > 0 and n ∈ IN we have
-n+1 2 ≤ (1 − ρn tn + t(1)
En ∆ - 2
n )∆n  + tn ,
(2)
(6.12)
where
 gn  γn
tn := (1 + γn )2 1 + un − (2 + γn ),
ε ρn
g  2 
n
t(1)
n := (1 + γn )
2
+ ρn Mn  σ12 ,
 ε  2 2 
tn := (1 + γn ) gn (ε + gn ) + τn ρn Mn  σ0,n
(2) 2
,

and the following definitions are used:


 
un := 2 -an − β∆Mn  − ρn Mn 2 β 2 ,
gn := τn ρn Mn En Zn .
Proof. Because of the inequality
 2
  τ 2 ∆n − Rn (Gn − G∗ )
2τn ∆n − Rn (Gn − G∗ ) ≤ ε + n ,
ε
2
after multiplication with τn+1 , inequality (6.10) yields

En ∆-n+1 2
7 τn Rn En Zn  
≤ (1 + γn )2 1 + ∆n − Rn (Gn − G∗ )2 τn2
ε 8
 2  
+ τn Rn  En Zn 2 + En Zn − En Zn 2 + ετn Rn En Zn  .

Taking into account (6.11) and (U5b), we find


7 
En ∆-n+1 2 ≤ (1 + γn )2 1 + gn (1 − ρn un )∆-n 2
ε 8
2
+gn (ε + gn ) + (τn Rn ) (σ0,n
2
+ σ12 ∆n 2 )
7  gn  gn 
= (1 + γn )2 1 − ρn 1 + un + -n 2
+ Rn 2 σ12 ∆
ε ε
 2 2 8
+gn (ε + gn ) + τn Rn  σ0,n .

This proves now (6.12).


6.4 Preliminary Results 185

6.4.2 Consideration of the Weighted Error Sequence

Suppose that
ρn
−→ 1 a.s., (6.13)
sn
Mn − Dn −→ 0 a.s. (6.14)

for given deterministic numbers sn and matrices Dn . The weighted error


∆-n := τn (Xn − x∗ ) can be represented then as follows:

Lemma 6.4. For each integer n:


-n+1 = (I − sn Bn )∆
∆ -n + τn sn Dn (En Zn − Zn )
+ sn ∆Bn ∆ -n + τn sn ∆Dn (En Zn − Zn )
 
+ τn+1 −Rn (En Zn + G∗ ) + ∆Pn , (6.15)

where the variables in Lemma 6.4 are defined as follows:


γn
Bn := Dn H ∗ − I, (6.16)
sn
1 
- n − γn I ,
∆Bn := Bn − (1 + γn )Rn H (6.17)
sn
1
∆Dn := (1 + γn )Rn − Dn , (6.18)
sn
∆Pn := pD (Xn − Rn Yn ) − (Xn − Rn Yn ), (6.19)
 1
 
H- n := H x∗ + t(Xn − x∗ ) dt. (6.20)
0

Proof. Because of assumption (U1a)–(U1c), and using the Taylor formula


we get
Yn = Gn + Zn = G∗ + H- n ∆n + Zn .
Hence, (6.2a), (6.2b) yields

∆n+1 = ∆n − Rn Yn + ∆Pn
- n )∆n + Rn (En Zn − Zn ) − Rn (En Zn + G∗ ) + ∆Pn .
= (I − Rn H

Multiplying then with τn+1 , (6.15) follows, provided that the following rela-
tions are taken into account:
 
- n ) = I − (1 + γn )Rn H
(1 + γn )(I − Rn H - n − γn I = I − sn (Bn − ∆Bn ),
(1 + γn )Rn = sn (Dn + ∆Dn ).

For the further consideration of (6.15) some additional auxiliary lemmas


are needed.
186 6 Stochastic Approximation Methods with Changing Error Variances

Lemma 6.5. For each n ∈ IN we have


-n = ∆(0)
∆ (1) (2)
n + ∆n + ∆n (6.21)
(i)
with An -measurable random variables ∆n , i = 0, 1, 2, where
(0)
∆n+1 = (I − sn Bn )∆(0)
n + (τn sn )Dn (En Zn − Zn ), (6.22)

(1)
∆n+1  ≤ I − sn Bn ∆(1) -
n  + sn ∆Bn ∆n 
   
+ τn+1 ρn Mn  En Zn  + G∗  + ∆Pn  , (6.23)

(2)
En ∆n+1 2 ≤ I − sn Bn 2 ∆(2)
n 
2
 
+(τn sn )2 ∆Dn 2 σ0,n
2
+ σ12 ∆n 2 . (6.24)
(i)
Proof. For i = 0, 1, 2, select arbitrary A1 -measurable random variables ∆1
such that
-1 = ∆(0) + ∆(1) + ∆(2) .
∆ 1 1 1
(0)
For n ≥ 1, define ∆n+1 then by (6.22) and put

(1)
∆n+1 := (I − sn Bn )∆(1) - ∗
n + sn ∆Bn ∆n + τn+1 −Rn (En Zn + G )

+ ∆Pn , (i)
(2)
∆n+1 := (I − sn Bn )∆(2)
n + τn sn ∆Dn (En Zn − Zn ). (ii)

Relation (6.21) holds then because of Lemma 6.4, and (6.23) follows directly
from (i) by taking the norm. Moreover, (ii) yields
(2)
∆n+1 2 ≤ I − sn Bn 2 ∆(2)
n  + (τn sn ) ∆Dn  En Zn − Zn 
2 2 2 2
5 6
+ 2τn sn (I − sn Bn )∆(2)
n , ∆Dn (En Zn − Zn ) .

(2)
The random variables ∆n and ∆Dn are An -measurable. Thus, taking ex-
pectations in the above inequality, by means of (U5b) we get then inequality
(6.24).
Suppose now that the matrices Dn from (6.14) fulfill the following rela-
tions:

Dn H ∗ + (Dn H ∗ )T ≥ 2an I, (6.25)


Dn H ∗  ≤ bn , (6.26)

with certain numbers an , bn ∈ IR. The matrices Bn from (6.16) fulfill then the
following inequalities:
6.4 Preliminary Results 187

Lemma 6.6. For each n ∈ IN:


γn
Bn + BnT ≥ 2 an − I =: 2
an I,
sn
2
γn γn
Bn 2 ≤ an + b2n =: b2n ,
−2
sn sn
1
I − sn Bn 2 ≤ 1 − 2sn  an − snb2n =: 1 − 2sn un ,
2
I − sn Bn  ≤ 1 − sn un .

Proof. The assertions follow by simple calculations from the relations (6.16),
(6.25) and (6.26), where:
√ 1
1 − 2t ≤ 1 − t for t≤ .
2

Some useful properties of the random variables ∆Bn , ∆Dn and ∆Pn arising
in (6.17), (6.18) and (6.19) are contained in the next lemma:
Lemma 6.7. Assuming that

γn −→ 0 and Xn −→ x∗ ∈ D̊ a.s.,
we have
(a) ∆Bn , ∆Dn −→ 0 a.s.
(b) G∗ = 0
(c) For each ω ∈ Ω, with exception of a null-set, there is an index n0 (ω)
such that
∆Pn (ω) = 0 for every n ≥ n0 (ω)
Proof. Because of Rn = ρn Mn , (6.13), (6.14), (6.18) and (U4a) we have
ρn
∆Dn = (1 + γn ) Mn − Dn −→ 0 a.s.
sn
Furthermore, Xn → x∗ a.s. and (U1a)–(U1c) yield
1
 
- n :=
H H x∗ + t(Xn − X ∗ ) dt −→ H ∗ a.s.
0

Hence, because of (6.13), (6.14), (6.16) and (6.17) we find


ρn - n −→ 0 a.s.
∆Bn = Dn H ∗ − (1 + γn ) Mn H
sn

Since x∗ ∈ D̊, with (U2) we get G∗ = 0.


188 6 Stochastic Approximation Methods with Changing Error Variances

Consider now an element ω ∈ Ω with Xn (ω) −→ x∗ . Because of x∗ ∈ D̊


there is a number n0 (ω) such that Xn+1 (ω) ∈ D̊ for all n ≥ n0 (ω). Hence, for
n ≥ n0 (ω) we find pD (Xn (ω) − Rn (ω)Yn (ω)) =: Xn+1 (ω) ∈ D̊, and therefore

pD (Xn (ω) − Rn (ω)Yn (ω)) = Xn (ω) − Rn (ω)Yn (ω),

hence, ∆Pn (ω) = 0.

6.4.3 Further Preliminary Results

Sufficient conditions for Xn+1 − Xn → 0 a.s. are given in the following:

Lemma 6.8. (a)  Xn+1 − Xn  ≤ ρn Mn Yn  for n = 1, 2, .... 


(b) Yn 2 ≤ 4 G∗ 2 + Gn − G∗ 2 + Zn − En Zn 2 + En Zn 2 for n =
1, 2, . . . .
'

(c) For each deterministic sequence of positive numbers (sn ) such that s2n
n=1
'

< ∞ we have s2n Yn 2 < ∞ a.s.
n=1
(d) Let (sn ) denote a sequence according to (c), and assume that (ρn ) fulfills
'

lim sup ρsnn ≤ 1 a.s. Then Xn+1 − Xn 2 < ∞ a.s., and (Xn+1 − Xn )
n→∞ n=1
converges to zero a.s.

Proof. For each n ∈ IN, according to (6.2a), (6.2b) we find


 
Xn+1 − Xn  = pD (Xn − Rn Yn ) − pD (Xn )
≤ Rn Yn  = ρn Mn Yn , (i)
 
13 ∗ 42
Yn  = 16  G + (Gn − G ) + (Zn − En Zn ) + En Zn 
2  ∗

4
 
≤ 4 G∗ 2 + Gn − G∗ 2 + Zn − En Zn 2 + En Zn 2 . (ii)

Due to assumption (a) in Sect. 6.3, inequality (6.4) and (U5a)–(U5c), for each
n ∈ IN we find

Gn − G∗ 2 ≤ β 2 Xn − x∗ 2 ≤ β 2 δ02 ,


En Zn 2 −→ 0 a.s.,
 
EZn − En Zn 2 = E En Zn − En Zn 2
 
≤ E σ0,n
2
+ σ12 Xn − x∗ 2 ≤ σ02 + σ12 δ02 .
6.4 Preliminary Results 189

Hence, for a sequence (sn ) according (c) we get


 
s2n G∗ 2 + Gn − G∗ 2 + En Zn 2 < ∞ a.s. (iii)
n
'
and s2n EZn − En Zn 2 < ∞. This yields then
n

s2n Zn − En Zn 2 < ∞ a.s. (iv)


n

Now relations (ii), (iii) and (iv) imply

s2n Yn 2 < ∞ a.s. (v)


n
 
Because of (U4a)–(U4d) the sequence Mn  is bounded a.s. Hence, under
the assumptions in (d), relation (v) yields

ρ2n Mn 2 Yn 2 < ∞ a.s.


n

Thus, relation (i) implies now the assertion in (d).

Conditions for the selection of sequence (-


an ) in inequality (6.7) are given
in the next lemma:

Lemma 6.9. Suppose that


(i) Xn −→ x∗ a.s.
0n = D(k) a.s. with deterministic matrices D(k) for k = 1, . . . , κ
(ii) lim M
n∈IN(k)

Then (-
an ) can be selected such that

1  T
a(k) := lim inf -
- an ≥ λmin D(k) H(x∗ ) + D(k) H(x∗ ) a.s., k = 1, . . . , κ.
n∈IN(k) 2

Proof. Consider an arbitrary k ∈ {1, . . . , κ}, put


1  
H ∗ := H(x∗ ) and b := λmin D(k) H ∗ + (D(k) H ∗ )T ,
2

and for n ∈ IN(k) define


"
-n (G(Xn )−G(x∗ ))
∆ n ,M
-
an := ∆n 2  0
, for ∆n =
b , for ∆n = 0.
190 6 Stochastic Approximation Methods with Changing Error Variances

According to assumption (U1a), for ∆n = 0 we find then


 
T (k) ∗ ∆ T
(M0n − D(k) )H ∗ ∆n + M
0n δ(x∗ ; ∆n )
∆ D H ∆n n
-
an = n + .
∆n 2 ∆n 2

Applying condition (U1b), we get


 
an ≥ b − M
- 0n − D(k)  H ∗  + M
0n L0 ∆n µ0 .

Finally, due to the assumptions (i), (ii) and (U4a)–(U4d) we find

lim inf -
an ≥ b a.s.
n∈IN(k)

6.5 General Convergence Results

In this section the convergence properties of the sequence (Xn ) generated


by the algorithm (6.2a), (6.2b) are considered in case of “arbitrary” matrix
1
step sizes of order . Especially, theorems are given for the convergence with
n
probability one (a.s.), the rate of convergence, the convergence in the mean
and the convergence in distribution. These convergence results are applied
then to the algorithms having step sizes as given in Sects. 6.7 and 6.8.
As stated in Sect. 6.3, here the assumptions (U1a)–(U1c) up to (U6) hold,
and all further assumptions will be marked by the letter V.
Here, we use the step size Rn = ρn Mn with a scalar factor ρn such that

lim n ρn = 1 a.s. (V1)


n→∞

According to condition (U4a), sequence (Mn ) is bounded a.s. Hence, with


1
probability 1, the step size Rn is of order .
n

6.5.1 Convergence with Probability One

A theorem about the convergence with probability 1 and the convergence rate
of the sequence (Xn ) is given in the following. Related results can be found
in the literature, see, e.g., [107] and [108]. The rate of convergence depends
mainly on the positive random variable - a, defined in (6.9). Because of (6.7),
this random variable can be interpreted as a mean lower bound of the quotients

0n · (G(Xn ) − G(x∗ ))


Xn − x∗ , M
.
Xn − x∗ 2
6.5 General Convergence Results 191
9
1
Due to the assumptions (U5a)–(U5c), for each number λ ∈ 0, we have
2
the relations
1 2
σ < ∞, (6.27)
n
n2 1
1 2
σ0,n <∞ a.s., (6.28)
n
n2(1−λ)
1
En Zn  < ∞ a.s. (6.29)
n
n1−λ
 $
1
Theorem 6.1. If λ is selected such that 0 ≤ λ < min ,-
a a.s., then
2
lim nλ (Xn − x∗ ) = 0 a.s.
n→∞

Proof. Because of (V1), with τn := nλ we find


lim τn ρn n1−λ = 1 a.s. and τn+1 = (1 + γn )τn with nγn −→ λ.
n→∞

Hence, due to (V1), (6.27), (6.28), (6.29) and (U4a) we get


 2
ρn Mn  σ12 < ∞ a.s.,
n
 2 2
τn ρn Mn  σ0,n < ∞ a.s.,
n
 
τn ρn Mn  En Zn  < ∞ a.s.
n

(1) (2)
For the random variables tn , tn and tn occurring in Lemma 6.3 we have
therefore

n < ∞ a.s. for i = 1, 2,


t(i) (i)
n
tn − 2(-
an − λ) −→ 0 a.s.
This yields t (k)
a(k) − λ) f.s. for k = 1, . . . , κ. According to
:= lim inf tn = 2(-
n∈IN(k)
the assumptions in this theorem and because of (6.9) we obtain
a − λ) > 0
t := q1 t(1) + . . . + qκ t(κ) = 2(- a.s. (ii)
Because of (V1), (i) and (ii), the assertion follows now from Lemma 6.3 and
Lemma B.3 in the Appendix.

Replacing in Theorem 6.1 the bound -


a by the number a from (U4d), then
a weaker result is obtained:
192 6 Stochastic Approximation Methods with Changing Error Variances
 
Corollary 6.1. If 0 ≤ λ < min 1
2, a , then lim nλ (Xn − x∗ ) = 0 a.s.
n→∞

Proof. According to (6.9) we have -


a ≥ a. The assertion follows now from
Theorem 6.1.

Since a > 0, in Corollary 6.1 we may select also λ = 0. Hence, we always have

Xn −→ x∗ a.s. (6.30)

6.5.2 Convergence in the Mean

We consider now the convergence rate of the mean square error EXn − x∗ 2 .
For the estimation of this error the following lemma is needed:

Lemma 6.10. If n ∈ IN, then

En (n + 1)∆n+1 2
1 1 
≤ 1 − s(1) n n∆n 2 + (nρn Mn )2 σ0,n
2
+ s(2)
n (6.31)
n n
(1) (2)
with An -measurable random variables sn and sn , where

n − (2-
s(1) an − 1) −→ 0 a.s., (6.32)
n −→ 0 a.s.
s(2) (6.33)

Proof. Putting τn := n, we find
1
τn+1 = (1 + γn )τn with nγn −→ .
2
According to Lemma 6.3 we select
 2 2
n := n(ρn tn − tn ),
s(1) n := ntn − nρn Mn  σ0,n .
(1)
s(2) (2)

Then (U4a)–(U4d), (U5a)–(U5c) and (V1) yield



ngn = n(nρn )Mn En Zn  −→ 0 a.s.,
 2
n ρn Mn  σ12 −→ 0 a.s.

Consequently,

tn − (2-
an − 1) −→ 0 a.s.,
n −→ 0 a.s.,
nt(1)
  2 2
ntn − nρn Mn  σ0,n −→ 0 a.s.
(2)

Thus, due to (V1) also (6.32) and (6.33) hold.


6.5 General Convergence Results 193

The following theorem shows that the mean square error EXn − x∗ 2 is
1
essentially of∗order
 n , and an upper bound is given for the limit of the sequence
n EXn −x 2 . This bound can be found in the literature in the special case
of scalar step sizes and having the trivial partition IN = IN(1) (cf. condition
(i) from Sect. 6.3), see [26], [164, page 28] and [152]. Also in this result, an
important role is played by the random variable - a defined in (6.9).

Theorem 6.2. Let -


a> 1
2 a.s. Then for each ε0 > 0 there is a set Ω0 ∈ A
such that

(a) P(Ω̄0 ) < ε0 ,


'
κ
qk (-b(k) σk )2
k=1
(b) lim sup EΩ0 n∆n  2
≤ . (6.34)
n∈IN a−1
2-

Here, -b(1) , . . . , -b(κ) and σ1 , . . . , σκ are constants such that


0n  ≤ -b(k)
lim sup M a.s. and σk2 = lim sup E σ0,n
2
n∈IN(k) n∈IN(k)

for each k ∈ {1, . . . , κ}.

Proof. Because of Lemma 6.10 and the above assumptions, the assertions in
this theorem follow directly from Lemma B.4 in the Appendix, provided there
we set:
 2
Tn := n∆n 2 , An := s(1)
n , Bn := nρn Mn  ,

2
vn := σ0,n , Cn := s(2)
n .

The following corollary is needed later on:

Corollary 6.2. If -
a > 1
2 a.s., then for each ε0 > 0 there is a set Ω0 ∈ A
such that

(a) P(Ω̄0 ) < ε0


(-bσ)2
(b) lim sup EΩ0 n∆n 2 ≤
n∈IN 2-a−1

provided the numbers -b and σ are selected such that


0n  ≤ -b a.s. and
lim sup M σ 2 = lim sup Eσ0,n
2
.
n∈IN n∈IN
194 6 Stochastic Approximation Methods with Changing Error Variances

A convergence result in case of scalar step sizes is contained in this


corollary:

Corollary 6.3. Select numbers d(1) , . . . , d(κ) such that


0n = d(k) I
lim M f.s. for k = 1, . . . , κ, (6.35)
n∈IN(k)

2dα∗ > 1, (6.36)

where d := q1 d(1) + . . . + qκ d(κ) and α∗ := 12 λmin (H ∗ + H ∗T ). Then for each


ε0 > 0 there is a set Ω0 ∈ A such that

(a) P (Ω̄0 ) < ε0


'
κ
qk (d(k) σk )2
k=1
(b) lim sup EΩ0 n∆n  2
≤ (6.37)
n∈IN 2dα∗ − 1

where again σk2 = lim sup Eσ0,n


2
for each k ∈ {1, . . . , κ}.
n∈IN(k)

Proof. Because of Corollary 6.1 and (6.35) we may apply Lemma 6.9. Hence,
we may assume that the sequence (- an ) in (6.7) fulfills

a(k) := lim inf -


- an ≥ d(k) α∗ for k = 1, . . . , κ a.s.
n∈IN(k)

a ≥ dα∗ > 12 a.s. The


Due to (6.9) and (6.36) we get therefore the inequality -
assertion of the above Corollary 6.3 follows now from Theorem 6.2.

For given positive numbers α∗ , q1 , . . . , qκ , σ1 , . . . , σκ , the right hand side


of (6.37) is minimal for
9 :−1
q1 qκ
d (k)
= σk2 α∗ 2 + ... + 2 for k = 1, . . . , κ. (6.38)
σ1 σκ

Using these limits, Corollary 6.3 yields


1
d= , (6.39)
α∗
9 :−1
q1 qκ
lim sup EΩ0 n∆n  ≤ α∗2
2
+ ... + 2 . (6.40)
n∈IN σ12 σκ

For scalar step sizes this yields now some guidelines for an asymptotic
optimal selection of the factors Mn in the step sizes Rn = ρn Mn .
In the special case
Mn −→ d · I a.s.,
6.5 General Convergence Results 195

the most favorable selection reads d = α∗ −1 which yields then the estimate
'
κ
qk σk2
k=1
lim sup EΩ0 n∆n 2 ≤ . (6.41)
n α∗ 2

The right hand side of (6.40), (6.41), resp., contains, divided by α∗2 , the
(q1 , . . . , qκ )-weighted harmonic resp. arithmetic mean of the variance limits
σ12 , . . . , σκ2 of the estimation error sequence (Zn ). Hence, the bound in (6.40)
is smaller than that in (6.41).

6.5.3 Convergence in Distribution



In this section it is shown that the weighted error n(Xn − x∗ ) converges in
distribution if further assumptions on the matrices Mn in step size Rn = ρn Mn
and on the estimation error Zn are made.
Hence, choose the following weighting factors:

τn = n. (6.42)

Now τn+1 = (1 + γn )τn holds again, where

λn 1
γn = , λn −→ . (6.43)
n 2
Furthermore, according to Corollary 6.1 it holds without further assumptions

Xn −→ x∗ a.s. (6.44)

Now we demand that matrices Mn fulfill certain limit conditions and, due
to condition (U3), that x∗ is contained in the interior of the feasible set: Hence,
(a) For all k ∈ {1, . . . , κ} there exists a deterministic matrix D(k) having

lim Mn = D(k) a.s. (V2)


n∈IN(k)

For n ∈ IN(k) let Dn := D(k) .


(b) There exist a(1) , . . . , a(κ) ∈ IR such that
1
a := q1 a(1) + . . . + qκ a(κ) > , (V3a)
2
D(k) H ∗ + (D(k) H ∗ )T ≥ 2a(k) I. (V3b)

(c) For any solution x∗ of (P), cf. (6.1), it holds

x∗ ∈ D̊. (V4)
196 6 Stochastic Approximation Methods with Changing Error Variances

Now some preliminary considerations are made which are useful for show-
ing the already mentioned central limit theorem.
According to (6.44) and assumptions (V2) and (V3a), (V3b) as well as
Lemma 6.9, for bound -a in (6.9) we may assume that
1
-
a≥a> a.s. (6.45)
2
Moreover, due to (V1), in Sect. 6.4.2
1
sn := for n = 1, 2, . . . (6.46)
n
can be chosen, and (6.43) yields
γn 1
= λn −→ . (6.47)
sn 2
Now, according to these relations and Lemma 6.6, for the matrices

γn
Bn := Dn H ∗ − I = Dn H ∗ − λn I (6.48)
sn

arising in (6.16), the inequalities


2un
I − sn Bn 2 ≤ 1 − , (6.49)
n
un
I − sn Bn  ≤ 1 − , (6.50)
n
hold, where
1 2
un := (an − λn ) − (λ − 2λn an + b2n ), (6.51)
2n n
and for n ∈ IN(k) we put an := a(k) and bn := b(k) := D(k) H ∗ . Hence,
because of (6.47), for any k = 1, . . . , κ the limit
1
lim un = u(k) := a(k) − (6.52)
n∈IN(k) 2

exists. Moreover, using (6.45), we get


1
ū := q1 u(1) + . . . + qκ u(κ) = a − > 0. (6.53)
2

Stochastic Equivalence
 
By further examination of sequence ∆ -n = √n(Xn − x∗ ) we find that it can
be replaced by a simpler sequence. This is guaranteed by the following result:
6.5 General Convergence Results 197
√ 
Theorem 6.3. n(Xn − x∗ ) is stochastically equivalent to sequence

(0) 1 1
∆n+1 = (I − n + √ Dn (En Zn − Zn ), n = 1, 2, . . . ,
Bn )∆(0)
n n
(0)
where ∆1 denotes an arbitrary A1 -measurable random variable.

Two stochastic sequences are called stochastically equivalent, if their dif-


ference converges stochastically to zero. √
The reduction onto the basic parts of the sequence ( n(Xn − x∗ )) in
Theorem 6.3 is already mentioned in a similar form in [40] (Eq. (2.2.10), α = 1).
But there the matrix Bn ≡ Λ is independent of index n.
Because of Lemma 6.5 it is sufficient for the proof of the theorem to show
(1) (2)
that sequences (∆n ) and (∆n ) in (6.21) converge stochastically to 0.
(1)
Lemma 6.11. (∆n ) converges stochastically to 0.

Proof. According to the upper remarks and Lemma 6.5 it holds


un 1 -n  + Cn )
Tn+1 ≤ (1 − )Tn + (∆Bn ∆ (i)
n n
with the nonnegative random variables

Tn := ∆(1)
n ,
√    
Cn := n n + 1 ρn Mn  En Zn  + G∗  + ∆Pn  .

Furthermore, according to (V1), (V4), (6.44), (U4a)–(U4d), (U5a)–(U5c) and


Lemma 6.7 it holds
∆Bn , Cn −→ 0 a.s. (ii)
Let ε, δ > 0 be arbitrary positive numbers. According to (ii) and Lemma B.5
in the Appendix there exists a set Ω1 ∈ A having
δ
P(Ω̄1 ) < , (iii)
4
∆Bn , Cn −→ 0 uniformly on Ω1 . (iv)

Due to (6.45) and Corollary 6.2 there exists a constant K < ∞ and a set
Ω2 ∈ A having
δ
P(Ω̄2 ) < , (v)
4
-n  ≤ K.
sup EΩ2 ∆ (vi)
n
198 6 Stochastic Approximation Methods with Changing Error Variances

For ε0 := δεū[4(K + 1)]−1 > 0, with ū from (6.53), according to (iv) there
exists an index n0 such that for any n ≥ n0 on Ω0 := Ω1 ∩ Ω2 we have

∆Bn , Cn ≤ ε0 .

Hence, with (i) and (iv) for n ≥ n0 it follows

 un  ε0 (K + 1)
EΩ0 Tn+1 ≤ 1 − E Ω0 T n + ,
n n

and therefore because of (6.52), (6.53) and Lemma A.7, Appendix,

ε0 (K + 1) δε
lim sup EΩ0 Tn ≤ = .
n ū 4
Thus, there exists an index n1 ≥ n0 having
δε
EΩ0 Tn ≤ for n ≥ n1 . (vii)
2
Finally, from (iii), (v), the Markov inequality and (vii) for indices n ≥ n1
holds
 
P(Tn > ε) ≤ P(Ω̄0 ) + P {Tn > ε} ∩ Ω0
EΩ0 Tn δ δ
≤ P(Ω̄0 ) + < + = δ.
ε 2 2
(2)
Now the corresponding conclusion for the sequence (∆n ) is shown.
(2)
Lemma 6.12. (∆n ) converges stochastically to 0.

Proof. According to above remarks and Lemma 6.5 we have

2un 1
En Tn+1 ≤ 1− Tn + (∆Dn 2 σ0,n
2
+ Cn ),
n n

(2)
where Tn := ∆n 2 and Cn := ∆Dn 2 σ12 ∆n 2 . Due to Lemma 6.7 it holds

∆Dn , Cn −→ 0 f.s.

Let ε, δ > 0 be arbitrary fixed. Here, choosing B (k) ≡ C (k) ≡ 0, Lemma B.4,
Appendix, can be applied. Hence, there exists a set Ω0 ∈ A where
δ
P(Ω̄0 ) < and lim sup EΩ0 Tn = 0. (i)
2 n
6.5 General Convergence Results 199

Hence, there exists an index n0 with


εδ
EΩ0 Tn ≤ for n ≥ n0 . (ii)
2
Like in the previous proof from (i) and (ii) follows now

P(Tn > ε) < δ for all n ≥ n0 .


(0)
For sequence (∆n ) in Theorem 6.3 the following representation can be
given
n
(0) (0) 1
∆n+1 = B0,n ∆1 − √ Bm,n Dm ∆Zm (6.54)
m=1
m
for n = 1, 2, . . . . Here,
1 1
Bn,n := I, Bm,n := I− Bn · ... · I − Bm+1 . (6.55)
n m+1
is defined for all indices m < n. Finally, according to (6.49) and (6.50) for
matrices Bm,n it follows

n
uj
Bm,n  ≤ 1− =: bm,n , (6.56)
j=m+1
j

n
2uj
Bm,n 2 ≤ 1− =: b̄m,n . (6.57)
j=m+1
j

Asymptotic Distribution

Assume for the estimation error Zn the following conditions:


(a) There are symmetric deterministic matrices W (1) , . . . , W (κ) such that for
any k = 1, . . . , κ it holds either

W (k) = lim E(∆Zm ∆Zm T


), (V5a)
m∈IN(k)
 
lim E Em (∆Zm ∆Zm
T T 
) − E(∆Zm ∆Zm ) =0 (V5b)
m→∞

or
W (k) = lim Em (∆Zm ∆Zm
T
) a.s. (V6)
m∈IN(k)

(b) It holds either


 
lim E ∆Zm 2 I{ ∆Zm 2 >tm} = 0 for all t > 0 (V7)
m→∞

or
 
lim Em ∆Zm 2 I{ ∆Zm 2 >tm} = 0 a.s. for all t > 0. (V8)
m→∞
200 6 Stochastic Approximation Methods with Changing Error Variances
√ 
The convergence in distribution of n(Xn − x∗ ) is guaranteed then by the
following result:
√ 
Theorem 6.4. The sequence n(Xn − x∗ ) is asymptotically N (0, V )-
distributed. The covariance matrix V is the unique solution of the Lyapunov-
matrix equation
T
1 1
DH ∗ − I V + V DH ∗ − I = C, (6.58)
2 2
where
D := q1 D(1) + . . . + qκ D(κ) , (6.59)
C := q1 D(1) W (1) D(1)T + . . . + qκ D(κ) W (κ) D(κ)T (6.60)

and matrices D(1) , . . . , D(κ) from condition (V2).

Central limit theorems of the above type can be found in many papers on
stochastic approximation (cf., e.g., [1,12,40,70,107,108,131,132,161,163]). But
in these convergence theorems it is always assumed that matrices Mn or the
covariance matrices of the estimation error Zn converge to a single limit D, W ,
respectively. This special case is enclosed in Theorem 6.4, if assumptions (V2),
(V3a), (V3b) and (V5a), (V5b)/(V6) for κ = 1 are fulfilled (cf. Corollary 6.4).
Before proving Theorem 6.4, some lemmas are shown first. For the deter-
ministic matrices
1
Am,n := √ Bm,n Dm for m ≤ n, (6.61)
m
arising in the important equation (6.54) the following result holds:

Lemma 6.13. (a) For m = 1, 2, . . . it holds lim Am,n  = 0.


n→∞
'
n
(b) lim sup Am,n  < ∞.
2
n→∞ m=1
2L
(c) It exists n0 ∈ IN with Am,n  ≤ √ for n0 ≤ m ≤ n, where L :=
m
sup Dn  < ∞.
n
(d) For indices m ∈ IN(k) , let Wm := W (k) . Then
n
lim Am,n Wm ATm,n = V
n→∞
m=1

with the matrix V according to (6.58).

Proof. Because of (6.56) and (6.57) for m ≤ n it holds:


L
Am,n  ≤ √ bm,n , (i)
m
6.5 General Convergence Results 201

L2
Am,n 2 ≤ b̄m,n . (ii)
m
Hence, according to (6.52) and (6.53) Lemma A.6, Appendix, can be applied.
Therefore

lim bm,n = 0 for m = 0, 1, 2, . . . , (iii)


n→∞
n
1 1
lim b̄m,n = < ∞, (iv)
n→∞
m=1
m 2ū

and because of Lemma A.5(a), Appendix, for any 0 < δ < ū we have
 m ū−δ
bm,n  φm,n (ū − δ) ∼  1. (v)
n
From (i) to (v) propositions (a), (b) and (c) are obtained easily. Symmetric
matrices V1 := 0 and
n
Vn+1 := Am,n Wm ATm,n for n = 1, 2, . . . (6.62)
m=1

by definition fulfill equation


T
1 1 1
Vn+1 = I− Bn Vn I − Bn + Cn , n ≥ 1, (vi)
n n n

where Cn := Dn Wn DnT . Hence, because of (6.47) and (6.48) for all k the
following limits exist:
1
B (k) := lim Bn = D(k) H ∗ − I.
n∈IN(k) 2

Furthermore, according to (V3b) it holds

1
B (k) + B (k)T ≥ 2 a(k) − I = 2u(k) I.
2

Therefore and because of (6.53), Lemma A.9, Appendix, can be applied to


equation (vi). Hence, Vn −→ V , where V is unique solution of

BV + V B T = C.

Here, B := q1 B (1) + . . . + qκ B (κ) , and C is given by (6.60). Henceforth,


B = DH ∗ − 12 I, where D is obtained from (6.59), and part (d) is proven.
202 6 Stochastic Approximation Methods with Changing Error Variances

Lemma 6.14. (a) If (V5a), (V5b) or (V6) holds, then


n
stoch
T
Am,n Em (∆Zm ∆Zm )ATm,n −→ V (KV)
m=1

with matrix V from (6.58)


(b) If (V7) or (V8) is fulfilled, then for each ε > 0
n
  stoch
Am,n 2 Em ∆Zm 2 I{ Am,n ∆Zm >ε} −→ 0 (KL)
m=1

holds.

Proof. Let V-0 := 0, and for n = 1, 2, . . .


n
V-n+1 := T
Am,n Em (∆Zm ∆Zm )ATm,n .
m=1

Having matrix Vn+1 according to (6.62), for each n

V-n+1 − V  ≤ V-n+1 − Vn+1  + Vn+1 − V , (i)

EV-n+1 − V  ≤ EV-n+1 − E V-n+1  + E V-n+1 − Vn+1  + Vn+1 − V  (ii)

holds. Furthermore,
n
V-n+1 − Vn+1  ≤ Am,n 2 Em (∆Zm ∆Zm
T
) − Wm , (iii)
m=1
n
 
EV-n+1 − E V-n+1  ≤ Am,n 2 E Em (∆Zm ∆Zm
T T 
) − E(∆Zm ∆Zm ) (iv)
m=1

and
n
 
E V-n+1 − Vn+1  ≤ Am,n 2 E(∆Zm ∆Zm
T
) − Wm . (v)
m=1

For any real stochastic sequence (bn ) with bn −→ 0 a.s., Lemma 6.13(a),(b)
and Lemma A.2, Appendix, provide
n
Am,n 2 bm −→ 0 a.s. (vi)
m=1

Besides, according to Lemma 6.13(d) it is

Vn+1 − V  −→ 0. (vii)
6.5 General Convergence Results 203

Hence, under condition (V5a), (V5b) we get because of (ii), (iv), (v), (vi)
and (vii)
EV-n+1 − V  −→ 0.
In addition, with (V6), because of (i), (iii), (vi) and (vii) we have

V-n+1 − V  −→ 0 a.s.

Therefore, (KV) is fulfilled in both cases.

Select an arbitrary ε > 0. Due to Lemma 6.13(c),

∆Zm 2 I{ Am,n ∆Zm >ε} ≤ ∆Zm 2 I% √2L &


m
∆Zm >ε

 ε 2
holds for all indices n0 ≤ m ≤ n. Hence, for n > n0 , where t := >0
2L
n
Un := Am,n 2 Em (∆Zm 2 I{ Am,n Zm >ε} )
m=1
n
 
≤ Un0 + Am,n 2 · Em ∆Zm 2 I{ ∆Zm 2 >tm} . (viii)
m=n0 +1

Therefore, from (vi) and (viii) follows EUn −→ 0 at assumption (V7) and
Un −→ 0 a.s. under assumption (V8). Hence, in any case (KL) is fulfilled.
Now we are able to show Theorem 6.4.
(0)
Proof (of Theorem 6.4). Let (∆n ) be defined according to (6.54) with
(0)
∆1 := 0. Thus,
n
(0)
∆n+1 = − Am,n ∆Zm for n = 1, 2, . . . .
m=1

Due to Lemma 6.14 all conditions for the limit theorem Lemma B.7 in
(0)
the Appendix are fulfilled for this sequence. Hence, (∆n ) is asymptotically
N (0, V )-distributed.
√ According
 to Theorem 6.3 this holds also true for se-
quence n(Xn − x∗ ) .
In the special case that the matrix sequence (Mn ) converges a.s. to a
unique limit matrix D, we get the following corollary:

Corollary 6.4. Let


Mn −→ Θ a.s., (6.63)
where Θ denotes a deterministic matrix having

ΘH ∗ + (ΘH ∗ )T > I. (6.64)


204 6 Stochastic Approximation Methods with Changing Error Variances
√ 
Then, n(Xn − x∗ ) is asymptotically N (0, V )-distributed. The covariance-
matrix V is unique solution of
T
1 1
ΘH ∗ − I V + V ΘH ∗ − I = ΘW ΘT , (6.65)
2 2
if
W := q1 W (1) + . . . + qκ W (κ) . (6.66)

Proof. In the present case, Θ(k) = Θ and a(k) = a for k = 1, . . . , κ, where a


fulfills
ΘH ∗ + (ΘH ∗ )T ≥ 2aI.
Due to (6.64) choose a > 12 . Hence, conditions (V2) and (V3) are fulfilled for
matrix sequence (Mn ). The assertion is now obtained from Theorem 6.4.

Special limit theorems are obtained for different limit-matrices D in


Corollary 6.4:
(a) For Θ = H ∗ −1 condition (6.64) is always fulfilled. Moreover, due to (6.65)
matrix V is given by

V = H ∗ −1 W (H ∗ −1 )T .

This special case was already discussed for κ = 1 in [12, 107, 108, 161]
(Corollary 9.1).
(b) Let Θ = d · I, where d is chosen to fulfill d · (H ∗ + H ∗ ) > I. Hence,
T

according to (6.65) for V the following formula is obtained:

d · (H ∗ V + V H ∗ ) − V = d2 W.
T

This equation for the covariance matrix V was discussed, in case κ = 1,


in [132] (Theorem 5) and in [40].
Matrix W in (6.66) can be interpreted as “mean limit covariance matrix” of
the estimation error sequence (Zn ).

6.6 Realization of Search Directions Yn


The approach used in this section for constructing search directions Yn of
method (6.2a), (6.2b), allows adaptive estimation of important unknown items
like Jacobi matrix H(x∗ ) and limit covariance matrix of error Zn . These es-
timators will be used in Sects. 6.7 and 6.8 for the construction of adaptive
stepsizes.
J.H. Venter (cf. [161]) takes in the one-dimensional case (ν = 1) at stage
n the search direction
1
Yn := (Yn(1) + Yn(2) ).
2
6.6 Realization of Search Directions Yn 205
(1) (2)
Here, Yn and Yn are given stochastic approximations of G(Xn + cn ) and
G(Xn − cn ), and (cn ) is a suitable deterministic sequence converging to zero.
The advantage of this approach is that using the stochastic difference quotient

 1
hn := (Y (1) − Yn(2) ),
2cn n
a known approximation for the derivative of G(Xn ) is obtained. This difference
quotient plays a fundamental role in the adaptive step size of Venter.
In [12, 107] and [108] this approach for the determination of search direc-
tions was generalized to the multi-dimensional case. There,

1
Yn := Yn(j)
2ν j=1
(i) (ν+i)
is assumed as search direction, where vectors Yn and Yn are given
stochastic approximations of vectors G(Xn + cn ei ) and G(Xn − cn ei ), and
e1 , . . . , eν denote the canonic unit vectors of IRν . Then,

 n := 1 (Y (1) − Y (ν+1) , . . . , Y (ν) − Y (2ν) )


H
2cn n n n n

is a known estimator of the Jacobian Hn := H(Xn ) of G at stage n.


The main disadvantage of this approach for Yn is that at each iteration
(1) (2ν)
n exactly 2ν realizations Yn , . . . , Yn have to be determined. This can be
avoided corresponding to the Venter approach if at stage n only two measure-
(1) (2)
ments Yn and Yn are determined. Hence, only one column vector

 1
hn := (Y (1) − Yn(2) )
2cn n
of the matrix H(Xn ) is estimated. Cycling through all columns, after ν iter-
ations a new estimator for the whole matrix H(Xn ) is obtained.
This possibility is contained in the following approach by means of choosing
ln ≡ 1 and εn ≡ 0:
For indices n = 1, 2, . . . , let ln ∈ IN0 and εn ∈ {0, 1} be integers with

εn + ln > 0. (W1a)

Furthermore, use at each iteration stage n ∈ IN


2ln
1
Yn := Yn(j) (W1b)
2ln + εn j=1−εn

(j)
as stochastic approximation of Gn := G(Xn ). Here, Yn denotes a known
(j)
stochastic estimator of G at Xn , that is

Yn(j) = G(j) (j)


n + Zn , (6.67)
206 6 Stochastic Approximation Methods with Changing Error Variances
(j) (j) (j)
where Gn := G(Xn ), and Zn denotes the estimation error for j =
(0) (2l )
0, . . . , 2ln . The arguments Xn , . . . , Xn n are given by

Xn(0) := Xn , Xn(i) := Xn + cn d(i) (ln +i)


n , Xn := Xn − cn d(i)
n (W1c)

for i = 1, . . . , ln , where (cn ) is a positive sequence converging to zero. As


(i)
direction vectors dn one after another the unit vectors e1 , e2 , . . . , eν , e1 , . . . of
ν
IR are chosen. Formally, this leads to

d(i)
n = el(n,i) for i = 1, . . . , ln , (W1d)

where indices l(n, i) are calculated as follows:

l(1, 0) := 0,
l(n, i) := l(n, i − 1) mod ν + 1 for i = 1, . . . , ln , (W1e)
l(n + 1, 0) := l(n, ln ).

Using the above choice of Yn , we get

Yn = Gn + Zn (6.68)

with estimation error


2ln
1
n + Zn ) − Gn .
(G(j) (j)
Zn := (6.69)
2ln + εn j=1−εn

In order to guarantee that this total error Zn fulfills conditions (U5a)–(U5c)


(0) (2l )
in Sect. 6.3, let the error terms Zn , . . . , Zn n denote An+1 -measurable and
stochastically independent random variables with given An . Furthermore, for
j = 0, . . . , 2ln assume that
L1 1
En Zn(j)  ≤ with µ1 > , (W2a)
nµ1 2
En Zn(j) − En Zn(j) 2 ≤ s20,n + s21 Xn(j) − x∗ 2 (W2b)

with positive constants L1 , µ1 and s1 and An -measurable random variables


s20,n ≥ 0, for which with an additional constant s0

sup Es20,n ≤ s20 (W2c)


n

is fulfilled. Hence, due to (W2b) and (W1c)

En Zn(j) − En Zn(j) 2 ≤ t20,n + t21 Xn − x∗ 2 , (6.70)

sup Et20,n ≤ s20 + 2s21 sup c2n =: t20 , (6.71)


n n
6.6 Realization of Search Directions Yn 207

if we denote

t20,n := s20,n + 2c2n s21 , (6.72)


t21 := 2s21 . (6.73)

The error Zn has further properties:


Lemma 6.15. For each n ∈ IN, Zn is An+1 -measurable, and the following
properties hold:
2ln
1
n + En Zn ) − Gn ,
(G(j) (j)
(a) En Zn =
2ln + εn j=1−εn
2ln
1
(b) En ∆Zn ∆ZnT = En ∆Zn(j) ∆Zn(j)T ,
(2ln + εn )2 j=1−εn
2ln
1
(c) En ∆Zn 2 = En ∆Zn(j) 2
(2ln + εn )2 j=1−εn
1  
≤ t20,n + t21 Xn − x∗ 2 ,
2ln + εn
L1
(d) En Zn  ≤ L0 c1+µ
n
0
+ µ1 with L0 and µ0 from (U1b).
n
Here, the following abbreviations for n = 1, 2, . . . and j = 0, . . . , 2ln are
used in Lemma 6.15:

∆Zn := Zn − En Zn , ∆Zn(j) := Zn(j) − En Zn(j) . (6.74)

Proof. Formula (a) is a direct consequence of (6.69). Hence, it holds


2ln
1
∆Zn := ∆Zn(j) ,
2ln + εn j=1−εn

(0) (2ln )
and therefore due to the conditional independence of Zn , . . . , Zn we get
2ln
1
En ∆Zn ∆ZnT = En ∆Zn(j) ∆Zn(j)T .
(2ln + εn )2 j=1−εn

Taking the trace of the matrix, we get

En ∆Zn 2 = En trace (∆Zn ∆ZnT ) = trace (En ∆Zn ∆ZnT )


2ln
1
= En ∆Zn(j) 2 ,
(2ln + εn )2 j=1−εn
208 6 Stochastic Approximation Methods with Changing Error Variances

and using (6.70) assertion (c) follows. Due to condition (W2a) we have
 
 
 1
2ln
 L1
 (j) 
En Zn  ≤ µ1 ,
 2ln + εn
 j=1−εn  n

and therefore by applying (a) we find


 
 ln 
 2 1 (i)  L1
En Zn  ≤  (Gn + Gn(ln +i) ) − Gn  + µ1
 2ln + εn 2  n
i=1

 
2
ln
 1 (i)  L1
≤  (Gn + G(ln +i)
) − Gn 
2ln + εn 2 n  + nµ1 .
i=1

Because of conditions (U1a)–(U1c) and (W1a), (W1b), for each i = 1, . . . , ln


    
 1 (i)  1 
 (Gn + G(l n +i)
) − G  =  δ(X ; c d (i)
) + δ(X ; −c d (i) 
)
2 n n 2 n n n n n n 

1  
≤ L0 cn d(i)n 
1+µ0
+ cn d(i)
n 
1+µ0
= L0 c1+µ
n
0
2
holds. Using the previously obtained inequality, assertion (d) follows.
Due to Lemmas 6.15(c), (6.70) and (6.71), conditions (U5b) and (U5c) are
fulfilled by means of

2 1
σ0,n := t2 , σ0 := t0 , σ1 := t1 . (6.75)
2ln + εn 0,n
For given positive numbers c0 and µ, choose for any n = 1, 2, . . .
c0 1
cn = , where µ> (W3a)
nµ 2(1 + µ0 )

with µ0 from condition (U1a)–(U1c). Due to Lemma 6.15(d) and (W2a)


we get √
nEn Zn  −→ 0 a.s. (6.76)
Therefore the estimation error sequence (Zn ) fulfills all conditions (U5a),
(U5b).
For convergence of the covariance matrices of the estimation error sequence
(Zn ) (cf. Lemma 6.18), we need the existence of

l(k) := lim ln < ∞, (W4a)


n∈IN(k)

ε(k) := lim εn ∈ {0, 1} (W4b)


n∈IN(k)
6.6 Realization of Search Directions Yn 209

for sequences (ln ) and (εn ) for each k = 1, . . . , κ. In particular this leads to

-
l := sup ln < ∞, (6.77)
n∈IN

¯l := lim l1 + . . . + ln = q1 l(1) + . . . + qκ l(κ) . (6.78)


n→∞ n
By means of approach (W1a), (W1b) for search directions Yn it is easy
to get convergent estimators for G(x∗ ), H(x∗ ) as well as the limit covariance
matrix of (Zn ).

6.6.1 Estimation of G∗

Vectors
1
G∗n+1 := (Y1 + . . . + Yn ) for n = 1, 2, . . . (6.79)
n
can be interpreted as An+1 -measurable estimators of G∗ := G(x∗ ). This is
shown in the next result:

Theorem 6.5. If Xn −→ x∗ a.s., then G∗n −→ G∗ a.s.

Proof. For n = 1, 2, . . . we have

1 1
G∗n+1 = 1− G∗n + Yn . (i)
n n

Due to (U5a), our assumption and (6.4) it holds

En Yn − Gn  = En Zn  −→ 0, Gn −→ G∗ a.s.,

and therefore we get


En Yn −→ G∗ a.s. (ii)

Moreover, due to (U5b), (U5c) and condition (a) in Sect. 6.3 for each n

EYn − En Yn 2 = EZn − En Zn 2 ≤ σ02 + (σ1 δ0 )2

holds and therefore


1
En Yn − En Yn 2 < ∞ a.s. (iii)
n
n2

Using (i), (ii) and (iii), the proposition follows from Lemma B.6 in the
Appendix.
210 6 Stochastic Approximation Methods with Changing Error Variances

6.6.2 Update of the Jacobian

For numbers ln in definition (W1a)–(W1e) we demand now

ln > 0 for n = 1, 2, . . . . (W5)

Furthermore, for exponent µ in (W3a)


1
µ< (W3b)
2
is assumed. Due to the following lemma, the “stochastic central difference
quotients”
1
h(i)
n := (Y (i) − Yn(ln +i) ) for i = 1, . . . , ln , (6.80)
2cn n
(i)
known at iteration stage n, are estimators of Hn · dn , where Hn := H(Xn ).
(i)
Lemma 6.16. For each n ∈ IN and i ∈ {1, . . . , ln }, hn is An+1 -measurable,
and it holds:
1 1
(a) En h(i)
n = (G(i) − G(l n +i)
)+ (En Zn(i) − En Zn(ln +i) )
2cn n n
2cn
1 2 ∗ 2

(b) En h(i)
n − En h n  ≤
(i) 2
t 0,n + t 2
1 Xn − x 
2c2n
(c) En h(i)
n − Hn dn  −→ 0
(i)

(j) (j) (j)


Proof. Part (a) holds because of En Yn = Gn + En Zn for j = 1, . . . , 2ln .
(j) (j) (j)
Furthermore, with ∆Zn := Zn − En Zn we have
1
En h(i)
n − En h n  = E n 
(i) 2
(∆Zn(i) − ∆Zn(ln +i) )2
2cn
1   1  
= 2 En ∆Zn(i) 2 + En ∆Zn(ln +i) 2 ≤ 2 t20,n + t21 Xn − x∗ 2 .
4cn 2cn

(j)
Here, the independence of the variables Zn was taken into account, and
(6.70) was used. Now, because of (U1a)–(U1c)

0G(Xn ± cn d(i)
n ) = G(Xn ) ± cn Hn dn + δ(Xn ; ±cn dn ),
(i) (i)

and therefore due to (a), (W2a) (U1b) and (W3a) we find


 
 1   L1
En hn − Hn dn  = 
(i) (i)  (i) 
δ(Xn ; cn dn ) − δ(Xn ; −cn dn )  +
(i)
2cn cn nµ1
L0   L1 L1
≤ cn d(i)
n 
1+µ0
+  − cn d(i)
n 
1+µ0
+ = L0 cµn0 + .
2cn cn nµ1 c0 nµ1 −µ
Finally, since cn −→ 0, µ0 > 0 and µ1 > 1
2 > µ, proposition (c) follows.
6.6 Realization of Search Directions Yn 211
(i)
Since for directions dn we use cyclically all unit vectors e1 , . . . , eν , the
Jacobian H ∗ := DG(x∗ ) can be estimated easily by means of vectors hn .
(i)

For this recursive estimate we start with a A1 -measurable ν × ν-Matrix


(0)
H1 . In an iteration step n ≥ 1 we calculate for i = 1, . . . , ln An+1 -measurable
matrices
1
Hn(i) := Hn(i−1) + (h(i) − Hn(i−1) d(i)
n ) dn
(i)T
(6.81a)
n n
and put
(0)
Hn+1 := Hn(ln ) . (6.81b)
According to (W1d), (6.81a), (6.81b) leads to
⎧ 
⎨ 1 − 1 h(i−1) + 1 h(i)
(i) n n,l n n , for l = l(n, i)
hn,l :=
⎩h (i−1)
, for l = l(n, i),
n,l

(i) (i) (i)


if Hn = (hn,1 , . . . , hn,ν ).
(i−1) (i)
Hence, at transition of Hn to Hn only the l(n, i)th column vector is
(i)
changed (updated) by averaging with vector hn of (6.80).
(0)
Due to the following theorem, matrices Hn are known approximations of
∗ ∗
H , if Xn converges to x .
(i) (0)
Besides Hn this theorem uses further matrices: Let H̄1 denote an arbi-
trary matrix, and for n = 1, 2, . . . and i = 1, . . . , ln let
1
H̄n(i) := H̄n(i−1) + (Hn − H̄n(i−1) )d(i) (i)T
n dn , (6.82a)
n
(0)
H̄n+1 := H̄n(ln ) . (6.82b)
(i) (0) (0)
Theorem 6.6. (a) (H̄n ) is bounded, and it holds Hn − H̄n −→ 0 a.s.
(b) If “Xn −→ x∗ a.s.,” then Hn −→ H ∗ a.s.
(0)

Proof. To arbitrary fixed l ∈ {1, . . . , ν} let (n1 , i1 ), (n2 , i2 ), . . . denote all or-
(i)
dered index pairs (n, i), for which dn = el is chosen. Hence, 1 ≤ n1 ≤ n2 ≤
. . ., and for each k it holds 1 ≤ ik ≤ lnk as well as ik < ik+1 if nk = nk+1 .
Moreover, for n ≥ 1 and i ≤ ln we have
 
d(i)
n = el ⇐⇒ (n, i) ∈ (n1 , i1 ), (n2 , i2 ), . . . .

Due to (6.81a), (6.81b) and (6.82a), (6.82b), for k = 1, 2, . . . we get

Vk+1 = (1 − αk )Vk + αk Uk , (i)

V̄k+1 = (1 − αk )V̄k + αk Ūk , (ii)


212 6 Stochastic Approximation Methods with Changing Error Variances

if
1
αk := , Uk := h(ik)
nk , Vk := Hn(ik−1
k−1 )
el ,
nk
Ūk := Hnk el , V̄k := H̄n(ik−1
k−1 )
el .

Because of (W1e), at latest after each νth iteration stage n, direction vector
el is chosen. Therefore it holds nk+1 ≤ nk + ν and nk ≤ kν for each index k.
Consequently, we get
αk = ∞. (iii)
k

Because of (W1e), and (6.77), at each iteration stage n, direction vector el is


chosen at most l0 := [ -l−1
ν ] + 1 times and therefore k ≤ l0 nk for k = 1, 2, . . . .
Hence, using condition (W3a),(W3b) we get

αk2 1 1 l02−2µ 1
= 2 ≤ < ∞. (iv)
k
2
cnk c0
k
n2−2µ
k
c20
k
k 2−2µ

Now, for k ≥ 1 let denote Bk the σ-algebra generated by Ank and


U1 , . . . , Uk−1 . Obviously now holds:

B1 ⊂ B2 ⊂ . . . ⊂ Bk ⊂ A , (v)

∆Uk := Uk − Ūk is Bk+1 -measurable, (vi)

E(∆Uk | Bk ) = Enk (∆Uk ) = Enk h(i


nk − Hnk el ,
k)
(vii)
 2   
E ∆Uk − E(∆Uk | Bk ) | Bk = Enk h(i nk − Enk hnk  . (viii)
k) (ik ) 2

From (iv), (viii) and Lemma 6.16(b) we have


 
αk2 E ∆Uk − E(∆Uk | Bk )2 | Bk
k

αk2  ∗ 2

≤ t 0,n + t 2
1 Xn − x  < ∞ a.s. (ix)
2c2nk k k
k

From (vii) and Lemma 6.16(c) we obtain

E(∆Uk | Bk ) −→ 0 a.s. (x)

Equations (i) and (ii) yield

Vk+1 − V̄k+1 = (1 − αk )(Vk − V̄k ) + αk ∆Uk .


6.6 Realization of Search Directions Yn 213

Due to (iii), (v), (vi), (ix) and (x), Lemma B.6 in the Appendix can be applied.
Hence, it holds
Vk+1 − V̄k+1 = (Hn(ikk ) − H̄n(ikk ) )el −→ 0 a.s.
Since l ∈ {1, . . . , ν} was chosen arbitrarily, we have shown proposition (a).
In case of “Xn −→ x∗ a.s.” follows according to (U1c) also Hn −→ H ∗
a.s. Therefore, because of (ii), (iii) and Lemma B.6 (Appendix)
V̄k+1 = H̄n(ikk ) el −→ H ∗ el a.s.

Hence, it holds H̄n −→ H ∗ a.s. and therefore because of (a) also Hn −→


(0) (0)

H ∗ a.s.
(0)
Matrices Hn+1 from (6.81a), (6.81b) can also be approximated by means
of special convex combinations of Jacobians H1 , . . . , Hn This is shown next.

Theorem 6.7. Having “ Xn+1 − Xn −→ 0 a.s.,” we get


- (0) −→ 0
Hn(0) − H a.s.
n

Here, An -measurable matrices H - n(i) are chosen as follows: Choose H


- (0) :=
1
(0)
H1 , and for n = 1, 2, . . . as well as i = 1, . . . , ln let

H - n(i−1) + 1 (Hn − H
- n(i) := H - n(i−1) ), (6.83a)

- (0) := H
H - (ln ) . (6.83b)
n+1 n

Proof. According to Theorem 6.6(a) only the following property has to be


shown:
- n(0) −→ 0 a.s.
H̄n(0) − H (∗)
Simple computation yields for n = 1, 2, . . . and 1 ≤ i ≤ ln

- n(i) = 1 - n(i−1) ) + 1 (Hn − H̄n(i−1) )(νd(i)


H̄n(i) − H 1− (H̄n(i−1) − H n dn
(i)T
− I).
nν nν
(i)
Defining now
L(n, i)
L(n, i) := l1 + . . . + ln−1 + i, αL(n,i) := ,

UL(n,i) := (Hn − H̄n(i−1) )(νd(i) (i)T - (i) ,
− I) , VL(n,i) := H̄n(i) − H
n dn n

(i) can be rewritten as:


 αL  αL
VL = 1 − VL−1 + UL (ii)
L L
for L = 1, 2, . . . . Due to (6.3) and Theorem 6.6(a) we have

(UL ) bounded a.s., (iii)


214 6 Stochastic Approximation Methods with Changing Error Variances

and according to (6.78) we get


¯l
lim αL = =: α > 0. (iv)
L→∞ ν
Hence, because of (ii), (iii), (iv) and Lemma A.4, Appendix, it is

(VL ) bounded and VL − VL−1 −→ 0 a.s. (v)

Because of (ii), for k = 0, 1, 2, . . . it holds


ν
αkν+m  α α
V(k+1)ν = p0,k Vkν + pm,k Ukν+m = 1 − Vkν + Ūk , (vi)
m=1
kν + m k k

if we define
⎧  
⎨ *
ν
αkν+j
1− kν+j , for m = 0, . . . , ν − 1
pm,k := j=m+1

1 , for m = ν,
ν
1 αkν+m kν k
Ūk := pm,k · Ukν+m + (p0,k − 1) + 1 Vkν .
ν m=1
α kν + m α

Now we show
Ūk −→ 0 a.s. (∗∗)
Because of (vi) and Lemma A.4, Appendix, from this follows Vkν −→ 0 a.s.
Hence, if (∗∗) holds, then due to (v), also (∗) holds true.

First of all, from the definition of Ūk , for each k = 0, 1, 2, . . . the following
approximation is obtained:
+ ν ,
1    
ν
   αkν+m kν 
Ūk  ≤  Ukν+m  + Ukν+m  · pm,k · − 1
ν m=1 m=1
α kν + m
 
 k 
+ (p0,k − 1) + 1 · Vkν . (vii)
α

For matrices
TL(n,i) := Hn − H̄n(i−1) ,
due to “Xn+1 − Xn −→ 0 a.s.,” inequality (6.3) and Theorem 6.6(a), we have

TL+1 − TL −→ 0 a.s. (viii)


(i)
Since for an index L = kν + m direction dn = em is chosen, for each k =
0, 1, 2, . . . holds
6.6 Realization of Search Directions Yn 215
 ν   ν   
   
 Ukν+m  =  Tkν + (Tkν+m − Tkν ) · (νem eTm − I)
m=1 m=1
 ν  ν
 
= (Tkν+m − Tkν ) · (νem eTm − I) ≤ ν Tkν+m − Tkν . (ix)
m=1 m=1

Now, let ε ∈ (0, α) denote a given arbitrary number. According to (iv) and
(viii) there exists an index k0 ∈ IN such that

α − ε ≤ αkν+m ≤ α + ε, (x)
ν
Tkν+m − Tkν  ≤ ε, (xi)
m=1
 
 
pm,k αkν+m · kν − 1 ≤ ε (xii)
 α kν + m 

for all k ≥ k0 and m = 1, . . . , ν. From (x), for k ≥ k0 we find


ν ν
α+ε α−ε
1− ≤ p0,k ≤ 1− .
kν (k + 1)ν
ν(ν−1) 2
Since for each t ∈ [0, 1] the inequality 0 ≤ (1 − t)ν − (1 − νt) ≤ 2 t holds,
for indices k ≥ k0 we get
α+ε α−ε ν−1 α−ε
− ≤ p0,k − 1 ≤ (−1 + · )
k k+1 2 (k + 1)ν
or
α+ε k k k+1 α−ε (α − ε)2
− + 1 ≤ (p0,k − 1) + 1 ≤ α − + ,
α α α(k + 1) k α 2(k + 1)

respectively. Hence, there exists an index k1 ≥ k0 with


 
 
(p0,k − 1) k + 1 ≤ 2 ε for k ≥ k1 . (xiii)
 α  α

Using now in (vii) the approximations (ix), (xi), (xii) and (xiii), we finally get
for all k ≥ k1

17 8 2ε
ν
Ūk  ≤ νε + ε Ukν+m  + Vkν 
ν m=1
α
9 :
2
≤ ε · 1 + sup UL  + sup VL  a.s.
L α L

Since ε > 0 was chosen arbitrarily, relation (∗∗) follows now by means of (iii)
and (v).
216 6 Stochastic Approximation Methods with Changing Error Variances

6.6.3 Estimation of Error Variances

Special approach (W1a)–(W1e) for search directions Yn also allows estima-


tion of limit covariance matrices W (1) , . . . , W (κ) of error sequence (Zn ) in
procedure (6.2a), (6.2b). Again the following is assumed:

ln > 0 for n = 1, 2, . . . . (W5)


(1) (2ln )
Since at stage n the vectors Yn , . . . , Yn are known, also
ln
-n := 1
Z (Yn(i) − Yn(ln +i) ) (6.84)
2ln i=1

can be calculated. These random vectors behave similarly to the estimation


error Zn . This can be seen by comparing Lemma 6.15 with the following result:

Lemma 6.17. For each n ∈ IN holds


ln
1
(a) En Z-n = n − Gn
(G(i) (ln +i)
+ En Zn(i) − En Zn(ln +i) )
2ln i=1
2ln
1
(b) En (Z-n − En Z-n )(Z-n − En Z-n )T = En (∆Zn(j) ∆Zn(j)T )
(2ln )2 j=1
2ln
1
(c) En Z-n − En Z-n 2 = En ∆Zn(j) 2
(2ln )2 j=1

(d) En Z-n  −→ 0

In the following we put ∆Z-n := Z


-n − En Z-n .

(j) (j) (j) (j)


Proof. From relation Yn = Gn + En Zn + ∆Zn we obtain representation
in (a) and the following formula
ln
-n = 1
∆Z (∆Zn(i) − ∆Zn(ln +i) ).
2ln i=1

(1) (2ln )
Hence, due to the conditional independence of Zn , . . . , Zn we get
ln
1
En ∆Z-n ∆Z
-T =
n En (∆Zn(i) − ∆Zn(ln +i) )(∆Zn(i) − ∆Zn(ln +i) )T
(2ln )2 i=1
ln
1
= (En ∆Zn(i) ∆Zn(i)T + En ∆Zn(ln +i) ∆Zn(ln +i)T ).
(2ln )2 i=1
6.6 Realization of Search Directions Yn 217

This is proposition (b), and by taking the trace we also get (c). According to
(U1a)–(U1c) and (6.3), for i = 1, . . . , ln follows
 
 (i) 
n − Gn
G(i) (ln +i)
 = 2cn Hn d(i)
n + δ(Xn ; cn dn ) − δ(Xn ; −cn dn )
(i)

≤ 2cn β + 2L0 c1+µ


n
0
.
 
 1 ' ln
(ln +i) 
 (i)
(En Zn − En Zn )
≤
L1
Because of (W2a) we have  2ln nµ1 and therefore
i=1
by means of (a)
L1
En Z-n  ≤ cn (β + L0 cµn0 ) +
.
nµ1
The righthand side of this inequality converges to 0, since due to (W2a) and
(W3a), (W3b) it holds cn −→ 0 and µ1 > 0.

Now, for the existence of limit covariance matrices W (1) , . . . , W (κ) of (Zn )
required by (V6) we assume:
Under assumption “Xn −→ x∗ a.s.” suppose

lim En ∆Zn(jn ) ∆Zn(jn )T = V (k) a.s. (W6)


n∈IN(k)

exist for each k = 1, . . . , κ and for each sequence (jn ) with jn ∈


{0, . . . , 2ln }, n = 1, 2, . . . . Here, V (1) , . . . , V (k) are deterministic matrices.
Due to the following lemma, sequence (Zn ) fulfills condition (V6) from
Sect. 6.5.3 with
1
W (k) := V (k) for k = 1, . . . , κ. (6.85)
2l(k) + ε(k)
Lemma 6.18. If “ Xn −→ x∗ f.s.,” then for each k = 1, . . . , κ
1
(a) lim En ∆Zn ∆ZnT = V (k) a.s.,
n∈IN(k) 2l(k) + ε(k)

1
(b) lim En Z-n Z-nT = V (k) a.s.
n∈IN(k) 2l(k)

Proof. Due to conditions (W4a), (W4b) and (W6), with Lemmas 6.15(b) and
6.17(b) we have
1
lim En ∆Zn ∆ZnT = V (k) a.s.,
n∈IN(k) 2l(k) + ε(k)
1
lim En ∆Z-n ∆Z
-T =
n (k)
V (k) a.s..
n∈IN(k) 2ln
218 6 Stochastic Approximation Methods with Changing Error Variances

By means of
-n ∆Z
E n ∆Z -T = En Z-n Z-T − (En Z-n )(En Z-n )T ,
n n
 
(En Z-n )(En Z-n )T  = En Z-n 2

from the second equation and Lemma 6.17(d) we obtain proposition (b).

Because of Lemma 6.18 the following averaging method for the calculation of
matrices W (1) , . . . , W (κ) is proposed:
(1) (κ)
At step n ∈ IN(k) with unique k ∈ {1, . . . , κ}, let matrices Wn , . . . , Wn
already be calculated. Furthermore, let

(k) 1 1 2ln - -T
Wn+1 := (1 − )W (k) + · Zn Zn (6.86a)
n n n 2ln + εn
(i)
Wn+1 := Wn(i) for i = k, i ∈ {1, . . . , κ}. (6.86b)
(1) (κ)
Here, at the beginning let matrices W1 , . . . , W1 be chosen arbitrarily.
Hence, if at stage n index n lies in IN(k) with a k ∈ {1, . . . , κ}, only matrix
(k) (1) (κ)
Wn of Wn , . . . , Wn is updated.
(i)
For the convergence of matrix sequence (Wn )n the following assumption
(j)
on the fourth moments of sequence (Zn ) is needed:
There exists µ2 ∈ [0, 1) and a random variable L2 < ∞ a.s. such that

En ∆Zn(j) 4 ≤ L2 nµ2 a.s. (W7)

for all n ∈ IN and j = 0, . . . , 2ln .


Hence, initially we get
' 1 - -T - -T 2
Lemma 6.19. n2 En Zn Zn − En Zn Zn  < ∞ a.s.
n

Proof. For each n ∈ IN with matrix Cn := Z -n Z-T we have


n
 
En Cn − En Cn 2 ≤ En tr (Cn − En Cn )T (Cn − En Cn )
 
= tr En (Cn − En Cn )T (Cn − En Cn )
 
= tr En CnT Cn − (En Cn )T (En Cn )
 
≤ tr(En CnT Cn ) = En trCnT Cn = En tr Z-n 2 Cn
= En Z-n 4 = En ∆Z -n + En Z-n 4
 4
 1 ln 
 
= En  (∆Zn(i) − ∆Zn(ln +i) ) + En Z-n 
 2ln 
i=1
6.6 Realization of Search Directions Yn 219
⎛ ⎞4
2ln
1
≤ En ⎝ ∆Zn(j)  + En Z-n ⎠
2ln j=1
⎛ ⎞
2ln
1
≤ 8⎝ En ∆Zn(j) 4 + En Z-n 4 ⎠ .
2ln j=1

where “tr” denotes the trace of a matrix.

Hence, with (W7) and Lemma 6.17(d) the proposition follows.

(1) (κ)
Finally, we show that matrices Wn , . . . , Wn are approximations of limit
covariance matrices W (1) , . . . , W (κ) :

Theorem 6.8. If “ Xn −→ x∗ a.s.,” then for any k = 1, . . . , κ


1
lim Wn(k) = V (k) = lim En ∆Zn ∆ZnT a.s. (6.87)
n∈IN(k) 2l(k) + ε(k) n∈IN(k)

Proof. For given fixed k ∈ {1, . . . , κ} let IN(k) = {n1 , n2 , . . .} with indices
n1 < n2 < . . . . With abbreviations
2lni
Ci := Z-n Z-T , Ti := Wn(k) , Bi := Ani ,
2lni + εni i ni i

by definition we have

1 1
Ti+1 = 1− Ti + Ci for i = 1, 2, . . . .
ni ni

Due to Lemma 6.18 and (W4a), (W4b) it holds


1
lim E(Ci | Bi ) = V (k) = lim En ∆Zn ∆ZnT a.s. (i)
i→∞ 2l(k) + ε(k) n∈IN(κ)

i
Because of (U6) we have −→ qk > 0 for i −→ ∞ and therefore
ni
1
= ∞. (ii)
i
ni

Lemma 6.19 yields


1   
Ci − E(Ci | Bi )2 | Bi < ∞ a.s.
E (iii)
i
n2i

Due to (i), (ii) and (iii) Lemma B.6, Appendix, can be applied to sequence
(Tn ). Hence, we have shown the proposition.
220 6 Stochastic Approximation Methods with Changing Error Variances
(1) (κ)
In a similar way approximations wn , . . . , wn of the mean quadratic limit
errors of sequence (Zn ) can be given:
At step n ∈ IN(k) with unique k ∈ {1, . . . , κ}, let

(k) 1 1 2ln -n 2
wn+1 := 1− wn(k) + · Z (6.88a)
n n 2ln + εn
(i)
wn+1 := wn(i) for i = k, i ∈ {1, . . . , κ}. (6.88b)
(1) (κ)
Here, initially let arbitrary w1 , . . . , w1 be given. Then, the following con-
vergence corollary can be shown:
Corollary 6.5. If “ Xn −→ x∗ a.s.,” then for any k = 1, . . . , κ
1
lim wn(k) = tr(V (k) ) = lim En ∆Zn 2 a.s. (6.89)
n∈IN(k) 2l(k) + ε(k) n∈IN(k)

(k) (k)
Proof. If w1 = tr(W1 ) for k = 1, . . . , κ , then for each n ∈ IN and k =
1, . . . , κ
wn(k) = tr(Wn(k) )
(k)
with matrix Wn from (6.86a), (6.86b). Hence, the proposition follows from
Theorem 6.8 by means of taking the trace.

6.7 Realization of Adaptive Step Sizes


Using the estimators for the unknown Jacobian H ∗ := H(x∗ ) and the limit
covariance matrices W (1) , . . . , W (κ) of the sequence of estimation errors (Zn ),
developed in Sect. 6.6, it is possible to select the step size Rn in algorithm
(6.2a), (6.2b) adaptively such that certain convergence conditions are fulfilled.
For the scalar ρn in the step size Rn = ρn Mn we suppose, as in Sect. 6.5,

lim n ρn = 1 a.s. (X1)


n→∞

Moreover, let the step directions (Yn ) be defined by relations (W1a), (W1b)
in Sect. 6.6.
(0)
Starting with a matrix H0 , and putting H1 := H0 , An -measurable ma-
(0)
trices Hn can be determined recursively by (6.81a), (6.81b).
(0)
Depending on Hn , the factor Mn in the step size Rn is selected here such
that conditions (U4a)–(U4d) from Sect. 6.3 are fulfilled.
Independent of the convergence behavior of (Xn ), for large index n the
(0)
matrices Hn do not deviate considerably from H - n(0) given by (6.83a), (6.83b).
This is shown next:
- n −→ 0 a.s.
Lemma 6.20. ∆Hn := Hn − H
(0) (0)
6.7 Realization of Adaptive Step Sizes 221

Proof. Because of (X1), the assumption of Lemma 6.8(d) holds. Hence,


(Xn+1 − Xn ) converges to zero a.s. Using Theorem 6.7, the assertion follows.

According to the definition we have H - n ∈ H, where H denotes the set of


(0)

all matrices H ∈ IM(ν × ν, IR) such that

H= λi H (i) , I is finite, λi = 1, λi > 0 and


i∈I i∈I
% &
H (i) ∈ H(x) : x ∈ D ∪ {H0 } for i ∈ I.

For the starting matrix H0 we assume H0  ≤ β with a number β according


to (6.3). Then, for all H ∈ H we get

H ≤ β. (6.90)

6.7.1 Optimal Matrix Step Sizes

If in (6.2a), (6.2b) a true matrix step size is used, then due to condition (U3)
from Sect. 6.3 we suppose
x∗ ∈ D̊. (X2)
In case that the covariance matrix of the estimation error Zn converges to-
wards a unique limit matrix W in the sense of (V5a), (V5b) or (V6), see
Sect. 6.5, according to [117] for

Mn := (H ∗ )−1 with H ∗ := H(x∗ )

we
√ obtain the minimum asymptotic covariance matrix of the process
( n(Xn − x∗ )). Since H ∗ is unknown in general, this approach is not possible
directly.
(0)
However, according to Theorem 6.6, the matrix Hn is a known An -
measurable approximate to H ∗ at stage n, provided that (Xn ) converges to
x∗ . Hence, similar to [12, 107, 108], for n = 1, 2, . . . we select
 (0)
(Hn )−1 , if Hn ∈ H0
(0)
Mn := −1 (6.91)
H0 , else.

Here, H0 is a set of matrices H ∈ IM(ν × ν, IR) with appropriate properties.


Concerning H(x), H0 and H0 we suppose:
(a) There is a positive number α such that

H(x)H(y)T + H(y)H(x)T ≥ 2α2 I, (X3a)


H(x)H0T + H0 H(x)T ≥ 2α2 I, (X3b)
H0 H0T ≥ α2 I (X3c)

for all x, y ∈ D.
222 6 Stochastic Approximation Methods with Changing Error Variances

(b) There is a positive number α0 with


 
H0 ⊂ H⊕ := H ∈ IM(ν × ν, IR) : HH T ≥ α02 I . (X4)

Assumption (X3a)–(X3c) is also used in [108] and in [107]. As is shown


later, see Lemma 6.22, conditions (X3a)–(X3c) and (X4) guarantee that se-
quence (Mn ) given by (6.91) fulfills condition (U4a)–(U4d) in Sect. 6.3.
If G(x) is a gradient function, hence, G(x) = ∇F (x) with a function
F : D0 −→ IR, then (X3a)–(X3c) does not guarantee the convexity of F . Con-
versely, the Hessian H(x) of a strongly convex function F (x) does not fulfill
condition (X3a)–(X3c) in general. Counter examples can be found in [115].
Some properties of the matrix sets H⊕ and H are described in the
following:

Lemma 6.21.
(a) For a matrix H ∈ IM(ν × ν, IR) the following properties are equivalent

(i) H ∈ H⊕
1
(ii) H is regular and H −1  ≤
α0
(b) For H ∈ H and x ∈ D we have

(i) H(x)H T + HH(x)T ≥ 2α2 I


(ii) HH T ≥ α2 I
1
(iii) H is regular and H −1  ≤
α
(iv) H T H ≤ β 2 I
1
(v) (H −1 )(H −1 )T ≥ 2 I
β
 T α2
(vi) H −1 H(x) + H −1 H(x) ≥ 2 2 I
β
Proof. If H is regular, then

(HH T )−1 = (H −1 )T H −1 ,
 
H −1 2 = λmax (H −1 )T (H −1 ) ,
1
HH T ≥ α02 I ⇐⇒ (HH T )−1 ≤ 2 I.
α0

This yields the equivalence


' of (i) and (ii) in (a).
Let now H = λi H (i) ∈ H and x ∈ D be given arbitrarily. Because of
i∈I
(X3a)–(X3c) we have
6.7 Realization of Adaptive Step Sizes 223
 
H(x)H T + HH(x)T = λi H(x)H (i)T + H (i) H(x)T ≥ 2α2 I
i∈I

and
HH T = λi λk H (i) H (k)T ≥ α2 I.
i,k∈I
−1 T −1 T −1
 (H ) (H  ) = (HH )
Hence, H is regular, and ≤ α−2 I. This
yields H −1 2 = λmax (H −1 )T (H −1 ) ≤ α−2 . Due to (6.90) we find
H T H ≤ λmax (H T H)I = H2 I ≤ β 2 I and therefore β −2 I ≤ (H T H)−1 =
(H −1 )(H −1 )T . Because of
 T  
H −1 H(x) + H −1 H(x) = (H −1 ) H(x)H T + HH(x)T (H −1 )T ,

assumptions (i) and (v) of (b) finally yield


 T α2
H −1 H(x) + H −1 H(x) ≥ 2α2 (H −1 )(H −1 )T ≥ 2 2 I.
β
Using Lemma 6.21, we obtain, as announced, the following result:

Lemma 6.22.
(a) sup Mn  ≤ max{α0−1 , α−1 }.
n∈IN
0n and ∆Mn such
(b) For each n ∈ IN there are An -measurable matrices M
that
0n + ∆Mn
(i) Mn = M
(ii) lim ∆Mn  = 0 a.s.
n→∞
2
0n H(x) u ≥ α u2
(iii) uT M for x ∈ D and u ∈ IRν
β2
Proof. Since H0 ⊂ H⊕ and H0 ∈ H, Lemmas 6.21(a) (ii) and (b) (iii) for
n = 1, 2, . . . yields

α0 −1 , if Hn ∈ H0
(0)
Mn  ≤ −1 ,
α , else

hence the first assertion. Furthermore, according to Lemma 6.20 it is


- n(0) −→ 0 a.s.,
∆Hn := Hn(0) − H (i)

- n ∈ H given by (6.83a), (6.83b). Because of H


with matrix H
(0) - n ∈ H for each
(0)

n ∈ IN, Lemma 6.21(b) (iii) yields


 
H- n(0) is regular and  - (0) )−1 
(H n
1
≤ . (ii)
α
224 6 Stochastic Approximation Methods with Changing Error Variances

Define now 
- n )−1 , if Hn ∈ H0
(H
(0) (0)
0n :=
M (6.92)
H0−1 , else,
and put ∆Mn := Mn − M 0n . Because of (i) and (ii), the assumptions of Lemma
- n(0) and Bn := Hn(0) . Hence,
C.2, Appendix, hold for An := H
∆Mn −→ 0 a.s.
- n(0) , H0 ∈ H
Furthermore, according to Lemma 6.21(b) (vi), and because of H
we get
2
uT M0n H(x) u ≥ α u2
β 2

for all n ∈ IN, x ∈ D and u ∈ IRν .


Due to Lemma 6.22, sequence (Mn ) fulfills condition (U4a)–(U4d) from
Sect. 6.3. Consequently, we have now this result:
Theorem 6.9.
(a) Xn −→ x∗ a.s.,
- n(0) −→ H ∗ := H(x∗ )
(b) Hn(0) , H a.s.
Proof. The first assertion follows immediately from Corollary 6.1. The second
assertion follows then from (a), Theorem 6.6(b) and Theorem 6.7.
Supposing still

H ∗ ∈ H0 := interior of H0 , (X5)
we get the main result of this section:
Theorem 6.10.
(a) Mn −→ (H ∗ )−1 a.s.
7 1
(b) nλ (Xn − x∗ ) −→ 0 a.s. for each λ ∈ 0, .
2
(c) For each 0 > 0 there is a set Ω0 ∈ A such that
(i) P(Ω̄0 ) < 0
 2 κ
(ii) lim sup EΩ0 nXn − x∗ 2 ≤ (H ∗ )−1  qk σk2 ,
n∈IN
k=1
where σk2 := lim sup E σ0,n
2
for k = 1, . . . , n
n∈IN(k)

(d) If  matrices of (Zn ) fulfill (V6) and (V8) in Sect. 6.5.3, then
√the covariance
n(Xn − x∗ ) has an asymptotic normal distribution with the asymptotic
covariance matrix given by
 κ  T
V = (H ∗ )−1 qk W (k) (H ∗ )−1 .
k=1
6.7 Realization of Adaptive Step Sizes 225

Proof. According to Theorem 6.9(b) and assumption (X5) there is a (stochas-


(0)
tic) index n0 ∈ IN such that Hn ∈ H0 for all n ≥ n0 a.s.. Hence, according
to the definition, for n ≥ n0 we get

Mn = (Hn(0) )−1 0n = (H
and M - (0) )−1 a.s.
n

Theorem 6.9(b) yields then

0n −→ Θ := (H ∗ )−1
Mn , M a.s. (i)

Because of Lemma 6.9 and (i) we suppose that

a := lim inf -
- an ≥ a = 1 (ii)
n∈IN

for sequence (- an ) arising in (6.7). Assertion (b) follows by applying


Theorem 6.1. Moreover, assertion (c) follows from (i), (ii) and Theorem 6.2.
Because of (i), assertion (d) is obtained directly from Corollary 6.4.

Practical Computation of (Hn(0) )−1

In formula (6.91) defining matrices Mn we have to compute (Hn )−1 . Accord-


(0)
(0) (l ) (0)
ing to (6.81a), (6.81b), matrix Hn+1 := Hn n follows from Hn by an ln -fold
(0)
rank-1-update. Also the inverse of Hn can be obtain by rank-1-updates.
(i−1)
Lemma 6.23. Let n ∈ IN and i ∈ {1, . . . , ln } with a regular Hn . Then
(i)
δn (i−1) −1 (i)
(a) det(Hn(i) ) = det(Hn(i−1) ) with δn(i) := n − 1 + d(i)T n (Hn ) hn .
n
(b) Hn(i) is regular if and only if δn(i) = 0. In this case
1  
(Hn(i) )−1 = (Hn(i−1) )−1 + (i) d(i)n − (H n
(i−1) −1 (i)
) h n d(i)T
n (Hn
(i−1) −1
) .
δn
Proof. The assertion follows directly from Lemma C.3, Appendix, if there we
put
1 (i)
d := d(i)
n , e := (h − Hn(i−1) d(i)
n )
n n
(i)
and take into account that dn 2 = 1.

Of course, at stage n matrix (Hn+1 )−1 follows from (Hn )−1 at minimum
(0) (0)

expenses for ln = 1. Indeed, according to (6.81a), (6.81b) we have

(0) 1 (1)
Hn+1 = Hn(0) + (h − Hn(0) d(1) (1)T
n )dn .
n n
226 6 Stochastic Approximation Methods with Changing Error Variances

Choice of Set H0

We still have to clarify how to select the set H0 occurring in (6.91) such
that conditions (X4) and (X5) are fulfilled. Theoretically, H0 := H⊕ could be
taken. However, this is not possible since the minimum eigenvalue of a matrix
HH T cannot be computed exactly in general.
Hence, for the choice of H0 , two practicable cases are presented:
Example 6.1. Let denote  · 0 any simple matrix norm with I0 = 1. Fur-
thermore, define
% &
H0 := H ∈ IM(ν × ν, IR) : det(H) = 0 and H −1 0 ≤ b0 ,

where b0 is a positive number such that


 ∗ −1 
(H )  < b0 . (6.93)
0

For this set H0 we have the following result:


Lemma 6.24.
1
(a) H0 ⊂ H⊕ with α0 := and κ0 such that
κ 0 b0
A ≤ κ0 A0 for each matrix A ∈ IM(ν × ν, IR).

(b) H ∗ ∈ H0 .

Proof. Each matrix H ∈ H0 is regular, and according to the above assumption


we have
1
H −1  ≤ κ0 H −1 0 ≤ κ0 b0 =: .
α0
 
Thus, Lemma 6.21(a) yields H ∈ H⊕ . Defining a0 := (H ∗ )−1 0 and 0 :=
b0 (1 − b0 ) > 0, formula (6.93) yields
1 a0

a0 1
0< ≤ 1 − a0 0 and 0 < .
b0 a0
If H − H ∗ 0 ≤ 0 , then according to Lemma C.1, Appendix, we know that
H is regular and
a0
H −1 0 ≤ ≤ b0 .
1 − a0 0
% & ◦
Hence, H: H − H ∗ 0 ≤ 0 ⊂ H0 , or H ∗ ∈ H0 .
In the above, the following norms  · 0 can be selected:

(a) A0 := max | ai,k | with κ0 := ν


i,k≤ν
; <
1 1 √
(b) A0 := T
tr(A A) = a2i,k , where κ0 := ν
ν ν
i,k≤ν
6.7 Realization of Adaptive Step Sizes 227
(0)
By means of Lemma 6.23 the determinant and the inverse of Hn can be
(0)
computed easily. Hence, the condition “Hn ∈ H0 ” arising in (6.91) can be
verified easily.
Example 6.2. For a given positive number β0 with
(det(H ∗ ))2
β0 < , (6.94)
(tr(H ∗ H ∗T ))ν−1
 $
(det(H))2
define H0 := H ∈ IM(ν × ν, IR) : ≥ β0 . According to
(tr(HH T ))ν−1
(0)
Lemma 6.23 we easily can decide whether Hn ∈ H0 is fulfilled. Also in
this case H0 fulfills conditions (X4) and (X5):
Lemma 6.25. If ν > 1, then

(a) H0 ⊂ H⊕ with α0 := β0 (ν − 1)ν−1 > 0

(b) H ∗ ∈ H0

Proof. If for given matrix H ∈ H0 Lemma C.4, Appendix, is applied to A :=


HH T , then
ν−1
ν−1
HH T ≥ det(HH T ) I ≥ β0 (ν − 1)ν−1 I =: α02 I,
tr(HH T )
hence, H ∈ H⊕ . Since det(·) and tr(·) are continuous functions on IM(ν ×ν, IR),
because of (6.94) there is a positive number 0 such that
(det(H))2
≥ β0
(tr(HH T ))ν−1
for each matrix H with H − H ∗  ≤ 0 . Consequently, also in the present case

we have H ∗ ∈ H0 .

6.7.2 Adaptive Scalar Step Size

We consider now algorithm (6.2a), (6.2b) with a scalar step size Rn = rn I =


ρn Mn where rn ∈ IR+ , and ρn fulfills (X1). Moreover, we suppose that Mn is
defined by
πn
Mn = ∗ I for n = 1, 2, . . . . (6.95)
αn
Here, πn and αn∗ are positive An -measurable random variables such that πn
controls the variance of the estimation error Zn , and αn∗ takes into account
the influence of the Jacobian H(x). In the following πn and αn are selected
such that the optimal limit condition (5.83) is fulfilled.
In this section we first assume that the Jacobian H(x) and the random
numbers πn and αn∗ fulfill the following conditions:
228 6 Stochastic Approximation Methods with Changing Error Variances

(a) Suppose that there is a positive number α such that

H(x) + H(x)T ≥ 2αI for x ∈ D. (X6)

(b) There are positive numbers d and d¯ with


πn
d≤ ≤ d¯ a.s. for all n = 1, 2, . . . . (X7)
αn∗
Corresponding to Lemma 6.22, here we have this result:
Lemma 6.26.
¯
(i) sup Mn  ≤ d,
n
(ii) uT Mn H(x) u ≥ dαu2 for each x ∈ D, u ∈ IRν and n ∈ IN.

Proof. According to the definition of Mn , the assertions (i), (ii) follow imme-
diately from the above assumptions (a), (b).

Put now M0n :≡ Mn and ∆Mn :≡ 0. According to Lemma 6.26, also in the
present case condition (U4a)–(U4d) from Sect. 5.3 is fulfilled. Thus, corre-
sponding to the last section, cf. Theorem 6.9, we obtain the following result:
Theorem 6.11.

(a) Xn −→ x∗ a.s.,
(b) Hn(0) −→ H ∗ a.s.

Further convergence results are obtained under additional assumptions on


the random parameters πn and αn∗ :
- with
(c) There is a positive number α

lim αn∗ = α
- a.s., (X8)
n→∞
- < 2α∗ := λmin (H ∗ + H ∗T ).
α (X9)

(d) There are positive numbers π (1) , . . . , π (κ) such that

lim πn = π (k) a.s., k = 1, . . . , κ, (X10a)


n∈IN(k)

q1 π (1) + . . . + qκ π (κ) = 1. (X10b)

Because of (c) and (d), for k = 1, . . . , κ we have

πn π (k)
lim ∗
= =: d(k) a.s., (6.96a)
n∈IN(k) αn α-
1
d := q1 d(1) + . . . + qκ d(κ) = , (6.96b)
-
α
1
dα∗ > , (6.96c)
2
6.7 Realization of Adaptive Step Sizes 229

especially
lim Mn = d(k) I a.s. (6.97)
n∈IN(k)

Results on the rate of convergence and the convergence in the quadratic


mean of (Xn ) are obtained now:

Theorem 6.12.
(a) If λ ∈ [0, 12 ), then nλ (Xn − x∗ ) −→ 0 a.s.
(b) For 0 > 0 there is a set Ω0 ∈ A such that

(i) P(Ω̄0 ) < 0 ,


'
κ
qk (π (k) σk )2
∗ 2 k=1
(ii) lim sup EΩ0 nXn − x  ≤ , (6.98)
n∈IN -(2α∗ − α
α -)

where σ1 , . . . , σκ are selected as in Theorem 6.10(c).

Proof. Because of (6.97) and Lemma 6.9 we may assume that

a(k) ≥ d(k) α∗ ,
- k = 1, . . . , κ.

Hence, due to (6.96b), (6.96c) we have -a ≥ dα∗ > 12 . The first assertion
follows then from Theorem 6.1. Because of (6.96a) and (6.97), assertion (b) is
obtained from Corollary 6.3.

As already mentioned in Sect. 6.5.2, the right hand side of (6.98) takes a
minimum for

- = α∗ ,
α (6.99a)
q1 qκ
π (k) = (σk2 ( 2 + . . . + 2 ))−1 , k = 1, . . . , κ. (6.99b)
σ1 σκ

Corresponding to Theorem 6.10(d), also in the present case we have a


central limit theorem:
Theorem 6.13. In addition to the above assumptions, suppose
(i) x∗ ∈ D̊
(ii) The covariance matrices of (Zn ) fulfill conditions (V6) and (V8) in
Sect. 6.5.3
√ 
Then n(Xn − x∗ ) is asymptotically N (0, V )-distributed, where the matrix
V is determined by the equation
κ
1 ∗ 1
(H V + V H ∗T ) − V = 2 qk (π (k) )2 W (k) . (6.100)
-
α -
α
k=1
230 6 Stochastic Approximation Methods with Changing Error Variances

Proof. For k = 1, . . . , κ we have d(k) (H ∗ + H ∗ )T ≥ 2(d(k) α∗ )I, and because


of conditions (6.96b), (6.96c) we obtain
1
a := q1 (d(1) α∗ ) + . . . + qκ (d(κ) α∗ ) = dα∗ > .
2
Hence, according to (6.96a) and (6.97), conditions (V2) and (V3a), (V3b) in
Sect. 6.5 hold. The assertion follows now from Theorem 6.4.

Choice of Sequence (πn )

Of course, the standard definition


πn :≡ 1, n = 1, 2, . . . , (6.101)
fulfills condition (X10), but does not yield the minimum asymptotic bound of
(EΩ0 nXn − x∗ 2 ).
Using the assumptions of Sect. 6.6.3 concerning the estimation error (Zn ),
(1) (κ)
the An -measurable random variables wn , . . . , wn , given by (6.88a), (6.88b),
are approximations of σ12 , . . . , σκ2 . Indeed, according to Corollary 6.5 we have
lim wn(k) = lim En ∆Zn 2 a.s., k = 1, . . . , κ. (6.102)
n∈IN(k) n∈IN(k)

For given positive numbers w and w̄ with


w ≤ min σk2 ≤ max σk2 ≤ w̄, (X11)
k≤κ k≤κ

for n ∈ IN and k = 1, . . . , κ we define



⎪ (k) (k)
⎨ wn , for wn ∈ [w, w̄]
-n(k) := w , for wn(k) < w
w . (6.103)

⎩ (k)
w̄ , for wn > w̄

Then, for indices n ∈ IN(k) with k ∈ {1, . . . , κ} we consider


−1
q1 qκ
πn := -n(k)
w (1)
+ ... + (κ)
. (6.104)
-n
w -n
w
Because of (6.102) and (X11), for each n ∈ IN and k = 1, . . . , κ we find
w w̄
≤ πn ≤ , (6.105a)
w̄ w
−1
q1 qκ
lim πn = σk2 2 + ... + 2 a.s. (6.105b)
n∈IN(k) σ1 σκ
Hence, condition (X10a), (X10b) holds, and because of the optimality con-
dition (6.99b), this choice of (πn ) yields the minimum asymptotic bound for
EΩ0 nXn − x∗ 2 .
6.7 Realization of Adaptive Step Sizes 231

Choice of Sequence (α∗n )

Obviously, at stage n the factor αn∗ in the step size (6.95) may be chosen as a
(0)
function of Hn . Hence, for n = 1, 2, . . . we put
 (0) (0) (0)
ϑn (Hn ), if Hn ∈ H1 and ϑn (Hn ) ∈ [ϑ, ϑ̄]
αn∗ := (6.106)
ϑ0 , else

Here, H1 is an appropriate subset of IM(ν×ν, IR), ϑn (H) is a real value function


on H1 , and ϑ0 , ϑ, ϑ̄ denote positive constants with ϑ0 ∈ [ϑ, ϑ̄].
If sequence (πn ) is chosen according to (6.104), then (X7) holds with
w w̄
d := and d¯ := . (6.107)
w̄ · ϑ̄ w·ϑ
Consequently, without any further assumptions, the convergence properties
according to Theorem 6.11 hold. Now we still require the following:
(e) Suppose

H ∗ ∈ H1 = interior of H1 . (X12)
−→ H ∗ a.s.” we assume that there is number α
(0)
(f) In case “Hn - such that

- < ϑ̄,
ϑ<α (X13)

ϑn (Hn(0) ) −→ α
- a.s. (X14)

In the important case (see the following examples)

ϑn (H) ≡ ϑ(H) is continuous on H1 ,

- = ϑ(H ∗ ). In the following we show that (X12),


condition (X14) holds with α
(X13) and (X14) imply condition (X8):
Lemma 6.27. lim αn∗ = α
- a.s.
n→∞

Proof. According to Theorem 6.11(b)

Hn(0) −→ H ∗ a.s. (i)

Thus, because of (X12), (X13) and (X14) there is an index n0 such that

Hn(0) ∈ H1 and ϑn (Hn(0) ) ∈ [ϑ, ϑ̄]

for all n ≥ n0 a.s. By definition for n ≥ n0 we have

αn∗ = ϑn (Hn(0) ) a.s.

Hence, because of (i) and (X14) we get lim αn∗ = α


- a.s.
n→∞
232 6 Stochastic Approximation Methods with Changing Error Variances

Examples for ϑn (H)


We consider now several examples for the function ϑn (H) to be needed in
(6.106) fulfilling conditions (X14) and (X9). In these cases the convergence
properties of Theorem 6.12 and 6.13 hold.
According to (X6) we have

H ∗ ∈ H+ , (6.108)
provided that H+ := {H ∈ IM(ν × ν, IR) : H + H > 0}. In the following
T

examples we take H1 := H+ . Hence, condition (X12) holds.


Example 6.3. The function
1
ϑ(1) (H) := λmin (H + H T )
2
is continuous on H+ and ϑ(1) (H ∗ ) = α∗ . Therefore, ϑn (H) :≡ ϑ(1) (H) is an
appropriate choice still guaranteeing the optimality condition (6.99a).
Example 6.4. ϑ(2) (H) := (tr((H +H T )−1 ))−1 is also continuous on H+ , where
1
ϑ(2) (H) < = λmin (H + H T )
λmax ((H + H T )−1 )
for each matrix H ∈ H+ , especially ϑ(2) (H ∗ ) < 2α∗ . Hence, also ϑn (H) :≡
ϑ(2) (H) is a possible approach.
Example 6.5. According to C.4, Appendix, on H+ the inequality ϑ(3) (H) <
λmin (H + H T ) holds with the continuous function
ν−1
ν−1
ϑ(3) (H) := det(H + H T ) · .
2tr(H)
Thus, also ϑn (H) :≡ ϑ(3) (H) is possible.
Example 6.6. For given An -measurable random vectors un from IRν such that
un  = 1, n = 1, 2, . . . , for H ∈ H+ we define
 T T −1
−1
ϑ(4)
n (H) := 2un (H + H ) un .
If un is an approximate eigenvector to the maximum eigenvalue (2α∗ )−1 of
the matrix (H ∗ + H ∗T )−1 , then ϑn (H ∗ ) is an approximation of α∗ .
(4)

Since H ∗ ∈ H+ , according to Theorem 6.11 there is a stochastic index n0
such that
Hn(0) ∈ H+ , n ≥ n0 a.s., (6.109)
and
(Hn(0) + Hn(0)T )−1 −→ (H ∗ + H ∗T )−1 a.s. (6.110)
Regarding to the v. Mises-algorithm for the computation of an eigenvector
related to the maximum eigenvalue of (H ∗ + H ∗T )−1 , the sequence of vectors
(un ) is defined recursively as follows (see Appendix C.2):
6.7 Realization of Adaptive Step Sizes 233

(a) Select a vector 0 = v0 ∈ IRν and put u1 := v0 −1 v0 .


(b) For a given An -measurable vector un at stage n, define
 (0)
(Hn + Hn )−1 un , if Hn ∈ H+
(0)T (0)
vn
vn := , un+1 := .
v0 , else vn 

From (6.109), with probability one for n ≥ n0 we get


1 vn
vn = (Hn(0) + Hn(0)T )−1 un , ϑ(4) (0)
n (Hn ) = , un+1 = .
2 < un , vn > vn 

If the matrices A := (H ∗ + H ∗T )−1 and An := (Hn + Hn )−1 fulfill a.s.


(0) (0)T

the assumptions in Lemma C.5(b), cf. Appendix, then due to (6.110) and
(4) (0)
Lemma C.5 sequence ϑn (Hn ) converges a.s. towards
1
  = α∗ .
2λmax (H ∗ + H ∗T )−1

- = α∗ . Hence, also (X9)


(4)
Therefore, ϑn (H) :≡ ϑn (H) fulfills (X14) with α
holds in this case.

Some Remarks on the Computation of ϑn (H)

If sequence (αn∗ ) in (6.95) is chosen according to the above examples, in each


iteration step condition

Hn(0) ∈ H+ ⇐⇒ Hn(0) + Hn(0)T > 0


(0) (0)T
must be verified. This holds if and only if (see [142]) the matrix Hn + Hn
has a Cholesky-decomposition. Hence, there is a matrix Sn such that

Hn(0) + Hn(0)T = Sn SnT , (6.111a)


⎛ (n) ⎞
s11 0
⎜ ⎟
Sn = ⎜⎝
..
. ⎟,
⎠ (6.111b)
(n)
∗ sνν
(n)
s11 , . . . , s(n)
νν > 0. (6.111c)

In [142] an algorithm for the decomposition (6.111a)–(6.111c) is presented. If


(0)
this algorithm stops without result, then Hn ∈ H+ .
The decomposition (6.111a)–(6.111c) is also useful for the computation of
(0)
ϑn (Hn ). This is shown in the following lemma:
234 6 Stochastic Approximation Methods with Changing Error Variances
(0)
Lemma 6.28. For Hn ∈ H+ , let Sn be given by (6.111b). Then
(n)
(a) det(Hn(0) + Hn(0)T ) = (s11 · . . . · s(n)
νν )
2

(b) (Hn(0) + Hn(0)T )−1 = (Sn−1 )T Sn−1 , where


⎛ ⎞
(n)
t11 0
⎜ ⎟
Sn−1 = ⎜⎝
..
. ⎟

(n)
∗ tνν
(c) tr((Hn(0) + Hn(0)T )−1 ) =
(n)
(tij )2
i≥j

(d) For given un ∈ IR determine vn := (Hn(0) + Hn(0)T )−1 un as follows


ν

(i) find wn with Sn wn = un


(ii) determine vn such that SnT vn = wn

Symmetric H ∗
If (e.g., in case of a gradient G(x) = ∇F (x)):
H ∗ := H(x∗ ) is symmetric, (X15)
then
1
α∗ :=
λmin (H ∗ + H ∗T ) = λmin (H ∗ ). (6.112)
2
Hence, in the present case the function ϑn (H) in (6.106) can be chosen to
depend only on H. In addition, conditions (X14) and (X9) can be fulfilled.
Due to (X6) we have

H ∗ ∈ H++ , (6.113)
provided that
% &
H++ := H ∈ IM(ν × ν, IR) : det(H) > 0 and tr(H) > 0 .
In the following examples we take H1 := H++ , hence, condition (X12) holds.
Example 6.7. (see Example 6.4). The function
2
ϑ(5) (H) :=
tr(H −1 )
is continuous on H++ and ϑ(5) (H ∗ ) < 2
λmax ((H ∗ )−1 ) = 2α∗ . Thus, ϑn (H) :≡
ϑ(5) (H) is a suitable function.
Example 6.8. (see Example 6.5)
ν − 1 ν−1
ϑ(6) (H) := 2 det(H) · ( )
tr(H)
is also continuous on H++ , and Lemma C.4, Appendix, yields ϑ(6) (H ∗ ) < 2α∗ .
Hence, also ϑn (H) :≡ ϑ(6) (H) is an appropriate choice.
6.7 Realization of Adaptive Step Sizes 235

Example 6.9. (see Example 6.6) For H ∈ H++ define


1
ϑ(7)
n (H) := ,
uTn H −1 un
where u1 := v0 −1 v0 and (un ) is given recursively by
 (0)
(Hn )−1 un , for Hn ∈ H++
(0)
vn
vn := , un+1 := ,
v0 , else vn 
n = 1, 2, . . . . Here, 0 = v0 ∈ IRν is an appropriate start vector. According to
C.5, Appendix, in many cases the following equation holds:
1
lim ϑ(7) (0)
n (Hn ) = = α∗ a.s.
n→∞ lim < un , vn >
n→∞

(7)
Thus, also ϑn (H) :≡ ϑn (H) is a suitable function.
Remark 6.2. Computation of ϑn (H):
In the above Examples 6.7–6.9 the expressions
 
tr (Hn(0) )−1 , det(Hn(0) ), (Hn(0) )−1
must be determined. This can be done very efficiently by means of
Lemma 6.23.

Non Optimal ϑn (H)


(0)
The quantities ϑn (Hn ) in the Examples 6.3–6.9 can be determined only with
larger expenses, since these examples were chosen such that condition (X9)
holds. If this is not demanded, then simpler functions ϑn (H) can be taken into
account. However, in this case the convergence properties in Theorems 6.12
and 6.13 can not be guaranteed.
On the other hand, if all conditions hold, expect (X9), then we have this
result:
Theorem 6.14. It is nλ (Xn − x∗ ) −→ 0 a.s. for each λ such that
 $
1 α∗
0 ≤ λ < min , .
2 α -
Proof. We may assume (see the proof of Theorem 6.12) that - a ≥ α∗ α
-−1 . The
assertion follows then from Theorem 6.1.
- in (X14) fulfills
According to Sect. 6.5.2, it would be favorable if the limit α
- ≈ α∗ .
α
This condition helps to find appropriate functions ϑn (H).
Among many other examples, the following one shows how such “non
optimal” functions ϑn (H) can be selected.
236 6 Stochastic Approximation Methods with Changing Error Variances

Example 6.10. The function ϑ(8) (H) := min hii is continuous on IM(ν × ν, IR),
i≤ν
where hii denotes the ith diagonal element of H. Hence, define

ϑn (H) := ϑ(8) (H), H1 := IM(ν × ν, IR).

According to (6.106) we have


 (0) (0)
∗ ϑ(8) (Hn ), for ϑ(8) (Hn ) ∈ [ϑ, ϑ̄]
αn = ,
ϑ0 , else

- := min h∗ii ∈ (ϑ, ϑ̄) we get


and in case α
i≤ν

lim αn∗ = α
- a.s.
n→∞

- ≥ α∗ , holds, but condition (X9) does not hold in general.


Here, inequality α
Remark 6.3. Reduction of the expenses in the computation of (αn∗ ). Formula
(6.106) for sequence (αn∗ ) can be modified such that αn∗ is changed only at
stages n ∈ IM. Here, IM is an arbitrary infinite subset of IN. The advantage of
(0)
this procedure is that the time consuming Cholesky-decomposition of Hn +
(0)T
Hn must be calculated only for “some” indices n ∈ IM.
Using a sequence (αn∗ ) as described above, the above convergence properties
hold true.

6.8 A Special Class of Adaptive Scalar Step Sizes


In contrary to the step size rule considered in Sect. 6.7.2, in the present section
scalar step sizes of the type

Rn = rn I with rn ∈ IR+ , (6.114a)

are studied for algorithm (6.2a), (6.2b), where the An -measurable random
variables rn are defined recursively as follows:
For n = 1 we select positive A1 -measurable random variables r1 , w1 , and
for n > 1 we put
wn−1
rn = r-n−1 Qn (-
rn−1 ) Kn . (6.114b)
wn
Here,

r-n−1 := min{r̄n , rn−1 }, (6.114c)


1
Qn (r) = if r ∈ [0, r̄n ], (6.114d)
1 + r vn (r)
with positive An -measurable random variables r̄n , wn , Kn and vn (r), r ∈
[0, r̄n ]. We assume that the following a priori assumptions hold:
6.8 A Special Class of Adaptive Scalar Step Sizes 237

(A1) For each n > 1 and r ∈ [0, r̄n ] suppose

r̄0 ≤ inf r̄n ≤ ∞ a.s., (Y1)


n

v ≤ vn (r) ≤ v̄ a.s., (Y2)

w ≤ wn ≤ w̄ a.s., (Y3)

lim (K2 · . . . · Kn ) = P ∗ ∈ (0, ∞) a.s. (Y4)


n→∞

with positive numbers r̄0 , v , v̄ , w , w̄ and a random variable P ∗ .


As can be seen from the following example, the definition (6.114a)–(6.114d)
is motivated by the standard step size rn = nd .
d
Example 6.11. Obviously, rn = n can be represented by
rn−1
rn = , n > 1.
1 + rn−1 d1

However, this corresponds to (6.114a)–(6.114d) with r̄n := ∞, vn (r) := d1 and


wn = Kn := 1.
In (6.114a)–(6.114d), the factor Qn (r) guarantees the convergence of rn
to 0, the variance of the estimation error Zn is taken into consideration by
wn , and the starting behavior of rn can be influenced by Kn .
Concrete examples for Qn (r), wn and Kn are given later on. Next to the
convergence behavior of (rn ) and (Xn ) is studied.
We suppose here that the step directions Yn in algorithm (6.2a), (6.2b)
are given by (W1a)–(W1e) in Sect. 6.6.

6.8.1 Convergence Properties

The following simple lemma is a key tool for the analysis of the step size rule
(6.114a)–(6.114d).

Lemma 6.29. Let t1 , u, u2 , u3 , . . . be given positive numbers.


(a) The function ϕ(t) := 1+tu t
is monotonous increasing on [0, ∞) and ϕ(t) ∈
[0, t) for each t > 0.
(b) If the sequence (tn ) is defined recursively by tn := 1+ttn−1
n−1 un
, n = 2, 3, . . . ,
t1
then tn = 1+t1 (u2 +...+un ) , n = 2, 3, . . . .

Proof. The first part is clear, and the formula in (b) follows by induction with
respect to n.
We show now that the step size rn according to (6.114a)–(6.114d) is of
order n1 .
238 6 Stochastic Approximation Methods with Changing Error Variances
7 8
w
Lemma 6.30. limextr n · rn ∈ w̄ v̄ , ww̄v a.s.
n

Proof. Putting
r-n wn Pn−1
Pn := K2 · . . . · Kn , tn := , un := vn (-
rn−1 ),
Pn wn−1
after multiplication with wn Pn−1 , definition (6.114a)–(6.114d) yields
tn−1
tn ≤ , n = 2, 3, . . . .
1 + tn−1 un

Because of Lemma 6.29 we get tn ≤ 1+t1 (u2t+...+u


1
n)
. Hence, (Y2) and (Y3)
yield
Pn Pn  v −1 P
n
rn = ntn
n- ≤ ntn ≤ nt1 1 + t1 (P1 + . . . + Pn−1 )
wn w w̄ w
−1
1 v Pn
= + (P1 + . . . + Pn−1 ) a.s.
nt1 w̄ n w
Therefore, by (Y4) and Lemma A.8, Appendix, we have

lim sup n · r-n ≤ a.s., (i)
n wv
rn ) converges to zero a.s. Because of (Y1) there exists a stochastic index
and (-
n0 with
rn = r-n , n ≥ n0 a.s. (ii)
From (i), (ii) and with (6.114a)–(6.114d) we get

lim sup n · rn ≤ a.s.,
n wv
tn−1
tn = , n ≥ n0 a.s.
1 + tn−1 un
Thus, for n ≥ n0 , Lemma 6.29 yields
tn0
tn = a.s. (6.115)
1 + tn0 (un0 +1 + . . . + un )
Applying now (ii), (Y2) and (Y3), we get
−1
Pn v̄ Pn
nrn = ntn ≥ ntn0 1 + tn0 (Pn0 + . . . + Pn−1 )
wn w w̄
−1
1 v̄ Pn
= + (Pn0 + . . . + Pn−1 ) a.s.
ntn0 wn w̄
6.8 A Special Class of Adaptive Scalar Step Sizes 239

Finally, condition (Y4) and Lemma A.8, Appendix, yield


w
lim inf (n · rn ) ≥ a.s.
n v̄ w̄
Obviously, the step size Rn in (6.114a) can be represented by Rn = ρn Mn ,
where
1
ρn := and Mn := (nrn ) I . (6.116)
n
Concerning the Jacobian H(x) of G(x) we demand as in Sect. 6.7.2 (cf. con-
dition (X6)):
(A2) There is a positive number α such that

H(x) + H(x)T ≥ 2α I for all x ∈ D. (Y5)

The sequence (Mn ) fulfills condition (U4a)–(U4d) from Sect. 6.3:

Lemma 6.31. (a) sup Mn  < ∞ a.s.


n
0n and ∆Mn such that
(b) For each n ∈ IN there are An -measurable matrices M

0n + ∆Mn
(i) Mn = M
(ii) lim ∆Mn  = 0 a.s.
n→∞
0n H(x) u ≥ α w u2 for each x ∈ D and u ∈ IRν
(iii) uT M
w̄ v̄
Proof. For n ∈ IN define
 
0n := max nrn , w
M I and 0n .
∆Mn := Mn − M (6.117)
w̄ v̄
Because of condition (Y5), the assertions follows then immediately from
Lemma 6.30.

Because of Lemma 6.31, corresponding to Theorems 6.9 and 6.11, here we


have the following result:
Theorem 6.15.

(a) Xn −→ x∗ a.s.
(b) Hn(0) −→ H ∗ a.s.

In order to derive further convergence properties, one needs again addi-


tional a posteriori assumption on the random functions vn (r) and random
quantities wn :
(A3) There is a constant L̄ such that for all n ∈ IN

| vn (r) − vn (0) |≤ L̄ r for r ∈ [0, r̄n ] a.s. (Y6)


240 6 Stochastic Approximation Methods with Changing Error Variances

(A4) If “Xn −→ x∗ a.s.,” then there are positive numbers α


- and
w(1) , . . . , w(κ) with
-
lim vn (0) = α a.s., (Y7)
n→∞

lim wn = w(k) a.s., k = 1, . . . , κ. (Y8)


n∈IN(k)

Now, a stronger form of Lemma 6.30 can be given:


Lemma 6.32. For each k ∈ {1, . . . , κ} we have
  q qκ −1
1
lim n · rn = α -w(k) + . . . + a.s.
n∈IN(k) w(1) w(κ)
Proof. According to (6.115) and (ii) from the proof of Lemma 6.30 we obtain,
for n ≥ n0 ,
Pn ntn0 Pn
nrn = ntn =
wn 1 + tn0 (un0 +1 + . . . + un ) wn
−1
1 1 Pn
= + (un0 +1 + . . . + un ) . (i)
ntn0 n wn
Because of Lemma 6.30, (-
rn ) converges to zero a.s., and by Theorem 6.15 we
have Xn −→ x∗ a.s. Hence, (Y4), (Y6), (Y7) and (Y8) yield
P∗
lim (Pn , wn , un ) = P ∗ , w(k) , -
α a.s. (ii)
n∈IN(k) w(k)
Having (i) and (ii), with Lemma A.8, Appendix, we obtain
  q qκ −1 P ∗
lim nrn = P ∗ α
1
- (1)
+ . . . + (κ) a.s.
n∈IN(k) w w w(k)

According to Lemma 6.32 we have now


lim Mn = d(k) I a.s., k = 1, . . . , κ , (6.118)
n∈IN(k)

1
d := q1 d(1) + . . . + qκ d(κ) = , (6.119)
-
α
where
  q qκ −1
1
d(k) := α-w(k) + . . . + , k = 1, . . . , κ (6.120)
w(1) w(κ)
Hence, the constant α- has the same meaning as in Sect. 6.7.2, see (6.97) and
(6.96b). Thus, corresponding to condition (X9), here we demand:
- < 2α∗ := λmin (H ∗ + H ∗T ).
α (Y9)
Corresponding to Theorem 6.12, here we have the following result:
6.8 A Special Class of Adaptive Scalar Step Sizes 241

Theorem 6.16. (a) If λ ∈ [0, 12 ), then nλ (Xn − x∗ ) −→ 0 a.s.


(b) For each 0 > 0 there is a set Ω0 ∈ A such that

(i) P(Ω̄0 ) < 0 ,


(ii) lim sup EΩ0 nXn − x∗ 2
n
9  q :−1 κ
∗ 1 qκ 2  σ 2
k
≤ α
-(2α − α
-) + . . . + q k . (6.121)
w(1) w(κ) k=1
w (k)

Proof. Because of (6.119), (6.120) and (Y9), as in the proof of Theorem 6.12,
-, d according to (6.9), (6.120), resp., we have
for the number α
1
- ≥ dα∗ >
α .
2
Based on (6.118)–(6.120), the assertions follow now from Theorem 6.1 and
Corollary 6.3.

According to Sect. 6.5.2, the right hand side of (6.121) takes a minimum
at the values

- = α∗ ,
α (6.122)
w = σk2 := lim sup E σ0,n
(k) 2
, k = 1, . . . , κ. (6.123)
n∈IN(k)

Also the limit properties in Theorem 6.13 can be transferred easily to the
present situation:

Theorem 6.17. Under the additional assumptions



(i) x∗ ∈ D
(ii) The covariance matrices of (Zn ) fulfill conditions (V6), (V8) in Sect. 6.5.3

sequence ( n(Xn − x∗ )) has an asymptotic N (0, V )-distribution, where the
matrix V is determined by the equation

1 ∗   q qκ −2 qk
κ
(H V + V H ∗T ) − V = α
1
- + . . . + W (k) .
-
α w(1) w(κ) (w (k) )2
k=1
(6.124)

Similar to Theorem 6.13, the above result can be shown by using Theorem 6.4,
where the numbers d(k) are not given by (6.96a), but by (6.118).

6.8.2 Examples for the Function Qn (r)

Several formulas are presented now for the function Qn (r) in (6.114a)–(6.114d)
such that conditions (Y1), (Y2) and (Y6) for the sequences (r̄n ) and (vn (r))
242 6 Stochastic Approximation Methods with Changing Error Variances

hold true. In each case the function Qn (r) depends on two An -measurable
random parameters αn∗ and Γn . Suppose that αn∗ , Γn fulfill the following
condition:
(A5) There are positive constants ϑ, ϑ̄ and γ̄ such that for n > 1

ϑ ≤ αn∗ ≤ ϑ̄ a.s., (Y10)


| Γn |≤ γ̄ a.s. (Y11)

Example 6.12. Standard step size


Corresponding to Example 6.11 we take
1
Qn (r) = , r ∈ [0, r̄n ] (6.125a)
1 + rαn∗ − r2 Γn
 $
∗ K̄ q̄
r̄n := αn min , . (6.125b)
Γn − Γn +
Here, K̄ > 0 and q̄ ∈ (0, 1) are arbitrary constants. The following lemma
shows the feasibility of (6.125a), (6.125b):
Lemma 6.33. According to (6.125a), (6.125b) function Qn (r) has the form
(6.114d), and for each r ∈ [0, r̄n ] we have

(a) vn (r) = αn∗ − rΓn ,


(b) ϑ (1 − q̄) ≤ vn (r) ≤ ϑ̄ (1 + K̄) a.s.,
ϑ
(c) r̄n ≥ r̄0 := min{K̄, q̄} > 0 a.s.
γ̄
Proof. Because of (Y10), (Y11) and (6.125b), the inequality in (c) holds for
each n > 1 and r ∈ [0, r̄n ]. Furthermore,

ϑ (1 − q̄) ≤ αn∗ (1 − q̄) ≤ αn∗ − r̄n Γn + ≤ αn∗ − rΓn


≤ αn∗ + r̄n Γn − ≤ αn∗ (1 + K̄) ≤ ϑ̄ (1 + K̄) a.s.,

hence, also (b) holds.


Example 6.13. “Optimal” step size
If En Zn = 0, n = 1, 2, . . . (cf. Robbins–Monro process), then by Lemma 6.1
and assumption (U5b) for each n ∈ IN we have
 2
En ∆n+1 2 ≤ ∆n − rn (Gn − G∗ ) + rn2 En Zn 2
 
≤ 1 − 2αrn + (β 2 + σ12 )rn2 ∆n 2 + rn2 σ0,n
2
.

Here, α and β are positive numbers with

Xn − x∗ , Gn − G∗  ≥ αXn − x∗ 2 ,
Gn − G∗  ≤ βXn − x∗  (see (6.4)).
6.8 A Special Class of Adaptive Scalar Step Sizes 243

Based on the above estimate for the quadratic error ∆n 2 , in [89] an “opti-
mal” deterministic step size rn is given which is defined recursively as follows:
1 − rn−1 α
rn = rn−1 for n = 2, 3, . . . .
1 − rn−1
2 (β 2 + σ12 )
This suggests the following function Qn (r):
Let q̄ ∈ (0, 1), and for n > 1 define
1 − rαn∗
Qn (r) = , r ∈ [0, r̄n ] (6.126a)
1 − r 2 Γn
 ∗ $
αn 1
r̄n := q̄ min , . (6.126b)
Γn + αn∗
The feasibility of this approach is shown next:
Lemma 6.34. If Qn (r) and r̄n are defined by (6.126a), (6.126b), then Qn (r)
has the form (6.114d), and for each r ∈ [0, r̄n ] we have
αn∗ − rΓn ∗ Γn − αn∗2
(a) vn (r) = = α − r,
1 − rαn∗ n
1 − rαn∗
q̄ ϑ̄2 + γ̄
(b) (1 − q̄) ϑ ≤ vn (r) ≤ ϑ̄ + a.s.,
1 − q̄ ϑ
 
 Γn − αn∗2  ϑ̄2 + γ̄
(c)  ≤ a.s.,
1 − rαn∗  1 − q̄
 $
ϑ 1
(d) r̄n ≥ r̄0 := q̄ min , > 0 a.s.
γ̄ ϑ̄
Proof. Assertion (a) follows by a simple computation.
If αn∗2 < Γn , then (6.126b) yields
Γn − αn∗2 Γn − αn∗2
0≤ ≤ ≤ Γn .
1 − rαn∗
1 − q̄ αn∗2 Γn−1
Hence, because of (a) we get
αn∗ ≥ vn (r) ≥ αn∗ − rΓn ≥ αn∗ − r̄n Γn + ≥ αn∗ (1 − q̄).
Let Γn ≤ αn∗2 . Then (6.126b) yields
αn∗2 − Γn αn∗2 + Γn −
0≤ ≤ ,
1 − rαn∗ 1 − q̄
and therefore with (a)
αn∗2 + Γn − q̄
αn∗ ≤ vn (r) ≤ αn∗ + .
1 − q̄ αn∗
From the above relations and assumptions (Y10), (Y11) we obtain now the
assertions in (b), (c) and (d).
244 6 Stochastic Approximation Methods with Changing Error Variances

Example 6.14. A further approach reads

Qn (r) = 1 − rαn∗ + r2 Γn , r ∈ [0, r̄n ] (6.127a)


 $
K̄αn∗ q̄ q̄αn∗
r̄n := min , , , (6.127b)
Γn − αn∗ (1 + K̄) Γn +

where K̄ > 0 and q̄ ∈ (0, 1) are arbitrary constants. In this case we have the
following result:

Lemma 6.35. Let Qn (r) and r̄n be given by (6.127a), (6.127b). Then func-
tion Qn (r) can be represented by (6.114d), and for each r ∈ [0, r̄n ] we have

αn∗ − rΓn Γn − αn∗ (αn∗ − rΓn )


(a) vn (r) = ∗
= αn∗ − r,
1 − rαn + r Γn2 1 − r(αn∗ − rΓn )
ϑ̄ (1 + K̄)
(b) (1 − q̄) ϑ ≤ vn (r) ≤ a.s.,
1 − q̄
 
 Γn − αn∗ (αn∗ − rΓn )  γ̄ + ϑ̄2 (1 + K̄)
(c)  ≤ a.s.,
1 − r(αn∗ − rΓn )  1 − q̄
 $
K̄ ϑ q̄ q̄ ϑ
(d) r̄n ≥ r̄0 := min , , > 0 a.s.
γ̄ ϑ̄ (1 + K̄) γ̄

Proof. Obviously (a) holds. Using (6.127b) we find

un := αn∗ − rΓn ≥ αn∗ − r̄n Γn + ≥ αn∗ (1 − q̄),


un ≤ αn∗ + r̄n Γn − ≤ αn∗ (1 + K̄),
Qn (r) = 1 − run ≥ 1 − r̄n αn∗ (1 + K̄) ≥ 1 − q̄.

Hence, with (a) we get

un α∗ (1 + K̄)
αn∗ (1 − q̄) ≤ un ≤ = vn (r) ≤ n ,
1 − run 1 − q̄
 
 Γn − αn∗ un  | Γn | +αn∗ un | Γn | +αn∗2 (1 + K̄)
 
 1 − run  ≤ 1 − q̄

1 − q̄
.

Applying now assumptions (Y10) and (Y11), the assertions in (b), (c) and (d)
follow.

Example 6.15. Consider Qn (r) given by

Qn (r) = exp(−rαn∗ + r2 Γn ), r ∈ [0, r̄n ] (6.128a)


 $
α∗
r̄n := min r̄, q̄ n+ . (6.128b)
Γn

In (6.128b) r̄ > 0 and q̄ ∈ (0, 1) are arbitrary constants. The feasibility of this
approach is guaranteed by the next lemma:
6.8 A Special Class of Adaptive Scalar Step Sizes 245

Lemma 6.36. Let Qn (r) and r̄n be given by (6.128a), (6.128b). Then Qn (r)
can be represented by (6.114d), where for each r ∈ [0, r̄n ] we have
1 
(a) vn (r) = exp(rαn∗ − r2 Γn ) − 1 = αn∗ + ∆αn∗ (r) r with
r
1 exp(rαn∗ − r2 Γn ) − 1
∆αn (r) := (αn∗ − rΓn )

− 1 − Γn ,
r rαn∗ − r2 Γn
r̄ ū
(b) ϑ (1 − q̄) ≤ vn (r) ≤ ū(1 + exp(r̄ ū)) a.s., where ū := ϑ̄ + r̄ γ̄,
2
ū2
(c) | ∆αn∗ (r) |≤ exp(r̄ ū) + γ̄ a.s.,
2 $
ϑ
(d) r̄n ≥ r̄0 := min r̄, q̄ > 0 a.s.
γ̄

Proof. Assertion (a) follows by an elementary calculation. According to


(6.128b) we have
αn∗ (1 − q̄) ≤ αn∗ − r̄n Γn + ≤ αn∗ − rΓn =: un ≤ αn∗ + r̄ | Γn | . (i)
Furthermore, for each t > 0 we may write
t2
exp(t) = 1 + t + exp(τ (t)) with τ (t) ∈ [0, t]. (ii)
2
Hence, with (a) we get
exp(run ) − 1  run  
vn (r) = un = un 1 + exp τ (run ) .
run 2
Together with (i) we find
 r̄un 
αn∗ (1 − q̄) ≤ vn (r) ≤ un 1 + exp(r̄un ) , (iii)
2
and (i), (ii) yield also the relation
 
 un exp(run ) − 1 
| ∆αn∗ (r) | =  − 1 − Γn 
r run
 u ru    u2
 n n 
= exp τ (run ) − Γn  ≤ n exp(r̄un )+ | Γn | . (iv)
r 2 2
The assertion in (b), (c) and (d) follow now from (iii), (iv) and the assumptions
(Y10), (Y11).

Choice of (α∗n )

In the above examples for Qn (r) condition (Y6) holds, and

vn (0) = αn∗ a.s., for all n > 1. (6.129)


246 6 Stochastic Approximation Methods with Changing Error Variances

Hence, assumption (Y7) is equivalent to


lim αn∗ = α
- a.s. (Y12)
n→∞

Note that (Y12) and (Y9) are exactly the conditions which are required for
(αn∗ ) in Sect. 6.7.2 (see (X8) and (X9)).
Theorem 6.15 guarantees also in the present case that Xn −→ x∗ and
Hn −→ H ∗ a.s. Thus, sequence (αn∗ ) can be chosen also according to (6.106).
(0)

Then (Y10) holds because of (X13), and due to Lemma 6.27 also conditions
(Y12) and (Y9) are fulfilled.

Selection of Sequence (Γn )


The An -measurable factors Γn in the above examples for Qn (r) can be chosen
arbitrarily, provided that condition (Y11) holds. According to Lemma 6.32,
Theorems 6.16 and 6.17,√(Γn ) does not influence the limit of (nrn ) and the
asymptotic variance of ( n(Xn − x∗ )). Of course, the most simple choice is
Γn ≡ 0, n > 1. (6.130)
A positive Γn means that the step size rn has a slower convergence to zero
than in the standard case (6.130).
Example 6.13 suggest the following choice for (Γn ):
 2
βn , if βn2 ≤ γ̄
Γn := for n = 2, 3, . . . , (6.131)
γ0 , else

where βnis an An -measurable
 estimate of the maximum eigenvalue β of the
∗ ∗ T
matrix 2 H(x ) + H(x ) , and γ0 is a constant such that | γ0 |≤ γ̄.
1

Since
Hn(0) −→ H ∗ := H(x∗ ) a.s., (6.132)
see Theorem 6.15(b), the following approximates βn of β ∗ can be used:
(a) βn = ψ (1) (Hn(0) ) with ψ (1) (H) := tr(H), (6.133)
(b) βn = ψ (2) (Hn(0) ) with ψ (2) (H) := max hii , (6.134)
i≤ν

where hii denotes the ith diagonal element of H.


Because of (6.132) we have
βn −→ ψ (i) (H ∗ ) a.s., (6.135)
where i = 1 for (6.133) and i = 2 for (6.134). The bound γ̄ in (6.131) is chosen
such that  (i) ∗ 2
ψ (H ) ≤ γ̄. (6.136)
Instead of (6.131) we can also take
Γn := αn∗2 , n > 1. (6.137)
In this case very simple expressions are obtained for the step size bound r̄n
in Examples 6.12–6.15.
6.8 A Special Class of Adaptive Scalar Step Sizes 247

6.8.3 Optimal Sequence (wn )

The factors wn in (6.114a)–(6.114d) are defined in the literature often by

wn ≡ 1, n ≥ 1, (6.138)

cf. Examples 6.11, 6.13. However,


√ this approach does not yield the minimum
asymptotic variance bound for n(Xn − x∗ ), see Theorem 6.16(b).
(1)
According to Sect. 6.7.2, the An -measurable random variables w -n , . . . ,
(κ)
-n , defined by (6.103), fulfill for k = 1, . . . , κ the following conditions:
w
(k)
w≤w
-n ≤ w̄, (6.139)
(k)
lim -n
w = lim En Zn − En Zn 2 a.s. (6.140)
n∈IN(k) n∈IN(k)

Since the limits w(1) , . . . , w(κ) of (wn ) from (Y8) should fulfill in the best case
(6.123), because of (6.140) we prefer the following choice of (wn ):
For an integer n ≥ 1 belonging to the set IN(k) with k ∈ {1, . . . , κ}, put

-n(k) .
wn := w (6.141)

Then, because of (6.139), (6.140) also conditions (Y3) and (Y8) hold.

6.8.4 Sequence (Kn )

According to Lemma 6.32, Theorems 6.16, 6.17, the positive An -measurable


factors Kn in the step size rule (6.114b) play no
√ role in the formulas for the
limit of (nrn ) and the asymptotic variance of ( n(Xn − x∗ )).
Selecting (Kn ), we have only to take into account the convergence condi-
tion (Y4). This requires

Kn · Kn+1 · . . . ≈ 1 for “large” n.

Hence, the factors Kn are responsible only for the initial behavior of the step
size rn and the iterates Xn .
By the factor Kn (as well as by wn ) the step size can be enlarged at
rn−1 → rn . This is necessary, e.g., in case of a too small starting step size r1 .
An obvious definition of Kn reads

Kn ≡ 1, n > 1. (6.142)

However, it is preferable to take Kn > 1 as long as algorithm (6.2a), (6.2b)


shows an “almost deterministic” behavior. This holds as long as the influence
of the estimation error Zn−1 to the step direction Yn−1 = Gn−1 + Zn−1 is
small.
Based on the above considerations, some definitions for (Kn ) are presented:
248 6 Stochastic Approximation Methods with Changing Error Variances

Adaptive Selection of (Kn )

Suppose
Kn = exp(s2n kn ), n > 1, (6.143)
where (sn ) is a deterministic sequence and kn denotes an An -measurable
random number such that

s2n < ∞ , (Y13)


n
sup E | kn | < ∞. (Y14)
n

Consequently,
 
E s2n | kn | < ∞ and especially s2n | kn |< ∞ a.s.
n n

Hence, due to (6.143), sequence (Kn ) fulfills condition (Y4).


E.g., (sn ) can be chosen as follows:
s1
sn = , n > 1, with s1 > 0. (6.144)
n
In the following some examples are given for sequence (kn ).

Example 6.16. According to assumption (a) in Sect. 6.3, the norm of the An -
measurable difference ∆Xn := Xn − Xn−1 is bounded by δ0 . Hence,

(a) kn := ∆Xn , ∆Xn−1 , (6.145)


(b) kn := ∆Xn , ∆Xn−1 + , (6.146)
∆Xn , ∆Xn−1 
(c) kn := cos (∆Xn , ∆Xn−1 ) := , (6.147)
∆Xn ∆Xn−1 
(d) kn := cos (∆Xn , ∆Xn−1 )+ (6.148)

fulfill condition (Y14) and are therefore appropriate definitions for (kn ).

Example 6.17. The factors kn can be chosen also as functions of the approx-
(0) (2l )
imates Yn−1 , . . . , Yn−1n−1 for the step direction Yn−1 , cf. (W1a)–(W1e) in
Sect. 6.6. Thereby we need the following result:

Lemma 6.37. For arbitrary index n ∈ IN and j, k ∈ {0, . . . , 2ln } we have


 
EYn(j) Yn(k)  ≤ L2 := 4 G∗ 2 + (β 2 + t21 ) δ02 + t20 + L21 < ∞.

Here, δ0 denotes the diameter of D, and β, t0 , t1 and L1 are given by (6.3),


(6.71), (6.73) and (W2a).
6.8 A Special Class of Adaptive Scalar Step Sizes 249

Proof. Corresponding to the proof of Lemma 6.8 we find


 
Yn(j) 2 = G∗ + (G(j) ∗ (j) 2
n − G ) + (Zn − En Zn ) + En Zn
(j) (j)
 
≤ 4 G∗ 2 + G(j) ∗ 2
n − G  + Zn − En Zn  + En Zn  .
(j) (j) 2 (j) 2

By (6.4), (6.70), (6.71) and (W2a) we have


∗ ∗
n − G  ≤ βXn − x ,
G(j) (j)

EZn(j) − En Zn(j) 2 ≤ t20 + t21 E∆n 2 ,


En Zn(j)  ≤ L1 .

Hence, due to Xn − x∗ , ∆n  ≤ δ0 and the above inequalities we get


(j)

EYn(j) 2 ≤ L2 as well as EYn(k) 2 ≤ L2 .

Applying now Schwarz’ inequality, the assertion follows.

Having this lemma, we find that condition (Y14) holds in the following
cases:
ln−1
1 (i) (l +i)
(a) kn := Yn−1 , Yn−1
n−1
, (6.149)
ln−1 i=1
ln−1
1 (i) (l +i) +
(b) kn := Yn−1 , Yn−1
n−1
 , (6.150)
ln−1 i=1
(c) kn := Yn−1 2 I (i) n−1 (l +i) , (6.151)
{ min Yn−1 , Yn−1 >0}
i=1,...,ln−1

2ln−1
1 (0) (j)
(d) kn := Yn−1 , Yn−1 + . (6.152)
2ln−1 j=1

Similar examples may be found easily.


7
Computation of Probabilities
of Survival/Failure by Means of Piecewise
Linearization of the State Function

7.1 Introduction
In reliability analysis [31,98,138] of technical or economic systems/structures,
the safety or survival of the underlying technical system or structure S is
described first by means of a criterion of the following type:


⎪ < 0, a safe state σ of S exists,



⎪ S is in a safe state, survival of



⎪ S is guaranteed.


⎨ > 0, a safe state σ of S does not exist,
s∗ (R, L) S is in an unsafe state, (7.1a)



⎪ failure occurs (may occur)


⎪ = 0, S is in a safe or unsafe state,




⎪ resp. (depending on the special

situation)

Here,
s∗ = s∗ (R, L) (7.1b)
denotes the so-called (limit) state or performance function of the sys-
tem/structure S depending basically on an
– mR -Vector R of material or structural resistance parameters
– mL -Vector L of external load factors
In case of economics systems, the “resistance” vector R represents, e.g.,
the transfer capacity of an input–output system, and the “load” vector L may
be interpreted as the vector of demands.
The resistance and load vectors R, L depend on certain noncontrollable
and controllable model parameters, hence,
   
R = R a(ω), x , L = L a(ω), x (7.1c)
254 7 Computation of Probabilities

with a ν-vector a = a(ω) of random model parameters and an r-vector x of


(nominal) design or decision variables. Hence, according to (7.1b), (7.1c), we
also have the more general relationship
 
s∗ = s∗ a(ω), x . (7.1d)

Remark 7.1. Note that system/structure S = Sa(ω),x is represented here para-


metrically by the random vector a = a(ω) of model parameters and the vector
x of design or decision variables.
 
For the definition of the state function s∗ = s∗ a(ω), x , in static technical
or economic problems one has, after a possible discretization (e.g., by FEM),
the following two basic relations [100]:
(I) The state equation (equilibrium, balance)
   '
T σ, a(ω), x = 0, σ ∈ 0 . (7.2a)

Here, σ is the state mσ -vector of the system/structure' S, e.g., a stress


0 ⊂ IR

state vector of a mechanical structure,
 lying
 in a range . Fur-
thermore, for given (a, x), T = T σ, (a, x) denotes a linear or nonlinear
mapping from IRmσ into IRmT representing, e.g., the structural equilib-
rium in elastic or plastic structural design [50, 53, 57, 66, 129].
 
Remark 7.2. We assume that for each a(ω), x under consideration, rela-
tion (7.2a) has at least one solution σ. Depending on the application,
(7.2a) may have a unique solution σ = σ a(ω), x , as in the elastic
case [129], or multiple solutions σ, as in the plastic case [57].
 Note that con-

dition (7.2a) includes also inequality state conditions T σ, a(ω), x ≤ 0.

In case of a linear(ized) system/structure the state equation (7.2a) reads


   
C a(ω), x σ = L a(ω), x (7.2b)
 
with the equilibrium matrix C = C a(ω), x .
(II) Admissibility condition (operating condition)
In addition to the state equation (7.2a), for the state vector σ we have
the admissibility or operating condition
'  ' 
σ∈ a(ω), x) or σ ∈ int a(ω), x (7.3a)

with the admissible (feasible) state domain


'      
a(ω), x := σ ∈ IRn : g σ, a(ω), x ≤ 0 (7.3b)

depending on the model parameter vector a = a(ω) and the design


vector x. Here,
7.1 Introduction 255
  
g = g σ, a(ω), x (7.3c)
 
denotes a vector function of σ, which depends on a(ω), x , and has ap-
propriate analytical properties.
In many applications or after linearization the function g is affine–linear,
hence,       
g σ, a(ω), x := H a(ω), x σ − h a(ω), x (7.3d)
     
with a matrix (H, h) = H a(ω), x , h a(ω), x .
We assume that
' the
 zero state σ = 0 is an interior point of the admissible
state domain a(ω), x , i.e.,
'    
0 ∈ int a(ω), x or g 0, a(ω), x < 0. (7.3e)

Thus, in case (7.3d) we have


 
h a(ω), x > 0. (7.3f)
' 
In many cases the feasible state domain a(ω), x is a closed convex set
containing the zero state as an interior point. In these cases the admissibility
condition (7.3a) can be represented also by
     
π σ, a(ω), x ≤ 1, π σ, a(ω), x < 1, resp., (7.4a)
  
where π = π σ, a(ω), x denotes the Minkowski or distance functional of
' ' 
the admissible state domain = a(ω), x :
    σ ' 
π σ, a(ω), x = inf λ > 0 : ∈ a(ω), x . (7.4b)
λ
By decomposition, discretization, the state vector σ is often partitioned,

σ = (σi )1≤i≤nG , (7.5a)

into a certain number nG'ofn0 -subvectors


 σi , i = 1, . . . , nG ; furthermore, the
admissible state domain a(ω), x is represented then by the product
'  ' 
a(ω), x = ⊗ni=1G
i a(ω), x (7.5b)

of nG factor domains
'      
i a(ω), λ := σ i : gi σ i , a(ω), x ≤ 0 (7.5c)

with the functions


  
gi = gi σi , a(ω), x , i = 1, . . . , nG . (7.5d)
256 7 Computation of Probabilities

In
' the above
 mentioned convex case, due to (7.5b), the distance functional of
a(ω), x can be represented by
     
π σ, a(ω), x = max πi σi , a(ω), x , (7.5e)
1≤i≤nG

where     σi '  
πi σi , a(ω), x := inf λ > 0 : ∈ i a(ω), x (7.5f)
λ
'  
denotes the Minkowski functional of the factor domain i a(ω), x .

7.2 The State Function s∗

Based
 on the relations described in the
 first section,
 for a given pair of vectors
a(ω), x , the state function s∗ = s∗ a(ω), x can be defined by the minimum
value function of one of the following optimization problems:
Problem A
min s (7.6a)
s.t.   
T σ, a(ω), x = 0 (7.6b)
  
g σ, a(ω), x ≤ s1 (7.6c)
'
σ ∈ 0, (7.6d)
where 1 := (1, . . . , 1)T . Note that due to Remark 7.2, problem (7.6a)–(7.6d)
always has feasible solutions (s, σ) = (s̃, σ̃) with any solution σ̃ of (7.2a) and
some sufficiently large s̃ ∈ IR.
Obviously, in case of a linear  state equation (7.2b), problem (7.6a)–(7.6d)
 
is convex, provided that g = g σ, a(ω), x is a convex vector function of σ,
'
and σ has a convex range 0 .
Problem B '
If T and g are linear with respect to σ, and the range 0 of σ is a closed
convex polyhedron '
0 := {σ : Aσ ≤ b} (7.7)
with a fixed matrix (A, b), then (7.6a)–(7.6d) can be represented, see (7.2b)
and (7.3d), by the linear program (LP):
min s (7.8a)
s.t.
   
C a(ω, x σ = L a(ω), x (7.8b)
   
H a(ω), x σ − h a(ω), x ≤ s1 (7.8c)
Aσ ≤ b. (7.8d)
7.2 The State Function s∗ 257

The dual representation of the LP (7.8a)–(7.8d) reads:


 T  T
max L a(ω), x u − h a(ω), x v − bT w (7.9a)
s.t.
 T  T
C a(ω), x u − H a(ω), x v − AT w = 0 (7.9b)
T
1 v =1 (7.9c)
v, w ≥ 0. (7.9d)
' n
Remark 7.3. If the state vector σ has full range 0 = IR , then (7.8d) is
T T
canceled, and in (7.9a)–(7.9d) the expressions b w, A w, w are canceled.
' 
In case of a closed convex admissible state domain a(ω), x , we also
have
 the admissibility condition (7.4a) with the distance functional π =
 '    
π σ, a(ω), x of a(ω), x . Then, the state function s∗ = s∗ a(ω), x can
be defined also by the minimum value function of the following optimization
problem:
Problem C
min s (7.10a)
s.t.
  
T σ, a(ω), x = 0 (7.10b)
  
π σ, a(ω), x − 1 ≤ s (7.10c)
'
σ ∈ 0. (7.10d)
Special forms of (7.10a)–(7.10d) are obtained
'  for a linear state equation
(7.2b) and/or for special admissible domains a(ω), x , as, e.g., closed con-
vex polyhedrons, cf. (7.3b), (7.3d),
'  %    &
a(ω), x = σ : H a(ω), x σ ≤ h a(ω), x . (7.11)
If (Hi , hi ), i = 1, . . . , mH , denote the rows of (H, h), then, because of
property (7.3f), the distance functional of the set in (7.11) is given by
     σ   
π σ, a(ω), x = inf λ > 0 : Hi a(ω), x ≤ hi a(ω), x , i = 1, . . . , mH
" λ #
1  
= inf λ > 0 :   Hi a(ω), x σ ≤ λ, i = 1, . . . , mH
hi a(ω), x
 
= max H  i a(ω), x σ, (7.12a)
0≤i≤mH

where
"
  0, i = 0  
H̃i a(ω), x :=  1  Hi a(ω), x , 1 ≤ i ≤ mH . (7.12b)
hi a(ω),x
258 7 Computation of Probabilities

7.2.1 Characterization of Safe States

Based on the state equation (I) and the admissibility condition (II), the
safety
 of a system
 or structure S = Sa(ω),x represented by the pair of vec-
tors a(ω), x can be defined as follows:
Definition 7.1. Let a = a(ω), x, resp., be given vectors of model parameters,
design variables. A system Sa(ω),x having configuration a(ω), x ' is in a safe
state (admits a safe state) if there exists a “safe” state vector σ̃ ∈ 0 fulfilling
(i) the state equation (7.2a) or (7.2b), hence,
      
T σ̃, a(ω), x = 0, C a(ω), x σ̃ = L a(ω), x , resp., (7.13a)

and (ii) the admissibility condition (7.3a), (7.3b) or (7.4a), (7.4b) in the
standard form
     
g σ̃, a(ω), x ≤ 0, π σ̃, a(ω), x ≤ 1, resp., (7.13b)

or in the strict form


     
g σ̃, a(ω), x < 0, π σ̃, a(ω), x < 1, resp., (7.13c)

depending on the application. System Sa(ω),x is in an unsafe state if no safe


state vector σ̃ exists for Sa(ω),x .

Remark 7.4. If Sa(ω),x is in an unsafe state, then, due to violations of the basic
safety conditions (I) and (II), failures and corresponding damages of Sa(ω),x
may (will) occur. E.g., in mechanical structures (trusses, frames) high external
loads may cause very high internal stresses in certain elements which damage
then the structure, at least partly. Thus, in this case failures and therefore
compensation and/or repair costs occur.
 
By means of the state function s∗ = s∗ a(ω), x , the safety
 or failure of
a system or structure Sa(ω),x having configuration a(ω), x can be described
now as follows:
 
Theorem 7.1. Let s∗ = s∗ a(ω), x be the state function of Sa(ω),x defined
by the minimum function of one of the above optimization problems (A)–(C).
Then, the following basic relations hold:
 
(a) If s∗ a(ω), x < 0, then a safe state vector σ̃ exists fulfilling the strict
admissibility
 condition (7.13c). Hence, Sa(ω),x is in a safe state.

(b) If s a(ω), x > 0, then no safe state vector exists. Hence, Sa(ω),x is in an
unsafe state and failures and damages may occur.  
(c) If a safe state vector σ̃ exists, then s∗ a(ω), x ≤ 0, s∗ a(ω), x < 0, resp.,
corresponding to the standard, the strict admissibility condition (7.13b),
(7.13c).
7.3 Probability of Safety/Survival 259


 
(d) If s a(ω), x ≥ 0, then no safe state vector exists in case of the strict
admissibility condition (7.13c).
(e) Suppose that the minimum (s∗ , σ ∗ ) is attained in the  selected optimization
problem defining the state function s∗ . If s∗ a(ω), x = 0, then a safe state
vector exists in case of the standard admissibility condition (7.13b), and
no safe state exists in case of the strict condition (7.13c).
 
Proof. (a) If s∗ a(ω), x < 0, then according to the definition of the minimum
value function s∗ , there exists a feasible point (s̃, σ̃) in Problem (A), (B),
(C), resp., such that, cf. (7.6c), (7.8c), (7.10c),
  
g σ̃, a(ω), x ≤ s̃1, s̃ < 0,
  
π σ̃, a(ω), x − 1 ≤ s̃, s̃ < 0.
'
Consequently, σ̃ ∈ 0 fulfills the state equation (7.13a) and the strict
admissibility condition (7.13c). Hence, according to Definition 7.1, the
vector σ̃ is a safe state vector fulfilling the strict admissibility condition
(7.13c).  
(b) Suppose that s∗ a(ω), x > 0, and assume that there exists a safe state
vector σ̃ for Sa(ω),x . Thus, according to Definition 7.1, σ̃ fulfills the state
equation (7.13a) and the admissibility condition (7.13b). Consequently,
(s̃,σ̃) = (0,
 σ̃) is feasible in Problem (A), (B)  or (C), and we obtain
s∗ a(ω), x ≤ 0 in contradiction to s∗ a(ω), x > 0. Thus, no safe state
vector exists in this case.
(c) Suppose that σ̃ is a safe state for Sa(ω),x . According to Definition 7.1,
vector σ̃ fulfills (7.13a) and condition (7.13b), (7.13c), respectively. Thus,
(s̃, σ̃) with s̃ = 0, with a certain s̃ < 0,
 resp., is feasible in Problem (A),

(B)  or (C).
 Hence, we get s a(ω), x ≤ 0 in case of the standard and

s a(ω), x < 0 in case of the strict admissibility condition.
(d) Thisassertion  follows directly from assertion (c).

(e) If
' s a(ω), x = 0, then under the present assumption there exists σ ∗ ∈
∗ ∗ ∗
0 such that (s , σ ) = (0, σ ) is an optimal solution of Problem (A), (B)

or (C). Hence, σ fulfills the state equation (7.13a) and the admissibility
condition (7.13b). Thus, σ ∗ is a safe state vector with respect to the
standard admissibility condition. On the other hand, because of (7.13c),
no safe state vector σ̃ exists in case of the strict admissibility condition.

7.3 Probability of Safety/Survival


In the following we assume that for all design vectors x under consideration
the limit state surface
% &
LSx := a ∈ IRν : s∗ (a, x) = 0 (7.14a)
260 7 Computation of Probabilities

in IRν has probability measure zero, hence,


   
P s∗ a(ω), x = 0 = 0 for x ∈ D, (7.14b)

where D ⊂ IRr denotes the set of admissible design vectors x.


Let x ∈ IRr be an arbitrary, but fixed design vector. According to Theo-
rem 7.1, the so-called safe (parameter) domain Bs,x of the system/structure
Sa(ω),x is defined by the set
% &
Bs,x := a ∈ IRν : s∗ (a, x) < 0 (7.15a)

in the space IRν of random parameters a = a(ω). Correspondingly, the unsafe


or failure (parameter) domain Bf,x of Sa(ω),x is defined by (Fig. 7.1)
% &
Bf,x := a ∈ IRν : s∗ (a, x) > 0 . (7.15b)

Thus, the probability of safety or survival ps = ps (x) and the probability


of failure pf = pf (x) are given, cf. (7.14a), (7.14b), by
   
ps (x) := P Sa(ω),x is in a safe state = P a(ω) ∈ Bs,x
       
= P s∗ a(ω), x < 0 = P s∗ a(ω), x ≤ 0 , (7.16a)

   
pf (x) = P Sa(ω),x is in an unsafe state = P a(ω) ∈ Bf,x
       
= P s∗ a(ω), x > 0 = P s∗ a(ω), x ≥ 0 . (7.16b)

Fig. 7.1. Safe/unsafe domain


7.3 Probability of Safety/Survival 261

For a large class of probability distributions Pa(·) of the random vector a =


a(ω) of noncontrollable model parameters, e.g., for continuous distributions,
a = a(ω) can be represented by a certain 1–1-transformation
   
a(ω) = Γ −1 z(ω) or z(ω) = Γ a(ω) (7.17)
of a random ν-vector z = z(ω) having a standard normal distribution
N (0, I) with zero mean and unit covariance matrix. Known transformations
Γ : IRν → IRν are the Rosenblatt transformation, see Sect. 2.5.3, the orthogo-
nal transformation [104], the Normal–Tail approximation [30].
Inserting (7.17) into definitions (7.16a), (7.16b), we get, cf. (7.14a),
(7.14b),
       
ps (x) = P s∗Γ z(ω), x < 0 = P s∗Γ z(ω), x ≤ 0 , (7.18a)
       
pf (x) = P s∗Γ z(ω), x > 0 = P s∗Γ z(ω), x ≥ 0 , (7.18b)

where the transformed state function s∗Γ = s∗Γ (z, x) is defined by


 
s∗Γ (z, x) := s∗ Γ −1 (z), x . (7.18c)
Introducing the transformed safe, unsafe domains (Fig. 7.2).
% &
Γ
Bs,x := z ∈ IRν : s∗Γ (z, x) < 0
   
= z ∈ IRν : s∗ Γ −1 (z), x < 0 = Γ (Bs,x ), (7.19a)
% &
Γ
Bf,x := z ∈ IRν : s∗Γ (z, x) > 0
   
= z ∈ IRν : s∗ Γ −1 (z), x > 0 = Γ (Bf,x ), (7.19b)

Fig. 7.2. Transformed safe/unsafe domains


262 7 Computation of Probabilities

we also have
 
ps (x) = P z(ω) ∈ Bs,x
Γ
, (7.20a)
 
pf (x) = P z(ω) ∈ Bf,x .
Γ
(7.20b)

Furthermore, the transformed limit state surface LSxΓ is given by


% &    
LSxΓ := z ∈ IRν : s∗Γ (z, x) = 0 = z ∈ IRν : s∗ Γ −1 (z), x = 0 = Γ (LSx ),
(7.21a)
where, see (7.14a), (7.14b),
           
P z(ω) ∈ LSxΓ = P s∗ Γ −1 z(ω) , x = 0 = P s∗ a(ω), x = 0 = 0.
(7.21b)

7.4 Approximation I of ps, pf : FORM

Let x ∈ D be again an arbitrary, but fixed design vector. The well known
approximation method “FORM” is based [54] on the linearization of the state
function s∗Γ = s∗Γ (z, x) at a so-called β-point zx∗ :
Definition 7.2. A β-point is a vector zx∗ lying on the limit state surface
s∗Γ (z, x) = 0 and having minimum distance to the origin 0 of the z-space IRν .
For the computation of zx∗ three different cases have to be taken into account.

7.4.1 The Origin of IRν Lies in the Transformed Safe Domain

Here, we suppose
0 ∈ Bs,x
Γ
or s∗Γ (0, x) < 0, (7.22a)
thus
Γ −1 (0) ∈ Bs,x . (7.22b)
Since in many cases
ā = Ea(ω) = Γ −1 (0),
(7.22b) can be represented often by

ā ∈ Bs,x or s∗ (ā, x) < 0. (7.22c)

Remark 7.5. Condition (7.22c) means that for the design vectors x under con-
sideration the expected parameter vector ā = Ea(ω) lies in the safe parameter
domain Bs,x which is, of course, a minimum requirement for an efficient op-
erating of a system/structure under stochastic uncertainty.
7.4 Approximation I of ps , pf : FORM 263

Obviously, in the present case a β-point is the projection of the origin


Γ
onto the closure B̄f,x of the failure domain Bf,xΓ
. Hence, zx∗ is a solution of the
projection problem
min z2 s.t. s∗Γ (z, x) ≥ 0. (7.23)
In the following we suppose that all derivatives of s∗Γ (·, x) under consider-
ation exist.
The Lagrangian L = L(z, λ) = L(z, λ; x) of the above projection problem
reads  
L(z, λ) := z2 + λ − s∗Γ (z, x) . (7.24)
Furthermore, for a projection zx∗ one has the following necessary optimality
conditions:
 
∇z L(z, λ) = 2z + λ − ∇z s∗Γ (z, x) = 0 (7.25a)
∂L
(z, λ) = −s∗Γ (z, s) ≤0 (7.25b)
∂λ
∂L  
λ (z, λ) = λ − s∗Γ (z, x) =0 (7.25c)
∂λ
λ ≥ 0. (7.25d)

Based on the above conditions, we get this result:

Theorem 7.2 (Necessary optimality conditions). Let zx∗ be a β-point,


i.e., an optimal solution of the projection problem (7.23). If zx∗ is a regular
point, i.e.,

if s∗Γ (zx∗ , x) > 0 or (7.26a)


if ∇z s∗Γ (zx∗ , x) = 0 for s∗Γ (zx∗ , x) = 0, (7.26b)

then there exists a Lagrange multiplier λ∗ ∈ IR such that (zx∗ , λ∗ ) fulfills the
K.–T. conditions (7.25a)–(7.25d).

The discussion of the two cases (7.26a), (7.26b) yields the following auxil-
iary result:
Corollary 7.1. Suppose that zx∗ is a regular β-point. Then, in the present
case (7.22a) we have
s∗Γ (zx∗ , x) = 0 (7.27a)
λ∗
zx∗ = ∇z s∗Γ (zx∗ , x) = 0 (7.27b)
2
with a Lagrange multiplier λ∗ > 0.

Proof. Let zx∗ be a regular β-point. If (7.26a) holds, then (7.25c) yields λ∗ = 0.
Thus, from (7.25a) we obtain zx∗ = 0 and therefore s∗Γ (0, x) > 0 which con-
tradicts assumption (7.22a) in this Sect. 7.4.1. Hence, in case (7.22a) we must
264 7 Computation of Probabilities

have s∗Γ (zx∗ , x) = 0. Since case (7.26b) is left, we have ∇z s∗Γ (zx∗ , x) = 0 for
a regular zx∗ . A similar argument as above yields λ∗ > 0, since otherwise we
have again zx∗ = 0 and therefore s∗Γ (0, x) = 0 which again contradicts (7.22a).
Thus, relation (7.27b) follows from (7.25a).

Suppose now that zx∗ is a regular β-point for all x ∈ D. Linearizing the
state function s∗Γ = s∗Γ (z, x) at zx∗ , because of (7.27a) we obtain

s∗Γ (z; x) = s∗Γ (zx∗ , x) + ∇z s∗Γ (zx∗ , x)T (z − zx∗ ) + . . .


= ∇z s∗Γ (zx∗ , x)T (z − zx∗ ) + . . . . (7.28)

Using (7.16a) and (7.28), the probability of survival ps = ps (x) can be ap-
proximated now by
   
ps (x) = P s∗Γ z(ω), x < 0 ≈ p̃s (x), (7.29a)

where
   
p̃s (x) := P ∇z s∗Γ (zx∗ )T z(ω) − zx∗ < 0
 
= P ∇z s∗Γ (zx∗ , x)T z(ω) < ∇z s∗Γ (zx∗ , x)T zx∗ . (7.29b)

Since z = z(ω) is an N (0, I)-distributed random vector, the random variable


in (7.29b), hence,
ξ(ω) := ∇z s∗Γ (zx∗ , x)T z(ω)
  
has an N 0, ∇z s∗Γ (zx∗ , x) -normal distribution. Consequently, for p̃s we find
   
∇z s∗Γ (zx∗ , x)T zx∗ − 0 ∇z s∗Γ (zx∗ , x)T zx∗
p̃s (x) = Φ   =Φ   , (7.29c)
∇z s∗ (zx∗ , x) ∇z s∗ (zx∗ , x)
Γ Γ

where Φ = Φ(t) denotes the distribution function of the standard N (0, 1)-
normal distribution.
Inserting (7.27b) into (7.29c), we get the following representation:
 ∗

∇z s∗Γ (zx∗ , x)T λ2 ∇z s∗Γ (zx∗ , x)
p̃s (x) = Φ   . (7.29d)
∇z s∗ (zx∗ , x)
Γ

Using again (7.27b), for the approximative probability p̃s = p̃s (x) we find the
following final representation:

Theorem 7.3. Suppose that the origin 0 of IRν lies in the safe domain Bs,x
Γ
,
and assume that the projection problem (7.23) has regular optimal solutions
zx∗ only. Then, ps (x) ≈ p̃s (x), where
 
p̃s (x) = Φ zx∗  . (7.30)
7.4 Approximation I of ps , pf : FORM 265

As can be seen from the Definition (7.29b) of p̃s , the safe domain, cf.
(7.19a), % &
Γ
Bs,x = z ∈ IRν : s∗Γ (z, x) < 0

is approximated by the tangent half space


% &
Hx := z ∈ IRν : ∇z s∗Γ (zx∗ , x)T (z − zx∗ ) < 0 (7.31)

at zx∗ . If Bs,x
Γ
lies in Hx , hence, if (Fig. 7.4)
Γ
Bs,x ⊂ Hx , (7.32a)

then
ps (x) ≤ p̃s (x). (7.32b)
On the contrary, if
Hx ⊂ Bs,x
Γ
, (7.33a)
as indicated by Fig. 7.3, then

p̃s (x) ≤ ps (x). (7.33b)

Γ
Fig. 7.3. Origin 0 lies in Bs,x
266 7 Computation of Probabilities

Fig. 7.4. The safe domain lies in the tangent half space

7.4.2 The Origin of IRν Lies in the Transformed Failure Domain

Again let x be an arbitrary, but fixed design vector. Opposite to (7.22a), here
we assume now that
0 ∈ Bf,x
Γ
or s∗Γ (0, x) > 0, (7.34a)
which means that
Γ −1 (0) ∈ Bf,x . (7.34b)
In this case, the β-point zx∗ ν
is the projection of the origin 0 of IR onto the
Γ
closure B̄s,x Γ
of the safe domain Bs,x , cf. Fig. 7.5. Thus, zx∗ is a solution of the
projection problem
min z2 s.t. s∗Γ (z, x) ≤ 0. (7.35)
Here, we have the Lagrangian

L(z, λ) = L(z, λ; x) := z2 + λs∗Γ (z, x), (7.36)

and the K.-T. conditions for a projection zx∗ of 0 onto B̄s,x


Γ
reads, cf (7.25a)–
(7.25d),

∇z L(z, λ) = 2z + λ∇z s∗Γ (z, x) = 0 (7.37a)


∂L
(z, λ) = s∗Γ (z, x) ≤0 (7.37b)
∂λ
∂L
λ (z, λ) = λs∗Γ (z, x) =0 (7.37c)
∂λ
λ ≥ 0. (7.37d)

Related to Theorem 7.2, in the present case we have this result:


7.4 Approximation I of ps , pf : FORM 267

Γ
Fig. 7.5. Origin 0 lies in Bf,x

Theorem 7.4 (Necessary optimality conditions). Let zx∗ be a β-point,


i.e., an optimal solution of the projection problem (7.35). If zx∗ is regular,
hence,
if s∗Γ (zx∗ , x) < 0 or (7.38a)
if ∇z s∗Γ (zx∗ , x) = 0 for s∗Γ (zx∗ , x) = 0, (7.38b)
then there exists a Lagrange multiplier λ∗ ∈ IR such that (zx∗ , λ∗ ) fulfills the
K.–T. conditions (7.37a)–(7.37d).
Discussing (7.38a), (7.38b), corresponding to Corollary 4.4 we find this auxil-
iary result:
Corollary 7.2. Suppose that zx∗ is a regular point of (7.35). Then, in case
(7.34a) we have
s∗Γ (zx∗ , x) = 0 (7.39a)

λ
zx∗ = − ∇z s∗Γ (zx∗ , x) (7.39b)
2
with λ∗ > 0.
Proof. Let zx∗ be a regular β-point. If (7.38a) holds, then from (7.37c) we
again get λ∗ = 0 and therefore zx∗ = 0, hence, s∗Γ (0, x) < 0 which contra-
dicts assumption (7.34a) in this subsection. Hence, under condition (7.34a)
we also have s∗Γ (zx∗ , x) = 0. Since case (7.38b) is left, we get ∇z s∗Γ (zx∗ , x) = 0.
Moreover, a similar argument as above yields λ∗ > 0, since otherwise we get
zx∗ = 0 and therefore s∗Γ (0, x) = 0. However, this contradict again assumption
(7.34a). Equation (7.39b) follows now from (7.37a).
268 7 Computation of Probabilities

Proceeding now as in Sect. 7.4.1, we linearize the state function s∗Γ at zx∗ .
Consequently, ps (x) is approximated again by means of formulas (7.29a),
(7.29b), hence,   

ps (x) = P s∗Γ z(ω), x < 0 ≈ p̃s (x),

where    
p̃s (x) := P ∇z s∗Γ (zx∗ , x)T z(ω) − zx∗ < 0 .

In the same way as in Sect. 7.4.1, in the present case we have the following
result:

Theorem 7.5. Suppose that the origin 0 of IRν is contained in the unsafe do-
Γ
main Bf,x , and assume that the projection problem (7.35) has regular optimal
solutions zx∗ only. Then, ps (x) ≈ p̃s (x), where
 
p̃s (x) := Φ − zx∗  . (7.40)
  1
Since Φ − zx∗  ≤ , the case 0 ∈ Bf,x
Γ
or s∗Γ (0, x) > 0 is of minor interest
2
in practice. Moreover, in most practical cases we may assume, cf. (7.22a)–
(7.22c), that 0 ∈ Bs,x
Γ
.

7.4.3 The Origin of IRν Lies on the Limit State Surface

For the sake of completeness we still consider the last possible case that, for
a given, fixed design vector x (Fig. 7.6), we have

0 ∈ LSx or s∗Γ (0, x) = 0. (7.41)

Obviously according to Definition 7.2 the β-point is given here by zx∗ = 0.


Hence, the present approximation method yields
   
p̃s (x) := P s∗Γ (zx∗ , x) + ∇z s∗Γ (zx∗ , x)T z(ω) − zx∗ < 0
 
= P ∇z s∗Γ (0, x)T z(ω) < 0
"

2 , for ∇z sΓ (0, x) = 0
1
= (7.42)
0, else,
  
since ξ(ω) = ∇z s∗Γ (0, x)T z(ω) has an N 0, ∇z s∗Γ (0, x) -normal distribu-
tion.

Remark 7.6. Comparing all cases (7.22a), (7.34a) and (7.41), we find that for
approximation purposes only the first case (7.22a) is of practical interest.
7.4 Approximation I of ps , pf : FORM 269

(a)

(b)

Fig. 7.6. (a) 0 ∈ LSx , Case 1 (b) 0 ∈ LSx , Case 2

7.4.4 Approximation of Reliability Constraints

In reliability-based design optimization (RBDO) for the design vector or load


factor x we have probabilistic constraints of the type

ps (x) ≥ αs (7.43)
270 7 Computation of Probabilities

with a prescribed minimum reliability level αs ∈ (0, 1]. Replacing the proba-
bilistic constraint (7.43) by the approximative condition

p̃s (x) ≥ αs , (7.44a)

in case (7.22a) we obtain, see Theorem 7.3,


 
Φ zx∗  ≥ αs

and therefore the deterministic norm condition


 
zx∗  ≥ βs := Φ−1 (αs ) (7.44b)

for the β-point zx∗ .

7.5 Approximation II of ps, pf : Polyhedral


Approximation of the Safe/Unsafe Domain
In the following an extension of the “FORM” technique is introduced which
works with multiple linearizations of the limit state surface LSx . According
to the considerations in Sects. 7.4.1–7.4.3, we may suppose here that (7.22a)–
(7.22c) holds, hence,
0 ∈ Bs,x
Γ
or s∗Γ (0, x) < 0.
As can be seen from Sect. 7.4, the β-point zx∗ lies on the boundary of the
Γ Γ
transformed safe/unsafe domain Bs,x , Bf,x . Now, further boundary points b
Γ Γ
of Bs,x , Bf,x , can be determined by minimizing or maximizing of a linear form

f (z) := cT z, (7.45)

with a given nonzero vector c ∈ IRν , on the closure B̄s,x Γ Γ


, B̄f,x of the safe,
Γ Γ
unsafe domain Bs,x , Bf,x . Hence, instead of the projection problems (7.23)
or (7.35), in the following we consider optimization problems of the following
type:
Γ Γ
min(max) cT z s.t. z ∈ B s,x (z ∈ B f,x , resp.) (7.46a)
or
min(max) cT z s.t. s∗Γ (z, x) ≤ 0 (≥ 0, resp.). (7.46b)
Clearly, multiplying the vector c by (−1), we may confine to minimization
problems only:
 
min cT z s.t. s∗Γ (z, x) ≤ 0 s∗Γ (z, x) ≥ 0, resp. . (7.47)

Definition 7.3. For a given vector c ∈ IRν and a fixed design vector x, let
bc,x denote an optimal solution of (7.47).
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 271

Fig. 7.7. Minimization of a linear form f on the safe domain

As indicated in Fig. 7.7, the boundary point bc,x can also be interpreted as a
β-point with respect to a shifted origin 0.
The Lagrangian of (7.47) reads – for both cases –
 
L(z, λ) = L(z, λ; x) := cT z + λ ± s∗Γ (z, x) , (7.48)

and we have the K.–T. conditions

∇z L(z, λ) = c ± λ∇z s∗Γ (z, x) = 0 (7.49a)


∂L
(z, λ) = s∗Γ (z, x) ≤ 0 (≥ 0, resp.) (7.49b)
∂λ
∂L
λ (z, λ) = λs∗Γ (z, x) = 0 (7.49c)
∂λ
λ ≥ 0. (7.49d)

Supposing again that the solutions bc,x of (7.47) are regular, related to
Sect. 7.4, here we have this result:
Theorem 7.6 (Necessary optimality conditions). Let c = 0 be a given ν-
vector. If bc,x is a regular optimal solution of (7.47), then there exists λ∗ ∈ IR
such that

s∗Γ (bc,x , x) = 0 (7.50a)


c
∇z s∗Γ (bc,x , x) = ∓ with λ∗ > 0. (7.50b)
λ∗
272 7 Computation of Probabilities

Proceeding now in similar way as in Sect. 7.4, the state function s∗Γ = s∗Γ (z, x)
is linearized at bc,x . Hence, with (7.50a) we get

s∗Γ (z, x) = s∗Γ (bc,x , x) + ∇z s∗Γ (bc,x , x)T (z − bc,x ) + . . .


= ∇z s∗Γ (bc,x , x)T (z − bc,x ) + . . . . (7.51a)

Then, a first type of approximation

ps (x) ≈ p̃s (x)

is defined again by
   
p̃s (x) := P ∇z s∗Γ (bc,x , x)T z(ω) − bc,x < 0 . (7.51b)

As in Sect. 7.4 we obtain


 
p̃s (x) = P ∇z s∗Γ (bc,x , x)T z(ω) < ∇z s∗Γ (bc,x , x)T bc,x
 
∇z s∗Γ (bc,x , x)T bc,x
=Φ   . (7.51c)
∇z s∗ (bc,x , x)
Γ

Inserting now (7.50b) into (7.51c), for p̃s (x) we find the following repre-
sentation:

Theorem 7.7. Let c = 0 be a given ν-vector and assume that (7.47) has
regular optimal solutions bc,x only. Then the approximation p̃s (x) of ps (x),
defined by (7.51b), can be represented by

(∓c)T bc,x
p̃s (x) = Φ . (7.52)
c

Proof. From (7.51c) and (7.50b) we obtain (λ∗ is deleted!)


 T 
∓ λc∗ bc,x (∓c)T bc,x
p̃s (x) = Φ  c = Φ .
∓ ∗  c
λ

Remark 7.7. We observe that for an appropriate selection of the vector c,


approximation (7.52) coincides with approximation (7.30) of ps (x) based on
a β-point. Obviously, also in the present case only an optimal solution bc,x of
the minimization problem (7.47) is needed replacing the projection problem
(7.23).

In the following we assume that for any design vector x and any cost vector
c = 0, the vector bc,x indicates an ascent direction for s∗Γ = s∗Γ (z, x) at bc,x ,
i.e.,
∇z s∗Γ (bc,x , x)T bc,x > 0. (7.53a)
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 273

Depending on the case taken in (7.47), this is equivalent, cf. (7.50b), to the
condition
(∓c)T bc,x > 0, (7.53b)
involving the following two cases:
Case 1 (“–”): Problem (7.47) with “≤,”
Case 2 (“+”): Problem (7.47) with “≥.”
Illustrations are given in the Fig. 7.8a,b below.

7.5.1 Polyhedral Approximation

In the following we consider an arbitrary, but fixed design vector x. Selecting


now a certain number J of c-vectors

c1 , c2 , . . . , c j , . . . , cJ , (7.54a)

and solving the related minimization problem (7.47) for c = cj , j = 1, . . . , J,


we obtain the corresponding boundary points

bc1 ,x , bc2 ,x , . . . , bcj ,x , . . . , bcJ ,x . (7.54b)

Remark 7.8 (Parallel computing of the b-points). Computing the boundary


points bcj ,x , j = 1, . . . , J, for a given design vectors x and c-vectors c1 , . . . , cJ ,
we have to solve optimization problem (7.47) for J different cost vectors cj , j =
1, . . . , J, but with an identical constraint. Thus, the boundary points bcj ,x , j =
1, . . . , J, and the corresponding constraints

∇z s∗Γ (bcj ,x , x)T (z − bcj ,x ) < 0 (7.55a)

or, cf. (7.50b)


(∓cj )T (z − bcj ,x ) < 0, (7.55b)
j = 1, . . . , J, can be determined simultaneously by means of parallel comput-
ing.
Considering the constraints (7.55a) or (7.55b) first as joint constraints,
we find the following more accurate approximations of the probability of sur-
vival/failure ps , pf .

Γ
Case 1: Minimization of Linear Forms on Bs,x

Γ
In this case, see Fig. 7.9, the transformed safe domain Bs,x is approximated
from outside by a convex polyhedron defined by the linear constraints (7.55a)
or (7.55b) with “–.” Hence, for ps we get the upper bound

ps (x) ≤ p̃s (x), (7.56a)


274 7 Computation of Probabilities

(a)

(b)

Fig. 7.8. (a) Condition (7.53a), (7.53b) in Case 1. (b) Condition (7.53a), (7.53b)
in Case 2

where
   
p̃s (x) = P ∇z s∗Γ (bcj ,x , x)T z(ω) − bcj ,x < 0, j = 1, . . . , J
   
= P (−cj )T z(ω) − bcj ,x < 0, j = 1, . . . , J . (7.56b)
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 275

Fig. 7.9. Polyhedral approximation (in Case 1)

Defining then the J × ν matrix A = A(c1 , . . . , cJ ) and the J-vector b =


b(x; c1 , . . . , cJ ) by
A(c1 , . . . , cJ ) := (−c1 , −c2 , . . . , −cJ )T (7.57a)
 
b(x; c1 , . . . , cJ ) := (−cj )T bcj ,x , (7.57b)
1≤j≤J

p̃s can be represented also by


 
p̃s (x; c1 , . . . , cJ ) = P A(c1 , . . . , cJ )z(ω) ≤ b(x; c1 , . . . , cJ ) . (7.57c)

Treating the constraints (7.55a) or (7.55b) separately, for ps we find the


larger, but simpler upper bound
ps (x) ≤ p̃s (x), (7.58a)
where
p̃s (x) := min p̃s (x; cj ) (7.58b)
1≤j≤J

and, cf. (7.51b), (7.51c), (7.52),


(−cj )T bcj ,x
p̃s (x; cj ) = Φ . (7.58c)
c
276 7 Computation of Probabilities

Γ
Case 2: Minimization of Linear Forms on Bf,x

Γ
Here, see Figs. 7.5 and 7.8b, the transformed failure domain Bf,x is approx-
imated from outside by a convex polyhedron based on the contrary of the
constraints (7.55a) or (7.55b). Thus, for the probability of failure pf we get,
cf. (7.33a), (7.33b), the upper bound

pf (x) ≤ p̃f (x) (7.59a)

with
   
p̃f (x) := P ∇z s∗Γ (bcj ,x , x)T z(ω) − bcj ,x ≥ 0, f = 1, . . . , J
   
= P cjT z(ω) − bcj ,x ≥ 0, j = 1, . . . , J . (7.59b)

Corresponding to (7.57a), (7.57b), here we define the J × ν matrix A =


A(c1 , . . . , cJ ) and the J-vector b = b(x; c1 , . . . , cJ ) by

A(c1 , . . . , cJ ) := (c1 , c2 , . . . , cJ )T (7.60a)


b(x; c1 , . . . , cJ ) := (cjT bcj ,x )1≤j≤J . (7.60b)

Then, p̃f = p̃f (x) can be represented by


 
p̃f (x) = P A(c1 , . . . , cJ )z(ω) ≥ b(x; c1 , . . . , cJ ) . (7.60c)

Obviously, in the present Case 2

p̃s (x) := 1 − p̃f (x) (7.61a)

defines a lower bound


p̃s (x) ≤ ps (x) (7.61b)
for the probability of survival ps . Working with separate constraints, for pf
we obtain the larger, but simpler upper bound

pf (x) ≤ p̃f (x) (7.62a)

with
   
p̃f (x) := min P cjT z(ω) − bcj ,x ≥ 0
1≤j≤J
 
= min 1 − p̃s (x; cj ) = 1 − max p̃s (x; cj ). (7.62b)
1≤j≤J 1≤j≤J

Here,
cjT bcj ,x
p̃s (x; cj ) := Φ . (7.62c)
cj 
7.5 Polyhedral Approximation of the Safe/Unsafe Domain 277

Remark 7.9. Since z = z(ω) has an N (0, I)-normal distribution, the random
vector
y = Az(ω) (7.63a)
with the matrix A = A(c1 , . . . , cJ ) given by (7.57a), (7.60a), resp., has an
N (0, Q)-normal distribution with the J × J covariance matrix

Q = AAT = (ciT cj )i,j=1,...,J . (7.63b)

Example 7.1 (Spherical safe domain).


Γ
Here we assume that the transformed safe domain Bs,x is given by
% &
Γ
Bs,x = z ∈ IRν : zE − R(x) =: s∗Γ (z, x) < 0 , (7.64)

where ·E denotes the Euclidean norm, and the radius R = R(x) is a function
of the design vector x (Fig. 7.10).

In the present case for each c = 0 we obtain, see (7.47) with “≤,”
c
bc,x = −R(x) . (7.65a)
c

Consequently, for a single vector c, according to (7.58b) we get for each


vector c

Fig. 7.10. Spherical safe domain (Case 1)


278 7 Computation of Probabilities
⎛  ⎞
(−c)T −R(x) c
c
p̃s (x; c) = Φ ⎝ ⎠ (7.65b)
c
 
= Φ R(x) ,

where, cf. (7.32b),  


ps (x) ≤ p̃s (x; c) = Φ R(x) . (7.65c)

Working with J vectors c1 , . . . , cJ , we get, see (7.56a), (7.56b),


   
p̃s (x, c1 , . . . , cJ ) = P (−cj )T z(ω) − bcj ,x < 0, j = 1, . . . , J
 
= P (−cj )T z(ω) < R(x)cj , j = 1, . . . , J .

In the special case J = 2ν and

c1 = e1 , c2 = −e1 , c3 = e2 , c4 = −e2 , . . . , c2ν−1 = eν , c2ν = −eν , (7.66a)

where e1 , e2 , . . . , eν are the unit vectors of IRν (Fig. 7.11), we obtain

Γ
Fig. 7.11. Spherical safe domain Bs,x with c-vectors given by the unit vectors
7.6 Computation of the Boundary Points 279
  
p̃s (x; ±e1 , . . . , ±eν ) = P zj (ω) < R(x), j = 1, . . . , ν

ν
 
= P − R(x) < zj (ω) < R(x)
j=1
    ν
= Φ R(x) − Φ − R(x)
   ν
= 2Φ R(x) − 1 . (7.66b)

In the present case of the spherical safe domain (7.64), the exact proba-
bility ps (x) can be determined analytically:
  
ps (x) = P z(ω) ≤ R(x)
 
1 1
= ... ν/2
exp − z2 dz. (7.67a)
(2π) 2
z ≤R(x)

Introducing polar coordinates in IRν , by change of variables in (7.67a) we


obtain
2−ν 
R(x)
2 2 r2
ps (x) =  ν
 e− 2 rν−1 dr, (7.67b)
Γ 2
0
where Γ = Γ (t) denotes the Γ -function. In the case ν = 2 we have Γ (1) = 1
and therefore
ps (x) = 1 − e− 2 R(x) .
1 2
(7.67c)
Comparing the above approximations, for ν = 2 we have
   2
ps (x) = 1 − e− 2 R(x) ≤ p̃s (x; ±e1 , ±e2 ) = 2Φ R(x) − 1
1 2

 
≤ p̃s (x; c) = Φ R(x) . (7.67d)

7.6 Computation of the Boundary Points


Let x denote again a given, but fixed design vector. In this section we con-
sider the properties and possible numerical solution of the optimization prob-
lem (7.47) for the computation of the boundary (b-)points bc,x . According to
Sect. 7.5, this problem has the following two versions:
Case 1
min cT z s.t. s∗Γ (z, x) ≤ 0 (7.68)
Case 2
min cT z s.t. s∗Γ (z, x) ≥ 0. (7.69)
Thus, solving (7.68) or (7.69), a central problem is that the constraint
of these problems is not given by an explicit formula, but by means of the
minimum value function of a certain optimization problem. However, in the
280 7 Computation of Probabilities

following we obtain explicit representations of (7.68) and (7.69), containing


an inner optimization problem for the representation of the constraint, by
taking into account problems (A), (B) and (C) in Sect. 7.2 describing the
state function s∗ .

7.6.1 State Function s∗ Represented by Problem A


For simplicity we assume that Problem A, i.e., (7.6a)–(7.6d), has an optimal
solution (s∗ , σ ∗ ) for each pair (a, x) of a parameter vector a and a design vector
x under consideration. According to the minimum value definition (7.6a)–
(7.6d) of s∗ = s∗ (a, x), condition s∗ (a, x) ≤ 0 holds if and only if there exists
a feasible point (s, σ) of (7.6a)–(7.6d)  that s ≤ 0. Since (7.6c) with
such
condition s ≤ 0 is equivalent to g̃ σ, (z, x) ≤ s1 ≤ 0, problem (7.68) can be
described explicitly by
min cT z (7.70a)
σ,z
s.t.
 
T σ, (z, x) = 0 (7.70b)
 
g̃ σ, (z, x) ≤ 0 (7.70c)
'
σ ∈ 0. (7.70d)
   
Here, the functions T = T σ, (z, x) and g̃ = g̃ σ, (z, x) are defined, cf. (7.17),
by
    
T σ, (z, x) := T σ, Γ −1 (z), x , (7.71a)
     
g̃ σ, (z, x) := g σ, Γ −1 (z), x . (7.71b)

For a similar representation of the related problem (7.69), a representation of


Problem A by means of a dual maximization problem is needed, see the next
Sect. 7.6.2.

7.6.2 State Function s∗ Represented by Problem B


For a given, fixed design vector x we assume again that the LP (7.8a)–(7.8d)
has an optimal solution (s∗ , σ ∗ ) for each (a, x) under consideration. Due to
the minimum value representation of s∗ by (7.8a)–(7.8d), problem (7.68) can
be described explicitly by
min cT z (7.72a)
σ,z
s.t.
 x)σ = L̃(z, x)
C(z, (7.72b)
 x)σ − h̃(z, x) ≤ 0
H(z, (7.72c)
Aσ ≤ b, (7.72d)
7.6 Computation of the Boundary Points 281

where
   
C(z,  x) = L Γ −1 (z), x
 x) := C Γ −1 (z), x , L(z, (7.73a)
   
 x) := H Γ −1 (z), x , h̃(z, x) = h Γ −1 (z), x .
H(z, (7.73b)

Since (7.8a)–(7.8d) is an LP, the state function s∗ can be described also by


the maximum value function of the dual maximization problem (7.9a)–(7.9d).
Hence, the second problem (7.69) with the “≥” constraint can be described
explicitly by
min cT z (7.74a)
u,v,w,z

s.t.
 x)T u − h̃(z, x)T v − bT w ≥ 0
L(z, (7.74b)
 x) u − H(z,
C(z, T  x) v − A w = 0
T T
(7.74c)
1T v = 1 (7.74d)
v, w ≥ 0. (7.74e)

The results in Sects. 7.6.1 and 7.6.2 can be summarized as follows:

Theorem 7.8. Explicit representation of the programs for the


computation of the b-points.
(a) Problem (7.47) with the constraint “≤,” i.e., problem (7.68), can be repre-
sented explicitly by the optimization problem (7.70a)–(7.70d) or (7.72a)–
(7.72d).
(b) In the linear case, problem (7.47) with the “≥” constraint, i.e., problem
(7.69), can be represented explicitly by the optimization problem (7.74a)–
(7.74e).

In practice, often the following properties hold:


 x) = C (independent of (z, x))
C(z, (7.75a)
 x) = L(z)
L(z,  (independent of x) (7.75b)
 x) = H (independent of (z, x))
H(z, (7.75c)
h̃(z, x) = h(x) (independent of z). (7.75d)

The above conditions mean that we have a fixed equilibrium matrix C, a
random external load L = L a(ω) independent of the design vector x and
fixed material or strength parameters.
Under the above conditions (7.75a)–(7.75d), problem (7.72a)–(7.72d) takes
the following simpler form:
min cT z (7.76a)
σ,z
282 7 Computation of Probabilities

s.t.

Cσ − L̃(z) = 0 (7.76b)
Hσ ≤ h(x) (7.76c)
Aσ ≤ b. (7.76d)

Remark 7.10. Normal distributed parameter vector a = a(ω). In this case


L̃ = L̃(z) is an affine–linear function. Consequently, (7.76a)–(7.76d) is then
an LP in the variables (σ, z).

7.7 Computation of the Approximate Probability


Functions
Based on polyhedral approximation, in Sect. 7.5.1 several approximations p̃s
of the survival probability ps are derived. For the simpler case of separate
treatment of the constraints (7.55a) or (7.55b), explicit representations of p̃s
are given already in (7.58a)–(7.58c) and (7.61a), (7.61b), (7.62a)–(7.62c). For
the computation of the more accurate approximative probability functions
p̃s , p̃f given by (7.56b) or (7.57c), (7.59b) or (7.60c), resp., several techniques
are available, such as
(I) Probability inequalities
(II) Discretization of the joint normal distribution N (0, I) and the related
(III) Monte Carlo simulation
In the following we consider methods (I) and (II).

7.7.1 Probability Inequalities

In Case 1 we have, cf. (7.57c),


 
p̃s (x) = P A(c1 , . . . , cJ )z(ω) ≤ b(x; c1 , . . . , cJ ) ,

where the J ×ν matrix A = A(c1 , . . . , cJ ) and the J-vector b = b(x; c1 , . . . , cJ )


are given by (7.57a), (7.57b); furthermore,z =z(ω) is a normal distributed
random ν-vector with Ez(ω) = 0 and cov z(·) = I (identity matrix). Due
to condition (7.53a), (7.53b) we have

b = b(x; c1 , . . . , cJ ) > 0. (7.77)

Thus, we get
 
p̃s (x; c1 , . . . , cJ ) = P Az(ω) ≤ b
   
= P b−1
d Az(ω) ≤ 1 = P u(ω) ≤ 1 , (7.78a)
7.7 Computation of the Approximate Probability Functions 283

where

1 := (1, . . . , 1)T ∈ IRJ (7.78b)


1
bd := (bj δij )1≤i,j≤J , b−1
d = δij (7.78c)
bj 1≤i,j≤J

u(ω) := b−1
d Az(ω), (7.78d)

where δij denotes the Kronecker symbol. Obviously, the random J-vector
u = u(ω) has a joint normal distribution with
 
Eu(ω) = 0, cov u(·) = b−1 T −1
d AA bd . (7.78e)

Inequalities for p̃s of the Markov-type can be obtained, see, e.g., [100], by
selecting a measurable function ϕ = ϕ(u) on IRJ such that
(1) ϕ(u) ≥ 0, u ∈ IRJ (7.79a)

(2) ϕ(u) ≥ ϕ0 > 0 for u ≤ 1 (7.79b)

with a positive constant ϕ0 . If ϕ = ϕ(u) is selected such that (7.79a),


(7.79b) holds, then for p̃s we get the lower bound
1  
p̃s (x; c1 , . . . , cJ ) ≥ 1 − Eϕ b−1
d Az(ω) . (7.80)
ϕ0
We suppose that expectations under consideration exist and are finite. The
approximate probability constraint (7.44a), i.e.,

p̃s (x; c1 , . . . , cJ ) ≥ αs

can be guaranteed then by the condition


 
Eϕ b−1
d Az(ω) ≤ ϕ0 (1 − αs ). (7.81a)

By means of this technique, lower probability bounds for p̃s and cor-
responding approximative chance constraints
 may be obtained very easily,
provided that the expectation Eϕ b−1
d Az(ω) can be evaluated efficiently. Ob-
viously, the accuracy of the approximation (7.80) increases with the reliability
level αs used in (7.44a):

1  
0 ≤ p̃s − 1 − Eϕ b−1
d Az(ω) ≤ 1 − αs . (7.81b)
ϕ0

Moreover, best lower bounds in (7.80) follow by maximizing the right hand
side of (7.80). Some examples for the selection of a function ϕ = ϕ(u), the
related positive constant ϕ0 and the consequences for the expectation term in
inequality (7.80) are given next.
284 7 Computation of Probabilities

Example 7.2 (Positive definite quadratic forms). Here we suppose that ϕ =


ϕ(u) is given by
ϕ(u) := uT Qu, u ∈ IRJ , (7.82a)
with a positive definite J × J matrix Q such that

uT Qu ≥ ϕ0 > 0 for u ≤ 1 (7.82b)

where ϕ0 is a certain positive constant.


If the positive definite quadratic form (7.82a) fulfills (7.82b), then for p̃s =
p̃s (x; c1 , . . . , cJ ) we have the lower probability bound (7.80). Moreover, in this
case (7.78b)–(7.78e) yields
   
Eϕ b−1d Az(ω) = Eϕ u(ω) = Eu(ω)T Qu(ω)
 
= EtrQu(ω)u(ω)T = trQcov u(·)
= trQb−1 T −1
d AA bd . (7.83)

Then, with (7.57a), (7.57b), (7.80) and (7.83) we have the following result:
Theorem 7.9. Suppose that ϕ = ϕ(u) is given by (7.82a), (7.82b) with a
positive definite matrix Q and a positive constant ϕ0 . In this case
1
p̃s (x; c1 , . . . , cJ ) ≥ 1 − trQb−1 T −1
d AA bd , (7.84a)
ϕ0
where
ciT cj
b−1 T −1
d AA bd := . (7.84b)
(ciT bci ,x )(cjT bcj ;x ) i,j=1,...,J

For a given positive definite matrix Q, the best, i.e., largest lower bound
ϕ∗0 > 0 in (7.82b) and therefore also in (7.84a), is given by

ϕ∗0 = min{uT Qu : u ≤ 1}. (7.85a)

We also have, cf. Fig. 7.12,

ϕ∗0 = min min{uT Qu : ul ≥ 1}, (7.85b)


1≤l≤J

where ul denotes the lth component of u. Thus, ϕ∗0 is the minimum

ϕ∗0 = min ϕ∗0l (7.85c)


1≤l≤J

of the minimum values ϕ∗0l of the convex minimization problems

min uT Qu s.t. 1 − ul ≤ 0, (7.85d)

l = 1, . . . , J. If (Q−1 )ll denotes the lth diagonal element of the inverse Q−1 of
Q, then
7.7 Computation of the Approximate Probability Functions 285

Fig. 7.12. One-sided inequalities in u-domain

1
ϕ∗0l = (7.85e)
(Q−1 )ll
and therefore
1
ϕ∗0 = ϕ∗0 (Q) = min . (7.85f)
1≤l≤J (Q−1 )ll
In the special case of a diagonal matrix Q = (qjj δij ) with positive diagonal
elements qjj > 0, j = 1, . . . , J, from (7.85f ) we get

ϕ∗0 = min qjj . (7.85g)


1≤l≤J

The largest lower bound in the probability inequality (7.84a) can be ob-
tained now by solving the optimization problem

trQb−1
d AA bd
T −1
min , (7.86)
Q0 ϕ∗0 (Q)

where the constraint in (7.86) means that Q is a positive definite J ×J matrix.


In the special case of a diagonal matrix Q, with W := b−1 T −1
d AA bd we get

J
trQW = qjj wjj
j=1

and therefore, W is at least positive semidefinite,


'
J
qjj wjj
trQW j=1
= ≥ trW,
ϕ∗0 (Q) min qjj
1≤j≤J
286 7 Computation of Probabilities

cf. (7.85g). Thus, if we demand in (7.86) that Q is diagonal, then

trQb−1 T −1
d AA bd
min ∗ = trb−1 T −1
d AA bd . (7.87)
Q0
Qdiagonal
ϕ0 (Q)

Example 7.3 (Convex quadratic functions).


In extension of (7.82a), (7.82b) we consider now the class of convex
quadratic functions given by

ϕ(u) := (u − u0 )T Q(u − u0 ), u ∈ IRJ , (7.88a)

with a positive semidefinite J × J matrix Q and a J-vector u0 such that


u0 ≤ 1. Dividing the inequality in (7.79b) by the constant ϕ0 > 0, it is easy
to see that we may suppose
ϕ0 = 1. (7.88b)
Thus, if the matrix/vector-parameter (Q, u0 ) is selected such that

ϕ(u) ≥ 1 for all u ≤ 1, (7.88c)

then corresponding to (7.80) we have


 
p̃s (x; c1 , . . . , cJ ) ≥ 1 − Eϕ b−1
d Az(ω) . (7.89a)

Using (7.78c)–(7.78e) and (7.83), in the present case (7.88a) we get


   
Eϕ b−1
d Az(ω) = Eϕ u(ω)
 
= E u(ω)T Qu(ω) − 2uT0 Qu(ω) + uTo u0
= trQb−1 T −1 T
d AA bd + u0 u0 . (7.89b)

Corresponding to (7.85a), we define ϕ∗0 = ϕ∗0 (Q, u0 ) by


 
ϕ∗0 (Q, u0 ) := min (u − u0 )T Q(u − u0 ) : u ≤ 1 . (7.90a)

We have, cf. (7.85b),

ϕ∗0 (Q, u0 ) = min ϕ∗0l (Q, u0 ), (7.90b)


1≤l≤J

where ϕ∗0l = ϕ∗0l (Q, u0 ) denotes the minimum value of the convex program

min(u − u0 )T Q(u − u0 )s.t. 1 − ul ≤ 0, (7.90c)

l = 1, . . . , J. As in (7.85d)–(7.85f) we get

(1 − u0l )2
ϕ∗0 (Q, u0 ) = min for Q 0. (7.90d)
1≤l≤J (Q−1 )ll
7.7 Computation of the Approximate Probability Functions 287

Fixing a set  
B ⊂ (Q, u0 ) : Q  0, u0 ≤ 1 , (7.91a)

the best probability bound in (7.89a) is represented by


" #
 
p̃s (x; c1 , . . . , cJ ) ≥ 1 − inf Eϕ b−1 ∗
d Az(ω) : ϕ0 (Q, u0 ) ≥ 1, (Q, u0 ) ∈ B ,

(7.91b)

cf. (7.86). Hence, in the present case the following minimization problem is
obtained:
min trQb−1 T −1
d AA bd + u0 u0
T
(7.92a)
s.t.

ϕ∗0 (Q, u0 ) ≥ 1 (7.92b)

(Q, u0 ) ∈ B. (7.92c)

Approximation by Means of Two-Sided Probability Inequalities

Selecting a J-vector d < 1 := (1, . . . , 1)T ∈ IRJ , we may approximate p̃s =


p̃s (x; c1 , . . . , cJ ) by
 
p̃s (x; c1 , . . . , cJ ) = P u(ω) ≤ 1
 
≥ P d ≤ u(ω) ≤ 1 , (7.93)

where the J-random vector u = u(ω) is again defined by (7.78b)–(7.78e).


Since u = u(ω) has an N (0, b−1 T −1
d AA bd )-normal distribution we may se-
lect d < 1 such that the error of the approximation (lower bound) is small.
Corresponding to (7.79a), (7.79b), in the present case we consider a func-
tion ϕ = ϕ(u) on IRJ such that

ϕ(u) ≥ 0 for u ∈ IRJ (7.94a)

ϕ(u) ≥ 1 for all u ∈ IRJ such that u ∈ [d, 1]. (7.94b)

Then, from (7.93) and (7.78d) we get [100]


 
p̃s (x; c1 , . . . , cJ ) ≥ 1 − Eϕ b−1
d Az(ω) . (7.94c)

Example 7.4 (Convex quadratic functions for bounded u-domains).


Suppose that the function ϕ = ϕ(u) is defined again by (7.88a), where in
the present case (Q, u0 ) must be selected such that

ϕ(u) ≥ 1 for all u ∈ [d, 1] (7.95a)


288 7 Computation of Probabilities

Fig. 7.13. Two-sided inequalities in u-domain

and d < u0 < 1. As in the above examples we define now, cf. Fig. 7.13,
ϕ∗0 (Q, u0 ) = min (u − u0 )T Q(u − u0 ). (7.95b)
u∈[d,1]

We get  
ϕ∗0 (Q, u0 ) = min min ϕ∗0la (Q, uo ), ϕ∗0lb (Q, u0 ) , (7.95c)
1≤l≤J

where ϕ∗0la = ϕ∗0la (Q, u0 ), ϕ∗0lb = ϕ∗0lb (Q, u0 ), resp., l = 1, . . . , J, denote the
minimum values of the convex optimization problems
min(u − u0 )T Q(u − u0 ) s.t. ul − dl ≤ 0, (7.95d)
min(u − u0 )T Q(u − u0 ) s.t. 1 − ul ≤ 0. (7.95e)
If the matrix Q is positive definite, then (7.90d) yields
(1 − u0l )2
ϕ∗0lb (Q, u0 ) = , l = 1, . . . , J, (7.95f)
(Q−1 )ll
and in similar way we get
(dl − u0l )2
ϕ∗0la (Q, u0 ) = , l = 1, . . . , J. (7.95g)
(Q−1 )ll
Thus, in case of a positive definite J × J matrix Q we finally get
1  
ϕ∗0 (Q, u0 ) = min min (1 − u 0l )2
, (dl − u 0l )2
. (7.95h)
1≤l≤J (Q−1 )ll

Maximum lower probability bounds in (7.94c) can be obtained then by the


same techniques as described in Example 7.3.
7.7 Computation of the Approximate Probability Functions 289

7.7.2 Discretization Methods

The problem is again the computation of the approximate probability function


p̃s given by (7.57c), hence,
 
p̃s (x; c1 , . . . , cJ ) = P z(ω) ∈ C(x; c1 , . . . , cJ ) . (7.96a)

Here, the convex set C(x; c1 , . . . , cJ ) in IRν is defined by

C(x; c1 , . . . , cJ ) := {z ∈ IRν : Az ≤ b}, (7.96b)

where the matrix (A, b) is given by (7.57a), (7.57b). Furthermore, z = z(ω)


is a random ν-vector having the standard normal distribution N (0, I) with
zero mean and unit covariance matrix I. However, in the present case it is
also possible to consider directly the basic probability function ps given by
(7.18a). Hence,
   
ps (x) = P s∗Γ z(ω), x ≤ 0
 
= P z(ω) ∈ Bs,x
Γ

with the transformed limit state function s∗Γ = s∗Γ (z, x) or the transformed
Γ
safe domain Bs,x .
Based on the representation of ps = ps (x) by a multiple integral [150] in
IRν , the N (0, I)-distributed random ν-vector z = z(ω) is approximated now
by a random ν-vector z̃ = z̃(ω) having a discrete distribution
N
Pz̃(·) = µ := αj εz(j) , (7.97a)
j=1

where z (j) , j = 1, . . . , N , are the realizations or scenarios of z̃(·) taken in IRν


with probabilities αj , j = 1, . . . , N , where
N
αj = 1, αj > 0, j = 1, . . . , N. (7.97b)
j=1

As an appropriate approximation of the standard normal distribution N (0, I)


in IRν , the discrete distribution µ should have the following obvious properties:

zµ(dz) = E z̃(ω) = Ez(ω) = 0 (7.97c)
    
zz T µ(dz) = cov z̃(·) = cov z(·) = I. (7.97d)
290 7 Computation of Probabilities

Thus, we have to select the set of realizations and corresponding probabilities

z (1) , z (2) , . . . , z (j) , . . . , z (N )


(7.97e)
α1 , α2 , . . . , αj , . . . , αN
such that
N
αj z (j) = 0 (7.98a)
j=1
N
αj z (j) z (j)T = I (7.98b)
j=1
N
αj = 1 (7.98c)
j=1
αj > 0, j = 1, . . . , N. (7.98d)

Remark 7.11. In order to increase the accuracy of the approximation µ of the


N (0, I)-distribution in IRν , conditions for the third and higher order moments
of z̃ = z̃(ω) can be added to (7.98a)–(7.98d). Moreover, known symmetry
properties of N (0, I) can be used in the selection of the realizations z (j) and
probabilities αj , j = 1, . . . , N , in (7.97e).
Because of the symmetry of the covariance matrix, in system (7.98a)–
(7.98d) we have
ν(ν − 1)
(a) ν + + 1 equalities
2
(b) N inequalities
for the N ν + N unknowns (z (j) , αj ), j = 1, . . . , N , in IRν × IR.
Defining

α := (α1 , α2 , . . . , αN )T (7.99a)
Z := (z (1) , z (2) , . . . , z (N ) ) (7.99b)
⎛√ ⎞
α1 0
⎜ √ ⎟
√ ⎜ α2 ⎟
( α)d := ⎜ . ⎟ (7.99c)
⎝ .. ⎠

0 αN
1 := (1, 1, . . . , 1)T , (7.99d)

conditions (7.98a)–(7.98d) can be represented also in the following form:

√ √ Zα = 0 (7.98a )
(Z αd )(Z αd )T = I (7.98b )
αT 1 = 1 (7.98c )
α > 0. (7.98d )
7.7 Computation of the Approximate Probability Functions 291

In the special case


1
αj = , j = 1, . . . , N, (7.100a)
N
we get
1
α= 1 (7.100b)
N
√ 1
αd = √ I, (7.100c)
N
and (98a )–(98d ) is reduced then to
Z1 = 0 (7.100d)
T
1 1
√ Z √ Z = I. (7.100e)
N N
Example 7.5. For ν = 2 and N = 3, a possible selection of z (j) , αj , j = 1, 2, 3,
reads  1   1 
−√2 (2) √ √
(1)
z = 0 ,z = / 2
, z (3)
= /2
3
2 − 3
2
1 1 1
α1 = , α2 = , α3 = .
3 3 3
1
If N = ν, then (7.100e) means that √ Z is an orthogonal matrix which
N
contradicts (7.100d). On the other hand, we can take N = 2ν and select then
 (j) 
the vectors zαj , j = 1, . . . , N , as follows:

z̃ (1) , z̃ (2) , . . . , z̃ (ν) , −z̃ (1) , −z̃ (2) , . . . , −z̃ (ν) (7.101a)
α̃1 α̃2 α̃ν α̃1 α̃2 α̃ν
, , ..., , , , ..., . (7.101b)
2 2 2 2 2 2
Then,
N ν ν
α̃j (j) α̃j
αj z (j) = z̃ + (−z̃ (j) )
j=1 j=1
2 j=1
2
ν
α̃j α̃j
= − z̃ (j) = 0, (7.101c)
j=1
2 2
N ν ν
α̃j (j) (j)T α̃j
αj z (j) z (j)T = z̃ z̃ + (−z̃ (j) )(−z̃ (j) )T
j=1 j=1
2 j=1
2
ν
= α̃j z̃ (j) z̃ (j)T , (7.101d)
j=1
N ν ν ν
α̃j α̃j
αj = + = α̃j . (7.101e)
j=1 j=1
2 j=1
2 j=1
292 7 Computation of Probabilities

Thus, in order to fulfill conditions (7.98a)–(7.98d), the data z̃ (1) , . . . , z̃ (ν) ,


α̃1 , . . . , α̃ν must be selected such that
ν
α̃j z̃ (j) z̃ (j)T = I (7.102a)
j=1

ν
α̃j = 1 (7.102b)
j=1

α̃j > 0, j = 1, . . . , ν. (7.102c)

Corresponding to (98b ), (7.102a) means that


    T
α̃1 z̃ (1) , . . . , α̃ν z̃ (ν) α̃1 z̃ (1) , . . . , α̃ν z̃ (ν) = I. (7.103a)

Hence, we have
1
z̃ (j) =  d(j) , j = 1, 2, . . . , ν, (7.103b)
α̃j
where

(d(1) , d(2) , . . . , d(ν) ) is an orthogonal ν × ν matrix. (7.103c)

Example 7.6. In case ν = 2,

1 1 1 −1
d(1) = √ , d(2) = √
2 1 2 1

1
are the columns of an orthogonal 2 × 2 matrix. Taking α̃1 = α˜2 = , from
2
(7.103b) and (7.101a), (7.101b) we get

1 1 1 1
z (1) = z̃ (1) = / √ =
1 2 1 1
2

1 1 −1 −1
z (2) = z̃ (2) = / √ =
1 2 1 1
2

−1
z (3) = −z̃ (1) =
−1
1
z (4) = −z̃ (2) = ,
−1

1 1 1
where αj = · = , j = 1, . . . , 4.
2 2 4
7.7 Computation of the Approximate Probability Functions 293

Example 7.7. Let ν > 1 be an arbitrary integer and define


1
d(j) := ej , α̃j = , j = 1, . . . , ν,
ν
where e1 , e2 , . . . , eν are the unit vectors of IRν . Then, with (7.103b), distribu-
tion (7.101a), (7.101b) reads
√ √ √ √ √ √
νe1 , νe2 , . . . , νeν , − νe1 , − νe2 , . . . , − νeν
1 1 1 1 1 1
2ν , 2ν ,..., 2ν , 2ν , 2ν ,..., 2ν .

7.7.3 Convergent Sequences of Discrete Distributions

Let z̃(·) denote again a random vector in IRν having a discrete probability dis-
tribution Pz̃(·) = µ according to (7.97a), (7.97b). We consider then sequences
of i.i.d. (independent and identically distributed) random ν-vectors

Z1 , Z2 , . . . , Zl , . . . (7.104a)

such that
PZl = Pz̃(·) = µ, l = 1, 2, . . . . (7.104b)
Hence, due to (7.97c), (7.97d) we get

EZl = E z̃ = zµ(dz) = 0 (7.104c)

cov(Zl ) = cov(z̃)) = zz T ν(dz) = I. (7.104d)

Starting from the discrete distribution µ, we obtain better approximations of


the N (0, I)-distribution in IRν by considering the distribution of the ν-random
vector
1
S (L) := √ (Z1 + Z2 + . . . + ZL ) for L = 1, 2, . . . . (7.105)
L
Due to (7.104c), (7.104d) we have
L
1
ES (L) = √ EZl = 0, (7.106a)
L l=1
 T
1   L L
1
cov(S (L)
) = E√ Zl √ Zl
L l=1 L l=1

L
1
= EZl ZlT = I. (7.106b)
L
l=1
294 7 Computation of Probabilities

Moreover, the distribution µ(L) of S (L) is given by the convolution

µ(L) := PS (L) = PZ (L) ∗ PZ (L) ∗ . . . ∗ PZ (L) , (7.106c)


1 2 L

(L)
where Zj , j = 1, . . . , L, are i.i.d. random ν-vectors having, see (7.97a),
(7.97b), (7.97e), the discrete distribution
 (1) (2) 
z√ z√ z√(j) (N )
z√
L
, L
, . . . , L
, . . . , L . (7.106d)
α1 , α2 , . . . , αj , . . . , αN

Consequently, if j1 , j2 , . . . , jl , . . . , jL are arbitrary integers such that

jl ∈ {1, . . . , N }, l = 1, . . . , L, (7.107a)

then the distribution µ(L) of S (L) is the discrete distribution taking the real-
izations (atoms)
L
1
s = s(j1 , j2 , . . . , jL ) := √ z (jl ) (7.107b)
L l=1

with the probabilities



L
αs := αjl . (7.107c)
s(j1 ,...,jL )=s l=1

Thus, µ(L) , l = 1, 2, . . . , is a sequence of discrete probability distributions of


the type (7.97a)–(7.97d). According to the central limit theorem [13,110] it is
known that
−→ N (0, I), L → ∞,
µ(L) weak (7.107d)

where “ weak
−→ ” means “weak convergence” of probability measures [13, 110].
An important special case arises if one starts with a random ν-vector z̃
having independent components

z̃ = (z̃1 , . . . , z̃ν )T .

In this case also the components

Zl1 , . . . , Zli , . . . , Zlν , l = 1, 2, . . .

of the sequence (Zl ) of i.i.d. random ν-vectors according to (7.104a), (7.104b)


are independent. Thus, from (7.105) we obtain that
L
1
S (L) = √ Zl
L l=1
7.7 Computation of the Approximate Probability Functions 295

has independent components


L
(L) 1
Si =√ Zli , i = 1, . . . , ν.
L l=1

However, this means that

ν
µ(L) = PS (L) = ⊗ PS (L) (7.108)
i=1 i

(L)
is the product of the distributions PS (L) of the components Si , i = 1, . . . , ν,
i
of S (L) .
If we assume now that the i.i.d. components z̃i , i = 1, . . . , ν, of z̃ have the
common distribution
1 1
Pz̃i = ε−1 + ε+1 , (7.109a)
2 2
where εz denotes the one-point measure at a point z, then the i.i.d. compo-
(L)
nents Si , i = 1, . . . , ν, of S (L) have, see (7.106a)–(7.106d), (7.107a)–(7.107c),
the Binomial-type distribution
L
L 1
PS (L) = ε 2l−L , i = 1, . . . , ν. (7.109b)
i l 2L √L
l=0

The Approximate Probability Functions


Γ
If Bs,x denotes again the transformed safe domain (in z-space), then the prob-
ability function ps and its approximate p̃s can be represented by
     
ps (x) = P z(ω) ∈ Bs,x
Γ
= P s∗Γ z(ω), x ≤ 0 , (7.110a)
     
p̃s (x) = p̃s (x; µ) := P z̃(ω) ∈ Bs,x
Γ
= P s∗Γ z̃(ω), x ≤ 0 . (7.110b)

Γ
If, besides the random ν-vector z = z(ω), also the safe domain Bs,x is approx-
s,x
imated, e.g., by a polyhedral set B Γ
such that

 Γ ⊂ B Γ or B Γ ⊂ B
B Γ ,
s,x s,x s,x s,x

then we have the approximate probability function


 
s,x , µ) := P z̃(ω) ∈ B
p̃s (x) = p̃s (x; B Γ . (7.110c)
s,x
296 7 Computation of Probabilities

Fig. 7.14. Discrete approximation of probability distributions

In the present case of approximate random ν-vectors z̃ with a discrete


distribution Pz̃ = µ, according to (7.97a), (7.97b) the approximate probability
functions read (Fig. 7.14)

p̃s (x; µ) = αj = αj , (7.111a)


s∗
Γ
(z (j) ,x)≤0 z (j) ∈Bs,x
Γ

s,x
p̃s (x, B Γ
, µ) = αj . (7.111b)

z (j) ∈Bs,x
Γ

Using the central limit theorem [13, 49, 110], we get the following convergence
result:

Theorem 7.10. Suppose that the approximate discrete distribution µ(L) is


defined as described in Sect. 7.3.
   
(a) If P s∗Γ z(ω), x = 0 = 0, then

p̃s (x; µ(L) ) → ps (x), L → ∞. (7.112a)


7.7 Computation of the Approximate Probability Functions 297
 
 L be a convex approximation of B Γ such that P z(ω) ∈ ∂ B
(b) Let B Γ =
s,x s,x s,x
0, where ∂B denotes the boundary of a set B ⊂ IRν . Then
   
 s,x Γ 
s,x
p̃s (x; B Γ
, µ(L) ) − ps (x; B ) ≤ 0 √1L , L → ∞, (7.112b)

where  
 Γ ) := P z(ω) ∈ B
ps (x, B Γ .
s,x s,x
A
Sequences, Series and Products

In this appendix let IN = IN(1) ∪ . . . ∪ IN(κ) denote a partition of IN fulfilling


condition (U6) from Chapter 6.3.

A.1 Mean Value Theorems for Deterministic Sequences

Here, often used statements about generalized mean values of scalar sequences
are given.
In this context the following double sequences are of importance. We use
the following notations:

F := {(an ) : an ∈ IR for n ∈ IN},


D := {(am,n )m,n : am,n ∈ IR for 0 ≤ m ≤ n},
D+ := {(am,n )m,n ∈ D : it exists n0 ∈ IN with
am,n > 0 for n0 ≤ m ≤ n},
D0 := {(am,n )m,n ∈ D+ : lim am,n = 0 for m = 0, 1, 2, . . .}.
n→∞

Two sequences (am,n ) and (bm,n ) in D+ can be compared if we define


(a) (am,n ) is not greater than (bm,n ), denoted as am,n  bm,n , if to any
number ε > 0 there exists n0 ∈ IN with

0 < am,n ≤ (1 + ε)bm,n for all n0 ≤ m ≤ n.

(b) (am,n ) and (bm,n ) are equivalent, denoted as am,n ∼ bm,n , if

am,n  bm,n and bm,n  am,n .

Hence, “” is an order, and “∼” is an equivalence relation on D+ having


following properties:
302 A Sequences, Series and Products

Lemma A.1. Let (am,n ), (bm,n ), (αm,n ) and (βm,n ) denote sequences from
D+ .
(a) If am,n  αm,n and bm,n  βm,n , then

am,n + bm,n  αm,n + βm,n , am,n bm,n  αm,n βm,n ,


am,n αm,n
 , am,n s  αm,n s for s > 1.
βm,n bm,n

(b) If am,n ∼ αm,n and bm,n ∼ βm,n , then

am,n + bm,n ∼ αm,n + βm,n , am,n bm,n ∼ αm,n βm,n ,


am,n αm,n
∼ , am,n s ∼ bm,n s for s ∈ IR.
bm,n βm,n

Proof. In case of (a) there exists to any ε > 0 an index n0 ∈ IN with:

0 < am,n ≤ (1 + ε)αm,n ,


0 < bm,n ≤ (1 + ε)βm,n for n0 ≤ m ≤ n.

Hence, for n0 ≤ m ≤ n we get

0 < am,n + bm,n ≤ (1 + ε)(αm,n + βm,n ),


0 < am,n bm,n ≤ (1 + ε)2 (αm,n βm,n ),
am,n αm,n
0< ≤ (1 + ε)2 ,
βm,n bm,n
0 < am,n s ≤ (1 + ε)s αm,n s for s ≥ 1.

Thus, assertion (a) follows and therefore also assertion (b).


The following mean value theorem is a modification of a theorem by O.
Toeplitz (cf. [65, Sect. 8, Theorem 5]).
Lemma A.2. Let (am,n ), (αm,n ) ∈ D0 and (βm ) ∈ F, and assume
n
lim am,n =: a ∈ (0, ∞).
n→∞
m=1

Defining β := lim inf βn and β̄ := lim sup βn , we get for each index k ∈ IN:
n n

'
n
(a) If αm,n  am,n and β̄ ≥ 0, then lim sup αm,n βm ≤ aβ̄.
n m=k
'
n
(b) If am,n  αm,n and β > 0, then aβ ≤ lim inf αm,n βm .
n m=k
'
n
(c) If αm,n ∼ am,n , then limextr αm,n βm ∈ [aβ, aβ̄].
n m=k

In Lemma A.2 limextr denotes lim sup or lim inf.


A.1 Mean Value Theorems for Deterministic Sequences 303

Proof. Due to (am,n ), (αm,n ) ∈ D0 , we have for any k ∈ IN


k−1 k−1
lim am,n = 0 and lim αm,n βm = 0. (i)
n n
m=1 m=1

Hence, all assertions have only to be shown for k = 1.

For the proof of (a) assume β̄ < ∞. Hence, for any ε > 0 there exists an
index n0 such that

βm ≤ β̄ + ε > 0 and 0 < αm,n ≤ (1 + ε)am,n

for all indices m, n with n0 ≤ m ≤ n. Hence, for n ≥ n0 we find


n n n
αm,n βm ≤ αm,n (β̄ + ε) ≤ (1 + ε)(β̄ + ε) am,n
m=n0 m=n0 m=n0

 n n0 −1 
= (1 + ε)(β̄ + ε) am,n − am,n .
m=1 m=1

'
n
Now, by means of (i) we get lim sup αm,n βm ≤ (1 + ε)(β̄ + ε)a. Since
n m=1
ε > 0 was chosen arbitrarily, this shows assertion (a).
Using the assumptions of (b), there exists to any ε ∈ (0, β] an index n0
with
0 ≤ β − ε ≤ βm and 0 ≤ am,n ≤ (1 + ε)αm,n
for n0 ≤ m ≤ n. Hence, for n ≥ n0
n n0 −1
β−ε n
αm,n βm ≥ ( am,n − am,n )
m=n0
1 + ε m=1 m=1

'
n
holds. Therefore again by means of (i) we have lim inf αm,n βm ≥ (β − ε)
n m=1
(1 + ε)−1 a. Thus, we obtain now statement (b).
In the proof of (c) we consider the following three possible cases:
Case 1: β > 0. The relation in (c) is fulfilled due to (a) and (b). Case 2:
β̄ < 0. This case is reduced to the previous case by means of the transition
from βn to −βn . Case 3: β ≤ 0 ≤ β̄. Here, from (a) we get
n n
lim sup αm,n βm ≤ aβ̄ and lim sup αm,n (−βm ) ≤ a(−β).
n n
m=k m=k

Hence, also statement (c) holds.


304 A Sequences, Series and Products

Given a sequence (an ) ∈ F, for m, n ∈ IN0 , m ≤ n, we define now


⎧ n
⎨ * (1 − a ), for m < n
j
bm,n := j=m+1

1 , for m = n.

About this sequence the following statement can be made:

Lemma A.3. (a) If '


(α) there exists an index n0 with an < 1 for n ≥ n0 and (β) an = ∞,
n
'
n
then (bm,n ) ∈ D0 and lim bm,n am = 1 for k = 1, 2, . . . .
n→∞ m=k
' 2
(b) In case an < ∞ it is bm,n ∼ exp(−(am+1 + . . . + an )).
' n
(c) If | an |< ∞, then there exists an index n2 with
n

limextr bm,n ∈ (0, ∞) for m ≥ n2 .


n

Proof. In all cases (a), (b) and (c) there exists an index n0 with am < 1 for
m ≥ n0 . Hence, it holds

0 < bm,n ≤ exp(−(am+1 + . . . + an )) for n0 ≤ m < n. (i)


'
Consequently, in case an = ∞ we have lim bm,n = 0 for m = 0, 1, 2, . . .
n n→∞
and therefore (bm,n ) ∈ D0 . Hence, for any index k
n n
bm,n am = bm,n (1 − (1 − am )) = 1 − bk−1,n −→ 1 for n −→ ∞.
m=k m=k
'
In case a2n < ∞ exists to each ε > 0 an index n1 ≥ n0 having
n

0 < exp(−2aj ) ≤ 1 − a2j for j ≥ n1 ,


⎛ ⎞

exp ⎝2 a2j ⎠ ≤ 1 + ε.
j≥n1

Hence, for n1 ≤ m < n we find

bm,n exp(am+1 + . . . + an ) ≥ bm,n (1 + am+1 ) · . . . · (1 + an )


  1
= (1 − a2m+1 ) · . . . · (1 − a2n ) ≥ exp −2(a2m+1 + . . . + a2n ) ≥ ,
1+ε
or exp(−(am+1 + . . . + an )) ≤ (1 + ε)bm,n , respectively. Therefore (b) follows
by means of (i).
A.1 Mean Value Theorems for Deterministic Sequences 305
' '
Now, let | an |< ∞. Then (an ) converges to zero and series bn
n n
converges, if bn , n = 1, 2, . . . , is defined by

2a , for an ≥ 0
bn := 1 n
2 an , for an < 0.

There exists an index n2 ≥ n0 with


0 < exp(−bn ) ≤ 1 − an for n ≥ n2 .
Hence, using (i), assertion (c) follows.
A simple application of Lemmas A.2 and A.3 is given next:
Lemma A.4. Let T1 ∈ IR and (an ), (bn ) ∈ F with following ' properties: (i)
There exists an index n0 with an ∈ [0, 1) for n ≥ n0 , (ii) an = ∞. Then
n
limextr Tn ∈ [lim inf bn , lim sup bn ] for sequence
n n n

Tn+1 := (1 − an )Tn + an bn , n = 1, 2, . . . .
Proof. By means of complete induction, formula
n
Tn+1 = b0,n T1 + (bm,n am )bm for n = 1, 2, . . .
m=1

is obtained. Hence, the assertion follows from Lemmas A.3(a) and A.2(c).
Now, consider to given number a ∈ IR the following double sequence in
D+ : ⎧ n
⎨ * (1 − a ), if m < n
j
Φm,n (a) := j=m+1

1 , if m = n.
Moreover, let IM denote a subset of IN such that the limit
| {1, . . . , n} ∩ IM |
q := lim
n→∞ n
exists and is positive. Then we get the following result:
Lemma A.5. For any a ∈ IR holds:
m a
(a) Φm,n (a) ∼ ( ) ,
n
m | IM0,m |
(b) ∼ ,
n | IM0,n |
(c) Φm,n (a) ∼ Φ|IM0,m |,|IM0,n | (a),
1 q
(d) If a > 0, then lim Φm,n (a) = for each k ∈ IN0 .
n→∞ m a
m∈ IMk,n
306 A Sequences, Series and Products

Proof. Due to representation


1 1 m
exp(−a( + . . . + )) = exp(a(cm − cn ))( )a
m+1 n n
with the convergent sequence (cf. [65])
1 1
ck := + . . . + − ln (k), k = 1, 2, . . . ,
1 k
by means of Lemma A.3(b) we have
 m a 1 1
∼ exp −a + ... + ∼ Φm,n (a).
n m+1 n

Since the limit q is positive, to any ε > 0 there exists an index n0 such that
q | IM0,k |
0< ≤ ≤ (1 + ε)q for k ≥ n0 .
1+ε k
Hence, for n0 ≤ m ≤ n we get
m | IM0,m | m
≤ ≤ (1 + ε)2 ,
(1 + ε)2 n | IM0,n | n

and therefore (b) holds. Due to (a), (b), and Lemma A.1(b) we have
 m a |IM0,m |
a
Φm,n (a) ∼ ∼ ∼ Φ|IM0,m |,|IM0,n | (a). (i)
n |IM0,n |

For a > 0, due to Lemma A.3(a) we have


1
lim Φ|IM0,m |,|IM0,n | (a)
n→∞ | IM0,m |
m∈ IMk,n
|IM0,n |
1 a 1
= lim ΦM,|IM0.n | (a) = . (ii)
a n→∞ M a
M =|IM0,k |+1

Note that the index transformation in (ii) is feasible. Indeed,

m ∈ IMk,n := {k + 1, . . . , n} ∩ IM ⇐⇒ | IM0,m |∈ {| IM0,k | +1, . . . , | IM0,n |}.

Hence, by means of (i) and the definition of q it is


1 1
Φ|IM0,m |,|IM0,n | (a) ∼ Φm,n (a) .
| IM0,m | qm

Therefore, (ii) and Lemma A.2(c) yields now assertion (d).


A.1 Mean Value Theorems for Deterministic Sequences 307

Now let (An ) ∈ F denote a given sequence with

lim An = A(k) for k = 1, . . . , κ,


n∈IN(k)

A := q1 A(1) + . . . + qκ A(κ) > 0.

Here again q1 , . . . , qκ denote the limit parts of sets IN(1) , . . . , IN(κ) in IN ac-
cording to (U6). Then double sequence
⎧ n
⎨ * (1 − Aj ), for m < n
j
bm,n := j=m+1

1 , for m = n

is in D+ and has the following properties:

Lemma A.6. (a) For any δ ∈ (0, A) we have Φm,n (A + δ)  bm,n 


Φm,n (A − δ).
(b) lim bm,n = 0 for m = 0, 1, 2, . . ..
n→∞ '
(c) For each k ∈ IN0 it is lim 1
bm,n m = q
A.
n→∞ m∈ IM
k,n

Proof. We have representation


⎛ ⎛ ⎞
(k) ⎞
|IN0,l |

⎜ 
⎝1 − l l ⎠⎟
A
bm,n = ⎝ (k) ⎠, (i)
k=1 (k) | IN0,l |
l∈ INm,n

and for arbitrary numbers b1 , . . . , bκ ∈ IR it is due to Lemmas A.1 and A.5


⎛ ⎞ ⎛ ⎞
  (k)
|IN0,n |

⎜  bk ⎟ κ
⎜  bk ⎟
⎝ 1− (k) ⎠= ⎝ 1− ⎠
| IN0,l | L
k=1 (k)
l∈ INm,n k=1 (k)
L=|IN0,m |+1


κ 
κ
= Φ|IN(k) (k) (bk ) ∼ Φm,n (bk )
0,m |,|IN0,n |
k=1 k=1

m bk
∼ ( ) ∼ Φm,n (b1 + . . . + bk ). (ii)
n
k=1

For each δ ∈ (0, A) there exists an index n0 with


(k)
δ |IN0,l | δ
bk := qk A(k) − ≤ Al ≤ qk A(k) + =: b̄k
k l k

for all n0 ≤ l ∈ IN(k) and k = 1, . . . , κ. Hence, due to (i) and (ii)

Φm,n (A + δ)  bm,n  Φm,n (A − δ).


308 A Sequences, Series and Products

Since A ± δ > 0, Lemma A.5(a) yields

lim bm,n = 0 for m = 0, 1, 2, . . . ,


n→∞

and from Lemmas A.5(d) and A.2 one obtains


'
A+δ ≤ limextr ≤
q 1 q
bm,n m A−δ .
n m∈ IMk,n

This holds true for any δ ∈ (0, A), and therefore (c) holds.
Applying now Lemma A.6, the limit of sequence
An fn
Tn+1 := (1 − )Tn + + gn for n = 1, 2, . . .
n n
can be calculated. Here, T1 ∈ IR and (fn ), (gn ) ∈ F with

lim fn = f (k) for k = 1, . . . , κ, gn convergent.


n∈IN(k)
n

Lemma A.7. For sequence (Tn ) equation


κ
qk f (k)
lim Tn =
n→∞ A
k=1

holds.

Proof. With abbreviations g0 := 0 and

Fn := fn − An (g0 + . . . + gn−1 ), Sn := Tn − (g0 + . . . + gn−1 ),

we get

Sn+1 = Tn+1 − (g0 + . . . + gn )


 1 
= Tn + (fn − An Tn ) + gn − (g0 + . . . + gn )
n
  1
= Tn − (g0 + . . . + gn−1 ) + (Fn − An Sn )
n
An 1
= 1− Sn + Fn for n = 1, 2, . . . .
n n

Hence, by means of complete induction we get


κ
1
Sn+1 = b0,n S1 + bm,n Fm .
m
k=1 m∈ IN(k)
0,n
A.2 Iterative Solution of a Lyapunov Matrix Equation 309
'
Due to the assertion we have with G := gn
n

lim Fm = f (k) − A(k) G for k = 1, . . . , κ.


m∈IN(k)

Now, from Lemmas A.6(b), (c) and A.2(c) follows


κ κ
qk (k) qk f (k)
lim Sn = (f − A(k) G) = − G.
n→∞ A A
k=1 k=1

Finally, we get our assertion by means of the relation Tn = Sn + (g0 + . . .


+gn−1 ).

The following simple mean value theorem is used a few times:

Lemma A.8. Let (fn ) be selected as above. Then


κ
1
(f1 + . . . + fn ) −→ qk f (k) , n → ∞.
n
k=1

Proof. Due to representation

1 1 fn
Tn+1 := (f1 + . . . + fn ) = 1− Tn + ,
n n n

the assertion follows from Lemma A.7.

A.2 Iterative Solution of a Lyapunov Matrix Equation

In Sect. 6.5 (or more precisely in the proof of Lemma 6.13) the following
recursive matrix equation is of interest: For n = 1, 2, . . . let
T
1 1 1
Vn+1 = I− Bn Vn I − Bn + Cn
n n n

where V1 and Cn denote symmetric and Bn real ν × ν-matrices with the


following properties:

lim Cn = C (k) , lim Bn = B (k) , B (k) + B (k)T ≥ 2u(k) I


n∈IN(k) n∈IN(k)

for k = 1, . . . , κ, where u(1) , . . . , u(κ) denote reals with

ū := q1 u(1) + . . . + qκ u(κ) > 0.


310 A Sequences, Series and Products

Convergence of matrix sequence (Vn ) is guaranteed by the following result:


Lemma A.9. The limit V := lim Vn exists and is unique solution of Lya-
n→∞
punov equation

BV + V B T = C, (∗)
where we put
B := q1 B (1) + . . . + qκ B (κ) and C := q1 C (1) + . . . + qκ C (κ) .
For the proof of this theorem we need further auxiliary results. First of we
state the following remark:
Remark A.1. If for n ∈ IN holds
Bn + BnT ≥ 2
an I for Bn  ≤ bn
an , bn , then (cf. Lemma 6.6)
with numbers 
 2
 
I − 1 Bn  ≤ 1 − 2un with un := 
an −
1 2
b .
 n  n 2n n

an ) and (bn ) can be chosen such that


Due to the assumptions about (Bn ), (
lim un = u(k) for k = 1, . . . , κ.
n∈IN(k)

Lemma A.10. For n ∈ IN let T1 , En and Fn denote symmetric ν ×ν-matrices


having

-n+1 := 1 (E1 + . . . + En ) −→ 0
E and Fn  < ∞.
n n

Then Tn → 0, n → ∞, if the sequence (Tn ) is defined by


T
1 1 1
Tn+1 := I− Bn Tn I − Bn + En + Fn , n = 1, 2, . . . .
n n n
(1) (2)
Proof. With matrices T1 := T1 , T1 := 0 and
T
(1) 1 1
Tn+1 := I− Bn Tn(1) I − Bn + Fn ,
n n
T
(2) 1 1 1
Tn+1 := I− Bn Tn(2) I − Bn + En
n n n
(1) (2)
we can write Tn = Tn + Tn for n = 1, 2, . . . . Because of the above remark
we get now
(1) 2
Tn+1  ≤ 1− un Tn(1)  + Fn .
n
(1)
Hence, Lemma A.7 yields Tn −→ 0 .
A.2 Iterative Solution of a Lyapunov Matrix Equation 311
 
-n+1
Due to E = 1 − n1 E-n + 1 En it is
n

1   T
(2) -n+1 = (I −
Tn+1 − E Bn ) (Tn(2) − E -n I − 1 Bn
-n ) + E +
1 -n
−1 E
n n n
T
1 -n ) I − 1 Bn
= I−Bn (Tn(2) − E
n n
3
1 - 1 4
+ E - - T
n − (Bn En + En Bn ) +
-n B T .
Bn E n
n n
Hence, the above remark gives us
(2) -n+1  ≤ 2 -n 
Tn+1 − E 1− un Tn(2) − E
n
-n  9
E 1
:
+ 1 + 2Bn  + Bn  .
2
n n
Again, using Lemma A.7, Tn − E -n −→ 0 follows. Finally, because of the
(2)
(2)
assumptions about En also Tn converges to 0.
Proof (of Lemma A.9). Having a solution V of (∗) it holds
1   1
T
1
Vn+1 − V = I − Bn (Vn − V ) + V I − Bn + Cn − V
n n n
T
1 1
= Bn (Vn − V ) I − Bn
I−
n n
17 8 1
+ Cn − (Bn V + V BnT ) + 2 Bn V BnT .
n n
Moreover, due to Lemma A.8 and the above assumptions we obtain
κ
1
(B1 + . . . + Bn ) −→ qk B (k) =: B,
n
k=1
κ
1
(C1 + . . . + Cn ) −→ qk C (k) =: C.
n
k=1

Hence, due to Lemma A.10 we have Vn −→ V, n → ∞.


The following theorem guarantees the uniqueness of the solution V of (∗).
Lemma A.11 (Lyapunov). If −B is stable, that is
Re(λ) > 0 for any eigenvalue λ of B,
then (∗) has an unique solution V .
This theorem can be found, e.g., in [8]. Due to relations
1
Re(λ) ≥ λmin (B + B T ) ≥ ū > 0 for all eigenvalues λ of B,
2
in our situation −B is stable.
B
Convergence Theorems for Stochastic
Sequences

The following theorems giving assertions about “almost sure convergence,”


“convergence in the mean” and “convergence in distribution” of random se-
quences, are of fundamental importance for Chap. 6.
Let, as already stated in Sect. 6.3, (Ω, A, P ) denote a probability space,
A1 ⊂ A2 ⊂ . . . ⊂ A σ-algebras, En (·) := E(· | An ) the expectation with
respect to An and IN = IN(1) ∪ . . . ∪ IN(κ) a disjunct partition of IN fulfilling
condition (U6).

B.1 A Convergence Result of Robbins–Siegmund


The following lemma due to H. Robbins and D. Siegmund, cf. [126], gives
an assertion about almost sure convergence of nonnegative random sequences
(Tn ).
(1) (2)
Let Tn , αn , αn and αn denote nonnegative An -measurable random
variables with

En Tn+1 ≤ (1 + αn(1) )Tn − αn + αn(2) a.s.


' (i)
for n = 1, 2, . . . , where αn < ∞ a.s. for i = 1, 2.
n
'
Lemma B.1. The limit lim Tn < ∞ exists, and it is αn < ∞ a.s.
n→∞ n

B.1.1 Consequences

In special case
αn = an Tn for n = 1, 2, . . .
'
with nonnegative An -measurable random variables an , fulfilling an = ∞
n
a.s., we get this result:
314 B Convergence Theorems for Stochastic Sequences

Lemma B.2. Tn −→ 0, n → ∞ a.s.

Proof. According to Lemma B.1 we know that T := lim Tn exists a.s. and
' ' n→∞
an Tn < ∞ a.s. Hence, due to an = ∞ a.s. we obtain the proposition.
n n

Now we abandon the nonnegativity condition of an .


To n ∈ IN, let 0 ≤ Sn and An denote An -measurable random numbers
with
An
En Sn+1 ≤ 1 − + αn(1) Sn + αn(2) a.s.
n
for n = 1, 2, . . ., where we demand for (An ):
(a) lim An = A(k) a.s. for k = 1, . . . , κ with numbers A(1) , . . . , A(κ) ∈ IR,
n∈IN(k)
(b) A := q1 A(1) + . . . + qκ A(κ) > 0.
Since for large n sequence (An ) is positive “in the mean,” (Sn ) converges a.s.
to zero.

Lemma B.3. Sn −→ 0, n → ∞ a.s.

Proof. To arbitrary ε ∈ (0, A), let


n
Bn := qk (A(k) − ε) (k)
| IN0,n |

for indices n ∈ IN(k) , where k ∈ {1, . . . , κ}. Then, for all k = 1, . . . , κ and
n ∈ IN we get
lim Bn = A(k) − ε, An ≥ Bn − ∆Bn ,
n∈IN(k)

with the An -measurable random variable ∆Bn := max{0, Bn −An } ≥ 0. Since


Sn is nonnegative, it follows

Bn
En Sn+1 ≤ 1− + βn(1) Sn + αn(2) ,
n
(1) (1)
where βn := αn + ∆B n
. For sufficiently large indices n it is ∆Bn = 0 a.s.,
' ∆Bn n
and therfore n < ∞ a.s. Hence, it holds
n

0 ≤ βn(1) , βn(1) < ∞ a.s.


n

Furthermore, there exists an index n0 with


A−ε Bn 1
1− , 1− ≥ for n ≥ n0 .
n n 2
B.1 A Convergence Result of Robbins–Siegmund 315

Hence, for n ≥ n0 exists the An -measurable positive random variable


 
(1)
1 − A−ε 1 − A−ε βn
pn := n
= n
1− .
(1)
1 − Bnn + βn 1 − Bnn (1)
1 − Bnn + βn

Now, let Pn0 := 1 and for n ≥ n0

Pn+1 := pn0 · . . . · pn , S̄n := Pn Sn , βn(2) := Pn+1 αn(2) .


(2)
Then Pn+1 , S̄n and βn are nonnegative An -measurable random variables,
where
A−ε
En S̄n+1 ≤ 1− S̄n + βn(2) , n ≥ n0 .
n
Now, only the following relations have to be verified:

lim sup Pn < ∞, lim inf Pn > 0 a.s.


n n

Using the first inequality it follows

βn(2) < ∞ a.s.,


n

and therefore (S̄n ) converges to zero a.s. due to Lemma B.2. Hence, due to
the second inequality and the definition also (Sn ) converges a.s. to 0.
For n ≥ n0 we have

1− A−ε
1− A−ε
n
(1 − 2βn(1) ) ≤ pn ≤ n
.
1 − Bnn 1 − Bnn

As in the proof of Lemma A.6 it can be shown that for n0 ≤ m < n it holds
⎛ ⎛ ⎞
(k) ⎞
|IN0,l |
n κ
⎜ 
⎝1 − l l ⎠⎟
Bj B
1− = ⎝ (k) ⎠
j=m+1
j | IN |
k=1
(k)
l∈INm,n 0,l

⎛ ⎞
 

⎜  qk (A − ε) ⎟
(k)
= ⎝ 1− (k) ⎠ ∼ Φm,n (A − ε),
k=1 (k) IN0,l
l∈INm,n

or stated in another way


n
1− A−ε
j
B
∼ 1.
j=m+1 1 − jj
316 B Convergence Theorems for Stochastic Sequences

Moreover, according to Lemma A.3(c) for an index n2 we have



n
(1)
lim inf (1 − 2βj ) > 0 a.s.
n
j=n2 +1

The last two relations and the above inequality for pn finally yield

limextr Pn ∈ (0, ∞) a.s.


n

B.2 Convergence in the Mean


For real An -measurable random variables Tn , An , Bn , Cn and vn assume for
any n ∈ IN and k = 1, . . . , κ
An 1
En Tn+1 ≤ (1 − )Tn + (Bn vn + Cn ) a.s.,
n n
as well as

Tn , Bn , vn ≥ 0,
lim inf An ≥ A(k) , lim sup Bn ≤ B (k) , lim sup Cn ≤ C (k) a.s.,
n∈IN(k) n∈IN(k) n∈IN(k)

lim sup Evn ≤ w(k) , A := q1 A(1) + . . . + qκ A(κ) > 0


n∈IN(k)

with real numbers A(1) , . . . , A(κ) , B (1) , . . . , B (κ) , C (1) , . . . , C (κ) and w(1) , . . . ,
w(κ) .
Now, for sequence (Tn ) the following convergence theorem holds:

Lemma B.4. To any ε0 > 0 there exists a set Ω0 ∈ A such that

(a) P(Ω̄0 ) < ε0 ,


κ
qk (B (k) w(k) + C (k) )
(b) lim sup ETn IΩ0 ≤ .
n∈IN A
k=1

For the proof of this theorem the following theorem of Egorov (cf., e.g., [67]
page 285) is needed:

Lemma B.5. Let U, U1 , U2 , . . . denote real random variables having

lim sup Un ≤ U a.s.


n

Then to any positive number ε0 there exists a set Ω0 ∈ A with following


properties:
(a) P(Ω̄0 ) < ε0 ,
B.2 Convergence in the Mean 317

(b) lim sup Un ≤ U is fulfilled uniformly on Ω0 , that is: To any number ε > 0
n
there exists an index n0 ∈ IN such that

Un (ω) ≤ U (ω) + ε for all n ≥ n0 and ω ∈ Ω0 .

Proof (of Lemma B.4). Let ε0 > 0 be chosen arbitrarily. According to


Lemma B.5 there exists a set Ω0 ∈ A such that inequalities

lim inf An ≥ A(k) , lim sup Bn ≤ B (k) , lim sup Cn ≤ C (k)


n∈IN(k) n∈IN(k) n∈IN(k)

for all k = 1, . . . , κ hold uniformly on Ω0 . Choose now an arbitrary number


ε ∈ (0, A). Then there exists an index n0 such that for all n ≥ n0 and n ∈ IN(k)
on Ω0 holds:

An ≥ A(k) − ε =: an , Bn ≤ B (k) + ε =: bn , Cn ≤ C (k) + ε =: cn .

Hence, for the An -measurable sets



{An ≥ an } ∩ {Bn ≤ bn } ∩ {Cn ≤ cn }, for n ≥ n0
Ωn :=
Ω , for n < n0

we have
Ω0 ⊂ Ω1 ∩ . . . ∩ Ωn =: Ω (n) ∈ An .
Let now S1 := T1 , and for n ≥ 1 define
 an  7  an  8
Sn+1 := 1 − Sn + Tn+1 − 1 − Tn IΩ (n) .
n n
Then, due to Ω (1) ⊃ Ω (2) ⊃ . . . ⊃ Ω (n) we get

T , on Ω (n)
Sn+1 =  n+1 an 
1 − n Sn , otherwise.

It is Ωn ⊃ Ω (n) ∈ An and therefore


 an  7  an  8
En Sn+1 = 1 − Sn + En Tn+1 − 1 − Tn IΩ (n)
n n
 an  1  an  1
≤ 1− Sn + (Bn vn + Cn )IΩ (n) ≤ 1 − Sn + (bn vn + cn ).
n n n n
Taking expectations, for n = 1, 2, . . . we get
 an  1
ESn+1 ≤ 1 − ESn + (bn Evn + cn ).
n n
Now, applying Lemma A.7 it follows
 
(k)
κ q
k (B + ε)w(k) + C (k) + ε
lim sup ESn ≤ .
n A−ε
k=1
318 B Convergence Theorems for Stochastic Sequences

Moreover, because of the above statements, due to Sn ≥ 0 and Ω0 ⊂ Ω (n) we


find
ESn+1 ≥ ESn+1 IΩ0 = ETn+1 IΩ0 .
Since ε ∈ (0, A) was chosen arbitrarily, last two relations yield
κ
qk (B (k) w(k) + C (k) )
lim sup ETn IΩ0 ≤ .
n A
k=1

B.3 The Strong Law of Large Numbers for Dependent


Matrix Sequences
For all n ∈ IN let denote an real random numbers and Tn , Cn real random
µ × ν-matrices with the following properties:
T1 A1 -measurable, an An -measurable, Cn An+1 -measurable,
an ∈ [0, 1), an = ∞ a.s.,
n

a2n En Cn − En Cn 2 < ∞, En Cn −→ C, n → ∞ a.s.,


n

with a further matrix C. For the averaged sequence


Tn+1 := (1 − an )Tn + an Cn , n = 1, 2, . . . ,
the following strong law of large numbers holds:
Lemma B.6. Tn −→ C, n → ∞ a.s.
(1) (2)
Proof. With abbreviations T1 := T1 , T1 := 0 and
(1)
Tn+1 := (1 − an )Tn(1) + an (Cn − En Cn ),
(2)
Tn+1 := (1 − an )Tn(2) + an En Cn ,
(1) (2)
we write Tn = Tn + Tn for n = 1, 2, . . .. Hence, according to Lemma A.4
(2)
it is lim Tn = C a.s. Let now 0 = h ∈ IRν be chosen arbitrarily. With
n
(1)
Sn := Tn h2 we then get
 2
 
Sn+1 = (1 − an )2 Sn + a2n (Cn − En Cn )h
+ 2(1 − an )an Tn(1) h, (Cn − En Cn )h.
Therefore, by taking expectations we find
 2
 
En Sn+1 = (1 − an )2 Sn + a2n En (Cn − En Cn )h
≤ (1 − an )Sn + h2 a2n En Cn − En Cn 2 .
Now, due to Lemma B.2 we get Sn −→ 0 a.s. Since h is chosen arbitrarily, it
(1)
follows Tn −→ 0 a.s. and therefore due to the above also Tn −→ C a.s.
B.4 A Central Limit Theorem for Dependent Vector Sequences 319

B.4 A Central Limit Theorem for Dependent Vector


Sequences
For indices m ≤ n, let Am,n denote real deterministic µ × ν-matrices, and
let Un : Ω −→ IRν be An+1 -measurable random vectors with the following
properties:
(a) For n = 1, 2, . . ., assume
En Un ≡ 0 a.s. (K0)
(b) It exists a deterministic matrix V with
n
stoch
T
Am,n Em (Um Um )ATm,n −→ V. (KV)
m=1

(c) For arbitrary  > 0 assume


n  
stoch
Am,n 2 Em Um 2 I{ Am,n Um > } −→ 0. (KL)
m=1

'
n
Then, sequence ( Am,n Um )n holds an asymptotic normal distribution:
m=1

'
n
Lemma B.7. ( Am,n Um )n converges in distribution to a Gaussian dis-
m=1
tributed random vector U with EU = 0 and cov(U ) = V .
Proof. Let U denote an arbitrary N (0, V )-distributed random vector, and let
0 = h ∈ IRµ be chosen arbitrarily. Now, for m ≤ n define
n
Ym,n := hT Am,n Um , Yn := Ym,n , Y := hT U.
m=1

Due to the “Cramér-Wold-Devise” (cf., e.g., [43], Theorem 8.7.6), only the
convergence of (Yn ) in distribution to Y has to be shown.
Due to (K0), (Ym,n , Am+1 )m≤n is a martingale difference scheme. There-
fore, according to the theorem of Brown (cf. [43], Theorem 9.2.3), the following
conditions have to be verified:
n
stoch
2
Em Ym,n −→ hT V h, (KV1)
m=1

n  
stoch
2
Em Ym,n I{|Ym,n |>δ} −→ 0 for all δ > 0. (KL1)
m=1
2
(KV1) follows from (KV), since Em Ym,n = hT Am,n Em (Um Um T
)ATm,n h.
Moreover, it is | Ym,n |≤ hAm,n Um  ≤ hAm,n Um , and therefore for
δ > 0 we get
320 B Convergence Theorems for Stochastic Sequences
2
Em (Ym,n I{|Ym,n |>δ} )
 
≤ h2 Am,n 2 Em Um 2 I{ δ
Am,n Um > h } .

Hence, (KL) is sufficient for (KL1).


C
Tools from Matrix Calculus

We consider here real matrices of the size ν × ν.

C.1 Miscellaneous
First we give inequalities about the error occurring in the inversion of a per-
turbed matrix, cf. [68].

Lemma C.1. Perturbation lemma. Let A and ∆A denote matrices such that

A is regular and A−1 0 · ∆A0 < 1,

where  · 0 is an arbitrary matrix norm such that I0 = 1.


Then, also A + ∆A is regular and

A−1 20 · ∆A0


(a) (A + ∆A)−1 − A−1 0 ≤ ,
1 − A−1 0 · ∆A0
A−1 0
(b) (A + ∆A)−1 0 ≤ .
1 − A−1 0 · ∆A0
A consequence of Lemma C.1 is given next:

Lemma C.2. Let (An ) and (Bn ) denote two matrix sequences with the fol-
lowing properties:

(i) An regular, A−1


n  ≤ L < ∞ for n ∈ IN,
(ii) Bn − An −→ 0.

Then there is an index n0 such that

(a) Bn is regular for n ≥ n0 ,


(b) Bn−1 − A−1
n −→ 0.
322 C Tools from Matrix Calculus

Proof. Due to assumption (ii) there is an index n0 such that


1
Bn − An  < , n ≥ n0 .
L
Hence, because of (i) we get A−1
n Bn − An  < 1. According to Lemma C.1,
matrix Bn is regular and

L2 Bn − An 
Bn−1 − A−1
n ≤ .
1 − LBn − An 

Using (ii), assertion (b) follows.

If a rank-1-matrix is added to a regular matrix, then the determinant and


the inverse of the new matrix can be computed easily. This is contained in
the Sherman–Morrison-lemma:
Lemma C.3. If e, d ∈ IRν and A is a regular matrix, then

(a) det(A + edT ) = det(A)α, where α := 1 + dT A−1 e.


1
(b) If α = 0, then (A + edT )−1 = A−1 − (A−1 e)(dT A−1 ).
α
The proof of Lemma C.3 can be found, e.g., in [68].
A lower bound for the minimum eigenvalue of a positive definite matrix is
given next, cf. [9], page 223:
Lemma C.4. If A is a symmetric, positive definite ν × ν-matrix, then
ν − 1 ν−1
λmin (A) > det(A)( ) .
tr(A)
Proof. If 0 < λ1 ≤ . . . ≤ λν denotes all eigenvalues of A, then

tr(A) = λ1 + . . . + λn , det(A) = λ1 · . . . · λn .

Hence
ν−1 ν−1
det(A) λ 2 + . . . + λν tr(A)
= λ 2 · . . . · λν ≤ < .
λ1 ν−1 ν−1

C.2 The v. Mises-Procedure in Case of Errors


For a symmetric, positive semidefinite ν×ν matrix A, the maximum eigenvalue
λ̄ := λmax (A) of A can be determined recursively by the v. Mises procedure,
cf. [151], page 203:
C.2 The v. Mises-Procedure in Case of Errors 323

Given a vector 0 = u1 ∈ IRν , define recursively for n = 1, 2, . . .


vn
vn := Aun and un+1 := .
vn 
For a suitable start vector u1 , sequence (un ) converges to an eigenvector of
the eigenvalue λ̄ and un , vn  converges to λ̄.
Suppose now that approximates A1 , A2 , . . . of A are given. Correspond-
ing to the above procedure, for a start vector u1 ∈ IRν with u1  = 1, the
sequences (un ) and (vn ) are determined by the following rule:
vn
vn := An un and un+1 := for n = 1, 2, . . . . (∗)
vn 
We suppose that A and An fulfill the following properties:
(a) ν > 1, and 0 = A symmetric and positive semidefinite,
(b) An −→ A.
Based on the following lemma, un , vn  converges to λ̄, provided that an
additional condition holds.
Let denote next U := {x ∈ IRν : Ax = λ̄x} the eigenspace of the maximum
eigenvalue λ̄ of A. Each vector z ∈ IRν can then be decomposed uniquely as
follows:
z = z (1) + z (2) , where
z (1) ∈ U and z (2) ∈ U ⊥ := {y ∈ IRν : y, x = 0 for all x ∈ U }.
Now the already mentioned main result of this part of the appendix can be
presented:
Lemma C.5. Let (un ) and (vn ) be determined according to (∗). Then the
following assertions are equivalent:
(a) un , vn  −→ λ̄.
(1)
(b) There is an index n0 ∈ IN such that sup An − A < 3 un0 .
λ̄−µ
n≥n0
Here, µ := λν−1 (cf. the proof of Lemma C.4) or µ := 0 if U = IRν . The
(1)
vector un0 denotes the projection of un0 onto U .
In the standard case An ≡ A the second condition in Lemma C.5 means
that
n0  > 0 for an index n0 ∈ IN.
u(1)
This is the known condition for the convergence of un , vn  to λ̄.
For the proof of Lemma C.5 we need several auxiliary results. Put q :=
µ
λ̄
∈ [0, 1), and define for n ∈ IN
" (2)
un (1)
(1) , if un = 0
tn := un
∞ , else,
1
dn := An − A and en := sup dk .
λ̄ k≥n
324 C Tools from Matrix Calculus

Then, un  = √ 1
(1)
, and the following lemma holds:
1+t2n

Lemma C.6. If

dn 1 + t2n < 1 ⇐⇒ An − A < λ̄u(1)
n ,

then 
qtn + dn 1 + t2n
tn+1 ≤  .
1 − dn 1 + t2n

Proof. Because of vn = An un , decomposition of vn yields:


 
vn(1) = P (An un ) = P Aun + (An − A)un = λ̄u(1)
n + P (An − A)un ,

n + Q(An − A)un .
vn(2) = Q(An un ) = Au(2)

Here, P is the projection matrix describing the projection of z ∈ IRν onto


z (1) ∈ U , and Q := I − P . Taking the norm, one obtains

vn(1)  ≥ λ̄u(1)
n  − An − A, (i)
vn  ≤ µu(2)
(2)
n  + An − A, (ii)

if the following relations are taken into account:

n  ≤ µun ,
Au(2) (2)
P , Q ≤ 1, un  = 1.
(1)
From (i), (ii) and the assumption λ̄un  − An − A > 0 we finally get
(1)
vn = 0 and
(2)
un+1  (2)
vn 
(2)
µun  + An − A
tn+1 := (1)
= (1)
≤ (1)
un+1  vn  λ̄un  − An − A

qtn + dn 1 + t2n
=  .
1 − dn 1 + t2n

In the next step conditions are given such that (tn ) converges to zero:

Lemma C.7. Assume that there is a number T ∈ (0, ∞) and an index n0 ∈ IN


with
(1 − q)T
(a) en0 ≤ (T ) := √ ,
(1 + T ) 1 + T 2
(b) tn0 ≤ T.

Then tn → 0, n → ∞.
C.2 The v. Mises-Procedure in Case of Errors 325

Proof. By induction we show first that

tn ≤ T, n ≥ n0 . (∗∗)

Given n ≥ n0 , suppose tn ≤ T . Because of (a) we have


;
  (1 − q)T 1 + t2n (1 − q)T
dn 1 + tn ≤ en0 1 + tn ≤
2 2
2
≤ < 1.
1+T 1+T 1+T
Hence, Lemma C.6 yields
 
qtn + dn 1 + t2n qT + dn 1 + t2n
tn+1 ≤  ≤  ≤T
1 − dn 1 + t2n 1 − dn 1 + t2n

and (∗∗) holds.


Since (dn ) converges to zero, there is an index n1 ≥ n0 with
 1+q
1 − dn 1 + T2 ≥ , n ≥ n1 .
2
By means of Lemma C.6 and (∗∗), for n ≥ n1 we get
√ √
qtn + dn 1 + T 2 2q 2 1 + T2
tn+1 ≤ 1+q = t n + d n .
2
1+q 1+q

Because of 2q
1+q ∈ [0, 1) and dn −→ 0, we finally have tn −→ 0.

The meaning of “tn −→ 0” is shown in the next lemma:

Lemma C.8. The following assertions are equivalent:

(a) un , vn  −→ λ̄,


(b) u(1)
n  −→ 1,

(c) tn −→ 0.

Proof. Because of un  = (1 + t2n )− 2 the equivalence of (b) and (c) is clear.
(1) 1

For n ∈ IN we have
 
un , vn  = uTn A + (An − A) un

n + un ) A(un + un ) + un (An − A)un


= (u(1) (2) T (1) (2) T

n  + un
= λ̄u(1) n + un (An − A)un ,
2 (2)T
Au(2) T

and
 T 
un (An −A)un  ≤ An −A , 0 ≤ u(2)T
n n ≤ µun  .
Au(2) (2) 2
326 C Tools from Matrix Calculus

Thus, for each n ∈ IN we get

un , vn 
n  − dn ≤
u(1) 2
n  + qun  + dn
≤ u(1) 2 (2) 2
λ̄
= q + (1 − q)u(1)
n  + dn .
2

Due to dn −→ 0 and q ∈ [0, 1) we obtain now also the equivalence of (a) and
(b).

Now Lemma C.5 can be proved.

Proof (of Lemma C.5). If


un , vn  −→ λ̄, (i)
then because of An −→ A and Lemma C.8 there is an index n0 such that

λ̄ − µ (1) 1 − q (1)
sup An − A < un0  ⇐⇒ en0 < un0 . (ii)
n≥n0 3 3

Conversely, suppose that this condition holds. Then there are two possibilities:
(1)
Case 1: en0 = 0. Then un0 = 0, tn0 < ∞, and also en0 = 0 < (tn0 ) with
function (t) from Lemma C.7. According to this Lemma we get then tn−→ 0.
Case 2: en0 > 0. Due to (ii) it is
1−q 1
0 < en0 ≤ < √ (1 − q) = (1).
3 2 2
Since (t) is continuous on [1, ∞) and decreases monotonically to zero, there
is a number T ∈ [1, ∞) such that

en0 = (T ). (iii)

Because of (t) · t ≥ 1−q


3 for t ≥ 1, with (ii) and (iii) we find

1 − q (2) 1−q 1 − q (1)


un0  ≤ ≤ (T )T = en0 T < un0 T,
3 3 3
and therefore
(2)
un0 
tn0 := (1)
≤ T. (iv)
un0 
From (iii) and (iv) with Lemma C.7 we obtain again tn −→ 0.
Due to Lemma C.8, condition “lim tn = 0” implies (i) which concludes
n
now the proof.
References

1. Anbar D. (1978) A Stochastic Newton–Raphson Method. Journal of Statistical


Planing and Inference 2: 153–163
2. Augusti G., Baratta A., Casciati F. (1984) Probabilistic Methods in Structural
Engineering. Chapman and Hall, London, New York
3. Aurnhammer A. (2004) Optimale Stochastische Trajektorienplanung und
Regelung von Industrierobotern. Fortschrittberichte VDI, Reihe 8, Nr. 1032.
VDI-Verlag GmbH, Düsseldorf
4. Barner M., Flohr F. (1996) Analysis II. Walter de Gruyter, Berlin, New York
5. Barthelemy J.-F., Haftka R. (1991) Recent Advances in Approximation Con-
cepts for Optimum Structural Design. In: Rozvany G. (ed.), Optimization of
Large Structural Systems. Lecture Notes, Vol. 1: 96–108, NATO/DFG ASI,
Berchtesgaden, Sept. 23–Oct. 4
6. Bastian G. et al. (1996) Theory of Robot Control. Springer, Berlin Heidelberg
New York
7. Bauer H. (1996) Probability Theory. De Gruyter, Berlin
8. Bellman R. (1970) Introduction to Matrix Analysis. McGraw-Hill, New York
9. Beresin I.S., Shidkow N.P. (1971) Numerische Methoden. 2. VEB Deutscher
Verlag der Wissenschaften, Berlin
10. Berger J. (1985) Statistical Decision Theory and Bauyesian Analysis. Springer,
Berlin Heidelberg New York
11. Bertsekas D.P. (1973) Stochastic Optimization Problems with Nondifferen-
tiable Cost Functionals. JOTA 12: 218–231
12. Betro’ B., de Biase L. (1978) A Newton-Like Method for Stochastic Optimiza-
tion. In: Dixon L.C.W., Szegö G.P. (eds.), Towards Global Optimization 2,
North-Holland, Amsterdam 269–289
13. Bhattacharya R.N., Rao R.R. (1976) Normal Approximation and Asymptotic
Expansion. Wiley, New York
14. Biles W.E., Swain J.J. (1979) Mathematical Programming and the Opti-
mization of Computer Simulations. Mathematical Programming Studies 11:
189–207
15. Bleistein N., Handelsmann R. (1975) Asymptotic Expansions of Integrals. Holt,
Rinehart and Winston, New York
16. Blum E., Oettli W. (1975) Mathematische Optimierung. Springer, Berlin
Heidelberg New York
328 References

17. Box G.E., Draper N.R. (1987) Empirical Model-Building and Response Sur-
faces. Wiley, New York, Chichester, Brisbane, Toronto, Singapore
18. Box G.E.P., Wilson K.G. (1951) On the Experimental Attainment of Optimum
Conditions. Journal of the Royal Statistical Society 13: 1–45
19. Breitung K., Hohenbichler M. (1989) Asymptotic Approximations for Multi-
variate Integrals with an Application to Multinormal Probabilities. Journal of
Multivariate Analysis 30: 80–97
20. Breitung K. (1984) Asymptotic Approximations for Multinormal Integrals.
ASCE Journal of the Engineering Mechanics Division 110(1): 357–366
21. Breitung K. (1990) Parameter Sensitivity of Failure Probabilities. In: Der
Kiureghian A., Thoft-Christensen P. (eds.), Reliability and Optimization of
Structural Systems ’90. Lecture Notes in Engineering, Vol. 61, 43–51. Springer,
Berlin Heidelberg New York
22. Bronstein I.N., Semendjajew K.A. (1980) Taschenbuch der Mathematik. Verlag
Harri Deutsch, Thun, Frankfurt/Main
23. Bucher C.G., Bourgund U. (1990) A Fast and Efficient Response Surface Ap-
proach For Structural Reliability Problems. Structural Safety 7: 57–66
24. Bullen P.S. (2003) Handbook of Means and Their Inequalities. Kluwer,
Dordrecht
25. Chankong V., Haimes Y.Y. (1983) Multiobjective Decision Making. Oxford,
North Holland, New York, Amsterdam
26. Chung K.L. (1954) On a Stochastic Approximation Method. Annals of Math-
ematical Statistics 25: 463–483
27. Craig J.J. (1988) Adaptive Control of Mechanical Manipulators. Addison-
Wesley, Reading, MA
28. Der Kiureghian A., Thoft-Christensen P. (eds.) (1990) Reliability and Opti-
mization of Structural Systems ’90. Lecture Notes in Engineering, Vol. 61.
Springer, Berlin Heidelberg New York
29. Dieudonné J. (1960) Foundations of Modern Analysis. Academic, New York
30. Ditlevsen O. (1981) Principle of Normal Tail Approximation. Journal of the
Engineering Mechanics Division, ASCE, 107: 1191–1208
31. Ditlevsen O., Madsen H.O. (1996) Structural Reliability Methods. Wiley,
Chichester
32. Dolinski K. (1983) First-Order Second Moment Approximation in Reliability
of Systems: Critical Review and Alternative Approach. Structural Safety 1:
211–213
33. Draper N.R., Smith H. (1980) Applied Regression Analysis. Wiley, New York
34. Ermoliev Y. (1983) Stochastic Quasigradient Methods and Their Application
to System Optimization. Stochastics 9: 1–36
35. Ermoliev Y., Gaivoronski A. (1983) Stochastic Quasigradient Methods and
Their Implementation. IIASA Workpaper, Laxenburg
36. Ermoliev Yu. (1988) Stochastic Quasigradient Methods. In: Ermoliev Yu., Wets
R.J.-B. (eds.), Numerical Techniques for Stochastic Optimization, Springer,
Berlin Heidelberg New York, 141–185
37. Ermoliev Yu., Wets R. (eds.) (1988) Numerical Techniques for Stochastic Op-
timization. Springer Series in Computational Mathematics Vol. 10. Springer,
Berlin Heidelberg New York
38. Eschenauer H.A. et al. (1991) Engineering Optimization of Design Processes.
Lecture Notes in Engineering, Vol. 63. Springer, Berlin Heidelberg New York
References 329

39. Fabian V. (1967) Stochastic Approximation of Minima with Improved Asymp-


totic Speed. Annals of Mathematical Statistics 38: 191–200
40. Fabian V. (1968) On Asymptotic Normality in Stochastic Approximation. An-
nals of Mathematical Statistics 39: 1327–1332
41. Fabian V. (1971) Stochastic Approximation. In: Rustagi J.S. (ed.), Optimizing
Methods in Statistics, Academic, New York, London, 439–470
42. Fleury C. (1989) First and Second Order Approximation Strategies in Struc-
tural Optimization. Structural Optimization 1: 3–10
43. Gänssler P., Stute W. (1977) Wahrscheinlichkeitstheorie. Springer, Berlin
Heidelberg New York
44. Gaivoronski A. (1988) Stochastic Quasigradient Methods and Their Imple-
mentation. In: Ermolief Yu., Wets R.J.-B. (eds.), Numerical Techniques for
Stochastic Optimization, Springer, Berlin Heidelberg New York, 313–351
45. Galambos J., Simonelli I. (1996) Bonferroni-Type Inequalities with Applica-
tions. Springer, Berlin Heidelberg New York
46. Gates T.K., Wets R.J.-B., Grismer M.E. (1989) Stochastic Approximation Ap-
plied to Optimal Irrigation and Drainage Planning. Journal of Irrigation and
Drainage Engineering, 115(3): 488–502
47. Gill P.E., Murray W., Wright M.H. (1981) Practical Optimization. Academic,
New York, London
48. Graf Fink von Finkenstein (1977) Einführung in die Numerische Mathematik,
C. Hanser Verlag, München
49. Grenander U. (1968) Probabilities on Algebraic Structures. Wiley, New York,
London
50. Han R., Reddy B.D. (1999) Plasticity-Mathematical Theory and Numerical
Analysis. Springer, Berlin Heidelberg New York
51. Henrici P. (1964) Elements of Numerical Analysis. Wiley, New York
52. Hiriart-Urruty J.B. (1977) Contributions a la programmation mathematique:
Cas deterministe et stochastique, Thesis, University of Clermont-Ferrand II
53. Hodge P.G. (1959) Plastic Analysis of Structures. McGraw-Hill, New York
54. Hohenbichler M., Rackwitz R., et al. (1987) New Light on First and Second
Order Reliability Methods. Structural Safety 4: 267–284
55. Jacobsen Sh.H., Schruben Lee W. (1989) Techniques for Simulation Response
Optimization. Operations Research Letters 8: 1–9
56. Johnson N.L., Kotz S. (1976) Distributions in Statistics: Continuous Multi-
variate Distributions. Wiley, New York, London
57. Kaliszky R. (1989) Plasticity: Theory and Engineering Applications. Elsevier,
Amsterdam
58. Kall P. (1976) Stochastic Linear Programming. Econometrics and Operations
Research, Vol. XXI. Springer, Berlin Heidelberg New York
59. Kall P., Wallace S.W. (1994) Stochastic Programming. Wiley, Chichester
60. Kesten H. (1958) Accelerated Stochastic Approximation. Annals of Mathemat-
ical Statistics 29: 41–59
61. Kiefer J., Wolfowitz J. (1952) Stochastic Estimation of the Maximum of a
Regression Function. Annals of Mathematical Statistics 23: 462–466
62. Kirsch U. (1993) Structural Optimization. Springer, Berlin Heidelberg New
York
63. Kleijnen J.P.C. (1987) Statistical Tools for Simulation Practitioners. Marcel
Decker, New York
330 References

64. Kleijnen J.P.C., van Groenendaal W. (1992) Simulation: A Statistical Perspec-


tive. Wiley, Chichester
65. Knopp K. (1964) Theorie und Anwendung der unendlichen Reihen. Springer,
Berlin Heidelberg New York
66. König J.A. (1987) Shakedown of Elastic–Plastic Structures. Elsevier, Amster-
dam
67. Kolmogorof A.N., Fomin S.V. (1975) Reelle Funktionen und Funktionananaly-
sis. VEB Deutscher Verlag der Wissenschaften, Berlin
68. Kosmol P. (1989) Methoden zur numerischen Behandlung nichtlinearer
Gleichungen und Optimierungsaufgaben. B.G. Teubner, Stuttgart
69. Kushner H.J., Clark D.S. (1978) Stochastic Approximation Methods for Con-
strained and Unconstrained Systems. Springer, Berlin Heidelberg New York
70. Lai T.L., Robbins H. (1979) Adaptive Design and Stochastic Approximation.
The Annals of Statistics 7(6): 1196–1221
71. Ljung L. (1977) Analysis of Recursive Stochastic Algorithms. IEEE Transac-
tions on Automatic Control, AC 22: 551–575
72. Luenberger D.G. (1969) Optimization by Vector Space Methods. Wiley, New
York
73. Luenberger D.G. (1973) Introduction to Linear and Nonlinear Programming.
Addison-Wesley, Reading, MA
74. Mangasarian O. L. (1969) Nonlinear Programming. McGraw-Hill, New York,
London, Toronto
75. Marshall A.W., Olkin I. (1960) Multivariate Chebyshev Inequalities. The An-
nals of Mathematical Statistics 31: 1001–1014
76. Marti K. (1977) Stochastische Dominanz und stochastische lineare Programme.
Methods of Operations Research 23: 141–160
77. Marti K. (1978) Stochastic Linear Programs with Stable Distributed Random
Variables. In: Stoer J. (ed.), Optimization Techniques Part 2. Lecture Notes in
Control and Information Sciences, Vol. 7, 76–86
78. Marti K. (1979) Approximationen Stochastischer Optimierungsprobleme. A.
Hain, Königsstein/Ts.
79. Marti K. (1979) On Solutions of Stochastic Programming Problems by De-
scent Procedures with Stochastic and Deterministic Directions. Methods of
Operations Research 33: 281–293
80. Marti K. (1980) Solving Stochastic Linear Programs by Semi-Stochastic Ap-
proximation Algorithms. In: Kall P., Prekopa A. (eds.), Recent Result in Sto-
chastic Programming. Lecture Notes in Economics and Mathematical Systems,
179: 191–213
81. Marti K. (1983) Descent Directions and Efficient Solutions in Discretely Dis-
tributed Stochastic Programs. LN in Economics and Mathematical Systems,
Vol. 299. Springer, Berlin Heidelberg New York
82. Marti K. (1984) On the Construction of Descent Directions in a Stochastic
Program Having a Discrete Distribution. ZAMM 64: T336–T338
83. Marti K. (1985) Computation of Descent Directions in Stochastic Optimization
Problems with Invariant Distributions. ZAMM 65: 355–378
84. Marti K., Fuchs E. (1986) Computation of Descent Directions and Efficient
Points in Stochastic Optimization Problems Without Using Derivatives. Math-
ematical Programming Study 28: 132–156
References 331

85. Marti K., Fuchs E. (1986) Rates of Convergence of Semi-Stochastic Approxima-


tion Procedures for Solving Stochastic Optimization Problems. Optimization
17: 243–265
86. Marti K. (1986) On Accelerations of Stochastic Gradient Methods by Us-
ing more Exact Gradient Estimations. Methods of Operations Research 53:
327–336
87. Marti K. (1987) Solving Stochastic Optimization Problems By Semi-Stochastic
Approximation Procedures. In: Borne/Tzafestas (eds.), Applied Modelling
and Simulation of Technological Systems, 559–568. Elsevier, North Holland,
Amsterdam, New York, Oxford, Tokyo
88. Marti K. (1990) Stochastic Optimization Methods in Structural Mechanics.
ZAMM 70: T 742–T745
89. Marti K., Plöchinger E. (1990) Optimal Step Sizes in Semi-Stochastic Approx-
imation Procedures, I and II, Optimization 21(1): 123–153 and (2): 281–312
90. Marti K. (1990) Stochastic Programming: Numerical Solution Techniques by
Semi-Stochastic Approximation Methods. In: Slowinski R., Teghem J. (eds.),
Stochastic Versus Fuzzy Approaches to Multiobjective Programming Under
Uncertainty, 23–43. Kluwer, Boston, Dordrecht, London
91. Marti K. (1990) Stochastic Optimization in Structural Design. ZAMM 72:
T452–T464
92. Marti K. (1992) Approximations and Derivatives of Probabilities in Structural
Design. ZAMM 72 6: T575–T578
93. Marti K. (1992) Computation of Efficient Solutions of Discretely Distributed
Stochastic Optimization Problems. ZOR 36: 259–296
94. Marti K. (1994) Approximations and Derivatives of Probability Functions. In:
Anastassiou C., Rachev S.T. (eds.), Approximation, Probability and Related
Fields, 367–377. Plenum, New York
95. Marti K. (1995) Differentiation of Probability Functions: The Transformation
Method. Computers and Mathematics with Applications 30(3–6): 361–382
96. Marti K. (1995) Computation of Probability Functions and Its Derivatives by
Means of Orthogonal Function Series Expansions. In: Marti K., Kall P. (eds.),
Stochastic Programming: Numerical Techniques and Engineering Applications,
22–53, LNEMS 423. Springer, Berlin Heidelberg New York
97. Marti K. (1999) Path Planning for Robots Under Stochastic Uncertainty. Op-
timization 45: 163–195
98. Marti K. (2003) Stochastic Optimization Methods in Optimal Engineering De-
sign Under Stochastic Uncertainty. ZAMM 83(11): 1–18
99. Marti K. (2003) Plastic Structural Analysis Under Stochastic Uncertainty.
Mathematical and Computer Modelling of Dynamical Systems (MCMDS) 9(2):
303–325
100. Marti K. (2005) Stochastic Optimization Methods. Springer, Berlin Heidelberg
New York
101. Marti K., Ermoliev Y., Pflug G. (eds.) (2004) Dynamic Stochastic Optimiza-
tion. LNEMS 532. Springer, Berlin Heidelberg New York
102. Mauro C.A. (1984) On the Performance of Two-Stage Group Screening Exper-
iments. Technometrics 26(3): 255–264
103. McGuire W., Gallagher R.H. (1979) Matrix Structural Analysis. Wiley, New
York, London
104. Melchers R.R. (1999) Structural Reliability: Analysis and Prediction, 2nd edn.
Ellis Horwood, Chichester (England)
332 References

105. Montgomery C.A. (1984) Design and Analysis of Experiments. Wiley, New
York
106. Myers R.H. (1971) Response Surface Methodology. Allyn and Bacon, Boston
107. Nevel’son M.B., Khas’minskii R.Z. (1973) An Adaptive Robbins–Monro Pro-
cedure. Automation and Remote Control 34: 1594–1607
108. Pantel M. (1979) Adaptive Verfahren der Stochastischen Approximation. Dis-
sertation, Universität Essen-Gesamthochschule, FB 6-Mathematik
109. Park S.H. (1996) Robust Design and Analysis for Quality Engineering.
Chapman and Hall, London
110. Paulauskas V., Rackauskas A. (1989) Approximation Theory in the Central
Limit Theorem. Kluwer, Dordrecht
111. Pfeiffer F., Johanni R. (1987) A Concept for Manipulator Trajectory Planning.
IEEE Journal of Robotics and Automation RA-3: 115–123
112. Pflug G.Ch. (1988) Step Size Rules, Stopping Times and Their Implementation
in Stochastic Quasigradient Algorithms. In: Ermoliev Yu., Wets R.J.-B. (eds.),
Numerical Techniques for Stochastic Optimization, 353–372, Springer, Berlin
Heidelberg New York
113. Phadke M.S. (1989) Quality Engineering Using Robust Design. P.T.R. Prentice
Hall, Englewood Cliffs, NJ
114. Pahl G., Beitz W. (2003) Engineering Design: A Systematic Approach.
Springer, Berlin Heidelberg New York London
115. Plöchinger E. (1992) Realisierung von adaptiven Schrittweiten für stochastis-
che Approximationsverfahren bei unterschiedlichem Varianzverhalten des
Schätzfehlers. Dissertation, Universität der Bundeswehr München
116. Polyak B.T., Tsypkin Y.Z. (1980) Robust Pseudogradient Adaption Algo-
rithms. Automation and Remote Control 41: 1404–1410
117. Polyak B.G., Tsypkin Y.Z. (1980) Optimal Pseudogradient Adaption Algo-
rithms. Automatika i Telemekhanika 8: 74–84
118. Prekopa A., Szantai T. (1978) Flood Control Reservoir System Design Using
Stochastic Programming. Mathematical Programming Study 9: 138–151
119. Press S.J. (1972) Applied Multivariate Analysis. Holt, Rinehart and Winston,
New York, London
120. Rackwitz R., Cuntze R. (1987) Formulations of Reliability-Oriented Optimiza-
tion. Engineering Optimization 11: 69–76
121. Reinhart J. (1997) Stochastische Optimierung von Faserkunststoffverbundplat-
ten. Fortschritt–Berichte VDI, Reihe 5, Nr. 463. VDI Verlag GmbH, Düsseldorf
122. Reinhart J. (1998) Implementation of the Response Surface Method (RSM)
for Stochastic Structural Optimization Problems. In: Marti K., Kall P. (eds.),
Stochastic Programming Methods and Technical Applications. Lecture Notes
in Economics and Mathematical Systems, Vol. 458, 394–409
123. Review of Economic Design, ISSN: 1434–4742/50. Springer, Berlin Heidelberg
New York
124. Richter H. (1966) Wahrscheinlichkeitstheorie. Springer, Berlin Heidelberg New
York
125. Robbins H., Monro S. (1951) A Stochastic Approximation Method. Annals of
Mathematical Statistics 22: 400–407
126. Robbins H., Siegmund D. (1971) A Convergence Theorem for Non Negative
Almost Supermaringales and Some Applications. Optimizing Methods in Sta-
tistics. Academic, New York, 233–257
References 333

127. Rommelfanger H. (1977) Differenzen- und Differentialgleichungen. Biblio-


graphisches Institut Mannheim
128. Rosenblatt M. (1952) Remarks on a Multivariate Transformation. The Annals
of Statistics 23: 470–472
129. Rozvany G.I.N. (1989) Structural Design via Optimality Criteria. Kluwer,
Dordrecht
130. Rozvany G.I.N. (1997) Topology Optimization in Structural Mechanics.
Springer, Berlin Heidelberg New York
131. Ruppert D. (1985) A Newton–Raphson Version of the Multivariate Robbins–
Monro Procedure. The Annals of Statistics 13(1): 236–245
132. Sacks J. (1958) Asymptotic Distributions of Stochastic Approximation Proce-
dures. Annals of Mathematical Statistics 29: 373–405
133. Sansone G. (1959) Orthogonal Functions. Interscience, New York, London
134. Sawaragi Y., Nakayama H., Tanino T. (1985) Theory of Multiobjective Opti-
mization. Academic, New York, London, Tokyo
135. Schmetterer L. (1960) Stochastic Approximation. Proceedings 4th Berkeley
Symposium on Mathematical Statistics and Probability, Vol. 1, 587–609
136. Schoofs A.J.G. (1987) Experimental Design and Structural Optimization. Doc-
toral Thesis, Technical University of Eindhoven
137. Schuëller G.I. (1987) A Critical Appraisal of Methods to Determine Failure
Probabilities. Journal of Structural Safety 4: 293–309
138. Schuëller G.I., Gasser M. (1998) Some Basic Principles of Reliability-Based
Optimization (RBO) of Structure and Mechanical Components. In: Marti K.,
Kall P. (eds.), Stochastic Programming-Numerical Techniques and Engineer-
ing. LNEMS Vol. 458: 80–103. Springer, Berlin Heidelberg New York
139. Schüler L. (1974) Schätzung von Dichten und Verteilungsfunktionen mehrdi-
mensionaler Zufallsvariabler auf der Basis orthogonaler Reihen. Dissertation
Naturwissenschaftliche Fakultät der TU Braunschweig, Braunschweig
140. Schüler L., Wolff H. (1976) Schätzungen mehrdimensionaler Dichten
mit Hermiteschen Polynomen und lineare Verfahren zur Trendverfolgung.
Forschungsbericht BMVg-FBWT 76–23, TU Braunschweig, Institut für
Angwandte Mathematik, Braunschweig
141. Schwartz S.C. (1967) Estimation of Probability Density by an Orthogonal Se-
ries. Annals of Mathematical Statistics 38: 1261–1265
142. Schwarz H.R. (1986) Numerische Mathematik. B.G. Teubner, Stuttgart
143. Sciavicco L., Siciliano B. (2000) Modeling and Control of Robot Manipulators.
Springer, Berlin Heidelberg New York London
144. Slotine J.-J., Li W. (1991) Applied Nonlinear Control. Prentice-Hall, Engle-
wood Chiffs, NJ
145. Sobieszczanski-Sobieski J.G., Dovi A. (1983) Structural Optimization by Multi-
Level Decomposition. AIAA Journal 23: 1775–1782
146. Spillers W.R. (1972) Automated Structural Analysis: An Introduction.
Pergamon, New York, Toronto, Oxford
147. Stengel R.F. (1986) Stochastic Optimal Control: Theory and Application.
Wiley, New York
148. Stöckl G. (2003) Optimaler Entwurf elastoplastischer mechanischer Strukturen
unter stochastischer Unsicherheit. Fortschritt-Berichte VDI, Reihe 18, Nr. 278.
VDI-Verlag GmbH, Düsseldorf
149. Streeter V.L., Wylie E.B. (1979) Fluid Mechanics. McGraw-Hill, New York
334 References

150. Stroud, A.H. (1971) Approximate Calculation of Multiple Integrals. Prentice-


Hall, Englewood Cliffs, NJ
151. Stummel F., Hainer K. (1982) Praktische Mathematik. B.B. Teubner, Stuttgart
152. Syski W. (1987) Stochastic Approximation Method with Subgradient Filtering
and On-Line Stepsize Rules for Nonsmooth, Nonconvex and Unconstrained
Problems. Control and Cybernetics 16(1): 59–76
153. Szantai T. (1986) Evaluation of a Special Multivariate Gamma Distribution
Function. Mathematical Programming Study 27: 1–16
154. Thoft-Christensen P., Baker M.J. (1982) Structural Reliability Theory and Its
Applications. Springer, Berlin Heidelberg New York
155. Tong Y.L. (1980) Probability Inequalities in Multivariate Distributions.
Academic, New York, London
156. Toropov V.V. (1989) Simulation Approach to Structural Optimization. Struc-
tural Optimization 1(1): 37–46
157. Tricomi F.G. (1970) Vorlesungen über Orthogonalreihen. Springer, Berlin
Heidelberg New York
158. Triebel H. (1981) Analysis und mathematische Physik. BSB B.B. Teubner
Verlagsgesellschaft, Leipzig
159. Uryas’ev St. (1989) A Differentiation Formula for Integrals over Sets Given by
Inclusion. Numerical Functional Analysis and Optimization 10: 827–841
160. Venter Y.H. (1966) On Dvoretzky Stochastic Approximation Theorems. Annals
of Mathematical Statistics 37: 1534–1544
161. Venter Y.H. (1967) An Extention of the Robbings–Monro Procedure. Annals
of Mathematical Statistics 38: 181–190
162. Vietor T. (1994) Optimale Auslegung von Strukturen aus spröden Wirkstoffen.
TIM-Forschungsberichte Nr. T04-02.94, FOMAAS, Unviersität Siegen
163. Walk H. (1977) An Invariance Principle for the Robbings–Monro Process in a
Hilbert Space. Z. Wahrscheinlichkeitstheorie verw. Gebiete 39: 135–150
164. Wasan M.T. (1969) Stochastic Approximation. University Press, Cambridge
165. Wets R. (1983) Stochastic Programming: Solution Techniques and Approxima-
tion Schemes. In: Stochastic Programming Bachem A., Grötschel M., Korte B.
(eds.), Mathematical Programming: The State of the Art, 566–603. Springer,
Berlin Heidelberg New York
Index

β-point, 262 boundary point, 271


1–1-transformation, 261 boundary points, 273
Bounded 1st order derivative, 29
a priori, 4, 5 Bounded 2nd order derivative, 30
actual parameter, 4 Bounded eigenvalues of the Hessian, 21
adaptive control of dynamic system, 4 Bounded gradient, 20
adaptive trajectory planning for robots, box constraints, 4
4
admissibility condition, 254 calibration methods, 5
admissibility of the state, 9 chance constrained programming, 46
approximate discrete distribution, 296 changing error variances, 178
Approximate expected failure cost Compensation, 4
minimization, 30 compromise solution, 6
approximate expected failure or concave function, 21
recourse cost constraints, 21 constraint functions, 3
Approximate expected failure or Continuity, 18
recourse cost minimization, 22 Continuous probability distributions, 12
approximate optimization problem, 4 continuously differentiable, 20
approximate probability function, 295 control (input, design) factors, 16
Approximation, 4 control volume, 46
Approximation of expected loss control volume concept, 46
functions, 26 Convex Approximation, 95
Approximation of state function, 22 convex loss function, 17, 20
asymptotically, 200 convex polyhedron, 273, 276
Convexity, 18
back-transformation, 49, 56 correction expenses, 4
Bayesian approach, 5 cost(s)
behavioral constraints, 32, 43 approximate expected failure or
Biggest-is-best, 17 recourse ... constraints, 21
bilinear functions, 25 expected, 6
Binomial-type distribution, 295 expected weighted total, 7
Bonferroni bounds, 31 factors, 3
Bonferroni-type bounds, 45 failure, 13
boundary (b-)points), 279 function, 5, 14
336 Index

loss function, 11 divergence, 49


maximum, 8 divergence theorem, 51, 52, 59
maximum weighted expected, 7 domain of integration, 52, 53
minimum production, 16 dual representation, 257
of construction, 9
primary ... constraints, 11 economic uncertainty, 11
recourse, 11, 13 efficient points, 96
total expected, 13 elastic moduli, 54
covariance matrix, 13, 21 elliptically contoured probability
cross-sectional areas, 9 distribution, 118
curvature, 66 essential supremum, 8
expectation, 6, 12, 13
decision theoretical task, 5 expected
decreasing rate of stochastic steps, 173 approximate ... failure cost minimiza-
demand m-vector, 10 tion, 30
demand factors, 3 approximate ... failure or recourse
descent directions, 95 cost minimization, 22
design approximate ... failure or recourse
optimal, 13 cost constraints, 21, 30
optimal ... of economic systems, 4 Approximation of ... loss functions,
optimal ... of mechanical structures, 4 26
optimal structural, 10 cost, 6
optimum, 9 cost minimization problem, 6
point, 37 failure or recourse cost functions, 15
problem, 3 failure or recourse cost constraints, 15
robust ... problem, 29 failure or recourse cost minimization,
robust optimal, 21 16
structural, 9 maximum weighted... costs, 7
structural reliability and, 44 primary cost constraints, 16, 22, 30
variables, 3 primary cost minimization, 15, 21, 30
vector, 9, 44 quality loss, 17
deterministic recourse cost functions, 20
constraint, 7, 9 total ... costs, 13
substitute problem, 5–8, 13, 18, 19, total cost minimization, 15
21 total weighted ... costs, 7
Differentiability, 19 weighted total costs, 7
differentiation, 58 explicit representations, 280
differentiation formula, 46, 48, 52 explicitly, 281
differentiation of parameter-dependent external load parameters, 3
multiple integrals, 46 external load factor, 253
differentiation with respect to xk , 49, 56 extreme points, 11
discrete distribution, 289, 293
Discrete probability distributions, 12 factor
displacement vector, 54 control (input,design), 16
distance functional, 255 cost, 3
distribution demand, 3
probability, 5 noise, 3
distribution parameters, 12 of production, 3, 9
disturbances, 3 weight, 7
Index 337

failure, 9 higher order partial derivatives, 51


approximate expected ... or recourse hybrid stochastic approximation
cost constraints, 21, 30 procedure, 131
costs, 11, 13 hypersurface, 52, 53, 60
domain, 10
expected costs of, 13 inequality
mode, 10 Jensen’s, 20
of the structure, 22 Markov-type, 36
probability, 45 Tschebyscheff-type, 33
probability of, 13, 15 initial behavior, 247
failure/survival domains, 31 Inner expansions, 28
finite difference gradient estimator, 130 inner linearization, 96
First Order Reliability Method input r-vector, 43
(FORM), 38 input vector, 3, 9, 18
fluid dynamics, 46 integrable majorant, 49, 50, 55
FORM, 262 integral transformation, 46, 48, 58
function series expansions, 75 interchanging differentiation and
function(s) integration, 48, 55, 59
Approximation of expected loss, 26
Approximation of state, 22 Jensen’s inequality, 20
bilinear, 25
concave, 21 Lagrangian, 14, 263
constraint, 3 Laguerre functions, 75
convex loss, 17 least squares estimation, 26
cost, 14 Legendre functions, 75
cost/loss, 11 limit state function, 9, 10, 45
hermite, 75 limit state surface, 259
Laguerre, 75 limited sensitivity, 22
Legendre, 75 linear form, 270
limit state, 9, 10 linear programs with random data, 46
loss, 14, 20 linearization, 37
mean value, 17, 26, 28 linearize the state function, 268
objective, 3 Linearizing the state function, 264
output, 32 Lipschitz-continuous, 29
performance, 32 Loss Function, 20
primary cost, 9 Lyapunov equation, 310
product quality, 17
quality, 29 manufacturing, 4
recourse cost, 14 manufacturing errors, 3, 54
response, 32 Markov-type, 283
state, 9, 10, 14, 15, 25 Markov-type inequality, 36
trigonometric, 75 material parameters, 3, 54
functional determinant, 55 matrix step size, 179
functional–efficient, 6 maximum costs, 8
Mean Integrated Square Error, 83
gradient, 48, 55
mean limit covariance matrix, 204
bounded, 20
mean performance, 16
Hölder mean, 24 mean square error, 139
Hermite functions, 75 mean value function, 17, 26, 28
338 Index

mean value theorem, 20, 29 Pareto


measurement and correction actions, 4 optimal, 6
mechanical structure, 9, 10 optimal solution, 7
minimum reliability, 34 weak ... optimal solution, 7, 8
minimum asymptotic covariance matrix, weak Pareto optimal, 6
221 Pareto-optimal solutions, 114
minimum distance, 39 partial derivatives, 27
model parameters, 3, 10 partial monotonicity, 100
model uncertainty, 11 performance, 16
modeling errors, 3 performance function, 253
more exact gradient estimators, 131 performance functions, 6
multiple integral, 12, 48 performance variables, 43
multiple linearizations, 270 Phase 1 of RSM, 134
Phase 2 of RSM, 136
Necessary optimality conditions, 263, physical uncertainty, 11
267, 271 Polyhedral approximation, 273
noise, 16 power mean, 24
noise factors, 3 primary cost constraints, 11
nominal vector, 4 primary cost function, 9
Nominal-is-best, 16 primary goal, 7
normal distributed random variable, 36 probabilistic constraints, 269
probability
continuous ... distributions, 12
objective function, 3
density, 47, 54
operating condition, 254
density function, 12, 44
operating conditions, 3, 9, 11
discrete ... distributions, 12
operating, safety conditions, 43
distribution, 5
optimal
function, 43, 46, 48
control, 3
of survival, 54
decision, 3 of failure, 13, 15
design of economic systems, 4 of safety, 13
design of mechanical structures, 4 space, 5
optimal design, 13 subjective or personal, 5
optimal structural design, 10 probability density function, 12
optimum design, 9 probability inequalities, 282
Orthogonal Function Series Expansion, probability of failure, 260
46 probability of safety, 260
outcome map, 5 product quality function, 17
outcomes, 5 production planning, 4
production planning problems, 10
parallel computing, 273 projection
parameter identification, 5 of the origin, 64
parameters projection problem, 263
distribution, 12
external load, 3 quality characteristics, 16
material, 3, 54 quality engineering, 16, 29
model, 3, 4, 10 quality function, 29
stochastic structural, 54 quality loss, 16
technological, 3 quality variations, 16
Index 339

random matrix, 47 sequential decision processes, 4


random parameter vector, 6, 8, 44 sizing variables, 9
random variability, 11 Smallest-is-best, 17
random vector, 13 stable distributions, 119
rate of stochastic and deterministic state equation, 254
steps, 158 state function, 9, 10, 14, 15, 25, 253,
realizations, 12 256
recourse cost functions, 14 statistical uncertainty, 11
recourse costs, 11, 13 stochastic approximation method, 179
reference points, 25 Stochastic Completion and Transforma-
region of integration, 48 tion Method, 46
regression techniques, 5, 25 stochastic completion technique, 73
regular point, 263 stochastic difference quotient, 205
reliability stochastic structural parameters, 54
First Order Reliability Method Stochastic Uncertainty, 5
(FORM), 38 structural optimization, 3
index, 39 structural dimensions, 9
of a stochastic system, 43 structural mechanics, 54
reliability analysis, 30, 253 structural reliability analysis, 37
reliability based optimization, 14, 45 structural reliability and design, 44, 54
Reliability constraints, 45 structural resistance parameters, 253
Reliability maximization, 45 structural systems, 4, 10
Response Surface Methods, 25 structural systems weakness, 14
Response Surface Model, 25 structure
response variables, 43 mechanical, 9
response, output or performance safe, 3
functions, 32 subjective or personal probability, 5
robust design problem, 29 sublinear functions, 97
robust optimal design, 21 surface element, 53
robustness, 16 Surface Integrals, 51
RSM-gradient estimator, 132 systems failure domain, 44

safe domain, 260 Taylor expansion, 27, 36


safe state, 258 Taylor polynomials, 57
safe structure, 3 technological coefficients, 10
safety, 9 technological parameters, 3
safety conditions, 11 thickness, 9
safety margins, 9 total weight, 9
sample information, 4, 5 Transformation Method, 46
scalarization, 6, 8 trigonometric functions, 75
scenarios, 12 Tschebyscheff-type inequality, 33
search directions, 204 two-step procedure, 4
second order expansions, 21
secondary goals, 7 uncertainty, 4
semi-stochastic approximation method, economic, 11
129 model, 11
sensitivities, 27 physical, 11
Sensitivity of reliabilities, 45 statistical, 11
separable, 100 stochastic, 5
340 Index

unsafe domain, 260 nominal, 4


unsafe state, 258 optimization problem, 6
volume, 9
v. Mises procedure, 322
variable weak convergence, 294
design, 3 weak functional–efficient, 6
vector weight factors, 7
input, 3 Worst Case, 8

You might also like