Applied Numerical Optimization: Prof. Alexander Mitsos, Ph.D. Basic Solution Methods For Unconstrained Problems

Applied Numerical Optimization
Prof. Alexander Mitsos, Ph.D.
Basic solution methods for unconstrained problems

Solution Methods for Unconstrained Optimization min 𝑓(𝒙)
𝒙∈𝑅 𝑛
Unconstrained
optimization methods
Direct methods Indirect methods
2 of 10 Applied Numerical Optimization

Indirect Methods – Concept
• First-order necessary conditions
𝜕𝑓
ቤ = 0 = 𝑔1 (𝒙) nonlinear system of
𝜕𝑥1 𝒙
equations
𝜕𝑓
ቤ = 0 = 𝑔2 (𝒙) 𝒈 𝒙 =𝟎
𝜵𝑓 𝒙 = 𝟎 ⇔ 𝜕𝑥2 𝒙 ⇔
⋮
𝜕𝑓
ቤ = 0 = 𝑔𝑛 (𝒙)
𝜕𝑥1 𝒙
• The optimal solution is found by solving the system of equations analytically or numerically
(e.g., by Newton’s method).
• Differentiation and solution of the system of equations is challenging for complex problems!

Solution Methods for Unconstrained Optimization
Unconstrained

Optimal solution is found by solving the system
of equations (optimality conditions):
𝜵𝑓 𝒙 = 𝟎
analytically or numerically

Direct Methods – Concept
Idea: Construct a convergent sequence of {𝒙(𝑘) }∞

𝑘=1 , which fulfills the following conditions:
∃𝑘ത ≥ 0: 𝑓 𝒙 𝑘+1
<𝑓 𝒙 𝑘
∀𝑘 > 𝑘ത and lim 𝒙(𝑘) = 𝒙∗ ∈ 𝑅𝑛
𝑘→∞
𝒙(2)
𝒙(1) 𝒙(3)
𝒙(4)
𝒙(0) 𝒙(5)
𝒙∗
 = 𝑅𝑛

Definition: Rate of Convergence
Idea: Construct a convergent sequence of {𝒙(𝑘) }∞

𝑘=1 , which fulfills the following conditions:
∃𝑘ത ≥ 0: 𝑓 𝒙 𝑘+1
<𝑓 𝒙 𝑘
∀𝑘 > 𝑘ത and lim 𝒙(𝑘) = 𝒙∗ ∈ 𝑅𝑛
𝑘→∞
Rate of convergence:
• Linear: if there exists a constant 𝐶 ∈ (0,1), such that for sufficiently large 𝑘 :
𝑘+1
𝒙 − 𝒙∗ ≤ 𝐶 𝒙 𝑘
− 𝒙∗
• Order 𝑝 (often 𝑝 = 2): if there exists a constant 𝑀 > 0, such that
𝑘+1 𝑝
𝒙 − 𝒙∗ ≤ 𝑀 𝒙 𝑘
− 𝒙∗
• Superlinear: if there exists a sequence 𝑐𝑘 converging to zero, i.e., lim 𝑐𝑘 = 0,

𝑘→∞
such that
𝑘+1
𝒙 − 𝒙∗ ≤ 𝑐𝑘 𝒙 𝑘
− 𝒙∗

Unconstrained

Optimal solution is found by directly Optimal solution is found by solving the system
improving the objective function via of equations (optimality conditions):
iterative descent. 𝜵𝑓 𝒙 = 𝟎
analytically or numerically.

Direct vs Indirect: Nomenclature not consistent in Literature
• Throughout class we use "direct" and "indirect":

 "indirect methods": 1. set up optimality conditions and then 2. try to solve the system of equations (or equations and
inequalities)
 "direct methods": directly aim to improve objective function (or objective function and constraints). These methods hope
to converge to optimality conditions.
• In the literature there are many alternative uses of the word, including
 exactly the opposite than ours
 "direct": without the use of derivatives, "indirect": using derivatives
 only in the context of dynamic optimization problems:
 "direct": first convert to nonlinear program
 "indirect": first set up optimality conditions
 only in the context of constrained problems
 "direct": only feasible iterates
 "indirect": infeasible iterates are allowed

Unconstrained

Iterative descent Optimal solution is found by solving the system
𝜵𝑓 𝒙 = 𝟎
Line-search Trust-region analytically or numerically.
approach approach

Check Yourself
• What are direct vs indirect methods?

• Which direct methods did we learn?
• Which convergence rates exist? Why is the convergence rate important?

Line search: basic idea and step length

Unconstrained

𝜵𝑓 𝒙 = 𝟎
approach approach

Direct Methods – Line-Search Approach
Definition (descent direction):
𝑘 𝑘 𝑇
A vector 𝒑 is called descent direction at 𝒙 , if 𝜵𝑓 𝒙 𝒑 < 0 holds.
Basic algorithm (line-search):

1. Choose a descent direction, 𝒑 𝑘
, such that
𝑘
𝑘 𝑇 𝛼𝑘 𝒑
𝜵𝑓 𝒙 𝒑𝑘 <0
𝑘+1
𝒙
𝑘
2. Determine a step length 𝛼𝑘 𝒑
𝑘
𝒙
3. Set 𝒙 𝑘+1 =𝒙 𝑘 + 𝛼𝑘 𝒑 𝑘  = 𝑅𝑛
Open issues:
𝑘
• Determination of the descent direction 𝒑 ?
• Calculation of the step length 𝛼𝑘 ?

Calculation of Step Length 𝜶𝒌
The exact line search strategy:
𝑘
1. Define the one-dimensional function along the descent direction 𝒑 :
𝜙 𝛼 = 𝑓(𝒙 𝑘 𝑘
+ 𝛼𝒑 )
2. Solve the one-dimensional minimization problem

min 𝜙(𝛼)
𝛼>0
Remarks
1. Naively speaking it would be ideal to globally minimize 𝜙(𝛼). Generally, it is very expensive to find this
solution. It is not necessarily a good idea since the search is one-dimensional
2. One could also search for some local solution. But this is often also too expensive (need function and/or
gradient evaluations at a number of points).
𝑘+1
3. Practical strategies (so-called non-exact LS): find 𝛼 such that 𝑓(𝒙 ) becomes as small as possible
with minimal effort.
Practical Line-Search Strategies
6
20
4
4 7 1.51
2.81
10
x2
2 1 1.5 2.81
0.1
0 20
-2
-3 -2 -1 0 x1 1 2 3 4
40 1
𝑘 𝑘
30
𝜙 𝛼 = 𝑓(𝒙 + 𝛼𝒑 )
20
10
-10
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Armijo Condition
Theorem[1]:
𝑘
Let 𝑓 be continuously differentiable, 𝒑 a descent direction, and let 𝑐1 ∈ (0,1) be given. Then there exists an
𝛼 > 0, such that for 𝜙 𝛼 ≔ 𝑓(𝒙 𝑘 𝑘
+ 𝛼𝒑 ), the condition 𝜙 𝛼 ≤ 𝜙 0 + 𝛼𝑐1 𝜙′ 0 holds.
Geometrical interpretation:
𝜙 𝛼
𝜙 0 𝑐1 =0
𝑐1 = 0.1
α
𝑐1 = 0.25
𝑇 𝑐1 = 1 𝑐1 = 0.5
𝜙 ′ 0 = 𝜵𝑓 𝒙 𝑘
𝒑 𝑘
<0
𝜙𝑙 𝛼 = 𝜙 0 + 𝛼𝑐1 𝜙′(0)
feasible domain for 𝑐1 = 0.25
6 of 10 Applied Numerical Optimization [1] Nocedal J., Wright S. J., Numerical Optimization, 2nd Edition, Springer, 2006
Simple Line-Search Algorithm
Remarks:
1. The choice of a step length, which fulfills the Armijo condition guarantees the descent of 𝑓:
′ 𝑘 𝑇 𝑘 𝑘
𝜙 0 = 𝜵𝑓 𝒙 𝒑 < 0 (𝒑 is a descent direction)
𝜙 𝛼 ≤ 𝜙 0 + 𝛼𝑐1 𝜙′(0) ⇒𝜙 𝛼 <𝜙 0

⇒ a descent is guaranteed!
2. The choice of 𝑐1 is crucial:

• Large 𝑐1 leads to small values of 𝛼, such that 𝒙 𝑘+1 ≈ 𝒙 𝑘 .
• Small 𝑐1 potentially results in small reduction of 𝑓 and therefore slower convergence
Simple line-search algorithm:

choose 𝛼1 > 0; 𝜌, 𝑐1 ∈ (0,1)
set 𝛼 = 𝛼1
repeat 𝛼 ← 𝜌𝛼 until 𝜙 𝛼 ≤ 𝜙 0 + 𝛼𝑐1 𝜙′(0)

Improved Line-Search Algorithm
choose 𝛼0 > 0 and 𝑐1 ∈ (0,1)
if 𝜙 𝛼0 ≤ 𝜙 0 + 𝛼0 𝑐1 𝜙′(0) STOP, else
find a better 𝛼 ∈ (0, 𝛼0 ) through quadratic interpolation of available data:
𝜙′(0)𝛼02
𝛼1 = −
2[𝜙 𝛼0 − 𝜙 0 − 𝜙 ′ 0 𝛼0 ]
if 𝜙 𝛼1 ≤ 𝜙 0 + 𝛼1 𝑐1 𝜙′(0) STOP, else
find a better 𝛼 ∈ (0, 𝛼1 ) through cubic interpolation of available data (how ?)
repeat the procedure of cubic interpolation, until the condition is fulfilled

Wolfe Conditions
Theorem[1]:
Let 𝑓 be continuously differentiable, 𝒑 𝑘 a descent direction and 𝑐1 ∈ (0,1), 𝑐2 ∈ (𝑐1 , 1). Then, there exists an
𝛼 > 0, such that 𝜙 𝛼 ≤ 𝜙 0 + 𝛼𝑐1 𝜙′(0)
𝜙′(𝛼) ≥ 𝑐2 𝜙′(0) (slope condition)
Geometric interpretation:  guarantee minimum step length!
𝜙 𝛼
𝜙 0 Relevance:
𝑐2 𝜙′(0) 𝑐1 𝜙′(0) Wolfe Conditions promote

convergence to a stationary point[1]
𝜙′(0) α
feasible values
9 of 10 Applied Numerical Optimization [1] Nocedal J., Wright S. J., Numerical Optimization, 2nd Edition, Springer, 2006
Check Yourself
• Explain the basic ideas of the line-search method.

• What is a descent direction? How it is defined?
• Explain the Armijo-rule and its potential drawbacks?
• Explain the Wolfe conditions and the advantage compared to Armijo’s rule.

Line search: simple directions

Unconstrained

𝜵𝑓 𝒙 = 𝟎
approach approach

Determination of a Descent Direction: A Toolbox
Line-search approaches differ from each other with respect to the determination of descent direction and step
length.
Line-search
approach
Determine the direction 𝒑 𝑘
Steepest-descent …
direction special cases
Conjugate Newton’s-step
directions direction Extra work:
𝑘 prove that it
Many gradient methods use a symmetric positive definite matrix 𝑫 and calculate
guarantees
𝑘+1 𝑘 𝑘 𝑘
𝒙 =𝒙 − 𝛼𝑘 𝑫 𝜵𝑓(𝒙 ) descent!
length.
Line-search
approach
directions direction
𝑘
𝒙 =𝒙 − 𝛼𝑘 𝑫 𝜵𝑓(𝒙 )

Steepest-Descent Direction (1)
𝑘 𝑘 𝑘 𝑘 𝑇 𝑘
Taylor series: 𝑓 𝒙 + 𝛼𝒑 =𝑓 𝒙 + 𝛼𝜵𝑓 𝒙 𝒑 + 𝑂(𝛼 2 )
The rate of change of 𝑓 at 𝒙 𝑘 𝑘

along the direction 𝒑 is the coefficient in the linear term:
𝑘 𝑇 𝑘
𝜵𝑓 𝒙 𝒑
𝑘
The unit direction 𝒑 with the highest rate of change is the solution of the following problem
𝑘 𝑇 𝑘 𝑘
min 𝜵𝑓 𝒙 𝒑 s. t. 𝒑 =1
𝒑 𝑘 ∈𝑅𝑛
𝑘 )𝑇 𝒑 𝑘 𝑘 𝑘
Note that 𝜵𝑓(𝒙 = 𝜵𝑓(𝒙 ) 𝒑 cos(𝜃)
The solution of the problem is achieved for cos(𝜃) = −1 ⇒ 𝜃 = 𝜋
⇒𝒑 𝑘 𝑘 𝑘
= −𝜵𝑓(𝒙 )/ 𝜵𝑓(𝒙 )
𝑘
The choice of 𝑫 is the identity matrix 𝑰.

Steepest-Descent Direction (2)
𝑘 𝑇 𝑘
descent direction: 𝜵𝑓 𝒙 𝒑 <0 𝑘 𝑘
𝑓 𝒙 >𝐶 𝑓 𝒙 =𝐶
𝑓 𝒙 𝑘
<𝐶
𝑘 𝑇 1
𝜵𝑓 𝒙 𝒑 >0
𝑘
𝜵𝑓 𝒙 𝒑 1
𝑘
𝜵𝑓(𝒙 )𝑇 𝒑 𝑘
= 𝜵𝑓(𝒙 𝑘
) 𝒑 𝑘
cos(𝜃)
𝑘 𝑘 𝑇 2
𝒙 𝜵𝑓 𝒙 𝒑 <0
2
𝒑
−𝜵𝑓 𝒙 𝑘

Method of Steepest-Descent
Algorithm:
choose 𝒙 0
for k=0,1,…
𝜵𝑓(𝒙 𝑘
if ) ≤ 𝜀 stop, else
𝒙∗
𝑘 𝑘
set 𝒑 = −𝜵𝑓(𝒙 )
determine the step length 𝛼𝑘 (e.g. 𝒙 𝑘+1
using the Armijo rule)
set 𝒙 𝑘+1 =𝒙 𝑘 + 𝛼𝑘 𝒑 𝑘
𝑘
𝒙
end for “well scaled” “poorly scaled”
Directions become perpendicular
Edgar T. F., Himmelblau D. M., Optimization of Chemical Processes,

Prof. Alexander Mitsos, Ph.D. 2nd Edition, McGraw-Hill, 2001
length.
Line-search
approach
𝑘
𝒙 =𝒙 − 𝛼𝑘 𝑫 𝜵𝑓(𝒙 )

Newton’s Descent Direction
Quadratic approximation of 𝑓 at 𝒙 𝑘+1
𝑇 1 𝑇
𝑘+1 𝑘 𝑘 𝑘+1 𝑘 𝑘+1 𝑘
𝑚 𝒙 =𝑓 𝒙 + 𝛁𝑓 𝒙 𝒙 −𝒙 + 𝒙 −𝒙 𝜵2 𝑓(𝒙 𝑘
)(𝒙 𝑘+1
−𝒙 𝑘
)
2
linear approximation of
𝑓 at 𝑥𝑘
(1st nec. opt. cond. for 𝑚) 𝑓 = const
𝑘+1 𝑘
0 = 𝜵𝑚 𝒙 = 𝜵𝑓 𝒙 + 𝜵2 𝑓 𝒙 𝑘 𝒙 𝑘+1
−𝒙 𝑘
𝑥* 𝒙 𝑘
𝑝𝑘𝑁𝑒𝑤𝑡𝑜𝑛
𝑘+1 𝑘 −1
⇒𝒙 =𝒙 − 𝜵2 𝑓 𝒙 𝑘
𝜵𝑓(𝒙 𝑘
) 𝒙 𝑘+1
𝑆𝑡𝑒𝑒𝑝 𝐷𝑒𝑐.
𝑝𝑘
𝑘 −1 minimum of the quadratic

⇒𝒑 = − 𝜵2 𝑓 𝒙 𝑘
𝜵𝑓 𝒙 𝑘
approximation
quadratic approximation of
⇒ 𝛼𝑘 = 1 𝑓 at 𝒙 𝑘
Fig.: Comparison of steepest-descent with Newton‘s
𝑘
The choice of 𝑫 is the inverse of the Hessian method from viewpoint of objective function approximation

Newton’s Method
Algorithm: Remarks:
choose 𝒙 0 1. line-search?
𝑘+1 𝑘
for k=0,1,… 𝒙 =𝒙 + 𝛼𝑘 𝒑 𝑘
𝑘 −1
if 𝜵𝑓(𝒙 𝑘
) ≤ 𝜀 stop, else 𝒑 = − 𝜵2 𝑓 𝒙 𝑘
𝜵𝑓 𝒙 𝑘
𝛼𝑘 = 1
𝑘 −1
set 𝒑 = − 𝜵2 𝑓 𝒙 𝑘
𝜵𝑓 𝒙 𝑘
2. (+) locally quadratic convergence, if 𝒙 𝑘 close to 𝒙∗

set 𝒙 𝑘+1
=𝒙 𝑘
+𝒑 𝑘 (−) 2nd derivatives & inversion (expensive for large
system of equations)
end for
3. If 𝑓 is quadratic, the algorithm converges in one iteration.
4. Convergence to a minimum is not guaranteed! Why?

Check Yourself
• Explain the basic ideas of the line-search method.

• Explain the steepest descent method.
• What additional requirements puts Newton‘s method on the objective function?
• Explain the Newton direction. Is it better than other descent directions? Why is the Newton step-length equal to
one?

Line search: complexity and examples

Unconstrained

𝜵𝑓 𝒙 = 𝟎
approach approach

Complexity Analysis
Nesterov (2004) proves: “In general, optimization problems are unsolvable” *
Let 𝐹 denote a class of problems, e.g., Lipschitz-continuous functions with Lipschitz-constant 𝐿, i.e.,
|𝑓 𝒙 − 𝑓 𝒚 | < 𝐿 𝒙 − 𝒚 , 𝐿 is assumed to be fixed for all 𝑃 ∈ 𝐹.
“Performance of a method 𝑀 on a problem 𝑃 ∈ 𝐹 is the total amount of computational effort that is required by 𝑀
to solve 𝑃.” *
“To solve the problem means to find an approximate solution to 𝑃 with an accuracy ε > 0.” *
For unconstrained problems, the accuracy ε > 0 can be defined as the norm of the objective’s gradient.
* Yurii Nesterov, Introductory Lectures on Convex Optimization – A Basic Course, Kluwer Academic Publishers, (2004)

Complexity Analysis – Measuring Computational Effort
Unit of measurement: Query to an oracle

It is assumed that the objective function is unknown and that the algorithm solves the optimization problem
by querying an oracle for local information about the unknown objective function. An oracle is simply a “black
box” capable of answering any query of the form:
𝒙1
• Given 𝒙 return the value 𝑓 𝒙 (Zeroth-order oracle) Algorithm 𝑓 𝒙1
𝒙2
f
• Given 𝒙 return 𝑓 𝒙 and gradient 𝜵𝑓 𝒙 (First-order oracle)
𝑓 𝒙2
Black-box =
• Given 𝒙 return 𝑓 𝒙 , 𝜵𝑓 𝒙 and Hessian 𝜵2 𝑓 𝒙 (Second-order oracle) “Oracle”
Analytical Complexity: The smallest number of queries to an oracle to solve Problem 𝑃 to accuracy ε. [1]
Arithmetical Complexity: The smallest number of arithmetic operations (including work of the oracle and
work of method), required to solve problem 𝑃 up to accuracy ε. [1]
4 of 11 Applied Numerical Optimization [1] Yurii Nesterov, Introductory Lectures on Convex Optimization – A Basic Course,
Prof. Alexander Mitsos, Ph.D. Kluwer Academic Publishers, 2004.
Analytical Complexity of Steepest Descent Method
Algorithm:
0
choose 𝒙
for k=0,1,…
𝑘
if 𝜵𝑓(𝒙 ) ≤ 𝜀 stop, else
set 𝒑 𝑘 = −𝜵𝑓 𝒙 𝑘
determine the step length 𝛼𝑘 (e.g. using the Armijo rule)
𝑘+1 𝑘 𝑘
set 𝒙 =𝒙 + 𝛼𝑘 𝒑
end for
• Problem class: 𝑓 is continuously differentiable and 𝜵𝑓 𝒙 is Lipschitz-continuous with fixed Lipschitz

constant 𝐿, i.e., 𝜵𝑓 𝒙 − 𝜵𝑓 𝒚 < 𝐿 𝒙 − 𝒚
• First-order oracle: returns 𝑓 𝒙 and gradient 𝜵𝑓 𝒙
1
• Worst-case analytical complexity (queries to oracle): 𝑂
𝜀2

Rosenbrock Function
Contour lines of f(x)

min2 𝑓 𝒙 = 100(𝑥2 − 𝑥12 )2 +(1 − 𝑥1 )2
𝒙∈𝑅
• Solution point is 𝒙 *=(1,1)T - why?
𝑥2
𝑓 𝒙
[1]
𝑥2
𝑥1
𝑥1
[1] https://commons.wikimedia.org/w/index.php?curid=9941741
Illustration of Convergence (1)
min2 𝑓 𝒙 = 100(𝑥2 − 𝑥12 )2 +(1 − 𝑥1 )2

𝒙∈𝑅
• Steepest descent with Armijo line-search
𝑓 𝒙
[1]
𝑥2
𝑥1
min2 𝑓 𝒙 = 100(𝑥2 − 𝑥12 )2 +(1 − 𝑥1 )2

𝒙∈𝑅
• Steepest descent with Wolfe line-search
𝑓 𝒙
[1]
𝑥2
𝑥1
Analytical Complexity of Newton’s Method
• Problem class: 𝑓 is twice continuously differentiable and 𝜵2 𝑓 𝒙 is Lipschitz-continuous with fixed Lipschitz
constant 𝐿, i.e., 𝜵2 𝑓 𝒙 − 𝜵2 𝑓 𝒚 < 𝐿 𝒙 − 𝒚
• Second-order oracle: returns 𝑓 𝒙 , 𝜵𝑓 𝒙 and Hessian 𝜵2 𝑓 𝒙
• Quadratic approximation of 𝑓 around 𝒙 𝑘

, line search
𝑇
𝑘+1 𝑘 𝑘 𝑇 𝑘+1 𝑘
1 𝑘+1 𝑘
𝑚 𝒙 =𝑓 𝒙 + 𝛁𝑓 𝒙 𝒙 −𝒙 + 𝒙 −𝒙 𝜵2 𝑓( 𝒙 𝑘
)( 𝒙 𝑘+1
− 𝒙(𝑘) )
2
1
• Worst-case analytical complexity: 𝑂 , 1 > 𝜏 > 0, arbitrary but fixed for a given problem
𝜀2−𝜏
• Quadratic approximation of 𝑓 around 𝒙 𝑘 with cubic regularization, line search
𝑘+1 𝑘+1
1 𝑘+1 𝑘 3
𝑚𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑒𝑑 𝒙 =𝑚 𝒙 + 𝜎𝑘 𝒙 −𝒙
3
1
• Worst-case analytical complexity: 𝑂
𝜀3/2

min2 𝑓 𝒙 = 100(𝑥2 − 𝑥12 )2 +(1 − 𝑥1 )2

𝒙∈𝑅
• Modified Newton (Armijo line-search;

if Hessian is < 0 switch to steepest descent)
𝑓 𝒙
[1]
𝑥2
𝑥1
Check Yourself
• What does the term complexity analysis refer to?

• What is the difference of analytical and arithmetic complexity
• Which method has better analytical complexity: Newton vs. steepest descent?

Line search: advanced directions

Unconstrained

𝜵𝑓 𝒙 = 𝟎
approach approach

Line-search approaches differ from each other with respect to the determination of descent direction and
step length.
Line-search
approach
variants

Inexact Newton Method (1)
Define: 𝑓 (𝑘) = 𝑓 𝒙 𝑘 and 𝒈 𝑘 ≔ 𝜵𝑓(𝒙 𝑘 ) and 𝑯 𝑘 ≔ 𝜵2 𝑓 𝒙 𝑘
From Newton’s method:

𝑘+1 𝑘 𝑘
𝒙 =𝒙 +𝒑
𝜵2 𝑓 𝒙 𝑘
𝒑 𝑘
= −𝜵𝑓(𝒙 𝑘
)⇒𝑯 𝑘
𝒑 𝑘 = −𝒈 𝑘
Idea:
𝑘
• The linear equation system, 𝑯 𝒑 𝑘 = −𝒈 𝑘
, is solved approximately by an iterative method, e.g., by CG
𝑘
(conjugate gradients) if 𝑯 is positive definite.
Comments:
• LU- or Cholesky-decomposition – very high computational effort!
• Large errors occur for ill-conditioned problems.
• The exact solution is not needed.

Inexact Newton Method (2)
Newton-CG method: Algorithm:

choose 𝒙 0
for k=0,1,…
Newton’s method if 𝜵𝑓(𝒙 𝑘
) ≤ 𝜀 stop, else
CG method calculate 𝒈 𝑘
≔ 𝜵𝑓(𝒙 𝑘
) and 𝑯 𝑘
≔ 𝜵2 𝑓 𝒙 𝑘
to determine 𝒑 𝑘
approximately solve 𝑯 𝑘
𝒑 𝑘
= −𝒈 𝑘
for 𝒑 𝑘
with CG method
𝑘+1 𝑘 𝑘
set 𝒙 =𝒙 + 𝛼𝑘 𝒑
end for
Some line search strategy is needed. (Why?)

Modified Newton Method
Motivation: What if
𝑘
•𝑯 𝑘
is singular or almost singular (poorly conditioned)? 𝑯 ≔ 𝜵2 𝑓 𝒙 𝑘
𝑘
•𝑯 is not positive definite?
Idea Approximations
𝑩 𝑘 =𝑯 𝑘 +𝑬 𝑘
𝑘 𝑘 𝑘
replace 𝑯 by the approximation 𝑩 ≈𝑯 with 𝑬 𝑘 = 𝜏𝑘 𝑰, 𝜏𝑘 ≥ 0 smartly chosen
converges to steepest descent for 𝜏𝑘 → ∞
𝑩 𝑘 𝒑 𝑘 = −𝒈 𝑘
𝑘+1 𝑘
𝒙 =𝒙 + 𝛼𝑘 𝒑 𝑘 , (k from the line-search)
Alternatives exist, e.g., see [1]
nd
6 of 10 Applied Numerical Optimization [1] Nocedal J., Wright S. J., Numerical Optimization, 2 Edition, Springer, 2006, Chapter 3
Quasi-Newton Methods (1)
𝑘
Idea: Reduce complexity by simplified calculation of 𝑯 (Davidon):
𝑘
𝑘 𝑘 𝑯 ≔ 𝜵2 𝑓 𝒙 𝑘
• replace 𝑯 by an approximation 𝑩 .
𝑘
• instead of calculating 𝑩 , we look for a simple update using information 𝑘 𝑘
𝒈 ≔ 𝜵𝑓(𝒙 )
from the last iterations.
𝑓 (𝑘) ≔ 𝑓(𝒙 𝑘
)
Approach:
𝑘 𝑘 𝑇 1
• Consider quadratic approximation of f at 𝒙 , 𝑚(𝑘) 𝒑 = 𝑓 (𝑘) + 𝒈 𝒑 + 2 𝒑𝑇 𝑩 𝑘
𝒑.
−1
• First order optimality condition: 𝒑 𝑘 = −𝑩 𝑘 𝒈 𝑘
symmetric positive definite
• By convexity necessary and sufficient for minimization of 𝑚(𝑘) 𝒑 .
𝑘+1 𝑘
• Construct the quadratic approximation at 𝒙 =𝒙 + 𝛼𝑘 𝒑 𝑘 ,
𝑇 1 𝑇 𝑘+1
𝑚(𝑘+1) 𝒑 = 𝑓 (𝑘+1) + 𝒈 𝑘+1
𝒑+ 𝒑 𝑩 𝒑
2
• What conditions must 𝑩 𝑘+1 satisfy?
Conditions on 𝑩 𝑘+1
:
1. Gradient of 𝑚(𝑘+1) at 𝒙 𝑘
and 𝒙 𝑘+1
must be equal to gradient of 𝑓.
𝜵𝑚(𝑘+1) 𝒑 = 𝒈 𝑘+1
+𝑩 𝑘+1
𝒑
𝑘+1 At 𝒙 = 𝒙 𝑘 𝑘
At 𝒙 = 𝒙 , 𝒑=0 , 𝒑 = −𝛼𝑘 𝒑
We want 𝜵𝑚(𝑘+1) 0 = 𝒈 𝑘+1
We want 𝜵𝑚(𝑘+1) −𝛼𝑘 𝒑 𝑘
=𝒈 𝑘
𝑘+1 𝑘+1
⇒𝒈 − 𝛼𝑘 𝑩 𝒑𝑘 =𝒈 𝑘
𝑘+1 𝑘 𝑘+1 𝑘
Automatically satisfied ⇒𝑩 𝛼𝑘 𝒑 =𝒈 −𝒈
=𝒙 𝑘+1 −𝒙 𝑘
𝑘+1 𝑘 𝑘 𝑘 𝑘+1 𝑘 𝑘 𝑘+1 𝑘

⇒𝑩 𝒔 =𝒚 , where 𝒔 =𝒙 −𝒙 and 𝒚 =𝒈 −𝒈
𝑘+1 𝑘 𝑇 𝑘+1 𝑘 𝑘 𝑘 𝑇 𝑘
2. Since, 𝑩 is symmetric positive definite: 𝒔 𝑩 𝒔 > 0, ∀𝒔 ≠0⇒𝒔 𝒚 >0
Wolfe conditions (line-search) guarantee these constraints for all 𝑓, even when 𝑓 is non-convex.
𝑘+1
Conditions on 𝑩 :
𝑘+1
𝑩 𝒔𝑘 =𝒚 𝑘
gives many solutions for 𝑩 𝑘+1
• Unique solution: 𝑩 𝑘+1 should be close to 𝑩 𝑘
min 𝑩 − 𝑩 𝑘 ← weighted Frobenius-Norm

𝑩 𝑊
𝐴 = 𝑊 1/2 𝐴𝑊 1/2 , for any 𝑊 s.t. 𝑊𝑦𝑘 = 𝑠𝑘
s. t. 𝑩𝑇 = 𝑩 𝑊 𝐹
𝑛 𝑛
2 𝑛×𝑛 2 2
𝑩𝒔 𝑘 =𝒚 𝑘 ∙ 𝐹: 𝑅 → 𝑅≥0 , 𝐶 𝐹 = ෍ ෍ 𝑐𝑖𝑗
𝑖=1 𝑗=1
𝑘+1
1 𝑘 𝑘 𝑇 𝑘
1 𝑘 𝑘 𝑇
1 𝑘 𝑘 𝑇
⇒𝑩 = 𝑰− 𝒚 𝒔 𝑩 𝑰− 𝒔 𝒚 + 𝒚 𝒚 → DFP formula
𝒚 𝑘 𝑇𝒔 𝑘 𝒚 𝑘 𝑇𝒔 𝑘 𝒚 𝑘 𝑇𝒔 𝑘
𝑘+1 −1
1 𝑘 𝑘 𝑇 𝑘 −1
1 𝑘 𝑘 𝑇
1 𝑘 𝑇
⇒𝑩 = 𝑰− 𝒔 𝒚 𝑩 𝑰− 𝒚 𝒔 + 𝒔 𝒔𝑘 → BFGS formula
𝒚 𝑘 𝑇𝒔 𝑘 𝒚 𝑘 𝑇𝒔 𝑘 𝒚 𝑘 𝑇𝒔 𝑘

Check Yourself
• Explain the inexact Newton method.

• What is the main idea of the modified and quasi-Newton methods? Why are these methods advantageous?
• Why is it necessary to introduce a step-length control mechanism (line-search) into modified and quasi-Newton
methods?

Parameter estimation
length.
Line-search
approach
variants

Regression Problems: Least-Squares Formulation
Example:
Consider a batch reactor with the reaction A → B at constant temperature
TR . The reagent concentration 𝑐𝐴 is measured at time instants 𝑡𝑗 .
The reaction is of first order, therefore we can write the analytic solution:
𝑐𝐴 TR
𝑑𝑐𝐴
ቤ = −𝑘 ∙ 𝑐𝐴 (𝑡) → 𝑐𝐴 (𝑡) = 𝑐𝐴 |𝑡=0 ∙ 𝑒 −𝑘𝑡 AB
𝑑𝑡 𝑡
The reaction constant 𝑘 and the reagent concentration at initial time 𝑐𝐴 (𝑡 = 0)
are unknown and should be determined from the measurements.
Optimization formulation uses 𝑥1 = 𝑐𝐴 𝑡 = 0 , 𝑥2 = 𝑘 : 2
measured values
𝑥2 ∙𝑡𝑗
𝑐𝐴, theoretical 𝑡𝑗 = 𝜑 𝒙, 𝑡𝑗 = 𝑥1 ∙ 𝑒 model 1.5 model
error
𝑐𝐴, measured 𝑡𝑗 = 𝑦𝑗 measurement cA 1
𝜀𝑗 = 𝑦𝑗 − 𝜑 𝒙, 𝑡𝑗 , 𝑗 = 1, … , 𝑚 residual (error) 0.5
1 2
1 2
min 𝜺(𝒙) 2 = Σ𝑗 𝑦𝑗 − 𝜙 𝒙, 𝑡𝑗 00 5 10 15 20
𝒙∈𝑅 2 2 2 𝑡
Gauss-Newton Method
1 2 1 𝑇
• min2 𝑓 𝒙 = min2 2
𝜺(𝒙) 2 = min2 2
𝜺(𝒙) 𝜺(𝒙)
𝒙∈𝑅 𝒙∈𝑅 𝒙∈𝑅
• Define: 𝑱 𝒙 ≔ 𝜵𝜺(𝒙) ∈ 𝑅𝑚×2
⇒ 𝜵𝑓 𝒙 = 𝑱(𝒙)𝑇 𝜺(𝒙)
𝑚
⇒ 𝜵2 𝑓 𝒙 = 𝑱(𝒙)𝑇 𝑱 𝒙 + ෍ 𝜀𝑗 (𝒙)𝜵2 𝜀𝑗 (𝒙)

𝑗=1
• The Hessian can be approximated by the first term in case of almost linear problems (i.e., 𝜵2 𝜀𝑗 𝒙 = 0) or
good starting values (i.e., small 𝜀𝑗 (𝒙))
• Newton’s direction: 𝜵2 𝑓 𝒙 𝑘 𝒑𝑘 = −𝜵𝑓(𝒙 𝑘 )
𝑘 𝑇 𝑘 𝑘 𝑇
• With Hessian approximation: 𝑱 𝑱 𝒑 𝑘 = −𝑱 𝜺 𝑘

Remarks on Gauss-Newton Method
𝑘 𝑘
• If 𝑱 has full-rank, 𝒑 is always a descent direction
2
𝒑 𝑘 𝑇 𝑘 𝑘 𝑇 𝑘 𝑇 𝑘 𝑘 𝑇 𝑘 𝑇 𝑘 𝑘 𝑘 𝑘
∙ 𝜵𝑓 𝒙 =𝒑 ∙𝑱 𝜺 = −𝒑 ∙𝑱 ∙𝑱 ∙𝒑 =− 𝑱 ∙𝒑 2
< 0,
𝑘 𝑘 𝑘 𝑇 𝑘
The inequality is strict unless 𝑱 ∙𝒑 =0⇔𝑱 𝜺 = 𝜵𝑓𝑘 = 0.  Optimum
• In descent-direction 𝒑 𝑘
, the step-length is determined as per the Wolfe-conditions
• For linear models the Jacobian 𝑱 matrix is constant.
• The condition of minimization corresponds to the normal equations

𝑘 𝑘
• If 𝑱 is singular or almost singular, the descent direction 𝒑 is, usually, not reliable. The method converges
very poorly. Quasi-Newton methods are therefore more efficient.
• It is a local method

Check Yourself
• Where can least-squares problems be applied?

• When is a least-squares problem linear or nonlinear?
• What is the key idea of the Gauss-Newton method?

Trust region method

Unconstrained

𝜵𝑓 𝒙 = 𝟎
approach approach

Trust Region Method (1)
Idea:
𝑘
• Approximate 𝑓 at 𝒙 by the quadratic model function 𝑚(𝑘) :
𝑘 𝑇 1
𝑚(𝑘) 𝒑 = 𝑓 (𝑘) + 𝒈 𝒑 + 2 𝒑𝑇 𝑩 𝑘
𝒑
where 𝑓 (𝑘) = 𝑓(𝒙 𝑘

), 𝒈 𝑘
= 𝜵𝑓 𝒙 𝑘
and 𝑩 𝑘
is symmetric
• Taylor-series: the approximation error is small for small 𝒑
• For each iteration 𝑘 = 0,1, … choose a trust region radius ∆(𝑘)
• Solve the minimization problem: min 𝑚(𝑘) (𝒑) s. t. 𝒑 ≤ ∆(𝑘) and set 𝒑 𝑘
to the solution found
𝒑
• Set 𝒙 𝑘+1 =𝒙 𝑘 +𝒑 𝑘

How to update the radius Δ(𝑘) ?
• Compare the agreement between the model function 𝑚𝑘 and the objective function 𝑓 at the previous
iterations. Define contraction rate 𝜌𝑘 as:
𝑓 𝒙 𝑘 − 𝑓(𝒙 𝑘 + 𝒑 𝑘 ) actual reduction

𝜌𝑘 = =
𝑚(𝑘) 𝟎 − 𝑚(𝑘) (𝒑 𝑘 ) predicted reduction
As 𝑚𝑘 is minimized over a domain containing 0:

𝑚(𝑘) 𝟎 − 𝑚(𝑘) 𝒑 𝑘 >0
If 𝜌𝑘 < 0 ! reject this step (ascent)
If 𝜌𝑘 ≈ 1 ! increase the radius: good agreement
If 𝜌𝑘 ≈ 0 ! decrease the radius: poor agreement

Basic Algorithm:
1
choose ∆(𝑚𝑎𝑥) > 0,∆(0)  (0,∆(𝑚𝑎𝑥) ) and   [0, 4 )
for 𝑘 = 0,1, …
calculate direction 𝒑 𝑘
, contraction rate 𝜌𝑘
1 ||𝒑 𝑘 ||
if 𝜌𝑘 < , ∆(𝑘+1) =
4 4
3 𝑘
else if 𝜌𝑘 > and ||𝒑 || = ∆(𝑘) , ∆(𝑘+1) =min(2∆(𝑘) , ∆(𝑚𝑎𝑥) )
4
else ∆(𝑘+1) = ∆(𝑘)

𝑘+1 𝑘 𝑘
if 𝜌𝑘 > , 𝒙 =𝒙 +𝒑
else 𝒙 𝑘+1 =𝒙 𝑘

Remarks:
• ∆(𝑘) is increased only if ||𝒑 𝑘

|| reaches the boundary of the domain.
• Strategies for the efficient solution of the minimization problem for 𝒑(𝑘) :
𝑘
- The Cauchy point: minimum along the steepest descent direction (−𝒈 ), slow
𝑘
- The Dogleg method: applicable when 𝑩 is positive definite, fast (superlinear)
- Steihaug’s approach for large sparse matrices

The Dogleg Method
Idea:
• For a large ∆(𝑘) : Newton step, 𝒑 𝑘 𝑘 −1 Dogleg-path
= 𝒑𝐵 = −𝑩 𝒈 𝑘 . Where 𝒑𝐵 is the ∆ (𝑘)
𝑘
unconstrained minimum of 𝑚𝑘 , 𝒑 𝐵
≤∆ (𝑘)
. 𝒙
𝒑𝐵
(𝑘) 𝑘 𝒑𝑈 𝒑 𝑘
(𝜏 ∗ )
• For a small ∆ : search the solution along the direction −𝒈
𝒈 𝑘 𝑇𝒈 𝑘
• For an intermediate ∆(𝑘) : additionally calculate 𝒑𝑈 = − 𝑘 𝑇 𝑘 𝑘
𝒈 𝑘
𝑘
𝒈 𝑩 𝒈 −𝒈
Where 𝒑𝑈 is the unconstrained minimum of 𝑚(𝑘) in the steepest descent direction.
• Then calculate the search-direction using a linear combination of
𝜏𝒑𝑈 0≤𝜏≤1
𝒑𝑈 and 𝒑𝐵 : 𝒑 𝑘 𝜏 =൝ 𝑈
𝒑 + (𝜏 − 1)(𝒑𝐵 − 𝒑𝑈 ) 1 ≤ 𝜏 ≤ 2
𝑘
with 𝒑 (𝜏 ∗ ) = ∆(𝑘) .
Illustration of Convergence – Rosenbrock Function
Contour lines of f(x)

min 𝑓 𝒙 = 100(𝑥2 − 𝑥12 )2 +(1 − 𝑥1 )2
𝒙∈𝑅 2
• Solution point is 𝒙 *=(1,1)T - why?
𝑓 𝒙
x2
[1]
𝑥2
𝑥1
x1
Illustration of Convergence – BFGS (Quasi-Newton Method)
min2 𝑓 𝒙 = 100(𝑥2 − 𝑥12 )2 +(1 − 𝑥1 )2

𝒙∈𝑅
• Matlab trust-region (fminunc)
𝑓 𝒙
[1]
𝑥2
𝑥1
Check Yourself
• Explain trust region method.

• Which model problem is solved in trust-region methods?
• How is trust-region radius updated?


Applied Numerical Optimization: Prof. Alexander Mitsos, Ph.D. Basic Solution Methods For Unconstrained Problems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Numerical Optimization: Prof. Alexander Mitsos, Ph.D. Basic Solution Methods For Unconstrained Problems

Uploaded by

Copyright:

Available Formats

Applied Numerical Optimization

Prof. Alexander Mitsos, Ph.D.

Basic solution methods for unconstrained problems

Direct methods Indirect methods

2 of 10 Applied Numerical Optimization

• First-order necessary conditions

3 of 10 Applied Numerical Optimization

Direct methods Indirect methods

4 of 10 Applied Numerical Optimization

Idea: Construct a convergent sequence of {𝒙(𝑘) }∞

5 of 10 Applied Numerical Optimization

Idea: Construct a convergent sequence of {𝒙(𝑘) }∞

• Order 𝑝 (often 𝑝 = 2): if there exists a constant 𝑀 > 0, such that

• Superlinear: if there exists a sequence 𝑐𝑘 converging to zero, i.e., lim 𝑐𝑘 = 0,

6 of 10 Applied Numerical Optimization

Direct methods Indirect methods

7 of 10 Applied Numerical Optimization

• Throughout class we use "direct" and "indirect":

8 of 10 Applied Numerical Optimization

Direct methods Indirect methods

9 of 10 Applied Numerical Optimization

• What are direct vs indirect methods?

10 of 10 Applied Numerical Optimization

Line search: basic idea and step length

Direct methods Indirect methods

2 of 10 Applied Numerical Optimization

Basic algorithm (line-search):

3 of 10 Applied Numerical Optimization

2. Solve the one-dimensional minimization problem

5 of 10 Applied Numerical Optimization

𝜙 𝛼 ≤ 𝜙 0 + 𝛼𝑐1 𝜙′(0) ⇒𝜙 𝛼 <𝜙 0

2. The choice of 𝑐1 is crucial:

Simple line-search algorithm:

7 of 10 Applied Numerical Optimization

choose 𝛼0 > 0 and 𝑐1 ∈ (0,1)

if 𝜙 𝛼0 ≤ 𝜙 0 + 𝛼0 𝑐1 𝜙′(0) STOP, else

find a better 𝛼 ∈ (0, 𝛼0 ) through quadratic interpolation of available data:

if 𝜙 𝛼1 ≤ 𝜙 0 + 𝛼1 𝑐1 𝜙′(0) STOP, else

find a better 𝛼 ∈ (0, 𝛼1 ) through cubic interpolation of available data (how ?)

repeat the procedure of cubic interpolation, until the condition is fulfilled

8 of 10 Applied Numerical Optimization

Geometric interpretation:  guarantee minimum step length!

𝑐2 𝜙′(0) 𝑐1 𝜙′(0) Wolfe Conditions promote

• Explain the basic ideas of the line-search method.

10 of 10 Applied Numerical Optimization

Line search: simple directions

Direct methods Indirect methods

2 of 11 Applied Numerical Optimization

Determine the direction 𝒑 𝑘

Determine the direction 𝒑 𝑘

4 of 11 Applied Numerical Optimization

The rate of change of 𝑓 at 𝒙 𝑘 𝑘

The solution of the problem is achieved for cos(𝜃) = −1 ⇒ 𝜃 = 𝜋

5 of 11 Applied Numerical Optimization

6 of 11 Applied Numerical Optimization

Edgar T. F., Himmelblau D. M., Optimization of Chemical Processes,

Determine the direction 𝒑 𝑘

8 of 11 Applied Numerical Optimization

𝑘 −1 minimum of the quadratic

9 of 11 Applied Numerical Optimization

2. (+) locally quadratic convergence, if 𝒙 𝑘 close to 𝒙∗

4. Convergence to a minimum is not guaranteed! Why?