Chapter 2 Optimization

Chapter 2-OPTIMIZATION
G.Anuradha
Contents
• Derivative-based Optimization
– Descent Methods
– The Method of Steepest Descent
– Classical Newton’s Method
– Step Size Determination
• Derivative-free Optimization
– Genetic Algorithms
– Simulated Annealing
– Random Search
– Downhill Simplex Search
What is Optimization?
• Choosing the best element from some set
of available alternatives
• Solving problems in which one seeks to
minimize or maximize a real function
Notation of Optimization
Optimize
y=f(x1,x2….xn) --------------------------------1
subject to
gj(x1,x2…xn) ≤ / ≥ /= bj ----------------------2
where j=1,2,….n
Eqn:1 is objective function
Eqn:2 a set of constraints imposed on the solution.
x1,x2…xn are the set of decision variables
Note:- The problem is either to maximize or minimize the
value of objective function.
Complicating factors in optimization
1. Existence of multiple decision variables
2. Complex nature of the relationships
between the decision variables and the
associated income
3. Existence of one or more complex
constraints on the decision variables
Types of optimization
• Constraint:- Solution is arrived at by
maximizing or minimizing the objective
function
• Unconstraint:- No constraints are imposed
on the decision variables and differential
calculus can be used to analyze them
Examples
Least Square Methods for System
Identification
• System Identification:- Determining a
mathematical model for an unknown system by
observing the input-output data pairs
• System identification is required
– To predict a system behavior
– To explain the interactions and relationship between
inputs and outputs
– To design a controller
• System identification
– Structure identification
– Parameter identification
Structure identification
• Apply a priori knowledge about the target

system to determine a class of models
within which the search for the most
suitable model is conducted
• y=f(u;θ)
y – model’s output
u – Input Vector
θ – parameter vector
Parameter Identification
• Structure of the model is known and
optimization techniques are applied to
determine the parameter vector θ= θ
Block diagram of parameter
identification
Parameter identification
• An input ui is applied to both the system and
the model
• Difference between the target system’s
output yi and model’s output yi is used to
update a parameter vector θ to minimize the
difference
• System identification is not a one-pass
process; it needs to do both structure and
parameter identification repeatedly
Classification of Optimization
algorithms
• Derivative-based algorithms:-
• Derivative-free algorithms
Characteristics of derivative free
algorithm
1. Derivative freeness:- repeated evaluation of objective
function
2. Intuitive guidelines:- concepts are based on nature’s
wisdom, such as evolution and thermodynamics
3. Slower
4. Flexibility
5. Randomness:- global optimizers
6. Analytic Opacity:-knowledge about them are based on
empirical studies
7. Iterative nature:-
Characteristics of derivative free
algorithm
• Stopping condition of iteration:- let k
denote an iteration count and fk denote
the best objective function obtained at
count k. stopping condition depends on
– Computation time
– Optimization goal;
– Minimal Improvement
– Minimal relative improvement
Basics of Matrix Manipulation
and Calculus
Basics of Matrix Manipulation
and Calculus
Gradient of a Scalar Function
Jacobian of a Vector Function
Least Square Estimator
• Method of least squares is a standard
approach to approximate solution of
overdetermined systems.
• Least Squares- Overall solution minimizes
the sum of the squares of the errors made
in solving every single equation
• Application—Data Fitting
Types of Least Squares
• Least Squares
– Linear:- It is a linear combination of
parameters.
– The model may represent a straight line, a
parabola or any other linear combination of
functions
– Non-Linear:- the parameters appear as
functions, such as β2,eβx.
If the derivatives are either constant or
depend only on the values of the independent
variable, the model is linear else non-linear.
Differences between Linear and
Non-Linear Least Squares
Linear Non-Linear
Algorithms Does not require initial Algorithms Require Initial values
values
Globally concave; Non convergence is Non convergence is a common issue
not an issue
Normally solved using direct methods Usually an iterative process
Solution is unique Multiple minima in the sum of squares
Yields unbiased estimates even when Yields biased estimates
errors are uncorrelated with predictor
values
Linear model
Regression Function
Linear model contd…
Using matrix notation
Where A is a m*n matrix
Due to noise a small amount of error is added
Least Square Estimator
Problem on Least Square
Estimator
Derivative Based Optimization
• Deals with gradient-based optimization
techniques, capable of determining search
directions according to an objective
function’s derivative information
• Used in optimizing non-linear neuro-fuzzy
models,
– Steepest descent
– Conjugate gradient
First-Order Optimality Condition
T 1 T 2
F  x  = F  x + x  = F  x  +  F  x  
 x + --- x  F  x  
x + 
x=x 2 x=x
 x = x – x
For small x: If x* is a minimum, this implies:

T T
F  x +  x   F  x  + F  x  x  F x  x  0
x = x x = x
T T
If  F x   x  0 then F  x –  x   F  x   – F  x   x  F  x 
 x = x
x=x
T
F  x  x = 0
But this would imply that x* is not a minimum. Therefore x = x
Since this must be true for every x,  F  x 

= 0
x =x
Second-Order Condition
If the first-order condition is satisfied (zero gradient), then
1 T
F x +  x  = F  x  + --  x 2 F  x x + 
2 x = x
T
A strong minimum will exist at x* if  x  2F  x  
 x  0 for any x ° 0.
x=x
Therefore the Hessian matrix must be positive definite. A matrix A is positive definite if:
T
z Az  0 for any z ° 0.
This is a sufficient condition for optimality.
A necessary condition is that the Hessian matrix be positive semidefinite. A

matrix A is positive semidefinite if:
T
z Az  0 for any z.
Basic Optimization Algorithm
xk + 1 = xk +  k p k
or
 xk =  xk + 1 – x k  =  kp k
xk +1
 kp k
xk
pk - Search Direction
k - Learning Rate
Steepest Descent
Choose the next step so that the function decreases:
F x k + 1   F  xk 
For small changes in x we can approximate F(x):

T
F  xk + 1  = F  xk + x k   F  xk  + gk x k
where
g k   F x 
x = xk
If we want the function to decrease:

T T
g k  xk =  kg k p k  0
We can maximize the decrease by choosing:
pk = – gk
x k + 1 = xk –  k g k
Example
2 2
F  x  = x1 + 2 x1 x 2 + 2x 2 + x1
x 0 = 0.5  = 0.1
0.5

F x 
F  x  =
 x1
=
2x 1 + 2x2 + 1 g0 =  F x  = 3
x= x0 3
 2x 1 + 4x 2
F x 
 x2
x 1 = x 0 –  g 0 = 0.5 – 0.1 3 = 0.2

0.5 3 0.2
x2 = x1 –  g1 = 0.2 – 0.1 1.8 = 0.02

0.2 1.2 0.08
Plot
2
-1
-2
-2 -1 0 1 2
Effect of learning rate
• More the learning rate the trajectory
becomes oscillatory.
• This will make the algorithm unstable
• The upper limit for learning rates can be
set for quadratic functions
Stable Learning Rates (Quadratic)
1 T T
F  x  = -- x Ax + d x + c
2
F  x  = Ax + d
x k + 1 = xk –  gk = x k –   Ax k + d  xk + 1 =  I –  A  x k –  d
Stability is determined
by the eigenvalues of
this matrix.
 I –  A  zi = z i –  Az i = z i –  iz i =  1 –  i z i
(i - eigenvalue of A) Eigenvalues

of [I - A].
Stability Requirement:
2 2
 1 –  i  1   ----   ------------
i max
Example
  0.851     0.526  
A= 22 
 1  = 0.764 z
 1 =  
 2 = 5.24 z
 2 = 
24   – 0.526     0.851  
2 2
  ------------ = ---------- = 0.38
max 5.24
 = 0.37  = 0.39
2 2
1 1
0 0
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2
Newton’s Method
T 1 T
F  xk + 1  = F  xk +  xk   F  xk  + g k  x k + --  xk A k x k
2
Take the gradient of this second-order approximation

and set it equal to zero to find the stationary point:
gk + Ak  xk = 0
–1
 x k = – Ak g k
xk + 1 = xk – A–k 1 gk
Example
2 2
F  x  = x1 + 2 x1 x 2 + 2x 2 + x1
x 0 = 0.5
0.5

F x  g0 =  F x  = 3
 x1 2x 1 + 2x2 + 1 x= x0 3
F  x  = =
 2x 1 + 4x 2
F x 
 x2 A= 22
24
–1
x1 = 0.5 – 2 2 3
=
0.5
–
1 – 0.5 3
=
0.5
–
1.5
=
–1
0.5 24 3 0.5 – 0.5 0.5 3 0.5 0 0.5
Plot
2
-1
-2
-2 -1 0 1 2
Conjugate Vectors
1 T T
F  x  = -- x Ax + d x + c
2
A set of vectors is mutually conjugate with respect to a positive

definite Hessian matrix A if
p Tk Ap j = 0 kj
One set of conjugate vectors consists of the eigenvectors of A.
T T
zk Az j =  jz k z j = 0 kj
(The eigenvectors of symmetric matrices are orthogonal.)

For Quadratic Functions
F  x  = Ax + d
 2F  x  = A
The change in the gradient at iteration k is
g k = gk + 1 – g k =  Ax k + 1 + d  – Axk + d  = A  xk
where
 xk =  xk + 1 – xk  =  k p k
The conjugacy conditions can be rewritten
T T T
 k p k Ap j =  xk Ap j =  gk p j = 0 k j
This does not require knowledge of the Hessian matrix.

Forming Conjugate Directions
Choose the initial search direction as the negative of the gradient.
p 0 = – g0
Choose subsequent search directions to be conjugate.
p k = – gk +  k p k – 1
where
T T T
gk – 1 gk gkgk g k – 1 gk
 k = ---------T-------------------- or k = ------------------------- or k = -------------------------
g k – 1 p k – 1 g Tk – 1 gk – 1 g Tk – 1 gk – 1
Conjugate Gradient algorithm
• The first search direction is the negative
of the gradient.
p 0 = – g0
(For quadratic
• Select the learning rate to minimize functions.)
along the line.
T
 F  x pk T
x=x g p
 k = – --------------------------------------k---------- = – --------k--------k----
T T
p k  2F  x  pk p k A kp k
x = xk
p k = – gk + k p k – 1
Example
1 T
F  x = -- x 2 2 x + 1 0 x x 0 = 0.5
2 24 0.5

F x 
 x1 2x 1 + 2x2 + 1 –3
F  x  = = p 0 = –g 0 = –  F x  =
 2x 1 + 4x 2 x = x0 –3
F x 
 x2
–3
33
–3
 0 = – -------------------------------------------- = 0.2 x1 = x0 – 0 g0 = 0.5 – 0.2 3 = – 0.1
2 2 –3 0.5 3 – 0.1
–3 –3
2 4 –3
Example
g 1 =  F x  = 2 2 – 0.1 + 1 = 0.6
x = x1 2 4 – 0.1 0 –0.6
0.6
0.6 – 0.6
gT1 g1 – 0.6 0.72
1 = ------------ = ----------------------------------------- = ---------- = 0.04
gT0 g0 18
3
33
3
p 1 = – g1 + 1 p 0 = – 0.6 + 0.04 – 3 = –0.72

0.6 –3 0.48
– 0.72
0.6 – 0.6
0.48 –0.72
 1 = – --------------------------------------------------------------- = – ------------- = 1.25
0.576
2 2 –0.72
–0.72 0.48
2 4 0.48
Plots
x2 = x1 +  1 p1 = – 0.1 + 1.25 – 0.72 = –1
– 0.1 0.48 0.5
Conjugate Gradient Steepest Descent

Contour Plot
2 2
1 1
0 0
x2
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2
x1
• This is used for finding line minimization
methods and their stopping criteria
– Initial bracketing
– Line searches
• Newton’s method
• Secant method
• Sectioning method

Chapter 2 Optimization

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 2 Optimization

Uploaded by

Copyright:

Available Formats

Chapter 2-OPTIMIZATION

• Apply a priori knowledge about the target

For small x: If x* is a minimum, this implies:

Since this must be true for every x,  F  x 

This is a sufficient condition for optimality.

A necessary condition is that the Hessian matrix be positive semidefinite. A

For small changes in x we can approximate F(x):

If we want the function to decrease:

x 1 = x 0 –  g 0 = 0.5 – 0.1 3 = 0.2

x2 = x1 –  g1 = 0.2 – 0.1 1.8 = 0.02

(i - eigenvalue of A) Eigenvalues

Take the gradient of this second-order approximation

A set of vectors is mutually conjugate with respect to a positive

One set of conjugate vectors consists of the eigenvectors of A.

(The eigenvectors of symmetric matrices are orthogonal.)

The change in the gradient at iteration k is

The conjugacy conditions can be rewritten

This does not require knowledge of the Hessian matrix.

Choose subsequent search directions to be conjugate.

p 1 = – g1 + 1 p 0 = – 0.6 + 0.04 – 3 = –0.72

Conjugate Gradient Steepest Descent

You might also like