You are on page 1of 318

Institute of Aeronautics & Space Studies - Blida University (Algeria)

Fundamentals of
Scientific Computations
and Numerical Linear
Algebra
Dr. BEKHITI Belkacem
clear all, clc,
figure

k = 5;
n = 2^k-1;
theta = pi*(-n:2:n)/n;
phi = (pi/2)*(-n:2:n)'/n;
X = cos(phi)*cos(theta);
Y = cos(phi)*sin(theta);
Z = sin(phi)*ones(size(theta));

colormap([1 0 0;1 1 1])


C = hadamard(2^k);
surf(X,Y,Z,C)
axis square

2020
Table of Contents

Chapter I: An Introduction to Difference Equations


Introduction ………………………………………………………………………………………………………………………………………..
First Order Linear Difference Equation …………………………………………………………………………………………………
Second Order Linear Difference Equation ……………………………………………………………………………………………
Real and distinct roots …………………………………………………………………………………………………………………
Real repeated roots ……………………………………………………………………………………………………………………
Complex conjugate roots ……………………………………………………………………………………………………………
Higher Order Linear Difference Equations ……………………………………………………………………………………………
Particular Solutions of Linear Difference Equations ………………………………………………………………………………
Undetermined Coefficients …………………………………………………………………………………………………………
Variation of Parameters ………………………………………………………………………………………………………………
Linear System of First Order Difference Equations …………………………………………………………………………………
Linear Discrete System and Eigenvalues ………………………………………………………………………………………
Non-Homogeneous Linear Systems of Difference Equations …………………………………………………………………
Static Forcing Function ………………………………………………………………………………………………………………
Dynamic Forcing Function …………………………………………………………………………………………………………
Difference and Anti-difference Operators ………………………………………………………………………………………………
Basic Properties of the Forward Difference Operator ……………………………………………………………………
The Backward and Central Difference Operators …………………………………………………………………………
The Inverse Difference Operator as a Sum Operator ……………………………………………………………………
Summation by Parts ……………………………………………………………………………………………………………………
The Z-Transform and Linear Time Invariant System ………………………………………………………………………………
Definitions and Region of convergence …………………………………………………………………………………………
The Z Transform of Some Commonly Occurring Functions …………………………………………………………
Some Properties of the Z Transform ……………………………………………………………………………………………
Transfer Function (System Function) and DC Gain ………………………………………………………………………
Inverse Z-transform ……………………………………………………………………………………………………………………
Contour Integration …………………………………………………………………………………………………………
Partial Fraction Expansion …………………………………………………………………………………………………
Inversion by Power Series …………………………………………………………………………………………………
Discretization (Sampling & Ideal Sampling) …………………………………………………………………………………
Numerical Integration methods …………………………………………………………………………………………
Step invariant method (zero order hold) ………………………………………………………………………………

Chapter II: Interpolation and Curve Fitting


Overview …………………………………………………………………………………………………………………………………………….
Interpolation by Lagrange Polynomial ……………………………………………………………………………………………………
Interpolation By Newton Polynomial ……………………………………………………………………………………………………
Interpolation by Cubic Spline ………………………………………………………………………………………………………………
Extrapolation ………………………………………………………………………………………………………………………………………
Curve Fitting and Least Squares ……………………………………………………………………………………………………………
Straight Line Fit …………………………………………………………………………………………………………………………
Fitting Linear Forms …………………………………………………………………………………………………………………
Polynomial Fit ……………………………………………………………………………………………………………………………
Weighting of Data ………………………………………………………………………………………………………………………
Chapter III: Numerical Differentiation and Integration
Overview ………………………………………………………………………………………………………………………………………..
Finite Difference Approximations …………………………………………………………………………………………………………
Differences and Polynomials ………………………………………………………………………………………………………
Numerical Integration …………………………………………………………………………………………………………………………
Riemann sum ……………………………………………………………………………………………………………………………
Trapezoidal rule …………………………………………………………………………………………………………………………
Simpson’s rule ……………………………………………………………………………………………………………………………
Numerical Integration as an interpolation problem ………………………………………………………………………
Numerical Solution of Differential Equations …………………………………………………………………………………………
Euler's Method (Taylor Series Method) ………………………………………………………………………………………
Heun's Method (Trapezoidal Method) …………………………………………………………………………………………
Runge–Kutta Method (Order 2 and 4) …………………………………………………………………………………………
Adams–Bashforth–Moulton Method ……………………………………………………………………………………………
Hamming Method ……………………………………………………………………………………………………………………

Chapter IV: Finding Roots of a Complex function by Local Methods


Newton–Raphson Method ……………………………………………………………………………………………………………………
Broyden's Method ………………………………………………………………………………………………………………………………
Horner's Method …………………………………………………………………………………………………………………………………
Bernoulli's Method ………………………………………………………………………………………………………………………………

Chapter V: Finding Roots of a Polynomials by Global Methods (QD-Algorithm)


Eduard Stiefel’s suggestion (∼ 1953) ……………………………………………………………………………………………………
Problem and its solution ………………………………………………………………………………………………………………………
Alternative formulations of the problem ………………………………………………………………………………………………
Motivation Example (Proposed by the author) ………………………………………………………………………………………
Finding the poles of f(z) from the moments ……………………………………………………………………………………………
Rutishauser’s QD algorithm …………………………………………………………………………………………………………………
Column generation version …………………………………………………………………………………………………………
Row generation or progressive version …………………………………………………………………………………………
Starting point of the Algorithm ……………………………………………………………………………………………………………

Chapter VI: Numerical Linear Algebra and Matrix Computations


Overview on Algebraic Linear systems …………………………………………………………………………………………………
Direct Methods to Solve Equations ………………………………………………………………………………………………
Iterative Methods to Solve Equations …………………………………………………………………………………………
Direct Methods to Solve Equations ………………………………………………………………………………………………………
Gauss Elimination Method …………………………………………………………………………………………………………
LU Decomposition Method ………………………………………………………………………………………………………
Cholesky’s Decomposition Methods ……………………………………………………………………………………………
Other Decompositions: QR and SVD …………………………………………………………………………………………
Polynomial method ……………………………………………………………………………………………………………………
Iterative Methods to Solve Equations ……………………………………………………………………………………………………
The Jacobi method ……………………………………………………………………………………………………………………
Gauss–Seidel method …………………………………………………………………………………………………………………
Successive Over-relaxation (SOR) method ……………………………………………………………………………………
The conjugate gradient method ……………………………………………………………………………………………………
The Lanczos Method for Non-symmetric Systems …………………………………………………………………………
The Broyden's algorithm (Proposed method) ………………………………………………………………………………
Eigenvalue Problems ……………………………………………………………………………………………………………………………
The Power Method ……………………………………………………………………………………………………………………
The Inverse Iteration Method ……………………………………………………………………………………………………
The QR Factorization …………………………………………………………………………………………………………………
Householder and QR ………………………………………………………………………………………………………
Gram-Schmidt and QR ……………………………………………………………………………………………………
Un-shifted QR algorithm and Eigenvalues ……………………………………………………………………………
Reduction of a Real Matrix to Upper Hessenberg …………………………………………………………………
Shifted QR algorithm …………………………………………………………………………………………………………
The Arnoldi iteration …………………………………………………………………………………………………………………
Band and sparse matrices ……………………………………………………………………………………………………………………

Chapter VII: Introduction to Nonlinear Systems and Numerical Optimization


Introduction ………………………………………………………………………………………………………………………………………..
Fundamentals and basic definitions ………………………………………………………………………………………………
Gradients and Level Curves …………………………………………………………………………………………………………
Vector-valued functions ………………………………………………………………………………………………………………
Optimization by Newton’s Method and Its Variants ………………………………………………………………………………
Newton’s Method for Scalar functions …………………………………………………………………………………………
Newton’s Method for Vector functions …………………………………………………………………………………………
Modified Newton Methods …………………………………………………………………………………………………………………
Quasi-Newton Method (Newton method with rank one correction) …………………………………………………
Broyden–Fletcher–Goldfarb–Shanno algorithm (rank one correction) ……………………………………………
Gradient Descent Algorithm ………………………………………………………………………………………………………
The Levenberg–Marquardt method ……………………………………………………………………………………………
Convexity and Optimization …………………………………………………………………………………………………………………
The Random Optimization Methods ……………………………………………………………………………………………………
The Random Search Method ………………………………………………………………………………………………………
The Random Walk (path) Method ………………………………………………………………………………………………
Monte Carlo Optimization method ………………………………………………………………………………………………………
The basic idea of Monte Carlo ……………………………………………………………………………………………………
Global optimization algorithm ……………………………………………………………………………………………………
Constrained optimization techniques ……………………………………………………………………………………………………
Lagrange Multipliers …………………………………………………………………………………………………………………
Constrained random search method ……………………………………………………………………………………………
Regularized Least Squares & Lagrange Multipliers …………………………………………………………………………
Metaheuristics and Optimization …………………………………………………………………………………………………………
Particle Swarm Optimization ………………………………………………………………………………………………………
Ant Colony Optimization ……………………………………………………………………………………………………………
Firefly Optimization Algorithm ……………………………………………………………………………………………………
Artificial Bee Colony Algorithm …………………………………………………………………………………………………
Bacterial Foraging Optimization Algorithm …………………………………………………………………………………
Grey Wolf Optimizer …………………………………………………………………………………………………………………
Table of Nature Inspired algorithms include …………………………………………………………………………………
CVX MATLAB package ……………………………………………………………………………………………………………………
‫التحليـــــــــــــــــل العددي والحاسوب‬
Computer science, or computing is the study of processes that interact with data
and that can be represented as an information in the form of programs. It
enables the use of algorithms to process, store and communicate digital
information. A computer scientist studies computing theory and software
systems design practice. From the other hand, numerical analysis is the study of
algorithms that use numerical approximation to problems of mathematical
analysis. Naturally, digital analysis finds application in all areas of engineering
and the physical sciences, but in the twenty-first century the life sciences, social
sciences, medicine, business, and even the arts have also embraced elements of
scientific computation. The growth of computing power (Micro-computers) has
revolutionized the use of realistic mathematical models in science and
engineering, and rigorous numerical analysis is required to implement these
detailed models into the world. For example, normal differential equations are
shown in celestial mechanics (predicting the motions of planets, stars and
galaxies); Numerical linear algebra is important for data analysis. Stochastic
differential equations and Markov chains are essential in living cell simulations
for medicine and biology.

Some problems in numerical analysis can be precisely solved by an algorithm.


This type of algorithm is called "direct methods": Examples include the Gaussian
and the QR factorization methods for solving system of linear equations and the
simplification method (Simplex method) in linear programming.

On the other hand, there are many problems that are not solved by direct
algorithms, in which case it may be possible to solve them using iterative
methods. Such a method begins with guessing and finding the most successful
approximation that effectively approximates the desired solution. Even when
direct algorithms are sometimes present, iterative methods are sometimes
preferred because they are more efficient (they may require less time and less
computational power in addition to a good approximation of the solution) or they
may be more stable.

In other cases, problems of continuity may need to be replaced by discrete


mathematical problems with already known solutions, this process called
"discretization". For example, the solution of a differential equation is a
mathematical function, which should be represented by a limited amount of
data, for example by the value of the function at different points in terms of the
function (the domain of a function), even though the range is a continuous
domain.
Again, numerical methods is a Methods designed for the constructive solution of
mathematical problems requiring particular numerical results, usually on a
computer. A numerical method is a complete and unambiguous set of
procedures for the solution of a problem, together with computable error
estimates. The study and implementation of such methods is the province of
numerical analysis.

The most covered topics in this branch of mathematics are:

Difference Equations and Recurrence Relations


Approximation and finite differences
Interpolation and curve fitting
Numerical Integral Equations
Numerical ODEs
Scientific Computing and Algorithms
Numerical Linear Algebra and Matrix analysis
Optimization Theory

In this book, we are about to bring the basic concepts of this science closer to
university students. We have prepared the material for this book in accordance
with contemporary international curricula, and we have tried as hard as possible
to be a book that collects the major issues of this science, relying in that to the
most important books and a wide range of references and origins of this art. We
also sought to simplify and shorten the phrase avoiding complexity. We also do
not forget to mention a matter that is very important, which is that: we did not
address the study of error and the speed of convergence, which are two main
factors in the methods of numerical analysis, and this is due to avoiding
prolongation and on the other hand we have to collect in this book the most exist
methods, therefore those who wanted to benefit in the study of error and
convergence are strongly recommended to see the end of the book where the
most important references detailed this point.

I also do not forget to mention that we concluded the book with a series of
natural optimization methods with attaching the book to a very large set of
algorithms and MATLAB codes. As for the lack in the number of exercises for
students, this could be improved in the near future.
CHAPTER I:
An Introduction to Difference
Equations
An Introduction to
Difference Equations

Difference equations are equations that involve discrete changes or


differences of the unknown function. This is in contrast to differential equations, which
involve instantaneous rates of changes, or derivatives, of the unknown function.
Difference equations are the discrete analogs of differential equations; they appear as
mathematical models in situations where the variable takes or is assumed to take only
a discrete set of values. As we will see in this chapter, the theory and solutions of
difference equations in many ways parallel the theory and solutions of differential
equations.

The theory of such difference equations is required for the understanding of some
important algorithms to be discussed later on. Furthermore, linear difference equations
are a useful tool in the study of many other processes of numerical analysis; and
numerical linear algebra. Much unnecessary complication is avoided by considering
difference equations with complex coefficients and solutions.

Definition: A difference equation over the set of 𝑘-values 0, 1, 2, … is an equation of


the form 𝐹(𝑘, 𝑦𝑘 , 𝑦𝑘−1 , … , 𝑦𝑘−𝑛 ) where 𝐹 is a given function, 𝑛 is some positive integer,
and 𝑘 = 0, 1, 2, …

An 𝑛𝑡ℎ order arbitrary linear difference equation is defined as

𝑎0 𝑦[𝑘] + 𝑎1 𝑦[𝑘 − 1] + ⋯ + 𝑎𝑁 𝑦[𝑘 − 𝑛] = 𝑓[𝑘]

If we define as the shift operator

𝑧 −1 : 𝐼 ⟶ 𝐼
𝑦 ⟶ 𝑦[𝑘 − 1] = 𝑧 −1 (𝑦[𝑘])

Then the 𝑛𝑡ℎ order DE becomes (𝑎0 + 𝑎1 𝑧 −1 + ⋯ + 𝑎𝑛 𝑧 −𝑛 )(𝑦[𝑘]) = 𝑓[𝑘] ⟺ 𝐿(𝑦[𝑘]) = 𝑓[𝑘]
where 𝐿 is: linear mapping or linear operator.

Theorem: A solution of 𝐿(𝑦[𝑘]) = 𝑓[𝑘] exist if and only if 𝑓[𝑘] ∈ ℛ(𝐿) that is the forcing
function 𝑓[𝑘] is an element of the range space of the operator 𝐿(𝑦[𝑘]). And this solution
is of two types namely particular and homogenous solutions: 𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘]

Proof: In linear algebra (see Algebra Book by BEKHITI 2020) we have seen that the
existence of solution of 𝑨𝐱 = 𝒃 exist if 𝒃 ∈ ℛ(𝑨) or ℛ(𝑨) = ℛ([𝑨 𝒃]) means that 𝒃 is a
column of 𝑨. Now suppose that a solution exist and 𝑦𝑝 be a particular solution then

𝐿(𝑦𝑝 [𝑘]) = 𝑓[𝑘]

Let 𝑦ℎ [𝑘] ∈ 𝒩(𝐿)


⏟ ⟹ 𝐿(𝑦ℎ [𝑘]) = 0
null space
Consider now 𝐿(𝑦𝑝 [𝑘] + 𝑦ℎ [𝑘]) = 𝐿(𝑦𝑝 [𝑘]) + 𝐿(𝑦ℎ [𝑘]) = 𝐿(𝑦𝑝 [𝑘]) + 0 = 𝑓[𝑘], then the set of
solutions of 𝐿(𝑦[𝑘]) = 𝑓[𝑘] is

𝑆 = {𝑦 where 𝑦[𝑘] = 𝑦𝑝 [𝑘] + 𝑦ℎ [𝑘]}

First let we consider the homogenous solution 𝑦ℎ [𝑘] such that

𝐿(𝑦ℎ [𝑘]) = 0 ⟹ 𝑦ℎ [𝑘] ∈ 𝒩(𝐿)

Corollary: The nullity of 𝑛𝑡ℎ order linear DE is 𝑛 in other word 𝑑𝑖𝑚(𝒩(𝐿)) = 𝑛


Proof: If 𝑎0 𝑦[𝑘] + 𝑎1 𝑦[𝑘 − 1] + ⋯ + 𝑎𝑁 𝑦[𝑘 − 𝑛] = 𝑓[𝑘] then the corresponding linear
mapping is given by 𝐿 = 𝑎0 + 𝑎1 𝑧 −1 + ⋯ + 𝑎𝑛 𝑧 −𝑛 . From the fundamental theorem of linear
algebra it is well known that any 𝑛𝑡ℎ order polynomial have 𝑛 roots, then this linear
mapping can be written as a product of 𝑛 linear terms 𝐿 = 𝐿1 𝐿2 … 𝐿𝑛

Let 𝑦𝑖 [𝑘] ∈ 𝒩(𝐿𝑖 ) 𝑖 = 1,2, … , 𝑛

(𝐿1 𝐿2 … 𝐿𝑛 )(𝑦1 [𝑘]) = (𝐿2 𝐿3 … 𝐿𝑛 )𝐿1 (𝑦1 [𝑘]) = 0


(𝐿 )(𝑦 [𝑘]) = (𝐿1 𝐿3 … 𝐿𝑛 )𝐿2 (𝑦2 [𝑘]) = 0
{ 1 𝐿2 … 𝐿𝑛 2

(𝐿1 𝐿2 … 𝐿𝑛 )(𝑦𝑛 [𝑘]) = (𝐿1 𝐿2 … 𝐿𝑛−1 )𝐿𝑛 (𝑦𝑛 [𝑘]) = 0

We can say that 𝑑𝑖𝑚(𝒩(𝐿)) = 𝑛 which implies that any basis of 𝒩(𝐿) contains 𝑛 linearly
independent vector.

Let {𝑦1 , 𝑦2 , … 𝑦𝑛 } be a linear independent functions of 𝒩(𝐿) means that

𝛼1 𝑦1 [𝑘] + 𝛼2 𝑦2 [𝑘] + ⋯ + 𝛼𝑛 𝑦𝑛 [𝑘] = 0


𝛼 𝑦 [𝑘 − 1] + 𝛼2 𝑦2 [𝑘 − 1] + ⋯ + 𝛼𝑛 𝑦𝑛 [𝑘 − 1] = 0
{ 1 1 } if and only if 𝛼𝑖 = 0 ∀𝑖

𝛼1 𝑦1 [𝑘 − 𝑛 + 1] + 𝛼2 𝑦2 [𝑘 − 𝑛 + 1] + ⋯ + 𝛼𝑛 𝑦𝑛 [𝑘 − 𝑛 + 1] = 0

In matrix form we write:

𝑦1 [𝑘] 𝑦2 [𝑘] ⋯ 𝑦𝑛 [𝑘] 𝛼1


[𝑘 𝑦2 [𝑘 − 1] ⋮ 𝛼
𝑦𝑛 [𝑘 − 1] ) ( 2 ) = 𝟎
( 𝑦1 − 1] ⃗ ⟺ 𝕎𝕏 = 𝟎

⋮ ⋮ ⋮ ⋮ ⋮
𝑦1 [𝑘 − 𝑛 + 1] 𝑦2 [𝑘 − 𝑛 + 1] 𝛼𝑛
… 𝑦𝑛 [𝑘 − 𝑛 + 1]
Theorem: a set of 𝑛 vectors from 𝒩(𝐿) say {𝑦1 , 𝑦2 , … 𝑦𝑛 } are linearly independent if and
only if det 𝕎 ≠ 0. (The matrix 𝕎 is called a Casoratian or Wronskian)

Proof: It is straightforward to see that if 𝕎 is singular matrix then there exist a non-
zero vector 𝕏 or 𝛼𝑖 ≠ 0 which implies linear dependence of 𝑦𝑖 [𝑘].

Example: Consider 𝑦[𝑘 − 2] − 3𝑦[𝑘 − 1] + 2𝑦[𝑘] = 0

▪ Show that (1)𝑘 & (1/2)𝑘 are homogenous solutions


▪ Is it basis of 𝒩(𝐿)
▪ What are the form of all homogenous solutions?
Solution:

▪ 𝑦[𝑘] = (1)𝑘 ⟹ (1)𝑘−2 − 3(1)𝑘−1 + 2(1)𝑘 = 0


1 𝑘 1 𝑘−2 1 𝑘−1 1 𝑘 1 𝑘−2 1 1 2
𝑦[𝑘] = ( ) ⟹ ( ) − 3( ) + 2( ) = ( ) {1 − 3 ( ) + 2 ( ) } = 0
2 2 2 2 2 2 2
1 𝑘
1 𝑘 (1)𝑘 ( ) 1 𝑘−1 1 𝑘 1 𝑘 1 −1 1 𝑘
▪ 𝕎 ((1)𝑘 , ( ) ) = || 2 || = ( ) − ( ) = ( ) (( ) − 1) = ( ) ≠ 0 ∀ 𝑘 = 1,2, …
2 1 𝑘−1 2 2 2 2 2
(1)𝑘 ( )
2

1 𝑘 𝑘
⟹ {(1) , ( ) } is a basis of 𝒩(𝐿)
2
▪ The homogenous solution is linear combination of elementary homogenous solutions

1 𝑘
𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶2
2

The simplest first-order difference


equation is the homogeneous equation: 𝑦[𝑘 − 1] + 𝑎𝑦[𝑘] = 0 ⟹ (𝑧 −1 + 𝑎)𝑦[𝑘] = 0

What is the Eigen pair of the operator 𝑧 −1 ?

Let (𝜆, 𝑦[𝑘])be an Eigen pair of 𝑧 −1 ⟹ 𝑧 −1 (𝑦[𝑘]) = 𝜆𝑦[𝑘] (𝜆 = −𝑎)


1
⟹ 𝑧(𝑦[𝑘]) = 𝑦[𝑘]
𝜆
1
⟹ 𝑧(𝑦[𝑘]) = 𝑦[𝑘 + 1] = 𝑦[𝑘] let 𝑦[0] = 𝐶1
𝜆
Now we do the recurrence
1 1
𝑦[1] = 𝑦[0] = ( ) 𝐶1
𝜆 𝜆
1 1 2
𝑦[2] = 𝑦[1] = ( ) 𝐶1
𝜆 𝜆

1 𝑘
𝑦[𝑘] = ( ) 𝐶1
𝜆

The Eigen pair of the operator 𝑧 −1 is (𝜆, (1/𝜆)𝑘 ) and the solution of 1st order DE is given
by
−1 𝑘
𝑦[𝑘 − 1] + 𝑎𝑦[𝑘] = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 ( )
𝑎

A general second order linear


difference equations can be written:

𝑦[𝑘 − 2] + 𝑎1 𝑦[𝑘 − 1] + 𝑎0 𝑦[𝑘] = 0 ⟹ (𝑧 −2 + 𝑎1 𝑧 −1 + 𝑎0 )𝑦[𝑘] = 0


⟹ (𝑧 −1 − 𝜆1 )(𝑧 −1 − 𝜆2 )𝑦[𝑘] = 0

Where 𝜆1 , 𝜆2 may be real distinct, repeated, or complex conjugate


(𝜆, 𝑦[𝑘])is an Eigen pair of 𝑧 −1 ⟹ 𝑧 −1 (𝑦[𝑘]) = 𝜆𝑦[𝑘]

⟹ 𝑧 −2 (𝑦[𝑘]) = 𝑧 −1 (𝑧 −1 (𝑦[𝑘]))
⟹ 𝑧 −2 (𝑦[𝑘]) = 𝜆𝑧 −1 (𝑦[𝑘]) = 𝜆2 𝑦[𝑘]

𝑦[𝑘 − 2] + 𝑎1 𝑦[𝑘 − 1] + 𝑎0 𝑦[𝑘] = 0 ⟺ (𝜆2 + 𝑎1 𝜆1 + 𝑎0 )𝑦[𝑘] = 0


⟺ (𝜆2 + 𝑎1 𝜆1 + 𝑎0 ) = 0

Let we call (𝜆2 + 𝑎1 𝜆1 + 𝑎0 ) the characteristic polynomial, symbolized by

∆(𝜆) = (𝜆2 + 𝑎1 𝜆1 + 𝑎0 )

Assuming that the characteristic polynomial ∆(𝜆) has two


different real roots 𝜆1 ∈ ℝ & 𝜆2 ∈ ℝ with 𝜆1 ≠ 𝜆2

(𝑧 −1 − 𝜆1 ) ⏟
⏟ (𝑧 −1 − 𝜆2 ) 𝑦[𝑘] = 0 ⟺ (𝐿1 𝐿2 )(𝑦[𝑘]) = 0
1 𝑘 1 𝑘
(𝜆1 ,( ) ) (𝜆2 ,( ) )
𝜆1 𝜆2

The elementary homogenous solutions are (1/𝜆1 )𝑘 and (1/𝜆2 )𝑘 . Notice that

1 𝑘 1 𝑘
( ) ( )
1 𝑘 1 𝑘 | 𝜆 𝜆2 | 1 𝑘 1 𝑘
𝕎 (( ) , ( ) ) = | 1 𝑘−1 𝑘−1 | ≠ 0 only if 𝜆1 ≠ 𝜆2 ⟹ {( ) , ( ) } is a basis
𝜆1 𝜆2 1 1 𝜆1 𝜆2
( ) ( )
𝜆1 𝜆2
1 𝑘 1 𝑘
𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶1 ( )
𝜆1 𝜆2

Assuming that the characteristic polynomial ∆(𝜆) has


repeated real roots 𝜆1 = 𝜆2 = 𝜆 ∈ ℝ

(𝑧 −1 − 𝜆)(𝑧 −1 − 𝜆)𝑦[𝑘] = (𝑧 −1 − 𝜆)2 𝑦[𝑘] = 0

1 𝑘
Set 𝑦1 [𝑘] = (𝑧 −1 − 𝜆)𝑦[𝑘] then (𝑧 −1 − 𝜆)𝑦1 [𝑘] = 0 ⟹ 𝑦1 [𝑘] = 𝐶1 ( )
𝜆

The sequence 𝑦1 [𝑘] is a homogenous solution of (𝑧 −1 − 𝜆)𝑦1 [𝑘] = 0 and we have

1 𝑘
(𝑧 −1 − 𝜆)𝑦[𝑘] = 𝑦1 [𝑘] ⟹ (𝑧 −1 − 𝜆)𝑦[𝑘] = 𝐶1 ( )
𝜆

Suppose that the general solution of the last DE is given by 𝑦2 [𝑘] = 𝐶2 [𝑘]. (1/𝜆)𝑘

Make a substitution of 𝑦2 [𝑘] into (𝑧 −1 − 𝜆)𝑦[𝑘] = 𝑦1 [𝑘] we get:

1 𝑘 1 𝑘 1 𝑘−1 1 𝑘 1 𝑘
(𝑧 −1 − 𝜆) (𝐶2 [𝑘]. ( ) ) = 𝐶1 ( ) ⟺ 𝐶2 [𝑘 − 1]. ( ) − 𝜆𝐶2 [𝑘]. ( ) = 𝐶1 ( )
𝜆 𝜆 𝜆 𝜆 𝜆
𝐶1
⟺ 𝐶2 [𝑘 − 1] − 𝐶2 [𝑘] = , diff = constant
𝜆
−𝐶1
⟺ 𝐶2 [𝑘] = 𝑘 + 𝐶2 , by accumulation
𝜆
From the superposition principle of linear systems we have

−𝐶1 1 𝑘 1 𝑘 1 𝑘
𝑦ℎ [𝑘] = 𝑦1 [𝑘] + 𝑦2 [𝑘] = {𝐶1 + 𝑘 + 𝐶2 } ( ) = 𝛽1 ( ) + 𝛽2 (𝑘 ( ) )
𝜆 𝜆 𝜆 𝜆

Assuming that the characteristic polynomial ∆(𝜆) has


two complex conjugate roots 𝜆1 ∈ ℂ & 𝜆2 ∈ ℂ with 𝜆1 = 𝜎 + 𝑗𝜔 & 𝜆2 = 𝜎 − 𝑗𝜔

(𝑧 −1 − 𝜆1 )(𝑧 −1 − 𝜆2 )𝑦[𝑘] ⟹ 𝑦[𝑘] = 𝐶1 (𝜎 + 𝑗𝜔)−𝑘 + 𝐶2 (𝜎 − 𝑗𝜔)−𝑘

From the basic complex analysis it is known that

𝑏
(𝑎 + 𝑗𝑏) = 𝑟(cos(𝜃) + 𝑗 sin(𝜃)) = 𝑟𝑒 𝑗𝜃 where 𝑟 = √𝑎2 + 𝑏 2 & 𝜃 = atan ( )
𝑎
(𝑎 + 𝑗𝑏)𝑘 = 𝑟 𝑘 𝑒 𝑗𝑘𝜃 = 𝑟 𝑘 (cos(𝑘𝜃) + 𝑗 sin(𝑘𝜃))
𝑘 𝑘
1 1 𝐶1 𝐶2
𝑦[𝑘] = 𝐶1 ( ) + 𝐶2 ( ) = +
𝜎 + 𝑗𝜔 𝜎 − 𝑗𝜔 (𝜎 + 𝑗𝜔)𝑘 (𝜎 − 𝑗𝜔)𝑘
𝐶1 𝐶2 1 𝑘
= 𝑘 𝑗𝑘𝜃 + 𝑘 −𝑗𝑘𝜃 = ( ) [𝐶1 𝑒 −𝑗𝑘𝜃 + 𝐶2 𝑒 𝑗𝑘𝜃 ]
𝑟 𝑒 𝑟 𝑒 𝑟
𝑘
1
= ( ) [(𝐶1 + 𝐶2 ) cos(𝑘𝜃) + (𝐶2 − 𝐶1 ) sin(𝑘𝜃)]
𝑟

Then the basis of 𝒩(𝐿) is

1 𝑘 1 𝑘 Im(𝜆1 )
{( ) cos(𝑘𝜃) , ( ) sin(𝑘𝜃)} where 𝑟 = √Re(𝜆1 )2 + Im(𝜆1 )2 & 𝜃 = atan ( )
𝑟 𝑟 Re(𝜆1 )

Summary:

▪ 𝜆1 ∈ ℝ & 𝜆2 ∈ ℝ with 𝜆1 ≠ 𝜆2 then 𝑦ℎ [𝑘] = 𝐶1 (1/𝜆1 )𝑘 + 𝐶2 (1/𝜆2 )𝑘


▪ 𝜆1 = 𝜆2 = 𝜆 ∈ ℝ then 𝑦ℎ [𝑘] = 𝐶1 (1/𝜆)𝑘 + 𝐶2 𝑘(1/𝜆)𝑘
▪ 𝜆1 ∈ ℂ & 𝜆2 ∈ ℂ with 𝜆1 = 𝜎 + 𝑗𝜔 & 𝜆2 = 𝜎 − 𝑗𝜔 then
1 𝑘 𝜔
𝑦ℎ [𝑘] = ( ) [(𝐶1 + 𝐶2 ) cos(𝑘𝜃) + (𝐶2 − 𝐶1 ) sin(𝑘𝜃)] where 𝑟 = √𝜎 2 + 𝜔 2 & 𝜃 = atan ( )
𝑟 𝜎

Example: Consider the Fibonacci sequence

𝑦[𝑘] = 𝑦[𝑘 − 1] + 𝑦[𝑘 − 2], 𝑦[0] = 1 and 𝑦[1] = 1

The sequence is {1, 1, 2, 3, 5, 8, 13, 21, 35, … } Find the general solution of this DE

Solution:

𝑦[𝑘] = 𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] ⟺ 𝑦[𝑘] − 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] = 0


⟺ (1 − 𝑧 −1 − 𝑧 −2 )𝑦[𝑘] = 0
⟺ ∆(𝜆) = (1 − 𝜆−2 − 𝜆−1 ) = 0

1 ± √5
𝜆1,2 = are the eigenvalues of (𝑧 −1 − 𝜆1 )(𝑧 −1 − 𝜆2 )𝑦[𝑘] = 0
−2
𝑘
2 −2 𝑘
𝑦[𝑘] = 𝐶1 ( ) + 𝐶2 ( )
√5 − 1 √5 + 1
From the initial conditions we know

3 + √5
𝑦[0] = 𝐶1 + 𝐶2 = 1 1 1
2 −2 𝐶 1 𝐶 5 + √5
2𝐶1 2𝐶2 ⟹( ) ( 1) = ( ) ⟹ ( 1) =
𝑦[1] = − =1 𝐶2 1 𝐶2 3 − √5
√5 − 1 √5 + 1 √5 − 1 √5 + 1
(5 − √5)
𝑘 𝑘
3 + √5 √5 + 1 3 − √5 1 − √5
𝑦[𝑘] = ( )( ) +( )( )
5 + √5 2 5 − √5 2

Example: Express in real terms a general solution of the following DE

❶ 𝑦[𝑘] − 𝑦[𝑘 − 2] = 0
❷ 𝑦[𝑘] + 𝑦[𝑘 − 2] = 0
❸ 𝑦[𝑘] + 2𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0

Solution:

𝑦1[𝑘] = 𝐶1(1)𝑘 = 𝐶1
❶ 𝑦[𝑘] − 𝑦[𝑘 − 2] = 0 ⟺ (1 − 𝑧−2)𝑦[𝑘] = 0 ⟺ (1 − 𝑧−1)(1 + 𝑧−1 )𝑦[𝑘] = 0 ⟺ {
𝑦2 [𝑘] = 𝐶2(−1)𝑘

𝑦[𝑘] = 𝑦1 [𝑘] + 𝑦2 [𝑘] = 𝐶1 + 𝐶2 (−1)𝑘

❷ 𝑦[𝑘] + 𝑦[𝑘 − 2] = 0 ⟺ (1 + 𝑧−2)𝑦[𝑘] = 0 ⟺ (𝑧−1 − 𝑖)(𝑧−1 + 𝑖)𝑦[𝑘] = 0

𝜋 𝑘𝜋 𝑘𝜋
𝜆1 = 𝑗 & 𝜆2 = −𝑗, 𝑟 = 1 & 𝜃 = ⟹ 𝑦[𝑘] = [𝛽1 cos ( ) + 𝛽2 sin ( )]
2 2 2
2
❸ 𝑦[𝑘] + 2𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0 ⟺ (1 + 2𝑧−1 + 𝑧−2 )𝑦[𝑘] = 0 ⟺ (1 + 𝑧−1 ) 𝑦[𝑘] = 0

𝜆 = 𝜆1 = 𝜆2 = −1 ⟹ 𝑦[𝑘] = (𝐶1 + 𝑘𝐶2 )(−1)𝑘

Example: Find the general solution of the difference equation

𝑦[𝑘] + (2𝑖)𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0


Solution:
𝜆1 = −(1 + √2)𝑖
𝑦[𝑘] + (2𝑖)𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0 ⟺ (1 + 2𝑖𝑧 −1 + 𝑧 −2 )𝑦[𝑘] = 0 ⟺ {
𝜆2 = −(1 − √2)𝑖
𝑘 𝑘 𝑘 𝑘
−1 −1 𝑖 𝑖
𝑦[𝑘] = 𝐶1 ( ) + 𝐶2 ( ) = 𝐶1 ( ) + 𝐶2 ( )
(1 + √2)𝑖 (1 − √2)𝑖 (1 + √2) (1 − √2)
𝑘 𝑘
1 1
= (𝑖)𝑘 {𝐶1 ( ) + 𝐶2 ( ) }
(1 + √2) (1 − √2)
Let {𝜆1 , 𝜆2 , … , 𝜆𝑟 } are the
distinct roots of ∆(𝜆) = 0
𝑟
∆(𝜆) = (𝜆 − 𝜆1 )𝑚1 (𝜆 − 𝜆2 )𝑚2 … (𝜆 − 𝜆𝑟 )𝑚𝑟 with ∑ 𝑚𝑖 = 𝑛
𝑖=1

The difference equations become (𝑧 −𝑛 + 𝑎𝑛−1 𝑧1−𝑛 + ⋯ + 𝑎1 𝑧 −1 + 𝑎0 )𝑦[𝑘] = 0

𝑚1 𝑚2 𝑚𝑟
⏟ −1 − 𝜆1 )
𝐿(𝑦[𝑘]) = ((𝑧 ⏟ −1 − 𝜆2 )
(𝑧 ⏟ −1 − 𝜆𝑟 )
… (𝑧 ) 𝑦[𝑘] = 0 ⟺ 𝐿 = 𝐿1 𝐿2 … 𝐿𝑟
𝐿1 𝐿2 𝐿𝑟

As what we have seen before if 𝑦𝑖 is an elementary homogenous solution of our


difference equation (i.e. 𝐿𝑖 (𝑦𝑖 [𝑘]) = 0 ) then

(𝐿1 𝐿2 … 𝐿𝑖−1 𝐿𝑖 𝐿𝑖+1 … 𝐿𝑟 )(𝑦𝑖 [𝑘]) = (𝐿1 𝐿2 … 𝐿𝑖−1 𝐿𝑖+1 … 𝐿𝑟 )𝐿𝑖 (𝑦𝑖 [𝑘]) = 0 ⟹ 𝒩(𝐿) = ⨁𝑟𝑖=1 (𝒩(𝐿𝑖 ))

So we can conclude that:

▪ 𝒩(𝐿) = 𝒩(𝐿1 )⨁𝒩(𝐿2 )⨁ … ⨁𝒩(𝐿𝑟 ) direct sum of null spaces


▪ If ℬ is a basis for 𝒩(𝐿) and ℬ𝑖 for 𝒩(𝐿𝑖 ) then ℬ = ℬ1 ⋃ℬ2 ⋃ … ⋃ℬ𝑟
▪ 𝑛(𝐿) = 𝑛(𝐿1 ) + 𝑛(𝐿2 ) + ⋯ + 𝑛(𝐿𝑟 ) sum of nullity
▪ 𝑛(𝐿) = ∑𝑟𝑖=1 𝑛(𝐿𝑖 ) = ∑𝑟𝑖=1 𝑚𝑖 = 𝑛
▪ 𝒩(𝐿𝑖 ) ⊂ 𝒩(𝐿) means that all vectors of 𝒩(𝐿𝑖 ) are included in 𝒩(𝐿).

Corollary: if is an eigenvalue of 𝐿 and repeated times then (𝑧 −1 − 𝜆𝑖 )𝑚𝑖 (𝑦[𝑘]) = 0 in


other word
1 𝑘 1 𝑘 1 𝑘
{( ) , 𝑘 ( ) , … , 𝑘 𝑚𝑖 −1 ( ) }
𝜆𝑖 𝜆𝑖 𝜆𝑖

Is a basis for 𝒩(𝐿𝑖 )

Proof: Let we prove this by induction

1 𝑘
▪ For the equation (𝑧 −1 − 𝜆𝑖 )(𝑦[𝑘]) = 0 ⟹ 𝑦[𝑘] = 𝐶1 ( )
𝜆𝑖
(𝑧 −1 − 𝜆𝑖 )(𝑦1 [𝑘]) = 0
▪ For the equation (𝑧 −1 − 𝜆𝑖 ) ⏟
(𝑧 −1 − 𝜆𝑖 )(𝑦2 [𝑘]) = 0 with {
(𝑧 −1 − 𝜆𝑖 )(𝑦2 [𝑘]) = 𝑦1 [𝑘]
𝑦1

1 𝑘 1 𝑘 1 𝑘 1 𝑘
We have seen ∶ 𝑦1 [𝑘] = 𝐶1 ( ) & 𝑦2 [𝑘] = 𝐶2 𝑘 ( ) ⟹ {( ) , 𝑘 ( ) } is a basis
𝜆𝑖 𝜆𝑖 𝜆𝑖 𝜆𝑖

▪ By induction we obtain

1 𝑘 1 𝑘 𝑚𝑖 −1
1 𝑘
{( ) , 𝑘 ( ) , … , 𝑘 ( ) } is a basis for 𝒩(𝐿𝑖 )
𝜆𝑖 𝜆𝑖 𝜆𝑖
As in the case of differential equations, if the difference
equation is forced then there are two methods which can be used to find a particular
solution: the method of undetermined coefficients and the method of variation of
parameters.

Generally, the particular solution can frequently be


found by an inspired guess. Let we consider the following example

𝑦[𝑘] − 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] = −𝑘 2

We set 𝑦[𝑘] = 𝑎𝑘 2 + 𝑏𝑘 + 𝑐 with a constants 𝑎, 𝑏, 𝑐 to be determined (we force them to


solve our DE)
(𝑎𝑘 2 + 𝑏𝑘 + 𝑐) − (𝑎(𝑘 − 1)2 + 𝑏(𝑘 − 1) + 𝑐) − (𝑎(𝑘 − 2)2 + 𝑏(𝑘 − 2) + 𝑐) = −𝑘 2
⟹ 𝑎(𝑘 2 − (𝑘 − 1)2 − (𝑘 − 2)2 ) + 𝑏(𝑘 − (𝑘 − 1) − (𝑘 − 2)) + 𝑐(1 − 1 − 1) = −𝑘 2

After simplification we get: 𝑎(−𝑘 2 + 6𝑘 − 5) + 𝑏(3 − 𝑘) − 𝑐 = −𝑘 2 ⟹ 𝑎 = 1, 𝑏 = 6, 𝑐 = 13

𝑦𝑝 [𝑘] = 𝑘 2 + 6𝑘 + 13

Example: Determine the general solution of the difference equation

2𝑦[𝑘] − 3𝑦[𝑘 − 1] − 2𝑦[𝑘 − 2] + 3 = 0

Solution: Let first we determine the homogenous solution


𝜆1 = 4 1 𝑘
∆(𝜆) = 2 − 3𝑧 −1 − 2𝑧 −2 ⟹ { ⟹ 𝑦ℎ [𝑘] = 𝐶1 (−1)𝑘 + 𝐶2 ( )
𝜆2 = −1 4

by an inspired guess 𝑦𝑝 [𝑘] = 𝐶3 ⟹ 2𝐶3 − 3𝐶3 − 2𝐶3 + 3 = 0 ⟹ 𝐶3 = 1 so 𝑦[𝑘] = 𝑦𝑝 [𝑘] + 𝑦ℎ [𝑘]

1 𝑘
𝑦[𝑘] = 1 + 𝐶1 (−1)𝑘 + 𝐶2 ( )
4
Summary: Given a forced linear time invariant difference equation ∑𝑛𝑖=0 𝑎𝑖 𝑦[𝑘 − 𝑖] = 𝑥[𝑘]

𝑥[𝑘] = 𝐾 𝑦𝑝 [𝑘] = 𝐶
𝑥[𝑘] = 𝐴𝛼 𝑘 𝑦𝑝 [𝑘] = 𝐾𝛼 𝑘
𝑥[𝑘] = 𝛼 𝑘 𝑘 𝑚 𝑦𝑝 [𝑘] = 𝛼 (𝐶0 𝑘 𝑚 + 𝐶1 𝑘 𝑚−1 + ⋯ + 𝐶𝑚 )
𝑘

cos(𝜔0 𝑘)
𝑥[𝑘] = { 𝑦𝑝 [𝑘] = 𝐶1 cos(𝜔0 𝑘) + 𝐶2 sin(𝜔0 𝑘)
sin(𝜔0 𝑘)
𝑘
𝑥[𝑘] = 𝛼 cos(𝜔0 𝑘) 𝑦𝑝 [𝑘] = 𝛼 𝑘 (𝐶1 cos(𝜔0 𝑘) + 𝐶2 sin(𝜔0 𝑘))

Example: Find a particular solution of the nonhomogeneous difference equation

𝑦[𝑘 + 2] − 8𝑦[𝑘 + 1] + 15𝑦[𝑘] = 3(2𝑘 ) − 5(3𝑘 )


Solution: As with differential equations, first we find the general solution of the
corresponding homogeneous equation 𝑦[𝑘 + 2] − 8𝑦[𝑘 + 1] + 15𝑦[𝑘] = 0, since the
solution of this non-forced DE can influence the form of a particular solution.
Characteristic equation is ∆(𝜆) = 𝜆2 − 8𝜆 + 15 = 0 ⟹ 𝜆1 = 3 & 𝜆2 = 5 Homogeneous
solution:
𝑦ℎ [𝑘] = 𝐶1 (3𝑘 ) + 𝐶2 (5𝑘 )

The particular solution is given by 𝑦𝑝 [𝑘] = 𝐴(2𝑘 ) + 𝐵𝑘(3𝑘 ) here we multiplied 3𝑘 by 𝑘


because 3𝑘 is a homogeneous solution also.

The undetermined coefficients 𝐴 and 𝐵 will be determined by substituting into the


original DE and equating coefficients of similar terms. We get 𝐴 = 1, 𝐵 = 5/6

5
𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘] = (2𝑘 ) + (𝐶1 + 𝑘) (3𝑘 ) + 𝐶2 (5𝑘 )
6
Example: Find a particular solution of
5 1
𝑦[𝑘] = 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] + 𝑥[𝑘] with 𝑥[𝑘] = 2𝑘 𝑘 ≥ 0
6 6
Solution: 𝑦𝑝 [𝑘] = 𝐾2𝑘 𝑢[𝑘] or 𝑦𝑝 [𝑘] = 𝐾2𝑘 𝑘 ≥ 0 where: 𝑢[𝑘] is a step function.
Substituting into the DE we obtain

5 1
𝐾2𝑘 𝑢[𝑘] = 𝐾 ( 2𝑘−1 𝑢[𝑘 − 1] − 2𝑘−2 𝑢[𝑘 − 2]) + 2𝑘 𝑢[𝑘]
6 6

Evaluate the equation for 𝑘 ≥ 2 we get

5 1 8
4𝐾 = 2𝐾 − 𝐾 + 4 ⟹ 𝐾 =
6 6 5

8 1 𝑘 1 𝑘
𝑦𝑝 [𝑘] = (2𝑘 )𝑢[𝑘] and 𝑦ℎ [𝑘] = {𝐶1 ( ) + 𝐶2 ( ) } 𝑢[𝑘]
5 3 2

8 𝑘 1 𝑘 1 𝑘
𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘] = { (2 ) + 𝐶1 ( ) + 𝐶2 ( ) } 𝑢[𝑘]
5 3 2

Example: Find the general solution of

▪ 𝑦[𝑘] − 5𝑦[𝑘 − 1] + 6𝑦[𝑘 − 2] = 2 ▪ 8𝑦[𝑘] − 6𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 2𝑘


▪ 𝑦[𝑘] − 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] = 𝑘 2
Solution:

1 1
▪ 𝑦[𝑘] − 5𝑦[𝑘 − 1] + 6𝑦[𝑘 − 2] = 0 ⟹ (𝑧 −1 − ) (𝑧 −1 − ) = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 (2𝑘 ) + 𝐶2 (3𝑘 )
2 3
𝑦𝑝 [𝑘] = 𝐶3 substituting into the DE we obtain 𝐶3 = 1 ⟹ 𝑦[𝑘] = 1 + 𝐶1 (2𝑘 ) + 𝐶2 (3𝑘 )

𝑘
1 𝑘 1 𝑘
▪ 8𝑦[𝑘] − 6𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 2 ⟹ (𝑧 −1 − 4)(𝑧 −1
− 2) = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶2 ( )
2 4

𝑦𝑝 [𝑘] = 𝐶3 2𝑘 substituting into the DE we obtain 𝐶3 = 4/21

4 1 𝑘 1 𝑘
𝑦[𝑘] = + 𝐶1 ( ) + 𝐶2 ( )
21 2 4
𝑘
2 −2 𝑘
▪ 𝑦[𝑘] − 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] = 𝑘 2 ⟹ (𝑧 −2 + 𝑧 −1 − 1) = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶2 ( )
√5 − 1 √5 + 1

𝑦𝑝 [𝑘] = 𝐶3 𝑘 2 + 𝐶4 𝑘 + 𝐶5 substituting into the DE we obtain 𝐶3 = −1, 𝐶2 = −6, & 𝐶5 = −13


𝑘
2 −2 𝑘
𝑦[𝑘] = −𝑘 2 − 6𝑘 − 13 + 𝐶1 ( ) + 𝐶2 ( )
√5 − 1 √5 + 1
Hariche Procedure: There is an orderly methodology for determining the undetermined
coefficients, which is based on an extension of the original operator to be non-forced
and thereafter we extract the particular part.

Let we have 𝐿(𝑦[𝑘]) = 𝑥[𝑘] suppose we can find a linear operator 𝐿1 such that

𝐿1 {𝐿(𝑦[𝑘])} = 𝐿1 (𝑥[𝑘]) = 0 ⟹ 𝑥[𝑘] ∈ 𝒩(𝐿1 )

Notice that 𝑦[𝑘] is the homogeneous solution of the operator 𝐿1 𝐿, so we can write

𝐿 (𝑦 [𝑘]) = 0
𝐿1 𝐿(𝑦[𝑘]) = 𝐿1 𝐿(𝑦1 [𝑘] + 𝑦2 [𝑘]) = 𝐿1 𝐿(𝑦1 [𝑘]) + 𝐿1 𝐿(𝑦2 [𝑘]) = 0 ⟹ { 1 1
𝐿(𝑦2 [𝑘]) = 0

It is straightforward to see that the solution 𝑦[𝑘] can be decompose into two main parts
namely 𝑦[𝑘] = 𝑦1 [𝑘] + 𝑦2 [𝑘] where 𝑦1 [𝑘] ∈ 𝒩(𝐿1 ) and 𝑦2 [𝑘] ∈ 𝒩(𝐿)

From the other side we observe that 𝑦1 [𝑘] is particular solution of 𝐿, to see the
correctness of this claim let we assume that (𝑦1 [𝑘] + 𝑦2 [𝑘]) is particular solution of 𝐿 so

𝐿1 (𝑥[𝑘]) = 0 ⟹ 𝐿1 (𝐿(𝑦[𝑘])) = 0 ⟹ 𝐿(𝑦1 [𝑘] + 𝑦2 [𝑘]) = 𝑥[𝑘]


⟹ 𝐿(𝑦1 [𝑘]) + 𝐿(𝑦2 [𝑘]) = 𝑥[𝑘]

But we know that 𝑦2 [𝑘] is the homogeneous solution of 𝐿 that is 𝐿(𝑦2 [𝑘]) = 0, hence only
𝑦1 [𝑘] is a particular solution of 𝐿 (i.e. 𝐿(𝑦1 [𝑘]) = 𝑥[𝑘]).

Example: Find the general solution of

𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 3(2𝑘 ) + 2𝑘

Solution: firstly we have the following 𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 𝑧 −1 (𝑧 −1 + 1)𝑦[𝑘] = 3(2𝑘 ) + 2𝑘

Notice that
(𝑧 −1 − 1)2 (2𝑘) = 0 1
{ −1 1 𝑘
} ⟹ (𝑧 −1 − 1)2 (𝑧 −1 − ) {3(2𝑘 ) + 2𝑘} = 0
(𝑧 − ) (2 ) = 0 2
2
Then
1 1
𝑧 −1 (𝑧 −1 − 1)2 (𝑧 −1 − ) (𝑧 −1 + 1)(𝑦[𝑘]) = 0 ⟹ (𝑧 −1 − 1)2 (𝑧 −1 − ) (𝑧 −1 + 1)(𝑦[𝑘]) = 0
2 2

𝑦[𝑘] = 𝐶1 (−1)𝑘 + 𝐶2 (2)𝑘 + 𝐶3 𝑛 + 𝐶4


We know the homogeneous solution of 𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0 is 𝑦ℎ [𝑘] = 𝐶1 (−1)𝑘 . Then
the particular solution is 𝑦𝑝 [𝑘] = 𝐶2 (2)𝑘 + 𝐶3 𝑛 + 𝐶4 now only we have to do is the
determination of the unknown coefficients 𝐶2 , 𝐶3 𝑎𝑛𝑑 𝐶4

Substituting 𝑦𝑝 [𝑘] into the DE we obtain


𝐶2 = 4
1 2 1 𝐶 =1
{𝐶2 ( ) + 𝐶2 ( )} (2)𝑘 + 2𝐶3 𝑛 + (2𝐶4 − 3𝐶3 ) = 3(2)𝑘 + 2𝑛 ⟹ { 3
2 2 3
𝐶4 =
2
3
𝑦[𝑘] = 𝐶1 (−1)𝑘 + 4(2)𝑘 + 𝑛 +
2
Summary: some difference operators

𝑦[𝑘] = 𝐶 (𝑧 −1 − 1)𝑦[𝑘] = 0
𝑦[𝑘] = 𝑘 (𝑧 −1 − 1)2 𝑦[𝑘] = 0
𝑦[𝑘] = 𝛼 𝑘 1
(𝑧 −1 − ) 𝑦[𝑘] = 0
𝑎
𝑚+1
𝑦[𝑘] = 𝛼 𝑘 𝑘 𝑚 −1
1
(𝑧 − ) 𝑦[𝑘] = 0
𝛼

Here we will exhibit a particular solution of linear time


invariant difference equation under no restriction whatsoever on 𝑥[𝑘]. The same method
also applies successfully to difference equations with variable coefficients, provided we
know the general solution of the corresponding homogeneous equation.

❶ Consider the first order linear time varying difference equation given by

𝑦[𝑘] + 𝑎[𝑘]𝑦[𝑘 − 1] = 𝑥[𝑘]

Assume that 𝑦ℎ [𝑘] is the homogeneous solution then the particular solution is

𝑦𝑝 [𝑘] = 𝛽[𝑘]𝑦ℎ [𝑘]

Make a substitution into the DE we get

𝛽[𝑘]𝑦ℎ [𝑘] + 𝑎[𝑘]𝛽[𝑘 − 1]𝑦ℎ [𝑘 − 1] = 𝑥[𝑘]


⟺ 𝛽[𝑘]𝑦ℎ [𝑘] + (𝛽[𝑘 − 1]𝑦ℎ [𝑘] − 𝛽[𝑘 − 1]𝑦ℎ [𝑘]) + 𝑎[𝑘]𝛽[𝑘 − 1]𝑦ℎ [𝑘 − 1] = 𝑥[𝑘]
⟺ 𝛽[𝑘]𝑦ℎ [𝑘] − 𝛽[𝑘 − 1]𝑦ℎ [𝑘] + ⏟
𝛽[𝑘 − 1](𝑦ℎ [𝑘] + 𝑎[𝑘]𝑦ℎ [𝑘 − 1]) = 𝑥[𝑘]
𝑧𝑒𝑟𝑜
⟺ 𝛽[𝑘]𝑦ℎ [𝑘] − 𝛽[𝑘 − 1]𝑦ℎ [𝑘] = 𝑥[𝑘]

𝑥[𝑘]
𝛽[𝑘] − 𝛽[𝑘 − 1] =
𝑦ℎ [𝑘]

and summing up from 𝑖 = 1 to 𝑘, we find


𝑘
𝑥[𝑖]
𝛽[𝑘] = 𝛽[0] + ∑
𝑦ℎ [𝑖]
𝑖=1
neglecting the term 𝛽[0]𝑦ℎ [𝑘] which is a solution of the homogeneous equation, we find
𝑘
𝑥[𝑖]
𝑦𝑝 [𝑘] = (∑ ) 𝑦 [𝑘]
𝑦ℎ [𝑖] ℎ
𝑖=1

❷ Consider the second order linear time varying difference equation given by

𝑦[𝑘 + 2] + 𝑝[𝑘]𝑦[𝑘 + 1] + 𝑞[𝑘]𝑦[𝑘] = 𝑥[𝑘]

Assume that 𝑦1 [𝑘], 𝑦2 [𝑘] are the homogeneous solutions then the particular solution is

𝑦𝑝 [𝑘] = 𝛽1 [𝑘]𝑦1 [𝑘] + 𝛽2 [𝑘]𝑦2 [𝑘]

Let we determine 𝛽1 [𝑘] 𝑎𝑛𝑑 𝛽2 [𝑘] so 𝑦𝑝 [𝑘]

 𝑦𝑝 [𝑘 + 1] = 𝛽1 [𝑘 + 1]𝑦1 [𝑘 + 1] + 𝛽2 [𝑘 + 1]𝑦2 [𝑘 + 1]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 1] + (𝛽1 [𝑘 + 1] − 𝛽1 [𝑘])𝑦1 [𝑘 + 1] + 𝛽2 [𝑘]𝑦2 [𝑘 + 1] + (𝛽2 [𝑘 + 1] − 𝛽2 [𝑘])𝑦2 [𝑘 + 1]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 1] + 𝛽2 [𝑘]𝑦2 [𝑘 + 1] + ∆𝛽1 [𝑘]𝑦1 [𝑘 + 1] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 1]

The coefficients are chosen so ∆𝛽1[𝑘]𝑦1 [𝑘 + 1] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 1] = 0

𝑦𝑝 [𝑘 + 1] = 𝛽1 [𝑘]𝑦1 [𝑘 + 1] + 𝛽2 [𝑘]𝑦2 [𝑘 + 1]

 𝑦𝑝 [𝑘 + 2] = 𝛽1 [𝑘 + 1]𝑦1 [𝑘 + 2] + 𝛽2 [𝑘 + 1]𝑦2 [𝑘 + 2]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 2] + (𝛽1 [𝑘 + 1] − 𝛽1 [𝑘])𝑦1 [𝑘 + 2] + 𝛽2 [𝑘]𝑦2 [𝑘 + 2] + (𝛽2 [𝑘 + 1] − 𝛽2 [𝑘])𝑦2 [𝑘 + 2]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 2] + 𝛽2 [𝑘]𝑦2 [𝑘 + 2] + ∆𝛽1 [𝑘]𝑦1 [𝑘 + 2] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 2]

The coefficients are chosen so ∆𝛽1[𝑘]𝑦1 [𝑘 + 2] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 2] = 𝑥[𝑘]

𝑦𝑝 [𝑘 + 2] = 𝛽1 [𝑘]𝑦1 [𝑘 + 2] + 𝛽2 [𝑘]𝑦2 [𝑘 + 2] + 𝑥[𝑘]

Make a substitution into the DE we get

𝑦𝑝 [𝑘 + 2] + 𝑝[𝑘]𝑦𝑝 [𝑘 + 1] + 𝑞[𝑘]𝑦𝑝 [𝑘] = 𝛽1 [𝑘]𝑦1 [𝑘 + 2] + 𝛽2 [𝑘]𝑦2 [𝑘 + 2] + 𝑥[𝑘]


+ 𝑝[𝑘]𝛽1 [𝑘]𝑦1 [𝑘 + 1] + 𝑝[𝑘]𝛽2 [𝑘]𝑦2 [𝑘 + 1]
+ 𝑞[𝑘]𝛽1 [𝑘]𝑦1 [𝑘] + 𝑞[𝑘]𝛽2 [𝑘]𝑦2 [𝑘]

= 𝛽1 [𝑘](𝑦1 [𝑘 + 2] + 𝑝[𝑘]𝑦1 [𝑘 + 1] + 𝑞[𝑘]𝑦1 [𝑘])


+ 𝛽2 [𝑘](𝑦2 [𝑘 + 2] + 𝑝[𝑘]𝑦2 [𝑘 + 1] + 𝑞[𝑘]𝑦2 [𝑘])
+ 𝑥[𝑘]
= 𝑥[𝑘]

Because 𝑦1 [𝑘] and 𝑦2 [𝑘] are solutions of the homogeneous equation corresponding to
the DE. Thus, the proof is complete. 

We will now find an explicit formula for the particular solution

∆𝛽 [𝑘]𝑦1 [𝑘 + 1] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 1] = 0 𝑦 [𝑘 + 1] 𝑦2 [𝑘 + 1] ∆𝛽1 [𝑘] 0


{ 1 ⟺( 1 )( )=( )
∆𝛽1 [𝑘]𝑦1[𝑘 + 2] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 2] = 𝑥[𝑘] 𝑦1 [𝑘 + 2] 𝑦2 [𝑘 + 2] ∆𝛽2 [𝑘] 𝑥[𝑘]
0 𝑦2 [𝑘 + 1]
| |
𝑥[𝑘] 𝑦2 [𝑘 + 2] 𝑥[𝑘]𝑦2 [𝑘 + 1]
𝛽1 [𝑘 + 1] − 𝛽1 [𝑘] = ∆𝛽1 [𝑘] = =−
𝑦 [𝑘 + 1] 𝑦2 [𝑘 + 1] det 𝕎(𝑘 + 1)
| 1 |
𝑦1 [𝑘 + 2] 𝑦2 [𝑘 + 2]
𝑦 [𝑘 + 1] 0
| 1 |
𝑦1 [𝑘 + 2] 𝑥[𝑘] 𝑥[𝑘]𝑦1 [𝑘 + 1]
𝛽2 [𝑘 + 1] − 𝛽2 [𝑘] = ∆𝛽2 [𝑘] = =
𝑦 [𝑘 + 1] 𝑦2 [𝑘 + 1] det 𝕎(𝑘 + 1)
| 1 |
𝑦1 [𝑘 + 2] 𝑦2 [𝑘 + 2]

With the Casoratian det 𝕎 is never zero. and summing up from ℓ = 0 to 𝑘 − 1, we find
𝑘−1 𝑘−1
𝑥[ℓ]𝑦2 [ℓ + 1] 𝑥[ℓ]𝑦1 [ℓ + 1]
𝛽1 [𝑘] = 𝛽1 [0] − ∑ , 𝛽2 [𝑘] = 𝛽2 [0] + ∑
det 𝕎(ℓ + 1) det 𝕎(ℓ + 1)
ℓ=0 ℓ=0

neglecting the term 𝛽1 [0]𝑦1 [𝑘] & 𝛽2 [0]𝑦2 [𝑘] which is a solution of the homogeneous
equation, we find
𝑘−1 𝑘−1
𝑥[ℓ]𝑦2 [ℓ + 1] 𝑥[ℓ]𝑦1 [ℓ + 1]
𝑦𝑝 [𝑘] = −𝑦1 [𝑘] ∑ + 𝑦2 [𝑘] ∑
det 𝕎(ℓ + 1) det 𝕎(ℓ + 1)
ℓ=0 ℓ=0
𝑦 [ℓ + 1] 𝑦2 [ℓ + 1]
𝑘−1 | 1 |
𝑦1 [𝑘] 𝑦2 [𝑘]
= ∑ 𝑥[ℓ]
𝑦 [ℓ + 1] 𝑦2 [ℓ + 1]
ℓ=0 | 1 |
𝑦1 [ℓ + 2] 𝑦2 [ℓ + 2]

This formula gives explicitly a particular solution of a second-order linear non-


homogeneous difference equation in terms of the nonhomogeneous term 𝑥[𝑘] and two
linearly independent solutions of the corresponding homogeneous equation.

Example: Verify that the sequences 𝑦1 [𝑘] = 1 and 𝑦2 [𝑘] = 𝑘 2 are linearly independent
homogeneous solutions of the difference equation

(2𝑘 + 1)𝑦[𝑘 + 2] − 4(𝑘 + 1)𝑦[𝑘 + 1] + (2𝑘 + 3)𝑦[𝑘] = (2𝑘 + 3)

With 𝑦[0] = 0 & 𝑦[1] = 1/2 . Find the general solution.

Solution: The reader can verify by direct substitution that the sequences are solutions
of the DE.

1 (ℓ + 1)2
𝑘−1 | 2
| 𝑘−1 𝑘 2 − (ℓ + 1)2
𝑘−1

𝑦𝑝 [𝑘] = ∑(2ℓ + 3) 1 𝑘 = ∑(2ℓ + 3) = ∑(𝑘 2 − (ℓ + 1)2 )


1 (ℓ + 1)2 (ℓ + 2)2 − (ℓ + 1)2
ℓ=0 | | ℓ=0 ℓ=0
1 (ℓ + 2)2

𝑘(𝑘 + 1)(2𝑘 + 1) 4𝑘 3 − 3𝑘 2 − 𝑘
𝑦𝑝 [𝑘] = 𝑘 3 − =
6 6

Hence, the general solution is


4𝑘 3 − 3𝑘 2 − 𝑘
𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘] = 𝐶1 + 𝐶2 𝑘 2 +
6
Dynamic systems
are systems whose states (variables) vary over time and where the changes follow some
rule whereby the future states depend on the past and the present.

If the system is assumed to change in discrete time steps (hours, days, weeks, months,
quarters, years, decades, …), it can be modeled using difference equations. If instead
time progresses continuously, it can be modeled using differential equations.

Difference equations appear as mathematical models describing realistic situations in


biology, electric circuits, digital control systems, business, probability theory, Fibonacci
sequences and others.

First-order system difference equations can generally be written as

𝐲[𝑘 − 1] + 𝜦𝐲[𝑘] = 𝐟[𝑘] where 𝐲[𝑘] ∈ ℝ𝑚 , 𝐟[𝑘] ∈ ℝ𝑚 𝑎𝑛𝑑 𝜦 ∈ ℝ𝑛×𝑛

Let we take the following change 𝑨 = −𝜦 we get 𝐲[𝑘 − 1] = 𝑨𝐲[𝑘] + 𝐟[𝑘], let we find the
homogeneous solution of this system (i.e. 𝐟[𝑘] = 𝟎)

𝐲[𝑘 − 1] = 𝑨𝐲[𝑘] ⟺ (𝑧 −1 𝑰 − 𝑨)𝐲[𝑘] = 𝟎 ⟹ 𝐲[𝑘] = 𝑨−𝑘 𝑪1 with 𝑪1 = 𝐲[0]

We can check this result by recurrence

𝒚[0] = 𝑨𝐲[1] 𝐲[1] = 𝑨−1 𝐲[0]


𝒚[1] = 𝑨𝐲[2] −1 −2
} ⟺ {𝐲[2] = 𝑨 𝐲[1] = 𝑨 𝐲[0]
⋮ ⋮
𝒚[𝑘] = 𝑨𝒚[𝑘 − 1] 𝐲[𝑘] = 𝑨−𝑘 𝐲[0]

Example: Verify that the homogeneous solution of 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝑩𝐮[𝑘] is given by

𝒚[𝑘] = 𝑨𝑘 𝒚[0]

Solution: The reader can verify it by back substitutions.

the solution of first order system of


linear difference equation 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝑩𝐮[𝑘] can be expressed in term of
eigenvalues of the matrix 𝑨.

Case01: suppose that 𝑨 has 𝑛-distinct eigenvalues, and let (𝜆𝑖 , 𝐱 𝑖 ) be an eigen-pair of 𝑨
that is 𝑨𝐱 𝑖 = 𝜆𝑖 𝐱 𝑖 , now we try to examine the vectors of the form 𝐲𝑖 [𝑘] = (𝜆𝑖 )𝑘 𝐱 𝑖 .

(𝜆𝑖 )𝑘 𝑨𝐱 𝑖 = (𝜆𝑖 )𝑘 𝜆𝑖 𝐱 𝑖 ⟺ 𝐲𝑖 [𝑘 + 1] = (𝜆𝑖 )𝑘+1 𝐱 𝑖 = 𝜆𝑖 𝐲𝑖 [𝑘] = 𝑨𝐲𝑖 [𝑘]

Then any first order system of linear difference equations has eigenvectors of the form
𝐲𝑖 [𝑘] = (𝜆𝑖 )𝑘 𝐱𝑖 . In other word we can say: any equation of the form 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] has a
solution 𝐲𝑖 [𝑘] = (𝜆𝑖 )𝑘 𝐱𝑖 , where are called elementary homogeneous solutions. The
general form of the homogeneous solution is the linear combination of them

𝒚[𝑘] = 𝛼1 (𝜆1 )𝑘 𝐱1 + 𝛼2 (𝜆2 )𝑘 𝐱 2 + ⋯ + 𝛼𝑛 (𝜆𝑛 )𝑘 𝐱 𝑛


0 1
Example: Find the homogeneous solution of 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
−2 −3
1 1
∆(𝜆) = 𝜆2 − 3𝜆 + 2 ⟹ {𝜆1 = −1, 𝐱1 = ( )} and {𝜆2 = −2, 𝐱 2 = ( )}
−1 −2
1 1
𝒚[𝑘] = (−1)𝑘 {𝛼1 ( ) + 𝛼2 ( ) 2𝑘 }
−1 −2

1 3
Repeat the same thing for the system 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
2 −1

∆(𝜆) = (𝜆2 − 7) ⟹ {𝜆1 = √7, 𝐱1 = ((√7 + 1)/2)} and {𝜆2 = −√7, 𝐱 2 = ((1 − √7)/2)}
1 1

√7 + 1 𝑘
1 − √7
𝑘
𝒚[𝑘] = {𝛼1 ( 2 ) (√7) + 𝛼2 ( 2 ) (−√7) }
1 1

Case02: suppose that 𝑨 has repeated eigenvalues, and for seeking simplicity let we
assume that only one eigenvalue is repeated twice. Let's try the conjectured solution

𝐲𝑖 [𝑘] = 𝑘. (𝜆𝑖 )𝑘 𝐱1 + (𝜆𝑖 )𝑘 𝐱2

Where: the vector 𝐱1 is an eigenvector of 𝑨 corresponding to 𝜆𝑖 , and 𝐱 2 is any other


vector such that (𝑨 − 𝜆𝑖 𝑰)𝐱 2 = 𝜆𝑖 𝐱1 .

Now consider

𝑨𝐲𝑖 [𝑘] = 𝑘. (𝜆𝑖 )𝑘 𝑨𝐱1 + (𝜆𝑖 )𝑘 𝑨𝐱 2 = 𝑘. (𝜆𝑖 )𝑘 (𝜆𝑖 𝐱1 ) + (𝜆𝑖 )𝑘 {𝜆𝑖 𝐱1 + 𝜆𝑖 𝐱 2 }


= (𝑘 + 1). (𝜆𝑖 )𝑘+1 𝐱1 + (𝜆𝑖 )𝑘+1 𝐱 2
= 𝐲𝑖 [𝑘 + 1]

Then general solution of 2 × 2 matrix is

𝒚[𝑘] = 𝛽1 𝜆𝑘 𝐱1 + 𝛽2 (𝑘. 𝜆𝑘 𝐱1 + 𝜆𝑘 𝐱 2 )
{
with (𝑨 − 𝜆𝑰)𝐱 2 = 𝜆𝐱1

0 1
Example: Find the homogeneous solution of 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
−4 4
1
∆(𝜆) = (𝜆 − 2)2 ⟹ 𝜆 = 2 and 𝐱1 = ( )
2
𝐱 𝐱
(𝑨 − 𝜆𝑰)𝐱 2 = 𝜆𝐱1 ⟺ (( 0 1) − (2 0)) (𝐱12 ) = 2 (1) ⟹ 𝐱 2 = (𝐱12 ) = (0)
−4 4 0 2 22 2 22 2

1 1 0
𝒚[𝑘] = 𝛽1 𝜆𝑘 𝐱1 + 𝛽2 (𝑘. 𝜆𝑘 𝐱1 + 𝜆𝑘 𝐱 2 ) = 𝛽1 2𝑘 ( ) + 𝛽2 (𝑘. 2𝑘 ( ) + 2𝑘 ( ))
2 2 2
Case03: suppose that 𝑨 has complex conjugate eigenvalues, and for seeking simplicity
let we assume that only 2 × 2 matrices. Then the general solution has the same form as
the one presented for the case of distinct real eigenvalues. Of course in this case, the
eigenvalues come as conjugate pairs and the eigenvectors are complex components.

3/5 2/5
Example: Find the homogeneous solution of 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
−8/5 3/5

3 4 1 3 4 1
{𝜆1 = + 𝑖, 𝐱1 = ( )} and {𝜆2 = − 𝑖, 𝐱 2 = ( )}
5 5 2𝑖 5 5 −2𝑖

3 4 𝑘 1 1 3 4 𝑘
𝒚[𝑘] = {𝛼1 ( + 𝑖) ( ) + 𝛼2 ( ) ( − 𝑖) }
5 5 2𝑖 −2𝑖 5 5

The
solution of linear non-homogeneous systems of difference equations can become very
complex. In this section, we concentrate on the solution methodology of special class,
namely systems with constants non-homogeneous term. (Recall, that we have
presented the solution methodology for single linear non-homogeneous difference
equation, where the non-homogeneous term is a polynomial or an exponential
function).

Consider the linear non-homogeneous system of


difference equations: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝒃 where 𝒃 is a vector whose components are
constants.

The solution is of the form: 𝒚[𝑘] = 𝒚ℎ [𝑘] + 𝒚𝑝 [𝑘] where 𝒚ℎ [𝑘] is the solution of the
associated homogeneous system: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] and 𝒚𝑝 [𝑘] is the particular solution.

We have seen that the solution of the associated homogeneous system is: 𝒚ℎ [𝑘] = 𝑨𝑘 𝒚[0]

Assuming that matrix 𝑨 have distinct (real) eigenvalues, and eigenvectors, the solution
of the associated homogeneous system can be written as:

𝒚ℎ [𝑘] = 𝛼1 (𝜆1 )𝑘 𝐱1 + 𝛼2 (𝜆2 )𝑘 𝐱 2 + ⋯ + 𝛼𝑛 (𝜆𝑛 )𝑘 𝐱𝑛

What about the form of the particular solution 𝒚𝑝 [𝑘] ?

Since the linear system has a constant non-homogeneous term, 𝒃 we conjecture that:

𝒚𝑝 [𝑘] = 𝒅

Thus 𝒚[𝑘] = 𝒚ℎ [𝑘] + 𝒚𝑝 [𝑘] = 𝑨𝑘 𝒚[0] + 𝒅 Substituting into the non-homogeneous system
yields:
𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝒃 ⟹ 𝑨𝑘+1 𝒚[0] + 𝒅 = 𝑨(𝑨𝑘 𝒚[0] + 𝒅) + 𝒃
⟹ 𝑨𝑘+1 𝒚[0] + 𝒅 = 𝑨𝑘+1 𝒚[0] + 𝑨𝒅 + 𝒃
⟹ 𝒅 = 𝑨𝒅 + 𝒃
⟹ (𝑰 − 𝑨)𝒅 = 𝒃

If the matrix (𝑰 − 𝑨) is invertible, then: 𝒅 = (𝑰 − 𝑨)−1 𝒃


𝒚[𝑘] = 𝒚ℎ [𝑘] + 𝒚𝑝 [𝑘] = 𝑨𝑘 𝒚[0] + (𝑰 − 𝑨)−1 𝒃

Thus, if matrix 𝑨 have real eigenvalues of multiplicity one and distinct corresponding
eigenvectors, then the solution of the linear non-homogeneous systems of difference
equations: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝒃 is:

𝒚[𝑘] = 𝛼1 (𝜆1 )𝑘 𝐱1 + 𝛼2 (𝜆2 )𝑘 𝐱 2 + ⋯ + 𝛼𝑛 (𝜆𝑛 )𝑘 𝐱𝑛 + (𝑰 − 𝑨)−1 𝒃

Example: Find the general solution of the linear non-homogeneous system:

2 4 1
𝒚[𝑘 + 1] = ( ) 𝒚[𝑘] + ( )
4 2 −1
Solution: First we solve the associated homogeneous system. We need to find the
eigenvalues and eigenvectors of the coefficient matrix.

1 −1
{𝜆1 = 6, 𝐱1 = ( )} and {𝜆2 = −2, 𝐱 2 = ( )}
1 1
From theory, we know that the particular solution is:

1 2
𝒚𝑝 [𝑘] = 𝒅 = (𝑰 − 𝑨)−1 𝒃 = ( )
5 −3
Therefore the general solution is:
1 −1 1 2
𝒚[𝑘] = 𝛼1 6𝑘 ( ) + 𝛼2 (−2)𝑘 ( )+ ( )
1 1 5 −3
Note: If 𝜆 = 1 is an eigenvalue of matrix R, then it satisfies the characteristic equation
|𝜆𝑰 − 𝑨| = 0 thus the matrix (𝑰 − 𝑨) is not invertible, thus a new form of the (constant)
particular solution 𝒚𝑝 [𝑘] = 𝒅 must be found. The following example demonstrates the
methodology for obtaining the particular solution for the case that 𝜆 = 1 is an
eigenvalue of matrix 𝑨.

Example: Find the general solution of the linear non-homogeneous system:

6 5 2
𝒚[𝑘 + 1] = ( ) 𝒚[𝑘] + ( )
−4 −3 −1
Solution: First we solve the associated homogeneous system. We need to find the
eigenvalues and eigenvectors of the coefficient matrix.

−1 −5/4
{𝜆1 = 1, 𝐱1 = ( )} and {𝜆2 = 2, 𝐱 2 = ( )}
1 1
The matrix has eigenvalues of multiplicity one, and distinct eigenvectors. Thus the
solution the associated homogeneous system is:

−1 −5/4
𝒚ℎ [𝑘] = 𝛼1 ( ) + 𝛼2 2𝑘 ( )
1 1
Observe that 𝜆 = 1 is an eigenvalue of matrix 𝑨 thus (𝑰 − 𝑨) is not invertible. Thus we
must conjecture a different particular solution. By (tedious) trial and error, we discover
that one type of a particular solution, that “looks promising” is:
𝑑
𝒚𝑝 [𝑘] = 𝒅 + 𝑑3 𝑘𝐱1 = ( 1 ) + 𝑑3 𝑘𝐱1
𝑑2

Where: 𝑑3 is a constant, and 𝐱1 the corresponding eigenvector to 𝜆 = 1.

Now we need to find the values of 𝑑1 , 𝑑2 and 𝑑3 . We substitute the conjectured


particular solution to the non-homogeneous system, thus:

𝒅 + 𝑑3 (𝑘 + 1)𝐱1 = 𝑨(𝒅 + 𝑑3 𝑘𝐱1 ) + 𝒃

Since 𝐱1 is the eigenvector corresponding to 𝜆 = 1, we observe that: 𝑨𝐱1 = 𝐱1 . Thus the


above relation simplifies as:

𝒅 − 𝑨𝒅 = 𝒃 − 𝑑3 𝐱1 ⟺ (𝑰 − 𝑨)𝒅 = 𝒃 − 𝑑3 𝐱1

Substituting the values for 𝒃 and 𝑨, the above can be written as:

𝑑1
−5 −5 𝑑1 1 −1 −5𝑑1 − 5𝑑2 − 𝑑3 = 1 −5 −5 −1 𝑑 1
( ) ( ) = ( ) − 𝑑3 ( )⟺{ ⟺( ) ( 2) = ( )
4 4 𝑑2 1 1 4𝑑1 + 4𝑑2 + 𝑑3 = 1 4 4 1 1
𝑑3

𝑑1 −1
−5 −5 −1 + 1
(𝑑2 ) = ( ) ( ) = (−1 )
4 4 1 1
𝑑3 9

−1 −5/4 −1 −1
Therefore the general solution is: 𝒚[𝑘] = 𝛼1 ( ) + 𝛼2 2𝑘 ( )+( ) + 9𝑘 ( )
1 1 −1 1
Consider the linear non-homogeneous system of
difference equations: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝑩𝒖[𝑘] where 𝑩𝒖[𝑘] is a vector whose components
are time varying. Assume that we are given 𝒚[0] = 𝒄 and 𝒖[𝑘] is some known vector
function and 𝐲[𝑘] ∈ ℝ𝑚 , 𝐮[𝑘] ∈ ℝ𝑚 , 𝑩 ∈ ℝ𝑛×𝑚 𝑎𝑛𝑑 𝑨 ∈ ℝ𝑛×𝑛 .

The general solution can be determined using recursion

𝒚[1] = 𝑨𝒚[0] + 𝑩𝒖[0]


𝒚[2] = 𝑨𝒚[1] + 𝑩𝒖[1] = 𝑨2 𝒚[0] + 𝑨𝑩𝒖[0] + 𝑩𝒖[1]
𝒚[3] = 𝑨𝒚[2] + 𝑩𝒖[2] = 𝑨3 𝒚[0] + 𝑨2 𝑩𝒖[0] + 𝑨𝑩𝒖[1] + 𝑩𝒖[2]


𝑘

𝒚[𝑘] = 𝑨 𝒚[0] + ∑ 𝑨𝑖−1 𝑩𝒖[𝑘 − 𝑖]


𝑘

𝑖=1

Where 𝒚ℎ [𝑘] = 𝑨𝑘 𝒚[0] and 𝒚𝑝 [𝑘] = ∑𝑘𝑖=1 𝑨𝑖−1 𝑩𝒖[𝑘 − 𝑖]

Note: The static case is a special form of this last one, only you have to assume
𝑩𝒖[𝑘 − 𝑖] = 𝒃 then 𝒚𝑝 [𝑘] = ∑𝑘𝑖=1 𝑨𝑖−1 𝒃 = (𝑰 + 𝑨 + 𝑨2 + ⋯ + 𝑨𝑘−1 )𝒃, and from the other side
we know that by the Taylor series 𝒅 = (𝑰 − 𝑨)−1 𝒃 = (𝑰 + 𝑨 + 𝑨2 + ⋯ + 𝑨𝑘−1 + 𝑨𝑘 + ⋯ )𝒃 but
the vectors 𝑨𝑖 𝒃 are homogeneous solutions when 𝑖 ≥ 𝑘, hence only ∑𝑘𝑖=1 𝑨𝑖−1 𝒃 is the
particular solution.
Example: Find the solution of the following difference equation

𝑦[𝑘 − 2] + 3𝑦[𝑘 − 1] + 2𝑦[𝑘] = 0

Solution: If we take the following change of variables 𝑥1 [𝑘] = 𝑦[𝑘] & 𝑥2 [𝑘] = 𝑦[𝑘 − 1] we
gat

𝑥1 [𝑘 − 1] 𝑥1 [𝑘]
𝑥1 [𝑘 − 1] = 𝑥2 [𝑘] 0 1
{ ⟺( )=( )( )
𝑥2 [𝑘 − 1] = −2𝑥1 [𝑘] − 3𝑥2 [𝑘] −2 −3
𝑥2 [𝑘 − 1] 𝑥2 [𝑘]

𝑥1 [𝑘]
0 1
Let 𝐱[𝑘] = ( ) we obtain 𝐱[𝑘 − 1] = ( ) 𝐱[𝑘]. Now we have reduced the problem
[𝑘] −2 −3
𝑥2
to the solution of first order system rather than second order difference equation.

0 1 −1 −3/2 −1/2
𝐱[𝑘] = ( ) 𝐱[𝑘 − 1] = ( ) 𝐱[𝑘 − 1]
−2 −3 1 0

−3/2 −1/2 𝑘 𝑘 1 −1 𝑘 1
𝐱[𝑘] = ( ) 𝐱[0] = 𝛼1 (−1) ( ) + 𝛼2 ( ) ( )
1 0 −1 2 −2

So the homogeneous solution is 𝑦[𝑘] = 𝑥1 [𝑘] = 𝛼1 (−1)𝑘 + 𝛼2 (−0.5)𝑘

Theorem: (Perron’s Theorem & Poincar´e’s Theorem) Assume that the difference
equation ∑𝑛𝑖=0 𝑎𝑖 𝑥[𝑘 + 𝑖] = 0 has a fundamental set of solutions {𝑥1 [𝑘], 𝑥2 [𝑘], . . . , 𝑥𝑛 [𝑘]} with
the property: 𝑥𝑖 [𝑘] = 𝜆𝑘𝑖 Then the characteristic root 𝜆𝑖 is given by

𝑥𝑖 [𝑘 + 1]
𝜆𝑖 = , 𝑘 = 1,2, … , 𝑛
𝑥𝑖 [𝑘]

Moreover, If |𝜆𝑛 | > |𝜆𝑖 | ∀𝑖 then 𝑥[𝑘] = ∑𝑛𝑖=1 𝛽𝑖 𝑥𝑖 [𝑘] and

𝑥[𝑘 + 1]
𝜆𝑛 = lim
𝑘→∞ 𝑥[𝑘]

This last statement is known as the Bernoulli's identity which will be discussed later.

The simplest operator that


can be applied to a sequence is the identity operator I which leaves a sequence
unchanged. This may be written 𝐼(𝑢[𝑘]) = 𝑢[𝑘]. Almost as simple is the (forward) shift
operator 𝑧 which takes each element of a sequence into its successor. This may be
expressed as 𝑧𝑢[𝑘] = 𝑢[𝑘 + 1] for all relevant 𝑘. We will also encounter the averaging
operator 𝐴 which may be defined as 𝐴(𝑢[𝑘]) = (𝑢[𝑘] + 𝑢[𝑘 + 1])/2. Of more interest to our
present work is the forward difference operator A defined by ∆𝑢[𝑘] = 𝑢[𝑘 + 1] − 𝑢[𝑘]

In situations in which the sequence {𝑢𝑘 } arises from a function 𝑢 defined on a set of grid
points {𝑥𝑘 }, the forward difference operator may also be expressed as

∆𝑢[𝑘] = 𝑢[𝑥𝑘 + ℎ] − 𝑢[𝑥𝑘 ] ℎ = ∆𝑥


Our first correspondence with the infinitesimal calculus may be made by noting that
the difference quotient
1 𝑢[𝑥𝑘 + ℎ] − 𝑢[𝑥𝑘 ]
∆𝑢[𝑘] =
ℎ ℎ
is an approximation to 𝑢′ (𝑥𝑘 ) the derivative 𝑑𝑢/𝑑𝑥 at 𝑥𝑘 . This approximation improves as
ℎ decreases.

The second order difference operator is ∆2 𝑢𝑘 = ∆(∆𝑢𝑘 ), whose action on the sequence 𝑢𝑘
is the twice repeated operation of the first order operator ∆, in exact analogy of the
second order differential operator it 𝑑2 /𝑑𝑥 2 when it acts on the continuous function 𝑓(𝑥)
as 𝑑 2 𝑓/𝑑𝑥 2 ,
∆2 𝑢𝑘 = ∆(∆𝑢𝑘 ) = (𝑢𝑘+2 − 𝑢𝑘+1 ) − (𝑢𝑘+1 − 𝑢𝑘 ) = 𝑢𝑘+2 − 2𝑢𝑘+1 + 𝑢𝑘

Higher order difference operators An can be defined in the same manner, where it is
easily seen that this operation is commutative, i.e., ∆𝑚 (∆𝑛 𝑢𝑘 ) = ∆𝑛 (∆𝑚 𝑢𝑘 )

Basic Properties of the Forward Difference Operator ∆:

⦁ ∆(𝛼𝑢𝑘 ) = 𝛼∆(𝑢𝑘 )
⦁ ∆(𝑢𝑘 + 𝑣𝑘 ) = ∆(𝑢𝑘 ) + ∆(𝑣𝑘 )
⦁ ∆(𝑢𝑘 𝑣𝑘 ) = 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘
𝑢𝑘 𝑣𝑘 ∆𝑢𝑘 − 𝑢𝑘 ∆𝑣𝑘
⦁ ∆( ) =
𝑣𝑘 𝑣𝑘 𝑣𝑘+1

Proof: let we prove only the nontrivial ones

∆(𝑢𝑘 𝑣𝑘 ) = 𝑢𝑘+1 𝑣𝑘+1 − 𝑢𝑘 𝑣𝑘


= 𝑢𝑘+1 (𝑣𝑘+1 − 𝑣𝑘 ) + 𝑢𝑘+1 𝑣𝑘 − 𝑢𝑘 𝑣𝑘
= 𝑢𝑘+1 (𝑣𝑘+1 − 𝑣𝑘 ) + 𝑣𝑘 (𝑢𝑘+1 − 𝑢𝑘 )
= 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘

which can also be written in terms of the difference operator ∆ (on 𝑣𝑘 and 𝑢𝑘 only) as

∆(𝑢𝑘 𝑣𝑘 ) = 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘


= (∆𝑢𝑘 + 𝑢𝑘 )∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘
= ∆𝑢𝑘 ∆𝑣𝑘 + 𝑢𝑘 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘

In analogy to the derivative of the quotient f(x)/g(x), g(x)≠0 , of two differentiate


functions, we can prove
𝑢𝑘 𝑢𝑘+1 𝑢𝑘 𝑢𝑘+1 𝑣𝑘 − 𝑢𝑘 𝑣𝑘+1
∆( )= − =
𝑣𝑘 𝑣𝑘+1 𝑣𝑘 𝑣𝑘 𝑣𝑘+1
(𝑢𝑘+1 𝑣𝑘 − 𝑢𝑘 𝑣𝑘 ) + (𝑢𝑘 𝑣𝑘 − 𝑢𝑘 𝑣𝑘+1 )
=
𝑣𝑘 𝑣𝑘+1
(𝑢𝑘+1 − 𝑢𝑘 )𝑣𝑘 + (𝑣𝑘 − 𝑣𝑘+1 )𝑢𝑘
=
𝑣𝑘 𝑣𝑘+1
∆𝑢𝑘 𝑣𝑘 − ∆𝑣𝑘 𝑢𝑘
=
𝑣𝑘 𝑣𝑘+1
The Backward and Central Difference Operators: 𝛻, 𝛿 It is also possible to define a
backward difference operator as follows:

∇𝑢[𝑘] = 𝑢[𝑘] − 𝑢[𝑘 − 1] or ∇𝑢[𝑘] = 𝑢[𝑥𝑘 ] − 𝑢[𝑥𝑘 − ℎ]

Again we may note that the quotient

1 𝑢[𝑥𝑘 ] − 𝑢[𝑥𝑘 − ℎ]
∇𝑢[𝑘] =
ℎ ℎ
is an approximation to 𝑢′ (𝑥𝑘 ) which also improves as ℎ becomes small. The asymmetry
of the forward and backward difference operators ∆ and ∇ may be eliminated by defining
a central difference operator: 𝛿𝑢𝑘 = 𝑢(𝑥𝑘 + ℎ/2) − 𝑢(𝑥𝑘 − ℎ/2). The quotient

ℎ ℎ
1 𝑢 (𝑥𝑘 + 2) − 𝑢 (𝑥𝑘 − 2)
𝛿𝑢 =
ℎ 𝑘 ℎ
is yet another approximation which converges to 𝑢′ (𝑥𝑘 ) as ℎ → 0. 

𝛿 2 𝑢𝑘 = 𝛿(𝛿𝑢𝑘 ) = 𝑢𝑘+1 − 2𝑢𝑘 + 𝑢𝑘−1

Example: Verify the following identities

⦁ 𝑧(Δ) = Δ(𝑧)
⦁ Δ2 = 𝑧 2 − 2𝑧 + 1
⦁ 𝛿 2 = Δ∇
The Inverse Difference Operator as a Sum Operator: We shall briefly discuss Δ−1, the
inverse of the difference operator Δ, Δ−1 Δ(𝑢𝑘 ) = 𝑢𝑘 and which, as may be expected,
will amount to summation as opposed to the differencing of Δ. Let Δ−1 𝑢𝑘 = 𝑦𝑘 , we shall
show that 𝑦𝑘 , aside from an arbitrary constant 𝑦1 , is a finite summation of the original
sequence 𝑢𝑘 as we will develop next.

Δ(Δ−1 𝑢𝑘 ) = Δ(𝑦𝑘 ) = 𝑦𝑘+1 − 𝑦𝑘

Now, if we write this equation for 𝑢1 , 𝑢2 , … , 𝑢𝑘 ,

𝑢1 = Δ(Δ−1 𝑢1 ) = 𝑦2 − 𝑦1
𝑢2 = Δ(Δ−1 𝑢2 ) = 𝑦3 − 𝑦2

𝑢𝑘 = Δ(Δ−1 𝑢𝑘 ) = 𝑦𝑘+1 − 𝑦𝑘

then add, the left hand sides will add up to 𝑢1 + 𝑢2 + ⋯ + 𝑢𝑘 , Adding the right hand side,
we get 𝑦𝑘+1 − 𝑦1
𝑘−1
−1
𝑢1 + 𝑢2 + ⋯ + 𝑢𝑘 = 𝑦𝑘+1 − 𝑦1 ⟹ 𝑢1 + 𝑢2 + ⋯ + 𝑢𝑘−1 = 𝑦𝑘 − 𝑦1 ⟹ 𝑦𝑘 = Δ 𝑢𝑘 = 𝑦1 + ∑ 𝑢𝑖
𝑖=1

as the desired finite summation operation of Δ−1. With this result we can interpret the
action of the inverse difference operator Δ−1 on the sequence 𝑢𝑘 , namely 𝑦𝑘 = Δ−1 𝑢𝑘 as
the 𝑘 − 1 partial sum, of the sequence 𝑢𝑖 plus an arbitrary constant 𝑦1 .
The latter may remind us of the spirit of the indefinite integration of function of
continuous variable f(x), along with its arbitrary constant of integration

∫ 𝑓(𝑥)𝑑𝑥 = 𝐹(𝑥) + 𝐶

where 𝐹(𝑥) is the antiderivative of f(x), i.e. dF(x)/dx = f(x). Often Δ−1 𝑢𝑘 is called the anti-
difference or sum operator, and for obvious reasons!

⦁ Δ−1 (1) = 𝑘
⦁ Δ−1 (𝛼1 𝑢𝑘 + 𝛼2 𝑣𝑘 ) = 𝛼1 Δ−1 (𝑢𝑘 ) + 𝛼2 Δ−1 (𝑣𝑘 )
1
⦁ Δ−1 (𝑎𝑘 ) = 𝑎𝑘
𝑎−1

Exercise: The reader is asked to prove that (i.e. 𝑎 ≠ 1 ≠ 𝑒 𝑎𝑘 )

1 1
⦁ Δ−1 𝑘 = 𝑘(𝑘 − 1) + 𝐶 ⦁ Δ−1 𝑘 2 = 𝑘(𝑘 − 1)(2𝑘 − 1) + 𝐶
2 6
1 1
⦁ Δ−1 𝑘 3 = (𝑘 − 1)2 𝑘 2 + 𝐶 ⦁ Δ−1 (−1)𝑘 = (−1)𝑘+1 + 𝐶
4 2
𝑎𝑘 𝑎 𝑒 𝑎𝑘
⦁ Δ−1 (𝑘𝑎𝑘 ) = (𝑘 − )+𝐶 ⦁ Δ−1 𝑒 𝑎𝑘 = 𝑎 +𝐶
𝑎−1 𝑎−1 𝑒 −1

Summation by Parts: We will now present a few of the essential elements of the
difference and summation calculus which show a striking correspondence to familiar
properties of the differential and integral calculus.

The important operation of summation by parts may now be illustrated by returning to


the formula for the difference of a product. Summing equation ∆(𝑢𝑘 𝑣𝑘 ) = 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘
between the indices 𝑘 = 0 and 𝑘 = 𝑁 − 1, we have
𝑁−1 𝑁−1

∑ ∆(𝑢𝑘 𝑣𝑘 ) = ∑ 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘


𝑘=0 𝑘=0

Expanding the left hand side and summing term by term, we see that all of the interior
terms of the sum cancel and only two "boundary terms" survive. This leaves
𝑁−1

𝑢𝑁 𝑣𝑁 − 𝑢0 𝑣0 = ∑ 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘


𝑘=0

Rearranging this expression gives the summation by parts formula


𝑁−1 𝑁−1
𝑁
∑ 𝑣𝑘 ∆𝑢𝑘 = [𝑢𝑘 𝑣𝑘 ]0 − ∑ 𝑢𝑘+1 ∆𝑣𝑘
𝑘=0 𝑘=0
We now begin an investigation of the difference equations which can be generated from
the few difference operators which we have already seen. A difference equation is simply
a prescribed relationship between the elements of a sequence. Having posed a
difference equation, the usual aim is to solve it, which means finding a sequence that
satisfies the relationship. It is best to proceed by example and we may also proceed by
analogy to differential equations. Indeed, much of the terminology used for differential
equations carries over directly to difference equations. An analog of the first order,
constant coefficient, non-homogeneous differential equation

1 1 1
𝑢′ (𝑥) + 𝑎𝑢(𝑥) = 𝑓(𝑥) ⟺ Δ𝑢 + 𝑎𝑢𝑘 = 𝑓𝑘 ⟺ 𝑢𝑘+1 + (𝑎 − ) 𝑢𝑘 = 𝑓𝑘
ℎ 𝑘 ℎ ℎ

The differential equation is equivalent to 𝑢𝑘+1 = ℎ𝑓𝑘 + (1 − ℎ𝑎)𝑢𝑘 . Let we solve this
difference equation recursively by setting 𝛽 = (1 − ℎ𝑎) and 𝑟𝑘 = ℎ𝑓𝑘 so we obtain

𝑢𝑘+1 = 𝑟𝑘 + 𝛽𝑢𝑘

Now, if we write this equation for 𝑖 = 1,2, … , 𝑘 we get

𝑢1 = 𝑟0 + 𝛽𝑢0
𝑢2 = 𝑟1 + 𝛽(𝑟0 + 𝛽𝑢0 ) = 𝛽 2 𝑢0 + 𝑟1 + 𝛽𝑟0
𝑢3 = 𝑟2 + 𝛽(𝛽 2 𝑢0 + 𝑟1 + 𝛽𝑟0 ) = 𝛽 3 𝑢0 + 𝑟2 + 𝛽𝑟1 + 𝛽 2 𝑟0

𝑘−1
𝑢𝑘 = 𝛽 𝑘 𝑢0 + (∑ 𝑟𝑖−1 𝛽 𝑘−𝑖 ) + 𝑟𝑘−1
𝑖=1

In analogy with the second order, constant coefficient, nonhomogeneous differential


equation
1 𝑎
𝑢(2) (𝑥) + 𝑎𝑢(1) (𝑥) + 𝑏𝑢(𝑥) = 𝑓(𝑥) ⟺ 2 Δ2 𝑢𝑘 + Δ𝑢𝑘 + 𝑏𝑢𝑘 = 𝑓𝑘
ℎ ℎ
Or equivalently: (𝑢𝑘+2 − 2𝑢𝑘+1 + 𝑢𝑘 ) + 𝑎ℎ(𝑢𝑘+1 − 𝑢𝑘 ) + 𝑏ℎ2 𝑢𝑘 = ℎ2 𝑓𝑘 which can be written
as
𝑢𝑘+2 + (𝑎ℎ − 2)𝑢𝑘+1 + (𝑏ℎ2 − 𝑎ℎ + 1)𝑢𝑘 = ℎ2 𝑓𝑘

The characteristic equation of this DE is given by: 𝑧 2 + (𝑎ℎ − 2)𝑧 + (𝑏ℎ2 − 𝑎ℎ + 1) = 0

We can solve this DE as what we have seen before.

In mathematics
and signal processing, the Z-transform converts a discrete-time signal, which is a
sequence of real or complex numbers, into a complex frequency-domain representation.
It can be considered as a discrete-time equivalent of the Laplace transform. This
similarity is explored in the theory of time-scale calculus.

Z-transform gives a tractable way to solve linear, constant-coefficient difference


equations. The idea contained within the Z-transform is also known in mathematical
literature as the method of generating functions.
From a mathematical view the Z-transform can also be viewed as a Laurent series
where one views the sequence of numbers under consideration as the (Laurent)
expansion of an analytic function. The Z-transform can be defined as either a one-sided
or two-sided transform.

It is well-known that computers are increasingly being integrated into physical


systems. For the computer, time is not continuous, it passes in discrete intervals. So
whenever a computer is being used, it is important to understand the ramifications of
the inherently discrete nature of time. As with the Laplace Transform, we will assume
that functions of interest are equal to zero for time less than zero.

In this diagram 𝑥(𝑡) represents a continuous-time signal that is sampled every 𝑇


seconds, the resulting signal is called 𝑥 ⋆ (𝑡) = 𝑥(𝑘𝑇). This represents a continuous-time
signal that is measured by a computer every 𝑇 seconds that results in a sampled signal.

There are a number of ways to represent the sampling process mathematically. One
way that is commonly used is to immediately represent the sampled signal by a series
𝑥[𝑛]. This technique has the advantage of being very simple to understand, but makes
the connection of the sampled signal to the Laplace Transform easy to understand.

⋆ (𝑡)
𝑥 = ∑ 𝑥(𝑘𝑇)𝛿(𝑡 − 𝑘𝑇)
𝑘=0

Since we now have a time domain signal, we wish to see what kind of analysis we can
do in a transformed domain. Let's start by taking the Laplace Transform of the
sampled signal:

⋆ (𝑠) {𝑥 ⋆ (𝑡)}
𝑋 = = {∑ 𝑥[𝑘]𝛿(𝑡 − 𝑘𝑇)}
𝑘=0

Since 𝑥[𝑘] is a constant, we can (because of Linearity) bring the Laplace Transform
inside the summation.
∞ ∞
⋆ (𝑠) {𝛿(𝑡 − 𝑘𝑇)} = ∑ 𝑥[𝑘]𝑒 −𝑠𝑘𝑇
𝑋 = ∑ 𝑥[𝑘]
𝑘=0 𝑘=0

To simplify the expression a little bit, we will use the notation 𝑧 = 𝑒 −𝑠𝑘𝑇 . We will call this
the Z Transform and define it as

(𝑥[𝑘]) = 𝑋(𝑧) = ∑ 𝑥[𝑘]𝑧 −𝑘


𝑘=0
Definition: The Z-transform of a sequence {𝑥[𝑘]} is denoted by 𝑋(𝑧) and it is calculated
using the formula 𝑋(𝑧) = ∑∞ 𝑘=0 𝑥[𝑘]𝑧
−𝑘
Where: z is chosen such that ∑∞ −𝑘
𝑘=0|𝑥[𝑘]𝑧 | < ∞
which is the convergence condition.

We will also use the following notation to represent a sequence {𝑥[𝑘]} and its Z-
transform 𝑋(𝑧): 𝑥[𝑘] ⟷ 𝑋(𝑧)

Alternatively, in cases where 𝑥[𝑘] is defined form −∞ to +∞ the two-sided transform is


defined as 𝑋(𝑧) = ∑+∞ −𝑘
𝑘=−∞ 𝑥[𝑘]𝑧 .

Region of convergence: The region of convergence (ROC) is the set of points in the
complex plane for which the Z-transform summation converges.
𝑘=+∞

ROC = {𝑧: | ∑ |𝑥[𝑘]𝑧 −𝑘 | < ∞ }


𝑘=−∞

Example Let us now consider a signal that is the real exponential: f[𝑘] = 𝛼 𝑘 𝑢[𝑘]
𝑘=+∞ ∞
𝑘 −𝑘
𝛼 𝑘 𝑧
𝐹(𝑧) = ∑ 𝛼 𝑢[𝑘]𝑧 = ∑( ) = |𝑧| > |𝛼|
𝑧 𝑧−𝛼
𝑘=−∞ 𝑘=0

𝐹(𝑧) Converges only if |𝑧| > |𝛼|

Example Let us now consider the left-sided real exponential: f[𝑘] = −𝛼 𝑘 𝑢[−𝑘 − 1]
𝑘=+∞ −1 ∞ ∞ ∞
𝑘 −𝑘
𝛼 𝑘 𝑧 𝑘 𝑧 𝑘+1 𝑧 𝑧 𝑘 𝑧
𝐹(𝑧) = ∑ −𝛼 𝑢[−𝑘 − 1]𝑧 = − ∑ ( ) = −∑( ) = −∑( ) = ∑( ) =
𝑧 𝛼 𝛼 𝛼 𝛼 𝑧−𝛼
𝑘=−∞ 𝑘=−∞ 𝑘=1 𝑘=0 𝑘=0

But 𝐹(𝑧) converges only if |𝑧| < |𝛼|

Remark: If you are primarily interested in application to one-sided signals, then the z-
transform used is restricted to causal signals (i.e., signals with zero values for negative
time) and the one-sided z-transform.

The z-transform is
an important tool in the analysis and design of discrete-time systems. It simplifies the
solution of discrete-time problems by converting LTI difference equations to algebraic
equations and convolution to multiplication. Thus, it plays a role similar to that served
by Laplace transforms in continuous-time problems. Now let us give some of commonly
used discrete-time signals such as the sampled step, exponential, and the discrete time
impulse.
❶ The Unit Impulse Function: In discrete time systems the unit impulse is defined
somewhat differently than in continuous time systems.

1 𝑘=0
𝛿[𝑘] = { = 𝑢[𝑘] − 𝑢[𝑘 − 1]
0 𝑘≠0

Where 𝑢[𝑘] is the step function defined by:


1 𝑘≥0
𝑢[𝑘] = {
0 𝑘<0

∞ ∞
−𝑘
𝑋(𝑧) = ∑ 𝛿[𝑘]𝑧 −𝑘
= 𝛿[𝑘]𝑧 | + ∑ 𝛿[𝑘]𝑧 −𝑘 = 1
𝑘=0 0 ⏟
𝑘=1
zero

❷ The Unit Step Function: The unit step is one when 𝑘 is zero or positive.

∞ ∞
−𝑘
1 𝑧
𝑋(𝑧) = ∑ 𝑢[𝑘]𝑧 = ∑ 𝑧 −𝑘 = −1
=
1−𝑧 𝑧−1
𝑘=0 𝑘=0

Which is only valid for |𝑧| < 1; This implies


that the z-transform expression we obtain has
a region of convergence outside which is not
valid.
❸ The Unitary Ramp Function: Consider complex function

∞ ∞
𝑑 𝑧
f[𝑘] = 𝑘𝑢[𝑘] ⟹ 𝐹(𝑧) = ∑ 𝑘𝑧 −𝑘 = −𝑧 ( ∑ 𝑧 −𝑘 ) =
𝑑𝑧 (𝑧 − 1)2
𝑘=0 𝑘=0

❹ The Exponential Function: Consider the exponential function



1 𝑧
f[𝑘] = 𝑒 −𝛼𝑘𝑇 𝑢[𝑘] ⟹ 𝐹(𝑧) = ∑ 𝑒 −𝛼𝑘𝑇 𝑢[𝑘]𝑧 −𝑘 = = , |𝑧| > 𝑒 −𝛼𝑇
1− 𝑒 −𝛼𝑇 𝑧 −1 𝑧 − 𝑒 −𝛼𝑇
𝑘=0

❺ The Discrete Exponential Function: With the Z-Transform it is more common to


get solutions in the form of a power series

1 𝑧
f[𝑘] = 𝑎𝑘 𝑢[𝑘] ⟹ 𝐹(𝑧) = ∑ 𝑎𝑘 𝑢[𝑘]𝑧 −𝑘 = = , |𝑧| > 𝑎
1 − 𝑎𝑘 𝑧 −1 𝑧 − 𝑎
𝑘=0

Exercise: Find the Z-Transform of the following signal f[𝑘] = cos(𝑎𝑘) 𝑢[𝑘] Solution:
∞ ∞ ∞ ∞
−𝑘
𝑒 𝑖𝑎𝑘 + 𝑒 −𝑖𝑎𝑘 1
𝐹(𝑧) = ∑ cos(𝑎𝑘) 𝑢[𝑘]𝑧 = ∑( ) 𝑢[𝑘]𝑧 −𝑘 = {∑ 𝑒 𝑖𝑎𝑘 𝑢[𝑘]𝑧 −𝑘 + ∑ 𝑒 −𝑖𝑎𝑘 𝑢[𝑘]𝑧 −𝑘 }
2 2
𝑘=0 𝑘=0 𝑘=0 𝑘=0
1 𝑧 𝑧 𝑧(𝑧 − cos(𝑎))
(f[𝑘]) = 𝐹(𝑧) = { + } =
2 𝑧 − 𝑒 𝑖𝛼 𝑧 − 𝑒 −𝑖𝛼 𝑧 2 − 2𝑧 cos(𝑎) + 1

Also it can be checked that if f[𝑘] = sin(𝑎𝑘) 𝑢[𝑘] then

𝑧 sin(𝑎𝑘)
𝐹(𝑧) =
𝑧 2 − 2𝑧 cos(𝑎) + 1

Remarks: the Z-Transform 𝑋(𝑧) is always a rational function of the complex variable z,
and the same complex function can also be expressed in terms of 𝑧 −1

𝑛(𝑧) 𝑧(𝑧 + 0.5) 1 + 0.5𝑧 −1


𝑋(𝑧) = = =
𝑑(𝑧) (𝑧 + 1)(𝑧 + 2) (1 + 𝑧 −1 )(1 + 2𝑧 −1 )

Properties of Region of convergence: Let 𝑋(𝑧) = 𝑛(𝑧)/𝑑(𝑧) be a rational function, so if


we define the roots of the numerator polynomial 𝑛(𝑧) as zeros of 𝑋(𝑧) and roots of the
denominator 𝑑(𝑧) as poles of 𝑋(𝑧) then, The region of convergence has a number of
properties that are dependent on the characteristics of the signal, 𝑥[𝑘].

▪ The ROC cannot contain any poles.


▪ If 𝑥[𝑘] is a finite-duration sequence, then the ROC is the entire z-plane, centered at
the origin.
▪ If 𝑥[𝑘] is a right-sided sequence, then the ROC extends outward from the outermost
pole in 𝑋(𝑧). And the ROC is outside of a given circle.
▪ If 𝑥[𝑘] is a left-sided sequence, then the ROC extends inward from the innermost pole
in 𝑋(𝑧). And the ROC is inside of a given circle.
▪ If 𝑥[𝑘] is a two-sided sequence, the ROC will be a ring in the z-plane that is bounded
on the interior and exterior by a pole

As we found with the Laplace Transform, it


will often be easier to work with the Z Transform if we develop some properties of the
transform itself. The z-transform can be derived from the Laplace transform. Hence, it
shares several useful properties with the Laplace transform, which can be stated
without proof. These properties can also be easily proved directly, and the proofs are
left as an exercise for the reader. Proofs are provided for properties that do not
obviously follow from the Laplace transform.

❶ Linearity: As with the Laplace Transform, the Z Transform is linear.

Given 𝑥[𝑘] = 𝑎. 𝑓[𝑘] + 𝑏. 𝑔[𝑘], the following property holds: 𝑋(𝑧) = 𝑎. 𝐹(𝑧) + 𝑏. 𝐺(𝑧)

Example: Find the z-transform of the causal sequence 𝑥[𝑘] = 2𝑢[𝑘] + 4𝛿[𝑘] 𝑘 = 0,1, …
∞ ∞ ∞
2𝑧 6𝑧 − 4
𝑋(𝑧) = ∑(2𝑢[𝑘] + 4𝛿[𝑘])𝑧 −𝑘 = ∑ 2𝑢[𝑘]𝑧 −𝑘 + ∑ 4𝛿[𝑘]𝑧 −𝑘 = +4=
𝑧−1 𝑧−1
𝑘=0 𝑘=0 𝑘=0
❷ Time Shift (Delay): An important property of the Z Transform is the time shift. To
see why this might be important consider that a discrete-time approximation to a
derivative is given by:

𝑑𝑓(𝑡) 𝑓((𝑘 + 1)𝑇) − 𝑓(𝑘𝑇) 𝑓[𝑘 + 1] − 𝑓[𝑘]


| = =
𝑑𝑡 𝑡=𝑘𝑇 𝑇 𝑇

Let's examine what effect such a shift has upon the Z Transform. Assume that the
sequence 𝑥[𝑘] has a Z-Transform 𝑋(𝑧), and we want to get the Z-Transform of 𝑥[𝑘 − 𝑛]

∞ ∞ ∞

(𝑥[𝑘 − 𝑛]) = ∑ 𝑥[𝑘 − 𝑛]𝑧 −𝑘 = ∑ 𝑥[𝑚]𝑧 −𝑚−𝑛 = 𝑧 −𝑛 ∑ 𝑥[𝑚]𝑧 −𝑚 = 𝑧 −𝑛 𝑋(𝑧)


𝑘=0 𝑚=−𝑛 𝑘=0

Example: Find the z-transform of the causal sequence

4, 𝑘 = 2,3 …
𝑥[𝑘] = {
0, otherwise

The given sequence is a step function starting at 𝑘 = 2 rather than 𝑘 = 0 (i.e., it is


delayed by two sampling periods). Using the delay property, we have
∞ ∞
−𝑘 −2
𝑧 4
𝑋(𝑧) = 4 ∑ 𝑢[𝑘 − 2]𝑧 = 4𝑧 ∑ 𝑢[𝑚]𝑧 −𝑚 = 4𝑧 −2 =
𝑧 − 1 𝑧(𝑧 − 1)
𝑘=0 𝑘=0

❸ Time Advance: Let's explore it in the same way as we did the shift to the
right. Consider the same sequence, 𝑥[𝑘], as before. This time we shift it to the left by 𝑛
samples to get 𝑥[𝑘 + 𝑛].
∞ 𝑛−1
−𝑘
𝑋new (𝑧) = ∑ 𝑥[𝑘 + 𝑛]𝑧 = 𝑧 𝑋(𝑧) − ∑ 𝑥[ℓ]𝑧 𝑛−ℓ
𝑛

𝑘=0 ℓ=0
Proof:
∞ ∞ ∞
−𝑘 𝑛−ℓ
𝑋new (𝑧) = ∑ 𝑥[𝑘 + 𝑛]𝑧 = ∑ 𝑥[ℓ]𝑧 = 𝑧 ∑ 𝑥[ℓ]𝑧 −ℓ
𝑛

𝑘=0 ℓ=𝑛 ℓ=𝑛


∞ 𝑛−1 𝑛−1
𝑛 −ℓ
= 𝑧 (∑ 𝑥[ℓ]𝑧 − ∑ 𝑥[ℓ]𝑧 ) = 𝑧 𝑋(𝑧) − ∑ 𝑥[ℓ]𝑧 𝑛−ℓ
−ℓ 𝑛

ℓ=0 ℓ=0 ℓ=0

❹ Multiplication by 𝒛𝒏𝟎 in Time Domain: this property is also known as the Z-domain
scaling and it says that: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then
∞ ∞
𝑧 −𝑘 𝑧
𝑧0𝑘 𝑥[𝑘] ⟷ ∑ 𝑧0𝑘 𝑥[𝑘]𝑧 −𝑘 = ∑ 𝑥[𝑘] ( ) = 𝑋 ( ) with ROC = |𝑧0 |𝑅
𝑧0 𝑧0
𝑘=0 𝑘=0

❺ Time Scaling: This property deals with the effect on the frequency-domain
representation of a signal if the time variable is altered. The most important concept to
understand for the time scaling property is that signals that are narrow in time will be
broad in frequency and vice versa.
∞ ∞ ∞

𝑥[𝑘/𝑛] ⟷ ∑ 𝑥[𝑘/𝑛]𝑧 −𝑘 = ∑ 𝑥[𝑚]𝑧 −𝑚𝑛 = ∑ 𝑥[𝑚](𝑧 𝑛 )−𝑚 = 𝑋(𝑧 𝑛 )


𝑘=0 𝑚=0 𝑚=0

❻ Differentiation and Integration in z-Domain: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then


∞ ∞
−𝑘
𝑑 𝑑
𝑘𝑥[𝑘] ⟷ ∑ 𝑘𝑥[𝑘]𝑧 = −𝑧 ∑ 𝑥[𝑘]𝑧 −𝑘 = −𝑧 𝑋(𝑧) with ROC = 𝑅
𝑑𝑧 𝑑𝑧
𝑘=0 𝑘=0

𝑋(𝑧)
𝑘 −1 𝑥[𝑘] ⟷ ∑ 𝑘 −1 𝑥[𝑘]𝑧 −𝑘 = − ∫ 𝑑𝑧 with ROC = 𝑅
𝑧
𝑘=1

Proof: The derivative rule is a straightforward let we prove only the integration rule

∞ ∞
𝑋(𝑧)
∫( ) 𝑑𝑧 = ∫ (∑ 𝑥[𝑘]𝑧 −𝑘−1 ) 𝑑𝑧 = ∑ −𝑘 −1 𝑥[𝑘]𝑧 −𝑘 = − (𝑘 −1 𝑥[𝑘])
𝑧
𝑘=0 𝑘=1

Example Determine the Z-transform of 𝑥1 [𝑘] = 𝑘 2 𝑢[𝑘] and 𝑥2 [𝑘] = 𝑘 2 𝑎𝑘 𝑢[𝑘].

▪ Applying the derivative rule we get:

𝑑 𝑑 𝑑2 𝑑
𝑘 2 𝑥[𝑘] ⟷= 𝑧 {𝑧 𝑋(𝑧)} = 𝑧 2 2 𝑋(𝑧) + 𝑧 𝑋(𝑧)
𝑑𝑧 𝑑𝑧 𝑑𝑧 𝑑𝑧

2
𝑑2 𝑧 2
𝑑 𝑧 2𝑧 2 𝑧 𝑧2 + 𝑧
𝑥1 [𝑘] = 𝑘 𝑢[𝑘] ⟷ 𝑧 ( )+𝑧 ( )= − =
𝑑𝑧 2 𝑧 − 1 𝑑𝑧 𝑧 − 1 (𝑧 − 1)3 (𝑧 − 1)2 (𝑧 − 1)3

▪ Redo the same process as before we obtain

2 𝑘
𝑑2 𝑧 2
𝑑 𝑧 2𝑎𝑧 2 𝑧𝑎 𝑎𝑧(𝑧 + 𝑎)
𝑥2 [𝑘] = 𝑘 𝑎 𝑢[𝑘] ⟷ 𝑧 ( )+𝑧 ( )= − =
2
𝑑𝑧 𝑧 − 𝑎 𝑑𝑧 𝑧 − 𝑎 (𝑧 − 𝑎) 3 (𝑧 − 𝑎) 2 (𝑧 − 𝑎)3

❼ Time Reversal: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then


∞ ∞
−𝑘
1 −𝑘 1 1
𝑥[−𝑘] ⟷ ∑ 𝑥[−𝑘]𝑧 = ∑ 𝑥[𝑘] ( ) = 𝑋 ( ) with ROC =
𝑧 𝑧 𝑅
𝑘=0 𝑘=0

Therefore, a pole (or zero) in 𝑋(𝑧) at = 𝑧𝑘 , moves to 1/𝑧𝑘 , after time reversal. The
relationship ROC = 1/𝑅 indicates the inversion of 𝑅, reflecting the fact that a right-sided
sequence becomes left-sided if time-reversed, and vice versa.

❽ Time Forward Difference: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then

Forward: ∆𝑥𝑘 = 𝑥[𝑘 + 1] − 𝑥[𝑘] ⟷ (𝑧 − 1)𝑋(𝑧) − 𝑧𝑥[0]


Backward: ∇𝑥𝑘 = 𝑥[𝑘] − 𝑥[𝑘 − 1] ⟷ (1 − 𝑧 −1 )𝑋(𝑧)
❾ Convolution: The convolution product between two signals is an inner operator
denoted by star and is defined by
∞ ∞

𝑦[𝑘] = 𝑥[𝑘] ⋆ ℎ[𝑘] = ℎ[𝑘] ⋆ 𝑥[𝑘] = ∑ ℎ[𝑘 − 𝑖]𝑥[𝑖] = ∑ 𝑥[𝑘 − 𝑖]ℎ[𝑖]


𝑖=−∞ 𝑖=−∞

Now the Z-transform of this convolution is 𝑌(𝑧) = (𝑥[𝑘] ⋆ ℎ[𝑘]) = 𝑋(𝑧)𝐻(𝑧). This
relationship plays a central role in the analysis and design of discrete-time LTI systems,
in analogy with the continuous-time case.

Proof: Applying the definition of Z-transform on convolution product and make an


interchanging the order of summations we get

∞ ∞ ∞ ∞

𝑌(𝑧) = (𝑥[𝑘] ⋆ ℎ[𝑘]) = ∑ ( ∑ 𝑥[𝑘 − 𝑖]ℎ[𝑖]) 𝑧 −𝑘 = ∑ (ℎ[𝑖] ∑ 𝑥[𝑘 − 𝑖] 𝑧 −𝑘 )


𝑘=−∞ 𝑖=−∞ 𝑖=−∞ 𝑘=−∞

We know that ∑∞
𝑘=−∞ 𝑥[𝑘 − 𝑖] 𝑧
−𝑘
= 𝑧 −𝑖 𝑋(𝑧) hence
∞ ∞

𝑌(𝑧) = ∑ (ℎ[𝑖]𝑧 −𝑖 𝑋(𝑧)) = 𝑋(𝑧) ∑ ℎ[𝑖]𝑧 −𝑖 = 𝑋(𝑧)𝐻(𝑧)


𝑖=−∞ 𝑖=−∞

❿ Time Accumulation: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then


𝑘
1
( ∑ 𝑥[𝑖]) = 𝑋(𝑧)
1 − 𝑧 −1
𝑖=−∞
Proof: It is well-known that
∞ 𝑘 𝑘
𝑋(𝑧)
𝑥[𝑘] ⋆ 𝑢[𝑘] = ∑ 𝑢[𝑘 − 𝑖]𝑥[𝑖] = ∑ 𝑥[𝑖] ⟹ ( ∑ 𝑥[𝑖]) = (𝑥[𝑘] ⋆ 𝑢[𝑘]) =
1 − 𝑧 −1
𝑖=−∞ 𝑖=−∞ 𝑖=−∞

⓫ Conjugation: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then 𝑥 ⋆ [𝑘] ⟷ 𝑋 ⋆ (𝑧 ⋆ ) with ROC = 𝑅


∞ ∞ ⋆
⋆ [𝑘]𝑧 −𝑘 ⋆ )−𝑘
∑𝑥 = (∑ 𝑥[𝑘](𝑧 ) = 𝑋 ⋆ (𝑧 ⋆ )
𝑘=0 𝑘=0

⓬ Parseval's Theorem:
𝑘=+∞
1 1
∑ 𝑥1 [𝑘]𝑥2⋆ [𝑘] = ∮ 𝑋1 (𝑧)𝑋2⋆ ( ⋆ ) 𝑧 −1 𝑑𝑧
2𝜋𝑗 𝑧
𝑘=−∞

Proof: Let 𝑥1 [𝑘] ⟷ 𝑋1 (𝑧) and 𝑥2 [𝑘] ⟷ 𝑋2 (𝑧) The proof of this identity is based on the
inverse relation and later on we will see the inverse formula of the Z-transform. Now
assume that the inverse formula is well-known

1
𝑥1 [𝑘] = ∮ 𝑋1 (𝑧)𝑧 𝑘−1 𝑑𝑧
2𝜋𝑗
𝑘=+∞ 𝑘=+∞ 𝑘=+∞
1 1
∑ 𝑥1 [𝑘]𝑥2⋆ [𝑘] = ∑ { ∮ 𝑋1 (𝑧)𝑧 𝑘−1 𝑑𝑧} 𝑥2⋆ [𝑘] = ∮ 𝑋1 (𝑧) ( ∑ 𝑥2⋆ [𝑘]𝑧 𝑘−1 ) 𝑑𝑧
2𝜋𝑗 2𝜋𝑗
𝑘=−∞ 𝑘=−∞ 𝑘=−∞
𝑘=+∞ 𝑘=+∞ ⋆
1 1 −𝑘 1 1 −𝑘
= ∮ 𝑋1 (𝑧) ( ∑ 𝑥2⋆ [𝑘] ( ) 𝑧 −1 ) 𝑑𝑧 = ∮ 𝑋1 (𝑧) ( ∑ 𝑥2 [𝑘] ( ⋆ ) ) 𝑧 −1 𝑑𝑧
2𝜋𝑗 𝑧 2𝜋𝑗 𝑧
𝑘=−∞ 𝑘=−∞
1 1
= ∮ 𝑋1 (𝑧)𝑋2⋆ ( ⋆ ) 𝑧 −1 𝑑𝑧
2𝜋𝑗 𝑧

⓭ Multiplication in Time It gives the change in Z-domain of the signal when


multiplication takes place. 𝒙𝟏 [𝒌] ⟷ 𝑿𝟏 (𝒛) 𝐚𝐧𝐝 𝒙𝟐 [𝒌] ⟷ 𝑿𝟐 (𝒛)

1 𝑧
𝑥1 [𝑘]𝑥2 [𝑘] ⟷ ∮ 𝑋1 (𝑣)𝑋2 ( ) 𝑣 −1 𝑑𝑣
2𝜋𝑗 𝑣

Proof: Let 𝑥1 [𝑘] ⟷ 𝑋1 (𝑧) and 𝑥2 [𝑘] ⟷ 𝑋2 (𝑧)


∞ ∞ ∞
1 1
∑ 𝑥1 [𝑘]𝑥2 [𝑘]𝑧 −𝑘 = ∑ (∮ 𝑋1 (𝑣)𝑣 𝑘−1 𝑑𝑣) 𝑥2 [𝑘]𝑧 −𝑘 = ∮ 𝑋1 (𝑣) (∑ 𝑥2 [𝑘]𝑣 𝑘−1 𝑧 −𝑘 ) 𝑑𝑣
2𝜋𝑗 2𝜋𝑗
𝑘=0 𝑘=0 𝑘=0

1 𝑧 −𝑘 1 𝑧
= ∮ 𝑋1 (𝑣) (∑ 𝑥2 [𝑘] ( ) ) 𝑣 −1 𝑑𝑣 = ∮ 𝑋1 (𝑣)𝑋2 ( ) 𝑣 −1 𝑑𝑣
2𝜋𝑗 𝑣 2𝜋𝑗 𝑣
𝑘=0

⓮ Initial value theorem: If 𝑥[𝑘] is causal (i.e. No positive terms in its Z-transform),
then which has its Z-transformation as 𝑋(𝑧), then the initial value theorem can be
written as;
lim 𝑥[𝑘] = lim 𝑋(𝑧)
𝑘→0 𝑧→∞

𝐏𝐫𝐨𝐨𝐟: Let 𝑋(𝑧) = ∑ 𝑥[𝑘]𝑧 −𝑘 = 𝑥[0] + 𝑥[1]𝑧 −1 + 𝑥[2]𝑧 −2 + ⋯ ⟹ lim 𝑋(𝑧) = 𝑥[0] = lim 𝑥[𝑘]
𝑧→∞ 𝑘→0
𝑘=0

⓯ Final value theorem: The final value theorem allows us to calculate the limit of a
sequence as 𝑘 tends to infinity, if one exists, from the z-transform of the sequence.

Final Value Theorem states that if 𝑥[𝑘] is causal and the poles are all inside the circle,
then its final value is denoted as 𝑥[𝑘] or 𝑥(∞) and can be written as

𝑥(∞) = lim 𝑥[𝑘] = lim(1 − 𝑧 −1 )𝑋(𝑧) = lim(𝑧 − 1)𝑋(𝑧)


𝑘→∞ 𝑧→1 𝑧→1

Proof: Start with the definition of the z-transform


𝑁

𝑋(𝑧) = lim ∑ 𝑥[𝑘]𝑧 −𝑘


𝑁→∞
𝑘=0

and also use the time advance theorem of the z-transform


𝑁

𝑧(𝑋(𝑧) − 𝑥[0]) = lim ∑ 𝑥[𝑘 + 1]𝑧 −𝑘


𝑁→∞
𝑘=0
Now subtract the first from the second
𝑁

(𝑧 − 1)𝑋(𝑧) − 𝑧𝑥[0] = lim (∑ 𝑥[𝑘 + 1]𝑧 −𝑘 − 𝑥[𝑘]𝑧 −𝑘 )


𝑁→∞
𝑘=0
Then take the limit as 𝑧 → 1

lim(𝑧 − 1)𝑋(𝑧) − 𝑧𝑥[0] = lim 𝑥[𝑁 + 1] − 𝑥[0]


𝑧→1 𝑁→∞

⟹ (lim(𝑧 − 1)𝑋(𝑧)) − 𝑥[0] = lim 𝑥[𝑁 + 1] − 𝑥[0]


𝑧→1 𝑁→∞
So
lim(𝑧 − 1)𝑋(𝑧) = lim 𝑥[𝑁 + 1] = 𝑥(∞)
𝑧→1 𝑁→∞

Alternately: Using the time delay theorem of the z-transform


𝑁
−1
𝑧 𝑋(𝑧) = lim ∑ 𝑥[𝑘 − 1]𝑧 −𝑘
𝑁→∞
𝑘=0
we can get to
𝑁 𝑁

(1 − 𝑧 −1 )𝑋(𝑧) = lim ∑ 𝑥[𝑘]𝑧 −𝑘 − ∑ 𝑥[𝑘 − 1]𝑧 −𝑘


𝑁→∞
𝑘=0 𝑘=0

and assuming 𝑥[𝑘] = 0 for 𝑘 < 0 and taking the limit as 𝑧 → 1

lim(1 − 𝑧 −1 )𝑋(𝑧) = lim 𝑥[𝑁] − 𝑥[−1] = lim 𝑥[𝑁] = 𝑥(∞)


𝑧→1 𝑁→∞ 𝑁→∞

The main pitfall of the theorem is that there are important cases where the limit does
not exist. The two main cases are as follows:

1. An unbounded sequence i.e. poles are outside the unit circle.


2. An oscillatory sequence i.e. poles are on the unit circle.

The reader is cautioned against blindly using the final value theorem, because this can
yield misleading results.

Example: Verify the final value theorem using the z-transform of a decaying
exponential sequence and its limit as 𝑘 tends to infinity.

Solution The z-transform pair of an exponential sequence is

𝑥 [𝑘] = 𝑎𝑘 𝑢[𝑘] 𝑧 𝑎𝑧
{ 1 ⟺ 𝑎𝑘 𝑢[𝑘] ⟷ 𝑋1 (𝑧) = & 𝑎−𝑘 𝑢[𝑘] ⟷ 𝑋2 (𝑧) =
𝑥2 [𝑘] = 𝑎−𝑘 𝑢[𝑘] 𝑧−𝑎 𝑎𝑧 − 1

Applying the final value theorem


𝑧(𝑧 − 1)
𝑥1 (∞) = lim(𝑧 − 1)𝑋1 (𝑧) = lim =0
𝑧→1 𝑧→1 𝑧 − 𝑎
𝑎𝑧(𝑧 − 1)
𝑥2 (∞) = lim(𝑧 − 1)𝑋2 (𝑧) = lim =0
𝑧→1 𝑧→1 𝑎𝑧 − 1
For the first signal 𝑥1 [𝑘] the theorem cannot even be applied. We have a growing
sequence without a limit (i.e. unbounded sequence); this once again violates the
conditions of the theorem. Thus, the final value theorem cannot be used.

For the second signal 𝑥2 [𝑘] the result is true.

⓰ Periodic Functions: If a function 𝑥1 [𝑘] is identically zero except for 0 ≤ 𝑘 ≤ 𝑇, then


for the periodic function 𝑥[𝑘] defined by

𝑥[𝑘] = ∑ 𝑥1 [𝑘 − 𝑚𝑇]
𝑚=0
The Z-transform 𝑋(𝑧) is given by

1 ∞ 1
𝑋(𝑧) = (∑ 𝑥1 [𝑘]𝑧 −𝑘 ) = 𝑋 (𝑧)
1−𝑧 −𝑇
𝑘=0 1 − 𝑧 −𝑇 1

Proof: Let we define 𝑥ℓ [𝑘] = 𝑥1 [𝑘 − ℓ𝑇] then

𝑥[𝑘] = 𝑥1 [𝑘] + 𝑥2 [𝑘] + 𝑥3 [𝑘] + ⋯


= 𝑥1 [𝑘] + 𝑥1 [𝑘 − 𝑇]𝑢[𝑘 − 𝑇] + 𝑥1 [𝑘 − 2𝑇]𝑢[𝑘 − 2𝑇] + ⋯

applying the Z-transform and the time-shifting property, we get

𝑋1 (𝑧)
𝑋(𝑧) = 𝑋1 (𝑧) + 𝑧 −𝑇 𝑋1 (𝑧) + +𝑧 −2𝑇 𝑋1 (𝑧) + ⋯ = 𝑋1 (𝑧)(1 + 𝑧 −𝑇 + 𝑧 −2𝑇 + ⋯ ) =
1 − 𝑧 −𝑇

The theory of linear


difference equations closely parallels the theory of linear differential equations. The
major distinction between these two mathematical models is that a differential equation
expresses a relationship between continuous input and continuous output functions,
whereas a difference equation expresses a relationship between discrete input and
discrete output functions (sequences).

We considered a discrete-time LTI system for which input 𝑥[𝑘] and output 𝑦[𝑘] satisfy
the general linear constant-coefficient difference equation of the form
𝑛 𝑚

∑ 𝑎𝑖 𝑦[𝑘 − 𝑖] = ∑ 𝑏𝑖 𝑥[𝑘 − 𝑖]
𝑖=0 𝑖=0

The transfer function of a system 𝐻(𝑧) is defined as the Z-transform of its output
𝑦[𝑘] divided by the Z-transform of its forcing function 𝑥[𝑘], so it is a complex rational
function of two polynomials named numerator and denominator.

Applying the z-transform and using the time-shift property and the linearity property of
the z-transform to this difference equation, we obtain
𝑛 𝑚
𝑌(𝑧) ∑𝑚
𝑖=0 𝑏𝑖 𝑧
−𝑖
(∑ 𝑎𝑖 𝑧 ) 𝑌(𝑧) = (∑ 𝑏𝑖 𝑧 −𝑖 ) 𝑋(𝑧) ⟹ 𝐻(𝑧) =
−𝑖
= 𝑛
𝑋(𝑧) ∑𝑖=0 𝑎𝑖 𝑧 −𝑖
𝑖=0 𝑖=0
Hence, 𝐻(𝑧) is always rational.

The DC gain is that ratio of the steady-state output to the steady-state input, especially
we take the input as unit step. The DC gain is an important parameter, especially in
control applications.
𝑧
𝑌(𝑧) = 𝐻(𝑧)𝑋(𝑧) = 𝐻(𝑧)
𝑧−1
Using the final value theorem we obtain
𝑧
𝑥[∞] = lim(1 − 𝑧 −1 )𝑋(𝑧) = lim(1 − 𝑧 −1 ) =1
𝑧→1 𝑧→1 𝑧−1
𝑧
𝑦[∞] = lim(1 − 𝑧 −1 )𝑌(𝑧) = lim(1 − 𝑧 −1 )𝐻(𝑧) = 𝐻(1)
𝑧→1 𝑧→1 𝑧−1
The DC gain is
𝑦[∞]
DC gain = = 𝐻(1) = lim(1 − 𝑧 −1 )𝑌(𝑧)
𝑥[∞] 𝑧→1

As mentioned earlier, Z-transformation is carried out to


simplify time domain calculations. After completing all the calculations in the Z
domain, we have to map the results back to the time domain so as to be of use.

Contour Integration We now discuss how to obtain the sequence 𝑥[𝑘] by contour
integration, given its Z-transform. Recall that the two-sided Z-transform is defined by

𝑋(𝑧) = ∑ 𝑥[𝑘]𝑧 −𝑘
𝑘=−∞
Let us multiply both sides by 𝑧 𝑛−1 and integrate over a closed contour within ROC of
𝑋(𝑧); let the contour enclose the origin. We have

𝑛−1
∮ 𝑋(𝑧)𝑧 𝑑𝑧 = ∮ ( ∑ 𝑥[𝑘]𝑧 −𝑘 ) 𝑧 𝑛−1 𝑑𝑧
𝒞 𝒞 𝑘=−∞
where 𝒞 denotes the closed contour within ROC, taken in a counterclockwise direction.
As the curve 𝒞 is inside ROC, the sum converges on every part of 𝒞 and, as a result, the
integral and the sum on the right-hand side can be interchanged. The above equation
becomes

∮ 𝑋(𝑧)𝑧 𝑛−1 𝑑𝑧 = ∑ 𝑥[𝑘] ∮ 𝑧 𝑛−1−𝑘 𝑑𝑧


𝒞 𝑘=−∞ 𝒞

Now we make use of the Cauchy integral theorem, according to which

1 1 𝑘=𝑛
∮ 𝑧 𝑛−1−𝑘 𝑑𝑧 = {
2𝜋𝑗 𝒞 0 𝑘≠𝑛

Here, 𝒞 is any contour that encloses the origin. Using the above equation, the right
hand side becomes 2𝜋𝑗𝑥[𝑘] and hence we obtain the formula

1 1
∮ 𝑋(𝑧)𝑧 𝑛−1 𝑑𝑧 = ∑ 𝑥[𝑘] ( ∮ 𝑧 𝑛−1−𝑘 𝑑𝑧) = 𝑥[𝑛]
2𝜋𝑗 𝒞 2𝜋𝑗 𝒞
𝑘=−∞

1
𝑥[𝑛] = ∮ 𝑋(𝑧)𝑧 𝑛−1 𝑑𝑧
2𝜋𝑗 𝒞

❷ Partial Fraction Expansion In case the given Z-transform is more complicated, it is


decomposed into a sum of standard fractions, and one can then calculate the overall
inverse Z-transform using the linearity property. We will restrict our attention to
inversion of Z-transforms that are proper.

Because the inverse Z-transform of 𝑧/(𝑧 − 𝑎) is given by 𝑎𝑘 𝑢[𝑘], we split 𝐻(𝑧) into partial
fractions, as given below:
𝐻1 𝐻2 𝐻𝑚
𝐻(𝑧) = + + ⋯+
𝑧 − 𝑎1 𝑧 − 𝑎2 𝑧 − 𝑎𝑚

where we have assumed that 𝐻(𝑧) has 𝑚 simple poles at 𝑎1 , 𝑎2 , … , 𝑎𝑚 . The coefficients
𝐻1 , 𝐻2 , … , 𝐻𝑚 are known as residues at the corresponding poles. The residues are
calculated using the formula 𝐻𝑖 = lim𝑧→𝑎𝑖 (𝑧 − 𝑎1 ) 𝐻(𝑧) 𝑖 = 1,2, … , 𝑚

In case of repeated poles (i.e. each root 𝑎𝑖 is repeated ℓ𝑖 times):

𝑟 ℓ𝑖
𝐻𝑖𝑗
𝐻(𝑧) = ∑ ∑
(𝑧 − 𝑎𝑖 )𝑗
𝑖=1 𝑗=1
Where
1 𝑑 𝑗−1
𝐻𝑖𝑗 = { lim ( (𝑧 − 𝑎𝑖 )ℓ𝑖 𝐻(𝑧))}
(𝑗 − 1)! 𝑧→𝑎𝑖 𝑑𝑧 𝑗−1

Example: Obtain the inverse of 𝐻(𝑧), defined by

11𝑧 2 − 15𝑧 + 6
𝐻(𝑧) =
(𝑧 − 2)(𝑧 − 1)2

We begin with partial fraction expansion:

𝐻11 𝐻21 𝐻22


𝐻(𝑧) = + +
(𝑧 − 2) (𝑧 − 1) (𝑧 − 1)2

11𝑧 2 − 15𝑧 + 6
𝐻11 = lim(𝑧 − 2)𝐻(𝑧) = lim = 20
𝑧→2 𝑧→2 (𝑧 − 1)2

Multiplying 𝐻(𝑧) by (𝑧 − 1)2 , we obtain the following equation:

11𝑧 2 − 15𝑧 + 6 𝐻11 (𝑧 − 1)


(𝑧 − 1)2 𝐻(𝑧) = = + 𝐻21 (𝑧 − 1) + 𝐻22
(𝑧 − 2) (𝑧 − 2)
Substituting 𝑧 = 1, we obtain 𝐻22 = −2 . On differentiating once with respect to 𝑧 and
substituting 𝑧 = 1, we obtain
(𝑧 − 2)(22𝑧 − 15)(11𝑧 2 − 15𝑧 + 6)
𝐻21 = lim = −9
𝑧→1 (𝑧 − 1)2

20 9 2
𝐻(𝑧) = − −
(𝑧 − 2) (𝑧 − 1) (𝑧 − 1)2
20𝑧 −1 9𝑧 −1 2𝑧 −1
= − −
(1 − 2𝑧 −1 ) (1 − 𝑧 −1 ) (1 − 𝑧 −1 )2

ℎ[𝑘] = (20 × 2𝑘−1 − 9 − 2(𝑘 − 1))𝑢[𝑘 − 1]

The above method is known as inversion. It is also known as realization, because it is


through this methodology that we can realize transfer functions in real life.

Example: Obtain the inverse of proper function 𝐻(𝑧), defined by

𝑧 3 − 𝑧 2 + 3𝑧 − 1
𝐻(𝑧) =
(𝑧 − 1)(𝑧 2 − 𝑧 + 1)

Because it is proper, we first divide the numerator by the denominator and obtain

𝑧(𝑧 + 1)
𝐻(𝑧) = 1 + = 1 + 𝐺(𝑧)
(𝑧 − 1)(𝑧 2 − 𝑧 + 1)

As 𝐺(𝑧) has a zero at the origin, we can divide by 𝑧:

𝐺(𝑧) (𝑧 + 1) (𝑧 + 1)
= =
𝑧 (𝑧 − 1)(𝑧 − 𝑧 + 1) (𝑧 − 1)(𝑧 − 𝑒 𝑗𝜋/3 )(𝑧 − 𝑒 −𝑗𝜋/3 )
2

Note that complex poles or complex zeros, if any, would always occur in conjugate pairs
for real sequences.
𝐺(𝑧) 2 1 1
= + −
𝑧 (𝑧 − 1) 𝑗𝜋 𝑗𝜋
(𝑧 − 𝑒 3 ) (𝑧 − 𝑒 − 3 )
We cross multiply by z and invert:
2𝑧 𝑧 𝑧 𝜋
𝐻(𝑧) = 1 + + − ⟹ ℎ[𝑘] = 𝛿[𝑘] + (2 − 2 cos ( 𝑘)) 𝑢[𝑘]
(𝑧 − 1) 𝑗𝜋 𝑗𝜋 3
(𝑧 − 𝑒 3 ) (𝑧 − 𝑒 − 3 )

Sometimes it helps to work directly in powers of z−1. We illustrate this in the next
example.

Example: Obtain the inverse of proper function 𝐻(𝑧), defined by

5
3 − 6 𝑧 −1 1
𝐻(𝑧) = |𝑧| >
1 1 3
(1 − 𝑧 −1 ) (1 − 𝑧 −1 )
4 3
There are two poles, one at 𝑧 = 1/3 and one at 1/4 . As ROC lies outside the outermost
pole, the inverse transform is a right handed sequence:
5
3 − 6 𝑧 −1 𝐻1 𝐻2
𝐻(𝑧) = = +
1 1 1 1
(1 − 4 𝑧 −1 ) (1 − 3 𝑧 −1 ) (1 − 4 𝑧 −1 ) (1 − 3 𝑧 −1 )

It is straightforward to see that 𝐻1 = 1 & 𝐻2 = 2

1 2 1 𝑘 1 𝑘
𝐻(𝑧) = + ⟹ ℎ[𝑘] = {( ) + 2 ( ) } 𝑢[𝑘]
1 1 4 3
(1 − 4 𝑧 −1 ) (1 − 3 𝑧 −1 )

Example: Determine the inverse Z-transform of

𝑧 2 + 2𝑧
𝐻(𝑧) = 1 < |𝑧| < 2
(𝑧 − 2)(𝑧 + 1)2

As the degree of the numerator polynomial is less than that of the denominator
polynomial, and as it has a zero at the origin, first divide by z and do a partial fraction
expansion:
𝐻(𝑧) 𝑧+2 𝐻11 𝐻12 𝐻22
= = + +
𝑧 (𝑧 − 2)(𝑧 + 1) 2 (𝑧 − 2) (𝑧 + 1) (𝑧 + 1)2

After computation as before we get

𝐻(𝑧) 4/9 4/9 1/3


= − −
𝑧 (𝑧 − 2) (𝑧 + 1) (𝑧 + 1)2

Because it is given to be the annulus 1 < |𝑧| < 2, and because it does not contain the
poles, ROC should be as follows:

4 𝑧 1 𝑧 4 𝑧
𝐻(𝑧) = − − +
⏟ 9 (𝑧 + 1) 3 (𝑧 + 1) 2 9 (𝑧 − 2)

|𝑧|>1 |𝑧|<2
We obtain the inverse as
𝑘 4 4
ℎ[𝑘] = ( − ) (−1)𝑘 𝑢[𝑘] − 2𝑘 𝑢[−𝑘 − 1]
3 9 9

Example: Determine the inverse Z-transform of


1
𝐻(𝑧)
𝑧 2 + (√2 − 1)𝑧 − √2

𝐻(𝑧) 1 𝛼 𝛽 𝛾
= = + +
𝑧 𝑧(𝑧 2 + (√2 − 1)𝑧 − √2) 𝑧 𝑧 − 1 𝑧 + √2

After evaluation of 𝛼, 𝛽 & 𝛾 we get

1 1 1
𝛼=− , 𝛽= , 𝑎𝑛𝑑 𝛾=
√2 1 + √2 2 + √2
Therefore the partial fraction expansion is

1 1 𝑧 1 𝑧
𝐻(𝑧) = − +( ) +( )
√2 1 + √2 𝑧 − 1 2 + √2 𝑧 + √2

1 1 1 𝑘
ℎ[𝑘] = − 𝛿[𝑘] + {( )+( ) (−√2) } 𝑢[𝑘]
√2 1 + √2 2 + √2

❷ Inversion by Power Series Now we will present another method of inversion. In this,
both the numerator and the denominator are written in powers of 𝑧 −1 and we divide the
former by the latter through long division. Because we obtain the result in a power
series, this method is known as the power series method. We illustrate it with an
example.

Example: Determine the inverse Z-transform of

𝐻(𝑧) = log(1 + 𝑎𝑧 −1 ) |𝑧| > |𝑎|


∞ ∞
(−1)𝑖+1 𝑣 𝑖 −1
(−1)𝑘+1 𝑎𝑘 𝑧 −𝑘
log(1 + 𝑣) = ∑ , |𝑣| < 1 ⟹ log(1 + 𝑎𝑧 )=∑ , |𝑧| > |𝑎|
𝑖 𝑘
𝑖=1 𝑘=1

𝑎𝑘 −1
𝐻(𝑧) = log(1 + 𝑎𝑧 −1 ) = { (−1)𝑘+1 𝑘 > 0} ⟹ ℎ[𝑘] = (−𝑎)𝑘 𝑢[𝑘 − 1]
𝑘 𝑘
0 𝑘≤0
−1
Example: Determine the inverse Z-transform of 𝐻(𝑧) = 𝑒 𝑧 .

−1
Let we use the power series expansion for 𝑒 𝑧

−1
∞ 1 −𝑘
𝐻(𝑧) = 𝑒 𝑧 = 𝑒 1/𝑧 = ∑ 𝑧
𝑘=0 𝑘!

Compare this to the Z-transform formula we get

1
ℎ[𝑘] = 𝑢[𝑘]
𝑘!

In most applications, a continuous


signal ℎ(𝑡) is sampled periodically, every 𝑇 seconds, to generate a discrete signal
denoted by ℎ⋆ (𝑡). The discrete signal is
subsequently fed into a linear plant or transfer
function. This process is called uniform
sampling and can be represented by a switch,
called a sampler, which closes every 𝑇 sec for duration of ∆𝑡 sec, as shown.
The output of the sampler can be expressed as the product of the input ℎ(𝑡) and a
periodic pulse train 𝑃(𝑡), that is,

ℎ⋆ (𝑡) = ℎ(𝑡)𝑃(𝑡)

Corresponding waveforms are


shown in Fig. The amplitudes of the
pulses are modulated by ℎ(𝑡). The
width ∆𝑡 of the pulses is assumed
to be relatively small compared to
the sampling time 𝑇. This type of
sampling is referred to as non-ideal,
because the pulses and thus the
samples are of finite width ∆𝑡.

If we assume that, in the limit, the pulse width approaches zero, the sampler output
becomes a train of impulse or Dirac delta functions. This type of uniform sampling is
referred to as ideal impulse sampling. In this case, the sampled signal is represented by
a train of impulse functions whose areas correspond to the sampled values of the
signal. The impulse train is deflned by 𝑃(𝑡) = ∑∞ 𝑘=0 𝛿(𝑡 − 𝑘𝑇) And the sampled signal is
given by

⋆ (𝑡)
ℎ = ℎ(𝑡)𝑃(𝑡) = ℎ(𝑡) ∑ 𝛿(𝑡 − 𝑘𝑇)
𝑘=0

Before go deeper into the subject let we answer the following question
Why discretization is needed? And is the discretization method unique or not?

Finding the discrete equivalent of continuous system is desired when digital computers
are a part of the system, so the data needed to be computerized must be sampled.

There are a lot of methods to find the discrete equivalent signal, among them we refer

▪ Numerical integration (i.e. Forward, Backward, Bilinear methods)


▪ Pole zero matching method
▪ Step invariant method (zero order hold)
▪ Impulse invariant method (impulse modulation)
Remark: The method of equivalence is called emulation

Numerical Integration: The topics of numerical integration of differential equations


is a quite complex and only the elementary techniques are presented here

𝑥[𝑘 + 1] − 𝑥[𝑘]
𝑥̇ (𝑡) ≅ The Forward rule
𝑇
𝑥[𝑘] − 𝑥[𝑘 − 1]
𝑥̇ (𝑡) ≅ The Backward rule
𝑇
1 𝑥[𝑘 + 1] − 𝑥[𝑘]
(𝑥̇ ( 𝑡) + 𝑧𝑥̇ ( 𝑡)) ≅ The Bilinear rule
{2 𝑇

The operation can be carried out directly on the system function (transfer function) if
one translates the above equations into the frequency domain.

𝑧−1 𝑧−1
𝑠𝑋(𝑠) ≅ ( ) 𝑋(𝑧) 𝑠=( ) The Forward rule
𝑇 𝑇
𝑧−1 𝑧−1
𝑠𝑋(𝑠) ≅ ( ) 𝑋(𝑧) ⟺ 𝑠=( ) The Backward rule
𝑇𝑧 𝑇𝑧
𝑧+1 𝑧−1 2 𝑧−1
( ) 𝑠𝑋(𝑠) ≅ ( ) 𝑋(𝑧) } 𝑠= ( ) The Bilinear rule
2 𝑇 𝑇 𝑧+1

Remark: the trapezoidal rule is also called the Tustin's method or bilinear
transformation.

Now it is interesting to see how the stable region of s-plane I mapped to the z-plane.

𝑥𝑘+1 − 𝑥𝑘
𝑥̇ ≅
𝑇

Forward Or
method

𝑧−1
𝑠=( )
𝑇
Discrete Stable ⟹ Continuous Stable

Means that: can have continuous stable but discrete


un-stable

Explanations: 𝑧 = 𝑠𝑇 + 1 or 𝑠 = (𝑧 − 1)/𝑇 and let 𝑧 = 𝜎 + 𝑗𝜔 then


(𝜎 − 1) 𝑗𝜔
continuous stable ⟹ 𝑅𝑒(𝑠) < 0 ⟹ 𝑅𝑒 ( + )<0⟹𝜎<1
𝑇 𝑇
This means that: we can have continuous stable system but discrete un-stable. In
other word we can interpret this result by the following implication, Discrete Stable ⟹
Continuous Stable
𝑥𝑘 − 𝑥𝑘−1
𝑥̇ ≅
𝑇

Or
Backward
method

𝑧−1
𝑠=( )
𝑇𝑧 Continuous Stable ⟹ Discrete Stable
Means that: can have continuous un-stable but discrete
stable

Explanations: 𝑧 −1 = 1 − 𝑠𝑇 or 𝑠 = (𝑧 − 1)/𝑇𝑧 and let 𝑧 = 𝜎 + 𝑗𝜔 then

(𝜎 − 1) + 𝑗𝜔 1 2 1 2
continuous stable ⟹ 𝑅𝑒(𝑠) < 0 ⟹ 𝑅𝑒 ( ) < 0 ⟹ (𝜎 − ) + 𝜔2 < ( )
𝑇(𝜎 + 𝑗𝜔) ⏟ 2 2
Disc Equation
Notice that: the obtained disc is located inside the unit circle so we can interpret this
result by the following implication, Continuous Stable ⟹ Discrete Stable. In other
word, we can have stable discrete system but continuous un-stable.

2 𝑧−1
Bilinear 𝑠= ( )
𝑇 𝑧+1
transformation


Continuous Stable Discrete Stable

Means that: continuous stable equivalent discrete stable

Explanations: 𝑠 = (2(𝑧 − 1))/(𝑇(𝑧 + 1)) and let 𝑧 = 𝜎 + 𝑗𝜔 and 𝑠 = 𝑎 + 𝑗𝑏 then

(𝜎 − 1) + 𝑗𝜔
continuous stable ⟹ 𝑅𝑒(𝑠) < 0 ⟹ 𝑅𝑒 ( 𝜎2 + 𝜔2 < 1
)<0⟹⏟
(𝜎 + 1) + 𝑗𝜔 Disc Equation
2 + 𝑇𝑠
Discrete Stable ⟹ |𝑧| < 1 ⟹ | |<1⟹⏟
𝑎 < 0 ⟹ continuous stable
2 − 𝑇𝑠 LHP
Notice that: the obtained disc is the unit disc so we can interpret this result by the
following equivalence, Continuous Stable ⟺ Discrete Stable. In other word, we can
have stable discrete system & continuous stable.

Example: Using the previous methods to find 𝐻(𝑧) equivalent to the given 𝐻(𝑠)

𝑎
𝐻(𝑠) =
𝑠+𝑎

𝑎 𝑎𝑇
The Forward rule: 𝐻(𝑧) = =
𝑧−1 𝑧 + (𝑎𝑇 − 1)
𝑇 +𝑎
𝑎 𝑎𝑇𝑧
The Backward rule: 𝐻(𝑧) = =
𝑧−1 (𝑎𝑇 + 1)𝑧 − 1
𝑇𝑧 + 𝑎
𝑎 𝑎𝑇(𝑧 + 1)
The Bilinear rule: 𝐻(𝑧) = =
2 𝑧−1 (𝑎𝑇 + 2)𝑧 + (𝑎𝑇 − 2)
𝑇 (𝑧 + 1) + 𝑎

clear all, clc, Ts=0.01, t1=0: Ts:4; a=3; u=ones(1,length(t1));


n1= length(u)-1;
y1(1)=0; y2(1)=0; y3(1)=0; a1=-(a*Ts-2)/(a*Ts+2); a2=(a*Ts)/(a*Ts+2);
for k=1:n1
y1(k+1)=-(a*Ts-1)* y1(k)+ a*Ts*u(k);
y2(k+1)=1/(a*Ts+1)* y2(k)+( (a*Ts)/(a*Ts+1))*u(k);
y3(k+1)=a1*y3(k)+a2*u(k+1)+a2**u(k);
end
figure
hold on
plot(t1(1:n1),y1(1:n1),'-','linewidth',3)
plot(t1(1:n1),y2(1:n1),'-','linewidth',3)
plot(t1(1:n1),y3(1:n1),'-','linewidth',3)
grid on

Step invariant method (zero order hold): The basic idea behind the step invariance
method is to choose a step response for discrete system that is similar to the step
response of analog system, and this is the reason why we call it step invariant.
−1
𝐻(𝑠)
𝑦(𝑡) = { } Continuos system
𝑠
−1
𝐻(𝑧)
𝑦[𝑘] = { } Discrete system
1 − 𝑧 −1

−1
𝐻(𝑠)
𝑦[𝑘] = 𝑦(𝑡)| = { }|
𝑡→𝑘𝑇 𝑠
𝑡→𝑘𝑇
This means that

−1
𝐻(𝑠)
𝐻(𝑧) = (1 − 𝑧 −1 ) ( { }| )
𝑠
𝑡→𝑘𝑇

Remark: the step invariance method is also referred to as ZOH equivalence.

Explanations (𝟏𝒔𝒕 method): Let we define ℎ0 (𝑡) = 𝑢(𝑡) − 𝑢(𝑡 − 𝑇)

1 𝑒 −𝑠𝑇 (1 − 𝑒 −𝑠𝑇 )
𝐻0 (𝑠) = − =
𝑠 𝑠 𝑠
−1 −1
𝐻(𝑠)
𝐻(𝑧) = ( {𝐻0 (𝑠)𝐻(𝑠)}| )= ( {(1 − 𝑒 −𝑠𝑇 ) }| )
𝑠
𝑡→𝑘𝑇 𝑡→𝑘𝑇
𝑠𝑇
We know that 𝑧 = 𝑒 then
−1
−1 ) 𝐻(𝑠)
𝐻(𝑧) = (1 − 𝑧 ( { }| )
𝑠
𝑡→𝑘𝑇

Remark: the advantage of setting the input to a step 𝑢[𝑘] is that: a step function is
invariant to ZOH i.e. 𝑢(𝑡) is also a step every sampling period (unaffected by ZOH).

Explanations (𝟐𝒏𝒅 method):


1 − 𝑒 −𝑠𝑇
𝐻0 (𝑠) = {ℎ0 (𝑡)} = ( )
𝑠

Before commence development we have to put the following setting


−1 −1
𝐻(𝑠)
𝐺(𝑠) = , g(𝑡) = {𝐺(𝑠)}, and {𝑒 −𝑠𝑇 } = 𝛿(𝑡 − 𝑇)
𝑠

−1 −1 −1
𝐻(𝑠) 𝐻(𝑠) −𝑠𝑇
{𝐻0 (𝑠)𝐻(𝑠)} = { − 𝑒 }= {𝐺(𝑠) − 𝑒 −𝑠𝑇 𝐺(𝑠)} = g(𝑡) − g(𝑡 − 𝑇)
𝑠 𝑠
−1
𝐻(𝑧) = ( {𝐻0 (𝑠)𝐻(𝑠)}| )= (g(𝑡) − g(𝑡 − 𝑇)| ) = (1 − 𝑧 −1 ) (g(𝑡)| )
𝑡=𝑘𝑇 𝑡=𝑘𝑇
𝑡→𝑘𝑇
Example: MATLAB Code for the ZOH design

clear all, close all, clc,

tStart = 0; tStep = 1e-3; tEnd = 1;


t = tStart:tStep:tEnd; % Time domain

x = @(t)(sin(2*pi*t)); % Original signal


% x = @(t)(4*exp(3*t));

Ts = 3e-2; % Sampling period

zohImpl = ones(1,Ts/tStep); % Impulse ZOH ℎ0 (𝑡)

nSamples = tEnd/Ts;
samples = 0:nSamples; % Samples

xSampled = zeros(1,length(t));
xSampled(1:Ts/tStep:end) = x(samples*Ts); % Sampled signal

% Convolution with impulse response


xZoh1 = conv(zohImpl,xSampled);
xZoh1 = xZoh1(1:length(t));

figure(3);
hold all;
plot(t,x(t),'-r','linewidth',3);
stem(t,xSampled);
plot(t,xZoh1,'-b','linewidth',3);
Other programs for the design of ZOH block
CHAPTER II:

Interpolation and Curve Fitting


Interpolation and Curve Fitting

There are two topics to be dealt with in this chapter, namely,


interpolation and curve fitting. Interpolation is to connect discrete data points in a
plausible way so that one can get reasonable estimates of data points between the given
points. The interpolation curve goes through all data points. Curve fitting, on the other
hand, is to find a curve that could best indicate the trend of a given set of data. The
curve does not have to go through the data points. In some cases, the data may have
different accuracy/reliability/uncertainty and we need the weighted least-squares curve
fitting to process such data.

If we estimate the values of the unknown function at the points that are inside/outside
the range of collected data points, we call it the interpolation/extrapolation.

For a given set of 𝑁 + 1 data points


{(𝑥0 , 𝑦0 ), (𝑥1 , 𝑦1 ), . . . , (𝑥𝑁 , 𝑦𝑁 )}, we want to find the coefficients of an 𝑁 𝑡ℎ -degree polynomial
function to match them :
𝑃𝑁 (𝑥) = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + ⋯ + 𝑎𝑁 𝑥 𝑁

The coefficients can be obtained by solving the following system of linear equations.

𝑎0 + 𝑥0 𝑎1 + 𝑥02 𝑎2 + ⋯ + 𝑥0𝑁 𝑎𝑁 = 𝑦0
𝑎0 + 𝑥1 𝑎1 + 𝑥12 𝑎2 + ⋯ + 𝑥1𝑁 𝑎𝑁 = 𝑦1
.................................................
𝑎0 + 𝑥𝑁 𝑎1 + 𝑥𝑁2 𝑎2 + ⋯ + 𝑥𝑁𝑁 𝑎𝑁 = 𝑦𝑁

But, as the number of data points increases, so does the number of unknown variables
and equations, consequently, it may be not so easy to solve. That is why we look for
alternatives to get the coefficients {𝑎0 , 𝑎1 , . . . , 𝑎𝑁 } . One of the alternatives is to make use
of the Lagrange polynomials

Let we define a polynomial of degree 𝑛 associated with each points 𝑥𝑗 as

𝑃𝑗 (𝑥) = 𝐴𝑗 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) … . (𝑥 − 𝑥𝑗−1 )(𝑥 − 𝑥𝑗+1 ) … (𝑥 − 𝑥𝑛 )

Where: 𝐴𝑗 is a constant, and where the factor is 𝑥 − 𝑥𝑗 omitted. The defined polynomial
may be written in a shorter notation as
𝑛

𝑃𝑗 (𝑥) = 𝐴𝑗 ∏(𝑥 − 𝑥𝑖 )
𝑖=0
𝑖≠𝑗

When 𝑥 is equal to any of the 𝑥𝑖 corresponding to a data points (except 𝑥𝑗 ) the value of
this polynomial is zero since the factor 𝑥𝑖 − 𝑥𝑖 will be zero. However, when 𝑥 = 𝑥𝑗
The value of polynomial is not zero since the factor 𝑥 − 𝑥𝑗 is missing. Thus if we denote
one of the data points as 𝑥𝑘
0 𝑘≠𝑗
𝑛
𝑃𝑗 (𝑥𝑘 ) =
𝐴𝑗 ∏(𝑥 − 𝑥𝑖 ) 𝑘=𝑗
𝑖=0
{ 𝑖≠𝑗
If 𝐴𝑗 is defined as
1 0 𝑘≠𝑗
𝐴𝑗 = Then 𝑃𝑗 (𝑥𝑘 ) becomes 𝑃𝑗 (𝑥𝑘 ) = {
∏𝑛𝑖=0(𝑥𝑗 − 𝑥𝑖 ) 1 𝑘=𝑗
𝑖≠𝑗

Thus functions 𝑃𝑗 (𝑥) the are the set of 𝑛𝑡ℎ degree polynomials, defined in such a way
that each one passes through zero at each of the data points except for the one point
𝑥𝑘 where 𝑘 = 𝑗 . the 𝑃𝑗 (𝑥) are called Lagrange polynomials.

We now form the following linear combination of the 𝑃𝑗 (𝑥): 𝑃𝑛 (𝑥) = ∑𝑛𝑗=0 𝑓(𝑥𝑗 ) 𝑃𝑗 (𝑥). Since
is a linear combination of 𝑛𝑡ℎ degree polynomials, it is also an 𝑛𝑡ℎ degree polynomial. If
we select any one of the points at which data are available, say 𝑥2 then

𝑃𝑛 (𝑥2 ) = 𝑓(𝑥0 )𝑃0 (𝑥2 ) + 𝑓(𝑥1 )𝑃1 (𝑥2 ) + 𝑓(𝑥2 )𝑃2 (𝑥2 ) + ⋯ + 𝑓(𝑥𝑛 )𝑃𝑛 (𝑥2 )

But since each of the 𝑃𝑗 (𝑥2 ) is zero except for 𝑃2 (𝑥2 ) which equals 1,

𝑃𝑛 (𝑥2 ) = 𝑓(𝑥2 )𝑃2 (𝑥2 ) = 𝑓(𝑥2 )

At 𝑥2 ,this 𝑛𝑡ℎ degree polynomial yields the value 𝑓(𝑥2 ). It can easily be seen that for any
of the data points 𝑥𝑖 , the polynomial becomes 𝑓(𝑥𝑖 ). Thus the polynomial 𝑃𝑛 (𝑥) is the
desired polynomial of degree 𝑛 which exactly fits the 𝑛 + 1 data points.

Example: Interpolation with this polynomial can be best


i 0 1 2 3
illustrated with an example. Consider the following set
𝒙𝒊 1 2 4 8
of data:
𝒇(𝒙𝒊 ) 1 3 7 11
Suppose we wish to interpolate for 𝑓(7). Then

𝑃3 (7) = 1𝑃0 (7) + 3𝑃1 (7) + 7𝑃2 (7) + 11𝑃3 (7)

(7 − 2)(7 − 4)(7 − 8)
𝑃0 (7) = = 0.71429
(1 − 2)(1 − 4)(1 − 8)

(7 − 1)(7 − 4)(7 − 8)
𝑃1 (7) = = −1.5
(2 − 1)(2 − 4)(2 − 8)

(7 − 1)(7 − 2)(7 − 8)
𝑃2 (7) = = 1.25
(4 − 1)(4 − 2)(4 − 8)

(7 − 2)(7 − 2)(7 − 4)
𝑃3 (7) = = 0.53571
(8 − 2)(8 − 2)(8 − 4)
𝑓(7) ≈ 𝑃3 (7) = 0.71429 + 3(−1.5) + 7(1.25) + 11(0.53571) = 10.85710

This is the value of the interpolating polynomial at = 7 .

Remark

1. It should be noted that the polynomial interpolation of this type can be dangerous
toward the center of the regions where the independent variable is widely spaced.
2. One of the alternatives is to make use of the Lagrange polynomials as:

𝑃𝑛 (𝑥2 ) = 𝑓(𝑥0 )𝑃0 (𝑥2 ) + 𝑓(𝑥1 )𝑃1 (𝑥2 ) + 𝑓(𝑥2 )𝑃2 (𝑥2 ) + ⋯ + 𝑓(𝑥𝑛 )𝑃𝑛 (𝑥2 )
𝑛

𝑃𝑛 (𝑥) = ∑ 𝑓(𝑥𝑗 ) 𝑃𝑗 (𝑥)


𝑗=0
With

∏𝑛𝑖=0(𝑥 − 𝑥𝑖 ) 𝑛
𝑖≠𝑗 (𝑥 − 𝑥𝑖 )
𝑃𝑗 (𝑥) = 𝑛
=∏
∏𝑖=0(𝑥𝑗 − 𝑥𝑖 ) 𝑖=0
(𝑥𝑗 − 𝑥𝑖 )
𝑖≠𝑗 𝑖≠𝑗

Now, we have the MATLAB routine “lagranp()” which finds us the coefficients of
Lagrange polynomial together with each Lagrange coefficient polynomial 𝑃𝑗 (𝑥). In order
to understand this routine, you should know that MATLAB deals with polynomials as
their coefficient vectors arranged in descending order and the multiplication of two
polynomials corresponds to the convolution of the coefficient vectors

function [l,L] = lagranp(x,y)


%Input : x = [x0 x1 ... xN], y = [y0 y1 ... yN]
%Output: l = Lagrange polynomial coefficients of degree N
% L = Lagrange coefficient polynomial
N = length(x)-1; %the degree of polynomial
l = 0;
for m = 1:N + 1
P = 1;
for k = 1:N + 1
if k ~= m, P = conv(P,[1 -x(k)])/(x(m)-x(k));
end
end
L(m,:) = P; %Lagrange coefficient polynomial
l = l + y(m)*P; %Lagrange polynomial
end
%---------------------------------------------------------------------%
%do_lagranp.m
x = [-2 -1 1 2]; y = [-6 0 0 6]; % given data points
l = lagranp(x,y) % find the Lagrange polynomial
xx = [-2: 0.02 : 2]; yy = polyval(l,xx); %interpolate for [-2,2]
clf, plot(xx,yy,’b’, x,y,’*’) %plot the graph
You can use the following code directly without function

clear all, clc,

x = [-2 -1 1 2]; y = [-6 0 0 6]; % given data points


xx = [-2: 0.02 : 2];

N = length(x)-1; %the degree of polynomial


l = 0;
for m = 1:N + 1
P = 1;
for k = 1:N + 1
if k ~= m, P = conv(P,[1 -x(k)])/(x(m)-x(k));
end
end
L(m,:) = P; %Lagrange coefficient polynomial
l = l + y(m)*P; %Lagrange polynomial
end

yy = polyval(l,xx); %interpolate for [-2,2]


plot(xx,yy,'linewidth',3)
hold on
stem(x,y)
grid on
We make the MATLAB program “do_lagranp.m” to use the routine “lagranp()” for finding
the third-degree polynomial 𝑙3 (𝑥) which matches the four given points

{(−2,−6), (−1, 0), (1, 0), (2, 6)}

and to check if the graph of 𝑙3 (𝑥) really passes the four points. The results from
running this program are depicted in Fig.

Example: Consider the following set of data: i 0 1 2


𝒙𝒊 1 2 3
𝒇(𝒙𝒊 ) 1 2 2
Suppose we wish to interpolate for 𝑓(1.5). Then

𝑃2 (𝑥) = 1𝑃0 (𝑥) + 2𝑃1 (𝑥) + 2𝑃2 (𝑥)

(𝑥 − 2)(𝑥 − 3) 1 2
𝑃0 (𝑥) = = (𝑥 − 5𝑥 + 6)
(1 − 2)(1 − 3) 2

(𝑥 − 1)(𝑥 − 3)
𝑃1 (𝑥) = = −𝑥 2 + 4𝑥 − 3
(2 − 1)(2 − 3)

(𝑥 − 1)(𝑥 − 2) 1 2
𝑃2 (𝑥) = = (𝑥 − 3𝑥 + 2)
(3 − 1)(3 − 2) 2

1 2 1
𝑓(𝑥) ≈ (𝑥 − 5𝑥 + 6) + 2(−𝑥 2 + 4𝑥 − 3) + 2 [ (𝑥 2 − 3𝑥 + 2)]
2 2
That is
1 5
𝑓(𝑥) ≈ (− 𝑥 2 + 𝑥 − 1)
2 2

𝑓(1.5) ≈ 1.625

This is the value of the interpolating polynomial at 𝑥 = 1.5

Example: Consider the following set of discrete data of


the function: i 0 1 2 3
𝜋𝑥 𝒙𝒊 -1 1 2 3
𝑓(𝑥) = sin ( ) (𝑥 2 + 3)
2 𝒇(𝒙𝒊 ) -4 4 0 -12

We want to find a Lagrange polynomial that fit the tabulated data such that:

𝑃(−1) = −4, 𝑃(1) = 4, 𝑃(2) = 0, and 𝑃(3) = −12,

The next figure shows exactly how a Lagrange polynomial (𝑑𝑒𝑔𝑟𝑒𝑒 < 3)fit the given data.
Although Lagrange’s method is
conceptually simple, it does not lend itself to an efficient algorithm. Better
computational procedure is obtained with Newton’s method, The Newton Method
provides a means of developing an nth-order interpolation polynomial without requiring
the solution of a set of simultaneous linear equations. Where the interpolating
polynomial is written in the form

𝑛𝑁 (𝑥) = 𝑎0 + 𝑎1 (𝑥 − 𝑥0 ) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 ) + ⋯

= 𝑛𝑁−1 (𝑥) + 𝑎𝑁 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 ) … (𝑥 − 𝑥𝑁−1 ) with 𝑛0 (𝑥) = 𝑎0

In order to derive a formula to find the successive coefficients {𝑎0 , 𝑎1 , . . . , 𝑎𝑁 } that make
this equation accommodate the data points, we will determine 𝑎0 and 𝑎1 so that

𝑛1 (𝑥) = 𝑛0 (𝑥) + 𝑎1 (𝑥 − 𝑥0 )

Matches the first two data points (𝑥0 , 𝑦0 ) and (𝑥1 , 𝑦1 ). We need to solve the two equations

𝑛1 (𝑥0 ) = 𝑎0 + 𝑎1 (𝑥0 − 𝑥0 ) = 𝑦0

𝑛1 (𝑥1 ) = 𝑎0 + 𝑎1 (𝑥1 − 𝑥0 ) = 𝑦1

To get
𝑦1 − 𝑎0 𝑦1 − 𝑦0
𝑎0 = 𝑦0 𝑎1 = = = 𝑑𝑓0
𝑥1 − 𝑥0 𝑥1 − 𝑥0

Starting from this first-degree Newton polynomial, we can proceed to the second degree
Newton polynomial: 𝑛2 (𝑥) = 𝑛1 (𝑥) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 ) = 𝑎0 + 𝑎1 (𝑥 − 𝑥0 ) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 )
Which, with the same coefficients 𝑎0 𝑎𝑛𝑑 𝑎1 , still matches the first two data points (𝑥0 , 𝑦0 )
and (𝑥1 , 𝑦1 ), since the additional (third) term is zero at (𝑥0 , 𝑦0 ) and (𝑥1 , 𝑦1 ). This is to say
that the additional polynomial term does not disturb the matching of previous existing
data. Therefore, given the additional matching condition for the third data point (𝑥2 , 𝑦2 ),
we only have to solve 𝑛2 (𝑥) = 𝑎0 + 𝑎1 (𝑥2 − 𝑥0 ) + 𝑎2 (𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )
for only one more coefficient 𝑎2 to get
𝑦1 − 𝑦0
𝑦2 − 𝑎0 − 𝑎1 (𝑥2 − 𝑥0 ) 𝑦2 − 𝑦0 − 𝑥1 − 𝑥0 (𝑥2 − 𝑥0 )
𝑎2 = =
(𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 ) (𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )

𝑦2 − 𝑦1 + 𝑦1 − 𝑎0 − 𝑎1 (𝑥2 − 𝑥1 + 𝑥1 − 𝑥0 )
=
(𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )
𝑦2 − 𝑦1 𝑦1 − 𝑦0
𝑥 − 𝑥1 − 𝑥1 − 𝑥0 𝑑𝑓1 − 𝑑𝑓0
𝑎2 = 2 = = 𝑑 2 𝑓0
(𝑥2 − 𝑥0 ) (𝑥2 − 𝑥0 )

Generalizing these results yields the formula to get the 𝑁 𝑡ℎ coefficient 𝑎𝑁 of the Newton
polynomial function as

𝑑𝑁−1 𝑓1 − 𝑑 𝑁−1 𝑓0
𝑎𝑁 = = 𝑑 𝑁 𝑓0
(𝑥𝑁 − 𝑥0 )

This is the divided difference, which can be obtained successively from the second row
of Table.

𝑥𝑘 𝑦𝑘 𝑑𝑓𝑘 𝑑 2 𝑓𝑘 𝑑 3 𝑓𝑘
𝑦1 − 𝑦0 𝑑𝑓1 − 𝑑𝑓0 𝑑 2 𝑓1 − 𝑑 2 𝑓0
𝑥0 𝑦0 𝑑𝑓0 = 𝑑 2 𝑓0 = 𝑑 3 𝑓0 =
𝑥1 − 𝑥0 (𝑥2 − 𝑥0 ) (𝑥1 − 𝑥0 )
𝑦2 − 𝑦1 𝑑𝑓2 − 𝑑𝑓1
𝑥1 𝑦1 𝑑𝑓1 = 𝑑 2 𝑓1 =
𝑥2 − 𝑥1 (𝑥2 − 𝑥1 )
𝑦3 − 𝑦2
𝑥2 𝑦2 𝑑𝑓2 =
𝑥3 − 𝑥2
𝑥3 𝑦3

function [n,DD] = newtonp(x,y)


%Input : x = [x0 x1 ... xN]
% y = [y0 y1 ... yN]
%Output: n = Newton polynomial coefficients of degree N
N = length(x)-1;
DD = zeros(N+1,N + 1);
DD(1:N+1,1) = y’;
for k = 2:N+1
for m = 1: N+2 - k %Divided Difference Table
DD(m,k) = (DD(m+1,k-1) - DD(m,k-1))/(x(m+k-1)- x(m));
end
end
a = DD(1,:);
n = a(N+1);
for k = N:-1:1
n = [n a(k)] - [0 n*x(k)];
end
You can use the following code directly without function

clear all,clc
x = [-2 -1 1 2]; y = [-6 0 0 6]; xx = [-2: 0.02 : 2];

N = length(x)-1; DD = zeros(N+1,N+1); DD(1:N+1,1) = y';


for k = 2:N+1
for m = 1: N+2 - k %Divided Difference Table
DD(m,k) = (DD(m+1,k-1) - DD(m,k-1))/(x(m+k-1)- x(m));
end
end
a = DD(1,:); n = a(N+1);
for k = N:-1:1
n = [n a(k)] - [0 n*x(k)];
end

yy = polyval(n,xx); %interpolate for [-2,2]


plot(xx,yy,'linewidth',3)
hold on
stem(x,y)
grid on

The advantage of the Newton Method is that it does not require the solution of 𝑛
simultaneous equations with 𝑛 unknowns.

Example: Suppose we are to find a Newton polynomial matching the following data
points {(−2, −6), (−1,0), (1,0), (2,6), (4,60)}. From these data points, we construct the
divided difference table as

𝑥𝑘 𝑦𝑘 𝑑𝑓𝑘 𝑑 2 𝑓𝑘 𝑑 3 𝑓𝑘 𝑑 4 𝑓𝑘
0 − (−6) 0−6 2 − (−2) 1−1
−2 −6 =6 = −2 =1 =0
−1 − (−2) 1 − (−2) 2 − (−2) 4 − (−2)
0−0 6−0 7−2
−1 0 =0 =2 =1
1 − (−1) 2 − (−1) 4 − (−1)
6−0 27 − 6
1 0 =6 =7
2−1 4−1
60 − 6
2 6 = 27
4−2
4 60

and then use this table to get the Newton polynomial as follows:

𝑛(𝑥) = 𝑦0 + 𝐷𝑓0 (𝑥 − 𝑥0 ) + 𝐷2 𝑓0 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 ) + 𝐷3 𝑓0 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) + 0

= −6 + 6(𝑥 − (−2)) − 2(𝑥 − (−2))(𝑥 − (−1)) + (𝑥 − (−2))(𝑥 − (−1))(𝑥 − 1)

= 𝑥3 − 𝑥
We might begin with not necessarily the first data point, but, say, the third one (1,0),
and proceed as follows to end up with the same result.

𝑛(𝑥) = 𝑦0 + 𝐷𝑓2 (𝑥 − 𝑥2 ) + 𝐷2 𝑓2 (𝑥 − 𝑥2 )(𝑥 − 𝑥3 ) + 𝐷3 𝑓2 (𝑥 − 𝑥2 )(𝑥 − 𝑥3 )(𝑥 − 𝑥4 ) + 0

= 0 + 6(𝑥 − 1) + 7(𝑥 − 1)(𝑥 − 2) + (𝑥 − 1)(𝑥 − 2)(𝑥 − 4)

= 𝑥3 − 𝑥

Example: Fit a quadratic polynomial for the following set of i 0 1 2


data using the Newton interpolation method. Using this 𝒙𝒊 1 2 3
polynomial approximate the function 𝑓(𝑥). When 𝑥 is equal to
𝒚(𝒙𝒊 ) 1 8 27
2.2 and 2.8.

𝑛(𝑥) = 𝑎0 + 𝑎1 (𝑥 − 𝑥0 ) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 )

Where
𝑦1 − 𝑦0 𝑑𝑓1 − 𝑑𝑓0 𝑦2 − 𝑦1
𝑎0 = 𝑦0 . 𝑎1 = 𝑑𝑓0 = . 𝑎2 = 𝑑 2 𝑓0 = . 𝑑𝑓1 = .
𝑥1 − 𝑥0 (𝑥2 − 𝑥0 ) 𝑥2 − 𝑥1

Or we can write
𝑦 −𝑦
𝑦1 − 𝑦0 𝑦2 − 𝑦0 − 𝑥1 − 𝑥0 (𝑥2 − 𝑥0 )
1 0
𝑎0 = 𝑦0 , 𝑎1 = 𝑑𝑓0 = , 𝑎2 = ,
𝑥1 − 𝑥0 (𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )

Hence, the polynomial is given by: 𝑛(𝑥) = 1 + 7(𝑥 − 1) + 6(𝑥 − 1)(𝑥 − 2) = 6𝑥 2 − 11𝑥 + 6

For 𝑥 = 2.2, 𝑦(2.2) = 𝑛(2.2) = 1 + 7(2.2 − 1) + 6(2.2 − 1)(2.2 − 2) = 10.84


For 𝑥 = 2.8, 𝑦(2.8) = 𝑛(2.8) = 1 + 7(2.8 − 1) + 6(2.8 − 1)(2.8 − 2) = 22.24

One of the difficulties with conventional


polynomial interpolation, particularly if the polynomial is of high order, is the highly
inflected or ‘‘wiggly’’ character with it is possible for the interpolating polynomial to
assume. A smoother interpolating function can usually be produced by mechanical
means such as a French curve or, more to the point of this discussion, by forcing a
flexible bar to pass through the desired points. The mathematical analog of this flexible
elastic bar is the cubic spline function.
Cubic spline interpolation is a special case for Spline interpolation that is used very
often to avoid the problem of Runge's phenomenon (a problem of oscillation). This
method gives an interpolating polynomial that is smoother and has smaller error than
some other interpolating polynomials such as Lagrange polynomial and Newton
polynomial. In spline interpolation the interpolant is a piecewise polynomial called a
spline. Interpolation using cubic spline is currently very popular, particularly for
interpolation in relatively noise free tables of physical properties.

The construction of a cubic spline interpolating function can be briefly described as


follows. We are given a series of points 𝑥𝑖 (𝑖 = 0,1,2, … , 𝑛)which are in general not evenly
spaced, and the corresponding functional values (𝑥𝑖 ) . Now consider two arbitrary
adjacent points 𝑥𝑖 𝑎𝑛𝑑 𝑥𝑖+1 . we wish to fit a cubic to this two points and use this cubic
as the interpolating function between them. We denote this cubic as:

𝐹𝑖 (𝑥) = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + 𝑎3 𝑥 3 (𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1 )

There are clearly 4 unknown constants and only two conditions are immediately
obvious, namely that 𝐹𝑖 (𝑥𝑖 ) = 𝑓(𝑥𝑖 ) and 𝐹𝑖 (𝑥𝑖+1 ) = 𝑓(𝑥𝑖+1 ). We are free to choose the two
remaining conditions as we like, to accomplish our desired objective of ‘‘smoothness.’’
The most effective approach is to match the first and second derivative
(𝑎𝑛𝑑 𝑡ℎ𝑢𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒 𝑎𝑛𝑑 𝑐𝑢𝑟𝑣𝑎𝑡𝑢𝑟𝑒) of 𝐹𝑖 (𝑥) to those of the cubic 𝐹𝑖−1 (𝑥) used for
interpolation of adjacent interval (𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1 ). If this procedure is carried out for all
intervals in the region 𝑥0 ≤ 𝑥 ≤ 𝑥𝑛 ( with special treatment at the end points as we will
discuss later ), then an approximating function for the region will have been
constructed. Consisting of the set of cubics 𝐹𝑖 (𝑥) (𝑖 = 0,1,2, … , 𝑛 − 1)

We denote this approximating function for the entire region as g(𝑥) and call it a cubic
spline.

To actually construct g(𝑥), it is convenient to note that due to the matching of second
derivatives of the cubics at each point 𝑥𝑖 , the second derivative of g(𝑥) is continuous
over the entire region 𝑥0 ≤ 𝑥 ≤ 𝑥𝑛 . This second derivative might appear as shown in fig.
Notice that the second derivative varies linearly over each interval ( the second
derivative of cubic is straight line ). Due to this linearity, the second derivative at any
point 𝑥, where 𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1, is given by

g ′′ (𝑥) − g ′′ (𝑥𝑖 ) g ′′ (𝑥𝑖+1 ) − g ′′ (𝑥𝑖 )


= = 𝑠𝑙𝑜𝑝𝑒
𝑥 − 𝑥𝑖 𝑥𝑖+1 − 𝑥𝑖
𝑥 − 𝑥𝑖
g ′′ (𝑥) = g ′′ (𝑥𝑖 ) + [g ′′ (𝑥𝑖+1 ) − g ′′ (𝑥𝑖 )]
𝑥𝑖+1 − 𝑥𝑖

Integrating this equation twice and applying the condition that g(𝑥𝑖 ) = 𝑓(𝑥𝑖 ) 𝑎𝑛𝑑
g(𝑥𝑖+1 ) = 𝑓(𝑥𝑖+1 ), we find that for 𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1 .

g ′′ (𝑥𝑖 ) (𝑥𝑖+1 − 𝑥)3 g ′′ (𝑥𝑖+1 ) (𝑥 − 𝑥𝑖 )3


g(𝑥) = 𝐹𝑖 (𝑥) = [ − ∆𝑥𝑖 (𝑥𝑖+1 − 𝑥)] + [ − ∆𝑥𝑖 (𝑥 − 𝑥𝑖 )]
6 ∆𝑥𝑖 6 ∆𝑥𝑖
(𝑥𝑖+1 − 𝑥) (𝑥 − 𝑥𝑖 )
+ 𝑓(𝑥𝑖 ) [ ] + 𝑓(𝑥𝑖+1 ) [ ]
∆𝑥𝑖 ∆𝑥𝑖

Where ∆𝑥𝑖 = 𝑥𝑖+1 − 𝑥𝑖 , this equation provides the interpolating cubics over each interval
for 𝑖 = 0,1,2, … , 𝑛 − 1 . Since the second derivatives g ′′ (𝑥𝑖 ) 𝑖 = 0,1,2, … , 𝑛 are still
unknown, these must be evaluated before we can use this equation.

The second derivative can be found by using the derivative matching conditions:

𝐹′𝑖 (𝑥𝑖 ) = 𝐹′𝑖−1 (𝑥𝑖 )


𝐹′′𝑖 (𝑥𝑖 ) = 𝐹′′𝑖−1 (𝑥𝑖 )

Applying these conditions to equation g(𝑥) = 𝐹𝑖 (𝑥) for 𝑖 = 0,1,2, … , 𝑛 − 1 and collecting
terms yields a set of simultaneous equations of the form:

∆𝑥𝑖−1 ′′ 2(𝑥𝑖+1 − 𝑥𝑖−1 ) ′′ 𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖 ) 𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 )


[ ] g (𝑥𝑖−1 ) + [ ] g (𝑥𝑖 ) + g ′′ (𝑥𝑖+1 ) = 6 [ − ]
∆𝑥𝑖 ∆𝑥𝑖 (∆𝑥𝑖 )2 ∆𝑥𝑖 ∆𝑥𝑖−1
If the 𝑥𝑖 are evenly separated with ∆𝑥, then the last equation is considerably simplified
and becomes
𝑓(𝑥𝑖+1 ) − 2𝑓(𝑥𝑖 ) + 𝑓(𝑥𝑖−1 )
g ′′ (𝑥𝑖−1 ) + 4g ′′ (𝑥𝑖 ) + g ′′ (𝑥𝑖+1 ) = 6 [ ]
(∆𝑥𝑖 )2

Whether the equations are of this form, there are 𝑛 − 1 equations in the 𝑛 + 1 unknowns
g ′′ (𝑥0 ), g ′′ (𝑥1 ), … , g ′′ (𝑥𝑛 ). The two necessary additional equations are obtained by
specifying conditions on g ′′ (𝑥0 ) and g ′′ (𝑥𝑛 ). It is usually simply specified that

g ′′ (𝑥0 ) = 0
g ′′ (𝑥𝑛 ) = 0

Example: We will illustrate cubic spline interpolation with an example. i 𝒙𝒊 𝒇(𝒙𝒊 )


Suppose the following unevenly-spaced tabulated function is given:
0 1 4
We will approximate 𝑓(5) by interpolation on natural cubic spline. First 1 4 9
we set g ′′ (1) = g ′′ (10) = 0. Now writing for 𝑖 = 1 2 6 15
3 9 7
4−1 2(6 − 1) ′′ 15 − 9 9−4 4 10 3
[ ] (0) + [ ] g (4) + g ′′ (6) = 6 [ − ]
6−4 6−4 (6 − 4) 2 (6 − 4)(4 − 1)

Or 5g ′′ (4) + g ′′ (6) = 4

Similarly, we find for 𝑖 = 2 that 0.66667g ′′ (4) + 3.33333g ′′ (6) + g ′′ (9) = −11.33333

And for 𝑖 = 3,

3g ′′ (6) + 8g ′′ (9) = −8

In matrix form we can write


′′
5 1 0 g (4) 4
′′ (6)
[0.66667 3.33333 1 ] [g ] = [−11.33333]
0 3 8 g (9)′′ −8

Solving these three equations simultaneously, we find

g ′′ (4) = 1.56932, g ′′ (6) = −3.84661, g ′′ (9) = 0.44248

Since we wish to approximate𝑓(5), then we must use the cubic 𝐹1 (𝑥) which appropriate
for the interval 4 ≤ 𝑥 ≤ 6. Hence

g ′′ (4) (6 − 5)3 g ′′ (6) (5 − 4)3 (6 − 5)


𝐹1 (5) = [ − (6 − 4)(6 − 5)] + [ − (6 − 4)(6 − 5)] + 9 [ ]
6 6−4 6 6−4 6−4
(5 − 4)
+ 15 [ ] = 12.56932
6−4

This is the desired interpolated value.


Polynomial extrapolation (interpolating outside the range of data
points) is dangerous. As an example, consider Fig. There are six data points, shown as
circles. The fifth-degree interpolating polynomial is represented by the solid line. The
interpolant looks fine within the range of data points, but drastically departs from the
obvious trend when 𝑥 > 12. Extrapolating y at 𝑥 = 14, for example, would be absurd in
this case.

If extrapolation cannot be avoided, the following two measures can be useful:

⦁ Plot the data and visually verify that the extrapolated value makes sense.
⦁ Use a low-order polynomial based on nearest-neighbor data points. A linear or
quadratic interpolant, for example, would yield a reasonable estimate of 𝑦(14) for the
data in Fig.

If the data are obtained from experiments,


they typically contain a significant amount of random noise due to measurement errors.
The task of curve fitting is to find a smooth curve that fits the data points “on the
average.” This curve should have a simple form (e.g. a low-order polynomial), so as to
not reproduce the noise.

Noting that many data are susceptible to some error, we don’t have to try to find a
function passing exactly through every point. Instead of pursuing the exact matching at
every data point, we look for an approximate function (not necessarily a polynomial)
that describes the data points as a whole with the smallest error in some sense, which
is called the curve fitting.

As a reasonable means, we consider the least-squares (LS) approach to minimizing the


sum of squared errors, where the error is described by the vertical distance to the curve
from the data points. We will look over various types of fitting functions in this section.
Linear regression
attempts to model the relationship between two variables by fitting a linear equation to
observed data. One variable is considered to be an explanatory variable, and the other
is considered to be a dependent variable.

Method 1 If there is some theoretical basis on which we believe the relationship


between the two variables to be 𝜃1 𝑥 + 𝜃0 = 𝑦 we should set up the following system of
equations from the collection of many experimental data:

𝜃1 𝑥1 + 𝜃0 = 𝑦1 𝑥1 1 𝑦1
𝜃1 𝑥2 + 𝜃0 = 𝑦2 𝑥2 1 𝜃 𝑦2
Or 𝐴𝜽 = 𝒚 with 𝐴 = [ ], 𝜽 = [ 1] , 𝒚=[ ⋮ ]
……………… ⋮ ⋮ 𝜃2
𝜃1 𝑥𝑀 + 𝜃0 = 𝑦𝑀 𝑥𝑀 1 𝑦𝑀

Noting that this apparently corresponds to the over determined case ie: (number of
equations more than the number of parameters). We resort to the least-squares (LS)
solution, which bases on the minimization of the objective function:

𝑱 = ‖𝐞‖𝟐 = ‖𝐴𝜽 − 𝒚‖𝟐 = [𝐴𝜽 − 𝒚]𝑇 [𝐴𝜽 − 𝒚]

= 𝜽𝑇 𝐴𝑇 𝐴𝜽 − 𝜽𝑇 𝐴𝑇 𝒚 − 𝒚𝑇 𝐴𝜽 + 𝒚𝑇 𝒚
𝑱 is minimum if
𝜕𝑱
= 0 ⇒ 𝜽𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆 = [𝑨𝑇 𝑨]−𝟏 𝑨𝑇 𝑦
𝜕𝜽

𝜃10
𝜽 = [ 0 ] = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒚
𝟎
𝜃2

Remarks: Here we have 𝑚 equation with 𝑛 unknown? This is the over determined case.
From the linear algebra point of view the system of equations 𝑨𝒙 = 𝒃 have a solution if
𝒃 ∈ 𝑟𝑎𝑛𝑘(𝑨) or in other word 𝑏 is one vector of 𝑨 columns because

𝑥1
𝑥2
𝑨𝒙 = 𝒃 ⇔ [𝒂𝟏 , 𝒂𝟐 , … , 𝒂𝒏 ]𝒙 = 𝒃 ⇔ [𝒂𝟏 , 𝒂𝟐 , … , 𝒂𝒏 ] [ ]=𝒃
𝑥𝑛
⇔ 𝒃 = 𝒂𝟏 𝑥1 + 𝒂𝟐 𝑥1 + ⋯ + 𝒂𝒏 𝑥1 ⇒ 𝑏 ∈ 𝑟𝑎𝑛𝑘(𝑨)

unique olution
Or equivalently 𝑟𝑎𝑛𝑘[𝑨 ⋮ 𝒃] = 𝑟𝑎𝑛𝑘[𝑨], If 𝑟𝑎𝑛𝑘[𝑨 ⋮ 𝒃] = 𝑟𝑎𝑛𝑘[𝑨] ⇒ { 𝑜𝑟
infinite solutions

Least square is concerned with the case where no exact solution exists.
Question: Since we cannot find 𝒙 satisfying 𝑨𝒙 = 𝒃, can we find 𝒙 such that 𝑨𝒙 is as
close as possible to 𝒃 and the equation rewriting as 𝑨𝒙 + 𝑬 = 𝒃 we want 𝒙 that minimize
the distance between 𝑨𝒙 and 𝒃 ⇒ smallest 𝑬 using orthogonal principle then the optimal 𝒙
is the one such that 𝑬 ⊥ 𝑨𝒙 ⇒ 𝑬𝑻 𝑨𝒙 = 𝟎 or [𝒃 − 𝑨𝒙]𝑇 𝑨𝒙 = 𝟎

[𝒃 − 𝑨𝒙]𝑇 𝑨𝒙 = 𝟎 ⇒ [𝒃𝑻 − 𝒙𝑻 𝑨𝑻 ]𝑨𝒙 = 𝟎


[𝒃 − 𝑨𝒙]𝑇 𝑨𝒙 = 𝟎 ⇒ 𝒃𝑻 𝑨𝒙 = 𝒙𝑻 𝑨𝑻 𝑨𝒙
[𝒃 − 𝑨𝒙]𝑇 𝑨𝒙 = 𝟎 ⇒ 𝒃𝑻 𝑨 = 𝒙𝑻 𝑨𝑻 𝑨

Now one can write that if 𝑖𝑛𝑣(𝑨𝑻 𝑨) exist then 𝒙 = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒃. Note that 𝑨 is 𝑚 × 𝑛 matrix
⇒ there is no invers

Assume 𝑟𝑎𝑛𝑘(𝑨) = 𝑛 ⇒ 𝑟𝑎𝑛𝑘[(𝑨𝑻 𝑨)𝒏×𝒏 ] = 𝑛 ≡ Full rank Therefore, 𝑟𝑎𝑛𝑘(𝑨) = 𝑛 ⇒


𝑻 𝑻 −𝟏 𝑻 ∗
(𝑨 𝑨)𝒏×𝒏 has an invers and 𝒙 = (𝑨 𝑨) 𝑨 𝒃. The optimal value of 𝒙 is 𝒙 where

𝒙∗ = (𝑨𝑻 𝑨)−𝟏 𝑨𝑻 𝒃

Example: We have a system of single input single output given by a linear equation

𝑦 = 𝑚𝑡 + 𝑐
𝑡: is the input of the system and 𝑦: is the output of the system

After certain experimentation the input output data are 𝑖 0 1 2


recorded so that: 𝑡𝑖 2 3 4
𝑦𝑖 3 4 15
Fit the straight line to the given data.

Solution we can rewrite the above table into a system of linear equation as:

2𝑚 + 𝑐 = 3 2 1 𝑚 3
{ 3𝑚 + 𝑐 = 4 𝑜𝑟 [ 3 1 ][ ] = [ 4 ]
4𝑚 + 𝑐 = 15 4 1 𝑐 15

It is of the form 𝐴𝑥 = 𝑏
2 1 3 𝑚
Where 𝐴 = [ 3 1 ] , 𝑏 = [ 4 ], 𝑎𝑛𝑑 𝑥=[ ]
4 1 15 𝑐

We want to get the optimal value of 𝑥 for best fitting 𝑟𝑎𝑛𝑘(𝐴) < 𝑟𝑎𝑛𝑘[𝐴 ⋮ 𝑏] the system
has no exact solution, but the best approximate one is:

𝒙∗ = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒃

𝑚∗ 2 3 4 3 6 32
∗ 29 9 −𝟏 32
𝒙 =[ ]= [𝑨𝑻 −𝟏 𝑻
𝑨] 𝑨 𝒃 = [ ] [ ][ 4 ] = [ ] ⟹ 𝑦 = 6𝑡 −
9 3 − 3
𝑐∗ 1 1 1 15 3
Method 2 Fitting a straight line 𝑓(𝑥) = 𝑎 + 𝑏𝑥 to data is also known as linear regression.
In this case the function to be minimized is
𝒏

𝑱(𝑎, 𝑏) = ∑(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )𝟐


𝒊=𝟏
𝒏 𝑛 𝑛
𝜕𝑱
= ∑ −2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 2 (− ∑ 𝑦𝑖 + 𝑛𝑎 + 𝑏 ∑ 𝑥𝑖 ) = 0
𝜕𝑱 𝜕𝑎
𝒊=𝟏 𝑖=1 𝑖=1
=0 ⇒ 𝒏 𝑛 𝑛 𝑛
𝜕𝜽 𝜕𝑱
= ∑ −2(𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )𝑥𝑖 = 2 (− ∑ 𝑥𝑖 𝑦𝑖 + 𝑎 ∑ 𝑥𝑖 + 𝑏 ∑ 𝑥𝑖 2 ) = 0
{𝜕𝑏
𝒊=𝟏 𝑖=1 𝑖=1 𝑖=1

Dividing both equations by 2𝑛 and rearranging terms, we get


𝑛 𝑛
1 1
𝑦̅ = 𝑎 + 𝑏𝑥̅ 𝑎𝑛𝑑 𝑎𝑥̅ + ( ∑ 𝑥𝑖 2 ) 𝑏 = ∑ 𝑥𝑖 𝑦𝑖
𝑛 𝑛
𝑖=1 𝑖=1
Where
𝑛 𝑛
1 1
𝑥̅ = ∑ 𝑥𝑖 𝑦̅ = ∑ 𝑦𝑖
𝑛 𝑛
𝑖=1 𝑖=1

are the mean values of the x and y data. The solution for the parameters is

𝑦̅ ∑ 𝑥𝑖2 − 𝑥̅ ∑ 𝑥𝑖 𝑦𝑖 ∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅
𝑎= 2 𝑏=
∑ 𝑥𝑖 − 𝑛𝑥̅ 2 ∑ 𝑥𝑖2 − 𝑛𝑥̅ 2
These expressions are susceptible to round off errors (the two terms in each numerator
as well as in each denominator can be roughly equal). It is better to compute the
parameters from
∑ 𝑦𝑖 (𝑥𝑖 − 𝑥̅ )
𝑏= 𝑎𝑛𝑑 𝑎 = 𝑦̅ − 𝑏𝑥̅
∑ 𝑥𝑖 (𝑥𝑖 − 𝑥̅ )
Example: Fit a straight line to the data shown and compute the standard deviation.

x 0.0 1.0 2.0 2.5 3.0


y 2.9 3.7 4.1 4.4 5.0

Solution The averages of the data are


5
1 0.0 + 1.0 + 2.0 + 2.5 + 3.0
𝑥̅ = ∑ 𝑥𝑖 = = 1.7
5 5
𝑖=1
5
1 2.9 + 3.7 + 4.1 + 4.4 + 5.0
𝑦̅ = ∑ 𝑦𝑖 = = 4.02
5 5
𝑖=1

The intercept a and slope b of the interpolant can now be determined

∑ 𝑦𝑖 (𝑥𝑖 − 𝑥̅ ) 2.9(−1.7) + 3.7(−0.7) + 4.1(0.3) + 4.4(0.8) + 5.0(1.3)


𝑏= = = 0.6431
∑ 𝑥𝑖 (𝑥𝑖 − 𝑥̅ ) 0.0(−1.7) + 1.0(−0.7) + 2.0(0.3) + 2.5(0.8) + 3.0(1.3)

𝑎 = 𝑦̅ − 𝑏𝑥̅ = 4.02 − 1.7(0.6431) = 2.927


Therefore, the regression line is f (x) = 2.927 + 0.6431x, which is shown in the figure
together with the data points.

We start the evaluation of the standard deviation by computing the residuals:

y 2.9 3.7 4.1 4.4 5.0


f (x) 2.927 3.570 4.213 4.535 4.856
y –f (x) -0.027 0.130 -0.113 -0.135 0.144

The sum of the squares of the residuals is


𝒏
𝟐
𝑱(𝑎, 𝑏) = ∑(𝑦𝑖 − 𝑓(𝑥𝑖 )) = (−0.027)𝟐 + (0.130)𝟐 + (−0.113)𝟐 + (−0.135)𝟐 + (0.144)𝟐 = 0.06936
𝒊=𝟏

so that the standard deviation becomes

𝑱 0.06936
𝜎=√ =√ = 0.1520
𝑛−𝑚 5−2

Consider the least-squares fit of the linear form


𝑚

𝑓(𝑥) = 𝑎1 𝑓1 (𝑥) + 𝑎2 𝑓2 (𝑥) + 𝑎3 𝑓3 (𝑥) + ⋯ + 𝑎𝑚 𝑓𝑚 (𝑥) = ∑ 𝑎𝑗 𝑓𝑗 (𝑥)


𝑗=1

Where: each 𝑓𝑗 (𝑥) is a predetermined function of x, called a basis function. Substitution


into the cost function, yields
2
𝒏 𝑚

𝑱 = ∑ (𝑦𝑖 − ∑ 𝑎𝑗 𝑓𝑗 (𝑥𝑖 ))
𝒊=𝟏 𝑗=1
Thus
𝒏 𝑚
𝜕𝑱
= 0 ⇒ −2 {∑ (𝑦𝑖 − ∑ 𝑎𝑗 𝑓𝑗 (𝑥𝑖 )) 𝑓𝑘 (𝑥𝑖 )} = 0 𝑘 = 1,2, … , 𝑚
𝜕𝑎𝑘
𝒊=𝟏 𝑗=1
Dropping the constant (−2) and interchanging the order of summation, we get

𝒎 𝑛 𝑛

∑ (∑ 𝑓𝑗 (𝑥𝑖 ) 𝑓𝑘 (𝑥𝑖 )) 𝑎𝑗 = ∑ 𝑓𝑘 (𝑥𝑖 ) 𝑦𝑖 𝑘 = 1,2, … , 𝑚


𝒋=𝟏 𝑖=1 𝑖=1

In matrix notation these equations are 𝐀𝜽 = 𝐲 where


𝑛 𝑛

𝐀 𝒌𝒋 = ∑ 𝑓𝑗 (𝑥𝑖 ) 𝑓𝑘 (𝑥𝑖 ) 𝐲𝒌 = ∑ 𝑓𝑘 (𝑥𝑖 ) 𝑦𝑖


𝑖=1 𝑖=1

Equations 𝐀𝜽 = 𝐲, known as the normal equations of the least-squares fit, can be solved
with many methods. Note that the coefficient matrix is symmetric, i.e., 𝐀 𝒌𝒋 = 𝐀 𝒋𝒌 . As
what we have seen before 𝐀𝜽 = 𝐲 ⇒ 𝜽𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆 = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒚

Example: Find the least squares solution of the following linear system:

𝑥1 + 𝑥2 + 𝑥3 = 4 1 1 1 4
𝑥1
−𝑥 + 𝑥2 + 𝑥3 = 0 −1 1 1 0
{ 1 ⟺ 𝑨𝐱 = 𝑩 ⟺ [ ] [𝑥2 ] = [ ]
− 𝑥2 + 𝑥3 = 1 0−1 1 𝑥3 1
𝑥1 + 𝑥3 = 2 1 0 1 2

so A has has rank 3. Since the augmented matrix [𝑨 ⋮ 𝑩] has rank 4, the linear system
is inconsistent. There is a unique least squares solution in this case. To find it, first
compute
3 0 1 1 11 1 − 3
[𝑨𝑇 𝑨] = [ 0 3 1 ] and [𝑨𝑇 𝑨]−𝟏 = [ 1 11 − 3 ]
30
1 1 4 −3 − 3 9

Hence the least squares solution is


1 8 8 3 6
𝑿 = [𝐴𝑇 𝐴]−𝟏 𝐴𝑇 𝒚 = [3] or 𝑥1 = , 𝑥2 = , 𝑥3 =
5 5 5 5
6

Example: Find the parameters 𝐴 and 𝐵 of the given function to fit the tabulated data
with minimum error using least square
𝜋𝑥
g(𝑥) = 𝐴 + 𝐵sin ( ) and let f(𝑥) = g(𝑥) + Noise
10
where:

𝑥 0 1.0 1.5 2.3 2.5 4.0 5.1 6.0 6.5 7.0 8.1 9
𝑓(𝑥) 0.2 0.8 2.5 2.5 3.5 4.3 3.0 5.0 3.5 2.4 1.3 2.0
𝑥 9.3 11.0 11.3 12.1 13.1 14.1 15.5 16.0 17.5 17.8 19.0 20.0
𝑓(𝑥) −0.3 −1.3 −3.0 −4.0 −4.9 −4.0 −5.2 −3.0 −3.5 −1.6 −1.4 −0.1
𝑛
𝜋𝑥𝑖 2
𝐸 = ∑ [𝐴 + 𝐵sin ( ) − 𝑓(𝑥𝑖 )]
10
𝑖=1
The parameters which can be varied to minimize 𝐸 𝑎𝑟𝑒 𝐴 𝑎𝑛𝑑 𝐵. We obtain two equations
in 𝐴 and 𝐵 by setting:
𝜕𝐸 𝜕𝐸
=0 𝑎𝑛𝑑 =0
𝜕𝐴 𝜕𝑏
𝑛
𝜕𝐸 𝜋𝑥𝑖
= 2 ∑ [𝐴 + 𝐵sin ( ) − 𝑓(𝑥𝑖 )] = 0
𝜕𝐴 10
𝑖=1
𝑛
𝜕𝐸 𝜋𝑥𝑖 𝜋𝑥𝑖
= 2 ∑ [𝐴 + 𝐵sin ( ) − 𝑓(𝑥𝑖 )] (sin ( )) = 0
𝜕𝐵 10 10
𝑖=1

Collecting terms, the equations finally become


𝑛 𝑛
𝜋𝑥𝑖
𝑛𝐴 + [∑ sin ( )] 𝐵 = ∑ 𝑓(𝑥𝑖 )
10
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
𝜋𝑥𝑖 𝜋𝑥𝑖 𝜋𝑥𝑖
[∑ sin ( )] 𝐴 + [∑ sin2 ( )] 𝐵 = ∑ 𝑓(𝑥𝑖 )sin ( )
10 10 10
𝑖=1 𝑖=1 𝑖=1

For this particular problem, = 24 . Calculating the coefficients, we find

24𝐴 + 1.1328096𝐵 = −1.2999996 & 1.1328096𝐴 + 11.053666𝐵 = 47.515395

Using Gauss-Jordan elimination yields: 𝐴 = −0.25831225 and 𝐵 = 4.3250821. The


approximating function is
𝜋𝑥
g(𝑥) = −0.25831225 + 4.3250821sin ( )
10

Example: Find the parameters 𝛼, 𝛽 and 𝛾 of the given function to fit the tabulated data
with minimum error using least square

g(𝑥) = 𝛼 cos(𝑥) + 𝛽 sin(𝑥) + 𝛾𝑥 2

f(1:8) f(9:16) f(17:24) f(25:31)


-5.8686 -20.8674 55.3619 45.5355
-18.2064 -11.2163 61.0673 38.1746
-26.6086 -2.8404 64.0433 29.0937
-31.9865 8.4867 65.1248 23.0077
-34.7729 19.1240 65.3033 14.3336
-35.0717 30.3340 63.2208 9.1920
-33.2271 39.8808 57.3186 4.6803
-27.7786 48.2515 52.8815

clear all, clc, x = [-3: 0.2 : 3]; n=length(x); r=1.2*rand(1,n);


g=35*cos(x)+40*sin(x)+5*x.^2+r;
A=[]; Y=[];
for i=1:n
v=[cos(x(i)) sin(x(i)) x(i)*x(i)]; A=[A;v];
Y=[Y;g(i)];
end
theta=pinv(A)*Y
Now we can plot the estimated curve with the original one to verify the result

yy=theta(1)*cos(x)+theta(2)*sin(x)+theta(3)*x.^2+r;
plot(x,yy,'linewidth',3)
hold on
plot(x,y,'linewidth',3)
hold on
grid on

Problem: assume that you have a tabulated data of some function but this data is not
arranged orderly along the x-axis, how to rearrange them using MATLAB?

clear all, clc, x=10*rand(200,1);


y=(1/(3*sqrt(pi)))*exp(-((x-5).^2)/9) + 0.02*rand(200,1);
%-----------------------------------
A=[x,y]; r=[3 6 13 14 21 24 30 31 33 35 39 40 41 44 49 53 55 56];
A(r, :) = []; B = sortrows(A);
%-----------------------------------
plot(B(:,1),B(:,2),'o','linewidth',3)
grid on
hold on
%-----------------------------------
p=polyfit(B(:,1),B(:,2),6);
F1 = polyval(p,B(:,1));
%-----------------------------------
plot(B(:,1),F1,'-b','linewidth',3)
grid on
A commonly used linear form is a polynomial. If the degree of
the polynomial is 𝑚 − 1, we have 𝑓(𝑥) = ∑𝑚
𝑗=1 𝑎𝑗 𝑥
𝑗−1
. Here the basis functions are

𝑓𝑗 (𝑥) = 𝑥 𝑗−1 𝑗 = 1,2, … , 𝑚


so that

𝑛 𝑛
𝑗+𝑘−2
𝐀 𝒌𝒋 = ∑ 𝑥𝑖 𝐲𝒌 = ∑ 𝑥𝑖𝑘−1 𝑦𝑖
𝑖=1 𝑖=1
or in matrix form we can writ 𝑨𝜽 = 𝒚 with
𝑛
𝑛 ∑ 𝑥𝑖 ∑ 𝑥𝑖2 … ∑ 𝑥𝑖𝑚 ∑ 𝑦𝑖
𝑖=1
𝑛
∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑥𝑖3 … ∑ 𝑥𝑖𝑚+1
𝑨= 𝒚= ∑ 𝑥𝑖 𝑦𝑖
… 𝑖=1

𝑛
𝑚−1
∑ 𝑥𝑖𝑚 ∑ 𝑥𝑖𝑚+1 … ∑ 𝑥𝑖2𝑚−2 ] ∑ 𝑥𝑖𝑚−1 𝑦𝑖
[ ∑ 𝑥𝑖 [ 𝑖=1 ]

Where ∑ stands for ∑𝑛𝑖=1 .The normal equations become progressively ill-conditioned
with increasing m. Fortunately, this is of little practical consequence, because only low-
order polynomials are useful in curve fitting. Polynomials of high order are not
recommended, because they tend to reproduce the noise inherent in the data.

Example: A certain experiment yields the following data:

x -1 0 1 2
y 0 1 3 9

It is suspected that y is a quadratic function of x. Use the Method of Least Squares to


find the quadratic function that best fits the data. Suppose that the function is:

𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2

We need to find a least squares solution of the linear system

𝑎 − 𝑏 + 𝑐=0
𝑎 =1
{
𝑎 + 𝑏 + 𝑐=3
𝑎 + 2𝑏 + 4𝑐 = 9

Again the linear system is inconsistent. Here

1 −1 1 0
1 0 0 1
𝑨=[ ] 𝑩=[ ]
1 1 1 3
1 2 4 9
and A has rank 3. We find that

4 2 6 1 44 12 − 20
[𝑨𝑇 𝑇 −𝟏
𝑨] = [ 2 6 8 ] 𝑎𝑛𝑑 [𝑨 𝑨] = [ 12 36 − 20 ]
80
6 8 18 −20 − 20 20

The unique least squares solution is therefore

1 11
𝜽= [𝑨𝑇 −𝟏 𝑇
𝑨] 𝑨 𝒚 = [33]
20
25

that is, a = 11/20, b = 33/20, c = 5/4. Hence the quadratic function that best fits the
data is
11 33 5
𝑦= + 𝑥 + 𝑥2
20 20 4

Example: Given the following noisy data

𝑥 2.10 6.22 7.17 10.52 13.68


𝑓(𝑥) 2.90 3.83 5.98 5.71 7.74

Fit a straight line to this data by using least squares.


For 5 data points and 𝑙 = 1 (a first degree polynomial), the least square equations are

5 𝑛

5 ∑ 𝑥𝑖 ∑ 𝑓(𝑥𝑖 )
𝑎0
𝑖=1 𝑖=1
[ ]= 𝑛
5 5
𝑎1
∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑥𝑖 𝑓(𝑥𝑖 )
[ 𝑖=1 𝑖=1 ] [ 𝑖=1 ]

Each element of these equations can now be computed

∑ 𝑥𝑖 = 2.10 + 6.22 + 7.17 + 10.52 + 13.68 = 39.69


𝑖=1
5

∑ 𝑥𝑖2 = (2.10)2 + (6.22)2 + (7.17)2 + (10.52)2 + (13.68)2 = 392.3201


𝑖=1
𝑛

∑ 𝑓(𝑥𝑖 ) = 2.90 + 3.83 + 5.98 + 5.71 + 7.74 = 26.16


𝑖=1
𝑛

∑ 𝑥𝑖 𝑓(𝑥𝑖 ) = (2.10)(2.90) + (6.22)(3.83) + (7.17)(5.98) + (10.52)(5.71) + (13.68)(7.74)


𝑖=1
= 238.7416
The set of equation is
5 39369 𝑎0 26.16
[ ][ ]=[ ]
39.69 392.3201 𝑎1 238.7416

𝑎0 5 39369 −1
26.16 2.038392
[ ]=[ ] [ ]=[ ]
𝑎1 39.69 392.3201 238.7416 0.4023190

The required straight line is g(𝑥) = 2.038392 + 0.4023190𝑥

Example: Fit a second order polynomial to the following data

𝑥 0 0.5 1.0 1.5 2.0 2.5


𝑓(𝑥) 0 0.25 1.0 2.25 4.0 6.25

Since the order is 2(𝑗 = 2), the matrix form to solve is

5 5 𝑛

5 ∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑓(𝑥𝑖 )
𝑖=1 𝑖=1 𝑎0 𝑖=1
5 5 5 𝑛

∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑥𝑖3 𝑎1 = ∑ 𝑥𝑖 𝑓(𝑥𝑖 )


𝑖=1 𝑖=1 𝑖=1 𝑖=1
5 5 5 [𝑎 2 ] 𝑛

∑ 𝑥𝑖2 ∑ 𝑥𝑖3 ∑ 𝑥𝑖4 ∑ 𝑥𝑖2 𝑓(𝑥𝑖 )


[ 𝑖=1 𝑖=1 𝑖=1 ] [ 𝑖=1 ]

Now plug in the given data, 𝑛 = 6.

Before we go on...what answers do you expect for the coefficients after looking at
the data?
5 𝑛

∑ 𝑥𝑖 = 7.5 ∑ 𝑓(𝑥𝑖 ) = 13.75


𝑖=1 𝑖=1
5 𝑛

∑ 𝑥𝑖2 = 13.75 ∑ 𝑥𝑖 𝑓(𝑥𝑖 ) = 28.125


𝑖=1 𝑖=1
5 𝑛

∑ 𝑥𝑖3 = 28.125 ∑ 𝑥𝑖2 𝑓(𝑥𝑖 ) = 61.1875


𝑖=1 𝑖=1
5

∑ 𝑥𝑖4 = 61.1875
𝑖=1
The set of equation is
6 7.5 13.75 𝑎0 13.75

7.5 13.75 28.125 𝑎1 = 28.125

[13.75 28.125 61.1875] [𝑎2 ] [61.1875]

𝑎0 6 7.5 13.75 13.75

𝑎1 = 𝑖𝑛𝑣 7.5 13.75 28.125 ∗ 28.125

[𝑎2 ] [13.75 28.125 61.1875] [61.1875]

Or use Gaussian elimination gives us the solution to the coefficients


𝑎0 0

𝑎1 = 0 ===> g(𝑥) = 𝑥 2

[𝑎2 ] [1]

This fits the data exactly. That is, g(𝑥) = 𝑓(𝑥) since 𝑓(𝑥) = 𝑥 2

Remark: Now we’ll try some ‘noisy’ data

𝑥 = [0 0.5 1.0 1.5 2.0 2.5]


𝑦 = [0.0674, −0.9156, 1.6253, 3.0377, 3.3535, 7.9409]

The resulting system to solve is:

𝑎0 6 7.5 13.75 −1
15.1093
−0.1812
𝑎1 = 7.5 13.75 28.125 ∗ 32.2834 = [ −0.3221 ]
1.3537
[𝑎2 ] [13.75 28.125 61.1875] [ 71.276 ]
So our fitted second order function is:

g(𝑥) = −0.1812 − 0.3221𝑥 + 1.3537𝑥 2

There are occasions when confidence in the accuracy of data


varies from point to point. For example, the instrument taking the measurements may
be more sensitive in a certain range of data. Sometimes the data represent the results
of several experiments, each carried out under different circumstances. Under these
conditions we may want to assign a confidence factor, or weight, to each data point and
minimize the sum of the squares of the weighted residuals 𝑟𝑖 = 𝑤𝑖 [𝑦𝑖 − 𝑓(𝑥𝑖 )], where
𝑤𝑖 are the weights. Hence the function to be minimized is
𝒏

𝑱(𝑎1 , 𝑎2 , … , 𝑎𝑚 ) = ∑ 𝑤𝑖 2 [𝑦𝑖 − 𝑓(𝑥𝑖 )]2


𝒊=𝟏
This procedure forces the fitting function 𝑓(𝑥) closer to the data points that have higher
weights.

  If the fitting function is the straight line 𝑓(𝑥) = 𝑎 + 𝑏𝑥 then:


𝑛

𝑱(𝑎, 𝑏) = ∑ 𝑤𝑖 2 [𝑦𝑖 − 𝑎 + 𝑏𝑥𝑖 ]2


𝑖=1
The conditions for minimizing 𝑱 are:

𝑛
𝜕𝑱
= ∑ −2𝑤𝑖 2 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0
𝜕𝑱 𝜕𝑎
𝑖=1
=0 ⇒ 𝑛
𝜕𝜽 𝜕𝑱
= ∑ −2𝑤𝑖 2 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )𝑥𝑖 = 0
{ 𝜕𝑏
𝑖=1
Or
𝑛 𝑛 𝑛

∑ 𝑤𝑖 𝑦𝑖 = (𝑎 ∑ 𝑤𝑖 + 𝑏 ∑ 𝑤𝑖 2 𝑥𝑖 )
2 2

𝑖=1 𝑖=1 𝑖=1

𝑛 𝑛 𝑛

∑ 𝑤𝑖 𝑥𝑖 𝑦𝑖 = (𝑎 ∑ 𝑤𝑖 𝑥𝑖 + 𝑏 ∑ 𝑤𝑖 2 𝑥𝑖 2 )
2 2 2

𝑖=1 𝑖=1 𝑖=1

Dividing equation by 𝑤𝑖 2 and introducing the weighted averages

∑ 𝑤𝑖 2 𝑥𝑖 ∑ 𝑤𝑖 2 𝑦𝑖
𝑥̂ = 𝑦̂ =
∑ 𝑤𝑖 2 ∑ 𝑤𝑖 2
we obtain
𝑎 = 𝑦̂ − 𝑏𝑥̂
solving for 𝑏 yields after some algebra
∑ 𝑤𝑖 2 𝑦𝑖 (𝑥𝑖 − 𝑥̅ )
𝑏=
∑ 𝑤𝑖 2 𝑥𝑖 (𝑥𝑖 − 𝑥̅ )
  A special application of weighted linear regression arises in fitting exponential
functions to data. Consider as an example the fitting function

𝑓(𝑥) = 𝑎𝑒 𝑏𝑥

Normally, the least-squares fit would lead to equations that are nonlinear in a and b.
But if we fit ln 𝑦 rather than 𝑦, the problem is transformed to linear regression: fit the
function
𝐹(𝑥) = ln 𝑓(𝑥) = ln 𝑎 + 𝑏𝑥

to the data points (𝑥𝑖 , ln 𝑦𝑖 ), i = 1, 2, . . . , n. This simplification comes at a price: least


square fit to the logarithm of the data is not the same as least-squares fit to the original
data. The residuals of the logarithmic fit are

𝑅𝑖 = ln 𝑦𝑖 − 𝐹(𝑥𝑖 ) = ln 𝑦𝑖 − ln 𝑎 − 𝑏𝑥𝑖

whereas the residuals used in fitting the original data are

𝑟𝑖 = [𝑦𝑖 − 𝑓(𝑥𝑖 )] = 𝑦𝑖 − 𝑎𝑒 𝑏𝑥𝑖

If the residuals 𝑟𝑖 are sufficiently small (𝑟𝑖 ≪ 𝑦𝑖 ), we can use the approximation

𝑟𝑖 𝑟𝑖
ln (1 − )≈
𝑦𝑖 𝑦𝑖
so that
𝑟𝑖
𝑅𝑖 ≈
𝑦𝑖
We can now see that by minimizing∑ 𝑅𝑖 , we inadvertently introduced the weights 1/𝑦𝑖 .
This effect can be negated if we apply the weights 𝑦𝑖 when fitting 𝐹(𝑥) to
(ln 𝑦𝑖 , 𝑥𝑖 ); that is, by minimizing
𝑛

𝑱 = ∑ 𝑦𝑖 2 𝑅𝑖 2
𝑖=1

Example: Determine the parameters a and b so that 𝑓(𝑥) = 𝑎𝑒 𝑏𝑥 fits the following data
in the least-squares sense.

𝑥 1.2 2.8 4.3 5.4 6.8 7.9


𝑦 7.5 16.1 38.9 67.0 146.6 266.2

Use two different methods: (1) fit ln 𝑦𝑖 ; and (2) fit ln 𝑦𝑖 with weights 𝑤𝑖 = 𝑦𝑖 . Compute
the standard deviation in each case.

Solution of Part (1) The problem is to fit the function ln(𝑎𝑒 𝑏𝑥 ) = ln 𝑎 + 𝑏𝑥 to the data

𝑥 1.2 2.8 4.3 5.4 6.8 7.9


ln 𝑦 2.015 2.779 3.661 4.205 4.988 5.584
We are now dealing with linear regression, where the parameters to be found are
𝐴 = ln 𝑎 𝑎𝑛𝑑 𝑏
6
1
𝑥̅ = ∑ 𝑥𝑖 = 4.733
6
𝑖=1
6
1
𝑧̅ = ∑ 𝑧𝑖 = 3.872
6
𝑖=1
∑ 𝑧𝑖 (𝑥𝑖 − 𝑥̅ ) 16.716
𝑏= = = 0.5366 𝐴 = 𝑧̅ − 𝑏𝑥̅ = 1.3323
∑ 𝑥𝑖 (𝑥𝑖 − 𝑥̅ ) 31.153

Therefore, 𝑎 = 𝑒 𝐴 = 3. 790 and the fitting function becomes 𝑓(𝑥) = 3.790𝑒 0.5366𝑥 . The plots
of 𝑓(𝑥) and the data points are shown in the figure.

Here is the computation of standard deviation:

𝑦 7.50 16.10 38.90 67.00 146.60 266.20


𝑓(𝑥) 7.21 17.02 38.0.7 68.69 145.60 262.72
𝑦 − 𝑓(𝑥) 0.29 -0.92 0.83 -1.69 1.00 3.48

𝒏
𝟐
𝑱(𝑎, 𝑏) = ∑(𝑦𝑖 − 𝑓(𝑥𝑖 )) = 17.59
𝒊=𝟏

𝑱
𝜎=√ = 2.10
𝑛−𝑚
As pointed out before, this is an approximate solution of the stated problem, since we
did not fit 𝑦𝑖 , but ln 𝑦𝑖 . Judging by the plot, the fit seems to be good.

Solution of Part (2) We again fit ln(𝑎𝑒 𝑏𝑥 ) = ln 𝑎 + 𝑏𝑥 to 𝑧 = ln 𝑦, but this time the
weights𝑤𝑖 = 𝑦𝑖 are used. the weighted averages of the data are
(recall that we fit 𝑧 = ln 𝑦)
∑ 𝑦𝑖 2 𝑥𝑖 ∑ 𝑦𝑖 2 𝑧𝑖
𝑥̂ = = 7.474 𝑧̂ = = 5.353
∑ 𝑦𝑖 2 ∑ 𝑦𝑖 2
This yield to
∑ 𝑦𝑖 2 𝑧𝑖 (𝑥𝑖 − 𝑥̂)
𝑏= = 0.5440
∑ 𝑦𝑖 2 𝑥𝑖 (𝑥𝑖 − 𝑥̂)
ln 𝑎 = 𝑧̂ − 𝑏𝑥̂ = 1.287
ln𝑎
Therefore 𝑎 = 𝑒 = 3. 622 so that the fitting function is 𝑓(𝑥) = 3.622𝑒 0.5440𝑥 . As expected,
this result is somewhat different from that obtained in Part (1).

The computations of the residuals and standard deviation are as follows:

𝑦 7.50 16.10 38.90 67.00 146.60 266.20


𝑓(𝑥) 6.96 16.61 37.56 68.33 146.33 266.20
𝑦 − 𝑓(𝑥) 0.54 -0.51 1.34 -1.33 0.267 0.00

𝒏
𝟐
𝑱(𝑎, 𝑏) = ∑(𝑦𝑖 − 𝑓(𝑥𝑖 )) = 4.186
𝒊=𝟏

𝑱
𝜎=√ = 1.023
𝑛−𝑚
Observe that the residuals and standard deviation are smaller than in Part (1),
indicating a better fit, as expected.
CHAPTER III:
Numerical Differentiation and
Integration
Numerical Differentiation and
Integration
Numerical differentiation deals with the following problem: we are given
the function y = f (x) and wish to obtain one of its derivatives at the point x = xk.
Numerical differentiation is not a particularly accurate process. It suffers from a conflict
between round off errors (due to limited machine precision) and errors inherent in
interpolation. For this reason, a derivative of a function can never be computed with the
same precision as the function itself.

Numerical integration, also known as quadrature, is intrinsically a much more accurate


procedure than numerical differentiation. Quadrature approximates the definite
𝑏
integral ∫𝑎 𝑓(𝑥)𝑑𝑥 by the sum ∑𝑛𝑖=1 𝐴𝑖 𝑓(𝑥𝑖 ) where the nodal abscissas 𝑥𝑖 and weights 𝐴𝑖
depend on the particular rule used for the quadrature. All rules of quadrature are
derived from polynomial interpolation of the integrand. Therefore, they work best if f (x)
can be approximated by a polynomial. Methods of numerical integration studied here
are characterized by equally spaced abscissas, and include well-known methods such
as the trapezoidal rule and Simpson’s rule. They are most useful if f (x) has already
been computed at equal intervals, or can be computed at low cost.

The derivation of the finite difference


approximations for the derivatives of f(x) are based on forward and backward Taylor
series expansions of f (x) about x, such as

𝑑𝑓 ℎ2 𝑑 2 𝑓 ℎ3 𝑑 3 𝑓 ℎ4 𝑑 4 𝑓 ℎ𝑘 𝑑 𝑘 𝑓
𝑓(𝑥 + ℎ) = 𝑓(𝑥) + ℎ + + + + ⋯ = ∑
𝑑𝑥 2! 𝑑𝑥 2 3! 𝑑𝑥 3 4! 𝑑𝑥 4 𝑘! 𝑑𝑥 𝑘
𝑘=0


𝑑𝑓 ℎ2 𝑑2 𝑓 ℎ3 𝑑 3 𝑓 ℎ4 𝑑 4 𝑓 𝑘
ℎ𝑘 𝑑 𝑘 𝑓
𝑓(𝑥 − ℎ) = 𝑓(𝑥) − ℎ + − + − ⋯ = ∑(−1)
𝑑𝑥 2! 𝑑𝑥 2 3! 𝑑𝑥 3 4! 𝑑𝑥 4 𝑘! 𝑑𝑥 𝑘
𝑘=0

Solving the above equations for 𝑓 ′ (𝑥) we get

𝑑𝑓 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ∆𝑓(𝑥)


= + 𝒪(ℎ) = + 𝒪(ℎ)
𝑑𝑥 ℎ ℎ

𝑑𝑓 𝑓(𝑥) − 𝑓(𝑥 − ℎ) ∇𝑓(𝑥)


{𝑑𝑥 = + 𝒪(ℎ) = + 𝒪(ℎ)
ℎ ℎ
As what we have seen before in chapter 1 we can compute the 2nd derivative as

𝑑 2 𝑓 ∆2 𝑓(𝑥) 1
2
= 2
+ 𝒪(ℎ) = 2 {𝑓(𝑥 + 2ℎ) − 2𝑓(𝑥 + ℎ) + 𝑓(𝑥)} + 𝒪(ℎ)
𝑑𝑥 ℎ ℎ
𝑑 2 𝑓 ∇2 𝑓(𝑥) 1
= + 𝒪(ℎ) = {𝑓(𝑥) − 2𝑓(𝑥 − ℎ) + 𝑓(𝑥 − 2ℎ)} + 𝒪(ℎ)
𝑑𝑥 2 ℎ2 ℎ2
These expressions are called non-central forward and non-central backward finite
difference approximations.

We can derive the approximations for higher derivatives in the same manner.

𝑑𝑛 𝑓 ∆𝑛 𝑓(𝑥) 𝑑 𝑛 𝑓 ∆𝑛 𝑓(𝑥)
= + 𝒪(ℎ) and = + 𝒪(ℎ)
𝑑𝑥 𝑛 ℎ𝑛 𝑑𝑥 𝑛 ℎ𝑛

With the following recurrence formulas ∆𝑛 𝑓(𝑥) = ∆(∆𝑛−1 𝑓(𝑥)) & ∇𝑛 𝑓(𝑥) = ∇(∇𝑛−1 𝑓(𝑥))

For derivatives up to the fourth order the forward and backward difference expressions
of 𝒪(ℎ) are tabulated below

𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2 𝑓𝑘+3 𝑓𝑘+4 𝑓𝑘−4 𝑓𝑘−3 𝑓𝑘−2 𝑓𝑘−1 𝑓𝑘


ℎ𝑓 ′ (𝑥𝑖 ) -1 1 ℎ𝑓 ′ (𝑥𝑖 ) -1 1
ℎ2 𝑓 (2) (𝑥𝑖 ) 1 -2 1 ℎ2 𝑓 (2) (𝑥𝑖 ) 1 -2 1
ℎ3 𝑓 (3) (𝑥𝑖 ) -1 3 -3 1 ℎ3 𝑓 (3) (𝑥𝑖 ) -1 3 -3 1
ℎ4 𝑓 (4) (𝑥𝑖 ) 1 -4 6 -4 1 ℎ4 𝑓 (4) (𝑥𝑖 ) 1 -4 6 -4 1

Coefficients of forward difference approximations 𝓞(𝒉) Coefficients of backward difference approximations 𝓞(𝒉)

The difference representation for derivatives we have thus far obtained are of 𝒪(ℎ). More
accurate expressions may be found by simply taking more terms in the Taylor series
expansion. Consider the following example

𝑑𝑓 𝑓(𝑥 + ℎ) − 𝑓(𝑥) 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ℎ 𝑑 2 𝑓 ℎ2 𝑑 3 𝑓


= + 𝒪(ℎ) = − ( 2) − ( 3) + ⋯
𝑑𝑥 ℎ ℎ 2 𝑑𝑥 6 𝑑𝑥

𝑑2 𝑓 ∆2 𝑓(𝑥)
Make a substitution of = + 𝒪(ℎ) Into the 1st derivative formula we obtain
𝑑𝑥2 ℎ2
𝑑𝑓 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ℎ 𝑓(𝑥 + 2ℎ) − 2𝑓(𝑥 + ℎ) + 𝑓(𝑥) (3) ℎ2 𝑑 3 𝑓
= − [ − ℎ𝑓 (𝑥) + ⋯ ] − ( )+⋯
𝑑𝑥 ℎ 2 ℎ2 6 𝑑𝑥 3

𝑑𝑓 1
Collecting term we get = (−3𝑓(𝑥 + 2ℎ) + 4𝑓(𝑥 + ℎ) − 3𝑓(𝑥)) + 𝒪(ℎ2 )
𝑑𝑥 2ℎ
We have thus found difference representation for the first derivative which is accurate
to 𝒪(ℎ2 ). The higher derivatives of the same accuracy are given below

𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2 𝑓𝑘+3 𝑓𝑘+4 𝑓𝑘+5 𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2 𝑓𝑘+3 𝑓𝑘+4 𝑓𝑘+5
2ℎ𝑓 ′ (𝑥𝑖 ) -3 4 -1 2ℎ𝑓 ′ (𝑥𝑖 ) 1 -4 3

ℎ2 𝑓 (2) (𝑥𝑖 ) 2 -5 4 -1 ℎ2 𝑓 (2) (𝑥𝑖 ) -1 4 -5 2

2ℎ3 𝑓 (3) (𝑥𝑖 ) -5 18 -24 14 -3 2ℎ3 𝑓 (3) (𝑥𝑖 ) 3 -14 24 -18 5

ℎ4 𝑓 (4) (𝑥𝑖 ) 3 -14 26 -24 11 -2 ℎ4 𝑓 (4) (𝑥𝑖 ) -2 11 -24 26 -14 3

Coefficients of forward difference approximations 𝓞(𝒉𝟐 ) Coefficients of backward difference approximations 𝓞(𝒉𝟐 )
The average of forward and backward difference methods when the data points are
equally spaced is called central difference method, this method gives a truncation error
of second order which provides more accuracy in approximation of the first derivative.

Central difference method is more accurate than a forward and backward, the
reasoning is just that a central difference scheme is designed to remove more terms in
the Taylor Series expansion compared to the forward and backward, in turn reducing
truncation error.
𝑑𝑓 ∆𝑓(𝑥) 𝑑𝑓 ∇𝑓(𝑥)
Let we take the average of = + 𝒪(ℎ) and = + 𝒪(ℎ)
𝑑𝑥 ℎ 𝑑𝑥 ℎ
𝑑𝑓 ∆𝑓(𝑥) ∇𝑓(𝑥) 𝑓(𝑥 + ℎ) − 𝑓(𝑥 − ℎ)
We get = + + 𝒪(ℎ2 ) = + 𝒪(ℎ2 )
𝑑𝑥 2ℎ 2ℎ 2ℎ

This is the first central difference approximation for 𝑓 ′ (𝑥). The term 𝒪(ℎ2 ) reminds us
that the truncation error behaves as ℎ2 . Notice that

ℎ4 (4)
𝑓(𝑥 + ℎ) + 𝑓(𝑥 − ℎ) = 2𝑓(𝑥) + ℎ2 𝑓 (2) (𝑥) + 𝑓 (𝑥) + ⋯
12
𝑑2 𝑓 𝑓(𝑥 + ℎ) − 2𝑓(𝑥) + 𝑓(𝑥 − ℎ)
2
= 2
+ 𝒪(ℎ2 )
𝑑𝑥 ℎ
Central difference approximations for other derivatives can be obtained in a similar
manner.
𝑛 𝑛 𝑛 𝑛
𝑑𝑛 𝑓 ∇ 𝑓 (𝑥 + 2 ℎ) + ∆ 𝑓 (𝑥 − 2 ℎ)
= + 𝒪(ℎ2 ) when 𝑛 = 2𝑘 is even
𝑑𝑥 𝑛 2ℎ𝑛
𝑛 𝑛−1 𝑛 𝑛−1
𝑑𝑛 𝑓 ∇ 𝑓 (𝑥 + 2 ℎ) + ∆ 𝑓 (𝑥 − 2 ℎ)
{𝑑𝑥 𝑛 = + 𝒪(ℎ2 ) when 𝑛 = 2𝑘 + 1 is odd
2ℎ𝑛

𝑓𝑘−2 𝑓𝑘−1 𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2


2ℎ𝑓 ′ (𝑥𝑖 ) -1 0 1
ℎ2 𝑓 (2) (𝑥𝑖 ) 1 -2 1
2ℎ3 𝑓 (3) (𝑥𝑖 ) -1 2 0 -2 1
ℎ4 𝑓 (4) (𝑥𝑖 ) 1 -4 6 -4 1

Coefficients of central difference approximations 𝓞(𝒉𝟐 )

The higher derivatives of central difference for accuracy 𝒪(ℎ4 ) are given below

𝑓𝑘−3 𝑓𝑘−2 𝑓𝑘−1 𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2 𝑓𝑘+3


2ℎ𝑓 ′ (𝑥𝑖 ) 1 -8 0 8 -1
ℎ2 𝑓 (2) (𝑥𝑖 ) -1 16 -30 16 -1
2ℎ3 𝑓 (3) (𝑥𝑖 ) 1 -8 13 0 -13 8 -1
ℎ4 𝑓 (4) (𝑥𝑖 ) -1 12 -39 56 -39 12 -1

Coefficients of central difference approximations 𝓞(𝒉𝟒 )


Example: Find the forward representation for 𝑓 ′ (𝑥) which of 𝒪(ℎ3 )

𝑑𝑓 ℎ2 𝑑2 𝑓 ℎ3 𝑑 3 𝑓 ℎ4 𝑑 4 𝑓
𝑓(𝑥 + ℎ) = 𝑓(𝑥) + ℎ + + + +⋯
𝑑𝑥 2! 𝑑𝑥 2 3! 𝑑𝑥 3 4! 𝑑𝑥 4
𝑑𝑓 ℎ2 𝑓(𝑥 + 2ℎ) − 2𝑓(𝑥 + ℎ) + 𝑓(𝑥) ℎ3 𝑑 3 𝑓
= 𝑓(𝑥) + ℎ + [ ] + + 𝒪(ℎ4 )
𝑑𝑥 2! ℎ2 3! 𝑑𝑥 3

Solution: we will use the forward difference (Taylor series expansion)

𝑑3 𝑓 1
And it is very well known that = {𝑓(𝑥 + 3ℎ) − 3𝑓(𝑥 + 2ℎ) + 3𝑓(𝑥 + ℎ) − 𝑓(𝑥)} + 𝒪(ℎ)
𝑑𝑥3 ℎ3
Make a substitution into the above expression we get

𝑑𝑓 𝑓𝑘+2 − 2𝑓𝑘+1 + 𝑓𝑘 𝑓𝑘+3 − 3𝑓𝑘+2 + 3𝑓𝑘+1 − 𝑓𝑘


𝑓𝑘+1 = 𝑓𝑘 + ℎ +{ }+{ } + 𝒪(ℎ4 )
𝑑𝑥 2! 3!

Therefore,
𝑑𝑓 2𝑓𝑘+3 − 9𝑓𝑘+2 + 18𝑓𝑘+1 − 11𝑓𝑘
= + 𝒪(ℎ3 )
𝑑𝑥 6ℎ
Example: Find the fifth backward difference representation which is of 𝒪(ℎ)

From the recurrence scheme for differences, the fifth difference can be expressed as

∇5 𝑓𝑘 = ∇(∇4 𝑓𝑘 ) = 𝑓𝑘 − 4𝑓𝑘−1 + 6𝑓𝑘−2 − 4𝑓𝑘−3 + 𝑓𝑘−4 − (𝑓𝑘−1 − 4𝑓𝑘−2 + 6𝑓𝑘−3 − 4𝑓𝑘−4 + 𝑓𝑘−5 )
= 𝑓𝑘 − 5𝑓𝑘−1 + 10𝑓𝑘−2 − 10𝑓𝑘−3 + 5𝑓𝑘−4 − 𝑓𝑘−5

and
𝑑 5 𝑓 ∇5 𝑓𝑘
= 5 + 𝒪(ℎ)
𝑑𝑥 5 ℎ
Example: Find the central difference representation of 𝒪(ℎ2 ) for the fifth derivative.

Solution:

∆5 𝑓𝑘 = 𝑓𝑘+5 − 5𝑓𝑘+4 + 10𝑓𝑘+3 − 10𝑓𝑘+2 + 5𝑓𝑘+1 − 𝑓𝑘


∇5 𝑓𝑘 = 𝑓𝑘 − 5𝑓𝑘−1 + 10𝑓𝑘−2 − 10𝑓𝑘−3 + 5𝑓𝑘−4 − 𝑓𝑘−5

The central difference rule is given by

𝑛 𝑛−1 𝑛 𝑛−1
𝑑 𝑛 𝑓 ∇ 𝑓 (𝑥 + 2 ℎ) + ∆ 𝑓 (𝑥 − 2 ℎ) + 𝒪(ℎ2 )
= when 𝑛 = 2𝑘 + 1 is odd
𝑑𝑥 𝑛 2ℎ𝑛

Applying this rule to our case we obtain

𝑑5 𝑓 (∆5 𝑓𝑘−2 + ∇5 𝑓𝑘+2 )


= + 𝒪(ℎ2 )
𝑑𝑥 5 2ℎ5
𝑓𝑘+3 − 4𝑓𝑘+2 + 5𝑓𝑘+1 − 5𝑓𝑘−1 + 4𝑓𝑘−2 − 𝑓𝑘−3
= + 𝒪(ℎ2 )
2ℎ5
Example: The following function represents a physical data taken at equally spaced
intervals
𝑥 0 0.5 1.0 1.50 2.0 2.50 3.00
𝑓(𝑥) 1.00 0.80 0.20 0.25 0.31 0.38 0.44

Find 𝑓 ′ (1.5) to 𝒪(0.5)2

clear all, clc, x = [0 0.5 1.0 1.50 2.0 2.50 3.00];


y = [1.00 0.80 0.20 0.25 0.31 0.38 0.44]; xx = 0:.05:3;
pp=polyfit(x,y,7); F1 = polyval(pp,x);
F2 = interp1(x,y,xx ,'spline');
%-----------------------------------
plot(x,F1,'o','linewidth',3)
grid on
hold on
plot(xx,F2,'-b','linewidth',3)

We want to find 𝑓 ′ (1.5) to 𝒪(0.5)2 , which means a three point representation will be
needed. A three point backward representation would require 𝑓(1.5), 𝑓(1.0) & 𝑓(0.5) the
value of 𝑓(0.5) would obviously influence the answer in such way as to give an incorrect
result, and use of 𝑓(1.0) would also appear undesirable since there is apparently a
drastic change in the behavior of 𝑓(𝑥) occurring very close to 𝑥 = 1.0 and the behavior at
𝑥 = 1.0 could be quite uncertain. We must therefore reject the backward difference
representation. A central difference of 𝑓 ′ (1.5) would involve 𝑓(1.0) which is undesirable
for the reason discussed before. A forward difference representation would involve the
values 𝑓(1.5), 𝑓(2.0) & 𝑓(2.5) . Over this region the function is smooth and well behaved,
so we use this representation.
−𝑓(2.5) + 4𝑓(2.0) − 3𝑓(1.5)
𝑓 ′ (1.5) = + 𝒪(0.5)2 = 0.22 + 𝒪(0.5)2
2(0.5)2

clear all, clc, x = [0 0.5 1.0 1.50 2.0 2.50 3.00];


y = [1.00 0.80 0.20 0.25 0.31 0.38 0.44]; xx = 0:.05:2;
h=0.5; dF=[]; n= length(x)-2;
for i=1:n
df=(-y(i+2)+4*y(i+1)-3*y(i))/(2*h^2);
dF=[dF, df] ;
end
%-----------------------------------
x1=x(1:n); y1= dF;
pp=polyfit(x1,y1,10); dF1= polyval(pp,x1) ;
dF2 = interp1(x1,y1,xx ,'spline');
plot(x1,dF1,'o','linewidth',3)
grid on
hold on
plot(xx,dF2,'linewidth',3)
F = polyval(pp,1.5)

Example: Given the following equally spaced data

𝑥 0 1.0 2.0 3.0 4.0


𝑓(𝑥) 30 33 28 12 -22

Find 𝑓 ′ (0), 𝑓 ′ (2), 𝑓 ′ (4), and 𝑓 ′′ (0), to 𝒪(ℎ)2


At 𝑥 = 0 a forward difference representation must be used since no points are available
in the backward direction.

−𝑓(2) + 4𝑓(1.0) − 3𝑓(0)


𝑓 ′ (0) = + 𝒪(1)2 = 7 + 𝒪(1)2
2(1)2

At 𝑥 = 2, we have choice of several representations, we arbitrarily select a central


difference method
𝑓(3) − 𝑓(1)
𝑓 ′ (2) = + 𝒪(1)2 = −10.5 + 𝒪(1)2
2(1)2

At 𝑥 = 4, a backward difference representation must be employed:

3𝑓(4) − 4𝑓(3) + 𝑓(2)


𝑓 ′ (4) = + 𝒪(1)2 = −43.0 + 𝒪(1)2
2(1)2

Example: Consider the function 𝑓(𝑥) = sin(10𝜋𝑥) find 𝑓 ′ (0.5) using the central difference
representation of 𝒪(ℎ2 ) with ℎ = 0.2, ℎ = 0.1 and ℎ = 0.005. Compare these results with
each other and with the exact analytical answer. Discuss the implications of these
results.

Solution:
sin(10𝜋(0.5 + 0.2)) + sin(10𝜋(0.5 − 0.2))
at ℎ = 0.2 𝑓 ′ (0.5) = + 𝒪(0.2)2 ≅ −1.2246𝑒 − 15
2(0.2)
sin(10𝜋(0.5 + 0.1)) + sin(10𝜋(0.5 − 0.1))
at ℎ = 0.1 𝑓 ′ (0.5) = + 𝒪(0.1)2 ≅ 1.2246𝑒 − 15
2(0.1)
sin(10𝜋(0.5 + 0.005)) + sin(10𝜋(0.5 − 0.005))
at ℎ = 0.005 𝑓 ′ (0.5) = + 𝒪(0.005)2 ≅ −31.287
2(0.005)

The exact analytical answer is 𝑓 ′ (0.5) = 10𝜋 cos(10𝜋(0.5)) = −31.416. In order to obtain a
meaningful answer for a finite difference representation of 𝑓 ′ (0.5) it would be necessary
to use much smaller ℎ.

Difference expressions for derivatives and


polynomials have some distinct relationships which can be very useful. The error term
for 𝑛𝑡ℎ difference will involve only derivatives of order 𝑛 + 1 or higher. Thus if we
consider a polynomial of order 𝑛, the 𝑛𝑡ℎ difference representation take anywhere along
this polynomial will be constant and exactly equal to the 𝑛𝑡ℎ derivative regardless the
spacing ℎ. This knowledge may be used to get some idea of how well a given polynomial
will fit data obtained at a series of equally spaces points on the independent variable.

Example: The following data represent a polynomial

𝑥 0 1 2 3 4 5
𝑓(𝑥) 1.00 0.50 8.00 35.50 95.00 198.50

▪ Of what order is this polynomial


▪ Determine the coefficients of this polynomial.
Solution: we will use the forward difference (backward difference could be used as well)

∆𝑓0 = 𝑓1 − 𝑓0 = 0.5 − 1 = −0.5


∆2 𝑓0 = ∆𝑓1 − ∆𝑓0 = 8
∆𝑓1 = 𝑓2 − 𝑓1 = 8.0 − 0.50 = 7.5 ∆3 𝑓0 = ∆2 𝑓1 − ∆2 𝑓0 = 12.0
∆2 𝑓1 = ∆𝑓2 − ∆𝑓1 = 20
∆𝑓2 = 𝑓3 − 𝑓2 = 35.5 − 8.0 = 27.5 ∆3 𝑓1 = ∆2 𝑓2 − ∆2 𝑓1 = 12.0
∆2 𝑓2 = ∆𝑓3 − ∆𝑓2 = 32
∆𝑓3 = 𝑓4 − 𝑓3 = 95.0 − 35.5 = 59.5 ∆3 𝑓2 = ∆2 𝑓3 − ∆2 𝑓2 = 12.0
∆𝑓4 = 𝑓5 − 𝑓4 = 198.5 − 95 = −0.5 ∆2 𝑓3 = ∆𝑓4 − ∆𝑓3 = 44

The third difference is constant

𝑑3 𝑓 ∆3 𝑓(𝑥)
= + 𝒪(ℎ) = 12.0 + 𝒪(ℎ) ⟹ 𝑓(𝑥) = 𝑎3 𝑥 3 + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0
𝑑𝑥 3 ℎ3
To determine the coefficients let we use the least square of Vandermond matrix

𝑥1 𝑥12 𝑥13 𝑓(𝑥1 ) 𝑥1 𝑥12 𝑥13 𝑓(𝑥1 )
1 3 𝑎 𝑎0 1 3
1 𝑥2 𝑥22 𝑥2 0 𝑓(𝑥2 ) 1 𝑥2 𝑥22 𝑥2 𝑓(𝑥2 )
𝑎1 𝑎1
1 𝑥3 𝑥32 𝑥33 (𝑎 ) = 𝑓(𝑥3 ) ⟹ (𝑎 ) = 1 𝑥3 𝑥32 𝑥33
𝑓(𝑥3 )
2 2
1 𝑥4 𝑥42 𝑥43 𝑎3 𝑓(𝑥4 ) 𝑎3 1 𝑥4 𝑥42 𝑥43 𝑓(𝑥4 )
1 (𝑓(𝑥5 )) 1
( 𝑥5 𝑥52 3
𝑥5 ) ( 𝑥5 𝑥52 𝑥5 ) (𝑓(𝑥5 ))
3

clear all, clc,


x = [0 1 2 3 4 5]; y = [1.00 0.50 8.00 35.50 95.00 198.50]; A=[];
for i=1:length(x)
V=[1 x(i) (x(i))^2 (x(i))^3]; A=[A;V];
end
theta=pinv(A)*y'
z= flip(theta'); z0=[0:1:5];
y=polyval(z,z0) % verification

The general form of numerical integration of a


function 𝑓(𝑥) over some interval [𝑎, 𝑏] is a weighted sum of the function values at a
finite number (𝑁 + 1) of sample points (nodes), referred to as ‘quadrature’:
𝑏=𝑥𝑁 𝑁

𝐼(𝑓(𝑥)) = ∫ 𝑓(𝑥)𝑑𝑥 = ∑ 𝑤𝑘 𝑓(𝑥𝑘 ) with 𝑤𝑘 = 𝑥𝑘+1 − 𝑥𝑘 for 𝑘 = 0,1, … , 𝑁 − 1


𝑎=𝑥0 𝑘=0

Riemann sum: is a certain kind of approximation


of an integral by a finite sum, the area under the
curve is approximated by a sequence of rectangular
pars with fixed width 𝑤𝑘 = ℎ = (𝑏 − 𝑎)/𝑁 and high
𝑓(𝑥𝑘 ).
𝑁−1
(𝑏 − 𝑎)
𝐼(𝑓(𝑥)) = ∑ 𝑓(𝑥𝑘 )
𝑁
𝑘=0
We divide the interval between 𝑥0 = 𝑎 and 𝑥𝑁 = 𝑏 into 𝑁 equally spaces subranges, and
sum the area of each rectangle of width ℎ and high 𝑓𝑘 , where 𝑘 = 0,1, … , 𝑁 − 1.
𝑏 𝑁−1 𝑁−1

𝐼(𝑓(𝑥)) = ∫ 𝑓(𝑥)𝑑𝑥 = ℎ ∑ 𝑓(𝑎 + 𝑘ℎ) = ℎ ∑ 𝑓𝑘


𝑎 𝑘=0 𝑘=0

Here, the sample points are equally spaced for the midpoint rule, the trapezoidal rule,
and Simpson’s rule.

Fig: zero order approximation first order approximation second order approximation

Figure above shows the integrations over two segments by the midpoint rule, the
trapezoidal rule, and Simpson’s rule, which are referred to as Newton–Cotes formulas
for being based on the approximate polynomial and are implemented by the following
formulas.
𝑥𝑘+1
𝑥𝑘 + 𝑥𝑘+1
〈midpoint rule〉 : ∫ 𝑓(𝑥)𝑑𝑥 ≅ ℎ𝑓 ( ) ℎ = 𝑥𝑘+1 − 𝑥𝑘
2
𝑥𝑘
𝑥𝑘+1

〈trapezoidal rule〉 : ∫ 𝑓(𝑥)𝑑𝑥 ≅ {𝑓(𝑥𝑘 ) + 𝑓(𝑥𝑘+1 )} ℎ = 𝑥𝑘+1 − 𝑥𝑘
2
𝑥𝑘
𝑥𝑘+1
ℎ 𝑥𝑘+1 − 𝑥𝑘−1
〈Simpson’s rule〉 : ∫ 𝑓(𝑥)𝑑𝑥 ≅ {𝑓(𝑥𝑘−1 ) + 4𝑓(𝑥𝑘 ) + 𝑓(𝑥𝑘+1 )} ℎ=
3 2
𝑥𝑘−1

These three integration rules are based on approximating the target function
(integrand) to the zeroth, first and second-degree polynomial, respectively. Since the
first two integrations are obvious, we are going to derive just Simpson’s rule

As an interpolation problem let we try to find second-degree polynomial which will fit
the data (−ℎ, 𝑓𝑘−1 ), (0, 𝑓𝑘 ) 𝑎𝑛𝑑 (ℎ, 𝑓𝑘+1 )

𝑝2 (𝑥) = 𝑐1 𝑥 2 + 𝑐2 𝑥 + 𝑐3 ≅ 𝑓(𝑥) at 𝑥𝑘−1 = −ℎ, 𝑥𝑘 = 0 and 𝑥𝑘+1 = ℎ


Matching the points (−ℎ, 𝑓𝑘−1 ), (0, 𝑓𝑘 ) and (ℎ, 𝑓𝑘+1 ), we should solve the following set of
equations:
𝑝2 (−ℎ) = 𝑐1 ℎ2 − 𝑐2 ℎ + 𝑐3 ≅ 𝑓𝑘−1 1 −ℎ ℎ2 𝑐3 𝑓𝑘−1
𝑝2 (0) = 𝑐3 ≅ 𝑓𝑘 ⟺ (1 𝑐
0 0 ) ( 2 ) = ( 𝑓𝑘 )
2
𝑝2 (ℎ) = 𝑐1 ℎ + 𝑐2 ℎ + 𝑐3 ≅ 𝑓𝑘+1 1 ℎ ℎ2 𝑐1 𝑓𝑘+1

𝑓𝑘+1 − 𝑓𝑘−1 1 𝑓𝑘+1 + 𝑓𝑘−1


𝑐3 = 𝑓𝑘 , 𝑐2 = and 𝑐1 = ( − 𝑓𝑘 )
2ℎ ℎ2 2
Integrating the second-degree polynomial 𝑝2 (𝑥) with these coefficients from 𝑥 = −ℎ to
𝑥 = ℎ yields
ℎ ℎ
+ℎ
1 1 2𝑐1 3
∫ 𝑝2 (𝑥)𝑑𝑥 = ∫(𝑐1 𝑥 2 + 𝑐2 𝑥 + 𝑐3 )𝑑𝑥 = 𝑐1 𝑥 3 + 𝑐2 𝑥 2 + 𝑐3 𝑥| = ℎ + 2𝑐3 ℎ
3 2 −ℎ 3
−ℎ −ℎ
2ℎ 𝑓𝑘+1 + 𝑓𝑘−1 ℎ
= ( − 𝑓𝑘 + 3𝑓𝑘 ) = {𝑓𝑘−1 + 4𝑓𝑘 + 𝑓𝑘+1 }
3 2 3

This is the Simpson integration formula.

Trapezoidal rule: In order to get the formulas for numerical integration of a function
𝑓(𝑥) over some interval [𝑎, 𝑏], we divide the interval into 𝑁 segments of equal length
ℎ = (𝑏 − 𝑎)/𝑁 so that the nodes (sample points) can be expressed as {𝑥 = 𝑎 + 𝑘ℎ, 𝑘 =
0, 1, 2, … , 𝑁}. Then we have the numerical integration of 𝑓 (𝑥) over [𝑎, 𝑏] by the
trapezoidal rule as
𝑏 𝑁−1 𝑥𝑘+1

𝑇(𝑓, ℎ) = ∫ 𝑓(𝑥)𝑑𝑥 = ∑ ( ∫ 𝑓(𝑥)𝑑𝑥) ≅ ((𝑓 + 𝑓1 ) + (𝑓1 + 𝑓2 ) + ⋯ + (𝑓𝑁−2 + 𝑓𝑁−1 ) + (𝑓𝑁−1 + 𝑓𝑁 ))
2 0
𝑎 𝑘=0 𝑥𝑘
𝑁−1 𝑁−1
ℎ 𝑓(𝑎) + 𝑓(𝑏)
= (𝑓0 + 𝑓𝑁 + 2 ∑ 𝑓𝑘 ) = ℎ ( + ∑ 𝑓𝑘 )
2 2
𝑘=1 𝑘=1

𝑏
Example: write a MATLAB code to compute the integral ∫𝑎 sin(𝑥 2 ) 𝑑𝑥 by trapezoidal rule
Simpson’s rule: On the other hand, we have the numerical integration of 𝑓(𝑥) over [𝑎, 𝑏]
by Simpson’s rule with an even number of segments 𝑁 = 2𝑚 as
𝑏 𝑥2 𝑥4 𝑥𝑁 𝑁/2−1 𝑥2𝑚+2

𝑆(𝑓, ℎ) = ∫ 𝑓(𝑥)𝑑𝑥 = ∫ 𝑓(𝑥)𝑑𝑥 + ∫ 𝑓(𝑥)𝑑𝑥 + ⋯ + ∫ 𝑓(𝑥)𝑑𝑥 = ∑ ( ∫ 𝑓(𝑥)𝑑𝑥)


𝑎 𝑥0 𝑥2 𝑥𝑁−2 𝑚=0 𝑥2𝑚

= {(𝑓0 + 4𝑓1 + 𝑓2 ) + (𝑓2 + 4𝑓3 + 𝑓4 ) + ⋯ + (𝑓𝑁−2 + 4𝑓𝑁−1 + 𝑓𝑁 )}
3

= {𝑓0 + 4𝑓1 + 2𝑓2 + 4𝑓3 … + 2𝑓2𝑚−2 + 4𝑓2𝑚−1 + 𝑓2𝑚 }
3
𝑚−1 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏) 2ℎ 4ℎ
= ( )+ ( ∑ 𝑓2𝑘 ) + (∑ 𝑓2𝑘−1 )
3 2 3 3
𝑘=1 𝑘=1

𝑏 𝑚−1 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏)
𝑆(𝑓, ℎ) = ∫ 𝑓(𝑥)𝑑𝑥 = {( ) + 2 ( ∑ 𝑓2𝑘 ) + 4 (∑ 𝑓2𝑘−1 )}
3 2
𝑎 𝑘=1 𝑘=1

𝑏
Example: write a MATLAB code to compute the integral ∫𝑎 √1 + 𝑥 2 𝑑𝑥 by Simpson’s rule
Remark: There is an important relationship between the above methods when ℎ ⟶ 𝜀
𝑆(𝑓, ℎ) ≅ (𝑇(𝑓, ℎ) + 2𝑀(𝑓, ℎ))/3. Where: 𝑆(𝑓, ℎ) is a Simpson’s rule, 𝑇(𝑓, ℎ) is the
trapezoidal rule, and 𝑀(𝑓, ℎ) is the midpoint rule.
𝑚−1 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏)
𝑆(𝑓, ℎ) = {( ) + 2 ( ∑ 𝑓2𝑘 ) + 4 (∑ 𝑓2𝑘−1 )}
3 2
𝑘=1 𝑘=1
𝑚−1 𝑚 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏)
= {( ) + 2 ( ∑ 𝑓2𝑘 + ∑ 𝑓2𝑘−1 ) + 2 (∑ 𝑓2𝑘−1 )}
3 2
𝑘=1 𝑘=1 𝑘=1

𝑏
Example: write a MATLAB code to compute the integral ∫𝑎 √1 + 𝑥 2 𝑑𝑥 by midpoint rule.

We divide the range of integration [𝑎, 𝑏]


equal intervals of length h and denote the
abscissas of the resulting nodes by 𝑥1 , 𝑥2 , … , 𝑥𝑛 .
Next we approximate 𝑓(𝑥) by a polynomial of
degree 𝑛 − 1 that intersects all the nodes.
Lagrange’s form of this polynomial, is
𝑛

𝑝𝑛−1 (𝑥) = ∑ ℓ𝑘 (𝑥)𝑓(𝑥𝑘 )


𝑘=1

where ℓ𝑘 (𝑥) are the cardinal functions defined in the last chapter. Therefore, an
𝑏
approximation to the integral ∫𝑎 𝑓(𝑥) 𝑑𝑥 is

𝑏 𝑏 𝑛 𝑏 𝑛

∫ 𝑓(𝑥) 𝑑𝑥 = ∫ 𝑝𝑛−1 (𝑥) 𝑑𝑥 = ∑ (𝑓(𝑥𝑘 ) ∫ ℓ𝑘 (𝑥)𝑑𝑥) = ∑ 𝑤𝑘 𝑓(𝑥𝑘 )𝑑𝑥


𝑎 𝑎 𝑘=1 𝑎 𝑘=1

Where
𝑏
𝑤𝑘 = ∫ ℓ𝑘 (𝑥)𝑑𝑥 𝑘 = 1,2, … , 𝑛
𝑎
This equation is known as the Newton–Cotes formula; Now if we take only one strip of
the total area and let 𝑎 = 𝑥𝑘 and 𝑏 = 𝑥𝑘+1 then we
obtain a classical examples of these formulas which are
the trapezoidal rule (𝑝1 (𝑥)), Simpson’s rule (𝑝2 (𝑥)).

Let we derive the trapezoidal rule from the Newton–


Cotes formula. It is well known that trapezoidal rule is
a first order approximation, so the interpolant
polynomial is of order one (i.e. 𝑛 = 2)
2
𝑥 − 𝑥2 𝑥 − 𝑥1
𝑝1 (𝑥) = ∑ ℓ𝑘 (𝑥)𝑓(𝑥𝑘 ) = ℓ1 (𝑥)𝑓(𝑎) + ℓ2 (𝑥)𝑓(𝑏) = ( ) 𝑓(𝑎) + ( ) 𝑓(𝑏)
𝑥1 − 𝑥2 𝑥2 − 𝑥1
𝑘=1
𝑏−𝑥 𝑥−𝑎
=( ) 𝑓(𝑎) + ( ) 𝑓(𝑏)
ℎ ℎ
When the polynomial 𝑝1 (𝑥) is an interpolant of the function 𝑓(𝑥) then they share the
same area under their curves, so
𝑏 𝑏 𝑏
1 2 2
(𝑏 − 𝑎)2
∫ 𝑓(𝑥) 𝑑𝑥 = ∫ 𝑝1 (𝑥) 𝑑𝑥 = (−(𝑏 − 𝑥) 𝑓(𝑎) + (𝑥 − 𝑎) 𝑓(𝑏))| = (𝑓(𝑎) + 𝑓(𝑏))
𝑎 𝑎 2ℎ 𝑎 2ℎ
𝑏
ℎ ℎ
∫ 𝑓(𝑥) 𝑑𝑥 = (𝑓(𝑎) + 𝑓(𝑏)) = (𝑓𝑘 + 𝑓𝑘+1 )
𝑎 2 2

This is known as the trapezoidal rule. It represents the area of the trapezoid shown in
figure.

Differential equations are


mathematical descriptions of how the variables and their derivatives (rates of change)
with respect to one or more independent variable affect each other in a dynamical way.
Their solutions show us how the dependent variable(s) will change with the
independent variable(s). Many problems in natural sciences and engineering fields are
formulated into a scalar differential equation or a vector differential equation—that is, a
system of differential equations.

In this section, we look into several methods of obtaining the numerical solutions to
ordinary differential equations

▪ Euler's Method: (Taylor Series Method)


▪ Heun's Method: (Trapezoidal Method)
▪ Runge–Kutta Method:
▪ Predictor–Corrector Method:

⦁ Adams–Bashforth–Moulton method (with modification formulas)


⦁ Hamming method (with modification formulas)
Euler's Method (Taylor Series Method): When talking about the numerical
solutions to ODEs, everyone starts with the Euler’s method, since it is easy to
understand and simple to program. Even though its low accuracy keeps it from being
widely used for solving ODEs, it gives us a clue to the basic concept of numerical
solution for a differential equation simply and clearly. Let we consider

𝐱̇ (𝑡) = 𝐟(𝑡, 𝐱(𝑡)) with 𝐱(𝑡0 ) = 𝐱 0

This is a first-order vector differential equation which is equivalent to a high-order


scalar differential equation. The Euler's algorithm can be described by

𝐱 𝑘+1 = 𝐱 𝑘 + ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 ) with 𝐱(𝑡0 ) = 𝐱 0

While the Taylor series methods are a useful starting point for understanding more
sophisticated methods, they are not much computational use. The first order method
is too inaccurate as it corresponding to simple linear extrapolation, while higher order
methods require the calculation of lot partial derivatives.

Example: write a MATLAB code to solve the following ordinary differential equation

𝑥̇ = −0.9 sin(0.5|𝑡|) |𝑥(𝑡)|

clear all, clc, h=0.01; n=2000; t0=0.01;


t=t0:h:t0+n*h; yy(1)= 0.1; y=[];
for i=1:n
f(i)=0.9*sin(-0.5*abs(t(i)))*abs(yy(i)) ;
yy(i+1)= yy(i) + h*f(i); y=[y, yy(i)];
end
plot(t(1:length(y)),y,'b','linewidth',3)
hold on

f=@(t,x)(0.9*sin(-0.5*abs(t))*abs(x));
[t,x] = ode45(f,t,0.1); % verification
plot(t,x ,'r--','linewidth',3)
grid on
title('Eulers Method vs ode45 Check')

Example: write a MATLAB code to solve the following ODE 𝑥̈ (𝑡) = 1 − 𝑥̇ (𝑡)𝑥(𝑡)

clear all, clc, h=0.01; n=5000; t0=0; t=t0:h:t0+n*h; x(1,:)=[0.1;0.2];


f=@(t,x)[1-x(1)*x(2);x(1)];
for i=1:n
x(i+1,:)= x(i,:) + h*(f(t(i),x(i,:)))';
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % Phase plane plot
grid on
𝑥̇ (𝑡) = 𝑥1 (𝑡) − 𝑥22 (𝑡)
Example: write a MATLAB code to solve the next system of ODE { 1
𝑥̇ 2 (𝑡) = 𝑥12 (𝑡) − 𝑥22 (𝑡)

clear all, clc, h=0.01; n=5000; t0=0; t=t0:h:t0+n*h; x(1,:)=[0.1;0.2];

f=@(t,x)[x(1)-x(2)*x(2); x(1)*x(1)-x(2)*x(2)];
for i=1:n
x(i+1,:)= x(i,:) + h*(f(t(i),x(i,:)))';
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on
Example: write a MATLAB code to solve the next forced system of ODE

𝑥̇ 1 (𝑡) = −𝑥2 (𝑡) + 𝑥1 (𝑡)(1 − 𝑥12 (𝑡) − 𝑥22 (𝑡)) + cos(3𝑡)


{ 3
𝑥̇ 2 (𝑡) = 𝑥1 (𝑡) + 𝑥2 (𝑡)(1 − 𝑥12 (𝑡) − 𝑥22 (𝑡)) + sin(2𝑡)
5
clear all, clc, h=0.01; n=5000; t0=0; t=t0:h:t0+n*h;
f=@(t,x)[-x(2)+x(1)*(1-x(1)*x(1)-x(2)*x(2))+cos(3*t);x(1)+x(2)*(1-
x(1)*x(1)-x(2)*x(2))+(3/5)*sin(2*t)]; x(1,:)= [0.1;0.2];
for i=1:n
x(i+1,:)= x(i,:) + h*(f(t(i),x(i,:)))';
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on, hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on

Heun's Method (Trapezoidal Method): Another method of solving a first-order


vector differential equation like 𝐱̇ (𝑡) = 𝐟(𝑡, 𝐱(𝑡)) comes from integrating both sides of the
equation.
𝑡𝑘+1 𝑡𝑘+1
∫ 𝐱̇ (𝑡) 𝑑𝑡 = 𝐱(𝑡𝑘+1 ) − 𝐱(𝑡𝑘 ) = ∫ 𝐟(𝑡, 𝐱(𝑡)) 𝑑𝑡 with 𝐱(𝑡0 ) = 𝐱 0
𝑡𝑘 𝑡𝑘

If we assume that the value of the (derivative) function 𝐟(𝑡, 𝐱(𝑡)) is constant as
𝐟(𝑡𝑘 , 𝐱 𝑘 ) within one time step [𝑡𝑘 , 𝑡𝑘+1 ), this becomes
𝑡𝑘+1 𝑡𝑘+1
𝐱(𝑡𝑘+1 ) − 𝐱(𝑡𝑘 ) = ∫ 𝐟(𝑡, 𝐱(𝑡)) 𝑑𝑡 = 𝐟(𝑡𝑘 , 𝐱 𝑘 ) ∫ 𝑑𝑡 = ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 )
𝑡𝑘 𝑡𝑘

If we use the trapezoidal rule, it becomes


𝐱 𝑘+1 − 𝐱 𝑘 = {𝐟(𝑡𝑘 , 𝐱 𝑘 ) + 𝐟(𝑡𝑘+1 , 𝐱 𝑘+1 )}
2
But, the right-hand side of this equation has 𝐱 𝑘+1 , which is unknown at 𝑡𝑘 . To resolve
this problem, we replace the 𝐱 𝑘+1 on the right-hand side by the following approximation:

𝐱 𝑘+1 = 𝐱 𝑘 + ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 )

so that it becomes

𝐱 𝑘+1 = 𝐱 𝑘 + {𝐟(𝑡𝑘 , 𝐱 𝑘 ) + 𝐟(𝑡𝑘+1 , 𝐱 𝑘 + ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 ))}
2
This is Heun’s method it is a kind of predictor-and-corrector method in that it predicts
the value of 𝐱 𝑘+1 by Euler's algorithm at 𝑡𝑘 and then corrects the predicted value at 𝑡𝑘+1 .


𝐱 𝑘+1 = 𝐱 𝑘 + {𝐊1 + 𝐊 2 }
2
𝐊1 = 𝐟(𝑡𝑘 , 𝐱 𝑘 )
𝐊 2 = 𝐟(𝑡𝑘 + ℎ, 𝐱 𝑘 + ℎ𝐊1 )

The truncation error of Heun’s method is 𝒪(ℎ2 ) (proportional to ℎ2 ), while the error of
Euler’s method is 𝒪(ℎ). This Heun’s method is implemented in the MATLAB routine

clear all, close all, clc,


h=0.001; t=0:h:5; x=zeros(1,length(t)); x(1)=2; n=length(t)-1;
xd=@(t,x)(cos(t)-abs(x)^(-3/2)); %insert function to be solved
for i = 1:n
k1 = xd(t(i),x(i)); k2 = xd(t(i)+ h,x(i)+k1*h);
x(i+1) = x(i)+((k1+k2)/2)*h;
end
[t,x_check] = ode45(xd,t,2); % verification
plot(t,x ,'b','linewidth',3)
hold on
plot(t,x_check ,'r--','linewidth',3)
title('Heun's Method vs ode45 Check ')
grid on
Example: write a MATLAB code to solve the next forced system of ODE

3
𝑥̇ 1 (𝑡) = −𝑥2 (𝑡) − 𝑥1 (𝑡)(𝑥12 (𝑡) + 𝑥22 (𝑡)) + 𝑒 −𝑡/2
{ 4
2 3
𝑥̇ 2 (𝑡) = 𝑥1 (𝑡) − 𝑥2 (𝑡)(𝑥12 (𝑡) + 𝑥22 (𝑡)) + sin( 𝜋𝑡)
5 4

clear all,clc, h=0.01; n=5000; t0=0; t=t0:h:t0+n*h; x(1,:)=[0.1;0.2];

f=@(t,x)[-x(2)-x(1)*(x(1)*x(1)+x(2)*x(2))+(3/4)*exp(-0.5*t);x(1)-
x(2)*(x(1)*x(1)+x(2)*x(2))+2/5*sin(3/4*pi*t)];

for i=1:n
k1 = f(t(i),x(i,:));
k2 = f(t(i)+ h,x(i,:)+k1*h);
x(i+1,:) = x(i,:)+((k1+k2)/2)'*h;
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on

Example: write a MATLAB code to solve the next non-forced system of ODE

𝑥̇ 1 (𝑡) = −𝑥1 (𝑡) + 𝑥2 (𝑡) − 𝑥1 (𝑡)(𝑥2 (𝑡) − 𝑥1 (𝑡))


{
𝑥̇ 2 (𝑡) = −𝑥1 (𝑡) − 𝑥2 (𝑡) + 2𝑥12 (𝑡)𝑥2 (𝑡)
clear all,clc, h=0.01; n=5000; t0=0; t=t0:h:t0+n*h; x(1,:)= [0.1;0.2];
f=@(t,x)[-x(1)+x(2)-x(1)*(x(2)-x(1)); -x(1)-x(2)+2*x(1)*x(1)*x(2)];
for i=1:n
k1 = f(t(i),x(i,:));
k2 = f(t(i)+ h,x(i,:)+k1*h);
x(i+1,:) = x(i,:)+((k1+k2)/2)'*h;
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on

Example: write a MATLAB code to solve the next non-forced system of ODE

𝑥̇ (𝑡) = 𝑥2 (𝑡) + 𝑥1 (𝑡)𝑥22 (𝑡)


{ 1
𝑥̇ 2 (𝑡) = −𝑥1 (𝑡) + 𝑥2 (𝑡)𝑥12 (𝑡)

clear all, clc, h=0.01; t0=0; t=t0:h:100;


f=@(t,x)[x(2)-x(1)*x(2)*x(2); -x(1)+x(2)*x(1)*x(2)];x(1,:)=[0.01;0.02];
for i=1:length(t)-1
k1 = f(t(i),x(i,:));
k2 = f(t(i)+ h,x(i,:)+k1*h);
x(i+1,:) = x(i,:)+((k1+k2)/2)'*h;
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on
Runge–Kutta Method: Although Heun’s method is a little better than the Euler’s
method, it is still not accurate enough for most real-world problems. The fourth-order
Runge–Kutta (RK4) method having a truncation error of 𝒪(ℎ4 ) is one of the most widely
used methods for solving differential equations. Its algorithm is described below.


𝐱 𝑘+1 = 𝐱 𝑘 + (𝐊1 + 2𝐊 2 + 2𝐊 3 + 𝐊 4 )
6
𝐊1 = 𝐟(𝑡𝑘 , 𝐱 𝑘 )
ℎ ℎ
𝐊 2 = 𝐟 (𝑡𝑘 + , 𝐱 𝑘 + 𝐊1 )
2 2
ℎ ℎ
𝐊 3 = 𝐟 (𝑡𝑘 + , 𝐱 𝑘 + 𝐊 2 )
2 2
𝐊 4 = 𝐟(𝑡𝑘 + ℎ, 𝐱 𝑘 + ℎ𝐊 3 )

The first equation is the core of RK4 method, which may be obtained by substituting
Simpson’s rule
𝑡𝑘+1
ℎ1 𝑥𝑘+1 − 𝑥𝑘 ℎ
∫ 𝑓(𝑥) 𝑑𝑥 = {𝑓𝑘 + 4𝑓𝑘+1/2 + 𝑓𝑘+1 } with ℎ1 = =
𝑡𝑘 3 2 2

𝑡
into the integral form 𝐱 𝑘+1 − 𝐱 𝑘 = ∫𝑡 𝑘+1 𝐟(𝑡, 𝐱(𝑡)) 𝑑𝑡 we get
𝑘


𝐱 𝑘+1 = 𝐱 𝑘 + (𝐟𝑘 + 2𝐟𝑘+1/2 + 2𝐟𝑘+1/2 + 𝐟𝑘+1 )
6
Accordingly, the RK4 method has a truncation error of 𝒪(ℎ4 ) and thus is expected to
work better than the previous two methods.
clear all, close all, clc,
h = 0.01; % set the step size
t = 0:h:200; % set the interval of t
x = zeros(1,length(t)); x0=0.01;
x(1) = x0; % set the intial value for x
n = length(t)-1;
%f=@(t,x)(cos(t)-abs(x)^(-3/2)); % x0=2; % 0<t<4
f=@(t,x)((x)^(2)- (x)^(3)); %insert function to be solved
for i = 1:n
k1 = f(t(i),x(i));
k2 = f(t(i)+0.5*h,x(i)+0.5*k1*h);
k3 = f(t(i)+0.5*h,x(i)+0.5*k2*h);
k4 = f(t(i)+h,x(i)+k3*h);
x(i+1) = x(i)+((k1+2*k2+2*k3+k4)/6)*h;
end
[t,x_check] = ode45(f,t,x0);
plot(t,x,'b','linewidth',2)
hold on
plot(t,x_check,'--r','linewidth',2)
title(' Runge–Kutta (RK4) vs ode45 Check ')
grid on

Example: write a MATLAB code to solve the next forced system of ODE

𝑥̇ 1 (𝑡) = −𝑥2 (𝑡) + 𝑥1 (𝑡)(1 − 𝑥12 (𝑡) − 𝑥22 (𝑡))


{
𝑥̇ 2 (𝑡) = 𝑥1 (𝑡) + 𝑥2 (𝑡)(1 − 𝑥12 (𝑡) − 𝑥22 (𝑡)) + 3/4

clear all, clc, h=0.1; n=10000; t0=0; t=t0:h:t0+n*h;


f=@(t,x)[-x(2)+x(1)*(1-x(1)*x(1)-x(2)*x(2)); x(1)+x(2)*(1-x(1)*x(1)-
x(2)*x(2))+3/4]; x(1,:)= [0.01;0.02];
for i=1:n
k1 = f(t(i),x(i,:));
k2 = f(t(i)+0.5*h,x(i,:)+0.5*k1*h);
k3 = f(t(i)+0.5*h,x(i,:)+0.5*k2*h);
k4 = f(t(i)+h,x(i,:)+k3*h);
x(i+1,:) = x(i,:)+((k1+2*k2+2*k3+k4)'/6)*h;
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
[t,x] = ode23(f,t,[0.01;0.02]); % verification
plot(t,x,'r--','linewidth',3)
title('Runge–Kutta (RK4) vs ode23 Check')
grid on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on
Example: write a MATLAB code to solve the next forced system of ODE

𝑥̇ 1 (𝑡) = 𝑥12 (𝑡) + 𝑥22 (𝑡) − 1


{
𝑥̇ 2 (𝑡) = 𝑥22 (𝑡) − 𝑥12 (𝑡) + 1

clear all, close all, clc,


h = 0.01; % set the step size
f=@(t,x)[-1+(x(1))^(2)+(x(2))^(2);1-(x(2))^(2)-(x(1))^(2)];
x0=[0.01; 0.02];
x(1,:) = x0; i=0;
for t = 0:h:100
i=i+1;
t1(i)=t;
f1(i,:)=f(t, x(i,:));
k1 = f(t, x(i,:));
k2 = f(t+0.5*h, x(i,:)+0.5*k1*h);
k3 = f(t+0.5*h, x(i,:)+0.5*k2*h);
k4 = f(t+h, x(i,:)+k3*h);
x(i+1,:) = x(i,:)+(k1' + 2*k2'+ 2*k3' + k4')*(h/6);
end
plot(t1,x(1:length(t1),:),'b','linewidth',2)
hold on
[t,x_check] = ode45(f,t1,x0);
plot(t1,x_check,'--r','linewidth',2)
title(' Runge–Kutta (RK4)Method vs ode45 Check ')
grid on
figure
plot(x(:,1), x(:,2),'k','linewidth',2) % "plot" parametric relationship
grid on
Example: write a MATLAB code to solve the next forced system (Missile-Targeting)

𝑥̇ 1 (𝑡) = −𝑥12 (𝑡) − 𝑥2 (𝑡) + 𝑥32 (𝑡)


𝑥̇ 2 (𝑡) = 𝑥12 (𝑡) + 𝑥22 (𝑡) − 𝑥3 (𝑡) 1 𝑡≥0
assume that 𝑢(𝑡) = {
2 2
9.81 0 𝑡<0
{ 𝑥̇ 3 (𝑡) = 𝑥1 (𝑡) − 𝑥2 (𝑡) − 𝑥3 (𝑡) + 10
𝑢(𝑡)

clear all, close all, clc, h = 0.01;


f=@(t,x)[-(x(1))^2-x(2)+(x(3))^2;...
(x(1))^2-x(3)+(x(2))^2;(x(1))^2-x(2)-(x(3))^2+ +9.81/10];
x0=[0; 0; 1]; x(1,:) = x0; i=0;
for t = 0:h:100
i=i+1;
t1(i)=t;
f1(i,:)=f(t, x(i,:));
k1 = f(t, x(i,:));
k2 = f(t+0.5*h, x(i,:)+0.5*k1*h);
k3 = f(t+0.5*h, x(i,:)+0.5*k2*h);
k4 = f(t+h, x(i,:)+k3*h);
x(i+1,:) = x(i,:)+(k1' + 2*k2'+ 2*k3' + k4')*(h/6);
end
plot(t1,x(1:length(t1),:),'b','linewidth',2)
hold on
grid on
title(' Runge–Kutta (RK4)')
grid on
figure
plot3(x(:,1), x(:,2), x(:,3),'k','linewidth',2)
grid on
Adams–Bashforth–Moulton Method: The Adams–Bashforth–Moulton (ABM) method
consists of two steps. The first step is to approximate 𝐟(𝑡, 𝐱(𝑡)) by the (Lagrange)
polynomial of degree 4 matching the four points {(𝑡𝑘−3 , 𝐟𝑘−3 ), (𝑡𝑘−2 , 𝐟𝑘−2 ), (𝑡𝑘−1 , 𝐟𝑘−1 ), (𝑡𝑘 , 𝐟𝑘 )}
and substitute the polynomial into the integral form of differential equation to get a
predicted estimate of 𝐱 𝑘+1

𝐩𝑘+1 = 𝐱 𝑘 + (−9𝐟𝑘−3 + 37𝐟𝑘−2 − 59𝐟𝑘−1 + 55𝐟𝑘 )
24
The second step is to repeat the same work with the updated four points

{(𝑡𝑘−2 , 𝐟𝑘−2 ), (𝑡𝑘−1 , 𝐟𝑘−1 ), (𝑡𝑘 , 𝐟𝑘 ), (𝑡𝑘+1 , 𝐟𝑘+1 )} with 𝐟𝑘+1 = 𝐟(𝑡𝑘+1 , 𝐩𝑘+1 )

to get a corrected estimate of 𝐱 𝑘+1


𝐜𝑘+1 = 𝐱 𝑘 + (𝐟 − 5𝐟𝑘−1 + 19𝐟𝑘 + 9𝐟𝑘+1 )
24 𝑘−2
In order to enhance the estimate one researcher proposed the following modifications

Predictor: ℎ
𝐩𝑘+1 = 𝐱 𝑘 + (−9𝐟𝑘−3 + 37𝐟𝑘−2 − 59𝐟𝑘−1 + 55𝐟𝑘 )
24
251
Modifier: 𝐦𝑘+1 = 𝐩𝑘+1 + (𝐜 − 𝐩𝑘 )
270 𝑘

Corrector: 𝐜𝑘+1 = 𝐱 𝑘 + (𝐟 − 5𝐟𝑘−1 + 19𝐟𝑘 + 9𝐟(𝑡𝑘+1 , 𝐦𝑘+1 ))
24 𝑘−2
19
𝐱 𝑘+1 = 𝐜𝑘+1 − (𝐜 − 𝐩𝑘+1 )
{ 270 𝑘+1
Hamming Method: In this section, we introduce just the algorithm of the Hamming
method summarized in the next equations

Predictor: 4ℎ
𝐩𝑘+1 = 𝐱 𝑘−3 + (2𝐟𝑘−2 − 𝐟𝑘−1 + 2𝐟𝑘 )
3
112
Modifier: 𝐦𝑘+1 = 𝐩𝑘+1 + (𝐜 − 𝐩𝑘 )
121 𝑘
1 3ℎ
Corrector: 𝐜𝑘+1 = (9𝐱 𝑘 − 𝐱 𝑘−2 ) + (−𝐟𝑘−1 + 2𝐟𝑘 + 𝐟(𝑡𝑘+1 , 𝐦𝑘+1 ))
8 8
9
{ 𝐱 𝑘+1 = 𝐜𝑘+1 − (𝐜 − 𝐩𝑘+1 )
121 𝑘+1
clear all, clc,
h=0.01; n=20000; t0=0.01; t=t0:h:t0+n*h;
f=@(t,x)[x(2); -0.5*(x(1)*x(1)-1)*x(2)-x(1)];
[t,x] = ode45(f,t,[-0.1;0.2]);
plot(t,x,'r','linewidth',3)
title('ode45 Method')
grid on
figure
plot(x(:,1),x(:,2),'b','linewidth',3)

clear all, clc,


h=0.01; n=20000; t0=0.01; t=t0:h:t0+n*h;
f=@(t,x)[x(2); -0.5*x(2)-2*x(1)-x(1)*x(1)];
[t,x] = ode45(f,t,[-0.1;0.2]);
plot(t,x,'r','linewidth',3)
title('ode45 Method')
grid on
figure
plot(x(:,1),x(:,2),'b','linewidth',3)
CHAPTER IV:
Finding Roots of a Complex function
by Local Methods
Finding Roots of a Complex function by
Local Methods

A common problem encountered in engineering


analysis is this: given a function f (x), determine the values of x for which f (x) = 0. The
solutions (values of x) are known as the roots of the equation f (x) = 0, or the zeroes of
the function f (x).

The Newton–Raphson algorithm is the best-known method of finding roots for a good
reason: it is simple and fast. The only drawback of the method is that it uses the
derivative 𝑓 ′ (𝑥) of the function as well as the function f (x) itself. Therefore, the Newton–
Raphson method is usable only in problems where 𝑓 ′ (𝑥) can be readily computed.

The Newton–Raphson formula can be derived from the Taylor series expansion of f (x)
about x:
𝑓(𝑥𝑖+1 ) = 𝑓(𝑥𝑖 ) + 𝑓 ′ (𝑥𝑖 )(𝑥𝑖+1 − 𝑥𝑖 ) + 𝒪(𝑥𝑖+1 − 𝑥𝑖 )2

If 𝑥𝑖+1 is a root of f (x) = 0 then 𝑓(𝑥𝑖 ) + 𝑓 ′ (𝑥𝑖 )(𝑥𝑖+1 − 𝑥𝑖 ) + 𝒪(𝑥𝑖+1 − 𝑥𝑖 )2 = 0. Assuming that
𝑥𝑖 is a close to 𝑥𝑖+1 , we can drop the term 𝒪(𝑥𝑖+1 − 𝑥𝑖 )2 and solve for 𝑥𝑖+1 . The result is
the Newton–Raphson formula
𝑓(𝑥𝑖 )
𝑥𝑖+1 = 𝑥𝑖 − ′
𝑓 (𝑥𝑖 )

Remark: we are not interested by the study of convergence and speed of the
algorithms, because it is well known in textbooks.

Newton's method requires the evaluation of the derivative 𝑓 ′ (𝑥) of the given function
𝑓(𝑥) and this itself is a problem, to overcome this obstacle let we replace the derivative
by it approximation

𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 )
𝑓 ′ (𝑥𝑖 ) =
𝑥𝑖 − 𝑥𝑖−1

There results, the formula

(𝑥𝑖 − 𝑥𝑖−1 )𝑓(𝑥𝑖 ) 𝑥𝑖−1 𝑓(𝑥𝑖 ) − 𝑥𝑖 𝑓(𝑥𝑖−1 )


𝑥𝑖+1 = 𝑥𝑖 − or equivalently 𝑥𝑖+1 =
𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 ) 𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 )

clear all, clc, x1=2.5; x2=3.5; s=1;


while abs(s) > eps
f1=x1^3-6*x1^2+11* x1-6; f2= x2^3-6*x2^2+11* x2-6;
x= x2 - ((x2-x1)*f2)/(f2-f1);
x1=x2; x2=x; s=x2-x1;
end
x
Let us now consider the n-dimensional version of the same problem, namely 𝐟(𝐱) = 0.

The solution of n simultaneous, nonlinear equations is a much more formidable task


than finding the root of a single equation. The trouble is the lack of a reliable method
for bracketing the solution vector x. Therefore, we cannot provide the solution algorithm
with a guaranteed good starting value of x, unless such a value is suggested by the
physics of the problem. The simplest and the most effective means of computing x is
the Newton– Raphson method. It works well with simultaneous equations, provided
that it is supplied with a good starting point. There are other methods that have better
global convergence characteristics, but all of them are variants of the Newton–Raphson
method.

In order to derive the Newton–Raphson method for a system of equations, we start with
the Taylor series expansion of 𝐟(𝐱) = [f1 (𝐱) f2 (𝐱) … f𝑛 (𝐱)]𝑇 about the point x:

𝐟(𝐱 + ∆𝐱) = 𝐟(𝐱) + 𝑱(𝐱)∆𝐱 + 𝓞(‖𝐱‖)2

where 𝑱(𝐱) is the Jacobian matrix (of size 𝑛 × 𝑛) made up of the partial derivatives

𝜕𝑓𝑖
𝑱(𝐱) = [𝐽𝑖𝑗 ] = [ ]
𝜕𝑥𝑗

Let us now assume that x is the current approximation of the solution of 𝐟(𝐱) = 𝟎, and
let 𝐱 + ∆𝐱 be the improved solution. To find the correction ∆𝐱, we set 𝐟(𝐱 + ∆𝐱) = 0. The
result is a set of linear equations for ∆𝐱 :
−1 −1
∆𝐱 = − (𝑱(𝐱)) 𝐟(𝐱) or equivalently 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑱(𝐱 𝑘 )) 𝐟(𝐱 𝑘 )

The above process is continued until ‖∆𝐱‖ < 𝜀, where 𝜀 is the error tolerance. As in the
one-dimensional case, success of the Newton–Raphson procedure depends entirely on
the initial estimate of 𝐱. If a good starting point is used, convergence to the solution is
very rapid. Otherwise, the results are unpredictable.

Example: Determine the points of intersection between the circle 𝑥 2 + 𝑦 2 = 3 and the
hyperbola 𝑥𝑦 = 1. Means that

𝑓1 (𝑥, 𝑦) = 𝑥 2 + 𝑦 2 − 3 = 0 2𝑥 2𝑦 ∆𝑥 2𝑥 2𝑦 −1 3 − 𝑥 2 − 𝑦 2
{ ⟹ 𝑱(𝑥, 𝑦) = [ ] ⟺ ( )=[ ] ( )
𝑓2 (𝑥, 𝑦) = 𝑥𝑦 − 1 = 0 𝑦 𝑥 ∆𝑦 𝑦 𝑥 1 − 𝑥𝑦

clear all, clc, x1=0.5; x2=1; s=1;


while abs(s) > eps
f1=x1^2 + x2^2-3;
f2= x1*x2-1;
J=[2*x1 2*x2; x2 x1];
x= [x1; x2] - inv(J)*[f1; f2];
s= x -[x1; x2]; x1=x(1,:); x2=x(2,:);
end
x
Besides Newton's method, other methods are available for the solution of nonlinear
systems. There is, for example, a multidimensional version of the secant method called
the Broyden's method. Instead of inverting the Jacobian 𝑱(𝐱) on every iteration, it may
be avoided using some modifications on the Newton–Raphson iterations.

In this section we introduce the most used secant


approximation to the Jacobian, proposed by C. Broyden. The algorithm, analogous to
Newton's method, but that substitutes this approximation for the analytic Jacobian, is
called Broyden's method.

We can see that the use of Newton's method means that we have to evaluate the
Jacobian 𝑱(𝐱) and the function 𝐟(𝐱), then we have to solve a system of 𝑛2 linear equation
at each step. So Newton's method has a quite high computational cost. Broyden's
method is a generalization of the secant method to the multivariable case. It has only a
superlinear convergence rate. However, it is much less expensive in computations for
each step.

In Newton's method we know that 𝑱(𝐱 𝑘 )∆𝐱 𝑘 = 𝐟(𝐱𝑘+1 ) − 𝐟(𝐱𝑘 ), and from the other side we
know that (∆𝐱 𝑘 )𝑻 ∆𝐱 𝑘 = ‖∆𝐱 𝑘 ‖2 so
∆𝐟(𝐱𝑘 ) − 𝑱(𝐱 𝑘−1 )∆𝐱 𝑘
𝑱(𝐱 𝑘 )∆𝐱𝑘 = ∆𝐟(𝐱𝑘 ) ⟺ (𝑱(𝐱 𝑘 ) − 𝑱(𝐱 𝑘−1 ))∆𝐱 𝑘 = ( ) (∆𝐱 𝑘 )𝑇 ∆𝐱𝑘
‖∆𝐱 𝑘 ‖2
∆𝐟(𝐱 𝑘 ) − 𝑱(𝐱 𝑘−1 )∆𝐱𝑘
⟺ 𝑱(𝐱 𝑘 ) = 𝑱(𝐱 𝑘−1 ) + ( ) (∆𝐱𝑘 )𝑇
‖∆𝐱 𝑘 ‖2

The computation of inverse need a lot of memory-space and this can be avoided if we
calculate iteratively the inverse of the matrix 𝑱(𝐱 𝑘 ) at each step. This can be
accomplished by using the Sherman-Morrison-Woodbury formula

−1 𝑨𝑘 −1 𝐮𝑘 𝐯𝑘 𝑇 𝑨𝑘 −1
(𝑨𝑘 + 𝐮𝑘 𝐯𝑘 𝑇 )−1
= 𝑨𝑘 − with 1 + 𝐯𝑘 𝑇 𝑨𝑘 −1 𝐮𝑘 ≠ 0
1 + 𝐯𝑘 𝑇 𝑨𝑘 −1 𝐮𝑘
−1
Let we define: 𝑨𝑘 = 𝑱(𝐱 𝑘−1 ), 𝐮𝑘 = (∆𝐟(𝐱 𝑘 ) − 𝑱(𝐱 𝑘−1 )∆𝐱𝑘 )/‖∆𝐱 𝑘 ‖2 , 𝐯𝑘 = ∆𝐱 𝑘 & 𝑩𝑘 = (𝑱(𝐱 𝑘 ))
then it can be verified that:

∆𝐱 𝑘 − 𝑩𝑘−1 . ∆𝐟(𝐱 𝑘 )
𝑩𝑘 = 𝑩𝑘−1 + ( ) (∆𝐱 𝑘 )𝑇 . 𝑩𝑘−1
(∆𝐱 𝑘 )𝑇 𝑩𝑘−1 . ∆𝐟(𝐱 𝑘 )

Algorithm:
Data: 𝐟(𝐱), 𝐱 0 , 𝐟(𝐱 0 ) and 𝑩0
Result: 𝐱 𝑘
begin:
𝐱 𝑘+1 = 𝐱 𝑘 − 𝑩𝑘 𝐟(𝐱𝑘 )
𝒔𝑘 − 𝑩𝑘 𝐲𝑘
𝑩𝑘+1 = 𝑩𝑘 + ( 𝑇 ) 𝒔𝑘 𝑇 𝑩𝑘
𝒔𝑘 𝑩𝑘 𝐲𝑘
𝐲𝑘 = 𝐟(𝐱 𝑘+1 ) − 𝐟(𝐱 𝑘 )
𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘

end
Example: (Author's idea) Determine the solution of the following matrix equation

𝑭(𝑿) = 𝑿𝑨1 − 𝑨4 𝑿 − 𝑿𝑨2 𝑿 + 𝑨3 = 𝟎

Where: 𝑿 ∈ ℝ2×2 is the variable matrix to be determined and 𝑨𝑖 ∈ ℝ2×2 are constant
matrices given by:

1 2 4 3 1 3 8 8
𝑨1 = ( ), 𝑨2 = ( ), 𝑨3 = ( ) and 𝑨4 = ( )
3 2 2 5 5 5 6 1
clear all, clc,
A1 =[1 2;3 2]; A2 =[4 3;2 5];
A3 =[1 3;5 5]; A4 =[8 8;6 1];

X0=I-0.1*rand(2,2); B=inv(100*eye(4,4)); % initialization

for k=1:1000

y0=X0*A1 - A4*X0 - X0*A2*X0 + A3; % 𝑭(𝑿𝑘 )


f0=[y0(:,1);y0(:,2)]; % 𝐟𝑘 = vec(𝑭(𝑿𝑘 ))
x0=[X0(:,1); X0(:,2)]; % 𝐱 𝑘 = vec(𝑿𝑘 )
x1=x0-B*f0; % vec(𝑿𝑘+1 ) = vec(𝑿𝑘 ) − 𝑱𝑘 −1 . vec(𝑭(𝑿𝑘 ))
X1=[x1(1:2,:) x1(3:4,:)]; % construction of 𝑿𝑘 from 𝐱 𝑘

y1=X1*A1 - A4*X1 - X1*A2*X1 + A3; % 𝑭(𝑿𝑘+1 )


f1=[y1(:,1);y1(:,2)]; % 𝐟𝑘+1 = vec(𝑭(𝑿𝑘+1 ))
x1=[X1(:,1); X1(:,2)]; % 𝐱 𝑘+1 = vec(𝑿𝑘+1 )

y=f1-f0; % 𝐲𝑘 = 𝐟𝑘+1 − 𝐟𝑘
s=x1-x0; % 𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘
B = B +((s-B*y)*(s'*B))/(s'*B*y); % 𝑱−1 (𝐱 𝑘 )
X0=X1; % update
end
X1 % solution
ZERO1=X1*A1 - A4*X1 - X1*A2*X1 + A3 % verifications

The result is
X1 =[0.7970 0.8191;… ZERO1 =[2.2204e-16 4.4409e-16;…
-0.4120 -0.2950] 0 0]
Horner’s method is a technique to evaluate polynomials
quickly. It needs 𝑛 multiplications and 𝑛 additions to evaluate a polynomial. It is also a
nested algorithmic programming that can factorizes a polynomial into a product of 𝑛
linear factors; This scheme (Horner’s method) is based on the Euclidian synthetic long
division. (Wording of this method is the author property)

Theorem: Let 𝑎(𝜆) = ∑𝑛𝑖=0 𝑎𝑖 𝜆𝑛−𝑖 be the polynomial of degree 𝑛 defined on the complex
field ℂ → ℂ[𝜆], where 𝑎𝑖 are constant coefficients and 𝜆 is a complex variable. The set

{𝑟(𝑘)}𝑘=0 is a sequence of variables which will converge iteratively to the exact solution
of 𝑎(𝜆) = 0 if and only if: 𝜆 = lim𝑘→∞ 𝑟(𝑘) and

𝑏𝑖 (𝑘) = 𝑎𝑖 + 𝑏𝑖−1 (𝑘)𝑟(𝑘) 𝑖 = 1,2, … 𝑛 − 1


{ −1
𝑟(𝑘 + 1) = −(𝑏𝑛−1 (𝑘)) 𝑎𝑛 and 𝑏0 = 𝑎0 𝑘 = 1,2, …

Proof: (Theorem's wording is the author property) Using the remainder theorem we can
obtain that, if we divide 𝑎(𝜆) by the linear factor (𝜆 − 𝑟) we get 𝑎(𝜆) = (𝜆 − 𝑟)𝑏(𝜆) + 𝑎(𝑟)
means that:
𝑎(𝜆) = (𝜆 − 𝑟)(𝑏0 𝜆𝑛−1 + 𝑏1 𝜆𝑛−2 + ⋯ + 𝑏𝑛−1 ) + 𝑎(𝑟)

If we set 𝑏𝑛 = 𝑎(𝑟), and we expand this matrix equation we get:

𝑎(𝜆) = 𝑏0 𝜆𝑛 + (𝑏1 − 𝑏0 𝑟)𝜆𝑛−1 + (𝑏2 − 𝑏1 𝑟)𝜆𝑛−2 + ⋯ + (𝑏𝑛−1 − 𝑏𝑛−2 𝑟)𝜆 + 𝑏𝑛

Identifying the coefficients of with different powers we get:

𝑎𝑖 = 𝑏𝑖 − 𝑏𝑖−1 𝑟 and 𝑎0 = 𝑏0 𝑖 = 1,2, … , 𝑛 ⟺ 𝑏𝑖 = 𝑎𝑖 + 𝑏𝑖−1 𝑟

Since 𝑟 is an operator root of 𝑎(𝜆), means that 𝑏𝑛 = 𝑎(𝑟) = 0 and from this last equation
we can deduce that 𝑏𝑛 = 𝑎𝑛 + 𝑏𝑛−1 𝑟 = 0 in other word 𝑟 = −(𝑏𝑛−1 )−1 𝑎𝑛 . If we iterate the
last obtained equations we arrive at:

𝑏𝑖 (𝑘) = 𝑎𝑖 + 𝑏𝑖−1 (𝑘)𝑟(𝑘) 𝑖 = 1,2, … 𝑛 − 1


{ −1
𝑟(𝑘 + 1) = −(𝑏𝑛−1 (𝑘)) 𝑎𝑛 and 𝑏0 = 𝑎0 𝑘 = 1,2, …

Based on the iterative Horner’s sachem we can redo the process too many times to get a
solution 𝜆 = lim𝑘→∞ 𝑟(𝑘), and the theorem is proved. 

Algorithm: (Horner’s Method)


1 Enter the number of iterations 𝑁
2 For 𝑘 = 0: 𝑁
3 Enter the degree 𝑛 and the polynomial coefficients 𝑎𝑖 ∈ ℝ
4 𝑟(0) ∈ ℝ Initial guess; 𝑏0 = 𝑏0 (0) = 𝑎0 = 1
5 For 𝑖 = 1: ℓ − 1
6 𝑏𝑖 (𝑘) = 𝑎𝑖 + 𝑏𝑖−1 (𝑘)𝑟(𝑘); 𝑏𝑖−1 (𝑘) = 𝑏𝑖 (𝑘)
7 End
−1
8 𝑟(𝑘 + 1) = −(𝑏ℓ−1 (𝑘)) 𝑎ℓ
9 𝑟(𝑘 + 1) = 𝑟(𝑘)
10 End
Example: Determine all roots of the following equation
𝑝(𝑧) = 𝑎0 𝑧 5 + 𝑎1 𝑧 4 + 𝑎2 𝑧 3 + 𝑎3 𝑧 2 + 𝑎4 𝑧 + 𝑎5 = 0
With a0=1; a1=-15; a2=85; a3=-225; a4=274; a5=-120;

clear all, clc,


a0=1; a1=-15; a2=85; a3=-225; a4=274; a5=-120;
a=[a1, a2, a3, a4, a5]; n=length(a); b0=a0;
r=0.5; % r=3.5; r=4.5;
for k=1:50
b1=a1+r*b0; b2=a2+r*b1; b3=a3+r*b2; b4=a4+r*b3;
r=-a5*inv(b4);
end
x1=r
zero1=a0*x1^5 + a1*x1^4 + a2*x1^3 + a3*x1^2 + a4*x1 + a5

When you get the first root repeat the process using long division until you get the
complete set.

b0; b1; b2; b3; b4; c0=b0;


for i=1:100
c1=b1+r*c0; c2=b2+r*c1; c3=b3+r*c2;
r=-b4*inv(c3);
end
x2=r

c0; c1; c2; c3; d0=c0;


for i=1:100
d1=c1+r*d0; d2=c2+r*d1;
r=-c3*inv(d2);
end
x3=r

d0; d1; d2; e0=d0;


for i=1:100
e1=d1+r*e0;
r=-d2*inv(e1);
end
x4=r

e0; e1; f0=e0;


for i=1:50
f1=e1+r*f0;
r=-e1*inv(f0);
end
x5=r
Bernoulli's method, in particular, is one which yields all
dominant zeros of a polynomial. By a dominant zero we mean a zero whose modulus is
not exceeded by the modulus of any other zero.

We have seen how the general linear difference equation (𝐷𝐸) with constant coefficients
can be solved analytically by determining the zeros of the associated characteristic
polynomial. Bernoulli's method consists in reversing this procedure.

The polynomial whose zeros are sought is considered the characteristic polynomial of
some difference equation and this associated difference equation is solved numerically.

Consider the complex function


𝑝(𝑧) = 𝑎0 𝑧 𝑁 + 𝑎1 𝑧 𝑁−1 + ⋯ + 𝑎𝑁−1 𝑧 + 𝑎𝑁
= 𝑧 𝑁 (𝑎0 + 𝑎1 𝑧 −1 + ⋯ + 𝑎𝑁−1 𝑧1−𝑁 + 𝑎𝑁 𝑧 −𝑁 )

Whose coefficients may be complex, has 𝑁 distinct zeros 𝑧1 , 𝑧2 , … , 𝑧𝑁 . Now let we see
what happens when we solve the following DE 𝑎0 𝑥[𝑛] + 𝑎1 𝑥[𝑛 − 1] + ⋯ + 𝑎𝑁 𝑥[𝑛 − 𝑁] = 0
which has 𝑝(𝑧) as its characteristic polynomial? The solution 𝑋 = {𝑥𝑛 } (whatever its
starting values are) must be representable in the form 𝑥[𝑛] = 𝐶1 𝑧1𝑛 + 𝐶2 𝑧2𝑛 + ⋯ + 𝐶𝑁 𝑧𝑁𝑛

where 𝐶1 , 𝐶2 , … , 𝐶𝑁 are suitable constants. To proceed further we make two assumptions:

𝑖) The polynomial 𝑝(𝑧) has a single dominant zero, 𝑧1 : |𝑧1 | > |𝑧𝑘 | for 𝑘 = 2,3, … , 𝑁

𝑖𝑖) The starting values are such that the dominant zero is represented in the solution,
i.e., 𝐶1 ≠ 0. We now consider the ratio of two consecutive values of the solution
sequence {𝑥𝑛 }. we find

𝑥[𝑛 + 1] 𝐶1 𝑧1𝑛+1 + 𝐶2 𝑧2𝑛+1 + ⋯ + 𝐶𝑁 𝑧𝑁𝑛+1


=
𝑥[𝑛] 𝐶1 𝑧1𝑛 + 𝐶2 𝑧2𝑛 + ⋯ + 𝐶𝑁 𝑧𝑁𝑛
𝐶 𝑧 𝑛+1 𝐶 𝑧 𝑛+1
1 + 𝐶2 (𝑧2 ) + ⋯ + 𝐶𝑁 ( 𝑧𝑁 )
1 1 1 1
= 𝑧1 { }
𝐶2 𝑧2 𝑛 𝐶𝑁 𝑧𝑁 𝑛
1 + 𝐶 (𝑧 ) + ⋯ + 𝐶 ( 𝑧 )
1 1 1 1

We know that

𝑧𝑘 𝑧𝑘 𝑛 𝑥[𝑛 + 1]
| | < 1, for 𝑘 = 2,3, … , 𝑁 ⟹ lim ( ) = 0 ⟹ lim = 𝑧1
𝑧1 𝑛→∞ 𝑧1 𝑛→∞ 𝑥[𝑛]

It is recommended that we agree on the following starting point

𝑥[0] = 1; 𝑥[1 − 𝑁] = 0, 𝑥[2 − 𝑁] = 0, … , 𝑥[−1] = 0

Example: Determine the dominant root of the following polynomial using Bernoulli's
Method.

𝑝(𝑧) = 𝑧 3 − 6𝑧 2 + 11𝑧 − 6 = 0
clear all, clc

a1=-6; a2=11; a3=-6;


u1=0; u2=0; u3=1; % initialization
u=[]; q1=[];

for k=1:80
u0=-(a1*u1+a2*u2+a3*u3); % U(k)=-(A1*U(k-1)+A2*U(k-2)+A3*U(k-3))
u3=u2; % let U0=U(k), U1=U(k-1), U2=U(k-2) and U3=U(k-3)
u2=u1; % after one iteration(k=k+1) we have U3=U2; U2=U1; U1=U0
u1=u0;
q1=u1*inv(u2);
end
q1

Example: Determine the dominant root of the following polynomial using Bernoulli's
Method.

𝑝(𝑧) = (𝑧 − 3)2 (𝑧 + 1) = 𝑧 4 − 4𝑧 3 − 2𝑧 2 + 12𝑧 + 9 = 0

clear all, clc

a1=-4; a2=-2; a3=12; a4=9;


u1=0; u2=0; u3=0; u4=1; % initialization
u=[]; q1=[];

for k=1:200
u0=-(a1*u1+a2*u2+a3*u3 + a4*u4);
u4=u3; u3=u2; u2=u1; u1=u0;
q1=u1*inv(u2);
end
q1
CHAPTER V:

Finding Roots of a Polynomials by


Global Methods (QD-Algorithm)

The QR (an implication: QD) is one of ten Algorithms that Changed the World.
Finding Roots of a Polynomials by Global
Methods (QD-Algorithm)

When founding the Institute of Applied


Mathematics at the (ETH) in Zurich in 1948 Eduard Stiefel hired Heinz Rutishauser,
who had just finished his dissertation in complex analysis, as a research assistant.
Rutishauser was hired to help to construct a digital electronic computer and to explore
and develop numerical methods for using it. In 1952 he finished his Habilitation thesis,
in which he developed a compiler, and became a Private lecturer. After that, around
1953, on Stiefel’s suggestion Rutishauser (1954) approached the key problem of
determining the poles of the following rational function given by a power series

𝑛(𝑧) 1 ℎ0 ℎ1 ℎ2
𝑓(𝑧) = = (∑ ℎ𝑖 𝑧 −𝑖 ) = + + +⋯
𝑑(𝑧) 𝑧 𝑧 𝑧2 𝑧3
𝑖=0

The application he had in mind was the following. Assume that 𝑨 is an 𝑛 × 𝑛 matrix and
that 𝑪 and 𝑩 are two n-vectors. Then, for ℎ𝑖 = 𝑪𝑨𝑖 𝑩, the series in is the Taylor expansion
at ∞ of
1 1 −1
𝑓(𝑧) = 〈𝑪𝑇 −1 𝑇
, (𝑧𝑰 − 𝑨) 𝑩〉 = ⟨𝑪 , (𝑰 − 𝑨) 𝑩⟩
𝑧 𝑧

which is a proper rational fraction of degree 𝑟 ≤ 𝑛 whose poles are eigenvalues of A. This
is seen from the representation
𝑪adj(𝑧𝑰 − 𝑨)𝑩
𝑓(𝑧) =
det(𝑧𝑰 − 𝑨)

which also reveals that only the numerator depends on 𝑩 and 𝑪 unless some zeros and
poles cancel. This application to the matrix eigenvalue problem was the starting point
and the target of Rutishauser’s investigation. He called the coefficients ℎ𝑖 Schwarz
constants, but today they are referred to as moments in numerical linear algebra and as
Markov parameters in systems and control theory, where the sequence of moments is
the impulse response of the linear time-invariant discrete-time single-input single-
output control system given by the state matrix 𝑨 and the vectors 𝑩 and 𝑪. So Stiefel’s
proposal for Rutishauser was to determine the eigenvalues of 𝑨 given the sequence of
moments. Rutishauser (1954) called his algorithm the quotient-difference algorithm or,
briefly, the QD algorithm. Nowadays, the abbreviation in lower-case letters, qd
algorithm, is widely used.

Remark: The rational proper function f (i.e. 𝑓(∞) = 0) can be considered as a


generating function of the moments ℎ𝑖 , and poles of f are eigenvalues of 𝑨.

Schwarz constants (= moments = Markov parameters)


We know now that Stiefel’s proposal was actually a
bad one because the problem of determining the eigenvalues from the moments is
typically extremely ill-conditioned (see Gautschi (1968)). Rutishauser became aware of
this ill-conditioning and a better solution of the matrix eigenvalue problem namely,
using the Lanczos (1950) algorithm for reducing the matrix to tridiagonal form and then
applying the progressive form of his qd algorithm.

Clearly, there are several


equivalent problems

• Find eigenvalues of 𝑨.
• Find poles of generating (rational) function f.
• Find zeros of the denominator polynomial of f (Bernoulli).

In theory, the problem had been solved before by


• Jacques Hadamard (1892) (his PhD thesis!),
• de Montessus de Ballore (1902/1905),
• A.C. Aitken (1926/1931).

But none of them had an efficient algorithm. Rutishauser cites Hadamard and Aitken,
but never de Montessus de Ballore, who proved the convergence of Padé approximants
with fixed denominator degree.

Daniel Bernoulli was the first one that put corner stone to the method of finding a
greatest root, Dénes König (1884) established more than 150 years later that the
analogous result holds for any power series of an analytic function with a single simple
pole on the boundary of the disk of convergence. Soon after that the French
mathematician Jacques Hadamard (1865–1963) in his thesis solved the problem of
finding all the poles of f from the moments by a beautiful procedure that is very ill-
suited to computer implementation.

Consider the second


order difference equation (i.e. series) ℎ𝑖+2 + 𝑑1 ℎ𝑖+1 + 𝑑2 ℎ𝑖 = 0. Let we define 𝑎1 =
−(𝜆1 + 𝜆2 ), 𝑎2 = 𝜆1 𝜆2 then this difference equation will be ℎ𝑖+2 − (𝜆1 + 𝜆2 )ℎ𝑖+1 + 𝜆1 𝜆2 ℎ𝑖 = 0.
The z-Transformation of this sequence is given by 𝑑(𝑧) = 𝑧 2 − (𝜆1 + 𝜆2 )𝑧 + 𝜆1 𝜆2 = 0.

Now, it was known to Daniel Bernoulli (1732), D. König (1884–1944) that, if 𝑑(𝑧) has a
unique zero 𝜆1 of maximum modulus (and hence the series converges for |𝑧| > |𝜆1 |), then
the solution {ℎ𝑖 } of the difference equation satisfies
ℎ𝑖+1 ℎ𝑖+2 ℎ𝑖+3
𝜆1 = lim = lim = lim
𝑖→∞ ℎ𝑖 𝑖→∞ ℎ𝑖+1 𝑖→∞ ℎ𝑖+2

Now based on Bernoulli's idea let we try to parametrize the second solution
ℎ𝑖+2 − 𝜆1 ℎ𝑖+1
ℎ𝑖+2 − (𝜆1 + 𝜆2 )ℎ𝑖+1 + 𝜆1 𝜆2 ℎ𝑖 = 0 ⟺ 𝜆2 = lim { }
𝑖→∞ ℎ𝑖+1 − 𝜆1 ℎ𝑖
ℎ ℎ𝑖+2 ℎ𝑖+3
ℎ𝑖+2 − 𝑖+3 ℎ𝑖+1 ℎ𝑖+1 −
ℎ𝑖+2 ℎ ℎ𝑖+2
⟺ 𝜆2 = lim { } = lim ( ) { 𝑖+1 }
𝑖→∞ ℎ𝑖+2 𝑖→∞ ℎ 𝑖 ℎ𝑖+1 ℎ𝑖+2
ℎ𝑖+1 − ℎ −
ℎ𝑖+1 𝑖 ℎ𝑖 ℎ𝑖+1
ℎ𝑖+1 ℎ𝑖+2 ℎ𝑖+1 ℎ𝑖+2
ℎ𝑖+1 |ℎ𝑖+2 ℎ𝑖+3 | |
ℎ𝑖+2 ℎ𝑖+3
| /ℎ𝑖+1
⟺ 𝜆2 = lim ( ) = lim
𝑖→∞ ℎ𝑖+2 ℎ𝑖 ℎ𝑖+1 𝑖→∞ ℎ ℎ𝑖+1
| | | 𝑖 | /ℎ𝑖
ℎ𝑖+1 ℎ𝑖+2 ℎ𝑖+1 ℎ𝑖+2

Hadamard and Aitken If 𝑓


is a proper rational function 𝑛(𝑧)/𝑑(𝑧) of degree 𝑟 with explicitly known denominator

𝑑(𝑧) = 𝑑0 𝑧 𝑟 + 𝑑1 𝑧 𝑟−1 + ⋯ + 𝑑𝑟−1 𝑧 + 𝑑𝑟

By the Cayley-Hamilton theorem it is well known that, the matrix 𝑨 satisfy its own
characteristic equation: 𝑑(𝑨) = 𝑑0 𝑨𝑟 + 𝑑1 𝑨𝑟−1 + ⋯ + 𝑑𝑟−1 𝑨 + 𝑑𝑟 𝑰 = 𝟎. Multiply left and
right this formula by 𝑪 and 𝑩 we get 𝑑0 𝑪𝑨𝑟 𝑩 + 𝑑1 𝑪𝑨𝑟−1 𝑩 + ⋯ + 𝑑𝑟−1 𝑪𝑨𝑩 + 𝑑𝑟 𝑪𝑩 = 0. In
terms of Markov parameters (i.e. ℎ𝑖 = 𝑪𝑨𝑖 𝑩 ) we can express it as

𝑑0 ℎ𝑟 + 𝑑1 ℎ𝑟−1 + ⋯ + 𝑑𝑟−1 ℎ1 + 𝑑𝑟 ℎ0 = 0

Shift this sequence to the left or to the right by 𝑘 step we obtain (i. e. ℎ𝑖+𝑘 = 𝑪𝑨𝑖+𝑘 𝑩)

𝑑0 ℎ𝑟+𝑘 + 𝑑1 ℎ𝑟+𝑘−1 + ⋯ + 𝑑𝑟−1 ℎ𝑘+1 + 𝑑𝑟 ℎ𝑘 = 0

This recursion only depends on 𝑑(𝑧). The numerator polynomial 𝑛(𝑧) of degree 𝑟 − 1
matches the first 𝑟 terms of the Laurent series of f(z)d(z), that is,

1
𝑓(𝑧)𝑑(𝑧) − 𝑛(𝑧) = 𝒪 ( ) as 𝑧 → ∞
𝑧

It was known to Daniel Bernoulli (1700–1782) that, if 𝑑(𝑧) has a unique zero 𝑧1 of
maximum modulus (and hence the series converges for |𝑧| > |𝑧1 |), then the solution {ℎ𝑖 }
of the difference equation satisfies
ℎ𝑖+1
𝑧1 = lim
𝑖→∞ ℎ𝑖

This is Bernoulli’s method for finding such a greatest root (see Aitken, 1926). In 1892
(𝑖)
Hadamard considered the double sequence of Hankel determinants 𝐻𝑘

ℎ𝑖 ℎ𝑖+1 ⋯ ℎ𝑖+𝑘−1
(𝑖) ℎ ℎ𝑖+2 ⋯ ℎ𝑖+𝑘
𝐻𝑘 = | 𝑖+1 ⋱ |, (𝑘 = 1,2, … , 𝑛 𝑖 = 1,2, … )
⋮ ⋮ ⋮
ℎ𝑖+𝑘−1 ℎ𝑖+𝑘 … ℎ𝑖+2𝑘−2

and, adapted to our situation, established the following main result.


Theorem: (Hadamard 1892) Assume that the series ℎ0 𝑧 −1 + ℎ1 𝑧 −2 + ℎ2 𝑧 −3 + ⋯
represents a rational function whose 𝑟 poles, counted including multiplicity, are
ordered so that |𝜆1 | ≥ |𝜆2 | ≥ ⋯ ≥ |𝜆𝑟−1 | ≥ |𝜆𝑟 |

If 1 ≤ 𝑘 ≤ 𝑟 and |𝜆𝑘+1 | < 𝛬 < |𝜆𝑘 | or if 𝑘 = 𝑟 and 𝛬 < |𝜆𝑟 |, then

(𝑖) 𝑖
𝛬 𝑖
𝐻𝑘 ( )
= 𝐶 𝑘 . 𝜆 1 𝜆 2 … 𝜆 𝑘 [1 + 𝒪 ( )] as 𝑖 → ∞ with 𝐶𝑘 = constant
|𝜆𝑘 |

Assuming simple poles, Peter Henrici (1958) gave a simpler proof of this result. Multiple
poles can be treated with a technique used by Michael Golomb (1943). New proofs of
Hadamard’s theorem have also been a topic in the subsequent qd literature (see Gragg
& Householder, 1966; Gragg, 1973; Householder, 1974). Here are some obvious
conclusions.

(𝑖)
Corollary: Under the assumptions of the previous theorem, we have that 𝐻𝑟+1 = 0,
(𝑖) (𝑖)
𝐻1 = ℎ𝑖 and 𝐻0 = 1 (∀𝑖) Moreover,

(𝑖)
▪ If f has 𝑟 simple poles, then: 𝐻𝑘 ≠ 0 (𝑘 = 1, … , 𝑛) for large enough 𝑖.
▪ If |𝜆𝑘+1 | < |𝜆𝑘 | then
(𝑖+1)
𝐻𝑘
lim (𝑖)
= 𝜆1 𝜆2 … 𝜆𝑘
𝑖→∞ 𝐻𝑘
▪ If |𝜆𝑘+1 | < |𝜆𝑘 | < |𝜆𝑘−1 | then
(𝑖+1) (𝑖)
(𝑖) 𝐻𝑘 𝐻𝑘−1
𝑞𝑘 = (𝑖)
. (𝑖+1)
⟶ 𝜆𝑘 as 𝑖 → ∞
𝐻𝑘 𝐻𝑘−1

Naturally, Hadamard was only interested in exact relationships, not in computation.


Thus the motivation for developing an efficient algorithm was missing, though, in fact,
he had the key in his hands. Aitken (1926, 1931) used and proves what is now called
“Jacobi identity” (“theorem of compound determinants”) which is a relation between
(𝑖) 2 (𝑖−1) (𝑖−1) (𝑖+1)
Hankel determinants: (𝐻𝑘 ) = 𝐻𝑘 𝐻𝑘𝑖+1 − 𝐻𝑘+1 𝐻𝑘−1 . Even if it had also been known
to Hadamard, but Aitken used it to build up — from the left or from the top — the table

Rhombus rule
(1) 2 (0) (2) (0) (2)
(𝐻2 ) = 𝐻2 𝐻2 − 𝐻3 𝐻1

From this table the Jacobi


identity can be considered as
Rhombus rule.
clear all, clc,

M=rand(4,4); A= M*diag([-1 -2 -3 -4])*inv(M);


B=[1;2;-1;1]; C=[1,0,-2,3];
n=min(size(A));
i=100;
% Give the values of k=1,2,3,…,r
k=2
for l=1:k
for m=1:k
H(l,m)=C*(A^(i+(l+m)-2))*B;
end
end
H

Rutishauser (1954) was aware of the work of


Hadamard (1892), Aitken (1926, 1931) and Lanczos (1950) when he worked on Stiefel’s
problem. It seems that, in the second half of 1952 or early in 1953, he took Aitken’s
work, improved it in a significant way. The key result was his qd algorithm.
Rutishauser (1954) tried to reformulate the Aitken’s work and instead of computing the
𝐻𝑘𝑖 table, he headed directly for recurrences for

(𝑖+1) (𝑖) (𝑖) (𝑖+1)


(𝑖) 𝐻𝑘 𝐻𝑘−1 (𝑖) 𝐻𝑘+1 𝐻𝑘−1
𝑞𝑘 = (𝑖)
. (𝑖+1)
and 𝑒𝑘 = (𝑖)
. (𝑖+1)
𝐻𝑘 𝐻𝑘−1 𝐻𝑘 𝐻𝑘

(𝑖+1)
Now multiplying Jacobi’s identity centered at 𝐻𝑘−1 , namely,

(𝑖+1) 2 (𝑖) (𝑖+2) (𝑖) (𝑖+2)


(𝐻𝑘−1 ) = 𝐻𝑘−1 𝐻𝑘−1 − 𝐻𝑘 𝐻𝑘−2
By
(𝑖+1)
𝐻𝑘
(𝑖+1) (𝑖) (𝑖+2)
𝐻𝑘−1 𝐻𝑘 𝐻𝑘−1
(𝑖)
We can turn the first term on the right-hand side into 𝑞𝑘 as follows:

(𝑖+1) (𝑖+1) (𝑖+2) (𝑖+1)


𝐻𝑘−1 𝐻𝑘 (𝑖) 𝐻𝑘−2 𝐻𝑘
(𝑖) (𝑖+2)
= 𝑞𝑘 − (𝑖+1) (𝑖+2)
(𝐼)
𝐻𝑘 𝐻𝑘−1 𝐻𝑘−1 𝐻𝑘−1

(𝑖+1)
Likewise, we write down Jacobi’s identity centered at 𝐻𝑘 , namely,

(𝑖+1) 2 (𝑖) (𝑖+2) (𝑖) (𝑖+2)


(𝐻𝑘 ) = 𝐻𝑘 𝐻𝑘 − 𝐻𝑘+1 𝐻𝑘−1
and multiply it by
(𝑖+1)
𝐻𝑘−1
(𝑖) (𝑖+2) (𝑖+1)
𝐻𝑘 𝐻𝑘−1 𝐻𝑘

This turns the first term on the right-hand side into

(𝑖+1) (𝑖+1) (𝑖+1) (𝑖)


𝐻𝑘−1 𝐻𝑘 (𝑖+1) 𝐻𝑘−1 𝐻𝑘+1
(𝑖) (𝑖+2)
= 𝑞𝑘 − (𝑖) (𝑖+1)
(𝐼𝐼)
𝐻𝑘 𝐻𝑘−1 𝐻𝑘 𝐻𝑘

(𝑖) (𝑖) (𝑖+1) (𝑖+1)


we can conclude from (I) and (II) that 𝑞𝑘 + 𝑒𝑘 = 𝑞𝑘 + 𝑒𝑘−1 , and from the definitions
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖+1) (𝑖+1)
of 𝑞𝑘 and of 𝑒𝑘 it is readily verified that 𝑞𝑘+1 𝑒𝑘 = 𝑞𝑘 𝑒𝑘 .

Those last relations are the rhombus rules (called so by Stiefel, 1955) defining the qd
(𝑖) (𝑖)
algorithm. Rutishauser (1954) suggested writing down the quantities 𝑒𝑘 and 𝑞𝑘 in a
triangular scheme called a qd scheme (also known as a qd table). Recall that, at this
time, the simple computations needed to build up this scheme were normally done on a
desk calculator, and so a suitable scheme to write down the numbers obtained was
most useful.
For building up the table column-wise from left to right:

(𝑖) (𝑖+1) (𝑖) (𝑖+1)


𝑒𝑘 = (𝑞𝑘 − 𝑞𝑘 ) + 𝑒𝑘−1
(𝑖+1) 𝑘 = 1,2, … , 𝑟
(𝑖) (𝑖+1) 𝑒𝑘
𝑞𝑘+1 = 𝑞𝑘 (𝑖)
𝑒𝑘 }

Unfortunately, as outlined, this algorithm is in general unstable, i.e., oversensitive to


rounding errors, and useless numerically. This fact is related to the occurrence of a
division of two small quantities, which can have large relative errors. (Recall that e-
columns tends to zero.)

A more stable way of constructing the qd scheme is obtained by building up the table
row-wise, from top to bottom (progressive qd algorithm):

(𝑖+1) (𝑖) (𝑖+1) (𝑖)


𝑞𝑘 = (𝑒𝑘 − 𝑒𝑘−1 ) + 𝑞𝑘
(𝑖+1) 𝑘 = 1,2, … , 𝑟
(𝑖+1) (𝑖) 𝑞𝑘
𝑒𝑘 = 𝑒𝑘 (𝑖)
𝑞𝑘+1 }

Written in this form, the rules can be used to construct the qd scheme row by row.

Remark: Rutishauser (1954) indicates that the name rhombus rules may have been
suggested to him by Stiefel.
(𝑖) (𝑖)
In the case of a proper rational fraction f of exact degree 𝑟, , 𝑒0 = 𝑒𝑟 = 0 holds for all 𝑖,
and thus the table is not defined beyond the 𝑟 𝑡ℎ e-column. Assuming that all the poles
of f have different moduli, Rutishauser (1954) could readily conclude from Aitken’s
work (see Aitken, 1926) that
(𝑖) (𝑖)
lim 𝑞 = 𝜆𝑘 and lim 𝑒 =0 (𝑘 = 1,2, … , 𝑟)
𝑖→∞ 𝑘 𝑖→∞ 𝑘

This behaviour means that the original qd algorithm for building up the table from its
first column is a computational disaster as the first formula inevitably leads to the
cancellation of leading digits. In contrast, the progressive form is a version that is still
of importance. It avoids the highly ill-conditioned computation of the table from the
moments (see, e.g., Gautschi, 1968, 1982).
Example: (column generation version) by using Bernoulli's method we get the
dominant zero, and then we will remove it by using long division algorithm 𝑝(𝑧) =
(𝑧 − 𝑧1 )𝑞(𝑧). Hereafter, we try to get the new dominant zero of the polynomial and repeat
the process until we find all zeros. We shall now discuss a modern extension of the
Bernoulli's method, which has the advantage of providing simultaneous approximations
to all zeros at time.

𝑝(𝑧) = 𝑧 2 − 𝑧 − 1 = 𝑧 2 (1 − 𝑧 −1 − 𝑧 −2 ) ⟹ (1 − 𝑧 −1 − 𝑧 −2 )𝑥[𝑖] = 0
⟹ 𝑥[𝑖] = 𝑥[𝑖 − 1] + 𝑥[𝑖 − 2]

With an initial conditions 𝑥[−1] = 0, 𝑥[0] = 1. Then the column generation is

(𝑖)
𝑖 𝑥[𝑖 ] (𝑖)
𝑒0
(𝑖)
𝑞1 = 𝑥[𝑖 + 1]/𝑥[𝑖] 𝑒1
(𝑖) 𝑞2 (𝑖)
𝑒2

0 1 0 1.000
1 1 0 2.000
2 2 0 1.500 1.000
-0.500 -1.000
3 3 0 1.6667
0.1666 -0.500 −10−5
4 5 0 1.6000
-0.0666 -0.666 −10−5
5 8 0 1.6250
-0.009615 -0.600 −2 × 10−5
6 13 0 1.615385 -0.624975
0.003663 −2.5 × 10−5
7 21 0 1.619048 -0.615409
0.001401 −4.9 × 10−5
8 34 0 1.6176470 −0.619243
↓ −0.000171
9 55 0 ↓ ↓
⋮ ↓
1 + √5 1 − √5 ⋮
2 2

Example: (row generation or progressive version) If the scheme is to be generated


row by row, a first couple of rows must somehow be obtained. The following algorithm
shows how this is accomplished. Let 𝑎0 , 𝑎1 , 𝑎2 , … , 𝑎𝑁 be constants, all different from zero.
(0) (0)
𝑞1 = −𝑎1 𝑎−1
0 , and 𝑞𝑘 = 0 𝑘 = 2,3, …
(0) (0) (0)
𝑒1 = 𝑎2 𝑎−1
1 , 𝑒2 = 𝑎3 𝑎−1
2 , …, 𝑒𝑟−1 = 𝑎𝑟 𝑎−1
𝑟−1 ,

Consider the elements thus generated as the first two rows of a QD scheme, and
generate further rows by means of progressive generation, using the side conditions
(𝑖) (𝑖)
𝑒0 = 𝑒𝑟 = 0

Remark: Later on we will see why we have started the algorithm by these special values
clear all, clc,
% 4th order polynomial associated with the difference equation
% a0*x(k)+ a1*x(k-1)+ a2*x(k-2)+ a3*x(k-3) + a4*x(k-4)
%--------------------------------------%
% syms s %
% expand((s-4)*(s-1)*(s-2)*(s-3)) %
%--------------------------------------%
a0=1; a1=-10; a2=35; a3=-50; a4=24;
Q1=[-(a1/a0), 0 , 0 , 0]; E1=[0 , (a2/a1), (a3/a2) , (a4/a3) , 0];
%------------------------------------------------------------------%
for i=1:20
E2=[]; Q2=[];
for k=1:4
q2(k)=(E1(k+1)-E1(k))+ Q1(k); Q2=[Q2,q2(k)];
end
Q2;
for k=1:3
e2(k)=(Q2(k+1)/Q2(k))* E1(k+1); E2=[E2,e2(k)];
end
E2=[0 , E2 , 0];
E1=E2; Q1=Q2;
end
E1
solution=Q1(1:4)

before talking on the initial conditions of the algorithm, let we


discuss some fundamentals of rational function realization. It is well-known that any
polynomial 𝑑(𝑧) = 𝑎0 𝑧 𝑟 + 𝑎1 𝑧 𝑟−1 + ⋯ + 𝑎𝑟 has it own companion matrix of the form
−𝑎1 1 0 ⋯ 0

−𝑎2 0 1 0
𝑨𝑐 = ⋮ 0 0 ⋱ ⋮
−𝑎𝑟−1 ⋮ ⋮ ⋮ 1
(−𝑎𝑟 0 ⋯ 0 0)

Where the eigenvalues of 𝑨𝑐 are zeros of the polynomial 𝑑(𝑧).

Now assume that we are interested by the study of a system described by a proper
rational function 𝑓(𝑧) = 𝑛(𝑧)/𝑑(𝑧), therefore the response of this system to some forcing
input will be
𝑛(𝑧)
𝑦(𝑧) = 𝑢(𝑧)
𝑑(𝑧)
−1
Let we put 𝑥1 (𝑧) = (𝑑(𝑧)) 𝑢(𝑧) then 𝑦(𝑧) = 𝑛(𝑧)𝑥1 (𝑧), so it is good thing to factorize the
denominator into a product of linear terms as
−1
𝑥1 (𝑧) = (𝑑(𝑧)) 𝑢(𝑧) ⟺ (𝑧 − 𝑞𝑟 )(𝑧 − 𝑞𝑟−1 ) … (𝑧 − 𝑞2 )(𝑧 − 𝑞1 )𝑥1 (𝑧) = 𝑢(𝑧)
Take the following change of variables

𝑥2 (𝑧) = (𝑧 − 𝑞1 )𝑥1 (𝑧)


𝑥3 (𝑧) = (𝑧 − 𝑞2 )𝑥2 (𝑧)

𝑢(𝑧) = (𝑧 − 𝑞𝑟 )𝑥𝑟 (𝑧)

This set of equation can be rewritten in more compact form as

𝐱[𝑘 + 1] = 𝑨𝑝 𝐱[𝑘] + 𝑩𝑝 𝐮[𝑘]

Where
𝑞1 1 0 ⋯ 0 0 𝑥1 [𝑘]

0 𝑞2 1 0 0 𝑥2 [𝑘]
𝑨𝑝 = ⋮ 0 𝑞3 ⋱ ⋮ , 𝑩𝑝 = ⋮ & 𝐱[𝑘] = ⋮
0 ⋮ ⋮ ⋮ 1 0
(0 0 ⋯ 0 𝑞𝑟 ) (1) (𝑥𝑟 [𝑘])

Remark: From this development we conclude that the matrix 𝑨𝑝 is another companion
form and is equivalent to the matrix 𝑨𝑐 , moreover eigenvalues of 𝑨𝑝 are zeros of 𝑑(𝑧) or
equivalently poles of 𝑓(𝑧).

From this fact of equivalence and based on the progressive qd-algorithm the scientist
Rutishauser thought in the existence of a nested set of tridiagonal matrices

() 1 0 0
𝑞1𝑖
() () 1 0
() ()
𝑞1𝑖 𝑒1𝑖 𝑞2𝑖 + 𝑒1𝑖 (𝑖) ()
(𝑖)
𝑻𝑟 = (𝑖) (𝑖)
𝑞3 + 𝑒2𝑖 ⋮
with
(𝑖)
lim 𝑻𝑟 = 𝑨𝑝
⋮ 𝑞2 𝑒2 () () ⋱ 𝑖→∞
0 𝑞3𝑖 𝑒3𝑖 ⋱
⋮ 1
⋮ (𝑖)
𝑞𝑟−1
(𝑖)
𝑒𝑟−1 (𝑖) (𝑖)
( 0 0 ⋯
𝑞𝑟 + 𝑒𝑟−1 )
(𝑖) (𝑖)
Such that the characteristic polynomial of 𝑻𝑟 which is 𝑑𝑟 (𝑧) will converges to the
𝑖 ()
denominator of 𝑓(𝑧) that is 𝑑(𝑧) = lim𝑖→∞ 𝑑𝑟 (𝑧).
(𝑖)
Since the scientist Rutishauser was interested in the limit of the zeros of 𝑑𝑟 (𝑧) as
(𝑖) (𝑖)
𝑖 → ∞ it was natural to look at 𝑻𝑟 . Clearly, 𝑻𝑟 has the LR decomposition
(𝑖) (𝑖) (𝑖)
𝑻𝑟 = 𝑳𝑟 𝑹𝑟

With
1 0 ⋯ () 1 0 ⋯ 0
0 ⋯ 0 𝑞1𝑖 () ⋯
()
𝑒1𝑖 1 0 0 𝑞2𝑖 1 0
0 ()
(𝑖)
𝑳𝑟 = ⋮
()
𝑒2𝑖 1 ⋱ ⋮ and
(𝑖)
𝑹𝑟 = ⋮ 0 𝑞3𝑖 ⋱ ⋮
0 ⋮ ⋮ 0 0 ⋮ 1
⋮ 𝑖 () ⋮ ⋮
( 0 0 ⋯ 𝑒𝑟−1 1) ( 0 0 0 𝑞(𝑟𝑖) )

At some historic moment in 1954, Rutishauser must have realized that his progressive
(𝑖) (𝑖) (𝑖)
qd algorithm can be interpreted as computing this LR factorization 𝑻𝑟 = 𝑳𝑟 𝑹𝑟 and
then forming
() () 1 0 0
𝑞𝑖 +𝑒 𝑖 1 1
() () 1 0
() ()
𝑞3𝑖 𝑒1𝑖 𝑞2𝑖 + 𝑒2𝑖 (𝑖) ()
(𝑖) (𝑖)
𝑹𝑟 𝑳 𝑟 = (𝑖) (𝑖)
𝑞3 + 𝑒3𝑖 ⋮ (𝑖+1)
= 𝑻𝑟
⋮ 𝑞3 𝑒2 () () ⋱
0 𝑞4𝑖 𝑒3𝑖 ⋱
⋮ 1
⋮ (𝑖)
𝑞(𝑟𝑖) 𝑒𝑟−1
( 0 0 𝑞𝑟(𝑖) )

(𝑖) (𝑖) (𝑖) (𝑖+1) (𝑖) (𝑖)
So, the qd algorithm consists of performing the step 𝑻𝑟 = 𝑳𝑟 𝑹𝑟 ⟿ 𝑻𝑟 = 𝑹𝑟 𝑳 𝑟
(𝑖+1) (𝑖) (𝑖) (𝑖) −1
called LR transformation, which is a similarity transformation: 𝑻𝑟 = 𝑹𝑟 𝑻𝑟 (𝑹𝑟 ) .

Now let's go back to the matrix 𝑨𝑐 and we factorize it into a special form

0 0 ⋯ 0 −𝑎1 1 0 ⋯ 0
1 ⋯ ⋯
0 1 0 0 −𝑎 2 0 1 0
𝑨𝑐 = ⋮ 0 1 ⋱ ⋮ ⋮ 0 0 ⋱ ⋮
0 ⋮ ⋮ ⋮ 0 −𝑎𝑟−1 ⋮ ⋮ ⋮ 1
−1 −1
(0 0 ⋯ 𝑎𝑟 𝑎𝑟−1 1) ( 0 0 ⋯ 0 −𝑎𝑟 𝑎𝑟−1 )

Notice that the matrix 𝑨𝑐 has been decomposed into a product of two matrices, we can
continue this process on we obtain the following decomposition

𝑨𝑐 = 𝑳𝑟−2 … 𝑳1 𝑳0 𝑹0 ⟺ 𝑨 ∼ 𝑹0

Where

−𝑎1 1 0 ⋯ 0 1 0 0 ⋯ 0
⋯ ⋯
0 −𝑎2 𝑎1−1 1 0 𝑎2 𝑎1−1 1 0 0
𝑹0 = ⋮ 0 −𝑎3 𝑎2−1 ⋱ ⋮ , 𝑳0 = ⋮ 0 1 ⋱ ⋮
0 ⋮ ⋮ ⋮ 1 0 ⋮ ⋮ ⋮ 0
( 0
−1
0 ⋯ 0 −𝑎𝑟 𝑎𝑟−1 ) ( 0 0 ⋯ 0 1)

0 0 ⋯ 0 0 0 ⋯ 0
1 ⋯ 1 ⋯
0 1 0 0 0 1 0 0
−1
𝑳1 = ⋮ 𝑎3 𝑎2 1 ⋱ ⋮ , etc … 𝑳𝑟−1 = ⋮ 0 1 ⋱ ⋮
0 ⋮ ⋮ ⋮ 0 0 ⋮ ⋮ 1 0
−1
(0 0 ⋯ 0 1) (0 0 ⋯ 𝑎𝑟 𝑎𝑟−1 1)
(0)
Now if we look for 𝑻𝑟 and we compar it with 𝑹0 we get
( ) ( )
(0) 𝑞10 = −𝑎1 𝑎0−1 , and 𝑞𝑘0 = 0 𝑘 = 2,3, …
𝑻𝑟 = 𝑹0 ⟹ { 0 ( ) −1
𝑒𝑖−1 = 𝑎𝑖 𝑎𝑖−1 , 𝑖 = 2,3, …

This is the reason behind the selection of starting point.


CHAPTER VI:
Numerical Linear Algebra and Matrix
Computations
Numerical Linear Algebra and Matrix
Computations
Solving a system of linear equations
is one of the most frequent tasks in numerical computing. The reason is twofold:
historically, many phenomena in physics and engineering have been modeled by linear
differential equations, since they are much easier to analyze than nonlinear ones. In
addition, even when the model is nonlinear, the problem is often solved iteratively as a
sequence of linear problems, e.g., by Newton’s method. Thus, it is important to be
able to solve linear equations efficiently and robustly, and to understand how
numerical artifacts affect the quality of the solution.

Solution of linear systems can be divided into two main categories named explicit and
implicit solution. Here in this chapter we are going to concentrate on the implicit
solution (i.e. numerical or approximate solution). Before starting introducing the
numerical methods for linear algebraic systems we recommend to give an overview on
the Cramer's rule which is a formula used by generations of mathematicians to write
down explicit solutions of linear systems.

Theorem (Cramer’s Rule) For det(𝑨) ≠ 0, the linear system 𝑨𝐱 = 𝒃 has the unique
solution 𝑥𝑖 = det(𝑨𝑖 ) / det(𝑨) , 𝑖 = 1, 2, . . . , 𝑛 where 𝑨𝑖 is the matrix obtained from 𝑨 by
replacing column 𝒂: 𝑖 by 𝒃. And the determinant can be computed using the Laplace
Expansion. For each row 𝑖 we have det(𝑨) = ∑𝑛𝑗=1 𝑎𝒊𝒋 (−1)𝑖+𝑗 det(𝑴𝑖𝑗 )
where 𝑴𝑖𝑗 denotes the (𝑛 − 1) × (𝑛 − 1) submatrix obtained by deleting row 𝑖 and
column 𝑗 of the matrix 𝑨.

The following recursive Matlab program computes a determinant using the Laplace
Expansion for the first row:

function d=DetLaplace(A);
% DETLAPLACE determinant using Laplace expansion
% d=DetLaplace(A); computes the determinant d of the matrix A
% using the Laplace expansion for the first row.
n=length(A);
if n==1;
d=A(1,1);
else
d=0; v=1;
for j=1:n
M1j=[A(2:n,1:j-1) A(2:n,j+1:n)];
d=d+v*A(1,j)*DetLaplace(M1j);
v=-v;
end
end
The following Matlab program computes the solution of a linear system with Cramer's
rule:

function x=Cramer(A,b);
% CRAMER solves a linear Sytem with Cramer’s rule
% x=Cramer(A,b); Solves the linear system Ax=b using Cramer’s
% rule. The determinants are computed using the function DetLaplace.

n=length(b);
detA=DetLaplace(A);

for i=1:n
AI=[A(:,1:i-1), b, A(:,i+1:n)];
x(i)=DetLaplace(AI)/detA;
end
x = x(:);

Cramer's rule looks simple and even elegant, but for computational purposes it is a
disaster, the computational effort with Laplace expansion is 𝒪(𝑛!), while with Gaussian
elimination (introduced in the next sections), it is 𝒪(𝑛3 ). Furthermore, the numerical
accuracy due to finite precision arithmetic is very poor.

Indeed, if the determinants are evaluated by the recursive relation, the computational
effort of Cramer's rule is of the order of (𝑛 + 1)! flops and therefore turns out to be
unacceptable even for small dimensions of 𝑨 (for instance, a computer able to perform
109 flops per second would take 9.6 × 1047 years to solve a linear system of only 50
equations).

Here you are given the most popular and exist methods for solving linear algebriac
problems.

Direct Methods to Solve Equations

▪ Gauss Elimination Method


▪ Gauss–Jordan Elimination Method
▪ Matrix Inverse Method (Polynomial method or Faddeev–LeVerrier algorithm)
▪ Decomposition (Factorization)
⦁ LU Decomposition (Factorization): Triangularization
⦁ Other Decomposition (Factorization): Cholesky, QR, and SVD

Iterative Methods to Solve Equations

▪ The Jacobi method


▪ The Gauss–Seidel method and relaxation technique
▪ The Conjugate gradient method
▪ The Free inversion Free derivative method (proposed by the author)
For the reason of very poor
computational efforts of the Cramer's rule, numerical methods that are alternatives to
Cramer's rule have been developed. They are called direct methods if they yield the
solution of the system in a finite number of steps, iterative if they require (theoretically)
an infinite number of steps. We notice from now on that the choice between a direct
and an iterative method does not depend only on the theoretical efficiency of the
scheme, but also on the particular type of matrix, on memory storage requirements
and, finally, on the architecture of the computer.

also known as row reduction, is an algorithm in linear


algebra for solving a system of linear equations. It is usually understood as a sequence
of operations performed on the corresponding matrix of coefficients. This method can
also be used to find the rank of a matrix, to calculate the determinant of a matrix, and
to calculate the inverse of an invertible square matrix. The method is named after Carl
Friedrich Gauss (1777–1855).

The process of row reduction makes use of elementary row operations, and can be
divided into two parts. The first part (sometimes called forward elimination) reduces a
given system to row echelon form, from which one can tell whether there are no
solutions, a unique solution, or infinitely many solutions. The second part (sometimes
called back substitution) continues to use row operations until the solution is found; in
other words, it puts the matrix into reduced row echelon form.

Example: convert 𝑨 into echelon form using the elementary row operations on [𝑨 𝒃]

In this section we shall be concerned with deriving algorithms for solving 𝑨𝐱 = 𝒃. The
algorithms have in common the idea of factoring 𝑨 in the form 𝑨 = 𝑹1 𝑹2 … 𝑹𝑘 where
each matrix 𝑹𝑖 is so simple that the equation 𝑹𝒚 = 𝒄 can be easily solved. The equation
𝑨𝐱 = 𝒃 can then be solved by taking 𝒚0 = 𝒃 and for 𝑖 = 1,2, . . . , 𝑘 computing 𝒚𝑖 as the
solution of the equation 𝑹𝑖 𝒚𝑖 = 𝒚𝑖−1 . Obviously 𝒚𝑘 is the desired solution 𝐱. The
matrices 𝑹𝑖 arising in the factorization of 𝑨 are often triangular. Hence the first thing
will be devoted to algorithms for solving triangular systems, the second will be devoted
to algorithms for factoring 𝑨, and the last will apply these factorizations to the solution
of linear systems.

Example: consider the system below


𝑳𝐲 = 𝒃
𝑨𝐱 = 𝒃 = 𝑳𝑼𝐱 = 𝒃 ⟹ {
𝑼𝐱 = 𝐲

Once the matrices 𝑳 and 𝑼 have been computed, solving the linear system consists only
of solving successively the two triangular systems
1 0 −3
(
)𝐲 = ( )
2 1 −3 1 0 2 1 −3
( )𝐱 = ( )⇔( )( )𝐱 = ( )⇔{ 2 3 9
4 17 9 2 3 0 5 9 2 1
( )𝐱 = 𝐲
0 5
1 0 −3 −3
The first system is very easy to solve and ( )𝐲 = ( )⟹𝐲=( ). After getting the
2 3 9 5
solution of the first system we can make a substitution into the second one we obtain

2 1 −3 −2
( )𝐱 = ( )⟹𝐱=( )
0 5 5 1

Gauss elimination method provide you only one upper triangular systems

𝑨𝐱 = 𝒃 ↔ 𝑼𝐱 = 𝐲
Elementary Operations

Algorithm for Gauss Elimination Method:

% Elimination phase
clear all, clc, A=10*rand(10,10) + 10*eye(10,10);
b=100*rand(10,1); n=length(b);
A1=A; b1=b; % storing A & b before the elimination process go on
for k=1:n-1
for i=k+1:n
if A(i,k)==0
break
else
lambda = A(i,k)/A(k,k);
A(i,k+1:n) = A(i,k+1:n) - lambda*A(k,k+1:n);
b(i)= b(i) - lambda*b(k);
end
end
end
% Back substitution phase
for k=n:-1:1
b(k) = (b(k) - A(k,k+1:n)*b(k+1:n))/A(k,k);
end
x = b
zero=A1*x-b1

 Computing the inverse of a matrix and solving simultaneous equations are related
tasks. The most economical way to invert an 𝑛 × 𝑛 matrix 𝑨 is to solve the equations:

𝑨𝐱 = 𝒃 ⟹ 𝑨−1 𝑨𝐱 = 𝑨−1 𝒃

which reduces to 𝐱 = 𝑨−1 𝒃. Inversion of large matrices should be avoided whenever


possible due its high cost, therefore we should use numerical method such as the
augmented method (i.e. [𝑨 𝑰] ⟶ [𝑰 𝑩] where 𝑩 = 𝑨−1) to compute the inverse of 𝑨 and
avoiding the calculation of det(𝑨). This is a fun way to find the inverse of a matrix, play
around with the rows (adding, multiplying or swapping) until we make matrix 𝑨 into the
identity matrix 𝑰. In linear algebra you know that by elementary operation (Row Echelon
Form) we can get:
𝑬1 𝑬2 … 𝑬𝑘 𝑨 = 𝑰 ⟺ 𝑨−1 = ⏟
⏟ (𝑬𝑘 )−1 … (𝑬2 )−1 (𝑬1 )−1 𝑰
Elementary Operations Elementary Operations

[𝑨 𝑰] ⟶ [𝑰 𝑨−1 ]

clear all, clc, A1=10*rand(10,10) + 10*eye(10,10); [m,n] = size(A1);


A = [A1, eye([m, n])]; % augmented matrix
for k = 1:m
A(k,k:end) = A(k,k:end)/A(k,k);
A([1:k-1,k+1:end],k:end) = ...
A([1:k-1,k+1:end],k:end) - A([1:k-1,k+1:end],k)*A(k,k:end);
end
IA1 = A(:,n+1:end);
Identity=IA1*A1

𝑳𝑼 decomposition (factorization) of a nonsingular


(square) matrix 𝑨 means expressing the matrix as the multiplication of a lower
triangular matrix 𝑳 and an upper triangular matrix 𝑼. After decomposing 𝑨, it is easy to
solve the equations 𝑨𝐱 = 𝒃, We first rewrite the equations as 𝑳𝑼𝐱 = 𝒃. Upon using the
notation 𝑼𝐱 = 𝐲, the equations become 𝑳𝐲 = 𝒃, which can be solved for 𝐲 by forward
substitution. Then will yield 𝐱 by the back substitution process.

To take a close look at the 𝑳𝑼 decomposition, we consider a 3 × 3 nonsingular matrix:

𝑎11 𝑎12 𝑎13 1 0 0 𝑢11 𝑢12 𝑢13


𝑨 = (𝑎21 𝑎22 𝑎23 ) = 𝑳𝑼 = (ℓ21 1 0) ( 0 𝑢22 𝑢23 ) with
𝑎31 𝑎32 𝑎33 ℓ31 ℓ32 1 0 0 𝑢33

1 0 0 𝑢11 𝑢12 𝑢13 𝑢11 𝑢12 𝑢13


𝑳𝑼 = (ℓ21 1 0) ( 0 𝑢22 𝑢23 ) = (ℓ21 𝑢11 ℓ21 𝑢12 + 𝑢22 ℓ21 𝑢13 + 𝑢23 )
ℓ31 ℓ32 1 0 0 𝑢33 ℓ31 𝑢11 ℓ31 𝑢12 + ℓ32 𝑢22 ℓ31 𝑢13 + ℓ32 𝑢23 + 𝑢33

From this writing we deduce that the initial values defined as: 𝑢1𝑖 = 𝑎1𝑖 𝑖 = 1,2, … , 𝑛 and

1 𝑗−1
ℓ𝑖𝑗 = (𝑎𝑖𝑗 − ∑ ℓ𝑖𝑘 𝑢𝑘𝑗 ) , 𝑗<𝑖
𝑢𝑗𝑗 𝑘=1
𝑖−1
𝑢𝑖𝑗 = (𝑎𝑖𝑗 − ∑ ℓ𝑖𝑘 𝑢𝑘𝑗 ) , 𝑗≥𝑖
{ 𝑘=1

It is usual practice to store the multipliers in the lower triangular portion of the
coefficient matrix, replacing the coefficients as they are eliminated (ℓ𝑖𝑗 replacing 𝑎𝑖𝑗 ).
The diagonal elements of 𝑳 do not have to be stored, since it is understood that each of
them is unity. The final form of the coefficient matrix would thus be the following
mixture of 𝑳 and 𝑼:
𝑢11 𝑢12 𝑢13
𝑬1 𝑬2 … 𝑬𝑘 𝑨 = [𝑳\𝑼] = (ℓ21 𝑢22 𝑢23 )
ℓ31 ℓ32 𝑢33

clear all, clc, A=10*rand(4,4) + 10*eye(4,4); n = size(A,1);


A1=A; % storing A before the decomposition process go on
n = size(A,1); % Obtain number of rows (should equal number of columns)
L = eye(n); % Start L off as identity
for k = 1:n
% For each row k, access columns from k+1 to the end and divide by
% the diagonal coefficient at A(k,k)
L(k+1:n,k) = A(k+1:n,k)/A(k,k);
% For each row k+1 to the end, perform Gaussian elimination
% In the end, A will contain U
for l=k+1:n
A(l,:) = A(l,:) - L(l,k)*A(k,:);
end
end
U = A
Zero=A1-L*U

Alternative derivation: In LU Decomposition method we try to convert 𝑨 matrix to


echleon form by using Gauss elimination method. The code starts from the first row,
tries to find the factor by which we need to multiply the current row and subtract it
from the rows below it, to make the elements in the Lower Triangular Matrix as zeros.

clear all, clc, A=10*rand(4,4) + 10*eye(4,4); n = size(A,1);


A1=A; % storing A before the decomposition process go on
for j = 1:n
for i =j:n-1
t = A(i+1,j)/A(j,j);
A(i+1,:) = A(i+1,:)-t*A(j,:);
F(i+1,j) = t;
end
end
U= A
L=F; L(:,n)=zeros(n,1);
for i = 1:n
L(i,i)=1;
end
L
Zero=A1-L*U
Example: Write a MATLAB code which carries out the solution phase (forward and
back substitutions). It is assumed that the original coefficient matrix has been
decomposed, so that the input is 𝑨 = [𝑳\𝑼].

The forward and back substitution algorithm:


⦁ let 𝑷𝑨 = 𝑳𝑼 so that 𝑷𝑨𝐱 = 𝑳𝑼𝐱 = 𝑷𝒃 with 𝑷 = 𝑬1 𝑬2 … 𝑬𝑘
⦁ solve 𝑳𝐲 = 𝑷𝒃, store 𝐲
⦁ solve 𝑼𝐱 = 𝐲, store 𝐱

clear all, clc, A=10*rand(4,4) + 10*eye(4,4); n = size(A,1);


b=100*rand(4,1);
A1=A; b1=b; % storing A before the decomposition process go on
A = [A, eye([n, n])]; % augmented matrix

for j = 1:n-1
for i = j+1:n
A(i,j) = A(i,j)/ A(j,j) ;
A(i,j+1:n) = A(i,j+1:n)- A(i,j)* A(j,j+1:n);
end
end

for i = 1:n
for j= 1:n
if i==j
L(i,i)=1; U(i,i) = A(i,i);
elseif i>j
L(i,j)= A(i,j); U(i,j)=0;
else
L(i,j)= 0; U(i,j)= A(i,j);
end
end
end
L
U
Zero=A1-L*U

% Back substitution
for k = 2:n
b(k)= b(k) - L(k,1:k-1)*b(1:k-1);
end
% Forward substitution
for k = n:-1:1
b(k) = (b(k) - U(k,k+1:n)*b(k+1:n))/U(k,k);
end
x = b
Zero=A1*x-b1
In linear algebra, the Cholesky
decomposition or Cholesky's factorization is a decomposition of a Hermitian, positive-
definite matrix into the product of a lower triangular matrix and its conjugate
transpose, which is useful for efficient numerical solutions, It was discovered by André-
Louis Cholesky for real matrices. When it is applicable, the Cholesky decomposition is
roughly twice as efficient as the 𝑳𝑼 decomposition for solving systems of linear
equations.

Choleski’s decomposition 𝑨 = 𝑳𝑳𝑻 has two limitations


▪ Since the matrix product 𝑳𝑳𝑻 is symmetric, Choleski’s decomposition requires 𝑨 being
symmetric.
▪ The decomposition process involves taking square roots of certain combinations of the
elements of. It can be shown that square roots of negative numbers can be avoided only
if is positive definite.

1 𝑗−1 𝑖−1
ℓ𝑖𝑗 = (𝑎𝑖𝑗 − ∑ ℓ𝑖𝑘 ℓ𝑗𝑘 ) , & ℓ𝑖𝑖 = √𝑎𝑖𝑖 − ∑ (ℓ𝑖𝑘 )2 𝑗 < 𝑖, 𝑖 = 1,2, … , 𝑛
ℓ𝑖𝑖 𝑘=1 𝑘=1

Initial values defined by ℓ11 = √𝑎11 , ℓ𝑗1 = 𝑎𝑗1 /ℓ11

clear all, clc,


A=10*rand(4,4)+10*eye(4,4); A=0.5*(A+A'); % Hermitian positive-definite
n = size(A,1);
A1=A; % storing A before the decomposition process go on

L(1,1)= sqrt(A(1,1));
L(2:n,1)= A(2:n,1)/L(1,1);
for i = 2:n, k=1:i-1,
L(i,i)= sqrt(A(i,i)-sum(L(i,k).^2));
for j = i+1:n,
L(j,i)= (A(j,i)-sum(L(j,k).*L(i,k)))/L(i,i);
end
end
L*L'
Zero=A1-L*L'

There are several other matrix


decompositions such as QR decomposition, and singular value decomposition (SVD).
Instead of looking into the details of these algorithms, we will simply survey the
MATLAB built-in functions implementing these decompositions.

 The 𝑞𝑟 function performs the orthogonal-triangular decomposition of a matrix. This


factorization is useful for both square and rectangular matrices. It expresses the matrix
as the product of a real orthonormal or complex unitary matrix and an upper triangular
matrix. (see the Algebra book by BEKHITI B 2020). In MATLAB you have [Q,R] = qr(A)
clear all, clc,

% classical Gram-Schmidt orthogonalization


% This algorithm computes the classical Gram-Schmidt
% orthogonalization of the vectors in the columns of the matrix A

A =10*rand(4,4); [m,n]=size(A); R=zeros(n); Q=A;

for k=1:n,

for i=1:k-1,
R(i,k)=Q(:,i)'*Q(:,k);
end
% remove these two lines for
for i=1:k-1, % modified-Gram-Schmidt
Q(:,k)=Q(:,k)-R(i,k)*Q(:,i);
end
R(k,k)=norm(Q(:,k)); Q(:,k)=Q(:,k)/R(k,k);

end

R
Q

We use the QR-decomposition to obtain the eigenvalues of a matrix. The method is


iterative and builds an upper-triangular matrix. The eigenvalues appear as the diagonal
terms of this upper-triangular matrix. These values are found to be in agreement with
those given by the MATLAB built-in function: eig. This algorithm does not work in the
case of complex eigenvalue. (Robert W Hornbeck 1975 pp 244-245)

clear all, clc, A=10*rand(4,4) + 10*eye(4,4); [m,n]=size(A);

M=A
for i=1:100;
[Q,R] = qr(M);
M=R*Q;
end
R
eig(A)

Some History on QR Algorithm: The 𝑄𝑅 Algorithm is today the most widely used
algorithm for computing eigenvalues and eigenvectors of dense matrices. It allows us to
compute the roots of the characteristic polynomial of a matrix, a problem for which
there is no closed form solution in general for matrices of size bigger than 4 × 4, with a
complexity comparable to solving a linear system with the same matrix.
Heinz Rutishauser discovered some 50 years ago when testing a computer5 that the
iteration 𝑨0 = 𝑨 with 𝑨 ∈ ℝ𝑛×𝑛

𝑨𝑖 = 𝑳𝑖 𝑹𝑖 Gauss 𝐿𝑈 decomposition
𝑨𝑖+1 = 𝑹𝑖 𝑳𝑖 𝑖 = 1,2, …

produces a sequence of matrices 𝑨𝑖 which, under certain conditions, converge to an


upper triangular matrix which contains on the diagonal the eigenvalues of the original
matrix 𝑨. In 1961 J. G. F. Francis (USA) and Vera Kublanovskaya (USSR) presented
independently a variant of this iteration, based on the 𝑄𝑅 decomposition instead of the
Gauss 𝐿𝑈 decomposition (called by Rutishauser at that time the 𝐿𝑅 algorithm, because
the Gauss triangular decomposition was known in German as “Zerlegung in eine Links-
und Rechtsdreieckmatrix”). The 𝑄𝑅 Algorithm is listed as one of the top ten algorithms
of the last century.

 SVD (singular value decomposition) is to express an 𝑚 × 𝑛 matrix 𝑨 in the following


form 𝑨 = 𝑼𝑺𝑽𝑇 where 𝑼 is an orthogonal (unitary) 𝑚 × 𝑚 matrix, 𝑽 is an orthogonal
(unitary) 𝑛 × 𝑛 matrix, and 𝑺 is a real diagonal 𝑚 × 𝑛 matrix having the singular values
of 𝑨 (the square roots of the eigenvalues of 𝑨𝑇 𝑨) in decreasing order on its diagonal.
This is implemented by the MATLAB built-in function svd().

A = [1 2;2 3;3 5]; % a rectangular matrix


[U,S,V] = svd(A) % Singular Value Decomposition
err = U*S*V'-A % to check if the result is right

In mathematics (linear algebra), the Faddeev–LeVerrier


algorithm is a recursive method to calculate the coefficients of the characteristic
polynomial 𝑝(𝜆) = det(𝜆𝑰𝑛 − 𝑨) of a square matrix, 𝑨, named after Dmitry
Konstantinovich Faddeev and Urbain Le Verrier. Calculation of this polynomial yields
the eigenvalues of 𝑨 as its roots; as a matrix polynomial in the matrix 𝑨 itself, it
vanishes by the fundamental Cayley–Hamilton theorem.

This method can be used to compute all the Eigenvalues, vectors, inverse, etc., of any
given matrix.
𝑩1 = 𝑨 and 𝑃1 = Trace(𝑩1 )
1
𝑩2 = 𝑨(𝑩1 − 𝑃1 𝑰) and 𝑃2 = Trace(𝑩2 )
2
1
𝑩3 = 𝑨(𝑩2 − 𝑃2 𝑰) and 𝑃3 = Trace(𝑩3 )
3

1
𝑩𝑛−1 = 𝑨(𝑩𝑛−2 − 𝑃𝑛−2 𝑰) and 𝑃𝑛−1 = Trace(𝑩𝑛−1 )
𝑛−1
1
𝑩𝑛 = 𝑨(𝑩𝑛−1 − 𝑃𝑛−1 𝑰) and 𝑃𝑛 = Trace(𝑩𝑛 )
𝑛

If 𝑨 is nonsingular then the inverse of 𝑨 can be determined by 𝑨−1 = (𝑩𝑛−1 − 𝑃𝑛−1 𝑰)/𝑃𝑛 .
clear all, clc, A=10*rand(4,4) + 10*eye(4,4); [m,n]=size(A);
b=100*rand(4,1); t=0; B=A; I=eye(n,n);

p=trace(B) % Initial value of trace to start algorithm

% Calculations of (Bi & pi) from B2 to Bn-1 & from p2 to pn-1


for k=2:n-1
B=A*(B-p*I); p=trace(B)/k;
end
Bn1=B; pn1=p;

Bn=A*(Bn1 - pn1*I); pn=trace(Bn)/n;

%invers of A & a solution x

disp('your Matrix invers is:')


IA=(Bn1-pn1*I)/pn
disp('your solution is:')
x= IA*b
disp('verification Zero=A*x-b:')
Zero=A*x-b

Iterative methods formally yield the


solution x of a linear system after an infinite number of steps. At each step they require
the computation of the residual of the system. In the case of a full matrix, their
computational cost is therefore of the order of 𝑛2 operations for each iteration, to be
2
compared with an overall cost of the order of 3 𝑛3 operations needed by direct methods.
Iterative methods can therefore become competitive with direct methods provided the
number of iterations that are required to converge (within a prescribed tolerance) is
either independent of 𝑛 or scales sub-linearly with respect to 𝑛.

So far we have considered direct methods for solution of a system of linear equations.
For sparse matrices, it may not be possible to take advantage of sparsity while using
direct methods, since the process of elimination can make the zero elements nonzero,
unless the zero elements are in a certain well defined pattern. Hence, the number of
arithmetic operations as well as the storage requirement may be the same for sparse
and filled matrices. This requirement may be prohibitive for large matrices and in those
cases, it may be worthwhile to consider the iterative methods.

The convergence condition for iterative algorithms: In general we have assumed


that the matrix can be factorized into 𝑨 = 𝑯 + 𝑮 with 𝑯−1 exist, so the system 𝑨𝐱 = 𝒃
will be equivalent to 𝐱 = 𝑯−1 (𝒃 − 𝑮𝐱). Let we define 𝑸 = −𝑯−1 𝑮 and 𝒓 = 𝑯−1 𝒃, the
iterative process can be define as
𝐱 𝑘+1 = 𝒓 + 𝑸𝐱 𝑘
The recursive evaluation of this formula with an initial starting 𝐱 0 will give
𝐱1 = 𝒓 + 𝑸𝐱 0
𝐱 2 = (𝑰 + 𝑸)𝒓 + 𝑸2 𝐱 0

𝑘−1
𝐱 𝑘 = (∑ 𝑸𝑖 ) 𝒓 + 𝑸𝑘 𝐱 0
𝑖=0

If there exist (𝑰 − 𝑸)−1 the we can write ∑𝑘−1 𝑖 −1 𝑘


𝑖=0 𝑸 = (𝑰 − 𝑸) (𝑰 − 𝑸 ) therefore

𝐱 𝑘 = (𝑰 − 𝑸)−1 (𝑰 − 𝑸𝑘 )𝒓 + 𝑸𝑘 𝐱 0

And as nontrivial solution we want 𝐱 𝑘 to converge independently on the initial state 𝐱 0 ,


this means that lim𝑘→∞ 𝑸𝑘 = 𝟎 or |𝜆𝑚𝑎𝑥 (𝑸)| < 1 hence

lim 𝐱 𝑘 = lim (𝑰 − 𝑸)−1 (𝑰 − 𝑸𝑘 )𝒓 + 𝑸𝑘 𝐱 0


𝑘→∞ 𝑘→∞
= lim (𝑰 − 𝑸)−1 (𝑰 − 𝑸𝑘 )𝒓 = lim (𝑰 − 𝑸)−1 𝒓
𝑘→∞ 𝑘→∞
= lim (𝑰 + 𝑯−1 𝑮)−1 𝑯−1 𝒃 = (𝑯 + 𝑮)−1 𝒃 = 𝑨−1 𝒃
𝑘→∞

The convergence condition is lim𝑘→∞ 𝑸𝑘 = 𝟎 or equivalently max1≤𝑖≤𝑛 {∑𝑛𝑗=1|𝑸(𝑖, 𝑗)|} < 1 or.

We can write the matrix 𝑨 in the form 𝑨 = 𝑫 + 𝑳 + 𝑼, where


𝑫 is a diagonal matrix, and 𝑳 and 𝑼 are respectively, lower and upper triangular
matrices with zeros on the diagonal. Then the system of equations can be written as

𝑫𝐱 = −(𝑳 + 𝑼)𝐱 + 𝒃 or 𝐱 = −𝑫−1 (𝑳 + 𝑼)𝐱 + 𝑫−1 𝒃

Here we have assumed that all diagonal elements of A are nonzero. This last equation
can be used to define an iterative process, which generates the next approximation 𝐱 𝑘+1
using the previous one on the right-hand side 𝐱 𝑘+1 = −𝑫−1 (𝑳 + 𝑼)𝐱 𝑘 + 𝑫−1 𝒃.

In numerical linear algebra, the Jacobi method is an iterative algorithm for determining
the solutions of a strictly diagonally dominant system of linear equations. A sufficient
(but not necessary) condition for the method to converge is that the matrix 𝑨 is strictly
or irreducibly diagonally dominant. Strict row diagonal dominance means that for each
row, the absolute value of the diagonal term is greater than the sum of absolute values
of other terms. The Jacobi method sometimes converges even if these conditions are not
satisfied.

This iterative process is known as the Jacobi iteration or the method of simultaneous
displacements. The latter name follows from the fact that every element of the solution
vector is changed before any of the new elements are used in the iteration. Hence, both
𝐱 𝑘+1 and 𝐱 𝑘 need to be stored separately. The iterative procedure can be easily
expressed in the component form as

𝑛
1
𝑥𝑘+1 (𝑖) = 𝑏(𝑖) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘 (𝑗)
𝑎(𝑖, 𝑖)
𝑗=1
𝑗≠𝑖
( )
Algorithm1: The Matrix-based formula

clear all, clc, A =rand(10,10); b=20*rand(10,1);


A = 0.5*(A+A');
A = A + 10*eye(10);

D= diag(A); M= diag(D);
n=max(size(A)); m=min(size(b)); x0=rand(n,m);
for k=1: 10
% x1= inv(M)*(b-(A-M)*x0); % Jacobi method
x1= x0 + inv(M)*(b-A*x0); % Jacobi method 'Alternative writing'
x0=x1;
end
x1
ZERO1=A*x1-b

‫ الكتابة المصفوفية لطريقة جكوب هي فقط للشرح وليست طريقة عملية إذ فيها المعكوس وهو أكبر شيء تم الهروب منه‬:‫مالحظة هامة‬

Algorithm2: The element-based formula

clear all, clc, A =rand(10,10); b=100*rand(10,1);


A = 0.5*(A+A'); A = A + 10*eye(10);
n=max(size(A)); m=min(size(b)); x0=rand(n,1); tol=1e-5;
% the first iteration
for j = 1:n
x(j)=((b(j)-A(j,[1:j-1,j+1:n])*x0([1:j-1,j+1:n]))/A(j,j));
end
x1 = x';
% the next iterations
k = 1;
while norm(x1-x0,1)>tol
for j = 1:n
x_ny(j)=((b(j)-A(j,[1:j-1,j+1:n])*x1([1:j-1,j+1:n]))/A(j,j));
end
x0 = x1; x1 = x_ny';
k = k + 1;
end
k, x = x1, ZERO1=A*x-b

In numerical linear algebra, the Gauss–Seidel method,


also known as the Liebmann method or the method of successive displacement, is an
iterative method used to solve a system of linear equations. It is named after the
German mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel, and is
similar to the Jacobi method. Though it can be applied to any matrix with non-zero
elements on the diagonals, convergence is only guaranteed if the matrix is either strictly
diagonally dominant, or symmetric and positive definite.
We can write the matrix 𝑨 in the form 𝑨 = 𝑳 + 𝑼, where 𝑳 and 𝑼 are respectively, lower
and strictly upper triangular matrices. Then the system of equations can be written as

𝑳𝐱 = 𝒃 − 𝑼𝐱 or 𝐱 = 𝑳−1 𝒃 − 𝑳−1 𝑼𝐱

This last equation can be used to define an iterative process, which generates the next
approximation 𝐱 𝑘+1 using the previous one on the right-hand side

𝐱 𝑘+1 = 𝑳−1 (𝒃 − 𝑼𝐱 𝑘 )

The convergence properties of the Gauss–Seidel method are dependent on the matrix 𝑨.
Namely, the procedure is known to converge if either:

𝑨 is symmetric positive-definite, or
𝑨 is strictly or irreducibly diagonally dominant.

The Gauss–Seidel method sometimes converges even if these conditions are not
satisfied.

Algorithm1: (Gauss–Seidel) The Matrix-based formula

clear all, clc,


A =rand(10,10); b=100*rand(10,1);
A = 0.5*(A+A');
A = A + 10*eye(10);

L = tril(A); U=A-L; % L1 = tril(A,-1) strictly lower


n=max(size(A)); m=min(size(b)); x0=rand(n,m);
for k=1: 10
x1= inv(L)*(b-(U)*x0); % Gauss–Seidel method
x0=x1;
end
x1
ZERO1=A*x1-b

‫ الكتابة المصفوفية لطريقة غوص هي فقط للشرح وليست طريقة عملية إذ فيها المعكوس وهو أكبر شيء تم الهروب منه‬:‫مالحظة هامة‬

However, by taking advantage of the triangular form of 𝑳, the elements of 𝐱 𝑘+1 can be
computed sequentially using forward substitution:

𝑖−1 𝑛
1
𝑥𝑘+1 (𝑖) = (𝑏(𝑖) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘+1 (𝑗) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘 (𝑗))
𝑎(𝑖, 𝑖)
𝑗=1 𝑗=𝑖+1

The procedure is generally continued until the changes made by an iteration are below
some tolerance, such as a sufficiently small residual.

The element-wise formula for the Gauss–Seidel method is extremely similar to that of
the Jacobi method. The computation of 𝑥𝑘+1 (𝑖) uses the elements of 𝐱 𝑘+1 that have
already been computed, and only the elements of 𝐱 𝑘 that have not been computed in
the 𝑘 + 1 iteration. This means that, unlike the Jacobi method, only one storage vector
is required as elements can be overwritten as they are computed, which can be
advantageous for very large problems.

Algorithm2: (Gauss–Seidel) The element-based formula

clear all, clc, A =rand(10,10); b=100*rand(10,1);


A = 0.5*(A+A'); A = A + 10*eye(10);
n=max(size(A)); m=min(size(b)); x=rand(n,m); iters=20;

for i=1:iters
for j = 1:size(A,1)
x(j) = (1/A(j,j)) * (b(j) - A(j,:)*x + A(j,j)*x(j));
end
end
x
ZERO1=A*x-b

The successive over-relaxation


method (SOR) is derived from the Gauss-Seidel method by introducing an
“extrapolation” parameter ω, resulting in faster convergence. It was devised
simultaneously by David M. Young, Jr. and by Stanley P. Frankel in 1950 for the
purpose of automatically solving linear systems on digital computers. Over-relaxation
methods had been used before the work of Young and Frankel.

The component 𝑥𝑖 is computed as for Gauss-Seidel but then averaged with its previous
value.
𝑖−1 𝑛
𝜔
𝑥𝑘+1 (𝑖) = (1 − 𝜔)𝑥𝑘 (𝑖) + (𝑏(𝑖) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘+1 (𝑗) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘 (𝑗))
𝑎(𝑖, 𝑖)
𝑗=1 𝑗=𝑖+1

We first mention that one cannot choose the relaxation parameter arbitrarily if one
wants to obtain a convergent method. This general, very elegant result is due to Kahan
from his PhD thesis 1958. For convergence of SOR it is necessary to choose 0 < ω < 2.

clear all, clc, A =rand(10,10); b=100*rand(10,1);


A = 0.5*(A+A'); A = A + 10*eye(10); w=2/3;

n=max(size(A)); m=min(size(b)); x=rand(n,m); iters=20;

for i=1:iters
for j = 1:size(A,1)
x(j) = (1-w)*x(j) + (w/A(j,j))*(b(j) - A(j,:)*x + A(j,j)*x(j));
end
end
x
ZERO1=A*x-b
Remark: The optimal choice of the relaxation parameter ω, is given by Young in his
thesis 1950.

Remark: Gauss-Seidel method 𝐱 𝑘+1 = 𝐟(𝐱𝑘+1 , 𝐱 𝑘 ) is very close to the simple iterative
method 𝐱 𝑘+1 = 𝐟(𝐱𝑘 ) and differs only in that: at any iteration we use just the calculated
components of this iteration. This method can be used for both linear and nonlinear
systems, and its convergence depends on the choice of the starting point and the
Jacobian of mapping, defined by the system.

Gauss − Seidel method 𝐱 𝑘+1 = 𝐟(𝐱 𝑘+1 , 𝐱 𝑘 ) Simple iterative method (Jacobi) 𝐱𝑘+1 = 𝐟(𝐱 𝑘 )
𝑥1 (𝑘 + 1) = f1 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘)) 𝑥1 (𝑘 + 1) = f1 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘))
𝑥2 (𝑘 + 1) = f2 (𝑥1 (𝑘 + 1), 𝑥2 (𝑘), 𝑥3 (𝑘)) 𝑥2 (𝑘 + 1) = f2 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘))
𝑥3 (𝑘 + 1) = f3 (𝑥1 (𝑘 + 1), 𝑥2 (𝑘 + 1), 𝑥3 (𝑘)) 𝑥3 (𝑘 + 1) = f3 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘))

Example: Solve the following set of nonlinear equations by the Gauss-Seidel method

1
𝑥𝑘+1 = (3 + 0.12𝑧𝑘 − 𝑒 𝑥𝑘 cos(𝑦𝑘 ))
27𝑥 + 𝑒 𝑥 cos(𝑦) − 0.12𝑧 = 3 27
1
{−0.2𝑥 2 + 37𝑦 + 3𝑥𝑧 = 6 ⟺ 𝑦𝑘+1 = (6 − 3𝑥𝑘+1 𝑧𝑘 + 0.2(𝑥𝑘+1 )2 )
37
𝑥 2 − 0.2𝑦 sin(𝑥) + 29𝑧 = −4 1
(−4 + 0.2𝑦𝑘+1 sin(𝑥𝑘+1 ) − (𝑥𝑘+1 )2 )
{ 𝑧𝑘+1 =
29
Start by 𝑥 = 𝑦 = 𝑧 = 1

clear all, clc, x0=1; y0=1; z0=1; s=1; k=1;

while s>0.01
x1=(1/27)*(3+0.12*z0-exp(x0)*cos(y0));
y1=(1/37)*(6-3*x1*z0+ 0.2*x1*x1);
z1=(1/27)*(-4+0.2*y1*sin(x1)-x1*x1);

v1= x1-x0; v2= y1-y0; v3= z1-z0; v=[v1; v2; v3]; s=norm(v);
x0=x1; y0=y1; z0=z1;
k=k+1;
end
k
x=x1, y=y1, z=z1,

is a mathematical technique that can be


useful for the optimization of both linear and non-linear systems. This technique is
generally used as an iterative algorithm, however, it can be used as a direct method,
and it will produce a numerical solution. Generally this method is used for very large
systems where it is not practical to solve with a direct method. This method was
developed by Magnus Hestenes and Eduard Stiefel. A sufficient and necessary
condition for the method to converge is that the matrix 𝑨 could be symmetric and
positive definite.
The preconditioned conjugate gradient method is used nowadays as a standard method
in many software libraries, and CG is the starting point of Krylov methods, which are
listed among the top ten algorithms of the last century.

Consider the problem of finding the vector 𝐱 that minimizes the scalar function (which
is called the energy of system)

1 𝑇 1 𝑇
f(𝐱) = 𝐱 𝑨𝐱 − 𝒃𝑇 𝐱 with ∇𝑓(𝐱) = (𝑨 + 𝑨)𝐱 − 𝒃 = 𝑨𝐱 − 𝒃
2 2
where the matrix 𝑨 is symmetric and positive definite. Because f(𝐱) is minimized when
its gradient ∇f = 𝑨𝐱 − 𝒃 is zero, we see that minimization is equivalent to solving 𝑨𝐱 = 𝒃.

Gradient methods accomplish the minimization by iteration, starting with an initial


vector 𝐱 0 . Each iterative cycle 𝑘 computes a refined solution

𝐱 𝑘+1 = 𝐱 𝑘 + 𝛼𝑘 𝐬𝑘

The step length 𝛼𝑘 is chosen so that 𝐱 𝑘+1 minimizes f(𝐱 𝑘+1 ) in the search direction 𝐬𝑘 .
That is, 𝐱 𝑘+1 must satisfy 𝑨𝐱 = 𝒃 so
𝑨(𝐱 𝑘 + 𝛼𝑘 𝐬𝑘 ) = 𝒃

Introducing the residual 𝐫𝑘 = 𝒃 − 𝑨𝐱 𝑘 this last equation becomes 𝛼𝑘 𝑨𝐬𝑘 = 𝐫𝑘 .


Premultiplying both sides by 𝐬𝑘 𝑇 and solving for 𝛼𝑘 , we obtain
𝑇
𝐬𝑘 𝑇 𝐫𝑘 (∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = 𝑇 =
𝐬𝑘 𝑨𝐬𝑘 (∇𝑓(𝐱 ))𝑇 𝑨∇𝑓(𝐱 )
𝑘 𝑘

We are still left with the problem of determining the search direction 𝐬𝑘 . Intuition tells
us to choose 𝐬𝑘 = −∇𝑓 = 𝐫𝑘 , since this is the direction of the largest negative change in
𝑓(𝐱). The resulting procedure is known as the method of steepest descent or a gradient
method. Summarizing, the gradient method can be described as follows: given 𝐱 0 ∈ ℝ𝑛 ,
for 𝑘 = 0, 1, … until convergence, compute

▪ 𝐫𝑘 = 𝒃 − 𝑨𝐱 𝑘
𝐬𝑘 𝑇 𝐫𝑘 𝐫𝑘 𝑇 𝐫𝑘
▪ 𝛼𝑘 = 𝑇 =
𝐬𝑘 𝑨𝐬𝑘 𝐫𝑘 𝑇 𝑨𝐫𝑘
▪ 𝐱 𝑘+1 = 𝐱 𝑘 + 𝛼𝑘 𝐫𝑘

It is not a popular algorithm due to slow convergence. In order to increase the speed of
convergence some peoples proposed a correction of the search direction.

Now the improved version of the steepest descent algorithm is based on the so called A-
conjugacy. A-conjugacy means that a set of nonzero vectors {𝐬0 , 𝐬1 , … , 𝐬𝑛−1 } are conjugate
with respect to the symmetric positive definite matrix A. That is 𝐬𝑖𝑇 𝑨𝐬𝑗 = 0 ∀ 𝑖 ≠ 𝑗. A set
of 𝑛 such vectors are linearly independent and hence span the whole space ℝ𝑛 . The
reason why such A-conjugate sets are important is that we can minimize our quadratic
function f(𝐱) in 𝑛 steps by successively minimizing it along each of the directions. Since
the set of A-conjugate vectors acts as a basis for ℝ𝑛 .

There are several ways to choose such a set. The eigenvectors of 𝑨 form a A-conjugate
set, butfinding the eigenvectors is a task requiring a lot of computations, so we
betterfind another strategy. A second alternative is to modify the usual Gram-Schmidt
orthogonalization process. This is also not optimal, as it requires storing all the
directions. In order to search optimally a complete set of linearly independent vectors
we use the more efficient algorithm " conjugate gradient method " which is based on the
following recursive formula 𝐬𝑘+1 = 𝐫𝑘+1 + 𝛽𝑘 𝐬𝑘 , and we try to find the constant 𝛽𝑘 for
which any two successive search directions are conjugate (noninterfering) to each other,
𝑇
meaning 𝐬𝑘+1 𝑨𝐬𝑘 = 0.
𝑇 𝑇
Substituting 𝐬𝑘+1 into 𝐬𝑘+1 𝑨𝐬𝑘 = 0 we get (𝐫𝑘+1 + 𝛽𝑘 𝐬𝑘𝑇 )𝑨𝐬𝑘 = 0, which yields
𝑇
𝐫𝑘+1 𝑨𝐬𝑘
𝛽𝑘 = − 𝑇
𝐬𝑘 𝑨𝐬𝑘

Here is the outline of the conjugate gradient algorithm:

▪ Choose 𝐱 0 (any vector will do, but one close to solution results in fewer iterations)
▪ 𝐫0 ← 𝒃 − 𝑨𝐱 0 & 𝐬0 ← 𝐫0
do with 𝑘 = 0, 1, 2, …
𝐬𝑘 𝑇 𝐫𝑘
▪ 𝛼𝑘 ← 𝑇
𝐬𝑘 𝑨𝐬𝑘
▪ 𝐱 𝑘+1 ← 𝐱 𝑘 + 𝛼𝑘 𝐫𝑘
▪ 𝐫𝑘+1 ← 𝒃 − 𝑨𝐱 𝑘+1 if ‖𝐫𝑘+1 ‖ ≤ 𝜀 exit loop (convergence criterion; 𝜀 is the error tolerance)
𝑇
𝐫𝑘+1 𝑨𝐬𝑘
▪ 𝛽𝑘 ← − 𝑇
𝐬𝑘 𝑨𝐬𝑘
▪ 𝐬𝑘+1 = 𝐫𝑘+1 + 𝛽𝑘 𝐬𝑘
end do

clear all, clc, A =100*rand(10,10); A = 0.5*(A+A'); A = A + 5*eye(10);


b=100*rand(10,1); n=max(size(A)); m=min(size(b));
x=10*rand(n,m); r=b-A*x; s=r; % A=[4 -1 1;-1 4 -2;1 -2 4]; b=[12;-1;5];

for k=1: length(b)


alpha = (s'*r)/(s'*A*s);
x = x + alpha*s; r = b - A*x;
if sqrt(dot(r,r))< 1.0e-15
return
else
beta=-(r'*A*s)/(s'*A*s); s = r + beta*s;
end
end
x
ZERO1=A*x-b
Many current problems of
research are reduced to large systems of linear differential equations which, in turn,
must be solved. Among the most common industrial problems which give rise to the
need for solving such a system are the vibration problem in which the corresponding
matrix is real and symmetric, but generally we across problems in which the matrix is
usually nonsymmetric and complex. Among the iterative methods for solving large
linear systems 𝑨𝐱 = 𝒃 with a sparse or possibly structured non-symmetric matrix 𝑨,
those that are based on the Lanczos process feature short recurrences for the
generation of the Krylov space. This means low cost and low memory requirement. A
Lanczos, conjugate gradient-like, method for the iterative solution of the equation is one
of the most popular and well known.

■ How KRYLOV Subspaces Come Into Play! In 1931 A. N. KRYLOV published a paper
entitled "On the numerical solution of the equation by which the frequency of small
oscillations is determined in technical problems", of course in this work KRYLOV was not
thinking in terms of projection processes, and he was not interested in solving a linear
system. Motivated by an application in the analysis of oscillations of mechanical
systems, he constructed a method for computing the minimal polynomial of a matrix.
Algebraically, his method is based on the following important fact.

Given 𝑨 ∈ 𝔽𝑛×𝑛 and a nonzero vector 𝐯 ∈ 𝔽𝑛 , consider the Krylov sequence {𝐯, 𝑨𝐯, 𝑨2 𝐯, … }
generated by 𝑨 and 𝐯. There then exists a uniquely defined integer 𝑑 = 𝑑(𝑨, 𝐯), so that
the vectors {𝐯, 𝑨𝐯, 𝑨2 𝐯, … 𝑨𝑑−1 𝐯} are linearly independent, and the vectors
2 𝑑
{𝐯, 𝑨𝐯, 𝑨 𝐯, … 𝑨 𝐯} are linearly dependent.

By construction, there exist scalars 𝛾0 , . . . , 𝛾𝑑−1 with 𝑨𝑑 𝐯 = ∑𝑑−1 𝑘


𝑘=0 𝛾𝑖 𝑨 𝐯. In polynomial
notation we can write: 𝑨𝑑 𝐯 = ∑𝑑−1 𝑘
𝑘=0 𝛾𝑖 𝑨 𝐯 ⟺ 𝑝(𝑨)𝐯 = 𝟎 where 𝑝(𝜆) = 𝜆𝑑 − ∑𝑑−1 𝑘
𝑘=0 𝛾𝑖 𝜆 .
The polynomial 𝑝(𝜆) is called the minimal polynomial of 𝐯 with respect to 𝑨.

Krylov's observation can be rephrased in the following way. For each matrix 𝑨 and
vector 𝐯, the Krylov subspaces is defined by 𝒦𝑑 (𝑨, 𝐯) = span{𝐯, 𝑨𝐯, 𝑨2 𝐯, … 𝑨𝑑−1 𝐯}. Most
important iterative techniques for solving large scale linear systems 𝑨𝐱 = 𝒃 are based on
the projection processes onto the Krylov subspace. In short this technique approximate
𝑨−1 𝒃 by 𝑝(𝑨)𝒃 where 𝑝 is a good polynomial.

A general projection method for solving linear system 𝑨𝐱 = 𝒃 is a method which seek an
approximate solution from an affine subspace 𝐱 𝑚 = 𝐱 0 + 𝒦𝑚 of dimension 𝑚 by imposing
the Petrov–Galerkin condition 𝒓 = (𝒃 − 𝑨𝐱 𝑚 ) ⊥ ℒ𝑚 where ℒ𝑚 is another 𝑚 dimensional
subspace. Here, 𝐱 0 represents an arbitrary initial guess to the solution. Krylov subspace
method is a method for which 𝒦𝑚 is a Krylov subspace.

𝒦𝑚 (𝑨, 𝒓0 ) = span{𝒓0 , 𝑨𝒓0 , 𝑨2 𝒓0 , … 𝑨𝑚−1 𝒓0 }

With 𝒓0 = (𝒃 − 𝑨𝐱 0 ). The different versions of Krylov subspace methods arise from


different choices of the subspace ℒ𝑚 and from the ways in which the system is
preconditioned.
Example: Two broad choices for ℒ𝑚 give rise to some of the best-known techniques

⦁ ℒ𝑚 = 𝒦𝑚 (𝑨, 𝒓0 ) Full Orthogonalization Method (FOM)


⦁ ℒ𝑚 = 𝑨𝒦𝑚 (𝑨, 𝒓0 ) Generalized minimal residual method (GMRES)

An important fact: Though the non-symmetric linear system is often the motivation for
applying the Lanczos algorithm, the operation the algorithm primarily performs is
tridiagonalization of a matrix, but what about the tridiagonalization process?

For every real square matrix 𝑨 ∈ ℝ𝑛×𝑛 there exists a nonsingular real matrix 𝑿, for
which 𝑻 = 𝑿−1 𝑨𝑿 ∈ ℝ𝑛×𝑛 is a tridiagonal matrix, and under certain conditions 𝑿 is
uniquely determined. Simple proofs for the existence and uniqueness of this
transformation are presented in a paper Angelika Bunse-Gerstner 1982.

However, the use of Krylov subspace method does not guarantee the non-singularity of
the matrix 𝑿, since "𝑚 ≤ 𝑛" the dimension of Krylov subspace is less or equal to the
dimension of the matrix 𝑨. Therefore, we can write 𝑻 = 𝐖 𝑇 𝑨𝐕 ∈ ℝ𝑚×𝑚 with 𝐖, 𝐕 ∈ ℝ𝑛×𝑚 .

■ The Nonsymmetric Lanczos Algorithm: For a nonsymmetric matrix 𝑨, the Lanczos


algorithm constructs, starting with two vectors 𝐯1 and 𝐰1, a pair of biorthogonal bases
{𝐯1 , 𝐯2 , … , 𝐯𝑚 } & {𝐰1 , 𝐰2 , … , 𝐰𝑚 } for the two Krylov subspaces, 𝐾𝑚 (𝑨, 𝐯1 ) & ℒ𝑚 (𝑨𝑇 , 𝐰1 ) where

𝐾𝑚 (𝑨, 𝐯1 ) = span{𝐯1 , 𝑨𝐯1 , . . . , 𝑨𝑚−1 𝐯1 } and ℒ𝑚 (𝑨𝑇 , 𝐰1 ) = span{𝐰1 , 𝑨𝑇 𝐰1 , . . . , (𝑨𝑇 )𝑚−1 𝐰1 }.

Two sets {𝐯𝑖 } & {𝐰𝑖 } of vectors satisfying (𝐰𝑖 𝑇 𝐯𝑗 = 0 ∀ 𝑖 ≠ 𝑗) are said to be biorthogonal
and can be obtained through the following algorithm: (Biswa Nath Datta 2003)

Initialization: Scale the vectors 𝐯 and 𝐰 to get the vectors 𝐯1 and 𝐰1 such that
𝐰1 𝑇 𝐯1 = 1. Set 𝛽1 = 0, 𝛾1 = 0, 𝐰0 = 𝐯0 = 0.
begin: For 𝑘 = 1, 2, . . . , 𝑚 do
𝛼𝑘 = 𝐰𝑘 𝑇 𝑨𝐯𝑘
𝝁𝑘+1 = 𝑨𝐯𝑘 − 𝛼𝑘 𝐯𝑘 − 𝛽𝑘 𝐯𝑘−1
𝜼𝑘+1 = 𝑨𝑇 𝐰𝑘 − 𝛼𝑘 𝐰𝑘 − 𝛾𝑘 𝐰𝑘−1
𝑇
𝜼𝑇𝑘+1 𝝁𝑘+1
𝛾𝑘+1 = √|𝜼𝑘+1 𝝁𝑘+1 | ; 𝛽𝑘+1 =
𝛾𝑘+1
𝜼𝑇𝑘+1 𝝁𝑇𝑘+1
𝐰𝑘+1 = ; 𝐯𝑘+1 =
𝛽𝑘+1 𝛾𝑘+1
end

■ Proof of the Nonsymmetric Lanczos Algorithm: Suppose 𝑨 ∈ ℝ𝑛×𝑛 and that a


biorthogonal rectangular matrices 𝐖𝑘 , 𝐕𝑘 ∈ ℝ𝑛×𝑘 exists so
𝛼1 𝛽2 0
𝑨𝐕𝑘 = 𝐕𝑘 𝑻𝑘 𝛾2 ⋱
{ 𝑇 where 𝑻𝑘 = ( 𝛼2 𝑇
𝐖𝑘 𝑨 = 𝑻𝑘 𝐖𝑘𝑇 ⋱ 𝛽𝑘 ) and 𝐖𝑘 𝐕𝑘 = 𝑰𝑘
⋱ 𝛾𝑘 𝛼𝑘
0
With the column partitioning 𝐕𝑘 = [𝐯1 , 𝐯2 , … , 𝐯𝑘 ] and 𝐖𝑘 = [𝐰1 , 𝐰2 , … , 𝐰𝑘 ], we find upon
comparing columns in 𝑨𝐕𝑘 = 𝐕𝑘 𝑻𝑘 and 𝑨𝑇 𝐖𝑘 = 𝐖𝑘 𝑻𝑇𝑘 that
𝑨𝐯𝑘 = 𝛽𝑘 𝐯𝑘−1 + 𝛼𝑘 𝐯𝑘 + 𝛾𝑘+1 𝐯𝑘+1 𝛽1 𝐯0 = 0
with 𝑘 = 1,2, … 𝑚
𝑨𝑇 𝐰𝑘 = 𝛾𝑘 𝐰𝑘−1 + 𝛼𝑘 𝐰𝑘 + 𝛽𝑘+1 𝐰𝑘+1 𝛾1 𝐰0 = 0

These equations together with the biorthogonality condition 𝐖𝑘 𝑇 𝐕𝑘 = 𝑰

𝐰𝑖 𝑇 𝐯𝑗 = 0 𝑖 ≠ 𝑗
{
𝐰𝑖 𝑇 𝐯𝑗 = 1 𝑖 = 𝑗

imply 𝐰𝑘 𝑇 𝑨𝐯𝑘 = 𝛽𝑘 𝐰𝑘 𝑇 𝐯𝑘−1 + 𝛼𝑘 𝐰𝑘 𝑇 𝐯𝑘 + 𝛾𝑘+1 𝐰𝑘 𝑇 𝐯𝑘+1 ⟹ 𝛼𝑘 = 𝐰𝑘 𝑇 𝑨𝐯𝑘 and

𝛾𝑘+1 𝐯𝑘+1 = 𝝁𝑘+1 = 𝑨𝐯𝑘 − 𝛼𝑘 𝐯𝑘 − 𝛽𝑘 𝐯𝑘−1


𝛽𝑘+1 𝐰𝑘+1 = 𝜼𝑘+1 = 𝑨𝑇 𝐰𝑘 − 𝛼𝑘 𝐰𝑘 − 𝛾𝑘 𝐰𝑘−1

There is some flexibility in choosing the scale factors 𝛽𝑘 and 𝛾𝑘 . Note that

𝜼𝑘+1 𝑇 𝝁𝑘+1
1 = 𝐰𝑘+1 𝑇 𝐯𝑘+1 =
𝛽𝑘+1 𝛾𝑘+1

It follows that once 𝛾𝑘 is specified 𝛽𝑘 is given by 𝛽𝑘 = (𝜼𝑇𝑘 𝝁𝑘 )/𝛾𝑘 with the "canonical"
choice 𝛾𝑘 = √|𝜼𝑇𝑘 𝝁𝑘 | we obtain the above algorithm.
𝛼1 𝛽2 0
𝛾2 ⋱
If 𝑻𝑘 = ( 𝛼2
⋱ 𝛽𝑘 ) then the situation at the bottom of the loop is summarized by
⋱ 𝛾𝑘 𝛼𝑘
0
the equations
𝑨[𝐯1 , … , 𝐯𝑘 ] = [𝐯1 , … , 𝐯𝑘 ]𝑻𝑘 + [0 0 ⋯ 𝛾𝑘+1 𝐯𝑘+1 ]
𝑨𝑇 [𝐰1 , … , 𝐰𝑘 ] = [𝐰1 , … , 𝐰𝑘 ]𝑻𝑘 + [0 0 ⋯ 𝛽𝑘+1 𝐰𝑘+1 ]

If 𝛾𝑘+1 = 0 , then the iteration terminates and span{𝐯1 , 𝐯2 , … , 𝐯𝑘 } is an invariant subspace


for 𝑨. If 𝛽𝑘+1 = 0, then the iteration also terminates and span{𝐰1 , 𝐰2 , … , 𝐰𝑘 } is an
invariant subspace for 𝑨𝑇 . However, if neither of these conditions are true and 𝜼𝑇𝑘 𝝁𝑘 = 0,
then the tridiagonalization process ends without any invariant subspace information.
This is called serious breakdown. See Wilkinson (965, p.389) for an early discussion of
the matter.

Remark: If the algorithm does not break down before completion of 𝑛 steps, then,
defining 𝐕𝑛 = [𝐯1 , 𝐯2 , … , 𝐯𝑛 ] and 𝐖𝑛 = [𝐰1 , 𝐰2 , … , 𝐰𝑛 ] with 𝐖𝑛 𝑇 𝐕𝑛 = 𝑰.

we obtain 𝑻 = 𝐖𝑛 𝑇 𝑨𝐕𝑛 , 𝑨𝐕𝑛 = 𝐕𝑛 𝑻 + 𝛾𝑛+1 𝐯𝑛+1 𝐞𝑛 𝑇 and 𝑨𝑇 𝐖𝑛 = 𝐖𝑛 𝑻 + 𝛽𝑛+1 𝐰𝑛+1 𝐞𝑛 𝑇 where


𝛼1 𝛽2 0
𝛾2 ⋱
𝑻 is tridiagonal(𝛼1 , 𝛼2 , … , 𝛼𝑛 , 𝛽2 , … , 𝛽𝑛 , 𝛾2 , … , 𝛾𝑛 ) = ( 𝛼2 𝑛×𝑛
⋱ 𝛽𝑛 ) ∈ ℝ
⋱ 𝛾𝑛 𝛼𝑛
0

■ Derivation of the solution: Assuming that the real general matrix 𝑨 ∈ ℝ𝑛×𝑛 is
transformed to its equivalent form 𝑻 = 𝑾𝑇 𝑨𝑽 such that 𝑻 ∈ ℝ𝑚×𝑚 is presented in Krylov
space of dimension 𝑚, 𝑽 = [𝐯1 , 𝐯2 , … 𝐯𝑚 ] ∈ ℝ𝑛×𝑚 is a matrix whose column-vectors form a
basis of 𝒦𝑚 and 𝑾 = [𝐰1 , 𝐰2 , … 𝐰𝑚 ] ∈ ℝ𝑛×𝑚 is a matrix whose column-vectors form a
basis of ℒ𝑚 with 𝑽𝑾𝑇 = 𝑰.
Let we start the generation of Krylov space by the following initial vector

𝒃 − 𝑨𝐱 0 𝒓0
𝐯1 = = = 𝑽𝒆1 with 𝒆1 = [1,0, … ,0]𝑇 ∈ ℝ𝑘 ⟹ 𝒓0 = ‖𝒓0 ‖2 𝑽𝒆1
‖𝒃 − 𝑨𝐱 0 ‖2 ‖𝒓0 ‖2

The exact solution is given by 𝐱 = 𝑨−1 𝒃 = (𝑨−1 𝒃 − 𝑨−1 𝒓0 ) + 𝑨−1 𝒓0 = 𝐱 0 + 𝑨−1 𝒓0

𝐱 = 𝐱 0 + 𝑨−1 𝒓0

The idea is to approximate 𝑨−1 𝒓0 by 𝑝(𝑨)𝒓0 where 𝑝 is a good polynomial.

𝐱 = 𝐱 0 + 𝑨−1 𝒓0
= 𝐱 0 + 𝑨−1 ‖𝒓0 ‖2 𝑽𝒆1
= 𝐱 0 + ‖𝒓0 ‖2 𝑽𝑾𝑇 𝑨−1 𝑽𝒆1
= 𝐱 0 + ‖𝒓0 ‖2 𝑽𝑻−1 𝒆1

Let we define a new vector 𝐲 such that 𝐲 is a solution of the system 𝑻𝐲 = ‖𝒓0 ‖2 𝒆1

𝐱 = (𝐱 0 + 𝑽𝐲) ∈ 𝐱 0 + 𝒦𝑚

Once the vector 𝐲 is obtained from the system 𝑻𝐲 = ‖𝒓0 ‖2 𝒆1 we can construct the
solution.

■ Termination Criterion: The above algorithm depends on a parameter 𝑚 which is the


dimension of the Krylov subspace. In practice it is desirable to select 𝑚 in a dynamic
fashion. This would be possible if the residual norm of the solution 𝐱 𝑚 is available
inexpensively (without having to compute 𝐱 𝑚 itself). Then the algorithm can be stopped
at the appropriate step using this information. The following proposition gives a result
in this direction.

The residual vector of the approximate solution 𝐱 𝑘 computed by the Lanczos Algorithm
is such that
𝒃 − 𝑨𝐱 𝑘 = 𝒃 − 𝑨(𝐱 0 + 𝑽𝑘 𝐲𝑘 )
= 𝒓0 − 𝑨𝑽𝑘 𝐲𝑘
= ‖𝒓0 ‖2 𝑽𝑘 𝒆1 − (𝑽𝑘 𝑻𝑘 𝐲𝑘 + 𝛾𝑘+1 𝐯𝑘+1 𝒆𝑇𝑘 𝐲𝑘 )
= −𝛾𝑘+1 𝐯𝑘+1 𝒆𝑇𝑘 𝐲𝑘

Then the tolerance error 𝜀 is defined by

‖𝒓𝑘 ‖2 ‖𝐯𝑘+1 ‖2
𝜀= = 𝛾𝑘+1 |𝒆𝑇𝑘 𝐲𝑘 | ×
‖𝒓0 ‖2 ‖𝒓0 ‖2

Remark: It is of great importance to know how to solve 𝑻𝑘 𝐲𝑘 = 𝒛𝑘 with 𝒛𝑘 = ‖𝒓0 ‖2 𝒆1 ?

𝛼1 𝛽2 0 1 0 0 𝑑1 𝜇2 0
𝛾2 ⋱ ⋱ ⋱
𝛼2 ℓ2 1 0 𝑑2
𝑻𝑘 = ( ⋱ 𝛽𝑘 ) = ( ⋱ 0 )( ⋱ 𝜇𝑘 ) = 𝑳 𝑘 𝑹 𝑘
⋱ 𝛾𝑘 𝛼𝑘 ⋱ ℓ𝑘 1 ⋱ 0 𝑑𝑘
0 0 0
Hence the equation 𝑻𝑘 𝐲𝑘 = 𝑳𝑘 (𝑹𝑘 𝐲𝑘 ) = 𝑳𝑘 𝝑𝑘 = 𝒛𝑘 can be solved recursively by back-
substitutions.
clear all, clc, A=10*rand(7,7); b=100*rand(7,1); n =length(b);
x0= rand(n,1); toll=0.0001; r0=b-A*x0; nres0=norm(r0,2);
V=r0/nres0; W=V; gamma(1)=0; beta(1)=0; k=1; nres=1;
while k <= n && nres > toll
vk=V(:,k); wk=W(:,k);
if k==1, vk1=0*vk; wk1=0*wk;
else, vk1=V(:,k-1); wk1= W(:,k-1);
end
alpha(k)=wk'*A*vk;
tildev=A*vk-alpha(k)*vk-beta(k)*vk1;
tildew=A'*wk-alpha(k)*wk-gamma(k)*wk1;
gamma(k+1)=sqrt(abs(tildew'*tildev)); % gamma(k+1)=tildez'*tildev;

if gamma(k+1) == 0, k=n+2;
else
beta(k+1)=tildew'*tildev/gamma(k+1);
W=[W,tildew/beta(k+1)];
V=[V,tildev/gamma(k+1)];
end
if k<n+2
if k==1
Tk = alpha;
else
Tk=diag(alpha)+diag(beta(2:k),1)+diag(gamma(2:k),-1);
end
yk=Tk\(nres0*[1,0*[1:k-1]]');
xk=x0+V(:,1:k)*yk;
nres=abs(gamma(k+1)*[0*[1:k-1],1]*yk)*norm(V(:,k+1),2)/nres0;
k=k+1;
else
return
end
end
m=k-1; % The Krylov space dimension
A, Tk, eig(A), eig(Tk), xk,
Zero1=A-V(:,1:m)*Tk*W(:,1:m)' % verification1
Zero2= A*xk-b % verification2

Example: Let we execute the program with the following matrix


A =
1.250321 3.209004 4.593262 2.748218 9.173651 1.236518
4.137617 5.958772 7.751354 5.337032 6.935770 3.988718
3.362769 1.916382 9.758420 3.546427 6.875781 3.475287
2.461339 5.068413 2.549812 7.292116 6.241738 2.997823
3.688499 7.626944 8.882880 5.214819 6.937188 3.948593
8.288752 0.028778 8.265965 4.578493 0.596674 8.944608

The result will be


Tk =
25.6608 9.7823 0 0 0 0
9.7823 8.8551 2.3272 0 0 0
0 2.3272 -4.8674 -4.8879 0 0
0 0 4.8879 3.4774 -0.7855 0
0 0 0 0.7855 5.5885 -0.8947
0 0 0 0 0.8947 1.4270

■ Link between Lanczos Method and Krylov Space: From the above algorithm we
know that
𝑨𝐯𝑘 = 𝛽𝑘 𝐯𝑘−1 + 𝛼𝑘 𝐯𝑘 + 𝛾𝑘+1 𝐯𝑘+1 𝛽1 𝐯0 = 0
𝑇 with 𝑘 = 1,2, … 𝑚
𝑨 𝐰𝑘 = 𝛾𝑘 𝐰𝑘−1 + 𝛼𝑘 𝐰𝑘 + 𝛽𝑘+1 𝐰𝑘+1 𝛾1 𝐰0 = 0

A back-substitutions will give


𝑘 𝑘
𝐯𝑘 = ∑ 𝑐𝑖 𝑨𝑖−1 𝐯1 and 𝐰𝑘 = ∑ 𝑑𝑖 𝑨𝑖−1 𝐰1 with 𝑘 = 1,2, … 𝑚
𝑖=1 𝑖=1

𝐾𝑚 (𝑨, 𝐯1 ) = span{𝐯1 , 𝐯2 , … , 𝐯𝑚 } = span{𝐯1 , 𝑨𝐯1 , … , 𝑨𝑚−1 𝐯1 }


ℒ𝑚 (𝑨𝑇 , 𝐰1 ) = span{𝐰1 , 𝐰2 , … , 𝐰𝑚 } = span{𝐰1 , 𝑨𝑇 𝐰1 , . . . , (𝑨𝑇 )𝑚−1 𝐰1 }

Notes on Iterative Methods: Iterative, or indirect methods, start with an initial guess of
the solution x and then repeatedly improve the solution until the change in x becomes
negligible. Since the required number of iterations can be very large, the indirect
methods are, in general, slower than their direct counterparts. A serious drawback of
iterative methods is that they do not always converge to the solution. It can be shown
that convergence is guaranteed only if the coefficient matrix is diagonally dominant. The
initial guess for x plays no role in determining whether convergence takes place if the
procedure converges for one starting vector, it would do so for any starting vector. The
initial guess affects only the number of iterations that are required for convergence.

In numerical optimization, the


Broyden algorithm is an iterative method for solving unconstrained nonlinear
optimization problems. The optimization problem is to minimize 𝑓(𝐱), where 𝐱 is a vector
in ℝ𝑛 , and 𝑓 is a differentiable scalar function. There are no constraints on the values
that 𝐱 can take. The algorithm begins at an initial estimate for the optimal value 𝐱 0 and
proceeds iteratively to get a better estimate at each stage. The development of this
algorithm is detailed in the fourth chapter. This method converges without any
conditions (i.e. symmetry positive-definiteness dominance etc…).

Algorithm:
Data: 𝐟(𝐱) = (𝑨𝐱 − 𝒃), 𝐱 0 , 𝐟(𝐱 0 ) and 𝑩0
Result: 𝐱 𝑘
begin:
𝐱 𝑘+1 = (𝑰 − 𝑩𝑘 𝑨)𝐱 𝑘 + 𝑩𝑘 𝒃
𝐲𝑘 = 𝑨(𝐱 𝑘+1 − 𝐱 𝑘 )
𝒔𝑘 − 𝑩𝑘 𝐲𝑘
𝑩𝑘+1 = 𝑩𝑘 + ( 𝑇 ) 𝒔𝑘 𝑇 𝑩𝑘
𝒔𝑘 𝑩𝑘 𝐲𝑘
𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘
end
This is the author property and in the limit of our knowledge none has been used this
algorithm in linear systems.

clear all, clc, V= diag([-1 -2 -3 -4 -5 -6 -7 -8 -9 -10]);


M=rand(10,10); A=M*V*inv(M); b=10*rand(10,1); n=max(size(A));
m=min(size(b)); x0=zeros(n,m); B=10*eye(n,n); I=eye(n,n);
for k=1:50
x1=(I-B*A)*x0+B*b; % x1 = x0 - B*(A*x0-b)
y=A*(x1-x0);
s=x1-x0;
B = B + ((s-B*y)*(s'*B))/(s'*B*y);
x0=x1;
end
x1, ZERO1=A*x1-b, ZERO2= eye(10) - B*A % verification

Many practical problems in engineering and physics


lead to eigenvalue problems. Eigenvalue problems in science and engineering are often
formulated at the level of the differential equation, but the essential features of
eigenvalue problems can be studied at the matrix level.

The matrix eigenvalue problem can be stated as 𝑨𝐱 = 𝜆𝐱 where 𝑨 is a given 𝑛 × 𝑛 matrix.


The problem is to find the scalar 𝜆 and the vector 𝐱. Rewriting the previous equation in
the form (𝑨 − 𝜆𝑰)𝐱 = 𝟎 it becomes apparent that we are dealing with a system of n
homogeneous equations. An obvious solution is the trivial one 𝐱 = 𝟎. A nontrivial
solution can exist only if the determinant of the coefficient matrix vanishes; that is, if
|𝑨 − 𝜆𝑰|𝐱 = 0. Expansion of the determinant leads to the polynomial equation known as
the characteristic equation Δ(𝜆) = ∏𝑛𝑖=1(𝜆 − 𝜆𝑖 ) = ∑𝑛𝑖=0 𝑎𝑖 𝜆𝑛−𝑖 which has the roots 𝜆𝑖 with
𝑖 = 1,2, … , 𝑛, called the eigenvalues of the matrix 𝑨. The solutions 𝐱 𝑖 of (𝑨 − 𝜆𝑖 𝑰)𝐱 = 𝟎 are
known as the eigenvectors.

Eigenvalue problems that originate from physical problems often end up with a
symmetric 𝑨. This is fortunate, because symmetric eigenvalue problems are much
easier to solve than their non-symmetric counterparts.

For many real situations, the eigenvalue problem does not arise in the standard form
𝑨𝐱 = 𝜆𝐱 but rather in the form 𝑨𝐱 = 𝜆𝑩𝐱 Where 𝑨 & 𝑩 are two symmetric matrices. It is
much more convenient if the equation 𝑨𝐱 = 𝜆𝑩𝐱 can be converted to the standard form.

Case 1: if 𝑩 is nonsingular then 𝑨𝐱 = 𝜆𝑩𝐱 ⟺ (𝑩−1 𝑨)𝐱 = 𝜆𝐱


Case 2: if 𝑨 is nonsingular then 𝑨𝐱 = 𝜆𝑩𝐱 ⟺ (𝑨−1 𝑩)𝐱 = 𝛾𝐱 with 𝛾 = 1/𝜆
Case 3: if 𝑩 is positive definite then 𝑩 can be written in Cholsky decomposition 𝑩 = 𝑳𝑳𝑇

𝑨𝐱 = 𝜆𝑩𝐱 ⟺ 𝑳−1 𝑨𝐱 = 𝜆𝑳−1 𝑩𝐱 = 𝜆𝑳𝑇 𝐱


⟺ 𝑳−1 𝑨(𝑳−𝑇 𝑳𝑇 )𝐱 = 𝜆𝑳𝑇 𝐱
⟺ (𝑳−1 𝑨𝑳−𝑇 )(𝑳𝑇 𝐱) = 𝜆(𝑳𝑇 𝐱)

If we define 𝑯 = 𝑳−1 𝑨𝑳−𝑇 and 𝐳 = 𝑳𝑇 𝐱 we get: 𝑨𝐱 = 𝜆𝑩𝐱 ⟺ 𝑯𝐳 = 𝜆𝐳, which is in standard


form, Moreover the matrix 𝑯 = 𝑳−1 𝑨𝑳−𝑇 has the same eignevalues with the original one.
Case 4: if 𝑨 is positive definite then 𝑨 can be written in Cholesky decomposition 𝑨 = 𝑳𝑳𝑇

𝑨𝐱 = 𝜆𝑩𝐱 ⟺ (𝑳−1 𝑩𝑳−𝑇 )𝐳 = 𝛾𝐳 with 𝛾 = 1/𝜆

The main algorithms for actually computing eigenvalues and eigenvectors are the 𝑄𝐷
algorithm (by Heinz Rutishauser ), the power method, the Householder Transformation,
the 𝐿𝑅 algorithm of Rutishauser and the powerful 𝑄𝑅 algorithm of Francis.

The power method is very good at approximating the extremal


eigenvalues of the matrix, that is, the eigenvalues having largest and smallest module,
denoted by 𝜆1 and 𝜆𝑛 respectively, as well as their associated eigenvectors. Solving such
a problem is of great interest in several real-life applications (geosysmic, machine and
structural vibrations, electric network analysis, quantum mechanics,…) where the
computation of 𝜆𝑛 (and its associated eigenvector 𝐱 𝑛 ) arises in the determination of the
proper frequency (and the corresponding fundamental mode) of a given physical system.

Let 𝑨 ∈ ℂ𝑛×𝑛 be a diagonalizable matrix and let 𝑿 = [𝐱1 , … , 𝐱 𝑛 ] ∈ ℂ𝑛×𝑛 be the matrix of its
eigenvectors 𝐱 𝑖 , for 𝑖 = 1, . . . , 𝑛. Let us also suppose that the eigenvalues of 𝑨 are ordered
as |𝜆1 | > |𝜆2 | > ⋯ > |𝜆𝑛 | where 𝜆1 has algebraic multiplicity equal to 1. Under these
assumptions, 𝜆1 is called the dominant eigenvalue of matrix 𝑨.

Given an arbitrary initial vector 𝒒0 ∈ ℂ𝑛 of unit Euclidean norm, consider for 𝑘 = 1, 2, …


the following iteration based on the computation of powers of matrices, commonly
known as the power method:
𝒛𝑘 = 𝑨𝒒𝑘−1
{𝒒𝑘 = 𝒛𝑘 /‖𝒛𝑘 ‖2
𝜂𝑘 = 𝒒𝑘 𝐻 𝑨𝒒𝑘

Let us analyze the convergence properties of this method. By induction on 𝑘 one can
check that
𝑨𝑘 𝒒0
𝒒𝑘 = 𝑘 𝑘≥1
‖𝑨 𝒒0 ‖2

This relation explains the role played by the powers of A in the method. Because A is
diagonalizable, its eigenvectors 𝐱 𝑖 form a basis of ℂ𝑛 ; it is thus possible to represent 𝒒0
as
𝑛

𝒒0 = ∑ 𝛼𝑖 𝐱 𝑖 𝛼𝑖 ∈ ℂ𝑛 , 𝑖 = 1, 2, … 𝑛
𝑖=1

Moreover, since 𝐴𝐱 𝑖 = 𝜆𝑖 𝐱 𝑖 , we have 𝑨𝑘 𝒒0 = ∑𝑛𝑖=1 𝛼𝑖 𝑨𝑘 𝐱 𝑖 = ∑𝑛𝑖=1 𝛼𝑖 𝜆𝑘𝑖 𝐱 𝑖


𝑛
𝛼𝑖 𝜆𝑖 𝑘
𝑘
𝑨 𝒒0 = 𝛼1 𝜆1𝑘 (𝐱1 + ∑ ( ) 𝐱𝑖 ) 𝑖 = 1, 2, … 𝑛
𝛼1 𝜆1
𝑖=2

Since |𝜆𝑖 /𝜆1 | < 1 for 𝑖 = 1, 2, … 𝑛, as 𝑘 increases the vector 𝑨𝑘 𝒒0 , tends to assume an
increasingly significant component in the direction of the eigenvector 𝐱1

lim 𝑨𝑘 𝒒0 = 𝛼1 𝜆1𝑘 𝐱1
𝑘→∞
𝑨𝑘 𝒒0 𝛼1 𝜆1𝑘 𝐱1 𝐱1
lim 𝒒𝑘 = lim = 𝑘 =
𝑘→∞ 𝑘→∞ ‖𝑨 𝒒0 ‖2
𝑘
𝛼1 𝜆1 ‖𝐱1 ‖2 ‖𝐱1 ‖2

As 𝑘 → ∞, the vector 𝒒𝑘 thus aligns itself along the direction of eigenvector 𝐱1 . Therefore
the sequence of Rayleigh quotients 𝜂𝑘 will converge to 𝜆1 .

𝐱1 𝐻 𝐱1 𝐱1 𝐻 𝜆1 𝐱1 𝐱1 𝐻 𝐱1
lim 𝜂𝑘 = lim 𝒒𝑘 𝐻 𝑨𝒒𝑘 = ( ) 𝑨( )=( ) = 𝜆1 = 𝜆1
𝑘→∞ 𝑘→∞ ‖𝐱1 ‖2 ‖𝐱1 ‖2 ‖𝐱1 ‖2 ‖𝐱1 ‖2 (‖𝐱1 ‖2 )2

and the convergence will be faster when the ratio |𝜆2 /𝜆1 | is smaller.

4 5 −4 10
Example: matrix 𝐴 = ( ) needs only 6 steps, ratio is 0.1, while matrix 𝐴 = ( )
6 5 7 5
needs 68 steps, ratio is 0.9

clear all, clc,


% Power iteration for the dominant eigen-pair
% Input: A, square matrix not necessarily symmetric
% N: number of iterations
% Output: lambda, sequence of eigenvalue approximations (vector)
% x final eigenvector approximation

A=randi(10,4); A=0.5*(A'+A); A1=A; N=10; L=eig(A)


n = length(A); x = randn(n,1); x = x/norm(x,2);
for k = 1:N
q = A*x;
x = q/norm(q,2);
lambda(k) = x'*A*x;
end
lambda(N)
x

Similarly, the left eigenvector can be obtained using the following algorithm

𝒘𝑘 = 𝑨𝑇 𝒑𝑘−1 𝒗𝑘 = 𝒑𝑇𝑘−1 𝑨
{ 𝒑𝑘 = 𝒘𝑘 /‖𝒘𝑘 ‖2 ⟺ {𝒑𝑇𝑘 = 𝒗𝑘 /‖𝒗𝑘 ‖2
𝜂𝑘 = 𝒑𝑘 𝑇 𝑨𝑇 𝒑𝑘 𝜂𝑘 = 𝒑𝑘 𝑇 𝑨𝒑𝑘

clear all, clc, A=randi(10,4); A=0.5*(A'+A); A1=A; N=20; L=eig(A)


n = length(A); y = randn(1,n); y = y/norm(y,inf);
for k = 1:N
y = (y*A)/norm(y*A,2);
lambda(k) = y*A*y';
end
lambda(N)
y
Finding Other Eigenvectors We can just remove the dominant direction from the
matrix and repeat the process.

Assume that the matrix 𝑨 is diagonalizable and let 𝐱 𝑖 , 𝐲𝑖 𝑇 be the right and left
eigenvectors respectively, means that 𝑨𝐱 𝑖 = 𝜆𝑖 𝐱 𝑖 , 𝐲𝑖 𝑇 𝑨 = 𝜆𝑖 𝐲𝑖 𝑇 , then by using the
spectral decomposition theorem

𝐲1 𝑇 𝐲1 𝑇 𝑛 𝑛
𝑇 𝑇
𝑨 = 𝑿𝜦𝒀 = [𝐱1 𝐱 2 … 𝐱 𝑛 ]𝜦 ( 𝐲2 ) = [𝜆1 𝐱1 𝜆2 𝐱 2 … 𝜆𝑛 𝐱 𝑛 ] ( 𝐲2 ) = ∑ 𝜆𝑖 𝐱 𝑖 𝐲𝑖 with ∑ 𝐱 𝑖 𝐲𝑖 𝑇 = 𝑰
𝑇
⋮ ⋮
𝑖=1 𝑖=1
𝐲𝑛 𝑇 𝐲𝑛 𝑇

Now use power iteration to find 𝐱1 and 𝜆1 then let 𝑨2 ← 𝑨 − 𝜆1 𝐱1 𝐲1 𝑇 repeat power
iteration on 𝑨2 to find 𝐱 2 and 𝜆2 continue like this for 𝜆3 , . . . , 𝜆𝑛 .

Remark: This method is good approximation only when the matrix 𝑨 is symmetric,
means that 𝑨 = 𝑿𝜦𝑿𝑇 . When 𝑨 is not symmetric then the method will be failed.

In order to obtain also the inverse of the symmetric matrix 𝑨 we use the following
iteration 𝑨 ← 𝑨 − 𝜆𝑖 𝐱 𝑖 𝐱 𝑖 𝑇 + (1/𝜆𝑖 )𝐱 𝑖 𝐱 𝑖 𝑇 . (Proposed by the author)

clear all, clc, M=10*rand(4,4);


A=M*diag([-10 -20 -30 -40])*inv(M) ; A=0.5*(A+A'); B=A; L1=eig(A),
n = length(A); N=100; L2=[]; X=[];
for i = 1:n
x = rand(n,1); x=x/norm(x,2);
for k = 1:N
x = (B*x)/norm((B*x),2);
lambda(k) = x'*B*x;
end
x1= x; X=[X,x1]; % zero= B*x - lambda(N)*x;
s=lambda(N); L2=[L2;s];
B = B - s*x1*x1'+ (1/s)*x1*x1';
end
L2, X, % All the eigenvalues and all eigenvectors of A
IA=A*B % B return to the inverse of A

Inverse iteration is an algorithm to compute the


smallest eigenvalue (in modulus) of a symmetric matrix A:

Choose 𝐱 0
for 𝑘 = 1,2, … , 𝑚 (until convergence)
solve 𝑨𝐱 𝑘+1 = 𝐱 𝑘
normalize 𝐱 𝑘+1 ∶= 𝐱 𝑘+1 /‖𝐱 𝑘+1 ‖
𝑇 𝑇
𝜆𝑘 = 𝐱 𝑘+1 𝑨𝐱 𝑘+1 /(𝐱 𝑘+1 𝐱 𝑘+1 )
end
For large matrices, one can save operations if we compute the 𝐿𝑈 decomposition of the
matrix 𝑨 only once. The iteration is performed using the factors 𝑳 and 𝑼. This way, each
iteration needs only 𝑂(𝑛2 ) operations, instead of 𝑂(𝑛3 ) with the program above.

clear all, clc,

% Shifted inverse iteration for the closest eigenpair.


% Input: A is square matrix
% s value close to desired eigenvalue (complex scalar)
% N number of iterations
% Output: gamma sequence of eigenvalue approximations (vector)
% x final eigenvector approximation
A=randi(10,4); A=0.5*(A'+A); L=eig(A)
n = length(A); x = randn(n,1); x = x/norm(x,inf); s=max(L);
B = A - s*eye(n); [L,U] = lu(B); N=20;
for k = 1:N
y = U\(L\x);
[normy,m] = max(abs(y));
gamma(k) = x(m)/y(m) + s;
x = y/normy;
end

We now show how Householder transformations can be


used to compute 𝑄𝑅 factorization. The 𝑄𝑅 factorization of an 𝑚 × 𝑛 matrix 𝑨 is given by
𝑨 = 𝑸𝑹 where 𝑸 ∈ ℝ𝑚×𝑚 is orthogonal and 𝑹 ∈ ℝ𝑚×𝑛 is upper triangular. In this section
we assume 𝑚 = 𝑛. We will see that if 𝑨 has full column rank, then the first 𝑛 columns of
𝑸 form an orthonormal basis for 𝑟𝑎𝑛𝑘(𝑨). Thus, calculation of the 𝑄𝑅 factorization is
one way to compute an orthonormal basis for a set of vectors. This computation can be
arranged in several ways. We give methods based on Householder, and Gram-Schmidt
orthogonalization process and a numerically more stable variant called modified Gram-
Schmidt are also discussed.

■ Householder and QR: a Householder matrix is a special form of linear


transformation 𝑯 which was suggested by Alston Scott Householder in a 1958. If the
transformation matrices 𝑯𝑖 is chosen to be unitary, then the condition numbers
associated with the systems 𝑯𝑖 𝑨𝐱 = 𝑯𝑖 𝒃 do not change (so they certainly don't get
worse). Furthermore, the matrices 𝑯𝑖 , should be chosen so that the matrices 𝑨𝑖 become
simpler. As was suggested by Householder, this can be accomplished in the following
manner. The unitary matrix 𝑯 is chosen to be 𝑯 = 𝑰 − 2𝒘𝒘𝑇 with 𝒘𝑇 𝒘 = 1 𝒘 ∈ ℂ𝒏

This matrix is Hermitian: 𝑯𝑇 = (𝑰 − 2𝒘𝒘𝑇 )𝑇 = 𝑰 − 2𝒘𝒘𝑇 = 𝑯 and unitary, that is 𝑯𝑇 𝑯 = 𝑰


and therefore involutory: 𝑯2 = 𝑰.

As any vector 𝐯 can be normalized 𝐯 = 𝐯/‖𝐯‖ to have a unit norm, the Householder
matrix defined above can be written as:
𝐯𝐯 𝑇
𝑯 = 𝑰 − 2𝒘𝒘𝑇 = 𝑰 − 2
‖𝐯‖2
We define specifically a vector 𝐯 = 𝐱 − ‖𝐱‖𝐞1 with 𝐱 is any n-dimensional vector and 𝐞1 is
the first standard basis vector with all elements equals zero except the first one. The
norm squared of this vector can be found to be

‖𝐯‖2 = 𝐯 𝑇 𝐯 = (𝐱 − ‖𝐱‖𝐞1 )𝑻 (𝐱 − ‖𝐱‖𝐞1 ) = 2‖𝐱‖(‖𝐱‖ − 𝑥1 )

Notice that 𝑯𝐱 = 𝑑𝐞1 with 𝑑 = ‖𝐱‖ Let we prove this claim

𝐯𝐯 𝑇 𝐯(𝐱 − ‖𝐱‖𝐞1 )𝑇 𝐯(‖𝐱‖2 − ‖𝐱‖𝑥1 )


𝑯𝐱 = (𝑰 − 2 ) 𝐱 = (𝑰 − 2 ) 𝐱 = (𝐱 − 2 ) = 𝐱 − 𝐯 = ‖𝐱‖𝐞1
‖𝐯‖2 ‖𝐯‖2 2‖𝐱‖(‖𝐱‖ − 𝑥1 )

We see that all elements in 𝐱 except the first one are eliminated to zero. This feature of
the Householder transformation is the reason why it is widely used.

Let 𝐜𝑖 be the columns of the matrix 𝑨 = [𝐜1 𝐜2 … 𝐜𝑛 ], it is very easy to observe that if we let
𝐯 = 𝐜1 − ‖𝐜1 ‖𝐞1 then

⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆
𝐯𝐯𝑇 0 ⋆ ⋆ 0
(𝑰 − 2 ) 𝐜 = ‖𝐜1 ‖𝐞1 ⟹ 𝑯1 𝑨 = [‖𝐜1 ‖𝐞1 ⋮ 𝐜̂ 2 ⋮ ⋯ ⋮ 𝐜̂ 𝑛 ] = 0 ⋆ ⋱ ⋆ =( 𝑨′ )
‖𝐯‖2 1 ⋮
⋮ ⋮ ⋮
(0 ⋆ ⋆) 0

We then construct second transformation matrix 𝑯2 (𝑯1 𝑨)

1 0 ⋯ 0
0 𝐯2 𝐯2 𝑇 )
𝑯2 = ( with 𝐯2 corresponds to the 2nd columns of the matrix 𝑨′
⋮ 𝑰−2
‖𝐯2 ‖2
0
If we redo the process 𝑘 times we obtain 𝑨𝑘 = 𝑷𝑘 𝑨𝑘−1 ⟹ 𝑨𝑘 = 𝑯𝑘 … 𝑯2 𝑯1 𝑨0 . A matrix
𝑨 = 𝑨0 can be reduced step by step using these unitary "Householder matrices" 𝑯𝑘 into
an upper triangular matrix 𝑨𝑛−1 = 𝑯𝑛−1 … 𝑯2 𝑯1 𝑨0 = 𝑹.

The Householder reduction of a matrix to triangular form requires about 2𝑛3 /3


operations. In this process an 𝑛 × 𝑛 unitary matrix 𝑯 = 𝑯𝑛−1 … 𝑯2 𝑯1 consisting of
Householder matrices 𝑯, and an 𝑛 × 𝑛 upper triangular matrix 𝑹 are determined so that

𝑯𝑨 = 𝑹 or 𝑨 = 𝑯−1 𝑹 = 𝑸𝑹

An upper triangular matrix 𝑹, that 𝑨𝑹−1 = 𝑸 is a matrix with orthonormal columns,


can also be produced directly by the application of Gram-Schmidt orthogonalization to
the columns at of 𝑨 = [𝐜1 𝐜2 … 𝐜𝑛 ].

Remark: Since 𝑯 is orthogonal we have ‖𝑯𝐱‖2 = ‖𝐱‖2 = 𝑑 2 ⟹ 𝑑 = ±‖𝐱‖. We can still


choose the sign of 𝑑, and we choose it such that no cancellation occurs in computing
𝐱 − 𝑑𝐞1.
+‖𝐱‖ if 𝑥1 < 0
𝑑={
−‖𝐱‖ if 𝑥1 ≥ 0
Algorithm:

At starting let 𝑨 = 𝑸1 𝑹1 with 𝑹1 = 𝑨 and 𝑸1 = 𝑰


for 𝑘 = 1: 𝑚 − 1

𝐱 = 𝐳𝐞𝐫𝐨𝐬(𝑘: 𝑚, 𝑘); 𝐱 = 𝑹𝑖 (𝑘: 𝑚, 𝑘); d = −sign(𝐱(𝑘))‖𝐱‖ ;


𝐯 = 𝐱; 𝐯(𝑘) = 𝐱(𝑘) − d𝐞1; 𝐫 = ‖𝐯(𝑘)‖;
𝐯𝐯 𝑇 𝐯𝐯 𝑇
𝑹𝑘+1 = (𝑰 − 2 2 ) 𝑹𝑘 ; 𝑸𝑘+1 = (𝑰 − 2 2 ) 𝑸𝑘
𝐫 𝐫
end

Example:
× × × × × × × × × × × × × × × × × × × ×
× × × × × 𝑯1 0 × × × × 𝑯2 0 × × × × 𝑯3 0 × × × ×
× × × × × → 0 × × × × → 0 0 × × × → 0 0 × × ×
× × × × × 0 × × × × 0 0 × × × 0 0 0 × ×
(× × × × ×) (0 × × × ×) (0 0 × × ×) (0 0 0 × ×)

% QR factorization by Householder reflections.


% Input: A is m-by-n matrix
% Output: Q,R A=QR, Q m-by-m orthogonal, R m-by-n upper triangular
clear all, clc, A=randi(10,4);[m,n] = size(A); Q = eye(m); R=A;
for k = 1:n
z = R(k:m,k);
v = [ -sign(z(1))*norm(z) - z(1); -z(2:end)];
P = eye(m-k+1) - 2*(v*v')/(v'*v); % HH reflection
R(k:m,:) = P*R(k:m,:);
Q(k:m,:) = P*Q(k:m,:);
end
Q = Q';
R = triu(R); % enforce exact triangularity

An alternative way of programming the Householder Transformations


% QR Factorization Using Householder Transformations
clear all, clc, A=10*rand(4,4) + 10*eye(4,4); [m,n]=size(A);
R=A; %Start with R=A
Q=eye(m); %Set Q as the identity matrix
for k=1:m-1
x=zeros(m,1); x(k:m,1)=R(k:m,k);
d=-sign(x(k))*norm(x); v=x; v(k)=x(k)-d; r=norm(v);
if r~=0, w=v/r; u=2*R'*w;
R=R-w*u'; %Product PR
Q=Q-2*Q*w*w'; %Product QP
end
end
Zero=A-Q*R
Identity=Q*Q'
■ Gram-Schmidt and QR: There are several ways to compute the QR factorization of a
matrix. Householder's method can be used to compute both types of QR factorizations
(i.e. the standard 𝑄𝑅 and the economy size 𝑄𝑅). On the other hand, the classical Gram-
Schmidt (CGS) and the modified Gram-Schmidt (MGS) compute 𝑸 ∈ ℝ𝑛×𝑛 and 𝑹 ∈ ℝ𝑛×𝑛
such that 𝑨 = 𝑸𝑹. The MGS has better numerical properties than the CGS. We will not
discuss them in detail here. The readers are referred to the book Biswa Datta (1995, pp.
339-343) and Bekhiti B (2020, pp. 104-108).

We now discuss two alternative methods that can be used to compute the thin 𝑄𝑅
𝑹
factorization 𝑨 = 𝑸𝑹 = [𝑸1 𝑸2 ] [ 1 ] = 𝑸1 𝑹1 directly.
𝟎

Let we define 𝑨 = [𝒂1 … 𝒂𝑘 … 𝒂𝑛 ] and 𝑸 = [𝒒1 𝒒2 … 𝒒𝑛 ] where 𝒂𝑖 and 𝒒𝑖 are columons of 𝑨


and 𝑸 respectively. Comparing 𝑘 𝑡ℎ columns in 𝑨 = 𝑸𝑹 we conclude that
𝑘
𝑨 = 𝑸𝑹 ⇔ 𝒂𝑘 = ∑ 𝑟𝑖𝑘 𝒒𝑖 ∈ span{𝒒1 𝒒2 … 𝒒𝑘 }
𝑖=1

If 𝑟𝑎𝑛𝑘(𝑨) = 𝑛, then this equation can be solved for 𝒒𝑘 :

𝑘−1
𝒒𝑘 = (𝒂𝑘 − ∑ 𝑟𝑖𝑘 𝒒𝑖 ) /𝑟𝑘𝑘
𝑖=1

Thus, we can think of 𝒒𝑘 as a unit vector in the direction of 𝒛𝑘 = (𝒂𝑘 − ∑𝑘−1𝑖=1 𝑟𝑖𝑘 𝒒𝑖 ) we
𝑇
choose 𝑟𝑖𝑘 = 𝒒𝑖 𝒂𝑘 𝑖 = 1,2, … , 𝑘 − 1 and 𝑟𝑘𝑘 = ‖𝒒𝑘 ‖. This leads to the classical Gram-
Schmidt (CGS) algorithm for computing 𝑨 = 𝑸1 𝑹1 .

% Classical Gram-Schmidt algorithm


clear all, clc, A =10*rand(4,4); [m,n]=size(A); R=zeros(n); Q=A;
for k=1:n,
for i=1:k-1,
R(i,k)=Q(:,i)'*Q(:,k);
end
for i=1:k-1,
Q(:,k)=Q(:,k)-R(i,k)*Q(:,i);
end
R(k,k)=norm(Q(:,k)); Q(:,k)=Q(:,k)/R(k,k);
end
R, Q,

Unfortunately, the CGS method has very poor numerical properties in that there is
typically a severe loss of orthogonality among the computed 𝒒𝑖 . Interestingly, a
rearrangement of the calculation, known as modified Gram Schmidt (MGS), yields a
much sounder computational procedure. In the k th step of MGS, the k th column of 𝑸
(denoted by 𝒒𝑘 ) and the fcth row of 𝑹 (denoted by 𝒓𝑘 𝑇 ) are determined. (see Gene H.
Golub and Charles F. Van Loan 1996).
% Modified Gram-Schmidt Orthogonalization

clear all, clc, A =10*rand(4,4); [m,n]=size(A); R=zeros(n); Q=A;

for k=1:n,
for i=1:k-1,
R(i,k)=Q(:,i)'*Q(:,k);
Q(:,k)=Q(:,k)-R(i,k)*Q(:,i);
end
R(k,k)=norm(Q(:,k)); Q(:,k)=Q(:,k)/R(k,k);
end

R, Q,

■ Un-shifted QR algorithm and Eigenvalues: The triangular decomposition is based


on the fact that virtually any matrix can be factorized into a product of lower and upper
triangular matrices. Thus if we denote the original matrix 𝑯 as 𝑯1 we can write

𝑯1 = 𝑳1 𝑹1

And the upper triangular matrix can be expressed as 𝑹1 = 𝑳1−1 𝑯1. If we now multiply
this equation on the right by 𝑳1 we get 𝑹1 𝑳1 = 𝑳1−1 𝑯1 𝑳1

In other words, the reverse multiplication 𝑹1 𝑳1 is a similarity transformation of 𝑯1 and


thus preserve the eigenvalues of 𝑯1 . Now let we define a new matrix 𝑯2 = 𝑹1 𝑳1 and
decompose as was done with 𝑯1 : 𝑯2 = 𝑳2 𝑹2

And then compute the 𝑯3 : 𝑯3 = 𝑹2 𝑳2 = 𝑳3 𝑹3 continuing the process 𝑯𝑘 = 𝑳−1


𝑘 𝑯𝑘−1 𝑳𝑘 .
Under certain conditions on the original matrix 𝑯 as 𝑘 → ∞ the matrix 𝑯𝑘 will
approaches an upper triangular matrix with the eigenvalues in decreasing order of
magnitude on the main diagonal.

𝑯𝑘 = 𝑳−1 −1
𝑘 𝑯𝑘−1 𝑳𝑘 ⟺ 𝑯𝑘 = (𝑳𝑘 … 𝑳2 𝑳1 ) 𝑯1 (𝑳𝑘 … 𝑳2 𝑳1 )

When 𝑘 → ∞ we have 𝑯𝑘 = 𝑯𝑘−1 = constant, means that lim𝑘→∞ 𝑳𝑘 = 𝑰 or equivalently

lim 𝑯𝑘 = lim 𝑳𝑘 𝑹𝑘 = 𝑹𝑘 = constant


𝑘→∞ 𝑘→∞

Finally we deduce that 𝑯𝑘 will tend to an upper triangular matrix, and from linear
algebra we know that the eigenvalues of an upper triangular matrix are the elements of
the main diagonal. The eigenvalues of 𝑯1 appear as the diagonal terms of this upper-
triangular matrix because 𝑯𝑘 and 𝑯1 are similar to each other.

We use the 𝑄𝑅-decomposition and 𝐿𝑅-decomposition (Nowadays it is known as 𝐿𝑈) to


obtain simultaneously the complete set of eigenvalues of a matrix. The method is
iterative and builds an upper-triangular matrix. The eigenvalues appear as the diagonal
terms of this upper-triangular matrix. These values are found to be in agreement with
those given by the MATLAB built-in function: eig. This algorithm does not work in the
case of complex eigenvalue. (J. Stoer R. Bulirsch 1991 pp 330)
John Francis' idea in 1961 for computing the eigenvalues of 𝑨 is (without any bells or
whistles) surprisingly simple. We will refer to this as the unshifted 𝑄𝑅 algorithm. It
looks like this:

Set 𝑨1 = 𝑨,
for 𝑘 = 1,2, . .. (until convergence)
Compute 𝑨𝑘 = 𝑸𝑘 𝑹𝑘
Set 𝑨𝑘+1 = 𝑹𝑘 𝑸𝑘
end

That is, compute the 𝑄𝑅 factorization of 𝑨, then reverse the factors, then compute the
𝑄𝑅 factorization of the result, before reversing the factors, and so on.

It turns out that the sequence 𝑨1 , 𝑨2 ,…, have the same eigenvalues and for any large
integer 𝑘 the matrix 𝑨𝑘 is usually close to being upper-triangular. Since the eigenvalues
of an upper-triangular matrix lie on its diagonal, the iteration above will allow us to
read off the eigenvalues of A from the diagonal entries of 𝑨𝑘 . Once we have the
eigenvalues, the eigenvectors can be computed, for example, by an inverse power
iteration.
clear all, clc, A =rand(100,100); M=A;
% n=7; C=diag(ones(n-1,1),1); A=4*diag(ones(n,1))+C+C'; M=A
% u = [1 -15 85 -225 274 -120]; A = compan(u); [m,n]=size(A); M=A;
for i=1:500;
[Q,R] = qr (M); % [Q,R] = lu (M);
M=R*Q;
end
R; eig(A) ;
spy(abs(M)>1e-4),
Warning: The sequence 𝑨1 , 𝑨2 ,…, computed by the unshifted QR algorithm above does
0 1
not always converge. For example, consider the matrix 𝑨 = 𝑨1 = ( ). In this example,
1 0
𝑨𝑘 = 𝑨1 for all 𝑘 and the unshift 𝑄𝑅 algorithm is stagnate. Below, we will fix that with
Wilkinson shifts. (Note that Rayleigh quotient shifts do not fix this example.)

■ Reduction of a Real Matrix to Upper Hessenberg: If the constraints of a linear


algebra problem do not allow a general matrix to be conveniently reduced to a
triangular one, reduction to Hessenberg form is often the next best thing. In fact,
reduction of any matrix to a Hessenberg form can be achieved in a finite number of
steps (for example, through Householder's algorithm of unitary similarity transforms).
Subsequent reduction of Hessenberg matrix to a triangular matrix can be achieved
through iterative procedures, such as shifted QR-factorization. In eigenvalue
algorithms, the Hessenberg matrix can be further reduced to a triangular matrix
through Shifted QR-factorization combined with deflation steps.

It is well known how any matrix can be transformed into an upper Hessenberg form
𝑨𝐻 = 𝑸𝑻 𝑨 𝑸 (i.e. An upper Hessenberg matrix is also called an almost upper triangular
matrix) by an orthogonal similarity transformation. To find all the eigenvalues of the
original matrix, then the 𝑄𝑅 algorithm can be applied to this upper Hessenberg matrix.

⋆ ⋆ ⋆ ⋯ ⋆
⋆ ⋆ ⋆ … ⋆
𝑨𝐻 = 𝑸𝑻 𝑨𝑸 = ( ⋆ ⋆ ⋮ ⋆)
⋱ ⋱ ⋮
⋆ ⋆

Example:
clear all, clc, A =rand(8,8);
H = hess(A)
spy(abs(H)>1e-4),

H =
0.1150 -0.9702 -0.7154 0.2833 0.1072 -0.0011 0.3260 0.3508
-1.7483 2.8283 1.2499 -0.0915 -0.1518 0.0410 0.4229 0.4616
0 1.0766 0.7553 -0.3348 -0.6033 0.2523 0.2128 0.3057
0 0 0.5410 -0.1267 -0.1155 -0.1975 0.0631 0.1153
0 0 0 -0.2507 0.1048 -0.0259 -0.4680 -0.0875
0 0 0 0 0.4812 0.6869 0.1871 0.0145
0 0 0 0 0 -0.3801 -0.1686 -0.0030
0 0 0 0 0 0 -0.2471 0.0837

Remark: Hessenberg matrix consists of an upper triangular form with an addition band
of elements adjacent to the main diagonal. (No more)
The Householder reduction to Hessenberg form
H=Q'*A*Q

clear all, clc, A =10*rand(8,8); [m,n]=size(A); Q=eye(n); M=A;


for k = 1:n-2
v = A(k+1:n,k);
alpha = -norm(v);
if (v(1)<0) alpha = -alpha; end
v(1) = v(1) - alpha;
v = v/norm(v);
A(k+1:n,k+1:n) = A(k+1:n,k+1:n) - 2*v*(v.'*A(k+1:n,k+1:n));
A(k+1,k) = alpha; A(k+2:n,k) = 0;
A(1:n,k+1:n) = A(1:n,k+1:n) - 2*(A(1:n,k+1:n)*v)*v.';
Q(1:n,(k+1):n)=Q(1:n,(k+1):n)-2*((Q(1:n,(k+1):n))*v)*(v.');
end
H=A, Q,
eig(A), eig(M),

Again: A matrix structure that is close to upper triangular form and that is preserved
by the QR algorithm is the Hessenberg form.

Now we are ready to compute simultaneously the eigenvalues of the matrix 𝑨 by


applying the QR algorithm on the obtained Hessenberg matrix. After this our faster QR
algorithm with look like this:

Hessenberg-QR algorithm MATLAB code for Hessenberg-QR

Set 𝑨1 = 𝑸𝑇𝐻 𝑨𝑸𝐻 = upper-Hessenberg, for i=1:500;


for 𝑘 = 1,2, … (until convergence) [Q,R] = qr (H);
Compute 𝑨𝑘 = 𝑸𝑘 𝑹𝑘 H=R*Q;
Set 𝑨𝑘+1 = 𝑹𝑘 𝑸𝑘 end
end R
spy(abs(H)),

We have seen how the QR algorithm for computing the Schur form of a matrix A can be
executed more economically if the matrix 𝑨 is first transformed to Hessenberg form.
Now we want to show how the convergence of the Hessenberg QR algorithm can be
improved dramatically by introducing (spectral) shifts into the algorithm.

■ Shifted QR algorithm: The QR algorithm computes the real Schur form of a matrix,
a canonical form that displays eigenvalues but not eigenvectors. Consequently,
additional computations usually must be performed if information regarding invariant
subspaces is desired. The convergence of QR algorithm depends on the ratio between
the eigenvalues 𝜆1 /𝜆2 , so in order to accelerate the rate of convergence Jim Wilkinson
think to introduce a shift factor 𝜇 such that: 𝛿 = (𝜆1 − 𝜇)/(𝜆2 − 𝜇) ≤ (𝜆1 /𝜆2 )
Oversimplifying: shifts are used in addition to obtain faster convergence, in other word
if 𝜆1 , 𝜆2 are two eigenvalues and you shift by the original ratio by 𝜇, then the magic ratio
is 𝛿 = (𝜆1 − 𝜇)/(𝜆2 − 𝜇) and the convergence speed goes faster whenever 𝜇 is close to 𝜆1 .
If we define a new matrix 𝑯𝑘 = 𝑨𝑘 − 𝜇𝑘 𝑰 the the eigenvalues of 𝑯𝑘 are 𝜆𝑘 − 𝜇𝑘 and the
ration becomes (𝜆𝑘 − 𝜇𝑘 )/(𝜆𝑘−1 − 𝜇𝑘−1 ), then the algorithm is as follow

Set 𝑨1 = 𝑸𝑇𝐻 𝑨𝑸𝐻 = upper-Hessenberg,


for 𝑘 = 1,2, … (until convergence)
Compute a shift 𝜇𝑘 = 𝑨𝑘 (𝑛, 𝑛)
Compute (𝑨𝑘 − 𝜇𝑘 𝑰) = 𝑸𝑘 𝑹𝑘
Set 𝑨𝑘+1 = 𝑹𝑘 𝑸𝑘 + 𝜇𝑘 𝑰
end

One can still check that this algorithm preserves the upper-Hessenberg structure and
produces a sequence of similar matrices. The idea of the shift is to quickly make
𝑨𝑘 (𝑛, 𝑛 − 1) converge to zero. A reasonable choice of the shift is the Rayleigh quotient,
where 𝜇𝑘 = 𝑨𝑘 (𝑛, 𝑛), because we would like 𝜇𝑘 to be an estimate for eigenvalue of 𝑨.

An even better approximation the eigenvalue, which is known as Wilkinson’s shift, is


obtained by considering the last 2 × 2 block. Wilkinson's shift is defined to be the
𝑎𝑛−1 ⋆
eigenvalue of the matrix [ 𝜀 𝑎𝑛 ] that is closer to 𝑎𝑛 .

% QR algorithm with Wilkinson shift for eigenvalues.


% Input: square matrix A
% Output: gamma sequence of approximations to one eigenvalue (vector)
clear all, clc, m=5; n=5 ; A = triu(rand(m,n),-1); L=eig(A)
for k = length(A):-1:1
% QR iteration
s = 0; I = eye(k);
while sum( abs(A(k,1:k-1)) ) > eps
[Q,R] = qr(A-s*eye(k));
A = R*Q + s*eye(k);
b = -(A(k-1,k-1)+ A(k,k));
c = A(k-1,k-1)*A(k,k) - A(k-1,k)*A(k,k-1);
mu = roots([1 b c]);
[temp,j] = min(abs(mu-A(k,k)));
s = mu(j);
end
% Deflation
d(k) = A(k,k);
A = A(1:k-1,1:k-1);
end
d' % all eigenvalue
% Alternative of Hessenberg QR algorithm with Rayleigh quotient shift

% HESS1 Reduce a matrix to Hessenberg form.


% Input: A is n-by-n matrix % Output: H is m*n upper Hessenberg form

clear all, clc, A=randi(10,4); A=0.5*(A'+A); [n,n]=size(A); L=eig(A)


H = A;
for k = 1:n-1
Q = eye(n);
z = H(k+1:n,k);
v = [-sign(z(1))*norm(z)-z(1); -z(2:end) ];
P = eye(n-k) - 2*(v*v')/(v'*v); % HH reflection
Q(k+1:n,k+1:n) = P;
H = Q*H*Q';
end
H, H = triu(H,-1) % enforce exact triangularity
%----------------------------------------------------------%
% QR algorithm with shifts after applying Hessenberg reduction

A= H;
for n = length(A):-1:1
% QR iteration
while sum(abs(A(n,1:n-1)))>eps
s = A(n,n);
[Q,R] = qr(A-s*eye(n));
A = R*Q + s*eye(n);
end
% Deflation
d(n) = A(n,n);
A = A(1:n-1,1:n-1);
end
d = flipud(sort(d))

In numerical linear algebra, the Arnoldi iteration is an


eigenvalue algorithm and an important example of an iterative method. Arnoldi finds an
approximation to the eigenvalues and eigenvectors of general (possibly non-Hermitian)
matrices by constructing an orthonormal basis of the Krylov subspace, which makes it
particularly useful when dealing with large sparse matrices.

The Arnoldi method belongs to a class of linear algebra algorithms that give a partial
result after a small number of iterations, in contrast to so-called direct methods which
must complete to give any useful results (see for example, Householder transformation).
The partial result in this case being the first few vectors of the basis the algorithm is
building. When applied to Hermitian matrices it reduces to the Lanczos algorithm. The
Arnoldi iteration was invented by W. E. Arnoldi in 1951.
One way to extend the Lanczos process to unsymmetric matrices is due to Arnoldi
(1951) and revolves around the Hessenberg reduction 𝑸𝑇 𝑨𝑸 = 𝑯. In particular, if
𝑸 = [𝒒1 , … , 𝒒𝑛 ] and we compare columns in 𝑨𝑸 = 𝑸𝑯 , then
𝑘+1
𝑨𝒒𝑘 = ∑ ℎ𝑖𝑘 𝒒𝑖 1≤ 𝑘 ≤𝑛−1
𝑖=1

Isolating the last term in the summation gives


𝑘
ℎ(𝑘 + 1, 𝑘)𝒒𝑘+1 = 𝑨𝒒𝑘 − ∑ ℎ𝑖𝑘 𝒒𝑖 ≜ 𝒓𝑘 (residue)
𝑖=1

where ℎ𝑖𝑘 = 𝒒𝑖 𝑇 𝑨𝒒𝑘 for 𝑖 = 1: 𝑘. It follows that if 𝒓𝑘 ≠ 0, then 𝒒𝑘+1 is specified by

𝒒𝑘+1 = 𝒓𝑘 /ℎ(𝑘 + 1, 𝑘)

where ℎ(𝑘 + 1, 𝑘) = ‖𝒓𝑘 ‖2 . These equations define the Arnoldi process and in strict
analogy to the symmetric Lanczos process we obtain:

𝑟0 = 𝑞1 , ℎ10 = 1, 𝑘 = 0,
while (if ℎ(𝑘 + 1, 𝑘) ≠ 0)
𝒒𝑘+1 = 𝒓𝑘 /ℎ(𝑘 + 1, 𝑘)
𝑘 = 𝑘 + 𝑙
𝒓𝑘 = 𝑨𝒒𝑘
for 𝑖 = 1: 𝑘
ℎ𝑖𝑘 = 𝒒𝑖 𝑇 𝒓𝑘
𝒓𝑘 = 𝒓𝑘 − ℎ𝑖𝑘 𝒒𝑖
end
ℎ(𝑘 + 1, 𝑘) = ‖𝒓𝑘 ‖2
end
% Arnoldi iteration for Krylov subspaces.
% Input: A is a square matrix (n by n)
% u is initial vector and m is the number of iterations
% Output: Q orthonormal basis of Krylov space (n by m+1)
% H upper Hessenberg matrix, A*Q(:,1:m)=Q*H(m+1 by m)
clear all, clc, A=randi(10,4); n=length(A); m=n-1; Q=zeros(n,m+1);
H=zeros(m+1,m); u=rand(n,1); Q(:,1) = u/norm(u);
for j = 1:m
r = A*Q(:,j);
for i = 1:j
H(i,j) = Q(:,i)'*r;
r = r - H(i,j)*Q(:,i);
end
H(j+1,j) = norm(r);
Q(:,j+1) = r/H(j+1,j);
end
Q, triu(Q'*A*Q,-1) % enforce exact triangularity
Interpretation: How was Arnoldi's algorithm interpreted as generating Krylav
subspaces and what is the relationship between them? From the algorithm we know
that: 𝒒𝑘+1 = 𝒓𝑘 /ℎ(𝑘 + 1, 𝑘) and 𝒓𝑘 = 𝑨𝒒𝑘 ⟹ 𝒒𝑘+1 = 𝛼𝑘 𝑨𝒒𝑘 with 𝛼𝑘 = 1/ℎ(𝑘 + 1, 𝑘)

⟹ 𝒒𝑘 = (𝛼1 𝛼2 … 𝛼𝑘 )𝑨𝑘 𝒒1 = 𝛽𝑘 𝑨𝑘 𝒒1

We assume that 𝒒1 is a given unit 2-norm starting vector. The 𝒒𝑘 are called the Arnoldi
vectors and they define an orthonormal basis for the Krylov subspace 𝐾𝑛 (𝑨, 𝒒1 , 𝑘):

𝐾𝑛 (𝑨, 𝒒1 , 𝑘) = span{𝒒1 , 𝒒2 , . . . , 𝒒𝑘 } = span{𝒒1 , 𝑨𝒒1 , . . . , 𝑨𝑘−1 𝒒1 }.

The situation after 𝑘 𝑡ℎ step is summarized by the 𝑘 𝑡ℎ step Arnoldi factorization

𝑨𝑸𝑘 = 𝑸𝑘 𝑯𝑘 + 𝒓𝑘 𝒆𝑇𝑘

where 𝑸𝑘 = [𝒒1 , . . . , 𝒒𝑘 ], 𝒆𝑘 = 𝑰(: , 𝑘), and


ℎ11 ℎ12 ⋯ ℎ1,𝑘−1 ℎ1𝑘
ℎ21 ℎ22 ℎ2,𝑘−1 ℎ2𝑘
ℎ32 ⋮
𝑯𝑘 = 0 ⋱ ⋮ ⋮
⋱ ℎ𝑘−1,𝑘−1
⋮ ⋱ ⋮
( 0 ⋯ 0 ℎ𝑘,𝑘−1 ℎ𝑘𝑘 )

If 𝒓𝑘 = 0, then the columns of 𝑸𝑘 define an invariant subspace and 𝜆(𝑯𝑘 ) ⊂ 𝜆(𝑨).


Otherwise, the focus is on how to extract information about 𝑨′𝑠 eigensystem from the
Hessenberg matrix 𝑯𝑘 and the matrix 𝑸𝑘 of Arnoldi vectors.

The matrix 𝑯 = 𝑸𝑇 𝑨𝑸 can be interpreted as the representation in the basis {𝒒1 , . . . , 𝒒𝑛 } of


the orthogonal projection of 𝑨 onto 𝐾𝑛 .

The Arnoldi iteration has two roles ❶ the basis of many of the iterative algorithms of
numerical linear algebra ❷ find eigenvalues of non-Hermitian matrices (i.e. using QR).

In numerical analysis and scientific computing, a


sparse matrix or sparse array is a matrix in which most of the elements are zero. By
contrast, if most of the elements are nonzero, then the matrix is considered dense. The
number of zero-valued elements divided by the total number of elements (e.g., 𝑚 × 𝑛 for
an 𝑚 × 𝑛 matrix) is called the sparsity of the matrix (which is equal to 1 minus the
density of the matrix). Using those definitions, a matrix will be sparse when its sparsity
is greater than 0.5.

Large sparse matrices often appear in scientific or engineering applications when


solving partial differential equations. Conceptually, sparsity corresponds to systems
with few pairwise interactions. Consider a line of balls connected by springs from one to
the next: this is a sparse system as only adjacent balls are coupled. By contrast, if the
same line of balls had springs connecting each ball to all other balls, the system would
correspond to a dense matrix (ref: Wikipedia). The concept of sparsity is useful in
combinatorics and application areas such as network theory, which have a low density
of significant data or connections. Here some plots of the most well-known spars
matrices (see MATLAB toolBox)

close all; clear all; clc, B = gallery('wathen',10,10);


A = gallery('poisson',10); [M,V,Q] = lu(B);
[L,U,P] = lu(A); figure;
spy(L); spy(M);
figure; figure;
spy(U); spy(V);
figure; figure;
spy(A); spy(B);
% now try with luinc %B = gallery('wathen',10,10);
%A = gallery('poisson',10); [M,V,Q] = luinc(B,1e-2);
figure;
[L,U,P] = luinc(A,1e-2); spy(M);
figure; figure;
spy(L); spy(V);
figure; figure;
spy(U); spy(B);
figure;
spy(A);
𝑠𝑝𝑦(𝑺) plots the sparsity pattern of matrix 𝑺. Nonzero values are colored while zero
values are white. The plot displays the number of nonzeros in the matrix, 𝑛𝑧 = 𝑛𝑛𝑧(𝑺).

𝐴 = 𝑔𝑎𝑙𝑙𝑒𝑟𝑦(′𝑝𝑜𝑖𝑠𝑠𝑜𝑛′, 𝑛) returns the sparse block tridiagonal matrix of order 𝑛2 resulting


from discretizing Poisson's equation with the 5-point operator on an n-by-n mesh.

𝐴 = 𝑔𝑎𝑙𝑙𝑒𝑟𝑦(′𝑤𝑎𝑡ℎ𝑒𝑛′, 𝑛𝑥, 𝑛𝑦) returns a sparse, random, n-by-n finite element matrix
where 𝑛 = 3 ⋆ 𝑛𝑥 ⋆ 𝑛𝑦 + 2 ⋆ 𝑛𝑥 + 2 ⋆ 𝑛𝑦 + 1. Matrix A is precisely the “consistent mass
matrix” for a regular nx-by-ny grid of 8-node (serendipity) elements in two dimensions.
A is symmetric, positive definite for any (positive) values of the “density” 𝑟ℎ𝑜(𝑛𝑥 , 𝑛𝑦 ),
which is chosen randomly. (see MATLAB toolBox Sparse Matrix Reordering)

load barbellgraph.mat
S = A + speye(size(A)); pct = 100/numel(A);
spy(S)
title('A Sparse Symmetric Matrix')
nz = nnz(S);
xlabel(sprintf('Nonzeros = %d (%.3f%%)',nz,nz*pct));
CHAPTER VII:
Introduction to Nonlinear Systems
and Numerical Optimization
(Plus Metaheuristics and Evolutionary Algorithms)
CHAPTER VII:
Introduction to Nonlinear Systems
and Numerical Optimization
(Plus Metaheuristics and Evolutionary Algorithms)
Introduction to Nonlinear Systems and
Numerical Optimization
Optimization problems arise in almost every field, where numerical
information is processed (Science, Engineering, Mathematics, Economics, Commerce,
etc.). In Science, optimization problems arise in data fitting, in variational principles,
and in the solution of differential and integral equations by expansion methods.
Engineering applications are in design problems, which usually have constraints in the
sense that variables cannot take arbitrary values. For example, while designing a bridge
an engineer will be interested in minimizing the cost, while maintaining certain
minimum strength for the structure. Even the strength of materials used will have a
finite range depending on what is available in the market. Such problems with
constraints are more difficult to handle than the simple unconstrained optimization
problems, which very often arise in scientific work. In most problems, we assume the
variables to be continuously varying, but some problems require the variables to take
discrete values (H M Antia 1995).

Mathematically speaking, optimization is the minimization or maximization of a function


subjected or not to constraints on its variables (parameters). In order to solve any
optimization problem numerically, nowadays there is a wide variety of algorithms at our
disposal. As we already saw in the previous chapters, these algorithms are starting
from some initial guess of the parameters, and then they generate sequence of iterates
which terminates, when either no more progress can be made, or when it seems that a
solution point has been approximated with sufficient accuracy. The main difference
between the optimization algorithms is the way on which they pass from one iteration to
another.

Mainly there are two different strategies for computing next iteration from the previous
one which are used most frequently in nowadays available optimization algorithms. The
first one is the line search strategy in which the algorithm chooses a direction 𝒅𝑘 and
then searches along this direction for the lower function value. The second one is called
the trust region strategy in which the information gathered about the objective function
is used to construct a model function whose behavior near the current iterate is trusted
to be similar enough to the actual function. Then the algorithm searches for the
minimizer of the model function inside the trust region.

Most of optimization problems require the global minimum to be found, most of the
methods that we are going to describe here will only find a local minimum. The function
has a local minimum at a point where it assumes the lowest value in a small
neighborhood of the point, which is not at the boundary of that neighborhood. To find a
global minimum we normally try
In this chapter, we consider methods for minimizing or maximizing a function of several
variables, that is, finding those values of the coordinates for which the function takes
on the minimum or the maximum value.

Definition A continuous
function 𝑓: ℝ𝑛 ⟶ ℝ is said
to be continuously
differentiable at 𝐱 ∈ ℝ𝑛 , if
(𝜕𝑓/𝜕𝑥𝒊 )(𝐱) exists and is
continuous, 𝑖 = 1, . . . , 𝑛; the
gradient of 𝑓 at 𝐱 is then
defined as
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
∇𝑓(𝐱) = [ … ]
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛

The function 𝑓 is said to be


continuously differentiable
in an open region 𝑫 ⊂ ℝ𝑛 ,
denoted 𝑓 ∈ 𝐶 1 (𝑫), if it is
continuously differentiable
at every point in 𝑫.

Lemma Let 𝑓: ℝ𝑛 ⟶ ℝ be continuously differentiable in an open convex set 𝑫 ⊂ ℝ𝑛 .


Then, for 𝐱 ∈ 𝑫 and any nonzero perturbation 𝒑 ∈ ℝ𝑛 , the directional derivative of 𝑓 at 𝐱
in the direction of 𝒑, defined by
𝜕𝑓 𝑓(𝐱 + 𝜀𝒑) − 𝑓(𝐱) 𝑇
𝐷𝒑 𝑓(𝐱) = ( ) . 𝒑 = lim = (∇𝑓(𝐱)) 𝒑
𝜕𝐱 𝜀→0 𝜀
For any 𝐱, 𝐱 + 𝒑 ∈ 𝑫,
1 𝐱+𝒑
𝑇 𝑇
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = ∫ (∇𝑓(𝐱 + 𝑡𝒑)) 𝑑𝑡 = ∫ (∇𝑓(𝐳)) 𝑑𝐳
0 𝐱

𝑇
and there exists 𝒛 ∈ (𝐱, 𝐱 + 𝒑) such that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑

Example: Let 𝑓: ℝ2 ⟶ ℝ, 𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇 Then

2𝑥1 − 2 + 3𝑥22
∇𝑓(𝐱) = ( )
6𝑥1 𝑥2 + 12𝑥22

𝑓(𝐱 𝑐 ) = 6, 𝑓(𝐱 𝑐 + 𝒑) = 23, ∇𝑓(𝐱 𝑐 ) = (3, 18)𝑇

If we let 𝑔(𝑡) = 𝑓(𝐱 𝑐 + 𝑡𝒑) = 𝑓(1 − 2𝑡, 1 + 𝑡 ) = 6 + 12𝑡 + 7𝑡 2 − 2𝑡 3 , and the reader can
𝑇
verify that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑 is true for 𝐳 = 𝐱 𝑐 + 𝑡𝒑 with 𝑡 = (7 − √19)/6 = 0.44
Example: Computing directional derivatives.
Let 𝑧 = 14 − 𝑥 2 − 𝑦 2 and let 𝑃 = (1,2). Find
the directional derivative of f, at 𝑃, in the
following directions:

1. toward the point Q=(3,4) ,


2. in the direction of ⟨2,−1⟩, and
3. toward the origin.

■ The surface is plotted in Figure above, where the point 𝑃 = (1,2) is indicated in the
𝑥, 𝑦_plane as well as the point (1,2,9) which lies on the surface of f. We find that

𝜕𝑓 𝜕𝑓
| = −lim 2𝑥 = −2, | = −lim 2𝑦 = −4
𝜕𝑥 𝑥=1 𝑥→1 𝜕𝑦 𝑥=2 𝑥→1

Let 𝑢⃗ 1 be the unit vector that points from the point (1,2) to the point 𝑄 = (3,4), as shown
in the figure. The vector ⃗⃗⃗⃗⃗
𝑃𝑄 = ⟨2,2⟩; the unit vector in this direction is 𝑢
⃗ 1 = ⟨1/√2, 1/√2⟩.
Thus the directional derivative of f at (1,2) in the direction of 𝑢
⃗ 1 is

𝑇 1/√2
⃗ 1 = (−2 − 4) (
𝐷𝑢⃗1 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢 ) = −3√2 ≅ −4.24
1/√2

Thus the instantaneous rate of change in moving from the point (1,2,9) on the surface
in the direction of 𝑢
⃗ 1 (which points toward the point Q) is about −4.24. Moving in this
direction moves one steeply downward.

■ We seek the directional derivative in the direction of ⟨2,−1⟩. The unit vector in this
direction is 𝑢
⃗ 2 = ⟨2/√5, −1/√5⟩. Thus the directional derivative of f at (1,2) in the
𝑇
direction of 𝑢
⃗ 2 is 𝐷𝑢⃗2 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢⃗ 2 = 0. Starting on the surface of f at (1,2) and
moving in the direction of ⟨2,−1⟩ (or 𝑢⃗ 2 ) results in no instantaneous change in z-value.

■ At P=(1,2), the direction towards the origin is given by the vector ⟨−1,−2⟩; the unit
vector in this direction is 𝑢
⃗ 3 = ⟨−1/√5, −2/√5⟩. The directional derivative of f at P in the
𝑇
direction of the origin is 𝐷𝑢⃗3 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢
⃗ 3 = 9/√5 ≅ 4.47. Moving towards the origin
means "walking uphill'' quite steeply, with an initial slope of about 4.47.

Note: The symbol "∇" is named "nabla,'' derived from the Greek name of a Jewish harp.
Oddly enough, in mathematics the expression ∇f is pronounced "del f.''

𝑎. ∇(𝑓g) = 𝑓∇(g) + g∇(𝑓)


b. ∇(𝑓/g) = (𝑔∇𝑓 − 𝑓∇g)/g2
𝑛
c. ∇((𝑓(𝑥, 𝑦)) ) = 𝑛𝑓(𝑥, 𝑦)𝑛−1 ∇𝑓
Gradients and Level Curves: In this section, we use the gradient and the chain rule to
investigate horizontal and vertical slices of a surface of the form 𝑧 = g( 𝑥, 𝑦). To begin
with, if 𝑘 is constant, then g(𝑥, 𝑦) = 𝑘 is called the level curve of g( 𝑥, 𝑦) of level k and is
the intersection of the horizontal plane z = k and the surface 𝑧 = g( 𝑥, 𝑦). In particular,
g(𝑥, 𝑦) = 𝑘 is a curve in the xy-plane.

The gradient vectors are perpendicular to the level sets, so will always be direction the
“slope” of a point toward another point on another level set. But how would you
represent that? The answer is the concept of gradient flow. Read more to learn about
how these three standard measurements fit together to flow along a surface, much like
a liquid or rolling object.
Theorem Let Consider a function f:ℝ𝑛 ⟶ ℝ, and suppose f is of class 𝐶1 . For some
constant 𝑐, consider the level set 𝑆 = {𝒙⃗ ∈ ℝ𝑛 : 𝑓(𝒙
⃗ ) = 𝑐}. Then, for any point 𝒙
⃗ 0 in 𝑆, the
⃗ 0 ) is perpendicular to 𝑆.
gradient ∇𝑓(𝒙

Proof: We need to show that for any


vector 𝒂⃗ which is tangent to 𝑆 at 𝒙 ⃗ 0 , is
perpendicular to ∇𝑓(𝒙 ⃗ 0 ). If 𝒂
⃗ is tangent
to 𝑆, we can find a parametrized curve
⃗ (𝑡) lying in 𝑆 such that 𝒙
𝒙 ⃗0=𝒙 ⃗ (𝑡0 ) and

⃗ (𝑡0 ) = 𝒂
𝒙 ⃗ . We will show that ∇𝑓(𝒙
⃗ 0 ) is

perpendicular to ⃗𝒂 = ⃗𝒙 (𝑡0 ).

⃗ (𝑡)
By the definition of 𝑆, and since 𝒙
lies in 𝑆, ⃗ (𝑡)) = 𝑐
𝑓 (𝒙 for all t.
Differentiating both sides of this
identity, and using the chain rule on

⃗ (𝑡)) ⋅ 𝒙
the left side, we obtain ∇𝑓 (𝒙 ⃗ (𝑡) = 0

⃗ (𝑡0 )) ⋅ 𝒙
Plugging in 𝑡 = 𝑡0 , this gives us ∇𝑓 (𝒙 ⃗ (𝑡0 ) = 0, which we can rewrite as

∇𝑓(𝒙 ⃗ ′ (𝑡0 ) = 0 ⟺ ∇𝑓(𝒙


⃗ 0) ⋅ 𝒙 ⃗ ′ (𝑡0 ) ⟺ ∇𝑓(𝒙
⃗ 0) ⊥ 𝒙 ⃗ 0) ⊥ 𝒂

⃗ 0 ) is perpendicular to the level set S. ■


Thus, we have shown that ∇𝑓(𝒙

Definition A continuously differentiable function: 𝑓: ℝ𝑛 ⟶ ℝ is said to be twice


continuously differentiable at 𝐱 ∈ ℝ𝑛 , if (𝜕 2 𝑓/𝜕𝑥𝑖 𝜕𝑥𝑗 )(𝐱) exists and is continuous,
1 < 𝑖 , 𝑗 < 𝑛; the Hessian of 𝑓 at 𝐱 is then defined as the 𝑛 × 𝑛 matrix whose i,j element is

𝜕 2 𝑓(𝐱)
∇2 𝑓(𝐱)𝑖𝑗 = 1 < 𝑖 ,𝑗 < 𝑛
𝜕𝑥𝑖 𝜕𝑥𝑗

The function/is said to be twice continuously differentiable in an open region 𝑫 ⊂ ℝ𝑛


denoted 𝑓 ∈ 𝐶 2 (𝑫), if it is twice continuously differentiable at every point in 𝑫.

Lemma Let the function 𝑓: ℝ𝑛 ⟶ ℝ be twice continuously differentiable in an open


convex set 𝑫 ⊂ ℝ𝑛 . Then, for 𝐱 ∈ 𝑫 and any nonzero perturbation 𝒑 ∈ ℝ𝑛 , the second
directional derivative of 𝑓 at 𝐱 in the direction of 𝒑, defined by
𝜕𝑓 𝜕𝑓
(𝐱 + 𝜀𝒑) − (𝐱)
2
𝐷𝒑𝒑 𝑓(𝐱) = lim 𝜕𝐱 𝜕𝐱 = 𝒑𝑇 (∇2 𝑓(𝐱))𝒑 = 𝒑𝑇 𝑯(𝐱)𝒑
𝜀→0 𝜀
For any 𝐱, 𝐱 + 𝒑 ∈ 𝑫, there exists 𝒛 ∈ (𝐱, 𝐱 + 𝒑) or {𝒛 = 𝐱 + 𝑡𝒑 𝑡 ∈ (0,1)} such that
𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 (∇2 𝑓(𝒛))𝒑 + 𝒪(‖𝒑‖3 )
2
𝑇 1 𝑇
= (∇𝑓(𝐱)) 𝒑 + 𝒑 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2
Remark: The Hessian 𝑯(𝐳) is always symmetric as long as 𝑓 is twice continuously
differentiable. This is the reason we were interested in symmetric matrices in previous
chapters.
𝑇
(𝑯(𝐱 + 𝑡𝒉)𝒉). 𝒉 = 〈𝑯(𝐱 + 𝑡𝒉)𝒉, 𝒉〉 = (𝑯(𝐱 + 𝑡𝒉)𝒉)𝑇 𝒉 = 𝒉𝑇 (𝑯(𝐱 + 𝑡𝒉)) 𝒉 = 𝒉𝑇 𝑯(𝐱 + 𝑡𝒉)𝒉

1
⟹ 𝑓(𝐱 + 𝒉) = 𝑓(𝐱) + ∇𝑓(𝐱). 𝒉 + (𝑯(𝐱 + 𝑡𝒉)𝒉). 𝒉
2

Example Let 𝑓, 𝐱 𝑐 , and 𝒑 be given by

𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇
Then
2 6𝑥2 2 6
∇2 𝑓(𝐱) = ( ) ⟹ ∇2 𝑓(𝐱 𝑐 ) = ( )
6𝑥2 6𝑥1 + 24𝑥2 6 30

The reader can verify that


𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2
is true for 𝐳 = 𝐱 𝑐 + 𝑡𝒑, 𝑡 = 1/3 .

Lemma suggests that we might model the function f around a point 𝐱 𝑐 by the quadratic
𝑇 1
model 𝑚(𝐱 𝑐 + 𝒑) = 𝑓(𝐱 𝑐 ) + (∇𝑓(𝐱 𝑐 )) 𝒑 + 2 𝒑𝑇 𝑯(𝐱𝑐 )𝒑 and this is precisely what we will do.
In fact, it shows that the error in this model is given by

1 𝑇
𝜀 = 𝑓(𝐱 𝑐 + 𝒑) − 𝑚(𝐱 𝑐 + 𝒑) = 𝒑 (𝑯(𝒛) − 𝑯(𝐱 𝑐 ))𝒑
2
for some 𝒛 ∈ (𝐱 𝑐 , 𝐱 𝑐 + 𝒑).

Vector-valued functions: Now let us proceed to the less simple case of 𝐹: ℝ𝑛 ⟶ ℝ𝑚 ,


where 𝑚 = 𝑛 in the nonlinear simultaneous equations problem and 𝑚 > 𝑛 in the
nonlinear least-squares problem. It will be convenient to have the special notation 𝑒𝑖𝑇 for
the 𝑖 𝑡ℎ row of the identity matrix. There should be no confusion with the natural log
base. Since the value of the 𝑖 𝑡ℎ component function of 𝐹 can be written 𝑓𝑖 (𝐱) = 𝐹(𝐱),
consistency with the product rule makes it necessary that 𝑓𝑖 ′(𝐱) = 𝐹 ′ (𝐱), the 𝑖 𝑡ℎ row of
𝐹 ′ (𝐱). Thus 𝐹 ′ (𝐱) must be an 𝑚 × 𝑛 matrix whose 𝑖 𝑡ℎ row is ∇𝑓𝑖 (𝐱)𝑇 . The following
definition makes this official.

Definition A continuous function 𝐹: ℝ𝑛 ⟶ ℝ𝑚 is continuously differentiable at


𝐱 ∈ ℝ𝑛 if each component function 𝑓𝑖 , 𝑖 = 1, … , 𝑚 is continuously differentiable at 𝐱. The
derivative of 𝐹 at 𝐱 is sometimes called the Jacobian (matrix) of 𝐹 at 𝐱, and its transpose
is sometimes called the gradient of 𝐹 at 𝐱. The common notations are:
𝜕𝑓𝑖 (𝐱)
𝐹 ′ (𝐱) ∈ ℝ𝑚×𝑛 𝐹 ′ (𝐱) = 𝑱(𝐱) = ∇𝐹(𝐱)𝑇 with 𝐹 ′ (𝐱)𝑖𝑗 =
𝜕𝑥𝑗
𝐹 is said to be continuously differentiable in an open set 𝑫 ⊂ ℝ𝑛 , denoted 𝐹 ∈ 𝐶 1 (𝐷), if F
is continuously differentiable at every point in 𝑫.
Example Let 𝐹: ℝ2 ⟶ ℝ2 𝑓1 = 𝑒 𝑥1 − 𝑥2 , 𝑓1 = 𝑥12 − 2𝑥2 Then
𝑒 𝑥1 −1
𝑱(𝐱) = ∇𝐹(𝐱)𝑇 = ( )
2𝑥1 −2

For the remainder of this chapter we will denote the Jacobian matrix of 𝐹 at 𝐱 by 𝑱(𝐱).
Also, we will often speak of the Jacobian of 𝐹 rather than the Jacobian matrix of 𝐹 at 𝐱.

Remark: The Jacobian of a vector-valued function in several variables generalizes the


gradient of a scalar-valued function in several variables, which in turn generalizes the
derivative of a scalar-valued function of a single variable. In other words, the Jacobian
matrix of a scalar-valued function in several variables is (the transpose of) its gradient
and the gradient of a scalar-valued function of a single variable is its derivative.

An important fact: Now comes the big difference from real-valued functions: there is
no mean value theorem for continuously differentiable vector-valued functions. That is,
in general there may not exist 𝐳 ∈ ℝ𝑛 such that 𝐹(𝐱 + 𝒑) = 𝐹(𝐱) + 𝑱(𝐳)𝒑. Intuitively the
reason is that, although each function 𝑓𝑖 satisfies 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 , the points
𝐳𝑖 , may differ. For example, consider the function of the example before. There is no
𝐳 ∈ ℝ𝑛 for which 𝐹(1,1) = 𝐹(0,0) + 𝑱(𝐳)(1,1)𝑇 as this would require

𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 −1 1 𝑒−1 1 𝑒 𝑧1 −1 1
( ) = ( ) + ( ) ( )⇔( )=( )+( )( )
𝑥12 − 2𝑥2 𝐱=(1,1) 𝑥12 − 2𝑥2 𝐱=(0,0) 2𝑥1 −2 𝐱=𝐳 1
𝑖
−1 0 2𝑧1 −2 1

or 𝑒 𝑧1 = 𝑒 − 1 and 2𝑧1 = 1, which is clearly impossible. Although the standard mean


value theorem is impossible, we will be able to replace it in our analysis by Newton's
theorem and the triangle inequality for line integrals.

Those results are given below. The integral of a vector valued function of a real variable
can be interpreted as the vector of Riemann integrals of each component function.

Lemma Let 𝐹: ℝ𝑛 ⟶ ℝ𝑚 be continuously differentiable in an open convex set 𝑫 ⊂ ℝ𝑛 . For


any 𝐱, 𝐱 + 𝒑 ∈ 𝑫,
1 𝐱+𝑡𝒑

𝐹(𝐱 + 𝒑) − 𝐹(𝐱) = (∫ 𝑱(𝐱 + 𝑡𝒑)𝑑𝑡) 𝒑 = ∫ 𝐹 ′ (𝐳)𝑑𝒛


0 𝐱

Proof: The proof comes right from the definition of 𝐹 ′ (𝐳) and a component by-
component application of Newton's formula 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 .■

Now we might think about a best model or the best linear approximation of the function
𝐹 around a point 𝐱 𝑐 , means that we model 𝐹(𝐱 𝑐 + 𝒑) by the affine model

𝑀(𝐱 𝑐 + 𝒑) = 𝐹(𝐱 𝑐 ) + 𝑱(𝐱 𝑐 )𝒑

and this is what we will do. To produce a bound on the difference between 𝐹(𝐱 𝑐 + 𝒑) and
𝑀(𝐱 𝑐 + 𝒑), we need to make an assumption about the continuity of 𝑱(𝐱 𝑐 ) just as we did
in scalar-valued-functions in the section before.
Definition Let the two integers 𝑚, 𝑛 > 0, 𝑮: ℝ𝑛 ⟶ ℝ𝑚×𝑛 , 𝐱 ∈ ℝ𝑛 , let ‖•‖ be a norm on
ℝ𝑛 , and ‖•‖𝑮 a norm on ℝ𝑚×𝑛 . 𝑮 is said to be Lipschitz continuous at 𝐱 if there exists an
open set 𝑫 ⊂ ℝ𝑛 , 𝐱 ∈ 𝑫, and a constant 𝛾 such that for all 𝐲 ∈ 𝑫,

‖𝑮(𝐱) − 𝑮(𝐲)‖𝑮 ≤ 𝛾‖𝐱 − 𝐲‖

The constant 𝛾 is called a Lipschitz constant for 𝑮 at 𝐱. For any specific 𝑫 containing 𝐱
for which the given inequality holds, 𝑮 is said to be Lipschitz continuous at 𝐱 in the
neighborhood 𝑫. If this inequality holds for every 𝐱 ∈ 𝑫, then 𝑮 ∈ 𝐿𝑖𝑝𝛾 (𝑫).

Note that the value of 𝛾 depends on the norms ‖•‖ & ‖•‖𝑮 , but the existence of 𝛾 does
not.

Lemma Let 𝐹: ℝ𝑛 ⟶ ℝ𝑚 be continuously differentiate in the open convex set 𝑫 ⊂ ℝ𝑛 ,


𝐱 ∈ 𝑫, and let 𝑱 be Lipschitz continuous at 𝐱 in the neighborhood 𝑫, using a vector norm
and the induced matrix operator norm and the constant 𝛾. Then, for any 𝐱 + 𝒑 ∈ 𝑫,
𝛾
‖𝐹(𝐱 + 𝒑) − 𝐹(𝐱) − 𝑱(𝐱)𝒑‖ ≤ ‖𝒑‖2
2

Proof:
1

𝐹(𝐱 + 𝒑) − 𝐹(𝐱) − 𝑱(𝐱)𝒑 = (∫ 𝑱(𝐱 + 𝑡𝒑)𝒑𝑑𝑡) − 𝑱(𝐱)𝒑


0
Using the triangle inequality for line integrals, the definition of a matrix operator norm,
and the Lipschitz continuity of 𝑱 at 𝐱 in neighborhood 𝑫, we obtain
1

‖𝐹(𝐱 + 𝒑) − 𝐹(𝐱) − 𝑱(𝐱)𝒑‖ ≤ ∫ ‖𝑱(𝐱 + 𝑡𝒑) − 𝑱(𝐱)‖‖𝒑‖𝑑𝑡


0
1

≤ ∫ 𝛾 ‖𝑡𝒑‖‖𝒑‖𝑑𝑡
0
1

= 𝛾‖𝒑‖ ∫ 𝑡𝑑𝑡 = 𝛾‖𝒑‖2


2

0

Using Lipschitz continuity, we can obtain a useful bound on the error in the
approximate affine model.

Lemma Let 𝐹, 𝑱 satisfy the conditions of the previous lemma. Then, for any 𝐯, 𝐮 ∈ 𝑫,
‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖
‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖ ≤ 𝛾 ‖𝐯 − 𝐮‖
2
If we assume that 𝑱(𝐱)−1 exists. Then there exists 𝜀 > 0, 0 < 𝛼 < 𝛽 , such that

𝛼‖𝐯 − 𝐮‖ ≤ ‖𝐹(𝐯) − 𝐹(𝐮)‖ ≤ 𝛽‖𝐯 − 𝐮‖

for all 𝐯, 𝐮 ∈ 𝑫 for which max{‖𝐯 − 𝐱‖, ‖𝐮 − 𝐱‖} ≤ 𝜀


Proof: using the previous lemma we can write

‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖ ≤ 𝛾‖𝐯 − 𝐮‖2


‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖
≤ 𝛾( ) ‖𝐯 − 𝐮‖
2
Using the triangle inequality and
‖𝐹(𝐯) − 𝐹(𝐮)‖ ≤ ‖𝑱(𝐱)(𝐯 − 𝐮)‖ + ‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖
𝛾
≤ (‖𝑱(𝐱)‖ + (‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖)) ‖𝐯 − 𝐮‖
2
≤ (‖𝑱(𝐱)‖ + 𝛾𝜀)‖𝐯 − 𝐮‖

Let we define 𝛽 = ‖𝑱(𝐱)‖ + 𝛾𝜀 then ‖𝐹(𝐯) − 𝐹(𝐮)‖ ≤ 𝛽‖𝐯 − 𝐮‖. Similarly,

‖𝐹(𝐯) − 𝐹(𝐮)‖ ≥ ‖𝑱(𝐱)(𝐯 − 𝐮)‖ − ‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖


1 𝛾
≥ (( ) − (‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖)) ‖𝐯 − 𝐮‖
‖𝑱(𝐱)−1 ‖ 2
1
≥ (( ) − 𝛾𝜀) ‖𝐯 − 𝐮‖
‖𝑱(𝐱)−1 ‖

Let we define 𝛼 = (1/‖𝑱(𝐱)−1 ‖) − 𝛾𝜀 > 0 so 𝛽‖𝐯 − 𝐮‖ ≥ ‖𝐹(𝐯) − 𝐹(𝐮)‖ ≥ 𝛼‖𝐯 − 𝐮‖

Summary on Jacobian approximation: If 𝐹 is differentiable at a point 𝒂 in ℝ𝑛 , then its


differential is represented by 𝑱(𝒂). In this case, the linear transformation represented by
𝑱(𝒂) is the best linear approximation of 𝐹 near the point 𝒂, in the sense that

𝐹(𝐱) − 𝐹(𝒂) = 𝑱(𝒂)(𝐱 − 𝒂) + 𝒪(‖𝐱 − 𝒂‖) as ‖𝐱 − 𝒂‖ ⟶ 0

where 𝒪(‖𝐱 − 𝒂‖) is a quantity that approaches zero much faster than the distance
between 𝐱 and 𝒂 does as 𝐱 approaches 𝒂.

In the preceding section we saw that the Jacobian, gradient, and Hessian will be useful
quantities in forming models of multivariable nonlinear functions. In many
applications, however, these derivatives are not analytically available. In this section we
introduce the formulas used to approximate these derivatives by finite differences, and
the error bounds associated with these formulas. The choice of finite-difference stepsize
in the presence of finite-precision arithmetic and the use of finite-difference derivatives
in our algorithms are discussed in (J. R Dennis, Jr. & Robert B. Schnabel 1993).

Frequently we deal with problems where the nonlinear function is itself the result of a
computer simulation, or is given by a long and messy algebraic formula, and so it is
often the case that analytic derivatives are not readily available although the function is
several times continuously differentiable. Therefore it is important to have algorithms
that work effectively in the absence of analytic derivatives.
 In the case when 𝐹: ℝ𝑛 ⟶ ℝ𝑚 , it is reasonable to use the same idea as in one variable
to approximate the (𝑖, 𝑗)𝑡ℎ component of 𝑱(𝐱) by the forward difference approximation

𝑓𝑖 (𝐱 + ℎ𝐞𝑗 ) − 𝑓𝑖 (𝐱)
𝑎𝑖𝑗 (𝐱) =

where 𝐞𝑗 , denotes the 𝑗 unit vector. This is equivalent to approximating the 𝑗 𝑡ℎ column
𝑡ℎ

of 𝑱(𝐱) = [𝐴1 (𝐱) 𝐴2 (𝐱) … 𝐴𝑛 (𝐱)] by


𝐹(𝐱 + ℎ𝐞𝑗 ) − 𝐹(𝐱)
𝐴𝑗 (𝐱) =

Again, one would expect ‖𝐴𝑗 (𝐱) − (𝑱(𝐱))𝒋 ‖ = 𝒪(ℎ) for ℎ sufficiently small. In terms of
ℓ1 norm we can write
𝛾
‖𝑨(𝐱) − 𝑱(𝐱)‖1 ≤ |ℎ|
2
Becuase we have seen that
𝛾 2 2 𝛾
‖𝐹(𝐱 + ℎ𝐞𝑗 ) − 𝐹(𝐱) − 𝑱(𝐱)ℎ𝐞𝑗 ‖ ≤ ℎ ‖𝐞𝑗 ‖ = |ℎ|2
2 2

Dividing both sides by ℎ gives


𝛾
‖𝐴𝑗 (𝐱) − (𝑱(𝐱))𝒋 ‖ ≤ |ℎ|
2
Since the ℓ1 norm of a matrix is the maximum of the ℓ1 norms of its columns,
𝛾
‖𝑨(𝐱) − 𝑱(𝐱)‖1 ≤ |ℎ| follows immediately. ■
2

 When the nonlinear problem is minimization of a function 𝑓: ℝ𝑛 ⟶ ℝ , we may need


to approximate the gradient ∇𝑓(𝐱) and/or the Hessian ∇2 𝑓(𝐱). Approximation of the
gradient is just a special case of the approximation of 𝑱(𝐱) discussed above, with 𝑚 = 1.

In some cases, finite-precision arithmetic causes us to seek a more accurate finite-


difference approximation using the central difference approximation. Notice that this
approximation requires twice as many evaluations of 𝑓 as forward differences.

One can prove that if 𝒂 ∈ ℝ𝑛 , which is a best model of ∇𝑓(𝐱) then 𝑎𝑖 given by
𝑓(𝐱 + ℎ𝐞𝑖 ) − 𝑓(𝐱 − ℎ𝐞𝑖 ) 𝛾
𝑎𝑖 (𝐱) = ⟺ ‖𝑎𝑖 (𝐱) − (∇𝑓(𝐱))𝒊 ‖ ≤ ℎ2
2ℎ 6
𝛾 2
⟺ ‖𝒂(𝐱) − ∇𝑓(𝐱)‖∞ ≤ ℎ
6

On some occasions ∇𝑓(𝐱) is analytically available but ∇2 𝑓(𝐱) is not. In this case, ∇2 𝑓(𝐱)
can be approximated by applying forward difference 𝑨𝑖 = (∇𝑓(𝐱 + ℎ𝐞𝑗 ) − ∇𝑓(𝐱)) /ℎ,
followed by 𝐴̂ = (𝐴 + 𝐴𝑇 )/2 , since the approximation to ∇2 𝑓(𝐱) should be symmetric.

If ∇𝑓(𝐱) is not available, it is possible to approximate ∇2 𝑓(𝐱) using only values of 𝑓(𝐱).

𝑓(𝐱 + ℎ𝐞𝑖 + ℎ𝐞𝑗 ) − 𝑓(𝐱 + ℎ𝐞𝑖 ) − 𝑓(𝐱 + ℎ𝐞𝑗 ) + 𝑓(𝐱) 5𝛾


𝐴𝑖𝑗 = 2
⟹ ‖𝐴𝑖𝑗 − (∇2 𝑓(𝐱))𝑖𝑗 ‖ ≤ ℎ
ℎ 3
In this section we derive the first and second-order necessary and sufficient conditions
for a point 𝐱 ⋆ to be a local minimizer of a continuously differentiable function 𝑓: ℝ𝑛 ⟶ ℝ,
𝑛 > 1. Naturally, these conditions will be a key to our algorithms for the unconstrained
minimization problem.

Lemma Let 𝑓: ℝ𝑛 ⟶ ℝ be continuously differentiable in an open convex set 𝑫 ⊂ ℝ𝑛 . Then


𝐱 ⋆ ∈ 𝑫 can be a local minimizer of 𝑓 only if ∇𝑓(𝐱 ⋆ ) = 𝟎.

Proof: As in the one-variable case, a proof by contradiction is better than a direct proof

A class of algorithms called descent methods are characterized by the direction vector 𝒑
such that 𝒑𝑇 ∇𝑓(𝐱) < 0 or 𝒑 = −∇𝑓(𝐱).

Theorem Let 𝑓: ℝ𝑛 ⟶ ℝ be twice continuously differentiable in the open convex set


𝑫 ⊂ ℝ𝑛 , and assume there exists 𝐱 ⋆ ∈ 𝑫 such that ∇𝑓(𝐱) = 𝟎. If ∇2 𝑓(𝐱) is positive definite,
then 𝐱 is a local minimizer of 𝑓. If ∇2 𝑓(𝐱) is Lipschitz continuous at 𝐱, then 𝐱 can be a
local minimizer of 𝑓 only if ∇2 𝑓(𝐱) is positive semidefinite.

The necessary and sufficient conditions for 𝐱 ⋆ to be a local maximizer of 𝑓 are simply
■ ∇𝑓(𝐱) = 𝟎
■ ∇2 𝑓(𝐱) is positive semidefinite.

It is important to understand the shapes of multivariable quadratic functions: they are


strictly convex or convex, respectively bowl- or trough-shaped in two dimensions, if 𝑯 is
positive definite or positive semidefinite; they are strictly concave or concave (turn the
bowl or trough upside down) if 𝑯 is negative definite or negative semidefinite; and they
are saddle-shaped (in n dimensions) if 𝑯 is indefinite.

𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2

When the Hessian matrix is positive definite, by definition is 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0 for any
𝒑 ≠ 0. Therefore we have that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (1/2)𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0, which means that
𝐱 must be a local minimum. Similarly, when the Hessian matrix is negative definite, 𝐱 is
a local maximum. Finally, when 𝑯 has both positive and negative eigenvalues, the point
is a saddle point.
Those methods use
the gradient to search for the minimum point of an objective function. Such gradient-
based optimization methods are supposed to reach a point at which the gradient is
(close to) zero. In this context, the optimization of an objective function f(x) is equivalent
to finding a zero of its gradient g(x), which in general is a vector-valued function of a
vector-valued independent variable x. Therefore, if we have the gradient function g(x) of
the objective function f(x), we can solve the system of nonlinear equations g(x) = 0 to get
the minimum of f(x) by using the Newton method explained in chapter 4.

Let 𝑓: ℝ𝑛 ⟶ ℝ be twice continuously


differentiable in the open convex set 𝑫 ⊂ ℝ𝑛 . The Newton method tries to go straightly
to the zero of the gradient of the approximate objective function, means that we try to
solve the following nonlinear system of equations 𝐠(𝐱) = ∇𝑓(𝐱) = 𝟎.

𝐠(𝐱) = 𝟎 ⟺ 𝐠(𝐱 𝑘 ) + (∇𝐠(𝐱)|𝐱𝑘 )(𝐱 − 𝐱 𝑘 ) = 𝟎


⟺ 𝐠(𝐱 𝑘 ) + 𝐇(𝐱𝑘 )(𝐱 − 𝐱 𝑘 ) = 𝟎
−1
⟺ 𝐱 = 𝐱 𝑘 − (𝐇(𝐱 𝑘 )) 𝐠(𝐱𝑘 )
−1
by the updating rule 𝐱 𝑘+1 = 𝐱 𝑘 − (𝐇(𝐱𝑘 )) 𝐠(𝐱 𝑘 )

Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the
extremum value of 𝑓(𝐱) with the following starting point 𝐱 = [0.01 0.02]. this function is
severely ill-conditioned near the minimizer (1,1) (which is the unique stationary point).

syms x1 x2; f=100*(x2-x1^2)^2+(1-x1)^2;


J=jacobian(f,[x1,x2]) % Gradient computation
H=jacobian(J,[x1,x2]) % Hessian computation

%-----------------------------------------------------------%
% f .......... objective function
% J .......... gradient of the objective function
% H .......... Hessian of the objective function
%-----------------------------------------------------------%
clear all, clc, i=1; x(i,:)=[0.01 0.02]; tol=0.001;
f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
J=@(x)[-400*(x(2)-x(1)^2)*x(1)+2*x(1)-2; 200*x(2)-200*x(1)^2];
H=@(x)[1200*x(1)^2-400*x(2)+2 -400*x(1);-400*x(1) 200];

while norm(J(x(i,:)))>tol
d=(inv(H(x(i,:)) + 0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))' % x=[0.989793 0.9797];
fmax=f(x)
Example: Let 𝑓(𝐱) = √(1 + 𝑥12 ) + √(1 + 𝑥22 ), find the extremum value of 𝑓(𝐱) with the
following starting point 𝐱 = [1 1].

syms x1 x2; f= sqrt(1+x(1)^2)+sqrt(1+x(2)^2);


J=jacobian(f,[x1,x2]) % Gradient computation
H=jacobian(J,[x1,x2]) % Hessian computation

clear all, clc, i=1; x(i,:)=[1 1];

f=@(x)sqrt(1+x(1)^2)+sqrt(1+x(2)^2);
J=@(x)[x(1)/sqrt(x(1)^2+1);x(2)/sqrt(x(2)^2+1)];
H=@(x)diag([1/(x(1)^2+1)^1.5,1/(x(2)^2+1)^1.5]);

while abs(x(i,:)*J(x(i,:)))>0.001
d=(inv(H(x(i,:))+0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)

Example: Let 𝑓(𝑥1 , 𝑥2 ) = (𝑥1 − 2)4 + ((𝑥1 − 2)2 )𝑥22 + (𝑥2 + 1)2 , which has its minimum
at 𝐱 ⋆ = (2, −1)𝑇 . Algorithm, started from 𝐱 0 = (1, 1)𝑇 , and we use the following
approximations

1 𝑓(𝐱 + ℎ𝐞1 ) − 𝑓(𝐱) 1 𝐻11 𝐻12


𝐠(𝐱) = ∇𝑓(𝐱) = ( ), 𝑯(𝐱) = ∇2 𝑓(𝐱) = ( )
ℎ 𝑓(𝐱 + ℎ𝐞 ) − 𝑓(𝐱) ℎ2 𝐻21 𝐻22
2

𝐻11 = 𝑓(𝐱 + 2ℎ𝐞1 ) − 2𝑓(𝐱 + ℎ𝐞1 ) + 𝑓(𝐱), 𝐻22 = 𝑓(𝐱 + 2ℎ𝐞2 ) − 2𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱)

𝐻12 = 𝐻21 = 𝑓(𝐱 + ℎ𝐞1 + ℎ𝐞2 ) − 𝑓(𝐱 + ℎ𝐞1 ) − 𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱).

Before starting the algorithm let we visualize the plot of this surface in space

clear all, clc, [x1,x2] = meshgrid(-4:0.4:6,-4:0.4:6);


f=(x1- 2).^4 + ((x1- 2).^2).*(x2).^2 + (x2 + 1).^2;
s=surf(x1,x2,f)
direction = [0 0 1];
rotate(s,direction,-25)
After the execution we obtain

Iterations = 9
Jacobian =

-4.0933e-11
2.2080e-09

x =

1.9950
-1.0050

fmax = 5.0001e-05

clear all, clc, i=1; x(i,:)=[1 1]; h=0.01; J=1; tol=0.00001;


f=@(x)(x(1)- 2)^4 + ((x(1)- 2)^2)*(x(2))^2 + (x(2) + 1)^2;

while norm(J)>tol

x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];


J1=(f(x1(i,:)) - f(x(i,:)))/h; J2=(f(x2(i,:)) - f(x(i,:)))/h;
J=[J1;J2]; % gradient computation

x11(i,:)= x(i,:) + 2*h*[1 0];


x12(i,:)= x(i,:) + h*[1 0] + h*[0 1];
x22(i,:)= x(i,:) + 2*h*[0 1];

H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;

x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;

end

Iterations=i
Gradient =J
x=(x(i,:))'
fmax=f(x)
Example: Consider the Freudenstein
and Roth test function

𝑓(𝐱) = (𝑓1 (𝐱))2 + (𝑓2 (𝐱))2 , 𝐱 ∈ ℝ2 ,

Where

𝑓1 (𝐱) = −13 + 𝑥1 + ((5 − 𝑥2 )𝑥2 − 2)𝑥2 ,

𝑓2 (𝐱) = −29 + 𝑥1 + ((𝑥2 + 1)𝑥2 − 14)𝑥2 .

Show that the function f has three stationary points. Find them and prove that one is a
global minimizer, one is a strict local minimum and the third is a saddle point. You
should use the stopping criteria ‖∇𝑓(𝑥)‖ ≤ 10−5. The algorithm should be employed four
times on the following four starting points:

x(i,:)=[-50 7]; x(i,:)=[20 7]; x(i,:)=[20 -18]; x(i,:)=[5 -10];

Solution: Let we see the plot of this surface

clear all, clc, [x1,x2] = meshgrid(-4:0.4:6,-4:0.4:6);


f1=-13+x1+((5-x2).*x2-2).*x2; f2=-29+ x1+((x2+1).*x2-14).*x2;
z=f1.^2+f2.^2; s=surf(x1,x2,z)
direction = [0 0 1]; rotate(s,direction,35)

Also we can use MATLAB code to see the Gradient and the Hessian of this function.

syms x1 x2; f1=-13+x1+((5-x2)*x2-2)*x2; f2=-29+ x1+((x2+1)*x2-14)*x2;


f=f1^2+f2^2;
J=jacobian(f,[x1,x2]) % gradient computation
H=jacobian(J,[x1,x2]) % Hessian computation

When we run the program we get the points as what have been asked.

Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the solution
of 𝑓(𝐱) = 0 using only the gradient (i.e. without use of Hessian).

It is very well-known that


𝑇 1
𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 + 𝜹𝑇 𝑯(𝐱 + 𝑡𝜹)𝜹 + 𝒪(‖𝜹‖3 )
2
𝑇
= (∇𝑓(𝐱)) 𝜹 + +𝒪(‖𝜹‖2 )
𝑇
Let we take the first approximation 𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 and when we iterate the
𝑇
equation we obtain 𝑓(𝐱 𝑘+1 ) − 𝑓(𝐱 𝑘 ) = (∇𝑓(𝐱 𝑘 )) ∆𝐱𝑘 assuming that when 𝑘 goes to infinity
then the solution is obtained as 𝑓(𝐱 𝑘+1 ) = 0

𝑇 𝑇 −1
−𝑓(𝐱 𝑘 ) = (∇𝑓(𝐱 𝑘 )) ∆𝐱 𝑘 ⟺ ∆𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) ) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 )
And in order to avoid the singularity in the matrix inversion, let we add some
𝑇 −1
regularization term: ∆𝐱 𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) 0<𝜆<1

clear all, clc, i=1; x(i,:)=[0 0.02]; delta=[1;1]; tol=0.001;


f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
J=@(x)[-400*(x(2)-x(1)^2)*x(1)+2*x(1)-2; 200*x(2)-200*x(1)^2];
while norm(delta)>tol
delta=-inv(J(x(i,:))*(J(x(i,:)))'+ 0.7*eye(2,2))*J(x(i,:));
x(i+1,:)=x(i,:) + (delta)'*f(x(i,:));
i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)

−1
It can be observed from the previous results 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑯(𝐱)) 𝐠(𝐱) that

𝑇 −1 −1
𝐱 𝑘+1 = 𝐱 𝑘 − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) ≅ 𝐱 𝑘 − (𝑯(𝐱)) ∇𝑓(𝐱 𝑘 )
𝑇
⟹ 𝑯(𝐱) ≅ (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) 𝑓(𝐱 𝑘 )

It practically means that, once the first derivatives are computed, we can also compute
part of the Hessian matrix for the same computational cost. The possibility to compute
“for free” the Hessian matrix once the Jacobian (i.e. Gradient) is available represents a
distinctive feature of least squares problems. This approximation is adopted in many
applications as it provides an evaluation of the Hessian matrix without computing any
second derivatives of the objective function.

2 2
Example: Given 𝑓(𝐱) = (𝑥12 − 2𝑥2 )𝑒 −𝑥1 −𝑥2 −𝑥1𝑥2 , find the solution of 𝑓(𝐱) = 0 using only
the gradient (i.e. without use of Hessian). Here in this example we will use the
approximate value of ∇𝑓(𝐱) rather that the analytic one.

clear all, clc, i=1; x(i,:)=[0.1,0.2]; delta=[1;1]; h=0.0001;


f=@(x)(x(1)^2-2*x(2))*exp(-x(1)^2-x(2)^2-x(1)*x(2)); tol=0.001;
% f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
while norm(delta)>tol
x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];
J1=(f(x1(i,:)) - f(x(i,:)))/h; J2=(f(x2(i,:)) - f(x(i,:)))/h;
J=[J1;J2];
delta=-inv(J*J'+ 0.7*eye(2,2))*J;
x(i+1,:)=x(i,:) + (delta)'*f(x(i,:)); i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)
2 2
Example: Given 𝑓(𝐱) = 𝑥1 𝑒 −𝑥1 −𝑥2 , find the minimizer of 𝑓(𝐱) using the approximate
value of the gradient and Hessian.

Before starting the algorithm let we visualize the plot of this surface in space

[x,y] = meshgrid([-2:.25:2]);
z = x.*exp(-x.^2-y.^2);
% Plotting the Z-values of the function on which the level
% sets have to be projected
z1 = x.^2+y.^2;
% Plot your contour
[cm,c]=contour(x,y,z,30);
% Plot the surface on which the level sets have to be projected
s=surface(x,y,z1,'EdgeColor',[.8 .8 .8],'FaceColor','none')
% Get the handle to the children i.e the contour lines of the contour
cv=get(c,'children');
% Extract the (X,Y) for each of the contours and recalculate the
% Z-coordinates on the surface on which to be projected.
for i=1:length(cv)
cc = cv(i);
xd=get(cc,'XData');
yd=get(cc,'Ydata');
zd=xd.^2+yd.^2;
set(cc,'Zdata',zd);
end
grid off
view(-15,25)
colormap cool

After the execution of the program


found in the next page we obtain
Iterations = 11

Jacobian =

-4.1814e-06
-8.6410e-06

x =

1.8844
3.4178

fmax = 4.5702e-07
clear all, clc, i=1; x(i,:)=[1 1]; h=0.01; J=1; tol=0.00001;
f=@(x)x(1)*exp(-x(1)^2-x(2)^2);
while norm(J)>tol

x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];


J1=(f(x1(i,:)) - f(x(i,:)))/h; J2=(f(x2(i,:)) - f(x(i,:)))/h;
J=[J1;J2];

x11(i,:)= x(i,:) + 2*h*[1 0];


x12(i,:)= x(i,:) + h*[1 0] + h*[0 1];
x22(i,:)= x(i,:) + 2*h*[0 1];

H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;

x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;

end
Iterations=i
Jacobian=J
x=(x(i,:))'
fmax=f(x)

Example: Develop the Taylor series of two-variables objective function 𝑓(𝑥1 , 𝑥2 ) with an
error of 𝒪(‖𝜹‖3 )
𝜕𝑓 𝜕𝑓 𝜕 2𝑓 𝜕 2𝑓 𝜕 2𝑓
𝑓(𝑥1 + 𝛿1 , 𝑥2 + 𝛿2 ) = 𝑓(𝑥1 , 𝑥2 ) + ( 𝛿1 + 𝛿2 ) + ( 2 𝛿12 + 2 𝛿1 𝛿2 + 2 𝛿22 ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥1 𝜕𝑥2 𝜕𝑥2
𝜕 2𝑓 𝜕 2𝑓
𝛿1 𝛿1
𝜕𝑓 𝜕𝑓 1 𝜕𝑥12 𝜕𝑥1 𝜕𝑥2
= 𝑓(𝑥1 , 𝑥2 ) + [ ] ( ) + [𝛿1 𝛿2 ] ( ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝛿 2 𝜕 2𝑓 𝜕 2𝑓 𝛿2
2
2
(𝜕𝑥1 𝜕𝑥2 𝜕𝑥2 )
𝑇 1
In compact form we can write 𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 + 2 𝜹𝑇 𝑯(𝐱 + 𝑡𝜹)𝜹 + 𝒪(‖𝜹‖3 )

Let 𝑭: ℝ𝑛 ⟶ ℝ𝑚 be a continuously
differentiable in the open convex set 𝑫 ⊂ ℝ𝑛 . The practical problem in the vector case is
to solve the simultaneously the set of nonlinear equations 𝑭(𝐱) = 𝟎. In before we have
seen that
𝑭(𝐱 + 𝜹) = 𝑭(𝐱) + 𝑱(𝐱)𝜹 ⟺ 𝑭(𝐱 𝑘 + 𝜹) = 𝑭(𝐱 𝑘 ) + 𝑱(𝐱 𝑘 )𝜹𝑘

When 𝑘 → ∞ assume that 𝐱 𝑘+1 = 𝐱 𝑘 + 𝜹 is a solution to 𝑭(𝐱) = 𝟎 that is 𝑭(𝐱 𝑘+1 ) = 𝟎


−1 −1
𝜹𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 = −(𝑱(𝐱 𝑘 )) 𝑭(𝐱𝑘 ) ⟹ 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑱(𝐱 𝑘 )) 𝑭(𝐱𝑘 )
In almost cases of practical problems the analytic expression of the Jacobian matrix
𝑱(𝐱 𝑘 ) is not available, so we need to approximate it by finite difference methods. In the
next program we will see how to do this. To make things more clear let we see the
following second order system 𝑭: ℝ2 ⟶ ℝ2

𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 𝜕𝑥2 1 𝑓1 (𝐱 + ℎ𝐞1 ) − 𝑓1 (𝐱) 𝑓1 (𝐱 + ℎ𝐞2 ) − 𝑓1 (𝐱)
𝑱(𝐱 𝑘 ) = ≅ ( )
𝜕𝑓2 𝜕𝑓2 ℎ 𝑓 (𝐱 + ℎ𝐞 ) − 𝑓 (𝐱)
2 1 2 𝑓2 (𝐱 + ℎ𝐞2 ) − 𝑓2 (𝐱)
( 𝜕𝑥1 𝜕𝑥2 )

Example: write a MATLAB code to solve the following nonlinear system of equations

f1=@(x)(x(1)^2+x(2)^2 -1); f2=@(x)(x(1)^2-x(2));

using the approximate method and take x(i,:)= [0.1 0.2] as starting point.

% Solve the nonlinear system F(x) = 0 using Newton's method


% Vectors x and x0 are row vectors (for display purposes)
% function F returns a column vector, [f1(x), ..fn(x)]'
% stop if norm of change in solution vector is less than tol
% solve J(x)y = - F(x) using Matlab solver
clear all, clc, i=1; x(i,:)= [0.1 0.2]; tol=0.0001; maxit=100; h=0.01;
f1=@(x)(x(1)^2+x(2)^2 -1); f2=@(x)(x(1)^2-x(2)); dif=1;

while (dif >= tol) && (i<=maxit)


x1(i,:)= x(i,:) + h*[1 0]; x2(i,:)= x(i,:) + h*[0 1];
J11=(f1(x1(i,:)) - f1(x(i,:)))/h; J12=(f1(x2(i,:)) - f1(x(i,:)))/h;
J21=(f2(x1(i,:)) - f2(x(i,:)))/h; J22=(f2(x2(i,:)) - f2(x(i,:)))/h;
J=[J11 J12;J21 J22]; F=[f1(x(i,:));f2(x(i,:))];

x(i+1,:) = x(i,:) -(inv(J)*F)';


dif = norm(x(i+1,:) - x(i,:));
i = i + 1;
end
x, Iterations=i, F

x =

0.1000 0.2000 Iterations = 8


3.5715 0.7390 F =
1.8755 0.6244
1.1046 0.6181 1.8483e-05
0.8333 0.6180 1.8483e-05
0.7878 0.6180
0.7862 0.6180
0.7862 0.6180

In many optimization problems we may across the problem of singularity in the


Jacobian matrix, to solve this problem we use the Gauss-Newton Method:
−1
𝐱 𝑘+1 = 𝐱 𝑘 − (𝑱𝑇 (𝐱 𝑘 )𝑱(𝐱 𝑘 )) 𝑱𝑇 (𝐱 𝑘 )𝑭(𝐱 𝑘 )
 If equations have one or two unknowns, graphical methods can be used to solve
them. If there are too many unknowns, it is not suitable to use graphical methods.
Other methods should be tried. In this section, graphical solution methods for
equations with one or two unknowns are proposed, and advantages and disadvantages
of graphical methods are summarized.

Example: using MATLAB to visualize the intersection of the following surfaces centered
at origin (assuming that the parameters are specified)
𝑥 2 𝑥 2 𝑥 2
( ) +( ) +( ) =𝑑
𝑎 𝑏 𝑐
2 2
𝑧 = 𝛼𝑥 + 𝛽𝑦 + 𝛾
{ 𝑥2 + 𝑦2 + 𝑧2 = 𝜆

clear all, clc, a=3; b=2; c=1; imax=50; jmax=50;

for i=1:imax+1
theta=2*pi*(i-1)/ imax;
for j=1:jmax+1
phi =2*pi*(j-1)/ jmax;
x(i,j) = a*cos(theta); y(i,j) = b*sin(theta)*cos(phi);
z(i,j) = c*sin(theta)*sin(phi);
end
end
s=surf(x,y,z), hold on % Plot of ellipsoid

a=0.5; b=-1; c=1; imax=50; jmax=50;


for i=1:imax+1
xi=-2+4*(i-1)/imax;
for j=1:jmax+1
eta =-2+4*(j-1)/jmax;
x(i,j) = xi; y(i,j) = eta;
z(i,j) = a*x(i,j)^2+b*y(i,j)^2+c;
end
end
s=surf(x,y,z), hold on % Plot of hyperbolic surface

a=2; b=2; c=2; imax=50; jmax=50;


for i=1:imax+1
theta=2*pi*(i-1)/ imax;
for j=1:jmax+1
phi =2*pi*(j-1)/ jmax;
x(i,j) = a*cos(theta); y(i,j) = b*sin(theta)*cos(phi);
z(i,j) = c*sin(theta)*sin(phi);
end
end
s=surf(x,y,z), axis equal % Plot of sphere
% Intersection of Three Surfaces
clear all, clc, i=1; x(i,:)=[1 1 1]; tol=1.e-6; maxit=100; dif=1;
f1=@(x)(x(1)^2+x(2)^2+x(3)^2-14);
f2=@(x)(x(1)^2+2*x(2)^2-9);
f3=@(x)(x(1)-3*x(2)^2+ x(3)^2);

while (dif >= tol) && (i<=maxit)


J=[2*x(i,1) 2*x(i,2) 2*x(i,3); 2*x(i,1) 4*x(i,2) 0; 1 -6*x(i,2)
2*x(i,3)];
F=[f1(x(i,:));f2(x(i,:));f3(x(i,:))];

x(i+1,:) = x(i,:)-(inv(J)*F)';
dif = norm(x(i+1,:)-x(i,:));
i = i + 1;
end
x, Iterations=i, F

The main disadvantage of Newton's method, even


when regularized to ensure global convergence, is the need to calculate 𝑛(𝑛 + 1)/2
second derivatives. Hence, it may be better to approximate the Hessian matrix using
the value of the function and the gradient vector 𝐠(𝐱 𝑘 ). The simplest technique is to use
a finite difference approximation. Once again this matrix may not be positive definite,
thus requiring modifications as discussed earlier. Further, evaluating the finite
difference approximation requires 𝑛 + 1 evaluations of the gradient vector, which could
be very expensive.
The above disadvantages can be avoided if some updating procedure similar to that in
the Broyden's method can be given for the Hessian matrix or its inverse, such that the
matrix is assured to be positive definite.

In Idea behind quasi-Newton methods: construct the


approximation of the inverse Hessian using information gathered during the process.
Now in this section we will show how the inverse Hessian can be built up from gradient
information obtained at various points.
−1
𝐱 𝑘+1 = 𝐱 𝑘 − 𝐁𝑘 𝐠(𝐱𝑘 ) where 𝐁𝑘 is an approximation of (𝐇(𝐱𝑘 ))
−1
To avoid the computation of(𝐇(𝐱 𝑘 )) , the quasi-Newton methods use an approximation
−1
to in place of (𝐇(𝐱 𝑘 )) the true inverse. Let 𝐁0 , 𝐁1 , 𝐁2 … be successive approximations of
the inverse of the Hessian.

Suppose first that the Hessian matrix of the objective function is constant and
independent of 𝐱 𝑘 for 0 ≤ 𝑖 ≤ 𝑘. In other words, the objective function is quadratic, with
Hessian 𝐇(𝐱) = 𝑸 for all 𝐱, where 𝑸 = 𝑸𝑇 . Then, if 𝐱 𝑘+1 is an optimizer of 𝑓 we get
𝐠(𝐱 𝑘+1 ) = 0 so

𝐠(𝐱 𝑘+1 ) − 𝐠(𝐱 𝑘 ) = 𝑸(𝐱 𝑘+1 − 𝐱 𝑘 ) ⟺ ∆𝐠(𝐱 𝑘 ) = 𝑸∆𝐱 𝑘 ⟺ ∆𝐠(𝐱𝑘 ) = 𝑸𝒑𝑘 ⟺ 𝒑𝑇𝑘 ∆𝐠(𝐱 𝑘 ) = 𝒑𝑇𝑘 𝑸𝒑𝑘

We start with a real symmetric positive definite matrix 𝐁0 . Note that given k, the
matrix 𝑸−1 satisfies
𝑸−1 ∆𝐠(𝐱𝑖 ) = ∆𝐱 𝑖 0 ≤ 𝑖 ≤ 𝑘

Therefore, we also impose the requirement that the approximation of the Hessian
satisfy
𝐁𝑘+1 ∆𝐠(𝐱 𝑖 ) = ∆𝐱 𝑖 = 𝒑𝑖 0 ≤ 𝑖 ≤ 𝑘

If n steps are involved, then moving in n directions 𝒑0 , 𝒑1 , 𝒑2 , … 𝒑𝑛−1 yields

𝐁𝑛 ∆𝐠(𝐱 0 ) = 𝒑0
𝐁𝑛 ∆𝐠(𝐱1 ) = 𝒑1

𝐁𝑛 ∆𝐠(𝐱𝑛−1 ) = 𝒑𝑛−1

If we define 𝒒𝑘 = ∆𝐠(𝐱𝑘 ) then this set of equations can be represented as

𝐁𝑛 [𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ]

Note that 𝑸 satisfies: 𝑸−1 [𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ]. Therefore, if the matrix
[𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] is nonsingular, then 𝑸−1 is determined uniquely after n steps, via

𝑸−1 = 𝐁𝑛 = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ]−1

This means that if n linearly independent directions 𝒑𝑖 and corresponding 𝒒𝑖 are known,
then 𝑸−1 is uniquely determined.
We will construct successive approximations 𝐁𝑘 to 𝑸−1 based on data obtained from the
first k steps such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1

After n linearly independent steps we would then have 𝐁𝑛 = 𝑸−1 . We want an update on
𝐁𝑘 such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1

Let us find the update in this form [Rank one correction] 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 . We need a
good 𝛼𝑘 ∈ ℝ and good 𝐮𝑘 ∈ ℝ𝑛 .

Theorem Given 𝑘 + 1 linearly independent direction 𝒑0 , 𝒑1 , 𝒑2 , … 𝒑𝑘 with corresponding


gradients 𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 such that 𝐁𝑘+1 [𝒒0 , 𝒒1 , … 𝒒𝑘 ] = [𝒑0 , 𝒑1 , … 𝒑𝑘 ] and if an update of
rank one 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 is carried out then

(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 = 𝐁𝑘 +
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )

Proof: We already know that 𝐁𝑘+1 [𝒒0 , 𝒒1 , … 𝒒𝑘 ] = [𝒑0 , 𝒑1 , … 𝒑𝑘 ] and 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 .
Therefore,
𝒑𝑘 = 𝐁𝑘+1 𝒒𝑘 = (𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 )𝒒𝑘 = 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘 ⟹ 𝒑𝑘 − 𝐁𝑘 𝒒𝑘 = 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘

(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇 /𝛼𝑘 = (𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘 )(𝒒𝑇𝒌 𝐮𝑘 𝐮𝑇𝑘 ) = 𝛼𝑘 𝐮𝑘 (𝐮𝑇𝑘 𝒒𝑘 )2 𝐮𝑇𝑘

Which can be written as


(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
= 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 = 𝐁𝑘+1 − 𝐁𝑘
𝛼𝑘 (𝐮𝑇𝑘 𝒒𝑘 )2

Let we multiply 𝒑𝑘 − 𝐁𝑘 𝒒𝑘 = 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 𝒒𝑘 on the right by 𝒒𝑇𝑘 we get

𝒒𝑇𝑘 𝒑𝑘 = 𝒒𝑇𝑘 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 𝒒𝑇𝑘 𝐮𝑘 𝐮𝑇𝒌 𝒒𝑘 = 𝒒𝑇𝑘 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 (𝒒𝑇𝑘 𝐮𝑘 )2 ⟹ (𝐮𝑇𝑘 𝒒𝑘 )2 = 𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )

Finally we replace this result in the update formula we obtain

(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 − 𝐁𝑘 =
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )

Algorithm: [Modified Newton method with rank 1 correction]


begin: k=1:n (untile converge)
𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 𝑩𝑘 𝐠(𝐱 𝑘 )
𝛼𝑘 = arg min𝛾 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 ) and 𝐠(𝐱 𝑘 ) = ∇𝑓(𝐱 𝑘 )
(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 = 𝐁𝑘 +
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )
𝒑𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 and 𝒒𝑘 = 𝐠(𝐱 𝑘+1 ) − 𝐠(𝐱 𝑘 )

end
Remark: The scalar 𝛼𝑘 is the smallest nonnegative value of a that locally minimizes
𝑓 along the direction ∇𝑓(𝐱 𝑘 ) starting from 𝐱 𝑘 . There are many alternative line-search
rules to choose 𝛼𝑘 along the ray 𝑆𝑘 = {𝐱 𝑘+1 = 𝐱 𝑘 + 𝑎𝒑𝑘 | 𝑎 > 0} . Namely: Armijo Rule,
Goldstein Rule, Wolfe Rule, Strong Wolfe Rule etc... In our work we are not interested
by such matter.

clc;clear; x(1,:)= [0.1 0.2]; B = eye(2,2); p= [0.1;0.2]; q= [1;2];


% objective function, its gradient and Hessian
f = @(x1,x2) -4*x1 - 2*x2 - x1.^2 + 2*x1.^4 - 2*x1.*x2 + 3*x2.^2;
g = @(x)[-4-2*x(1)+8*x(1)^3-2*x(2); -2-2*x(1)+6*x(2)];
%Hessian = @(x) [-2+24*x(1)^2, -2; -2; 6];
i=1; tol=0.001; alpha=0.19; % fixed step size (i.e. not optimal one)
while norm(g(x(i,:)))>=tol
x(i+1,:) = x(i,:) - alpha*(B*g(x(i,:)))';
B = B + ((p-B*q)*(p-B*q)')/(q'*(p-B*q));
p = (x(i+1,:)- x(i,:))';
q = g(x(i+1,:)) - g(x(i,:));
i=i+1;
end
x(end,:)
q
i
% plot contour lines
f = @(x1,x2) -4*x1 - 2*x2 - x1.^2 + 2*x1.^4 - 2*x1.*x2 + 3*x2.^2;
[x, y] = meshgrid(-0.25:0.01:1.75, -0.25:0.0025:1.75);
contour(x,y,f(x,y),[-4.34 -4.3 -4.2 -4.1 -4.0 -3 -2 -1
0],'ShowText','On'), hold on; grid on;
In numerical optimization, the
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is an iterative method for solving
unconstrained nonlinear optimization problems. The BFGS method belongs to quasi-
Newton methods, a class of hill-climbing optimization techniques that seek a stationary
point of a (preferably twice continuously differentiable) function. For such problems, a
necessary condition for optimality is that the gradient be zero. Newton's method and
the BFGS methods are not guaranteed to converge unless the function has a quadratic
Taylor expansion near an optimum. However, BFGS can have acceptable performance
even for non-smooth optimization instances.

In Quasi-Newton methods, the Hessian matrix of second derivatives is not computed.


Instead, the Hessian matrix is approximated using updates specified by gradient
evaluations (or approximate gradient evaluations). The BFGS method is one of the most
popular members of this class.

The optimization problem is to minimize 𝑓(𝐱), where 𝐱 is a vector in ℝ𝑛 , and 𝑓 is a


differentiable scalar function. There are no constraints on the values that 𝐱 can take.
The algorithm begins at an initial estimate for the optimal value 𝐱 0 and proceeds
iteratively to get a better estimate at each stage.

The search direction 𝒑𝑘 at stage k is given by the solution of the analogue of the Newton
equation: 𝑯𝑘 𝒑𝑘 = −∇𝑓(𝐱 𝑘 ) where 𝑯𝑘 is an approximation to the Hessian matrix, which
is updated iteratively at each stage, and ∇𝑓(𝐱 𝑘 ) is the gradient of the function evaluated
at 𝐱 𝑘 . A line search in the direction 𝒑𝑘 is then used to find the next point 𝒑𝑘+1 by
minimizing 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 ) over the scalar 𝛾 > 0. The quasi-Newton condition imposed on
the update of 𝑯𝑘 is
∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) = 𝑯𝑘+1 (𝐱 𝑘+1 − 𝐱 𝑘 )

Let 𝒚𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) and 𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 then 𝑯𝑘+1 satisfies 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 which is the
secant equation. The curvature condition 𝒔𝑇𝑘 𝑯𝑘+1 𝒔𝑘 = 𝒔𝑇𝑘 𝒚𝑘 > 0 should be satisfied for
𝑯𝑘+1 to be positive definite. If the function is not strongly convex, then the condition has
to be enforced explicitly.

Instead of requiring the full Hessian matrix at the point 𝐱 𝑘+1 to be computed as 𝑯𝑘+1,
the approximate Hessian at stage k is updated by the addition of two matrices:

𝐇𝑘+1 = 𝐇𝑘 + 𝐔𝑘 + 𝐕𝑘 = 𝐇𝑘 + 𝛼(𝐮𝑘 𝐮𝑇𝑘 ) + 𝛽(𝐯𝑘 𝐯𝑘𝑇 )

Both 𝐔𝑘 and 𝐕𝑘 are symmetric rank-one matrices, but their sum is a rank-two update
matrix. Imposing the secant condition, 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 Choosing 𝐮𝑘 = 𝒚𝑘 and 𝐯𝑘 = 𝐇𝑘 𝒔𝑘 , we
can obtain:
1 1
𝛼= 𝑇 , 𝛽= 𝑇
𝐲𝑘 𝒔𝑘 𝒔𝑘 𝑯𝑘 𝒔𝑘

Finally, we substitute 𝛼𝑘 and 𝛽𝑘 into 𝐇𝑘+1 = 𝐇𝑘 + 𝛼𝐮𝑘 𝐮𝑇𝑘 + 𝛽𝐯𝑘 𝐯𝑘𝑇 and get the update
equation of 𝐇𝑘+1.
𝐲𝑘 𝐲𝑘𝑇 𝐇𝑘 𝒔𝑘 𝒔𝑇𝑘 𝐇𝑘𝑇
𝐇𝑘+1 = 𝐇𝑘 + + 𝑇
𝐲𝑘𝑇 𝒔𝑘 𝒔𝑘 𝐇𝑘 𝒔𝑘

Algorithm: [BFGS algorithm with rank 2 correction]


Initialization: 𝐱 0 and 𝑯0
begin: k=1:n (untile converge)

get by 𝒑𝑘 solving 𝑯𝑘 𝒑𝑘 = −∇𝑓(𝐱 𝑘 )


get 𝛼𝑘 such that 𝛼𝑘 = arg min𝛾 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 )
𝒔𝑘 = 𝛼𝑘 𝒑𝑘 and update 𝐱 𝑘+1 = 𝐱 𝑘 + 𝒔𝑘
𝐲𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 )
𝐲𝑘 𝐲𝑘𝑇 𝐇𝑘 𝒔𝑘 𝒔𝑇𝑘 𝐇𝑘𝑇
𝐇𝑘+1 = 𝐇𝑘 + 𝑇 + 𝑇
𝐲𝑘 𝒔𝑘 𝒔𝑘 𝐇𝑘 𝒔𝑘

end

The functional 𝑓(𝐱 𝑘 ) denotes the objective function to be minimized. Convergence can
be checked by observing the norm of the gradient, ‖∇𝑓(𝐱 𝑘 )‖2 ≤ 𝜀. In order to avoid the
inversion of 𝑯𝑘 at each step we apply the Sherman–Morrison formula

𝑨−1 𝐮𝐯 𝑇 𝑨−1
(𝑨 + 𝐮𝐯 𝑇 )−1 = 𝑨−1 +
1 + 𝐯 𝑇 𝑨−1 𝐮
We get
𝒔𝑘 𝐲𝑘𝑇 𝐲𝑘 𝒔𝑇𝑘 𝒔𝑘 𝒔𝑇𝑘
𝐁𝑘+1 = (𝑰 − ) 𝐁 𝑘 (𝑰 − ) +
𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘
(𝒔𝑇𝑘 𝐲𝑘 + 𝐲𝑘𝑇 𝐁𝑘 𝐲𝑘 )(𝒔𝑘 𝒔𝑇𝑘 ) 𝐁𝑘 𝐲𝑘 𝒔𝑇𝑘 + 𝒔𝑘 𝐲𝑘𝑇 𝐁𝑘
= 𝐁𝑘 + −
(𝒔𝑇𝑘 𝐲𝑘 )2 𝒔𝑇𝑘 𝐲𝑘

Algorithm: [BFGS algorithm with rank 2 correction-without inversion]


Initialization: 𝐱 0 and 𝑩0
begin: k=1:n (untile converge)

get by 𝒑𝑘 solving 𝒑𝑘 = −𝐁𝑘 ∇𝑓(𝐱 𝑘 )


get 𝛼𝑘 such that 𝛼𝑘 = arg min𝛾 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 )
𝒔𝑘 = 𝛼𝑘 𝒑𝑘 and update 𝐱 𝑘+1 = 𝐱 𝑘 + 𝒔𝑘
𝐲𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 )
(𝒔𝑇𝑘 𝐲𝑘 + 𝐲𝑘𝑇 𝐁𝑘 𝐲𝑘 )(𝒔𝑘 𝒔𝑇𝑘 ) 𝐁𝑘 𝐲𝑘 𝒔𝑇𝑘 + 𝒔𝑘 𝐲𝑘𝑇 𝐁𝑘
𝐁𝑘+1 = 𝐁𝑘 + −
(𝒔𝑇𝑘 𝐲𝑘 )2 𝒔𝑇𝑘 𝐲𝑘

end

Remark: In general, the finite difference approximations of the Hessian are more
expensive than the secant condition updates. (Walter Gander and Martin J Gander)
clear all, clc, tol=10^-4; x(:,1)= [0.8624 0.1456]; z=[]; B=eye(2,2);
f=@(x)x(1)^2-x(1)*x(2)-3*x(2)^2+5; J=@(x)[2*x(1)-x(2);-x(1)-6*x(2)];
% f=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2-10; J=@(x)[4*x(1);6*x(1);8*x(3)];
% f=@(x)3*sin(x(1))+exp(x(2)); J=@(x)[3*cos(x(1));exp(x(2))];
i=1; %matlab starts counting at 1
while and(norm(J(x(:,i)))>0.001,i<500)
p(:,i)=-B*J(x(:,i));
%------------------------------------------------------------%
% armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.01; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%------------------------------------------------------------%
s=alp*p(:,i); x(:,i+1)=x(:,i) + s; y=J(x(:,i+1))-J(x(:,i));
B = B + ((s'*y + y'*B*y)/(s'*y)^2)*(s*s') -(B*y*s'+ (s*y')*B)/(s'*y);
i=i+1;
end
x(:,end), fmax=f(x(:,end)), Gradient=J(x(:,end))

Gradient descent is a first-order iterative


optimization algorithm for finding a local minimum of a differentiable function. To find
a local minimum of a function using gradient descent, we take steps proportional to the
negative of the gradient (or approximate gradient) of the function at the current point.

Gradient descent is based on the observation that if the scalar multi-variable function
𝑓(𝐱) is defined and differentiable in a neighborhood of a point 𝒂, then 𝑓(𝐱) decreases
fastest if one goes from 𝒂 in the direction of the negative gradient of 𝑓(𝐱) at 𝒂, −∇𝑓 (𝒂).
It follows that, if 𝐚𝑘+1 = 𝐚𝑘 − 𝛼∇𝑓(𝐚𝑘 ) for 𝛼 ∈ ℝ small enough, then 𝑓(𝐚𝑘 ) ≥ 𝑓(𝐚𝑘+1 ) In
other words, the term 𝛼∇𝑓(𝐚𝑘 ) is subtracted from 𝐚𝑘 because we want to move against
the gradient, toward the local minimum.

With this observation in mind, one starts with a guess 𝐱 0 for a local minimum of 𝑓(𝐱),
and considers the sequence 𝐱1 , 𝐱 2 , …, such that 𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 ).

Note that the value of the step size 𝛼𝑘 is allowed to change at every iteration. With
certain assumptions on the function 𝑓(𝐱) (for example, 𝑓(𝐱) convex and ∇𝑓(𝐱) Lipschitz)
and particular choices of 𝛼𝑘 convergence to a local minimum can be guaranteed.
According to Wolfe conditions, or the Barzilai–Borwein method
𝑇
(𝐱 𝑘 − 𝐱 𝑘−1 )𝑇 (∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )) (∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = or 𝛼𝑘 = 𝑇
‖∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )‖2 (∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )
Algorithm: [Gradient descent algorithm]
Initialization: 𝐱 0 and 𝛼0
begin: k=1:n (untile converge)

𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 )
𝑇
(∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = 𝑇
(∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )

end

Understanding gradient descent


Suppose you are on the peak of a
mountain and you want to reach a
lake which is at the lowest point of
the mountain. Which way will you
move???

The simplest way would be to


check at the point you are
standing, find the way ground is
descending the most and start
moving that way. There are high
chances that this path will lead
you to the lake. This is what is
depicted in the picture above.

Graphically it could be visualized as follows:

The peaks represent regions with high cost with red areas whereas the lowest point
with blue areas are regions with minimum cost or loss. In any Optimization & Deep
Learning problems, we try to find a model function that gives prediction having least
loss in comparison to actual value.
Suppose our model function has two parameters then, mathematically we wish to find
the optimal values of parameters 𝜃1 and 𝜃2 that would minimize our loss. The loss (𝐽(𝜃))
space shown in the above figure tells us how our algorithm would perform if we would
choose a particular value for a parameter. Here the 𝜃1 and 𝜃2 are our x and y axis while
the loss is plotted corresponding to the z axis. The Gradient Descent rule states that the
direction in which we should move should be 180 degrees with the gradient, in other
words moving opposite to the gradient.

clear all, clc, i=1; x(i,:)=[3 3]; a=0.01;


f=@(x)x(1)^2+ x(1)*x(2)+3*x(2)^2; z=@(x1,x2)x1.^2 + x1.*x2 + 3*x2.^2;
J=@(x)[2*x(1)+ x(2);x(1)+6*x(2)]; H=@(x)[2 1;1 6];
while norm(J(x(i,:)))>0.000001
x(i+1,:)=x(i,:)-a*(J(x(i,:)))';
a= (J(x(i,:)))'*J(x(i,:))/((J(x(i,:)))'*H(x(i,:))*J(x(i,:)));
%ezcontour(z,[-5 5 -5 5]); axis equal; hold on
%plot([x(i+1,1) x(i,1)], [x(i+1,2) x(i,2)],'ko-'); hold on; refresh
i=i+1;
end
Iterations=i
Gradient=J(x(end,:))
x=(x(i,:))'
fmax=f(x)

 A case of remarkable interest, where the parameter 𝛼𝑘 can be exactly computed, is


the problem of minimizing the quadratic function
𝑇 1
𝑓(𝐱 𝑘 + 𝜹) = 𝑓(𝐱 𝑘 ) + (∇𝑓(𝐱 𝑘 )) 𝜹 + 𝜹𝑇 𝑯(𝐱𝑘 )𝜹
2
At a fixed instant of time 𝐱 𝑘 = constant this quadratic function can be considered as
1
𝜙(𝜹) = 𝑓(𝜹) = 𝜹𝑇 𝑨𝜹 + 𝒃𝑇 𝜹 + 𝒄 𝑯(𝐱 𝑘 ) = 𝑨, 𝒃 = ∇𝑓(𝐱 𝑘 ) and 𝒄 = 𝑓(𝐱 𝑘 )
2

In order to minimize the new objective function we consider the gradient ∇𝜙(𝜹) = 𝑨𝜹 + 𝒃.
In term of 𝐱 𝑘+1we obtain ∇𝜙(𝐱 𝑘+1 ) = ∇𝑓(𝐱 𝑘+1 ) = 𝑨𝐱 𝑘+1 + 𝒃. As a consequence, all
gradient-like iterative methods developed in the previous chapter for linear systems,
can be extended to solve nonlinear minimization problems.

In particular, having fixed a descent direction 𝒑𝑘 = (𝐱 𝑘+1 − 𝐱 𝑘 )/𝛼𝑘 , we can determine the
optimal value of the acceleration parameter 𝛼𝑘 , in such a way as to find the point where
the function f, restricted to the direction 𝒑𝑘 , is minimized 𝛼𝑘 = arg min𝛼 𝑓(𝐱 𝑘 + 𝛼𝒑𝑘 ).
Setting to zero the directional derivative, we get:

𝑑 𝑛 𝜕𝑓 𝜕
0= 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = ∑ ( (𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) (𝐱 (𝑖) + 𝛼𝑘 𝒑𝑘 (𝑖))
𝑑𝛼𝑘 𝑖=1 𝜕𝑥𝑖 𝜕𝛼𝑘 𝑘
𝑇
= (∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) 𝒑𝑘
But we have seen that ∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = 𝑨(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) + 𝒃 = (𝑨𝐱 𝑘 + 𝒃) + 𝛼𝑘 𝑨𝒑𝑘

𝑑
Therefore 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = (𝛼𝑘 𝒑𝑘 𝑇 𝑨 + (𝑨𝐱 𝑘 + 𝒃)𝑇 )𝒑𝑘 = 0
𝑑𝛼𝑘

If we define now 𝒓𝑘 = −(𝑨𝐱 𝑘 + 𝒃) we obtain


𝒓𝑇𝑘 𝒑𝑘
𝛼𝑘 = 𝑇
𝒑𝑘 𝑨𝒑𝑘
It is not a popular algorithm due to slow convergence. In order to increase the speed of
convergence some peoples proposed a correction of the search direction called the
conjugate gradient algorithm (See the detail in the chapter before).

The conjugate direction or conjugate gradient method only requires a simple


modification of the gradient method, with a remarkable increase in the convergence
rate. It is as simple to program as the gradient method. Fletcher Reeves (FR) extends
the linear conjugate gradient method to nonlinear functions by incorporating two
changes:

■ For the step length 𝛼𝑘 , (which minimizes f along the search direction 𝒑𝑘 ), we perform
a line search that identifies the approximate minimum of the nonlinear function f along
the search direction 𝒑𝑘 .

■ The residual 𝒓𝑘 (𝒓𝑘 = −(𝒃 + 𝑨𝐱 𝑘 )), which is the gradient of function f has to be
replaced by the gradient of the nonlinear objective function.

Remark: An appropriate step length effecting sufficient decrease could be chosen from
one of the various known methods such as the Armijo, the Goldstein or the Wolfe’s
conditions. Moreover, if f is a strongly convex quadratic function and 𝛼𝑘 is the exact
minimizer of the function f, then the FR algorithm becomes specifically the linear
conjugate gradient algorithm.

The conjugate direction having modified slightly so (𝒑𝑘 − 𝛽𝑘 𝒑𝑘−1 ) = −∇𝑓(𝐱 𝑘 ),

𝒑𝑘 = 𝛽𝑘 𝒑𝑘−1 − ∇𝑓(𝐱 𝑘 ) ⟺ 𝐱 𝑘+1 = (𝛼𝑘 + 𝛽𝑘 )𝐱 𝑘 − 𝛼𝑘 𝛽𝑘 𝐱 𝑘−1 − ∇𝑓(𝐱 𝑘 )

and the 𝛽𝑘 scalar is given by the equation: 𝛽𝑘 = |∇𝑓(𝐱 𝑘 )|2 /|∇𝑓(𝐱 𝑘−1 )|2 where k is an
iterative index. The following algorithm of the conjugate gradient method is made on
MATLAB

Algorithm: [Gradient descent algorithm]


Initialization: 𝐱 0 , 𝒑0 = ∇𝑓(𝐱 0 ) and 𝛼0
begin: k=1:n (untile converge)

get 𝛼𝑘 such that 𝛼𝑘 = arg min𝛼 𝑓(𝐱 𝑘 + 𝛼𝒑𝑘 )


𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 )
𝛽𝑘+1 = |∇𝑓(𝐱 𝑘+1 )|2 /|∇𝑓(𝐱 𝑘 )|2
Make an update 𝒑𝑘+1 = 𝒑𝑘 − 𝛽𝑘+1 ∇𝑓(𝐱 𝑘+1 )

end
clear all, clc, tol=10^-5; x(:,1)=10*rand(2,1);
f=@(x)x(1)^2+x(1)*x(2)+3*x(2)^2+100;
J=@(x)[2*x(1)+x(2);x(1)+6*x(2)]; p(:,1)=-J(x(:,1));
i=1; % matlab starts counting at 1
finalX = x ; % initialize the vector
finalf =f(x(:,1)); z=[];

while and(norm(J(x(:,i)))>0.001,i<500)
%-------------------------------------------------%
% Armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.02; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%-------------------------------------------------%
x(:,i+1)=x(:,i) + alp*p(:,i);
beta=((J(x(:,i+1)))'*J(x(:,i+1)))/((J(x(:,i)))'*J(x(:,i)));
p(:,i+1)=-J(x(:,i+1)) + beta*p(:,i);
i=i+1;
z=[z,f(x(:,i))];
end
Iter=i
xmax=x(:,end)
fmax=f(x(:,end))
Gradient=J(x(:,end))

%-------------------------------------------------%

figure(1)
X= x(1,1:end-1); Y= x(2,1:end-1); Z= z;
plot3(X,Y,Z ,'bo-','linewidth',0.1);
hold on

figure(2)
[X,Y]= meshgrid([-3:0.5:3]) ;
Z=X^2+X.*Y+3*Y.^2+5;
S=mesh(X,Y,Z); %plotting the surface
title('Subrats Pics'), xlabel('x'), ylabel('y')
To address the shortcomings of the original
Newton method, several variations of the technique were suggested to guarantee
convergence to a local minimum. One of the most important variations is the
Levenberg–Marquardt method. This method effectively uses a step that is a combination
between the Newton method and the steepest descent method. The step taken by this
method is given by:

𝐱 𝑘+1 = 𝐱 𝑘 + 𝑯(𝐱 𝑘 )−1 ∇𝑓(𝐱 𝑘 ) 𝐋𝐞𝐯𝐞𝐧𝐛𝐞𝐫𝐠–𝐌𝐚𝐫𝐪𝐮𝐚𝐫𝐝𝐭


} → 𝐱 𝑘+1 = 𝐱 𝑘 + (𝑯(𝐱𝑘 ) + 𝜇𝑰)−1 ∇𝑓(𝐱 𝑘 )
𝐱 𝑘+1 = 𝐱 𝑘 + 𝜇∇𝑓(𝐱 𝑘 )

where 𝜇 is a positive scalar and 𝑰 ∈ ℝ𝑛×𝑛 is the identity matrix. Notice that in last
equation if 𝜇 is small enough, the Hessian matrix 𝑯(𝐱 𝑘 ) dominates and the method
becomes effectively a Newton’s step. If the parameter 𝜇 is large enough, the matrix 𝜇𝑰
dominates and the method is approximately in the steepest descent direction. By
increasing 𝜇, the inverse matrix becomes small in norm and subsequently the norm of
the step taken ‖𝐱 𝑘+1 − 𝐱 𝑘 ‖ becomes smaller. It follows that the parameter 𝜇 controls
also the step size.

One interesting mathematical property of this approach is that adding the matrix 𝜇𝑰 to
the Hessian matrix increases each eigenvalue of this matrix by 𝜇. If the matrix 𝑯(𝐱 𝑘 ) is
not positive semi-definite then adding 𝜇 to each eigenvalue makes them more positive.
The value of 𝜇 can be increased until all the eigenvalues are positive thus guaranteeing
that the step (𝐱 𝑘+1 − 𝐱 𝑘 ) is a descent step. The Levenberg–Marquardt approach starts
each iteration with a very small value of 𝜇, thus giving effectively the Newton’s step. If
an improvement in the objective function is achieved, the new point is accepted.
Otherwise, the value of 𝜇 is increased until a reduction in the objective function is
obtained.

Example: Find the minimum of the function

𝑓(𝐱 𝑘 ) = 1.5𝑥12 + 2𝑥22 + 1.5𝑥32 + 𝑥1 𝑥3 + 2𝑥2 𝑥3 − 3𝑥1 − 𝑥3

starting from the point 𝐱 0 = [3.0 − 7.0 0]𝑇 . Utilize the MATLAB software

% The Levenberg Marquardt Method


clear all, clc,
f=@(x)1.5*x(1)^2+2*x(2)^2+1.5*x(3)^2+x(1)*x(3)+2*x(2)*x(3)-3*x(1)-x(3);
G=@(x)[3*x(1)+x(3)-3; 4*x(2)+2*x(3); 3*x(3)+ x(1) + 2*x(2)-1];
H=@(x)[3 0 1;0 4 2;1 2 3];

n=3; %Number of Parameters


x0=[3 -7 0]'; %This is the starting point
f0=f(x0); %initial function value
G0=G(x0); %initial gradient
G0Norm=norm(G0); %get the old gradient norm
H0=H(x0); %initial Hessian
I=eye(n); %This is the identity matrix
while (G0Norm>1.0e-5) %repeat until gradient is small enough
u=0.001; %initialize trust region parameter
DescentFlag=0; %flag signaling if a descent step
while(DescentFlag==0) %repeat until descent direction found
M=H0+u*I; %Marquardt Matrix
dx=-1.0*inv(M)*G0;
x=x0+dx; %get the new trial point
fNew=f(x); %calculate new value
if(fNew<f0) %a descent step?
DescentFlag=1.0; %set success flag
else
u=u*4; %Increase Mu
end
end
dx =x-x0; %get the new step
StepNorm=norm(dx); %get the step norm
GNew=G(x); %get new gradient
HNew =H(f); %get new Hessian
%now we swap parameters
x0=x; G0=GNew; H0=HNew;
G0Norm=norm(GNew);
end
x=x0, f=f(x0), G=G(x0), Ndx=StepNorm,

The algorithm terminated in only one iteration. The exact solution for this problem is
𝐱 ⋆ = [1.0 0.0 0.0]𝑇 with a minimum value of 𝑓(𝐱 ⋆ ) = −1.50.

Convexity The concept of convexity is fundamental in optimization. Many practical


problems possess this property, which generally makes them easier to solve both in
theory and practice. The term “convex” can be applied both to sets and to functions. A
set 𝑆 ∈ ℝ𝑛 is a convex set if the straight line segment connecting any two points in 𝑆 lies
entirely inside 𝑆. Formally, for any two points 𝐱 ∈ 𝑆 and 𝐲 ∈ 𝑆, we have 𝛼𝐲 + (1 − 𝛼)𝐱 ∈ 𝑆
for all 𝛼 ∈ [0, 1]. The function 𝑓 is a convex function if its domain 𝑆 is a convex set and if
for any two points 𝐱 and 𝐲 in 𝑆, the following property is satisfied:

𝑓(𝛼𝐲 + (1 − 𝛼)𝐱) ≤ 𝛼𝑓(𝐲) + (1 − 𝛼)𝑓(𝐱)


Or
𝑓(𝛼(𝐲 − 𝐱) + 𝐱) ≤ 𝛼(𝑓(𝐲 ) − 𝑓(𝐱)) + 𝑓(𝐱)

𝑓(𝛼(𝐲 − 𝐱) + 𝐱) − 𝑓(𝐱) ≤ 𝛼(𝑓(𝐲) − 𝑓(𝐱))

As 𝛼 → 0, the Taylor series of 𝑓(𝛼(𝐲 − 𝐱) + 𝐱) yields


𝑇 𝑇
𝑓(𝐱 ) + 𝛼(∇𝑓(𝐱)) (𝐲 − 𝐱) − 𝑓(𝐱) ≤ 𝛼(𝑓(𝐲) − 𝑓(𝐱)) ⟺ (∇𝑓(𝐱)) (𝐲 − 𝐱) ≤ 𝑓(𝐲) − 𝑓(𝐱)
𝑇
We conclude that: (∇𝑓(𝐱)) (𝐲 − 𝐱) ≥ 0 ⟹ 𝑓(𝐲) ≥ 𝑓(𝐱)
Theorem: Suppose that 𝑓 is differentiable function in a convex optimization problem.
Let 𝛀 denote the feasible set. Then 𝐱 is optimal if and only if 𝐱 ∈ 𝛀 and
𝑇
(∇𝑓(𝐱)) (𝐲 − 𝐱) ≥ 0

Proof: the Taylor series yields


𝑇 1
𝑓(𝐲) = 𝑓(𝐱) + (∇𝑓(𝐱)) (𝐲 − 𝐱) + 𝐝𝑇 (𝑯(𝐱 + 𝛼𝐝))𝐝 with 𝛼 ∈ (0,1] and 𝐝 = 𝐲 − 𝐱
2

Now if 𝑯(𝐱) is positive semidefinite everywhere in 𝐱 ∈ 𝛀, then 𝐝𝑇 (𝑯(𝐱 + 𝛼𝐝))𝐝 ≥ 0 and so


𝑇
𝑓(𝐲) ≥ 𝑓(𝐱) + (∇𝑓(𝐱)) (𝐲 − 𝐱)

Now if 𝐱 ⋆ is an optimizer of 𝑓(𝐱) then 𝑓(𝐱 ⋆ ) ≤ 𝑓(𝐲) ∀𝐲 ∈ 𝛀 which leads to


𝑇
(∇𝑓(𝐱 ⋆ )) (𝐲 − 𝐱 ⋆ ) ≥ 0
𝑇
This last inequality can be reduced to the other form (∇𝑓(𝐱 ⋆ )) 𝐳 ≥ 0 ∀𝐳 ∈ ℝ𝑛 . In turn,
this is equivalent to ∇𝑓(𝐱 ⋆ ) = 0. ■

Theorem: Any locally optimal point of a convex optimization problem is also (globally)
optimal.

Random search (RS) is a family of


numerical optimization methods that do not require the gradient of the problem to be
optimized, and RS can hence be used on functions that are not continuous or
differentiable. Such optimization methods are also known as direct-search, derivative-
free, or black-box methods. The name "random search" is attributed to Rastrigin who
made an early presentation of RS along with basic mathematical analysis.

Random search (RS) belongs to the fields of Global


Stochastic Optimization. Random search is a direct search method as it does not
require derivatives to search a continuous domain. To implement the method we need a
pseudorandom number generator. Fortunately, such pseudo-random number
generators with uniform distribution are implemented on most compilers. In order to
limit the search procedure to a confined space, we can impose on the decision variables
certain limits of the form:

𝐱 𝐨𝐩𝐭 = arg min𝐱 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑛 ) with 𝑎𝑘 ≤ 𝑥𝑘 (𝑖) ≤ 𝑏𝑘 , 𝑘 = 1,2, … , 𝑛

Where: k is the number of variables and i the number of iterations. The MATLAB
routine function rand generates a subunit random number at each call and let us call
this number 𝑟 = 𝑟𝑎𝑛𝑑. While the search must be done in the interval (𝑎𝑘 , 𝑏𝑘 ) for each
variable 𝑥𝑘 , we want the random number to be generated within this range. For this
reason, the following translation must be made: 𝑥𝑘 (𝑖) = 𝑎𝑘 + 𝑟𝑘 (𝑖)(𝑏𝑘 − 𝑎𝑘 ) 𝑘 = 1,2, … , 𝑛
clc;clear; nVar = 2; % the number of decision variables
N= 10000; % Number of random generated points
epsilon = 1e-3; % the convergence factor
a=zeros(1,nVar); b=zeros(1,nVar); % pre-allocation of vectors a and b
for i=1:nVar, a(i)=-1.50; b(i)=1.50; end % set-up of the search limits
fMin = 1e6; % initialize fMin
fPrecedent = fMin;
for i=1:N % global search procedure
x1 = a(1)+ rand*(b(1)-a(1)); % random generation: variable x1
x2 = a(2)+ rand*(b(2)-a(2)); % random generation: variable x2
f=@(x,y)2*x+y+(x.^2-y.^2)+(x-y.^2).^2; % The objective function
func =f(x1,x2);
if (func<fMin)
fMin = func; x1Min = x1; x2Min = x2;
if abs(fMin - fPrecedent)<=epsilon
break;
else
fPrecedent= fMin;
end, end, end
x1=x1Min, x2=x2Min, fMin =fMin(end)
J=@(x,y)[4*x-2*y^2+2;-4*x*y+4*y^3-2*y+1];
Jmin=J(x1, x2), fmin=f(x1, x2),

>> x1 = -0.2267 >> Jmin = >> fmin = -1.0929


>> x2 = -0.7629 -0.070971
0.057800
The search efficiency depends on the number n of randomly generated points within the
search domain (𝐚𝑘 , 𝐛𝑘 ).

Random Walk is an algorithm that provides


random paths in a graph. A random walk means that we start at one node, choose a
neighbor to navigate to at random or based on a provided probability distribution, and
then do the same from that node, keeping the resulting path in a list.

To find the solution to the minimization problem, the random path method uses an
iterative relationship of the form 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖) & 𝑖 = 1,2, … , 𝑛 where i is an
iterative index, 𝐱(𝑖) is the vector of the decision variables, 𝛼𝑖 is a step size, at iteration i
called acceleration factor in the 𝐬(𝑖) direction, and 𝐬(𝑖) is the vector of the minimization
direction. The search procedure starts from a randomly chosen point. Whatever this
start point is, we have to reach the same solution. The coordinates of the minimization
direction vector 𝐬𝑘 , are randomly chosen using the rand function.
Algorithm: [Random Walk algorithm]
Step 1: chose 𝐱(0) and 𝑁max
set 𝑖 = 1
Step 2: for each iteration 𝑖 do
𝐬(𝑖) = random vector
Step 3: 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖)

𝛼𝑖 is determined to minimize 𝑓(𝐱(𝑖 + 1)) & 𝑖 ← 𝑖 + 1


If 𝑖 < 𝑁max go to step 2
else stop (iteration exceeded)
end
end

Remark: The convergence of this algorithm is slow and not guaranteed in general, it is
dependent strongly on the convexity of the objective function.

%f=@(x)2*x(1)+x(2)+(x(1).^2+x(2).^2)+(x(1)+x(2).^2).^2;
%J=@(x)[4*x(1)+2*x(2)^2+2;4*x(1)*x(2)+4*x(2)^3+2*x(2)+1];
%---------------------------------------------------------------%
clear all, clc, n=0; nMax=5; xzero=rand(2,1); epsilon=1e-4; alfa0=0.01;
f=@(x)2*x(1)+x(2)+(x(1).^2-x(2).^2)+(x(1)-x(2).^2).^2;
a = -1.0 ; b = 1.0; % the range for s
F0=f(xzero); Fprecedent=F0; % the function value at the start point
f0=F0; s=rand(2,1); alfa = alfa0; increment = alfa0;
xone = xzero + alfa*s; % generate a next iteration Xl
F1 = f(xone); % the objective function value in Xl
Factual = F1;
i=1; % initialize the counter i
go = true; % variable 'go' remains 'true' as long as
% the convergence criteria are not fulfilled
while go
while (Factual>=Fprecedent)
s = rand(2,1); s = a*[1;1] + s*(b-a); % generate a random direction s
xone = xzero + alfa*s;
F1 = f(xone); Factual = F1;
end
i=i+1; f1=F1;
while (Factual<Fprecedent)
Fprecedent = Factual;
alfa = alfa + increment;
xone = xzero + alfa*s; F1 = f(xone);
end
deltaF = abs(F1-Fprecedent); F0 = Factual; xzero = xone; alfa = alfa0;
if(abs(f0-f1)<=epsilon) n = n + 1; end
f0 =f1;
if(n==nMax) go = false; break; end
end
J=@(x)[4*x(1)-2*x(2)^2+2;-4*x(1)*x(2)+4*x(2)^3-2*x(2)+1];
xone, Factual, Jmin=J(xone), fmin=f(xone),

Methods for which the search principle


is based on random numbers are generally called the Monte Carlo methods. At present,
random numbers are often replaced by computer-generated pseudo-random numbers
by randomization procedures. The Monte Carlo method has been successfully applied
to solving linear equation systems, to calculating the inverse matrix, to evaluating
multiple integrals, to solving the Dirichlet problem, to solving functional equations of a
variety of types and so on. It has also been used in the field of nuclear physics.

The Monte Carlo method is based on the following principle: if the best option is
needed, it should be tried "at random" many times and then the best option found
between those attempts chosen. If there are enough different attempts, the best option
found will almost certainly be an optimal global value. This method is valid both
mathematically and intuitively. The advantages of the method are both its simplicity
and its universality. But it has the disadvantage of being too slow.

The Monte Carlo idea: It is preferred to explain


the Monte Carlo principle (method) by an
illustration or exemplification of a simple integral
calculation problem. For this, the following
integral will be considered:
1 𝜋/2
𝜋
𝑦 = ∫ √1 − 𝑥 2 𝑑𝑥 = ∫ (sin(𝜃))2 𝑑𝜃 =
0 0 4

The function under the integral represent a circle


arc of 90°, that can be inscribed in a square,
whose edge size is the unit.

Obviously, the area of the square is 𝑆 = 1. The


area of the quarter-circle is 𝐴 = 𝜋𝑆/4.

Let’s pretend that we don’t know the value of 𝜋. To calculate it, we will generate a large
number 𝑁 of random points in the unit square. By 𝑛 we will denote the number of
points lying inside the quarter-circle. As you will certainly agree, with large 𝑁 the ratio
𝑛/𝑁 must be very similar to the ratio of 𝐴/𝑆. And that’s all! From the equation 𝑛/𝑁 = 𝐴/𝑆
we can easily express that 𝜋 = 4𝑛/𝑁. This proportion becomes valid as the number 𝑁 of
uniformly distributed points on the square area generated and in the meantime
becomes higher.
The corresponding MATLAB program is presented below.
clear all, clc nmax = 5000;
x = rand(nmax,1); y = rand(nmax,1); x1=x-0.5; y1=y-0.5;
r = sqrt(x1.^2+y1.^2) ;
% get logicals
inside = r<=0.5; outside = r>0.5;
% plot
plot(x1(inside),y1(inside),'b.');
hold on
plot(x1(outside),y1(outside),'r.');
axis equal
% get pi value
thepi = 4*sum(inside)/nmax;
fprintf('%8.4f\n',thepi)

In the following, a global optimization algorithm


applicable to solving nonconvex problems is proposed. It is effective even if a problem
has multiple local optima. Let us imagine that our objective function has the equation:

𝑓(𝑥, 𝑦) = −0.02 sin(𝑥 + 4𝑦) − 0.2 cos(2𝑥 + 3𝑦) − 0.2 sin(2𝑥 − 𝑦) + 0.4 cos(𝑥 − 2𝑦)

While the objective function depends on two variables 𝑥 and 𝑦, its graphical
representation is a surface. From Fig. it is evident that this surface has many peaks
and valleys, interpretable as many local minimum (or maximum) points, depending on
the problem scope. Usually, a numerical optimization procedure risks ending in a local
optimum point instead of an absolute minimum point.
% This program draws the mesh of a multimodal
% function that depends on two variables
clear all, clc
f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);
a1=-2.5;a2=-2.5; b1=2.5;b2=2.5; increment1=0.1; increment2=0.1;
n1=(b1-a1)/increment1; n2=(b2-a2)/increment2; fGraph = zeros(n1,n2);
x1 = a1;
for i1 = 1:n1
x2 = a2;
for i2 = 1:n2
fGraph (i1,i2)=f(x1,x2);
x2 = x2 + increment2;
end
x1 = x1 + increment1;
end
mesh(fGraph) ; % drawing of fGraph
clear all, clc

f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);

n=10000; % set the number of random numbers generated


gridNumber = 250; % set the grid numbers for local search
isd=20; % set the interval size divisor for local search
a1=-3; a2=-3; b1=3;b2=3; % the initial search limits on each axis
delta = 1.0e3; % set the absolute value of difference df(x)
epsilon = 1.0e-4; % set the convergence criterion value
minF = 1.e20; % set the mimimum initial value of f

%------% open a text file to save results %------%


fp = fopen('results.txt','w');
fprintf(fp, ' GLOBAL OPTIMIZATION METHOD\n\n');
fprintf(fp, ' the size of random numbers n = %d\n',n);
fprintf(fp, ' grid numbers for local searc h : %d\n ', gridNumber);
fprintf(fp, ' interval size divisor for local');
fprintf(fp, ' searc h : %d\n ' , isd );
fprintf(fp, ' initial search limits on each axis\n');
fprintf(fp, ' a1 = %5.2f b1 = %5.2f\n', a1 , b1);
fprintf(fp, ' a2 = %5.2f b2 = %5.2f\n', a2 , b2);
fprintf(fp, ' the absolute difference between');
fprintf(fp, ' function values :\n');
fprintf(fp, ' delta = %f\n' , delta);
fprintf(fp, ' the convergence criterion value :\n');
fprintf(fp, ' epsilon = %f\n', epsilon );
%------% Global search (by a Monte Carlo method) %------%
for i = 1:n
x1 = a1 + (b1 - a1)*rand;
x2 = a2 + (b2 - a2)*rand;
func = f(x1,x2);
if func<minF, minF = func; x1_min=x1; x2_min=x2; end
end
precF = minF;

%------% Local search (by a Multi-Grid method) %------%


ls1 = x1_min - abs(b1-a1)/isd; % left limitj x1 axis
if ls1<a1, ls1 = a1; end
ld1 = x1_min + abs(b1-a1)/isd; % right limitj x1 axis
if ld1>b1, ld1 = b1; end
ls2 = x2_min - abs(b2-a2)/isd; % left limitj x2 axis
if ls2<a2, ls2 = a2; end
ld2 = x2_min + abs(b2-a2)/isd; % right limitj x1 axis
if ld2>b2, ld2 = b2; end

while delta>epsilon
%------% The block for xl variable (keep x2=constant) %------%
x2 = x2_min; x1 = ls1;
increment = abs(ld1-ls1)/gridNumber;
while x1<=ld1
func = f(x1,x2);
if func<minF, minF = func; x1_min=x1; end
x1 = x1 + increment;
end

ls1 = x1_min - increment;


if ls1<a1, ls1 = a1; end
ld1 = x1_min + increment;
if ld1>b1, ld1 = b1; end
%------% The block for x2 variable (keep x1=constant) %------%
x1= x1_min; increment = abs(ls2-ld2)/gridNumber;
while x2<=ld2
func = f(x1,x2);
if func<minF, minF = func; x2_min = x2; end
x2 = x2 + increment;
end

ls2 = x2_min - increment;


if ls2<a2, ls2 = a2; end
ld2 = x2_min + increment;
if ld2>b2, ld2 = b2; end

actF = minF;
% check the convergence criterion
delta=abs(actF-precF);
precF = actF;
end

% draw the surface graph


x1 =-3:0.1:3 ; x2 =-3 :0.1:3;
[x1,x2] = meshgrid (x1,x2); % create the meshgrid
F = f(x1,x2); contour (x1,x2,F,15);

% mark the optimum solution


hold on;
scatter(x1_min, x2_min,'markerfacecolor', 'r');
hold off;

fprintf (fp, ' \n\n the minimum value : Fmin =');


fprintf (fp, ' %f\n',minF);
fprintf (fp, ' the coordinates of the optimal');
fprintf( fp, ' point : \n ' );
fprintf (fp, ' x1 = %f x2 = %f\n' , x1_min, x2_min);
fclose (fp); % close the text file

x1_min, x2_min, minF

Explanation: Initially the global search


procedure is designed based on Monte Carlo
principles, which means that a number n of
pairs of points (𝑥𝑖 , 𝑦𝑖 ) are randomly generated
inside the search area, defined by the search
limits −3 ≤ 𝑥, 𝑦 ≤ 3. From tliese points, the
point (𝑥, 𝑦) tlat has the minimum value is
thought to have more chances to be placed
near the true optimum point. This statement is
more accurate as the number 𝑛 of points that
are randomly generated inside the search area
increases. The search for this point with value
min𝐹 and coordinates (𝑥1𝑚𝑖𝑛 , 𝑥2𝑚𝑖𝑛 ) is done by
Monte Carlo method.

Starting from this point, a local search procedure is designed. This procedure is based
on the Grid method. First thing to do is to define a neighborhood of the starting point
on each axis. This neighborhood should be set for each axis separately. Then along
each axis a search of the minimum point in that direction is made successively.
In this particular case, the local search along the 𝑥1 axis starts from the left bound of
the neighborhood, that is 𝑙𝑠1, while 𝑥2 is kept constant. Once the minimum point along
𝑥1 axis is found, it is kept constant, while the local search is performed along the 𝑥2
axis. 'When the search along all axes is finished, it means that one iteration is over. The
value of the objective function at this point is compared with the value of the similar
point at the previous iteration. If the difference between these two points denoted delta
is less than a precision factor called epsilon, initially set, then the search stops, or
continues otherwise.■ (Ancau Mircea 2019)

Example Find a solution for the two-dimensional optimization problem:

f=@(x1,x2)log((1+(x1-4/3).^2)+3*(x1+x2-(x1).^3).^2);
x1 =-2:0.1:2 ; x2 =-2:0.1:2;

we apply this Global optimization algorithm we get


x1_min = 1.3376
x2_min = 1.0555
minF = 1.8193e-05
In many optimization problems, the
variables are interrelated by physical laws like the conservation of mass or energy,
Kirchhoff’s voltage and current laws, and other system equalities that must be satisfied.
In effect, in these problems certain
equality constraints of the form ℎ𝑖 (𝐱) =
0 for 𝐱 ∈ 𝛀 where 𝑖 = 1, 2, … , 𝑝 must be
satisfied before the problem can be
considered solved. In other optimization
problems a collection of inequality
constraints might be imposed on the
variables or parameters to ensure physical
realizability, reliability, compatibility, or
even to simplify the modeling of the
problem. For example, the power
dissipation might become excessive if a
particular current in a circuit exceeds a
given upper limit or the circuit might
become unreliable if another current is
reduced below a lower limit, the mass of an element in a specific chemical reaction
must be positive, and so on. In these problems, a collection of inequality constraints of
the form 𝑔𝑗 (𝐱) ≥ 0 for 𝐱 ∈ 𝛀 where 𝑗 = 1, 2, … , 𝑞 must be satisfied before the
optimization problem can be considered solved.

An optimization problem may entail a set of equality constraints and possibly a set of
inequality constraints. If this is the case, the problem is said to be a constrained
optimization problem. The most general constrained optimization problem can be
expressed mathematically as
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
subject to: ℎ𝑖 (𝐱) = 0
𝑔𝑗 (𝐱) ≥ 0

A problem that does not entail any equality or inequality constraints is said to be an
unconstrained optimization problem. Constrained optimization is usually much more
difficult than unconstrained optimization, as might be expected. Consequently, the
general strategy that has evolved in recent years towards the solution of constrained
optimization problems is to reformulate constrained problems as unconstrained
optimization problems. When the objective function and all the constraints are linear
functions of 𝐱, the problem is a linear programming problem. Problems of this type are
probably the most widely formulated and solved of all optimization problems,
particularly in control system, management, financial, and economic applications.
Nonlinear programming problems, in which at least some of the constraints or the
objective are nonlinear functions, tend to arise naturally in the physical sciences and
engineering, and are becoming more widely used in control system, management and
economic sciences as well.
Several branches of mathematical programming are of much interest for the
optimization problems, namely, linear, integer, quadratic, nonlinear, and dynamic
programming. Each one of these branches of mathematical programming consists of the
theory and application of a collection of optimization techniques that are suited to a
specific class of optimization problems.

In mathematical optimization, the method of Lagrange


multipliers is a strategy for finding the local maxima and minima of a function subject
to equality constraints (i.e., subject to the condition that one or more equations have to
be satisfied exactly by the chosen values of the variables). It is named after the
mathematician Joseph-Louis Lagrange. The basic idea is to convert a constrained
problem into a form such that the derivative test of an unconstrained problem can still
be applied. The relationship between the gradient of the function and gradients of the
constraints known as the Lagrangian function.

The method can be summarized as follows: in order to find the maximum or minimum
of a function 𝑓(𝐱) subjected to the equality constraint g(𝐱) = 0, form the Lagrangian
function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) and find the stationary points 𝐱 = 𝐱 ⋆ of 𝐿(𝐱 , 𝜆) such that
∇𝐿(𝐱 ⋆ , 𝜆) = 0. Further, the method of Lagrange multipliers is generalized by the Karush–
Kuhn–Tucker conditions, which can also take into account inequality constraints of the
form ℎ(𝐱) ≤ 𝑐.

Often the Lagrange multipliers have an interpretation as some quantity of interest. For
example, consider
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: g 𝑖 (𝐱) = 𝑐𝑖
𝑝
The Lagrangian function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) + ∑𝑖=1 𝜆𝑖 (𝑐𝑖 − g 𝑖 (𝐱)). Then 𝜆𝑘 = 𝜕𝐿/𝜕𝑐𝑘 . So, 𝜆𝑘 is
the rate of change of the quantity being optimized as a function of the constraint
parameter. The relationship between the gradient of the function and gradients of the
constraints is:
𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) ⟹ ∇𝑓(𝐱) = 𝜆∇g(𝐱).

Example: Suppose we wish to maximize the


objective function 𝑓(𝑥, 𝑦) = 𝑥 + 𝑦 subject to the
constraint 𝑥 2 + 𝑦 2 = 1. The feasible set is the unit
circle, and the level sets of 𝑓 are diagonal lines
(with slope −1), so we can see graphically that the
maximum and the minimum occurs at
√2 √2 √2 √2
( , ) and (− ,− )
2 2 2 2
For the method of Lagrange multipliers, the
constraint is g(𝐱) = 𝑥 2 + 𝑦 2 − 1 = 0 hence 𝐿(𝐱 , 𝜆) = (𝑥 + 𝑦) + 𝜆(𝑥 2 + 𝑦 2 − 1). Now we can
calculate the gradient:
𝜕𝐿 𝜕𝐿 𝜕𝐿 𝑇
∇𝐿(𝐱 , 𝜆) = ( , , ) = (1 + 2𝜆𝑥, 1 + 2𝜆𝑦, 𝑥 2 + 𝑦 2 − 1)𝑇
𝜕𝑥 𝜕𝑦 𝜕𝜆
1 + 2𝜆𝑥 = 0
and therefore: ∇𝐿(𝐱 , 𝜆) = 0 ⟹ { 1 + 2𝜆𝑦 = 0 }. Notice that the last equation is the
𝑥2 + 𝑦2 − 1 = 0
original constraint. The first two equations yield 𝑥 = 𝑦 = −1/2𝜆 𝜆 ≠ 0. By substituting
into the last equation we have: 2𝜆2 − 1 = 0 ⟹ 𝜆 = ±√2/2. which implies that the
stationary points of 𝐿 are (√2/2, √2/2) and (−√2/2, −√2/2). Evaluating the objective
function f at these points yields 𝑓 = ±√2.

Example: Now we modify the objective function of


the previous Example so that we minimize 𝑓(𝑥, 𝑦) =
(𝑥 + 𝑦)2 again along the circle g(𝐱) = 𝑥 2 + 𝑦 2 − 1 = 0.
Now the level sets of 𝑓 are still lines of slope −1, and
the points on the circle tangent to these level sets
are again (√2/2, √2/2) and (−√2/2, −√2/2). These
tangency points are maxima of 𝑓. On the other
hand, the minima occur on the level set for 𝑓 = 0
(since by its construction 𝑓 cannot take negative
values), at (√2/2, √2/2) and (−√2/2, −√2/2), where
the level curves of 𝑓 are not tangent to the constraint. The condition that ∇𝑓 = 𝜆∇g
correctly identifies all four points as extrema; the minima are characterized in
particular by 𝜆 = 0.

Remark: In optimal control theory, the Lagrange multipliers are interpreted as costate
variables, and Lagrange multipliers are reformulated as the minimization of the
Hamiltonian, in Pontryagin's minimum principle.

Example: Determine the Lagrange multipliers for the optimization problem

minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀


{
subject to: 𝑨𝐱 = 𝒃

Where 𝑨 ∈ ℝ𝑝×𝑛 is assumed to have full row rank. Also discuss the case where the
constraints are nonlinear.

Solution in this case we have: 𝐿(𝐱 , 𝝀) = 𝑓(𝐱) − g(𝐱)𝝀 = 𝑓(𝐱) − (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 with
g(𝐱) = 𝑨𝐱 − 𝒃 = 0, let we define g new (𝐱, 𝝀) = g(𝐱)𝝀 = (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 . Now take the
gradient of this new function so: ∇g new (𝐱) = 𝑨𝑇 𝝀 + g(𝐱) = 𝑨𝑇 𝝀

From the other hand


g new (𝐱) = g(𝐱)𝝀 = 𝑓(𝐱) − 𝐿(𝐱 , 𝝀) ⟹ ∇g new (𝐱) = ∇𝑓(𝐱) − ∇𝐿(𝐱 , 𝝀) = ∇(g(𝐱)𝝀) = 𝑨𝑇 𝝀

Taking the evaluation of this expression at 𝐱 = 𝐱 ⋆ we obtain


∇g new (𝐱)|𝐱=𝐱⋆ = ∇𝑓(𝐱 ⋆ ) − ∇𝐿(𝐱 ⋆ , 𝝀) = ∇𝑓(𝐱 ⋆ ) = 𝑨𝑇 𝝀
From the basic linear algebra it is very well known that Lagrange multipliers are
uniquely determined as ∇g new (𝐱) = 𝑨𝑇 𝝀 ⟺ 𝝀 = (𝑨𝑨𝑇 )−1 𝑨∇g new (𝐱) = (𝑨)+ ∇g new (𝐱)

𝝀 = [𝜆1 ⋯ 𝜆𝑝 ]𝑇 = (𝑨𝑨𝑇 )−1 𝑨∇𝑓(𝐱 ⋆ ) = (𝑨)+ ∇𝑓(𝐱 ⋆ )


For the case of nonlinear equality constraints, a similar conclusion can be reached in
𝑇
terms of the Jacobian of the constraints. If we let 𝑱e = [∇g1 (𝐱) … ∇g 𝑝 (𝐱)] then the
Lagrange multipliers are uniquely determined as 𝝀 = (𝑱e )+ ∇𝑓(𝐱 ⋆ ).

Example Solve the problem


1 𝑇
minimize 𝑓(𝐱) = 𝐱 𝑯𝐱 + 𝐱 𝑇 𝒑 for 𝐱 ∈ 𝛀
{ 2
subject to: 𝑨𝐱 = 𝒃

Where 𝑯 ≻ 0 and 𝑨 ∈ ℝ𝑝×𝑛 is assumed to have full row rank.

Solution We know that: ∇𝑓(𝐱) = 𝑯𝐱 + 𝒑, so that 𝝀 = (𝑨)+ ∇𝑓(𝐱 ⋆ ) = (𝑨𝑨𝑇 )−1 𝑨(𝑯𝐱 ⋆ + 𝒑). In
order to omite the existence of 𝐱 ⋆ let 𝑨𝑨𝑇 𝝀 = 𝑨(𝑯𝐱 ⋆ + 𝒑) ⟹ 𝑨𝑇 𝝀 = (𝑯𝐱 ⋆ + 𝒑) multiply both
sides by 𝑨𝑯−1 we get: 𝑨𝑯−1 𝑨𝑇 𝝀 = 𝑨𝑯−1 𝑯𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝑨𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝒃 + 𝑨𝑯−1 𝒑

𝝀 = (𝑨𝑯−1 𝑨𝑇 )−1 (𝑨𝑯−1𝒑 + 𝒃)


The Lagrange multipliers problem is difficult to solve analytically in general, therefore
we try to solve such problems digitally using computers.

Remark: assume that we are dealing with the problem of optimization such that
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: 𝐠(𝐱) = 𝟎

The Karush–Kuhn–Tucker conditions state that ∇𝐿(𝐱 , 𝝀) = 0 which can written in the
form
𝜕𝐿(𝐱 , 𝝀)
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 0
𝑇
𝐿(𝐱 , 𝝀) = 𝑓(𝐱 ) + 𝝀 𝐠(𝐱 ) ⟺ 𝒉(𝐱 , 𝝀) = ( 𝜕𝐱 )=( )=( )
𝜕𝐿(𝐱 , 𝝀)
𝐠(𝐱 ) 0
𝜕𝝀
1
Where 𝑱 is the Jacobian of the vector 𝐠(𝐱 ). In case when 𝑓(𝐱) = 2 𝐱 𝑇 𝑯𝐱 + 𝐱 𝑇 𝒑 and
𝐠(𝐱 ) = 𝑨𝐱 − 𝒃
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 𝑯𝐱 + 𝒑 + 𝑨𝑇 𝝀
𝒉(𝐱 , 𝝀) = ( )=( ) ⇔ (𝑯 𝑨𝑇 ) (𝐱) = (−𝒑 )
𝐠(𝐱 ) 𝑨 𝟎 𝝀 𝒃
𝑨𝐱 − 𝒃

: assume that we are dealing with the


problem of optimization such that
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{subject to: ℎ𝑖 (𝐱) = 0 1≤𝑖≤𝑝
g 𝑗 (𝐱) ≤ 0 1≤𝑗≤𝑚

The basic concept in random search approaches is to randomly generate points in the
parameter space. Only feasible points satisfying g 𝑗 (𝐱) ≤ 0, 1 ≤ 𝑗 ≤ 𝑚 are considered,
while non-feasible points with at least one g 𝑗 (𝐱) > 0 for some j are rejected. The
algorithm keeps track of the feasible random point with the least value of the objective
function. This requires checking, at every iteration, if the newly generated feasible point
has a better objective function than the best value achieved so far.
The main disadvantage of this algorithm is that a large number of objective function
calculations may be required especially for problems with large n. The following
example illustrates this technique.

Example Find a solution for the two-dimensional constrained optimization problem:


minimize 𝑥12 + 𝑥22 ,
subject to: 𝑥1 − 𝑥22 − 4 ≥ 0
𝑥1 − 10 ≤ 0

%The Random Search Approach for constraint problems % MATLAB M8.1


clear all, clc,
f=@(x)x(1)^2+x(2)^2; g=@(x)[4-x(1)+x(2)^2;x(1)-10];
n=2; %This is the Number of Parameters
m=2; %This is the Number of Constraints
Ub=[10 10]'; %upper values
Lb=[-10 -10]'; %lower values
f0=1.0e9; %select a large initial value for the minimum
N=100000; %maximum number of allowed iterations
k=0; %iteration counter
while(k<N) %repeat until maximum number of iteration
r=1.2*rand(n,1); %get a vector of random variables
x=Lb + r.*(Ub-Lb); %Get new random point
f1=f(x); %get new objective function value
gg=g(x); %get the value of ALL constraints at the new point
if and((f1<f0),(max(gg)<0)) %is there an improvement and is the
%new point feasible?
x0=x; %adjust best value
f0=f1;
end
k=k+1; %increment the iteration counter
end

iterations=k,
BestPosition=x0,
fmax=f0,

The point returned by the random optimization algorithm is 𝐱 = [4.175048 0.048896].

Example Find a solution for the two-dimensional constrained optimization problem:


minimize 𝑥1 + 𝑥2 ,
{
subject to: 𝑥12 + 𝑥22 − 1 ≤ 0

The above program give as 𝑥1 = 𝑥2 = 0.7056 𝑓 = −1.4134


Example Find a solution for the two-
dimensional constrained optimization
problem:

4 2
minimize log (1 (𝑥1 − ) + 3(𝑥1 + 𝑥2 − 𝑥13 )2 ) ,
3
2 2
subject to: 𝑥1 + 𝑥2 − 4 ≤ 0
{ −1 ≤ 𝑥1 , 𝑥2 ≤ 1

The above program give as

𝑥1 = 0.5995, 𝑥2 = −0.3714, 𝑓 = 0.4311

clear all, clc,


f=@(x)log((1+(x(1)-4/3).^2)+3*(x(1)+x(2)-(x(1)).^3).^2);
g=@(x)x(1)^2+x(2)^2-4; % Contraints
n=2; %This is the Number of Parameters
m=2; %This is the Number of Constraints
Ub=[1 1]'; %upper values
Lb=[-1 -1]'; %lower values
f0=1.0e9; %select a large initial value for the minimum
N=100000; %maximum number of allowed iterations
k=0; %iteration counter
while(k<N) %repeat until maximum number of iteration
r=0.8*rand(n,1); %get a vector of random variables
x=Lb + r.*(Ub-Lb); %Get new random point
f1=f(x); %get new objective function value
gg=g(x); %get the value of ALL constraints at the new point
if and((f1<f0),(max(gg)<=0)) %is there an improvement and is the
%new point feasible?
x0=x; %adjust best value
f0=f1;
end
k=k+1; %increment the iteration counter
end
iterations=k, BestPosition=x0, fmax=f0,

: There are several situations in which the least


squares solution of 𝑨𝐱 = 𝒃 does not give rise to a good estimate of the “true” vector 𝐱.
For example, when 𝑨 is underdetermined, that is, when there are fewer equations than
variables, there are several optimal solutions to the least squares problem, and it is
unclear which of these optimal solutions is the one that should be considered. In these
cases, some type of prior information on 𝐱 should be incorporated into the optimization
model. One way to do this is to consider a penalized problem in which a regularization
function 𝑅(·) is added to the objective function. The regularized least squares (RLS)
problem has the form
RLS: min‖𝑨𝐱 − 𝒃‖2 + 𝜆𝑅(𝐱)
𝐱

The positive constant 𝜆 is the regularization parameter. As 𝜆 gets larger, more weight is
given to the regularization function. In many cases, the regularization is taken to be
quadratic. In particular, 𝑅(𝐱) = ‖𝑫𝐱‖2 where 𝑫 ∈ ℝ𝑝×𝑛 is a given matrix. The quadratic
regularization function aims to control the norm of 𝑫𝐱 and is formulated as follows:

min‖𝑨𝐱 − 𝒃‖2 + 𝜆‖𝑫𝐱‖2


𝐱
To find the optimal solution of this problem, note that it can be equivalently written as

min{𝑓RLS ≡ 𝐱 𝑇 (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)𝐱 − 2𝒃𝑻 𝑨𝐱 + ‖𝒃‖2 }


𝐱

Since the Hessian of the objective function is ∇2 𝑓RLS = 2(𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫) ≽ 0, it follows by
previous theorems that any stationary point is a global minimum point. The stationary
points are those satisfying ∇𝑓RLS (𝐱) = 0, that is, (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)𝐱 = 𝑨𝑻 𝒃.

Therefore, if 𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫 ≻ 𝟎, then the RLS solution is given by

𝐱 = (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)−1 𝑨𝑻 𝒃

Example: Let 𝑨 ∈ ℝ3×3 , 𝒃 ∈ ℝ3 and 𝑩 ∈ ℝ2×3 be given by

2 + 10−3 3 4 20.0019
𝑻 −3 1 1 1
𝑨 = 𝑩 𝑩 + 10 𝑰 = ( 3 5 + 10−3 7 ) , 𝑩 = ( ) , 𝒃 = ( 34.0004)
−3 1 2 3
4 7 10 + 10 48.0202

The purpose is to find the best approximate solution of 𝑨𝐱 = 𝒃. Knowing that the exact
solution is 𝐱 𝑡𝑟𝑢𝑒 = [1 2 3]𝑇 .

The matrix 𝑨 is in fact of a full column rank since its eigenvalues are all positive (which
can be checked, for example, by the MATLAB command eig(𝑨)), and the simple least
squares solution is given by 𝐱 𝐿𝑆 , whose value can be computed by

clear all, clc


B=[1,1,1;1,2,3];
A=B'*B+0.001*eye(3);
b=[20.0019,34.0004,48.0202]'
xLS=inv(A'*A)*A'*b

𝐱 𝐿𝑆 = [4.5316 − 5.1036 6.5612]𝑇

𝐱 𝐿𝑆 is rather far from the true vector 𝐱 𝑡𝑟𝑢𝑒 . One difference between the solutions is that
the squared norm ‖𝐱 𝐿𝑆 ‖2 = 90.1855 is much larger then the correct squared norm
‖𝐱 𝑡𝑟𝑢𝑒 ‖2 = 14. In order to control the norm of the solution we will add the quadratic
regularization function ‖𝐱‖2. The regularized solution will thus have the form

𝐱 = (𝑨𝑻 𝑨 + 𝜆𝑰𝑻 𝑰)−1 𝑨𝑻 𝒃


Picking the regularization parameter as 𝜆 = 1, the RLS solution becomes

clear all, clc


B=[1,1,1;1,2,3];
A=B'*B+0.001*eye(3);
b=[20.0019,34.0004,48.0202]'
xRLS=inv(A'*A+eye(3))*A'*b

𝐱 𝑅𝐿𝑆 = [1.1763 − 2.0318 2.8872]𝑇

which is a much better estimate for 𝐱 𝑡𝑟𝑢𝑒 than 𝐱 𝐿𝑆 .

Denoising: One application area in which regularization is commonly used is


denoising. Suppose that a noisy measurement of a signal 𝐱 ∈ ℝ𝑛 is given:
𝒃=𝐱+𝐰
Here 𝐱 is an unknown signal, 𝐰 is an unknown noise vector, and 𝒃 is the known
measurements vector. The denoising problem is the following: Given 𝒃, find a “good”
estimate of 𝐱. The least squares problem associated with the approximate equations
𝐱 ≈ 𝒃 is
min‖𝐱 − 𝒃‖2
𝐱

However, the optimal solution of this problem is obviously x = b, which is meaningless.


This is a case in which the least squares solution is not informative even though the
associated matrix—the identity matrix—is of a full column rank. To find a more
relevant problem, we will add a regularization term. For that, we need to exploit some a
priori information on the signal. For example, we might know in advance that the signal
is smooth in some sense. In that case, it is very natural to add a quadratic penalty,
which is the sum of the squares of the differences of consecutive components of the
vector; that is, the regularization function is
𝑛−1
𝑅(𝐱) = ∑ (𝑥𝑖 − 𝑥𝑖+1 )2
𝑖=1

This quadratic function can also be written as 𝑅(𝐱) = ‖𝑳𝐱‖2 , where 𝑳 ∈ ℝ(𝑛−1)×𝑛 is given
by
1 −1 0 0 0 0
0 1 −1 0 0 0
0 0 1 −1 0 0
𝑳= 1 ⋮
⋮ ⋱ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
(0 0 0 … 1
0 −1)
The resulting regularized least squares problem is (with 𝜆 a given regularization
parameter)
min‖𝐱 − 𝒃‖2 + 𝜆‖𝑳𝐱‖2
𝐱

and its optimal solution is given by 𝐱 𝑅𝐿𝑆 = (𝑰 + 𝜆𝑳𝑻 𝑳)−1 𝒃


Example: Consider the signal 𝐱 ∈ ℝ300 constructed by the following MATLAB
commands:

clear all, clc


t=linspace(0,4,300)';
x=sin(t)+t.*(cos(t).^2);

𝑖−1 𝑖−1 𝑖−1


Essentially, this is the signal given by 𝑥𝑖 = sin (4 299 ) + (4 299 ) cos2 (4 299 ), where
𝑖 = 1,2, . . . , 300. A normally distributed noise with zero mean and standard deviation of
0.05 was added to each of the components:

randn('seed',314);
b=x+0.05*randn(300,1);

The true and noisy signals are given in Figure, which was constructed by the MATLAB
commands

subplot(1,2,1);
plot(1:300,x,'LineWidth',2);
subplot(1,2,2);
plot(1:300,b,'LineWidth',2);

In order to denoise the signal 𝒃, we look at the optimal solution of the RLS problem, for
four different values of the regularization parameter: 𝜆 = 1, 10,100, 1000.

The original true signal is denoted by a dotted line. As can be seen in the next Figure,
as 𝜆 gets larger, the RLS solution becomes smoother.
For 𝜆 = 10 the RLS solution is a rather good estimate of the original vector 𝐱. For
𝜆 = 100 we get a smoother RLS signal, but evidently it is less accurate than 𝐱 𝑅𝐿𝑆 (10),
especially near the boundaries. The RLS solution for 𝜆 = 1000 is very smooth, but it is a
rather poor estimate of the original signal. In any case, it is evident that the parameter
𝜆 is chosen via a trade-off between data fidelity (closeness of 𝐱 to 𝒃) and smoothness
(size of 𝑳𝐱). The four plots where produced by the MATLAB commands

L=zeros(299,300);
for i=1:299
L(i,i)=1;
L(i,i+1)=-1;
end
x_rls=(eye(300)+1*L'*L)\b;
x_rls=[x_rls,(eye(300)+10*L'*L)\b];
x_rls=[x_rls,(eye(300)+100*L'*L)\b];
x_rls=[x_rls,(eye(300)+1000*L'*L)\b];

figure(2)
for j=1:4
subplot(2,2,j);
plot(1:300,x_rls(:,j),'LineWidth',2);
hold on
plot(1:300,x,':r','LineWidth',2);
hold off
title(['\lambda=',num2str(10^(j-1))]);
end
Most real-world optimizations are highly
nonlinear and multimodal, under various complex constraints. Different objectives are
often conflicting. Even for a single objective, sometimes, optimal solutions may not exist
at all. In general, finding an optimal solution or even sub-optimal solutions is not an
easy task. This work aims to introduce the fundamentals of metaheuristic optimization,
as well as some popular metaheuristic algorithms. Metaheuristic algorithms are
becoming an important part of modern optimization. A wide range of metaheuristic
algorithms have emerged over the last two decades, and many metaheuristics such as
particle swarm optimization are becoming increasingly popular. Despite their popularity,
mathematical analysis of these algorithms lacks behind. Convergence analysis still
remains unsolved for the majority of metaheuristic algorithms, while efficiency analysis
is equally challenging.

Problem formulation: In general, an optimization problem can be written as

Minimize 𝑓1 (𝐱), 𝑓2 (𝐱), … , 𝑓𝑚 (𝐱) 𝐱 = [𝑥1 , 𝑥1 , … , 𝑥𝑛 ]


Subjected to
ℎ𝑗 (𝐱) = 0, (𝑗 = 1,2, . . . , 𝐽)
𝑔𝑘 (𝐱) ≤ 0, (𝑘 = 1,2, . . . , 𝐾)

where 𝑓1 , . . . , 𝑓𝑚 (𝐱) are the objectives, while ℎ𝑗 and 𝑔𝑘 are the equality and inequality
constraints, respectively. In the
case when 𝑚 = 1 , it is called
single-objective optimization.
When 𝑚 ≥ 2 , it becomes a multi-
objective problem whose solution
strategy is different from those for
a single objective. In general, all
the functions 𝑓𝑖 , ℎ𝑗 and 𝑔𝑘 are
nonlinear. In the special case when
all these functions are linear, the
optimization problem becomes a
linear programming problem which
can be solved using the standard
simplex method (Dantzig 1963).
Metaheuristic optimization concerns more generalized, nonlinear optimization
problems. It is worth pointing out that the above minimization problem can also be
formulated as a maximization problem if 𝑓𝑖 is replaced with −𝑓𝑖 .
Derivative-free algorithms do not use any derivative information but the values of the
function itself. Some functions may have discontinuities or it may be expensive to
calculate derivatives accurately, and thus derivative-free algorithms become very
useful.

From a different perspective, optimization algorithms can be classified into trajectory-


based and population-based. A trajectory-based algorithm typically uses a single agent
or one solution at a time, which will trace out a path as the iterations continue.
Optimization algorithms can also be classified as deterministic or stochastic. If an
algorithm works in a mechanical deterministic manner without any random nature, it
is called deterministic. For such an algorithm, it will reach the same final solution if we
start with the same initial point. Evolutionary algorithms such as particle swarm
optimization (PSO), ant colony optimization (ACO) and their variants are good examples
of stochastic algorithms.

Search capability can also be a basis for algorithm classification. In this case,
algorithms can be divided into local and global search algorithms. Local search
algorithms typically converge towards a local optimum, not necessarily (often not) the
global optimum, and such an algorithm is often deterministic and has no ability to
escape from local optima. On the other hand, for global optimization, local search
algorithms are not suitable, and global search algorithms should be used. Modern
metaheuristic algorithms in most cases tend to be suitable for global optimization,
though not always successful or efficient.

Algorithms with stochastic components were often referred to as heuristic in the past,
though the recent literature tends to refer to them as metaheuristics. We will follow
Glover's convention and call all modern nature-inspired algorithms metaheuristics
(Glover 1986, Glover and Kochenberger 2003). Loosely speaking, heuristic means to find
or to discover by trial and error. Here meta- means beyond or higher level, and
metaheuristics generally perform better than simple heuristics. In addition, all
metaheuristic algorithms use a certain tradeoff of randomization and local search.
Quality solutions to difficult optimization problems can be found in a reasonable
amount of time, but there is no guarantee that optimal solutions can be reached. It is
hoped that these algorithms work most of the time, but not all the time. Almost all
metaheuristic algorithms tend to be suitable for global optimization.

Particle swarm optimization (PSO) was developed


by Kennedy and Eberhart in 1995, based on swarm behavior observed in nature such
as fish and bird schooling. Since then, PSO has generated a lot of attention, and now
forms an exciting, ever-expanding research subject in the field of swarm intelligence.
PSO has been applied to almost every area in optimization, computational intelligence,
and design/scheduling applications.
PSO searches the space of an objective function by adjusting the trajectories of
individual agents, called particles. Each particle traces a piecewise path which can be
modelled as a time-dependent positional vector. The movement of a swarming particle
consists of two major components: a stochastic component and a deterministic
component. Each particle is attracted toward the position of the current global best
𝐠 𝑏𝑒𝑠𝑡 and its own best known location 𝐱 𝑏𝑒𝑠𝑡 , while exhibiting at the same time a tendency
to move randomly.

When a particle finds a location that is better than any previously found locations, then
it updates this location as the new current best for particle 𝑖 . There is a current best for
all particles at any time 𝑡 at each iteration. The aim is to find the global best among all
the current best solutions until the objective no longer improves or after a certain
number of iterations.

Let 𝐱 𝑖 and 𝐯𝑖 be the position and velocity vectors, respectively, of particle 𝑖. The new
velocity vector is determined by the following formula

𝐯𝑖𝑘+1 = 𝜔𝐯𝑖𝑘 + 𝛼𝜺1 × (𝐱 𝑖𝑏𝑒𝑠𝑡 − 𝐱 𝑖𝑘 ) + 𝛽𝜺2 × (𝐠 𝑏𝑒𝑠𝑡 − 𝐱 𝑖𝑘 ) 𝑘: number of iterations

where 𝜺1 and 𝜺2 are two random vectors, and each entry takes a value between 0 and 1.
The parameters 𝛼 and 𝛽 are the learning parameters or acceleration constants, which
are typically equal to, say, 𝛼 ≈ 𝛽 ≈ 2. 𝜔(𝑘) is the inertia function takes a value between
0 and 1. In the simplest case, the inertia function can be taken as a constant, typically
𝜔 ∈ [0.5 0.9]. This is equivalent to introducing a virtual mass to stabilize the motion of
the particles, and thus the algorithm is expected to converge more quickly.

The initial locations of all particles should be distributed relatively uniformly so that
they can sample over most regions, which is especially important for multimodal
problems. The initial velocity of a particle can be set to zero, that is, 𝐯𝑖𝑘=0 = 0 . The new
position can then be updated by the formula 𝐱 𝑖𝑘+1 = 𝐱 𝑖𝑘 + 𝐯𝑖𝑘+1

As the iterations proceed, the particle system swarms and may converge towards a
global optimum.

Algorithm: [Particle Swarm Optimization]


Initialize particles
Do until maximum iterations or minimum error criteria
For each particle
Calculate Data fitness value
If the fitness value is better than pBest
Set pBest = current fitness value
If pBest is better than gBest
Set gBest = pBest
For each particle
Calculate particle Velocity
Use gBest and Velocity to update particle Data
Exercise: Write a MATLAB code to search the maximum value of the following objective
functions

▪ 𝑓(𝑥, 𝑦) = 3 sin(𝑥) + 𝑒 𝑦 − 4 ≤ 𝑥, 𝑦 ≤ 4
▪ 𝑓(𝑥, 𝑦) = 100(𝑦 − 𝑥 2 )2 + (1 − 𝑥 2 )2 − 10 ≤ 𝑥, 𝑦 ≤ 10

Using the following inertia function

𝜔𝑚𝑎𝑥 − 𝜔𝑚𝑖𝑛
𝜔(𝑖𝑡𝑒𝑟) = 𝜔𝑚𝑎𝑥 − × 𝑖𝑡𝑒𝑟
Max𝑖𝑡𝑒𝑟
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure

wmax=0.9; wmin=0.4; c1=1.49; c2=1.49;


itermax=50; xmin=[-2 -2]; xmax=[2 2];
n=20; m=2; % n=Number of Particles and n=Number of variables
v=zeros(m,n); rand('state',0);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=w*v(:,i)+c1*rand*(xbest(:,i)-x(:,i))+c2*rand*(gbest-x(:,i));
x(:,i)=x(:,i)+v(:,i);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
the optimal value is:
1.5708
the optimal value is:
2.0000
the minimum value of func
is: -2.8647

Exercise: Write a MATLAB code to search the maximum value of the following objective
functions

▪ 𝑓(𝑥, 𝑦) = 𝑥 2 − 𝑦 2 − 10 ≤ 𝑥, 𝑦 ≤ 10
4 3 2
▪ 𝑓(𝑥) = 𝑥 − 14𝑥 + 60 𝑥 − 70 𝑥 − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝑥, 𝑦) = 𝑥sin(4𝑥) + 1.1𝑦sin(𝑦) − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝐱) = (𝑥 + 10𝑦) + 5(𝑧 − 𝑤) + (𝑦 − 2𝑧) + 10(𝑥 − 2𝑤)4
2 2 4

Alternatives of PSO: There are many variations which extend the standard algorithm.
The standard particle swarm optimization uses both the current global best 𝐠 𝑏𝑒𝑠𝑡 and
the individual best 𝐱 𝑖𝑏𝑒𝑠𝑡 . The reason of using the individual best is primarily to increase
the diversity in the quality solution, however, this diversity can be simulated using the
randomness. Subsequently, there is no compelling reason for using the individual best.
A simplified version which could accelerate the convergence of the algorithm is to use
the global best only. Thus, in the accelerated particle swarm optimization, the velocity
vector is generated by

𝐯𝑖𝑘+1 = 𝐯𝑖𝑘 + 𝛼 × (𝜺1 − 0.5𝐞) + 𝛽(𝐠 𝑏𝑒𝑠𝑡 − 𝐱 𝑖𝑘 ) with 𝐞𝑇 = [1 1 1 … 1]

In order to increase the convergence even further, we can also write the update of the
location in a single step
𝐱 𝑖𝑘+1 = (1 − 𝛽)𝐱 𝑖𝑘 + 𝛽𝐠 𝑏𝑒𝑠𝑡 + 𝛼 × (𝜺1 − 0.5𝐞)

A further accelerated PSO is to reduce the randomness as iterations proceed. This


mean that we can use a monotonically decreasing function such as

𝛼 = 𝛼0 𝑒 −𝛾𝑘 ; or 𝛼 = 𝛼0 𝛾 𝑘 , ( 𝛾 < 1)
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure

wmax=0.9; wmin=0.4; c1=0.3; c2=0.49;


itermax=50; xmin=[-2 -2]; xmax=[2 2];
n=20; m=2; % n=Number of Particles and n=Number of variables
v=zeros(m,n); rand('state',0);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
Example Write a MATLAB code to search by PSO the maximum value of

𝑓 = 2𝑥 2 − 3𝑦 2 + 4𝑥 2 + 2 − 10 ≤ 𝑥, 𝑦, 𝑧 ≤ 10

clear all, clc, c1=0.3; c2=0.49;


itermax=50; xmin=10*[-2 -2 -2]; xmax=10*[2 2 2];
n=20; m=3; % n=Number of Particles and n=Number of variables
v=zeros(m,n); rand('state',0);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
fun_marge(i)=2*(x(1,i))^2-3*(x(2,i))^2+4*(x(3,i))^2+2;
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=2*(x(1,i))^2-3*(x(2,i))^2+4*(x(3,i))^2+2;
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)

plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
As already mentioned, swarm intelligence is a
relatively new approach to problem
solving that takes inspiration from social
behaviors of insects and of other
animals. In particular, ants have
inspired a number of methods and
techniques among which the most
studied and the most successful is the
general-purpose optimization technique
known as ant colony optimization. Ant
colony optimization (ACO) takes
inspiration from the foraging behavior of
some ant species. These ants deposit
pheromone on the ground in order to mark some favorable path that should be followed
by other members of the colony. Ant colony optimization exploits a similar mechanism
for solving optimization problems.

Ant colony optimization (ACO), introduced by Marco Dorigo 1991 in his doctoral
dissertation, is a class of optimization algorithms modeled on the actions of an ant
colony. ACO is a probabilistic technique useful in problems that deal with finding better
paths through graphs. Artificial 'ants'—simulation agents—locate optimal solutions by
moving through a parameter space representing all possible solutions. Natural ants lay
down pheromones directing each other to resources while exploring their environment.
The simulated 'ants' similarly record their positions and the quality of their solutions,
so that in later simulation iterations more ants locate for better solutions.

Procedure: The ants construct the solutions as follows. Each ant starts from a
randomly selected city (node or vertex). Then, at each construction step it moves along
the edges of the graph. Each ant keeps a memory of its path, and in subsequent steps
it chooses among the edges that do not lead to vertices that it has already visited. An
ant has constructed a solution once it has visited all the vertices of the graph. At each
construction step, an ant probabilistically chooses the edge to follow among those that
lead to yet unvisited vertices. The probabilistic rule is biased by pheromone values and
heuristic information: the higher the pheromone and the heuristic value associated to
an edge, the higher the probability an ant will choose that particular edge. Once all the
ants have completed their tour, the pheromone on the edges is updated. Each of the
pheromone values is initially decreased by a certain percentage. Each edge then
receives an amount of additional pheromone proportional to the quality of the solutions
to which it belongs (there is one solution per ant). The solution construction process is
stochastic and is biased by a pheromone model, that is, a set of parameters associated
with graph components (either nodes or edges) whose values are modified at runtime by
the ants.
Set parameters, initialize pheromone trails
SCHEDULE_ACTIVITIES

Construct Ant Solutions (Generate a random population of 𝑚 ants (solution)).


For every individual ant ascertain the best position according to the objective function.
Get the best ant in search space.
Restore (Update) the pheromone-trail.
Verify if the termination is true.
END_SCHEDULE_ACTIVITIES

This procedure is repeatedly applied until a termination criterion is satisfied.

Parametrization: Let’s say the number of cities is 𝑛, the number of ants is 𝑚, the
distance between 𝑖 𝑡ℎ and 𝑗 𝑡ℎ cities is 𝑑𝑖𝑗 𝑖, 𝑗 = 1, 2 … , 𝑛 and the concentration of
pheromone in city (𝑖, 𝑗) at time 𝑡 is 𝜏𝑖𝑗 (𝑡). At the initial time, the pheromone
concentration 𝜏𝑖𝑗 (𝑡) between cities is equal to 𝜏𝑖𝑗 (0) = 𝐶 (𝐶 is a constant), and the
𝑘
probability of its choice is expressed by 𝑝𝑖𝑗 , and the formula is as follows:

𝛼 𝛽
(𝜏𝑖𝑗 (𝑡)) (𝜂𝑖𝑗 (𝑡))
𝑘
𝑝𝑖𝑗 = 𝛽
𝛼
∑𝑥∈𝑁(𝑥 )(𝜏𝑖𝑠 (𝑡)) (𝜂𝑖𝑠 (𝑡))
𝑘

The parameter 𝜂𝑖𝑗 (𝑡) = 1/𝑑𝑖𝑗 is heuristic information, which indicates the degree of
expectation of ants from 𝑖 𝑡ℎ to the 𝑗 𝑡ℎ city. 𝑁(𝑥𝑘 ) (𝑘 = 1, 2 … , 𝑚) indicates that ant 𝑘 is to
visit the urban set. Furthermore, 𝛼 and 𝛽 are positive real parameters whose values
determine the relative importance of pheromone versus heuristic information. When all
ants complete a cycle, they update the pheromone according to formula
𝑚
𝜏𝑖𝑗 (𝑡) ⟵ (1 − 𝜌)𝜏𝑖𝑗 (𝑡) + ∆𝜏𝑖𝑗 with ∆𝜏𝑖𝑗 = ∑ ∆𝜏𝑘𝑖𝑗
𝑘=1

Where: 𝜌 ∈ (0,1] is a parameter called evaporation rate (i.e.


pheromone decay coefficient), (1 − 𝜌) is called the
pheromone residual factor, and ∆𝜏𝑖𝑗 is the quantity or
pheromone concentration released by the 𝑘 𝑡ℎ ant on the
path of (𝑖, 𝑗). In the basic (ACO), only the positive feedback
pheromone concentration is usually updated. In order to
update the pheromone concentration in the search process
we use
𝑄/ 𝐿𝑘
∆𝜏𝑘𝑖𝑗 = {
0

Where 𝑄 is a constant that represents the total amount of pheromone released once by
an ant . 𝐿𝑘 is the tour length of the 𝑘 𝑡ℎ ant.
clear all, clc,
%LB=20*[-1 -1 -1]; UB=20*[1 1 1]; nvars=size(LB,2);
%f=@(x)2*x(1)^2-3*x(2)^2+4*x(3)^2+2; % Ant-cost
LB=20*[-1 -1]; UB=20*[1 1]; nvars=size(LB,2);
f=@(x)3*sin(x(1))+exp(x(2));
MaxTour=100; % Number of Tours
piece=500; % Number of pieces (cities)
max_assign=50; % MaxValue of assign
ants=50; % Number of Ants
poz_ph=0.5; % PositivePheremone
neg_ph=0.2; % NegativePheremone
lambda=0.95; % EvaporationParameter
ph=0.05; % Pheromone

pher=ones(piece,nvars);
indis=zeros(ants,nvars);
costs=zeros(ants,1);
cost_general=zeros(max_assign,(nvars+1));

deger=zeros(piece,nvars); deger(1,:)=LB;

for i=2:piece
for j=1:nvars
deger(i,j)=deger(i-1,j) + (UB(j)-LB(j))/(piece-1);
end
end
assign=0;
while (assign<max_assign)
for i=1:ants % FINDING THE PARAMETERS OF VALUE
prob = pher.*rand(piece,nvars);
for j=1:nvars
indis(i,j) = find(prob(:,j) == max(prob(:,j)));
end
temp=zeros(1,nvars);
for j=1:nvars
temp(j)=deger(indis(i,j),j);
end
costs(i) = f(temp); % LOCAL UPDATING
deltalocal = zeros(piece,nvars);
% Creates Matrix Contain the Pheremones Deposited for Local Updating
for j=1:nvars
deltalocal(indis(i,j),j)=(poz_ph*ph/(costs(i)));
end
pher = pher + deltalocal;
end
best_ant= min(find(costs==min(costs)));
worst_ant = min(find(costs==max(costs)));
deltapos = zeros(piece,nvars);
deltaneg = zeros(piece,nvars);
for j=1:nvars
deltapos(indis(best_ant,j),j)=(ph/(costs(best_ant)));
% UPDATING PHER OF nvars
deltaneg(indis(worst_ant,j),j)=-(neg_ph*ph/(costs(worst_ant)));
% NEGATIVE UPDATING PHER OF worst path
end
delta = deltapos + deltaneg;
pher = pher.^lambda + delta;
assign=assign + 1; % Update general cost matrix
for j=1:nvars
cost_general (assign,j)=deger(indis(best_ant,j),j);
end
cost_general (assign,nvars+1)=costs(best_ant);
xlabel Tour
title('Change in Cost Value. Red: Means, Blue: Best')
hold on
plot(assign, mean(costs), '.r');
plot(assign, costs(best_ant), '.b');
end
list_cost=sortrows(cost_general,nvars+1);
for j=1:nvars
x(j)=list_cost(1,j);
end
x1=x', fmax=f(x1)
The Firefly Algorithm (FA) was developed by
Xin-She Yang (Yang 2008) and is based on the flashing patterns and behavior of
fireflies. In essence, FA uses the following three idealized rules:

⦁ Fireflies are unisex (one firefly will be attracted to other fireflies regardless of their sex)
⦁ The attractiveness is proportional to the brightness and both decrease as the distance
between two fireflies increases. Thus for any two flashing fireflies, the brighter firefly
will attract the other one. If neither one is brighter, then a random move is performed.
⦁ The brightness of a firefly is determined by the landscape of the objective function.

As a firefly's attractiveness is proportional to the light intensity seen by adjacent


fireflies, we can now define the variation of attractiveness 𝛽 with the distance 𝑟 by
2
𝛽 = 𝛽0 𝑒 −𝛾𝑟 , where 𝛽0 is the attractiveness at 𝑟 = 0. The movement of a firefly 𝑖, attracted
to another more attractive (brighter) firefly 𝑗, is determined by
−𝛾𝑟 2
𝐱 𝑖𝑘+1 = 𝐱 𝑖𝑘 + 𝛽0 𝑒 𝑖𝑗 (𝐱𝑗𝑘 − 𝐱 𝑖𝑘 ) + 𝛼 × 𝐞𝑘𝑖

where 𝛾 is the light absorption coefficient, which can be in the range [0.01, 100], 𝑟𝑖𝑗 the
−𝛾𝑟 2
line-of-sight distance between the fireflies. The second term 𝛽0 𝑒 𝑖𝑗 (𝐱𝑗𝑘 − 𝐱 𝑖𝑘 ) is due to
the attraction. The third term 𝛼 × 𝐞𝑘𝑖 is a randomization with 𝛼 being the randomization
parameter, and 𝐞𝑘𝑖 is a vector of random numbers drawn from a Gaussian distribution
or uniform distribution at time k. If 𝛽0 = 0 , it becomes a simple random walk.
Furthermore, the randomization 𝐞𝑘𝑖 can easily be extended to other distributions such
as Lévy flights.
clear all, clc, c1=0.8; c2=0.7; gama=20;
itermax=50; xmin=10*[-2 -2]; xmax=10*[2 2];
n=50; m=2; % n=Number of Particles and n=Number of variables
rand('state',0); % v=zeros(m,n);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));

for iter=1:itermax
for i=1:n
for j=1:i
r= norm(x(:,j)-x(:,i));
x(:,i)=x(:,i)+c2*(exp(-gama*r^2))*(x(:,j)-x(:,i))+c1*(randn-0.5);
end
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on

It can be shown that the limiting case 𝛾 → 0 corresponds to the standard Particle
Swarm Optimization (PSO). In fact, if the inner loop (for j) is removed and x(:,j)the is
replaced by the current global best, then FA essentially becomes the standard PSO.
In computer science and operations research,
the artificial bee colony algorithm (ABC) is an optimization algorithm based on the
intelligent foraging behavior of honey bee swarm,
proposed by Derviş Karaboğa (Erciyes University) in
2005. In the ABC model, the colony consists of three
groups of bees: employed bees, onlookers and scouts.
It is assumed that there is only one artificial
employed bee for each food source. In other words,
the number of employed bees in the colony is equal
to the number of food sources around the hive.
Employed bees go to their food source and come
back to hive and dance on this area. The employed
bee whose food source has been abandoned becomes a scout and starts to search for
finding a new food source. Onlookers watch the dances of employed bees and choose
food sources depending on dances.

Notes: employed bees associated with specific food sources, onlooker bees watching the
dance of employed bees within the hive to choose a food source, and scout bees
searching for food sources randomly. Both onlookers and scouts are also called
unemployed bees.

The main steps of the algorithm are given below:


▪ Initial food sources are produced for all employed bees
▪ REPEAT
▪ Each employed bee goes to a food source in her memory and determines a closest
source, then evaluates its nectar amount (‫ )كمية الرحيق‬and dances in the hive.
▪ Each onlooker watches the dance of employed bees and chooses one of their
sources depending on the dances, and then goes to that source. After choosing a
neighbor around that, she evaluates its nectar amount.
▪ Abandoned food sources are determined and are replaced with the new food
sources discovered by scouts.
▪ The best food source found so far is registered.
▪ UNTIL (requirements are met)

Initialization Phase: All the vectors of the population of food sources, 𝐱 𝑘 , are initialized
by scout bees and control parameters are set. Since each food source, 𝐱 𝑘 , is a solution
vector to the optimization problem, each 𝐱 𝑘 vector holds 𝑛 variables, (𝐱 𝑘 (𝑖), 𝑖 = 1. . . 𝑛),
which are to be optimized so as to minimize the objective function.

𝐱 𝑘 (𝑖) = 𝒍𝑖 + rand(0,1) × (𝒖𝑖 − 𝒍𝑖 )

where 𝒍𝑖 and 𝒖𝑖 are the lower and upper bound of the parameter 𝐱 𝑘 (𝑖) , respectively.

Employed Bees Phase: Employed bees search for new food sources (𝐯𝑘 ) having more
nectar within the neighbourhood of the food source (𝐱 𝑘 ) in their memory.

𝐯𝑘 (𝑖) = 𝐱 𝑘 (𝑖) + 𝛼𝑘 (𝑖)(𝐱𝑘 (𝑖) − 𝐱 𝑚 (𝑖))


where 𝐱 𝑚 is a randomly selected food source, and 𝛼𝑘 (𝑖) is a random number within the
range [−𝛽, 𝛽]. After producing the new food source 𝐯𝑘 , its fitness is calculated and a
greedy selection is applied between 𝐯𝑘 and 𝐱 𝑘 .

The fitness value of the solution, fit(𝐱 𝑘 ), might be calculated for minimization problems
using the following formula
1
if f(𝐱 𝑘 ) ≥ 0
fit(𝐱 𝑘 ) = { 1 + f(𝐱 𝑘 )
1 + |f(𝐱 𝑘 )| if f(𝐱 𝑘 ) < 0

Onlooker Bees Phase: Unemployed bees consist of two groups of bees: onlooker bees
and scouts. Employed bees share their food source information with onlooker bees
waiting in the hive and then onlooker bees probabilistically choose their food sources
depending on this information. In ABC, an onlooker bee chooses a food source
depending on the probability values calculated using the fitness values provided by
employed bees. For this purpose, a fitness based selection technique can be used, such
as the roulette wheel selection method (Goldberg, 1989).

The probability value 𝑝𝑘 with which 𝐱 𝑘 is chosen by an onlooker bee can be calculated
by using the expression given in equation
fit(𝐱 𝑘 )
𝑝𝑘 = 𝑁
∑𝑘=1 fit(𝐱 𝑘 )

After a food source 𝐱 𝑘 for an onlooker bee is probabilistically chosen, a neighbourhood


source 𝐯𝑘 is determined by using equation, and its fitness value is computed. As in the
employed bees phase, a greedy selection is applied between 𝐯𝑘 and 𝐱 𝑘 . Hence, more
onlookers are recruited to richer sources and positive feedback behavior appears.

Scout Bees Phase: The unemployed bees who choose their food sources randomly are
called scouts. Employed bees whose solutions cannot be improved through a
predetermined number of trials, specified by the user of the ABC algorithm and called
“limit” or “abandonment criteria” herein, become scouts and their solutions are
abandoned. Then, the converted scouts start to search for new solutions, randomly. For
instance, if solution 𝐱 𝑘 has been abandoned, the new solution discovered by the scout
who was the employed bee of 𝐱 𝑘 can be defined by 𝐱 𝑘 (𝑖) = 𝒍𝑖 + rand(0,1) × (𝒖𝑖 − 𝒍𝑖 ). Hence
those sources which are initially poor or have been made poor by exploitation are
abandoned and negative feedback behavior arises to balance the positive feedback.

Exercise: Write a MATLAB code to search the maximum value of the following objective
functions

𝑓(𝐱) = 3 sin(𝑥) + 𝑒 𝑦 − 5 ≤ 𝑥, 𝑦 ≤ 5
𝑓(𝐱) = 2𝑥 2 + 3𝑦 2 + 4𝑧 2 + 5𝑤 2 + 10 − 5 ≤ 𝑥, 𝑦 ≤ 5
clc;
clear;
close all;
%% Problem Definition
% CostFunction=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;
f=@(x)3*sin(x(1))+exp(x(2)); % CostFunction
nVar=2; % Number of Decision Variables
VarSize=[1 nVar]; % Decision Variables Matrix Size
VarMin=-5; % Decision Variables Lower Bound
VarMax= 5; % Decision Variables Upper Bound

%% ABC Settings
MaxIt=500; % Maximum Number of Iterations
nPop=500; % Population Size (Colony Size)
nOnlooker=nPop; % Number of Onlooker Bees
L=round(0.6*nVar*nPop); % Abandonment Limit Parameter (Trial Limit)
a=1; % Acceleration Coefficient Upper Bound

%% Initialization
% Empty Bee Structure
empty_bee.Position=[]; empty_bee.Cost=[];
pop=repmat(empty_bee,nPop,1); % Initialize Population Array
BestSol.Cost=inf; % Initialize Best Solution Ever Found

% Create Initial Population


for i=1:nPop
pop(i).Position=VarMax*rand(1,nVar);
pop(i).Cost=f(pop(i).Position);
if pop(i).Cost<=BestSol.Cost
BestSol=pop(i);
end
end

C=zeros(nPop,1); % Abandonment Counter


BestCost=zeros(MaxIt,1); % Array to Hold Best Cost Values

%% ABC Main Loop


for it=1:MaxIt
% Recruited Bees (Employed Bees Phase)
for i=1:nPop
K=[1:i-1 i+1:nPop]; % Choose k randomly, not equal to i
k=K(randi([1 numel(K)]));
phi=a*rand(1,nVar); % Define Acceleration Coeff.

% New Bee Position


newbee.Position=pop(i).Position+phi.*(pop(i).Position-pop(k).Position);
newbee.Cost=f(newbee.Position); % Evaluation
% Comparision
if newbee.Cost<=pop(i).Cost
pop(i)=newbee;
else
C(i)=C(i)+1;
end

end

% Calculate Fitness Values and Selection Probabilities


F=zeros(nPop,1);
MeanCost = mean([pop.Cost]);

for i=1:nPop
F(i) = exp(-pop(i).Cost/MeanCost); % Convert Cost to Fitness
end
P=F/sum(F);

% Onlooker Bees (Onlooker Bees Phase)


for m=1:nOnlooker

%-----------------------------------------------
% Select Source Site by Roulette Wheel Selection
%-----------------------------------------------
r=rand;
C=cumsum(P);
i=find(r<=C,1,'first');
%-----------------------------------------------

K=[1:i-1 i+1:nPop]; % Choose k randomly, not equal to i


k=K(randi([1 numel(K)]));
phi=a*rand(1,nVar); % Define Acceleration Coeff.

% New Bee Position


newbee.Position=pop(i).Position+phi.*(pop(i).Position-pop(k).Position);
newbee.Cost=f(newbee.Position); % Evaluation

% Comparision

if newbee.Cost<=pop(i).Cost
pop(i)=newbee;
else
C(i)=C(i)+1;
end
end
% Scout Bees (Scout Bees Phase)
for i=1:nPop
if C(i)>=L
pop(i).Position=rand(1,nVar);
pop(i).Cost=f(pop(i).Position);
C(i)=0;
end
end
% Update Best Solution Ever Found
for i=1:nPop
if pop(i).Cost<=BestSol.Cost
BestSol=pop(i);
end
end
BestCost(it)=BestSol.Cost; % Store Best Cost Ever Found

% Display Iteration Information


disp(['Iteration', num2str(it) ':BestCost' num2str(BestCost(it))]);
end

%% Results
BestSol

figure;
%plot(BestCost,'LineWidth',2);
semilogy(BestCost,'LineWidth',2);
xlabel('Iteration'); ylabel('Best Cost');
grid on;
Bacteria Foraging Optimization
Algorithm (BFOA), proposed by Passino, is
a new comer to the family of nature-
inspired optimization algorithms. For over the
last five decades, optimization algorithms like
Genetic Algorithms (GAs), Evolutionary
Programming (EP), Evolutionary Strategies
(ES), which draw their inspiration from
evolution and natural genetics, have been
dominating the realm of optimization
algorithms. Recently natural swarm inspired
algorithms like Particle Swarm Optimization
(PSO), Ant Colony Optimization (ACO) have
found their way into this domain and proved their effectiveness. Following the same
trend of swarm-based algorithms, Passino proposed the BFOA. Application of group
foraging strategy of a swarm of E.coli-bacteria in multi-optimal function optimization is
the key idea of the new algorithm. Bacteria search for nutrients in a manner to
maximize energy obtained per unit time. Individual bacterium also communicates with
others by sending signals. A bacterium takes foraging decisions after considering two
previous factors. The process, in which a bacterium moves by taking small steps while
searching for nutrients, is called chemotaxis and key idea of BFOA is mimicking
chemotactic movement of virtual bacteria in the problem search space.

Now suppose that we want to find the minimum of the cost function 𝑱(𝜽) where 𝜽 ∈ ℜ𝑝
(i.e. 𝜽 is a 𝑝-dimensional vector of real numbers), and we do not have measurements or
an analytical description of the gradient ∇𝑱(𝜽). BFOA mimics the four principal
mechanisms observed in a real bacterial system: chemotaxis, swarming,
reproduction, and elimination-dispersal to solve this non-gradient optimization
problem. A virtual bacterium is actually one trial solution (may be called a search-
agent) that moves on the functional surface (see Figure above) to locate the global
optimum.
Flow diagram illustrating the bacterial foraging optimization algorithm
Generic algorithm of BFO

Randomly distribute initial values for 𝜃𝑖 , 𝑖 = 1, 2, . . . , 𝑆 across the optimization domain.


Compute the initial cost function value for each bacterium 𝑖 as 𝐽𝑖 , and the initial total
𝑖
cost with swarming effect as 𝐽𝑠𝑤 .

for Elimination-dispersal loop do


for Reproduction loop do
for Chemotaxis loop do
for Bacterium i do

Tumble: Generate a random vector as a unit length random direction


Move: Let 𝜃𝑛𝑒𝑤 = 𝜃𝑖 + 𝑐𝜑 and compute corresponding 𝐽𝑛𝑒𝑤 . Let
𝑛𝑒𝑤
𝐽𝑠𝑤 = 𝐽𝑛𝑒𝑤 + 𝐽𝑐𝑐 (𝜃𝑛𝑒𝑤 , 𝜃)
Swim: Let 𝑚 = 0
while 𝑚 < 𝑁𝑠 do
let 𝑚 = 𝑚 + 1
𝑛𝑒𝑤 𝑖
if 𝐽𝑠𝑤 < 𝐽𝑠𝑤 then
Let 𝜃𝑛𝑒𝑤 = 𝜃𝑖 , compute corresponding 𝐽𝑖 and 𝐽𝑠𝑤
𝑖

Let 𝜃𝑛𝑒𝑤 = 𝜃𝑖 + 𝑐𝜑 and compute corresponding 𝐽(𝜃𝑛𝑒𝑤 ).


𝑛𝑒𝑤
Let 𝐽𝑠𝑤 = 𝐽𝑛𝑒𝑤 + 𝐽𝑐𝑐 (𝜃𝑛𝑒𝑤 , 𝜃)
else
let 𝑚 = 𝑁𝑠
end
end
end
end
Sort bacteria in order of ascending cost 𝐽𝑠𝑤 The 𝑆𝑟 = 𝑆/2 bacteria with the highest 𝐽
value die and other 𝑆𝑟 bacteria with the best value split Update value of 𝐽 and 𝐽𝑠𝑤
accordingly.
end
Eliminate and disperse the bacteria to random locations on the optimization domain
with probability 𝑝𝑒𝑑 . Update corresponding 𝐽 and 𝐽𝑠𝑤 .
end
clc; clear; close all;
f=@(x)x(1)^2+x(1)*x(2)+3*x(2)^2+10; % The objective function
%Initialization
p=2; % dimension of search space
s=100; % The number of bacteria
Nc=100; % Number of chemotactic steps
Ns=8; % Limits the length of a swim
Nre=4; % The number of reproduction steps
Ned=2; % The number of elimination-dispersal events
Sr=s/2; % The number of bacteria reproductions (split) per generation
Ped=0.25; % Probability that each bacteria will be eliminated/dispersed
c(:,1)=0.05*ones(s,1); % the run length
for m=1:s % the initital posistions
P(1,:,1,1,1)= 50*rand(s,1)';
P(2,:,1,1,1)= .2*rand(s,1)';
%P(3,:,1,1,1)= .2*rand(s,1)';
end
% Main loop
for ell=1:Ned %Elimination and dispersal loop
for K=1:Nre %Reprodution loop
for j=1:Nc % swim/tumble(chemotaxis)loop
for i=1:s
J(i,j,K,ell)=f(P(:,i,j,K,ell));
% Tumble
Jlast=J(i,j,K,ell);
Delta(:,i)=(2*round(rand(p,1))-1).*rand(p,1);
P(:,i,j+1,K,ell)=P(:,i,j,K,ell)+c(i,K)*Delta(:,i)/sqrt(Delta(:,i)'*Delt
a(:,i)); % This adds a unit vector in the random direction

% Swim (for bacteria that seem to be headed in the right direction)


J(i,j+1,K,ell)=f(P(:,i,j+1,K,ell));
m=0; % Initialize counter for swim length
while m<Ns
m=m+1;
if J(i,j+1,K,ell)<Jlast
Jlast=J(i,j+1,K,ell);
P(:,i,j+1,K,ell)=P(:,i,j+1,K,ell)+c(i,K)*Delta(:,i)/sqrt(Delta(:,i)'*De
lta(:,i));
J(i,j+1,K,ell)=f(P(:,i,j+1,K,ell));
else
m=Ns ;
end
end
J(i,j,K,ell)=Jlast;
sprintf('The value of interation i %3.0f ,j=%3.0f, K=%3.0f, ell=%3.0f'
,i,j,K ,ell );
end % Go to next bacterium
x = P(1,:,j,K,ell);
y = P(2,:,j,K,ell);
clf
plot(x, y , 'h')
grid on
axis([-5 5 -5 5]);
pause(.1)
end % Go to the next chemotactic

%Reprodution
Jhealth=sum(J(:,:,K,ell),2); % Set the health of each of the S bacteria
[Jhealth,sortind]=sort(Jhealth); % Sorts the nutrient concentration
P(:,:,1,K+1,ell)=P(:,sortind,Nc+1,K,ell);
c(:,K+1)=c(sortind,K); %keeps the chemotaxis parameters with each
bacterium at the next generation

%Split the bacteria (reproduction)


for i=1:Sr
P(:,i+Sr,1,K+1,ell)=P(:,i,1,K+1,ell);
% The least fit do not reproduce, the most fit ones split into two
identical copies
c(i+Sr,K+1)=c(i,K+1);
end
end % Go to next reproduction

%Eliminatoin and dispersal


for m=1:s
if Ped>rand % % Generate random number
P(1,:,1,1,1)= 50*rand(s,1)';
P(2,:,1,1,1)= .2*rand(s,1)';
%P(3,:,1,1,1)= .2*rand(s,1)';
else
P(:,m,1,1,ell+1)=P(:,m,1,Nre+1,ell); % Bacteria that are not dispersed
end
end
end % Go to next elimination and disperstal

%Report
reproduction = J(:,[1:Ns,Nre,Ned]);
[jlastreproduction,O]=min(reproduction,[],2); %minf for each bacterial
[Y,I] = min(jlastreproduction)
pbest=P(:,I,O(I,:),K,ell)
plot([1:s],jlastreproduction)
xlabel('Iteration'), ylabel('Function')
The GWO algorithm mimics
the leadership hierarchy and hunting mechanism of gray
wolves in nature proposed by Mirjalili et al. in 2014. Four
types of grey wolves such as alpha, beta, delta, and
omega are employed for simulating the leadership
hierarchy. In addition, three main steps of hunting,
searching for prey, encircling prey, and attacking prey,
are implemented to perform optimization.

Mathematical model: The hunting technique and the social hierarchy of grey wolves
are mathematically modeled in order to design GWO and perform optimization. The
proposed mathematical models of the social hierarchy, tracking, encircling, and
attacking prey are as follows:

■ Social hierarchy In order to mathematically model


the social hierarchy of wolves when designing GWO, we
consider the fittest solution as the alpha (α).
Consequently, the second and third best solutions are
named beta (β) and delta (δ) respectively. The rest of the
candidate solutions are assumed to be omega (ω). In the
GWO algorithm the hunting (optimization) is guided by
α, β, and δ. The ω wolves follow these three wolves.

■ Encircling prey As mentioned above, grey wolves encircle prey during the hunt. In
order to mathematically model encircling behavior the following equations are
proposed: ⃗𝑿 ⃗ (𝑡 + 1) = ⃗𝑿
⃗ (𝑡) − ⃗𝑨
⃗ . ⃗𝑫
⃗ with ⃗𝑫 ⃗ . ⃗𝑿
⃗ = |𝑪 ⃗ 𝑝 (𝑡) − ⃗𝑿⃗ (𝑡)| where 𝑡 indicates the
current iteration, 𝑨⃗ and 𝑪 ⃗ are coefficient vectors, 𝑿 ⃗⃗ 𝑝 (𝑡) is the position vector of the prey,
and 𝑿 ⃗⃗ (𝑡) indicates the position vector of a grey wolf. The vectors ⃗𝑨 and 𝑪 ⃗ are calculated
as follows: ⃗𝑨 = 2𝒂 ⃗ .𝒓
⃗1−𝒂⃗ ⃗𝑪 = 2𝒓
⃗ 2 . Where components of 𝒂 ⃗ are linearly decreased from
2 to 0 over the course of iterations and 𝒓 ⃗ 1, 𝒓
⃗ 2 are random vectors in [0,1].

■ Hunting: Grey wolves have the ability to recognize the location of prey and encircle
them. The hunt is usually guided by the alpha. The beta and delta might also
participate in hunting occasionally. However, in an abstract search space we have no
idea about the location of the optimum (prey). In order to mathematically simulate the
hunting behavior of grey wolves, we suppose that the alpha (best candidate solution)
beta, and delta have better knowledge about the potential location of prey. Therefore,
we save the first three best solutions obtained so far and oblige the other search agents
(including the omegas) to update their positions according to the position of the best
search agent. The following formulas are proposed in this regard.

⃗𝑫 ⃗ 1 . ⃗𝑿
⃗ 𝛼 = |𝑪 ⃗ 𝛼 − ⃗𝑿
⃗| ⃗𝑿
⃗ 1 = ⃗𝑿
⃗𝛼− ⃗𝑨⃗ 1 . ⃗𝑫
⃗𝛼
⃗⃗ 1 + 𝑿
𝑿 ⃗⃗ 1 + 𝑿
⃗⃗ 1
{⃗𝑫 ⃗ 2 . ⃗𝑿
⃗ 𝛽 = |𝑪 ⃗ 𝛽 − ⃗𝑿
⃗| {⃗𝑿
⃗ 2 = ⃗𝑿
⃗𝛽 − ⃗𝑨
⃗ 2 . ⃗𝑫
⃗𝛽 ⃗⃗ (𝑡 + 1) =
𝑿
3
⃗𝑫 ⃗ 3 . ⃗𝑿
⃗ 𝛿 = |𝑪 ⃗ 𝛿 − ⃗𝑿
⃗| ⃗𝑿
⃗ 3 = ⃗𝑿
⃗ 𝛿 − ⃗𝑨
⃗ 3 . ⃗𝑫
⃗𝛿
With these equations, a search agent updates its position according to alpha, beta, and
delta in a n dimensional search space. In addition, the final position would be in a
random place within a circle which is defined by the positions of alpha, beta, and delta
in the search space. In other words alpha, beta, and delta estimate the position of the
prey, and other wolves updates their positions randomly around the prey.

% Grey Wolf Optimizer


clear all, clc

SearchAgents_no=20;
Max_iter=200;
dim=4;
lb=-0.25*ones(1,dim); ub=0.25*ones(1,dim);
%fobj=@(x)(x(1)-1)^2+(x(2)-2)^2+(x(3)-3)^2+(x(4)-4)^2+(x(5)-5)^2;
%fobj=@(x)3*sin(x(1))+exp(x(2)); dim=2;
fobj=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;

% initialize alpha, beta, and delta_pos


Alpha_pos=zeros(1,dim);
Alpha_score=inf; %change this to -inf for maximization problems
Beta_pos=zeros(1,dim);
Beta_score=inf; %change this to -inf for maximization problems
Delta_pos=zeros(1,dim);
Delta_score=inf; %change this to -inf for maximization problems

%---------------------------------------------------------------------%
%Initialize the positions of search agents
%---------------------------------------------------------------------%

Boundary_no= size(ub,2); % numnber of boundaries

% If the boundaries of all variables are equal and user enter a signle
% number for both ub and lb
if Boundary_no==1
Positions=rand(SearchAgents_no,dim).*(ub-lb)+lb;
end

% If each variable has a different lb and ub


if Boundary_no>1
for i=1:dim
ub_i=ub(i);
lb_i=lb(i);
Positions(:,i)=rand(SearchAgents_no,1).*(ub_i-lb_i)+lb_i;
end
end
%---------------------------------------------------------------------%
Convergence_curve=zeros(1,Max_iter);
l=0; % Loop counter
% Main loop
while l<Max_iter
for i=1:size(Positions,1)
% Return back the search agents that go beyond the boundaries
Flag4ub=Positions(i,:)>ub; Flag4lb=Positions(i,:)<lb;
Positions(i,:)=(Positions(i,:).*(~(Flag4ub+Flag4lb)))+ub.*Flag4ub+lb.*F
lag4lb;
% Calculate objective function for each search agent
fitness=fobj(Positions(i,:));
% Update Alpha, Beta, and Delta
if fitness<Alpha_score
Alpha_score=fitness; % Update alpha
Alpha_pos=Positions(i,:);
end
if fitness>Alpha_score && fitness<Beta_score
Beta_score=fitness; % Update beta
Beta_pos=Positions(i,:);
end
if fitness>Alpha_score && fitness>Beta_score && fitness<Delta_score
Delta_score=fitness; % Update delta
Delta_pos=Positions(i,:);
end
end
a=2-l*((2)/Max_iter); % a decreases linearly fron 2 to 0
% Update the Position of search agents including omegas
for i=1:size(Positions,1)
for j=1:size(Positions,2)
r1=rand();r2=rand(); % random numbers in [0,1]
A1=2*a*r1-a; C1=2*r2;
D_alpha=abs(C1*Alpha_pos(j)-Positions(i,j));
X1=Alpha_pos(j)-A1*D_alpha;
r1=rand(); r2=rand();
A2=2*a*r1-a; C2=2*r2;
D_beta=abs(C2*Beta_pos(j)-Positions(i,j));
X2=Beta_pos(j)-A2*D_beta;
r1=rand(); r2=rand();
A3=2*a*r1-a; C3=2*r2;
D_delta=abs(C3*Delta_pos(j)-Positions(i,j));
X3=Delta_pos(j)-A3*D_delta;
Positions(i,j)=(X1+X2+X3)/3;
end
end
l=l+1;
Convergence_curve(l)=Alpha_score;
end
%-----------------%
X=Positions(end,:)
F0=fobj(Positions)
%-----------------%
% Plot
%-----%

x=-3:0.5:3; y=-3:0.5:3;
L=length(x);
f=[];
for i=1:L
for j=1:L
f(i,j)=fobj([x(i),y(j)]);
end
end
surfc(x,y,f,'LineStyle','none');

Nature Inspired algorithms include:


▪ Artificial Bee Colony Algorithm, ▪ Ant Colony Optimisation
▪ Firefly Algorithm, ▪ Swarm Optimisation
▪ Social Spider Algorithm, ▪ Fractal Stochastic Optimization
▪ Bat Algorithm, ▪ Rat Swarm Optimizer
▪ Strawberry Algorithm, ▪ Fish School Search Optimizer
▪ Plant Propagation Algorithm, ▪ The Grey Wolf Optimizer
▪ Seed Based Plant Propagation Algorithm ▪ Bacterial Foraging Optimization
▪ Genetic Algorithm, ▪ Harmony search Algorithm
▪ Simulated Annealing, ▪ Coronavirus herd immunity optimizer.

%-------------------------------------------------------------------------------------------------------%
Applications of Swarm Intelligence: Swarm Intelligence-based techniques can be
used in a number of applications. The U.S. military is investigating swarm techniques
for controlling unmanned vehicles. The European Space Agency is thinking about an
orbital swarm for self-assembly and interferometry. NASA is investigating the use of
swarm technology for planetary mapping. A 1992 paper by M. Anthony Lewis and
George A. Bekey discusses the possibility of using swarm intelligence to control
nanobots within the body for the purpose of killing cancer tumors. Conversely al-Rifaie
and Aber have used stochastic diffusion search to help locate tumours. Swarm
intelligence has also been applied for data mining. Ant based models are further subject
of modern management theory.

%-------------------------------------------------------------------------------------------------------%
CVX: is a MATLAB-based modeling system for convex optimization. It was created by
Michael Grant and Stephen Boyd. This MATLAB package is in fact an interface to other
convex optimization solvers such as SeDuMi and SDPT3. We will explore here some of
the basic features of the software, but a more comprehensive and complete guide can
be found at the CVX website (CVXr.com). The basic structure of a CVX program is as
follows:

cvx_begin
{variables declaration}
minimize({objective function}) or maximize({objective function})
subject to
{constraints}
cvx_end

CVX accepts only convex functions as objective and constraint functions. There are
several basic convex functions, called “atoms,” which are embedded in CVX.
Example: Suppose that we wish to solve the least squares problem

minimize ‖𝑨𝐱 − 𝒃‖2


subject to 𝑪𝐱 = 𝒅
‖𝐱 ‖∞ ≤ 𝑒
m = 20; n = 10; p = 4;
A = randn(m,n); b = randn(m,1);
C = randn(p,n); d = randn(p,1); e = rand;
cvx_begin
variable x(n)
minimize(norm(A*x-b,2))
subject to
C*x ==d
norm(x,Inf)<=e
cvx_end

Example: Suppose that we wish to write a CVX code that solves the convex
optimization problem

minimize √𝑥12 + 𝑥22 + 1 + 2 max{𝑥1 , 𝑥2 , 0}


𝑥12
subject to |𝑥1 | + |𝑥1 | + ≤5
𝑥2
1
+ 𝑥 4 ≤ 10
1
𝑥2
𝑥2 ≥ 1
𝑥2 ≥ 0
cvx_begin
variable x(2)
minimize(norm([x;1])+2*max(max(x(1),x(2)),0))
subject to
norm(x,1)+quad_over_lin(x(1),x(2))<=5
inv_pos(x(2))+x(1)^4<=10
x(2)>=1
x(1)>=0
cvx_end

Example: Let us use an example to illustrate how a metaheuristic works. The design of
a compressional and tensional spring involves three design variables: wire diameter 𝑥1 ,
coil diameter 𝑥2 , and the length of the coil 𝑥3 . This optimization problem can be
written as
minimize 𝑓(𝐱) = 𝑥12 𝑥2 (2 + 𝑥3 ),

subject to the following constraints


𝑥23 𝑥3
g1 (𝐱) = 1 − ≤ 0,
71785𝑥14
4𝑥22 − 𝑥1 𝑥2 1
g 2 (𝐱) = − −1≤0
12566(𝑥13 𝑥2 − 𝑥14 ) 5108𝑥12
140.45𝑥1
g 3 (𝐱) = 1 − ≤0
𝑥23 𝑥3
𝑥1 + 𝑥2
g 4 (𝐱) = −1≤0
1.5
The bounds on the variables are 0.05 ≤ 𝑥1 ≤ 2.0,0.25 ≤ 𝑥2 ≤ 1.3,2.0 ≤ 𝑥3 ≤ 15.0.

For a trajectory-based metaheuristic algorithm such as simulated annealing, an initial


guess say, 𝐱 0 = (1.0, 1.0, 14.0) is used. Then, the next move is generated and accepted
depending on whether it improves or not, possibly with a probability. For a population-
based metaheuristic algorithm such as PSO, a set of n vectors are generated initially.
Then, the values of the objective functions are compared and the current best solution
is found. Iterations proceed until a certain stopping criterion is met.
The following best solution can be found easily

𝐱 ⋆ = (0.051690 0.356750 11.28126), 𝑓(𝐱 ⋆ ) = 0.012665.

Metaheuristics have been used in many applications such as engineering design


optimization (Glover and Kochenberger 2003, Talbi 2008, Yang 2010). It is an area of
active research, and there is no doubt that more metaheuristic algorithms and new
applications will emerge in the future.
Armijo Backtracking Line Search: Among the many line search techniques, one of the
most successful ones is the Armijo backtracking line search, which starts with a
reasonably large line search parameter 𝛼𝑖𝑛𝑖𝑡 , and then reduces it until the function
value at the new position is sufficiently reduced relative to the value at the old position:

clear all, clc, figure


k = 5;
n = 2^k-1;
theta = pi*(-n:2:n)/n;
phi = (pi/2)*(-n:2:n)'/n;
X = cos(phi)*cos(theta);
Y = cos(phi)*sin(theta);
Z = sin(phi)*ones(size(theta));
colormap([1 0 0;1 1 1])
C = hadamard(2^k);
surf(X,Y,Z,C)
axis square
References:
[1] Heinz Rutishauser, Martin Gutknecht Lectures on Numerical Mathematics-Birkhäuser Basel (1990)
[2] Walter Gander, Martin J. Gander, Felix Kwok - Scientific Computing - An Introduction using Maple and
MATLAB-Springer International Publishin (2010)
[3] Alfio Quarteroni, Riccardo Sacco, Fausto Saleri - Numerical Mathematics (2007, Springer-Verlag Berlin
Heidelberg)
[4] H M Antia - Numerical Methods For Scientists And Engineers. Bombay- Tata Institute of Fundamental
Research 1995
[5] Eugene Isaacson, Bishop Keller - Analysis Ofnumerical Methods. Copyright 1966 by John Wiley & Sons
[6] Hornbeck Robert W. - Numerical Methods 1975.
[7] John H. Mathews -Numerical Methods Using MATLAB Third Edition 1991.
[8] Dingyü Xue -Solving Optimization Problems with MATLAB® 2020.
[9] Amir Beck -Introduction to nonlinear optimization: Theory, Algorithms, and Applications with MATLAB
(2014) Society for Industrial and Applied Mathematics and the Mathematical Optimization Society.
[10] Mircea Anciiu -Practical Optimization with MATLAB Cambridge Scholars Publishing 2019
[11] Andreas Antoniou & Wu-Sheng Lu -Practical Optimization Algorithms And Engineering Applications,
Springer Science Business Media, LLC 2007.
[12] Cesar Lopez - MATLAB Optimization Techniques-Apress (2014)
[13] Mohamed Bakr Nonlinear Optimization in Electrical Engineering with Applications in MATLAB® The
Institution of Engineering and Technology 2013.
[14] Jorge Nocedal Stephen J. Wright Numerical Optimization @ 2006 Springer Science+Business Media, LLC.
[15] P. Venkataraman - Applied Optimization with MATLAB Programming-Wiley-Interscience (2001)
[16] Xin-She Yang Introduction to Mathematical Optimization: From Linear Programming to Metaheuristics,
Cambridge International Science Publishing 2008
[17] S.Boyd, L.Vandenberghe - Convex Optimization (Cambridge,2004)
[18] J. E. Dennis, Robert B. Schnabel -Numerical Methods for Unconstrained Optimization and Nonlinear
Equations-Society for Industrial Mathematics (1996)
[19] Gene H. Golub, Charles F. Van Loan Matrix Computations. The Johns Hopkins University Press
Baltimore and London 1996.
[12] F. B. Hildebrand Introduction To Numerical Analysis Second Edition Copyright © 1956, 1974 by Francis
B Hildebrand.
[21] J. H. Wilkinson C Reinsch Linear Algebra Handbook for Automatic Computation 1971.
[22] J. H. Wilkinson Algebraic Eigenvalue Problem, © Oxford University Press, 1965.
[23] Magnus R. Hestenes Optimization Theory The Finite Dimensional Case, Copyright © 1975 by John Wiley
& Sons, Inc.
[24] R.M. JohnSOU Linear Differential and Difference Equations: A Systems Approach for Mathematicians and
Engineers, C RM Johnson, 1997.
[25] Alston S. Householder Principles Of numerical Analysis, McGRAW-HlLL Book Coinipany, INC.1953.
[26] Conte S.D., De Boor C. - Elementary Numerical Analysis_ An Algorithmic Approach 1980.
[27] G. W. Stewart Matrix Algorithms Volume I: Basic Decompositions. @ 1998 by the Society for Industrial
and Applied Mathematics.
[28] G. W. Stewart Matrix Algorithms Volume II: Basic Decompositions. @ 2001 by the Society for Industrial
and Applied Mathematics.
[29] Abdul J. Jerri Linear Difference Equations with Discrete Transform Methods © 1996 Springer
Science+Business Media Dordrecht
[30] Peter Lancaster, Kęstutis Šalkauskas - Curve and Surface Fitting_ An Introduction-Elsevier (1986)
[31] Charles F. Van Loan - Introduction to Scientific Computing_ A Matrix Vector Approach Using MATLAB-
Prentice Hall (1996)
[32] Biswa Nath Datta - Numerical Linear Algebra and Applications (1995, Brooks-Cole Pub Co)
Appendices:

Leverrier-Faddeev Algorithm.
Sylvester, Lyapunov and Riccati
equations
Appendix A
Proof of the Leverrier-Faddeev Algorithm
𝑵(𝑠) 1
It is very well − known that: {𝑒𝐴𝑡 } = (𝑠𝑰 − 𝑨)−1 = = (𝑵 𝑠 𝑛−1 + 𝑵2 𝑠 𝑛−2 + ⋯ + 𝑵𝑛 )
𝑝(𝑠) 𝑝(𝑠) 1

where 𝑝(𝑠) = 𝑠 𝑛 + 𝑎1 𝑠 𝑛−1 + 𝑎2 𝑠 𝑛−2 + ⋯ + 𝑎𝑛 and the adjugate matrix 𝑵(𝑠) is a matrix
polynomial in 𝑠 of degree 𝑛 − 1 with 𝑛 × 𝑛 constant coefficient matrices 𝑵1 , … , 𝑵𝑛 .

The above equation can be re-written as: (𝑠𝑰 − 𝑨)𝑵(𝑠) = 𝑝(𝑠)𝑰, expanding this multiplication we
get: 𝑵1 𝑠 𝑛 + (𝑵2 − 𝑨𝑵1 )𝑠 𝑛−1 + (𝑵3 − 𝑨𝑵2 )𝑠 𝑛−2 + ⋯ + (𝟎 − 𝑨𝑵1 ) = (𝑠 𝑛 + 𝑎1 𝑠 𝑛−1 + ⋯ + 𝑎𝑛 )𝑰.

By equating identical powers we obtain:


𝑵1 = 𝑰
𝑵2 = 𝑨𝑵1 + 𝑎1 𝑰
𝑵3 = 𝑨𝑵2 + 𝑎2 𝑰

{ 𝟎 = 𝑨𝑵𝑛 + 𝑎𝑛 𝑰

If the coefficients 𝑎1 , … , 𝑎𝑛 of the characteristic polynomial 𝑝(𝑠) were known, last equation
would then constitute an algorithm for computing the matrices 𝑵1 , … , 𝑵𝑛 . Leverrier-Faddeev
proposed a recursive algorithm which will compute 𝑵𝑖 and 𝑎𝑖 in parallel, even if the coefficients
𝑎𝑖 are not known in advance.

Important Note: Albeit that the Leverrier-Faddeev method has been extensively covered in
most books of linear system theory, but unfortunately, the majority of these books do not give a
proof of the coefficient formulas.

To complete the proof let we consider


𝑒 𝜆1 𝑡 𝑛
𝐴𝑡
𝑒 = 𝑉( ⋱ ) 𝑉 −1 ⟹ trace(𝑒 𝐴𝑡 ) = ∑ 𝑒 𝜆𝑖 𝑡
𝑖=1
𝑒 𝜆𝑛 𝑡
Using the Laplace transform we get
𝑛 1 𝑝′ (𝑠)
trace ( {𝑒𝑨𝑡 }) = ∑ ( )=
𝑖=1 𝑠 − 𝜆𝑖 𝑝(𝑠)
From the other hand we know that:
𝑑 𝑨𝑡
𝑒 = 𝑨𝑒 𝑨𝑡
𝑑𝑡
⟹𝑠 {𝑒𝑨𝑡 } − 𝑰 = 𝑨 {𝑒𝑨𝑡 }
𝑑
{ 𝑓(𝑡)} = 𝑠𝐹(𝑠) − 𝑓(0)
{ 𝑑𝑡

Using the trace of the last equation we get:

𝑝′ (𝑠) 𝑵(𝑠) 1
𝑠 − 𝑛 = trace (𝑨 {𝑒𝑨𝑡 }) = trace (𝑨 )= trace(𝑨𝑵(𝑠))
𝑝(𝑠) 𝑝(𝑠) 𝑝(𝑠)

Which can be written as: 𝑠𝑝′ (𝑠) − 𝑛𝑝(𝑠) = trace(𝑨𝑵(𝑠)). This is equivalent to

−𝑎1 𝑠 𝑛−1 − 2𝑎2 𝑠 𝑛−2 − ⋯ − (𝑛 − 1)𝑎𝑛−1 𝑠 − 𝑛𝑎𝑛 = tr(𝑨𝑵1 )𝑠 𝑛−1 + tr(𝑨𝑵2 )𝑠 𝑛−2 + ⋯ + tr(𝑨𝑵𝑛 )

Comparing coefficients in the equation yields the relations


Appendix A
1
𝑎𝑘 = − tr(𝑨𝑵𝑘 ) 𝑘 = 1,2, … , 𝑛
𝑘
Finally we obtain:

𝑵1 𝑠 𝑛−1 + 𝑵2 𝑠 𝑛−2 + ⋯ + 𝑵𝑛
(𝑠𝑰 − 𝑨)−1 =
𝑠 𝑛 + 𝑎1 𝑠 𝑛−1 + 𝑎2 𝑠 𝑛−2 + ⋯ + 𝑎𝑛

where

𝑵1 = 𝑰 𝑎1 = −tr(𝑨𝑵1 )
1
𝑵2 = 𝑨𝑵1 + 𝑎1 𝑰 𝑎2 = − tr(𝑨𝑵2 )
2
1
𝑵3 = 𝑨𝑵2 + 𝑎2 𝑰 𝑎3 = − tr(𝑨𝑵3 )
3
⋮ ⋮
𝑵𝑛 = 𝑨𝑵𝑛−1 + 𝑎𝑛−1 𝑰 1
𝑎𝑛 = − tr(𝑨𝑵𝑛 )
𝑛
{ 𝟎 = 𝑨𝑵𝑛 + 𝑎𝑛 𝑰

Modified Leverrier-Faddeev algorithm: Let we assume that

𝑩(𝑠)
(𝑠𝑰 − 𝑨)−1 𝑨 = ⟺ (𝑠𝑰 − 𝑨)𝑩(𝑠) = 𝑝(𝑠)𝑨
𝑝(𝑠)

Following the same procedure as before we obtain

𝑩1 = 𝑨 𝑝1 = tr(𝑩1 )
1
𝑩2 = 𝑨(𝑩1 − 𝑝1 𝑰) 𝑝2 = tr(𝑩2 )
2
⋮ ⋮
1
𝑩𝑛−1 = 𝑨(𝑩𝑛−2 − 𝑝𝑛−2 𝑰) 𝑝𝑛−1 = tr(𝑩𝑛−1 )
𝑛−1
1
𝑩𝑛 = 𝑨(𝑩𝑛−1 − 𝑝𝑛−1 𝑰) 𝑝𝑛 = tr(𝑩𝑛 )
𝑛
And 𝑩𝑛 = 𝑝𝑛 𝑰. If 𝑨 is nonsingular then the inverse of 𝑨 can be determined by

𝑨−1 = (𝑩𝑛−1 − 𝑝𝑛−1 𝑰)/𝑝𝑛

__________________________________________
Appendix A

Searching for Maximum value of an Array: Sometimes, given an array of numbers in system
engineering, the smallest or largest value needs to be identified — and quickly!

There are several built-in ways to find a minimum or maximum value from an array in
MATLAB. Here three algorithms that can in fastest way compute these values

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Find Max value of Vector array
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, Time =[1 2 3 4 101 6 7 8 100];
Max=Time(1); k=1;
while (k<=length(Time))
if (Time(k)> Max)
Max = Time(k);
end
k=k+1;
end
Max

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Find Max value of Vector array
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, Time =rand(1,10)
Max=Time(1);
for k=1:length(Time)
if (Time(k)> Max)
Max = Time(k);
end
end
Max

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Find Max value of Matrix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =[1 2 3;4 101 6;7 8 100];
Maxvalue=-inf;
Row=0; Column=0;
for i=1:size(A,1)
for j=1:size(A,2)
if A(i,j)> Maxvalue
Maxvalue= A(i,j);
Row=i; Column=j;
end
end
end
Maxvalue
Appendix A

The Left/right and Up/Down Flip Operator: The swapping (Flip) operator is very important in
signal processing and even in engineering. It returns to flip the entries in each row in the
left/right direction. Columns are preserved, but appear in a different order than before.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods For right/left reflection
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =10*rand(5,5); [m,n]=size(A);
for i=1:m
for j=1:n
B(i,j)= A(i,end-j+1);
end
end
A, B

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Up/Down reflection
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =10*rand(5,5), [m,n]=size(A);
for i=1:m
for j=1:n
C(i,j)= A(end-i+1,j);
end
end
A, C

Upper/Lower Triangular (Sum Decomposition): Any matrix can be decomposed into a sum of
upper and lower triangular matrices. Here an algorithm that does this decomposition.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Method for Upper/Lower Triangular (Sum Decomposition)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =10*rand(3,3)
[m,n]=size(A); C=zeros(m,n); B=zeros(m,n); D=zeros(m,n);
for i=1:m
for j=1:n
D(i,i)= A(i,i);
if i>j
B(i,j)= A(i,j);
C(i,j)= C(i,j);
elseif i<j
C(i,j)= A(i,j);
B(i,j)= B(i,j);
end
end
end
C,B,D,
Appendix A

Sorting Operator: arranges the elements of the collection in ascending or descending


order.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Sorting Methods for Vectors "in descending order"
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, V=10*rand(1,4), V1=V; n = length(V);
while n ~= 0
nn = 0;
for ii = 1:n-1
if V(ii) > V(ii+1)
[V(ii+1),V(ii)] = deal(V(ii), V(ii+1));
nn = ii;
end
end
n = nn;
end
V1, V

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Sorting Methods for Matrices (Rows) "in descending order"
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A=10*rand(4,4); B = zeros(size(A)); n = size(A,2);
for iRow = 1:size(A, 1)
v = A(iRow, :);
while any(diff(v) > 0) % Test if sorted in descending order
v = v(randperm(n,n)); % If not, shuffle the elements randomly
end
B(iRow,:) = v;
end
A,B

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Sorting Methods for Matrices (COLUMNS)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A=10*rand(4,4); B = zeros(size(A)); n = size(A,2);
for iCOL = 1:size(A, 2)
v = A(:, iCOL);
while any(diff(v) > 0) % Test if sorted in descending order
v = v(randperm(n, n)); % If not, shuffle the elements randomly
end
B(:, iCOL) = v;
end
A, B
Appendix A

Singular value decomposition: In linear algebra, the singular value decomposition


(SVD) is a factorization of a real or complex matrix that generalizes the Eigen-
decomposition of a square normal matrix to any 𝑚 × 𝑛 matrix via an extension of the
polar decomposition. Here an algorithm that does this decomposition.

clear all, clc, A=10*rand(2,4); %A=[2 4 1 3; 0 0 2 1];

% Get U, V
[Vecs, Vals] = eig(A*A');
[~, P] = sort(sum(Vals), 'descend');
U = Vecs(:, P);

[Vecs, Vals] = eig(A'*A);


[~, P] = sort(sum(Vals), 'descend');
V = Vecs(:,P);

% Get Sigma
singularValues = sum(Vals(:,P)).^0.5;
sigma=zeros(size(A));
m=nnz(find(singularValues>=0.001));
for i=1:m
sigma(i, i) = singularValues(i);
end
A
U*sigma*V'
%[U S V]=svd(A)

Solving linear systems by QR: Given a system of linear equations of the form 𝑨𝐱 = 𝒃 and
assume that 𝑨 is decomposed in the 𝑸𝑹 form

𝑸𝐲 = 𝒃 𝐲 = 𝑸𝑇 𝒃
𝑨𝐱 = 𝒃 ⟺ 𝑸𝑹𝐱 = 𝒃 ⟺ { ⟺{ ⟺ 𝑹𝐱 = 𝑸𝑇 𝒃 = 𝒃𝑛𝑒𝑤
𝑹𝐱 = 𝐲 𝑹𝐱 = 𝐲

The matrix 𝑹 is an upper triangular thus the last equation can be solved by back substitution.

clear all, clc, A=10*rand(10,10); b=10*rand(10,1);


n=size(A,1); b1=b; [Q R]=qr(A); b=Q'*b; % 𝑹𝐱 = 𝒃𝑛𝑒𝑤 = 𝑸𝑇 𝒃

x=zeros(n,1);
for j=n:-1:1
if (R(j,j)==0)
error('Matrix is singular!');
end
x(j)=b(j)/R(j,j);
b(1:j-1)=b(1:j-1)-R(1:j-1,j)*x(j);
end
A*x-b1
Appendix B
The main object of this section is to introduce a famous
theorem in linear algebra that provides a nice answer to the following question: if we
wish to use only orthogonal (or unitary) matrices as similarity transformation matrices,
what is the simplest form to which a given matrix 𝑨 can be transformed? It would be nice
if we could say something like “diagonal” or “Jordan canonical form”. Unfortunately,
neither is possible. However, upper triangular matrices are very nice special forms of
matrices. In particular, we can see the eigenvalues of an upper triangular matrix at a
glance. That makes the Schur-decomposition theorem extremely attractive.

In the mathematical discipline of linear algebra, the Schur decomposition (Schur


triangular form) or Schur triangularization, named after Issai Schur, is matrix
decomposition. It allows one to write an arbitrary complex matrix as unitarily equivalent
to an upper triangular matrix whose diagonal elements are the eigenvalues of the
original matrix. 𝑸𝐻 𝑨𝑸 = 𝑻 Where 𝑸 is a unitarily matrix (𝑸𝐻 𝑸 = 𝑰) and 𝑻 is an upper
triangular matrix, having diagonal blocks that are either 1 × 1 (corresponding to real
eigenvalues) or 2 × 2 (corresponding to complex conjugate eigenvalues). See Dr. J.H.
(Jan) Brandts 2001

The Schur decomposition of a given matrix is numerically computed by the QR algorithm


or its variants. The Schur-algorithm consists of two separate stages. First, by means of a
similarity transformation, the original matrix is transformed in a finite number of steps
to Hessenberg form (i.e. in the Hermitian/symmetric case to real tridiagonal form). This
first stage of the algorithm prepares its second stage, called triangularization.

Remark:1 Hessenberg decomposition is the first step in Schur algorithm. Although every
square matrix has a Schur-decomposition, in general this decomposition is not unique.

Remark:2 If a real matrix 𝑨 is transformable to an upper triangular form and only by


means of QR-algorithm, then Hessenberg and Schur forms are the same.

Remark:3 Issai Schur (1875, 1941) was a Russian mathematician who worked in
Germany for most of his life. He studied at the University of Berlin. He obtained his
doctorate in 1901, became lecturer in Zurich 1903 and a professor in 1919. He is a
student of Ferdinand Georg Frobenius.

Consider for the moment a QR-factorization of the matrix 𝑨 = 𝑸1 𝑹1 , where 𝑸1𝐻 𝑸1 = 𝑰 and
𝑹1 is upper triangular. We will now reverse the order of multiplication product of 𝑸1 and
𝑹1 and eliminate 𝑹1 ,
𝑹1 𝑸1 = 𝑸1𝐻 𝑨𝑸1

Since 𝑸1𝐻 𝑨𝑸1 is a similarity transformation of 𝑨, therefore the matrix 𝑹1 𝑸1 has the same
eigenvalues as 𝑨. More importantly, we have seen before that by repeating this process,
the matrix 𝑹𝑘 𝑸𝑘 will become closer and closer to upper triangular, such that we
eventually can read off the eigenvalues from the diagonal. That is, the QR-method
generates a sequence of matrices 𝑨𝑘 initiated with 𝑨 = 𝑨0 = 𝑸1 𝑹1 and given by 𝑨𝑘 = 𝑹𝑘 𝑸𝑘 ,
where 𝑸𝑘 and 𝑹𝑘 represents a QR-factorization of 𝑨𝑘−1 = 𝑸𝑘 𝑹𝑘 . If we repeat the process k
times we obtain 𝑻 = 𝑸𝐻 𝑘 … 𝑸2 𝑸1 𝑨𝑸1 𝑸2 … 𝑸𝑘 = 𝑸 𝑨𝑸 upper triangular matrix.
𝐻 𝐻 𝐻
Appendix B
𝑛×𝑛
Theorem: [Real--Schur--Form]. Let 𝑨 ∈ ℂ . Then there exists a unitarily matrix 𝑸
𝐻
(not unique) such that 𝑸 𝑨𝑸 = 𝑻 is upper triangular (not unique), with the eigenvalues
of 𝑨 as diagonal elements in any required order.

Proof: Use induction on 𝑛, the size of the matrix. For 𝑛 = 1, there is nothing to prove.
For 𝑛 > 1, assume that all (𝑛 − 1) × (𝑛 − 1) matrices are unitarily similar to an upper-
triangular matrix, and consider an 𝑛 × 𝑛 matrix 𝑨. Suppose that (𝜆, 𝐱) is an eigenpair
for 𝑨, and suppose that 𝐱 has been normalized so that ‖𝐱‖2 = 1.

we can construct an elementary reflector 𝑹2 = 𝑰 or 𝑹 = 𝑹𝐻 = 𝑹−1 with the property that


𝑹𝐱 = 𝒆1 or, equivalently, 𝐱 = 𝑹𝒆1

𝐮𝐮𝐻 1 if x1 is real
𝑹 = (𝑰 − 2 ) with 𝐮 = 𝐱 ± 𝜇 ‖𝐱‖𝒆1 and { } ⟹ 2𝐮𝐻 𝐱 = 𝐮𝐻 𝐮 ⟹ 𝑹𝐱 = 𝒆1
𝐻
𝐮 𝐮 x1 /|x1 | if x1 is not real

Thus 𝐱 is the first column in 𝑹, so 𝑹 = (𝐱 |𝑽), and


𝑹𝑨𝑹 = 𝑹𝑨(𝐱 |𝑽) = 𝑹(𝜆𝐱 |𝑨𝑽) = (𝜆𝒆1 |𝑹𝑨𝑽) = ( 𝜆 𝐱 𝐻 𝑨𝑽 )
𝟎 𝐕𝐻 𝑨𝑽
Since 𝐕𝐻 𝑨𝑽 is (𝑛 − 1) × (𝑛 − 1), the induction hypothesis insures that there exists a
1 𝟎
unitary matrix 𝑸 such that 𝑸𝐻 (𝐕𝐻 𝑨𝑽)𝑸 = 𝑻1 is upper triangular. If 𝑼 = 𝑹 ( ), then 𝑼
𝟎 𝑸
is unitary (because 𝑼𝐻 = 𝑼−1 ), and

𝜆 𝐱 𝐻 𝑨𝑽𝑸 𝜆 𝐱 𝐻 𝑨𝑽𝑸
𝐔𝐻 𝑨𝑼 = ( 𝐻 𝐻 )=( )=𝑻
𝟎 𝑸 𝐕 𝑨𝑽𝑸 𝟎 𝑻1

is upper triangular. Since similar matrices have the same eigenvalues, and since the
eigenvalues of a triangular matrix are its diagonal entries, the diagonal entries of 𝑻 must
be the eigenvalues of 𝑨. ■

Transformation from Schur to Jordan Form: Let the matrix 𝑱 be the Jordan form of 𝑨
so that 𝑨 = 𝑽𝑱𝑽−1 . Now we apply the QR algorithm on the matrix (𝑽−1 )𝐻 so that (𝑽−1 )𝐻 =
𝑸𝑹 and define a new matrix 𝑻 = 𝑸𝐻 𝑨𝑸, this new matrix is a lower triangular form.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% RATIONAL TRANSFORMATION FROM SCHUR TO JORDAN FORM
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc,
% M1=10*rand(5,5); A=M1*diag([-1 -2 -3 -4 -5])*inv(M1);
A=[3 5 4 6 1;4 5 6 7 7;3 3 2 7 8;4 2 8 7 7;3 6 9 5 1];
[V D]=eig(A); W=(inv(V))'; % inv(V)*A*V
[Q R]=qr(W);
T=Q'*A*Q;
T = tril(T)
Q

If we select 𝑽 = 𝑸𝑹 then 𝑻 = 𝑸𝐻 𝑨𝑸, is an upper triangular form.


Appendix B
clear all, clc, M1=10*rand(5,5); A=M1*diag([-1 -2 -3 -4 -5])*inv(M1);
% A=10*rand(5,5);
[V D]=eig(A); % inv(V)*A*V
[Q R]=qr(V);
T=Q'*A*Q;
T = triu(T)
[V1 D1]=eig(T);
[Vnew,Dnew] = cdf2rdf(V,D);
T= inv(Vnew)*Dnew*Vnew
Q
%---------------- verifications by MATLAB instructions ----------------%
[Q2 T2]=schur(A);
T2
Q2

Given square matrices 𝑨 and 𝑩, the


generalized Schur decomposition factorizes both matrices as 𝑨 = 𝑸𝑺𝒁𝐻 and 𝑩 = 𝑸𝑻𝒁𝐻 .
where 𝑸 and 𝒁 are unitary, and 𝑺 and 𝑻 are upper triangular. The generalized Schur
decomposition is also sometimes called the 𝑄𝑍-decomposition.

The generalized eigenvalues 𝜆 that solve the generalized eigenvalue problem 𝑨𝐱 = 𝜆𝑩𝐱
(where 𝐱 is an unknown nonzero vector) can be calculated as the ratio of the diagonal
elements of 𝑺 to those of 𝑻. That is, using subscripts to denote matrix elements, the 𝑖 𝑡ℎ
generalized eigenvalue 𝜆𝑖 satisfies 𝜆𝑖 = 𝑺𝑖𝑖 /𝑻𝑖𝑖 .

𝑨𝐱 = 𝜆𝑩𝐱 ⟺ (𝑨 − 𝜆𝑩)𝐱
⟺ 𝐱 ∈ 𝒩(𝑨 − 𝜆𝑩)
⟺ det(𝑨 − 𝜆𝑩) = 0
⟺ det(𝑸) det(𝑺 − 𝜆𝑻) det(𝒁𝐻 ) = 0
⟺ 𝜆𝑖 = 𝑺𝑖𝑖 /𝑻𝑖𝑖 if 𝑻𝑖𝑖 ≠ 0

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Generalized Schur decomposition Or QZ decomposition
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, M1=10*rand(5,5); M2=10*rand(5,5);
A=M1*diag([-1 -2 -3 -4 -5])*inv(M1);
B=M2*diag([1 2 3 4 5])*inv(M2);
n=size(A,1); [AA,BB,Q,Z] = qz(A,B) % MATLAB instruction
lambda=zeros(n,n);
for i=1:n
lambda(i,i)=AA(i,i)/BB(i,i);
end
lambda
%---------------- verifications by MATLAB instructions ----------------%
[V, lambda]=eig(A,B);
lambda
Appendix B
𝑛×𝑛
Theorem: (Generalized Real Schur Decomposition) If 𝑨 and 𝑩 are in ℝ then there
exist orthogonal matrices 𝑸 and 𝒁 such that 𝑸 𝑨𝒁 is upper quasi-triangular and 𝑸𝑇 𝑩𝒁 is
𝑇

upper triangular.

Proof: (See Stewart 1972). ■

If 𝑺 is triangular, the diagonal elements of 𝑺 and 𝑻,

𝚲𝛼 = diag(diag(𝑺)) and 𝚲𝛽 = diag(diag(𝑻))

are the generalized eigenvalues that satisfy

𝑨𝑽𝚲𝛽 = 𝑩𝑽𝚲𝛼
{
𝚲𝛽 𝑾𝑨 = 𝚲𝛼 𝑾𝑩

The eigenvalues produced by 𝜆 = eig(𝑨, 𝑩) are the element-wise ratios of 𝚲𝛼 (𝑖, 𝑖) and
𝚲𝛽 (𝑖, 𝑖).
𝜆𝑖 = 𝚲𝛼 (𝑖, 𝑖)/𝚲𝛽 (𝑖, 𝑖)

If 𝑺 is not triangular, it is necessary to further reduce the 2-by-2 blocks to obtain the
eigenvalues of the full system.

clear all, clc,


A =[2 4 3 4 1;2 4 3 1 5;2 2 5 2 3;4 5 5 3 3;5 2 4 1 3];
B =[3 5 2 2 3;4 5 5 5 3;1 4 5 2 3;5 2 2 5 5;2 4 4 1 4];
[AA,BB,Q,Z,V,W] = qz(A,B);
alpha = diag(AA)
beta = diag(BB)
zero1=A*V*diag(beta)- B*V*diag(alpha)
zero2=diag(beta)*W'*A - diag(alpha)*W'*B
lambda = alpha./beta
lambda = eig(A,B) % verifications

In mathematics, in the field of control theory, a Sylvester


equation is a matrix equation of the form: 𝑨𝑿 + 𝑿𝑩 = 𝑪. Then given matrices 𝑨, 𝑩, and 𝑪,
the problem is to find the possible matrices 𝑿 that obey this equation. 𝑨 and 𝑩 must be
square matrices of sizes 𝑛 and 𝑚 respectively, and then 𝑿 and 𝑪 both have 𝑛 rows and 𝑚
columns.

A Sylvester equation has a unique solution for 𝑿 exactly when there are no common
eigenvalues of 𝑨 and −𝑩. More generally, the equation 𝑨𝑿 + 𝑿𝑩 = 𝑪 has been considered
as an equation of bounded operators on a (possibly infinite-dimensional) Banach space.

If we introduce the Sylvester operator 𝑺: ℂ𝑛×𝑚 ⟶ ℂ𝑛×𝑚 defined by 𝑺𝑿 = 𝑨𝑿 + 𝑿𝑩, then we


may write Sylvester's equation in the form 𝑺𝑿 = 𝑪.

The operator 𝑺 is linear in 𝑿. Hence, our problem reduces to that of determining when 𝑺
is nonsingular. It turns out that the nonsingularity depends on the spectra of 𝑨 and 𝑩.
Specifically, we have the following theorem.
Appendix B
𝑛×𝑛 𝑚×𝑚
Theorem: Given matrices 𝑨 ∈ ℂ and 𝑩 ∈ ℂ , the Sylvester equation 𝑨𝑿 + 𝑿𝑩 = 𝑪
𝑛×𝑚 𝑛×𝑚
has a unique solution 𝑿 ∈ ℂ for any 𝑪 ∈ ℂ if and only if 𝑨 and −𝑩 do not share any
eigenvalue.

Proof: To prove the necessity of the condition, let (𝜆, 𝐱) be a right eigenpair of 𝑨 and let
(𝜇, 𝐲) be a left eigenpair of −𝑩. Let 𝑿 = 𝐱𝐲 𝐻 . Then

𝑺𝑿 = 𝑨𝐱𝐲 𝐻 + 𝐱𝐲 𝐻 𝑩 = 𝜆𝐱𝐲 𝐻 − 𝜇𝐱𝐲 𝐻 = (𝜆 − 𝜇)𝑿 = 𝑪

Hence if we can choose 𝜆 − 𝜇 = 0 that is, if 𝑨 and −𝑩 have a common eigenvalue— then
𝑿 is a "null vector" of 𝑺, and 𝑺 is singular. This proves the "if" part of the theorem.

To prove the converse, we use the fact that an operator is nonsingular if every linear
system in the operator has a solution. Consider the system 𝑺𝑿 = 𝑪. The first step is to
transform this system into a more convenient form. Let 𝑻 = 𝑽𝐻 𝑩𝑽 be a Schur
decomposition of 𝑩. Then Sylvester's equation can be written in the form

𝑨(𝑿𝑽) + 𝑿𝑽(𝑽𝐻 𝑩𝑽) = 𝑪𝑽

If we set 𝒀 = 𝑿𝑽 and 𝑫 = 𝑪𝑽, we may write the transformed equation in the form

𝑨𝒀 + 𝒀𝑻 = 𝑫
Let us partition this system in the form

𝑡11 𝑡12 𝑡13 ⋯


𝑡22 𝑡23
𝑨(𝒚1 , 𝒚2 , … ) + (𝒚1 , 𝒚2 , … ) ( 0 ⋯) = (𝒅 , 𝒅 , … )
⋯ 1 2
0 0 𝑡33
⋮ ⋮ ⋮

In general, suppose that we have found 𝒚1 , 𝒚2 , … 𝒚𝑘−1 From the 𝑘 𝑡ℎ column of this
equation we get
𝑘−1
(𝑨 + 𝑡𝑘𝑘 𝑰)𝒚𝑘 = 𝒅𝑘 − ∑ 𝑡𝑖𝑘 𝒚𝑖
𝑖=1

The right-hand side of this equation is well defined and the matrix on the left is non-
singular. Hence 𝒚𝑘 is well defined. We have found a solution 𝒀 of the equation
𝑨𝒀 + 𝒀𝑻 = 𝑫. Hence 𝑿 = 𝒀𝑽𝐻 is a solution of 𝑺𝑿 = 𝑪. Justifying the "only if" part of the
theorem. Q.E.D. ■

Alternative proof: Assume that 𝜎 is an Eigenvalue of the operator 𝑺 that is 𝑺𝑿 = 𝜎𝑿 or


equivalently the Sylvester operator becomes 𝑿 ⟼ (𝑨 − 𝜎𝑰)𝑿 + 𝑿𝑩 = 𝑨𝑿 + 𝑿(𝑩 − 𝜎𝑰) = 0
which means that 𝜎 = 𝜆 − 𝜇 where 𝜆 and 𝜇 are the Eigenvalue of 𝑨 and – 𝑩 respectively.

𝑿 ⟼ (𝑨 − 𝜆𝑰)𝑿 + 𝑿(𝑩 + 𝜇𝑰) = 0

Now operator 𝑺 is singular only if 𝜎 = 0 means that 𝑨 and −𝑩 share a common


eigenvalues. ■

Corollary: The eigenvalues of 𝑺 are the sum of the eigenvalues of 𝑨 and 𝑩.


Appendix B
Certain matrix equations arise
naturally in linear control and system theory. Among those frequently encountered in
the analysis and design of linear systems are the Lyapunov equations

𝑨𝑿 + 𝑿𝑨𝑇 = 𝑪 continuous − time systems


𝑨𝑿𝑨𝑇 − 𝑿 = 𝑪 discrete − time system𝑠

Which are special cases of another classical matrix equation, known as the Sylvester
equations:
𝑨𝑿 + 𝑿𝑩 = 𝑪 continuous − time systems
𝑨𝑿𝑩 − 𝑿 = 𝑪 discrete − time system𝑠

The Sylvester equations also arise in a wide variety of applications. For example, the
numerical solution of elliptic boundary value problems can be formulated in terms of the
solution of the Sylvester equation (Starke and Niethammer 1991). The solution of the
Sylvester equation is also needed in the block diagonalization of a matrix by a similarity
transformation (see Datta 1995) and Golub and Van Loan (1996). Once a matrix is
transformed to a block diagonal form using a similarity transformation, the block
diagonal form can then be conveniently used to compute the matrix exponential 𝑒 𝑨𝑡 .

From the algebraic point of view it is well-known that if the matrices 𝑨 and 𝑩
are stable then the unique solution of the continuous-time Sylvester equation is given by
∞ ∞ ∞ ∞

𝑿 = − ∫ 𝑒 𝑪𝑒 𝑑𝑡 ⟺ 𝑿 = − ∫ 𝑫(𝑡)𝑑𝑡 = − ∑ 𝑫(𝑘𝑇𝑠 ) 𝑇𝑠 ≅ − ∑(𝑰 + 𝑨𝑇𝑠 )𝑘 (𝑪𝑇𝑠 )(𝑰 + 𝑩𝑇𝑠 )𝑘


𝑨𝑡 𝑩𝑡
0 0 𝑘=0 𝑘=0

clear all, clc, Ts=0.01; M1=randi(5,3); M2=randi(7,3); M3=randi(10,3);


A =M1*diag([-1 -2 -3])*inv(M1); B =M2*diag([-4 -5 -6])*inv(M2);
C =M3*diag([1 2 3])*inv(M3); n=size(A,1); I=eye(n,n); N=1000;
EA=I + Ts*A; EB=I + Ts*B; CT=Ts*C; X=zeros(n,n);
for k=0:N
X=X - (EA^k)*(CT)*(EB^k);
end
X
Zero=A*X+X*B-C % self-related verifications
XMATLAB = sylvester(A,B,C) % verifications using MATLAB

Another MATLAB code:

clear all, clc, Ts=0.001;


M1=0.1*randi(5,3); M2=0.1*randi(7,3); M3=0.1*randi(10,3);
A =M1*diag([-1 -2 -3])*inv(M1); B =M2*diag([-4 -5 -6])*inv(M2);
C =M3*diag([1 2 3])*inv(M3); n=size(A,1); I=eye(n,n); N=20;
t=0:Ts:N; X=zeros(n,n);
for k=1:length(t)
X=X - (expm(A*t(k)))*(Ts*C)*(expm(B*t(k)));
end
X
XMATLAB = sylvester(A,B,C) % verifications using MATLAB
Appendix B
Remark: This method is not practical, since it is very costly due to numerical integration
and it relies on the stability of matrices 𝑨 and 𝑩.

Alternatively, it is also possible to rearrange the equation in a linear form


using the KRON function as follows:

𝑨𝑿 + 𝑿𝑩 = 𝑪 ⟺ ((𝑰 ⊗ 𝑨) + (𝑩𝑇 ⊗ 𝑰))vec(𝑿) = vec(𝑪)

Where 𝐱 = vec(𝑿) and 𝐜 = vec(𝑪), contain the concatenated columns of 𝑿 and 𝑪. The
solution of this equation exists and unique if and only if the matrix 𝑮 = (𝑰 ⊗ 𝑨) + (𝑩𝑇 ⊗ 𝑰)
is nonsingular. The following lines of code provides an example in MATLAB:

clear all, clc, Ts=0.001;


A=[1 1;2 3]; B=[1 2;3 4]; C=-[1 3;1 1];
G=kron(eye(2),A)+kron(B',eye(2));
c=reshape(C,4,1);
x=inv(G)*c;
X=reshape(x,2,2)
XMATLAB = sylvester(A,B,C) % verifications using MATLAB

Remark: This method remains not recommended considering the cost of the inverse
matrix and the cost of transforming the system into a linear equation.

A classical algorithm for the numerical solution of the Sylvester equation is


the Bartels–Stewart algorithm 1971, which consists of transforming 𝑨 and 𝑩 into Schur
form by 𝑄𝑅-algorithm, and then solving the resulting triangular system via back-
substitution. The Bartels–Stewart algorithm is of computational cost equal to
𝒪(𝑛3 ) arithmetical operations. It was the first numerically stable method that could be
systematically applied to solve such equations. In 1979, G. Golub, C. Van Loan and S.
Nash introduced an improved version of the algorithm, known as the Hessenberg–Schur
algorithm.

The Bartels–Stewart algorithm computes 𝑿 by applying the following steps:

▪ Compute the real Schur decompositions 𝑹 = 𝑼𝐻 𝑨𝑼 and 𝑻 = 𝑽𝐻 𝑩𝑽


▪ Matrices 𝑹 and 𝑻 are block-upper triangular matrices, with diagonal blocks of size 1 × 1
(corresponding to real eigenvalues) or 2 × 2 (corresponding to complex eigenvalues).
Appendix B
𝑇
▪ Set 𝑫 = 𝑼 𝑪𝑽
▪ Solve the simplified system 𝑹𝒀 + 𝒀𝑻 = 𝑫 where 𝒀 = 𝑼𝐻 𝑿𝑽. This can be done using
forward substitution on the blocks. Specifically, if 𝑡𝑘+1,𝑘 = 0, then
𝑘−1
(𝑹 + 𝑡𝑘𝑘 𝑰)𝒚𝑘 = 𝒅𝑘 − ∑ 𝑡𝑖𝑘 𝒚𝑖
𝑖=1

▪ In case when: 𝑡𝑘+1,𝑘 ≠ 0, columns [𝒚𝑘 ; 𝒚𝑘+1 ] should be concatenated and solved for
simultaneously.
(𝑹 + 𝑡𝑘𝑘 𝑰) 𝒚𝑘 𝑘−1 𝑡𝑖𝑘 𝒚𝑖
𝑡𝑚𝑘 𝑰 𝒅
( ) (𝒚 ) = ( 𝑘 ) − ∑ ( ) with 𝑚 = 𝑘 + 1
𝑡𝑘𝑚 𝑰 (𝑹 + 𝑡𝑚𝑚 𝑰) 𝑚 𝒅𝑚 𝑖=1 𝑡𝑖𝑚 𝒚𝑖

Algorithm: Bartels–Stewart
input: 𝑨 ∈ ℝ𝑛×𝑛 , 𝑩 ∈ ℝ𝑚×𝑚 , 𝑪 ∈ ℝ𝑛×𝑚
output: 𝑿 ∈ ℝ𝑛×𝑚 , the solution of 𝑨𝑿 + 𝑿𝑩 = 𝑪
▪ Compute the Schur reduction 𝑨 = 𝑼𝐻 𝑹𝑼 and 𝑩 = 𝑽𝐻 𝑻𝑽;
▪ Compute 𝑫 = 𝑼𝐻 𝑪𝑽 ;
▪ if 𝑡𝑘+1,𝑘 = 0, for all 𝑘 then find 𝒀 using (𝑹 + 𝑡𝑘𝑘 𝑰)𝒚𝑘 = 𝒅𝑘 − ∑𝑘−1
𝑖=1 𝑡𝑖𝑘 𝒚𝑖
▪ else
▪ Find 𝒀 using
(𝑹 + 𝑡𝑘𝑘 𝑰) 𝑡𝑚𝑘 𝑰 𝒚𝑘 𝒅𝑘 𝑡𝑖𝑘 𝒚𝑖
𝑘−1
( ) ( ) = ( ) − ∑𝑖=1 ( ) with 𝑚 = 𝑘 + 1
𝑡𝑘𝑚 𝑰 (𝑹 + 𝑡𝑚𝑚 𝑰) 𝒚𝑚 𝒅𝑚 𝑡 𝒚 𝑖𝑚 𝑖
▪ end
▪ Compute 𝑿 = 𝑼𝒀𝑽𝐻 ;

Example: write a MATLAB code to solve the Sylvester equation by Bartels–Stewart


algorithm. Assume that the matrices 𝑨 and −𝑩 are of distinct real eigenvalues.

clear all, clc, % Dr. BEKHITI Belkacem 21/10/2020


M1=10*rand(5,5); A=M1*diag([-1 -2 -3 -4 -5])*inv(M1);
M2=10*rand(5,5); B=M2*diag([-1 -2 -3 -4 -5])*inv(M2);
M3=10*rand(5,5); C=M3*diag([1.5 2.5 3.5 4.5 5.5])*inv(M3);

[U R]=schur(A,'real'); [V T]=schur(B,'real'); F=U'*C*V; T


n=size(A,1); I=eye(n,n); d=zeros(n,1); Y=zeros(n,n);
W=
for k=1:n
Y(:,k)=inv(R+T(k,k)*I)*(F(:,k)-Y(:,1:k-1)*T(1:k-1,k));
end

Zero=R*Y + Y*T - F
X=U*Y*V'
Zero=A*X + X*B - C
Appendix B
Example: write a MATLAB code to solve the Sylvester equation by Bartels–Stewart
algorithm. Assume that 𝑨 and −𝑩 are of distinct eigenvalues (May be complex conjugate).

clear all, clc, % Dr. BEKHITI Belkacem 21/10/2020


A=10*rand(5,5); B=10*rand(5,5); C=10*rand(5,5);
[U R]=schur(A,'real'); [V T]=schur(B,'real'); F=U'*C*V; T
n=size(A,1); I=eye(n,n); d=zeros(n,1); Y=zeros(n,n);
W = [diag(T,-1);0];
for k=1:n
if W(k)==0
Y(:,k)=inv(R+T(k,k)*I)*(F(:,k)-Y(:,1:k-1)*T(1:k-1,k));
else
H=[R+T(k,k)*I T(k+1,k)*I;T(k,k+1)*I R+T(k+1,k+1)*I];
G=[Y(:,1:k-1)*T(1:k-1,k);Y(:,1:k-1)*T(1:k-1,k+1)];
Z=inv(H)*([F(:,k);F(:,k+1)]-G);
Y(:,k)=Z(1:n); Y(:,k+1)=Z(n+1:end);
k=k+2
end, end
Zero1=R*Y + Y*T - F
X=U*Y*V'
Zero2=A*X + X*B - C

(Proposed Algorithm) The best-known and very widely used numerical


method for small and dense problems is the Hessenberg-Schur method by Golub, Nash,
and Van Loan. The Hessenberg-Schur method is an efficient implementation of the
Bartels-Stewart method proposed earlier, based on the reductions of both matrices to
Schur forms. Unfortunately, these methods are not practical for large and sparse
problems. In this section, we introduce a new iterative scheme based on fixed point
iteration. The scheme requires solution of a linear systems with multiple right-hand
sides at each iteration, which can be solved by using block Krylov subspace methods for
linear systems (see C. Brezinski (2002), Y. Saad (1996)). Our method method does not
require solution of a low-dimensional Sylvester equation at every iteration.

The main idea to solve the Sylvester equations is to write this equation as a block linear
system and then use some suitable iterative scheme. This can be accomplished by the
following change of variable: 𝑨𝑿 = 𝒁 where 𝒁 = 𝑪 − 𝑿𝑩. This possibility generates the
iterative method:

Algorithm: Block iterative method


input: 𝑨 ∈ ℝ𝑛×𝑛 , 𝑩 ∈ ℝ𝑚×𝑚 , 𝑪 ∈ ℝ𝑛×𝑚 𝑿0 ∈ ℝ𝑛×𝑚
output: 𝑿 ∈ ℝ𝑛×𝑚 , the solution of 𝑨𝑿 + 𝑿𝑩 = 𝑪

▪ for 𝑘 = 1,2, … until convergence do


if ‖𝑨−1 ‖‖𝑩‖ < 1 then Solve 𝑨𝑿𝑘+1 = 𝑪 − 𝑿𝑘 𝑩;
−1
elseif ‖𝑨 ‖‖𝑩‖ > 1 then Solve 𝑿𝑘+1 𝑩 = 𝑪 − 𝑨𝑿𝑘 ;
else display ('Try other method');
▪ end
Appendix B
Convergence: For the convergence we proceed as follows. From Algorithm we obtain that
𝑿𝑘 = 𝑨−1 (𝑪 − 𝑿𝑘−1 𝑩) and from SE we know that 𝑿 = 𝑨−1 (𝑪 − 𝑿𝑩) Combining these two
equations it follows that 𝑿 − 𝑿𝑘 = 𝑨−1 ( 𝑿 − 𝑿𝑘 )𝑩 and so,

𝑿 − 𝑿𝑘 = 𝑨−𝑘 ( 𝑿 − 𝑿0 )𝑩𝑘

Therefore ‖𝑿 − 𝑿𝑘 ‖ ≤ (‖𝑨−1 ‖ × ‖𝑩‖)𝑘 × ‖𝑿 − 𝑿0 ‖. Since ‖𝑨−1 ‖ × ‖𝑩‖ < 1 then the sequence
𝑿𝑘 converges to 𝑿 when 𝑘 tends to infinity. On the other hand if 𝑬𝑘 = 𝑿 − 𝑿𝑘 then
‖𝑬𝑘 ‖ ≤ ‖𝑨−1 ‖ × ‖𝑩‖ × ‖𝑬𝑘−1 ‖, hence the sequence 𝑿𝑘 converges q-linearly to the solution.

Now if ‖𝑨−1 ‖ × ‖𝑩‖ > 1 we should use the following iteration 𝑿𝑘+1 𝑩 = 𝑪 − 𝑨𝑿𝑘 .

clear all, clc, % Dr. BEKHITI Belkacem 21/10/2020


M1=10*rand(2,2); A=M1*diag([-1 -2])*inv(M1);
M2=10*rand(2,2); B=2*M2*diag([-3 -4])*inv(M2);
M3=10*rand(2,2); C=M3*diag([1.5 2.5])*inv(M3);
n=size(A,1); X=rand(n,n);
for k=1:10
if norm(A)>norm(B)
X=inv(A)*(C-X*B);
else
X=(C-A*X)*inv(B);
end, end
X
Zero=A*X + X*B - C
XMATLAB = sylvester(A,B,C)

(The Best Proposed Algorithm) Here we present a new iterative scheme for
large-scale solution of the well-known Sylvester equation. The proposed scheme is based
on Broyden's method (which is a quasi-Newton method) and can make good use of the
recently developed methods for solving block linear systems. For the detail of
mathematical proof of the method see Chapter IV.

Algorithm: Broyden's method for Sylvester equations


input: 𝑨1 ∈ ℝ𝑛×𝑛 , 𝑨2 ∈ ℝ𝑚×𝑚 , 𝑨3 ∈ ℝ𝑛×𝑚 , 𝑿0 ∈ ℝ𝑛×𝑚 , 𝑩, 𝜀
output: 𝑿 ∈ ℝ𝑛×𝑚 , the solution of 𝑨1 𝑿 + 𝑿𝑨2 = 𝑨3

▪ while toll ≤ 𝜀 do
𝑭0 = 𝑨1 𝑿0 + 𝑿0 𝑨2 − 𝑨3 ;
𝐟0 = vec(𝑭0 ); 𝐱 0 = vec(𝑿0 ); 𝐱1 = 𝐱 0 − 𝑩𝐟0
𝐗1 = vec −1 (𝐱1 );
𝑭1 = 𝑨1 𝑿1 + 𝑿1 𝑨2 − 𝑨3 ;
𝐟1 = vec(𝑭1 ); 𝐱1 = vec(𝑿1 ); 𝐬 = 𝐱1 − 𝐱 0 ; 𝐲 = 𝐟1 − 𝐟0 ;
(𝐬 − 𝑩𝐲)(𝐬𝑇 𝑩)
𝑩=𝑩+ ;
𝐬𝑇 𝑩𝐲
𝐗 0 = 𝐗1 ; 𝑘 = 𝑘 + 1;
toll = ‖𝐬‖;
▪ end
Appendix B

clear all, clc, % Dr. BEKHITI Belkacem 21/10/2020


A1=10*rand(5,5); A2=10*rand(5,5); A3=10*rand(5,5);
n=size(A1,1); I=eye(n,n); k=1; toll=1;
X0=I-0.1*rand(n,n); B=inv(100*eye(n^2,n^2)); % initialization
while toll>1e-15
y0=X0*A1 + A2*X0 - A3;
f0=reshape(y0,n^2,1); % f0=[y0(:,1);y0(:,2)]; vectorisation of y0
x0=reshape(X0,n^2,1); % x0=[X0(:,1); X0(:,2)]; vectorization of X0
x1=x0-B*f0;
X1=reshape(x1,n,n); % X1=[x1(1:2,:) x1(3:4,:)]; Inverse of vec
y1=X1*A1 + A2*X1 - A3;
f1=reshape(y1,n^2,1); % f1=[y1(:,1);y1(:,2)]; vectorisation of y1
x1=reshape(X1,n^2,1); % x1=[X1(:,1); X1(:,2)]; vectorisation of X1
y=f1-f0; s=x1-x0;
B = B + ((s-B*y)*(s'*B))/(s'*B*y);
X0=X1;
k=k+1;
toll=norm(s);
end
X1
ZERO1=y1=X1*A1 + A2*X1 - A3

Example: (Application) Considered for the computational comparison of the described


methods will be the discretization of the well-known Poisson equation on the unit
square.
𝜕 2𝑢 𝜕 2𝑢
+ = 𝑓(𝑥, 𝑦) (𝑥, 𝑦) ∈ [−1,1],
𝜕𝑥 2 𝜕𝑦 2

With homogeneous Dirichlet boundary conditions, such that 𝑢(±1,0) = 𝑢(0, ±1) = 0, and
the right-hand side 𝑓(𝑥, 𝑦) chosen such that the reference solution 𝑢(𝑥, 𝑦) is given by

𝑢(𝑥, 𝑦) = (1 − 𝑥 2 )(1 − 𝑦 2 ) cos(2𝑥𝑦)

%Dr. BEKHITI Belkacem 21/10/2020

clear all, clc,

[x,y] = meshgrid([-1:.05:1]);

z=(1-x.^2).*(1-y.^2).*cos(20.*x.*y);

s=surf(x,y,z),

axis equal
Appendix B
as visualised in Figure. The most common approach is the discretization by a five-point
stencil on an (𝑛 + 1) × (𝑛 + 1) equi-spaced grid, therefore this will be considered first. The
discretization results in a sparse Sylvester (Lyapunov) equation.

The detailed results of the comparison found in Gerhardus Petrus Kirsten 2018.

Example: (Roth's removal rule) Given two square matrices 𝑨 and 𝑩, of size 𝑛 and 𝑚,
and a matrix 𝑪 of size 𝑛 × 𝑚, then one can ask when the following two square matrices of
𝑨 𝑪 𝑨 𝟎
size 𝑛 + 𝑚 are similar to each other: ( ) & ( ). The answer is that these two
𝟎 𝑩 𝟎 𝑩
matrices are similar exactly when there exists a matrix 𝑿 such that 𝑨𝑿 − 𝑿𝑩 = 𝑪. In other
words, 𝑿 is a solution to a Sylvester equation. This is known as Roth's removal rule.

𝑰 𝑿 𝑨 𝑪 𝑰 −𝑿 𝑨 𝟎
One easily checks one direction: If 𝑨𝑿 − 𝑿𝑩 = 𝑪 then ( )( )( )=( ).
𝟎 𝑰 𝟎 𝑩 𝟎 𝑰 𝟎 𝑩
The term generalized
refers to a very wide class of equations, which includes systems of matrix equations,
bilinear equations and problems where the coefficient matrices are rectangular. We start
with the most common form of generalized Sylvester equation, namely

𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬

Which differs from the previous equation for the occurrence of coefficient matrices on
both sides of the unknown solution 𝑿. If 𝑩 and 𝑪 are both nonsingular, left
multiplication by 𝑪−1 and right multiplication by 𝑩−1 lead to a standard Sylvester
equation, with the same solution matrix 𝑿. In case either 𝑩 or 𝑪 are ill-conditioned, such
a transformation may lead to severe instabilities.

The case of singular 𝑩 and 𝑪, especially for 𝑩 = 𝑪⊤ and 𝑨 = 𝑫⊤ has an important role in
the solution of differential-algebraic equations and descriptor systems.

A natural extension of the Bartels-Stewart method can be implemented for numerically


solving 𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬 when dimensions are small, and this was discussed in Peter
Benne 2013 and T. Penzl 1998, where the starting point is a QZ decomposition of the
pencils (𝑨, 𝑪) and (𝑫, 𝑩), followed by the solution of a sequence of small (1-by-1 or 2-by-
2) generalizedSylvester equations, which is performed by using their Kronecker form.

The problem can be recast as a standard Sylvester equation even if ill-conditioned 𝑩 and
𝑪, one could consider using a specifically selected 𝛼 ∈ ℝ (or 𝛼 ∈ ℂ) such that the two
matrices 𝑪 + 𝛼𝑨 and 𝑩 − 𝛼𝑫 are better conditioned and the solution uniqueness is
ensured, and rewrite 𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬 as the following equivalent generalized Sylvester
matrix equation,
𝑨𝑿(𝑩 − 𝛼𝑫) + (𝑪 + 𝛼𝑨)𝑿𝑫 = 𝑬 ⟺ 𝑨1 𝑿 + 𝑿𝑨2 = 𝑨3
Appendix B
Other generalizations of the Sylvester equation have attracted the attention of many
researchers. In some cases the standard procedure for their solution consists in solving
a (sequence of) related standard Sylvester equation(s), so that the computational core is
the numerical solution of the latter by means of some of the procedures discussed in
previous sections. We thus list here some of the possible generalizations more often
encountered and employed in real applications. We start by considering the case when
the two coefficient matrices can be rectangular. This gives the following equation:

𝑨𝑿 + 𝒀𝑩 = 𝑪

Where 𝑿, 𝒀 are both unknown, and 𝑨, 𝑩 and 𝑪 are all rectangular matrices of conforming
dimensions. Equations of this type arise in control theory, for instance in output
regulation with internal stability, where the matrices are in fact polynomial matrices
(see, e.g., H. K. Wimmer 1996).

The two-sided version of previous equation is given by 𝑨𝑿𝑩 + 𝑪𝒀𝑫 = 𝑬 and this is an
example of more complex bilinear equations.

A typical generalization is given by the following bilinear equation:

𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬𝒀 + 𝑭

where the pair (𝑿, 𝒀) is to be determined, and 𝑿 occurs in two different terms.

Example: write a MATLAB code to solve the generalized Sylvester equation by the
proposed Broyden's algorithm.

clear all, clc, % Dr. BEKHITI Belkacem 21/10/2020


A1=10*rand(6,6); A2=10*rand(6,6); A3=10*rand(6,6); A4=10*rand(6,6);
A5=10*rand(6,6); n=size(A1,1); I=eye(n,n); k=1; toll=1;
X0=I-0.1*rand(n,n); B=inv(100*eye(n^2,n^2)); % initialization
while toll>1e-15
y0= A1*X0*A2 + A3*X0*A4 - A5;
f0=reshape(y0,n^2,1); % f0=[y0(:,1);y0(:,2)]; vectorisation of y0
x0=reshape(X0,n^2,1); % x0=[X0(:,1); X0(:,2)]; vectorization of X0
x1=x0-B*f0;
X1=reshape(x1,n,n); % X1=[x1(1:2,:) x1(3:4,:)]; Inverse of vec
y1= A1*X1*A2 + A3*X1*A4 - A5;
f1=reshape(y1,n^2,1); % f1=[y1(:,1);y1(:,2)]; vectorisation of y1
x1=reshape(X1,n^2,1); % x1=[X1(:,1); X1(:,2)]; vectorisation of X1
y=f1-f0;
s=x1-x0;
B = B + ((s-B*y)*(s'*B))/(s'*B*y);
X0=X1;
k=k+1;
toll=norm(s);
end
X1
ZERO1= A1*X1*A2 + A3*X1*A4 - A5
Appendix B
The Lyapunov equation occurs in
many branches of control theory, such as stability analysis and optimal control. This
and related equations are named after the Russian mathematician Aleksandr Lyapunov.

The continuous Lyapunov equation: 𝑨𝑿 + 𝑿𝑨⊤ = 𝑸 or 𝑿𝑨 + 𝑨⊤ 𝑿 = 𝑸


The discrete Lyapunov equation: 𝑨𝑿𝑨⊤ − 𝑿 = 𝑸 or 𝑨⊤ 𝑿𝑨 − 𝑿 = 𝑸

where 𝑸 is a Hermitian (symmetric) matrix. Since the Lyapunov equation is a special


case of the Sylvester equation, the following corollary is immediate.

Corollary: Let 𝜆1 , 𝜆2 , . . . , 𝜆𝑛 be the eigenvalues of 𝑨. Then the Lyapunov equation


𝑿𝑨 + 𝑨⊤ 𝑿 = 𝑸 has a unique solution 𝑿 if and only if 𝜆𝑖 + 𝜆𝑗 ≠ 0, 𝑖 = 1, . . . , 𝑛; 𝑗 = 1, . . . , 𝑛.

The following result on the uniqueness of the solution X of the discrete Lyapunov
Equation 𝑨⊤ 𝑿𝑨 − 𝑿 = 𝑸.

Let 𝜆1 , . . . , 𝜆𝑛 be the eigenvalues of 𝑨. Then the discrete Lyapunov equation 𝑨⊤ 𝑿𝑨 − 𝑿 = 𝑸


has a unique solution 𝑿 if and only if 𝜆𝑖 𝜆𝑗 ≠ 1, 𝑖 = 1, . . . , 𝑛; 𝑗 = 1, . . . , 𝑛.

If 𝑨 is stable, the analytical integral solution 𝑿 can be written as


∞ ∞
𝑨𝑡 𝑨⊤ 𝑡 ⊤
Continuous Lyapunov equation 𝑿 = − ∫ 𝑒 𝑸𝑒 𝑑𝑡 or 𝑿 = − ∫ 𝑒 𝑨 𝑡 𝑸𝑒 𝑨𝑡 𝑑𝑡
0 0
∞ ∞

Discrete Lyapunov equation 𝑿 = − ∑(𝑨)𝑘 𝑸(𝑨⊤ )𝑘 or 𝑿 = − ∑(𝑨⊤ )𝑘 𝑸𝑨𝑘


0 0

Example: write a MATLAB code to solve the Continuous Lyapunov equation by the
numerical integration algorithm.

clear all, clc,


M1=10*rand(3,3); A =M1*diag([-1 -2 -3])*inv(M1);
M2=10*rand(3,3); C =M2*diag([3 4 5])*inv(M2) ; Q= C*C';
n=size(A,1); I=eye(n,n); Ts=0.001;
EA=I + Ts*A; QT=Ts*Q; X=zeros(n,n); k=0; e=1;

while e>1e-15
X=X - (EA^k)*(QT)*((EA')^k);
e=norm(A*X+X*A'-Q);
k=k+1;
end

X
Zero = A*X+X*A'-Q
XMATLAB = lyap(A,-Q) % verifications using MATLAB
Zero=A*XMATLAB + XMATLAB*A'-Q
Appendix B
Example: write a MATLAB code to solve the Discrete Lyapunov equation by the
numerical integration algorithm.

clear all, clc,


M1=10*rand(3,3); A =M1*diag([0.1 0.2 0.3])*inv(M1);
M2=10*rand(3,3); C =M2*diag([3 4 5])*inv(M2) ; Q= C*C';
n=size(A,1); X=zeros(n,n); k=0; e=1;

while e>1e-10
X=X - (A^k)*(Q)*((A')^k);
e=norm(A*X*A'-X-Q);
k=k+1;
end

X
Zero = A*X*A'-X-Q
XMATLAB = dlyap(A,-Q) % verifications using MATLAB
Zero=A*XMATLAB*A' - XMATLAB -Q

Remark: the numerical integration algorithm is not recommended because it is costly


numerically unstable, and is applicable only for stable matrices.

All the previous algorithms introduced before are applicable to Lyapunov equation.

An algebraic Riccati
equation is a type of nonlinear equation that arises in the context of infinite-horizon
optimal control problems in continuous time or discrete time.

A typical algebraic Riccati equation is similar to one of the following:

The continuous time algebraic Riccati equation ∶ 𝑨⊤ 𝑿 + 𝑿𝑨 − 𝑿𝑩𝑹−1 𝑩⊤ 𝑿 + 𝑸 = 𝟎


The discrete time algebraic Riccati equation ∶ 𝑿 = 𝑨⊤ 𝑿𝑨 − (𝑨⊤ 𝑿𝑩)(𝑹 + 𝑩⊤ 𝑿𝑩)−1 (𝑩⊤ 𝑿𝑨) + 𝑸

𝑿 is the unknown 𝑛 by 𝑛 symmetric matrix and 𝑨, 𝑩, 𝑸, 𝑹 are known real coefficient


matrices. Though generally this equation can have many solutions, it is usually specified
that we want to obtain the unique stabilizing solution, if such a solution exists.

The name Riccati is given to these equations because of their relation to the Riccati
differential equation.

Remark: Riccati's equation is a nonlinear equation that appeared in the optimal control
domain of continuous and discrete linear systems. In practice, it needs a parametric
solution rather than a numerical one. And some limitations in the optimal control
impose to find an analytical parametrization, but we try to suggest some numerical
algorithms through which we can solve this equation within the limits of some
conditions.
Appendix B
The Newton’s method can be used also for
solving equations of the kind 𝑭(𝑿) = 𝟎, where 𝑭: 𝓥 → 𝓥 is a differentiable operator in a
Banach space (we are interested only in the case in which 𝓥 is ℂ𝑚×𝑛 ). The sequence is
defined by
−1
𝑿𝑘+1 = 𝑿𝑘 + (𝑭′𝑿𝑘 ) 𝑭(𝑿𝑘 ) 𝑿𝑘 ∈ 𝓥

Where 𝑭′𝑿𝑘 is the Fréchet derivative of 𝑭 at the point 𝑿.

The Fréchet derivative is a derivative defined on Banach spaces. Named after Maurice
Fréchet, it is commonly used to generalize the derivative of a real-valued function of a
single real variable to the case of a vector-valued function of multiple real variables, and
to define the functional derivative used widely in the calculus of variations.

The Fréchet derivative a point 𝑿 ∈ ℂ𝑚×𝑛 is a linear mapping


𝐿(𝑿)
ℂ𝑚×𝑛 → ℂ𝑚×𝑛
𝑬 ⟶ 𝑳(𝑿, 𝑬) = 𝑭′𝑿𝑘 𝑯𝑘

such that for all 𝑬 ∈ ℂ𝑚×𝑛 : 𝑭(𝑿 + 𝑬) − 𝑭(𝑿) − 𝑳(𝑿, 𝑬) = 𝒪(‖𝑬‖), and it therefore describes
the first order effect on 𝑭 of perturbations in 𝑿. In the practical computation it is
preferable to avoid constructing and inverting explicitly 𝑭′𝑿𝑘 . Thus, a better way to
compute one step of Newton’s method is to define the Newton increment 𝑯𝑘 ∶= 𝑿𝑘+1 − 𝑿𝑘
and to solve the linear matrix equation in the unknown 𝑯𝑘 in order to get 𝑿𝑘+1 :

𝑭′𝑿𝑘 𝑯𝑘 = −𝑭(𝑿𝑘 ) and 𝑿𝑘+1 = 𝑿𝑘 + 𝑯𝑘

The convergence of the method in Banach spaces is less straightforward than in the
scalar case, and it is described by the Newton–Kantorovich theorem.

Consider the Riccati operator: 𝓡(𝑿) = 𝑿𝑨 + 𝑨⊤ 𝑿 − 𝑿𝑩𝑿 + 𝑪. The Fréchet derivative of 𝓡(𝑿)
at a point 𝑿 is

𝓡′ (𝑿)𝑯 = 𝓡(𝑿 + 𝑯) − 𝓡(𝑿) = (𝑨⊤ − 𝑿𝑩)𝑯 + 𝑯(𝑨 − 𝑩𝑿)

Thus, the 𝑘 𝑡ℎ step of Newton’s method for continuous time algebraic Riccati equation
consists in solving the Sylvester equation.

𝓡′ (𝑿𝑘 )𝑯𝑘 = −𝓡(𝑿𝑘 ) ⇔ (𝑨⊤ − 𝑿𝑘 𝑩)𝑯𝑘 + 𝑯𝑘 (𝑨 − 𝑩𝑿𝑘 ) = −𝓡(𝑿𝑘 )

in the unknown 𝑯𝑘 and setting 𝑿𝑘+1 = 𝑿𝑘 + 𝑯𝑘 . Observe that if 𝑿𝑘 is a Hermitian matrix,


then the last equation is a Lyapunov equation. Thus, if 𝑿0 is Hermitian and the sequence
{𝑿𝑘 }𝑘 is well defined, then 𝑿𝑘 is Hermitian for each 𝑘.

The corresponding code is reported in Listing below; this code makes use of the function
sylvester for the solution of a matrix sylvester equation described in before.

The standard results on the convergence of Newton’s method in Banach spaces yield
locally quadratic convergence in a neighborhood of the stabilizing solution 𝑿+ . This
property guarantees that the method is self-correcting, that is, a small perturbation
introduced at some step 𝑘 of the iteration does not affect the convergence.
Appendix B
clear all, clc,
% This algorithm solves: C+XA+A'X-XBX=0 by means of Newton’s method.
% A,B & C: matrix coefficients.
% X0: initial approximation & X: the solution
A=10*rand(4,4); B=10*rand(4,4); C=10*rand(4,4); X0= rand(4,4);
tol = 1e-13; kmax = 80; X = X0; err = 1; k=0;
while err>tol && k<kmax
RX = C + X*A + A'*X - X*B*X;
H =sylvester(A'-X*B,A-B*X,-RX) ;
X=X+H;
err =norm(H,1)/norm(X,1); k=k+1;
end
if k == kmax
disp('Warning: reached the maximum number of iterations')
end
Zero=C + X*A + A'*X - X*B*X

The function sign(𝑧) is


defined for a nonimaginary complex number 𝑧 as the nearest square root of unity. Its
matrix counterpart can be defined for any matrix 𝑾 with no purely imaginary
eigenvalues relying on the Jordan canonical form of 𝑾,

𝑱+ 𝟎
𝑱 = 𝑽−1 𝑾𝑽 = ( )
𝟎 𝑱−

where we have grouped the Jordan blocks so that the eigenvalues of 𝑱+ have positive real
part, while the eigenvalues of 𝑱− have negative real part. We define the matrix sign of 𝑾
as
𝑰𝑝 𝟎
sign(𝑾) = 𝑽 ( ) 𝑽−1
𝟎 −𝑰𝑞

where 𝑝 is the size of 𝑱+ and 𝑞 is the size of 𝑱− . Observe that according to this definition,
sign(𝑾) is a matrix function. From the last equation it follows that sign(𝑾) − 𝑰 has rank
𝑞, while sign(𝑾) − 𝑰 has rank 𝑝;

Theorem: Let the continuous time algebraic Riccati equation 𝓡(𝑿) = 𝑪 + 𝑿𝑨 + 𝑫𝑿 − 𝑿𝑩𝑿
have a stabilizing solution 𝑿+ , namely 𝜎(𝑨 − 𝑩𝑿+ ) ⊂ ℂ− or sign(𝑨 − 𝑩𝑿+ ) = −𝑰, and let
𝑨 −𝑩
𝑯=( ) be the corresponding Hamiltonian matrix. Partition sign(𝑯) + 𝑰 = [𝑾1 𝑾2 ],
−𝑪 −𝑫
where 𝑾1 , 𝑾2 ∈ ℂ2n×n . Then 𝑿+ such that 𝓡(𝑿+ ) = 0 is the unique solution of the
overdetermined system 𝑾2 𝑿+ = −𝑾1 .

Proof: from the fact that


𝑨𝑽 = 𝑽𝚲 ⟺ f(𝑨)𝑽 = 𝑽f(𝚲)
{ 𝑰 𝑰
𝑯 [ ] = [ ] (𝑨 − 𝑩𝑿+ )
𝑿+ 𝑿+
If we let f(𝑧) = sign(𝑧) + 1 one has
𝑰 𝑰 𝑰
[𝑾1 𝑾2 ] [ ] = (sign(𝑯) + 𝑰) [ ] = [ ] (sign(𝑨 − 𝑩𝑿+ ) + 𝑰) = 0 ⟹ 𝑾2 𝑿+ = −𝑾1
𝑿+ 𝑿+ 𝑿+
Appendix B
𝑰 𝟎
On the other hand, (sign(𝑯) + 𝑰) [ ] = [𝟎 𝑾2 ] and since the left-hand side of the
𝑿+ 𝑰
latter equality has rank 𝑛, 𝑾2 has full rank and 𝑿+ is the unique solution of Riccati
equation. ■

Once the sign of 𝑯 is computed, in order to get the required solution it is enough to solve
the overdetermined system. This task can be accomplished by using the standard
algorithms for solving an overdetermined system, such as the QR factorization of 𝑾2 .

Computing the matrix sign function: The simpler iteration is obtained by Newton’s
2
method applied to 𝑿2 − 𝑰 = 𝟎, which is appropriate since sign(𝑾) yields (sign(𝑾)) − 𝑰 = 𝟎.
1
The resulting iteration is 𝑿𝑘+1 = 2 (𝑿𝑘 + 𝑿−1
𝑘 ), whose convergence properties are
synthesized in the following result.

Theorem: If 𝑾 ∈ ℂ𝑛×𝑛 has no purely imaginary eigenvalues, then the sequence


1
𝑿𝑘+1 = (𝑿𝑘 + 𝑿𝑘−1 ), 𝑘 = 0,1, …
2
With 𝑿0 = 𝑾 converges quadratically to 𝑺 = sign(𝑾). Moreover,

1 −1
‖𝑿𝑘+1 − 𝑺‖ ≤ ‖𝑿 ‖‖𝑿𝑘 − 𝑺‖2
2 𝑘
for any operator norm.
1
Iteration 𝑿𝑘+1 = 2 (𝑿𝑘 + 𝑿−1
𝑘 ) together with the termination criterion ‖𝑿𝑘+1 − 𝑿𝑘 ‖ ≤ 𝜀, for
some norm and a tolerance ε, provides a rough algorithm for the sign function.

There is a scaling technique which dramatically accelerates the convergence. The idea is
simple but effective: Since sign(𝑾) = sign(𝑐𝑾) for any 𝑐 > 0, the limit of the sequence
{𝑿𝑘 }𝑘 does not change if at each step one “scales” the matrix 𝑿𝑘 in the following way:

clear all, clc,


% This Algo solves: C+XA+DX-XBX=0 by means of the matrix sign function
% A, B, C, D: matrix coefficients & X is the solution of: C+XA+DX-XBX=0
A=10*rand(4,4); B=10*rand(4,4); C=10*rand(4,4); D=10*rand(4,4);
H=[A,-B;-C,-D]; n=size(H,1); tol=1e-13; kmax=1000; err=1; SH=H; k=0;
while err>tol && k<kmax
[L,U,P] =lu(SH);
c=abs(prod(diag(U)))^(-1/n); % Byers’ determinantal scaling method
SH = SH*c;
Z = L\P;
Z = (c*U)\Z;
err =norm(SH-Z,1)/norm(SH,1);
SH =(SH + Z)*0.5;
k=k+1;
end
[n,m] =size(B); W = SH + eye(m+n);
X = -W(1:m+n,n+1:n+m)\W(1:m+n,1:n)
Zero=C + X*A + D*X - X*B*X
Appendix B
Remark: According to the convergence of the matrix sign function the algorithm does not
always converge. Accordingly, it should be executed too many times.

Reliable software for solving matrix equations has been available for a long time, due to
its fundamental role in control applications; in particular, the SLICE Library was made
available already in 1986, the LAPACK ("Linear Algebra Package") was initially released
in 1992. Most recent versions of MATLAB also rely on calls to SLICOT routines within
the control-related Toolboxes. SLICOT includes a large variety of codes for model
reduction and nonlinear problems on sequential and parallel architectures;

You might also like