Professional Documents
Culture Documents
Fundamentals of
Scientific Computations
and Numerical Linear
Algebra
Dr. BEKHITI Belkacem
clear all, clc,
figure
k = 5;
n = 2^k-1;
theta = pi*(-n:2:n)/n;
phi = (pi/2)*(-n:2:n)'/n;
X = cos(phi)*cos(theta);
Y = cos(phi)*sin(theta);
Z = sin(phi)*ones(size(theta));
2020
Table of Contents
On the other hand, there are many problems that are not solved by direct
algorithms, in which case it may be possible to solve them using iterative
methods. Such a method begins with guessing and finding the most successful
approximation that effectively approximates the desired solution. Even when
direct algorithms are sometimes present, iterative methods are sometimes
preferred because they are more efficient (they may require less time and less
computational power in addition to a good approximation of the solution) or they
may be more stable.
In this book, we are about to bring the basic concepts of this science closer to
university students. We have prepared the material for this book in accordance
with contemporary international curricula, and we have tried as hard as possible
to be a book that collects the major issues of this science, relying in that to the
most important books and a wide range of references and origins of this art. We
also sought to simplify and shorten the phrase avoiding complexity. We also do
not forget to mention a matter that is very important, which is that: we did not
address the study of error and the speed of convergence, which are two main
factors in the methods of numerical analysis, and this is due to avoiding
prolongation and on the other hand we have to collect in this book the most exist
methods, therefore those who wanted to benefit in the study of error and
convergence are strongly recommended to see the end of the book where the
most important references detailed this point.
I also do not forget to mention that we concluded the book with a series of
natural optimization methods with attaching the book to a very large set of
algorithms and MATLAB codes. As for the lack in the number of exercises for
students, this could be improved in the near future.
CHAPTER I:
An Introduction to Difference
Equations
An Introduction to
Difference Equations
The theory of such difference equations is required for the understanding of some
important algorithms to be discussed later on. Furthermore, linear difference equations
are a useful tool in the study of many other processes of numerical analysis; and
numerical linear algebra. Much unnecessary complication is avoided by considering
difference equations with complex coefficients and solutions.
𝑧 −1 : 𝐼 ⟶ 𝐼
𝑦 ⟶ 𝑦[𝑘 − 1] = 𝑧 −1 (𝑦[𝑘])
Then the 𝑛𝑡ℎ order DE becomes (𝑎0 + 𝑎1 𝑧 −1 + ⋯ + 𝑎𝑛 𝑧 −𝑛 )(𝑦[𝑘]) = 𝑓[𝑘] ⟺ 𝐿(𝑦[𝑘]) = 𝑓[𝑘]
where 𝐿 is: linear mapping or linear operator.
Theorem: A solution of 𝐿(𝑦[𝑘]) = 𝑓[𝑘] exist if and only if 𝑓[𝑘] ∈ ℛ(𝐿) that is the forcing
function 𝑓[𝑘] is an element of the range space of the operator 𝐿(𝑦[𝑘]). And this solution
is of two types namely particular and homogenous solutions: 𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘]
Proof: In linear algebra (see Algebra Book by BEKHITI 2020) we have seen that the
existence of solution of 𝑨𝐱 = 𝒃 exist if 𝒃 ∈ ℛ(𝑨) or ℛ(𝑨) = ℛ([𝑨 𝒃]) means that 𝒃 is a
column of 𝑨. Now suppose that a solution exist and 𝑦𝑝 be a particular solution then
We can say that 𝑑𝑖𝑚(𝒩(𝐿)) = 𝑛 which implies that any basis of 𝒩(𝐿) contains 𝑛 linearly
independent vector.
Proof: It is straightforward to see that if 𝕎 is singular matrix then there exist a non-
zero vector 𝕏 or 𝛼𝑖 ≠ 0 which implies linear dependence of 𝑦𝑖 [𝑘].
1 𝑘 𝑘
⟹ {(1) , ( ) } is a basis of 𝒩(𝐿)
2
▪ The homogenous solution is linear combination of elementary homogenous solutions
1 𝑘
𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶2
2
The Eigen pair of the operator 𝑧 −1 is (𝜆, (1/𝜆)𝑘 ) and the solution of 1st order DE is given
by
−1 𝑘
𝑦[𝑘 − 1] + 𝑎𝑦[𝑘] = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 ( )
𝑎
⟹ 𝑧 −2 (𝑦[𝑘]) = 𝑧 −1 (𝑧 −1 (𝑦[𝑘]))
⟹ 𝑧 −2 (𝑦[𝑘]) = 𝜆𝑧 −1 (𝑦[𝑘]) = 𝜆2 𝑦[𝑘]
∆(𝜆) = (𝜆2 + 𝑎1 𝜆1 + 𝑎0 )
(𝑧 −1 − 𝜆1 ) ⏟
⏟ (𝑧 −1 − 𝜆2 ) 𝑦[𝑘] = 0 ⟺ (𝐿1 𝐿2 )(𝑦[𝑘]) = 0
1 𝑘 1 𝑘
(𝜆1 ,( ) ) (𝜆2 ,( ) )
𝜆1 𝜆2
The elementary homogenous solutions are (1/𝜆1 )𝑘 and (1/𝜆2 )𝑘 . Notice that
1 𝑘 1 𝑘
( ) ( )
1 𝑘 1 𝑘 | 𝜆 𝜆2 | 1 𝑘 1 𝑘
𝕎 (( ) , ( ) ) = | 1 𝑘−1 𝑘−1 | ≠ 0 only if 𝜆1 ≠ 𝜆2 ⟹ {( ) , ( ) } is a basis
𝜆1 𝜆2 1 1 𝜆1 𝜆2
( ) ( )
𝜆1 𝜆2
1 𝑘 1 𝑘
𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶1 ( )
𝜆1 𝜆2
1 𝑘
Set 𝑦1 [𝑘] = (𝑧 −1 − 𝜆)𝑦[𝑘] then (𝑧 −1 − 𝜆)𝑦1 [𝑘] = 0 ⟹ 𝑦1 [𝑘] = 𝐶1 ( )
𝜆
1 𝑘
(𝑧 −1 − 𝜆)𝑦[𝑘] = 𝑦1 [𝑘] ⟹ (𝑧 −1 − 𝜆)𝑦[𝑘] = 𝐶1 ( )
𝜆
Suppose that the general solution of the last DE is given by 𝑦2 [𝑘] = 𝐶2 [𝑘]. (1/𝜆)𝑘
1 𝑘 1 𝑘 1 𝑘−1 1 𝑘 1 𝑘
(𝑧 −1 − 𝜆) (𝐶2 [𝑘]. ( ) ) = 𝐶1 ( ) ⟺ 𝐶2 [𝑘 − 1]. ( ) − 𝜆𝐶2 [𝑘]. ( ) = 𝐶1 ( )
𝜆 𝜆 𝜆 𝜆 𝜆
𝐶1
⟺ 𝐶2 [𝑘 − 1] − 𝐶2 [𝑘] = , diff = constant
𝜆
−𝐶1
⟺ 𝐶2 [𝑘] = 𝑘 + 𝐶2 , by accumulation
𝜆
From the superposition principle of linear systems we have
−𝐶1 1 𝑘 1 𝑘 1 𝑘
𝑦ℎ [𝑘] = 𝑦1 [𝑘] + 𝑦2 [𝑘] = {𝐶1 + 𝑘 + 𝐶2 } ( ) = 𝛽1 ( ) + 𝛽2 (𝑘 ( ) )
𝜆 𝜆 𝜆 𝜆
𝑏
(𝑎 + 𝑗𝑏) = 𝑟(cos(𝜃) + 𝑗 sin(𝜃)) = 𝑟𝑒 𝑗𝜃 where 𝑟 = √𝑎2 + 𝑏 2 & 𝜃 = atan ( )
𝑎
(𝑎 + 𝑗𝑏)𝑘 = 𝑟 𝑘 𝑒 𝑗𝑘𝜃 = 𝑟 𝑘 (cos(𝑘𝜃) + 𝑗 sin(𝑘𝜃))
𝑘 𝑘
1 1 𝐶1 𝐶2
𝑦[𝑘] = 𝐶1 ( ) + 𝐶2 ( ) = +
𝜎 + 𝑗𝜔 𝜎 − 𝑗𝜔 (𝜎 + 𝑗𝜔)𝑘 (𝜎 − 𝑗𝜔)𝑘
𝐶1 𝐶2 1 𝑘
= 𝑘 𝑗𝑘𝜃 + 𝑘 −𝑗𝑘𝜃 = ( ) [𝐶1 𝑒 −𝑗𝑘𝜃 + 𝐶2 𝑒 𝑗𝑘𝜃 ]
𝑟 𝑒 𝑟 𝑒 𝑟
𝑘
1
= ( ) [(𝐶1 + 𝐶2 ) cos(𝑘𝜃) + (𝐶2 − 𝐶1 ) sin(𝑘𝜃)]
𝑟
1 𝑘 1 𝑘 Im(𝜆1 )
{( ) cos(𝑘𝜃) , ( ) sin(𝑘𝜃)} where 𝑟 = √Re(𝜆1 )2 + Im(𝜆1 )2 & 𝜃 = atan ( )
𝑟 𝑟 Re(𝜆1 )
Summary:
The sequence is {1, 1, 2, 3, 5, 8, 13, 21, 35, … } Find the general solution of this DE
Solution:
1 ± √5
𝜆1,2 = are the eigenvalues of (𝑧 −1 − 𝜆1 )(𝑧 −1 − 𝜆2 )𝑦[𝑘] = 0
−2
𝑘
2 −2 𝑘
𝑦[𝑘] = 𝐶1 ( ) + 𝐶2 ( )
√5 − 1 √5 + 1
From the initial conditions we know
3 + √5
𝑦[0] = 𝐶1 + 𝐶2 = 1 1 1
2 −2 𝐶 1 𝐶 5 + √5
2𝐶1 2𝐶2 ⟹( ) ( 1) = ( ) ⟹ ( 1) =
𝑦[1] = − =1 𝐶2 1 𝐶2 3 − √5
√5 − 1 √5 + 1 √5 − 1 √5 + 1
(5 − √5)
𝑘 𝑘
3 + √5 √5 + 1 3 − √5 1 − √5
𝑦[𝑘] = ( )( ) +( )( )
5 + √5 2 5 − √5 2
❶ 𝑦[𝑘] − 𝑦[𝑘 − 2] = 0
❷ 𝑦[𝑘] + 𝑦[𝑘 − 2] = 0
❸ 𝑦[𝑘] + 2𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0
Solution:
𝑦1[𝑘] = 𝐶1(1)𝑘 = 𝐶1
❶ 𝑦[𝑘] − 𝑦[𝑘 − 2] = 0 ⟺ (1 − 𝑧−2)𝑦[𝑘] = 0 ⟺ (1 − 𝑧−1)(1 + 𝑧−1 )𝑦[𝑘] = 0 ⟺ {
𝑦2 [𝑘] = 𝐶2(−1)𝑘
𝜋 𝑘𝜋 𝑘𝜋
𝜆1 = 𝑗 & 𝜆2 = −𝑗, 𝑟 = 1 & 𝜃 = ⟹ 𝑦[𝑘] = [𝛽1 cos ( ) + 𝛽2 sin ( )]
2 2 2
2
❸ 𝑦[𝑘] + 2𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 0 ⟺ (1 + 2𝑧−1 + 𝑧−2 )𝑦[𝑘] = 0 ⟺ (1 + 𝑧−1 ) 𝑦[𝑘] = 0
𝑚1 𝑚2 𝑚𝑟
⏟ −1 − 𝜆1 )
𝐿(𝑦[𝑘]) = ((𝑧 ⏟ −1 − 𝜆2 )
(𝑧 ⏟ −1 − 𝜆𝑟 )
… (𝑧 ) 𝑦[𝑘] = 0 ⟺ 𝐿 = 𝐿1 𝐿2 … 𝐿𝑟
𝐿1 𝐿2 𝐿𝑟
(𝐿1 𝐿2 … 𝐿𝑖−1 𝐿𝑖 𝐿𝑖+1 … 𝐿𝑟 )(𝑦𝑖 [𝑘]) = (𝐿1 𝐿2 … 𝐿𝑖−1 𝐿𝑖+1 … 𝐿𝑟 )𝐿𝑖 (𝑦𝑖 [𝑘]) = 0 ⟹ 𝒩(𝐿) = ⨁𝑟𝑖=1 (𝒩(𝐿𝑖 ))
1 𝑘
▪ For the equation (𝑧 −1 − 𝜆𝑖 )(𝑦[𝑘]) = 0 ⟹ 𝑦[𝑘] = 𝐶1 ( )
𝜆𝑖
(𝑧 −1 − 𝜆𝑖 )(𝑦1 [𝑘]) = 0
▪ For the equation (𝑧 −1 − 𝜆𝑖 ) ⏟
(𝑧 −1 − 𝜆𝑖 )(𝑦2 [𝑘]) = 0 with {
(𝑧 −1 − 𝜆𝑖 )(𝑦2 [𝑘]) = 𝑦1 [𝑘]
𝑦1
1 𝑘 1 𝑘 1 𝑘 1 𝑘
We have seen ∶ 𝑦1 [𝑘] = 𝐶1 ( ) & 𝑦2 [𝑘] = 𝐶2 𝑘 ( ) ⟹ {( ) , 𝑘 ( ) } is a basis
𝜆𝑖 𝜆𝑖 𝜆𝑖 𝜆𝑖
▪ By induction we obtain
1 𝑘 1 𝑘 𝑚𝑖 −1
1 𝑘
{( ) , 𝑘 ( ) , … , 𝑘 ( ) } is a basis for 𝒩(𝐿𝑖 )
𝜆𝑖 𝜆𝑖 𝜆𝑖
As in the case of differential equations, if the difference
equation is forced then there are two methods which can be used to find a particular
solution: the method of undetermined coefficients and the method of variation of
parameters.
𝑦𝑝 [𝑘] = 𝑘 2 + 6𝑘 + 13
1 𝑘
𝑦[𝑘] = 1 + 𝐶1 (−1)𝑘 + 𝐶2 ( )
4
Summary: Given a forced linear time invariant difference equation ∑𝑛𝑖=0 𝑎𝑖 𝑦[𝑘 − 𝑖] = 𝑥[𝑘]
𝑥[𝑘] = 𝐾 𝑦𝑝 [𝑘] = 𝐶
𝑥[𝑘] = 𝐴𝛼 𝑘 𝑦𝑝 [𝑘] = 𝐾𝛼 𝑘
𝑥[𝑘] = 𝛼 𝑘 𝑘 𝑚 𝑦𝑝 [𝑘] = 𝛼 (𝐶0 𝑘 𝑚 + 𝐶1 𝑘 𝑚−1 + ⋯ + 𝐶𝑚 )
𝑘
cos(𝜔0 𝑘)
𝑥[𝑘] = { 𝑦𝑝 [𝑘] = 𝐶1 cos(𝜔0 𝑘) + 𝐶2 sin(𝜔0 𝑘)
sin(𝜔0 𝑘)
𝑘
𝑥[𝑘] = 𝛼 cos(𝜔0 𝑘) 𝑦𝑝 [𝑘] = 𝛼 𝑘 (𝐶1 cos(𝜔0 𝑘) + 𝐶2 sin(𝜔0 𝑘))
5
𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘] = (2𝑘 ) + (𝐶1 + 𝑘) (3𝑘 ) + 𝐶2 (5𝑘 )
6
Example: Find a particular solution of
5 1
𝑦[𝑘] = 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] + 𝑥[𝑘] with 𝑥[𝑘] = 2𝑘 𝑘 ≥ 0
6 6
Solution: 𝑦𝑝 [𝑘] = 𝐾2𝑘 𝑢[𝑘] or 𝑦𝑝 [𝑘] = 𝐾2𝑘 𝑘 ≥ 0 where: 𝑢[𝑘] is a step function.
Substituting into the DE we obtain
5 1
𝐾2𝑘 𝑢[𝑘] = 𝐾 ( 2𝑘−1 𝑢[𝑘 − 1] − 2𝑘−2 𝑢[𝑘 − 2]) + 2𝑘 𝑢[𝑘]
6 6
5 1 8
4𝐾 = 2𝐾 − 𝐾 + 4 ⟹ 𝐾 =
6 6 5
8 1 𝑘 1 𝑘
𝑦𝑝 [𝑘] = (2𝑘 )𝑢[𝑘] and 𝑦ℎ [𝑘] = {𝐶1 ( ) + 𝐶2 ( ) } 𝑢[𝑘]
5 3 2
8 𝑘 1 𝑘 1 𝑘
𝑦[𝑘] = 𝑦ℎ [𝑘] + 𝑦𝑝 [𝑘] = { (2 ) + 𝐶1 ( ) + 𝐶2 ( ) } 𝑢[𝑘]
5 3 2
1 1
▪ 𝑦[𝑘] − 5𝑦[𝑘 − 1] + 6𝑦[𝑘 − 2] = 0 ⟹ (𝑧 −1 − ) (𝑧 −1 − ) = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 (2𝑘 ) + 𝐶2 (3𝑘 )
2 3
𝑦𝑝 [𝑘] = 𝐶3 substituting into the DE we obtain 𝐶3 = 1 ⟹ 𝑦[𝑘] = 1 + 𝐶1 (2𝑘 ) + 𝐶2 (3𝑘 )
𝑘
1 𝑘 1 𝑘
▪ 8𝑦[𝑘] − 6𝑦[𝑘 − 1] + 𝑦[𝑘 − 2] = 2 ⟹ (𝑧 −1 − 4)(𝑧 −1
− 2) = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶2 ( )
2 4
4 1 𝑘 1 𝑘
𝑦[𝑘] = + 𝐶1 ( ) + 𝐶2 ( )
21 2 4
𝑘
2 −2 𝑘
▪ 𝑦[𝑘] − 𝑦[𝑘 − 1] − 𝑦[𝑘 − 2] = 𝑘 2 ⟹ (𝑧 −2 + 𝑧 −1 − 1) = 0 ⟹ 𝑦ℎ [𝑘] = 𝐶1 ( ) + 𝐶2 ( )
√5 − 1 √5 + 1
Let we have 𝐿(𝑦[𝑘]) = 𝑥[𝑘] suppose we can find a linear operator 𝐿1 such that
Notice that 𝑦[𝑘] is the homogeneous solution of the operator 𝐿1 𝐿, so we can write
𝐿 (𝑦 [𝑘]) = 0
𝐿1 𝐿(𝑦[𝑘]) = 𝐿1 𝐿(𝑦1 [𝑘] + 𝑦2 [𝑘]) = 𝐿1 𝐿(𝑦1 [𝑘]) + 𝐿1 𝐿(𝑦2 [𝑘]) = 0 ⟹ { 1 1
𝐿(𝑦2 [𝑘]) = 0
It is straightforward to see that the solution 𝑦[𝑘] can be decompose into two main parts
namely 𝑦[𝑘] = 𝑦1 [𝑘] + 𝑦2 [𝑘] where 𝑦1 [𝑘] ∈ 𝒩(𝐿1 ) and 𝑦2 [𝑘] ∈ 𝒩(𝐿)
From the other side we observe that 𝑦1 [𝑘] is particular solution of 𝐿, to see the
correctness of this claim let we assume that (𝑦1 [𝑘] + 𝑦2 [𝑘]) is particular solution of 𝐿 so
But we know that 𝑦2 [𝑘] is the homogeneous solution of 𝐿 that is 𝐿(𝑦2 [𝑘]) = 0, hence only
𝑦1 [𝑘] is a particular solution of 𝐿 (i.e. 𝐿(𝑦1 [𝑘]) = 𝑥[𝑘]).
Notice that
(𝑧 −1 − 1)2 (2𝑘) = 0 1
{ −1 1 𝑘
} ⟹ (𝑧 −1 − 1)2 (𝑧 −1 − ) {3(2𝑘 ) + 2𝑘} = 0
(𝑧 − ) (2 ) = 0 2
2
Then
1 1
𝑧 −1 (𝑧 −1 − 1)2 (𝑧 −1 − ) (𝑧 −1 + 1)(𝑦[𝑘]) = 0 ⟹ (𝑧 −1 − 1)2 (𝑧 −1 − ) (𝑧 −1 + 1)(𝑦[𝑘]) = 0
2 2
𝑦[𝑘] = 𝐶 (𝑧 −1 − 1)𝑦[𝑘] = 0
𝑦[𝑘] = 𝑘 (𝑧 −1 − 1)2 𝑦[𝑘] = 0
𝑦[𝑘] = 𝛼 𝑘 1
(𝑧 −1 − ) 𝑦[𝑘] = 0
𝑎
𝑚+1
𝑦[𝑘] = 𝛼 𝑘 𝑘 𝑚 −1
1
(𝑧 − ) 𝑦[𝑘] = 0
𝛼
❶ Consider the first order linear time varying difference equation given by
Assume that 𝑦ℎ [𝑘] is the homogeneous solution then the particular solution is
𝑥[𝑘]
𝛽[𝑘] − 𝛽[𝑘 − 1] =
𝑦ℎ [𝑘]
❷ Consider the second order linear time varying difference equation given by
Assume that 𝑦1 [𝑘], 𝑦2 [𝑘] are the homogeneous solutions then the particular solution is
𝑦𝑝 [𝑘 + 1] = 𝛽1 [𝑘 + 1]𝑦1 [𝑘 + 1] + 𝛽2 [𝑘 + 1]𝑦2 [𝑘 + 1]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 1] + (𝛽1 [𝑘 + 1] − 𝛽1 [𝑘])𝑦1 [𝑘 + 1] + 𝛽2 [𝑘]𝑦2 [𝑘 + 1] + (𝛽2 [𝑘 + 1] − 𝛽2 [𝑘])𝑦2 [𝑘 + 1]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 1] + 𝛽2 [𝑘]𝑦2 [𝑘 + 1] + ∆𝛽1 [𝑘]𝑦1 [𝑘 + 1] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 1]
𝑦𝑝 [𝑘 + 1] = 𝛽1 [𝑘]𝑦1 [𝑘 + 1] + 𝛽2 [𝑘]𝑦2 [𝑘 + 1]
𝑦𝑝 [𝑘 + 2] = 𝛽1 [𝑘 + 1]𝑦1 [𝑘 + 2] + 𝛽2 [𝑘 + 1]𝑦2 [𝑘 + 2]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 2] + (𝛽1 [𝑘 + 1] − 𝛽1 [𝑘])𝑦1 [𝑘 + 2] + 𝛽2 [𝑘]𝑦2 [𝑘 + 2] + (𝛽2 [𝑘 + 1] − 𝛽2 [𝑘])𝑦2 [𝑘 + 2]
= 𝛽1 [𝑘]𝑦1 [𝑘 + 2] + 𝛽2 [𝑘]𝑦2 [𝑘 + 2] + ∆𝛽1 [𝑘]𝑦1 [𝑘 + 2] + ∆𝛽2 [𝑘]𝑦2 [𝑘 + 2]
Because 𝑦1 [𝑘] and 𝑦2 [𝑘] are solutions of the homogeneous equation corresponding to
the DE. Thus, the proof is complete.
With the Casoratian det 𝕎 is never zero. and summing up from ℓ = 0 to 𝑘 − 1, we find
𝑘−1 𝑘−1
𝑥[ℓ]𝑦2 [ℓ + 1] 𝑥[ℓ]𝑦1 [ℓ + 1]
𝛽1 [𝑘] = 𝛽1 [0] − ∑ , 𝛽2 [𝑘] = 𝛽2 [0] + ∑
det 𝕎(ℓ + 1) det 𝕎(ℓ + 1)
ℓ=0 ℓ=0
neglecting the term 𝛽1 [0]𝑦1 [𝑘] & 𝛽2 [0]𝑦2 [𝑘] which is a solution of the homogeneous
equation, we find
𝑘−1 𝑘−1
𝑥[ℓ]𝑦2 [ℓ + 1] 𝑥[ℓ]𝑦1 [ℓ + 1]
𝑦𝑝 [𝑘] = −𝑦1 [𝑘] ∑ + 𝑦2 [𝑘] ∑
det 𝕎(ℓ + 1) det 𝕎(ℓ + 1)
ℓ=0 ℓ=0
𝑦 [ℓ + 1] 𝑦2 [ℓ + 1]
𝑘−1 | 1 |
𝑦1 [𝑘] 𝑦2 [𝑘]
= ∑ 𝑥[ℓ]
𝑦 [ℓ + 1] 𝑦2 [ℓ + 1]
ℓ=0 | 1 |
𝑦1 [ℓ + 2] 𝑦2 [ℓ + 2]
Example: Verify that the sequences 𝑦1 [𝑘] = 1 and 𝑦2 [𝑘] = 𝑘 2 are linearly independent
homogeneous solutions of the difference equation
Solution: The reader can verify by direct substitution that the sequences are solutions
of the DE.
1 (ℓ + 1)2
𝑘−1 | 2
| 𝑘−1 𝑘 2 − (ℓ + 1)2
𝑘−1
𝑘(𝑘 + 1)(2𝑘 + 1) 4𝑘 3 − 3𝑘 2 − 𝑘
𝑦𝑝 [𝑘] = 𝑘 3 − =
6 6
If the system is assumed to change in discrete time steps (hours, days, weeks, months,
quarters, years, decades, …), it can be modeled using difference equations. If instead
time progresses continuously, it can be modeled using differential equations.
Let we take the following change 𝑨 = −𝜦 we get 𝐲[𝑘 − 1] = 𝑨𝐲[𝑘] + 𝐟[𝑘], let we find the
homogeneous solution of this system (i.e. 𝐟[𝑘] = 𝟎)
Example: Verify that the homogeneous solution of 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝑩𝐮[𝑘] is given by
𝒚[𝑘] = 𝑨𝑘 𝒚[0]
Case01: suppose that 𝑨 has 𝑛-distinct eigenvalues, and let (𝜆𝑖 , 𝐱 𝑖 ) be an eigen-pair of 𝑨
that is 𝑨𝐱 𝑖 = 𝜆𝑖 𝐱 𝑖 , now we try to examine the vectors of the form 𝐲𝑖 [𝑘] = (𝜆𝑖 )𝑘 𝐱 𝑖 .
Then any first order system of linear difference equations has eigenvectors of the form
𝐲𝑖 [𝑘] = (𝜆𝑖 )𝑘 𝐱𝑖 . In other word we can say: any equation of the form 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] has a
solution 𝐲𝑖 [𝑘] = (𝜆𝑖 )𝑘 𝐱𝑖 , where are called elementary homogeneous solutions. The
general form of the homogeneous solution is the linear combination of them
1 3
Repeat the same thing for the system 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
2 −1
∆(𝜆) = (𝜆2 − 7) ⟹ {𝜆1 = √7, 𝐱1 = ((√7 + 1)/2)} and {𝜆2 = −√7, 𝐱 2 = ((1 − √7)/2)}
1 1
√7 + 1 𝑘
1 − √7
𝑘
𝒚[𝑘] = {𝛼1 ( 2 ) (√7) + 𝛼2 ( 2 ) (−√7) }
1 1
Case02: suppose that 𝑨 has repeated eigenvalues, and for seeking simplicity let we
assume that only one eigenvalue is repeated twice. Let's try the conjectured solution
Now consider
𝒚[𝑘] = 𝛽1 𝜆𝑘 𝐱1 + 𝛽2 (𝑘. 𝜆𝑘 𝐱1 + 𝜆𝑘 𝐱 2 )
{
with (𝑨 − 𝜆𝑰)𝐱 2 = 𝜆𝐱1
0 1
Example: Find the homogeneous solution of 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
−4 4
1
∆(𝜆) = (𝜆 − 2)2 ⟹ 𝜆 = 2 and 𝐱1 = ( )
2
𝐱 𝐱
(𝑨 − 𝜆𝑰)𝐱 2 = 𝜆𝐱1 ⟺ (( 0 1) − (2 0)) (𝐱12 ) = 2 (1) ⟹ 𝐱 2 = (𝐱12 ) = (0)
−4 4 0 2 22 2 22 2
1 1 0
𝒚[𝑘] = 𝛽1 𝜆𝑘 𝐱1 + 𝛽2 (𝑘. 𝜆𝑘 𝐱1 + 𝜆𝑘 𝐱 2 ) = 𝛽1 2𝑘 ( ) + 𝛽2 (𝑘. 2𝑘 ( ) + 2𝑘 ( ))
2 2 2
Case03: suppose that 𝑨 has complex conjugate eigenvalues, and for seeking simplicity
let we assume that only 2 × 2 matrices. Then the general solution has the same form as
the one presented for the case of distinct real eigenvalues. Of course in this case, the
eigenvalues come as conjugate pairs and the eigenvectors are complex components.
3/5 2/5
Example: Find the homogeneous solution of 𝒚[𝑘 + 1] = ( ) 𝒚[𝑘]
−8/5 3/5
3 4 1 3 4 1
{𝜆1 = + 𝑖, 𝐱1 = ( )} and {𝜆2 = − 𝑖, 𝐱 2 = ( )}
5 5 2𝑖 5 5 −2𝑖
3 4 𝑘 1 1 3 4 𝑘
𝒚[𝑘] = {𝛼1 ( + 𝑖) ( ) + 𝛼2 ( ) ( − 𝑖) }
5 5 2𝑖 −2𝑖 5 5
The
solution of linear non-homogeneous systems of difference equations can become very
complex. In this section, we concentrate on the solution methodology of special class,
namely systems with constants non-homogeneous term. (Recall, that we have
presented the solution methodology for single linear non-homogeneous difference
equation, where the non-homogeneous term is a polynomial or an exponential
function).
The solution is of the form: 𝒚[𝑘] = 𝒚ℎ [𝑘] + 𝒚𝑝 [𝑘] where 𝒚ℎ [𝑘] is the solution of the
associated homogeneous system: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] and 𝒚𝑝 [𝑘] is the particular solution.
We have seen that the solution of the associated homogeneous system is: 𝒚ℎ [𝑘] = 𝑨𝑘 𝒚[0]
Assuming that matrix 𝑨 have distinct (real) eigenvalues, and eigenvectors, the solution
of the associated homogeneous system can be written as:
Since the linear system has a constant non-homogeneous term, 𝒃 we conjecture that:
𝒚𝑝 [𝑘] = 𝒅
Thus 𝒚[𝑘] = 𝒚ℎ [𝑘] + 𝒚𝑝 [𝑘] = 𝑨𝑘 𝒚[0] + 𝒅 Substituting into the non-homogeneous system
yields:
𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝒃 ⟹ 𝑨𝑘+1 𝒚[0] + 𝒅 = 𝑨(𝑨𝑘 𝒚[0] + 𝒅) + 𝒃
⟹ 𝑨𝑘+1 𝒚[0] + 𝒅 = 𝑨𝑘+1 𝒚[0] + 𝑨𝒅 + 𝒃
⟹ 𝒅 = 𝑨𝒅 + 𝒃
⟹ (𝑰 − 𝑨)𝒅 = 𝒃
Thus, if matrix 𝑨 have real eigenvalues of multiplicity one and distinct corresponding
eigenvectors, then the solution of the linear non-homogeneous systems of difference
equations: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝒃 is:
2 4 1
𝒚[𝑘 + 1] = ( ) 𝒚[𝑘] + ( )
4 2 −1
Solution: First we solve the associated homogeneous system. We need to find the
eigenvalues and eigenvectors of the coefficient matrix.
1 −1
{𝜆1 = 6, 𝐱1 = ( )} and {𝜆2 = −2, 𝐱 2 = ( )}
1 1
From theory, we know that the particular solution is:
1 2
𝒚𝑝 [𝑘] = 𝒅 = (𝑰 − 𝑨)−1 𝒃 = ( )
5 −3
Therefore the general solution is:
1 −1 1 2
𝒚[𝑘] = 𝛼1 6𝑘 ( ) + 𝛼2 (−2)𝑘 ( )+ ( )
1 1 5 −3
Note: If 𝜆 = 1 is an eigenvalue of matrix R, then it satisfies the characteristic equation
|𝜆𝑰 − 𝑨| = 0 thus the matrix (𝑰 − 𝑨) is not invertible, thus a new form of the (constant)
particular solution 𝒚𝑝 [𝑘] = 𝒅 must be found. The following example demonstrates the
methodology for obtaining the particular solution for the case that 𝜆 = 1 is an
eigenvalue of matrix 𝑨.
6 5 2
𝒚[𝑘 + 1] = ( ) 𝒚[𝑘] + ( )
−4 −3 −1
Solution: First we solve the associated homogeneous system. We need to find the
eigenvalues and eigenvectors of the coefficient matrix.
−1 −5/4
{𝜆1 = 1, 𝐱1 = ( )} and {𝜆2 = 2, 𝐱 2 = ( )}
1 1
The matrix has eigenvalues of multiplicity one, and distinct eigenvectors. Thus the
solution the associated homogeneous system is:
−1 −5/4
𝒚ℎ [𝑘] = 𝛼1 ( ) + 𝛼2 2𝑘 ( )
1 1
Observe that 𝜆 = 1 is an eigenvalue of matrix 𝑨 thus (𝑰 − 𝑨) is not invertible. Thus we
must conjecture a different particular solution. By (tedious) trial and error, we discover
that one type of a particular solution, that “looks promising” is:
𝑑
𝒚𝑝 [𝑘] = 𝒅 + 𝑑3 𝑘𝐱1 = ( 1 ) + 𝑑3 𝑘𝐱1
𝑑2
𝒅 − 𝑨𝒅 = 𝒃 − 𝑑3 𝐱1 ⟺ (𝑰 − 𝑨)𝒅 = 𝒃 − 𝑑3 𝐱1
Substituting the values for 𝒃 and 𝑨, the above can be written as:
𝑑1
−5 −5 𝑑1 1 −1 −5𝑑1 − 5𝑑2 − 𝑑3 = 1 −5 −5 −1 𝑑 1
( ) ( ) = ( ) − 𝑑3 ( )⟺{ ⟺( ) ( 2) = ( )
4 4 𝑑2 1 1 4𝑑1 + 4𝑑2 + 𝑑3 = 1 4 4 1 1
𝑑3
𝑑1 −1
−5 −5 −1 + 1
(𝑑2 ) = ( ) ( ) = (−1 )
4 4 1 1
𝑑3 9
−1 −5/4 −1 −1
Therefore the general solution is: 𝒚[𝑘] = 𝛼1 ( ) + 𝛼2 2𝑘 ( )+( ) + 9𝑘 ( )
1 1 −1 1
Consider the linear non-homogeneous system of
difference equations: 𝒚[𝑘 + 1] = 𝑨𝒚[𝑘] + 𝑩𝒖[𝑘] where 𝑩𝒖[𝑘] is a vector whose components
are time varying. Assume that we are given 𝒚[0] = 𝒄 and 𝒖[𝑘] is some known vector
function and 𝐲[𝑘] ∈ ℝ𝑚 , 𝐮[𝑘] ∈ ℝ𝑚 , 𝑩 ∈ ℝ𝑛×𝑚 𝑎𝑛𝑑 𝑨 ∈ ℝ𝑛×𝑛 .
𝑖=1
Note: The static case is a special form of this last one, only you have to assume
𝑩𝒖[𝑘 − 𝑖] = 𝒃 then 𝒚𝑝 [𝑘] = ∑𝑘𝑖=1 𝑨𝑖−1 𝒃 = (𝑰 + 𝑨 + 𝑨2 + ⋯ + 𝑨𝑘−1 )𝒃, and from the other side
we know that by the Taylor series 𝒅 = (𝑰 − 𝑨)−1 𝒃 = (𝑰 + 𝑨 + 𝑨2 + ⋯ + 𝑨𝑘−1 + 𝑨𝑘 + ⋯ )𝒃 but
the vectors 𝑨𝑖 𝒃 are homogeneous solutions when 𝑖 ≥ 𝑘, hence only ∑𝑘𝑖=1 𝑨𝑖−1 𝒃 is the
particular solution.
Example: Find the solution of the following difference equation
Solution: If we take the following change of variables 𝑥1 [𝑘] = 𝑦[𝑘] & 𝑥2 [𝑘] = 𝑦[𝑘 − 1] we
gat
𝑥1 [𝑘 − 1] 𝑥1 [𝑘]
𝑥1 [𝑘 − 1] = 𝑥2 [𝑘] 0 1
{ ⟺( )=( )( )
𝑥2 [𝑘 − 1] = −2𝑥1 [𝑘] − 3𝑥2 [𝑘] −2 −3
𝑥2 [𝑘 − 1] 𝑥2 [𝑘]
𝑥1 [𝑘]
0 1
Let 𝐱[𝑘] = ( ) we obtain 𝐱[𝑘 − 1] = ( ) 𝐱[𝑘]. Now we have reduced the problem
[𝑘] −2 −3
𝑥2
to the solution of first order system rather than second order difference equation.
0 1 −1 −3/2 −1/2
𝐱[𝑘] = ( ) 𝐱[𝑘 − 1] = ( ) 𝐱[𝑘 − 1]
−2 −3 1 0
−3/2 −1/2 𝑘 𝑘 1 −1 𝑘 1
𝐱[𝑘] = ( ) 𝐱[0] = 𝛼1 (−1) ( ) + 𝛼2 ( ) ( )
1 0 −1 2 −2
Theorem: (Perron’s Theorem & Poincar´e’s Theorem) Assume that the difference
equation ∑𝑛𝑖=0 𝑎𝑖 𝑥[𝑘 + 𝑖] = 0 has a fundamental set of solutions {𝑥1 [𝑘], 𝑥2 [𝑘], . . . , 𝑥𝑛 [𝑘]} with
the property: 𝑥𝑖 [𝑘] = 𝜆𝑘𝑖 Then the characteristic root 𝜆𝑖 is given by
𝑥𝑖 [𝑘 + 1]
𝜆𝑖 = , 𝑘 = 1,2, … , 𝑛
𝑥𝑖 [𝑘]
𝑥[𝑘 + 1]
𝜆𝑛 = lim
𝑘→∞ 𝑥[𝑘]
This last statement is known as the Bernoulli's identity which will be discussed later.
In situations in which the sequence {𝑢𝑘 } arises from a function 𝑢 defined on a set of grid
points {𝑥𝑘 }, the forward difference operator may also be expressed as
The second order difference operator is ∆2 𝑢𝑘 = ∆(∆𝑢𝑘 ), whose action on the sequence 𝑢𝑘
is the twice repeated operation of the first order operator ∆, in exact analogy of the
second order differential operator it 𝑑2 /𝑑𝑥 2 when it acts on the continuous function 𝑓(𝑥)
as 𝑑 2 𝑓/𝑑𝑥 2 ,
∆2 𝑢𝑘 = ∆(∆𝑢𝑘 ) = (𝑢𝑘+2 − 𝑢𝑘+1 ) − (𝑢𝑘+1 − 𝑢𝑘 ) = 𝑢𝑘+2 − 2𝑢𝑘+1 + 𝑢𝑘
Higher order difference operators An can be defined in the same manner, where it is
easily seen that this operation is commutative, i.e., ∆𝑚 (∆𝑛 𝑢𝑘 ) = ∆𝑛 (∆𝑚 𝑢𝑘 )
⦁ ∆(𝛼𝑢𝑘 ) = 𝛼∆(𝑢𝑘 )
⦁ ∆(𝑢𝑘 + 𝑣𝑘 ) = ∆(𝑢𝑘 ) + ∆(𝑣𝑘 )
⦁ ∆(𝑢𝑘 𝑣𝑘 ) = 𝑢𝑘+1 ∆𝑣𝑘 + 𝑣𝑘 ∆𝑢𝑘
𝑢𝑘 𝑣𝑘 ∆𝑢𝑘 − 𝑢𝑘 ∆𝑣𝑘
⦁ ∆( ) =
𝑣𝑘 𝑣𝑘 𝑣𝑘+1
which can also be written in terms of the difference operator ∆ (on 𝑣𝑘 and 𝑢𝑘 only) as
1 𝑢[𝑥𝑘 ] − 𝑢[𝑥𝑘 − ℎ]
∇𝑢[𝑘] =
ℎ ℎ
is an approximation to 𝑢′ (𝑥𝑘 ) which also improves as ℎ becomes small. The asymmetry
of the forward and backward difference operators ∆ and ∇ may be eliminated by defining
a central difference operator: 𝛿𝑢𝑘 = 𝑢(𝑥𝑘 + ℎ/2) − 𝑢(𝑥𝑘 − ℎ/2). The quotient
ℎ ℎ
1 𝑢 (𝑥𝑘 + 2) − 𝑢 (𝑥𝑘 − 2)
𝛿𝑢 =
ℎ 𝑘 ℎ
is yet another approximation which converges to 𝑢′ (𝑥𝑘 ) as ℎ → 0.
⦁ 𝑧(Δ) = Δ(𝑧)
⦁ Δ2 = 𝑧 2 − 2𝑧 + 1
⦁ 𝛿 2 = Δ∇
The Inverse Difference Operator as a Sum Operator: We shall briefly discuss Δ−1, the
inverse of the difference operator Δ, Δ−1 Δ(𝑢𝑘 ) = 𝑢𝑘 and which, as may be expected,
will amount to summation as opposed to the differencing of Δ. Let Δ−1 𝑢𝑘 = 𝑦𝑘 , we shall
show that 𝑦𝑘 , aside from an arbitrary constant 𝑦1 , is a finite summation of the original
sequence 𝑢𝑘 as we will develop next.
𝑢1 = Δ(Δ−1 𝑢1 ) = 𝑦2 − 𝑦1
𝑢2 = Δ(Δ−1 𝑢2 ) = 𝑦3 − 𝑦2
⋮
𝑢𝑘 = Δ(Δ−1 𝑢𝑘 ) = 𝑦𝑘+1 − 𝑦𝑘
then add, the left hand sides will add up to 𝑢1 + 𝑢2 + ⋯ + 𝑢𝑘 , Adding the right hand side,
we get 𝑦𝑘+1 − 𝑦1
𝑘−1
−1
𝑢1 + 𝑢2 + ⋯ + 𝑢𝑘 = 𝑦𝑘+1 − 𝑦1 ⟹ 𝑢1 + 𝑢2 + ⋯ + 𝑢𝑘−1 = 𝑦𝑘 − 𝑦1 ⟹ 𝑦𝑘 = Δ 𝑢𝑘 = 𝑦1 + ∑ 𝑢𝑖
𝑖=1
as the desired finite summation operation of Δ−1. With this result we can interpret the
action of the inverse difference operator Δ−1 on the sequence 𝑢𝑘 , namely 𝑦𝑘 = Δ−1 𝑢𝑘 as
the 𝑘 − 1 partial sum, of the sequence 𝑢𝑖 plus an arbitrary constant 𝑦1 .
The latter may remind us of the spirit of the indefinite integration of function of
continuous variable f(x), along with its arbitrary constant of integration
∫ 𝑓(𝑥)𝑑𝑥 = 𝐹(𝑥) + 𝐶
where 𝐹(𝑥) is the antiderivative of f(x), i.e. dF(x)/dx = f(x). Often Δ−1 𝑢𝑘 is called the anti-
difference or sum operator, and for obvious reasons!
⦁ Δ−1 (1) = 𝑘
⦁ Δ−1 (𝛼1 𝑢𝑘 + 𝛼2 𝑣𝑘 ) = 𝛼1 Δ−1 (𝑢𝑘 ) + 𝛼2 Δ−1 (𝑣𝑘 )
1
⦁ Δ−1 (𝑎𝑘 ) = 𝑎𝑘
𝑎−1
1 1
⦁ Δ−1 𝑘 = 𝑘(𝑘 − 1) + 𝐶 ⦁ Δ−1 𝑘 2 = 𝑘(𝑘 − 1)(2𝑘 − 1) + 𝐶
2 6
1 1
⦁ Δ−1 𝑘 3 = (𝑘 − 1)2 𝑘 2 + 𝐶 ⦁ Δ−1 (−1)𝑘 = (−1)𝑘+1 + 𝐶
4 2
𝑎𝑘 𝑎 𝑒 𝑎𝑘
⦁ Δ−1 (𝑘𝑎𝑘 ) = (𝑘 − )+𝐶 ⦁ Δ−1 𝑒 𝑎𝑘 = 𝑎 +𝐶
𝑎−1 𝑎−1 𝑒 −1
Summation by Parts: We will now present a few of the essential elements of the
difference and summation calculus which show a striking correspondence to familiar
properties of the differential and integral calculus.
Expanding the left hand side and summing term by term, we see that all of the interior
terms of the sum cancel and only two "boundary terms" survive. This leaves
𝑁−1
1 1 1
𝑢′ (𝑥) + 𝑎𝑢(𝑥) = 𝑓(𝑥) ⟺ Δ𝑢 + 𝑎𝑢𝑘 = 𝑓𝑘 ⟺ 𝑢𝑘+1 + (𝑎 − ) 𝑢𝑘 = 𝑓𝑘
ℎ 𝑘 ℎ ℎ
The differential equation is equivalent to 𝑢𝑘+1 = ℎ𝑓𝑘 + (1 − ℎ𝑎)𝑢𝑘 . Let we solve this
difference equation recursively by setting 𝛽 = (1 − ℎ𝑎) and 𝑟𝑘 = ℎ𝑓𝑘 so we obtain
𝑢𝑘+1 = 𝑟𝑘 + 𝛽𝑢𝑘
𝑢1 = 𝑟0 + 𝛽𝑢0
𝑢2 = 𝑟1 + 𝛽(𝑟0 + 𝛽𝑢0 ) = 𝛽 2 𝑢0 + 𝑟1 + 𝛽𝑟0
𝑢3 = 𝑟2 + 𝛽(𝛽 2 𝑢0 + 𝑟1 + 𝛽𝑟0 ) = 𝛽 3 𝑢0 + 𝑟2 + 𝛽𝑟1 + 𝛽 2 𝑟0
⋮
𝑘−1
𝑢𝑘 = 𝛽 𝑘 𝑢0 + (∑ 𝑟𝑖−1 𝛽 𝑘−𝑖 ) + 𝑟𝑘−1
𝑖=1
In mathematics
and signal processing, the Z-transform converts a discrete-time signal, which is a
sequence of real or complex numbers, into a complex frequency-domain representation.
It can be considered as a discrete-time equivalent of the Laplace transform. This
similarity is explored in the theory of time-scale calculus.
There are a number of ways to represent the sampling process mathematically. One
way that is commonly used is to immediately represent the sampled signal by a series
𝑥[𝑛]. This technique has the advantage of being very simple to understand, but makes
the connection of the sampled signal to the Laplace Transform easy to understand.
∞
⋆ (𝑡)
𝑥 = ∑ 𝑥(𝑘𝑇)𝛿(𝑡 − 𝑘𝑇)
𝑘=0
Since we now have a time domain signal, we wish to see what kind of analysis we can
do in a transformed domain. Let's start by taking the Laplace Transform of the
sampled signal:
∞
⋆ (𝑠) {𝑥 ⋆ (𝑡)}
𝑋 = = {∑ 𝑥[𝑘]𝛿(𝑡 − 𝑘𝑇)}
𝑘=0
Since 𝑥[𝑘] is a constant, we can (because of Linearity) bring the Laplace Transform
inside the summation.
∞ ∞
⋆ (𝑠) {𝛿(𝑡 − 𝑘𝑇)} = ∑ 𝑥[𝑘]𝑒 −𝑠𝑘𝑇
𝑋 = ∑ 𝑥[𝑘]
𝑘=0 𝑘=0
To simplify the expression a little bit, we will use the notation 𝑧 = 𝑒 −𝑠𝑘𝑇 . We will call this
the Z Transform and define it as
We will also use the following notation to represent a sequence {𝑥[𝑘]} and its Z-
transform 𝑋(𝑧): 𝑥[𝑘] ⟷ 𝑋(𝑧)
Region of convergence: The region of convergence (ROC) is the set of points in the
complex plane for which the Z-transform summation converges.
𝑘=+∞
Example Let us now consider a signal that is the real exponential: f[𝑘] = 𝛼 𝑘 𝑢[𝑘]
𝑘=+∞ ∞
𝑘 −𝑘
𝛼 𝑘 𝑧
𝐹(𝑧) = ∑ 𝛼 𝑢[𝑘]𝑧 = ∑( ) = |𝑧| > |𝛼|
𝑧 𝑧−𝛼
𝑘=−∞ 𝑘=0
Example Let us now consider the left-sided real exponential: f[𝑘] = −𝛼 𝑘 𝑢[−𝑘 − 1]
𝑘=+∞ −1 ∞ ∞ ∞
𝑘 −𝑘
𝛼 𝑘 𝑧 𝑘 𝑧 𝑘+1 𝑧 𝑧 𝑘 𝑧
𝐹(𝑧) = ∑ −𝛼 𝑢[−𝑘 − 1]𝑧 = − ∑ ( ) = −∑( ) = −∑( ) = ∑( ) =
𝑧 𝛼 𝛼 𝛼 𝛼 𝑧−𝛼
𝑘=−∞ 𝑘=−∞ 𝑘=1 𝑘=0 𝑘=0
Remark: If you are primarily interested in application to one-sided signals, then the z-
transform used is restricted to causal signals (i.e., signals with zero values for negative
time) and the one-sided z-transform.
The z-transform is
an important tool in the analysis and design of discrete-time systems. It simplifies the
solution of discrete-time problems by converting LTI difference equations to algebraic
equations and convolution to multiplication. Thus, it plays a role similar to that served
by Laplace transforms in continuous-time problems. Now let us give some of commonly
used discrete-time signals such as the sampled step, exponential, and the discrete time
impulse.
❶ The Unit Impulse Function: In discrete time systems the unit impulse is defined
somewhat differently than in continuous time systems.
1 𝑘=0
𝛿[𝑘] = { = 𝑢[𝑘] − 𝑢[𝑘 − 1]
0 𝑘≠0
∞ ∞
−𝑘
𝑋(𝑧) = ∑ 𝛿[𝑘]𝑧 −𝑘
= 𝛿[𝑘]𝑧 | + ∑ 𝛿[𝑘]𝑧 −𝑘 = 1
𝑘=0 0 ⏟
𝑘=1
zero
❷ The Unit Step Function: The unit step is one when 𝑘 is zero or positive.
∞ ∞
−𝑘
1 𝑧
𝑋(𝑧) = ∑ 𝑢[𝑘]𝑧 = ∑ 𝑧 −𝑘 = −1
=
1−𝑧 𝑧−1
𝑘=0 𝑘=0
∞ ∞
𝑑 𝑧
f[𝑘] = 𝑘𝑢[𝑘] ⟹ 𝐹(𝑧) = ∑ 𝑘𝑧 −𝑘 = −𝑧 ( ∑ 𝑧 −𝑘 ) =
𝑑𝑧 (𝑧 − 1)2
𝑘=0 𝑘=0
Exercise: Find the Z-Transform of the following signal f[𝑘] = cos(𝑎𝑘) 𝑢[𝑘] Solution:
∞ ∞ ∞ ∞
−𝑘
𝑒 𝑖𝑎𝑘 + 𝑒 −𝑖𝑎𝑘 1
𝐹(𝑧) = ∑ cos(𝑎𝑘) 𝑢[𝑘]𝑧 = ∑( ) 𝑢[𝑘]𝑧 −𝑘 = {∑ 𝑒 𝑖𝑎𝑘 𝑢[𝑘]𝑧 −𝑘 + ∑ 𝑒 −𝑖𝑎𝑘 𝑢[𝑘]𝑧 −𝑘 }
2 2
𝑘=0 𝑘=0 𝑘=0 𝑘=0
1 𝑧 𝑧 𝑧(𝑧 − cos(𝑎))
(f[𝑘]) = 𝐹(𝑧) = { + } =
2 𝑧 − 𝑒 𝑖𝛼 𝑧 − 𝑒 −𝑖𝛼 𝑧 2 − 2𝑧 cos(𝑎) + 1
𝑧 sin(𝑎𝑘)
𝐹(𝑧) =
𝑧 2 − 2𝑧 cos(𝑎) + 1
Remarks: the Z-Transform 𝑋(𝑧) is always a rational function of the complex variable z,
and the same complex function can also be expressed in terms of 𝑧 −1
Given 𝑥[𝑘] = 𝑎. 𝑓[𝑘] + 𝑏. 𝑔[𝑘], the following property holds: 𝑋(𝑧) = 𝑎. 𝐹(𝑧) + 𝑏. 𝐺(𝑧)
Example: Find the z-transform of the causal sequence 𝑥[𝑘] = 2𝑢[𝑘] + 4𝛿[𝑘] 𝑘 = 0,1, …
∞ ∞ ∞
2𝑧 6𝑧 − 4
𝑋(𝑧) = ∑(2𝑢[𝑘] + 4𝛿[𝑘])𝑧 −𝑘 = ∑ 2𝑢[𝑘]𝑧 −𝑘 + ∑ 4𝛿[𝑘]𝑧 −𝑘 = +4=
𝑧−1 𝑧−1
𝑘=0 𝑘=0 𝑘=0
❷ Time Shift (Delay): An important property of the Z Transform is the time shift. To
see why this might be important consider that a discrete-time approximation to a
derivative is given by:
Let's examine what effect such a shift has upon the Z Transform. Assume that the
sequence 𝑥[𝑘] has a Z-Transform 𝑋(𝑧), and we want to get the Z-Transform of 𝑥[𝑘 − 𝑛]
∞ ∞ ∞
4, 𝑘 = 2,3 …
𝑥[𝑘] = {
0, otherwise
❸ Time Advance: Let's explore it in the same way as we did the shift to the
right. Consider the same sequence, 𝑥[𝑘], as before. This time we shift it to the left by 𝑛
samples to get 𝑥[𝑘 + 𝑛].
∞ 𝑛−1
−𝑘
𝑋new (𝑧) = ∑ 𝑥[𝑘 + 𝑛]𝑧 = 𝑧 𝑋(𝑧) − ∑ 𝑥[ℓ]𝑧 𝑛−ℓ
𝑛
𝑘=0 ℓ=0
Proof:
∞ ∞ ∞
−𝑘 𝑛−ℓ
𝑋new (𝑧) = ∑ 𝑥[𝑘 + 𝑛]𝑧 = ∑ 𝑥[ℓ]𝑧 = 𝑧 ∑ 𝑥[ℓ]𝑧 −ℓ
𝑛
❹ Multiplication by 𝒛𝒏𝟎 in Time Domain: this property is also known as the Z-domain
scaling and it says that: if 𝑥[𝑘] ⟷ 𝑋(𝑧) with ROC = 𝑅 then
∞ ∞
𝑧 −𝑘 𝑧
𝑧0𝑘 𝑥[𝑘] ⟷ ∑ 𝑧0𝑘 𝑥[𝑘]𝑧 −𝑘 = ∑ 𝑥[𝑘] ( ) = 𝑋 ( ) with ROC = |𝑧0 |𝑅
𝑧0 𝑧0
𝑘=0 𝑘=0
❺ Time Scaling: This property deals with the effect on the frequency-domain
representation of a signal if the time variable is altered. The most important concept to
understand for the time scaling property is that signals that are narrow in time will be
broad in frequency and vice versa.
∞ ∞ ∞
Proof: The derivative rule is a straightforward let we prove only the integration rule
∞ ∞
𝑋(𝑧)
∫( ) 𝑑𝑧 = ∫ (∑ 𝑥[𝑘]𝑧 −𝑘−1 ) 𝑑𝑧 = ∑ −𝑘 −1 𝑥[𝑘]𝑧 −𝑘 = − (𝑘 −1 𝑥[𝑘])
𝑧
𝑘=0 𝑘=1
𝑑 𝑑 𝑑2 𝑑
𝑘 2 𝑥[𝑘] ⟷= 𝑧 {𝑧 𝑋(𝑧)} = 𝑧 2 2 𝑋(𝑧) + 𝑧 𝑋(𝑧)
𝑑𝑧 𝑑𝑧 𝑑𝑧 𝑑𝑧
2
𝑑2 𝑧 2
𝑑 𝑧 2𝑧 2 𝑧 𝑧2 + 𝑧
𝑥1 [𝑘] = 𝑘 𝑢[𝑘] ⟷ 𝑧 ( )+𝑧 ( )= − =
𝑑𝑧 2 𝑧 − 1 𝑑𝑧 𝑧 − 1 (𝑧 − 1)3 (𝑧 − 1)2 (𝑧 − 1)3
2 𝑘
𝑑2 𝑧 2
𝑑 𝑧 2𝑎𝑧 2 𝑧𝑎 𝑎𝑧(𝑧 + 𝑎)
𝑥2 [𝑘] = 𝑘 𝑎 𝑢[𝑘] ⟷ 𝑧 ( )+𝑧 ( )= − =
2
𝑑𝑧 𝑧 − 𝑎 𝑑𝑧 𝑧 − 𝑎 (𝑧 − 𝑎) 3 (𝑧 − 𝑎) 2 (𝑧 − 𝑎)3
Therefore, a pole (or zero) in 𝑋(𝑧) at = 𝑧𝑘 , moves to 1/𝑧𝑘 , after time reversal. The
relationship ROC = 1/𝑅 indicates the inversion of 𝑅, reflecting the fact that a right-sided
sequence becomes left-sided if time-reversed, and vice versa.
Now the Z-transform of this convolution is 𝑌(𝑧) = (𝑥[𝑘] ⋆ ℎ[𝑘]) = 𝑋(𝑧)𝐻(𝑧). This
relationship plays a central role in the analysis and design of discrete-time LTI systems,
in analogy with the continuous-time case.
∞ ∞ ∞ ∞
We know that ∑∞
𝑘=−∞ 𝑥[𝑘 − 𝑖] 𝑧
−𝑘
= 𝑧 −𝑖 𝑋(𝑧) hence
∞ ∞
⓬ Parseval's Theorem:
𝑘=+∞
1 1
∑ 𝑥1 [𝑘]𝑥2⋆ [𝑘] = ∮ 𝑋1 (𝑧)𝑋2⋆ ( ⋆ ) 𝑧 −1 𝑑𝑧
2𝜋𝑗 𝑧
𝑘=−∞
Proof: Let 𝑥1 [𝑘] ⟷ 𝑋1 (𝑧) and 𝑥2 [𝑘] ⟷ 𝑋2 (𝑧) The proof of this identity is based on the
inverse relation and later on we will see the inverse formula of the Z-transform. Now
assume that the inverse formula is well-known
1
𝑥1 [𝑘] = ∮ 𝑋1 (𝑧)𝑧 𝑘−1 𝑑𝑧
2𝜋𝑗
𝑘=+∞ 𝑘=+∞ 𝑘=+∞
1 1
∑ 𝑥1 [𝑘]𝑥2⋆ [𝑘] = ∑ { ∮ 𝑋1 (𝑧)𝑧 𝑘−1 𝑑𝑧} 𝑥2⋆ [𝑘] = ∮ 𝑋1 (𝑧) ( ∑ 𝑥2⋆ [𝑘]𝑧 𝑘−1 ) 𝑑𝑧
2𝜋𝑗 2𝜋𝑗
𝑘=−∞ 𝑘=−∞ 𝑘=−∞
𝑘=+∞ 𝑘=+∞ ⋆
1 1 −𝑘 1 1 −𝑘
= ∮ 𝑋1 (𝑧) ( ∑ 𝑥2⋆ [𝑘] ( ) 𝑧 −1 ) 𝑑𝑧 = ∮ 𝑋1 (𝑧) ( ∑ 𝑥2 [𝑘] ( ⋆ ) ) 𝑧 −1 𝑑𝑧
2𝜋𝑗 𝑧 2𝜋𝑗 𝑧
𝑘=−∞ 𝑘=−∞
1 1
= ∮ 𝑋1 (𝑧)𝑋2⋆ ( ⋆ ) 𝑧 −1 𝑑𝑧
2𝜋𝑗 𝑧
1 𝑧
𝑥1 [𝑘]𝑥2 [𝑘] ⟷ ∮ 𝑋1 (𝑣)𝑋2 ( ) 𝑣 −1 𝑑𝑣
2𝜋𝑗 𝑣
⓮ Initial value theorem: If 𝑥[𝑘] is causal (i.e. No positive terms in its Z-transform),
then which has its Z-transformation as 𝑋(𝑧), then the initial value theorem can be
written as;
lim 𝑥[𝑘] = lim 𝑋(𝑧)
𝑘→0 𝑧→∞
∞
𝐏𝐫𝐨𝐨𝐟: Let 𝑋(𝑧) = ∑ 𝑥[𝑘]𝑧 −𝑘 = 𝑥[0] + 𝑥[1]𝑧 −1 + 𝑥[2]𝑧 −2 + ⋯ ⟹ lim 𝑋(𝑧) = 𝑥[0] = lim 𝑥[𝑘]
𝑧→∞ 𝑘→0
𝑘=0
⓯ Final value theorem: The final value theorem allows us to calculate the limit of a
sequence as 𝑘 tends to infinity, if one exists, from the z-transform of the sequence.
Final Value Theorem states that if 𝑥[𝑘] is causal and the poles are all inside the circle,
then its final value is denoted as 𝑥[𝑘] or 𝑥(∞) and can be written as
The main pitfall of the theorem is that there are important cases where the limit does
not exist. The two main cases are as follows:
The reader is cautioned against blindly using the final value theorem, because this can
yield misleading results.
Example: Verify the final value theorem using the z-transform of a decaying
exponential sequence and its limit as 𝑘 tends to infinity.
𝑥 [𝑘] = 𝑎𝑘 𝑢[𝑘] 𝑧 𝑎𝑧
{ 1 ⟺ 𝑎𝑘 𝑢[𝑘] ⟷ 𝑋1 (𝑧) = & 𝑎−𝑘 𝑢[𝑘] ⟷ 𝑋2 (𝑧) =
𝑥2 [𝑘] = 𝑎−𝑘 𝑢[𝑘] 𝑧−𝑎 𝑎𝑧 − 1
1 ∞ 1
𝑋(𝑧) = (∑ 𝑥1 [𝑘]𝑧 −𝑘 ) = 𝑋 (𝑧)
1−𝑧 −𝑇
𝑘=0 1 − 𝑧 −𝑇 1
𝑋1 (𝑧)
𝑋(𝑧) = 𝑋1 (𝑧) + 𝑧 −𝑇 𝑋1 (𝑧) + +𝑧 −2𝑇 𝑋1 (𝑧) + ⋯ = 𝑋1 (𝑧)(1 + 𝑧 −𝑇 + 𝑧 −2𝑇 + ⋯ ) =
1 − 𝑧 −𝑇
We considered a discrete-time LTI system for which input 𝑥[𝑘] and output 𝑦[𝑘] satisfy
the general linear constant-coefficient difference equation of the form
𝑛 𝑚
∑ 𝑎𝑖 𝑦[𝑘 − 𝑖] = ∑ 𝑏𝑖 𝑥[𝑘 − 𝑖]
𝑖=0 𝑖=0
The transfer function of a system 𝐻(𝑧) is defined as the Z-transform of its output
𝑦[𝑘] divided by the Z-transform of its forcing function 𝑥[𝑘], so it is a complex rational
function of two polynomials named numerator and denominator.
Applying the z-transform and using the time-shift property and the linearity property of
the z-transform to this difference equation, we obtain
𝑛 𝑚
𝑌(𝑧) ∑𝑚
𝑖=0 𝑏𝑖 𝑧
−𝑖
(∑ 𝑎𝑖 𝑧 ) 𝑌(𝑧) = (∑ 𝑏𝑖 𝑧 −𝑖 ) 𝑋(𝑧) ⟹ 𝐻(𝑧) =
−𝑖
= 𝑛
𝑋(𝑧) ∑𝑖=0 𝑎𝑖 𝑧 −𝑖
𝑖=0 𝑖=0
Hence, 𝐻(𝑧) is always rational.
The DC gain is that ratio of the steady-state output to the steady-state input, especially
we take the input as unit step. The DC gain is an important parameter, especially in
control applications.
𝑧
𝑌(𝑧) = 𝐻(𝑧)𝑋(𝑧) = 𝐻(𝑧)
𝑧−1
Using the final value theorem we obtain
𝑧
𝑥[∞] = lim(1 − 𝑧 −1 )𝑋(𝑧) = lim(1 − 𝑧 −1 ) =1
𝑧→1 𝑧→1 𝑧−1
𝑧
𝑦[∞] = lim(1 − 𝑧 −1 )𝑌(𝑧) = lim(1 − 𝑧 −1 )𝐻(𝑧) = 𝐻(1)
𝑧→1 𝑧→1 𝑧−1
The DC gain is
𝑦[∞]
DC gain = = 𝐻(1) = lim(1 − 𝑧 −1 )𝑌(𝑧)
𝑥[∞] 𝑧→1
Contour Integration We now discuss how to obtain the sequence 𝑥[𝑘] by contour
integration, given its Z-transform. Recall that the two-sided Z-transform is defined by
∞
𝑋(𝑧) = ∑ 𝑥[𝑘]𝑧 −𝑘
𝑘=−∞
Let us multiply both sides by 𝑧 𝑛−1 and integrate over a closed contour within ROC of
𝑋(𝑧); let the contour enclose the origin. We have
∞
𝑛−1
∮ 𝑋(𝑧)𝑧 𝑑𝑧 = ∮ ( ∑ 𝑥[𝑘]𝑧 −𝑘 ) 𝑧 𝑛−1 𝑑𝑧
𝒞 𝒞 𝑘=−∞
where 𝒞 denotes the closed contour within ROC, taken in a counterclockwise direction.
As the curve 𝒞 is inside ROC, the sum converges on every part of 𝒞 and, as a result, the
integral and the sum on the right-hand side can be interchanged. The above equation
becomes
∞
1 1 𝑘=𝑛
∮ 𝑧 𝑛−1−𝑘 𝑑𝑧 = {
2𝜋𝑗 𝒞 0 𝑘≠𝑛
Here, 𝒞 is any contour that encloses the origin. Using the above equation, the right
hand side becomes 2𝜋𝑗𝑥[𝑘] and hence we obtain the formula
∞
1 1
∮ 𝑋(𝑧)𝑧 𝑛−1 𝑑𝑧 = ∑ 𝑥[𝑘] ( ∮ 𝑧 𝑛−1−𝑘 𝑑𝑧) = 𝑥[𝑛]
2𝜋𝑗 𝒞 2𝜋𝑗 𝒞
𝑘=−∞
1
𝑥[𝑛] = ∮ 𝑋(𝑧)𝑧 𝑛−1 𝑑𝑧
2𝜋𝑗 𝒞
Because the inverse Z-transform of 𝑧/(𝑧 − 𝑎) is given by 𝑎𝑘 𝑢[𝑘], we split 𝐻(𝑧) into partial
fractions, as given below:
𝐻1 𝐻2 𝐻𝑚
𝐻(𝑧) = + + ⋯+
𝑧 − 𝑎1 𝑧 − 𝑎2 𝑧 − 𝑎𝑚
where we have assumed that 𝐻(𝑧) has 𝑚 simple poles at 𝑎1 , 𝑎2 , … , 𝑎𝑚 . The coefficients
𝐻1 , 𝐻2 , … , 𝐻𝑚 are known as residues at the corresponding poles. The residues are
calculated using the formula 𝐻𝑖 = lim𝑧→𝑎𝑖 (𝑧 − 𝑎1 ) 𝐻(𝑧) 𝑖 = 1,2, … , 𝑚
𝑟 ℓ𝑖
𝐻𝑖𝑗
𝐻(𝑧) = ∑ ∑
(𝑧 − 𝑎𝑖 )𝑗
𝑖=1 𝑗=1
Where
1 𝑑 𝑗−1
𝐻𝑖𝑗 = { lim ( (𝑧 − 𝑎𝑖 )ℓ𝑖 𝐻(𝑧))}
(𝑗 − 1)! 𝑧→𝑎𝑖 𝑑𝑧 𝑗−1
11𝑧 2 − 15𝑧 + 6
𝐻(𝑧) =
(𝑧 − 2)(𝑧 − 1)2
11𝑧 2 − 15𝑧 + 6
𝐻11 = lim(𝑧 − 2)𝐻(𝑧) = lim = 20
𝑧→2 𝑧→2 (𝑧 − 1)2
20 9 2
𝐻(𝑧) = − −
(𝑧 − 2) (𝑧 − 1) (𝑧 − 1)2
20𝑧 −1 9𝑧 −1 2𝑧 −1
= − −
(1 − 2𝑧 −1 ) (1 − 𝑧 −1 ) (1 − 𝑧 −1 )2
𝑧 3 − 𝑧 2 + 3𝑧 − 1
𝐻(𝑧) =
(𝑧 − 1)(𝑧 2 − 𝑧 + 1)
Because it is proper, we first divide the numerator by the denominator and obtain
𝑧(𝑧 + 1)
𝐻(𝑧) = 1 + = 1 + 𝐺(𝑧)
(𝑧 − 1)(𝑧 2 − 𝑧 + 1)
𝐺(𝑧) (𝑧 + 1) (𝑧 + 1)
= =
𝑧 (𝑧 − 1)(𝑧 − 𝑧 + 1) (𝑧 − 1)(𝑧 − 𝑒 𝑗𝜋/3 )(𝑧 − 𝑒 −𝑗𝜋/3 )
2
Note that complex poles or complex zeros, if any, would always occur in conjugate pairs
for real sequences.
𝐺(𝑧) 2 1 1
= + −
𝑧 (𝑧 − 1) 𝑗𝜋 𝑗𝜋
(𝑧 − 𝑒 3 ) (𝑧 − 𝑒 − 3 )
We cross multiply by z and invert:
2𝑧 𝑧 𝑧 𝜋
𝐻(𝑧) = 1 + + − ⟹ ℎ[𝑘] = 𝛿[𝑘] + (2 − 2 cos ( 𝑘)) 𝑢[𝑘]
(𝑧 − 1) 𝑗𝜋 𝑗𝜋 3
(𝑧 − 𝑒 3 ) (𝑧 − 𝑒 − 3 )
Sometimes it helps to work directly in powers of z−1. We illustrate this in the next
example.
5
3 − 6 𝑧 −1 1
𝐻(𝑧) = |𝑧| >
1 1 3
(1 − 𝑧 −1 ) (1 − 𝑧 −1 )
4 3
There are two poles, one at 𝑧 = 1/3 and one at 1/4 . As ROC lies outside the outermost
pole, the inverse transform is a right handed sequence:
5
3 − 6 𝑧 −1 𝐻1 𝐻2
𝐻(𝑧) = = +
1 1 1 1
(1 − 4 𝑧 −1 ) (1 − 3 𝑧 −1 ) (1 − 4 𝑧 −1 ) (1 − 3 𝑧 −1 )
1 2 1 𝑘 1 𝑘
𝐻(𝑧) = + ⟹ ℎ[𝑘] = {( ) + 2 ( ) } 𝑢[𝑘]
1 1 4 3
(1 − 4 𝑧 −1 ) (1 − 3 𝑧 −1 )
𝑧 2 + 2𝑧
𝐻(𝑧) = 1 < |𝑧| < 2
(𝑧 − 2)(𝑧 + 1)2
As the degree of the numerator polynomial is less than that of the denominator
polynomial, and as it has a zero at the origin, first divide by z and do a partial fraction
expansion:
𝐻(𝑧) 𝑧+2 𝐻11 𝐻12 𝐻22
= = + +
𝑧 (𝑧 − 2)(𝑧 + 1) 2 (𝑧 − 2) (𝑧 + 1) (𝑧 + 1)2
Because it is given to be the annulus 1 < |𝑧| < 2, and because it does not contain the
poles, ROC should be as follows:
4 𝑧 1 𝑧 4 𝑧
𝐻(𝑧) = − − +
⏟ 9 (𝑧 + 1) 3 (𝑧 + 1) 2 9 (𝑧 − 2)
⏟
|𝑧|>1 |𝑧|<2
We obtain the inverse as
𝑘 4 4
ℎ[𝑘] = ( − ) (−1)𝑘 𝑢[𝑘] − 2𝑘 𝑢[−𝑘 − 1]
3 9 9
𝐻(𝑧) 1 𝛼 𝛽 𝛾
= = + +
𝑧 𝑧(𝑧 2 + (√2 − 1)𝑧 − √2) 𝑧 𝑧 − 1 𝑧 + √2
1 1 1
𝛼=− , 𝛽= , 𝑎𝑛𝑑 𝛾=
√2 1 + √2 2 + √2
Therefore the partial fraction expansion is
1 1 𝑧 1 𝑧
𝐻(𝑧) = − +( ) +( )
√2 1 + √2 𝑧 − 1 2 + √2 𝑧 + √2
1 1 1 𝑘
ℎ[𝑘] = − 𝛿[𝑘] + {( )+( ) (−√2) } 𝑢[𝑘]
√2 1 + √2 2 + √2
❷ Inversion by Power Series Now we will present another method of inversion. In this,
both the numerator and the denominator are written in powers of 𝑧 −1 and we divide the
former by the latter through long division. Because we obtain the result in a power
series, this method is known as the power series method. We illustrate it with an
example.
𝑎𝑘 −1
𝐻(𝑧) = log(1 + 𝑎𝑧 −1 ) = { (−1)𝑘+1 𝑘 > 0} ⟹ ℎ[𝑘] = (−𝑎)𝑘 𝑢[𝑘 − 1]
𝑘 𝑘
0 𝑘≤0
−1
Example: Determine the inverse Z-transform of 𝐻(𝑧) = 𝑒 𝑧 .
−1
Let we use the power series expansion for 𝑒 𝑧
−1
∞ 1 −𝑘
𝐻(𝑧) = 𝑒 𝑧 = 𝑒 1/𝑧 = ∑ 𝑧
𝑘=0 𝑘!
1
ℎ[𝑘] = 𝑢[𝑘]
𝑘!
ℎ⋆ (𝑡) = ℎ(𝑡)𝑃(𝑡)
If we assume that, in the limit, the pulse width approaches zero, the sampler output
becomes a train of impulse or Dirac delta functions. This type of uniform sampling is
referred to as ideal impulse sampling. In this case, the sampled signal is represented by
a train of impulse functions whose areas correspond to the sampled values of the
signal. The impulse train is deflned by 𝑃(𝑡) = ∑∞ 𝑘=0 𝛿(𝑡 − 𝑘𝑇) And the sampled signal is
given by
∞
⋆ (𝑡)
ℎ = ℎ(𝑡)𝑃(𝑡) = ℎ(𝑡) ∑ 𝛿(𝑡 − 𝑘𝑇)
𝑘=0
Before go deeper into the subject let we answer the following question
Why discretization is needed? And is the discretization method unique or not?
Finding the discrete equivalent of continuous system is desired when digital computers
are a part of the system, so the data needed to be computerized must be sampled.
There are a lot of methods to find the discrete equivalent signal, among them we refer
𝑥[𝑘 + 1] − 𝑥[𝑘]
𝑥̇ (𝑡) ≅ The Forward rule
𝑇
𝑥[𝑘] − 𝑥[𝑘 − 1]
𝑥̇ (𝑡) ≅ The Backward rule
𝑇
1 𝑥[𝑘 + 1] − 𝑥[𝑘]
(𝑥̇ ( 𝑡) + 𝑧𝑥̇ ( 𝑡)) ≅ The Bilinear rule
{2 𝑇
The operation can be carried out directly on the system function (transfer function) if
one translates the above equations into the frequency domain.
𝑧−1 𝑧−1
𝑠𝑋(𝑠) ≅ ( ) 𝑋(𝑧) 𝑠=( ) The Forward rule
𝑇 𝑇
𝑧−1 𝑧−1
𝑠𝑋(𝑠) ≅ ( ) 𝑋(𝑧) ⟺ 𝑠=( ) The Backward rule
𝑇𝑧 𝑇𝑧
𝑧+1 𝑧−1 2 𝑧−1
( ) 𝑠𝑋(𝑠) ≅ ( ) 𝑋(𝑧) } 𝑠= ( ) The Bilinear rule
2 𝑇 𝑇 𝑧+1
Remark: the trapezoidal rule is also called the Tustin's method or bilinear
transformation.
Now it is interesting to see how the stable region of s-plane I mapped to the z-plane.
𝑥𝑘+1 − 𝑥𝑘
𝑥̇ ≅
𝑇
Forward Or
method
𝑧−1
𝑠=( )
𝑇
Discrete Stable ⟹ Continuous Stable
Or
Backward
method
𝑧−1
𝑠=( )
𝑇𝑧 Continuous Stable ⟹ Discrete Stable
Means that: can have continuous un-stable but discrete
stable
(𝜎 − 1) + 𝑗𝜔 1 2 1 2
continuous stable ⟹ 𝑅𝑒(𝑠) < 0 ⟹ 𝑅𝑒 ( ) < 0 ⟹ (𝜎 − ) + 𝜔2 < ( )
𝑇(𝜎 + 𝑗𝜔) ⏟ 2 2
Disc Equation
Notice that: the obtained disc is located inside the unit circle so we can interpret this
result by the following implication, Continuous Stable ⟹ Discrete Stable. In other
word, we can have stable discrete system but continuous un-stable.
2 𝑧−1
Bilinear 𝑠= ( )
𝑇 𝑧+1
transformation
⟹
Continuous Stable Discrete Stable
⟸
Means that: continuous stable equivalent discrete stable
(𝜎 − 1) + 𝑗𝜔
continuous stable ⟹ 𝑅𝑒(𝑠) < 0 ⟹ 𝑅𝑒 ( 𝜎2 + 𝜔2 < 1
)<0⟹⏟
(𝜎 + 1) + 𝑗𝜔 Disc Equation
2 + 𝑇𝑠
Discrete Stable ⟹ |𝑧| < 1 ⟹ | |<1⟹⏟
𝑎 < 0 ⟹ continuous stable
2 − 𝑇𝑠 LHP
Notice that: the obtained disc is the unit disc so we can interpret this result by the
following equivalence, Continuous Stable ⟺ Discrete Stable. In other word, we can
have stable discrete system & continuous stable.
Example: Using the previous methods to find 𝐻(𝑧) equivalent to the given 𝐻(𝑠)
𝑎
𝐻(𝑠) =
𝑠+𝑎
𝑎 𝑎𝑇
The Forward rule: 𝐻(𝑧) = =
𝑧−1 𝑧 + (𝑎𝑇 − 1)
𝑇 +𝑎
𝑎 𝑎𝑇𝑧
The Backward rule: 𝐻(𝑧) = =
𝑧−1 (𝑎𝑇 + 1)𝑧 − 1
𝑇𝑧 + 𝑎
𝑎 𝑎𝑇(𝑧 + 1)
The Bilinear rule: 𝐻(𝑧) = =
2 𝑧−1 (𝑎𝑇 + 2)𝑧 + (𝑎𝑇 − 2)
𝑇 (𝑧 + 1) + 𝑎
Step invariant method (zero order hold): The basic idea behind the step invariance
method is to choose a step response for discrete system that is similar to the step
response of analog system, and this is the reason why we call it step invariant.
−1
𝐻(𝑠)
𝑦(𝑡) = { } Continuos system
𝑠
−1
𝐻(𝑧)
𝑦[𝑘] = { } Discrete system
1 − 𝑧 −1
−1
𝐻(𝑠)
𝑦[𝑘] = 𝑦(𝑡)| = { }|
𝑡→𝑘𝑇 𝑠
𝑡→𝑘𝑇
This means that
−1
𝐻(𝑠)
𝐻(𝑧) = (1 − 𝑧 −1 ) ( { }| )
𝑠
𝑡→𝑘𝑇
1 𝑒 −𝑠𝑇 (1 − 𝑒 −𝑠𝑇 )
𝐻0 (𝑠) = − =
𝑠 𝑠 𝑠
−1 −1
𝐻(𝑠)
𝐻(𝑧) = ( {𝐻0 (𝑠)𝐻(𝑠)}| )= ( {(1 − 𝑒 −𝑠𝑇 ) }| )
𝑠
𝑡→𝑘𝑇 𝑡→𝑘𝑇
𝑠𝑇
We know that 𝑧 = 𝑒 then
−1
−1 ) 𝐻(𝑠)
𝐻(𝑧) = (1 − 𝑧 ( { }| )
𝑠
𝑡→𝑘𝑇
Remark: the advantage of setting the input to a step 𝑢[𝑘] is that: a step function is
invariant to ZOH i.e. 𝑢(𝑡) is also a step every sampling period (unaffected by ZOH).
−1 −1 −1
𝐻(𝑠) 𝐻(𝑠) −𝑠𝑇
{𝐻0 (𝑠)𝐻(𝑠)} = { − 𝑒 }= {𝐺(𝑠) − 𝑒 −𝑠𝑇 𝐺(𝑠)} = g(𝑡) − g(𝑡 − 𝑇)
𝑠 𝑠
−1
𝐻(𝑧) = ( {𝐻0 (𝑠)𝐻(𝑠)}| )= (g(𝑡) − g(𝑡 − 𝑇)| ) = (1 − 𝑧 −1 ) (g(𝑡)| )
𝑡=𝑘𝑇 𝑡=𝑘𝑇
𝑡→𝑘𝑇
Example: MATLAB Code for the ZOH design
nSamples = tEnd/Ts;
samples = 0:nSamples; % Samples
xSampled = zeros(1,length(t));
xSampled(1:Ts/tStep:end) = x(samples*Ts); % Sampled signal
figure(3);
hold all;
plot(t,x(t),'-r','linewidth',3);
stem(t,xSampled);
plot(t,xZoh1,'-b','linewidth',3);
Other programs for the design of ZOH block
CHAPTER II:
If we estimate the values of the unknown function at the points that are inside/outside
the range of collected data points, we call it the interpolation/extrapolation.
The coefficients can be obtained by solving the following system of linear equations.
𝑎0 + 𝑥0 𝑎1 + 𝑥02 𝑎2 + ⋯ + 𝑥0𝑁 𝑎𝑁 = 𝑦0
𝑎0 + 𝑥1 𝑎1 + 𝑥12 𝑎2 + ⋯ + 𝑥1𝑁 𝑎𝑁 = 𝑦1
.................................................
𝑎0 + 𝑥𝑁 𝑎1 + 𝑥𝑁2 𝑎2 + ⋯ + 𝑥𝑁𝑁 𝑎𝑁 = 𝑦𝑁
But, as the number of data points increases, so does the number of unknown variables
and equations, consequently, it may be not so easy to solve. That is why we look for
alternatives to get the coefficients {𝑎0 , 𝑎1 , . . . , 𝑎𝑁 } . One of the alternatives is to make use
of the Lagrange polynomials
Where: 𝐴𝑗 is a constant, and where the factor is 𝑥 − 𝑥𝑗 omitted. The defined polynomial
may be written in a shorter notation as
𝑛
𝑃𝑗 (𝑥) = 𝐴𝑗 ∏(𝑥 − 𝑥𝑖 )
𝑖=0
𝑖≠𝑗
When 𝑥 is equal to any of the 𝑥𝑖 corresponding to a data points (except 𝑥𝑗 ) the value of
this polynomial is zero since the factor 𝑥𝑖 − 𝑥𝑖 will be zero. However, when 𝑥 = 𝑥𝑗
The value of polynomial is not zero since the factor 𝑥 − 𝑥𝑗 is missing. Thus if we denote
one of the data points as 𝑥𝑘
0 𝑘≠𝑗
𝑛
𝑃𝑗 (𝑥𝑘 ) =
𝐴𝑗 ∏(𝑥 − 𝑥𝑖 ) 𝑘=𝑗
𝑖=0
{ 𝑖≠𝑗
If 𝐴𝑗 is defined as
1 0 𝑘≠𝑗
𝐴𝑗 = Then 𝑃𝑗 (𝑥𝑘 ) becomes 𝑃𝑗 (𝑥𝑘 ) = {
∏𝑛𝑖=0(𝑥𝑗 − 𝑥𝑖 ) 1 𝑘=𝑗
𝑖≠𝑗
Thus functions 𝑃𝑗 (𝑥) the are the set of 𝑛𝑡ℎ degree polynomials, defined in such a way
that each one passes through zero at each of the data points except for the one point
𝑥𝑘 where 𝑘 = 𝑗 . the 𝑃𝑗 (𝑥) are called Lagrange polynomials.
We now form the following linear combination of the 𝑃𝑗 (𝑥): 𝑃𝑛 (𝑥) = ∑𝑛𝑗=0 𝑓(𝑥𝑗 ) 𝑃𝑗 (𝑥). Since
is a linear combination of 𝑛𝑡ℎ degree polynomials, it is also an 𝑛𝑡ℎ degree polynomial. If
we select any one of the points at which data are available, say 𝑥2 then
𝑃𝑛 (𝑥2 ) = 𝑓(𝑥0 )𝑃0 (𝑥2 ) + 𝑓(𝑥1 )𝑃1 (𝑥2 ) + 𝑓(𝑥2 )𝑃2 (𝑥2 ) + ⋯ + 𝑓(𝑥𝑛 )𝑃𝑛 (𝑥2 )
But since each of the 𝑃𝑗 (𝑥2 ) is zero except for 𝑃2 (𝑥2 ) which equals 1,
At 𝑥2 ,this 𝑛𝑡ℎ degree polynomial yields the value 𝑓(𝑥2 ). It can easily be seen that for any
of the data points 𝑥𝑖 , the polynomial becomes 𝑓(𝑥𝑖 ). Thus the polynomial 𝑃𝑛 (𝑥) is the
desired polynomial of degree 𝑛 which exactly fits the 𝑛 + 1 data points.
(7 − 2)(7 − 4)(7 − 8)
𝑃0 (7) = = 0.71429
(1 − 2)(1 − 4)(1 − 8)
(7 − 1)(7 − 4)(7 − 8)
𝑃1 (7) = = −1.5
(2 − 1)(2 − 4)(2 − 8)
(7 − 1)(7 − 2)(7 − 8)
𝑃2 (7) = = 1.25
(4 − 1)(4 − 2)(4 − 8)
(7 − 2)(7 − 2)(7 − 4)
𝑃3 (7) = = 0.53571
(8 − 2)(8 − 2)(8 − 4)
𝑓(7) ≈ 𝑃3 (7) = 0.71429 + 3(−1.5) + 7(1.25) + 11(0.53571) = 10.85710
Remark
1. It should be noted that the polynomial interpolation of this type can be dangerous
toward the center of the regions where the independent variable is widely spaced.
2. One of the alternatives is to make use of the Lagrange polynomials as:
𝑃𝑛 (𝑥2 ) = 𝑓(𝑥0 )𝑃0 (𝑥2 ) + 𝑓(𝑥1 )𝑃1 (𝑥2 ) + 𝑓(𝑥2 )𝑃2 (𝑥2 ) + ⋯ + 𝑓(𝑥𝑛 )𝑃𝑛 (𝑥2 )
𝑛
∏𝑛𝑖=0(𝑥 − 𝑥𝑖 ) 𝑛
𝑖≠𝑗 (𝑥 − 𝑥𝑖 )
𝑃𝑗 (𝑥) = 𝑛
=∏
∏𝑖=0(𝑥𝑗 − 𝑥𝑖 ) 𝑖=0
(𝑥𝑗 − 𝑥𝑖 )
𝑖≠𝑗 𝑖≠𝑗
Now, we have the MATLAB routine “lagranp()” which finds us the coefficients of
Lagrange polynomial together with each Lagrange coefficient polynomial 𝑃𝑗 (𝑥). In order
to understand this routine, you should know that MATLAB deals with polynomials as
their coefficient vectors arranged in descending order and the multiplication of two
polynomials corresponds to the convolution of the coefficient vectors
and to check if the graph of 𝑙3 (𝑥) really passes the four points. The results from
running this program are depicted in Fig.
(𝑥 − 2)(𝑥 − 3) 1 2
𝑃0 (𝑥) = = (𝑥 − 5𝑥 + 6)
(1 − 2)(1 − 3) 2
(𝑥 − 1)(𝑥 − 3)
𝑃1 (𝑥) = = −𝑥 2 + 4𝑥 − 3
(2 − 1)(2 − 3)
(𝑥 − 1)(𝑥 − 2) 1 2
𝑃2 (𝑥) = = (𝑥 − 3𝑥 + 2)
(3 − 1)(3 − 2) 2
1 2 1
𝑓(𝑥) ≈ (𝑥 − 5𝑥 + 6) + 2(−𝑥 2 + 4𝑥 − 3) + 2 [ (𝑥 2 − 3𝑥 + 2)]
2 2
That is
1 5
𝑓(𝑥) ≈ (− 𝑥 2 + 𝑥 − 1)
2 2
𝑓(1.5) ≈ 1.625
We want to find a Lagrange polynomial that fit the tabulated data such that:
The next figure shows exactly how a Lagrange polynomial (𝑑𝑒𝑔𝑟𝑒𝑒 < 3)fit the given data.
Although Lagrange’s method is
conceptually simple, it does not lend itself to an efficient algorithm. Better
computational procedure is obtained with Newton’s method, The Newton Method
provides a means of developing an nth-order interpolation polynomial without requiring
the solution of a set of simultaneous linear equations. Where the interpolating
polynomial is written in the form
𝑛𝑁 (𝑥) = 𝑎0 + 𝑎1 (𝑥 − 𝑥0 ) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 ) + ⋯
In order to derive a formula to find the successive coefficients {𝑎0 , 𝑎1 , . . . , 𝑎𝑁 } that make
this equation accommodate the data points, we will determine 𝑎0 and 𝑎1 so that
𝑛1 (𝑥) = 𝑛0 (𝑥) + 𝑎1 (𝑥 − 𝑥0 )
Matches the first two data points (𝑥0 , 𝑦0 ) and (𝑥1 , 𝑦1 ). We need to solve the two equations
𝑛1 (𝑥0 ) = 𝑎0 + 𝑎1 (𝑥0 − 𝑥0 ) = 𝑦0
𝑛1 (𝑥1 ) = 𝑎0 + 𝑎1 (𝑥1 − 𝑥0 ) = 𝑦1
To get
𝑦1 − 𝑎0 𝑦1 − 𝑦0
𝑎0 = 𝑦0 𝑎1 = = = 𝑑𝑓0
𝑥1 − 𝑥0 𝑥1 − 𝑥0
Starting from this first-degree Newton polynomial, we can proceed to the second degree
Newton polynomial: 𝑛2 (𝑥) = 𝑛1 (𝑥) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 ) = 𝑎0 + 𝑎1 (𝑥 − 𝑥0 ) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 )
Which, with the same coefficients 𝑎0 𝑎𝑛𝑑 𝑎1 , still matches the first two data points (𝑥0 , 𝑦0 )
and (𝑥1 , 𝑦1 ), since the additional (third) term is zero at (𝑥0 , 𝑦0 ) and (𝑥1 , 𝑦1 ). This is to say
that the additional polynomial term does not disturb the matching of previous existing
data. Therefore, given the additional matching condition for the third data point (𝑥2 , 𝑦2 ),
we only have to solve 𝑛2 (𝑥) = 𝑎0 + 𝑎1 (𝑥2 − 𝑥0 ) + 𝑎2 (𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )
for only one more coefficient 𝑎2 to get
𝑦1 − 𝑦0
𝑦2 − 𝑎0 − 𝑎1 (𝑥2 − 𝑥0 ) 𝑦2 − 𝑦0 − 𝑥1 − 𝑥0 (𝑥2 − 𝑥0 )
𝑎2 = =
(𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 ) (𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )
𝑦2 − 𝑦1 + 𝑦1 − 𝑎0 − 𝑎1 (𝑥2 − 𝑥1 + 𝑥1 − 𝑥0 )
=
(𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )
𝑦2 − 𝑦1 𝑦1 − 𝑦0
𝑥 − 𝑥1 − 𝑥1 − 𝑥0 𝑑𝑓1 − 𝑑𝑓0
𝑎2 = 2 = = 𝑑 2 𝑓0
(𝑥2 − 𝑥0 ) (𝑥2 − 𝑥0 )
Generalizing these results yields the formula to get the 𝑁 𝑡ℎ coefficient 𝑎𝑁 of the Newton
polynomial function as
𝑑𝑁−1 𝑓1 − 𝑑 𝑁−1 𝑓0
𝑎𝑁 = = 𝑑 𝑁 𝑓0
(𝑥𝑁 − 𝑥0 )
This is the divided difference, which can be obtained successively from the second row
of Table.
𝑥𝑘 𝑦𝑘 𝑑𝑓𝑘 𝑑 2 𝑓𝑘 𝑑 3 𝑓𝑘
𝑦1 − 𝑦0 𝑑𝑓1 − 𝑑𝑓0 𝑑 2 𝑓1 − 𝑑 2 𝑓0
𝑥0 𝑦0 𝑑𝑓0 = 𝑑 2 𝑓0 = 𝑑 3 𝑓0 =
𝑥1 − 𝑥0 (𝑥2 − 𝑥0 ) (𝑥1 − 𝑥0 )
𝑦2 − 𝑦1 𝑑𝑓2 − 𝑑𝑓1
𝑥1 𝑦1 𝑑𝑓1 = 𝑑 2 𝑓1 =
𝑥2 − 𝑥1 (𝑥2 − 𝑥1 )
𝑦3 − 𝑦2
𝑥2 𝑦2 𝑑𝑓2 =
𝑥3 − 𝑥2
𝑥3 𝑦3
clear all,clc
x = [-2 -1 1 2]; y = [-6 0 0 6]; xx = [-2: 0.02 : 2];
The advantage of the Newton Method is that it does not require the solution of 𝑛
simultaneous equations with 𝑛 unknowns.
Example: Suppose we are to find a Newton polynomial matching the following data
points {(−2, −6), (−1,0), (1,0), (2,6), (4,60)}. From these data points, we construct the
divided difference table as
𝑥𝑘 𝑦𝑘 𝑑𝑓𝑘 𝑑 2 𝑓𝑘 𝑑 3 𝑓𝑘 𝑑 4 𝑓𝑘
0 − (−6) 0−6 2 − (−2) 1−1
−2 −6 =6 = −2 =1 =0
−1 − (−2) 1 − (−2) 2 − (−2) 4 − (−2)
0−0 6−0 7−2
−1 0 =0 =2 =1
1 − (−1) 2 − (−1) 4 − (−1)
6−0 27 − 6
1 0 =6 =7
2−1 4−1
60 − 6
2 6 = 27
4−2
4 60
and then use this table to get the Newton polynomial as follows:
= 𝑥3 − 𝑥
We might begin with not necessarily the first data point, but, say, the third one (1,0),
and proceed as follows to end up with the same result.
= 𝑥3 − 𝑥
𝑛(𝑥) = 𝑎0 + 𝑎1 (𝑥 − 𝑥0 ) + 𝑎2 (𝑥 − 𝑥0 )(𝑥 − 𝑥1 )
Where
𝑦1 − 𝑦0 𝑑𝑓1 − 𝑑𝑓0 𝑦2 − 𝑦1
𝑎0 = 𝑦0 . 𝑎1 = 𝑑𝑓0 = . 𝑎2 = 𝑑 2 𝑓0 = . 𝑑𝑓1 = .
𝑥1 − 𝑥0 (𝑥2 − 𝑥0 ) 𝑥2 − 𝑥1
Or we can write
𝑦 −𝑦
𝑦1 − 𝑦0 𝑦2 − 𝑦0 − 𝑥1 − 𝑥0 (𝑥2 − 𝑥0 )
1 0
𝑎0 = 𝑦0 , 𝑎1 = 𝑑𝑓0 = , 𝑎2 = ,
𝑥1 − 𝑥0 (𝑥2 − 𝑥0 )(𝑥2 − 𝑥1 )
Hence, the polynomial is given by: 𝑛(𝑥) = 1 + 7(𝑥 − 1) + 6(𝑥 − 1)(𝑥 − 2) = 6𝑥 2 − 11𝑥 + 6
There are clearly 4 unknown constants and only two conditions are immediately
obvious, namely that 𝐹𝑖 (𝑥𝑖 ) = 𝑓(𝑥𝑖 ) and 𝐹𝑖 (𝑥𝑖+1 ) = 𝑓(𝑥𝑖+1 ). We are free to choose the two
remaining conditions as we like, to accomplish our desired objective of ‘‘smoothness.’’
The most effective approach is to match the first and second derivative
(𝑎𝑛𝑑 𝑡ℎ𝑢𝑠 𝑡ℎ𝑒 𝑠𝑙𝑜𝑝𝑒 𝑎𝑛𝑑 𝑐𝑢𝑟𝑣𝑎𝑡𝑢𝑟𝑒) of 𝐹𝑖 (𝑥) to those of the cubic 𝐹𝑖−1 (𝑥) used for
interpolation of adjacent interval (𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1 ). If this procedure is carried out for all
intervals in the region 𝑥0 ≤ 𝑥 ≤ 𝑥𝑛 ( with special treatment at the end points as we will
discuss later ), then an approximating function for the region will have been
constructed. Consisting of the set of cubics 𝐹𝑖 (𝑥) (𝑖 = 0,1,2, … , 𝑛 − 1)
We denote this approximating function for the entire region as g(𝑥) and call it a cubic
spline.
To actually construct g(𝑥), it is convenient to note that due to the matching of second
derivatives of the cubics at each point 𝑥𝑖 , the second derivative of g(𝑥) is continuous
over the entire region 𝑥0 ≤ 𝑥 ≤ 𝑥𝑛 . This second derivative might appear as shown in fig.
Notice that the second derivative varies linearly over each interval ( the second
derivative of cubic is straight line ). Due to this linearity, the second derivative at any
point 𝑥, where 𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1, is given by
Integrating this equation twice and applying the condition that g(𝑥𝑖 ) = 𝑓(𝑥𝑖 ) 𝑎𝑛𝑑
g(𝑥𝑖+1 ) = 𝑓(𝑥𝑖+1 ), we find that for 𝑥𝑖 ≤ 𝑥 ≤ 𝑥𝑖+1 .
Where ∆𝑥𝑖 = 𝑥𝑖+1 − 𝑥𝑖 , this equation provides the interpolating cubics over each interval
for 𝑖 = 0,1,2, … , 𝑛 − 1 . Since the second derivatives g ′′ (𝑥𝑖 ) 𝑖 = 0,1,2, … , 𝑛 are still
unknown, these must be evaluated before we can use this equation.
The second derivative can be found by using the derivative matching conditions:
Applying these conditions to equation g(𝑥) = 𝐹𝑖 (𝑥) for 𝑖 = 0,1,2, … , 𝑛 − 1 and collecting
terms yields a set of simultaneous equations of the form:
Whether the equations are of this form, there are 𝑛 − 1 equations in the 𝑛 + 1 unknowns
g ′′ (𝑥0 ), g ′′ (𝑥1 ), … , g ′′ (𝑥𝑛 ). The two necessary additional equations are obtained by
specifying conditions on g ′′ (𝑥0 ) and g ′′ (𝑥𝑛 ). It is usually simply specified that
g ′′ (𝑥0 ) = 0
g ′′ (𝑥𝑛 ) = 0
Or 5g ′′ (4) + g ′′ (6) = 4
Similarly, we find for 𝑖 = 2 that 0.66667g ′′ (4) + 3.33333g ′′ (6) + g ′′ (9) = −11.33333
And for 𝑖 = 3,
3g ′′ (6) + 8g ′′ (9) = −8
Since we wish to approximate𝑓(5), then we must use the cubic 𝐹1 (𝑥) which appropriate
for the interval 4 ≤ 𝑥 ≤ 6. Hence
⦁ Plot the data and visually verify that the extrapolated value makes sense.
⦁ Use a low-order polynomial based on nearest-neighbor data points. A linear or
quadratic interpolant, for example, would yield a reasonable estimate of 𝑦(14) for the
data in Fig.
Noting that many data are susceptible to some error, we don’t have to try to find a
function passing exactly through every point. Instead of pursuing the exact matching at
every data point, we look for an approximate function (not necessarily a polynomial)
that describes the data points as a whole with the smallest error in some sense, which
is called the curve fitting.
𝜃1 𝑥1 + 𝜃0 = 𝑦1 𝑥1 1 𝑦1
𝜃1 𝑥2 + 𝜃0 = 𝑦2 𝑥2 1 𝜃 𝑦2
Or 𝐴𝜽 = 𝒚 with 𝐴 = [ ], 𝜽 = [ 1] , 𝒚=[ ⋮ ]
……………… ⋮ ⋮ 𝜃2
𝜃1 𝑥𝑀 + 𝜃0 = 𝑦𝑀 𝑥𝑀 1 𝑦𝑀
Noting that this apparently corresponds to the over determined case ie: (number of
equations more than the number of parameters). We resort to the least-squares (LS)
solution, which bases on the minimization of the objective function:
= 𝜽𝑇 𝐴𝑇 𝐴𝜽 − 𝜽𝑇 𝐴𝑇 𝒚 − 𝒚𝑇 𝐴𝜽 + 𝒚𝑇 𝒚
𝑱 is minimum if
𝜕𝑱
= 0 ⇒ 𝜽𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆 = [𝑨𝑇 𝑨]−𝟏 𝑨𝑇 𝑦
𝜕𝜽
𝜃10
𝜽 = [ 0 ] = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒚
𝟎
𝜃2
Remarks: Here we have 𝑚 equation with 𝑛 unknown? This is the over determined case.
From the linear algebra point of view the system of equations 𝑨𝒙 = 𝒃 have a solution if
𝒃 ∈ 𝑟𝑎𝑛𝑘(𝑨) or in other word 𝑏 is one vector of 𝑨 columns because
𝑥1
𝑥2
𝑨𝒙 = 𝒃 ⇔ [𝒂𝟏 , 𝒂𝟐 , … , 𝒂𝒏 ]𝒙 = 𝒃 ⇔ [𝒂𝟏 , 𝒂𝟐 , … , 𝒂𝒏 ] [ ]=𝒃
𝑥𝑛
⇔ 𝒃 = 𝒂𝟏 𝑥1 + 𝒂𝟐 𝑥1 + ⋯ + 𝒂𝒏 𝑥1 ⇒ 𝑏 ∈ 𝑟𝑎𝑛𝑘(𝑨)
unique olution
Or equivalently 𝑟𝑎𝑛𝑘[𝑨 ⋮ 𝒃] = 𝑟𝑎𝑛𝑘[𝑨], If 𝑟𝑎𝑛𝑘[𝑨 ⋮ 𝒃] = 𝑟𝑎𝑛𝑘[𝑨] ⇒ { 𝑜𝑟
infinite solutions
Least square is concerned with the case where no exact solution exists.
Question: Since we cannot find 𝒙 satisfying 𝑨𝒙 = 𝒃, can we find 𝒙 such that 𝑨𝒙 is as
close as possible to 𝒃 and the equation rewriting as 𝑨𝒙 + 𝑬 = 𝒃 we want 𝒙 that minimize
the distance between 𝑨𝒙 and 𝒃 ⇒ smallest 𝑬 using orthogonal principle then the optimal 𝒙
is the one such that 𝑬 ⊥ 𝑨𝒙 ⇒ 𝑬𝑻 𝑨𝒙 = 𝟎 or [𝒃 − 𝑨𝒙]𝑇 𝑨𝒙 = 𝟎
Now one can write that if 𝑖𝑛𝑣(𝑨𝑻 𝑨) exist then 𝒙 = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒃. Note that 𝑨 is 𝑚 × 𝑛 matrix
⇒ there is no invers
𝒙∗ = (𝑨𝑻 𝑨)−𝟏 𝑨𝑻 𝒃
Example: We have a system of single input single output given by a linear equation
𝑦 = 𝑚𝑡 + 𝑐
𝑡: is the input of the system and 𝑦: is the output of the system
Solution we can rewrite the above table into a system of linear equation as:
2𝑚 + 𝑐 = 3 2 1 𝑚 3
{ 3𝑚 + 𝑐 = 4 𝑜𝑟 [ 3 1 ][ ] = [ 4 ]
4𝑚 + 𝑐 = 15 4 1 𝑐 15
It is of the form 𝐴𝑥 = 𝑏
2 1 3 𝑚
Where 𝐴 = [ 3 1 ] , 𝑏 = [ 4 ], 𝑎𝑛𝑑 𝑥=[ ]
4 1 15 𝑐
We want to get the optimal value of 𝑥 for best fitting 𝑟𝑎𝑛𝑘(𝐴) < 𝑟𝑎𝑛𝑘[𝐴 ⋮ 𝑏] the system
has no exact solution, but the best approximate one is:
𝒙∗ = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒃
𝑚∗ 2 3 4 3 6 32
∗ 29 9 −𝟏 32
𝒙 =[ ]= [𝑨𝑻 −𝟏 𝑻
𝑨] 𝑨 𝒃 = [ ] [ ][ 4 ] = [ ] ⟹ 𝑦 = 6𝑡 −
9 3 − 3
𝑐∗ 1 1 1 15 3
Method 2 Fitting a straight line 𝑓(𝑥) = 𝑎 + 𝑏𝑥 to data is also known as linear regression.
In this case the function to be minimized is
𝒏
are the mean values of the x and y data. The solution for the parameters is
𝑦̅ ∑ 𝑥𝑖2 − 𝑥̅ ∑ 𝑥𝑖 𝑦𝑖 ∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥̅ 𝑦̅
𝑎= 2 𝑏=
∑ 𝑥𝑖 − 𝑛𝑥̅ 2 ∑ 𝑥𝑖2 − 𝑛𝑥̅ 2
These expressions are susceptible to round off errors (the two terms in each numerator
as well as in each denominator can be roughly equal). It is better to compute the
parameters from
∑ 𝑦𝑖 (𝑥𝑖 − 𝑥̅ )
𝑏= 𝑎𝑛𝑑 𝑎 = 𝑦̅ − 𝑏𝑥̅
∑ 𝑥𝑖 (𝑥𝑖 − 𝑥̅ )
Example: Fit a straight line to the data shown and compute the standard deviation.
𝑱 0.06936
𝜎=√ =√ = 0.1520
𝑛−𝑚 5−2
𝑱 = ∑ (𝑦𝑖 − ∑ 𝑎𝑗 𝑓𝑗 (𝑥𝑖 ))
𝒊=𝟏 𝑗=1
Thus
𝒏 𝑚
𝜕𝑱
= 0 ⇒ −2 {∑ (𝑦𝑖 − ∑ 𝑎𝑗 𝑓𝑗 (𝑥𝑖 )) 𝑓𝑘 (𝑥𝑖 )} = 0 𝑘 = 1,2, … , 𝑚
𝜕𝑎𝑘
𝒊=𝟏 𝑗=1
Dropping the constant (−2) and interchanging the order of summation, we get
𝒎 𝑛 𝑛
Equations 𝐀𝜽 = 𝐲, known as the normal equations of the least-squares fit, can be solved
with many methods. Note that the coefficient matrix is symmetric, i.e., 𝐀 𝒌𝒋 = 𝐀 𝒋𝒌 . As
what we have seen before 𝐀𝜽 = 𝐲 ⇒ 𝜽𝒆𝒔𝒕𝒊𝒎𝒂𝒕𝒆 = [𝑨𝑻 𝑨]−𝟏 𝑨𝑻 𝒚
Example: Find the least squares solution of the following linear system:
𝑥1 + 𝑥2 + 𝑥3 = 4 1 1 1 4
𝑥1
−𝑥 + 𝑥2 + 𝑥3 = 0 −1 1 1 0
{ 1 ⟺ 𝑨𝐱 = 𝑩 ⟺ [ ] [𝑥2 ] = [ ]
− 𝑥2 + 𝑥3 = 1 0−1 1 𝑥3 1
𝑥1 + 𝑥3 = 2 1 0 1 2
so A has has rank 3. Since the augmented matrix [𝑨 ⋮ 𝑩] has rank 4, the linear system
is inconsistent. There is a unique least squares solution in this case. To find it, first
compute
3 0 1 1 11 1 − 3
[𝑨𝑇 𝑨] = [ 0 3 1 ] and [𝑨𝑇 𝑨]−𝟏 = [ 1 11 − 3 ]
30
1 1 4 −3 − 3 9
Example: Find the parameters 𝐴 and 𝐵 of the given function to fit the tabulated data
with minimum error using least square
𝜋𝑥
g(𝑥) = 𝐴 + 𝐵sin ( ) and let f(𝑥) = g(𝑥) + Noise
10
where:
𝑥 0 1.0 1.5 2.3 2.5 4.0 5.1 6.0 6.5 7.0 8.1 9
𝑓(𝑥) 0.2 0.8 2.5 2.5 3.5 4.3 3.0 5.0 3.5 2.4 1.3 2.0
𝑥 9.3 11.0 11.3 12.1 13.1 14.1 15.5 16.0 17.5 17.8 19.0 20.0
𝑓(𝑥) −0.3 −1.3 −3.0 −4.0 −4.9 −4.0 −5.2 −3.0 −3.5 −1.6 −1.4 −0.1
𝑛
𝜋𝑥𝑖 2
𝐸 = ∑ [𝐴 + 𝐵sin ( ) − 𝑓(𝑥𝑖 )]
10
𝑖=1
The parameters which can be varied to minimize 𝐸 𝑎𝑟𝑒 𝐴 𝑎𝑛𝑑 𝐵. We obtain two equations
in 𝐴 and 𝐵 by setting:
𝜕𝐸 𝜕𝐸
=0 𝑎𝑛𝑑 =0
𝜕𝐴 𝜕𝑏
𝑛
𝜕𝐸 𝜋𝑥𝑖
= 2 ∑ [𝐴 + 𝐵sin ( ) − 𝑓(𝑥𝑖 )] = 0
𝜕𝐴 10
𝑖=1
𝑛
𝜕𝐸 𝜋𝑥𝑖 𝜋𝑥𝑖
= 2 ∑ [𝐴 + 𝐵sin ( ) − 𝑓(𝑥𝑖 )] (sin ( )) = 0
𝜕𝐵 10 10
𝑖=1
Example: Find the parameters 𝛼, 𝛽 and 𝛾 of the given function to fit the tabulated data
with minimum error using least square
yy=theta(1)*cos(x)+theta(2)*sin(x)+theta(3)*x.^2+r;
plot(x,yy,'linewidth',3)
hold on
plot(x,y,'linewidth',3)
hold on
grid on
Problem: assume that you have a tabulated data of some function but this data is not
arranged orderly along the x-axis, how to rearrange them using MATLAB?
𝑛 𝑛
𝑗+𝑘−2
𝐀 𝒌𝒋 = ∑ 𝑥𝑖 𝐲𝒌 = ∑ 𝑥𝑖𝑘−1 𝑦𝑖
𝑖=1 𝑖=1
or in matrix form we can writ 𝑨𝜽 = 𝒚 with
𝑛
𝑛 ∑ 𝑥𝑖 ∑ 𝑥𝑖2 … ∑ 𝑥𝑖𝑚 ∑ 𝑦𝑖
𝑖=1
𝑛
∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑥𝑖3 … ∑ 𝑥𝑖𝑚+1
𝑨= 𝒚= ∑ 𝑥𝑖 𝑦𝑖
… 𝑖=1
…
𝑛
𝑚−1
∑ 𝑥𝑖𝑚 ∑ 𝑥𝑖𝑚+1 … ∑ 𝑥𝑖2𝑚−2 ] ∑ 𝑥𝑖𝑚−1 𝑦𝑖
[ ∑ 𝑥𝑖 [ 𝑖=1 ]
Where ∑ stands for ∑𝑛𝑖=1 .The normal equations become progressively ill-conditioned
with increasing m. Fortunately, this is of little practical consequence, because only low-
order polynomials are useful in curve fitting. Polynomials of high order are not
recommended, because they tend to reproduce the noise inherent in the data.
x -1 0 1 2
y 0 1 3 9
𝑦 = 𝑎 + 𝑏𝑥 + 𝑐𝑥 2
𝑎 − 𝑏 + 𝑐=0
𝑎 =1
{
𝑎 + 𝑏 + 𝑐=3
𝑎 + 2𝑏 + 4𝑐 = 9
1 −1 1 0
1 0 0 1
𝑨=[ ] 𝑩=[ ]
1 1 1 3
1 2 4 9
and A has rank 3. We find that
4 2 6 1 44 12 − 20
[𝑨𝑇 𝑇 −𝟏
𝑨] = [ 2 6 8 ] 𝑎𝑛𝑑 [𝑨 𝑨] = [ 12 36 − 20 ]
80
6 8 18 −20 − 20 20
1 11
𝜽= [𝑨𝑇 −𝟏 𝑇
𝑨] 𝑨 𝒚 = [33]
20
25
that is, a = 11/20, b = 33/20, c = 5/4. Hence the quadratic function that best fits the
data is
11 33 5
𝑦= + 𝑥 + 𝑥2
20 20 4
5 𝑛
5 ∑ 𝑥𝑖 ∑ 𝑓(𝑥𝑖 )
𝑎0
𝑖=1 𝑖=1
[ ]= 𝑛
5 5
𝑎1
∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑥𝑖 𝑓(𝑥𝑖 )
[ 𝑖=1 𝑖=1 ] [ 𝑖=1 ]
𝑎0 5 39369 −1
26.16 2.038392
[ ]=[ ] [ ]=[ ]
𝑎1 39.69 392.3201 238.7416 0.4023190
5 5 𝑛
5 ∑ 𝑥𝑖 ∑ 𝑥𝑖2 ∑ 𝑓(𝑥𝑖 )
𝑖=1 𝑖=1 𝑎0 𝑖=1
5 5 5 𝑛
Before we go on...what answers do you expect for the coefficients after looking at
the data?
5 𝑛
∑ 𝑥𝑖4 = 61.1875
𝑖=1
The set of equation is
6 7.5 13.75 𝑎0 13.75
𝑎1 = 0 ===> g(𝑥) = 𝑥 2
[𝑎2 ] [1]
This fits the data exactly. That is, g(𝑥) = 𝑓(𝑥) since 𝑓(𝑥) = 𝑥 2
𝑎0 6 7.5 13.75 −1
15.1093
−0.1812
𝑎1 = 7.5 13.75 28.125 ∗ 32.2834 = [ −0.3221 ]
1.3537
[𝑎2 ] [13.75 28.125 61.1875] [ 71.276 ]
So our fitted second order function is:
𝑛
𝜕𝑱
= ∑ −2𝑤𝑖 2 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 ) = 0
𝜕𝑱 𝜕𝑎
𝑖=1
=0 ⇒ 𝑛
𝜕𝜽 𝜕𝑱
= ∑ −2𝑤𝑖 2 (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )𝑥𝑖 = 0
{ 𝜕𝑏
𝑖=1
Or
𝑛 𝑛 𝑛
∑ 𝑤𝑖 𝑦𝑖 = (𝑎 ∑ 𝑤𝑖 + 𝑏 ∑ 𝑤𝑖 2 𝑥𝑖 )
2 2
𝑛 𝑛 𝑛
∑ 𝑤𝑖 𝑥𝑖 𝑦𝑖 = (𝑎 ∑ 𝑤𝑖 𝑥𝑖 + 𝑏 ∑ 𝑤𝑖 2 𝑥𝑖 2 )
2 2 2
∑ 𝑤𝑖 2 𝑥𝑖 ∑ 𝑤𝑖 2 𝑦𝑖
𝑥̂ = 𝑦̂ =
∑ 𝑤𝑖 2 ∑ 𝑤𝑖 2
we obtain
𝑎 = 𝑦̂ − 𝑏𝑥̂
solving for 𝑏 yields after some algebra
∑ 𝑤𝑖 2 𝑦𝑖 (𝑥𝑖 − 𝑥̅ )
𝑏=
∑ 𝑤𝑖 2 𝑥𝑖 (𝑥𝑖 − 𝑥̅ )
A special application of weighted linear regression arises in fitting exponential
functions to data. Consider as an example the fitting function
𝑓(𝑥) = 𝑎𝑒 𝑏𝑥
Normally, the least-squares fit would lead to equations that are nonlinear in a and b.
But if we fit ln 𝑦 rather than 𝑦, the problem is transformed to linear regression: fit the
function
𝐹(𝑥) = ln 𝑓(𝑥) = ln 𝑎 + 𝑏𝑥
𝑅𝑖 = ln 𝑦𝑖 − 𝐹(𝑥𝑖 ) = ln 𝑦𝑖 − ln 𝑎 − 𝑏𝑥𝑖
If the residuals 𝑟𝑖 are sufficiently small (𝑟𝑖 ≪ 𝑦𝑖 ), we can use the approximation
𝑟𝑖 𝑟𝑖
ln (1 − )≈
𝑦𝑖 𝑦𝑖
so that
𝑟𝑖
𝑅𝑖 ≈
𝑦𝑖
We can now see that by minimizing∑ 𝑅𝑖 , we inadvertently introduced the weights 1/𝑦𝑖 .
This effect can be negated if we apply the weights 𝑦𝑖 when fitting 𝐹(𝑥) to
(ln 𝑦𝑖 , 𝑥𝑖 ); that is, by minimizing
𝑛
𝑱 = ∑ 𝑦𝑖 2 𝑅𝑖 2
𝑖=1
Example: Determine the parameters a and b so that 𝑓(𝑥) = 𝑎𝑒 𝑏𝑥 fits the following data
in the least-squares sense.
Use two different methods: (1) fit ln 𝑦𝑖 ; and (2) fit ln 𝑦𝑖 with weights 𝑤𝑖 = 𝑦𝑖 . Compute
the standard deviation in each case.
Solution of Part (1) The problem is to fit the function ln(𝑎𝑒 𝑏𝑥 ) = ln 𝑎 + 𝑏𝑥 to the data
Therefore, 𝑎 = 𝑒 𝐴 = 3. 790 and the fitting function becomes 𝑓(𝑥) = 3.790𝑒 0.5366𝑥 . The plots
of 𝑓(𝑥) and the data points are shown in the figure.
𝒏
𝟐
𝑱(𝑎, 𝑏) = ∑(𝑦𝑖 − 𝑓(𝑥𝑖 )) = 17.59
𝒊=𝟏
𝑱
𝜎=√ = 2.10
𝑛−𝑚
As pointed out before, this is an approximate solution of the stated problem, since we
did not fit 𝑦𝑖 , but ln 𝑦𝑖 . Judging by the plot, the fit seems to be good.
Solution of Part (2) We again fit ln(𝑎𝑒 𝑏𝑥 ) = ln 𝑎 + 𝑏𝑥 to 𝑧 = ln 𝑦, but this time the
weights𝑤𝑖 = 𝑦𝑖 are used. the weighted averages of the data are
(recall that we fit 𝑧 = ln 𝑦)
∑ 𝑦𝑖 2 𝑥𝑖 ∑ 𝑦𝑖 2 𝑧𝑖
𝑥̂ = = 7.474 𝑧̂ = = 5.353
∑ 𝑦𝑖 2 ∑ 𝑦𝑖 2
This yield to
∑ 𝑦𝑖 2 𝑧𝑖 (𝑥𝑖 − 𝑥̂)
𝑏= = 0.5440
∑ 𝑦𝑖 2 𝑥𝑖 (𝑥𝑖 − 𝑥̂)
ln 𝑎 = 𝑧̂ − 𝑏𝑥̂ = 1.287
ln𝑎
Therefore 𝑎 = 𝑒 = 3. 622 so that the fitting function is 𝑓(𝑥) = 3.622𝑒 0.5440𝑥 . As expected,
this result is somewhat different from that obtained in Part (1).
𝒏
𝟐
𝑱(𝑎, 𝑏) = ∑(𝑦𝑖 − 𝑓(𝑥𝑖 )) = 4.186
𝒊=𝟏
𝑱
𝜎=√ = 1.023
𝑛−𝑚
Observe that the residuals and standard deviation are smaller than in Part (1),
indicating a better fit, as expected.
CHAPTER III:
Numerical Differentiation and
Integration
Numerical Differentiation and
Integration
Numerical differentiation deals with the following problem: we are given
the function y = f (x) and wish to obtain one of its derivatives at the point x = xk.
Numerical differentiation is not a particularly accurate process. It suffers from a conflict
between round off errors (due to limited machine precision) and errors inherent in
interpolation. For this reason, a derivative of a function can never be computed with the
same precision as the function itself.
∞
𝑑𝑓 ℎ2 𝑑2 𝑓 ℎ3 𝑑 3 𝑓 ℎ4 𝑑 4 𝑓 𝑘
ℎ𝑘 𝑑 𝑘 𝑓
𝑓(𝑥 − ℎ) = 𝑓(𝑥) − ℎ + − + − ⋯ = ∑(−1)
𝑑𝑥 2! 𝑑𝑥 2 3! 𝑑𝑥 3 4! 𝑑𝑥 4 𝑘! 𝑑𝑥 𝑘
𝑘=0
𝑑 2 𝑓 ∆2 𝑓(𝑥) 1
2
= 2
+ 𝒪(ℎ) = 2 {𝑓(𝑥 + 2ℎ) − 2𝑓(𝑥 + ℎ) + 𝑓(𝑥)} + 𝒪(ℎ)
𝑑𝑥 ℎ ℎ
𝑑 2 𝑓 ∇2 𝑓(𝑥) 1
= + 𝒪(ℎ) = {𝑓(𝑥) − 2𝑓(𝑥 − ℎ) + 𝑓(𝑥 − 2ℎ)} + 𝒪(ℎ)
𝑑𝑥 2 ℎ2 ℎ2
These expressions are called non-central forward and non-central backward finite
difference approximations.
We can derive the approximations for higher derivatives in the same manner.
𝑑𝑛 𝑓 ∆𝑛 𝑓(𝑥) 𝑑 𝑛 𝑓 ∆𝑛 𝑓(𝑥)
= + 𝒪(ℎ) and = + 𝒪(ℎ)
𝑑𝑥 𝑛 ℎ𝑛 𝑑𝑥 𝑛 ℎ𝑛
With the following recurrence formulas ∆𝑛 𝑓(𝑥) = ∆(∆𝑛−1 𝑓(𝑥)) & ∇𝑛 𝑓(𝑥) = ∇(∇𝑛−1 𝑓(𝑥))
For derivatives up to the fourth order the forward and backward difference expressions
of 𝒪(ℎ) are tabulated below
Coefficients of forward difference approximations 𝓞(𝒉) Coefficients of backward difference approximations 𝓞(𝒉)
The difference representation for derivatives we have thus far obtained are of 𝒪(ℎ). More
accurate expressions may be found by simply taking more terms in the Taylor series
expansion. Consider the following example
𝑑2 𝑓 ∆2 𝑓(𝑥)
Make a substitution of = + 𝒪(ℎ) Into the 1st derivative formula we obtain
𝑑𝑥2 ℎ2
𝑑𝑓 𝑓(𝑥 + ℎ) − 𝑓(𝑥) ℎ 𝑓(𝑥 + 2ℎ) − 2𝑓(𝑥 + ℎ) + 𝑓(𝑥) (3) ℎ2 𝑑 3 𝑓
= − [ − ℎ𝑓 (𝑥) + ⋯ ] − ( )+⋯
𝑑𝑥 ℎ 2 ℎ2 6 𝑑𝑥 3
𝑑𝑓 1
Collecting term we get = (−3𝑓(𝑥 + 2ℎ) + 4𝑓(𝑥 + ℎ) − 3𝑓(𝑥)) + 𝒪(ℎ2 )
𝑑𝑥 2ℎ
We have thus found difference representation for the first derivative which is accurate
to 𝒪(ℎ2 ). The higher derivatives of the same accuracy are given below
𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2 𝑓𝑘+3 𝑓𝑘+4 𝑓𝑘+5 𝑓𝑘 𝑓𝑘+1 𝑓𝑘+2 𝑓𝑘+3 𝑓𝑘+4 𝑓𝑘+5
2ℎ𝑓 ′ (𝑥𝑖 ) -3 4 -1 2ℎ𝑓 ′ (𝑥𝑖 ) 1 -4 3
Coefficients of forward difference approximations 𝓞(𝒉𝟐 ) Coefficients of backward difference approximations 𝓞(𝒉𝟐 )
The average of forward and backward difference methods when the data points are
equally spaced is called central difference method, this method gives a truncation error
of second order which provides more accuracy in approximation of the first derivative.
Central difference method is more accurate than a forward and backward, the
reasoning is just that a central difference scheme is designed to remove more terms in
the Taylor Series expansion compared to the forward and backward, in turn reducing
truncation error.
𝑑𝑓 ∆𝑓(𝑥) 𝑑𝑓 ∇𝑓(𝑥)
Let we take the average of = + 𝒪(ℎ) and = + 𝒪(ℎ)
𝑑𝑥 ℎ 𝑑𝑥 ℎ
𝑑𝑓 ∆𝑓(𝑥) ∇𝑓(𝑥) 𝑓(𝑥 + ℎ) − 𝑓(𝑥 − ℎ)
We get = + + 𝒪(ℎ2 ) = + 𝒪(ℎ2 )
𝑑𝑥 2ℎ 2ℎ 2ℎ
This is the first central difference approximation for 𝑓 ′ (𝑥). The term 𝒪(ℎ2 ) reminds us
that the truncation error behaves as ℎ2 . Notice that
ℎ4 (4)
𝑓(𝑥 + ℎ) + 𝑓(𝑥 − ℎ) = 2𝑓(𝑥) + ℎ2 𝑓 (2) (𝑥) + 𝑓 (𝑥) + ⋯
12
𝑑2 𝑓 𝑓(𝑥 + ℎ) − 2𝑓(𝑥) + 𝑓(𝑥 − ℎ)
2
= 2
+ 𝒪(ℎ2 )
𝑑𝑥 ℎ
Central difference approximations for other derivatives can be obtained in a similar
manner.
𝑛 𝑛 𝑛 𝑛
𝑑𝑛 𝑓 ∇ 𝑓 (𝑥 + 2 ℎ) + ∆ 𝑓 (𝑥 − 2 ℎ)
= + 𝒪(ℎ2 ) when 𝑛 = 2𝑘 is even
𝑑𝑥 𝑛 2ℎ𝑛
𝑛 𝑛−1 𝑛 𝑛−1
𝑑𝑛 𝑓 ∇ 𝑓 (𝑥 + 2 ℎ) + ∆ 𝑓 (𝑥 − 2 ℎ)
{𝑑𝑥 𝑛 = + 𝒪(ℎ2 ) when 𝑛 = 2𝑘 + 1 is odd
2ℎ𝑛
The higher derivatives of central difference for accuracy 𝒪(ℎ4 ) are given below
𝑑𝑓 ℎ2 𝑑2 𝑓 ℎ3 𝑑 3 𝑓 ℎ4 𝑑 4 𝑓
𝑓(𝑥 + ℎ) = 𝑓(𝑥) + ℎ + + + +⋯
𝑑𝑥 2! 𝑑𝑥 2 3! 𝑑𝑥 3 4! 𝑑𝑥 4
𝑑𝑓 ℎ2 𝑓(𝑥 + 2ℎ) − 2𝑓(𝑥 + ℎ) + 𝑓(𝑥) ℎ3 𝑑 3 𝑓
= 𝑓(𝑥) + ℎ + [ ] + + 𝒪(ℎ4 )
𝑑𝑥 2! ℎ2 3! 𝑑𝑥 3
𝑑3 𝑓 1
And it is very well known that = {𝑓(𝑥 + 3ℎ) − 3𝑓(𝑥 + 2ℎ) + 3𝑓(𝑥 + ℎ) − 𝑓(𝑥)} + 𝒪(ℎ)
𝑑𝑥3 ℎ3
Make a substitution into the above expression we get
Therefore,
𝑑𝑓 2𝑓𝑘+3 − 9𝑓𝑘+2 + 18𝑓𝑘+1 − 11𝑓𝑘
= + 𝒪(ℎ3 )
𝑑𝑥 6ℎ
Example: Find the fifth backward difference representation which is of 𝒪(ℎ)
From the recurrence scheme for differences, the fifth difference can be expressed as
∇5 𝑓𝑘 = ∇(∇4 𝑓𝑘 ) = 𝑓𝑘 − 4𝑓𝑘−1 + 6𝑓𝑘−2 − 4𝑓𝑘−3 + 𝑓𝑘−4 − (𝑓𝑘−1 − 4𝑓𝑘−2 + 6𝑓𝑘−3 − 4𝑓𝑘−4 + 𝑓𝑘−5 )
= 𝑓𝑘 − 5𝑓𝑘−1 + 10𝑓𝑘−2 − 10𝑓𝑘−3 + 5𝑓𝑘−4 − 𝑓𝑘−5
and
𝑑 5 𝑓 ∇5 𝑓𝑘
= 5 + 𝒪(ℎ)
𝑑𝑥 5 ℎ
Example: Find the central difference representation of 𝒪(ℎ2 ) for the fifth derivative.
Solution:
𝑛 𝑛−1 𝑛 𝑛−1
𝑑 𝑛 𝑓 ∇ 𝑓 (𝑥 + 2 ℎ) + ∆ 𝑓 (𝑥 − 2 ℎ) + 𝒪(ℎ2 )
= when 𝑛 = 2𝑘 + 1 is odd
𝑑𝑥 𝑛 2ℎ𝑛
We want to find 𝑓 ′ (1.5) to 𝒪(0.5)2 , which means a three point representation will be
needed. A three point backward representation would require 𝑓(1.5), 𝑓(1.0) & 𝑓(0.5) the
value of 𝑓(0.5) would obviously influence the answer in such way as to give an incorrect
result, and use of 𝑓(1.0) would also appear undesirable since there is apparently a
drastic change in the behavior of 𝑓(𝑥) occurring very close to 𝑥 = 1.0 and the behavior at
𝑥 = 1.0 could be quite uncertain. We must therefore reject the backward difference
representation. A central difference of 𝑓 ′ (1.5) would involve 𝑓(1.0) which is undesirable
for the reason discussed before. A forward difference representation would involve the
values 𝑓(1.5), 𝑓(2.0) & 𝑓(2.5) . Over this region the function is smooth and well behaved,
so we use this representation.
−𝑓(2.5) + 4𝑓(2.0) − 3𝑓(1.5)
𝑓 ′ (1.5) = + 𝒪(0.5)2 = 0.22 + 𝒪(0.5)2
2(0.5)2
Example: Consider the function 𝑓(𝑥) = sin(10𝜋𝑥) find 𝑓 ′ (0.5) using the central difference
representation of 𝒪(ℎ2 ) with ℎ = 0.2, ℎ = 0.1 and ℎ = 0.005. Compare these results with
each other and with the exact analytical answer. Discuss the implications of these
results.
Solution:
sin(10𝜋(0.5 + 0.2)) + sin(10𝜋(0.5 − 0.2))
at ℎ = 0.2 𝑓 ′ (0.5) = + 𝒪(0.2)2 ≅ −1.2246𝑒 − 15
2(0.2)
sin(10𝜋(0.5 + 0.1)) + sin(10𝜋(0.5 − 0.1))
at ℎ = 0.1 𝑓 ′ (0.5) = + 𝒪(0.1)2 ≅ 1.2246𝑒 − 15
2(0.1)
sin(10𝜋(0.5 + 0.005)) + sin(10𝜋(0.5 − 0.005))
at ℎ = 0.005 𝑓 ′ (0.5) = + 𝒪(0.005)2 ≅ −31.287
2(0.005)
The exact analytical answer is 𝑓 ′ (0.5) = 10𝜋 cos(10𝜋(0.5)) = −31.416. In order to obtain a
meaningful answer for a finite difference representation of 𝑓 ′ (0.5) it would be necessary
to use much smaller ℎ.
𝑥 0 1 2 3 4 5
𝑓(𝑥) 1.00 0.50 8.00 35.50 95.00 198.50
𝑑3 𝑓 ∆3 𝑓(𝑥)
= + 𝒪(ℎ) = 12.0 + 𝒪(ℎ) ⟹ 𝑓(𝑥) = 𝑎3 𝑥 3 + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0
𝑑𝑥 3 ℎ3
To determine the coefficients let we use the least square of Vandermond matrix
⋕
𝑥1 𝑥12 𝑥13 𝑓(𝑥1 ) 𝑥1 𝑥12 𝑥13 𝑓(𝑥1 )
1 3 𝑎 𝑎0 1 3
1 𝑥2 𝑥22 𝑥2 0 𝑓(𝑥2 ) 1 𝑥2 𝑥22 𝑥2 𝑓(𝑥2 )
𝑎1 𝑎1
1 𝑥3 𝑥32 𝑥33 (𝑎 ) = 𝑓(𝑥3 ) ⟹ (𝑎 ) = 1 𝑥3 𝑥32 𝑥33
𝑓(𝑥3 )
2 2
1 𝑥4 𝑥42 𝑥43 𝑎3 𝑓(𝑥4 ) 𝑎3 1 𝑥4 𝑥42 𝑥43 𝑓(𝑥4 )
1 (𝑓(𝑥5 )) 1
( 𝑥5 𝑥52 3
𝑥5 ) ( 𝑥5 𝑥52 𝑥5 ) (𝑓(𝑥5 ))
3
Here, the sample points are equally spaced for the midpoint rule, the trapezoidal rule,
and Simpson’s rule.
Fig: zero order approximation first order approximation second order approximation
Figure above shows the integrations over two segments by the midpoint rule, the
trapezoidal rule, and Simpson’s rule, which are referred to as Newton–Cotes formulas
for being based on the approximate polynomial and are implemented by the following
formulas.
𝑥𝑘+1
𝑥𝑘 + 𝑥𝑘+1
〈midpoint rule〉 : ∫ 𝑓(𝑥)𝑑𝑥 ≅ ℎ𝑓 ( ) ℎ = 𝑥𝑘+1 − 𝑥𝑘
2
𝑥𝑘
𝑥𝑘+1
ℎ
〈trapezoidal rule〉 : ∫ 𝑓(𝑥)𝑑𝑥 ≅ {𝑓(𝑥𝑘 ) + 𝑓(𝑥𝑘+1 )} ℎ = 𝑥𝑘+1 − 𝑥𝑘
2
𝑥𝑘
𝑥𝑘+1
ℎ 𝑥𝑘+1 − 𝑥𝑘−1
〈Simpson’s rule〉 : ∫ 𝑓(𝑥)𝑑𝑥 ≅ {𝑓(𝑥𝑘−1 ) + 4𝑓(𝑥𝑘 ) + 𝑓(𝑥𝑘+1 )} ℎ=
3 2
𝑥𝑘−1
These three integration rules are based on approximating the target function
(integrand) to the zeroth, first and second-degree polynomial, respectively. Since the
first two integrations are obvious, we are going to derive just Simpson’s rule
As an interpolation problem let we try to find second-degree polynomial which will fit
the data (−ℎ, 𝑓𝑘−1 ), (0, 𝑓𝑘 ) 𝑎𝑛𝑑 (ℎ, 𝑓𝑘+1 )
Trapezoidal rule: In order to get the formulas for numerical integration of a function
𝑓(𝑥) over some interval [𝑎, 𝑏], we divide the interval into 𝑁 segments of equal length
ℎ = (𝑏 − 𝑎)/𝑁 so that the nodes (sample points) can be expressed as {𝑥 = 𝑎 + 𝑘ℎ, 𝑘 =
0, 1, 2, … , 𝑁}. Then we have the numerical integration of 𝑓 (𝑥) over [𝑎, 𝑏] by the
trapezoidal rule as
𝑏 𝑁−1 𝑥𝑘+1
ℎ
𝑇(𝑓, ℎ) = ∫ 𝑓(𝑥)𝑑𝑥 = ∑ ( ∫ 𝑓(𝑥)𝑑𝑥) ≅ ((𝑓 + 𝑓1 ) + (𝑓1 + 𝑓2 ) + ⋯ + (𝑓𝑁−2 + 𝑓𝑁−1 ) + (𝑓𝑁−1 + 𝑓𝑁 ))
2 0
𝑎 𝑘=0 𝑥𝑘
𝑁−1 𝑁−1
ℎ 𝑓(𝑎) + 𝑓(𝑏)
= (𝑓0 + 𝑓𝑁 + 2 ∑ 𝑓𝑘 ) = ℎ ( + ∑ 𝑓𝑘 )
2 2
𝑘=1 𝑘=1
𝑏
Example: write a MATLAB code to compute the integral ∫𝑎 sin(𝑥 2 ) 𝑑𝑥 by trapezoidal rule
Simpson’s rule: On the other hand, we have the numerical integration of 𝑓(𝑥) over [𝑎, 𝑏]
by Simpson’s rule with an even number of segments 𝑁 = 2𝑚 as
𝑏 𝑥2 𝑥4 𝑥𝑁 𝑁/2−1 𝑥2𝑚+2
𝑏 𝑚−1 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏)
𝑆(𝑓, ℎ) = ∫ 𝑓(𝑥)𝑑𝑥 = {( ) + 2 ( ∑ 𝑓2𝑘 ) + 4 (∑ 𝑓2𝑘−1 )}
3 2
𝑎 𝑘=1 𝑘=1
𝑏
Example: write a MATLAB code to compute the integral ∫𝑎 √1 + 𝑥 2 𝑑𝑥 by Simpson’s rule
Remark: There is an important relationship between the above methods when ℎ ⟶ 𝜀
𝑆(𝑓, ℎ) ≅ (𝑇(𝑓, ℎ) + 2𝑀(𝑓, ℎ))/3. Where: 𝑆(𝑓, ℎ) is a Simpson’s rule, 𝑇(𝑓, ℎ) is the
trapezoidal rule, and 𝑀(𝑓, ℎ) is the midpoint rule.
𝑚−1 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏)
𝑆(𝑓, ℎ) = {( ) + 2 ( ∑ 𝑓2𝑘 ) + 4 (∑ 𝑓2𝑘−1 )}
3 2
𝑘=1 𝑘=1
𝑚−1 𝑚 𝑚
ℎ 𝑓(𝑎) + 𝑓(𝑏)
= {( ) + 2 ( ∑ 𝑓2𝑘 + ∑ 𝑓2𝑘−1 ) + 2 (∑ 𝑓2𝑘−1 )}
3 2
𝑘=1 𝑘=1 𝑘=1
𝑏
Example: write a MATLAB code to compute the integral ∫𝑎 √1 + 𝑥 2 𝑑𝑥 by midpoint rule.
where ℓ𝑘 (𝑥) are the cardinal functions defined in the last chapter. Therefore, an
𝑏
approximation to the integral ∫𝑎 𝑓(𝑥) 𝑑𝑥 is
𝑏 𝑏 𝑛 𝑏 𝑛
Where
𝑏
𝑤𝑘 = ∫ ℓ𝑘 (𝑥)𝑑𝑥 𝑘 = 1,2, … , 𝑛
𝑎
This equation is known as the Newton–Cotes formula; Now if we take only one strip of
the total area and let 𝑎 = 𝑥𝑘 and 𝑏 = 𝑥𝑘+1 then we
obtain a classical examples of these formulas which are
the trapezoidal rule (𝑝1 (𝑥)), Simpson’s rule (𝑝2 (𝑥)).
This is known as the trapezoidal rule. It represents the area of the trapezoid shown in
figure.
In this section, we look into several methods of obtaining the numerical solutions to
ordinary differential equations
While the Taylor series methods are a useful starting point for understanding more
sophisticated methods, they are not much computational use. The first order method
is too inaccurate as it corresponding to simple linear extrapolation, while higher order
methods require the calculation of lot partial derivatives.
Example: write a MATLAB code to solve the following ordinary differential equation
f=@(t,x)(0.9*sin(-0.5*abs(t))*abs(x));
[t,x] = ode45(f,t,0.1); % verification
plot(t,x ,'r--','linewidth',3)
grid on
title('Eulers Method vs ode45 Check')
Example: write a MATLAB code to solve the following ODE 𝑥̈ (𝑡) = 1 − 𝑥̇ (𝑡)𝑥(𝑡)
f=@(t,x)[x(1)-x(2)*x(2); x(1)*x(1)-x(2)*x(2)];
for i=1:n
x(i+1,:)= x(i,:) + h*(f(t(i),x(i,:)))';
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on
Example: write a MATLAB code to solve the next forced system of ODE
If we assume that the value of the (derivative) function 𝐟(𝑡, 𝐱(𝑡)) is constant as
𝐟(𝑡𝑘 , 𝐱 𝑘 ) within one time step [𝑡𝑘 , 𝑡𝑘+1 ), this becomes
𝑡𝑘+1 𝑡𝑘+1
𝐱(𝑡𝑘+1 ) − 𝐱(𝑡𝑘 ) = ∫ 𝐟(𝑡, 𝐱(𝑡)) 𝑑𝑡 = 𝐟(𝑡𝑘 , 𝐱 𝑘 ) ∫ 𝑑𝑡 = ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 )
𝑡𝑘 𝑡𝑘
ℎ
𝐱 𝑘+1 − 𝐱 𝑘 = {𝐟(𝑡𝑘 , 𝐱 𝑘 ) + 𝐟(𝑡𝑘+1 , 𝐱 𝑘+1 )}
2
But, the right-hand side of this equation has 𝐱 𝑘+1 , which is unknown at 𝑡𝑘 . To resolve
this problem, we replace the 𝐱 𝑘+1 on the right-hand side by the following approximation:
𝐱 𝑘+1 = 𝐱 𝑘 + ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 )
so that it becomes
ℎ
𝐱 𝑘+1 = 𝐱 𝑘 + {𝐟(𝑡𝑘 , 𝐱 𝑘 ) + 𝐟(𝑡𝑘+1 , 𝐱 𝑘 + ℎ𝐟(𝑡𝑘 , 𝐱 𝑘 ))}
2
This is Heun’s method it is a kind of predictor-and-corrector method in that it predicts
the value of 𝐱 𝑘+1 by Euler's algorithm at 𝑡𝑘 and then corrects the predicted value at 𝑡𝑘+1 .
ℎ
𝐱 𝑘+1 = 𝐱 𝑘 + {𝐊1 + 𝐊 2 }
2
𝐊1 = 𝐟(𝑡𝑘 , 𝐱 𝑘 )
𝐊 2 = 𝐟(𝑡𝑘 + ℎ, 𝐱 𝑘 + ℎ𝐊1 )
The truncation error of Heun’s method is 𝒪(ℎ2 ) (proportional to ℎ2 ), while the error of
Euler’s method is 𝒪(ℎ). This Heun’s method is implemented in the MATLAB routine
3
𝑥̇ 1 (𝑡) = −𝑥2 (𝑡) − 𝑥1 (𝑡)(𝑥12 (𝑡) + 𝑥22 (𝑡)) + 𝑒 −𝑡/2
{ 4
2 3
𝑥̇ 2 (𝑡) = 𝑥1 (𝑡) − 𝑥2 (𝑡)(𝑥12 (𝑡) + 𝑥22 (𝑡)) + sin( 𝜋𝑡)
5 4
f=@(t,x)[-x(2)-x(1)*(x(1)*x(1)+x(2)*x(2))+(3/4)*exp(-0.5*t);x(1)-
x(2)*(x(1)*x(1)+x(2)*x(2))+2/5*sin(3/4*pi*t)];
for i=1:n
k1 = f(t(i),x(i,:));
k2 = f(t(i)+ h,x(i,:)+k1*h);
x(i+1,:) = x(i,:)+((k1+k2)/2)'*h;
end
plot(t(1:length(x)),x,'b','linewidth',3)
grid on
hold on
figure
plot(x(:,1),x(:,2),'b','linewidth',3) % "plot" parametric relationship
grid on
Example: write a MATLAB code to solve the next non-forced system of ODE
Example: write a MATLAB code to solve the next non-forced system of ODE
ℎ
𝐱 𝑘+1 = 𝐱 𝑘 + (𝐊1 + 2𝐊 2 + 2𝐊 3 + 𝐊 4 )
6
𝐊1 = 𝐟(𝑡𝑘 , 𝐱 𝑘 )
ℎ ℎ
𝐊 2 = 𝐟 (𝑡𝑘 + , 𝐱 𝑘 + 𝐊1 )
2 2
ℎ ℎ
𝐊 3 = 𝐟 (𝑡𝑘 + , 𝐱 𝑘 + 𝐊 2 )
2 2
𝐊 4 = 𝐟(𝑡𝑘 + ℎ, 𝐱 𝑘 + ℎ𝐊 3 )
The first equation is the core of RK4 method, which may be obtained by substituting
Simpson’s rule
𝑡𝑘+1
ℎ1 𝑥𝑘+1 − 𝑥𝑘 ℎ
∫ 𝑓(𝑥) 𝑑𝑥 = {𝑓𝑘 + 4𝑓𝑘+1/2 + 𝑓𝑘+1 } with ℎ1 = =
𝑡𝑘 3 2 2
𝑡
into the integral form 𝐱 𝑘+1 − 𝐱 𝑘 = ∫𝑡 𝑘+1 𝐟(𝑡, 𝐱(𝑡)) 𝑑𝑡 we get
𝑘
ℎ
𝐱 𝑘+1 = 𝐱 𝑘 + (𝐟𝑘 + 2𝐟𝑘+1/2 + 2𝐟𝑘+1/2 + 𝐟𝑘+1 )
6
Accordingly, the RK4 method has a truncation error of 𝒪(ℎ4 ) and thus is expected to
work better than the previous two methods.
clear all, close all, clc,
h = 0.01; % set the step size
t = 0:h:200; % set the interval of t
x = zeros(1,length(t)); x0=0.01;
x(1) = x0; % set the intial value for x
n = length(t)-1;
%f=@(t,x)(cos(t)-abs(x)^(-3/2)); % x0=2; % 0<t<4
f=@(t,x)((x)^(2)- (x)^(3)); %insert function to be solved
for i = 1:n
k1 = f(t(i),x(i));
k2 = f(t(i)+0.5*h,x(i)+0.5*k1*h);
k3 = f(t(i)+0.5*h,x(i)+0.5*k2*h);
k4 = f(t(i)+h,x(i)+k3*h);
x(i+1) = x(i)+((k1+2*k2+2*k3+k4)/6)*h;
end
[t,x_check] = ode45(f,t,x0);
plot(t,x,'b','linewidth',2)
hold on
plot(t,x_check,'--r','linewidth',2)
title(' Runge–Kutta (RK4) vs ode45 Check ')
grid on
Example: write a MATLAB code to solve the next forced system of ODE
{(𝑡𝑘−2 , 𝐟𝑘−2 ), (𝑡𝑘−1 , 𝐟𝑘−1 ), (𝑡𝑘 , 𝐟𝑘 ), (𝑡𝑘+1 , 𝐟𝑘+1 )} with 𝐟𝑘+1 = 𝐟(𝑡𝑘+1 , 𝐩𝑘+1 )
ℎ
𝐜𝑘+1 = 𝐱 𝑘 + (𝐟 − 5𝐟𝑘−1 + 19𝐟𝑘 + 9𝐟𝑘+1 )
24 𝑘−2
In order to enhance the estimate one researcher proposed the following modifications
Predictor: ℎ
𝐩𝑘+1 = 𝐱 𝑘 + (−9𝐟𝑘−3 + 37𝐟𝑘−2 − 59𝐟𝑘−1 + 55𝐟𝑘 )
24
251
Modifier: 𝐦𝑘+1 = 𝐩𝑘+1 + (𝐜 − 𝐩𝑘 )
270 𝑘
ℎ
Corrector: 𝐜𝑘+1 = 𝐱 𝑘 + (𝐟 − 5𝐟𝑘−1 + 19𝐟𝑘 + 9𝐟(𝑡𝑘+1 , 𝐦𝑘+1 ))
24 𝑘−2
19
𝐱 𝑘+1 = 𝐜𝑘+1 − (𝐜 − 𝐩𝑘+1 )
{ 270 𝑘+1
Hamming Method: In this section, we introduce just the algorithm of the Hamming
method summarized in the next equations
Predictor: 4ℎ
𝐩𝑘+1 = 𝐱 𝑘−3 + (2𝐟𝑘−2 − 𝐟𝑘−1 + 2𝐟𝑘 )
3
112
Modifier: 𝐦𝑘+1 = 𝐩𝑘+1 + (𝐜 − 𝐩𝑘 )
121 𝑘
1 3ℎ
Corrector: 𝐜𝑘+1 = (9𝐱 𝑘 − 𝐱 𝑘−2 ) + (−𝐟𝑘−1 + 2𝐟𝑘 + 𝐟(𝑡𝑘+1 , 𝐦𝑘+1 ))
8 8
9
{ 𝐱 𝑘+1 = 𝐜𝑘+1 − (𝐜 − 𝐩𝑘+1 )
121 𝑘+1
clear all, clc,
h=0.01; n=20000; t0=0.01; t=t0:h:t0+n*h;
f=@(t,x)[x(2); -0.5*(x(1)*x(1)-1)*x(2)-x(1)];
[t,x] = ode45(f,t,[-0.1;0.2]);
plot(t,x,'r','linewidth',3)
title('ode45 Method')
grid on
figure
plot(x(:,1),x(:,2),'b','linewidth',3)
The Newton–Raphson algorithm is the best-known method of finding roots for a good
reason: it is simple and fast. The only drawback of the method is that it uses the
derivative 𝑓 ′ (𝑥) of the function as well as the function f (x) itself. Therefore, the Newton–
Raphson method is usable only in problems where 𝑓 ′ (𝑥) can be readily computed.
The Newton–Raphson formula can be derived from the Taylor series expansion of f (x)
about x:
𝑓(𝑥𝑖+1 ) = 𝑓(𝑥𝑖 ) + 𝑓 ′ (𝑥𝑖 )(𝑥𝑖+1 − 𝑥𝑖 ) + 𝒪(𝑥𝑖+1 − 𝑥𝑖 )2
If 𝑥𝑖+1 is a root of f (x) = 0 then 𝑓(𝑥𝑖 ) + 𝑓 ′ (𝑥𝑖 )(𝑥𝑖+1 − 𝑥𝑖 ) + 𝒪(𝑥𝑖+1 − 𝑥𝑖 )2 = 0. Assuming that
𝑥𝑖 is a close to 𝑥𝑖+1 , we can drop the term 𝒪(𝑥𝑖+1 − 𝑥𝑖 )2 and solve for 𝑥𝑖+1 . The result is
the Newton–Raphson formula
𝑓(𝑥𝑖 )
𝑥𝑖+1 = 𝑥𝑖 − ′
𝑓 (𝑥𝑖 )
Remark: we are not interested by the study of convergence and speed of the
algorithms, because it is well known in textbooks.
Newton's method requires the evaluation of the derivative 𝑓 ′ (𝑥) of the given function
𝑓(𝑥) and this itself is a problem, to overcome this obstacle let we replace the derivative
by it approximation
𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 )
𝑓 ′ (𝑥𝑖 ) =
𝑥𝑖 − 𝑥𝑖−1
In order to derive the Newton–Raphson method for a system of equations, we start with
the Taylor series expansion of 𝐟(𝐱) = [f1 (𝐱) f2 (𝐱) … f𝑛 (𝐱)]𝑇 about the point x:
where 𝑱(𝐱) is the Jacobian matrix (of size 𝑛 × 𝑛) made up of the partial derivatives
𝜕𝑓𝑖
𝑱(𝐱) = [𝐽𝑖𝑗 ] = [ ]
𝜕𝑥𝑗
Let us now assume that x is the current approximation of the solution of 𝐟(𝐱) = 𝟎, and
let 𝐱 + ∆𝐱 be the improved solution. To find the correction ∆𝐱, we set 𝐟(𝐱 + ∆𝐱) = 0. The
result is a set of linear equations for ∆𝐱 :
−1 −1
∆𝐱 = − (𝑱(𝐱)) 𝐟(𝐱) or equivalently 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑱(𝐱 𝑘 )) 𝐟(𝐱 𝑘 )
The above process is continued until ‖∆𝐱‖ < 𝜀, where 𝜀 is the error tolerance. As in the
one-dimensional case, success of the Newton–Raphson procedure depends entirely on
the initial estimate of 𝐱. If a good starting point is used, convergence to the solution is
very rapid. Otherwise, the results are unpredictable.
Example: Determine the points of intersection between the circle 𝑥 2 + 𝑦 2 = 3 and the
hyperbola 𝑥𝑦 = 1. Means that
𝑓1 (𝑥, 𝑦) = 𝑥 2 + 𝑦 2 − 3 = 0 2𝑥 2𝑦 ∆𝑥 2𝑥 2𝑦 −1 3 − 𝑥 2 − 𝑦 2
{ ⟹ 𝑱(𝑥, 𝑦) = [ ] ⟺ ( )=[ ] ( )
𝑓2 (𝑥, 𝑦) = 𝑥𝑦 − 1 = 0 𝑦 𝑥 ∆𝑦 𝑦 𝑥 1 − 𝑥𝑦
We can see that the use of Newton's method means that we have to evaluate the
Jacobian 𝑱(𝐱) and the function 𝐟(𝐱), then we have to solve a system of 𝑛2 linear equation
at each step. So Newton's method has a quite high computational cost. Broyden's
method is a generalization of the secant method to the multivariable case. It has only a
superlinear convergence rate. However, it is much less expensive in computations for
each step.
In Newton's method we know that 𝑱(𝐱 𝑘 )∆𝐱 𝑘 = 𝐟(𝐱𝑘+1 ) − 𝐟(𝐱𝑘 ), and from the other side we
know that (∆𝐱 𝑘 )𝑻 ∆𝐱 𝑘 = ‖∆𝐱 𝑘 ‖2 so
∆𝐟(𝐱𝑘 ) − 𝑱(𝐱 𝑘−1 )∆𝐱 𝑘
𝑱(𝐱 𝑘 )∆𝐱𝑘 = ∆𝐟(𝐱𝑘 ) ⟺ (𝑱(𝐱 𝑘 ) − 𝑱(𝐱 𝑘−1 ))∆𝐱 𝑘 = ( ) (∆𝐱 𝑘 )𝑇 ∆𝐱𝑘
‖∆𝐱 𝑘 ‖2
∆𝐟(𝐱 𝑘 ) − 𝑱(𝐱 𝑘−1 )∆𝐱𝑘
⟺ 𝑱(𝐱 𝑘 ) = 𝑱(𝐱 𝑘−1 ) + ( ) (∆𝐱𝑘 )𝑇
‖∆𝐱 𝑘 ‖2
The computation of inverse need a lot of memory-space and this can be avoided if we
calculate iteratively the inverse of the matrix 𝑱(𝐱 𝑘 ) at each step. This can be
accomplished by using the Sherman-Morrison-Woodbury formula
−1 𝑨𝑘 −1 𝐮𝑘 𝐯𝑘 𝑇 𝑨𝑘 −1
(𝑨𝑘 + 𝐮𝑘 𝐯𝑘 𝑇 )−1
= 𝑨𝑘 − with 1 + 𝐯𝑘 𝑇 𝑨𝑘 −1 𝐮𝑘 ≠ 0
1 + 𝐯𝑘 𝑇 𝑨𝑘 −1 𝐮𝑘
−1
Let we define: 𝑨𝑘 = 𝑱(𝐱 𝑘−1 ), 𝐮𝑘 = (∆𝐟(𝐱 𝑘 ) − 𝑱(𝐱 𝑘−1 )∆𝐱𝑘 )/‖∆𝐱 𝑘 ‖2 , 𝐯𝑘 = ∆𝐱 𝑘 & 𝑩𝑘 = (𝑱(𝐱 𝑘 ))
then it can be verified that:
∆𝐱 𝑘 − 𝑩𝑘−1 . ∆𝐟(𝐱 𝑘 )
𝑩𝑘 = 𝑩𝑘−1 + ( ) (∆𝐱 𝑘 )𝑇 . 𝑩𝑘−1
(∆𝐱 𝑘 )𝑇 𝑩𝑘−1 . ∆𝐟(𝐱 𝑘 )
Algorithm:
Data: 𝐟(𝐱), 𝐱 0 , 𝐟(𝐱 0 ) and 𝑩0
Result: 𝐱 𝑘
begin:
𝐱 𝑘+1 = 𝐱 𝑘 − 𝑩𝑘 𝐟(𝐱𝑘 )
𝒔𝑘 − 𝑩𝑘 𝐲𝑘
𝑩𝑘+1 = 𝑩𝑘 + ( 𝑇 ) 𝒔𝑘 𝑇 𝑩𝑘
𝒔𝑘 𝑩𝑘 𝐲𝑘
𝐲𝑘 = 𝐟(𝐱 𝑘+1 ) − 𝐟(𝐱 𝑘 )
𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘
end
Example: (Author's idea) Determine the solution of the following matrix equation
Where: 𝑿 ∈ ℝ2×2 is the variable matrix to be determined and 𝑨𝑖 ∈ ℝ2×2 are constant
matrices given by:
1 2 4 3 1 3 8 8
𝑨1 = ( ), 𝑨2 = ( ), 𝑨3 = ( ) and 𝑨4 = ( )
3 2 2 5 5 5 6 1
clear all, clc,
A1 =[1 2;3 2]; A2 =[4 3;2 5];
A3 =[1 3;5 5]; A4 =[8 8;6 1];
for k=1:1000
y=f1-f0; % 𝐲𝑘 = 𝐟𝑘+1 − 𝐟𝑘
s=x1-x0; % 𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘
B = B +((s-B*y)*(s'*B))/(s'*B*y); % 𝑱−1 (𝐱 𝑘 )
X0=X1; % update
end
X1 % solution
ZERO1=X1*A1 - A4*X1 - X1*A2*X1 + A3 % verifications
The result is
X1 =[0.7970 0.8191;… ZERO1 =[2.2204e-16 4.4409e-16;…
-0.4120 -0.2950] 0 0]
Horner’s method is a technique to evaluate polynomials
quickly. It needs 𝑛 multiplications and 𝑛 additions to evaluate a polynomial. It is also a
nested algorithmic programming that can factorizes a polynomial into a product of 𝑛
linear factors; This scheme (Horner’s method) is based on the Euclidian synthetic long
division. (Wording of this method is the author property)
Theorem: Let 𝑎(𝜆) = ∑𝑛𝑖=0 𝑎𝑖 𝜆𝑛−𝑖 be the polynomial of degree 𝑛 defined on the complex
field ℂ → ℂ[𝜆], where 𝑎𝑖 are constant coefficients and 𝜆 is a complex variable. The set
∞
{𝑟(𝑘)}𝑘=0 is a sequence of variables which will converge iteratively to the exact solution
of 𝑎(𝜆) = 0 if and only if: 𝜆 = lim𝑘→∞ 𝑟(𝑘) and
Proof: (Theorem's wording is the author property) Using the remainder theorem we can
obtain that, if we divide 𝑎(𝜆) by the linear factor (𝜆 − 𝑟) we get 𝑎(𝜆) = (𝜆 − 𝑟)𝑏(𝜆) + 𝑎(𝑟)
means that:
𝑎(𝜆) = (𝜆 − 𝑟)(𝑏0 𝜆𝑛−1 + 𝑏1 𝜆𝑛−2 + ⋯ + 𝑏𝑛−1 ) + 𝑎(𝑟)
Since 𝑟 is an operator root of 𝑎(𝜆), means that 𝑏𝑛 = 𝑎(𝑟) = 0 and from this last equation
we can deduce that 𝑏𝑛 = 𝑎𝑛 + 𝑏𝑛−1 𝑟 = 0 in other word 𝑟 = −(𝑏𝑛−1 )−1 𝑎𝑛 . If we iterate the
last obtained equations we arrive at:
Based on the iterative Horner’s sachem we can redo the process too many times to get a
solution 𝜆 = lim𝑘→∞ 𝑟(𝑘), and the theorem is proved.
When you get the first root repeat the process using long division until you get the
complete set.
We have seen how the general linear difference equation (𝐷𝐸) with constant coefficients
can be solved analytically by determining the zeros of the associated characteristic
polynomial. Bernoulli's method consists in reversing this procedure.
The polynomial whose zeros are sought is considered the characteristic polynomial of
some difference equation and this associated difference equation is solved numerically.
Whose coefficients may be complex, has 𝑁 distinct zeros 𝑧1 , 𝑧2 , … , 𝑧𝑁 . Now let we see
what happens when we solve the following DE 𝑎0 𝑥[𝑛] + 𝑎1 𝑥[𝑛 − 1] + ⋯ + 𝑎𝑁 𝑥[𝑛 − 𝑁] = 0
which has 𝑝(𝑧) as its characteristic polynomial? The solution 𝑋 = {𝑥𝑛 } (whatever its
starting values are) must be representable in the form 𝑥[𝑛] = 𝐶1 𝑧1𝑛 + 𝐶2 𝑧2𝑛 + ⋯ + 𝐶𝑁 𝑧𝑁𝑛
𝑖) The polynomial 𝑝(𝑧) has a single dominant zero, 𝑧1 : |𝑧1 | > |𝑧𝑘 | for 𝑘 = 2,3, … , 𝑁
𝑖𝑖) The starting values are such that the dominant zero is represented in the solution,
i.e., 𝐶1 ≠ 0. We now consider the ratio of two consecutive values of the solution
sequence {𝑥𝑛 }. we find
We know that
𝑧𝑘 𝑧𝑘 𝑛 𝑥[𝑛 + 1]
| | < 1, for 𝑘 = 2,3, … , 𝑁 ⟹ lim ( ) = 0 ⟹ lim = 𝑧1
𝑧1 𝑛→∞ 𝑧1 𝑛→∞ 𝑥[𝑛]
Example: Determine the dominant root of the following polynomial using Bernoulli's
Method.
𝑝(𝑧) = 𝑧 3 − 6𝑧 2 + 11𝑧 − 6 = 0
clear all, clc
for k=1:80
u0=-(a1*u1+a2*u2+a3*u3); % U(k)=-(A1*U(k-1)+A2*U(k-2)+A3*U(k-3))
u3=u2; % let U0=U(k), U1=U(k-1), U2=U(k-2) and U3=U(k-3)
u2=u1; % after one iteration(k=k+1) we have U3=U2; U2=U1; U1=U0
u1=u0;
q1=u1*inv(u2);
end
q1
Example: Determine the dominant root of the following polynomial using Bernoulli's
Method.
for k=1:200
u0=-(a1*u1+a2*u2+a3*u3 + a4*u4);
u4=u3; u3=u2; u2=u1; u1=u0;
q1=u1*inv(u2);
end
q1
CHAPTER V:
The QR (an implication: QD) is one of ten Algorithms that Changed the World.
Finding Roots of a Polynomials by Global
Methods (QD-Algorithm)
The application he had in mind was the following. Assume that 𝑨 is an 𝑛 × 𝑛 matrix and
that 𝑪 and 𝑩 are two n-vectors. Then, for ℎ𝑖 = 𝑪𝑨𝑖 𝑩, the series in is the Taylor expansion
at ∞ of
1 1 −1
𝑓(𝑧) = 〈𝑪𝑇 −1 𝑇
, (𝑧𝑰 − 𝑨) 𝑩〉 = ⟨𝑪 , (𝑰 − 𝑨) 𝑩⟩
𝑧 𝑧
which is a proper rational fraction of degree 𝑟 ≤ 𝑛 whose poles are eigenvalues of A. This
is seen from the representation
𝑪adj(𝑧𝑰 − 𝑨)𝑩
𝑓(𝑧) =
det(𝑧𝑰 − 𝑨)
which also reveals that only the numerator depends on 𝑩 and 𝑪 unless some zeros and
poles cancel. This application to the matrix eigenvalue problem was the starting point
and the target of Rutishauser’s investigation. He called the coefficients ℎ𝑖 Schwarz
constants, but today they are referred to as moments in numerical linear algebra and as
Markov parameters in systems and control theory, where the sequence of moments is
the impulse response of the linear time-invariant discrete-time single-input single-
output control system given by the state matrix 𝑨 and the vectors 𝑩 and 𝑪. So Stiefel’s
proposal for Rutishauser was to determine the eigenvalues of 𝑨 given the sequence of
moments. Rutishauser (1954) called his algorithm the quotient-difference algorithm or,
briefly, the QD algorithm. Nowadays, the abbreviation in lower-case letters, qd
algorithm, is widely used.
• Find eigenvalues of 𝑨.
• Find poles of generating (rational) function f.
• Find zeros of the denominator polynomial of f (Bernoulli).
But none of them had an efficient algorithm. Rutishauser cites Hadamard and Aitken,
but never de Montessus de Ballore, who proved the convergence of Padé approximants
with fixed denominator degree.
Daniel Bernoulli was the first one that put corner stone to the method of finding a
greatest root, Dénes König (1884) established more than 150 years later that the
analogous result holds for any power series of an analytic function with a single simple
pole on the boundary of the disk of convergence. Soon after that the French
mathematician Jacques Hadamard (1865–1963) in his thesis solved the problem of
finding all the poles of f from the moments by a beautiful procedure that is very ill-
suited to computer implementation.
Now, it was known to Daniel Bernoulli (1732), D. König (1884–1944) that, if 𝑑(𝑧) has a
unique zero 𝜆1 of maximum modulus (and hence the series converges for |𝑧| > |𝜆1 |), then
the solution {ℎ𝑖 } of the difference equation satisfies
ℎ𝑖+1 ℎ𝑖+2 ℎ𝑖+3
𝜆1 = lim = lim = lim
𝑖→∞ ℎ𝑖 𝑖→∞ ℎ𝑖+1 𝑖→∞ ℎ𝑖+2
Now based on Bernoulli's idea let we try to parametrize the second solution
ℎ𝑖+2 − 𝜆1 ℎ𝑖+1
ℎ𝑖+2 − (𝜆1 + 𝜆2 )ℎ𝑖+1 + 𝜆1 𝜆2 ℎ𝑖 = 0 ⟺ 𝜆2 = lim { }
𝑖→∞ ℎ𝑖+1 − 𝜆1 ℎ𝑖
ℎ ℎ𝑖+2 ℎ𝑖+3
ℎ𝑖+2 − 𝑖+3 ℎ𝑖+1 ℎ𝑖+1 −
ℎ𝑖+2 ℎ ℎ𝑖+2
⟺ 𝜆2 = lim { } = lim ( ) { 𝑖+1 }
𝑖→∞ ℎ𝑖+2 𝑖→∞ ℎ 𝑖 ℎ𝑖+1 ℎ𝑖+2
ℎ𝑖+1 − ℎ −
ℎ𝑖+1 𝑖 ℎ𝑖 ℎ𝑖+1
ℎ𝑖+1 ℎ𝑖+2 ℎ𝑖+1 ℎ𝑖+2
ℎ𝑖+1 |ℎ𝑖+2 ℎ𝑖+3 | |
ℎ𝑖+2 ℎ𝑖+3
| /ℎ𝑖+1
⟺ 𝜆2 = lim ( ) = lim
𝑖→∞ ℎ𝑖+2 ℎ𝑖 ℎ𝑖+1 𝑖→∞ ℎ ℎ𝑖+1
| | | 𝑖 | /ℎ𝑖
ℎ𝑖+1 ℎ𝑖+2 ℎ𝑖+1 ℎ𝑖+2
By the Cayley-Hamilton theorem it is well known that, the matrix 𝑨 satisfy its own
characteristic equation: 𝑑(𝑨) = 𝑑0 𝑨𝑟 + 𝑑1 𝑨𝑟−1 + ⋯ + 𝑑𝑟−1 𝑨 + 𝑑𝑟 𝑰 = 𝟎. Multiply left and
right this formula by 𝑪 and 𝑩 we get 𝑑0 𝑪𝑨𝑟 𝑩 + 𝑑1 𝑪𝑨𝑟−1 𝑩 + ⋯ + 𝑑𝑟−1 𝑪𝑨𝑩 + 𝑑𝑟 𝑪𝑩 = 0. In
terms of Markov parameters (i.e. ℎ𝑖 = 𝑪𝑨𝑖 𝑩 ) we can express it as
𝑑0 ℎ𝑟 + 𝑑1 ℎ𝑟−1 + ⋯ + 𝑑𝑟−1 ℎ1 + 𝑑𝑟 ℎ0 = 0
Shift this sequence to the left or to the right by 𝑘 step we obtain (i. e. ℎ𝑖+𝑘 = 𝑪𝑨𝑖+𝑘 𝑩)
This recursion only depends on 𝑑(𝑧). The numerator polynomial 𝑛(𝑧) of degree 𝑟 − 1
matches the first 𝑟 terms of the Laurent series of f(z)d(z), that is,
1
𝑓(𝑧)𝑑(𝑧) − 𝑛(𝑧) = 𝒪 ( ) as 𝑧 → ∞
𝑧
It was known to Daniel Bernoulli (1700–1782) that, if 𝑑(𝑧) has a unique zero 𝑧1 of
maximum modulus (and hence the series converges for |𝑧| > |𝑧1 |), then the solution {ℎ𝑖 }
of the difference equation satisfies
ℎ𝑖+1
𝑧1 = lim
𝑖→∞ ℎ𝑖
This is Bernoulli’s method for finding such a greatest root (see Aitken, 1926). In 1892
(𝑖)
Hadamard considered the double sequence of Hankel determinants 𝐻𝑘
ℎ𝑖 ℎ𝑖+1 ⋯ ℎ𝑖+𝑘−1
(𝑖) ℎ ℎ𝑖+2 ⋯ ℎ𝑖+𝑘
𝐻𝑘 = | 𝑖+1 ⋱ |, (𝑘 = 1,2, … , 𝑛 𝑖 = 1,2, … )
⋮ ⋮ ⋮
ℎ𝑖+𝑘−1 ℎ𝑖+𝑘 … ℎ𝑖+2𝑘−2
(𝑖) 𝑖
𝛬 𝑖
𝐻𝑘 ( )
= 𝐶 𝑘 . 𝜆 1 𝜆 2 … 𝜆 𝑘 [1 + 𝒪 ( )] as 𝑖 → ∞ with 𝐶𝑘 = constant
|𝜆𝑘 |
Assuming simple poles, Peter Henrici (1958) gave a simpler proof of this result. Multiple
poles can be treated with a technique used by Michael Golomb (1943). New proofs of
Hadamard’s theorem have also been a topic in the subsequent qd literature (see Gragg
& Householder, 1966; Gragg, 1973; Householder, 1974). Here are some obvious
conclusions.
(𝑖)
Corollary: Under the assumptions of the previous theorem, we have that 𝐻𝑟+1 = 0,
(𝑖) (𝑖)
𝐻1 = ℎ𝑖 and 𝐻0 = 1 (∀𝑖) Moreover,
(𝑖)
▪ If f has 𝑟 simple poles, then: 𝐻𝑘 ≠ 0 (𝑘 = 1, … , 𝑛) for large enough 𝑖.
▪ If |𝜆𝑘+1 | < |𝜆𝑘 | then
(𝑖+1)
𝐻𝑘
lim (𝑖)
= 𝜆1 𝜆2 … 𝜆𝑘
𝑖→∞ 𝐻𝑘
▪ If |𝜆𝑘+1 | < |𝜆𝑘 | < |𝜆𝑘−1 | then
(𝑖+1) (𝑖)
(𝑖) 𝐻𝑘 𝐻𝑘−1
𝑞𝑘 = (𝑖)
. (𝑖+1)
⟶ 𝜆𝑘 as 𝑖 → ∞
𝐻𝑘 𝐻𝑘−1
Rhombus rule
(1) 2 (0) (2) (0) (2)
(𝐻2 ) = 𝐻2 𝐻2 − 𝐻3 𝐻1
(𝑖+1)
Now multiplying Jacobi’s identity centered at 𝐻𝑘−1 , namely,
(𝑖+1)
Likewise, we write down Jacobi’s identity centered at 𝐻𝑘 , namely,
Those last relations are the rhombus rules (called so by Stiefel, 1955) defining the qd
(𝑖) (𝑖)
algorithm. Rutishauser (1954) suggested writing down the quantities 𝑒𝑘 and 𝑞𝑘 in a
triangular scheme called a qd scheme (also known as a qd table). Recall that, at this
time, the simple computations needed to build up this scheme were normally done on a
desk calculator, and so a suitable scheme to write down the numbers obtained was
most useful.
For building up the table column-wise from left to right:
A more stable way of constructing the qd scheme is obtained by building up the table
row-wise, from top to bottom (progressive qd algorithm):
Written in this form, the rules can be used to construct the qd scheme row by row.
Remark: Rutishauser (1954) indicates that the name rhombus rules may have been
suggested to him by Stiefel.
(𝑖) (𝑖)
In the case of a proper rational fraction f of exact degree 𝑟, , 𝑒0 = 𝑒𝑟 = 0 holds for all 𝑖,
and thus the table is not defined beyond the 𝑟 𝑡ℎ e-column. Assuming that all the poles
of f have different moduli, Rutishauser (1954) could readily conclude from Aitken’s
work (see Aitken, 1926) that
(𝑖) (𝑖)
lim 𝑞 = 𝜆𝑘 and lim 𝑒 =0 (𝑘 = 1,2, … , 𝑟)
𝑖→∞ 𝑘 𝑖→∞ 𝑘
This behaviour means that the original qd algorithm for building up the table from its
first column is a computational disaster as the first formula inevitably leads to the
cancellation of leading digits. In contrast, the progressive form is a version that is still
of importance. It avoids the highly ill-conditioned computation of the table from the
moments (see, e.g., Gautschi, 1968, 1982).
Example: (column generation version) by using Bernoulli's method we get the
dominant zero, and then we will remove it by using long division algorithm 𝑝(𝑧) =
(𝑧 − 𝑧1 )𝑞(𝑧). Hereafter, we try to get the new dominant zero of the polynomial and repeat
the process until we find all zeros. We shall now discuss a modern extension of the
Bernoulli's method, which has the advantage of providing simultaneous approximations
to all zeros at time.
𝑝(𝑧) = 𝑧 2 − 𝑧 − 1 = 𝑧 2 (1 − 𝑧 −1 − 𝑧 −2 ) ⟹ (1 − 𝑧 −1 − 𝑧 −2 )𝑥[𝑖] = 0
⟹ 𝑥[𝑖] = 𝑥[𝑖 − 1] + 𝑥[𝑖 − 2]
(𝑖)
𝑖 𝑥[𝑖 ] (𝑖)
𝑒0
(𝑖)
𝑞1 = 𝑥[𝑖 + 1]/𝑥[𝑖] 𝑒1
(𝑖) 𝑞2 (𝑖)
𝑒2
0 1 0 1.000
1 1 0 2.000
2 2 0 1.500 1.000
-0.500 -1.000
3 3 0 1.6667
0.1666 -0.500 −10−5
4 5 0 1.6000
-0.0666 -0.666 −10−5
5 8 0 1.6250
-0.009615 -0.600 −2 × 10−5
6 13 0 1.615385 -0.624975
0.003663 −2.5 × 10−5
7 21 0 1.619048 -0.615409
0.001401 −4.9 × 10−5
8 34 0 1.6176470 −0.619243
↓ −0.000171
9 55 0 ↓ ↓
⋮ ↓
1 + √5 1 − √5 ⋮
2 2
Consider the elements thus generated as the first two rows of a QD scheme, and
generate further rows by means of progressive generation, using the side conditions
(𝑖) (𝑖)
𝑒0 = 𝑒𝑟 = 0
Remark: Later on we will see why we have started the algorithm by these special values
clear all, clc,
% 4th order polynomial associated with the difference equation
% a0*x(k)+ a1*x(k-1)+ a2*x(k-2)+ a3*x(k-3) + a4*x(k-4)
%--------------------------------------%
% syms s %
% expand((s-4)*(s-1)*(s-2)*(s-3)) %
%--------------------------------------%
a0=1; a1=-10; a2=35; a3=-50; a4=24;
Q1=[-(a1/a0), 0 , 0 , 0]; E1=[0 , (a2/a1), (a3/a2) , (a4/a3) , 0];
%------------------------------------------------------------------%
for i=1:20
E2=[]; Q2=[];
for k=1:4
q2(k)=(E1(k+1)-E1(k))+ Q1(k); Q2=[Q2,q2(k)];
end
Q2;
for k=1:3
e2(k)=(Q2(k+1)/Q2(k))* E1(k+1); E2=[E2,e2(k)];
end
E2=[0 , E2 , 0];
E1=E2; Q1=Q2;
end
E1
solution=Q1(1:4)
Now assume that we are interested by the study of a system described by a proper
rational function 𝑓(𝑧) = 𝑛(𝑧)/𝑑(𝑧), therefore the response of this system to some forcing
input will be
𝑛(𝑧)
𝑦(𝑧) = 𝑢(𝑧)
𝑑(𝑧)
−1
Let we put 𝑥1 (𝑧) = (𝑑(𝑧)) 𝑢(𝑧) then 𝑦(𝑧) = 𝑛(𝑧)𝑥1 (𝑧), so it is good thing to factorize the
denominator into a product of linear terms as
−1
𝑥1 (𝑧) = (𝑑(𝑧)) 𝑢(𝑧) ⟺ (𝑧 − 𝑞𝑟 )(𝑧 − 𝑞𝑟−1 ) … (𝑧 − 𝑞2 )(𝑧 − 𝑞1 )𝑥1 (𝑧) = 𝑢(𝑧)
Take the following change of variables
Where
𝑞1 1 0 ⋯ 0 0 𝑥1 [𝑘]
⋯
0 𝑞2 1 0 0 𝑥2 [𝑘]
𝑨𝑝 = ⋮ 0 𝑞3 ⋱ ⋮ , 𝑩𝑝 = ⋮ & 𝐱[𝑘] = ⋮
0 ⋮ ⋮ ⋮ 1 0
(0 0 ⋯ 0 𝑞𝑟 ) (1) (𝑥𝑟 [𝑘])
Remark: From this development we conclude that the matrix 𝑨𝑝 is another companion
form and is equivalent to the matrix 𝑨𝑐 , moreover eigenvalues of 𝑨𝑝 are zeros of 𝑑(𝑧) or
equivalently poles of 𝑓(𝑧).
From this fact of equivalence and based on the progressive qd-algorithm the scientist
Rutishauser thought in the existence of a nested set of tridiagonal matrices
() 1 0 0
𝑞1𝑖
() () 1 0
() ()
𝑞1𝑖 𝑒1𝑖 𝑞2𝑖 + 𝑒1𝑖 (𝑖) ()
(𝑖)
𝑻𝑟 = (𝑖) (𝑖)
𝑞3 + 𝑒2𝑖 ⋮
with
(𝑖)
lim 𝑻𝑟 = 𝑨𝑝
⋮ 𝑞2 𝑒2 () () ⋱ 𝑖→∞
0 𝑞3𝑖 𝑒3𝑖 ⋱
⋮ 1
⋮ (𝑖)
𝑞𝑟−1
(𝑖)
𝑒𝑟−1 (𝑖) (𝑖)
( 0 0 ⋯
𝑞𝑟 + 𝑒𝑟−1 )
(𝑖) (𝑖)
Such that the characteristic polynomial of 𝑻𝑟 which is 𝑑𝑟 (𝑧) will converges to the
𝑖 ()
denominator of 𝑓(𝑧) that is 𝑑(𝑧) = lim𝑖→∞ 𝑑𝑟 (𝑧).
(𝑖)
Since the scientist Rutishauser was interested in the limit of the zeros of 𝑑𝑟 (𝑧) as
(𝑖) (𝑖)
𝑖 → ∞ it was natural to look at 𝑻𝑟 . Clearly, 𝑻𝑟 has the LR decomposition
(𝑖) (𝑖) (𝑖)
𝑻𝑟 = 𝑳𝑟 𝑹𝑟
With
1 0 ⋯ () 1 0 ⋯ 0
0 ⋯ 0 𝑞1𝑖 () ⋯
()
𝑒1𝑖 1 0 0 𝑞2𝑖 1 0
0 ()
(𝑖)
𝑳𝑟 = ⋮
()
𝑒2𝑖 1 ⋱ ⋮ and
(𝑖)
𝑹𝑟 = ⋮ 0 𝑞3𝑖 ⋱ ⋮
0 ⋮ ⋮ 0 0 ⋮ 1
⋮ 𝑖 () ⋮ ⋮
( 0 0 ⋯ 𝑒𝑟−1 1) ( 0 0 0 𝑞(𝑟𝑖) )
⋯
At some historic moment in 1954, Rutishauser must have realized that his progressive
(𝑖) (𝑖) (𝑖)
qd algorithm can be interpreted as computing this LR factorization 𝑻𝑟 = 𝑳𝑟 𝑹𝑟 and
then forming
() () 1 0 0
𝑞𝑖 +𝑒 𝑖 1 1
() () 1 0
() ()
𝑞3𝑖 𝑒1𝑖 𝑞2𝑖 + 𝑒2𝑖 (𝑖) ()
(𝑖) (𝑖)
𝑹𝑟 𝑳 𝑟 = (𝑖) (𝑖)
𝑞3 + 𝑒3𝑖 ⋮ (𝑖+1)
= 𝑻𝑟
⋮ 𝑞3 𝑒2 () () ⋱
0 𝑞4𝑖 𝑒3𝑖 ⋱
⋮ 1
⋮ (𝑖)
𝑞(𝑟𝑖) 𝑒𝑟−1
( 0 0 𝑞𝑟(𝑖) )
⋯
(𝑖) (𝑖) (𝑖) (𝑖+1) (𝑖) (𝑖)
So, the qd algorithm consists of performing the step 𝑻𝑟 = 𝑳𝑟 𝑹𝑟 ⟿ 𝑻𝑟 = 𝑹𝑟 𝑳 𝑟
(𝑖+1) (𝑖) (𝑖) (𝑖) −1
called LR transformation, which is a similarity transformation: 𝑻𝑟 = 𝑹𝑟 𝑻𝑟 (𝑹𝑟 ) .
Now let's go back to the matrix 𝑨𝑐 and we factorize it into a special form
0 0 ⋯ 0 −𝑎1 1 0 ⋯ 0
1 ⋯ ⋯
0 1 0 0 −𝑎 2 0 1 0
𝑨𝑐 = ⋮ 0 1 ⋱ ⋮ ⋮ 0 0 ⋱ ⋮
0 ⋮ ⋮ ⋮ 0 −𝑎𝑟−1 ⋮ ⋮ ⋮ 1
−1 −1
(0 0 ⋯ 𝑎𝑟 𝑎𝑟−1 1) ( 0 0 ⋯ 0 −𝑎𝑟 𝑎𝑟−1 )
Notice that the matrix 𝑨𝑐 has been decomposed into a product of two matrices, we can
continue this process on we obtain the following decomposition
𝑨𝑐 = 𝑳𝑟−2 … 𝑳1 𝑳0 𝑹0 ⟺ 𝑨 ∼ 𝑹0
Where
−𝑎1 1 0 ⋯ 0 1 0 0 ⋯ 0
⋯ ⋯
0 −𝑎2 𝑎1−1 1 0 𝑎2 𝑎1−1 1 0 0
𝑹0 = ⋮ 0 −𝑎3 𝑎2−1 ⋱ ⋮ , 𝑳0 = ⋮ 0 1 ⋱ ⋮
0 ⋮ ⋮ ⋮ 1 0 ⋮ ⋮ ⋮ 0
( 0
−1
0 ⋯ 0 −𝑎𝑟 𝑎𝑟−1 ) ( 0 0 ⋯ 0 1)
0 0 ⋯ 0 0 0 ⋯ 0
1 ⋯ 1 ⋯
0 1 0 0 0 1 0 0
−1
𝑳1 = ⋮ 𝑎3 𝑎2 1 ⋱ ⋮ , etc … 𝑳𝑟−1 = ⋮ 0 1 ⋱ ⋮
0 ⋮ ⋮ ⋮ 0 0 ⋮ ⋮ 1 0
−1
(0 0 ⋯ 0 1) (0 0 ⋯ 𝑎𝑟 𝑎𝑟−1 1)
(0)
Now if we look for 𝑻𝑟 and we compar it with 𝑹0 we get
( ) ( )
(0) 𝑞10 = −𝑎1 𝑎0−1 , and 𝑞𝑘0 = 0 𝑘 = 2,3, …
𝑻𝑟 = 𝑹0 ⟹ { 0 ( ) −1
𝑒𝑖−1 = 𝑎𝑖 𝑎𝑖−1 , 𝑖 = 2,3, …
Solution of linear systems can be divided into two main categories named explicit and
implicit solution. Here in this chapter we are going to concentrate on the implicit
solution (i.e. numerical or approximate solution). Before starting introducing the
numerical methods for linear algebraic systems we recommend to give an overview on
the Cramer's rule which is a formula used by generations of mathematicians to write
down explicit solutions of linear systems.
Theorem (Cramer’s Rule) For det(𝑨) ≠ 0, the linear system 𝑨𝐱 = 𝒃 has the unique
solution 𝑥𝑖 = det(𝑨𝑖 ) / det(𝑨) , 𝑖 = 1, 2, . . . , 𝑛 where 𝑨𝑖 is the matrix obtained from 𝑨 by
replacing column 𝒂: 𝑖 by 𝒃. And the determinant can be computed using the Laplace
Expansion. For each row 𝑖 we have det(𝑨) = ∑𝑛𝑗=1 𝑎𝒊𝒋 (−1)𝑖+𝑗 det(𝑴𝑖𝑗 )
where 𝑴𝑖𝑗 denotes the (𝑛 − 1) × (𝑛 − 1) submatrix obtained by deleting row 𝑖 and
column 𝑗 of the matrix 𝑨.
The following recursive Matlab program computes a determinant using the Laplace
Expansion for the first row:
function d=DetLaplace(A);
% DETLAPLACE determinant using Laplace expansion
% d=DetLaplace(A); computes the determinant d of the matrix A
% using the Laplace expansion for the first row.
n=length(A);
if n==1;
d=A(1,1);
else
d=0; v=1;
for j=1:n
M1j=[A(2:n,1:j-1) A(2:n,j+1:n)];
d=d+v*A(1,j)*DetLaplace(M1j);
v=-v;
end
end
The following Matlab program computes the solution of a linear system with Cramer's
rule:
function x=Cramer(A,b);
% CRAMER solves a linear Sytem with Cramer’s rule
% x=Cramer(A,b); Solves the linear system Ax=b using Cramer’s
% rule. The determinants are computed using the function DetLaplace.
n=length(b);
detA=DetLaplace(A);
for i=1:n
AI=[A(:,1:i-1), b, A(:,i+1:n)];
x(i)=DetLaplace(AI)/detA;
end
x = x(:);
Cramer's rule looks simple and even elegant, but for computational purposes it is a
disaster, the computational effort with Laplace expansion is 𝒪(𝑛!), while with Gaussian
elimination (introduced in the next sections), it is 𝒪(𝑛3 ). Furthermore, the numerical
accuracy due to finite precision arithmetic is very poor.
Indeed, if the determinants are evaluated by the recursive relation, the computational
effort of Cramer's rule is of the order of (𝑛 + 1)! flops and therefore turns out to be
unacceptable even for small dimensions of 𝑨 (for instance, a computer able to perform
109 flops per second would take 9.6 × 1047 years to solve a linear system of only 50
equations).
Here you are given the most popular and exist methods for solving linear algebriac
problems.
The process of row reduction makes use of elementary row operations, and can be
divided into two parts. The first part (sometimes called forward elimination) reduces a
given system to row echelon form, from which one can tell whether there are no
solutions, a unique solution, or infinitely many solutions. The second part (sometimes
called back substitution) continues to use row operations until the solution is found; in
other words, it puts the matrix into reduced row echelon form.
Example: convert 𝑨 into echelon form using the elementary row operations on [𝑨 𝒃]
In this section we shall be concerned with deriving algorithms for solving 𝑨𝐱 = 𝒃. The
algorithms have in common the idea of factoring 𝑨 in the form 𝑨 = 𝑹1 𝑹2 … 𝑹𝑘 where
each matrix 𝑹𝑖 is so simple that the equation 𝑹𝒚 = 𝒄 can be easily solved. The equation
𝑨𝐱 = 𝒃 can then be solved by taking 𝒚0 = 𝒃 and for 𝑖 = 1,2, . . . , 𝑘 computing 𝒚𝑖 as the
solution of the equation 𝑹𝑖 𝒚𝑖 = 𝒚𝑖−1 . Obviously 𝒚𝑘 is the desired solution 𝐱. The
matrices 𝑹𝑖 arising in the factorization of 𝑨 are often triangular. Hence the first thing
will be devoted to algorithms for solving triangular systems, the second will be devoted
to algorithms for factoring 𝑨, and the last will apply these factorizations to the solution
of linear systems.
Once the matrices 𝑳 and 𝑼 have been computed, solving the linear system consists only
of solving successively the two triangular systems
1 0 −3
(
)𝐲 = ( )
2 1 −3 1 0 2 1 −3
( )𝐱 = ( )⇔( )( )𝐱 = ( )⇔{ 2 3 9
4 17 9 2 3 0 5 9 2 1
( )𝐱 = 𝐲
0 5
1 0 −3 −3
The first system is very easy to solve and ( )𝐲 = ( )⟹𝐲=( ). After getting the
2 3 9 5
solution of the first system we can make a substitution into the second one we obtain
2 1 −3 −2
( )𝐱 = ( )⟹𝐱=( )
0 5 5 1
Gauss elimination method provide you only one upper triangular systems
𝑨𝐱 = 𝒃 ↔ 𝑼𝐱 = 𝐲
Elementary Operations
% Elimination phase
clear all, clc, A=10*rand(10,10) + 10*eye(10,10);
b=100*rand(10,1); n=length(b);
A1=A; b1=b; % storing A & b before the elimination process go on
for k=1:n-1
for i=k+1:n
if A(i,k)==0
break
else
lambda = A(i,k)/A(k,k);
A(i,k+1:n) = A(i,k+1:n) - lambda*A(k,k+1:n);
b(i)= b(i) - lambda*b(k);
end
end
end
% Back substitution phase
for k=n:-1:1
b(k) = (b(k) - A(k,k+1:n)*b(k+1:n))/A(k,k);
end
x = b
zero=A1*x-b1
Computing the inverse of a matrix and solving simultaneous equations are related
tasks. The most economical way to invert an 𝑛 × 𝑛 matrix 𝑨 is to solve the equations:
𝑨𝐱 = 𝒃 ⟹ 𝑨−1 𝑨𝐱 = 𝑨−1 𝒃
From this writing we deduce that the initial values defined as: 𝑢1𝑖 = 𝑎1𝑖 𝑖 = 1,2, … , 𝑛 and
1 𝑗−1
ℓ𝑖𝑗 = (𝑎𝑖𝑗 − ∑ ℓ𝑖𝑘 𝑢𝑘𝑗 ) , 𝑗<𝑖
𝑢𝑗𝑗 𝑘=1
𝑖−1
𝑢𝑖𝑗 = (𝑎𝑖𝑗 − ∑ ℓ𝑖𝑘 𝑢𝑘𝑗 ) , 𝑗≥𝑖
{ 𝑘=1
It is usual practice to store the multipliers in the lower triangular portion of the
coefficient matrix, replacing the coefficients as they are eliminated (ℓ𝑖𝑗 replacing 𝑎𝑖𝑗 ).
The diagonal elements of 𝑳 do not have to be stored, since it is understood that each of
them is unity. The final form of the coefficient matrix would thus be the following
mixture of 𝑳 and 𝑼:
𝑢11 𝑢12 𝑢13
𝑬1 𝑬2 … 𝑬𝑘 𝑨 = [𝑳\𝑼] = (ℓ21 𝑢22 𝑢23 )
ℓ31 ℓ32 𝑢33
for j = 1:n-1
for i = j+1:n
A(i,j) = A(i,j)/ A(j,j) ;
A(i,j+1:n) = A(i,j+1:n)- A(i,j)* A(j,j+1:n);
end
end
for i = 1:n
for j= 1:n
if i==j
L(i,i)=1; U(i,i) = A(i,i);
elseif i>j
L(i,j)= A(i,j); U(i,j)=0;
else
L(i,j)= 0; U(i,j)= A(i,j);
end
end
end
L
U
Zero=A1-L*U
% Back substitution
for k = 2:n
b(k)= b(k) - L(k,1:k-1)*b(1:k-1);
end
% Forward substitution
for k = n:-1:1
b(k) = (b(k) - U(k,k+1:n)*b(k+1:n))/U(k,k);
end
x = b
Zero=A1*x-b1
In linear algebra, the Cholesky
decomposition or Cholesky's factorization is a decomposition of a Hermitian, positive-
definite matrix into the product of a lower triangular matrix and its conjugate
transpose, which is useful for efficient numerical solutions, It was discovered by André-
Louis Cholesky for real matrices. When it is applicable, the Cholesky decomposition is
roughly twice as efficient as the 𝑳𝑼 decomposition for solving systems of linear
equations.
1 𝑗−1 𝑖−1
ℓ𝑖𝑗 = (𝑎𝑖𝑗 − ∑ ℓ𝑖𝑘 ℓ𝑗𝑘 ) , & ℓ𝑖𝑖 = √𝑎𝑖𝑖 − ∑ (ℓ𝑖𝑘 )2 𝑗 < 𝑖, 𝑖 = 1,2, … , 𝑛
ℓ𝑖𝑖 𝑘=1 𝑘=1
L(1,1)= sqrt(A(1,1));
L(2:n,1)= A(2:n,1)/L(1,1);
for i = 2:n, k=1:i-1,
L(i,i)= sqrt(A(i,i)-sum(L(i,k).^2));
for j = i+1:n,
L(j,i)= (A(j,i)-sum(L(j,k).*L(i,k)))/L(i,i);
end
end
L*L'
Zero=A1-L*L'
for k=1:n,
for i=1:k-1,
R(i,k)=Q(:,i)'*Q(:,k);
end
% remove these two lines for
for i=1:k-1, % modified-Gram-Schmidt
Q(:,k)=Q(:,k)-R(i,k)*Q(:,i);
end
R(k,k)=norm(Q(:,k)); Q(:,k)=Q(:,k)/R(k,k);
end
R
Q
M=A
for i=1:100;
[Q,R] = qr(M);
M=R*Q;
end
R
eig(A)
Some History on QR Algorithm: The 𝑄𝑅 Algorithm is today the most widely used
algorithm for computing eigenvalues and eigenvectors of dense matrices. It allows us to
compute the roots of the characteristic polynomial of a matrix, a problem for which
there is no closed form solution in general for matrices of size bigger than 4 × 4, with a
complexity comparable to solving a linear system with the same matrix.
Heinz Rutishauser discovered some 50 years ago when testing a computer5 that the
iteration 𝑨0 = 𝑨 with 𝑨 ∈ ℝ𝑛×𝑛
𝑨𝑖 = 𝑳𝑖 𝑹𝑖 Gauss 𝐿𝑈 decomposition
𝑨𝑖+1 = 𝑹𝑖 𝑳𝑖 𝑖 = 1,2, …
This method can be used to compute all the Eigenvalues, vectors, inverse, etc., of any
given matrix.
𝑩1 = 𝑨 and 𝑃1 = Trace(𝑩1 )
1
𝑩2 = 𝑨(𝑩1 − 𝑃1 𝑰) and 𝑃2 = Trace(𝑩2 )
2
1
𝑩3 = 𝑨(𝑩2 − 𝑃2 𝑰) and 𝑃3 = Trace(𝑩3 )
3
⋮
1
𝑩𝑛−1 = 𝑨(𝑩𝑛−2 − 𝑃𝑛−2 𝑰) and 𝑃𝑛−1 = Trace(𝑩𝑛−1 )
𝑛−1
1
𝑩𝑛 = 𝑨(𝑩𝑛−1 − 𝑃𝑛−1 𝑰) and 𝑃𝑛 = Trace(𝑩𝑛 )
𝑛
If 𝑨 is nonsingular then the inverse of 𝑨 can be determined by 𝑨−1 = (𝑩𝑛−1 − 𝑃𝑛−1 𝑰)/𝑃𝑛 .
clear all, clc, A=10*rand(4,4) + 10*eye(4,4); [m,n]=size(A);
b=100*rand(4,1); t=0; B=A; I=eye(n,n);
So far we have considered direct methods for solution of a system of linear equations.
For sparse matrices, it may not be possible to take advantage of sparsity while using
direct methods, since the process of elimination can make the zero elements nonzero,
unless the zero elements are in a certain well defined pattern. Hence, the number of
arithmetic operations as well as the storage requirement may be the same for sparse
and filled matrices. This requirement may be prohibitive for large matrices and in those
cases, it may be worthwhile to consider the iterative methods.
𝐱 𝑘 = (𝑰 − 𝑸)−1 (𝑰 − 𝑸𝑘 )𝒓 + 𝑸𝑘 𝐱 0
The convergence condition is lim𝑘→∞ 𝑸𝑘 = 𝟎 or equivalently max1≤𝑖≤𝑛 {∑𝑛𝑗=1|𝑸(𝑖, 𝑗)|} < 1 or.
Here we have assumed that all diagonal elements of A are nonzero. This last equation
can be used to define an iterative process, which generates the next approximation 𝐱 𝑘+1
using the previous one on the right-hand side 𝐱 𝑘+1 = −𝑫−1 (𝑳 + 𝑼)𝐱 𝑘 + 𝑫−1 𝒃.
In numerical linear algebra, the Jacobi method is an iterative algorithm for determining
the solutions of a strictly diagonally dominant system of linear equations. A sufficient
(but not necessary) condition for the method to converge is that the matrix 𝑨 is strictly
or irreducibly diagonally dominant. Strict row diagonal dominance means that for each
row, the absolute value of the diagonal term is greater than the sum of absolute values
of other terms. The Jacobi method sometimes converges even if these conditions are not
satisfied.
This iterative process is known as the Jacobi iteration or the method of simultaneous
displacements. The latter name follows from the fact that every element of the solution
vector is changed before any of the new elements are used in the iteration. Hence, both
𝐱 𝑘+1 and 𝐱 𝑘 need to be stored separately. The iterative procedure can be easily
expressed in the component form as
𝑛
1
𝑥𝑘+1 (𝑖) = 𝑏(𝑖) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘 (𝑗)
𝑎(𝑖, 𝑖)
𝑗=1
𝑗≠𝑖
( )
Algorithm1: The Matrix-based formula
D= diag(A); M= diag(D);
n=max(size(A)); m=min(size(b)); x0=rand(n,m);
for k=1: 10
% x1= inv(M)*(b-(A-M)*x0); % Jacobi method
x1= x0 + inv(M)*(b-A*x0); % Jacobi method 'Alternative writing'
x0=x1;
end
x1
ZERO1=A*x1-b
الكتابة المصفوفية لطريقة جكوب هي فقط للشرح وليست طريقة عملية إذ فيها المعكوس وهو أكبر شيء تم الهروب منه:مالحظة هامة
𝑳𝐱 = 𝒃 − 𝑼𝐱 or 𝐱 = 𝑳−1 𝒃 − 𝑳−1 𝑼𝐱
This last equation can be used to define an iterative process, which generates the next
approximation 𝐱 𝑘+1 using the previous one on the right-hand side
𝐱 𝑘+1 = 𝑳−1 (𝒃 − 𝑼𝐱 𝑘 )
The convergence properties of the Gauss–Seidel method are dependent on the matrix 𝑨.
Namely, the procedure is known to converge if either:
𝑨 is symmetric positive-definite, or
𝑨 is strictly or irreducibly diagonally dominant.
The Gauss–Seidel method sometimes converges even if these conditions are not
satisfied.
الكتابة المصفوفية لطريقة غوص هي فقط للشرح وليست طريقة عملية إذ فيها المعكوس وهو أكبر شيء تم الهروب منه:مالحظة هامة
However, by taking advantage of the triangular form of 𝑳, the elements of 𝐱 𝑘+1 can be
computed sequentially using forward substitution:
𝑖−1 𝑛
1
𝑥𝑘+1 (𝑖) = (𝑏(𝑖) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘+1 (𝑗) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘 (𝑗))
𝑎(𝑖, 𝑖)
𝑗=1 𝑗=𝑖+1
The procedure is generally continued until the changes made by an iteration are below
some tolerance, such as a sufficiently small residual.
The element-wise formula for the Gauss–Seidel method is extremely similar to that of
the Jacobi method. The computation of 𝑥𝑘+1 (𝑖) uses the elements of 𝐱 𝑘+1 that have
already been computed, and only the elements of 𝐱 𝑘 that have not been computed in
the 𝑘 + 1 iteration. This means that, unlike the Jacobi method, only one storage vector
is required as elements can be overwritten as they are computed, which can be
advantageous for very large problems.
for i=1:iters
for j = 1:size(A,1)
x(j) = (1/A(j,j)) * (b(j) - A(j,:)*x + A(j,j)*x(j));
end
end
x
ZERO1=A*x-b
The component 𝑥𝑖 is computed as for Gauss-Seidel but then averaged with its previous
value.
𝑖−1 𝑛
𝜔
𝑥𝑘+1 (𝑖) = (1 − 𝜔)𝑥𝑘 (𝑖) + (𝑏(𝑖) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘+1 (𝑗) − ∑ 𝑎(𝑖, 𝑗)𝑥𝑘 (𝑗))
𝑎(𝑖, 𝑖)
𝑗=1 𝑗=𝑖+1
We first mention that one cannot choose the relaxation parameter arbitrarily if one
wants to obtain a convergent method. This general, very elegant result is due to Kahan
from his PhD thesis 1958. For convergence of SOR it is necessary to choose 0 < ω < 2.
for i=1:iters
for j = 1:size(A,1)
x(j) = (1-w)*x(j) + (w/A(j,j))*(b(j) - A(j,:)*x + A(j,j)*x(j));
end
end
x
ZERO1=A*x-b
Remark: The optimal choice of the relaxation parameter ω, is given by Young in his
thesis 1950.
Remark: Gauss-Seidel method 𝐱 𝑘+1 = 𝐟(𝐱𝑘+1 , 𝐱 𝑘 ) is very close to the simple iterative
method 𝐱 𝑘+1 = 𝐟(𝐱𝑘 ) and differs only in that: at any iteration we use just the calculated
components of this iteration. This method can be used for both linear and nonlinear
systems, and its convergence depends on the choice of the starting point and the
Jacobian of mapping, defined by the system.
Gauss − Seidel method 𝐱 𝑘+1 = 𝐟(𝐱 𝑘+1 , 𝐱 𝑘 ) Simple iterative method (Jacobi) 𝐱𝑘+1 = 𝐟(𝐱 𝑘 )
𝑥1 (𝑘 + 1) = f1 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘)) 𝑥1 (𝑘 + 1) = f1 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘))
𝑥2 (𝑘 + 1) = f2 (𝑥1 (𝑘 + 1), 𝑥2 (𝑘), 𝑥3 (𝑘)) 𝑥2 (𝑘 + 1) = f2 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘))
𝑥3 (𝑘 + 1) = f3 (𝑥1 (𝑘 + 1), 𝑥2 (𝑘 + 1), 𝑥3 (𝑘)) 𝑥3 (𝑘 + 1) = f3 (𝑥1 (𝑘), 𝑥2 (𝑘), 𝑥3 (𝑘))
Example: Solve the following set of nonlinear equations by the Gauss-Seidel method
1
𝑥𝑘+1 = (3 + 0.12𝑧𝑘 − 𝑒 𝑥𝑘 cos(𝑦𝑘 ))
27𝑥 + 𝑒 𝑥 cos(𝑦) − 0.12𝑧 = 3 27
1
{−0.2𝑥 2 + 37𝑦 + 3𝑥𝑧 = 6 ⟺ 𝑦𝑘+1 = (6 − 3𝑥𝑘+1 𝑧𝑘 + 0.2(𝑥𝑘+1 )2 )
37
𝑥 2 − 0.2𝑦 sin(𝑥) + 29𝑧 = −4 1
(−4 + 0.2𝑦𝑘+1 sin(𝑥𝑘+1 ) − (𝑥𝑘+1 )2 )
{ 𝑧𝑘+1 =
29
Start by 𝑥 = 𝑦 = 𝑧 = 1
while s>0.01
x1=(1/27)*(3+0.12*z0-exp(x0)*cos(y0));
y1=(1/37)*(6-3*x1*z0+ 0.2*x1*x1);
z1=(1/27)*(-4+0.2*y1*sin(x1)-x1*x1);
v1= x1-x0; v2= y1-y0; v3= z1-z0; v=[v1; v2; v3]; s=norm(v);
x0=x1; y0=y1; z0=z1;
k=k+1;
end
k
x=x1, y=y1, z=z1,
Consider the problem of finding the vector 𝐱 that minimizes the scalar function (which
is called the energy of system)
1 𝑇 1 𝑇
f(𝐱) = 𝐱 𝑨𝐱 − 𝒃𝑇 𝐱 with ∇𝑓(𝐱) = (𝑨 + 𝑨)𝐱 − 𝒃 = 𝑨𝐱 − 𝒃
2 2
where the matrix 𝑨 is symmetric and positive definite. Because f(𝐱) is minimized when
its gradient ∇f = 𝑨𝐱 − 𝒃 is zero, we see that minimization is equivalent to solving 𝑨𝐱 = 𝒃.
𝐱 𝑘+1 = 𝐱 𝑘 + 𝛼𝑘 𝐬𝑘
The step length 𝛼𝑘 is chosen so that 𝐱 𝑘+1 minimizes f(𝐱 𝑘+1 ) in the search direction 𝐬𝑘 .
That is, 𝐱 𝑘+1 must satisfy 𝑨𝐱 = 𝒃 so
𝑨(𝐱 𝑘 + 𝛼𝑘 𝐬𝑘 ) = 𝒃
We are still left with the problem of determining the search direction 𝐬𝑘 . Intuition tells
us to choose 𝐬𝑘 = −∇𝑓 = 𝐫𝑘 , since this is the direction of the largest negative change in
𝑓(𝐱). The resulting procedure is known as the method of steepest descent or a gradient
method. Summarizing, the gradient method can be described as follows: given 𝐱 0 ∈ ℝ𝑛 ,
for 𝑘 = 0, 1, … until convergence, compute
▪ 𝐫𝑘 = 𝒃 − 𝑨𝐱 𝑘
𝐬𝑘 𝑇 𝐫𝑘 𝐫𝑘 𝑇 𝐫𝑘
▪ 𝛼𝑘 = 𝑇 =
𝐬𝑘 𝑨𝐬𝑘 𝐫𝑘 𝑇 𝑨𝐫𝑘
▪ 𝐱 𝑘+1 = 𝐱 𝑘 + 𝛼𝑘 𝐫𝑘
It is not a popular algorithm due to slow convergence. In order to increase the speed of
convergence some peoples proposed a correction of the search direction.
Now the improved version of the steepest descent algorithm is based on the so called A-
conjugacy. A-conjugacy means that a set of nonzero vectors {𝐬0 , 𝐬1 , … , 𝐬𝑛−1 } are conjugate
with respect to the symmetric positive definite matrix A. That is 𝐬𝑖𝑇 𝑨𝐬𝑗 = 0 ∀ 𝑖 ≠ 𝑗. A set
of 𝑛 such vectors are linearly independent and hence span the whole space ℝ𝑛 . The
reason why such A-conjugate sets are important is that we can minimize our quadratic
function f(𝐱) in 𝑛 steps by successively minimizing it along each of the directions. Since
the set of A-conjugate vectors acts as a basis for ℝ𝑛 .
There are several ways to choose such a set. The eigenvectors of 𝑨 form a A-conjugate
set, butfinding the eigenvectors is a task requiring a lot of computations, so we
betterfind another strategy. A second alternative is to modify the usual Gram-Schmidt
orthogonalization process. This is also not optimal, as it requires storing all the
directions. In order to search optimally a complete set of linearly independent vectors
we use the more efficient algorithm " conjugate gradient method " which is based on the
following recursive formula 𝐬𝑘+1 = 𝐫𝑘+1 + 𝛽𝑘 𝐬𝑘 , and we try to find the constant 𝛽𝑘 for
which any two successive search directions are conjugate (noninterfering) to each other,
𝑇
meaning 𝐬𝑘+1 𝑨𝐬𝑘 = 0.
𝑇 𝑇
Substituting 𝐬𝑘+1 into 𝐬𝑘+1 𝑨𝐬𝑘 = 0 we get (𝐫𝑘+1 + 𝛽𝑘 𝐬𝑘𝑇 )𝑨𝐬𝑘 = 0, which yields
𝑇
𝐫𝑘+1 𝑨𝐬𝑘
𝛽𝑘 = − 𝑇
𝐬𝑘 𝑨𝐬𝑘
▪ Choose 𝐱 0 (any vector will do, but one close to solution results in fewer iterations)
▪ 𝐫0 ← 𝒃 − 𝑨𝐱 0 & 𝐬0 ← 𝐫0
do with 𝑘 = 0, 1, 2, …
𝐬𝑘 𝑇 𝐫𝑘
▪ 𝛼𝑘 ← 𝑇
𝐬𝑘 𝑨𝐬𝑘
▪ 𝐱 𝑘+1 ← 𝐱 𝑘 + 𝛼𝑘 𝐫𝑘
▪ 𝐫𝑘+1 ← 𝒃 − 𝑨𝐱 𝑘+1 if ‖𝐫𝑘+1 ‖ ≤ 𝜀 exit loop (convergence criterion; 𝜀 is the error tolerance)
𝑇
𝐫𝑘+1 𝑨𝐬𝑘
▪ 𝛽𝑘 ← − 𝑇
𝐬𝑘 𝑨𝐬𝑘
▪ 𝐬𝑘+1 = 𝐫𝑘+1 + 𝛽𝑘 𝐬𝑘
end do
■ How KRYLOV Subspaces Come Into Play! In 1931 A. N. KRYLOV published a paper
entitled "On the numerical solution of the equation by which the frequency of small
oscillations is determined in technical problems", of course in this work KRYLOV was not
thinking in terms of projection processes, and he was not interested in solving a linear
system. Motivated by an application in the analysis of oscillations of mechanical
systems, he constructed a method for computing the minimal polynomial of a matrix.
Algebraically, his method is based on the following important fact.
Given 𝑨 ∈ 𝔽𝑛×𝑛 and a nonzero vector 𝐯 ∈ 𝔽𝑛 , consider the Krylov sequence {𝐯, 𝑨𝐯, 𝑨2 𝐯, … }
generated by 𝑨 and 𝐯. There then exists a uniquely defined integer 𝑑 = 𝑑(𝑨, 𝐯), so that
the vectors {𝐯, 𝑨𝐯, 𝑨2 𝐯, … 𝑨𝑑−1 𝐯} are linearly independent, and the vectors
2 𝑑
{𝐯, 𝑨𝐯, 𝑨 𝐯, … 𝑨 𝐯} are linearly dependent.
Krylov's observation can be rephrased in the following way. For each matrix 𝑨 and
vector 𝐯, the Krylov subspaces is defined by 𝒦𝑑 (𝑨, 𝐯) = span{𝐯, 𝑨𝐯, 𝑨2 𝐯, … 𝑨𝑑−1 𝐯}. Most
important iterative techniques for solving large scale linear systems 𝑨𝐱 = 𝒃 are based on
the projection processes onto the Krylov subspace. In short this technique approximate
𝑨−1 𝒃 by 𝑝(𝑨)𝒃 where 𝑝 is a good polynomial.
A general projection method for solving linear system 𝑨𝐱 = 𝒃 is a method which seek an
approximate solution from an affine subspace 𝐱 𝑚 = 𝐱 0 + 𝒦𝑚 of dimension 𝑚 by imposing
the Petrov–Galerkin condition 𝒓 = (𝒃 − 𝑨𝐱 𝑚 ) ⊥ ℒ𝑚 where ℒ𝑚 is another 𝑚 dimensional
subspace. Here, 𝐱 0 represents an arbitrary initial guess to the solution. Krylov subspace
method is a method for which 𝒦𝑚 is a Krylov subspace.
An important fact: Though the non-symmetric linear system is often the motivation for
applying the Lanczos algorithm, the operation the algorithm primarily performs is
tridiagonalization of a matrix, but what about the tridiagonalization process?
For every real square matrix 𝑨 ∈ ℝ𝑛×𝑛 there exists a nonsingular real matrix 𝑿, for
which 𝑻 = 𝑿−1 𝑨𝑿 ∈ ℝ𝑛×𝑛 is a tridiagonal matrix, and under certain conditions 𝑿 is
uniquely determined. Simple proofs for the existence and uniqueness of this
transformation are presented in a paper Angelika Bunse-Gerstner 1982.
However, the use of Krylov subspace method does not guarantee the non-singularity of
the matrix 𝑿, since "𝑚 ≤ 𝑛" the dimension of Krylov subspace is less or equal to the
dimension of the matrix 𝑨. Therefore, we can write 𝑻 = 𝐖 𝑇 𝑨𝐕 ∈ ℝ𝑚×𝑚 with 𝐖, 𝐕 ∈ ℝ𝑛×𝑚 .
Two sets {𝐯𝑖 } & {𝐰𝑖 } of vectors satisfying (𝐰𝑖 𝑇 𝐯𝑗 = 0 ∀ 𝑖 ≠ 𝑗) are said to be biorthogonal
and can be obtained through the following algorithm: (Biswa Nath Datta 2003)
Initialization: Scale the vectors 𝐯 and 𝐰 to get the vectors 𝐯1 and 𝐰1 such that
𝐰1 𝑇 𝐯1 = 1. Set 𝛽1 = 0, 𝛾1 = 0, 𝐰0 = 𝐯0 = 0.
begin: For 𝑘 = 1, 2, . . . , 𝑚 do
𝛼𝑘 = 𝐰𝑘 𝑇 𝑨𝐯𝑘
𝝁𝑘+1 = 𝑨𝐯𝑘 − 𝛼𝑘 𝐯𝑘 − 𝛽𝑘 𝐯𝑘−1
𝜼𝑘+1 = 𝑨𝑇 𝐰𝑘 − 𝛼𝑘 𝐰𝑘 − 𝛾𝑘 𝐰𝑘−1
𝑇
𝜼𝑇𝑘+1 𝝁𝑘+1
𝛾𝑘+1 = √|𝜼𝑘+1 𝝁𝑘+1 | ; 𝛽𝑘+1 =
𝛾𝑘+1
𝜼𝑇𝑘+1 𝝁𝑇𝑘+1
𝐰𝑘+1 = ; 𝐯𝑘+1 =
𝛽𝑘+1 𝛾𝑘+1
end
𝐰𝑖 𝑇 𝐯𝑗 = 0 𝑖 ≠ 𝑗
{
𝐰𝑖 𝑇 𝐯𝑗 = 1 𝑖 = 𝑗
There is some flexibility in choosing the scale factors 𝛽𝑘 and 𝛾𝑘 . Note that
𝜼𝑘+1 𝑇 𝝁𝑘+1
1 = 𝐰𝑘+1 𝑇 𝐯𝑘+1 =
𝛽𝑘+1 𝛾𝑘+1
It follows that once 𝛾𝑘 is specified 𝛽𝑘 is given by 𝛽𝑘 = (𝜼𝑇𝑘 𝝁𝑘 )/𝛾𝑘 with the "canonical"
choice 𝛾𝑘 = √|𝜼𝑇𝑘 𝝁𝑘 | we obtain the above algorithm.
𝛼1 𝛽2 0
𝛾2 ⋱
If 𝑻𝑘 = ( 𝛼2
⋱ 𝛽𝑘 ) then the situation at the bottom of the loop is summarized by
⋱ 𝛾𝑘 𝛼𝑘
0
the equations
𝑨[𝐯1 , … , 𝐯𝑘 ] = [𝐯1 , … , 𝐯𝑘 ]𝑻𝑘 + [0 0 ⋯ 𝛾𝑘+1 𝐯𝑘+1 ]
𝑨𝑇 [𝐰1 , … , 𝐰𝑘 ] = [𝐰1 , … , 𝐰𝑘 ]𝑻𝑘 + [0 0 ⋯ 𝛽𝑘+1 𝐰𝑘+1 ]
Remark: If the algorithm does not break down before completion of 𝑛 steps, then,
defining 𝐕𝑛 = [𝐯1 , 𝐯2 , … , 𝐯𝑛 ] and 𝐖𝑛 = [𝐰1 , 𝐰2 , … , 𝐰𝑛 ] with 𝐖𝑛 𝑇 𝐕𝑛 = 𝑰.
■ Derivation of the solution: Assuming that the real general matrix 𝑨 ∈ ℝ𝑛×𝑛 is
transformed to its equivalent form 𝑻 = 𝑾𝑇 𝑨𝑽 such that 𝑻 ∈ ℝ𝑚×𝑚 is presented in Krylov
space of dimension 𝑚, 𝑽 = [𝐯1 , 𝐯2 , … 𝐯𝑚 ] ∈ ℝ𝑛×𝑚 is a matrix whose column-vectors form a
basis of 𝒦𝑚 and 𝑾 = [𝐰1 , 𝐰2 , … 𝐰𝑚 ] ∈ ℝ𝑛×𝑚 is a matrix whose column-vectors form a
basis of ℒ𝑚 with 𝑽𝑾𝑇 = 𝑰.
Let we start the generation of Krylov space by the following initial vector
𝒃 − 𝑨𝐱 0 𝒓0
𝐯1 = = = 𝑽𝒆1 with 𝒆1 = [1,0, … ,0]𝑇 ∈ ℝ𝑘 ⟹ 𝒓0 = ‖𝒓0 ‖2 𝑽𝒆1
‖𝒃 − 𝑨𝐱 0 ‖2 ‖𝒓0 ‖2
𝐱 = 𝐱 0 + 𝑨−1 𝒓0
𝐱 = 𝐱 0 + 𝑨−1 𝒓0
= 𝐱 0 + 𝑨−1 ‖𝒓0 ‖2 𝑽𝒆1
= 𝐱 0 + ‖𝒓0 ‖2 𝑽𝑾𝑇 𝑨−1 𝑽𝒆1
= 𝐱 0 + ‖𝒓0 ‖2 𝑽𝑻−1 𝒆1
Let we define a new vector 𝐲 such that 𝐲 is a solution of the system 𝑻𝐲 = ‖𝒓0 ‖2 𝒆1
𝐱 = (𝐱 0 + 𝑽𝐲) ∈ 𝐱 0 + 𝒦𝑚
Once the vector 𝐲 is obtained from the system 𝑻𝐲 = ‖𝒓0 ‖2 𝒆1 we can construct the
solution.
The residual vector of the approximate solution 𝐱 𝑘 computed by the Lanczos Algorithm
is such that
𝒃 − 𝑨𝐱 𝑘 = 𝒃 − 𝑨(𝐱 0 + 𝑽𝑘 𝐲𝑘 )
= 𝒓0 − 𝑨𝑽𝑘 𝐲𝑘
= ‖𝒓0 ‖2 𝑽𝑘 𝒆1 − (𝑽𝑘 𝑻𝑘 𝐲𝑘 + 𝛾𝑘+1 𝐯𝑘+1 𝒆𝑇𝑘 𝐲𝑘 )
= −𝛾𝑘+1 𝐯𝑘+1 𝒆𝑇𝑘 𝐲𝑘
‖𝒓𝑘 ‖2 ‖𝐯𝑘+1 ‖2
𝜀= = 𝛾𝑘+1 |𝒆𝑇𝑘 𝐲𝑘 | ×
‖𝒓0 ‖2 ‖𝒓0 ‖2
𝛼1 𝛽2 0 1 0 0 𝑑1 𝜇2 0
𝛾2 ⋱ ⋱ ⋱
𝛼2 ℓ2 1 0 𝑑2
𝑻𝑘 = ( ⋱ 𝛽𝑘 ) = ( ⋱ 0 )( ⋱ 𝜇𝑘 ) = 𝑳 𝑘 𝑹 𝑘
⋱ 𝛾𝑘 𝛼𝑘 ⋱ ℓ𝑘 1 ⋱ 0 𝑑𝑘
0 0 0
Hence the equation 𝑻𝑘 𝐲𝑘 = 𝑳𝑘 (𝑹𝑘 𝐲𝑘 ) = 𝑳𝑘 𝝑𝑘 = 𝒛𝑘 can be solved recursively by back-
substitutions.
clear all, clc, A=10*rand(7,7); b=100*rand(7,1); n =length(b);
x0= rand(n,1); toll=0.0001; r0=b-A*x0; nres0=norm(r0,2);
V=r0/nres0; W=V; gamma(1)=0; beta(1)=0; k=1; nres=1;
while k <= n && nres > toll
vk=V(:,k); wk=W(:,k);
if k==1, vk1=0*vk; wk1=0*wk;
else, vk1=V(:,k-1); wk1= W(:,k-1);
end
alpha(k)=wk'*A*vk;
tildev=A*vk-alpha(k)*vk-beta(k)*vk1;
tildew=A'*wk-alpha(k)*wk-gamma(k)*wk1;
gamma(k+1)=sqrt(abs(tildew'*tildev)); % gamma(k+1)=tildez'*tildev;
if gamma(k+1) == 0, k=n+2;
else
beta(k+1)=tildew'*tildev/gamma(k+1);
W=[W,tildew/beta(k+1)];
V=[V,tildev/gamma(k+1)];
end
if k<n+2
if k==1
Tk = alpha;
else
Tk=diag(alpha)+diag(beta(2:k),1)+diag(gamma(2:k),-1);
end
yk=Tk\(nres0*[1,0*[1:k-1]]');
xk=x0+V(:,1:k)*yk;
nres=abs(gamma(k+1)*[0*[1:k-1],1]*yk)*norm(V(:,k+1),2)/nres0;
k=k+1;
else
return
end
end
m=k-1; % The Krylov space dimension
A, Tk, eig(A), eig(Tk), xk,
Zero1=A-V(:,1:m)*Tk*W(:,1:m)' % verification1
Zero2= A*xk-b % verification2
■ Link between Lanczos Method and Krylov Space: From the above algorithm we
know that
𝑨𝐯𝑘 = 𝛽𝑘 𝐯𝑘−1 + 𝛼𝑘 𝐯𝑘 + 𝛾𝑘+1 𝐯𝑘+1 𝛽1 𝐯0 = 0
𝑇 with 𝑘 = 1,2, … 𝑚
𝑨 𝐰𝑘 = 𝛾𝑘 𝐰𝑘−1 + 𝛼𝑘 𝐰𝑘 + 𝛽𝑘+1 𝐰𝑘+1 𝛾1 𝐰0 = 0
Notes on Iterative Methods: Iterative, or indirect methods, start with an initial guess of
the solution x and then repeatedly improve the solution until the change in x becomes
negligible. Since the required number of iterations can be very large, the indirect
methods are, in general, slower than their direct counterparts. A serious drawback of
iterative methods is that they do not always converge to the solution. It can be shown
that convergence is guaranteed only if the coefficient matrix is diagonally dominant. The
initial guess for x plays no role in determining whether convergence takes place if the
procedure converges for one starting vector, it would do so for any starting vector. The
initial guess affects only the number of iterations that are required for convergence.
Algorithm:
Data: 𝐟(𝐱) = (𝑨𝐱 − 𝒃), 𝐱 0 , 𝐟(𝐱 0 ) and 𝑩0
Result: 𝐱 𝑘
begin:
𝐱 𝑘+1 = (𝑰 − 𝑩𝑘 𝑨)𝐱 𝑘 + 𝑩𝑘 𝒃
𝐲𝑘 = 𝑨(𝐱 𝑘+1 − 𝐱 𝑘 )
𝒔𝑘 − 𝑩𝑘 𝐲𝑘
𝑩𝑘+1 = 𝑩𝑘 + ( 𝑇 ) 𝒔𝑘 𝑇 𝑩𝑘
𝒔𝑘 𝑩𝑘 𝐲𝑘
𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘
end
This is the author property and in the limit of our knowledge none has been used this
algorithm in linear systems.
Eigenvalue problems that originate from physical problems often end up with a
symmetric 𝑨. This is fortunate, because symmetric eigenvalue problems are much
easier to solve than their non-symmetric counterparts.
For many real situations, the eigenvalue problem does not arise in the standard form
𝑨𝐱 = 𝜆𝐱 but rather in the form 𝑨𝐱 = 𝜆𝑩𝐱 Where 𝑨 & 𝑩 are two symmetric matrices. It is
much more convenient if the equation 𝑨𝐱 = 𝜆𝑩𝐱 can be converted to the standard form.
The main algorithms for actually computing eigenvalues and eigenvectors are the 𝑄𝐷
algorithm (by Heinz Rutishauser ), the power method, the Householder Transformation,
the 𝐿𝑅 algorithm of Rutishauser and the powerful 𝑄𝑅 algorithm of Francis.
Let 𝑨 ∈ ℂ𝑛×𝑛 be a diagonalizable matrix and let 𝑿 = [𝐱1 , … , 𝐱 𝑛 ] ∈ ℂ𝑛×𝑛 be the matrix of its
eigenvectors 𝐱 𝑖 , for 𝑖 = 1, . . . , 𝑛. Let us also suppose that the eigenvalues of 𝑨 are ordered
as |𝜆1 | > |𝜆2 | > ⋯ > |𝜆𝑛 | where 𝜆1 has algebraic multiplicity equal to 1. Under these
assumptions, 𝜆1 is called the dominant eigenvalue of matrix 𝑨.
Let us analyze the convergence properties of this method. By induction on 𝑘 one can
check that
𝑨𝑘 𝒒0
𝒒𝑘 = 𝑘 𝑘≥1
‖𝑨 𝒒0 ‖2
This relation explains the role played by the powers of A in the method. Because A is
diagonalizable, its eigenvectors 𝐱 𝑖 form a basis of ℂ𝑛 ; it is thus possible to represent 𝒒0
as
𝑛
𝒒0 = ∑ 𝛼𝑖 𝐱 𝑖 𝛼𝑖 ∈ ℂ𝑛 , 𝑖 = 1, 2, … 𝑛
𝑖=1
Since |𝜆𝑖 /𝜆1 | < 1 for 𝑖 = 1, 2, … 𝑛, as 𝑘 increases the vector 𝑨𝑘 𝒒0 , tends to assume an
increasingly significant component in the direction of the eigenvector 𝐱1
lim 𝑨𝑘 𝒒0 = 𝛼1 𝜆1𝑘 𝐱1
𝑘→∞
𝑨𝑘 𝒒0 𝛼1 𝜆1𝑘 𝐱1 𝐱1
lim 𝒒𝑘 = lim = 𝑘 =
𝑘→∞ 𝑘→∞ ‖𝑨 𝒒0 ‖2
𝑘
𝛼1 𝜆1 ‖𝐱1 ‖2 ‖𝐱1 ‖2
As 𝑘 → ∞, the vector 𝒒𝑘 thus aligns itself along the direction of eigenvector 𝐱1 . Therefore
the sequence of Rayleigh quotients 𝜂𝑘 will converge to 𝜆1 .
𝐱1 𝐻 𝐱1 𝐱1 𝐻 𝜆1 𝐱1 𝐱1 𝐻 𝐱1
lim 𝜂𝑘 = lim 𝒒𝑘 𝐻 𝑨𝒒𝑘 = ( ) 𝑨( )=( ) = 𝜆1 = 𝜆1
𝑘→∞ 𝑘→∞ ‖𝐱1 ‖2 ‖𝐱1 ‖2 ‖𝐱1 ‖2 ‖𝐱1 ‖2 (‖𝐱1 ‖2 )2
and the convergence will be faster when the ratio |𝜆2 /𝜆1 | is smaller.
4 5 −4 10
Example: matrix 𝐴 = ( ) needs only 6 steps, ratio is 0.1, while matrix 𝐴 = ( )
6 5 7 5
needs 68 steps, ratio is 0.9
Similarly, the left eigenvector can be obtained using the following algorithm
𝒘𝑘 = 𝑨𝑇 𝒑𝑘−1 𝒗𝑘 = 𝒑𝑇𝑘−1 𝑨
{ 𝒑𝑘 = 𝒘𝑘 /‖𝒘𝑘 ‖2 ⟺ {𝒑𝑇𝑘 = 𝒗𝑘 /‖𝒗𝑘 ‖2
𝜂𝑘 = 𝒑𝑘 𝑇 𝑨𝑇 𝒑𝑘 𝜂𝑘 = 𝒑𝑘 𝑇 𝑨𝒑𝑘
Assume that the matrix 𝑨 is diagonalizable and let 𝐱 𝑖 , 𝐲𝑖 𝑇 be the right and left
eigenvectors respectively, means that 𝑨𝐱 𝑖 = 𝜆𝑖 𝐱 𝑖 , 𝐲𝑖 𝑇 𝑨 = 𝜆𝑖 𝐲𝑖 𝑇 , then by using the
spectral decomposition theorem
𝐲1 𝑇 𝐲1 𝑇 𝑛 𝑛
𝑇 𝑇
𝑨 = 𝑿𝜦𝒀 = [𝐱1 𝐱 2 … 𝐱 𝑛 ]𝜦 ( 𝐲2 ) = [𝜆1 𝐱1 𝜆2 𝐱 2 … 𝜆𝑛 𝐱 𝑛 ] ( 𝐲2 ) = ∑ 𝜆𝑖 𝐱 𝑖 𝐲𝑖 with ∑ 𝐱 𝑖 𝐲𝑖 𝑇 = 𝑰
𝑇
⋮ ⋮
𝑖=1 𝑖=1
𝐲𝑛 𝑇 𝐲𝑛 𝑇
Now use power iteration to find 𝐱1 and 𝜆1 then let 𝑨2 ← 𝑨 − 𝜆1 𝐱1 𝐲1 𝑇 repeat power
iteration on 𝑨2 to find 𝐱 2 and 𝜆2 continue like this for 𝜆3 , . . . , 𝜆𝑛 .
Remark: This method is good approximation only when the matrix 𝑨 is symmetric,
means that 𝑨 = 𝑿𝜦𝑿𝑇 . When 𝑨 is not symmetric then the method will be failed.
In order to obtain also the inverse of the symmetric matrix 𝑨 we use the following
iteration 𝑨 ← 𝑨 − 𝜆𝑖 𝐱 𝑖 𝐱 𝑖 𝑇 + (1/𝜆𝑖 )𝐱 𝑖 𝐱 𝑖 𝑇 . (Proposed by the author)
Choose 𝐱 0
for 𝑘 = 1,2, … , 𝑚 (until convergence)
solve 𝑨𝐱 𝑘+1 = 𝐱 𝑘
normalize 𝐱 𝑘+1 ∶= 𝐱 𝑘+1 /‖𝐱 𝑘+1 ‖
𝑇 𝑇
𝜆𝑘 = 𝐱 𝑘+1 𝑨𝐱 𝑘+1 /(𝐱 𝑘+1 𝐱 𝑘+1 )
end
For large matrices, one can save operations if we compute the 𝐿𝑈 decomposition of the
matrix 𝑨 only once. The iteration is performed using the factors 𝑳 and 𝑼. This way, each
iteration needs only 𝑂(𝑛2 ) operations, instead of 𝑂(𝑛3 ) with the program above.
As any vector 𝐯 can be normalized 𝐯 = 𝐯/‖𝐯‖ to have a unit norm, the Householder
matrix defined above can be written as:
𝐯𝐯 𝑇
𝑯 = 𝑰 − 2𝒘𝒘𝑇 = 𝑰 − 2
‖𝐯‖2
We define specifically a vector 𝐯 = 𝐱 − ‖𝐱‖𝐞1 with 𝐱 is any n-dimensional vector and 𝐞1 is
the first standard basis vector with all elements equals zero except the first one. The
norm squared of this vector can be found to be
We see that all elements in 𝐱 except the first one are eliminated to zero. This feature of
the Householder transformation is the reason why it is widely used.
Let 𝐜𝑖 be the columns of the matrix 𝑨 = [𝐜1 𝐜2 … 𝐜𝑛 ], it is very easy to observe that if we let
𝐯 = 𝐜1 − ‖𝐜1 ‖𝐞1 then
⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆
𝐯𝐯𝑇 0 ⋆ ⋆ 0
(𝑰 − 2 ) 𝐜 = ‖𝐜1 ‖𝐞1 ⟹ 𝑯1 𝑨 = [‖𝐜1 ‖𝐞1 ⋮ 𝐜̂ 2 ⋮ ⋯ ⋮ 𝐜̂ 𝑛 ] = 0 ⋆ ⋱ ⋆ =( 𝑨′ )
‖𝐯‖2 1 ⋮
⋮ ⋮ ⋮
(0 ⋆ ⋆) 0
1 0 ⋯ 0
0 𝐯2 𝐯2 𝑇 )
𝑯2 = ( with 𝐯2 corresponds to the 2nd columns of the matrix 𝑨′
⋮ 𝑰−2
‖𝐯2 ‖2
0
If we redo the process 𝑘 times we obtain 𝑨𝑘 = 𝑷𝑘 𝑨𝑘−1 ⟹ 𝑨𝑘 = 𝑯𝑘 … 𝑯2 𝑯1 𝑨0 . A matrix
𝑨 = 𝑨0 can be reduced step by step using these unitary "Householder matrices" 𝑯𝑘 into
an upper triangular matrix 𝑨𝑛−1 = 𝑯𝑛−1 … 𝑯2 𝑯1 𝑨0 = 𝑹.
𝑯𝑨 = 𝑹 or 𝑨 = 𝑯−1 𝑹 = 𝑸𝑹
Example:
× × × × × × × × × × × × × × × × × × × ×
× × × × × 𝑯1 0 × × × × 𝑯2 0 × × × × 𝑯3 0 × × × ×
× × × × × → 0 × × × × → 0 0 × × × → 0 0 × × ×
× × × × × 0 × × × × 0 0 × × × 0 0 0 × ×
(× × × × ×) (0 × × × ×) (0 0 × × ×) (0 0 0 × ×)
We now discuss two alternative methods that can be used to compute the thin 𝑄𝑅
𝑹
factorization 𝑨 = 𝑸𝑹 = [𝑸1 𝑸2 ] [ 1 ] = 𝑸1 𝑹1 directly.
𝟎
𝑘−1
𝒒𝑘 = (𝒂𝑘 − ∑ 𝑟𝑖𝑘 𝒒𝑖 ) /𝑟𝑘𝑘
𝑖=1
Thus, we can think of 𝒒𝑘 as a unit vector in the direction of 𝒛𝑘 = (𝒂𝑘 − ∑𝑘−1𝑖=1 𝑟𝑖𝑘 𝒒𝑖 ) we
𝑇
choose 𝑟𝑖𝑘 = 𝒒𝑖 𝒂𝑘 𝑖 = 1,2, … , 𝑘 − 1 and 𝑟𝑘𝑘 = ‖𝒒𝑘 ‖. This leads to the classical Gram-
Schmidt (CGS) algorithm for computing 𝑨 = 𝑸1 𝑹1 .
Unfortunately, the CGS method has very poor numerical properties in that there is
typically a severe loss of orthogonality among the computed 𝒒𝑖 . Interestingly, a
rearrangement of the calculation, known as modified Gram Schmidt (MGS), yields a
much sounder computational procedure. In the k th step of MGS, the k th column of 𝑸
(denoted by 𝒒𝑘 ) and the fcth row of 𝑹 (denoted by 𝒓𝑘 𝑇 ) are determined. (see Gene H.
Golub and Charles F. Van Loan 1996).
% Modified Gram-Schmidt Orthogonalization
for k=1:n,
for i=1:k-1,
R(i,k)=Q(:,i)'*Q(:,k);
Q(:,k)=Q(:,k)-R(i,k)*Q(:,i);
end
R(k,k)=norm(Q(:,k)); Q(:,k)=Q(:,k)/R(k,k);
end
R, Q,
𝑯1 = 𝑳1 𝑹1
And the upper triangular matrix can be expressed as 𝑹1 = 𝑳1−1 𝑯1. If we now multiply
this equation on the right by 𝑳1 we get 𝑹1 𝑳1 = 𝑳1−1 𝑯1 𝑳1
𝑯𝑘 = 𝑳−1 −1
𝑘 𝑯𝑘−1 𝑳𝑘 ⟺ 𝑯𝑘 = (𝑳𝑘 … 𝑳2 𝑳1 ) 𝑯1 (𝑳𝑘 … 𝑳2 𝑳1 )
Finally we deduce that 𝑯𝑘 will tend to an upper triangular matrix, and from linear
algebra we know that the eigenvalues of an upper triangular matrix are the elements of
the main diagonal. The eigenvalues of 𝑯1 appear as the diagonal terms of this upper-
triangular matrix because 𝑯𝑘 and 𝑯1 are similar to each other.
Set 𝑨1 = 𝑨,
for 𝑘 = 1,2, . .. (until convergence)
Compute 𝑨𝑘 = 𝑸𝑘 𝑹𝑘
Set 𝑨𝑘+1 = 𝑹𝑘 𝑸𝑘
end
That is, compute the 𝑄𝑅 factorization of 𝑨, then reverse the factors, then compute the
𝑄𝑅 factorization of the result, before reversing the factors, and so on.
It turns out that the sequence 𝑨1 , 𝑨2 ,…, have the same eigenvalues and for any large
integer 𝑘 the matrix 𝑨𝑘 is usually close to being upper-triangular. Since the eigenvalues
of an upper-triangular matrix lie on its diagonal, the iteration above will allow us to
read off the eigenvalues of A from the diagonal entries of 𝑨𝑘 . Once we have the
eigenvalues, the eigenvectors can be computed, for example, by an inverse power
iteration.
clear all, clc, A =rand(100,100); M=A;
% n=7; C=diag(ones(n-1,1),1); A=4*diag(ones(n,1))+C+C'; M=A
% u = [1 -15 85 -225 274 -120]; A = compan(u); [m,n]=size(A); M=A;
for i=1:500;
[Q,R] = qr (M); % [Q,R] = lu (M);
M=R*Q;
end
R; eig(A) ;
spy(abs(M)>1e-4),
Warning: The sequence 𝑨1 , 𝑨2 ,…, computed by the unshifted QR algorithm above does
0 1
not always converge. For example, consider the matrix 𝑨 = 𝑨1 = ( ). In this example,
1 0
𝑨𝑘 = 𝑨1 for all 𝑘 and the unshift 𝑄𝑅 algorithm is stagnate. Below, we will fix that with
Wilkinson shifts. (Note that Rayleigh quotient shifts do not fix this example.)
It is well known how any matrix can be transformed into an upper Hessenberg form
𝑨𝐻 = 𝑸𝑻 𝑨 𝑸 (i.e. An upper Hessenberg matrix is also called an almost upper triangular
matrix) by an orthogonal similarity transformation. To find all the eigenvalues of the
original matrix, then the 𝑄𝑅 algorithm can be applied to this upper Hessenberg matrix.
⋆ ⋆ ⋆ ⋯ ⋆
⋆ ⋆ ⋆ … ⋆
𝑨𝐻 = 𝑸𝑻 𝑨𝑸 = ( ⋆ ⋆ ⋮ ⋆)
⋱ ⋱ ⋮
⋆ ⋆
Example:
clear all, clc, A =rand(8,8);
H = hess(A)
spy(abs(H)>1e-4),
H =
0.1150 -0.9702 -0.7154 0.2833 0.1072 -0.0011 0.3260 0.3508
-1.7483 2.8283 1.2499 -0.0915 -0.1518 0.0410 0.4229 0.4616
0 1.0766 0.7553 -0.3348 -0.6033 0.2523 0.2128 0.3057
0 0 0.5410 -0.1267 -0.1155 -0.1975 0.0631 0.1153
0 0 0 -0.2507 0.1048 -0.0259 -0.4680 -0.0875
0 0 0 0 0.4812 0.6869 0.1871 0.0145
0 0 0 0 0 -0.3801 -0.1686 -0.0030
0 0 0 0 0 0 -0.2471 0.0837
Remark: Hessenberg matrix consists of an upper triangular form with an addition band
of elements adjacent to the main diagonal. (No more)
The Householder reduction to Hessenberg form
H=Q'*A*Q
Again: A matrix structure that is close to upper triangular form and that is preserved
by the QR algorithm is the Hessenberg form.
We have seen how the QR algorithm for computing the Schur form of a matrix A can be
executed more economically if the matrix 𝑨 is first transformed to Hessenberg form.
Now we want to show how the convergence of the Hessenberg QR algorithm can be
improved dramatically by introducing (spectral) shifts into the algorithm.
■ Shifted QR algorithm: The QR algorithm computes the real Schur form of a matrix,
a canonical form that displays eigenvalues but not eigenvectors. Consequently,
additional computations usually must be performed if information regarding invariant
subspaces is desired. The convergence of QR algorithm depends on the ratio between
the eigenvalues 𝜆1 /𝜆2 , so in order to accelerate the rate of convergence Jim Wilkinson
think to introduce a shift factor 𝜇 such that: 𝛿 = (𝜆1 − 𝜇)/(𝜆2 − 𝜇) ≤ (𝜆1 /𝜆2 )
Oversimplifying: shifts are used in addition to obtain faster convergence, in other word
if 𝜆1 , 𝜆2 are two eigenvalues and you shift by the original ratio by 𝜇, then the magic ratio
is 𝛿 = (𝜆1 − 𝜇)/(𝜆2 − 𝜇) and the convergence speed goes faster whenever 𝜇 is close to 𝜆1 .
If we define a new matrix 𝑯𝑘 = 𝑨𝑘 − 𝜇𝑘 𝑰 the the eigenvalues of 𝑯𝑘 are 𝜆𝑘 − 𝜇𝑘 and the
ration becomes (𝜆𝑘 − 𝜇𝑘 )/(𝜆𝑘−1 − 𝜇𝑘−1 ), then the algorithm is as follow
One can still check that this algorithm preserves the upper-Hessenberg structure and
produces a sequence of similar matrices. The idea of the shift is to quickly make
𝑨𝑘 (𝑛, 𝑛 − 1) converge to zero. A reasonable choice of the shift is the Rayleigh quotient,
where 𝜇𝑘 = 𝑨𝑘 (𝑛, 𝑛), because we would like 𝜇𝑘 to be an estimate for eigenvalue of 𝑨.
A= H;
for n = length(A):-1:1
% QR iteration
while sum(abs(A(n,1:n-1)))>eps
s = A(n,n);
[Q,R] = qr(A-s*eye(n));
A = R*Q + s*eye(n);
end
% Deflation
d(n) = A(n,n);
A = A(1:n-1,1:n-1);
end
d = flipud(sort(d))
The Arnoldi method belongs to a class of linear algebra algorithms that give a partial
result after a small number of iterations, in contrast to so-called direct methods which
must complete to give any useful results (see for example, Householder transformation).
The partial result in this case being the first few vectors of the basis the algorithm is
building. When applied to Hermitian matrices it reduces to the Lanczos algorithm. The
Arnoldi iteration was invented by W. E. Arnoldi in 1951.
One way to extend the Lanczos process to unsymmetric matrices is due to Arnoldi
(1951) and revolves around the Hessenberg reduction 𝑸𝑇 𝑨𝑸 = 𝑯. In particular, if
𝑸 = [𝒒1 , … , 𝒒𝑛 ] and we compare columns in 𝑨𝑸 = 𝑸𝑯 , then
𝑘+1
𝑨𝒒𝑘 = ∑ ℎ𝑖𝑘 𝒒𝑖 1≤ 𝑘 ≤𝑛−1
𝑖=1
𝒒𝑘+1 = 𝒓𝑘 /ℎ(𝑘 + 1, 𝑘)
where ℎ(𝑘 + 1, 𝑘) = ‖𝒓𝑘 ‖2 . These equations define the Arnoldi process and in strict
analogy to the symmetric Lanczos process we obtain:
𝑟0 = 𝑞1 , ℎ10 = 1, 𝑘 = 0,
while (if ℎ(𝑘 + 1, 𝑘) ≠ 0)
𝒒𝑘+1 = 𝒓𝑘 /ℎ(𝑘 + 1, 𝑘)
𝑘 = 𝑘 + 𝑙
𝒓𝑘 = 𝑨𝒒𝑘
for 𝑖 = 1: 𝑘
ℎ𝑖𝑘 = 𝒒𝑖 𝑇 𝒓𝑘
𝒓𝑘 = 𝒓𝑘 − ℎ𝑖𝑘 𝒒𝑖
end
ℎ(𝑘 + 1, 𝑘) = ‖𝒓𝑘 ‖2
end
% Arnoldi iteration for Krylov subspaces.
% Input: A is a square matrix (n by n)
% u is initial vector and m is the number of iterations
% Output: Q orthonormal basis of Krylov space (n by m+1)
% H upper Hessenberg matrix, A*Q(:,1:m)=Q*H(m+1 by m)
clear all, clc, A=randi(10,4); n=length(A); m=n-1; Q=zeros(n,m+1);
H=zeros(m+1,m); u=rand(n,1); Q(:,1) = u/norm(u);
for j = 1:m
r = A*Q(:,j);
for i = 1:j
H(i,j) = Q(:,i)'*r;
r = r - H(i,j)*Q(:,i);
end
H(j+1,j) = norm(r);
Q(:,j+1) = r/H(j+1,j);
end
Q, triu(Q'*A*Q,-1) % enforce exact triangularity
Interpretation: How was Arnoldi's algorithm interpreted as generating Krylav
subspaces and what is the relationship between them? From the algorithm we know
that: 𝒒𝑘+1 = 𝒓𝑘 /ℎ(𝑘 + 1, 𝑘) and 𝒓𝑘 = 𝑨𝒒𝑘 ⟹ 𝒒𝑘+1 = 𝛼𝑘 𝑨𝒒𝑘 with 𝛼𝑘 = 1/ℎ(𝑘 + 1, 𝑘)
⟹ 𝒒𝑘 = (𝛼1 𝛼2 … 𝛼𝑘 )𝑨𝑘 𝒒1 = 𝛽𝑘 𝑨𝑘 𝒒1
We assume that 𝒒1 is a given unit 2-norm starting vector. The 𝒒𝑘 are called the Arnoldi
vectors and they define an orthonormal basis for the Krylov subspace 𝐾𝑛 (𝑨, 𝒒1 , 𝑘):
𝑨𝑸𝑘 = 𝑸𝑘 𝑯𝑘 + 𝒓𝑘 𝒆𝑇𝑘
The Arnoldi iteration has two roles ❶ the basis of many of the iterative algorithms of
numerical linear algebra ❷ find eigenvalues of non-Hermitian matrices (i.e. using QR).
𝐴 = 𝑔𝑎𝑙𝑙𝑒𝑟𝑦(′𝑤𝑎𝑡ℎ𝑒𝑛′, 𝑛𝑥, 𝑛𝑦) returns a sparse, random, n-by-n finite element matrix
where 𝑛 = 3 ⋆ 𝑛𝑥 ⋆ 𝑛𝑦 + 2 ⋆ 𝑛𝑥 + 2 ⋆ 𝑛𝑦 + 1. Matrix A is precisely the “consistent mass
matrix” for a regular nx-by-ny grid of 8-node (serendipity) elements in two dimensions.
A is symmetric, positive definite for any (positive) values of the “density” 𝑟ℎ𝑜(𝑛𝑥 , 𝑛𝑦 ),
which is chosen randomly. (see MATLAB toolBox Sparse Matrix Reordering)
load barbellgraph.mat
S = A + speye(size(A)); pct = 100/numel(A);
spy(S)
title('A Sparse Symmetric Matrix')
nz = nnz(S);
xlabel(sprintf('Nonzeros = %d (%.3f%%)',nz,nz*pct));
CHAPTER VII:
Introduction to Nonlinear Systems
and Numerical Optimization
(Plus Metaheuristics and Evolutionary Algorithms)
CHAPTER VII:
Introduction to Nonlinear Systems
and Numerical Optimization
(Plus Metaheuristics and Evolutionary Algorithms)
Introduction to Nonlinear Systems and
Numerical Optimization
Optimization problems arise in almost every field, where numerical
information is processed (Science, Engineering, Mathematics, Economics, Commerce,
etc.). In Science, optimization problems arise in data fitting, in variational principles,
and in the solution of differential and integral equations by expansion methods.
Engineering applications are in design problems, which usually have constraints in the
sense that variables cannot take arbitrary values. For example, while designing a bridge
an engineer will be interested in minimizing the cost, while maintaining certain
minimum strength for the structure. Even the strength of materials used will have a
finite range depending on what is available in the market. Such problems with
constraints are more difficult to handle than the simple unconstrained optimization
problems, which very often arise in scientific work. In most problems, we assume the
variables to be continuously varying, but some problems require the variables to take
discrete values (H M Antia 1995).
Mainly there are two different strategies for computing next iteration from the previous
one which are used most frequently in nowadays available optimization algorithms. The
first one is the line search strategy in which the algorithm chooses a direction 𝒅𝑘 and
then searches along this direction for the lower function value. The second one is called
the trust region strategy in which the information gathered about the objective function
is used to construct a model function whose behavior near the current iterate is trusted
to be similar enough to the actual function. Then the algorithm searches for the
minimizer of the model function inside the trust region.
Most of optimization problems require the global minimum to be found, most of the
methods that we are going to describe here will only find a local minimum. The function
has a local minimum at a point where it assumes the lowest value in a small
neighborhood of the point, which is not at the boundary of that neighborhood. To find a
global minimum we normally try
In this chapter, we consider methods for minimizing or maximizing a function of several
variables, that is, finding those values of the coordinates for which the function takes
on the minimum or the maximum value.
Definition A continuous
function 𝑓: ℝ𝑛 ⟶ ℝ is said
to be continuously
differentiable at 𝐱 ∈ ℝ𝑛 , if
(𝜕𝑓/𝜕𝑥𝒊 )(𝐱) exists and is
continuous, 𝑖 = 1, . . . , 𝑛; the
gradient of 𝑓 at 𝐱 is then
defined as
𝜕𝑓 𝜕𝑓 𝜕𝑓 𝑇
∇𝑓(𝐱) = [ … ]
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
𝑇
and there exists 𝒛 ∈ (𝐱, 𝐱 + 𝒑) such that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑
Example: Let 𝑓: ℝ2 ⟶ ℝ, 𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇 Then
2𝑥1 − 2 + 3𝑥22
∇𝑓(𝐱) = ( )
6𝑥1 𝑥2 + 12𝑥22
If we let 𝑔(𝑡) = 𝑓(𝐱 𝑐 + 𝑡𝒑) = 𝑓(1 − 2𝑡, 1 + 𝑡 ) = 6 + 12𝑡 + 7𝑡 2 − 2𝑡 3 , and the reader can
𝑇
verify that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐳)) 𝒑 is true for 𝐳 = 𝐱 𝑐 + 𝑡𝒑 with 𝑡 = (7 − √19)/6 = 0.44
Example: Computing directional derivatives.
Let 𝑧 = 14 − 𝑥 2 − 𝑦 2 and let 𝑃 = (1,2). Find
the directional derivative of f, at 𝑃, in the
following directions:
■ The surface is plotted in Figure above, where the point 𝑃 = (1,2) is indicated in the
𝑥, 𝑦_plane as well as the point (1,2,9) which lies on the surface of f. We find that
𝜕𝑓 𝜕𝑓
| = −lim 2𝑥 = −2, | = −lim 2𝑦 = −4
𝜕𝑥 𝑥=1 𝑥→1 𝜕𝑦 𝑥=2 𝑥→1
Let 𝑢⃗ 1 be the unit vector that points from the point (1,2) to the point 𝑄 = (3,4), as shown
in the figure. The vector ⃗⃗⃗⃗⃗
𝑃𝑄 = ⟨2,2⟩; the unit vector in this direction is 𝑢
⃗ 1 = ⟨1/√2, 1/√2⟩.
Thus the directional derivative of f at (1,2) in the direction of 𝑢
⃗ 1 is
𝑇 1/√2
⃗ 1 = (−2 − 4) (
𝐷𝑢⃗1 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢 ) = −3√2 ≅ −4.24
1/√2
Thus the instantaneous rate of change in moving from the point (1,2,9) on the surface
in the direction of 𝑢
⃗ 1 (which points toward the point Q) is about −4.24. Moving in this
direction moves one steeply downward.
■ We seek the directional derivative in the direction of ⟨2,−1⟩. The unit vector in this
direction is 𝑢
⃗ 2 = ⟨2/√5, −1/√5⟩. Thus the directional derivative of f at (1,2) in the
𝑇
direction of 𝑢
⃗ 2 is 𝐷𝑢⃗2 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢⃗ 2 = 0. Starting on the surface of f at (1,2) and
moving in the direction of ⟨2,−1⟩ (or 𝑢⃗ 2 ) results in no instantaneous change in z-value.
■ At P=(1,2), the direction towards the origin is given by the vector ⟨−1,−2⟩; the unit
vector in this direction is 𝑢
⃗ 3 = ⟨−1/√5, −2/√5⟩. The directional derivative of f at P in the
𝑇
direction of the origin is 𝐷𝑢⃗3 𝑓(𝐱) = (∇𝑓(𝐱)) 𝑢
⃗ 3 = 9/√5 ≅ 4.47. Moving towards the origin
means "walking uphill'' quite steeply, with an initial slope of about 4.47.
Note: The symbol "∇" is named "nabla,'' derived from the Greek name of a Jewish harp.
Oddly enough, in mathematics the expression ∇f is pronounced "del f.''
The gradient vectors are perpendicular to the level sets, so will always be direction the
“slope” of a point toward another point on another level set. But how would you
represent that? The answer is the concept of gradient flow. Read more to learn about
how these three standard measurements fit together to flow along a surface, much like
a liquid or rolling object.
Theorem Let Consider a function f:ℝ𝑛 ⟶ ℝ, and suppose f is of class 𝐶1 . For some
constant 𝑐, consider the level set 𝑆 = {𝒙⃗ ∈ ℝ𝑛 : 𝑓(𝒙
⃗ ) = 𝑐}. Then, for any point 𝒙
⃗ 0 in 𝑆, the
⃗ 0 ) is perpendicular to 𝑆.
gradient ∇𝑓(𝒙
⃗ (𝑡)
By the definition of 𝑆, and since 𝒙
lies in 𝑆, ⃗ (𝑡)) = 𝑐
𝑓 (𝒙 for all t.
Differentiating both sides of this
identity, and using the chain rule on
′
⃗ (𝑡)) ⋅ 𝒙
the left side, we obtain ∇𝑓 (𝒙 ⃗ (𝑡) = 0
′
⃗ (𝑡0 )) ⋅ 𝒙
Plugging in 𝑡 = 𝑡0 , this gives us ∇𝑓 (𝒙 ⃗ (𝑡0 ) = 0, which we can rewrite as
𝜕 2 𝑓(𝐱)
∇2 𝑓(𝐱)𝑖𝑗 = 1 < 𝑖 ,𝑗 < 𝑛
𝜕𝑥𝑖 𝜕𝑥𝑗
1
⟹ 𝑓(𝐱 + 𝒉) = 𝑓(𝐱) + ∇𝑓(𝐱). 𝒉 + (𝑯(𝐱 + 𝑡𝒉)𝒉). 𝒉
2
𝑓(𝑥) = 𝑥12 − 2𝑥2 + 3𝑥1 𝑥22 + 4𝑥23 , 𝐱 𝑐 = (1, 1)𝑇 , 𝒑 = (−2, 1)𝑇
Then
2 6𝑥2 2 6
∇2 𝑓(𝐱) = ( ) ⟹ ∇2 𝑓(𝐱 𝑐 ) = ( )
6𝑥2 6𝑥1 + 24𝑥2 6 30
Lemma suggests that we might model the function f around a point 𝐱 𝑐 by the quadratic
𝑇 1
model 𝑚(𝐱 𝑐 + 𝒑) = 𝑓(𝐱 𝑐 ) + (∇𝑓(𝐱 𝑐 )) 𝒑 + 2 𝒑𝑇 𝑯(𝐱𝑐 )𝒑 and this is precisely what we will do.
In fact, it shows that the error in this model is given by
1 𝑇
𝜀 = 𝑓(𝐱 𝑐 + 𝒑) − 𝑚(𝐱 𝑐 + 𝒑) = 𝒑 (𝑯(𝒛) − 𝑯(𝐱 𝑐 ))𝒑
2
for some 𝒛 ∈ (𝐱 𝑐 , 𝐱 𝑐 + 𝒑).
For the remainder of this chapter we will denote the Jacobian matrix of 𝐹 at 𝐱 by 𝑱(𝐱).
Also, we will often speak of the Jacobian of 𝐹 rather than the Jacobian matrix of 𝐹 at 𝐱.
An important fact: Now comes the big difference from real-valued functions: there is
no mean value theorem for continuously differentiable vector-valued functions. That is,
in general there may not exist 𝐳 ∈ ℝ𝑛 such that 𝐹(𝐱 + 𝒑) = 𝐹(𝐱) + 𝑱(𝐳)𝒑. Intuitively the
reason is that, although each function 𝑓𝑖 satisfies 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 , the points
𝐳𝑖 , may differ. For example, consider the function of the example before. There is no
𝐳 ∈ ℝ𝑛 for which 𝐹(1,1) = 𝐹(0,0) + 𝑱(𝐳)(1,1)𝑇 as this would require
𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 − 𝑥2 𝑒 𝑥1 −1 1 𝑒−1 1 𝑒 𝑧1 −1 1
( ) = ( ) + ( ) ( )⇔( )=( )+( )( )
𝑥12 − 2𝑥2 𝐱=(1,1) 𝑥12 − 2𝑥2 𝐱=(0,0) 2𝑥1 −2 𝐱=𝐳 1
𝑖
−1 0 2𝑧1 −2 1
Those results are given below. The integral of a vector valued function of a real variable
can be interpreted as the vector of Riemann integrals of each component function.
Proof: The proof comes right from the definition of 𝐹 ′ (𝐳) and a component by-
component application of Newton's formula 𝑓𝑖 (𝐱 + 𝒑) = 𝑓𝑖 (𝐱) + ∇𝑓𝑖 (𝐳𝑖 )𝑇 𝒑 .■
Now we might think about a best model or the best linear approximation of the function
𝐹 around a point 𝐱 𝑐 , means that we model 𝐹(𝐱 𝑐 + 𝒑) by the affine model
and this is what we will do. To produce a bound on the difference between 𝐹(𝐱 𝑐 + 𝒑) and
𝑀(𝐱 𝑐 + 𝒑), we need to make an assumption about the continuity of 𝑱(𝐱 𝑐 ) just as we did
in scalar-valued-functions in the section before.
Definition Let the two integers 𝑚, 𝑛 > 0, 𝑮: ℝ𝑛 ⟶ ℝ𝑚×𝑛 , 𝐱 ∈ ℝ𝑛 , let ‖•‖ be a norm on
ℝ𝑛 , and ‖•‖𝑮 a norm on ℝ𝑚×𝑛 . 𝑮 is said to be Lipschitz continuous at 𝐱 if there exists an
open set 𝑫 ⊂ ℝ𝑛 , 𝐱 ∈ 𝑫, and a constant 𝛾 such that for all 𝐲 ∈ 𝑫,
The constant 𝛾 is called a Lipschitz constant for 𝑮 at 𝐱. For any specific 𝑫 containing 𝐱
for which the given inequality holds, 𝑮 is said to be Lipschitz continuous at 𝐱 in the
neighborhood 𝑫. If this inequality holds for every 𝐱 ∈ 𝑫, then 𝑮 ∈ 𝐿𝑖𝑝𝛾 (𝑫).
Note that the value of 𝛾 depends on the norms ‖•‖ & ‖•‖𝑮 , but the existence of 𝛾 does
not.
Proof:
1
≤ ∫ 𝛾 ‖𝑡𝒑‖‖𝒑‖𝑑𝑡
0
1
Using Lipschitz continuity, we can obtain a useful bound on the error in the
approximate affine model.
Lemma Let 𝐹, 𝑱 satisfy the conditions of the previous lemma. Then, for any 𝐯, 𝐮 ∈ 𝑫,
‖𝐯 − 𝐱‖ + ‖𝐮 − 𝐱‖
‖𝐹(𝐯) − 𝐹(𝐮) − 𝑱(𝐱)(𝐯 − 𝐮)‖ ≤ 𝛾 ‖𝐯 − 𝐮‖
2
If we assume that 𝑱(𝐱)−1 exists. Then there exists 𝜀 > 0, 0 < 𝛼 < 𝛽 , such that
where 𝒪(‖𝐱 − 𝒂‖) is a quantity that approaches zero much faster than the distance
between 𝐱 and 𝒂 does as 𝐱 approaches 𝒂.
In the preceding section we saw that the Jacobian, gradient, and Hessian will be useful
quantities in forming models of multivariable nonlinear functions. In many
applications, however, these derivatives are not analytically available. In this section we
introduce the formulas used to approximate these derivatives by finite differences, and
the error bounds associated with these formulas. The choice of finite-difference stepsize
in the presence of finite-precision arithmetic and the use of finite-difference derivatives
in our algorithms are discussed in (J. R Dennis, Jr. & Robert B. Schnabel 1993).
Frequently we deal with problems where the nonlinear function is itself the result of a
computer simulation, or is given by a long and messy algebraic formula, and so it is
often the case that analytic derivatives are not readily available although the function is
several times continuously differentiable. Therefore it is important to have algorithms
that work effectively in the absence of analytic derivatives.
In the case when 𝐹: ℝ𝑛 ⟶ ℝ𝑚 , it is reasonable to use the same idea as in one variable
to approximate the (𝑖, 𝑗)𝑡ℎ component of 𝑱(𝐱) by the forward difference approximation
𝑓𝑖 (𝐱 + ℎ𝐞𝑗 ) − 𝑓𝑖 (𝐱)
𝑎𝑖𝑗 (𝐱) =
ℎ
where 𝐞𝑗 , denotes the 𝑗 unit vector. This is equivalent to approximating the 𝑗 𝑡ℎ column
𝑡ℎ
One can prove that if 𝒂 ∈ ℝ𝑛 , which is a best model of ∇𝑓(𝐱) then 𝑎𝑖 given by
𝑓(𝐱 + ℎ𝐞𝑖 ) − 𝑓(𝐱 − ℎ𝐞𝑖 ) 𝛾
𝑎𝑖 (𝐱) = ⟺ ‖𝑎𝑖 (𝐱) − (∇𝑓(𝐱))𝒊 ‖ ≤ ℎ2
2ℎ 6
𝛾 2
⟺ ‖𝒂(𝐱) − ∇𝑓(𝐱)‖∞ ≤ ℎ
6
On some occasions ∇𝑓(𝐱) is analytically available but ∇2 𝑓(𝐱) is not. In this case, ∇2 𝑓(𝐱)
can be approximated by applying forward difference 𝑨𝑖 = (∇𝑓(𝐱 + ℎ𝐞𝑗 ) − ∇𝑓(𝐱)) /ℎ,
followed by 𝐴̂ = (𝐴 + 𝐴𝑇 )/2 , since the approximation to ∇2 𝑓(𝐱) should be symmetric.
If ∇𝑓(𝐱) is not available, it is possible to approximate ∇2 𝑓(𝐱) using only values of 𝑓(𝐱).
Proof: As in the one-variable case, a proof by contradiction is better than a direct proof
A class of algorithms called descent methods are characterized by the direction vector 𝒑
such that 𝒑𝑇 ∇𝑓(𝐱) < 0 or 𝒑 = −∇𝑓(𝐱).
The necessary and sufficient conditions for 𝐱 ⋆ to be a local maximizer of 𝑓 are simply
■ ∇𝑓(𝐱) = 𝟎
■ ∇2 𝑓(𝐱) is positive semidefinite.
𝑇 1
𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝒑 + 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 + 𝒪(‖𝒑‖3 )
2
When the Hessian matrix is positive definite, by definition is 𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0 for any
𝒑 ≠ 0. Therefore we have that 𝑓(𝐱 + 𝒑) − 𝑓(𝐱) = (1/2)𝒑𝑇 𝑯(𝐱 + 𝑡𝒑)𝒑 > 0, which means that
𝐱 must be a local minimum. Similarly, when the Hessian matrix is negative definite, 𝐱 is
a local maximum. Finally, when 𝑯 has both positive and negative eigenvalues, the point
is a saddle point.
Those methods use
the gradient to search for the minimum point of an objective function. Such gradient-
based optimization methods are supposed to reach a point at which the gradient is
(close to) zero. In this context, the optimization of an objective function f(x) is equivalent
to finding a zero of its gradient g(x), which in general is a vector-valued function of a
vector-valued independent variable x. Therefore, if we have the gradient function g(x) of
the objective function f(x), we can solve the system of nonlinear equations g(x) = 0 to get
the minimum of f(x) by using the Newton method explained in chapter 4.
Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the
extremum value of 𝑓(𝐱) with the following starting point 𝐱 = [0.01 0.02]. this function is
severely ill-conditioned near the minimizer (1,1) (which is the unique stationary point).
%-----------------------------------------------------------%
% f .......... objective function
% J .......... gradient of the objective function
% H .......... Hessian of the objective function
%-----------------------------------------------------------%
clear all, clc, i=1; x(i,:)=[0.01 0.02]; tol=0.001;
f=@(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2;
J=@(x)[-400*(x(2)-x(1)^2)*x(1)+2*x(1)-2; 200*x(2)-200*x(1)^2];
H=@(x)[1200*x(1)^2-400*x(2)+2 -400*x(1);-400*x(1) 200];
while norm(J(x(i,:)))>tol
d=(inv(H(x(i,:)) + 0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))' % x=[0.989793 0.9797];
fmax=f(x)
Example: Let 𝑓(𝐱) = √(1 + 𝑥12 ) + √(1 + 𝑥22 ), find the extremum value of 𝑓(𝐱) with the
following starting point 𝐱 = [1 1].
f=@(x)sqrt(1+x(1)^2)+sqrt(1+x(2)^2);
J=@(x)[x(1)/sqrt(x(1)^2+1);x(2)/sqrt(x(2)^2+1)];
H=@(x)diag([1/(x(1)^2+1)^1.5,1/(x(2)^2+1)^1.5]);
while abs(x(i,:)*J(x(i,:)))>0.001
d=(inv(H(x(i,:))+0.5*eye(2,2))*J(x(i,:)))';
x(i+1,:)=x(i,:)-d;
i=i+1;
end
Iterations=i
x=(x(i,:))'
fmax=f(x)
Example: Let 𝑓(𝑥1 , 𝑥2 ) = (𝑥1 − 2)4 + ((𝑥1 − 2)2 )𝑥22 + (𝑥2 + 1)2 , which has its minimum
at 𝐱 ⋆ = (2, −1)𝑇 . Algorithm, started from 𝐱 0 = (1, 1)𝑇 , and we use the following
approximations
𝐻11 = 𝑓(𝐱 + 2ℎ𝐞1 ) − 2𝑓(𝐱 + ℎ𝐞1 ) + 𝑓(𝐱), 𝐻22 = 𝑓(𝐱 + 2ℎ𝐞2 ) − 2𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱)
𝐻12 = 𝐻21 = 𝑓(𝐱 + ℎ𝐞1 + ℎ𝐞2 ) − 𝑓(𝐱 + ℎ𝐞1 ) − 𝑓(𝐱 + ℎ𝐞2 ) + 𝑓(𝐱).
Before starting the algorithm let we visualize the plot of this surface in space
Iterations = 9
Jacobian =
-4.0933e-11
2.2080e-09
x =
1.9950
-1.0050
fmax = 5.0001e-05
while norm(J)>tol
H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;
x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;
end
Iterations=i
Gradient =J
x=(x(i,:))'
fmax=f(x)
Example: Consider the Freudenstein
and Roth test function
Where
Show that the function f has three stationary points. Find them and prove that one is a
global minimizer, one is a strict local minimum and the third is a saddle point. You
should use the stopping criteria ‖∇𝑓(𝑥)‖ ≤ 10−5. The algorithm should be employed four
times on the following four starting points:
Also we can use MATLAB code to see the Gradient and the Hessian of this function.
When we run the program we get the points as what have been asked.
Example: Given a Rosenbrock function 𝑓(𝐱) = 100(𝑥2 − 𝑥12 )2 + (1 − 𝑥1 )2, find the solution
of 𝑓(𝐱) = 0 using only the gradient (i.e. without use of Hessian).
𝑇 𝑇 −1
−𝑓(𝐱 𝑘 ) = (∇𝑓(𝐱 𝑘 )) ∆𝐱 𝑘 ⟺ ∆𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) ) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 )
And in order to avoid the singularity in the matrix inversion, let we add some
𝑇 −1
regularization term: ∆𝐱 𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 = − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) 0<𝜆<1
−1
It can be observed from the previous results 𝐱 𝑘+1 = 𝐱 𝑘 − (𝑯(𝐱)) 𝐠(𝐱) that
𝑇 −1 −1
𝐱 𝑘+1 = 𝐱 𝑘 − (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) ∇𝑓(𝐱 𝑘 )𝑓(𝐱 𝑘 ) ≅ 𝐱 𝑘 − (𝑯(𝐱)) ∇𝑓(𝐱 𝑘 )
𝑇
⟹ 𝑯(𝐱) ≅ (∇𝑓(𝐱 𝑘 )(∇𝑓(𝐱 𝑘 )) + 𝜆𝑰) 𝑓(𝐱 𝑘 )
It practically means that, once the first derivatives are computed, we can also compute
part of the Hessian matrix for the same computational cost. The possibility to compute
“for free” the Hessian matrix once the Jacobian (i.e. Gradient) is available represents a
distinctive feature of least squares problems. This approximation is adopted in many
applications as it provides an evaluation of the Hessian matrix without computing any
second derivatives of the objective function.
2 2
Example: Given 𝑓(𝐱) = (𝑥12 − 2𝑥2 )𝑒 −𝑥1 −𝑥2 −𝑥1𝑥2 , find the solution of 𝑓(𝐱) = 0 using only
the gradient (i.e. without use of Hessian). Here in this example we will use the
approximate value of ∇𝑓(𝐱) rather that the analytic one.
Before starting the algorithm let we visualize the plot of this surface in space
[x,y] = meshgrid([-2:.25:2]);
z = x.*exp(-x.^2-y.^2);
% Plotting the Z-values of the function on which the level
% sets have to be projected
z1 = x.^2+y.^2;
% Plot your contour
[cm,c]=contour(x,y,z,30);
% Plot the surface on which the level sets have to be projected
s=surface(x,y,z1,'EdgeColor',[.8 .8 .8],'FaceColor','none')
% Get the handle to the children i.e the contour lines of the contour
cv=get(c,'children');
% Extract the (X,Y) for each of the contours and recalculate the
% Z-coordinates on the surface on which to be projected.
for i=1:length(cv)
cc = cv(i);
xd=get(cc,'XData');
yd=get(cc,'Ydata');
zd=xd.^2+yd.^2;
set(cc,'Zdata',zd);
end
grid off
view(-15,25)
colormap cool
Jacobian =
-4.1814e-06
-8.6410e-06
x =
1.8844
3.4178
fmax = 4.5702e-07
clear all, clc, i=1; x(i,:)=[1 1]; h=0.01; J=1; tol=0.00001;
f=@(x)x(1)*exp(-x(1)^2-x(2)^2);
while norm(J)>tol
H(1,1,:)=(f(x11(i,:))-2*f(x1(i,:))+ f(x(i,:)))/h^2;
H(1,2,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,1,:)=(f(x12(i,:))-f(x1(i,:))-f(x2(i,:))+f(x(i,:)))/h^2;
H(2,2,:)=(f(x22(i,:))-2*f(x2(i,:))+f(x(i,:)))/h^2;
x(i+1,:)=x(i,:)-(inv(H)*J)';
i=i+1;
end
Iterations=i
Jacobian=J
x=(x(i,:))'
fmax=f(x)
Example: Develop the Taylor series of two-variables objective function 𝑓(𝑥1 , 𝑥2 ) with an
error of 𝒪(‖𝜹‖3 )
𝜕𝑓 𝜕𝑓 𝜕 2𝑓 𝜕 2𝑓 𝜕 2𝑓
𝑓(𝑥1 + 𝛿1 , 𝑥2 + 𝛿2 ) = 𝑓(𝑥1 , 𝑥2 ) + ( 𝛿1 + 𝛿2 ) + ( 2 𝛿12 + 2 𝛿1 𝛿2 + 2 𝛿22 ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥1 𝜕𝑥2 𝜕𝑥2
𝜕 2𝑓 𝜕 2𝑓
𝛿1 𝛿1
𝜕𝑓 𝜕𝑓 1 𝜕𝑥12 𝜕𝑥1 𝜕𝑥2
= 𝑓(𝑥1 , 𝑥2 ) + [ ] ( ) + [𝛿1 𝛿2 ] ( ) + 𝒪(‖𝜹‖3 )
𝜕𝑥1 𝜕𝑥2 𝛿 2 𝜕 2𝑓 𝜕 2𝑓 𝛿2
2
2
(𝜕𝑥1 𝜕𝑥2 𝜕𝑥2 )
𝑇 1
In compact form we can write 𝑓(𝐱 + 𝜹) − 𝑓(𝐱) = (∇𝑓(𝐱)) 𝜹 + 2 𝜹𝑇 𝑯(𝐱 + 𝑡𝜹)𝜹 + 𝒪(‖𝜹‖3 )
Let 𝑭: ℝ𝑛 ⟶ ℝ𝑚 be a continuously
differentiable in the open convex set 𝑫 ⊂ ℝ𝑛 . The practical problem in the vector case is
to solve the simultaneously the set of nonlinear equations 𝑭(𝐱) = 𝟎. In before we have
seen that
𝑭(𝐱 + 𝜹) = 𝑭(𝐱) + 𝑱(𝐱)𝜹 ⟺ 𝑭(𝐱 𝑘 + 𝜹) = 𝑭(𝐱 𝑘 ) + 𝑱(𝐱 𝑘 )𝜹𝑘
𝜕𝑓1 𝜕𝑓1
𝜕𝑥1 𝜕𝑥2 1 𝑓1 (𝐱 + ℎ𝐞1 ) − 𝑓1 (𝐱) 𝑓1 (𝐱 + ℎ𝐞2 ) − 𝑓1 (𝐱)
𝑱(𝐱 𝑘 ) = ≅ ( )
𝜕𝑓2 𝜕𝑓2 ℎ 𝑓 (𝐱 + ℎ𝐞 ) − 𝑓 (𝐱)
2 1 2 𝑓2 (𝐱 + ℎ𝐞2 ) − 𝑓2 (𝐱)
( 𝜕𝑥1 𝜕𝑥2 )
Example: write a MATLAB code to solve the following nonlinear system of equations
using the approximate method and take x(i,:)= [0.1 0.2] as starting point.
x =
Example: using MATLAB to visualize the intersection of the following surfaces centered
at origin (assuming that the parameters are specified)
𝑥 2 𝑥 2 𝑥 2
( ) +( ) +( ) =𝑑
𝑎 𝑏 𝑐
2 2
𝑧 = 𝛼𝑥 + 𝛽𝑦 + 𝛾
{ 𝑥2 + 𝑦2 + 𝑧2 = 𝜆
for i=1:imax+1
theta=2*pi*(i-1)/ imax;
for j=1:jmax+1
phi =2*pi*(j-1)/ jmax;
x(i,j) = a*cos(theta); y(i,j) = b*sin(theta)*cos(phi);
z(i,j) = c*sin(theta)*sin(phi);
end
end
s=surf(x,y,z), hold on % Plot of ellipsoid
x(i+1,:) = x(i,:)-(inv(J)*F)';
dif = norm(x(i+1,:)-x(i,:));
i = i + 1;
end
x, Iterations=i, F
Suppose first that the Hessian matrix of the objective function is constant and
independent of 𝐱 𝑘 for 0 ≤ 𝑖 ≤ 𝑘. In other words, the objective function is quadratic, with
Hessian 𝐇(𝐱) = 𝑸 for all 𝐱, where 𝑸 = 𝑸𝑇 . Then, if 𝐱 𝑘+1 is an optimizer of 𝑓 we get
𝐠(𝐱 𝑘+1 ) = 0 so
𝐠(𝐱 𝑘+1 ) − 𝐠(𝐱 𝑘 ) = 𝑸(𝐱 𝑘+1 − 𝐱 𝑘 ) ⟺ ∆𝐠(𝐱 𝑘 ) = 𝑸∆𝐱 𝑘 ⟺ ∆𝐠(𝐱𝑘 ) = 𝑸𝒑𝑘 ⟺ 𝒑𝑇𝑘 ∆𝐠(𝐱 𝑘 ) = 𝒑𝑇𝑘 𝑸𝒑𝑘
We start with a real symmetric positive definite matrix 𝐁0 . Note that given k, the
matrix 𝑸−1 satisfies
𝑸−1 ∆𝐠(𝐱𝑖 ) = ∆𝐱 𝑖 0 ≤ 𝑖 ≤ 𝑘
Therefore, we also impose the requirement that the approximation of the Hessian
satisfy
𝐁𝑘+1 ∆𝐠(𝐱 𝑖 ) = ∆𝐱 𝑖 = 𝒑𝑖 0 ≤ 𝑖 ≤ 𝑘
𝐁𝑛 ∆𝐠(𝐱 0 ) = 𝒑0
𝐁𝑛 ∆𝐠(𝐱1 ) = 𝒑1
⋮
𝐁𝑛 ∆𝐠(𝐱𝑛−1 ) = 𝒑𝑛−1
Note that 𝑸 satisfies: 𝑸−1 [𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] = [𝒑0 , 𝒑1 , … 𝒑𝑛−1 ]. Therefore, if the matrix
[𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑛−1 ] is nonsingular, then 𝑸−1 is determined uniquely after n steps, via
This means that if n linearly independent directions 𝒑𝑖 and corresponding 𝒒𝑖 are known,
then 𝑸−1 is uniquely determined.
We will construct successive approximations 𝐁𝑘 to 𝑸−1 based on data obtained from the
first k steps such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1
After n linearly independent steps we would then have 𝐁𝑛 = 𝑸−1 . We want an update on
𝐁𝑘 such that:
𝐁𝑘+1 = [𝒑0 , 𝒑1 , … 𝒑𝑘 ][𝒒0 , 𝒒1 , 𝒒2 , … 𝒒𝑘 ]−1
Let us find the update in this form [Rank one correction] 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝒌 . We need a
good 𝛼𝑘 ∈ ℝ and good 𝐮𝑘 ∈ ℝ𝑛 .
(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 = 𝐁𝑘 +
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )
Proof: We already know that 𝐁𝑘+1 [𝒒0 , 𝒒1 , … 𝒒𝑘 ] = [𝒑0 , 𝒑1 , … 𝒑𝑘 ] and 𝐁𝑘+1 = 𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 .
Therefore,
𝒑𝑘 = 𝐁𝑘+1 𝒒𝑘 = (𝐁𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 )𝒒𝑘 = 𝐁𝑘 𝒒𝑘 + 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘 ⟹ 𝒑𝑘 − 𝐁𝑘 𝒒𝑘 = 𝛼𝑘 𝐮𝑘 𝐮𝑇𝑘 𝒒𝑘
(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )(𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )𝑇
𝐁𝑘+1 − 𝐁𝑘 =
𝒒𝑇𝑘 (𝒑𝑘 − 𝐁𝑘 𝒒𝑘 )
end
Remark: The scalar 𝛼𝑘 is the smallest nonnegative value of a that locally minimizes
𝑓 along the direction ∇𝑓(𝐱 𝑘 ) starting from 𝐱 𝑘 . There are many alternative line-search
rules to choose 𝛼𝑘 along the ray 𝑆𝑘 = {𝐱 𝑘+1 = 𝐱 𝑘 + 𝑎𝒑𝑘 | 𝑎 > 0} . Namely: Armijo Rule,
Goldstein Rule, Wolfe Rule, Strong Wolfe Rule etc... In our work we are not interested
by such matter.
The search direction 𝒑𝑘 at stage k is given by the solution of the analogue of the Newton
equation: 𝑯𝑘 𝒑𝑘 = −∇𝑓(𝐱 𝑘 ) where 𝑯𝑘 is an approximation to the Hessian matrix, which
is updated iteratively at each stage, and ∇𝑓(𝐱 𝑘 ) is the gradient of the function evaluated
at 𝐱 𝑘 . A line search in the direction 𝒑𝑘 is then used to find the next point 𝒑𝑘+1 by
minimizing 𝑓(𝐱 𝑘 + 𝛾𝒑𝑘 ) over the scalar 𝛾 > 0. The quasi-Newton condition imposed on
the update of 𝑯𝑘 is
∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) = 𝑯𝑘+1 (𝐱 𝑘+1 − 𝐱 𝑘 )
Let 𝒚𝑘 = ∇𝑓(𝐱 𝑘+1 ) − ∇𝑓(𝐱 𝑘 ) and 𝒔𝑘 = 𝐱 𝑘+1 − 𝐱 𝑘 then 𝑯𝑘+1 satisfies 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 which is the
secant equation. The curvature condition 𝒔𝑇𝑘 𝑯𝑘+1 𝒔𝑘 = 𝒔𝑇𝑘 𝒚𝑘 > 0 should be satisfied for
𝑯𝑘+1 to be positive definite. If the function is not strongly convex, then the condition has
to be enforced explicitly.
Instead of requiring the full Hessian matrix at the point 𝐱 𝑘+1 to be computed as 𝑯𝑘+1,
the approximate Hessian at stage k is updated by the addition of two matrices:
Both 𝐔𝑘 and 𝐕𝑘 are symmetric rank-one matrices, but their sum is a rank-two update
matrix. Imposing the secant condition, 𝑯𝑘+1 𝒔𝑘 = 𝒚𝑘 Choosing 𝐮𝑘 = 𝒚𝑘 and 𝐯𝑘 = 𝐇𝑘 𝒔𝑘 , we
can obtain:
1 1
𝛼= 𝑇 , 𝛽= 𝑇
𝐲𝑘 𝒔𝑘 𝒔𝑘 𝑯𝑘 𝒔𝑘
Finally, we substitute 𝛼𝑘 and 𝛽𝑘 into 𝐇𝑘+1 = 𝐇𝑘 + 𝛼𝐮𝑘 𝐮𝑇𝑘 + 𝛽𝐯𝑘 𝐯𝑘𝑇 and get the update
equation of 𝐇𝑘+1.
𝐲𝑘 𝐲𝑘𝑇 𝐇𝑘 𝒔𝑘 𝒔𝑇𝑘 𝐇𝑘𝑇
𝐇𝑘+1 = 𝐇𝑘 + + 𝑇
𝐲𝑘𝑇 𝒔𝑘 𝒔𝑘 𝐇𝑘 𝒔𝑘
end
The functional 𝑓(𝐱 𝑘 ) denotes the objective function to be minimized. Convergence can
be checked by observing the norm of the gradient, ‖∇𝑓(𝐱 𝑘 )‖2 ≤ 𝜀. In order to avoid the
inversion of 𝑯𝑘 at each step we apply the Sherman–Morrison formula
𝑨−1 𝐮𝐯 𝑇 𝑨−1
(𝑨 + 𝐮𝐯 𝑇 )−1 = 𝑨−1 +
1 + 𝐯 𝑇 𝑨−1 𝐮
We get
𝒔𝑘 𝐲𝑘𝑇 𝐲𝑘 𝒔𝑇𝑘 𝒔𝑘 𝒔𝑇𝑘
𝐁𝑘+1 = (𝑰 − ) 𝐁 𝑘 (𝑰 − ) +
𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘 𝒔𝑇𝑘 𝐲𝑘
(𝒔𝑇𝑘 𝐲𝑘 + 𝐲𝑘𝑇 𝐁𝑘 𝐲𝑘 )(𝒔𝑘 𝒔𝑇𝑘 ) 𝐁𝑘 𝐲𝑘 𝒔𝑇𝑘 + 𝒔𝑘 𝐲𝑘𝑇 𝐁𝑘
= 𝐁𝑘 + −
(𝒔𝑇𝑘 𝐲𝑘 )2 𝒔𝑇𝑘 𝐲𝑘
end
Remark: In general, the finite difference approximations of the Hessian are more
expensive than the secant condition updates. (Walter Gander and Martin J Gander)
clear all, clc, tol=10^-4; x(:,1)= [0.8624 0.1456]; z=[]; B=eye(2,2);
f=@(x)x(1)^2-x(1)*x(2)-3*x(2)^2+5; J=@(x)[2*x(1)-x(2);-x(1)-6*x(2)];
% f=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2-10; J=@(x)[4*x(1);6*x(1);8*x(3)];
% f=@(x)3*sin(x(1))+exp(x(2)); J=@(x)[3*cos(x(1));exp(x(2))];
i=1; %matlab starts counting at 1
while and(norm(J(x(:,i)))>0.001,i<500)
p(:,i)=-B*J(x(:,i));
%------------------------------------------------------------%
% armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.01; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%------------------------------------------------------------%
s=alp*p(:,i); x(:,i+1)=x(:,i) + s; y=J(x(:,i+1))-J(x(:,i));
B = B + ((s'*y + y'*B*y)/(s'*y)^2)*(s*s') -(B*y*s'+ (s*y')*B)/(s'*y);
i=i+1;
end
x(:,end), fmax=f(x(:,end)), Gradient=J(x(:,end))
Gradient descent is based on the observation that if the scalar multi-variable function
𝑓(𝐱) is defined and differentiable in a neighborhood of a point 𝒂, then 𝑓(𝐱) decreases
fastest if one goes from 𝒂 in the direction of the negative gradient of 𝑓(𝐱) at 𝒂, −∇𝑓 (𝒂).
It follows that, if 𝐚𝑘+1 = 𝐚𝑘 − 𝛼∇𝑓(𝐚𝑘 ) for 𝛼 ∈ ℝ small enough, then 𝑓(𝐚𝑘 ) ≥ 𝑓(𝐚𝑘+1 ) In
other words, the term 𝛼∇𝑓(𝐚𝑘 ) is subtracted from 𝐚𝑘 because we want to move against
the gradient, toward the local minimum.
With this observation in mind, one starts with a guess 𝐱 0 for a local minimum of 𝑓(𝐱),
and considers the sequence 𝐱1 , 𝐱 2 , …, such that 𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 ).
Note that the value of the step size 𝛼𝑘 is allowed to change at every iteration. With
certain assumptions on the function 𝑓(𝐱) (for example, 𝑓(𝐱) convex and ∇𝑓(𝐱) Lipschitz)
and particular choices of 𝛼𝑘 convergence to a local minimum can be guaranteed.
According to Wolfe conditions, or the Barzilai–Borwein method
𝑇
(𝐱 𝑘 − 𝐱 𝑘−1 )𝑇 (∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )) (∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = or 𝛼𝑘 = 𝑇
‖∇𝑓(𝐱 𝑘 ) − ∇𝑓(𝐱 𝑘−1 )‖2 (∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )
Algorithm: [Gradient descent algorithm]
Initialization: 𝐱 0 and 𝛼0
begin: k=1:n (untile converge)
𝐱 𝑘+1 = 𝐱 𝑘 − 𝛼𝑘 ∇𝑓(𝐱 𝑘 )
𝑇
(∇𝑓(𝐱 𝑘 )) ∇𝑓(𝐱 𝑘 )
𝛼𝑘 = 𝑇
(∇𝑓(𝐱 𝑘 )) 𝐇(𝐱 𝑘 )∇𝑓(𝐱 𝑘 )
end
The peaks represent regions with high cost with red areas whereas the lowest point
with blue areas are regions with minimum cost or loss. In any Optimization & Deep
Learning problems, we try to find a model function that gives prediction having least
loss in comparison to actual value.
Suppose our model function has two parameters then, mathematically we wish to find
the optimal values of parameters 𝜃1 and 𝜃2 that would minimize our loss. The loss (𝐽(𝜃))
space shown in the above figure tells us how our algorithm would perform if we would
choose a particular value for a parameter. Here the 𝜃1 and 𝜃2 are our x and y axis while
the loss is plotted corresponding to the z axis. The Gradient Descent rule states that the
direction in which we should move should be 180 degrees with the gradient, in other
words moving opposite to the gradient.
In order to minimize the new objective function we consider the gradient ∇𝜙(𝜹) = 𝑨𝜹 + 𝒃.
In term of 𝐱 𝑘+1we obtain ∇𝜙(𝐱 𝑘+1 ) = ∇𝑓(𝐱 𝑘+1 ) = 𝑨𝐱 𝑘+1 + 𝒃. As a consequence, all
gradient-like iterative methods developed in the previous chapter for linear systems,
can be extended to solve nonlinear minimization problems.
In particular, having fixed a descent direction 𝒑𝑘 = (𝐱 𝑘+1 − 𝐱 𝑘 )/𝛼𝑘 , we can determine the
optimal value of the acceleration parameter 𝛼𝑘 , in such a way as to find the point where
the function f, restricted to the direction 𝒑𝑘 , is minimized 𝛼𝑘 = arg min𝛼 𝑓(𝐱 𝑘 + 𝛼𝒑𝑘 ).
Setting to zero the directional derivative, we get:
𝑑 𝑛 𝜕𝑓 𝜕
0= 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = ∑ ( (𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) (𝐱 (𝑖) + 𝛼𝑘 𝒑𝑘 (𝑖))
𝑑𝛼𝑘 𝑖=1 𝜕𝑥𝑖 𝜕𝛼𝑘 𝑘
𝑇
= (∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 )) 𝒑𝑘
But we have seen that ∇𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = 𝑨(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) + 𝒃 = (𝑨𝐱 𝑘 + 𝒃) + 𝛼𝑘 𝑨𝒑𝑘
𝑑
Therefore 𝑓(𝐱 𝑘 + 𝛼𝑘 𝒑𝑘 ) = (𝛼𝑘 𝒑𝑘 𝑇 𝑨 + (𝑨𝐱 𝑘 + 𝒃)𝑇 )𝒑𝑘 = 0
𝑑𝛼𝑘
■ For the step length 𝛼𝑘 , (which minimizes f along the search direction 𝒑𝑘 ), we perform
a line search that identifies the approximate minimum of the nonlinear function f along
the search direction 𝒑𝑘 .
■ The residual 𝒓𝑘 (𝒓𝑘 = −(𝒃 + 𝑨𝐱 𝑘 )), which is the gradient of function f has to be
replaced by the gradient of the nonlinear objective function.
Remark: An appropriate step length effecting sufficient decrease could be chosen from
one of the various known methods such as the Armijo, the Goldstein or the Wolfe’s
conditions. Moreover, if f is a strongly convex quadratic function and 𝛼𝑘 is the exact
minimizer of the function f, then the FR algorithm becomes specifically the linear
conjugate gradient algorithm.
and the 𝛽𝑘 scalar is given by the equation: 𝛽𝑘 = |∇𝑓(𝐱 𝑘 )|2 /|∇𝑓(𝐱 𝑘−1 )|2 where k is an
iterative index. The following algorithm of the conjugate gradient method is made on
MATLAB
end
clear all, clc, tol=10^-5; x(:,1)=10*rand(2,1);
f=@(x)x(1)^2+x(1)*x(2)+3*x(2)^2+100;
J=@(x)[2*x(1)+x(2);x(1)+6*x(2)]; p(:,1)=-J(x(:,1));
i=1; % matlab starts counting at 1
finalX = x ; % initialize the vector
finalf =f(x(:,1)); z=[];
while and(norm(J(x(:,i)))>0.001,i<500)
%-------------------------------------------------%
% Armijo method for alpha determination
alp=0.01; % initial step
rho=0.01; c=0.02; % rho and c are in (0,1);
x1=x(:,i); x2=x(:,i)+alp*p(:,i);
f2=f(x2); f1=f(x(:,i));
while and(f2>f1+c*alp*(J(x(:,i)))'*p, alp>10^-6);
alp=rho*alp;
f2=f(x(:,i)+alp*p(:,i));
end
%-------------------------------------------------%
x(:,i+1)=x(:,i) + alp*p(:,i);
beta=((J(x(:,i+1)))'*J(x(:,i+1)))/((J(x(:,i)))'*J(x(:,i)));
p(:,i+1)=-J(x(:,i+1)) + beta*p(:,i);
i=i+1;
z=[z,f(x(:,i))];
end
Iter=i
xmax=x(:,end)
fmax=f(x(:,end))
Gradient=J(x(:,end))
%-------------------------------------------------%
figure(1)
X= x(1,1:end-1); Y= x(2,1:end-1); Z= z;
plot3(X,Y,Z ,'bo-','linewidth',0.1);
hold on
figure(2)
[X,Y]= meshgrid([-3:0.5:3]) ;
Z=X^2+X.*Y+3*Y.^2+5;
S=mesh(X,Y,Z); %plotting the surface
title('Subrats Pics'), xlabel('x'), ylabel('y')
To address the shortcomings of the original
Newton method, several variations of the technique were suggested to guarantee
convergence to a local minimum. One of the most important variations is the
Levenberg–Marquardt method. This method effectively uses a step that is a combination
between the Newton method and the steepest descent method. The step taken by this
method is given by:
where 𝜇 is a positive scalar and 𝑰 ∈ ℝ𝑛×𝑛 is the identity matrix. Notice that in last
equation if 𝜇 is small enough, the Hessian matrix 𝑯(𝐱 𝑘 ) dominates and the method
becomes effectively a Newton’s step. If the parameter 𝜇 is large enough, the matrix 𝜇𝑰
dominates and the method is approximately in the steepest descent direction. By
increasing 𝜇, the inverse matrix becomes small in norm and subsequently the norm of
the step taken ‖𝐱 𝑘+1 − 𝐱 𝑘 ‖ becomes smaller. It follows that the parameter 𝜇 controls
also the step size.
One interesting mathematical property of this approach is that adding the matrix 𝜇𝑰 to
the Hessian matrix increases each eigenvalue of this matrix by 𝜇. If the matrix 𝑯(𝐱 𝑘 ) is
not positive semi-definite then adding 𝜇 to each eigenvalue makes them more positive.
The value of 𝜇 can be increased until all the eigenvalues are positive thus guaranteeing
that the step (𝐱 𝑘+1 − 𝐱 𝑘 ) is a descent step. The Levenberg–Marquardt approach starts
each iteration with a very small value of 𝜇, thus giving effectively the Newton’s step. If
an improvement in the objective function is achieved, the new point is accepted.
Otherwise, the value of 𝜇 is increased until a reduction in the objective function is
obtained.
starting from the point 𝐱 0 = [3.0 − 7.0 0]𝑇 . Utilize the MATLAB software
The algorithm terminated in only one iteration. The exact solution for this problem is
𝐱 ⋆ = [1.0 0.0 0.0]𝑇 with a minimum value of 𝑓(𝐱 ⋆ ) = −1.50.
Theorem: Any locally optimal point of a convex optimization problem is also (globally)
optimal.
Where: k is the number of variables and i the number of iterations. The MATLAB
routine function rand generates a subunit random number at each call and let us call
this number 𝑟 = 𝑟𝑎𝑛𝑑. While the search must be done in the interval (𝑎𝑘 , 𝑏𝑘 ) for each
variable 𝑥𝑘 , we want the random number to be generated within this range. For this
reason, the following translation must be made: 𝑥𝑘 (𝑖) = 𝑎𝑘 + 𝑟𝑘 (𝑖)(𝑏𝑘 − 𝑎𝑘 ) 𝑘 = 1,2, … , 𝑛
clc;clear; nVar = 2; % the number of decision variables
N= 10000; % Number of random generated points
epsilon = 1e-3; % the convergence factor
a=zeros(1,nVar); b=zeros(1,nVar); % pre-allocation of vectors a and b
for i=1:nVar, a(i)=-1.50; b(i)=1.50; end % set-up of the search limits
fMin = 1e6; % initialize fMin
fPrecedent = fMin;
for i=1:N % global search procedure
x1 = a(1)+ rand*(b(1)-a(1)); % random generation: variable x1
x2 = a(2)+ rand*(b(2)-a(2)); % random generation: variable x2
f=@(x,y)2*x+y+(x.^2-y.^2)+(x-y.^2).^2; % The objective function
func =f(x1,x2);
if (func<fMin)
fMin = func; x1Min = x1; x2Min = x2;
if abs(fMin - fPrecedent)<=epsilon
break;
else
fPrecedent= fMin;
end, end, end
x1=x1Min, x2=x2Min, fMin =fMin(end)
J=@(x,y)[4*x-2*y^2+2;-4*x*y+4*y^3-2*y+1];
Jmin=J(x1, x2), fmin=f(x1, x2),
To find the solution to the minimization problem, the random path method uses an
iterative relationship of the form 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖) & 𝑖 = 1,2, … , 𝑛 where i is an
iterative index, 𝐱(𝑖) is the vector of the decision variables, 𝛼𝑖 is a step size, at iteration i
called acceleration factor in the 𝐬(𝑖) direction, and 𝐬(𝑖) is the vector of the minimization
direction. The search procedure starts from a randomly chosen point. Whatever this
start point is, we have to reach the same solution. The coordinates of the minimization
direction vector 𝐬𝑘 , are randomly chosen using the rand function.
Algorithm: [Random Walk algorithm]
Step 1: chose 𝐱(0) and 𝑁max
set 𝑖 = 1
Step 2: for each iteration 𝑖 do
𝐬(𝑖) = random vector
Step 3: 𝐱(𝑖 + 1) = 𝐱(𝑖) + 𝛼𝑖 𝐬(𝑖)
Remark: The convergence of this algorithm is slow and not guaranteed in general, it is
dependent strongly on the convexity of the objective function.
%f=@(x)2*x(1)+x(2)+(x(1).^2+x(2).^2)+(x(1)+x(2).^2).^2;
%J=@(x)[4*x(1)+2*x(2)^2+2;4*x(1)*x(2)+4*x(2)^3+2*x(2)+1];
%---------------------------------------------------------------%
clear all, clc, n=0; nMax=5; xzero=rand(2,1); epsilon=1e-4; alfa0=0.01;
f=@(x)2*x(1)+x(2)+(x(1).^2-x(2).^2)+(x(1)-x(2).^2).^2;
a = -1.0 ; b = 1.0; % the range for s
F0=f(xzero); Fprecedent=F0; % the function value at the start point
f0=F0; s=rand(2,1); alfa = alfa0; increment = alfa0;
xone = xzero + alfa*s; % generate a next iteration Xl
F1 = f(xone); % the objective function value in Xl
Factual = F1;
i=1; % initialize the counter i
go = true; % variable 'go' remains 'true' as long as
% the convergence criteria are not fulfilled
while go
while (Factual>=Fprecedent)
s = rand(2,1); s = a*[1;1] + s*(b-a); % generate a random direction s
xone = xzero + alfa*s;
F1 = f(xone); Factual = F1;
end
i=i+1; f1=F1;
while (Factual<Fprecedent)
Fprecedent = Factual;
alfa = alfa + increment;
xone = xzero + alfa*s; F1 = f(xone);
end
deltaF = abs(F1-Fprecedent); F0 = Factual; xzero = xone; alfa = alfa0;
if(abs(f0-f1)<=epsilon) n = n + 1; end
f0 =f1;
if(n==nMax) go = false; break; end
end
J=@(x)[4*x(1)-2*x(2)^2+2;-4*x(1)*x(2)+4*x(2)^3-2*x(2)+1];
xone, Factual, Jmin=J(xone), fmin=f(xone),
The Monte Carlo method is based on the following principle: if the best option is
needed, it should be tried "at random" many times and then the best option found
between those attempts chosen. If there are enough different attempts, the best option
found will almost certainly be an optimal global value. This method is valid both
mathematically and intuitively. The advantages of the method are both its simplicity
and its universality. But it has the disadvantage of being too slow.
Let’s pretend that we don’t know the value of 𝜋. To calculate it, we will generate a large
number 𝑁 of random points in the unit square. By 𝑛 we will denote the number of
points lying inside the quarter-circle. As you will certainly agree, with large 𝑁 the ratio
𝑛/𝑁 must be very similar to the ratio of 𝐴/𝑆. And that’s all! From the equation 𝑛/𝑁 = 𝐴/𝑆
we can easily express that 𝜋 = 4𝑛/𝑁. This proportion becomes valid as the number 𝑁 of
uniformly distributed points on the square area generated and in the meantime
becomes higher.
The corresponding MATLAB program is presented below.
clear all, clc nmax = 5000;
x = rand(nmax,1); y = rand(nmax,1); x1=x-0.5; y1=y-0.5;
r = sqrt(x1.^2+y1.^2) ;
% get logicals
inside = r<=0.5; outside = r>0.5;
% plot
plot(x1(inside),y1(inside),'b.');
hold on
plot(x1(outside),y1(outside),'r.');
axis equal
% get pi value
thepi = 4*sum(inside)/nmax;
fprintf('%8.4f\n',thepi)
𝑓(𝑥, 𝑦) = −0.02 sin(𝑥 + 4𝑦) − 0.2 cos(2𝑥 + 3𝑦) − 0.2 sin(2𝑥 − 𝑦) + 0.4 cos(𝑥 − 2𝑦)
While the objective function depends on two variables 𝑥 and 𝑦, its graphical
representation is a surface. From Fig. it is evident that this surface has many peaks
and valleys, interpretable as many local minimum (or maximum) points, depending on
the problem scope. Usually, a numerical optimization procedure risks ending in a local
optimum point instead of an absolute minimum point.
% This program draws the mesh of a multimodal
% function that depends on two variables
clear all, clc
f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);
a1=-2.5;a2=-2.5; b1=2.5;b2=2.5; increment1=0.1; increment2=0.1;
n1=(b1-a1)/increment1; n2=(b2-a2)/increment2; fGraph = zeros(n1,n2);
x1 = a1;
for i1 = 1:n1
x2 = a2;
for i2 = 1:n2
fGraph (i1,i2)=f(x1,x2);
x2 = x2 + increment2;
end
x1 = x1 + increment1;
end
mesh(fGraph) ; % drawing of fGraph
clear all, clc
f=@(x1,x2)-0.02*sin(x1+4*x2)-0.2*cos(2*x1+3*x2)-0.2*sin(2*x1-
x2)+0.4*cos(x1-2*x2);
while delta>epsilon
%------% The block for xl variable (keep x2=constant) %------%
x2 = x2_min; x1 = ls1;
increment = abs(ld1-ls1)/gridNumber;
while x1<=ld1
func = f(x1,x2);
if func<minF, minF = func; x1_min=x1; end
x1 = x1 + increment;
end
actF = minF;
% check the convergence criterion
delta=abs(actF-precF);
precF = actF;
end
Starting from this point, a local search procedure is designed. This procedure is based
on the Grid method. First thing to do is to define a neighborhood of the starting point
on each axis. This neighborhood should be set for each axis separately. Then along
each axis a search of the minimum point in that direction is made successively.
In this particular case, the local search along the 𝑥1 axis starts from the left bound of
the neighborhood, that is 𝑙𝑠1, while 𝑥2 is kept constant. Once the minimum point along
𝑥1 axis is found, it is kept constant, while the local search is performed along the 𝑥2
axis. 'When the search along all axes is finished, it means that one iteration is over. The
value of the objective function at this point is compared with the value of the similar
point at the previous iteration. If the difference between these two points denoted delta
is less than a precision factor called epsilon, initially set, then the search stops, or
continues otherwise.■ (Ancau Mircea 2019)
f=@(x1,x2)log((1+(x1-4/3).^2)+3*(x1+x2-(x1).^3).^2);
x1 =-2:0.1:2 ; x2 =-2:0.1:2;
An optimization problem may entail a set of equality constraints and possibly a set of
inequality constraints. If this is the case, the problem is said to be a constrained
optimization problem. The most general constrained optimization problem can be
expressed mathematically as
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
subject to: ℎ𝑖 (𝐱) = 0
𝑔𝑗 (𝐱) ≥ 0
A problem that does not entail any equality or inequality constraints is said to be an
unconstrained optimization problem. Constrained optimization is usually much more
difficult than unconstrained optimization, as might be expected. Consequently, the
general strategy that has evolved in recent years towards the solution of constrained
optimization problems is to reformulate constrained problems as unconstrained
optimization problems. When the objective function and all the constraints are linear
functions of 𝐱, the problem is a linear programming problem. Problems of this type are
probably the most widely formulated and solved of all optimization problems,
particularly in control system, management, financial, and economic applications.
Nonlinear programming problems, in which at least some of the constraints or the
objective are nonlinear functions, tend to arise naturally in the physical sciences and
engineering, and are becoming more widely used in control system, management and
economic sciences as well.
Several branches of mathematical programming are of much interest for the
optimization problems, namely, linear, integer, quadratic, nonlinear, and dynamic
programming. Each one of these branches of mathematical programming consists of the
theory and application of a collection of optimization techniques that are suited to a
specific class of optimization problems.
The method can be summarized as follows: in order to find the maximum or minimum
of a function 𝑓(𝐱) subjected to the equality constraint g(𝐱) = 0, form the Lagrangian
function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) and find the stationary points 𝐱 = 𝐱 ⋆ of 𝐿(𝐱 , 𝜆) such that
∇𝐿(𝐱 ⋆ , 𝜆) = 0. Further, the method of Lagrange multipliers is generalized by the Karush–
Kuhn–Tucker conditions, which can also take into account inequality constraints of the
form ℎ(𝐱) ≤ 𝑐.
Often the Lagrange multipliers have an interpretation as some quantity of interest. For
example, consider
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: g 𝑖 (𝐱) = 𝑐𝑖
𝑝
The Lagrangian function 𝐿(𝐱 , 𝜆) = 𝑓(𝐱) + ∑𝑖=1 𝜆𝑖 (𝑐𝑖 − g 𝑖 (𝐱)). Then 𝜆𝑘 = 𝜕𝐿/𝜕𝑐𝑘 . So, 𝜆𝑘 is
the rate of change of the quantity being optimized as a function of the constraint
parameter. The relationship between the gradient of the function and gradients of the
constraints is:
𝐿(𝐱 , 𝜆) = 𝑓(𝐱) − 𝜆g(𝐱) ⟹ ∇𝑓(𝐱) = 𝜆∇g(𝐱).
Remark: In optimal control theory, the Lagrange multipliers are interpreted as costate
variables, and Lagrange multipliers are reformulated as the minimization of the
Hamiltonian, in Pontryagin's minimum principle.
Where 𝑨 ∈ ℝ𝑝×𝑛 is assumed to have full row rank. Also discuss the case where the
constraints are nonlinear.
Solution in this case we have: 𝐿(𝐱 , 𝝀) = 𝑓(𝐱) − g(𝐱)𝝀 = 𝑓(𝐱) − (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 with
g(𝐱) = 𝑨𝐱 − 𝒃 = 0, let we define g new (𝐱, 𝝀) = g(𝐱)𝝀 = (𝑨𝐱 − 𝒃)[𝜆1 ⋯ 𝜆𝑝 ]𝑇 . Now take the
gradient of this new function so: ∇g new (𝐱) = 𝑨𝑇 𝝀 + g(𝐱) = 𝑨𝑇 𝝀
Solution We know that: ∇𝑓(𝐱) = 𝑯𝐱 + 𝒑, so that 𝝀 = (𝑨)+ ∇𝑓(𝐱 ⋆ ) = (𝑨𝑨𝑇 )−1 𝑨(𝑯𝐱 ⋆ + 𝒑). In
order to omite the existence of 𝐱 ⋆ let 𝑨𝑨𝑇 𝝀 = 𝑨(𝑯𝐱 ⋆ + 𝒑) ⟹ 𝑨𝑇 𝝀 = (𝑯𝐱 ⋆ + 𝒑) multiply both
sides by 𝑨𝑯−1 we get: 𝑨𝑯−1 𝑨𝑇 𝝀 = 𝑨𝑯−1 𝑯𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝑨𝐱 ⋆ + 𝑨𝑯−1 𝒑 = 𝒃 + 𝑨𝑯−1 𝒑
Remark: assume that we are dealing with the problem of optimization such that
minimize 𝑓(𝐱), for 𝐱 ∈ 𝛀
{
subject to: 𝐠(𝐱) = 𝟎
The Karush–Kuhn–Tucker conditions state that ∇𝐿(𝐱 , 𝝀) = 0 which can written in the
form
𝜕𝐿(𝐱 , 𝝀)
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 0
𝑇
𝐿(𝐱 , 𝝀) = 𝑓(𝐱 ) + 𝝀 𝐠(𝐱 ) ⟺ 𝒉(𝐱 , 𝝀) = ( 𝜕𝐱 )=( )=( )
𝜕𝐿(𝐱 , 𝝀)
𝐠(𝐱 ) 0
𝜕𝝀
1
Where 𝑱 is the Jacobian of the vector 𝐠(𝐱 ). In case when 𝑓(𝐱) = 2 𝐱 𝑇 𝑯𝐱 + 𝐱 𝑇 𝒑 and
𝐠(𝐱 ) = 𝑨𝐱 − 𝒃
∇𝑓(𝐱 ) + 𝑱𝑇 𝝀 𝑯𝐱 + 𝒑 + 𝑨𝑇 𝝀
𝒉(𝐱 , 𝝀) = ( )=( ) ⇔ (𝑯 𝑨𝑇 ) (𝐱) = (−𝒑 )
𝐠(𝐱 ) 𝑨 𝟎 𝝀 𝒃
𝑨𝐱 − 𝒃
The basic concept in random search approaches is to randomly generate points in the
parameter space. Only feasible points satisfying g 𝑗 (𝐱) ≤ 0, 1 ≤ 𝑗 ≤ 𝑚 are considered,
while non-feasible points with at least one g 𝑗 (𝐱) > 0 for some j are rejected. The
algorithm keeps track of the feasible random point with the least value of the objective
function. This requires checking, at every iteration, if the newly generated feasible point
has a better objective function than the best value achieved so far.
The main disadvantage of this algorithm is that a large number of objective function
calculations may be required especially for problems with large n. The following
example illustrates this technique.
iterations=k,
BestPosition=x0,
fmax=f0,
4 2
minimize log (1 (𝑥1 − ) + 3(𝑥1 + 𝑥2 − 𝑥13 )2 ) ,
3
2 2
subject to: 𝑥1 + 𝑥2 − 4 ≤ 0
{ −1 ≤ 𝑥1 , 𝑥2 ≤ 1
The positive constant 𝜆 is the regularization parameter. As 𝜆 gets larger, more weight is
given to the regularization function. In many cases, the regularization is taken to be
quadratic. In particular, 𝑅(𝐱) = ‖𝑫𝐱‖2 where 𝑫 ∈ ℝ𝑝×𝑛 is a given matrix. The quadratic
regularization function aims to control the norm of 𝑫𝐱 and is formulated as follows:
Since the Hessian of the objective function is ∇2 𝑓RLS = 2(𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫) ≽ 0, it follows by
previous theorems that any stationary point is a global minimum point. The stationary
points are those satisfying ∇𝑓RLS (𝐱) = 0, that is, (𝑨𝑻 𝑨 + 𝜆𝑫𝑻 𝑫)𝐱 = 𝑨𝑻 𝒃.
2 + 10−3 3 4 20.0019
𝑻 −3 1 1 1
𝑨 = 𝑩 𝑩 + 10 𝑰 = ( 3 5 + 10−3 7 ) , 𝑩 = ( ) , 𝒃 = ( 34.0004)
−3 1 2 3
4 7 10 + 10 48.0202
The purpose is to find the best approximate solution of 𝑨𝐱 = 𝒃. Knowing that the exact
solution is 𝐱 𝑡𝑟𝑢𝑒 = [1 2 3]𝑇 .
The matrix 𝑨 is in fact of a full column rank since its eigenvalues are all positive (which
can be checked, for example, by the MATLAB command eig(𝑨)), and the simple least
squares solution is given by 𝐱 𝐿𝑆 , whose value can be computed by
𝐱 𝐿𝑆 is rather far from the true vector 𝐱 𝑡𝑟𝑢𝑒 . One difference between the solutions is that
the squared norm ‖𝐱 𝐿𝑆 ‖2 = 90.1855 is much larger then the correct squared norm
‖𝐱 𝑡𝑟𝑢𝑒 ‖2 = 14. In order to control the norm of the solution we will add the quadratic
regularization function ‖𝐱‖2. The regularized solution will thus have the form
This quadratic function can also be written as 𝑅(𝐱) = ‖𝑳𝐱‖2 , where 𝑳 ∈ ℝ(𝑛−1)×𝑛 is given
by
1 −1 0 0 0 0
0 1 −1 0 0 0
0 0 1 −1 0 0
𝑳= 1 ⋮
⋮ ⋱ ⋮ ⋮
⋮ ⋮ ⋮ ⋮
(0 0 0 … 1
0 −1)
The resulting regularized least squares problem is (with 𝜆 a given regularization
parameter)
min‖𝐱 − 𝒃‖2 + 𝜆‖𝑳𝐱‖2
𝐱
randn('seed',314);
b=x+0.05*randn(300,1);
The true and noisy signals are given in Figure, which was constructed by the MATLAB
commands
subplot(1,2,1);
plot(1:300,x,'LineWidth',2);
subplot(1,2,2);
plot(1:300,b,'LineWidth',2);
In order to denoise the signal 𝒃, we look at the optimal solution of the RLS problem, for
four different values of the regularization parameter: 𝜆 = 1, 10,100, 1000.
The original true signal is denoted by a dotted line. As can be seen in the next Figure,
as 𝜆 gets larger, the RLS solution becomes smoother.
For 𝜆 = 10 the RLS solution is a rather good estimate of the original vector 𝐱. For
𝜆 = 100 we get a smoother RLS signal, but evidently it is less accurate than 𝐱 𝑅𝐿𝑆 (10),
especially near the boundaries. The RLS solution for 𝜆 = 1000 is very smooth, but it is a
rather poor estimate of the original signal. In any case, it is evident that the parameter
𝜆 is chosen via a trade-off between data fidelity (closeness of 𝐱 to 𝒃) and smoothness
(size of 𝑳𝐱). The four plots where produced by the MATLAB commands
L=zeros(299,300);
for i=1:299
L(i,i)=1;
L(i,i+1)=-1;
end
x_rls=(eye(300)+1*L'*L)\b;
x_rls=[x_rls,(eye(300)+10*L'*L)\b];
x_rls=[x_rls,(eye(300)+100*L'*L)\b];
x_rls=[x_rls,(eye(300)+1000*L'*L)\b];
figure(2)
for j=1:4
subplot(2,2,j);
plot(1:300,x_rls(:,j),'LineWidth',2);
hold on
plot(1:300,x,':r','LineWidth',2);
hold off
title(['\lambda=',num2str(10^(j-1))]);
end
Most real-world optimizations are highly
nonlinear and multimodal, under various complex constraints. Different objectives are
often conflicting. Even for a single objective, sometimes, optimal solutions may not exist
at all. In general, finding an optimal solution or even sub-optimal solutions is not an
easy task. This work aims to introduce the fundamentals of metaheuristic optimization,
as well as some popular metaheuristic algorithms. Metaheuristic algorithms are
becoming an important part of modern optimization. A wide range of metaheuristic
algorithms have emerged over the last two decades, and many metaheuristics such as
particle swarm optimization are becoming increasingly popular. Despite their popularity,
mathematical analysis of these algorithms lacks behind. Convergence analysis still
remains unsolved for the majority of metaheuristic algorithms, while efficiency analysis
is equally challenging.
where 𝑓1 , . . . , 𝑓𝑚 (𝐱) are the objectives, while ℎ𝑗 and 𝑔𝑘 are the equality and inequality
constraints, respectively. In the
case when 𝑚 = 1 , it is called
single-objective optimization.
When 𝑚 ≥ 2 , it becomes a multi-
objective problem whose solution
strategy is different from those for
a single objective. In general, all
the functions 𝑓𝑖 , ℎ𝑗 and 𝑔𝑘 are
nonlinear. In the special case when
all these functions are linear, the
optimization problem becomes a
linear programming problem which
can be solved using the standard
simplex method (Dantzig 1963).
Metaheuristic optimization concerns more generalized, nonlinear optimization
problems. It is worth pointing out that the above minimization problem can also be
formulated as a maximization problem if 𝑓𝑖 is replaced with −𝑓𝑖 .
Derivative-free algorithms do not use any derivative information but the values of the
function itself. Some functions may have discontinuities or it may be expensive to
calculate derivatives accurately, and thus derivative-free algorithms become very
useful.
Search capability can also be a basis for algorithm classification. In this case,
algorithms can be divided into local and global search algorithms. Local search
algorithms typically converge towards a local optimum, not necessarily (often not) the
global optimum, and such an algorithm is often deterministic and has no ability to
escape from local optima. On the other hand, for global optimization, local search
algorithms are not suitable, and global search algorithms should be used. Modern
metaheuristic algorithms in most cases tend to be suitable for global optimization,
though not always successful or efficient.
Algorithms with stochastic components were often referred to as heuristic in the past,
though the recent literature tends to refer to them as metaheuristics. We will follow
Glover's convention and call all modern nature-inspired algorithms metaheuristics
(Glover 1986, Glover and Kochenberger 2003). Loosely speaking, heuristic means to find
or to discover by trial and error. Here meta- means beyond or higher level, and
metaheuristics generally perform better than simple heuristics. In addition, all
metaheuristic algorithms use a certain tradeoff of randomization and local search.
Quality solutions to difficult optimization problems can be found in a reasonable
amount of time, but there is no guarantee that optimal solutions can be reached. It is
hoped that these algorithms work most of the time, but not all the time. Almost all
metaheuristic algorithms tend to be suitable for global optimization.
When a particle finds a location that is better than any previously found locations, then
it updates this location as the new current best for particle 𝑖 . There is a current best for
all particles at any time 𝑡 at each iteration. The aim is to find the global best among all
the current best solutions until the objective no longer improves or after a certain
number of iterations.
Let 𝐱 𝑖 and 𝐯𝑖 be the position and velocity vectors, respectively, of particle 𝑖. The new
velocity vector is determined by the following formula
where 𝜺1 and 𝜺2 are two random vectors, and each entry takes a value between 0 and 1.
The parameters 𝛼 and 𝛽 are the learning parameters or acceleration constants, which
are typically equal to, say, 𝛼 ≈ 𝛽 ≈ 2. 𝜔(𝑘) is the inertia function takes a value between
0 and 1. In the simplest case, the inertia function can be taken as a constant, typically
𝜔 ∈ [0.5 0.9]. This is equivalent to introducing a virtual mass to stabilize the motion of
the particles, and thus the algorithm is expected to converge more quickly.
The initial locations of all particles should be distributed relatively uniformly so that
they can sample over most regions, which is especially important for multimodal
problems. The initial velocity of a particle can be set to zero, that is, 𝐯𝑖𝑘=0 = 0 . The new
position can then be updated by the formula 𝐱 𝑖𝑘+1 = 𝐱 𝑖𝑘 + 𝐯𝑖𝑘+1
As the iterations proceed, the particle system swarms and may converge towards a
global optimum.
▪ 𝑓(𝑥, 𝑦) = 3 sin(𝑥) + 𝑒 𝑦 − 4 ≤ 𝑥, 𝑦 ≤ 4
▪ 𝑓(𝑥, 𝑦) = 100(𝑦 − 𝑥 2 )2 + (1 − 𝑥 2 )2 − 10 ≤ 𝑥, 𝑦 ≤ 10
𝜔𝑚𝑎𝑥 − 𝜔𝑚𝑖𝑛
𝜔(𝑖𝑡𝑒𝑟) = 𝜔𝑚𝑎𝑥 − × 𝑖𝑡𝑒𝑟
Max𝑖𝑡𝑒𝑟
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure
for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=w*v(:,i)+c1*rand*(xbest(:,i)-x(:,i))+c2*rand*(gbest-x(:,i));
x(:,i)=x(:,i)+v(:,i);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
the optimal value is:
1.5708
the optimal value is:
2.0000
the minimum value of func
is: -2.8647
Exercise: Write a MATLAB code to search the maximum value of the following objective
functions
▪ 𝑓(𝑥, 𝑦) = 𝑥 2 − 𝑦 2 − 10 ≤ 𝑥, 𝑦 ≤ 10
4 3 2
▪ 𝑓(𝑥) = 𝑥 − 14𝑥 + 60 𝑥 − 70 𝑥 − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝑥, 𝑦) = 𝑥sin(4𝑥) + 1.1𝑦sin(𝑦) − 10 ≤ 𝑥, 𝑦 ≤ 10
▪ 𝑓(𝐱) = (𝑥 + 10𝑦) + 5(𝑧 − 𝑤) + (𝑦 − 2𝑧) + 10(𝑥 − 2𝑤)4
2 2 4
Alternatives of PSO: There are many variations which extend the standard algorithm.
The standard particle swarm optimization uses both the current global best 𝐠 𝑏𝑒𝑠𝑡 and
the individual best 𝐱 𝑖𝑏𝑒𝑠𝑡 . The reason of using the individual best is primarily to increase
the diversity in the quality solution, however, this diversity can be simulated using the
randomness. Subsequently, there is no compelling reason for using the individual best.
A simplified version which could accelerate the convergence of the algorithm is to use
the global best only. Thus, in the accelerated particle swarm optimization, the velocity
vector is generated by
In order to increase the convergence even further, we can also write the update of the
location in a single step
𝐱 𝑖𝑘+1 = (1 − 𝛽)𝐱 𝑖𝑘 + 𝛽𝐠 𝑏𝑒𝑠𝑡 + 𝛼 × (𝜺1 − 0.5𝐞)
𝛼 = 𝛼0 𝑒 −𝛾𝑘 ; or 𝛼 = 𝛼0 𝛾 𝑘 , ( 𝛾 < 1)
clear all, clc,
[X,Y] = meshgrid(-4:0.5:4,-4:0.5:4);
Z = 3*sin(X)+exp(Y); surf(X,Y,Z); figure
for iter=1:itermax
w=wmax-(wmax-wmin)*iter/itermax;
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
% fun_marge(i)=100*(x(2,i)-x(1,i)^2)^2+(1-x(1,i))^2;
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
Example Write a MATLAB code to search by PSO the maximum value of
𝑓 = 2𝑥 2 − 3𝑦 2 + 4𝑥 2 + 2 − 10 ≤ 𝑥, 𝑦, 𝑧 ≤ 10
for iter=1:itermax
for i=1:n
v(:,i)=v(:,i) + c1*(rand-0.5) + c2*(gbest-x(:,i));
x(:,i)=(1-c2)*x(:,i) + c2*gbest + c1*(rand-0.5);
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=2*(x(1,i))^2-3*(x(2,i))^2+4*(x(3,i))^2+2;
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
As already mentioned, swarm intelligence is a
relatively new approach to problem
solving that takes inspiration from social
behaviors of insects and of other
animals. In particular, ants have
inspired a number of methods and
techniques among which the most
studied and the most successful is the
general-purpose optimization technique
known as ant colony optimization. Ant
colony optimization (ACO) takes
inspiration from the foraging behavior of
some ant species. These ants deposit
pheromone on the ground in order to mark some favorable path that should be followed
by other members of the colony. Ant colony optimization exploits a similar mechanism
for solving optimization problems.
Ant colony optimization (ACO), introduced by Marco Dorigo 1991 in his doctoral
dissertation, is a class of optimization algorithms modeled on the actions of an ant
colony. ACO is a probabilistic technique useful in problems that deal with finding better
paths through graphs. Artificial 'ants'—simulation agents—locate optimal solutions by
moving through a parameter space representing all possible solutions. Natural ants lay
down pheromones directing each other to resources while exploring their environment.
The simulated 'ants' similarly record their positions and the quality of their solutions,
so that in later simulation iterations more ants locate for better solutions.
Procedure: The ants construct the solutions as follows. Each ant starts from a
randomly selected city (node or vertex). Then, at each construction step it moves along
the edges of the graph. Each ant keeps a memory of its path, and in subsequent steps
it chooses among the edges that do not lead to vertices that it has already visited. An
ant has constructed a solution once it has visited all the vertices of the graph. At each
construction step, an ant probabilistically chooses the edge to follow among those that
lead to yet unvisited vertices. The probabilistic rule is biased by pheromone values and
heuristic information: the higher the pheromone and the heuristic value associated to
an edge, the higher the probability an ant will choose that particular edge. Once all the
ants have completed their tour, the pheromone on the edges is updated. Each of the
pheromone values is initially decreased by a certain percentage. Each edge then
receives an amount of additional pheromone proportional to the quality of the solutions
to which it belongs (there is one solution per ant). The solution construction process is
stochastic and is biased by a pheromone model, that is, a set of parameters associated
with graph components (either nodes or edges) whose values are modified at runtime by
the ants.
Set parameters, initialize pheromone trails
SCHEDULE_ACTIVITIES
Parametrization: Let’s say the number of cities is 𝑛, the number of ants is 𝑚, the
distance between 𝑖 𝑡ℎ and 𝑗 𝑡ℎ cities is 𝑑𝑖𝑗 𝑖, 𝑗 = 1, 2 … , 𝑛 and the concentration of
pheromone in city (𝑖, 𝑗) at time 𝑡 is 𝜏𝑖𝑗 (𝑡). At the initial time, the pheromone
concentration 𝜏𝑖𝑗 (𝑡) between cities is equal to 𝜏𝑖𝑗 (0) = 𝐶 (𝐶 is a constant), and the
𝑘
probability of its choice is expressed by 𝑝𝑖𝑗 , and the formula is as follows:
𝛼 𝛽
(𝜏𝑖𝑗 (𝑡)) (𝜂𝑖𝑗 (𝑡))
𝑘
𝑝𝑖𝑗 = 𝛽
𝛼
∑𝑥∈𝑁(𝑥 )(𝜏𝑖𝑠 (𝑡)) (𝜂𝑖𝑠 (𝑡))
𝑘
The parameter 𝜂𝑖𝑗 (𝑡) = 1/𝑑𝑖𝑗 is heuristic information, which indicates the degree of
expectation of ants from 𝑖 𝑡ℎ to the 𝑗 𝑡ℎ city. 𝑁(𝑥𝑘 ) (𝑘 = 1, 2 … , 𝑚) indicates that ant 𝑘 is to
visit the urban set. Furthermore, 𝛼 and 𝛽 are positive real parameters whose values
determine the relative importance of pheromone versus heuristic information. When all
ants complete a cycle, they update the pheromone according to formula
𝑚
𝜏𝑖𝑗 (𝑡) ⟵ (1 − 𝜌)𝜏𝑖𝑗 (𝑡) + ∆𝜏𝑖𝑗 with ∆𝜏𝑖𝑗 = ∑ ∆𝜏𝑘𝑖𝑗
𝑘=1
Where 𝑄 is a constant that represents the total amount of pheromone released once by
an ant . 𝐿𝑘 is the tour length of the 𝑘 𝑡ℎ ant.
clear all, clc,
%LB=20*[-1 -1 -1]; UB=20*[1 1 1]; nvars=size(LB,2);
%f=@(x)2*x(1)^2-3*x(2)^2+4*x(3)^2+2; % Ant-cost
LB=20*[-1 -1]; UB=20*[1 1]; nvars=size(LB,2);
f=@(x)3*sin(x(1))+exp(x(2));
MaxTour=100; % Number of Tours
piece=500; % Number of pieces (cities)
max_assign=50; % MaxValue of assign
ants=50; % Number of Ants
poz_ph=0.5; % PositivePheremone
neg_ph=0.2; % NegativePheremone
lambda=0.95; % EvaporationParameter
ph=0.05; % Pheromone
pher=ones(piece,nvars);
indis=zeros(ants,nvars);
costs=zeros(ants,1);
cost_general=zeros(max_assign,(nvars+1));
deger=zeros(piece,nvars); deger(1,:)=LB;
for i=2:piece
for j=1:nvars
deger(i,j)=deger(i-1,j) + (UB(j)-LB(j))/(piece-1);
end
end
assign=0;
while (assign<max_assign)
for i=1:ants % FINDING THE PARAMETERS OF VALUE
prob = pher.*rand(piece,nvars);
for j=1:nvars
indis(i,j) = find(prob(:,j) == max(prob(:,j)));
end
temp=zeros(1,nvars);
for j=1:nvars
temp(j)=deger(indis(i,j),j);
end
costs(i) = f(temp); % LOCAL UPDATING
deltalocal = zeros(piece,nvars);
% Creates Matrix Contain the Pheremones Deposited for Local Updating
for j=1:nvars
deltalocal(indis(i,j),j)=(poz_ph*ph/(costs(i)));
end
pher = pher + deltalocal;
end
best_ant= min(find(costs==min(costs)));
worst_ant = min(find(costs==max(costs)));
deltapos = zeros(piece,nvars);
deltaneg = zeros(piece,nvars);
for j=1:nvars
deltapos(indis(best_ant,j),j)=(ph/(costs(best_ant)));
% UPDATING PHER OF nvars
deltaneg(indis(worst_ant,j),j)=-(neg_ph*ph/(costs(worst_ant)));
% NEGATIVE UPDATING PHER OF worst path
end
delta = deltapos + deltaneg;
pher = pher.^lambda + delta;
assign=assign + 1; % Update general cost matrix
for j=1:nvars
cost_general (assign,j)=deger(indis(best_ant,j),j);
end
cost_general (assign,nvars+1)=costs(best_ant);
xlabel Tour
title('Change in Cost Value. Red: Means, Blue: Best')
hold on
plot(assign, mean(costs), '.r');
plot(assign, costs(best_ant), '.b');
end
list_cost=sortrows(cost_general,nvars+1);
for j=1:nvars
x(j)=list_cost(1,j);
end
x1=x', fmax=f(x1)
The Firefly Algorithm (FA) was developed by
Xin-She Yang (Yang 2008) and is based on the flashing patterns and behavior of
fireflies. In essence, FA uses the following three idealized rules:
⦁ Fireflies are unisex (one firefly will be attracted to other fireflies regardless of their sex)
⦁ The attractiveness is proportional to the brightness and both decrease as the distance
between two fireflies increases. Thus for any two flashing fireflies, the brighter firefly
will attract the other one. If neither one is brighter, then a random move is performed.
⦁ The brightness of a firefly is determined by the landscape of the objective function.
where 𝛾 is the light absorption coefficient, which can be in the range [0.01, 100], 𝑟𝑖𝑗 the
−𝛾𝑟 2
line-of-sight distance between the fireflies. The second term 𝛽0 𝑒 𝑖𝑗 (𝐱𝑗𝑘 − 𝐱 𝑖𝑘 ) is due to
the attraction. The third term 𝛼 × 𝐞𝑘𝑖 is a randomization with 𝛼 being the randomization
parameter, and 𝐞𝑘𝑖 is a vector of random numbers drawn from a Gaussian distribution
or uniform distribution at time k. If 𝛽0 = 0 , it becomes a simple random walk.
Furthermore, the randomization 𝐞𝑘𝑖 can easily be extended to other distributions such
as Lévy flights.
clear all, clc, c1=0.8; c2=0.7; gama=20;
itermax=50; xmin=10*[-2 -2]; xmax=10*[2 2];
n=50; m=2; % n=Number of Particles and n=Number of variables
rand('state',0); % v=zeros(m,n);
for i=1:n
for j=1:m
x(j,i)=xmin(j)+rand*(xmax(j)-xmin(j));
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
end
xbest=x; fbest=fun_marge; fgbest=min(fun_marge);
gbest=x(:,find(fun_marge==fgbest));
for iter=1:itermax
for i=1:n
for j=1:i
r= norm(x(:,j)-x(:,i));
x(:,i)=x(:,i)+c2*(exp(-gama*r^2))*(x(:,j)-x(:,i))+c1*(randn-0.5);
end
for jj=1:m
if x(jj,i)>xmax(jj)
x(jj,i)=xmax(jj);
end
if x(jj,i)<xmin(jj)
x(jj,i)=xmin(jj);
end
end
fun_marge(i)=3*sin(x(1,i))+exp(x(2,i));
if fun_marge(i) < fbest(i)
xbest(:,i)=x(:,i);
fbest(i)=fun_marge(i);
end
if fun_marge(i) < fgbest
gbest=x(:,i); fgbest=fun_marge(i);
end
end
result(iter)=fgbest;
end
fprintf(' the optimal value is %3.4f\n', gbest)
fprintf(' the minimum value of func is %3.4f\n', fgbest)
plot([1:itermax], result,'--r','linewidth',1.5)
xlabel('Iteration'), ylabel('Function'), grid on
It can be shown that the limiting case 𝛾 → 0 corresponds to the standard Particle
Swarm Optimization (PSO). In fact, if the inner loop (for j) is removed and x(:,j)the is
replaced by the current global best, then FA essentially becomes the standard PSO.
In computer science and operations research,
the artificial bee colony algorithm (ABC) is an optimization algorithm based on the
intelligent foraging behavior of honey bee swarm,
proposed by Derviş Karaboğa (Erciyes University) in
2005. In the ABC model, the colony consists of three
groups of bees: employed bees, onlookers and scouts.
It is assumed that there is only one artificial
employed bee for each food source. In other words,
the number of employed bees in the colony is equal
to the number of food sources around the hive.
Employed bees go to their food source and come
back to hive and dance on this area. The employed
bee whose food source has been abandoned becomes a scout and starts to search for
finding a new food source. Onlookers watch the dances of employed bees and choose
food sources depending on dances.
Notes: employed bees associated with specific food sources, onlooker bees watching the
dance of employed bees within the hive to choose a food source, and scout bees
searching for food sources randomly. Both onlookers and scouts are also called
unemployed bees.
Initialization Phase: All the vectors of the population of food sources, 𝐱 𝑘 , are initialized
by scout bees and control parameters are set. Since each food source, 𝐱 𝑘 , is a solution
vector to the optimization problem, each 𝐱 𝑘 vector holds 𝑛 variables, (𝐱 𝑘 (𝑖), 𝑖 = 1. . . 𝑛),
which are to be optimized so as to minimize the objective function.
where 𝒍𝑖 and 𝒖𝑖 are the lower and upper bound of the parameter 𝐱 𝑘 (𝑖) , respectively.
Employed Bees Phase: Employed bees search for new food sources (𝐯𝑘 ) having more
nectar within the neighbourhood of the food source (𝐱 𝑘 ) in their memory.
The fitness value of the solution, fit(𝐱 𝑘 ), might be calculated for minimization problems
using the following formula
1
if f(𝐱 𝑘 ) ≥ 0
fit(𝐱 𝑘 ) = { 1 + f(𝐱 𝑘 )
1 + |f(𝐱 𝑘 )| if f(𝐱 𝑘 ) < 0
Onlooker Bees Phase: Unemployed bees consist of two groups of bees: onlooker bees
and scouts. Employed bees share their food source information with onlooker bees
waiting in the hive and then onlooker bees probabilistically choose their food sources
depending on this information. In ABC, an onlooker bee chooses a food source
depending on the probability values calculated using the fitness values provided by
employed bees. For this purpose, a fitness based selection technique can be used, such
as the roulette wheel selection method (Goldberg, 1989).
The probability value 𝑝𝑘 with which 𝐱 𝑘 is chosen by an onlooker bee can be calculated
by using the expression given in equation
fit(𝐱 𝑘 )
𝑝𝑘 = 𝑁
∑𝑘=1 fit(𝐱 𝑘 )
Scout Bees Phase: The unemployed bees who choose their food sources randomly are
called scouts. Employed bees whose solutions cannot be improved through a
predetermined number of trials, specified by the user of the ABC algorithm and called
“limit” or “abandonment criteria” herein, become scouts and their solutions are
abandoned. Then, the converted scouts start to search for new solutions, randomly. For
instance, if solution 𝐱 𝑘 has been abandoned, the new solution discovered by the scout
who was the employed bee of 𝐱 𝑘 can be defined by 𝐱 𝑘 (𝑖) = 𝒍𝑖 + rand(0,1) × (𝒖𝑖 − 𝒍𝑖 ). Hence
those sources which are initially poor or have been made poor by exploitation are
abandoned and negative feedback behavior arises to balance the positive feedback.
Exercise: Write a MATLAB code to search the maximum value of the following objective
functions
𝑓(𝐱) = 3 sin(𝑥) + 𝑒 𝑦 − 5 ≤ 𝑥, 𝑦 ≤ 5
𝑓(𝐱) = 2𝑥 2 + 3𝑦 2 + 4𝑧 2 + 5𝑤 2 + 10 − 5 ≤ 𝑥, 𝑦 ≤ 5
clc;
clear;
close all;
%% Problem Definition
% CostFunction=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;
f=@(x)3*sin(x(1))+exp(x(2)); % CostFunction
nVar=2; % Number of Decision Variables
VarSize=[1 nVar]; % Decision Variables Matrix Size
VarMin=-5; % Decision Variables Lower Bound
VarMax= 5; % Decision Variables Upper Bound
%% ABC Settings
MaxIt=500; % Maximum Number of Iterations
nPop=500; % Population Size (Colony Size)
nOnlooker=nPop; % Number of Onlooker Bees
L=round(0.6*nVar*nPop); % Abandonment Limit Parameter (Trial Limit)
a=1; % Acceleration Coefficient Upper Bound
%% Initialization
% Empty Bee Structure
empty_bee.Position=[]; empty_bee.Cost=[];
pop=repmat(empty_bee,nPop,1); % Initialize Population Array
BestSol.Cost=inf; % Initialize Best Solution Ever Found
end
for i=1:nPop
F(i) = exp(-pop(i).Cost/MeanCost); % Convert Cost to Fitness
end
P=F/sum(F);
%-----------------------------------------------
% Select Source Site by Roulette Wheel Selection
%-----------------------------------------------
r=rand;
C=cumsum(P);
i=find(r<=C,1,'first');
%-----------------------------------------------
% Comparision
if newbee.Cost<=pop(i).Cost
pop(i)=newbee;
else
C(i)=C(i)+1;
end
end
% Scout Bees (Scout Bees Phase)
for i=1:nPop
if C(i)>=L
pop(i).Position=rand(1,nVar);
pop(i).Cost=f(pop(i).Position);
C(i)=0;
end
end
% Update Best Solution Ever Found
for i=1:nPop
if pop(i).Cost<=BestSol.Cost
BestSol=pop(i);
end
end
BestCost(it)=BestSol.Cost; % Store Best Cost Ever Found
%% Results
BestSol
figure;
%plot(BestCost,'LineWidth',2);
semilogy(BestCost,'LineWidth',2);
xlabel('Iteration'); ylabel('Best Cost');
grid on;
Bacteria Foraging Optimization
Algorithm (BFOA), proposed by Passino, is
a new comer to the family of nature-
inspired optimization algorithms. For over the
last five decades, optimization algorithms like
Genetic Algorithms (GAs), Evolutionary
Programming (EP), Evolutionary Strategies
(ES), which draw their inspiration from
evolution and natural genetics, have been
dominating the realm of optimization
algorithms. Recently natural swarm inspired
algorithms like Particle Swarm Optimization
(PSO), Ant Colony Optimization (ACO) have
found their way into this domain and proved their effectiveness. Following the same
trend of swarm-based algorithms, Passino proposed the BFOA. Application of group
foraging strategy of a swarm of E.coli-bacteria in multi-optimal function optimization is
the key idea of the new algorithm. Bacteria search for nutrients in a manner to
maximize energy obtained per unit time. Individual bacterium also communicates with
others by sending signals. A bacterium takes foraging decisions after considering two
previous factors. The process, in which a bacterium moves by taking small steps while
searching for nutrients, is called chemotaxis and key idea of BFOA is mimicking
chemotactic movement of virtual bacteria in the problem search space.
Now suppose that we want to find the minimum of the cost function 𝑱(𝜽) where 𝜽 ∈ ℜ𝑝
(i.e. 𝜽 is a 𝑝-dimensional vector of real numbers), and we do not have measurements or
an analytical description of the gradient ∇𝑱(𝜽). BFOA mimics the four principal
mechanisms observed in a real bacterial system: chemotaxis, swarming,
reproduction, and elimination-dispersal to solve this non-gradient optimization
problem. A virtual bacterium is actually one trial solution (may be called a search-
agent) that moves on the functional surface (see Figure above) to locate the global
optimum.
Flow diagram illustrating the bacterial foraging optimization algorithm
Generic algorithm of BFO
%Reprodution
Jhealth=sum(J(:,:,K,ell),2); % Set the health of each of the S bacteria
[Jhealth,sortind]=sort(Jhealth); % Sorts the nutrient concentration
P(:,:,1,K+1,ell)=P(:,sortind,Nc+1,K,ell);
c(:,K+1)=c(sortind,K); %keeps the chemotaxis parameters with each
bacterium at the next generation
%Report
reproduction = J(:,[1:Ns,Nre,Ned]);
[jlastreproduction,O]=min(reproduction,[],2); %minf for each bacterial
[Y,I] = min(jlastreproduction)
pbest=P(:,I,O(I,:),K,ell)
plot([1:s],jlastreproduction)
xlabel('Iteration'), ylabel('Function')
The GWO algorithm mimics
the leadership hierarchy and hunting mechanism of gray
wolves in nature proposed by Mirjalili et al. in 2014. Four
types of grey wolves such as alpha, beta, delta, and
omega are employed for simulating the leadership
hierarchy. In addition, three main steps of hunting,
searching for prey, encircling prey, and attacking prey,
are implemented to perform optimization.
Mathematical model: The hunting technique and the social hierarchy of grey wolves
are mathematically modeled in order to design GWO and perform optimization. The
proposed mathematical models of the social hierarchy, tracking, encircling, and
attacking prey are as follows:
■ Encircling prey As mentioned above, grey wolves encircle prey during the hunt. In
order to mathematically model encircling behavior the following equations are
proposed: ⃗𝑿 ⃗ (𝑡 + 1) = ⃗𝑿
⃗ (𝑡) − ⃗𝑨
⃗ . ⃗𝑫
⃗ with ⃗𝑫 ⃗ . ⃗𝑿
⃗ = |𝑪 ⃗ 𝑝 (𝑡) − ⃗𝑿⃗ (𝑡)| where 𝑡 indicates the
current iteration, 𝑨⃗ and 𝑪 ⃗ are coefficient vectors, 𝑿 ⃗⃗ 𝑝 (𝑡) is the position vector of the prey,
and 𝑿 ⃗⃗ (𝑡) indicates the position vector of a grey wolf. The vectors ⃗𝑨 and 𝑪 ⃗ are calculated
as follows: ⃗𝑨 = 2𝒂 ⃗ .𝒓
⃗1−𝒂⃗ ⃗𝑪 = 2𝒓
⃗ 2 . Where components of 𝒂 ⃗ are linearly decreased from
2 to 0 over the course of iterations and 𝒓 ⃗ 1, 𝒓
⃗ 2 are random vectors in [0,1].
■ Hunting: Grey wolves have the ability to recognize the location of prey and encircle
them. The hunt is usually guided by the alpha. The beta and delta might also
participate in hunting occasionally. However, in an abstract search space we have no
idea about the location of the optimum (prey). In order to mathematically simulate the
hunting behavior of grey wolves, we suppose that the alpha (best candidate solution)
beta, and delta have better knowledge about the potential location of prey. Therefore,
we save the first three best solutions obtained so far and oblige the other search agents
(including the omegas) to update their positions according to the position of the best
search agent. The following formulas are proposed in this regard.
⃗𝑫 ⃗ 1 . ⃗𝑿
⃗ 𝛼 = |𝑪 ⃗ 𝛼 − ⃗𝑿
⃗| ⃗𝑿
⃗ 1 = ⃗𝑿
⃗𝛼− ⃗𝑨⃗ 1 . ⃗𝑫
⃗𝛼
⃗⃗ 1 + 𝑿
𝑿 ⃗⃗ 1 + 𝑿
⃗⃗ 1
{⃗𝑫 ⃗ 2 . ⃗𝑿
⃗ 𝛽 = |𝑪 ⃗ 𝛽 − ⃗𝑿
⃗| {⃗𝑿
⃗ 2 = ⃗𝑿
⃗𝛽 − ⃗𝑨
⃗ 2 . ⃗𝑫
⃗𝛽 ⃗⃗ (𝑡 + 1) =
𝑿
3
⃗𝑫 ⃗ 3 . ⃗𝑿
⃗ 𝛿 = |𝑪 ⃗ 𝛿 − ⃗𝑿
⃗| ⃗𝑿
⃗ 3 = ⃗𝑿
⃗ 𝛿 − ⃗𝑨
⃗ 3 . ⃗𝑫
⃗𝛿
With these equations, a search agent updates its position according to alpha, beta, and
delta in a n dimensional search space. In addition, the final position would be in a
random place within a circle which is defined by the positions of alpha, beta, and delta
in the search space. In other words alpha, beta, and delta estimate the position of the
prey, and other wolves updates their positions randomly around the prey.
SearchAgents_no=20;
Max_iter=200;
dim=4;
lb=-0.25*ones(1,dim); ub=0.25*ones(1,dim);
%fobj=@(x)(x(1)-1)^2+(x(2)-2)^2+(x(3)-3)^2+(x(4)-4)^2+(x(5)-5)^2;
%fobj=@(x)3*sin(x(1))+exp(x(2)); dim=2;
fobj=@(x)2*x(1)^2+3*x(2)^2+4*x(3)^2+5*x(4)^2+10;
%---------------------------------------------------------------------%
%Initialize the positions of search agents
%---------------------------------------------------------------------%
% If the boundaries of all variables are equal and user enter a signle
% number for both ub and lb
if Boundary_no==1
Positions=rand(SearchAgents_no,dim).*(ub-lb)+lb;
end
x=-3:0.5:3; y=-3:0.5:3;
L=length(x);
f=[];
for i=1:L
for j=1:L
f(i,j)=fobj([x(i),y(j)]);
end
end
surfc(x,y,f,'LineStyle','none');
%-------------------------------------------------------------------------------------------------------%
Applications of Swarm Intelligence: Swarm Intelligence-based techniques can be
used in a number of applications. The U.S. military is investigating swarm techniques
for controlling unmanned vehicles. The European Space Agency is thinking about an
orbital swarm for self-assembly and interferometry. NASA is investigating the use of
swarm technology for planetary mapping. A 1992 paper by M. Anthony Lewis and
George A. Bekey discusses the possibility of using swarm intelligence to control
nanobots within the body for the purpose of killing cancer tumors. Conversely al-Rifaie
and Aber have used stochastic diffusion search to help locate tumours. Swarm
intelligence has also been applied for data mining. Ant based models are further subject
of modern management theory.
%-------------------------------------------------------------------------------------------------------%
CVX: is a MATLAB-based modeling system for convex optimization. It was created by
Michael Grant and Stephen Boyd. This MATLAB package is in fact an interface to other
convex optimization solvers such as SeDuMi and SDPT3. We will explore here some of
the basic features of the software, but a more comprehensive and complete guide can
be found at the CVX website (CVXr.com). The basic structure of a CVX program is as
follows:
cvx_begin
{variables declaration}
minimize({objective function}) or maximize({objective function})
subject to
{constraints}
cvx_end
CVX accepts only convex functions as objective and constraint functions. There are
several basic convex functions, called “atoms,” which are embedded in CVX.
Example: Suppose that we wish to solve the least squares problem
Example: Suppose that we wish to write a CVX code that solves the convex
optimization problem
Example: Let us use an example to illustrate how a metaheuristic works. The design of
a compressional and tensional spring involves three design variables: wire diameter 𝑥1 ,
coil diameter 𝑥2 , and the length of the coil 𝑥3 . This optimization problem can be
written as
minimize 𝑓(𝐱) = 𝑥12 𝑥2 (2 + 𝑥3 ),
Leverrier-Faddeev Algorithm.
Sylvester, Lyapunov and Riccati
equations
Appendix A
Proof of the Leverrier-Faddeev Algorithm
𝑵(𝑠) 1
It is very well − known that: {𝑒𝐴𝑡 } = (𝑠𝑰 − 𝑨)−1 = = (𝑵 𝑠 𝑛−1 + 𝑵2 𝑠 𝑛−2 + ⋯ + 𝑵𝑛 )
𝑝(𝑠) 𝑝(𝑠) 1
where 𝑝(𝑠) = 𝑠 𝑛 + 𝑎1 𝑠 𝑛−1 + 𝑎2 𝑠 𝑛−2 + ⋯ + 𝑎𝑛 and the adjugate matrix 𝑵(𝑠) is a matrix
polynomial in 𝑠 of degree 𝑛 − 1 with 𝑛 × 𝑛 constant coefficient matrices 𝑵1 , … , 𝑵𝑛 .
The above equation can be re-written as: (𝑠𝑰 − 𝑨)𝑵(𝑠) = 𝑝(𝑠)𝑰, expanding this multiplication we
get: 𝑵1 𝑠 𝑛 + (𝑵2 − 𝑨𝑵1 )𝑠 𝑛−1 + (𝑵3 − 𝑨𝑵2 )𝑠 𝑛−2 + ⋯ + (𝟎 − 𝑨𝑵1 ) = (𝑠 𝑛 + 𝑎1 𝑠 𝑛−1 + ⋯ + 𝑎𝑛 )𝑰.
If the coefficients 𝑎1 , … , 𝑎𝑛 of the characteristic polynomial 𝑝(𝑠) were known, last equation
would then constitute an algorithm for computing the matrices 𝑵1 , … , 𝑵𝑛 . Leverrier-Faddeev
proposed a recursive algorithm which will compute 𝑵𝑖 and 𝑎𝑖 in parallel, even if the coefficients
𝑎𝑖 are not known in advance.
Important Note: Albeit that the Leverrier-Faddeev method has been extensively covered in
most books of linear system theory, but unfortunately, the majority of these books do not give a
proof of the coefficient formulas.
𝑝′ (𝑠) 𝑵(𝑠) 1
𝑠 − 𝑛 = trace (𝑨 {𝑒𝑨𝑡 }) = trace (𝑨 )= trace(𝑨𝑵(𝑠))
𝑝(𝑠) 𝑝(𝑠) 𝑝(𝑠)
Which can be written as: 𝑠𝑝′ (𝑠) − 𝑛𝑝(𝑠) = trace(𝑨𝑵(𝑠)). This is equivalent to
−𝑎1 𝑠 𝑛−1 − 2𝑎2 𝑠 𝑛−2 − ⋯ − (𝑛 − 1)𝑎𝑛−1 𝑠 − 𝑛𝑎𝑛 = tr(𝑨𝑵1 )𝑠 𝑛−1 + tr(𝑨𝑵2 )𝑠 𝑛−2 + ⋯ + tr(𝑨𝑵𝑛 )
𝑵1 𝑠 𝑛−1 + 𝑵2 𝑠 𝑛−2 + ⋯ + 𝑵𝑛
(𝑠𝑰 − 𝑨)−1 =
𝑠 𝑛 + 𝑎1 𝑠 𝑛−1 + 𝑎2 𝑠 𝑛−2 + ⋯ + 𝑎𝑛
where
𝑵1 = 𝑰 𝑎1 = −tr(𝑨𝑵1 )
1
𝑵2 = 𝑨𝑵1 + 𝑎1 𝑰 𝑎2 = − tr(𝑨𝑵2 )
2
1
𝑵3 = 𝑨𝑵2 + 𝑎2 𝑰 𝑎3 = − tr(𝑨𝑵3 )
3
⋮ ⋮
𝑵𝑛 = 𝑨𝑵𝑛−1 + 𝑎𝑛−1 𝑰 1
𝑎𝑛 = − tr(𝑨𝑵𝑛 )
𝑛
{ 𝟎 = 𝑨𝑵𝑛 + 𝑎𝑛 𝑰
𝑩(𝑠)
(𝑠𝑰 − 𝑨)−1 𝑨 = ⟺ (𝑠𝑰 − 𝑨)𝑩(𝑠) = 𝑝(𝑠)𝑨
𝑝(𝑠)
𝑩1 = 𝑨 𝑝1 = tr(𝑩1 )
1
𝑩2 = 𝑨(𝑩1 − 𝑝1 𝑰) 𝑝2 = tr(𝑩2 )
2
⋮ ⋮
1
𝑩𝑛−1 = 𝑨(𝑩𝑛−2 − 𝑝𝑛−2 𝑰) 𝑝𝑛−1 = tr(𝑩𝑛−1 )
𝑛−1
1
𝑩𝑛 = 𝑨(𝑩𝑛−1 − 𝑝𝑛−1 𝑰) 𝑝𝑛 = tr(𝑩𝑛 )
𝑛
And 𝑩𝑛 = 𝑝𝑛 𝑰. If 𝑨 is nonsingular then the inverse of 𝑨 can be determined by
__________________________________________
Appendix A
Searching for Maximum value of an Array: Sometimes, given an array of numbers in system
engineering, the smallest or largest value needs to be identified — and quickly!
There are several built-in ways to find a minimum or maximum value from an array in
MATLAB. Here three algorithms that can in fastest way compute these values
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Find Max value of Vector array
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, Time =[1 2 3 4 101 6 7 8 100];
Max=Time(1); k=1;
while (k<=length(Time))
if (Time(k)> Max)
Max = Time(k);
end
k=k+1;
end
Max
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Find Max value of Vector array
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, Time =rand(1,10)
Max=Time(1);
for k=1:length(Time)
if (Time(k)> Max)
Max = Time(k);
end
end
Max
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Find Max value of Matrix
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =[1 2 3;4 101 6;7 8 100];
Maxvalue=-inf;
Row=0; Column=0;
for i=1:size(A,1)
for j=1:size(A,2)
if A(i,j)> Maxvalue
Maxvalue= A(i,j);
Row=i; Column=j;
end
end
end
Maxvalue
Appendix A
The Left/right and Up/Down Flip Operator: The swapping (Flip) operator is very important in
signal processing and even in engineering. It returns to flip the entries in each row in the
left/right direction. Columns are preserved, but appear in a different order than before.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods For right/left reflection
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =10*rand(5,5); [m,n]=size(A);
for i=1:m
for j=1:n
B(i,j)= A(i,end-j+1);
end
end
A, B
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Methods for Up/Down reflection
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =10*rand(5,5), [m,n]=size(A);
for i=1:m
for j=1:n
C(i,j)= A(end-i+1,j);
end
end
A, C
Upper/Lower Triangular (Sum Decomposition): Any matrix can be decomposed into a sum of
upper and lower triangular matrices. Here an algorithm that does this decomposition.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Method for Upper/Lower Triangular (Sum Decomposition)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A =10*rand(3,3)
[m,n]=size(A); C=zeros(m,n); B=zeros(m,n); D=zeros(m,n);
for i=1:m
for j=1:n
D(i,i)= A(i,i);
if i>j
B(i,j)= A(i,j);
C(i,j)= C(i,j);
elseif i<j
C(i,j)= A(i,j);
B(i,j)= B(i,j);
end
end
end
C,B,D,
Appendix A
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Sorting Methods for Vectors "in descending order"
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, V=10*rand(1,4), V1=V; n = length(V);
while n ~= 0
nn = 0;
for ii = 1:n-1
if V(ii) > V(ii+1)
[V(ii+1),V(ii)] = deal(V(ii), V(ii+1));
nn = ii;
end
end
n = nn;
end
V1, V
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Sorting Methods for Matrices (Rows) "in descending order"
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A=10*rand(4,4); B = zeros(size(A)); n = size(A,2);
for iRow = 1:size(A, 1)
v = A(iRow, :);
while any(diff(v) > 0) % Test if sorted in descending order
v = v(randperm(n,n)); % If not, shuffle the elements randomly
end
B(iRow,:) = v;
end
A,B
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Simple Sorting Methods for Matrices (COLUMNS)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, A=10*rand(4,4); B = zeros(size(A)); n = size(A,2);
for iCOL = 1:size(A, 2)
v = A(:, iCOL);
while any(diff(v) > 0) % Test if sorted in descending order
v = v(randperm(n, n)); % If not, shuffle the elements randomly
end
B(:, iCOL) = v;
end
A, B
Appendix A
% Get U, V
[Vecs, Vals] = eig(A*A');
[~, P] = sort(sum(Vals), 'descend');
U = Vecs(:, P);
% Get Sigma
singularValues = sum(Vals(:,P)).^0.5;
sigma=zeros(size(A));
m=nnz(find(singularValues>=0.001));
for i=1:m
sigma(i, i) = singularValues(i);
end
A
U*sigma*V'
%[U S V]=svd(A)
Solving linear systems by QR: Given a system of linear equations of the form 𝑨𝐱 = 𝒃 and
assume that 𝑨 is decomposed in the 𝑸𝑹 form
𝑸𝐲 = 𝒃 𝐲 = 𝑸𝑇 𝒃
𝑨𝐱 = 𝒃 ⟺ 𝑸𝑹𝐱 = 𝒃 ⟺ { ⟺{ ⟺ 𝑹𝐱 = 𝑸𝑇 𝒃 = 𝒃𝑛𝑒𝑤
𝑹𝐱 = 𝐲 𝑹𝐱 = 𝐲
The matrix 𝑹 is an upper triangular thus the last equation can be solved by back substitution.
x=zeros(n,1);
for j=n:-1:1
if (R(j,j)==0)
error('Matrix is singular!');
end
x(j)=b(j)/R(j,j);
b(1:j-1)=b(1:j-1)-R(1:j-1,j)*x(j);
end
A*x-b1
Appendix B
The main object of this section is to introduce a famous
theorem in linear algebra that provides a nice answer to the following question: if we
wish to use only orthogonal (or unitary) matrices as similarity transformation matrices,
what is the simplest form to which a given matrix 𝑨 can be transformed? It would be nice
if we could say something like “diagonal” or “Jordan canonical form”. Unfortunately,
neither is possible. However, upper triangular matrices are very nice special forms of
matrices. In particular, we can see the eigenvalues of an upper triangular matrix at a
glance. That makes the Schur-decomposition theorem extremely attractive.
Remark:1 Hessenberg decomposition is the first step in Schur algorithm. Although every
square matrix has a Schur-decomposition, in general this decomposition is not unique.
Remark:3 Issai Schur (1875, 1941) was a Russian mathematician who worked in
Germany for most of his life. He studied at the University of Berlin. He obtained his
doctorate in 1901, became lecturer in Zurich 1903 and a professor in 1919. He is a
student of Ferdinand Georg Frobenius.
Consider for the moment a QR-factorization of the matrix 𝑨 = 𝑸1 𝑹1 , where 𝑸1𝐻 𝑸1 = 𝑰 and
𝑹1 is upper triangular. We will now reverse the order of multiplication product of 𝑸1 and
𝑹1 and eliminate 𝑹1 ,
𝑹1 𝑸1 = 𝑸1𝐻 𝑨𝑸1
Since 𝑸1𝐻 𝑨𝑸1 is a similarity transformation of 𝑨, therefore the matrix 𝑹1 𝑸1 has the same
eigenvalues as 𝑨. More importantly, we have seen before that by repeating this process,
the matrix 𝑹𝑘 𝑸𝑘 will become closer and closer to upper triangular, such that we
eventually can read off the eigenvalues from the diagonal. That is, the QR-method
generates a sequence of matrices 𝑨𝑘 initiated with 𝑨 = 𝑨0 = 𝑸1 𝑹1 and given by 𝑨𝑘 = 𝑹𝑘 𝑸𝑘 ,
where 𝑸𝑘 and 𝑹𝑘 represents a QR-factorization of 𝑨𝑘−1 = 𝑸𝑘 𝑹𝑘 . If we repeat the process k
times we obtain 𝑻 = 𝑸𝐻 𝑘 … 𝑸2 𝑸1 𝑨𝑸1 𝑸2 … 𝑸𝑘 = 𝑸 𝑨𝑸 upper triangular matrix.
𝐻 𝐻 𝐻
Appendix B
𝑛×𝑛
Theorem: [Real--Schur--Form]. Let 𝑨 ∈ ℂ . Then there exists a unitarily matrix 𝑸
𝐻
(not unique) such that 𝑸 𝑨𝑸 = 𝑻 is upper triangular (not unique), with the eigenvalues
of 𝑨 as diagonal elements in any required order.
Proof: Use induction on 𝑛, the size of the matrix. For 𝑛 = 1, there is nothing to prove.
For 𝑛 > 1, assume that all (𝑛 − 1) × (𝑛 − 1) matrices are unitarily similar to an upper-
triangular matrix, and consider an 𝑛 × 𝑛 matrix 𝑨. Suppose that (𝜆, 𝐱) is an eigenpair
for 𝑨, and suppose that 𝐱 has been normalized so that ‖𝐱‖2 = 1.
𝐮𝐮𝐻 1 if x1 is real
𝑹 = (𝑰 − 2 ) with 𝐮 = 𝐱 ± 𝜇 ‖𝐱‖𝒆1 and { } ⟹ 2𝐮𝐻 𝐱 = 𝐮𝐻 𝐮 ⟹ 𝑹𝐱 = 𝒆1
𝐻
𝐮 𝐮 x1 /|x1 | if x1 is not real
𝜆 𝐱 𝐻 𝑨𝑽𝑸 𝜆 𝐱 𝐻 𝑨𝑽𝑸
𝐔𝐻 𝑨𝑼 = ( 𝐻 𝐻 )=( )=𝑻
𝟎 𝑸 𝐕 𝑨𝑽𝑸 𝟎 𝑻1
is upper triangular. Since similar matrices have the same eigenvalues, and since the
eigenvalues of a triangular matrix are its diagonal entries, the diagonal entries of 𝑻 must
be the eigenvalues of 𝑨. ■
Transformation from Schur to Jordan Form: Let the matrix 𝑱 be the Jordan form of 𝑨
so that 𝑨 = 𝑽𝑱𝑽−1 . Now we apply the QR algorithm on the matrix (𝑽−1 )𝐻 so that (𝑽−1 )𝐻 =
𝑸𝑹 and define a new matrix 𝑻 = 𝑸𝐻 𝑨𝑸, this new matrix is a lower triangular form.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% RATIONAL TRANSFORMATION FROM SCHUR TO JORDAN FORM
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc,
% M1=10*rand(5,5); A=M1*diag([-1 -2 -3 -4 -5])*inv(M1);
A=[3 5 4 6 1;4 5 6 7 7;3 3 2 7 8;4 2 8 7 7;3 6 9 5 1];
[V D]=eig(A); W=(inv(V))'; % inv(V)*A*V
[Q R]=qr(W);
T=Q'*A*Q;
T = tril(T)
Q
The generalized eigenvalues 𝜆 that solve the generalized eigenvalue problem 𝑨𝐱 = 𝜆𝑩𝐱
(where 𝐱 is an unknown nonzero vector) can be calculated as the ratio of the diagonal
elements of 𝑺 to those of 𝑻. That is, using subscripts to denote matrix elements, the 𝑖 𝑡ℎ
generalized eigenvalue 𝜆𝑖 satisfies 𝜆𝑖 = 𝑺𝑖𝑖 /𝑻𝑖𝑖 .
𝑨𝐱 = 𝜆𝑩𝐱 ⟺ (𝑨 − 𝜆𝑩)𝐱
⟺ 𝐱 ∈ 𝒩(𝑨 − 𝜆𝑩)
⟺ det(𝑨 − 𝜆𝑩) = 0
⟺ det(𝑸) det(𝑺 − 𝜆𝑻) det(𝒁𝐻 ) = 0
⟺ 𝜆𝑖 = 𝑺𝑖𝑖 /𝑻𝑖𝑖 if 𝑻𝑖𝑖 ≠ 0
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Generalized Schur decomposition Or QZ decomposition
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all, clc, M1=10*rand(5,5); M2=10*rand(5,5);
A=M1*diag([-1 -2 -3 -4 -5])*inv(M1);
B=M2*diag([1 2 3 4 5])*inv(M2);
n=size(A,1); [AA,BB,Q,Z] = qz(A,B) % MATLAB instruction
lambda=zeros(n,n);
for i=1:n
lambda(i,i)=AA(i,i)/BB(i,i);
end
lambda
%---------------- verifications by MATLAB instructions ----------------%
[V, lambda]=eig(A,B);
lambda
Appendix B
𝑛×𝑛
Theorem: (Generalized Real Schur Decomposition) If 𝑨 and 𝑩 are in ℝ then there
exist orthogonal matrices 𝑸 and 𝒁 such that 𝑸 𝑨𝒁 is upper quasi-triangular and 𝑸𝑇 𝑩𝒁 is
𝑇
upper triangular.
𝑨𝑽𝚲𝛽 = 𝑩𝑽𝚲𝛼
{
𝚲𝛽 𝑾𝑨 = 𝚲𝛼 𝑾𝑩
The eigenvalues produced by 𝜆 = eig(𝑨, 𝑩) are the element-wise ratios of 𝚲𝛼 (𝑖, 𝑖) and
𝚲𝛽 (𝑖, 𝑖).
𝜆𝑖 = 𝚲𝛼 (𝑖, 𝑖)/𝚲𝛽 (𝑖, 𝑖)
If 𝑺 is not triangular, it is necessary to further reduce the 2-by-2 blocks to obtain the
eigenvalues of the full system.
A Sylvester equation has a unique solution for 𝑿 exactly when there are no common
eigenvalues of 𝑨 and −𝑩. More generally, the equation 𝑨𝑿 + 𝑿𝑩 = 𝑪 has been considered
as an equation of bounded operators on a (possibly infinite-dimensional) Banach space.
The operator 𝑺 is linear in 𝑿. Hence, our problem reduces to that of determining when 𝑺
is nonsingular. It turns out that the nonsingularity depends on the spectra of 𝑨 and 𝑩.
Specifically, we have the following theorem.
Appendix B
𝑛×𝑛 𝑚×𝑚
Theorem: Given matrices 𝑨 ∈ ℂ and 𝑩 ∈ ℂ , the Sylvester equation 𝑨𝑿 + 𝑿𝑩 = 𝑪
𝑛×𝑚 𝑛×𝑚
has a unique solution 𝑿 ∈ ℂ for any 𝑪 ∈ ℂ if and only if 𝑨 and −𝑩 do not share any
eigenvalue.
Proof: To prove the necessity of the condition, let (𝜆, 𝐱) be a right eigenpair of 𝑨 and let
(𝜇, 𝐲) be a left eigenpair of −𝑩. Let 𝑿 = 𝐱𝐲 𝐻 . Then
Hence if we can choose 𝜆 − 𝜇 = 0 that is, if 𝑨 and −𝑩 have a common eigenvalue— then
𝑿 is a "null vector" of 𝑺, and 𝑺 is singular. This proves the "if" part of the theorem.
To prove the converse, we use the fact that an operator is nonsingular if every linear
system in the operator has a solution. Consider the system 𝑺𝑿 = 𝑪. The first step is to
transform this system into a more convenient form. Let 𝑻 = 𝑽𝐻 𝑩𝑽 be a Schur
decomposition of 𝑩. Then Sylvester's equation can be written in the form
If we set 𝒀 = 𝑿𝑽 and 𝑫 = 𝑪𝑽, we may write the transformed equation in the form
𝑨𝒀 + 𝒀𝑻 = 𝑫
Let us partition this system in the form
In general, suppose that we have found 𝒚1 , 𝒚2 , … 𝒚𝑘−1 From the 𝑘 𝑡ℎ column of this
equation we get
𝑘−1
(𝑨 + 𝑡𝑘𝑘 𝑰)𝒚𝑘 = 𝒅𝑘 − ∑ 𝑡𝑖𝑘 𝒚𝑖
𝑖=1
The right-hand side of this equation is well defined and the matrix on the left is non-
singular. Hence 𝒚𝑘 is well defined. We have found a solution 𝒀 of the equation
𝑨𝒀 + 𝒀𝑻 = 𝑫. Hence 𝑿 = 𝒀𝑽𝐻 is a solution of 𝑺𝑿 = 𝑪. Justifying the "only if" part of the
theorem. Q.E.D. ■
Which are special cases of another classical matrix equation, known as the Sylvester
equations:
𝑨𝑿 + 𝑿𝑩 = 𝑪 continuous − time systems
𝑨𝑿𝑩 − 𝑿 = 𝑪 discrete − time system𝑠
The Sylvester equations also arise in a wide variety of applications. For example, the
numerical solution of elliptic boundary value problems can be formulated in terms of the
solution of the Sylvester equation (Starke and Niethammer 1991). The solution of the
Sylvester equation is also needed in the block diagonalization of a matrix by a similarity
transformation (see Datta 1995) and Golub and Van Loan (1996). Once a matrix is
transformed to a block diagonal form using a similarity transformation, the block
diagonal form can then be conveniently used to compute the matrix exponential 𝑒 𝑨𝑡 .
From the algebraic point of view it is well-known that if the matrices 𝑨 and 𝑩
are stable then the unique solution of the continuous-time Sylvester equation is given by
∞ ∞ ∞ ∞
Where 𝐱 = vec(𝑿) and 𝐜 = vec(𝑪), contain the concatenated columns of 𝑿 and 𝑪. The
solution of this equation exists and unique if and only if the matrix 𝑮 = (𝑰 ⊗ 𝑨) + (𝑩𝑇 ⊗ 𝑰)
is nonsingular. The following lines of code provides an example in MATLAB:
Remark: This method remains not recommended considering the cost of the inverse
matrix and the cost of transforming the system into a linear equation.
▪ In case when: 𝑡𝑘+1,𝑘 ≠ 0, columns [𝒚𝑘 ; 𝒚𝑘+1 ] should be concatenated and solved for
simultaneously.
(𝑹 + 𝑡𝑘𝑘 𝑰) 𝒚𝑘 𝑘−1 𝑡𝑖𝑘 𝒚𝑖
𝑡𝑚𝑘 𝑰 𝒅
( ) (𝒚 ) = ( 𝑘 ) − ∑ ( ) with 𝑚 = 𝑘 + 1
𝑡𝑘𝑚 𝑰 (𝑹 + 𝑡𝑚𝑚 𝑰) 𝑚 𝒅𝑚 𝑖=1 𝑡𝑖𝑚 𝒚𝑖
Algorithm: Bartels–Stewart
input: 𝑨 ∈ ℝ𝑛×𝑛 , 𝑩 ∈ ℝ𝑚×𝑚 , 𝑪 ∈ ℝ𝑛×𝑚
output: 𝑿 ∈ ℝ𝑛×𝑚 , the solution of 𝑨𝑿 + 𝑿𝑩 = 𝑪
▪ Compute the Schur reduction 𝑨 = 𝑼𝐻 𝑹𝑼 and 𝑩 = 𝑽𝐻 𝑻𝑽;
▪ Compute 𝑫 = 𝑼𝐻 𝑪𝑽 ;
▪ if 𝑡𝑘+1,𝑘 = 0, for all 𝑘 then find 𝒀 using (𝑹 + 𝑡𝑘𝑘 𝑰)𝒚𝑘 = 𝒅𝑘 − ∑𝑘−1
𝑖=1 𝑡𝑖𝑘 𝒚𝑖
▪ else
▪ Find 𝒀 using
(𝑹 + 𝑡𝑘𝑘 𝑰) 𝑡𝑚𝑘 𝑰 𝒚𝑘 𝒅𝑘 𝑡𝑖𝑘 𝒚𝑖
𝑘−1
( ) ( ) = ( ) − ∑𝑖=1 ( ) with 𝑚 = 𝑘 + 1
𝑡𝑘𝑚 𝑰 (𝑹 + 𝑡𝑚𝑚 𝑰) 𝒚𝑚 𝒅𝑚 𝑡 𝒚 𝑖𝑚 𝑖
▪ end
▪ Compute 𝑿 = 𝑼𝒀𝑽𝐻 ;
Zero=R*Y + Y*T - F
X=U*Y*V'
Zero=A*X + X*B - C
Appendix B
Example: write a MATLAB code to solve the Sylvester equation by Bartels–Stewart
algorithm. Assume that 𝑨 and −𝑩 are of distinct eigenvalues (May be complex conjugate).
The main idea to solve the Sylvester equations is to write this equation as a block linear
system and then use some suitable iterative scheme. This can be accomplished by the
following change of variable: 𝑨𝑿 = 𝒁 where 𝒁 = 𝑪 − 𝑿𝑩. This possibility generates the
iterative method:
𝑿 − 𝑿𝑘 = 𝑨−𝑘 ( 𝑿 − 𝑿0 )𝑩𝑘
Therefore ‖𝑿 − 𝑿𝑘 ‖ ≤ (‖𝑨−1 ‖ × ‖𝑩‖)𝑘 × ‖𝑿 − 𝑿0 ‖. Since ‖𝑨−1 ‖ × ‖𝑩‖ < 1 then the sequence
𝑿𝑘 converges to 𝑿 when 𝑘 tends to infinity. On the other hand if 𝑬𝑘 = 𝑿 − 𝑿𝑘 then
‖𝑬𝑘 ‖ ≤ ‖𝑨−1 ‖ × ‖𝑩‖ × ‖𝑬𝑘−1 ‖, hence the sequence 𝑿𝑘 converges q-linearly to the solution.
Now if ‖𝑨−1 ‖ × ‖𝑩‖ > 1 we should use the following iteration 𝑿𝑘+1 𝑩 = 𝑪 − 𝑨𝑿𝑘 .
(The Best Proposed Algorithm) Here we present a new iterative scheme for
large-scale solution of the well-known Sylvester equation. The proposed scheme is based
on Broyden's method (which is a quasi-Newton method) and can make good use of the
recently developed methods for solving block linear systems. For the detail of
mathematical proof of the method see Chapter IV.
▪ while toll ≤ 𝜀 do
𝑭0 = 𝑨1 𝑿0 + 𝑿0 𝑨2 − 𝑨3 ;
𝐟0 = vec(𝑭0 ); 𝐱 0 = vec(𝑿0 ); 𝐱1 = 𝐱 0 − 𝑩𝐟0
𝐗1 = vec −1 (𝐱1 );
𝑭1 = 𝑨1 𝑿1 + 𝑿1 𝑨2 − 𝑨3 ;
𝐟1 = vec(𝑭1 ); 𝐱1 = vec(𝑿1 ); 𝐬 = 𝐱1 − 𝐱 0 ; 𝐲 = 𝐟1 − 𝐟0 ;
(𝐬 − 𝑩𝐲)(𝐬𝑇 𝑩)
𝑩=𝑩+ ;
𝐬𝑇 𝑩𝐲
𝐗 0 = 𝐗1 ; 𝑘 = 𝑘 + 1;
toll = ‖𝐬‖;
▪ end
Appendix B
With homogeneous Dirichlet boundary conditions, such that 𝑢(±1,0) = 𝑢(0, ±1) = 0, and
the right-hand side 𝑓(𝑥, 𝑦) chosen such that the reference solution 𝑢(𝑥, 𝑦) is given by
[x,y] = meshgrid([-1:.05:1]);
z=(1-x.^2).*(1-y.^2).*cos(20.*x.*y);
s=surf(x,y,z),
axis equal
Appendix B
as visualised in Figure. The most common approach is the discretization by a five-point
stencil on an (𝑛 + 1) × (𝑛 + 1) equi-spaced grid, therefore this will be considered first. The
discretization results in a sparse Sylvester (Lyapunov) equation.
The detailed results of the comparison found in Gerhardus Petrus Kirsten 2018.
Example: (Roth's removal rule) Given two square matrices 𝑨 and 𝑩, of size 𝑛 and 𝑚,
and a matrix 𝑪 of size 𝑛 × 𝑚, then one can ask when the following two square matrices of
𝑨 𝑪 𝑨 𝟎
size 𝑛 + 𝑚 are similar to each other: ( ) & ( ). The answer is that these two
𝟎 𝑩 𝟎 𝑩
matrices are similar exactly when there exists a matrix 𝑿 such that 𝑨𝑿 − 𝑿𝑩 = 𝑪. In other
words, 𝑿 is a solution to a Sylvester equation. This is known as Roth's removal rule.
𝑰 𝑿 𝑨 𝑪 𝑰 −𝑿 𝑨 𝟎
One easily checks one direction: If 𝑨𝑿 − 𝑿𝑩 = 𝑪 then ( )( )( )=( ).
𝟎 𝑰 𝟎 𝑩 𝟎 𝑰 𝟎 𝑩
The term generalized
refers to a very wide class of equations, which includes systems of matrix equations,
bilinear equations and problems where the coefficient matrices are rectangular. We start
with the most common form of generalized Sylvester equation, namely
𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬
Which differs from the previous equation for the occurrence of coefficient matrices on
both sides of the unknown solution 𝑿. If 𝑩 and 𝑪 are both nonsingular, left
multiplication by 𝑪−1 and right multiplication by 𝑩−1 lead to a standard Sylvester
equation, with the same solution matrix 𝑿. In case either 𝑩 or 𝑪 are ill-conditioned, such
a transformation may lead to severe instabilities.
The case of singular 𝑩 and 𝑪, especially for 𝑩 = 𝑪⊤ and 𝑨 = 𝑫⊤ has an important role in
the solution of differential-algebraic equations and descriptor systems.
The problem can be recast as a standard Sylvester equation even if ill-conditioned 𝑩 and
𝑪, one could consider using a specifically selected 𝛼 ∈ ℝ (or 𝛼 ∈ ℂ) such that the two
matrices 𝑪 + 𝛼𝑨 and 𝑩 − 𝛼𝑫 are better conditioned and the solution uniqueness is
ensured, and rewrite 𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬 as the following equivalent generalized Sylvester
matrix equation,
𝑨𝑿(𝑩 − 𝛼𝑫) + (𝑪 + 𝛼𝑨)𝑿𝑫 = 𝑬 ⟺ 𝑨1 𝑿 + 𝑿𝑨2 = 𝑨3
Appendix B
Other generalizations of the Sylvester equation have attracted the attention of many
researchers. In some cases the standard procedure for their solution consists in solving
a (sequence of) related standard Sylvester equation(s), so that the computational core is
the numerical solution of the latter by means of some of the procedures discussed in
previous sections. We thus list here some of the possible generalizations more often
encountered and employed in real applications. We start by considering the case when
the two coefficient matrices can be rectangular. This gives the following equation:
𝑨𝑿 + 𝒀𝑩 = 𝑪
Where 𝑿, 𝒀 are both unknown, and 𝑨, 𝑩 and 𝑪 are all rectangular matrices of conforming
dimensions. Equations of this type arise in control theory, for instance in output
regulation with internal stability, where the matrices are in fact polynomial matrices
(see, e.g., H. K. Wimmer 1996).
The two-sided version of previous equation is given by 𝑨𝑿𝑩 + 𝑪𝒀𝑫 = 𝑬 and this is an
example of more complex bilinear equations.
𝑨𝑿𝑩 + 𝑪𝑿𝑫 = 𝑬𝒀 + 𝑭
where the pair (𝑿, 𝒀) is to be determined, and 𝑿 occurs in two different terms.
Example: write a MATLAB code to solve the generalized Sylvester equation by the
proposed Broyden's algorithm.
The following result on the uniqueness of the solution X of the discrete Lyapunov
Equation 𝑨⊤ 𝑿𝑨 − 𝑿 = 𝑸.
Example: write a MATLAB code to solve the Continuous Lyapunov equation by the
numerical integration algorithm.
while e>1e-15
X=X - (EA^k)*(QT)*((EA')^k);
e=norm(A*X+X*A'-Q);
k=k+1;
end
X
Zero = A*X+X*A'-Q
XMATLAB = lyap(A,-Q) % verifications using MATLAB
Zero=A*XMATLAB + XMATLAB*A'-Q
Appendix B
Example: write a MATLAB code to solve the Discrete Lyapunov equation by the
numerical integration algorithm.
while e>1e-10
X=X - (A^k)*(Q)*((A')^k);
e=norm(A*X*A'-X-Q);
k=k+1;
end
X
Zero = A*X*A'-X-Q
XMATLAB = dlyap(A,-Q) % verifications using MATLAB
Zero=A*XMATLAB*A' - XMATLAB -Q
All the previous algorithms introduced before are applicable to Lyapunov equation.
An algebraic Riccati
equation is a type of nonlinear equation that arises in the context of infinite-horizon
optimal control problems in continuous time or discrete time.
The name Riccati is given to these equations because of their relation to the Riccati
differential equation.
Remark: Riccati's equation is a nonlinear equation that appeared in the optimal control
domain of continuous and discrete linear systems. In practice, it needs a parametric
solution rather than a numerical one. And some limitations in the optimal control
impose to find an analytical parametrization, but we try to suggest some numerical
algorithms through which we can solve this equation within the limits of some
conditions.
Appendix B
The Newton’s method can be used also for
solving equations of the kind 𝑭(𝑿) = 𝟎, where 𝑭: 𝓥 → 𝓥 is a differentiable operator in a
Banach space (we are interested only in the case in which 𝓥 is ℂ𝑚×𝑛 ). The sequence is
defined by
−1
𝑿𝑘+1 = 𝑿𝑘 + (𝑭′𝑿𝑘 ) 𝑭(𝑿𝑘 ) 𝑿𝑘 ∈ 𝓥
The Fréchet derivative is a derivative defined on Banach spaces. Named after Maurice
Fréchet, it is commonly used to generalize the derivative of a real-valued function of a
single real variable to the case of a vector-valued function of multiple real variables, and
to define the functional derivative used widely in the calculus of variations.
such that for all 𝑬 ∈ ℂ𝑚×𝑛 : 𝑭(𝑿 + 𝑬) − 𝑭(𝑿) − 𝑳(𝑿, 𝑬) = 𝒪(‖𝑬‖), and it therefore describes
the first order effect on 𝑭 of perturbations in 𝑿. In the practical computation it is
preferable to avoid constructing and inverting explicitly 𝑭′𝑿𝑘 . Thus, a better way to
compute one step of Newton’s method is to define the Newton increment 𝑯𝑘 ∶= 𝑿𝑘+1 − 𝑿𝑘
and to solve the linear matrix equation in the unknown 𝑯𝑘 in order to get 𝑿𝑘+1 :
The convergence of the method in Banach spaces is less straightforward than in the
scalar case, and it is described by the Newton–Kantorovich theorem.
Consider the Riccati operator: 𝓡(𝑿) = 𝑿𝑨 + 𝑨⊤ 𝑿 − 𝑿𝑩𝑿 + 𝑪. The Fréchet derivative of 𝓡(𝑿)
at a point 𝑿 is
Thus, the 𝑘 𝑡ℎ step of Newton’s method for continuous time algebraic Riccati equation
consists in solving the Sylvester equation.
The corresponding code is reported in Listing below; this code makes use of the function
sylvester for the solution of a matrix sylvester equation described in before.
The standard results on the convergence of Newton’s method in Banach spaces yield
locally quadratic convergence in a neighborhood of the stabilizing solution 𝑿+ . This
property guarantees that the method is self-correcting, that is, a small perturbation
introduced at some step 𝑘 of the iteration does not affect the convergence.
Appendix B
clear all, clc,
% This algorithm solves: C+XA+A'X-XBX=0 by means of Newton’s method.
% A,B & C: matrix coefficients.
% X0: initial approximation & X: the solution
A=10*rand(4,4); B=10*rand(4,4); C=10*rand(4,4); X0= rand(4,4);
tol = 1e-13; kmax = 80; X = X0; err = 1; k=0;
while err>tol && k<kmax
RX = C + X*A + A'*X - X*B*X;
H =sylvester(A'-X*B,A-B*X,-RX) ;
X=X+H;
err =norm(H,1)/norm(X,1); k=k+1;
end
if k == kmax
disp('Warning: reached the maximum number of iterations')
end
Zero=C + X*A + A'*X - X*B*X
𝑱+ 𝟎
𝑱 = 𝑽−1 𝑾𝑽 = ( )
𝟎 𝑱−
where we have grouped the Jordan blocks so that the eigenvalues of 𝑱+ have positive real
part, while the eigenvalues of 𝑱− have negative real part. We define the matrix sign of 𝑾
as
𝑰𝑝 𝟎
sign(𝑾) = 𝑽 ( ) 𝑽−1
𝟎 −𝑰𝑞
where 𝑝 is the size of 𝑱+ and 𝑞 is the size of 𝑱− . Observe that according to this definition,
sign(𝑾) is a matrix function. From the last equation it follows that sign(𝑾) − 𝑰 has rank
𝑞, while sign(𝑾) − 𝑰 has rank 𝑝;
Theorem: Let the continuous time algebraic Riccati equation 𝓡(𝑿) = 𝑪 + 𝑿𝑨 + 𝑫𝑿 − 𝑿𝑩𝑿
have a stabilizing solution 𝑿+ , namely 𝜎(𝑨 − 𝑩𝑿+ ) ⊂ ℂ− or sign(𝑨 − 𝑩𝑿+ ) = −𝑰, and let
𝑨 −𝑩
𝑯=( ) be the corresponding Hamiltonian matrix. Partition sign(𝑯) + 𝑰 = [𝑾1 𝑾2 ],
−𝑪 −𝑫
where 𝑾1 , 𝑾2 ∈ ℂ2n×n . Then 𝑿+ such that 𝓡(𝑿+ ) = 0 is the unique solution of the
overdetermined system 𝑾2 𝑿+ = −𝑾1 .
Once the sign of 𝑯 is computed, in order to get the required solution it is enough to solve
the overdetermined system. This task can be accomplished by using the standard
algorithms for solving an overdetermined system, such as the QR factorization of 𝑾2 .
Computing the matrix sign function: The simpler iteration is obtained by Newton’s
2
method applied to 𝑿2 − 𝑰 = 𝟎, which is appropriate since sign(𝑾) yields (sign(𝑾)) − 𝑰 = 𝟎.
1
The resulting iteration is 𝑿𝑘+1 = 2 (𝑿𝑘 + 𝑿−1
𝑘 ), whose convergence properties are
synthesized in the following result.
1 −1
‖𝑿𝑘+1 − 𝑺‖ ≤ ‖𝑿 ‖‖𝑿𝑘 − 𝑺‖2
2 𝑘
for any operator norm.
1
Iteration 𝑿𝑘+1 = 2 (𝑿𝑘 + 𝑿−1
𝑘 ) together with the termination criterion ‖𝑿𝑘+1 − 𝑿𝑘 ‖ ≤ 𝜀, for
some norm and a tolerance ε, provides a rough algorithm for the sign function.
There is a scaling technique which dramatically accelerates the convergence. The idea is
simple but effective: Since sign(𝑾) = sign(𝑐𝑾) for any 𝑐 > 0, the limit of the sequence
{𝑿𝑘 }𝑘 does not change if at each step one “scales” the matrix 𝑿𝑘 in the following way:
Reliable software for solving matrix equations has been available for a long time, due to
its fundamental role in control applications; in particular, the SLICE Library was made
available already in 1986, the LAPACK ("Linear Algebra Package") was initially released
in 1992. Most recent versions of MATLAB also rely on calls to SLICOT routines within
the control-related Toolboxes. SLICOT includes a large variety of codes for model
reduction and nonlinear problems on sequential and parallel architectures;