You are on page 1of 187

Budapest University of Technology

and Economics
Faculty of Transportation and Vehicle
Engineering

Dr. Bicsák György,


Sziroczák Dávid, PhD CEng,
Aaron Latty, CEng

NUMERICAL METHODS

XY Kiadó
BME Közlekedésmérnöki és Járműmérnöki Kar

32708-2/2017/INTFIN számú EMMI által támogatott tananyag

© Dr. Bicsák György, Sziroczák Dávid PhD CEng, Aaron Latty CEng, 2018

ISBN XXX-XXX-XXX-XXX-X

Kiadja az XY Kiadó Kft.

Készült az XY Bt. nyomdájában

1
Contents
Contents............................................................................................................... 2
1 Introduction ................................................................................................. 5
2 Errors and Approximations ......................................................................... 7
2.1 Errors and Their Sources: .................................................................... 7
2.1.1 Binary and Hexadecimal Numbers .............................................. 9
2.1.2 Floating Point and Rounddown Errors ..................................... 10
2.1.3 Absolute and Relative Errors ...................................................... 11
2.1.4 Error Limits .................................................................................. 13
2.1.5 Error Propagation........................................................................ 13
2.1.6 Substracting Two Near Zero Numbers ......................................14
2.2 Iterative Methods .............................................................................. 16
2.2.1 Basic Iterative Method ............................................................... 18
2.2.2 Convergence Rate ...................................................................... 18
3 Single Variable Equations, System of Equations ..................................... 20
3.1 Solution of Single Variable Equations (24) ........................................ 21
3.1.1 Bisection method ....................................................................... 22
3.1.2 Newton-Raphson Method ......................................................... 24
3.1.3 Secant method ........................................................................... 27
3.1.4 Regula Falsi Method................................................................... 30
3.1.5 Fixed-point method.................................................................... 33
3.1.6 Finding multiple roots ................................................................ 34
3.2 Solution of System Equations ........................................................... 40
3.2.1 Gauss elimination method ..........................................................41
3.2.2 LU factorization methods .......................................................... 44
3.2.3 Iterative solutions ...................................................................... 45
3.2.4 Systems of nonlinear equations ................................................. 51
4 Curve Fitting Methods............................................................................... 56
4.1 Regression ......................................................................................... 56

2
4.1.1 Linear Regression ....................................................................... 58
4.1.2 Linearization of Non-linear Input Data...................................... 62
4.1.3 Polynomial Regression ............................................................... 64
4.2 Finite difference ................................................................................ 70
4.2.1 Factorial Polynomials ................................................................. 75
4.2.2 Anti-differentiation .................................................................... 76
4.3 Interpolation ...................................................................................... 79
4.3.1 Lagrange Interpolation .............................................................. 80
4.3.2 Newton Divided-Difference Interpolation ................................ 84
4.3.3 Spline Interpolation [3] .............................................................. 92
5 Numerical Derivation................................................................................107
5.1 2-Point Backward Difference .......................................................... 108
5.2 2-Point Forward Difference............................................................. 108
5.3 2-Point Central Difference ............................................................... 109
5.4 3-Points Backward Difference ......................................................... 110
5.5 Numerical Difference Formulas ........................................................ 111
6 Numerical Integration .............................................................................. 116
6.1 Newton-Cotes Formulas .................................................................. 118
6.1.1 Rectangular Rule ....................................................................... 118
6.1.2 Trapezoidal Rule ........................................................................ 123
6.2 Quadrature Formulas ....................................................................... 125
6.2.1 1/3 Simpson Rule ........................................................................ 127
6.2.2 3/8 Simpson Rule .......................................................................129
6.3 Romberg Integration .......................................................................134
6.4 Gaussian Quadrature ........................................................................ 137
6.5 Improper Integrals ...........................................................................139
6.6 Numerical Integration of Bivariate Functions ................................ 140
7 Numerical Solution of Differential Equations – Initial Value Problems 144
7.1 One-Step Methods ...........................................................................145
7.1.1 Euler’s Method ..........................................................................145

3
7.1.2 Second Order Taylor Method .................................................. 148
5.1. Runge-Kutta Methods ......................................................................150
7.1.3 Second Order Runge-Kutta Methods....................................... 151
7.1.4 Built in Functions in MATLAB.................................................... 153
8 Partial Differential Equations ................................................................... 157
8.1 Partial Differential Equations generally .......................................... 157
8.2 Elliptic Partial Differential Equations .............................................. 158
8.2.2 Neumann Problem ....................................................................167
8.2.3 Mixed problem ......................................................................... 169
8.2.4 Elliptic Partial Differential Equations in More Complex Regions
170
8.3 Parabolic Partial Differential Equations .......................................... 173
8.3.1 Finite-Difference Method ......................................................... 174
8.3.2 Crank-Nicolson Method ............................................................ 177
8.4 Hyperbolic Partial Differential Equations ....................................... 180
References ....................................................................................................... 185

4
1 Introduction
Although a wide range of mathematical knowledge is included in the
university curriculum, the practical use is often not immediately clear, or
would be only made clear during later stages of studies. Furthermore in
engineering practice, there are many analytical solutions that can be very
complicated, and the effort involved to follow them through might not be
reflected in the final outcome, as the desired level of accuracy or
sophistication might not be required to solve a practical problem in hand.
Instead of looking for a „correct” mathematically perfectly accurate solution,
many times we can settle with a solution that is although not 100% accurate,
but satisfies our goals and expectations. In these group of problems we need
to also include problems which can’t be reasonably defined in mathematical
terms, but we can do very good approximation using numerical methods such
as finite element, finite difference or finite volume methods. This is the
philosophy behind the development and usage of the various numerical
methods, and that is the topic of this set of lecture notes.
The aim of developing and using numerical methods instead of relying on
complicated and expensive analytical solution methods is to approximate the
accurate solution we desire using simpler, arithmetical equations, with the
required level of accuracy. Many example numerical methods when
understood properly, might seem very basic, primitive even, but it is essential
we understand the underlying logic; that is why can we use a given method to
solve our problems. Complicated mathematics can obscure the underlying
solution method of the problems, hence most examples in these notes are
kept simple, to primarily illustrate the method.
Naturally, even numerical methods don’t make all of our problems disappear;
the more accurate we want to approximate the theoretical solutions, the
more computations we need. This can be prohibitively expensive for very
large simulations for example. It is worth noting, that running simulations are
usually still a lot more cheaper, convenient, and in some sense provide us with
more knowledge about the problem, than setting up tests. In order to make
the understanding of more complicated engineering simulation tools (FEA,
CFD, etc.) easier in further studies, we will investigate the errors involved in
numerical procedures, what are their cause, what are their effects on the
solution and how do they propagate. The rest of the chapters are divided into
the following main topics:
 Solution of equations and system of equations
 Approximation and interpolation
 Numerical differentiation and integration

5
 Numerical solution of differential equations
 Numerical solution of partial differential equations

The literature of numerical methods is very broad today, in particular there


are many practically-oriented books available in English. Although most
sources touch on the theoretical background, they often don’t go into
describing the theory in depth. This set of notes is aimed to describe the
theory of the most commonly found numerical methods in engineering, both
from the theory and practical point of view. It might as well be, that you as
the students studying these method will never come across them directly in
engineering practice, and happily solve all problems using the 4 basic
mathematical operators. However some of you will have to deal with complex
mathematical and/or numerical problems in your future careers, and these
notes are aiming to provide assistance for all of you to get introduced to the
world of numerical methods.
In terms of software choices, in engineering practice a great deal of problems
can be solved using simple spreadsheet tools such as Microsoft Excel. Many
of the solutions presented in the notes were solved or demonstrated using
Excel. It should be noted, that Excel is actually underappreciated by many
users, as knowing the more advanced functions and VBA scripting would
enable an engineer to solve most of the problems without investing in
expensive technical software using this „simple” office tool. MATLAB is also
a popular choice for engineering use, with its user friendly, yet powerful
computing capability, it is practically an industry standard for prototyping and
new developments. Many chapters in the notes show examples implemented
in MATLAB.
The authors wish you good time and good luck for your studies in the coming
semester!

6
2 Errors and Approximations
Numerical methods are procedures, where a mathematical problem is not
necessarily solved in an exact way, rather the solution is converged within a
desired tolerance to the accurate solution with a finite number of successive
approximations. Although simpler calculations can be performed using even
a calculator, to use most numerical methods a computer is required. Even with
this, the speed and cost of solution is still better than trying to find an
analytical way.
Numerical methods themselves are usually described as a strongly
mathematical algorithm, which is a clear set of instructions, describing the
operations and functions and their precise order. When applying these
methods to some given starting conditions (the description of the problem),
the methods will arrive at the desired solution (with the desired accuracy).
This is when the term „acceptable” comes in to play. Since we discussed that
the numerical methods only approximate the exact solution, we need to
define an acceptance criteria on the goodness of approximation. Also note
that the approximate nature also applies to the input data, as relying on data
with excessive noise in it can have great effect on the accuracy of the
algorithm. This will be described in the following chapter in more detail. The
second problem to consider is the fact that were are just approximating the
solution itself, we need to be aware of the order of errors in the solution
process, and what effect that can have on the results. To understand this, let
us first investigate the various types of error, their sources and the error
propagation.

2.1 Errors and Their Sources:


During model creation, many steps can generate errors in the procedure.
Models are always a projection of reality, and are almost always simplified,
thus to introduce these simplification errors to our calculations is
unavoidable. In many cases it is exactly this error that allows us to use a
mathematical model to describe complex real life phenomena. During model
creation, the following errors can be defined [1] :
1. Modelling errors
Modelling errors are defined between the behaviour in reality,
and the behaviour of the approximate model. For example,
for a static problem using trusses we assume that the trusses
are rigid and don’t deform, while in reality there is definitely
deformation.

7
2. Measurement errors
Measurement errors were discussed in previous subjects.
When measuring it is inevitable that errors and noise is
introduced to the measuring process and thus into the
solution. For this reason it is often said: „one measurement is
not measurement”.
3. Expression/function errors
Expression errors are introduced to make handling of
mathematical equations possible, usually by neglecting small
terms. For example in the case of a Taylor or Fourier series, we
would only use the first few terms of the series, or when
calculating the volumetric thermal expansion coefficient from
the linear.
4. Discretisation errors
The source of this error is when continuous functions, space,
etc. are mapped to a discrete representation. For example
continuous functions are represented with discrete values, a
function might be differentiated numerically, and the integral
represented as a sum of products.
5. Roundoff and digital representation error
Storing data in a computer inevitable distorts the numerical
values due to the digital representation of the numbers, for
example numbers can be represented as integers, single or
double precision floating point numbers, which each have
their own precision and limitations. Roundoff errors are
introduced for example when irrational numbers such as π are
represented, as it is impossible to store the infinite number of
digits of it, so the numerical representation must be truncated
at some, suitably large number of digits.
6. Evaluation error
Although a converged mathematical model may produce
solutions that seem accurate (taking all other mentioned
sources of errors into account), but due to the definition of
boundary conditions, model components and various affects,
the results are still not accurate; it still doesn’t match reality.
This error like the previous ones might or might not be
acceptable, however the main issue is that we might not be
able to percieve it without rigorous comparison to tests, and
thus can introduce major problems when accurate results are
desired.

8
The steps of model creation and the sources of errors between them are
shown in Figure 1.

Figure 1: Model creation and the error sources between the steps [1]

2.1.1 Binary and Hexadecimal Numbers


The decimal (10 based) system of numbers is the one people are familiar with
for every day use. Without additional unnecessary explanation, the numbers
in this system are represented as:
642 = [6 ∗ 102 + 4 ∗ 101 + 2 ∗ 100 ]10
For simplicity, especially for large and small numbers, we also use the
following decimal representation:
±0, 𝑑1 𝑑2 … 𝑑𝑚 ∗ 10𝑘 , 1 ≤ 𝑑1 ≤ 9, 0 ≤ 𝑑2 , 𝑑3 , … , 𝑑𝑚 ≤ 9
For example: 642 = 6,42 ∗ 102
Where the first term is the significand, the second is the power term.
This representation is what we refer to as the normal or floating point
representation. It is worth mentioning here that there is a small, but
potentially annoying difference between the English engineering world and
the German/Hungarian notation. In Hungary and the German speaking world,
the decimal mark is a comma ’,’ while in the English speaking world uses a
decimal point ’.’. (Actually, the ’,’ is used as a digits separator in the significand,
so for example 1,425,124.67 in English notation is 1425124,67 in Hungarian,
which is also a typical source for confusion) Most software in existence today
would use the English decimal point, however there are many tools, where
regional settings would affect the expected decimal mark, and when using
the wrong one, the software would not be able to interpret our input.
Microsoft Excel is a typical example. Also note you need to be aware of this
when creating reports for different cultures.
In the computers due to the boolean logic used in its architecture and thus for
calculations, not the decimal, but the binary system of numbers are used.
Using the binary system, the numbers are represented in the following way:

9
642 = [1 ∗ 29 + 0 ∗ 28 + 1 ∗ 27 + 0 ∗ 26 + 0 ∗ 25 + 0 ∗ 24 + 0 ∗ 23 + 0 ∗ 22
+ 0 ∗ 21 + 0 ∗ 20 ]2
So the binary form is: 642 = [1010000010]2
Also commonly used in programming is the hexadecimal (16 based) system.
Using similar logic:
642 = [2 ∗ 162 + 8 ∗ 161 + 2 ∗ 160 ]16 = [282]16
Since we have run out of arabic numbers, to represent the 16 symbols, in
addition to the 0..9 numbers, the letters A..F will also appear, for example:
30 = [1 ∗ 161 + 14 ∗ 160 ]16 = [1𝐸]16
What is the point of using the hexadecimal system one might ask? It is
essentially an expansion of the binary system, since 24 = 16, meaning that for
every quartet of binary numbers we can define allocate a hexadecimal
symbol, for example: 𝐷 = [1101]2 or 9 = [0101]2 .

2.1.2 Floating Point and Rounddown Errors


Since the computers can only store and display a limited number of digits, it is
sensible to store the numbers in a way, where we only use a given (and fixed)
number of digits regardless of the numerical value. Due to this there are 2
number representation in computing: fixed point and floating point
representation. In the first case when storing the numbers it is decided how
many digits we would like to keep, for example: 4,0000; -129,1012; 0,0023. In
this case all the 4 digits are used even if we want to store a whole (integer)
number. In the case of a floating point representation, we store a given
number of digits from the characteristic, for example:
4 ∗ 100 ; −1,2910 ∗ 102 ; 2,3000 ∗ 10−3
In its general format, the floating point representation can be expressed in
the following form:
(−1)𝑆 𝑀 𝑝𝑘 (2.1.)

Where S is the sign, M is the Mantissa (or significand), p is the base and k is
the exponent (together 𝑝𝑘 is the characteristic or the power term). Also it is
required that: 1/p ≤ M ≤ 1
In the case of different architectures we can use different floating point
representation. Currently the Institute of Electrical and Electronics Engineers
(IEEE) defined the IEEE 754 standard, which has been used for the past 30
years. Using this a 32 bit (single prescision) number can be represented as: [2]

10
𝑛𝑢𝑚𝑏𝑒𝑟 = (−1)𝑆 (1. 𝑀)(2𝑘−127 ) (2.2.)

The sign has 1 bit length (0 for positive, 1 for negative), then the
8bit significand, which defines the place of the binary number separator in the
number. Most of the time it is stored in a normalised form. When the
exponent is 𝑘 bits long, then the bias is 𝑒 = 2𝑘−1 − 1. In the case of single
precision: 𝑒 = 127.
The next value is the mantissa, which is 23 bit long real number set to one
decimal norm. The first digit is always 1, which is not stored in the format, only
the fractional part. The stored part can be calculated as the following:
𝑠𝑧á𝑚 = (−1)𝑆 (1 + 𝑀)(2𝑘−127 ) (2.3.)

where 𝑆 is the sign bit, 𝑀 is the mantissa, 𝑘 is the characteristic and 𝑒 is the
bias. The bits reserved for the mantissa defines the accuracy of the number
representation, while the size of the characteristic sets the magnitude of the
represented numbers. [2]

Example
Which binary floating point number is displayed using 32 bits in the following
format:
1 10000110 10100000000000000000000 ?

Solution:
Following the previous logic:
𝑠=1
𝑘 = 100001102 = 13410
𝑀 = 0.1012 = 0,62510
so:
𝑥 = −1,625 ∗ 27 = −1,625 ∗ 128 = −208

So behind the scary representation of 1 10000110


10100000000000000000000 the more understandable -208 is hiding. (Note
for the computer, the first representation is more understandable though!)

2.1.3 Absolute and Relative Errors


In engineering practice, the errors introduced during measurements, model
creation, simulation post-processing and similar need to be quantified,
regardless the source of the error. To quantify errors, we define two types of
errors: absolute and relative errors. When the measured(simulated) value is

11
denoted as 𝑥̃ and the accurate value as 𝑥 then the two types of errors are
defined as:
Absolute error:
𝛿𝑎 = 𝑥 − 𝑥̃ (2.4.)

Relative error:
𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑒𝑟𝑟𝑜𝑟 𝛿𝑎 𝑥 − 𝑥̃ (2.5.)
𝛿𝑟 = = = , 𝑥≠0
𝑟𝑒𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑥 𝑥
When the real value is 0, the relative error is not defined. Most of the time
relative errors have higher significance than absolute errors, so in most cases
we evaluate only relative errors.

Example
Let’s assume two cases: first we measure electric current, where 𝑥̃1 =
0,004 𝐴, while the real value 𝑥1 = 0,005 𝐴. In the second case, we measure
electric potential (voltage), where the measured value is 𝑥̃2 = 1315 𝑉, while
real value is 𝑥2 = 1331 𝑉. What is the absolute and relative error in the two
measurements?

Solution:
Absolute errors:
𝛿𝑎1 = 𝑥1 − 𝑥̃1 = 0,005 𝐴 − 0,004 𝐴 = 0,001 𝐴
𝛿𝑎2 = 𝑥2 − 𝑥̃2 = 1331 𝑉 − 1315 𝑉 = 16 𝑉
Based on this, the 0,001 A, sounds lot lower than 16 V, so is that a better
measurement? Let’s investigate the two in terms of relative errors!
𝑥1 − 𝑥̃1 0,005 𝐴 − 0,004 𝐴
𝛿𝑟1 = = = 0,2
𝑥1 0,005 𝐴
𝑥2 − 𝑥̃2 1331 𝑉 − 1315 𝑉
𝛿𝑟2 = = = 0,012
𝑥2 1331 𝑉
In this case we can see that for the current, 0,001 A absolute error
corresponds to 20% relative error, while when measuring voltage, the 16 V
absolute error result in only 1,2% relative error. So, it is not just the difference
that matters, but how large is the difference in the grand scheme of things.

12
2.1.4 Error Limits
When looking at a single measurement error, then the absolute error provides
us with some information. However, when we are looking at multiple
measurements (and we always should be!), it is beneficial to average the
measurements, and look at the absolute value of the absolute error. The next
question is, what is the highest level of absolute error we can expect; what
are our error limits? While the bottom limit of the absolute error is clearly 0,
the upper limit can be anything, even infinity. Due to this, we can identify an
upper limit to the acceptable absolute error as:

|𝛿𝑎 | = |𝑥 − 𝑥̃| ≤ 𝛼 (2.6.)

Keep in mind, that 𝛼 is not an estimate for the error from the measurements,
rather an acceptance limit! A similar limit can be defined for the acceptable
relative error as:
|𝑥 − 𝑥̃| (2.7.)
|𝛿𝑟 | = ≤𝛽
|𝑥|

2.1.5 Error Propagation


After discussing the different types of errors, the next question is how do
these errors „evolve” during our procedure, and how do they affect the final
results of the calculation/simulation. Considering that most of numerical
methods rely on mainly the 4 basic arithmetic operations (addition,
substraction, multiplication, division), these effect can be easily investigated
and quantified for such operations.
Let us assume that, 𝑥̃1 and 𝑥̃2 are the approximations of the real 𝑥1 and 𝑥2
values respectively. In this case, the absolute and relative errors of these pairs
are:
|𝛿𝑎1 | ≤ 𝛼1 ; |𝛿𝑎2 | ≤ 𝛼2 ; |𝛿𝑟1 | ≤ 𝛽1 ; |𝛿𝑟2 | ≤ 𝛽2 (2.8.)

The upper absolute error limit in the case of addition and substraction are the
sum/subtraction of the individual limits:
|𝛿𝑎 | = |(𝑥1 ± 𝑥2 ) − (𝑥̃1 ± 𝑥̃2 )| ≤ 𝛼1 + 𝛼2 (2.9.)

In the case of relative errors the error propagation is different:


𝑥1 𝑥2 − 𝑥̃1 𝑥̃2 (2.10.)
|𝛿𝑟 | = | | ≤ 𝛽1 + 𝛽2
𝑥1 𝑥2
This can be easily seen, as assuming a two term product, we would do the
following:

13
𝑥1 𝑥2 − 𝑥̃1 𝑥̃2 𝑥1 𝑥2 − [𝑥1 − (𝛿𝑎1 )][𝑥2 − (𝛿𝑎2 )] (2.11.)
|𝛿𝑟 | = | |=| |=
𝑥1 𝑥2 𝑥1 𝑥2
−(𝛿𝑎1 )(𝛿𝑎1 ) + (𝛿𝑎1 )𝑥2 + (𝛿𝑎2 )𝑥1
=| |
𝑥1 𝑥2
In this last equation, we can use the approximation that −(𝛿𝑎1 )(𝛿𝑎1 ) ≈ 0, as
this product is magnitudes smaller than the other terms. Using this
simplification then we can arrive to:
(𝛿𝑎1 )𝑥2 + (𝛿𝑎2 )𝑥1 (𝛿𝑎1 ) (𝛿𝑎2 ) (2.12.)
|𝛿𝑟 | = | |=| + |
𝑥1 𝑥2 𝑥1 𝑥2
(𝛿𝑎1 ) (𝛿𝑎2 )
≤| |+| | ≤ 𝛽1 + 𝛽2
𝑥1 𝑥2
Which proves our point.
Similarly, without proof this time, the relative error propagation of division is:
𝑥1 /𝑥2 − 𝑥̃1 /𝑥̃2 (2.13.)
|𝛿𝑟 | = | | ≤ 𝛽1 + 𝛽2
𝑥1 /𝑥2

2.1.6 Substracting Two Near Zero Numbers


During numerical calculations, two relatively simple operations can result in
very large inaccuracies:
 Dividing with small order numbers
 Substracting two nearly equal numbers
Of course, when the second case happens in the denominator of a rational,
then that leads back to the first case. Let’s take 2 numbers, 𝑆1 and 𝑆2 and
assume that in floating point representation their first k numerals match:
𝐹𝐿(𝑆1 ) = 0, 𝑑1 𝑑2 … 𝑑𝑘 𝑎𝑘+1 … 𝑎𝑚 × 10𝑝
𝐹𝐿(𝑆2 ) = 0, 𝑑1 𝑑2 … 𝑑𝑘 𝑏𝑘+1 … 𝑏𝑚 × 10𝑝
The higher k is, 𝑆1 and 𝑆2 will be more equal obviously. After the substraction
we arrive to the following form:
𝐹𝐿(𝐹𝐿(𝑆1 ) − 𝐹𝐿(𝑆2 )) = 0, 𝑐𝑘+1 … 𝑐𝑚 × 10𝑝−𝑘
where 𝑐𝑘+1 … 𝑐𝑚 are the results of (𝑎𝑘+1 − 𝑏𝑘+1 ) … (𝑎𝑚 − 𝑏𝑚 ) substractions.
What this equation represents is that the mantissa is reduced to m - k number
of elements as opposed to the original 𝑆1 and 𝑆2 numbers, where the mantissa
is m long. However due to this we increase the importance of the rounddown
error, which from this point on will affect our calculations. This effect can be
eliminated by reformulating the problem. Let’s look at an example.

14
Example
Let’s assume the following second order equation: 𝑥 2 + 50 + 4 = 0, with
roots at 𝑥1 = −0,080128411 and 𝑥2 = −49,91987159.

Solution:
It is known that second order equations have a explicit solution expression:
−𝑏 ± √𝑏 2 − 4𝑎𝑐
𝑥1,2 =
2𝑎
In this case, 𝑏 2 ≫ 4𝑎𝑐, so the terms under the root: √𝑏 2 − 4𝑎𝑐 ≅ 𝑏. So when
calculating 𝑥1 we are substracing nearly equal numbers. In the case of a 4 digit
roundoff, the root is calculated as:
−50,00 + √50,002 − 4(1,000)(4,000) −50,00 + 49,88
𝐹𝐿(𝑥1 ) = =
2(1,000) 2,000
= −0,060
Likewise the second root:
−50,00 + √50,002 − 4(1,000)(4,000) −50,00 − 49,88
𝐹𝐿(𝑥2 ) = =
2(1,000) 2,000
= −49,94
These values show the following relative errors:
|𝑥1 − 𝐹𝐿(𝑥1 )|
|𝛿𝑟1 | = ≅ 0,2503 = 25,03%
|𝑥1 |
|𝑥2 − 𝐹𝐿(𝑥2 )|
|𝛿𝑟2 | = ≅ 0,0004 = 0,4%
|𝑥2 |
So 𝑥1 produces a lot higher relative error than 𝑥2 , where two neary equal
numbers are added and not extracted, so there is less problem. The main issue
though is the fact that the same equation for the same problem can give 0.4%
or 25% relative errors, even for a simple problem like this.
Let’s try to approach the problem in a different way. In addition to the explicit
solution, we also know that an equation written as 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 = 0 can be
rearranged to produce 𝑥1 𝑥2 = 𝑐. So from one solution we can calculate the
other as:
𝑐 4,000
𝐹𝐿(𝑥1 ) = = = −0,080
𝑎𝐹𝐿(𝑥2 ) (1,000)(−49,94)
In which case the relative error is:
|𝑥1 − 𝐹𝐿(𝑥1 )|
|𝛿𝑟1 | = ≅ 0,0016 = 0,16%
|𝑥1 |

15
So we have achieved a drastic increase in accuracy (24,87%) by introducing a
very simple (but clever!) change compared to the original equation.

2.2 Iterative Methods


Numerical methods mean a set of predefined algebraic and logical
mathematical operations, performed in an exact order, which results in an
approximate solution to a given’s problem analytical exact solution. This
system of operations, which sets the content and order of operations and
result in the required solution are called algorithms.
To describe the algorithms, the most efficient method is to present it as a
code. The code contains inputs, the required operations and generates the
outputs. We use two main notations in the code: the end of a given step is
marked with a full stop (.), while if the given step is still ongoing it is marked
with semicolon (;). We characterise an algorithm as stable, when for a small
disturbance in input results in small change in the output. If this is not the case,
we consider the algorithm unstable.
By iterative methods we mean procedures, which start with an initial guess,
and then the problem is solved by succsessively approximating a better
solution, until the solution is deemed acceptable. Use of these iterative
methods can be very broad, starting from the solution of algebraic equations,
throughout solving system of equations and even for more complicated
matters such as differential equations. This set of lecture notes (and the
subject you are studying) demonstrates a variety of these methods and their
applicability.
One of the most important questions for the iterative methods is the stopping
condition. Generally we use two criteria:
1. When some condition becomes true
2. When we have surpassed a certain number of iteration, time, etc.

The stopping conditions need to check whether the calculated approximate


solution is within the acceptable tolerance limit from the real solution. In
reality however, the precise value is not always available, so oftentimes the
stopping condition is based on the relative change of the solutions acquired
from the successive iterations. This is referred to as convergence, and it is the
desirable stopping condition. However due to algorithmic errors, software
bugs, poor initial condition, etc. the solution might not converge, thus we
need to limit the maximum iterations, time or something, otherwise our code
will try to run forever. The implication of that on standalone code is not
necessarily huge as it can be terminated, but numerical algorithms can just be

16
a part of something bigger, in which case a poorly written and not iteration
constrained algorithm can crash a whole system!

Example: [3]

Let’s look at a mathematical example. Let’s approximate the value of 𝑒 −2 to


7 digits, with 𝜖 = 10−6 tolerance!
Solution:
Let’s define the first 𝑛 + 1 members using the Maclaurin series, which is
written for the 𝑓(𝑥) = 𝑒 𝑥 function using an 𝑛th order Taylor polynomial:
𝑛
1
𝑇𝑛 (𝑥) = ∑ 𝑥 𝑖
𝑖!
𝑖=0
When approximating 𝑒 −2 with the lowest number of n, then:
|𝑒 −2 − 𝑇𝑛 (−2)| < 𝜀
Which is also the stopping condition. The solution algorithm then can be
defined as:

Input: 𝑥 = 2, 𝜀 = 10−6 , 𝑁 = 20 x to calculate, precision,


iteration limit
−2
Output: Approximate 𝑒 with 𝜀 accuracy or “error” message
1. step Set initially: 𝑛 = 0

2. step 𝑇𝑣𝑎𝑙 = 𝑒 −𝑥 Real value


𝑇𝑒𝑟𝑚 = 1 Denominator
𝑃𝑠𝑢𝑚 = 0 Initial sum
𝑆𝑔𝑛 = 1 Alternating sign initial set
3. step 𝑊ℎ𝑖𝑙𝑒 𝑛 ≤ 𝑁, 𝑑𝑜 𝑆𝑡𝑒𝑝 3 − 𝑆𝑡𝑒𝑝 5

4. step 𝑃𝑠𝑢𝑚 = 𝑃𝑠𝑢𝑚 + 𝑆𝑔𝑛 ∗ 𝑇𝑒𝑟𝑚/𝑛! Stopping condition


𝑆𝑡𝑜𝑝
5. step 𝑛 =𝑛+1 Next n
𝑆𝑔𝑛 = −𝑆𝑔𝑛 Swap sign
𝑇𝑒𝑟𝑚 = 𝑇𝑒𝑟𝑚 ∗ 𝑥 Next Term
6. step 𝑂𝑢𝑡𝑝𝑢𝑡(𝑓𝑎𝑖𝑙𝑢𝑟𝑒) If non-convergence
𝑆𝑡𝑜𝑝
End

17
With 7 digits precision: 𝑒 −2 = 0,1353353. When we limit the maximum
number of iteration at 20, the program will likely result in the failure error
message, as it is not guaranteed to achieve convergence in so many steps. If
the difference between the exact value and the approximate value is within
the prescribed tolerance, the program also stops. In this case the outputs are
n and the function approximation. If we try the algorithm in MATLAB or Excel
VBA, we will see that 14 iteration steps are enough to achieve convergence,
so n=13. In this case the approximate value: 𝑒 −2 ≅ 0,1353351, which results
in 𝛿𝑎 = 0,2 ∗ 10−6 absolute error.

2.2.1 Basic Iterative Method


The basic iterative methods rely on repeating substitutions. For example, if
there is an 𝑓(𝑥) function and 𝑥0 initial value, then 𝑥1 , 𝑥2 , … 𝑥𝑛+1 values can be
generated as:
𝑥𝑛+1 = 𝑓(𝑥𝑛 ) (2.14.)

𝑛 = 0,1,2, …, and 𝑥0 values are known


There are mainly 3 outcomes of the algorithm:
- Iterations converge in a fast manner
- Iterations converge in a slow manner
- Iterations don’t converge

How the program runs will be affected by the character of the function 𝑓(𝑥)
and the 𝑥0 initial value.

2.2.2 Convergence Rate


Let’s take a {𝑥𝑛 } series, which converges to 𝑥. Then the nth iteration steps the
absolute error is:
𝛿𝑎,𝑛 = 𝑥 − 𝑥𝑛 , 𝑛 = 0,1,2, …
If a number R exists and a K constant for which the following is true:
|𝑒𝑛+1 | (2.15.)
lim =𝐾
𝑛→∞ |𝑒𝑛 |𝑅

Then we say that the iterative algorithm’s rate of convergence is R. There are
2 basic types of convergence, linear and quadratic. Convergence is linear
when 𝑅 = 1, so:

18
|𝑒𝑛+1 | (2.16.)
lim =𝐾≠0
𝑛→∞ |𝑒𝑛 |
Convergence is quadratic when 𝑅 = 2:
|𝑒𝑛+1 | (2.17.)
lim =𝐾≠0
𝑛→∞ |𝑒𝑛 |2

Convergence rate is not always an integer number, for example the bisection
1
method’s convergence rate is 2 (1 + √5) ≅ 1,618. [3]

Example: [3]
Calculate the convergence rate of the following problem:
1
𝑥𝑛+1 = ( ) 𝑥𝑛 , 𝑛 = 0,1,2, …, 𝑥0 = 1
2

Solution:
1 𝑛
Since when 𝑛 → ∞, we can see that 𝑥𝑛 = (2) → 0, obviously with 𝑥 = 0 as a
limit. In this case, the absolute error in the nth step:
1 𝑛 1 𝑛
𝛿𝑎,𝑛 = 𝑥 − 𝑥𝑛 = 0 − ( ) = − ( )
2 2
First let’s investigate what happens if 𝑅 = 1. In this case, using (2.15.):
1 𝑛+1
|− (2) |
|𝑒𝑛+1 | 1
lim = lim 𝑛 = ≠0
𝑛→∞ |𝑒𝑛 |𝑅 𝑛→∞ 1 2
|− (2) |
So it shows that 𝑅 = 1 satisfies the algorithm, and converges linearly. Since R
satisfies the criteria set in (2.15.), there is no need to investigate other values.

19
3 Single Variable Equations, System of Equations
Solving single variable equations is perhaps the most common numerical
activity. Of interest is the value of a single, or the values of multiple variables,
to satisfy the desired equivalence(s). If the equations can be rearranged and
brought to a closed form, then the problem becomes a trivial issue. However
not all variables can be expressed conveniently, or it might not be possible at
all. Typical example is the equation governing flow in a rocket nozzle:
𝛾+1

𝛾+1 𝛾 − 1 2 2(𝛾−1)
𝐴 𝛾+1 2(𝛾−1) (1 + 2 𝑀 ) (3.1.)

=( )
𝐴 2 𝑀

While the correct 𝐴 value can be calculated if the Mach (𝑀) number is known
across the nozzle, it is often not what we want to know, but the exact
opposite; if I have a nozzle with given geometry, 𝐴, what Mach numbers
would I get? In this case, expressing 𝑀 from above equation as a function of
𝐴, is not directly possible (or if it is, the solution is not trivial at all!). However,
in these cases, we can rely on numerical methods to approximate what 𝑀 is a
good enough solution, to satisfy the problem stated above.

Figure 2: 1/x class function

20
In all cases it is important to consider the algorithm from a practical point of
view. Solving for the roots of equations is normally a part of a larger set of
methods or program, but even if standalone, it is desired that the algorithm
definitely comes to some conclusion. For this reason it is good practice to
define an iteration and/or time limit to the method, especially if the continuity
of a function cannot be guaranteed. As it can be seen in Figure 2, while the
function satisfies the criteria that the function values at lower and upper
bounds have different signs, there is no actual numerical solution to this
equation in the shown interval. If untreated, a computer would potentially go
into an infinite loop, trying to find the non-existing solution.
In the following chapter, the solution of single variable equations will be
investigated, with some of the basic root finding methods demonstrated, and
briefly discussed. As the second part of the chapter, the solution of systems
of equations will be investigated, and the methodology discussed.

3.1 Solution of Single Variable Equations (24)

Solving single variable equations can be done using a large variety of


methods, however all of them require some initial steps and assumptions,
which will be discussed here, before the specific methodologies are
presented.
Initially, all equations need to be rearranged, so that one side of the equation
becomes zero, so use the standard format:
𝑓(𝑥) = 0 (3.2.)
Any equation written, regardless of complexity can be rearranged to this 0
form, for example by simply subtracting one side of the equation from the
other.
As a general rule, continuity of the function is desired, at least “near” the root
of interest. It is very difficult to define “near” as it will be shown, that different
algorithms might operate on largely different sized regions. Also, due to the
numerical nature of the algorithms, and the digitized representation of
numbers, discontinuities are not necessarily a problem, as all parts of the
function are not evaluated during the solution process. Regardless, “nice”,
continuous, many times differentiable functions are beneficial, to avoid issues
arising from discontinuities. Fortunately, most processes encountered in
nature or everyday life have these properties, and as such will lend themselves
for simple and efficient numerical root finding.
Regarding finding the exact roots it needs to be mentioned, that as with all
other things numerical, the exact value of the root can never be found. This is
due to the fact, that numbers are represented in a digitised manner by the

21
computer. So at least, the accuracy of any numerical procedure performed by
the computer is limited by the machine’s precision. Today most good
numerical software use double precision variables, which represent numbers
encoded with 64 bits of memory, which are accurate up to 15-17 decimal digits.
This precision is more than enough for all but the most demanding of
applications; even so, calculating so many digits is a complete waste of time
and resources for most applications. For example if we are calculating the
tons of steel to be purchased, why go down to the microgram level, when the
foundry would be able to measure only in kg at best?
Based on these, to find roots efficiently, it is always important to set the
required tolerance for the procedure. For most applications 3 or 4 significant
digits are enough, so a relative tolerance of 1E-3 or 1E-4 is acceptable. This
obviously requires some a priori understanding of the problem at hand, but
adaptivity could also be included in the root finding algorithm, so depending
on the problem, tolerance could be adopted. In this chapter most examples
will use 1E-4 absolute tolerance for convergence.

3.1.1 Bisection method

Bisection method is a simple root finding method which relies on successively


subdividing a given interval where the root is predicted until the sufficient
convergence is achieved. Probably one of the first documented root finding
methods, it is known by different names such as the „interval halving” the
„binary search” or the „dichotomy” method. The method is very simple,
robust, but it is not among the most efficient, as the convergence rate of the
algorithm is only linear.
The method requires a continually differentiable function, f(x) in a given
closed interval [a b]. It is essential that the function has different signs on the
two extremes of the interval for this solution to converge.
The method implements the root finding using the following steps:
1. Evaluate the function at the centre of [a,b] interval. This point will be
called 𝑥𝑖 , where the i subscript is an indicator of the number of
iterations, and the function value is 𝑓(𝑥𝑖 )
2. If the function value is within the described tolerance, so |𝑓(𝑥𝑖 )| < 𝜀
we have converged to the solution
3. If there is no convergence yet, then find out which end of the interval
has differing signs from 𝑓(𝑥𝑖 ), and form a new, smaller interval using
𝑥𝑖 and the side of [a,b] as appropriate
4. Repeat steps from 1. until convergence is achieved.

22
The method is shown graphically in Figure 3. The orange coloured bars
represent the calculation interval, which are halved at every subsequent
iteration. It can also be seen that the fairly high required precision can result
in many additional iterations, which might or might not be justified,
depending on the application.

Bisection method interval at iterations


300 18
250 17
16
200 15
F(x) function value [-]

Iteration number [-]


14
150 13
12
100 11
50 10
9
0 8
7 F(x)
-50 6
5
-100 4
3
-150 2
-200 1
0 1 2 3 4 5 6 7 8 9 10
Calculation interval

Figure 3: Bisection method interval at iterations

Example

The method is implemented in MATLAB on an arbitrary polynomial function:


𝑓(𝑥) = 0.8𝑥 3 − 4.0𝑥 2 − 12.0𝑥 − 21.0

Solution:
%Script C2BisectionMethod.m
%Script to use the bisection method to solve single
%variable equations
%Function to be evaluated
function fValue = f_at_x(x)
fValue = 0.8*x^3-4.0*x^2-12.0*x-21.0;
end

23
%Limits of interval: a - lower; b - upper
a = 0.0;
b = 10.0;
%Tolerance
eps = 1e-3;
%Maximum limit on iterations
iMax = 1e12;
%Counter of current iteration number
iCounter = 1

%Initial evaluation of the mid interval


xi = (a+b)/2.0
%Initial evaluation of the function
fxi = f_at_x(xi);

%The method will run, until both of the following are


% true: Function value's absolute value at mid
% interval not smaller than tolerance
% Number of iterations are smaller than limit
while ~(abs(fxi) < eps) && iCounter < iMax
%If the signs of f(a) and f(xi) are the same, then we
% know b must have the different sign!
if sign(f_at_x(a)) == sign(fxi)
a = xi;
else
b = xi;
end
%Update the function value and iteration counter
xi = (a+b)/2.0;
fxi = f_at_x(xi);
iCounter = iCounter+1;
end

%Solution of the method


result = (a+b)/2.0
%Function value at the solution
fValue = f_at_x(result)

3.1.2 Newton-Raphson Method

The Newton or Newton-Rhapson method is a classic root finding algorithm


for real valued functions. The method is named after Isaac Newton and
Joseph Raphson. The method is based on iterative approximations of the root
of the function. The method relies on the Taylor series expansion of the

24
function, so there are implications on convergence whether the function is
continuous and differentiable in the regions near the root, the starting point
of the method and the overall behaviour of the function. The convergence of
the method, near the solution at least, is quadratic, which is considered quite
efficient.
The Taylor series representation of the function 𝑓(𝑥) near a given point 𝑓(𝑥𝑖 )
is defined as:
𝑓(𝑥) = 𝑓(𝑥𝑖 ) + 𝑓 ′ (𝑥𝑖 )(𝑥 − 𝑥𝑖 ) + 𝑓 ′′ (𝑥𝑖 )(𝑥 − 𝑥𝑖 )2 + . . . (3.3.)
The method implements the following procedure:
1. If in the i-th latest iteration, the function value is within the described
tolerance, so |𝑓(𝑥𝑖 )| < 𝜀 we have converged to the solution.
2. If not, then the next point where the function will be evaluated needs
to be calculated. If we neglect all terms but the first order in the Taylor
𝑓(𝑥 ) 𝑓(𝑥 )
series, it can be written that 𝑓′ (𝑥𝑖 ) = 𝑓(𝑥 )⁄𝑖 ∆𝑥 = ∆𝑥. Using this, the next
𝑖 𝑖
𝑓(𝑥𝑖 )
point to evaluate the function is estimated as: 𝑥𝑖+1 = 𝑥𝑖 − . So,
𝑓′ (𝑥𝑖 )
the term in the equation represents a small increment from the
current point, 𝑥𝑖 towards the next point, 𝑥𝑖+1 along the current slope
of the curve at 𝑥𝑖 .
3. Repeat from step 1
Note, that the classic Newton-Raphson method does not unconditionally
guarantee a solution. For example, starting from a local extremum or
𝑓(𝑥 )
inflection point of a function, where 𝑓 ′ (𝑥𝑖 ) = 0, the 𝑓′ (𝑥𝑖 ) division can not be
𝑖
evaluated. Also, the method can enter into infinite patterns or divergent
behaviour. Of critical importance in most cases, and especially for complicated
functions, is the starting point of the iterations.
It can be noted, that more than the first derivative can be included in the
algorithm, which would speed up the method’s convergence, however the
limitations would still be applicable.

Example

The MATLAB code that implements a version of the Newton-Raphson method


is shown in the following. Note the evaluation of the local differential can be
done in a variety of different ways, the current implementation is based on a

25
central difference method. In this example a root of the following function is
required:
𝑓(𝑥) = 0.8𝑥 3 − 4.0𝑥 2 − 12.0𝑥 − 21.0

Solution:
%Script C2NewtonRaphson.m
%Script to use the Newton-Raphson method to solve
single variable equations

%Function to be evaluated
function fValue = f_at_x(x)
fValue = 0.8*x^3-4.0*x^2-12.0*x-21.0;
end

%Initial point:
x0 = 10.0;
%Step size used to calculate local differential
xStep = 1e-5;

%Tolerance
eps = 1e-3;
%Maximum limit on iterations
iMax = 1e12;
%Counter of current iteration number
iCounter = 1;

%Initial point
xi = x0;
%Initial evaluation of the function
fxi = f_at_x(xi);

%The method will run, until both of the following are


true:
%Function value's absolute value at mid interval not
smaller than tolerance
%Number of iterations are smaller than limit
while ~(abs(fxi) < eps) && iCounter < iMax
%Calculate the local differential.
%Note there are many ways to evaluate the
differential!
diffX = (f_at_x(xi+xStep)-f_at_x(xi-
xStep))/(2.0*xStep);

26
%Work out the better approximation of the root
xi = xi - fxi/diffX;
%Update the function value and iteration counter
fxi = f_at_x(xi);
iCounter = iCounter+1;
end

%Solution of the method


result = xi
%Function value at the solution
fValue = f_at_x(result)

Figure 4 shows the example of how the Newton-Raphson method converges


on a solution from a given starting point, x = 10 as defined in the MATLAB
script. In this case, 5 iterations is required for convergence. Starting from a
different point, such as x = -5, it would be 17 iterations, and some points might
never yield converged solutions at all.

Convergence of the Newton-Raphson method


300

250
F(x) function value [-]

200

150

100

50

0
6 6,5 7 7,5 8 8,5 9 9,5 10
-50
x [-]

F(x) Newton-Raphson x0=10

Figure 4: Convergence of the Newton-Raphson method

3.1.3 Secant method

The Secant method can be thought as a variant of the Newton method. In this
algorithm, 2 initial estimates of the root are taken, then one of them improved

27
by moving on a secant line of the curve. The procedure is then iterated until
convergence is found.
There are a number of benefits of using the Secant method over the Newton
method. The most important is that the method does not require the
computation of the local derivative; depending on the type of function used,
differentiating it might be computationally expensive, or might even be
impossible. Based on this, the Secant method is searching along the secant
line, as opposed to the tangent line used by the Newton-Raphson method.
Provided good starting points are selected, the Secant method is generally
more stable than the Newton-Raphson, and has a lower chance of diverging
or showing wildly oscillating behaviour. Although more iterations are
generally needed than with Newton-Raphson, each iteration can be
significantly cheaper because the derivative is not evaluated.
The following algorithm describes the Secant method:
 Assume 2 initial estimates for the root, points, 𝑥−1 and 𝑥0 .
 If in the i-th latest iteration, the function value is within the described
tolerance, so |𝑓(𝑥𝑖 )| < 𝜀 we have converged to the solution.
 If not, then calculate a better estimation of the root, using the
following formula:
𝑓(𝑥𝑖 )
𝑥𝑖+1 = 𝑥𝑖 −
𝑓(𝑥𝑖−1 ) − 𝑓(𝑥𝑖 ) (3.4.)
(𝑥𝑖−1 − 𝑥𝑖 )
 Repeat from step 2.

Example

The following MATLAB script implements the Secant method on the familiar
polynomial function:
𝑓(𝑥) = 0.8𝑥 3 − 4.0𝑥 2 − 12.0𝑥 − 21.0

Solution:
%Script C2Secant.m
%Script to use the Secant method to solve single
variable equations

%Function to be evaluated
function fValue = f_at_x(x)

28
fValue = 0.8*x^3-4.0*x^2-12.0*x-21.0;
end

%Initial points:
xm1 = 10.0
x0 = 9.5;

%Tolerance
eps = 1e-3;
%Maximum limit on iterations
iMax = 1e12;
%Counter of current iteration number
iCounter = 1;

%Initial point
xim1 = xm1;
xi = x0;
%Initial evaluations of the function
fxim1 = f_at_x(xim1);
fxi=f_at_x(xi);

%The method will run, until both of the following are


true:
%Function value's absolute value at mid interval not
smaller than tolerance
%Number of iterations are smaller than limit
while ~(abs(fxi) < eps) && iCounter < iMax

%Work out the better approximation of the root


xip1 = xi - fxi/((fxim1-fxi)/(xim1-xi));
%Update the function value and iteration counter
xim1=xi;
fxim1=fxi;
xi= xip1;
fxi = f_at_x(xi);
iCounter = iCounter+1;
end

%Solution of the method


result = xi
%Function value at the solution
fValue = f_at_x(result)

29
Figure 5 shows the convergence of the Secant method from the starting point
pair of 10 and 9.5. It can be seen that in this case, 5 iterations of the method
provided the root with the requested accuracy. To compare, a less fortunate
pair, such as -4.5 and -5 would require 111 iterations.

Convergence of the Secant method


300

250
F(x) function value [-]

200

150

100

50

0
6 6,5 7 7,5 8 8,5 9 9,5 10
-50
x [-]

F(x) Secant method

Figure 5: Convergence of the Secant method

3.1.4 Regula Falsi Method

Regula Falsi, or False Position methods are a family of bracketing root finding
methods, that can be thought of as an advanced version of the Bisection
method. In the case of the Bisection method, the interval is always halved,
and thus very little information is actually used from the function’s behaviour.
The Regula Falsi relies on the calculation of the secant line of the function, and
improving the root estimation based on the zero crossing of this secant line.
This additional information results in more educated guessing than the
bisection method, which increases the convergence rate of the algorithm,
although at the expense of additional calculations. Similar to the bisection
method, it requires an [a b] interval where the function crosses the 0 value for
it to work.
The Regula Falsi method can be described using the following algorithm:

30
1. Evaluate the function at the ends of an [a,b] interval. The two end
points are names as 𝑥0 = 𝑎 and 𝑥1 = 𝑏. The function values are 𝑓(𝑥0 )
and 𝑓(𝑥1 )
2. The secant line between 𝑥𝑖 and 𝑥𝑖+1 and their function values is
evaluated, and it’s zero crossing point 𝑥𝑖+2, is calculated as:
𝑥𝑖+1 − 𝑥𝑖 𝑥𝑖 𝑓(𝑥𝑖+1 ) − 𝑥𝑖+1 𝑓(𝑥𝑖 )
𝑥𝑖+2 = 𝑥𝑖 − 𝑓(𝑥𝑖 ) = (3.5.)
𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖 ) 𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖 )

3. If the function value 𝑓(𝑥𝑖+2 ) is within the described tolerance, so


|𝑓(𝑥𝑖+2 )| < 𝜀 we have converged to the solution and 𝑥𝑖+2 is the
desired root.
4. If there is no convergence yet, then find out in which interval the root
lies, based on the sign difference between 𝑓(𝑥𝑖 ) and 𝑓(𝑥𝑖+2 ). Define
new [a b] interval based on the result.
5. Repeat steps from 2. until convergence is achieved.

Example

An implementation of the Regula Falsi method in MATLAB using the usual


polynomial function is shown below. Convergence is shown graphically in
Figure 6.
𝑓(𝑥) = 0.8𝑥 3 − 4.0𝑥 2 − 12.0𝑥 − 21.0

Solution:
%Script C2RegulaFalsi.m
%Script to use the Regula Falsi method to solve
single variable equations

%Function to be evaluated
function fValue = f_at_x(x)
fValue = 0.8*x^3-4.0*x^2-12.0*x-21.0;
end

%Limits of interval:
%a: lower; b: upper
a = -5.0;
b = 10.0;

31
%Tolerance
eps = 1e-3;
%Maximum limit on iterations
iMax = 1e12;
%Counter of current iteration number
iCounter = 1;

%Initial evaluation of the secant zero crossing


xi = (a*f_at_x(b)-b*f_at_x(a))/(f_at_x(b)-f_at_x(a));
%Initial evaluation of the function at secant zero
crossing
fxi = f_at_x(xi);

%The method will run, until both of the following are


true:
%Function value's absolute value at mid interval not
smaller than tolerance
%Number of iterations are smaller than limit
while ~(abs(fxi) < eps) && iCounter < iMax
%If the signs of f(a) and f(xi) are the same,
then we know b must have
%the different sign!
if sign(f_at_x(a)) == sign(fxi)
a = xi;
else
b = xi;
end
%Update the function value and iteration counter
xi = (a*f_at_x(b)-b*f_at_x(a))/(f_at_x(b)-
f_at_x(a));
fxi = f_at_x(xi);
iCounter = iCounter+1;
end

%Solution of the method


result = xi
%Function value at the solution
fValue = f_at_x(result)

32
Convergence of the Regula Falsi method
300

250

200

150
F(x) function value [-]

100

50 F(x)
Regula Falsi
0
-10 -5 0 5 10 15

-50

-100

-150

-200
x [-]

Figure 6: Convergence of the Regula Falsi method

3.1.5 Fixed-point method

The Fixed-point method is a recursive iteration method, where a


supplementary 𝑥 = 𝑔(𝑥) function is chosen to iteratively solve the 𝑓(𝑥) = 0
problem. By definition 𝑝 is a fixed point of the function 𝑔(𝑥) if 𝑔(𝑝) = 𝑝. Using
this information, it can be written for example:
𝑔(𝑥) = 𝑥 − 𝑓(𝑥) (3.6.)
From which it can be deducted, that if 𝑔(𝑥) has a fixed point at p, then

33
𝑓(𝑥) = 𝑥 − 𝑔(𝑥) (3.7.)
has a zero at 𝑝. Thus, finding the root of 𝑓(𝑥) is substituted with iteratively
finding a fixed point of 𝑔(𝑥). Interestingly, the Newton method is a form of
Fixed-point root finding method, where 𝑔(𝑥) includes the first derivative of
𝑓(𝑥).
The method has a strict criterion for stability, the absolute value of the
differential of the 𝑔(𝑥) function needs to be under 1 for all 𝑥 values considered
during the iteration, or as expressed in equations: |𝑔′(𝑥)| < 1. Depending on
the sign and value of the 𝑔′(𝑥), a “spiral” or “staircase” like convergence (or
divergence) can be seen.
The method can be described with the following algorithm.
1. Rewrite the original 𝑓(𝑥) = 0 problem into 𝑥 = 𝑔(𝑥) in a way, where
a fixed point of 𝑔(𝑥) is also a solution of 𝑓(𝑥) = 0.
2. Assume 𝑥0 as the initial best estimation of the root of 𝑓(𝑥) (and the
fixed point of 𝑔(𝑥)).
3. Generate a better approximation of the fixed point: 𝑥𝑛+1 = 𝑔(𝑥𝑛 )
4. Iterate until desired convergence of the fixed point is achieved or
terminate if divergent behaviour is identified.
5. The acquired fix point of 𝑔(𝑥) is the desired root of the original 𝑓(𝑥)
function.
It can be seen that in addition to the initial estimation, the choice of 𝑔(𝑥) also
plays a significant role in whether the solution converges or not, and also in
the speed of convergence if it does. The choice, however is not obvious,
especially as information about 𝑓(𝑥) might not be even available. Generally
Fixed-point methods are simpler to investigate than the possible numerical
methods that are based on them, so by studying the behaviour of the Fixed-
point method, it is possible to improve the performance of other root finding
methods, or develop new ones.

3.1.6 Finding multiple roots

So far it has been assumed, that the problem in question has a solution, and
only a single solution, thus one root. When considering real problems, this is
not necessarily true, a function can have any number, even infinite number of
roots, such as the 𝑓(𝑥) = sin(1⁄𝑥 ) type functions. Furthermore, a given root
might not be just a simple root, but have multiplicity of 2, 3 and more, such as
in the case of 𝑓(𝑥) = (𝑥 − 1)3 , there is 1 root, +1, but with the multiplicity of
3.

34
Finding a multiple root numerically is more difficult than a simple one with
methods such as the Newton or Secant methods. The effect of the multiplicity
of the root is to reduce the convergence rate of the method. Generally, the
Newton method will show a linear convergence rate near a root of multiplicity
𝑚−1
m as: 𝜆 = 𝑚 where 𝑚 ≥ 1. It can be seen, that for m > 2 (provided it is
feasible to use) the Bisection method provides faster convergence than the
Newton.
The numerical methods demonstrated here all rely on some form of iteration
and start with the assumption that the initial estimate is somewhere “near”
the desired root. Following this logic, in order to find multiple roots, their
approximate location need to be identified and the iterations started “near”
the roots. A common way of identifying the root locations is to define a
starting point on the x axis, and moving along the positive or negative
direction. When the sign of the function changes (as long as the function is
continuous) it’s an indication that there is a root nearby. Once these sign
change regions are identified, an appropriate method can be chosen to
converge on the root with desired accuracy. Note the behaviour of some of
the tools, such as the Newton-Raphson, there is a chance to go widely off the
desired track, and explore regions other than the desired root region. This can
be an issue, as although the root approximate location was identified, but the
actual root might be found somewhere else. The Bisection or the Regula Falsi
methods don’t suffer from this issue, as they have already bracketed the root,
and would stay within the bracket throughout the solving process.
Although the search algorithm sounds simple, a fully automatic method
without a priori information on the functions and the roots would be difficult
to create. A mathematic algorithm relying on incrementing the variable would
have no information on the expected scale of the problem; is the user
searching on the nano (10-9) or the giga (10+9) scale. Choosing the wrong scale
would either result in too small steps and a very slow (to the limit of being
unacceptable) algorithm, or one with too large that would skip roots. In these
examples we can have a “feel” for the size of the problem, however if the
problem behaves as a black box, and/or too expensive to evaluate enough
times to acquire information, the root multiple root finding algorithm needs
further refining to provide the required performance.

Example

The following example shows how a version of the multiple root finding
algorithm can be implemented on an example polynomial function. The
function is plotted in Figure 7. It can be seen that the function has 6 roots:

35
 𝑥 = +3.5
 𝑥 = −2.0
 𝑥 = +2.1
 𝑥 = −4.0
 𝑥 = +1.4 (with a multiplicity m = 2)

Solution:

Figure 7: Example function for multi-root finding

The iteration will be started from 𝑥 = 0, and will look for roots on either side
of the starting position. Since we were able to graph the given function, and
have information about the roots and their rough position, we can use that
information to set step sizes and similar for an efficient algorithm. To zoom in
on the individual roots, we will use the Regula Falsi method as demonstrated
before.
The MATLAB implementation of the script is shown below:
%Script C2FindMultipleRoots.m
%Script to find multiple roots of a given equation

function fValue = multiRootFun(x)


fValue = (x-3.5)*(x+2)*(x-2.1)*(x+4)*(x-1.4)^2;
end

36
%Define starting position:
x0 = 0.0;

%Define expected number of roots:


%Note that in this case we know one of the roots are
double!
nRoot = 5;
%Number of roots already found
nRootsFound = 0;
%Search step size
xStep = 0.025;
%Root finding tolerance
rTol = 1.0e-4;
%Maximum number of search steps, to prevent algorithm
being stuck
iMax = 1e6;
iCounter =1;
%Vector of roots found, in order of finding:
vRoots = [];
%Iteration when root was found for diagnostics
itRootFound =[];
%Left position to check from
lPos =x0;
%Right position to check from
rPos = x0;

while (nRootsFound<nRoot) && iCounter < iMax


%Check if there is a sign change to the left. If
yes, search for root. If not,
%increment the position

if(sign(multiRootFun(lPos))!=sign(multiRootFun(lPos-
xStep)))
%Signs are not equal, so possible root position!
vRoots = [vRoots;C2RegulaFalsiFun(lPos-
xStep,lPos,rTol)];
itRootFound = [itRootFound;iCounter];
nRootsFound=nRootsFound+1;
end
%Increment the left position
lPos = lPos - xStep;
%Repeat for the right side search direction

if(sign(multiRootFun(rPos))!=sign(multiRootFun(rPos+x
Step)))

37
%Signs are not equal, so possible root position!
vRoots =
[vRoots;C2RegulaFalsiFun(rPos,rPos+xStep,rTol)];
itRootFound = [itRootFound;iCounter];
nRootsFound=nRootsFound+1;
end
%Increment the right position
rPos = rPos + xStep;
%Increment counter and go to the next iteration
step
iCounter = iCounter+1
end
%Write out results and the iteration values
vRoots
itRootFound

When this script is executed on the example function shown before, it will not
find all the roots, and will terminate due to reaching the iteration count limit.
Why is this happening? Investigating the function closer, it can be seen that in
the vicinity of the double root, the function doesn’t actually change signs. A
zoomed in view of the function is shown in Figure 8.

Figure 8: Zoomed in image near the double root of the function

Due to the fact that the function doesn’t change sign near the root, the multi
root finding algorithm wouldn’t be able to recognize it as a place to search,
unless it stumbles upon the exact numerical value of the root (where signum
is zero, so different to before), which is quite unlikely. Furthermore,

38
appropriate root finding method needs to be found, as for example both the
Bisection and Regula Falsi methods require the interval ends to have different
signs for it to work. Arguably, if the location was identified as a potential
position for a root, the Newton-Raphson iteration would very quickly
converge to the root within the required numerical precision.

Figure 9: Multi roots search example function without multiple roots

If we consider a different function, for example altering the current one by


removing the duplicity of the root (shown in Figure 9) the algorithm would
rapidly converge.
Root value Iteration number the root is found
1.4000 57
-2.0000 81
2.1000 85
3.5000 141
-4.0000 161
Table 1: Root value vs Iteration number
It can be seen that the multiple root finding process has found all 5 roots of
the function and required 161 iterations (plus the iterations of the Regula-Falsi
method when zooming in on the local roots). By acquiring more information
about the problem in hand, the number of iterations can be reduced, for
example when we know that no 2 roots are placed too close to each other,
the search step can be doubled, effectively halving the number of search
steps. Keep in mind however, that without a priori information of the

39
function, there is no guarantee, that these multiple root finding algorithms
would return all, or for that matter any of the desired roots.

3.2 Solution of System Equations

This chapter describes various methods to solve linear systems of equations.


In engineering practice, the majority of problems faced can be described with
a set of linear equations sharing common variables between them. Even if the
problem behaviour is nonlinear, many times it can be approximated by
linearizing the problem locally. These sets of equations together are referred
to as a linear system of equations. At the end of the chapter strategies for
solving nonlinear systems are also discussed.
To consistently solve the linear systems of equations, the equations need to
satisfy the requirements of linear independence. Consistent solution means
that a single set of solutions can be found that satisfies the system of
equations. In this case it can be shown that to solve for N variables, N linearly
independent equations need to be defined. If the equations are not
independent, then at least one can be derived as the linear combination of the
others. As a consequence, when the equations are not independent, there is
no definite solution to the equations, as the system doesn’t contain enough
information to clearly express all the variables. These types of systems are
called inconsistent or overdetermined.
The systems of equations can be arranged into a matrix form. Consider the
following set of three equations:
3𝑥 + 2𝑦 − 4𝑧 = 3 (3.8.)

2𝑥 − 5𝑧 = −4 (3.9.)

−2𝑥 + 3𝑦 + 2𝑧 = 5 (3.10.)
Which can be rewritten and expanded in the following form:
3𝑥 + 2𝑦 − 4𝑧 = +3
2𝑥 + 0𝑦 − 5𝑧 = −4 (3.11.)
−2𝑥 + 3𝑦 + 2𝑧 = +5
From here the equations can be rewritten as a matrix-vector scalar
multiplication:
3 2 −4 𝑥 3
[ 2 0 −5] [𝑦] = [−4] (3.12.)
−2 3 2 𝑧 5
By generalizing the equation, defining the 𝐴̿ coefficient matrix, the 𝑥̅ solution
vector and 𝑏̅ known coefficients, we can write:

40
𝐴̿𝑥̅ = 𝑏̅ (3.13.)
Which can be rearranged by multiplying with the inverse of 𝐴̿ from the left:
̿ 1 𝑏̅
𝐴̿−1 𝐴̿𝑥̅ = 𝐼 𝑥̿ ̅ = 𝑥̅ = 𝐴− (3.14.)
Which is one of the most common problems in engineering or scientific
calculations. So what we are looking for in order to solve the linear systems
of equations, are methods to generate the inverse of the coefficient matrix,
𝐴− ̿ 1 , or at least an acceptable approximation. The following methods will
show different approaches to generating this matrix.
Note, that large scale matrix inversion is a problem on its own which is of
significant interest of the scientific community. Many engineering software,
such as Finite Element Methods, some Computational Fluid Dynamics
software, image processing, compression algorithms, and many more rely on
some form of matrix inversion. Most of the basic methods shown here
become prohibitively expensive as the size of the coefficient matrix grows, as
the methods contain unnecessary and repeated steps. Nevertheless, they
illustrate the basic ideas of solving the linear systems of equations.
Also note, that many software provide some form of built in matrix inversion
algorithm. In MATLAB, the Y = inv(X) command generates the inverse of a
square matrix X. (Note that the documentation suggests, that the 𝑥 = 𝐴\𝑏
command or the linsolve method is used for solving linear systems of
equations instead!) In Microsoft Excel, you can use the MINVERSE function
on an array of cells to generate the inverse, and MMULT to multiply the
matrices and vectors together.

3.2.1 Gauss elimination method

The Gauss elimination method is the most visual of all the methods, and for
small systems of equations it is a method that can be performed by hand if
required. It is also called the row-reduction method, as it operates directly on
the rows of the coefficient matrix.
There are two similar methods, the Gaussian and Gauss-Jordan elimination.
They both rely on 3 types of elementary row operations that are used to alter
the system of equations, but the outcome of the procedure is different: the
Gaussian elimination stops when an upper triangle matrix is produced out of
the 𝐴̿ coefficient matrix (or non-reduced row echelon form) while the Gauss-
Jordan procedure continues until the completely reduced row echelon form
is reached. The reduced form in the case of a consistent system of equations
is the diagonal unity matrix. Computationally it is often desirable not to

41
perform all the row operations, as algorithmically the gains of having the
reduced format are overshadowed by the additional computational expense.
In order to solve the 𝐴̿𝑥̅ = 𝑏̅ type equations, the system is written in the
following augmented matrix format:
[𝐴̿|𝑏̅] (3.15.)
Which in the case of our example system is:
3 2 −4 3
[ 2 0 −5| −4] (3.16.)
−2 3 2 5
As the systems of equations are linear, we know that any scalar multiple or
addition/subtraction of the equations will also be valid linear equations.
Furthermore, as the order of equations are arbitrary, there are three
operations, that can be used by the Gaussian elimination, which are the
following:
1. Swap any two rows in the matrix
2. Multiply any row with a scalar, non-zero number
3. Add a multiple of one row to any other row

Example

In the case of the example matrix (3.16.), the following steps can be taken to
solve:
1. Divide row 1 by 3:
1 0.666 −1.333 1
[2 0 −5 | −4]
−2 3 2 5
2. Add -2 times row 1 to row 2, and add 2 times row 1 to row 3
1 0.666 −1.333 1
[0 −1.333 −2.333| −6]
0 4.333 −0.666 7
3. Divide row 2 by -1.333:
1 0.666 −1.333 1
[0 1 1.75 | 4.5]
0 4.333 −0.666 7
4. Add -0.666 times row 2 to row 1, and add -4.333 row 2 to row 3.
1 0 −2.5 −2
[0 1 1.75 | 4.5 ]
0 0 −8.25 −12.5

42
5. Divide row 3 by -8.25:
1 0 −2.5 −2
[0 1 1.75 | 4.5 ]
0 0 1 1.5152
6. Add 2.5 times row 3 to row 1 and add -1.75 times row 3 to row 2:
1 0 0 1.7879
[0 1 0| 1.8485]
0 0 1 1.5152
In this example, the solution is of the Gauss-Jordan form, and the solution
represents:
𝑥 = 1.7879
𝑦 = 1.8485
𝑧 = 1.5152
The fact that there is a consistent solution shows that the original system of
equations were linearly independent.
In MATLAB, the following simple implementation shows the Gauss-Jordan
elimination. Note that important features, such as checking for 0 division and
checking for consistency is omitted but should be an important part for any
real algorithm.
% Script C2_System_GaussElimination
% Script to show the Gauss-Jordan elimination

% Note that the method does not check for consistency,


% neither in terms of size nor linear independence.
% Also 0 element checks are not included.

% Define the coefficient matrices:


A = [3 2 -4;2 0 -5;-2 3 2]
b = [3;-4;5]

%For interest, random matrices can be generated to test


% algorithm
% A = rand(3,3)
% b = rand(3,1)
% Define the matrix on which the elimination is
% performed:
M = [A b]

% Calculation loop starts. Note, process is not


% iterative, it has finite steps!
% Iterate through the diagonal:
for i=1:length(A)

43
% Find the correct row and divide by the pivot element
% in the diagonal
M(i,:) = M(i,:)/M(i,i);
%Iterate through number of rows:
for j=1:length(A)
%Only subtract if it's not the pivot row
if i != j
M(j,:) = M(j,:) - M(j,i)/M(i,i)*M(i,:);
endif
end
end

% Print M to screen
M

% The solution vector is the last column of M


x = M(:,end)

% Check equations, must produce numerically 0 if


% procedure is correct:
A*x-b

3.2.2 LU factorization methods

LU factorization method is a short form of Lower-Upper factorization.


Essentially the algorithm breaks down the 𝐴̿ coefficient matrix into a lower
and upper triangular matrix:
𝐴̿ = 𝐿̿𝑈
̿ (3.17.)
The benefit of this form is that since there are no elements above or below
the diagonals for the two matrices, they are very efficient to solve in following
forms
̿𝑥̅ = 𝑏̅ or 𝐿̿𝑥̅ = 𝑏̅
𝑈 (3.18.)
This is because in the matrix line with only a single element, the solution is
directly present. Then using that information, the second row can be quickly
solved, as there is only one unknown left, then taking the two solutions to the
3rd row, and so on.. This method is called forward (or back) substitution
depending on whether it is a lower or upper triangular matrix.
There are certain conditions where the LU factorization doesn’t work, even if
𝐴̿ is non-singular, for example when the first row of the matrix begins with 0.

44
For this reason, it is common to use a version of the LU factorization, where a
𝑃̿ permutation matrix is included to swap the rows around:
𝐴̿ = 𝑃̿𝐿̿𝑈
̿ (3.19.)
This factorization exists if, and only if 𝐴̿ is non-singular. Because of this
property, and it’s efficiency the LU factorization with row pivoting is the most
common method how systems of equations are solved by a computer. The LU
factorization requires in the order of 2⁄3 𝑛3 operations to solve (where n is
the row count of 𝐴̿). The additional computation expense of breaking down
the coefficient matrix into the two triangular matrices pays back during the
later part of the solution. Compare this to the Gauss elimination, where each
element in every row is going to be operated on multiple times during the
procedure.
Once the factorization is complete, the solution can be found as:
−1
̿ 1 𝑏̅ = (𝑃̿𝐿̿𝑈
𝑥̅ = 𝐴− ̿) 𝑏̅ = 𝑈
̿ −1 𝐿̿−1 𝑃̿−1 𝑏̅ = 𝑈
̿ −1 𝐿̿−1 𝑃̿𝑇 𝑏̅ (3.20.)
This is a very efficient procedure, since the permutation matrix has little cost,
and the inverse of a triangular matrix is also a triangular matrix, so the forward
and back substitution methods can be used to very efficiently solve the
system.

3.2.3 Iterative solutions

This section deals with a set of methods where the solution of the linear
system of equations are arrived by not a deterministic, but an iterative
method. The benefit of an iterative method is its low computational cost per
iteration, in the order of 𝑛2 , which for high n values is considerably smaller
than for example the 2⁄3 𝑛3 iterations for an LU factorization, which is one of
the most efficient direct methods.
The general format of the iterative methods are the following:
𝐴̿𝑥̅ = [𝑄̿ − 𝑃̿]𝑥̅ = 𝑏̅ (3.21.)
Thus the original 𝐴̿ coefficient matrix is split into 𝑄̿ and 𝑃̿ matrix components.
With these the initial equation can be expressed as:
𝑄̿ 𝑥̅ = 𝑏̅ + 𝑃̿ 𝑥̅ (3.22.)
From here using a similar concept as for finding fixed points of a single
equation, an initial 𝑥̅ 0 estimation of the solution is assumed and then iterated
as:
𝑄̿ 𝑥̅ 𝑘+1 = 𝑏̅ + 𝑃̿𝑥̅ 𝑘 𝑘 = 0,1,2, … (3.23.)

45
And if 𝑄̿ is invertible (non-singular) the next approximation of the solution can
be expressed as:
𝑥̅ 𝑘+1 = 𝑄̿ −1 𝑏̅ + 𝑄̿ −1 𝑃̿𝑥̅ 𝑘 𝑘 = 0,1,2, … (3.24.)
As it can be seen so far, no restrictions have been placed to the different
matrices, except that 𝑄̿ −1 must exist. In theory any decomposition of the
coefficient matrix works, however unless special types of matrices are used,
the inversion of 𝑄̿ −1 can be as costly as just directly solving the original
equation, making the iterative procedure quite inefficient and thus pointless.
The examples provided in this section will show specific, easily invertible
matrix decomposition which enable efficient solving of the equations using
thee iterative method.
3.2.3.1 Vector and matrix norms
For the iterative solutions, the convergence of the solutions needs to be
tracked, as unlike the case of the direct methods, it is not known how many
iterations it will take to find a solution.
One way to monitor the convergence is to assess how close (or similar) two
vectors are to each other. This can be done using vector norms.
Consider a vector:
𝑏̅ = {𝑏1 , 𝑏2 , … , 𝑏𝑛 } (3.25.)
There are 3 different norms that are defined over the vector:
 The 𝑙1 norm, ‖𝑏̅‖1 is the sum of absolute values of all vector
components:
𝑛

‖𝑏̅‖1 = ∑|𝑏𝑖 | = |𝑏1 | + |𝑏2 | + ⋯ + |𝑏𝑛 | (3.26.)


𝑖=1
 The 𝑙∞ norm, ‖𝑏̅‖∞ is the maximum of the absolute values of all
vector components:
‖𝑏̅‖ = max (|𝑏1 |, |𝑏2 |, … , |𝑏𝑛 |)

(3.27.)
 The 𝑙2 norm, ‖𝑏̅‖2 is the square root of the sum of squares of all
vector components:
𝑛

‖𝑏̅‖2 = √∑ 𝑏𝑖 2 = √𝑏1 2 + 𝑏2 2 + ⋯ + 𝑏𝑛 2 (3.28.)


𝑖=1

The following properties are applicable to the calculated norms of vectors:

46
 ‖𝑏̅‖ ≥ 0 and 0 is possible if and only if vector is the null vector
 ‖𝛼𝑏̅‖ = |𝛼|‖𝑏̅‖ where 𝛼 is a scalar multiplier
 ‖𝑏̅ + 𝑐̅‖ ≤ ‖𝑏̅‖ + ‖𝑐̅‖ for any two vectors
Three norms of a matrix can be defined in a similar manner. The 1 norm of the
matrix are equivalent of column-wise maximum of individual vector
operations, the infinite norm is the maximum row-wise, and the 2 norm is
calculated on all elements of the n x n square matrix 𝐴̿ = [𝑎𝑖𝑗 ]:
 The 𝑙1 norm, ‖𝐴̿‖ is the largest of the sum of absolute values of all
1
vector components in the matrix columns:
𝑛

‖𝐴̿‖1 = max {∑|𝑎𝑖𝑗 |} (3.29.)


1≤𝑗≤𝑛
𝑖=1
 The 𝑙∞ norm, ‖𝐴̿‖∞ is the largest of the sum of absolute values of all
row vector components in the matrix rows:
𝑛

‖𝐴̿‖∞ = max {∑|𝑎𝑖𝑗 |} (3.30.)


1≤𝑖≤𝑛
𝑗=1

 The 𝑙2 norm or Euclidean norm, ‖𝐴̿‖2 is the square root of the sum of
squares of all matrix elements:
𝑛 𝑛

‖𝐴̿‖2 = √∑ ∑ 𝑎𝑖𝑗 2 (3.31.)


𝑗=1 𝑖=1

The matrix norms have also similar properties to the vector norm calculation:
 ‖𝐴̿‖ ≥ 0 and 0 is possible if and only if matrix is the null matrix
 ‖𝛼𝐴̿‖ = |𝛼|‖𝐴̿‖ where 𝛼 is a scalar multiplier
 ‖𝐴̿ + 𝐵̿‖ ≤ ‖𝐴̿‖ + ‖𝐵̿‖ for any two n x n matrices
 ‖𝐴̿𝐵̿‖ ≤ ‖𝐴̿‖‖𝐵̿‖ for any two n x n matrices

MATLAB and similar software have built in functions to calculate the various
norms of vectors and matrices. In MATLAB, the norm(v,p) method can be
used to calculate the various norms, where p can be given 1, 2 or Inf,
representing the 3 types of norms shown here.
In this section these norms will be used to track the convergence of the
iterative solutions and stop them when sufficient convergence is achieved.

47
3.2.3.2 The Jacobi method
The Jacobi method first decomposes the 𝐴̿ coefficient matrix into a diagonal,
lower-triangular and upper-triangular) matrices 𝐷 ̿ , 𝐿̿ and 𝑈
̿ in the following
manner:
𝑎11 0 0
̿
𝐷=[ 0 𝑎22 0 ] (3.32.)
0 0 𝑎𝑛𝑛
0 0 0
̿𝐿 = [𝑎21 0 0] (3.33.)
𝑎𝑛1 𝑎𝑛2 0
0 𝑎12 𝑎1𝑛
̿
𝑈 = [0 0 𝑎2𝑛 ] (3.34.)
0 0 0
This way, the 𝑄̿ and 𝑃̿ matrices can be expressed from 𝐴̿ = 𝐷 ̿ + 𝐿̿ + 𝑈
̿ where
the matrices used for the iteration are: 𝑄̿ = 𝐷
̿ and 𝑃̿ = −[𝐿̿ + 𝑈̿] which gives
the specific form of the Jacobi iteration equations:
̿ −1 𝑏̅ + 𝐷
𝑥̅ 𝑘+1 = 𝐷 ̿ −1 (−(𝐿̿ + 𝑈
̿))𝑥̅ 𝑘 𝑘 = 0,1,2, … (3.35.)

Example

The following MATAB script implements the Jacobi iteration. This particular
implementation defines the norm as distance from the true solution as
opposed to the incremental change between the solution vector through the
iterations. For a large problem, such as in the case of a very large CFD model,
the direct calculation of the solution can be very expensive. Note that
convergence of the iteration strongly depends on the structure if the
coefficient matrix, but this not discussed here any further.
Also note, that the script serves demonstration purposes, so some features
are sub-optimal, such as unnecessary matrix additions, or not pre-allocating
the solution matrix size, but these have been left there for better reading.
% Script C2_System_Jacobi.m
% Script to solve linear system of equations using the
% Jacobi method

%Coefficient matrices:
A = [3 2 -4;2 4 -5;-2 3 2]
b = [3;-4;5]
% Try random matrices. Note the high values in the
% diagonal aid convergence

48
% A = rand(4,4)+diag([1,2,3,4]);
% b = rand(4,1);

% Tolerance
eps = 1e-3;
% Maximum limit on iterations
iMax = 100;
% Counter of current iteration number
iCounter = 1;

% Define an initial estimate of the solution


x0 = zeros(length(A),1);

% Create the D, L and U matrices:


D = diag(diag(A));
% At matrix is the remains of A when the diagonal is
% taken off
At = A-D;
L = tril(At);
U = triu(At);

% x matrix will contain all the solution vectors


% throughout the iteration
x = [x0];
% Norm to check
cNorm = norm(A*x(:,1)-b);

% The method will run until the error norm is below the
% accepted tolerance or the
% number of iterations exceed the set limit
while ~(cNorm < eps) && iCounter < iMax
xNew = inverse(D)*(b-(L+U)*x(:,iCounter));
x = [x xNew];
iCounter = iCounter + 1;
cNorm = norm(A*x(:,iCounter)-b);
end

A
b
% Result of the iteration
result = x(:,end)
% Compare to direct solution result:
directResult = linsolve(A,b)

49
3.2.3.3 Gauss-Seidel method
The Gauss-Seidel method can be looked as an improved version of the Jacobi
method. It speeds up the solution, because it uses more information about
the solution than the Jacobi. The Jacobi method relies only on the information
at the kth iteration, but the Gauss-Seidel uses information from the column
vectors computed in the (k+1)th iteration; and since it is assumed that the
solution is converging, this will give a better estimation, thus faster solution.
In the Gauss-Seidel iteration, the matrices are defined as the following:
𝑄̿ = 𝐷
̿ + 𝐿̿ and 𝑃̿ = −𝑈
̿ (3.36.)
Using these matrices, the specific iteration form can be written as:
−1 −1
̿ + 𝐿̿] 𝑏̅ − [𝐷
𝑥̅ 𝑘+1 = [𝐷 ̿ + 𝐿̿] 𝑈 ̿𝑥̅ 𝑘 𝑘 = 0,1,2, … (3.37.)

Example

The method is implemented in MATLAB using the following script.


% Script C2_System_GaussSeidel.m
% Script to solve linear system of equations using the
% Gauss-Seidel method

% Coefficient matrices:
A = [3 2 -4;2 4 -5;-2 3 2]
b = [3;-4;5]

% Try random
% A = rand(4,4)+diag([1,2,3,4]);
% b = rand(4,1);

% Tolerance
eps = 1e-3;
% Maximum limit on iterations
iMax = 200;
% Counter of current iteration number
iCounter = 1;

% Define an initial estimate of the solution


x0 = zeros(length(A),1);

% Create the D, L and U matrices:


D = diag(diag(A));
% At matrix is the remains of A when the diagonal is

50
% taken off
At = A-D;
L = tril(At);
U = triu(At);

% x matrix will contain all the solution vectors


% throughout the iteration
x = [x0];
% Norm to check
cNorm = norm(A*x(:,1)-b);

% The method will run until the error norm is below the
% accepted tolerance or the number of iterations exceed
% the set limit
while ~(cNorm < eps) && iCounter < iMax
xNew = inverse(D+L)*(b-U*x(:,iCounter));
x = [x xNew];
iCounter = iCounter + 1;
cNorm = norm(A*x(:,iCounter)-b);
end

% Result of the iteration


result = x(:,end)
% Compare to direct solution result:
directResult = linsolve(A,b)

For suitably formed problems and good initial estimates the Gauss-Seidel
method achieves convergence in a lower number of iterations than the Jacobi
method, such as 8 vs 13 iterations respectively. Choose a worse starting point,
and a worse structured matrix and the iteration requirement of the Gauss-
Seidel can be more than 100 higher than the Jacobi. The convergence of these
methods are not investigated any further, but the reader is encouraged to
read up the appropriate literature.

3.2.4 Systems of nonlinear equations

So far, the chapter has only discussed the solution of linear system of
equations. Linear systems offer great simplification, as due to the rules of
linearity, the actual equations can be simplified into a matrix algebra problem,
for which robust solutions exist.

51
In the case of a system of non-linear equations, such simplifications cannot be
made, and thus there is more difficulty in solving them, both from
computation cost and from algorithm point of view.
This chapter illustrates how the Newton’s method demonstrated for solving
single variable functions is applied to a more general case. First a system of 2
equations is solved, then the extension of the method to higher number of
variables is discussed.
Let’s define the two variables as 𝑥 and 𝑦 and the two functions in the
following format:
𝑓1 (𝑥, 𝑦) = 0 (3.38.)

𝑓2 (𝑥, 𝑦) = 0 (3.39.)
Similarly to the single variable solution, the concept of the Newton iteration
relies on the Taylor-series expansion of the functions around a given 𝑥1 , 𝑦1
point and an assumed 𝑥2 , 𝑦2 point which is the solution of the system (if such
solution even exists!):
𝜕𝑓1 𝜕𝑓1
𝑓1 (𝑥2 , 𝑦2 ) = 𝑓1 (𝑥1 , 𝑦1 ) + | (𝑥2 − 𝑥1 ) + | (𝑦2 − 𝑦1 )
𝜕𝑥 (𝑥1,𝑦1) 𝜕𝑦 (𝑥 (3.40.)
1 ,𝑦1 )
+⋯
𝜕𝑓2 𝜕𝑓2
𝑓2 (𝑥2 , 𝑦2 ) = 𝑓2 (𝑥1 , 𝑦1 ) + | (𝑥2 − 𝑥1 ) + | (𝑦2 − 𝑦1 )
𝜕𝑥 (𝑥1 ,𝑦1) 𝜕𝑦 (𝑥 (3.41.)
1 ,𝑦1 )
+⋯

These equations can be rewritten as:


𝜕𝑓1 𝜕𝑓1
| ∆𝑥 + | ∆𝑦 = 0 − 𝑓1 (𝑥1 , 𝑦1 ) (3.42.)
𝜕𝑥 (𝑥1 ,𝑦1 ) 𝜕𝑦 (𝑥 ,𝑦 )
1 1
𝜕𝑓2 𝜕𝑓2
| ∆𝑥 + | ∆𝑦 = 0 − 𝑓2 (𝑥1 , 𝑦1 ) (3.43.)
𝜕𝑥 (𝑥1 ,𝑦1 ) 𝜕𝑦 (𝑥
1 ,𝑦1 )

Which can be turned into a matrix notation:


𝜕𝑓1 𝜕𝑓1
𝜕𝑥 𝜕𝑦 ∆𝑥 −𝑓
[ ] = [ 1] (3.44.)
𝜕𝑓2 𝜕𝑓2 ∆𝑦 −𝑓2 (𝑥 ,𝑦 )
1 1
[ 𝜕𝑥 𝜕𝑦 ](𝑥 ,𝑦 )
1 1

Essentially what we have achieved with this, is that the original non-linear
system of equation was linearized in the vicinity of the solution, by taking only
the first order term from the Taylor series. As we have seen during the
discussion on linear systems, the existence of a solution depends on the

52
singularity of the coefficient matrix. This coefficient matrix is referred to as
the Jacobian of the non-linear system, and is written as:
𝜕𝑓1 𝜕𝑓1
𝜕𝑥 𝜕𝑦
𝐽(𝑓1 , 𝑓2 )(𝑥1 ,𝑦1 ) = (3.45.)
𝜕𝑓2 𝜕𝑓2
[ 𝜕𝑥 𝜕𝑦 ](𝑥 ,𝑦 )
1 1

The singularity of the system can be easily checked by calculating the


determinant of the Jacobian. If non-zero, then it can be inverted and the
results found. This is analogous to a point on the function with 0 slope in the
single variable case.
The results of this system of linear equations, ∆𝑥 and ∆𝑦 denote the steps
which we need to take in order to converge towards the solution. As it was
shown in the single variable case, the Newton method can be very quick to
converge when started near the solution, however it can widely oscillate and
even diverge away from the solution.

Example

The implementation of the Newton method for solving nonlinear system of


equations is shown as a MATLAB script below:
% Script C2_System_Nonlinear_Newton.m
% Script to solve non-linear system of equations using
% the Newton method

% Functions to be evaluated
function fValue = f1_at_xy(x,y)
fValue=-6.1*x^3+2.3*y^2-11.25
end

function fValue = f2_at_xy(x,y)


fValue=x^2+3.25*y^3-4.11
end

% Initial point:
x0 = 2.0;
y0 = -4.0;
% Step size used to calculate local differential
xStep = 1e-5;
yStep = 1e-5;
% Tolerance
eps = 1e-3;

53
% Maximum limit on iterations
iMax = 1e2;
% Counter of current iteration number
iCounter = 1;

% Initial point
xi = x0;
yi = y0;
% Initial evaluation of the functions
f1xiyi = f1_at_xy(xi,yi);
f2xiyi = f2_at_xy(xi,yi);

diagMatrix =[xi yi f1xiyi f2xiyi];


% The method will run, until both of the following are
% true: Function value's absolute value at mid interval
% not smaller than tolerance
% Number of iterations are smaller than limit
while ~(max(abs(f1xiyi),abs(f2xiyi)) < eps) && iCounter
< iMax
% Generate the Jacobian matrix numerically, using %
% central difference
% Note there are many ways to evaluate the differential!
dF1dx = (f1_at_xy(xi+xStep,yi)-f1_at_xy(xi-
xStep,yi))/(2.0*xStep);
dF2dx = (f2_at_xy(xi+xStep,yi)-f2_at_xy(xi-
xStep,yi))/(2.0*xStep);
dF1dy = (f1_at_xy(xi,yi+yStep)-f1_at_xy(xi,yi-
yStep))/(2.0*yStep);
dF2dy = (f2_at_xy(xi,yi+yStep)-f2_at_xy(xi,yi-
yStep))/(2.0*yStep);
fValueVector = [-f1xiyi -f2xiyi];
mJacobian = [dF1dx dF1dy;dF2dx dF2dy];
deltaXandY =fValueVector *inverse(mJacobian);
%Work out the better approximation of the root
xi = xi + deltaXandY(1);
yi = yi + deltaXandY(2);
%Update the function value and iteration counter
f1xiyi = f1_at_xy(xi,yi);
f2xiyi = f2_at_xy(xi,yi);
iCounter = iCounter+1;
diagMatrix = [diagMatrix;xi yi f1xiyi f2xiyi];
end
% Solution of the method
xi
yi

54
% Function value at the solution
f1_at_xy(xi,yi)
f2_at_xy(xi,yi)

Figure 10 shows the iteration steps the Newton method took to solve the
provided nonlinear system of equations. It can be seen that although the
procedure started “near” the solution, the method went quite far from this
neighbourhood, and in this particular case, returned and provided a solution.
This particular run took 40 iterations to converge within 1E-4 tolerance on the
solution. All the benefits and drawbacks discussed in the single variable case
are still valid for the systems of equations.
The example above was shown for a two-variable system of nonlinear
equation. Following the same logic, the method can be extended to any
number of variables. Using the Taylor-series approximation, each non-linear
equation needs to be differentiated with respect to each variable, and the
Jacobian matrix formed. Then provided the Jacobian is non-singular, the step
vector required to improve the estimate of the solution can be calculated the
usual way.

Figure 10: The Newton iteration steps solving the nonlinear systems of
equations

55
4 Curve Fitting Methods
In real life engineering problems, in most cases some kind of input data has to
be processed, analysed and evaluated, based on which the given system, tool,
unit or model behaviour can be concluded. These input data are mostly
generated during some kind of experiment: some kind of physical quantity
has to be converted to a numerical value in some sort of way. For example:
we can measure the flow velocity, the pressure, the temperature of a fluid,
and in the possession of the material properties and the available flow-domain
the mass flow rate can be calculated. Similarly, by measuring the voltage and
current of an electrical circuit gives the circuit’s electrical resistance. Or the
expansion/compression of a spring indicates the applied force on it, if the
characteristics of the spring is known. From these examples we can conclude
two states: it is really rare that the actually needed physical quantity can be
measured; and because the experiments are designed and executed by
mostly humans, there is generally a measurement error/scattering in the data
(you know: one measurement is not measurement).
Without a doubt, the best start to evaluate the input data is to plot them in a
suitable chosen diagram. In this way, their behaviour can be observed, their
characteristics can be concluded: whether they are linear, quadratic,
exponential, or maybe have normal distribution or completely stochastic.
Indeed, their classification should be done based on science, so the scope of
this chapter is to introduce the curve fitting methods on the initial input data
and the mathematical theory behind them. The fitted curves can be linear or
polynomial, since mostly there are the most widespread used characteristics
in engineering.
The most important consequence of the chapter is the following: we can fit
or approximate curves on the input points, so there are two main methods:
- regression
- interpolation.

4.1 Regression
The regression first was worked out by Francis Galton and the word
‘regression’ itself means: return to the average. Essentially, the regression is
not else but inspecting that how does the result depend on the change of the
explanatory variable. Many examples can be made, like how does the traffic
of an airport depend on the GDP per capita, or how does the torque of a motor
depend on the engine speed, or how does the chance of getting a grade
better than 2 depend on studying more in Numerical Methods class. Indeed,

56
these are just illustrated examples, in real life the traffic of an airport is
influenced also by the location, the tourism, …, in the case of the motor the
quality of the fuel or the load also influences the torque, …, but in the
example of studying, well, that is actually a single variable problem.
The scope is that in the case of regression analysis we want to find the
connection of two or more variables and try to model the behaviour. There
are two basic regression types: linear and nonlinear.
It is essential to understand, that the regression process is only an
approximation. Meaning, we have a set of points and we would like to fit a
curve near to them, but not exactly on them. Therefore, the fitted curve does
not have to go through on the input points. If we wanted to find a curve that
actually fits on every single input point, then we are talking about
interpolation. The basic difference between interpolation and approximation
can be seen in Figure 11.

3,5
Interpolation
3
Regression
2,5

1,5

0,5

0
0 0,5 1 1,5 2

Figure 11: The main difference between regression and interpolation

In general, we can state that the regression should be used if large number of
initial points are given, or they have scattering between them (for example
due to an inaccurate experiment). Regression should be used also, if the
characteristics of the input data is well-known (like the motor-torque).

57
4.1.1 Linear Regression
The simplest form of regression is the linear regression, when a linear curve is
fitted to the input points. This is an often method problem in engineering
problems, because linear systems are widely used in every field. The goal of
the linear regression is to find the equation of a curve in the following form:
𝑦 = 𝑎1 𝑥 + 𝑎0 (4.1.)

which fits the best for 𝑛 set points: (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ). As a first step, it is
recommended to plot the data points in a Cartesian coordinate system, so we
can observe the linear characteristics of them. For example, if it can be easily
seen that the data points have cubic-characteristics, we still can fit a linear
curve on them, but the result will be probably useless.

Figure 12: Linear set of points with initial errors [3]

Our goal is to define 𝑎1 and 𝑎0 coefficients, so the distance (error) between


the fitted curve and the initial points was minimal (see Figure 12). This error is
considered as an absolute error in every 𝑥𝑖 points as follows:
𝑒𝑖 = 𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 ) (4.2.)

These error values have to be calculated in the whole interval in order to find
the optimally suitable 𝑎1 and 𝑎0 values.
Nevertheless, there comes the next question: how can we decide, which one
is the best solution? We cannot just summarise the errors, since the negative
and positive values were cancel each other, resulting in an inaccurate solution.
Therefore, to eliminate this mistake, we can get the absolute value of each
error, or the square of them, and complete the summation. In both cases, we

58
do not have to be afraid of the negative signs, since the result are positive in
every cases. However, if we got only the absolute values of the errors, we will
not get exact solution by solving the following equation:
𝑛 𝑛

𝐸 = ∑ 𝑒𝑖 = ∑|𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 )| (4.3.)


𝑖=1 𝑖=1
As you can see in the following example: if we have 4 input points and fit two
linear curves on them with the same error values, we got two results,
illustrated by the dashed and continuous lines – see

Figure 13: Two linear fittings with the same total error (absolute values are
summed)

To get rid of this effect, the least mean square method has been invented,
which has become the most widespread used regression method. In this case,
the sum of error-squares is calculated, as follows:
𝑛 𝑛

𝐸= ∑ 𝑒𝑖2 = ∑[𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 )]2 (4.4.)


𝑖=1 𝑖=1
The least mean square method includes the advantages of the former
methods:

- error values with different signage are not eliminating each other,
- the total error is always positive,
- there is always an exact solution for 𝑎1 and 𝑎0 coefficients,
- small error values have got smaller weight in the summation, than the
large errors, due to the square counting.

The solution of least mean square method:

59
The scope is still to find a linear curve with the equation of 𝑦 = 𝑎1 𝑥 + 𝑎0 , that
suits best to (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) input points, and at the same time minimise
the total error, calculated in equation (4.4.). The 𝐸 total error is influenced by
𝑎1 and 𝑎0 in non-linear way, so it is clear that this equation has a minimum
𝜕𝐸 𝜕𝐸
point, where 𝜕𝑎 and 𝜕𝑎 values are zero, like:
1 0
𝑛
𝜕𝐸
= −2 ∑ 𝑥𝑖 [𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 )] = 0
𝜕𝑎1
𝑖=1
𝑛

→ ∑{𝑥𝑖 [𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 )]} = 0


(4.5.)
𝑖=1
𝑛 𝑛
𝜕𝐸
= −2 ∑[𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 )] = 0 → ∑[𝑦𝑖 − (𝑎1 𝑥𝑖 + 𝑎0 )]
𝜕𝑎0
𝑖=1 𝑖=1
=0
By expanding and rearranging these equations, we can express two
equations, which determine the values of 𝑎1 and 𝑎0 :

𝑛 𝑛 𝑛
2
(∑ 𝑥𝑖 ) 𝑎0 + (∑ 𝑥𝑖 ) 𝑎1 = ∑ 𝑥𝑖 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 (4.6.)
𝑛𝑎0 + (∑ 𝑥𝑖 ) 𝑎1 = ∑ 𝑦𝑖
𝑖=1 𝑖=1

Applying the Cramer’s rule, the solution will be:

(∑𝑛𝑖=1 𝑥𝑖2 )(∑𝑛𝑖=1 𝑦𝑖 ) − (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 )


𝑎0 = 2
𝑛(∑𝑛𝑖=1 𝑥𝑖2 ) − (∑𝑛𝑖=1 𝑥𝑖 )
(4.7.)
𝑛(∑𝑛𝑖=1 𝑥𝑖 𝑦𝑖 ) − (∑𝑛𝑖=1 𝑥𝑖 )(∑𝑛𝑖=1 𝑦𝑖 )
𝑎1 = 2
𝑛(∑𝑛𝑖=1 𝑥𝑖2 ) − (∑𝑛𝑖=1 𝑥𝑖 )
Although, equations (4.7.) look terrible, if we use MATLAB to solve them, the
problem is not so tragic anymore.

60
Example [3]

Let’s see an example for linear regression in MATLAB:

function [a1 a0] = LinearReg(x,y)


% LinearReg uses linear least-squares approximation
% to find a linear curve in the form: y = a1*x + a0
% x, y are n-dimensional row or column vectors of
% data, a1 and a0 are the coefficients that we are
% looking for
n = length(x);
% first calculate the necessary summations
Sumx = sum(x); Sumy = sum(y); Sumxx = sum(x.*x);
Sumxy = sum(x.*y);
den = n*Sumxx - Sumx^2;
% then calculate the coefficients
a1 = (n*Sumxy - Sumx*Sumy)/den;
a0 = (Sumxx*Sumy - Sumxy*Sumx)/den;
% Plot the data and the line fit
l = zeros(n,1);
for i = 1:n,
l(i) = a1*x(i)+ a0;
% Calculating n points on the line
end
plot(x,y,'o')
hold on
plot(x,l)
xlabel('x'); ylabel('y'); title('Linear regression')
fprintf('The equation of the linear: y = %dx +
%d.',a1,a0)
end

Test the function with the following input data:


xi yi
0 7.6
0.2 8.4
0.4 8.2
0.6 8.6
0.8 8.5
1.0 8.6
1.2 8.9

61
We can run our function in the Command Window:
>> x = 0:0.2:1.2;
>> y = [7.6 8.4 8.2 8.6 8.5 8.6 8.9];
>> [a1 a0] = LinearReg(x,y)
The equation of the linear: y = 8.214286e-01x + 7.907143e+00.

The plot can be seen in the next figure:

4.1.2 Linearization of Non-linear Input Data


If the connection is non-linear between the initial points, there are two
options available to solve the problem: polynomial regression can be carried
out (see following chapter), or linearization of the non-linear data and then
linear regression can be used. There are three functions, used for
linearization: the exponential function, the power function or the saturation
function. In this subchapter, we are going through on these three technics.

4.1.2.1 Exponential Function


The exponential function can be written in the following form:

62
𝑦 = 𝑎𝑒 𝑏𝑥 (4.8.)
where 𝑎 and 𝑏 are constants.
Since, the differentiation of the exponential function gives the same
exponential function multiplied by a constant, so this technic provides a
solution to describe quantities, where the rate of change of the quantity and
the quantity itself are directly proportional, like radioactive decay, or the
flame-speed described by Vibe combustion model.

Figure 14: Three linearization technics for the three function types: a,d –
exponential; b,e – power; c,f - saturation [3]

In order to linearize the exponential function, only the natural logarithm of


equation (4.8.) has to be taken:
ln 𝑦 = 𝑏𝑥 + ln 𝑎 (4.9.)
Thus, plotting ln 𝑦 in the function of 𝑥 gives a line with intercept point ln 𝑎 and
a slope of 𝑏, as (a) and (d) diagram of Figure 14 show.

4.1.2.2 Power Function


The second function type is the power function:
𝑦 = 𝑎𝑥 𝑏 (4.10.)
where 𝑎 and 𝑏 are constants. For the linearization the standard logarithm
(base 10) is taken, and can be written as:

63
log 𝑦 = 𝑏 log 𝑥 + log 𝑎 (4.11.)
That is, both log 𝑦 has to be plotted in the function of log 𝑥, to get a linear with
slope 𝑏 and log 𝑎 intercept, as (b) and (e) of Figure 14 illustrate.

4.1.2.3 Saturation Function


The saturation function can be written as:
𝑥
𝑦= (4.12.)
𝑎𝑥 + 𝑏
where 𝑎 and 𝑏 are constants.
In order to linearize it, we invert the function:
1 1
= 𝑏( )+𝑎 (4.13.)
𝑦 𝑥
So, 1/𝑦 is plotted in the function of 1/𝑥, with a 𝑏 slope and 𝑎 intercept point
linear, which is demonstrated by (c) and (f) diagrams of Figure 14 shows.

4.1.3 Polynomial Regression


Indeed, there are numerous situations, when the linear regression is just not
a suitable solution. Especially, if the input data has a parabolic, cubic, etc.,
characteristics. In these cases, it is recommended to apply polynomial
regression. If chapter 4.1.1 is expanded to 𝑛 number of points:
(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ), to which we want to fit a 𝑚-th degree polynomial, then
the expression we are looking for is:
𝑦 = 𝑎𝑚 𝑥 𝑚 + 𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 (4.14.)
The total error can be written as:
𝑛

𝐸 = ∑[𝑦𝑖 − (𝑎𝑚 𝑥 𝑚 + 𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 )]2 (4.15.)


𝑖=1
The coefficients (𝑎𝑚 , … , 𝑎1 , 𝑎0 ) can be determined similarly, like in the linear
case. So, we have to differentiate with each one of coefficients the 𝐸 value,
and make them equal to zero. This way we will get the following equation
system:
𝑛
𝜕𝐸
= −2 ∑[𝑦𝑖 − (𝑎𝑚 𝑥 𝑚 + 𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 )] = 0 (4.16.)
𝜕𝑎0
𝑖=1

64
𝑛
𝜕𝐸
= −2 ∑{𝑥𝑖 [𝑦𝑖 − (𝑎𝑚 𝑥 𝑚 + 𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 )]} = 0
𝜕𝑎1
𝑖=1
𝑛
𝜕𝐸
= −2 ∑{𝑥𝑖2 [𝑦𝑖 − (𝑎𝑚 𝑥 𝑚 + 𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 )]} = 0
𝜕𝑎2
𝑖=1
...
𝑛
𝜕𝐸
= −2 ∑{𝑥𝑖𝑚 [𝑦𝑖 − (𝑎𝑚 𝑥 𝑚 + 𝑎𝑚−1 𝑥 𝑚−1 + ⋯ + 𝑎2 𝑥 2 + 𝑎1 𝑥 + 𝑎0 )]} = 0
𝜕𝑎𝑚
𝑖=1

This system of equation can be easily converted to 𝑚 + 1 linear equations,


and this way, 𝑎𝑚 , … , 𝑎2 , 𝑎1 , 𝑎0 coefficients can be calculated:
𝑛 𝑛 𝑛 𝑛

𝑛𝑎0 + (∑ 𝑥𝑖 ) 𝑎1 + (∑ 𝑥𝑖2 ) 𝑎2 + ⋯ + (∑ 𝑥𝑖𝑚 ) 𝑎𝑚 = ∑ 𝑦𝑖


𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛

(∑ 𝑥𝑖 ) 𝑎0 + (∑ 𝑥𝑖2 ) 𝑎1 + (∑ 𝑥𝑖3 ) 𝑎2 + ⋯ + (∑ 𝑥𝑖𝑚+1 ) 𝑎𝑚


𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛

= ∑ 𝑥𝑖 𝑦𝑖
𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛
(4.17.)
(∑ 𝑥𝑖2 ) 𝑎0 + (∑ 𝑥𝑖3 ) 𝑎1 + (∑ 𝑥𝑖4 ) 𝑎2 + ⋯ + (∑ 𝑥𝑖𝑚+2 ) 𝑎𝑚 = ∑ 𝑥𝑖2 𝑦𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

...
𝑛 𝑛 𝑛 𝑛

(∑ 𝑥𝑖𝑚 ) 𝑎0 + (∑ 𝑥𝑖𝑚+1 ) 𝑎1 + (∑ 𝑥𝑖𝑚+2 ) 𝑎2 + ⋯ + (∑ 𝑥𝑖2𝑚 ) 𝑎𝑚


𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛

= ∑ 𝑥𝑖𝑚 𝑦𝑖
𝑖=1

Of course, in practice it is really rare thet 𝑚 was higher than 2 or 3, but


theoretically there is no upper limit.

Example [3]

Let’s define a MATLAB function for quadratic regression.


function [a2 a1 a0] = QuadraticRegression(x,y)
%
% QuadraticRegression uses quadratic least-squares
% approximation to fit a data by a 2nd-degree
% polynomial in the form y = a2*x^2 + a1*x + a0.
%
% [a2 a1 a0] = QuadraticRegression(x,y) where
% x, y are n-dimensional row or column vectors of
% data, a2, a1 and a0 are the coefficients that

65
% describe the quadratic fit.
n = length(x);
Sumx = sum(x); Sumy = sum(y);
Sumx2 = sum(x.^2); Sumx3 = sum(x.^3); Sumx4 =
sum(x.^4);
Sumxy = sum(x.*y); Sumx2y = sum(x.*x.*y);
% Form the coefficient matrix and the vector of
% right-hand sides
A =[n Sumx Sumx2;Sumx Sumx2 Sumx3;Sumx2 Sumx3 Sumx4];
b =[Sumy;Sumxy;Sumx2y];
w = A\b; % Solve
a2 = w(3); a1 = w(2); a0 = w(1);
% Plot the data and the quadratic fit
xx = linspace(x(1),x(end));
% Generate 100 points for plotting purposes
p = zeros(100,1); % Pre-allocate
for i = 1:100
p(i)= a2*xx(i)^2 + a1*xx(i)+ a0;
% Calculate 100 points
end
plot(x,y,'o')
hold on
plot(xx,p)
xlabel('x'); ylabel('y'); title('Quadratic
Regression')
fprintf('The equation of the parabolic: y = %dx^2 +
%dx + %d.',a2,a1,a0)

Test the function with the following points:


>> x2 = 0:0.4:1.6;
>> y2 = [2.8 3.05 3.42 4.7 6.6];
>> [a2 a1 a0] = QuadraticRegression(x2,y2)
The equation of the parabolic: y = 1.879464e+00x^2 + -6.946429e-01x +
2.865429e+00.

The result plot looks like the following:

66
Example [3]

Define a cubic regression function in MATLAB:


function [a3 a2 a1 a0]= CubicRegression(x,y)
%
% CubicRegression uses cubic least-squares
% approximation to fit a data by a 3rd-degree
% polynomial y = a3*x^3 + a2*x^2 + a1*x + a0.
%
% [a3 a2 a1 a0]= CubicRegression(x,y) where x, y are
% n-dimensional row or column vectors of data,
%
% a3, a2, a1 and a0 are the coefficients that
% describe the cubic fit.
%
n = length(x);
Sumx = sum(x); Sumy = sum(y);
Sumx2 = sum(x.^2); Sumx3 = sum(x.^3); Sumx4 =
sum(x.^4);
Sumx5 = sum(x.^5); Sumx6 = sum(x.^6);
Sumxy = sum(x.*y); Sumx2y = sum(y.*x.^2);
Sumx3y = sum(y.*x.^3);
% Defining the usual vectors and matrices
A =[n Sumx Sumx2 Sumx3; Sumx Sumx2 Sumx3 Sumx4;
Sumx2 Sumx3 Sumx4 Sumx5; Sumx3 Sumx4 Sumx5 Sumx6];
b =[Sumy;Sumxy;Sumx2y;Sumx3y];
w = A\b; % Solution

67
a3 = w(4); a2 = w(3); a1 = w(2); a0 = w(1);
% Plotting
xx = linspace(x(1),x(end));
p = zeros(100,1);
for i = 1:100,
p(i)= a3*xx(i)^3 + a2*xx(i)^2 + a1*xx(i) + a0;
end
plot(xx,p,x,y,'o')
hold on
xlabel('x'); ylabel('y'); title('Cubic Regression')
fprintf('The equation of the polynom is: y = %dx^3 +
%dx^2 + %dx + %d.',a3,a2,a1,a0)
end

Let’s see how large is the difference between the two methods for the
following points:
>> x3 = 0:0.4:1.6;
>> y3 = [2.95 3.2 3.42 4.8 6.91];

By running both functions, the following equations are the results:


The equation of the polynomial is: y = 9.895833e-01x^3 + -1.964286e-01x^2 +
2.559524e-01x + 2.973143e+00.
The equation of the polynomial is: y = 2.178571e+00x^2 + -1.105714e+00x +
3.049143e+00.

68
Built-in MATLAB functions

Indeed, MATLAB has its own built-in functions in order to make our problems
easier. The two used functions are:

POLYFIT fit polynomial data

“P = POLYFIT(X,Y,N) finds the coefficients of a polynomial P(X) of degree N


that fits the data Y best in a least-squares sense. P is a row vector of length
N+1 containing the polynomial coefficients in descending powers, P(1)* X^N +
P(2)*X^(N−1) + ⋯ + P(N)*X + P(N+1).”

POLYVAL evaluates function value

“y = POLYVAL(p,x) returns the value of a polynomial of degree n evaluated at


x. The input argument p is a vector of length n+1 whose elements are the
coefficients in descending powers of the polynomial to be evaluated.”

Example: MATLAB built-in functions [3]


We type in some input data and then use the built-in function and the already
defined CubicRegression function to see the differences:
>> x = 0:0.3:1.2;
>> y = [3.5 4.9 6.1 7.2 11.2];
>> P = polyfit(x,y,3)
P=

9.5679 -13.1746 8.4484 3.4586

Then use our neat CubicRegression function:


>> xi = linspace(0,1.2); % Generating 100 points for plotting
>> yi = polyval(P,xi); % Evaluate the polynomial at these points
>> plot(xi,yi)
>> hold on
>> plot(x,y,'o')
>> [a3 a2 a1 a0] = CubicRegression(x,y)
a3 =

69
9.5679
a2 =
-13.1746
a1 =
8.4484
a0 =
3.4586

This is what we supposed to get in the diagram:

It can be seen that the coefficients are the same in both cases, also, the curve
does not fit completely to the input data, since there is a small deviation.

4.2 Finite difference


One of the most important aspect of the numerical analyses is the finite
differentiation, since the curve fitting, data-smoothing, numerical derivation
and integration all are based on the finite difference calculation. So let’s see
the theoretical aspect of the topic.
A continuous function 𝑦 = 𝑓(𝑥) is given, and also 𝑥0 , 𝑥1 , 𝑥2 , … , 𝑥𝑛 values are
available as input points, where 𝑥0 ≤ 𝑥 ≤ 𝑥𝑛 . The function can be calculated
over the whole interval, and has the following values:

70
Table 2: The 𝑓(𝑥) function, interpreted in 𝑥0 ≤ 𝑥 ≤ 𝑥𝑛 domain

Let 𝑥𝑖 and 𝑥𝑗 be two, not necessarily consecutive values on the x0 ≤ x ≤ xn


domain. Then, the first difference can be written as:
𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑗 )
𝑓(𝑥𝑖 , 𝑥𝑗 ) = (4.18.)
𝑥𝑖 − 𝑥𝑗
Similarly, the second difference:
𝑓(𝑥𝑖 , 𝑥𝑗 ) − 𝑓(𝑥𝑗 , 𝑥𝑘 )
𝑓(𝑥𝑖 , 𝑥𝑗 , 𝑥𝑘 ) = (4.19.)
𝑥𝑖 − 𝑥𝑘
The third, fourth, … differences can be evaluated in the same way.
It is practical the summarise the differences in a difference table, where every
difference is written in between the former to differences, like it is
demonstrated in the following table:

Table 3: The difference table for 𝑓(𝑥) function

This can be slightly confusing, so let’s see an example to understand the


problem.

71
Example

Calculate the difference table until the fourth difference for the following 𝑥
values: 0, 1, 3, 4, 7, 9, if the function is : 𝑦 = 𝑓(𝑥) = 𝑥 3

Solution
Create the table below. The first column includes the 𝑥 values, the second
column contains the 𝑓(𝑥) function values, and from the third to the sixth
columns we write the differences. With using the former expression, we can
calculate the first difference:
1 − 27
= 13
1−3
Then, the second difference:
37 − 93
= 14
3−7
Likewise, the third difference:
4−8
=1
0−4
Finally, the fourth difference:
1−1
=0
0−4
The difference table looks like the following:

It is interesting to observe, that the third differences are equal, and the fourth
differences are all zero.

Let us see, how we could generalise this method.

72
Generally, 𝑥 values are located with the same distances from each other or so-
called they are equidistant. In this particular case, the denominators will be
equal to each other, so they can be eliminated. As a result, the differences can
be calculated as the difference of the function values.
So, if the difference is constant between 𝑥 values, then a completely average
𝑥𝑘 value can be written in the following way:
𝑥𝑘 = 𝑥0 + 𝑘ℎ 𝑤ℎ𝑒𝑟𝑒 𝑘 = ⋯ − 2, −1, 0, 1, 2, … (4.20.)
Based on that, the first differences can be expressed by introducing Δ
difference operator:
∆𝑓𝑘 = 𝑓𝑘+1 − 𝑓𝑘 (4.21.)
Similarly, the second differences are:
∆2 𝑓𝑘 = ∆(∆𝑓𝑘 ) = ∆𝑓𝑘+1 − ∆𝑓𝑘 (4.22.)
By generalising the solution to every positive 𝑛 value:
∆𝑛 𝑓𝑘 = ∆(∆𝑛−1 𝑓𝑘 ) = ∆𝑛−1 𝑓𝑘+1 − ∆𝑛−1 𝑓𝑘 (4.23.)
Of course, the difference operator meets with the rules of the exponents:
∆𝑚 (∆𝑛 𝑓𝑘 ) = ∆𝑚+𝑛 𝑓𝑘 (4.24.)
Based on that, the new difference table can be created as follows:

Table 4: The general form of difference table

73
Example

The following 𝑥 values are given: 1,2,3,4,5,6,7,8.


The formerly used, 𝑓(𝑥) = 𝑥 3 function is interpreted and we look for the
difference table.

Solution:
Create the difference table as in the former example. It can be seen, that the
3rd differences are equal to each other again, and the 4th differences are still
zeros.

By considering a binomial distribution:


𝑛
( 𝑗 ) = 𝑓𝑘+1 − 𝑓𝑘 (4.25.)
Then, by modifying equation (3.24. (4.23.), the final equation can be written in
the next expression:
𝑛(𝑛 − 1)
∆𝑛 𝑓𝑘 = 𝑓𝑘+𝑛 − 𝑛𝑓𝑘+𝑛−1 + 𝑓𝑘+𝑛−2 + ⋯ (4.26.)
2!
+ (−1) 𝑛𝑓𝑘+1 + (−1)𝑛 𝑓𝑘
𝑛−1

Using 𝑘 = 0, 𝑛 = 1,2,3,4 values, the expression is much friendlier:


∆𝑓0 = 𝑓2 − 𝑓1
2
∆ 𝑓0 = 𝑓2 − 2𝑓1 + 𝑓0
3
(4.27.)
∆ 𝑓0 = 𝑓3 − 3𝑓2 + 3𝑓1 − 𝑓0
∆4 𝑓0 = 𝑓4 − 4𝑓3 + 6𝑓2 − 4𝑓1 + 𝑓0

74
4.2.1 Factorial Polynomials
Factorial polynomials can be defined as:
(𝑥)(𝑛) = 𝑥(𝑥 − 1)(𝑥 − 2) … (𝑥 − 𝑛 + 1)
1 (4.28.)
(𝑥)−(𝑛) =
(𝑥 − 1)(𝑥 − 2) … (𝑥 − 𝑛)
By using Δ difference operator again, the same equation can be expressed in
the following form:
∆(𝑥)(𝑛) = (𝑥)(𝑛−1)
(4.29.)
∆(𝑥)−(𝑛) = −𝑛(𝑥)(𝑛−1)
Sometimes, creating a factorial polynomial is the best solution to handle a
problem. So, to create one, we have to use the Maclaurin series, which is
written as:
𝑝𝑛 (𝑥) = 𝑎0 + 𝑎1 (𝑥)(1) + 𝑎2 (𝑥)(2) + ⋯ + 𝑎𝑛 (𝑥)(𝑛) (4.30.)
where 𝑎𝑘 coefficients have to be determined. For that, the first expression
gives 𝑥 = 0, thus:
𝑎0 = 𝑝𝑛 (0) (4.31.)
The 𝑎1 coefficient can be calculated from equation (4.30.):
∆𝑝𝑛 (𝑥) = 1𝑥 0 𝑎1 + 2𝑎2 (𝑥)(1) + 3𝑎3 (𝑥)(2) + ⋯ + 𝑛𝑎𝑛 (𝑥)(𝑛−1) (4.32.)
Then, making 𝑥 equal to zero, we get:
𝑎1 = ∆𝑝𝑛 (0) (4.33.)
The next differentiation gives:
∆2 𝑝𝑛 (𝑥) = 2 ∙ 1𝑎2 + 3 ∙ 2𝑎3 (𝑥)(1) + ⋯ + 𝑛(𝑛 − 1)𝑎𝑛 (𝑥)(𝑛−2) (4.34.)
Due to 𝑥 = 0, 𝑎2 will be:
∆2 𝑝𝑛 (0) ∆2 𝑝𝑛 (0)
𝑎2 = = (4.35.)
2∙1 2!
The general form:
∆𝑗 𝑝𝑛 (0)
𝑎𝑗 = ∀ 𝑗 = 0,1,2, … , 𝑛 (4.36.)
𝑗!
After determining the calculation process, we could really ask: what is this
good for? The answer is simple: this is one of the easiest way to generate a
difference table.
There are 3 major steps in the method:

75
1. The 𝑝𝑛 (𝑥) values – provided by the Maclauring series – are divided with
𝑥, which gives a quotient 𝑞0 (𝑥) and a residue 𝑟0 , which latter is equal to
the 𝑎0 coefficient. Thus:
𝑝𝑛 (𝑥) = 𝑟0 + 𝑥𝑞0 (𝑥) (4.37.)
2. By dividing 𝑞0 (𝑥) with (𝑥 − 1) to express 𝑞1 (𝑥) quotient and 𝑟1 residue,
we got:
𝑞0 (𝑥) = 𝑟1 + (𝑥 − 1)𝑞1 (𝑥) (4.38.)
Substituting this expression to equation (4.37.) and using the original
equation (4.30.) we can write:
𝑝𝑛 (𝑥) = 𝑟0 + 𝑥[𝑟1 + (𝑥 − 1)𝑞1 (𝑥)]
(4.39.)
= 𝑟0 + 𝑟1 (𝑥)(1) + 𝑥(𝑥 − 1)𝑞1 (𝑥)
3. Dividing 𝑞1 (𝑥) with (𝑥 − 2) we get 𝑞2 (𝑥) and the residue 𝑟2 , which
latter is now equal to 𝑎2 , so:
𝑞1 (𝑥) = 𝑟2 + (𝑥 − 2)𝑞2 (𝑥) (4.40.)
Substituting back to the former equation:
𝑝𝑛 (𝑥) = 𝑟0 + 𝑟1 (𝑥)(1) + 𝑥(𝑥 − 1)[𝑟2 + (𝑥 − 2)𝑞2 (𝑥)] =
(4.41.)
= 𝑟0 + 𝑟1 (𝑥)(1) + 𝑟2 (𝑥)(2) + 𝑥(𝑥 − 1)(𝑥 − 2)𝑞2 (𝑥)
Then, we continue this process until the new quotient’s order is not 1 order
lower than the former quotient’s; meaning we can repeat this algorithm
(𝑛 + 1) times.

This gives us the general form of the factorial polynomial:


𝑝𝑛 (𝑥) = 𝑟0 + 𝑟1 (𝑥)(1) + 𝑟2 (𝑥)(2) + ⋯ + 𝑟𝑛−1 (𝑥)(𝑛−1)
(4.42.)
+ 𝑟𝑛 (𝑥)(𝑛)
where
∆𝑗 𝑝𝑛 (0)
𝑟𝑗 = 𝑎𝑗 = ∀ 𝑗 = 0,1,2, … , 𝑛 (4.43.)
𝑗!
In simplified form:
∆𝑗 𝑝𝑛 (0) = 𝑗! 𝑟𝑗 (4.44.)

4.2.2 Anti-differentiation
If only the first derivative of a given function is known, in order to get the
function itself, the opposite of the differentiation has to be carried out, which

76
is the so-called anti-differentiation. We could be suspicious, since it is really
sounds like the integration, and it actually is, but from a different point of
view. Our goal right now is to carry out an integration process without the
general integration rules, and relying only on the difference-calculation. Going
back to equation (4.29.), the anti-differentiation can be given in the following
general form:
(𝑥)(𝑛+1)
∆−1 (𝑥)(𝑛) = (4.45.)
(𝑛 + 1)
Let’s see, why this equation looks so familiar!

Example

Anti-differentiate the following polynomial:


𝑝(𝑥) = 𝑥 4 − 5𝑥 3 + 3𝑥 + 4

Solution:
Create the factorial polynomial, as it was described earlier: (division, quotient
and residual formation, repetition):
𝑝𝑛 (𝑥) = 4 − (𝑥)(1) − 8(𝑥)(2) + (𝑥)(3) + (𝑥)(4)
Then create the difference table:
∆0 𝑝(0) = 0! ∙ 4 = 4
∆1 𝑝(0) = 1! ∙ (−1) = −1
∆2 𝑝(0) = 2! ∙ (−8) = −16
∆3 𝑝(0) = 3! ∙ 1 = 6
∆4 𝑝(0) = 4! ∙ 1 = 24
∆5 𝑝(0) = 5! ∙ 0 = 0
Insert the values to the table, then the next round and so on:

77
Repeat the steps, until the table is not ready:

78
From the table the anti-differentiation formula can be written as:
(𝑥)(5) (𝑥)(4) (𝑥)(3) (𝑥)(2)
∆−1 𝑝𝑛 (𝑥) = + −8 − + 4(𝑥)(1) + 𝐶
5 4 3 2
where 𝐶 is a constant.
In fact, we've actually implemented integration.

The 3.6 Example gives us a feeling that the sum of the series pn(x) over the
interval of 𝑎 ≤ 𝑥 ≤ 𝑎 + (𝑛 − 1) is:
𝛼+(𝑛−1)ℎ

∑ 𝑝𝑛 (𝑥) = 𝑝𝑛 (𝛼) + 𝑝𝑛 (𝛼 + ℎ) + 𝑝𝑛 (𝛼 + 2ℎ) + ⋯ (4.46.)


𝑥=𝛼
+ 𝑝𝑛 [𝛼 + (𝑛 − 1)ℎ]
In more compact form we can express:

𝛼+(𝑛−1)ℎ

∑ 𝑝𝑛 (𝑥) = ∆−1 𝑝𝑛 (𝑥)|𝛼+𝑛ℎ


𝛼
(4.47.)
𝑥=𝛼
𝑏
This, in fact, means that the well-known ∫𝑎 𝑓(𝑥)𝑑𝑥 = 𝐹(𝑏) − 𝐹(𝑎) definite
integral formula can be approximated as the sum of function-products. This
realisation will be extremely important in the latter chapters.

4.3 Interpolation
So far to this point we have discussed how can we approximate a set if input
points with a function, if the function does not necessarily have to go through
every single points, but the error between the function values and given
points are minimal. It was reasonable, since in the case of experimental data
usually there is a light (or strong) scattering between the points; or if we have
a large number of input points it was an unnecessarily hard constraint to make
the function go through at every point – this would cause a really large
computation capacity-requirement, furthermore probably the results was not
in the expected range.
There are, however, many situations when the input points mean a constraint
to the function, because we need the curve touch every point, like if we draw
a 3 dimensional model in a CAD software, we define points then connect them
with some kind of curve. In this case, interpolation method is applied. In
definition, the interpolation is a numerical method, which gives and describes
a connection between two sets of numbers; it provides the unknown values

79
of a function based on the known values. All of the interpolation methods
have to fulfil three conditions:
- it has to go through at given input points,
- must fulfil further special conditions, like slope, curvature,
- apply polynomials for easy of use.

Therefore, if the mathematical problem has to be defined, then it can be said


that if a function’s 𝑓: ℝ → ℝ values are known:
𝑦𝑖 = 𝑓(𝑥𝑖 ), 𝑖 = 1, 2, … , 𝑛 (4.48.)
on 𝑎 ≤ 𝑥1 ≤ 𝑥2 ≤ ⋯ ≤ 𝑥𝑛 ≤ 𝑥𝑛+1 ≤ 𝑏 points (𝑛 + 1 pieces of points), then
actually we have to find a function, which satisfy the interpolation condition
of:
𝑝(𝑥𝑖 ) = 𝑦𝑖 𝑖 = 1, 2, … , 𝑛 (4.49.)
In this problem, the data points are {𝑥𝑖 }𝑛+1
𝑖=1 . If the interpolation conditions are
well determined and are satisfied, then the 𝑝(𝑥) polynomial will expectedly
approximate 𝑓(𝑥) function over the data points-determined interval with an
acceptable error.
If the interpolation function was used to approximate values outside from
(𝑥1 , 𝑥𝑛+1 ) interval, then we would use extrapolation.
During the interpolation process, by using 𝑛 + 1 data points the following 𝑛-
th order polynomial is used:

𝑝(𝑥) = 𝑎𝑛+1 𝑥 𝑛 + 𝑎𝑛 𝑥 𝑛−1 + ⋯ + 𝑎3 𝑥 2 + 𝑎2 𝑥 + 𝑎1 (4.50.)


In order to determine the polynomial, there are several methods. The two
most commonly used theoretical methods are the Lagrange and Newton
method. In practice, maybe the most widespread interpolation method is the
Spline interpolation. These methods are discussed in the following
subchapters.

4.3.1 Lagrange Interpolation


In order to interpret the general form of Lagrange interpolation polynomial,
at first consider the case, when the interpolation polynomial has to go
through on two points of (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ), and has the following form:

𝑝1 (𝑥) = 𝐿1 (𝑥)𝑦1 + 𝐿2 (𝑥)𝑦2 (4.51.)

80
where 𝐿1 (𝑥) and 𝐿2 (𝑥) are the Lagrange coefficients. They can be given as:
𝑥 − 𝑥2 𝑥 − 𝑥1
𝐿1 (𝑥) = , 𝐿2 (𝑥) = (4.52.)
𝑥1 − 𝑥2 𝑥2 − 𝑥1
When 𝐿1 (𝑥1 ) = 1 and 𝐿1 (𝑥2 ) = 0, it is also true that 𝐿2 (𝑥1 ) = 0 and 𝐿2 (𝑥2 ) =
1. This means that 𝑝1 (𝑥1 ) = 𝑦1 and 𝑝1 (𝑥2 ) = 𝑦2 , namely the interpolating
polynomial is a straight line that passes both points, as it can be seen in Figure
15.

Figure 15: 1st degree Lagrange interpolation [3]

In the second-order case, the Lagrange interpolating polynomial that passes


through three points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ) and (𝑥3 , 𝑦3 ) has the following form:
𝑝2 (𝑥) = 𝐿1 (𝑥)𝑦1 + 𝐿2 (𝑥)𝑦2 + 𝐿3 (𝑥)𝑦3 (4.53.)
where the Lagrange coefficients:
(𝑥 − 𝑥2 )(𝑥 − 𝑥3 ) (𝑥 − 𝑥1 )(𝑥 − 𝑥3 )
𝐿1 (𝑥) = , 𝐿2 (𝑥) = ,
(𝑥1 − 𝑥2 )(𝑥1 − 𝑥3 ) (𝑥2 − 𝑥1 ) (𝑥2 − 𝑥3 )
(4.54.)
(𝑥 − 𝑥1 )(𝑥 − 𝑥2 )
𝐿3 (𝑥) =
(𝑥3 − 𝑥1 ) (𝑥3 − 𝑥2 )
Then 𝐿1 (𝑥1 ) = 1 = 𝐿2 (𝑥2 ) = 𝐿3 (𝑥3 ), while the rest of the coefficients:
𝐿𝑖 (𝑥𝑖 ) = 0 for all 𝑖 ≠ 𝑗. Indeed, it means that 𝑝2 (𝑥1 ) = 𝑦1 , 𝑝2 (𝑥2 ) = 𝑦2 and
𝑝2 (𝑥3 ) = 𝑦3 , namely the second order polynomial goes through three points
(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ) and (𝑥3 , 𝑦3 ), as Figure 16 demonstrates.

Figure 16: 2nd order Lagrange interpolating polynomial [3]

81
The pattern can be figured out from these two examples, which is the
following: the 𝑛-th order Lagrange polynomial that passes through 𝑛 + 1
points (𝑥1 , 𝑦1 ), … , (𝑥𝑛+1 , 𝑦𝑛+1 ) can be written as follows:
𝑛+1

𝑝𝑛 (𝑥) = 𝐿1 (𝑥)𝑦1 + ⋯ + 𝐿𝑛+1 (𝑥)𝑦𝑛+1 = ∑ 𝐿𝑖 (𝑥)𝑦𝑖 (4.55.)


𝑖=1
where the Lagrange coefficients:
𝑛+1
𝑥 − 𝑥𝑖
𝐿𝑖 (𝑥) = ∏ (4.56.)
𝑥𝑖 − 𝑥𝑗
𝑗=1
𝑗≠𝑖
Let’s see an example, how can we use this method to solve a problem
analytically.

Example [3]

By using Lagrange method, find the interpolation polynomial of the input data
in the following table, then use this polynomial to find the interpolated value
at 𝑥 = 0,3.
𝒊 𝒙𝒊 𝒚𝒊
1 0,1 0,12
2 0,5 0,47
3 0,9 0,65

Solution:
The three Lagrange interpolation coefficients are:
(𝑥 − 𝑥2 )(𝑥 − 𝑥3 ) (𝑥 − 0,5)(𝑥 − 0,9)
𝐿1 (𝑥) = =
(𝑥1 − 𝑥2 )(𝑥1 − 𝑥3 ) (0,1 − 0,5)(0,1 − 0,9)
(𝑥 − 𝑥1 )(𝑥 − 𝑥3 ) (𝑥 − 0,1)(𝑥 − 0,9)
𝐿2 (𝑥) = =
(𝑥2 − 𝑥1 )(𝑥2 − 𝑥3 ) (0,5 − 0,1)(0,5 − 0,9)
(𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) (𝑥 − 0,1)(𝑥 − 0,5)
𝐿3 (𝑥) = =
(𝑥3 − 𝑥1 )(𝑥3 − 𝑥2 ) (0,9 − 0,1)(0,9 − 0,5)
When the coefficients are known, the 2nd order polynomial can be written as:
𝑝2 (𝑥) = 𝐿1 (𝑥)𝑦1 + 𝐿2 (𝑥)𝑦2 + 𝐿3 (𝑥)𝑦3 =

82
(𝑥 − 0,5)(𝑥 − 0,9) (𝑥 − 0,1)(𝑥 − 0,9)
= (0,12) + (0,47)
(0,1 − 0,5)(0,1 − 0,9) (0,5 − 0,1)(0,5 − 0,9)
(𝑥 − 0,1)(𝑥 − 0,5)
+ (0,65) =
(0,9 − 0,1)(0,9 − 0,5)
= 0,5312𝑥 2 + 1,1937𝑥 + 0,0059
By using this polynomial the interpolated value at 𝑥 = 0,3 is 𝑝2 (0,3) = 0,3162.

Example

Let’s define a MATLAB function that can find the Lagrange interpolating
polynomial, fits to set of data, also can interpolate the polynomial’s value at a
specified point:

function yi = LagrangeInterp(x,y,xi)
%
% LagrangeInterp finds the Lagrange interpolating
% polynomial that fits the data (x,y) and uses it to
% find the interpolated value at xi.
% yi = LagrangeInterp(x,y,xi) where x, y are n-
% dimensional row or column vectors of data,
% xi is a specified point, yi is the interpolated
% value at xi.
n = length(x);
L = zeros(1,n);
for i = 1:n,
L(i) = 1;
for j = 1:n,
if j ~= i,
L(i) = L(i)* (xi - x(j))/(x(i) - x(j));
end
end
end
yi = sum(y.* L);

How does it work? Define the following points. We are interested in


calculating the function’s value at point 0.3. The function gives to following
result:

>> x4 = 0,1; 0,5; 0,9


>> y4 = 0,13; 0,45; 0,66

83
>> yi = LagrangeInterp(x,y,0.3)
ans =
0.3037

Disadvantages of Lagrange interpolation method


Unfortunately, no information is stored from the lower-degree polynomial or
used in the construction of the new higher-degree polynomial, when going
from one degree to the next. This can cause some headache in the following
two options:
- when we do not know the degree of the interpolation polynomial in
advance
- when additional points are added to the input data.

4.3.2 Newton Divided-Difference Interpolation


We can construct the Newton interpolating polynomials recursively in the
following way:
𝑝1 (𝑥) = 𝑎1 + 𝑎2 (𝑥 − 𝑥1 )
𝑝2 (𝑥) = 𝑎1 + 𝑎2 (𝑥 − 𝑥1 ) + 𝑎3 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )
= 𝑝1 (𝑥) + 𝑎3 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )
(4.57.)
𝑝3 (𝑥) = 𝑝2 (𝑥) + 𝑎4 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )(𝑥 − 𝑥3 )

𝑝𝑛 (𝑥) = 𝑃𝑛−1 (𝑥) + 𝑎𝑛+1 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) … (𝑥 − 𝑥𝑛 )
where 𝑎1 , 𝑎2 , … , 𝑎𝑛+1 coefficients can be determined as it is introduced next:
Since 𝑝1 (𝑥) must agree with the input points (𝑥1 , 𝑦1 ) and (𝑥2 , 𝑦2 ), thus:
𝑝1 (𝑥1 ) = 𝑦1 , 𝑝1 (𝑥2 ) = 𝑦2 (4.58.)
because of that:
𝑎1 + 𝑎2 (𝑥1 − 𝑥1 ) = 𝑦1 𝑦2 − 𝑦1
=> 𝑎1 = 𝑦1 , 𝑎2 = (4.59.)
𝑎1 + 𝑎2 (𝑥2 − 𝑥1 ) = 𝑦2 𝑥2 − 𝑥1
Similarly, 𝑝2 (𝑥) must agree with the input points (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ) and (𝑥2 , 𝑦2 ),
so that:
𝑝2 (𝑥1 ) = 𝑦1 , 𝑝2 (𝑥2 ) = 𝑦2 , 𝑝2 (𝑥3 ) = 𝑦3 (4.60.)
𝑎1 and 𝑎2 coefficients still have to be determined in the same way as earlier
(see equation (4.59.)), and 𝑎3 can be calculated as:

84
𝑦3 − 𝑦2 𝑦2 − 𝑦1
𝑥 − 𝑥2 − 𝑥2 − 𝑥1 (4.61.)
𝑎3 = 3
𝑥3 − 𝑥1
If we continue this process, all the missing coefficients can be determined
with using Newton’s divided differences. Calculating for two points (𝑥1 , 𝑦1 )
and (𝑥2 , 𝑦2 ), the first divided difference can be given as the slope of the
connecting line between the two points, which is:
𝑦2 − 𝑦1
𝑓[𝑥2 , 𝑥1 ] = = 𝑎2 (4.62.)
𝑥2 − 𝑥1
If three points were considered:
𝑦3 − 𝑦2 𝑦2 − 𝑦1
𝑓[𝑥3 , 𝑥2 ] − 𝑓[𝑥2 , 𝑥1 ] 𝑥3 − 𝑥2 − 𝑥2 − 𝑥1 (4.63.)
𝑓[𝑥3 , 𝑥2 , 𝑥1 ] = = = 𝑎3
𝑥3 − 𝑥1 𝑥3 − 𝑥1
For a general case, the method can be written in the following form:
𝑓[𝑥𝑘+1 , 𝑥𝑘 , … , 𝑥2 , 𝑥1 ]
𝑓[𝑥𝑘+1 , 𝑥𝑘 , … , 𝑥2 , 𝑥1 ] − 𝑓[𝑥𝑘+1 , 𝑥𝑘 , … , 𝑥2 , 𝑥1 ]
= (4.64.)
𝑥𝑘+1 − 𝑥1
= 𝑎𝑘+1
Therefore, we can also give Newton’s divided-difference interpolating
polynomial for 𝑛 + 1 pieces of points (𝑥1 , 𝑦1 ), … , (𝑥𝑛+1 , 𝑦𝑛+1 ) as:
𝑝𝑛 (𝑥) = 𝑎1 + 𝑎2 (𝑥 − 𝑥1 ) + 𝑎3 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) + ⋯ +
(4.65.)
+𝑎𝑛+1 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 ) … (𝑥 − 𝑥𝑛 )
where 𝑎1 , … , 𝑎𝑛+1 coefficients are in order 𝑦1 , 𝑓[𝑥2 , 𝑥1 ], … , 𝑓[𝑥𝑛+1 , 𝑥1 ]; for
determining their value, the best solution is to apply the difference tables.

Example

Find the 4th order Newton’s divided-difference polynomial for the following
input points, then use the interpolation polynomial to find the interpolated
value at 𝑥 = 0,7.
𝒊 𝒙𝒊 𝒚𝒊
1 0 0
2 0,1 0,1210
3 0,2 0,2258
4 0,5 0,4650
5 0,8 0,6249

85
Solution:
The first task is to create the difference table, which can be seen below.
Based on equation (4.64.), the 4th order Newton’s divided-difference
interpolating polynomial is denoted by:
𝑝4 (𝑥) = 𝑎1 + 𝑎2 (𝑥 − 𝑥1 ) + 𝑎3 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )
+ 𝑎4 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )(𝑥 − 𝑥3 )
+ 𝑎5 (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )(𝑥 − 𝑥3 )(𝑥 − 𝑥4 ) =
= 0 + 1,21(𝑥 − 0) − 0,81(𝑥 − 0,1) + 0,3664(𝑥 − 0)(𝑥 − 0,1)(𝑥 − 0,2) −
−0,1254(𝑥 − 0)(𝑥 − 0,1)(𝑥 − 0,2)(𝑥 − 0,5) =
= −0,1254𝑥 4 + 0,4667𝑥 3 − 0,9412𝑥 2 + 1,2996𝑥
By using the polynomial, the interpolated value can be calculated easily:
𝑝4 (0,7) = 0,5784
The difference table:

Example: MATLAB built-in functions [3]

In practice, it works in the following way: MATLAB has the INTERP1 function:

yi = interp1(x,y,xi,method), where the vectors x and y are the vectors holding


the x and the y coordinates of points to be interpolated, respectively, xi is a
vector holding points of evaluation, like yi = f(xi) and method is an optional
string specifying an interpolation method. We have four options for the
method:

86
- nearest neighbour interpolation method – ‘nearest’
- linear interpolation method – ‘linear’
- cubic spline interpolation method – ‘spline’
- Shape-preserving piecewise cubic interpolation – ‘pchip’
Next, we will see some examples both for the built-in and for user-specified
functions and their operation mode.

At first, create a functions, which calculates the divided differences based on


sets of data:

function a = divdiff(x, y)

% Divided differences based on points stored in


% arrays x and y.
n = length(x);
for k=1:n-1
y(k+1:n) = (y(k+1:n) - y(k))./(x(k+1:n) - x(k));
end
a = y(:);

Then, create a functions, which calculates the interpolating polynomial at


the supplied points:
function [yi, a] = Newtonpol(x, y, xi)

% Values yi of the interpolating polynomial at the


% points xi. Coordinates of the points of
% interpolation are stored in vectors x and y.
% Horner's method is used to evaluate a polynomial.
% Second output parameter a holds coefficients
% of the interpolating polynomial in Newton's form.

a = divdiff(x, y);
n = length(a);
val = a(n)*ones(1,length(xi));
for m = n-1:-1:1
val = (xi - x(m)).*val + a(m);
end
yi = val(:);

To see, how they are working, define the points (x,y) and the points, where
the interpolant is evaluated (xi,yi):

87
>> x = 0:pi/5:pi;
>> y = sin(2.*x);
>> xi = 0:pi/100:pi;
>> yi = interp1(x,y,xi, 'nearest ');
>> plot(x, y, 'o', xi, yi), title('Piecewise constant interpolant of y = sin(2x)')
The script gives us the following result:

By trying out the cubic interpolation method, the diagram gives a


significantly different form:

>> yi = interp1(x, y, xi, ’pchip');

>> plot(x, y, 'o', xi, yi), title('Cubic interpolant of y = sin(2x)')

88
Now try out the Newtonpol function:

>> [yi, a] = Newtonpol(x, y, xi);

>> plot(x, y, 'o', xi, yi), title('Quintic interpolant of y = sin(2x)')

The result is:

It is good to keep in mind that interpolation process not always produces a


sequence of polynomials that converge uniformly to the interpolated function
as degree of the interpolating polynomial tends to infinity. One of the best
example is the following: given g(x) = 1/1(1+x2), over the interval of −5 ≤ 𝑥 ≤
5. If it was interpolated at n+1 evenly spaced points (xk = -5 + 10k/n, k = 0,1,…,n).
Create a script, which generates both g(x) and its interpolated polynomial
pn(x).

% Script showint.m
% Plot of the function 1/(1 + x^2) and its
% interpolating polynomial of degree n.
m=input('Enter number of interpolating polynomials');
for k=1:m
n = input('Enter degree of the interpolating
polynomial ');
hold on
x = linspace(-5,5,n+1);

89
y = 1./(1 + x.*x);
z = linspace(-5.5,5.5);
t = 1./(1 + z.^2);
h1_line = plot(z,t,'-.');
set(h1_line, 'LineWidth',1.25)
t = Newtonpol(x,y,z);
h2_line = plot(z,t,'r');
set(h2_line,'LineWidth',1.3,'Color',[0 0 0])
axis([-5.5 5.5 -.5 1])
title(sprintf('Example of divergence (n =
%2.0f)',n))
xlabel('x')
ylabel('y')
legend('y = 1/(1+x^2)','interpolant')
hold off
end

By running this script, try out the parameters of m = 1 and n = 9. the result is
the following:

Another example for a function, that finds the Newton divided-difference


interpolating polynomials, which fits a set of data, also can calculate the
interpolated value at a specified point.

90
function yi = NewtonInterp(x,y,xi)
%
% NewtonInterp finds the Newton divided-difference
% interpolating polynomial that fits the data (x,y)
% and uses it to find the interpolated value at xi.
% yi = NewtonInterp(x,y,xi) where x, y are n-
% dimensional row or column vectors of data,
% xi is a specified point, yi is the interpolated
% value at xi.
n = length(x);
a = zeros(1,n);
a(1) = y(1);
DivDiff = zeros(1,n-1);
for i = 1:n-1,
DivDiff(i,1) = (y(i+1) - y(i))/(x(i+1) - x(i));
end
for j = 2:n-1,
for i = 1:n-j,
DivDiff(i,j) = (DivDiff(i+1,j-1) - DivDiff(i,j-
1))/(x(j+i) - x(i));
end
end
for k = 2:n,
a(k) = DivDiff(1,k-1);
end
yi = a(1);
xprod = 1;
for m = 2:n,
xprod = xprod*(xi - x(m-1));
yi = yi + a(m)*xprod;
end

Let’s see how it works:


>> x = [0.4 0.5 0.6 0.7];
>> y = [0.921061 0.877583 0.825336 0.764842];
>> format long
>> yi = NewtonInterp(x,y,0.52)
yi = 0.867818416000000

Although, the script looks more messy, the function itself is less complicated,
since can handle only a constant xi value, not a vector type.

91
4.3.3 Spline Interpolation [3]
Up to this point, to interpolate 𝑛 + 1 points, a 𝑛-th degree polynomial has
been used. For example, to interpolate 11 input points, by using the Lagrange
interpolation we got an 10th degree polynomial, which is not really practical,
furthermore is really ugly. This means, that the former interpolation methods
work well only for a low number of points, but for 101 or 1001 input points, it
is not a good solution to use an 100th or 1000th degree polynomial, it would
not make much sense. Additionally, considering 11 points and the 10 th degree
polynomial, due to the peaks, generated by the high degree polynomial, the
interpolation method can make relevant errors, as it can be seen in Figure 17.
This means, the higher the number of initial points the higher degree
polynomial is required, thus the error of the interpolation method is also
higher.

In order to eliminate on this phenomena, an idea was born to avoid using one
global polynomial to interpolate a large number of initial points but use more
low-degree polynomial, interpreted over sub-intervals, including one or more
data points. These low-degree polynomials are the so-called splines.
Originally, splines were flexible, thin strips – rulers, mostly made from special
wood or steel, and were used by naval architects and drafters. The two
endpoints of the stripe were fixed, in the middle section weights were applied
creating curves with different curvatures. At this time, it was mathematically
challenging to describe these functions, but as soon as the mathematical
background was available, the splines have become maybe the most
commonly used interpolation methods.

92
Figure 17: Interpolating 11 points with a 10th degree polynomial and with using a
cubic spline

The most commonly applied spline types is the cubic spline, which can give
accurate description and smooth transition between two sub-intervals, as it
can be seen in Figure 17. The same figure illustrates the advantages of the
cubic spline over the high-degree polynomials.

4.3.3.1 Linear Splines


The simplest approximation are the linear splines, since in this case first
degree polynomials – linear lines are used to connect the input points. So, if 4
input points are given (𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), (𝑥3 , 𝑦3 ) and (𝑥4 , 𝑦4 ), to describe the
linear splines we have to go back to the Lagrange method and consider the
following equations:
𝑥 − 𝑥2 𝑥 − 𝑥1
𝑆1 (𝑥) = 𝑦1 + 𝑦 , 𝑥 ≤ 𝑥 ≤ 𝑥2
𝑥1 − 𝑥2 𝑥2 − 𝑥1 2 1
𝑥 − 𝑥3 𝑥 − 𝑥2
𝑆2 (𝑥) = 𝑦2 + 𝑦 , 𝑥 ≤ 𝑥 ≤ 𝑥3 (4.66.)
𝑥2 − 𝑥3 𝑥3 − 𝑥2 3 2
𝑥 − 𝑥4 𝑥 − 𝑥3
𝑆3 (𝑥) = 𝑦3 + 𝑦 , 𝑥 ≤ 𝑥 ≤ 𝑥4
𝑥3 − 𝑥4 𝑥4 − 𝑥3 4 3
If we take a look on the linear splines in Figure 18, they can be really suspicious:
we have met them already. Actually, the linear splines have been already
discussed in subchapter 4.3.1. Consequently, we already know their

93
drawbacks, since linear splines are not capable of creating a smooth transition
between sub-intervals, additionally they create a breakpoint in every point.

Figure 18: Interpolating of 4 points with linear splines and a 3rd degree
polynomial

This behaviour comes from the fact that the linear splines’ first derivatives
(slopes) are not equal to each other in each point (if this was a constraint, the
interpolation problem could not been carried out with linear splines). That’s
why our next constraint will be determining the equality of the first
derivatives, even if then first degree polynomials cannot be used. So in the
following, higher degree polynomials will be called over the sub-intervals. Of
course, our next problem will be the equality of the second derivatives, but
this will be discussed after we dealt with the first derivatives.

4.3.3.2 Quadratic Splines


The quadratic splines are basically parabolic polynomials, applied on the input
points over every single sub-interval to interpolate the missing function
values. Assuming that there are 𝑛 + 1 input points, which are (𝑥1 , 𝑦1 ), … ,
(𝑥𝑛+1 , 𝑦𝑛+1 ), and that can be divided into 𝑛 sub-intervals, thus 𝑛 quadratic
polynomials has to be constructed, like it is shown in Figure 19. The quadratic
polynomials have the following general form:

94
𝑆𝑖 (𝑥) = 𝑎1 𝑥 2 + 𝑏𝑖 𝑥 + 𝑐𝑖 , 𝑖 = 1,2, … , 𝑛 (4.67.)
where 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 (𝑖 = 1,2, … , 𝑛) are the unkown coefficients, have to be
determined. Since we have 𝑛 polynomials, each one with 3 coefficients, 3𝑛
unkown coefficients have to be determined. In order to calculate every one
of them, exactly 3𝑛 equations have to be constructed.

Figure 19: Quadratic splines

In order to determine the 3𝑛 equations the following conditions have to be


settled:

Function values at the endpoints:

The first polynomial 𝑆1 (𝑥) must pass through (𝑥1 , 𝑦1 ) point, and the last one
𝑆𝑛 (𝑥) has to pass through (𝑥𝑛+1 , 𝑦𝑛+1 ):

𝑆𝑖 (𝑥) = 𝑦1 = 𝑎1 𝑥 2 + 𝑏1 𝑥1 + 𝑐1
2
(4.68.)
𝑆𝑛 (𝑥𝑛+1 ) = 𝑦𝑛+1 = 𝑎𝑛 𝑥𝑛+1 + 𝑏𝑛 𝑥𝑛+1 + 𝑐𝑛
Function values at intermediate points:

In the intermediate points 2 conditions must be fulfilled: the polynomials have


to pass through the points, and the neighbouring polynomials have to be
equal in the following points:

𝑆1 (𝑥2 ) = 𝑦2 , 𝑆2 (𝑥3 ) = 𝑦3 , … , 𝑆𝑛−1 (𝑥𝑛 ) = 𝑦𝑛 (4.69.)

95
𝑆2 (𝑥2 ) = 𝑦2 , 𝑆3 (𝑥3 ) = 𝑦3 , … , 𝑆𝑛 (𝑥𝑛 ) = 𝑦𝑛

This way, 𝑛 − 1 equations are generated. Therefore, the general form is the
following:
2
𝑆𝑖 (𝑥𝑖+1 ) = 𝑦𝑖+1 = 𝑎𝑖 𝑥𝑖+1 + 𝑏𝑖 𝑥𝑖+1 + 𝑐𝑖 , 𝑖 = 1,2, … , 𝑛 − 1
(4.70.)
𝑆𝑖 (𝑥𝑖 ) = 𝑦𝑖 = 𝑎𝑖 𝑥𝑖2 + 𝑏𝑖 𝑥𝑖 + 𝑐𝑖 , 𝑖 = 2,3, … , 𝑛 − 1
Values of the first derivatives in the intermediate points:

In the intermediate points, the values of the first derivatives must agree,
which can be written mathematically in the following form (creating 𝑛 − 1
equations):

𝑆′1 (𝑥2 ) = 𝑆′2 (𝑥2 ), 𝑆′2 (𝑥3 ) = 𝑆′3 (𝑥3 ), … , 𝑆 ′ 𝑛−1 (𝑥𝑛 ) = 𝑆′𝑛 (𝑥𝑛 ) (4.71.)
In general form:

𝑆′𝑖 (𝑥𝑖+1 ) = 𝑆′𝑖+1 (𝑥𝑖+1 ), 𝑖 = 1,2, … , 𝑛 − 1 (4.72.)


Since 𝑆 ′ 𝑖 (𝑥) = 2𝑎𝑖 𝑥 + 𝑏𝑖 , the former equation in another form:

2𝑎𝑖 𝑥𝑖+1 + 𝑏𝑖 = 2𝑎𝑖+1 𝑥𝑖+1 + 𝑏𝑖+1 , 𝑖 = 1,2, … , 𝑛 − 1 (4.73.)


If these equations were summarised, then we would create 2 + (𝑛 − 1) +
(𝑛 − 1) + (𝑛 − 1) = 3𝑛 − 1 equations, which is almost good, but we still miss
one more equation the have an exact soluation.

Second derivatives’ values in the left endpoint is zero:

This means:

𝑆′′1 (𝑥1 ) = 0 = 2𝑎1 (4.74.)


Finally, we have created the necessary 3𝑛 equations. By significantly
simplifying the problem: we stated that 𝑎1 = 0, and the rest 3𝑛 − 1 unkown
coefficients have provided by a 3𝑛 − 1 element system of equation.

Example

Determine the quadratic splines for the following input points:

96
𝑥𝑖 𝑦𝑖
2 5
3 2,3
5 5,1
7,5 1,5

Solution:
Because we have 4 input points, 𝑛 = 3, thus 3 quadratic splines have to be
determined with the total of 9 unknown coefficients. First of all we know that
𝑎1 = 0. The rest 8 equations are given by solving equations (4.68.)-(4.73.).
Based on equation (4.68.):
𝑎1 (2)2 + 𝑏1 (2) + 𝑐1 = 5
𝑎3 (7,5)2 + 𝑏3 (7,5) + 𝑐3 = 1,5
By using equation (4.70.):
𝑎1 (3)2 + 𝑏1 (3) + 𝑐1 = 2,3
𝑎2 (5)2 + 𝑏2 (5) + 𝑐2 = 5,1
𝑎2 (3)2 + 𝑏2 (3) + 𝑐2 = 2,3
𝑎3 (5)2 + 𝑏3 (5) + 𝑐3 = 5,1
Finally, applying equation (4.73.):
2𝑎1 (3) + 𝑏1 = 2𝑎2 (3) + 𝑏2
2𝑎2 (5) + 𝑏2 = 2𝑎3 (5) + 𝑏3
To simplify the problem, transform the linear system of equation into matrix
form:
2 1 0 0 0 0 0 0 𝑏1 5
0 0 0 0 0 56,25 7,5 1 𝑐1 1,5
3 1 0 0 0 0 0 0 𝑎2 2,3
0 0 25 5 1 0 0 0 𝑏2 5,1
=
0 0 9 3 1 0 0 0 𝑐2 2,3
0 0 0 0 0 25 5 1 𝑎3 5,1
1 0 −6 −1 0 0 0 0 𝑏3 0
[0 0 10 1 0 −10 −1 0] [ 𝑐3 ] [ 0 ]
Solving the system of equation, we will have the following coefficients:
𝑎1 = 0; 𝑎2 = 2,05; 𝑎3 = −2,776;
𝑏1 = −2,7; 𝑏2 = −15; 𝑏3 = 33,26;
𝑐1 = 10,4; 𝑐2 = 28,85; 𝑐3 = −91,8
As the final step, construct the quadratic spline equations:
𝑆1 (𝑥) = −2,8𝑥 + 10,4 2 ≤ 𝑥 ≤ 3

97
𝑆2 (𝑥) = 2,05𝑥 2 − 15𝑥 + 28,85 3 ≤ 𝑥 ≤ 5
𝑆3 (𝑥) = −2,776𝑥 2 + 33,26𝑥 − 91,8 5 ≤ 𝑥 ≤ 7,5
By plotting the solution of the example:

4.3.3.3 Cubic Splines


Considering cubic splines, the individual sub-intervals are interpolated by
using cubic polynomials. Assuming 𝑛 + 1 input points that are
(𝑥1 , 𝑦1 ), … , (𝑥𝑛+1 , 𝑦𝑛+1 ), 𝑛 sub-intervals are created, meaning 𝑛 cubic
polynomials. Each polynomial can be written in general form as follows:
𝑆𝑖 (𝑥) = 𝑎𝑖 (𝑥 − 𝑥𝑖 )3 + 𝑏𝑖 (𝑥 − 𝑥𝑖 )2 + 𝑐𝑖 (𝑥 − 𝑥𝑖 ) + 𝑑𝑖 , 𝑖
(4.75.)
= 1,2, … , 𝑛
where 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 , 𝑑𝑖 (𝑖 = 1,2, … , 𝑛) are the unknown coefficients. Since there
are 𝑛 polynomials, each one consists of 4 coefficients, so in total 4𝑛 equations
have to be written up. These equations are generated similarly like formerly:
constraints have to be defined.
The spline passes through the start, end and intermediate points, furthermore
the adjacent splines have the same value:
𝑆1 (𝑥1 ) = 𝑦1 , 𝑆𝑛 (𝑥𝑛+1 ) = 𝑦𝑛+1
𝑆𝑖+1 (𝑥𝑖+1 ) = 𝑆𝑖 (𝑥𝑖+1 ), 𝑖 = 1,2, … , 𝑛 − 1 (4.76.)
𝑆𝑖 (𝑥𝑖 ) = 𝑦𝑖 , 𝑖 = 2,3, … , 𝑛

98
The first derivatives also have the same values in the intermediate points
(creating 𝑛 − 1 equations):
𝑆𝑖′ (𝑥𝑖+1 ) = 𝑆𝑖+1

(𝑥𝑖+1 ), 𝑖 = 1,2, … , 𝑛 − 1 (4.77.)
The second derivatives are also equal to each other in the intermediate points
(creating 𝑛 − 1 equations):
𝑆𝑖′′ (𝑥𝑖+1 ) = 𝑆𝑖+1
′′
(𝑥𝑖+1 ), 𝑖 = 1,2, … , 𝑛 − 1 (4.78.)
So, in total 4𝑛 − 2 equations have been defined so far, two more are missing.
These missing equations can be written up by defining boundary conditions:
how does the spline start out from the start point, and how does it get to the
endpoint. Generally, there are 2 kinds of boundary conditions.
Clamped boundary conditions:
The first and last splines (𝑆1 and 𝑆𝑛 ) starting from (𝑥1 , 𝑦1 ) and getting to
(𝑥𝑛+1 , 𝑦𝑛+1 ) are clamped as:
𝑆1′ (𝑥1 ) = 𝑝, 𝑆𝑛′ (𝑥𝑛+1 ) = 𝑞 (4.79.)
Free boundary conditions:
This is a little bit tricky, because here the curvatures are defined as boundary
conditions:
𝑆1′′ (𝑥1 ) = 0, 𝑆𝑛′′ (𝑥𝑛+1 ) = 0 (4.80.)
Indeed, applying clamped boundary conditions gives a more accurate
approximation, because it consists more information about the spline, like in
the case of free boundary conditions. Of course, in order to use clamped
boundary conditions, we need to know more about the splines, since we
define the boundaries manually in this case.

4.3.3.4 Constructing cubic splines with clamped boundary


conditions
Equations (4.75.)-(4.77.), completed by equation (4.79.), define
𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 , 𝑑𝑖 (𝑖 = 1,2, … , 𝑛) coefficients. Going straightforward, 𝑑𝑖 coefficients
can be determined at first, since the first and last equations of (4.75.) provide
the necessary information: 𝑆𝑖 (𝑥𝑖 ) = 𝑑𝑖 (𝑖 = 1,2, … , 𝑛). Furthermore, based on
equation (4.76.) 𝑆𝑖 (𝑥𝑖 ) = 𝑦𝑖 (𝑖 = 1,2, … , 𝑛), thus:

𝑑𝑖 = 𝑦𝑖 , 𝑖 = 1,2, … , 𝑛 (4.81.)

99
Calculate the distance between the points (steps): ℎ𝑖 = 𝑥𝑖+1 (𝑖 = 1,2, … , 𝑛).
Substituting this term to equation (4.76.), keeping in mind that 𝑆𝑖+1 (𝑥𝑖+1 ) =
𝑑𝑖+1 , it can be written that:

𝑑𝑖+1 = 𝑎𝑖 ℎ𝑖3 + 𝑏𝑖 ℎ𝑖2 + 𝑐𝑖 ℎ𝑖 + 𝑑𝑖 , 𝑖 = 1,2, … , 𝑛 − 1 (4.82.)


By defining 𝑑𝑛+1 = 𝑦𝑛+1 , the former equation will be valid for the full range
𝑖 = 1,2, … , 𝑛, since 𝑆𝑛 (𝑥𝑛+1 ) = 𝑦𝑛+1. Thus:

𝑑𝑖+1 = 𝑎𝑖 ℎ𝑖3 + 𝑏𝑖 ℎ𝑖2 + 𝑐𝑖 ℎ𝑖 + 𝑑𝑖 , 𝑖 = 1,2, … , 𝑛 (4.83.)


The next step is to define the first derivative of 𝑆𝑖 (𝑥) based on equation
(4.77.):

𝑐𝑖+1 = 3𝑎𝑖 ℎ𝑖2 + 2𝑏𝑖 ℎ𝑖 + 𝑐𝑖 , 𝑖 = 1,2, … , 𝑛 − 1 (4.84.)


If we stated that 𝑐𝑛+1 = 𝑆𝑛′ (𝑥𝑛+1 ), the former equation would be valid for the
full range of the interval 𝑖 = 1,2, … , 𝑛, so:

𝑐𝑖+1 = 3𝑎𝑖 ℎ𝑖2 + 2𝑏𝑖 ℎ𝑖 + 𝑐𝑖 , 𝑖 = 1,2, … , 𝑛 (4.85.)


Only the second derivative of 𝑆𝑖 (𝑥) is missing, which can be written as:

2𝑏𝑖+1 = 6𝑎𝑖 ℎ𝑖 + 2𝑏𝑖 , 𝑖 = 1,2, … , 𝑛 − 1 (4.86.)


1 ′′
By defining 𝑏𝑛+1 = 𝑆 (𝑥𝑛+1 ) the former equation becomes valid for the full
2 𝑛
range 𝑖 = 1,2, … , 𝑛, meaning:

𝑏𝑖+1 = 3𝑎𝑖 ℎ𝑖 + 𝑏𝑖 , 𝑖 = 1,2, … , 𝑛 (4.87.)


The goal is to express 𝑏𝑖 (𝑖 = 1,2, … , 𝑛 + 1) values from the system of
equation. In order to solve that, the former equation (4.87.) has to be solved
by substituting 𝑎1 = (𝑏𝑖+1 − 𝑏𝑖 )/3ℎ𝑖 expression to equations (4.85.) and
(4.86.) with 𝑖 = 1,2, … , 𝑛. This way we get:

1
𝑑𝑖+1 = (2𝑏𝑖 + 𝑏𝑖+1 )ℎ𝑖2 + 𝑐𝑖 ℎ𝑖 + 𝑑𝑖 , 𝑖 = 1,2, … , 𝑛 (4.88.)
3
and

𝑐𝑖+1 = (𝑏𝑖 + 𝑏𝑖+1 )ℎ𝑖 + 𝑐𝑖 , 𝑖 = 1,2, … , 𝑛 (4.89.)


Expressing 𝑐𝑖 from equation (4.88.):

100
𝑑𝑖+1 − 𝑑𝑖 1
𝑐𝑖 = − (2𝑏𝑖 + 𝑏𝑖+1 )ℎ𝑖 (4.90.)
ℎ𝑖 3
By changing 𝑖 to 𝑖 − 1:

𝑑𝑖 − 𝑑𝑖−1 1
𝑐𝑖−1 = − (2𝑏𝑖−1 + 𝑏𝑖 )ℎ𝑖−1 (4.91.)
ℎ𝑖−1 3
The same step is done to equation (4.89.):

𝑐𝑖 = (𝑏𝑖−1 + 𝑏𝑖 )ℎ𝑖−1 + 𝑐𝑖−1 (4.92.)


Now substitute back equations (4.89.) and (4.90.) to equation (4.91.). This
way for 𝑖 = 2,3, … , 𝑛 we can write:

𝑏𝑖−1 ℎ𝑖−1 + 2𝑏𝑖 (ℎ𝑖 + ℎ𝑖−1 ) + 𝑏𝑖+1 ℎ𝑖


3(𝑑𝑖+1 − 𝑑𝑖 ) 3(𝑑𝑖 − 𝑑𝑖−1 ) (4.93.)
= −
ℎ𝑖 ℎ𝑖−1
Actually, now we can write a system of equation, where only 𝑏𝑖 (𝑖 =
1,2, … , 𝑛 + 1) are the unknowns coefficients, since 𝑑𝑖 (𝑖 = 1,2, … , 𝑛 + 1) are
the input points, while ℎ𝑖 (𝑖 = 1,2, … , 𝑛) determine the spacing of data. Still,
equation (4.93.) generates only 𝑛 − 1 equations for 𝑛 + 1 unknown
coefficients, so 2 more equations have to be generated. These are determined
by the boundary conditions.

At first, solve equation (4.90.) for 𝑖 = 1:

𝑑2 − 𝑑1 1
𝑐1 = − (2𝑏1 + 𝑏2 )ℎ1 (4.94.)
ℎ1 3
Because earlier it was determined that 𝑐1 = 𝑆1′ (𝑥1 ) = 𝑝, the equation can be
transformed to:

3(𝑑2 − 𝑑1 )
(2𝑏1 + 𝑏2 )ℎ1 = − 3𝑝 (4.95.)
ℎ1
From equation (4.89.):

𝑐𝑛+1 = (𝑏𝑛 + 𝑏𝑛+1 )ℎ𝑛 + 𝑐𝑛 (4.96.)


Furthermore, we already know that 𝑐𝑛+1 = 𝑆𝑛′ (𝑥𝑛+1 ) = 𝑞, so:

𝑐𝑛 = 𝑞 − (𝑏𝑛 + 𝑏𝑛+1 )ℎ𝑛 (4.97.)


In equation (4.90.) using 𝑖 = 𝑛 expression we get:

101
𝑑𝑛+1 − 𝑑𝑛 1
𝑐𝑛 = − (2𝑏𝑛 + 𝑏𝑛+1 )ℎ𝑛 (4.98.)
ℎ𝑛 3
Substituting equation (4.97.) back to the former expression we get:

3(𝑑𝑛+1 − 𝑑𝑛 )
(2𝑏𝑛+1 + 𝑏𝑛 )ℎ𝑛 = − − 3𝑞 (4.99.)
ℎ𝑛
Finally, by combining equations (4.99.), (4.95.) and (4.93.), a system of
equations is generated with 𝑛 + 1 equations and 𝑛 + 1 unknown coefficients
(𝑏𝑖 , 𝑖 = 1,2, … , 𝑛 + 1). This way 𝑏𝑖 can be expressed from the following
system:

3(𝑑2 − 𝑑1 )
(2𝑏1 + 𝑏2 )ℎ1 = − 3𝑝
ℎ1
𝑏𝑖−1 ℎ𝑖−1 + 2𝑏𝑖 (ℎ𝑖 + ℎ𝑖−1 ) + 𝑏𝑖+1 ℎ𝑖
3(𝑑𝑖+1 − 𝑑𝑖 ) 3(𝑑𝑖 − 𝑑𝑖−1 )
= − (4.100.)
ℎ𝑖 ℎ𝑖−1
where 𝑖 = 2,3, … , 𝑛
3(𝑑𝑛+1 − 𝑑𝑛 )
(2𝑏𝑛+1 + 𝑏𝑛 )ℎ𝑛 = − − 3𝑞
ℎ𝑛
A tridiagonal system (the coefficient matrix has non-zero elements only in the
main diagonal, and the first diagonal below and above the main diagonal) has
been generated with a clear solution. As soon as 𝑏𝑖 coefficients are
determined, with equation (4.90.) 𝑐𝑖 s can be calculated, finally with using
equation (4.87.), we can get 𝑎𝑖 coefficients. That is it the task is finished.
Although, this process seems complicated and long, actually for a large
number of input points gives significantly faster and more accurate result that
all the other interpolation methods.

Example

Solve the interpolation problem in Example 4.12 using cubic splines with
clamped boundary conditions, if the boundaries are:
𝑝 = −1; 𝑞 = 1
Solution:
Since we have 4 input points, due to this 𝑛 = 3, so 3 cubic polynomials have
to be determined in the following form:
𝑆𝑖 (𝑥) = 𝑎𝑖 (𝑥 − 𝑥𝑖 )3 + 𝑏𝑖 (𝑥 − 𝑥𝑖 )2 + 𝑐𝑖 (𝑥 − 𝑥𝑖 ) + 𝑑𝑖 , 𝑖 = 1,2,3

102
Using the former deduction, at first 𝑏𝑖 coefficients are determined in the
following way:
3(𝑑2 − 𝑑1 )
(2𝑏1 + 𝑏2 )ℎ1 = − 3𝑝
ℎ1
3(𝑑3 − 𝑑2 ) 3(𝑑2 − 𝑑1 )
𝑏1 ℎ1 + 2𝑏2 (ℎ2 + ℎ1 ) + 𝑏3 ℎ2 = −
ℎ2 ℎ1
3(𝑑4 − 𝑑3 ) 3(𝑑3 − 𝑑2 )
𝑏2 ℎ2 + 2𝑏3 (ℎ3 + ℎ2 ) + 𝑏4 ℎ3 = −
ℎ3 ℎ2
3(𝑑4 − 𝑑3 )
(2𝑏4 + 𝑏3 )ℎ3 = − 3𝑞
ℎ3
Because 𝑑𝑖 values are equal to the input points:
𝑑1 = 5; 𝑑2 = 2,3; 𝑑3 = 5,1; 𝑑4 = 1,5
Similarly: ℎ1 = 1; ℎ2 = 2; ℎ3 = 2,5. By substituting with 𝑝 = −1 and 𝑞 = 1,
the system of equation can be written as:
2 1 0 0 𝑏1 −5,1
1 6 2 0 𝑏2 12,3
[ ][ ] = [ ]
0 2 9 2,5 𝑏3 −8,52
0 0 2,5 5 𝑏4 7,32
which is a tridiagonal system, and by solving it
𝑏1 = −4,3551; 𝑏2 = 3,6103; 𝑏3 = −2,5033; 𝑏4 = 2,7157
Next, 𝑐𝑖 values are calculated:
𝑑2 − 𝑑1 1
𝑐1 = − (2𝑏1 − 𝑏2 )ℎ1 = −1
ℎ1 3
𝑑3 − 𝑑2 1
𝑐2 = − (2𝑏2 − 𝑏3 )ℎ2 = −1,7449
ℎ2 3
𝑑4 − 𝑑3 1
𝑐3 = − (2𝑏3 − 𝑏4 )ℎ3 = 0,4691
ℎ3 3

Finally, determined 𝑎𝑖 values:


𝑏2 − 𝑏1
𝑎1 = = 2,6551
3ℎ1
𝑏3 − 𝑏2
𝑎2 = = −1,0189
3ℎ2
𝑏4 − 𝑏3
𝑎3 = = 0,6959
3ℎ3

103
By using the calculated coefficients, the 3 cubic splines can be written as:
𝑆1 (𝑥) = 2,6551(𝑥 − 2)3 − 4,3551(𝑥 − 2)2 − (𝑥 − 2) + 5, 2 ≤ 𝑥 ≤ 3
𝑆2 (𝑥) = −1,0189(𝑥 − 3)3 + 3,6103(𝑥 − 3)2 − 1,7449(𝑥 − 3) + 2,3, 3 ≤ 𝑥 ≤ 5
𝑆3 (𝑥) = 0,6959(𝑥 − 5)3 − 2,5033(𝑥 − 5)2 + 0,4691(𝑥 − 5) + 5,1, 5 ≤ 𝑥 ≤ 7,5
Look at the input points and the quadratic and cubic interpolating splines and
their characteristics, illustrated by the next diagram. It can be clearly seen that
the cubic splines pass through the points more smoothly, with smaller peaks
and providing more accurate solution.

4.3.3.5 Constructing Cubic splines with free boundary


conditions
As it was discussed earlier, “free boundary conditions” does not mean the
total lack of boundaries, but 𝑆1′′ (𝑥1 ) = 0 and 𝑆𝑛′′ (𝑥𝑛+1 ) = 0 conditions. As
𝑆1′′ (𝑥) = 6𝑎1 (𝑥 − 𝑥1 ) + 2𝑏1, due to the first condition 𝑏1 = 0. Based on the
1
former deduction 𝑏𝑛+1 = 2 𝑆𝑛′′ (𝑥𝑛+1 ), so the second condition is a 𝑏𝑛+1 = 0.
By using equation (4.93.) and the two conditions, a system of equation can be
determined with 𝑛 + 1 equations and 𝑛 + 1 unknown coefficients, that can
be solved to 𝑏𝑖 s for 𝑖 = 2,3, … , 𝑛:

104
𝑏1 = 0
𝑏𝑖−1 ℎ𝑖−1 + 2𝑏𝑖 𝑖 + ℎ𝑖−1 ) + 𝑏𝑖+1 ℎ𝑖
(ℎ
3(𝑑𝑖+1 − 𝑑𝑖 ) 3(𝑑𝑖 − 𝑑𝑖−1 ) (4.101.)
= −
ℎ𝑖 ℎ𝑖−1
𝑏𝑛+1 = 0
By calculating 𝑏𝑖 values, the rest of the unknown coefficients can be
calculated in the same way, like in the case of clamped boundary conditions:
𝑑𝑖 s (𝑖 = 1,2, … , 𝑛 + 1) are equal to the input points, ℎ𝑖 s (𝑖 = 1,2, … , 𝑛) are
determined by the spacing, 𝑏𝑖 s are provided by equation (4.101.), 𝑐𝑖 s from
equation (4.90.) and 𝑎𝑖 s from equation (4.87.):

𝑏𝑖+1 − 𝑏𝑖
𝑎𝑖 = , 𝑖 = 1,2, … , 𝑛 (4.102.)
3ℎ𝑖

Example

The already known 3.11 Example input points are used, but interpolate with
cubic splines with free boundary conditions.

Solution:
Because free boundary conditions are used: 𝑏1 = 0, 𝑏4 = 0. Consequently,
equation (4.101.) is friendlier now:
𝑏1 = 0
6𝑏2 + 2𝑏3 = 12,3 → 𝑏2 = 2,5548
2𝑏2 + 9𝑏3 = −8,52 → 𝑏3 = −1,5144
𝑏4 = 0
Then calculate 𝑐𝑖 values:
𝑑2 − 𝑑1 1
𝑐1 = − (2𝑏1 + 𝑏2 )ℎ1 = −3,5516
ℎ1 3
𝑑3 − 𝑑2 1
𝑐2 = − (2𝑏2 + 𝑏3 )ℎ2 = −0,9968
ℎ2 3
𝑑4 − 𝑑3 1
𝑐3 = − (2𝑏3 + 𝑏4 )ℎ3 = 1,0840
ℎ3 3
Finally, 𝑎𝑖 s are determined:

105
𝑏2 − 𝑏1
𝑎1 = = 0,8516
3ℎ1
𝑏3 − 𝑏2
𝑎2 = = −0,6782
3ℎ2
𝑏4 − 𝑏3
𝑎3 = = 0,2019
3ℎ3
So, the 3 cubic splines are:
𝑆1 (𝑥) = 0,8516(𝑥 − 2)3 − 3,5516(𝑥 − 2) + 5; 2 ≤ 𝑥 ≤ 3
𝑆2 (𝑥) = −0,6782(𝑥 − 3)3 + 2,5548(𝑥 − 3)2 − 0,9968(𝑥 − 3) + 2,3; 3 ≤ 𝑥
≤5
𝑆3 (𝑥) = 0,2019(𝑥 − 5)3 − 1,5144(𝑥 − 5)2 + 1,0840(𝑥 − 5) + 5,1; 5 ≤ 𝑥
≤ 7,5
The difference between the clamped and free boundary conditions can be
observed in the following diagram:

106
5 Numerical Derivation
In everyday engineering problems, it is quite common that a known (or
interpolated function’s slope, or curvature should be determined. Of course,
higher derivatives could be needed, too, but generally the first 2 derivatives
are looked for. Contrary to the integration process, the derivation is relatively
easy, less rules are applied; in most cases the calculation even analytically can
be carried out for simpler functions and polynomials. In the case of more
complex functions on the other hand the conventional method is not efficient
all the time, since numerical methods can provide results in shorter time with
less computational efforts. By discretizing the function, the derivatives of a
function can be determined at certain points. It is even more a reasonable
assumption if input points are available and not a function. Of course, the
derivatives can be calculated by applying an interpolation method on the
input points, and then differentiating the interpolated function, but of course,
there are more efficient methods, too. In this subchapter, we are focusing in
this latter solution.
The point of the finite difference formulas that in a given 𝑥𝑖 point the
derivatives are approximated by using the neighbouring function values. In
order to calculate the differences, Taylor-series are used. Four examples are
introduced in this chapter, but the rest can be interpreted in the same way.
The differences between the formulas come from the utilization of different
function values, that can be originated from the left- or right-hand side of the
𝑥𝑖 point. This way, we call them forward-, backward- or central formulas.
The basis of numerical differentiation is calculus which is the study of
continuous change. Generally speaking, differentiation is concerning the rates
of change of curves while integration is concerning the summation of
quantities under curves. It has a wide range of applications.
The difference approximation can be defined by:
∆𝑦 𝑓(𝑥𝑖 + ∆𝑥) − 𝑓(𝑥𝑖 )
= (5.1.)
∆𝑥 ∆𝑥
where 𝑦 and 𝑓(𝑥) are alternative representations of the dependent variable
and 𝑥 is the independent variable. If we let ∆𝑥 approach zero, the difference
becomes a derivative:
𝑑𝑦 𝑓(𝑥𝑖 + ∆𝑥) − 𝑓(𝑥𝑖 )
= lim (5.2.)
𝑑𝑥 ∆𝑥→0 ∆𝑥
𝑑𝑦
where 𝑑𝑥 (sometimes designated as 𝑦′ or 𝑓(𝑥𝑖 )) is the first derivative of y with
respect to 𝑥 evaluated at 𝑥𝑖 .

107
5.1 2-Point Backward Difference
If the values of 𝑓(𝑥𝑖−1 ) are approximated from generating the Taylor-series in
𝑥𝑖 point, if ℎ = 𝑥𝑖 − 𝑥𝑖−1 steps are used in the following form:

1 2 ′′ 1
𝑓(𝑥𝑖−1 ) = 𝑓(𝑥𝑖 ) − ℎ𝑓 ′ (𝑥𝑖 ) + ℎ 𝑓 (𝑥𝑖 ) − ℎ3 𝑓 ′′′ (𝑥𝑖 ) + ⋯ (5.3.)
2! 3!
Keeping the linear term only in the expression we get:

1 2 ′′
𝑓(𝑥𝑖−1 ) = 𝑓(𝑥𝑖 ) − ℎ𝑓 ′ (𝑥𝑖 ) +
ℎ 𝑓 (𝜁) (5.4.)
2!
where the third tem is the remainder, where 𝑥𝑖−1 ≤ 𝜁 ≤ 𝑥𝑖 . By solving
equation (5.4.) it can be written:

𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 ) 1
𝑓 ′ (𝑥𝑖 ) = + ℎ𝑓 ′′ (𝜁) (5.5.)
ℎ 2!
where the second term gives the truncation error. By neglecting this term the
approximation of the first derivative can be expressed. Of course, do not
forget that the truncation error is strongly dependent of the order of ℎ, and
can be expressed as 𝑂(ℎ), so the two-point backward difference equation can
be written as:

𝑓(𝑥𝑖 ) − 𝑓(𝑥𝑖−1 )
𝑓 ′ (𝑥𝑖 ) = + 𝑂(ℎ) (5.6.)

Since 𝜁 is not known exactly, neither 𝑂(ℎ) is. But the smaller of ℎ becomes,
the smaller and more negligible 𝑂(ℎ) is.

5.2 2-Point Forward Difference


Similarly, like earlier: the value of 𝑓(𝑥𝑖+1 ) is still approximated from
generating the Taylor series from starting 𝑥𝑖 with ℎ = 𝑥𝑖+1 − 𝑥𝑖 steps:

1 2 ′′ 1
𝑓(𝑥𝑖+1 ) = 𝑓(𝑥𝑖 ) − ℎ𝑓 ′ (𝑥𝑖 ) + ℎ 𝑓 (𝑥𝑖 ) − ℎ3 𝑓 ′′′ (𝑥𝑖 ) + ⋯ (5.7.)
2! 3!
Keeping the linear terms only:

1 2 ′′
𝑓(𝑥𝑖+1 ) = 𝑓(𝑥𝑖 ) − ℎ𝑓 ′ (𝑥𝑖 ) +ℎ 𝑓 (𝜁) (5.8.)
2!
where the third term is still the remainder, and 𝑥𝑖 ≤ 𝜁 ≤ 𝑥𝑖+1 . By solving the
equation:

108
𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖 ) 1
𝑓 ′ (𝑥𝑖 ) = + ℎ𝑓 ′′ (𝜁) (5.9.)
ℎ 2!
The second term is still the truncation error. By neglecting it, similarly like in
the case of the backward formula, the 2-points forward formula can be
written as:

𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖 )
𝑓 ′ (𝑥𝑖 ) = + 𝑂(ℎ) (5.10.)

5.3 2-Point Central Difference


The story is the same: we are still using Taylor-series, but from now on the
quadratic term is also used.

1 2 ′′ 1
𝑓(𝑥𝑖−1 ) = 𝑓(𝑥𝑖 ) − ℎ𝑓 ′ (𝑥𝑖 ) + ℎ 𝑓 (𝑥𝑖 ) − ℎ3 𝑓 ′′′ (𝜁); 𝑥𝑖−1 ≤ 𝜁
2! 3! (5.11.)
≤ 𝑥𝑖
and

1 2 ′′ 1
𝑓(𝑥𝑖+1 ) = 𝑓(𝑥𝑖 ) + ℎ𝑓 ′ (𝑥𝑖 ) + ℎ 𝑓 (𝑥𝑖 ) + ℎ3 𝑓 ′′′ (𝜑); 𝑥𝑖−1 ≤ 𝜑
2! 3! (5.12.)
≤ 𝑥𝑖
By extracting the first equation from the second we get:

1 3 ′′′
𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖−1 ) = 2ℎ𝑓 ′ (𝑥𝑖 ) + ℎ [𝑓 (𝜁) + 𝑓 ′′′ (𝜑)] (5.13.)
3!
Finally, 𝑓′(𝑥𝑖 ) can be expressed easily:

𝑓(𝑥𝑖+1 ) − 𝑓(𝑥𝑖−1 )
𝑓 ′ (𝑥𝑖 ) = + 𝑂(ℎ2 ) (5.14.)
2ℎ
Okay, but what is this whole thing about? As it can be observed, the truncation
error depends on ℎ2 , so it will be significantly lower, than in the case of the
forward or backward difference formulas, so this is a more accurate
approximation. Let’ suppose that 𝑥1 , 𝑥2 , … , 𝑥𝑛 are given. Then, the 2-points
forward formula from already known methods cannot be used is 𝑥1 point,
since no data available in 𝑥0 . But this method can be used in all the other
points with 𝑂(ℎ) truncation error. The same thing is true for the 2-points
forward difference: it can be used in every points, except in 𝑥𝑛 , since there is
no 𝑥𝑛+1 . Similarly, the 2-points central difference cannot be used in 𝑥1 and 𝑥𝑛 ,
but anywhere else, furthermore with 𝑂(ℎ2 ) truncation error. The properties

109
of the difference formulas can be seen in Figure 20. These rules should be kept
in mind in the case of these kind of problems.

Figure 20: Representation of 2-Points Backward, Forward and Central


Difference formulas [3]

5.4 3-Points Backward Difference


Approximating the first derivative from 3 points starts with the same
principle, like in the case of 2-points formulas: generating the Taylor-series,
expressing 𝑓(𝑥𝑖−1 ) value in 𝑥𝑖 point:

1 2 ′′ 1
𝑓(𝑥𝑖−1 ) = 𝑓(𝑥𝑖 ) − ℎ𝑓 ′ (𝑥𝑖 ) + ℎ 𝑓 (𝑥𝑖 ) − ℎ3 𝑓 ′′′ (𝜁); 𝑥𝑖−1 ≤ 𝜁
2! 3! (5.15.)
≤ 𝑥𝑖
Then, 𝑓(𝑥𝑖−2 ) is approximated from 𝑥𝑖 point:

1 1
𝑓(𝑥𝑖−2 ) = 𝑓(𝑥𝑖 ) − (2ℎ)𝑓 ′ (𝑥𝑖 ) + (2ℎ)2 𝑓 ′′ (𝑥𝑖 ) − (2ℎ)3 𝑓 ′′′ (𝜑);
2! 3! (5.16.)
𝑥𝑖−2 ≤ 𝜑 ≤ 𝑥𝑖−1

110
By extracting 4 times equation (5.15.) from equation (5.16.), the following
expression can be written:

𝑓(𝑥𝑖−2 ) − 4𝑓(𝑥𝑖−1 )
4 3 ′′′ 8 (5.17.)
= −3𝑓(𝑥𝑖 ) + ℎ𝑓 ′ (𝑥𝑖 ) + ℎ 𝑓 (𝜁) − ℎ3 𝑓 ′′′ (𝜑)
3! 3!
Then, 𝑓 ′ (𝑥𝑖 ) is expressed:

𝑓(𝑥𝑖−2 ) − 4𝑓(𝑥𝑖−1 ) + 3𝑓(𝑥𝑖 ) 1 2 ′′′ 2


𝑓 ′ (𝑥𝑖 ) = − ℎ 𝑓 (𝜁) + ℎ2 𝑓 ′′′ (𝜑) (5.18.)
2ℎ 3 3
Since the truncation error is still quadratically dependent from ℎ, so the
simpler form of the formula is:

𝑓(𝑥𝑖−2 ) − 4𝑓(𝑥𝑖−1 ) + 3𝑓(𝑥𝑖 )


𝑓 ′ (𝑥𝑖 ) = + 𝑂(ℎ2 ) (5.19.)
2ℎ
So, the 3-Points backward difference formula can approximate the first
derivative based on 𝑥𝑖 , 𝑥𝑖−1 and 𝑥𝑖−2 function values in 𝑥𝑖 .

5.5 Numerical Difference Formulas


The second, third derivatives can be expressed similarly from the function
values with 𝑥𝑖 initial points, using Taylor-series. So, the derivation processes
are not detailed here, but the most commonly used formulas are summarised
in Table 5.

First derivative, 2-points formulas


𝑦𝑖+1 − 𝑦𝑖
Forward difference 𝑦𝑖′ =

′ 𝑦𝑖+1 − 𝑦𝑖−1
Central difference 𝑦𝑖 =
2ℎ
𝑦𝑖 − 𝑦𝑖−1
Backward difference 𝑦𝑖′ =

First derivative, 3-points formula
−𝑦𝑖+2 + 4𝑦𝑖+1 − 3𝑦𝑖
Forward difference 𝑦𝑖′ =
2ℎ
First derivative, 4-points formula
−𝑦𝑖+2 + 8𝑦𝑖+1 − 8𝑦𝑖−1 + 𝑦𝑖−2
Central Difference 𝑦𝑖′ =
12ℎ
Second derivative, 3 points formulas

111
𝑦𝑖+2 − 2𝑦𝑖+1 + 𝑦𝑖
Forward difference 𝑦𝑖′′ =
ℎ2
𝑦𝑖+1 − 2𝑦𝑖 + 𝑦𝑖−1
Central difference 𝑦𝑖′′ =
ℎ2
𝑦𝑖 − 2𝑦𝑖−1 + 𝑦𝑖−2
Backward difference 𝑦𝑖′′ =
ℎ2
Second derivative, 4 points formula
2𝑦𝑖 − 5𝑦𝑖+1 + 4𝑦𝑖+2 − 𝑦𝑖+3
Forward difference 𝑦𝑖′′ =
ℎ2
Second derivative, 5 points formula
−𝑦𝑖+2 + 16𝑦𝑖+1 − 30𝑦𝑖 + 16𝑦𝑖−1 − 𝑦𝑖−2
Central difference 𝑦𝑖′′ =
12ℎ2
Third derivative, 4-points formula
𝑦𝑖+3 − 3𝑦𝑖+2 + 3𝑦𝑖+1 − 𝑦𝑖
Forward difference 𝑦𝑖′′′ =
ℎ3
Table 5: Numerical difference formulas

Example: MATLAB built-in functions

There are two functions in MATLAB that make our lives easier: the diff and
polyder.

Y = diff(X) calculates differences between adjacent elements of X along the


first array dimension whose size does not equal 1:
 If X is a vector of length m, then Y = diff(X) returns a vector of
length m-1. The elements ofY are the differences between adjacent
elements of X.
Y = [X(2)-X(1) X(3)-X(2) ... X(m)-X(m-1)]
 If X is a nonempty, nonvector p-by-m matrix, then Y =
diff(X) returns a matrix of size (p-1)-by-m, whose elements are the
differences between the rows of X.
Y = [X(2,:)-X(1,:); X(3,:)-X(2,:); ... X(p,:)-X(p-1,:)]
 If X is a 0-by-0 empty matrix, then Y = diff(X) returns a 0-by-0 empty
matrix.

112
If X is a vector, the result will be also a vector, but one element shorter
than X.
By entering diff(X,n), the function executes differentiation recursively
n times, calculating the nth difference.

k = polyder(p) returns the derivative of the


𝑑
 polynomial represented by the coefficients in p: 𝑘(𝑥) = 𝑑𝑥 𝑝(𝑥)
𝑑
 product of the polynomials a and b: 𝑘(𝑥) = [𝑎(𝑥)𝑏(𝑥)]
𝑑𝑥
𝑞(𝑥) 𝑑 𝑎(𝑥)
 quotient of the polynomials a and b: = [ ]
𝑑(𝑥) 𝑑𝑥 𝑏(𝑥)

Example [3]

Consider f(x) = e-xsin(x/2), and x = 1.2, 1.4, 1.6, 1.8.

Solution:

>> h=0.2;
>> x = 1.2:h:1.8;
>> y = [0.1701 0.1589 0.1448 0.1295];
% Values of f at the discrete x values
>> y_prime = diff(y)./h
y_prime =
-0.0560 -0.0705 -0.0765
>> y_sex = diff(y,2)
-0.0029 -0.0012

The polyder function can be used as follows:


>> P = [2 0 -1 3];
>> polyder(P)
ans =
6 0 -1

A univariate function f(x) is given, and we are interested in an approximate


value of f’(x). In the following we create an algorithm, which computes a
sequence of the approximate values to derivative in question using the
following finite difference approximation of f’(x):

113
𝑓(𝑥 + ℎ) − 𝑓(𝑥 − ℎ)
𝑓′(𝑥) ≈
2ℎ
where h is the initial stepsize.
The solution has two phases:
1: computing a sequence of approximations to f’(x) using several values of h.
2: utilising Richardson’s extrapolation.
2
Consider the function 𝑓(𝑥) = 𝑒 −𝑥 and the comparison should be done
against the exact values of f’(x) at 𝑥 = 0.1,0.2, … ,1.0. Furthermore, h = 0.01
and n = 10.

Phase 1:
function der = numder(fun, x, h, n, varargin)
% Approximation der of the first order derivative, at
% the point x, of a function named by the string fun.
% Parameters h and n are user supplied values of the
% initial stepsize and the number of performed
% iterations in the Richardson extrapolation.
% For fuctions that depend on parameters their values
% must follow the parameter n.
d = [];
for i=1:n
s = (feval(fun,x+h,varargin{:})-feval(fun,x-
h,varargin{:}))/(2*h);
d = [d;s];
h = .5*h;
end
l = 4;
for j=2:n
s = zeros(n-j+1,1);
s = d(j:n) + diff(d(j-1:n))/(l - 1);
d(j:n) = s;
l = 4*l;
end
der = d(n);

Phase 2:
function testnder(h, n)
% Test file for the function numder. The initial
% stepsize is h and the number of iterations is n.
% Function to be tested is f(x) = exp(-x^2).
format long

114
disp(' x numder exact')
disp(sprintf('\n_____________________________________
_______________'))
for x=.1:.1:1
s1 = numder('exp2', x, h, n);
s2 = derexp2(x);
disp(sprintf('%1.14f %1.14f %1.14f',x,s1,s2))
end

Create the investigated function also:


function y = derexp2(x)
% First order derivative of f(x) = exp(-x^2).
y = -2*x.*exp(-x.^2);

So the result is:

>> testnder(0.01, 10)


x numder exact
___________________________________________________
0.10000000000000 -0.19800996675001 -0.19800996674983
0.20000000000000 -0.38431577566308 -0.38431577566093
0.30000000000000 -0.54835871116311 -0.54835871116274
0.40000000000000 -0.68171503117432 -0.68171503117297
0.50000000000000 -0.77880078306967 -0.77880078307140
0.60000000000000 -0.83721159128023 -0.83721159128524
0.70000000000000 -0.85767695185697 -0.85767695185818
0.80000000000000 -0.84366787846708 -0.84366787846888
0.90000000000000 -0.80074451919839 -0.80074451920129
1.00000000000000 -0.73575888234189 -0.73575888234288

115
6 Numerical Integration
The process of integration is a widely used mathematical tool, which can occur
in a wide-range of different engineering problems, since the are calculation
underneath a curve is the fundamental of performance computing and also of
numerous other applications (thermal state changes, determining torque
from shear forces, etc.). At the same time, it is also a common problem that
the function itself, which determines the curve, is unknown – like evaluating
an experiment data –, so approximation can be applied and the computed
curve is capable to perform analyses, or a numerical integration method can
be applied, that is capable of jumping through the approximation and using
the input data to determine the searched area. Additionally, we should not
forget about the fact that it is not easy at all to do the integration of special
functions, and also there are functions that cannot be integrated analytically.
As a remainder: the definite integral means the following problem:

∫ 𝑓(𝑥) = 𝐹(𝑥)𝑑𝑥 (6.1.)


𝑎
where 𝐹(𝑥) is the primitive function 𝑓(𝑥) function. The relationship between
them is 𝐹 ′ (𝑥) = 𝑓(𝑥). Without the boundaries of 𝑎 and 𝑏 the problem is called
an indefinite integral, since the number of the solutions are infinite: (𝑥) +
𝑐, 𝑐 ∈ ℝ (derivating the constant term we get zero every time). In the case of
definite integral the 𝑓(𝑥) function is integrated over [𝑎, 𝑏] interval the result
is the exact area under the investigated curve. This latter problem can be
desperately needed in engineering practice regularly.
The point of the numerical integration is that instead of finding the primitive
function of a curve (function), a set of discrete points and the related function
values are combined in a determined way to approximate the actual value of
the integral. So, the input points have to be discretized, and the function
values have to be considered. This latter can come from an experiment or can
be calculated if the function itself is known. Indeed, the advantages of this
solution are that this is much easier than the original way (I mean really!), and
usually the computation requirements and the number of operations to be
performed are significantly lower, which can be really handy in the case of a
more complex simulation, or when using a large data set.
the strip method or the histogram method.

116
Example: MATLAB Built-in functions

One function is available in the current version of MATLAB, that is capable of


numerical integration of univariate functions: integral. In older versions, the
quad and quad8 had the same capability, and they operate in the same way.

q = integral(fun,xmin,xmax) numerically integrates function fun from xmin


to xmax using global adaptive quadrature and default error tolerances.
q = integral(fun,xmin,xmax,Name,Value) specifies additional options with
one or more Name,Value pair arguments. For example, specif 'WayPoints'
followed by a vector of real or complex numbers to indicate specific points for
the integrator to use.
The input parameters are:
- ’fun’: string, containing the name of the function to be integrated
- xmin and xmax: lower and upper limit
- Name: argument name
- Value: corresponding value

For example:
2
Consider a simple, rational function: 𝑓(𝑥) = 𝑒 −𝑥 (𝑙𝑛𝑥)2

>> fun = @(x) exp(-x.^2).*log(x).^2;


>> q = integral(@(x)fun(x,5),0,2)
q=
-0.460501533846733

Or if we want to create a function with a parameter: 𝑓(𝑥) = 1/(𝑥 3 − 2𝑥 − 𝑐):

>> fun = @(x,c) 1./(x.^3-2*x-c);


>> q = integral(@(x)fun(x,5),0,2)
q=
-0.460501533846733

117
6.1 Newton-Cotes Formulas
The most commonly used numerical integration formulas are the Newton-
Cotes formulas, which can be divided into two categories: opened and closed
formulas. The main difference between them is relatively easy-to-understand:
in the case of open formulas the endpoints of the interval are not used during
the computation, while the closed formulas incorporate the endpoints, too.
Open formulas are the Gauss-quadrature formula while the trapezoidal and
Simpson rules are closed formulas.
The initial consideration of the Newton-Cotes formulas is really simple: instead
of generating real functions, it replaces them with a simple polynomial (zero,
first, second or third degree polynomial) and they are to be integrated. This
process is possible in two ways: if the function itself is available, then by using
discretization input points are generated and by using interpolation a curve is
fitted that can be integrated finally; or if the points are already available, then
the first step can be neglected.

6.1.1 Rectangular Rule


One of the easiest method is the rectangular rule. In this case, the integral
𝑏
value (∫𝑎 𝑓(𝑥)𝑑𝑥) is approximated by creating a rectangle in between 𝑎 and
𝑏 endpoints, which gives the width of the rectangular: 𝑏 − 𝑎. The only
question is that how can we determine the height of the rectangle? Actually,
the rectangular rule gives 3 possibilities immediately: we can take the left-
hand endpoint of the interval, the right-hand endpoint or the function value
in the midpoint of the interval. That’s why they are called as left-hand rule,
right-hand rule and midpoint rule. Although, the method itself is really simple,
it can be assumed that the accuracy is not the best. As Figure 21 illustrates, if
the [𝑎, 𝑏] interval is wide and additionally, the 𝑓(𝑥) function over it has
significant changes, then all of the rectangular rules give high error. So, if we
need precise integrate values, the rectangular rule is not the best solution, it
can give a rough estimation for the order of the function, but the method used
globally over the interval is not the best choice: the solution should not be
accepted.
However, this might give us an idea to increase the accuracy: if this method
way completed not only once (globally), but multiple times along the interval,
by segregating it, the solution’s accuracy might increase. This method is the
so-called composite rectangular method.

118
Figure 21: Illustration of the rectangular rule

First of all, the [𝑎, 𝑏] interval is divided into 𝑛 pieces, where:


𝑎 = 𝑥0 < 𝑥1 < ⋯ < 𝑥𝑛−1 < 𝑥𝑛 = 𝑏
Subintervals are generated with ∆𝑥𝑖 width. Then, the integral value over the
whole interval can be interpreted as the sum of the function values over the
subintervals, and can be approximated as:
𝑏 𝑛

∫ 𝑓(𝑥) ≈ ∑ 𝑓(𝑥𝑖∗ )∆𝑥𝑖 (6.2.)


𝑎 𝑖=1
Of course, the most serious question is finding the 𝑥𝑖∗ value for each
subinterval. Of course, that’s why the rectangular rule has been introduced.
The three most simple methods looks like the following:
- Left-hand endpoint:
𝑛

𝐿𝑛 = ∑ 𝑓(𝑥𝑖−1 )∆𝑥𝑖 (6.3.)


𝑖=1
- Right-hand endpoint:

119
𝑛

𝑅𝑛 = ∑ 𝑓(𝑥𝑖 )∆𝑥𝑖 (6.4.)


𝑖=1
- Midpoint:
𝑛
𝑥𝑖−1 + 𝑥𝑖
𝑆𝑀𝑛 = ∑ 𝑓 ( ) ∆𝑥𝑖 (6.5.)
2
𝑖=1
Generally, the simpler methods and formulas are preferred, so transform
these formulas to a more consumable way. If the [𝑎, 𝑏] interval is divided
𝑏−𝑎
equidistantly, meaning the 𝑛 subintervals have the same width (ℎ = ), the
𝑛
formulas have more friendly expressions:
- Left-hand endpoint:
𝑛−1
𝑏−𝑎
𝐿𝑛 = ∑ 𝑓(𝑥𝑖 ) (6.6.)
𝑛
𝑖=0
- Right-hand endpoint:
𝑛
𝑏−𝑎
𝑅𝑛 = ∑ 𝑓(𝑥𝑖 ) (6.7.)
𝑛
𝑖=1
- Midpoint:
𝑛
𝑏−𝑎 𝑥𝑖−1 + 𝑥𝑖
𝑆𝑀𝑛 = ∑𝑓( ) (6.8.)
𝑛 2
𝑖=1

Figure 22: An example for the left-hand and right-hand rule 1

1http://www.inf.u-szeged.hu/~kgelle/sites/default/files/upload/11_numerikus_integralas_0.pdf

120
Without proving it mathematically, the composite rectangular rule has the
following error for left-hand (and right-hand) rule:
1
𝐸 = [ (𝑏 − 𝑎)𝑓 ̅′] ℎ = 𝑂(ℎ) (6.9.)
2
where 𝑓 ̅′ stands for the average error of the function, generated by using
Taylor-series, and estimated for the full interval.
The composite rectangular method has the following error for midpoint rule:
1
𝐸 = [ (𝑏 − 𝑎)𝑓 ̅′′] ℎ2 = 𝑂(ℎ2 ) (6.10.)
24
where 𝑓 ̅′′ is the estimated value of 𝑓′′ over the full [𝑎, 𝑏] interval.

Example: Built-in MATLAB functions

MATLAB consists a built-in function, which approximates the integral of f(x)


by using the trapezoidal method:

Q = trapz(Y) returns the approximate integral of Y via the trapezoidal


method with unit spacing. The size of Y determines the dimension to
integrate along:
 If Y is a vector, then trapz(Y) is the approximate integral ofY .
 If Y is a matrix, then trapz(Y) integrates over each column and
returns a row vector of integration values.
 If Y is a multidimensional array, then trapz(Y) integrates over the
first dimension whose size does not equal 1. The size of this
dimension becomes 1, and the sizes of other dimensions remain
unchanged

21
For example if we want to appriximate ∫1 𝑥 𝑑𝑥 function when n = 5 and n = 10,
we use:

>> x5=linspace(1,2,5); x10=linspace(1,2,10);


>> y5=1./x5; y10=1./x10;
>> area5=trapz(x5,y5), area10=trapz(x10,y10)
area5 =
0.697023809523809

121
area10 =
0.693917602005837
𝑡 2
Similarly, approximating 𝑓(𝑡) = ∫0 𝑒 −𝜏 𝑑𝜏, when n = 10 and t = 2:
>> t=linspace(0,2,10); y=exp(-t.^2); area=trapz(t,y)
area =
0.881782381062206

Example [3]

Create the closed type Newton-Cotes formula as a MATLAB user function, and
define the different methods’ error.

Solution:
The nodes of the n – point formula are defined as: 𝑥𝑘 = 𝑎 + (𝑘 − 1)ℎ, 𝑘 =
𝑏−𝑎
1,2, … , 𝑛, where ℎ = 𝑛−1 , 𝑛 > 1. The weights of the quadrature formula are
determined from the conditions that the following equations are satisfied for
the monomials 𝑓(𝑥) = 1, 𝑥, … , 𝑥 𝑛−1 :
𝑏 𝑛

∫ 𝑓(𝑥)𝑑𝑥 = ∑ 𝑤𝑘 𝑓(𝑥𝑘 )
𝑎 𝑘=1
So, create the functions:

function [s, w, x] = cNCqf(fun, a, b, n, varargin)


% Numerical approximation s of the definite integral
% of f(x). fun is a string containing the name of the
% integrand f(x). % Integration is over the interval
% [a, b]. Method used:
% n-point closed Newton-Cotes quadrature formula.
% The weights and the nodes of the quadrature formula
% are stored in vectors w and x, respectively.
if n < 2
error(' Number of nodes must be greater than 1')
end
x = (0:n-1)/(n-1);
f = 1./(1:n);
V = vander(x);
V = rot90(V);

122
w = V\f';
w = (b-a)*w;
x = a + (b-a)*x;
x = x';
s = feval(fun,x,varargin{:});
s = w'*s;

So, create the following function:

function w = exp2(x)
% The weight function w of the Gauss-Hermite
% quadrarure formula.
w = exp(-x.^2);

Then:
>> approx_v = [];
>> for n =2:4
approx_v = [approx_v; (2/sqrt(pi))*cNCqf('exp2', 0, 1, n)];
end
>> approx_v
approx_v =
0.771743332258054
0.843102830042981
0.842890571431721
>> exact_v = erf(1)
exact_v =
0.842700792949715

The error of different methods are clearly visible.

6.1.2 Trapezoidal Rule


As it was seen in the previous subchapter, depending on the changes of the
function, the rectangular rules can over- or underestimate the integral value.
So, it seems reasonable to get the average of the left-hand and right-hand side
rules, which is actually not else but a four-point defined trapezoid. This way,
between function values 𝑓(𝑎) and 𝑓(𝑏) a linear line is defined: which is
actually a first degree polynomial. The area under the line can be calculated
really simple. The average of the left-hand and right-hand is rules is:

123
𝐿𝑛 − 𝑅𝑛
𝑇𝑛 = (6.11.)
2
From which the geometrical interpretation:
𝑓(𝑏) − 𝑓(𝑎)
𝐴 = 𝑓(𝑎) + (𝑥 − 𝑎) (6.12.)
𝑏−𝑎
This could lead to the same problem as in the earlier case: looking for a global
solution over one big interval ([𝑎, 𝑏]) leads to a relatively big error (see Figure
23 left-hand side). The solution is also similar, like in the previous subchapter:
the same rule should be applied over more, narrower subintervals, so we can
write the rule for these individual pieces (Figure 23 right-hand side):
𝑓(𝑥𝑖−1 ) + 𝑓(𝑥𝑖 )
𝐴𝑖 = ∆𝑥𝑖 (6.13.)
2
This way, the expression of the composite trapezoidal rule can be generated:
𝑛
𝑓(𝑥𝑖−1 ) + 𝑓(𝑥𝑖 )
𝑇𝑛 = ∑ ∆𝑥𝑖 (6.14.)
2
𝑖=1
If the subdivision of the interval is done equidistantly, keeping in mind that
𝑥𝑖+1 − 𝑥𝑖 = ℎ (𝑖 = 1,2, … , 𝑛), then the formula is:
𝑛

𝑇𝑛 = ∑[𝑓(𝑥𝑖 ) + 𝑓(𝑥𝑖+1 )] =
2 (6.15.)
𝑖=1

= [𝑓(𝑎) + 2𝑓(𝑥2 ) + 2𝑓(𝑥3 ) + ⋯ + 2𝑓(𝑥𝑛 ) + 𝑓(𝑏)]
2

Figure 23: The simple (left) and composite (right) trapezoidal rules

The error can be written as:


1
𝐸 = [− (𝑏 − 𝑎)𝑓 ̅′′] ℎ2 = 𝑂(ℎ2 ) (6.16.)
12

124
where 𝑓 ̅′′ is the estimated value of 𝑓′′ over the interval of [a,b]. Therefore, the
error 𝑂(ℎ2 ) is compatible with the midpoint method and superior to the
rectangular rule using the endpoints whose error is 𝑂(ℎ).

Example

Create a function, which uses the composite trapezoidal rule to estimate the
value of a definite integral.

Solution:
function I = TrapComp(f,a,b,n)
%
% TrapComp estimates the value of the integral of
% f(x) from a to b by using the composite trapezoidal
% rule applied to n equal-length subintervals.
% I = TrapComp(f,a,b,n) where f is an inline function
% representing the integrand, a and b are the limits
% of integration, n is the number of equal-length
% subintervals in [a,b], I is the integral estimate.
%
h = (b-a)/n; I = 0;
x = a:h:b;
for i = 2:n,
I = I + 2*f(x(i));
end
I = I + f(a) + f(b);
I = I*h/2;

The execution is:


>> f = inline('1/(x+2)');
>> I = TrapComp(f, -1,1,8)
I=
1.103210678210678

6.2 Quadrature Formulas


With using the trapezoidal rule, we have reached the first-degree polynomials
as interpolating functions, so the next step is to apply higher-degree
polynomials. Indeed, expectedly the accuracy of the following methods will

125
be higher as a consequence. In this chapter, the 1/3 and 3/8 Simpson methods
are introduced, which apply quadratic and cubic polynomials fitted on the
input points. Of course, the goal is still the same: to develop a fast, easy-to-
use method for numerical integration.
Before constructing the Simpson rules, it is important to clarify: from now on
the used formulas are quadrature formulas. The definite integral of an 𝑓
function over [𝑎, 𝑏] interval can be approximated with a quadrature
formula, if:
- the input points (base points) are known, or can be calculated
(𝑥1 , 𝑥2 , … , 𝑥𝑛 ), for which 𝑎 = 𝑥1 < 𝑥2 < ⋯ < 𝑥𝑛 = 𝑏,
- a 𝑤𝑖 weight is defined for each 𝑥𝑖 ,
- in this case the quadrature formula is:
𝑛

𝑄𝑛 (𝑓) = ∑ 𝑤𝑖 𝑓(𝑥𝑖 ) (6.17.)


𝑖=1
Thus, the sum of the weighted function values taken at the base points.

A quadrature formula can be expressed in the following way:

- the base points are defined (𝑥1 … 𝑥𝑛 )


- the quadrature formula value is the integral of the Lagrange
polynomial at (𝑥𝑖 , 𝑓(𝑥𝑖 )) points over [𝑎, 𝑏] integral.

If the determined form is used, the expression can be called a


quadrature formula. Since the Lagrange polynomial is already known,
the formula can be written in the following way:
𝑛

∑ 𝑓(𝑥𝑖 )𝐿𝑖 (𝑥) (6.18.)


𝑖=1

where 𝐿𝑖 (𝑥) is the 𝑖th degree Lagrange base polynomial for 𝑥. The
integral value is:
𝑏 𝑛 𝑛 𝑏

∫ ∑ 𝑓(𝑥𝑖 )𝐿𝑖 (𝑥)𝑑𝑥 = ∑ (𝑓(𝑥𝑖 ) ∫ 𝐿𝑖 (𝑥)𝑑𝑥) (6.19.)


𝑎 𝑖=1 𝑖=1 𝑎

126
That is, the interpolating quadrature formula is not else, but a
𝑏
quadrature formula, with the weight of𝑤𝑖 = ∫𝑎 𝐿𝑖 (𝑥)𝑑𝑥.

Newton-Cotes formulas

Since the quality and distribution of the basepoints are essential in the
case of interpolation, the same thing is true for the numerical
integration. In the case of Lagrange interpolation the weights can be
generated clearly.

If the investigated interval [𝑎, 𝑏] is divided equally, meaning


equidistantly, and the endpoints are also basepoints, then the applied
interpolating quadrature formula is called Newton-Cotes formula.

Composite Quadrature formulas

The accuracy of a numerical integrating process does not necessarily


improve by increasing the number of the basepoints (in most cases
does, but not always). Just like in the case of the interpolation method,
high-degree polynomials can be capricious. In order to eliminate this
phenomena, it is recommended to divide the [𝑎, 𝑏] interval to
subintervals, to 𝑛 pieces, to which the quadrature formula can be
applied one-by-one (like the composite trapezoidal or any of the
Simpson rules). These methods are the so-called composite quadrature
formulas.

Just for the note, this chapter could be much more detailed, since the
theoretical fundamentals are really complex, but this book focuses
rather on the engineering applications.

6.2.1 1/3 Simpson Rule


Contrary to the rectangular or trapezoidal rules, the 1/3 Simpson rule
applies quadratic polynomials to interpolate the basepoints. Three
points are required to fit a parabolic curve on them, that are 𝑥1 = 𝑎,
𝑥2 = (𝑎 + 𝑏)/2 and 𝑥3 = 𝑏, as Figure 24 left-hand side demonstrates.

127
Figure 24: The simple (left) and composite (right) 1/3 Simpson rules

The simplest method to fit a parabolic curve on three points is to use the
second degree Lagrange interpolating polynomial that can be constructed as
follows:
(𝑥 − 𝑥2 )(𝑥 − 𝑥3 ) (𝑥 − 𝑥1 )(𝑥 − 𝑥3 )
𝑝2 (𝑥) = 𝑓(𝑥1 ) + 𝑓(𝑥2 )
(𝑥1 − 𝑥3 )(𝑥1 − 𝑥3 ) (𝑥2 − 𝑥1 )(𝑥2 − 𝑥3 )
+ (6.20.)
(𝑥 − 𝑥1 )(𝑥 − 𝑥2 )
+ 𝑓(𝑥3 )
(𝑥3 − 𝑥1 )(𝑥3 − 𝑥2 )

So, the formula, to be integrated is:


𝑏 𝑏

∫ 𝑓(𝑥)𝑑𝑥 = ∫ 𝑝2 (𝑥)𝑑𝑥 (6.21.)


𝑎 𝑎

𝑎+𝑏
By substituting 𝑥1 = 𝑎, 𝑥2 = 2 , 𝑥3 = 𝑏 values into 𝑝2 (𝑥) polynomial and
completing the integral from 𝑎 to 𝑏, then we get the following expression,
which is called the simple 1/3 Simpson rule:
𝑏

∫ 𝑓(𝑥)𝑑𝑥 ≅ [𝑓(𝑥1 ) + 4𝑓(𝑥2 ) + 𝑓(𝑥3 )] (6.22.)
3
𝑎

𝑏−𝑎
where ℎ = 2 . The name of the method comes from the 1/3 constant, of
course.
In the case of the composite 1/3 Simpson rule the [𝑎, 𝑏] is divided to 𝑛
subintervals, for which 𝑛 + 1 (!) points are used, that are 𝑎 =
𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑥𝑛+1 = 𝑏. Although, it could happen that the width of the

128
(𝑏−𝑎)
subintervals are not the same, the equidistant division (ℎ = 𝑛 ) makes the
process much easier. Since three points are required to construct a parabolic
curve, it is essential to know that the 1/3 Simpson rule can be applied only if
the number of the subintervals are even. This way the composite 1/3 Simpson
rule can be written in the following way:
𝑏
ℎ ℎ
∫ 𝑓(𝑥)𝑑𝑥 ≅ [𝑓(𝑥1 ) + 4𝑓(𝑥2 ) + 𝑓(𝑥3 )] + [𝑓(𝑥3 ) + 4𝑓(𝑥4 ) + 𝑓(𝑥5 )]
3 3
𝑎 (6.23.)

+ ⋯ + [𝑓(𝑥𝑛−1 ) + 4𝑓(𝑥𝑛 ) + 𝑓(𝑥𝑛+1 )]
3

As it can be seen, during the sum, the function values with even indexes
(𝑥2 , 𝑥4 , … , 𝑥𝑛 ) are added four times, while the function values with odd
indexes (𝑥3 , 𝑥5 , … , 𝑥𝑛−1 ) are added only two times, while the first and last
function values are added only once. So, the 1/3 Simpson rule can be
expressed in a more compact form:
𝑏 𝑛 𝑛−1

∫ 𝑓(𝑥)𝑑𝑥 ≅ {𝑓(𝑥1 ) + 4 ∑ 𝑓(𝑥𝑖 ) + 2 ∑ 𝑓(𝑥𝑗 ) + 𝑓(𝑥𝑛+1 )} (6.24.)
3
𝑎 𝑖=2,4,6… 𝑗=3,5,7…

Finally, let’s figure out what is the main reason, the 1/3 Simpson rule has
become so widespread! To answer this question, we have to determine the
accuracy of this method, so let’s see the error function:
1
𝐸 = [− (𝑏 − 𝑎)𝑓 ̅ (4) ] ℎ4 = 𝑂(ℎ4 ) (6.25.)
180
where 𝑓 ̅ (4) is the estimated value of 𝑓 (4) over the full interval of [𝑎, 𝑏].
Meaning, the 1/3 Simpson rule has got 2 times higher order accuracy (𝑂(ℎ4 ))
compared to the composite trapezoidal rule’s (𝑂(ℎ2 )), which makes it really
favourable.

6.2.2 3/8 Simpson Rule


A development of the 1/3 Simpson Rule, when not quadratic but cubic
polynomials are applied to interpolate the 𝑓(𝑥) function values. In order to
construct the 3rd degree polynomial, four points are required with the same
2𝑎+𝑏 𝑎+2𝑏
distance from each other, so 𝑥1 = 𝑎, 𝑥2 = 3 , 𝑥3 = 3 , 𝑥4 = 𝑏, where ℎ =
𝑏−𝑎
, like Figure 25 demonstrates it.
3

129
Figure 25: The 3/8 Simpson Rule

The third-degree Lagrange polynomial, constructed on 4 basepoints can be


written in the following form:
(𝑥 − 𝑥2 )(𝑥 − 𝑥3 )(𝑥 − 𝑥4 )
𝑝3 (𝑥) = 𝑓(𝑥1 )
(𝑥1 − 𝑥3 )(𝑥1 − 𝑥3 )(𝑥1 − 𝑥4 )
(𝑥 − 𝑥1 )(𝑥 − 𝑥3 )(𝑥 − 𝑥4 )
+ 𝑓(𝑥2 ) +
(𝑥2 − 𝑥1 )(𝑥2 − 𝑥3 )(𝑥2 − 𝑥4 ) (6.26.)
(𝑥 − 𝑥1 )(𝑥 − 𝑥2 )(𝑥 − 𝑥4 ) (𝑥 − 𝑥1 )(𝑥 − 𝑥2 )(𝑥 − 𝑥3 )
+ 𝑓(𝑥3 ) + 𝑓(𝑥4 )
(𝑥3 − 𝑥1 )(𝑥3 − 𝑥2 )(𝑥3 − 𝑥4 ) (𝑥4 − 𝑥1 )(𝑥4 − 𝑥2 )(𝑥4 − 𝑥3 )

By substituting back this polynomial to the function to be integrated:


𝑏 𝑏

∫ 𝑓(𝑥)𝑑𝑥 ≅ ∫ 𝑝3 (𝑥)𝑑𝑥 (6.27.)


𝑎 𝑎
If 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 values are substituted with 𝑎 and 𝑏 parameters in 𝑝3 (𝑥)
polynomial, then the 3/8 Simpson rule can be written in another form:
𝑏
3ℎ
∫ 𝑓(𝑥)𝑑𝑥 ≅ [𝑓(𝑥1 ) + 3𝑓(𝑥2 ) + 3𝑓(𝑥3 ) + 𝑓(𝑥4 )] (6.28.)
8
𝑎
Of course, the name comes from the 3/8 constant multiplier.
Indeed, the global method is still not an accurate solution, thus the 3/8
Simpson method should be applied as a composite algorithm, where the
investigates [𝑎, 𝑏] interval is divided into 𝑛 + 1 base points: 𝑎 =
𝑥1 , 𝑥2 , … , 𝑥𝑛 , 𝑥𝑛+1 = 𝑏. It is subservient to apply equidistant division. Since the

130
3/8 Simpson method applies third-degree polynomials, 4 points are required
to construct them, so during the division of [𝑎, 𝑏] interval, 3-divisible
subintervals have to be generated. After the division, the sum of each
subinterval’s area can be summarised as:
𝑏
3ℎ
∫ 𝑓(𝑥)𝑑𝑥 ≅ [𝑓(𝑥1 ) + 3𝑓(𝑥2 ) + 3𝑓(𝑥3 ) + 𝑓(𝑥4 )]
8
𝑎
3ℎ (6.29.)
+ [𝑓(𝑥4 ) + 3𝑓(𝑥5 ) + 3𝑓(𝑥6 ) + 𝑓(𝑥7 )] + ⋯
8
3ℎ
+ [𝑓(𝑥𝑛−2 ) + 3𝑓(𝑥𝑛−1 ) + 3𝑓(𝑥𝑛 ) + 𝑓(𝑥𝑛+1 )]
8
If the terms are merged in the previous expression, it can be seen that the
terms with 2. , 5. , 8. , … and 3. , 6. , 9. , … indexes are summarised and multiplied
with 3, while terms with 4. , 7. , 10. , … indexes are summarised and multiplied
with 2. So, the expression in a more compact form can be written as follows:
𝑏 𝑛−1 𝑛−2
3ℎ
∫ 𝑓(𝑥)𝑑𝑥 ≅ {𝑓(𝑥1 ) + 3 ∑ [𝑓(𝑥𝑖 ) + 𝑓(𝑥𝑖+1 )] + 2 ∑ 𝑓(𝑥𝑗 )
8
𝑎 𝑖=2,5,8 𝑗=4,7,10
(6.30.)
+ 𝑓(𝑥𝑛+1 )}

This is the general form of 3/8 Simpson method. From which the error of the
method can be determined as:
1 ̅̅̅̅̅
(4) ] ℎ4 = 𝑂(ℎ4 )
𝐸 = [− (𝑏 − 𝑎)𝑓 (6.31.)
80
where ̅̅̅̅̅
𝑓 (4) is the estimated value of 𝑓 (4) over the full [𝑎, 𝑏] interval. So,
compared to the trapezoidal method, which has (𝑂(ℎ2 )) accuracy, the 3/8
Simpson method gives (𝑂(ℎ4 )) accuracy, which is actually comparable with
the 1/3 Simpson method’s accuracy.

Example [3]

Create the closed type Newton-Cotes formula as a MATLAB user function, and
define the different methods’ error.

Solution:
The nodes of the n – point formula are defined as: 𝑥𝑘 = 𝑎 + (𝑘 − 1)ℎ, 𝑘 =
𝑏−𝑎
1,2, … , 𝑛, where ℎ = 𝑛−1 , 𝑛 > 1. The weights of the quadrature formula are
determined from the conditions that the following equations are satisfied for
the monomials 𝑓(𝑥) = 1, 𝑥, … , 𝑥 𝑛−1 :

131
𝑏 𝑛

∫ 𝑓(𝑥)𝑑𝑥 = ∑ 𝑤𝑘 𝑓(𝑥𝑘 )
𝑎 𝑘=1
So, create the functions:
function [s, w, x] = cNCqf(fun, a, b, n, varargin)
% Numerical approximation s of the definite integral
% of f(x). fun is a string containing the name of the
% integrand f(x).
% Integration is over the interval [a, b].
% Method used:
% n-point closed Newton-Cotes quadrature formula.
% The weights and the nodes of the quadrature formula
% are stored in vectors w and x, respectively.
if n < 2
error(' Number of nodes must be greater than 1')
end
x = (0:n-1)/(n-1);
f = 1./(1:n);
V = vander(x);
V = rot90(V);
w = V\f';
w = (b-a)*w;
x = a + (b-a)*x;
x = x';
s = feval(fun,x,varargin{:});
s = w'*s;

The error of the function can be calculated as:


𝑥
2 2
𝐸𝑟𝑓(𝑥) = ∫ 𝑒 −𝑡 𝑑𝑡
√𝜋
0
Fortunately, there is a built-in function in MATLAB, which can handle the error
function: erf(x). This is approximated at x = 1 using the closed Newton-Cotes
quadrature formulas with n = 2 (trapezoidal rule), n = 3 (Simpson’s rule), and
n = 4 (Boole’s rule).
So, create the following function:
function w = exp2(x)
% The weight function w of the Gauss-Hermite
% quadrarure formula.
w = exp(-x.^2);

132
Then:
>> approx_v = [];
>> for n =2:4
approx_v = [approx_v; (2/sqrt(pi))*cNCqf('exp2', 0, 1, n)];
end
>> approx_v
approx_v =
0.771743332258054
0.843102830042981
0.842890571431721
>> exact_v = erf(1)
exact_v =
0.842700792949715

The errors of different methods are clearly visible.

Example [3]
Evaluate the following function with 1/3 Simpson rule, if n = 8:
1
1
∫ 𝑑𝑥
𝑥+2
−1

Solution:
function I = Simpson(f,a,b,n)
%
% Simpson estimates the value of the integral of f(x)
% from a to b by using the composite Simpson’s 1/3
% rule applied to n equal-length subintervals.
%
% I = Simpson(f,a,b,n) where
%
% f is an inline function representing the integrand,
% a, b are the limits of integration,
% n is the (even) number of subintervals,
%
% I is the integral estimate.
h = (b-a)/n;
x = a:h:b;
I = 0;
for i = 1:2:n,

133
I = I + f(x(i)) + 4*f(x(i+1)) + f(x(i+2));
end

Now create the analysed function, and use the defined function:
>> f = inline('1/(x+2)');
>> I = Simpson(f, -1,1,8)
I=
13.184704184704188

6.3 Romberg Integration


In the previous subchapter, we only focused on those integration methods,
where the functions were represented by tabular data. But in several cases,
not the input points are available, but the equation of the function is known.
When the 𝑓(𝑥) function’s analytical form is available, two numerical
approximations can be combined in order to get a third, more accurate
estimation. The combination is done by using the Richardson’s extrapolation.
The errors of trapezoidal rule (equation (6.16.)) and 1/3 Simpson rule
(equation (6.25.)) were introduced already. In both cases, the precision can
be increased by using higher 𝑛 values in [𝑎, 𝑏] interval, but of course, it takes
more computational effort, too. The Romberg integration gives a more
precise estimation than these two methods, but using less computational
efforts.
The theory starts with the Richardson’s extrapolation, which states that if
there are two approximations with 𝑂(ℎ2 ) error, they can be combined so the
create an estimation with 𝑂(ℎ4 ) error. But two estimates with 𝑂(ℎ4 ) error
can be combined to an approximation with 𝑂(ℎ6 ) error, meaning the
Richardson’s extrapolation increases the order of an estimation’s accuracy
with two, if the accuracy was an even number.
So, going back to the trapezoidal error (equation (6.16.)), the expression can
be written as:
𝐼 ≅ 𝐼ℎ − 𝐶ℎ2 (6.32.)
Since ̅̅̅̅
𝑓 ′′ is independent from ℎ. If the trapezoidal rule was used with different
subinterval widths, the errors:
𝐼 ≅ 𝐼ℎ1 − 𝐶ℎ12
(6.33.)
𝐼 ≅ 𝐼ℎ2 − 𝐶ℎ22
𝐶 can be eliminated, so the integral estimate is:

134
ℎ 2
( 1 ) 𝐼ℎ2 − 𝐼ℎ1

𝐼≅ 2 (6.34.)
ℎ1 2
( ) −1
ℎ2
This approximation has an error of 𝑂(ℎ4 ). By using the composition of ℎ1 = ℎ
1
and ℎ2 = ℎ, and also knowing that the error of the method is 𝑂(ℎ4 ) we can
2
write:
4 1
𝐼 = 𝐼ℎ/2 − 𝐼ℎ + 𝑂(ℎ4 ) (6.35.)
3 3
The same deduction can be carried out for methods with different error
orders. In general form, the extrapolation can be written as:
4𝑗−1 𝐼𝑖+1,𝑗−1 − 𝐼𝑖,𝑗−1
𝐼𝑖,𝑗 = (6.36.)
4𝑗−1 − 1
Using this formula, a scheme can be created, where the 𝐼1,1 , 𝐼2,1 , … , 𝐼𝑚,1
entries are written in the first column and estimate the composite trapezoidal
rule, that has 𝑛, 2𝑛, … , 2𝑚−1 subintervals. In the next column, the
𝐼1,2 , 𝐼2,2 , … , 𝐼𝑚,2 entries are located, which combine every two successive
entries of the first column, and gives a more accurate approximation. The
method is continued until the last column, where there is only one entry: 𝐼1,𝑚 .
This is the so-called Romberg iteration scheme.

Figure 26: Romberg iteration scheme

135
Example [3]

Create a function that uses Romberg integral estimation.

Solution:
The defined function is:
function I = Romberg(f,a,b,n,n_levels)
%
% Romberg uses the Romberg integration scheme to find
% integral estimates at different levels of accuracy.
%
% I = Romberg(f,a,b,n,n_levels) where
%
% f is an inline function representing the integrand,
% a and b are the limits of integration,
% n is the initial number of equal-length
% subintervals in [a,b],
% n_levels is the number of accuracy levels,
%
% I is the matrix of integral estimates.
%
I = zeros(n_levels,n_levels); % Pre-allocate
% Calculate the first-column entries by using the
% composite trapezoidal rule, where the number of
% subintervals is doubled going from one element
% to the next.
for i = 1:n_levels,
n_intervals = 2^(i-1)*n;
I(i,1) = TrapComp(f,a,b,n_intervals);
end
% Starting with the second level, use Romberg scheme
to
% generate the remaining entries of the table.
for j = 2:n_levels,
for i = 1:n_levels - j+1,
I(i,j) = (4^(j-1)*I(i+1,j-1)-I(i,j-1))/(4^(j-1)-1);
end
end

For testing use a simple function: f = (x2 + 3x)2 over the interval of [0,1], with n
= 2 and 3 levels of accuracy:

136
>> format long
>> f = inline('(x^2+3*x)^2');
>> I = Romberg(f,0,1,2,3)
I =
5.531250000000 4.700520833333 4.700000000000
4.908203125000 4.700032552083 0
4.752075195313 0 0

6.4 Gaussian Quadrature


The Gaussian quadrature applies a weighted sum of values of function 𝑓(𝑥) at
numerous points in interval [𝑎, 𝑏] to estimate the integral. Of course, the
weights and points are determined in a way to minimise the error.
The Gauss formulas estimate:
𝑏 𝑛

∫ 𝑝(𝑥)𝑓(𝑥)𝑑𝑥 ≈ ∑ 𝑤𝑘 𝑓(𝑥𝑘 ) (6.37.)


𝑎 𝑘=1

where 𝑝(𝑥) is the weight function. The most common weight functions are
the following:
weight p(x) interval [a,b] Quadrature name
1 [-1,1] Gauss-Legendre
1/√1 − 𝑥 2 [-1,1] Gauss-Chebyshev
𝑒 −𝑥 [0,] Gauss-Laguerre
−𝑥 2 [-,] Gauss-Hermite
𝑒
Table 6: The most commonly used Gaussian Quadrature weight functions
The weights of the Gauss formulas are all positive and the nodes are the roots
of the class of polynomials that are orthogonal, with respect to the given
weight function p(x), on the associated interval.
For example, the Gauss-Legendre polynomial is written as:
1 𝑛

∫ 𝑓(𝑥)𝑑𝑥 ≈ ∑ 𝑤𝑘 𝑓(𝑥𝑘 ) (6.38.)


−1 𝑘=1
This equation contains 2𝑛 unknowns: 𝑛 nodes and 𝑛 weights. So, the missing
equations can be generated by fitting the integral for functions 𝑓(𝑥) =
1, 𝑥, 𝑥 2 , … , 𝑥 2𝑛−1 . The weights are calculated as:

137
1 𝑛
𝑥 − 𝑥𝑗
𝑤𝑘 = ∫ ∏ 𝑑𝑥 (6.39.)
𝑥𝑘 − 𝑥𝑗
−1 𝑗=1
𝑗≠𝑘
The 𝑥1 , 𝑥2 , … , 𝑥𝑛 Gauss nodes are the zeros of the 𝑛th degree Legendre
polynomial.

Example [3]

Create a function, which uses Gauss-Legendre or Gauss-Chebyshev formula


with n nodes.

Solution:
function [s, w, x] = Gquad1(fun, a, b, n, type,
varargin)
% Numerical integration using either the Gauss-
% Legendre (type = 'L') or the Gauss-Chebyshev (type
% = 'C') quadrature with n (n > 0) nodes.
% fun is a string representing the name of the
% function that is integrated from a to b. For the
% Gauss - Chebyshev quadrature it is assumed that
% a = -1 and b = 1.
% The output parameters s, w, and x hold the computed
% approximation of the integral, list of weights, and
% the list of nodes, respectively.
d = zeros(1,n-1);
if type == 'L'
k = 1:n-1;
d = k./(2*k - 1).*sqrt((2*k - 1)./(2*k + 1));
fc = 2;
J = diag(d,-1) + diag(d,1);
[u,v] = eig(J);
[x,j] = sort(diag(v));
w = (fc*u(1,:).^2)';
w = w(j)';
w = 0.5*(b - a)*w;
x = 0.5*((b - a)*x + a + b);
else
x = cos((2*(1:n) - (2*n + 1))*pi/(2*n))';
w(1:n) = pi/n;
end
f = feval(fun,x,varargin{:});

138
s = w*f(:);
w = w';

Let’s see how it works:


>> approx_v = [];
for n=2:8
approx_v = [approx_v; (2/sqrt(pi))*Gquad1('exp2', 0,
1, n, 'L')];
end
>> approx_v
approx_v =
0.842441892522547
0.842690018484511
0.842701171316200
0.842700786127333
0.842700793037422
0.842700792948825
0.842700792949722
>> exact_v = erf(1)
exact_v =
0.842700792949715

6.5 Improper Integrals


𝑏
In those cases, where the limits (𝑎 and 𝑏) of an integral (∫𝑎 𝑓(𝑥)𝑑𝑥 ) are
infinite, we are talking about an improper integral. These integrals have the
following forms:
∞ −𝑏 ∞

∫ 𝑓(𝑥)𝑑𝑥 (𝑎 > 0); ∫ 𝑓(𝑥)𝑑𝑥 (𝑏 > 0); ∫ 𝑓(𝑥)𝑑𝑥 (6.40.)


𝑎 −∞ ∞
Let’s see the first case. When the integrand decreases to zero quickly, then
the integrand can be treated with using the following expression:
1 1
𝑥 = 𝜎 and 𝑑𝑥 = − 𝜎2 𝑑𝜎 (6.41.)
In this case:
∞ 0 1/𝑎
1 1 1 1
∫ 𝑓(𝑥)𝑑𝑥 = ∫ 𝑓 ( ) (− 2 ) 𝑑𝜎 = ∫ 2 𝑓 ( ) 𝑑𝜎 (6.42.)
𝜎 𝜎 𝜎 𝜎
𝑎 1/𝑎 0

139
The problem is caused by the integrand which is singular at the lower limit. To
overcome on this problem, the Newton-Cotes formula (like composite
midpoint rule) can be used to avoid the utilization of the endpoints.

Example

Approximate the following integral:



𝑠𝑖𝑛𝑥
∫ 𝑑𝑥
𝑥2
2

Solution:
In MATLAB, by using h = 0.0125 to estimate the value, the solution is the
following:

>> f = inline('sin(1/x)'); h = 0.0125; x = 0:h:0.5;


>> n = length(x) - 1; m = zeros(n,1); I = 0;
>> for i = 1:n,
m(i) = (x(i+1) + x(i))/2;
I = I + f(m(i));
end
>> I = I*h;
>> I
I =
0.027687112368230

6.6 Numerical Integration of Bivariate Functions


For the approximation of the following double integral:

∬ 𝑓(𝑥, 𝑦)𝑑𝑥𝑑𝑦
𝐷
where 𝐷 = {(𝑥, 𝑦): 𝑎 ≤ 𝑥 ≤ 𝑏, 𝑐 ≤ 𝑦 ≤ 𝑑} is the domain is the integration, the
integral2 function can be used in MATLAB.

140
q = integral2(fun,xmin,xmax,ymin,ymax) approximates the integral of the
function z = fun(x,y) over the planar region xmin ≤ x ≤ xmax and
ymin(x) ≤ y ≤ ymax(x).
q = integral2(fun,xmin,xmax,ymin,ymax,Name,Value) specifies additional
options with one or more Name,Value pair arguments.

Example

Calculate the integral for 𝑓(𝑥, 𝑦) = 𝑒 −𝑥𝑦 sin(𝑥𝑦); −1 ≤ 𝑥 ≤ 1, 0 ≤ 𝑦 ≤ 1.

Solution:
function z = esin(x,y);
z = exp(-x*y).*sin(x*y);

>> result = dblquad('esin', -1, 1, 0, 1)


result =
-0.221768529427044

Example [3]

Create another function, which has either semi-infinite or bi-infinite interval,


and approximate Euler’s gamma function.

Solution:
The MATLAB function for the integration is the following:
function [s, w, x] = Gquad2(fun, n, type, varargin)
% Numerical integration using either the Gauss-
% Laguerre(type = 'L') or the Gauss-Hermite (type =
% 'H') with n (n > 0) nodes.
% fun is a string containing the name of the function
% that is integrated.
% The output parameters s, w, and x hold the computed
% approximation of the integral, list of weights, and
% the list of nodes, respectively.
if type == 'L'
d = -(1:n-1);

141
f = 1:2:2*n-1;
fc = 1;
else
d = sqrt(.5*(1:n-1));
f = zeros(1,n);
fc = sqrt(pi);
end
J = diag(d,-1) + diag (f) + diag(d,1);
[u,v] = eig(J);
[x,j] = sort(diag(v));
w = (fc*u(1,:).^2)';
w = w(j);
f = feval(fun,x,varargin{:});
s = w'*f(:);

The Euler’s gamma function is:


Γ(𝑡) = ∫ 𝑒 −𝑥 𝑥 𝑡−1 𝑑𝑥 , (𝑡 > −1)


0
We will use the Gquad2 function with the setting of ’L’. But, define another
function to compute the gamma function using Gauss-Laguerre quadratures.

function y = mygamma(t)
% Value(s) y of the Euler's gamma function evaluated
% at t (t > -1).
td = t - fix(t);
if td == 0
n = ceil(t/2);
else
n = ceil(abs(t)) + 10;
end
y = Gquad2('pow',n,'L',t-1);
A simple power function:
function z = pow(x, e)
% Power function z = x^e
z = x.^e;

For testing, create the following script:


% Script testmyg.m
format long

142
disp(' t mygamma gamma')
disp(sprintf('\n_____________________________________
________________'))
for t=1:.1:2
s1 = mygamma(t);
s2 = gamma(t);
disp(sprintf('%1.14f %1.14f %1.14f',t,s1,s2))
end

The result is the following:


>> testmyg
t mygamma gamma
_____________________________________________________
1.00000000000000 1.00000000000000 1.00000000000000
1.10000000000000 0.95470549811707 0.95135076986687
1.20000000000000 0.92244757458893 0.91816874239976
1.30000000000000 0.90150911731168 0.89747069630628
1.40000000000000 0.89058495940663 0.88726381750308
1.50000000000000 0.88871435840715 0.88622692545276
1.60000000000000 0.89522845323377 0.89351534928769
1.70000000000000 0.90971011289336 0.90863873285329
1.80000000000000 0.93196414951082 0.93138377098024
1.90000000000000 0.96199632935381 0.96176583190739
2.00000000000000 1.00000000000000 1.00000000000000

143
7 Numerical Solution of Differential Equations – Initial
Value Problems
When solving ordinary differential equations (ODE) numerically, the most
common is to take a bottom-up approach. This would involve evaluating the
highest order of differential, and step by step working back to the final level
of equations for each value of the independent variable (most commonly
time). Once one state of the system is solved by this approach, all the other
states can be calculated by varying the independent variable and
implementing a time-marching solution. Note that it is referred to as time
marching, although the independent variable might not be actually defined as
time.
Due to the nature of the differentiation process, and its reverse the
integration, these equations have infinite number of solutions, as the
integration constant can be taken as an arbitrary value, and it would still
satisfy the differential equations. To ensure, that a specific solution is arrived
at, an nth order differential equation requires additional n boundary conditions
to allow the definition of these integration constants. By far the most
common approach is when all these boundary conditions are provided at the
beginning of a system and can be referred to as starting conditions. Defining
the constants in either the first or last state of the system would be referred
to as a boundary-value problem (BVP) Note however, that the integration
constants can be defined in any state of the system, and they need not be
defined in the same state. If they are defined in the same state, we define the
problem as an initial-value problem (IVP), otherwise they are just general
boundary conditions.
Generally, an nth order ordinary differential equation can be described as the
following:
𝑦 (𝑛) = 𝑓(𝑥, 𝑦, 𝑦 ′ , 𝑦 ′′ , … , 𝑦 (𝑛−1) ) (7.1.)
with corresponding n number of boundary conditions (in this case for
simplicity defined as starting conditions):
𝑦 (𝑖) (𝑥0 ) = 𝑦𝑖 ; 𝑖 = 0 … 𝑛 (7.2.)
and the independent variable x of the system is bound by both sides as:
𝑥0 ≤ 𝑥 ≤ 𝑥𝑚 (7.3.)
For many methods presented, it is going to be assumed, that there is a
uniform spacing between the independent value points, such as 𝑥𝑗 = 𝑥𝑗−1 +
∆𝑥 for all 𝑚 ≥ 𝑗 > 0. When numerically solving the equation, the objective is

144
to provide an acceptable estimate at all points of the independent variable to
the function value y an its derivatives.
There are many available numerical applications which offer powerful solver
algorithms. The sequential time marching repeat nature of the solution
process lends itself to visually straightforward implementation in spreadsheet
programs, such as Microsoft Excel. Packages such as MATLAB offer various
forms of ODE solvers, catering for many levels of precision and performance
requirements and for various system behaviour. There are various other
packages available in many programming languages, some are freely
available, such as ODEPACK in Fortran, Boost in C++, various packages in Java,
SciPy and similar.

7.1 One-Step Methods


One step methods rely solely on the previously computed 𝑥𝑖 , 𝑦𝑖 (and
derivatives of y) solutions to extrapolate the function values and estimate the
solutions in the i+1th state of the system. The exact manner of extrapolation
varies between the individual methods, and the following sections will
present some of the most commonly used methods available.

7.1.1 Euler’s Method


Euler’s method is the simplest of the One-Step methods. The main benefits
are simplicity, easy implementation and quick calculations per timestep.
However, the accuracy is low, and the method can easily become unstable
depending on the system behaviour and step size chosen. For this reason,
Euler’s method is rarely used in most applications where accuracy is desired.
Euler’s extrapolation method relies on the truncated form of the Taylor series
expression about the ith state of the system; all but the first order linear terms
in the series expansion are neglected. Thus, the Euler’s method extrapolates
the function using the local tangent in the ith state to predict the system state
in the i+1th. Formally this can be written as:
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥 𝑦𝑖 ′ (7.4.)
Where in the case of a first order ODE:
𝑦𝑖 ′ = 𝑓(𝑥𝑖 , 𝑦𝑖 ) (7.5.)
Note in the case of a higher order ODE, the solution can be traced back as a
system of first order ODEs.

145
Example [3]

Since, Euler’s method is not built-in in MATLAB, create the function:


function y = EulerODE(f,x,y0)
%
% EulerODE uses Euler's method to solve a first-order
% ODE given in the form y' = f(x,y) subject to
% initial condition y0.
%
% y = EulerODE(f,x,y0) where
% f is an inline function representing f(x,y),
% x is a vector representing the mesh points,
% y0 is a scalar representing the initial value of y,
%
% y is the vector of solution estimates at the mesh
% points.
y = 0*x; % Pre-allocate
y(1) = y0; h = x(2)-x(1);
for n = 1:length(x)-1
y(n+1) = y(n)+ h*f(x(n),y(n));
end

Then, in order to test it, let’s solve the following Initial-Value Problem:
𝑦 ′ + 𝑦 = 2𝑥, 𝑦(0) = 1, 0 ≤ 𝑦 ≤ 1
So, we create the following script: Euler_01.m
% Testing of EulerODE function by solving
% y' + y = 2x, y(0) = 1, 0 <= x <= 1.
% The exact solution is y_exact(x)=2x+3*exp(-x)-2.
disp(' x yEuler yExact')
h = 0.1; x = 0:h:1; y0 = 1;
f = inline('-y+2*x','x','y');
yEuler = EulerODE(f,x,y0);
yExact = inline('2*x+3*exp(-x)-2');
for k = 1:length(x)
x_coord = x(k);
yE = yEuler(k);
yEx = yExact(x(k));
fprintf('%6.2f %11.6f %11.6f\n',x_coord,yE,yEx)
end

By running it, the result is:

146
>> Euler_01
x yEuler yExact
0.00 1.000000 1.000000
0.10 0.900000 0.914512
0.20 0.830000 0.856192
0.30 0.787000 0.822455
0.40 0.768300 0.810960
0.50 0.771470 0.819592
0.60 0.794323 0.846435
0.70 0.834891 0.889756
0.80 0.891402 0.947987
0.90 0.962261 1.019709
1.00 1.046035 1.103638

It can be seen, that the biggest relative error is approx. 6.2% at x = 0.7. By
reducing the step size (h) this error also can be reduced: for example at h =
0.05 the largest relative error is 3% at x = 0.65.
The error calculation approach to find the percent relative errors at all xi can
be carried out by creating the next script: Euler_error

disp(' x yEuler yExact e_local


e_global')
h = 0.1; x = 0:h:1; y0 = 1; f = inline('-
y+2*x','x','y');
yEuler = EulerODE(f,x,y0); yExact =
inline('2*x+3*exp(-x)-2');
ytilda = 0*x; ytilda(1) = y0;
for n = 1:length(x)-1
ytilda(n+1) = yExact(x(n)) + h*f(x(n),yExact(x(n)));
end
for k = 1:length(x)
x_coord = x(k);
yE = yEuler(k);
yEx = yExact(x(k));
e_local = (yEx-ytilda(k))/yEx*100;
e_global = (yEx-yE)/yEx*100;
fprintf('%6.2f %11.6f %11.6f %6.2f
%6.2f\n',x_coord,yE,yEx,e_local,e_global)
end

By executing it:

147
>> Euler_error
x yEuler yExact e_local e_global
0.00 1.000000 1.000000 0.00 0.00
0.10 0.900000 0.914512 1.59 1.59
0.20 0.830000 0.856192 1.53 3.06
0.30 0.787000 0.822455 1.44 4.31
0.40 0.768300 0.810960 1.33 5.26
0.50 0.771470 0.819592 1.19 5.87
0.60 0.794323 0.846435 1.04 6.16
0.70 0.834891 0.889756 0.90 6.17
0.80 0.891402 0.947987 0.76 5.97
0.90 0.962261 1.019709 0.64 5.63
1.00 1.046035 1.103638 0.53 5.22

As a further note on the stability and accuracy of Euler’s method, it can be


easily seen, that for a linear y function, Euler’s method would provide the
exact solution with 0 error. From here it can also be seen, that the further the
local behaviour of the y function is from the linear behaviour, the greater the
difference between the extrapolation and the function behaviour is.
Intuitively this can be imagined as trying to follow a curved line with straight
lines set at specific angles. The less local curvature the line has, or the shorter
the straight lines are (small ∆x) the better the method approximates the true
solution. High curvature with large timesteps (over the method’s stability
limit) will result in oscillatory divergent behaviour, and usually results
overflow and likely breaks the software execution.

7.1.2 Second Order Taylor Method


After investigating the Euler’s method, this chapter will investigate higher
order methods based on the Taylor’s series expansion. Euler’s method keeps
the first order (linear) terms from the series, thus it is the lowest order (0 th
order wouldn’t work, as the function value wouldn’t change!). The chapter
will present the Second Order Taylor Method, however the same logic can be
followed to extrapolate the methodology to 3rd, 4th, and arbitrary higher
orders. Keep in mind, that just like the convergence of the Taylor’s series,
higher order Taylor methods will result in more and more expensive
calculations, with diminishing returns for accuracy when using the higher
order terms.

148
The Second Order Taylor Method keeps the first two orders from the Taylor’s
series expansion, so the linear and the quadratic components. As such, the
estimation of the next system state can be written as:
1
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥 𝑦𝑖 ′ + ∆𝑥 2 𝑦𝑖 ′′ (7.6.)
2!
Where
𝑦𝑖 ′ = 𝑓(𝑥𝑖 , 𝑦𝑖 ), 𝑦𝑖 ′′ = 𝑓 ′ (𝑥𝑖 , 𝑦𝑖 ) (7.7.)
It can be seen, that the method only differs from the Euler’s method (First
Order Taylor Method) by the additional second order term. Following the
analogy in the previous chapter, the second order term allows us to
approximate a curved line with a parabola (2nd order curve) rather than a
straight line (1st order curve). Again, for low curvatures, and small steps, the
parabola can be a good approximation, and approximates the solution well.
In the case of high and changing curvatures, the method either needs small
timesteps, or has to include the higher order terms for better fit.
For the sake of completeness, a general kth order Taylor Method would be of
the following form:
1 1
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥 𝑦𝑖 ′ + ∆𝑥 2 𝑦𝑖 ′′ + ⋯ + ∆𝑥 𝑘 𝑦𝑖 (𝑘) (7.8.)
2! 𝑘!
Where
𝑦𝑖 ′ = 𝑓(𝑥𝑖 , 𝑦𝑖 ), 𝑦𝑖 ′′ = 𝑓 ′ (𝑥𝑖 , 𝑦𝑖 ) … 𝑦𝑖 (𝑘) = 𝑓 (𝑘−1) (𝑥𝑖 , 𝑦𝑖 ) (7.9.)
The following section will demonstrate the use of the Second Order Taylor
Method in MATLAB on an example problem.

Example [3]

Continuing the example from the previous sub-chapter, the following script
calculates and represents the relative error of the Second Order Taylor
Method and compares the values to the Euler Method solution:

Euler_Taylor.m
disp(' x yEuler yTaylor2 e_Euler
e_Taylor2')
h = 0.1; x = 0:h:1; y0 = 1;
f = inline('-y+2*x','x','y'); fp = inline('y-
2*x+2','x','y');
yEuler = EulerODE(f,x,y0); yExact =
inline('2*x+3*exp(-x)-2');

149
yTaylor2 = 0*x; yTaylor2(1) = y0;
for n = 1:length(x)-1
yTaylor2(n+1) = yTaylor2(n)+h*(f(x(n),
yTaylor2(n))+(1/2)*h*fp(x(n),yTaylor2(n)));
end
for k = 1:length(x)
x_coord = x(k);
yE = yEuler(k);
yEx = yExact(x(k));
yT = yTaylor2(k);
e_Euler = (yEx-yE)/yEx*100;
e_Taylor2 = (yEx-yT)/yEx*100;
fprintf('%6.2f %11.6f %11.6f %6.2f
%6.2f\n',x_coord,yE,yT,e_Euler,e_Taylor2)
end

The result is:


>> Euler_Taylor
x yEuler yTaylor2 e_Euler e_Taylor2
0.00 1.000000 1.000000 0.00 0.00
0.10 0.900000 0.915000 1.59 -0.05
0.20 0.830000 0.857075 3.06 -0.10
0.30 0.787000 0.823653 4.31 -0.15
0.40 0.768300 0.812406 5.26 -0.18
0.50 0.771470 0.821227 5.87 -0.20
0.60 0.794323 0.848211 6.16 -0.21
0.70 0.834891 0.891631 6.17 -0.21
0.80 0.891402 0.949926 5.97 -0.20
0.90 0.962261 1.021683 5.63 -0.19
1.00 1.046035 1.105623 5.22 -0.18

5.1. Runge-Kutta Methods

The Runge-Kutta Methods contain a family of methods, which achieve results


with comparable accuracy to the Taylor’s series based methods, however
does not require the calculation of the derivatives. Calculating derivatives can
be computationally very expensive, or downright impossible, for example
when the function is not differentiable at the given point. For this reason
there is definite benefit for using methods that don’t rely on the derivatives.
The solution process of these methods is a similar time marching method to
the previously shown algorithms, however in this case it can be expressed as:

150
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥𝜑𝑖 (7.10.)
where as before
𝑦 ′ = 𝑓(𝑥, 𝑦), 𝑦(𝑥0 ) = 𝑦0 (7.11.)
but the extrapolation term used to approximate the next state of system, 𝜑𝑖
is not the derivative 𝑓𝑖 rather some approximated slope function. The main
difference between the methods shown in this chapter is how this function is
defined.

7.1.3 Second Order Runge-Kutta Methods


The Second Order Runge-Kutta Method (or RK2) is relying on 2 points defined
between every state of the system to approximate the slope of the
extrapolation function. This means that although the accuracy (or the error)
of the method is of the same order as the equivalent order Taylor’s method,
the RK2 requires two function evaluations to estimate the next system state.
While this sounds more expensive, keep in mind, that if the derivative is not
available for a Taylor’s method, then it would be evaluated numerically, which
can add considerable expense to the method, so the extra evaluation of RK2
can be very beneficial in the end. As a sidenote, there are higher order RK
methods, in which the number of function evaluations correspond to the
order of the function; for example RK4 (such as the ODE4 solver in MATLAB)
requires 4 evaluations of the function to extrapolate to the new state.
The function itself is defined as:
𝜑(𝑥𝑖 , 𝑦𝑖 ) = 𝜑𝑖 = 𝑎1 𝑘1 + 𝑎2 𝑘2 (7.12.)
With this it can be written that:
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥(𝑎1 𝑘1 + 𝑎2 𝑘2 ) (7.13.)
Where the following are by definition:
𝑘1 = 𝑓(𝑥𝑖 , 𝑦𝑖 ) = 𝑓𝑖 (7.14.)

𝑘2 = 𝑓(𝑥𝑖 + 𝑏1 ∆𝑥, 𝑦𝑖 + 𝑐11 𝑘1 ∆𝑥) (7.15.)


The constants in the equations, 𝑎1 , 𝑎2 , 𝑏1 and 𝑐11 vary between the different
implementations of the Second Order Runge-Kutta methods. The actual
coefficients can be determined by expressing the Taylor series expansion of
equation (7.13.).
Only the first 3 terms are kept, which will result in 3 equations, to solve 4
unknowns. Hence one of the coefficients can be chosen, and based on this

151
choice do we differentiate the various RK2 methods. Neglecting the
derivation, the coefficients are expressed in the following form:
1 1
𝑎1 + 𝑎2 = 1, 𝑎2 𝑏1 = , 𝑎2 𝑐11 = (7.16.)
2 2
In the following sections, the most common RK2 methods, and the constant
values used are described.

7.1.3.1 Improved Euler’s Method


The Improved Euler’s Method is a RK2 method, where 𝑎2 = 1 is assumed.
From these the other constants are determined as:
1 1
𝑎1 = 0, 𝑏1 = , 𝑐11 = (7.17.)
2 2
Inserting the constants into equation (7.13.).
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥𝑘2 (7.18.)
And
𝑘1 = 𝑓(𝑥𝑖 , 𝑦𝑖 ) (7.19.)
1 1 (7.20.)
𝑘2 = 𝑓 (𝑥𝑖 + ∆𝑥, 𝑦𝑖 + 𝑘1 ∆𝑥)
2 2

7.1.3.2 Heun’s Method


1
Heun’s Method is also in the RK2 family. It assumes 𝑎2 = 2 and from that the
following are determined:
1
𝑎1 = , 𝑏1 = 1, 𝑐11 = 1 (7.21.)
2
Which gives the following specific format to the RK2 equations:
1
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥(𝑘1 + 𝑘2 ) (7.22.)
2
And
𝑘1 = 𝑓(𝑥𝑖 , 𝑦𝑖 ) (7.23.)

𝑘2 = 𝑓(𝑥𝑖 + ∆𝑥, 𝑦𝑖 + 𝑘1 ∆𝑥) (7.24.)


Heun’s method is implemented in MATLAB, and can be accessed using the
ODE2 solver.

152
7.1.3.3 Ralston’s Method
2 1 3 3
Ralston’s Method assumes 𝑎2 = 3 from which: 𝑎1 = 3 , 𝑏1 = 4 and 𝑐11 = 4
which gives the following specific implementation:
1
𝑦𝑖+1 = 𝑦𝑖 + ∆𝑥(𝑘1 + 2𝑘2 ) (7.25.)
3
And
𝑘1 = 𝑓(𝑥𝑖 , 𝑦𝑖 ) (7.26.)
3 3 (7.27.)
𝑘2 = 𝑓 (𝑥𝑖 + ∆𝑥, 𝑦𝑖 + 𝑘1 ∆𝑥)
4 4
Table 7 provides a summary of the methods, and the constants used.
Method 𝑎1 𝑎2 𝑏1 𝑐11
Improved Euler’s 0 1 1/2 1/2
Heun’s 1/2 1/2 1 1
Ralston’s 1/3 2/3 3/4 ¾
Table 7: Second Order Runge-Kutta coefficients

7.1.4 Built in Functions in MATLAB


MATLAB offers a selection of built-in functions to solve initial value problems.
The following list shows a selection of fixed timestep methods available:
- ode23 – Explicit Runge-Kutta (2,3) formula:
- ode45 – Explicit Runge-Kutta (4,5) formula
- ode113 – Adams-Bashforth-Moulton solver
- ode15s – Numerical differentiation based solver
- ode23s – Modified Rosembrock formula of order 2 based solver

The syntax to use in MATLAB is the following:


[t,y] = ode23(odefun,tspan,y0), where tspan = [t0 tf], integrates the system
of differential equations y’ = f(t,y) from t0 to tf with initial conditions y0. Each
row in the solution array y corresponds to a value returned in column vector t.
All MATLAB ordinary differentiation equation solvers can solve systems of
equations of the form y1 or problems that involve a mass matrix, M(t,y)y’ =
f(t,y). The solvers all use similar syntaxes. The ode23s solver only can solve
problems with a mass matrix if the mass matrix is
constant. ode15s and ode23t can solve problems with a mass matrix that is

153
singular, known as differential-algebraic equations (DAEs). Specify the mass
matrix using the Mass option of odeset.

Example [3]

Find the numerical solution of y at t =0, 0.25, 0.5, 0.75, 1 to the following
problem:
𝑦 ′ = −2𝑡𝑦 2
with the initial condition of y(0) = 1. Use both ode23 and ode45 solvers.

Solution:
The exact solution of this problem is:
1
𝑦(𝑡) =
1 + 𝑡2
In MATLAB, at first let’s create the equation:
function dy = eq1(t,y)
% The m-file for the ODE y' = -2ty^2.
dy = -2*t.*y(1).^2;

Then, compute the results:


>> tspan = [0 .25 .5 .75 1]; y0 = 1;
>> [t1 y1] = ode23('eq1', tspan, y0);
>> [t2 y2] = ode45('eq1', tspan, y0);
>> [t1 y1 y2]
ans =
0 1.0000 1.0000
0.2500 0.9412 0.9412
0.5000 0.8000 0.8000
0.7500 0.6400 0.6400
1.0000 0.5000 0.5000

Example [3]

Consider the following Initial-Value Problem and solve it with Heun’s Method:
𝑦 ′ − 𝑥 2 𝑦 = 2𝑥 2 , 𝑦(0) = 1, 0 ≤ 𝑦 ≤ 1, ℎ = 0.1
We create a user defined function, which uses the Heun’s method to solve the
initial value problem:

154
Solution:
function y = HeunODE(f,x,y0)
%
% HeunODE uses Heun's method to solve a first-order
% ODE given in the form y' = f(x,y) subject to
% initial condition y0.
%
% y = HeunODE(f,x,y0) where
% f is an inline function representing f(x,y),
% x is a vector representing the mesh points,
% y0 is a scalar representing the initial value of y,
% y is the vector of solution estimates at the mesh
% points.
y = 0*x; % Pre-allocate
y(1) = y0; h = x(2)- x(1);
for n = 1:length(x)-1,
k1 = f(x(n),y(n));
k2 = f(x(n)+h,y(n)+h*k1);
y(n+1) = y(n)+h*(k1+k2)/2;
end

Let’s solve the same problem, like earlier, thus time using Heun’s Method:
Heun_01.m
% Testing of HeunODE function by solving
% y' + y = 2x, y(0) = 1, 0 <= x <= 1.
% The exact solution is y_exact(x) = 2x + 3*exp(-x) -
2.
disp(' x yHeun yExact')
h = 0.1; x = 0:h:1; y0 = 1;
f = inline('-y+2*x','x','y');
yHeun = HeunODE(f,x,y0);
yExact = inline('2*x+3*exp(-x)-2');
for k = 1:length(x)
x_coord = x(k);
yH = yHeun(k);
yEx = yExact(x(k));
fprintf('%6.2f %11.6f %11.6f\n',x_coord,yH,yEx)
end

The solution is:


>> Heun_01
x yHeun yExact

155
0.00 1.000000 1.000000
0.10 0.915000 0.914512
0.20 0.857075 0.856192
0.30 0.823653 0.822455
0.40 0.812406 0.810960
0.50 0.821227 0.819592
0.60 0.848211 0.846435
0.70 0.891631 0.889756
0.80 0.949926 0.947987
0.90 1.021683 1.019709
1.00 1.105623 1.103638

It is interesting to see how the solutions compare to the built-in MATLAB


functions. When the same script is augmented to include 2 commonly used
MATLAB IVP solver functions ODE23 and ODE45, the following results can be
obtained.

x yHeun yExact ODE23 ODE45


0.00 1.000000 1.000000 1.000000 1.000000
0.10 0.915000 0.914512 0.914744 0.914512
0.20 0.857075 0.856192 0.857585 0.856192
0.30 0.823653 0.822455 0.824025 0.822455
0.40 0.812406 0.810960 0.812372 0.810960
0.50 0.821227 0.819592 0.820861 0.819592
0.60 0.848211 0.846435 0.847576 0.846435
0.70 0.891631 0.889756 0.890782 0.889756
0.80 0.949926 0.947987 0.948909 0.947987
0.90 1.021683 1.019709 1.020538 1.019709
1.00 1.105623 1.103638 1.103599 1.103638

It can be seen, that while our simple implementation of Heun’s Method gives
good approximation of the result, it overestimates the exact solution in this
example by 0.21%. Depending on the application, this relatively small error can
be ignored, or different methods used. The ODE23 for example solves the IVP
with only 0.19% max error, and the ODE45 matches the exact solution within
the requested precision. The trade-off is always between the precision, and
the computing resources allocated to the problem.

156
8 Partial Differential Equations
The problem of solving a partial differential equation (or more) occurs in many
fields of engineering applications, like thermodynamics, fluid mechanics,
applied mechanics (like elasticity), or electromagnetic theory. The most
commonly known PDE problems are the Laplace’s equation, wave equation
or heat equation.
The main challenge with partial differential equations (PDE) that their
analytical solution requires advanced mathematical methods, and in some
cases, it is not even possible to find a solution in a closed-form, like for
ordinary differential equations. Mostly, this is the reason that generally it is
easier to approximate the solutions by using a simple and efficient numerical
method. Although, there are numerous methods available for solving PDEs,
but mostly the finite-difference methods have become widespread-used.

8.1 Partial Differential Equations generally


In the case of PDEs the equation involves a dependent-variable function with
at least two independent variables and the function’s partial derivatives. The
highest order of the derivative determines the order of the PDE, marked with
𝑛. We make difference between linear PDEs (the dependent variable’s and its
derivatives’ highest order is first degree) or nonlinear. If the PDE includes the
dependent variable and its derivatives only, then we are dealing with a
homogenous PDE. Of course, if there are other terms, it is a non-homogenous
PDE.
The general form of a second order linear PDE is the following:
𝜕2𝑢 𝜕2𝑢 𝜕2𝑢 𝜕𝑢 𝜕𝑢
𝐴 2+𝐵 +𝐶 2+𝐷 +𝐸 + 𝐹𝑢 = 𝐺 (8.1.)
𝜕𝑥 𝜕𝑥𝜕𝑦 𝜕𝑦 𝜕𝑥 𝜕𝑦
This form is commonly simplified by using the following brief notations for the
partial derivatives:
𝜕𝑢 𝜕2𝑢 𝜕2𝑢 𝜕2𝑢
𝑢𝑥 = ; 𝑢𝑥𝑥 = 2 ; 𝑢𝑥𝑦 = ; 𝑢𝑦𝑦 = 2 (8.2.)
𝜕𝑥 𝜕𝑥 𝜕𝑥𝜕𝑦 𝜕𝑦
So, in the simplified form the general PDE is:
𝐴𝑢𝑥𝑥 + 𝐵𝑢𝑥𝑦 + 𝐶𝑢𝑦𝑦 + 𝐷𝑢𝑥 + 𝐸𝑢𝑦 + 𝐹𝑢 = 𝐺 (8.3.)
where 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺 are functions of 𝑥 and 𝑦.
The spatial coordinates determine the dimension of a PDE, but the time does
not. So, 𝑢 = 𝑢(𝑥, 𝑦) is a two-dimensional PDE, while 𝑢 = 𝑢(𝑥, 𝑦, 𝑡) is also a
two-dimensional PDE.

157
The classification of PDEs are based on the sign of the discriminant:
∆𝑃𝐷𝐸 = 𝐵2 − 4𝐴𝐶 (8.4.)
There are three classes:
- if ∆𝑃𝐷𝐸 < 0, it is an elliptic PDE (i.e.: Laplace’s equation),
- if ∆𝑃𝐷𝐸 = 0, it is a parabolic PDE (i.e.: Poisson’s equation),
- if ∆𝑃𝐷𝐸 > 0, it is a hyperbolic PDE (i.e.: 1D wave equation).

8.2 Elliptic Partial Differential Equations


Solving an elliptic PDE can be done in three main ways:

- the solution is looked for in a particular region of 𝑥𝑦- plane, where


the unknown function is prescribed, which is the Dirichlet problem,
𝜕𝑢
- a boundary value problem, where the normal derivative of 𝑢𝑛 = 𝜕𝑛 is
given on the boundary, referred by the Neumann problem,
- when the function 𝑢 is specified on certain parts of the boundary and
𝑢𝑛 is specified on the others, we are talking about a mixed problem.

In this subchapter these three problems are introduced.

8.2.1.1 Dirichlet Problem


The typical example for a Dirichlet problem is a 2D heat flow in steady state,
when the temperature is known along the boundary of the region. As an initial
equation, the Poisson’s equation for 2 dimensions can be written as:
𝑢𝑥𝑥 + 𝑢𝑦𝑦 = 𝑓(𝑥, 𝑦) (8.5.)
The considered region is rectangular in this case, with equidistant vertical and
horizontal grid lines of distance ℎ. The intersections of grid lines are mesh
points. The mesh points, located on the boundary, are the boundary points,
while inside the region they are called interior mesh points. During this
method, the approximation of 𝑢 is completed at the interior mesh points. At
an arbitrary mesh point (𝑥, 𝑦) the indexes are (𝑖ℎ, 𝑗ℎ), but for simplicity they
are labelled as (𝑖, 𝑗), as Figure 27 represents. In this point, the approximated
value is 𝑢𝑖𝑗 , the function is 𝑓𝑖𝑗 .

158
Figure 27: Rectangular region grid with a 5-point molecule [3]

The three-point central difference formula, which gives the approximation of


the second-order partial derivatives can be written as:
𝑢𝑖−1,𝑗 − 2𝑢𝑖𝑗 + 𝑢𝑖+1,𝑗 𝑢𝑖,𝑗−1 − 2𝑢𝑖𝑗 + 𝑢𝑖,𝑗+1
+ = 𝑓𝑖𝑗 (8.6.)
ℎ2 ℎ2
By multiplying with ℎ2 we get:
𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 + +𝑢𝑖,𝑗−1 + 𝑢𝑖,𝑗+1 − 4𝑢𝑖𝑗 = ℎ2 𝑓𝑖𝑗 (8.7.)
This form of equation (8.7.) gives the connection between the solution 𝑢 at
(𝑖, 𝑗) and for neighbouring points for the Poisson’s equation. In the same way,
for the Laplace equation it can be written:
𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 + +𝑢𝑖,𝑗−1 + 𝑢𝑖,𝑗+1 − 4𝑢𝑖𝑗 = 0 (8.8.)
Which is the difference equation for Laplace’s equation.
The outcome is a linear system of algebraic equations in both cases, where
the number of equations (also the unknowns’) is the number of interior mesh
points in the grid. Indeed, it means that in the case of 𝑛 interior mesh points
the linear equation is 𝐴𝑢 = 𝑏, where the coefficient matrix is 𝐴𝑛×𝑛 , the
unknowns are stored in 𝑢𝑛×1 vector, while the known quantities generates
𝑏𝑛×1 . In practice it means that if a point mesh had a connection to a boundary
point, the boundary condition is moved to the right-hand side of the equation
blending into vector 𝑏. In the case of the Poisson’s equation, aside from the
boundary conditions the ℎ2 𝑓𝑖𝑗 values also affect vector 𝑏. In order to achieve
better accuracy, high mesh points are recommended to use, causing the 𝐴
matrix to be large, but sparse with at most five nonzero elements in each row.

159
These kinds of linear equations are usually solved with indirect methods, like
Gauss-Seidel iterative method.

Example [3]

Consider a rectangular plate (length of 1 and width 2) with a steady-state


temperature distribution inside. The lower edge has a temperature profile of
𝜋𝑥
sin( 2 ), while the other 3 are kept at zero. The mesh size is ℎ = 0.5. Compute
the approximate 𝑢 values in the interior mesh points, and the related relative
errors of them.

Solution:
The exact solution is:
1 𝜋𝑥 𝜋(1 − 𝑦)
𝑢(𝑥, 𝑦) = 𝜋 sin 2 sinh
sinh (2 ) 2

Create a user defined function in MATLAB:

function U = DirichletPDE(x,y,f,uleft,uright,ubottom,utop)
% DirichletPDE numerically solves an elliptic PDE with
% Dirichlet boundary conditions over a rectangular
% region.
%
% U = DirichletPDE(x,y,f,uleft,uright,ubottom,utop) where
%
% x is the 1-by-m vector of mesh points in the x
% direction,
% y is the n-by-1 vector of mesh points in the y
% direction,
% f is the inline function defining the forcing
% function which is in terms of x and y, namely,
% f(x,y),
% ubottom(x),utop(x),uright(y),uleft(y) are the
% functions defining the boundary conditions,
%
% U is the solution at the interior mesh points.
m = size(x,2); n = size(y,1); N = (m-2)*(n-2);
A = diag(-4*ones(N,1)); % Create diagonal matrix
A = A + diag(diag(A,n-2)+1,n-2); % Add n-2 diagonal
A = A + diag(diag(A,2-n)+1,2-n); % Add 2-n diagonal
d1 = ones(N-1,1); % Create vector of ones
d1(n-2:n-2:end) = 0; % Insert zeros
A = A + diag(d1,1); % Add upper diagonal
A = A + diag(d1,-1); % Add lower diagonal

160
[X Y] = meshgrid(x(2:end-1),y(end-1:-1:2)); % Create mesh
h = x(2)-x(1);
%Define boundary conditions
for i = 2:m-1
utopv(i-1) = utop(x(i));
ubottomv(i-1) = ubottom(x(i));
end
for i = 1:n
uleftv(i) = uleft(y(n+1-i));
urightv(i) = uright(y(n+1-i));
end
% Build vector b
b = 0; % Initialize vector b

for i = 1:N
b(i) = h^2*f(X(i),Y(i));
end

b(1:n-2:N) = b(1:n-2:N)-utopv;
b(n-2:n-2:N) = b(n-2:n-2:N)-ubottomv;
b(1:n-2) = b(1:n-2)-uleftv(2:n-1);
b(N-(n-3):N) = b(N-n+3:N)-urightv(2:n-1);
u = A\b'; % Solve the system
U = reshape(u,n-2,m-2);
U = [utopv;U;ubottomv];
U = [uleftv' U urightv'];
[X Y] = meshgrid(x,y(end:-1:1));
surf(X,Y,U); % 3D plot of the numerical results
xlabel('x');ylabel('y');

Now run the function with the following parameters:

x = 0:0.5:2; % x must be 1-by-m


y = 0:0.5:1; y = y'; % y must be n-by-1
f = inline('0','x','y');
ubottom = inline('sin(pi*x/2)');
utop = inline('0','x'); uleft = inline('0','y');
uright = inline('0','y');
U = DirichletPDE(x,y,f,uleft,uright,ubottom,utop)

U =

0 0 0 0 0
0 0.2735 0.3867 0.2735 0
0 0.7071 1.0000 0.7071 0

161
Let’s see, what is the difference if the mesh size is decreased to ℎ = 0.1!

x = 0:0.1:2;
y = 0:0.1:1; y = y';
f = inline('0','x','y');
ubottom = inline('sin(pi*x/2)');
utop = inline('0','x'); uleft = inline('0','y');
uright = inline('0','y');
U = DirichletPDE(x,y,f,uleft,uright,ubottom,utop)

8.2.1.2 Alternating Direction Implicit Methods


Although, the formerly described linear system of equation is an efficient
method, from the viewpoint of the computation effort-requirement the
method was even more efficient if the coefficient matrix were tridiagonal
with maximum three non-zero elements in each row, rather than using a large
sparse matrix. The Alternating Direction Implicit (ADI) methods were
developed to reach this goal.
To describe this method, consider the same Laplace’s equation in a
rectangular grid with using ℎ-sized mesh with 𝑁 interior points in each row
and 𝑀 interior points in each column. By rearranging equation (8.8.) the 𝑗th
row elements (𝑢𝑖−1,𝑗 , 𝑢𝑖𝑗 , 𝑢𝑖+1,𝑗 ) will be on the left-hand side of the equation,
while the 𝑖th column elements (𝑢𝑖,𝑗−1 , 𝑢𝑖𝑗 , 𝑢𝑖,𝑗+1 ) on the right-hand side in the
following way:
𝑢𝑖−1,𝑗 − 4𝑢𝑖𝑗 + 𝑢𝑖+1,𝑗 = −𝑢𝑖,𝑗−1 − 𝑢𝑖,𝑗+1 (8.9.)
Another rearranging way is:
𝑢𝑖,𝑗−1 − 4𝑢𝑖𝑗 + 𝑢𝑖,𝑗+1 = −𝑢𝑖−1,𝑗 − 𝑢𝑖+1,𝑗 (8.10.)
In this case, the 𝑖th column elements are in the left-hand side, the 𝑗th row
elements are on the right-hand side. During solving the ADI methods, each
iterative step is separated to two sub-steps: at first equation (8.9.) is solved in
every row of the mesh, then equation (8.10.) is solved in every column. The
most commonly applied method to solve this problem is the Peaceman-
Rachford Alternating Direction Implicit (PRADI) Method.

162
8.2.1.3 Peaceman-Rachford Alternating Direction Implicit
Method
(0)
During the PRADI, as an initial step an arbitrary value (𝑢𝑖𝑗 ) is used as a starting
value at each interior mesh points. As the first sub-step, the 𝑢𝑖𝑗 values are
upgraded from row to row using equation (8.9.) – in 𝑗th row (𝑗 = 1,2, … , 𝑀):
(0.5) (0.5) (0.5) (0) (0)
𝑢𝑖−1,𝑗 − 4𝑢𝑖,𝑗 + 𝑢𝑖,𝑗−1 = −𝑢𝑖,𝑗−1 − 𝑢𝑖,𝑗+1 , 𝑖 = 1,2, … , 𝑁 (8.11.)
Of course, the values that are given by the boundary conditions, are not
affected by the iteration steps and remain unchanged during the
computation. For any 𝑗 rows 𝑁 equations are generated by equation (8.11.),
thus in total 𝑀 ∗ 𝑁 equations are constructed. In this case, the coefficient
matrix is diagonal, which can be solved more efficiently, by using for example
the Thomas method. At this moment, the iteration step is only halfway
through, the second sub-step is still needed.
The second half of the iteration step solves equation (8.10.) for each column
(0.5)
by using 𝑢𝑖𝑗 values as initial values and then upgrades them – for 𝑖th column
(𝑖 = 1,2, … , 𝑁):
(1) (1) (1) (0.5) (0.5)
𝑢𝑖,𝑗−1 − 4𝑢𝑖,𝑗 + 𝑢𝑖,𝑗+1 = −𝑢𝑖−1,𝑗 − 𝑢𝑖+1,𝑗 , 𝑗 = 1,2, … , 𝑁 (8.12.)
Now, this equation generates 𝑀 equations for each column. Of course, the
boundary conditions do not change during the method. In total, 𝑀 ∗ 𝑁
equations are generated, which can be solves similarly, like the linear equation
system, generated for the rows. After finding the solution, the first iteration
step is completed. The further iteration steps are executed in the same way.
The method is continued until the convergence has reached, or until a
terminating condition has been fulfilled, like defining the minimum error
(tolerance) between two successive component matrices.

Example [3]

Consider the same Dirichlet problem, introduced in 8 Example and solve it


with MATLAB, using default parameter values.

Solution:
Create the PRADI method’s MATLAB code.
function
[U,k]=PRADI(x,y,f,uleft,uright,ubottom,utop,tol,kmax)
% PRADI numerically solves an elliptic PDE with

163
% Dirichlet boundary conditions over a rectangular
% region using the Peaceman-Rachford alternating
% direction implicit method.
%
% [U,k] = PRADI(x,y,f,uleft,uright,ubottom,utop,tol,
% kmax) where
% x is the 1-by-m vector of mesh points in the
% x direction,
% y is the n-by-1 vector of mesh points in the
% y direction,
% f is the inline function defining the forcing
% function,
% ubottom,uleft,utop,uright are the functions
% defining the boundary conditions,
% tol is the tolerance used for convergence
% (default = 1e-4),
% kmax is the maximum number of iterations
% (default = 50),
% U is the solution at the mesh points,
% k is the number of full iterations needed to meet
% the tolerance.
% Note: The default starting value at all mesh points
% is 0.5.
if nargin<9 || isempty(kmax), kmax = 50; end
if nargin<8 || isempty(tol), tol = 1e-4; end
[X Y] = meshgrid(x(2:end-1),y(2:end-1));
% Create messh grid
m = size(X,2); n = size(X,1); N = m*n;
u = 0.5*ones(n,m); % Starting values
h = x(2)-x(1); % Mesh size
% Define boundary conditions
for i = 2:m+1
utopv(i-1) = utop(x(i));
ubottomv(i-1) = ubottom(x(i));
end
for i = 1:n+2
uleftv(i) = uleft(y(i));
urightv(i) = uright(y(i));
end
U = [ubottomv;u;utopv]; U = [uleftv' U urightv'];
% Generate matrix A1 (first half) and A2 (second half).
A = diag(-4*ones(N,1));
d1 = diag(A,1)+1; d1(m:m:N-1) = 0;
d2 = diag(A,-1)+1; d2(n:n:N-1) = 0;
A2 = diag(d2,1)+diag(d2,-1)+A;
A1 = diag(d1,1)+diag(d1,-1)+A;
U1 = U;
for i = 1:N % Initialize vector b
b0(i) = h^2*f(X(i),Y(i));

164
end
b0 = reshape(b0,n,m);
for k = 1:kmax
% First half
b = b0-U1(1:end-2,2:end-1)-U1(3:end,2:end-1);
b(:,1) = b(:,1)-U(2:end-1,1);
b(:,end) = b(:,end)-U(2:end-1,end);
b = reshape(b',N,1);
u = ThomasMethod(A1,b);
% Tridiagonal system - Thomas method
u = reshape(u,m,n);
U1 = [U(1,2:end-1);u';U(end,2:end-1)];
U1 = [U(:,1) U1 U(:,end)];
% second half
b = b0-U1(2:end-1,1:end-2)-U1(2:end-1,3:end);
b(1,:) = b(1,:)-U(1,2:end-1);
b(end,:) = b(end,:)-U(end,2:end-1);
b = reshape(b,N,1);
u = ThomasMethod(A2,b);
% Tridiagonal system - Thomas method
u = reshape(u,n,m);
U2 = [U(1,2:end-1);u;U(end,2:end-1)];
U2 = [U(:,1) U2 U(:,end)];
if norm(U2-U1,inf)<=tol, break, end;
U1 = U2;
end
[X Y] = meshgrid(x,y);
U = U1;
for i = 1:n+2
W(i,:) = U(n-i+3,:);
YY(i) = Y(n-i+3);
end
U = W; Y = YY;
surf(X,Y,U);
xlabel('x');ylabel('y');

To run this function, a function for Thomas method is also needed, which can
be constructed easily as follows:
function x = ThomasMethod(A,b)
% ThomasMethod uses Thomas method to find the solution
% vector x of a tridiagonal system Ax = b.
% x = ThomasMethod(A,b) where
% A is a tridiagonal n-by-n coefficient matrix,
% b is the n-by-1 vector of the right-hand sides,
% x is the n-by-1 solution vector.
n = size(A,1);
d = diag(A); % Vector of diagonal entries of A
l = [0;diag(A, -1)]; % Vector of lower diagonal elements

165
u = [diag(A,1);0]; % Vector of upper diagonal elements
u(1) = u(1)/d(1); b(1) = b(1)/d(1); % First equation
for i = 2:n-1 % The next n?2 equations
den = d(i) - u(i-1)*l(i);
if den == 0
x = 'failure, division by zero';
return
end
u(i)= u(i)/den; b(i) = (b(i)-b(i-1)*l(i))/den;
end
b(n) = (b(n)-b(n-1)*l(n))/(d(n)-u(n?1)*l(n));
% Last equation
x(n) = b(n);
for i = n-1: -1:1
x(i) = b(i) - u(i)*x(i+1);
end
x = x';

Now execute the PRADI function:


x = 0:1:3;
y = 0:1:3; y = y';
f = inline('0','x','y');
ubottom = inline('sin(pi*x/3)');
utop = inline('0','x'); uleft = inline('0','y');
uright = inline('0','y');
[U, k] = PRADI(x,y,f,uleft,uright,ubottom,utop)

U =
0 0 0 0
0 0.1083 0.1083 0
0 0.3248 0.3248 0
0 0.8660 0.8660 0
k =
5

If kmax is set to 1, the result will be as follows:


[U, k] = PRADI(x,y,f,uleft,uright,ubottom,utop,1e-4,1)
U =
0 0 0 0
0 0.1325 0.1325 0
0 0.3635 0.3635 0
0 0.8660 0.8660 0
k =
1

166
8.2.2 Neumann Problem
Until this point, we assumed that at the boundary we have known values of
𝑢. The Neumann problem is based on the missing information on the
boundary, meaning the boundary values form an unknown vector.
Considering the same Laplace’s equation, the difference equation at the mesh
point (3,3) is the following:
𝑢23 + 𝑢32 − 4𝑢33 + 𝑢43 + 𝑢34 = 0 (8.13.)
where 𝑢23 , 𝑢32 , 𝑢33 are the interior mesh points, while 𝑢43 , 𝑢34 are the
boundary points, as Figure 28 shows.

Figure 28: Illustration of the Neumann problem [3]

From these points the interior mesh points create the unknown vector, while
the boundary values are not available, so they are also part of the unknown
vector. Actually, considering equation (8.9.) close to the boundary points, the
same problem can occur. As the equation can be written to interior points
only, for example in Figure 28 we can construct 9 equations only, despite the
fact that there are 25 unknowns. Thus, more equations must be generated. In
order to generate these equations, equation (8.9.) has to be applied to each
marked boundary points. Regarding to point (3,3) the equation for these two
points are:
𝑢24 + 𝑢44 + 𝑢35 + 𝑢33 − 4𝑢34 = 0
(8.14.)
𝑢42 + 𝑢44 + 𝑢53 + 𝑢33 − 4𝑢43 = 0
This way, two quantities have been generated (𝑢35 and 𝑢53 ) that have to be
eliminated, since they are not part of the grid. By using the information on the
vertical and horizontal boundary segments and extending the grid beyond the
𝜕𝑢
boundary region. Since at point (3,4) the derivative of 𝜕𝑦
= 𝑔(𝑥) = 𝑔34 is
available, by applying two-point central difference formula we get:

167
𝜕𝑢 𝑢35 − 𝑢33
| = 𝑔34 = → 𝑢35 = 𝑢33 + 2ℎ𝑔34 (8.15.)
𝜕𝑦 (3,4) 2ℎ
𝜕𝑢 𝑢53 − 𝑢33
| = 𝑘43 = → 𝑢53 = 𝑢33 + 2ℎ𝑘43 (8.16.)
𝜕𝑥 (4,3) 2ℎ
By substituting these expressions to equation (8.14.), two more equations can
be generated that consist only interior mesh points and boundary points. To
continue this process, it has to assumed that the Laplace’s equation is valid
beyond the rectangular region, at least in the exterior area that contains the
newly produced points (like 𝑢35 and 𝑢53 ). Using this method, the same
number of equations can be generated as the total number of interior mesh
points and boundary points together, like in the example illustrated in Figure
28, 25 unknowns and 25 equations.
The nature and properties of the normal boundaries determined on the
different segments of boundaries specify, if there was a solution for a
Neumann problem: if the line integral of the normal derivatives is zero, the
solution exists, otherwise it does not:

𝜕𝑢
∫ 𝑑𝑠 = 0 (8.17.)
𝜕𝑛
𝐶

Example [3]

Consider the next Neumann problem, illustrated in the figure below. The
constructed grid is ℎ = 1. Is there a solution?

Solution:
The problems includes 2 interior points and 10 boundary points, 12 unknowns
in total. In order to figure out if there is a solution, solve equation (8.17.),

168
where the bottom border is marked as 𝐿1 , the left and right edges are 𝐿4 , 𝐿2
and finally the top edge is 𝐿3 (going counterclockwise):

𝜕𝑢 𝜕𝑢 𝜕𝑢 𝜕𝑢 𝜕𝑢
∫ 𝑑𝑠 = ∫ 𝑑𝑠 = ∫ 𝑑𝑠 = ∫ 𝑑𝑠 = ∫ 𝑑𝑠 =
𝜕𝑛 𝜕𝑛 𝜕𝑛 𝜕𝑛 𝜕𝑛
𝐶 𝐿1 𝐿2 𝐿3 𝐿4
4 2 4 2

= ∫ 𝑥𝑑𝑥 + ∫ 𝑦𝑑𝑦 − ∫(𝑥 + 1)𝑑𝑥 − ∫(𝑦 − 1)𝑑𝑦 = −1 ≠ 0


0 0 0 0
So, this problem does not have a solution. With using the earlier solution
method, a 12 × 12 system of equations was generated with a coefficient
matrix that is singular.

8.2.3 Mixed problem


When the 𝑢 values are known along some segments of the boundary, while
𝜕𝑢
the normal boundaries (𝑢𝑛 = 𝜕𝑛) are known on other segments, we are
talking about a mixed problem. The solution in this case is similar like at the
Neumann problem, but less complex, since we have to deal with 𝑢𝑛 only at
certain segments of the boundary. Consequently, the region has not to be
extended globally, and thus the linear system, that describes the problem, is
smaller.

Example [3]

Consider the next problem and solve it with the grid size of ℎ = 1.

Solution:
The upper, lower, and left edges yields for a Dirichlet boundary conditions,
extension is not needed there, only extension pertains to the (3,1) mesh

169
point, where 𝑢 is not known (𝑢41 ) is created. Equation (8.8.) is solved first at
the interior mesh points:
(1,1) 𝑢01 + 𝑢10 + 𝑢21 + 𝑢12 − 4𝑢11 = 0
(2,1) 𝑢11 + 𝑢20 + 𝑢31 + 𝑢22 − 4𝑢21 = 0
(3,1) 𝑢21 + 𝑢30 + 𝑢41 + 𝑢32 − 4𝑢31 = 0
With two-point central difference formula 𝑢41 can be eliminated:
𝜕𝑢 𝑢41 − 𝑢21
| = [2𝑦](3,1) = 2 = ⟹ 𝑢41 = 𝑢21 + 4
𝜕𝑥 (3,1) 2
By using the boundary values and substituting with 𝑢41 = 𝑢21 + 4, we can
write:

𝑢21 + 2 − 4𝑢11 = 0 −4 1 0 𝑢11 −2 𝑢11 = 1.5769


𝑢11 + 𝑢31 + 8 − 4𝑢21 = 0 ⇒ [ 1 −4 1 ] [𝑢21 ] = [ −8 ] ⇒ 𝑢22 = 4.3077
𝑢21 + (𝑢21 + 4) + 18 − 4𝑢31 = 0 0 2 −4 𝑢31 −22 𝑢31 = 7.6538

8.2.4 Elliptic Partial Differential Equations in More Complex


Regions
Up to this point, the elliptic PDEs have been considered over simple regions,
like rectangular regions. In these cases, the grid can be easily adjusted to the
geometry, so some of the points were on the boundary region. Unfortunately,
in reality the applied geometry is usually more complex, which results in a
boundary that crosses the grid not in the mesh points.
When the Laplace’s equation has to be solved over the region, illustrated in
Figure 29, the boundary is not regular, since it includes a curve segment that
goes through 𝐴 and 𝐵 points, which are not mesh points. In the case of 𝑀and
𝑄 points there is no problem, they can be treated as earlier, but at point 𝑃 has
to be considered differently, because the adjacent 𝐴 and 𝐵 points are not on
the grid. The scope is to determine an equation in mesh point 𝑃 for 𝑢𝑥𝑥 (𝑃)
and 𝑢𝑦𝑦 (𝑃) in order to carry out a new form for the PDE.

170
Figure 29: Illustration of an irregular boundary at a region; left: grid, right:
mesh points [3]

So, if the distance between 𝐴 and 𝑃 is 𝛼ℎ, while the distance between 𝐵 and
𝑃 is 𝛽ℎ, the Taylor’s series expansion can be written for 𝑢 at four points
(𝐴, 𝐵, 𝑀, 𝑁) about point 𝑃 in the following way for 𝑢(𝑀) and 𝑢(𝐴):
𝜕𝑢(𝑃) 1 2 𝜕 2 𝑢(𝑃)
𝑢(𝑀) = 𝑢(𝑃) − ℎ + ℎ −⋯ (8.18.)
𝜕𝑥 2! 𝜕𝑥 2
𝜕𝑢(𝑃) 1 𝜕 2 𝑢(𝑃)
𝑢(𝐴) = 𝑢(𝑃) − (𝛼ℎ) + (𝛼ℎ)2 −⋯ (8.19.)
𝜕𝑥 2! 𝜕𝑥 2
The difference between the two equation is only a multiplication with ℎ, so by
neglecting any terms with higher order than ℎ2 :
ℎ2 𝜕 2 𝑢(𝑃)
𝑢(𝐴) + 𝛼𝑢(𝑀) ≅ (𝛼 + 1)𝑢(𝑃) + 𝛼(𝛼 + 1) (8.20.)
2 𝜕𝑥 2
Rearranging the equation:
𝜕 2 𝑢(𝑃) 2 1 1 1
2
= 2[ 𝑢(𝐴) + 𝑢(𝑀) − 𝑢(𝑃)] (8.21.)
𝜕𝑥 ℎ 𝛼(𝛼 + 1) (𝛼 + 1) 𝛼
For 𝑢(𝐵) and 𝑢(𝑁) a similar equation can be carried out:
𝜕 2 𝑢(𝑃) 2 1 1 1
= 2[ 𝑢(𝐵) + 𝑢(𝑁) − 𝑢(𝑃)] (8.22.)
𝜕𝑦 2 ℎ 𝛽(𝛽 + 1) (𝛽 + 1) 𝛽
By adding these to equations we get:

171
𝑢𝑥𝑥 (𝑃) + 𝑢𝑦𝑦 (𝑃) =
2 1 1
= [ 𝑢(𝐴) + 𝑢(𝐵)
2
ℎ 𝛼(𝛼 + 1) 𝛽(𝛽 + 1) (8.23.)
1 1 1 1
+ 𝑢(𝑀) + 𝑢(𝑁) − ( + ) 𝑢(𝑃)]
(𝛼 + 1) (𝛽 + 1) 𝛼 𝛽
To deal with the Laplace’s equation, we know that:
1 1 1 1
𝑢(𝐴) + 𝑢(𝐵) + 𝑢(𝑀) + 𝑢(𝑁)
𝛼(𝛼 + 1) 𝛽(𝛽 + 1) (𝛼 + 1) (𝛽 + 1)
2(𝛼 + 𝛽) (8.24.)
− 𝑢(𝑃) = 0
𝛼𝛽
In order to generate the general form, using Figure 29 right-hand side
notations, the difference equation is:
1 1 1
𝑢 + 𝑢 + 𝑢 +
𝛼(𝛼 + 1) 𝑖+1,𝑗 𝛽(𝛽 + 1) 𝑖,𝑗+1 (𝛼 + 1) 𝑖−1,𝑗
(8.25.)
1 2(𝛼 + 𝛽)
+ 𝑢 − 𝑢𝑖,𝑗 = 0
(𝛽 + 1) 𝑖,𝑗−1 𝛼𝛽
The same method can be applied to the Poisson’s equation. This final equation
can be used in every case (at any mesh point) when any of the neighbouring
points are not located on the grid.

Example [3]

Solve the Laplace equation over the region, illustrated below. The slanting
2
part of the boundary is described as 𝑦 = − 3 𝑥 + 2.

172
Solution:
The method start again with the Dirichlet problem’s solution at mesh points
(1,1), (2,1), (1,2):
1 + 4 + 𝑢12 + 𝑢21 − 4𝑢11 = 0
𝑢11 + 7 + 9.5 + 𝑢22 − 4𝑢21 = 0
1
1 + 𝑢11 + + 𝑢22 − 4𝑢12 = 0
3
For point (2,2) equation (8.25.) has to be used. From the slanting segment
1
equation the vertical distance of 𝐴 point can be calculated, which is 3 from
1 2
point (2,2). So ℎ = and 𝛽 = , while 𝛼 = 1. This way:
2 3
2
2 2 2 (1 + )
9+ (2) + 𝑢12 + 𝑢21 − 3 𝑢 =0
2 22
2 5 5
∙ 3
3 3 3
Combining with equation (8.25.) after the simplifying:
−4 1 1 0 𝑢11 −5 𝑢11 = 3.3354
1 −4 0 1 𝑢21 −16.5 𝑢21 = 6.0666
[ ][ ] = [ ]⟹
1 0 −4 1 𝑢12 −1.33 𝑢12 = 2.2749
0 1.2 1 −5 𝑢 22 −12.6 𝑢 22 = 4.4310

8.3 Parabolic Partial Differential Equations


The parabolic differential equations can be solved mainly with two technics:
the finite-difference (FD) method and Crank-Nicolson (CN) method. Their goal
is similar as earlier: deriving a suitable difference equation that generates the
solution estimations at mesh points. The main problem with parabolic PDEs,
that the divergence cannot be eliminated completely, not even by fining the
grid size. For an assured convergence, additional conditions have to be
determined.
One of the most widespread-known example for parabolic PDE is the 1D heat
equation in the following form:
𝜕𝑢 𝜕 2 𝑢
𝐶 = (8.26.)
𝜕𝑡 𝜕𝑥 2
where 𝐶 is a constant.

173
8.3.1 Finite-Difference Method
The 1D heat equation actually means that considering a wire with the length
of 𝐿 which ends are kept at 0 temperature and subjected to the initial
temperature along the wire prescribed by 𝑓(𝑥). In this case, the simplified
initial value problem is:
𝑢𝑡 = 𝛼 2 𝑢𝑥𝑥 (𝛼 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 > 0), 0 ≤ 𝑥 ≤ 𝐿, 𝑡 ≥ 0
𝑢(0, 𝑡) = 0 = 𝑢(𝐿, 𝑡) (8.27.)
𝑢(𝑥, 0) = 𝑓(𝑥)
As Figure 30 illustrates, the generated grid for solving the 1D heat equation
has a size of ℎ in 𝑥-direction and 𝑘 size in 𝑡-direction.
As a first step, in equation (8.27.) the partial derivatives has to be replaced
with their Finite-Difference (FD) approximations. For 𝑢𝑥𝑥 term a three-point
central difference formula can be applied, while for term 𝑢𝑡 a two-point
forward difference formula must be used, since it is considered more accurate
than the central difference formula. Also, at 𝑡 = 0 only forward progression is
possible, since there is no information available when 𝑡 < 0. The transformed
equation then:
1 𝛼2
(𝑢𝑖,𝑗+1 − 𝑢𝑖,𝑗 ) = 2 (𝑢𝑖−1,𝑗 − 2𝑢, 𝑗 + 𝑢𝑖+1,𝑗 ) (8.28.)
𝑘 ℎ

Figure 30: The generated grid in order to solve the 1D heat equation [3]

174
If 𝑢𝑖−1,𝑗 , 𝑢𝑖,𝑗 and 𝑢𝑖+1,𝑗 are all known, 𝑢𝑖,𝑗+1 can be computed at higher level
of axis 𝑡 as:
2𝑘𝛼 2 𝑘𝛼 2
𝑢𝑖,𝑗+1 = [1 − 2 ] 𝑢𝑖,𝑗 + 2 (𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 ) (8.29.)
ℎ ℎ
This equation can be written in a simplified form:
𝑘𝛼 2
𝑢𝑖,𝑗+1 = (1 − 2𝑟)𝑢𝑖,𝑗 + 𝑟(𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 ), 𝑟 = (8.30.)
ℎ2
which is known as the difference equation for the 1D heat equation applying
FD method.
The FD method accuracy improves as ℎ and 𝑘 tends to zero. Equation (8.30.)
can be considered as stable and convergent if:
𝑘𝛼 2 1
𝑟= 2 ≤ (8.31.)
ℎ 2
otherwise is divergent and unstable.

Example [3]

Solve the 1D heat equation for a wire which is laterally insulated, has 𝐿 = 1
length and 𝛼 = 0.5. The two ends of the wire are kept at zero temperature,
while the initial temperature is 𝑓(𝑥) = 10 sin 𝜋𝑥 at mesh points ℎ = 0.25 and
𝑘 = 0.1. Compute the temperature (𝑢(𝑥, 𝑡)) at 0 ≤ 𝑥 ≤ 1 and 0 ≤ 𝑡 ≤ 0.5, if
mesh points were generated by ℎ = 0.25 and 𝑘 = 0.1, considering all the
parameter values are consistent physical units.

Solution:
The exact solution is:
2
𝑢(𝑥, 𝑡) = (10 sin 𝜋𝑥)𝑒 −0.25𝜋 𝑡
For the numerical solution the following user-defined function can be
constructed in MATLAB:

function u = Heat1DFD(t,x,u,alpha)
%
% Heat1DFD numerically solves the one-dimensional heat
% equation, with zero boundary conditions, using the
% finite-difference method.
%
% u = Heat1DFD(t,x,u,alpha) where
%

175
% t is the row vector of times to compute,
% x is the column vector of x positions to compute,
% u is the column vector of initial temperatures for
% each value in x,
% alpha is a given parameter of the PDE,
%
% u is the solution at the mesh points.
u = u(:); % u must be a column vector
k = t(2)-t(1);
h = x(2)-x(1);
r = (alpha/h)^2*k;
if r > 0.5
warning('Method is unstable and divergent. Results will
be inaccurate.')
end
i = 2:length(x)-1;
for j = 1:length(t)-1
u(i,j+1) = (1-2*r)*u(i,j) + r*(u(i-1,j)+u(i+1,j));
end

To execute the function, run the following commands:


t = 0:0.1:0.5;
x = 0:0.25:1; x = x';
u = 10.*sin(pi*x);
u = Heat1DFD(t,x,u,0.5)

As a result, we will receive:


u =

0 0 0 0 0
0
7.0711 5.4142 4.1456 3.1742 2.4304
1.8610
10.0000 7.6569 5.8627 4.4890 3.4372
2.6318
7.0711 5.4142 4.1456 3.1742 2.4304
1.8610
0.0000 0 0 0 0
0

The results and grid are visualised as follows:

176
8.3.2 Crank-Nicolson Method
𝑘𝛼 2 1
The main problem with the Finite-Difference Method that the 𝑟 = 2 ≤
ℎ 2
condition can cause significant computational errors, like when ℎ = 0.2 and
1
𝛼 = 1: since 𝑟 ≤ 2, 𝑘 has to be 0.02 or lower, causing too many time steps.
Furthermore, if the mesh size were reduced half to ℎ = 0.1, the number of
time steps increases with a factor of 4. So, in order to decrease 𝑟, wither ℎ
must be increased, resulting in decreased accuracy, or 𝑘 must be decreased,
causing computation-requirement increment.
To overcome on this problem, the Crank-Nicolson (CN) Method provides a
𝑘𝛼 2
solution without a restriction on 𝑟 = 2 . To achieve that, a six-point molecule

is employed as opposed to the four-point molecule used with FD method.
The starting equation is equation (8.35.) now from which the CN method
difference equation can be derived. Two terms are written on the right-hand
side, similar to the inside parentheses: one for the 𝑗th time row and one for
𝛼2
(𝑗 + 1)st time row. Then, both sides are multiplied with , thus:
2ℎ 2
1 𝛼2
(𝑢𝑖,𝑗+1 − 𝑢𝑖,𝑗 ) = 2 (𝑢𝑖−1,𝑗 − 2𝑢𝑖,𝑗 + 𝑢𝑖+1,𝑗 ) +
𝑘 2ℎ (8.32.)
𝛼2
+ (𝑢 − 2𝑢𝑖,𝑗+1 + 𝑢𝑖+1,𝑗+1 )
2ℎ2 𝑖−1,𝑗+1
𝑘𝛼 2
Then, by multiplying with 2𝑘 and letting 𝑟 = 2 , after rearranging the higher

time row terms to the left hand side, we can write:
2(1 + 𝑟)𝑢𝑖,𝑗+1 − 𝑟(𝑢𝑖−1,𝑗+1 − 𝑢𝑖+1,𝑗+1 ) = (8.33.)

177
𝑘𝛼 2
= 2(1 − 𝑟)𝑢𝑖,𝑗 + 𝑟(𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 ), 𝑟=
ℎ2
This is the form of the 1D heat equation using the CN method.
The initial step is starting with 0th time level (𝑗 = 1), each time is applied, so
the initial temperature (𝑓(𝑥)) provides the 3 terms on the right-hand side:
𝑢𝑖−1,ℎ , 𝑢𝑖,𝑗 and 𝑢𝑖+1,𝑗 . The higher time levels are not known. 𝑛 non-boundary
mesh points in each row provide 𝑛 × 𝑛 ensuing system of equations to be
solved with a tridiagonal coefficient matrix. By solving this system of equation
the temperature values can be calculated in 𝑗 = 1 row. Indeed, the process
has to be repeated to approximate all the desired mesh points.

Example [3]

The formerly (8.6 Example) considered 1D heat equation is used again. Find
the approximate values of 𝑢(𝑥, 𝑡) at the mesh points by using CN method and
compare the results with the FD method.

Solution:
Create the following MATLAB function:
function u = Heat1DCN(t,x,u,alpha)
%
% Heat1DCN numerically solves the one-dimensional heat
% equation, with zero boundary conditions, using the
% Crank-Nicolson method.
%
% u = Heat1DCN(t,x,u,alpha) where
% t is the row vector of times to compute,
% x is the column vector of x positions to compute,
% u is the column vector of initial temperatures for
% each value in x,
% alpha is a given parameter of the PDE,
%
% u is the solution at the mesh points.
u = u(:); % u must be a column vector
k = t(2)-t(1); h = x(2)-x(1); r = (alpha/h)^2*k;
% Compute A
n = length(x);
A = diag(2* (1+r)*ones(n-2,1));
A = A + diag(diag(A,-1)-r,-1);
A = A + diag(diag(A,1)-r, 1);
% Compute B
B = diag(2*(1-r)*ones(n-2,1));
B = B + diag(diag(B,-1) +r,-1);

178
B = B + diag(diag(B,1) +r,1);
C = A\B;
i = 2:length(x)-1;
for j = 1:length(t)-1
u(i,j+1) = C*u(i,j);
end

Now execute it:


t = 0:0.1:0.5;
x = 0:0.25:1; x=x';
u = 10.*sin(pi*x);
u = Heat1DCN(t,x,u,0.5)

The result is:


u =

0 0 0 0 0 0
7.0711 5.5880 4.4159 3.4897 2.7578 2.1794
10.0000 7.9026 6.2451 4.9352 3.9001 3.0821
7.0711 5.5880 4.4159 3.4897 2.7578 2.1794
0.0000 0 0 0 0 0

By visualising the results we get:

In these two examples the ratio 𝑟 is satisfied, so both CN and FD methods


were successful, but CN provided more accurate results.

179
8.4 Hyperbolic Partial Differential Equations
In order to introduce the numerical solution method of hyperbolic PDEs,
consider the following boundary-value problem:
𝑢𝑡𝑡 = 𝐶 2 𝑢𝑥𝑥
𝑢(𝑥, 0) = 𝑓(𝑥)
𝑢𝑡 (𝑥, 0) = 𝜙(𝑥) (8.34.)
𝑢(0, 𝑡) = 𝜓1 (𝑡)
𝑢(1, 𝑡) = 𝜓2 (𝑡)
when 0 ≤ 𝑡 ≤ 𝑇, which is actually the modelling of the transverse (1D)
vibrations of a stretched string. In the first equation, by replacing 𝑢𝑥𝑥 and 𝑢𝑡𝑡
terms with a 3-point central difference approximation, we get:
1
𝑢𝑥𝑥 = 2 (𝑢𝑖−1,𝑗 − 2𝑢𝑖,𝑗 + 𝑢𝑖+1,𝑗 ) + 𝑂(ℎ2 ) =
ℎ (8.35.)
1
= 𝑢𝑡𝑡 = 2 (𝑢𝑖,𝑗−1 − 2𝑢𝑖,𝑗 + 𝑢𝑖,𝑗+1 ) + 𝑂(𝑘 2 )
𝑘
where 𝑥 = 𝑖ℎ, 𝑖 = 0,1,2, … and 𝑡 = 𝑗𝑘, 𝑗 = 0,1,2, ….

Figure 31: The applied grid for solving the 1D wave equation [3]

𝑘𝛼 2
By substituting with 𝑟̃ = ( ℎ ) , multiplying with 𝑘 2 and rearranging the
equation for 𝑢𝑖,𝑗+1, we can write:
𝑢𝑖,𝑗+1 = −𝑢𝑖,𝑗−1 + 2(1 − 𝑟̃ )𝑢𝑖𝑗 + 𝑟̃ (𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 ), 𝑟̃
𝑘𝛼 2 (8.36.)
=( )

180
According to this, the function values at the 𝑗th and (𝑗 − 1)th time levels are
needed to compute the (𝑗 + 1)th time level function values. This difference
equation is known as the 1D wave equation using finite difference method.
These kind of methods are called three level difference schemes. If the terms
in equation (8.36.) were expanded as Taylor’s series and were simplified, it
could be seen that the truncation error is 𝑂(𝑘 2 + ℎ2 ). Additionally, the
formula works well if 𝑟̃ ≤ 1, which latter is also the stability condition.
For the initial problem – 1st equation of (8.34.) – there are two implicit finite
difference schemes that are:
𝑢𝑖,𝑗+1 − 2𝑢𝑖,𝑗−1 + 𝑢𝑖,𝑗−1
=
𝑘2
𝐶2 (8.37.)
= 2 [(𝑢𝑖+1,𝑗+1 − 2𝑢𝑖,𝑗+1 + 𝑢𝑖−1,𝑗+1 )
2ℎ
+ (𝑢𝑖+1,𝑗−1 − 2𝑢𝑖,𝑗−1 + 𝑢𝑖−1,𝑗−1 )]
and
𝑢𝑖,𝑗+1 − 2𝑢𝑖,𝑗−1 + 𝑢𝑖,𝑗−1
=
𝑘2
2
𝐶
= 2 [(𝑢𝑖+1,𝑗+1 − 2𝑢𝑖,𝑗+1 + 𝑢𝑖−1,𝑗+1 ) (8.38.)
4ℎ
+ 2(𝑢𝑖+1,𝑗 − 2𝑢𝑖,𝑗 + 𝑢𝑖−1,𝑗 )
+ (𝑢𝑖+1,𝑗−1 − 2𝑢𝑖,𝑗−1 + 𝑢𝑖−1,𝑗−1 )]
These implicit formulas give good approximations for all 𝑟̃ values.

Example [3]

Solve the next PDE:


𝜕2𝑢 𝜕2𝑢
=
𝜕𝑡 2 𝜕𝑥 2
𝜕𝑢
If: 𝑢(0, 𝑡) = 0, 𝑢(1, 𝑡) = 0, 𝑡 > 0 and 𝜕𝑡 (𝑥, 0) = 0, 𝑢(𝑥, 0) = sin3 (𝜋𝑥) for
all 𝑥 ∈ [0; 1].

Solution:
This example has an exact solution which is:
3 1
𝑢(𝑥, 𝑡) = sin 𝜋𝑥 cos 𝜋𝑡 − sin 3𝜋𝑥 cos 3𝜋𝑡
4 4

181
By applying the explicit formula (equation (8.36.)), where 𝑟̃ < 1, let ℎ = 0.25
0.2
and 𝑘 = 0.2. Thus, 𝑟̃ = 0.25 = 0.8, so the stability condition is satisfied. If 𝑢𝑖𝑗 =
𝑢(𝑖ℎ, 𝑗𝑘), than the boundary conditions:
𝑢0,𝑗 = 0
𝑢4,𝑗 = 0
𝑢𝑖,0 = sin3(𝜋𝑖ℎ) , 𝑖 = 1,2,3,4
𝑢𝑖,1 − 𝑢𝑖,−1 = 0 so 𝑢𝑖,−1 = 𝑢𝑖,1
By substituting the calculated 𝑟̃ we get:
𝑢𝑖,𝑗+1 = −𝑢𝑖,𝑗−1 + 0.64(𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 ) + 2(0.36)𝑢𝑖,𝑗
= 0.32(𝑢𝑖−1,𝑗 + 𝑢𝑖+1,𝑗 ) + 0.36𝑢𝑖,𝑗
Because 𝑢𝑖,−1 = 𝑢𝑖,1 , it can be calculated:
𝑢1,1 = 0.32(𝑢0,𝑗 + 𝑢2,0 ) + 0.36𝑢1,0 = 0.32(0 + 1) + 0.36(0.3537) = 0.4473
The exact value of the function is: 𝑢(0.25,0.2) = 0.4838.
By repeating the same steps, we get:
𝑢2,1 = 0.32(0.3537 + 0.3537) + 0.36(1.0) = 0.5867
The exact value = 0.5296.
As a final step:
𝑢3,1 = 0.32(1.0 + 0) + 0.36(0.3537) = 0.4473
When the exact value = 0.4838.
The computations could be continued for 𝑗 = 1,2, …

Example [3]

Take an elastic ring with the length of 𝐿 = 2 and 𝛼 = 1, that are fixed at both
𝜋𝑥
ends. The initial displacement of the string is given as: 𝑓(𝑥) = 5 sin ( 2 ), while
the initial velocity is zero (𝑔(𝑥) = 0). Find the displacement of the string
(𝑢(𝑥, 𝑓)) for 0 ≤ 𝑥 ≤ 𝐿 and 0 ≤ 𝑡 ≤ 2, when ℎ = 𝑘 = 0.4.

Solution:
The exact solution is the following:
𝜋𝑥 𝜋𝑡
𝑢(𝑥, 𝑡) = 5 sin cos
2 2
At first, calculate 𝑟̃ :

182
𝑘𝛼 2
𝑟̃ = ( ) = 1

The FD approach to solve the wave equation to zero-boundary conditions and
prescribed initial displacement and velocity can be constructed as:

function u = Wave1DFD(t,x,u,ut,alpha)
%
% Wave1DFD numerically solves the one-dimensional wave
% equation, with zero boundary conditions, using the
% finite-difference method.
%
% u = Wave1DFD(t,x,u,ut,alpha) where
% t is the row vector of times to compute,
% x is the column vector of x positions to compute,
% u is the column vector of initial displacements
% for each value in x,
% ut is the column vector of initial velocities for
% each value in x,
% alpha is a given parameter of the PDE,
% u is the solution at the mesh points.
u = u(:); % u must be a column vector
ut = ut(:); % ut must be a column vector
k = t(2)-t(1);
h = x(2)-x(1);
r = (k*alpha/h)^2;
if r>1
warning('Method is unstable and divergent. Results will be
inaccurate.')
end
i = 2:length(x)-1;
u(i,2) = (1-r)*u(i,1) + r/2*(u(i-1,1) + u(i+1,1)) +
k*ut(i);
for j = 2:length(t)-1
u(i,j+1) = -u(i,j-1) + 2*(1-r)*u(i,j) + r*(u(i-1,j) +
u(i+1,j));
end

Using the user-defined function, we can get the results and plotted as follows:
t = 0:0.1:2;
x = 0:0.1:2; x = x';
u = 5.*sin(pi*x/2);
ut = zeros(length(x),1);
u = Wave1DFD(t,x,u,ut,1);
plot(x,u)

183
The displacement of the elastic string looks like the following:

184
References
1. Faragó, I. and R. Horváth, Numerikus Módszerek, F. Miklós, Editor. 2011,
BME TTK Matematika Intézet: Budapest.
2. Dr. Nyakóné dr. Juhász Katalin, Dr. Terdik György, Biró Piroska, and Dr.
Kátai Zoltán, Bevezetés az Informatikába. 2011, Debreceni Egyetem,
Informatikai Kar; Sapientia EMTE, Műszaki és Humántudományok Kar:
Kempelen Farkas Hallgatói Információs Központ.
3. Esfandiari, R.S., Numerical Methods for Engineers and Scientists Using
MATLAB. 2013: Taylor & Francis Group, LLC.

185
<< 10-20 soros ismertető a könyv hátsó borítójára>>

The recent book was written for the Budapest University of Technology and
Economics, Transportation and Vehicle Engineering Facility’s curriculum,
fitted for the Stipendium Hungaricum Scholarship Programme. The aim of
developing and using numerical methods instead of relying on complicated
and expensive analytical solution methods is to approximate the accurate
solution we desire using simpler, arithmetical equations, with the required
level of accuracy. After determining the different errors, that has an effect on
a numerical method’s result, and their sources, the chapters introduce the
solution of equations and system of equations, then the two main curve
fitting methods, as approximation and interpolation. These chapters are
followed by the numerical differentiation and integration methods, then the
numerical solution of ordinary and partial differentiation equations are
outlined. Since MATLAB is also a popular choice for engineering use, with its
user friendly, yet powerful computing capability, it is practically an industry
standard for prototyping and new developments. Many chapters in the notes
show examples implemented in MATLAB.

186

You might also like