This action might not be possible to undo. Are you sure you want to continue?
https://www.scribd.com/doc/53637571/ELEC3030Notesv1
12/14/2011
text
original
Numerical Methods
1. Introduction
Most mathematical problems in engineering and physics as well as in many other
disciplines cannot be solved exactly (analytically) because of their complexity. They have to be
tackled either by using analytical approximations, that is by choosing a simplified (and then only
approximately correct) analytical formulation that can then be solved exactly, or by using
numerical methods. In both cases, approximations are involved but this need not represent a
disadvantage since in most practical cases, the parameters used in the mathematical model (the
basic equations) will only be known approximately, by measurements or otherwise.
Furthermore, if the accuracy is high there is no practical advantage of having an exact solution,
when only a few decimal figures will have practical significance.
An ideal property sought in the design of numerical methods is that they should be “exact
inthelimit”, that is, if we refine the discretization or if we take more terms in a series, or if we
perform a larger number of iterations, etc (depending of the nature of the method), the result
should only improve, approximating asymptotically the exact solution. Of course not all
methods will have this property.
Since approximations will be present, an important aspect of using numerical methods is to
be able to determine the magnitude of the error involved. Naturally, we cannot know the error.
This will imply to know the exact solution! But we can determine error bounds, or limits, so we
can know that the error cannot exceed a certain measure.
There are several sources for these errors. One of them could be an inaccurate
mathematical model; that is, the mathematical description of the problem does not represent
correctly the physical situation. Possible reasons for this are:
Incomplete or incorrect theory
Idealizations/Approximations/Uncertainty
The second one could be due for example to idealization of the geometry of the problem or the
properties of materials involved and the uncertainty of the value of material parameters, etc.
This type of error concerns the mathematical model and the only possibility to reduce it is
to improve the mathematical description of the physical problem. It does not concern the
numerical calculation. It is though, an important part of the more general problem of “numerical
or computer modelling”.
Errors concerning the numerical calculations can be of three kinds:
i) Blunder: Mistakes, computer or programming bugs. We cannot really dismiss this
error. In practice, the possibility of its occurrence must always be considered, in particular when
the result is not known approximately in advance and then, there are no means to evaluate the
‘reasonableness’ of the obtained result.
ii) Truncation error: This is an approximation error, where for example the value of a
function is calculated with a finite number of terms in an infinite series, or an integral is
estimated by the sum of a finite number of trapezoidal areas over (noninfinitesimal) segments.
This is also called discretization error in some cases. It is important to have an estimate or a
bound (limit) for this type of error.
iii) Round off error: Arithmetic calculations can almost never be carried out with
E763 (part 2) Numerical Methods page 2
complete accuracy. Most numbers have infinite decimal representation, which must be rounded.
But even if the data in a problem can be initially expressed exactly by finite decimal
representation, divisions may introduce numbers that must be rounded; multiplication will also
introduce more digits. This type of error has a random character that makes it difficult to deal
with.
So far the error we have been discussing is the absolute error or the error defined by:
error = true value – approximation. A problem with this definition is that it doesn’t take into
account the magnitude of the value being measured, for example and absolute error of 1 cm has a
very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another
definition, that reflects this significance is the relative error defined as: relative error = absolute
error/true value. For example, in the previous case, the length of the bridge has a relative error
of 10
−4
and 0.1 respectively; or in percent, 0.01% and 10%.
2. Machine representation of numbers
Not only numbers have to be truncated to be represented (for manual calculation or in a
machine) but also all the intermediate results are subjected to the same restriction (over the
complete calculation process). Of the continuum of real numbers, only a discrete set can be
represented. Additionally, only a limited range of numbers can be represented in a given
machine. Also, even if two numbers can be represented, it is possible that the product of an
arithmetic operation between these numbers is not representable. Additionally, there will be
occasions where results fall outside the range of representable numbers (underflow or overflow
errors).
Example: Some computers accept real numbers in a floatingpoint format, with say, two digits
for the exponent; that is, in the form:
mantissa exponent
± 0.xxxxxx E ±xx
which can only represent numbers in the range: 10
–100
≤ y < 10
99
.
This is a normalized form, where the mantissa is defined such that:
0.1 ≤ mantissa < 1
We will see this in more detail next.
FloatingPoint Arithmetic and Roundoff Error
Any number can be represented in a normalised format in the form: x = ± a×10
b
in a
decimal representation. We can even use any other number as the exponent base and we have a
more general form: x = ± a×β
b
, common examples are binary, octal and hexadecimal
representations. In all of these, a is called the mantissa, a number between 0.1 and 1, which can
be infinite in length (for example for π or for 1/3 or 1/9), β is the base (10 for decimal
representation) and b is the exponent (positive or negative), which can also be infinite.
If we want to represent these numbers in a computer or use them in calculations, we need
to truncate both the mantissa (to a certain number of places), and the exponent, which limits the
range (or size) of numbers that can be considered.
Now, if A is the set of numbers exactly representable in a given machine, the question
arises of how to represent a number x not belonging to A ( x ∉A).
This is encountered not only when reading data into a computer, but also when
APDF Page Cut DEMO: Purchase from www.APDF.com to remove the watermark
page 1 E763 (part 2) Numerical Methods
Numerical Methods
1. Introduction
Most mathematical problems in engineering and physics as well as in many other
disciplines cannot be solved exactly (analytically) because of their complexity. They have to be
tackled either by using analytical approximations, that is by choosing a simplified (and then only
approximately correct) analytical formulation that can then be solved exactly, or by using
numerical methods. In both cases, approximations are involved but this need not represent a
disadvantage since in most practical cases, the parameters used in the mathematical model (the
basic equations) will only be known approximately, by measurements or otherwise.
Furthermore, if the accuracy is high there is no practical advantage of having an exact solution,
when only a few decimal figures will have practical significance.
An ideal property sought in the design of numerical methods is that they should be “exact
inthelimit”, that is, if we refine the discretization or if we take more terms in a series, or if we
perform a larger number of iterations, etc (depending of the nature of the method), the result
should only improve, approximating asymptotically the exact solution. Of course not all
methods will have this property.
Since approximations will be present, an important aspect of using numerical methods is to
be able to determine the magnitude of the error involved. Naturally, we cannot know the error.
This will imply to know the exact solution! But we can determine error bounds, or limits, so we
can know that the error cannot exceed a certain measure.
There are several sources for these errors. One of them could be an inaccurate
mathematical model; that is, the mathematical description of the problem does not represent
correctly the physical situation. Possible reasons for this are:
Incomplete or incorrect theory
Idealizations/Approximations/Uncertainty
The second one could be due for example to idealization of the geometry of the problem or the
properties of materials involved and the uncertainty of the value of material parameters, etc.
This type of error concerns the mathematical model and the only possibility to reduce it is
to improve the mathematical description of the physical problem. It does not concern the
numerical calculation. It is though, an important part of the more general problem of “numerical
or computer modelling”.
Errors concerning the numerical calculations can be of three kinds:
i) Blunder: Mistakes, computer or programming bugs. We cannot really dismiss this
error. In practice, the possibility of its occurrence must always be considered, in particular when
the result is not known approximately in advance and then, there are no means to evaluate the
‘reasonableness’ of the obtained result.
ii) Truncation error: This is an approximation error, where for example the value of a
function is calculated with a finite number of terms in an infinite series, or an integral is
estimated by the sum of a finite number of trapezoidal areas over (noninfinitesimal) segments.
This is also called discretization error in some cases. It is important to have an estimate or a
bound (limit) for this type of error.
iii) Round off error: Arithmetic calculations can almost never be carried out with
E763 (part 2) Numerical Methods page 2
complete accuracy. Most numbers have infinite decimal representation, which must be rounded.
But even if the data in a problem can be initially expressed exactly by finite decimal
representation, divisions may introduce numbers that must be rounded; multiplication will also
introduce more digits. This type of error has a random character that makes it difficult to deal
with.
So far the error we have been discussing is the absolute error or the error defined by:
error = true value – approximation. A problem with this definition is that it doesn’t take into
account the magnitude of the value being measured, for example and absolute error of 1 cm has a
very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another
definition, that reflects this significance is the relative error defined as: relative error = absolute
error/true value. For example, in the previous case, the length of the bridge has a relative error
of 10
−4
and 0.1 respectively; or in percent, 0.01% and 10%.
2. Machine representation of numbers
Not only numbers have to be truncated to be represented (for manual calculation or in a
machine) but also all the intermediate results are subjected to the same restriction (over the
complete calculation process). Of the continuum of real numbers, only a discrete set can be
represented. Additionally, only a limited range of numbers can be represented in a given
machine. Also, even if two numbers can be represented, it is possible that the product of an
arithmetic operation between these numbers is not representable. Additionally, there will be
occasions where results fall outside the range of representable numbers (underflow or overflow
errors).
Example: Some computers accept real numbers in a floatingpoint format, with say, two digits
for the exponent; that is, in the form:
mantissa exponent
± 0.xxxxxx E ±xx
which can only represent numbers in the range: 10
–100
≤ y < 10
99
.
This is a normalized form, where the mantissa is defined such that:
0.1 ≤ mantissa < 1
We will see this in more detail next.
FloatingPoint Arithmetic and Roundoff Error
Any number can be represented in a normalised format in the form: x = ± a×10
b
in a
decimal representation. We can even use any other number as the exponent base and we have a
more general form: x = ± a×β
b
, common examples are binary, octal and hexadecimal
representations. In all of these, a is called the mantissa, a number between 0.1 and 1, which can
be infinite in length (for example for π or for 1/3 or 1/9), β is the base (10 for decimal
representation) and b is the exponent (positive or negative), which can also be infinite.
If we want to represent these numbers in a computer or use them in calculations, we need
to truncate both the mantissa (to a certain number of places), and the exponent, which limits the
range (or size) of numbers that can be considered.
Now, if A is the set of numbers exactly representable in a given machine, the question
arises of how to represent a number x not belonging to A ( x ∉A).
This is encountered not only when reading data into a computer, but also when
page 3 E763 (part 2) Numerical Methods
representing intermediate results in the computer during a calculation. Results of the elementary
arithmetic operations between two numbers need not belong to A. Let’s see first how a number
is represented (truncated) in a machine.
A machine representation can in most cases be obtained by rounding:
x ––––––––––––– fl(x)
Here and from now on, fl(x) will represent the truncated form of x (this is, with a
truncated mantissa and limited exponent), and not just its representation in normalized floating
point format.
For example, in a computer with t = 4 digit representation for the mantissa:
fl(π) = 0.3142E 1
fl(0.142853) = 0.1429E 0
fl(14.28437) = 0.1428E 2
In general, x is first represented in a normalized form (floatingpoint format): x = a×10
b
(if we consider a decimal system), where a is the mantissa and b the exponent; 1 >  a¦ ≥ 10
–1
(so that the first digit of a after the decimal point is not 0).
Then suppose that the decimal representation of  a¦ is given by:
 a¦ = 0.α
1
α
2
α
3
α
4
... α
i
... 0 ≤ α
i
≤ 9, α
1
≠ 0 (2.1)
(where a can have infinite terms!)
Then, we form:
a' =
0.α
1
α
2
⋅ ⋅ ⋅ α
t
if 0 ≤ α
t +1
≤ 4
0.α
1
α
2
⋅ ⋅ ⋅ α
t
+10
−t
if α
t+1
≥ 5
¦
´
¹
(2.2)
That is, only t digits are kept in the mantissa and the last one is rounded up: α
t
is
incremented by 1 if the next digit α
t+1
≥ 5 and all digits after α
t
are deleted.
The machine representation (truncated form of x) will be:
fl(x) = sign(x) a’ 10
b
(2.3)
The exponent b must also be limited. So a floatingpoint system will be characterized by
the parameters: t, length of the mantissa; β, the base (10 in the decimal case as above) and L
and U, the limits for the exponent b: L ≤ b ≤ U.
To clarify some of the features of the floatingpoint representation, let’s examine the
system (β, t, L, U) = (10, 2, –1, 2).
The mantissa can have only two digits: Then, there are only 90 possible forms for the
mantissa (excluding 0 and negative numbers):
0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99
The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2.
Then, including now zero and negative numbers, this system can represent exactly only
2×90×4 + 1 = 721 numbers: The set of floatingpoint numbers is finite.
The smallest positive number in the system is 0.10×10
–1
= 0.01
E763 (part 2) Numerical Methods page 4
and the largest is: 0.99×10
2
= 99.
We can also see that the spacing between numbers in the set varies; the numbers are not equally
spaced: At the small end (near 0): 0, 0.01, 0.02, etc the spacing is 0.01 while at the other
extreme: 97, 98, 99 the spacing is 1 (100 time bigger in absolute terms).
In a system like this, we are interested in the relative error produced when any number is
truncated in order to be represented in the system. (We are here excluding the underflow or
overflow errors produced when we try to represent a number which is outside the range defined
by L and U, for example 1000 or even 100 in the example above.)
It can be shown that the relative error of fl(x) is bounded by:
fl(x) − x
x
≤ 5 ×10
−t
= eps (2.4)
This limit eps is defined here as the machine precision.
Demonstration:
From (2.1) and (2.2), the normalized decimal representation of x and its truncated floating
point form, we have that the maximum possible difference between the two forms is 5 at the
decimal position t+1, that is:
fl(x) − x ≤ 5 ×10
−(t+1)
×10
b
also, since x ≥ 0.1×10
b
or 1 x ≤ 10 ×10
−b
, we obtain the condition (2.4).
From equation (4) we can write:
fl(x) = x(1+ε) (2.5)
where ε¦ ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from
1 in this machine representation, and the maximum value of ε is eps. So, we can also define
the machine precision eps as the smallest positive machine number g for which 1 + g >
1.
Definition:
machine precision
eps = min g g +1 >1,g > 0
{ }
(2.6)
Error in basic operations
The result of arithmetic operations between machine numbers will also have to be
represented as machine numbers, then, for each arithmetic operation and assuming that x and y
are already machine numbers, we will have:
x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε
1
); also: (2.7)
x − y ––––––––––––fl(x − y) = (x − y)(1 + ε
2
) (2.8)
x
*
y –––––––––––––fl(x
*
y) = (x
*
y)(1 + ε
3
) (2.9)
x/y –––––––––––––fl(x/y) = (x/y)(1 + ε
4
) with all  ε
i
¦ ≤ eps (2.10)
page 3 E763 (part 2) Numerical Methods
representing intermediate results in the computer during a calculation. Results of the elementary
arithmetic operations between two numbers need not belong to A. Let’s see first how a number
is represented (truncated) in a machine.
A machine representation can in most cases be obtained by rounding:
x ––––––––––––– fl(x)
Here and from now on, fl(x) will represent the truncated form of x (this is, with a
truncated mantissa and limited exponent), and not just its representation in normalized floating
point format.
For example, in a computer with t = 4 digit representation for the mantissa:
fl(π) = 0.3142E 1
fl(0.142853) = 0.1429E 0
fl(14.28437) = 0.1428E 2
In general, x is first represented in a normalized form (floatingpoint format): x = a×10
b
(if we consider a decimal system), where a is the mantissa and b the exponent; 1 >  a¦ ≥ 10
–1
(so that the first digit of a after the decimal point is not 0).
Then suppose that the decimal representation of  a¦ is given by:
 a¦ = 0.α
1
α
2
α
3
α
4
... α
i
... 0 ≤ α
i
≤ 9, α
1
≠ 0 (2.1)
(where a can have infinite terms!)
Then, we form:
a' =
0.α
1
α
2
⋅ ⋅ ⋅ α
t
if 0 ≤ α
t +1
≤ 4
0.α
1
α
2
⋅ ⋅ ⋅ α
t
+10
−t
if α
t+1
≥ 5
¦
´
¹
(2.2)
That is, only t digits are kept in the mantissa and the last one is rounded up: α
t
is
incremented by 1 if the next digit α
t+1
≥ 5 and all digits after α
t
are deleted.
The machine representation (truncated form of x) will be:
fl(x) = sign(x) a’ 10
b
(2.3)
The exponent b must also be limited. So a floatingpoint system will be characterized by
the parameters: t, length of the mantissa; β, the base (10 in the decimal case as above) and L
and U, the limits for the exponent b: L ≤ b ≤ U.
To clarify some of the features of the floatingpoint representation, let’s examine the
system (β, t, L, U) = (10, 2, –1, 2).
The mantissa can have only two digits: Then, there are only 90 possible forms for the
mantissa (excluding 0 and negative numbers):
0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99
The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2.
Then, including now zero and negative numbers, this system can represent exactly only
2×90×4 + 1 = 721 numbers: The set of floatingpoint numbers is finite.
The smallest positive number in the system is 0.10×10
–1
= 0.01
E763 (part 2) Numerical Methods page 4
and the largest is: 0.99×10
2
= 99.
We can also see that the spacing between numbers in the set varies; the numbers are not equally
spaced: At the small end (near 0): 0, 0.01, 0.02, etc the spacing is 0.01 while at the other
extreme: 97, 98, 99 the spacing is 1 (100 time bigger in absolute terms).
In a system like this, we are interested in the relative error produced when any number is
truncated in order to be represented in the system. (We are here excluding the underflow or
overflow errors produced when we try to represent a number which is outside the range defined
by L and U, for example 1000 or even 100 in the example above.)
It can be shown that the relative error of fl(x) is bounded by:
fl(x) − x
x
≤ 5 ×10
−t
= eps (2.4)
This limit eps is defined here as the machine precision.
Demonstration:
From (2.1) and (2.2), the normalized decimal representation of x and its truncated floating
point form, we have that the maximum possible difference between the two forms is 5 at the
decimal position t+1, that is:
fl(x) − x ≤ 5 ×10
−(t+1)
×10
b
also, since x ≥ 0.1×10
b
or 1 x ≤ 10 ×10
−b
, we obtain the condition (2.4).
From equation (4) we can write:
fl(x) = x(1+ε) (2.5)
where ε¦ ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from
1 in this machine representation, and the maximum value of ε is eps. So, we can also define
the machine precision eps as the smallest positive machine number g for which 1 + g >
1.
Definition:
machine precision
eps = min g g +1 >1,g > 0
{ }
(2.6)
Error in basic operations
The result of arithmetic operations between machine numbers will also have to be
represented as machine numbers, then, for each arithmetic operation and assuming that x and y
are already machine numbers, we will have:
x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε
1
); also: (2.7)
x − y ––––––––––––fl(x − y) = (x − y)(1 + ε
2
) (2.8)
x
*
y –––––––––––––fl(x
*
y) = (x
*
y)(1 + ε
3
) (2.9)
x/y –––––––––––––fl(x/y) = (x/y)(1 + ε
4
) with all  ε
i
¦ ≤ eps (2.10)
page 5 E763 (part 2) Numerical Methods
If x and y are not floatingpoint numbers (machine numbers) they will have to be converted
first giving:
x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y))
and similarly for the rest.
Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher
order error terms:
fl(z) = fl(fl(x) – fl(y))
= (x(1 + ε
1
) – y(1 + ε
2
))(1 + ε
3
)
= ((x – y) + x ε
1
– yε
2
)(1 + ε
3
)
= (x – y) + x ε
1
– yε
2
+ (x – y)ε
3
Then:
fl(z) − z
z
= ε
3
+
xε
1
− yε
2
x − y
≤ eps 1+
x + y
x − y

\

.

We can see that if x approaches y the relative error can blow up, especially for large values of x
and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend
to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the
maximum error will be 0.5×10
–4
×20000 = 1 (imagining the maximum absolute truncation of
0.00005 in every case) while it is extremely improbable that this case occurs. From a statistical
point of view, one can expect that in about 90% of the cases, the error will not exceed 0.005.
Example
Let’s compute the difference between a = 1200 and b = 1194 using a floatingpoint
system with a 3digit mantissa:
fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×10
4
– 0.119×10
4
)
= 0.001×10
4
= 10
where the correct value is 6, giving a relative error of: 0.667 (or 66.7%).
The machine precision for this system is eps = 5×10
–t
and the error bound above gives a
limit for the relative error of
eps 1 +
a + b
a − b

\

.

= 2.0
Error propagation in calculations
One of the important tasks in computer modelling is to find algorithms where the error
propagation remains bounded.
In this context, an algorithm is defined as a finite sequence of elementary operations that
prescribe how to calculate the solution to a problem from given input data (as in a sequence of
computer instructions). Problems can arise when one is not careful, as is shown in the following
example:
E763 (part 2) Numerical Methods page 6
Example
Assume that we want to calculate the sum of three floatingpoint numbers: a, b, c.
This has to be done in sequence, that is, using any of the next two algorithms:
i) (a + b) + c or
ii) a + (b + c)
If the numbers are in floatingpoint format with t = 8 decimals and their values are for example:
a = 0.23371258E4
b = 0.33678429E 2
c = 0.33677811E 2
The two algorithms will give the results: i) 0.64100000E3
ii) 0.64137126E3
The exact result (which needs 14 decimal points to calculate) is
0.641311258E3
Exercise 2.1
Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of
the example above. Neglect higher order error terms; that is, products of the form: ε
1
ε
2
.
Example
Determination the error propagation in the calculation of y = x − a ( )
2
using floatingpoint
arithmetic, by two different algorithms, when x and a are already floatingpoint numbers.
a) direct calculation: y = x − a ( )
2
fl(y)= x − a ( )1 + ε
1
( )
 
2
1+ ε
2
( )
fl(y)= x − a ( )
2
1 + ε
1
( )
2
 
1+ ε
2
( )
And only preserving first order error terms:
fl(y)= x − a ( )
2
1 + ε
1
( )
2
1 + ε
2
( )
 
= x − a ( )
2
1+ 2ε
1
( ) 1+ ε
2
( )
 
fl(y)= x − a ( )
2
1 + 2ε
1
+ ε
2
( )
then:
fl(y) y = x − a ( )
2
2ε
1
+ ε
2
( ) or
∆y =
fl(y) y
y
= 2ε
1
+ ε
2
We can see that the relative error in the calculation of y using this algorithm is given by
2ε
1
+ ε
2
, so it is less than 3 eps.
b) Using the expanded form: y = x
2
− 2ax + a
2
fl(y)= x
2
1+ ε
1
( )− 2ax 1 + ε
2
( )
( )
1 + ε
3
( )+ a
2
1+ ε
4
( )
 
1+ ε
5
( )
page 5 E763 (part 2) Numerical Methods
If x and y are not floatingpoint numbers (machine numbers) they will have to be converted
first giving:
x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y))
and similarly for the rest.
Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher
order error terms:
fl(z) = fl(fl(x) – fl(y))
= (x(1 + ε
1
) – y(1 + ε
2
))(1 + ε
3
)
= ((x – y) + x ε
1
– yε
2
)(1 + ε
3
)
= (x – y) + x ε
1
– yε
2
+ (x – y)ε
3
Then:
fl(z) − z
z
= ε
3
+
xε
1
− yε
2
x − y
≤ eps 1+
x + y
x − y

\

.

We can see that if x approaches y the relative error can blow up, especially for large values of x
and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend
to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the
maximum error will be 0.5×10
–4
×20000 = 1 (imagining the maximum absolute truncation of
0.00005 in every case) while it is extremely improbable that this case occurs. From a statistical
point of view, one can expect that in about 90% of the cases, the error will not exceed 0.005.
Example
Let’s compute the difference between a = 1200 and b = 1194 using a floatingpoint
system with a 3digit mantissa:
fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×10
4
– 0.119×10
4
)
= 0.001×10
4
= 10
where the correct value is 6, giving a relative error of: 0.667 (or 66.7%).
The machine precision for this system is eps = 5×10
–t
and the error bound above gives a
limit for the relative error of
eps 1 +
a + b
a − b

\

.

= 2.0
Error propagation in calculations
One of the important tasks in computer modelling is to find algorithms where the error
propagation remains bounded.
In this context, an algorithm is defined as a finite sequence of elementary operations that
prescribe how to calculate the solution to a problem from given input data (as in a sequence of
computer instructions). Problems can arise when one is not careful, as is shown in the following
example:
E763 (part 2) Numerical Methods page 6
Example
Assume that we want to calculate the sum of three floatingpoint numbers: a, b, c.
This has to be done in sequence, that is, using any of the next two algorithms:
i) (a + b) + c or
ii) a + (b + c)
If the numbers are in floatingpoint format with t = 8 decimals and their values are for example:
a = 0.23371258E4
b = 0.33678429E 2
c = 0.33677811E 2
The two algorithms will give the results: i) 0.64100000E3
ii) 0.64137126E3
The exact result (which needs 14 decimal points to calculate) is
0.641311258E3
Exercise 2.1
Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of
the example above. Neglect higher order error terms; that is, products of the form: ε
1
ε
2
.
Example
Determination the error propagation in the calculation of y = x − a ( )
2
using floatingpoint
arithmetic, by two different algorithms, when x and a are already floatingpoint numbers.
a) direct calculation: y = x − a ( )
2
fl(y)= x − a ( )1 + ε
1
( )
 
2
1+ ε
2
( )
fl(y)= x − a ( )
2
1 + ε
1
( )
2
 
1+ ε
2
( )
And only preserving first order error terms:
fl(y)= x − a ( )
2
1 + ε
1
( )
2
1 + ε
2
( )
 
= x − a ( )
2
1+ 2ε
1
( ) 1+ ε
2
( )
 
fl(y)= x − a ( )
2
1 + 2ε
1
+ ε
2
( )
then:
fl(y) y = x − a ( )
2
2ε
1
+ ε
2
( ) or
∆y =
fl(y) y
y
= 2ε
1
+ ε
2
We can see that the relative error in the calculation of y using this algorithm is given by
2ε
1
+ ε
2
, so it is less than 3 eps.
b) Using the expanded form: y = x
2
− 2ax + a
2
fl(y)= x
2
1+ ε
1
( )− 2ax 1 + ε
2
( )
( )
1 + ε
3
( )+ a
2
1+ ε
4
( )
 
1+ ε
5
( )
page 7 E763 (part 2) Numerical Methods
This is, taking the square of x first (with its error) and subtracting the product 2ax (with its error)
and the error in the subtraction, and then, adding the last term with its error and the
corresponding error due to that addition. Solving this, keeping only first order error terms we
get:
fl(y)= x
2
− 2ax
( )
+ x
2
ε
1
− 2axε
2
+ x
2
− 2ax
( )
ε
3
+ a
2
1 + ε
4
( )
 
1 +ε
5
( )
( ) ( ) ( )
5
2 2
4
2
3 2 3 1
2 2 2
2 2 2 ) ( ε ε ε ε ε ε a ax x a ax x a ax x y + − + + + − + + + − = fl
from where we can write:
∆y =
fl(y) y
y
= ε
5
+
x
2
x − a
( )
2
ε
1
+ ε
3
( )−
2ax
x − a
( )
2
ε
2
+ε
3
( )+
a
2
x − a
( )
2
ε
4
and we can see that there will be problems with this calculation if (x – a)
2
is too small compared
with either x
2
or a
2
. The first term above is bounded by eps while the second will be eps
multiplied by the amplification factor x
2
/ x − a ( )
2
. For example in if x = 15 and a = 14, the
three amplification factors will be respectively: 225, 420 and 196, which gives a total error
bound of (1 + 450 + 840 + 196)eps = 1487eps compared with 3 eps from algorithm a).
Exercise 2.2
For y = a – b compare the error bounds when a and b are and are not already defined as
floatingpoint numbers.
Exercise 2.3
Determine the error propagation characteristics of two algorithms to calculate
a) y = x
2
− a
2
and b) y = x −1 ( )
3
. Assume in both cases that x and a are floating point
numbers.
E763 (part 2) Numerical Methods page 8
3. Root Finding: Solution of nonlinear equations
This is a problem of very common occurrence in engineering and physics. We’ll study in
this section several numerical methods to find roots of equations. The methods can be basically
classified as “bracketing methods” and “open methods”. In the first case, the solution lies inside
an interval limited by a lower and an upper bound. The methods are always convergent as the
iterations progress. In contrast, the open methods are based on procedures that require one
starting point or two that not necessarily enclose the root. Because of this, sometimes they
diverge. However when they do converge, they usually do that faster than the bracketing
methods.
Bracketing Methods
The Bisection Method
If a function is continuous in the interval [a, b] and f(a) and f(b) are of different sign
(f(a)f(b)<0), then at least one root of f(x) will lie in that interval.
The bisection method is an iterative procedure that starts with two points where f(x) has
different sign, that is, that “bracket” a root of the function.
We now define the point
2
b a
c
+
= (3.1)
as the midpoint of the interval [a, b], and examine the sign of f(c). There are now three
possibilities: f(c)=0, in which case the solution is c, and f(c) is positive or f(c) is negative. The
next interval where to search for the root will be either [a, c] or [c, b] if a and c or c and b are of
different sign, respectively. (Equivalently, we can search for the case f(a)f(c)<0 or f(c)f(b)<0).
The process then continues, each time halving the size of the search interval and “bracketing”
the solution as shown in Figs. 3.1 and 3.2.
CONVERGENCE AND ERROR
Since we know that the solution lies in the interval [a, b] the absolute error for c is
bounded by (b−a)/2, then, we can see that after n iterations, halving the interval in each iteration,
the search interval would have reduced in length to (b – a)/2
n
, so the maximum error after n
iterations is:
n
n
a b
c
2
−
≤ − α (3.2)
Fig. 3.1 Fig. 3.2
a
n
b
n
c
n
a
n+1
b
n+1
c
n+1
a
b c
page 7 E763 (part 2) Numerical Methods
This is, taking the square of x first (with its error) and subtracting the product 2ax (with its error)
and the error in the subtraction, and then, adding the last term with its error and the
corresponding error due to that addition. Solving this, keeping only first order error terms we
get:
fl(y)= x
2
− 2ax
( )
+ x
2
ε
1
− 2axε
2
+ x
2
− 2ax
( )
ε
3
+ a
2
1 + ε
4
( )
 
1 +ε
5
( )
( ) ( ) ( )
5
2 2
4
2
3 2 3 1
2 2 2
2 2 2 ) ( ε ε ε ε ε ε a ax x a ax x a ax x y + − + + + − + + + − = fl
from where we can write:
∆y =
fl(y) y
y
= ε
5
+
x
2
x − a
( )
2
ε
1
+ ε
3
( )−
2ax
x − a
( )
2
ε
2
+ε
3
( )+
a
2
x − a
( )
2
ε
4
and we can see that there will be problems with this calculation if (x – a)
2
is too small compared
with either x
2
or a
2
. The first term above is bounded by eps while the second will be eps
multiplied by the amplification factor x
2
/ x − a ( )
2
. For example in if x = 15 and a = 14, the
three amplification factors will be respectively: 225, 420 and 196, which gives a total error
bound of (1 + 450 + 840 + 196)eps = 1487eps compared with 3 eps from algorithm a).
Exercise 2.2
For y = a – b compare the error bounds when a and b are and are not already defined as
floatingpoint numbers.
Exercise 2.3
Determine the error propagation characteristics of two algorithms to calculate
a) y = x
2
− a
2
and b) y = x −1 ( )
3
. Assume in both cases that x and a are floating point
numbers.
E763 (part 2) Numerical Methods page 8
3. Root Finding: Solution of nonlinear equations
This is a problem of very common occurrence in engineering and physics. We’ll study in
this section several numerical methods to find roots of equations. The methods can be basically
classified as “bracketing methods” and “open methods”. In the first case, the solution lies inside
an interval limited by a lower and an upper bound. The methods are always convergent as the
iterations progress. In contrast, the open methods are based on procedures that require one
starting point or two that not necessarily enclose the root. Because of this, sometimes they
diverge. However when they do converge, they usually do that faster than the bracketing
methods.
Bracketing Methods
The Bisection Method
If a function is continuous in the interval [a, b] and f(a) and f(b) are of different sign
(f(a)f(b)<0), then at least one root of f(x) will lie in that interval.
The bisection method is an iterative procedure that starts with two points where f(x) has
different sign, that is, that “bracket” a root of the function.
We now define the point
2
b a
c
+
= (3.1)
as the midpoint of the interval [a, b], and examine the sign of f(c). There are now three
possibilities: f(c)=0, in which case the solution is c, and f(c) is positive or f(c) is negative. The
next interval where to search for the root will be either [a, c] or [c, b] if a and c or c and b are of
different sign, respectively. (Equivalently, we can search for the case f(a)f(c)<0 or f(c)f(b)<0).
The process then continues, each time halving the size of the search interval and “bracketing”
the solution as shown in Figs. 3.1 and 3.2.
CONVERGENCE AND ERROR
Since we know that the solution lies in the interval [a, b] the absolute error for c is
bounded by (b−a)/2, then, we can see that after n iterations, halving the interval in each iteration,
the search interval would have reduced in length to (b – a)/2
n
, so the maximum error after n
iterations is:
n
n
a b
c
2
−
≤ − α (3.2)
Fig. 3.1 Fig. 3.2
a
n
b
n
c
n
a
n+1
b
n+1
c
n+1
a
b c
page 9 E763 (part 2) Numerical Methods
where α is the exact position of the root and c
n
is the nth approximation found by this method.
Furthermore, if we want to find the solution with a tolerance ε (that is, ε α ≤ −
n
c ), we can
calculate the maximum number of iterations required from the expression above. Naturally, if at
one stage the solution lies at the middle of the current interval the search finishes early.
An approximate relative error (or percent error) at iteration n+1 can be defined as:
1
1
+
+
−
=
n
n n
c
c c
ε
but from the figure above we can see that
2
1 1
1
+ +
+
−
= −
n n
n n
a b
c c and since
2
1 1
1
+ +
+
+
=
n n
n
a b
c ,
the relative error at iteration n+1 can be written as:
1 1
1 1
+ +
+ +
+
−
=
n n
n n
a b
a b
ε (3.3)
This expression can also be used to stop iterations.
Exercise 3.1
Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that
satisfies:
2 log
log ) log( ε − −
≥
a b
n (3.4)
Example
The function: ) 3 cos( ) ( x x f = , has one root in the interval [0, 1]. The following simple Matlab
program implements the bisection method to find this root.
a=0; b=1; %limits of the search interval [a,b]
eps=1e6; %sets tolerance to 1e6
fa=ff(a);
fb=ff(b);
if(fa*fb>0)
disp('root not in the interval selected')
else
n=ceil((log(ba)log(eps))/log(2)); %rounded up to closest integer
disp('Iteration number a b c')
for i=1:n
c=a+0.5*(ba); %c is set as the midpoint between a and b
disp([sprintf('%8d',i),' ',sprintf('%15.8f',a,b,c)])
fc=ff(c);
if(fa*fc)<0 %the root is between a and c
b=c;
fb=fc;
elseif(fa*fc)>0 %the root is between b and c
a=c;
fa=fc;
E763 (part 2) Numerical Methods page 10
else return
end
end
end
together with the function definition:
function y=ff(x)
%****************************
y=cos(3*x);
%****************************
And the corresponding results are:
Iteration number a b c
1 0.00000000 1.00000000 0.50000000
2 0.50000000 1.00000000 0.75000000
3 0.50000000 0.75000000 0.62500000
4 0.50000000 0.62500000 0.56250000
5 0.50000000 0.56250000 0.53125000
6 0.50000000 0.53125000 0.51562500
7 0.51562500 0.53125000 0.52343750
8 0.52343750 0.53125000 0.52734375
9 0.52343750 0.52734375 0.52539063
10 0.52343750 0.52539063 0.52441406
11 0.52343750 0.52441406 0.52392578
12 0.52343750 0.52392578 0.52368164
13 0.52343750 0.52368164 0.52355957
14 0.52355957 0.52368164 0.52362061
15 0.52355957 0.52362061 0.52359009
16 0.52359009 0.52362061 0.52360535
17 0.52359009 0.52360535 0.52359772
18 0.52359772 0.52360535 0.52360153
19 0.52359772 0.52360153 0.52359962
20 0.52359772 0.52359962 0.52359867
Provided that the solution lies in the initial interval, and since the search interval is
continually divided by two, we can see that this method will always converge to the solution and
will find it within a required precision in a finite number of iterations.
However, due to the rather blind choice of solution (it is always chosen as the middle of
the interval), the error doesn’t vary monotonically. For the previous example
0 ) 3 cos( ) ( = = x x f :
Iteration
number c f(c) error %
1 0.50000000 0.07073720 4.50703414
2 0.75000000 0.62817362 43.23944878
3 0.62500000 0.29953351 19.36620732
4 0.56250000 0.11643894 7.42958659
5 0.53125000 0.02295166 1.46127622
6 0.51562500 0.02391905 1.52287896
7 0.52343750 0.00048383 0.03080137
8 0.52734375 0.01123469 0.71523743
9 0.52539063 0.00537552 0.34221803
page 9 E763 (part 2) Numerical Methods
where α is the exact position of the root and c
n
is the nth approximation found by this method.
Furthermore, if we want to find the solution with a tolerance ε (that is, ε α ≤ −
n
c ), we can
calculate the maximum number of iterations required from the expression above. Naturally, if at
one stage the solution lies at the middle of the current interval the search finishes early.
An approximate relative error (or percent error) at iteration n+1 can be defined as:
1
1
+
+
−
=
n
n n
c
c c
ε
but from the figure above we can see that
2
1 1
1
+ +
+
−
= −
n n
n n
a b
c c and since
2
1 1
1
+ +
+
+
=
n n
n
a b
c ,
the relative error at iteration n+1 can be written as:
1 1
1 1
+ +
+ +
+
−
=
n n
n n
a b
a b
ε (3.3)
This expression can also be used to stop iterations.
Exercise 3.1
Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that
satisfies:
2 log
log ) log( ε − −
≥
a b
n (3.4)
Example
The function: ) 3 cos( ) ( x x f = , has one root in the interval [0, 1]. The following simple Matlab
program implements the bisection method to find this root.
a=0; b=1; %limits of the search interval [a,b]
eps=1e6; %sets tolerance to 1e6
fa=ff(a);
fb=ff(b);
if(fa*fb>0)
disp('root not in the interval selected')
else
n=ceil((log(ba)log(eps))/log(2)); %rounded up to closest integer
disp('Iteration number a b c')
for i=1:n
c=a+0.5*(ba); %c is set as the midpoint between a and b
disp([sprintf('%8d',i),' ',sprintf('%15.8f',a,b,c)])
fc=ff(c);
if(fa*fc)<0 %the root is between a and c
b=c;
fb=fc;
elseif(fa*fc)>0 %the root is between b and c
a=c;
fa=fc;
E763 (part 2) Numerical Methods page 10
else return
end
end
end
together with the function definition:
function y=ff(x)
%****************************
y=cos(3*x);
%****************************
And the corresponding results are:
Iteration number a b c
1 0.00000000 1.00000000 0.50000000
2 0.50000000 1.00000000 0.75000000
3 0.50000000 0.75000000 0.62500000
4 0.50000000 0.62500000 0.56250000
5 0.50000000 0.56250000 0.53125000
6 0.50000000 0.53125000 0.51562500
7 0.51562500 0.53125000 0.52343750
8 0.52343750 0.53125000 0.52734375
9 0.52343750 0.52734375 0.52539063
10 0.52343750 0.52539063 0.52441406
11 0.52343750 0.52441406 0.52392578
12 0.52343750 0.52392578 0.52368164
13 0.52343750 0.52368164 0.52355957
14 0.52355957 0.52368164 0.52362061
15 0.52355957 0.52362061 0.52359009
16 0.52359009 0.52362061 0.52360535
17 0.52359009 0.52360535 0.52359772
18 0.52359772 0.52360535 0.52360153
19 0.52359772 0.52360153 0.52359962
20 0.52359772 0.52359962 0.52359867
Provided that the solution lies in the initial interval, and since the search interval is
continually divided by two, we can see that this method will always converge to the solution and
will find it within a required precision in a finite number of iterations.
However, due to the rather blind choice of solution (it is always chosen as the middle of
the interval), the error doesn’t vary monotonically. For the previous example
0 ) 3 cos( ) ( = = x x f :
Iteration
number c f(c) error %
1 0.50000000 0.07073720 4.50703414
2 0.75000000 0.62817362 43.23944878
3 0.62500000 0.29953351 19.36620732
4 0.56250000 0.11643894 7.42958659
5 0.53125000 0.02295166 1.46127622
6 0.51562500 0.02391905 1.52287896
7 0.52343750 0.00048383 0.03080137
8 0.52734375 0.01123469 0.71523743
9 0.52539063 0.00537552 0.34221803
page 11 E763 (part 2) Numerical Methods
10 0.52441406 0.00244586 0.15570833
11 0.52392578 0.00098102 0.06245348
12 0.52368164 0.00024860 0.01582605
13 0.52355957 0.00011762 0.00748766
14 0.52362061 0.00006549 0.00416920
15 0.52359009 0.00002606 0.00165923
16 0.52360535 0.00001971 0.00125498
17 0.52359772 0.00000317 0.00020212
18 0.52360153 0.00000827 0.00052643
19 0.52359962 0.00000255 0.00016215
20 0.52359867 0.00000031 0.00001998
We can see that the error is not continually decreasing although in the end it has to be small.
This is due to the rather “brute force” nature of the algorithm. The approximation to the solution
is chosen blindly as the midpoint of the interval without and attempt at guessing its position
inside the interval. For example, if at some iteration n, the magnitudes of ) (
n
a f and ) (
n
b f are
very different, say, ) ( ) (
n n
b f a f >> , it is likely that the solution is closer to b than to a, if the
function is smooth.
A possible way to improve it is to select the point c by interpolating the values at a and b.
This is called the “regula falsi” method or method of false position.
Regula Falsi Method
In this case, the next point is obtained by linear interpolation between the values at a and b.
From the figure (left), we can see that:
a c
b c
a f
b f
−
−
=
) (
) (
(3.5)
from where
) ( ) (
) ( ) (
a f b f
a bf b af
c
−
−
=
or alternatively:
) ( ) (
) )( (
a f b f
b a a f
a c
−
−
+ = (3.6)
Fig. 3.3
The algorithm is the same as the bisection method except for the calculation of the point c.
In this case, for the same function, ( 0 ) 3 cos( ) ( = = x x f ) the solution, within the same absolute
tolerance (10
−6
) is found in only 4 iterations:
Iteration number a b c
0 0.00000000 1.00000000 0.00000000
1 0.50251446 1.00000000 0.50251446
2 0.50251446 0.53237237 0.53237237
3 0.52359536 0.53237237 0.52359536
a
b c
E763 (part 2) Numerical Methods page 12
4 0.52359536 0.52359878 0.52359878
and for the error:
Iter.
number c f(c) error %
1 0.50251446 0.06321078 4.02680812
2 0.53237237 0.02631774 1.67563307
3 0.52359536 0.00001025 0.00065258
4 0.52359878 0.00000000 0.00000008
We can see that the error decreases much more rapidly than in the bisection method. The size of
the interval also decreases more rapidly. In this case the successive values are:
Iter.
number ba ba in bisection method
1 0.49748554 0.5
2 0.02985791 0.25
3 0.00877701 0.125
4 0.00000342 0.0625
However, this is not necessarily always the case and some occasions, the search interval can
remain large. In particular, one of limits can remain stuck while the other converges to the
solution. In that case the length of the interval tends to a finite value instead of converging to
zero.
In the following example for the function 1 ) (
10
− = x x f the solution requires 70 iterations to
reach a tolerance of 10
−6
with the regula falsi method while only 24 are needed with
the bisection
method. We can also see that the right side of the interval remains stuck at 1.3 and the size of
the interval will tend to 0.3 in the limit instead of converging to zero. The figure shows the
interpolating lines at each iteration. The corresponding approximations are the points where
these lines cross the xaxis.
0 0.2 0.4 0.6 0.8 1 1.2
1
0
1
2
3
4
5
6
7
8
9
10
0 0.2 0.4 0.6 0.8 1 1.2
1
0
1
2
3
4
5
6
7
8
9
10
Fig. 3.4 Standard regula falsi Fig. 3.5 Modified regula falsi
Modified regula falsi method
A modified version of the regula falsi method that improves this situation consists of an
algorithm that detects when one of the limits gets stuck and then reduces the value at that limit
by 2, changing the slope of the approximating function. For the same example, this modified
version reaches the solution in 13 iterations. The sequence of approximations and interpolating
lines is shown in the figure above (right).
page 11 E763 (part 2) Numerical Methods
10 0.52441406 0.00244586 0.15570833
11 0.52392578 0.00098102 0.06245348
12 0.52368164 0.00024860 0.01582605
13 0.52355957 0.00011762 0.00748766
14 0.52362061 0.00006549 0.00416920
15 0.52359009 0.00002606 0.00165923
16 0.52360535 0.00001971 0.00125498
17 0.52359772 0.00000317 0.00020212
18 0.52360153 0.00000827 0.00052643
19 0.52359962 0.00000255 0.00016215
20 0.52359867 0.00000031 0.00001998
We can see that the error is not continually decreasing although in the end it has to be small.
This is due to the rather “brute force” nature of the algorithm. The approximation to the solution
is chosen blindly as the midpoint of the interval without and attempt at guessing its position
inside the interval. For example, if at some iteration n, the magnitudes of ) (
n
a f and ) (
n
b f are
very different, say, ) ( ) (
n n
b f a f >> , it is likely that the solution is closer to b than to a, if the
function is smooth.
A possible way to improve it is to select the point c by interpolating the values at a and b.
This is called the “regula falsi” method or method of false position.
Regula Falsi Method
In this case, the next point is obtained by linear interpolation between the values at a and b.
From the figure (left), we can see that:
a c
b c
a f
b f
−
−
=
) (
) (
(3.5)
from where
) ( ) (
) ( ) (
a f b f
a bf b af
c
−
−
=
or alternatively:
) ( ) (
) )( (
a f b f
b a a f
a c
−
−
+ = (3.6)
Fig. 3.3
The algorithm is the same as the bisection method except for the calculation of the point c.
In this case, for the same function, ( 0 ) 3 cos( ) ( = = x x f ) the solution, within the same absolute
tolerance (10
−6
) is found in only 4 iterations:
Iteration number a b c
0 0.00000000 1.00000000 0.00000000
1 0.50251446 1.00000000 0.50251446
2 0.50251446 0.53237237 0.53237237
3 0.52359536 0.53237237 0.52359536
a
b c
E763 (part 2) Numerical Methods page 12
4 0.52359536 0.52359878 0.52359878
and for the error:
Iter.
number c f(c) error %
1 0.50251446 0.06321078 4.02680812
2 0.53237237 0.02631774 1.67563307
3 0.52359536 0.00001025 0.00065258
4 0.52359878 0.00000000 0.00000008
We can see that the error decreases much more rapidly than in the bisection method. The size of
the interval also decreases more rapidly. In this case the successive values are:
Iter.
number ba ba in bisection method
1 0.49748554 0.5
2 0.02985791 0.25
3 0.00877701 0.125
4 0.00000342 0.0625
However, this is not necessarily always the case and some occasions, the search interval can
remain large. In particular, one of limits can remain stuck while the other converges to the
solution. In that case the length of the interval tends to a finite value instead of converging to
zero.
In the following example for the function 1 ) (
10
− = x x f the solution requires 70 iterations to
reach a tolerance of 10
−6
with the regula falsi method while only 24 are needed with
the bisection
method. We can also see that the right side of the interval remains stuck at 1.3 and the size of
the interval will tend to 0.3 in the limit instead of converging to zero. The figure shows the
interpolating lines at each iteration. The corresponding approximations are the points where
these lines cross the xaxis.
0 0.2 0.4 0.6 0.8 1 1.2
1
0
1
2
3
4
5
6
7
8
9
10
0 0.2 0.4 0.6 0.8 1 1.2
1
0
1
2
3
4
5
6
7
8
9
10
Fig. 3.4 Standard regula falsi Fig. 3.5 Modified regula falsi
Modified regula falsi method
A modified version of the regula falsi method that improves this situation consists of an
algorithm that detects when one of the limits gets stuck and then reduces the value at that limit
by 2, changing the slope of the approximating function. For the same example, this modified
version reaches the solution in 13 iterations. The sequence of approximations and interpolating
lines is shown in the figure above (right).
page 13 E763 (part 2) Numerical Methods
Open Methods
These methods start with only one point or two but not necessarily bracketing the root. One of
the simplest is the fixed point iteration.
Fixed point iteration
This method is also called onepoint iteration or successive substitution. For a function of the
form 0 ) ( = x f , the method simply consists of rearranging it into the form: ) (x g x =
Example
For the function 0 505 . 0 1 . 1 5 . 0
2
= + − x x the algorithm can be set as: 505 . 0 1 . 0 5 . 0
2
+ − = x x x . or
2
1 ) 1 . 0 (
2
+ −
=
x
x With an initial guess introduced in the right hand side, a new value of x is
obtained and the iteration can continue.
Starting from the value: x
0
= 0.5, the successive values are:
Iteration number x error %
1 0.58000000 13.79310345
2 0.61520000 5.72171651
3 0.63271552 2.76830889
4 0.64189291 1.42973889
5 0.64682396 0.76234834
6 0.64950822 0.41327570
7 0.65097964 0.22603166
8 0.65178928 0.12421806
9 0.65223571 0.06844503
10 0.65248214 0.03776824
11 0.65261826 0.02085727
12 0.65269347 0.01152336
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
↓
y=g(x)
← y=x
0.4 0.45 0.5 0.55 0.6 0.65
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
Fig.. 3.6 Plot of the x and g(x) Fig. 3.7 Closeup showing the successive
approximations
E763 (part 2) Numerical Methods page 14
Convergence
This is a very simple method but solutions are not guaranteed. The following figures show
situations when the method converges and when it diverges.
In cases (a) and (b) the method converges, while in cases (c) and (d) it diverges.
↓
y=g(x)
← y=x
(a) convergence
y=g(x) →
y=x →
(b) convergence
y=g(x) →
← y=x
(c) divergence
← y=g(x)
y=x →
(d) divergence
Fig. 3.8 Four different cases
From Fig. 3.8 ad we can see that it is relatively easy to determine when the method will
converge, so the best way of ensuring success is to plot the functions ) (x g y = and x y = . In a
more rigorous way, we can also see that for convergence to occur the slope of g(x) should be less
that that of x in the region of search. That is, 1 ) ( ' < x g .
If divergence is predicted, a different form of rewriting the problem 0 ) ( = x f in the form
) (x g y = needs to be found that satisfies the condition above.
For example, for the function 0 1 3 3 ) (
2
= − + = x x x f , with a solution at 0.2637626
0
= x ,
we can separate it in the following two forms:
(a)
3
1 3
) (
2
+ −
= =
x
x g x and (b) 1 4 3 ) (
2
− + = = x x x g x
In the first case, x x g 2 ) ( ' − = then 0.5275252 ) ( '
0
− =
=x x
x g while for the second case,
4 6 ) ( ' + = x x g and then, 5.5825757 ) ( '
0
=
=x x
x g .
page 13 E763 (part 2) Numerical Methods
Open Methods
These methods start with only one point or two but not necessarily bracketing the root. One of
the simplest is the fixed point iteration.
Fixed point iteration
This method is also called onepoint iteration or successive substitution. For a function of the
form 0 ) ( = x f , the method simply consists of rearranging it into the form: ) (x g x =
Example
For the function 0 505 . 0 1 . 1 5 . 0
2
= + − x x the algorithm can be set as: 505 . 0 1 . 0 5 . 0
2
+ − = x x x . or
2
1 ) 1 . 0 (
2
+ −
=
x
x With an initial guess introduced in the right hand side, a new value of x is
obtained and the iteration can continue.
Starting from the value: x
0
= 0.5, the successive values are:
Iteration number x error %
1 0.58000000 13.79310345
2 0.61520000 5.72171651
3 0.63271552 2.76830889
4 0.64189291 1.42973889
5 0.64682396 0.76234834
6 0.64950822 0.41327570
7 0.65097964 0.22603166
8 0.65178928 0.12421806
9 0.65223571 0.06844503
10 0.65248214 0.03776824
11 0.65261826 0.02085727
12 0.65269347 0.01152336
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
↓
y=g(x)
← y=x
0.4 0.45 0.5 0.55 0.6 0.65
0.5
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
Fig.. 3.6 Plot of the x and g(x) Fig. 3.7 Closeup showing the successive
approximations
E763 (part 2) Numerical Methods page 14
Convergence
This is a very simple method but solutions are not guaranteed. The following figures show
situations when the method converges and when it diverges.
In cases (a) and (b) the method converges, while in cases (c) and (d) it diverges.
↓
y=g(x)
← y=x
(a) convergence
y=g(x) →
y=x →
(b) convergence
y=g(x) →
← y=x
(c) divergence
← y=g(x)
y=x →
(d) divergence
Fig. 3.8 Four different cases
From Fig. 3.8 ad we can see that it is relatively easy to determine when the method will
converge, so the best way of ensuring success is to plot the functions ) (x g y = and x y = . In a
more rigorous way, we can also see that for convergence to occur the slope of g(x) should be less
that that of x in the region of search. That is, 1 ) ( ' < x g .
If divergence is predicted, a different form of rewriting the problem 0 ) ( = x f in the form
) (x g y = needs to be found that satisfies the condition above.
For example, for the function 0 1 3 3 ) (
2
= − + = x x x f , with a solution at 0.2637626
0
= x ,
we can separate it in the following two forms:
(a)
3
1 3
) (
2
+ −
= =
x
x g x and (b) 1 4 3 ) (
2
− + = = x x x g x
In the first case, x x g 2 ) ( ' − = then 0.5275252 ) ( '
0
− =
=x x
x g while for the second case,
4 6 ) ( ' + = x x g and then, 5.5825757 ) ( '
0
=
=x x
x g .
page 15 E763 (part 2) Numerical Methods
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
← y=x
y=g(x)
↓
Fig. 3.9(a)
3
1 3
) (
2
+ −
= =
x
x g x
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
← y=x
y=g(x) →
Fig. 3.9(b) 1 4 3 ) (
2
− + = = x x x g x
Fig. 3.9 illustrates the main deficiency of this method. Convergence often depends on how the
problem is formulated. Additionally, divergence can also occur if the initial guess is not
sufficiently close to the solution.
NewtonRaphson Method
This is one of the most used methods for root finding. It also needs only one point to start
the iterations, but as a difference to the fixed point iteration it will converge to the solution
provided the function is monotonically varying in the region of interest.
Starting from a point x
0
, a tangent to the function f(x) (a line with the slope of the
derivative of f) is extrapolated to find the point where it crosses the xaxis, providing a new
approximation. The same procedure is repeated until the error tolerance is achieved. The
method needs repeated evaluation of the function and its derivative and an appropriate stopping
criterion is the value of the function at the successive approximations: f(x
n
).
x
0
x
1
slope = f '(x
0
)
Fig. 3.10
Form Fig. 3.10 we can see that at stage n:
1
) (
) ( '
+
−
=
n n
n
n
x x
x f
x f , then the next
E763 (part 2) Numerical Methods page 16
approximation is found as:
) ( '
) (
1
n
n
n n
x f
x f
x x − =
+
. (3.7)
Example
For the same function as in the previous examples: 0 ) 3 cos( ) ( = = x x f , and for the same
tolerance of 10
−6
, the solution is found in 3 iterations starting from x
0
= 0.3 (After 3 iterations
the accuracy is better than 10
−8
). Starting from 0.5, only 2 iterations are sufficient.
Iteration number x f(x)
0 0.30000000 0.62160997
1 0.56451705 0.12244676
2 0.52339200 0.00062033
3 0.52359878 0.00000000
The method can also be derived from the Taylor series expansion. This also provides a useful
insight on the rate of convergence of the method.
Considering the Taylor expansion truncated to the first order (see Appendix):
2
1
) 2 (
1 1
) (
! 2
) (
) )( ( ' ) ( ) (
i i i i i i i
x x
f
x x x f x f x f − + − + =
+ + +
ξ
(3.8)
Considering now the exact solution x
r
and the Taylor expansion, evaluated at this point:
2
) 2 (
) (
! 2
) (
) )( ( ' ) ( 0 ) (
i r i r i i r
x x
f
x x x f x f x f − + − + = =
ξ
and reordering (assuming a single root – first derivative ≠ 0):
2
) 2 (
) (
) ( ' ! 2
) (
) ( '
) (
i r
i i
i
i r
x x
x f
f
x f
x f
x x − − − =
ξ
(3.9)
Using now (3.7) for
1 + i
x :
) ( '
) (
1
i
i
i i
x f
x f
x x − =
+
and substituting in (3.9) gives:
2
) 2 (
1
) (
) ( ' ! 2
) (
i r
i
i r
x x
x f
f
x x − − =
+
ξ
which can be reordered as:
2
) 2 (
1
) (
) ( ' ! 2
) (
) (
i r
i
i r
x x
x f
f
x x − − = −
+
ξ
(3.10)
The error at stage i can be written as the difference between x
r
and x
i
: ) (
i r r
x x − = E then, from
(3.10) we can write:
2
) 2 (
1
) ( ' 2
) (
r
i
r
x f
f
E E
ξ −
=
+
Assuming convergence, both ξ and x
i
should eventually become x
r
, so the previous equation can
be rearranged on the form:
page 15 E763 (part 2) Numerical Methods
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
← y=x
y=g(x)
↓
Fig. 3.9(a)
3
1 3
) (
2
+ −
= =
x
x g x
0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
← y=x
y=g(x) →
Fig. 3.9(b) 1 4 3 ) (
2
− + = = x x x g x
Fig. 3.9 illustrates the main deficiency of this method. Convergence often depends on how the
problem is formulated. Additionally, divergence can also occur if the initial guess is not
sufficiently close to the solution.
NewtonRaphson Method
This is one of the most used methods for root finding. It also needs only one point to start
the iterations, but as a difference to the fixed point iteration it will converge to the solution
provided the function is monotonically varying in the region of interest.
Starting from a point x
0
, a tangent to the function f(x) (a line with the slope of the
derivative of f) is extrapolated to find the point where it crosses the xaxis, providing a new
approximation. The same procedure is repeated until the error tolerance is achieved. The
method needs repeated evaluation of the function and its derivative and an appropriate stopping
criterion is the value of the function at the successive approximations: f(x
n
).
x
0
x
1
slope = f '(x
0
)
Fig. 3.10
Form Fig. 3.10 we can see that at stage n:
1
) (
) ( '
+
−
=
n n
n
n
x x
x f
x f , then the next
E763 (part 2) Numerical Methods page 16
approximation is found as:
) ( '
) (
1
n
n
n n
x f
x f
x x − =
+
. (3.7)
Example
For the same function as in the previous examples: 0 ) 3 cos( ) ( = = x x f , and for the same
tolerance of 10
−6
, the solution is found in 3 iterations starting from x
0
= 0.3 (After 3 iterations
the accuracy is better than 10
−8
). Starting from 0.5, only 2 iterations are sufficient.
Iteration number x f(x)
0 0.30000000 0.62160997
1 0.56451705 0.12244676
2 0.52339200 0.00062033
3 0.52359878 0.00000000
The method can also be derived from the Taylor series expansion. This also provides a useful
insight on the rate of convergence of the method.
Considering the Taylor expansion truncated to the first order (see Appendix):
2
1
) 2 (
1 1
) (
! 2
) (
) )( ( ' ) ( ) (
i i i i i i i
x x
f
x x x f x f x f − + − + =
+ + +
ξ
(3.8)
Considering now the exact solution x
r
and the Taylor expansion, evaluated at this point:
2
) 2 (
) (
! 2
) (
) )( ( ' ) ( 0 ) (
i r i r i i r
x x
f
x x x f x f x f − + − + = =
ξ
and reordering (assuming a single root – first derivative ≠ 0):
2
) 2 (
) (
) ( ' ! 2
) (
) ( '
) (
i r
i i
i
i r
x x
x f
f
x f
x f
x x − − − =
ξ
(3.9)
Using now (3.7) for
1 + i
x :
) ( '
) (
1
i
i
i i
x f
x f
x x − =
+
and substituting in (3.9) gives:
2
) 2 (
1
) (
) ( ' ! 2
) (
i r
i
i r
x x
x f
f
x x − − =
+
ξ
which can be reordered as:
2
) 2 (
1
) (
) ( ' ! 2
) (
) (
i r
i
i r
x x
x f
f
x x − − = −
+
ξ
(3.10)
The error at stage i can be written as the difference between x
r
and x
i
: ) (
i r r
x x − = E then, from
(3.10) we can write:
2
) 2 (
1
) ( ' 2
) (
r
i
r
x f
f
E E
ξ −
=
+
Assuming convergence, both ξ and x
i
should eventually become x
r
, so the previous equation can
be rearranged on the form:
page 17 E763 (part 2) Numerical Methods
2
) 2 (
1
) ( ' 2
) (
r
r
r
r
x f
x f
E E
−
=
+
(3.11)
We can see that the relation between errors of successive order is quadratic. That means that on
each NewtonRaphson iteration, the number of correct decimal points should double. This is
what is called quadratic convergence.
Although the convergence rate is generally quite good, there are cases that show poor or
no convergence. An example is when there is an inflexion point near the root and in that case,
the iteration values will start to progressively diverge from the solution. Another case is when
the root is a multiple root, that is, when the first derivative is also zero.
Stopping the iterations
It can be demonstrated that the absolute error (difference between the current approximation and
the correct value) can be written as a multiple of the difference between the last two consecutive
approximations:
) ( ) (
1 −
− = − =
n n n n n
x x C x x
α
E
and the constant C
n
tends to 1 as the method converges. Then, a convenient criterion for
stopping the iterations is when:
ε < −
−
) (
1 n n n
x x C , (3.12)
choosing a small value for C
n
, usually 1.
The Secant Method
One of the difficulties of the NewtonRaphson method is the need to evaluate the derivative.
This can be very inconvenient of difficult in some cases. However, the derivative can be
approximated using a finite difference expression, for example, the backward difference:
1
1
) ( ) (
) ( '
−
−
−
−
≅
i i
i i
i
x x
x f x f
x f (3.13)
If we now substitute this in the expression for the NewtonRaphson itertions, the following
equation is obtained:
) ( ) (
) )( (
1
1
1
−
−
+
−
−
− =
i i
i i i
i i
x f x f
x x x f
x x (3.14)
which is the formula for the secant method.
Example
Use the secant method to find the root of x e x f
x
− =
−
) ( . Start with the estimates at x
−1
= 0 and
x
0
= 1. The exact result is: 0.56714329…
First iteration:
63212 . 0 ) ( 1
0 . 1 ) ( 0
0 0
1 1
− = =
= =
− −
x f x
x f x
then 61270 . 0
1 63212 . 0
) 0 1 ( 63212 . 0
1
1
=
− −
− −
− = x ε = 8%
Second iteration:
07081 . 0 ) ( 61270 . 0
63212 . 0 ) ( 1
1 1
0 0
− = =
− = =
x f x
x f x
then 5 0.56383832
) 63212 . 0 ( 07081 . 0
) 1 61270 . 0 ( 07081 . 0
61270 . 0
2
=
− − −
− −
− = x
Note that in this case the 2 points are at the same side of the root (not bracketing it).
Using Excel, a simple calculation can be made giving:
E763 (part 2) Numerical Methods page 18
i xi f(xi) error %
1 0 1
0 1 0.63212
1 0.612700047 0.070814271 8.032671349
2 0.563838325 0.005182455 0.582738902
3 0.567170359 4.24203E05 0.004772880
4 0.567143307 2.53813E08 2.92795E06
5 0.567143290 1.24234E13 7.22401E08
The secant method doesn’t require the evaluation of the derivative of the function as the
NewtonRaphson method does but still, it suffers from the same problems. The convergence of
the method is similar to that of Newton and similarly, it has severe problems if the derivative is
zero or near zero in the region of interest.
Multiple Roots
We have seen that some of the methods have poor convergence if the derivative is very small or
zero. For higher order zeros (multiple roots), the function is zero and also the first n−1
derivatives (n is the order of the root). In this case the NewtonRaphson method (and the secant
method will converge poorly).
We can notice however, that if the function f(x) has a root multiple root at x = α, the
function:
) ( '
) (
) (
x f
x f
x g = (3.15)
has a simple root at x = α (If the root of f is of order n, the root of the derivative is of order n−1).
We can then use the standard NewtonRaphson method for the function f(x).
Exercise 3.2
Use the NewtonRaphson method to find a root of the function
x
xe x f
−
− =
1
1 ) ( . Start the
iterations with 0
0
= x .
Extrapolation and Acceleration
Observing how the iterations proceed in the methods we have just studied, one can expect some
advantages if an extrapolated value is calculated from the iterations in order to improve the next
guess and accelerate the convergence.
One of the best methods for extrapolation and acceleration is the Aitken’s method: Considering
the sequence of values x
n
, the extrapolated sequence can be constructed by:
) (
1 −
− + =
n n n n
x x x x α where
2 1
1
2
− −
−
+ −
−
=
n n n
n n
x x x
x x
α (3.16)
We can use this expression embedded in the fixed point iteration for example, in the form:
Starting from a value for x
0
: the first two iterates are found using the standard method:
x1=g(x0);
x2=g(x1);
Now, we can use the Aitken’s extrapolation in a repeated form:
alpha=(x2x1)/(x22*x1+x0)
page 17 E763 (part 2) Numerical Methods
2
) 2 (
1
) ( ' 2
) (
r
r
r
r
x f
x f
E E
−
=
+
(3.11)
We can see that the relation between errors of successive order is quadratic. That means that on
each NewtonRaphson iteration, the number of correct decimal points should double. This is
what is called quadratic convergence.
Although the convergence rate is generally quite good, there are cases that show poor or
no convergence. An example is when there is an inflexion point near the root and in that case,
the iteration values will start to progressively diverge from the solution. Another case is when
the root is a multiple root, that is, when the first derivative is also zero.
Stopping the iterations
It can be demonstrated that the absolute error (difference between the current approximation and
the correct value) can be written as a multiple of the difference between the last two consecutive
approximations:
) ( ) (
1 −
− = − =
n n n n n
x x C x x
α
E
and the constant C
n
tends to 1 as the method converges. Then, a convenient criterion for
stopping the iterations is when:
ε < −
−
) (
1 n n n
x x C , (3.12)
choosing a small value for C
n
, usually 1.
The Secant Method
One of the difficulties of the NewtonRaphson method is the need to evaluate the derivative.
This can be very inconvenient of difficult in some cases. However, the derivative can be
approximated using a finite difference expression, for example, the backward difference:
1
1
) ( ) (
) ( '
−
−
−
−
≅
i i
i i
i
x x
x f x f
x f (3.13)
If we now substitute this in the expression for the NewtonRaphson itertions, the following
equation is obtained:
) ( ) (
) )( (
1
1
1
−
−
+
−
−
− =
i i
i i i
i i
x f x f
x x x f
x x (3.14)
which is the formula for the secant method.
Example
Use the secant method to find the root of x e x f
x
− =
−
) ( . Start with the estimates at x
−1
= 0 and
x
0
= 1. The exact result is: 0.56714329…
First iteration:
63212 . 0 ) ( 1
0 . 1 ) ( 0
0 0
1 1
− = =
= =
− −
x f x
x f x
then 61270 . 0
1 63212 . 0
) 0 1 ( 63212 . 0
1
1
=
− −
− −
− = x ε = 8%
Second iteration:
07081 . 0 ) ( 61270 . 0
63212 . 0 ) ( 1
1 1
0 0
− = =
− = =
x f x
x f x
then 5 0.56383832
) 63212 . 0 ( 07081 . 0
) 1 61270 . 0 ( 07081 . 0
61270 . 0
2
=
− − −
− −
− = x
Note that in this case the 2 points are at the same side of the root (not bracketing it).
Using Excel, a simple calculation can be made giving:
E763 (part 2) Numerical Methods page 18
i xi f(xi) error %
1 0 1
0 1 0.63212
1 0.612700047 0.070814271 8.032671349
2 0.563838325 0.005182455 0.582738902
3 0.567170359 4.24203E05 0.004772880
4 0.567143307 2.53813E08 2.92795E06
5 0.567143290 1.24234E13 7.22401E08
The secant method doesn’t require the evaluation of the derivative of the function as the
NewtonRaphson method does but still, it suffers from the same problems. The convergence of
the method is similar to that of Newton and similarly, it has severe problems if the derivative is
zero or near zero in the region of interest.
Multiple Roots
We have seen that some of the methods have poor convergence if the derivative is very small or
zero. For higher order zeros (multiple roots), the function is zero and also the first n−1
derivatives (n is the order of the root). In this case the NewtonRaphson method (and the secant
method will converge poorly).
We can notice however, that if the function f(x) has a root multiple root at x = α, the
function:
) ( '
) (
) (
x f
x f
x g = (3.15)
has a simple root at x = α (If the root of f is of order n, the root of the derivative is of order n−1).
We can then use the standard NewtonRaphson method for the function f(x).
Exercise 3.2
Use the NewtonRaphson method to find a root of the function
x
xe x f
−
− =
1
1 ) ( . Start the
iterations with 0
0
= x .
Extrapolation and Acceleration
Observing how the iterations proceed in the methods we have just studied, one can expect some
advantages if an extrapolated value is calculated from the iterations in order to improve the next
guess and accelerate the convergence.
One of the best methods for extrapolation and acceleration is the Aitken’s method: Considering
the sequence of values x
n
, the extrapolated sequence can be constructed by:
) (
1 −
− + =
n n n n
x x x x α where
2 1
1
2
− −
−
+ −
−
=
n n n
n n
x x x
x x
α (3.16)
We can use this expression embedded in the fixed point iteration for example, in the form:
Starting from a value for x
0
: the first two iterates are found using the standard method:
x1=g(x0);
x2=g(x1);
Now, we can use the Aitken’s extrapolation in a repeated form:
alpha=(x2x1)/(x22*x1+x0)
page 19 E763 (part 2) Numerical Methods
xbar=x2+alpha*(x2x1)
and now we can refresh the initial guess:
x0=xbar
and restart the iterations.
Similarly for the NewtonRaphson method where the evaluation of x1, and x2 are replaced by
the corresponding forms for the NR method and the function and its derivative needs to be
calculated at each stage:
f0=f(x0) % calculation of the function
df0=df(x0) % calculation of the derivative
x1=x0f0/df0
x2=x1f1/df1
alpha=(x2x1)/(x22*x1+x0)
xbar=x2+alpha*(x2x1)
x0=xbar
Example
For the function ) 3 cos( ) ( x x f = , using the FixedPoint method with and without acceleration
gives the results that follow. The iterations are started with 5 . 0
0
= x and the alternative form:
2
) 3 cos( 2
) (
x x
x g
+
= is used.
Fixed Point method Accelerated FP method Iter.
Number
x error % x error %
1 0.5353686 4.50703414 0.52359385 1.12230396
2 0.51771753 2.24787104 0.52359878 0.00023538
3 0.52653894 1.12323492 0.52359878 0
4 0.52212875 0.56153005
5 0.52433378 0.28075410
6 0.52323127 0.14037569
7 0.52378253 0.07018767
8 0.5235069 0.03509381
9 0.52364471 0.01754690
10 0.52357581 0.00877345
11 0.52361026 0.00438673
12 0.52359303 0.00219336
13 0.52360165 0.00109668
14 0.52359734 0.00054834
15 0.52359949 0.00027417
16 0.52359842 0.00013709
17 0.52359896 0.00006854
18 0.52359869 0.00003427
19 0.52359882 0.00001714
20 0.52359875 0.00000857
Higher Order Methods
All the methods we have seen consist of approximating the function using a straight line and
E763 (part 2) Numerical Methods page 20
looking for the intersection of that line with the axis. An obvious extension of this idea is to use
a higher order approximation, for example, fitting a second order curve (a parabola) to the
function and looking for the zero of this instead. A method that implements this is the Muller
method.
Muller’s method
Using three points, we can find the equation of a parabola that fits the function and then, find the
zeros of the parabola.
For example, for the function: 2
6
− = x y (used
to find the value of
6
2 ), the function and the
approximating parabola are shown in the
figure:
In this case, the 3 points are chosen as
5 . 0
1
= x , 5 . 1
2
= x and 0 . 1
3
= x
The parabola passing through the points:
) , (
1 1
y x , ) , (
2 2
y x and ) , (
3 3
y x can be written
as.
) )( ( ) (
2 3 1 3 2 3
x x x x d x x c y y − − + − + = where
1 2
1 2
1
x x
y y
c
−
−
= ,
2 3
2 3
2
x x
y y
c
−
−
= and
1 3
1 2
1
x x
c c
d
−
−
=
0.5 1 1.5
5
0
5
10
x
1
x
2
x
3
↓
x
4
Fig. 3.11
Solving for the zero closest to
3
x gives:
1 3
2
3
3 4
4 ) (
2
d y s s sign s
y
x x
− +
− = (3.17)
where ) (
2 3 1 2
x x d c s − + = .
The results of the iterations are:
Iteration number x error
1 1.07797900 0.07797900
2 1.37190697 0.29392797
3 1.37170268 0.00020429
4 1.37148553 0.00021716
5 1.17938786 0.19209767
6 1.14443703 0.03495083
7 1.12268990 0.02174712
8 1.12242784 0.00026206
9 1.12246205 0.00003420
10 1.12246205 0.00000000
The Muller method requires three points to start the iterations but it doesn’t require evaluation of
derivatives as the NewtonRaphson. It can also be used to find complex roots.
page 19 E763 (part 2) Numerical Methods
xbar=x2+alpha*(x2x1)
and now we can refresh the initial guess:
x0=xbar
and restart the iterations.
Similarly for the NewtonRaphson method where the evaluation of x1, and x2 are replaced by
the corresponding forms for the NR method and the function and its derivative needs to be
calculated at each stage:
f0=f(x0) % calculation of the function
df0=df(x0) % calculation of the derivative
x1=x0f0/df0
x2=x1f1/df1
alpha=(x2x1)/(x22*x1+x0)
xbar=x2+alpha*(x2x1)
x0=xbar
Example
For the function ) 3 cos( ) ( x x f = , using the FixedPoint method with and without acceleration
gives the results that follow. The iterations are started with 5 . 0
0
= x and the alternative form:
2
) 3 cos( 2
) (
x x
x g
+
= is used.
Fixed Point method Accelerated FP method Iter.
Number
x error % x error %
1 0.5353686 4.50703414 0.52359385 1.12230396
2 0.51771753 2.24787104 0.52359878 0.00023538
3 0.52653894 1.12323492 0.52359878 0
4 0.52212875 0.56153005
5 0.52433378 0.28075410
6 0.52323127 0.14037569
7 0.52378253 0.07018767
8 0.5235069 0.03509381
9 0.52364471 0.01754690
10 0.52357581 0.00877345
11 0.52361026 0.00438673
12 0.52359303 0.00219336
13 0.52360165 0.00109668
14 0.52359734 0.00054834
15 0.52359949 0.00027417
16 0.52359842 0.00013709
17 0.52359896 0.00006854
18 0.52359869 0.00003427
19 0.52359882 0.00001714
20 0.52359875 0.00000857
Higher Order Methods
All the methods we have seen consist of approximating the function using a straight line and
E763 (part 2) Numerical Methods page 20
looking for the intersection of that line with the axis. An obvious extension of this idea is to use
a higher order approximation, for example, fitting a second order curve (a parabola) to the
function and looking for the zero of this instead. A method that implements this is the Muller
method.
Muller’s method
Using three points, we can find the equation of a parabola that fits the function and then, find the
zeros of the parabola.
For example, for the function: 2
6
− = x y (used
to find the value of
6
2 ), the function and the
approximating parabola are shown in the
figure:
In this case, the 3 points are chosen as
5 . 0
1
= x , 5 . 1
2
= x and 0 . 1
3
= x
The parabola passing through the points:
) , (
1 1
y x , ) , (
2 2
y x and ) , (
3 3
y x can be written
as.
) )( ( ) (
2 3 1 3 2 3
x x x x d x x c y y − − + − + = where
1 2
1 2
1
x x
y y
c
−
−
= ,
2 3
2 3
2
x x
y y
c
−
−
= and
1 3
1 2
1
x x
c c
d
−
−
=
0.5 1 1.5
5
0
5
10
x
1
x
2
x
3
↓
x
4
Fig. 3.11
Solving for the zero closest to
3
x gives:
1 3
2
3
3 4
4 ) (
2
d y s s sign s
y
x x
− +
− = (3.17)
where ) (
2 3 1 2
x x d c s − + = .
The results of the iterations are:
Iteration number x error
1 1.07797900 0.07797900
2 1.37190697 0.29392797
3 1.37170268 0.00020429
4 1.37148553 0.00021716
5 1.17938786 0.19209767
6 1.14443703 0.03495083
7 1.12268990 0.02174712
8 1.12242784 0.00026206
9 1.12246205 0.00003420
10 1.12246205 0.00000000
The Muller method requires three points to start the iterations but it doesn’t require evaluation of
derivatives as the NewtonRaphson. It can also be used to find complex roots.
page 21 E763 (part 2) Numerical Methods
4. Interpolation and Approximation
There are many occasions when there is a need to convert discrete data, from measurements or
calculations into a smooth function. This could be necessary for example, to obtain estimates of
the values between data points. It could be also necessary if one wants to represent a function by
a simpler one that approximates its behaviour. Two approaches to this task can be used.
Interpolation consists of fitting a smooth curve exactly to the data points and creating the curve
that spans these points. In approximation, the curve doesn’t necessarily fit exactly the data
points but it approximates the overall behaviour of the data, following an established criterion.
Lagrange Interpolation
The basic interpolation problem can be formulated as:
Given a set of nodes, {x
i
, i =0,…,n} and corresponding data values {y
i
, i =0,…,n}, find the
polynomial p(x) of degree less or equal to n such that p(x
i
) = y
i
.
Consider the family of functions:
∏
≠ =
−
−
=
n
i j j
j i
j
n
i
x x
x x
x L
, 0
) (
) ( , i = 0, 1, n (4.18)
We can see that they are polynomials of order n and have the property (interpolatory condition):
¹
´
¦
≠
=
= =
j i
j i
x L
j i j
n
i
0
1
) (
,
) (
δ (4.19)
Then, if we define the polynomial by:
∑
=
=
n
k
n
k
k n
x L y x p
0
) (
) ( ) ( (4.20)
then:
i
n
k
i
n
k
k i n
y x L y x p = =
∑
=0
) (
) ( ) ( (4.21)
The uniqueness of this interpolation polynomial can also be demonstrated (that is, that there is
only one polynomial of order n or less that satisfy this condition.
Lagrange Polynomials
In more detail, from the general definition (4.18) the equation for the first order polynomial
(straight line) passing through two points ) , (
1 1
y x and ) , (
2 2
y x is:
2
1 2
1
1
2 1
2
2
) 1 (
2 1
) 1 (
1 1
) ( y
x x
x x
y
x x
x x
y L y L x p
−
−
+
−
−
= + = (4.22)
The second order polynomial (parabola) passing through three points is:
3
) 2 (
3 2
) 2 (
2 1
) 2 (
1 2
) ( y L y L y L x p + + =
3
2 3 1 3
2 1
2
3 2 1 2
3 1
1
3 1 2 1
3 2
2
) )( (
) )( (
) )( (
) )( (
) )( (
) )( (
) ( y
x x x x
x x x x
y
x x x x
x x x x
y
x x x x
x x x x
x p
− −
− −
+
− −
− −
+
− −
− −
= (4.23)
In general, we can see that the interpolation polynomials have the form given in (4.20) for any
order.
Each of the Lagrange interpolation functions
) (n
k
L associated to each of the nodes x
k
(given in
general by (4.18)) is:
E763 (part 2) Numerical Methods page 22
D
x N
x x x x x x x x x x
x x x x x x x x x x
x L
n k k k k k k k
n k k n
k
) (
) ( ) )( ( ) )( (
) ( ) )( ( ) )( (
) (
1 1 2 1
1 1 2 1 ) (
=
− − − − −
− − − − −
=
+ −
+ −
L L
L L
(4.24)
The denominator has the same form as the denominator and D = N(x
k
).
Example
Find the interpolating polynomial that passes through the three points: ) 4 , 2 ( ) , (
1 1
− = y x ,
) 2 , 0 ( ) , (
2 2
= y x and ) 8 , 2 ( ) , (
3 3
= y x .
Substituting in (4.20), or more specifically, (4.23):
8
) 0 2 )( 2 2 (
) 0 )( 2 (
2
) 2 0 )( 2 0 (
) 2 )( 2 (
4
) 2 2 )( 0 2 (
) 2 )( 0 (
) (
2
− +
− +
+
− +
− +
+
− − − −
− −
=
x x x x x x
x p
3
) 2 (
3 2
) 2 (
2 1
) 2 (
1
2 2 2
2
8
8
2
2
4
4
4
8
2
) ( y L y L y L
x x x x x
x p + + =
+
+
−
−
+
−
=
2 ) (
2
2
+ + = x x x p
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1.5 1 0.5 0 0.5 1 1.5 2
0
0.5
1
L
1
(2)
(x)
L
2
(2)
(x)
L
3
(2)
(x)
Fig. 4.1 shows the complete interpolating
polynomial ) (
2
x p and the three Lagrange
interpolation polynomials ) (
) 2 (
x L
k
k = 1, 2,
3 corresponding to each of the nodal points.
Notice that the function corresponding to
one node has a value 1 at that node and 0 at
the other two.
Fig. 4.1
Exercise 4.1
Find the 5
th
order Lagrange interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and
yd = {2, 1, 3.4, 3.8, 5.8, 4.8}.
Exercise 4.2
Show that an arbitrary polynomial of order n can be represented exactly by
∑
=
=
n
i
i i
x L x p x p
0
) ( ) ( ) (
using an arbitrary set of (distinct) data points x
i
.
Newton Interpolation
It can be easily demonstrated that the polynomial interpolating a set of points is unique (Exercise
4.2), and the Lagrange method allows us to find it. The Newton interpolation method gives
eventually the same result but it can be more convenient in some cases. In particular, it is
simpler to extend the interpolation adding extra points, which in the Lagrange method would
need a total recalculation of the interpolation functions.
page 21 E763 (part 2) Numerical Methods
4. Interpolation and Approximation
There are many occasions when there is a need to convert discrete data, from measurements or
calculations into a smooth function. This could be necessary for example, to obtain estimates of
the values between data points. It could be also necessary if one wants to represent a function by
a simpler one that approximates its behaviour. Two approaches to this task can be used.
Interpolation consists of fitting a smooth curve exactly to the data points and creating the curve
that spans these points. In approximation, the curve doesn’t necessarily fit exactly the data
points but it approximates the overall behaviour of the data, following an established criterion.
Lagrange Interpolation
The basic interpolation problem can be formulated as:
Given a set of nodes, {x
i
, i =0,…,n} and corresponding data values {y
i
, i =0,…,n}, find the
polynomial p(x) of degree less or equal to n such that p(x
i
) = y
i
.
Consider the family of functions:
∏
≠ =
−
−
=
n
i j j
j i
j
n
i
x x
x x
x L
, 0
) (
) ( , i = 0, 1, n (4.18)
We can see that they are polynomials of order n and have the property (interpolatory condition):
¹
´
¦
≠
=
= =
j i
j i
x L
j i j
n
i
0
1
) (
,
) (
δ (4.19)
Then, if we define the polynomial by:
∑
=
=
n
k
n
k
k n
x L y x p
0
) (
) ( ) ( (4.20)
then:
i
n
k
i
n
k
k i n
y x L y x p = =
∑
=0
) (
) ( ) ( (4.21)
The uniqueness of this interpolation polynomial can also be demonstrated (that is, that there is
only one polynomial of order n or less that satisfy this condition.
Lagrange Polynomials
In more detail, from the general definition (4.18) the equation for the first order polynomial
(straight line) passing through two points ) , (
1 1
y x and ) , (
2 2
y x is:
2
1 2
1
1
2 1
2
2
) 1 (
2 1
) 1 (
1 1
) ( y
x x
x x
y
x x
x x
y L y L x p
−
−
+
−
−
= + = (4.22)
The second order polynomial (parabola) passing through three points is:
3
) 2 (
3 2
) 2 (
2 1
) 2 (
1 2
) ( y L y L y L x p + + =
3
2 3 1 3
2 1
2
3 2 1 2
3 1
1
3 1 2 1
3 2
2
) )( (
) )( (
) )( (
) )( (
) )( (
) )( (
) ( y
x x x x
x x x x
y
x x x x
x x x x
y
x x x x
x x x x
x p
− −
− −
+
− −
− −
+
− −
− −
= (4.23)
In general, we can see that the interpolation polynomials have the form given in (4.20) for any
order.
Each of the Lagrange interpolation functions
) (n
k
L associated to each of the nodes x
k
(given in
general by (4.18)) is:
E763 (part 2) Numerical Methods page 22
D
x N
x x x x x x x x x x
x x x x x x x x x x
x L
n k k k k k k k
n k k n
k
) (
) ( ) )( ( ) )( (
) ( ) )( ( ) )( (
) (
1 1 2 1
1 1 2 1 ) (
=
− − − − −
− − − − −
=
+ −
+ −
L L
L L
(4.24)
The denominator has the same form as the denominator and D = N(x
k
).
Example
Find the interpolating polynomial that passes through the three points: ) 4 , 2 ( ) , (
1 1
− = y x ,
) 2 , 0 ( ) , (
2 2
= y x and ) 8 , 2 ( ) , (
3 3
= y x .
Substituting in (4.20), or more specifically, (4.23):
8
) 0 2 )( 2 2 (
) 0 )( 2 (
2
) 2 0 )( 2 0 (
) 2 )( 2 (
4
) 2 2 )( 0 2 (
) 2 )( 0 (
) (
2
− +
− +
+
− +
− +
+
− − − −
− −
=
x x x x x x
x p
3
) 2 (
3 2
) 2 (
2 1
) 2 (
1
2 2 2
2
8
8
2
2
4
4
4
8
2
) ( y L y L y L
x x x x x
x p + + =
+
+
−
−
+
−
=
2 ) (
2
2
+ + = x x x p
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1 0 1 2
2
0
2
4
6
8
p
2
(x)
2 1.5 1 0.5 0 0.5 1 1.5 2
0
0.5
1
L
1
(2)
(x)
L
2
(2)
(x)
L
3
(2)
(x)
Fig. 4.1 shows the complete interpolating
polynomial ) (
2
x p and the three Lagrange
interpolation polynomials ) (
) 2 (
x L
k
k = 1, 2,
3 corresponding to each of the nodal points.
Notice that the function corresponding to
one node has a value 1 at that node and 0 at
the other two.
Fig. 4.1
Exercise 4.1
Find the 5
th
order Lagrange interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and
yd = {2, 1, 3.4, 3.8, 5.8, 4.8}.
Exercise 4.2
Show that an arbitrary polynomial of order n can be represented exactly by
∑
=
=
n
i
i i
x L x p x p
0
) ( ) ( ) (
using an arbitrary set of (distinct) data points x
i
.
Newton Interpolation
It can be easily demonstrated that the polynomial interpolating a set of points is unique (Exercise
4.2), and the Lagrange method allows us to find it. The Newton interpolation method gives
eventually the same result but it can be more convenient in some cases. In particular, it is
simpler to extend the interpolation adding extra points, which in the Lagrange method would
need a total recalculation of the interpolation functions.
page 23 E763 (part 2) Numerical Methods
The general form of a polynomial used to interpolate n + 1 data points is:
n
n
x b x b x b x b b x f + + + + + = L
3
3
2
2 1 0
) ( (4.25)
Newton’s method, and Lagrange’s, give us a procedure to find the coefficients b
i
.
x1 x x2
y1
y
y2
From Fig. 4.2, we can write:
1 2
1 2
1
1
x x
y y
x x
y y
−
−
=
−
−
which can be rearranged as:
) (
1
1 2
1 2
1
x x
x x
y y
y y −
−
−
+ = (4.26)
This is the Newton’s expression for the first
order interpolation polynomial (or linear
interpolation).
Fig. 4.2
That is, the Newton form of the equation of a straight line that passes through 2 points (x
1
,
y
1
) and (x
2
, y
2
) is
) ( ) (
1 1 0
x x a a x p − + = ; where:
¦
¹
¦
´
¦
−
−
=
=
1 2
1 2
1
1 0
x x
y y
a
y a
(4.27)
Similarly, the general expression for a second order polynomial passing through the 3 data
points: (x
1
, y
1
), (x
2
, y
2
) and (x
3
, y
3
) can be written as:
2
2 1 0 2
) ( x b x b b x p + + =
or, rearranging: ) )( ( ) ( ) (
2 1 2 1 1 0 2
x x x x a x x a a x p − − + − + = (4.28)
Substituting the values for the 3 points, we get, after some rearrangement:
¦
¦
¦
¹
¦
¦
¦
´
¦
−
−
−
−
−
−
=
−
−
=
=
1 3
1 2
1 2
2 3
2 3
2
1 2
1 2
1
1 0
x x
x x
y y
x x
y y
a
x x
y y
a
y a
(4.29)
The individual terms in the above expression are usually called “divided differences” and
denominated by the symbol D. That is,
i i
i i
i
x x
y y
Dy
−
−
=
+
+
1
1
,
i i
i i
i
x x
Dy Dy
y D
−
−
=
+
+
2
1 2
,
i i
i i
i
x x
y D y D
y D
−
−
=
+
+
3
2
1
2
3
, etc
In this form, (4.29) takes the form:
1 0
y a = ,
1 1
Dy a = and
1
2
2
y D a = .
E763 (part 2) Numerical Methods page 24
The general form of Newton interpolation polynomials is then an extension of (4.27) and (4.28):
L + − − + − + = ) )( ( ) ( ) (
2 1 2 1 1 0
x x x x a x x a a x p
n
(4.30)
or,
∑
=
+ =
n
i
i i n
x W a a x p
1
0
) ( ) ( with
∏
=
− =
n
i
i i
x x x W
1
) ( ) ( (4.31))
with the coefficients:
1 0
y a = ,
1
y D a
i
i
=
Example
We can consider the previous example of finding the interpolating polynomial that passes
through the three points: ) 4 , 2 ( ) , (
1 1
− = y x , ) 2 , 0 ( ) , (
2 2
= y x and ) 8 , 2 ( ) , (
3 3
= y x .
In this case it is usual and convenient to arrange the calculations in a table with the following
quantities in each column:
i
x
i
y
i
Dy
i
y D
2
−2 4
1
) 2 ( 0
4 2
1 2
1 2
− =
− −
−
=
−
−
x x
y y
0 2
1
) 2 ( 2
) 1 ( 3
1 3
1 2
=
− −
− −
=
−
−
x x
Dy Dy
3
0 2
2 8
2 3
2 3
=
−
−
=
−
−
x x
y y
2 8
Then, the coefficients are: a
0
= 4, a
1
= −1 and a
2
= 1 and the polynomial is:
2 ) 0 )( 2 ( ) 2 ( 4 ) (
2
2
+ + = − + + + − = x x x x x x p
Note that it is the same polynomial found using Lagrange interpolation.
One important property of the Newton’s construction of the interpolating polynomial is what
makes it easy to extend including more points. if additional points are included, the new higher
order polynomial can be easily constructed from the previous one:
) ( ) ( ) (
1 1 1
x W a x p x p
n n n n + + +
+ = with ) )( ( ) (
1 1 + +
− =
n n n
x x x W x W and
1
1
1
y D a
n
n
+
+
= (4.32)
In this way, it has many similarities with the Taylor expansion, where additional terms increase
the order of the polynomial. These similarities allow a treatment of the error in the same way as
it is done with Taylor expansions.
Exercise 4.3
Find the 5
th
order interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1,
page 23 E763 (part 2) Numerical Methods
The general form of a polynomial used to interpolate n + 1 data points is:
n
n
x b x b x b x b b x f + + + + + = L
3
3
2
2 1 0
) ( (4.25)
Newton’s method, and Lagrange’s, give us a procedure to find the coefficients b
i
.
x1 x x2
y1
y
y2
From Fig. 4.2, we can write:
1 2
1 2
1
1
x x
y y
x x
y y
−
−
=
−
−
which can be rearranged as:
) (
1
1 2
1 2
1
x x
x x
y y
y y −
−
−
+ = (4.26)
This is the Newton’s expression for the first
order interpolation polynomial (or linear
interpolation).
Fig. 4.2
That is, the Newton form of the equation of a straight line that passes through 2 points (x
1
,
y
1
) and (x
2
, y
2
) is
) ( ) (
1 1 0
x x a a x p − + = ; where:
¦
¹
¦
´
¦
−
−
=
=
1 2
1 2
1
1 0
x x
y y
a
y a
(4.27)
Similarly, the general expression for a second order polynomial passing through the 3 data
points: (x
1
, y
1
), (x
2
, y
2
) and (x
3
, y
3
) can be written as:
2
2 1 0 2
) ( x b x b b x p + + =
or, rearranging: ) )( ( ) ( ) (
2 1 2 1 1 0 2
x x x x a x x a a x p − − + − + = (4.28)
Substituting the values for the 3 points, we get, after some rearrangement:
¦
¦
¦
¹
¦
¦
¦
´
¦
−
−
−
−
−
−
=
−
−
=
=
1 3
1 2
1 2
2 3
2 3
2
1 2
1 2
1
1 0
x x
x x
y y
x x
y y
a
x x
y y
a
y a
(4.29)
The individual terms in the above expression are usually called “divided differences” and
denominated by the symbol D. That is,
i i
i i
i
x x
y y
Dy
−
−
=
+
+
1
1
,
i i
i i
i
x x
Dy Dy
y D
−
−
=
+
+
2
1 2
,
i i
i i
i
x x
y D y D
y D
−
−
=
+
+
3
2
1
2
3
, etc
In this form, (4.29) takes the form:
1 0
y a = ,
1 1
Dy a = and
1
2
2
y D a = .
E763 (part 2) Numerical Methods page 24
The general form of Newton interpolation polynomials is then an extension of (4.27) and (4.28):
L + − − + − + = ) )( ( ) ( ) (
2 1 2 1 1 0
x x x x a x x a a x p
n
(4.30)
or,
∑
=
+ =
n
i
i i n
x W a a x p
1
0
) ( ) ( with
∏
=
− =
n
i
i i
x x x W
1
) ( ) ( (4.31))
with the coefficients:
1 0
y a = ,
1
y D a
i
i
=
Example
We can consider the previous example of finding the interpolating polynomial that passes
through the three points: ) 4 , 2 ( ) , (
1 1
− = y x , ) 2 , 0 ( ) , (
2 2
= y x and ) 8 , 2 ( ) , (
3 3
= y x .
In this case it is usual and convenient to arrange the calculations in a table with the following
quantities in each column:
i
x
i
y
i
Dy
i
y D
2
−2 4
1
) 2 ( 0
4 2
1 2
1 2
− =
− −
−
=
−
−
x x
y y
0 2
1
) 2 ( 2
) 1 ( 3
1 3
1 2
=
− −
− −
=
−
−
x x
Dy Dy
3
0 2
2 8
2 3
2 3
=
−
−
=
−
−
x x
y y
2 8
Then, the coefficients are: a
0
= 4, a
1
= −1 and a
2
= 1 and the polynomial is:
2 ) 0 )( 2 ( ) 2 ( 4 ) (
2
2
+ + = − + + + − = x x x x x x p
Note that it is the same polynomial found using Lagrange interpolation.
One important property of the Newton’s construction of the interpolating polynomial is what
makes it easy to extend including more points. if additional points are included, the new higher
order polynomial can be easily constructed from the previous one:
) ( ) ( ) (
1 1 1
x W a x p x p
n n n n + + +
+ = with ) )( ( ) (
1 1 + +
− =
n n n
x x x W x W and
1
1
1
y D a
n
n
+
+
= (4.32)
In this way, it has many similarities with the Taylor expansion, where additional terms increase
the order of the polynomial. These similarities allow a treatment of the error in the same way as
it is done with Taylor expansions.
Exercise 4.3
Find the 5
th
order interpolation polynomial to fit the data: xd = {0, 1, 2, 3, 4, 5} and yd = {2, 1,
page 25 E763 (part 2) Numerical Methods
3.4, 3.8, 5.8, 4.8}, this time using Newton interpolation.
Some Practical Notes:
In some of the examples seen here we have used equally spaced data. This is by no means
necessary and the data can have any distribution. However, if the data is equally spaced, simpler
expressions for the divided differences can be derived (not done here).
Both in the Lagrange and the Newton methods, the interpolating function is obtained in a
form that is different from the standard polynomial expression (4.25). The coefficients of this
expression can be obtained from the Lagrange or Newton forms by setting a system of equations
for the known values at the data point. However, the resultant system is notoriously ill
conditioned (see chapter on Matrix Computations) and the results can be highly inaccurate.
____________________________
One of the problems with interpolation of data point is that this technique is very sensitive
to noisy data. A very small change in the values of the data can lead to a drastic change in the
interpolating function. This is illustrated in the following example:
Fig. 4.3 shows the interpolating polynomial (in blue) for the data:
xd = {0, 1, 2, 3, 4, 5}
yd = {2, 1, 3, 4.8, 5.8, 4.8}
Now, if we add two more points with a slight amount of noise: xd’ = {2.2, 2.7} and yd’ = {3.5,
4.25} (shown with the filled black markers), the new interpolation polynomial (red line) shows a
dramatic difference to the first one.
0 1 2 3 4 5
0
1
2
3
4
5
6
Fig. 4.3
Hermite Interpolation
The problem arises because the extra points force a higher degree polynomial and this can have a
higher oscillatory behaviour. Another approach that avoids this problem is to use data for the
derivative of the function too. If we also ask for the derivative values to be matched at the
nodes, the oscillations will be prevented. This is done with the “Hermite Interpolation”. The
development is rather similar to that of the Newton’s method but more complicated due to the
involvement of the derivative values. It can also be constructed easily with the help of a table
(as in Newton’s) and divided differences. We will not cover here the details of the derivation
but simply, the procedure to find it.
The table is similar to that for the Newton interpolation but we enter the data points twice (see
E763 (part 2) Numerical Methods page 26
below) and the derivatives values are placed between repeated data points, in alternate rows as
the first divided differences. The initial setup for 2 points is marked with red circles.
i i
x
i
y
i
Dy
i
y D
2
i
y D
3
1
1
x
1
y
'
1
y
1
1
x
1
y
1 2
1 1
'
x x
y Dy
A
−
−
=
1 2
1 2
1
x x
y y
Dy
−
−
=
1 2
x x
A B
C
−
−
=
2
2
x
2
y
1 2
1 2
'
x x
Dy y
B
−
−
=
'
2
y
2
2
x
2
y
The corresponding interpolating polynomial is:
) ( ) ( ) ( ) ( ' ) (
2
2
1
2
1 1 1 1 2
x x x x C x x A x x y y x H − − + − + − + = (4.33)
The coefficients of the successive terms are marked in the table with blue squares.
Fig. 4.4 shows the function ) sin( ) ( x x y π =
(+) compared with the interpolated curves
using Hermite, Newton and cubic splines
method (see next section).
The data points are marked by circles.
Newton interpolation results in a polynomial
of order 4 while Hermite gives a polynomial
of order 7.
(If 7 points are used for the Newton iteration
the results are quite similar.)
0 0.5 1 1.5
1
0.5
0
0.5
1
1.5
cubic splines
Newton
Hermite
cubic splines
Newton
Hermite
cubic splines
Newton
Hermite
cubic splines
Newton
Hermite
Fig. 4.4
Another approach to avoid the oscillations present when using high order polynomials is to
use lower order polynomials to interpolate subsets of the data and assemble the overall
approximating function piecewise. This is what is called “spline interpolation”.
Spline Interpolation
Any piecewise interpolation of data by low order functions is called spline interpolation and the
simplest and widely used is the piecewise linear interpolation, or simply, joining consecutive
data points by straight lines.
page 25 E763 (part 2) Numerical Methods
3.4, 3.8, 5.8, 4.8}, this time using Newton interpolation.
Some Practical Notes:
In some of the examples seen here we have used equally spaced data. This is by no means
necessary and the data can have any distribution. However, if the data is equally spaced, simpler
expressions for the divided differences can be derived (not done here).
Both in the Lagrange and the Newton methods, the interpolating function is obtained in a
form that is different from the standard polynomial expression (4.25). The coefficients of this
expression can be obtained from the Lagrange or Newton forms by setting a system of equations
for the known values at the data point. However, the resultant system is notoriously ill
conditioned (see chapter on Matrix Computations) and the results can be highly inaccurate.
____________________________
One of the problems with interpolation of data point is that this technique is very sensitive
to noisy data. A very small change in the values of the data can lead to a drastic change in the
interpolating function. This is illustrated in the following example:
Fig. 4.3 shows the interpolating polynomial (in blue) for the data:
xd = {0, 1, 2, 3, 4, 5}
yd = {2, 1, 3, 4.8, 5.8, 4.8}
Now, if we add two more points with a slight amount of noise: xd’ = {2.2, 2.7} and yd’ = {3.5,
4.25} (shown with the filled black markers), the new interpolation polynomial (red line) shows a
dramatic difference to the first one.
0 1 2 3 4 5
0
1
2
3
4
5
6
Fig. 4.3
Hermite Interpolation
The problem arises because the extra points force a higher degree polynomial and this can have a
higher oscillatory behaviour. Another approach that avoids this problem is to use data for the
derivative of the function too. If we also ask for the derivative values to be matched at the
nodes, the oscillations will be prevented. This is done with the “Hermite Interpolation”. The
development is rather similar to that of the Newton’s method but more complicated due to the
involvement of the derivative values. It can also be constructed easily with the help of a table
(as in Newton’s) and divided differences. We will not cover here the details of the derivation
but simply, the procedure to find it.
The table is similar to that for the Newton interpolation but we enter the data points twice (see
E763 (part 2) Numerical Methods page 26
below) and the derivatives values are placed between repeated data points, in alternate rows as
the first divided differences. The initial setup for 2 points is marked with red circles.
i i
x
i
y
i
Dy
i
y D
2
i
y D
3
1
1
x
1
y
'
1
y
1
1
x
1
y
1 2
1 1
'
x x
y Dy
A
−
−
=
1 2
1 2
1
x x
y y
Dy
−
−
=
1 2
x x
A B
C
−
−
=
2
2
x
2
y
1 2
1 2
'
x x
Dy y
B
−
−
=
'
2
y
2
2
x
2
y
The corresponding interpolating polynomial is:
) ( ) ( ) ( ) ( ' ) (
2
2
1
2
1 1 1 1 2
x x x x C x x A x x y y x H − − + − + − + = (4.33)
The coefficients of the successive terms are marked in the table with blue squares.
Fig. 4.4 shows the function ) sin( ) ( x x y π =
(+) compared with the interpolated curves
using Hermite, Newton and cubic splines
method (see next section).
The data points are marked by circles.
Newton interpolation results in a polynomial
of order 4 while Hermite gives a polynomial
of order 7.
(If 7 points are used for the Newton iteration
the results are quite similar.)
0 0.5 1 1.5
1
0.5
0
0.5
1
1.5
cubic splines
Newton
Hermite
cubic splines
Newton
Hermite
cubic splines
Newton
Hermite
cubic splines
Newton
Hermite
Fig. 4.4
Another approach to avoid the oscillations present when using high order polynomials is to
use lower order polynomials to interpolate subsets of the data and assemble the overall
approximating function piecewise. This is what is called “spline interpolation”.
Spline Interpolation
Any piecewise interpolation of data by low order functions is called spline interpolation and the
simplest and widely used is the piecewise linear interpolation, or simply, joining consecutive
data points by straight lines.
page 27 E763 (part 2) Numerical Methods
First Order Splines
If we have a set of data points: (x
i
, y
i
), i = 1, , n; the first order splines (straight lines) can
be defined as:
2 1 1 1 1
); ( ) ( x x x x x m y x f ≤ ≤ − + = (4.34)
3 2 2 2 2
); ( ) ( x x x x x m y x f ≤ ≤ − + = (4.35)
…
1
); ( ) (
+
≤ ≤ − + =
i i i i i
x x x x x m y x f (4.36)
with the slopes given by:
i i
i i
i
x x
y y
m
−
−
=
+
+
1
1
Quadratic and Cubic Splines
First order splines are straightforward. To fit a line in each interval between data points, we
need only 2 pieces of data: the data values at the 2 points. If we now want to fit a higher order
function, for example a second order polynomial, a parabola, we need to determine 3
coefficients, and in general, for an morder spline we need m+1 equations.
If we want a smooth function, over the complete domain, we should impose continuity of
the representation in contiguous intervals and as many derivatives as possible to be continuous
too at the adjoining points.
For n+1 points, x
i
, i = 0,..,n there are n intervals and consequently, n functions to
determine. If we use quadratic splines (second order) their equations will be of the form:
c x b x a x f
i i i
+ + =
2
) ( and we need 3 equations per interval to find the 3 coefficients, that is, a
total of 3n equations.
We can establish the following conditions to be satisfied:
1) The values of the functions corresponding to adjacent intervals must be equal at the
common interior nodes (no discontinuities at the nodes).
This can be represented by:
i
x x
i i i i i i
c x b x a c x b x a x f
=
+ + +
+ + = + + =
1 1
2
1
2
) ( (4.37)
for i = 1,…,n, that is, at the node x
i
, the boundary between interval i and i+1, the functions
defined in each of these intervals must coincide. This gives us a total of 2n−2 equations
(there are 2 equations in the above line and n−1 interior points).
2) The first and last functions must pass through the first and last points (end points). This
gives us another 2 equations
3) The first derivative at the interior nodes must be continuous.
This can be written as: ) ( ' ) ( '
1 i i i i
x f x f
+
= for i = 2,…,n. Then:
1
2
1
2
2 2
+ +
+ = +
i i i i i i
b x a b x a (4.38)
and we have another n−1 equations.
All these give us a total of 3n−1 equations when we need 3n. An additional condition must then
be established. Usually this can be chosen at the end points, for example, stating that a
1
= 0
(This corresponds to asking for the second derivative to be zero at the first point and results in
the two first points joined by a straight line).
For example, for a set of 4 data points, we can establish the necessary equations as listed above,
giving a total of 8 equations for 8 unknowns (having fixed a
1
= 0 already). This can be solved
by the matrix techniques that we will study later.
Quadratic splines have some shortcomings that are not present in cubic splines, so these
E763 (part 2) Numerical Methods page 28
are preferred. However, their calculation is even more cumbersome than that of quadratic
splines. In this case, the function and the first and second derivatives are continuous at the
nodes.
Because of their popularity, cubic splines are commonly found in computer libraries and for
example, Matlab has a standard function that calculates them.
To illustrate the advantages of this technique,
Fig. 4.5 shows the cubic interpolation splines
(green line) for the same function described
in the last example, and compared with the
highly oscillatory result for a single
polynomial interpolation for the 8 data points
(that gives a polynomial of order 7.
Using the builtin Matlab function, the code
for plotting this graph is simply:
0 1 2 3 4 5
0
1
2
3
4
5
6
xd=[0,1,2,2.2,2.7,3,4,5]'; Fig. 4.5
yd=[2,1,3,3.5,4.25,4.8,5.8,4.8]';
x=0:0.05:5;
y=spline(xd,yd,x);
plot(x,y,'g','Linewidth',2.5)
plot(xd,yd,'ok','MarkerSize',10,'MarkerFaceColor','w','LineWidth',2)
Note that the drawing of the 7order interpolating polynomial (red line in Fig. 4.5) is not
included in this piece of code and that the last line is simply to draw the markers.
Exercise 4.4
Using Matlab plot the function
x
xe x f
2
sin 2 . 1
1 . 0 ) ( = in the interval [0, 4] and construct a cubic
spline interpolation using the values of the function at the points xd = 0: 0.5: 4 (in Matlab
notation). Use Excel to construct a table of divided differences for this function at the points xd
and find the coefficients of the Newton interpolation polynomial. Use Matlab to plot the
corresponding polynomial in the same figure as the splines and compare the results
If the main objective is to create a smooth function to represent the data, it is some times
preferable to choose a function that doesn’t necessarily pass exactly through the data but
approximates its overall behaviour. This is what is called “approximation”. The problems here
are how to choose the approximating function and what is considered the best choice.
Approximation
There are many different ways to approximate a function and you have seen some of them
in detail already. Taylor expansions, least squares curve fitting and Fourier series are examples
of this.
Methods like the “least squares” look for a single function or polynomial to approximate
the desired function. Another approach, of which the Fourier series is an example, consists of
page 27 E763 (part 2) Numerical Methods
First Order Splines
If we have a set of data points: (x
i
, y
i
), i = 1, , n; the first order splines (straight lines) can
be defined as:
2 1 1 1 1
); ( ) ( x x x x x m y x f ≤ ≤ − + = (4.34)
3 2 2 2 2
); ( ) ( x x x x x m y x f ≤ ≤ − + = (4.35)
…
1
); ( ) (
+
≤ ≤ − + =
i i i i i
x x x x x m y x f (4.36)
with the slopes given by:
i i
i i
i
x x
y y
m
−
−
=
+
+
1
1
Quadratic and Cubic Splines
First order splines are straightforward. To fit a line in each interval between data points, we
need only 2 pieces of data: the data values at the 2 points. If we now want to fit a higher order
function, for example a second order polynomial, a parabola, we need to determine 3
coefficients, and in general, for an morder spline we need m+1 equations.
If we want a smooth function, over the complete domain, we should impose continuity of
the representation in contiguous intervals and as many derivatives as possible to be continuous
too at the adjoining points.
For n+1 points, x
i
, i = 0,..,n there are n intervals and consequently, n functions to
determine. If we use quadratic splines (second order) their equations will be of the form:
c x b x a x f
i i i
+ + =
2
) ( and we need 3 equations per interval to find the 3 coefficients, that is, a
total of 3n equations.
We can establish the following conditions to be satisfied:
1) The values of the functions corresponding to adjacent intervals must be equal at the
common interior nodes (no discontinuities at the nodes).
This can be represented by:
i
x x
i i i i i i
c x b x a c x b x a x f
=
+ + +
+ + = + + =
1 1
2
1
2
) ( (4.37)
for i = 1,…,n, that is, at the node x
i
, the boundary between interval i and i+1, the functions
defined in each of these intervals must coincide. This gives us a total of 2n−2 equations
(there are 2 equations in the above line and n−1 interior points).
2) The first and last functions must pass through the first and last points (end points). This
gives us another 2 equations
3) The first derivative at the interior nodes must be continuous.
This can be written as: ) ( ' ) ( '
1 i i i i
x f x f
+
= for i = 2,…,n. Then:
1
2
1
2
2 2
+ +
+ = +
i i i i i i
b x a b x a (4.38)
and we have another n−1 equations.
All these give us a total of 3n−1 equations when we need 3n. An additional condition must then
be established. Usually this can be chosen at the end points, for example, stating that a
1
= 0
(This corresponds to asking for the second derivative to be zero at the first point and results in
the two first points joined by a straight line).
For example, for a set of 4 data points, we can establish the necessary equations as listed above,
giving a total of 8 equations for 8 unknowns (having fixed a
1
= 0 already). This can be solved
by the matrix techniques that we will study later.
Quadratic splines have some shortcomings that are not present in cubic splines, so these
E763 (part 2) Numerical Methods page 28
are preferred. However, their calculation is even more cumbersome than that of quadratic
splines. In this case, the function and the first and second derivatives are continuous at the
nodes.
Because of their popularity, cubic splines are commonly found in computer libraries and for
example, Matlab has a standard function that calculates them.
To illustrate the advantages of this technique,
Fig. 4.5 shows the cubic interpolation splines
(green line) for the same function described
in the last example, and compared with the
highly oscillatory result for a single
polynomial interpolation for the 8 data points
(that gives a polynomial of order 7.
Using the builtin Matlab function, the code
for plotting this graph is simply:
0 1 2 3 4 5
0
1
2
3
4
5
6
xd=[0,1,2,2.2,2.7,3,4,5]'; Fig. 4.5
yd=[2,1,3,3.5,4.25,4.8,5.8,4.8]';
x=0:0.05:5;
y=spline(xd,yd,x);
plot(x,y,'g','Linewidth',2.5)
plot(xd,yd,'ok','MarkerSize',10,'MarkerFaceColor','w','LineWidth',2)
Note that the drawing of the 7order interpolating polynomial (red line in Fig. 4.5) is not
included in this piece of code and that the last line is simply to draw the markers.
Exercise 4.4
Using Matlab plot the function
x
xe x f
2
sin 2 . 1
1 . 0 ) ( = in the interval [0, 4] and construct a cubic
spline interpolation using the values of the function at the points xd = 0: 0.5: 4 (in Matlab
notation). Use Excel to construct a table of divided differences for this function at the points xd
and find the coefficients of the Newton interpolation polynomial. Use Matlab to plot the
corresponding polynomial in the same figure as the splines and compare the results
If the main objective is to create a smooth function to represent the data, it is some times
preferable to choose a function that doesn’t necessarily pass exactly through the data but
approximates its overall behaviour. This is what is called “approximation”. The problems here
are how to choose the approximating function and what is considered the best choice.
Approximation
There are many different ways to approximate a function and you have seen some of them
in detail already. Taylor expansions, least squares curve fitting and Fourier series are examples
of this.
Methods like the “least squares” look for a single function or polynomial to approximate
the desired function. Another approach, of which the Fourier series is an example, consists of
page 29 E763 (part 2) Numerical Methods
using a family of simple functions to build an expansion that approximates the given function.
The problem then is to find the appropriate set of coefficients for that expansion. Taylor series
are somehow related, the main difference is that while the other methods based on expansions
attempt to find an overall approximation, Taylor series are meant to approximate the function at
one particular point and its close vicinity.
Least Squares Curve Fitting
One of the most common problems of approximation is fitting a curve to experimental data. In
this case, the usual objective is to find the curve that approximates the data minimizing some
measure of the error. In the method of least squares the measure chosen is the sum of the
squares of the differences between the data and the approximating curve,
If the data is given by: (x
i
, y
i
), i = 1, , n. We can define the error of the approximation
by:
( )
∑
=
− =
n
i
i i
x f y E
1
2
) ( (4.39)
The commonest choices for the approximating functions are polynomials, straight lines,
parabolas, etc. Then, the minimization of the error will give the necessary relations to determine
the coefficients.
Approximation by a Straight Line
The equation of a straight line is: bx a x f + = ) ( so the error to minimise is:
( )
∑
=
+ − =
n
i
i i
bx a y b a E
1
2
) ( ) , ( (4.40)
The error is a function of the parameters a and b that define the straight line. Then, the
minimization of the error can be achieved by making the derivatives of E with respect to a and b
equal to zero. These conditions give:
( ) 0 0 ) ( 2
1 1 1
= − − ⇒ = + − − =
∂
∂
∑ ∑ ∑
= = =
n
i
i
n
i
i
n
i
i i
x b na y bx a y
a
E
(4.41)
( ) 0 0 ) ( 2
1
2
1 1 1
= − − ⇒ = + − − =
∂
∂
∑ ∑ ∑ ∑
= = = =
n
i
i
n
i
i
n
i
i i
n
i
i i i
x b x a y x x bx a y
b
E
(4.42)
which can be simplified to:
∑ ∑
= =
= +
n
i
i
n
i
i
y x b na
1 1
(4.43)
and
∑ ∑ ∑
= = =
= +
n
i
i i
n
i
i
n
i
i
y x x b x a
1 1
2
1
(4.44)
Solving the system for a and b gives:
E763 (part 2) Numerical Methods page 30
2
1 1
2
1 1 1 1
2


.

\

−
−
=
∑ ∑
∑ ∑ ∑ ∑
= =
= = = =
n
i
i
n
i
i
n
i
i i
n
i
i
n
i
i
n
i
i
x x n
y x x y x
a and
2
1 1
2
1 1 1


.

\

−
−
=
∑ ∑
∑ ∑ ∑
= =
= = =
n
i
i
n
i
i
n
i
i
n
i
i
n
i
i i
x x n
y x y x n
b (4.45)
Example
Fitting a straight line to the data given by:
xd = {0, 0.2, 0.8, 1, 1.2, 1.9, 2, 2.1, 2.95, 3} and
yd = {0.01, 0.22, 0.76, 1.03, 1.18, 1.94, 2.01, 2.08, 2.9, 2.95}
Then, 15 . 15
10
1
=
∑
= i
i
xd ; 08 . 15
10
1
=
∑
= i
i
yd , 32.8425
10
1
2
=
∑
= i
i
xd and 32.577
10
1
=
∑
= i
i i
yd xd
and the parameters of the straight line are:
a = 0.01742473648290 and
b= 0.98387806172746
Fig. 4.6 shows the approximating straight
line (red) together with the curve for
Newton interpolation (black) and for cubic
splines (blue).
The problems occurring with the use of
higher order polynomial interpolation are
also evident.
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
Newton Interpolation
Cubic Splines
Least Squares Line Fit
Fig. 4.6
Example
The data (xd, yd) shown in Fig. 4.7 appears to behave in an exponential manner. Then, defining
the variable ) log(
i i
yd zd = , zd should vary linearly with xd.
We can then fit a straight line to the pair of
variables (xd, zd). If the fitting function is the
function z(x), the function
z
e y =
is a least squares fit to the original data.
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
Fig. 4.7
page 29 E763 (part 2) Numerical Methods
using a family of simple functions to build an expansion that approximates the given function.
The problem then is to find the appropriate set of coefficients for that expansion. Taylor series
are somehow related, the main difference is that while the other methods based on expansions
attempt to find an overall approximation, Taylor series are meant to approximate the function at
one particular point and its close vicinity.
Least Squares Curve Fitting
One of the most common problems of approximation is fitting a curve to experimental data. In
this case, the usual objective is to find the curve that approximates the data minimizing some
measure of the error. In the method of least squares the measure chosen is the sum of the
squares of the differences between the data and the approximating curve,
If the data is given by: (x
i
, y
i
), i = 1, , n. We can define the error of the approximation
by:
( )
∑
=
− =
n
i
i i
x f y E
1
2
) ( (4.39)
The commonest choices for the approximating functions are polynomials, straight lines,
parabolas, etc. Then, the minimization of the error will give the necessary relations to determine
the coefficients.
Approximation by a Straight Line
The equation of a straight line is: bx a x f + = ) ( so the error to minimise is:
( )
∑
=
+ − =
n
i
i i
bx a y b a E
1
2
) ( ) , ( (4.40)
The error is a function of the parameters a and b that define the straight line. Then, the
minimization of the error can be achieved by making the derivatives of E with respect to a and b
equal to zero. These conditions give:
( ) 0 0 ) ( 2
1 1 1
= − − ⇒ = + − − =
∂
∂
∑ ∑ ∑
= = =
n
i
i
n
i
i
n
i
i i
x b na y bx a y
a
E
(4.41)
( ) 0 0 ) ( 2
1
2
1 1 1
= − − ⇒ = + − − =
∂
∂
∑ ∑ ∑ ∑
= = = =
n
i
i
n
i
i
n
i
i i
n
i
i i i
x b x a y x x bx a y
b
E
(4.42)
which can be simplified to:
∑ ∑
= =
= +
n
i
i
n
i
i
y x b na
1 1
(4.43)
and
∑ ∑ ∑
= = =
= +
n
i
i i
n
i
i
n
i
i
y x x b x a
1 1
2
1
(4.44)
Solving the system for a and b gives:
E763 (part 2) Numerical Methods page 30
2
1 1
2
1 1 1 1
2


.

\

−
−
=
∑ ∑
∑ ∑ ∑ ∑
= =
= = = =
n
i
i
n
i
i
n
i
i i
n
i
i
n
i
i
n
i
i
x x n
y x x y x
a and
2
1 1
2
1 1 1


.

\

−
−
=
∑ ∑
∑ ∑ ∑
= =
= = =
n
i
i
n
i
i
n
i
i
n
i
i
n
i
i i
x x n
y x y x n
b (4.45)
Example
Fitting a straight line to the data given by:
xd = {0, 0.2, 0.8, 1, 1.2, 1.9, 2, 2.1, 2.95, 3} and
yd = {0.01, 0.22, 0.76, 1.03, 1.18, 1.94, 2.01, 2.08, 2.9, 2.95}
Then, 15 . 15
10
1
=
∑
= i
i
xd ; 08 . 15
10
1
=
∑
= i
i
yd , 32.8425
10
1
2
=
∑
= i
i
xd and 32.577
10
1
=
∑
= i
i i
yd xd
and the parameters of the straight line are:
a = 0.01742473648290 and
b= 0.98387806172746
Fig. 4.6 shows the approximating straight
line (red) together with the curve for
Newton interpolation (black) and for cubic
splines (blue).
The problems occurring with the use of
higher order polynomial interpolation are
also evident.
0 0.5 1 1.5 2 2.5 3
0
0.5
1
1.5
2
2.5
3
Newton Interpolation
Cubic Splines
Least Squares Line Fit
Fig. 4.6
Example
The data (xd, yd) shown in Fig. 4.7 appears to behave in an exponential manner. Then, defining
the variable ) log(
i i
yd zd = , zd should vary linearly with xd.
We can then fit a straight line to the pair of
variables (xd, zd). If the fitting function is the
function z(x), the function
z
e y =
is a least squares fit to the original data.
0 0.5 1 1.5 2 2.5 3
0
5
10
15
20
25
Fig. 4.7
page 31 E763 (part 2) Numerical Methods
Second Order Approximation
If a second order curve is used for the approximation:
2
) ( cx bx a x f + + = , there are 3 parameters
to find and the error is given by:
( )
∑
=
+ + − =
n
i
i
cx bx a y c b a E
1
2
2
) ( ) , , ( (4.45)
Making the derivatives of E with respect to a, b and c equal to zero will give the necessary 3
equations for the coefficients of the parabola. The expressions are similar to those found for the
straight line fit although more complicated.
Matlab has standard functions to perform least squares approximations with polynomials of any
order. If the data is given by (xd, yd), and m is the desired order of the polynomial to fit, the
function:
coeff = polyfit(xd, yd, m)
returns the coefficients of the polynomial.
If now x is the array of values where the approximation is wanted,
y = polyval(coeff, x)
returns the values of the polynomial fit at the point x.
Exercise 4.5
Find the coefficients of a second order polynomial that approximates the function
x
e x s = ) ( .in
the interval [−1, 1] in the least squares sense, using the points x
d
= [−1, 0, 1]. Plot the function
and the approximating polynomial together with the Taylor approximation (Taylor series
truncated to second order) for comparison.
Approximation using Continuous Functions
The same idea of “least squares”, that is, to try to minimize an error expression in the least
squares sense, can be used to the approximation of continuous functions. Imagine that instead of
discrete data sets we have a rather complicated analytical expression representing the function of
interest. This case is rather different to all we have seen previously because now there is a
certainty of the value of the desired variable at every point, but it might be desirable to have a
simpler expression representing this behaviour, at least in a certain domain, for further analysis
purposes. For example is the function is s(x) and we want to approximate it using a second order
polynomial, we can formulate the problem as minimizing the square of the difference over the
domain of interest, that is:
∫
Ω
− + + = dx x s c bx ax E
2 2
)) ( ( (4.46)
Again, asking for the derivatives of E with respect to a, b and c to be zero, will give us the
necessary equations to find these coefficients. There is one problem though, the resultant matrix
problem, particularly for large systems (high order polynomials) is normally very ill
conditioned, so some special precautions need to be taken.
E763 (part 2) Numerical Methods page 32
Approximation using Orthogonal Functions
We can extend the same ideas of least squares to the approximation of a given function by an
expansion in terms of a set of “basis functions”. You are already familiar with this idea from the
use of Fourier expansions to approximate functions, but the concept is not restricted to use of
sines and cosines as basis functions.
First of all, we called a base a family of functions that satisfy a set of properties. Similarly with
what you know of vectors, for example the set of unit vectors xˆ , yˆ and zˆ constitute a base in
the 3D space because any vector in that space can be expressed as a combination of these three.
Naturally, we don’t actually need these vectors to be perpendicular to each other but that helps
(as we’ll see later). However, the base must be complete, that is no vector in 3D escapes this
representation. For example if we only consider the vectors xˆ and yˆ , no combination of them
can possibly have a component along the zˆ axis and the resultant vectors are restricted to the xy
plane. In the same sense, we want a set of basis functions to be complete (as sines and cosines
are) in order that any function can have its representation by an expansion. The difference now
is that unlike the 3D space, we need an infinite set (that is the dimension of the space of
functions).
So, if we select a complete set of basis functions: ) (x
k
φ , we can represent our function by:
∑
=
= ≈
n
k
k k n
x c x f x f
1
) ( ) (
~
) ( φ (4.47)
an expansion truncated to n terms.
In this context, the error of this approximation is the function difference between the exact and
the approximate functions: ) (
~
) ( ) ( x f x f x r
n n
− = and we can use the least squares ideas again,
seeking to minimise the norm of this error. That is: the quantity, (error residual):
∫
Ω
− ≡ = dx x f x f x r R
n n n
2
2
)) (
~
) ( ( ) ( (4.48)
with the subscript n because the error residual above is also a function of truncation level.
This concept of norm is analogous to that of the norm or modulus of a vector:
∑
=
= ⋅ =
N
i
i
v v v v
1
2
2 r r r
.
In order to extend it to functions we need to introduce the inner product of functions, (analogous
to the dot product of vectors):
If we have two functions f and g defined over the same domain Ω, their inner product is
the quantity:
∫
Ω
= dx x g x f g f ) ( ) ( , (4.49)
The inner product satisfies the following properties:
1. 0 , > f f for all nonzero functions
2. f g g f , , = for all functions f and g.
3.
2 1 2 1
, , , g f g f g g f β α β α + = + for all functions f and g and scalars α and β.
Note: The above definition of inner product is sometimes extended using a weighting function
page 31 E763 (part 2) Numerical Methods
Second Order Approximation
If a second order curve is used for the approximation:
2
) ( cx bx a x f + + = , there are 3 parameters
to find and the error is given by:
( )
∑
=
+ + − =
n
i
i
cx bx a y c b a E
1
2
2
) ( ) , , ( (4.45)
Making the derivatives of E with respect to a, b and c equal to zero will give the necessary 3
equations for the coefficients of the parabola. The expressions are similar to those found for the
straight line fit although more complicated.
Matlab has standard functions to perform least squares approximations with polynomials of any
order. If the data is given by (xd, yd), and m is the desired order of the polynomial to fit, the
function:
coeff = polyfit(xd, yd, m)
returns the coefficients of the polynomial.
If now x is the array of values where the approximation is wanted,
y = polyval(coeff, x)
returns the values of the polynomial fit at the point x.
Exercise 4.5
Find the coefficients of a second order polynomial that approximates the function
x
e x s = ) ( .in
the interval [−1, 1] in the least squares sense, using the points x
d
= [−1, 0, 1]. Plot the function
and the approximating polynomial together with the Taylor approximation (Taylor series
truncated to second order) for comparison.
Approximation using Continuous Functions
The same idea of “least squares”, that is, to try to minimize an error expression in the least
squares sense, can be used to the approximation of continuous functions. Imagine that instead of
discrete data sets we have a rather complicated analytical expression representing the function of
interest. This case is rather different to all we have seen previously because now there is a
certainty of the value of the desired variable at every point, but it might be desirable to have a
simpler expression representing this behaviour, at least in a certain domain, for further analysis
purposes. For example is the function is s(x) and we want to approximate it using a second order
polynomial, we can formulate the problem as minimizing the square of the difference over the
domain of interest, that is:
∫
Ω
− + + = dx x s c bx ax E
2 2
)) ( ( (4.46)
Again, asking for the derivatives of E with respect to a, b and c to be zero, will give us the
necessary equations to find these coefficients. There is one problem though, the resultant matrix
problem, particularly for large systems (high order polynomials) is normally very ill
conditioned, so some special precautions need to be taken.
E763 (part 2) Numerical Methods page 32
Approximation using Orthogonal Functions
We can extend the same ideas of least squares to the approximation of a given function by an
expansion in terms of a set of “basis functions”. You are already familiar with this idea from the
use of Fourier expansions to approximate functions, but the concept is not restricted to use of
sines and cosines as basis functions.
First of all, we called a base a family of functions that satisfy a set of properties. Similarly with
what you know of vectors, for example the set of unit vectors xˆ , yˆ and zˆ constitute a base in
the 3D space because any vector in that space can be expressed as a combination of these three.
Naturally, we don’t actually need these vectors to be perpendicular to each other but that helps
(as we’ll see later). However, the base must be complete, that is no vector in 3D escapes this
representation. For example if we only consider the vectors xˆ and yˆ , no combination of them
can possibly have a component along the zˆ axis and the resultant vectors are restricted to the xy
plane. In the same sense, we want a set of basis functions to be complete (as sines and cosines
are) in order that any function can have its representation by an expansion. The difference now
is that unlike the 3D space, we need an infinite set (that is the dimension of the space of
functions).
So, if we select a complete set of basis functions: ) (x
k
φ , we can represent our function by:
∑
=
= ≈
n
k
k k n
x c x f x f
1
) ( ) (
~
) ( φ (4.47)
an expansion truncated to n terms.
In this context, the error of this approximation is the function difference between the exact and
the approximate functions: ) (
~
) ( ) ( x f x f x r
n n
− = and we can use the least squares ideas again,
seeking to minimise the norm of this error. That is: the quantity, (error residual):
∫
Ω
− ≡ = dx x f x f x r R
n n n
2
2
)) (
~
) ( ( ) ( (4.48)
with the subscript n because the error residual above is also a function of truncation level.
This concept of norm is analogous to that of the norm or modulus of a vector:
∑
=
= ⋅ =
N
i
i
v v v v
1
2
2 r r r
.
In order to extend it to functions we need to introduce the inner product of functions, (analogous
to the dot product of vectors):
If we have two functions f and g defined over the same domain Ω, their inner product is
the quantity:
∫
Ω
= dx x g x f g f ) ( ) ( , (4.49)
The inner product satisfies the following properties:
1. 0 , > f f for all nonzero functions
2. f g g f , , = for all functions f and g.
3.
2 1 2 1
, , , g f g f g g f β α β α + = + for all functions f and g and scalars α and β.
Note: The above definition of inner product is sometimes extended using a weighting function
page 33 E763 (part 2) Numerical Methods
w(x) in the form:
∫
Ω
= dx x g x f x w g f ) ( ) ( ) ( , and provided that w(x) is nonnegative it satisfies
all the required properties.
In a similar form as with the dot product of vectors, ) ( ), ( ) (
2
x f x f x f =
Using the inner product definition, the error expression (4.48) can be written as:
2
2
2
2
) (
~
) (
~
), ( 2 ) ( )) (
~
) ( ( )) (
~
) ( ( x f x f x f x f dx x f x f x f x f R
n n n n n
+ − = − = − =
∫
Ω
(4.50)
and if we write ) (
~
x f
n
as
∑
=
=
n
k
k k n
x c x f
1
) ( ) (
~
φ as above,
we get:
∑∑ ∑
= = =
+ − =
n
j
n
k
j k j k
n
k
k k n
x x c c x x f c x f R
1 1 1
2
) ( ), ( ) ( ), ( 2 ) ( φ φ φ (4.51)
We can see that the error residual R
n
, is a function of the coefficients c
k
of the expansion and
then, to find those values that minimize this error, we can make the derivatives of R
n
with respect
to c
k
equal to zero for all k. That is:
0 =
∂
∂
k
n
c
R
for k = 1, …, n. (4.52)
The first term is independent of c
k
, so it will not contribute and the other two will yield the
general equation:
) ( ), ( ) ( ), (
1
x x f x x c
k
n
j
j k j
φ φ φ =
∑
=
for k = 1, …, n. (4.53)
Writing this down in detail gives:
(k = 1)
1 1 2 2 1 1 1 1
, , , , φ φ φ φ φ φ φ f c c c
n n
= + + + L
(k = 2)
2 2 2 2 2 1 1 2
, , , , φ φ φ φ φ φ φ f c c c
n n
= + + + L
L
(k = n)
n n n n n n
f c c c φ φ φ φ φ φ φ , , , ,
2 2 1 1
= + + + L
which can be written as a matrix problem of the form: Φc = s, where the matrix Φ contains all
the inner products (in all combinations), the vector c is the list of coefficients and s is the list of
values in the right hand side (which are all known: we can calculate them all).
We can find the coefficients solving the system of equations but we can see that this will
be a much easier task if all crossed products of basic functions yielded zero; that is, if
0 ) ( ), ( = x x
j k
φ φ for all j ≠ k (4.54)
This is what is called orthogonality and is a very useful property of the functions in a base.
(Similar with what happens with a base of perpendicular vectors: all dot products between
E763 (part 2) Numerical Methods page 34
different members of the base are zero).
Then if the basis functions ) (x
k
φ are orthogonal: the solution of the system above is
straightforward:
) ( ), (
) ( ), (
x x
x x f
c
k k
k
k
φ φ
φ
= (4.55)
Remark:
You can surely recognise here the properties of the sine and cosine functions involved in Fourier
expansions: They form a complete set, they are orthogonal: 0 ) ( ), ( = x x
j k
φ φ where
k
φ is
either x kπ sin or x kπ cos . and the coefficients of the expansions are given by the same
expression (4.55) above.
Fourier expansions are convenient but sinusoidal functions are not the only or simpler
possibility. There are many other sets of orthogonal functions that can form a base, in particular,
we are interested in polynomials because of the convenience of calculations.
Families of Orthogonal Polynomials
There are many different families of polynomials that can be used in this manner. Some are
more useful than others in particular problems. They are often originated as solution to some
differential equations.
1. Legendre Polynomials
They are orthogonal polynomials in the interval [−1, 1], with weighting function 1. That is,
0 ) ( ) ( ) ( ), (
1
1
= =
∫
−
dx x P x P x P x P
j i j i
if i ≠ j. They are usually normalised so that P
n
(1) = 1 and
their norm in this case is:
1 2
2
) ( ) ( ) ( ), (
1
1
+
= =
∫
−
n
dx x P x P x P x P
n n n n
(4.56)
The first few are:
1 ) (
0
= x P
x x P = ) (
1
2 ) 1 3 ( ) (
2
2
− = x x P
2 ) 3 5 ( ) (
3
3
x x x P − =
8 ) 3 30 35 ( ) (
2 4
4
+ − = x x x P ,
) 63 70 15 (
8
1
5 3
5
x x x P + − =
) 231 315 105 5 (
16
1
6 4 2
6
x x x P + − + − =
) 429 693 315 35 (
16
1
7 5 3
7
x x x x P + − + − =
etc.
1 0 1
1
0
1
Fig. 4.8 Legendre polynomials
In general they can be defined by the expression:
n
n
n
n
n
n
x
x n
x P ) 1 (
! 2
) 1 (
) (
2
−
∂
∂ −
= (4.57)
They also satisfy the recurrence relation: ) ( ) ( ) 1 2 ( ) ( ) 1 (
1 1
x nP x xP n x P n
n n n − +
− + = + (4.58)
page 33 E763 (part 2) Numerical Methods
w(x) in the form:
∫
Ω
= dx x g x f x w g f ) ( ) ( ) ( , and provided that w(x) is nonnegative it satisfies
all the required properties.
In a similar form as with the dot product of vectors, ) ( ), ( ) (
2
x f x f x f =
Using the inner product definition, the error expression (4.48) can be written as:
2
2
2
2
) (
~
) (
~
), ( 2 ) ( )) (
~
) ( ( )) (
~
) ( ( x f x f x f x f dx x f x f x f x f R
n n n n n
+ − = − = − =
∫
Ω
(4.50)
and if we write ) (
~
x f
n
as
∑
=
=
n
k
k k n
x c x f
1
) ( ) (
~
φ as above,
we get:
∑∑ ∑
= = =
+ − =
n
j
n
k
j k j k
n
k
k k n
x x c c x x f c x f R
1 1 1
2
) ( ), ( ) ( ), ( 2 ) ( φ φ φ (4.51)
We can see that the error residual R
n
, is a function of the coefficients c
k
of the expansion and
then, to find those values that minimize this error, we can make the derivatives of R
n
with respect
to c
k
equal to zero for all k. That is:
0 =
∂
∂
k
n
c
R
for k = 1, …, n. (4.52)
The first term is independent of c
k
, so it will not contribute and the other two will yield the
general equation:
) ( ), ( ) ( ), (
1
x x f x x c
k
n
j
j k j
φ φ φ =
∑
=
for k = 1, …, n. (4.53)
Writing this down in detail gives:
(k = 1)
1 1 2 2 1 1 1 1
, , , , φ φ φ φ φ φ φ f c c c
n n
= + + + L
(k = 2)
2 2 2 2 2 1 1 2
, , , , φ φ φ φ φ φ φ f c c c
n n
= + + + L
L
(k = n)
n n n n n n
f c c c φ φ φ φ φ φ φ , , , ,
2 2 1 1
= + + + L
which can be written as a matrix problem of the form: Φc = s, where the matrix Φ contains all
the inner products (in all combinations), the vector c is the list of coefficients and s is the list of
values in the right hand side (which are all known: we can calculate them all).
We can find the coefficients solving the system of equations but we can see that this will
be a much easier task if all crossed products of basic functions yielded zero; that is, if
0 ) ( ), ( = x x
j k
φ φ for all j ≠ k (4.54)
This is what is called orthogonality and is a very useful property of the functions in a base.
(Similar with what happens with a base of perpendicular vectors: all dot products between
E763 (part 2) Numerical Methods page 34
different members of the base are zero).
Then if the basis functions ) (x
k
φ are orthogonal: the solution of the system above is
straightforward:
) ( ), (
) ( ), (
x x
x x f
c
k k
k
k
φ φ
φ
= (4.55)
Remark:
You can surely recognise here the properties of the sine and cosine functions involved in Fourier
expansions: They form a complete set, they are orthogonal: 0 ) ( ), ( = x x
j k
φ φ where
k
φ is
either x kπ sin or x kπ cos . and the coefficients of the expansions are given by the same
expression (4.55) above.
Fourier expansions are convenient but sinusoidal functions are not the only or simpler
possibility. There are many other sets of orthogonal functions that can form a base, in particular,
we are interested in polynomials because of the convenience of calculations.
Families of Orthogonal Polynomials
There are many different families of polynomials that can be used in this manner. Some are
more useful than others in particular problems. They are often originated as solution to some
differential equations.
1. Legendre Polynomials
They are orthogonal polynomials in the interval [−1, 1], with weighting function 1. That is,
0 ) ( ) ( ) ( ), (
1
1
= =
∫
−
dx x P x P x P x P
j i j i
if i ≠ j. They are usually normalised so that P
n
(1) = 1 and
their norm in this case is:
1 2
2
) ( ) ( ) ( ), (
1
1
+
= =
∫
−
n
dx x P x P x P x P
n n n n
(4.56)
The first few are:
1 ) (
0
= x P
x x P = ) (
1
2 ) 1 3 ( ) (
2
2
− = x x P
2 ) 3 5 ( ) (
3
3
x x x P − =
8 ) 3 30 35 ( ) (
2 4
4
+ − = x x x P ,
) 63 70 15 (
8
1
5 3
5
x x x P + − =
) 231 315 105 5 (
16
1
6 4 2
6
x x x P + − + − =
) 429 693 315 35 (
16
1
7 5 3
7
x x x x P + − + − =
etc.
1 0 1
1
0
1
Fig. 4.8 Legendre polynomials
In general they can be defined by the expression:
n
n
n
n
n
n
x
x n
x P ) 1 (
! 2
) 1 (
) (
2
−
∂
∂ −
= (4.57)
They also satisfy the recurrence relation: ) ( ) ( ) 1 2 ( ) ( ) 1 (
1 1
x nP x xP n x P n
n n n − +
− + = + (4.58)
page 35 E763 (part 2) Numerical Methods
2. Chebyshev Polynomials
The general, compact definition of these polynomials is: )) ( cos cos( ) (
1
x n x T
n
−
= (4.59)
and they satisfy the following orthogonality condition:
¦
¹
¦
´
¦
= =
≠ =
≠
=
−
=
∫
−
0 if
0 if 2
if 0
1
) ( ) (
) ( ), (
1
1
2
j i
j i
j i
dx
x
x T x T
x T x T
j i
j i
π
π (4.60)
That is, they are orthogonal in the interval [−1, 1] with the weighting function:
2
1 1 ) ( x x w − =
They are characterised by having all their oscillations of the same amplitude and in the interval
[−1, 1] and also all their zeros occur in the same interval.
The first few Chebyshev polynomials are:
1 ) (
0
= x T
x x T = ) (
1
1 2 ) (
2
2
− = x x T
x x x T 3 4 ) (
3
3
− =
1 8 8 ) (
2 4
4
+ − = x x x T
x x x x T 5 20 16 ) (
3 5
5
+ − =
1 18 48 32 ) (
2 4 6
6
− + − = x x x x T
x x x x x T 7 56 112 64 ) (
3 5 7
7
− + − =
… etc.
1 0 1
1
0
1
Fig. 4.9 Chebyshev polynomials
They also can be constructed from the recurrence relation:
) ( ) ( 2 ) (
1 1
x T x xT x T
n n n − +
− = for n ≥ 1. (4.61)
These are not the only possibilities. Other families of polynomials commonly used are the
Hermite polynomials, which are orthogonal over the complete real axis with weighting function
) exp(
2
x − , and Laguerre polynomials, orthogonal in [0, ∞) with weighting function
x
e
−
.
Example
Approximate the function
2 2
1
1
) (
x a
x f
+
= with a = 4 in the interval [−1, 1] in the least squares
sense using Legendre polynomials up to order 2.
The approximation is:
∑
=
=
2
1
) ( ) (
~
k
k k
x P c x f and the coefficients are:
∫
−
=
1
1
2
) ( ) (
) (
1
dx x P x f
x P
c
k
k
k
with
1 2
2
) (
2
+
=
k
x P
k
.
Form the expression above we can see that the calculation of the coefficients will involve
integrals:
E763 (part 2) Numerical Methods page 36
∫
−
+
=
1
1
2 2
1
dx
x a
x
I
m
m
which satisfy the recurrence relation:
2
2 2
1
) 1 (
2
−
−
−
=
m m
I
a a m
I with a
a
I
1
0
tan
2
−
=
We can also see that due to symmetry I
m
= 0 for m odd. (Integral of an odd function over the
interval [−1, 1]). Also, because of this, only even numbered coefficients are necessary. (Odd
coefficients of the expansion are zero).
The coefficients are then:
0
1
1
2 2
1
1
0
2
1
1
1
2
1
) (
2
1
I dx
x a
dx x f c =
+
= =
∫ ∫
− −
( )
0 2
1
1
2 2
2
1
1
2 2
3
4
5
1
1 3
4
5
) ( ) (
2
5
I I dx
x a
x
dx x f x P c − =
+
−
= =
∫ ∫
− −
( )
0 2 4
1
1
4 4
3 30 35
16
9
) ( ) (
2
9
I I I dx x f x P c + − = =
∫
−
( )
0 2 4 6
1
1
6 6
5 105 315 231
32
13
) ( ) (
2
13
I I I I dx x f x P c − + − = =
∫
−
( )
0 2 4 6 8
1
1
8 8
35 1260 6930 12012 6435
256
17
) ( ) (
2
17
I I I I I dx x f x P c + − + − = =
∫
−
The results are shown in Fig. 4.10,
where the red line corresponds to the
approximating curve. The green line at the
bottom shows the absolute value of the
difference between the two curves.
The total error of the approximation
(integral of this difference, divided by the
integral of the original curve) is of 10.5%.
1 0 1
0
0.5
1
Fig. 4.10
Exercise 4.6
Use cosine functions ( ) cos( x nπ ) instead of Legendre polynomials (in the form of a Fourier
series) and compare the results.
Remarks:
We can observe in the figure of the example above that the error oscillates through the
domain. This is typical of least squares approximation, where the overall error is minimised.
In this context, Chebyshev polynomials are the best possible choice. The error obtained
with its approximation is the least possible with any other polynomial up to the same degree.
page 35 E763 (part 2) Numerical Methods
2. Chebyshev Polynomials
The general, compact definition of these polynomials is: )) ( cos cos( ) (
1
x n x T
n
−
= (4.59)
and they satisfy the following orthogonality condition:
¦
¹
¦
´
¦
= =
≠ =
≠
=
−
=
∫
−
0 if
0 if 2
if 0
1
) ( ) (
) ( ), (
1
1
2
j i
j i
j i
dx
x
x T x T
x T x T
j i
j i
π
π (4.60)
That is, they are orthogonal in the interval [−1, 1] with the weighting function:
2
1 1 ) ( x x w − =
They are characterised by having all their oscillations of the same amplitude and in the interval
[−1, 1] and also all their zeros occur in the same interval.
The first few Chebyshev polynomials are:
1 ) (
0
= x T
x x T = ) (
1
1 2 ) (
2
2
− = x x T
x x x T 3 4 ) (
3
3
− =
1 8 8 ) (
2 4
4
+ − = x x x T
x x x x T 5 20 16 ) (
3 5
5
+ − =
1 18 48 32 ) (
2 4 6
6
− + − = x x x x T
x x x x x T 7 56 112 64 ) (
3 5 7
7
− + − =
… etc.
1 0 1
1
0
1
Fig. 4.9 Chebyshev polynomials
They also can be constructed from the recurrence relation:
) ( ) ( 2 ) (
1 1
x T x xT x T
n n n − +
− = for n ≥ 1. (4.61)
These are not the only possibilities. Other families of polynomials commonly used are the
Hermite polynomials, which are orthogonal over the complete real axis with weighting function
) exp(
2
x − , and Laguerre polynomials, orthogonal in [0, ∞) with weighting function
x
e
−
.
Example
Approximate the function
2 2
1
1
) (
x a
x f
+
= with a = 4 in the interval [−1, 1] in the least squares
sense using Legendre polynomials up to order 2.
The approximation is:
∑
=
=
2
1
) ( ) (
~
k
k k
x P c x f and the coefficients are:
∫
−
=
1
1
2
) ( ) (
) (
1
dx x P x f
x P
c
k
k
k
with
1 2
2
) (
2
+
=
k
x P
k
.
Form the expression above we can see that the calculation of the coefficients will involve
integrals:
E763 (part 2) Numerical Methods page 36
∫
−
+
=
1
1
2 2
1
dx
x a
x
I
m
m
which satisfy the recurrence relation:
2
2 2
1
) 1 (
2
−
−
−
=
m m
I
a a m
I with a
a
I
1
0
tan
2
−
=
We can also see that due to symmetry I
m
= 0 for m odd. (Integral of an odd function over the
interval [−1, 1]). Also, because of this, only even numbered coefficients are necessary. (Odd
coefficients of the expansion are zero).
The coefficients are then:
0
1
1
2 2
1
1
0
2
1
1
1
2
1
) (
2
1
I dx
x a
dx x f c =
+
= =
∫ ∫
− −
( )
0 2
1
1
2 2
2
1
1
2 2
3
4
5
1
1 3
4
5
) ( ) (
2
5
I I dx
x a
x
dx x f x P c − =
+
−
= =
∫ ∫
− −
( )
0 2 4
1
1
4 4
3 30 35
16
9
) ( ) (
2
9
I I I dx x f x P c + − = =
∫
−
( )
0 2 4 6
1
1
6 6
5 105 315 231
32
13
) ( ) (
2
13
I I I I dx x f x P c − + − = =
∫
−
( )
0 2 4 6 8
1
1
8 8
35 1260 6930 12012 6435
256
17
) ( ) (
2
17
I I I I I dx x f x P c + − + − = =
∫
−
The results are shown in Fig. 4.10,
where the red line corresponds to the
approximating curve. The green line at the
bottom shows the absolute value of the
difference between the two curves.
The total error of the approximation
(integral of this difference, divided by the
integral of the original curve) is of 10.5%.
1 0 1
0
0.5
1
Fig. 4.10
Exercise 4.6
Use cosine functions ( ) cos( x nπ ) instead of Legendre polynomials (in the form of a Fourier
series) and compare the results.
Remarks:
We can observe in the figure of the example above that the error oscillates through the
domain. This is typical of least squares approximation, where the overall error is minimised.
In this context, Chebyshev polynomials are the best possible choice. The error obtained
with its approximation is the least possible with any other polynomial up to the same degree.
page 37 E763 (part 2) Numerical Methods
Furthermore, these polynomials have other useful properties. We have seen before interpolation
of data by higher order polynomials, using equally spaced data points and the problems that this
cause were quite clear. However, one then can think if there is a different distribution of points
that helps in minimising the error. The answer is yes, the optimum arrangement is to locate the
data points on the zeros of the Chebyshev polynomial of the order necessary to give the required
number of data points. These occur for:

.

\
 −
= π
n
k
x
k
2
1 2
cos for k = 1, …, n.
Approximation to a Point
The approximation methods we have studied so far attempt to reach a “gobal” approximation to
a function; that is, to minimise the error over a complete domain. This might be desirable in
many cases but there could be others where it is more important to achieve high approximation
at one particular point of the domain and in its close vicinity. One such method is the Taylor
approximation, where the function is approximated by a Taylor polynomial.
Taylor Polynomial Approximation
Among polynomial approximation methods, the Taylor polynomial gives the maximum possible
“order of contact” between the function and the polynomial; that is, the norder Taylor
polynomial agrees with function and with its n derivatives at the point of contact.
The Taylor polynomial for approximating a function f(x) at a point a is given by:
n
n
a x
n
a f
a x
a f
a x a f a f x p ) (
!
) (
) (
! 2
) ( ' '
) )( ( ' ) ( ) (
) (
2
− + + − + − + = L (4.62)
and the error incurred is given by:
1
) 1 (
) (
)! 1 (
) (
+
+
−
+
n
n
a x
n
f ξ
where ξ is a point between a and x.
(See Appendix).
Fig. 4.11 shows the function (blue line):
x x x y π π 2 cos sin ) ( + =
with the Taylor approximation (red line)
using a Taylor polynomial of order 9 (Taylor
series truncated at order 9). For comparison,
the Newton approximation using 9 equally
spaced points (indicated with * and a black
line), giving a single polynomial of order 9.
While the Taylor approximation is very good
in the vicinity of zero (9 derivatives are also
matched there) it deteriorates badly away
from zero.
1 0 1
2
1
0
1
Fig. 4.11
If a function varies rapidly or has a pole, polynomial approximations will not be able to achieve
high degree of accuracy. In that case an approximation using rational functions will be better.
E763 (part 2) Numerical Methods page 38
The polynomial in the denominator will provide the facilities for rapid variations and poles.
Padé Approximation
Padé Approximants are rational functions, or ratio of polynomials, that fits the value of a
function and a number of its derivatives at one point. They usually provide an approximation
that is better than that of the Taylor polynomials, in particular in the case of functions containing
poles.
A Padé approximation to a function f(x) that can be represented by a Taylor series in [a, b]
(or Padé approximant) is a ratio between two polynomials P
m
(x) and Q
n
(x) of orders m and n
respectively:
0 1
2
2
0 1
2
2
) (
) (
) ( ) (
b x b x b x b
a x a x a x a
x Q
x P
x R x f
n
n
m
m
n
m m
n
+ + + +
+ + + +
= = ≈
L
L
(4.63)
For simplicity, we’ll consider only approximations to a function at x = 0. For other values, a
simple transformation of variables can be used.
If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is:
0 1
2
2
0
) ( c x c x c x c x c x t
k
k
k
i
i
i
+ + + + = =
∑
=
L with n m k + = (4.64)
we can write:
) (
) (
) ( ) (
x Q
x P
x R x t
n
m m
n
= ≈ or ) ( ) ( ) ( x P x Q x t
m n
= . (4.65)
Considering now this equation in its expanded form:
0 1
2
2 0 1
2
2 0 1
2
2
) )( ( a x a x a x a b x b x b x b c x c x c x c
m
m
n
n
k
k
+ + + + = + + + + + + + + L L L (4.66)
we can establish a system of equations to find the coefficients of P and Q. First of all, we can
force this condition to apply at x = 0 (exact matching of the function at x = 0). This will give:
) 0 ( ) 0 ( ) 0 (
m n
P Q t = or:
0 0 0
a b c = (4.67)
but, since the ratio R doesn’t change if we multiply the numerator and denominator by any
number, we can choose the value of 1
0
= b and this gives us the value of
0
a .
Taking now the first derivative of (4.66) will give:
+ + + + + + +
−
) )( 2 (
0 1 1
2
2
1
b x b x b c x c x kc
n
n
k
k
L L
1 2
1
1 2
1
0 1
2 ) 2 )( ( a x a x ma b x b x nb c x c x c
m
m
n
n
k
k
+ + + = + + + + + +
− −
L L L (4.68)
and again, forcing this equation to be satisfied at x = 0 (exact matching of the first derivative),
gives:
1 1 0 0 1
a b c b c = + (4.69)
which gives an equation relating coefficients
1
a and
1
b :
0 1 1 0 1
b c b c a = − (4.70)
To continue with the process of taking derivatives of (4.66) we can establish first some general
formulae. If we call g(x) the product of denominators in the left hand side of (4.66), we can
apply repeatedly the product rule to form the derivatives giving:
page 37 E763 (part 2) Numerical Methods
Furthermore, these polynomials have other useful properties. We have seen before interpolation
of data by higher order polynomials, using equally spaced data points and the problems that this
cause were quite clear. However, one then can think if there is a different distribution of points
that helps in minimising the error. The answer is yes, the optimum arrangement is to locate the
data points on the zeros of the Chebyshev polynomial of the order necessary to give the required
number of data points. These occur for:

.

\
 −
= π
n
k
x
k
2
1 2
cos for k = 1, …, n.
Approximation to a Point
The approximation methods we have studied so far attempt to reach a “gobal” approximation to
a function; that is, to minimise the error over a complete domain. This might be desirable in
many cases but there could be others where it is more important to achieve high approximation
at one particular point of the domain and in its close vicinity. One such method is the Taylor
approximation, where the function is approximated by a Taylor polynomial.
Taylor Polynomial Approximation
Among polynomial approximation methods, the Taylor polynomial gives the maximum possible
“order of contact” between the function and the polynomial; that is, the norder Taylor
polynomial agrees with function and with its n derivatives at the point of contact.
The Taylor polynomial for approximating a function f(x) at a point a is given by:
n
n
a x
n
a f
a x
a f
a x a f a f x p ) (
!
) (
) (
! 2
) ( ' '
) )( ( ' ) ( ) (
) (
2
− + + − + − + = L (4.62)
and the error incurred is given by:
1
) 1 (
) (
)! 1 (
) (
+
+
−
+
n
n
a x
n
f ξ
where ξ is a point between a and x.
(See Appendix).
Fig. 4.11 shows the function (blue line):
x x x y π π 2 cos sin ) ( + =
with the Taylor approximation (red line)
using a Taylor polynomial of order 9 (Taylor
series truncated at order 9). For comparison,
the Newton approximation using 9 equally
spaced points (indicated with * and a black
line), giving a single polynomial of order 9.
While the Taylor approximation is very good
in the vicinity of zero (9 derivatives are also
matched there) it deteriorates badly away
from zero.
1 0 1
2
1
0
1
Fig. 4.11
If a function varies rapidly or has a pole, polynomial approximations will not be able to achieve
high degree of accuracy. In that case an approximation using rational functions will be better.
E763 (part 2) Numerical Methods page 38
The polynomial in the denominator will provide the facilities for rapid variations and poles.
Padé Approximation
Padé Approximants are rational functions, or ratio of polynomials, that fits the value of a
function and a number of its derivatives at one point. They usually provide an approximation
that is better than that of the Taylor polynomials, in particular in the case of functions containing
poles.
A Padé approximation to a function f(x) that can be represented by a Taylor series in [a, b]
(or Padé approximant) is a ratio between two polynomials P
m
(x) and Q
n
(x) of orders m and n
respectively:
0 1
2
2
0 1
2
2
) (
) (
) ( ) (
b x b x b x b
a x a x a x a
x Q
x P
x R x f
n
n
m
m
n
m m
n
+ + + +
+ + + +
= = ≈
L
L
(4.63)
For simplicity, we’ll consider only approximations to a function at x = 0. For other values, a
simple transformation of variables can be used.
If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is:
0 1
2
2
0
) ( c x c x c x c x c x t
k
k
k
i
i
i
+ + + + = =
∑
=
L with n m k + = (4.64)
we can write:
) (
) (
) ( ) (
x Q
x P
x R x t
n
m m
n
= ≈ or ) ( ) ( ) ( x P x Q x t
m n
= . (4.65)
Considering now this equation in its expanded form:
0 1
2
2 0 1
2
2 0 1
2
2
) )( ( a x a x a x a b x b x b x b c x c x c x c
m
m
n
n
k
k
+ + + + = + + + + + + + + L L L (4.66)
we can establish a system of equations to find the coefficients of P and Q. First of all, we can
force this condition to apply at x = 0 (exact matching of the function at x = 0). This will give:
) 0 ( ) 0 ( ) 0 (
m n
P Q t = or:
0 0 0
a b c = (4.67)
but, since the ratio R doesn’t change if we multiply the numerator and denominator by any
number, we can choose the value of 1
0
= b and this gives us the value of
0
a .
Taking now the first derivative of (4.66) will give:
+ + + + + + +
−
) )( 2 (
0 1 1
2
2
1
b x b x b c x c x kc
n
n
k
k
L L
1 2
1
1 2
1
0 1
2 ) 2 )( ( a x a x ma b x b x nb c x c x c
m
m
n
n
k
k
+ + + = + + + + + +
− −
L L L (4.68)
and again, forcing this equation to be satisfied at x = 0 (exact matching of the first derivative),
gives:
1 1 0 0 1
a b c b c = + (4.69)
which gives an equation relating coefficients
1
a and
1
b :
0 1 1 0 1
b c b c a = − (4.70)
To continue with the process of taking derivatives of (4.66) we can establish first some general
formulae. If we call g(x) the product of denominators in the left hand side of (4.66), we can
apply repeatedly the product rule to form the derivatives giving:
page 39 E763 (part 2) Numerical Methods
) ( ) ( ' ) ( ' ) ( ) ( ' x q x t x q x t x g + =
) ( ) ( ' ' ) ( ' ) ( ' 2 ) ( ' ' ) ( ) ( ' ' x q x t x q x t x q x t x g + + =
L (4.71)
∑
=
−
−
=
i
j
j i j i
x Q x t
j i j
i
x g
0
) ( ) ( ) (
) ( ) (
)! ( !
!
) (
and since we are interested on the values of the derivatives at x = 0, we have:
∑
=
−
−
=
i
j
j i j i
Q t
j i j
i
g
0
) ( ) ( ) (
) 0 ( ) 0 (
)! ( !
!
) 0 ( (4.72)
The first derivative of a polynomial, say, ) (x Q
n
is
1 2
1
2 b x b x nb
n
n
+ + +
−
L , so the second
derivative will be:
2 3
2
2 3 2 ) 1 ( b x b x nb n
n
n
+ ⋅ + + −
−
L and so on. Then these derivatives
evaluated at x = 0 are successively:
j
b j b b b b ! , , , ) 4 3 2 ( , ) 3 2 ( , 2 ,
4 3 2 1
L ⋅ ⋅ ⋅
Then, we can write (4.72) as:
∑ ∑
=
−
=
−
= −
−
=
i
j
j i j
i
j
j i j
i
b c i b j i c j
j i j
i
g
0 0
) (
! )! ( !
)! ( !
!
) 0 ( and equating
this to the ith derivative of ) (x P
m
evaluated at x = 0, that is,
i
a i! gives:
∑
=
−
=
i
j
j i j i
b c a
0
, but since 1
0
= b we can finally write:
i
i
j
j i j i
c b c a = −
∑
−
=
−
1
0
for i = 1, …, k, (k = m + n). (4.73)
where we take the coefficients 0 =
i
a for m i > and 0 =
i
b for n i > .
Example
Consider the function
x
x f
−
=
1
1
) ( . This function has a pole at x = 1 and polynomial
approximations will not perform well.
The Taylor coefficients of this function are given by:
! 2
) 1 2 ( 5 3 1
j
j
c
j
j
− ⋅ ⋅
=
L
So the first five are: 1
0
= c ,
2
1
1
= c ,
8
3
2
= c ,
16
5
3
= c ,
128
35
4
= c which gives a polynomial of
order 4.
From the equations above we can calculate the Padé coefficients. We choose: m = n =2, so
k = m + n = 4.
Taking 1
0
= b ,
0 0 0
a b c = gives 1
0
= a
The other equations (from asking the derivatives to fit) are:
E763 (part 2) Numerical Methods page 40
i
i
j
j i j i
c b c a = −
∑
−
=
−
1
0
for i = 1, …, k, (k = m + n)
and in detail:
4 1 3 2 2 3 1 4 0 4
3 1 2 2 1 3 0 3
2 1 1 2 0 2
1 1 0 1
c b c b c b c b c a
c b c b c b c a
c b c b c a
c b c a
= − − − −
= − − −
= − −
= −
but for m = n =2, we have 0
4 3 4 3
= = = = b b a a and the system can be written as:
4 1 3 2 2
3 1 2 2 0
2 1 1 2 0 2
1 1 0 1
0 0 0
0 0 0
c b c b c
c b c b c
c b c b c a
c b c a
= − −
= − −
= − −
= −
and we can see that the second set of equations can be solved for the coefficients b
i
. Rewriting
this system:
4 2 2 1 3
3 2 0 1 2
c b c b c
c b c b c
− = +
− = +
In general, when 2 k m n = = as in this case the matrix that define the system will be of the
form:
+ − + − + −
− − −
− −
−
0 3 2 1
3 0 1 2
2 1 0 1
1 2 1 0
r r r r
r r r r
r r r r
r r r r
n n n
n
n
n
L
L L L L L
L
L
L
This is a special kind of matrix called the Toeplitz matrix that has the same element along each
diagonal so it is defined by a total of 2n+1 numbers. There are methods for solving systems with
this matrix and in particular,
Solving the system will give:
] 16 1 , 4 3 , 1 [ − = a and ] 16 5 , 4 5 , 1 [ − = b
giving:
2
2
2
2 1 0
2
2 1 0 2
2
0.3125 1.25 1
0.0625 0.75 1
) (
x x
x x
x b x b b
x a x a a
x R
+ −
+ −
=
+ +
+ +
=
page 39 E763 (part 2) Numerical Methods
) ( ) ( ' ) ( ' ) ( ) ( ' x q x t x q x t x g + =
) ( ) ( ' ' ) ( ' ) ( ' 2 ) ( ' ' ) ( ) ( ' ' x q x t x q x t x q x t x g + + =
L (4.71)
∑
=
−
−
=
i
j
j i j i
x Q x t
j i j
i
x g
0
) ( ) ( ) (
) ( ) (
)! ( !
!
) (
and since we are interested on the values of the derivatives at x = 0, we have:
∑
=
−
−
=
i
j
j i j i
Q t
j i j
i
g
0
) ( ) ( ) (
) 0 ( ) 0 (
)! ( !
!
) 0 ( (4.72)
The first derivative of a polynomial, say, ) (x Q
n
is
1 2
1
2 b x b x nb
n
n
+ + +
−
L , so the second
derivative will be:
2 3
2
2 3 2 ) 1 ( b x b x nb n
n
n
+ ⋅ + + −
−
L and so on. Then these derivatives
evaluated at x = 0 are successively:
j
b j b b b b ! , , , ) 4 3 2 ( , ) 3 2 ( , 2 ,
4 3 2 1
L ⋅ ⋅ ⋅
Then, we can write (4.72) as:
∑ ∑
=
−
=
−
= −
−
=
i
j
j i j
i
j
j i j
i
b c i b j i c j
j i j
i
g
0 0
) (
! )! ( !
)! ( !
!
) 0 ( and equating
this to the ith derivative of ) (x P
m
evaluated at x = 0, that is,
i
a i! gives:
∑
=
−
=
i
j
j i j i
b c a
0
, but since 1
0
= b we can finally write:
i
i
j
j i j i
c b c a = −
∑
−
=
−
1
0
for i = 1, …, k, (k = m + n). (4.73)
where we take the coefficients 0 =
i
a for m i > and 0 =
i
b for n i > .
Example
Consider the function
x
x f
−
=
1
1
) ( . This function has a pole at x = 1 and polynomial
approximations will not perform well.
The Taylor coefficients of this function are given by:
! 2
) 1 2 ( 5 3 1
j
j
c
j
j
− ⋅ ⋅
=
L
So the first five are: 1
0
= c ,
2
1
1
= c ,
8
3
2
= c ,
16
5
3
= c ,
128
35
4
= c which gives a polynomial of
order 4.
From the equations above we can calculate the Padé coefficients. We choose: m = n =2, so
k = m + n = 4.
Taking 1
0
= b ,
0 0 0
a b c = gives 1
0
= a
The other equations (from asking the derivatives to fit) are:
E763 (part 2) Numerical Methods page 40
i
i
j
j i j i
c b c a = −
∑
−
=
−
1
0
for i = 1, …, k, (k = m + n)
and in detail:
4 1 3 2 2 3 1 4 0 4
3 1 2 2 1 3 0 3
2 1 1 2 0 2
1 1 0 1
c b c b c b c b c a
c b c b c b c a
c b c b c a
c b c a
= − − − −
= − − −
= − −
= −
but for m = n =2, we have 0
4 3 4 3
= = = = b b a a and the system can be written as:
4 1 3 2 2
3 1 2 2 0
2 1 1 2 0 2
1 1 0 1
0 0 0
0 0 0
c b c b c
c b c b c
c b c b c a
c b c a
= − −
= − −
= − −
= −
and we can see that the second set of equations can be solved for the coefficients b
i
. Rewriting
this system:
4 2 2 1 3
3 2 0 1 2
c b c b c
c b c b c
− = +
− = +
In general, when 2 k m n = = as in this case the matrix that define the system will be of the
form:
+ − + − + −
− − −
− −
−
0 3 2 1
3 0 1 2
2 1 0 1
1 2 1 0
r r r r
r r r r
r r r r
r r r r
n n n
n
n
n
L
L L L L L
L
L
L
This is a special kind of matrix called the Toeplitz matrix that has the same element along each
diagonal so it is defined by a total of 2n+1 numbers. There are methods for solving systems with
this matrix and in particular,
Solving the system will give:
] 16 1 , 4 3 , 1 [ − = a and ] 16 5 , 4 5 , 1 [ − = b
giving:
2
2
2
2 1 0
2
2 1 0 2
2
0.3125 1.25 1
0.0625 0.75 1
) (
x x
x x
x b x b b
x a x a a
x R
+ −
+ −
=
+ +
+ +
=
page 41 E763 (part 2) Numerical Methods
Fig. 4.12 shows the function
x
x f
−
=
1
1
) ( (blue line) and the Padé
approximant (red line)
2
2
2
2
0.3125 1.25 1
0.0625 0.75 1
) (
x x
x x
x R
+ −
+ −
=
The Taylor polynomial up to 4
th
order:
4
4
3
3
2
2 1 0
) ( x c x c x c x c c x P + + + + =
with a green line.
0 1
0
2
4
6
8
Fig. 4.12
It is clear that the Padé approximant gives a better fit, particularly closer to the pole.
Increasing the order, the Padé approximation
gets closer to the singularity.
Fig. 4.13 shows the approximations for
m = n = 1, 2, 3 and 4.
The poles closer to 1 of this functions are:
333 . 1 : ) (
1
1
x R
106 . 1 : ) (
2
2
x R
052 . 1 : ) (
3
3
x R
031 . 1 : ) (
4
4
x R
0 0.5 1
0
2
4
6
8
Fig. 4.13
We can see that even ) (
1
1
x R , a ratio of linear functions of x, gives a better approximation than
the 4
th
order Taylor polynomial.
Exercise 4.7
Using the Taylor (McLaurin) expansion of x x f cos ) ( = , truncated to order 4:
24 2
1 ) (
4 2 4
0
x x
x c x t
k
k
k
+ − = =
∑
=
find the Padé approximant:
0 1
2
2
0 1
2
2
2
2 2
2
) (
) (
) (
b x b x b
a x a x a
x Q
x P
x R
+ +
+ +
= =
page 41 E763 (part 2) Numerical Methods
Fig. 4.12 shows the function
x
x f
−
=
1
1
) ( (blue line) and the Padé
approximant (red line)
2
2
2
2
0.3125 1.25 1
0.0625 0.75 1
) (
x x
x x
x R
+ −
+ −
=
The Taylor polynomial up to 4
th
order:
4
4
3
3
2
2 1 0
) ( x c x c x c x c c x P + + + + =
with a green line.
0 1
0
2
4
6
8
Fig. 4.12
It is clear that the Padé approximant gives a better fit, particularly closer to the pole.
Increasing the order, the Padé approximation
gets closer to the singularity.
Fig. 4.13 shows the approximations for
m = n = 1, 2, 3 and 4.
The poles closer to 1 of this functions are:
333 . 1 : ) (
1
1
x R
106 . 1 : ) (
2
2
x R
052 . 1 : ) (
3
3
x R
031 . 1 : ) (
4
4
x R
0 0.5 1
0
2
4
6
8
Fig. 4.13
We can see that even ) (
1
1
x R , a ratio of linear functions of x, gives a better approximation than
the 4
th
order Taylor polynomial.
Exercise 4.7
Using the Taylor (McLaurin) expansion of x x f cos ) ( = , truncated to order 4:
24 2
1 ) (
4 2 4
0
x x
x c x t
k
k
k
+ − = =
∑
=
find the Padé approximant:
0 1
2
2
0 1
2
2
2
2 2
2
) (
) (
) (
b x b x b
a x a x a
x Q
x P
x R
+ +
+ +
= =
E763 (part 2) Numerical Methods page 42
5. MATRIX COMPUTATIONS
We have seen earlier that a number of issues arise when we consider errors in the
calculations dealing with machine numbers. When matrices are involved, the problems of
accuracy of representation, error propagation and sensitivity of the solutions to small variations
in the data are much more important.
Before discussing any methods of solving matrix equations, we consider first the rather
fundamental matrix property of ‘condition number’.
‘Condition’ of a Matrix
We have seen that multiplying or dividing two floatingpoint numbers gives an error of the
order of the ‘last preserved bit’. If, say, two numbers are held to 8 decimal digits, the resulting
product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have
a relative uncertainty of ± 10
–8
.
By contrast, with matrices and vectors, multiplying (that is, evaluating y = Ax) or
‘dividing’ (that is, solving Ax = y for x) can lose in some cases ALL significant figures!
Before examining this problem we have to define matrix and vector norms.
VECTOR AND MATRIX NORMS
To introduce the idea of ‘length’ into vectors and matrices, we have to consider norms:
VECTOR NORMS
If x
T
= (x
1
, x
2
,...., x
n
) is a real or complex vector, a general norm is denoted by
N
x and is
defined by:
N
n
i
N
i
N
x
1
1
¦
)
¦
`
¹
¦
¹
¦
´
¦
=
∑
=
x (5.1)
So the usual “Euclidian norm”, or “length” is x
2
or simply x :
2 2
1
2
...
n
x x + + = x (5.2)
Other norms are used, e.g. x
1
and x
∞
, the latter corresponding to the greatest in
magnitude x
i
 (Show this as an exercise).
MATRIX NORM
If A is an nbyn real or complex matrix, we denote its norm defined by:
¦
)
¦
`
¹
¦
¹
¦
´
¦
=
≠
N
N
x
N
x
Ax
A
0
max for any choice of the vector x (5.3)
According to our choice of N, in defining the vector norms by (5.1), we have
corresponding Ax
1
, Ax
2
, Ax
∞
, the Euclidian N = 2 being the commonest. Note that
(ignoring the question “how do we find its value”), for given A and N, A has some specific
numerical value ≥ 0.
page 43 E763 (part 2) Numerical Methods
‘Condition’ of a Linear System Ax = y
This is in the context of Hadamard’s general concept of a ‘wellposedproblem’, which is
roughly one where the result is not too sensitive to small changes in the problem specification.
Definition: The problem of finding x, satisfying Ax = y, is well posed or well conditioned if:
(i) a unique x satisfies Ax = y, and
(ii) small changes in either A or y result in small changes in x.
For a quantitative measure of “how well conditioned” a problem is, we need to estimate
the amount of variation in x when y varies and/or the variation in x when A changes slightly or
the corresponding changes in y when either (or both) x and A vary.
Suppose A is fixed, but y changes slightly to y + δy, with the associated x changing to
x + δx. We have:
Ax = y, (5.4)
and so: A (x + δx) = y + δy
Subtracting gives:
A δx = δy or δx = A
–1
δy (5.5)
From our definition (5.3), we must have for any A and z:
A
z
Az
≤
and so z A Az ⋅ ≤ (5.6)
Taking the norm of both sides of (15) and using inequality (5.6) gives:
y A y A x δ δ δ ⋅ ≤ =
− − 1 1
(5.7)
Taking the norm of (5.4) and using inequality (5.6) gives:
x A x A y ⋅ ≤ =
(5.8)
Finally multiplying corresponding sides of (5.7) and (5.8) and dividing by
y x
gives our
fundamental result:
y
y
A A
x
x δ δ
1 −
≤
(5.9)
For any square matrix A we introduce its condition number and define:
cond(A) =
1 −
A A
(5.10)
We note that a ‘good’ condition number is small, near to 1.
E763 (part 2) Numerical Methods page 42
5. MATRIX COMPUTATIONS
We have seen earlier that a number of issues arise when we consider errors in the
calculations dealing with machine numbers. When matrices are involved, the problems of
accuracy of representation, error propagation and sensitivity of the solutions to small variations
in the data are much more important.
Before discussing any methods of solving matrix equations, we consider first the rather
fundamental matrix property of ‘condition number’.
‘Condition’ of a Matrix
We have seen that multiplying or dividing two floatingpoint numbers gives an error of the
order of the ‘last preserved bit’. If, say, two numbers are held to 8 decimal digits, the resulting
product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have
a relative uncertainty of ± 10
–8
.
By contrast, with matrices and vectors, multiplying (that is, evaluating y = Ax) or
‘dividing’ (that is, solving Ax = y for x) can lose in some cases ALL significant figures!
Before examining this problem we have to define matrix and vector norms.
VECTOR AND MATRIX NORMS
To introduce the idea of ‘length’ into vectors and matrices, we have to consider norms:
VECTOR NORMS
If x
T
= (x
1
, x
2
,...., x
n
) is a real or complex vector, a general norm is denoted by
N
x and is
defined by:
N
n
i
N
i
N
x
1
1
¦
)
¦
`
¹
¦
¹
¦
´
¦
=
∑
=
x (5.1)
So the usual “Euclidian norm”, or “length” is x
2
or simply x :
2 2
1
2
...
n
x x + + = x (5.2)
Other norms are used, e.g. x
1
and x
∞
, the latter corresponding to the greatest in
magnitude x
i
 (Show this as an exercise).
MATRIX NORM
If A is an nbyn real or complex matrix, we denote its norm defined by:
¦
)
¦
`
¹
¦
¹
¦
´
¦
=
≠
N
N
x
N
x
Ax
A
0
max for any choice of the vector x (5.3)
According to our choice of N, in defining the vector norms by (5.1), we have
corresponding Ax
1
, Ax
2
, Ax
∞
, the Euclidian N = 2 being the commonest. Note that
(ignoring the question “how do we find its value”), for given A and N, A has some specific
numerical value ≥ 0.
page 43 E763 (part 2) Numerical Methods
‘Condition’ of a Linear System Ax = y
This is in the context of Hadamard’s general concept of a ‘wellposedproblem’, which is
roughly one where the result is not too sensitive to small changes in the problem specification.
Definition: The problem of finding x, satisfying Ax = y, is well posed or well conditioned if:
(i) a unique x satisfies Ax = y, and
(ii) small changes in either A or y result in small changes in x.
For a quantitative measure of “how well conditioned” a problem is, we need to estimate
the amount of variation in x when y varies and/or the variation in x when A changes slightly or
the corresponding changes in y when either (or both) x and A vary.
Suppose A is fixed, but y changes slightly to y + δy, with the associated x changing to
x + δx. We have:
Ax = y, (5.4)
and so: A (x + δx) = y + δy
Subtracting gives:
A δx = δy or δx = A
–1
δy (5.5)
From our definition (5.3), we must have for any A and z:
A
z
Az
≤
and so z A Az ⋅ ≤ (5.6)
Taking the norm of both sides of (15) and using inequality (5.6) gives:
y A y A x δ δ δ ⋅ ≤ =
− − 1 1
(5.7)
Taking the norm of (5.4) and using inequality (5.6) gives:
x A x A y ⋅ ≤ =
(5.8)
Finally multiplying corresponding sides of (5.7) and (5.8) and dividing by
y x
gives our
fundamental result:
y
y
A A
x
x δ δ
1 −
≤
(5.9)
For any square matrix A we introduce its condition number and define:
cond(A) =
1 −
A A
(5.10)
We note that a ‘good’ condition number is small, near to 1.
E763 (part 2) Numerical Methods page 44
Relevance of the Condition Number
The quantity
y y δ
can be interpreted as a measure of relative uncertainty in the vector
y. Similarly,
x x δ
is the associated relative uncertainty in the vector x.
From equations (5.9) and (5.10), cond(A) gives an upper bound (worst case) factor of
degradation of precision between y and x = A
–1
y. Note that if we rewrite the equations from
(5.4) for A
–1
instead of A (and reversing x and y), equation (5.9) will be the same but with x and
y reversed and (5.10) would remain unchanged. These two equations therefore give the
important result that:
If A denotes the precise transformation y = Ax and δx, δy are related small changes in x
and y, the ratio:
y y
x x
δ
δ
must lie between 1/ cond(A) and cond(A).
Numerical Example
Here is an example using integers for total precision. Suppose:
=
98 99
99 100
A
We then have:
Ax = y:
=
− 1000
1000
1000
1000
A
Shifting x slightly gives:
=
− 1197
1199
999
1001
A
Alternatively, shifting y slightly:
=
− 999
1001
801
803
A
So a small change in y can cause a big change in x or vice versa. We have this clear moral,
concerning any matrix multiplication or (effectively) inversion:
For given A, either multiplying (Ax) or ‘dividing’ (A
–1
y) can be catastrophic, the degree
of catastrophe depending on cond(A) and on the ‘direction’ of change in x or y.
In the above example, cond(A) is about 4000.
Matrix Computations
We now consider methods for solving matrix equations. The most common problems
encountered are of the form:
A x = y (5.11)
or A x = 0, requiring det(A) = 0 (5.12)
(we will consider this a special case of (5.11)) or
page 45 E763 (part 2) Numerical Methods
A x = k
2
B x (5.13)
where A (and B) are known n×n matrices and x and k
2
are unknown.
Usually B (and sometimes A) is positive definite (meaning that x
T
Bx > 0 for all x
A is sometimes complex (and also B), but numerically the difference is straightforward, and so
we will consider only real matrices.
Types of Matrices A and B (Sparsity Patterns)
We can classify the problems (or the matrices) according to their sparsity (that is, the
proportion of zero entries). This will have a strong relevance on the choice of solution method.
In this form, the main categories are: dense, where most of the matrix elements are nonzero and
sparse, where a large proportion of them are zero. Among the sparse, we can distinguish several
types:
Banded matrices: all nonzero elements are grouped in a band around the main diagonal
(fixed and variable band).
Arbitrarily sparse
Finally, sparse matrix with any pattern, but where the elements are not stored; that is.
elements are ‘generated’ or calculated each time they are needed in the solving algorithm.
* * * * 0 0
* * * * * 0
* * * * * *
* * * * * *
0 * * * * *
0 0 * * * *
Zeros and nonzeros in band matrix of semibandwidth 4
We can also distinguish between different solution methods, basically:
DIRECT where the solution emerges in a finite number of calculations (if we temporarily
ignore roundoff error due to finite wordlength).
INDIRECT or iterative, where a stepbystep procedure converges towards the correct solution.
Indirect methods can be specially suited to sparse matrices (especially when the order is large) as
they can often be implemented without the need to store the entire matrix A (or intermediate
forms of matrices) in high speed memory.
All the common direct routines are available in software libraries and in books and journals,
most commonly in Fortran or Fortran90/95, but also some in C.
Direct Methods of Solving Ax = y
– Gauss elimination or LU decomposition
The classic solution method of (5.11) is by the Gauss method. For example, given the system:
E763 (part 2) Numerical Methods page 44
Relevance of the Condition Number
The quantity
y y δ
can be interpreted as a measure of relative uncertainty in the vector
y. Similarly,
x x δ
is the associated relative uncertainty in the vector x.
From equations (5.9) and (5.10), cond(A) gives an upper bound (worst case) factor of
degradation of precision between y and x = A
–1
y. Note that if we rewrite the equations from
(5.4) for A
–1
instead of A (and reversing x and y), equation (5.9) will be the same but with x and
y reversed and (5.10) would remain unchanged. These two equations therefore give the
important result that:
If A denotes the precise transformation y = Ax and δx, δy are related small changes in x
and y, the ratio:
y y
x x
δ
δ
must lie between 1/ cond(A) and cond(A).
Numerical Example
Here is an example using integers for total precision. Suppose:
=
98 99
99 100
A
We then have:
Ax = y:
=
− 1000
1000
1000
1000
A
Shifting x slightly gives:
=
− 1197
1199
999
1001
A
Alternatively, shifting y slightly:
=
− 999
1001
801
803
A
So a small change in y can cause a big change in x or vice versa. We have this clear moral,
concerning any matrix multiplication or (effectively) inversion:
For given A, either multiplying (Ax) or ‘dividing’ (A
–1
y) can be catastrophic, the degree
of catastrophe depending on cond(A) and on the ‘direction’ of change in x or y.
In the above example, cond(A) is about 4000.
Matrix Computations
We now consider methods for solving matrix equations. The most common problems
encountered are of the form:
A x = y (5.11)
or A x = 0, requiring det(A) = 0 (5.12)
(we will consider this a special case of (5.11)) or
page 45 E763 (part 2) Numerical Methods
A x = k
2
B x (5.13)
where A (and B) are known n×n matrices and x and k
2
are unknown.
Usually B (and sometimes A) is positive definite (meaning that x
T
Bx > 0 for all x
A is sometimes complex (and also B), but numerically the difference is straightforward, and so
we will consider only real matrices.
Types of Matrices A and B (Sparsity Patterns)
We can classify the problems (or the matrices) according to their sparsity (that is, the
proportion of zero entries). This will have a strong relevance on the choice of solution method.
In this form, the main categories are: dense, where most of the matrix elements are nonzero and
sparse, where a large proportion of them are zero. Among the sparse, we can distinguish several
types:
Banded matrices: all nonzero elements are grouped in a band around the main diagonal
(fixed and variable band).
Arbitrarily sparse
Finally, sparse matrix with any pattern, but where the elements are not stored; that is.
elements are ‘generated’ or calculated each time they are needed in the solving algorithm.
* * * * 0 0
* * * * * 0
* * * * * *
* * * * * *
0 * * * * *
0 0 * * * *
Zeros and nonzeros in band matrix of semibandwidth 4
We can also distinguish between different solution methods, basically:
DIRECT where the solution emerges in a finite number of calculations (if we temporarily
ignore roundoff error due to finite wordlength).
INDIRECT or iterative, where a stepbystep procedure converges towards the correct solution.
Indirect methods can be specially suited to sparse matrices (especially when the order is large) as
they can often be implemented without the need to store the entire matrix A (or intermediate
forms of matrices) in high speed memory.
All the common direct routines are available in software libraries and in books and journals,
most commonly in Fortran or Fortran90/95, but also some in C.
Direct Methods of Solving Ax = y
– Gauss elimination or LU decomposition
The classic solution method of (5.11) is by the Gauss method. For example, given the system:
E763 (part 2) Numerical Methods page 46
=
1
1
1
11 6 3
8 5 2
7 4 1
3
2
1
x
x
x
(5.14)
we subtract 2 times the first row from the second row, and then we subtract 3 times the first row
from the third row, to give:
−
− =
− −
− −
2
1
1
10 6 0
6 3 0
7 4 1
3
2
1
x
x
x
(5.15)
and then subtracting 2 times the second row from the third row gives:
− =
− −
0
1
1
2 0 0
6 3 0
7 4 1
3
2
1
x
x
x
(5.16)
The steps from (5.14) to (5.16) are termed ‘triangulation’ or ‘forwardelimination’. The
triangular form of the lefthand matrix of (5.16) is crucial; it allows the next steps.
The third row immediately gives:
x
3
= 0 (5.17a)
and substitution into row 2 gives:
(–3) x
2
+ 0 = –1 and so: x
2
=1/3 (5.17b)
and then into row 1 gives:
x
1
+ 4 (1/3) + 0 = 1 and so: x
1
= –1/3 (5.17c)
The steps through (5.17) are termed ‘backsubstitution’. We now ignore a complication of
‘pivoting’, a technique sometimes required to improve numerical stability and which consists of
changing the order of rows and columns.
Important points about this algorithm:
1. When performed on a dense matrix, (or on a sparse matrix but not taking advantage of
the zeros), computing time is proportional to n
3
(n: order of the matrix). This means that
doubling the order of the matrix will increase computation time by up to 8 times!
2. The determinant comes immediately as the product of the diagonal elements of (5.16).
3. Algorithms that take advantage of the special ‘band’ and ‘variableband’ are very
straightforward, by changing the limits of the loops when performing row or column operations,
and some ‘bookkeeping’. For example, in a matrix of ‘semibandwidth’ 4, the first column has
nonzero elements only in the first 4 rows as in the figure. Then only those 4 numbers need
storing, and only the 3 elements below the diagonal need ‘eliminating’ in the first column.
4. Oddly, it turns out that, in our context, one should NEVER find the inverse matrix A in
order to solve Ax = y for x. Even if it needs doing for a number of different righthandside
vectors, y, it is better to ‘keep a record’ of the triangular form of (5.16) and backsubstitute as
necessary.
5. Other methods very similar to Gauss are due to Crout and Choleski. The latter is (only)
for use with symmetric matrices. Its advantage is that time AND storage are half that of the
page 47 E763 (part 2) Numerical Methods
orthodox Gauss.
6. There are variations in the implementation of the basic method developed to take
advantage of the special type of sparse matrices encountered in some cases, for example when
solving problems using finite differences or finite elements. One of these is the frontal method;
here, elimination takes place in carefully controlled manner, with intermediate results being kept
in backingstore. Another variation consists of only storing the ‘nonzero’ elements in the matrix,
at the expense of a great deal of ‘bookkeeping’ and reordering (renumbering) rows and columns
through the process, in the search for the best compromise between numerical stability and fill
in.
Iterative Methods of Solving Ax = y
We will outline 3 methods: (1) Gradient methods and its most popular version, the conjugate
gradient method (2a) the Jacobi (or simultaneous displacement) and (2b) the closely related
GaussSeidel (or successive displacement) algorithm.
1.– Gradient Methods
Although known and used for decades, the method has come, in the 1980’s, to be adopted
as one of the most popular iterative algorithms for solving linear systems of equations: A x = y.
The fundamental idea of this approach is to introduce an error residual A x – y for some trial
vector x and proceed to minimize the residual with respect to the components of x..
The equation to be solved for x:
A x = y (5.18)
can be recast as: finding x to minimize the ‘errorresidual’, a column vector r defined as a
function of x by:
r = y – A x (5.19)
The general idea of this kind of methods is to search for the solution (minimum of the error
residual) in a multidimensional space (of the components of vector x), starting from a point x
0
and choosing a direction to move. The optimum distance to move along that direction can then
be calculated.
In the steepest descent method, the simplest form of this method, these directions are
chosen as those of the gradient of the error residual at each iteration point. Because of this, they
will be mutually orthogonal and then there will be no more than n different directions. In 2D
(see figure below) this means that every time we’ll have to make a change of direction at right
angle to the previous, but this will not always allow us to reach the minimum or at least not in an
efficient way.
The norm
r
of this residual vector is an obvious choice for the quantity to minimise, and
2
r
, the square of the norm of the residual, which is not negative and is only zero when the
error is zero and there are no square roots to calculate. However, using (5.19), gives (if A is
symmetric: A
T
=A):
2
r
= (y – Ax)
T
(y – Ax) = x
T
AAx – 2x
T
Ay + y
T
y (5.20)
which is rather awkward to compute, because of the product AA. Another possible choice of
error functional (measure of the error), valid for the case where the matrix A is positive definite
and which is also minimised for the correct solution is the functional: h
2
= r
T
A
–1
r (instead of
r
T
r as in (5.20). This gives a simpler form:
E763 (part 2) Numerical Methods page 46
=
1
1
1
11 6 3
8 5 2
7 4 1
3
2
1
x
x
x
(5.14)
we subtract 2 times the first row from the second row, and then we subtract 3 times the first row
from the third row, to give:
−
− =
− −
− −
2
1
1
10 6 0
6 3 0
7 4 1
3
2
1
x
x
x
(5.15)
and then subtracting 2 times the second row from the third row gives:
− =
− −
0
1
1
2 0 0
6 3 0
7 4 1
3
2
1
x
x
x
(5.16)
The steps from (5.14) to (5.16) are termed ‘triangulation’ or ‘forwardelimination’. The
triangular form of the lefthand matrix of (5.16) is crucial; it allows the next steps.
The third row immediately gives:
x
3
= 0 (5.17a)
and substitution into row 2 gives:
(–3) x
2
+ 0 = –1 and so: x
2
=1/3 (5.17b)
and then into row 1 gives:
x
1
+ 4 (1/3) + 0 = 1 and so: x
1
= –1/3 (5.17c)
The steps through (5.17) are termed ‘backsubstitution’. We now ignore a complication of
‘pivoting’, a technique sometimes required to improve numerical stability and which consists of
changing the order of rows and columns.
Important points about this algorithm:
1. When performed on a dense matrix, (or on a sparse matrix but not taking advantage of
the zeros), computing time is proportional to n
3
(n: order of the matrix). This means that
doubling the order of the matrix will increase computation time by up to 8 times!
2. The determinant comes immediately as the product of the diagonal elements of (5.16).
3. Algorithms that take advantage of the special ‘band’ and ‘variableband’ are very
straightforward, by changing the limits of the loops when performing row or column operations,
and some ‘bookkeeping’. For example, in a matrix of ‘semibandwidth’ 4, the first column has
nonzero elements only in the first 4 rows as in the figure. Then only those 4 numbers need
storing, and only the 3 elements below the diagonal need ‘eliminating’ in the first column.
4. Oddly, it turns out that, in our context, one should NEVER find the inverse matrix A in
order to solve Ax = y for x. Even if it needs doing for a number of different righthandside
vectors, y, it is better to ‘keep a record’ of the triangular form of (5.16) and backsubstitute as
necessary.
5. Other methods very similar to Gauss are due to Crout and Choleski. The latter is (only)
for use with symmetric matrices. Its advantage is that time AND storage are half that of the
page 47 E763 (part 2) Numerical Methods
orthodox Gauss.
6. There are variations in the implementation of the basic method developed to take
advantage of the special type of sparse matrices encountered in some cases, for example when
solving problems using finite differences or finite elements. One of these is the frontal method;
here, elimination takes place in carefully controlled manner, with intermediate results being kept
in backingstore. Another variation consists of only storing the ‘nonzero’ elements in the matrix,
at the expense of a great deal of ‘bookkeeping’ and reordering (renumbering) rows and columns
through the process, in the search for the best compromise between numerical stability and fill
in.
Iterative Methods of Solving Ax = y
We will outline 3 methods: (1) Gradient methods and its most popular version, the conjugate
gradient method (2a) the Jacobi (or simultaneous displacement) and (2b) the closely related
GaussSeidel (or successive displacement) algorithm.
1.– Gradient Methods
Although known and used for decades, the method has come, in the 1980’s, to be adopted
as one of the most popular iterative algorithms for solving linear systems of equations: A x = y.
The fundamental idea of this approach is to introduce an error residual A x – y for some trial
vector x and proceed to minimize the residual with respect to the components of x..
The equation to be solved for x:
A x = y (5.18)
can be recast as: finding x to minimize the ‘errorresidual’, a column vector r defined as a
function of x by:
r = y – A x (5.19)
The general idea of this kind of methods is to search for the solution (minimum of the error
residual) in a multidimensional space (of the components of vector x), starting from a point x
0
and choosing a direction to move. The optimum distance to move along that direction can then
be calculated.
In the steepest descent method, the simplest form of this method, these directions are
chosen as those of the gradient of the error residual at each iteration point. Because of this, they
will be mutually orthogonal and then there will be no more than n different directions. In 2D
(see figure below) this means that every time we’ll have to make a change of direction at right
angle to the previous, but this will not always allow us to reach the minimum or at least not in an
efficient way.
The norm
r
of this residual vector is an obvious choice for the quantity to minimise, and
2
r
, the square of the norm of the residual, which is not negative and is only zero when the
error is zero and there are no square roots to calculate. However, using (5.19), gives (if A is
symmetric: A
T
=A):
2
r
= (y – Ax)
T
(y – Ax) = x
T
AAx – 2x
T
Ay + y
T
y (5.20)
which is rather awkward to compute, because of the product AA. Another possible choice of
error functional (measure of the error), valid for the case where the matrix A is positive definite
and which is also minimised for the correct solution is the functional: h
2
= r
T
A
–1
r (instead of
r
T
r as in (5.20). This gives a simpler form:
E763 (part 2) Numerical Methods page 48
Ax x y x Ay y x A y A x A y r A r
T T T T T
h + − = − − = =
− −
2 ) ( ) (
1 1 2
(5.21)
or, because the first term in (5.21) is independent of the variables and will play no part in the
minimisation we can drop it and we finally have
:
y x Ax x
T T
h 2
2
− = (5.22)
The method proceeds by evaluating the error functional at a point x
0
, choosing a direction,
in this case, the direction of the gradient of the error functional, and finding the minimum value
of the functional along that line. That is, if p gives the direction, the line is defined by all the
points (x + α p), where α is a scalar parameter. The next step is to find the value of α that
minimizes the error. This gives the next point x
1
. (Since in this case, α is the only variable, it is
simple to calculate the gradient of the error functional as a function of α and finding the
corresponding minimum).
u
v
x
0
x
1
Fig. 5.1
Several variations appear at this point. It would seem obvious that the best direction to
choose is that of the gradient (its negative or downwards) and that is the choice taken in the
conventional “steepestgradient” or “steepest descent” method. In this case the consecutive
directions are all perpendicular to each other as illustrated in the figure above. However, as
mentioned earlier, this conduces to poor convergence due to the discrete nature of the steps.
The more efficient and popular “Conjugate Gradient Method” looks for a direction which
is ‘A–orthogonal’ (or conjugate) to the previous one instead (p
T
Aq = 0 instead of p
T
q = 0).
Exercise 5.1:
Show that the value of α that minimizes the error functional in the line (x + αs) in the two cases
mentioned above (using the squared norm of the residual as error functional or the functional
h
2
), for a symmetric matrix A, is given respectively by:
( )
2
Ap
x A y A p −
=
T
α
and
( )
p A p
x A y p
T
T
−
= α
page 49 E763 (part 2) Numerical Methods
A useful feature of this method (as can be observed from the expressions above) is that
reference to the matrix A is only via simple matrix products; for given values of the matrices A,
y, x
i
and p
i
, we need only form A times a vector (Ax or Ap
i
) and A
T
times a vector. These can
be formed from a given sparse A without unnecessary multiplication (of nonzeros) or storage.
An even better advantage of the conjugate gradient method is that it is guaranteed to converge in
at most n steps, where n is the order of the matrix.
The Conjugate Gradient Algorithm
The basic steps of the algorithm are as follow:
1) Choose a starting point x
0
.
2) Choose direction to move. In the first step we choose that of the gradient of h
2
: and this
coincides with the direction of the residual r:
0 0 0 0
r y Ax y x Ax x 2 ) ( 2 ) 2 (
2
− = − = − ∇ = ∇
T T
h
so we choose:
0 0
r p = .
3) Calculate the distance to move – parameter α:
( )
s A p
x A y p
T
T
−
= α using p
0
and x
0
.
4) Calculate the new point:
1 1 − −
+ =
k k k
p x x α
5) Calculate the new residual:
1 1 1 1 1 1
) (
− − − − − −
+ = + − = + − =
k k k k k k k
p r p Ax y p x A y r α α α
6) Calculate the next direction. Here is where the methods differ. For the conjugate gradient
the new vector p is not along the gradient of r but instead:
1 −
+ =
k k k
p r p β where
1 1 − −
=
k
T
k
k
T
k
r r
r r
β
More robust versions of the algorithm use a preliminary ‘preconditioning’ of the matrix
A, to alleviate the problem that the condition number of (5.20) is the square of the condition
number of A, – a serious problem if A is not ‘safely’ positivedefinite. This leads to the popular
‘PCCG’ or PreConditionedConjugateGradient algorithm, as a complete package and to many
more variations in implementation that can be found as commercial packages.
2.– Jacobi and GaussSeidel Methods
Two algorithms that are simple to implement are the closely related Jacobi (simultaneous
displacement) and GaussSeidel (successive displacement or ‘relaxation’).
a) Jacobi Method or Simultaneous Displacement
Suppose the set of equations for solution are:
a
11
x
1
+ a
12
x
2
+ a
13
x
3
= y
1
a
21
x
1
+ a
22
x
2
+ a
23
x
3
= y
2
(5.23)
a
31
x
1
+ a
32
x
2
+ a
33
x
3
= y
3
This can be reorganised to:
x
1
= ( y
1
– a
12
x
2
– a
13
x
3
)/a
11
(5.24a)
E763 (part 2) Numerical Methods page 48
Ax x y x Ay y x A y A x A y r A r
T T T T T
h + − = − − = =
− −
2 ) ( ) (
1 1 2
(5.21)
or, because the first term in (5.21) is independent of the variables and will play no part in the
minimisation we can drop it and we finally have
:
y x Ax x
T T
h 2
2
− = (5.22)
The method proceeds by evaluating the error functional at a point x
0
, choosing a direction,
in this case, the direction of the gradient of the error functional, and finding the minimum value
of the functional along that line. That is, if p gives the direction, the line is defined by all the
points (x + α p), where α is a scalar parameter. The next step is to find the value of α that
minimizes the error. This gives the next point x
1
. (Since in this case, α is the only variable, it is
simple to calculate the gradient of the error functional as a function of α and finding the
corresponding minimum).
u
v
x
0
x
1
Fig. 5.1
Several variations appear at this point. It would seem obvious that the best direction to
choose is that of the gradient (its negative or downwards) and that is the choice taken in the
conventional “steepestgradient” or “steepest descent” method. In this case the consecutive
directions are all perpendicular to each other as illustrated in the figure above. However, as
mentioned earlier, this conduces to poor convergence due to the discrete nature of the steps.
The more efficient and popular “Conjugate Gradient Method” looks for a direction which
is ‘A–orthogonal’ (or conjugate) to the previous one instead (p
T
Aq = 0 instead of p
T
q = 0).
Exercise 5.1:
Show that the value of α that minimizes the error functional in the line (x + αs) in the two cases
mentioned above (using the squared norm of the residual as error functional or the functional
h
2
), for a symmetric matrix A, is given respectively by:
( )
2
Ap
x A y A p −
=
T
α
and
( )
p A p
x A y p
T
T
−
= α
page 49 E763 (part 2) Numerical Methods
A useful feature of this method (as can be observed from the expressions above) is that
reference to the matrix A is only via simple matrix products; for given values of the matrices A,
y, x
i
and p
i
, we need only form A times a vector (Ax or Ap
i
) and A
T
times a vector. These can
be formed from a given sparse A without unnecessary multiplication (of nonzeros) or storage.
An even better advantage of the conjugate gradient method is that it is guaranteed to converge in
at most n steps, where n is the order of the matrix.
The Conjugate Gradient Algorithm
The basic steps of the algorithm are as follow:
1) Choose a starting point x
0
.
2) Choose direction to move. In the first step we choose that of the gradient of h
2
: and this
coincides with the direction of the residual r:
0 0 0 0
r y Ax y x Ax x 2 ) ( 2 ) 2 (
2
− = − = − ∇ = ∇
T T
h
so we choose:
0 0
r p = .
3) Calculate the distance to move – parameter α:
( )
s A p
x A y p
T
T
−
= α using p
0
and x
0
.
4) Calculate the new point:
1 1 − −
+ =
k k k
p x x α
5) Calculate the new residual:
1 1 1 1 1 1
) (
− − − − − −
+ = + − = + − =
k k k k k k k
p r p Ax y p x A y r α α α
6) Calculate the next direction. Here is where the methods differ. For the conjugate gradient
the new vector p is not along the gradient of r but instead:
1 −
+ =
k k k
p r p β where
1 1 − −
=
k
T
k
k
T
k
r r
r r
β
More robust versions of the algorithm use a preliminary ‘preconditioning’ of the matrix
A, to alleviate the problem that the condition number of (5.20) is the square of the condition
number of A, – a serious problem if A is not ‘safely’ positivedefinite. This leads to the popular
‘PCCG’ or PreConditionedConjugateGradient algorithm, as a complete package and to many
more variations in implementation that can be found as commercial packages.
2.– Jacobi and GaussSeidel Methods
Two algorithms that are simple to implement are the closely related Jacobi (simultaneous
displacement) and GaussSeidel (successive displacement or ‘relaxation’).
a) Jacobi Method or Simultaneous Displacement
Suppose the set of equations for solution are:
a
11
x
1
+ a
12
x
2
+ a
13
x
3
= y
1
a
21
x
1
+ a
22
x
2
+ a
23
x
3
= y
2
(5.23)
a
31
x
1
+ a
32
x
2
+ a
33
x
3
= y
3
This can be reorganised to:
x
1
= ( y
1
– a
12
x
2
– a
13
x
3
)/a
11
(5.24a)
E763 (part 2) Numerical Methods page 50
x
2
= ( y
2
– a
23
x
3
– a
21
x
1
)/a
22
(5.24b)
x
3
= ( y
3
– a
31
x
1
– a
32
x
2
)/a
33
(5.24c)
Suppose we had the vector x
(0)
= [x
1
, x
2
, x
3
]
(0)
, and substituted it into the righthand side of
(5.24), to yield on the lefthand side the new vector x
(1)
= [x
1
, x
2
, x
3
]
(1)
. Successive substitutions
will give the sequence of vectors:
x
(0)
, x
(1)
, x
(2)
, x
(3)
, . . . .
Because (5.24) is merely a rearrangement of the equations for solution, the ‘correct’
solution substituted into (5.24) must be selfconsistent, i.e. yield itself! The sequence will:
either converge to the correct solution
or diverge.
b) GaussSeidel Method or Successive Displacement
Note that when eq. (5.24a) is applied, a ‘new’ value of x
1
would be available, which could
be used instead of the ‘previous’ x
1
value when solving (5.24b). And similarly for x
1
and x
2
when applying (5.24c). This is the GaussSeidel or successive displacement iterative scheme,
illustrated here with an example, showing that the computer program (in FORTRAN/90) is
barely more complicated than writing down the equations.
! Example of successive displacement
x1 = 0.
x2 = 0.
x3 = 0. ! Equations being solved are:
do i=1,10
x1 = (4. + x2  x3)/4. ! 4*x1  x2 + x3 = 4
x2 = (9.  2.*x3  x1)/6. ! x1 +6*x2 + 2*x3 = 9
x3 = (2. + x1 + 2.*x2)/5. ! x1 2*x2 + 5*x3 = 2
print x1,x2,x3
enddo
stop
end
1 1.333333 1.133334
1.05 0.9472222 0.9888889
0.9895833 1.00544 1.000093
1.001337 0.9997463 1.000166
0.9998951 0.9999621 0.9999639
0.9999996 1.000012 1.000005
1.000002 0.9999981 0.9999996
0.9999996 1 1
1 1 1
1 1 1
Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on
page 51 E763 (part 2) Numerical Methods
the righthandside vector y of (5.23). Convergence does not even depend on the ‘starting value’
of the vector, which only affects the necessary number of iterations.
We will skip over any formal proof of convergence. But to give the sharp criteria for
convergence,− first one splits A as:
A = L + D + U
where L, D and U are the lower, diagonal and upper triangular parts of A. Then the schemes
converge ifandonlyif all the eigenvalues of the matrix:
D
–1
(U + L) {for simultaneous displacement}
or
(D + L)
–1
U {for successive displacement}
lie within the unit circle.
For applications, it is simpler to use some sufficient (but not necessary!) conditions, when
possible, such as:
(a) If A is symmetric and positivedefinite, then successive displacement converges.
(b) If, in addition, a
i,j
< 0 for all i ≠ j, then simultaneous displacement also converges.
Condition (a) is a commonly satisfied condition. Usually, the successive method
converges in about half the computer time of the simultaneous, but, strictly there are matrices
where one method converges and the other does not, and viceversa.
Matrix Eigenvalue Problems
The second class of matrix problem that occurs frequently in the numerical solution of
differential equations as well as in many other areas, is the matrix eigenvalue problem. In many
occasions, the matrices will be large and very sparse; in others, they will be dense and normally
the solution methods will have to take these characteristics into account in order to achieve a
solution in an efficient way. There is a number of different methods to solve matrix eigenvalue
problems involving dense or sparse matrices, each of them with different characteristics and
better adapted to different types of matrices and requirements.
Choice of method
The choice of method will depend on the characteristics of the problem (type of the
matrices) and on the solution requirements. For example, most methods suitable for dense
matrices calculate all eigenvectors and eigenvalues of the system. However, in many problems
arising from the numerical solution of PDEs one is only interested in one or just a few
eigenvalues and/or eigenvectors. Also, in many cases the matrices will be large and sparse.
In what follows we will concentrate in methods which are suitable for sparse matrices (of
course the same methods can be applied to dense matrices).
The problem to solve can have two different forms:
x Ax λ = (standard eigenvalue problem) (5.25)
Bx Ax λ = (generalized eigenvalue problem) (5.26)
E763 (part 2) Numerical Methods page 50
x
2
= ( y
2
– a
23
x
3
– a
21
x
1
)/a
22
(5.24b)
x
3
= ( y
3
– a
31
x
1
– a
32
x
2
)/a
33
(5.24c)
Suppose we had the vector x
(0)
= [x
1
, x
2
, x
3
]
(0)
, and substituted it into the righthand side of
(5.24), to yield on the lefthand side the new vector x
(1)
= [x
1
, x
2
, x
3
]
(1)
. Successive substitutions
will give the sequence of vectors:
x
(0)
, x
(1)
, x
(2)
, x
(3)
, . . . .
Because (5.24) is merely a rearrangement of the equations for solution, the ‘correct’
solution substituted into (5.24) must be selfconsistent, i.e. yield itself! The sequence will:
either converge to the correct solution
or diverge.
b) GaussSeidel Method or Successive Displacement
Note that when eq. (5.24a) is applied, a ‘new’ value of x
1
would be available, which could
be used instead of the ‘previous’ x
1
value when solving (5.24b). And similarly for x
1
and x
2
when applying (5.24c). This is the GaussSeidel or successive displacement iterative scheme,
illustrated here with an example, showing that the computer program (in FORTRAN/90) is
barely more complicated than writing down the equations.
! Example of successive displacement
x1 = 0.
x2 = 0.
x3 = 0. ! Equations being solved are:
do i=1,10
x1 = (4. + x2  x3)/4. ! 4*x1  x2 + x3 = 4
x2 = (9.  2.*x3  x1)/6. ! x1 +6*x2 + 2*x3 = 9
x3 = (2. + x1 + 2.*x2)/5. ! x1 2*x2 + 5*x3 = 2
print x1,x2,x3
enddo
stop
end
1 1.333333 1.133334
1.05 0.9472222 0.9888889
0.9895833 1.00544 1.000093
1.001337 0.9997463 1.000166
0.9998951 0.9999621 0.9999639
0.9999996 1.000012 1.000005
1.000002 0.9999981 0.9999996
0.9999996 1 1
1 1 1
1 1 1
Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on
page 51 E763 (part 2) Numerical Methods
the righthandside vector y of (5.23). Convergence does not even depend on the ‘starting value’
of the vector, which only affects the necessary number of iterations.
We will skip over any formal proof of convergence. But to give the sharp criteria for
convergence,− first one splits A as:
A = L + D + U
where L, D and U are the lower, diagonal and upper triangular parts of A. Then the schemes
converge ifandonlyif all the eigenvalues of the matrix:
D
–1
(U + L) {for simultaneous displacement}
or
(D + L)
–1
U {for successive displacement}
lie within the unit circle.
For applications, it is simpler to use some sufficient (but not necessary!) conditions, when
possible, such as:
(a) If A is symmetric and positivedefinite, then successive displacement converges.
(b) If, in addition, a
i,j
< 0 for all i ≠ j, then simultaneous displacement also converges.
Condition (a) is a commonly satisfied condition. Usually, the successive method
converges in about half the computer time of the simultaneous, but, strictly there are matrices
where one method converges and the other does not, and viceversa.
Matrix Eigenvalue Problems
The second class of matrix problem that occurs frequently in the numerical solution of
differential equations as well as in many other areas, is the matrix eigenvalue problem. In many
occasions, the matrices will be large and very sparse; in others, they will be dense and normally
the solution methods will have to take these characteristics into account in order to achieve a
solution in an efficient way. There is a number of different methods to solve matrix eigenvalue
problems involving dense or sparse matrices, each of them with different characteristics and
better adapted to different types of matrices and requirements.
Choice of method
The choice of method will depend on the characteristics of the problem (type of the
matrices) and on the solution requirements. For example, most methods suitable for dense
matrices calculate all eigenvectors and eigenvalues of the system. However, in many problems
arising from the numerical solution of PDEs one is only interested in one or just a few
eigenvalues and/or eigenvectors. Also, in many cases the matrices will be large and sparse.
In what follows we will concentrate in methods which are suitable for sparse matrices (of
course the same methods can be applied to dense matrices).
The problem to solve can have two different forms:
x Ax λ = (standard eigenvalue problem) (5.25)
Bx Ax λ = (generalized eigenvalue problem) (5.26)
E763 (part 2) Numerical Methods page 52
Sometimes the generalized eigenvalue problem can be converted into the form (33) simply
by premultiplying by the inverse of B:
x Ax B λ =
−1
however, if A and B are symmetric, the new matrix in the left hand side ( A B
1 −
) will have lost
this property. Instead, it is preferable to decompose the matrix B (factorise) in the form
T
LL B = (Choleski factorisation  possible if B is positive definite). Substituting in (5.26) and
premultiplying by
1 −
L will give:
x L Ax L
T
λ =
−1
and since: ( ) I L L =
−
T T
1
, the identity matrix,
( ) x L x L L A L
T T
T
λ =
− − 1 1
; putting: y x L =
T
and ( ) A L A L
~
1 1
=
− −
T
gives:
y y A λ =
~
The matrix A
~
is symmetric if A and B are symmetric and the eigenvalues are not
modified. The eigenvectors x can be obtained from y simply by backsubstitution. However, if
the matrices A and B are sparse, this method is not convenient because A
~
will generally be
dense. In the case of sparse matrices, it is more convenient to solve the generalized problem
directly.
Solution Methods
We can again classify solution methods as:
1. Direct: (or transformation methods)
i) Jacobi rotations Converts the matrix of (5.25) to diagonal form
ii) QR (or QZ for complex) Converts the matrices to triangular form
iii) Conversion to Tridiagonal Normally to be followed by either i) or ii)
All these methods give all eigenvalues. We will not examine any of these in detail.
2. Vector Iteration Methods:
These are better suited for sparse matrices and also for the case when only a few eigenvalues and
eigenvectors are needed.
The Power Method: or Direct iteration
For the standard problem (33) the algorithm consists of the repeated multiplication of a
starting vector by the matrix A, this can be seen to produce a reinforcement of the component of
the trial vector along the direction of the eigenvector of largest absolute value, causing the trial
vector to converge gradually to that eigenvector. The algorithm can be described schematically
by:
page 53 E763 (part 2) Numerical Methods
Choose starting vector
Premultiply by A
Normalize
Check convergence
OK
STOP
Not OK
The normalization step is necessary because otherwise the iteration vector can grow
indefinitely in length over the iterations.
How does this algorithm work?
If φ
i
, i = 1, .. N are the eigenvectors of A, we can write any vector of N components as a
superposition of them (they constitute a base in the space of N dimensions). In particular, for the
starting vector:
∑
=
=
N
i
i i
1
0
φ α x (5.27)
When we multiply by A we get:
0 1
~
x A x =
or:
∑ ∑ ∑
= = =
= = =
N
i
i i i
N
i
i i
N
i
i i
1 1 1
1
~
φ λ α φ α φ α A A x (5.28)
If λ
1
is the eigenvalue of largest absolute value, we can also write this as:
∑
=
=
N
i
i
i
i
1
1
1 1
~
φ
λ
λ
α λ x
This is then normalized by:
1
1
1
~
~
x
x
x =
Then, after n iterations of multiplications by A and normalization we will get:
∑
=


.

\

=
N
i
i
n
i
i n
C
1
1
φ
λ
λ
α x
(5.29)
where C is a normalization constant.
From this expression (5.29) we can see that since
i
λ λ ≥
1
, for all i, the coefficient of all
φ
i
for i ≠ 1, will tend to zero as n increases and then the vector x
n
will gradually converge to φ
1
.
E763 (part 2) Numerical Methods page 52
Sometimes the generalized eigenvalue problem can be converted into the form (33) simply
by premultiplying by the inverse of B:
x Ax B λ =
−1
however, if A and B are symmetric, the new matrix in the left hand side ( A B
1 −
) will have lost
this property. Instead, it is preferable to decompose the matrix B (factorise) in the form
T
LL B = (Choleski factorisation  possible if B is positive definite). Substituting in (5.26) and
premultiplying by
1 −
L will give:
x L Ax L
T
λ =
−1
and since: ( ) I L L =
−
T T
1
, the identity matrix,
( ) x L x L L A L
T T
T
λ =
− − 1 1
; putting: y x L =
T
and ( ) A L A L
~
1 1
=
− −
T
gives:
y y A λ =
~
The matrix A
~
is symmetric if A and B are symmetric and the eigenvalues are not
modified. The eigenvectors x can be obtained from y simply by backsubstitution. However, if
the matrices A and B are sparse, this method is not convenient because A
~
will generally be
dense. In the case of sparse matrices, it is more convenient to solve the generalized problem
directly.
Solution Methods
We can again classify solution methods as:
1. Direct: (or transformation methods)
i) Jacobi rotations Converts the matrix of (5.25) to diagonal form
ii) QR (or QZ for complex) Converts the matrices to triangular form
iii) Conversion to Tridiagonal Normally to be followed by either i) or ii)
All these methods give all eigenvalues. We will not examine any of these in detail.
2. Vector Iteration Methods:
These are better suited for sparse matrices and also for the case when only a few eigenvalues and
eigenvectors are needed.
The Power Method: or Direct iteration
For the standard problem (33) the algorithm consists of the repeated multiplication of a
starting vector by the matrix A, this can be seen to produce a reinforcement of the component of
the trial vector along the direction of the eigenvector of largest absolute value, causing the trial
vector to converge gradually to that eigenvector. The algorithm can be described schematically
by:
page 53 E763 (part 2) Numerical Methods
Choose starting vector
Premultiply by A
Normalize
Check convergence
OK
STOP
Not OK
The normalization step is necessary because otherwise the iteration vector can grow
indefinitely in length over the iterations.
How does this algorithm work?
If φ
i
, i = 1, .. N are the eigenvectors of A, we can write any vector of N components as a
superposition of them (they constitute a base in the space of N dimensions). In particular, for the
starting vector:
∑
=
=
N
i
i i
1
0
φ α x (5.27)
When we multiply by A we get:
0 1
~
x A x =
or:
∑ ∑ ∑
= = =
= = =
N
i
i i i
N
i
i i
N
i
i i
1 1 1
1
~
φ λ α φ α φ α A A x (5.28)
If λ
1
is the eigenvalue of largest absolute value, we can also write this as:
∑
=
=
N
i
i
i
i
1
1
1 1
~
φ
λ
λ
α λ x
This is then normalized by:
1
1
1
~
~
x
x
x =
Then, after n iterations of multiplications by A and normalization we will get:
∑
=


.

\

=
N
i
i
n
i
i n
C
1
1
φ
λ
λ
α x
(5.29)
where C is a normalization constant.
From this expression (5.29) we can see that since
i
λ λ ≥
1
, for all i, the coefficient of all
φ
i
for i ≠ 1, will tend to zero as n increases and then the vector x
n
will gradually converge to φ
1
.
E763 (part 2) Numerical Methods page 54
This method (the power method) finds the dominant eigenvector, that is the eigenvector
that corresponds to the largest eigenvalue (in absolute value). To find the eigenvalue we can see
that premultiplying (5.25) by the transpose of φ
i
will give:
i
T
i i i
T
i
φ φ λ φ φ = A
where
i
T
i
φ φ A
and
i
T
i
φ φ
are scalars, so we can write:
i
T
i
i
T
i
i
φ φ
φ φ
λ
A
=
(5.30)
This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the
eigenvector. This expression has interesting properties; if we only know an estimate of the
eigenvector, (5.30) will give an estimate of the eigenvalue with a higher order of accuracy than
that of the eigenvector itself.
Exercise 5.2
Write a short program to calculate the dominant eigenvector and the corresponding
eigenvalue of a matrix like that in (7.14) but of order 7, using the power method. Terminate
iterations when the relative difference between two successive estimates of the eigenvalue are
within a tolerance of 10
–6
.
Exercise 5.3 (advanced)
Using the same reasoning used to explain the power method for the standard eigenvalue problem
5.27)( 5.29), develop the corresponding algorithm for the generalized eigenvalue problem
Ax =λBx.
The power method in the form presented above can only be used to find the dominant
eigenvector. However, modified forms of the basic idea can be developed to find any
eigenvector of the matrix. One of these modifications is the inverse iteration.
Inverse Iteration
For the system Ax = λx we can write:
x A x
1
1
−
=
λ
from where we can see that the
eigenvalues of A
–1
are 1/λ; the reciprocals of the eigenvalues of A and the eigenvectors are the
same. In this form, if we are interested in finding the eigenvector of A corresponding to the
smallest eigenvalue in absolute value (closest to zero), we can notice that for that eigenvalue λ,
its reciprocal is the largest, and so it can be found using the power method on the matrix A
–1
.
The procedure then is as follows:
1. Choose a starting vector x
0
.
2. Find
0
1
1
~
x A x
−
=
or instead, solve:
0 1
~
x x A =
(avoiding the explicit calculation of A
–1
)
3. Normalize:
1 1
~
x x C = (C is a normalization factor to have:
1
1
= x
)
4. Restart
Exercise 5.4
Write a program and calculate by inverse iteration the smallest eigenvalue of the
page 55 E763 (part 2) Numerical Methods
tridiagonal matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the
subdiagonals. You can use the algorithms given in section 1 of the Appendix for the solution of
the linear system of equations.
An important extension of the inverse iteration method allows finding any eigenvalue of
the spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in
absolute value.
Shifted Inverse Iteration
Suppose that the system Ax = λx has the set of eigenvalues { λ
i
}. If we construct the
matrix A
~
as:
I A A σ − =
~
where I is the identity matrix and σ is a real number, we have:
x x x x Ax x A ) (
~
σ λ σ λ σ − = − = − = (5.31)
And we can see that the matrix A
~
has the same eigenvectors as A and its eigenvalues are
{λ−σ}, that is, the same eigenvalues as A but shifted by σ. Then, if we apply the inverse
iteration method to the matrix A
~
, the procedure will yield the eigenvalue (λ
i
−σ) closest to zero;
that is, we can find the eigenvalue λ
i
of A closest to the real number σ.
Rayleigh Iteration
Another extension of this method is known as Rayleigh iteration. In this case, the
Rayleigh quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift
is updated using this value.
Since the convergence rate of the power method depends on the relative value of the
coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in
(5.27)) and these are affected by the ratio between the eigenvalues λ
i
and λ
1
, the convergence
will be fastest when this ratio is largest as we can see from (5.29). The same reasoning applied
to the shifted inverse iteration method leads to the conclusion that the convergence rate will be
fastest when the shift is chosen as close as possible to the target eigenvalue. In this form, the
Rayleigh iteration has faster convergence than the ordinary shifted inverse iteration.
Exercise 5.5
Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of
the previous exercise which lies closest to 3.5.
Then, write a modified version of this program using Rayleigh quotient update of the shift
in every iteration (Rayleigh iteration). Compare the convergence of both procedures (by the
number of iterations needed to achieve the same tolerance for the relative difference between
successive estimates of the eigenvalue). Use a tolerance of 10
–6
in both programs.
E763 (part 2) Numerical Methods page 54
This method (the power method) finds the dominant eigenvector, that is the eigenvector
that corresponds to the largest eigenvalue (in absolute value). To find the eigenvalue we can see
that premultiplying (5.25) by the transpose of φ
i
will give:
i
T
i i i
T
i
φ φ λ φ φ = A
where
i
T
i
φ φ A
and
i
T
i
φ φ
are scalars, so we can write:
i
T
i
i
T
i
i
φ φ
φ φ
λ
A
=
(5.30)
This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the
eigenvector. This expression has interesting properties; if we only know an estimate of the
eigenvector, (5.30) will give an estimate of the eigenvalue with a higher order of accuracy than
that of the eigenvector itself.
Exercise 5.2
Write a short program to calculate the dominant eigenvector and the corresponding
eigenvalue of a matrix like that in (7.14) but of order 7, using the power method. Terminate
iterations when the relative difference between two successive estimates of the eigenvalue are
within a tolerance of 10
–6
.
Exercise 5.3 (advanced)
Using the same reasoning used to explain the power method for the standard eigenvalue problem
5.27)( 5.29), develop the corresponding algorithm for the generalized eigenvalue problem
Ax =λBx.
The power method in the form presented above can only be used to find the dominant
eigenvector. However, modified forms of the basic idea can be developed to find any
eigenvector of the matrix. One of these modifications is the inverse iteration.
Inverse Iteration
For the system Ax = λx we can write:
x A x
1
1
−
=
λ
from where we can see that the
eigenvalues of A
–1
are 1/λ; the reciprocals of the eigenvalues of A and the eigenvectors are the
same. In this form, if we are interested in finding the eigenvector of A corresponding to the
smallest eigenvalue in absolute value (closest to zero), we can notice that for that eigenvalue λ,
its reciprocal is the largest, and so it can be found using the power method on the matrix A
–1
.
The procedure then is as follows:
1. Choose a starting vector x
0
.
2. Find
0
1
1
~
x A x
−
=
or instead, solve:
0 1
~
x x A =
(avoiding the explicit calculation of A
–1
)
3. Normalize:
1 1
~
x x C = (C is a normalization factor to have:
1
1
= x
)
4. Restart
Exercise 5.4
Write a program and calculate by inverse iteration the smallest eigenvalue of the
page 55 E763 (part 2) Numerical Methods
tridiagonal matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the
subdiagonals. You can use the algorithms given in section 1 of the Appendix for the solution of
the linear system of equations.
An important extension of the inverse iteration method allows finding any eigenvalue of
the spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in
absolute value.
Shifted Inverse Iteration
Suppose that the system Ax = λx has the set of eigenvalues { λ
i
}. If we construct the
matrix A
~
as:
I A A σ − =
~
where I is the identity matrix and σ is a real number, we have:
x x x x Ax x A ) (
~
σ λ σ λ σ − = − = − = (5.31)
And we can see that the matrix A
~
has the same eigenvectors as A and its eigenvalues are
{λ−σ}, that is, the same eigenvalues as A but shifted by σ. Then, if we apply the inverse
iteration method to the matrix A
~
, the procedure will yield the eigenvalue (λ
i
−σ) closest to zero;
that is, we can find the eigenvalue λ
i
of A closest to the real number σ.
Rayleigh Iteration
Another extension of this method is known as Rayleigh iteration. In this case, the
Rayleigh quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift
is updated using this value.
Since the convergence rate of the power method depends on the relative value of the
coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in
(5.27)) and these are affected by the ratio between the eigenvalues λ
i
and λ
1
, the convergence
will be fastest when this ratio is largest as we can see from (5.29). The same reasoning applied
to the shifted inverse iteration method leads to the conclusion that the convergence rate will be
fastest when the shift is chosen as close as possible to the target eigenvalue. In this form, the
Rayleigh iteration has faster convergence than the ordinary shifted inverse iteration.
Exercise 5.5
Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of
the previous exercise which lies closest to 3.5.
Then, write a modified version of this program using Rayleigh quotient update of the shift
in every iteration (Rayleigh iteration). Compare the convergence of both procedures (by the
number of iterations needed to achieve the same tolerance for the relative difference between
successive estimates of the eigenvalue). Use a tolerance of 10
–6
in both programs.
E763 (part 2) Numerical Methods page 56
6. Numerical Differentiation and Integration
Numerical Differentiation
For a function of one variable f(x), the derivative at a point x = a is defined as:
h
a f h a f
a f
h
) ( ) (
lim ) ( '
0
− +
=
→
(6.1)
This suggests that choosing a small value for h the derivative can be reasonably
approximated by the forward difference:
h
a f h a f
a f
) ( ) (
) ( '
− +
≈
(6.2)
An equally valid approximation can be the backward difference:
h
h a f a f
a f
) ( ) (
) ( '
− −
≈ (6.3)
Graphically, we can see the meaning of each
of these expressions in the Fig. 6.1.
The derivative at x
c
is the slope of the tangent
to the curve at the point C. The slope of the
chords between the points A and C, and B and
C are the values of the backward and forward
difference approximations to the derivative,
respectively.
We can see that a better approximation to
the derivative is obtained by the slope of the
chord between points A and B, labelled
“central difference” in Fig. 6.1.
xa xc xb
ya
yc
yb
A
B
C
Central Difference
Forward Difference
Backward Difference
Fig. 6.1
We can understand this better analysing the error in each approximation by the use of
Taylor expansions.
Considering the expansions for the points a+h and a−h:
L
! 3
) (
! 2
) (
) ( ' ) ( ) (
3
) 3 (
2
) 2 (
+ + + + = + h
a f
h
a f
h a f a f h a f (6.4)
L
! 3
) (
! 2
) (
) ( ' ) ( ) (
3
) 3 (
2
) 2 (
+ − + − = − h
a f
h
a f
h a f a f h a f (6.5)
where in both cases the reminder (error) can be represented by a term of the form:
4
) 4 (
! 4
) (
h
f ξ
(see Appendix), so we could write instead
page 57 E763 (part 2) Numerical Methods
) (
! 3
) (
! 2
) (
) ( ' ) ( ) (
4 3
) 3 (
2
) 2 (
h O h
a f
h
a f
h a f a f h a f + + + + = +
where the symbol O(h
4
) means: “a term of the order of h
4
”
Truncating (6.4) to first order we can then see that ) ( ) ( ' ) ( ) (
2
h O h a f a f h a f + + = + , so
for the forward difference formula we have:
) (
) ( ) (
) ( ' h O
h
a f h a f
a f +
− +
= (6.6)
and we can see that the error of this approximation is of the order of h. A similar result is
obtained for the backward difference.
We can also see that subtracting (6.4) and (6.5) and discarding terms of order of h
3
and
higher we can obtain a better approximation:
) ( ) ( ' 2 ) ( ) (
3
h O h a f h a f h a f + = − − +
From where we obtain the central difference formula:
) (
2
) ( ) (
) ( '
2
h O
h
h a f h a f
a f +
− − +
=
(6.7)
which has an error of the order of h
2
.
Expressions (6.2) and (6.3) are “twopoint” formulae for the first derivative. Many more
different expressions with different degrees of accuracy can be constructed using a similar
procedure and using more points. For example, the Taylor expansions for the function at a
number of points can be used to eliminate all the terms except the desired derivative.
Example:
Considering the 3 points x
0
, x
1
= x
0
+ h and x
2
= x
0
+ 2h and taking the Taylor expansions at x
1
and x
2
we can construct a threepoint forward difference formula:
) (
! 2
) (
) ( ' ) ( ) (
3 2 0
) 2 (
0 0 1
h O h
x f
h x f x f x f + + + = (6.8)
) ( 4
! 2
) (
2 ) ( ' ) ( ) (
3 2 0
) 2 (
0 0 2
h O h
x f
h x f x f x f + + + = (6.9)
Multiplying (6.8) by 4 and subtracting to eliminate the second derivative term:
) ( ) ( ' 2 ) ( 3 ) ( ) ( 4
3
0 0 2 1
h O h x f x f x f x f + + = −
from where we can extract for the first derivative:
) (
2
) ( ) ( 4 ) ( 3
) ( '
2 2 1 0
0
h O
h
x f x f x f
x f +
− + −
= (6.10)
Exercise 6.1
Considering the 3 points x
0
, x
1
= x
0
− h and x
2
= x
0
− 2h and the Taylor expansions at x
1
and x
2
find a threepoint backward difference formula. What is the order of the error?
Exercise 6.2
E763 (part 2) Numerical Methods page 56
6. Numerical Differentiation and Integration
Numerical Differentiation
For a function of one variable f(x), the derivative at a point x = a is defined as:
h
a f h a f
a f
h
) ( ) (
lim ) ( '
0
− +
=
→
(6.1)
This suggests that choosing a small value for h the derivative can be reasonably
approximated by the forward difference:
h
a f h a f
a f
) ( ) (
) ( '
− +
≈
(6.2)
An equally valid approximation can be the backward difference:
h
h a f a f
a f
) ( ) (
) ( '
− −
≈ (6.3)
Graphically, we can see the meaning of each
of these expressions in the Fig. 6.1.
The derivative at x
c
is the slope of the tangent
to the curve at the point C. The slope of the
chords between the points A and C, and B and
C are the values of the backward and forward
difference approximations to the derivative,
respectively.
We can see that a better approximation to
the derivative is obtained by the slope of the
chord between points A and B, labelled
“central difference” in Fig. 6.1.
xa xc xb
ya
yc
yb
A
B
C
Central Difference
Forward Difference
Backward Difference
Fig. 6.1
We can understand this better analysing the error in each approximation by the use of
Taylor expansions.
Considering the expansions for the points a+h and a−h:
L
! 3
) (
! 2
) (
) ( ' ) ( ) (
3
) 3 (
2
) 2 (
+ + + + = + h
a f
h
a f
h a f a f h a f (6.4)
L
! 3
) (
! 2
) (
) ( ' ) ( ) (
3
) 3 (
2
) 2 (
+ − + − = − h
a f
h
a f
h a f a f h a f (6.5)
where in both cases the reminder (error) can be represented by a term of the form:
4
) 4 (
! 4
) (
h
f ξ
(see Appendix), so we could write instead
page 57 E763 (part 2) Numerical Methods
) (
! 3
) (
! 2
) (
) ( ' ) ( ) (
4 3
) 3 (
2
) 2 (
h O h
a f
h
a f
h a f a f h a f + + + + = +
where the symbol O(h
4
) means: “a term of the order of h
4
”
Truncating (6.4) to first order we can then see that ) ( ) ( ' ) ( ) (
2
h O h a f a f h a f + + = + , so
for the forward difference formula we have:
) (
) ( ) (
) ( ' h O
h
a f h a f
a f +
− +
= (6.6)
and we can see that the error of this approximation is of the order of h. A similar result is
obtained for the backward difference.
We can also see that subtracting (6.4) and (6.5) and discarding terms of order of h
3
and
higher we can obtain a better approximation:
) ( ) ( ' 2 ) ( ) (
3
h O h a f h a f h a f + = − − +
From where we obtain the central difference formula:
) (
2
) ( ) (
) ( '
2
h O
h
h a f h a f
a f +
− − +
=
(6.7)
which has an error of the order of h
2
.
Expressions (6.2) and (6.3) are “twopoint” formulae for the first derivative. Many more
different expressions with different degrees of accuracy can be constructed using a similar
procedure and using more points. For example, the Taylor expansions for the function at a
number of points can be used to eliminate all the terms except the desired derivative.
Example:
Considering the 3 points x
0
, x
1
= x
0
+ h and x
2
= x
0
+ 2h and taking the Taylor expansions at x
1
and x
2
we can construct a threepoint forward difference formula:
) (
! 2
) (
) ( ' ) ( ) (
3 2 0
) 2 (
0 0 1
h O h
x f
h x f x f x f + + + = (6.8)
) ( 4
! 2
) (
2 ) ( ' ) ( ) (
3 2 0
) 2 (
0 0 2
h O h
x f
h x f x f x f + + + = (6.9)
Multiplying (6.8) by 4 and subtracting to eliminate the second derivative term:
) ( ) ( ' 2 ) ( 3 ) ( ) ( 4
3
0 0 2 1
h O h x f x f x f x f + + = −
from where we can extract for the first derivative:
) (
2
) ( ) ( 4 ) ( 3
) ( '
2 2 1 0
0
h O
h
x f x f x f
x f +
− + −
= (6.10)
Exercise 6.1
Considering the 3 points x
0
, x
1
= x
0
− h and x
2
= x
0
− 2h and the Taylor expansions at x
1
and x
2
find a threepoint backward difference formula. What is the order of the error?
Exercise 6.2
E763 (part 2) Numerical Methods page 58
Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the
second derivative is:
2
) 2 (
) ( ) ( 2 ) (
) (
h
h a f a f h a f
a f
+ + − −
≈ (6.11)
Show also that the error is O(h
2
).
Exercise 6.3
Use the Taylor expansions for f(a+h), f(a–h), f(a+2h) and f(a–2h) to show that the
following are formulae for f ’(a) and f
(2)
(a), and that both have an error of the order of h
4
:
h
h a f h a f h a f h a f
a f
12
) 2 ( ) ( 8 ) ( 8 ) 2 (
) ( '
+ − + + − − −
≈
2
) 2 (
12
) 2 ( ) ( 16 ) ( 30 ) ( 16 ) 2 (
) (
h
h a f h a f a f h a f h a f
a f
+ − + + − − + − −
≈
Expressions for the derivatives can also be found using other methods. For example, if the
function is interpolated with a polynomial using, say, n points, the derivative (first, second, etc)
can be estimated by calculating the derivative of the interpolating polynomial at the point of
interest.
Example:
Considering the 3 points x
1
, x
2
and x
3
with x
1
< x
2
< x
3
(this time, not necessarily equally spaced)
and respective function values y
1
, y
2
and y
3
, we can use the Lagrange interpolation polynomial to
approximate y(x):
3 3 2 2 1 1
) ( ) ( ) ( ) ( ) ( y x L y x L y x L x L x f + + = ≈ (6.12)
where:
) )( (
) )( (
) (
3 1 2 1
3 2
1
x x x x
x x x x
x L
− −
− −
= ,
) )( (
) )( (
) (
3 2 1 2
3 1
2
x x x x
x x x x
x L
− −
− −
= and
) )( (
) )( (
) (
2 3 1 3
2 1
3
x x x x
x x x x
x L
− −
− −
= (6.13)
so the first derivative can be approximated by:
3 3 2 2 1 1
) ( ' ) ( ' ) ( ' ) ( ' ) ( ' y x L y x L y x L x L x f + + = ≈
which is:
3
2 3 1 3
2 1
2
3 2 1 2
3 1
1
3 1 2 1
3 2
) )( (
2
) )( (
2
) )( (
2
) ( ' y
x x x x
x x x
y
x x x x
x x x
y
x x x x
x x x
x f
− −
− −
+
− −
− −
+
− −
− −
≈ .
This general expression will give the value of the derivative at any of the points x
1
, x
2
or x
3
.
For example, at x
1
:
3
2 3 1 3
2 1
2
3 2 1 2
3 1
1
3 1 2 1
3 2 1
1
) )( ( ) )( ( ) )( (
2
) ( ' y
x x x x
x x
y
x x x x
x x
y
x x x x
x x x
x f
− −
−
+
− −
−
+
− −
− −
≈ (6.14)
Exercise 6.4
Show that if the points are equally spaced by the distance h in (6.12) and the expression is
evaluated at x
1
, x
2
and x
3
, the expression reduces respectively to the 3point forward difference
formula (6.10), the central difference and the 3point backward difference formulae.
page 59 E763 (part 2) Numerical Methods
Central Difference Expressions for different derivative
The following expressions are central difference approximations for the derivatives:
i)
h
x f x f
x f
i i
i
2
) ( ) (
) ( '
1 1 − +
−
=
ii)
2
1 1
) ( ) ( 2 ) (
) ( ' '
h
x f x f x f
x f
i i i
i
− +
+ −
=
iii)
3
2 1 1 2
2
) ( ) ( 2 ) ( 2 ) (
) ( ' ' '
h
x f x f x f x f
x f
i i i i
i
− − + +
− + −
=
iv)
4
2 1 1 2 ) 4 (
) ( ) ( 4 ) ( 6 ) ( 4 ) (
) (
h
x f x f x f x f x f
x f
i i i i i
i
− − + +
+ − + −
=
Naturally, many more expressions can be developed using more points and/or different methods.
Exercise 6.5
Derive expressions iii) and iv) above. What is the order of the error for each of the 4 expressions
above?
Partial Derivatives
For a function of two variables: f(x,y), the partial derivative:
x
y x f
∂
∂ ) , (
is defined as:
h
y x f y h x f
x
y x f
h
) , ( ) , (
lim
) , (
0
− +
=
∂
∂
→
, which clearly, is a function of y.
Again, we can approximate this expression by a difference assuming that h is sufficiently small.
Then, for example a central difference expression for
x
y x f
∂
∂ ) , (
is:
h
y h x f y h x f
x
y x f
2
) , ( ) , ( ) , ( − − +
=
∂
∂
(6.15)
Similarly,
h
h y x f h y x f
y
y x f
2
) , ( ) , ( ) , ( − − +
=
∂
∂
In this form, the gradient of f is given by:
( ) y h y x f h y x f x y h x f y h x f
h
y
y
y x f
x
x
y x f
y x f ˆ )) , ( ) , ( ( ˆ )) , ( ) , ( (
2
1
ˆ
) , (
ˆ
) , (
) , ( − − + + − − + ≈
∂
∂
+
∂
∂
= ∇
Exercise 6.6
Using central difference formulae for the second derivatives derive an expression for the
Laplacian ( f
2
∇ ) of a scalar function f.
Numerical Integration
In general, numerical integration methods approximate the definite integral of a function f by a
weighted sum of function values at several points in the interval of integration. In general these
E763 (part 2) Numerical Methods page 58
Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the
second derivative is:
2
) 2 (
) ( ) ( 2 ) (
) (
h
h a f a f h a f
a f
+ + − −
≈ (6.11)
Show also that the error is O(h
2
).
Exercise 6.3
Use the Taylor expansions for f(a+h), f(a–h), f(a+2h) and f(a–2h) to show that the
following are formulae for f ’(a) and f
(2)
(a), and that both have an error of the order of h
4
:
h
h a f h a f h a f h a f
a f
12
) 2 ( ) ( 8 ) ( 8 ) 2 (
) ( '
+ − + + − − −
≈
2
) 2 (
12
) 2 ( ) ( 16 ) ( 30 ) ( 16 ) 2 (
) (
h
h a f h a f a f h a f h a f
a f
+ − + + − − + − −
≈
Expressions for the derivatives can also be found using other methods. For example, if the
function is interpolated with a polynomial using, say, n points, the derivative (first, second, etc)
can be estimated by calculating the derivative of the interpolating polynomial at the point of
interest.
Example:
Considering the 3 points x
1
, x
2
and x
3
with x
1
< x
2
< x
3
(this time, not necessarily equally spaced)
and respective function values y
1
, y
2
and y
3
, we can use the Lagrange interpolation polynomial to
approximate y(x):
3 3 2 2 1 1
) ( ) ( ) ( ) ( ) ( y x L y x L y x L x L x f + + = ≈ (6.12)
where:
) )( (
) )( (
) (
3 1 2 1
3 2
1
x x x x
x x x x
x L
− −
− −
= ,
) )( (
) )( (
) (
3 2 1 2
3 1
2
x x x x
x x x x
x L
− −
− −
= and
) )( (
) )( (
) (
2 3 1 3
2 1
3
x x x x
x x x x
x L
− −
− −
= (6.13)
so the first derivative can be approximated by:
3 3 2 2 1 1
) ( ' ) ( ' ) ( ' ) ( ' ) ( ' y x L y x L y x L x L x f + + = ≈
which is:
3
2 3 1 3
2 1
2
3 2 1 2
3 1
1
3 1 2 1
3 2
) )( (
2
) )( (
2
) )( (
2
) ( ' y
x x x x
x x x
y
x x x x
x x x
y
x x x x
x x x
x f
− −
− −
+
− −
− −
+
− −
− −
≈ .
This general expression will give the value of the derivative at any of the points x
1
, x
2
or x
3
.
For example, at x
1
:
3
2 3 1 3
2 1
2
3 2 1 2
3 1
1
3 1 2 1
3 2 1
1
) )( ( ) )( ( ) )( (
2
) ( ' y
x x x x
x x
y
x x x x
x x
y
x x x x
x x x
x f
− −
−
+
− −
−
+
− −
− −
≈ (6.14)
Exercise 6.4
Show that if the points are equally spaced by the distance h in (6.12) and the expression is
evaluated at x
1
, x
2
and x
3
, the expression reduces respectively to the 3point forward difference
formula (6.10), the central difference and the 3point backward difference formulae.
page 59 E763 (part 2) Numerical Methods
Central Difference Expressions for different derivative
The following expressions are central difference approximations for the derivatives:
i)
h
x f x f
x f
i i
i
2
) ( ) (
) ( '
1 1 − +
−
=
ii)
2
1 1
) ( ) ( 2 ) (
) ( ' '
h
x f x f x f
x f
i i i
i
− +
+ −
=
iii)
3
2 1 1 2
2
) ( ) ( 2 ) ( 2 ) (
) ( ' ' '
h
x f x f x f x f
x f
i i i i
i
− − + +
− + −
=
iv)
4
2 1 1 2 ) 4 (
) ( ) ( 4 ) ( 6 ) ( 4 ) (
) (
h
x f x f x f x f x f
x f
i i i i i
i
− − + +
+ − + −
=
Naturally, many more expressions can be developed using more points and/or different methods.
Exercise 6.5
Derive expressions iii) and iv) above. What is the order of the error for each of the 4 expressions
above?
Partial Derivatives
For a function of two variables: f(x,y), the partial derivative:
x
y x f
∂
∂ ) , (
is defined as:
h
y x f y h x f
x
y x f
h
) , ( ) , (
lim
) , (
0
− +
=
∂
∂
→
, which clearly, is a function of y.
Again, we can approximate this expression by a difference assuming that h is sufficiently small.
Then, for example a central difference expression for
x
y x f
∂
∂ ) , (
is:
h
y h x f y h x f
x
y x f
2
) , ( ) , ( ) , ( − − +
=
∂
∂
(6.15)
Similarly,
h
h y x f h y x f
y
y x f
2
) , ( ) , ( ) , ( − − +
=
∂
∂
In this form, the gradient of f is given by:
( ) y h y x f h y x f x y h x f y h x f
h
y
y
y x f
x
x
y x f
y x f ˆ )) , ( ) , ( ( ˆ )) , ( ) , ( (
2
1
ˆ
) , (
ˆ
) , (
) , ( − − + + − − + ≈
∂
∂
+
∂
∂
= ∇
Exercise 6.6
Using central difference formulae for the second derivatives derive an expression for the
Laplacian ( f
2
∇ ) of a scalar function f.
Numerical Integration
In general, numerical integration methods approximate the definite integral of a function f by a
weighted sum of function values at several points in the interval of integration. In general these
E763 (part 2) Numerical Methods page 60
methods are called “quadrature” and there are different methods to choose the points and the
weights.
Trapezoid Rule
The simplest method to approximate the integral of a function is the trapezoid rule. In this case,
the interval of integration is divided into a number of subintervals on which the function is
simply approximated by a straight line as shown in the figure below.
The integral (area under the curve) is
then approximated by the sum of the areas of
the trapezoids based on each subinterval.
The area of the trapezoid with base in the
interval [x
i
, x
i+1
] is:
2
) (
) (
1
1
+
+
+
−
i i
i i
f f
x x
and the total area is then the sum of all the
terms of this form. If we denote by
) (
1 i i i
x x h − =
+
x1 ... x(i1) x(i) x(i+1)
Fig. 6.2
the width of the subinterval [x
i
, x
i+1
], the integral can be approximated by:
∑
∫
−
=
+
+ ≈
1
1
1
) (
2
1
) (
1
n
i
i i i
x
x
f f h dx x f
n
(6.16)
If all the subintervals are of the same width (the points are equally spaced), (6.14) reduces to:


.

\

+
+
≈
∑
∫
−
=
1
2
1
2
) (
1
n
i
i
n
x
x
f
f f
h dx x f
n
(6.17)
Exercise 6.7
Using the trapezoid rule:
(a) Calculate the integral of exp(−x
2
) between 0 and 2.
(b) Calculate the integral of 1/x between 1 and 2
In both cases use 10 and 100 subintervals (11 and 101 points respectively).
It can be shown that the error incurred with the application of the trapezoid rule in one interval is
given by the term:
) (
12
) (
) 2 (
3
ξ f
a b
E
−
− = (6.18)
where ξ is a point inside the interval and the error is defined as the difference between the exact
integral (I) and the area of the trapezoid (A): E = I − A.
If this is applied to the Trapezoid rule using a number of subintervals, the error term changes to:
page 61 E763 (part 2) Numerical Methods
) (
12
) (
) 2 (
2
h
f
h a b
E ξ
−
− = (6.19)
where now
h
ξ is a point in the complete interval [a, b] and depends on h. Considering the Error
as the sum of the individual errors in each subinterval, we can write this term in the form:
∑ ∑
= =
− = − =
n
i
i
n
i
i
hf
h
f
h
E
1
) 2 (
2
1
) 2 (
3
) (
12
) (
12
ξ ξ (6.20)
where
i
ξ are points in each subintervals. Expression (6.20), in the limit when ∞ → n and
0 → h corresponds to the integral of f
(2)
over the interval [a, b]. Then, we can write (6.20) as:
)) ( ' ) ( ' (
12
2
a f b f
h
E − − ≈ (6.21)
Two options are open now, we can use this term to estimate the error incurred or equivalently, to
determine the number of equally spaced points required for a determined precision or, include
this term in the calculation to form a corrected form of the trapezoid rule:
)) ( ' ) ( ' (
12 2
) (
2 1
2
1
a f b f
h
f
f f
h dx x f
n
i
i
n
b
a
− −


.

\

+
+
≈
∑
∫
−
=
(6.22)
Simpson’s Rule
In the case of the trapezoid rule, the function is approximated by a straight line and this can be
done repeatedly by subdividing the interval. A higher degree of accuracy using the same
number of subintervals can be obtained with a better approximation than the straight line. For
example, choosing a quadratic approximation could give a better result. This is the Simpson’s
rule.
Consider the function f(x) and the interval [a, b]. Defining the points x
0
, x
1
and x
2
, as:
a x =
0
,
2
1
b a
x
+
= , b x =
2
and defining
2
a b
h
−
=
Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives
as in (6.12):
) ( ) ( ) ( ) ( ) ( ) ( ) (
2 2 1 1 0 0
x L x f x L x f x L x f x f + + ≈
where:
) )( (
2
1
) )( (
) )( (
) (
2 1
2
2 0 1 0
2 1
0
x x x x
h
x x x x
x x x x
x L − − =
− −
− −
=
) )( (
1
) )( (
) )( (
) (
2 0
2
2 1 0 1
2 0
1
x x x x
h
x x x x
x x x x
x L − − − =
− −
− −
=
and ) )( (
2
1
) )( (
) )( (
) (
1 0
2
1 2 0 2
1 0
2
x x x x
h
x x x x
x x x x
x L − − =
− −
− −
=
Then,
( )
∫ ∫
+
+ + ≈
h a
a
b
a
dx x L x f x L x f x L x f dx x f
2
2 2 1 1 0 0
) ( ) ( ) ( ) ( ) ( ) ( ) ( (6.23)
E763 (part 2) Numerical Methods page 60
methods are called “quadrature” and there are different methods to choose the points and the
weights.
Trapezoid Rule
The simplest method to approximate the integral of a function is the trapezoid rule. In this case,
the interval of integration is divided into a number of subintervals on which the function is
simply approximated by a straight line as shown in the figure below.
The integral (area under the curve) is
then approximated by the sum of the areas of
the trapezoids based on each subinterval.
The area of the trapezoid with base in the
interval [x
i
, x
i+1
] is:
2
) (
) (
1
1
+
+
+
−
i i
i i
f f
x x
and the total area is then the sum of all the
terms of this form. If we denote by
) (
1 i i i
x x h − =
+
x1 ... x(i1) x(i) x(i+1)
Fig. 6.2
the width of the subinterval [x
i
, x
i+1
], the integral can be approximated by:
∑
∫
−
=
+
+ ≈
1
1
1
) (
2
1
) (
1
n
i
i i i
x
x
f f h dx x f
n
(6.16)
If all the subintervals are of the same width (the points are equally spaced), (6.14) reduces to:


.

\

+
+
≈
∑
∫
−
=
1
2
1
2
) (
1
n
i
i
n
x
x
f
f f
h dx x f
n
(6.17)
Exercise 6.7
Using the trapezoid rule:
(a) Calculate the integral of exp(−x
2
) between 0 and 2.
(b) Calculate the integral of 1/x between 1 and 2
In both cases use 10 and 100 subintervals (11 and 101 points respectively).
It can be shown that the error incurred with the application of the trapezoid rule in one interval is
given by the term:
) (
12
) (
) 2 (
3
ξ f
a b
E
−
− = (6.18)
where ξ is a point inside the interval and the error is defined as the difference between the exact
integral (I) and the area of the trapezoid (A): E = I − A.
If this is applied to the Trapezoid rule using a number of subintervals, the error term changes to:
page 61 E763 (part 2) Numerical Methods
) (
12
) (
) 2 (
2
h
f
h a b
E ξ
−
− = (6.19)
where now
h
ξ is a point in the complete interval [a, b] and depends on h. Considering the Error
as the sum of the individual errors in each subinterval, we can write this term in the form:
∑ ∑
= =
− = − =
n
i
i
n
i
i
hf
h
f
h
E
1
) 2 (
2
1
) 2 (
3
) (
12
) (
12
ξ ξ (6.20)
where
i
ξ are points in each subintervals. Expression (6.20), in the limit when ∞ → n and
0 → h corresponds to the integral of f
(2)
over the interval [a, b]. Then, we can write (6.20) as:
)) ( ' ) ( ' (
12
2
a f b f
h
E − − ≈ (6.21)
Two options are open now, we can use this term to estimate the error incurred or equivalently, to
determine the number of equally spaced points required for a determined precision or, include
this term in the calculation to form a corrected form of the trapezoid rule:
)) ( ' ) ( ' (
12 2
) (
2 1
2
1
a f b f
h
f
f f
h dx x f
n
i
i
n
b
a
− −


.

\

+
+
≈
∑
∫
−
=
(6.22)
Simpson’s Rule
In the case of the trapezoid rule, the function is approximated by a straight line and this can be
done repeatedly by subdividing the interval. A higher degree of accuracy using the same
number of subintervals can be obtained with a better approximation than the straight line. For
example, choosing a quadratic approximation could give a better result. This is the Simpson’s
rule.
Consider the function f(x) and the interval [a, b]. Defining the points x
0
, x
1
and x
2
, as:
a x =
0
,
2
1
b a
x
+
= , b x =
2
and defining
2
a b
h
−
=
Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives
as in (6.12):
) ( ) ( ) ( ) ( ) ( ) ( ) (
2 2 1 1 0 0
x L x f x L x f x L x f x f + + ≈
where:
) )( (
2
1
) )( (
) )( (
) (
2 1
2
2 0 1 0
2 1
0
x x x x
h
x x x x
x x x x
x L − − =
− −
− −
=
) )( (
1
) )( (
) )( (
) (
2 0
2
2 1 0 1
2 0
1
x x x x
h
x x x x
x x x x
x L − − − =
− −
− −
=
and ) )( (
2
1
) )( (
) )( (
) (
1 0
2
1 2 0 2
1 0
2
x x x x
h
x x x x
x x x x
x L − − =
− −
− −
=
Then,
( )
∫ ∫
+
+ + ≈
h a
a
b
a
dx x L x f x L x f x L x f dx x f
2
2 2 1 1 0 0
) ( ) ( ) ( ) ( ) ( ) ( ) ( (6.23)
E763 (part 2) Numerical Methods page 62
or
∫ ∫ ∫ ∫
+ + +
+ + ≈
h a
a
h a
a
h a
a
b
a
dx x L x f dx x L x f dx x L x f dx x f
2
2 2
2
1 1
2
0 0
) ( ) ( ) ( ) ( ) ( ) ( ) (
The individual integrals are:
∫ ∫
+ +
− − =
h a
a
h a
a
dx x x x x
h
dx x L
2
2 1
2
2
0
) )( (
2
1
) (
and with the change of variable:
1
x x t x − = →
3 2 3
2
1
) (
2
1
) (
2 3
2 2
2
0
h t
h
t
h
dt h t t
h
dx x L
h
h
h
h
h a
a
=


.

\

− = − =
−
−
+
∫ ∫
Similarly,
3
4
) (
2
1
h
dx x L
h a
a
=
∫
+
and
3
) (
2
2
h
dx x L
h a
a
=
∫
+
Then, substituting in (6.23):
( ) ) 2 ( ) ( 4 ) (
3
) ( h a f h a f a f
h
dx x f
b
a
+ + + + ≈
∫
(6.24)
Example
Use of the Simpson’s rule to calculate the integral:
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π
We have: x
0
= a = 0.25, 25 . 1 2
0 2
= = + = b h x x ,
then 75 . 0
0 1
= + = h x x and h = 0.5
The figure shows the function (in blue) and the
2
nd
order Lagrange interpolation used to
calculate the integral with the Simpson’s rule.
The exact value of this integral is:
950158158 . 0 ) 5 . 0 (sin
25 . 1
25 . 0
= +
∫
dx x π
Applying the Simpson’s rule to this
function gives:
0 0.5 1 1.5
0
1
x0 x1
x2
Fig. 6.3
( ) 0.97140452 ) 2 ( ) ( 4 ) (
3
) 5 . 0 (sin
25 . 1
25 . 0
= + + + + ≈ +
∫
h a f h a f a f
h
dx x π
page 63 E763 (part 2) Numerical Methods
As with the trapezoid rule, higher accuracy can be obtained subdividing the interval of
integration and adding the result of the integrals over each subintervals.
Exercise 6.8
Write down the expression for the composite Simpson’s rule using n subintervals and use it to
calculate the integral
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π of the example above using 10 subintervals.
The Midpoint Rule
Another simple method to approximate a definite integral is the midpoint rule where the function
is simply approximated be a constant value over the interval; the value of the function at the
midpoint. In such a way, the integral of f(x) in the interval [a, b] is approximated by the area of
the rectangle of base (b − a) and height f(c), where c = (a + b)/2.
Consider the integral:
∫
=
b
a
dx x f I ) ( (6.25)
and the Taylor expansion for the function f, truncated to first order and centred in the midpoint
of the interval [a, b]:
) ( ) ( ' ) ( ) ( ) (
2
1
h O c f c x c f x p + − + = (6.26)
where 2 ) ( b a c + = .
The error term in (6.26) is actually: ) ( ) (
2
1
) (
) 2 ( 2
1
ξ f c x x R − = so the error in (6.25) when
) (
1
x p is used instead of f(x) is given by:
∫
=
b
a
dx x R E ) (
1
and applying the Integral Mean Value
Theorem to this (see Appendix), we have:
( )
3 3 ) 2 ( ) 2 ( 2 ) 2 (
) ( ) ( ) (
6
1
) ( ) (
6
1
) ( ) (
2
1
c a c b f c x f dx c x f E
b
a
b
a
− − − = − = − =
∫
ξ ξ ξ
but since c = a + h and c = b − h, where h = (b − a)/2, we have:
4
) (
) ( ) (
3
3 3
a b
c a c b
−
= − − −
and the error is:
) ( ) (
24
1
) 2 ( 3
ξ f a b E − = (6.27)
which is half of the estimate for the single interval trapezoid rule.
We can write then:
) ( ) (
24
1
) ( ) ( ) (
) 2 ( 3
ξ f a b c f a b dx x f
b
a
− + − =
∫
(6.28)
for some ξ in the interval [a, b].
Similarly to the case of the trapezoid rule and the Simpson’s rule, the midpoint rule can be
used in a composite manner, after subdividing the interval of integration in a number N of
E763 (part 2) Numerical Methods page 62
or
∫ ∫ ∫ ∫
+ + +
+ + ≈
h a
a
h a
a
h a
a
b
a
dx x L x f dx x L x f dx x L x f dx x f
2
2 2
2
1 1
2
0 0
) ( ) ( ) ( ) ( ) ( ) ( ) (
The individual integrals are:
∫ ∫
+ +
− − =
h a
a
h a
a
dx x x x x
h
dx x L
2
2 1
2
2
0
) )( (
2
1
) (
and with the change of variable:
1
x x t x − = →
3 2 3
2
1
) (
2
1
) (
2 3
2 2
2
0
h t
h
t
h
dt h t t
h
dx x L
h
h
h
h
h a
a
=


.

\

− = − =
−
−
+
∫ ∫
Similarly,
3
4
) (
2
1
h
dx x L
h a
a
=
∫
+
and
3
) (
2
2
h
dx x L
h a
a
=
∫
+
Then, substituting in (6.23):
( ) ) 2 ( ) ( 4 ) (
3
) ( h a f h a f a f
h
dx x f
b
a
+ + + + ≈
∫
(6.24)
Example
Use of the Simpson’s rule to calculate the integral:
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π
We have: x
0
= a = 0.25, 25 . 1 2
0 2
= = + = b h x x ,
then 75 . 0
0 1
= + = h x x and h = 0.5
The figure shows the function (in blue) and the
2
nd
order Lagrange interpolation used to
calculate the integral with the Simpson’s rule.
The exact value of this integral is:
950158158 . 0 ) 5 . 0 (sin
25 . 1
25 . 0
= +
∫
dx x π
Applying the Simpson’s rule to this
function gives:
0 0.5 1 1.5
0
1
x0 x1
x2
Fig. 6.3
( ) 0.97140452 ) 2 ( ) ( 4 ) (
3
) 5 . 0 (sin
25 . 1
25 . 0
= + + + + ≈ +
∫
h a f h a f a f
h
dx x π
page 63 E763 (part 2) Numerical Methods
As with the trapezoid rule, higher accuracy can be obtained subdividing the interval of
integration and adding the result of the integrals over each subintervals.
Exercise 6.8
Write down the expression for the composite Simpson’s rule using n subintervals and use it to
calculate the integral
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π of the example above using 10 subintervals.
The Midpoint Rule
Another simple method to approximate a definite integral is the midpoint rule where the function
is simply approximated be a constant value over the interval; the value of the function at the
midpoint. In such a way, the integral of f(x) in the interval [a, b] is approximated by the area of
the rectangle of base (b − a) and height f(c), where c = (a + b)/2.
Consider the integral:
∫
=
b
a
dx x f I ) ( (6.25)
and the Taylor expansion for the function f, truncated to first order and centred in the midpoint
of the interval [a, b]:
) ( ) ( ' ) ( ) ( ) (
2
1
h O c f c x c f x p + − + = (6.26)
where 2 ) ( b a c + = .
The error term in (6.26) is actually: ) ( ) (
2
1
) (
) 2 ( 2
1
ξ f c x x R − = so the error in (6.25) when
) (
1
x p is used instead of f(x) is given by:
∫
=
b
a
dx x R E ) (
1
and applying the Integral Mean Value
Theorem to this (see Appendix), we have:
( )
3 3 ) 2 ( ) 2 ( 2 ) 2 (
) ( ) ( ) (
6
1
) ( ) (
6
1
) ( ) (
2
1
c a c b f c x f dx c x f E
b
a
b
a
− − − = − = − =
∫
ξ ξ ξ
but since c = a + h and c = b − h, where h = (b − a)/2, we have:
4
) (
) ( ) (
3
3 3
a b
c a c b
−
= − − −
and the error is:
) ( ) (
24
1
) 2 ( 3
ξ f a b E − = (6.27)
which is half of the estimate for the single interval trapezoid rule.
We can write then:
) ( ) (
24
1
) ( ) ( ) (
) 2 ( 3
ξ f a b c f a b dx x f
b
a
− + − =
∫
(6.28)
for some ξ in the interval [a, b].
Similarly to the case of the trapezoid rule and the Simpson’s rule, the midpoint rule can be
used in a composite manner, after subdividing the interval of integration in a number N of
E763 (part 2) Numerical Methods page 64
subintervals to achieve higher precision. In that case, the expression for the integral becomes:
) (
24
) (
) ( ) (
) 2 (
2
1
η f
a b h
c f h dx x f
N
i
i
b
a
−
+ =
∑
∫
=
,
N
a b
h
−
= (6.29)
where c
i
are the midpoints of each of the n subintervals and η is a point between a and b.
Exercise 6.9
Use the Midpoint Rule to calculate the integral:
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π using 2 and 10 subintervals.
compare the result with that of the Simpson’s rule in the example above.
Gaussian Quadrature
In the trapezoid, Simpson’s and midpoint rule, the definite integral of a function f(x) is
approximated by the exact integral of a polynomial that approximates the function. In all these
cases, the evaluation points are chosen arbitrarily, often equally spaced. However, it is rather
clear that the precision attained is dependent of the position of these points, giving then another
route to optimisation. Then the weighting coefficients are determined by the choice of method.
Considering again the general approach at the approximation of the definite integral, the problem
can be written in the form:
∫
∑
−
=
≈ =
1
1
1
) ( ) ( ) ( dx x f x f w f G
n
i
n
i
n
i n
(6.30)
(The interval of integration is here chosen as [ −1,1], but obviously, any other interval can be
mapped into this by a change of variable.)
The objective now is to find for a given n, the best choice of evaluation points
n
i
x (called here
“Gauss points”) and weights
n
i
w to get maximum precision in the approximation. Compared to
the criterion for the trapezoid and Simpson’s rules, this is equivalent to try to find for a fixed n
the best choice for
n
i
x and
n
i
w the approximation (6.30) is exact for a polynomial of degree N,
with N (>n) as large as possible. (That is we go beyond the degree of approximation of the
previous rules.)
This is equivalent to say:
∑
∫
=
−
=
n
i
k n
i
n
i
k
x w dx x
1
1
1
) ( for k = 0, 1, 2, …, N (6.31)
with N as large as possible (note the equal sign now). We can simplify the problem to these
functions now because any polynomial is just a superposition of terms of the form
k
x , so if the
integral is exact for each of them, it will be for any polynomial containing those terms.
Expression (6.31) is a system of equations that the Gauss points and weights (unknown) need to
satisfy. The problem is then, to find the
n
i
x and
n
i
w (n of each). This is a nonlinear problem
that cannot be solved directly. We also have to determine the value of N. It can be shown that
this number is N = 2n − 1. This is also rather natural since (6.31) consists of N+1 equations and
we need to find 2n parameters.
page 65 E763 (part 2) Numerical Methods
Finding the weights
For a set of weights and Gauss points, let’s consider the Lagrange interpolation polynomials of
order n, associated to each of the Gauss points:
∏
≠ =
−
−
=
n
j k k
n
k
n
j
n
k n
j
x x
x x
x L
, 1
) ( (6.32)
then, since the expression (6.31) should be exact for any polynomial up to order N = 2n − 1, and
) (x L
n
j
is of order n, we have:
∑
∫
=
−
=
n
i
n
i
n
j
n
i
n
j
x L w dx x L
1
1
1
) ( ) ( (6.33)
but since the ) (x L
n
j
are interpolation polynomials,
j
i
n
i
n
j
x L δ = ) ( (that is, they are 1 if i=j and 0
otherwise), all the terms in the sum in (6.33) are zero except for i=j and we can have:
n
i
n
i
w dx x L =
∫
−
1
1
) ( (6.34)
With this, we have the weights for a given set of Gauss points. We have to find now the best
choice for these.
If P(x) is an arbitrary polynomial of degree ≤ 2n − 1, we can write: P(x) = P
n
(x) Q(x) + R(x); (Q
and R are respectively the quotient polynomial and reminder polynomial of the division of P by
P
n
). P
n
(x) is of order n and then, Q and R are of degree n − 1 or less. Then we have:
∫ ∫ ∫
− − −
+ =
1
1
1
1
1
1
) ( ) ( ) ( ) ( dx x R dx x Q x P dx x P
n
(6.35)
If we now define the polynomial P
n
(x) by its roots and choose these as the Gauss points:
∏
=
− =
n
i
n
i n
x x x P
1
) ( ) ( , (6.36)
then, the integral of the product P
n
(x) Q(x), which is a polynomial of degree ≤ 2n − 1, must be
given exactly by the quadrature expression:
∑
∫
=
−
=
n
i
n
i
n
i n
n
i n
x Q x P w dx x Q x P
1
1
1
) ( ) ( ) ( ) ( (6.37)
but since the Gaussian points are the roots of P
n
(x), (6.37) must be zero; that is:
0 ) ( ) (
1
1
=
∫
−
dx x Q x P
n
(6.38)
for all polynomials Q(x) of degree n − 1 or less. This implies that P
n
(x) must be a member of a
E763 (part 2) Numerical Methods page 64
subintervals to achieve higher precision. In that case, the expression for the integral becomes:
) (
24
) (
) ( ) (
) 2 (
2
1
η f
a b h
c f h dx x f
N
i
i
b
a
−
+ =
∑
∫
=
,
N
a b
h
−
= (6.29)
where c
i
are the midpoints of each of the n subintervals and η is a point between a and b.
Exercise 6.9
Use the Midpoint Rule to calculate the integral:
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π using 2 and 10 subintervals.
compare the result with that of the Simpson’s rule in the example above.
Gaussian Quadrature
In the trapezoid, Simpson’s and midpoint rule, the definite integral of a function f(x) is
approximated by the exact integral of a polynomial that approximates the function. In all these
cases, the evaluation points are chosen arbitrarily, often equally spaced. However, it is rather
clear that the precision attained is dependent of the position of these points, giving then another
route to optimisation. Then the weighting coefficients are determined by the choice of method.
Considering again the general approach at the approximation of the definite integral, the problem
can be written in the form:
∫
∑
−
=
≈ =
1
1
1
) ( ) ( ) ( dx x f x f w f G
n
i
n
i
n
i n
(6.30)
(The interval of integration is here chosen as [ −1,1], but obviously, any other interval can be
mapped into this by a change of variable.)
The objective now is to find for a given n, the best choice of evaluation points
n
i
x (called here
“Gauss points”) and weights
n
i
w to get maximum precision in the approximation. Compared to
the criterion for the trapezoid and Simpson’s rules, this is equivalent to try to find for a fixed n
the best choice for
n
i
x and
n
i
w the approximation (6.30) is exact for a polynomial of degree N,
with N (>n) as large as possible. (That is we go beyond the degree of approximation of the
previous rules.)
This is equivalent to say:
∑
∫
=
−
=
n
i
k n
i
n
i
k
x w dx x
1
1
1
) ( for k = 0, 1, 2, …, N (6.31)
with N as large as possible (note the equal sign now). We can simplify the problem to these
functions now because any polynomial is just a superposition of terms of the form
k
x , so if the
integral is exact for each of them, it will be for any polynomial containing those terms.
Expression (6.31) is a system of equations that the Gauss points and weights (unknown) need to
satisfy. The problem is then, to find the
n
i
x and
n
i
w (n of each). This is a nonlinear problem
that cannot be solved directly. We also have to determine the value of N. It can be shown that
this number is N = 2n − 1. This is also rather natural since (6.31) consists of N+1 equations and
we need to find 2n parameters.
page 65 E763 (part 2) Numerical Methods
Finding the weights
For a set of weights and Gauss points, let’s consider the Lagrange interpolation polynomials of
order n, associated to each of the Gauss points:
∏
≠ =
−
−
=
n
j k k
n
k
n
j
n
k n
j
x x
x x
x L
, 1
) ( (6.32)
then, since the expression (6.31) should be exact for any polynomial up to order N = 2n − 1, and
) (x L
n
j
is of order n, we have:
∑
∫
=
−
=
n
i
n
i
n
j
n
i
n
j
x L w dx x L
1
1
1
) ( ) ( (6.33)
but since the ) (x L
n
j
are interpolation polynomials,
j
i
n
i
n
j
x L δ = ) ( (that is, they are 1 if i=j and 0
otherwise), all the terms in the sum in (6.33) are zero except for i=j and we can have:
n
i
n
i
w dx x L =
∫
−
1
1
) ( (6.34)
With this, we have the weights for a given set of Gauss points. We have to find now the best
choice for these.
If P(x) is an arbitrary polynomial of degree ≤ 2n − 1, we can write: P(x) = P
n
(x) Q(x) + R(x); (Q
and R are respectively the quotient polynomial and reminder polynomial of the division of P by
P
n
). P
n
(x) is of order n and then, Q and R are of degree n − 1 or less. Then we have:
∫ ∫ ∫
− − −
+ =
1
1
1
1
1
1
) ( ) ( ) ( ) ( dx x R dx x Q x P dx x P
n
(6.35)
If we now define the polynomial P
n
(x) by its roots and choose these as the Gauss points:
∏
=
− =
n
i
n
i n
x x x P
1
) ( ) ( , (6.36)
then, the integral of the product P
n
(x) Q(x), which is a polynomial of degree ≤ 2n − 1, must be
given exactly by the quadrature expression:
∑
∫
=
−
=
n
i
n
i
n
i n
n
i n
x Q x P w dx x Q x P
1
1
1
) ( ) ( ) ( ) ( (6.37)
but since the Gaussian points are the roots of P
n
(x), (6.37) must be zero; that is:
0 ) ( ) (
1
1
=
∫
−
dx x Q x P
n
(6.38)
for all polynomials Q(x) of degree n − 1 or less. This implies that P
n
(x) must be a member of a
E763 (part 2) Numerical Methods page 66
family of orthogonal polynomials
1
.. In particular, Legendre polynomials are a good choice
because they are orthogonal in the interval [1, 1] with a weighting function w(x) = 1.
Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1, and equation
(6.35), we have that if we choose P
n
(x) as in (6.36), (6 38) is satisfied and then, (6.35) reduces
to:
∫ ∫
− −
=
1
1
1
1
) ( ) ( dx x R dx x P (6.39)
but since R(x) is of degree n − 1 or less, the interpolation using Lagrange polynomials for the n
points will give the exact representation of R (see Exercise 4.2). That is:
∑
=
=
n
i
i
n
i
x L x R x R
1
) ( ) ( ) ( exactly. (6.40)
Then,
∑
∫ ∫
∑
∫
=
− −
=
−
= =
n
i
i
n
i
n
i
i
n
i
dx x L x R dx x L x R dx x R
1
1
1
1
1
1
1
1
) ( ) ( ) ( ) ( ) ( (6.41)
but we have seen before (6.34) the integral of the Lagrange interpolation polynomial for point i
is the value of
n
i
w , so:
∑
∫
=
−
=
n
i
n
i
n
i
x R w dx x R
1
1
1
) ( ) ( (6.40)
Now, since P(x) = P
n
(x) Q(x) + R(x) and 0 ) ( =
n
i n
x P (see 6.36), ) ( ) (
n
i
n
i
x R x P = and from (6.39):
∑
∫
=
−
=
n
i
n
i
n
i
x P w dx x P
1
1
1
) ( ) ( (6.41)
which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1can be calculated
exactly using the sets of Gauss points
n
i
x  zeros of the n
th
order Legendre polynomial and the
weights
n
i
w determined by(6.34) – the integral over the interval [−1, 1] of the Lagrange
polynomial of order n corresponding to the Gauss point
n
i
x .
Back to the start then, we have now a set of n Gauss points and weights that yield the exact
evaluation of the integral of a polynomial up to degree 2n − 1. These should also give the best
result to the integral of an arbitrary function f:
1
Remember that for orthogonal polynomials in [−1, 1], all their roots lie in [−1, 1], and satisfy:
j
i j i
dx x p x p δ =
∫
−
1
1
) ( ) ( . Additionally, 0 ) ( ) (
1
1
=
∫
−
dx x q x p
i
for any polynomial q of degree ≤ i −1.
page 67 E763 (part 2) Numerical Methods
∑
∫
=
−
≈
n
i
n
i
n
i
x f w dx x f
1
1
1
) ( ) ( (6.42)
Gauss nodes and weights for different orders are given in the following table.
Gaussian Quadrature: Nodes and Weights
n Nodes
n
i
x Weghts
n
i
w
1 0.0 2.0
2 3 3 ± = ±0.577350269189 1.0
3 0 9 8 = 0.888888888889
5 15 ± = ±0.774596669241 9 5 = 0.555555555556
4
35 30 70 525− ± =±0.339981043585
36 ) 30 18 ( + = 0.652145154863
35 30 70 525+ ± =±0.861136311594
36 ) 30 18 ( − = 0.347854845137
5 0 225 128 = 0.568888888889
3 7 10 2 5− ± = ±0.538469310106
900 ) 70 13 322 ( + =0.478628670499
3 7 10 2 5+ ± = ±0.906179845939
900 ) 70 13 322 ( − =0.236926885056
6 ±0.238619186 0.4679139
±0.661209386 0.3607616
±0.932469514 0.1713245
Example
For the integral:
∫
−
−
1
1
5 sin dx x e
x
the results of the calculation using Gauss quadrature are listed in
the table:
n Integral Error %
2 −0.307533965529
3 0.634074857001
4 0.172538331616
28.71
5 0.247736352452
−2.35
6
0.241785750244
0.10
The error is also calculated compared with the exact value: 0.24203832101745.
The results with few Gauss points are not very good because the function varies strongly in
the interval of integration. However, for n = 6, the error is very small.
Exercise 6.10
Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for
E763 (part 2) Numerical Methods page 66
family of orthogonal polynomials
1
.. In particular, Legendre polynomials are a good choice
because they are orthogonal in the interval [1, 1] with a weighting function w(x) = 1.
Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1, and equation
(6.35), we have that if we choose P
n
(x) as in (6.36), (6 38) is satisfied and then, (6.35) reduces
to:
∫ ∫
− −
=
1
1
1
1
) ( ) ( dx x R dx x P (6.39)
but since R(x) is of degree n − 1 or less, the interpolation using Lagrange polynomials for the n
points will give the exact representation of R (see Exercise 4.2). That is:
∑
=
=
n
i
i
n
i
x L x R x R
1
) ( ) ( ) ( exactly. (6.40)
Then,
∑
∫ ∫
∑
∫
=
− −
=
−
= =
n
i
i
n
i
n
i
i
n
i
dx x L x R dx x L x R dx x R
1
1
1
1
1
1
1
1
) ( ) ( ) ( ) ( ) ( (6.41)
but we have seen before (6.34) the integral of the Lagrange interpolation polynomial for point i
is the value of
n
i
w , so:
∑
∫
=
−
=
n
i
n
i
n
i
x R w dx x R
1
1
1
) ( ) ( (6.40)
Now, since P(x) = P
n
(x) Q(x) + R(x) and 0 ) ( =
n
i n
x P (see 6.36), ) ( ) (
n
i
n
i
x R x P = and from (6.39):
∑
∫
=
−
=
n
i
n
i
n
i
x P w dx x P
1
1
1
) ( ) ( (6.41)
which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1can be calculated
exactly using the sets of Gauss points
n
i
x  zeros of the n
th
order Legendre polynomial and the
weights
n
i
w determined by(6.34) – the integral over the interval [−1, 1] of the Lagrange
polynomial of order n corresponding to the Gauss point
n
i
x .
Back to the start then, we have now a set of n Gauss points and weights that yield the exact
evaluation of the integral of a polynomial up to degree 2n − 1. These should also give the best
result to the integral of an arbitrary function f:
1
Remember that for orthogonal polynomials in [−1, 1], all their roots lie in [−1, 1], and satisfy:
j
i j i
dx x p x p δ =
∫
−
1
1
) ( ) ( . Additionally, 0 ) ( ) (
1
1
=
∫
−
dx x q x p
i
for any polynomial q of degree ≤ i −1.
page 67 E763 (part 2) Numerical Methods
∑
∫
=
−
≈
n
i
n
i
n
i
x f w dx x f
1
1
1
) ( ) ( (6.42)
Gauss nodes and weights for different orders are given in the following table.
Gaussian Quadrature: Nodes and Weights
n Nodes
n
i
x Weghts
n
i
w
1 0.0 2.0
2 3 3 ± = ±0.577350269189 1.0
3 0 9 8 = 0.888888888889
5 15 ± = ±0.774596669241 9 5 = 0.555555555556
4
35 30 70 525− ± =±0.339981043585
36 ) 30 18 ( + = 0.652145154863
35 30 70 525+ ± =±0.861136311594
36 ) 30 18 ( − = 0.347854845137
5 0 225 128 = 0.568888888889
3 7 10 2 5− ± = ±0.538469310106
900 ) 70 13 322 ( + =0.478628670499
3 7 10 2 5+ ± = ±0.906179845939
900 ) 70 13 322 ( − =0.236926885056
6 ±0.238619186 0.4679139
±0.661209386 0.3607616
±0.932469514 0.1713245
Example
For the integral:
∫
−
−
1
1
5 sin dx x e
x
the results of the calculation using Gauss quadrature are listed in
the table:
n Integral Error %
2 −0.307533965529
3 0.634074857001
4 0.172538331616
28.71
5 0.247736352452
−2.35
6
0.241785750244
0.10
The error is also calculated compared with the exact value: 0.24203832101745.
The results with few Gauss points are not very good because the function varies strongly in
the interval of integration. However, for n = 6, the error is very small.
Exercise 6.10
Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for
E763 (part 2) Numerical Methods page 68
n subintervals.
Change of Variable
The above procedure was developed for definite integrals over the interval [−1, 1]. Integrals
over other intervals can be calculated after a change of variables. For example if the integral to
calculate is:
∫
b
a
dt t f ) (
To change from the interval [a, b] to [−1, 1] the following change of variable can be made:
2 ) (
2 ) (
a b
b a t
x
−
+ −
← or
2 2
b a
x
a b
t
+
+
−
→ so dx
a b
dt
2
−
=
Then, the integral can be approximated by:
∑
∫
=

.

\
 +
+
− −
≈
n
i
n
i
n
i
b
a
b a
x
a b
f w
a b
dt t f
1
2 2 2
) ( (6.43)
Exercise 6.11
Use Gaussian quadrature to calculate:
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π
E763 (part 2) Numerical Methods page 68
n subintervals.
Change of Variable
The above procedure was developed for definite integrals over the interval [−1, 1]. Integrals
over other intervals can be calculated after a change of variables. For example if the integral to
calculate is:
∫
b
a
dt t f ) (
To change from the interval [a, b] to [−1, 1] the following change of variable can be made:
2 ) (
2 ) (
a b
b a t
x
−
+ −
← or
2 2
b a
x
a b
t
+
+
−
→ so dx
a b
dt
2
−
=
Then, the integral can be approximated by:
∑
∫
=

.

\
 +
+
− −
≈
n
i
n
i
n
i
b
a
b a
x
a b
f w
a b
dt t f
1
2 2 2
) ( (6.43)
Exercise 6.11
Use Gaussian quadrature to calculate:
∫
+
25 . 1
25 . 0
) 5 . 0 (sin dx x π
page 69 E763 (part 2) Numerical Methods
7. Solution of Partial Differential Equations
The Finite Difference Method: Solution of Differential Equations by
Finite Differences
We will study here a direct method to solve differential equations which is based on the use of
finite differences. This consists of the direct substitution of derivatives by finite differences and
thus the problem reduces to an algebraic form.
One Dimensional Problems:
Let’s consider a problem given by the equation:
Lf = g (7.1)
where L is a linear differential operator and g(x) is a known function. The problem is to find the
function f(x) satisfying equation (7.1) over a given region (interval) [a, b] and subjected to some
boundary conditions at a and b.
The basic idea is to substitute the derivatives by appropriate difference formulae like those
seen in section 6. This will convert the problem into a system of algebraic equations.
In order to apply systematically the difference approximations we proceed as follow:
–– First, we divide the interval [a, b] into N equal subintervals of length h : h = (b–a)/N,
defining the points x
i
: x
i
= a + ih
–– Next, we approximate all derivatives in the operator L by appropriate difference formulae
(h must be sufficiently small – N large – to do this accurately).
–– Finally, we formulate the corresponding difference equation at each point x
i
. This will
generate a linear system of N–1 equations on the N–1 unknown values of f
i
= f(x
i
).
Example:
If we take Lf = g to be a general second order differential equation:
cf ' ' (x) + df' (x) + ef (x) = g(x) (7.2)
with constant coefficients c, d and e. x ∈[a, b] and boundary conditions:
f(a) at a and f(b) at b. (7.3)
Using (6.11) and (6.7) at point i:
f
i
' ' ≈
f
i −1
− 2 f
i
+ f
i+1
h
2
and f
i
' ≈
f
i+1
− f
i −1
2h
(7.4)
results for the equation at point i:
c
f
i−1
− 2 f
i
+ f
i +1
h
2
+ d
f
i +1
− f
i−1
2h
+ e f
i
= g
i
(7.5)
E763 (part 2) Numerical Methods page 70
or:
c −
d
2
h

\

.
 f
i−1
+ −2c + eh
2
( )
f
i
+ c +
d
2
h

\

.
 f
i +1
= g
i
h
2
(7.6)
for all i except i = 1 and N–1, where for i = 1: f
i1
= f(a) and
for i = N–1: f
i+1
= f(b)
This can be written as a matrix problem of the form: A f = g, where f = { f
i
} g = {h
2
g
i
}
are vectors of order N–1. The matrix A has only 3 elements per row:
a
i,i−1
= c −
d
2
h

\

.
 , a
i,i
= −2c + eh
2
( )
, a
i,i+1
= c +
d
2
h

\

.

Exercise 7.1
Formulate the algorithm to solve a general second order differential equation over the
interval [a, b], with Dirichlet boundary conditions at a and b (known values of f at a and b).
Write a short computer program to implement it and use it to solve:
f’’ + 2f’ + 5f = e
–x
sin(x) in [0,5]; f(0) = f(5) = 0
Two Dimensional Problems
We consider now the problem Lf = g in 2 dimensions, where f is a function of 2 variables,
say: x and y. In this case, we need to approximate derivatives in x and y in L. To do this, we
superimpose a regular grid over the domain of the problem and (as for the 1–D case) we only
consider the values of f and g at the nodes of the grid.
Example:
Referring to the figure below, the problem consists of finding the potential distribution
between the inner and outer conducting surfaces of square crosssection when a fixed voltage is
applied between them. The equation describing this problem is the Laplace equation:
∇
2
φ = 0 with the boundary conditions φ = 0 in the outer conductor and φ =1 in the inner
conductor.
To approximate ∇
2
φ =
∂
2
φ
∂x
2
+
∂
2
φ
∂y
2
(7.7)
we can choose for convenience the same spacing h for x and y. Ignoring the symmetry of the
problem, the whole crosssection is discretized (as in Fig. 8.1). With this choice, there are 56
free nodes (for which the potential is not known).
page 69 E763 (part 2) Numerical Methods
7. Solution of Partial Differential Equations
The Finite Difference Method: Solution of Differential Equations by
Finite Differences
We will study here a direct method to solve differential equations which is based on the use of
finite differences. This consists of the direct substitution of derivatives by finite differences and
thus the problem reduces to an algebraic form.
One Dimensional Problems:
Let’s consider a problem given by the equation:
Lf = g (7.1)
where L is a linear differential operator and g(x) is a known function. The problem is to find the
function f(x) satisfying equation (7.1) over a given region (interval) [a, b] and subjected to some
boundary conditions at a and b.
The basic idea is to substitute the derivatives by appropriate difference formulae like those
seen in section 6. This will convert the problem into a system of algebraic equations.
In order to apply systematically the difference approximations we proceed as follow:
–– First, we divide the interval [a, b] into N equal subintervals of length h : h = (b–a)/N,
defining the points x
i
: x
i
= a + ih
–– Next, we approximate all derivatives in the operator L by appropriate difference formulae
(h must be sufficiently small – N large – to do this accurately).
–– Finally, we formulate the corresponding difference equation at each point x
i
. This will
generate a linear system of N–1 equations on the N–1 unknown values of f
i
= f(x
i
).
Example:
If we take Lf = g to be a general second order differential equation:
cf ' ' (x) + df' (x) + ef (x) = g(x) (7.2)
with constant coefficients c, d and e. x ∈[a, b] and boundary conditions:
f(a) at a and f(b) at b. (7.3)
Using (6.11) and (6.7) at point i:
f
i
' ' ≈
f
i −1
− 2 f
i
+ f
i+1
h
2
and f
i
' ≈
f
i+1
− f
i −1
2h
(7.4)
results for the equation at point i:
c
f
i−1
− 2 f
i
+ f
i +1
h
2
+ d
f
i +1
− f
i−1
2h
+ e f
i
= g
i
(7.5)
E763 (part 2) Numerical Methods page 70
or:
c −
d
2
h

\

.
 f
i−1
+ −2c + eh
2
( )
f
i
+ c +
d
2
h

\

.
 f
i +1
= g
i
h
2
(7.6)
for all i except i = 1 and N–1, where for i = 1: f
i1
= f(a) and
for i = N–1: f
i+1
= f(b)
This can be written as a matrix problem of the form: A f = g, where f = { f
i
} g = {h
2
g
i
}
are vectors of order N–1. The matrix A has only 3 elements per row:
a
i,i−1
= c −
d
2
h

\

.
 , a
i,i
= −2c + eh
2
( )
, a
i,i+1
= c +
d
2
h

\

.

Exercise 7.1
Formulate the algorithm to solve a general second order differential equation over the
interval [a, b], with Dirichlet boundary conditions at a and b (known values of f at a and b).
Write a short computer program to implement it and use it to solve:
f’’ + 2f’ + 5f = e
–x
sin(x) in [0,5]; f(0) = f(5) = 0
Two Dimensional Problems
We consider now the problem Lf = g in 2 dimensions, where f is a function of 2 variables,
say: x and y. In this case, we need to approximate derivatives in x and y in L. To do this, we
superimpose a regular grid over the domain of the problem and (as for the 1–D case) we only
consider the values of f and g at the nodes of the grid.
Example:
Referring to the figure below, the problem consists of finding the potential distribution
between the inner and outer conducting surfaces of square crosssection when a fixed voltage is
applied between them. The equation describing this problem is the Laplace equation:
∇
2
φ = 0 with the boundary conditions φ = 0 in the outer conductor and φ =1 in the inner
conductor.
To approximate ∇
2
φ =
∂
2
φ
∂x
2
+
∂
2
φ
∂y
2
(7.7)
we can choose for convenience the same spacing h for x and y. Ignoring the symmetry of the
problem, the whole crosssection is discretized (as in Fig. 8.1). With this choice, there are 56
free nodes (for which the potential is not known).
page 71 E763 (part 2) Numerical Methods
1 2 3
10 11 12
19 20
23
48
4 5 6 7 8 9
18
22
26
30
34
38
47
56
φ = 0
φ = 1
Fig. 8.1 Finite difference mesh for solution of the electrostatic field in square coaxial line.
On this mesh, we only consider the unknown internal nodal
values and only those nodes are numbered. An internal point of
the grid, x
i
, labelled by O in the figure, is surrounded by other
four, which for convenience are labelled by N, S, E and W. For
this point we can approximate the derivatives in (7.7) by:
O
E
N
W
x
i
S
∂
2
φ
∂x
2
≈
1
h
2
φ
W
− 2φ
O
+ φ
E
( ) and
∂
2
φ
∂y
2
≈
1
h
2
φ
N
− 2φ
O
+ φ
S
( ) (7.8)
Then the equation ∇
2
φ = 0 becomes simply:
∇
2
φ ≈ φ
N
+ φ
S
+ φ
E
+φ
W
− 4φ
O
( ) h
2
= 0 or
φ
N
+φ
S
+ φ
E
+ φ
W
− 4φ
O
= 0 (7.9)
Formulating this equation for each point in the grid and using the boundary conditions
where appropriate, we end up with a system of N equations, one for each of the N free points of
the grid. Applying equation (7.9) to point 1 of the mesh gives:
0 +φ
10
+φ
2
+ 0 − 4φ
1
= 0 (7.10)
and to point 2: 0 +φ
11
+ φ
3
+φ
1
− 4φ
2
= 0 (7.11)
A typical interior point such as 11 gives:
φ
2
+ φ
20
+ φ
12
+ φ
10
− 4φ
11
= 0 (7.12)
E763 (part 2) Numerical Methods page 72
and one near the inner conductor, 13 will give:
φ
4
+ 1+ φ
14
+ φ
12
− 4φ
13
= 0 (7.13)
In this way, we can assemble all 56 equations from the 56 mesh points of the figure, in
terms of the 56 unknowns. The resulting 56 equations can be expressed as:
A x = y (7.14a)
or
−4 1 L 1 K
1 −4 1 K 1
1 −4 K 1
⋅
⋅
⋅
K −4 1
1 −4
φ
1
φ
2
φ
3
⋅
⋅
φ
13
⋅
⋅
φ
55
φ
56
=
0
0
0
⋅
⋅
−1
⋅
⋅
0
0
(7.14b)
The unknown vector x of (7.14a) is simply (φ
1
, φ
2
, . . . , φ
56
)
T
.
The righthandside vector y of eqs. (7.14) consists of zeros except for the –1’s coming
from equations like (7.13) corresponding to points next to the inner conductor with potential 1;
namely, from points 12 to 17, 20, 21,..., 41 to 45.
The 56×56 matrix A has mostly zero elements, except for ‘–4’ on the diagonal, and either
two, three or four ‘+1’ somewhere else on each matrix row, the number and distribution
depending on the geometric node location.
One has to be careful not to confuse the rowandcolumn numbers of the 2D array A with
the ‘x and y coordinates’ of the physical problem. Each number 1 to 56 of the mesh points in the
figure above corresponds precisely to the row number of matrix A and the row number of
column vector x.
Equation (7.14) is standard, as given in (5.11), and could be solved with a standard library
package, with a Gauss elimination solver routine, or better, a sparse or a bandmatrix version of
Gauss elimination. Better still would be the GaussSeidel or successive displacement. Section 2
in the Appendix shows details of such a solution.
Enforcing Boundary Conditions
In the description so far we have seen the implementation of the Dirichlet boundary
condition; that is a condition where the values of the desired function are known at the edges of
the region of interest (ends of the interval in 1D or boundary of a 2D or 3D region). This has
been implemented in a straightforward way in equations (7.10), (7.11) and (7.13).
Frequently, other types of boundary conditions appear, for example, when the derivatives
of the function are known (instead of the values themselves) at the boundary  this is the
Neumann condition. In occasions, a mixed condition will apply; something like:
page 71 E763 (part 2) Numerical Methods
1 2 3
10 11 12
19 20
23
48
4 5 6 7 8 9
18
22
26
30
34
38
47
56
φ = 0
φ = 1
Fig. 8.1 Finite difference mesh for solution of the electrostatic field in square coaxial line.
On this mesh, we only consider the unknown internal nodal
values and only those nodes are numbered. An internal point of
the grid, x
i
, labelled by O in the figure, is surrounded by other
four, which for convenience are labelled by N, S, E and W. For
this point we can approximate the derivatives in (7.7) by:
O
E
N
W
x
i
S
∂
2
φ
∂x
2
≈
1
h
2
φ
W
− 2φ
O
+ φ
E
( ) and
∂
2
φ
∂y
2
≈
1
h
2
φ
N
− 2φ
O
+ φ
S
( ) (7.8)
Then the equation ∇
2
φ = 0 becomes simply:
∇
2
φ ≈ φ
N
+ φ
S
+ φ
E
+φ
W
− 4φ
O
( ) h
2
= 0 or
φ
N
+φ
S
+ φ
E
+ φ
W
− 4φ
O
= 0 (7.9)
Formulating this equation for each point in the grid and using the boundary conditions
where appropriate, we end up with a system of N equations, one for each of the N free points of
the grid. Applying equation (7.9) to point 1 of the mesh gives:
0 +φ
10
+φ
2
+ 0 − 4φ
1
= 0 (7.10)
and to point 2: 0 +φ
11
+ φ
3
+φ
1
− 4φ
2
= 0 (7.11)
A typical interior point such as 11 gives:
φ
2
+ φ
20
+ φ
12
+ φ
10
− 4φ
11
= 0 (7.12)
E763 (part 2) Numerical Methods page 72
and one near the inner conductor, 13 will give:
φ
4
+ 1+ φ
14
+ φ
12
− 4φ
13
= 0 (7.13)
In this way, we can assemble all 56 equations from the 56 mesh points of the figure, in
terms of the 56 unknowns. The resulting 56 equations can be expressed as:
A x = y (7.14a)
or
−4 1 L 1 K
1 −4 1 K 1
1 −4 K 1
⋅
⋅
⋅
K −4 1
1 −4
φ
1
φ
2
φ
3
⋅
⋅
φ
13
⋅
⋅
φ
55
φ
56
=
0
0
0
⋅
⋅
−1
⋅
⋅
0
0
(7.14b)
The unknown vector x of (7.14a) is simply (φ
1
, φ
2
, . . . , φ
56
)
T
.
The righthandside vector y of eqs. (7.14) consists of zeros except for the –1’s coming
from equations like (7.13) corresponding to points next to the inner conductor with potential 1;
namely, from points 12 to 17, 20, 21,..., 41 to 45.
The 56×56 matrix A has mostly zero elements, except for ‘–4’ on the diagonal, and either
two, three or four ‘+1’ somewhere else on each matrix row, the number and distribution
depending on the geometric node location.
One has to be careful not to confuse the rowandcolumn numbers of the 2D array A with
the ‘x and y coordinates’ of the physical problem. Each number 1 to 56 of the mesh points in the
figure above corresponds precisely to the row number of matrix A and the row number of
column vector x.
Equation (7.14) is standard, as given in (5.11), and could be solved with a standard library
package, with a Gauss elimination solver routine, or better, a sparse or a bandmatrix version of
Gauss elimination. Better still would be the GaussSeidel or successive displacement. Section 2
in the Appendix shows details of such a solution.
Enforcing Boundary Conditions
In the description so far we have seen the implementation of the Dirichlet boundary
condition; that is a condition where the values of the desired function are known at the edges of
the region of interest (ends of the interval in 1D or boundary of a 2D or 3D region). This has
been implemented in a straightforward way in equations (7.10), (7.11) and (7.13).
Frequently, other types of boundary conditions appear, for example, when the derivatives
of the function are known (instead of the values themselves) at the boundary  this is the
Neumann condition. In occasions, a mixed condition will apply; something like:
page 73 E763 (part 2) Numerical Methods
f (r) +
∂ f
∂n
= K, where the second term refers to the normal derivative (or derivative along the
normal direction to the boundary. This type of condition appears in some radiation problems
(Sommerfeld radiation condition).
We will see next some forms of dealing with these boundary conditions in the context of
finite difference calculations.
For example in the case of the ‘square coaxial problem’ studied earlier, we can see that the
solution will have symmetry properties which makes unnecessary the calculation of the potential
over the complete crosssection. In fact only one eighth of the crosssection is actually needed.
The new region of interest can be one eighth
of the complete crosssection: the shaded region
or one quarter of the crosssection, to avoid
oblique lines. In any case, the new boundary
conditions needed on the dashed lines that limit
the new region of interest are of the Neumann
type:
∂φ
∂n
= 0 since the lines of constant
potential will be perpendicular to those edges. (n
represents the normal to the boundary).
φ = 0
φ = 1
∂φ
∂n
= 0
Fig. 8.2
We will need a different strategy to deal with these conditions.
For this condition it is more convenient to define the mesh in a different manner. If we
place the boundary at half the node spacing from the start of the mesh as in the figure below, we
can implement the normal derivative condition in a simple form:
(Note that the node numbers used in the next figure do not correspond to the mesh numbering
defined earlier for the whole crosssection).
φ = 0
10
1
a
2
∂φ
∂n
= 0
b
The boundary condition that applies at point b (not
actually part of the mesh) is:
∂φ
∂n
= 0 We can
approximate this by using a central difference
with the point 1 and the auxiliary point a outside
the mesh. Then:
∂φ
∂n
b
= 0 =
φ
a
−φ
1
h
and then,
φ
a
= φ
1
.
Fig. 8.3
So the corresponding equation for point 1 is: φ
N
+φ
S
+ φ
E
+ φ
W
− 4φ
0
= 0
Substituting values we have:
0 +φ
10
+φ
1
+φ
2
− 4φ
1
= 0
or φ
2
+ φ
10
− 3φ
1
= 0
A Dirichlet condition can also be implemented in a similar form, if necessary. For
example if part of a straight boundary is subjected to a Neumann condition and the rest to a
Dirichlet condition (as for example in the case of a conductor that does not cover the whole
boundary), a Dirichlet condition will have to be implemented in a mesh separated half of the
E763 (part 2) Numerical Methods page 74
spacing from the edge.
Exercise 7.2:
Consider the Fig. 8.4 and the points 5
and 15 with the point a on the boundary (not
in the mesh). Use Taylor expansions at points
5 and 15 at distances h/2 and 3h/2 from a, to
show that a condition that can be applied
when the function φ has a fixed value V at the
boundary and we also want the normal
derivative of φ to be zero there is:
9φ
5
− φ
15
= 8V.
1 2
14
φ = V
3 4 5
11 12 13
a
∂φ
∂n
= 0
15
6
Fig. 8.4
Using the results of Exercise 7.2, the difference equation corresponding to the
discretization of the Laplace equation at point 5 where the boundary condition is φ = V will be:
φ
N
+φ
4
− 4φ
5
+φ
6
+φ
15
= 0 but φ
N
= φ
5
and also φ
15
= 9φ
5
−8V so finally:
φ
4
+ 6φ
5
+ φ
6
= 8V
Exercise 7.3
Using Taylor expansions, derive an equation to implement the Neumann condition using
five points along a line normal to the edge and at distances 0, h, 2h, 3h, and 4h.
Exercise 7.4 (Advanced)
Using two points like 11 and 12 in the figure of Exercise 7.2, find the equation
corresponding to the implementation of a radiation condition of the form: φ + p
∂φ
∂n
= K , where
p is a constant with the same dimensions as h.
Repeat using 3 points along the same line and use the extra degree of freedom to increase
the degree of approximation (eliminating second and third derivatives).
Example Heat conduction (diffusion) in a uniform rod.
A uniform rod of length L is initially at temperature 0. It is then connected to a heat source
at temperature 1 at one end while the other is attached to a sink at temperature 0. Find the
temperature distribution along the rod, as time varies. We need the timedomain solution
(transient). In the steady state we would expect a linear distribution of temperature between both
ends.
The equation describing the temperature variation in space and time, assuming a heat
conductivity of 1 is the normalized diffusion equation:
∂
2
u
∂x
2
=
∂u
∂t
(7.15)
where u(x,t) is the temperature. The boundary and initial conditions are:
page 73 E763 (part 2) Numerical Methods
f (r) +
∂ f
∂n
= K, where the second term refers to the normal derivative (or derivative along the
normal direction to the boundary. This type of condition appears in some radiation problems
(Sommerfeld radiation condition).
We will see next some forms of dealing with these boundary conditions in the context of
finite difference calculations.
For example in the case of the ‘square coaxial problem’ studied earlier, we can see that the
solution will have symmetry properties which makes unnecessary the calculation of the potential
over the complete crosssection. In fact only one eighth of the crosssection is actually needed.
The new region of interest can be one eighth
of the complete crosssection: the shaded region
or one quarter of the crosssection, to avoid
oblique lines. In any case, the new boundary
conditions needed on the dashed lines that limit
the new region of interest are of the Neumann
type:
∂φ
∂n
= 0 since the lines of constant
potential will be perpendicular to those edges. (n
represents the normal to the boundary).
φ = 0
φ = 1
∂φ
∂n
= 0
Fig. 8.2
We will need a different strategy to deal with these conditions.
For this condition it is more convenient to define the mesh in a different manner. If we
place the boundary at half the node spacing from the start of the mesh as in the figure below, we
can implement the normal derivative condition in a simple form:
(Note that the node numbers used in the next figure do not correspond to the mesh numbering
defined earlier for the whole crosssection).
φ = 0
10
1
a
2
∂φ
∂n
= 0
b
The boundary condition that applies at point b (not
actually part of the mesh) is:
∂φ
∂n
= 0 We can
approximate this by using a central difference
with the point 1 and the auxiliary point a outside
the mesh. Then:
∂φ
∂n
b
= 0 =
φ
a
−φ
1
h
and then,
φ
a
= φ
1
.
Fig. 8.3
So the corresponding equation for point 1 is: φ
N
+φ
S
+ φ
E
+ φ
W
− 4φ
0
= 0
Substituting values we have:
0 +φ
10
+φ
1
+φ
2
− 4φ
1
= 0
or φ
2
+ φ
10
− 3φ
1
= 0
A Dirichlet condition can also be implemented in a similar form, if necessary. For
example if part of a straight boundary is subjected to a Neumann condition and the rest to a
Dirichlet condition (as for example in the case of a conductor that does not cover the whole
boundary), a Dirichlet condition will have to be implemented in a mesh separated half of the
E763 (part 2) Numerical Methods page 74
spacing from the edge.
Exercise 7.2:
Consider the Fig. 8.4 and the points 5
and 15 with the point a on the boundary (not
in the mesh). Use Taylor expansions at points
5 and 15 at distances h/2 and 3h/2 from a, to
show that a condition that can be applied
when the function φ has a fixed value V at the
boundary and we also want the normal
derivative of φ to be zero there is:
9φ
5
− φ
15
= 8V.
1 2
14
φ = V
3 4 5
11 12 13
a
∂φ
∂n
= 0
15
6
Fig. 8.4
Using the results of Exercise 7.2, the difference equation corresponding to the
discretization of the Laplace equation at point 5 where the boundary condition is φ = V will be:
φ
N
+φ
4
− 4φ
5
+φ
6
+φ
15
= 0 but φ
N
= φ
5
and also φ
15
= 9φ
5
−8V so finally:
φ
4
+ 6φ
5
+ φ
6
= 8V
Exercise 7.3
Using Taylor expansions, derive an equation to implement the Neumann condition using
five points along a line normal to the edge and at distances 0, h, 2h, 3h, and 4h.
Exercise 7.4 (Advanced)
Using two points like 11 and 12 in the figure of Exercise 7.2, find the equation
corresponding to the implementation of a radiation condition of the form: φ + p
∂φ
∂n
= K , where
p is a constant with the same dimensions as h.
Repeat using 3 points along the same line and use the extra degree of freedom to increase
the degree of approximation (eliminating second and third derivatives).
Example Heat conduction (diffusion) in a uniform rod.
A uniform rod of length L is initially at temperature 0. It is then connected to a heat source
at temperature 1 at one end while the other is attached to a sink at temperature 0. Find the
temperature distribution along the rod, as time varies. We need the timedomain solution
(transient). In the steady state we would expect a linear distribution of temperature between both
ends.
The equation describing the temperature variation in space and time, assuming a heat
conductivity of 1 is the normalized diffusion equation:
∂
2
u
∂x
2
=
∂u
∂t
(7.15)
where u(x,t) is the temperature. The boundary and initial conditions are:
page 75 E763 (part 2) Numerical Methods
u(0, t) = 0
u(L, t) = 1
¹
`
)
for all t u(x,0) = 0 for x < L
We can discretize the solution space (x.t) with a regular grid with spacing ∆x and ∆t: The
solution will be sought at positions: x
i
, i = 1, ...., M–1 (leaving out of the calculation the points
at the ends of the rod), so ∆ x = L/M) and at times: t
n
, n = 0, 1, 2,... .
Using this discretization, the boundary and initial condition become:
u
0
n
= 0
u
M
n
= 1
¹
`
)
for all n u
i
0
= 0 for i = 0, ...., M–1 (7.16)
We now discretize equation (7.15), converting the derivatives into differences:
For the time derivative at time n and position i, we can use the central difference formula:
∂
2
u
∂x
2
i, n
=
u
i −1
n
− 2u
i
n
+u
i+1
n
∆x
2
(7.17)
In the right hand side of (7.15), we need to approximate the time derivative at the same point
(i,n). We could use the forward difference:
∂u
∂t
≈
u
i
n+1
− u
i
n
∆t
(7.18)
and in this case we get:
u
i−1
n
− 2u
i
n
+ u
i +1
n
∆x
2
=
u
i
n +1
− u
i
n
∆t
(7.19)
as the difference equation.
Rearranging: u
i
n +1
=
∆t
∆x
2
u
i−1
n
+ 1 − 2
∆t
∆x
2

\

.

u
i
n
+
∆t
∆x
2
u
i +1
n
(7.20)
This gives us a form of calculating the temperature u at point
i and time n+1 as a function of u at points i, i–1 and i+1, at
time n. We have then a timestepping algorithm.
n+1
n
i i+1 i–1
Fig. 8.5
Equation (7.20) is valid for n = 0,...., N and i = 1, ...., M–1.
Special cases: for i = 1: u
i−1
n
= 0, and for i = M–1: u
i+1
n
= 1 for all n (at all times).
If we call: b = 1− 2
∆t
∆x
2

\

.
 and c =
∆t
∆x
2
, we can rewrite (7.20) as:
u
1
n +1
= bu
1
n
+ cu
2
n
u
2
n +1
= cu
1
n
+ bu
2
n
+ cu
3
n
L
u
M−1
n +1
= L cu
M−2
n
+ bu
M−1
n
+ c
(7.21)
which can be written in matrix form as: u
n +1
= Au
n
+ v, where the matrix A and the vector v
are:
E763 (part 2) Numerical Methods page 76
A =
b c ⋅ ⋅
c b c ⋅ ⋅
c b c ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ c b
and v =
0
0
⋅
0
c
(7.22)
u
n
is the vector containing all values u
i
n
. It is known at n = 0, so the matrix equation (7.21) can
be solved for all successive time steps.
Problems with this formulation:
– It is unstable unless ∆ t ≤ ∆ x
2
/2 (c ≤ 0.5 or b ≥ 0) This is due to the rather poor
approximation to the time derivative provided by the forward difference (7.18). A better scheme
can be constructed using the central difference in the RHS of (7.15) instead of the forward
difference.
Using the central difference for the time derivative:
the CrankNicolson method
A central difference approximation to the RHS of (7.15) would be:
∂u
∂t
i, n
=
u
i
n+1
− u
i
n −1
2∆t
This would involve values of u at three time steps: n–1, n, and n+1, which would be
undesirable. A solution to this is to shrink the gap between the two values, having instead of the
difference between u
i
n +1
and u
i
n −1
, the difference between u
i
n +1
and u
i
n
. However, if we
consider this a ‘central’ difference, the derivative must be evaluated at the central point which
corresponds to the value n+1/2:
∂u
∂t
i, n+1/ 2
=
u
i
n+1
− u
i
n
∆t
(7.23)
We then need to evaluate the left hand side of (7.15) at the same time, but this is not on the grid!
We will have:
∂
2
u
∂x
2
i, n+1/ 2
=
u
i −1
n+1/ 2
− 2u
i
n+1/ 2
+ u
i+1
n+1/ 2
∆x
2
(7.24)
Since the values of u are restricted to positions in the grid, we have to approximate the values in
the RHS of (7.24). Since those values are at the centre of the intervals, we can approximate
them by the average between the neighbouring grid points:
u
i
n +1/ 2
≈
u
i
n
+ u
i
n +1
2
and similarly for i–1, and i+1. (7.25)
We can now substitute (7.25) into (7.24) and evaluate the equation (7.15) at time n+1/2. After
rearranging this gives:
u
i−1
n +1
− 2du
i
n+1
+ u
i+1
n+1
= −u
i−1
n
+ 2eu
i
n
− u
i+1
n
(7.26)
page 75 E763 (part 2) Numerical Methods
u(0, t) = 0
u(L, t) = 1
¹
`
)
for all t u(x,0) = 0 for x < L
We can discretize the solution space (x.t) with a regular grid with spacing ∆x and ∆t: The
solution will be sought at positions: x
i
, i = 1, ...., M–1 (leaving out of the calculation the points
at the ends of the rod), so ∆ x = L/M) and at times: t
n
, n = 0, 1, 2,... .
Using this discretization, the boundary and initial condition become:
u
0
n
= 0
u
M
n
= 1
¹
`
)
for all n u
i
0
= 0 for i = 0, ...., M–1 (7.16)
We now discretize equation (7.15), converting the derivatives into differences:
For the time derivative at time n and position i, we can use the central difference formula:
∂
2
u
∂x
2
i, n
=
u
i −1
n
− 2u
i
n
+u
i+1
n
∆x
2
(7.17)
In the right hand side of (7.15), we need to approximate the time derivative at the same point
(i,n). We could use the forward difference:
∂u
∂t
≈
u
i
n+1
− u
i
n
∆t
(7.18)
and in this case we get:
u
i−1
n
− 2u
i
n
+ u
i +1
n
∆x
2
=
u
i
n +1
− u
i
n
∆t
(7.19)
as the difference equation.
Rearranging: u
i
n +1
=
∆t
∆x
2
u
i−1
n
+ 1 − 2
∆t
∆x
2

\

.

u
i
n
+
∆t
∆x
2
u
i +1
n
(7.20)
This gives us a form of calculating the temperature u at point
i and time n+1 as a function of u at points i, i–1 and i+1, at
time n. We have then a timestepping algorithm.
n+1
n
i i+1 i–1
Fig. 8.5
Equation (7.20) is valid for n = 0,...., N and i = 1, ...., M–1.
Special cases: for i = 1: u
i−1
n
= 0, and for i = M–1: u
i+1
n
= 1 for all n (at all times).
If we call: b = 1− 2
∆t
∆x
2

\

.
 and c =
∆t
∆x
2
, we can rewrite (7.20) as:
u
1
n +1
= bu
1
n
+ cu
2
n
u
2
n +1
= cu
1
n
+ bu
2
n
+ cu
3
n
L
u
M−1
n +1
= L cu
M−2
n
+ bu
M−1
n
+ c
(7.21)
which can be written in matrix form as: u
n +1
= Au
n
+ v, where the matrix A and the vector v
are:
E763 (part 2) Numerical Methods page 76
A =
b c ⋅ ⋅
c b c ⋅ ⋅
c b c ⋅ ⋅
⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅
⋅ c b
and v =
0
0
⋅
0
c
(7.22)
u
n
is the vector containing all values u
i
n
. It is known at n = 0, so the matrix equation (7.21) can
be solved for all successive time steps.
Problems with this formulation:
– It is unstable unless ∆ t ≤ ∆ x
2
/2 (c ≤ 0.5 or b ≥ 0) This is due to the rather poor
approximation to the time derivative provided by the forward difference (7.18). A better scheme
can be constructed using the central difference in the RHS of (7.15) instead of the forward
difference.
Using the central difference for the time derivative:
the CrankNicolson method
A central difference approximation to the RHS of (7.15) would be:
∂u
∂t
i, n
=
u
i
n+1
− u
i
n −1
2∆t
This would involve values of u at three time steps: n–1, n, and n+1, which would be
undesirable. A solution to this is to shrink the gap between the two values, having instead of the
difference between u
i
n +1
and u
i
n −1
, the difference between u
i
n +1
and u
i
n
. However, if we
consider this a ‘central’ difference, the derivative must be evaluated at the central point which
corresponds to the value n+1/2:
∂u
∂t
i, n+1/ 2
=
u
i
n+1
− u
i
n
∆t
(7.23)
We then need to evaluate the left hand side of (7.15) at the same time, but this is not on the grid!
We will have:
∂
2
u
∂x
2
i, n+1/ 2
=
u
i −1
n+1/ 2
− 2u
i
n+1/ 2
+ u
i+1
n+1/ 2
∆x
2
(7.24)
Since the values of u are restricted to positions in the grid, we have to approximate the values in
the RHS of (7.24). Since those values are at the centre of the intervals, we can approximate
them by the average between the neighbouring grid points:
u
i
n +1/ 2
≈
u
i
n
+ u
i
n +1
2
and similarly for i–1, and i+1. (7.25)
We can now substitute (7.25) into (7.24) and evaluate the equation (7.15) at time n+1/2. After
rearranging this gives:
u
i−1
n +1
− 2du
i
n+1
+ u
i+1
n+1
= −u
i−1
n
+ 2eu
i
n
− u
i+1
n
(7.26)
page 77 E763 (part 2) Numerical Methods
where d = 1 +
∆x
2
∆t

\

.

and e = 1−
∆x
2
∆t

\

.

This form of treating the derivative (evaluating the equation between grid points in order to
have a central difference approximation for the first order derivative is called the Crank
Nicolson method and has several advantages over the previous formulation (using the forward
difference).
Fig. 8.6 shows the scheme of calculation: the points or
values involved in each calculation. The dark spot
represents the position where the equation is evaluated. This
method provides a second order approximation for both
derivatives and also, very important, it is unconditionally
stable.
n+1
n
i i+1 i–1
n+1/2
Fig. 8.6 CrankNicolson scheme
We can now write (7.26) for each value of i as in (7.21), considering the special cases at the ends
of the rod and at t = 0, and write the corresponding matrix form. In this case we will get:
Au
n+1
= Bu
n
− 2v (7.27)
where:
A =
−2d 1 ⋅
1 −2d 1 ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 −2d
, B =
2e −1 ⋅
−1 2e −1 ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ −1 2e
and v =
0
⋅
⋅
1
Example
Consider now a parabolic equation in 2 space dimensions and time, like the Schroedinger
equation or the diffusion equation in 2D:
∂
2
u
∂x
2
+
∂
2
u
∂y
2
= a
∂u
∂t
(7.28)
For example, this could represent the temperature distribution over a 2dimensional plate. Let’s
consider a square plate of length 1 in x and y and the following boundary conditions:
a) u(0,y,t) = 1
b) u(1,y,t) = 0
c) u(x,0,t) = 1
d) u(x,1,t) = 0
e) u(x,y,0) = 0 for x > 0 and y > 0
That is, the sides x = 0 and y = 0 are kept at
temperature 1 at all times while the sides x = 1
and y = 1 are kept at u = 0.
The whole plate, except the sides x = 0 and y = 0
are at temperature 0 at t = 0.
The following diagram shows the discretization for the x coordinate:
0 1 i–1
i
i+1 R R+1
. . . . . .
x:
E763 (part 2) Numerical Methods page 78
There are R+2 points with R unknown values (i = 1, ..., R). The two extreme points: i = 0 and i
= R+1 correspond to x = 0 and x = 1, respectively.
The discretization of the y coordinate can be made in similar form. For convenience, we
can also use R+2 points, where only R are unknown: (j = 1, ..., R).
Discretization of time is done by considering t = n∆ t, where ∆ t is the time step.
Approximation of the time derivative (first order):
As in the previous example, in order to use the central difference and only two time levels,
we use the CrankNicolson method. That is, equation (7.28) will be evaluated half way between
time grid points.
This will give:
∂u
∂t
i, j, n+1/ 2
=
u
i, j
n +1
− u
i, j
n
∆t
(7.29)
We need to approximate the LHS at the same time using the average of the values at n and n+1:
∇
2
u
(n +1 / 2)
=
1
2
∇
2
u
(n)
+ ∇
2
u
( n+1)
( )
Applying this and (7.29) to (7.28) we get:
∇
2
u
(n +1)
+ ∇
2
u
( n)
=
2a
∆t
u
(n+1)
− u
(n)
( )
or rewriting:
∇
2
u
(n +1)
−
2a
∆t
u
( n+1)
= −∇
2
u
(n)
+
2a
∆t
u
( n)

\

.

(7.30)
where u is still a continuous function of position.
Approximation of the space derivative (second order):
Using the central difference for the second order derivatives and using ∆x = ∆y = h, we
get:
∇
2
u =
1
h
2
u
N
+ u
S
+ u
W
+ u
E
− 4u
O
( )
and using this in (7.30):
1
,
1
1 ,
1
1 ,
1
, 1
1
, 1
2
4
+ +
+
+
−
+
+
+
−


.

\

∆
+ − + + +
n
j i
n
j i
n
j i
n
j i
n
j i
u
t
a
u u u u


.

\



.

\

∆
− − + + + − =
+ − + −
n
j i
n
j i
n
j i
n
j i
n
j i
u
t
a
u u u u
, 1 , 1 , , 1 , 1
2
4 (7.31)
Defining now node numbering over the grid and a vector u
(n)
containing all the unknown values
(for all i and j), (7.31) can be written as a matrix equation of the form:
Au
(n+1)
= Bu
(n)
(7.32)
page 77 E763 (part 2) Numerical Methods
where d = 1 +
∆x
2
∆t

\

.

and e = 1−
∆x
2
∆t

\

.

This form of treating the derivative (evaluating the equation between grid points in order to
have a central difference approximation for the first order derivative is called the Crank
Nicolson method and has several advantages over the previous formulation (using the forward
difference).
Fig. 8.6 shows the scheme of calculation: the points or
values involved in each calculation. The dark spot
represents the position where the equation is evaluated. This
method provides a second order approximation for both
derivatives and also, very important, it is unconditionally
stable.
n+1
n
i i+1 i–1
n+1/2
Fig. 8.6 CrankNicolson scheme
We can now write (7.26) for each value of i as in (7.21), considering the special cases at the ends
of the rod and at t = 0, and write the corresponding matrix form. In this case we will get:
Au
n+1
= Bu
n
− 2v (7.27)
where:
A =
−2d 1 ⋅
1 −2d 1 ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ 1 −2d
, B =
2e −1 ⋅
−1 2e −1 ⋅
⋅ ⋅ ⋅ ⋅
⋅ ⋅ −1 2e
and v =
0
⋅
⋅
1
Example
Consider now a parabolic equation in 2 space dimensions and time, like the Schroedinger
equation or the diffusion equation in 2D:
∂
2
u
∂x
2
+
∂
2
u
∂y
2
= a
∂u
∂t
(7.28)
For example, this could represent the temperature distribution over a 2dimensional plate. Let’s
consider a square plate of length 1 in x and y and the following boundary conditions:
a) u(0,y,t) = 1
b) u(1,y,t) = 0
c) u(x,0,t) = 1
d) u(x,1,t) = 0
e) u(x,y,0) = 0 for x > 0 and y > 0
That is, the sides x = 0 and y = 0 are kept at
temperature 1 at all times while the sides x = 1
and y = 1 are kept at u = 0.
The whole plate, except the sides x = 0 and y = 0
are at temperature 0 at t = 0.
The following diagram shows the discretization for the x coordinate:
0 1 i–1
i
i+1 R R+1
. . . . . .
x:
E763 (part 2) Numerical Methods page 78
There are R+2 points with R unknown values (i = 1, ..., R). The two extreme points: i = 0 and i
= R+1 correspond to x = 0 and x = 1, respectively.
The discretization of the y coordinate can be made in similar form. For convenience, we
can also use R+2 points, where only R are unknown: (j = 1, ..., R).
Discretization of time is done by considering t = n∆ t, where ∆ t is the time step.
Approximation of the time derivative (first order):
As in the previous example, in order to use the central difference and only two time levels,
we use the CrankNicolson method. That is, equation (7.28) will be evaluated half way between
time grid points.
This will give:
∂u
∂t
i, j, n+1/ 2
=
u
i, j
n +1
− u
i, j
n
∆t
(7.29)
We need to approximate the LHS at the same time using the average of the values at n and n+1:
∇
2
u
(n +1 / 2)
=
1
2
∇
2
u
(n)
+ ∇
2
u
( n+1)
( )
Applying this and (7.29) to (7.28) we get:
∇
2
u
(n +1)
+ ∇
2
u
( n)
=
2a
∆t
u
(n+1)
− u
(n)
( )
or rewriting:
∇
2
u
(n +1)
−
2a
∆t
u
( n+1)
= −∇
2
u
(n)
+
2a
∆t
u
( n)

\

.

(7.30)
where u is still a continuous function of position.
Approximation of the space derivative (second order):
Using the central difference for the second order derivatives and using ∆x = ∆y = h, we
get:
∇
2
u =
1
h
2
u
N
+ u
S
+ u
W
+ u
E
− 4u
O
( )
and using this in (7.30):
1
,
1
1 ,
1
1 ,
1
, 1
1
, 1
2
4
+ +
+
+
−
+
+
+
−


.

\

∆
+ − + + +
n
j i
n
j i
n
j i
n
j i
n
j i
u
t
a
u u u u


.

\



.

\

∆
− − + + + − =
+ − + −
n
j i
n
j i
n
j i
n
j i
n
j i
u
t
a
u u u u
, 1 , 1 , , 1 , 1
2
4 (7.31)
Defining now node numbering over the grid and a vector u
(n)
containing all the unknown values
(for all i and j), (7.31) can be written as a matrix equation of the form:
Au
(n+1)
= Bu
(n)
(7.32)
page 79 E763 (part 2) Numerical Methods
Exercise 7.5
A length L of transmission line is terminated at both ends with a short circuit. A unit
impulse voltage is applied in the middle of the line at time t = 0.
The voltage φ(x, t) along the line satisfies the following differential equation:
∂
2
φ
∂x
2
+ p
2
∂
2
φ
∂t
2
= 0 with the boundary and initial conditions:
φ(0, t) = φ(L, t) = 0, φ(x, t) = 0 for t < 0 and φ(L 2, 0) =1
a) By discretizing both the coordinate x and time t as x
i
= (i – 1)∆x , i = 1, 2, .., R+1 and
t
m
= m∆t , m = 0, 1, 2, ..., use finite differences to formulate the numerical solution of the
equation above. Show that the problem reduces to a matrix problem of the form:
Φ
m+1
= AΦ
m
+ BΦ
m−1
where A and B are matrices and Φ
m
is a vector containing the voltages at each of the
discretization points x
i
at time t
m
.
b) Choosing R = 7, find the matrices A and B giving the values of their elements, taking
special care at the edges of the grid. Show the discretized equation corresponding to points at
the edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0.
c) How would the matrices A and B change if the boundary condition was changed to
∂φ
∂x
= 0
at x = 0, L (corresponding in this case to an open circuit). Show the discretized equation
corresponding to one of the edge points and propose a way to transform it so it contains only
values corresponding to points in the defined grid.
E763 (part 2) Numerical Methods page 80
8. Solution of Boundary Value Problems
A physical problem is often described by a differential equation of the form:
) ( ) ( x s x u
r r
= L on a region Ω. (8.1)
where L is a linear differential operator,
u(
r
x ) is a function to be found,
s(
r
x ) is a known
function and
r
x is the position vector of any point in Ω (coordinates).
We also need to impose boundary conditions on the values of u and/or its derivatives on
Γ, the boundary of Ω, in order to have a unique solution to the problem.
The boundary conditions can be written in general as:
) ( ) ( x t x u
r r
= B on Γ. (8.2)
with B, a linear differential operator and
t(
r
x ) a known function. So we will have for example:
B = 1 :
u(
r
x ) = t(
r
x ) : known values of u on Γ (Dirichlet condition)
n ∂
∂
= B :
∂u(
r
x )
∂n
= t(
r
x ) : fixed values of the normal derivative (Neumann condition)
k
n
+ =
∂
∂
B :
∂u
∂n
+ ku = t(
r
x ) : mixed condition (Radiation condition)
A problem described in this form is known as a boundary value problem (differential equation
+ boundary conditions).
Strong and Weak Solutions to Boundary Value Problems
Strong Solution:
This is the direct approach to the solution of the problem, as specified above in (8.1)(8.2);
for example, the finite difference solution method studied earlier is of this type.
Weak Solution:
This is an indirect approach. Instead of trying to solve the problem directly, we can re
formulate it as the search for a function that satisfies some conditions also satisfied by the
solution to the original problem (8.1)(8.2). With a proper definition of these conditions the
search will lead to the correct and unique solution.
Before expanding on the details of this search, we need to introduce the idea of inner
product between two functions. This will allow us to quantify (put a number to) how close a
function is to another or how big or small is an error function.
The commonest definition of inner product between two functions f and g is:
f , g = fgdΩ
∫
for real functions f and g or
page 79 E763 (part 2) Numerical Methods
Exercise 7.5
A length L of transmission line is terminated at both ends with a short circuit. A unit
impulse voltage is applied in the middle of the line at time t = 0.
The voltage φ(x, t) along the line satisfies the following differential equation:
∂
2
φ
∂x
2
+ p
2
∂
2
φ
∂t
2
= 0 with the boundary and initial conditions:
φ(0, t) = φ(L, t) = 0, φ(x, t) = 0 for t < 0 and φ(L 2, 0) =1
a) By discretizing both the coordinate x and time t as x
i
= (i – 1)∆x , i = 1, 2, .., R+1 and
t
m
= m∆t , m = 0, 1, 2, ..., use finite differences to formulate the numerical solution of the
equation above. Show that the problem reduces to a matrix problem of the form:
Φ
m+1
= AΦ
m
+ BΦ
m−1
where A and B are matrices and Φ
m
is a vector containing the voltages at each of the
discretization points x
i
at time t
m
.
b) Choosing R = 7, find the matrices A and B giving the values of their elements, taking
special care at the edges of the grid. Show the discretized equation corresponding to points at
the edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0.
c) How would the matrices A and B change if the boundary condition was changed to
∂φ
∂x
= 0
at x = 0, L (corresponding in this case to an open circuit). Show the discretized equation
corresponding to one of the edge points and propose a way to transform it so it contains only
values corresponding to points in the defined grid.
E763 (part 2) Numerical Methods page 80
8. Solution of Boundary Value Problems
A physical problem is often described by a differential equation of the form:
) ( ) ( x s x u
r r
= L on a region Ω. (8.1)
where L is a linear differential operator,
u(
r
x ) is a function to be found,
s(
r
x ) is a known
function and
r
x is the position vector of any point in Ω (coordinates).
We also need to impose boundary conditions on the values of u and/or its derivatives on
Γ, the boundary of Ω, in order to have a unique solution to the problem.
The boundary conditions can be written in general as:
) ( ) ( x t x u
r r
= B on Γ. (8.2)
with B, a linear differential operator and
t(
r
x ) a known function. So we will have for example:
B = 1 :
u(
r
x ) = t(
r
x ) : known values of u on Γ (Dirichlet condition)
n ∂
∂
= B :
∂u(
r
x )
∂n
= t(
r
x ) : fixed values of the normal derivative (Neumann condition)
k
n
+ =
∂
∂
B :
∂u
∂n
+ ku = t(
r
x ) : mixed condition (Radiation condition)
A problem described in this form is known as a boundary value problem (differential equation
+ boundary conditions).
Strong and Weak Solutions to Boundary Value Problems
Strong Solution:
This is the direct approach to the solution of the problem, as specified above in (8.1)(8.2);
for example, the finite difference solution method studied earlier is of this type.
Weak Solution:
This is an indirect approach. Instead of trying to solve the problem directly, we can re
formulate it as the search for a function that satisfies some conditions also satisfied by the
solution to the original problem (8.1)(8.2). With a proper definition of these conditions the
search will lead to the correct and unique solution.
Before expanding on the details of this search, we need to introduce the idea of inner
product between two functions. This will allow us to quantify (put a number to) how close a
function is to another or how big or small is an error function.
The commonest definition of inner product between two functions f and g is:
f , g = fgdΩ
∫
for real functions f and g or
page 81 E763 (part 2) Numerical Methods
(8.3)
f , g = f g *dΩ
∫
for complex functions f and g
In general, the inner product between two functions will be a real number obtained by
global operations between the functions over the domain and satisfying some defining properties
(for real functions):
i) f , g = g, f
ii) αf + βg, h = α f , h + β g, h α and β scalars (8.4)
iii) f , f = 0 if and only if f = 0
From this definition we can deduce that:
If for a function r, we have that r, h = 0 for any choice of h, then r = 0.
We will try to use this property for testing an error residual r.
We are now in position to formulate the weak solution of (8.1)(8.2) as:
Given the function
s(
r
x ) and an arbitrary function
h(
r
x ) in Ω; find the function
u(
r
x ) that
satisfies:
h s h u , , = L or 0 , = − h s u L (8.5)
for any choice of the function h.
Approximate Solutions to the Weak Formulation
The RayleighRitz Method
In this case, instead of trying to find the exact solution
u(
r
x ) that satisfies (8.5), we only
look for an approximation. We can choose a set of basis functions and assume that the wanted
function, or rather, a suitable approximation to it, can be represented as an expansion in terms of
the basis functions (as for example with Fourier expansions using sines and cosines).
Now, once we have chosen the basis functions, the unknown is no longer the function u,
but its expansion coefficients; that is: numbers! so the problem (8.5) is converted from “finding a
function u among all possible functions defined on Ω” into the search for a set of numbers. We
now need some method to find these numbers.
We will see two methods to solve (84) using the RayleighRitz procedure above:
Weighted Residuals
Variational Method
Weighted Residuals
For this approach the problem (8.5) is rewritten in the form: r = Lu – s = 0 (8.6)
where the function r is the residual or error or the difference
between the LHS and the RHS of (7.19) when we try any function in place of u. This residual
E763 (part 2) Numerical Methods page 82
will only be zero when the function we use is the correct solution to the problem.
We can now look at the solution to (7.24) as an optimization (or minimization) problem;
that is: “Find the function u such that r = 0” or in approximate way:
“Find the function u such that r is as small as possible” and here we need to be able to
measure how ‘small’ is the error (or residual).
Using these ideas on the weak formulation, this is now transformed into:
Given the function
s(
r
x ) and an arbitrary function
h(
r
x ) in Ω; find the function
u(
r
x ) that
satisfies:
r, h = 0 (where r = Lu – s) (8.7)
for any choice of the function h.
Applying the RayleighRitz procedure, we can put:
u(
r
x ) = d
j
b
j
(
r
x )
j=1
N
∑
(8.8)
where d
j
, j = 1, ... , N are scalar coefficients and
b
j
(
r
x ), j = 1, ... , N are a set of linearly independent basis functions (normally
called trial functions – chosen)
We now need to introduce the function h. For this, we can also use an expansion, in
general using another set of functions
w
i
(
r
x ) for example:
h(
r
x ) = c
i
w
i
(
r
x )
i =1
N
∑
(8.9)
where c
i
, i = 1, ... , N are scalar coefficients and
w
i
(
r
x ), i = 1, ... , N a set of linearly independent expansion functions for h
(normally called weighting functions – also chosen)
Using now (8.9) into (8.7) we get:
r, h = r(
r
x ), c
i
w
i
(
r
x )
i=1
N
∑
= c
i
r(
r
x ), w
i
(
r
x )
i=1
N
∑
= 0 (8.10)
for any choice of the coefficients c
i
.
Since (8.10) has to be satisfied for any choice of the coefficients c
i
(equivalent to say ‘any
choice of h), we can deduce that:
r, w
i
= 0 for all i, i = 1, ... , N (8.11)
which is simpler than (8.10). (It comes from the choice: c
i
= 1 and all others = 0, in sequence).
So, we can conclude that we don’t need to test the residual against any (and all) possible
function h; we only need to test it against the chosen weighting functions w
i
, i = 1, ... , N.
page 81 E763 (part 2) Numerical Methods
(8.3)
f , g = f g *dΩ
∫
for complex functions f and g
In general, the inner product between two functions will be a real number obtained by
global operations between the functions over the domain and satisfying some defining properties
(for real functions):
i) f , g = g, f
ii) αf + βg, h = α f , h + β g, h α and β scalars (8.4)
iii) f , f = 0 if and only if f = 0
From this definition we can deduce that:
If for a function r, we have that r, h = 0 for any choice of h, then r = 0.
We will try to use this property for testing an error residual r.
We are now in position to formulate the weak solution of (8.1)(8.2) as:
Given the function
s(
r
x ) and an arbitrary function
h(
r
x ) in Ω; find the function
u(
r
x ) that
satisfies:
h s h u , , = L or 0 , = − h s u L (8.5)
for any choice of the function h.
Approximate Solutions to the Weak Formulation
The RayleighRitz Method
In this case, instead of trying to find the exact solution
u(
r
x ) that satisfies (8.5), we only
look for an approximation. We can choose a set of basis functions and assume that the wanted
function, or rather, a suitable approximation to it, can be represented as an expansion in terms of
the basis functions (as for example with Fourier expansions using sines and cosines).
Now, once we have chosen the basis functions, the unknown is no longer the function u,
but its expansion coefficients; that is: numbers! so the problem (8.5) is converted from “finding a
function u among all possible functions defined on Ω” into the search for a set of numbers. We
now need some method to find these numbers.
We will see two methods to solve (84) using the RayleighRitz procedure above:
Weighted Residuals
Variational Method
Weighted Residuals
For this approach the problem (8.5) is rewritten in the form: r = Lu – s = 0 (8.6)
where the function r is the residual or error or the difference
between the LHS and the RHS of (7.19) when we try any function in place of u. This residual
E763 (part 2) Numerical Methods page 82
will only be zero when the function we use is the correct solution to the problem.
We can now look at the solution to (7.24) as an optimization (or minimization) problem;
that is: “Find the function u such that r = 0” or in approximate way:
“Find the function u such that r is as small as possible” and here we need to be able to
measure how ‘small’ is the error (or residual).
Using these ideas on the weak formulation, this is now transformed into:
Given the function
s(
r
x ) and an arbitrary function
h(
r
x ) in Ω; find the function
u(
r
x ) that
satisfies:
r, h = 0 (where r = Lu – s) (8.7)
for any choice of the function h.
Applying the RayleighRitz procedure, we can put:
u(
r
x ) = d
j
b
j
(
r
x )
j=1
N
∑
(8.8)
where d
j
, j = 1, ... , N are scalar coefficients and
b
j
(
r
x ), j = 1, ... , N are a set of linearly independent basis functions (normally
called trial functions – chosen)
We now need to introduce the function h. For this, we can also use an expansion, in
general using another set of functions
w
i
(
r
x ) for example:
h(
r
x ) = c
i
w
i
(
r
x )
i =1
N
∑
(8.9)
where c
i
, i = 1, ... , N are scalar coefficients and
w
i
(
r
x ), i = 1, ... , N a set of linearly independent expansion functions for h
(normally called weighting functions – also chosen)
Using now (8.9) into (8.7) we get:
r, h = r(
r
x ), c
i
w
i
(
r
x )
i=1
N
∑
= c
i
r(
r
x ), w
i
(
r
x )
i=1
N
∑
= 0 (8.10)
for any choice of the coefficients c
i
.
Since (8.10) has to be satisfied for any choice of the coefficients c
i
(equivalent to say ‘any
choice of h), we can deduce that:
r, w
i
= 0 for all i, i = 1, ... , N (8.11)
which is simpler than (8.10). (It comes from the choice: c
i
= 1 and all others = 0, in sequence).
So, we can conclude that we don’t need to test the residual against any (and all) possible
function h; we only need to test it against the chosen weighting functions w
i
, i = 1, ... , N.
page 83 E763 (part 2) Numerical Methods
Now, we can expand the function u in terms of the trial (basis) functions as in (8.8) and use
this expansion in r: r = Lu – s. With this, (8.11) becomes:
0 ) ( , ) ( ) ( , ,
1
= − = − =
∑
=
x w x s x b d w s u w r
i
N
j
j j i i
r r r
L L for all i
or
0 , , ,
1
= − =
∑
=
i
N
j
i j j i
w s w b d w r L for all i (8.12)
Note that in this expression the only unknowns are the coefficients d
j
.
We can rewrite this expression (8.12) as: a
ij
d
j
j =1
N
∑
= s
i
for all i: i = 1, N
which can be put in matrix notation as:
A d = s (8.13)
where A = { a
ij
} , d = { d
j
} and s = { s
i
} with a
ij
=
i j
w b , L and s
i
= s, w
i
Since the trial functions b
j
are known, the matrix elements a
ij
are all known and the problem has
been reduced to solve (8.13), finding the coefficients d
j
of the expansion of the function u.
Different choices of the trial functions and the weighting functions define the different
variations by which this method is known:
i) w
i
= bi –––––––> method of Galerkin
ii)
w
i
(
r
x ) = δ(
r
x −
r
x
i
) –––––––> Point matching method This is equivalent to asking
for the residual to be zero at a fixed number of points on the domain.
Variational Method
The central idea of this method is to find a functional
*
(or variational expression)
associated to the boundary value problem and for which the solution of the BV problem leads to
a stationary value of the functional. Combining this idea with the RayleighRitz procedure we
can develop a systematic solution method.
Before introducing numerical techniques to solve variational formulations, let’s examine
an example to illustrate the nature of the variational approach:
* A functional is simply a function of a function over a complete domain which gives as a result
a number. For example: J(φ)= φ
2
dΩ
Ω
∫
. This expression gives a numerical value for each
function φ we use in J(φ). Note that J is not a function of
r
x .
E763 (part 2) Numerical Methods page 84
Example:
The problem is to find the path of light travel between two points. We can start using a
variational approach and for this we need a variational expression related to this problem. In
particular, for this case we can use Fermat’s Principle that says that ‘light travels through the
quickest path’ (Note that this is not necessarily the shortest). We can formulate this statement in
the form:
time = min
ds
v(s)
P
1
P
2
∫
(8.14)
where v(s) is the velocity point by point along the path. We are not really interested in the actual
time taken to go from P
1
to P
2
, but on the conditions that this minimum imposes on the path
s(x,y) itself. That is, we want to find the actual path between these points that minimize this
time. We can write for the velocity: v = c/n, where n(x,y) is the refractive index.
Let’s consider first a uniform medium, that is, one with n uniform (constant). In that case,
(8.14) becomes:
time =
n
c
min ds
P
1
P
2
∫
=
n
c
min( path length) (8.15)
The integral in this case reduces to the actual length of the path, so the above statement asks for
the path of ‘minimum length’ or the shortest path between P
1
to P
2
. Obviously, the solution in
this case is the straight line joining the points. However, you can see that (8.14) can be applied
to an inhomogeneous medium, and the minimization process should lead to the actual trajectory.
Extending the example a little more, let’s consider an interface between two media with
different refractive indices. Without losing generality, we can consider a situation like that of
the figure.
From above, we know that in each media the path will be a straight line, but what are the
coordinates of the point on the y–axis (interface) where both straight lines meet?
P
1
P
2
x
y
P
0
Applying (8.14) gives:
time =
1
c
min n
1
ds
P
1
P
0
∫
+ n
2
ds
P
0
P
2
∫
(8.16)
Fig. 8.7
Both integrals correspond to the respective lengths of the two branches of the total path, but we
don’t know the coordinate y
0
of the point P
0
. We can rewrite (8.16) in the form:
time =
1
c
min n
1
x
1
2
+ (y
1
− y
0
)
2
+ n
2
x
2
2
+ (y
2
− y
0
)
2
 
(8.17)
where the only variable (unknown) is y
0
. To find it we need to find the minimum, and for this
page 83 E763 (part 2) Numerical Methods
Now, we can expand the function u in terms of the trial (basis) functions as in (8.8) and use
this expansion in r: r = Lu – s. With this, (8.11) becomes:
0 ) ( , ) ( ) ( , ,
1
= − = − =
∑
=
x w x s x b d w s u w r
i
N
j
j j i i
r r r
L L for all i
or
0 , , ,
1
= − =
∑
=
i
N
j
i j j i
w s w b d w r L for all i (8.12)
Note that in this expression the only unknowns are the coefficients d
j
.
We can rewrite this expression (8.12) as: a
ij
d
j
j =1
N
∑
= s
i
for all i: i = 1, N
which can be put in matrix notation as:
A d = s (8.13)
where A = { a
ij
} , d = { d
j
} and s = { s
i
} with a
ij
=
i j
w b , L and s
i
= s, w
i
Since the trial functions b
j
are known, the matrix elements a
ij
are all known and the problem has
been reduced to solve (8.13), finding the coefficients d
j
of the expansion of the function u.
Different choices of the trial functions and the weighting functions define the different
variations by which this method is known:
i) w
i
= bi –––––––> method of Galerkin
ii)
w
i
(
r
x ) = δ(
r
x −
r
x
i
) –––––––> Point matching method This is equivalent to asking
for the residual to be zero at a fixed number of points on the domain.
Variational Method
The central idea of this method is to find a functional
*
(or variational expression)
associated to the boundary value problem and for which the solution of the BV problem leads to
a stationary value of the functional. Combining this idea with the RayleighRitz procedure we
can develop a systematic solution method.
Before introducing numerical techniques to solve variational formulations, let’s examine
an example to illustrate the nature of the variational approach:
* A functional is simply a function of a function over a complete domain which gives as a result
a number. For example: J(φ)= φ
2
dΩ
Ω
∫
. This expression gives a numerical value for each
function φ we use in J(φ). Note that J is not a function of
r
x .
E763 (part 2) Numerical Methods page 84
Example:
The problem is to find the path of light travel between two points. We can start using a
variational approach and for this we need a variational expression related to this problem. In
particular, for this case we can use Fermat’s Principle that says that ‘light travels through the
quickest path’ (Note that this is not necessarily the shortest). We can formulate this statement in
the form:
time = min
ds
v(s)
P
1
P
2
∫
(8.14)
where v(s) is the velocity point by point along the path. We are not really interested in the actual
time taken to go from P
1
to P
2
, but on the conditions that this minimum imposes on the path
s(x,y) itself. That is, we want to find the actual path between these points that minimize this
time. We can write for the velocity: v = c/n, where n(x,y) is the refractive index.
Let’s consider first a uniform medium, that is, one with n uniform (constant). In that case,
(8.14) becomes:
time =
n
c
min ds
P
1
P
2
∫
=
n
c
min( path length) (8.15)
The integral in this case reduces to the actual length of the path, so the above statement asks for
the path of ‘minimum length’ or the shortest path between P
1
to P
2
. Obviously, the solution in
this case is the straight line joining the points. However, you can see that (8.14) can be applied
to an inhomogeneous medium, and the minimization process should lead to the actual trajectory.
Extending the example a little more, let’s consider an interface between two media with
different refractive indices. Without losing generality, we can consider a situation like that of
the figure.
From above, we know that in each media the path will be a straight line, but what are the
coordinates of the point on the y–axis (interface) where both straight lines meet?
P
1
P
2
x
y
P
0
Applying (8.14) gives:
time =
1
c
min n
1
ds
P
1
P
0
∫
+ n
2
ds
P
0
P
2
∫
(8.16)
Fig. 8.7
Both integrals correspond to the respective lengths of the two branches of the total path, but we
don’t know the coordinate y
0
of the point P
0
. We can rewrite (8.16) in the form:
time =
1
c
min n
1
x
1
2
+ (y
1
− y
0
)
2
+ n
2
x
2
2
+ (y
2
− y
0
)
2
 
(8.17)
where the only variable (unknown) is y
0
. To find it we need to find the minimum, and for this
page 85 E763 (part 2) Numerical Methods
we do:
d
dy
0
(time) = 0 = n
1
−(y
1
− y
0
)
x
1
2
+(y
1
− y
0
)
2
+ n
2
−(y
2
− y
0
)
x
2
2
+ (y
2
− y
0
)
2
(8.18)
Now, from the figure we can observe that the right hand side can be written in terms of the
angles of incidence and refraction as:
n
1
sinα
1
− n
2
sinα
2
= 0 (8.19)
as the condition the point P
0
must satisfy, and we know this is right because (8.19) is the familiar
Snell’s law.
So, the general idea of this variational approach is to formulate the problem in a form that we
look for a stationary condition (maximum, minimum, inflexion point) on some parameter which
depends on the desired solution. (Like in the above problem, we looked for the time, which
depends on the actual path travelled).
An important property of a variational approach is that precisely because the solution
function produces a stationary value of the functional, this is rather insensitive to small
perturbations of the solution (approximations). This is a very desirable property, particularly for
the application of numerical methods where all solutions are only approximate. To illustrate this
property, let’s analyse another example:
Example
Consider the problem of finding the natural resonant frequencies of a vibrating string (a
chord in a guitar). For this problem, an appropriate variational expression is (do not worry about
where this comes from, but there is a proof in the Appendix):
k
2
= s.v.
dy
dx

\

.

2
dx
a
b
∫
y
2
dx
a
b
∫
(8.20)
The above expression corresponds to the k–number or resonant frequencies of a string vibrating
freely and attached at the ends at a and b.
For simplicity, let’s change the limits to –a and a. The first mode of oscillation has the
form:
y = Acos
πx
2a
then
dy
dx
= −
Aπ
2a
sin
πx
2a
Using this exact solution in (99): we get for the first mode (prove this):
k
2
=
π
2
4a
2
then k =
π
2a
≈
1.571
a
Now, to show how a ‘bad’ approximation of the function y can still give a rather acceptable
value for k, let’s try a simple triangular shape (instead of the correct sinusoidal shape of the
vibrating string):
E763 (part 2) Numerical Methods page 86
¦
¦
¹
¦
¦
´
¦
> 
.

\

−
< 
.

\

+
=
0 1
0 1
x
a
x
A
x
a
x
A
y
a a
Fig. 8.8
then
dy
dx
=
A/ a x < 0
−A/ a x > 0
¦
´
¹
and using these values in (99), gives (prove this):
k
2
=
3
a
2
then k ≈
1.732
a
which is not too bad considering how coarse is the approximation to y(x). If instead of the
triangular shape we try a second order (parabolic) shape:
y = a 1 − x a ( )
2
( )
then
dy
dx
= −2
A
a
2
x
Substituting these values in (99) gives now:
k
2
=
2.5
a
2
then k =
1.581
a
which is a rather good approximation.
Now, how can we use this method systematically? As said before, the use of the Rayleigh
Ritz procedure permits to construct a systematic numerical method. In summary, we can specify
the necessary steps as:
BV problem: L u = s ––––––––> Find variational expression J(u)
Use RayleighRitz: u = d
j
b
j
j=1
N
∑
Insert in J(u)
Find stationary value of J(u) = J({ d
j
}), that is:
find the coefficients d
j
Reconstruct u = d
j
b
j
j=1
N
∑
u is solution of BV problem <–––––––– then
We will skip here the problem of actually finding the corresponding variational expression
to a boundary value problem, simply saying that there are systematic methods to find them. We
will be concerned here on how to solve a problem, once we already have a variational
formulation.
page 85 E763 (part 2) Numerical Methods
we do:
d
dy
0
(time) = 0 = n
1
−(y
1
− y
0
)
x
1
2
+(y
1
− y
0
)
2
+ n
2
−(y
2
− y
0
)
x
2
2
+ (y
2
− y
0
)
2
(8.18)
Now, from the figure we can observe that the right hand side can be written in terms of the
angles of incidence and refraction as:
n
1
sinα
1
− n
2
sinα
2
= 0 (8.19)
as the condition the point P
0
must satisfy, and we know this is right because (8.19) is the familiar
Snell’s law.
So, the general idea of this variational approach is to formulate the problem in a form that we
look for a stationary condition (maximum, minimum, inflexion point) on some parameter which
depends on the desired solution. (Like in the above problem, we looked for the time, which
depends on the actual path travelled).
An important property of a variational approach is that precisely because the solution
function produces a stationary value of the functional, this is rather insensitive to small
perturbations of the solution (approximations). This is a very desirable property, particularly for
the application of numerical methods where all solutions are only approximate. To illustrate this
property, let’s analyse another example:
Example
Consider the problem of finding the natural resonant frequencies of a vibrating string (a
chord in a guitar). For this problem, an appropriate variational expression is (do not worry about
where this comes from, but there is a proof in the Appendix):
k
2
= s.v.
dy
dx

\

.

2
dx
a
b
∫
y
2
dx
a
b
∫
(8.20)
The above expression corresponds to the k–number or resonant frequencies of a string vibrating
freely and attached at the ends at a and b.
For simplicity, let’s change the limits to –a and a. The first mode of oscillation has the
form:
y = Acos
πx
2a
then
dy
dx
= −
Aπ
2a
sin
πx
2a
Using this exact solution in (99): we get for the first mode (prove this):
k
2
=
π
2
4a
2
then k =
π
2a
≈
1.571
a
Now, to show how a ‘bad’ approximation of the function y can still give a rather acceptable
value for k, let’s try a simple triangular shape (instead of the correct sinusoidal shape of the
vibrating string):
E763 (part 2) Numerical Methods page 86
¦
¦
¹
¦
¦
´
¦
> 
.

\

−
< 
.

\

+
=
0 1
0 1
x
a
x
A
x
a
x
A
y
a a
Fig. 8.8
then
dy
dx
=
A/ a x < 0
−A/ a x > 0
¦
´
¹
and using these values in (99), gives (prove this):
k
2
=
3
a
2
then k ≈
1.732
a
which is not too bad considering how coarse is the approximation to y(x). If instead of the
triangular shape we try a second order (parabolic) shape:
y = a 1 − x a ( )
2
( )
then
dy
dx
= −2
A
a
2
x
Substituting these values in (99) gives now:
k
2
=
2.5
a
2
then k =
1.581
a
which is a rather good approximation.
Now, how can we use this method systematically? As said before, the use of the Rayleigh
Ritz procedure permits to construct a systematic numerical method. In summary, we can specify
the necessary steps as:
BV problem: L u = s ––––––––> Find variational expression J(u)
Use RayleighRitz: u = d
j
b
j
j=1
N
∑
Insert in J(u)
Find stationary value of J(u) = J({ d
j
}), that is:
find the coefficients d
j
Reconstruct u = d
j
b
j
j=1
N
∑
u is solution of BV problem <–––––––– then
We will skip here the problem of actually finding the corresponding variational expression
to a boundary value problem, simply saying that there are systematic methods to find them. We
will be concerned here on how to solve a problem, once we already have a variational
formulation.
page 87 E763 (part 2) Numerical Methods
Example
For the problem of the square coaxial (or square capacitor) seen earlier, the BV problem is
defined by the Laplace equation: ∇
2
φ = 0 with some boundary conditions (L = ∇
2
, u = φ,
s = 0).
An appropriate functional (variational expression) for this case is:
J (φ) = ∇φ ( )
2
dΩ
Ω
∫
(given) (8.21)
Using RayleighRitz:
J (φ) = ∇ d
j
b
j
x, y ( )
j =1
N
∑

\

.



\

.


2
dΩ
Ω
∫
= d
j
∇b
j
(x, y)
j=1
N
∑

\

.


2
dΩ
Ω
∫
J (φ) = d
i
d
j
∇b
i
(x, y) ⋅ ∇b
j
(x, y)
j =1
N
∑
i =1
N
∑
dΩ
Ω
∫
(8.22)
This can be written as the matrix equation: J(d) = d
T
Ad,
where d = { d
j
} and a
ij
= ∇b
i
⋅ ∇b
j
Ω
∫
dΩ
Now, find stationary value:
∂J
∂d
i
= 0 for all i: i = 1, ... , N
so, applying it to (8.22):
∂J
∂d
i
= a
ij
d
j
j=1
N
∑
= 0 , for all i: i = 1, ... , N
And the problem reduces to the matrix equation: A d = 0 (8.23)
Solving the system of equations (8.23) we obtain the coefficients d
j
and the unknown
function can be obtained as:
u(x, y) = d
j
b
j
(x, y)
j=1
N
∑
We can see that both methods, the weighted residuals and the variational method transform
the BV problem into an algebraic, matrix problem.
One of the first steps in the implementation of either method is the choice of appropriate
expansion functions to use in the RayleighRitz procedure: basis or trial functions and weighting
functions.
E763 (part 2) Numerical Methods page 88
The finite element method provides a simple form to construct these functions and
implementing these methods.
page 87 E763 (part 2) Numerical Methods
Example
For the problem of the square coaxial (or square capacitor) seen earlier, the BV problem is
defined by the Laplace equation: ∇
2
φ = 0 with some boundary conditions (L = ∇
2
, u = φ,
s = 0).
An appropriate functional (variational expression) for this case is:
J (φ) = ∇φ ( )
2
dΩ
Ω
∫
(given) (8.21)
Using RayleighRitz:
J (φ) = ∇ d
j
b
j
x, y ( )
j =1
N
∑

\

.



\

.


2
dΩ
Ω
∫
= d
j
∇b
j
(x, y)
j=1
N
∑

\

.


2
dΩ
Ω
∫
J (φ) = d
i
d
j
∇b
i
(x, y) ⋅ ∇b
j
(x, y)
j =1
N
∑
i =1
N
∑
dΩ
Ω
∫
(8.22)
This can be written as the matrix equation: J(d) = d
T
Ad,
where d = { d
j
} and a
ij
= ∇b
i
⋅ ∇b
j
Ω
∫
dΩ
Now, find stationary value:
∂J
∂d
i
= 0 for all i: i = 1, ... , N
so, applying it to (8.22):
∂J
∂d
i
= a
ij
d
j
j=1
N
∑
= 0 , for all i: i = 1, ... , N
And the problem reduces to the matrix equation: A d = 0 (8.23)
Solving the system of equations (8.23) we obtain the coefficients d
j
and the unknown
function can be obtained as:
u(x, y) = d
j
b
j
(x, y)
j=1
N
∑
We can see that both methods, the weighted residuals and the variational method transform
the BV problem into an algebraic, matrix problem.
One of the first steps in the implementation of either method is the choice of appropriate
expansion functions to use in the RayleighRitz procedure: basis or trial functions and weighting
functions.
E763 (part 2) Numerical Methods page 88
The finite element method provides a simple form to construct these functions and
implementing these methods.
page 89 E763 (part 2) Numerical Methods
9. FINITE ELEMENTS
As the name suggests, the finite element method is based on the division of the domain of
interest Ω into ‘elements’, or small pieces E
i
that cover Ω completely but without intersections:
They constitute a tessellation (tiling) of Ω:
E
i
i
U
= Ω ; E
i
IE
j
= ∅
Over the subdivided domain we apply the methods discussed earlier (either weighted
residuals or variational). The basis functions are defined locally in each element and because
each element is small, these functions can be very simple and still constitute overall a good
approximation of the desired function. In this form inside each element E
e
, the wanted function
u(
r
x ) in (8.1) is represented by a local approximation
˜ u
e
(
r
x ) valid only in the element number e:
E
e
. The complete function u(
r
x ) over the whole domain Ω is then simply approximated by the
addition of all the local pieces:
u(
r
x ) ≈ ˜ u
e
(
r
x )
e
∑
An important characteristic of this method is that it is ‘exactinthelimit’, that is, the
degree of approximation can only improve when the number of elements increase, the solution
gradually and monotonically converging to the exact value. In this form, a solution can always
be obtained to any degree of approximation, provided the availability of computer resources.
One dimensional problems
Let’s consider first a one dimensional case with Ω as the interval [a, b]. We first divide Ω
into N subintervals, not necessarily equal in size!, defined by N+1 nodes x
i
, as shown in the
figure. The wanted function u(x) is then locally approximated in each subinterval by a simple
function, for example, a straight line: a function of the type: ax + b, defined differently in each
subinterval. The total approximation
˜
u (x) is then the superposition of all these locally defined
functions, a piecewise linear approximation to u(x). The amount of error in this approximation
will depend on the size of each element (and consequently on the total number of elements) and
more importantly, on their size in relation to the local variation of the desired function.
a
x
2
x
i
x
3 b
˜
u (x)
u(x)
u
i+1
u
i
x
i
x
i+1
1
N
i
(x)
u
i
N
i
(x)
u(x)
˜ u
e
(x)
u
i+1
N
i+1
(x)
N
i+1
(x)
Fig. 9.1 Piecewise linear approximation Fig. 9.2 Shape functions
The function ˜
u (x) , approximation to u(x), is the addition of the locally defined functions
˜
u
e
(x) which are only nonzero in the subinterval e. Now, these local functions ˜
u
e
(x) can be
E763 (part 2) Numerical Methods page 90
defined as the superposition of interpolation functions N
i
(x) and N
i+1
(x) as shown in the figure
above (right). From the figure we can see that the function
˜
u
e
(x), local approximation to u(x) in
element e, can be written as:
˜
u
e
(x) = u
i
N
i
(x) + u
i+1
N
i+1
(x) (9.1)
Now, if we consider the neighbouring subintervals and the associated interpolation
functions as in the figure below, we can extend the definition of these functions to form the
triangular shapes below:
x
i+1
x
i
x
i1
x
i+2
N
i+1
(x)
N
i
(x)
˜
u (x)
Fig. 9.3 Shape functions and approximation over two adjacent intervals
With this definition, we can write for the function
˜
u (x) in the complete domain Ω:
u(x) ≈ ˜ u (x) = ˜ u
e
(x)
e=1
Ne
∑
= u
i
N
i
(x)
i =1
Np
∑
(9.2)
(Np is the number of nodes, Ne is the number of elements). The functions N
i
(x) are defined as
the (doublesided) interpolation functions at node i, i = 1, ... , Np so N
i
(x) = 1 at node i and 0 at
all other nodes.
This form of expanding u(x) has a very useful property:
— The trial functions N
i
(x), known in the FE method as shape functions, are
interpolation functions, and as a consequence, the coefficients of the expansion: u
i
in
(104), are the nodal values, that is, the values of the wanted function at the nodal
points. Solving the resultant matrix equation will give these values directly.
Exercise 9.1:
Examine this figure and that of the previous page and show that indeed (9.1) is valid in the
element (x
i
, x
i+1
) and so (9.2) is valid over the full domain.
Example
Consider the boundary value problem:
d
2
y
dx
2
+ k
2
y = 0 for x ∈ [a, b] with y(a) = y(b) = 0
This corresponds for example, to the equation describing the envelope of the transverse
displacement of a vibrating string attached at both ends. A suitable variational expression
(corresponding to resonant frequencies) is the following, as seen in (8.20) and in the Appendix:
page 89 E763 (part 2) Numerical Methods
9. FINITE ELEMENTS
As the name suggests, the finite element method is based on the division of the domain of
interest Ω into ‘elements’, or small pieces E
i
that cover Ω completely but without intersections:
They constitute a tessellation (tiling) of Ω:
E
i
i
U
= Ω ; E
i
IE
j
= ∅
Over the subdivided domain we apply the methods discussed earlier (either weighted
residuals or variational). The basis functions are defined locally in each element and because
each element is small, these functions can be very simple and still constitute overall a good
approximation of the desired function. In this form inside each element E
e
, the wanted function
u(
r
x ) in (8.1) is represented by a local approximation
˜ u
e
(
r
x ) valid only in the element number e:
E
e
. The complete function u(
r
x ) over the whole domain Ω is then simply approximated by the
addition of all the local pieces:
u(
r
x ) ≈ ˜ u
e
(
r
x )
e
∑
An important characteristic of this method is that it is ‘exactinthelimit’, that is, the
degree of approximation can only improve when the number of elements increase, the solution
gradually and monotonically converging to the exact value. In this form, a solution can always
be obtained to any degree of approximation, provided the availability of computer resources.
One dimensional problems
Let’s consider first a one dimensional case with Ω as the interval [a, b]. We first divide Ω
into N subintervals, not necessarily equal in size!, defined by N+1 nodes x
i
, as shown in the
figure. The wanted function u(x) is then locally approximated in each subinterval by a simple
function, for example, a straight line: a function of the type: ax + b, defined differently in each
subinterval. The total approximation
˜
u (x) is then the superposition of all these locally defined
functions, a piecewise linear approximation to u(x). The amount of error in this approximation
will depend on the size of each element (and consequently on the total number of elements) and
more importantly, on their size in relation to the local variation of the desired function.
a
x
2
x
i
x
3 b
˜
u (x)
u(x)
u
i+1
u
i
x
i
x
i+1
1
N
i
(x)
u
i
N
i
(x)
u(x)
˜ u
e
(x)
u
i+1
N
i+1
(x)
N
i+1
(x)
Fig. 9.1 Piecewise linear approximation Fig. 9.2 Shape functions
The function ˜
u (x) , approximation to u(x), is the addition of the locally defined functions
˜
u
e
(x) which are only nonzero in the subinterval e. Now, these local functions ˜
u
e
(x) can be
E763 (part 2) Numerical Methods page 90
defined as the superposition of interpolation functions N
i
(x) and N
i+1
(x) as shown in the figure
above (right). From the figure we can see that the function
˜
u
e
(x), local approximation to u(x) in
element e, can be written as:
˜
u
e
(x) = u
i
N
i
(x) + u
i+1
N
i+1
(x) (9.1)
Now, if we consider the neighbouring subintervals and the associated interpolation
functions as in the figure below, we can extend the definition of these functions to form the
triangular shapes below:
x
i+1
x
i
x
i1
x
i+2
N
i+1
(x)
N
i
(x)
˜
u (x)
Fig. 9.3 Shape functions and approximation over two adjacent intervals
With this definition, we can write for the function
˜
u (x) in the complete domain Ω:
u(x) ≈ ˜ u (x) = ˜ u
e
(x)
e=1
Ne
∑
= u
i
N
i
(x)
i =1
Np
∑
(9.2)
(Np is the number of nodes, Ne is the number of elements). The functions N
i
(x) are defined as
the (doublesided) interpolation functions at node i, i = 1, ... , Np so N
i
(x) = 1 at node i and 0 at
all other nodes.
This form of expanding u(x) has a very useful property:
— The trial functions N
i
(x), known in the FE method as shape functions, are
interpolation functions, and as a consequence, the coefficients of the expansion: u
i
in
(104), are the nodal values, that is, the values of the wanted function at the nodal
points. Solving the resultant matrix equation will give these values directly.
Exercise 9.1:
Examine this figure and that of the previous page and show that indeed (9.1) is valid in the
element (x
i
, x
i+1
) and so (9.2) is valid over the full domain.
Example
Consider the boundary value problem:
d
2
y
dx
2
+ k
2
y = 0 for x ∈ [a, b] with y(a) = y(b) = 0
This corresponds for example, to the equation describing the envelope of the transverse
displacement of a vibrating string attached at both ends. A suitable variational expression
(corresponding to resonant frequencies) is the following, as seen in (8.20) and in the Appendix:
page 91 E763 (part 2) Numerical Methods
J = k
2
=
dy
dx

\

.

2
dx
a
b
∫
y
2
dx
a
b
∫
(given) (9.3)
Using now the expansion: y = y
j
N
j
(x)
j =1
N
∑
(9.4)
where N is the number of nodes, y
j
are the nodal values (unknown) and N
j
(x) are the shape
functions.
From (9.3) we have: k
2
= k
2
(y
j
) =
Q
R
where:
Q =
d
dx
y
j
N
j
(x)
j=1
N
∑

\

.


2
dx
a
b
∫
= y
j
dN
j
dx
j=1
N
∑
2
dx
a
b
∫
= y
j
dN
j
dx
y
k
dN
k
dx
k=1
N
∑
j =1
N
∑
dx
a
b
∫
(9.5)
and
R = y
j
N
j
(x)
j =1
N
∑
2
dx
a
b
∫
= y
j
N
j
(x) y
k
N
k
(x)
k=1
N
∑
j=1
N
∑
dx
a
b
∫
(9.6)
To find the stationary value, we do:
dk
2
dy
i
= 0 for each y
i
, i = 1, ... , N (9.7)
But since k
2
=
Q
R
, then
dk
2
dy
i
=
Q
'
R− QR
'
R
2
so
dk
2
dy
i
= 0 ⇒Q
'
R = QR
'
and
Q
'
=
Q
R
R
'
= k
2
R
'
so finally:
dQ
dy
i
= k
2
dR
dy
i
for all y
i
, i = 1, ... , N (9.8)
We now have to evaluate these derivatives:
dQ
dy
i
=
dN
i
dx
y
k
dN
k
dx
k=1
N
∑

\

.

dx
a
b
∫
+ y
j
dN
j
dx
j=1
N
∑

\

.


dN
i
dx
dx
a
b
∫
or
dQ
dy
i
= 2
dN
i
dx
y
j
dN
j
dx
j =1
N
∑

\

.


dx
a
b
∫
= 2 y
j
dN
i
dx
dN
j
dx
dx
a
b
∫

\

.


j =1
N
∑
which can be written as:
dQ
dy
i
= 2 a
ij
y
j
j =1
N
∑
where a
ij
=
dN
i
dx
dN
j
dx
dx
a
b
∫
(9.9)
For the second term:
dR
dy
i
= N
i
(x) y
j
N
j
(x)
j =1
N
∑

\

.


dx
a
b
∫
+ y
k
N
k
(x)
k=1
N
∑

\

.

N
i
(x) dx
a
b
∫
E763 (part 2) Numerical Methods page 92
or
dR
dy
i
= 2 y
j
N
i
(x)N
j
(x)dx
a
b
∫
j =1
N
∑
= 2 b
ij
y
j
j =1
N
∑
with b
ij
= N
i
(x)N
j
(x)dx
a
b
∫
(9.10)
Replacing (9.9) and (9.10) in (9.8), we can write the matrix equation:
A y = k
2
B y (9.11)
where A = { a
ij
} , B = { b
ij
} and y = { y
j
} is the vector of nodal values.
Equation (9.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k
2
and the corresponding solution vector y (list of nodal values of the function y(x)).
The matrix elements:
a
ij
=
dN
i
dx
dN
j
dx
dx
a
b
∫
and b
ij
= N
i
N
j
dx
a
b
∫
(9.12)
The shape functions N
i
and N
j
are only nonzero in the vicinity of nodes i and j respectively (see
figure below). So, a
ij
= b
ij
= 0 if N
i
and N
j
do not overlap; that is, if j ≠ i–1, i, i+1. Then,
the matrices A and B are tridiagonal: They will have not more than 3 elements per row.
i1 i+1 i j1 j+1 j
N
i
N
j
. . . . . . . .
Exercise 9.2:
Define generically the triangular functions N
i
, integrate (9.12) and calculate the value of the
matrix elements a
ij
and b
ij
.
Two Dimensional Finite Elements
In this case, a two dimensional region Ω is subdivided into smaller pieces or ‘elements’
and the subdivision satisfies the same properties as in the onedimensional case; that is, the
elements cover completely the region of interest and there is no intersections (no overlapping).
The most common form of subdividing a 2D region is by using triangles. Quadrilateral elements
are also used and they have some useful properties but by far the most common, versatile and
easier to use are triangles with straight sides. There are welldeveloped methods to produce
appropriate meshing (or subdivision) of a 2D region into triangles and they have maximum
flexibility to accommodate to intricate shapes of the region of interest Ω.
The process of calculation follows the same route as in the 1D case. Now, a function of
two variables u(x,y) is approximated by shape functions N
j
(x,y) defined as interpolation
functions over one element (in this case a triangle). This approximation is given by:
page 91 E763 (part 2) Numerical Methods
J = k
2
=
dy
dx

\

.

2
dx
a
b
∫
y
2
dx
a
b
∫
(given) (9.3)
Using now the expansion: y = y
j
N
j
(x)
j =1
N
∑
(9.4)
where N is the number of nodes, y
j
are the nodal values (unknown) and N
j
(x) are the shape
functions.
From (9.3) we have: k
2
= k
2
(y
j
) =
Q
R
where:
Q =
d
dx
y
j
N
j
(x)
j=1
N
∑

\

.


2
dx
a
b
∫
= y
j
dN
j
dx
j=1
N
∑
2
dx
a
b
∫
= y
j
dN
j
dx
y
k
dN
k
dx
k=1
N
∑
j =1
N
∑
dx
a
b
∫
(9.5)
and
R = y
j
N
j
(x)
j =1
N
∑
2
dx
a
b
∫
= y
j
N
j
(x) y
k
N
k
(x)
k=1
N
∑
j=1
N
∑
dx
a
b
∫
(9.6)
To find the stationary value, we do:
dk
2
dy
i
= 0 for each y
i
, i = 1, ... , N (9.7)
But since k
2
=
Q
R
, then
dk
2
dy
i
=
Q
'
R− QR
'
R
2
so
dk
2
dy
i
= 0 ⇒Q
'
R = QR
'
and
Q
'
=
Q
R
R
'
= k
2
R
'
so finally:
dQ
dy
i
= k
2
dR
dy
i
for all y
i
, i = 1, ... , N (9.8)
We now have to evaluate these derivatives:
dQ
dy
i
=
dN
i
dx
y
k
dN
k
dx
k=1
N
∑

\

.

dx
a
b
∫
+ y
j
dN
j
dx
j=1
N
∑

\

.


dN
i
dx
dx
a
b
∫
or
dQ
dy
i
= 2
dN
i
dx
y
j
dN
j
dx
j =1
N
∑

\

.


dx
a
b
∫
= 2 y
j
dN
i
dx
dN
j
dx
dx
a
b
∫

\

.


j =1
N
∑
which can be written as:
dQ
dy
i
= 2 a
ij
y
j
j =1
N
∑
where a
ij
=
dN
i
dx
dN
j
dx
dx
a
b
∫
(9.9)
For the second term:
dR
dy
i
= N
i
(x) y
j
N
j
(x)
j =1
N
∑

\

.


dx
a
b
∫
+ y
k
N
k
(x)
k=1
N
∑

\

.

N
i
(x) dx
a
b
∫
E763 (part 2) Numerical Methods page 92
or
dR
dy
i
= 2 y
j
N
i
(x)N
j
(x)dx
a
b
∫
j =1
N
∑
= 2 b
ij
y
j
j =1
N
∑
with b
ij
= N
i
(x)N
j
(x)dx
a
b
∫
(9.10)
Replacing (9.9) and (9.10) in (9.8), we can write the matrix equation:
A y = k
2
B y (9.11)
where A = { a
ij
} , B = { b
ij
} and y = { y
j
} is the vector of nodal values.
Equation (9.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k
2
and the corresponding solution vector y (list of nodal values of the function y(x)).
The matrix elements:
a
ij
=
dN
i
dx
dN
j
dx
dx
a
b
∫
and b
ij
= N
i
N
j
dx
a
b
∫
(9.12)
The shape functions N
i
and N
j
are only nonzero in the vicinity of nodes i and j respectively (see
figure below). So, a
ij
= b
ij
= 0 if N
i
and N
j
do not overlap; that is, if j ≠ i–1, i, i+1. Then,
the matrices A and B are tridiagonal: They will have not more than 3 elements per row.
i1 i+1 i j1 j+1 j
N
i
N
j
. . . . . . . .
Exercise 9.2:
Define generically the triangular functions N
i
, integrate (9.12) and calculate the value of the
matrix elements a
ij
and b
ij
.
Two Dimensional Finite Elements
In this case, a two dimensional region Ω is subdivided into smaller pieces or ‘elements’
and the subdivision satisfies the same properties as in the onedimensional case; that is, the
elements cover completely the region of interest and there is no intersections (no overlapping).
The most common form of subdividing a 2D region is by using triangles. Quadrilateral elements
are also used and they have some useful properties but by far the most common, versatile and
easier to use are triangles with straight sides. There are welldeveloped methods to produce
appropriate meshing (or subdivision) of a 2D region into triangles and they have maximum
flexibility to accommodate to intricate shapes of the region of interest Ω.
The process of calculation follows the same route as in the 1D case. Now, a function of
two variables u(x,y) is approximated by shape functions N
j
(x,y) defined as interpolation
functions over one element (in this case a triangle). This approximation is given by:
page 93 E763 (part 2) Numerical Methods
u(x, y) ≈ u
j
N
j
(x, y)
j =1
N
∑
(9.13)
where N is the number of nodes in the mesh and the coefficients u
j
are the nodal values
(unknown). N
j
(x,y) are the shape functions defined for every node of the mesh.
Fig. 9.4
The figure shows a rectangular region in the xy–plane subdivided into triangles, with the
corresponding piecewise planar approximation to a function u(x,y) plotted along the vertical
axis. Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so
the approximation is continuous over the entire region but its derivatives are not. The
approximation shown in the figure shows uses first order functions, that is, pieces of planes (flat
tiles). Other types are also possible but require defining more nodes in each triangle. (While a
plane is totally defined by 3 points  e.g. the nodal values, a second order surface for example,
will need 6 points  for example the values at the 3 vertices and at 3 midside points).
For a first order approximation, the function u(x,y) is approximated in each triangle by a
function of the form (first order in x and y):
˜ u (x, y) = p + qx + ry = (1 x y)
p
q
r

\

.



(9.14)
where p, q and r are constants with different values in each triangle. Similarly to the one
dimensional case, this function can be written in terms of shape functions (interpolation
polynomials):
u(x, y) ≈
˜
u (x, y) = u
1
N
1
(x, y) + u
2
N
2
(x, y) + u
3
N
3
(x, y) (9.15)
for a triangle with nodes numbered 1, 2 and 3, with coordinates (x
1
, y
1
), (x
2
, y
2
) and (x
3
, y
3
).
E763 (part 2) Numerical Methods page 94
The shape functions N
i
are such that N
i
= 1 at node i and 0 at all the others:
It can be shown that the function N
i
satisfying this property is:
N
i
(x, y) =
1
2A
a
i
+ b
i
x + c
i
y ( ) (9.16)
where A is the area of the triangle and
a
1
= x
2
y
3
– x
3
y
2
b
1
= y
2
– y
3
c
1
= x
3
– x
2
a
2
= x
3
y
1
– x
1
y
3
b
2
= y
3
– y
1
c
2
= x
1
– x
3
a
3
= x
1
y
2
– x
2
y
1
b
3
= y
1
– y
2
c
3
= x
2
– x
1
See demonstration in the Appendix.
The function N
1
defined in (9.16) and shown below corresponds to the shape function
(interpolation function) for the node numbered 1 in the triangle shown. Now, the same node can
be a vertex of other neighbouring triangles in which we will also define a corresponding shape
function for node 1 (with expressions like (9.16) but with different values of the constants a, b
and c – different orientation of the plane), building up the complete shape function for node 1:
N
1
. This is shown in the next figure for a node number i, which belongs to five triangles.
u
1
N
1
u
2
N
2
1
2
3
u
1
u
2
1
3
u
1
u
2
2
u
3
u
1
N
1
+ u
2
N
2
+ u
3
N
3
Fig. 9.5 Fig. 9.6
i
Fig. 9.7
page 93 E763 (part 2) Numerical Methods
u(x, y) ≈ u
j
N
j
(x, y)
j =1
N
∑
(9.13)
where N is the number of nodes in the mesh and the coefficients u
j
are the nodal values
(unknown). N
j
(x,y) are the shape functions defined for every node of the mesh.
Fig. 9.4
The figure shows a rectangular region in the xy–plane subdivided into triangles, with the
corresponding piecewise planar approximation to a function u(x,y) plotted along the vertical
axis. Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so
the approximation is continuous over the entire region but its derivatives are not. The
approximation shown in the figure shows uses first order functions, that is, pieces of planes (flat
tiles). Other types are also possible but require defining more nodes in each triangle. (While a
plane is totally defined by 3 points  e.g. the nodal values, a second order surface for example,
will need 6 points  for example the values at the 3 vertices and at 3 midside points).
For a first order approximation, the function u(x,y) is approximated in each triangle by a
function of the form (first order in x and y):
˜ u (x, y) = p + qx + ry = (1 x y)
p
q
r

\

.



(9.14)
where p, q and r are constants with different values in each triangle. Similarly to the one
dimensional case, this function can be written in terms of shape functions (interpolation
polynomials):
u(x, y) ≈
˜
u (x, y) = u
1
N
1
(x, y) + u
2
N
2
(x, y) + u
3
N
3
(x, y) (9.15)
for a triangle with nodes numbered 1, 2 and 3, with coordinates (x
1
, y
1
), (x
2
, y
2
) and (x
3
, y
3
).
E763 (part 2) Numerical Methods page 94
The shape functions N
i
are such that N
i
= 1 at node i and 0 at all the others:
It can be shown that the function N
i
satisfying this property is:
N
i
(x, y) =
1
2A
a
i
+ b
i
x + c
i
y ( ) (9.16)
where A is the area of the triangle and
a
1
= x
2
y
3
– x
3
y
2
b
1
= y
2
– y
3
c
1
= x
3
– x
2
a
2
= x
3
y
1
– x
1
y
3
b
2
= y
3
– y
1
c
2
= x
1
– x
3
a
3
= x
1
y
2
– x
2
y
1
b
3
= y
1
– y
2
c
3
= x
2
– x
1
See demonstration in the Appendix.
The function N
1
defined in (9.16) and shown below corresponds to the shape function
(interpolation function) for the node numbered 1 in the triangle shown. Now, the same node can
be a vertex of other neighbouring triangles in which we will also define a corresponding shape
function for node 1 (with expressions like (9.16) but with different values of the constants a, b
and c – different orientation of the plane), building up the complete shape function for node 1:
N
1
. This is shown in the next figure for a node number i, which belongs to five triangles.
u
1
N
1
u
2
N
2
1
2
3
u
1
u
2
1
3
u
1
u
2
2
u
3
u
1
N
1
+ u
2
N
2
+ u
3
N
3
Fig. 9.5 Fig. 9.6
i
Fig. 9.7
page 95 E763 (part 2) Numerical Methods
Joining all the facets of N
i
, for each of the triangles that contain node i, we can refer to this
function as N
i
(x,y) and then, considering all the nodes of the mesh, the function u can be written
as:
u(x, y) = u
j
N
j
(x, y)
j=1
N
∑
for j = 1, …, N (9.17)
We can now use this expansion for substitution in a variational expression or in the
weighted residuals expression, to obtain the corresponding matrix problem for the expansion
coefficients. An important advantage of this method is that these coefficients are precisely the
nodal values of the wanted function u, so the result is obtained immediately when solving the
matrix problem. An additional advantage is the sparsity of the resultant matrices.
MATRIX SPARSITY
Considering the global shape function for node i, shown in the figure above, we can see
that it is zero in all other nodes of the mesh. If we now consider the global shape function
corresponding to another node, say, j, we can see that products of the form N
i
N
j
or of
derivatives of these functions, which will appear in the definition of the matrix elements (as seen
in the 1–D case), will almost always be zero, except when the nodes i and j are either the same
node or immediate neighbours so there is an overlap (see figure above). This implies that the
corresponding matrices will be very sparse, which is very convenient in terms of computer
requirements.
Example:
If we consider a simple mesh:
1 2 3 4
5
6 7
8
9 10 11 12
1 3 5
2 4 6
7 9 11
8 10 12
The matrix sparsity pattern results:
1
2
3
4
5
6
7
8
9
10
11
12
1 2 3 4 5 6 7 8 9 10 11 12
Contribution from triangle 2
with nodes 2, 5 and 6.
Contribution from node 11
in triangles 10, 11 and 12
E763 (part 2) Numerical Methods page 96
Example:
For the problem of finding the potential distribution in the square coaxial, or equivalently, the
temperature distribution (in steady state) between the two square section surfaces, an appropriate
variational expression is:
J = ∇φ ( )
2
dΩ
Ω
∫
(9.18)
and substituting (9.17):
J = ∇ φ
j
N
j
j =1
N
∑

\

.


2
dx dy
Ω
∫
= φ
j
∇N
j
j =1
N
∑

\

.


2
dx dy
Ω
∫
which can be rewritten as:
J = φ
i
∇N
i
i=1
N
∑
⋅ φ
j
∇N
j
j=1
N
∑
dx dy
Ω
∫
or
J = φ
i
φ
j
j=1
N
∑
i =1
N
∑
∇N
i
⋅ ∇N
j
dx dy
Ω
∫
= Φ
T
AΦ (9.19)
where A = { a
ij
} , a
ij
= ∇N
i
⋅ ∇N
j
dx dy
Ω
∫
and Φ = { φ
j
}
Note that again, the coefficients a
ij
can be calculated and the only unknowns are the nodal values
φ
j
.
We now have to find the stationary value; that is: put:
∂J
∂φ
i
= 0 for i = 1, … , N
∂J
∂φ
i
= 0 = 2
i=1
N
∑
a
ij
φ
j
for i = 1, … , N then: AΦ = 0 (9.20)
Then, the resultant equation is (9.20): AΦ = 0. We need to evaluate first the elements of
the matrix A. For this, we can consider the integral over the complete domain Ω as the sum of
the integrals over each element Ω
k
, k = 1, … , Ne (Ne elements in the mesh).
a
ij
= ∇N
i
⋅ ∇N
j
dx dy
Ω
∫
= ∇N
i
⋅ ∇N
j
dx dy
Ω
k
∫
k=1
Ne
∑
(9.21)
Before calculating these values, let’s consider the matrix sparsity; that is, let’s see which
elements of A are actually nonzero. As discussed earlier (page 45), a
ij
will only be nonzero if
the nodes i and j are both in the same triangle. In this way, the sum in (9.21) will only extend to
at most two triangles for each combination of i and j.
Inside one triangle, the shape function N
i
(x,y), defined for the node i is:
page 95 E763 (part 2) Numerical Methods
Joining all the facets of N
i
, for each of the triangles that contain node i, we can refer to this
function as N
i
(x,y) and then, considering all the nodes of the mesh, the function u can be written
as:
u(x, y) = u
j
N
j
(x, y)
j=1
N
∑
for j = 1, …, N (9.17)
We can now use this expansion for substitution in a variational expression or in the
weighted residuals expression, to obtain the corresponding matrix problem for the expansion
coefficients. An important advantage of this method is that these coefficients are precisely the
nodal values of the wanted function u, so the result is obtained immediately when solving the
matrix problem. An additional advantage is the sparsity of the resultant matrices.
MATRIX SPARSITY
Considering the global shape function for node i, shown in the figure above, we can see
that it is zero in all other nodes of the mesh. If we now consider the global shape function
corresponding to another node, say, j, we can see that products of the form N
i
N
j
or of
derivatives of these functions, which will appear in the definition of the matrix elements (as seen
in the 1–D case), will almost always be zero, except when the nodes i and j are either the same
node or immediate neighbours so there is an overlap (see figure above). This implies that the
corresponding matrices will be very sparse, which is very convenient in terms of computer
requirements.
Example:
If we consider a simple mesh:
1 2 3 4
5
6 7
8
9 10 11 12
1 3 5
2 4 6
7 9 11
8 10 12
The matrix sparsity pattern results:
1
2
3
4
5
6
7
8
9
10
11
12
1 2 3 4 5 6 7 8 9 10 11 12
Contribution from triangle 2
with nodes 2, 5 and 6.
Contribution from node 11
in triangles 10, 11 and 12
E763 (part 2) Numerical Methods page 96
Example:
For the problem of finding the potential distribution in the square coaxial, or equivalently, the
temperature distribution (in steady state) between the two square section surfaces, an appropriate
variational expression is:
J = ∇φ ( )
2
dΩ
Ω
∫
(9.18)
and substituting (9.17):
J = ∇ φ
j
N
j
j =1
N
∑

\

.


2
dx dy
Ω
∫
= φ
j
∇N
j
j =1
N
∑

\

.


2
dx dy
Ω
∫
which can be rewritten as:
J = φ
i
∇N
i
i=1
N
∑
⋅ φ
j
∇N
j
j=1
N
∑
dx dy
Ω
∫
or
J = φ
i
φ
j
j=1
N
∑
i =1
N
∑
∇N
i
⋅ ∇N
j
dx dy
Ω
∫
= Φ
T
AΦ (9.19)
where A = { a
ij
} , a
ij
= ∇N
i
⋅ ∇N
j
dx dy
Ω
∫
and Φ = { φ
j
}
Note that again, the coefficients a
ij
can be calculated and the only unknowns are the nodal values
φ
j
.
We now have to find the stationary value; that is: put:
∂J
∂φ
i
= 0 for i = 1, … , N
∂J
∂φ
i
= 0 = 2
i=1
N
∑
a
ij
φ
j
for i = 1, … , N then: AΦ = 0 (9.20)
Then, the resultant equation is (9.20): AΦ = 0. We need to evaluate first the elements of
the matrix A. For this, we can consider the integral over the complete domain Ω as the sum of
the integrals over each element Ω
k
, k = 1, … , Ne (Ne elements in the mesh).
a
ij
= ∇N
i
⋅ ∇N
j
dx dy
Ω
∫
= ∇N
i
⋅ ∇N
j
dx dy
Ω
k
∫
k=1
Ne
∑
(9.21)
Before calculating these values, let’s consider the matrix sparsity; that is, let’s see which
elements of A are actually nonzero. As discussed earlier (page 45), a
ij
will only be nonzero if
the nodes i and j are both in the same triangle. In this way, the sum in (9.21) will only extend to
at most two triangles for each combination of i and j.
Inside one triangle, the shape function N
i
(x,y), defined for the node i is:
page 97 E763 (part 2) Numerical Methods
N
i
(x, y) =
1
2A
(a
i
+ b
i
x + c
i
y); Then: ∇N
i
=
∂N
i
∂x
ˆ
x +
∂N
i
∂y
ˆ
y =
1
2A
(b
i
ˆ
x + c
i
ˆ
y )
And then:
a
ij
=
1
4A
k
2
(b
i
b
j
+ c
i
c
j
) dx dy
Ω
k
∫
k=1
Ne
∑
=
1
4A
k
(b
i
b
j
+ c
i
c
j
)
k=1
Ne
∑
(9.2)
where the sum will only have a few terms (for those values of k corresponding to the triangles
containing nodes i and j. The values of A
k
, the area of the triangle and b
i
, b
j
, c
i
and c
j
will be
different for each triangle concerned.
In particular, considering for example the element a
4 7
corresponding to the mesh in the
previous figure, the sum extends over the triangles containing nodes 4 and 7; that is triangles
number 5 and 6:
a
4 7
=
1
4A
5
b
4
(5)
b
7
( 5)
+ c
4
( 5)
c
7
( 5)
( )
+
1
4A
6
b
4
( 6)
b
7
( 6)
+ c
4
( 6)
c
7
( 6)
( )
In this case, the integral in (9.22) reduces simply to the area of the triangle and the
calculations are easy. However, in other cases the integration can be complicated because they
have to be done over the triangle, which is at any position and orientation in the x–y plane.
For example, solving a variational expression like:
J = k
2
=
(∇φ)
2
dΩ
∫
φ
2
dΩ
∫
(corresponding to (9.3) for 1D problems) (9.23)
This results in an eigenvalue problem AΦ = k
2
BΦ where the matrix B has the elements:
b
ij
= N
i
N
j
dΩ
Ω
∫
(9.24)
The elements of A are the same as in (9.24).
Exercise 9.3:
Apply the RayleighRitz procedure to expression (9.23) and show that indeed the matrix
elements a
ij
and b
ij
have the values given in (9.21) and (9.24).
For the case of integrals like those in (126): b
ij
= N
i
N
j
dΩ
Ω
∫
= N
i
N
j
dΩ
Ω
k
∫
k=1
Ne
∑
The sum is over all triangles containing nodes i and j. For example for the element 5,6 in the
previous mesh, these will be triangles 2 and 7 only.
If we take triangle 7, its contribution to this element is the term:
b
56
=
1
4A
7
(a
5
+ b
5
x + c
5
y)(a
6
+ b
6
x + c
6
y) dx dy
Ω
7
∫
E763 (part 2) Numerical Methods page 98
or
b
56
=
1
4A
7
[a
5
a
6
+(a
5
b
6
+ a
6
b
5
)x + (a
5
c
6
+ a
6
c
5
)y + b
5
b
6
x
2
Ω
7
∫
+(b
5
c
6
+ b
6
c
5
)xy + c
5
c
6
y
2
] dx dy
Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh.
These calculations can be cumbersome is attempted directly. However, it is much simpler to use
a transformation of coordinates, into a local system. This has the advantage that the integrals
can be calculated for just one model triangle and then the result converted back to the x and y
coordinates. The most common system of local coordinates used for this purpose is the triangle
area coordinates.
Triangle Area Coordinates
For a triangle as in the figure, any point inside can
be specified by coordinates x and y or for example,
by the coordinates ξ
1
and ξ
2
.
These coordinates are defined with value 1 at
one node and zero at the opposite side, varying
linearly between these limits. For the sake of
symmetry, we can define 3 coordinates: (ξ
1
,ξ
2
,ξ
3
)
when obviously, only two are independent.
A formal definition of these coordinates can
be made in terms of the area of the triangles shown
in the figure.
P
1 (1,0,0)
ξ1=1
ξ1=0
ξ1
ξ2=0
3 (0.0.1)
ξ2
ξ2=1
2
(0.1.0)
Fig. 9.8
The advantage of using this local coordinate system is that the actual shape of the triangle
is not important. Any point inside is defined by its proportional distance to the sides,
irrespective of triangle shape. In this way, calculations can be made in model triangle using this
system and then, mapped back to the global coordinates.
If A
1
, A
2
and A
3
are the areas of the triangles
formed by P and two of the vertices of the
triangle, the area coordinates are defined as the
ratio:
ξ
i
=
A
i
A
where A is the area of the triangle.
Note that the triangle marked with the
dotted line has the same area as A
1
(same base,
same height), so the line of constant ξ
1
is the one
marked in the figure.
1
3
2
A
1
A
3
A
2
line of constant ξ
1
Fig. 9.9
Since A
1
+ A
2
+ A
3
= A we have that ξ
1
+ ξ
2
+ ξ
3
=1, which shows their linear
dependence.
page 97 E763 (part 2) Numerical Methods
N
i
(x, y) =
1
2A
(a
i
+ b
i
x + c
i
y); Then: ∇N
i
=
∂N
i
∂x
ˆ
x +
∂N
i
∂y
ˆ
y =
1
2A
(b
i
ˆ
x + c
i
ˆ
y )
And then:
a
ij
=
1
4A
k
2
(b
i
b
j
+ c
i
c
j
) dx dy
Ω
k
∫
k=1
Ne
∑
=
1
4A
k
(b
i
b
j
+ c
i
c
j
)
k=1
Ne
∑
(9.2)
where the sum will only have a few terms (for those values of k corresponding to the triangles
containing nodes i and j. The values of A
k
, the area of the triangle and b
i
, b
j
, c
i
and c
j
will be
different for each triangle concerned.
In particular, considering for example the element a
4 7
corresponding to the mesh in the
previous figure, the sum extends over the triangles containing nodes 4 and 7; that is triangles
number 5 and 6:
a
4 7
=
1
4A
5
b
4
(5)
b
7
( 5)
+ c
4
( 5)
c
7
( 5)
( )
+
1
4A
6
b
4
( 6)
b
7
( 6)
+ c
4
( 6)
c
7
( 6)
( )
In this case, the integral in (9.22) reduces simply to the area of the triangle and the
calculations are easy. However, in other cases the integration can be complicated because they
have to be done over the triangle, which is at any position and orientation in the x–y plane.
For example, solving a variational expression like:
J = k
2
=
(∇φ)
2
dΩ
∫
φ
2
dΩ
∫
(corresponding to (9.3) for 1D problems) (9.23)
This results in an eigenvalue problem AΦ = k
2
BΦ where the matrix B has the elements:
b
ij
= N
i
N
j
dΩ
Ω
∫
(9.24)
The elements of A are the same as in (9.24).
Exercise 9.3:
Apply the RayleighRitz procedure to expression (9.23) and show that indeed the matrix
elements a
ij
and b
ij
have the values given in (9.21) and (9.24).
For the case of integrals like those in (126): b
ij
= N
i
N
j
dΩ
Ω
∫
= N
i
N
j
dΩ
Ω
k
∫
k=1
Ne
∑
The sum is over all triangles containing nodes i and j. For example for the element 5,6 in the
previous mesh, these will be triangles 2 and 7 only.
If we take triangle 7, its contribution to this element is the term:
b
56
=
1
4A
7
(a
5
+ b
5
x + c
5
y)(a
6
+ b
6
x + c
6
y) dx dy
Ω
7
∫
E763 (part 2) Numerical Methods page 98
or
b
56
=
1
4A
7
[a
5
a
6
+(a
5
b
6
+ a
6
b
5
)x + (a
5
c
6
+ a
6
c
5
)y + b
5
b
6
x
2
Ω
7
∫
+(b
5
c
6
+ b
6
c
5
)xy + c
5
c
6
y
2
] dx dy
Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh.
These calculations can be cumbersome is attempted directly. However, it is much simpler to use
a transformation of coordinates, into a local system. This has the advantage that the integrals
can be calculated for just one model triangle and then the result converted back to the x and y
coordinates. The most common system of local coordinates used for this purpose is the triangle
area coordinates.
Triangle Area Coordinates
For a triangle as in the figure, any point inside can
be specified by coordinates x and y or for example,
by the coordinates ξ
1
and ξ
2
.
These coordinates are defined with value 1 at
one node and zero at the opposite side, varying
linearly between these limits. For the sake of
symmetry, we can define 3 coordinates: (ξ
1
,ξ
2
,ξ
3
)
when obviously, only two are independent.
A formal definition of these coordinates can
be made in terms of the area of the triangles shown
in the figure.
P
1 (1,0,0)
ξ1=1
ξ1=0
ξ1
ξ2=0
3 (0.0.1)
ξ2
ξ2=1
2
(0.1.0)
Fig. 9.8
The advantage of using this local coordinate system is that the actual shape of the triangle
is not important. Any point inside is defined by its proportional distance to the sides,
irrespective of triangle shape. In this way, calculations can be made in model triangle using this
system and then, mapped back to the global coordinates.
If A
1
, A
2
and A
3
are the areas of the triangles
formed by P and two of the vertices of the
triangle, the area coordinates are defined as the
ratio:
ξ
i
=
A
i
A
where A is the area of the triangle.
Note that the triangle marked with the
dotted line has the same area as A
1
(same base,
same height), so the line of constant ξ
1
is the one
marked in the figure.
1
3
2
A
1
A
3
A
2
line of constant ξ
1
Fig. 9.9
Since A
1
+ A
2
+ A
3
= A we have that ξ
1
+ ξ
2
+ ξ
3
=1, which shows their linear
dependence.
page 99 E763 (part 2) Numerical Methods
The area of each of these triangles, for example A
1
, can be calculated using (see
Appendix):
A
1
=
1
2
det
1 x y
1 x
2
y
2
1 x
3
y
3

\

.



where x and y are the coordinates of point P.
Evaluating this determinant gives: A
1
=
1
2
x
2
y
3
− x
3
y
2
( )+ y
2
− y
3
( )x + x
3
− x
2
( )y
 
and using the definitions in the Appendix:
A
1
=
1
2
a
1
+ b
1
x + c
1
y ( )
Then, ξ
1
=
A
1
A
=
1
2A
a
1
+ b
1
x + c
1
y ( ) or ξ
1
= N
1
(x, y) (9.25)
We have then, that these coordinates vary in the same form as the shape functions, which
is quite convenient for calculations.
Expression (9.25) also gives us the required relationship between the local (ξ
1
,ξ
2
,ξ
3
) and
global coordinates (x,y); that is the expression we need to convert coordinates (x,y into ξ
1
,ξ
2
,ξ).
We can also find the inverse relation, that is (x,y) in terms of (ξ
1
,ξ
2
,ξ
3
). This is all we
need to convert from one system of coordinates to the other.
For the inverse relationship, we have from the Appendix:
N
1
N
2
N
3
( ) = ξ
1
ξ
2
ξ
3
( )= 1 x y ( )
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3

\

.



−1
from where:
1 x y ( ) = ξ
1
ξ
2
ξ
3
( )
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3

\

.



and expanding:
1 = ξ
1
+ ξ
2
+ξ
3
x = x
1
ξ
1
+ x
2
ξ
2
+ x
3
ξ
3
(9.26)
y = y
1
ξ
1
+ y
2
ξ
2
+ y
3
ξ
3
Equations (9.25) and (9.26) will allow us to change from one system to the other.
Finally, the evaluation of integrals can be made now in terms of the local coordinates in
the usual way:
f (x, y) dxdy
Ω
k
∫
= f (ξ
1
,ξ
2
,ξ
3
) J dξ
1
dξ
2
Ω
k
∫
where J is the Jacobian of the transformation. In this case this is simply 1/2A, so the
expression we need to use to transform integrals is:
E763 (part 2) Numerical Methods page 100
f (x, y) dxdy
Ω
k
∫
= 2A f (ξ
1
, ξ
2
, ξ
3
) dξ
1
dξ
2
Ω
k
∫
(9.27)
Example:
The integral (9.24) of the previous exercise, is difficult to calculate in terms of x and y for
an arbitrary triangle, and needs to be calculated separately for each pair of nodes in each triangle
of the mesh. Transforming to (local) area coordinates this is much simpler:
N
i
N
j
dxdy
Ω
k
∫
= 2 A ξ
i
ξ
j
dξ
1
dξ
2
Ω
k
∫
(9.28)
To determine the limits of integration, we can see that
the total area of the triangle can be covered by ribbons
like that shown in the figure, with a width dξ
1
and
length ξ
2
. Now, we can note that at the left end of this
ribbon, ξ
3
= 0, so ξ
2
= 1–ξ
1
. Moving the ribbon from
the side 2–3 of the triangle to the vertex 1 (ξ
1
changing
from 0 to 1) will cover the complete triangle, so the
limits of integration must be:
ξ
2
: 0 ––> 1 – ξ
1
and
ξ
1
: 0 ––> 1
1
3
2
ξ
1
=1
ξ
1
=0
ξ2=0
ξ
2
=1ξ
1
ξ
2
=1
Fig. 9.10
So that the integral of (9.28) results:
N
i
N
j
dxdy
Ω
k
∫
= 2 A ξ
i
ξ
j
dξ
2
dξ
1
0
1−ξ
1
∫
0
1
∫
(9.29)
Note that the above conclusion about integration limits is valid for any integral over the
triangle, not only the one used above.
We can now calculate these integrals. Taking first the case where i ≠ j:
a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other
choice, e.g. 1,3 will give the same result).
I
ij
= 2A ξ
1
ξ
2
dξ
2
dξ
1
0
1−ξ
1
∫
0
1
∫
= A ξ
1
(1− ξ
1
)
2
dξ
1
0
1
∫
= A
1
2
−
2
3
+
1
4

\

.

=
A
12
b) For i = j and choosing i = 1:
I
ii
= 2 A ξ
1
2
dξ
2
dξ
1
0
1−ξ
1
∫
0
1
∫
= 2A ξ
1
2
(1 − ξ
1
)dξ
1
0
1
∫
= 2A
1
3
−
1
4

\

.

=
A
6
page 99 E763 (part 2) Numerical Methods
The area of each of these triangles, for example A
1
, can be calculated using (see
Appendix):
A
1
=
1
2
det
1 x y
1 x
2
y
2
1 x
3
y
3

\

.



where x and y are the coordinates of point P.
Evaluating this determinant gives: A
1
=
1
2
x
2
y
3
− x
3
y
2
( )+ y
2
− y
3
( )x + x
3
− x
2
( )y
 
and using the definitions in the Appendix:
A
1
=
1
2
a
1
+ b
1
x + c
1
y ( )
Then, ξ
1
=
A
1
A
=
1
2A
a
1
+ b
1
x + c
1
y ( ) or ξ
1
= N
1
(x, y) (9.25)
We have then, that these coordinates vary in the same form as the shape functions, which
is quite convenient for calculations.
Expression (9.25) also gives us the required relationship between the local (ξ
1
,ξ
2
,ξ
3
) and
global coordinates (x,y); that is the expression we need to convert coordinates (x,y into ξ
1
,ξ
2
,ξ).
We can also find the inverse relation, that is (x,y) in terms of (ξ
1
,ξ
2
,ξ
3
). This is all we
need to convert from one system of coordinates to the other.
For the inverse relationship, we have from the Appendix:
N
1
N
2
N
3
( ) = ξ
1
ξ
2
ξ
3
( )= 1 x y ( )
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3

\

.



−1
from where:
1 x y ( ) = ξ
1
ξ
2
ξ
3
( )
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3

\

.



and expanding:
1 = ξ
1
+ ξ
2
+ξ
3
x = x
1
ξ
1
+ x
2
ξ
2
+ x
3
ξ
3
(9.26)
y = y
1
ξ
1
+ y
2
ξ
2
+ y
3
ξ
3
Equations (9.25) and (9.26) will allow us to change from one system to the other.
Finally, the evaluation of integrals can be made now in terms of the local coordinates in
the usual way:
f (x, y) dxdy
Ω
k
∫
= f (ξ
1
,ξ
2
,ξ
3
) J dξ
1
dξ
2
Ω
k
∫
where J is the Jacobian of the transformation. In this case this is simply 1/2A, so the
expression we need to use to transform integrals is:
E763 (part 2) Numerical Methods page 100
f (x, y) dxdy
Ω
k
∫
= 2A f (ξ
1
, ξ
2
, ξ
3
) dξ
1
dξ
2
Ω
k
∫
(9.27)
Example:
The integral (9.24) of the previous exercise, is difficult to calculate in terms of x and y for
an arbitrary triangle, and needs to be calculated separately for each pair of nodes in each triangle
of the mesh. Transforming to (local) area coordinates this is much simpler:
N
i
N
j
dxdy
Ω
k
∫
= 2 A ξ
i
ξ
j
dξ
1
dξ
2
Ω
k
∫
(9.28)
To determine the limits of integration, we can see that
the total area of the triangle can be covered by ribbons
like that shown in the figure, with a width dξ
1
and
length ξ
2
. Now, we can note that at the left end of this
ribbon, ξ
3
= 0, so ξ
2
= 1–ξ
1
. Moving the ribbon from
the side 2–3 of the triangle to the vertex 1 (ξ
1
changing
from 0 to 1) will cover the complete triangle, so the
limits of integration must be:
ξ
2
: 0 ––> 1 – ξ
1
and
ξ
1
: 0 ––> 1
1
3
2
ξ
1
=1
ξ
1
=0
ξ2=0
ξ
2
=1ξ
1
ξ
2
=1
Fig. 9.10
So that the integral of (9.28) results:
N
i
N
j
dxdy
Ω
k
∫
= 2 A ξ
i
ξ
j
dξ
2
dξ
1
0
1−ξ
1
∫
0
1
∫
(9.29)
Note that the above conclusion about integration limits is valid for any integral over the
triangle, not only the one used above.
We can now calculate these integrals. Taking first the case where i ≠ j:
a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other
choice, e.g. 1,3 will give the same result).
I
ij
= 2A ξ
1
ξ
2
dξ
2
dξ
1
0
1−ξ
1
∫
0
1
∫
= A ξ
1
(1− ξ
1
)
2
dξ
1
0
1
∫
= A
1
2
−
2
3
+
1
4

\

.

=
A
12
b) For i = j and choosing i = 1:
I
ii
= 2 A ξ
1
2
dξ
2
dξ
1
0
1−ξ
1
∫
0
1
∫
= 2A ξ
1
2
(1 − ξ
1
)dξ
1
0
1
∫
= 2A
1
3
−
1
4

\

.

=
A
6
page 101 E763 (part 2) Numerical Methods
Once calculated in this form, the result can be used for any triangle irrespective of the
shape and position. We can see that for this integral, only the area A will change when applied
to different triangles.
Higher Order Shape Functions
In most cases involving second order differential equations, the resultant weighted
residuals expression or the corresponding variational expression can be written without
involving second order derivatives. In these cases, first order shape functions will be fine.
However, if this is not possible, other type of elements/shape functions must be used. (Note that
second order derivatives of a first order function will be zero everywhere.) We can also choose
to use different shape functions even if we could use first order polynomials, for example, to get
higher accuracy with less elements.
Second Order Shape Functions:
To define a first order polynomial in x and y (planar shape functions) we needed 3 degrees
of freedom (the nodal values of u will define completely the approximation. If we use second
order shape functions, we need to fit this type of surface over each triangle. To do this uniquely,
we need six degrees of freedom: either the 3 nodal values and the derivatives of u at those points
or the values at six points on the triangle. (This is analogous to trying to fit a second order curve
to an interval – while a straight line is completely defined by 2 points, a second order curve will
need 3). Choosing the easier option, we form triangles with 6 points: the 3 vertices and 3
midside points:
1
3 2
4
5
6
1
2
0
1
2
( )
0
1
2
1
2
( )
1
2
1
2
0
( )
(1 0 0)
(0 1 0) (0 0 1)
The figure shows a triangle with 6 points identified by
their triangle coordinates.
We need to specify now the shape functions. If
we take first the function N
1
(ξ
1
,ξ
2
,ξ
3
), corresponding to
node 1, we know that it should be 1 there and zero at
every other node. That is, it must be 0 at ξ
1
= 0 (nodes
2, 3 and 5) and also at ξ
1
= 1/2 (nodes 4 and 6) Then,
we can simply write:
Fig. 9.11
N
1
= ξ
1
2ξ
1
−1 ( ) (9.30)
In the same form we can write the corresponding shape functions for the other vertices
(nodes 2 and 3). For the midside nodes, for example node 4, we have that N
4
should be zero at
all nodes except 4 where its value is 1. We can see that all other nodes are either on the side 2–3
(where ξ
1
= 0) or on the side 3–1 (where ξ
2
= 0). So the function N
4
should be:
N
4
= 4ξ
1
ξ
2
(9.31)
E763 (part 2) Numerical Methods page 102
The following figure shows the two types of second order shape functions, one for vertices
and one for midside points.
1
2
3
4
6
5
N
3
2
4
1
6
5
3
N
4
Fig. 9.12 Shape function N
3
(x). Fig. 9.13 Shape function N
4
(x).
Exercise 9.4:
For the mesh of second order triangles of the figure, find the corresponding sparsity
pattern.
1 5 3 12 10
11 9 6 4
2 8 7
Fig. 9.14
Exercise 9.5:
Use a similar reasoning as that used to define the second order shape functions (9.30)
(9.31), to find the third order shape functions in terms of triangle area coordinates.
Note that in this case there are 3 different types of functions.
1
3
2
4
5
6
(1 0 0)
(0 1 0)
(0 0 1)
1
3
0
2
3
( )
8
2
3
0
1
3
( )
9
0
2
3
1
3
( )
7
0
1
3
2
3
( )
2
3
1
3
0
( )
1
3
2
3
0
( )
1
3
1
3
1
3
( )
10
Fig. 9.15
page 101 E763 (part 2) Numerical Methods
Once calculated in this form, the result can be used for any triangle irrespective of the
shape and position. We can see that for this integral, only the area A will change when applied
to different triangles.
Higher Order Shape Functions
In most cases involving second order differential equations, the resultant weighted
residuals expression or the corresponding variational expression can be written without
involving second order derivatives. In these cases, first order shape functions will be fine.
However, if this is not possible, other type of elements/shape functions must be used. (Note that
second order derivatives of a first order function will be zero everywhere.) We can also choose
to use different shape functions even if we could use first order polynomials, for example, to get
higher accuracy with less elements.
Second Order Shape Functions:
To define a first order polynomial in x and y (planar shape functions) we needed 3 degrees
of freedom (the nodal values of u will define completely the approximation. If we use second
order shape functions, we need to fit this type of surface over each triangle. To do this uniquely,
we need six degrees of freedom: either the 3 nodal values and the derivatives of u at those points
or the values at six points on the triangle. (This is analogous to trying to fit a second order curve
to an interval – while a straight line is completely defined by 2 points, a second order curve will
need 3). Choosing the easier option, we form triangles with 6 points: the 3 vertices and 3
midside points:
1
3 2
4
5
6
1
2
0
1
2
( )
0
1
2
1
2
( )
1
2
1
2
0
( )
(1 0 0)
(0 1 0) (0 0 1)
The figure shows a triangle with 6 points identified by
their triangle coordinates.
We need to specify now the shape functions. If
we take first the function N
1
(ξ
1
,ξ
2
,ξ
3
), corresponding to
node 1, we know that it should be 1 there and zero at
every other node. That is, it must be 0 at ξ
1
= 0 (nodes
2, 3 and 5) and also at ξ
1
= 1/2 (nodes 4 and 6) Then,
we can simply write:
Fig. 9.11
N
1
= ξ
1
2ξ
1
−1 ( ) (9.30)
In the same form we can write the corresponding shape functions for the other vertices
(nodes 2 and 3). For the midside nodes, for example node 4, we have that N
4
should be zero at
all nodes except 4 where its value is 1. We can see that all other nodes are either on the side 2–3
(where ξ
1
= 0) or on the side 3–1 (where ξ
2
= 0). So the function N
4
should be:
N
4
= 4ξ
1
ξ
2
(9.31)
E763 (part 2) Numerical Methods page 102
The following figure shows the two types of second order shape functions, one for vertices
and one for midside points.
1
2
3
4
6
5
N
3
2
4
1
6
5
3
N
4
Fig. 9.12 Shape function N
3
(x). Fig. 9.13 Shape function N
4
(x).
Exercise 9.4:
For the mesh of second order triangles of the figure, find the corresponding sparsity
pattern.
1 5 3 12 10
11 9 6 4
2 8 7
Fig. 9.14
Exercise 9.5:
Use a similar reasoning as that used to define the second order shape functions (9.30)
(9.31), to find the third order shape functions in terms of triangle area coordinates.
Note that in this case there are 3 different types of functions.
1
3
2
4
5
6
(1 0 0)
(0 1 0)
(0 0 1)
1
3
0
2
3
( )
8
2
3
0
1
3
( )
9
0
2
3
1
3
( )
7
0
1
3
2
3
( )
2
3
1
3
0
( )
1
3
2
3
0
( )
1
3
1
3
1
3
( )
10
Fig. 9.15
page A1 E763 (part 2) Numerical Methods Appendix
APPENDIX
1. Taylor theorem
For a continuous function we have ) ( ) ( ) ( ' a f x f dt t f
x
a
− =
∫
, then, we can write:
∫
+ =
x
a
dt t f a f x f ) ( ' ) ( ) ( or ) ( ) ( ) (
0
x R a f x f + = (A1.1)
where the reminder is
∫
=
x
a
dt t f x R ) ( ' ) (
0
.
We can now integrate R
0
by parts using:
) (
) ( ) ( '
) 2 (
t x v dt dv
dt t f du t f u
− − = =
= =
giving:
∫ ∫
− + − − = =
x
a
x
a
x
a
dt t f t x t f t x dt t f x R ) ( ) ( ) ( ' ) ( ) ( ' ) (
) 2 (
0
which gives after solving and
substituting in (A1.1):
∫
− + − + =
x
a
dt t f t x a f a x a f x f ) ( ) ( ) ( ' ) ( ) ( ) (
) 2 (
or
) ( ) )( ( ' ) ( ) (
1
x R a x a f a f x f + − + = (A1.2)
We can also integrate R
1
by parts using this time:
2
) (
) (
) ( ) (
2
) 3 ( ) 2 (
t x
v dt t x dv
dt t f du t f u
−
− = − =
= =
which gives:
∫ ∫
− + − − = − =
x
a
x
a
x
a
dt t f t x t x
t f
dt t f t x x R ) ( ) ( ) (
2
) (
) ( ) ( ) (
) 3 ( 2 2
) 2 (
) 2 (
1
and again, after
substituting in (A1.2) gives:
∫
− + − + − + =
x
a
dt t f t x a x
a f
a x a f a f x f ) ( ) ( ) (
2
) (
) )( ( ' ) ( ) (
) 3 ( 2 2
) 2 (
Proceeding in this way we get the expansion:
n
n
n
R a x
n
a f
a x
a f
a x
a f
a x a f a f x f + − + + − + − + − + = ) (
!
) (
) (
! 3
) (
) (
2
) (
) )( ( ' ) ( ) (
) (
3
) 3 (
2
) 2 (
L (A1.3)
where the reminder can be written as:
∫
+
−
=
x
a
n
n
n
dt t f
n
t x
x R ) (
!
) (
) (
) 1 (
(A1.4)
To find a more useful for for the reminder we need to invoke some general mathematical
theorems:
E763 (part 2) Numerical Methods Appendix page A2
First Theorem for the Mean Value of Integrals
If the function ) (t g is continuous and integrable in the interval [a, x], then there exists a point ξ
between a and x such that: ) )( ( ) ( a x g dt t g
x
a
− =
∫
ξ .
This says simply that the integral can be represented by an average value of the function ) (ξ g
times the length of the interval. Because this average value must be between the minimum and
the maximum values and the function is continuous, there will be a point ξ for which the
function has this value.
And in a more complex form:
Second Theorem for the Mean Value of Integrals
Now if the functions g and h are continuous and integrable in the interval and h does not change
sign in the interval, there exists a point ξ between a and x such that:
∫ ∫
=
x
a
x
a
dt t h g dt t h t g ) ( ) ( ) ( ) ( ξ
If we now use this second theorem on expression (A1.4), with: ) ( ) (
) 1 (
t f t g
n+
= and
!
) (
) (
n
t x
t h
n
−
= , we get:
∫
−
=
+
x
a
n
n
n
dt
n
t x
f x R
!
) (
) ( ) (
) 1 (
ξ which can be integrated, giving:
1
) 1 (
) (
)! 1 (
) (
) (
+
+
−
+
=
n
n
n
t x
n
f
x R
ξ
(A1.5)
for the reminder of the Taylor expansion, where ξ is appoint between a and x.
page A1 E763 (part 2) Numerical Methods Appendix
APPENDIX
1. Taylor theorem
For a continuous function we have ) ( ) ( ) ( ' a f x f dt t f
x
a
− =
∫
, then, we can write:
∫
+ =
x
a
dt t f a f x f ) ( ' ) ( ) ( or ) ( ) ( ) (
0
x R a f x f + = (A1.1)
where the reminder is
∫
=
x
a
dt t f x R ) ( ' ) (
0
.
We can now integrate R
0
by parts using:
) (
) ( ) ( '
) 2 (
t x v dt dv
dt t f du t f u
− − = =
= =
giving:
∫ ∫
− + − − = =
x
a
x
a
x
a
dt t f t x t f t x dt t f x R ) ( ) ( ) ( ' ) ( ) ( ' ) (
) 2 (
0
which gives after solving and
substituting in (A1.1):
∫
− + − + =
x
a
dt t f t x a f a x a f x f ) ( ) ( ) ( ' ) ( ) ( ) (
) 2 (
or
) ( ) )( ( ' ) ( ) (
1
x R a x a f a f x f + − + = (A1.2)
We can also integrate R
1
by parts using this time:
2
) (
) (
) ( ) (
2
) 3 ( ) 2 (
t x
v dt t x dv
dt t f du t f u
−
− = − =
= =
which gives:
∫ ∫
− + − − = − =
x
a
x
a
x
a
dt t f t x t x
t f
dt t f t x x R ) ( ) ( ) (
2
) (
) ( ) ( ) (
) 3 ( 2 2
) 2 (
) 2 (
1
and again, after
substituting in (A1.2) gives:
∫
− + − + − + =
x
a
dt t f t x a x
a f
a x a f a f x f ) ( ) ( ) (
2
) (
) )( ( ' ) ( ) (
) 3 ( 2 2
) 2 (
Proceeding in this way we get the expansion:
n
n
n
R a x
n
a f
a x
a f
a x
a f
a x a f a f x f + − + + − + − + − + = ) (
!
) (
) (
! 3
) (
) (
2
) (
) )( ( ' ) ( ) (
) (
3
) 3 (
2
) 2 (
L (A1.3)
where the reminder can be written as:
∫
+
−
=
x
a
n
n
n
dt t f
n
t x
x R ) (
!
) (
) (
) 1 (
(A1.4)
To find a more useful for for the reminder we need to invoke some general mathematical
theorems:
E763 (part 2) Numerical Methods Appendix page A2
First Theorem for the Mean Value of Integrals
If the function ) (t g is continuous and integrable in the interval [a, x], then there exists a point ξ
between a and x such that: ) )( ( ) ( a x g dt t g
x
a
− =
∫
ξ .
This says simply that the integral can be represented by an average value of the function ) (ξ g
times the length of the interval. Because this average value must be between the minimum and
the maximum values and the function is continuous, there will be a point ξ for which the
function has this value.
And in a more complex form:
Second Theorem for the Mean Value of Integrals
Now if the functions g and h are continuous and integrable in the interval and h does not change
sign in the interval, there exists a point ξ between a and x such that:
∫ ∫
=
x
a
x
a
dt t h g dt t h t g ) ( ) ( ) ( ) ( ξ
If we now use this second theorem on expression (A1.4), with: ) ( ) (
) 1 (
t f t g
n+
= and
!
) (
) (
n
t x
t h
n
−
= , we get:
∫
−
=
+
x
a
n
n
n
dt
n
t x
f x R
!
) (
) ( ) (
) 1 (
ξ which can be integrated, giving:
1
) 1 (
) (
)! 1 (
) (
) (
+
+
−
+
=
n
n
n
t x
n
f
x R
ξ
(A1.5)
for the reminder of the Taylor expansion, where ξ is appoint between a and x.
page A3 E763 (part 2) Numerical Methods Appendix
2. Implementation of Gaussian Elimination
Referring to the steps (24)  (26) to eliminate the nonzero entries in the lower triangle of
the matrix, we can see that (in step (25)) to eliminate the unknown x
1
from equation i (i = 2 – N);
– that is, to eliminate the matrix elements a
i1
, we need to subtract from eqn. i equation 1 times
a
i1
/a
11
. With this, all the elements of column 1 except a
11
will be zero. After that, to remove x
2
from equations 3 to N (or elements a
i2
, i = 3 – N as in (26)), we need to scale the new equation 2
by the factor a
i2
/a
22
and subtract it from row i, i = 3 – N. The pattern now repeats in the same
form, so the steps for triangularizing matrix A (keeping just the upper triangle) can be
implemented by the following code (in Matlab):
for k=1:n1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
for i=k+1:n
A(i,k+1:n)=A(i,k+1:n)v(i)*A(k,k+1:n);
end
end
In fact, this can be simplified eliminating the second loop, by noting that all operations on
rows can be performed simultaneously. The simpler version of the code is then:
for k=1:n1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)v(k+1:n)*A(k,k+1:n);
end
U=triu(A); % This function simply puts zeros in the lower
triangle
The factorization is completed by calculating the lower triangular matrix L. The complete
procedure can be implemented as follow:
function [L,U] = GE(A)
%
% Computes LU factorization of matrix A
% input: matrix A
% output: matrices L and U
%
[n,n]=size(A);
for k=1:n1
A(k+1:n,k)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)A(k+1:n,k)*A(k,k+1:n);
end
L=eye(n,n) + tril(A,1);
U=triu(A);
This function can be used to find the LU factors of a matrix A using dense storage. The
function eye(n,n) returns the identity matrix of order n and tril(A,1) gives a lower
triangular matrix with the elements of A in the lower triangle, excluding the diagonal, and
E763 (part 2) Numerical Methods Appendix page A4
setting all others to zero (l
ij
= a
ij
if j ≤ i–1, and 0 otherwise).
The solution of a linear system of equations Ax = b is completed (after the LU
factorization of the matrix A is performed) with a forward substitution using matrix L followed
by backward substitution using the matrix U. Forward substitution consists of finding the first
unknown x
1
directly as b
1
/l
11
, substituting this value in the second equation, find x
2
and so on.
This can be implemented by:
function x = LTriSol(L,b)
%
% Solves the triangular system Lx = b by forward substitution
%
n=length(b);
x=zeros(n,1); % a vector of zeros to start
for j=1:n1
x(j)=b(j)/L(j,j);
b(j+1:n)=b(j+1:n)x(j)*L(j+1:n,j);
end
x(n)=b(n)/L(n,n);
Backward substitution can be implemented in a similar form, this time the unknowns are
found from the end upwards:
function x = UTriSol(L,b)
%
% Solves the triangular system Ux = b by backward substitution
%
n=length(b);
x=zeros(n,1);
for j=n:1:2 % from n to 2 one by one
x(j)=b(j)/U(j,j);
b(1:j1)=b(1:j1)x(j)*U(1:j1,j);
end
x(1)=b(1)/U(1,1);
With these functions the solution of the system of equations Ax = b can be performed in
three steps by the code:
[L,U] = GE(A);
y = LTriSol(L,b);
x = UTriSol(U,y);
Exercise
Use the functions GE, LTriSol and UTriSol to solve the system of equations
generated by the finite difference modelling of the square coaxial structure, given by equation
(7.14). You will have to complete first the matrix A given only schematically in (7.14), after
applying the boundary conditions. Note that because of the geometry of the structure, not all
rows will have the same pattern.
To input the matrix it might be useful to start with the matlab command:
A = triu(tril(ones(n,n),1),1)5*eye(n,n)
page A3 E763 (part 2) Numerical Methods Appendix
2. Implementation of Gaussian Elimination
Referring to the steps (24)  (26) to eliminate the nonzero entries in the lower triangle of
the matrix, we can see that (in step (25)) to eliminate the unknown x
1
from equation i (i = 2 – N);
– that is, to eliminate the matrix elements a
i1
, we need to subtract from eqn. i equation 1 times
a
i1
/a
11
. With this, all the elements of column 1 except a
11
will be zero. After that, to remove x
2
from equations 3 to N (or elements a
i2
, i = 3 – N as in (26)), we need to scale the new equation 2
by the factor a
i2
/a
22
and subtract it from row i, i = 3 – N. The pattern now repeats in the same
form, so the steps for triangularizing matrix A (keeping just the upper triangle) can be
implemented by the following code (in Matlab):
for k=1:n1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
for i=k+1:n
A(i,k+1:n)=A(i,k+1:n)v(i)*A(k,k+1:n);
end
end
In fact, this can be simplified eliminating the second loop, by noting that all operations on
rows can be performed simultaneously. The simpler version of the code is then:
for k=1:n1
v(k+1:n)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)v(k+1:n)*A(k,k+1:n);
end
U=triu(A); % This function simply puts zeros in the lower
triangle
The factorization is completed by calculating the lower triangular matrix L. The complete
procedure can be implemented as follow:
function [L,U] = GE(A)
%
% Computes LU factorization of matrix A
% input: matrix A
% output: matrices L and U
%
[n,n]=size(A);
for k=1:n1
A(k+1:n,k)=A(k+1:n,k)/A(k,k); % find multipliers
A(k+1:n,k+1:n)=A(k+1:n,k+1:n)A(k+1:n,k)*A(k,k+1:n);
end
L=eye(n,n) + tril(A,1);
U=triu(A);
This function can be used to find the LU factors of a matrix A using dense storage. The
function eye(n,n) returns the identity matrix of order n and tril(A,1) gives a lower
triangular matrix with the elements of A in the lower triangle, excluding the diagonal, and
E763 (part 2) Numerical Methods Appendix page A4
setting all others to zero (l
ij
= a
ij
if j ≤ i–1, and 0 otherwise).
The solution of a linear system of equations Ax = b is completed (after the LU
factorization of the matrix A is performed) with a forward substitution using matrix L followed
by backward substitution using the matrix U. Forward substitution consists of finding the first
unknown x
1
directly as b
1
/l
11
, substituting this value in the second equation, find x
2
and so on.
This can be implemented by:
function x = LTriSol(L,b)
%
% Solves the triangular system Lx = b by forward substitution
%
n=length(b);
x=zeros(n,1); % a vector of zeros to start
for j=1:n1
x(j)=b(j)/L(j,j);
b(j+1:n)=b(j+1:n)x(j)*L(j+1:n,j);
end
x(n)=b(n)/L(n,n);
Backward substitution can be implemented in a similar form, this time the unknowns are
found from the end upwards:
function x = UTriSol(L,b)
%
% Solves the triangular system Ux = b by backward substitution
%
n=length(b);
x=zeros(n,1);
for j=n:1:2 % from n to 2 one by one
x(j)=b(j)/U(j,j);
b(1:j1)=b(1:j1)x(j)*U(1:j1,j);
end
x(1)=b(1)/U(1,1);
With these functions the solution of the system of equations Ax = b can be performed in
three steps by the code:
[L,U] = GE(A);
y = LTriSol(L,b);
x = UTriSol(U,y);
Exercise
Use the functions GE, LTriSol and UTriSol to solve the system of equations
generated by the finite difference modelling of the square coaxial structure, given by equation
(7.14). You will have to complete first the matrix A given only schematically in (7.14), after
applying the boundary conditions. Note that because of the geometry of the structure, not all
rows will have the same pattern.
To input the matrix it might be useful to start with the matlab command:
A = triu(tril(ones(n,n),1),1)5*eye(n,n)
page A5 E763 (part 2) Numerical Methods Appendix
This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two
subdiagonals. After that you will have to adjust any differences between this matrix and A.
Compare the results with those obtained by GaussSeidel.
E763 (part 2) Numerical Methods Appendix page A6
3. Solution of Finite Difference Equations by GaussSeidel
Equation (7.14) is the matrix equation derived by applying the finite differences method to
the Laplace equation. Solving this system of equation by the method of GaussSeidel, as shown
in (5.23) and (5.24), consists of taking the original simultaneous equations, putting all diagonal
matrix terms on the lefthandside as in (5.24) and effectively putting them in a DO loop (after
initializing the ‘starting vector’).
Equation (7.9) is the typical equation, and putting the diagonal term on the lefthandside
gives:
φ
O
=
1
4
φ
N
+φ
S
+ φ
E
+ φ
W
( )
We could write 56 lines of code (one per equation) or even simpler, use subscripted
variables inside a loop. In this case the elements of A are all either 0, +1 or 4, and are easily
“generated during the algorithm”, rather that actually stored in an array. This simplifies the
computer program, and instead of A, the only array needed holds the current value of vector
elements:
x
T
= φ
1
, φ
2
, ... , φ
56
( ).
The program can be simplified further by keeping the vector of 56 unknowns x in a 2D
array z(11,11) to be identified spatially with the 2D Cartesian coordinates of the physical
problem (see figure). For example z(3,2) stores the value of φ
10
. There is obviously scope to
improve efficiency since with this arrangement we store values corresponding to nodes with
known, fixed voltage values, including all those nodes inside the inner conductor. None of the
actually need to be stored, but doing so makes the program simpler. In this case, the program (in
old Fortran 77) can be as simple as:
c GaussSeidel to solve Laplace between square inner & outer
c conductors. Note that this applies to any potential problem,
c including potential (voltage) distribution or steady state
c current flow or heat flow.
c
dimension z(11,11)
data z/121*0./
do i=4,8
do j=4,8
z(i,j)=1.
enddo
enddo
c
do n=1,30
do i=2,10
do j=2,10
if(z(i,j).lt.1.) z(i,j)=.25*(z(i–1,j)+z(i+1,j)+
$ z(i,j–1)+z(i,j+1))
enddo
enddo
c in each iteration print values in the 3rd row
page A5 E763 (part 2) Numerical Methods Appendix
This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two
subdiagonals. After that you will have to adjust any differences between this matrix and A.
Compare the results with those obtained by GaussSeidel.
E763 (part 2) Numerical Methods Appendix page A6
3. Solution of Finite Difference Equations by GaussSeidel
Equation (7.14) is the matrix equation derived by applying the finite differences method to
the Laplace equation. Solving this system of equation by the method of GaussSeidel, as shown
in (5.23) and (5.24), consists of taking the original simultaneous equations, putting all diagonal
matrix terms on the lefthandside as in (5.24) and effectively putting them in a DO loop (after
initializing the ‘starting vector’).
Equation (7.9) is the typical equation, and putting the diagonal term on the lefthandside
gives:
φ
O
=
1
4
φ
N
+φ
S
+ φ
E
+ φ
W
( )
We could write 56 lines of code (one per equation) or even simpler, use subscripted
variables inside a loop. In this case the elements of A are all either 0, +1 or 4, and are easily
“generated during the algorithm”, rather that actually stored in an array. This simplifies the
computer program, and instead of A, the only array needed holds the current value of vector
elements:
x
T
= φ
1
, φ
2
, ... , φ
56
( ).
The program can be simplified further by keeping the vector of 56 unknowns x in a 2D
array z(11,11) to be identified spatially with the 2D Cartesian coordinates of the physical
problem (see figure). For example z(3,2) stores the value of φ
10
. There is obviously scope to
improve efficiency since with this arrangement we store values corresponding to nodes with
known, fixed voltage values, including all those nodes inside the inner conductor. None of the
actually need to be stored, but doing so makes the program simpler. In this case, the program (in
old Fortran 77) can be as simple as:
c GaussSeidel to solve Laplace between square inner & outer
c conductors. Note that this applies to any potential problem,
c including potential (voltage) distribution or steady state
c current flow or heat flow.
c
dimension z(11,11)
data z/121*0./
do i=4,8
do j=4,8
z(i,j)=1.
enddo
enddo
c
do n=1,30
do i=2,10
do j=2,10
if(z(i,j).lt.1.) z(i,j)=.25*(z(i–1,j)+z(i+1,j)+
$ z(i,j–1)+z(i,j+1))
enddo
enddo
c in each iteration print values in the 3rd row
page A7 E763 (part 2) Numerical Methods Appendix
write(*,6) n,(z(3,j),j=1,11)
enddo
c
write(*,7) z
6 format(1x,i2,11f7.4)
7 format(‘FINAL RESULTS=’,//(/1x,11f7.4))
stop
end
In the program, first z(i,j) is initialized with zeros, then the values corresponding to the
inner conductor are set to one (1 V). After this, the iterations start (to a maximum of 30) and the
GaussSeidel equations are solved.
In order to check the convergence, the values of potentials in one intermediate row (the
third) are printed after every iteration. We can see that after 19 iterations there are no more
changes (within 4 decimals). Naturally, a more efficient monitoring of convergence can be
implemented whereby the changes are monitored, either in a point by point basis, or as the norm
of the difference, and the iterations are stopped when this value is within a prefixed precision.
The results are:
1 0.0000 0.0000 0.0000 0.2500 0.3125 0.3281 0.3320 0.3330 0.0833 0.0208 0.0000
2 0.0000 0.0000 0.1250 0.3750 0.4492 0.4717 0.4785 0.4181 0.1895 0.0699 0.0000
3 0.0000 0.0469 0.2109 0.4473 0.5225 0.5472 0.5399 0.4737 0.2579 0.1098 0.0000
4 0.0000 0.0908 0.2690 0.4922 0.5653 0.5865 0.5742 0.5082 0.3016 0.1369 0.0000
5 0.0000 0.1229 0.3076 0.5205 0.5902 0.6084 0.5944 0.5299 0.3293 0.1542 0.0000
6 0.0000 0.1446 0.3326 0.5381 0.6047 0.6211 0.6067 0.5435 0.3465 0.1650 0.0000
7 0.0000 0.1586 0.3484 0.5488 0.6133 0.6288 0.6143 0.5519 0.3572 0.1716 0.0000
8 0.0000 0.1675 0.3582 0.5553 0.6184 0.6334 0.6190 0.5572 0.3637 0.1756 0.0000
9 0.0000 0.1729 0.3641 0.5592 0.6215 0.6362 0.6218 0.5604 0.3676 0.1780 0.0000
10 0.0000 0.1763 0.3678 0.5616 0.6234 0.6379 0.6236 0.5624 0.3700 0.1794 0.0000
11 0.0000 0.1783 0.3700 0.5631 0.6245 0.6390 0.6247 0.5636 0.3713 0.1802 0.0000
12 0.0000 0.1795 0.3713 0.5639 0.6253 0.6396 0.6254 0.5643 0.3722 0.1807 0.0000
13 0.0000 0.1802 0.3721 0.5645 0.6257 0.6400 0.6258 0.5647 0.3727 0.1810 0.0000
14 0.0000 0.1807 0.3726 0.5648 0.6259 0.6403 0.6260 0.5649 0.3729 0.1812 0.0000
15 0.0000 0.1810 0.3729 0.5650 0.6261 0.6404 0.6261 0.5651 0.3731 0.1813 0.0000
16 0.0000 0.1811 0.3731 0.5651 0.6262 0.6405 0.6262 0.5652 0.3732 0.1813 0.0000
17 0.0000 0.1812 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1813 0.0000
18 0.0000 0.1813 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1814 0.0000
19 0.0000 0.1813 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
20 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
21 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
22 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
23 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
24 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
25 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
26 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
27 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
28 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
29 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
30 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
FINAL RESULTS=
E763 (part 2) Numerical Methods Appendix page A8
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.3099 0.6406 1.0000 1.0000 1.0000 1.0000 1.0000 0.6406 0.3099 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
page A7 E763 (part 2) Numerical Methods Appendix
write(*,6) n,(z(3,j),j=1,11)
enddo
c
write(*,7) z
6 format(1x,i2,11f7.4)
7 format(‘FINAL RESULTS=’,//(/1x,11f7.4))
stop
end
In the program, first z(i,j) is initialized with zeros, then the values corresponding to the
inner conductor are set to one (1 V). After this, the iterations start (to a maximum of 30) and the
GaussSeidel equations are solved.
In order to check the convergence, the values of potentials in one intermediate row (the
third) are printed after every iteration. We can see that after 19 iterations there are no more
changes (within 4 decimals). Naturally, a more efficient monitoring of convergence can be
implemented whereby the changes are monitored, either in a point by point basis, or as the norm
of the difference, and the iterations are stopped when this value is within a prefixed precision.
The results are:
1 0.0000 0.0000 0.0000 0.2500 0.3125 0.3281 0.3320 0.3330 0.0833 0.0208 0.0000
2 0.0000 0.0000 0.1250 0.3750 0.4492 0.4717 0.4785 0.4181 0.1895 0.0699 0.0000
3 0.0000 0.0469 0.2109 0.4473 0.5225 0.5472 0.5399 0.4737 0.2579 0.1098 0.0000
4 0.0000 0.0908 0.2690 0.4922 0.5653 0.5865 0.5742 0.5082 0.3016 0.1369 0.0000
5 0.0000 0.1229 0.3076 0.5205 0.5902 0.6084 0.5944 0.5299 0.3293 0.1542 0.0000
6 0.0000 0.1446 0.3326 0.5381 0.6047 0.6211 0.6067 0.5435 0.3465 0.1650 0.0000
7 0.0000 0.1586 0.3484 0.5488 0.6133 0.6288 0.6143 0.5519 0.3572 0.1716 0.0000
8 0.0000 0.1675 0.3582 0.5553 0.6184 0.6334 0.6190 0.5572 0.3637 0.1756 0.0000
9 0.0000 0.1729 0.3641 0.5592 0.6215 0.6362 0.6218 0.5604 0.3676 0.1780 0.0000
10 0.0000 0.1763 0.3678 0.5616 0.6234 0.6379 0.6236 0.5624 0.3700 0.1794 0.0000
11 0.0000 0.1783 0.3700 0.5631 0.6245 0.6390 0.6247 0.5636 0.3713 0.1802 0.0000
12 0.0000 0.1795 0.3713 0.5639 0.6253 0.6396 0.6254 0.5643 0.3722 0.1807 0.0000
13 0.0000 0.1802 0.3721 0.5645 0.6257 0.6400 0.6258 0.5647 0.3727 0.1810 0.0000
14 0.0000 0.1807 0.3726 0.5648 0.6259 0.6403 0.6260 0.5649 0.3729 0.1812 0.0000
15 0.0000 0.1810 0.3729 0.5650 0.6261 0.6404 0.6261 0.5651 0.3731 0.1813 0.0000
16 0.0000 0.1811 0.3731 0.5651 0.6262 0.6405 0.6262 0.5652 0.3732 0.1813 0.0000
17 0.0000 0.1812 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1813 0.0000
18 0.0000 0.1813 0.3732 0.5652 0.6263 0.6406 0.6263 0.5652 0.3733 0.1814 0.0000
19 0.0000 0.1813 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
20 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
21 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
22 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3733 0.1814 0.0000
23 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
24 0.0000 0.1814 0.3733 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
25 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
26 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
27 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
28 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
29 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
30 0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
FINAL RESULTS=
E763 (part 2) Numerical Methods Appendix page A8
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.3099 0.6406 1.0000 1.0000 1.0000 1.0000 1.0000 0.6406 0.3099 0.0000
0.0000 0.2994 0.6263 1.0000 1.0000 1.0000 1.0000 1.0000 0.6263 0.2994 0.0000
0.0000 0.2615 0.5653 1.0000 1.0000 1.0000 1.0000 1.0000 0.5653 0.2615 0.0000
0.0000 0.1814 0.3734 0.5653 0.6263 0.6406 0.6263 0.5653 0.3734 0.1814 0.0000
0.0000 0.0907 0.1848 0.2615 0.2994 0.3099 0.2994 0.2615 0.1814 0.0907 0.0000
0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
page A9 E763 (part 2) Numerical Methods Appendix
4. A variational formulation for the wave equation
Equation (8.20) in the example on the use of the variational method shows a variational
formulation for the onedimensional wave equation. In the following we will prove that indeed
both problems are equivalent.
Equation (8.20) stated:
k
2
= s.v.
dy
dx

\

.

2
dx
a
b
∫
y
2
dx
a
b
∫
(A4.1)
Let’s examine a variation of k
2
produced by a small perturbation δy on the solution y:
k
2
(y + δy) = k
2
+ δk
2
=
d(y + δy)
dx

\

.

2
dx
a
b
∫
(y +δy)
2
dx
a
b
∫
And rewriting:
k
2
+δk
2
( )
(y +δy)
2
dx
a
b
∫
=
d(y + δy)
dx

\

.

2
dx
a
b
∫
(A4.2)
Expanding and neglecting higher order variations:
k
2
+δk
2
( )
(y
2
+ 2yδy) dx
a
b
∫
=
dy
dx

\

.

2
+ 2
dy
dx
dδy
dx

\

.
 dx
a
b
∫
or
k
2
y
2
dx
a
b
∫
+ 2k
2
yδy dx
a
b
∫
+ δk
2
y
2
dx
a
b
∫
=
dy
dx

\

.

2
dx
a
b
∫
+ 2
dy
dx
dδy
dx
dx
a
b
∫
(A4.3)
and now using (A4.1):
2k
2
yδy dx
a
b
∫
+δk
2
y
2
dx
a
b
∫
= 2
dy
dx
dδy
dx
dx
a
b
∫
(A4.4)
Now, since we want k
2
to be stationary about the solution function y, we make δk
2
= 0, and we
examine what conditions this imposes on the function y:
k
2
yδy dx
a
b
∫
=
dy
dx
dδy
dx
dx
a
b
∫
(A4.5)
Integrating the RHS by parts:
k
2
yδy dx
a
b
∫
=
dy
dx
δy
a
b
− δy
d
2
y
dx
2
dx
a
b
∫
E763 (part 2) Numerical Methods Appendix page A10
Or rearranging:
δy
d
2
y
dx
2
+ k
2
y

\

.

dx
a
b
∫
=
dy
dx
δy
a
b
(A4.6)
Since δy is arbitrary, (A4.6) can only be valid if both sides are zero. That means that y should
satisfy the differential equation:
d
2
y
dx
2
+ k
2
y = 0 (A4.7)
and any of the boundary conditions:
dy
dx
= 0 at a and b or δy = 0 at a and b (fixed values of y at the ends). (A4.8)
Summarizing, we can see that imposing the condition of stationarity of (A4.1) with respect
to small variations of the function y leads to y satisfying the differential equation (A4.7), which
is the wave equation, and any of the boundary conditions (A4.8); that is, either fixed values of y
at the ends (Dirichlet B.C.), or zero normal derivative (Neumann B.C.).
page A9 E763 (part 2) Numerical Methods Appendix
4. A variational formulation for the wave equation
Equation (8.20) in the example on the use of the variational method shows a variational
formulation for the onedimensional wave equation. In the following we will prove that indeed
both problems are equivalent.
Equation (8.20) stated:
k
2
= s.v.
dy
dx

\

.

2
dx
a
b
∫
y
2
dx
a
b
∫
(A4.1)
Let’s examine a variation of k
2
produced by a small perturbation δy on the solution y:
k
2
(y + δy) = k
2
+ δk
2
=
d(y + δy)
dx

\

.

2
dx
a
b
∫
(y +δy)
2
dx
a
b
∫
And rewriting:
k
2
+δk
2
( )
(y +δy)
2
dx
a
b
∫
=
d(y + δy)
dx

\

.

2
dx
a
b
∫
(A4.2)
Expanding and neglecting higher order variations:
k
2
+δk
2
( )
(y
2
+ 2yδy) dx
a
b
∫
=
dy
dx

\

.

2
+ 2
dy
dx
dδy
dx

\

.
 dx
a
b
∫
or
k
2
y
2
dx
a
b
∫
+ 2k
2
yδy dx
a
b
∫
+ δk
2
y
2
dx
a
b
∫
=
dy
dx

\

.

2
dx
a
b
∫
+ 2
dy
dx
dδy
dx
dx
a
b
∫
(A4.3)
and now using (A4.1):
2k
2
yδy dx
a
b
∫
+δk
2
y
2
dx
a
b
∫
= 2
dy
dx
dδy
dx
dx
a
b
∫
(A4.4)
Now, since we want k
2
to be stationary about the solution function y, we make δk
2
= 0, and we
examine what conditions this imposes on the function y:
k
2
yδy dx
a
b
∫
=
dy
dx
dδy
dx
dx
a
b
∫
(A4.5)
Integrating the RHS by parts:
k
2
yδy dx
a
b
∫
=
dy
dx
δy
a
b
− δy
d
2
y
dx
2
dx
a
b
∫
E763 (part 2) Numerical Methods Appendix page A10
Or rearranging:
δy
d
2
y
dx
2
+ k
2
y

\

.

dx
a
b
∫
=
dy
dx
δy
a
b
(A4.6)
Since δy is arbitrary, (A4.6) can only be valid if both sides are zero. That means that y should
satisfy the differential equation:
d
2
y
dx
2
+ k
2
y = 0 (A4.7)
and any of the boundary conditions:
dy
dx
= 0 at a and b or δy = 0 at a and b (fixed values of y at the ends). (A4.8)
Summarizing, we can see that imposing the condition of stationarity of (A4.1) with respect
to small variations of the function y leads to y satisfying the differential equation (A4.7), which
is the wave equation, and any of the boundary conditions (A4.8); that is, either fixed values of y
at the ends (Dirichlet B.C.), or zero normal derivative (Neumann B.C.).
page A11 E763 (part 2) Numerical Methods Appendix
5. Area of a Triangle
For a triangle with nodes 1, 2 and 3 with coordinates (x
1
, y
1
), (x
2
, y
2
) and (x
3
, y
3
):
A
B
C
3
2
y
1
y
3
1
y
2
x
1
x
3
x
2
The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C)
Area of rectangle = (x
2
− x
1
)(y
3
− y
2
) = (x
2
y
3
+ x
1
y
2
− x
1
y
3
− x
2
y
2
)
Area (A) =
1
2
(x
2
− x
3
)(y
3
− y
2
) =
1
2
(x
2
y
3
+ x
3
y
2
− x
2
y
2
− x
3
y
3
)
Area (B) =
1
2
(x
3
− x
1
)(y
3
− y
1
) =
1
2
(x
3
y
3
− x
3
y
1
− x
1
y
3
+ x
1
y
1
)
Area (C) =
1
2
(x
2
− x
1
)(y
1
− y
2
) =
1
2
(x
2
y
1
− x
2
y
2
− x
1
y
1
+ x
1
y
2
)
Then, the area of the triangle is:
A =
1
2
(x
2
y
3
− x
3
y
2
) +(x
3
y
1
− x
1
y
3
) + (x
1
y
2
− x
2
y
1
)
 
(A5.1)
which can be written as: A =
1
2
det
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
(A5.2)
E763 (part 2) Numerical Methods Appendix page A12
6. Shape Functions (Interpolation Functions)
The function u is approximated in each triangle by a first order function (a plane). This
will be given by an expression of the form: p + qx + ry which can also be written as a vector
product (equation 9.14).
˜
u (x, y) = p + qx + ry = (1 x y)
p
q
r
(A6.1)
Evaluating this expression at each node of a triangle (with nodes numbered 1, 2 and 3):
u
1
= p + qx
1
+ ry
1
u
2
= p + qx
2
+ ry
2
u
3
= p + qx
3
+ ry
3
¹
`
¦
)
¦
⇒
u
1
u
2
u
3
=
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
p
q
r
(A6.2)
And from here we can calculate the value of the constants p, q and r in terms of the nodal values
and the coordinates of the nodes:
p
q
r
=
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
−1
u
1
u
2
u
3
(A6.3)
Replacing (A6. 3) in (A6.1):
˜
u (x, y) = (1 x y)
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
−1
u
1
u
2
u
3
(A6.4)
and comparing with (9.15), written as a vector product:
u(x, y) ≈
˜
u (x, y) = u
1
N
1
(x, y) + u
2
N
2
(x, y) + u
3
N
3
(x, y) =(N
1
N
2
N
3
)
u
1
u
2
u
3
(A6.5)
we have finally:
(N
1
N
2
N
3
) =(1 x y)
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
−1
(A6.6)
Solving the right hand side (inverting the matrix and multiplying), gives the expression for
each shape function (9.16):
N
i
(x, y) =
1
2A
a
i
+ b
i
x + c
i
y ( ) (A6.7)
where A is the area of the triangle and
page A11 E763 (part 2) Numerical Methods Appendix
5. Area of a Triangle
For a triangle with nodes 1, 2 and 3 with coordinates (x
1
, y
1
), (x
2
, y
2
) and (x
3
, y
3
):
A
B
C
3
2
y
1
y
3
1
y
2
x
1
x
3
x
2
The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C)
Area of rectangle = (x
2
− x
1
)(y
3
− y
2
) = (x
2
y
3
+ x
1
y
2
− x
1
y
3
− x
2
y
2
)
Area (A) =
1
2
(x
2
− x
3
)(y
3
− y
2
) =
1
2
(x
2
y
3
+ x
3
y
2
− x
2
y
2
− x
3
y
3
)
Area (B) =
1
2
(x
3
− x
1
)(y
3
− y
1
) =
1
2
(x
3
y
3
− x
3
y
1
− x
1
y
3
+ x
1
y
1
)
Area (C) =
1
2
(x
2
− x
1
)(y
1
− y
2
) =
1
2
(x
2
y
1
− x
2
y
2
− x
1
y
1
+ x
1
y
2
)
Then, the area of the triangle is:
A =
1
2
(x
2
y
3
− x
3
y
2
) +(x
3
y
1
− x
1
y
3
) + (x
1
y
2
− x
2
y
1
)
 
(A5.1)
which can be written as: A =
1
2
det
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
(A5.2)
E763 (part 2) Numerical Methods Appendix page A12
6. Shape Functions (Interpolation Functions)
The function u is approximated in each triangle by a first order function (a plane). This
will be given by an expression of the form: p + qx + ry which can also be written as a vector
product (equation 9.14).
˜
u (x, y) = p + qx + ry = (1 x y)
p
q
r
(A6.1)
Evaluating this expression at each node of a triangle (with nodes numbered 1, 2 and 3):
u
1
= p + qx
1
+ ry
1
u
2
= p + qx
2
+ ry
2
u
3
= p + qx
3
+ ry
3
¹
`
¦
)
¦
⇒
u
1
u
2
u
3
=
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
p
q
r
(A6.2)
And from here we can calculate the value of the constants p, q and r in terms of the nodal values
and the coordinates of the nodes:
p
q
r
=
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
−1
u
1
u
2
u
3
(A6.3)
Replacing (A6. 3) in (A6.1):
˜
u (x, y) = (1 x y)
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
−1
u
1
u
2
u
3
(A6.4)
and comparing with (9.15), written as a vector product:
u(x, y) ≈
˜
u (x, y) = u
1
N
1
(x, y) + u
2
N
2
(x, y) + u
3
N
3
(x, y) =(N
1
N
2
N
3
)
u
1
u
2
u
3
(A6.5)
we have finally:
(N
1
N
2
N
3
) =(1 x y)
1 x
1
y
1
1 x
2
y
2
1 x
3
y
3
−1
(A6.6)
Solving the right hand side (inverting the matrix and multiplying), gives the expression for
each shape function (9.16):
N
i
(x, y) =
1
2A
a
i
+ b
i
x + c
i
y ( ) (A6.7)
where A is the area of the triangle and
page A13 E763 (part 2) Numerical Methods Appendix
a
1
= x
2
y
3
– x
3
y
2
b
1
= y
2
– y
3
c
1
= x
3
– x
2
a
2
= x
3
y
1
– x
1
y
3
b
2
= y
3
– y
1
c
2
= x
1
– x
3
(A6.8)
a
3
= x
1
y
2
– x
2
y
1
b
3
= y
1
– y
2
c
3
= x
2
– x
1
Note that from (A5.1), the area of the triangle can be written as:
A =
1
2
a
1
+ a
2
+ a
3
( ) (A6.9)
page A13 E763 (part 2) Numerical Methods Appendix
a
1
= x
2
y
3
– x
3
y
2
b
1
= y
2
– y
3
c
1
= x
3
– x
2
a
2
= x
3
y
1
– x
1
y
3
b
2
= y
3
– y
1
c
2
= x
1
– x
3
(A6.8)
a
3
= x
1
y
2
– x
2
y
1
b
3
= y
1
– y
2
c
3
= x
2
– x
1
Note that from (A5.1), the area of the triangle can be written as:
A =
1
2
a
1
+ a
2
+ a
3
( ) (A6.9)
E763 (part 2) Numerical Methods
page 2
complete accuracy. Most numbers have infinite decimal representation, which must be rounded. But even if the data in a problem can be initially expressed exactly by finite decimal representation, divisions may introduce numbers that must be rounded; multiplication will also introduce more digits. This type of error has a random character that makes it difficult to deal with. So far the error we have been discussing is the absolute error or the error defined by: error = true value – approximation. A problem with this definition is that it doesn’t take into account the magnitude of the value being measured, for example and absolute error of 1 cm has a very different significance in the length of a 100 m bridge or of a 10 cm bolt. Another definition, that reflects this significance is the relative error defined as: relative error = absolute error/true value. For example, in the previous case, the length of the bridge has a relative error of 10−4 and 0.1 respectively; or in percent, 0.01% and 10%.
2. Machine representation of numbers
Not only numbers have to be truncated to be represented (for manual calculation or in a machine) but also all the intermediate results are subjected to the same restriction (over the complete calculation process). Of the continuum of real numbers, only a discrete set can be represented. Additionally, only a limited range of numbers can be represented in a given machine. Also, even if two numbers can be represented, it is possible that the product of an arithmetic operation between these numbers is not representable. Additionally, there will be occasions where results fall outside the range of representable numbers (underflow or overflow errors). Example: Some computers accept real numbers in a floatingpoint format, with say, two digits for the exponent; that is, in the form: mantissa exponent ± 0.xxxxxx E ±xx which can only represent numbers in the range: 10–100 ≤ y < 1099. This is a normalized form, where the mantissa is defined such that: 0.1 ≤ mantissa < 1 We will see this in more detail next.
FloatingPoint Arithmetic and Roundoff Error
Any number can be represented in a normalised format in the form: x = ± a×10b in a decimal representation. We can even use any other number as the exponent base and we have a more general form: x = ± a×βb, common examples are binary, octal and hexadecimal representations. In all of these, a is called the mantissa, a number between 0.1 and 1, which can be infinite in length (for example for π or for 1/3 or 1/9), β is the base (10 for decimal representation) and b is the exponent (positive or negative), which can also be infinite. If we want to represent these numbers in a computer or use them in calculations, we need to truncate both the mantissa (to a certain number of places), and the exponent, which limits the range (or size) of numbers that can be considered. Now, if A is the set of numbers exactly representable in a given machine, the question arises of how to represent a number x not belonging to A ( x ∉ A ). This is encountered not only when reading data into a computer, but also when
page 3
E763 (part 2) Numerical Methods
representing intermediate results in the computer during a calculation. Results of the elementary arithmetic operations between two numbers need not belong to A. Let’s see first how a number is represented (truncated) in a machine. A machine representation can in most cases be obtained by rounding:
x ––––––––––––– fl(x)
Here and from now on, fl(x) will represent the truncated form of x (this is, with a truncated mantissa and limited exponent), and not just its representation in normalized floatingpoint format. For example, in a computer with t = 4 digit representation for the mantissa: fl(π) = 0.3142E 1 fl(0.142853) = 0.1429E 0 fl(14.28437) = 0.1428E 2 In general, x is first represented in a normalized form (floatingpoint format): x = a×10b (if we consider a decimal system), where a is the mantissa and b the exponent; 1 > a ≥ 10–1 (so that the first digit of a after the decimal point is not 0). Then suppose that the decimal representation of a is given by:
a = 0.α1α2α3α4 ... αi ... (where a can have infinite terms!)
0 ≤ αi ≤ 9,
α1 ≠ 0
(2.1)
Then, we form:
0.α 1α 2 ⋅ ⋅⋅ α t a' = −t 0.α 1α 2 ⋅ ⋅⋅ α t + 10
if 0 ≤ α t +1 ≤ 4 if α t+1 ≥ 5
(2.2)
That is, only t digits are kept in the mantissa and the last one is rounded up: αt is incremented by 1 if the next digit αt+1 ≥ 5 and all digits after αt are deleted. The machine representation (truncated form of x) will be:
fl(x) = sign(x) a’ 10b
(2.3)
The exponent b must also be limited. So a floatingpoint system will be characterized by the parameters: t, length of the mantissa; β, the base (10 in the decimal case as above) and L and U, the limits for the exponent b: L ≤ b ≤ U. To clarify some of the features of the floatingpoint representation, let’s examine the system (β, t, L, U) = (10, 2, –1, 2). The mantissa can have only two digits: Then, there are only 90 possible forms for the mantissa (excluding 0 and negative numbers): 0.10, 0.11, 0.12, . . . , . . , 0.98, 0.99
The exponent can vary from –1 to 2, so it can only take 4 possible values: –1, 0, 1 and 2. Then, including now zero and negative numbers, this system can represent exactly only 2×90×4 + 1 = 721 numbers: The set of floatingpoint numbers is finite. The smallest positive number in the system is 0.10×10–1 = 0.01
E763 (part 2) Numerical Methods
page 4
and the largest is: 0.99×102 = 99. We can also see that the spacing between numbers in the set varies; the numbers are not equally spaced: At the small end (near 0): 0, 0.01, 0.02, etc the spacing is 0.01 while at the other extreme: 97, 98, 99 the spacing is 1 (100 time bigger in absolute terms). In a system like this, we are interested in the relative error produced when any number is truncated in order to be represented in the system. (We are here excluding the underflow or overflow errors produced when we try to represent a number which is outside the range defined by L and U, for example 1000 or even 100 in the example above.) It can be shown that the relative error of fl(x) is bounded by:
fl(x) − x x ≤ 5 × 10− t = eps (2.4)
This limit eps is defined here as the machine precision.
Demonstration: From (2.1) and (2.2), the normalized decimal representation of x and its truncated floatingpoint form, we have that the maximum possible difference between the two forms is 5 at the decimal position t+1, that is: fl(x) − x ≤ 5 × 10− (t+1) × 10b
also, since
x ≥ 0.1× 10
b
or
1 x ≤ 10 × 10
−b
, we obtain the condition (2.4).
From equation (4) we can write:
fl(x) = x(1+ε)
(2.5)
where ε ≤ eps, for all numbers x. The quantity (1+ε) in (2.5) cannot be distinguished from 1 in this machine representation, and the maximum value of ε is eps. So, we can also define the machine precision eps as the smallest positive machine number g for which 1 + g > 1. Definition:
g machine precision eps = min{ g + 1 > 1,g > 0}
(2.6)
Error in basic operations
The result of arithmetic operations between machine numbers will also have to be represented as machine numbers, then, for each arithmetic operation and assuming that x and y are already machine numbers, we will have:
x + y –––we get –––fl(x + y) which is equal to (x + y)(1 + ε1); also: x − y ––––––––––––fl(x − y) = (x − y)(1 + ε2) x*y –––––––––––––fl(x*y) = (x*y)(1 + ε3) x/y –––––––––––––fl(x/y) = (x/y)(1 + ε4) with all εi ≤ eps
(2.7) (2.8) (2.9) (2.10)
page 5
E763 (part 2) Numerical Methods
If x and y are not floatingpoint numbers (machine numbers) they will have to be converted first giving: x + y –––––––––––– fl(x + y) = fl(fl(x) + fl(y)) and similarly for the rest. Let’s examine for example the subtraction of two such numbers: z = x – y , ignoring higher order error terms:
fl(z) = fl(fl(x) – fl(y)) = (x(1 + ε1) – y(1 + ε2))(1 + ε3) = ((x – y) + x ε1 – yε2)(1 + ε3) = (x – y) + x ε1 – yε2 + (x – y)ε3
Then: fl(z ) − z z =
ε3 +
xε1 − yε 2 x−y
x + y ≤ eps 1+ x−y
We can see that if x approaches y the relative error can blow up, especially for large values of x and y. The maximal error bounds are pessimistic and in practical calculations, errors might tend to cancel. For example, in adding 20000 numbers rounded to, say, 4 decimal places the maximum error will be 0.5×10–4×20000 = 1 (imagining the maximum absolute truncation of 0.00005 in every case) while it is extremely improbable that this case occurs. From a statistical point of view, one can expect that in about 90% of the cases, the error will not exceed 0.005.
Example Let’s compute the difference between a = 1200 and b = 1194 using a floatingpoint system with a 3digit mantissa: fl(a – b) = fl(fl(1200) – fl(1194)) = fl(0.120×104 – 0.119×104) = 0.001×104 = 10
where the correct value is 6, giving a relative error of: 0.667 (or 66.7%). The machine precision for this system is eps = 5×10–t and the error bound above gives a a+b limit for the relative error of eps 1 + = 2.0 a−b
Error propagation in calculations
One of the important tasks in computer modelling is to find algorithms where the error propagation remains bounded. In this context, an algorithm is defined as a finite sequence of elementary operations that prescribe how to calculate the solution to a problem from given input data (as in a sequence of computer instructions). Problems can arise when one is not careful, as is shown in the following example:
E763 (part 2) Numerical Methods
page 6
Example Assume that we want to calculate the sum of three floatingpoint numbers: a, b, c. This has to be done in sequence, that is, using any of the next two algorithms:
i) (a + b) + c ii) a + (b + c)
or
If the numbers are in floatingpoint format with t = 8 decimals and their values are for example:
a = 0.23371258E4 b = 0.33678429E 2 c = 0.33677811E 2
The two algorithms will give the results:
i) ii)
0.64100000E3 0.64137126E3
The exact result (which needs 14 decimal points to calculate) is 0.641311258E3
Exercise 2.1 Show, using an error analysis, why the case ii) gives a more accurate result for the numbers of the example above. Neglect higher order error terms; that is, products of the form: ε1ε2. Example Determination the error propagation in the calculation of y = (x − a)2 using floatingpoint arithmetic, by two different algorithms, when x and a are already floatingpoint numbers.
a) direct calculation: y = (x − a)2
fl( y) = [(x − a )(1 + ε1 )] (1 + ε 2 )
2
fl( y) = (x − a ) (1 + ε1 ) (1 + ε 2 )
2 2
[
]
And only preserving first order error terms:
fl( y) = (x − a) (1 + ε1 ) (1 + ε 2 ) = (x − a ) [(1 + 2ε1 )(1 + ε 2 )]
2 2 2 2
fl( y) = (x − a) (1 + 2ε1 + ε 2 )
[
]
then:
fl( y) y = (x − a) (2ε1 + ε 2 )
2
or
∆y =
fl( y) y = 2ε1 + ε 2 y
We can see that the relative error in the calculation of y using this algorithm is given by 2ε1 + ε 2 , so it is less than 3 eps. b) Using the expanded form: y = x 2 − 2ax + a2
fl( y) = x (1 + ε1 ) − 2ax (1 + ε 2 ) (1 + ε 3 ) + a (1 + ε 4 ) (1 + ε 5 )
2 2
[(
)
]
the three amplification factors will be respectively: 225.page 7 E763 (part 2) Numerical Methods This is.3 Determine the error propagation characteristics of two algorithms to calculate and b) y = (x − 1)3 . taking the square of x first (with its error) and subtracting the product 2ax (with its error) and the error in the subtraction. Exercise 2.y x2 2ax a2 = ε5 + ε4 2 (ε 1 + ε 3 ) − 2 (ε 2 + ε 3 ) + y (x − a ) (x − a ) (x − a)2 and we can see that there will be problems with this calculation if (x – a)2 is too small compared with either x2 or a2. 420 and 196. Solving this. . For example in if x = 15 and a = 14. Assume in both cases that x and a are floating point a) y = x 2 − a2 numbers. and then. adding the last term with its error and the corresponding error due to that addition. Exercise 2.2 For y = a – b compare the error bounds when a and b are and are not already defined as floatingpoint numbers. which gives a total error bound of (1 + 450 + 840 + 196)eps = 1487eps compared with 3 eps from algorithm a). The first term above is bounded by eps while the second will be eps multiplied by the amplification factor x 2 / ( x − a )2 . keeping only first order error terms we get: 2 2 2 2 fl( y) = x − 2ax + x ε 1 − 2ax ε 2 + x − 2ax ε 3 + a (1 + ε 4 ) (1 + ε 5 ) [( ) ( ) ] fl( y ) = x 2 − 2ax + a 2 + x 2 (ε1 + ε 3 ) − 2ax(ε 2 + ε 3 ) + a 2ε 4 + x 2 − 2ax + a 2 ε 5 from where we can write: ∆y = ( ) fl( y).
b] if a and c or c and b are of different sign. and examine the sign of f(c). an a c b cn cn+1 bn+1 bn an+1 Fig. Because of this. However when they do converge. each time halving the size of the search interval and “bracketing” the solution as shown in Figs.2 CONVERGENCE AND ERROR Since we know that the solution lies in the interval [a. the search interval would have reduced in length to (b – a)/2n. they usually do that faster than the bracketing methods.2. Bracketing Methods The Bisection Method If a function is continuous in the interval [a. halving the interval in each iteration. The methods can be basically classified as “bracketing methods” and “open methods”. The methods are always convergent as the iterations progress. so the maximum error after n iterations is: b−a (3. that “bracket” a root of the function. the open methods are based on procedures that require one starting point or two that not necessarily enclose the root. c] or [c.2) α − cn ≤ n 2 . respectively. the solution lies inside an interval limited by a lower and an upper bound. The next interval where to search for the root will be either [a. (Equivalently.E763 (part 2) Numerical Methods page 8 3. The process then continues.1) 2 as the midpoint of the interval [a. sometimes they diverge. In contrast. then at least one root of f(x) will lie in that interval. 3. we can search for the case f(a)f(c)<0 or f(c)f(b)<0). The bisection method is an iterative procedure that starts with two points where f(x) has different sign. b] and f(a) and f(b) are of different sign (f(a)f(b)<0). 3. There are now three possibilities: f(c)=0. we can see that after n iterations. in which case the solution is c.1 and 3. and f(c) is positive or f(c) is negative. 3. b] the absolute error for c is bounded by (b−a)/2. In the first case. We now define the point a+b c= (3. that is. Root Finding: Solution of nonlinear equations This is a problem of very common occurrence in engineering and physics. We’ll study in this section several numerical methods to find roots of equations.1 Fig. then. b].
i). if we want to find the solution with a tolerance ε (that is. 1]. α − cn ≤ ε ).5*(ba). has one root in the interval [0.3) This expression can also be used to stop iterations. An approximate relative error (or percent error) at iteration n+1 can be defined as: ε= c n+1 − c n c n+1 but from the figure above we can see that c n+1 − c n = b n+1 − a n+1 b n+1 + a n+1 and since c n+1 = . fb=fc. %limits of the search interval [a.sprintf('%15. if(fa*fb>0) disp('root not in the interval selected') else n=ceil((log(ba)log(eps))/log(2)).b. if at one stage the solution lies at the middle of the current interval the search finishes early.' '.4) n≥ log 2 Example The function: f ( x) = cos(3x) .8f'.b] eps=1e6. if(fa*fc)<0 %the root is between a and c b=c. elseif(fa*fc)>0 %the root is between b and c a=c.c)]) fc=ff(c). 2 2 the relative error at iteration n+1 can be written as: ε= b n+1 − a n+1 b n+1 + a n+1 (3. %sets tolerance to 1e6 fa=ff(a).page 9 E763 (part 2) Numerical Methods where α is the exact position of the root and cn is the nth approximation found by this method. %rounded up to closest integer disp('Iteration number a b c') for i=1:n c=a+0. we can calculate the maximum number of iterations required from the expression above. a=0.a. Furthermore. . fa=fc.1 Demonstrate that the number of iterations required to achieve a tolerance ε is the integer that log(b − a ) − log ε satisfies: (3. Naturally. The following simple Matlab program implements the bisection method to find this root. Exercise 3. b=1. %c is set as the midpoint between a and b disp([sprintf('%8d'. fb=ff(b).
52355957 0.62500000 0.11643894 0. %**************************** And the corresponding results are: Iteration number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 a 0.51562500 0.07073720 0.75000000 3 0.52539063 f(c) 0.E763 (part 2) Numerical Methods page 10 else return end end end together with the function definition: function y=ff(x) %**************************** y=cos(3*x).52343750 0.56250000 5 0.52359962 c 0.52368164 0.01123469 0.52362061 0. For the previous example f ( x) = cos(3x) = 0 : Iteration number c 1 0.50000000 0.50000000 0.53125000 6 0.75000000 0.52343750 0.52368164 0.00000000 0.52539063 0.00537552 error % 4.52359009 0.34221803 .52359009 0.52734375 9 0.52355957 0.52360153 0.52359867 Provided that the solution lies in the initial interval.52359772 0.71523743 0.50000000 0.46127622 1.00000000 0.62817362 0.52368164 0.02391905 0. we can see that this method will always converge to the solution and will find it within a required precision in a finite number of iterations.52362061 0.23944878 19.52343750 0.51562500 7 0.52441406 0.52360535 0.52441406 0.53125000 0.52343750 0.52360535 0.52343750 0.51562500 0.53125000 0.52359772 0.53125000 0. However. the error doesn’t vary monotonically.52359962 0. and since the search interval is continually divided by two.00048383 0.62500000 0.50000000 2 0.62500000 4 0.52362061 0.56250000 0.02295166 0.52287896 0.52392578 0.03080137 0.50000000 0.52539063 0.52734375 0.52734375 0.52343750 0.52343750 8 0.52359009 0.42958659 1.52359772 0. due to the rather blind choice of solution (it is always chosen as the middle of the interval).50703414 43.53125000 0.52392578 0.75000000 0.50000000 0.52355957 0.52359772 b 1.50000000 0.29953351 0.36620732 7.00000000 1.52360153 0.52343750 0.56250000 0.52360535 0.
The approximation to the solution is chosen blindly as the midpoint of the interval without and attempt at guessing its position inside the interval.00001971 0.00016215 0.5) c a b Fig.00006549 0.00000031 0.00125498 0.52355957 0. From the figure (left).6) (3.53237237 0.3 The algorithm is the same as the bisection method except for the calculation of the point c. within the same absolute tolerance (10−6) is found in only 4 iterations: Iteration number 0 1 2 3 a 0.00098102 0.53237237 c 0. In this case.52360153 0. A possible way to improve it is to select the point c by interpolating the values at a and b.00020212 0.00011762 0.52362061 0.00001998 We can see that the error is not continually decreasing although in the end it has to be small.53237237 0. we can see that: f (b) c − b = f (a ) c − a from where c= or alternatively: c =a+ af (b) − bf (a) f (b) − f (a) f (a)(a − b) f (b) − f (a) (3.00000000 1. This is called the “regula falsi” method or method of false position.00165923 0.52359867 0.06245348 0.52359009 0.00000000 0. if the function is smooth.15570833 0.52368164 0. it is likely that the solution is closer to b than to a.00416920 0.52441406 0. 3. f (a n ) >> f (b n ) .00002606 0.52360535 0. for the same function. if at some iteration n.50251446 0.52359536 b 1. the magnitudes of f (a n ) and f (b n ) are very different.00244586 0. For example.00024860 0. This is due to the rather “brute force” nature of the algorithm.00748766 0.01582605 0.page 11 E763 (part 2) Numerical Methods 10 11 12 13 14 15 16 17 18 19 20 0.00000000 0.52392578 0.52359962 0.52359536 .50251446 0.00000317 0. Regula Falsi Method In this case. ( f ( x) = cos(3x) = 0 ) the solution.00000827 0.50251446 0. say.52359772 0.00052643 0.00000255 0. the next point is obtained by linear interpolation between the values at a and b.00000000 0.
8 1 1. We can also see that the right side of the interval remains stuck at 1.52359878 and for the error: Iter.4 0.49748554 0.E763 (part 2) Numerical Methods page 12 4 0. the search interval can remain large.00000008 We can see that the error decreases much more rapidly than in the bisection method.53237237 0.00000000 error % 4.2 0.4 Standard regula falsi Fig. 10 9 8 7 6 5 4 3 2 1 0 1 0 0.3 and the size of the interval will tend to 0. this modified version reaches the solution in 13 iterations. changing the slope of the approximating function.67563307 0.5 Modified regula falsi Modified regula falsi method A modified version of the regula falsi method that improves this situation consists of an algorithm that detects when one of the limits gets stuck and then reduces the value at that limit by 2. In particular. number 1 2 3 4 ba 0. In that case the length of the interval tends to a finite value instead of converging to zero. this is not necessarily always the case and some occasions.2 0.00877701 0.6 0.00000342 ba in bisection method 0.25 0.52359536 0. In the following example for the function f ( x) = x10 − 1 the solution requires 70 iterations to reach a tolerance of 10−6 with the regula falsi method while only 24 are needed with the bisection method.2 Fig. For the same example. number 1 2 3 4 c 0. In this case the successive values are: Iter. The sequence of approximations and interpolating lines is shown in the figure above (right). one of limits can remain stuck while the other converges to the solution.4 0.5 0.3 in the limit instead of converging to zero. .52359536 0. The size of the interval also decreases more rapidly.8 1 1.02680812 1.02631774 0.00001025 0.50251446 0. The figure shows the interpolating lines at each iteration.00065258 0.6 0. 3.0625 However.52359878 0.06321078 0.2 10 9 8 7 6 5 4 3 2 1 0 1 0 0.125 0. 3.02985791 0.52359878 f(c) 0. The corresponding approximations are the points where these lines cross the xaxis.
the successive values are: Iteration number 1 2 3 4 5 6 7 8 9 10 11 12 1 0.01152336 0.64682396 0. One of the simplest is the fixed point iteration. the method simply consists of rearranging it into the form: x = g ( x) Example For the function 0.6 Plot of the x and g(x) Fig.9 0.76830889 1.02085727 0.5 0.4 0.6 0. or ( x − 0.5 0.6 0.56 0.6 0.2 0.64189291 0.68 0.45 0.65 Fig. For a function of the form f ( x) = 0 .65269347 error % 13.79310345 5.505 = 0 the algorithm can be set as: x = 0.12421806 0.54 0.22603166 0.7 0.4 0.8 1 ← y=x y=g(x) ↓ x 0.1 0 0 0.1x + 0.58000000 0. 3.52 0.6 0.62 0.4 0.5.8 0.41327570 0. a new value of x is x= 2 obtained and the iteration can continue.65261826 0.66 0.42973889 0.1) 2 + 1 With an initial guess introduced in the right hand side.58 0.63271552 0.61520000 0.55 0.page 13 E763 (part 2) Numerical Methods Open Methods These methods start with only one point or two but not necessarily bracketing the root.7 0. Fixed point iteration This method is also called onepoint iteration or successive substitution.64950822 0.2 0.72171651 2.5 x 2 − 0.505 .3 0. 3.5 x 2 − 1.64 0.03776824 0.5 0.1x + 0..65178928 0. Starting from the value: x0 = 0.65223571 0.76234834 0.65248214 0.7 Closeup showing the successive approximations .06844503 0.65097964 0.
0 . a different form of rewriting the problem f ( x) = 0 in the form y = g ( x) needs to be found that satisfies the condition above.8 Four different cases (d) divergence From Fig. In cases (a) and (b) the method converges.8 ad we can see that it is relatively easy to determine when the method will converge. That is. 0 g ' ( x) = 6 x + 4 and then. 3. we can separate it in the following two forms: − 3x 2 + 1 (a) x = g ( x) = and (b) x = g ( x) = 3 x 2 + 4 x − 1 3 In the first case. y=x → y=g(x) ↓ ← y=x y=g(x ) → (a) convergence (b) convergence y=g(x) → ← y=x y=x → ← y=g(x) (c) divergence Fig. g ' ( x) = −2 x then g ' ( x) x =x = −0. For example. If divergence is predicted. for the function f ( x ) = 3 x 2 + 3 x − 1 = 0 . g ' ( x) x =x = 5. so the best way of ensuring success is to plot the functions y = g ( x) and y = x . In a more rigorous way.5275252 while for the second case. 3. g ' ( x) < 1 . The following figures show situations when the method converges and when it diverges.5825757 .E763 (part 2) Numerical Methods page 14 Convergence This is a very simple method but solutions are not guaranteed. with a solution at x0 = 0. while in cases (c) and (d) it diverges.2637626 . we can also see that for convergence to occur the slope of g(x) should be less that that of x in the region of search.
5 0.4 0. 2 NewtonRaphson Method This is one of the most used methods for root finding.1 0.2 0.4 ← y=x 0.2 0.page 15 E763 (part 2) Numerical Methods 0.3 0. Starting from a point x0. providing a new approximation.1 0.3 0. divergence can also occur if the initial guess is not sufficiently close to the solution.9(a) x = g ( x) = 3 Fig.10 Form Fig.9(b) x = g ( x) = 3 x 2 + 4 x − 1 − 3x + 1 Fig.5 0. It also needs only one point to start the iterations. 3. 3.6 Fig. slope = f '(x0) x1 x0 Fig.2 0. but as a difference to the fixed point iteration it will converge to the solution provided the function is monotonically varying in the region of interest. Convergence often depends on how the problem is formulated.1 0 0 y=g(x) ↓ 0. a tangent to the function f(x) (a line with the slope of the derivative of f) is extrapolated to find the point where it crosses the xaxis. then the next xn − xn+1 . The same procedure is repeated until the error tolerance is achieved. The method needs repeated evaluation of the function and its derivative and an appropriate stopping criterion is the value of the function at the successive approximations: f(xn).4 0.4 0.6 0.9 illustrates the main deficiency of this method.3 0.5 0.10 we can see that at stage n: f ' ( xn ) = f ( xn ) .5 0.2 0. 3. Additionally.3 0.1 0 0 y=g(x) → ← y=x 0. 3. 3.
52339200 0.9) gives: f ' ( xi ) xr = xi +1 − which can be reordered as: f ( 2) (ξ ) ( xr − xi ) 2 2! f ' ( xi ) f ( 2) (ξ ) ( xr − xi ) 2 (3.00000000 The method can also be derived from the Taylor series expansion.8) Considering now the exact solution xr and the Taylor expansion.9) Using now (3.30000000 0. evaluated at this point: f ( 2) (ξ ) f ( xr ) = 0 = f ( xi ) + f ' ( xi )( xr − xi ) + ( xr − xi ) 2 2! and reordering (assuming a single root – first derivative ≠ 0): f ( xi ) f ( 2) (ξ ) xr = xi − − ( xr − xi ) 2 f ' ( xi ) 2! f ' ( xi ) (3.E763 (part 2) Numerical Methods page 16 approximation is found as: xn +1 = xn − f ( xn ) . f ' ( xn ) (3.3 (After 3 iterations the accuracy is better than 10−8).7) Example For the same function as in the previous examples: f ( x ) = cos(3 x) = 0 .62160997 0. the solution is found in 3 iterations starting from x0 = 0. from (3.10) 2! f ' ( xi ) The error at stage i can be written as the difference between xr and xi: Er = ( xr − xi ) then. This also provides a useful insight on the rate of convergence of the method.12244676 0.5.00062033 0. Considering the Taylor expansion truncated to the first order (see Appendix): f ( xi +1 ) = f ( xi ) + f ' ( xi )( xi +1 − xi ) + f ( 2) (ξ ) ( xi +1 − xi ) 2 2! (3. so the previous equation can be rearranged on the form: ( xr − xi +1 ) = − .7) for xi +1 : xi +1 = xi − f ( xi ) and substituting in (3. and for the same tolerance of 10−6. only 2 iterations are sufficient. Iteration number 0 1 2 3 x 0.10) we can write: − f ( 2) (ξ ) 2 Er +1 = Er 2 f ' ( xi ) Assuming convergence.56451705 0.52359878 f(x) 0. Starting from 0. both ξ and xi should eventually become xr.
61270 − − 0. This is what is called quadratic convergence. the backward difference: f ( xi ) − f ( xi −1 ) f ' ( xi ) ≅ (3.07081 then x2 = 0. Then.12) Cn ( xn − xn−1 ) < ε . the number of correct decimal points should double. choosing a small value for Cn.11) We can see that the relation between errors of successive order is quadratic.13) xi − xi −1 If we now substitute this in the expression for the NewtonRaphson itertions. a convenient criterion for stopping the iterations is when: (3. However. the derivative can be approximated using a finite difference expression.61270 f ( x1 ) = −0.page 17 E763 (part 2) Numerical Methods Er +1 = − f ( 2 ) ( xr ) 2 Er 2 f ' ( xr ) (3. The exact result is: 0. This can be very inconvenient of difficult in some cases. usually 1. Start with the estimates at x−1 = 0 and x0 = 1. Although the convergence rate is generally quite good. the iteration values will start to progressively diverge from the solution.14) xi +1 = xi − f ( xi ) − f ( xi −1 ) which is the formula for the secant method. the following equation is obtained: f ( xi )( xi − xi −1 ) (3.0 − 0.56714329… First iteration: x−1 = 0 f ( x−1 ) = 1. That means that on each NewtonRaphson iteration. Another case is when the root is a multiple root. when the first derivative is also zero. The Secant Method One of the difficulties of the NewtonRaphson method is the need to evaluate the derivative.563838325 − 0.63212) Note that in this case the 2 points are at the same side of the root (not bracketing it). Example Use the secant method to find the root of f ( x ) = e − x − x . Stopping the iterations It can be demonstrated that the absolute error (difference between the current approximation and the correct value) can be written as a multiple of the difference between the last two consecutive approximations: En = ( xα − xn ) = Cn ( xn − xn−1 ) and the constant Cn tends to 1 as the method converges. Using Excel.63212 − 0. for example. there are cases that show poor or no convergence.63212 x1 = 0.61270 ε = 8% x0 = 1 f ( x0 ) = −0.63212(1 − 0) then x1 = 1 − = 0. that is.63212 − 1 Second iteration: x0 = 1 f ( x0 ) = −0. An example is when there is an inflexion point near the root and in that case. a simple calculation can be made giving: .61270 − 1) = 0.07081 − (−0.07081(0.
22401E08 The secant method doesn’t require the evaluation of the derivative of the function as the NewtonRaphson method does but still. the function: f ( x) g ( x) = (3. For higher order zeros (multiple roots).24234E13 error % 8. one can expect some advantages if an extrapolated value is calculated from the iterations in order to improve the next guess and accelerate the convergence.567143307 0. the root of the derivative is of order n−1). Extrapolation and Acceleration Observing how the iterations proceed in the methods we have just studied.16) We can use this expression embedded in the fixed point iteration for example. Now. Start the iterations with x0 = 0 . the extrapolated sequence can be constructed by: xn = xn + α ( xn − xn−1 ) where α = x n−1 − x n x n − 2 x n−1 + x n−2 (3.032671349 0. in the form: Starting from a value for x0: the first two iterates are found using the standard method: x1=g(x0).070814271 0.004772880 2. Exercise 3.63212 0.612700047 0.92795E06 7.24203E05 2. The convergence of the method is similar to that of Newton and similarly. x2=g(x1).2 Use the NewtonRaphson method to find a root of the function f ( x ) = 1 − xe1− x . One of the best methods for extrapolation and acceleration is the Aitken’s method: Considering the sequence of values xn.E763 (part 2) Numerical Methods page 18 i 1 0 1 2 3 4 5 xi 0 1 0.567143290 f(xi) 1 0.15) f ' ( x) has a simple root at x = α (If the root of f is of order n.005182455 4. In this case the NewtonRaphson method (and the secant method will converge poorly). the function is zero and also the first n−1 derivatives (n is the order of the root).567170359 0. We can then use the standard NewtonRaphson method for the function f(x).563838325 0. we can use the Aitken’s extrapolation in a repeated form: alpha=(x2x1)/(x22*x1+x0) . that if the function f(x) has a root multiple root at x = α.582738902 0. We can notice however.53813E08 1. Multiple Roots We have seen that some of the methods have poor convergence if the derivative is very small or zero. it has severe problems if the derivative is zero or near zero in the region of interest. it suffers from the same problems.
12230396 0.00023538 0 Higher Order Methods All the methods we have seen consist of approximating the function using a straight line and .52359303 0.07018767 0.56153005 0.5235069 0.00027417 0.52323127 0.52359878 0.52357581 0.52378253 0.00109668 0.03509381 0.page 19 E763 (part 2) Numerical Methods xbar=x2+alpha*(x2x1) and now we can refresh the initial guess: x0=xbar and restart the iterations.00001714 0.52359385 0.12323492 0.28075410 0.52359882 0.01754690 0.00438673 0.00054834 0.50703414 2. Similarly for the NewtonRaphson method where the evaluation of x1.00877345 0.14037569 0. Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Fixed Point method x 0.52361026 0. g ( x) = 2 Iter.00219336 0. using the FixedPoint method with and without acceleration gives the results that follow. The iterations are started with x0 = 0.52359875 error % 4. and x2 are replaced by the corresponding forms for the NR method and the function and its derivative needs to be calculated at each stage: f0=f(x0) % calculation of the function df0=df(x0) % calculation of the derivative x1=x0f0/df0 x2=x1f1/df1 alpha=(x2x1)/(x22*x1+x0) xbar=x2+alpha*(x2x1) x0=xbar Example For the function f ( x ) = cos(3 x) .24787104 1.00006854 0.00013709 0.5353686 0.52359878 error % 1.52359896 0.52212875 0.52359734 0.5 and the alternative form: 2 x + cos(3x) is used.52359842 0.00003427 0.00000857 Accelerated FP method x 0.52653894 0.52433378 0.51771753 0.52359949 0.52364471 0.52360165 0.52359869 0.
12246205 error 0. fitting a second order curve (a parabola) to the function and looking for the zero of this instead.12242784 1. the 3 points are chosen as x1 = 0. 3.07797900 1. find the zeros of the parabola.0 The parabola passing through the points: ( x1 .17) where s = c2 + d1 ( x3 − x2 ) . we can find the equation of a parabola that fits the function and then.37190697 1. For example. . y1 ) . x2 = 1. for example. for the function: y = x 6 − 2 (used 10 to find the value of 6 2 ).37148553 1. The results of the iterations are: Iteration number 1 2 3 4 5 6 7 8 9 10 x 1.00021716 0.12268990 1.5 Fig.37170268 1. c2 = 3 x2 − x1 x3 − x2 x3 − x1 5 x2 0 x1 x3 x4 ↓ 5 0.14443703 1. y = y3 + c2 ( x − x3 ) + d1 ( x − x3 )( x − x2 ) where y −y y − y2 c −c and d1 = 2 1 c1 = 2 1 .07797900 0.5 .19209767 0.00000000 The Muller method requires three points to start the iterations but it doesn’t require evaluation of derivatives as the NewtonRaphson.00003420 0.11 Solving for the zero closest to x3 gives: x4 = x3 − 2 y3 s + sign( s) s 2 − 4 y3d1 (3. An obvious extension of this idea is to use a higher order approximation.03495083 0. A method that implements this is the Muller method. the function and the approximating parabola are shown in the figure: In this case.12246205 1.29392797 0.17938786 1.02174712 0.00026206 0. It can also be used to find complex roots.E763 (part 2) Numerical Methods page 20 looking for the intersection of that line with the axis. y2 ) and ( x3 . ( x2 . y3 ) can be written as. Muller’s method Using three points.5 and x3 = 1.00020429 0.5 1 1.
18) We can see that they are polynomials of order n and have the property (interpolatory condition): 1 i = j (4. y1 ) and ( x2 . find the polynomial p(x) of degree less or equal to n such that p(xi) = yi.18) the equation for the first order polynomial (straight line) passing through two points ( x1 . i =0.20) then: p n ( xi ) = k =0 ∑ y k L(kn) ( xi ) = yi (4. following an established criterion.…. j = i 0 i ≠ j Then. Lagrange Polynomials In more detail. Two approaches to this task can be used. n (4. from measurements or calculations into a smooth function. if we define the polynomial by: p n ( x) = n k =0 ∑ y k L(kn) ( x) n (4. Lagrange Interpolation The basic interpolation problem can be formulated as: Given a set of nodes. i = 0. Interpolation consists of fitting a smooth curve exactly to the data points and creating the curve that spans these points.21) The uniqueness of this interpolation polynomial can also be demonstrated (that is. Each of the Lagrange interpolation functions L(n) associated to each of the nodes xk (given in k general by (4. In approximation. y2 ) is: x − x2 x − x1 ( 1 (4.18)) is: .….19) L( n ) ( x j ) = δ i .n} and corresponding data values {yi. {xi.22) p1 ( x) = L11) y1 + L(2 ) y2 = y1 + y2 x1 − x2 x2 − x1 The second order polynomial (parabola) passing through three points is: ( p2 ( x) = L12) y1 + L(22) y2 + L( 2) y3 3 ( x − x2 )( x − x3 ) ( x − x1 )( x − x3 ) ( x − x1 )( x − x2 ) y3 p2 ( x ) = y1 + y2 + ( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 ) (4. 1. to obtain estimates of the values between data points. This could be necessary for example.20) for any order. j ≠ i xi − x j ∏ n x− xj . i =0.page 21 E763 (part 2) Numerical Methods 4. the curve doesn’t necessarily fit exactly the data points but it approximates the overall behaviour of the data. from the general definition (4. that there is only one polynomial of order n or less that satisfy this condition. Consider the family of functions: L( n ) ( x) = i j = 0. Interpolation and Approximation There are many occasions when there is a need to convert discrete data. It could be also necessary if one wants to represent a function by a simpler one that approximates its behaviour. we can see that the interpolation polynomials have the form given in (4.23) In general.n}.
5 2 Fig.1 Exercise 4.5 2 0 (2) L1 (x) 0 2 2 1. 2. it is simpler to extend the interpolation adding extra points.24) The denominator has the same form as the denominator and D = N(xk).8}. The Newton interpolation method gives eventually the same result but it can be more convenient in some cases. or more specifically. (4. 5} and yd = {2. 4. and the Lagrange method allows us to find it. 3. 4. 4. 4 p2(x) p2(x) p2(x) p2(x) p2(x) 2 (2) L3 (x) 0. Exercise 4.8. Substituting in (4. 1. 5. 2.2) and ( x3 . which in the Lagrange method would need a total recalculation of the interpolation functions. Notice that the function corresponding to one node has a value 1 at that node and 0 at the other two.20).5 0 0. 4.1 shows the complete interpolating polynomial p2 ( x) and the three Lagrange interpolation polynomials L(k2) ( x) k = 1.4) .5 1 1.5 1 0.1 Find the 5th order Lagrange interpolation polynomial to fit the data: xd = {0. ( x2 . 1. y2 ) = (0. Example Find the interpolating polynomial that passes through the three points: ( x1 .8) . 3 corresponding to each of the nodal points.4. y3 ) = (2. y1 ) = (−2. Newton Interpolation It can be easily demonstrated that the polynomial interpolating a set of points is unique (Exercise 4.8. 3. 3.E763 (part 2) Numerical Methods page 22 L(kn ) ( x) = ( x − x1 )( x − x2 )L( x − xk −1 )( x − xk +1 )L( x − xn ) N ( x) = D ( xk − x1 )( xk − x2 )L( xk − xk −1 )( xk − xk +1 )L( xk − xn ) (4. .23): ( x − 0)( x − 2) ( x + 2)( x − 2) ( x + 2)( x − 0) 4+ 2+ p 2 ( x) = 8 (−2 − 0)(−2 − 2) (0 + 2)(0 − 2) (2 + 2)(2 − 0) p 2 ( x) = x2 − 2x x2 − 4 x 2 + 2x ( 4+ 2+ 8 = L12) y1 + L(22) y 2 + L( 2) y3 3 8 8 −4 2 p2 ( x ) = x + x + 2 (2) L2 (x) 8 1 6 p2(x) Fig.2 Show that an arbitrary polynomial of order n can be represented exactly by p( x) = ∑ p( xi ) Li ( x) i =0 n using an arbitrary set of (distinct) data points xi. In particular.2).
Dyi +1 − Dyi D 2 yi +1 − D 2 yi y −y .27) p( x) = a0 + a1 ( x − x1 ) .29) takes the form: a0 = y1 . y3) can be written as: p 2 ( x) = b0 + b1 x + b2 x 2 or. D 3 yi = . where: a = 2 1 x2 − x1 Similarly. and Lagrange’s. etc Dyi = i +1 i . y1) and (x2. y1). D 2 y i = xi + 2 − xi x i + 3 − xi xi +1 − xi In this form.page 23 E763 (part 2) Numerical Methods The general form of a polynomial used to interpolate n + 1 data points is: f ( x) = b0 + b1 x + b2 x 2 + b3 x 3 + L + bn x n Newton’s method. after some rearrangement: a0 = y1 y 2 − y1 a1 = x2 − x1 (4. the general expression for a second order polynomial passing through the 3 data points: (x1. give us a procedure to find the coefficients bi. From Fig. y2) is a 0 = y1 y − y1 (4. we get. 4. a1 = Dy1 and a2 = D 2 y1 .25) y2 y y − y1 y2 − y1 = x − x1 x2 − x1 which can be rearranged as: y = y1 + y2 − y1 ( x − x1 ) x2 − x1 (4.2. x1 x x2 Fig. y2) and (x3.28) Substituting the values for the 3 points.2 That is. the Newton form of the equation of a straight line that passes through 2 points (x1. . (4.29) y3 − y 2 y 2 − y1 − x3 − x2 x2 − x1 a 2 = x3 − x1 The individual terms in the above expression are usually called “divided differences” and denominated by the symbol D. (x2.26) y1 This is the Newton’s expression for the first order interpolation polynomial (or linear interpolation). rearranging: p 2 ( x) = a0 + a1 ( x − x1 ) + a2 ( x − x1 )( x − x2 ) (4. 4. we can write: (4. That is.
27) and (4. with the coefficients: (4. it has many similarities with the Taylor expansion. where additional terms increase the order of the polynomial. Exercise 4. 1.28): pn ( x ) = a0 + a1 ( x − x1 ) + a2 ( x − x1 )( x − x2 ) + L or.3 Find the 5th order interpolation polynomial to fit the data: xd = {0. y1 ) = (−2.32) In this way.E763 (part 2) Numerical Methods page 24 The general form of Newton interpolation polynomials is then an extension of (4.8) . .31)) pn ( x) = a0 + ∑ aiWi ( x) with Wi ( x) = ∏ ( x − xi ) i =1 i =1 n n a0 = y1 . One important property of the Newton’s construction of the interpolating polynomial is what makes it easy to extend including more points. ( x2 . In this case it is usual and convenient to arrange the calculations in a table with the following quantities in each column: xi yi 4 Dyi 2−4 y2 − y1 = = −1 x2 − x1 0 − (−2) D 2 yi −2 0 2 Dy2 − Dy1 3 − (−1) = =1 2 − (−2) x3 − x1 y3 − y 2 8 − 2 = =3 x3 − x2 2 − 0 2 8 Then. if additional points are included. 2. 3. y3 ) = (2. the coefficients are: a0 = 4.2) and ( x3 . 4. 5} and yd = {2. These similarities allow a treatment of the error in the same way as it is done with Taylor expansions.4) . y2 ) = (0. the new higher order polynomial can be easily constructed from the previous one: pn +1 ( x) = pn ( x ) + an+1Wn+1 ( x) with Wn+1 ( x) = Wn ( x)( x − xn+1 ) and a n+1 = D n+1 y1 (4.30) (4. ai = D i y1 Example We can consider the previous example of finding the interpolating polynomial that passes through the three points: ( x1 . 1. a1 = −1 and a2 = 1 and the polynomial is: p2 ( x) = 4 − ( x + 2) + ( x + 2)( x − 0) = x 2 + x + 2 Note that it is the same polynomial found using Lagrange interpolation.
3.8.25} (shown with the filled black markers). This is by no means necessary and the data can have any distribution. 5} yd = {2. the resultant system is notoriously illconditioned (see chapter on Matrix Computations) and the results can be highly inaccurate. the interpolating function is obtained in a form that is different from the standard polynomial expression (4. We will not cover here the details of the derivation but simply. 4. 4. Some Practical Notes: In some of the examples seen here we have used equally spaced data.8. A very small change in the values of the data can lead to a drastic change in the interpolating function. 2. Another approach that avoids this problem is to use data for the derivative of the function too. The development is rather similar to that of the Newton’s method but more complicated due to the involvement of the derivative values.3 Hermite Interpolation The problem arises because the extra points force a higher degree polynomial and this can have a higher oscillatory behaviour. simpler expressions for the divided differences can be derived (not done here). 5.5. 4.4. This is done with the “Hermite Interpolation”. 3.25). 4. However.7} and yd’ = {3. 2. 3. ____________________________ One of the problems with interpolation of data point is that this technique is very sensitive to noisy data. if the data is equally spaced. Both in the Lagrange and the Newton methods. The table is similar to that for the Newton interpolation but we enter the data points twice (see . 6 5 4 3 2 1 0 0 1 2 3 4 5 Fig. 4.8. If we also ask for the derivative values to be matched at the nodes.8}.page 25 E763 (part 2) Numerical Methods 3. 4.2. 5. 4.3 shows the interpolating polynomial (in blue) for the data: xd = {0. if we add two more points with a slight amount of noise: xd’ = {2. It can also be constructed easily with the help of a table (as in Newton’s) and divided differences. the new interpolation polynomial (red line) shows a dramatic difference to the first one. this time using Newton interpolation. 1. The coefficients of this expression can be obtained from the Lagrange or Newton forms by setting a system of equations for the known values at the data point. the oscillations will be prevented. However.8. the procedure to find it. This is illustrated in the following example: Fig.8} Now. 1.
5 (4.4 shows the function y ( x) = sin(πx) 1. Spline Interpolation Any piecewise interpolation of data by low order functions is called spline interpolation and the simplest and widely used is the piecewise linear interpolation. i 1 xi x1 x1 yi y1 Dyi y1 ' D 2 yi D 3 yi 1 y1 Dy1 = y2 − y1 x2 − x1 A= Dy1 − y1 ' x2 − x1 C= B−A x2 − x1 2 x2 x2 y2 y2 ' y2 B= y2 '− Dy1 x2 − x1 2 The corresponding interpolating polynomial is: H 2 ( x) = y1 + y1 ' ( x − x1 ) + A( x − x1 ) 2 + C ( x − x1 ) 2 ( x − x2 ) The coefficients of the successive terms are marked in the table with blue squares.4 Another approach to avoid the oscillations present when using high order polynomials is to use lower order polynomials to interpolate subsets of the data and assemble the overall approximating function piecewise. Newton and cubic splines 1 method (see next section). in alternate rows as the first divided differences. The initial setup for 2 points is marked with red circles. joining consecutive data points by straight lines. Fig.5 Newton interpolation results in a polynomial of order 4 while Hermite gives a polynomial 0 of order 7. .33) cubic splines Newton Hermite Fig. 0. (If 7 points are used for the Newton iteration the results are quite similar.5 1 1.5 1 0 0.) 0. This is what is called “spline interpolation”. 4. 4.E763 (part 2) Numerical Methods page 26 below) and the derivatives values are placed between repeated data points. The data points are marked by circles. or simply.5 (+) compared with the interpolated curves using Hermite.
we should impose continuity of the representation in contiguous intervals and as many derivatives as possible to be continuous too at the adjoining points.. a parabola. If we now want to fit a higher order function. a total of 3n equations. and in general. If we use quadratic splines (second order) their equations will be of the form: f i ( x) = ai x 2 + bi x + c and we need 3 equations per interval to find the 3 coefficients.…. This can be solved by the matrix techniques that we will study later. the functions defined in each of these intervals must coincide.n. For n+1 points. Quadratic splines have some shortcomings that are not present in cubic splines. n functions to determine. for an morder spline we need m+1 equations. that is. An additional condition must then be established. over the complete domain. . This gives us a total of 2n−2 equations (there are 2 equations in the above line and n−1 interior points). x1 ≤ x ≤ x2 (4. so these . All these give us a total of 3n−1 equations when we need 3n.34) (4. at the node xi. If we want a smooth function.37) for i = 1. i = 0. we can establish the necessary equations as listed above. This can be represented by: f ( xi ) = ai x 2 + bi x + c = ai +1 x 2 + bi +1 x + ci +1 x = xi (4. stating that a1 = 0 (This corresponds to asking for the second derivative to be zero at the first point and results in the two first points joined by a straight line).35) f ( x) = y2 + m2 ( x − x2 ). we need to determine 3 coefficients.. 2) The first and last functions must pass through the first and last points (end points).n. we need only 2 pieces of data: the data values at the 2 points. the boundary between interval i and i+1. Usually this can be chosen at the end points. i = 1.. xi. To fit a line in each interval between data points.….n there are n intervals and consequently.36) f ( x) = yi + mi ( x − xi ). for example.38) and we have another n−1 equations. This can be written as: f i ' ( xi ) = f i +1 ' ( xi ) for i = 2. for example a second order polynomial. for a set of 4 data points. xi ≤ x ≤ xi +1 yi +1 − yi mi = with the slopes given by: xi +1 − xi Quadratic and Cubic Splines First order splines are straightforward. We can establish the following conditions to be satisfied: 1) The values of the functions corresponding to adjacent intervals must be equal at the common interior nodes (no discontinuities at the nodes). yi). This gives us another 2 equations 3) The first derivative at the interior nodes must be continuous. n. the first order splines (straight lines) can be defined as: f ( x) = y1 + m1 ( x − x1 ). Then: 2ai xi2 + bi = 2ai +1xi2 + bi +1 (4. that is. For example. giving a total of 8 equations for 8 unknowns (having fixed a1 = 0 already). x2 ≤ x ≤ x3 … (4.page 27 E763 (part 2) Numerical Methods First Order Splines If we have a set of data points: (xi.
2.'LineWidth'. the code for plotting this graph is simply: 6 5 4 3 2 1 0 0 1 2 3 4 5 xd=[0.4. 4. Methods like the “least squares” look for a single function or polynomial to approximate the desired function.25.1xe1. x=0:0. consists of .5]'.4 Using Matlab plot the function f ( x) = 0.8]'.3. Matlab has a standard function that calculates them.2 sin x in the interval [0. Fig.4.4.'w'. Using the builtin Matlab function.3.05:5. y=spline(xd.5) is not included in this piece of code and that the last line is simply to draw the markers. 4] and construct a cubic spline interpolation using the values of the function at the points xd = 0: 0.5) plot(xd.5 yd=[2.1. Taylor expansions. the function and the first and second derivatives are continuous at the nodes. of which the Fourier series is an example. it is some times preferable to choose a function that doesn’t necessarily pass exactly through the data but approximates its overall behaviour.'Linewidth'. and compared with the highly oscillatory result for a single polynomial interpolation for the 8 data points (that gives a polynomial of order 7.5.2. In this case.y.'ok'.2.3.1.'g'. 4.x).2. least squares curve fitting and Fourier series are examples of this.5. Approximation There are many different ways to approximate a function and you have seen some of them in detail already. cubic splines are commonly found in computer libraries and for example. 4.4.E763 (part 2) Numerical Methods page 28 are preferred.2. Because of their popularity. This is what is called “approximation”. To illustrate the advantages of this technique.10. Another approach. Use Excel to construct a table of divided differences for this function at the points xd and find the coefficients of the Newton interpolation polynomial.'MarkerSize'. However.5: 4 (in Matlab notation). their calculation is even more cumbersome than that of quadratic splines.'MarkerFaceColor'.5 shows the cubic interpolation splines (green line) for the same function described in the last example. Exercise 4. The problems here are how to choose the approximating function and what is considered the best choice.yd. plot(x.yd.2) Note that the drawing of the 7order interpolating polynomial (red line in Fig.7.8. Use Matlab to plot the corresponding polynomial in the same figure as the splines and compare the results 2 If the main objective is to create a smooth function to represent the data.8. Fig.
Least Squares Curve Fitting One of the most common problems of approximation is fitting a curve to experimental data. the minimization of the error can be achieved by making the derivatives of E with respect to a and b equal to zero. the main difference is that while the other methods based on expansions attempt to find an overall approximation.40) The error is a function of the parameters a and b that define the straight line. parabolas. The problem then is to find the appropriate set of coefficients for that expansion. Approximation by a Straight Line The equation of a straight line is: f ( x) = a + bx so the error to minimise is: E (a. Taylor series are meant to approximate the function at one particular point and its close vicinity.42) ∑ xi yi − a∑ xi − b∑ xi2 = 0 i =1 i =1 i =1 n i =1 n n na + b∑ xi = ∑ yi and a ∑ xi + b∑ xi2 = ∑ xi yi i =1 i =1 i =1 n n n (4. yi). Then. the usual objective is to find the curve that approximates the data minimizing some measure of the error. the minimization of the error will give the necessary relations to determine the coefficients.44) i =1 n i =1 n Solving the system for a and b gives: . i = 1. In this case. We can define the error of the approximation by: E = ∑ ( yi − f ( xi ) )2 i =1 n (4. In the method of least squares the measure chosen is the sum of the squares of the differences between the data and the approximating curve.43) (4. If the data is given by: (xi. Then. n. etc. These conditions give: n ∂E = −2∑ ( yi − (a + bxi ) ) = 0 ⇒ ∂a i =1 n ∂E = −2∑ ( yi − (a + bxi ) )xi = 0 ⇒ ∂b i =1 which can be simplified to: ∑ yi − na − b∑ xi = 0 i =1 n n (4. straight lines.page 29 E763 (part 2) Numerical Methods using a family of simple functions to build an expansion that approximates the given function. b) = ∑ ( yi − (a + bxi ) )2 i =1 n (4.41) (4. Taylor series are somehow related.39) The commonest choices for the approximating functions are polynomials. .
6 Example The data (xd. 1.15 . 10 5 0 0 0. 2. zd). 2 Least Squares Line Fit 1.5 0 0 0. zd should vary linearly with xd.01.5 1 1. yd) shown in Fig. Then.76. 1. 4. defining the variable zd i = log( yd i ) . ∑ xdi = 15.94.1. 4.9. 1. 1.5 10 10 10 and ∑ xdi ydi = 32. ∑ ydi = 15. 4. the function 20 y = ez 15 is a least squares fit to the original data. If the fitting function is the function z(x).7 .95} Then.98387806172746 Fig.6 shows the approximating straight line (red) together with the curve for Newton interpolation (black) and for cubic splines (blue).01.2. 4.5 3 Cubic Splines Newton Interpolation Fig.5 1 1. 2. 2. 1. 0.5 2 2.577 i =1 10 and the parameters of the straight line are: a = 0.5 2 2.08.18. 2.2. 1.5 3 Fig. 2. 2.5 1 0. 2. We can then fit a straight line to the pair of 25 variables (xd. 0.95.01742473648290 and b= 0. 3} and yd = {0. 0. The problems occurring with the use of higher order polynomial interpolation are also evident.03.8.7 appears to behave in an exponential manner.E763 (part 2) Numerical Methods page 30 a= ∑ xi2 ∑ yi − ∑ xi ∑ xi yi i =1 i =1 n i =1 n n n n n n∑ xi2 − ∑ xi i =1 i =1 i =1 2 and b = n∑ xi yi − ∑ xi ∑ yi i =1 i =1 n n n n n∑ xi2 − ∑ xi i =1 i =1 n i =1 2 (4.8425 i =1 i =1 i =1 3 2. 0.9.22. ∑ xdi2 = 32.45) Example Fitting a straight line to the data given by: xd = {0.08 .
will give us the necessary equations to find these coefficients. Plot the function and the approximating polynomial together with the Taylor approximation (Taylor series truncated to second order) for comparison.5 Find the coefficients of a second order polynomial that approximates the function s( x) = e x . the resultant matrix problem. If now x is the array of values where the approximation is wanted. yd). at least in a certain domain. to try to minimize an error expression in the least squares sense. using the points xd = [−1. c) = ∑ yi − (a + bx + cx 2 ) i =1 n ( ) 2 (4. Exercise 4. If the data is given by (xd. b. particularly for large systems (high order polynomials) is normally very illconditioned. 1] in the least squares sense. but it might be desirable to have a simpler expression representing this behaviour. we can formulate the problem as minimizing the square of the difference over the domain of interest. the function: coeff = polyfit(xd. This case is rather different to all we have seen previously because now there is a certainty of the value of the desired variable at every point.in the interval [−1. Approximation using Continuous Functions The same idea of “least squares”.45) Making the derivatives of E with respect to a. The expressions are similar to those found for the straight line fit although more complicated. there are 3 parameters to find and the error is given by: E (a. so some special precautions need to be taken. and m is the desired order of the polynomial to fit. b and c equal to zero will give the necessary 3 equations for the coefficients of the parabola. b and c to be zero. . asking for the derivatives of E with respect to a. Imagine that instead of discrete data sets we have a rather complicated analytical expression representing the function of interest. that is: (4. y = polyval(coeff. for further analysis purposes. 1]. can be used to the approximation of continuous functions. yd. Matlab has standard functions to perform least squares approximations with polynomials of any order. that is. x) returns the values of the polynomial fit at the point x.46) E = ∫ (ax 2 + bx + c − s ( x)) 2 dx Ω Again.page 31 E763 (part 2) Numerical Methods Second Order Approximation If a second order curve is used for the approximation: f ( x) = a + bx + cx 2 . m) returns the coefficients of the polynomial. For example is the function is s(x) and we want to approximate it using a second order polynomial. There is one problem though. 0.
f .47) an expansion truncated to n terms. but the concept is not restricted to use of sines and cosines as basis functions. no combination of them ˆ can possibly have a component along the z axis and the resultant vectors are restricted to the xyplane. However. αg1 + βg 2 = α f . if we select a complete set of basis functions: φk (x) . i =1 In order to extend it to functions we need to introduce the inner product of functions. g = g . In the same sense. y and z constitute a base in the 3D space because any vector in that space can be expressed as a combination of these three. their inner product is the quantity: f . we called a base a family of functions that satisfy a set of properties. f . we want a set of basis functions to be complete (as sines and cosines are) in order that any function can have its representation by an expansion. f for all functions f and g. r2 r r N This concept of norm is analogous to that of the norm or modulus of a vector: v = v ⋅ v = ∑ vi2 . we need an infinite set (that is the dimension of the space of functions). (analogous to the dot product of vectors): If we have two functions f and g defined over the same domain Ω. Similarly with ˆ ˆ ˆ what you know of vectors. 3. f > 0 for all nonzero functions 2. that is no vector in 3D escapes this ˆ ˆ representation. First of all. That is: the quantity. For example if we only consider the vectors x and y . we can represent our function by: n ~ f ( x ) ≈ f n ( x ) = ∑ ck φ k ( x ) k =1 (4. g = ∫ f ( x) g ( x)dx (4. seeking to minimise the norm of this error. the error of this approximation is the function difference between the exact and ~ the approximate functions: rn ( x) = f ( x) − f n ( x) and we can use the least squares ideas again. You are already familiar with this idea from the use of Fourier expansions to approximate functions. g 2 for all functions f and g and scalars α and β. So. we don’t actually need these vectors to be perpendicular to each other but that helps (as we’ll see later). (error residual): Rn = rn ( x) 2 ~ ≡ ∫ ( f ( x) − f n ( x)) 2 dx Ω (4.E763 (part 2) Numerical Methods page 32 Approximation using Orthogonal Functions We can extend the same ideas of least squares to the approximation of a given function by an expansion in terms of a set of “basis functions”. the base must be complete. The difference now is that unlike the 3D space.48) with the subscript n because the error residual above is also a function of truncation level. f . In this context. Naturally. Note: The above definition of inner product is sometimes extended using a weighting function . for example the set of unit vectors x . g1 + β f .49) Ω The inner product satisfies the following properties: 1.
n. φn L which can be written as a matrix problem of the form: Φc = s. f ( x) Using the inner product definition.51) k =1 j =1 k =1 We can see that the error residual Rn. φ 2 c 2 + L + φ1 .φ j ( x) j =1 n = f ( x).50) n ~ ~ and if we write f n ( x) as f n ( x) = ∑ ckφk ( x) as above. k =1 we get: Rn = f ( x) − 2∑ ck f ( x). In a similar form as with the dot product of vectors. φ j ( x) = 0 for all j ≠ k (4. (4. φ j ( x) 2 n n n (4. φk ( x) for k = 1. That is: ∂Rn = 0 for k = 1. φ1 c1 + φn . φ2 φn . ∂ck (4. f n ( x) + f n ( x) 2 (4. We can find the coefficients solving the system of equations but we can see that this will be a much easier task if all crossed products of basic functions yielded zero.page 33 E763 (part 2) Numerical Methods w(x) in the form: f . φ n c n = f . φn cn = f . f ( x) 2 = f ( x). φ2 c2 + L + φ2 . the vector c is the list of coefficients and s is the list of values in the right hand side (which are all known: we can calculate them all). is a function of the coefficients ck of the expansion and then. φ1 c1 + φ2 . φ1 c1 + φ1 . φn cn = f . where the matrix Φ contains all the inner products (in all combinations). the error expression (4. n. g = ∫ w( x) f ( x) g ( x)dx and provided that w(x) is nonnegative it satisfies Ω all the required properties. φk ( x) + ∑∑ ck c j φk ( x). …. …. φ1 φ2 . so it will not contribute and the other two will yield the general equation: ∑ c j φk ( x).54) This is what is called orthogonality and is a very useful property of the functions in a base. φ2 c2 + L + φn . we can make the derivatives of Rn with respect to ck equal to zero for all k.53) Writing this down in detail gives: (k = 1) (k = 2) (k = n) φ1 . that is. to find those values that minimize this error. (Similar with what happens with a base of perpendicular vectors: all dot products between .52) The first term is independent of ck. if φk ( x).48) can be written as: ~ Rn = ( f ( x) − f n ( x)) 2 ~ = ∫ ( f ( x) − f n ( x)) 2 dx = f ( x) Ω 2 ~ ~ − 2 f ( x).
Some are more useful than others in particular problems. They are often originated as solution to some differential equations. φk ( x) Remark: You can surely recognise here the properties of the sine and cosine functions involved in Fourier expansions: They form a complete set.58) . Pj ( x) = ∫ Pi ( x) Pj ( x)dx = 0 if i ≠ j. Families of Orthogonal Polynomials There are many different families of polynomials that can be used in this manner. φ j ( x) = 0 where φk is either sin kπx or cos kπx . Pn ( x ) = ∫ Pn ( x ) Pn ( x )dx = −1 1 2 2n + 1 (4. Legendre Polynomials They are orthogonal polynomials in the interval [−1. There are many other sets of orthogonal functions that can form a base. 1 P5 = (15 x − 70 x 3 + 63 x 5 ) 8 1 P6 = (−5 + 105 x 2 − 315 x 4 + 231x 6 ) 16 1 P7 = (−35 x + 315 x 3 − 693 x 5 + 429 x 7 ) 16 etc. They are usually normalised so that Pn(1) = 1 and −1 1 their norm in this case is: Pn ( x). That is. Then if the basis functions φk (x ) are orthogonal: the solution of the system above is straightforward: f ( x).8 Legendre polynomials (−1) n ∂ n In general they can be defined by the expression: Pn ( x) = They also satisfy the recurrence relation: (1 − x 2 ) n 2 n n! ∂x n (n + 1) Pn+1 ( x) = (2n + 1) xPn ( x) − nPn−1 ( x) (4. in particular. 4. Pi ( x).56) The first few are: P0 ( x) = 1 P ( x) = x 1 P2 ( x) = (3 x 2 − 1) 2 P3 ( x) = (5 x 3 − 3 x) 2 P4 ( x) = (35 x 4 − 30 x 2 + 3) 8 .E763 (part 2) Numerical Methods page 34 different members of the base are zero). with weighting function 1. Fourier expansions are convenient but sinusoidal functions are not the only or simpler possibility.55) ck = φk ( x).57) (4.55) above. φk ( x) (4. 1]. and the coefficients of the expansions are given by the same expression (4. they are orthogonal: φk ( x). 1. we are interested in polynomials because of the convenience of calculations. 1 0 1 1 0 1 Fig.
59) Ti ( x ).60) That is. orthogonal in [0. 2k + 1 Form the expression above we can see that the calculation of the coefficients will involve integrals: 2 . and Laguerre polynomials. 1] with the weighting function: w( x) = 1 1 − x 2 They are characterised by having all their oscillations of the same amplitude and in the interval [−1. Chebyshev Polynomials The general. 1] in the least squares 1+ a2x2 sense using Legendre polynomials up to order 2. T j ( x) = −1 ∫ Ti ( x)T j ( x) 1− x2 if i ≠ j 0 dx = π 2 if i = j ≠ 0 π if i = j = 0 (4. 1] and also all their zeros occur in the same interval. ∞) with weighting function e − x . 1 1 0 1 0 Fig.page 35 E763 (part 2) Numerical Methods 2. Other families of polynomials commonly used are the Hermite polynomials. compact definition of these polynomials is: Tn ( x) = cos(n cos −1 ( x)) and they satisfy the following orthogonality condition: 1 (4.9 Chebyshev polynomials They also can be constructed from the recurrence relation: Tn+1 ( x ) = 2 xTn ( x) − Tn −1 ( x) for n ≥ 1.61) These are not the only possibilities. 2 ~ The approximation is: f ( x) = ∑ ck Pk ( x ) and the coefficients are: ck = k =1 1 1 Pk ( x) 2 −1 ∫ f ( x) Pk ( x)dx 1 2 with Pk ( x) = . The first few Chebyshev polynomials are: 1 T0 ( x) = 1 T1 ( x) = x T2 ( x) = 2 x 2 − 1 T3 ( x) = 4 x 3 − 3 x T4 ( x ) = 8 x 4 − 8 x 2 + 1 T5 ( x) = 16 x 5 − 20 x 3 + 5 x T6 ( x) = 32 x 6 − 48 x 4 + 18 x 2 − 1 T7 ( x ) = 64 x 7 − 112 x 5 + 56 x 3 − 7 x … etc. 4. (4. they are orthogonal in the interval [−1. which are orthogonal over the complete real axis with weighting function exp(− x 2 ) . Example Approximate the function f ( x ) = with a = 4 in the interval [−1.
(Integral of an odd function over the interval [−1. (Odd coefficients of the expansion are zero). 4.E763 (part 2) Numerical Methods page 36 Im = −1 ∫ 1 + a 2 x 2 dx 1 xm which satisfy the recurrence relation: 2 1 2 Im = − I m−2 with I 0 = tan −1 a a (m − 1)a 2 a 2 We can also see that due to symmetry Im = 0 for m odd.5 0 1 0 1 Fig. divided by the integral of the original curve) is of 10.10 Exercise 4. The coefficients are then: c0 = c2 = c4 = c6 = c8 = 1 1 1 1 ∫ f ( x)dx = 2 ∫ 1 + a 2 x 2 dx = 2 I 0 2 −1 −1 5 5 3x 2 − 1 5 ∫ P2 ( x) f ( x)dx = 4 ∫ 1 + a 2 x 2 dx = 4 (3I 2 − I 0 ) 2 −1 −1 9 9 ∫ P4 ( x) f ( x)dx = 16 (35I 4 − 30 I 2 + 3I 0 ) 2 −1 13 13 ∫ P6 ( x) f ( x)dx = 32 (231I 6 − 315I 4 + 105I 2 − 5I 0 ) 2 −1 17 17 ∫ P8 ( x) f ( x)dx = 256 (6435I8 − 12012 I 6 + 6930 I 4 − 1260 I 2 + 35I 0 ) 2 −1 1 1 1 1 1 1 1 1 The results are shown in Fig.10. 1]). where the overall error is minimised. only even numbered coefficients are necessary. because of this. Chebyshev polynomials are the best possible choice. 0. Also.6 Use cosine functions ( cos(nπ x) ) instead of Legendre polynomials (in the form of a Fourier series) and compare the results. where the red line corresponds to the approximating curve. This is typical of least squares approximation. The error obtained with its approximation is the least possible with any other polynomial up to the same degree. The green line at the bottom shows the absolute value of the difference between the two curves. In this context. . 4. The total error of the approximation (integral of this difference. Remarks: We can observe in the figure of the example above that the error oscillates through the domain.5%.
The Taylor polynomial for approximating a function f(x) at a point a is given by: p ( x) = f (a ) + f ' (a )( x − a ) + f ' ' (a) f ( n) (a ) ( x − a) n ( x − a) 2 + L + n! 2! (4.page 37 E763 (part 2) Numerical Methods Furthermore. This might be desirable in many cases but there could be others where it is more important to achieve high approximation at one particular point of the domain and in its close vicinity. xk = cos 2n Approximation to a Point The approximation methods we have studied so far attempt to reach a “gobal” approximation to a function. where the function is approximated by a Taylor polynomial. One such method is the Taylor approximation. that is. While the Taylor approximation is very good in the vicinity of zero (9 derivatives are also matched there) it deteriorates badly away from zero. 1 0 1 2 1 0 1 Fig. 4. that is. Taylor Polynomial Approximation Among polynomial approximation methods. f ( n+1) (ξ ) ( x − a ) n+1 where ξ is a point between a and x. the optimum arrangement is to locate the data points on the zeros of the Chebyshev polynomial of the order necessary to give the required number of data points. the Newton approximation using 9 equally spaced points (indicated with * and a black line). In that case an approximation using rational functions will be better. these polynomials have other useful properties. using equally spaced data points and the problems that this cause were quite clear. . the Taylor polynomial gives the maximum possible “order of contact” between the function and the polynomial. (n + 1)! Fig. For comparison. However.11 If a function varies rapidly or has a pole. The answer is yes. polynomial approximations will not be able to achieve high degree of accuracy. to minimise the error over a complete domain. …. the norder Taylor polynomial agrees with function and with its n derivatives at the point of contact. one then can think if there is a different distribution of points that helps in minimising the error. 4. We have seen before interpolation of data by higher order polynomials.11 shows the function (blue line): y ( x) = sin π x + cos 2π x with the Taylor approximation (red line) using a Taylor polynomial of order 9 (Taylor series truncated at order 9). These occur for: 2k − 1 π for k = 1. n.62) and the error incurred is given by: (See Appendix). giving a single polynomial of order 9.
64) m we can write: t ( x) ≈ Rn ( x) = Pm ( x ) or t ( x)Qn ( x) = Pm ( x) . that fits the value of a function and a number of its derivatives at one point. For other values. a simple transformation of variables can be used.66). in particular in the case of functions containing poles.70) To continue with the process of taking derivatives of (4.68) and again.66) we can establish first some general formulae.67) but. First of all. we’ll consider only approximations to a function at x = 0. forcing this equation to be satisfied at x = 0 (exact matching of the first derivative).E763 (part 2) Numerical Methods page 38 The polynomial in the denominator will provide the facilities for rapid variations and poles. gives: c1b0 + c0b1 = a1 (4.66) we can establish a system of equations to find the coefficients of P and Q. we can choose the value of b0 = 1 and this gives us the value of a0 .66) will give: (kck x k −1 + L + 2c2 x 2 + c1 )(bn x n + L + b1x + b0 ) + (c k x + L + c1 x + c0 )(nbn x n−1 + L + 2b2 x + b1 ) = ma m x m−1 + L + 2a 2 x + a1 (4. If the Taylor approximation at x = 0 (Maclaurin series) to f(x) is: t ( x) = ∑ ci x i = ck x k + L + c2 x 2 + c1x + c0 i =0 k with k = m + n (4. This will give: t (0)Qn (0) = Pm (0) or: c0b0 = a0 (4.65) (4.69) k which gives an equation relating coefficients a1 and b1 : a1 − c0 b1 = c1b0 (4. Qn ( x) Considering now this equation in its expanded form: (ck x k + L + c2 x 2 + c1x + c0 )(bn x n + L + b2 x 2 + b1x + b0 ) = am x m + L + a2 x 2 + a1 x + a0 (4. Padé Approximation Padé Approximants are rational functions.63) = f ( x ) ≈ Rn ( x ) = m Qn ( x ) bn x n + L + b2 x 2 + b1 x + b0 For simplicity. b] (or Padé approximant) is a ratio between two polynomials Pm(x) and Qn(x) of orders m and n respectively: P ( x) am x m + L + a2 x 2 + a1x + a0 m (4. They usually provide an approximation that is better than that of the Taylor polynomials. A Padé approximation to a function f(x) that can be represented by a Taylor series in [a. Taking now the first derivative of (4. we can apply repeatedly the product rule to form the derivatives giving: . or ratio of polynomials. If we call g(x) the product of denominators in the left hand side of (4. since the ratio R doesn’t change if we multiply the numerator and denominator by any number. we can force this condition to apply at x = 0 (exact matching of the function at x = 0).
j =0 (k = m + n). (2 ⋅ 3 ⋅ 4)b4 .72) The first derivative of a polynomial.72) as: g (i ) (0) = ∑ j!(i − j )! j!c j (i − j )!bi− j = i! ∑ c j bi− j j =0 j =0 i i! i and equating this to the ith derivative of Pm (x) evaluated at x = 0. Qn (x) is nbn x n−1 + L + 2b2 x + b1 . we have: g ( i ) ( 0) = ∑ j!(i − j )! t ( j ) (0)Q (i− j ) (0) j =0 i i! (4. c1 = . 1− x approximations will not perform well. so the second derivative will be: (n − 1)nbn x n−2 + L + 2 ⋅ 3b3 x + 2b2 and so on. j!b j Then. c4 = which gives a polynomial of 2 8 16 128 order 4. Taking b0 = 1 . that is. so k = m + n = 4. but since b0 = 1 we can finally write: j =0 i −1 i ai − ∑ c j bi − j = ci for i = 1. (2 ⋅ 3)b3 . 2b2 . …. (4. we can write (4. c3 = .page 39 E763 (part 2) Numerical Methods g ' ( x ) = t ( x )q ' ( x) + t ' ( x) q ( x ) g ' ' ( x) = t ( x)q ' ' ( x ) + 2t ' ( x)q ' ( x) + t ' ' ( x)q ( x ) L g ( x) = (i ) (4. From the equations above we can calculate the Padé coefficients. L . c0b0 = a0 gives a0 = 1 The other equations (from asking the derivatives to fit) are: . c2 = .73) where we take the coefficients ai = 0 for i > m and bi = 0 for i > n . Then these derivatives evaluated at x = 0 are successively: b1. Example 1 . k. . Consider the function f ( x) = This function has a pole at x = 1 and polynomial 1 ⋅ 3 ⋅ 5 L (2 j − 1) The Taylor coefficients of this function are given by: c j = 2 j j! 1 3 5 35 So the first five are: c0 = 1 . say. i!ai gives: ai = ∑ c j bi− j . We choose: m = n =2.71) ∑ i j =0 i! t ( j ) ( x )Q (i − j ) ( x ) j!(i − j )! and since we are interested on the values of the derivatives at x = 0.
…. 5 16] giving: 2 R2 ( x) = a0 + a1x + a2 x 2 b0 + b1 x + b2 x 2 = 1 − 0. There are methods for solving systems with this matrix and in particular.0625 x 2 1 − 1. Solving the system will give: a = [1. we have a3 = a4 = b3 = b4 = 0 and the system can be written as: a1 a2 0 0 − c0b2 0 0 − c0b2 0 0 − c2b2 − c0b1 = c1 − c1b1 = c2 − c2b1 − c3b1 = c3 = c4 and we can see that the second set of equations can be solved for the coefficients bi. Rewriting this system: c2b1 + c0b2 = −c3 c3b1 + c2b2 = − c4 In general. j =0 i −1 (k = m + n) and in detail: a1 a2 a3 a4 − c0 b4 − c0 b3 − c1b3 − c0 b2 − c1b2 − c 2 b2 − c0 b1 − c1b1 − c 2 b1 − c3b1 = c1 = c2 = c3 = c4 but for m = n =2. − 5 4.E763 (part 2) Numerical Methods page 40 ai − ∑ c j bi − j = ci for i = 1.25 x + 0.75 x + 0. − 3 4.3125 x 2 . k. when n = m = k 2 as in this case the matrix that define the system will be of the form: r1 r2 L rn−1 r0 r r0 r1 L rn−2 −1 r−2 r−1 r0 L rn−3 L L L L L r−n+1 r−n + 2 r−n+3 L r0 This is a special kind of matrix called the Toeplitz matrix that has the same element along each diagonal so it is defined by a total of 2n+1 numbers. 1 16] and b = [1.
4.12 It is clear that the Padé approximant gives a better fit. gives a better approximation than the 4th order Taylor polynomial. truncated to order 4: t ( x) = find the Padé approximant: 2 R2 ( x) = k =0 ∑ ck x k = 1 − 4 x2 x4 + 2 24 P2 ( x) a2 x 2 + a1 x + a0 = Q2 ( x) b2 x 2 + b1x + b0 .5 1 Fig.13 1 We can see that even R1 ( x ) . 4. a ratio of linear functions of x.12 shows the function 1 (blue line) and the Padé 1− x approximant (red line) 8 f ( x) = 6 4 2 R2 ( x) = 1 − 0.106 1. Exercise 4.25 x + 0. Fig.75 x + 0.031 4 2 0 0 0. particularly closer to the pole.7 Using the Taylor (McLaurin) expansion of f ( x) = cos x .052 1. Increasing the order.page 41 E763 (part 2) Numerical Methods Fig.0625 x 2 1 − 1.13 shows the approximations for m = n = 1. 4.3125 x 2 th 2 The Taylor polynomial up to 4 order: P( x) = c0 + c1x + c2 x 2 + c3 x 3 + c4 x 4 with a green line. 0 0 1 Fig. the Padé approximation gets closer to the singularity. The poles closer to 1 of this functions are: 1 R1 ( x) : 2 R2 ( x) : 3 R3 ( x) : 4 R4 ( x) : 8 6 1. 3 and 4.333 1. 2. 4.
.
2) Other norms are used. the resulting product (or quotient) will effectively have its least significant bit ‘truncated’ and therefore have a relative uncertainty of ± 10–8. the latter corresponding to the greatest in magnitude xi (Show this as an exercise). in defining the vector norms by (5.. ‘Condition’ of a Matrix We have seen that multiplying or dividing two floatingpoint numbers gives an error of the order of the ‘last preserved bit’. e. we have corresponding Ax 1 .. Before discussing any methods of solving matrix equations. xn) is a real or complex vector.. Ax 2 . Note that (ignoring the question “how do we find its value”). or “length” is x 2 or simply x : x = x1 + . the Euclidian N = 2 being the commonest. x 1 and x ∞ .1) So the usual “Euclidian norm”. VECTOR AND MATRIX NORMS To introduce the idea of ‘length’ into vectors and matrices.. MATRIX NORM If A is an nbyn real or complex matrix. we have to consider norms: VECTOR NORMS If xT = (x1. By contrast.. When matrices are involved.1). for given A and N.. error propagation and sensitivity of the solutions to small variations in the data are much more important. + x n 2 2 2 (5. with matrices and vectors.3) According to our choice of N.. a general norm is denoted by x defined by: N and is x N n = ∑ xi i =1 1N N (5. we denote its norm defined by: Ax N A N = max for any choice of the vector x x≠0 x N (5. two numbers are held to 8 decimal digits. multiplying (that is. A has some specific numerical value ≥ 0. Ax ∞ . If. x2. MATRIX COMPUTATIONS We have seen earlier that a number of issues arise when we consider errors in the calculations dealing with machine numbers. . evaluating y = Ax) or ‘dividing’ (that is. we consider first the rather fundamental matrix property of ‘condition number’.g. solving Ax = y for x) can lose in some cases ALL significant figures! Before examining this problem we have to define matrix and vector norms. the problems of accuracy of representation. say.E763 (part 2) Numerical Methods page 42 5.
5) From our definition (5.10) We note that a ‘good’ condition number is small.6) gives: y = Ax ≤ A ⋅ x (5.7) (5.9) x y For any square matrix A we introduce its condition number and define: −1 cond(A) = A A (5. Suppose A is fixed.8) and dividing by x y gives our fundamental result: δx δy ≤ A A −1 (5. with the associated x changing to x + δx.3). which is roughly one where the result is not too sensitive to small changes in the problem specification. (5.7) and (5. Definition: The problem of finding x. We have: Ax = y. For a quantitative measure of “how well conditioned” a problem is. and (ii) small changes in either A or y result in small changes in x. is well posed or well conditioned if: (i) a unique x satisfies Ax = y . . we must have for any A and z: Az z ≤ A and so Az ≤ A ⋅ z (5.4) and using inequality (5. we need to estimate the amount of variation in x when y varies and/or the variation in x when A changes slightly or the corresponding changes in y when either (or both) x and A vary.4) and so: Subtracting gives: A δx = δy A (x + δx) = y + δy or δx = A–1 δy (5.8) Finally multiplying corresponding sides of (5.page 43 E763 (part 2) Numerical Methods ‘Condition’ of a Linear System Ax = y This is in the context of Hadamard’s general concept of a ‘wellposedproblem’.6) Taking the norm of both sides of (15) and using inequality (5. near to 1.6) gives: δ x = A −1δ y ≤ A −1 ⋅ δ y Taking the norm of (5. satisfying Ax = y. but y changes slightly to y + δy.
requiring det(A) = 0 (5.4) for A–1 instead of A (and reversing x and y). δ x x is the associated relative uncertainty in the vector x. δy are related small changes in x and y. concerning any matrix multiplication or (effectively) inversion: For given A. the degree of catastrophe depending on cond(A) and on the ‘direction’ of change in x or y. cond(A) gives an upper bound (worst case) factor of degradation of precision between y and x = A–1y . These two equations therefore give the important result that: If A denotes the precise transformation y = Ax and δx.11)) or . The most common problems encountered are of the form: Ax=y (5. Similarly.E763 (part 2) Numerical Methods page 44 Relevance of the Condition Number The quantity δ y y can be interpreted as a measure of relative uncertainty in the vector y.9) and (5. either multiplying (Ax) or ‘dividing’ (A–1y) can be catastrophic. shifting y slightly: 803 1001 A = − 801 999 So a small change in y can cause a big change in x or vice versa.9) will be the same but with x and y reversed and (5. cond(A) is about 4000. In the above example.12) (we will consider this a special case of (5. Suppose: 100 99 A= 99 98 We then have: Ax = y: Shifting x slightly gives: 1001 1199 A = − 999 1197 Alternatively. 1000 1000 A = − 1000 1000 Matrix Computations We now consider methods for solving matrix equations. Numerical Example Here is an example using integers for total precision.10) would remain unchanged. equation (5. the ratio: δx x δy y must lie between 1/ cond(A) and cond(A).11) or A x = 0.10). Note that if we rewrite the equations from (5. We have this clear moral. From equations (5.
most commonly in Fortran or Fortran90/95. Types of Matrices A and B (Sparsity Patterns) We can classify the problems (or the matrices) according to their sparsity (that is. but also some in C. Indirect methods can be specially suited to sparse matrices (especially when the order is large) as they can often be implemented without the need to store the entire matrix A (or intermediate forms of matrices) in high speed memory.11) is by the Gauss method. This will have a strong relevance on the choice of solution method. sparse matrix with any pattern. given the system: . but numerically the difference is straightforward. where most of the matrix elements are nonzero and sparse. INDIRECT or iterative. Arbitrarily sparse Finally. Direct Methods of Solving Ax = y – Gauss elimination or LU decomposition The classic solution method of (5. For example.page 45 E763 (part 2) Numerical Methods A x = k2 B x where A (and B) are known n×n matrices and x and k2 are unknown. and so we will consider only real matrices. the main categories are: dense. basically: DIRECT where the solution emerges in a finite number of calculations (if we temporarily ignore roundoff error due to finite wordlength). where a stepbystep procedure converges towards the correct solution. All the common direct routines are available in software libraries and in books and journals. In this form. but where the elements are not stored. Usually B (and sometimes A) is positive definite (meaning that xTBx > 0 for all x (5. where a large proportion of them are zero. elements are ‘generated’ or calculated each time they are needed in the solving algorithm. Among the sparse.13) A is sometimes complex (and also B). that is. the proportion of zero entries). we can distinguish several types: Banded matrices: all nonzero elements are grouped in a band around the main diagonal (fixed and variable band). * * * * 0 0 * * * 0 0 * * * * 0 * * * * * * * * * * * * * * * 0 * * * * Zeros and nonzeros in band matrix of semibandwidth 4 We can also distinguish between different solution methods.
For example. Important points about this algorithm: 1. The determinant comes immediately as the product of the diagonal elements of (5. it is better to ‘keep a record’ of the triangular form of (5. and some ‘bookkeeping’. and only the 3 elements below the diagonal need ‘eliminating’ in the first column.16) are termed ‘triangulation’ or ‘forwardelimination’. The latter is (only) for use with symmetric matrices. 3. the first column has nonzero elements only in the first 4 rows as in the figure. in our context. by changing the limits of the loops when performing row or column operations.14) to (5. The third row immediately gives: (5.14) we subtract 2 times the first row from the second row.15) 2 0 − 6 − 10 x3 − 2 and then subtracting 2 times the second row from the third row gives: 7 x1 1 4 1 0 − 3 − 6 x = − 1 2 0 0 2 x3 0 (5. Algorithms that take advantage of the special ‘band’ and ‘variableband’ are very straightforward.16) and backsubstitute as necessary. 4.17a) x3 = 0 and substitution into row 2 gives: (–3) x2 + 0 = –1 and so: x2 =1/3 and then into row 1 gives: x1 + 4 (1/3) + 0 = 1 and so: x1 = –1/3 (5. y. and then we subtract 3 times the first row from the third row. Oddly. When performed on a dense matrix. computing time is proportional to n3 (n: order of the matrix). This means that doubling the order of the matrix will increase computation time by up to 8 times! 2. 5. a technique sometimes required to improve numerical stability and which consists of changing the order of rows and columns. it turns out that.E763 (part 2) Numerical Methods page 46 1 4 7 x1 1 2 5 8 x = 1 2 3 6 11 x3 1 (5. The triangular form of the lefthand matrix of (5. Other methods very similar to Gauss are due to Crout and Choleski.16). Its advantage is that time AND storage are half that of the .17c) (5. to give: 4 7 x1 1 1 0 − 3 − 6 x = − 1 (5. Even if it needs doing for a number of different righthandside vectors. one should NEVER find the inverse matrix A in order to solve Ax = y for x. Then only those 4 numbers need storing. (or on a sparse matrix but not taking advantage of the zeros). We now ignore a complication of ‘pivoting’.16) is crucial.17b) The steps through (5. in a matrix of ‘semibandwidth’ 4.17) are termed ‘backsubstitution’. it allows the next steps.16) The steps from (5.
a column vector r defined as a function of x by: r=y–Ax (5. but this will not always allow us to reach the minimum or at least not in an efficient way. The norm r of this residual vector is an obvious choice for the quantity to minimise. the simplest form of this method. valid for the case where the matrix A is positive definite and which is also minimised for the correct solution is the functional: h2 = rTA–1r (instead of rTr as in (5.20) which is rather awkward to compute. Another variation consists of only storing the ‘nonzero’ elements in the matrix. 6. for example when solving problems using finite differences or finite elements.19) The general idea of this kind of methods is to search for the solution (minimum of the error residual) in a multidimensional space (of the components of vector x). starting from a point x0 and choosing a direction to move.18) can be recast as: finding x to minimize the ‘errorresidual’. in the 1980’s. elimination takes place in carefully controlled manner. gives (if A is symmetric: AT=A): r 2 2 = (y – Ax)T(y – Ax) = xTAAx – 2xTAy + yTy (5. In the steepest descent method.19).. to be adopted as one of the most popular iterative algorithms for solving linear systems of equations: A x = y. 1. the method has come. The fundamental idea of this approach is to introduce an error residual A x – y for some trial vector x and proceed to minimize the residual with respect to the components of x.– Gradient Methods Although known and used for decades.20). in the search for the best compromise between numerical stability and fillin. the conjugate gradient method (2a) the Jacobi (or simultaneous displacement) and (2b) the closely related GaussSeidel (or successive displacement) algorithm. because of the product AA.page 47 E763 (part 2) Numerical Methods orthodox Gauss. with intermediate results being kept in backingstore. these directions are chosen as those of the gradient of the error residual at each iteration point. the square of the norm of the residual. However. they will be mutually orthogonal and then there will be no more than n different directions. Iterative Methods of Solving Ax = y We will outline 3 methods: (1) Gradient methods and its most popular version. using (5. In 2D (see figure below) this means that every time we’ll have to make a change of direction at right angle to the previous. One of these is the frontal method. at the expense of a great deal of ‘bookkeeping’ and reordering (renumbering) rows and columns through the process. which is not negative and is only zero when the error is zero and there are no square roots to calculate. Because of this. Another possible choice of error functional (measure of the error). This gives a simpler form: . There are variations in the implementation of the basic method developed to take advantage of the special type of sparse matrices encountered in some cases. The optimum distance to move along that direction can then be calculated. The equation to be solved for x: Ax=y (5. and r . here.
α is the only variable. However. if p gives the direction. and finding the minimum value of the functional along that line.21) is independent of the variables and will play no part in the minimisation we can drop it and we finally have : h 2 = xT Ax − 2xT y (5.22) The method proceeds by evaluating the error functional at a point x0. Exercise 5. The next step is to find the value of α that minimizes the error. That is. 5. it is simple to calculate the gradient of the error functional as a function of α and finding the corresponding minimum). choosing a direction. because the first term in (5. x 0 v x 1 u Fig. is given respectively by: α= p T A(y − Ax ) Ap 2 and α= p T (y − Ax ) p T Ap .21) or.1: Show that the value of α that minimizes the error functional in the line (x + αs) in the two cases mentioned above (using the squared norm of the residual as error functional or the functional h2). for a symmetric matrix A. In this case the consecutive directions are all perpendicular to each other as illustrated in the figure above. as mentioned earlier. the line is defined by all the points (x + α p). The more efficient and popular “Conjugate Gradient Method” looks for a direction which is ‘A–orthogonal’ (or conjugate) to the previous one instead (pTAq = 0 instead of pTq = 0).E763 (part 2) Numerical Methods page 48 h 2 = rT A −1r = (y − Ax)T A −1 (y − Ax) = y T Ay − 2xT y + xT Ax (5. where α is a scalar parameter.1 Several variations appear at this point. It would seem obvious that the best direction to choose is that of the gradient (its negative or downwards) and that is the choice taken in the conventional “steepestgradient” or “steepest descent” method. (Since in this case. This gives the next point x1. in this case. the direction of the gradient of the error functional. this conduces to poor convergence due to the discrete nature of the steps.
An even better advantage of the conjugate gradient method is that it is guaranteed to converge in at most n steps.page 49 E763 (part 2) Numerical Methods A useful feature of this method (as can be observed from the expressions above) is that reference to the matrix A is only via simple matrix products. for given values of the matrices A. The Conjugate Gradient Algorithm The basic steps of the algorithm are as follow: 1) Choose a starting point x0. a) Jacobi Method or Simultaneous Displacement Suppose the set of equations for solution are: a11x1 + a12x2 + a13x3 = y1 a21x1 + a22x2 + a23x3 = y2 a31x1 + a32x2 + a33x3 = y3 This can be reorganised to: (5. Here is where the methods differ. These can be formed from a given sparse A without unnecessary multiplication (of nonzeros) or storage. In the first step we choose that of the gradient of h2: and this coincides with the direction of the residual r: ∇h 2 = ∇(xT Ax 0 − 2xT y ) = 2( Ax 0 − y ) = −2r0 0 so we choose: p 0 = r0 . – a serious problem if A is not ‘safely’ positivedefinite. to alleviate the problem that the condition number of (5.23) x1 = ( y1 – a12x2 – a13x3)/a11 (5. where n is the order of the matrix. we need only form A times a vector (Ax or Api) and AT times a vector.24a) . y. xi and pi. 2. p k = rk + β p k −1 where β = T rk rk T rk −1rk −1 More robust versions of the algorithm use a preliminary ‘preconditioning’ of the matrix A. For the conjugate gradient the new vector p is not along the gradient of r but instead: p T (y − Ax ) p T As using p0 and x0. 2) Choose direction to move. as a complete package and to many more variations in implementation that can be found as commercial packages.20) is the square of the condition number of A. 3) 4) 5) 6) Calculate the distance to move – parameter α: α = Calculate the new point: x k = x k −1 + α p k −1 Calculate the new residual: rk = y − A(x k −1 + α p k −1 ) = y − Ax k −1 + α p k −1 = rk −1 + α p k −1 Calculate the next direction. This leads to the popular ‘PCCG’ or PreConditionedConjugateGradient algorithm.– Jacobi and GaussSeidel Methods Two algorithms that are simple to implement are the closely related Jacobi (simultaneous displacement) and GaussSeidel (successive displacement or ‘relaxation’).
E763 (part 2) Numerical Methods
page 50
x2 = ( y2 – a23x3 – a21x1)/a22 x3 = ( y3 – a31x1 – a32x2)/a33
(5.24b) (5.24c)
Suppose we had the vector x(0) = [x1, x2, x3](0), and substituted it into the righthand side of (5.24), to yield on the lefthand side the new vector x(1) = [x1, x2, x3](1). Successive substitutions will give the sequence of vectors:
x(0), x(1), x(2), x(3), . . . .
Because (5.24) is merely a rearrangement of the equations for solution, the ‘correct’ solution substituted into (5.24) must be selfconsistent, i.e. yield itself! The sequence will: either converge to the correct solution or diverge.
b) GaussSeidel Method or Successive Displacement
Note that when eq. (5.24a) is applied, a ‘new’ value of x1 would be available, which could be used instead of the ‘previous’ x1 value when solving (5.24b). And similarly for x1 and x2 when applying (5.24c). This is the GaussSeidel or successive displacement iterative scheme, illustrated here with an example, showing that the computer program (in FORTRAN/90) is barely more complicated than writing down the equations.
! Example of successive displacement x1 = 0. x2 = 0. x3 = 0. ! Equations being solved are: do i=1,10 x1 = (4. + x2  x3)/4. ! 4*x1  x2 + x3 = 4 x2 = (9.  2.*x3  x1)/6. ! x1 +6*x2 + 2*x3 = 9 x3 = (2. + x1 + 2.*x2)/5. ! x1 2*x2 + 5*x3 = 2 print x1,x2,x3 enddo stop end 1 1.05 0.9895833 1.001337 0.9998951 0.9999996 1.000002 0.9999996 1 1 1.333333 0.9472222 1.00544 0.9997463 0.9999621 1.000012 0.9999981 1 1 1 1.133334 0.9888889 1.000093 1.000166 0.9999639 1.000005 0.9999996 1 1 1
Whether the algorithm converges or not depends on the matrix A and (surprisingly) not on
page 51
E763 (part 2) Numerical Methods
the righthandside vector y of (5.23). Convergence does not even depend on the ‘starting value’ of the vector, which only affects the necessary number of iterations. We will skip over any formal proof of convergence. But to give the sharp criteria for convergence,− first one splits A as:
A=L+D+U
where L, D and U are the lower, diagonal and upper triangular parts of A. Then the schemes converge ifandonlyif all the eigenvalues of the matrix:
D–1(U + L)
or (D + L)–1U
{for simultaneous displacement} {for successive displacement}
lie within the unit circle. For applications, it is simpler to use some sufficient (but not necessary!) conditions, when possible, such as: (a) If A is symmetric and positivedefinite, then successive displacement converges. (b) If, in addition, ai,j < 0 for all i ≠ j, then simultaneous displacement also converges. Condition (a) is a commonly satisfied condition. Usually, the successive method converges in about half the computer time of the simultaneous, but, strictly there are matrices where one method converges and the other does not, and viceversa.
Matrix Eigenvalue Problems
The second class of matrix problem that occurs frequently in the numerical solution of differential equations as well as in many other areas, is the matrix eigenvalue problem. In many occasions, the matrices will be large and very sparse; in others, they will be dense and normally the solution methods will have to take these characteristics into account in order to achieve a solution in an efficient way. There is a number of different methods to solve matrix eigenvalue problems involving dense or sparse matrices, each of them with different characteristics and better adapted to different types of matrices and requirements.
Choice of method
The choice of method will depend on the characteristics of the problem (type of the matrices) and on the solution requirements. For example, most methods suitable for dense matrices calculate all eigenvectors and eigenvalues of the system. However, in many problems arising from the numerical solution of PDEs one is only interested in one or just a few eigenvalues and/or eigenvectors. Also, in many cases the matrices will be large and sparse. In what follows we will concentrate in methods which are suitable for sparse matrices (of course the same methods can be applied to dense matrices). The problem to solve can have two different forms:
Ax = λx
(standard eigenvalue problem) (generalized eigenvalue problem)
(5.25) (5.26)
Ax = λ Bx
E763 (part 2) Numerical Methods
page 52
Sometimes the generalized eigenvalue problem can be converted into the form (33) simply by premultiplying by the inverse of B:
B −1Ax = λ x
however, if A and B are symmetric, the new matrix in the left hand side ( B −1A ) will have lost this property. Instead, it is preferable to decompose the matrix B (factorise) in the form B = LLT (Choleski factorisation  possible if B is positive definite). Substituting in (5.26) and premultiplying by L−1 will give:
L−1Ax = λLT x
and since: LT
( )
−1 T
L = I , the identity matrix,
and
T ~ L−1 A L−1 = A gives:
L−1 A L−1 LT x = λ LT x ;
( )
T
putting: LT x = y
( )
~ Ay = λy
~ The matrix A is symmetric if A and B are symmetric and the eigenvalues are not modified. The eigenvectors x can be obtained from y simply by backsubstitution. However, if ~ the matrices A and B are sparse, this method is not convenient because A will generally be dense. In the case of sparse matrices, it is more convenient to solve the generalized problem directly.
Solution Methods
We can again classify solution methods as:
1. Direct: (or transformation methods)
i) ii) iii) Jacobi rotations QR (or QZ for complex) Conversion to Tridiagonal
Converts the matrix of (5.25) to diagonal form Converts the matrices to triangular form Normally to be followed by either i) or ii)
All these methods give all eigenvalues. We will not examine any of these in detail.
2. Vector Iteration Methods:
These are better suited for sparse matrices and also for the case when only a few eigenvalues and eigenvectors are needed.
The Power Method: or Direct iteration
For the standard problem (33) the algorithm consists of the repeated multiplication of a starting vector by the matrix A, this can be seen to produce a reinforcement of the component of the trial vector along the direction of the eigenvector of largest absolute value, causing the trial vector to converge gradually to that eigenvector. The algorithm can be described schematically by:
From this expression (5. .28) If λ1 is the eigenvalue of largest absolute value. . N are the eigenvectors of A.29) we can see that since λ1 ≥ λi .27) When we multiply by A we get: ~ = A x or: x1 0 N N N ~ = A α φ = α Aφ = α λ φ x1 ∑ i i ∑ i i ∑ i i i i =1 i =1 i =1 (5. i = 1. In particular. for the starting vector: x 0 = ∑ α i φi i =1 N (5. for all i.page 53 E763 (part 2) Numerical Methods Choose starting vector Premultiply by A Normalize Check convergence OK Not OK STOP The normalization step is necessary because otherwise the iteration vector can grow indefinitely in length over the iterations.. we can write any vector of N components as a superposition of them (they constitute a base in the space of N dimensions). after n iterations of multiplications by A and normalization we will get: N λi x n = C ∑ α i φi λ 1 i =1 n (5. will tend to zero as n increases and then the vector xn will gradually converge to φ1. the coefficient of all φi for i ≠ 1. How does this algorithm work? If φi.29) where C is a normalization constant. we can also write this as: λi ~ =λ x1 φi 1 ∑α i N i =1 λ1 ~ x This is then normalized by: x1 = ~1 x1 Then.
4. This expression has interesting properties. if we only know an estimate of the eigenvector. develop the corresponding algorithm for the generalized eigenvalue problem Ax =λBx. Exercise 5. Exercise 5. the reciprocals of the eigenvalues of A and the eigenvectors are the same.30) This is known as the Rayleigh quotient and can be used to obtain the eigenvalue from the eigenvector.2 Write a short program to calculate the dominant eigenvector and the corresponding eigenvalue of a matrix like that in (7.25) by the transpose of φi will give: φiT Aφi = λi φiT φi T T where φi Aφi and φi φi are scalars. 2.4 Write a program and calculate by inverse iteration the smallest eigenvalue of the .27)( 5.30) will give an estimate of the eigenvalue with a higher order of accuracy than that of the eigenvector itself. One of these modifications is the inverse iteration. The procedure then is as follows: 1. Terminate iterations when the relative difference between two successive estimates of the eigenvalue are within a tolerance of 10–6. In this form. if we are interested in finding the eigenvector of A corresponding to the smallest eigenvalue in absolute value (closest to zero). λ x = A −1x from where we can see that the Choose a starting vector x0. (5.E763 (part 2) Numerical Methods page 54 This method (the power method) finds the dominant eigenvector. −1 x x Find ~1 = A x 0 or instead.3 (advanced) Using the same reasoning used to explain the power method for the standard eigenvalue problem 5.14) but of order 7. we can notice that for that eigenvalue λ. To find the eigenvalue we can see that premultiplying (5. so we can write: λi = φiT Aφi φiT φi (5. solve: A~1 = x 0 (avoiding the explicit calculation of A–1) x (C is a normalization factor to have: x = 1 ) Normalize: x = C ~ 1 1 1 Restart Exercise 5. its reciprocal is the largest. and so it can be found using the power method on the matrix A–1. that is the eigenvector that corresponds to the largest eigenvalue (in absolute value). Inverse Iteration For the system Ax = λx we can write: 1 eigenvalues of A–1 are 1/λ.29). 3. modified forms of the basic idea can be developed to find any eigenvector of the matrix. The power method in the form presented above can only be used to find the dominant eigenvector. However. using the power method.
the procedure will yield the eigenvalue (λi−σ) closest to zero. the convergence will be fastest when this ratio is largest as we can see from (5. that is. the same eigenvalues as A but shifted by σ. An important extension of the inverse iteration method allows finding any eigenvalue of the spectrum (spectrum of a matrix: set of eigenvalues) not just that of smallest eigenvalue in absolute value. . Compare the convergence of both procedures (by the number of iterations needed to achieve the same tolerance for the relative difference between successive estimates of the eigenvalue).27)) and these are affected by the ratio between the eigenvalues λi and λ1. Use a tolerance of 10–6 in both programs. Exercise 5. You can use the algorithms given in section 1 of the Appendix for the solution of the linear system of equations.29). the Rayleigh quotient is used to calculate an estimate of the eigenvalue at each iteration and the shift is updated using this value.5 Write a program using shifted inverse iteration to find the eigenvalue of the matrix A of the previous exercise which lies closest to 3.page 55 E763 (part 2) Numerical Methods tridiagonal matrix A of order 7 where the elements are –4 in the main diagonal and 1 in the subdiagonals. we have: ~ Ax = Ax − σ x = λ x − σ x = (λ − σ )x (5. Then. write a modified version of this program using Rayleigh quotient update of the shift in every iteration (Rayleigh iteration). if we apply the inverse ~ iteration method to the matrix A . that is. Since the convergence rate of the power method depends on the relative value of the coefficients of each eigenvector in the expansion of the trial vector during the iterations (as in (5. Then. Shifted Inverse Iteration Suppose that the system Ax = λx has the set of eigenvalues { λi}. If we construct the ~ ~ matrix A as: A = A − σ I where I is the identity matrix and σ is a real number. the Rayleigh iteration has faster convergence than the ordinary shifted inverse iteration.31) ~ And we can see that the matrix A has the same eigenvectors as A and its eigenvalues are {λ−σ}. Rayleigh Iteration Another extension of this method is known as Rayleigh iteration.5. we can find the eigenvalue λi of A closest to the real number σ. In this case. The same reasoning applied to the shifted inverse iteration method leads to the conclusion that the convergence rate will be fastest when the shift is chosen as close as possible to the target eigenvalue. In this form.
6. the derivative at a point x = a is defined as: f ' (a) = lim f ( a + h) − f (a ) h (6. The slope of the chords between the points A and C.1 We can understand this better analysing the error in each approximation by the use of Taylor expansions. so we could write instead .4) f ( a − h) = f ( a ) − f ' ( a ) h + (6.E763 (part 2) Numerical Methods page 56 6. We can see that a better approximation to the derivative is obtained by the slope of the chord between points A and B. The derivative at xc is the slope of the tangent to the curve at the point C. labelled “central difference” in Fig.2) An equally valid approximation can be the backward difference: f ' (a) ≈ f ( a ) − f ( a − h) h (6. Forward Difference yb Central Difference B Backward Difference yc A xa C ya xc xb Fig. respectively. 6. Numerical Differentiation and Integration Numerical Differentiation For a function of one variable f(x).1.5) (ξ ) 4 h 4! where in both cases the reminder (error) can be represented by a term of the form: (see Appendix).3) Graphically. and B and C are the values of the backward and forward difference approximations to the derivative. Considering the expansions for the points a+h and a−h: f ( a + h) = f ( a ) + f ' ( a ) h + f ( 2) (a) 2 f (3) (a) 3 h + h + L 2! 3! f ( 2) (a) 2 f (3) (a) 3 h − h + L 2! 3! f ( 4) (6. we can see the meaning of each of these expressions in the Fig.1.1) h→0 This suggests that choosing a small value for h the derivative can be reasonably approximated by the forward difference: f ' (a) ≈ f ( a + h) − f ( a ) h (6. 6.
9) f ( x2 ) = f ( x0 ) + f ' ( x0 )2h + 4h + O ( h 3 ) 2! Multiplying (6. We can also see that subtracting (6.3) are “twopoint” formulae for the first derivative.1 Considering the 3 points x0.2) and (6.6) and we can see that the error of this approximation is of the order of h.4) to first order we can then see that for the forward difference formula we have: f ( a + h) = f ( a ) + f ' ( a ) h + O (h 2 ) . Example: Considering the 3 points x0. What is the order of the error? Exercise 6. Many more different expressions with different degrees of accuracy can be constructed using a similar procedure and using more points. so f ' (a) = f ( a + h) − f ( a ) + O ( h) h (6.7) which has an error of the order of h2.10) Exercise 6.2 . For example. A similar result is obtained for the backward difference. x1 = x0 + h and x2 = x0 + 2h and taking the Taylor expansions at x1 and x2 we can construct a threepoint forward difference formula: f ( 2) ( x0 ) 2 (6.8) f ( x1 ) = f ( x0 ) + f ' ( x0 )h + h + O (h 3 ) 2! f ( 2) ( x0 ) 2 (6.5) and discarding terms of order of h3 and higher we can obtain a better approximation: f ( a + h) − f ( a − h) = 2 f ' ( a ) h + O ( h 3 ) From where we obtain the central difference formula: f ' (a) = f ( a + h) − f ( a − h) + O (h 2 ) 2h (6.4) and (6. Expressions (6.8) by 4 and subtracting to eliminate the second derivative term: 4 f ( x1 ) − f ( x2 ) = 3 f ( x0 ) + 2 f ' ( x0 )h + O(h3 ) from where we can extract for the first derivative: f ' ( x0 ) = − 3 f ( x0 ) + 4 f ( x1 ) − f ( x2 ) + O (h 2 ) 2h (6. the Taylor expansions for the function at a number of points can be used to eliminate all the terms except the desired derivative. x1 = x0 − h and x2 = x0 − 2h and the Taylor expansions at x1 and x2 find a threepoint backward difference formula.page 57 E763 (part 2) Numerical Methods f ( 2) (a) 2 f (3) (a ) 3 h + h + O(h 4 ) 2! 3! where the symbol O(h4) means: “a term of the order of h4 ” f ( a + h) = f ( a ) + f ' ( a ) h + Truncating (6.
and that both have an error of the order of h4: f ' (a) ≈ f ( 2) ( a ) ≈ f ( a − 2h) − 8 f ( a − h) + 8 f (a + h) − f (a + 2h) 12h − f (a − 2h) + 16 f (a − h) − 30 f (a ) + 16 f (a + h) − f (a + 2h) 12h 2 Expressions for the derivatives can also be found using other methods. Exercise 6. For example.4 Show that if the points are equally spaced by the distance h in (6. f(a–h).11) f ( 2) ( a ) ≈ h2 Show also that the error is O(h2). n points. we can use the Lagrange interpolation polynomial to approximate y(x): (6.12) f ( x) ≈ L( x) = L1 ( x) y1 + L2 ( x) y2 + L3 ( x) y3 where: ( x − x2 )( x − x3 ) ( x − x1 )( x − x3 ) ( x − x1 )( x − x2 ) (6. For example. the derivative (first. not necessarily equally spaced) and respective function values y1. the expression reduces respectively to the 3point forward difference formula (6. at x1: 2 x1 − x2 − x3 x1 − x3 x1 − x2 (6. f(a+2h) and f(a–2h) to show that the following are formulae for f ’(a) and f(2)(a).13) . x2 and x3 with x1 < x2 < x3 (this time. etc) can be estimated by calculating the derivative of the interpolating polynomial at the point of interest. the central difference and the 3point backward difference formulae. Example: Considering the 3 points x1.12) and the expression is evaluated at x1. ( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 ) This general expression will give the value of the derivative at any of the points x1. x2 and x3. say. y2 and y3. .10).E763 (part 2) Numerical Methods page 58 Using the Taylor expansions for f(a+h) and f(a–h) show that a suitable formula for the second derivative is: f (a − h) − 2 f ( a ) + f (a + h) (6.14) f ' ( x1 ) ≈ y1 + y2 + y3 ( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 ) f ' ( x) ≈ Exercise 6. second. L2 ( x) = and L3 ( x ) = L1 ( x) = ( x1 − x2 )( x1 − x3 ) ( x2 − x1 )( x2 − x3 ) ( x3 − x1 )( x3 − x2 ) so the first derivative can be approximated by: f ' ( x) ≈ L' ( x ) = L1 ' ( x ) y1 + L2 ' ( x) y2 + L3 ' ( x) y3 which is: 2 x − x2 − x3 2 x − x1 − x3 2 x − x1 − x2 y1 + y2 + y3 .3 Use the Taylor expansions for f(a+h). if the function is interpolated with a polynomial using. x2 or x3.
5 Derive expressions iii) and iv) above. y ) f ( x + h. y ) = (6. y − h) = ∂y 2h In this form. we can approximate this expression by a difference assuming that h is sufficiently small. ∂f ( x. y ) ∂f ( x. is a function of y. Exercise 6. the partial derivative: ∂f ( x. y ) = lim . which clearly. ∂f ( x. many more expressions can be developed using more points and/or different methods.6 Using central difference formulae for the second derivatives derive an expression for the Laplacian ( ∇ 2 f ) of a scalar function f. y ) is defined as: ∂x Exercise 6. y ) − f ( x − h. y ) f ( x. y ) f ( x + h. numerical integration methods approximate the definite integral of a function f by a weighted sum of function values at several points in the interval of integration. y ) ˆ ˆ ˆ ˆ (( f ( x + h. y + h) − f ( x. y ) Then. What is the order of the error for each of the 4 expressions above? Partial Derivatives For a function of two variables: f(x.page 59 E763 (part 2) Numerical Methods Central Difference Expressions for different derivative The following expressions are central difference approximations for the derivatives: f ( xi +1 ) − f ( xi −1 ) f ' ( xi ) = i) 2h f ( xi +1 ) − 2 f ( xi ) + f ( xi −1 ) ii) f ' ' ( xi ) = h2 f ( xi + 2 ) − 2 f ( xi +1 ) + 2 f ( xi −1 ) − f ( xi −2 ) iii) f ' ' ' ( xi ) = 2h 3 f ( xi + 2 ) − 4 f ( xi +1 ) + 6 f ( xi ) − 4 f ( xi −1 ) + f ( xi −2 ) iv) f ( 4 ) ( xi ) = h4 Naturally. h →0 ∂x h Again. Numerical Integration In general. y ) = x+ y≈ ∂x ∂y 2h ∂f ( x. y ) − f ( x − h. In general these . y + h) − f ( x. for example a central difference expression for is: ∂x ∂f ( x. y − h)) y ) ∇f ( x.y).15) ∂x 2h Similarly. y )) x + ( f ( x. the gradient of f is given by: 1 ∂f ( x. y ) − f ( x .
In this case. It can be shown that the error incurred with the application of the trapezoid rule in one interval is given by the term: (b − a )3 ( 2) (6. the interval of integration is divided into a number of subintervals on which the function is simply approximated by a straight line as shown in the figure below.16) If all the subintervals are of the same width (the points are equally spaced). If we denote by hi = ( xi +1 − xi ) x1 .14) reduces to: xn x1 ∫ f ( x)dx ≈h f1 + f n n−1 + ∑ fi i =2 2 (6. the integral can be approximated by: xn x1 ∫ f ( x)dx ≈ 2 ∑ hi ( fi + fi +1) i =1 1 n−1 (6. the error term changes to: .18) E=− f (ξ ) 12 where ξ is a point inside the interval and the error is defined as the difference between the exact integral (I) and the area of the trapezoid (A): E = I − A. 6.17) Exercise 6. The integral (area under the curve) is then approximated by the sum of the areas of the trapezoids based on each subinterval. If this is applied to the Trapezoid rule using a number of subintervals. (6.. The area of the trapezoid with base in the interval [xi.2 the width of the subinterval [xi.E763 (part 2) Numerical Methods page 60 methods are called “quadrature” and there are different methods to choose the points and the weights. xi+1]. (b) Calculate the integral of 1/x between 1 and 2 In both cases use 10 and 100 subintervals (11 and 101 points respectively). xi+1] is: ( xi +1 − xi ) ( f i + f i +1 ) 2 and the total area is then the sum of all the terms of this form.7 Using the trapezoid rule: (a) Calculate the integral of exp(−x2) between 0 and 2.. Trapezoid Rule The simplest method to approximate the integral of a function is the trapezoid rule. x(i1) x(i) x(i+1) Fig.
b].20) E≈− h2 ( f ' (b) − f ' (a)) 12 (6.page 61 E763 (part 2) Numerical Methods (b − a )h 2 ( 2) f (ξ h ) (6. A higher degree of accuracy using the same number of subintervals can be obtained with a better approximation than the straight line.20) as: h 3 n ( 2) h2 n ∑ f (ξi ) = − 12 ∑ hf (2) (ξi ) 12 i =1 i =1 (6. as: a+b b−a . x2 = b and defining h = 2 2 Using Lagrange interpolation to generate a second order polynomial approximation to f(x) gives as in (6. x1 = f ( x) ≈ f ( x0 ) L0 ( x) + f ( x1 ) L1 ( x) + f ( x2 ) L2 ( x) where: ( x − x1 )( x − x2 ) 1 = ( x − x1 )( x − x2 ) ( x0 − x1 )( x0 − x2 ) 2h 2 ( x − x0 )( x − x2 ) 1 L1 ( x) = = − 2 ( x − x0 )( x − x2 ) ( x1 − x0 )( x1 − x2 ) h ( x − x0 )( x − x1 ) 1 L2 ( x) = = ( x − x0 )( x − x1 ) ( x2 − x0 )( x2 − x1 ) 2h 2 L0 ( x) = and Then. Considering the Error as the sum of the individual errors in each subinterval.23) .19) 12 where now ξ h is a point in the complete interval [a.21) Two options are open now. to determine the number of equally spaced points required for a determined precision or. Consider the function f(x) and the interval [a. the function is approximated by a straight line and this can be done repeatedly by subdividing the interval. Expression (6. include this term in the calculation to form a corrected form of the trapezoid rule: ∫ f ( x)dx ≈h a b f1 + f n n−1 h 2 + ∑ f i − ( f ' (b) − f ' (a)) i = 2 12 2 (6. b]. in the limit when n → ∞ and h → 0 corresponds to the integral of f(2) over the interval [a. b] and depends on h. b ∫ a a + 2h f ( x )dx ≈ ∫ ( f ( x0 ) L0 ( x) + f ( x1) L1( x) + f ( x2 ) L2 ( x))dx a (6. This is the Simpson’s rule.22) Simpson’s Rule In the case of the trapezoid rule. choosing a quadratic approximation could give a better result. x1 and x2. we can use this term to estimate the error incurred or equivalently. we can write this term in the form: E=− E=− where ξi are points in each subintervals. we can write (6. For example.12): x0 = a . Defining the points x0. Then.20).
25 0. ∫ L1( x)dx = a ∫ L2 ( x)dx = 3 h Then.25 Use of the Simpson’s rule to calculate the integral: We have: x0 = a = 0.5 The figure shows the function (in blue) and the 2nd order Lagrange interpolation used to calculate the integral with the Simpson’s rule.25 ∫ (sin πx + 0.5)dx ≈ 3 ( f (a) + 4 f (a + h) + f (a + 2h)) = 0.97140452 h .3 1.5 1 1.24) Example 1.23): ∫ f ( x)dx ≈ 3 ( f (a) + 4 f (a + h) + f (a + 2h)) a b h (6.25 ∫ (sin πx + 0.25 . The exact value of this integral is: 1.5)dx = 0. 6.25.E763 (part 2) Numerical Methods page 62 or ∫ f ( x)dx ≈ f ( x0 ) ∫ L0 ( x)dx + f ( x1 ) ∫ L1( x)dx + f ( x2 ) ∫ L2 ( x)dx a a a a a+2h a b a + 2h a+2h a+2h The individual integrals are: ∫ L0 ( x)dx = 2h 2 ∫ ( x − x1)( x − x2 )dx a 1 a+2h and with the change of variable: x → t = x − x1 a+2h a ∫ L0 ( x)dx = a+2h 1 2h 2 −h ∫ t (t − h)dt = 4h and 3 h 2 1 t3 −ht 2 3 2 2h a+2h a h = −h h 3 Similarly. x2 = x0 + 2h = b = 1.25 ∫ (sin πx + 0.25 1 0. substituting in (6.950158158 0 x2 x0 x1 Applying the Simpson’s rule to this function gives: 0 0.5)dx 0. then x1 = x0 + h = 0.75 and h = 0.5 Fig.
28) for some ξ in the interval [a. b].5)dx of the example above using 10 subintervals. the midpoint rule can be used in a composite manner.25 calculate the integral 0. E= We can write then: b ∫ f ( x)dx = (b − a) f (c) + 24 (b − a) a 1 3 f ( 2) (ξ ) (6. truncated to first order and centred in the midpoint of the interval [a.26) is actually: R1 ( x) = ( x − c) 2 f ( 2) (ξ ) so the error in (6. 1 The error term in (6. Exercise 6. higher accuracy can be obtained subdividing the interval of integration and adding the result of the integrals over each subintervals.25) when 2 p1 ( x) is used instead of f(x) is given by: E = ∫ R1 ( x)dx and applying the Integral Mean Value a b Theorem to this (see Appendix). Consider the integral: I = ∫ f ( x)dx a b (6.page 63 E763 (part 2) Numerical Methods As with the trapezoid rule. In such a way. the value of the function at the midpoint.25 ∫ (sin πx + 0. after subdividing the interval of integration in a number N of . the integral of f(x) in the interval [a. we have: E= 1 ( 2) 1 1 b f (ξ ) ∫ ( x − c) 2 dx = f ( 2) (ξ )( x − c) a = f ( 2) (ξ ) (b − c)3 − (a − c)3 2 6 6 a b ( ) (b − a )3 4 (6. where c = (a + b)/2.25) and the Taylor expansion for the function f. b]: p1 ( x) = f (c) + ( x − c) f ' (c) + O(h 2 ) (6. b] is approximated by the area of the rectangle of base (b − a) and height f(c).26) where c = (a + b) 2 .27) but since c = a + h and c = b − h. we have: (b − c)3 − (a − c)3 = and the error is: 1 (b − a )3 f ( 2) (ξ ) 24 which is half of the estimate for the single interval trapezoid rule. The Midpoint Rule Another simple method to approximate a definite integral is the midpoint rule where the function is simply approximated be a constant value over the interval. where h = (b − a)/2.8 Write down the expression for the composite Simpson’s rule using n subintervals and use it to 1. Similarly to the case of the trapezoid rule and the Simpson’s rule.
1]. We can simplify the problem to these functions now because any polynomial is just a superposition of terms of the form x k . compare the result with that of the Simpson’s rule in the example above. but obviously. This is a nonlinear problem that cannot be solved directly. However.31) is a system of equations that the Gauss points and weights (unknown) need to satisfy. Simpson’s and midpoint rule. We also have to determine the value of N. the problem can be written in the form: Gn ( f ) = ∑ win f ( xin ) ≈ i =1 n −1 ∫ f ( x)dx 1 (6. it is rather clear that the precision attained is dependent of the position of these points. this is equivalent to try to find for a fixed n the best choice for xin and win the approximation (6. Gaussian Quadrature In the trapezoid. Expression (6.31) consists of N+1 equations and we need to find 2n parameters. any other interval can be mapped into this by a change of variable. (That is we go beyond the degree of approximation of the previous rules.30) is exact for a polynomial of degree N. Compared to the criterion for the trapezoid and Simpson’s rules. Then the weighting coefficients are determined by the choice of method. the evaluation points are chosen arbitrarily. Exercise 6. It can be shown that this number is N = 2n − 1. Considering again the general approach at the approximation of the definite integral.) This is equivalent to say: −1 ∫x 1 k dx = ∑ win ( xin ) k for k = 0.25 ∫ (sin πx + 0.31) with N as large as possible (note the equal sign now). . In that case. often equally spaced.) The objective now is to find for a given n. giving then another route to optimisation.9 1.25 Use the Midpoint Rule to calculate the integral: 0. with N (>n) as large as possible. the definite integral of a function f(x) is approximated by the exact integral of a polynomial that approximates the function. N i =1 n (6. to find the xin and win (n of each). the expression for the integral becomes: ∫ f ( x)dx = h∑ f (ci ) + a i =1 b N h 2 (b − a ) ( 2) f (η ) .29) where ci are the midpoints of each of the n subintervals and η is a point between a and b.5)dx using 2 and 10 subintervals.E763 (part 2) Numerical Methods page 64 subintervals to achieve higher precision. ….30) (The interval of integration is here chosen as [ −1. 1. the best choice of evaluation points xin (called here “Gauss points”) and weights win to get maximum precision in the approximation. The problem is then. This is also rather natural since (6. In all these cases. so if the integral is exact for each of them. 2. 24 h= b−a N (6. it will be for any polynomial containing those terms.
associated to each of the Gauss points: Lnj ( x) = k =1. We have to find now the best choice for these. If P(x) is an arbitrary polynomial of degree ≤ 2n − 1. i =1 n (6. (6. which is a polynomial of degree ≤ 2n − 1.32) then. since the expression (6.35) If we now define the polynomial Pn(x) by its roots and choose these as the Gauss points: Pn ( x) = ∏ ( x − xin ) .31) should be exact for any polynomial up to order N = 2n − 1. This implies that Pn(x) must be a member of a . (Q and R are respectively the quotient polynomial and reminder polynomial of the division of P by Pn). let’s consider the Lagrange interpolation polynomials of order n.k ≠ j ∏ n n x − xk n x n − xk j (6. Q and R are of degree n − 1 or less. must be given exactly by the quadrature expression: −1 ∫ Pn ( x)Q( x)dx = ∑ wi Pn ( xi )Q( xi ) n n n i =1 1 1 n (6.37) must be zero.33) but since the Lnj ( x) are interpolation polynomials. we can write: P(x) = Pn(x) Q(x) + R(x). Then we have: −1 ∫ P( x)dx = ∫ Pn ( x)Q( x)dx + ∫ R( x)dx −1 −1 1 1 1 (6. Lnj ( xin ) = δ i j (that is. that is: −1 ∫ Pn ( x)Q( x)dx = 0 (6. we have the weights for a given set of Gauss points.34) With this. they are 1 if i=j and 0 otherwise).37) but since the Gaussian points are the roots of Pn(x). the integral of the product Pn(x) Q(x).page 65 E763 (part 2) Numerical Methods Finding the weights For a set of weights and Gauss points.38) for all polynomials Q(x) of degree n − 1 or less.36) then. all the terms in the sum in (6. Pn(x) is of order n and then.33) are zero except for i=j and we can have: −1 ∫ Li ( x)dx = wi n 1 n (6. we have: −1 ∫ L j ( x)dx = ∑ wi L j ( xi ) n n n n i =1 1 n (6. and Lnj ( x) is of order n.
Additionally. P ( xin ) = R ( xin ) and from (6. (6. Going back now to the integral of the arbitrary polynomial P(x) of degree ≤ 2n − 1. all their roots lie in [−1.40) 1 Then. 1] of the Lagrange polynomial of order n corresponding to the Gauss point xin . These should also give the best result to the integral of an arbitrary function f: 1 Remember that for orthogonal polynomials in [−1. the interpolation using Lagrange polynomials for the n points will give the exact representation of R (see Exercise 4. 1] with a weighting function w(x) = 1. (6 38) is satisfied and then. That is: R ( x) = ∑ R ( xin ) Li ( x) exactly. In particular. 1].35).41) −1 −1 i =1 i =1 −1 but we have seen before (6.zeros of the nth order Legendre polynomial and the weights win determined by(6. and satisfy: −1 ∫ pi ( x) p j ( x)dx = δ i 1 j . we have now a set of n Gauss points and weights that yield the exact evaluation of the integral of a polynomial up to degree 2n − 1. i =1 n (6.E763 (part 2) Numerical Methods page 66 family of orthogonal polynomials1.39): −1 ∫ P( x)dx = ∑ wi P( xi ) n n i =1 1 n (6. and equation (6. n n ∫ R( x)dx = ∫ ∑ R( xi ) Li ( x)dx = ∑ R( xi ) ∫ Li ( x)dx 1 1 n n (6.2). we have that if we choose Pn(x) as in (6. so: −1 ∫ R( x)dx = ∑ wi R( xi ) n n i =1 1 n (6.34) the integral of the Lagrange interpolation polynomial for point i is the value of win . Legendre polynomials are a good choice because they are orthogonal in the interval [1. Back to the start then.39) but since R(x) is of degree n − 1 or less.40) Now.41) which tells us that the integral of the arbitrary polynomial P(x) of order 2n − 1can be calculated exactly using the sets of Gauss points xin ..35) reduces to: −1 ∫ P( x)dx = ∫ R( x)dx −1 1 1 (6. 1 .34) – the integral over the interval [−1. −1 ∫ pi ( x)q( x)dx = 0 for any polynomial q of degree ≤ i −1.36).36). 1]. since P(x) = Pn(x) Q(x) + R(x) and Pn ( xin ) = 0 (see 6.
661209386 ±0.568888888889 (322 + 13 70 ) 900 =0. However. for n = 6.71 5 0.247736352452 −2.0 1.307533965529 3 0.555555555556 ± 15 5 = ±0. Exercise 6.241785750244 6 0.24203832101745. the error is very small. Gaussian Quadrature: Nodes and Weights n 1 2 3 Nodes xin 0.10 The error is also calculated compared with the exact value: 0.888888888889 5 9 = 0.347854845137 128 225 = 0.1713245 Example For the integral: the table: −1 ∫e 1 −x sin 5 x dx the results of the calculation using Gauss quadrature are listed in n Integral Error % 2 −0.238619186 ±0.634074857001 4 0.932469514 (18 − 30 ) 36 = 0.652145154863 ± 525 + 70 30 35 =±0.35 0.538469310106 ± 5 + 2 10 7 3 = ±0.3607616 0.10 Compare the results of the example above with those of the Simpson’s and Trapezoid Rules for .page 67 E763 (part 2) Numerical Methods −1 ∫ f ( x)dx ≈ ∑ wi i =1 1 n n f ( xin ) (6. The results with few Gauss points are not very good because the function varies strongly in the interval of integration.478628670499 (322 − 13 70 ) 900 =0.42) Gauss nodes and weights for different orders are given in the following table.236926885056 0.4679139 0.774596669241 4 ± 525 − 70 30 35 =±0.906179845939 6 ±0.339981043585 (18 + 30 ) 36 = 0.0 8 9 = 0.861136311594 5 0 ± 5 − 2 10 7 3 = ±0.0 ± 3 3 = ±0.172538331616 28.577350269189 0 Weghts win 2.
5)dx .11 1.25 Use Gaussian quadrature to calculate: 0. 1]. For example if the integral to calculate is: b ∫ f (t )dt a To change from the interval [a. the integral can be approximated by: ∫ f (t )dt ≈ a b b−a n n b−a n a+b ∑ wi f 2 xi + 2 2 i =1 (6.E763 (part 2) Numerical Methods page 68 n subintervals.43) Exercise 6. Integrals over other intervals can be calculated after a change of variables.25 ∫ (sin πx + 0. Change of Variable The above procedure was developed for definite integrals over the interval [−1. b] to [−1. 1] the following change of variable can be made: x← t − ( a + b) 2 b−a a+b b−a x+ so dt = dx or t → (b − a) 2 2 2 2 Then.
.
we formulate the corresponding difference equation at each point xi.page 69 E763 (part 2) Numerical Methods 7. The problem is to find the function f(x) satisfying equation (7. we approximate all derivatives in the operator L by appropriate difference formulae (h must be sufficiently small – N large – to do this accurately).4) results for the equation at point i: c fi−1 − 2 fi + fi +1 f −f + d i +1 i−1 + e fi = gi h2 2h (7. –– Finally. defining the points xi : xi = a + ih –– Next.5) . In order to apply systematically the difference approximations we proceed as follow: –– First. b] into N equal subintervals of length h : h = (b–a)/N. This consists of the direct substitution of derivatives by finite differences and thus the problem reduces to an algebraic form. d and e. Example: If we take Lf = g to be a general second order differential equation: cf ' ' (x) + df' (x) + ef (x) = g(x) (7. This will generate a linear system of N–1 equations on the N–1 unknown values of fi = f(xi).b] and and boundary conditions: f(b) at b. One Dimensional Problems: Let’s consider a problem given by the equation: Lf = g (7.7) at point i: fi ' ' ≈ fi −1 − 2 fi + fi+1 h2 x ∈[a.1) where L is a linear differential operator and g(x) is a known function.3) and fi ' ≈ fi+1 − fi −1 2h (7. Solution of Partial Differential Equations The Finite Difference Method: Solution of Differential Equations by Finite Differences We will study here a direct method to solve differential equations which is based on the use of finite differences. we divide the interval [a. f(a) at a Using (6.2) with constant coefficients c.11) and (6.1) over a given region (interval) [a. (7. This will convert the problem into a system of algebraic equations. The basic idea is to substitute the derivatives by appropriate difference formulae like those seen in section 6. b] and subjected to some boundary conditions at a and b.
We consider now the problem Lf = g in 2 dimensions. f(0) = f(5) = 0 Two Dimensional Problems say: x and y. there are 56 free nodes (for which the potential is not known).E763 (part 2) Numerical Methods page 70 or: d d c − h fi−1 + −2c + eh2 fi + c + h fi +1 = gi h2 2 2 for i = 1: for i = N–1: fi1 = f(a) and fi+1 = f(b) ( ) (7. b].1 Formulate the algorithm to solve a general second order differential equation over the interval [a. with Dirichlet boundary conditions at a and b (known values of f at a and b). With this choice.1). To approximate ∇φ= 2 ∂ 2φ ∂ 2 φ + ∂ x2 ∂ y2 (7. Example: Referring to the figure below. Ignoring the symmetry of the problem. where f = { fi} g = {h2gi} are vectors of order N–1. ai.i +1 = c + h 2 2 Exercise 7. where This can be written as a matrix problem of the form: A f = g. To do this. we superimpose a regular grid over the domain of the problem and (as for the 1–D case) we only consider the values of f and g at the nodes of the grid. ai. In this case. .7) we can choose for convenience the same spacing h for x and y.i = −2c + eh 2 . 8.6) for all i except i = 1 and N–1. the whole crosssection is discretized (as in Fig. where f is a function of 2 variables. The matrix A has only 3 elements per row: d d ai .i−1 = c − h . the problem consists of finding the potential distribution between the inner and outer conducting surfaces of square crosssection when a fixed voltage is applied between them. Write a short computer program to implement it and use it to solve: ( ) f’’ + 2f’ + 5f = e–xsin(x) in [0.5]. The equation describing this problem is the Laplace equation: 2 ∇ φ = 0 with the boundary conditions φ = 0 in the outer conductor and φ = 1 in the inner conductor. we need to approximate derivatives in x and y in L.
labelled by O in the figure. we only consider the unknown internal nodal values and only those nodes are numbered.12) . E and W.page 71 E763 (part 2) Numerical Methods 1 2 3 4 5 6 7 8 9 φ=0 10 11 12 18 19 20 22 23 φ=1 26 30 34 38 47 48 56 Fig. is surrounded by other four. S.8) Then the equation ∇ 2φ = 0 becomes simply: ∇ 2φ ≈ (φ N + φ S + φE + φ W − 4φO ) h2 = 0 or (7. we end up with a system of N equations. xi. 8. For this point we can approximate the derivatives in (7.9) φ N + φ S + φ E + φW − 4φO = 0 Formulating this equation for each point in the grid and using the boundary conditions where appropriate. Applying equation (7. which for convenience are labelled by N.11) A typical interior point such as 11 gives: φ2 + φ20 + φ12 + φ10 − 4φ11 = 0 (7.10) (7.7) by: N W O xi S E ∂ 2φ 1 ≈ (φ − 2φO + φ E ) ∂ x2 h 2 W and ∂ 2φ 1 ≈ (φ − 2φO + φS ) ∂ y2 h 2 N (7. An internal point of the grid. one for each of the N free points of the grid.1 Finite difference mesh for solution of the electrostatic field in square coaxial line.9) to point 1 of the mesh gives: 0 + φ10 + φ 2 + 0 − 4φ1 = 0 and to point 2: 0 + φ11 + φ3 + φ1 − 4φ2 = 0 (7. On this mesh.
13) corresponding to points next to the inner conductor with potential 1.13). three or four ‘+1’ somewhere else on each matrix row. a sparse or a bandmatrix version of Gauss elimination. we can assemble all 56 equations from the 56 mesh points of the figure.. 20. . In occasions. φ56)T. One has to be careful not to confuse the rowandcolumn numbers of the 2D array A with the ‘x and y coordinates’ of the physical problem. other types of boundary conditions appear. something like: . namely. that is a condition where the values of the desired function are known at the edges of the region of interest (ends of the interval in 1D or boundary of a 2D or 3D region).13) In this way. The 56×56 matrix A has mostly zero elements.11). .14) is standard.11) and (7. Enforcing Boundary Conditions In the description so far we have seen the implementation of the Dirichlet boundary condition. Section 2 in the Appendix shows details of such a solution.14b) The unknown vector x of (7. Frequently. The resulting 56 equations can be expressed as: Ax=y (7. (7.14a) is simply (φ1..14) consists of zeros except for the –1’s coming from equations like (7. (7. The righthandside vector y of eqs. . Better still would be the GaussSeidel or successive displacement. 41 to 45. with a Gauss elimination solver routine.. or better. and either two.10). from points 12 to 17.E763 (part 2) Numerical Methods page 72 and one near the inner conductor. in terms of the 56 unknowns.. This has been implemented in a straightforward way in equations (7. the number and distribution depending on the geometric node location. 21.this is the Neumann condition. φ2. as given in (5.14a) or 1 −4 1 −4 1 1 −4 L 1 K 1 ⋅ K ⋅ ⋅ φ1 0 φ 0 2 1 φ3 0 ⋅ ⋅ ⋅ ⋅ φ = −1 13 ⋅ ⋅ ⋅ ⋅ K −4 1 φ 55 0 1 −4 φ56 0 K (7. a mixed condition will apply. Each number 1 to 56 of the mesh points in the figure above corresponds precisely to the row number of matrix A and the row number of column vector x. and could be solved with a standard library package. Equation (7. 13 will give: φ4 + 1+ φ14 + φ12 − 4φ13 = 0 (7. . for example. when the derivatives of the function are known (instead of the values themselves) at the boundary . except for ‘–4’ on the diagonal.
This type of condition appears in some radiation problems (Sommerfeld radiation condition).page 73 E763 (part 2) Numerical Methods f (r) + ∂f = K . φ=0 The new region of interest can be one eighth of the complete crosssection: the shaded region or one quarter of the crosssection. (n represents the normal to the boundary). For this condition it is more convenient to define the mesh in a different manner. For example in the case of the ‘square coaxial problem’ studied earlier. we can implement the normal derivative condition in a simple form: (Note that the node numbers used in the next figure do not correspond to the mesh numbering defined earlier for the whole crosssection). If we place the boundary at half the node spacing from the start of the mesh as in the figure below. ∂n b h φa = φ1 . 8. In fact only one eighth of the crosssection is actually needed. the new boundary ∂n conditions needed on the dashed lines that limit φ=1 the new region of interest are of the Neumann type: ∂φ =0 ∂n since the lines of constant potential will be perpendicular to those edges. 8.3 So the corresponding equation for point 1 is: φ N + φ S + φ E + φW − 4φ 0 = 0 Substituting values we have: 0 + φ10 + φ1 + φ 2 − 4φ1 = 0 or φ2 + φ10 − 3φ1 = 0 A Dirichlet condition can also be implemented in a similar form. Fig. if necessary. Fig. φ=0 a b The boundary condition that applies at point b (not actually part of the mesh) is: ∂φ = 0 We can ∂n ∂φ =0 ∂n 1 10 2 approximate this by using a central difference with the point 1 and the auxiliary point a outside ∂φ φ −φ the mesh. a Dirichlet condition will have to be implemented in a mesh separated half of the . we can see that the solution will have symmetry properties which makes unnecessary the calculation of the potential over the complete crosssection.2 We will need a different strategy to deal with these conditions. where the second term refers to the normal derivative (or derivative along the ∂n normal direction to the boundary. Then: = 0 = a 1 and then. In any case. We will see next some forms of dealing with these boundary conditions in the context of finite difference calculations. to avoid ∂φ =0 oblique lines. For example if part of a straight boundary is subjected to a Neumann condition and the rest to a Dirichlet condition (as for example in the case of a conductor that does not cover the whole boundary).
We need the timedomain solution (transient). ∂φ =0 ∂n 1 11 2 12 3 φ=V 4 14 a 5 15 6 13 Fig. 2h. as time varies.2: Consider the Fig. h. derive an equation to implement the Neumann condition using five points along a line normal to the edge and at distances 0. It is then connected to a heat source at temperature 1 at one end while the other is attached to a sink at temperature 0.2. Repeat using 3 points along the same line and use the extra degree of freedom to increase the degree of approximation (eliminating second and third derivatives). Exercise 7. The equation describing the temperature variation in space and time. Find the temperature distribution along the rod.4 and the points 5 and 15 with the point a on the boundary (not in the mesh). 8.E763 (part 2) Numerical Methods page 74 spacing from the edge. Exercise 7. In the steady state we would expect a linear distribution of temperature between both ends.3 Using Taylor expansions. Use Taylor expansions at points 5 and 15 at distances h/2 and 3h/2 from a. 3h. 8.4 Using the results of Exercise 7. where ∂n p is a constant with the same dimensions as h.2. Example Heat conduction (diffusion) in a uniform rod.15) . the difference equation corresponding to the discretization of the Laplace equation at point 5 where the boundary condition is φ = V will be: φ N + φ 4 − 4φ 5 + φ 6 + φ15 = 0 but φ N = φ 5 and also φ15 = 9φ 5 − 8V so finally: φ4 + 6φ 5 + φ6 = 8V Exercise 7. and 4h.4 (Advanced) Using two points like 11 and 12 in the figure of Exercise 7. A uniform rod of length L is initially at temperature 0. to show that a condition that can be applied when the function φ has a fixed value V at the boundary and we also want the normal derivative of φ to be zero there is: 9φ5 − φ15 = 8V. The boundary and initial conditions are: (7.t) is the temperature. assuming a heat conductivity of 1 is the normalized diffusion equation: ∂ 2u ∂u = ∂ x 2 ∂t where u(x. find the equation corresponding to the implementation of a radiation condition of the form: φ + p ∂φ = K .
where the matrix A and the vector v .. .. M–1 (leaving out of the calculation the points at the ends of the rod). n = 0.20) Rearranging: ∆x 2 ∆x ∆x This gives us a form of calculating the temperature u at point i and time n+1 as a function of u at points i. the boundary and initial condition become: u0 = 0 (7.. 1.16) u0 = 0 for i = 0.20) as: ∆x ∆x n u1 +1 n +1 u2 = n bu1 + cun 2 n uM −1 n +1 = cu1 + bu2 + cu3 L = L n n (7. M–1 for all n i un = 1 M We now discretize equation (7... 8.. We could use the forward difference: ∂u uin+1 − uin (7. .19) = i ∆x 2 ∆t as the difference equation. 2. N and i = 1..t) = 1 for all t u(x. and for i = M–1: ui+1 = 1 for all n (at all times). Using this discretization.n).... ∆t n ∆t n ∆t n n +1 ui = ui−1 + 1 − 2 2 ui + 2 ui +1 (7.0) = 0 for x < L We can discretize the solution space (x..18) ≈ ∂t ∆t and in this case we get: un − 2uin + uin+1 un +1 − uin i−1 (7. we can rewrite (7.page 75 E763 (part 2) Numerical Methods u(0..5 Equation (7..17) In the right hand side of (7..21) cuM −2 + buM−1 + c n n which can be written in matrix form as: are: un +1 = Aun + v. i = 1. n n Special cases: for i = 1: ui−1 = 0.20) is valid for n = 0.t) with a regular grid with spacing ∆x and ∆t: The solution will be sought at positions: xi. at time n. ..15). ∆t ∆t If we call: b = 1− 2 2 and c = 2 . converting the derivatives into differences: For the time derivative at time n and position i. we can use the central difference formula: n n ∂ 2u un − 2un + ui+1 i = i −1 ∂ x 2 i.n ∆x 2 (7. i–1 i i+1 n+1 n Fig. .. we need to approximate the time derivative at the same point (i.15).. so ∆ x = L/M) and at times: tn.t) = 0 u(L.. We have then a timestepping algorithm.. M–1. i–1 and i+1..
A solution to this is to shrink the gap between the two values.24) = i −1 2 2 ∂ x i.22) u is the vector containing all values ui .15) would be: ∂u u n+1 − un −1 i = i ∂t i.24) and evaluate the equation (7.26) . the derivative must be evaluated at the central point which corresponds to the value n+1/2: ∂u u n+1 − uin = i (7. However. if we consider this a ‘central’ difference. we can approximate them by the average between the neighbouring grid points: ui n +1/ 2 ≈ uin + un +1 i 2 and similarly for i–1. we have to approximate the values in the RHS of (7. Using the central difference for the time derivative: the CrankNicolson method A central difference approximation to the RHS of (7.23) ∂t i.25) We can now substitute (7. A better scheme can be constructed using the central difference in the RHS of (7. n Problems with this formulation: – It is unstable unless ∆ t ≤ ∆ x2/2 (c ≤ 0.25) into (7.15) instead of the forward difference.15) at the same time.n+1/ 2 ∆x Since the values of u are restricted to positions in the grid. so the matrix equation (7. After rearranging this gives: ui−1 − 2dui n +1 n+1 + ui+1 = −ui−1 + 2eui − ui+1 n+1 n n n (7.15) at time n+1/2. but this is not on the grid! We will have: n+1/ ∂ 2u un+1/ 2 − 2uin+1/ 2 + ui+1 2 (7.E763 (part 2) Numerical Methods page 76 ⋅ b c c b c ⋅ ⋅ A= c b c ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ n ⋅ ⋅ c b ⋅ ⋅ ⋅ and 0 0 v = ⋅ 0 c (7.21) can be solved for all successive time steps. and i+1.5 or b ≥ 0) This is due to the rather poor approximation to the time derivative provided by the forward difference (7. n. (7. Since those values are at the centre of the intervals. the difference between ui and ui .24). and n+1.n 2∆t This would involve values of u at three time steps: n–1. It is known at n = 0. having instead of the n +1 n −1 n +1 n difference between ui and ui . which would be undesirable.n+1/ 2 ∆t We then need to evaluate the left hand side of (7.18).
28) For example. R R+1 .27) 0 ⋅ and v = ⋅ 1 where: 1 ⋅ ⋅ −2 d 2e −1 1 −1 2e −1 −2d 1 ⋅ ⋅ . This method provides a second order approximation for both derivatives and also.t) = 0 u(x.0. considering the special cases at the ends of the rod and at t = 0. In this case we will get: Au n +1 = Bu − 2v n (7. like the Schroedinger equation or the diffusion equation in 2D: ∂ 2u ∂ 2 u ∂u + =a ∂ x2 ∂ y2 ∂t (7.y.6 CrankNicolson scheme We can now write (7.page 77 E763 (part 2) Numerical Methods where ∆x2 d = 1 + ∆t and ∆x 2 e = 1− ∆t This form of treating the derivative (evaluating the equation between grid points in order to have a central difference approximation for the first order derivative is called the CrankNicolson method and has several advantages over the previous formulation (using the forward difference).0) = 0 for x > 0 and y > 0 That is.. very important. Fig. 8. except the sides x = 0 and y = 0 are at temperature 0 at t = 0.t) = 1 u(x... n+1 n+1/2 n i–1 i i+1 Fig. it is unconditionally stable. Let’s consider a square plate of length 1 in x and y and the following boundary conditions: a) b) c) d) e) u(0. The dark spot represents the position where the equation is evaluated.. B= A= ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1 −2d ⋅ −1 2e Example Consider now a parabolic equation in 2 space dimensions and time. this could represent the temperature distribution over a 2dimensional plate. the sides x = 0 and y = 0 are kept at temperature 1 at all times while the sides x = 1 and y = 1 are kept at u = 0. The following diagram shows the discretization for the x coordinate: x: 0 1 .y.21).1. and write the corresponding matrix form.y.6 shows the scheme of calculation: the points or values involved in each calculation.26) for each value of i as in (7. i–1 i i+1 .t) = 0 u(x. 8.t) = 1 u(1. The whole plate.
we can also use R+2 points. equation (7. That is.j −1.j .32) .29) ∂t i.30) or rewriting: ∇ u 2 (n +1) − 2 a ( n +1) 2 (n) 2a ( n) u = −∇ u + u ∆t ∆t where u is still a continuous function of position. .29) to (7. ∆t 2a = − uin 1. = (7. j + uin 1.E763 (part 2) Numerical Methods page 78 There are R+2 points with R unknown values (i = 1.n+1/ 2 ∆t We need to approximate the LHS at the same time using the average of the values at n and n+1: ∇ 2 u(n +1 / 2) = Applying this and (7. (7. This will give: n +1 un j − ui. .31) Defining now node numbering over the grid and a vector u(n) containing all the unknown values (for all i and j).28) will be evaluated half way between time grid points.j . j + uin j −1 + uin j +1 − 4 − uin j + . R).... . − ∆t (7. The discretization of the y coordinate can be made in similar form. where ∆ t is the time step.30): 2a 1 1 uin+1 j + uin+1 j + uin+−1 + uin++1 − 4 + uin+1 . Discretization of time is done by considering t = n∆ t. in order to use the central difference and only two time levels.31) can be written as a matrix equation of the form: Au(n+1) = Bu(n) (7. j. +1.28) we get: ∇ u 2 (n +1) 1 2 (∇2u (n) + ∇2u( n+1) ) = +∇ u 2 ( n) 2a (n+1) (n) u −u ∆t ( ) (7.. j ∂u i.. For convenience. where only R are unknown: (j = 1. The two extreme points: i = 0 and i = R+1 correspond to x = 0 and x = 1. Approximation of the space derivative (second order): Using the central difference for the second order derivatives and using ∆x = ∆y = h. . we get: 1 2 ∇ u = 2 (uN + uS + uW + u E − 4uO ) h and using this in (7. Approximation of the time derivative (first order): As in the previous example.. R). we use the CrankNicolson method. respectively.
φ ( x.. .page 79 E763 (part 2) Numerical Methods Exercise 7. use finite differences to formulate the numerical solution of the equation above. L (corresponding in this case to an open circuit). 2.0) = 1 a) By discretizing both the coordinate x and time t as xi = (i – 1)∆x . t) along the line satisfies the following differential equation: 2 ∂ 2φ 2∂ φ =0 2 +p ∂x ∂t 2 with the boundary and initial conditions: φ (0. t ) = 0 for t < 0 and φ ( L 2. t ) = φ ( L. Show that the problem reduces to a matrix problem of the form: Φ m +1 = AΦ m + BΦ m −1 where A and B are matrices and Φm discretization points xi at time tm... A unit impulse voltage is applied in the middle of the line at time t = 0. The voltage φ(x.. . 2.5 A length L of transmission line is terminated at both ends with a short circuit. i = 1. m = 0. c) How would the matrices A and B change if the boundary condition was changed to ∂φ =0 ∂x at x = 0. taking special care at the edges of the grid.. R+1 and tm = m∆t . . t ) = 0 . Show the discretized equation corresponding to points at the edge of the grid (for i = 1 or R+1) and consider the boundary condition φ(0) = φ(L) = 0. find the matrices A and B giving the values of their elements. 1. is a vector containing the voltages at each of the b) Choosing R = 7. Show the discretized equation corresponding to one of the edge points and propose a way to transform it so it contains only values corresponding to points in the defined grid.
in order to have a unique solution to the problem. Instead of trying to solve the problem directly. This will allow us to quantify (put a number to) how close a function is to another or how big or small is an error function. r r (8. The commonest definition of inner product between two functions f and g is: f .E763 (part 2) Numerical Methods page 80 8. for example.2). the boundary of Ω. the finite difference solution method studied earlier is of this type. Before expanding on the details of this search. g = ∫ fg dΩ for real functions f and g or . r r (8. We also need to impose boundary conditions on the values of u and/or its derivatives on Γ. we can reformulate it as the search for a function that satisfies some conditions also satisfied by the solution to the original problem (8. With a proper definition of these conditions the search will lead to the correct and unique solution. as specified above in (8.2) r with B. So we will have for example: B=1: r r u( x ) = t ( x ) : known values of u on Γ fixed values of the normal derivative (Dirichlet condition) (Neumann condition) B= B= ∂ : ∂n r ∂u( x ) r = t( x ) : ∂n ∂ +k : ∂n ∂u r + ku = t( x ) : ∂n mixed condition (Radiation condition) A problem described in this form is known as a boundary value problem (differential equation + boundary conditions).1)(8. a linear differential operator and t ( x ) a known function. we need to introduce the idea of inner product between two functions. Strong and Weak Solutions to Boundary Value Problems Strong Solution: This is the direct approach to the solution of the problem.1) r r where L is a linear differential operator. u( x ) is a function to be found.2). Solution of Boundary Value Problems A physical problem is often described by a differential equation of the form: Lu ( x ) = s ( x ) on a region Ω. The boundary conditions can be written in general as: Bu ( x ) = t ( x ) on Γ. Weak Solution: This is an indirect approach. s( x ) is a known r function and x is the position vector of any point in Ω (coordinates).1)(8.
h f . or Lu − s. f = 0 if and only if f = 0 α and β scalars (8. find the function u( x ) that satisfies: Lu.2) as: r r r Given the function s( x ) and an arbitrary function h( x ) in Ω. g = ∫ f g * dΩ (8.5) is converted from “finding a function u among all possible functions defined on Ω” into the search for a set of numbers. once we have chosen the basis functions. h = α f . h = 0 (8. h = s.5) Approximate Solutions to the Weak Formulation r In this case. This residual . the unknown is no longer the function u.19) when we try any function in place of u. h for any choice of the function h. the inner product between two functions will be a real number obtained by global operations between the functions over the domain and satisfying some defining properties (for real functions): i) ii) iii) f . g = g. Now.5) is rewritten in the form: r = Lu – s = 0 (8. h = 0 for any choice of h. We now need some method to find these numbers. a suitable approximation to it. we have that r. We are now in position to formulate the weak solution of (8. we only look for an approximation. or rather.3) for complex functions f and g In general.1)(8. We will see two methods to solve (84) using the RayleighRitz procedure above: Weighted Residuals Variational Method The RayleighRitz Method Weighted Residuals For this approach the problem (8.4) From this definition we can deduce that: If for a function r.6) where the function r is the residual or error or the difference between the LHS and the RHS of (7. h + β g. that is: numbers! so the problem (8. instead of trying to find the exact solution u( x ) that satisfies (8. We can choose a set of basis functions and assume that the wanted function.page 81 E763 (part 2) Numerical Methods f . We will try to use this property for testing an error residual r.5). can be represented as an expansion in terms of the basis functions (as for example with Fourier expansions using sines and cosines). but its expansion coefficients. f αf + β g. then r = 0.
we can also use an expansion. . find the function u( x ) that satisfies: r.. Since (8. we can put: r r u( x ) = ∑ d j bj ( x ) j =1 N (8.wi ( x ) = 0 i=1 i=1 N N (8.E763 (part 2) Numerical Methods page 82 will only be zero when the function we use is the correct solution to the problem.. N (8. N r wi ( x ).. . Applying the RayleighRitz procedure.11) which is simpler than (8. i = 1.. in sequence).. . .9) where ci.. i = 1. (It comes from the choice: ci = 1 and all others = 0. N. So. N r b j ( x )..10) for any choice of the coefficients ci. .. that is: “Find the function u such that r = 0” or in approximate way: “Find the function u such that r is as small as possible” and here we need to be able to measure how ‘small’ is the error (or residual).. . j = 1.. we can deduce that: r. j = 1. h = 0 (where r = Lu – s) (8.wi = 0 for all i.10) has to be satisfied for any choice of the coefficients ci (equivalent to say ‘any choice of h). this is now transformed into: r r r Given the function s( x ) and an arbitrary function h( x ) in Ω.7) for any choice of the function h. N are scalar coefficients and a set of linearly independent expansion functions for h (normally called weighting functions – also chosen) Using now (8. We can now look at the solution to (7. .24) as an optimization (or minimization) problem. . in r general using another set of functions wi ( x ) for example: r r h( x ) = ∑ ci wi ( x ) i =1 N (8.7) we get: r r r r r.8) where dj. .. . ∑ ci wi ( x ) = ∑ ci r( x ). i = 1. h = r( x ). we only need to test it against the chosen weighting functions wi. . For this.9) into (8. N are scalar coefficients and are a set of linearly independent basis functions called trial functions – chosen) (normally We now need to introduce the function h. Using these ideas on the weak formulation. we can conclude that we don’t need to test the residual against any (and all) possible function h.10). i = 1. .. .
page 83 E763 (part 2) Numerical Methods Now. Combining this idea with the RayleighRitz procedure we can develop a systematic solution method. N (8.11) becomes: N r r r r . (8. wi ( x ) = 0 j =1 for all i or r . With this. Note that J is not a function of x . Before introducing numerical techniques to solve variational formulations. This expression gives a numerical value for each r function φ we use in J(φ). We can rewrite this expression (8. wi = 0 j =1 N for all i (8. finding the coefficients dj of the expansion of the function u. wi Since the trial functions bj are known. Variational Method The central idea of this method is to find a functional* (or variational expression) associated to the boundary value problem and for which the solution of the BV problem leads to a stationary value of the functional.12) as: which can be put in matrix notation as: Ad=s j =1 ∑ aij d j N = si for all i: i = 1.12) Note that in this expression the only unknowns are the coefficients dj. wi = Lu − s. let’s examine an example to illustrate the nature of the variational approach: * A functional is simply a function of a function over a complete domain which gives as a result 2 a number.8) and use this expansion in r: r = Lu – s. Ω . wi − s. For example: J(φ) = ∫ φ dΩ . Different choices of the trial functions and the weighting functions define the different variations by which this method is known: i) wi = bi –––––––> method of Galerkin r r r ii) wi ( x ) = δ ( x − x i ) –––––––> Point matching method This is equivalent to asking for the residual to be zero at a fixed number of points on the domain. wi = ∑ d j Lb j .13) where A = { aij} . the matrix elements aij are all known and the problem has been reduced to solve (8. wi = L ∑ d j b j ( x ) − s ( x ). d = { dj } and s = { si} with aij = Lb j . wi and si = s. we can expand the function u in terms of the trial (basis) functions as in (8.13).
y) itself.15) time = min ∫ ds = min( path length) c c P1 The integral in this case reduces to the actual length of the path. We can write for the velocity: v = c/n. From above. for this case we can use Fermat’s Principle that says that ‘light travels through the quickest path’ (Note that this is not necessarily the shortest).14) time = min ∫ v(s) P 1 where v(s) is the velocity point by point along the path. so the above statement asks for the path of ‘minimum length’ or the shortest path between P1 to P2. We are not really interested in the actual time taken to go from P1 to P2. let’s consider an interface between two media with different refractive indices. That is. 8. However. we can consider a situation like that of the figure.17) where the only variable (unknown) is y0. Obviously. We can rewrite (8.7 Both integrals correspond to the respective lengths of the two branches of the total path. but we don’t know the coordinate y0 of the point P0.16) in the form: 1 2 2 2 2 time = min n1 x1 + (y1 − y0 ) + n2 x 2 + (y2 − y0 ) c [ ] (8.16) P1 Fig. We can formulate this statement in the form: P2 ds (8. Let’s consider first a uniform medium. (8. In particular. one with n uniform (constant). we want to find the actual path between these points that minimize this time.14) gives: y P2 P0 1 time = min n1 ∫ ds + n2 ∫ ds c P0 P1 P2 P0 x (8. To find it we need to find the minimum. the solution in this case is the straight line joining the points.y) is the refractive index. Without losing generality. you can see that (8. but on the conditions that this minimum imposes on the path s(x. Extending the example a little more.E763 (part 2) Numerical Methods page 84 Example: The problem is to find the path of light travel between two points. where n(x. we know that in each media the path will be a straight line. In that case. We can start using a variational approach and for this we need a variational expression related to this problem. but what are the coordinates of the point on the y–axis (interface) where both straight lines meet? Applying (8. and for this . and the minimization process should lead to the actual trajectory.14) can be applied to an inhomogeneous medium. that is.14) becomes: P2 n n (8.
18) Now. we looked for the time. from the figure we can observe that the right hand side can be written in terms of the angles of incidence and refraction as: (8.19) n1 sin α 1 − n2 sin α 2 = 0 as the condition the point P0 must satisfy. let’s change the limits to –a and a. to show how a ‘bad’ approximation of the function y can still give a rather acceptable value for k. An important property of a variational approach is that precisely because the solution function produces a stationary value of the functional. the general idea of this variational approach is to formulate the problem in a form that we look for a stationary condition (maximum. but there is a proof in the Appendix): b 2 dy ∫ dx dx 2 (8.19) is the familiar Snell’s law.v. (Like in the above problem.page 85 E763 (part 2) Numerical Methods we do: d −(y1 − y0 ) (time) = 0 = n1 + n2 2 dy 0 x1 + (y1 − y0 )2 2 x2 + (y2 − y0 )2 −(y2 − y0 ) (8. this is rather insensitive to small perturbations of the solution (approximations). The first mode of oscillation has the form: dy Aπ πx πx y = Acos then =− sin dx 2a 2a 2a Using this exact solution in (99): we get for the first mode (prove this): k = 2 π2 4a 2 then k= π 2a ≈ 1. For this problem. an appropriate variational expression is (do not worry about where this comes from.20) k = s. a b ∫y a 2 dx The above expression corresponds to the k–number or resonant frequencies of a string vibrating freely and attached at the ends at a and b. inflexion point) on some parameter which depends on the desired solution. and we know this is right because (8. which depends on the actual path travelled). So. particularly for the application of numerical methods where all solutions are only approximate. minimum. To illustrate this property. let’s try a simple triangular shape (instead of the correct sinusoidal shape of the vibrating string): . let’s analyse another example: Example Consider the problem of finding the natural resonant frequencies of a vibrating string (a chord in a guitar). This is a very desirable property.571 a Now. For simplicity.
8. once we already have a variational formulation. how can we use this method systematically? As said before.581 a which is a rather good approximation. In summary. we can specify the necessary steps as: BV problem: L u = s ––––––––> Find variational expression J(u) Use RayleighRitz: u = ∑ d j bj j =1 N Insert in J(u) that is: Find stationary value of J(u) = J({ dj}). the use of the RayleighRitz procedure permits to construct a systematic numerical method. Now. simply saying that there are systematic methods to find them.E763 (part 2) Numerical Methods page 86 x A1 + a x < 0 y= x A1 − x > 0 a dy A/ a = dx − A / a x <0 x >0 k = 2 a a Fig. If instead of the triangular shape we try a second order (parabolic) shape: y = a 1 − (x a) ( 2 ) then A dy = −2 2 x a dx Substituting these values in (99) gives now: k = 2 2. gives (prove this): 3 a2 1.732 a then then k≈ which is not too bad considering how coarse is the approximation to y(x).5 a2 then k= 1.8 and using these values in (99). We will be concerned here on how to solve a problem. find the coefficients dj Reconstruct u = ∑ d j bj j=1 N u is solution of BV problem <–––––––– then We will skip here the problem of actually finding the corresponding variational expression to a boundary value problem. .
.21) Ω Using RayleighRitz: N J (φ ) = ∫ ∇ ∑ d j b j (x. the weighted residuals and the variational method transform the BV problem into an algebraic. for all i: i = 1. s = 0).23) we obtain the coefficients dj and the unknown function can be obtained as: u(x... . One of the first steps in the implementation of either method is the choice of appropriate expansion functions to use in the RayleighRitz procedure: basis or trial functions and weighting functions. the BV problem is defined by the Laplace equation: ∇ 2φ = 0 with some boundary conditions (L = ∇ 2 .22): N ∂J = ∑ aij d j = 0 .y) = ∑ d j b j (x. u = φ. y) j=1 N We can see that both methods. N ∂di j=1 And the problem reduces to the matrix equation: Ad=0 (8. .22) This can be written as the matrix equation: where d = { dj} and ∫ ∇bi ⋅ ∇b j dΩ Now. . find stationary value: ∂J =0 ∂d i for all i: i = 1.. y) dΩ Ω j=1 2 2 J (φ ) = ∫ Ω i =1 j =1 ∑ ∑ did j∇bi (x. y) dΩ Ω j =1 N = ∫ ∑ d j ∇b j (x. N so.page 87 E763 (part 2) Numerical Methods Example For the problem of the square coaxial (or square capacitor) seen earlier. y) dΩ J(d) = dTAd. . applying it to (8. aij = Ω N N (8. . An appropriate functional (variational expression) for this case is: 2 J (φ ) = ∫ (∇φ ) dΩ (given) (8. matrix problem.23) Solving the system of equations (8. y) ⋅ ∇b j (x.
E763 (part 2) Numerical Methods page 88 The finite element method provides a simple form to construct these functions and implementing these methods. .
2 Shape functions ˜ The function u(x) . the solution gradually and monotonically converging to the exact value. a solution can always be obtained to any degree of approximation. the degree of approximation can only improve when the number of elements increase. defined by N+1 nodes xi.1) is represented by a local approximation ue ( x ) valid only in the element number e: r Ee. approximation to u(x). these local functions ue (x) can be . The basis functions are defined locally in each element and because each element is small. i Ei I E j = ∅ Over the subdivided domain we apply the methods discussed earlier (either weighted residuals or variational). not necessarily equal in size!. One dimensional problems Let’s consider first a one dimensional case with Ω as the interval [a. The wanted function u(x) is then locally approximated in each subinterval by a simple function. the finite element method is based on the division of the domain of interest Ω into ‘elements’. is the addition of the locally defined functions ˜ ˜ ue (x) which are only nonzero in the subinterval e. on their size in relation to the local variation of the desired function. FINITE ELEMENTS As the name suggests. In this form. or small pieces Ei that cover Ω completely but without intersections: They constitute a tessellation (tiling) of Ω: U Ei = Ω . 9.page 89 E763 (part 2) Numerical Methods 9. a straight line: a function of the type: ax + b. The amount of error in this approximation will depend on the size of each element (and consequently on the total number of elements) and more importantly. that is. The complete function u( x ) over the whole domain Ω is then simply approximated by the r r ˜ addition of all the local pieces: u( x ) ≈ ∑ ue ( x ) e An important characteristic of this method is that it is ‘exactinthelimit’. b]. these functions can be very simple and still constitute overall a good approximation of the desired function. Now. ˜ u(x) u(x) ui ui Ni (x) 1 ˜ ue (x) u(x) ui+1 Ni+1 (x) ui+1 a x2 x3 xi b Ni (x) xi Ni+1 (x) xi+1 Fig. for example. 9. We first divide Ω into N subintervals. a piecewise linear approximation to u(x). the wanted function r r ˜ u( x ) in (8.1 Piecewise linear approximation Fig. The total approximation u(x) is then the superposition of all these locally defined functions. as shown in the figure. defined differently in each ˜ subinterval. provided the availability of computer resources. In this form inside each element Ee.
From the figure we can see that the function ue (x). we can extend the definition of these functions to form the triangular shapes below: ˜ u(x) Ni (x) Ni+1 (x) xi1 xi xi+1 xi+2 Fig.2) (Np is the number of nodes.1: Examine this figure and that of the previous page and show that indeed (9. Exercise 9. b] with y(a) = y(b) = 0 This corresponds for example. as seen in (8.1) is valid in the element (xi. Ne is the number of elements).1) ue (x) = ui Ni (x) + ui+1 Ni+1 (x) Now.. known in the FE method as shape functions. the values of the wanted function at the nodal points. can be written as: ˜ (9. we can write for the function u(x) in the complete domain Ω: ˜ ˜ u(x) ≈ u(x) = ∑ ue (x) = ∑ ui Ni (x) e=1 i =1 Ne Np (9. . if we consider the neighbouring subintervals and the associated interpolation functions as in the figure below. xi+1) and so (9. Solving the resultant matrix equation will give these values directly. are the nodal values. that is.3 Shape functions and approximation over two adjacent intervals ˜ With this definition. the coefficients of the expansion: ui in (104).20) and in the Appendix: . local approximation to u(x) in element e. Example Consider the boundary value problem: d2 y + k2 y = 0 dx 2 for x ∈ [a. and as a consequence. The functions Ni(x) are defined as the (doublesided) interpolation functions at node i.E763 (part 2) Numerical Methods page 90 defined as the superposition of interpolation functions Ni(x) and Ni+1(x) as shown in the figure ˜ above (right). are interpolation functions. i = 1. A suitable variational expression (corresponding to resonant frequencies) is the following. Np so Ni(x) = 1 at node i and 0 at all other nodes.2) is valid over the full domain. This form of expanding u(x) has a very useful property: — The trial functions Ni(x).. 9. to the equation describing the envelope of the transverse displacement of a vibrating string attached at both ends. .
Q 2 2 k = k (y j ) = where: From (9. N dk 2 = 0 ⇒ Q ' R = QR ' dyi (9. i = 1..3) we have: R b N b N dN j dN j d N dx = ∫ ∑ y j Q = ∫ ∑ y j N j (x) dx = ∫ ∑ y j dx dx dx j=1 a j =1 a a j=1 b 2 2 ∑ yk k =1 N dNk dx dx (9. yj are the nodal values (unknown) and Nj(x) are the shape functions.6) To find the stationary value.4) where N is the number of nodes.8) We now have to evaluate these derivatives: b b N N dN j dNi dN dN dQ = ∫ i ∑ yk k dx + ∫ ∑ y j dx dx k =1 dx dx dx dyi a a j=1 or b N N b dN dN j dN j dQ dN dx = 2∑ y j ∫ i = 2∫ i ∑ y j dx dx dx dx j =1 dx dyi j =1 a a which can be written as: b N dQ dN dN j = 2∑ aij y j where aij = ∫ i dx dx dx dyi j =1 a (9.. N dyi dyi (9. .. .page 91 E763 (part 2) Numerical Methods J =k = 2 ∫ dx a b a b dy 2 dx (given) (9.. .9) For the second term: b b N N dR = ∫ Ni (x) ∑ y j N j (x) dx + ∫ ∑ yk Nk (x) Ni (x) dx dyi j =1 a a k =1 .3) ∫y Using now the expansion: 2 dx y = ∑ y j N j (x) j =1 N (9. we do: dk 2 =0 dyi so for each yi.5) and b N b N N R = ∫ ∑ y j N j (x ) dx = ∫ ∑ y j N j ( x )∑ yk Nk ( x ) dx k =1 a j =1 a j=1 2 (9. then = R2 R dyi so finally: and Q 2 R'= k R' R dQ dR = k2 for all yi. i = 1.7) But since k = Q' = 2 Q dk 2 Q' R − QR ' . .
Quadrilateral elements are also used and they have some useful properties but by far the most common.. j1 Nj j j+1 Exercise 9. the matrices A and B are tridiagonal: They will have not more than 3 elements per row. a two dimensional region Ω is subdivided into smaller pieces or ‘elements’ and the subdivision satisfies the same properties as in the onedimensional case.9) and (9.E763 (part 2) Numerical Methods page 92 or N dR = 2∑ y j ∫ Ni ( x ) N j ( x ) dx = 2∑ bij y j dyi j =1 j =1 a b N with bij = ∫ Ni ( x ) N j ( x ) dx a b (9. The matrix elements: aij = ∫ b dNi dN j dx dx dx a and bij = ∫ Ni N j dx a b (9. B = { bij} and y = { yj} is the vector of nodal values. integrate (9. the elements cover completely the region of interest and there is no intersections (no overlapping). we can write the matrix equation: A y = k2 B y (9.2: Define generically the triangular functions Ni.. Two Dimensional Finite Elements In this case. The process of calculation follows the same route as in the 1D case. Then.11) is a matrix eigenvalue problem. The solution will give the eigenvalue k2 and the corresponding solution vector y (list of nodal values of the function y(x)). So.10) Replacing (9.y) is approximated by shape functions Nj(x. i+1. i. Ni i1 i i+1 . Now. There are welldeveloped methods to produce appropriate meshing (or subdivision) of a 2D region into triangles and they have maximum flexibility to accommodate to intricate shapes of the region of interest Ω. if j ≠ i–1.12) The shape functions Ni and Nj are only nonzero in the vicinity of nodes i and j respectively (see figure below). a function of two variables u(x..11) where A = { aij} . that is.y) defined as interpolation functions over one element (in this case a triangle).12) and calculate the value of the matrix elements aij and bij. The most common form of subdividing a 2D region is by using triangles. versatile and easier to use are triangles with straight sides... This approximation is given by: .8).. Equation (9. aij = bij = 0 if Ni and Nj do not overlap.10) in (9. that is..
a second order surface for example. Nj(x. 2 and 3. the function u(x. (x2 . pieces of planes (flat tiles). y1). y2) and (x3 .14) r where p. the nodal values. q and r are constants with different values in each triangle. y) j =1 N (9.y) is approximated in each triangle by a function of the form (first order in x and y): p ˜ u(x. will need 6 points . Note that the approximation is composed by flat ‘tiles’ that fit exactly along the edges so the approximation is continuous over the entire region but its derivatives are not. with the corresponding piecewise planar approximation to a function u(x. y ) ≈ ∑ u j N j ( x. Fig. y) + u3 N3 (x. this function can be written in terms of shape functions (interpolation polynomials): ˜ (9. . y3).4 The figure shows a rectangular region in the xy–plane subdivided into triangles. y) = p + qx + ry = (1 x y) q (9. y) + u2 N2 (x. Other types are also possible but require defining more nodes in each triangle.15) u(x. with coordinates (x1 . Similarly to the one dimensional case.g.e.y) are the shape functions defined for every node of the mesh.y) ≈ u(x. For a first order approximation. that is.13) where N is the number of nodes in the mesh and the coefficients uj are the nodal values (unknown).y) = u1 N1(x.page 93 E763 (part 2) Numerical Methods u( x . 9. y) for a triangle with nodes numbered 1. The approximation shown in the figure shows uses first order functions.for example the values at the 3 vertices and at 3 midside points). (While a plane is totally defined by 3 points .y) plotted along the vertical axis.
16) b1 = y2 – y3 b2 = y3 – y1 b3 = y1 – y2 c1 = x3 – x2 c2 = x1 – x3 c3 = x2 – x1 2 1 1 u3 3 3 Fig. Now. 9. building up the complete shape function for node 1: N1. the same node can be a vertex of other neighbouring triangles in which we will also define a corresponding shape function for node 1 (with expressions like (9.7 . This is shown in the next figure for a node number i.16) and shown below corresponds to the shape function (interpolation function) for the node numbered 1 in the triangle shown. 9.16) but with different values of the constants a. u2 u2 N2 u1 u1 N1 2 u1 u1 N1 + u2 N2 + u3 N3 u2 1 (ai + bi x + ciy ) 2A (9. b and c – different orientation of the plane). y) = where A is the area of the triangle and a1 = x2y3 – x3y2 a2 = x3y1 – x1y3 a3 = x1y2 – x2y1 See demonstration in the Appendix. which belongs to five triangles.5 Fig.E763 (part 2) Numerical Methods page 94 The shape functions Ni are such that Ni = 1 at node i and 0 at all the others: It can be shown that the function Ni satisfying this property is: Ni (x. 9. The function N1 defined in (9.6 i Fig.
we can refer to this function as Ni(x. …. N (9. say. An important advantage of this method is that these coefficients are precisely the nodal values of the wanted function u. which will appear in the definition of the matrix elements (as seen in the 1–D case). 11 and 12 . the function u can be written as: u(x. considering all the nodes of the mesh. MATRIX SPARSITY Considering the global shape function for node i. 5 and 6. An additional advantage is the sparsity of the resultant matrices.y) and then. Example: 1 2 3 4 1 2 8 3 6 4 10 5 7 6 8 If we consider a simple mesh: 5 7 9 9 10 11 11 12 12 The matrix sparsity pattern results: 1 1 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 10 11 12 Contribution from triangle 2 with nodes 2. If we now consider the global shape function corresponding to another node. This implies that the corresponding matrices will be very sparse.17) We can now use this expansion for substitution in a variational expression or in the weighted residuals expression. j. which is very convenient in terms of computer requirements.y) = ∑ u j N j (x. so the result is obtained immediately when solving the matrix problem. will almost always be zero. for each of the triangles that contain node i. shown in the figure above. we can see that products of the form Ni Nj or of derivatives of these functions.page 95 E763 (part 2) Numerical Methods Joining all the facets of Ni. except when the nodes i and j are either the same node or immediate neighbours so there is an overlap (see figure above). to obtain the corresponding matrix problem for the expansion coefficients. y) j =1 N for j = 1. we can see that it is zero in all other nodes of the mesh. Contribution from node 11 in triangles 10.
17): N N J = ∫ ∇ ∑ φ j N j dx dy = ∫ ∑ φ j ∇N j dx dy Ω j =1 Ω j =1 which can be rewritten as: J = Ω 2 2 ∫ i∑φ i ∇Ni ⋅ ∑φ j ∇N j j =1 =1 Ω N N dx dy or J = ∑ ∑φ iφ j i =1 j =1 N N ∫ ∇Ni ⋅∇N j dx dy = Φ AΦ Φ = { φj} T (9.21) Before calculating these values. In this way. … . defined for the node i is: . aij will only be nonzero if the nodes i and j are both in the same triangle.21) will only extend to at most two triangles for each combination of i and j.18) Ω and substituting (9. let’s see which elements of A are actually nonzero. … . aij = ∫ ∇Ni ⋅∇N j dx dy Ω and Note that again. that is: put: N ∂J = 0 = 2∑ aij φ j ∂φi i =1 ∂J =0 ∂φi AΦ = 0 for i = 1. Ne (Ne elements in the mesh). the coefficients aij can be calculated and the only unknowns are the nodal values φj. that is. k = 1. We need to evaluate first the elements of the matrix A. For this. As discussed earlier (page 45). N for i = 1. We now have to find the stationary value. the temperature distribution (in steady state) between the two square section surfaces. the resultant equation is (9.20): AΦ = 0. N then: (9. let’s consider the matrix sparsity.20) Then. we can consider the integral over the complete domain Ω as the sum of the integrals over each element Ω k. an appropriate variational expression is: 2 J = ∫ (∇φ ) dΩ (9. Inside one triangle.y).19) where A = { aij} . or equivalently. aij = ∫ ∇Ni ⋅∇ N j dx dy = ∑ Ω Ne k =1 Ω k ∫ ∇Ni ⋅∇N j dx dy (9.E763 (part 2) Numerical Methods page 96 Example: For the problem of finding the potential distribution in the square coaxial. … . the shape function Ni(x. the sum in (9.
the integral in (9.22) reduces simply to the area of the triangle and the calculations are easy. which is at any position and orientation in the x–y plane. For example.24). For the case of integrals like those in (126): bij = ∫ Ni N j dΩ = ∑ Ω Ne k =1 Ω k ∫ Ni N j dΩ The sum is over all triangles containing nodes i and j. ci and cj will be different for each triangle concerned.3) for 1D problems) (9. 2A i Then: ∇Ni = ∂ Ni ∂N 1 ˆ ˆ ˆ (b x + ci y) x+ iˆ = y 2A i ∂x ∂y aij = k =1 ∑ 4 A 2 ( bi bj + ci c j ) ∫ k Ne 1 dx dy = ∑ Ne Ωk k =1 1 (bi b j + ci c j ) 4 Ak (9.21) and (9.24). For example for the element 5.24) The elements of A are the same as in (9. the sum extends over the triangles containing nodes 4 and 7.23) and show that indeed the matrix elements aij and bij have the values given in (9. its contribution to this element is the term: b56 = 1 4 A7 Ω7 ∫ ( a5 + b5x + c5y)( a6 + b6 x + c6 y) dx dy . these will be triangles 2 and 7 only. in other cases the integration can be complicated because they have to be done over the triangle. Exercise 9. considering for example the element a4 7 corresponding to the mesh in the previous figure. If we take triangle 7. solving a variational expression like: J = k2 = ∫ (∇φ ) 2 dΩ ∫ φ 2 dΩ (corresponding to (9.6 in the previous mesh. bj.page 97 E763 (part 2) Numerical Methods Ni ( x.2) where the sum will only have a few terms (for those values of k corresponding to the triangles containing nodes i and j.3: Apply the RayleighRitz procedure to expression (9. The values of Ak. that is triangles number 5 and 6: a4 7 = 1 1 (5) ( 5) ( 5) ( 5) ( 6) ( 6) ( 6) ( 6) b4 b7 + c4 c7 + b4 b7 + c4 c7 4A5 4A6 ( ) ( ) In this case. In particular.23) This results in an eigenvalue problem AΦ = k2BΦ where the matrix B has the elements: bij = ∫ Ni N j dΩ Ω (9. However. the area of the triangle and bi. y) = And then: 1 ( a + bi x + ci y ).
mapped back to the global coordinates. Any point inside is defined by its proportional distance to the sides. Since A1 + A2 + A3 = A we have that ξ1 + ξ2 + ξ3 = 1 . The most common system of local coordinates used for this purpose is the triangle area coordinates. 9.0. For the sake of symmetry. the area coordinates are defined as the ratio: 1 A3 2 A1 A ξi = i A A2 where A is the area of the triangle. we can define 3 coordinates: (ξ1. dependence. line of constant ξ1 Note that the triangle marked with the dotted line has the same area as A1 (same base. 1 (1.1) ξ1 =0 ξ2 =0 Fig. Triangle Area Coordinates For a triangle as in the figure. only two are independent. If A1.8 The advantage of using this local coordinate system is that the actual shape of the triangle is not important. varying linearly between these limits. any point inside can be specified by coordinates x and y or for example. 9. into a local system. A formal definition of these coordinates can be made in terms of the area of the triangles shown in the figure. same height). These calculations can be cumbersome is attempted directly. by the coordinates ξ1 and ξ2. These coordinates are defined with value 1 at one node and zero at the opposite side.0) ξ2 =1 ξ2 ξ1 3 (0. 3 Fig. irrespective of triangle shape.1. it is much simpler to use a transformation of coordinates.0. A2 and A3 are the areas of the triangles formed by P and two of the vertices of the triangle. However.ξ2.9 which shows their linear . This has the advantage that the integrals can be calculated for just one model triangle and then the result converted back to the x and y coordinates. In this way.ξ3) when obviously. so the line of constant ξ1 is the one marked in the figure.E763 (part 2) Numerical Methods page 98 or b56 = 1 4 A7 Ω7 ∫ [a5a6 + ( a5b6 + a6b5) x + ( a5c6 + a6c5) y + b5b6x +(b5c6 + b6 c5 )xy + c5 c6 y ] dx dy 2 2 Integrals like these need to be calculated for every pair of nodes in every triangle of the mesh.0) ξ1 =1 P 2 (0. calculations can be made in model triangle using this system and then.
the evaluation of integrals can be made now in terms of the local coordinates in the usual way: ∫ f (x. ξ1 = A1 = A 1 (a1 + b1x + c1y ) 2A ξ1 = N1( x . y) (9. that these coordinates vary in the same form as the shape functions. This is all we need to convert from one system of coordinates to the other.25) also gives us the required relationship between the local (ξ1. Finally. so the expression we need to use to transform integrals is: .ξ3). that is (x. For the inverse relationship. for Appendix): 1 1 A1 = det 1 2 1 example A1. can be calculated using (see x x2 x3 y y2 y3 where x and y are the coordinates of point P. which is quite convenient for calculations.26) will allow us to change from one system to the other.26) Equations (9.ξ2. that is the expression we need to convert coordinates (x. In this case this is simply 1/2A.ξ2.ξ3) and global coordinates (x. we have from the Appendix: 1 x1 N3 ) = (ξ1 ξ2 ξ3 ) = (1 x y) 1 x2 1 x3 −1 y1 y2 y3 (N1 from where: N2 (1 x y) = (ξ1 ξ2 and expanding: 1 x1 ξ3 )1 x2 1 x3 y1 y2 y3 1 = ξ1 + ξ2 + ξ3 x = x1ξ1 + x2 ξ2 + x3ξ3 y = y1ξ1 + y2 ξ2 + y3ξ3 (9.ξ).y into ξ1.ξ2 .ξ2.ξ3) J dξ1dξ2 Ωk Ωk where J is the Jacobian of the transformation.25) and (9.y). We can also find the inverse relation. Evaluating this determinant gives: A1 = 1 [ x 2 y3 − x3 y2 ) + (y2 − y3 )x + (x3 − x2 )y] 2 ( and using the definitions in the Appendix: A1 = 1 (a1 + b1 x + c1 y ) 2 or Then.y) in terms of (ξ1. Expression (9. y) dxdy = ∫ f (ξ1.25) We have then.page 99 E763 (part 2) Numerical Methods The area of each of these triangles.
29) Note that the above conclusion about integration limits is valid for any integral over the triangle.24) of the previous exercise. not only the one used above.10 So that the integral of (9.27) Example: The integral (9. y) dxdy = 2A ∫ f (ξ1. we can note that at the left end of this ribbon.g. Transforming to (local) area coordinates this is much simpler: Ωk ∫ NiN j dxdy = 2 A ∫ ξiξ j dξ1dξ2 Ωk (9. ξ3 = 0. is difficult to calculate in terms of x and y for an arbitrary triangle.E763 (part 2) Numerical Methods page 100 Ωk ∫ f (x.3 will give the same result). with a width dξ1 and length ξ2. Iij = 2A ∫ ξ1 0 1 1−ξ1 0 ∫ ξ2 dξ2dξ1 = A ∫ ξ1(1− ξ1) 0 1 2 1 2 1 A dξ 1 = A − + = 2 3 4 12 b) For i = j and choosing i = 1: Iii = 2 A∫ ξ12 0 1 1−ξ1 0 ∫ dξ2dξ1 = 2A ∫ ξ12 (1 − ξ1)dξ1 0 1 1 1 A = 2A − = 3 4 6 . so the limits of integration must be: ξ2: ξ1: 0 ––> 1 – ξ1 0 ––> 1 and 1 ξ1 =1 ξ2 =1ξ1 2 ξ2 =1 3 ξ1 =0 ξ2 =0 Fig. 1. We can now calculate these integrals. Taking first the case where i ≠ j: a) choosing i = 1 and j = 2 (This is an arbitrary choice – you can check that any other choice. ξ3) dξ1dξ2 Ωk (9.28) To determine the limits of integration. ξ2. so ξ2 = 1–ξ1. and needs to be calculated separately for each pair of nodes in each triangle of the mesh. Now. Moving the ribbon from the side 2–3 of the triangle to the vertex 1 (ξ1 changing from 0 to 1) will cover the complete triangle. 9. we can see that the total area of the triangle can be covered by ribbons like that shown in the figure.28) results: 1 1−ξ1 0 0 Ωk ∫ NiN j dxdy = 2 A∫ ∫ ξiξ j dξ2dξ1 (9. e.
we need to fit this type of surface over each triangle.ξ2. for example node 4. Higher Order Shape Functions In most cases involving second order differential equations. we can simply write: Fig. 3 and 5) and also at ξ1 = 1/2 (nodes 4 and 6) Then. if this is not possible. For the midside nodes.31) .11 N1 = ξ1(2ξ1 − 1) (9. That is. We can see that for this integral. corresponding to node 1. we form triangles with 6 points: the 3 vertices and 3 midside points: 1 (1 0 0) The figure shows a triangle with 6 points identified by their triangle coordinates. we need six degrees of freedom: either the 3 nodal values and the derivatives of u at those points or the values at six points on the triangle. 9. If we take first the function N1(ξ1.page 101 E763 (part 2) Numerical Methods Once calculated in this form. (1 1 0) 4 2 2 2 (0 1 0) 5 6 (1 0 1) 2 2 3 (0 0 1) (0 1 1 2 2 ) We need to specify now the shape functions. To do this uniquely. Second Order Shape Functions: To define a first order polynomial in x and y (planar shape functions) we needed 3 degrees of freedom (the nodal values of u will define completely the approximation. In these cases.) We can also choose to use different shape functions even if we could use first order polynomials.ξ3). If we use second order shape functions. Choosing the easier option. (Note that second order derivatives of a first order function will be zero everywhere. for example. we have that N4 should be zero at all nodes except 4 where its value is 1. to get higher accuracy with less elements. the resultant weighted residuals expression or the corresponding variational expression can be written without involving second order derivatives. only the area A will change when applied to different triangles. first order shape functions will be fine. We can see that all other nodes are either on the side 2–3 (where ξ1 = 0) or on the side 3–1 (where ξ2 = 0). other type of elements/shape functions must be used. the result can be used for any triangle irrespective of the shape and position.30) In the same form we can write the corresponding shape functions for the other vertices (nodes 2 and 3). (This is analogous to trying to fit a second order curve to an interval – while a straight line is completely defined by 2 points. So the function N4 should be: N4 = 4ξ1ξ2 (9. a second order curve will need 3). it must be 0 at ξ1 = 0 (nodes 2. we know that it should be 1 there and zero at every other node. However.
one for vertices and one for midside points.14 Exercise 9. 9. 2 8 7 4 6 9 11 1 5 3 12 10 Fig.5: Use a similar reasoning as that used to define the second order shape functions (9.E763 (part 2) Numerical Methods page 102 The following figure shows the two types of second order shape functions. 3 (0 0 1) (1 0 2 ) 3 3 8 (0 1 2 ) 3 3 7 (2 0 1 ) 3 3 9 (1 0 0) 1 4 (1 13 1 ) 3 3 10 5 (0 2 1 ) 3 3 6 (0 1 0) 2 ( 2 1 3 3 0) (1 2 0) 3 3 Fig. 9. Exercise 9. find the corresponding sparsity pattern.31). Fig. Note that in this case there are 3 different types of functions.4: For the mesh of second order triangles of the figure.30)(9.13 Shape function N4(x). 9. 9. N3 3 5 6 1 4 1 3 N4 5 6 2 4 2 Fig. to find the third order shape functions in terms of triangle area coordinates.15 .12 Shape function N3(x).
2) gives: f ( x ) = f (a ) + f ' (a )( x − a ) + f ( 2) ( a ) ( x − a ) 2 + ∫ ( x − t ) 2 f (3) (t )dt 2 a x Proceeding in this way we get the expansion: f ( x) = f (a) + f ' (a )( x − a) + f ( 2) ( a ) f (3) (a) f ( n) (a ) ( x − a) 2 + ( x − a )3 + L + ( x − a) n + Rn (A1. we can write: a x f ( x) = f (a) + ∫ f ' (t )dt or f ( x) = f (a ) + R0 ( x) a x (A1.1) where the reminder is R0 ( x) = ∫ f ' (t )dt . after 2 a a x substituting in (A1.2) u= f We can also integrate R1 by parts using this time: x du = f (t )dt ( x − t ) 2 which gives: dv = ( x − t )dt v = − 2 (t ) x ( 2) (3) R1 ( x) = ∫ ( x − t ) f ( 2) (t )dt = − a f ( 2) (t ) ( x − t ) 2 + ∫ ( x − t ) 2 f (3) (t )dt and again. a x We can now integrate R0 by parts using: x u = f ' (t ) du = f ( 2) (t )dt giving: dv = dt v = −( x − t ) x R0 ( x) = ∫ f ' (t )dt = − ( x − t ) f ' (t ) a + ∫ ( x − t ) f ( 2) (t )dt which gives after solving and a a x substituting in (A1.page A1 E763 (part 2) Numerical Methods Appendix APPENDIX 1.4) To find a more useful for for the reminder we need to invoke some general mathematical theorems: .3) 2 3! n! ( x − t ) n ( n+1) f (t )dt n! a x where the reminder can be written as: Rn ( x) = ∫ (A1. Taylor theorem For a continuous function we have ∫ f ' (t )dt = f ( x) − f (a) . then.1): f ( x) = f (a ) + ( x − a ) f ' (a ) + ∫ ( x − t ) f ( 2) (t )dt or a x f ( x ) = f (a ) + f ' (a )( x − a ) + R1 ( x ) (A1.
5) x f ( n+1) (ξ ) ( x − t ) n+1 (n + 1)! for the reminder of the Taylor expansion. Rn ( x) = .E763 (part 2) Numerical Methods Appendix page A2 First Theorem for the Mean Value of Integrals If the function g (t ) is continuous and integrable in the interval [a. x]. with: g (t ) = f ( n+1) (t ) and h (t ) = ( x − t )n . we get: n! Rn ( x) = f ( n+1) (ξ ) ∫ ( x − t)n dt which can be integrated. where ξ is appoint between a and x. Because this average value must be between the minimum and the maximum values and the function is continuous. giving: n! a (A1.4). And in a more complex form: Second Theorem for the Mean Value of Integrals Now if the functions g and h are continuous and integrable in the interval and h does not change sign in the interval. there will be a point ξ for which the function has this value. then there exists a point ξ between a and x such that: ∫ g (t )dt = g (ξ )( x − a) . a x This says simply that the integral can be represented by an average value of the function g (ξ ) times the length of the interval. there exists a point ξ between a and x such that: ∫ g (t )h(t )dt = g (ξ )∫ h(t )dt a a x x If we now use this second theorem on expression (A1.
k+1:n)=A(k+1:n. we need to subtract from eqn.1). to remove x2 from equations 3 to N (or elements ai2 . we need to scale the new equation 2 by the factor ai2/a22 and subtract it from row i.U] = GE(A) % % Computes LU factorization of matrix A % input: matrix A % output: matrices L and U % [n.k+1:n). to eliminate the matrix elements ai1.n) returns the identity matrix of order n and tril(A.k+1:n)v(i)*A(k. % find multipliers for i=k+1:n A(i. % find multipliers A(k+1:n. % find multipliers A(k+1:n. The function eye(n. After that.k)=A(k+1:n. for k=1:n1 A(k+1:n.k+1:n). end L=eye(n.k).k)/A(k. U=triu(A). % This function simply puts zeros in the lower triangle The factorization is completed by calculating the lower triangular matrix L. Implementation of Gaussian Elimination Referring to the steps (24) . – that is. The simpler version of the code is then: for k=1:n1 v(k+1:n)=A(k+1:n.1) gives a lower triangular matrix with the elements of A in the lower triangle. i equation 1 times ai1/a11. excluding the diagonal.n) + tril(A. so the steps for triangularizing matrix A (keeping just the upper triangle) can be implemented by the following code (in Matlab): for k=1:n1 v(k+1:n)=A(k+1:n.(26) to eliminate the nonzero entries in the lower triangle of the matrix. This function can be used to find the LU factors of a matrix A using dense storage.k). i = 3 – N.k+1:n). end U=triu(A).k+1:n)v(k+1:n)*A(k. The pattern now repeats in the same form. this can be simplified eliminating the second loop. The complete procedure can be implemented as follow: function [L.k)/A(k. we can see that (in step (25)) to eliminate the unknown x1 from equation i (i = 2 – N).page A3 E763 (part 2) Numerical Methods Appendix 2.k+1:n)A(k+1:n.k+1:n)=A(i. end end In fact. i = 3 – N as in (26)).k). by noting that all operations on rows can be performed simultaneously.k)/A(k. With this.k)*A(k.n]=size(A). and . all the elements of column 1 except a11 will be zero.k+1:n)=A(k+1:n.
y). To input the matrix it might be useful to start with the matlab command: A = triu(tril(ones(n. Forward substitution consists of finding the first unknown x1 directly as b1/l11 . y = LTriSol(L. end x(1)=b(1)/U(1. and 0 otherwise).n) . given by equation (7. The solution of a linear system of equations Ax = b is completed (after the LU factorization of the matrix A is performed) with a forward substitution using matrix L followed by backward substitution using the matrix U.b).E763 (part 2) Numerical Methods Appendix page A4 setting all others to zero (lij = aij if j ≤ i–1. x=zeros(n. b(1:j1)=b(1:j1)x(j)*U(1:j1.1). % a vector of zeros to start for j=1:n1 x(j)=b(j)/L(j. With these functions the solution of the system of equations Ax = b can be performed in three steps by the code: [L.1). x = UTriSol(U.1).b) % % Solves the triangular system Lx = b by forward substitution % n=length(b).j).1). b(j+1:n)=b(j+1:n)x(j)*L(j+1:n.14). x=zeros(n. this time the unknowns are found from the end upwards: function x = UTriSol(L.j).n). for j=n:1:2 % from n to 2 one by one x(j)=b(j)/U(j. LTriSol and UTriSol to solve the system of equations generated by the finite difference modelling of the square coaxial structure. This can be implemented by: function x = LTriSol(L.U] = GE(A). Exercise Use the functions GE. Note that because of the geometry of the structure. end x(n)=b(n)/L(n.n). find x2 and so on. after applying the boundary conditions.j).j).1)5*eye(n. Backward substitution can be implemented in a similar form. substituting this value in the second equation. not all rows will have the same pattern. You will have to complete first the matrix A given only schematically in (7.14).b) % % Solves the triangular system Ux = b by backward substitution % n=length(b).
.page A5 E763 (part 2) Numerical Methods Appendix This will generate a tridiagonal matrix of order n with –4 in the main diagonal and 1 in the two subdiagonals. After that you will have to adjust any differences between this matrix and A. Compare the results with those obtained by GaussSeidel.
use subscripted variables inside a loop.30 do i=2. and are easily “generated during the algorithm”.j)=1.lt. but doing so makes the program simpler.) z(i. and putting the diagonal term on the lefthandside gives: φO = 1 (φ N + φ S + φ E + φW ) 4 We could write 56 lines of code (one per equation) or even simpler.24).j)+z(i+1. φ56 ).j)+ $ z(i.8 do j=4. putting all diagonal matrix terms on the lefthandside as in (5. The program can be simplified further by keeping the vector of 56 unknowns x in a 2D array z(11. φ 2 . dimension z(11.8 z(i. including all those nodes inside the inner conductor. .11) to be identified spatially with the 2D Cartesian coordinates of the physical problem (see figure). Solving this system of equation by the method of GaussSeidel. the only array needed holds the current value of vector elements: xT = (φ1. . Equation (7.2) stores the value of φ10.j–1)+z(i. In this case the elements of A are all either 0. including potential (voltage) distribution or steady state current flow or heat flow. None of the actually need to be stored. In this case. Solution of Finite Difference Equations by GaussSeidel Equation (7. fixed voltage values.11) data z/121*0.14) is the matrix equation derived by applying the finite differences method to the Laplace equation.23) and (5./ do i=4. consists of taking the original simultaneous equations. enddo enddo c do n=1. rather that actually stored in an array. Note that this applies to any potential problem. For example z(3. This simplifies the computer program. +1 or 4.10 if(z(i.24) and effectively putting them in a DO loop (after initializing the ‘starting vector’). There is obviously scope to improve efficiency since with this arrangement we store values corresponding to nodes with known.9) is the typical equation.E763 (part 2) Numerical Methods Appendix page A6 3. the program (in old Fortran 77) can be as simple as: c c c c c GaussSeidel to solve Laplace between square inner & outer conductors.1.j)...j)=.j+1)) enddo enddo in each iteration print values in the 3rd row c .25*(z(i–1. as shown in (5. and instead of A.10 do j=2.
5636 0.5652 0.6211 0.6258 0.6263 0.0000 0.5592 0. the values of potentials in one intermediate row (the third) are printed after every iteration.3733 0.11f7.3281 0.6406 0.1814 FINAL RESULTS= 0.5653 0.0000 0.6067 0.5604 0.6390 0.3700 0.1813 19 0.0000 0.1807 15 0.3016 0.3750 0.0000 0.3727 0.0000 0.1814 0.0000 0.0000 0.1446 7 0.5652 0.3734 0.1812 18 0.3733 0.3326 0.6406 0.1814 28 0.5653 0.5639 0.1756 0.5653 0.11) enddo c 6 7 write(*.1814 0. either in a point by point basis.1814 0.5653 0.0000 0.2690 0. In order to check the convergence.0000 0.4) format(‘FINAL RESULTS=’.0833 0.6263 0.5652 0.3733 0.6362 0.6400 0.6262 0.3641 0.6406 0.6133 0.6263 0.5653 0.3734 0.3734 0.0000 0.6263 0.3733 0.3731 0.5653 0.2109 0.6263 0.1813 0.3732 0.4)) stop end In the program. Naturally.6257 0.1814 0.5653 0.0000 0.6406 0.3733 0.1729 10 0.0000 2 0.1812 0.1795 13 0. a more efficient monitoring of convergence can be implemented whereby the changes are monitored.4922 0.0000 0.5299 0.0000 0.(z(3.6406 0.5653 0.1811 17 0.0000 0.6263 0.0000 0.5865 0.j) is initialized with zeros.5647 0.3713 0.5650 0.6218 0.1814 0.3734 0.1814 21 0.0000 0.6263 0.6263 0.6) n.0000 3 0.0000 0.1813 0.6236 0.0000 0.5652 0.6263 0.0000 0.5205 0.6406 0.1814 0.3734 0.1814 0.0000 0. first z(i.5653 0.0000 0.3734 0.0000 0.6406 0.1586 8 0.6263 0.5616 0.6263 0.1814 30 0.0000 0.2579 0.1814 0.3726 0.1810 0.page A7 E763 (part 2) Numerical Methods Appendix write(*.1814 25 0.2500 0.5653 0.5653 0.5653 0.6260 0.0000 0.j=1.0208 0.0000 0.6263 0.5381 0.0000 0.3700 0.6215 0.1810 16 0.1716 0.3484 0.3733 0.//(/1x.1814 29 0.0000 0.5944 0.0000 0.5652 0.3729 0.5648 0.6406 0.6047 0.0000 0.5742 0.0699 0.0000 0.6396 0.6190 0.1783 12 0. The results are: 1 0.3734 0.6406 0.5653 0.5653 0.5645 0.3582 0.3125 0.3734 0.3733 0.0000 0.0000 0.3734 0.1814 0.0000 0.6406 0.6379 0.6262 0.6334 0.6263 0.5653 0.3733 0.0000 0.6406 0.6245 0.5519 0.1813 20 0.6261 0.0000 0.0000 0.5653 0.0908 5 0.3732 0.0000 0.1814 0.i2.3734 0.6263 0.6263 0.6403 0.5653 0.6263 0. and the iterations are stopped when this value is within a prefixed precision.5653 0.4473 0.3722 0.1814 22 0.1814 27 0.1369 0.3733 0.3734 0.6405 0.5399 0.1794 0.5653 0.5082 0.3729 0.1814 0.0000 0.1802 0.0469 4 0.6261 0.0000 0.3734 0.0000 0.6263 0.3733 0.6406 0.5649 0.5631 0.3731 0.3076 0.1250 0.6084 0. or as the norm of the difference.0000 0.0000 0.1814 26 0.j).6288 0.6254 0.0000 .6247 0.5472 0.6263 0.3733 0.3330 0.3678 0.5435 0.6404 0.3676 0.3320 0.6263 0.0000 0.6263 0.0000 0.5488 0.6234 0. then the values corresponding to the inner conductor are set to one (1 V).0000 0.5653 0.5653 0.7) z format(1x.5643 0.5653 0.1814 23 0.5572 0.5902 0.1814 0.1814 0.3713 0.6263 0.0000 0.6406 0.5651 0.6263 0. After this.0000 0.3572 0.6406 0.5624 0.5651 0.6184 0.0000 0.0000 0.6263 0.0000 0.3734 0. We can see that after 19 iterations there are no more changes (within 4 decimals).4717 0.1229 6 0.1098 0.3732 0.4181 0.3293 0.0000 0.0000 0.5653 0.0000 0.1813 0.0000 0.5553 0.6259 0.1650 0.1780 0.1814 24 0.4737 0.6253 0.1763 11 0.0000 0.6263 0.0000 0.1542 0.1802 14 0.0000 0.3721 0.0000 0.0000 0.11f7.3733 0.3637 0.5653 0.1807 0.6263 0.6143 0.4785 0.1675 9 0. the iterations start (to a maximum of 30) and the GaussSeidel equations are solved.6263 0.5653 0.6263 0.3465 0.1895 0.5225 0.0000 0.3734 0.4492 0.6263 0.
0000 0.0000 0.0000 0.2994 0.5653 0.0907 0.6406 0.0000 0.0000 1.3734 0.0000 0.0000 0.0000 0.0000 0.0000 0.1814 0.0000 0.0000 0.5653 0.2615 0.2994 0.3734 0.0000 1.0000 1.0000 1.2615 0.0000 1.6263 1.2615 0.0000 0.5653 0.6263 0.0000 0.5653 0.3099 0.6263 0.2994 0.5653 0.0000 1.6263 1.3099 0.2994 0.0000 0.0000 1.0907 0.0000 0.0000 0.0000 0.1814 0.6406 0.0000 0.6406 1.2994 0.0000 0.0907 0.0000 1.3099 0.0000 1.0000 0.2615 0.0000 1.0000 0.0000 0.0000 0.2994 0.E763 (part 2) Numerical Methods Appendix page A8 0.0000 1.1814 0.5653 1.2615 0.0000 0.0000 0.0000 0.0000 0.1814 0.0000 .0000 1.0000 0.0000 0.0000 0.0000 0.1814 0.6263 0.6263 0.6263 0.6263 0.1848 0.1814 0.0000 0.0000 0.0000 0.0000 1.0000 1.2615 0.1848 0.0000 1.0000 1.2615 0.0000 0.0000 0.0000 0.0000 0.0000 0.0907 0.0000 1.0000 0.3734 0.6406 0.0000 0.0000 1.0000 0.2615 0.3734 0.0000 0.3099 0.0000 1.5653 0.0000 1.2994 0.2994 0.0000 0.5653 1.
1) ∫y a 2 dx Let’s examine a variation of k2 produced by a small perturbation δy on the solution y: k (y + δy) = k + δk = 2 2 2 ∫ a b d(y + δy) dx dx 2 ∫ (y + δy) a b 2 dx 2 And rewriting: + (k 2 + δk 2 )∫ (y + δy)2 dx = ∫ d(ydx δy) a a b b dx (A4.2) Expanding and neglecting higher order variations: (k 2 + δk 2 )∫ (y 2 + 2yδy) dx = ∫ dy dx a a b b 2 +2 2 dy dδy dx dx dx b or k 2 b a ∫ y dx + 2k 2 2 ∫ yδy dx + δk a b a b 2 dy dδy dy 2 ∫ y dx = ∫ dx dx + 2 ∫ dx dx dx a a a b a b b b (A4.v. a b (A4. In the following we will prove that indeed both problems are equivalent. A variational formulation for the wave equation Equation (8.4) Now.5) ∫ y δy dx = dx δy a − ∫ δy dx 2 d2 y dx .20) in the example on the use of the variational method shows a variational formulation for the onedimensional wave equation. and we examine what conditions this imposes on the function y: k Integrating the RHS by parts: k 2 b a 2 b a ∫ y δy dx = ∫ dy b dy dδy dx dx dx a b b a (A4.1): 2k 2 ∫ yδy dx + δk 2 ∫ y dx = 2∫ 2 dy dδy dx dx dx a (A4. Equation (8. since we want k2 to be stationary about the solution function y.3) and now using (A4.20) stated: b 2 dy ∫ dx dx k 2 = s.page A9 E763 (part 2) Numerical Methods Appendix 4. we make δk2 = 0.
1) with respect to small variations of the function y leads to y satisfying the differential equation (A4.C. (A4. we can see that imposing the condition of stationarity of (A4. (A4.8). and any of the boundary conditions (A4. That means that y should satisfy the differential equation: d2 y 2 +k y= 0 (A4.6) Since δy is arbitrary.7) dx 2 and any of the boundary conditions: dy = 0 at a and b dx or δy = 0 at a and b (fixed values of y at the ends).8) Summarizing. or zero normal derivative (Neumann B.). which is the wave equation.7).6) can only be valid if both sides are zero.). that is.C. . either fixed values of y at the ends (Dirichlet B.E763 (part 2) Numerical Methods Appendix page A10 Or rearranging: ∫ δy dx 2 a b d2 y b dy 2 + k y dx = δy dx a (A4.
1) which can be written as: (A5. y1).2) . 2 and 3 with coordinates (x1. Area of a Triangle For a triangle with nodes 1. y2) and (x3. (x2.page A11 E763 (part 2) Numerical Methods Appendix 5. the area of the triangle is: A= 1 2 [(x 2y3 − x3y2 ) + (x 3y1 − x1y3) + (x1y2 − x2 y1)] A= 1 x1 1 det 1 x 2 2 1 x3 y1 y2 y3 (A5. y3): y3 3 B A y2 y1 1 C 2 x1 x3 x2 The area of the triangle is: A = Area of rectangle – area(A) – area(B) – area(C) Area of rectangle = (x2 − x1 )(y3 − y2 ) = (x2 y3 + x1 y2 − x1y3 − x2 y2 ) Area (A) = 1 (x2 − x3 )(y3 − y2 ) = 1 (x2 y3 + x 3 y2 − x2 y2 − x3 y3 ) 2 2 Area (B) = 1 (x3 − x1 )(y3 − y1 ) = 1 (x3 y3 − x3 y1 − x1y3 + x1 y1 ) 2 2 Area (C) = 1 (x2 − x1 )(y1 − y2 ) = 1 (x2 y1 − x 2 y2 − x1 y1 + x1 y2 ) 2 2 Then.
6) Solving the right hand side (inverting the matrix and multiplying). written as a vector product: u1 ˜ u(x. y) = (N1 N2 N3 ) u2 u3 we have finally: (N1 N2 (A6. Shape Functions (Interpolation Functions) The function u is approximated in each triangle by a first order function (a plane). p ˜ (A6.5) 1 x1 N3 ) = (1 x y) 1 x 2 1 x3 −1 y1 y2 y3 (A6. y) = p + qx + ry = (1 x y) q r Evaluating this expression at each node of a triangle (with nodes numbered 1.1): 1 x1 ˜ u(x. y) + u3 N3 (x.3) u1 u2 u3 (A6. 2 and 3): u1 = p + qx1 + ry1 u2 = p + qx2 + ry2 u3 = p + qx 3 + ry3 u1 1 x1 ⇒ u2 = 1 x 2 u3 1 x 3 y1 p y2 q y3 r (A6.7) 2A where A is the area of the triangle and .15). gives the expression for each shape function (9. 3) in (A6.y) ≈ u(x.14). q and r in terms of the nodal values and the coordinates of the nodes: p 1 x1 q = 1 x2 r 1 x 3 Replacing (A6.4) and comparing with (9.16): 1 Ni (x. y) = (ai + bi x + ciy ) (A6.1) u(x. y) + u2 N2 (x.2) And from here we can calculate the value of the constants p.y) = u1 N1(x. y) = (1 x y) 1 x 2 1 x 3 −1 y1 y2 y3 u1 u2 u3 −1 y1 y2 y3 (A6.E763 (part 2) Numerical Methods Appendix page A12 6. This will be given by an expression of the form: p + qx + ry which can also be written as a vector product (equation 9.
the area of the triangle can be written as: A= 1 2 (a1 + a2 + a3 ) (A6.page A13 E763 (part 2) Numerical Methods Appendix a1 = x2y3 – x3y2 a2 = x3y1 – x1y3 a3 = x1y2 – x2y1 b 1 = y2 – y3 b 2 = y3 – y1 b 3 = y1 – y2 c 1 = x3 – x 2 c 2 = x1 – x 3 c 3 = x2 – x 1 (A6.1).8) Note that from (A5.9) .
This action might not be possible to undo. Are you sure you want to continue?