You are on page 1of 262

INTRODUCTION TO NUMERICAL ANALYSIS

WITH C PROGRAMS

´ te
´
Attila Ma
Brooklyn College of the City University of New York
July 2004
Last Revised: August 2008

c 2004, 2008 by Attila M´

at´e
Typeset by AMS-TEX

ii

Table of Contents

iii

CONTENTS

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.

Preface
Floating point numbers
Absolute and relative errors
Roundoff and truncation errors
The remainder term in Taylor’s formula
Lagrange interpolation
Newton interpolation
Hermite interpolation
The error of the Newton interpolation polynomial
Finite differences
Newton interpolation with equidistant points.
The Lagrange form of Hermite interpolation
Solution of nonlinear equations
Programs for solving nonlinear equations
Newton’s method for polynomial equations
Fixed-point iteration
Aitken’s acceleration
The speed of convergence of Newton’s method
Numerical differentiation of tables
Numerical differentiation of functions
Simple numerical integration formulas
Composite numerical integration formulas
Adaptive integration
Adaptive integration with Simpson’s rule
The Euler-Maclaurin summation formula
Romberg integration
Integrals with singularities
Numerical solution of differential equations
Runge-Kutta methods
Predictor-Corrector methods
Recurrence equations
Numerical instability
Gaussian elimination
The Doolittle and Crout algorithms. Equilibration
Determinants
Positive definite matrices
Jacobi and Gauss-Seidel iterations
Cubic splines
Overdetermined systems of linear equations
The power method to find eigenvalues
The inverse power method

v
1
8
10
10
13
18
22
24
27
29
31
35
37
45
52
56
60
61
63
67
70
75
79
83
93
99
102
105
114
125
130
132
149
151
158
163
166
173
183
195

iv

Introduction to Numerical Analysis

41.
42.
43.
44.
45.
46.
47.
48.

Wielandt’s deflation
Similarity transformations and the QR algorithm
Spectral radius and Gauss-Seidel iteration
Orthogonal polynomials
Gauss’s quadrature formula
The Chebyshev polynomials
Boundary value problems and the finite element method
The heat equation
Bibliography
List of symbols
Subject index

202
210
220
226
231
233
238
244
247
249
251

Preface

v

PREFACE

These notes started as a set of handouts to the students while teaching a course on introductory
numerical analysis in the fall of 2003 at Brooklyn College of the City University of New York. The
notes rely on my experience of going back over 25 years of teaching this course. Many of the methods
are illustrated by complete C programs, including instructions how to compile these programs in a
Linux environment. These programs can be found at
http://www.sci.brooklyn.cuny.edu/~mate/numanal/numanal_progs.tar.gz
They do run, but many of them are works in progress, and may need improvements. While the
programs are original, they benefited from studying computer implementations of numerical methods
in various sources, such as [AH], [CB], and [PTVF]. In particular, we heavily relied on the arrayhandling techniques in C described, and placed in the public domain, by [PTVF]. In many cases,
there are only a limited number of ways certain numerical methods can be efficiently programmed;
nevertheless, we believe we did not violate anybody’s copyright (such belief is also expressed in
[PTVF, p. xvi] by the authors about their work).
New York, New York, July 2004
Last Revised: August 6, 2008
Attila M´
at´e

vi .

371. For example.993. An important exception to this rule is when two numbers of about the same magnitude are subtracted. a nonzero number in nonstandard form can usually be brought into standard form by lowering the exponent and shifting the mantissa (the only time this cannot be done is when this would bring the exponent e below its lower limit – see next). when solving the quadratic equation x2 − 300x + 1 = 0. As a general rule. This phenomenon is called loss of precision. we have −126 ≤ e ≤ 127. One more bit is needed to store the sign. both having seven significant digits. when an arithmetic operation is performed the number of significant digits is the same as those in the operands (assuming that both operands have the same number of significant digits). Many programming languages define double precision numbers and even long double precision numbers.1 the mantissa m of a (single precision) floating point is 23 bits. To store the exponent. Whenever possible. There is no problem √ with calculating the first root. since 89996 ≈ 299. . so altogether 40 bits are needed to store a single precision floating point number. . It is easy to overcome this problem in the present case. . In the IEEE standard. base 2) notation. 4 300 + 89996 and the loss of significance is avoided. mk (mi = 0 or 1). FLOATING POINT NUMBERS A floating point number in binary arithmetic is a number of form 2e · 0. Thus a single precision floating point number roughly corresponds to a decimal number having seven significant digits. one must avoid this kind of loss of precision. if x = .0000232 has only three significant digits.003. that is. then the difference x − y = . but with the second root there is clearly a loss of significance. then 0.m. For the exponent e.7235291. this is often possible by rewriting the expression. 8 bits are needed. 1 IEEE is short for Institute of Electric and Electronics Engineers. and the dot in 0.1.m is the binary “decimal” point. Here e is called the exponent and m is called the mantissa. Floating point numbers 1 1. In the standard form of a floating point number it is assumed that m1 = 1 unless all digits of m are 0. if the digits of m are m1 . The second root can be calculated as √ √ √ 300 − 89996 300 + 89996 300 − 89996 √ = · x= 2 2 300 + 89996 2 √ = ≈ . 193.m = m1 ·2−1 +m1 ·2−2 +. m1 ·2−k . When evaluating an algebraic expression. . m1 .7235523 and y = . . . where e and m are integers written in binary (that is. For example. . the quadratic formula gives two solutions: x= 300 + √ 89996 2 and x= 300 − √ 2 89996 . 333. 370.

it is easiest to first write that x = y · 106 .245735 − sin 0. x2 + 1 − x = x2 + 1 − x · √ =√ x2 + 1 + x x2 + 1 + x To do the numerical calculation. 000 · 10−7 . 5. Show how to avoid the loss of significance when solving the equation x2 − 1000x − 1 = 0. 000. We cannot use the expression given directly. Using only two terms and the Lagrange remainder term.e. and their subtraction will result in a loss of precision. Calculate x − x2 − 1 for x = 256. note that p  √x2 + 1 + x p 1 . 000 with 10 significant digit accuracy. √ Solution. 0 · 10−6 . where y = 2. 000 with 6 significant digit accuracy. the error is less than 5 · 10−12 ). Calculate sin 0. Avoid the loss of significant digits.0002 = 2 · 10−4 . 125. √ Solution. Calculate x2 + 1 − x for x = 1. note that   x + √x2 − 1 p p 1 2 2 √ √ = x− x −1= x− x −1 · . 4.2 Introduction to Numerical Analysis Problems √ 1. 2 x+ x −1 x + x2 − 1 To do the numerical calculation. this formula will give accurate results up to 11 decimal places (i.245712. Avoid loss of significant digits. To avoid this. = 2 2 x+ x −1 y + y − 10−10 √ 2.000.0002 − 1 on your calculator. . 000. we have x3 = 8 · 10−12 . we have x2 x3 ex = 1 + x + + eξ .. Hint: The best way to avoid the loss of precision is to use the Taylor series approximation of ex . =p x2 + 1 + x y 2 + 10−12 + y 3. and their subtraction will result in a loss of precision. since x2 + 1 and x are too close. We cannot use the expression given directly. With x = . since x and x2 − 1 are too close. where y = 1. Evaluate e0. and noting that eξ is very close to one. 000. Then 1 1 √ · 10−6 ≈ 5. To avoid this. 2 6 where ξ is between 0 and x.56.953. it is easiest to first write that x = y · 105 . Then 1 1 p √ · 10−5 ≈ 1.

00003 would lead to an unnecessary and unacceptable loss of accuracy. with sufficient precision we have 1 − cos x ≈ x2 /2! − x4 /4! = 32 · 10−6 − 512 · 10−12 ≈ 0.. when summing finitely many terms of the series.008.008 with 10 decimal accuracy. this cannot be avoided.. Using the Taylor series of cos x gives a more accurate result: cos x = ∞ X n=0 (−1)n · x2n x2 x4 x6 =1− + − + .000. that this approach does not avoid the loss of precision that results from calculating x − y.39 · 10−15 .008 means radians here.245712. we have x6 0.. Find 1 − cos 0.245735 and y = 0. however.0000319998293 3 7. Solution. Of course 0. 999. it may be better to use the formula sin x − sin y = 2 cos x+y x−y sin 2 2 with x = 0. .1. 6! 720 720 and so this term can safely be omitted. (2n)! 2! 4! 6! For |x| ≤ 1 this is an alternating series.. Calculate 9999 X n=1 1 n2 .016 1 < = 10−12 · < 10−12 · .) Hint: Rather than evaluating the sines of the angles given.00139 < 1.. Thus.. and so. With x = 0. Since we allow an error no larger than 5 · 10−11 . 9 · 10−10 1 − e−0. 6. It is much better to use the Taylor series of ex with x = −3 · 10−5 : 1 − ex = 1 − ∞ X x2 x3 xn = −x − − − . the absolute value of the error error will be less than that of the first omitted term. .00003 with 10 decimal digit accuracy.00003 ≈ 3 · 10−5 − = . Floating point numbers 3 (Note that angles are measured in radians. From what is given. Using the value of cos 0.008 here would lead to unacceptable loss of precision. 029. 2 6 When summing finitely many terms of an alternating series. Thus. Then sine of a small angle may be more precisely evaluated using a Taylor series than using the sine function on your calculator. writing x = 8 · 10−3 . Solution. the third term here can be safely omitted. Using the value of 1 − e−0. Find 1 − e−0. 55. n! 2 6 n=0 For x = −3 · 10−5 this becomes an alternating series: 3 · 10−5 − 9 · 10−10 27 · 10−15 + − . 2 8. Note. since to value is too close to 1. the error will be smaller than the first omitted term.

assume that the calculation used a floating point calculation with 10 significant digits in decimal arithmetic. When calculating the latter sum. but now that it can also handle many languages in addition to C. 980. we obtain 1. the size of the sum is at least 1 = 100 . Whenever practical. 933. i++) { 2 GCC originally stood for the GNU C Compiler (version 1.. . the first term added is 1/100002 . one should first add the smaller numbers so as to reduce the error. the error in these additions will accumulate.4 Introduction to Numerical Analysis Note. all the digits smaller than 10−8 will be lost. sum=0.h> #include <math. 0 · 10−8 ..000. ). showing that the latter approximation is much better. 997. When one calculates the former sum. Of course. 066. 725. + 2 + 2. when adding the numbers by using the first method). so when one gets to this last term. . 723 and if we calculate it in the backward order (i. However. 938. GCC stands for the GNU Compiler Collection.1. The following computer experiment was done to produce the above results on a computer running Fedora Core 5 Linux. 934.0 was released in 1987). 004. then n = 999. 12 2 3 4 99982 99992 To explain this.644. using the compiler GCC version gcc (GCC) 4. i < 10000000. 200. by adding the terms for n = 9. 1/99992 = 1. . when adding ten thousand numbers.. only two significant digits of this term can be added. . and it will make a significant difference that the addition initially was performed greater precision (i.644. int i.0-3). we obtain 1. when one gets to the end of the calculation. term.e. Now.08744·10−4 .e. then n = 3.e.644. ..0 20060304 (Red Hat 4. and all its ten significant digits can be taken into account. Euler has proved that ∞ X π2 1 = .0. . The issue here is that the proper order to calculate this sum is 1 1 1 1 1 1 + + + + . 999. 99992 99982 99972 99962 2 1 and not 1 1 1 1 1 1 + 2 + 2 + 2 + .27868·10−7 . If we calculate the sum 9. and so no digits with place value less than 10−9 can be taken into account.1. 322. 999. n2 6 n=1 We have π 2 /6 ≈ 1. 999. 848. . This means that adding floating point numbers is not a commutative operation. for(i=1. 030. since this will result in a smaller accumulated error.999 X n=1 1 n2 in the forward order (i. A computer experiment.2 The program addfw. 998.h> main() { float sum.. after the first term is added. while the latter differs from it by 1. by adding the terms for n = 1. then n = 9... The former differs from the value of π 2 /6 by 2.c carried out the addition in forward order: 1 2 3 4 5 6 7 8 9 10 #include <stdio. + + . then n = 2.999. ).

giving about seven significant decimal digits of precision. term. i >= 1. i--) { term = (float) i. in line 11 the variable i is converted to floating point before its square is calculated in line 12.0.c was used to carry out the addition in the backward order. in calculating π 2 /6.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #include <stdio.0). sum += term. toprint = (double) pisqo6.h> #include <math. } printf("The sum is %.0. and it is not a part of the file. float is the type of a single precision floating point number. pisqo6 *= pisqo6. 13 term = 1/term. and long double about 21 significant digits)3 . double toprint. sum). term = term * term. (the type double gives about 14 significant digits. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 #include <stdio.0. pisqo6 = atan2l(0. since printf apparently is unable to handle long double variables. In line 6. we used the fact that π/4 = arctan 1. and are not based on the technical documentation of the GCC compiler . calculating the square of an integer would cause an overflow for large values of i.h> #include <math. 15 } 16 printf("The sum is %. and then used the C-function atan2l to do the calculation in the file pi.1. 3 These values are only rough guesses. 12 term = term * term. int i. pisqo6 /= (long double) 6. The program addbw. sum=0. /* The conversion of the long double value pisqoe6 to the double value toprint is necessary.h> main() { float sum.12f\n".h> main() { long double pisqo6. Floating point numbers 5 11 term = (float) i.12f\n". sum). for(i=9999999. In calculating the ith term of the series to be added. } Finally. 17 } Here the number at the beginning of each line is a line number. 14 sum += term. -1. term = 1/term.

0) is π. 12.o gcc -o addfw -s -O0 addfw.6 4 The atan2 function is more practical for computer calculations than the arctan function of mathematics. the long double variable pisqo6 is used to calculate π 2 /6.0. since the function printf does not seem to work with long double variables.0).x) function is to give the polar angle of the point (x. the prompt is most likely different (in fact. 8. If the target is older than at least one of the files needed to produce it.6 Introduction to Numerical Analysis 16 this may be a bug */ 17 printf(" pi^2/6 is %.e. After each line naming the target. the prompt is highly customizable. 5 This bug has been fixed in newer releases of GCC. No rule is needed to produce the target all.o This makefile does the compilation of the programs after typing the command $ make all Here the dollar sign $ at the beginning of the line is the prompt given by the computer to type a command.o gcc -o addbw -s -O0 addbw. No action is taken if the target is already present.y) in the plane. 0). given by atan2(0. the atan2l function (the final l here stands for “long. rather.0. etc. First the atan2 function. see http://gcc. it is checked if the target is newer than the files needed to produce it (as described on the right-hand side of the colon). the double long variable pisqo6 is cast to (i. of course.-1.o -lm addfw.c pi : pi. 6.gnu. converted to) a double variable.c clean : addfw addbw pi rm *. meaning that the user can easily change the prompt in a way she likes). and then a computer command follows. 6 The idea is that the tasks described in the rules should not be performed unnecessarily.4 In line 11. there are zero or more lines of rules describing how to produce the target.o : addfw.o -lm pi. and gives its result as as long double.c addbw : addbw.12f\n". what is needed for all to be produced that the three files mentioned after the colon should be present. 18 } In this last program.c gcc -c -O0 addbw. 15. . arctany can be calculated as atan2(y. 10.o gcc -o pi -s -O4 pi.. 4.org/ml/gcc-bugs/2006-11/msg02440. but they can be abstract objectives to be accomplished (as in line 1 – to produce all files – and in line 15 – to clean up the directory by removing unneeded files). 2.o -lm addbw. or.” to indicate that this function takes arguments as long double floating point numbers. each start with a tab character. and not with a number of spaces).5 In compiling these programs without optimization. The makefile contains a number of targets on the left hand-side of the colons in lines 1. in Unix or Linux. unless the target is a file name.html The version we used was gcc 4. toprint). 1. and. the polar angle of the point (−1.c gcc -c -O0 addfw. In this case. the rule is applied to produce a new target.c gcc -c -O4 pi. These targets are often names of files to be produced (as in lines 2 and 4. The meaning of the atan2(y.o : pi.o : addbw. etc.). Each rule starts with a tab character (thus lines 3 and 5. we used the following makefile: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 all : addfw addbw pi addfw : addfw.1. On the right-hand side of the colon there are file names or other targets needed for (producing) the target. in an actual computer. Of course.

the output of the program addfw and addbw will be identical: 1. It must be doing something more complicated than just adding the terms of the series in reverse order. 966. 848 by about 1.644. In this case. after each change in one of the C program files. the object file (i.c gcc -c -O4 pi.7 7 What the optimizing compiler gcc actually does is not clear to me. 5.e. that can be replaced with any string of characters.o -lm addbw. or one of the source files. 848.c. This means that the optimizing compiler notices the problem with adding the terms of the sequence in the forward orders.644.2. using the makefile 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 all: addfw addbw pi addfw : addfw. 933.c clean : addfw addbw pi rm *.o (* is called a wild card character. .c This command specifies that the file addfw. only those programs for which the source file (the C program).c gcc -c -O4 addfw. the rule needed to produce the target addfw. and corrects this problem. After finishing work on this program. etc.o -lm pi.o : pi.) If the above programs are compiled with optimization. 7.c should be compiled with the optimization level 0. and the -lm option specifies that the object file should be linked to the mathematics library (denoted by the letter m) of the C compiler. This differs from the value 1. have been changed. one uses the command $ gcc -c -O0 addfw. Only those programs will be recompiled for which recompilation is needed..o gcc -o addfw -s -O4 addfw.00000 · 10−7 . since the result is somewhat better that would be obtained by doing the latter. one can type to command $ make all to recompile the executable programs. The rule used the produce the target file addfw is $ gcc -o addfw -s -O0 addfw. 934. one can type $ make clean which will remove all files of form *.c pi : pi.o : addbw.o : addfw. In lines 4.c gcc -c -O4 addbw. that is those files whose name ends in .o gcc -o addbw -s -O4 addbw.c addbw : addbw. This means that no optimization is to be done.o. the compiled program) of the C program addfw. the level of optimization).o the result will be different.. Absolute and relative errors 7 For example. it is now indicated by the option -O4 (minus capital “oh” four) that optimization level 4 is being used. 066.o -lm Here the -o option specifies the resulting object file should be called addfw.o -lm addfw.o. The file addfw is an executable program.o gcc -o pi -s -O4 pi. and one can run it simply by typing the command $ addfw When one works on this program. that is. as specified by the -O0 option (here the first letter is capital “oh” and the second letter is 0.

where x ¯ is the calculated value of x.8 Introduction to Numerical Analysis 2. then f (x. this means E(xy) ∆x ∆y ≈ + . ABSOLUTE AND RELATIVE ERRORS The absolute error when calculating a quantity x is the difference x − x ¯. one usually estimates the relative error by the quantity (x − x ¯)/¯ x. y) replaced with xy and x/y respectively. y). y¯) ∂f (¯ x. If we write ∆x = x − x ¯ and ∆y = y − y¯. The relative error is defined as (x − x ¯)/x. For division. xy x y In other words. Since x is usually not known. For the relative error.8 The error in evaluating a function f (x. writing E(·) for the error. y y . We will apply this with f (x. the relative error of multiplication is approximately the sum of the relative errors of the factors. we get a similar result: E(x/y) ≈ so x 1 ∆x − 2 ∆y. y) − f (¯ x. in this way one can calculate also the relative error. y¯) ≈ ∂f (¯ x. y) can be estimated as follows. so if the size of the absolute error is known. For multiplication we obtain E(xy) ≈ y∆x + x∆y. y¯) (x − x ¯) + (y − y¯) ∂x ∂y The right-hand side represents the approximate error in the function evaluation f (x.

.

.

.

.

.

.

E(x/y) .

.

∆x .

.

∆y .

.

≤.

.

+.

.

.

.

x/y .

.

x .

.

y .

The condition of a function describes how sensitive a function evaluation to relative errors in its argument: It is “defined” as . that is the relative error of the division is roughly9 less than the sums of the absolute values of the relative errors of the divisor and the dividend. .

) .

(. .

.

.

f (x) − f (¯ .

xf ′ (x) .

x) x−x ¯ .

.

.

.

.

. ¯ is “small” ≈ .

max .

:x−x

f (x)
x

f (x)

where x is the fixed point where the evaluation of f is desired, and x
¯ runs over values close to x.
This is not really a definition (as the quotation marks around the word defined indicate above), since
it is not explained what small means; it is meant only as an approximate description.
Problems
1. Evaluate
cos(2x + y 2 )
for x = 2 ± 0.03 and y = 1 ± 0.07. (The arguments of cosine are radians, and not degrees.)
8 Of course, one does not usually know the exact value of the error, since to know the exact error and the calculated
value is to know x itself, since x = x
¯ + ǫ, where ǫ is the error.
9 The word roughly has to be here, since the equation on which this derivation is based is only an approximate
equation. Thus, the sign ≤ above must not be taken literally.

03 and |dy| ≤ 0. There is no problem with the actual calculation. that is. y) = cos(2x + y 2 ). ∂x ∂y where x = 2. y)| ≤ | − 2 sin(2x + y 2 )||dx| + | − 2y sin(2x + y 2 )||dy| / 2 · 0.42338. y) dx + sin dy. |dx| ≤ 3 and |dy| ≤ 2.283.284 ± 0. y) = x 1 − cos .y) ≈ 0.959 · 0.07 = 0. x we estimate the error of f by its total differential df (x. dx + dy = 1 − cos − sin ∂x ∂y x x x x where x = 30.192 Thus f (x. y) y y y ∂f (x. Roundoff and truncation errors 9 Solution. and dx = ±0. that is. and dx = ±3 and dy = ±2.10 Hence |df (x. y) =  y ∂f (x.3. how accurate this result is? Writing f (x. since there is no serious loss of precision:   20 30 · 1 − cos ≈ 6. 662. how accurate this result is? Writing  y f (x.07.03 and dy = ±0.192. With x = 2 and y = 1 we have cos(2x + y 2 ) = cos 5 ≈ 0. |dx| ≤ 0. we estimate the error of f by its total differential df (x.11 Thus . The real question is. y = 1. 2. y) ∂f (x. Evaluate for x = 30 ± 3 and y = 20 ± 2. 30 The real question is.03 + 2 · 0. y) dx + dy = −2 sin(2x + y 2 ) dx − 2y sin(2x + y 2 ) dy.959 · 0.  y s = x 1 − cos x Solution.07. There is no problem with the actual calculation. y = 20. y) = ∂f (x.

.

y y .

.

y .

.

y .

.

|df (x. y)| ≤ .

1 − cos − sin .

|dx| + .

sin .

3. . y) ≈ 6. Find the condition of y = tan x near x = 1. 4.8. and dy. 10 It is more natural to write ∆x and ∆y for the errors of x and y. 11 It is more natural to write ∆x and ∆y for the errors of x and y. |dy| / 0. and dy. x x x x Thus f (x.8.618 · 2 ≈ 1.198 · 3 + 0.42 ± 1.5. but in the total differential below one customarily uses dx. Find the condition of y = x3 near x = 2. but in the total differential below one customarily uses dx.

such errors ought to be avoided in so far as possible. 4. To estimate Rn (x. a). Let U be an open interval in R and let f : U → R be a function that is n + 1 times differentiable. continuous processes in the computer are represented by finite. A third type of error. The latter refers to the error committed in a calculation when an infinite process is replaced by a finite process of calculation. error resulting from a mistake or blunder is an altogether different matter. Lemma. x) = − dx n! (1) for every x ∈ U . A simple example for this latter is when the sum of an infinite series is evaluated by adding finitely many terms in the series. More generally. a) = f (x) − n X f (k) (a) k=0 k! (x − a)k . we have d f (n+1) (x)(b − x)n Rn (b. and the term for k = 0 involves no product. which are usually infinite decimals. whereas roundoff and truncation errors cannot be avoided. Proof. ROUNDOFF AND TRUNCATION ERRORS The main types of errors are roundoff error. resulting in truncation error. and truncation error. defined as def Rn (x. − f (x) f (x) = −f (x) − k! (k − 1)! k=1 . The former refers to the fact that real numbers. are represented by a finite number of digits in the computer. We have d d d Rn (b. we need the following lemma. k! k! k=1 We separated out the term for k = 0 since we are going to use the product rule for differentiation. Given any b ∈ U . discrete processes. x) = f (b) − n X n f (k) (x) k=0 X (b − x)k (b − x)k f (k) (x) = f (b) − f (x) − . The goodness of this approximation can be measured by the remainder term Rn (x.10 Introduction to Numerical Analysis 3. a). We have Rn (b. Let n ≥ 1 be an integer. a function f (x) is approximated by a polynomial n X f (k) (a) k=0 k! (x − a)k . THE REMAINDER TERM IN TAYLOR’S FORMULA In Taylor’s formula. x) = f (b) − f (x) dx dx dx  n  X df (k) (x) (b − x)k d (b − x)k − + f (k) (x) dx k! dx k! k=1   n X (b − x)k −k(b − x)k−1 (k+1) ′ (k) f (x) = −f (x) − + f (x) k! k! k=1   n X (b − x)k−1 (b − x)k (k) (k+1) ′ .

x) = −f ′ (x) − f (n+1) (x) − f ′ (x) = −f (n+1) (x) . b ∈ U with a 6= b. and so the equation (3) Rn (b..13 As φ(a) = 0 by the choice of K and φ(b) = 0 trivially. Corollary 1. b). we obtain K = f n+1 (ξ) from here.e. we obtain   d (b − x)n (b − x)n Rn (b. the expression described by n X Pn k=1 ) on the right-hand side equals (Ak − Ak−1 ) = (A1 − A0 ) + (A2 − A1 ) + . . . b] and differentiable in (a. b) such that φ′ (ξ) = 0. we can use Rolle’s Theorem to obtain the existence of a ξ ∈ (a.4.12 We have b − a 6= 0. b) if a < b or ξ ∈ (b. x) − K · (b − x)n+1 (n + 1)! Then φ is differentiable in U . as differentiability implies continuity. and write φ(x) = Rn (b. (An − An−1 ) k=1 = An − A0 = f (n+1) (x) (b − x)n − f ′ (x). Thus Noting that (n+1)! the result follows from (3). we can see that f (n+1) (ξ)(b − ξ)n −(n + 1)(b − ξ)n 0 = φ′ (ξ) = − −K · . a) (if a > b) such that (2) Rn (b. but this is not needed for the rest of the argument. Let n ≥ 1 be an integer. Using (1). . dx n! n! as we wanted to show. For the sake of simplicity. we will assume that a < b. 13 Naturally. a) if b < a. Let K be the real number for which this equation is satisfied. 12 This assumption is never really used except that it helps us avoid circumlocutions such as ξ ∈ (a. a) = K · (b − a)n+1 (n + 1)! can be solved for K. Let U be an open interval in R and let f : U → R be a function that is n + 1 times differentiable. The remainder term in Taylor’s formula 11 Writing Ak = f (k+1) (x) (b − x)k k! for k with 0 ≤ k ≤ n. b) (if a < b) or ξ ∈ (b. For any any a. n! (n + 1)! (n+1) 1 = n! and keeping in mind that ξ 6= b. (n + 1)! Proof. n! Substituting this in the above equation. it follows that f is continuous on the interval [a. φ is differentiable also at a and b. a) = f (n+1) (ξ) (b − a)n+1 . the sum (i. there is a ξ ∈ (a.

a).b (x) = (b − x)r for an integer r with 0 < r ≤ n + 1. The Wikipedia entry [Wiki] also discusses the result (especially under the subheading Mean value theorem). then the same result holds. b). b). it is possible to find such a K. Rn (b. without giving attributions.b is continuous on [a. b). x) − Kg(x). it is important to realize that the value of ξ depends on r. n! As g ′ (ξ) 6= 0 by our assumptions. Let g(x) be a function that is differentiable on (a. we obtain f (n+1) (ξ) (b − ξ)n (b − a). The above argument can be carried out in somewhat more generality. Here ξ is some number in the interval (a. it follows that g(a) 6= 0. Taking r = 1. and g(b) = 0. a) instead of (a. by (1) we have f (n+1) (ξ)(b − ξ)n 0 = φ′ (ξ) = − − Kg ′ (ξ). b] and differentiable on (a. Let U be an open interval in R and let f : U → R be a function that is n + 1 times differentiable. a] instead of [a. Assume. but this was unessential. b] for the intervals mentioned (but the roles of a and b should not otherwise be interchanged). a) = − f (n+1) (ξ)(b − ξ)n g(a) · ′ . we get formula (2). We have φ(b) = 0 since Rn (b. except that one needs to write (b. Taking ga. Let n ≥ 1 be an integer. Substituting the value of K so obtained into (4). For any any a. Note that. a) = − f (n+1) (ξ)(b − ξ)n ga. Instead of (3). there is a ξ ∈ (a.b (b) = 0 ′ and ga.b (a) · ′ . we can see that Rn (b. g ′ (x) 6= 0 for x ∈ (a. determine K now such that (4) Rn (b. we obtain that Rn (b. Write φ(x) = Rn (b. Further. b) such that φ′ (ξ) = 0. n! g (ξ) Note that in the argument we again assumed that a < b. This result is given in [B]. a) = n! . g(a) g(b) − g(a) − = = g ′ (η) b−a b−a for some η ∈ (a. further. that ga. b). let ga. Since we have g ′ (η) 6= 0 by our assumptions. We can restate the result just obtained in the following Corollary 2.b (x) 6= 0 for x ∈ (a. see also [M]. b ∈ U with a < b. b) = 0 and g(b) = 0 (the latter by our assumptions). n! ga. we can determine K from this equation. b). b) and [b. a) = f (n+1) (ξ) (b − ξ)n−r+1 (b − a)r .b (ξ) If b < a. note that the function g can depend on a and b. As g(a) 6= 0. by the Mean-Value Theorem. b).12 Introduction to Numerical Analysis Note. this is called Lagrange’s Remainder Term of the Taylor Series. Taking r = n + 1 here. Thus. rn! This is called the Roche–Schl¨ omilch Remainder Term of the Taylor Series. We have φ(a) = 0 by the choice of K.b (x) be a function such that ga. b) such that Rn (b. b) or (b. a) = Kg(a). As φ is differentiable on (a. Then there is a ξ ∈ (a.

Example. . Find a polynomial P (x) of degree 3 such that P (−1) = 7. . A different. . (4 + 1)(4 − 2)(4 − 6) 20 polynomials are sometimes called Lagrange fundamental polynomials. Solution. and if x = xj then one of the fractions in this product has a zero numerator. 1). x3 = 4. while a polynomial of lower degree usually cannot be found. since the numbers in the denominator do not depend on x. that is P (xi ) = yi for i with 1 ≤ i ≤ n. sin x. (−1 − 2)(−1 − 4)(−1 − 6) 105 l2 (x) = l3 (x) = 14 These (x + 1)(x − 4)(x − 6) 1 = (x + 1)(x − 4)(x − 6). . elegant solution was later found by Lagrange. Writing x1 = −1. and on this interval it does converge to the function (1 + x)α (this series is called the Binomial Series). but it is not suited to establish the convergence of the Taylor series of (1 + x)α . the polynomial P (x) defined as n X yi li (x) P (x) = i=1 satisfies the requirements. and let y1 . P (2) = 3. if x = xi then each of the fractions in the product expressing li (x) is 1. x2 = 2. and cos x on the whole real line. and x4 = 6. where α is an arbitrary real number. li (xj ) = 0 if j 6= i. for any integer j with 1 ≤ j ≤ n we have  1 if j = i. 5. Indeed. we have l1 (x) = 1 (x − 2)(x − 4)(x − 6) =− (x − 2)(x − 4)(x − 6). . . y2 . P (4) = −2. (2 + 1)(2 − 4)(2 − 6) 24 1 (x + 1)(x − 2)(x − 6) = − (x + 1)(x − 2)(x − 6). Lagrange’s Remainder Term is convenient for establishing the convergence of the Taylor series of ex . This can be established by using Cauchy’s Remainder Term. For example. Further. Lagrange considers the polynomials14 li (x) = n Y x − xj x − xj j=1 i j6=i It is clear that li (x) is a polynomial of degree n. x2 . these not necessarily distinct.xn be distinct real numbers. The Taylor series of this last function is convergent on the interval (−1. The different forms of the remainder term of Taylor’s series are useful in different situations. .5. .yn be also be real numbers. . The polynomial interpolation problem was first solved by Newton. LAGRANGE INTERPOLATION Let x1 . Lagrange interpolation 13 this is called Cauchy’s Remainder Term of the Taylor Series. and P (6) = 8. The task of polynomial interpolation is to find a polynomial P (x) such that P (xi ) = yi for i with 1 ≤ i ≤ n. Here we consider the latter solution. It is not too hard to prove that there is a unique polynomial of degree n − 1 satisfying these conditions. For this reason.

P (2) = 4. Find the Lagrange interpolation polynomial P (x) such that P (1) = −3. P (1) = 2. Write x1 = −2. x3 = 4. Solution. x3 = 2. 3 2 2 .14 Introduction to Numerical Analysis l4 (x) = (x + 1)(x − 2)(x − 4) 1 = (x + 1)(x − 2)(x − 4). We have l1 (x) = l2 (x) = l3 (x) = Thus. (x3 − x1 )(x3 − x2 ) (4 − 1)(4 − 3) 3   1 1 P (x) = P (1)l1 (x) + P (3)l2 (x) + P (4)l3 (x) = −3 · (x − 3)(x − 4) + (−1) · − (x − 1)(x − 4) 6 2 1 1 1 + 3 · (x − 1)(x − 3) = − (x − 3)(x − 4) + (x − 1)(x − 4) + (x − 1)(x − 3) = x2 − 3x − 1. (x1 − x2 )(x1 − x3 ) (1 − 3)(1 − 4) 6 (x − 1)(x − 4) 1 (x − x1 )(x − x3 ) = = − (x − 1)(x − 4). Write x1 = 1. (x1 − x2 )(x1 − x3 ) (−2 − 1)(−2 − 2) 12 l2 (x) = l3 (x) = Thus. (x3 − x1 )(x3 − x2 ) (2 + 2)(2 − 1) 4 P (x) = P (−2)l1 (x) + P (1)l2 (x) + P (2)l3 (x) = 3 ·   1 1 (x + 2)(x − 2) (x − 1)(x − 2) + 2 · − 12 3 1 1 2 + 4 · (x + 2)(x − 1) = (x − 1)(x − 2) − (x + 2)(x − 2) + (x + 2)(x − 1). Solution. (x2 − x1 )(x2 − x3 ) (1 + 2)(1 − 2) 3 (x + 2)(x − 1) 1 (x − x1 )(x − x2 ) = = (x + 2)(x − 1). P (3) = −1. P (4) = 3. Find the Lagrange interpolation polynomial P (x) such that P (−2) = 3. we have (x − x1 )(x − x3 ) (x + 2)(x − 2) 1 = = − (x + 2)(x − 2). we have (x − x2 )(x − x3 ) (x − 3)(x − 4) 1 = = (x − 3)(x − 4). 4 4 3 2. (x2 − x1 )(x2 − x3 ) (3 − 1)(3 − 4) 2 (x − 1)(x − 3) 1 (x − x1 )(x − x2 ) = = (x − 1)(x − 3). 7 P (x) = 7l1 (x) + 3l2 (x) − 2l3 (x) + 8l4 (x) = − Problems 1. We have l1 (x) = (x − 1)(x − 2) 1 (x − x2 )(x − x3 ) = = (x − 1)(x − 2). (6 + 1)(6 − 2)(6 − 4) 56 Thus. x2 = 1. the polynomial P (x) can be written as 7 (x − 2)(x − 4)(x − 6) 105 2 8 3 + (x + 1)(x − 4)(x − 6) − (x + 1)(x − 2)(x − 6) + (x + 1)(x − 2)(x − 4) 24 20 56 1 1 1 = − (x − 2)(x − 4)(x − 6) + (x + 1)(x − 4)(x − 6) − (x + 1)(x − 2)(x − 6) 15 8 10 1 + (x + 1)(x − 2)(x − 4). x2 = 3.

5. cannot be used at x = xi . If one substitutes x = xi . P (−1) = 5. Let ξ be such a zero. An important question is. Repeating this argument n times. F ′′ (t) must have at least n − 1 zeros. P (2) = 7. p(x) means that the polynomial Q(t) = P (t) + (f (x) − P (x)) p(t) p(x) is the interpolation polynomial for f (t) at t = x and t = xi (1 ≤ i ≤ n). we can see that. assume x is different from any of the points xi (1 ≤ i ≤ n). we can see that F is also differentiable n times. Write F (t) = f (t) − P (t) − (f (x) − P (x)) p(t) p(x) As is easily seen. P (4) = 3. F ′′′ (t) must have at least n − 2 zeros. Applying Rolle’s theorem repeatedly. Find a polynomial P (x) such that P (−2) = 3. Substituting this into our original definition of li (x). Indeed. one usually writes p(x) = since this allows one to write the polynomials li (x) in a simple way. we obtain that F (n) (t) has a zero in the open interval spanned by x and the xi ’s (1 ≤ i ≤ n). This formula. formula (1) follows. indeed. F ′ (t) must have at least n zeros in the in the open interval spanned by x and the xi ’s. Applying Rolle’s theorem again. In fact.15 Assuming f is differentiable n times. so we obtain p′ (xi ) = (2) n Y j=1 j6=i (xi − xj ). Lagrange interpolation 15 3. and P (5) = −1. using the product rule for differentiation. F (t) = 0 for t = x and t = xi (1 ≤ i ≤ n). F ′ (t) has a zero between any two zeros of F (z). p(x) p′ (xi )(x − xi ) Indeed. and so on. how to estimate the error of this approximation at t point x. as stated. after which the formula will be correct even for x = xi . F (n) (t) has at least a zero in the open interval spanned by x and the xi ’s (1 ≤ i ≤ n). That is. the polynomial Q(t) interpolates f (t) at the additional point t = x in addition the where P (t) interpolates f (t). In describing the Lagrange interpolation polynomial. The interpolation problem is usually used to approximate a function f (x) by a polynomial P (x) of degree n such that P (xi ) = f (xi ) for i with 1 ≤ i ≤ n. all the terms of the sum except the one with j = i will be contain a zero factor. . To find the error f (x) − P (x) at x. but the fraction on the right-hand side can clearly be reduced by dividing both the numerator and the denominator by x − xi . Error of the Lagrange interpolation. since for x = xi the error is clearly 0. by Rolle’s theorem. p′ (x) = n n Y X k=1 j=1 j6=k (x − xj ). Then we have 0 = F (n) (ξ) = f (n) (ξ) − (f (x) − P (x)) 15 This n! . (1) li (x) = Qn i=1 (x − xi ). Since F (t) is known to have n + 1 (distinct) zeros in the closed interval spanned by x and the xi ’s (namely x and the xi ’s).

x = 2.e.16 Introduction to Numerical Analysis Indeed. x2 = 3. √ 5). 6. since the values ξ = 1 and ξ = 5 are not allowed. and not calculate this exactly. Problems 4. x2 . . Hence − 3 1 3 <E<− =− . and x3 = 4. its nth derivative is zero. n! This is the error formula for the Lagrange interpolation. and x = 6. and x = 4. 2 50 5 We have strict inequalities. in the interval (1. we have 1 1 √ < E(5) < . Solution. this only allows one to estimate the error. Clearly. √ Solution. and x3 . and 6 d4 ln x = − 4 . Noting that the third derivative of x equals 38 x−5/2 . x2 = 4. the value of ξ is not known. Of course. Estimate the error of Lagrange interpolation when interpolating f (x) = x at x = 5 when using the interpolation points x1 = 1. and x3 = 5. Estimate the error of Lagrange interpolation when interpolating f (x) = 1/x at x = 2 when using the interpolation points x1 = 1. but it is known that 1 < ξ < 6. So we have the error at x = 5 is E(5) = (5 − 1)(5 − 3)(5 − 4) 3 −5/2 ξ −5/2 · ξ = 3! 8 2 according to the error formula of the Lagrange interpolation. 2 2 · 64 864 √ 5. its derivative is n!. being a polynomial of form tn + lower order terms. We have n = 4. x1 . P (t). the right-hand side is smallest for ξ = 5 and largest for x = 1. Solving this for the error f (x) − P (x) of the interpolation at x. Noting that 55/2 = 25 5. dx4 x So the error equals E=− 6 (3 − 1)(3 − 2)(3 − 4)(3 − 6) 3 =− 4 ξ4 4! 2ξ The value of ξ is not known. p(x) = (x − 1)(x − 2)(x − 4)(x − 6). while p(t). i. being a polynomial of degree n − 1. we obtain f (x) − P (x) = f (n) (ξ) p(x) . where ξ is some number in the interval spanned by x.. for this reason. Estimate the error of interpolating ln x at x = 3 with an interpolation polynomial with base points x = 1.

e. 625 We have strict inequalities.18 To avoid confusion.” . x = 3. In counting the zeros of g ′ . Estimate the error of interpolating ex at x = 2 with an interpolation polynomial with base points x = 1. and x = 2. just as a k-fold zero of f gave rise to k − 1 zeros of f ′ . the same will also be true for f . The result is also true if one takes multiplicities into account. 8. a k-fold zero of f gives rise a k − 1-fold zero of f ′ at the same location. One says that f has a k-fold zero at α if f (α) = f ′ (α) = . with f (x) = 1/x and with some ξ between 1 and 5. 5). b).5. Explain why f ′ has at least n − 1 zeros in the same interval. then it is easy to see that P (x) has a k-fold zero in the sense of the above definition. The n zeros give rise to n − 1 intervals determined by adjacent zeros.e. Clearly.. It is easier to note that if one scatters a k-fold zero. Assuming that f has n zeros. so for a single multiple zero the assertion is true. the count will not change either. in any any open interval determined by adjacent zeros of f there is at least one zero of f ′ . x = 1. where ξ is some number in the interval spanned by x. Estimate the error of interpolating sin x at x = 3 with an interpolation polynomial with base points x = −1. one need not think up an expression for a function g that will have scattered zeros near the multiple zeros of f . Solution. and x = 4. but the calculation is unnecessarily messy. By Rolle’s Theorem. the total count of n zeros will not change. some possibly multiple. i. the right-hand side is smallest for ξ = 1 and largest for ξ = 5. x2 . and argue “geometrically. thus giving rise to n − 1 zeros. and x = 6. in the interval (1. Noting that the third derivative of 1/x equals −6/x4 . 7. and replaces it by k nearby single zeros. . for the error at x = 2 we have E(x) = f ′′′ (ξ) 6 (2 − 1)(2 − 4)(2 − 5) 6 (x − 1)(x − 4)(x − 5) =− 4 =− 4 3! ξ 6 ξ according to the error formula of the Lagrange interpolation. to give a lower bound for the number of zeros. call the function so changed g.. 17 More precisely. Lagrange interpolation 17 Solution. 18 One need not reflect how to do this analytically. i. and x = 4. since the values ξ = 1 and ξ = 5 are not allowed. If a polynomial P (x) has the form P (x) = (x − α)k Q(α). As we already know that g ′ has (at least) n − 1 zeros. 9.16 Clearly. = f (k−1) (α) = 0. Assume a differentiable function f has n zeros in the interval (a. . and x3 . One can simply redraw the graph of f near the multiple zero. 16 This definition is motivated by the case of polynomials. x1 . one could carefully use this fact to compute the number of zeros17 of f ′ . Thus we have −6 < E(5) < − 6 . The cluster of k single zeros of g will give rise to k − 1 zeros of g ′ .

not just the specific function f for which we are discussing the Newton interpolation polynomial. with f = f . (4). . We then have A2 = f (x2 ) − f (x1 ) f (x1 ) − f (x2 ) = . · · · .t3 . looked for the interpolation polynomial in the form P (x) = A1 + A2 (x − x1 ) + A3 (x − x1 )(x − x2 ) + . . ti ] = f[t2 . ti−1 ] . t3 . t1 ] 19 To make sense of this last formula.··· . we want to determine the coefficients Ai ’s in such a way that P (xi ) = f (xi ) for each i with 1 ≤ i ≤ n. t2 ] = f [t1 ] − f [t2 ] .20 To see that using these equations for the definition implies (3). x2 − x1 x1 − x2 For arbitrary distinct real numbers t1 . + An (x − x1 )(x − x2 ) . we have f [t1 . t1 ] = f [ti .··· . · · · . ti − t1 If we write (4) f[t2 .18 Introduction to Numerical Analysis 6. (2). .ti−1 ] [t1 . ti ] − f [t1 ..ti−1 ] [t1 . (3) f [t1 . t2 . · · · . define the divided differences recursively as (1) f [t1 ] = f (t1 ). (x − xn−1 ) = n X i=1 Ai i−1 Y j=1 (x − xj ). · · · . [] . first note that. ti ] = f [t2 . in his solution to the interpolation problem. as remarked above. Given a function f (x). NEWTON INTERPOLATION Newton. t2 . ti−1 . ti−1 . 20 The latter used also for i = 1. ti−1 . . Instead of using (1)–(3) to define divided differences. ti ] = f[t2 . . we can define them by (1). assume that the above formulas define the divided differences for an arbitrary function f . in that we will consider divided differences also for functions other than f . . (2) f [t1 . t3 ..··· . t2 .ti−1 ] [ti . . x2 . ti ] for i > 2. . defining divided differences in this latter way. . in general. . t2 . ti−1 . and (5). · · · . . . . . From now on. xn are pairwise distinct. ti−1 ] then we have (5) f [t1 . . . t1 − t2 and.19 If we write f[ ] = f .ti−1 ] (t) = f [t. This equation with x = x1 gives A1 = f (x1 ). assume that the numbers x1 . We will take this point of view in what follows. Hence f (x2 ) = P (x2 ) = f (x1 ) + A2 (x2 − x1 ).. · · · . ti ] = f[t2 . then this is true even for i = 2.

. Clearly. . ti−1 ] ti − t1 f [t2 . . We need to determine An in such a way that the equation P (xn ) = f (xn ) also holds... . x2 . a] = xk + lower order terms (7) for x 6= a. . ti−1 ] − f [t1 . x3 .xm−1 ] (x) = An xn−m+1 + lower order terms. We can show by induction that (6) Ai = f [x1 . · · · ..xm−1 ] [x. ti − t1 f [t1 . x . we obtain f [ti . xi ]. x2 . and (5). Hence.x3 . . x3 . . . 22 Here we take [x . for the second equality). . x2 . x ] for m = 1 to be the empty tuple. for m = 1 we have P m 2 3 [x2 . but we may allow n 2 3 x = x1 . · · · . Newton interpolation 19 according to (5) (and (2). ti ] − f [t1 . xm ] = An xn−m + lower order terms.. . ti ] = f[t2 . xm−1 ] = P[x2 . That is. this is true for m = 1 if we note that the left-hand side for m = 1 is P (x). with Q(x) = xk+1 we have Q[x.. In view of this assumption. assume this result is true for all i with i < n. we obtain that P [x. xm ] = P[x2 . For each integer k.. x3 .6. ti−1 . 21 In the argument that follows. .22 Assume m > 1 and assume that this formula is true with m − 1 replacing m.··· . . . xm ] = P [x] = P (x). x3 . · · · . x .x3 . Taking divided differences and using (7). since the last term vanishes at these points. k Q[x. . x . ti−1 . we will show it for i = n. P (x) = An xn−1 + lower order terms.. xm ] = An xn−m + lower order terms. ti ] = showing that (3) also holds. x2 . . that is. = x−a j=0 Hence taking divided differences of P (x) m times for 0 ≤ m ≤ n.xm ] = P[ ] and P [x. t2 . . t2 .ti−1 ] [t1 . · · · . (4). In fact. · · · . .. using (2). xi ] i−1 Y j=1 (x − xj ). · · · . we have21 P [x. Indeed. · · · xi ] i−1 Y j=1 (x − xj ) + An n−1 Y j=1 (x − xj ) still interpolates f (x) at the points xi for 1 ≤ i ≤ n − 1.x3 . In fact. we need to assume that x is different from x . . x2 . Then the polynomial P (x) = n−1 X i=1 f [x1 . ti−1 ] = . the polynomial interpolating f at the points xi for 1 ≤ i ≤ n − 1 is Pn−1 (x) = n−1 X i=1 f [x1 . we have P [x. . · · · . .. t2 . . a] = X xk+1 − ak+1 xk−j aj .

This establishes the assertion. Problems 1. . xik be a permutation of these points (i. . . xn ] = An . As these two interpolating polynomials are the same. a listing of the same points in a different order). . . we obtain that P [x. with x = x1 . . . . . xk is f [x1 . while the leading term of the interpolating polynomial to f using the base points xi1 . x3 . . . f (xn ). .20 Introduction to Numerical Analysis showing that the formula is also true for m. . . this shows that P [x1 . . x2 . . . . xk ] = f [xi1 . xn we have P (x) = n X i=1 f [x1 . . . . . It is not hard to prove this by induction. P (2) = 4. . . Find the Newton interpolation polynomial P (x) of degree 3 such that P (1) = 2. . . . Divided differences are independent of the order in which the points are listed.e. . x2 . a divided difference does not change if we rearrange the points. x3 . xik is f [xi1 . clearly. This is so. x3 . that is. . . simpler argument is based on the observation that the leading term of the interpolating polynomial to f using the base points x1 . .. xik ]. the coefficients of their leading terms must agree. . Let f be a function defined at these points. x2 . because the left-hand side is calculated by using the values f (x1 ). . . . This is true for every x. x2 . while the right-hand side is calculated by using the values P (x1 ). Permuting the arguments of a divided difference. . f [x1 . . for i < n its validity was assumed as the induction hypothesis). xn ] = An . To summarize. xi2 . xi2 . Theorem. Let x1 . . Using this formula with m = n. . P (5) = 9. xi2 . Now. xk be distinct points. . . xk ]xk−1 . . and let xi1 . . xik ]xk−1 . . x2 . · · · xi ] i−1 Y j=1 (x − xj ). Then f [x1 . and these values are in turn the same. the Newton interpolation polynomial at the points x1 . Proof. . . . . . But another. . P (4) = 6. x2 . . . xi2 . x2 . Hence the formula is true for all m with 1 ≤ m ≤ n. . xn ]. . x3 . . This completes the proof of (6) (for i = n. x2 . xn ] = P [x1 . . x2 . P (xn ). .

x1 .. 5](x − 4)(x − 2)(x − 1) 1 1 = 6 + 1(x − 4) − (x − 4)(x − 2) + (x − 4)(x − 2)(x − 1). 5−4 5−4 P [1. . .. 2.6. P (7) = 2.] 1 2 2 4 f [. 2. . Since. 2] = P [2] − P [1] 4−2 = = 2. 4. 5] − P [2. 4. . 2.. P [x1 . 1](x − 4)(x − 2) + P [4. 4−1 4−1 3 3−1 2 P [4. 4] − P [1.] f [x. We have 21 P [1. 2] 1−2 1 = =− . 2. as is easily seen. 2. Find the Newton interpolation polynomial P (x) of degree 3 such that P (2) = 3. For example P (x) = P [4] + P [4. 5] = We can summarize these values in a divided difference table: x f [. 5](x − 1)(x − 2)(x − 4) 1 1 = 2 + 2(x − 1) − (x − 1)(x − 2) + (x − 1)(x − 2)(x − 4).] f [. 5] − P [1. 2](x − 4) + P [4. 2−1 2−1 P [2. 5] = = = . . . The leading term of the Newton interpolation polynomial P to a function f with the nodes x0 . 4. 3 4 2. P (x) = P [1] + P [1. P (8) = 3. .. 5] = 9−6 P [5] − P [4] = = 3. 2. . . 4](x − 1)(x − 2) + P [1. 2.. xn ]xn . 2](x − 1) + P [1. xn is f [x0 . . x1 . 3 4 Observe that we can take these points in any other order. . . · · · . 3. 5−1 5−1 4 P [2. . 4. 4] 1 3 − −3 P [1. xi ] does not change if we permute its argument. 4−2 4−2 P [4. Newton interpolation Solution. 4] = = 5−2 5−2 3  2 1 P [2. 4] = P [2.] 2 1 4 6 5 9 3 − 31 2 3 1 4 Thus. we can use already calculated divided difference to write the result in different ways. P (4) = 5. 4] = 6−4 P [4] − P [2] = = 1.. 1. .

that is. we have 0 = f (n) (ξ) − P (n) (ξ) = f (n) (ξ) − n!f [x1 . xn be n + 1 distinct points in the interval [a. . xn . . let a number mi > 0 be given. Taking the nth derivative of the polynomial P . On the other hand. .) 7. x2 .) Solution. b]. . only the derivative of the leading term survives.22 Introduction to Numerical Analysis Using this. . xn ] = f (n) (ξ) . let f be a function that is continuous in [a. · · · . . . · · · . . we are looking for a polynomial P such that (1) P (l) (xi ) = f (l) (xi ) (1 ≤ i ≤ n. we need the following simple Lemma. Let x0 . Omitting the middle member of these equations and solving the remaining equality. . x1 . HERMITE INTERPOLATION Hermite interpolation refers to interpolation where at some of the interpolation points not only the value of the function is prescribed. . b) such that f [x0 . xn are assumed to be distinct. To figure out how this works. x1 . n! . xn . . and for each i with 1 ≤ i ≤ n. . x1 . x1 . (All the nodes x0 . b] and has at least n derivatives in the interval (a. . x1 . xn ]. xn ]. That is. . (Note: This result is stated as a lemma in the next section. the number mi specifies how many values (including the value of the function itself and perhaps some of its derivatives) are prescribed at the point xi . Then there is a number ξ ∈ (a. n! as we wanted to show. P (n) (x) = n!f [x1 . show that f [x0 . f (x) − P (x) has at least n + 1 zeros. In the most general formulation of the problem. · · · . the degree of P must be at least n X mi − 1. xn . x2 . b] and a positive integer n. xn . . . but also some of its derivatives. b). Writing ξ for such a zero. . N= i=1 The Newton interpolation polynomial of gives a fairly easy way of solving this problem. . x1 . we obtain f [x1 . Given an interval [a. · · · . x2 . assume we are given the points x1 . xn ] = 1 (n) f (ξ). x1 . Given a function f . The lowest degree of the polynomial for which these equations are solvable is one less than the number of equations. . x0 . . x2 . . The idea is to replace each interpolation point xi by mi distinct closely situated points. and 0 ≤ l < mi ) Intuitively. xn ] = f (n) (ξ) n! for some ξ in the interval spanned by x0 . . . . Hence f (n) (x) − P (n) (x) has at least one zero in the interval spanned by x0 . and take the limiting case when these points all approach xi . Section 7 on Hermite interpolation.

. . The proof is identical to the proof given above. but repeat the point xi mi times so as to get the agreement described in (1). but a proof is not too hard to give. xi−1 . We will omit a rigorous proof of this result. Let P be the Newton interpolation polynomial interpolating f at the points x0 . That is i−1 n Y X (x − xj ). 1! 1 ′′ P (2) = 2 2 5−3 P [4] − P [2] = = 1. . Thus. . . Problems 1. ′′ Solution. 2. · · · xi ] − f [x1 . When x1 and xi are different. One can also show that with this extended definition of divided differences. 2] = P [2. to ensure that this divided difference is defined. . . . . except that one needs to assume appropriate differentiability of f even at the endpoints a or b in case these points are repeated in the divided difference f [x0 . xn ]. hence its nth derivative is n! times its leading coefficient. · · · . . there is a ξ ∈ [a. xn . x] = f (n−1) (x) (n − 1)! provided that the derivative on the right-hand side exists. xn . . Lemma 1 remains valid even if the points x0 .7. Taking this into account. xn ] = 0. P (4) = 5 P ′ (4) = 7. That is f (n) (ξ) − n!f [x0 . f [x0 . we can extend the definition of the divided difference so that. xi−1 ] . similarly as in the argument used to evaluate the error of the Lagrange interpolation. xi ] = f [x2 . x1 . . . x1 . 4−2 4−2 . . the Newton interpolation formula allows us to write the Hermite interpolation polynomial in a simple way: use the same points x1 . we can still use the formula f [x1 . b]. . x] (x taken n times) as 1 f [x. Further P [2. . Here P is a polynomial of degree n. xi − x1 With this extension. · · · xi ] P (x) = j=0 i=0 Then the function f (x) − P (x) has at least n + 1 zeros in the interval [a. · · · . 4] = P ′ [2] = 6. . we can define the n-fold divided difference f [x. We have P [2] = 3 and P [4] = 5. b] such that f (n) (ξ) − P (n) (ξ) = 0. This is clearly equivalent to the equation we wanted to prove. xn are not all distinct but at least two of them are different. Motivated by this definition. P ′ (2) = 6. 2] = P [2. x. P (2) = 4. to the case when several points agree. · · · . · · · . Find the Newton-Hermite interpolation polynomial P (x) such that P (2) = 3. . Hermite interpolation 23 Proof. x2 . . x.

2. xn . P [2. P [2. 4] − P [2. . 4] = P (x) = P [2] + P [2. 2] = =− . 2. 4](x − 2) + P [2. . 4] = = . 2. 2. 4. . 2. . For example P (x) = P [2] + P [2. so the error term for the Newton interpolation formula should be the same. that is. in evaluating the coefficients here. 2] 9 = 2 =− . 4. 4] − P [2. 2. 4. 4](x − 2)(x − 2)(x − 2)(x − 4) = 3 + 6(x − 2) + 2(x − 2)2 9 5 − (x − 2)3 + (x − 2)3 (x − 4). 2. the polynomial Q(t) interpolating f at the points x. 4. . 4] = 4−2 4−2 2 P [4. 4] = P [2. x] n Y j=1 (t − xj ). 4] − P [2. 2. P [2. 4 2 We can step through the coefficients in any other order. and the latter was evaluated above. P (3) = 6. P ′ (1) = 6. 2. . 1! 1−6 5 P [2. 4] 20 5 4 − −4 P [2. 2. xn . 2. 4. THE ERROR OF THE NEWTON INTERPOLATION POLYNOMIAL The Newton interpolation formula describes the same polynomial as the Lagrange interpolation formula. 2](x − 2)(x − 4) + P [2. Given distinct base points x1 . 4. 4](x − 2)(x − 4)(x − 2) 5 + P [2. 4] = P [4. 2. 4. 4. 2. 4] = P [2. 2. f [x1 . xi ] P (t) = i=1 j=1 If we add the point x to the interpolation points. 4 2 Note that. 2. 2. 4. Yet the Newton interpolation formula allows us to contribute to the discussion of the error of interpolation. xn can be written as Q(t) = P (t) + f [x1 . 4] − P [2. 2. 4] − P [2. 8. 2. the order of the points does not make any difference. 4. . 4]. · · · . . P(4)=5. 4−2 4−2 −5 − 2 P [2. 2. the Newton interpolation polynomial P (t) on these points can be written as i−1 n Y X (t − xj ). 4] = = = = 4−2 4−2 8 2 Hence we can write the Newton-Hermite interpolation polynomial as P [2. 4. 2.24 Introduction to Numerical Analysis P ′ [4] = 7. 2](x − 2)(x − 4)(x − 2)(x − 4) = 3 + (x − 2) − (x − 2)(x − 4) 2 5 11 2 2 2 + (x − 2) (x − 4) + (x − 2) (x − 4) . 2](x − 2)(x − 2) + P [2. 2](x − 2) + P [2. 2. · · · . 2. 4. 4](x − 2)(x − 2)(x − 2) + P [2. P ′ (3) = 5 P ′′ (3) = 8. 4−2 4−2 4  3 − − 25 11 P [2. . 4] 7−1 = = 3. 4] = 4 − 2] 4−2 4  11 9 P [2. Find the Newton-Hermie interpolation polynomial P (x) such that P (1) = 3. x1 .

8. The error of the Newton interpolation polynomial

25

Clearly, Q(x) = f (x), since x is one of the interpolation points for Q(t). Thus, the error
E(x) = f (x) − P (x)
at x of the interpolating polynomial P can be written as
(1)

E(x) = Q(x) − P (x) = f [x, x1 , · · · , xn ]

n
Y

j=1

(x − xj ),

where, noting that the order of arguments in immaterial in the divided differences, we wrote x as
the first argument. In view of the Lemma in Section 7 on Hermite interpolation, we have
(2)

n
f (n) (ξ) Y
E(x) =
(x − xj )
n! j=1

for some ξ in the open interval spanned by the points x, x1 , . . . , xn , assuming that f is n times
differentiable in this interval and is continuous in the closed interval spanned by these points.
Sometimes we need to differentiate the error term in (2). Since the dependence of ξ on x is not
known, this cannot be done directly. However, formula (1) can provide a starting point. In fact, we
have
Lemma. Let n ≥ 2 be an integer, and and let x1 , . . . , xn be such that x1 < x2 < . . . < xn . Let
x ∈ (x1 , xn ), and assume that x 6= xi for 1 ≤ i ≤ n. Assume,further, that f is continuous in [x1 , xn ]
and f (n+1) exists in (x1 , xn ). Let x ∈ (x1 , xn ) and assume that x 6= xi for 1 < i < n. Then
1
d
f [x, x1 , . . . , xn ] =
f (n+1) (η)
dx
(n + 1)!
for some η ∈ (x1 , xn ).
Remark. The assumption x1 < x2 < . . . < xn is just another way of saying that the points
x1 , . . . , xn are distinct. This assumption can be relaxed; we can still obtain the same conclusion
if we only assume that the two extreme points (say x1 and xn ) are different from the others. To
establish this, the proof needs only minor modifications; in particular, the Lemma of the section
Hermite interpolation needs to be extended to the situation when there may be coincidences among
the points (this lemma was stated before we defined divided differences for the case when there are
coincidences among the points). Similarly, the restriction that the point x is different from any of the
xi with 1 < i < n can also be dropped. On the other hand, we cannot allow that the extreme points
coincide with any other points, since in that case, just in order to define the divided differences we
need differentiability at the extreme points, and this is not assumed. At the price of requiring that
f (n+1) exist in the closed interval spanned by x, x1 , . . . , xn , we can extend the result to the case
when x does not belong to the interval spanned by x1 , . . . , xn . Only simple modifications are needed
in the proof to establish this.
Proof. We need to redo some of the material on Newton-Hermite interpolation rigorously.23
Let x, t ∈ (x1 , xn ) with x 6= t, x 6= xi , and t 6= xi for 1 ≤ i ≤ n, and let Qx,t (u) be the interpolation
polynomial that interpolates f (u) at the points x, t, and x1 , . . . xn . Write
P (u) = lim Qx,t (u).
t→x

23 In the earlier discussion, we accepted without proof that if some interpolation points coincide in the Newton
interpolation formula, then we obtain an interpolating polynomial to f that agrees with certain derivatives of f at
those multiple interpolating points. Here we need a rigorous proof of a special case of this statement.

26

Introduction to Numerical Analysis

First, we need to show that this limit exists. To this end, note that for any k with 1 ≤ k ≤ n the
limit
f [t, x1 , . . . xk ] − f [x, x1 , . . . xk ]
∂f [z, x1 , . . . , xk ]

Ck = lim f [t, x, x1 , . . . xk ] = lim
=

t→x
t→x
t−x
∂z
z=x

exists.24 Indeed, f [z, x1 , . . . , xn ] can be written as a fraction with the numerator being a polynomial
of z and f (z), with coefficients depending on x1 , . . . xn , and f (x1 ), . . . , f (xn ), and the denomimator
being a product of differences of the form z − xi and xi − xj for 1 ≤ i, j < k; since none of these
differences is zero, and since f (z) is differentiable, the existence of the (partial) derivative on the
right-hand follows.
Thus, in the Newton interpolation formula25
Qx,t (u) = f [x] + f [x, t](u − x) + (u − x)(u − t)

n
X

f [x, t, x1 , . . . , xk ]

k−1
Y
j=1

k=1

(u − xj )

the coefficients converge when t → x, and in fact, we have
P (u) = lim Qx,t (u) = f (x) + f ′ (x)(u − x) + (u − x)2
t→x

n
X

Ck

k=1

k−1
Y
j=1

(u − xj )

with the Ck ’s just defined. It is clear from here that P (x) = f (x) and P ′ (x) = f ′ (x). Furthermore,
we have Qx,t (xi ) = f (xi ) for any i with 1 ≤ i ≤ n, and so P (xi ) = f (xi ).
The rest of the argument is essentially the repetition of the proof of the Lemma in Section 7 on
Hermite interpolation. The function f (u)−P (u) has at least n+2 zeros, counting multiplicities (one
double zero at x, and zeros at x1 , . . . , xn . Therefore its n + 1st derivative must have at least one
zero, say η, in the interval (x1 , xn ), according to repeated applications of Rolle’s Theorem. When
differentiating P (u) n + 1 times, only its leading term, Cn un+1 , survives, where Cn was defined
above. Thus, we have
f (n+1) (u) − P (n+1) (u) = f (n+1) (u) − (n + 1)! Cn .
This equals zero for u = η, showing that
Cn =

1
f (n+1) (η).
(n + 1)!

This, with the definition of Cn gives the desired result.
Remark. Seemingly, the following simpler argument would have worked. We have
d
f [x, x1 , . . . , xn ] = lim f [x, t, x1 , . . . , xn ]
t→x
dx
Now, let {tm }∞
m=1 be a sequence of elements of the interval (x1 , xn ) that tends to x. We have
f [x, tm , x1 , . . . , xn ] =

1
f (n+1) (ηn )
(n + 1)!

24 We used partial derivatives with a new variable z to emphasize the point that x is kept fixed in the argument,
but we could have written
d
Ck =
f [x, x1 , . . . , xk ]
dx
equally well.
25 We often use the fact that in the divided differences, the order of the arguments is immaterial; so, while above
we used the order t, x, next we use the order x, t.

and assume x ∈ (x1 . Proof. where even the differentiability of f is not assumed. xn ) depending on m (and x. forward shift. then f (n) (ξ) is not determined by (2). xn ]. [Ral]. x2 . one needs to assume that f (n+1) exists and is continuous in the interval [x1 . then ξ depends on x. E. Using the Binomial Theorem. let xk = x0 + kh for every integer. called. We have d (n) f (n+1) (η) f (ξ) = dx n+1 for some η ∈ [x1 . . We will introduce the operators I. . . xn ). (∇f )(x) = f (x) − f (x − h). Ralston). The sequence ηm does have a convergent subsequence {mj }∞ j=1 . we will write ∆f (x) instead of (∆f )(x). Remark.g. and backward difference operators with the following meanings: (If )(x) = f (x). (Ef )(x) = f (x + h). for any integer k ≥ 0 we have ∆k+1 f (x) = ∆(∆k f (x)) = ∆k f (x + h) − ∆k f (x). (∆f )(x) = f (x + h) − f (x). xn ] and that f (n+1) exists in [x1 . Theorem (A. x1 . in turn. . . Assume f is continuous in [x1 . Even if the limit. but x is kept fixed in this argument). and ∇. Let ξ ∈ (x1 . We will drop some of these parentheses. of this subsequence belongs to (x1 . FINITE DIFFERENCES We will consider interpolation when the interpolation points are equally spaces. These operators will multiply in a natural way. ∆0 f (x) = f (x) and. This is the main reason that x 6= xi is not allowed in the Theorem. . say η. xn ) be such that formula (2) for the error of the interpolation polynomial at x is satisfied with ξ. In order to make this argument work. it is not guaranteed that lim f (n+1) (ηmj ) = f (n+1) (η) j→∞ unless one assumes that f (n+1) is continuous. ∆ = (E − I) = (E + (−1)I) = i i=0 26 See A. xn ]. .9. < xn . forward difference. 9. but the limit of this subsequence may be x1 or xn . e. The result is a direct consequence of the Lemma just proved by noting that f (n) (ξ) = n!f [x. ∆. We will write E −1 for the inverse of the operator E: (E −1 f )(x) = f (x − h). xn ]. For example. Given h > 0. one can write one can write that ∆ = E − I. Ralston.. and assume that x 6= xi for 1 ≤ i ≤ n.26 Let the points x1 < x2 < . in view of (1) and (2) above. In a natural way. xn ). Finite differences 27 for some ηm ∈ (x1 . If x = xi for some i with 1 < i < n. identity. xn ] according to the definition of ξ. we have n   X n n n n (−1)n−i E i .

We have ∆3 f (x) = (E − I)3 f (x) = (E 3 − 3E 2 + 3E − I)f (x) = f (x + 3h) − 3f (x + 2h) + 3f (x + h) − f (x).28 Introduction to Numerical Analysis Applying this to a function f we obtain n n n ∆ f = (E − I) f (x) = (E + (−1)I) f (x) = n   X n i i=0 n−i (−1) n   X n (−1)n−i f (x + ih). but it is not hard to justify the operator calculus on algebraic grounds. In justifying the Binomial Theorem above for operators. . and f (x + 3h). E f (x) = i i=0 i This calculation appears to be purely formal. As ∇ and E −1 −n f (x) = n   X n i i=1 (−1)i ∇i f (x). Hence the above formula can also be written as f (x − nh) = E −n f (x) = n   X n i=1 i (−1)i ∆i f (x − ih). the only thing one needs to notice that the operators E and I commute (because EI = IE = E). we have ∇i f (x) = ∆i E −i f (x) = ∆i f (x − ih). let A and B be two operators acting on functions on the real line. E = (I + (−1)∇) = i i=1 Applying this to f . the operators will form a noncommutative ring. In fact. f (x + h). There are other uses of the Binomial Theorem. Solution. we obtain f (x − nh) = E We have ∇ = E −1 ∆. def r[(AB)f ](x) = [A(Bf )](x). The standard arguments used to establish the Binomial Theorem shows that we have n   X n i n−i AB (A + B)n = i i=1 whenever the operators A and B commute. Hence n   X n ∆i E n = (∆ + I)n = i i=1 Applying this to a function f . Problem 3 1. Similarly. f (x + 2h). we have E = ∆ + I. E −1 = (I − ∇). For example. let r be a real number. Thus n   X n −n n (−1)i ∇i . and let f be a function on reals. This is called Newton’s forward formula. Writing def def [(A + B)f ](x) = [Af ](x) + [Bf ](x). we obtain f (x + nh) = E n f (x) = n   X n i i=1 ∆i f (x). Express ∆ f (x) as a linear combination of f (x). [(rA)f ](x) = [Af ](x). commute.

10. Newton interpolation with equidistant points.

29

10. NEWTON INTERPOLATION WITH EQUIDISTANT POINTS.
Let a point x0 be given, and let h > 0. Consider the points xt = x0 + th for a real t with
def
def
−∞ < t < ∞. Given a function f on reals, write ft = f (xt ) and ∆k ft = ∆k f (xt ). We will
be particularly be interested in integer values of t, and we will consider interpolating f on the
consecutive points xi , xi+1 , . . . , xi+n for integers n ≥ 0 and k. Denoting by P the interpolating
def

def

polynomial on these points, we will write Pt = P (xt ) and ∆k Pt = ∆k P (xt ), similarly as for f .
In order to calculate the Newton interpolation polynomial, first we show that
(1)

f [xi , xi+1 , . . . , xi+k ] =

1
∆k fi .
k!hk

for integers k ≥ 0 and i.
In fact, it is easy to show this by induction on k. For k = 0 the result is obvious:
f [xi ] = f (xi ) = fi =

1
∆0 fi ,
0!h0

since 0!=1 and ∆0 is the identity operator. Now, assume that k ≥ 1 and that (1) holds with k − 1
replacing k. Then, using the definition of divided differences, we have
f [xi+1 , . . . , xi+k ] − f [xi , xi+1 , . . . , xi+k−1 ]
xi+k − xi
1
k−1
k−1
(∆
fi+1 − ∆
fi )
1
1
(k−1)!hk−1
=
∆∆k−1 fi =
∆k fi
=
kh
k!hk
k!hk

f [xi , xi+1 , . . . , xi+k ] =

Thus (1) is established.
Using this, the Newton interpolation formula on the points x0 , x1 , . . . , xn can be written as
follows.
Pt = P (x0 + th) =

n
X

f [x0 , x1 , . . . xi ]

i−1
Y

j=0

i=0

(xt − xj ) =

i−1
n
X
1 i Y

f
(t − j)h
0
i!hi
j=0
i=0

Since h occurs in each factor of the product on the right-hand side, this produces a factor of hi ,
which can be canceled against 1/hi before this product. We define the binomial coefficient ti as 

i−1
i−1
t def Y t − j
1 Y
=
=
(t − j)
i
i−j
i! j=0
j=0

(2)

for real t and nonnegative integer i; note that this definition agrees with the customary definition of
the binomial coefficient ti if t is also an integer. With this, the above formula becomes
Pt =

n  
X
t
i=0

i

∆i f0 .

This is Newton’s forward formula. This formula uses the interpolation points x0 , x1 , . . . , xn .
If t is an integer with 0 ≤ t ≤ n, then xt is one of the interpolation points; that is, we have
ft = f (xt ) = P (xt ) = Pt . Hence
ft =

n  
X
t
i=0

i

∆i f0

if t is an integer with 0 ≤ t ≤ n

30

Introduction to Numerical Analysis

Recall that when interpolating through a sequence of points, one can take these points in any order
in the Newton interpolation formula. Assume that we step through some of the points xj in the
order xi0 , xi1 , . . . , xin . Then Newton’s interpolation formula becomes
Pt =

n
X

f [xi0 , xi1 , . . . xik ]

k−1
Y
j=0

k=0

(xt − xij ) =

n
X

f [xi0 , xi1 , . . . xik ]hk

k−1
Y
j=0

k=0

(t − ij ).

If we want to be able to use formula (1) for the divided differences, we need to make sure that the
points xi0 , xi1 , . . . , xik are consecutive for any integer k with 0 ≤ k ≤ n.27 That is, for k > 0 we
must have
ik = max(i0 , . . . ik−1 ) + 1 or ik = min(i0 , . . . ik−1 ) − 1.
We have
{i0 , . . . ik } = {j : ik − k ≤ j ≤ ik } and

{i0 , . . . ik−1 } = {j : ik − k ≤ j ≤ ik − 1}

in the former case, and
{i0 , . . . ik } = {j : ik ≤ j ≤ ik + k} and

{i0 , . . . ik−1 } = {j : ik + 1 ≤ j ≤ ik + k}

in the latter case.28 Since the value of the divided differences is independent of the order the points
are listed, it is easy to see from (1) and (2) that we have
f [xi0 , xi1 , . . . xik ]

k−1
Y
j=0

(t − ij ) = h

k  

t − ik + k
∆k fik −k ,
k

the former case, and
f [xi0 , xi1 , . . . xik ]

k−1
Y
j=0

(t − ij ) = h

k  

t − ik − 1
∆k fik ,
k

Note that these formulas work even in the case k = 0, since

Pt = 


t − ik + k


∆k fik −k

n 
k

X 

 
t − ik − 1

∆k fik
k

k=0 

u
0 

= 1 for any real u.29 Thus, we have



if ik = max(i0 , . . . ik−1 ) + 1 or k = 0, 


if ik = min(i0 , . . . ik−1 ) − 1.

.





With ik = k for each k with 0 ≤ k ≤ n this gives Newton’s forward formula above. The choice
ik = −k for each k with 0 ≤ k ≤ n gives Newton’s backward formula:
Pt = 

n 
X
t+k−1

k=0
27 Of

k

∆k f−k .

course, this requirement is vacuous for k = 0.
equalities are equalities of sets, meaning that on either side the same integers are listed, even though the
order they are listed is probably not the same.
29 The empty product in (2) is 1 by convention. Thus, in the formula next, case k = 0 could have been subsumed
in either of the alternatives given in the formula, since the binomial coefficient is 1 for k = 0 in both cases.
28 These

11. The Lagrange form of Hermite interpolation

31

The choice i0 = 0, i2m−1 = m and i2m = −m m ≥ 1, i.e., when the sequence i0 , i1 , . . . start out as
0, 1, −1, 2, −2, 3 gives the formula
Pt = f0 +    

n 
X
t+m−1
t+m−1
∆2m−1 f−m+1 +
∆2m f−m
2m − 1
2m
m=0

if the points x0 ,. . . , x2n are used to interpolate, or
Pt = f0 +

n−1
X

m=0  

    

t+n−1
t+m−1
t+m−1
2m−1
2m
∆2n−1 f−n+1

f−m+1 +
∆ f−m +
2n − 1
2m − 1
2m

if the points x0 ,. . . , x2n−1 are used to interpolate (n > 0). This is called Gauss’s forward formula.
It starts out as 
 
    

t
t
t+1
t+1
2
3
Pt = f0 +
∆f0 +
∆ f−1 +
∆ f−1 +
∆4 f−2 + . . . .
1
2
3
4
The choice i0 = 0, i2m−1 = −m and i2m = m m ≥ 1, i.e., when the sequence i0 , i1 , . . . start out as
0, −1, 1, −2, 2, −3,. . . gives the formula    

n 
X
t+m−1
t+m
2m−1
2m

f−m +
∆ f−m
Pt = f0 +
2m − 1
2m
m=0
if the points x0 ,. . . , x2n are used to interpolate, or
Pt = f0 +

n−1
X

m=0      

t+n−1
t+m−1
t+m−1
∆2n−1 f−n
∆2m−1 f−m +
∆2m f−m
2n − 1
2m − 1
2m

if the points x0 ,. . . , x2n−1 are used to interpolate (n > 0). This is called Gauss’s backward formula.
It starts out as 
      

t
t+1
t+1
t+2
2
3
Pt = f0 +
∆f−1 +
∆ f−1 +
∆ f−2 +
∆4 f−2 + . . . .
1
2
3
4

11. THE LAGRANGE FORM OF HERMITE INTERPOLATION
We consider the following special case of Hermite interpolation. Let n and r be a positive integer
with r ≤ n, and let x1 , x2 , . . . , xn be distinct points. We want to find a polynomial P (x) such
that P (x) = f (x) for i with 1 ≤ x ≤ n, and P ′ (x) = f ′ (x) for i with 1 ≤ x ≤ r. There are n + r
conditions here, and the lowest degree polynomial satisfying these condition has degree n + r − 1.
For this polynomial we have
(1)

P (x) =

n
X

f (xj )hj (x) +

j=1

where, writing
ljn (x) =

n
Y
x − xi
x
− xi
i=1 j
i6=j

r
X

¯ j (x),
f ′ (xj )h

j=1

(1 ≤ j ≤ n),

32

Introduction to Numerical Analysis

ljr (x) =

r
Y
x − xi
x
− xi
i=1 j

(1 ≤ j ≤ r),

i6=j

and
pn (x) =

n
Y

i=1

(x − xi ) and pr (x) =

r
Y

i=1

(x − xi ),

we have
(2)

hj (x) =

( 



(xj ) ljr (x)ljn (x)
(xj ) + ljn
(1 − (x − xj ) ljr

(x)
ljn (x) pprr(x
j)

if 1 ≤ j ≤ r,
if r < j ≤ n.

and
¯ j (x) = (x − xj )ljr (x)ljn (x)
h

(3)

if

1 ≤ j ≤ r.

To see this, first note that we have
(4)

hj (xi ) = 0 if i 6= j

(1 ≤ i, j ≤ n).

This is because ljn (xi ) = 0 in this case, and both expressions on the of (2) are multiplied by ljn (x).
Further, note that
(5)

hj (xj ) = 1

(1 ≤ j ≤ n).

This is immediate by (2) in case r < j ≤ n, since ljn (xj ) = 1. In case 1 ≤ j ≤ r, by (2) we have
hj (xj ) = ljr (xj )ljn (xj ) = 1 · 1 = 1.
Next, observe that
(6)

h′j (xi ) = 0

(1 ≤ i ≤ r and 1 ≤ j ≤ n).

Indeed, if r < j ≤ n, then ljn (x) = 0 and pr (x) = 0 for x = xi ; that is, each of these polynomials is
divisible by (x − xi ). Hence hj (x) is divisible by (x − xi )2 according to (2); therefore h′j (xi ) = 0 as
we wanted to show. A similar argument works in case 1 ≤ j ≤ r and i 6= j. In this case ljr (x) and
ljn (x) are each divisible by (x − xi ). Hence hj (x) is again divisible by (x − xi )2 according to (2);
therefore h′j (xi ) = 0 again.
Finally, consider the case i = j (when 1 ≤ j ≤ r). We have 



h′j (x) = −ljr
(xj ) − ljn
(xj ) ljr (x)ljn (x)




+ (1 − (x − xj )(ljr
(xj ) + ljn
(xj ))ljr
(x)ljn (x))




+ (1 − (x − xj )(ljr
(xj ) + ljn
(xj ))ljr (x)ljn
(x)).

Substituting x = xj and noting that ljr (xj ) = 1 and ljn (xj ) = 1, it follows that

thus (6) follows. 





h′j (xi ) = h′j (xj ) = −ljr
(xj ) − ljn
(xj ) + ljr
(xj ) + ljn
(xj ) = 0;

11. The Lagrange form of Hermite interpolation

33

Further, note that30
¯ ′ (xi ) = δij
h
j

(7)

(1 ≤ i, j ≤ r)

Indeed, according to (3) we have
¯ j (x) = ljr (x)ljn (x) + (x − xj )l′ (x)ljn (x) + (x − xj )ljr (x)l′ (x)
h
jr
jn
For x = xj this gives

¯ ′ (xj ) = ljr (xj )ljn (xj ) = 1 · 1 = 1.
h
j

¯ ′ (xj ) = 0, because ljr (xi ) =
On the other hand, for x = xi with i 6= j the above equality implies h
j
ljn (xi ) = 0 in this case. Thus (7) follows.
Using (1) and (4)–(7), we can easily conclude that P (xi ) = f (xi ) for i with 1 ≤ i ≤ n and
P ′ (xi ) = f ′ (xi ) for i with 1 ≤ i ≤ r. Thus P is indeed the interpolation polynomial we were looking
for.
Problems
1. Find the Hermite interpolating polynomial P (x) in the Lagrange form such that P (1) = 3,
P ′ (1) = 5, P (3) = 4, P ′ (3) = 2, and P (4) = 7.
Solution. Writing x1 = 1, x2 = 3, and x3 = 4, we have n = 3, r = 2, and
p3 (x) = (x − 1)(x − 3)(x − 4)

and p2 (x) = (x − 1)(x − 3).

First, we are going to calculate h1 (x). We have
(x − 3)(x − 4)
(x − 3)(x − 4)
=
,
(1 − 3)(1 − 4)
6

.

2x − 7 .

.

5 (x − 4) + (x − 3) .

.

l13 (1) = . ′ = =− .

6 6 .

We have (x − 1)(x − 4) (x − 1)(x − 4) =− . (3 − 1)(3 − 4) 2 . 2 ′ ′ h1 (x) = (1 − (x − 1)(l13 (1) + l12 (1))l13 (x)l12 (x)      x − 3 (x − 3)(x − 4) (4x − 1)(x − 3)2 (x − 4) 5 1 − =− . = 1 − (x − 1) − − 6 2 2 6 36 Next. we are going to calculate h2 (x). 6 l13 (x) = x=1 and l12 (x) = Thus x−3 x−3 =− 1−3 2 x=1 1 ′ and l12 (1) = − .

.

(x − 1) + (x − 4) .

.

2x − 5 .

.

1 ′ l23 (3) = − =− =− . .

.

ij called Kronecker’s delta. is defined to be 1 is i = j and 0 if i 6= j. 2 2 2 x=3 x=3 l23 (x) = 30 δ . .

(Listing the points in the natural order as x1 = 2. and P ′ (5) = 6. r = 2. Of course it is possible to modify formula (1) in an appropriate way and then take the points in the natural order. we may take n = 3. Find the Hermite interpolating polynomial P (x) in the Lagrange form such that P (2) = 5. p2 (4) (4 − 1)(4 − 3) (4 − 1)(4 − 3) 9 ¯ 1 (x) is also simple: The formula for h 3 ¯ 1 (x) = (x − 1)l12 (x)l13 (x) = (x − 1) · x − 3 · (x − 3)(x − 4) = − (x − 1)(x − 3) (x − 4) . P (4) = 3. h3 (x). x3 = 5 would not work. x2 = 5. P (5) = 4. P ′ (2) = 2. is simpler: h3 (x) = l33 (x) (x − 1)(x − 3) (x − 1)(x − 3) (x − 1)2 (x − 3)2 p2 (x) = · = . hence the calculation of. In order to use (1).) . 2 2 2 2 4 The formula for. h 1 − 3 (1 − 3)(1 − 4) 12 ¯ 2 (x): Similarly for h 2 ¯ 2 (x) = (x − 3)l22 (x)l23 (x) = (x − 3) · x − 1 · (x − 1)(x − 4) = − (x − 1) (x − 3)(x − 4) . Hint. x2 = 4.34 Introduction to Numerical Analysis and l22 (3) = x−1 x−1 = 3−1 2 and ′ l22 (3) = 1 . and x1 = 2. 2 Thus ′ ′ h2 (x) = (1 − (x − 3)(l23 (3) + l22 (3))l23 (x)l22 (x)      (x − 1)(x − 4) (x − 1)2 (x − 4) x−1 1 1 − =− = 1 − (x − 1) − + . h 3 − 1 (3 − 1)(3 − 4) 4 So 2 ¯ 1 (x) + 2h ¯ 2 (x) = −3 (4x − 1)(x − 3) (x − 4) P (x) = 3h1 (x) + 4h2 (x) + 7h3 (x) + 5h 36 (x − 1)2 (x − 4) (x − 1)2 (x − 3)2 −4 +7 4 9 (x − 1)2 (x − 3)(x − 4) (x − 1)(x − 3)3 (x − 4) −2 = −5 12 4 (4x − 1)(x − 3)2 (x − 4) (x − 1)2 (x − 3)2 =− − (x − 1)2 (x − 4) + 7 12 9 (x − 1)(x − 3)3 (x − 4) (x − 1)2 (x − 3)(x − 4) −5 − . x3 = 4. since the derivatives of P are specified at the first r points of the given n points. 12 2 2.

x2 − x1 Solving this for y = 0 and writing x = x3 for the solution. and if f (x1 )f (x3 ) > 0. The equation of the secant line is y − f (x2 ) = f (x2 ) − f (x1 ) (x − x2 ). if x3 = (x1 +x2 )/2. i. the line through the points (x1 . we obtain x2 = x1 − f (x1 ) . f (x2 ) − f (x1 ) This equation can also be written as x3 = x1 f (x2 ) − x2 f (x1 ) . There is no guarantee that all roots are found this way. x2 ). The secant method. Namely. in the former expression. Solution of nonlinear equations 35 12. because each expression contains a number of subtractions where the result is likely be small. SOLUTION OF NONLINEAR EQUATIONS Bisection.12. then there must be a solution in the interval (x3 . One would expect that this correction term in small in comparison to x2 . the equation f (x) = 0 must have at least one solution in the interval (x1 . and f (x1 )f (x3 ) < 0 then one of these solutions must be in the interval (x1 . Assume f is a continuous function. Both expressions for x3 may be affected by losses of precision. of precision occurs when two nearly equal numbers are subtracted. f ′ (x1 ) This process can be repeated to get successively better approximations to the root.e. a root can be localized in successively smaller intervals. only the correction term 32 Loss −f (x2 ) x2 − x1 f (x2 ) − f (x1 ) is affected by losses of precision. but there is no need to calculate the derivative. Solving this for y = 0. Given two approximations x1 and x2 to the root. Newton’s method.. f (x1 )) and (x2 . by the Intermediate Value Theorem. because of the loss of precision. so the loss of precision is less important than it would be if the latter formula were used. . Such a solution may be approximated by successive halving of the interval. and f (x1 )f (x2 ) < 0. the secant line. we obtain x3 = x2 − f (x2 ) x2 − x1 . the former equation is preferred in numerical calculations to the latter. However. f (x2 ) − f (x1 ) however.32 31 This is just a simple way of saying that f (x1 ) and f (x2 ) have different signs.31 Then. x2 ). The secant method usually is somewhat slower than Newton’s method. since both of the intervals (x1 . and writing x = x2 for the solution. While Newton’s method is usually considerably faster than bisection. If f is differentiable and x1 is an approximate solution of the equation f (x) = 0 then a better approximation might be obtained by drawing the tangent line to f at x1 and taking the new approximation x2 to be the point where the tangent line intersects the x axis. By repeating this halving of the interval. x2 ) may contains roots of the equation f (x) = 0. of course if f (x1 )f (x3 ) = 0 then x3 itself is a solution. The equation of the tangent line to f at x1 is y − f (x1 ) = f ′ (x)(x − x1 ). f (x2 )) is drawn. x3 ). and the place x3 where this line intersects the x axis is taken to be the next approximation. x3 ) and (x3 . its disadvantage is that the derivative of f needs to be calculated.

if f (xn−2 )f (xn ) < 0 and f (xn−1 )f (xn ) > 0. and then calculated xn+1 by using the formula (1) xn+1 = xn − f (xn ) xn − xi f (xn ) − f (xi ) (this is just the formula above for x3 with i.” write y¯n−2 = αf (xn−2 ). and n+1. The method can be speeded up by using formula (1) for the first step (even if f (x1 )f (x2 ) > 0) with 1. . f (xn )). Geometrically. Then one forms the sequence x1 . This will make the method converge relatively slowly.06667. . On the other hand. n + 1 replacing 1. 2. n. here y¯n−2 = f (xn−2 )/2. b] and for any λ with 0 < λ < 1 we have f (λx + (1 − λ)t ≤ λf (x) + (1 − λ)f (t). 3. and x2 = x1 − 33 f f (x1 ) 1 x3 − 12x1 + 8 =3+ = x1 − 1 2 ≈ 3. . and take i to be the largest integer with 1 ≤ j < i such that f (xn )f (xi ) < 0. i.e. b] if for any x. and so the above formula will use x1 instead of a later. xn have already been calculated for n ≥ 2. one takes α = 1/2. the function is called concave (in elementary calculus courses. y¯n−2 ) and (xn . respectively). . f (x1 ) < 0 < f (x2 ). f (xn )) can be written as y¯n−2 − f (xn ) (x − xn ). presumably better approximation of the root. In successive steps one uses formula (1) calculate xn+1 with i = n−1 unless we have f (xn−2 )f (xn ) < 0 and f (xn−1 )f (xn ) > 0.33 say. ′ f (x1 ) 3x1 − 12 15 is convex on an interval [a. 2.36 Introduction to Numerical Analysis When iterating the secant method. then all the successive iterates x2 . Such a function in elementary calculus courses is called concave up. and further. We have f ′ (x) = 3x2 − 12. . then one uses a modified step as follows: xn+1 is taken to be the intersection of the straight line through the points (xn−2 . y − f (xn ) = xn−2 − xn Solving this for y = 0 and writing x = xn+1 for the solution. we will have i = 1. . and x1 < x2 . x2 . An approximate solution of the equation f (x) = 0 with f (x) = x3 − 12x + 8 is x = 3. this means that the chord between x and t lies above the graph of the function. The equation of the line through the points (xn−2 . as follows. Problems 1. . Perform one step of Newton’s method to find the next approximation. lie to the left of the root. . when xn+1 is calculated by the above formula. In the simplest choice. y¯n−2 − f (xn ) As mentioned. Assuming that f is continuous. we obtain xn+1 = xn − f (xn ) xn−2 − xn . then x1 and x2 enclose a solution of the equation f (x) = 0. αf (xn−2 )) and (xn . 3 replacing i. where different choices of the parameter α give rise to different methods. x1 = 3. To get the equation for the “Illinois step. t ∈ [a. n. If instead we have the opposite inequality f (λx + (1 − λ)t ≥ λf (x) + (1 − λ)f (t). x3 . one starts with two approximations x1 and x2 to the root of f (x) = 0 such that f (x1 )f (x2 ) < 0 (which is just a short way of saying that f (x1 ) and f (x2 ) have opposite signs). one used the term “concave down. x3 . Assuming that x1 . Solution. for the Illinois method. when the chord is below the graph. That is. The trouble with this method is that if f is convex. the method so obtained is called the Illinois method. . but the term “concave up” is not used in the mathematical literature.”) . ..

x We discuss these programs in a Linux environment (their discussion in any Unix environment would be virtually identical). f (xn−1 )) and (xn .h> double funct(double x) { double value. xn+1 is the intersection (xn−2 .14792. or if f (xn−2 ). to be discussed later. Solution. The simplest way of compiling these programs is to write a makefile. Explain the Illinois modification of the secant method. and for n > 3 it is performed when f (xn−1 ) and f (xn ) have different signs. Using decimal points (as in writing 1.c. the makefile is called (surprise) makefile (this is the default name for a makefile). When solving the equation f (x) = 0. . they are line numbers that are occasionally useful in a line-by-line discussion of such a file. and further approximations x3 . In the secant step. return(value). Using Newton’s method with x0 = 3 as a starting point. The program for bisection is in the file bisect.o bisect. i.o bisect. The secant step is performed in case of n = 3 (in this case the Illinois step cannot be performed.13. value = 1. =3− 1 ′ f (x0 ) −2/3 2 x0 − 1 The actual solution is approximately 3. 3.0. Programs for solving nonlinear equations 37 2. Consider the equation f (x) = 0 with f (x) = 2 − x + ln x. In the present case. f (xn )). In the Illinois step. one starts with two approximations x1 and x2 for the solution of the equation. We have x1 = x0 − f (x0 ) 2 − x0 + ln x0 3 + 3 ln 3 −1 + ln 3 = x0 − = ≈ 3. f (xn )) with the x axis. The numbers at the beginning of these lines are not part of the file. The Illinois step is performed otherwise. . f (xn−1 ). and f (xn ) all have the same sign. f (xn−2 )/2) and (xn ... xn+1 is the intersection of the line connecting the points (xn−1 .c will contain the definition of the function f :34 1 2 3 4 5 6 7 8 #include <math. } Here pow(x. with the following content: 1 all: bisect 2 bisect : funct. since the Illinois step requires three earlier approximations). 13. x4 .o 3 gcc -o bisect -s -O4 funct.e.14619. when n > 3 and f (xn−1 ) and f (xn ) have the same sign and f (xn−2 ) has a different sign. we indicate that the number is a floating point constant. the methods discussed above are used to solve the equation f (x) = 0.0 instead of 1). are generated.y) is C’s way of writing xy . where 1 f (x) = − 2x .x).0/x-pow(2.o -lm 34 The Perl programming language was used to mark up the computer files in this section for AMS-TEX typesetting . The following file funct. find the next approximation to the solution of the equation. PROGRAMS FOR SOLVING NONLINEAR EQUATIONS In the following C programs. Solution.

c 7 gcc -c -O4 bisect. since it is a compiler and not a loader option. by typing $ bisect Here the dollar sign $ is the customary way of denoting the command line prompt of the computer. 36 The .o and bisect.o : bisect. a quirk of the make command. The option -s gives is passed to the loader36 to strip all symbols from the compiled program.h> double funct(double). loader links together the object programs funct.e. and you can set the prompt to be almost anything at will. even though in actual life the command line prompt is usually different. say. or if it predates at least one of the source files (i. Clearly.o. and we will discuss the file bisect.38 Introduction to Numerical Analysis 4 funct. One important point. the file bisect to be made depends on the files funct.c next: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include <stdio. that the first character of line three is a tab character (which on the screen looks like eight spaces).c. double bisect(double (*fnct)(double).o.o : funct. double xtol. but our description of it is probably correct.c has already been described. The function funct() is defined separately.h> #include <math. and the program can be run by typing its name on the command line.o. if the source files have not changed since the last creation of the target file. that is. the rule always must start with a tab character. The option -O4 is probably useless on this line. in fact. The option -O4 (the first character is “Oh” and not “zero”) specifies the optimization level. int *converged).c 6 bisect. The -o option gives the name bisect to the file produced by the compiler. main() { /* This program implements the bisection method for solving an equation funct(x)=0. and the mathematics library invoked by the -lm option. int maxits. This file contains the compiled program. The option -s of gcc is undocumented as of version 2.o suffix indicates that these are object files. and the mathematics library (described as -lm at the end of the line). Apparently only linking and no compilation is done by line 3 of the makefile. Lines 4 and 6 contain the dependencies for creating the object files funct. These latter two files need to be written by the programmer.c 5 gcc -c -O4 funct. thereby making the compiled program more compact.o and bisect.c Line 1 here describes the file to make. The command on this line invokes the gcc compiler (the GNU C compiler) to link the already created programs funct.96. Line 3 contains the rule used to create the target. already compiled programs – read on).o (the . the file funct.c and bisect. */ 35 Unix is highly customizable. The file on the left of the colon is called the target target file. there is no need to create the target file again. The compiler option -c means compile but do not link the assembled source files.. namely. double x0. $ make all on the command line. if at least one of the source files has been change since the target file has last been created).o and bisect. When running the make command. the file bisect. and the files on the right are called the source files.35 The second line contains the dependencies. int *noofits. the target file is created only if it does not already exist.o and bisect. double *fatroot. double x1. and lines 5 and 7 describe the commands issued to the GNU C compiler to create these files from the source files funct. by typing. that is.

f0. (*itcount)++) { 51 x = (x0+x1)/2. */ 60 printf("Iteration number %2u: ".12f and f(x)=% . 1. 17 int its. 20 &its.12f. root. double x1. fx = fnct(x).0.12f\n". 19 50. 48 return 0. 61 printf("x=% . double xtol. 54 else 55 { x0 = x. success. 37 double x0. 24 } 25 else if ( success ) { 26 printf("The root is %. fx. 46 if ( (f0>=0 && fx>=0) || (f0<=0 && fx<=0) ) { 47 *converged = 2. double *fatroot. *itcount < maxits. 28 printf("%u iterations were used to find " 29 " the root\n". 62 if ( x1-x0 <= xtol ) { 63 *converged = 1. 49 } 50 for (*itcount = 0. 33 } 34 } 35 36 double bisect(double (*fnct)(double). root. 45 fx = fnct(x1). 52 if ( (f0<=0 && fx>=0) || (f0>=0 && fx<=0) ) 53 x1 = x. 64 break. withintol=0. 65 } 39 . 39 int *itcount. fx). These lines 58 should be deleted from a program of 59 practical utility. Programs for solving nonlinear equations 15 const double tol=5e-10. 16 double x. x. *itcount). fx. 30 } 31 else { 32 printf("The method cannot find the root. fx). The value of " 27 " the function\nat the root is %. 38 int maxits. 21 if ( success == 2 ) { 22 printf("The function has the same signs at " 23 "both endpoints. root. 44 f0 = fnct(x0). 43 *converged = 0. int *converged) 40 { 41 double x.12f.13.\n". &fx. f0 = fx. tol.\n"). &success). 18 root=bisect(&funct. 42 int iterating=1.01. } 56 /* The next two lines are included so as to monitor 57 the progress of the calculation. its). 0.\n").

Note the output statements in lines 60–61.505000000000 Iteration number 1: x= 0.000000011477 f(x)= 0.641318359375 Iteration number 10: x= 0.641185737550 Iteration number 25: x= 0.641185741238 Iteration number 28: x= 0.641185744004 Iteration number 30: x= 0.641167297363 Iteration number 15: x= 0.561074663602 f(x)=-0.641137084961 Iteration number 14: x= 0.641185826063 Iteration number 23: x= 0.641189956665 Iteration number 17: x= 0.044232545092 f(x)=-0.636484375000 Iteration number 7: x= 0. &funct in line 18.640351562500 Iteration number 8: x= 0. since they may need to be restored for debugging).641185235977 Iteration number 20: x= 0.002933217996 f(x)=-0.000170969726 f(x)= 0.010624951276 f(x)= 0. and the bisection method itself is described in the function bisect through lines 36–65. which is called by the address of the current function parameter. The program itself is a fairly straightforward implementation of the bisection method.641197509766 Iteration number 13: x= 0. We wrote &funct for the sake of clarity.000000004998 f(x)= 0. the number of iterations is limited to 50 is the calling statement in lines 18–20 to avoid an infinite loop.641185744926 Iteration number 27: x= 0.000000001480 f(x)= 0.642285156250 Iteration number 9: x= 0. these were included in the program only for the purposes of illustration.641182403564 Iteration number 16: x= 0. these lines should be deleted (or.000383300305 f(x)=-0.000005103837 f(x)= 0.000000701193 f(x)=-0.000465872209 f(x)= 0. The printout of the program is as follows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Iteration number 0: x= 0. . so we could have written simple funct in line 18.690625000000 Iteration number 4: x= 0. better yet.641186180115 Iteration number 18: x= 0.752500000000 Iteration number 2: x= 0.016594102551 f(x)= 0.000000128096 f(x)=-0.644218750000 Iteration number 6: x= 0.001232872504 f(x)= 0.01 and x1 = 1.000000027396 f(x)=-0.641185767055 Iteration number 24: x= 0. Normally.355806027449 f(x)= 0. The value of and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and the f(x)= 0.000014799045 f(x)=-0. The first parameter of bisect in line 36 is a pointer double (*fnct)(double) to a function.000001786675 f(x)= 0.641076660156 Iteration number 12: x= 0.641184291840 Iteration number 19: x= 0.000000000139 function 37 The address operator & with functions is superfluous.000011738180 f(x)=-0.000001530481 f(x)= 0.641185944080 Iteration number 22: x= 0.641185744465 The root 0.166018770776 f(x)=-0. *fatroot = fnct(x). Note the additional parameter success to indicate the whether the program was successful in finding the root. return root.628750000000 Iteration number 3: x= 0.641185744696.641185752302 Iteration number 26: x= 0. Further. since the name funct already refers to the address of this function.063871145198 f(x)=-0. Lines 10 through 34 here is the calling program. commented out.000000286548 f(x)=-0.640834960938 Iteration number 11: x= 0.000000079226 f(x)= 0.000064813802 f(x)= 0.000000024435 f(x)=-0.000000001759 f(x)= 0.40 Introduction to Numerical Analysis 66 67 68 69 70 } } root = (x0+x1)/2.000041335881 f(x)= 0.641185743082 Iteration number 29: x= 0.659687500000 Iteration number 5: x= 0.641185708046 Iteration number 21: x= 0.37 The program is called with the initial values x1 = 0.003858575571 f(x)=-0.

its derivative also needs to be calculated.0 ? (x) : (-(x))) double funct(double). in addition to the function. &success).c: 1 2 3 4 5 6 7 8 #include <math.c.c gcc -c -O4 funct.o -lm funct. tol. } The program itself is contained in the file newton. The program to do this constitutes the file dfunct.0/(x*x)-pow(2.o dfunct.o newton.c itself is as follows: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 #include <stdio. &its. root. fx. int its. double *fx.c dfunct. .o newton. double xtol.x)*log(2.13. &fx. Programs for solving nonlinear equations 33 at the root is 0. &dfunct. */ const double tol=5e-10.c newton.c gcc -c -O4 bisect. value = -1.c The file newton. 50. 34 30 iterations were used to find 41 the root For Newton’s method. double startingval.o dfunct. main() { /* This program implements the Newton’s method for solving an equation funct(x)=0. int *itcount.c gcc -c -O4 newton. double (*deriv)(double). int *outcome).o : funct. double newton(double (*fnct)(double). 3. The function funct() and its derivative dfunct() is defined separately. The makefile to compile this program is as follows: 1 2 3 4 5 6 7 8 9 10 all: newton newton : funct. double dfunct(double).o : dfunct.0).000000000139.h> double dfunct(double x) { double value.o : newton. double x.o gcc -o newton -s -O4 funct.h> #include <math.0. int maxits.0. return(value).c gcc -c -O4 dfunct.h> #define absval(x) ((x) >= 0. root = newton(&funct. success.

with initial approximation x1 = 3. *outcome = 0. Finally. double xtol. int *outcome) { double x. *itcount). given as 5 · 10−10 by the constant tol specified in line 19. for (*itcount = 0. *fx). (*itcount)++) { dfx = deriv(x).12f\n". This program produces the following output: . printf("x=% . } } return x. */ printf("Iteration number %2u: ". /* too flat */ break. the value 2 indicates that the root has been successfully calculated within the required tolerance. /* The next two lines are included so as to monitor the progress of the calculation. } double newton(double (*fnct)(double). root. double (*deriv)(double). printf("%u iterations were used to find " " the root\n". } *fx = fnct(x). The value of " " the function\nat the root is %. specified in the calling statement in line 23) has been exceeded. int *itcount.\n". The value 1 indicates that the tangent line is too flat at the current approximation (in which case the next approximation cannot be calculated with sufficient precision). } else if (success == 0) { printf("The maximum number of iterations has been reached\n"). assumedzero=1e-20. dfx.42 Introduction to Numerical Analysis 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 } if ( success == 2 ) { printf("The root is %. if ( absval(deriv(x))<=assumedzero ) { *outcome = 1. fx).12f and f(x)=% . double startingval. x). x = x+dx.12f. } else if (success == 1) { printf("The derivative is too flat at %. located in lines 39–67. The variable success keeps track of the outcome of the calculation. *itcount < maxits. /* returning the value of the root */ } The calling statement calls in lines 22-24 calls the bisection function.12f\n".12f. dx = -*fx/dfx. /* within tolerance */ *fx = fnct(x). the value 0 indicates that the maximum number of iterations (50 at present. if ( absval(dx)/(absval(x)+assumedzero) <= xtol ) { *outcome = 2. int maxits. dx. break. x = startingval. its). These lines should be deleted from a program of practical utility. x. double *fx.

root. tol. fold. 1.641185733066 and Iteration number 5: x= 0. &success). root. is given in the file secant.000000000000 function The Illinois method. The function funct() is defined separately. */ const double tol=5e-10. 50. &its. double x1. int *itcount. 43 . success.12f. f1. } } double secant(double (*fnct)(double). its).13. if ( success == 2 ) { printf("The root %.\n". fx. f2. fbar. double x. double secant(double (*fnct)(double). int maxits.000000000000. int maxits. fx). The value of " " the function\nat the root is %. int *outcome). xnew. f0. } else if (success == 0) { printf("The maximum number of iterations has been reached\n"). assumedzero=1e-20. &fx.641185744505. double x0. main() { /* This program implements the Illinois variant of the secant method for solving an equation funct(x)=0.000000000000 and Iteration number 1: x= 1.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 #include <stdio.644576458341 and Iteration number 2: x= 0. a modification of the secant method. root.0 ? (x) : (-(x))) double funct(double). dx. double x0. double xtol. Programs for solving nonlinear equations 1 2 3 4 5 6 7 8 9 Iteration number 0: x= 3.651829979669 and Iteration number 3: x= 0.641185744505 and The root 0.000000040191 f(x)= 0.518501258529 f(x)=-0.000380944220 f(x)= 0. double xtol. printf("%u iterations were used to find" " the root\n". alpha.641077330609 and Iteration number 4: x= 0. 5 iterations were used to find the root f(x)=-7.12f.h> #include <math.01. dfx.666666666667 f(x)=-2. int *outcome) { double xold.037017477051 f(x)= 0. double *fatroot. int its. The value of the at the root is -0.0. double *fatroot.h> #define absval(x) ((x) >= 0. double x1. int *itcount. 0. root = secant(&funct.

Eventually. } return root. the user will find it out either by having the maximum number of iterations exceeded or by receiving a floating point exception (overflow. break. The output of this program is as follows: 1 2 3 4 5 6 7 8 9 10 11 Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration number 0: number 1: number 2: number 3: number 4: number 5: number 6: number 7: number 8: number 9: number 10: x= x= x= x= x= x= x= x= x= x= x= 1. } else { fbar = alpha*fold.656019294352 0.210870175492 f(x)=-0.000000000000 0.641186340346 0. so the root cannot be calculated. x1 = xnew.000000001479 f(x)= 0.641185744505 and and and and and and and and and and and f(x)=-1. for (*itcount = 0.000000001476 f(x)=-0. } if ( absval(xnew-x1)/(absval(x1)+assumedzero) <= xtol ) { *outcome = 2. x1.930673220217 f(x)= 0.642029942985 0. The value 1 for success is no longer used.584619671930 0. The calling statement in lines 20–22 specifies the starting values x1 = 0. */ printf("Iteration number %2u: ".000002093441 f(x)=-0.002963594600 f(x)= 0.641185744085 0.002550386255 f(x)=-0. x0 = x1. } xold = x0.01 and x2 = 1 in line 20. /* returning the value of the root */ Here we still use the value 2 for the variable success (see line 23) to indicate that the method was successful.44 Introduction to Numerical Analysis 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 } *outcome = 0. fold = f0.000000000000 .12f\n". xnew= x1-f1*(xold-x1)/(fbar-f1). printf("x=% . /* This called the Illinois method. *fatroot = fnct(root). if ( *itcount == 0 || f0<=0 && f1>=0 || f0>=0 && f1<=0 || fold<=0 && f1<=0 || fold>=0 && f1>=0 ) { xnew = x1-f1*(x1-x0)/(f1-f0). f1 = fnct(xnew).641185744926 0. */ f0 = fnct(x0).12f and f(x)=%15.5. because of dividing by a number too close to zero). f1 = fnct(x1). but at the price of some additional calculations.990099311353 0. f1).971140749370 0.976322026870 f(x)=-0.640460359614 0.000000000000 f(x)=-0. alpha = 0. one could add the option of telling the user that the secant line is too flat. the same starting values we used for bisection. /* within tolerance */ root = xnew. *itcount < maxits. (*itcount)++) { /* The next two lines are included so as to monitor the progress of the calculation. There are other. These lines should be deleted from a program of practical utility. f0 = f1. more complicated choices for alpha. *itcount).051383435810 f(x)=-0.

14.o bisect. ((a0 x + a1 )x + a2 )x .o dfunct. lines 3. The value of the function 13 at the root is -0.o -lm funct. . and there is no space character before the first letter in these lines. .o bisect. 7.o gcc -o secant -s -O4 funct. 5.c bisect. If we write n−1 X bk xn−1−k .o newton. NEWTON’S METHOD FOR POLYNOMIAL EQUATIONS Given the polynomial P (x) = n X ak xn−k = a0 xn + a1 xn−1 + . . .o secant.o newton. using the above expression for Q(x). Then P (x0 ) = bn .o dfunct.000000000000.641185744505. This method of calculating the value of a polynomial is called Horner’s rule.o : dfunct.o -lm newton : funct.o : newton.o gcc -o newton -s -O4 funct. Newton’s method for polynomial equations 45 12 The root 0.c newton.c gcc -c -O4 secant. + an k=0 = (. 13. the right-hand side can be written as (x − x0 )Q(x) + bn = (x − x0 ) = n−1 X k=0 bk xn−k + bn − n−1 X bk xn−k−1 x0 = n−1 X k=0 k=0 n−1 X k=0 bk xn−k−1 + bn = x n X k=0 bk xn−k−1 + bn − x0 bk xn−k − n X k=1 bk−1 xn−k x0 . 15. . n−1 X k=0 bk xn−k−1 . it is easy to calculate the value of P (x0 ): put (1) b0 = a0 and bk = ak + bk−1 x0 for k with 0 < k ≤ n.o : secant. Indeed. 14. + an−1 )x + an . 17 here start with a tab character.o secant.c dfunct. . can be used to compile all these programs: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 all: bisect newton secant bisect : funct. 14 10 iterations were used to find the root A single makefile.o : bisect. usually called makefile.o : funct. 11.c gcc -c -O4 newton. 9.c As we pointed out before.c gcc -c -O4 bisect.o gcc -o bisect -s -O4 funct.c gcc -c -O4 dfunct.o -lm secant : funct.c gcc -c -O4 funct.c secant. Q(x) = k=0 it is easy to see that (2) P (x) = (x − x0 )Q(x) + bn .

and the quotient P1 (x) is called the deflated polynomial. Hence the derivative of P (x) can also be easily evaluated at x0 (since we only need to evaluate Q(x0 ) for this. 2) These equations also allow us to evaluate the coefficients of the polynomial Q(x). If we make the assumption that b−1 = 0 (b−1 hax not been defined so far). . 1) Equations (1) allow a speedy evaluation of the polynomial P (x) at x0 . we obtain P ′ (x) = (x − x0 )Q′ (x) + Q(x). since we assumed that b−1 = 0). we can use Newton’s method to approximate the zero of P (x). the term bn was incorporated into the first sum. in obtaining the last equality. The reason for doing this is to eliminate the errors resulting from the fact that the coefficients of P1 (x) are only approximate values. Once a root α0 of P (x) = 0 is found. Division by the factor x − α0 corresponding to the root α0 is called deflation. one can look for further roots of the equation by looking for zeros of the polynomial P (x)/(x−α0 ). and deflates a polynomial several times. we can extend the summation in the second sum on the right-hand side to k = 0: n X n−k bk x k=0 = n X − n X n−k bk−1 x x0 = n X (bk xn−k − bk−1 xn−k x0 ) k=0 k=0 (bk − bk−1 x0 )xn−k = k=0 n X ak xn−k = P (x). using the same method). double *fx. 38 Last but one. 8 *outcome = 0. Finally. int maxits. int *outcome) 6 { 7 double x. 4 double startingval.h" 2 3 double newton(double (*fnct)(double). dfx. one can use this approximation as the starting value for Newton’s method to find the root of the equation P (x) = 0. An illustration how this works is given in the following program to solve a polynomial equation. assumedzero=1e-20. These reflections allow us to draw the following conclusions.c is essentially the same as the program performing Newton’s method above (only some fprint statements used to monitor the progress of the calculation were deleted): 1 #include "newton. and in the second sum. k=0 The penultimate38 equality holds because we have ak = bk − bk−1 x0 (0 ≤ k ≤ n) according to (1) (the case k = 0 also follows. double (*deriv)(double). taking the derivative of (2). Here one can observe that 3) the coefficients of P (x)/(x−α0 ) can be evaluated by using equations (1) with α0 replacing x0 . If one continues this procedure. we obtain P ′ (x0 ) = Q(x0 ). The file newton. 9 x = startingval. dx. Once an approximate zero β1 of the deflated polynomial is found (by Newton’s method. k was replaced with k + 1 (so. Substituting x = x0 . Hence. double xtol.46 Introduction to Numerical Analysis Here. 5 int *itcount. in the second sum on the right-hand side. This is because the bk ’s are the coefficients of the quotient of dividing x − x0 into P (x). the errors might accumulate to an unacceptable level unless each time one goes back to the original polynomial to refine the approximate root. say). k goes from 1 to n instead of going from 0 to n − 1).

18 if ( absval(dx)/(absval(x)+assumedzero) <= xtol ) { 19 *outcome = 2. 22 } 23 } 24 return x. double startingval. double *fx.h: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #include #include #include #include <stdio. int *outcome). double x.c: 1 2 3 4 5 6 7 8 9 10 11 12 #include <math. Newton’s method for polynomial equations 47 10 for (*itcount = 0. 21 break.h> <math. and to deflate it (i. } As has been explained above. The function evalp to evaluate polynomials is given in the file horner.14. /* within tolerance */ 20 *fx = fnct(x). *itcount < maxits. double b[]). /* too flat */ 14 break.0 ? (x) : (-(x))) double funct(double).h> <float. int n. The vector b is . double origfunct(double).h> double evalp(double a[]. int *itcount.. double (*deriv)(double). int k. double newton(double (*fnct)(double). evaluate its derivative. double b[]) { double value. The function evalp() takes returns the value of a polynomial at x with coefficient degree n and vector a (with a[n] being the coefficient of the highest degree term). /* returning the value of the root */ 25 } The first line here invokes the header file newton.h> <stdlib. to divide it by the linear factor corresponding to the root) once a root is known. } return b[n]. (*itcount)++) { 11 dfx = deriv(x). k++ ){ b[k]=a[k]+b[k-1]*x. 12 if ( absval(deriv(x))<=assumedzero ) { 13 *outcome = 1. double x. k<=n. b[0]=a[0]. 15 } 16 *fx = fnct(x). int n. Horner’s method can be used to evaluate a polynomial. double xtol. int maxits. double evalp(double a[]. for ( k=1. double dfunct(double).e. x = x+dx. 17 dx = -*fx/dfx.h> #define absval(x) ((x) >= 0. double dorigfunct(double).

for (k=0. break. root. n = nn. &its. &success). 50. tol. a[20]. } while ( n > 0 ) { root = newton(&funct. b[20].h" double aa[20]. s) !=EOF. int its. } if ( success == 2 ) { printf("Root number %d is %.12f. } nn--. nn++) { aa[nn]=strtod(s. count. . coefffile=fopen("coeffs". "r"). for (k=0. char s[20]. if ( n < nn && success == 2 ) { root = newton(&origfunct. n--.\n".b).n. The function funct() and its derivative dfunct() is defined separately. double x. &dfunct. fx. k++) { a[k] = aa[k]. &fx. k<=n. tol. &its. } else { printf("The procedure does not converge\n").root. n. fscanf(coefffile.12f. The value of " " the function\nat the root is %. for (nn=0.NULL). 50.0. main() { /* This program implements the Newton’s method for solving an equation funct(x)=0. extern double aa[]. fx).48 Introduction to Numerical Analysis used to store the coefficients of the polynomial Q(x) in equation (2) above. success.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 #include "newton. root. } evalp(a. &fx. k<=n. */ const double tol=5e-10. random. extern int nn. n. k++) { a[k] = b[k]. k. &success). FILE *coefffile. &dorigfunct. y. The main program is contained in the file callnewton. "%s". nn-n+1. 0. root. int nn. a[].

extern int nn. but possibly less readable. The numbers in this file are first read as character strings. b). extern int n. n. evalp(aa. The integer nn will store the degree of the original polynomial. nn. x. The functions in lines 53–78 are used to provide an interface between the function evalp() mentioned above. double b[20]. c[20]. n. } In lines 4–5. and the coefficients are read into the array aa[] in lines 19–22. 40 the library function fscan seems to handle numbers of type float well. b). and will decrease gradually as more roots are found). return evalp(b. and it is stored in the integer nn. x. x. n-1. return evalp(b. x. and in line 21 they are converted to double40 At the same time. and a will contain those of the deflated polynomial. The calculation of the roots and the deflation of the polynomial is performed in the loop in lines 39 This was done so that the program implementing Newton’s method could be used without change. . because these variables are needed by several functions in this file. } double dorigfunct(double x) { extern double aa[]. double b[20]. since it was incremented at the end of the for loop in line 20. return evalp(aa. extern int nn. the array will aa contain the coefficients of the original polynomial. c). but not those of type double. this file is opened for reading in line 18. x. c[20]. nn needs to be decremented in line 23. so reading the numbers as character strings and then converting them appears to be safer. so that the functions providing these interfaces are not needed. solution would be to rewrite this program. return evalp(a. In lines 27–27 the array aa[] is copied into the array a[] since initially the deflated polynomial is the same as the original polynomial (nn is also copied to n in line 24). b).14. nn. and perhaps more efficient. the variables are declared outside the functions. } double dfunct(double x) { extern double a[]. double b[20]. An alternative. nn-1. implementing Horner’s rule.39 The functions origfunct() and dorigfunct() evaluate the original polynomial and its derivative. n will store the degree of the deflated polynomial (this will initially be nn. and the functions funct() used by the file newton. while the functions funct() and funct() evaluate the deflated polynomial and its derivative. x. extern int n. c). The coefficients of the polynomial are contained in the file coeffs. double b[20].c implementing Newton’s method. the degree of the polynomial is calculated by counting the number of the coefficients read. } double funct(double x) { extern double a[]. evalp(a. Newton’s method for polynomial equations 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 } 49 } } double origfunct(double x) { extern double aa[]. b).

and not a part of the file. b2 = a2 + b1 x0 = 3 + 1 · (−2) = 1. and then value of the polynomial being considered will be c2 . there . P ′ (−2) = c2 = 7. and x0 = −2. then the result is printed out in lines 32–26. It is easy to check that this result is correct. The value of the function is 0. If this calculation is successful then in lines 32–34 this root is refined by calling Newton’s for the original polynomial with the the root of the deflated polynomial just calculated being used as initial value. For higher degree polynomials.0 4.0 to calculate the next root of the deflated polynomial. a2 = 3. so there is a good chance that the program will terminate unsuccessfully.300000000000. we have b0 = a0 and bk = ak + bk−1 x0 for k with 0 < k ≤ 3. c1 = b1 + c0 x0 = 1 + 2 · (−2) = −3. If the determination of the current root is successful.430000000000. there is no guarantee that Newton’s method will find the root of a polynomial.50 Introduction to Numerical Analysis 28–51. c0 = b0 = 2. Problems 3 1. c2 = b2 + c1 x0 = 1 + (−3) · (−2) = 7.000000000000. we did not need to calculate b3 . 2 is 4. If at any time in the process. If at any time during the calculation. Actually. Using Horner’s rule. the loop is broken in line 44. The value of the function is -0. Of course. a3 = 4. b0 = a0 = 2. 3 is -5. too large a value is used to approximate the root. since in case n=nn the deflated polynomial agrees with the original polynomial. Evaluate the derivative of P (x) = 2x + 5x2 + 3x + 4 at x = −2 using Horner’s method.7168 The 1 at the beginning of the line is a line number. b1 = a1 + b0 x0 = 5 + 2 · (−2) = 1. In this way one could make it likely that Newton’s method calculates the real roots of a polynomial. There is no real saving when the calculation is done for a polynomial of such low degree. a1 = 5. Newton’s method could be terminated and restarted with a new value. There are ways to estimate the size of the roots of a polynomial. Several measures could be taken to prevent this. The value of the function is 0.200000000000. Newton’s method is called with initial value 0.33 -19. Solution. That is. The roots were successfully determined. this can be calculated by first calculating the coefficients c0 = b0 and ck = bk + ck−1 for k with 0 < k ≤ 2. Show the details of your calculation. The derivative as x = −2 is the value for x = −2 of the polynomial b0 x2 + b1 x + b2 . since it is not used in calculating the derivative.733. Therefore. This is done only for n<nn. Further. Newton’s method is unsuccessful.000000000000. The program was run with the following coefficient file coeffs: 1 1. and the printout was the following 1 2 3 4 5 6 Root number at the root Root number at the root Root number at the root 1 is -3. -74. b3 = a3 + b2 x0 = 4 + 1 · (−2) = 2. That is.000000000000. In line 28. We have a0 = 2.

Solution. That is. 3. By another use of Horner’s method. The derivative as x = 2 is the value for x = 2 of the polynomial b0 x2 + b1 x + b2 . Solution. b2 = a2 + b1 x0 = 6 + (−2) · 2 = 2. and x0 = 2. Note: The coefficients of the polynomial Q(x) can be produced by Horner’s method. we can evaluate Q(x0 ). a2 = 6. is that the formal differentiation of polynomials can be avoided. Therefore. . Another advantage of the method.15. 2. c0 = b0 = 1. a1 = −4. Another advantage of the method. It is easy to check that this result is correct. we did not need to calculate b3 . c2 = b2 + c1 x0 = 2 + 0 · 2 = 2. and then value of the polynomial being considered will be c2 . we have b0 = a0 and bk = ak + bk−1 x0 for k with 0 < k ≤ 3. For higher degree polynomials. this can be calculated by first calculating the coefficients c0 = b0 and ck = bk + ck−1 for k with 0 < k ≤ 2. Further. There is no real saving when the calculation is done for a polynomial of such low degree. let x0 and r be a numbers. is that the formal differentiation of polynomials can be avoided. Show the details of your calculation. We have P ′ (x) = Q(x) + (x − x0 )Q′ (x) simply be using the product rule for differentiation. especially for computers. Actually. c1 = b1 + c0 x0 = −2 + 1 · 2 = 0. b1 = a1 + b0 x0 = −4 + 1 · 2 = −2. Fixed-point iteration 51 is definitely a saving in calculation. we obtain that P ′ (x0 ) = Q(x0 ). We have a0 = 1. b0 = a0 = 1. P ′ (2) = c2 = 2. b3 = a3 + b2 x0 = 4 + 2 · 2 = 8. Let P and Q be polynomials. especially for computers. This provides an efficient way to evaluate P ′ (x) on computers without using symbolic differentiation. a3 = 4. That is. since it is not used in calculating the derivative. Evaluate the derivative of P (x) = x3 − 4x2 + 6x + 4 at x = 2 using Horner’s method. there is definitely a saving in calculation. and assume that P (x) = (x − x0 )Q(x) + r. Show that P ′ (x0 ) = Q(x0 ). Using Horner’s rule. Substituting x = x0 .

Clearly. As q n−1 converges to 0 when n tends to zero. A simple result describing the solvability of equations by fixed-point iteration is the following: Theorem. Let n ≥ 1. we have ξn ∈ (c − r. x=x− ′ g (x) Furthermore. here ξn is some number between xn and c. the interval (c − r. However. c + r).52 Introduction to Numerical Analysis 15. i. and it is important for theoretical reasons. c + r). starting with any value x1 ∈ (c − r. that c = f (c). Hence xn+1 ∈ (c − r. for example. given that x1 ∈ (c − r. noting that xn+1 = f (xn ). Proof. and using the simple iteration xn+1 = f (xn ). c + r) and putting xn+1 = f (xn ) for n ≥ 1. c + r). when using Newton’s method xn+1 = xn − g(xn ) g ′ (xn ) to solve the equation g(x) = 0. Thus. FIXED-POINT ITERATION The equation x = f (x) can often be solved by starting with a value x = x1 . thus we can conclude by induction that |xn − c| ≤ q n−1 |x1 − c| for all positive integers n. This method is called fixed-point iteration. This is partly because other methods can often be reformulated in terms of fixed-point iteration. this can be considered as using fixed-point iteration to solve the equation g(x) . the sequence {xn }∞ n=1 converges to c. where the first equality used the assumption that c is a root of x = f (x)..g. on account of cubic splines). c + r) of course cannot be known exactly.e.. Then. variants of fixed-point iteration are useful for solving certain systems of linear equations arising in practice (e. A partial converse to this is the following . Assume x = c is a solution of the equation x = f (x). Hence inequality (1) is valid for all positive integers n. When one tries to use this result in practice. the fact that |f ′ (x)| < q in an interval for some q with 0 ≤ q < 1 makes it reasonable to try to use fixed-point iteration to find a solution of the equation x = f (x) in this interval. c + r) by assumption. we have (1) |xn+1 − c| ≤ q|xn − c|. so we have |f ′ (ξn )| ≤ q. it follows that xn converges to c. By the Mean-Value Theorem of differentiation we have f (xn ) − c = f (xn ) − f (c) = f ′ (ξn )(xn − c). Therefore. we can conclude that xn ∈ (c − r. c + r) for all positive integers n. Assume further that there are numbers r > 0 and q with 0 ≤ q < 1 such that |f ′ (x)| ≤ q for all x with c − r < x < c + r. and assume that xn ∈ (c − r.

Fixed-point iteration 53 Theorem. the file funct. by the Mean-Value Theorem that f (xn ) − c = f (xn ) − f (c) = f ′ (ξn )(xn − c). Thus we have |xn − c| ≥ |xN − c| for all n ≥ N .c. we must have a positive integer N such that xn ∈ (a. Hence xn cannot converge to c unless xN = c. double startingval.c contains the program: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include <stdio. for any n ≥ N we have. b with a < c < b such that |f ′ (x)| ≥ 1 for all x with a < x < b. we obtain that |xn+1 − c| ≥ |xn − c| for all n ≥ N . double xtol. if |f ′ (x)| > x near the solution of the equation x = f (x). double fixedpoint(double (*fnct)(double).c implements the simple program calculating the function e1/x : 1 2 3 4 5 6 7 8 #include <math. int maxits. int *itcount. Then starting with any value x1 and putting xn+1 = f (xn ) for n ≥ 1.15. and so. Of course. */ .h> #define absval(x) ((x) >= 0. In a C program doing fixed-point iteration to solve the equation x = e1/x .h> double funct(double x) { double value. b) for every n ≥ N . } The ideas used in the implementation of fixed-point iteration are similar to the ideas are similar to those used in earlier programs such as newton. So. However. it is very unlikely that we accidentally end up with xk = c in practice. then xn = c for all n ≥ k. noting that xn+1 = f (xn ) and that |f ′ (ξn )| ≥ 1. Assume further that there are numbers a. Proof. and so. b). The function funct() is defined separately. int *outcome). the sequence {xn }∞ n=1 does not converge to c unless xk = c for some positive integer k. Fixed point iteration is easy to implement on computer. Assuming xn converges to c. value = exp(1. Assume x = c is a solution of the equation x = f (x). if xk = c for some positive integer k.h> #include <math.0/x). return(value). using fixed-point iteration to find the solution should be considered hopeless. for some ξn ∈ (a. main() { /* This program implements fixed-point iteration for solving an equation x=funct(x).0 ? (x) : (-(x))) double funct(double). The file fixedpoint.

/* returning the value of the root */ } The calling statement on line 18 uses the starting value x = 1. *outcome = 0.718281828459 1. 1.444667861010 1. x = fnct(x). x). assumedzero=1e-20. double xtol.770289539914 1. *itcount).0. for (*itcount = 0. its). int its. if ( absval(x-oldx)/(absval(oldx)+assumedzero) <= xtol ) { *outcome = 2. printf("x=% . *itcount < maxits.759235513562 1.750874646996 1. (*itcount)++) { oldx = x.12f\n". /* The next line is included so as to monitor the progress of the calculation.54 Introduction to Numerical Analysis 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 } const double tol=5e-10. } else if (success == 0) { printf("The maximum number of iterations has been reached\n"). This line should be deleted from a program of practical utility.765490799373 1. root). x = startingval. double root. } double fixedpoint(double (*fnct)(double). tol.833530851751 1. 50. /* within tolerance */ break.998107789671 1. &success). if ( success == 2 ) { printf("The root is %. The printout of the program is as follows: 1 2 3 4 5 6 7 8 9 10 11 12 Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration Iteration number 0: x= number 1: x= number 2: x= number 3: x= number 4: x= number 5: x= number 6: x= number 7: x= number 8: x= number 9: x= number 10: x= number 11: x= 2. printf("%u iterations were used to find" " the root\n".761938693392 .12f\n". &its. success. root = fixedpoint(&funct.785346181250 1. */ printf("Iteration number %2u: ". double startingval. int *outcome) { double x. int *itcount. int maxits.649502126004 1.725291093281 1. and limits the number of iterations to 50. } } return x. oldx.

763222834639 38 iterations were used to find the root Problems 1. for example. Fixed-point iteration 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 55 Iteration number 12: x= 1. For each value of k. negative. On the other hand.763247085165 Iteration number 19: x= 1.763222837130 Iteration number 35: x= 1. There are several ways to write the above equation in the form x = f (x). or zero). Thus this form is well-suited for fixed-point iteration. This will definitely not work. one can also write x = arctan(x + 1) + kπ.763222819121 Iteration number 32: x= 1. since the derivative of the right-hand side is 1 + tan2 x. In order to solve the equation by fixed-point iteration.763222833845 Iteration number 38: x= 1.763225343308 Iteration number 23: x= 1.763222687135 Iteration number 28: x= 1. There are infinitely many solutions of this equation. Rearrange the equation x + 1 = tan x so that it be solvable by fixed-point iteration. and we also must have |f (x)| < q with some q < 1 near the solution.763230634603 Iteration number 21: x= 1. where k can be any integer (positive. Solution. It is easy to see by inspecting the graphs of y = x and y = arctan x + kπ that they will intersect exactly once for each value of k. it needs to be written in the form x = f (x).763222917845 Iteration number 29: x= 1.763298230825 Iteration number 17: x= 1. and this is always greater than 1 except at x = 0.763223641361 Iteration number 25: x= 1.762809621275 Iteration number 14: x= 1.763222861208 Iteration number 31: x= 1.15. one can use the starting value x = 0.763089906433 Iteration number 16: x= 1.763223093927 Iteration number 27: x= 1.763209080909 Iteration number 20: x= 1.763222842990 Iteration number 33: x= 1.763457255891 Iteration number 15: x= 1. The derivative of the right-hand side is 1 .763222376662 Iteration number 26: x= 1. .763180076095 Iteration number 18: x= 1.763222834639 The root is 1.763218410517 Iteration number 22: x= 1.763222786999 Iteration number 30: x= 1.763221411417 Iteration number 24: x= 1. (x + 1)2 + 1 and for any value of x this is less or equal to 1/2.763222835246 Iteration number 37: x= 1.763222832776 Iteration number 36: x= 1.763951807726 Iteration number 13: x= 1.763222829453 Iteration number 34: x= 1. for example x = tan x − 1.

but the convergence is usually much slower than in case C < 1. quadratic. while with quadratic convergence the number of significant decimal digits would roughly double with each additional step. respectively. 16. 1 To find the positive solution. AITKEN’S ACCELERATION Let α the solution of an equation. and so.56 Introduction to Numerical Analysis 2. limit in (1) may not exists. called Aitken’s acceleration. for any number c < 0 we have 0 < f ′ (x) ≤ ec < 1 for x ≤ c. in case of linear convergence (with C = 1/10). with the additional requirement of C < 1 in case λ = 1. and so for any c > −1 we have 1 1 ≤ <1 0< x+2 c+2 for any x > c. Then f ′ (x) = ex . one might as well use the starting point x1 = 1 also in this case. Write ǫn = xn − α for the error of the nth approximation. with each step one may expect one additional significant decimal digit in the solution. there is a method. in case 0 < |f ′ α)| < 1 one would have linear convergence. for some ξn between xn and α. 2. We do not intend to give a more inclusive definition or more precise definition of the order of convergence here.14619. The equation ex − x − 2 = 0 has one positive solution and one negative solution. or it may be 0 for each value of λ for which it exist. In fact. . one can find the positive solution using any starting point x1 > −1 and the iteration xn+1 = ln(xn + 2). since then x2 = −1). or cubic convergence. to speed up this convergence. write f (x) = ln(x + 2). If one knows that we are dealing this linear convergence.41 Intuitively. 3 we also talk of linear. xn − α provided that f ′ is continuous at α and xn (and so also ξn ) is close enough to alpha. so that xn+1 − α = f ′ (ξn ) ≈ f ′ (α). So. In case λ = 1 and C = 1 we may or may not have convergence. For example. Solution. then xn+1 − α = f (xn ) − f (α) = f ′ (ξn )(xn − α). The approximate solutions are -1. the higher the order of convergence of a method the the faster the convergence is. Rearrange the equation so that each of these solutions can be found by fixed-point iteration.84141 and 1. To find the negative solution. Then we have f ′ (x) = x+2 . Of course. With fixed point iteration. In case λ = 1. one usually expect linear convergence. when it converges. where the second equation is justified by the Mean-Value Theorem (under suitable differentiability assumptions). In case we have limn→∞ xn = α we say that the order of convergence of the approximation is λ if (1) |ǫn+1 | =C n→∞ |ǫn |λ lim for some constant C 6= 0. write x = f (x) with f (x) = ex − 2. Thus. Assume we have xn+1 − α = f ′ (ξn ) ≈ C xn − α 41 If λ < 1 we obviously have divergence. if f (α) = 0 and xn+1 = f (x + n). and let xn be the nth approximation to this solution. Thus one can find the negative solution by using any starting point x1 < 0 and using the iteration xn+1 = exn − 2 (actually x1 = 0 would also work.

one uses the middle member of the equations in (2).h> #define absval(x) ((x) >= 0. with xn+3 replacing α. given an n ≥ 0 divisible by 3. this equation becomes exact: xn+2 − xn+3 xn+1 − xn+3 = . .43 Next we will discuss a C program implementation of the method. and one calculates xn+3 according to (2). That is. in case of linear convergence the relation usually holds without absolute values. (2) xn+3 = xn − xn+2 xn − x2n+1 (xn+1 − xn )2 = . xn+2 − 2xn+1 + xn xn+2 − 2xn+1 + xn When one wants to solve the equation x = f (x) with fixed-point iteration combined with Aitken’s acceleration. the next approximation to α.h> #include <math. we will use the method to find the solution of the equation x = e1/x . The header file aitken. When doing so. } As for the program performing Aitken acceleration.c. xn − α xn+1 − α since both sides are approximately equal to C. However.42 Then we have xn+1 − α xn+2 − α ≈ . this time we broke up the calling program and the program performing the work into two separate files and a third header file (so as not to have repeat the same declarations in the other files). As in the fixed-point iteration example. as we just saw in the example of fixed point iteration.16.h> double funct(double x) { double value. value = exp(1.e. return(value). Determine xn+3 .0 ? (x) : (-(x))) #define square(x) ((x)*(x)) 42 Strictly speaking. since this fraction represents a small correction to the value of xn . since ǫλ n may not be defined for noninteger values of λ if ǫn is negative). since the right-hand side can cause greater loss of precision due to the subtraction of quantities of approximately the same size. 43 This is because in the middle member the loss of precision is likely to occur in the fraction that is being subtracted from xn . i.. does two fixed-point iteration steps.h is as follows: 1 2 3 4 5 6 #include <stdio. contained in the file funct. one starts with an approximate solution. and then does one Aitken acceleration step. this loss of precision is not as fatal as might occur on the right-hand side. The program defining this function. is the same as above: 1 2 3 4 5 6 7 8 #include <math. since equation (1) has absolute values (which is really necessary there. xn+2 = f (xn+2 ). this assumption is stronger than the assumption of linear convergence. Aitken’s acceleration 57 for some C with −1 < C < 1. xn − xn+3 xn+1 − xn+3 Thus (xn+2 − xn+3 )(xn − xn+3 ) = (xn+1 − xn+3 )2 . one takes xn+1 = f (xn ). so that.0/x).

13 switch(i) { 14 case 0: 15 x0 = x. . int maxits. success. Line 11 of the program uses the starting value 1 for the iteration.h" main() { /* This program implements fixed-point iteration for solving an equation x=funct(x). its). int *itcount. The part of the program that performs the main work is contained in the file aitken. */ const double tol=5e-10. The function funct() is defined separately. *itcount < maxits. (*itcount)++) { 12 oldx = x. all these are taken from the file aitken. 8 *outcome = 0. 7 int i=0. assumedzero=1e-20. 8 double aitken(double (*fnct)(double). 9 double xtol.c is very similar to the calling program of the fixed-point iteration discussed before: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 #include "aitken. 19 break. tol.c: 1 #include "fixedpoint. int *outcome). 50. int maxits. } } The difference here is that instead of including the function declarations and the standard include statements. 4 double xtol.58 Introduction to Numerical Analysis 7 double funct(double).12f\n".square(x1-x0)/(x-2*x1+x0). 9 x = startingval. x0. int its.h. included in line 1. the same starting value that was used in the fixed-point iteration. oldx. &its. double startingval. int *itcount.h" 2 3 double aitken(double (*fnct)(double). if ( success == 2 ) { printf("The root is %. printf("%u iterations were used to find" " the root\n". 17 case 1: 18 x1 = x. 20 case 2: 21 x = x0 . 1. root = aitken(&funct. double root. &success). root). x = fnct(x). int *outcome) 5 { 6 double x. double startingval. The calling program. x = fnct(x). in the file callaitken. 16 break. 10 11 for (*itcount = 0. } else if (success == 0) { printf("The maximum number of iterations has been reached\n"). x1.0.

c need to be changed.16. they are not required. the former when the function f in the equation x = f (x) to be solved is changed. 35 if ( i==3 ) { i=0. */ 28 printf("Iteration number %2u: ".c aitken.o \ aitken.o callaitken. it would be easy to arrange for reading these parameters on the command line and never change the calling program if one only wants to solve one equation. That is. In fact. often solving an equation is only a part of a larger calculation. and when quoted. Aitken’s acceleration 59 22 break. The following makefile can be used to compile the programs scattered in the files above: 1 2 3 4 5 6 7 8 9 10 all: aitken aitken : funct. Line 4 begins with three space characters. 29 printf("x=% . In fact. the command started in line 3 continues in line 4.c The backslash \ at the end of line 3 is the standard way of breaking up a line in Unix. 30 if ( absval(x-oldx)/(absval(oldx)+assumedzero) <= xtol ) { 31 *outcome = 2.c aitken.12f\n". However. The backslash quotes the newline character at the end of the line. 33 } 34 i++. The advantage of breaking up the program in all these files is that most files will never have to be changed when performing the calculation with different functions. the next approximation. but they make the layout more appealing. This line 26 should be deleted from a program of 27 practical utility. 23 } 24 /* The next line is included so as to monitor 25 the progress of the calculation. x).654194746033 1.c callaitken.o : callaitken.c aitken.o gcc -o aitken -s -O4 callaitken.986829971168 1. *itcount).h gcc -c -O4 aitken.o : aitken.o aitken.o : funct.c gcc -c -O4 funct.o funct. and the latter when one wants to change starting value.718281828459 1. the newline character no longer signifies the end of a command. only the files funct. The printout of the program is as follows: 1 2 3 4 5 Iteration Iteration Iteration Iteration Iteration number number number number number 0: 1: 2: 3: 4: x= x= x= x= x= 2. In the switch statement in lines 13–23 decides whether fixed-point iteration (in case *itcount is of form 3k or 3k +1 for some integer k) should be used to calculate x.h gcc -c -O4 callaitken. /* within tolerance */ 32 break.830380270380 . } 36 } 37 return x. or whether the Aitken acceleration step (in case *itcount is of form 3k + 2) should be used.444667861010 1. the precision of the solution required (described by the parameter tol) or the maximum number of iterations permitted. in this case the calling program might still have to be changed. /* returning the value of the root */ 38 } The program is similar to the program performing fixed-point iteration.c and callaitken.o -lm funct.

xn+2 − 2xn+1 + xn 17. then one takes xn+1 = f (xn ) and xn+2 = f (xn+1 ). Each time one does two steps of fixed-point iteration and then one step of Aitken acceleration. where 38 iterations were used to find the solution of the same equation.763222834357 Iteration number 12: x= 1. so that equation (1) will make sense. That is. THE SPEED OF CONVERGENCE OF NEWTON’S METHOD In solving the equation f (x) = 0. that f is twice differentiable in an open interval containing α and xn . Explain in how Aitken’s acceleration for fixed point iteration works. 2 Dividing by f ′ (xn ) (recall that f ′ (xn ) 6= 0 by our assumptions above) and rearranging the equation.763228455409 Iteration number 9: x= 1. using the same starting value and requiring the same precision. further. f ′ (xn ) Let α be a solution of the above equation.763222834349 12 iterations were used to find the root One might compare this to fixed point iteration. Assume. Noting that f (α) = 0. Newton’s method starts with an approximate solution x0 . that f ′ (α) 6= 0.765197485602 Iteration number 8: x= 1. By the Taylor expansion of f at xn with the Lagrange remainder term we have f (α) = f (xn ) + (α − xn )f ′ (xn ) + f ′′ (ξn ) (α − xn )2 2 with some ξn between xn and α.763224642370 Iteration number 11: x= 1. and for n ≥ 0 one puts (1) xn+1 = xn − f (xn ) . i.. we obtain f ′′ (ξn ) f (xn ) −α= ′ (α − xn )2 . we will also have f ′ (xn ) 6= 0. f (α) = 0.. this equation can be written as 0 = f (xn ) + (α − xn )f ′ (xn ) + f ′′ (ξn ) (α − xn )2 .60 Introduction to Numerical Analysis 6 7 8 9 10 11 12 13 14 15 Iteration number 5: x= 1. In solving the equation f (x) = x.763222834349 The root is 1. Problem 1. Solution. i.769373837327 Iteration number 6: x= 1. and assume that α is not a multiple root. and xn+3 = xn − (xn+1 − xn )2 .e.e. Assuming that f ′ is continuous near α and that xn is close enough to α. if n is divisible by 3. xn − ′ f (xn ) 2f (xn ) .763219646420 Iteration number 10: x= 1. one starts out with an approximation x0 to the solution.759749886372 Iteration number 7: x= 1.

If x is one of the interpolation points. and. of experimental data). . xn ] p(x). it follows that limn→∞ f ′′ (ξn ) = f ′′ (α). xn . xn+1 − α f ′′ (α) = ′ . That is. n! where p(x) = n Y j=1 (x − xj ). . and ξ is a point in the open interval spanned by the points x. . . Thus f ′ (x) = P ′ (x) + E ′ (x). . . one can approximate the derivative by a polynomial interpolation formula: f (x) = P (x) + E(x). 2 n→∞ (xn − α) 2f (α) lim Hence Newton’s method has quadratic convergence for simple roots. As for E ′ (x). where P (x) is the interpolation polynomial (given either in the Lagrange or the Newton form).18. so the first term will be zero anyway. we can see that also limn→∞ ξn = α (since ξn is between xn and α). n! . see (1) and (2) in Section 8 on the error term of the Newton interpolation polynomial. the only thing that is known that ξ is somewhere in the interval spanned by the points x. we have (2) E ′ (x) = df (n) (ξ) p(x) p′ (x) + f (n) (ξ) . x1 . f ′′ (ξn ) (α − xn )2 . at any one of the interpolation points. xn . dx n! n! The first term on the right-hand side here presents some difficulties. 18. further. There is no difficulty here in determining P ′ (x). Assuming. for the error term we have (1) E(x) = f (n) (ξ) p(x) = f [x. . 2f ′ (xn ) f ′′ (ξn ) xn+1 − α = . this leads to xn+1 − α = That is. Numerical differentiation of tables 61 Using equation (1). . Hence. . (xn − α)2 2f ′ (xn ) Assuming that limn→∞ xn = α. we have E ′ (x) = f (n) (ξ) p′ (x) . . that f ′′ is continuous at α. x1 . x1 . since the dependence of ξ on x is not known. NUMERICAL DIFFERENTIATION OF TABLES When differentiating a function given by a table (for example. . we have p(x) = 0. .

. Therefore. . xn ] p′ (x) = f (n+1) (η) + f (n) (ξ) . . . . . Let these points be η1 . Assuming that f is n times differentiable. . . . . That is.62 Introduction to Numerical Analysis Assume the points x1 . xn . . and the error term E (k) (x) is zero at the points η1 . . That is. . . . xn . xn . ξ and η are appropriate numbers in the interval spanned by the the numbers x1 . . . . Instead of restricting us to the first derivative. . Using the the second expression in (1). we will discuss the kth derivative for k with 1 ≤ k < n. . xn . . . ηn−k . the right-hand side of (2) can also be written as we obtain E ′ (x) = df [x. that f is differentiable n + 1 times. . . f (k) (x) = P (k) (x) + E (k) (x). Then. . . . . . xn are distinct and x belong to the open interval spanned by x1 . further. . Another. . . x1 . xn ] p(x) p′ (x) p(x) + f [x. dx (n + 1)! n! where the second equality holds for some ξ and η in the open interval spanned by the the numbers x1 . . . where n is the number of interpolation points. . . Then E ′ (x) can be estimated even if x 6= xi for any i with 1 ≤ i ≤ n. its kth derivative E (k) has at least n − k zeros in the interval spanned by the points x1 . . according to (1) (for ξ) and to the Lemma in Section 8 on the error term of the Newton interpolation polynomial (for η). . Assume. E(x) must also be differentiable n times. Therefore. Here P (k) (x) is a polynomial of degree n − k − 1. x1 . by Rolle’s Theorem. on the right-hand side. ηn−k : E (k) (x) = n−k n−k Y Y 1 dn−k f (k) (t) . E(x) = 0 at each of the interpolation points x1 . the error of this interpolation can be expressed as he error of the Lagrange interpolation45 of the function f (k) (x) at the points η1 . . . . more complicated argument is needed to give an error estimate for the derivative of f if f is differentiable only n times. P (k) (x) interpolates f (k) (x) at these points. . xn . . ηn−k . .44 Furthermore.

.

· · f (n) (ξ) (x − ηi ) = . 1 (x − ηi ).

. (n − k)! dt (n − k)! t=ξ i=1 i=1 where ξ is in the interval spanned by x and by the points η1 . since their presence is not assured. 44 Because E(x) = f (x) − P (x). disregard all but one. xn . xi < η1i < xi+1 . ηk−1 i < ηki < ηk−1 i+1 . . Clearly.46 Then η0i = xi (1 ≤ i ≤ n) and ηi = ηki (1 ≤ i ≤ n − k). < xn and η1 < η2 < . . . For example. Assuming the assertion is valid with k − 1 replacing k. we can say that xi < ηi < xi+k for i with 1 ≤ i ≤ n − k. . but they are known to be located in the interval spanned by x1 . so for k = 1 the above claim about the location of the ηi ’s is correct. The points η1 . . . for k > 1. ηn−k are not known here. 45 Using . the term Lagrange here is somewhat of a misnomer. if E ′ (x) has more than one zero between x1 and x2 . This can be proved by induction on k. For l with 0 ≤ l < n write ηli with 1 ≤ i ≤ n − l for the zeros of E (i) (x) guaranteed by Rolle’s Theorem. . In fact. a little more precise information can be obtained about the location of the ηi ’s. and they do not depend on x (since they are the zeros of E (k) (t) guaranteed by Rolle’s theorem). . the Newton interpolation has the same error. Moreover. . . simply disregard the extra zeros. so xi < ηki < xi+k follows. ηn−k . . . we have xi < ηk−1 i < xi+k−1 and xi+1 < ηk−1 i+1 < xi+k . since the Newton interpolation polynomial is the same as the Lagrange interpolation polynomial. Assuming x1 < x2 < . . The name Lagrange was used only because this error formula was derived on account of a discussion of the Lagrange interpolation formula. . . < ηn−k . The same formula could have been derived on account of the Newton interpolation formula. 46 E (i) (x) may have more zeros than those guaranteed by Rolle’s theorem. written in a different way.

NUMERICAL DIFFERENTIATION OF FUNCTIONS Given a function f that is differentiable sufficiently many times. say m = 2n + 1. and rearrange the equation so as to isolate f ′ (x) on the left hand side. This requirement can be weakened somewhat. In the present case. 48 In an open interval containing x − h and x + h. When using this formula to estimate f ′ (x). . O(f (x)) denotes a function g(x) such that the ratio g(x)/f (x) stays bounded (for certain values of x understood from the context). 2h The assumption needed to make this formula valid is that x be differentiable 2n + 2 times near48 x (n is a nonnegative integer here). the total error in calculating f ′ (x) will be / δ + |c1 |h2 . Assuming the absolute value of the error in calculation f (t) for t near x is δ. 2h 47 In a notation introduced by Edmund Landau in the early 20th century. and the error of formula (1) is about |c1 |h2 . The limitation in choosing h too small is the loss of precision when calculating f (x + h) − f (x − h). Divide both sides by 2h. k! (m + 1)! where ξ1 is between x and x + h and ξ2 is between x and x − h. one would like to choose h as small as possible. (2k + 1)! (2n + 2)! We will consider this formula for a fixed value of x. we obtain f (x + h) − f (x − h) = 2 n X f (2k+1) (x) k=0  h2n+2 h2k+1 + f (2n+2) (ξ1 ) − f (2n+2) (ξ2 ) . say. h where the first term is the error resulting from the numerical evaluation f (x + h) − f (x − h) . Writing47 −ck = f (2k+1) (x) (2k + 1)! we obtain and  h2n+1 − f (2n+2) (ξ1 ) − f (2n+2) (ξ2 ) = O(h2n+1 ). for given x and h > 0 we have according to Taylor’s formula (with the Lagrange remainder term) f (x + h) = m X f (k) (x) k=0 and f (x − h) = m X f (k) (x) k=0 hk hm+1 + f (m+1) (ξ1 ) k! (m + 1)! (−h)m+1 (−h)k + f (m+1) (ξ1 ) . Since every term for even k will cancel. and o(f (x)) denotes a function g(x) that tends to zero (when x tends to a limit understood from the context. one usually assumes that f (x) is positive in the region considered. + cn h2n + O(h2n+1 ). Assume m > 0 is odd. Numerical differentiation of functions 63 19. but the value of h will change. we use the notation O(h2n+1 ) for h close to zero or when h → 0. usually when x tends to infinity or x tends to zero). When using the notation O(f (x)) or o(f (x)). (2n + 2)! n f ′ (x) = (1) f (x + h) − f (x − h) X ck h2k + O(h2n+1 ) + 2h k=0 f (x + h) − f (x − h) = + c1 h2 + c2 h4 + . and subtract these formulas. m ≥ 0 and integer. . .19.

. and continuing the procedure further would produce less accurate results. that is when δ − 2 + 2|c1 |h = 0. − (22n − 4)cn h2n + O(h2n+1 ). If we write this equation with 2h instead of h. Richardson extrapolation. 3 we have f ′ (x) = F2 (h) + c2. 2|c1 | the error in this case is / (2c1 )1/3 δ 2/3 + δ2 4|c1 | While c1 here is not usually known (though it may be possible to estimate it). . If we take four times the first of these formulas and subtract the second one. . + 22n c2.2 h4 + 64c2.n h2n + O(h2n+1 ). subtracting the second equation. when s δ h= 3 . 23 obtain f ′ (x) = F3 (h) + c3. it is not unreasonable to take c1 = 1 in the above formula to find the optimal value of h. + cn h2n + O(h2n+1 ). . + c2.k = 4 − 22k · ck . where 16 − 22k 16F2 (h) − F2 (2h) and c2. and dividing by 15.. 15 15 This procedure can be continued along for a while. . 2h Then (1) above can be written as f ′ (x) = F1 (h) + c1 h2 + c2 h4 + .2 h4 + c2. + c3. Multiplying the first equation by 16. this should make it clear that there are limitations to how small h can be chosen in a practical calculation. . but eventually the procedure may become unstable. For fixed x. h i. The same formula with 2h replacing h says f ′ (x) = F1 (2h) + 4c1 h2 + 16c2 h4 + . write F1 (h) = f (x + h) − f (x − h) .3 h6 + . . If no estimate for c1 is available.n h2n + O(h2n+1 ). the term corresponding to h2 will drop out: 3f ′ (x) = 4F1 (h) − F1 (2h) − 12c2 h4 + . . F3 (h) = . . . writing F2 (h) = 4F1 (h) − F1 (2h) 3 and c2. That is.3 h6 + .64 Introduction to Numerical Analysis The above expression for the error assumes its minimum assumes its minimum when its derivative with respect to h is 0. we obtain f ′ (x) = F2 (2h) + 16c2.3 h6 + .k = · ck . + 22n cn h2n + O(h2n+1 ). This process can be continued.n h2n + O(h2n+1 ).e. . .

that f ′′′ (1) ≈ 1. xn ] = f (n) (ξ) n! for some ξ in the interval spanned by the numbers x0 . we showed that 1 f [x0 . Assume. i. xn ). xn . x + h). Numerical differentiation of functions 65 Higher derivatives. According to the Lemma in the section on Hermite interpolation. If higher order derivatives also exist. for n = 1 this formula gives f ′ (x) ≈ f (x + h/2) − f (x − h/2) . and so f (x0 ) = E −n/2 f (x). .19. . xn to be equidistant. We have f ′ (x) = 49 The f (x + h) − f (x − h) f ′′′ (x) 2 − h + O(h3 ). . where one usually takes the points x0 . 2h 3! second equation uses the observation that x = x0 − nh/2. xn ].49 Taylor’s formula can be used to estimate the error of this formula.e. h which is the same formula we had above with h/2 replacing h. h2 Using Taylor’s formula for f (x + h) and f (x − h) involving the first two derivatives and using the remainder term with the third derivative (i. h) = 2h to approximate the derivative of a function f .. further. . Problems 1.e.. i. using the Taylor expansions given at the beginning of this section with m=2). x1 . With n = 2. . and x is taken to be the midpoint of the interval (x0 . . . For example. . the same formula gives f ′′ (x) ≈ f (x + h) − 2f (x) + f (x − h) . we obtain f ′′ (x) = f (x + h) − 2f (x) + f (x − h) h ′′ + (f (ξ2 ) − f ′′ (ξ1 )) h2 6 with some ξ1 ∈ (x − h. Using the forward difference operator ∆f (t) = f (t+h)−f (h) and the forward shift operator Ef (x) = f (x+h). . Consider the formula f (x + h) − f (x − h) f¯(x. .. xk = x0 + kh for some h > 0 (0 ≤ k ≤ n. . provided f is differentiable n times in this interval. one can again use Richardson extrapolation to obtain a more accurate result without having to use a smaller value of h. x) and ξ2 (x.e. This formula can be used to estimate the nth derivative of a function: f (n) (x) ≈ n!f [x0 . this formula can be written as f (n) (x) ≈ ∆n E −n/2 f (x) ∆n f (x0 ) = n h hn according to formula (1) in Section 10 on Newton interpolation with equidistant points. . Assume we are able to evaluate f with about 5 decimal precision. What is the best value of h to approximate the derivative? Solution. x = x0 + nh/2. . . . . . x1 .

466. We have f ′ (x) = f¯(x.e. h 6 h 6 as f ′′′ (x) ≈ 1. i. with an error of 5 · 10−7 . the (x−h) (absolute value of the maximum) error in evaluating f (x+h)−f is 5 · 10−7 /h. h) = 2h to approximate the derivative of a function f . .e. h ≈ 0. Thus.0246. find a better approximation for f ′ (1). We have f ′ (x) = f (x + h) − f (x − h) f ′′′ (x) 2 − h + O(h3 ). 2. i. . 2h 3! We are able to evaluate f (x) with 6 decimal precision. 3. i..135. The derivative of the right-hand side with respect to h is − h 5 · 10−7 + . h) = 2h to approximate its derivative. What is the best value of h to approximate the derivative? Solution..e. . 0. We have f¯(1. The derivative of the right-hand side with respect to h is − 5 · 10−6 h + . Assume we are able to evaluate f with about 6 decimal precision. h ≈ 0.011. 0. 2h) + c1 (2h)2 + c2 (2h)4 . Consider the formula f (x + h) − f (x − h) f¯(x. 447.1) = 5. So the absolute 2h value of the total error (roundoff error plus truncation error) in evaluating f ′ (x) is |f ′′′ (x)| 2 5 · 10−6 h2 5 · 10−6 + h ≈ + .66 Introduction to Numerical Analysis We are able to evaluate f (x) with 5 decimal precision. h2 3 Equating this with 0 gives the place of minimum error when h3 = 15 · 10−7 . h) + c1 h2 + c2 h4 . f ′ (x) = f¯(x. the (x−h) (absolute value of the maximum) error in evaluating f (x+h)−f is 5 · 10−6 /h. . So the absolute 2h value of the total error (roundoff error plus truncation error) in evaluating f ′ (x) is 5 · 10−7 |f ′′′ (x)| 2 5 · 10−7 h2 + h ≈ + . Thus. 752 Using Richardson extrapolation. 136 and f¯(1. h 6 h 6 as f ′′′ (x) ≈ 1. Assume. Solution.657.2) = 5. i. h2 3 Equating this with 0 gives the place of minimum error when h3 = 15 · 10−6 .e. we are using the formula f (x + h) − f (x − h) f¯(x. further.. . that f ′′′ (1) ≈ 1. Given a certain function f . with an error of 5 · 10−6 .. 177.

the function f will be assumed to be Riemannintegrable on [a. Given a certain function f . we can integrate the formula f (x) = P (x) + E(x) to approximate the integral of f . 93.657. b] by an interpolation polynomial P . . we must get an exact result. Multiplying the first equation by 4 and subtracting the second one. we obtain 3f ′ (x) = 4f¯(x. find a better approximation for f ′ (2). . c3 . SIMPLE NUMERICAL INTEGRATION FORMULAS If we approximate a function f on an interval [a. . Simple numerical integration formulas 67 with some c1 . Often one can get reasonable estimates of the integral of the error E.25) = 12. We consider two simple rules: the trapezoidal rule. 0. When considering these rules.20.062500 Using Richardson extrapolation. we are using the formula f (x + h) − f (x − h) f¯(x.1 we have f ′ (x) ≈ 4f¯(x. 500 ≈ = 12. 752 ≈ = 4. 466. f ′ (x) = f¯(x. b]. .135. 2h) 4 · 5. 2h) 4 · 12. The polynomial P is easy to integrate.125) = 12.015. We have f ′ (x) = f¯(x. with some c1 . we obtain 3f ′ (x) = 4f¯(x. after all all coefficients c2 . 000 3 3 The function in the example is f (x) = x3 and f ′ (2) = 12 is exact.125 we have f ′ (x) ≈ 4f¯(x. if we eliminate the coefficient c1 . with h = 0.961. 50 It is natural that one step of Richardson extrapolation gives the exact result for a cubic polynomial.50 20. 4. .015625 and f¯(2. with h = 0. h) − f¯(x. . in the Taylor expansion of f ′ (x) above are zero. . . Solution. 0. 2h) − 12c2 h4 − . 2h) − f (x. 136 − 5. . . . h) − f¯(x. . . . .062. and Simpson’s rule. 177. h) + c1 h2 + c2 h4 . c2 . Multiplying the first equation by 4 and subtracting the second one. h) = 2h to approximate its derivative. So. 56 3 3 The function in the example is f (x) = x tan x and f ′ (1) = 4. .000. . . That is. . .982. 2h) + c1 (2h)2 + c2 (2h)4 . . h) − f¯(x. That is. even though this might not be apparent from the formula we may obtain for E(x). h) + 12c2 h4 + . c2 . . 625 − 12. We have f¯(2. Hence the error term E(x) = f (x) − P (x) will also be Riemann-integrable.

This is because the derivative of a function has the intermediate-value property. b) and where ξx ∈ (a. Using the Newton interpolation formula. h] 0 h2 f (h) − f (0) h2 h = f (0)h+ · = (f (0)+f (h)). b] such that m < φ(x) < M we have Z and ψ(x) ≥ 0 b φ(x)ψ(x) dx = H Z for x ∈ (a. h]x(x − h) = 1 ′′ f (ξx )x(x − h). b). b) with α < β. This means that the set {g ′ (x) : x ∈ (a. this is true even if g ′ is not continuous. It can also be shows. we have f (x) = P (x) + E(x). h] by the linear polynomial P with interpolating at the points 0 and h.51 In case of the usual form of error terms. according to Lemma 1 in the section on Hermite interpolation (Section 7). . g ′ (ξ ) replaces φ(x). Assuming that f is continuous on [0. then the useful half of the argument above is still applicable so as to establish (1). we obtain Z h P (x) dx = 0 Z h (f [0]+f [0. If g ′ is bounded from above or from below. b) instead of assuming ψ(x) ≥ 0. one can show that m < H < M . that is. h). b)} is an interval (open. h]x) dx = f [0]h+f [0. h]x. Approximate f on the interval [0. Integrating these on the interval [0.52 It is clear that (1) is also valid if we assume ψ(x) ≤ 0 for all x ∈ (a. Rb a ψ(x) dx Then m ≤ H ≤ M can easily be shown. we have E(x) = f [x. b) then (1) does not say anything useful. if g ′ is bounded neither from above nor from below on (a. 2 h 2 2 For the error. use (1) with −ψ replacing ψ. b ψ(x) dx a a for some number H with m ≤ H ≤ M . Here. h] and is twice times differentiable in the interval (0. we assumed that the function φ(x) is bounded. however. 53 To show this. so the value of H makes no difference. b). β ∈ (a. and. b). If ab ψ = 0. that that H = m or H = M cannot happen.68 Introduction to Numerical Analysis The trapezoidal rule. b) depends on x. or semi-closed). we need a simple Mean-Value Theorem for integrals that we will now describe. we have g ′ (ξ) = c for some ξ with α < ξ < β. we have (1) Z b g ′ (ξx )ψ(x) dx = g ′ (η) a Z b ψ(x) dx a for some η ∈ (a. x However. if g ′ (α) < c < g ′ (β) or g ′ (α) > c > g ′ (β) for some α. Namely. If a φ > 0 then one can take H= Rb a φ(x)ψ(x) dx . h) depends on x. In fact. 2 where ξx ∈ (0. 52 Above. so it was sufficient to R impose conditions on these values only for x ∈ (a. Given integrable functions φ and ψ on the interval [a.53 51 The value of φ(x) and ψ(x) at a or b clearly have no influence on the value of the integrals. if φ(x) = g ′ (ξx ) for some function g that is differentiable on (a. Hence the number H above with the choice of φ(x) = g ′ (ξx ) can always be written as g ′ (η) for some η ∈ (a. b). since g ′ (η) can be any real number (because g ′ has the intermediate value property). provided that the integrals on both sides of this equation exist. this equality can be further simplified. where P (x) = f [0] + f [0. then the integrals on both sides of the last equation Rb are zero. h]. for the error term. 0. and g ′ need not be bounded. closed.

we can use (1) to estimate the integral of the error: Z h −h Z h Z f (4) (η) h 4 1 f (4) (ξx )(x4 − h2 x2 ) dx = (x − h2 x2 ) dx 24 −h 24 −h   h5 f (4) (η) 2h5 2h3 = − f (4) (η) = − h2 · 24 5 3 90 E(x) dx = 54 See the comment at the end of Section 7 (before the Problems) saying that the Lemma can be extended to the case of repeated interpolation points. 0]x2 (x + h)(x − h). where the last term on the right-hand side is the error term. according to Lemma 1 in the section on Hermite interpolation (Section 7). −h. b] and twice differentiable in (a. we obtain Z b f= a b−a (b − a)3 ′′ (f (a) + f (b)) − f (η) 2 12 with some η ∈ (a. In Simpson’s rule. the integral of the error term can be written as Z h 0 Z Z h 1 h ′′ 1 ′′ E(x) dx = f (ξx )x(x − h) dx = f (η) (x2 − hx) dx 2 0 2 0   3 h3 1 ′′ h2 h = − f ′′ (η) = f (η) −h· 2 3 2 12 With a simple change of variable t = a + (b − a)x with h = b − a. h). 0. h. so we can use the Mean-Value Theorem (1) in the error estimate. −h. −h. −h. 0]x(x + h)(x − h) + f [x. −h. h. and h by a cubic polynomial. Simple numerical integration formulas 69 Using this Mean-Value Theorem. −h. . h] · h 2h3 = (f (−h) + 4f (0) + f (h)). but the cubic part of the interpolation polynomial will make no contribution to the integral. 0. h). 0 can be written as f (x) = P (x) + E(x) = f [0] + f [0. The main points of this equation is that the integral on the interval [−h. 3 3 Assuming that f is continuous on [−h. and h. −h. 0.54 As the polynomial on the multiplying the derivative on the right-hand side is nonpositive for x ∈ (−h. Thus. 0]x(x2 − h)) dx f [0] · 2h + f [0.20. b). The Newton-Hermite interpolation polynomial at the points 0. a function is interpolated by a quadratic polynomial at three points. 0. b). h]x(x + h) + f [0. provided f is continuous in [a. for the error term. h](x2 + hx) + f [0. while the polynomial in the error term (the fifth and last term) is ≤ 0 in the interval (−h. h. −h. To get a better and simpler error estimate. −h]x + f [0. h. −h]x + f [0. h] of the fourth (cubic) term on the right-hand side is zero. h). h) depending on x. 0]x2 (x + h)(x − h) = 1 (4) f (ξx )x2 (x2 − h2 ) 4! for some ξx ∈ (−h. 0. for the main term we have Z h P (x) dx = Z h −h −h (f [0] + f [0. h. −h. h] and is four times differentiable in the interval (−h. we interpolate at the points −h. Simpson’s rule. we have E(x) = f [x.

Let f be a continuous function on the interval [a. provided f is continuous in [a. h) depending on x. 21. COMPOSITE NUMERICAL INTEGRATION FORMULAS In composite integration formulas. h). 24 for some η1 . So we obtain that Z h E(x) dx = −h h4 ′′′ (f (η2 ) − f ′′′ (η1 ). Here 25 · 90 = 2880. with the change of variable t = a + h + 2hx. d). Let n be a positive integer. b] into equal parts. we have n h3 X ′′ − f (ηi ). we have Z 0 Z Z f ′′′ (η2 ) 0 3 1 0 ′′′ f (ξx )(x3 − h2 x) dx = (x − h2 x) dx 6 −h 6 −h „ « f ′′′ (η2 ) h4 h4 ′′′ h2 = − = + h2 · f (η2 ) 6 4 2 24 E(x) dx = −h with some η2 ∈ (−h. put h = b−a n . h: f (x) = P (x) + E(x) = f [0] + f [0. 0]x(x + h)(x − h). and add the results. that (1) can be generalized to the case where one assumes that ξx belongs to another interval (c. −h]x + f [0.70 Introduction to Numerical Analysis with some η ∈ (−h. 0]: Z h Z Z 1 h ′′′ f ′′′ (η1 ) h 3 f (ξx )(x3 − h2 x) dx = (x − h2 x) dx 6 0 6 0 „ « h4 h2 f ′′′ (η1 ) h4 = − f ′′′ (η1 ) − h2 · = 6 4 2 24 E(x) dx = 0 with some η1 ∈ (−h. b) of integration. we to use the The Newton interpolation polynomial at the points 0. h) and so ξ need not belong to the x x interval of integration (0. and assume that f is twice differentiable on (a. h. b). h]x(x + h) + f [x. since we only have ξ ∈ (−h. −h. however. If we assume that f has only three derivatives on (−h. and let xi = a + hi for i with 0 ≤ i ≤ n. The calculation involving the main term is straightforward. −h. h). b] and five times differentiable in (a. h). h). The above discussion assumed that the fourth derivative of f exists. and not the interval (a. 0]x(x + h)(x − h) = 1 ′′′ f (ξx )x(x2 − h) 3! for some ξx ∈ (−h. 12 i=1 55 Formula (1) above is not applicable literally. −h. η2 ∈ (−h. −h. and the error term here is E(x) = f [x. b). It is easy to show. We need to apply formula (1) separately on the interval [0. h). .55 Similarly. We use the trapezoidal rule on each of the intervals [xi−1 . we obtain Z b f= a b−a 6  f (a) + 4f  a+b 2   (b − a)5 (4) + f (b) − 5 f (η) 2 · 90 with some η ∈ (a. The problem here is that x(x2 − h) changes sign on (−h. hence formula (1) above is not directly applicable. b]. h). h). Composite trapezoidal rule. xi ] (1 ≤ i ≤ n). one divides the interval [a. as for the sum of the error terms. h] and [−h. and on each part uses the same numerical integration formula. b). Writing h = (b − a)/2. h.

Ralston–P. Thus. b). let δ be the maximum error in calculating the value of f . b). the error in calculating the right-hand side of the above formula is about57 p b−a b−a b−a · 2 · 2. Using Simpson’s rule on each of the intervals [x2k−1 . Composite numerical integration formulas 71 where ηi ∈ (xi−1 .. 10–11 in A. 2 2n 12n i=1 The error committed when using this formula to calculate an integral results from the error term and the accumulation of the errors committed in calculating each term on the right-hand side.58δ n/6 with 99% probability. b] in 2n equal parts.. b]. The number of terms actually added is n + 1. x2k ] for k with 1 ≤ k ≤ n. Let xi = a + hi for i with 0 ≤ i ≤ 2n. Composite Simpson’s rule. in a way similar to the error calculation in the composite trapezoidal rule.e. much more slowly). As an example. Let n be a positive integer. 57 This . Rabinowitz [RR] concerns the case when δ = 1/2. that is. we obtain Z a b n−1  (b − a)5 X  b − a 2f (x2k ) + 4f (x2k+1 ) + f (b) − f= f (a) + 4f (x1 ) + f (4) (ξ) 6n 2880n4 k=1 for some ξ ∈ (a. the error term can be written as − (b − a)3 ′′ f (ξ). we divide the interval [a. once the size of the roundoff error is about equal to the size of the truncation error.e. Thus.21. To estimate this error. 2n n n Noting that the truncation error (i. Then assuming the errors in evaluating function values are independent of one another. Simpson’s rule is used to calculate the integral Z 0 4 dx √ . xi ) for each i. b). and assume that f is four times differentiable on (a. calculation of the error assumes that n terms are added on the right-hand side of the above integration formula. there is relatively little payoff in increasing the value of n.56 This error will be less than 2. Since the derivative of any function satisfies the intermediate-value property. 1 + x3 The function is defined in the file funct.58δ n/6 ≈ 1. and put h = b−a 2n . the error given by the √ error term) decreases in proportion of 1/n2 while the round-off error decreases in proportion of 1/ n (i.053δ · √ ≈ δ · √ . we have Z a b f= n−1  (b − a)3 X b − a f (xi ) + f (b) − f (a) + 2 f ′′ (ξ). and not all of them are multiplied by 2. Let f be a continuous function on the interval [a.c: 56 The calculation on pp. the sum equals nf ′′ (ξ) for some ξ ∈ (a. the error resulting of adding n function values can be considered a random variable phaving a normal distribution of mean 0 and p variance δ n/6. 12n2 Hence. and this sum is multiplied by 2. The error calculation used the fact that f (4) satisfies the mean-value property. and adding the results.

c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 #include <stdio. printf("The integral is %. double lower. } Lines 19–38 contain the function implementing Simpson’s rule. main() { double lower = 0. integr).0/sqrt(1.n). h = (upper . i < n. n. The parameters are a pointer to (*fnct).lower)/ 2. int n) { double x.0. return integr *= h / 3. } double simpson(double (*fnct)(double). &n). double lower. flower. h. double upper.0. int i.h> #include <math. int i. scanf("%d". integr = (*fnct)(x). printf("Type n: "). int n). } integr -= (*fnct)(upper). } The calling program and Simpson’s rule are contained in the file main. x = lower.0.12f\n". fupper. integr = 0. lower. integr += 2 * (*fnct)(x).0.0 / ((double) n). return(value).h> double funct(x) double x. upper. double upper.0+x*x*x). x += h.fmiddle. double simpson(double (*fnct)(double). upper = 4. i++) { x += h. the lower limit lower. integr = simpson(&funct. integr += 4 * (*fnct)(x). the .h> double funct(double). integr. for(i=0. { double value. It’s prototype is given in lines 19–20. the function to be integrated.middle. x.72 Introduction to Numerical Analysis 1 2 3 4 5 6 7 8 9 #include <math. value = 1.

and 34. and f (x) = e−x . Problems 1. For the third derivative we have 2 2 f ′′′ (x) = (12x − 8x2 )e−x = 4x(3 − 2x2 )e−x > 0 for x ∈ (0. the absolute value of the error is (b − a)3 ′′ 1 1 1 |f (ξ)| = |f ′′ (ξ)| ≤ · 2 = 2. noting that a = 0 and b = 1. 30 So one needs to make sure that n ≥ 58..e.735. we need to have 1/(6n2 ) < 5 · 10−5 .805473405017 Here. with the absolute value of the error not exceeding 5 · 10−5 . the printout is 1 Type n: 200 2 The integral is 1.21. the printout is 1 Type n: 40 2 The integral is 1.805473301824 The actual value of the integral is 1. n> r 105 ≈ 57. so that the user can experiment with different values of n entered. We want to evaluate Z 1 2 e−x dx 0 using the composite trapezoidal rule with four decimal precision. 1). b) is to be divided into 2n parts). Thus −2 = f ′′ (0) < f ′′ (x) < f ′′ (1) = 2e−1 ≈ 0. With n = 200.e.. Hence f ′′ (x) is increasing on [0. Thus one needs to divide the interval [0. 32. the location pointed to by the pointer fnct.e. What value of n should one use when dividing the interval [0. 658. the number 40 on line 1 is typed by the user. 759 for x ∈ (0. The calculation is fairly straightforward. 1] into n parts? Solution. with n = 40. 12n2 2 We want to use this with a = 0. 1]. 473. So. given in lines 8–17. 1] into (at least) 58 parts in order to get the result with 4 decimal precision while using the trapezoidal rule. It is .. the interval (a.805. 1). and the number n of the pairs of intervals used (i. 301. i. For example. i. In the calling program. The error term in the composite trapezoidal formula when integrating f on the interval [a. 2 2 2 12n 12n 12n 6n In order to ensure that this error is less than 5 · 10−5 . Composite numerical integration formulas 73 upper limit upper. the function is referred to as (*fnct). b = 1. the only point to be mentioned that in lines 28. b] and dividing the interval into n parts is − (b − a)3 ′′ f (ξ).735. We have 2 f ′′ (x) = (4x2 − 2)e−x . the number n is entered by the user on line 13. We want to find the maximum of the absolute value of this in the interval [0. 1].

We want to evaluate Z 1 2 e−x dx 0 using the composite Simpson rule with four decimal precision. The fourth derivative at x ≈ 0. Using this in the above error formula.3 5 · 2880 3 This is satisfied with n = 4 (n = 3 comes close.958572 is approximately −7. and for x ≈ 0. that is.41948. when we divide the interval into eight parts). 2880n4 Therefore. 1]. this error should be less than 5 · 10−5 . and at x = 1 is it is approximately −7. We have 2 f (4) (x) = (16x4 − 48x2 + 12)e−x . One can easily find the zeros of this. Only one of these is in the interval [0. Therefore. 1] is 12 (assumed at x = 1).1949.e. The error term in the composite Simpson formula when integrating f on the interval [a. What value of n should one use when dividing the interval [0. We find that the zeros are x ≈ ±0. since solving this equation amount to solving a quadratic equation. since 34 = 81). 2. which gives the result with 4 decimal precision if one divides the interval into (at least) 8 parts. that is 1 · 12 < 5 · 10−5 . 1].958572. b = 1.958572 and x ≈ ±2.958572. it is positive. we will obtain the the result with four decimal precision. 2880n4 2 We want to use this with a = 0. We want to find the maximum the absolute value of this in the interval [0.02018 and x = 0. 1]: x ≈ 0. b] and dividing the interval into 2n parts is − (b − a)5 (4) f (ξ). the fourth derivative has a local minimum at x ≈ 0. and f (x) = e−x . That is.35759. we must have n4 > 12 · 105 250 = ≈ 83.958572 this is approximately 74. we find that the absolute value of the error is at most 1 · 12. The sixth derivative is 2 f (6) (x) = (64x6 − 480x4 + 720x2 − 120)e−x . . 1]. 1] into 2n parts? Solution. when we use the composite Simpson rule with n = 4 (i. Thus the absolute maximum of the fourth derivative on the interval [0. At x = 0 it is 12..74 Introduction to Numerical Analysis interesting to compare this with the Simpson rule. For the fifth derivative we have 2 f (5) (x) = −(32x5 − 160x3 + 120x)e−x . 2880n4 In order to achieve four decimal precision. Since the fifth derivative is not zero anywhere else in the interval [0. this is a place of an absolute minimum of the fourth derivative in [0.

Hence 1 E(I0 ) ≈ E(I1 ) ≈ (T (I0 ) + T (I1 ) − T (I)). one would expect that the right-hand side gives a better approximation. 6 R Thus the error in approximating I f by T (I0 ) + T (I1 ) is E(I0 ) + E(I1 ) ≈ 1 (T (I0 ) + T (I1 ) − T (I)). since on certain parts of the interval of integration the function may be better behaved. one usually divides the interval of integration into parts. Suppose we want to determine the integral of a function f on an interval I with a maximum absolute error of ǫ. such as the trapezoidal rule and Simpson’s rule. This is the problem adaptive integration intends to deal with. and. Assume the trapezoidal rule on this interval approximates the integral as T (I). and on each part one uses simple integration formulas. 8 Hence. 3 . we have E(I0 ) = − |I|3 ′′ f (ξI0 )). Then we divide I into two equal parts. Adaptive integration 75 22. we have E(I) = − |I|3 ′′ f (ξI )). Making the assumption that f ′′ (ξ) ≈ f ′′ (ξI0 ) ≈ f ′′ (ξI1 ). I we have T (I0 ) + T (I1 ) − T (I) = E(I) − E(I0 ) − E(I1 ) ≈ (8 − 1 − 1) · E(I0 ) = 6E(I0 ). than in other parts. requiring a courser subdivision. noting that |I0 | = |I1 | = |I|/2. not be exact. Adaptive integration can be used with various simple integration rules. I0 and I1 . Often dividing the interval into equal parts is not the best thing to do. we have T (I) ≈ T (I0 ) + T (I1 ). for the errors of these approximations. This equation will. requiring a finer subdivision. and consider the trapezoidal rule approximations R T (I0 ) and T (I1 ) on each of these parts.22. (1) it follows that E(I0 ) ≈ E(I1 ) ≈ E(I) . using a finer subdivision of the integral. ADAPTIVE INTEGRATION When numerically approximating the integral of a function. We will consider it in more detail how this is done. however. Writing |I| for the length of the interval I. noting that Z f = T (I) + E(I) = T (I0 ) + T (I1 ) + E(I0 ) + E(I1 ). In fact. In the following discussion we consider it with the trapezoidal rule. 8 · 12 E(I0 ) = − |I|3 ′′ f (ξI0 )). 8 · 12 where ξI0 ∈ I0 and ξI1 ∈ I1 . 12 where ξI ∈ I. to estimate the accuracy of this approximation. Since T (I) and T (I0 ) + T (I1 ) both approximate I f . one can use the quantity  T (I) − T (I0 ) + T (I1 ) .

on line 6 the declaration of funct. an additional precaution may be that (2) is only accepted as a guarantee that the error is small enough if the length of I itself is not too large. and in lines 7–11 the declaration of the function trapadapt. double trapadapt(double (*fnct)(double). double epsilon.0 ? (x) : (-(x))) double funct(double). double maxh. others maybe only a few times. especially if we did not make sure that we divided the interval into sufficiently many parts.h to this file is 1 2 3 4 5 6 7 8 9 10 11 #include <stdio. double maxh. . double epsilon. On line 4. double b. fmid. If we are successful. The file adapt. integr = 0.76 Introduction to Numerical Analysis Thus.h> #define absval(x) ((x) >= 0. integr1. the total error committed on I will be less than ǫ.c. A C program implementation of adaptive integration is contained in the file adapt. the function to be integrated (defined in a separate file) is given. double a.c contains the definition of this function: 1 #include "adapt.h> #include <math. double *badb. absval(x) to calculate the absolute value of x is defined. is give. the total error committed on I will be less than ǫ. double *bada. 10 mid. h. some parts of I may have to be subdivided many times. Of course. 5 double minh. we try to R determine I1 with an allowable error less than ǫ/2. When continuing this procedure. double fa.a. int *success). double a. t1. double fb.h" 2 3 double trapadapt(double (*fnct)(double). 11 h = b . 4 double b. then we work with the intervals I0 and I1 separately. Similarly. in any case. Therefore. I If this inequality is not satisfied. since the assumption (1) may fail. Noting that Z I f= Z I0 f+ Z f. 3 (2) then we accept the approximation Z f = T (I0 ) + T (I1 ).0. I1 R we try to determine the integral I0 with an allowable error less than ǫ/2. double fa. double *badb. 6 double *bada. integr2. 7 int *success) 8 { 9 double x. double fb. discussed next. there is no certainty that this will be so. t2. if 1 |T (I0 ) + T (I1 ) − T (I)| < ǫ. double minh. The header file adapt.

*success = 1. the endpoints a and b are placed into the locations *bada and badb on line 33. the limits of the integral a and b.0*fmid+fb)*h/4. pointers *bada and *badb. 5 { 6 double value. and the variable success that is 1 (true) if the method is successful and 0 (false) otherwise.0.0*epsilon ) { integr = t2. The function trapadapt has parameters (mentioned in lines 3–7) (*fnct). } else { integr1 = trapadapt(fnct. bada. the values fa and fb. bada. epsilon/2. the minimum length minh of a subdivision (if the error is too large with an interval of length minh. *badb = b. *bada = a. and in line 16 it is decided whether the calculation is successful (when h<=maxh and the the absolute value of the difference t2-t1 estimating the error is small enough). success). fmid = (*fnct)(mid). fmid.b) and its two parts are calculated. the maximum length max h of a subdivision that need not be subdivided further. fa. 8 return(value). } if ( *success ) { integr = integr1+integr2. t2 = (fa+2. maxh.b). a. minh. The sum of the two integrals on the subintervals.0 instead of epsilon (since each of these subintervals allow only half the error of the whole interval.0/sqrt(1. mid. badb. t1 = (fa+fb)*h/2. . so it can be decided later to use a different method to determine the integral on the interval where the method failed). if ( *success ) { integr2 = trapadapt(fnct. to the endpoints of the subinterval on which the calculation failed (the calling program can read these values. Adaptive integration 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 } 77 if ( h >= minh ) { mid = (a + b)/2. fb. maxh. the calculation will be declared a failure). If the calculation is not successful. if ( h<=maxh && absval(t2-t1)<= 3. The file funct.mid) and (mid.b) with epsilon/2. epsilon/2.h" 2 3 double funct(x) 4 double x.0. 7 value = 1. of the function at a and b. b.c contains the definition of the function funct: 1 #include "adapt. the maximum allowable error epsilon on the interval (a. } } } else { *success = 0. In lines 15 and 16. a pointer to the function to be integrated.22. success). mid. calculated in line 28. the trapezoidal sums on the whole interval (a.0.0+x*x*x).b). If this is not the case. respectively. } return integr.0. minh. will be the integral on the interval (a. the method is recursively called in lines 21–25 for the intervals (a. fmid. badb.

b = 4.o adapt.o : adapt.c adapt. epsilon.o main.h gcc -c -O4 funct. mid. minh. given in the file main.0. but the file adapt.c may need to be rewritten. maxh. &badb. badb). there is no need for these files. The main program.o adapt. x. fb. The printout of the program is a single line (the 1 at the beginning of the line is a line number. bada.c.c and funct.o funct. badb. } } The makefile.h gcc -c -O4 adapt. one types the command $ make all If the program runs well. integr.o may be kept. &bada.c adapt. fb=funct(b). bada. %. &success).12f. fa. } else { printf("The integral has trouble in the interval" "(%. fb. epsilon=5e-10.c adapt.c clean : adapt rm funct.12f\n".h" main() { double a = 0.o by typing $ make clean Once the program runs. and funct. used to compile this program is given in 1 2 3 4 5 6 7 8 9 10 11 all: adapt adapt : funct. 1 The integral is 1. the integral of the function 1 √ 1 + x3 is being calculated.o : funct.12f)\n". minh=1e-5.h gcc -c -O4 main.o To compile the program. a.0. fa=funct(a). if ( success ) { printf("The integral is %.o. maxh=1e-1. The file funct.78 Introduction to Numerical Analysis 9 } That is.o main.o -lm funct.c adapt.h gcc -o adapt -s -O4 adapt. one may remove the object files main. fa. not part of the printout). int success.o : main.c main.o main. integr). specifies the limits of the integral as a = 0 and b = 4: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 #include "adapt. b. and the file adapt is the only one needed.c need not be changed. called makefile.805473301821 . integr=trapadapt(&funct. since it can be reused if a different integral is to be calculated (for which the files main.

599. using a finer subdivision of the integral. one has to further subdivide each ot these part into two equal parts.e. which is in accordance with out choice of the maximum allowable error ǫ = 5 · 10−10 (i. As in case of the trapezoidal rule. 2/8] further or the result is already precise enough. . Writing f (x) = interval [1/8. 2880 58 Of course. 7 3 This is smaller than the permissible error. I and I . We will consider it in more detail how this is done. The error allowed on the interval [1/8. I0 and I1 . Suppose we want to determine the integral of a function f on an interval I with a maximum absolute error of ǫ. the result is accurate to nine decimal places. to estimate the accuracy of this approximation. In order to 0 1 calculate S(I0 ) and S(I1 ). 2/8] is ǫ/8 = 6. In fact. this equation will not be exact. i. in order to calculate S(I). one can use the quantity  S(I) − S(I0 ) + S(I1 ) . we have E(I) = − |I|5 (4) f (ξI )). The error can be estimated as 1 ¯ (T − T ) ≈ −0. Then we divide I into two equal parts.126. ADAPTIVE INTEGRATION WITH SIMPSON’S RULE After of discussion of an implementation for adaptive integration for the trapezoidal rule.000. Solution. we have S(I) ≈ S(I0 ) + S(I1 ). Adaptive integration with Simpson’s rule 79 A more precise approximation of the integral is 1. for the errors of these approximations. Assume Simpson’s rule on this interval approximates the integral as S(I).805. Writing |I| for the length of the interval I. 701 2 and 1/16 (f (1/8) + 2f (3/16) + f (2/8)) = 0.23. 523. T¯ = 2 Do we have to subdivide the interval [1/8.e. 658. so the interval does not need to be further subdivided. 301. 077. it will be easy to make the appropriate changes so as to implement adaptive integration with Simpson’s rule. 2/8] we find that T = √ 3 1 + x2 . We would like to determine the integral Z 1p 3 1 + x2 dx 0 with precision ǫ = 5 · 10−4 using the adaptive trapezoidal rule. and consider the Simpson approximations S(I0 ) and R S(I1 ) on each of these parts. 853..25 · 10−5 . epsilon=5e-10). one can accept T¯ as the integral on this interval.58 Since S(I) and S(I0 ) + S(I1 ) both approximate I f .126. 23. one would expect that the righthand side gives a better approximation. 282. 473. one already has to divide I into two equal parts. on the 1/8 (f (1/8) + f (2/8)) ≈ 0. 025. Problem 1.

I R I0 I1 we try to determine the integral I0 with an allowable error less than ǫ/2. 25 · 2880 where ξI0 ∈ I0 and ξI1 ∈ I1 . Noting that Z Z Z f= f+ f. Similarly. The header file adapt. others maybe only a few times. noting that |I0 | = |I1 | = |I|/2. Of course. in any case.h to this file is . 25 Hence. the total error committed on I will be less than ǫ. Hence 1 E(I1 ) ≈ E(I2 ) ≈ (S(I0 ) + S(I1 ) − S(I)). 15 Thus. If we are successful. Making the assumption that f (4) (ξ) ≈ f (4) (ξI0 ) ≈ f (4) (ξI1 ). some parts of I may have to be subdivided many times. 15 then we accept the approximation Z f = S(I0 ) + S(I1 ). since the assumption (1) may fail. 30 R Thus the error in approximating I f by S(I0 ) + S(I1 ) is E(I0 ) + E(I1 ) ≈ 1 (S(I0 ) + S(I1 ) − S(I)). we have E(I0 ) = − |I|5 f (4) (ξI0 )). When continuing this procedure. I If this inequality is not satisfied. we try to R determine I1 with an allowable error less than ǫ/2. especially if we did not make sure that we divided the interval into sufficiently many parts. Therefore. if (2) 1 |S(I0 ) + S(I1 ) − S(I)| < ǫ. an additional precaution may be that (2) is only accepted as a guarantee that the error is small enough if the length of I itself is not too large. noting that Z f = S(I) + E(I) = S(I0 ) + S(I1 ) + E(I0 ) + E(I1 ). 25 · 2880 E(I1 ) = − |I|5 f (4) (ξI0 )). there is no certainty that this will be so. I we have S(I0 ) + S(I1 ) − S(I) = E(I) − E(I0 ) − E(I1 ) ≈ (25 − 1 − 1) · E(I0 ) = 30E(I1 ).c. then we work with the intervals I0 and I1 separately. and. the total error committed on I will be less than ǫ.80 Introduction to Numerical Analysis where ξI ∈ I. A C program implementation of adaptive integration is contained in the file adapt. (1) it follows that E(I0 ) ≈ E(I1 ) ≈ E(I) .

double epsilon. 19 *success = 1. fmid2 = (*fnct)(mid2). Adaptive integration with Simpson’s rule 1 2 3 4 5 6 7 8 9 10 11 81 #include <stdio. bada. fmid.0*epsilon ) { 18 integr = s2.0. double b. int *success). mid2. the function to be integrated (defined in a separate file) is given. double mid. double fa. 26 maxh. *bada = a. double fa. 30 } 31 } 32 } 33 else { 34 *success = 0. 14 mid2 = (mid + b)/2. on line 4. double maxh.0*fmid2+fb)*h/12.0. mid2. double *badb. 6 double *bada. double fmid. fmid2. fmid1.0. mid. and in lines 7–11 the declaration of the function simpsonadapt. double mid. double fb.h" 2 3 double simpsonadapt(double (*fnct)(double). badb. double fmid. 27 } 28 if ( *success ) { 29 integr = integr1+integr2. 10 mid1. double maxh. double a. double *badb.0. integr1.0. double minh. 5 double minh.0*fmid1+2. success).23. fmid2. 12 if ( h >= minh ) { 13 mid1 = (a + mid)/2. fb. 20 } 21 else { 22 integr1 = simpsonadapt(fnct.c contains the definition of this function: 1 #include "adapt.0. double *bada. mid1. 16 s2 = (fa+4. double epsilon. discussed next. mid. fmid. 17 if ( h<=maxh && absval(s2-s1)<= 15. double simpsonadapt(double (*fnct)(double). success). minh.0*fmid+4. Similarly as in case of the trapezoidal rule.0 ? (x) : (-(x))) double funct(double). epsilon/2. minh. fmid1 = (*fnct)(mid1). 7 int *success) 8 { 9 double x. a. 4 double b. absval(x) to calculate the absolute value of x is defined. h. 11 h = b . b.h> #define absval(x) ((x) >= 0. fmid1. integr = 0.a. fa. *badb = b. epsilon/2. s2. badb. double fb. s1. bada.h> #include <math. 35 } . 23 maxh. on line 6 the declaration of funct. double a. 24 if ( *success ) { 25 integr2 = simpsonadapt(fnct.0*fmid+fb)*h/6. integr2. 15 s1 = (fa+4. is give.0. The file adapt.

The sum of the two integrals on the subintervals. If this is not the case.0 instead of epsilon (since each of these subintervals allow only half the error of the whole interval. integr=simpsonadapt(&funct. the method is recursively called in lines 22–26 for the intervals (a. badb). } } There are very few changes in the calling program compared to the one used in case of adaptive integration for the trapezoidal rule. given in the file main. fb. minh=1e-5.c. one difference is that now the calling program has to calculate . fa. b. and in line 17 it is decided whether the calculation is successful (when h<=maxh and the the absolute value of the difference s2-s1 estimating the error is small enough). epsilon. the integral of the function 1 √ 1 + x3 is being calculated.82 Introduction to Numerical Analysis 36 return integr. the values fa. the calculation will be declared a failure). bada. b = 4. fmid.12f)\n".c contains the definition of the function funct. the minimum length minh of a subdivision (if the error is too large with an interval of length minh. &bada.mid) and (mid. minh. fb=funct(b). In lines 15 and 16. maxh. and the variable success that is 1 (true) if the method is successful and 0 (false) otherwise. a.12f. 37 } The function simpsonadapt has parameters (mentioned in lines 3–7) (*fnct). fa. The file funct. fmid of the function at a.b) with epsilon/2. x.0. the midpoint mid of the interval of integration (while the midpoint is easy to calculate.b). the endpoints a and b are placed into the locations *bada and badb on line 34. epsilon=5e-10. mid. will be the integral on the interval (a. so it can be decided later to use a different method to determine the integral on the interval where the method failed). if ( success ) { printf("The integral is %. the maximum allowable error epsilon on the interval (a. &success). and both the calling program and the called program will need the midpoint). respectively. the Simpson sums on the whole interval (a. The main program.0. fmid. mid. That is. %. the limits of the integral a and b. as before: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include "adapt. b.b). bada. this file is the same as in case of adaptive integration for the trapezoidal rule. a pointer to the function to be integrated.12f\n". and mid.h" main() { double a = 0. fmid=funct(mid). the calculation is more efficient if the midpoint is not recalculated each time. fb.b) and its two parts are calculated. fb. integr). mid=(a+b)/2. If the calculation is not successful. } else { printf("The integral has trouble in the interval" "(%. specifies the limits of the integral as a = 0 and b = 4. fa=funct(a).0. &badb. int success. maxh=1e-1. the maximum length max h of a subdivision that need not be subdivided further. badb. calculated in line 29. pointers *bada and *badb to the endpoints of the subinterval on which the calculation failed (the calling program can read these values. integr. the function simsponadapt will call itself recursively.

the continuity requirement can be relaxed.b) and has to calculate the value of the function there.100.100.3] = 4. so the interval does not need to be further subdivided. 24. 4 2 4 Do we have to subdivide the interval [2.25 · 10−4 . 1 The integral is 1. 168. .3] = 2 2 1 12  f (2) + 4f        1 1 3 2 + 2f 2 + 4f 2 + f (3) = 4. and assume f is a function that is N times differentiable on the interval [a.59 Then Euler-Maclaurin summation formula states that we have60 Z b N b−1  X X f (a) Bm  (m−1) f (b) f (x) dx + f (b) − f (m−1) (a) f (i) + + = 2 2 m! a m=2 i=a+1 (1) m is even Z (−1)N b PN (x)f (N ) (x) dx. one can accept S[2. 3] we find that     1 1 f (2) + 4f 2 + f (3) ≈ 4. let N > 0 be an integer.100. 60 The notation used will be explained below.24. not part of the printout). 3] further or the result is already precise enough.2 1 ] + S[2 1 . 2 2 15 15 This is considerably smaller in absolute value than the permissible error.3] ≈ = −4. Problems 1.3] = 6 2 √ 1 + x3 .100. The error can be estimated as  4. We would like to evaluate the integral Z 4p 1 + x3 dx 0 with a precision of ǫ = 5 · 10−4 using the adaptive Simpson Rule. b). The error allowed on the interval [2. a < b. Solution. and f (N ) is continuous on (a. on the and S[2. The Euler-Maclaurin summation formula 83 the value of the midpoint of the interval (a. for example. THE EULER-MACLAURIN SUMMATION FORMULA Let a and b be integers. 3] is ǫ/4 = 1. 756. 168. 27 · 10−6 . b] instead of assuming that it is continuous on (a. this is done on line 10. − N! a 59 The continuity of f (N ) is needed to justify the integration by parts involving f (N ) in the proof of (1) given later in this section.2 1 ] + S[2 1 . The printout of the program is a single line (the 1 at the beginning of the line is a line number. that f (N ) is Riemann-integrable on [a. The makefile is the same as in case of adaptive integration for the trapezoidal rule.3] − S[2. 756 − 4. 175 1 S[2. It is sufficient to assume. 756 as the integral on 2 2 this interval.361. Writing f (x) = interval [2. b].2 1 ] + S[2 1 . 102.100.805473301724 The first nine digits of the result coincide with those given by adaptive integration with the trapezoidal rule. 102. Insofar as this integration by parts can be justified under weaker assumptions. 102. S[2. b). 175.

and the Bernoulli polynomials Bm (x) are defined by the equation61 ∞ X zexz Bm (x) m = z . and B2 = 1/6. . given later in this section. considered as a function of z. . but taking its value to be 1 for z = 0. B6 = 1/42. one needs to check carefully which definition of these polynomials is taken. the function zexz ez − 1 on the left-hand side. The one given here is the modern version. B1 (x) = x − 21 . f (t) is called odd if f (−t) = −f (t) . That Bm (x) is indeed a polynomial of x will be a by-product of the proof of formula (1). ez − 1 regarded as a function of z. . To see that Bm = 0 for odd m > 1.e. and (6) 1 R= (2M + 1)! Z b P2M +1 (x)f (2M +1) (x) dx a 61 In the mathematical literature. and it turns out that the coefficients Bm (x)/m! of this Taylor series are polynomials of degree m of x. can be expanded into a Taylor series at z = 0. B14 = 7/6. were [x] is the integer part of x (i. is not defined for z = 0. B12 = −691/2730. giving rise to different. B1 = −1/2. See footnote 64 on p. and its Taylor series at z = 0 will converge for |z| < 2π (this value for the radius of convergence is a simple consequence of results from complex function theory). B8 = −1/30. (1) can be rewritten as (4) Z b M b−1  X X B2m  (2m−1) f (b) f (a) f (x) dx + f (b) − f (2m−1) (a) + R. 63 A function f (t) is called even if f (−t) = f (t). f (i) + + = 2 2 (2m)! a m=2 i=a+1 where the remainder term R can be written as Z b 1 P2M (x)f (2M ) (x) dx (5) R=− (2M )! a in case M > 0. the largest integer not exceeding x).62 Finally.84 Introduction to Numerical Analysis Here Pm (x) = Bm (x − [x]). there a number of different definitions of the Bernoulli polynomials. B4 = −1/30. and f (2M ) is continuous on (a. but closely related sets of polynomials. observe that the function z 1 1 ez + 1 1 e−z + 1 + = z = (−z) ez − 1 2 2 ez − 1 2 e−z − 1 is an even function63 of z. 62 The expression zexz . B10 = 5/66. B2 (x) = x2 − x + 12 . these coefficients identify the Bernoulli polynomials. . Writing N = 2M or N = 2M + 1 with an integer M . f is 2M times differentiable on [a. z e − 1 m=0 m! It turns out that B0 (x) = 1.. for odd m > 1 we have Bm = 0. b]. The proof of the Euler-Maclaurin summation formula will be given later in this section. Substituting x = 0 in the above series. the Bernoulli numbers Bm are defined as Bm (0). and B0 = 1. 86. so the odd powers of z are missing from its Taylor expansion at z = 0. b). When reading a statement involving Bernoulli polynomials in the mathematical literature. it becomes differentiable any number of times. z e − 1 m=0 m! (2) In other words. this shows that the Bernoulli numbers are defined by the Taylor expansion (3) ∞ X z Bm n = z .

and writing xi = a + ih.24. the trapezoidal rule gives a much better error term. than. When making the substitution t = hx in the Euler summation formula above. for example. n) = n−1 X h f (xi ) + f (b)) (f (a) + 2 2 i=1 we obtain: T (f. Further.e. for the trapezoidal sum T (f. and write A. By making the substitution t = hx in the integral in (1). then all the terms in the sum on the right-hand side are zero. n) = Z b f (t) dt + a − (−1)N N! N X m=2 m is even Z  Bm hm  (m−1) f (b) − f (m−1) (a) m! b PN (t/h)hN f (N ) (t) dt. For any integer c with a ≤ c < b we have ′ Z c+1  Z c+1 1 t−c− f (t) dt = f (t) dt 2 c c    . and F instead of a. noting that f (k) (t) = h−k f (t/h). hk f (k) (t) = f (t/h). Simpson’s rule. we obtain for the integral that Z b f = T (f. b. For example. and f is not the same as above. B. a An interesting point about the error term here is that if the function f is periodic with period b − a. b = Bh. In this case. when integrating a periodic. Proof of the Euler-Maclaurin summation formula. many times differentiable function along its whole period. n) + O(hN ). i. f is 2M + 1 times differentiable on [a. f (t) = F (t/h). we in fact obtain Z Bh B−1 X F (A) 1 F (B) F (t/h) dt + F (i) + + = 2 2 h Ah i=A+1 − (−1)N N! Z Bh Ah N X m=2 m is even  Bm  (m−1) F (B) − F (m−1) (A) m! 1 PN (t/h)F (N ) (t/h) dt. b). The Euler-Maclaurin summation formula 85 in case M ≥ 0. and not Simpson’s rule or Romberg integration. b]. and multiplying both sides by h. By making the substitution. in calculating the integral Z 2π esin x dx 0 one would use the trapezoidal rule. one can obtain a version of the trapezoidal rule with new error terms. h Writing a = Ah.. we also change to notation. and f (2M +1) is continuous on (a. Romberg integration (discussed in the next section) does not improve on the trapezoidal rule in this case. a That is.

t=c+1 Z c+1  1 1 .

(7) t−c− − f (t).

f ′ (t) dt = t−c− 2 2 t=c c Z c+1 1 B1 (t − c)f ′ (t) dt. = (f (c) + f (c + 1)) − 2 c .

(m + 1) Z c+1 Bm (t − c)f c (m) (t) dt = Z . using integration by parts. and the third equality is obtained by taking the partial derivative of (2) with respect to x. and the third equality holds since B1 (x) = x − 21 . According to (2) above we have ∞ ∞ ′ X X z 2 exz ∂ zexz Bm (x) m Bm (x) m+1 z = z = = z . Hence. z −1 m! e − 1 ∂x e m! m=0 m=0 Here the first equality is just (2) multiplied by x and with the sides reversed. Equating the coefficients of ′ z m+1 on the extreme members of these equations. we obtain that Bm+1 (x) = (m + 1)Bm (x) for all 64 m ≥ 0.86 Introduction to Numerical Analysis The second equality here follows by integration by parts.

t=c+1 Z .

= Bm+1 (t − c)f (m) (t).

for K = 1. can be established for K = 1. . = BK+1 · f (c + 1) − f (c) − K +1 c Using this equation. and writing K = m. the formula is equivalent to (7). Equating the coefficients of powers of z on the extreme members of these equations. This is because we have ∞ ∞ X X Bm (1) m zez z Bm (0) m z = z =z+ z =z+ z . Indeed. so that Bm (x) is indeed a polynomial of degree m of x. and the third equation holds according to (2) with x = 0. N by induction. To obtain (1).  64 This (m) equation shows that Bm (x) = m!B0 (x) = m!. the formula Z c+1 f (t) dt = c K X 1 (f (c) + f (c + 1)) 2  (−1)K Bm (m−1) (−1) − f (c + 1) − f (m−1) (c) + (m)! K! m=2 m Z c c+1 BK (t − c)f (K) (t) dt. . . one only needs to point out that PN (t) = BN (t − [t]) = BN (t − c) for t with c ≤ t < c + 1. m! e − 1 e − 1 m! m=0 m=0 The first equality here holds according to (2) with x = 1. we obtain that indeed Bm+1 (0) = Bm+1 (1) holds for m ≥ 1. b − 1. and that Bm = 0 for odd m > 1. Hence. . take this last formula for K = N and sum it for c = a. For K = N . we obtain for K ≥ 1 that Z c+1 BK (t − c)f (K) (t) dt c   Z c+1  1 (K+1) (K) (K) BK+1 (t − c)f (t) dt . To see that indeed (1) results this way. 2. . . − t=c c+1 c c+1 c ′ Bm+1 (t − c)f (m) (t) dt Bm+1 (t − c)f (m+1) (t) dt = Bm+1 (1)f (m) (c + 1) − Bm+1 (0)f (m) (c) − Z c c+1 Bm+1 (t − c)f (m+1) (t) dt. this equation is equivalent to (1). . . noting that Bm+1 = Bm+1 (0) by the definition of the Bernoulli numbers. and the induction step from K to K + 1 follows by expressing the integral on the right-hand side with the aid of the the equation above. Observe that Bm (0) = Bm (1) holds for m > 1. a + 1. .

that is. and using the above formula for F instead of f . While this explanation can be developed into a full-featured proof of the formula. Noting that D(eD − I)−1 = ∞ X Bm m D m! m=0 according to (3). i. let I be the identity operator. m! m=2 65 At this point all these calculations should be taken only as symbolic manipulations – the question whether the inverse of eD − I is meaningful should be bypassed for now. 2 m! m=2 Observing that F (x + 1) − F (x) = (8) R x+1 x  1 f (x) + f (x + 1) = 2 Z f . F ′ = f ). n! n=0 where the last equality can be taken as the definition of the exponential eE of the operator E in terms of the Taylor series of the exponential function. the last equation can be written as DF (x) = ∞ X Bm m D (F (x + 1) − F (x)) m! m=0 Noting that DF = F ′ = f . we obtain DF (x) = D(eD − I)−1 (F (x + 1) − F (x)). we will not do this here. In terms of the operators D and E. Let f be a function. B0 = 1. this is valid only under some restrictive assumptions on f . The Euler-Maclaurin summation formula can be explained in terms of the calculus of finite differences..24. this formula can be written as ∞ X 1 Bm (m−1) f (x) = F (x + 1) − F (x) − (f (x + 1) − f (x)) + (f (x + 1) − f (m−1 (x)). n! n=0 of course. Writing F for an antiderivative of f (that is. The Euler-Maclaurin summation formula 87 Connection with finite differences. this can be written as Ef = ∞ X 1 n D f = eD f. but for now we will assume that this formula is valid. Ef (x) = f (x + 1). Then. we obtain Z x+1 x f = F (x + 1) − F (x) = (eD − I)F (x). and let D be the differentiation operator. Multiplying both sides by D on the left. finally. see Section 9). . let E be the forward shift operator (with h = 1. Df (x) = f ′ (x). Multiplying both sides by (eD − I)−1 we can write that65 F (x) = (eD − I)−1 (F (x + 1) − F (x)). that is If (x) = f (x). we have f (x + 1) = ∞ X 1 (n) f (x). expressing f (x + 1) into a Taylor series expansion at x.e. and B1 = −1/2. it follows that x+1 x f+ ∞ X Bm (m−1) (f (x + 1) − f (m−1 (x)).

Bm (x) has zeros at x = 0. In fact. On may probably obtain this integral in a way similar to the way one obtains the integral remainder term of the Taylor formula. this cannot be done. Unfortunately. This integral can be considered in a sense a “remainder term” of the infinite series on the right-hand side of (1). 1] for odd m. It therefore follows that Bm (1/2) = −Bm (1/2) for odd m. f (N ) uniformly on the interval [a. in fact. x) for x = a as the integral of its derivative. According to (2). one may try to derive formula (1) from formula (8) by finding a polynomial P that approximates f . when f is a polynomial. defined by (3). we have ∞ X zexz ze(1−x)z Bm (x) − Bm (1 − x) m − = z . Hence. b] (if a < b) or [b. for odd m > 1.67 However. We are going to show that Bm (x) has no more zeros in the interval [0. that we have Bm (x) = mBm−1 (x) for all m > 0. b] or [b. This can be proved from (1) in the Lemma in Section 4 by expressing Rn (b. b] by the Weierstrass approximation theorem. Therefore. in much the same way as was done in the first proof of the Euler-Maclaurin formula. 66 The Weierstrass approximation theorem says if f is continuous on [a. 1. all the arguments above are rigorous. since all the infinite series in fact turn into finite series. it is questionable whether this is an avenue worth pursuing. 68 See footnote 63 on p. in its Taylor expansion at z = 0 the coefficients of even powers of z are zero. So this argument indeed gives a rigorous proof (of the form (8)) of the Euler-Maclaurin formula polynomials. Hence Bm (x) = Bm (1 − x) for even m. we have ′ ′ mBm−1 (x) = Bm (x) = −Bm (1 − x) = −mBm−1 (1 − x). z z e −1 e −1 m! m=0 The left-hand side here can also be written as z(e(x−1/2)z − e(1/2−x)z ) . k! n! a k=0 provided f k is exists on [a. . . Nevertheless. for even m > 1. We saw above. It is better to think of the above argument leading to formula (8) as an intuitive justification of the appearance of the Bernoulli numbers. a]. since the sum on the right-hand side of (1) usually diverges when N → ∞. Thus Bm (x) = −Bm (1 − x) for odd m > 0. this argument works in case f is a polynomial. . such a proof is in some sense more satisfying than the one we gave for the Taylor formula in Section 4 using telescoping sums. . b]. a] (if b < a) and f (n+1) is continuous on [a. and then Bm (1) = Bm (1 − 1) = Bm (0) = 0 also holds. 84. a proof of the Taylor formula with integral remainder term can be obtained by repeated integration by parts. and we ignore the last integral on the right-hand side. b] then for every ǫ > 0 there is a polynomial P such that |f (x) − P (x)| < ǫ for all x ∈ [a. . 67 The Taylor formula with integral remainder term says can be written as f (b) = Z b (n+1) n X f (x) (b − x)n f k (a) + dx. If one wants to get a proof of the general form if the Euler-Maclaurin summation formula. Zeros of the Bernoulli polynomials.88 Introduction to Numerical Analysis which equivalent the Euler-Maclaurin summation formula if on the right-hand side of (1) we make N → ∞. ′ in the proof of the Euler-Maclaurin summation formula. this is because we saw above that Bm (0) = Bm = 0 for odd m. 1/2. ez/2 − e−z/2 It is easy to see that this is an odd function68 of z.66 But then one runs into technical difficulties as to how to obtain the integral on the right-hand side of (1). since in this case high-order derivatives of f are zero. Thus. f ′ . in the Euler-Maclaurin summation formula. Namely. Hence Bm (1/2) = 0 for odd m.

and we know that this is not the case. Z c+1/ Z c+1 P2M +1 (x) dx P2M +1 (x)f (2M +1) (x) dx = f (2M +1) (ηc ) c+1/2 c+1/2 = f (2M +1) (ηc ) B2M +2 (1/2) − B2M +2 B2M +2 (1) − B2M +2 (1/2) = −f (2M +1) (ηc ) . 1]. indeed. Hence we have R= b−1  B2M +2 (1/2) − B2M +2 X (2M +1) f (ξc ) − f (2M +1) (ηc ) . 1/2 and 1 (and the only zero of B1 (x) is x = 1/2). if m > 0 is even. 1]. since then mBm−1 (x) = Bm (x) would have at least two zeros in (0. c + 1] for integer c.24. and another zero in the interval (1/2. for m > 0 even. Mean-value form of the remainder term. Bm−2 (x) = B1 (x). we will show that   1 = (21−m − 1)Bm . 2M + 2 2M + 2 The first equality holds for some ηc ∈ (c + 1/2. 1]. we will reach the contradictory conclusion that B( x) has at least two zeros in (0. 1) by Rolle’s theorem. Hence. The second ′ (x) = mBm−1 (x) for all m > 0. 1] are x = 0. and Bm−1 (x) = (m − 1)Bm−2 (x) has at lest two zeros in (0. In the end. 1). 68 – take ξ = x in that formula): Z c+1/2 Z c+1/2 P2M +1 (x) dx P2M +1 (x)f (2M +1) (x) dx = f (2M +1) (ξc ) c c B2M +2 (1/2) − B2M +2 B2M +2 (1/2) − B2M +2 (0) = f (2M +1) (ξc ) . = f (2M +1) (ξc ) 2M + 2 2M + 2 The first equality holds for some ξc ∈ (c. by Rolle’s theorem. Bm−2 (x) has at least four zeros in the interval [0. It cannot have more zeros ′ in [0. one in (0. and the third equality holds since equality holds because Bm Bm = Bm (0) by the definition of the Bernoulli numbers. 1). since them. c + 1) by the Mean-Value Theorem quoted. it follows that P2M +1 (x) = B2M +1 (x − [x]) does not change sign on any of the intervals [c. Then Bm (x) = mBm−1 has at least three zeros in the interval (0. thus. 1 ′ Now. 1). 1) by Rolle’s ′ theorem. c + 1/2) by the Mean-Value Theorem quoted. If m − 2 > 1. 1]. and Bm (x) has at least for zeros (counted with multiplicity) ′ in the interval [0. can have at most one zero. c + 1/2) and ηc ∈ (c + 1/2. The remainder term (6) to formula (4) can be written as b−1 Z c+1/2 X 1 P2M +1 (x)f (2M +1) (x) dx R= (2M + 1)! c=a c b−1 Z c+1 X 1 + P2M +1 (x)f (2M +1) (x) dx (2M + 1)! c=a c+1/2 Above we saw that the only zeros of Bm (x) for odd m > 1 in the interval [0. c + 1). 1). c + 1/2] and [c + 1/2. 1/2) and one in (1/2. and 1. Below. assume that m > 1 is odd. (9) Bm 2 . then Bm−2 (x) also has zeros at x = 0 and x = 1. Hence the integrals in the above sums can be estimated by the Mean-Value Theorem discussed in formula (1) in Section 20 (p. the only zeros of Bm (x) for odd m > 1 in the interval [0. 1/2). being a linear polynomial. 1/2. The third equality holds since Bm (1) = Bm (0) = Bm for all m ≥ 0. (2M + 2)! c=a where ξc ∈ (c. That is. then Bm (x) = m+1 Bm+1 (x) must have at least one zero in the interval (0. therefore we can repeat the same argument. 1] are x = 0. Similarly. The Euler-Maclaurin summation formula 89 Indeed. Bm (x) has exactly two zeros in the interval [0. If m − 2 = 1 then this is a contradiction.

possibly there are infinitely many intervals Ik between In . in case M = 0. that In+1 need not be adjacent to In . Write ff  Z f = 6 0 S = sup g ′ (ξ) : In 69 That 70 Note and In+1 . c + 1] for c with a ≤ c < b) when deriving (10) with assuming only that f (2M + 1)(x) exists for x ∈ (a. β). choose A such that Z Such an A exists since writing Rβ α β α Z ˛t=β ˛ f ′ g = f (t)g(t)˛ −A t=α β f. B2M +2 (2M + 2)! c=a In order that the form (6) be a valid expression for the remainder term in (4). F (x) = Z x α Z ˛t=x ˛ f ′ g − f (t)g(t)˛ +A t=α x f. if one wants to derive (10) under this weaker assumption. and.69 one can write this as Z (11) β α Z ˛x=β ˛ f ′ g = f (x)g(x)˛ − g ′ (ξ) x=α β f α for some ξ ∈ (α. since f (x) = 0 for those values of x). either f (x) 6= 0 for all x ∈ (αn . It may be of some interest to point that that for the validity of (10) one needs to assume only that f (2M +1) (x) should exist for all x ∈ (a. β). As is clear from the proof of the Euler-Maclaurin summation formula. b). is. or f (x) ≤ 0 for all x ∈ (α. first assume that f (x) 6= 0 for x ∈ (α. β). That is. by Rolle’s theorem there is a ξ ∈ (α. Of course. we cannot use (6). Instead. showing that (11) holds. The formula for integration by parts says that Z β ˛x=β Z ˛ f ′ g = f (x)g(x)˛ − x=α α β f g′ . α we have F (α) = 0 and F (β) = 0 (the latter by the choice of A). and [c + 1/2. α f 6= 0. βn ] such that any two of these intervals have at most one endpoint in common (i. b). in deriving (6). βn ) (for those intervals In on which f identically vanishes. β). In fact. in which case the integration by parts formula is certainly not valid. Since the formula analogous to (11) has already been established for f and g on each In with some ξn ∈ (αn . β]. b) can be decomposed as a union of (finitely many or infinitely many) intervals In = [αn . β). the last integration by parts in the derivation of the Euler-Maclaurin summation formula needs to be replaced by a Mean-Value Theorem. the above expression for R can be simplified as (10) R=− b−1   X 2 − 2−1−2M f (2M +1) (ξc ) − f (2M +1) (ηc ) . The point is that this last equation is valid even without the assumption that the integration by parts formula above is valid. it may happen that f g ′ is not integrable on [α. b). in fact. β) such that F ′ (ξ) = 0.. and g ′ (x) exists for all x ∈ (α.e. β]. one uses the integration by parts occurring in (7)). Since F is differentiable on (α. βn ) or f (x) = 0 for all x ∈ [αn . β). This integration by parts needs to be replaced by (11) (on the intervals [c. β). for each n. βn ]. 0 = F ′ (ξ) = f ′ (ξ)g(ξ) − (f ′ (ξ)g(ξ) + f (ξ)g ′ (ξ)) + Af (ξ) = (A − g ′ (ξ))f (ξ). c + 1/2]. β). either f (x) ≥ 0 for all x ∈ (α. To establish (11) under these weaker assumptions. they have no internal points in common).90 Introduction to Numerical Analysis With the aid of this relation. (11) is trivially true). we needed to assume that f (2M +1) be continuous on (a. Then. given that f is either always positive or always negative on the interval (α. α In case f does not change sign on (α. That is.70 f (x) = 0 at the endpoints of each of these intervals unless these endpoints equal α or β (when f (x) may or may not be zero). β) by the same Mean-Value Theorem for integrals that we used above. one integrates (5) by parts (in case M > 0. β). This equation implies that A = g ′ (ξ). f ′ and g are continuous on [α. we have Z β α Z ˛x=β X ˛ g ′ (ξn ) f ′ g = f (x)g(x)˛ − x=α n f In (the terms f (x)g(x) for x = αn or x = βn with x 6= α and x 6= β on the right-hand side do not appear. (11) is valid if f does not change sign on (α. and yet (11) is still valid. In case f (x) = 0 for some x ∈ (α. the interval (a.

Then (11) is satisfied with this ξ. according to the Euler-Maclaurin summation formula (4) with remainder term (10) we have Z b b−1 M X X dx 1 1  1 B2m  1 1 + R. Similarly. we disregard the trivial case when f (x) = 0 for all x ∈ (α. In fact. Hence ∞ ∞ X X z Bm (1/2) + Bm m m 2Bm m z =2· z = 2 z . so that we do not assume that these sets are bounded. Solution. we allow s = −∞ or S = +∞. and the second equality can be obtained by adding the two equations above. this is because of cancelation with the factorial in f (2m−1) (x). . With t = 2z. v] and H is such that g ′ (u) ≤ H ≤ g ′ (v) or g ′ (u) ≥ H ≥ g ′ (v). Hence. if g is differentiable on [u. it is easy to see that s ≤ H ≤ S. In any case. also.23. Next we will prove formula (9).71 there is a ξ ∈ (α. ∞ ∞ X X z 2z t Bm m Bm m m z − = 2z = t = t = 2 z . the negative sign on the right-hand 71 That is.. ez − 1 ez + 1 e −1 e − 1 m=0 m! m! m=0 where the third equality holds according to (3). This is true even if g ′ is not continuous. formula (9) follows. z z 2z t e −1 e +1 e −1 e −1 m! m! m=0 m=0 where the third equality holds according to (2). we have s < H < S unless s = S. there are i and j such that g ′ (ξi ) ≤ H ≤ g ′ (ξj ). we have ∞ ∞ X X z z 2zez tet·(1/2) Bm (1/2) m Bm (1/2) m m + = = = t = 2 z . Since a derivative always has the intermediate-value property. b] such that g ′ (ξ) = H. Write H= P n g ′ (ξn ) Rβ α f R In f . Euler’s constant γ is defined as γ = lim n→∞ n X 1 i=1 i  − log n. Evaluation of Bm (1/2). + + − =− − 2n c=a+1 c 2b 2m b2m a2m n x m=2 Note the absence of the factorial in the denominator after the sum sign on the right-hand side. R P R as αβ f = n In f . Writing f (x) = 1/x. β).e. Problem 1. β) such that H = g ′ (ξ). then there is a ξ in [a. Evaluate γ with 10 decimal precision. we have f (m) (x) = (−1)m m!/xm+1 . m! e − 1 m=0 m! m=0 the first equality here holds again according to (3). Equating coefficients of z m on the two sides. but we assume that these sets are not empty – i. Adaptive integration with Simpson’s rule and s = inf  Z g ′ (ξ) : In 91 ff f 6= 0 .

it gives 0. 53. if we put ξc = c + 1/2 and ηc = c + 11/. 000. For a = 10 and M = 4 we have 2 − 2−9 5/66 · 10 ≈ 1. 901. the seventh decimal place deviates from the decimal expansion of γ. 45. a b Hence. 901. for n = 10. 000. 72 It is known that B2m for m > 0 is positive if m is odd. 2M + 2 where b−1  X 1 1  S= . 901.92 Introduction to Numerical Analysis side occurs because (−1)2m−1 = −1 occurs in the expression for f (2m−1) (x). c + 1/2) and ηc ∈ (c + 1/2. Formula (10) for the remainder term gives 2 − 2−1−2M R= B2M +2 · S. 902. 664. − ξc2M +2 ηc2M +2 c=a Again the absence of the factorial and of the minus sign (the one that occurs in (10) in the first of these formulas is explained by the form of f 2M +2 . . 664. it gives 0. 068. 0<R≤ 10 10 and this gives an acceptable result. 31. 220.577. 714.72 For a = 10 and M = 3 this gives 0>R≥ 2 − 2−7 −1/30 ≈ −6. we get a sum smaller than S. 900. Even in the last result. 000. If we put ξc = c and ηc = c + 1. for n = 1. 20. 893. 20.577. 215. 164. 000 gives 0. and it is negative if m is even. 000. 000. 80 / γ / 0. Using the latter values for a and M . Thus we obtain 1 1 (12) 0 < S < 2M +2 − 2M +2 .577. i 2a 2m a m=2 where. we obtain that 0. c + 1). 664.577. 215. while n X 1 i=1 i − log n for n = 10. These latter sums are easily evaluated since almost everything in them cancels. for n = 100. by making b → ∞ in (12) we have 0<R≤ 2 − 2−1−2M B2M +2 · 2M +2 2M + 2 a or 0>R≥ 2 − 2−1−2M B2M +2 · 2M +2 2M + 2 a depending on whether B2M +2 is positive or negative. Here ξc ∈ (c. · 10 1010 which is slightly worse that what is required for ten decimal precision. 664.577. it gives 0. 265. 53. 215.514 · 10−12 . 664.640 · 10−11 . note that for the actual value of γ up to 14 decimals we have γ = 0.577. 215. To compare this with what one can do without the Euler-Maclaurin summation formula. 216. for any integer a ≥ 1 we have Z n n X  dx  1 − log n = lim − n→∞ n→∞ i i x 1 i=1 i=1 Z Z n a−1 n a 1 X1 X dx dx  1 1 1 = + − + lim + + − n→∞ 2a i 2a x i 2n x 1 a i=1 i=a γ = lim = n X 1 a−1 X i=1 M X 1 1 B2m 1 + − log a + · 2m + R. we get a sum larger than S.577.

so there is no reason to do repeat function evaluations used in calculating According to (1). n) = n−1  X h f (xi ) + f (b) f (a) + 2 2 i=1 The Euler (or Euler-Maclaurin) summation formula. this is because the sum expressing Sk.l h2l k + O(hk l=j+1 (j ≤ k). Writing hk = (b − a)/2k for k ≥ 0. ROMBERG INTEGRATION Let f be a function on the interval [a.j to Rb f as follows.j + a N −1 X 2N −1 ) cj. b]: T (f.0 will be the zeroth Richardson extrapolation. (b − a)/2k ). f = Sk. where k and j are integers with 0 ≤ j ≤ k. and write T (f. n) + a N −1 X cl h2l + O(h2N −1 ).0 = + hk f (a + (2i − 1)hk ) 2 i=1 for k > 0. we assume here that k ≤ N − 2.0 reuses all the partition points occurring in the sum Sk−1. In the jth Richardson extrapolation Sk.0 2X Sk−1.0 = h0 (f (a) + f (b)) 2 and k−1 (2) Sk. One calculates the approximations Sk. We have Z a 73 The b f = Sk.l h2l k + O(hk l=j+1 Euler summation formula gives the values of these constants and that of the error term explicitly.25. generalizes the trapezoidal rule in that it gives a more precise error term. Romberg integration 93 25.j by eliminating the coefficient of h2j k . let h = (b − a)/n and xi = a + ki for 0 ≤ i ≤ n. According to this formula. one obtains Sk. For j > 0. for h close to 0 we have Z (1) b f = T (f. this approximation is the a trapezoidal sum T (f. and assume that f is at least 2N − 1 times continuously differentiable for some N > 0.0 + k + O(h a l=1 The quantity Sk.j h2j k + N −1 X 2N −1 ) cj−1.j the first j terms h2l k are eliminated: Z (3) b f = Sk. l=1 where the constants cl and the error term O(h2N −1 ) depend on f .j−1 + cj−1. b]. S0. we have Z b N −1 X 2N −1 cl h2l ). The organization of this calculation is as follows. For j = 0.0 .j−1 and Sk. we have Clearly. Let n > 0 be an integer.73 This equation allows one to do Richardson extrapolation on the trapezoidal sum for various values of h. . n) for the trapezoidal sum approximating the integral of f on [a.j from Sk−1. discussed in Section 24.

h> #define absval(x) ((x) >= 0. k + (4 + 2 k note that in the error term. This hope is probably justified if the function is smooth enough.j−1 + l=j+1 −1 j 2N −1 cj−1. Richardson extrapolation after a certain point may produce no improvement.74 Writing (4) Sk. and in the other formula it may be negative. In practice. the error in one of the formulas may be positive. double a. All one can hope is that the errors do not all add up.l h2l ).j−1 + cj−1. but it is somewhat questionable when the process is repeated so as to get Sn. (4j − 1) Z N −1 X b a f = 4j Sk.94 Introduction to Numerical Analysis and Z b a N −1 X f = Sk−1. . which agrees with (3).e.j−1 4j − 1 and cj.j−1 − Sk−1.j−1 − Sk−1. double romberg(double (*fnct)(double).j−1 .l . 74 That is. this means that if the function is not sufficiently smooth. I. even though the two formulas are subtracted.j−1 and Sk−1.j = 4j Sk.j h2j k−1 + 2N −1 cj−1. 75 This is justified for the single use of (4) in calculating S k.0 ? (x) : (-(x))) double funct(double).j from Sk. we will calculate the integral Z 0 1 dx . the coefficient of h2j will be eliminated.l (4j − 4l )h2l )O(h2N ).j−1 + cj−1.l 22j h2j k + N −1 X −1 2N −1 cj−1.l = we have Z a provided we can write b f = Sk.j + 4j − 4l cj−1.h> #include <math. the errors may add up.l h2k k−1 + O(hk−1 ) l=j+1 = Sk−1. the coefficient is 4j + 22N −1 rather than 4j − 22N −1 since in only a bound on the absolute value of the error is known. 1 + x2 The header file romberg. 4j − 1 N −1 X 2N −1 cj. k + O(hk l=j+1 4j + 22N −1 −1 O(h2N ) k 4j − 1 −1 75 ). 0) with 0 ≤ k ≤ n. Multiplying the first equation by 4j = 22j and subtracting the second equation. Hence.n from S(k.h is used in all the files implementing the program: 1 2 3 4 5 6 7 #include <stdio. as O(h2N k In the computer implementation of Romberg integration.l 22l h2l O(h2N ) k +2 k l=j+1 where we used the equation hk−1 = 2hk .

s[MAXROW].0/(1. return(value). */ news=olds/2.0+x*x). for (j=1. k. j++) { four_to_j *= 4. hcoeff=-1. state==2: last row reached */ for (state=0. sigma. There is no need to keep the old row around. if ( maxrow > MAXROW ) { printf("Too many rows"). news=(four_to_j * news-olds)/(four_to_j-1. p++) { hcoeff+=2. two_to_kmin1. *success=0. sigma+=F(a+hcoeff*h). for (p=1.0. state == 0. { double value.0. double tol. four_to_j=1.) { sigma=0. state. double a. return 0. olds=s[j]. hcoeff. This includes declarations form the functions funct and romberg defined soon. int maxrow.0+h*sigma. /* state==0: more splitting is needed. s[j]=news. news.h" double funct(x) double x. } Romberg integration is defined in the file romberg. Romberg integration 8 95 double b.h" #define MAXROW 30 #define F(x) ((*fnct)(x)) double romberg(double (*fnct)(double). four_to_j.0. double b. int *success) { int j. p <= two_to_kmin1.c: 1 2 3 4 5 6 7 8 9 #include "romberg. s[0]=news. } k=1. double tol. two_to_kmin1=1. state==1: result is within tolerance.0. /* A new row of the interpolation table is calculated. } /* The new diagonal element is calculated and compared . h. j<k.0. h=(b-a)/2.0. double olds. } olds=s[0]. news=h*(F(a)+F(b)).0). int *success). int maxrow. s[0]=news. value = 1.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #include "romberg.25. The integrand is defined in the file funct. p.

and the result is moved into s[j]. } else { /* The new diagonal element is entered in the table */ s[k++] = news. two_to_kmin1 *= 2. Lines 21–26 calculate ( k. printf ("The value of h is %. If on line 12 it turns out that maxrow is larger than MAXROW. In the implementation. return news.j is calculated in news. *success assigned 0. The other parameters are the lower limit a.j is calculated in lines 31–36 using formula (4).j .0). and it will stay 0 unless the test in line 55 shows that the calculation reached state==1. this is moved to olds. The function prototype romberg. an integer state describing the current state of the process. 0) using formula (2) (k is incremented on line 50). If they are close.0. the upper limit b. and a real four_to_j of type double to hold the current value 4j is declared in line 10. and the .e. Before the calculation. F(x) is defined to be (*fnct)(x). updating the array s[].or else the parameters need to be returned to the calling program. if ( absval(news-olds) <= tol ) { state = 1. and Sk. and the calling program can decide what to print. whether more interval splitting is needed (state==0). } else if ( k==maxrow ) { state = 2. and a pointer to an integer *success that will be true if the integral was successfully calculated.96 Introduction to Numerical Analysis 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 } to the old diagonal element. h /= 2. a yes answer will finish the calculation. Further. on line 42 it is tested if the new diagonal element is close enough to the old diagonal element. printf ("The number of division points is %d\n". and Sk. There is an integer two_to_kmin1 declared in line 8 to hold the current value of 2k−1 . has a pointer to the function (*fnct) to be integrated. The actual calculation starts on line 17 by calculating S0. In lines 50–52 the new diagonal element news is stored in s[k]. maxrow is not allowed to be larger than MAXROW. then k is incremented in the same statement.0. k). defined to be 30 in line 2 (because MAXROW number of locations have been reserved to the array s[] holding a column of the matrix of approximation sums Sk. or the result is within the tolerance defined by tol (state==1.0 . the result is accepted */ four_to_j *= 4. In line 11.12f\n". and the main part of the calculation is done in the loop between lines 21 and 54.. the array element s[j] contains Sk−1. */ printf ("The number of rows is %d\n". the permissible error tol. } } if ( state == 1 ) { *success = 1. As mentioned above. the calculation is terminated unsuccessfully. 2*two_to_kmin1). the maximum number of rows maxrow allowed. described in lines 5–6. } /* Some diagnostic information is printed out next. This part is best commented out in a production version -. or whether no more rows are allowed (state==2). this allows to simplify the writing when the function pointed to by fnct needs to be evaluated. In line 3.j ). news=(four_to_j * news-olds)/(four_to_j-1. h). i.

integr). 4.141592653590 tolerance used is 5e-13 This shows that all 12 decimal digits of the result (i. &success). and a maximum 26 rows is allowed. tol). tol=5e-13.0.12f\n".0*atan2(1. } } The tolerance tol is set to be 5 · 10−13 and in line the romberg integration routine is called with limits 0 and 1. 4. since they might be needed temporarily when the program is maintained or updated).12f\n".. printf("The calculated value of pi is %.007812500000 number of division points is 128 integral is 0. four times the integral) agree with the value of π. Problems 1.0.1.12g\n". printf("The actual value of pi is %. In line 55.e. the value of *success is set. and in the final version of the program they can be commented out (they should probably still be left there. As we mentioned.25. Romberg integration 97 values of h and two_to_kmin1 are updated.141592653590 actual value of pi is 3. The printout of the program is 1 2 3 4 5 6 7 The The The The The The The number of rows is 7 value of h is 0. 8 f (x) dx .785398163397 calculated value of pi is 3. The result of the calculation is compared to the value of π obtained as 4 arctan 1 calculated on line 11. 26. The calling program is contained in the file main.0*integr). printf("The tolerance used is %.h" main() { double integr.0. } else { printf("It is not possible to calculate the integral " "with the\n required precision\n"). Line 65 returns the value of the integral. int success.3 to evaluate the integral Z 0 using Romberg integration.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 #include "romberg. tol.12f\n". 1 + x2 0 The value of this integral is well known to be π/4. if ( success ) { printf("The integral is %. Write the expression for S3. the integral to be calculated is Z 1 dx . The printing statements on lines 61–63 have only diagnostic purpose. integr = romberg(&funct.1)).

750.0 − S2. Explain why adaptive integration would be a better choice. 3 3 1 16S2. 3.1 − S1.5 · 10−10 . Explain how one can use Taylor’s formula to approximate the missing part of the integral Z 10−5 p x3 + x dx 0 .0 2 S2.2 = 15 45 1 S3. 45 S3.0 1 (f (0) + 2f (1) + f (2) + 2f (3) + 2f (4) + 2f (4) + 2f (6) + 2f (7) + f (8)). Solution.0 = 1 64S3.1 = = (f (0) + 4f (2) + 2f (4) + 4f (6) + f (8) 3 3 4S3.1 = = (f (0) + 4f (1) + 2f (2) + 4f (3) + 2f (4) + 4f (5) + 2f (6) + 4f (7) + f (8)).2 = (14f (0) + 64f (1) + 24f (2) + 64f (3) + 28f (4) + 64f (5) + 24f (6) + 64f (7) + 14f (8).0 − S0.3 = 2. Near x = 0 one needs to divide the interval into much smaller parts than away from 0 because the error terms of both the trapezoidal rule and Simpson’s rule becomes infinite at 0. 3 3 = f (0) + 2f (2) + 2f (4) + 2f (6) + f (8).98 Introduction to Numerical Analysis Solution. We have h0 = 8.0 − S1.1 = S2. S3. Even adaptive integration will have trouble with the integral in the preceding problem on the interval [0. S2. 4 4S1.2 = (868f (0) + 4096f (1) + 1408f (2) + 4096f (3) 63 2835 + 1744f (4) + 4096f (5) + 1408f (6) + 4096f (7) + 868f (8)). 2 4S2. 1]. Even adaptive integration does not work unless one excludes a small neighborhood of x = 0. 122. but one can successfully use the adaptive trapezoidal rule to calculate Z 1 p x3 + x dx = 1. The integrand behaves badly near x = 0 in that all its derivatives become infinite at x = 0.0 = h0 (f (0) + f (8)) = 4(f (0) + f (8)). So no integration method is likely to work that divides the interval into equal parts.0 = (f (0) + 4f (4) + f (8)).0 1 S3. and S0.2 − S2. S1.1 = (28f (0) + 128f (2) + 48f (4) + 128f (6) + 28f (8)).0 = 2(f (0) + 2f (4) + f (8)). 2 Further. S1.847. 11 10−5 with an error less than 2. We would like to determine the integral Z 1p x3 + x dx 0 Explain why Romberg integration would not work.

for example.e. 52 = 1.26. when would you prefer Romberg integration over adaptive integration. or some other way of estimating the error.5 · 10. Briefly explain in what situation would you prefer adaptive integration over Romberg integration. 10−5 ]. we can ignore x2 /2 in the Taylor expansion. i. We have   √ p √ x2 4 2 + O(x ) . 122. x3/2 0 The trouble is that the usual numerical integration methods do not work well in this situation.152 · 10−8 3 x3 + x dx ≈ 1. occurring in the error estimate for Simpson’s rule. such as Z π/2 sin x dx I= . 000. 750. 26.77 76 For the sake of simplicity.63. the fourth derivative. 153. +x= x 1+x = x 1+ 2 √ where we took the Taylor expansion of 1 + x2 at x = 0. one needs to evaluate integrals with singularities. tends to infinity as fast as x−9/2 at x = 0. An exact proof showing that the contribution of the remaining terms is too small to matter would need to use the formula for the remainder term. 4. . Integrals with singularities 99 with the same error 2. Solution. and conversely. The situation is even worse for the derivatives of the integral. it is fairly easy to see that f ′′ (ξ) is very close to 1 when ξ is in the interval [0.847. since its maximum of the corresponding term √ x · x2 /2 = x5/2 /2 in the integrand on this interval is 10−25/2 < 1012 . 77 This is because the integrand tends to infinity at x = 0 as fast as x−1/2 . The remainder of the Taylor formula would be f ′′ (ξ) 2 t 2 √ √ to calculate the Taylor series of 1 + t and at t = 0 and then with f (t) = 1 + t and t = x2 (it is technically simpler √ 2 take t = x than to calculate the Taylor series of 1 + x2 .76 Thus p Z 0 Hence Z 0 1 p 10−5 x3 p x3 +x≈ Z 0 10−5 √ 2 x dx = 10−15/2 = 3. On occasion. since the integrand tends to infinity at x = 0. and the fourth derivative of this is constant times x−9/2 .000. and when we multiply this by the length of the interval of integration. INTEGRALS WITH SINGULARITIES Improving singularities by integration by parts. 031. 10−5 . the result is much smaller than the permissible error is < ·10−17 .847. the interval in which t = x2 belongs when x belongs to the interval [0. 10−10 ]. When integrating this on the interval [0. we used O(x4 ) instead of the remainder term of the Taylor formula. 750. 11 + 0. but the result will of course be the same).. 10−5 . and use this result to find Z 1 0 p x3 + x dx with an error less than 5 · 10−10 .

0 The integral on the right is just as bad as the as the integral on the left.100 Introduction to Numerical Analysis One can improve the situation by integration by parts. integration by parts gives78 I= Z π/2 0  π/2 Z x−3/2 sin x dx = lim −2x−1/2 sin x + ǫց0 x=ǫ π/2 2x−1/2 cos x dx 0 π/2 Z p = − 8/π + 2x−1/2 cos x dx. but another integration by parts will indeed improve the situation: . Indeed.

π/2 Z p .

1/2 + I = − 8/π + 4x cos x.

.

the argument also works for functions of form f (x) = xα g(xβ ) where α and β are reals. by a finite number of terms of the Taylor series) near 0 – the Taylor series need not even be convergent anywhere. since this time the contribution of the integrated-out part is zero. since the integrand on the right-hand tends to zero as fast as x3/2 when x tends to zero. lim x→0 x we have lim ǫ−1/2 sin ǫ = 0. − + 6 120 6 120 0 Here. Finally. and four more integration by parts are needed to make Simpson’s rule applicable. However. or whether one uses a less precise error term involving the third derivative for Simpson’s rule. 80 One needs to be careful here: just because a function behaves as O(x2 ) near x = 0.81 78 In order to deal with the singularity at x = 0. β > 0. The argument here works only because we are considering functions of the form f (x) = xα g(x). More generally. instead of x = 0 at the lower limit we take ǫ tending to 0 from the right.80 The second integral can be calculated directly. so Simpson’s rule is still not applicable. so the fourth derivative tends to zero when x tends to zero. ǫց0 79 This latter statement is true whether one uses the error term involving the fourth derivative that we obtained for Simpson’s rule. and the factor sin x will be differentiated. we have Z 0 π/2 −3/2 x sin x dx = Z 0 π/2 −3/2 x    Z π/2  x3/2 x3 x5 x7/2 −1/2 x − sin x − x + dx + dx. where O(x7 ) is used to indicate the error when x → 0. Consider for example the function f (x) = x2 sin x1 when x 6= 0 and f (0) = 0. Two more integration by parts are needed to make the trapezoidal rule applicable. it does not follow that its derivative behaves as O(x). the fourth derivative at x = 0 still tends to infinity as fast as x−5/2 . since the second derivative tends to infinity at x = 0. (2n + 1)! 6 120 convergent on the whole real line. So. and it is simpler to use another method. but its derivative does not tend to 0 when x → 0. called the subtraction of singularity. the integrand behaves as x−3/2 · O(x7 ) = O(x11/2 ). Using the Taylor expansion sin x = ∞ X n=0 (−1)n x3 x5 x2n+1 =x− + + O(x7 ). The situation is now better. R 81 Note that the integral π/2 x−1/2 dx is a convergent improper integral. x=0 π/2 1/2 4x 0 Z p sin x dx = − 8/π + π/2 4x1/2 sin x dx. Performing repeated integrations by parts is laborious. the first integral can be calculated by Simpson’s rule. without the need of numerical integration. so its fourth derivative behaves as O(x11/2−4 ) = O(x3/2 ). In the integration part. where α is a real number. and the function g(x) has a convergent Taylor series near 0. 0 the last equation holds. This function is differentiable everywhere. and g(x) can be approximated by Taylor polynomial (i.e. at x = 0. one can say that the integral no longer has a singularity.. 0 . since we have sin x = 1. even the trapezoidal rule is not applicable.79 Subtraction of singularity. the factor x−3/2 will be integrated. indeed.

and consider functions whose Taylor series never converges. + 3 5 the first integral can be evaluated by Simpson’s rule.83 There are also numerical methods that can deal with singularities. For example. one can even weaken this condition.. the integral I= Z ∞ 3/2 e−x dx 0 can be handled by first splitting the interval of integration as I = I 1 + I2 = Z 1 3/2 e−x 0 dx + Z ∞ 3/2 e−x dx 1 While first integral is continuous at x = 0. Infinite intervals of integration. without the use of numerical methods. This can be done by the subtraction of singularity. Integrals with singularities 101 When one wants to evaluate the integral Z I= 2 0 arctan x dx . Using the Maclaurin expansion82 arctan x = ∞ X (−1)n n=0 x2n+1 x x5 =x− + + O(x7 ). but it would be much more laborious than even in the previous example. 83 The . the only important point is that the Maclaurin converges near x = 0. We have the Maclaurin series ex = ∞ X xn x2 =1+x+ + O(x3 ). so we need to prepare this integral for numerical integration. its derivatives (beginning with the second derivative) tend to infinity when x → 0.2] of integration has no importance here. but in this case it is still important that the first few terms of the Taylor series approximate the function well near 0.26. but it is usually more accurate to transform the integral to a finite interval of integration. fact that the Maclaurin series of arctan x does not converge on the whole interval [0. we have Z 2 −3/2 x 0 arctan x dx = Z 0 2 −3/2 x  x3 x5 arctan x − x + − 3 5  dx + Z 2 0   x3/2 x7/2 −1/2 x − dx. On the other hand. 1] (only conditionally at x = 1). since the repeated derivatives of arctan x look much more complicated than the repeated derivatives of sin x. In fact.e. but removing the singularities by one of the methods outlined above is usually more accurate. and the second one directly. 2n + 1 3 5 convergent on the interval (−1. subtraction of singularity works fairly simply. and the first few derivatives of the function can similarly be approximated by the appropriate derivatives of the Taylor series. x3/2 repeated integration by parts would still work to prepare the integral for numerical evaluation. n! 2 n=0 and so 3/2 e−x 82 i. 2 the Taylor expansion at x = 0. There are numerical methods that work for integrals on infinite intervals. = 1 − x3/2 + x3 + O(x9/2 ).

We can differentiate the right-hand side again to obtain 1 1 y y ′′′ = − (1 + 2yy ′ )(x + y 2 )−3/2 + y ′ = − (x + y 2 )−3/2 − (x + y 2 )−1 + (x + y 2 )1/2 . and the second integral can be evaluated directly. in fact. 2 2 2 x+y 2 x+y 2 x + y2 ′′ where the second equality was obtained by substituting y ′ from the original differential equation.102 Introduction to Numerical Analysis Hence we can write Z Z 1 3/2 e−x dx = I1 = 1 0 0  3/2 e−x − 1 + x3/2 − x3 2  dx + Z 0 1  1 − x3/2 + x3 2  dx. one can differentiate to obtain p 1 + 2yy ′ 1 + 2y x + y 2 1 p y = p = = p + y. so dx = −t−2 dx. . the integrand needs to be taken to be 0. To simplify the notation. One of the simplest method of approximating the solution is Taylor’s method. The integral on the right-hand side behaves well. The coefficient’s here involve higher derivatives of y at x0 . 4 4 2 84 For t = 0. we can use the substitution t = 1/x to obtain I2 = Z ∞ −x3/2 e 1 dx = − Z 0 t −2 −t−3/2 e dt = Z 1 −3/2 t−2 e−t dt . NUMERICAL SOLUTION OF DIFFERENTIAL EQUATIONS The initial value problem for the differential equation y ′ = f (x. which consists in approximating the solution y = y(x) with its Taylor series at x0 : y(x) ≈ n X y (k) (x0 ) k=0 k! (x − x0 )k . if one wants to solve the differential equation p y(3) = 1. φ(x)) satisfying a given initial condition φ(x0 ) = y0 for some given numbers x0 and y0 . one often write y(x) instead of φ(x). y′ = x + y2 . y) is to find a function φ such that φ′ (x) = f (x. we have x = t−1 . and the upper limit ∞ moves to 0. the lower limit 1 stays the same. For example. The first integral on the right-hand side can be handled by Simpson’s rule. 0 1 indeed. As for the second integral above. the integrand and all its derivatives tend to 0 when t ց 0. One usually looks for the solution in a neighborhood of the point x0 .84 27. Hence this integral can be handled even by Romberg integration (which is more efficient than Simpson’s rule if the integrand behaves nicely). when substituting. but this is easy to calculate using implicit differentiation (provided that f is differentiable sufficiently many times).

4) into parts. a sufficient condition for the differentiability of g a (x0 . y). etc. y0 ) and B = gy (x0 . arguments are suppressed at some places. 2 6 where y1 = y(x1 ).). we must have A = gx (x0 . y0 ). one might divide the interval (3. where fx and fy denote the partial derivatives of f . |h| + |k| For the first of these equations to hold. y) + f (x. y)f (x. y) g(x.1. y(x) ≈ y(3) + y ′ (3)(x − 3) + 2 6 8 192 One can use this equation to determine y(x) at a point close to x = 3. y). Namely. y)y ′ = fx (x. for better readability. y) = +f dx ∂x dx ∂y ∂x ∂y ∂x ∂y where. where for the third equality we used the equation y ′ = f .27. One can obtain a general expression for y (n) in terms of differential operators. y0 ) is that partial derivatives gx and gy be continuous at (x0 . y ) is valid if. y ) = f (x . and then determine y(3. Differentiating once more. this means d ∂ ∂ = +f . y) = fx + fy f. k) is a function such that lim h→0 k→0 E(h.86 Since g was an arbitrary function. k) where E(h. This formula can be continued by calculating the fourth and higher derivatives of y. one can determine y(3. and for the fourth equality we used the fact that the order the mixed derivatives are taken is irrelevant. 86 Certain conditions are needed in order that the the chain rule be valid.1. 4 32 Hence y ′′′ (3) 5 59 y ′′ (3) (x − 3)2 + (x − 3)3 = 1 + 2(x − 3) + (x − 3)2 + (x − 3)3 . if one writes f instead of f (x. for example. . fx instead of fx (x. one can obtain an approximation to y(4). Numerical solution of differential equations 103 Substituting x = 2. y0 ) on the right-hand side. assuming that the function y = y(x) satisfies the above differential equation. y). Furthermore. y) + fy (x. y ′′ (3) = . y0 ). Then writing a similar equation at x = 3. f xy 0 0 xy 0 0 xy and fxy exist at (x0 . Repeating this step ten times. This latter condition means that there are numbers A and B g(x0 + h. say at x = 3. that is y ′ = f (x.2). then for the derivative of an arbitrary function g(x. y). and f and its derivatives are evaluated at (x0 . y ′′′ (3) = .. y) + fy (x. k) = 0. We will include a short discussion of Taylor’s method in general. in terms of differential operators. for h close to 0 we have we have h3 h2 (1) y1 = y0 + hf + (fx + fy f ) + (fxx + 2fxy f + fx fy + fyy f 2 + fy2 f ) + O(h4 ). y(x)) = g(x. and using the (two-variable) chain-rule y ′′ = fx (x.1) first. y) needs to be differentiable at (x0 . for the above formula to be valid for x = x0 and y0 = y(x0 ) the function y(x) needs to be differentiable at x0 and g(x. Namely. y0 ) and the first derivatives fx and fy are continuous at (x0 . If one wants to determine y(4). y) = g(x.e. y0 + h) = Ah + Bk + E(h. The second equality uses the differential equation y ′ = f (x. We have y ′ = f (x. say into ten equal parts. y). The calculations are easier to follow if one suppresses the arguments of f (i. dx ∂x ∂y 85 The equation f (x . One can even estimate the error of this approximation using the remainder term of Taylor’s formula. we obtain 5 59 y ′ (3) = 2. etc. writing h = x1 − x0 . we have y ′′′ = (fx + fy f )x + (fx + fy f )y y ′ = (fxx + fyx f + fy fx ) + (fxy + fyy f + fy fy )y ′ = (fxx + fyx f + fy fx ) + (fxy + fyy f + fy fy )f = fxx + 2fxy f + fx fy + fyy f 2 + fy2 f.85 Thus. in which case y = 3 according to the initial condition. y) + g(x. y0 ). y0 ). y) we can write   ∂ dy(x) ∂ ∂ ∂ ∂ d ∂ g g(x.

y) =  ∂ ∂ +f ∂x ∂y n−1 f While this formula can speed up the evaluation of higher-order derivatives of y. Write a third order Taylor approximation at x = 0 for the solution of the differential equation y ′ = x + y with initial condition y(0) = 2. Namely. Differentiating again. Differentiating and substituting x = 0 and y = 2. y ′ (0) = x + y = 2. it is is not as useful as one might think at first time.102. Solution. one cannot use the Binomial Theorem to evaluate the power of the differential operator on the right-hand side.540. and these differential operators do not commute:     ∂ ∂ ∂ ∂ f g = fx gy + f gyx = fx gy + f gxy .540. ∂x ∂y ∂y ∂x Problems 1. Thus. Solution.545. 626. 302x + 0. 351 for x near 0.104 Introduction to Numerical Analysis Therefore.616.272. We have y(0) = 2. for n ≥ 1 we have dn y = dx  d dx (n−1) f (x. x3 x2 − 0. Differentiating and again substituting x = 0 and y = 2. 771 + O(x4 ) 2 6 y(x) = 1 + 0. Write a third order Taylor approximation at x = 0 for the solution of the differential equation y ′ = sin x + cos y with initial condition y(0) = 1. y(x) = y(0) + y ′ (0)x + y ′′ (0) x2 x3 x2 x3 + y ′′′ (0) + O(x4 ) = 2 + 2x + 3 + 3 + O(x4 ) 2 3 2 6 1 3 = 2 + 2x + x2 + x3 + +O(x4 ) 2 2 for x near 0. the right-hand side was obtained by substituting x = 0 and y = 2. while f g = f gxy . y ′ (0) = sin 0 + cos 1 ≈ 0. . 351. We have y(0) = 1. we again obtain y ′′′ (x) = 1 + y ′ = 1 + x + y = 3. 626 + O(x4 ) 2 6 x2 x3 = 1 + 0. 676 − 0. The proof of the Binomial Theorem expressing (A+B)n depends on the commutativity AB = BA. Substituting x = 0 and y = 1 we obtain y ′′ (0) ≈ 0.540. 302 y ′′ (x) = cos x − y ′ sin y = cos x − (sin x + cos y) sin y.616. we obtain y ′′ (x) = 1 + y ′ = 1 + x + y = 3.545. we obtain y ′′′ (x) = − sin x − (cos x − y ′ sin y) sin y − (sin x + cos y)y ′ cos y = − sin x − (cos x − (sin x + cos y) sin y) sin y − (sin x + cos y)(sin x + cos y) cos y − sin x − cos x sin y + sin x sin2 y + cos y sin2 y − sin2 x cos y − 2 sin x cos2 y − cos3 y Substituting x = 0 and y = 1 we obtain y ′′′ (x) ≈ −0. 302x + 0. 2.

that is y(x1 ) − y1 is called the local truncation error. y). Runge-Kutta methods 105 28. αi . i=1 where (3) K1 = f (x0 . we have α1 = 0. so as to be consistent with equation (3). However. 2 The method described above (with appropriately chosen coefficients) is called an r-stage RungeKutta method. earlier errors may cause excessive global errors even though the local errors (truncation errors and round-off errors) may be small. and βij in such a way as to achieve a high degree of agreement with Taylor’s formula h2 y1 = y(0) + hy ′ (0) + y ′′ (0) + . The error committed in n steps is about n times the local truncation error. one seeks to determine the solution y1 = y(x1 ) at a nearby point x1 = x0 + h in the form (1) y1 = y0 + hK.87 We will discuss the equations that the requirement of an agreement with the Taylor series method imposes on the above coefficients. and (4) i−1   X βij Kj . then global truncation error is nO(hm+1 ) = O(hm ). . then m is called the order of the method. For example. The global truncation error is the error committed after a number of steps. So the local truncation error of an m order method is O(hm+1 ). if one wants to obtain a Runge-Kutta method along 87 This is not strictly true. and taking n steps to calculate the value at b from the known value at a. if the local truncation error is O(hm+1 ). One seeks to determine the coefficients γi . b). one might divide this interval into n equal parts and taking h = (b − a)/n. appropriate coefficients cannot always be found. y0 ).28. if one wants to calculate the solution on an interval (a. RUNGE-KUTTA METHODS The difficulty with Taylor’s method is that it involves a substantial amount of symbolic calculation. a task usually more complicated for computers than direct numerical calculation. Runge-Kutta methods seek to remedy this weakness of Taylor’s method in that they achieve the same precision without using any symbolic calculations. for a stable method this is not supposed to happen. . . A necessary (but not sufficient) condition that appropriate coefficients can be found is m ≤ r (we will not prove this). . where K is sought in the form (2) K= r X γi Ki . The Runge-Kutta methods are quite stable. Given the differential equation y = f ′ (x. Thus. y(x0 ) = y0 . y0 + h j=1 In the last equation. If an agreement with the Taylor formula up to and including the term hm is achieved. For a given m and r. since the errors do not add up: earlier errors may be magnified in subsequent calculations. Ki = f x0 + hαi . In an unstable method. The error committed in calculating the true value of y(x1 ).

In order to describe the equations for the coefficients involved. we need to express the quantities Ki in terms of the two dimensional Taylor series. hence we are only going to present the basic principles involved. According to this . the equations that one needs to solve are quite complicated.106 Introduction to Numerical Analysis these lines that are useful in practice.

.

 i  n+1 n X .

.

1 1 ∂ ∂ ∂ ∂ F (x0 +h. y0 +k) = h h F (x. y).

.

y).x=x + F (x.

.

and θ in the multi-variable Taylor formula. The right-hand side here can be further expanded by considering the Taylor expansion of the Kj on the right-hand side. we obtain !  j−1 i−1    X X βjl f h h Ki = f + fx αi + fy βij f + fx αj + fy j=1 + 1 2 fxx αi2 + 2fxy αi l=1 i−1 X βij f + fyy = f + fx αi + f fy i−1 X j=1 βij f j=1 j=1  X i−1   βij h + fx fy i−1 X j=1 2  h2 + O(h3 ) αj βij + f fy2 j−1 i−1 X X βij βjl j=1 l=1  i−1 i−1 X 1 1 2 X 2 2 2 h + O(h3 ). y0 + h j=1 j=1 + 1 2 fxx αi2 + 2fxy αi i−1 X βij Kj + fyy j=1 i−1 X βij Kj j=1 2  h2 + O(h3 ).  3 3 2  2 3    ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ h k F = h F +3 h k F + 3h F+ h F +k ∂x ∂y ∂x ∂x ∂y ∂x ∂y ∂y ∂3F ∂3F ∂3F ∂3F = h3 3 + 3h2 k 2 + 3hk 2 + k 3 3 = h3 Fxxx + 3h2 kFxxy + 3hk 2 Fxyy + k 3 Fyyy . it relies on the fact ∂ ∂ ∂ ∂ = ∂y . This is valid if the function being differentiated is sufficiently nice. . Usually. the expansion Kl = f + O(h) was taken to obtain the right-hand side. in the second member of these equalities. 89 The proof of the Binomial Theorem relies on the commutativity of multiplication. 1). Other single-variable remainder terms can be used to obtain different multi-variable remainder terms. y0 ). and the symbol O(h3 ) is considered in a neighborhood of h = 0. we have i−1 i−1     X X βij Kj h βij Kj = f + fx αi + fy Ki = f x0 + hαi . Substituting this expansion into the expansion of Ki (in the terms multiplied by h2 .88 The derivatives after the sum sign can be evaluated by using the Binomial Theorem. where the derivatives are taken at the point (x0 . Kj = f + fx αj + fy βjl Kl h + O(h2 ) = f + fx αj + fy l=1 l=1 where. + fxx αi + f fxy αi βij βij + f fyy 2 2 j=1 j=1 88 It is customary to use θ for an unknown number between (0. y0 + tk). +k +k 0 i! ∂x ∂y (n + 1)! ∂x ∂y y=y0 i=0 y=y0 +θk This formula can easily be derived by writing G(t) = F (x0 + th. 2 ∂x ∂x ∂y ∂x∂y ∂y Using this formula with n = 2 for Ki above. using the same formula that we just gave for Ki : j−1 j−1     X X βjl f h + O(h2 ). we only need to take Kj = f + O(h)).x=x0 +θh . one used ξ in the remainder term of the single-variable Taylor formula. That is. so that the mixed derivatives that ∂x ∂y ∂x do not depend on the order of differentiation. The remainder term involves the value of G(θ) for some θ in the interval (0.89 For example. 1). and expressing G(1) in terms of the Taylor formula for G at t = 0. We used the Lagrange remainder term of the single-variable Taylor formula to obtain the multi-variable remainder term.

Since the present equation must hold for every (small) h and for every f . However. h y1 = y0 + (K1 + 2K2 + 2K3 + K4 ). If the error is too large. 2 γi αi2 = r X i=1 1 . K3 = f (x0 + h/2. in particular. but this involves too many function evaluations. We obtain r X i=1 γi = 1. 6 This is fourth order method. and we would need agreement with error of O(h5 ) to derive a fourth-order method. In the Runge-Kutta-Fehlberg method. y0 ). 2 i−1 X j=1 r X γi i=1 βij = 1 . the coefficients of f h. In the classical Runge-Kutta method one has K1 = f (x0 . r X γi αi = i=1 r X i=1 1 . fxx h3 . fx fy h3 . f fxy h3 . y0 + hK1 /2). one should use a smaller step size. A better way is to use two Runge-Kutta approximations with the same step size. one has a fourth-order method combined with a fifth 90 This means. Using two different methods normally still involves too much calculation. . The higher order method then calculates the approximation with much greater precision. K2 = f (x0 + h/2. there are pairs of Runge-Kutta methods where most of the calculations used in making the lower order step can be re-used form making the higher order step. 3 i−1 X αj βij = j=1 r X i=1 γi i−1 X j=1 1 . One way of estimating the error of a Runge-Kutta approximation is to use the same method with step size h and step size h/2 and compare the results. fx h2 . and f 2 fyy h3 must agree. y0 + hK3 ). y0 + hK2 /2). we obtain y1 = y0 + f h r X i=1  fx fy r X i=1 1 + f 2 fyy 2 γi r r i−1  X  X X γi + fx γi αi + f fy γi βij h2 + i=1 i−1 X i=1 αj βij + f fy2 j=1 r X γi i=1 r X i=1 i−1 X βij j=1 2  γi j=1 j−1 i−1 X X j=1 l=1 r r i−1 X X X 1 γi αi2 + f fxy γi αi βij βij βjl + fxx 2 i=1 i=1 j=1 h3 + O(h4 ). f fy h2 . one may compare this equation with equation (1) in Section 27 concerning Taylor’s method on the Numerical solution of Differential Equations. f fy2 h3 . so the difference of the two approximations can be taken as the error. 3 More equations can be obtained if more terms of the to Taylor expansions are calculated and compared. that the equations for the coefficients we listed above are not sufficient to derive this method. K4 = f (x0 + h. 6 βij r X i=1 2 = γi j−1 i−1 X X j=1 l=1 βij βjl = 1 . one method of order m and another of order m + 1.28. 3 γi i−1 X βij = j=1 r X i=1 γi αi 1 . To obtain equations for the coefficients. The above equations reflect only agreement with the Taylor expansion of y(x) with error O(h4 ).90 The disadvantage of this method is that it is difficult to estimate the error of the approximation of y1 to the true solution of the differential equation. 6 1 . Runge-Kutta methods 107 Substituting this into (1) and (2).

int *countfail. one puts K1 = f (x0 . one halves the step size. one uses the fourth order method with this estimate of the error. The declarations and the included resources are contained in the file rk45.. 282 in [AH]. then one doubles the step size. where x1 = x0 + h. y0 + h(−8K1 /27 + 2K2 − 3544k3 /2565 + 1859K4 /4104 − 11K5 /40)). y0 ).e. y1 = y0 + h(25K1 /216 + 1408K3 /2565 + 2197K4 /4104 − K5 /5). where rk45 refers to Runge-Kutta 4th and 5th order: 1 2 3 4 5 6 7 8 9 10 #include <stdio. double epsilon.h.108 Introduction to Numerical Analysis order method.. y0 + h(439K1 /216 − 8K2 + 3680K3 /513 − 845K4 /4104)). They erroneously give this term as 28561K4 /56437. 92 One might think that it is better to use the fifth-order method. This gives the following estimate for the error E = h(K1 /360 − 128K3 /4275 − 2197K4 /75240 + k5/50 + 2K6 /55).52) on p. no error estimate is available for the fifth-order method. K3 = f (x0 + 3h/8. int *countsucc. double rk45(double (*fnct)(double. In this file. double *calling_h. y¯1 = y0 + h(16K1 /135 + 6656K3 /12825 + 28561K4 /56430 − 9K5 /50 + 2K6 /55). K2 = f (x0 + h/4.92 When looking for the solution of a differential equation on an interval (a. For the fifth-order method one needs to do only one more function evaluation91 K6 = f (x0 + h/2. but combining the fifth order method with the error estimate for the fourth order method would lead to numerical instabilities.h> #include <math. In a practice. double x1. given an initial condition y(a) = c. which stands for the function f in the differential 91 In formula (8.h> #define absval(x) ((x) >= 0. takes h/2 as the new value of h (perhaps one needs to do this a number of times). y0 + h(1932K1 − 7200K2 + 7296K3 )/2197). b) (i. int *success). y0 + hK1 /4). For the local truncation error of the former method we have y(x1 ) − y1 = O(h5 ). double y). since it is more accurate than the fourth-order method. Next we discuss a computer implementation of the Runge-Kutta-Fehlberg method with step size control. If at one point one finds that the local truncation error is too small. one starts out with an initial step size h. Here y1 is the approximation given by the method to the correct value of y(x1 ). the definition of the abbreviation absval(x) to calculate the absolute value of x is given on line 4.0 ? (x) : (-(x))) double funct(double x. Since the latter is much smaller than the former. while for that of the latter method we have y(x1 ) − y¯1 = O(h6 ). However. Line 6 declares the function funct. . the denominator in the term involving K on the right-hand side of the 4 equation for y1 is in error. In the fourth order method. one can take y¯1 − y1 as a good estimate for the y(x1 ) − y1 . and if the local truncation error is too large. While it is true that the error of the fifth could be estimated by the above formula for E. where we wrote y¯1 for the approximation given by the method to the correct value of y(x1 ). double y0.double). i. K5 = f (x0 + h. double min_h.e. y0 + h(3K1 + 9K2 )/32). double x0. K4 = f (x0 + 12h/13. one is looking for y(b)).

int *countsucc. +1859.*k5/40. err1=h*(k1/360.*k3/2565.28. y).*k4/4104.)).. epsilonover32=epsilon/32.k5.*k2+7296. x += h. h=x1-x.) { k1=F(x. +k5/50.k2.. double y0. } else { (*countsucc)++. k5=F(x+h.))..*k4/4104.+1408. y +=h*(25.k3.+2.*k1+9. The function itself is described in the file rk45.-k5/5.y+h*(1932..y+h*(3.*k3/513.).h=*calling_h.*k1-7200. double k1..).y+h*(-8. else { if ( err <= h*epsilonover32 ) h *= 2.*k1/216.k6.*k2)/32.*k3/2565.*k2-3544. /* state==0: stepping state==1: h is too small state==2: x1 is reached */ for ( state=0.).*k1/216. if ( absval(x1-x) <= assumedzero ) state=2.-8..y+h*k1/4. Runge-Kutta methods 109 equation y ′ = f (x.). if ( shortrange ) { *calling_h=h. err=absval(err1).-128.*k1/27. k4=F(x+12.y).*k3/4275. On line 7.*k2+3680. double *calling_h. double x1. shortrange=0.-11. the function rk45 is declared to perform the Runge-Kutta-Fehlberg method with step size control. k6=F(x+h/2.double).h" #define F(x. k3=F(x+3. } (*countfail)++. if ( err>h*epsilon ) { h /= 2.*k3)/2197. int state.+2*k6/55.y) ((*fnct)(x. int *countfail. double epsilon.y+h*(439.*h/8.err.*k4/4104. k2=F(x+h/4.+2197.*k4/75240. if ( h > x1-x ) shortrange=1. } } } } .). if ( h<min_h ) { state=1. int *success) { /* This program implements the Runge-Kutta_Fehlberg method to solve an ordinary differential equation */ const double assumedzero=1e-20.err1.y=y0.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 #include "rk45. !state .k4.y)) double rk45(double (*fnct)(double..-2197.x. -845. double x0.*h/13. double min_h.

epsilon.. and the calculation is finished (on line 51. the quantities k1 through k6 (corresponding to K1 through K6 ) are evaluated. the point where the solution is sought (the interval (x0. line 13. Its first parameter is (*fnct). and in fact these fractions would be evaluated only once. then the step size is halved. the starting point. corresponding to ǫ/32 (cf. } else *success=0. this location will contain the value of the step size that was used last). larger that h*epsilon. pointer to the function f in the differential equation y ′ = f (x. based on the value of state). and in line 29 its absolute value is taken.110 50 51 52 53 54 55 56 57 } Introduction to Numerical Analysis if ( !shortrange ) *calling_h=h.y) according to the definition on line 2. any optimizing compiler would take care of this. a pointer calling_h to the step size the function uses (the starting value *calling_h of the location pointed to the by calling_h is the initial step size specified by the calling program.. if ( state==2 ) { *success=1. that of x is incremented by h on the next line.e. Therefore it might be reasonable to give some additional leeway.y) stands for (*fnct)(x. In lines 15–17 the comments describe the meaning of the variable state: it will be 0 if there are more steps to be calculated. when successive steps of the method are applied. However. return 0. in lines 19–25. to simplify the notation.e. x1. and 0 (false) if the calculation was not successful. The definition of the function rk45 starts on line 4. indicating that x1 has been reached. y0.e.0.. corresponding to hǫ). where the step needs to be repeated with a smaller step size. The reason for this would be that these fractions should be calculated only once. the location pointed to by the pointer countsucc counts the number of successful steps (i. that is. x0. The criterion used for considering the error too small is that it is smaller than epsilonover32. error step size is doubled. the global error allowed at x1. doubling the step size will make the error only slightly smaller than ǫ. where epsilonover32 is initialized). Otherwise the count of successes is incremented on line 36. Lines 18–48 contain a look that is performed if state is 1 (i. The location pointed to by success will contain 1 (true) if the value of the solution at x1 was successfully calculated. frequent halving and doubling of the step size. y) to be solved. while countfail points to the location containing the number of failed steps (i. and put these floating point numbers in the program. the error of the fourth-oder method is evaluated. return y. stepping). the value of y(x) at the next step. the initial value of the solution. when the result of the calculated for the given step is accepted). F(x.. Inside this loop. The reason is that a fourth order method is used.93 in these calculations. and the step size would then be halved. If on line 30 it is found that the local error is too large (i. If the value of h thereby becomes less than min_h. the process is abandoned. is calculated on line 38.. on line 32 the integer state is changed to 1. state is changed to 2. We will consider the effect of this on an example. and only consider the error too small if it is less than ǫ/64. and not every time the loop is executed. and the count of failures is incremented on line 33. The parameter min_h contains the smallest permissible step size (if the step size needed to achieve the required accuracy needs to be smaller. i. so the local truncation error is proportional to h5 . it will be 1 is the step size needed to achieve the required accuracy would be too small). If the new value of x is too close to x1 (the endpoint of the interval on which the solution is calculated).x1) will be divided into smaller parts). In line 27.e.e. If on line 41 it is found that the error is too small. This might result in thrashing. so doubling the step size will result in a 32-fold increase of the error. at compile time. . The function rk45 returns the calculated value of the solution at x1. when the process finished. and y. in a few steps the error might become larger than ǫ.94 93 One might think that instead of giving the coefficients as common fractions in these calculations. 94 This maybe trying to control the size of the error too tightly: if the error is only slightly smaller than ǫ/32. one should calculate these fractions as floating point numbers.

28. Runge-Kutta methods

111

If x is close to x1 than the current step size h, the integer shortrange is changed to 1 on line 42.
The next and last step to be taken will be x1-x instead of the current step size h, but the current
value of h is preserved at the location calling_h on line 44, so that the calling program can take
this value from this location. If the function rk45 is called again to continue the calculation, this
value of h can be used for further calculations as a starting value.
The loop ends on line 48, and in line 48, if the last step taken was not too short (i.e., shortrange
is 0; shortrange was initialized to be 0 on line 14, but this value may have been changed on line 43),
the last used value of h is loaded into the location calling_h (if the last step taken was too short,
this location will contain the value of h before the short step was taken; this value was assigned on
line 44).
If the loop ended with state equaling 2, assigned on line 39 (rather than 1, assigned on line
32), the variable *success is assigned 1 (true) on line 52, and the value y of the solution at x1 is
returned. Otherwise, *success is assigned 0, and the value 0 is returned on line 56.
The file funct.c contains the definition of the function f (x, y).
1
2
3
4
5
6
7
8
9
10
11
12
13
14

#include "rk45.h"
double funct(double x,double y)
{
const assumedzero=1e-20;
double value;
if ( absval(x+1.0)<=assumedzero ) {
value=0.0;
}
else {
value=-(x+1.0)*sin(x)+y/(x+1.0);
}
return(value);
}

This file defines

f (x, y) = −(x + 1) sin x +

y
.
x+1

The calling program is contained in the file main.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

#include "rk45.h"
double sol(double x);
main()
{
double min_h=1e-5, epsilon=5e-10,
x0=0.0,x1,xn=10.0,y0=1.0,y1,h,h0,yx;
int i, n=10, success, countsucc=0,countfail=0, status;
/* status==0: failed;
status==1: integrating;
status==2: reached the end of the interval; */
status=1;
printf("Solving the differential equation "
"y’=-(x+1)sin x+y/(x+1):\n");
printf("
x
y
"
"
exact solution
error\n\n");
yx=sol(x0);
printf("%6.3f %20.16f %20.16f %20.16f\n",

112

Introduction to Numerical Analysis

19
x0, y0, yx, y0-yx);
20
h0=(xn-x0)/((double) n); h=h0;
21
for (i=1; status==1; i++) {
22
x1=x0+h0;
23
y1=rk45(&funct, x0, y0, x1, epsilon, &h,
24
min_h, &countsucc, &countfail, &success);
25
if ( success ) {
26
x0=x1; y0=y1;
27
yx=sol(x0);
28
printf("%6.3f %20.16f %20.16f %20.16f\n",
29
x0, y0, yx, yx-y0);
30
if ( i>=n ) status=2;
31
}
32
else {
33
status=0;
34
printf("No more values could be calculated.\n");
35
}
36
}
37
printf("\n" "Parameters\n");
38
printf("epsilon:
"
39
"
"
40
"%.16f\n", epsilon);
41
printf("smallest step size allowed:"
42
"
"
43
"%.16f\n", min_h);
44
printf("number of successful steps: %8d\n", countsucc);
45
printf("number of failed steps:
%8d\n", countfail);
46 }
47
48 double sol(double x) {
49
/* This function is the exact solution of the
50
differential equation being solved, to compare
51
with the numerical solution */
52
return (x+1.0)*cos(x);
53 }

This program will solve the differential equation y ′ = f (x, y) with the above choice of f (x, y) with
the initial condition y(0) = 1. The exact solution of this equation with the given initial condition is
(5)

y(x) = (x + 1) cos x.

Indeed, differentiating this, we obtain
(6)

y ′ (x) = cos x − (x + 1) sin x.

According to the above equation, we have
cos x =

y(x)
.
x+1

Replacing cos x in (6) with the right-hand side, we obtain the differential equation we are solving.
The exact solution using equation (5) is calculated in lines 48–53 as sol(x), so that the calculated
solution can be compared with the exact solution. The calculated solution and the exact solution

28. Runge-Kutta methods

113

is printed out at the points x = 1, 2, 3, . . . , 10. This is accomplished by setting up a loop in lines
21–35. Inside the loop, the function rk45, discussed above, is called with initial values x0 and ending
values x1 = x + 1 with x0 = 0, 1, . . . , 10. On line 38, the value of x, the calculated value y¯(x) of
y(x), the exact value of y(x), and the error y(x) − y¯(x) is printed out. Finally, in lines 37–46, the
value of ǫ, h, the smallest step size allowed, the number of successful steps, and the number of failed
steps is allowed. On line 6, the smallest step size is specified as 10−10 and ǫ is given as 10−5 (the
way the program was written, ǫ means the error allowed on an interval of length 1). The printout
of the program is
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Solving the differential equation y’=-(x+1)sin x+y/(x+1):
x
y
exact solution
error
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000

1.0000000000000000
1.0806046116088004
-1.2484405098784015
-3.9599699865316182
-3.2682181044399328
1.7019731125069812
6.7211920060799493
6.0312180342683925
-1.3095003046372715
-9.1113026190614956
-9.2297868201511246

Parameters
epsilon:
smallest step size allowed:
number of successful steps:
number of failed steps:

1.0000000000000000
1.0806046117362795
-1.2484405096414271
-3.9599699864017817
-3.2682181043180596
1.7019731127793576
6.7211920065525623
6.0312180347464368
-1.3095003042775217
-9.1113026188467696
-9.2297868198409763

0.0000000000000000
0.0000000001274790
0.0000000002369743
0.0000000001298364
0.0000000001218732
0.0000000002723763
0.0000000004726129
0.0000000004780446
0.0000000003597498
0.0000000002147257
0.0000000003101476

0.0000000005000000
0.0000100000000000
353
10

If one makes a slight change in the file rk45.c by replacing the definition of the variable
epsilonover32=epsilon/32. given in line 13 with epsilonover32=epsilon/64., the printout
changes only slightly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Solving the differential equation y’=-(x+1)sin x+y/(x+1):
x
y
exact solution
error
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
9.000
10.000

1.0000000000000000
1.0806046116088004
-1.2484405098375091
-3.9599699864770947
-3.2682181043739900
1.7019731125861131
6.7211920061722683
6.0312180343918813
-1.3095003044983455
-9.1113026189071356
-9.2297868199575586

Parameters
epsilon:
smallest step size allowed:
number of successful steps:

1.0000000000000000
1.0806046117362795
-1.2484405096414271
-3.9599699864017817
-3.2682181043180596
1.7019731127793576
6.7211920065525623
6.0312180347464368
-1.3095003042775217
-9.1113026188467696
-9.2297868198409763

0.0000000000000000
0.0000000001274790
0.0000000001960819
0.0000000000753129
0.0000000000559304
0.0000000001932445
0.0000000003802938
0.0000000003545558
0.0000000002208238
0.0000000000603657
0.0000000001165816

0.0000000005000000
0.0000100000000000
361

114

Introduction to Numerical Analysis

20 number of failed steps:

10
Problem

1. Consider the differential equation y = f (x, y) with initial condition y(x0 ) = y0 . Show that,
with x1 = x0 + h, the solution at x1 can be obtained with an error O(h3 ) by the formula  

h
h
y1 = y0 + hf x0 + , y0 + f (x0 , y0 ) .
2
2
In other words, this formula describes a Runge-Kutta method of order 2.95
Solution. Writing f , fx , fy for f and its derivatives at (x0 , y0 ), we have We have
f 

x0 + 

h
h
h
h
, y0 + f (x0 , y0 ) = f + fx + f · fy + O(h2 ).
2
2
2
2

according to Taylor’s formula in two variables. Substituting this into the above formula for y1 , we
obtain
h2
y1 = y0 + hf + (fx + f fy ) + O(h3 ).
2
This agrees with the Taylor expansion of y1 (given in the preceding program) with error O(h3 ),
showing that this is indeed a correct Runge-Kutta method of order 2.

29. PREDICTOR-CORRECTOR METHODS
Consider the differential equation y ′ = f (x, y), and assume that its solution y = y(x) is known at
the points xi = x0 + ih for h = 0, 1, 2, 3. Write yi = y(xi ).96 We would like to calculate y4 . Clearly,
Z x4
f (x, y(x)) dx.
(1)
y4 = y3 +
x3

We intend to evaluate the integral on the right-hand side by approximating f (x, y(x)) with an
interpolation polynomial. We will consider two different interpolation polynomial for this: first, we
will use the polynomial P (x) interpolating at the points x0 , x1 , x2 , x3 , and, second we will use
the polynomial Q(x) interpolating at the points x1 , x2 , x3 , x4 . The interpolation by Q(x) will give
a better approximation, because interpolation inside an interval determined by the nodes is more
accurate than interpolation outside this interval.97 On the other hand, interpolation by Q(x) will
involve the unknown value y4 . This will not be a big problem, since y4 will occur on both sides of
the equation, and it can be determined by solving an equation.
Write f¯(x) = f (x, y(x)), and write fi = f (xi , yi ). To simplify the calculation assume that x2 = 0;
in this case x0 = −2h, x1 = −h, x3 = h, x4 = 2h. It will help us to make the calculations that
follow more transparent to also use the notation f¯i = f¯(ih). Note, that, this will mean that
(2)
95 This

fi = f¯i−2

method is called the modified Euler method. The Euler method simply takes y1 = y0 + hf (x0 , y0 ).
or fewer points could be considered, but we want to confine our attention to a case where we can derive
formulas of considerable importance in practice.
97 In fact, for P (x), the interval determined by the nodes is (x , x ), and so x in the interval (x , x ) of integration
0
1
3
4
is outside the interval of interpolation, while, for Q(x), the interval determined by the nodes is (x1 , x4 ), and so x is
in the interval (x3 , x4 ) of integration is inside the interval of interpolation.
96 More

29. Predictor-Corrector methods

115

Writing EP for the error of the Newton interpolation polynomial P interpolating f¯ at x0 , x1 , x2 ,
x3 we have
f¯(x) = P (x) + EP (x) = f¯[0] + f¯[0, −h]x + f¯[0, −h, h]x(x + h) + f¯[0, −h, h, −2h]x(x + h)(x − h)
f¯0 − f¯−1
+ f¯[0, −h, h, −2h, x]x(x + h)(x − h)(x + 2h) = f¯0 +
x
h
f¯1 − 3f¯0 + 3f¯−1 − f¯−2 3
f¯1 − 2f¯0 + f¯−1 2
(x
+
xh)
+
(x − h2 x)
+
2h2
6h3
f¯(4) (ξx ) 4
+
(x + 2hx3 − h2 x2 − 2h3 x)
4!
for some ξx ∈ (x0 , x4 ) = (−2h, 2h) (assuming that x ∈ (x3 , x4 )).98 Integrating this, we obtain
Z

2h

h

f¯0 − f¯−1 3h2
f¯1 − 2f¯0 + f¯−1 23h3
f¯1 − 3f¯0 + 3f¯−1 − f¯−2 3h4
f¯ = f¯0 h +
·
·
·
+
+
h
2
2h2
12
6h3
8
(4)
h
251 5
f¯ (ξ) 251 5
·
h =
(55f¯1 − 59f¯0 + 37f¯−1 − 9f¯−2 ) + f¯(4) (ξ)
h
+
4!
30
24
720
251 5
h
(55f3 − 59f2 + 37f1 − 9f0 ) + f¯(4) (ξ)
h
=
24
720

for some ξ ∈ (x0 , x) ⊂ (x0 , x4 ); for the last equality we used (2). In calculating the remainder term,
note that the polynomial
x4 + 2hx3 − h2 x2 − 2h3 x = x(x + h)(x − h)(x + 2h)
is positive on the interval (h, 2h), and so we can use the Mean-Value Theorem for Integrals given
in formula (1) in Section 20 on Simple Numerical Integration Formulas.99 Recalling that f¯(x) =
f (x, y(x)) and substituting this into (1), we obtain
(3)

y4 = y3 +

h
251 5
(55f3 − 59f2 + 37f1 − 9f0 ) + f¯(4) (ξ)
h
24
720

for some ξ ∈ (x3 , x4 ). This is the Adams-Bashforth predictor formula.
98 To

ease the calculation of differences, one may note that
f [xi , xi+1 , . . . , xi+k ] =

1
∆k f i .
k!hk

according to (1) in Section 10 on Newton Interpolation With Equidistant Points, and
∆n f (x) =

n “ ”
X
n
(−1)n−i f (x + ih).
i
i=0

according to a formula in Section 7 on Finite Differences.
99 What we need here is a slight extension of that result. In discussing that formula, we assumed that ξ belongs
x
to the interval of integration, which is not the case at present. The same argument can be extended to the situation
when ξx belongs to some other, arbitrary interval, since the only thing that was used in the argument was that the
derivative of a function satisfies the Intermediate-Value Theorem. Thus, the Mean-Value Theorem in question can be
extended to cover the present situation.

and ycorr of the value calculated for y4 by the corrector formula (using ypred to calculated f4 . One solves this equation by fixed-point iteration. x2 . Recalling that f¯(x) = f (x.e. h = y3 + (9fpred + 19f3 − 5f2 + 9f1 ). h]x(x + h) + f¯[0. 2h). −h]x + f¯[0. ypred equals the right-hand side of (3) except for the error term). we obtain Z 2h h f¯1 − 2f¯0 + f¯−1 23h3 f¯2 − 3f¯1 + 3f¯0 − f¯−1 3h4 f¯0 − f¯−1 3h2 · · · + + f¯ = f¯0 h + 2 h 2 2h 12 6h3 8 (4) ¯ f (η) 19 5 h ¯ 19 5 + · h = (9f2 + 19f¯1 − 5f¯0 + 9f¯−1 ) − f¯(4) (η) h 4! 30 24 720 19 5 h (9f4 + 19f3 − 5f2 + 9f1 ) − f¯(4) (η) h = 24 720 for some η ∈ (x1 . note that the polynomial x4 − 2hx3 − h2 x2 + 2h3 x = x(x + h)(x − h)(x − 2h) is negative on the interval (h. 2h) (assuming that x ∈ (x3 . h. but one only does one step of fixed point iteration. ypred ). −h. x4 )). x4 . where (2) was used to obtain the last equality. one uses the corrector formula (4) as an equation for y4 (since y4 occurs on both sides). We will explain below how exactly this is done. In calculating the remainder term. x4 ). 2h]x(x + h)(x − h) f¯0 − f¯−1 + f¯[0. Write ypred for the value calculated for y4 by the predictor formula (i. write h (55f3 − 59f2 + 37f1 − 9f0 ). we obtain (4) y4 = y3 + h 19 5 (9f4 + 19f3 − 5f2 + 9f1 ) − f¯(4) (η) h 24 720 for some η ∈ (x3 . and so we can again use the Mean-Value Theorem for Integrals given in formula (1) in the Section 20 on Simple Numerical Integration Formulas. y(x)) and substituting this into (1). but first we need to discuss the error estimates for these formulas. That is. This is the Adams-Moulton corrector formula. x3 . −h. 2h. 24 ypred = y3 + (5) fpred ycorr In order to estimate the error. x4 ). we make the assumption that (6) f¯(4) (η) ≈ f¯(4) (ξ) and 19 5 h y4 ≈ ycorr − f¯(4) (η) 720 .. x4 ) = (−2h.116 Introduction to Numerical Analysis Similarly as above writing EQ for the error of the Newton interpolation polynomial Q interpolating f¯ at x1 . with a starting value obtained by the predictor formula (3). −h. h. 24 = f (x4 . x]x(x + h)(x − h)(x − 2h) = f¯0 + x h f¯1 − 2f¯0 + f¯−1 2 f¯2 − 3f¯1 + 3f¯0 − f¯−1 3 + (x + xh) + (x − h2 x) 2 2h 6h3 f¯(4) (ηx ) 4 + (x − 2hx3 − h2 x2 + 2h3 x) 4! for some ηx ∈ (x1 . In a practical method for solving differential equations. Integrating this. we have f¯(x) = Q(x) + EQ (x) = f¯[0] + f¯[0.

Using these estimates. The header file abm. and Ecorr is called the Adams-Bashforth-Moulton predictor-corrector method.29. if the error is not small enough after one iteration. λ)(y4 − ypred ) − f¯(4) (η) h 24 720 24 720 251 19 19 251 6 9h fy (x4 .070. 720 where (3) and (5) were used to express y4 − ypred . for computers. that can supply these values. and the third equation uses the error term given in (3). the second term on the right-hand side is much smaller than the first term. . as on the right-hand side of (4). so we obtain the second formula in (6). 428.100 The method combining the above formulas (5). we will use the latter value in the computer program below. Since the formulas rely on four previous values values for yi .071. rather than f4 . Hence 19 5 19 ¯(4) 270 5 19 y4 − ycorr ≈ −f¯(4) (η) h =− · f (η) h ≈− (ycorr − ypred ) 720 270 720 270 The left-hand side is the error Ecorr of the corrector formula. We will use the method solve the same differential equation f (x. For small h. After the Runge-KuttaFehlberg method supplies the initial values. 370. the Adams-Bashforth-Moulton method can work on its own until the step size needs to be changed (decreased because the error is too large or increased because the error is too small – to speed up the calculation). ycorr . The corrector formula (using fixed point iteration to solve the equation for y4 ) is used only once. The reason the latter equation is not exact is that we used fpred to calculate ycorr . When the step size is changed. Next we will discuss a computer implementation of this method. y) considered as a function of the single variable y only. Nevertheless. That is. such as the RungeKutta-Fehlberg method. λ)f¯(4) (ξ) h = 24 720 720 720 1920 y4 − ycorr = where the second equation follows with λ between y4 and ypred from the Mean-Value Theorem for the function f (x4 . 270 14 For the last approximate equation. the method is used in combination with another method.h contains the header file for a number of files used in the implementation of this method: 100 There seems to be hardly any practical difference whether one uses the value 19/270 or 1/14. Predictor-Corrector methods 117 holds. (7) Ecorr ≈ − 1 19 (ycorr − ypred ) ≈ − (ycorr − ypred ). and (7) for ypred . y) = −(x + 1) sin x + y . we obtain ycorr − ypred = (y4 − ypred ) − (y4 − ycorr ) ≈ f¯(4) (η)  251 19 + 720 720  270 5 h5 = f¯(4) (η) h . the Runge-Kutta-Fehlberg method can again be invoked to supply new starting-valued with the changed step size. λ)f¯(4) (ξ) h5 − f¯(4) (η) h5 = −f¯(4) (η) h5 + fy (x4 . 6. this is not an issue. (5). at compile time. and the second formula in (6) was used to estimate y4 − ycorr . For this reason. x+1 the file funct. The computer evaluates these fractions only once. it is better to decrease the step size than to do more iteration.c defining this function is identical to the file with the same name considered for the Runge-Kutta-Fehlberg method. The latter value is easier for humans to remember. note that 19/270 ≈ 0. 4 and 1/14 ≈ 0. 9h 19 5 9h 19 5 (f4 − fpred ) − f¯(4) (η) h = fy (x4 .

double x0. double *error).*k3/4275.-k5/5.*K1/216.double).). double epsilon. double f3.double). double f2. x0 for the starting point.+2*k6/55. int *countrkfail.*K1-7200. y1. *x1=x0+h. -845. and the step size control will be performed in the calling program. y0 for the initial value of the solution at that point. *error=absval(h*(K1/360.. k1addr . k4=F(x0+12.*K1+9. the step size h (so *x1=x0+h).*h/13. double *k1addr. double f1.y0+h*(439. double *x1.+2. *x1 is end point of the step where the solution is calculated (this value is calculated in here. double *h. double f0.+2197.y0+h*(3.y0+h*K1/4.*k2)/32. double h. double *x1.).y0+h*(-8.*k4/4104. k6=F(x0+h/2. double rkfstep(double (*fnct)(double.+1408.y0). double y0. This file contains the declarations of the functions needed for the program. double h.*k3/2565.*K1/216.*h/8.*k3)/2197. double x0. return y1.0 ? (x) : (-(x))) double funct(double x. double *error).-11. k4. K1=F(x0.double). k5.).*K1/27.*k2-3544. int *countabfail. k6. k2=F(x0+h/4. double x4.*k3/513. k3=F(x0+3. double x1. This file only implements a single step of the method. double x0.*k4/4104. y1=y0+h*(25. double h.h> #include <math. The parameters of this function are the pointer funct describing the function f (x.*k2+3680. double abmstep(double (*fnct)(double.*k4/75240.). k3. One step of the Runge-Kutta-Fehlberg method is implemented in the file 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #include "abm. y) in the differential equation to be solved. double y0.)).y0+h*(1932.*k4/4104. double y).-2197. int *countabsucc.*k3/2565.h> #define absval(x) ((x) >= 0. double abmrkf(double (*fnct)(double. int *success).-128. k5=F(x0+h. double *error) { /* This program implements the Runge-Kutta_Fehlberg method to solve an ordinary differential equation */ double k2.y)) #define K1 (*k1addr) double rkfstep(double (*fnct)(double.118 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Introduction to Numerical Analysis #include <stdio. double y3.y) ((*fnct)(x. double *k1addr. these declarations will be discussed in detail below.)).)). double y0.double).*k5/40. int *countrksucc.. +k5/50. and the calling program can read it from the location x1.*k2+7296.h" #define F(x. } The implementation of the Runge-Kutta-Fehlberg method has been discussed earlier.-8... double min_h. +1859.

y4. the main steps to ensure this were: 1) the Runge-Kutta method is used only to decrease the step size. double f0. int *countabfail.predy4).*f2+37. h.y) and K1 are defined.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include "abm. double f2. and the calling program can get it from this location. the value y1 of the solution is returned to the calling program. return y4. 2) the predictor corrector method is allowed to increase the step size only if it was found on 8 consecutive occasions that the step size is too small. One step of the Adam-Bashforth-Moulton method is implemented in the file abmstep.c implements step size control for the method: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #include "abm. and h. Predictor-Corrector methods 119 for the address of the quantity K1 (x0 . double h.*predf4+19. f2 . double epsilon.y) ((*fnct)(x. */ . double y3. This is a difficult problem.*f3-5. int *countrkfail. double *error) { double predy4. Automatic step size control is incorporated. f1. } The function abmstep is quite simple.. predf4=F(x4.. double x0. *error=absval(y4-predy4)*1.y) ((*fnct)(x. double f1. respectively. and never to increase it. f3 . y).*f1-9. int *countabsucc. int *countrksucc. y0 ) (this is calculated here.y)) #define H (*h) double abmrkf(double (*fnct)(double./14. predy4=y3+h*(55. The main problem in coupling these two methods is to ensure that most of the steps are performed by the predictor corrector method and not by the Runge-Kutta method.29.y)) double abmstep(double (*fnct)(double. double y0. double f3. On line 24. y4=y3+h*(9. double min_h. The starting values or the initial values when the step size is changed are supplied by the Runge-Kutta-Fehlberg method. predf4. f1 . corresponding to x4 . The file abm. Its parameters are a pointer fnct representing the function f (x. double x4. f2. in lines 2 and 3 the symbols F(x.h" #define F(x. f3. Lines 11–23 perform the calculation the same way as was already discussed above on account of the Runge-Kutta-Fehlberg method. int *success) { /* This solves an ordinary differential equation using the Adams-Bashforth-Moulton predictor corrector method.h" #define F(x.*f3-59..*f2+f1)/24. x4. K0 can be reused even if the step size is changed). double *h.double). and the pointer error to the location containing the error as determined by formula (7) on line 12. To simplify the notation. and the *error of the step (readable by the calling program). double x1.double).*f0)/24.

. if ( H < min_h ) state=1. (*countrksucc)++. } else { iserror_ok=1. y[j]=y[j+1].) { if ( i==3 ) { f[3]=F(x[3]. if ( iserror_ok < 2 ) { (*countabsucc)++. } } else (*countabfail)++. y[0]=y0. y[0]=y[i]. (*countrkfail)++.y[3]). x[4]=x[3]+H. shortrange=0. i++.. !state . state. i=0. j++) { x[j]=x[j+1]. f[4].f[1].&error).y[i]. upscalecount=0. working_h. } } if ( iserror_ok==2 ) { H /= 2..&f[i]. if ( error > H*epsilon ) { iserror_ok=2. y[5]. y[4]=abmstep(fnct.x[4]. epsilonover64=epsilon/64. } for (j=0.120 Introduction to Numerical Analysis 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 const double assumedzero=1e-20. int i=0. iserror_ok=1. /* state==0: stepping state==1: h is too small state==2: x1 is reached */ /* iserror_ok==0: error is too small iserror_ok==1: error is OK iserror_ok==2: error is too large */ /* using Runge-Kutta-Fehlberg to generate starting values: */ x[0]=x0. if ( error > H*epsilon ) iserror_ok=2. x[5]. for (j=0. iserror_ok.&error). j.f[0].f[3]. j<3.f[2]. x[0]=x[i]. } else { y[i+1]=rkfstep(fnct.x[i].&x[i+1].y[3]. double error. for ( state=0. H. H. if ( error <= H*epsilonover64 ) iserror_ok=0. j<4. j++) { f[j]=f[j+1].

The meaning of the integer iserror_ok is explained in the comments in lines 34–36 (0 if the error is too small. } } else upscalecount=0. until the calculation is finished (the point x1 is reached) or when the calculation cannot be continued (since h would have to be smaller than allowed). the smallest step size min_h allowed. the global error epsilon allowed on an interval of length 1. Meaning of the integer state is explained in the comments in lines 31–33 (0 if the method is stepping. which is 1 is the determination of the value of the solution at x1 is successful. Predictor-Corrector methods 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 } 121 } else { if ( absval(x1-x[i]) <= assumedzero ) state=2. 1 is the value of h is smaller than the allowed step size. a pointer countrksucc to the number of successful RungeKutta-Fehlberg steps. y). if ( H > x1-x[i] ) { shortrange=1. } else { *success=0. reached in a single step. The loop in lines 39–97 is performed as long as useful calculations can be done. the point x1 where the solution is desired (this point is reached as a result of many steps. x[0]=x[i]. a pointer countrkfail to the number of failed Runge-Kutta-Fehlberg steps. i=0. y[0]=y[i]. H=x1-x[i]. if ( upscalecount==8 ) { H *= 2. a pointer h to the value of h.. } The function abmrkf has its parameters arranged in a similar way to the parameters of the function rk45 in the file rk45. return 0. and 0 is it is unsuccessful.0. x[0]=x[i].0 otherwise). 1 is the error is OK. y[0]=y[i].c discussed above. the function abmrkf here and the earlier function rk45 has many features in common.e. in the functions rkfstep and abmstep). i=0. so its meaning is very different from the meaning of x1. a pointer countabsucc to the number of successful Adams-Bashforth-Moulton steps. and a pointer to an integer *success. In fact. and 2 if the point x1 is reached. i. a pointer countabfail to the number of failed Adams-Bashforth-Moulton steps. return y[i]. if ( state==2 ) { *success=1. else { if ( iserror_ok==0 ) { upscalecount++. } } } if ( shortrange ) H=working_h. the starting point x0.29. the initial value y0 at this point. working_h=H. The function returns the value of the solution at y1 if successful (returns 0. The . } else shortrange=0. The parameters of abmrkf are a pointer fnct to the function representing f (x. and 2 if the error is too large. in the section on Runge-Kutta methods.

h. *countabfail is implemented because the error was found too large (through the failure of the test on line 48). 101 With more expensive book keeping. i. showed that there are not enough values available to take an Adams-Bashforth-Moulton step). and f[] are updated in lines 50–56. the variable the variable *countabsucc incremented on line 49. In line 98. i is set to 0 to indicate that only 1 (i. it is tested if the current step size to larger than the step needs to be taken to reach x1 . n=10.e. the calculated value is returned to the calling program on line 101. the step size is doubled on line 84.e. .101 ) If the step size does not need to be changed in lines 73–77. and the arrays x[]. and in lines 64–66 the integer iserror_ok is set to 2. and i is set to 0 to indicate that there is only one (i. As will appear from the printout of the present program. The step size will be increased only if it is found on eight occasions in a row that the error is too small. but to keep track of them is more expensive also in this case. In line 99 the variable state is tested to see whether the determination of y1 was successful. and *countrksucc is incremented in lines 68–70. success. if the error is not too large.h0. Only an Adams-Bashforth-Moulton step can increase the step size (in lines 64 and 68 we saw that the Runge-Kutta-Fehlberg method cannot set iserror_ok to 0). and in line 40 the program doing this is called. some of the old starting values could still be used even if the step size is halved. y[]. To read these lines. int i. the variable shortrange is set to 1. If the step size is doubled..y) and H.0.xn=10.h" double sol(double x). on line 81 it is tested if the error is too small.e. a Runge-Kutta-Felberg step is taken (because the test on line 40 failed. The step size must be decreased if this is the case. In line 76. this may have happened either in a Runge-Kutta-Fehlberg step or an Adams-Bashforth-Moulton step. and *countrkfail is incremented in case the error is found to be too large on line 63. the number of successive occurrences of a too small error is counted by the variable upscalecount. otherwise iserror_ok is set to 1.122 Introduction to Numerical Analysis test on line 40 i==0 is evaluated true when enough starting values have been obtained to do an Adams-Bashforth-Moulton step. so that the calling program can read this value in the location h (recall that H stands for *h. if so. On the eighth occurrence. Such a more expensive book keeping hardly seems justified unless a function evaluation is enormously costly. note the definitions in lines 2–3 defining F(x. main() { double min_h=1e-5. x0=0. In line 73 it is tested if the error is too large. On line 58. the error is too large (iserror_ok is set to 2). If so. i is set to be zero to ensure that the next (and last) step is a Runge-Kutta-Fehlberg step. On line 92. i + 1) starting value is available (if the step size changes. If the error is this step is smaller than or equal to hǫ/64 (the parameter epsilonover64 has value ǫ/64) the error is too small (iserror_ok is set to 0). on line 79 it is tested if we are close enough to target point x1 (we should of course not test exact equality with floating point numbers). epsilon=5e-10.c is very similar to the file by the same name discussed on account of the Runge-Kutta-Fehlberg method in the preceding section: 1 2 3 4 5 6 7 8 #include "abm. it seems that the method is tuned in a way that a change in step size is needed only rarely. then it is possible that there are still four starting values available among the already calculated values. On line 61. If the latter is the case. if so.0. The variable i is incremented on line 69 to indicate that the successful Runge-Kutta-Fehlberg step made an additional starting value available.y1... the working value of h is restored in case the last step was a short step. The main program contained in the file main. as indicated on line 3). countabsucc=0. and this is done in line 74.x1. earlier starting values become useless. and if the error is larger than hǫ. The current value of h is preserved in the variable working_h so that the program can communicate this value of h to the calling program. In line 89. the step size may need to be increased.y0=1.countabfail=0.0. and the step size is changed on line 91 to the size of the actual step that needs to be taken. or whether more steps need to be taken. i + 1) available starting value.yx.

y0-yx). 14 printf("Solving the differential equation " 15 "y’=-(x+1)sin x+y/(x+1):\n"). 47 printf("number of RK successful steps: %8d\n". epsilon). 16 printf(" x y " 17 " exact solution error\n\n"). min_h). &countabfail.16f %20.16f\n". epsilon. y0. to compare 54 with the numerical solution */ 55 return (x+1. 21 h0=(xn-x0)/((double) n). 27 if ( success ) { 28 x0=x1.16f %20.3f %20.\n").29.countrkfail=0. 46 printf("number of ABM failed steps: %8d\n". 10 /* status==0: failed.16f\n". 35 break. 56 } 123 The only difference between this file and the file main. 31 x0. 30 printf("%6. &success). 26 &countrksucc. 12 status==2: reached the end of the interval. y0=y1. 20 x0. i++) { 23 x1=x0+h0.c discussed on account of the Runge-KuttaFelberg method. &h. &countrkfail.16f\n". countrksucc). 39 printf("epsilon: " 40 " " 41 "%. 45 printf("number of ABM successful steps: %8d\n". 42 printf("smallest step size allowed:" 43 " " 44 "%. status. 32 } 33 else { 34 printf("No more values could be calculated. &countabsucc. yx. 49 } 50 51 double sol(double x) { 52 /* This function is the exact solution of the 53 differential equation being solved. 25 min_h. y0. yx. i<=n .0)*cos(x).3f %20. y0-yx). countabfail). The printout of this program is as follows: . 11 status==1: integrating. 29 yx=sol(x0). 48 printf("number of RK failed steps: %8d\n". y0.16f %20. countabsucc). that here the separate step counts are printed out for Runge-Kutta-Fehlberg steps and Adams-Bashforth-Moulton steps in lines 45–48. 24 y1=abmrkf(&funct. countrkfail).16f %20. x0. x1. 22 for (i=1. 36 } 37 } 38 printf("\n" "Parameters\n"). 19 printf("%6. Predictor-Corrector methods 9 countrksucc=0. */ 13 status=1.16f\n". 18 yx=sol(x0). h=h0.

7019731136560623 6.0000000000000000 0.7211920065525623 6.0000000005000000 0. where ξ1 and ξ2 are some unknown number near where the solution is being calculated.0000000004113366 0. We have Y (x1 ) − y1 = 1 3 ′′′ h Y (ξ1 ).000 1.0000000000000000 1.000 9.3095003030395211 -9.0000000000000000 1.000 10.7211920076259482 6.1113026188467696 -9.2484405092421744 -3.0000000008767047 0. Write y(x1 ) for the correct value of y(x) at x1 .1113026175358165 -9.000 4.0806046117362795 -1.0312180359864973 -1. 45 .2484405096414271 -3.9599699859316626 -3. and the error of the corrected value y¯1 is 1 3 ′′′ − 12 h Y (ξ2 ).2297868198409763 Parameters epsilon: smallest step size allowed: number of ABM successful steps: number of ABM failed steps: number of RK successful steps: number of RK failed steps: 0. 3 and Y (x1 ) − y¯1 = − 1 3 ′′′ h Y (ξ2 ).7019731127793576 6. assuming Y ′′′ (ξ1 ) ≈ Y ′′′ (ξ2 ).3095003042775217 -9. 12 5 12 5 14 5 (5) h Y (ξ1 ). where ξ1 and ξ2 are some unknown numbers near where the solution is being calculated. Solution.9599699864017817 -3. For a certain predictor-corrector method.000 8.0000000010733860 0. Estimate the error in terms of the difference y¯1 − y1 .0312180347464368 -1.000 1. We then have 14 y(x1 ) − y1 = − h5 Y (5) (ξ1 ). 2.2682181037536910 1.000 7.0000000012400602 0. 12 Hence.0000000014256585 0.0000100000000000 2033 4 52 6 Problem 1. the error of the predicted value y1 is 31 h3 Y ′′′ (ξ1 ).0806046121476161 -1. Estimate the error in terms of the difference y¯1 − y1 . where Y (x) is the true solution of the equation.   5 3 ′′′ 1 1 3 ′′′ h3 Y ′′′ (ξ2 ) = h Y (ξ2 ). the error of the predicted value y1 is − 45 1 5 (5) where Y (x) is the true solution of the equation.0000000004701192 0.0000000012380006 0.000 5. Solution. and the error of the corrected value y¯1 is 90 h Y (ξ2 ).2682181043180596 1.0000000005643685 0.000 2.0000000003992527 0.000 6. Assume that Y (5) (ξ1 ) ≈ Y (5) (ξ2 ).124 Introduction to Numerical Analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Solving the differential equation y’=-(x+1)sin x+y/(x+1): x y exact solution error 0.2297868184153185 1. For a certain predictor-corrector method.0000000013109534 0. y¯1 − y1 = (Y (x1 ) − y1 ) − (Y (x1 ) − y¯1 ) ≈ h Y (ξ2 ) − − 3 12 12 Therefore Y (x1 ) − y¯1 ≈ − 1 5 1 1 3 ′′′ h Y (ξ2 ) = − · h3 Y ′′′ (ξ2 ) = − (¯ y1 − y1 ).000 3.

. . . and can be multiplied by scalars. a homogeneous recurrence equation.30. . . and these numbers yi can be specified arbitrarily. i = hαy0 . i + hz0 . y2 + z2 . am 6= 0. they form a vector space. that is hy0 . . Write (2) P (ζ) = m X ak ζ k . Recurrence equations and y(x1 ) − y¯1 = 125 1 5 (5) h Y (ξ2 ). y2 . m is called the order of this equation. y1 . . the equation can be replaced with a lower order equation. y2 . i = hy0 + z0 . z2 . i. y1 + z1 . . . i. more precisely. 90 29 90 29 30. It will be advantageous to work with complex numbers. k=0 102 A recurrence equation can often be written in terms of the difference operators introduced in Section 9 on Finite Differences. . then what he get is called an inhomogeneous recurrence equation. as an 6= 0). First. . The solution vectors form an n-dimensional vector space. . yn is then determined by the recurrence equation. . y1 . a recurrence equation is also called a difference equation. In this case. that is αhy0 . . i with infinitely many components. It is convenient to consider a solution of this equation as a vector y = hy0 . RECURRENCE EQUATIONS An equation of form (1) m X ak yn+k = 0 k=0 (a0 6= 0. . αy2 . αy1 .) Here ak for integers k with 0 ≤ n are given numbers. .102 In this section. It is also clear that the dimension of this vector space is m since each solution is determined if we specify the number yi for each integer i with 1 ≤ i ≤ n − 1 (indeed. (If the right-hand side is replaced with some function of n that is not identically zero. The assumptions a0 6= 0 and am 6= 0 are reasonable in the sense that if either of these assumptions fail. since if y and z are solutions then αy + βz is also a solution.e. m > 0) is called a recurrence equation. and we seek solutions yn such that these equations are satisfied for all nonnegative integer n. These vectors can be added componentwise. . y1 . we will only discuss homogeneous recurrence equations. the numbers ak and yn will be allowed to be complex. i. 45 90 90   1 5 (5) 1 1 29 y(x1 ) − y¯1 ≈ h5 Y (5) (ξ2 ) = − (¯ h Y (ξ2 ) = · − y1 − y1 ). y2 . z1 . 90 Hence y¯1 − y1 = (y(x1 ) − y1 ) − (y(x1 ) − y¯1 ) ≈ − Therefore 1 29 14 5 (5) h Y (ξ2 ) − h5 Y (5) (ξ2 ) = − h5 Y (5) (ξ2 ). . .

etc.103 The forward shift operator E on functions of n is defined by writing Ef (n) = f (n + 1). the forward shift operator was introduced in Section 9 on page 27. Here ζ is a complex variable. whereas here n is a nonnegative integer variable. y2 . . the notation should indicate the variable in question as well. 104 The forward shift operator is always associated with a variable. i. counting multiplicities. i. 106 It would be formally more correct. By solving the characteristic equation. 103 Or an indeterminate. and is not tu be interpreted as representing a number. . but less convenient. y . and the operator E on yn will act according to the equation Eyn = yn+1 . and the polynomial equation P (ζ) = 0 is called its characteristic equation. The difference there was that the variable used was x.104 The powers of the operator E can be defined as they were in Section 9. given complex numbers λ and η the operators E − λ and E − η commute. and let P (n) be a polynomial of n that is not identically zero.108 The degree of a polynomial P (n) of n will be denoted by deg P (n). y . and the identically zero polynomial will have degree −1. . k=0 j=1 here E − λj could also have been written as E − λj I. An indeterminate is a symbolic variable used in defining a polynomial ring. assuming that λj is a zero107 of multiplicity mj of the characteristic polynomial for j with 1 ≤ j ≤ N . we can also use the identity operator I. Polynomials of the operator E will be called difference operators. y1 . from an alternative viewpoint. i = hy1 . particular. and x was a real variable.106 The recurrence equation (1) can be written in terms of the operator E as m X (3) k=0  ak E k yn = 0. For example. . the constant polynomial that is not identically zero will have degree zero. . Then (E − λ)P (n)η n = Q(n)η n . j=1 the second equation here just says that the above polynomial equation (of degree m) has m roots. where N X mj = m. for example Ex would shift the variable x forward.126 Introduction to Numerical Analysis The polynomial P (ζ) is called the characteristic polynomial of the recurrence equation (1). in addition. and n2 En = n2 (n + 1). the characteristic polynomial can be factored as the product of m linear factors. . . . Then we have Lemma. Note that E does not commute with expressions involving n. that is (E − λ)(E − η) = (E − η)(E − λ). 105 This is because these operators can also be expressed in terms of of the forward difference operator ∆ described in Section 9. Let λ and η be nonzero complex numbers. if more than one variable were associated with forward shift operators. . to say that E acts on vectors hy . while Ey would shift the variable y. we have m X k=0 ak ζ k = am N Y j=1 (ζ − λj )mj . and 0 1 2 Ehy0 . y3 . but the identity operator is often omitted when is has a number coefficient. 107 A 108 In zero of a polynomial is a root of the equation obtained by equating the polynomial to zero. This is because the rules of algebra involving polynomials of the variable ζ and polynomials of the forward shift operator E are the same.105 yn will be considered as a function of n. nEn2 = n(n + 1)2 . y2 . . The difference operator in recurrence equation (3) has a corresponding factorization: m N X Y ak E k = am (E − λj )mj .

where not all the complex coefficients ck are zero (≡ here means that equality holds identically.30. k=1 with a nonzero c. such that if 1 ≤ k < l ≤ r then either λk 6= λl . Let fk (n) = Pk (n)λnk be functions of n for k with 1 ≤ k ≤ r.) The proof is complete.  Linear independence of certain functions. Proof. Let d be the degree of P1 (x). Proof. on the contrary. The product is taken for all k for which λk is different from λ1 . Further. there is no need to take both of the factors k l (E − λk )d+1 and (E − λl )d+1 . Let r ≥ 1 be an integer. Assume. of λ = η then the term involving nk will cancel. (since deg Pk (n) ≤ d). λk 6=λ1 d+1 r X ck Pk (n)λnk = cλn1 . Corollary (Linear Independence). Then d (E − λ1 )  Y (E − λk ) k:2≤k≤r.109 The reason for this equation is that the difference operator (E − λk )d+1 annihilates the term Pk (n)λnk when λk 6= λ1 according to the Lemma above. the operator lowers the degree of nk in the term nk η n by one. since the terms with zero coefficients can simply be discarded. Then the functions fk are linearly independent. but not higher). Recurrence equations 127 where Q(n) is another polynomial of n such that deg Q(n) = deg P (n) if λ 6= η and deg Q(n) = deg P (n) − 1 if λ = η. In the former case. but such redundancy is harmless and it serves to simplify the notation. Given an integer k ≥ 0. because if λ = λ . The operator (E − λ1 )d will annihilate the term Pk (n)λnk in case λk = λ1 and k 6= 1 (since deg Pk (n) < d in this 109 This arrangement is of course highly redundant. since the degrees of those terms will also be lowered. in the present case this means that equality holds for every nonnegative integer n). that we have r X k=1 ck Pk (n)λnk ≡ 0. This equation says it all. k k−1 n of the highest degree of the polynomial P (n). and λk is a nonzero complex number. we may assume that among the terms Pk (n)λnk the polynomial P1 (n) is the one that has the highest degree (other polynomials Pk (n) with nonzero ck for λk 6= λ1 may have the same degree. without loss of generality. These operators will not change the degree of the polynomial in the term P1 (n)λn1 according to the same Lemma (because λk 6= λ1 ). then the resulting term η 1 n η will not cancel against the terms resulting from lower degree terms of P (n). and if λ 6= η then this term will not cancel. or if λk = λl then deg Pk (n) 6= deg Pl (n). Functions here mean functions on nonnegative integers. instead of the word “function” we could have used the word “sequence. To this end. .” The lemma just established has several important corollaries. (In this if nk is the term  case. where Pk (n) is a polynomial of n that is not identically zero. we have (E − λ)nk η n = (n + 1)k η n+1 − λnk η n =  k−1 X k  nj  η n . We will show that this equation cannot hold. = (η − λ)nk + η j j=0  k   X k j=0 j nj η n+1 − λnk η n the second equality was obtained by using the Binomial Theorem. we may assume that none of the coefficients are zero.

This is a contradiction since λ1 6= 0 according to assumptions. Since a recurrence equation of order m can have at most m linearly independent solutions. this confirms the above equation. If follows that any solution of (1) is a linear combination of these solutions. j=1 the functions nr λnj for r and j with 0 ≤ r < mj and 1 ≤ j ≤ N represent m linearly independent solutions of the differential equation m X k=0  ak E k yn = 0. To see that each of these functions is a solution. . we obtain that cλn1 ≡ 0. Finally.  Thus we exhibited m linearly independent solutions of equation (1). y1 = 1 and yn+2 = yn + yn+1 for every integer n ≥ 0. . the resulting function will be cλn1 with c 6= 0. Problems 1. according to our assumptions). 2 . the difference operator (E − λj )mj annihilates the function nr λnj for r < mj . as we mentioned). Assuming m X k=0 ak ζ k = am N Y j=1 (ζ − λj )mj . n = 0. where N X mj = m. The characteristic equation of the recurrence equation yn+2 = yn + yn+1 is ζ 2 = 1 + ζ. after the application of the above difference operators. these functions will represent a complete set of linearly independent solutions. are defined by the equations y0 = 0. Proof. applying the difference operator to both sides of the equation expressing linear dependency. while c 6= 0. The Fibonacci numbers yn . . in view of the Lemma above. So. 1. The solutions of this equation are √ 1+ 5 ζ1 = 2 and √ 1− 5 ζ2 = . showing that the functions in question are linearly independent. Hence. Write a formula expressing yn . Solution. the operator (E − λ1 )d lowers the degree of P1 (n) by d in the term P1 (d)λn1 according to the Lemma (while none of the other operators change the degree of P1 (n) in this term. The linear independence of the functions claimed to be representing the solutions have been established in the first Corollary. 2. Corollary (Solution of the Homogeneous Equation).128 Introduction to Numerical Analysis case.  The solution of the recurrence equation. it is enough to note according to the equation m N X Y ak E k = an (E − λj )mj j=1 k=0 that.

and n2 · 5n are linearly independent. Write a difference operator that annihilates all but the first term in the expression c1 n3 · 3n + c2 n4 · 2n + c3 n2 · 5n . 5 √ √ Adding the first equation to this. we obtain √ 5 (C1 − C2 ) = 1. C1 2 2 It is easy to solve these equations. The difference operator (E − 2)5 will annihilate the second term. we obtain C2 = −1/ 5. will it will annihilate the second and the third terms. it will change the term into c · 3n with a nonzero c).29. while it reduces the first term to c · 3n . Multiplying the first equation by 1/2 and subtracting it from the second equation. then we must have c1 = 0. (E − 3)3 (E − 2)5 (E − 5)3 will change the first term into c · 3n with a nonzero c. the formula for yn gives √ !n √ !n 1− 5 1 1+ 5 1 yn = √ −√ . Substituting this into 1 1 √ the first equation. The difference operator (E − 3)3 will lower the degree of the polynomial in the first term to 0 (i. the general solution of the above recurrence equation is √ !n √ !n 1+ 5 1− 5 yn = C1 + C2 . where c is a nonzero constant (it is assumed that c1 6= 0). we obtain 2C = 2/ 5. while it will not change the degrees of the polynomials Hence the product of these differential operators. 2 that is 2 C1 − C2 = √ . 2 2 5 5 2. or else C = 1/ 5. .e. Solution. This argument can be used to show that if c1 n3 · 3n + c2 n4 · 2n + c3 n2 · 5n ≡ 0. Finally. while it will not change the degrees of the other polynomials. n4 · 2n . Similar arguments can be used to show that we must also have c2 = 0 and c3 = 0. the difference operator (E − 5)3 will annihilate the third term. 2 2 The initial conditions y0 = 0 and y1 = 1 lead to the equations C1 + C2 = 0 and √ √ 1− 5 1+ 5 + C2 = 1. With these values for C1 and C2 . hence the terms n3 · 3n .. while it will not change the degrees of the polynomials in the other terms. Predictor-Corrector methods 129 Thus.

3 + hK . Of course. y¯n+1 ). yn ). this equation just expresses the use of Simpson’s method in calculating the integral in the equation yn+1 = yn−1 + Z xn+1 f (x. instead of considering what happens during one application of the corrector equation. 3 is solved exactly. the corrector equation is yn+1 = yn−1 + h ¯ (fn+1 + 4fn + fn−1 ). 3 The characteristic equation of this recurrence equation is ζ2 = 1 − hK 2 (ζ + 4ζ + 1). we will put f (x. and put fn = f (xn . xn−1 We will consider what happens when equation (1) is used to solve the differential equation y ′ = −Ky for some constant K. NUMERICAL INSTABILITY Numerical instability is the phenomenon when small errors committed early in the calculation cause excessively large errors later. Equation (1) then becomes (2) yn+1 = yn−1 − hK (yn+1 + 4yn + yn−1 ). since it only provides a starting value for the corrector equation. Write yn for the calculated solution of the above equation at xn . For the discussion that follows. y(x)) dx. An example for a numerically unstable method for solving the differential equation y ′ = f (x. That is. we will consider the behavior of the method when the equation (1) yn+1 = yn−1 + h (fn+1 + 4fn + fn−1 ). Milne’s method has the predictor y¯n+1 = yn−3 + 4h (2fn − fn−1 + 2fn−2 ). y) = −Ky. y) is Milne’s method. 3 that is (3 + hK)ζ 2 + 4hKζ − (3 − hK) = 0. 3 The predictor equation does not play too much role in out analysis of the method. The solutions of this equation are Z1 = √ −2hK + 3h2 K 2 + 9 3 + hK and Z2 = √ −2hK − 3h2 K 2 + 9 . 3 Setting f¯n+1 = f (xn+1 .130 Introduction to Numerical Analysis 31. Writing y¯n+1 for the predicted value of y at xn+1 . where we use fn+1 rather than f¯n+1 on the right-hand side to indicate this is an equation for yn . Let xn = x0 + nh with some x0 and some h > 0.

even if one enters the correct solution y(x1 ) as the value of y1 . seeking the solution for some x > 0. We obtain Z1 = 1 − Kh + O(h2 ) and Z2 = −1 − Kh + O(h2 ). it is likely that there will be some truncation error in the calculation. Because of normal roundoff errors. 3 Hence the general solution of equation (2) is yn = c1 Z1n + c2 Z2n = c1 (1 − Kh + O(h2 ))n + c2 (−1)n (1 + Kh/3 + O(h2 ))n . 111 In order to start the method. yn = c1 e + c2 (−1) e +O n The values of c1 and c2 are determined with the aid of the initial conditions. y0 = 1. and substituting these into the expressions for Z1 and Z2 . the parasitic solution will arise during the calculation even if the choice of initial conditions ensure that c2 = 0 at the beginning. Finally. In a practical calculation one will not be able to suppress the parasitic solution (by ensuring that c2 = 0). we are taking K = 1) with initial condition x0 = 0. we use h = x/n for some large n. one only has c1 ≈ 1 and c2 ≈ 0 then. Take the exponential of both sides and note that eu+O(1/n) = eu · eO(1/n) = eu · (1 + O(1/n)) = eu + eu · O(1/n). and. one needs initial conditions for y and y . since the correct solution of the differential equation will not be the same as the main solution of the recurrence equation. Rather than directly calculating the Taylor series for Z1 and Z2 . In so far as one obtains y by some 0 1 1 approximate solution of the differential equation. an easier way to do this may be to use the Taylor expansions 1 1 1 1 1 = · = (1 + (−hK/3) + O(h2 )) = (1 − Kh/3 + O(h2 )) 3 + hK 3 1 − (−hK/3) 3 3 and p p 3h2 K 2 + 9 = 3 1 + h2 K 2 /3 = 3(1 + O(h2 )). the situation is hopeless even without roundoff errors.110 and using this with u = −x and u = x/3. we obtain that   1 −x n x/3 . =e +O n n2 n as n → ∞. while solution c2 (−1)n ex/3 is called the parasitic solution of the recurrence equation (2). Numerical instability 131 We will approximate these with error O(h2 ) by using the Taylor series for them at h = 0. for large enough x. on the other hand. That is. since the latter only approximates the solution of the differential equation. the term c2 (−1)n ex/3 will have much larger absolute value than the term c1 e−x . we „ „ «« „ « u 1 1 n log 1 + + O = u + O n n2 n for fixed u when n → ∞. so one will not have c2 = 0. the parasitic solution will not be suppressed. .31. Noting that   n   u 1 1 u 1+ +O . and the calculated solution will be nowhere near the actual solution of the equation. The choice c1 = 1 and c2 = 0 correctly gives the solution y(x) = e−x when x → ∞. eu · O(1/n) = O(1/n) for fixed u.111 110 Using have the Taylor expansion log(1 + t) = t + O(t2 ) as t → 0 (here log t denotes the natural logarithm of t). In fact. If. Suppose we are trying to solve the differential equation y ′ = −y (that is. The solution c1 e−x is called the main solution.

Write m43 = a43 /a33 . we write mi1 = ai1 /a11 . and 4.132 Introduction to Numerical Analysis 32. and the following equations result: 3x1 + 6x2 − 2x3 + 2x4 = −22 2x2 + 4x3 − 2x4 = 12 −2x4 = 2 −3x3 + x4 = −13 The above method of solving is called Gaussian elimination. The last system of equation is called triangular because of the shape formed (4) by the nonzero coefficients That is. We have m32 = 2 and m42 = −3. named after the German mathematician Carl Friedrich Gauss. GAUSSIAN ELIMINATION Consider the following system of linear equations 3x1 + 6x2 − 2x3 + 2x4 = −22 −3x1 − 4x2 + 6x3 − 4x4 = 34 −6x1 − 18x2 − 2x3 − 2x4 = 36 6x1 + 16x2 + x3 + x4 = −33 Writing aij for the coefficient of xj in the ith equation and bi for the right-hand side of the ith equation. We obtain the equations 3x1 + 6x2 − 2x3 + 2x4 = −22 2x2 + 4x3 − 2x4 = 12 6x3 − 4x4 = 28 −3x3 + x4 = −13 (3) (3) Write a3ij for the coefficient of xj in the ith equation. We will continue the above process: writing (2) (2) mi2 = ai2 /a22 and subtract mi2 times the second equation from the ith equation for i = 3 and i = 4. We have m43 = −2. To solve this system of equation. The following equations will result (the first equation is written down unchanged): 3x1 + 6x2 − 2x3 + 2x4 = −22 2x2 + 4x3 − 2x4 = 4x2 + 5x3 − 3x4 = 12 11 −6x2 − 6x3 + 2x4 = − 8 (2) Write aij for the coefficient of xj in the ith equation. 3. and subtract mi1 times the first equation from the ith equation for i = 2. and subtract m43 times the third equation from the fourth equation. and m41 = −2. . we have aij = 0 for j < i. We have m21 = −1. m31 = 2. this system can be written as n X (1) j=1 aij xj = bi (1 ≤ i ≤ n) with n = 4. writing a4ij for the coefficient of xj and bi for the write hand (4) side in the ith equation in the last system of equations.

the final value of the coefficient ( (l) alr if l ≤ r. Taking aij = aij . 3 This way is of solving a triangular system of equations is called back substitution. .32. x3 . one defines the coefficients mij for 1 ≤ j < i < n and aij for 1 ≤ i ≤ n. this equation says ij (i+1) akj is (i) = akj in case i > j. for any l and r with 1 ≤ l. alr That is (s) anew = alr lr (5) whenever min(l. new (4) alr = (r+1) = 0 if r < l. and x4 . −3 Next. 2 Finally. using the values of x2 . one starts with the coefficients (1) (k) aij of the equations. The restriction i ≤ j can be dropped here. In a general description of the method of solving equations (1). equation (3) can be rewritten as (i+1) akj (i) = akj − mki anew ij for with i ≤ j and i < k. According to (4). Gaussian elimination 133 The resulting equations are easy to solve. given that anew = 0 when i > j. r + 1) ≤ s ≤ n. one defines (i) (2) (i) mki = aki /aii (i) (assuming that aii 6= 0). using the value of x3 and x4 . That (i+1) akj (i) = akj − mki anew ij for i < k. showing that xi is eliminated from the kth equation for note that for i = j this gives aki (i) = aij (i ≤ j). −2 Using this value of x4 . r ≤ n. The final value of the coefficient aij in these calculation will be anew ij generally. More k > i. we can determine the value of x2 from the second equation: (4) x2 = (4) (4) b2 − a23 x3 − a24 x4 = (4) a22 12 − 4 · 4 − (−2) · (−1) = −3. since we have j < i < k in this case. From the fourth equation we obtain (4) x4 = b4 = (4) a44 2 = −1. i ≤ k ≤ n as follows: when it is the turn to use the ith equation to eliminate the coefficients of xi in the kth equation for i < k ≤ n. This is true according to (5). we can determine x1 from the first equation: (4) x1 = (4) (4) (4) b1 − a12 x2 − a13 x3 − a14 x4 (4) a11 = −22 − 6 · (−3) − (−2) · 4 − 2 · (−1) = 2. we can determine x3 from the third equation (4) (4) b3 − a34 x4 x3 = (4) a33 = −13 − 1 · (−1) = 4. (i+1) = 0. and one defines the new value of akj as (i+1) (3) akj (i) (i) = akj − mki aij for with i ≤ j and i < k. since.

where x is the column vector of unknowns. and the letter U is used because the matrix it denotes is an upper-triangular matrix. . . Writing A = (aij ). xn )T and b = (b1 .. for this reason. Writing def new new −1 bnew = (bnew b 1 .e. A row vector can be considered a 1 × n matrix. Equation (1) can be written in the form Ax = b. . i. the transpose of an m × n matrix A = (aij ) is an n × m matrix B = (bij ) with bij = aji for any i. we calculated the righthand sides simultaneously with the coefficients in the LU-factorization. i=1 One can also express these equations in matrix form. the last equation in matrix form can be written as A = LU. . L = (mij ). a matrix whose elements below the main diagonal are zero. Multiplication by the inverse of U can be calculated with the aid of back-substitution. it is usually advantageous to separate the calculation of the left-hand side from that of the right-hand side.e. The name L is used because the matrix it designates is a lower-triangular matrix. already mentioned above: Pn bnew − j=i+1 anew i ij xj xi = . bn ) = L the elements bnew are calculated analogously to the way the coefficients anew i ij are calculated in formula (6): i−1 X mil bnew . In general. while a column vector. . . Hence a column vector can be written as the transpose of a row vector. i=1 Write mkk = 1 and mki = 0 for k < i. bn )T . x = (x1 .. we obtain (6) anew kj = akj − k−1 X mki anew ij . this equation can now be solved as x = U −1 L−1 b Multiplication by the inverse matrices U −1 and L−1 is easy to calculate. where T in superscript stands for transpose. . then the last equation can also be written as akj = n X mki anew ij . an n × 1 matrix. This is called the LU-factorization of A. i. . new aii 112 Writing column vectors takes up too much space in text. . In the numerical example above. a matrix whose elements above the main diagonal are zero. one wants to solve systems of equations with the same left-hand side but different right-hand sides. Often. . and U = (anew ij ) for the n × n matrices with the indicated elements. x2 .e. bnew = b − i l i l=1 This formula is called forward substitution. b2 . j with 1 ≤ i ≤ n and 1 ≤ j ≤ m. . . b2 . and so it is easier to describe a column vector as the transpose of a row vector.. i.134 Introduction to Numerical Analysis By a repeated application of this formula.112 As A = LU . and b is the column vector of the right-hand sides. The solution of the equation then can be written in the form x = U −1 bnew . .

This method of interchanging of equations is called partial pivoting. Instead of using the ith equation in continuing the calculations. 113 Of course. either. however.000. and moves this element into the position of the element aii by interchanging rows and columns of the coefficient matrix.3 Using partial pivoting with this system of equations. and then one carries out Gaussian elimination on the obtained system of equations without making further interchanges. one can use what is called full pivoting. with the rest of the elements being equal to 0. Such a permutation can be described in terms of a permutation matrix P = (pij ). 02 . even the working of full pivoting is affected by the equations being “differently scaled.32. To see this.1x2 = −2 −. (The interchange of rows corresponds to interchanging equations. In equation (2). the system of equations is not uniquely solvable.” Indeed.113 A sequence of interchanges of rows of a matrix results in a permutation of the rows of a matrix. In fact. one looks for the number with (i) (i) with the largest absolute value among the coefficients aki for k ≥ i.” If this is the case. This may.000. . and not actually equal to zero. rescaled system. and one cannot proceed further along the lines described.000. . both of these are good choices. in a theoretical discussion of algorithms one often considers oracles.) In practice. one interjects the following step. If this coefficient is ari then one interchanges the ith equation and the rth equation before continuing the calculation. it was assumed that aii 6= 0. x1 by back substitution. the system formed by i + 1st through the nth equations at this stage of the process is not solvable for xk with i < k ≤ n. unobtainable in practice. in this case. 02x1 − 10 x2 = 3 Interchanging the two equations in this case would be a mistake. To deal with this problem. and there is in practice no significant advantage of full pivoting over partial pivoting. dividing each equation by the coefficient of the maximum absolute value on its left-hand side we obtain an equivalent system of equations. assume that the system formed by i + 1st through the nth equations at this stage of the process is solvable for xk with i < k ≤ n (these are the only unknowns occurring in the latter equations). An oracle is a piece of information. it is helpful to imagine that all row interchanges happen in advance. and the new element moved in the position of aii is called the pivot element. 002x1 + x2 = − . Gaussian elimination 135 (i) Pivoting. one would use the coefficient 1 of x1 in the first equation as pivot element. x1 + . the system of equations is either unsolvable or it has infinitely many solutions. it may be simpler to rescale each equation by dividing through with the coefficient with the maximum absolute value on its left-hand side. Partial pivoting can lead one astray in the system of equations . Instead of partial pivoting. However. In this case we can chose xi arbitrarily. one would use the coefficient −10 of x2 in the second equation as pivot element. and then we can solve the system for the unknowns xi−1 . in practice this cannot be done. full pivoting is rarely done because it is much more complicated to implement than partial pivoting. one would not interchange the equations. It may (i) happen that we have aki = 0 for all k ≥ i. xi−2 . In fact. 001x2 = − . The point is that the working of partial pivoting may be distorted if the equations are “differently scaled. In full pivoting one looks for the coefficient (i) with the larges absolute value among all the coefficients akj for k ≥ i and j ≥ i.000. and full pivoting can work well without rescaling the equations. whereas with the latter. . . 01x1 + . not be the case. That is. that determines a step to be taken in a algorithm. In a theoretical discussion of partial pivoting. of course. then the original system of equations in not solvable.000. A permutation matrix can be characterized by saying that each row and each column has exactly one element equal to 1. One gets into trouble even if aii is only close to zero. with the former system. . and the interchange of columns corresponds to interchanging unknowns. If on the other hand.

. Calculating the vector P b amounts to making the same interchanges in the components of the vector b that were made to the rows of the matrix A.e.e. and R is a right inverse of a matrix B. one in effect obtains a permuted LU-factorization A = P −1 LU of the matrix A. Given the equation Ax = b.. The inverse of the matrix A can be written as U −1 L−1 P . This last step is perhaps unnecessary. Given a system of equations Ax = b. and U −1 (L−1 P B) rather than first evaluating A−1 and then evaluating A−1 B. Iterative improvement. It is easy to see that the inverse of a permutation matrix is its transpose. However. then QB = BR = I. so the jth row of the matrix A′′ will be the i row of the matrix A′ . as one can easily check by the equation (P A)il = n X pik akl = ajl . then it is nonsingular. Similarly. one can then obtain an approximate solution of the equation Aδx(1) = δb(1) . if one needs to calculate A−1 B for some matrix B. k=1 where (P A)il denotes the element in the ith row and lth column of the matrix P A. or A = P −1 LU. Writing A′′ = P T A′ then (P T )ji = 1. We explained above the multiplying by L−1 can be calculated by forward substitution. in the matrix P A the ith row of the matrix P A will be the jth row of the matrix A. then one calculates the matrices P B. the matrix A′′ is the same as the matrix A. Hence P T P = I. i. that is. finding the inverse of a permutation matrix presents no problems in practice. Then one obtains an LU-factorization of the matrix A′ . Therefore the solution obtained as x(0) = U −1 L−1 P b. Writing A′ = P A. and if it is nonsingular. then it must also have a right inverse. Writing δb(1) = b − Ax(0) . When one performs Gaussian elimination on the matrix A with partial pivoting. since if a matrix B has a left inverse.136 Introduction to Numerical Analysis If pij = 1 in an n × n permutation matrix then. and multiplying by U −1 can be calculated by back substitution. one can write P A = LU . In a practical calculation. and U is an upper triangular matrix. i. But this argument is more complicated than the above argument involving the double transpose. so Q = QI = Q(BR) = (QB)R = IR = R. we have P P T = (P T )T P T = I. then the ith row of the matrix A′ will be the jth row of the matrix A. the left and right inverses must agree. will only be approximate. so P T is the inverse of P . where I is the n × n identity matrix. Therefore. Let P is the permutation matrix describing the row interchanges performed during the solution of the equations. If Q is a left inverse. one finds a representation A′ = LU where L is a lower triangular matrix. Write A′ = P A. one usually does not want to evaluate the inverse matrix. That is. L−1 (P B). the elements of the matrices L and U in the factorization A = P −1 LU can only be approximately determined. one can then find the solution in the form x = U −1 L−1 P b. given an n × n matrix A. with the additional assumption that the elements in the main diagonal of the matrix L all equal to 1. that is. Suppose pij = 1 in the permutation matrix P .

double *x. single precision being represented by data type float. int n. using extended precision helps one to avoid loss of significant digits. double *b. even 9 the largest matrix for which Gaussian elimination 10 is feasible occupies a relatively modest amount of 11 memory. double **allocmatrix(int n). */ 13 { 14 int i. double *allocvector(int n).32. So memory for rows 8 could be allowcated separately. One can perform this step repeatedly to obtain a better approximation to the solution. double *b. The following code segment allocates 5 a contiguous block of pointers for the whole matrix. Gaussian elimination 137 as δx(1) = U −1 L−1 P δb(1) . . 6 There is no real advantage for this here. This is already double precision. int col. int improve(double **a.h> #include <math.c contains the routines allocating memory for the vectors and matrices used in the programs: 1 #include "gauss. int *success).h> #include <stdlib. LUsolve(double **a. int n. int n). On the other hand. rows will be interchanged. Extended precision in this case can be represented by data type long double. int n.h in the computer implementation of Gaussian elimination: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #include <stdio. One can then expect that x(1) = x(0) + δx(1) is a better approximation of the true solution of the equation Ax = b than vector x(1) is. int maxits. 15 double **a.0 ? (x) : (-(x))) void void void void swap_pivot_row(double **a. when discussing the respective functions. The rest of the calculation can be done with single precision. double **c. LUfactor(double **a. We use the header file gauss. These declarations will be explained later. double *x. int *success). Lines 5-15 contain the declarations of functions used in the program. This calculation is called (one step of) iterative improvement. int n. 114 Most or all the programs in these notes use data type double for floating point numbers. double *b). there does not seem to be any disadvantage 12 in asking for a contiguous block. double *deltab). long double is used in the program below to calculate δx(1) .h> #define absval(x) ((x) >= 0. because with 7 pivoting. The file alloc.114 Computer implementation. since the vectors b and Ax(0) are close to each other. Indeed. double tol. backsubst(double **c.h" 2 3 double **allocmatrix(int n) 4 /* Allocate matrix. The calculation of δx(1) above should be done with double or extended precision.

thus a[0] will also point to the beginning of the block of n(n + 1) reals.. On line 32. here n is the number of unknowns and also the number of equations we are dealing with. } We will first discuss the function allocvector in lines 36–45. } /* Initialize the pointers pointing to each row: */ for (i=2. b=(double *) malloc((size_t)((n+1)*sizeof(double))). return a. the values of a[i] for 1<=i<=n may change: */ a[0]=a[1]. and the element a[i][j] for 1 ≤ i. This is important when deallocating this block: the value of a[0] will not change. Then the pointers a[i] are initialized for 2 ≤ i ≤ n on line 28 in such a way that each pointer a[i] will point to a block of n + 1 reals. i. the element a[i][0] will have special uses (at times it will contain the right-hand side of the equation. A vector of reals (of type double) will be represented a by a block of n + 1 reals pointed to by a pointer (called b on line 38). i<=n. while that of a[1] may change when interchanging rows. Thus a[i] can be said to refer to row i of the matrix. } return b. } /* Allocate the whole matrix: */ a[1]=(double *) malloc((size_t)(n*(n+1)*sizeof(double))).e. The elements of this matrix can be referred to as b[i] for 0 ≤ i ≤ n (they can also be referred to as *(b+i)). In lines 3–34 the function allocmatrix first allocates a block of n+1 pointers to reals (of the type double) on line 16. the pointer a[0] is given the value of a[1].138 Introduction to Numerical Analysis 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 a=(double **) malloc((size_t)((n+1)*sizeof(double*))). In line 40. j ≤ n will refer to the matrix element aij . whether b is the NULL pointer. exit(1). the rows will be interchanged. exit(1).c contains the functions doing the LU-factorization of the matrix A: 1 #include "gauss. and the address of the beginning of this block is placed in the pointer a[1]. it is tested whether !b is true. indicating that the memory was not successfully allocated. these reals can be invoked as a[i][j] for 0 ≤ j ≤ n. the interchange of row i and j can be handled by simply switching the values of a[i] and a[j].h" 2 .i++) a[i]=a[i-1]+n+1. at other times it will contain the solution vector). } double *allocvector(int n) { double *b. and then a block of n(n + 1) reals is allocated. if (!b) { printf("Allocation failure 1 in vector\n"). The file lufactor. if (!a) { printf("Allocation failure 1 in matrix\n"). exit(1). /* Save the beginning of the matrix in a[0]. if (!a[1]) { printf("Allocation failure 2 in matrix\n").

and then the rows a[col] and a[pivi] are interchanged. maxel=absval(a[col][col]). for (j=1. Gaussian elimination 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 139 void swap_pivot_row(double **a. a[col]=a[pivi]. and a pointer to the integer *success. The interchange is performed simply by exchanging the values of the pointers a[col] and a[pivi]. rows col and pivi are interchanged in lines 16–18. double pivot. i++) { mult = a[i][j] /= pivot. n). the element mij is calculated.j<n. int *success) /* Uses partial pivoting */ { const double assumedzero=1e-20. *rowptr. a[pivi]=rowptr. The other parameters of this function are the size of the matrix n.h above). If no appropriate pivot element if found.k<=n.i<=n. for (i=col+1.i++) { if ( absval(a[i][col])>maxel ) { maxel=absval(a[i][col]). } } void LUfactor(double **a. In the loop in lines 33–45.32. In lines 10–14 the element of the largest absolute value in or below the main diagonal of column col found. k. the column col and the size of the matrix n. for (i=j+1. j. the value of the variable *success is changed to 0 (false) on line 30. int col. pivi=i. if ( absval(pivot)<=assumedzero ) { *success=0. int n) /* Scans the elements a[col][i] for col<=i<=n to find the element c[col][pivi] with the largest absolute value. In line 10. */ { double maxel. int i. } } } In lines 3–19. mult. } else *success=1. In lines 21–38 to function LUfactor performs LU-factorization with partial pivoting of the matrix given as parameter **a. for (k=j+1. return. this will become true if LU-factorization was successfully performed. absval finds the absolute value of a number (as defined in line 5 of the file gauss. } } if ( pivi !=col ) { rowptr=a[col]. The outer loop in lines 27–37 first places the pivot element in the main diagonal in line 28 by interchanging appropriate rows. j.k++) a[i][k]-=mult*a[j][k]. pivi=col. mij will be stored in location . int n. int i. pivot=a[j][j]. the function swap_pivot_row is defined. In the end. one can find out what interchanges have been made simply by examining the values of the pointers a[i] for i with 1 ≤ i ≤ n.j++) { swap_pivot_row(a. i<=n. and if this element is a[pivi][col] (pivi refers to pivoting i. The parameters of this function is the matrix **a (described as a pointer of pointers to double. as explained above).pivi. row i of the matrix for j = 1 ≤ i ≤ n is updated: In line 34.

h" void LUsolve(double **a. a[i][0] /= a[i][i].140 Introduction to Numerical Analysis a[i][j]. Next we calculate y=Ux=L^-1 c. . double *deltab) { long double longdelta. First we calculate c=Pb. and the inner loop ends on line 36.j<i.i<=n. double *x. We will put x in the same place as y: x[i]=a[i][0]. The zero matrix elements (mij for i < j) and anew ij for i > 0) do not need to be stored. for (i=1. int n.j. double *b. } } void backsubst(double **c. */ a[n][0] /= a[n][n]. j. for (i=n-1. int i. double *b. int *success) { const double assumedzero=1e-20. double *b) { int i. y will be stored the same place c was stored: y[i] will be a[i][0].i>=1.i<=n. either. The file lusolve. the element anew of the ij matrix U will replace the matrix element aij . for i > j.i++) a[i][0]=b[1+(a[i]-a[0])/(n+1)]. the element aij will be replaced by the element mij of the matrix L. int maxits.i++) { longdelta=b[i]. For i ≤ j. and those needed for iterative improvement.j++) a[i][0] -= a[i][j]*a[j][0]. The quantity 1+(a[i]-a[1])/(n+1) gives the original row index of what ended up as row i after the interchanges. deltab[i]=longdelta. for (j=1. not used in representing the matrix. The elements of c[i] will be stored in a[i][0]. */ for (i=2. we calculate x=U^-1 y. double *x.c contains the function needed to solve the equation once the LU-factorization is given. The elements mii are equal to 1.i--) { for (j=i+1. an appropriate multiple of row i is subtracted from row j. in line 35. double **c.j<=n. int n.j++) a[i][0] -= a[i][j]*a[j][0]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 #include "gauss.j<=n. Finally. as element aij of the matrix. that is.i<=n. /* We want to solve P^-1 LUx=b. } } int improve(double **a. int n. /* Finally. and these do not need to be stored.j++) longdelta -= c[i][j]*x[j]. recall that i > j at this point. That is. double tol. /* Here is why we preserved the beginning of the matrix in a[0].i++) for (j=1. */ for (i=1.

for (i=1. break. When this function is called. n. *deltab. In lines 16–17. The other parameters of the function backsubst are the size of the matrix n. maxx=0. if ( j==2 && rel_error > 0. the elements of this vector will replace the elements of the vector P b in column zero of the matrix **a.. rather.e. In line 9.. /* If the process is successful. *success will be changed to 1. in the lower half. since this vector replaced the elements of the vector P b originally placed here). } *success=0.j++) { LUsolve(a. if ( absval(a[i][0])>maxdeltax ) maxdeltax=absval(a[i][0]). break. the solution vector *x of the equation.32. double maxx. deltab). The function LUsolve is given in lines 3–25. if ( absval(x[i])>maxx ) maxx=absval(x[i]).i<=n. rel_error. when the value of xi is placed in location a[i][0]. the size n of this matrix. In line 27 the function backsubst substitutes the solutions of the system of equation into the original equation. the vector L−1 P b is calculated (by doing the same transformation to the elements of the vector *b as was done to the rows of the matrix). for (i=1. deltab). the details were explained above. i. x. the matrix L is stored. b. maxdeltax. Again. and the vector *b of the right-hand side of the equation. n.i++) { x[i] += a[i][0].5 ) { printf("Iterative refinement will not converge\n"). the coefficients of the original equation are stored in the matrix **c (the original coefficient matrix needed to be copied to another location. */ for (j=1. column zero of the matrix is used to store the solution vector x of the system of equations. /* Cannot return from here since memory must be freed */ } if ( rel_error <= tol ) { *success=1. the elements of the vector *b are permuted the way the rows of the matrix were permuted. j==1 || (j<=maxits). since this matrix was overwritten when LU-factorization was performed). this location is no longer needed to store the corresponding element of the vector b (or. while in the upper half the matrix U is stored.i++) { x[i]=0. deltab=allocvector(n). maxdeltax=0.i<=n. the vector L−1 P b. } rel_error=maxdeltax/(maxx+assumedzero). } backsubst(c. The parameters for this function are the matrix **a. j. . Gaussian elimination 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 } 141 int i. The comment in lines 10–14 explains how this permutation is preserved in the values of the pointers a[i]. the matrix **a should contain the LU-factorization of the original coefficient matrix. backward substitution is performed in lines 20–24.. the right-hand side *b of the equation. and they are placed in column zero of the matrix **a. this column is not used to store matrix elements.. return j. Finally. deltab[i]=b[i]. } free(deltab).

**c. If it is. the integer *success is given value 1 to indicate the success of the calculation. iterative improvement is done for values of j > 1. On line 54. j. the function allocvector (discussed above) is called to allocate memory to the vector deltab. *success is given the value 0 (false). the memory used for the vector δb is freed. but the constant assumedzero is added to the denominator to avoid a floating-point exception caused by dividing by 0. the relative error of the solution is calculated. The first 7 entry is the required tolerance. when it is clear from the context what type of norm is taken. to calculate the solution of the system of equations. the original coefficient matrix **c. the number of unknowns. the maximum allowed number of iterations maxits. and it is abandoned. and it will be explicitly changed to 1 later if iterative improvement is successful.h" 2 3 main() 4 { 5 /* This program reads in the coefficient matrix 6 of a system of linear equations. this loop is performed at least once. Whatever the value of maxits. when the first values of x and δx have been calculated116 ) is greater than 1/2. success. readerror=0. 116 For j = 1 at this point both x and δx are equal to the initial solution of the equation Ax = b 117 The inequality kδx/xk > 1/2 means roughly that even the first significant binary digit of kxk is wrong. In line 42. The loop in lines 52–71 does repeated iterative improvement. This is done by taking the ratio kδxk/kxk. the relative error is too large. Iterative improvement is performed by the function improve in lines 39–74. 15 int i. n. the original right-hand side of the equation. that is. the size of the matrix n. 1 #include "gauss. that is the vector δb obtained as δb = b − Aδx. The parameters of this function are **a.117 If the relative error is less than tol. one may simply write kvk instead.142 Introduction to Numerical Analysis and *deltab. the new value x + δx is given to the vector x. while δb made equal to b. the vector x is initially set to 0. Then the L∞ -norm115 maxx = kxk of the vector x. on line 72. which at this time holds the LU-factorization on the coefficient matrix. . temp. the solution of the equation Aδx = δb in column zero of the matrix A on line 53. and the L∞ norm maxdeltax kδxk of δx is calculated. and then. the difference the actual right-hand side and the one calculated by substituting the solutions into the equation. */ 14 double **a. tol. In line 49. LUsolve is called to solve the system of equations with δb as right-hand side (which is b for j = 1). there must be n(n+1) 11 of these. and the 8 second entry is n. and the number of steps used for iterative improvement is returned on line 73. The L ∞ norm of a vector v is usually denoted as kvk∞ . the original right-hand side *b. and the loop in lines 52–73 is terminated. the maximum number of iterations. a vector *x for the solution vector (eventually returned to the calling program). and this the values obtained during the calculation of the components of the vector δb are accumulated in longdelta. in lines 55-59 the new value of the vector x (the current solution) is calculated: The function LU-solve stored δx. *b. iterative improvement is unlikely to be useful. In lines 61 it is checked whether the initial relative error (for j = 2. to use to test when floating point numbers are too close to usefully distinguish. the constant assumedzero is defined to be 10−20 . the identifier longdelta is declared long double on line 30. which stands for δb. 115 The L∞ -norm of a vector is the maximum of the absolute values of its components. and the integer *success indicating success or failure of iterative improvement. *x. 9 The rest of the entries are the coefficients 10 and the right-hand sides. The important point here is that the the components of this vector are calculated with extended precision. this memory is freed. On line 56. In line 45. In lines 46–48. The second entry must be an unsigned 12 integer. In line 61. the other entries can be integers or 13 reals. In line 72.

and x inside this else statement.j++) { if (fscanf(coefffile. fscanf(coefffile.x[i]). LUfactor(a. if ( readerror ) printf("Not enough data\n").tol. "r"). } else printf("The solution could not be determined\n"). "%u". s).j<=n+1.i++) { if ( readerror ) break. for (i=1.x.NULL). } } fclose(coefffile). s)==EOF) { readerror=1. since if this statement is not entered.12f\n".b. n. FILE *coefffile. else { /* duplicate coefficient matrix */ c=allocmatrix(n). j). } else b[i]=strtod(s. j). /* allocate vector for iterative improvement of solution */ x=allocvector(n). if ( success ) { printf("%u iterations were used to find" " the solution. for (i=1.j<=n. for (j=1. } if ( j<=n ) { a[i][j]=strtod(s. for (i=1. i. printf("Not enough coefficients\n").NULL).c.&success). &success). printf("i=%u j=%u\n".i++) for (j=1. fscanf(coefffile. tol=strtod(s. coefffile=fopen("coeffs".NULL).i++) printf("x[%3i]=%16. printf("The solution is:\n"). break. /* Allocate right-hand side */ b=allocvector(n).i<=n. c.n. /* Create pointers to point to the rows of the matrix: */ a=allocmatrix(n). s).i. "%s". "%s".32. printf("Tolerance for relative L-infinity" " error used: %s\n".\n". /* Call improve here */ if ( success ) j=improve(a. &n). Gaussian elimination 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 char s[25].j++) c[i][j]=a[i][j].i<=n.i<=n.100. space is not reserved for x and c */ 143 . /* must free c[0].

In line 43.” Otherwise. and if LU-factorization is successful. the coefficient matrix and the right-hand side is read in the loop in lines 28–42 (as character strings. memory reserved for **c and *x is freed. it is helpful to write an equation an its right-hand side on a single line. If this is done successfully.c has the program reading the input data: the tolerance tol. Line 2 contains the value of n. note that the parameter 100 indicates that at most 100 steps of iterative improvement are allowed. on line 47 one allocates space for a copy **c of the coefficient matrix.308315e-17 10 2 3 5 -2 5 1 4 -2 -1 3 3 -1 2 1 3 9 -2 -1 -1 4 -1 2 -1 2 -1 -4 -5 2 3 1 -1 -4 -2 3 4 -9 -4 -3 -1 9 -1 -2 9 8 -7 8 -7 7 0 3 3 -4 -2 1 2 -2 1 3 4 1 -1 2 3 2 -1 2 3 -1 1 4 3 -1 -3 -4 2 1 2 3 -1 -1 1 2 -4 -3 -8 -2 -2 -3 4 8 -8 2 -4 3 1 -5 3 -2 4 9 -3 2 -1 8 -3 -4 3 2 5 7 Note that the first column of numbers 1–12 here are line numbers. since a[1] may have been changed. and not part of the file. The integer readerror indicates whether there is a sufficient number of input data in the file coeffs. and on line 27. value of n. there is not much one can do other than print that there are “not enough data. and it is converted to double on line 20). memory is reserved for the coefficient matrix.144 Introduction to Numerical Analysis 67 68 69 70 71 72 73 74 75 76 77 } free(c[0]). the solution is printed out in lines 55–61. tabs. /* free matrix */ free(a). On line 25. and tol is read in lines 19–20 (it is read as a character string on line 18. The data can be separated by any kind of white space (i. In line 51. the coefficient matrix. Finally. On line 18. the function LUfactor is called. If on line 44 one finds that there was a reading error. the file coeffs is opened. and then coefficients and the right-hand sides of the equations. then improve is called to perform iterative improvement on line 54. since memory for these variables was reserved in the current block. tolerance was set nearly as low as possible without causing iterative improvement to fail. On line 67–69. } /* We must free a[0] and not a[1]. Line 1 contains the tolerance. */ free(a[0]).e. in lines 73–76. newlines. as usual. but this is not required). and and in lines 48–49 the coefficient matrix is duplicated. free(c). and the right-hand side of the equation. memory allocated for the matrix **a and the vector *b is freed. the second one is n. this is done for the right-hand side vector of the equation. and then calls the programs to solve the equation. The input data are contained in the file coeffs: the first entry is tol. otherwise a message saying that “the solution could not be determined” is printed on line 63. The above program was called with the following input file coeffs: 1 2 3 4 5 6 7 8 9 10 11 12 8. The file main. free(b). This must be done before the end of the current else statement. the file coeffs is closed. and each of . free(x). spaces. after some experimentation. then converted to double and placed in the correct location in lines 38 and 40). memory is allocated for the solution vector *x. On line 52.

731205614034 x[ 10]= -0.429714763221 x[ 8]= -1. we obtain the following output: 1 2 3 4 5 6 7 8 9 10 11 12 13 Tolerance for relative L-infinity error used: 8. We obtain the matrix A(1) by subtracting l21 = 2 times the first row from the second row. 12 Solution.163886551028 Problems 1. 2 3 A = 4 8 6 13  1 5 . we have   1 0 0 L = 2 1 0.669529286724 x[ 9]= 1.039356801309 x[ 5]= -0. i. 6 6 9 .. Find the LU-factorization of the matrix   3 2 1 A = 9 7 5.014536223326 x[ 4]= -1. With this input file. The values of the remaining elements of L (l21 = 2. lii = 1 for i = 1. we have l21 = a21 /a11 = 4/2 = 2 and we have l31 = a31 /a11 = 6/2 = 3.32.542881432027 x[ 6]= 0.270477452322 x[ 2]= 0. Find the LU-factorization of the matrix  Do not make any row interchanges. The solution is: x[ 1]= -0.e. 0 4 9 (1) (l) (1) Writing A(1) = aij .900056527510 x[ 7]= 2. lij = 0 whenever 1 ≤ i < j ≤ 3. and l21 = 3 times the first row from the third row:   2 3 1 A(1) =  0 2 3  . That is.e.. and l32 = 2) have been given above. 2. and the elements in the upper triangle of L are zero.308315e-17 3 iterations were used to find the solution.126721910153 x[ 3]= 0. We obtain the matrix U = A(2) by subtracting the l32 = 2 times the second row from the third row in he matrix A(1) :   2 3 1 U = A(2) =  0 2 3  . l31 = 3. Gaussian elimination 145 the rest of the lines contain the coefficients of one of the ten equations and its right-hand side. 0 0 3 The matrix L can be written by noting that the elements of L in the main diagonal are all 1’s. 3 2 1 2. 3. Writing A = (aij ) and L = (lij ). we have l32 = a32 /a22 = 4/2 = 2. i.

l31 = 3. lii = 1 for i = 1. That is. 6)T . that is      1 0 0 10 y1  3 1 0   y2  =  37  . i. Writing A = (aij ) and L = (lij ). We obtain the matrix A(1) by subtracting l21 = 3 times the first row from the second row. 3.. x3 0 2 47 Solution. 2 3 1 47 y3 This equation can be written as y1 =10 3y1 + y2 =37 2y1 + 3y2 + y3 =47 The solution of these equations is straightforward (the method for solving them is called forward substitution). from the third equation y3 = 47−2y1 −3y2 = 47−2·10−3·7 = 6.146 Introduction to Numerical Analysis Do not make any row interchanges. from the second equation we have y2 = 37−y1 = 37−3·10 = 7. and the elements in the upper triangle of L are zero. From the first equation. y = (10. Solve the equation  1 3 2  0 0 3 1 00 3 1 0     1 1 10 x1 4 1   x2  =  37  .. we have y1 = 10.e. the equation U x = y can be written as      10 3 1 1 x1  0 4 1   x2  =  7  . We obtain the matrix U = A(2) by subtracting the l32 times the second row from the third row in he matrix A(1) :   3 2 1 U = A(2) =  0 1 2  . Then this equation can be written as Ly = b. That is. 2 2 1 3. 0 0 3 The matrix L can be written by noting that the elements of L in the main diagonal are all 1’s. we have l21 = a21 /a11 = 9/3 = 3 and we have l31 = a31 /a11 = 6/3 = 2. 7. Write the above equation as LU x = b. The values of the remaining elements of L (l21 = 2.e. and l32 = 2) have been given above. Then. i. 6 0 0 2 x3 . lij = 0 whenever 1 ≤ i < j ≤ 3. Solution. and l31 = 2 times the first row from the third row:   3 2 1 A(1) =  0 1 2  . we have l32 = a32 /a22 = 2/1 = 2. Finally. and write y = U x. Thus. 0 2 7 (1) (l) (1) Writing A(1) = aij . we have   1 0 0 L = 3 1 0. 2.

3)T . from the third equation y3 = 28−y1 −3y2 = 28−·11−3·5 = 2. Substituting this into the second equation. Write the above equation as LU x = b. Gaussian elimination 147 or else 3x1 + x2 + x3 =10 4x2 + x3 =7 2x3 =6 This can easily be solved by what is called back substitution. we obtain x1 = 10 − 3 − 1 10 − x2 − x3 = = 2. 3 3 That is. From the first equation. the equation U x = y can be written as      11 x1 2 1 2  0 1 2   x2  =  5  . Solve the equation  1 2 1  0 0 2 1 00 3 1 0     11 1 2 x1 1 2   x2  =  27  . we have x3 = 6/2 = 3. 1.32. Finally. y = (11. that is      11 1 0 0 y1  2 1 0   y2  =  27  . Thus. we have x2 = 7 − x3 7−3 = = 1. 4 4 Finally. x3 28 0 1 Solution. from the second equation we have y2 = 27−2y1 = 27−2·11 = 5. we have y1 = 11. y3 28 1 3 1 This equation can be written as y1 =11 2y1 + y2 =27 y1 + 3y2 + y3 =28 The solution of these equations is straightforward (the method for solving them is called forward substitution). Then this equation can be written as Ly = b. That is. 2 x3 0 0 1 or else 2x1 + x2 + 2x3 =11 x2 + 2x3 =5 x3 =2 . From the third equation. 4. and write y = U x. 5. substituting these into the second equation. we have x = (2. Then. 2)T .

5. 2)T . we have x = (3. 1  2 2. we have  Furthermore. we obtain x1 = 11 − x2 − 2x3 11 − 1 − 4 = = 3. 2 2 That is. Solution. 0 0 3 We have A = LU = (LD)(D−1 U ). Finally. 1 . Given  1 0 A = LU =  2 1 1 3  0 2 6 00 4 1 0 0  4 8. Write D for the diagonal matrix whose diagonal elements are the same as those of U .  0 0. Note that the inverse of a diagonal matrix is a diagonal matrix formed by the reciprocal of the elements in the diagonal: D−1  1/2 = 0 0 0 1/4 0  0 0 . we have x2 = 5 − 2x3 = 5 − 4 = 1. 1. Briefly explain how iterative improvement works.   2 0 0 D = 0 4 0. 3  2 2. substituting these into the second equation. Substituting this into the second equation. 1/3 Now. That is. we have x3 = 2. From the third equation. 1 LD =  2 1  Thus.148 Introduction to Numerical Analysis This can easily be solved by what is called back substitution. 3 write A as L′ U ′ such that L′ is a lower triangular matrix and U ′ is an upper triangular matrix such that the elements in the main diagonal of U ′ are all 1’s. 1/2 D−1 U =  0 0  2 0 0 1 00 0 3 1   2 0 0 0 4 0 = 4 4 2 12 0 3 0 1/4 0  2 0 0 0 0 1/3  0 4 12 2 A = 4 2   1 3 6 4 4 8 = 0 1 0 0 0 3  0 1 3 00 1 3 0 0 6.

. .33.e.) One determines the elements of the matrices U and L row by row. The Doolittle algorithm is a slight modification of Gaussian elimination. In Gaussian elimination. and x = (x1 . one takes the calculated solution x(0) of the equation Ax = b. and so on. where A = (aij ) is an n × n matrix. by calculating δb(2) = b − Ax(1) . In iterative improvement. . column of L and the second row of U . one can use double in C if one uses float for the rest of the calculation. One then solves the equation Aδx(1) = δb(1) . uis = ais − i−1 X and For i > 1 one calculates (1) lik uks k=1 for s ≥ i. . That is. A(2) .. and then solving the equation Aδx(2) = δb(2) . (Similarly.g. one starts with the system of equations Ax = b. Equilibration 149 Solution. an upper triangular matrix with all 1s in the diagonal will be called an upper triangular unit matrix. a lower triangular matrix with all 1s in the diagonal. by using the formulas derived above. i. This process can be repeated. or one can use long double if one uses double for the rest of the calculation). . where U is an upper triangular matrix and L is a lower triangular unit matrix. and then one calculates the difference δb(1) = b − Ax(0) . at the ith step for making (i+1) akj = 0 for k > i of the elements of the matrix A(i+1) (see equation (3) of the section on Gaussian elimination). EQUILIBRATION The Doolittle algorithm. . one can directly calculate the elements of the matrix L = A(new) by doing the calculations in different order. .. . one takes u1s = a1s for s ≥ 1. . and (2) lsi = asi − Pi−1 k=1 lsk uki uii for s > i. and so on. Then one determines a factoring LU of the matrix A. In particular. x2 . . Namely. first one calculates the first column of L and the first row U . The Doolittle and Crout algorithms. . THE DOOLITTLE AND CROUT ALGORITHMS. xn )T and b = (b1 . A(n) . only one scalar variable needs to be declared for this (e. . bn )T are column vectors. b2 . One can change the order of calculations by doing the calculations column by column. and takes x(1) = x(0) + δx(1) as an improved approximation of the solution of the equation. ls1 = as1 u11 for s > 1. . The inner products used in the calculation of the components of this vector should be accumulated in a location with double or extended precision. instead of calculating the elements of the matrices A(1) . 33.

The downside of this is that uss needs to be calculated for each of the tentative interchanges. The equations can be obtained by combining the equations of Gaussian elimination with the method used in Problem 5 just mentioned. Given the factorization A = P −1 LU .” that is. As the first step. if the coefficients in some of the equations have sizes very different from the coefficients in some other equations.150 Introduction to Numerical Analysis If one uses the last two formulas for i = 2. since it involves n2 divisions. however. and keep the interchange that gives the largest absolute value for uss . and U ′ is an upper triangular unit matrix. as opposed to the factorization A=P −1 LU obtained in Gaussian elimination. in turn. We will describe the Crout algorithm with partial pivoting and implicit equilibration. To avoid this. Partial pivoting does not work well if the equations are “differently scaled. . . one factors the matrix A as P −1 L′ U ′ . . the elements on the right-hand side have already been calculated by the time one needs them. . We will describe how to simulate equilibration in the context of the Crout algorithm. . before doing any (variant of) Gaussian elimination. Instead of doing equilibration. This waste is avoided in the Crout algorithm. 3. . in the same way as this was done in the function backsubst in the implementation of iterative improvement in the file lusolve. one can obtain the factorization A = P −1 L′ U ′ as was done in Problem 5 of the section on Gaussian elimination. The way to get around this is to tentatively interchange row s with each of the rows s + 1. we put ′ ls1 = as1 for s ≥ 1. where L is an lower triangular unit matrix. one can simulate its effects by doing what is called implicit equilibration by first calculating (3) di = max{|aij | : 1 ≤ j ≤ n} for i with 1 ≤ i ≤ n. one may want to rescale the equations by dividing each equation by its coefficient of maximum absolute value. and pick the pivot row as the row s for which . s + 2. L′ is a lower triangular matrix. is very costly. Doing this is called equilibration.c of the section of Gaussian elimination. This. where P is a permutation matrix. and later the results of all these calculations will be discarded except the one involving the actual interchange made. . . one may have to interchange the equations first. since when calculating uss .118 The point of doing the calculations in this order is that one can use a location with extended precision to calculate the partial sums of the right-hand sides of the formulas for uis an lis . and U is an upper triangular matrix. but the point here is to obtain the latter factorization directly. to see which row will give the largest absolute value of uss . One gets into trouble with this calculation when one wants to implement partial pivoting. In the Crout algorithm.

′ .

.

ls1 .

.

.

.

ds .

are again denoted by a1s ). perhaps stating them makes the explanation clearer. Then change the notation appropriately to reflect this row interchange (so that elements the new first row. write (4) ′ lsi = asi − i−1 X k=1 ′ lsk u′ki for s ≥ i. u′1s = ′ l11 For i > 1. and write a1s for s > 1. for s ≥ 1 is largest. and then interchange row 1 and s. . when these equations contain an empty sum. So there was no need to state the starting equations. 118 The starting equations above are just equations (1) and (2) with i = 1. for example.

34. Determinants Then select the row s such that .

′ .

.

lsi .

.

.

.

ds .

ik } such that σ(i) = i for any element of M that is not in this set.119 The main point of the Crout algorithm is that finding the pivot row does not involve tentative calculations. A cyclic permutation. the pivot element is chosen from among the elements of column i of the matrix A(i) on or below the diagonal. . . all of these being 0. At the ith step. When doing Gaussian elimination. . Interchange row s and row i. then instead of (3) one takes di = 1 for all i. So. but the reason to use extended precision was to control the accumulation of roundoff errors in the calculation of the matrix element in question). The other elements of the ith column at the ith step need to be calculated tentatively to find the pivot element. . Equations (4) and (5) allow first to store the partial products in a location with extended precision. One often considers this σ as being a permutation of some set M that includes the set {i1 . as above in case of the Doolittle algorithm. ik ). 151 for s ≥ i is the largest as the pivot row. The two-cycle (ii) is the identity mapping. σ(ik ) = i1 . 34. and the diagonal elements of U ′ . . A standard notation for this permutation is σ = (i1 i2 . or a cycle. The elements whose calculation is not indicated here and not stored (these are the above diagonal elements of the matrix L′ and the below diagonal elements of U ′ . one needs to take another look at Problem 5 in the section on Gaussian elimination. In the Doolittle algorithm. If one does not want to do implicit equilibration. and then these calculations will be discarded after the pivot element is found. these diagonal elements are the elements that turned out to be pivot elements. σ(i2 ) = i3 . these elements are calculated at the ith step of the Crout algorithm. . and then to transfer the value to the appropriate location in the coefficient matrix (at this point. and only the pivot element among these occurs in the matrix U obtained at the end of Gaussian elimination. what is the real reason that partial pivoting works smoothly in the Crout algorithm. or a k-cycle. On the other hand. . the starting equations are just equations (4) and (5) with . a proper two-cycle being a two-cycle (ij) for i and j distinct. the other elements were changed in the continued course of the algorithms. all of these being 1). ik . 119 Similarly i = 1. i2 . and change the notation to reflect the row interchange. . some precision is lost. It is seen in this problem that when one converts an LU factorization LU with L being an lower diagonal unit matrix into and a factorization L′ U ′ with U ′ an upper diagonal unit matrix that the diagonal elements of L′ are the same as those of U . Such a two-cycle is not a proper two-cycle. . where k ≥ 2 is an integer. σ(ik−1 ) = ik . . is a permutation σ where for some elements i1 . . . since only the final elements of the matrix are calculated naturally. Then calculate (5) u′is = ais − Pi−1 ′ ′ k=1 lik uks ′ lii for s > i. DETERMINANTS Cyclic permutations. but requires extra calculations in the Doolittle algorithm? To understand this. so only the element actually chosen as pivot element is calculated naturally. these elements are the same as the elements of the ith column of the matrix L′ on or below the diagonal. i2 . . A proper two-cycle is also called a transposition. A permutation is a one-to-one mapping of a set onto itself. and are available when one needs to determine the pivot row. we have σ(i1 ) = i2 . these elements are never available. as it would in the Doolittle algorithm. .

Finally. . one needs to note that. n are distinct. .  One can use the idea of the proof of the above Lemma. it can be written as a product τ1 . The identity mapping cannot be written as a product of an odd number of transpositions. and we will use induction on n. or at most one transposition containing n will remain all the way to the left. in which case every permutation of Mn (the identity mapping being the only permutation) can be written as a product of zero number of transpositions (the empty product of mapping will be considered to be the identity mapping). j. there are no transpositions on M1 . . def Proof. It is enough to consider permutations of the set Mn = {1. For n = 1 the statement is certainly true. so the identity mapping can be represented only as a product of a zero number of transpositions. According to the last displayed equation. and k are distinct elements then (ijk) = (ik)(ij). in fact. So the permutation ρ = (in)σ restricted to the set Mn−1 is a permutation of this latter set. and we also have (nij) = (ijn) = (in)(ij) = (ni)(ij). τm of transpositions. We will use induction on n. We start with the following Lemma. Further. τm . Even and odd permutations. the inverse (in)−1 of (in) is (in) itself. where ι (the Greek letter iota) stands for the identity mapping (often represented by the empty product. all transpositions containing n in the product can be moved all the way to the left in such a way that in the end either no transposition will contain n. if i. Comparing this with one of the expressions equal to (nij) above we obtain (ij)(nj) = (ni)(ij). The Lemma is certainly true for n = 1. . Let σ be a permutation of the set Mn . and assume σ(n) = i. and n are distinct elements of Mn . . . and so (ik)(ij)(j) = k. By induction. namely. . Proof. (nij) = (jni) = (ji)(jn) = (ij)(nj). Moreover. we have (nij) = (nj)(ni). if i. we clearly have (ni)(ni) = ι. and so (nj)(ni) = (ni)(ij). For example. Every permutation of a finite set can be written as a product of transpositions. Then (in)σ(n) = n (since i = σ(n) and (in)(i) = n). k. 2. clearly. (ij)(j) = i and (ik)(i) = j. Using these four displayed equation. where for the second equality. which agrees with the equation (ijk)(j) = k. the first . We will assume that the underlying set of the permutations is Mn . in a similar way.152 Introduction to Numerical Analysis Lemma. n}. or else one can use direct calculation. j. and so σ = (in)−1 ρ = (in)ρ = (in)τ1 . j. to show that if i. then we have (ij)(nk) = (nk)(ij). . Each application of these identities changes the number of transpositions in the product by an even number.

that are not one-to-one) by putting sgn(σ) = 0 if σ is not one-to-one. . 2. A permutation cannot be written as a product both of an even number of transpositions and of an odd number of transpositions. . σ4 (1) = 2.g. . i=1 where σ runs over all mappings (not necessarily one-to-one) of the set Mn = {1. . . one may say instead of that σ runs through all permutations of the set of the set Mn = {1. Since the right-hand side contains an odd number of transpositions. . and the transpositions τ1 . . are transpositions. . . The above formula is called the Leibniz formula for determinants. The right-hand side multiplies out this product by taking one term out of each of these sums. Assume that for a permutation σ we have σ = τ1 . this is impossible according to the last Lemma. . and τ1 . n}. or Latin signum) on permutations of a finite set is defined as sgn(σ) = 1 if σ is an even permutation. n} into itself. . and ρ1 . for n = 2. . . . . the identity mapping on Mn−1 can only be represented as a product of an even number of transpositions. The determinant det A of an n × n matrix A = (aij ) is defined as det A = X σ sgn(σ) n Y aiσ(i) . However. and adding up all the products that can be so formed. The left-hand side represent a product of sums. l is odd. Then n n X Y aij = n XY aiσ(i) . σ2 (1) = 1.  A permutation that can be written as a product of an even number of transpositions is called an even permutation. and a permutation that can be written as a product of an odd number of transpositions is called an odd permutation. σ4 (2) = 2. . Determinants. and sgn(σ) = −1 if σ is an odd permutation. . it is expedient to extend the signum function to mappings of a finite set into itself that are not permutations (i. Let (aij ) be an n × n matrix. . Sometimes one writes det(A) instead of det A. σ3 (1) = 2. . n} into itself. rather than just permutations. .e.34. Since sgn(σ) = 0 unless σ is a permutation. The the identity mapping can be written as a product ι = σσ −1 = τ1 . τk ρl . Proof. . The distributive rule. . ρl )−1 = τ1 . σ3 (2) = 1. . . The function sgn (sign. and the last identity decreases the number of transpositions by two. . τk (ρl )−1 . where σ1 (1) = 1. . . 2. . . . the above equation says that (a11 + a12 )(a21 + a22 ) = a1σ1 (1) a2σ1 (2) + a1σ2 (1) a2σ2 (2) + a1σ3 (1) a2σ3 (2) + a1σ4 (1) a2σ4 (2) . σ i=1 i=1 j=1 where σ runs over all mappings (not necessarily one-to-one) of the set Mn = {1.. 2. τm cannot be the identity mapping (since σ(n) = i). for some considerations (e. (ρ1 )−1 = τ1 . . Often. . . τk . σ2 (2) = 2. ρl . . . A product σ = (ni)τ1 . ρl where k is even. . τm . ρ1 . The equality of these two sides is obtained by the distributivity of multiplication over addition. . Then we can remove n from the underlying set. so the only possibility that remains is the we end up with a product of transpositions representing the identity mapping where none of the transpositions contain n. τk and σ = ρ1 . Determinants 153 to identities do not change the number of transpositions. . For example. σ1 (2) = 1. τk (ρ1 . since by the induction hypothesis. . for the application of the distributive rule above) it may be expedient to say that σ runs through all mappings. Corollary. . where i is distinct from n. we must have started with an even number of transpositions to begin with. . .

ρ Y Y  n n sgn(σ) sgn(ρ) aiσ(i) biρ(i) i=1 i=1 Y Y  X n n n Y sgn(ρ) sgn(σ) aiσ(i) bσ(i)ρσ(i) = aiσ(i) bσ(i)ρσ(i) sgn(ρσ) i=1 σ. Then. i=1 n Y aiσ(i) bσ(i)π(i) = 0. while we still restrict π to permutations.ρ = X σ. Writing ρσ = π. we will show that the inner sum is zero whenever σ is not a permutation. The third equality takes into account that sgn(ρ) sgn(σ) = sgn(ρσ) (since. i=1 120 One might ask why did we not extend the range of σ to all mappings earlier. sgn(π) sgn(π) σ. and the third equality expresses the fact that σ(k) = σ(l). the product i=1 bσ(i)ρσ(i) is just a rearrangement Qn of the product i=1 biρ(i) . we have det(A) det(B) = X σ. that is: Lemma. Let A = (aij ) and B = (bij ) be two n × n matrices. i=1 The second equality expresses the fact that sgn(π(kl)) = − sgn(π).π σ i=1 π i=1 Here σ and π run over all permutations of Mn . which is zero if σ is not one-to-one.154 Introduction to Numerical Analysis Multiplications of determinants. and the equation just reflects the interchange of the factors corresponding to i = k and i = l in the product. it follows that 2 X sgn(π) π and so X π sgn(π) n Y aiσ(i) bσ(i)π(i) = 0. the first equality next is obvious: X sgn(π) π n Y aiσ(i) bσ(i)π(i) = X π sgn(π(kl)) π i=1 =− X sgn(π)) n Y i=1 n Y aiσ(i) bσ(i)π(kl)(i) i=1 aiσ(i) bσ(i)π(kl)(i) = − X sgn(π)) π n Y aiσ(i) bσ(i)π(i) . Rearranging this equation. ρσ can be written as the product of k + l transpositions). the permutation π(kl) will run over all permutations of Mn as π runs over all permutations of Mn . if ρ can be written as the product of k transpositions and σ as the product of l transpositions. so the right-hand side above can be written as n n Y Y X XX aiσ(i) bσ(i)π(i) = aiσ(i) bσ(i)π(i) . denoting by (kl) the transposition of k and l. The answer is that we wanted to make sure that π = ρσ is a permutation. The product of two determinants of the same size is the determinant of the product matrix. Then det(AB) = det(A) det(B).ρ i=1 i=1 Qn On the right-hand side of the second equality.120 To show that this is possible. Hence. the permutations π and σ uniquely determine ρ. Now we want to change our point of view in that we want to allow σ to run over all mappings of Mn into itself on the right-hand side. either. since σ is a one-to-one mapping. when this would have been easy since we had sgn(σ) in the expression. . In fact. assume that for some distinct k and l in Mn we have σ(k) = σ(l). and if σ is not one-to-one then π = ρσ is not one-to-one. With σ and ρ running over all permutations of Mn . Proof.

and the third equality is obtained by replacing σ −1 with σ.j . Formally.j = X sgn(σ) σ =− X σ sgn(σ) n Y i=1 n Y i=1 ai σ(kl)(i) = − X σ sgn(σ(kl)) n Y ai σ(kl)(i) i=1 aiσ(i) = − det A.j the third equality follows by the distributive rule mentioned above.122 If we multiply a row of a determinant by a number c. one needs to split up the sum representing the determinant into sums containing even and odd permutations. a similar statement can be made when one interchanges rows. l ∈ Mn are distinct. however. Determinants 155 as we wanted to show. Formally: if A = (aij ) B = (bij ). (kj) denotes a transposition. since σ −1 runs over all permutations of Mn . with σ running over all permutations of Mn . respectively. Indeed. In order to establish the result for this case as well. if A = (aij ) then. If two columns of A are identical. we have bij = aij if 121 The theory of determinants can be developed for arbitrary rings. i.34. in the last sum expressing det(A) det(B) one can allow σ to run over all mappings of Mn into itself. just as σ does. a + a = 0 holds for every a). where. with σ running over all permutations of Mn . To this end.121 Therefore. the third equality is based on the equality sgn(σ) = sgn(σ −1 ). and the last equality holds in view of the definition of the product matrix AB. . Indeed. and for some k ∈ Mn and some number c.  Simple properties of determinants. since σ(kl) runs over all permutations of Mn . one can conclude that det A = − det A. we have det AT = X sgn(σ) σ = X n Y aσ(i)i = σ i=1 sgn(σ) σ n Y X sgn(σ) n Y aiσ−1 (i) = X sgn(σ −1 ) σ i=1 n Y aiσ−1 (i) i=1 aiσ(i) = det A. as pointed out in the previous footnote. and then one needs to show that these two parts of the sum cancel each other. For rings of characteristic 2 (in which one can have a + a = 0 while a 6= 0 – in fact. one needs to split up the summation for π into two parts such that in the first part π runs over all even permutations. 122 This argument does not work for rings of characteristic 2.j = det(ai (kl)(j) )i. det A = det AT . Of course. by interchanging the identical columns. Indeed. where AT denotes the transpose of A. since det AT = det A. the second equality holds. For a matrix A. as usual. easy to change the argument in a way that it will also work for rings of characteristic 2. then det(aij )i. and the third equality is obtained by replacing σ(kl) by σ. then det A = 0. then the determinant gets multiplied by c. the last step in the argument is not correct. We have det(A) det(B) = = X XX sgn(π) π sgn(π) σ π n n X Y n Y aiσ(i) bσ(i)ρπ(i) = sgn(π) π i=1 aik bkπ(i) = det i=1 k=1 X X n aik bkj k=1  n XY aiσ(i) bσ(i)π(i) σ i=1 = det(AB). just as σ does. If one interchanges two columns of a determinant. if k. and not just over permutations (while π will still run over all permutations). since sgn(σ(kl)) = − sgn(σ). we have det(ai (kl)(j) )i. i=1 here the second equality represents a change in the order in of the factors in the product. then π(kl) will run over all odd permutations. the value determinant gets multiplied by −1. Is is.

then det A = (−1)k+l akl det A(k. Formally. With σ running over all permutations of Mn and ρ running over all permutations of {2. then rows k − 2 and k − 3. After bringing the element akl into the first row. while keeping the rest of the elements unchanged. 1). 1). into position (1. . However. one makes altogether k − 1 + l − 1 row and column interchanges. . The result can be obtained from the last Lemma by moving the element akl into the top left corner (i. then det C = det A + det B. hence the factor (−1)k+l = (−1)(k−1)+(l−1) in the formula to be proved. 1)) of the matrix A. note that the product on the left-hand side of the second equality is zero unless σ(1) = 1. Corollary. While doing so. . l). the two determinants get added. j) the (n − 1) × (n − 1) matrix obtained by deleting the ith row and jth column of the matrix. The following theorem describes the expansion of a determinant by a row. . since this will change the order of rows in the submatrix corresponding to the element.e. This is easy to verify by factoring out c from each of the products in the defining equation of the determinant det B.and C = (cij ). akl is the only (possibly) nonzero element in the kth row of A. Thus. Expansion of a determinant by a row. nothing will change. it will not work to interchange the kth row of the matrix A with the first row. i=2 for the second equality.and and ckj = akj + bkj . It is due to PierreSimon Laplace (1749–1827). and for some k ∈ Mn we have bij = aij if i 6= k. l ∈ Mn . . This is again easy to verify from the defining equation of determinants. n} of a permutation σ of Mn with σ(1) = 1. then rows k − 1 and k − 2. . . For any integer k ∈ Mn we have det A = n X j=1 (−1)i+j aij det A(i. n}. when doing this. By “(possibly) nonzero” we mean that the Lemma of course applies also in the case when we even have a11 = 0. If two determinants are identical except for one row. Proof. Theorem. one can make similar column interchanges. If a11 is the only (possibly) nonzero element of the first row of A. . l). denote by A(i. where we have cij = aij if i 6= k. then det A = a11 det A(1. then sgn(ρ) = sgn(σ). Lemma. note that if ρ is the restriction to the set {2. one can move the kth row into the position of the first row by first interchanging rows k and k − 1. we have det A = X sgn(σ) σ = X σ:σ(1)=1 n Y aiσ(i) = i=1 a11 sgn(σ) X sgn(σ) aiσ(i) = a11 X ρ i=2 aiσ(i) i=1 σ:σ(1)=1 n Y n Y sgn(ρ) n Y aiρ(i) = a11 det A(1. one always needs to interchange adjacent rows. then the determinant formed by adding the elements in the different rows. . If for some k. then the value of the determinant does not change. since this second determinant is zero. If one adds the multiple of a row to another row in a determinant. etc.. In order not to disturb the order of rows in the submatrix A(k. j). Given an n × n matrix A = (aij ). if A = (aij ) B = (bij ). note that adding c times row l to row k to a determinant amounts to adding to the determinant c times a second determinant in which rows k and l are identical. For the fourth equality. To see this. Proof. .156 Introduction to Numerical Analysis i 6= k and we have bkj = cakj then det B = c det A.

i=1 The expansion n X aik Aij . j) is often called the cofactor of the entry aij of the matrix A. This equation can also be written as n X Aij aik = δjk . one may assume that the letter that comes first in the alphabet refers to the rows.i  Aji det A  . where the subscripts outside indicate that the entry listed between the parenthesis is in the i row and the jth column.j To explain the notation here. . det A i=1 This equation can also be expressed as a matrix product. For example. If j = k then this is in fact the determinant of the matrix A. if j 6= k then this represent a determinant whose jth and kth columns are identical. i on the outside in the middle member indicates that the entry listed between the parenthesis is in the j row and the ith column. In fact. In view of the last Corollary. one can obtain a similar expansion of a determinant by a column: n X det A = aij Aij . defined as δjk = 1 if j = k and δjk = 0 if j 6= k. we have −1 A =  Aij det A T =  Aij det A  = j. with the matrix B = (bij ) with bij = Aji / det A. j=1 this equation can be established by the (repeated) use of the addition rule of determinants. where I is the n × n identity matrix. i. j) is akj .34. the matrix B is the inverse of A. Determinants 157 Proof. some agreed upon unspoken assumption is made. and in row k all elements are zero except that the element in position (k. Let the matrix Bj be the matrix that agrees with A except for row k. then the above equation can be written as n X det A = aij Aij . this equation can be written simply as AB = I. as usual. In other words. the subscripts j. Therefore n X aik Aij = δjk det A. That is. The matrix in the middle is obviously the same as the one on the right.123 123 If. j=1 Since det A = det AT . the subscripts on the outside are omitted. def The number Aij = (−1)i+j det A(i. k ≤ n represent the expansion of a determinant by the jth column that has the elements aik in this column instead of the elements aij . i=1 for some j and k with 1 ≤ j. i=1 where δjk is Kronecker’s delta. and one of the simple properties of determinants says that such a determinant is 0. the equation to be proved can be written as n X det A = det Bj .

POSITIVE DEFINITE MATRICES An n × n matrix A with real entries is called positive definite if xT Ax > 0 for any n-dimensional column vector x. Cramer’s rule is of theoretical interest. xn )T . this principal minor is called the kth principal minor. the determinant det A is called the determinant of the system. i=1 Assuming that det A 6= 0. In fact. det A where Bk is the matrix where the kth column of A has been replaced by the right-hand side b. note that if A is positive definite then the matrix obtained by deleting the first row and the first column of the matrix is also positive definite. If A = (aij ) and x = (x1 . . but it is not a practical method for the numerical solution of a system of linear equations. . The above equation expressing xk is called Cramer’s rule. xT Ax = i. or it has infinitely many solutions. either it has no solution..e. 35. all matrices will be assumed to have real entries.158 Introduction to Numerical Analysis Cramer’s rule. the second equation holds because the numerator in the middle member represents the expansion of det Bk . This determinant is called the determinant of the unknown xk .124 we have xk = Pn i=1 bi Aik det A = det Bk . since the calculations of the determinants in it are time consuming. then it is easy to show that the system of equations Ax = b does not have a unique solution – i. in this case the rank of the matrix A is less than n. multiplying n the ith equation j=1 aij xj = bi by the cofactor Aik . In what follows. . The following is an important characterization of positive definite matrices: 124 If det A = 0. . for some integer k. Further. the practical method for the solution of a system of linear equations is Gaussian elimination.j=1 Hence. this can be seen by choosing x1 = 0 in the equation above. For a fixed k with 1 ≤ k ≤ n. where A = (aij ) is an n × n matrix. and adding these equations for i with 1 ≤ i ≤ n we obtain for the left-hand side n X i=1 Aik n X j=1 aij xj = n X xj j=1 n X aij Aik = n X xj δjk det A = xk det A j=1 i=1 So we obtain the equation xk det A = n X bi Aik . A matrix A = (aij ) is symmetric is aij = aji . and x = P (xi )T and b = (bi )T are column vectors. A principal minor is the determinant obtained by taking the intersection of the first k rows and first k columns of the matrix. if A is a positive definite matrix then akk > 0 for all k with 1 ≤ k ≤ n. A minor of a matrix is the determinant of a submatrix obtained by taking the intersection of a number of rows and the same number of columns in the matrix. then n X xi aij xj . Consider the system Ax = b of linear equations. This follows by choosing xi = δki . .

we will only assume that A has one of the properties in the theorem. Assume n > 1. if x is an n-dimensional column vector and y = S T x then x is a nonzero vector if and only if y is so. assume that all principal minors of A are positive. bij = −mi1 aj1 + n X k=2 δik akj = aij − mi1 aj1 = 0 Next. Further. C = SAS T .e. and noticing that all but the first element of this column is zero. a11 is just the first principal minor of A. here the second equality can be seen by expanding the determinant C by its first column. Then. . . note that the equation C = SAS T implies that Cm = Sm Am (Sm )T . since. As we just saw. −m21 . Finally. Writing mi1 = ai1 /a11 . Further. c11 > 0. where S = (sij ) is the matrix whose first column is (1. . . writing Am . all diagonal elements of a positive definite matrix are positive. Hence det(C) = det(SAS T ) = det(S) det(A) det(S T ) = det(A). Thus. . At this point. Note that all elements of the elements of the first column of C = (cij ) are the same as those of B. Then the matrix C ′ obtained by deleting the first row and first column of C is also positive definite. so det(S) = det(S T ) = 1. −m31 . 0. We will use induction on n. Next assume that A is positive definite. S. c11 = a11 . and assume the result is true for every k × k matrix with 1 ≤ k < n. Our first claim is that the element a11 is then nonzero. C formed by the first m rows and m columns. We will then probe that it is also true for n × n matrices. −mn1 )T . and C satisfying the first equation. C is also positive definite. . All principal minors of a positive definite matrix are positive. In fact. Proof.Sm . Hence det(A) = det(C) = c11 det(C ′ ) > 0. by the induction hypothesis the determinant of C ′ is positive. i. 0. so assuming the left-hand side is positive for all nonzero y is the same as assuming that the right-hand side is positive for all x. Cm . Next. this is because the elements of the first column of C are obtained by multiplying the rows of B by the first column of S T . yT Ay = xT Cx. The result is obviously true in case n = 1. establishing our claim. for i > 1. the size of the matrix. Form the matrix B by subtracting appropriate multiples of the first row of A from the other rows in such a way that all but the first element of the first column will be zero (exactly as is done in Gaussian elimination). this second equation would not be true for arbitrary matrices A. the element bij is obtained as the product of the ith row of S and the jth column of A. For an arbitrary positive integer m ≤ n. Then all its principal minors other than det(A) are positive by the induction hypothesis. Indeed. we claim that C is positive definite if and only if A is positive definite. . since the first column of S T is (1. for the submatrices of A.. in each case the result of this multiplication is the first element of the row of B in question. form the matrix C = BS T . Let A = (aij ) be an n × n matrix. this is because S is nonsingular (as det(S) = 1). and so x = (S T )−1 y. this amounts to forming the matrix B = (bij ) = SA. and whose other elements are sij = δij . S. that it is either positive definite or that all its principal minors are positive. 0)T . We know that this is true is A is positive definite. then we will have to show that it has the other property. i. we need to prove that det(A) is also positive.35. a symmetric matrix is positive definite if all its principal minors are positive. S is a lower triangular matrix with all its diagonal elements equal to 1. Conversely.e. as we remarked above. On the other hand.. further. . Positive definite matrices 159 Theorem. and ci1 = 0 for i > 1. it is true at .

as the example of the matrix   1 −3 A= 0 1 with positive principal minors shows. . ci1 = c1i = 0. xn )T . The proof of the theorem is complete. since for the vector x = (11 . 1)T we have xT Ax = x1 − 3x1 x2 + x22 = −1. . i. x2 )T = (1. Therefore. It then follows that C is also symmetric. since C T = (SAS T )T = (S T )T AT S T = SAT S T = SAS T . Cm−1 consists of the first m − 1 rows and m − 1 columns of C ′ . it then follows that all but the first element in the first row of C is also zero. the inequality det(Cm−1 )′ > 0 follows.. .125 Now. det(Sm ) = det((Sm )T ) = 1. . of the rows and columns of index 2 through m of C). the first equation holds by expanding Cm by its first column. this implies that A is positive definite.j=2 i.. « . it follows that C ′ is positive definite. that is.e. . The Theorem has an important consequence concerning Gaussian elimination. it also follows that C is positive definite. So. Now. ′ As for the principal minor associated with the submatrix Cm−1 of the matrix C ′ obtained by deleting ′ the first row and first column of C (that is. that AT = A. so the elements in this block do not contribute to the product on the right-hand side of the second equation. for an arbitrary nonzero vector x = (x1 . Since det(C1 ) > 0. . So all principal minors of C ′ are positive. As we remarked above. . assume also that A is a symmetric matrix. we have ′ ′ 0 < det(Cm ) = c11 det(Cm−1 ) = det(C1 ) det(Cm−1 ).160 Introduction to Numerical Analysis present because the block formed by the first m rows and the last n − m columns of S contain all zeros. H of the appropriate size (so that they can be multiplied). As we know that all but the first element in the first column of C is zero. for i > 1. This matrix is not positive definite.j=1 the inequality holds since c11 = det(C1 ) > 0 and the matrix C ′ is positive definite.e. 125 It is easy to verify that matrices can be multiplying in bocks. . and the second equation holds because the only entry of the matrix C1 is c11 . i. Hence. det(Cm ) = det(Sm ) det(Am ) det((Sm )T ) = det(Am ) > 0. since Sm is a lower triangular matrix with ones in the diagonal. that is „ A C B D «„ E G F H « = „ AE + BG CE + DG AF + BH CF + DH for matrices A. By the induction hypothesis. i.  The second conclusion of the Theorem is not valid without assuming that A is symmetric. we have n X xi cij xj = c11 x21 + n X xi cij xj > 0. Cholesky factorization.

so its determinant is the product of its diagonal elements. this decomposition is identical to LU . Further. according to the Corollary above. This matrix is. then this decomposition is unique. and LD is a lower diagonal matrix. Then all diagonal entries of D are positive. 129 In fact. we have A = L′′ (L′′ )T . then A = AT = (D−1 U )T (LD)T = (U T (D−1 )T )(DT LT ) = (U T D−1 )(DLT ). 127 Another way of putting this is that the equation A = LU and the facts that L is a lower diagonal matrix. note that U = A(n) . without row interchanges). Then D−1 U is an upper diagonal matrix whose diagonal elements are ones. 128 A matrix all whose elements outside the diagonal are zero. so we have det(Ul ) > 0 for the kth principal minor of U . as A is positive definite. In particular. Then writing L′′ = LE.e. since D and D−1 are diagonal matrices. we will show that we can continue the procedure in that in (k) (k) the matrix A(k) = aij obtained at this point we have akk 6= 0. a 0 × 0 matrix) and we 0 stipulate that det(U0 ) = 1. or both the subtracted row and the row from which we subtract are outside this minor. This is an LU decomposition of the matrix A with the diagonal elements of U T D−1 being ones. and we have A = (LD)(D−1 U ). the diagonal elements of U are all positive. 126 This argument works even in case l = 1 if we take U to be the empty matrix (i. . Positive definite matrices 161 Corollary. Of course. That is. A = LDLT . All principal minors of A being positive according to the Theorem above. Hence Gaussian elimination can be continued as claimed. by what we said about the principal minors of A(k) above.. so they are their own transposes. The operations in Gaussian elimination (subtracting a multiple of a row from another row) do not change the values of the principal minor (since either the subtracted row is in the principal minor in question.. Assume A is a positive definite matrix. upper triangular. we have det(Ul ) = ull det(Ul−1 ). where L is a lower triangular matrix with all 1s in the diagonal. as this equation holds without a permutation matrix only if no pivoting was done).126  Observe that if it is possible to obtain an LU decomposition of a matrix A = LU (without pivoting.127 Form the diagonal matrix128 D by taking the diagonal elements of U .e. Then Gaussian elimination on A can be performed without pivoting (i. (k) since a11 = a11 > 0. because the steps in the Gaussian elimination are uniquely determined. in the LU factorization A = LU . and U is an upper diagonal matrix. the same holds for the (k) (k) principal minors of A(k) . so akk 6= 0. the inequality ull > 0 follows. Hence. note that this is true in case k = 1. Let E be the diagonal matrix formed by taking the square roots of the diagonal elements of D. Then D = E 2 . Now assume that A is positive definite (in addition to being symmetric). hence U = DLT .35. A being an n × n matrix). We start out with Gaussian elimination without pivoting. U is an upper diagonal matrix. we also know that u11 = a11 > 0. so this argument is not really needed in case l = 1. Proof. Since U is a triangular matrix. and the diagonal elements of L are ones can be used to derive the equations used in the Doolittle algorithm (without pivoting) to determine the entries of L and U . where Ak is the matrix found in the intersection of the first k rows and the first k columns of A(k) . the third equation here holds. as det(Ul ) > 0 and det(Ul−1 ) > 0. we have det(Ak ) > 0. To see that ull > 0 for each k with 1 ≤ l ≤ n in the LU factorization LU obtained. however. this is the decomposition obtained by the Crout algorithm. and in this latter case the principal minor is not changed).129 Assume now that A is a symmetric matrix. and at the kth step (with 1 ≤ k ≤ n. none of these diagonal (k) elements can be zero.

130 Finally. u1s u11 for s > 1. For the former. ′′ For the Cholesky factorization with L′′ = lij we have and ′′ ls1 = ′′ l11 = √ as1 l11 for s > 1. The formulas used in the Doolittle algorithm can easily be modified to obtain the factorization A = LDLT or the Cholesky factorization A = L′′ (L′′ )T . a11 . . This representation of a positive definite symmetric matrix A is called its Cholesky factorization. so the two starting equations are not needed.131 This emphasizes the point that the upper half of A (i. the above equations can also be directly obtained from the matrix equation A = LU = LDLT . For the derivation of the formulas below. Here we wrote as1 and asi instead of a1s and ais . where dij = 0 unless i = j. and then. . only about n2 /2 matrix elements need to be calculated. as no pivoting is intended. . it may also be worth noting that we have L′′ = LE = U T (D−1 )E = U T E −1 . because the matrix A is symmetric. when these equations contain an empty sum.e. . these equations can be used in a different order. k=1 s≥i If one uses the last two formulas for i = 2. for s > 1. for s = 1. the latter two formulas can also be used for i = 1 as well. with A = aij . So there was no need to state the starting equations. then it is better to use the factorization A = LDLT mentioned above than the Cholesky factorization. perhaps stating them makes the explanation clearer. This order of calculations is possible because one does not intend to do any pivoting. . ais for s > i) never needs 130 The starting equations above are just equations (1) and (2) with i = 1. 131 Again. since for this factorization one does not need to calculate the square roots of the entries of the diagonal matrix D. and the elements of L can also be calculated row by row. one can calculate uis for i ≤ s and lis for i < s. Instead of looking at the equations of the Doolittle algorithm. and ls1 = For i > 1 one calculates uis = ais − and lsi = i−1 X lik uks for uis uii for s > i. as outlined in the preceding footnote. this is possible. in turn. one calculates u11 by the above formulas. the elements on the right-hand side have already been calculated by the time one needs them.. These equations can also be used in a different order. (lik lii = taii − k=1 and ′′ lsi = asi − Pi−1 ′′ ′′ k=1 lik lsk ′′ lii for s ≥ i Again. L = (lij ) and D = (dii ).162 Introduction to Numerical Analysis where L′′ is a lower triangular matrix. The advantage of Cholesky factorization over Gaussian elimination is that instead of n2 matrix elements. 3. we have u1s = a1s for s ≥ 1. dii = uii (so dij = uii δij ). That is. For i > 1 we put v u i−1 X u ′′ ′′ )2 . If one is only interested in solving the equation Ax = b.

just as in other variants of the Gaussian elimination. According to the assumption that A is row-diagonally dominant. both Jacobi and Gauss-Seidel iterations will converge with any starting vector x(0) . the kth approximation xj to the solution xj of this equation is calculated as Pn (k) bi − j=1. one makes use of xj (1 ≤ i ≤ n). and b is an n-dimensional column vector. Consider the equation Ax = b. so the whole square block normally reserved for the elements of a nonsymmetric matrix is needed. this is also true for the elements of L′′ in Cholesky factorization. we have 0 ≤ q < 1. Assume that for some k ≥ 1 and for some C we have (1) (k) |xi (k−1) − xi |≤C for each i with 1 ≤ i ≤ n. we have (k+1) xi − (k) xi = Pn j=1. first to store A. and then to store L′′ . in Gauss-Seidel iteration. Write q = max 1≤i≤n Pn j=1. aii and in Gauss-Seidel iteration one takes Pi−1 Pn (k+1) (k) bi − j=1 aij xj − j=i+1 aij xj (k+1) xi = aii (k+1) that is. In case of Jacobi iteration. JACOBI AND GAUSS-SEIDEL ITERATIONS In certain cases. Lemma. When solving this equation for the n-dimensional column vector x. This is also true for the LDLT factorization. Proof. for example. where only the lower triangular block representing the symmetric matrix A is needed. but there one needs also to store the matrix U . where A is an n × n row-diagonally dominant matrix. j6=i (k−1) aij xj aii (k)  − xj (1 ≤ i ≤ n). The elements of L and U can be stored in the locations originally occupied by the elements of A. One can use any (0) xi starting values with these iterations. Jacobi and Gauss-Seidel iterations 163 to be stored. Call an n × n matrix A = (aij ) row-diagonally dominant if |aii | > n X j=1 j6=i |aij | for each i with 1 ≤ i ≤ n. as soon as it is available. one can solve the system of equations n X aij xj = bi (1 ≤ i ≤ n) j=1 (k) by iteration. = 0 for each i is a reasonable choice. 36. In Jacobi iteration.36. j6=i aij xj (k+1) xi = (1 ≤ i ≤ n). . j6=i |aii | |aij | .

Jacobi and Gauss-Seidel iterations exhibit linear convergence. This proves the convergence of Jacobi iteration. this establishes (2) with i + 1 replacing i (in that it shows the inequality in (2) with i replacing j). and so (k+1) lim (xi k→∞ (k) − xi ) = 0 for each i with 1 ≤ i ≤ n. The system of equation is row-diagonally dominant. if |xi − (k) xi | = Pn j=1. we have (k+1) xi − (k) xi = Pi−1 j=1 (k) aij xj (k+1)  − xj − (k−1) Pn j=i+1 aij xj aii (k)  − xj (1 ≤ i ≤ n). Explain why the system of equations 6x + 2y − 3z = −8 3x − 7y + 2z = −2x − 4y + 9z = 9 5 can be solved by Gauss-Seidel iteration. assuming (1) only (i.e. As q < 1. Question 2. The proof of Gauss-Seidel iteration is slightly more subtle. (2) follows by inductions. Such a system of equations can always be solved by both Jacobi iteration and Gauss-Seidel iteration. For Gauss-Seidel iteration. and Aitken’s acceleration (see Section 16) can be adapted to matrices and used to accelerate the convergence of these methods. Therefore. Question 1. thus. Write the equations describing the Gauss-Seidel iteration to solve the above system of equations. we also have (k+1) (k) lim (xi − xi ) = 0 k→∞ for each i with 1 ≤ i ≤ n in the present case. Assume in addition to (1) that we have (k+1) (2) |xj (k) − xj | ≤ C for each j with 1 ≤ j < i. Problems 1. . Then (k+1) |xi − (k) xi | = Pi−1 j=1 |aij |C + Pn j=i+1 |aii | |aij |C ≤ qC (1 ≤ i ≤ n). then we have |xi |aij |C ≤ qC (1 ≤ i ≤ n). similarly as in the case of the Jacobi iteration. Question 1. (k) − xi | ≤ q k C0 . j6=i (0) |aii | (k+1) − xi | ≤ C0 . the absolute value of the each coefficient in the main diagonal is larger than the sum of the absolute values of all the other coefficients on the left-hand side of the same equation. That is. note that this is vacuous for i = 1.164 Introduction to Numerical Analysis and so (k+1) |xi (1) Thus. Consider a fixed i with 1 ≤ i ≤ n. Do not solve the equations. without assuming (2)).. Hence the latter inequality also follows. Solution. This proves the convergence of the Gauss-Seidel iteration.

In Gauss-Seidel iteration. one can start with x(0) = y (0) = z (0) = 0. The system of equation is row-diagonally dominant. and for each integer k ≥ 0 one can take −9 + 3y (k) + 2z (k) 8 6 − 2x(k) + 3z (k) = 9 (k) 3 + 3x − 2y (k) = . Solution. Write the equations describing the Gauss-Seidel iteration to solve the above system of equations. Jacobi and Gauss-Seidel iterations 165 Question 2. Explain why the system of equations 8x − 3y − 2z = −9 2x + 9y − 3z = 6 −3x + 2y + 7z = 3 can be solved by Gauss-Seidel iteration. similarly as in the preceding problem. In Jacobi iteration. one can start with x(0) = y (0) = z (0) = 0. Question 2. Explain how to solve the system of equations x − 2y − 7z = 7 8x − y + 3z = −2 2x + 6y + z = −4 . 7 x(k+1) = y (k+1) z (k+1) Question 3. Write the equations describing Jacobi iteration to solve the above system of equations. and for each integer k ≥ 0 one can take −9 + 3y (k) + 2z (k) 8 6 − 2x(k+1) + 3z (k) = 9 3 + 3x(k+1) − 2y (k+1) = . In Gauss-Seidel iteration.36. Question 2. 7 x(k+1) = y (k+1) z (k+1) 3. Question 3. Question 1. Question 1. one can start with x(0) = y (0) = z (0) = 0. and for each integer k ≥ 0 one can take −8 − 2y (k) + 3z (k) 6 (k+1) 9 − 3x − 2z (k) = −7 5 + 2x(k+1) + 4y (k+1) = 9 x(k+1) = y (k+1) z (k+1) 2.

. the absolute value of the each coefficient in the main diagonal is larger than the sum of the absolute values of all the other coefficients on the left-hand side in the same row. . . b] be an interval. we would like to determine cubic polynomials Si (x) = ai (x − xi )3 + bi (x − xi )2 + ci (x − xi ) + di (1) for i with 0 ≤ i < n satisfying the following conditions: (2) (3) (4) Si (xi ) = yi Sn−1 (xn ) = yn . yn .166 Introduction to Numerical Analysis by Gauss-Seidel iteration. . . By changing the order of equations. . x1 . Solution. < xn = b for some positive integer n. . and for each integer k ≥ 0 one can take −2 + y (k) − 3z (k) 8 −4 − 2x(k) − z (k) = 6 7 − x(k) + 2y (k) = −7 x(k+1) = y (k+1) z (k+1) 37. and for each integer k ≥ 0 one can take −2 + y (k) − 3z (k) 8 −4 − 2x(k+1) − z (k) = 6 7 − x(k+1) + 2y (k+1) = −7 x(k+1) = y (k+1) z (k+1) When solving them by Jacobi iteration. Such a system of equations can always be solved by both Jacobi iteration and Gauss-Seidel iteration. When solving these equations with Gauss-Seidel iteration. y1 . . That is. Si (xi+1 ) = Si+1 (xi+1 ) for i with 0 ≤ i ≤ n − 2 and 0 ≤ k ≤ 2. Moving the first equation to the last place (when the second equation will become the first one) and the third one will become the second one. we obtain the following system of equations: 8x − y + 3z = −2 2x + 6y + z = −4 x − 2y − 7z = 7 This system of equation is row-diagonally dominant. and assume we are given points x0 . . . . (k) (k) for i with 0 ≤ i ≤ n − 1. xn with a = x0 < x1 < . and corresponding values y0 . CUBIC SPLINES Let [a. one can start with x(0) = y (0) = z (0) = 0. the system of equation can be made to be rowdiagonally dominant. one can start with x(0) = y (0) = z (0) = 0.

the other equations in (7)–(9) correspond the equations (4) with k = 0. these are called correct boundary conditions. For example. The two remaining equations can be obtained by setting conditions on S0 at x0 and Sn−1 at xn . hi−1 (1 ≤ i ≤ n) . According to (7). The function S(x) defined as S(x) = Si (x) whenever x ∈ [xi . Equation (12) with i − 1 replacing with i says di − di−1 = ai−1 h2i−1 + bi−1 hi−1 + ci−1 . one may require that ′′ S0′′ (x0 ) = 0 and Sn−1 (xn ) = 0. i + 1] (1 ≤ i < n) is called a cubic spline.37. then one can require that S0′ (x0 ) = f ′ (x0 ) (5) ′ and Sn−1 (xn ) = f ′ (xn ). Cubic splines 167 These represent (n + 1) + 3(n − 1) = 4n − 2 equations. (6) these are called free or natural boundary conditions. we obtain (11) ci+1 = (bi+1 + bi )hi + ci (0 ≤ i ≤ n − 2). equations (2) say that yi = di for i with 0 ≤ i ≤ n − 1 Write dn = yn . In case i 6= 0. Alternatively. we will use di instead of yi . dn has not been defined. By (9) we have (10) ai = bi+1 − bi 3hi (0 ≤ i ≤ n − 2). and used to draw curves. (12) di+1 − di = ai h2i + bi hi + ci hi (0 ≤ i ≤ n − 1). and 2. The name spline is that of a flexible ruler forced to match a certain shape by clamping it at certain points. Write hi = xi+1 − xi (0 ≤ i ≤ n − 1). we can substitute (11) with i − 1 replacing i on the right-hand side to obtain di+1 − di = ai h2i + bi hi + (bi + bi−1 )hi−1 + ci−1 . (0 ≤ i ≤ n − 2). This can be used to determine the ai ’s once the bi ’s are known. From now on. It is not too hard to solve the above equations. Equations (3) and (4) can be rewritten as di+1 = ai h3i + bi h2i + ci hi + di (7) 3ai h2i (8) ci+1 = + 2bi hi + ci (9) bi+1 = 3ai hi + bi (0 ≤ i ≤ n − 1). The drawback of these conditions is that f ′ (x0 ) and f ′ (xn ) may not be known. The equation for i = n − 1 in (7) corresponds to equation (3). (0 ≤ i ≤ n − 2). In view of equation (1). Substituting this into (8). if one wants to approximate a function f with f (xi ) = yi . 1. hi (1 ≤ i ≤ n − 1). so far.

Once bi are known. struct rows *allocrows(int n). double d. but. }.0 ? (x) : (-(x))) struct rows{ double a. . The header file spline. This can also be written as (13) bi−1 hi−1 + 2bi (hi + hi−1 ) + bi+1 hi = 3 di − di−1 di+1 − di −3 hi hi−1 (1 ≤ i ≤ n − 2). as we will see soon. (10) will be extended to i = n − 1. This gives n − 2 equations for bi .h is given as 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include <stdio. 0 ≤ i ≤ n − 1. Two more equations can be obtained from the boundary conditions (5) or (6).h> #define absval(x) ((x) >= 0.h> #include <stdlib. Substituting into this the value for ai and ai−1 given by (10) and multiplying both sides by 3. the free boundary conditions given by equation (6) gives b0 = 0 and 3an−1 hn−1 + bn−1 = 0. Thus equation (13) for i with 1 ≤ i ≤ n − 1 plus the equations b0 = bn = 0 allow us to determine the values of bi .h> #include <math. we obtain di+1 − di di − di−1 − = ai h2i − ai−1 h2i−1 + bi (hi + hi−1 ) hi hi−1 (1 ≤ i ≤ n − 1). By extending equation (10) to i = n − 1. double c. Note that the matrix of the system (13) of equations is row-diagonally dominant. For example. equation (13) will also be extended to i = n − 1.168 Introduction to Numerical Analysis Subtracting the last two equations. this amounts to extending (9) to i = n − 1. Introducing the new variable bn and setting bn = 0. double b. and so Gauss-Seidel iteration can be used to solve these equations. we cannot allow i = n − 1 here since (10) is not valid for i = n − 1. A program implementing cubic spline interpolation (using free boundary conditions) is described next. we have 3 di+1 − di di − di−1 −3 = (bi+1 − bi )hi − (bi − bi−1 )hi−1 + 3bi (hi + hi−1 ) hi hi−1 (1 ≤ i ≤ n − 2). Hence we can also extend (10) to i = n − 1: an−1 = bn − bn−1 3hn and bn = 0. the other coefficients can be determined with the aid of equations (10) and (12). and so this equation will also be extended to i = n − 1.

double evalspline(double xbar.i++) x[i]=0.h" 2 3 int gauss_seidel_tridiag(struct rows *eqn. b=(double *) malloc((size_t)((n+1)*sizeof(double))). float tol.. a pointer to the first of these structures will be returned. if (!b) { printf("Allocation failure 1 in vector\n"). The memory allocation is handled by the functions in the file alloc. exit(1). } return b. int *converged). 7 double newx. int maxits. } return b. In lines 7–12. float tol. double *x. This function is identical to the one used in Gaussian elimination.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #include "spline. Lines 14–21 give the declarations of the functions occurring in the program. double *f.37. int n. In line 7 it is tested if the allocation was successful. double *x. Cubic splines 15 16 17 18 19 20 21 169 double *allocvector(int n).c: 1 #include "spline. struct rows *s. 4 int n. int maxits. int n). int *success). Line 5 defines the absolute value function |x|. In line 11. } The function allocrows in lines 3–12 allocates memory for a structure of type rows defined in the file spline. the function allocvector is defined. } double *allocvector(int n) { double *b. The main functions are defined in the file spline. that is. the structure rows is defined to hold the coefficients of the cubic polynomials constituting the splines on the one-hand. The parameter n indicates that a block of n + 1 structures (indexed from 0 to n) will be reserved. . b=(struct rows *) malloc((size_t)((n+1)*sizeof(struct rows))). struct rows *s. int *converged) 5 { 6 int itcount. This reserves memory for a block of n + 1 reals of the type double.h. int free_spline(double *x. int gauss_seidel_tridiag(struct rows *eqn. int maxits. it has nonzero elements only in the main diagonal and the two adjacent diagonals. if (!b) { printf("Allocation failure 1 in rows\n"). and it will hold the coefficients and the right-hand side of the coefficient matrix in equations (1)3 on the other hand. In lines 14–23.i<=n+1. i. exit(1).h" struct rows *allocrows(int n) { struct rows *b. double *x. The coefficient matrix is a tridiagonal matrix . float tol. 8 for (i=0. int n.

a=(bvec[i+1]-bvec[i])/(3. } } return itcount. prevdeltaf=deltaf. int *success) { int i. struct rows *eqn.i++) { newx = (eqn[i]. s[i]. } double evalspline(double xbar.d . eqn[i]. struct rows *s.i++) { hi=x[i+1]-x[i].c=(f[i+1]-f[i])/hi-(bvec[i+1]+2. eqn[i]. hiprev. int maxits. struct rows *s. float tol. } int free_spline(double *x. } noofits = gauss_seidel_tridiag(eqn.170 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 Introduction to Numerical Analysis *converged=0. i<n && x[i]<xbar. success).i<n.a=hiprev.c * x[i+1])/ eqn[i]. double *bvec.b=bvec[i]. eqn=allocrows(n-1). x[i]=newx. s[i]. if ( *success ) { for (i=0. eqn[i].i<=n. int n) { int i. . double dx. hi. itcount++) { *converged=1.*hi).d=f[i]. deltaf. itcount<maxits && !(*converged).d=3. for (i=1. double *x.*bvec[i])*hi/3. double *f.i<n. s[i]. bvec. tol.c=hi. for (i=1. for (itcount=0. n-1.b=2.a * x[i-1] . noofits.b. int n. s[i]. deltaf=f[i+1]-f[i]. } } free(bvec). prevdeltaf=f[1]-f[0]. free(eqn).eqn[i]. if ( absval(newx-x[i])>tol ) *converged=0.i++) { hi=x[i+1]-x[i]. bvec=allocvector(n). i++) .*(hiprev+hi). eqn[i]. prevdeltaf. maxits. hiprev=x[1]-x[0]. hiprev=hi.. dx=xbar-x[--i]. for (i=1.eqn[i].*(deltaf/hi-prevdeltaf/hiprev). return noofits.

eqn[i]. and in lines 40–45 it calculates the coefficients of the polynomials (1) in lines 39–46. the integer maxits giving the maximum permissible number of iterations. the value of i is searched for which the point x ¯ (or xbar is located in the interval [xi . 11 x=allocvector(n). eqn[i]. 0 if not). it is assumed that x0 = xn+1 = 0.h" 2 3 main() 4 { 5 const float tol=5e-12. the integer maxits to store the maximum permissible number of iterations. 6 const int maxits=100.d. and the integer *success indicating whether the determination of the polynomials (1) was successful. memory for the structure holding the coefficients of the equation is reserved. the number of iterations itcount is returned. 8 double *x.37. 7 int i. In line 11. the structure *s to store the coefficients of the polynomials (1). . and the integer *converged indicating whether the calculation was successful (1 is successful. the tolerance tol indicating the precision required for the solution. the coefficients bi ). The memory is freed in lines 48–49. success. 13 s=allocrows(n-1). . expxbar.d + dx * (s[i]. the vector *x to store the solution of the equations. This variable gets its values when it calls gauss_seidel_tridiag on line 37. the function values yi at these points stored in the vector *f.k. xi+1 ). The function returns the actual number of iterations performed. This corresponds to the situation that in equations (13) with bi the unknowns we have b0 = bn = 0 (the n in these equation will correspond to n − 1 in the program).c + dx * (s[i]. y. 9 struct rows *s. theerror. The main program calling these programs in contained in the file main. and indicates the number of iterations used in solving the system of equations.n. 61 } In linew 3–17.b + dx * s[i]. In the equation. On line 19. This function has the parameters xbar. but in line 15 it is set to be 0 unless the solution x[i] converges for every value of i (1 ≤ i ≤ n). Cubic splines 171 60 return s[i]. The function gauss_seidel_tridiag is called with parameters *eqs describing the coefficient matrix and the right-hand side of the equation as just described. On line 50 the number of iterations noofit is returned. the integer *converged is set to be 1. 14 printf("Approximating sin x by free cubic spline " 15 "interpolation using %d points\n". n+1). *f. and the integer n. xbar. the structure *s holding the coefficients of the polynomials (1). . the integer n (the number of interpolation points minus 1). xn are stored. the vector where the points x0 .a.a)). to indicate whether this function was successful in solving the system of equations. In lines 26. itcount. aii . and ai i+1 are represented by the structure elements eqn[i]. the double *x of interpolation points.c. it calls the function gauss_seidel_tridiag on line 37 to solve these equations. . The coefficients ai−1 i . In lines 53–61 the function evalspline is defined. The function free_spline in lines 22–51 calculates the coefficient matrix for the equations (1)3 in lines 30–36. 12 f=allocvector(n). Gauss-Seidel iteration for a tridiagonal matrix is implemented. so that the value of the spline can be returned on line 60. . The function has parameters *x. the value where the spline is to be evaluated. and the right-hand side is represented by the structure element eqn[i].b. In line 58. and in line 27 memory is reserved for the vector bvec to hold the solution of equations (1)3 (that is. 10 n=100.c: 1 #include "spline. This variable obtains its value from the function call of gauss_seidel_tridiag in line 37.

tol.70 2. The tolerance tol used is specified on line 5 to be 5·10−12 . and in lines 30–37 the spline is evaluated at 12 points.13f\n".s.55+(double) i.10 2.20 0.60 2.60 0. y. In lines 43–45 the reserved memory is freed.\n").30 2.40 0. s.30 1. y=evalspline(xbar. xbar.60 1.70 1.. and on line 13. } } else printf("The coefficients of the cubic spline " "count not be calculated\n" "Probably more iterations are needed.172 Introduction to Numerical Analysis 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 } for (i=0. &success).90 1. theerror=expxbar-y. f. 10]. itcount). xi ]. maxits.90 3.80 0. expxbar. The printout of this program is as follows: 1 Approximating sin x by free cubic spline interpolation using 101 points 2 0.10 1. for (i=-1.13f %18. printf("%7.i<=n. If the evaluation of the coefficients of the spline polynomial is unsuccessful. free(x). if ( success ) { printf("The number of Gauss-Seidel iterations " "is %u\n\n". In lines 11–13 memory is allocated for the vectors *x holding the interpolation points and *f holding the function values at the interpolation points. itcount=free_spline(x.40 2.00 5 1.00 3 0. f[i] = sin(x[i]).i++) { xbar=0. if ( i%5==0 ) printf("\n"). the rest of the evaluations take place inside. printf("%7.80 2.13f %18.80 1.4f %18. memory is allocated for the structure *s containing the coefficients of the spline polynomials.i++) { x[i] = (double) i/10. n.20 1.00 . expxbar=sin(xbar). In line 23.40 1. Two evaluations take place outside the interval of interpolation.i<=n/10.50 4 0.70 0. the function free_spline is called to find the coefficients of the spline polynomials.30 0. a message to this effect is printed in lines 40–41. } printf("\n\n"). theerror). at the middle of certain partition intervals [xi−1 .20 2.00 7 2.2f ".50 8 2. In lines 17 and 18 x[i] and f[i] are allocated. the maximum number of allowed iterations will be 100. free(f).x.n). printf(" x sin x " " spline(x) error\n\n"). The function sin x will be interpolated at 100 equidistant points in the interval [0.10 0.x[i]).50 6 1.90 2. free(s).

90 5.60 3.40 7.70 4.9541520171807 0.5500 -0.0000002003416 -0.3971481672860 34 4.5500 -0.30 10 3.80 19 8.70 7.60 4.50 10.00 8.80 23 24 25 The number of Gauss-Seidel 26 27 x sin x 28 29 -0.20 9.5500 0.40 9.60 9.30 20 8.10 3.00 4.9868438585032 35 5.5500 -0.50 6.20 3.10 7. however.4348249101747 0. where m > n.5500 0.4500 -0.1248950371168 40 10.2636601135395 0.70 9.7674011568675 39 9.6692398572763 36 6.10 9.60 8.5500 0.9868436008696 -0.0000002576336 -0.1815860484840 error -0.90 9.60 7.10 4. in view of measurement errors.0000001455935 -0. that is.80 11 4.90 8.10 6.90 6.5576835717979 -0.50 8.5500 -0.60 6.30 12 4.00 7.20 6.2636601823728 37 7.90 7. and for these reason not all equations can be satisfied.9023633099588 3.60 5.0000000688333 0.5576837173914 33 3. we have more equations than unknown.0000001747173 0.6692396825590 0.00 iterations is 22 spline(x) -0.5226870924736 0.0000002490989 0.5500 0.00 5.30 16 6.30 18 7.40 5.5500 -0.00 6.30 14 5. This situation can occur when we make more measurements to determine the quantities xi than absolutely necessary.20 4.30 22 9.5500 0. all these equations would be satisfied.50 9.0000001036828 -0. .9997837641894 32 2.2792227385251 38.9541522662795 38 8.40 6.0000013167352 0.9997835031776 0.70 5.38.40 8.50 4.50 5.7674009565259 -0.40 3.40 4.70 8. In such a situation one looks for the most reasonable solution of these equations.80 17 7.10 5.90 173 3.80 21 9. If the measurements were exact. One can write this system of equations in matrix form as Ax = b.00 9.70 3. Overdetermined systems of linear equations 9 3.70 6.5500 0.5226872289307 31 1.80 15 6.1248937203815 -1.0000002610118 0. the determination of the coefficients has errors.80 13 5.20 7.50 7.20 5.4349655341112 30 0.20 8.0001406239365 0.0000001364570 0.90 4.10 8. OVERDETERMINED SYSTEMS OF LINEAR EQUATIONS Assume we are given a system of linear equations n X j=1 aij xj = bi (1 ≤ i ≤ m).3971480636032 -0.

. QT is the inverse of the matrix Q. or else as AT Ax = AT b. and in principle they are solvable for x. If Q is an orthogonal n × n matrix. Furthermore.133 In this case. If Q1 and Q2 are orthogonal matrices then Q1 Q2 is also orthogonal.134 Given an n-dimensional column vector v such that vT v = 1. such vectors are perpendicular. While this equation is unsolvable. this equation can be written. as −2(AT b− AT Ax) = 0. . xn )T is an n-dimensional column vector. the word orthogonal literally means having a right angle. one may look for a vector x for which the norm of the vector b − Ax is the smallest possible. . Indeed. bm )T is an m-dimensional column vector. The minimum of this will occur when the partial derivatives of M with respect to each of the quantities xk (1 ≤ k ≤ n) are zero: n m   X X ∂M aij xj = 0 aik bi − = −2 ∂xk j=1 i=1 (1 ≤ k ≤ n). 134 An isometry is a transformation of a space that leaves the distance between two points unchanged. multiplication on the left by Q is an isometry. or orthogonal. H T = I T − 2(vvT )T = I − 2(vT )T vT = I − 2vvT . . since (Q1 Q2 )T = QT2 QT1 .. vn )T is the square root of n i=1 vi . and b = (b1 . then for any n-dimensional vector x we have kQxk2 = kxk. 132 The 133 The . we will seek another way to solve the problem of minimizing the L2 norm of the vector b − Ax. x = (x1 . kQxk22 = (Qx)T (Qx) = xT QT Qx = xT x = kxk22 . In other word. . Indeed. In matrix form. .174 Introduction to Numerical Analysis where A = (aij ) is an m × n matrix.. Hence. The problem is that the matrix AT A is usually ill-conditioned. and so (Q1 Q2 )T Q1 Q2 = QT2 QT1 Q1 Q2 = QT2 IQ2 = QT2 Q2 = I. In the present case. the matrix H = Hv = I − 2vvT is symmetric and orthogonal. An n × n matrix Q is called orthogonal if QT Q = I.e.132 The square of the L2 norm of the vector b − Ax can be written as M= m  X i=1 bi − n X aij xj j=1 2 . . i. vectors with initial points at the origin and endpoints at the given points) x and y is ky − xk2 . the distance between points represented by position vectors (i.e. we will discuss the case of L2 norm. . One can take various vector norms. . . where I is the identity matrix of the same size. and so we also have QQT = I. The word perpendicular is rarely used for vectors of dimension greater than three. reason for calling these matrices orthogonal is that for any two distinct column vectors x and y of such a matrix we have xT y = 0. These represent n equations for n unknowns. H T H = HH = (I − 2vvT )2 = I − 2 · 2vvT + 4vvT vvT = I − 4vvT + 4v(vT v)vT = I − 4vvT + 4vvT = I. with T in superscript indicating transpose. P 2 L2 norm kvk2 of the vector v = (v1 .

In finding the solution x of the equation Ax = b for which the square kb − Axk2 of the error is minimal. The matrix H is called a Householder transformation. we have to solve the equation Rx = c. we have xT y = (xT y)T = yT x. A2 = Hv2 A1 . the right-hand side equals x − (x − y) = y. in order to minimize the norm of r. For the vector r = b − Ax we then have Pr =       c − Rx R c . Hence. . It will be seen that the process will not break down provided the columns of the starting matrix A are linearly independent. and the fourth equation holds since vT v = 1. xT x = kxk2 = kyk2 = yT y in view of our assumption. Write P b = (cT . Hv1 . Since the product of orthogonal matrices is orthogonal.38. and. there is a Householder transformation that maps x to y and y to x. . . we will form the matrices A1 = Hv1 A. showing that Hv x = y. In the matrix Ak = (aij )1≤i≤m the elements below the main diagonal in the 1≤j≤n first k column will be 0. . noting that xT y is a scalar. and so it equals its own transpose. and we will write P = Hvn Hvn−1 . Overdetermined systems of linear equations 175 the parenthesis in the third term of the fourth member can be placed anywhere since multiplication of matrices (and vectors) is associative. x= − d 0 d We have krk2 = kP rk2 since P is an orthogonal matrix. The orthogonal matrix P will be found as the product of Householder matrices Hv of size m × m. as we wanted to show. Indeed. Starting with the matrix A. . krk22 = d 2 This will be minimal when c − Rx = 0. Thus. Indeed. kx − yk2 then Hv maps x to y. dT )T . T T T x x−x y−y x+y y x x − xT y − y T x + y T y Hv x =  In the denominator on the right-hand side. where c is an n-dimensional column vector and d is an m − n-dimensional def column vector. hence the denominator on the right equals 2xxT − 2yT x. Thus. (k) P will be orthogonal.    (x − y)(xT − yT ) (x − y)T x−y x x= I −2 · I −2 kx − yk2 kx − yk2 (x − y)T (x − y)   (x − y)(xT x − yT x) (x − y)(xT − yT ) x=x−2 T = I −2 T . This assumption will also guarantee that the 135 Such a solution is called the least square solution of the system of equations. for the square of krk2 we have   c − Rx 2 = kc − Rxk22 + kdk22 . 0T )T such that R is an n × n upper triangular matrix and 0 is an n × (m − n) matrix with all its entries 0. . An = Hvn An−1 .135 we will find an orthogonal matrix P of size m × m for which P A has form (RT . if one takes v= x−y . . Given two vectors x and y with kxk2 = kyk2 and x 6= y.

and that of C is (m − k + 1) × (n − k + 1) . can be written as « „ U B . since a zero linear combination of the columns of this matrix would be mapped by P −1 to a zero linear combination of the columns of the matrix A = P −1 (P A). I here is the (k − 1) × (k − 1) unit matrix. To see this. We will choose the vector vk in such a way that the first k − 1 components of vk will be zero. Thus the columns of R are linearly dependent. Hence its columns are also linearly dependent. 137 Thus. This is possible. and H is an (m − k + 1) × (m − k + 1) matrix. the Householder matrix Hvk = I − 2vk vkT will have zeros in the first k − 1 rows and columns except for ones in the main diagonal. that of B is (k − 1) × (n − k + 1). Here I denotes an identity matrix. we look for a Householder transformation that maps the first column x = a1 of the matrix A to the m-dimensional vector y = (±||a1 ||2 . i. that of 0 is (m − k + 1) × (k − 1). = = Ak = Hvk Ak−1 = 0 HC 0U + H0 0B + HC 0 C 0 H here the right-hand side is partitioned into k − 1 and m − k + 1 rows and k − 1 and m − k + 1 columns. so that the first components of x and y get added in absolute value when forming the vector v1 = x−y . then the rows of the block formed by the intersection of the first k rows and first k columns of R are linearly dependent. the first k − 1 columns of the matrix Ak−1 will not change. . partitioned into k − 1 and m − k + 1 rows and k − 1 and n − k + 1 columns.176 Introduction to Numerical Analysis diagonal elements of the triangular matrix R in the upper square of the matrix P A are nonzero. the size of U is (k − 1) × (k − 1). the columns of the matrix P A cannot be linearly dependent. P being an orthogonal matrix. This shows that the first k − 1 columns of Ak−1 and Ak are indeed identical. That is.137 The matrix Ak−1 . amk x = ak 136 This latter assertion is easy to see. . . kx − yk2 for the Householder transformation Hv1 . « « „ « „ «„ „ U B IU + 00 IB + 0C U B I 0 . H is actually a Householder matrix.138 This equation expresses the fact that the the elements in the first k − 1 columns of Ak−1 under the main diagonal are zero. Ak−1 = 0 C where U is an upper diagonal square matrix. . each of the appropriate size. To make sure that in Ak the kth column also has zeros below the main diagonal. On the other hand. Hvk = 0 H where the matrix is partitioned into k − 1 and m − k + 1 rows and k − 1 and m − k + 1 columns.136 Starting with the matrix A. as claimed. 0. since the entries in these columns outside the block just described are all zero. Thus. since the two vectors have the same length. . but this fact will not be used directly. since the kth row of this block consists entirely of zeros. and the two 0’s indicate zero matrices. . Now. the matrix Ak will have zeros in the first k − 1 columns below the main diagonal. and 0 is a zero matrix. 138 Thus. the 0 below I is the (m − k + 1) × (k − 1) zero matrix. 0)T . Assume that we have formed the matrix Ak−1 in such a way that all elements in the first k − 1 columns below the main diagonal of Ak−1 are zero. since the same was true for Ak−1 . choose a Householder transformation that maps the kth column T  (k−1) (k−1) (k−1) = a1k . in the product Ak = Hvk Ak−1 . Hence. The sign ± is chosen to be the opposite of that of a11 . since these columns outside R contain only zero entries. the 0 to the right is the (k − 1) × (m − k + 1) zero matrix.. each of the appropriate size. Hence the first k columns of the whole matrix R are linearly dependent. If the kth diagonal element of the triangular matrix R is zero. with I denoting the m × m identity matrix. note that the verbal description just given for Hvk means that this matrix can be written in block matrix form as « „ I 0 . it is invertible.e. . and so the equation Rx = c is solvable. In this case. . assuming that the columns of A are linearly independent. therefore the columns of P A are also linearly dependent. the same way as Ak−1 was partitioned.

so as to maximize the norm of the vector v.c: 1 #include "household. this can only happen if the columns of the starting matrix A are linearly dependent. since Ak−1 = P ′ A for some orthogonal matrix (see the footnote above). when forming the difference v T um X u (k−1) (k−1) (k−1) (k−1) v ¯ = x − y = 0.int m.  i=k where the zeros occupy the last m − k positions (i. .double **p.h> #include <stdlib. int n) 4 /* Allocate matrix. . . ±t (aik )2 .h> #define absval(x) ((x) >= 0. The memory allocation functions are described in the file alloc. amk  . double *b. giving a new way of triangularizing a matrix without row interchanges. The following code segment allocates 5 a contiguous block of pointers for the whole matrix 6 of size m times n. if we (k−1) have aik = 0 for i ≥ k. ak+1 k . The ± sign is chosen to be the opposite of the sign of (k−1) akk . The function declarations in lines 8–12 will be discussed below. k¯ vk2 This process will break down only if the norm in the denominator here is zero. . Then we form the Householder transformation Hvk with the vector vk = 1 v ¯. and k vectors of k − 1 dimension must be linearly dependent.h" 2 3 double **allocmatrix(int m. . Overdetermined systems of linear equations to (k) y = ak 177 v T um X u (k−1) (k−1) (k−1) = a1k .h is used: 1 2 3 4 5 6 7 8 9 10 11 12 #include <stdio. Thus this algorithm of triangularizing the top square of A will not fail if the columns of A are linearly independent. 0. These two vectors having the same L2 norms. all the positions corresponding to places below the main diagonal of the kth column. .int n). .38. .  i=k quantities of the same sign are added in the kth component. and on line 6 the square is defined. . double *allocvector(int n). double *x. . int n). .. double **allocmatrix(int m. int n). int n). This means that the first k columns of the matrix Ak−1 are linearly dependent. In programming the method. . 0 . ak−1 k . 0 . . */ . void triangsolve(double **a. On line 5 the absolute value. However.h> #include <math. so that. akk ∓ t (aik )2 .0 ? (x) : (-(x))) #define square(x) ((x)*(x)) int household(double **a. double *b. Thus the columns of the matrix Ak−1 are linearly dependent. The method will work even if A is a square matrix. int m. void pb(double **p. that is. the method is much more expensive in calculation than Gaussian elimination. since these columns have nonzero elements only in the first k − 1 places. the header file household. . This gives a new way of solving the equation Ax = b even for square matrices.e. there is a Householder transformation mapping x to y. .

if (!a[1]) { printf("Allocation failure 2 in matrix\n"). each row pointer is incremented by n + 1. exit(1). norm.k. if (!b) { printf("Allocation failure 1 in vector\n"). /* Save the beginning of the matrix in a[0]. if (!a) { printf("Allocation failure 1 in matrix\n").l. return a. } double *allocvector(int n) { double *b. double **a. m + 1 pointers are allocated. the values of a[i] for 1<=i<=n may change: */ a[0]=a[1].178 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 Introduction to Numerical Analysis { int i. and in line 16. . and on line 22. int i. b=(double *) malloc((size_t)((n+1)*sizeof(double))).success. c=allocmatrix(m. } /* Allocate the whole matrix: */ a[1]=(double *) malloc((size_t)(m*(n+1)*sizeof(double))). } The function allocmatrix in lines 3–38 differs from the earlier function allocmatrix discussed on account of Gaussian elimination in that the matrix allocated here has size m × n. the size of the allocated block for the matrix is m(n + 1) (no memory needs to be allocated to the pointed a[0]. i++) a[i]=a[i-1]+n+1. i<=m. on line 10.n). double **c. } return b. success=1. } /* Initialize the pointers pointing to each row: */ for (i=2. This changes very little.c contains the main functions implementing the method: 1 2 3 4 5 6 7 8 9 10 #include "household. this pointer only needed for freeing the memory.h" int household(double **a. The function allocvector in lines 30–39 is identical to the function by the same name discussed on account of Gaussian elimination. The file household. the rows will be interchanged. a=(double **) malloc((size_t)((m+1)*sizeof(double*))).j. exit(1). exit(1).double **p.int n) /* The matrices are size m times n */ { const double near_zero=1e-20.int m.

j<=n. 48 } 49 50 void pb(double **p.i<=m. 30 } 31 for (i=k+1.at the price of some loss of 15 transparency in the method.i<=m. 32 norm=0.i<=m. 39 for (l=k.i++) b[i]=c[i].j.i<=m. 26 p[k][k] += a[k][k]. It 17 stores 0’s in a part of the matrix p that is 18 supposed to be zero.i<=m. int m.l<=m..i<k. 25 if ( a[k][k]<0. but since these entries 19 of p are never used.l<=m.i++) 37 for (j=k.l++) c[i] -= 2. 33 for (i=k..i<=m. ) p[k][k]=-p[k][k]. 53 double *c.k<=n. 55 for(k=1. 58 for (l=k. 59 } 60 for (i=k.l. 46 free(c).j++) { 38 c[i][j]=a[i][j].j<=n.i++) 43 for (j=k.i<=m. int n) 51 { 52 int i.i++) { 57 c[i]=b[i].j++) a[i][j]=c[i][j]. 27 if (absval(p[k][k])<=near_zero) { 28 success=0.38. it is not necessary to 20 do this. 24 p[k][k]=sqrt(p[k][k]). 47 return success..k<=n. 44 } 45 free(c[0]).k. double *b. 61 } 179 . 35 for (i=k.i++) p[k][k] += square(a[i][k]).i++) p[i][k]=0.i++) p[i][k] /= norm. 54 c=allocvector(m). */ 22 p[k][k]=0. 34 norm=sqrt(norm). 23 for (i=k.k++) { 56 for (i=k. since it is superfluous.*p[i][k]*p[l][k]*b[l]. 41 } 42 for (i=k.*p[i][k]*p[l][k]*a[l][j]. 36 for (i=k. Overdetermined systems of linear equations 11 for(k=1.i++) p[i][k]=a[i][k].l++) 40 c[i][j] -= 2.i<=m. 21 for (i=1. The next line is 16 commented out.i++) norm += square(p[i][k]). 29 break.k++) { 12 /* The entries of the matrix p could be stored in 13 the entries of the matrix a except for its main 14 diagonal -.

backward substitution. double *x. The function returns 1 is the calculation is successful. but arranging this is considerably more complicated than in the case of Gaussian elimination. the columns of the matrix **p contain one more nonzero element that there are available spaces. 1 #include "household. the calculation is then abandoned on line 29. besides. In line 44. and in lines 42–43. The loop in lines 11–44 deals with the kth column of the matrix **a. Finally the vector is divided by its norm in line 35 to obtain the vector vk . The function triangsolve in lines 65–74 solves a system of linear equation whose left-hand side is represented by a triangular matrix. *b contains the right-hand side of the equation Ax = b when the function is called. The next parameter **p contains the column vectors vk 1 ≤ k ≤ n for the Householder transformations Hvk . and this memory freed in line 62. so it has no effect. the result is placed in the matrix **c. x[i] /= a[i][i].i--) { x[i]=b[i]. zeros are placed in the first k places of the column vector p[. m and n. the first parameter of this function. the last two parameters m and n indicate the size of the matrix. The calculation in lines 55–61 parallels those in lines 35–44. contain the size of the matrix. the second parameter is the column vector *b. the column vector v¯ is calculated in the kth column of the matrix **p. Before calculating the norm. double *b. the norm of this vector will also be too close to 0. .][k]. was described on account of Gaussian elimination.180 Introduction to Numerical Analysis 62 63 64 65 66 67 68 69 70 71 72 73 74 free(c). a matrix **c is allocated. When calling this function. x[n]= b[n]/a[n][n]. and in lines 31–34 the norm of this vector is calculated.k. In this case.j<=n. but note that this line is commented out.l.j. or perhaps it could be useful in a modification of the method. } } The function household in lines 3–63 implements the Householder transformation of the matrix **a.j++) x[i] -= a[i][j]*x[j]. it might be possible to store the matrix **p in the part of the matrix **a that will become zero.139 The last two parameters. } void triangsolve(double **a. in line 27 it is tested whether p[k][k] is too close to zero (the constant near_zero is set to 10−20 . The matrix **c is freed in lines 45–46. In lines 22–31. It perhaps makes the program easier to read by a human reader. In line 21. the variable success will be set to 0 (false) (it was set to 1 on line 10). for (j=i+1. The first 139 At some loss of transparency. and the integer success is returned on line 47. The function pb in lines 50–63 has the matrix **p as its first parameter. to hold a temporary copy of the matrix **a while doing the matrix calculations. for (i=n-1. memory for a vector *c is allocated for temporary storage. otherwise it returns 0. In line 9. so the matrix **a would need to be enlarged by one row to make this possible. The calculation of the product Hvk Ak−1 is performed in lines 36–41.h" 2 3 main() 4 { 5 /* This program reads in the coefficient matrix 6 of a system of linear equations. if it is. the matrix **p is expected to contain the Householder transformations in its columns representing the matrix P (P is the product of these Householder transformations). the method. int n) { int i.i>=1. and it will contain the vector P b when the function returns. this result is copied into the matrix **a.

} if ( j<=n ) { a[i][j]=strtod(s. and second entry is n.NULL). fscanf(coefffile. The second entry must be an unsigned integer.i. printf("i=%u j=%u\n". the other entries can be integers or reals.b. there must be m(n+1) of these. } else printf("The solution could not be determined\n"). else { p=allocmatrix(m.n).x.i<=m. j).n). &n). m. i. The rest of the entries are the coefficients and the right-hand sides.j<=n+1. readerror=0. for (i=1.b. the number of equations. s)==EOF) { readerror=1. printf("Not enough coefficients\n"). for (i=1. **p. for (j=1.n).x[i]). } } fclose(coefffile).NULL).m. "%u". success. success=household(a. j. if ( success ) { x=allocvector(n). triangsolve(a.n). /* Allocate right-hand side */ b=allocvector(m). "%s". FILE *coefffile. free(p[0]). 181 . } else b[i]=strtod(s. free(x). */ double **a. pb(p. "%u". *b. &m). the number of unknowns. "r"). Overdetermined systems of linear equations 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 entry is m.m. int i. free(p). fscanf(coefffile. if ( readerror ) printf("Not enough data\n"). n.n). /* Create pointers to point to the rows of the matrix: */ a=allocmatrix(m. coefffile=fopen("coeffs".38.p.i<=n. char s[25]. printf("The solution is:\n"). *x.i++) { if ( readerror ) break. break.12f\n".j++) { if (fscanf(coefffile.i++) printf("x[%3i]=%16.

2f\n".j++) printf("%5. the vector *x to contain the solution of the equations is allocated memory space (this is freed on line 52). the coefficients are read. In lines 60–65 the entries of the transformed matrix **a are printed out. these lines are commented out. the rest of the entries contain the coefficients.677022884362 3]= -0. The matrix **p is freed in lines 56–57. The program was run with the following input file coeffs: 1 2 3 4 5 6 7 8 9 10 11 12 13 12 2 1 3 9 -1 -4 -1 -9 -1 8 9 -6 10 3 4 -1 -2 2 -5 -4 -4 -2 -7 -6 -4 5 -2 2 -1 -1 2 -2 -3 9 7 8 -2 -2 5 -1 3 1 3 -1 4 2 -1 3 1 3 4 -1 9 8 -7 0 3 1 2 2 8 3 -2 -1 2 3 1 1 -2 -8 -5 -4 2 -4 1 2 3 -1 2 2 -2 2 3 6 -3 -2 1 2 3 4 1 3 2 -1 -1 1 4 -3 -4 2 3 -1 -1 -4 -3 -8 -3 4 8 -4 3 1 -2 4 9 -3 5 8 -1 3 6 -3 2 -1 8 -3 -4 3 2 5 7 9 5 It is important to keep in mind that the first column represent line numbers and is not part of the file. they can again be useful for debugging. If there was a reading error (because the end of file was reached before the expected number of data were read. In line 43. and the equations are solved on line 48. however.708141957416 2]= -0. In line 20 the file coeff is opened. then the function household is invoked to calculate the entries of the matrix **p and triangularize the top square of the matrix **a. no calculation is done. free(a). for (i=1.i++) { for (j=1. The matrix **a and the vector *b are freed in lines 67–69. the right-hand side of the equation is calculated on line 47. lines 42–58 are executed. so nothing is actually done here. The first two entries of the file contain the size of the coefficient matrix. the matrix **p is allocated to contain the vectors describing the Householder transformations. free(b). } */ free(a[0]). The program produced the following printout: 1 2 3 4 5 6 The x[ x[ x[ x[ x[ solution is: 1]= 0.611332408213 5]= -0. In lines 25–39. Otherwise. If this function was successful. and rather than deleting them. In line 46.508849337986 . and then the solutions are printed out.i<=m. they are only commented out. These lines were useful for debugging. then the coefficients are read in.182 Introduction to Numerical Analysis 58 59 60 61 62 63 64 65 66 67 68 69 70 } } /* printf("\n").j<=n. and the file is closed on line 40.948640836111 4]= 0.b[i]). printf(" %5.a[i][j]). then lines 45–43 are executed. In a later modification of the program.2f ".

and the equation p(λ) = 0 is called its characteristic equation. an n-dimensional nonzero column vector v is called an eigenvector of A if there is a real number λ such that Av = λv. Thus an n × n matrix has n eigenvalues when these eigenvalues are counted with their multiplicities. . The characteristic equation has n solutions (which are usually complex numbers) when these solutions are counted with their multiplicities. the above equation can be written as (A − λI)v = 0.e. i=1 140 The German word eigen means something like owned by self. . the determinant formed by these column vectors p(λ) = det(A − λI) is zero. If the eigenvalues λ1 . Writing I for the n × n identity matrix. . Thus. i. in any case. Given an n × n matrix A. There have been attempts to translate the term eigenvalue into English as characteristic value or proper value. The multiplicity of a solution of the characteristic equation is also called the multiplicity of the same eigenvalue. The power method to find eigenvalues 7 8 9 10 11 183 x[ 6]= 0.870101220300 x[ 10]= -0. and vi denotes the ith component of v.471529531514 x[ 7]= -0. then this equation can also be written as n X i=1 vi (ai − λei ) = 0. Indeed. λ2 . This polynomial is called the characteristic polynomial of the matrix.173441867530 Problem 1.e. The number λ is called an eigenvalue 140 of A. If ai denotes the ith column vector of A. 141 In case λ is a multiple eigenvalue (i.325609870356 x[ 8]= -0. eigenvalue of a matrix would mean something like a special value owned by the matrix.39. The coefficients of this polynomial of course depend on the entries of the matrix A. . p(λ) is a polynomial of λ of degree n. the the mixed German-English term eigenvalue predominates in English (the German term is Eigenwert).863288481097 x[ 9]= 1. Describe what an orthogonal matrix is. λm (there may also be other eigenvalues) of an n × n matrix are distinct.. In other words. writing vi (1 ≤ i ≤ m) for the corresponding eigenvectors. the latter is almost never. ei denotes the ith column of the identity matrix I. 39.141 assume that we have m X αi vi = 0. the vectors ai − λei (1 ≤ i ≤ n) are linearly dependent. then the corresponding eigenvectors are linearly independent. .. THE POWER METHOD TO FIND EIGENVALUES Let n > 1 be an integer. The former expression is sometimes used. if it is a multiple root of the characteristic equation). then the eigenvector i vi may not be unique.

0 1 The characteristic polynomial of this matrix is . we may assume that αi 6= 0 for i with 1 ≤ i ≤ k and αi = 0 for k < i ≤ m.184 Introduction to Numerical Analysis where not all the numbers αi are zero. we obtain k X αi λi vi = 0. k X αi vi = 0. since λi 6= λk by our assumption. where k ≥ 1. This relation contradicts the minimality of k. Multiplying this equation by the matrix A on the left. That an n × n matrix does not need to have n linearly independent eigenvectors is shown by the matrix   1 1 . we obtain k−1 X j=1 (αi λk − αi λi )vi = 0. where 1 ≤ k ≤ n. since it would mean that v1 = 0. i=1 Multiplying the former linear relation by λk and subtracting from it the latter. whereas we assumed that an eigenvector cannot be the zero vector. k = 1 is not possible here. Assume that the αi ’s here are so chosen that the smallest possible number among them is not zero. i=1 Of course. None of the coefficients αi λk − αi λi (1 ≤ i ≤ k − 1) is zero. Let the number of nonzero coefficients among the αi be k. By rearranging the vectors and the corresponding eigenvalues. That is. and noting that Avi = λi vi .

.

.

1 − λ 1 .

.

.

. = (λ − 1)2 .

0 1 − λ.

and so its only eigenvalue. . When describing a column vector. one can save space by writing it as the transpose of a row vector.143 The companion matrix of a polynomial. 0)T is the only eigenvector of the above matrix. Let p(λ) = λn + n−1 X ak λk = λn + an−1 λn−1 + . + a1 λ + a0 k=0 142 We must have λ = 1 in this equation since 1 is the only eigenvalue of the above matrix. aside from its scalar multiples. (1. 143 vT . v2 = v2 . . The equation      v1 v1 1 1 =λ v2 0 1 v2 with λ = 1 can be written as142 v1 +v2 = v1 . That is. The only solution of these equations is v2 = 0 and v1 is arbitrary. so 1 is a double eigenvalue of this matrix. is the transpose of the vector v.

. The power method to find eigenvalues 185 be a polynomial..39. We are going to prove the above statement about the characteristic polynomial of the matrix A.. we have to show that the determinant .. 1 −an−1       is called the companion matrix of the polynomial p... . −a0 −a1 −a2 . .. . 0 . The n × n matrix 0 0 1 0  0 1 A= . . .. 0 .  . ... 0 0 0 .. 0 .. . The interesting point is that the characteristic polynomial of the matrix A is (−1)n p(λ). .. To do this. 0 0  0 .. The importance of the companion matrix is that any method to find eigenvalues of a matrix can be used as a method to find zeros of polynomials.

.

−λ 0 .

.

1 −λ .

1 det(A − λI) = .

.

0 . ..

. . . .

. .

0 0 0 0 −λ .. . ...... 0 . . ... . 1 −λ − an−1 . . . 0 0 0 . .. −a0 −a1 −a2 ....

.

.

.

.

.

.

.

.

.

156 in Section 34 on determinants). We obtain that this determinant equals . equals (−1)n p(λ). To this end we will expand this determinant by its first row (see the Theorem on p.

.

−λ .

.

1 .

.

0 −λ .

.

.. .

. .

.

0 .

.... .. . . 0 0 −λ 1 . 0 . .. .. −a1 −a2 −a3 ... . 0 . .. 0 0 0 . 0 0 0 . . 0 .... .. −λ .

.

.

1 −λ .

.

.

.

0 1 .

.

.

.

0 0 .

.

+ (−1)n+1 (−a0 ) .

. .. .

. ..

. .

.

.

0 0 .

−λ −an−2 .

.

.

0 0 .

1 . . .. 0 . 1 −λ − an−1 0 . ....... 0 0 0 .. 0 0 0 . .. 0 . .. . −λ .. .

.

.

.

.

.

.

. .

.

1 −λ .

.

0 1 .

Hence it follows that indeed144 det(A − λI) = −λ · (−1)n−1 p1 (λ) + (−1)n+1 (−a0 ) = (−1)n p(λ). + a2 λ + a1 . The power method. and all its eigenvalues are real. . where p1 (λ) = λn−1 + n−2 X ak+1 λk = λn−1 + an−1 λn−2 + . in 144 To do. so the value of this determinant is just the product of this diagonal elements. So we can make the induction hypothesis that this determinant equals (−1)n−1 p1 (λ). Another proof can be given by expanding the determinant of the matrix A − λI by its last column. but this is easy to . The first determinant is the of form det(A1 − λI) where the (n − 1) × (n − 1) matrix A1 is obtained from the matrix A by deleting the first row and the first column of A. We will assume that the eigenvectors of the matrix A span the vector space Rn . since it is the determinant of a triangular matrix. see Problem 8 below. . k=0 The second determinant is 1. complete the proof by induction. one needs to check that the statement is true for n = 1. Let A be an n × n matrix with real numbers as entries.

. our choice for x(0) is unlucky. we have x (0) = n X (0) αi vi i=1 (0) (0) for some numbers αi . 1. As the vectors vi were assumed to be linearly independent. Let x(0) be an arbitrary n-dimensional column vector. .186 Introduction to Numerical Analysis this case. 146 The method converges even if λ has multiplicity greater than 1. we assume that A has a real eigenvalue λ of multiplicity 1 such that for all other eigenvalues ρ we have |ρ| < |λ|. . As we assumed that the eigenvectors are linearly independent. µk+1 Thus (k+1) αi (k+1) α1 (k) = λi αi . the last displayed equation means that (k) λi αi (k+1) αi = for each i with 1 ≤ i ≤ n. vn . . and thus we will not be able to determine the eigenvalue λ1 .145 In an important case. Let vi . i. This step normalizes the vector x(k+1) . and assume that |λ1 | > |λi | for all i with 2 ≤ i ≤ n. That is. . λ1 α(k) 1 145 This is easy to see. We do not assume that the λi ’s for i with 2 ≤ i ≤ n are distinct. for lack of a better choice. when aij = aji . denoting by µk+1 the component of largest absolute value of the vector u(k+1) . we put def u(k+1) = Ax(k) = n X (k) λi αi vi i=1 and. that is. 147 If we perform the procedure described below in exact arithmetic. we may choose x(0) = (1. . that λ is real. . . one can determine the eigenvalue of the matrix by what is called the power method. . 1 ≤ i ≤ n be the corresponding eigenvectors. then the vectors obtained during the procedure will all be inside the subspace spanned by v2 . If x(k) has already been determined. already discussed above. 1)T . 1 ≤ i ≤ n. albeit slowly. the components of the eigenvectors are also real. We will also assume that the matrix A has a single eigenvalue of largest absolute value.e. In practice. . these coefficients must be real. roundoff errors will lead us out of the subspace spanned by v2 . We start with the arbitrarily chosen vector x(0) . Given that these column vectors are in a vector space over the real numbers. assume that the eigenvalues of A are λi . . . this is valid. vn . . and we must choose another x(k) = (1) n X (k) αi vi i=1 and u(k) as follows. however.. we put x(k+1) = n X (k+1) αi vi = i=1 1 µk+1 u(k+1) = n (k) X λi α i i=1 µk+1 vi . and so the procedure may still converge to the largest eigenvalue. it ensures that the component of largest absolute value of x(k+1) is 1. The components of the eigenvector v corresponding to the eigenvalue λ are the coefficients showing that the column vectors of the matrix A − λI are linearly dependent.146 If this is the case. namely when A = (aij ) is a symmetric matrix. If α1 starting value. To describe the method.147 We form the vectors = 0. We need to assume.

(k) (0) λ 1 α1 α1 We assumed that λ1 has absolute value larger than any other eigenvalue Thus. The file alloc. int *success). we obtain  k (0) (k) αi αi λi = . because with 7 pivoting.h> #include <stdlib. This allows us to (approximately) determine λ1 and a corresponding eigenvector. As the component of the largest absolute value of x(k) is 1. Hence. even 148 The limit z = lim (k) of a sequence of vectors can be defined componentwise. The power method to find eigenvalues 187 for each i with 1 ≤ i ≤ n. we use the header file power. double *x.0 ? (x) : (-(x))) double **allocmatrix(int n). So memory for rows 8 could be allowcated separately.h" 2 3 double **allocmatrix(int n) 4 /* Allocate matrix. In writing a program to calculate the eigenvalue of an n × n matrix using the power method. it follows that the limit def (k) α1 = lim α1 k→∞ exists. so. double *allocvector(int n). rows will be interchanged. This file defines the absolute value function on line 5. by saying that the k→∞ z limit of each component of the vector z(k) is the corresponding component of the z. float tol. . according to (1).39. and contains the declarations of the functions used in the program in lines 7–10. (0) (k) lim k→∞ αi (k) α1 = αi (0) α1 · lim k→∞  λi λ1 k =0 for each i with 2 ≤ i ≤ n. Thus. we have148 lim k→∞ 1 x(k) = (k) α1 n (k) X α i v (k) i i=1 α1 = v1 . The following code segment allocates 5 a contiguous block of pointers for the whole matrix. int n. that is. 6 There is no real advantage for this here.c is almost the same as the one that was used for Gaussian elimination: 1 #include "power. we are also saying that it is finite.149 and α1 6= 0. int maxits. for large n. by saying that the limit exists. we also say that the limit does not exist.h: 1 2 3 4 5 6 7 8 9 10 #include <stdio. int power(double **a. Using this equation repeatedly. 149 When a limit is infinite. Since any nonzero scalar multiple of the eigenvector v1 is an eigenvector. The case i = 1 is of course of no interest. this means that x(k) is nearly an eigenvector of A for the eigenvalue λ1 . double *lambda. On the other hand. x(k) ≈ α1 v1 .h> #include <math.h> #define absval(x) ((x) >= 0.

9 for (i=1. exit(1).c contains the function power performing the power method. j. newx.0. there does not seem to be any disadvantage in asking for a contiguous block. the only change is line one. */ { int i. } /* Allocate the whole matrix: */ a[1]=(double *) malloc((size_t)(n*(n+1)*sizeof(double))). } return b.i<=n.h.h" 2 3 int power(double **a. if (!b) { printf("Allocation failure 1 in vector\n"). exit(1). itcount. *u.188 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Introduction to Numerical Analysis the largest matrix for which Gaussian elimination is feasible occupies a relatively modest amount of memory.itcount++) { 11 /* Form y=Ax abd record maximum element max */ . i<=n. /* Save the beginning of the matrix in a[0].i++) x[i]=1. if (!a) { printf("Allocation failure 1 in matrix\n").h rather than gauss. } In fact. if (!a[1]) { printf("Allocation failure 2 in matrix\n"). 7 double m. double *lambda. return a. which now includes the file power. int n. exit(1). the rows will be interchanged. int maxits. double **a. a=(double **) malloc((size_t)((n+1)*sizeof(double*))). max. int *success) 5 { 6 int i. 4 float tol. } /* Initialize the pointers pointing to each row: */ for (i=2.itcount<=maxits. 1 #include "power. } double *allocvector(int n) { double *b. 8 u=allocvector(n). i++) a[i]=a[i-1]+n+1. the values of a[i] for 1<=i<=n may change: */ a[0]=a[1]. 10 for (itcount=1. *y. The file power. double *x. b=(double *) malloc((size_t)((n+1)*sizeof(double))).

the integer n specifying the size of the matrix. *lambda is given its value as the largest component of u (it is easy to see that this is close to the eigenvalue. but on line 23 it will be made 0 (false) unless the ith component of the iteration converges. 18 } 19 /* Normalize u and test for convergence */ 20 *success=1. The power method to find eigenvalues 189 12 max=0. 33 } The function power introduced in lines 3–4 returns the number of iterations performed.. and the pointer to the integer success indicating whether the procedure converged.j<=n. the maximum number of iterations maxits allowed. 16 if ( absval(m)>absval(max) ) max=m. it will stay one on line 26 if the iteration converged for every value of i. 28 break. the vector u is calculated. the pointer *lambda to the location where the eigenvalue will be calculated. The vector x at this point will contain an approximation of the eigenvector. space is allocated for the vector u. The 9 rest of the entries are the elements of the 10 matrix. On line 31.c: 1 #include "power. and on line 32. If on line 26 it is found that the iteration converged. the tolerance tol. The second entry must be an unsigned 11 integer. 23 if ( absval(newx-x[i])>tol ) *success=0. The 7 first entry is the required tolerance. On line 8. the size of the matrix. .i<=n. 13 for (i=1. 29 } 30 } 31 free(u). Its parameters of include the pointer **a to the coefficients of the matrix A. 1. . 25 } 26 if ( *success ) { 27 *lambda=max. thus. the number of iterations is returned. . the space allocated for u is freed. *success is set to be 1 (true). On line 20.i++) { 22 newx=u[i]/max. 32 return itcount. 17 u[i]=m. and the loop is exited on line 28. .39.c is called in the file main. the other entries can be integers or . 1)T .j++) m += a[i][j]*x[j]. In line 9 the vector x is given the initial value (1.h" 2 3 main() 4 { 5 /* This program reads in the elements of a square 6 matrix the eigenvalue of which is sought. 21 for (i=1.. and in lines 21–25. In lines 13–18. the vector *x that will store the components of the eigenvector. 15 for (j=1.i<=n. The function power. since the largest component of the previous value of x was 1). the next value of the vector x will be calculated.i++) { 14 m=0. 24 x[i]=newx. and the 8 second entry is n. The iterations to find the new values of u and x are carried out in the loop in lines 10–30.

j<=n.n. and. 30 break. i. s)==EOF) { 27 readerror=1.190 Introduction to Numerical Analysis 12 reals.i. 37 else { 38 x=allocvector(n). 28 printf("Not enough coefficients\n"). &n). 45 for (i=1. the results of the calculation are printed out in line 40–47.i++) { 24 if ( readerror ) break. 22 a=allocmatrix(n). 31 } 32 a[i][j]=strtod(s. tol). On line 39. space is allocated for the vector *x. otherwise an error message is . The first entry of this file is the tolerance tol.x[i]). itcount. 19 fscanf(coefffile.&success). "%u". the power method is called. an error message is printed out on line 36. 16 char s[25]. After tol and n are read in and printed out in lines 19–21. 47 } 48 else 49 printf("The eigenvalue could not be determined\n").\n".12f\n". otherwise the calculation is continued in line 37. success. *x.100. and the rest of the entries can be real numbers such as float or double or integers).i<=n. 25 for (j=1. the first entry must be of type float. "r"). 21 fscanf(coefffile. 44 printf("The eigenvector is:\n"). 51 } 52 free(a[0]).NULL). space for the matrix A is allocated on line 22. &tol). 50 free(x). readerror=0. j. 15 int n. 14 float tol.i++) 46 printf("x[%3i]=%16. subs. */ 13 double **a. 17 FILE *coefffile. lambda). 33 } 34 } 35 fclose(coefffile).12f\n". 43 printf("The eigenvalue is %16. the second entry is the size n of the matrix (this must be an integer. 29 printf("i=%u j=%u\n".j++) { 26 if (fscanf(coefffile. "%f". The file coeffs is opened on line 18. this will contain the eigenvector of the matrix in the end. If there were not enough coefficients.&lambda. In lines 23–34. "%s". and on line 35 the file coeffs is closed. lambda.i<=n. On line 38.x. 36 if ( readerror ) printf("Not enough data\n"). i. j). 23 for (i=1. itcount).tol. 20 printf("Tolerance used: %g\n". free(a). 18 coefffile=fopen("coeffs". if successful. 53 } The matrix A is contained in the file coeffs. 39 itcount=power(a. 40 if ( success ) { 41 printf("%u iterations were used to find" 42 " the largest eigenvalue. the matrix is read in.

Solution. Find the eigenvalues of the matrix A.829693363275 x[ 2]= 1. The characteristic polynomial is . The following input file coeffs was used to run the program: 1 2 3 4 5 6 7 5e-14 5 1 2 3 5 2 3 4 -2 2 3 1 1 1 1 -1 1 2 1 -1 3 2 1 -1 1 2 The output obtained was as follows: 1 2 3 4 5 6 7 8 9 Tolerance used: 5e-14 28 iterations were used to find the largest eigenvalue.478388026222 x[ 5]= 0. The space reserved for the vector x is freed on line 50 (this must be freed inside the else statement of lines 37–51.253187383172 x[ 4]= 0.39. Question 2.310753728109 The eigenvector is: x[ 1]= 0. The power method to find eigenvalues 191 printed on line 49. The eigenvalue is 8. Question 1. where it was allocated). The space allocated for the matrix is freed on line 52.457090784061 Problems 1. Question 1. Find the characteristic polynomial of the matrix A=  3 1 4 3  .000000000000 x[ 3]= 0.

.

.

3 − λ 1 .

.

.

p(λ) = .

4 3 − λ. = (3 − λ)(3 − λ) − 1 · 4 = 9 − 6λ + λ2 − 4 = λ2 − 6λ + 5.

The eigenvalues are the solutions of the equation p(λ) = 0. Question 1. Find the characteristic polynomial of the matrix A=  2 2 6 3  . the characteristic polynomial is easy to factor: p(λ) = λ2 − 6λ + 5 = (λ − 1)(λ − 5). Question 2.e. Find the eigenvalues of the above matrix. i. .. 2. the solutions of the equation p(λ) = 0 are λ = 1 and λ = 5. the eigenvalues of A are 1 and 5. however. Thus. One can use the quadratic formula to solve this equation. Question 2.

The characteristic polynomial is . Question 1.192 Introduction to Numerical Analysis Solution.

.

.

2 − λ 2 .

.

.

p(λ) = .

6 3 − λ. = (2 − λ)(3 − λ) − 2 · 6 = 6 − 5λ + λ2 − 12 = λ2 − 5λ − 6.

however. 3. the characteristic polynomial is easy to factor: p(λ) = λ2 − 5λ − 6 = (λ + 1)(λ − 6). The eigenvalues are the solutions of the equation p(λ) = 0. The eigenvalues of a triangular matrix are the elements in the main diagonal.e. Thus. One can use the quadratic formula to solve this equation. Find the eigenvectors of the matrix  2 3 1 A = 0 1 4. 0 0 3  Solution. the solutions of the equation p(λ) = 0 are λ = −1 and λ = 6. the eigenvalues of A are −1 and 6. Thus. Question 2.. The easiest way to see this is to note that the determinant of a triangular matrix is the product of the elements in the main diagonal. the characteristic polynomial of the above matrix is . i.

.

.

2 − λ 3 1 .

.

.

p(λ) = .

.

0 1−λ 4 .

.

. = (2 − λ)(1 − λ)(3 − λ).

0 0 3 − λ.

. I. this equation becomes x2 = 2x2 . Substituting x2 = x3 = 0 into the first equation. i. 0)T ..e. the solutions of the equation p(λ) = 0. This is satisfied for any value of x1 . Substituting this into the second equation. 0. Ax = x. i. are 2. the eigenvalues. 0. .. That is. and Ax = 3x. 0)T = x1 (1.e. x3 x3 0 0 3 This equation can be written as 2x1 + 3x2 + x3 = 2x1 x2 + 4x3 = 2x2 3x3 = 2x3 The third equation here can be satisfied only with x3 = 0. The equation Ax = 2x can be written as      x1 x1 2 3 1  0 1 4   x2  = 2  x2  . The eigenvectors are the solutions of the equations Ax = λx with these values of λ. an eigenvector corresponding to the eigenvalue 2 is x = (1. 0)T . 1. the eigenvector x corresponding to the eigenvalue 2 is x = (x1 .e. 3. this equation becomes 2x1 = 2x1 . 0. Thus. the solutions of the equations Ax = 2x. This can also be satisfied only with x2 = 0.

0. 0)T . corresponding to the eigenvalues 2. As for the eigenvector corresponding to the eigenvalue 1. (−3. 1)T . i. 0)T .. the equation Ax = x can be written as 2x1 + 3x2 + x3 = x1 x2 + 4x3 = x2 3x3 = x3 The third equation here can only be satisfied with x3 = 0. i. 3. however. any scalar multiple of this is also an eigenvector.. since A is a triangular matrix. That is. The power method to find eigenvalues 193 As we just saw. Substituting x3 = 0 and x2 = 1 into the first equation. Substituting this into the first equation. x2 . Find all eigenvectors of the matrix 1 0 A= 0 0  1 1 0 0 0 1 1 0  0 0 . (7. x3 . and the only element in the main diagonal is 1.e. i. We might as well choose x2 = 1. we obtain 2x1 + 3 · 2 + 1 = 3x1 . The second equation then becomes x2 + 4 = 3x2 . 0)T . this equation becomes x2 = x2 .. in turn. 1)T . the equation Ax = 3x can be written as 2x1 + 3x2 + x3 = 3x1 x2 + 4x3 = 3x2 3x3 = 3x3 The third equation here is satisfied for any value of x3 . 1.39. The equation Ax = x with x = (x1 . x4 )T can be written as      x1 1 1 0 0 x1  0 1 1 0   x2   x2  A=   =  . We might as well choose x3 = 1. this. 4. 2. since a scalar multiple of an eigenvector is always an eigenvector. x1 = 7. x1 = −3. is hardly worth pointing out. Substituting this into the second equation. this equation becomes 2x1 + 3 = x1 . As for the eigenvector corresponding to the eigenvalue 3.e. 1. The only eigenvalue of A is 1. 2. x2 = 2. since any other choice of x3 would only give a scalar multiple of the eigenvector we are going to obtain. the eigenvector corresponding to the eigenvalue 1 is x = (−3. 1 1 Solution. any other choice would only give a scalar multiple of the eigenvector we are going to obtain. This gives the eigenvector x = (7. the eigenvectors are (1. x3 x3 0 0 1 1 x4 x4 0 0 0 1 This equation can be written as the following system of equations: x1 +x2 = x1 x2 +x3 = x2 x3 +x4 = x3 x4 = x4 . Thus.e. 1.

The result is true even if A is not invertible. 6. In the first equation we can take any value of x1 . We have u(1) = Ax(0) = (3. That is. Taking the limit when η → 0. 7. we can see that det(AB − λI) = det(BA − λI). the matrix A − ηI will be invertible if η 6= 0 is close enough to 0. we have det(AB − λI) = det(A(B − λA−1 )) = det(A) det(B − λA−1 ) = det(B − λA−1 ) det(A) = det((B − λA−1 )A) = det(BA − λI). the second equation implies x3 = 0. aside from its scalar multiples. the matrix A − ηI is invertible except for finitely many values of η (namely. do one step of the power method to determine its eigenvalues. 17)T . The fourth equation is always satisfied. 0. and det(C) denotes the determinant of the matrix C. and the third equation implies x4 = 0. using the starting vector x(0) = (1. Given the matrix A=  2 1 8 9  .1 . Solution. We must have x1 6= 0. since an eigenvector cannot be the zero vector. So. showing that the characteristic polynomials of AB and BA are the same. and x(1) = 1 (1) u = 17   3 . 0)T is the only eigenvector. In fact. For such an η. 0. Note. since an eigenvector is only determined up to a scalar multiple. Explain what it means for a column vector v to be an eigenvector of a square matrix (of the same size). 5. The actual values of the eigenvalues are 10 and 1. we have det((A − ηI)B − λI) = det(B(A − ηI) − λI). 17 This gives the first approximation 17 (the component of u(1) with the largest absolute value) to the largest eigenvalue. 0. and B is an arbitrary n × n matrix. where λ is the unknown.194 Introduction to Numerical Analysis The first equation here implies x2 = 0. Noting that we have det(CD) = det(C) · det(D) for any two n × n matrices C and D. 0)T . and I is the n × n identity matrix. Thus we have x = x1 (1. We might as well take x1 = 1 here. 0. even if A is not invertible. (1. Show that the matrices AB and BA have the same eigenvalues. The eigenvalues of AB are the zeroes of its characteristic polynomial det(AB − λI). . 1)T . Assume A is an invertible n × n matrix. Solution. it is invertible exactly when η is not an eigenvalue of A).

we have µ = (λ − s)−1 . + a1 λ + a0 k=0 is (−1)n p(λ) by expanding the determinant A − λI by its last column. The inverse power method 195 8. For this discussion to make sense in case k = 0 (when P is an empty matrix. 0 is k ×(n−k −1) matrix with all its elements 0. λn . . (λn − s)−1 . then this vector is also an eigenvector of (A − sI)−1 with eigenvalue (λ − s)−1 . then µ−1 + s is an eigenvalue of A. To see this. Show that the characteristic polynomial of the companion matrix A of the polynomial p(λ) = λn + n−1 X ak λk = λn + an−1 λn−1 + . 152 One can reverse the above argument to show that if µ is an eigenvalue of the matrix (A − sI)−1 .150 If λ is an eigenvalue of A corresponding to the eigenvector v. . Indeed. . then the eigenvalues of (A − sI)−1 will be (λ1 − s)−1 . we obtain the matrix   P 0 C= . we write u(k+1) = (A − sI)−1 x(k) . That is. Q is an (n − k − 1) × (n − k − 1) upper triangular matrix with all its diagonal elements equal to 1. we have (A − sI)−1 v = (λ − s)−1 v. we need to assume that s itself is not an eigenvalue of A. as in the power method. we obtain the equation above (after reversing the sides). this argument.151 If the eigenvalues of A are λ1 . .152 Thus. and det C = det P · det Q. (λ − s)−1 will be the eigenvalue of (A − sI)−1 with the largest absolute value. THE INVERSE POWER METHOD The inverse power method is the power method applied with the matrix (A − sI)−1 for some number s. and 0T is its transpose. the power method will be well-suited to determine this eigenvalue of (A − sI)−1 . 1)T . . 150 There are variants of the inverse power method for which this statement is not quite accurate. 1. if λ1 is the closest eigenvalue of A to s. . this will guarantee that the matrix A − sI is invertible. Writing λ = µ−1 + s. By crossing out the row and the column of the element −ak (or −λ − an−1 in case k = n − 1) in the last column of the matrix A − λI. .40. unless a better choice is available. note that (A − sI)v = Av − sIv = λv − sv = (λ − s)v. . . we may put x(0) = (1. 151 In . one starts with an arbitrary vector x(0) . every eigenvalue of (A − sI) is of form (λ − s)−1 for some eigenvalue λ of A. As before. Therefore. assuming that λ1 is real. . . . It is easy to show by the Leibniz formula we used to define determinants (see p. Having determined x(k) . . Hint. That is. det Q = 1. Multiplying this equation by (λ − s)−1 (A − sI)−1 on the left. or a 0 × 0 matrix) or k = n − 1 (when Q is an empty matrix). take the determinant of an empty matrix to be 1. 153 in Section 34) that det P = (−λ)k . 0T Q where P is a k × k lower triangular matrix with all its diagonal elements equal to −λ. . 40.

one does not need to determine the inverse of the matrix A − sI. 6 There is no real advantage for this here. When using the inverse power method. µk+1 where µk+1 is the component of the largest absolute value of the vector u(k+1) . int *success). int n. int *success). 1 #include "inv_power. 16 a=(double **) malloc((size_t)((n+1)*sizeof(double*))). one finds the LU-factorization of the matrix A − sI. because with 7 pivoting. 20 } .h> #include <stdlib. int maxits. double *allocvector(int n). float tol. int n. and then finds u(k+1) = (A − sI)x(k) by solving the equation (A − sI)u(k+1) = x(k) . 19 exit(1). 15 double **a. On the other hand. So memory for rows 8 could be allowcated separately.h 1 2 3 4 5 6 7 8 9 10 11 12 13 #include <stdio. 17 if (!a) { 18 printf("Allocation failure 1 in matrix\n"). Instead. float s. void LUfactor(double **a. double *b). rows will be interchanged. int col.h> #define absval(x) ((x) >= 0. double *lambda. double *x. The LU-factorization need to be calculated only once.h> #include <math. The following code segment allocates 5 a contiguous block of pointers for the whole matrix. int n).h" 2 3 double **allocmatrix(int n) 4 /* Allocate matrix. int inv_power(double **a. even 9 the largest matrix for which Gaussian elimination 10 is feasible occupies a relatively modest amount of 11 memory. and then can be used repeatedly for solving this equation with different k. The file alloc. In a program implementing the inverse power method. we use the header file inv_power.196 Introduction to Numerical Analysis and x(k+1) = 1 u(k+1) . void LUsolve(double **a. there does not seem to be any disadvantage 12 in asking for a contiguous block. Line 5 defines the function absval to take the absolute value. */ 13 { 14 int i. int n. void swap_pivot_row(double **a.0 ? (x) : (-(x))) double **allocmatrix(int n).c handling memory allocation is virtually identical to the one used for Gaussian elimination.

if (!b) { printf("Allocation failure 1 in vector\n"). exit(1).h are needed in the file alloc. } } void LUfactor(double **a. a[col]=a[pivi]. *rowptr. and then the rows a[col] and a[pivi] are interchanged.i<=n.c The file lufactor. } } if ( pivi !=col ) { rowptr=a[col]. int i. return a. } return b.pivi. the rows will be interchanged. i<=n. The inverse power method 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 197 /* Allocate the whole matrix: */ a[1]=(double *) malloc((size_t)(n*(n+1)*sizeof(double))). /* Save the beginning of the matrix in a[0]. int n) /* Scans the elements a[col][i] for col<=i<=n to find the element c[col][pivi] with the largest absolute value. maxel=absval(a[col][col]). a[pivi]=rowptr.40.h" void swap_pivot_row(double **a. int col. int n. for (i=col+1. */ { double maxel. exit(1).h is included on line 1 instead of gauss. i++) a[i]=a[i-1]+n+1. since none of the functions declared in the file inv_power.i++) { if ( absval(a[i][col])>maxel ) { maxel=absval(a[i][col]). } /* Initialize the pointers pointing to each row: */ for (i=2. int *success) /* Uses partial pivoting */ . b=(double *) malloc((size_t)((n+1)*sizeof(double))). } The only difference is that the file inv_power. if (!a[1]) { printf("Allocation failure 2 in matrix\n"). an inessential change (since the latter would also work.h. the values of a[i] for 1<=i<=n may change: */ a[0]=a[1]. } double *allocvector(int n) { double *b.c implements LU-factorization: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 #include "inv_power. pivi=col. pivi=i.

mult.j<i. } } This file is again similar to the one used for Gaussian elimination. 33 for (i=j+1. i++) { 34 mult = a[i][j] /= pivot. j. The quantity 1+(a[i]-a[1])/(n+1) gives the original row index of what ended up as row i after the interchanges. double *b) { int i. /* We want to solve P^-1 LUx=b.i<=n.k++) a[i][k]-=mult*a[j][k]. We will put x in the same place as y: x[i]=a[i][0]. Next we calculate y=Ux=L^-1 c. not used in representing the matrix. j.h.i<=n. n). */ for (i=2. 26 int i.j++) a[i][0] -= a[i][j]*a[j][0]. 29 if ( absval(pivot)<=assumedzero ) { 30 *success=0.h However. and the program would work even if the file gauss.153 The file lusolve. */ a[n][0] /= a[n][n]. that is.j++) a[i][0] -= a[i][j]*a[j][0]. a[i][0] /= a[i][i]. First we calculate c=Pb.h" void LUsolve(double **a.h instead of gauss. In fact.k<=n.c.i++) for (j=1. since iterative im153 In fact. i<=n. y will be stored the same place c was stored: y[i] will be a[i][0]. we retained only the function LUsolve.i++) a[i][0]=b[1+(a[i]-a[0])/(n+1)]. the file lusolve uses no function defined outside the file. */ for (i=1. 25 double pivot. /* Finally. 27 for (j=1. The elements of c[i] will be stored in a[i][0]. for (i=n-1. .j++) { 28 swap_pivot_row(a. j. the file inv_power. return. 35 for (k=j+1. 31 } 32 else *success=1. /* Here is why we preserved the beginning of the matrix in a[0].j<n.h were included instead of inv_power. we deleted the functions backsubst and improve used for iterative improvement. int n.i--) { for (j=i+1.c implements the solution of a linear system of equation given the LU-factorization of the matrix of the equation: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 #include "inv_power.j<=n.i>=1. there is no need to change the file lusolve. 36 } 37 } 38 } This is identical to the file with the same name given on account of Gaussian elimination. pivot=a[j][j]. k. except that the first line now includes the file inv_power. we calculate x=U^-1 y.h is included on line 1 instead of gauss.198 Introduction to Numerical Analysis 23 { 24 const double assumedzero=1e-20. Again.h.

/* I am here */ u=allocvector(n).. The inverse power method 199 provement is not appropriate in the present case.i++) { u[i] = a[i][0]. itcount. LUfactor(a. for (i=1. newx.i++) a[i][i] -= s.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 #include "inv_power. or permissible error. . if ( absval(u[i])>absval(max) ) max=u[i]. } /* Normalize u and test for convergence */ *success=1. max. maxits is the maximum allowed number of iterations.i<=n. j./max. for (itcount=1. in the calculation. tol is the tolerance. int *success) { int i. } if ( *success ) { *lambda=s+1. } The function inv_power defined in lines 3–37 is very similar to the power function used in the power method.x). The parameter list is the same as the one used for the power method. the first parameter **a represents the entries of the matrix. if ( !*success ) return 0. That is.i++) x[i]=1. n is the size of the matrix.c discussed on account of the power method. *x will contain the eigenvector at the end of the calculation.n. It is useful to make a line-by-line comparison between this file and the file power. and *success indicates whether the method was successful. float s. it is much more efficient to improve the approximation to the eigenvalue and the eigenvector is to perform the inverse power method with a closed approximation s to the eigenvalue λ being calculated. if ( absval(newx-x[i])>tol ) *success=0. float tol.itcount<=maxits.i<=n. /* Form A-sI and place the result in A */ for (i=1. x[i]=newx.0.40.i<=n. In line 9. for (i=1. *y.i++) { newx=u[i]/max.n.i<=n.h" int inv_power(double **a. double *x. return itcount. *u.itcount++) { /* Solve (A-sI)y=x and place the result in u */ LUsolve(a. and in line 10 its LU-factorization 154 Whatever one may get out of iterative improvement. and stored in the same locations where A itself was stored. the matrix A − sI is calculated. double m. int n. double *lambda. break. max=0. int maxits.154 The inverse power method itself is implemented in the file inv_power. } } free(u). for (i=1.success).

successive values of the vectors u and x are calculated.c is as follows: 1 #include "inv_power. the eigenvalue of A is calculated. 21 FILE *coefffile. to try the inverse power method with a different value of s). On lines 17–22. and the rest of the entries are the entries of the matrix A. If the process is successful. success. on line 13. j. The first entry is an 8 approximate eigenvalue. again. note that LUsolve stores the solution of this equation in column zero of the array **a. and convergence is tested.) The first entry in this file is the value of s. µ (i. to be discussed soon.2 5e-14 5 1 2 3 5 2 3 4 -2 2 3 1 1 1 1 -1 1 2 1 -1 3 2 1 -1 1 2 (Recall that the numbers in the first column are line numbers. and the number of iterations is returned on line 35. For the convergence test. the memory allocated for the vector u is freed. using the equation λ=s+ 1 . and in lines 15–34. On line 35. *x. lambda. tol. 11 The rest of the entries are the elements of the 12 matrix. The calling program is contained in the file main. the other 14 entries can be integers or reals. In line 11. and it is changed to false on line 27 if the test on this line fails for any value of i. . it is stored in the same locations). In lines 25–29 the normalized vector u is stored in x. This program read the values of the entries of the matrix from the file coeffs 1 2 3 4 5 6 7 8 9. The inverse 15 power method is used to determine the 16 eigenvalue of the matrix */ 17 double **a. the second entry 9 entry is the required tolerance. 20 char ss[25]. and the 10 third entry is n. memory for the vector u is allocated.h" 2 3 main() 4 { 5 /* This program reads in the elements of a square 6 matrix from the file coeffs the eigenvalue of 7 which is sought. µ = (λ − s)−1 )m where µ is the eigenvalue calculated for the matrix (A − sI)−1 . the second one is the tolerance tol. perhaps. 18 float s. subs. the third one is the value of n..c. if LU-factorization was unsuccessful. The element of maximum absolute value is also calculated in these lines.e. i. there is nothing more to be done (except. on line 31. readerror=0. itcount. the vector u = (A − sI)−1 x is calculated. The first entry must be of type float. 13 the third entry an unsigned integer. and stored as the variable max. 19 int n. The vector x is initialized on line 14.200 Introduction to Numerical Analysis is calculated (and. and this solution is copied into the vector *x in lines 19–21. The file main. the integer *success is set to be 1 (true) on line 24. Then. the size of the matrix.

&lambda. &tol). 23 fscanf(coefffile.310753728109 The eigenvector is: x[ 1]= 0.tol. in the present file. 37 } 38 a[i][j]=strtod(ss. 56 free(x). 45 itcount=inv_power(a. "%s". &n). 26 printf("Tolerance used: %g\n".12f\n".i. 46 if ( success ) { 47 printf("%u iterations were used to find" 48 " the largest eigenvalue. ss)==EOF) { 33 readerror=1.c is nearly identical to the one used on account of the power method.&success).j<=n. The eigenvalue is 8.i<=n.j++) { 32 if (fscanf(coefffile. 28 a=allocmatrix(n). 31 for (j=1. 34 printf("Not enough coefficients\n").\n". 42 if ( readerror ) printf("Not enough data\n").12f\n". The inverse power method 201 22 coefffile=fopen("coeffs".829693363275 .c associated with the power method. the name of the character string declared on line 30 was changes to ss to avoid conflict (in the file main. 59 } The program main.i++) 52 printf("x[%3i]=%16. 29 for (i=1. 57 } 58 free(a[0]). 35 printf("i=%u j=%u\n".n. lambda). &s). this character string was called s). 25 fscanf(coefffile. tol). 53 } 54 else 55 printf("The eigenvalue could not be determined\n"). 39 } 40 } 41 fclose(coefffile). 50 printf("The eigenvector is:\n"). 43 else { 44 x=allocvector(n). 24 printf("Approximate eigenvalue used: %8. "%f".100. itcount). 27 fscanf(coefffile.x[i]). "%u".40. 51 for (i=1. On line 45 the function inv_power is called (rather than power power method earlier).5f\n". free(a). j). "r"). 36 break. the output of the program is the following: 1 2 3 4 5 6 Approximate eigenvalue used: 9.s. One difference is that in lines 23–24 the parameter s is read from the input file coeffs and then it is printed out.x.i++) { 30 if ( readerror ) break. 49 printf("The eigenvalue is %16. s). i.20000 Tolerance used: 5e-14 15 iterations were used to find the largest eigenvalue.NULL). With the above input file.i<=n. "%f".

0 is an eigenvalue of B. In fact. with eigenvector x. and then forms the matrix B = A − λxzT .457090784061 41. Indeed. and xzT is an n × n matrix.478388026222 0. Note that w 6= 0. we must have w 6= 0. One then looks for a column vector z such that zT x = 1. WIELANDT’S DEFLATION In Wielandt’s deflation. Bx = Ax − λ(xzT )x = λx − λx(zT x) = λx − λx = 0. let A be a square matrix. In the last term of the second member of these equations. as the ρ 6= λ. The associative rule for matrix products was used to obtain the last term of the third . x are n × 1 column vectors. Indeed. then zT x is a number. after determining the eigenvalue and the corresponding eigenvector of a square matrix. the matrix xzT and the scalar zT y were interchanged. and then one can look for the remaining eigenvalues. To explain how this work. the eigenvectors y and x of A corresponding to the eigenvalues ρ and λ must be linearly independent. if ρ 6= 0 is an eigenvalue of A different from λ then ρ is also an eigenvalue of B. then we will show that Bw = ρw with w = ρy − λ(zT y)x. the order of the matrix is reduced. if Ay = ρy for some nonzero vector y. that is. One can make the following observations.000000000000 0. One may want to keep in mind that if A is an n × n matrix and z.202 7 8 9 10 Introduction to Numerical Analysis x[ x[ x[ x[ 2]= 3]= 4]= 5]= 1. We have Bw = A(ρy − λ(zT y)x) − λ(xzT )(ρy − λ(zT y)x) = ρ2 y − λ2 (zT y)x − λρ(xzT )y + λ2 (zT y)(xzT )x = ρ2 y − λ2 (zT y)x − λρ(xzT )y + λ2 (zT y)x(zT x) = ρ2 y − λ2 (zT y)x − λρ(xzT )y + λ2 (zT y)x = ρ2 y − λρ(xzT )y = ρ2 y − λρx(zT y) = ρw. Second. w being one of their linear combinations. Ax = λx. First. and assume that the column vector x is an eigenvector of A with eigenvalue λ.253187383172 0.

Hence. being an eigenvector. the order of an n × n matrix A = (aij ) is reduced by one. so we cannot allow x and y to be linearly dependent. . To obtain the last equation. if λ is a multiple eigenvalue with at least two linearly independent eigenvectors. one finds a nonzero eigenvalue λ and the corresponding eigenvector x = (x1 . . if we also have ρ 6= 0.155 Finally. cannot be a zero vector. . then the vector w = ρy − λ(zT y)x is an eigenvector of B with eigenvalue ρ. Thus. As mentioned above. one needs to note that the matrix x and scalar xT y commute. . then λ will still be an eigenvalue of B. . as was required above. ρ−λ Thus. The entry bij of the matrix B = A − λxzT can be expressed as (2) bij = aij − λxi · 1 1 arj = aij − xi arj . and this was needed to ensure that w 6= 0. First. . λxr λxr xr The second equation holds because x is an eigenvector of A with eigenvalue λ. wn )T . and then the equality zT x = 1 was used to obtain the fourth member of these equations. assuming that ρ 6= λ. the associative rule was used. we have zT y = 1 (zT w). arn )T λxr Then zT x is the rth component of the vector 1 1 1 Ax = λx = x. . Then one finds the component xr of x of maximum absolute value. we are looking for another eigenvalue ρ 6= 0 of the matrix A eigenvector y. To obtain the sixth member. . ρ−λ This equation expresses the eigenvector y associated with the eigenvalue ρ of A in terms of the eigenvector w of B belonging to the same eigenvalue. . ar2 . xn )T . Write w = (w1 . The rth component of the vector ρw = Bw is ρwr = n X brj wj = 0. . since x. That is. . . . In one step of Wielandt deflation. the argument works even if ρ = λ 6= 0 and x and y are linearly independent. Note: If ρ = λ and y = cx then w = ρy − λ(zT y)x = ρy − λ(zT x)y = 0. then (1) y= 1 1 (w + λ(zT y)x) = ρ ρ  w+  λ (zT w)x . j=1 155 The assumption ρ 6= λ was only used to show that x and y were linearly independent. this will guarantee that xr 6= 0.41. Then one picks z= 1 (ar1 . That is. λxr xr It can be seen that all elements in the rth row are zero: brj = arj − 1 xr arj = 0 xr Now. the equation zT x = 1 is satisfied. The rth component of this vector is x1r xr = 1. Wielandt’s deflation 203 member. we have zT w = ρ(zT y) − λ(zT y)(zT x) = ρ(zT y) − λ(zT y) = (ρ − λ)(zT y).

after calculating the eigenvalues by deflation.h> #include <math. having found one eigenvalue λ and a corresponding eigenvector x of the matrix A. int n. we will have ρ¯ 6= ρ. double *lambda. float tol. float s. In the computer implementation of Wielandt’s deflation. int *r. int *success). Since the inverse power method also calculates the eigenvector of the original matrix. int n. Having found w′ . float tol. one should use the inverse power method with the original matrix to improve the accuracy of the solution. int k). it follows that wr = 0. int n. and this matrix will not be singular (so the inverse can be calculated). double *x. Finally. one can deflate the matrix. int maxits). bij wj = ρwi = j=1 j=1 j6=r where the second equality holds because wr = 0. one can calculate the corresponding eigenvector of y of A by using equation (1).204 Introduction to Numerical Analysis where the last equation holds because brj = 0 for each j. double *allocvector(int n). double *x. for the sake of simplifying the discussion. and can find w by restoring the deleted zero component. one will have B ′ w′ = ρw′ . there is little reason to keep track of eigenvectors during deflation. /* int inv_power(double **a. */ int power_maxel(double **a. Thus. int maxits. The fact that the matrix A − ρ¯I is nearly singular (and so its inverse cannot be calculated very accurately) will not adversely affect the calculation of the eigenvalue. As ρ 6= 0. This is because the eigenvalues of the deflated matrix should only be considered approximations of those of the original matrix A. if ρ¯ is the calculated approximation of the eigenvalue ρ of A. then one would use the inverse power method to calculate the largest eigenvalue of the matrix (A − ρ¯I)−1 . double *x. Using equation (1) to find the corresponding eigenvector is. One can use the inverse power method with the original matrix to obtain better approximations of the eigenvalues. In this way. int wielandt_defl(double **a. double *x. one can calculate the eigenvalues and eigenvectors of B ′ to find ρ and w′ . Because of roundoff errors. and look for the remaining eigenvalues of a lower order matrix. that is. we omitted this step.h> #define absval(x) ((x) >= 0. void wielandt(double **a.h> #include <stdlib. Hence. the header file wielandt. Therefore. since successive deflations introduce roundoff errors.h" 2 3 int power_maxel(double **a. float tol. int n. int maxits. if one deletes the rth row and rth column of the matrix B to obtain B ′ .0 ? (x) : (-(x))) double **allocmatrix(int n). one can reduce the order of the matrix A by one. since it is not used in the present implementation.h is used: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 #include <stdio. however. double *x. double *lambda. the equation ρw = Bw can be written componentwise as n n X X bij wj . Indeed. int *success). int n. . probably not worth doing. Here the function inv_power is commented out on lines 9–10. The file power_maxel. double *lambda. As mentioned above. double *lambda. one can find all the eigenvalues one by one. and if one deletes the rth component of the vector w to obtain w′ . It follows that the rth column of the matrix B plays does not enter into the calculation of the eigenvalue ρ and the corresponding eigenvector.c contains a slightly modified version of the power method: 1 #include "wielandt.

max. 19 if ( absval(m)>absval(max) ) max=m. 33 we know that x[r] for this r has value 1. 31 /* In principle. 46 } 47 break. 45 max=x[i]. so it 34 is easier to look for x[r] with the largest value.j++) m += a[i][j]*x[j].0.i++) { 17 m=0.. 26 if ( absval(newx-x[i])>tol ) *success=0. 39 but in practice it might happen because of rounding 40 errors. 13 for (itcount=1. 11 u=allocvector(n).. */ 41 *r=1. 20 u[i]=m. *u. 48 } 49 } 50 free(u). newx. *y. 37 it might happen that we find another component 38 with x[r]==-1. 16 for (i=1.i++) x[i]=1. 42 for (i=1. In lines 41–46. float tol. 51 return itcount. Wielandt’s deflation 4 int *r. 18 for (j=1. int *success) 5 /* In this variation of the power method.itcount<=maxits. the subscript r of the component xr of the largest absolute value of the . If we were 36 looking for x[r] with the largest absolute value. 21 } 22 /* Normalize u and test for convergence */ 23 *success=1. the subscript 6 of the maximal element of the eigenvector is returned 7 as r */ 8 { 9 int i. j.i<=n. int maxits.itcount++) { 14 /* Form y=Ax and record maximum element max */ 15 max=0. 35 The calling program will assume x[r]==1. max=x[1].j<=n.i++) 43 if ( x[i]>max) { 44 *r=i.41. However. 24 for (i=1. 27 x[i]=newx. we would want to find the value of r 32 for which x[r] has the largest absolute value.i<=n. 12 for (i=1. 52 } 205 The function power_maxel has an additional parameter *r compared to the function power discussed earlier. 28 } 29 if ( *success ) { 30 *lambda=max.i<=n. 10 double m.i++) { 25 newx=u[i]/max. this should not happen. theoretically. itcount.i<=n.

float tol. Note that the eigenvector *x was obtained by the power method (through the function call on line 24 below). which allows us make this assumption in the calling program. The deflation itself is performed by the functions in the file wielandt. Aside from this.m--) { eigencount=n-m. formula (2) is used to calculate the components of the matrix B and stored at the same location where the matrix A was stored. success. As explained in the comment in lines 31–40. lambda[n]=a[1][1]. This allows us to be certain that x[x]==1.j++) a[i][j] -= x[i]*a[r][j]. On line 11. itcount.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 #include "wielandt.m.m>1.m. in actual fact. } return eigencount.h" /* It is important to preserve the original row pointers for **a so the matrix could be freed */ void wielandt(double **a.j<n.i<n.r).j++) a[i][j]=a[i][j+1]. } int wielandt_defl(double **a. the component of the largest value is calculated instead. for (i=1. if ( !success ) break.i<=n. int n. return.x. int r) { int i. double *x. the function power_maxel is identical to the function power discussed earlier. the vector *x containing the eigenvector found by the calling program.x. this is why there is no need . tol. double *lambda. eigencount+1). j. The value *r of this subscript is returned to the calling program.i++) for (j=r. for (i=1. } The function wielandt in lines 6–16 has as parameters the matrix **a. wielandt(a.&lambda[eigencount+1]. the size n of the matrix. int maxits) { int m. r. } if ( success && m==1 ) { eigencount=n.j<=n. printf("%3u iterations used for eigenvalue number %u\n". as was pointed out in the theoretical discussion above. in effect.i++) if ( i!=r ) for (j=1. for (m=n. for (i=r.i<n. and the subscript r of the row and column to be deleted.i++) a[i]=a[i+1]. this value of r is used in the deflation. itcount. the power method ensures that the component x[r] of the largest absolute value of x is 1. eigencount. on account of the power method.206 Introduction to Numerical Analysis eigenvector x is calculated. double *x.maxits.&success).&r. itcount=power_maxel(a. int n.

the rth row and rth column of the matrix B is deleted. Wielandt’s deflation 207 to divide by xr on line 11. the 10 second entry is n. the vector *lambda that will contain the eigenvalues being calculated. Its parameters are the matrix **a.h" 2 3 main() 4 { 5 /* This program reads in the elements of in the 6 upper triangle of a square matrix int the file 7 named symm_coeffs. The remaining parameters are the size n of the matrix A. alternatively. since it is guaranteed that all eigenvalues of a symmetric matrix is real. and the value of m will be decreased by 1 with each deflation step. . the entries in the upper triangle of the matrix are given. as mentioned in the description of this function. this value could be returned as the component of a vector to the calling program. that on line 2 is the size of the matrix. With xr = −1. and fills in the lower triangle 8 to make the matrix symmetric. but no data will be returned in *x to the calling program. there is no need to store the other entries. the function power_maxel is called to calculate the largest eigenvalue of the current matrix. and the eigenvalue of this matrix is the only matrix element. The integer m will contain the current size of the deflated matrix. the size of the matrix. On line 35. to be used to store various vectors in the course of the calculation. the function wielandt is called to carry out one step of deflation. In lines 24–25. the calculation is abandoned on line 28 by breaking out of the loop. the calculation on line 11 would not work. The loop in lines 22–30 starts with m = n. This program was used with a symmetric matrix. The file symm_coeffs with the following content was used: 1 5e-14 2 5 3 1 2 3 5 8 4 3 -2 2 3 5 1 1 -1 6 -1 3 7 2 The first column of numbers are line numbers. the parameter r will contain the subscript of the largest component of the eigenvector x.41. in line 33 the last eigenvalue is calculated.156 In lines 13–15. even though without rounding errors we should have |x′r | < xr = 1. The calling program is given in the file main. and the calling program could decide whether to print out this value of to discard it.c we were looking for the largest x rather than the x with r r the largest absolute value in lines 41–43. a complex eigenvalue might interfere with the power method used during the deflation process. the vector *x that will contain the various eigenvectors calculated. If everything was successful. On line 26.c: 1 #include "wielandt. the number of iterations used to calculate the current eigenvalue is printed out. In lines 3–7. unlike in formula (2). and the maximum number of iterations maxit to be used in calculating each eigenvalue. since the matrix is assumed to be symmetric. The first entry in 9 the file is the required tolerance. r ′ would be returned instead of r. At this point. The 156 This was the reason that in the file power_maxel. The number on line 1 is the tolerance. the matrix has order 1 × 1. If the calculation of the eigenvalue was unsuccessful in lines 24–25. not part of the file. If there is an r ′ < r for which xr′ is rounded to −1. the number is returned of the eigenvalues successfully calculated. the tolerance (or permissible error) tol. The function wielandt described in lines 18–36 returns the number of the eigenvalues that were successfully calculated. and the body of the loop is executed again unless m==1. Otherwise. Its role is only to reserve memory only once.

i++) { 27 if ( readerror ) break. lambda[i]). */ 16 double **a. 17 float s. "%u". 18 int n. *lambda. 22 fscanf(coefffile. readerror=0. j).j++) { 29 if (fscanf(coefffile. 26 for (i=1. 39 if ( readerror ) printf("Not enough data\n"). i. 58 free(lambda).12f ". 40 else { 41 x=allocvector(n).n. success. It is:\n"). tol. 25 a=allocmatrix(n). 20 FILE *coefffile. 32 printf("i=%u j=%u\n". &tol).lambda. 21 coefffile=fopen("symm_coeffs". 59 free(x).NULL). They are:\n").150). 19 char ss[25].j<=n. ss)==EOF) { 30 readerror=1. The first entry 13 must be of type float. "%s". the other entries can be 15 integers or reals. subs. 42 lambda=allocvector(n). 48 else if ( eigencount<n ) 49 printf("Only %u eigenvalues could be calculated. 56 } 57 printf("\n"). i.i++) { 54 printf("%16. They are:\n". 60 } . 43 eigencount=wielandt_defl(a. 31 printf("Not enough coefficients\n").tol. 33 break. 44 if ( eigencount==0 ) 45 printf("No eigenvalues could be calculated\n"). 51 else 52 printf("All eigenvalues were calculated. 55 if ( i % 4 == 0 ) printf("\n"). 46 else if ( eigencount==1 ) 47 printf("Only one eigenvalue could be calculated. "%f". 34 } 35 a[j][i]=a[i][j]=strtod(ss. &n). tol). the third entry must be 14 an unsigned integer. *x. j.i<=eigencount. 50 eigencount). "r").i<=n.208 Introduction to Numerical Analysis 11 rest of the entries are the elements in the 12 upper triangle of the matrix. 24 fscanf(coefffile.x. 28 for (j=i. 36 } 37 } 38 fclose(coefffile). 23 printf("Tolerance used: %g\n". eigencount. 53 for (i=1.

memory for the vectors pointed to by x and lambda is reserved. since the roundoff errors then accumulate. The purpose of choosing the component of the largest absolute value of the eigenvector is to make the roundoff error the smallest possible.256975565094 8 -0. 2)T . 8)T = (2. Solution.e.583574215424 -7. This is necessary. 8 4 8 8 0 0 0  4 4) . this point may be moot. The third row is used since the third component of the eigenvector x has the largest absolute value. −1. This is especially important if one performs repeated deflations. 62 } All coefficients of the matrix A are stored. inside the same else statement. The larger the components of the matrix that we are subtracting from A. We have λ = 2. The reason this results in the smallest roundoff error is that this is likely to make the entries of the subtracted matrix x · zT the smallest possible (since we divide by xr . When doing exact calculations with integers. 8. The printout of the program was as follows: 1 Tolerance used: 5e-14 2 56 iterations used for eigenvalue number 1 3 48 iterations used for eigenvalue number 2 4 139 iterations used for eigenvalue number 3 5 20 iterations used for eigenvalue number 4 6 All eigenvalues were calculated. Do one step of Wielandt deflation. Hence z= 1 1 1 T a = (4.852544184194 4. since in this case there are no roundoff errors. if there is no reading error (that is. 4 8 8 An eigenvalue of this matrix is 2 with eigenvector x = (−1. Consider the matrix   14 8 10 A =  8 22 14  . 8) is the third row of the matrix A.363332212396 Problems 1.624673383928 -3. free(a). We have B = A − λ · x · zT  14 = 8 4 8 22 8    −1 14 8 10 = A − 2 · x · zT =  8 22 14  −  −1  ( 2 2 4 8 8      10 −2 −4 −4 16 12 14 14  −  −2 −4 −4  =  10 26 18  . and this memory is freed in lines 58–59. 4.. since even if the starting matrix is symmetric. 8. much of what is being done here is similar to the way earlier matrices have been read. with at most 150 iterations permitted for calculating each eigenvalue. the matrices obtained during the deflation process will not be symmetric (but they will all have only real eigenvalues). the rth component with r = 3 has the largest absolutle value of the eigenvector x. row of A is a3∗ = (4. the more the original values of the entries of A will perturbed by roundoff errors. 4. 8). the calculation of eigenvalues is performed. Wielandt’s deflation 209 61 free(a[0]). As in earlier programs. i. the largest component of x.41. They are: 7 13. in calculating z). In lines 41–42. in lines 40–60. and the missing elements of the matrix are filled in on line 35. Wielandt deflation is called on line 43. third. 4)T . there were enough numbers in the input file). xr = x3 = 2. and the rth. except that only the upper triangle of the matrix is read. The coefficients of the matrix are read in the loop in lines 26–37. λx3 3∗ 2·2 2 here the row vector (4.

which is not an eigenvalue of the original matrix A). Consider the matrix  36 A =  30 36 54 48 72  12 36  . and look for the eigenvalues of   66 40 . . Solution. T (ei ) = i=1 A similarity transformation corresponds to a change of basis. 36 −18 −27 −6 54 99 42  2) When looking for the eigenvalues of the matrix B (except for the eigenvalue 0. i. . −3)T . one can delete the third row and third column of the matrix B. the rth component with r = 1 has the largest absolutle value of the eigenvector x. B′ = 10 26 2. . The matrix S −1 AS is said to be similar to A. if T is a linear transformation and e1 . Namely. . 12)T = (6. B′ = 99 42 42. 2)T . 54. 9. 36 An eigenvalue of this matrix is 12 with eigenvector x = (6. Do one step of Wielandt deflation. one can delete the first row and first column of the matrix B. We have λ = 12. we have Av = λv 157 A matrix can be thought of as the description of a linear transformation of a finite dimensional vector space in terms of how the basis vectors are transformed. Hence z= 1 T 1 1 a = (36. λx1 1∗ 12 · 6 12 we have B = A − λ · x · zT  36 =  30 36 54 48 72    6 36 54 12 = A − 12 · x · zT =  30 48 36  −  −2  ( 6 9 −3 36 72 36      12 36 54 12 0 0 0 36  −  −12 −18 −4  =  42 66 40  . and the the transformation that takes A to S −1 AS is called a similarity transformation. −2. and look for the eigenvalues of   16 12 . first.e. which is not an eigenvalue of the original matrix A). xr = x1 = 6. In fact. SIMILARITY TRANSFORMATIONS AND THE QR ALGORITHM Let A and S be n × n matrices and assume that S is nonsingular. and the rth.210 Introduction to Numerical Analysis When looking for the eigenvalues of the matrix B (except for the eigenvalue 0. 54..157 The importance of similarity transformations for us is that A and S −1 AS have the same eigenvalues. 12). then the matrix A = (aij ) such that n X aij ei . en are basis vectors. row of A is a1∗ = (36. .

. One method performs Gaussian elimination. one might note that multiplying the matrix CA on the left by D amounts to adding ρ times row k to row l of the matrix CA. An upper Hessenberg matrix is a square matrix where all elements below the main diagonal. the element that became ak+1.158 Multiplying A by P on the right amounts to interchanging columns k + 1 and i′ of A. . That is. one makes the elements ak+2 k . beginning with the first column. . There are methods to find the eigenvalues of a Hessenberg matrix. and pk+1 i′ = 1 and pi′ k+1 = 1 and pij = 0 otherwise. . one then subtracts def mi = aik /ak+1 k times row k + 1 from row i. is defined to be 1 is i = j and 0 if i 6= j. . (We will give some explanation of this point below. since this is true for an arbitrary matrix A.160 n X r=1 cir drj = n X r=1 (δir δrj + ρδir δkj δrl − ρδil δkr δrj + ρ2 δil δkr δkj δrl ) = δij + ρδil δkj − ρδil δkj + ρ2 δil δkl δkj = δij . the resulting matrix is clearly the original matrix A. l ≤ n with k 6= l. ank . . If not. first interchanging the rows and the the columns as stated amounts to forming the matrix P AP = P −1 AP .42. as a first step in finding the eigenvalues of a matrix A may be to transform it to a Hessenberg matrix. . . 159 δ . 158 In terms of permutations this means that if one interchanges row k+1 and row i′ . One can do this by performing the following steps. This corresponds to multiplying A on the left by a permutation matrix P = (pij ) where pii = 1 if i 6= k + 1 and i 6= i′ . Hence. To make this a similarity transformation. ij 160 Instead of the following calculation. When dealing with the kth column (1 ≤ k ≤ n − 2). One goes through the columns of the matrix A. one then interchanges column k + 1 and column i′ .) At this point. are zero. The inverse of the matrix P is P itself. one then adds mi times column i to column k + 1. Indeed. that is. called Kronecker’s delta. . but not immediately adjacent to it. Given p and q with 1 ≤ p. . and one goes on to deal with column k + 1. then one finds the element of largest absolute value among the elements ak+1 k . Subtracting ρ times row k from row l in the matrix A amounts to multiplying the matrix A on the left by the matrix C = (cij ). a plain rotation by angle θ involving row p and q is a transformation such that a′pj = apj cos θ − aqj sin θ. To make this a similarity transformation. For i = k + 2. . and let c be a number. Similarity transformations and the QR algorithm 211 if and only if S −1 AS(S −1 v) = λ(S −1 v). it follows that CD is the identity matrix. and then one again interchanges row k + 1 and row i′ . as is easily seen by multiplying the first of these equations by S −1 on the left. . A = (aij ) is an upper Hessenberg matrix if aij = 0 whenever i > j + 1. q ≤ n and p 6= q. consider interchanging row k + 1 and row i′ . where dij = δij + ρδkj δil . Let k and l be arbitrary with 1 ≤ k. thus DCA = A. there are various methods to find a matrix S −1 AS similar to A that is an upper Hessenberg matrix. To explain why changing the matrix A in this way is a similarity transformation. In fact. where159 cij = δij − ρδil δkj . If this is ai′ k then one interchanges row k + 1 and row i′ .k is nonzero. Plain rotations will be used to transform a Hessenberg matrix to a diagonal matrix. then one gets back the original matrix. It is easy to see that C = D−1 . . n. ank in this column zero. If these elements are already zero then nothing is done. Adding ρ times column l to column k in the matrix A amounts to multiplying A on the right by D = (dij ). the last equation holds because δkl = 0 since we assumed k 6= j.

one multiplies A from the left by such a matrix Sp involving rows p and p + 1 to change the element ap+1 p to zero. what amounts to the same. and a′ij = aij if i 6= p and i 6= q. The matrix S is clearly an orthogonal matrix. to obtain the matrix (1. . sqp = sin θ. we have a′qp = app sin θ + aqp cos θ. . one then needs to multiply on the right with the inverses (or. and sij = δij otherwise. The reason this is called a plain rotation is because if one describes the components of a vector with unit vectors e1 . and then one takes the plain of the vectors ep and eq . transposes. since S = (S T )−1 . S1 A.e. . a′′iq = a′ip sin θ + a′iq cos θ. the jth column vector of the matrix) in the new coordinate system. (while leaving all other unit vectors unchanged. so as to transform A into a diagonal matrix. . S1 AS1T . The entries a′′ij of the matrix A′′ = AS T can be obtained by the formulas a′′ip = a′ip cos θ − a′iq sin θ. In order to have a′qp = 0. cos θ = app α. . Sn−1 This matrix will not be a triangular matrix. hence its inverse is its transpose. rotates the unit vectors in this plain by an angle θ to obtain the new unit vectors e′p and e′q . with we have α= q sin θ = −aqp α 1 a2pp + a2qp and . The transformation SAS T is a similarity transformation. the angle θ needs to be determined so that. Nevertheless. . . one uses plain rotations to make the elements under the main diagonal zero. . since multiplication from the right will change the zero elements ap+1 p next to the main diagonal back to nonzero elements. again changing the . Assuming that A is an upper Hessenberg matrix. . one forms the matrix Sn−1 Sn−2 . of these matrices. .) T A1 = Sn−1 .212 Introduction to Numerical Analysis a′qj = apj sin θ + aqj cos θ. and a′′ij = a′ij if j 6= p and j 6= q. . one uses a plain rotation S that changes the element aqp to zero.. sqq = cos θ. . That is. Using a plain rotation with rows p and q with angle θ. Since one wants to perform a similarity transformation so as not to change the eigenvalues. . assuming that all eigenvalues of A are real. . When wanting to change the element aqp to zero (where q = p + 1 in our case). en . the above formulas describe the vector (namely. spq = − sin θ. For p = 1. by repeatedly performing this transformation (i. p = n − 1. The way the matrix A′ = (a′ij ) is obtained from the matrix A = (aij ) amounts to multiplying A on the left by the matrix S = sij where spp = cos θ.

void printmatrix(double **a. . int p. .h: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 #include <stdio. double costheta). The shift that needs to be applied can be determined the lower right 2 by 2 minor of the matrix A. double sintheta. double costheta). assuming that we are dealing with a real matrix). The functions in this file will be explained below. double shiftamount(double **a. and to determine Si+1 one needs only the current values of the entries ai+1 i+1 and ai+2 i+1 ). one obtains the matrices A2 .c is the same as the one used before: 1 #include "qr.0 ? (x) : (-(x))) #define square(x) ((x)*(x)) void simgauss(double **a. Having determined this eigenvalue. an eigenvalue of Ak will be (close to) the (k) element ann . int n).h" 2 3 double **allocmatrix(int n) . then one can determine the eigenvalues of the deflated matrix. the equation (1) is then also valid. int n. That is. int n. . int rightprotate(double **a. one obtains a sequence of matrices that usually converges to a triangular matrix. but the shift makes a difference in determining the appropriate matrices Si . the (k) element an−1 n is close enough to zero. This is because multiplying by Si−1 on the right does not change the entries of the matrix involved in determining the rotation Si+1 (since multiplying on the right by Si−1 affects columns i − 1 and i of the matrix only. double *sintheta. double tol. int protation(double **a. If λ1 and λ2 are the eigenvalues of this minor.h> #include <stdlib. int maxits. void printmatrix_mathematica(double **a. then forms the matrix T A′1 = Sn−1 . one can form the matrices B1 = S1 A. int q. .h> #define absval(x) ((x) >= 0. Sn−1 . . int p. and A1 = Bn−1 Sn−1 . then one choses s as the eigenvalue λi for which |λi − ann | is smaller in case λ1 and λ2 are real. B2 = S2 B1 S1T . in this case. The file alloc. int *success). int *itcount. and then applies the shift A1 = A′1 + sI. . One can speed up this convergence by applying shifts. double *allocvector(int n). int p. When calculating A1 . double **allocmatrix(int n). . . Similarity transformations and the QR algorithm 213 elements just below the main diagonal to zero). one used the header file qr. int n. the matrix Ak can be deflated by crossing out the last row and last column. instead of starting with the Hessenberg matrix A.int n). . double qr(double **a. . int n. int n). int q.h> #include <math. . S1 (A − sI)S1T .42. since the shifts −sI and +sI will cancel (since the matrices Si and SiT commute with the identity matrix I). and the real part of λ1 (or λ2 ) if these eigenvalues are complex (since they are complex conjugates of each other. . until in the matrix Ak . Of course. B3 = S3 B2 S2T . one forms the matrix A − sI with an appropriate value of S. double *costheta). To implement this method on computer. int q. int leftprotate(double **a. double sintheta. T T Bn−1 = Sn−1 Bn−2 Sn−2 . After obtaining the matrix A1 . int n).

there does not seem to be any disadvantage in asking for a contiguous block. because with pivoting. The only change we made is line 1. even the largest matrix for which Gaussian elimination is feasible occupies a relatively modest amount of memory. b=(double *) malloc((size_t)((n+1)*sizeof(double))). So memory for rows could be allocated separately. a=(double **) malloc((size_t)((n+1)*sizeof(double*))).c first used for Gaussian elimination.214 Introduction to Numerical Analysis 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 /* Allocate matrix. } This is identical to the file alloc. the rows will be interchanged. if (!a) { printf("Allocation failure 1 in matrix\n"). if (!a[1]) { printf("Allocation failure 2 in matrix\n"). exit(1). i<=n. } /* Allocate the whole matrix: */ a[1]=(double *) malloc((size_t)(n*(n+1)*sizeof(double))).h" void simgauss(double **a.h is included now. return a. There is no real advantage for this here. i++) a[i]=a[i-1]+n+1. the values of a[i] for 1<=i<=n may change: */ a[0]=a[1]. The file hessenberg. where the file qr. exit(1). } return b. rows will be interchanged. if (!b) { printf("Allocation failure 1 in vector\n"). /* Save the beginning of the matrix in a[0]. The following code segment allocates a contiguous block of pointers for the whole matrix. exit(1).int n) /* The matrices are size n times n */ { . */ { int i. double **a. On the other hand. } double *allocvector(int n) { double *b.c contains the function using Gaussian elimination to produce the upper Hessenberg matrix: 1 2 3 4 5 #include "qr. } /* Initialize the pointers pointing to each row: */ for (i=2.

success. for (i=k+2. int p. since then column k of the matrix is almost zero already */ if (absval(a[k+1][k])>near_zero) for (i=k+2.l.k.42. realpart. success=1.j<=n. theshift.k<n-1. pivi=i.i++) { temp=a[i][k+1]. norm. for(k=1. if ( absval(a[q][p])<=near_zero ) return.i<=n. } } double shiftamount(double **a. double *sintheta. double *costheta) { const near_zero=1e-20.+ a[n-1][n]*a[n][n-1]-a[n-1][n-1]*a[n][n]. a[i][k+1]=a[i][pivi].j<=n. a[pivi]=rowptr. 215 .i++) if( (temp=absval(a[i][k])) > max ) { max=temp. if ( discr<0 ) return realpart.. max=absval(a[k+1][k]). if (p==q) return 0. for (j=k. int i. int n.pivi. int q. Similarity transformations and the QR algorithm 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 } const double near_zero=1e-20. temp.j. discr=square(realpart)/4.i++) { temp=a[i][k]/a[k+1][k]. } /* interchange rows and column */ if ( pivi != k+1 ) { /* interchange rows */ rowptr=a[k+1]. for (j=1. a[k+1]=a[pivi]. double max. } int protation(double **a. int n) { double discr.j++) a[i][j] -= temp*a[k+1][j]. if ( a[n-1][n-1]>=a[n][n] ) return realpart-sqrt(discr). return realpart+sqrt(discr). *rowptr.i<=n. /* interchange columns */ for (i=1.k++) { pivi=k+1. realpart=(a[n-1][n-1]+a[n][n])/2. double alpha. a[i][pivi]=temp.i<=n.j++) a[j][k+1] += temp*a[j][i]. } } /* If a[k+1][k] is nearly zero then do nothing.

and calculates the appropriate shift to be applied to the matrix before using plane rotations. The last three functions mentioned return 0 in case p and q are equal (when the transformation cannot be carried out).216 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 Introduction to Numerical Analysis alpha=1. } return 1. as we mentioned. int q. } The function simgauss in lines 3–36 has the matrix **a and its size n as parameters. int p. otherwise they return 1. This function returns the values of sin θ and cos θ as pointers sintheta and costheta. The function protation in lines 50--61 calculates sin θ and cos θ for the plain rotation that would make the element aqp zero. and then the rows and columns are interchanged in lines 19–26. and the function rightprotate in lines 78–91 applies its transpose as a matrix multiplication on the right. for (i=1./sqrt(square(a[p][p])+square(a[q][p])). } int rightprotate(double **a. for (j=1. double sintheta. int q. The function shiftamount in lines 38–48 uses the quadratic formula to find the eigenvalues of the lower right 2 by 2 minor of the matrix **a. a[i][p]=temp*costheta-a[i][q]*sintheta.i<=n. double temp. a[p][j]=costheta*temp-sintheta*a[q][j]. } int leftprotate(double **a. the pivot element is found in lines 11–18.c contains the function applying these plain rotations: . The angle θ itself does not need to be determined. The function leftprotate in lines 63–76 applies this transformation as a matrix multiplication on the left. Then the k + 1st row is subtracted from later rows in line 32. The file qr. double temp. *sintheta=-alpha*a[q][p]. /* error is p==q */ if (p==q) return 0. int n.j++) { temp=a[p][j]. a[q][j]=sintheta*temp+costheta*a[q][j].j<=n. a[i][q]=temp*sintheta+a[i][q]*costheta. and the appropriate column subtraction (necessary to make the transformation a similarity transformation) is performed in line 33. int p. Dealing with columns k = 1 through k = n − 2 in the loop in lines 10–35. int n. /* error is p==q */ if (p==q) return 0. double sintheta. double costheta) { int j. we need to interchange columns as well in order to make the transformation a similarity transformation.i++) { temp=a[i][p]. } return 1. return 1. double costheta) { int i. *costheta= alpha*a[p][p].

217 . oldcost. printf("\nm={").&cost).i<=n.n).n.oldsint. cost.a[i][j]).n.j.i<=n.i-1.. pk.i++) { protation(a.i. int maxits.*itcount<=maxits. printf("%8. leftprotate(a. int *success) { int i. *success=0. j.(*itcount)++) { pk=shiftamount(a.n.n-1. } } /* We get to this point if we are unsuccessful */ return 0.j. } void printmatrix(double **a.i.h" double qr(double **a.sint. int *itcount. oldsint=sint.i<=n.j++) printf("%8.i+1.i++) { printf("\n"). for (i=1.5f". printf("\n{"). double lambda. return a[n][n]. Similarity transformations and the QR algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 #include "qr. sint.42.j<=n.i<n. } printf("}}\n").i++) a[i][i] += pk. for (j=1. oldsint.oldcost). for (i=1.5f ".n. } rightprotate(a. int n) { int i. int n) { int i.j++) printf("%8. double tol.&sint.oldcost). } void printmatrix_mathematica(double **a.cost).i.oldsint.i++) { if ( i>1 ) printf("}. for (i=1. } printf("\n")."). for (i=1. int n.a[i][j]).n.". if (i>1) rightprotate(a. oldcost=cost.i<=n. for (j=1.a[i][n]).5f.j<n. for (i=1. for (*itcount=1. if ( absval(a[n][n-1])<=tol ) { *success=1.i+1.i++) a[i][i] -= pk.

The first entry 13 must be of type float. and fills in the lower triangle 8 to make the matrix symmetric. so the calling program can read it). i. The 11 rest of the entries are the elements in the 12 upper triangle of the matrix. the other entries can be 15 integers or reals. itcount. it is guaranteed that the eigenvalues are real. . A similar function printmatrix_mathematica prints out the matrix in the form usable by the software Mathematica. One needs to multiply with these rotation matrices also on the right with one step delay as was explained above. This is useful for debugging purposes. only the last matrix used needs to be remembered. the third entry must be 14 an unsigned integer. &tol). and the integer *success showing whether the function was successful. readerror=0. This program was used with the following input file symm_coeffs: 1 5e-14 2 5 3 1 2 3 5 8 4 3 -2 2 3 5 1 1 -1 6 -1 3 7 2 The numbers on the left are line numbers. but it works only if the eigenvalues are real. The first number is the tolerance. success. the maximum number of iterations maxits allowed. 18 int n. For this reason.c: 1 #include "qr. "r"). This is why one stores the current values of sint and cost in oldsint and oldcost on line 16. to use it for the next cycle of the loop on lines 9–16. 19 char ss[25].h" 2 3 main() 4 { 5 /* This program reads in the elements of in the 6 upper triangle of a square matrix int the file 7 named symm_coeffs. 21 coefffile=fopen("symm_coeffs". the number of iterations *itcount (as a pointer.218 Introduction to Numerical Analysis 52 53 } printf("N[Eigenvalues[m]]\n"). The parameters of the function qr are the matrix **a. the second number is the size of the matrix. rows 3–7 then contain the upper triangle of a symmetric matrix. to compare it with the result produced by the present program. the tolerance tol to stop the iteration when the test on line 20 shows that the element a[n][n-1] is small enough. */ 16 double **a. "%f". the size of the matrix. since Mathematica can be used to calculate the eigenvalues of the matrix. For a symmetric matrix. j. lambda. The first entry in 9 the file is the required tolerance. The function printmatrix can be used to print out the entries of the matrix **a. subs. 20 FILE *coefffile. The calling program is contained in the file main. the 10 second entry is n. 17 float tol. the size n of the matrix. The present program does not require that the matrix be symmetric. The function qr uses plain rotations involving rows i and i + 1 on the Hessenberg matrix to make the element ai+1 i zero. 22 fscanf(coefffile.

25460 1. lambda).79959 -0. i. "%u". the eigenvalue is calculated by using plane rotations. printf("i=%u j=%u\n".18487 -0. the eigenvalues.NULL). } a[j][i]=a[i][j]=strtod(ss.00000 0. printf("Not enough coefficients\n"). if ( success ) printf("An eigenvalue is %16. free(a).74484 -0.00000 -0.j<=n. In lines 26–37 the matrix is read. in line 42.100. a=allocmatrix(n).76364 -1. in line 41 the function simgauss is invoked to perform a similarity transformation based on Gaussian elimination.00000 0. if there is no error in reading the input (no end of file is reached before the required number of data are read). else { simgauss(a.29943 -3. and in line 46 the final entries of the matrix **a are printed out. In lines 42–45 the result is printed out.85017 2.42.00001 -7.36333 -0. as printed out by Wielandt’s deflation are as follows: 1 Tolerance used: 5e-14 2 56 iterations used for eigenvalue number 1 3 48 iterations used for eigenvalue number 2 4 139 iterations used for eigenvalue number 3 5 20 iterations used for eigenvalue number 4 6 All eigenvalues were calculated.62467 The lower right entry of the matrix contains the calculated eigenvalue.n. They are: . } } fclose(coefffile).24524 0. "%s". tol). Similarity transformations and the QR algorithm 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 } 219 printf("Tolerance used: %g\n". ss)==EOF) { readerror=1.i++) { if ( readerror ) break.j++) { if (fscanf(coefffile. It is interesting to note that the other elements in the main diagonal of the matrix are also fairly close to the other eigenvalues.tol.40126 -0.&itcount. printmatrix(a.00000 -0. &n). Then.00000 -0. for (i=1.88053 4.&success).12f\n".01046 4. In fact. if ( readerror ) printf("Not enough data\n").n). for (j=i. break.00000 6.i<=n.32985 -0. lambda=qr(a.58357 3.52720 -1.n). else printf("No eigenvalue could be calculated\n"). } free(a[0]). itcount). j).00000 -0.624673383928 13. fscanf(coefffile.00000 -3. printf("The number of iterations was %u\n". The printout of the program is as follows: 1 2 3 4 5 6 7 8 9 Tolerance used: 5e-14 The number of iterations was 20 An eigenvalue is -0.

that is. In fact. . . it can be carried out using the Jordan normal form. (A − λI)x = 0. x2 . If k · k is a matrix norm compatible with a vector norm. we have An x = λn x. and. see. with a vector norm. and in the generalized setting not all elements of the spectrum will be eigenvalues. these column vectors are linearly dependent.162 for any matrix A we have ρ(A) ≤ kAn k1/n (1) for any positive integer n ≥ 1. if for any matrix A and any column vector x we have kAxk ≤ kAkkxk. however. xN )T . This is equivalent to saying that A − λI is not invertible. then Ax = λx. This equation can also be read as saying that the linear combination of the column vectors of A − λI) formed by the coefficients x1 . 162 A matrix norm k · k is called compatible. if λ is an eigenvalue of A with eigenvector x. in what follows. . relies on the theory of complex functions.363332212396 43. Reading. SPECTRAL RADIUS AND GAUSS-SEIDEL ITERATION Spectral radius. n→∞ The proof of the first of these relations is easy to prove. 163 For the Jordan normal form of matrices. AddisonWesley Publishing Company. where I is the identity matrix. in other word. . one can define the spectrum of A as the set of complex numbers for which the matrix λ − A is not invertible. . also denoted as k · k.220 Introduction to Numerical Analysis 7 8 13. . . In other words. A2 x = A(Ax) = A(λx) = λAx = λ2 x. (2) ρ(A) = lim kAn k1/n . In the discussion that follows. for which Ax = λx. or consistent. it will be simpler to write λ − A instead of λI − A. generalizable way beyond the scope of matrices. we have ρ(A) = λ = (λn )1/n = (|λn | kxk)1/n = (|λ|n |xk)1/n = kAn xk1/n ≤ (kAn kkxk)1/n ≤ kAn k1/n . Second edition. and should be skipped by those unfamiliar with the basics of this theory. called the eigenvector corresponding to λ. l · k will denote a fixed vector norm or a fixed matrix norm compatible with this vector norm.163 but the most elegant proof. Taking λ to be the eigenvalue with the largest absolute value of A. This discussion.583574215424 -0. in fact. using a similar argument combined with induction on n. and x to be a corresponding eigenvector with kxk = 1. for example Serge Lang. relies on the concept of the resolvent of a matrix. The spectrum of A is the set of its eigenvalues. Furthermore.624673383928 -7.852544184194 4.161 The spectral radius ρ(A) of a matrix A is defined as the maximum of the absolute values of the elements of its spectrum. Mass. for any positive integer n. . 161 This second definition of the spectrum is much better than the first definition. xN . An eigenvalue of such a matrix A is a complex number for which there is a nonzero n-dimensional column vector x = (x1 . 1971. The proof of (2) is more complicated. For matrices. Linear Algebra.256975565094 -3. In this section we will discuss N ×N matrices with complex numbers as entries. discussed in the next subsection. . since it can be generalized to entities much more general than matrices.

at every point ζ that does not belong to the spectrum of A. i. it is easy to show that if |ζ0 − ζ| k(ζ0 − A)−1 k < 1. note that the inverse occurring in the second factor on the right-hand side exists according to the assumption that ζ0 belongs to the resolvent set of A. where ζ is a complex variable. . showing that the resolvent is holomorphic165 on its domain (since ζ0 was an arbitrary element of the domain). the value of the integral I ζ n RA (ζ) dζ |ζ|=r taken along the circle {ζ : |ζ| = r} of radius r with center at the origin. The complement of the spectrum of A. one may need to break up a line segment into a path consisting of a number of line segments. The resolvent (of resolvent function) R(ζ) of the matrix A is defined as def R(ζ) = RA (ζ) = (ζ − A)−1 . the result will be the derivative of the sum of this series. since a well-known theorem of complex function theory says that path integrals on simply connected domains166 only 164 That is. is independent of r as long as r is large enough for this circle to include the spectrum of A in its interior. If ζ0 is in the resolvent set of A and ζ is close to ζ0 . = n=0 convergent for |x| < 1. Take any circle with center at the origin that includes the spectrum of A in its interior.e. Spectral radius and Gauss-Seidel iteration 221 The resolvent of a matrix. that is. In fact. A domain is called simply connected if it has no holes inside. (1 − x)−1 = 1 + x + x2 + .. . if one differentiates each term of the series and then adds up these terms. 166 A domain is a connected open set. the domain of RA . since the small circle represented by the numbers ζ belongs to the resolvent set. This shows that the resolvent set is open. the expansion can be termwise differentiated164 in this small circle. we might try I − (ζ0 − ζ)(ζ0 − A)−1 −1 = ∞ X n=0 (ζ0 − ζ)(ζ0 − A)−1 n = ∞ X n=0 (ζ0 − ζ)n (ζ0 − A)−n . Note that any circle of radius greater than the spectral radius ρ(A) of A will do. to this end. so much so that the use of a different term is justified). For any positive integer n. then this series converges and equals the inverse on the left-hand side. In analogy for the expansion ∞ X xn . one might try to calculate the inverse of  ζ − A = (ζ0 − A) − (ζ0 − ζ) = I − (ζ0 − ζ)(ζ0 − A)−1 ) (ζ0 − A). . that is. an open set such that from any point of the set one can reach any other point by walking along a number of straight line segments. we need the inverse of the first factor on the right-hand side. 165 Holomorphic is the term used in complex function theory for functions that are differentiable (differentiability in complex function theory behaves very differently in differentiability on the real line. We want to calculate this inverse of the lefthand side of A. a somewhat more precise way of describing this is that any path consisting of line line segments entirely inside the domain can be continuously deformed into any other similar path with the same endpoints – when doing the deformation. is called the resolvent set of A. R(ζ) is defined at every point where ζ − A is invertible.43.

if z = x + iy with real x and y. On the other hand. |ζ|=r the expansion after the third equality is analogous to the expansion of (1 − x)−1 described above. . n→∞ This.168 with the understanding that for a number c we write c∗ = c¯. Given an m × n matrix A = (ajk )j. so they can be multiplied). then z¯ = x − iy. Write z¯ for the complex conjugate of the number z. if A is a column vector.k:1≤j≤m. However. The order of listing the ranges of the subscripts is immaterial.222 Introduction to Numerical Analysis depend on the starting point and the endpoint.167 then it is easy to check that one can also multiply the matrices B ∗ and A∗ . together with (1). the value of the integral on the left-hand side. Since this is true for any r > ρ(A). The last equality follows since the integral of ζ −n for integer n ≥ 0 is zero unless n − 1 − k = −1. 1≤k≤m . and the entry in the intersection of the jth row and the kth column is ajk . is the does not depend on r as long as r > ρ(A). and it is convergent for large r. to understand the notation. We could equally well have written A∗ = (akj )j. If one can multiply the matrices A and B (in this order). Hermitian matrices. for any r > ρ(A) and n > ge1 we have I 1 I 1 n n kA k = ζ RA (ζ) dζ ≤ |ζ n | max kRA (ζ)k d|ζ| 2πi |ζ|=r 2π |ζ|=r |ζ|=r ≤ 1 2πr · rn max kRA (ζ)k = rn+1 max kRA (ζ)k. and the element in the intersection of the kth row and jth column is ajk . unless k = n. since the existence of the limit is not guaranteed. as we remarked above. Making n → ∞. if r−1 kAk = |ζ|−1 kAk < 1. 2π |ζ|=r |ζ|=r the second inequality holds since |ζ| = r along the path of integration. 167 That 168 I. the conjugate transpose A∗ of A is the n × m matrix A∗ = (ajk )k. Thus the matrix A has m rows and n columns. n→∞ n→∞ |ζ|=r on the left-hand side. and the second one identifies column. This is true even in case AB is a number. it follows that lim sup kAn k1/n ≤ ρ(A).. 1≤k≤n . Hence.k:1≤j≤n. if the number of columns of A is the same as the number of rows of B. e.g. we need to take lim sup. 1≤k≤n . and not the path connecting them.e. the complex conjugate of ajk .j:1≤j≤m. even though the intervening infinite series need not make sense for all such r.. note that the first subscript listed outside the parentheses before the column identifies rows. So the left-hand side is still equal to the right-hand side for any such r.e. and (AB)∗ = B ∗ A∗ . that is. i. as the forth equality shows. For any such integral with large enough r. is. in this case the integral of the series can be evaluated termwise. we can write I I I −1 ζ n RA (ζ) dζ = ζ n (ζ − A)−1 dζ = dζ ζ n−1 I − ζ −1 A |ζ|=r = I |ζ|=r ζ |ζ|=r ∞ X n−1 −k k ζ k=0 A dζ = ∞ X I |ζ|=r ζ n−1−k Ak dζ = |ζ|=r k=0 ∞ I X k=0 ζ n−k−k Ak dζ = 2πiAn . and B is a row vector (of the same dimension. we obtain 1/n lim sup kAn k1/n ≤ lim r(n+1)/n max kRA (ζ)k = r. establishes (2). and the length of the path of integration is 2πr. the matrix A∗ has n rows and m columns.

the present definition of the term “positive definite” given for Hermitian matrices reverts to the definition given earlier. the quadratic form x∗ Ax is a real number.k=1 The N × N matrix A = (aij ) is called Hermitian 169 if A∗ = A. so xT Ay = (xT Ay)T = yT AT x = yT Ax. xN )T . In fact. . 1822-1901. or an N -dimensional column vector. 169 Named after the French mathematician Charles Hermite. . Since we assumed for this calculation that the entries of A are real. k ≤ N . then we have x∗ Ax = akk . with x = (x1 . Then xT AyT is real. We will be interested in the expression. The matrix norm. and the kth component is 1. For an N -dimensional column vector x = (x1 . induced by this vector norm is defined for N × N matrices is defined as def kAk = max kAxk. x∗ Ax. xn ) with complex entries. For a Hermitian matrix A and for any column vector x. . . kxk = kxk2 = x x = n=1 n=1 2 The norm for vectors k · k will from now on denote the l norm. that is. N will be a fixed positive integer. we have x∗ Ax = N X xj ajk xk . Indeed. this is equivalent to saying that ajk = akj for any j. j. k with 1 ≤ j. x:kxk=1 it is easily seen that this matrix norm is compatible with the vector norm k · k. . called a quadratic form.43. any column vector z can be written as x + iy with x and y being column vectors with real entries. Spectral radius and Gauss-Seidel iteration 223 In what follows. for symmetric matrices with real entries. and we will only be interested in the conjugate transpose of an N × N square matrix. this implies that the diagonal elements are positive. The conjugate transpose of the latter is an N -dimensional row vector. That is. . . If the entries of A are real. and so z∗ Az = xT AxT + yT AyT . (x∗ Ax)∗ = x∗ A∗ (x∗ )∗ = x∗ Ax. also called l2 norm. As in the real case. It is easy to see that. kAxk ≤ kxk. we have A∗ = AT . A Hermitian matrix A is called positive definite if for any nonzero column vector x we have x∗ Ax > 0. associated with an N × N matrix A = (aij ). that is A is a real symmetric matrix. then it is sufficient to require this inequality for column vectors with real entries. Then z∗ = (x + iy)∗ = xT − iyT . the l2 norm of x is defined as 1/2  X 1/2 X N N def ∗ 2 = xn xn |xn | . since if in the vector x we have xj all but the kth component is zero. . and so z∗ Az = (xT − iyT )A(x + iy) = xT AxT + ixT AyT − iyT AxT + yT AyT .

. Then Gauss-Seidel iteration can be described as follows: We start with an arbitrary vector x0 . 170 Clearly. Indeed. The condition for solving these equations is that the matrix L + D should be invertible. and U is an (n) (n) upper diagonal matrix with with zeros in the diagonal. bN )T is the right-hand side of the equation. . D is a diagonal matrix. Then the equation Ax = b is solvable by Gauss-Seidel iteration. Proof.170 Write xn = (x1 . . Theorem..e. and for this it will be sufficient that there be a positive number q < 1 such that k(L + D)−1 U )n k < q n . If xn is the nth vector of Gauss-Seidel iteration. The following is an important result concerning the convergence of Gauss-Seidel iteration.224 Introduction to Numerical Analysis Gauss-Seidel iteration. (n+1) By solving the jth equation for xj . and for n ≥ 0 we determine the vector xn+1 by using the equation (L + D)xn+1 + U xn = b. xN )T for the nth approximation to the solution. i. . . where A = (aij ) is an N × N matrix with complex entries. that the diagonal elements of the matrix A (which is the same as the diagonal elements of the matrix D) are nonzero. Indeed. this can be written as a system of equations j X (n+1) ajk xk + N X k=j+1 k=1 (n) ajk xk = bj (1 ≤ j ≤ N ). . where we used l2 norms. In order to show that the iteration converges.. . . . we can see that xn+1 − xn = (−1)n ((L + D)−1 U )n (x1 − x0 ). kxn+1 − xn k ≤ k(L + D)−1 U )n k kx1 − x0 k. it is invertible if and only if its diagonal entries. L and U here is not the same as the matrices L and U obtained in LU factorization. it will be sufficient to show that kxn+1 − xn k → 0. . where L is a lower triangular matrix with zeros in the diagonal. . The matrix form of these equations can also be solved for xn+1 : (3) xn+1 = (L + D)−1 (b − U xn ). xN )T is the column vector of the unknowns. i. are nonzero (this is the same as the condition mentioned above necessary in order to be able (n+1) to solve the jth equation for xj for each j in the above system of equations).e. . As L + D is a lower triangular matrix. it is necessary to require that ajj 6= 0 for 1 ≤ j ≤ N . and subtracting from it equation (3) as given. then for n ≥ 0 we have xn+2 − xn+1 = −(L + D)−1 U (xn+1 − xn ). When solving these equations. Write A in the form A = L + D + U. that is. Assume A is a positive definite Hermitian matrix (with complex entries). the diagonal entries of D. Iterating this. and b = (b1 . x = (x1 . this equation can be obtained by taking equation (3) with n + 1 replacing n. Consider the system of linear equations Ax = b. we obtain the usual equations used in Gauss-Seidel iteration. .

we obtain that and (1 + λ)(1 + λ) |1 + λ|2 x∗ Dx = (1 − |λ|2 )x∗ Ax.  . Recall that the matrix A = L + D + U is assumed to be Hermitian. (λ + λ x = (1 + λ)(1 + λ)x ¯ ∗ Dx to both sides and noting on the right-hand side that L + D + L∗ = A. λx ¯ and the second one by 1 + λ and adding the equations. if all eigenvalues of (L + D)−1 U lie in the unit circle (of the complex plane). we obtain (4) λx∗ Ax = (1 + λ)x∗ L∗ x. or. The conjugate transpose of this equation is ¯ ∗ Ax = (1 + λ)x ¯ ∗ Lx. the last equation can be written as (L + D)−1 L∗ x = λx. This is what we are going to show. we Multiplying the first equation by 1 + λ obtain ¯ + 2λλ)¯ ¯ x∗ A¯ ¯ ∗ (L + L∗ )x. and multiplying both sides by x∗ on the left. A being positive definite also implies that all the diagonal elements of A (which are the same as the diagonal elements of D) are positive. hence the entries of D are real and U = L∗ . Adding (1 + λ)(1 + λ)x we obtain ¯ ∗ Dx + (λ + λ ¯ + 2λλ)¯ ¯ x∗ A¯ ¯ + λλ)x ¯ ∗ Ax. noting that λλ ¯ = (1 + λ)(1 + λ). indeed. the last displayed equation can hold only if 1 − |λ|2 > 0. multiplying both sides by L + D on the left and switching the sides. (1 + λ)(1 + λ)x x = (1 + λ + λ ¯ = |λ|2 Transferring the term involving A on the left-hand side to the right-hand side.43. whereas x∗ Ax > 0 since A is positive definite. as λ(L + D)x = L∗ x. that is. noting that L + D + L∗ = A. Thus. We need to show that |λ| < 1. To this end. Adding λL∗ x to both sides. that is. hence we also have x∗ Dx > 0. let λ be an eigenvalue of (L + D)−1 U with the associated eigenvector x. This shows that |λ| < 1. Hence. Spectral radius and Gauss-Seidel iteration 225 and this will follow according to (2) if the spectral radius ρ((L + D)−1 U ) of (L + D)−1 U is less than 1. Note that λ 6= −1. where λ is a complex number and x is a nonzero vector. (L + D)−1 U x = λx. equation (4) with λ = −1 implies that x∗ Ax = 0.

172 The word orthonormal is constructed by combining the words orthogonal (referring to the above relation when m 6= n) and normal (referring to the case when m = n. 171 We will assume that the interval (a. b) is finite. ) a are called orthonormal on the interval (a. and one wants to consider the monic 174 polynomials γ1n pn . b) be an interval and let w be a positive integrable function on (a. 2.173 Sometimes one drops to condition of normality. «−1/2 . Put def p0 (x) = γ0 . . so that pn is a polynomial of degree n. 2 . . . . One can construct the polynomials pn by using what is called the Gram-Schmidt orthogonalization: Let γ0 = „Z b w(x) dx a «−1/2 . Then Z b p0 (x) = 1. 1. a Assuming that the polynomials pk (x) of degree k have been defined for k with 0 ≤ k < n in such a way that Z b pk (x)pl (x) dx = δkl a whenever 1 ≤ k. l < n. hence raising it to the power −1/2 is meaningful. k=0 For l < n we have Z b a Pn (x)pl (x)w(x) dx = Z b a xn pm (x)w(x) dx − n−1 XZ b k=0 Write γn = „Z b a Pn2 (x)w(x) dx a ηk pk (x)pl (x) dx = ηl − ηl = 0. (n = 0. b). put Pn (x) = xn − n−1 X ηk pk (x). We make the choice of γ being positive. . However. 173 The condition that Z b a p2n (x)w(x) dx = 1 determines only |γn |.171 The polynomials pn (x) = γn xn + . b) for all positive integers n. ) such that Z (1) b pm (x)pn (x)w(x) dx = δmn (m. ORTHOGONAL POLYNOMIALS Let (a.172 Here we assume that γn > 0. b) with respect to the weight function w. . 1. . the integral here is positive since we assumed that w(x) > 0.226 Introduction to Numerical Analysis 44. 174 Monic polynomials are polynomials with leading coefficient one. the arguments can be extended to the case when a or b or both are infinite if one assumes that the function |x|n w(x) is integrable on (a. One then is free to choose γ to be positive or negative. n = 0. Writing ηk = Z b a xn pk (x) dx. .

Thus. (3) should be replaced with p1 (x) = λ1 xp0 (x) + λn pn (x). Orthogonal polynomials 227 note that the integral here is positive since w(x) is positive and the polynomial Pn (x) is not identically zero (namely. in which case (3) becomes valid even for n = 0. pn+1 (x) = λn+1 xpn (x) + i=0 in fact. a For k with 0 ≤ k < n−1. instead of considering the case n = 0 separately. It is clear that every polynomial p(x) of degree m ≥ 0 can be written as a linear combination p(x) = m X λi pi (x). in fact. i=0 Hence. However. b). (1) implies Z (2) b pn (x)p(x)w(x) dx = 0 a whenever p(x) has degree less than n.44. since the polynomial xpk (x) has degree less than n. Put pn (x) = γn Pn (x). Therefore. can be written as a linear combination n X λi pi (x). For n = 0. it is more convenient to stipulate that p−1 (x) ≡ 0. b) with respect to the weight function w. as is easily seen by following the above calculation for n = 0. its leading coefficient is 1). we have Z b pn (x) xpk (x) w(x) dx = 0 a in particular. . we obtain Z b pn+1 (x)pk (x)w(x) dx = λn+1 a Z b pn (x) xpk (x) w(x) dx + a n X n=0 λi Z b pi (x)pk (x)w(x) dx. we have (3) pn+1 (x) = λn+1 xpn (x) + λn pn (x) + λn−1 pn−1 (x) This is the three-term recurrence formula for orthogonal polynomials. 0 ≤ λ ≤ n + 1. Then Z b a p2n (x)w(x) dx = 1. Now. Hence we have175 0 = λk Z a b p2k (x)w(x) dx = λk . Multiplying both sides by pk (x)w(x) for some integer k and then integrating on (a. it is also clear that pn+1 (x). 175 p2 (x) k is short for (pk (x))2 . the system of polynomials pn constructed this way is orthonormal on (a. every integral is zero except the coefficient of λi for i = k on the right-hand side. any polynomial of degree n + 1 can be written in such a way for appropriate choices of the coefficients λi .

since we assumed that γn 6= 0. In fact. Multiplying both sides in (3) by pn−1 w(x) and integrating. The integral on the left and the second integral on the right are zero. note that the coefficient of xn+1 on the left-hand side is γn+1 . Furthermore. λn . a The first integral on the right-hand side is one and the second one is zero (because the degree of q(x) is less than n – cf. we have xpn−1 (x) = γn−1 xn + lower order terms = γn−1 pn (x) + q(x). we obtain λn−1 = − γn+1 γn Z b xpn (x)pn−1 (x)w(x) dx. Thus. and the third integral on the right-hand side is 1. γn2 Multiplying both sides of (3) by pn (x)w(x) and integrating Z b pn+1 (x)pn (x)w(x) dx a = λn+1 Z a b xp2n (x)w(x) dx + λn Z a b p2n (x)w(x) dx + λn−1 Z a b pn−1 (x)pn (x)w(x) dx. (2)). we obtain Z b pn+1 (x)pn−1 (x)w(x) dx a = λn+1 Z b xpn (x)pn−1 (x)w(x) dx + λn Z b pn (x)pn−1 w(x) dx + λn−1 b a a a Z p2n−1 (x)w(x) dx. a This expression can be further simplified.228 Introduction to Numerical Analysis We will take a closer look at the coefficients λn−1 . we have λn−1 = − =− γn+1 γn Z b a γn+1 γn pn (x) xpn−1 (x)w(x) dx = − γn+1 γn−1 γn2 Z a b p2n (x)w(x) dx − γn+1 γn Z b Z b pn (x) a   γn−1 pn (x) + q(x) w(x) dx γn pn (x)q(x)w(x) dx. λn+1 6= 0 since γn+1 6= 0. Thus. First. Thus (4) γn+1 . using the expression for λn+1 given by (4). (5) λn−1 = − γn+1 γn−1 . . 0 = λn+1 Z b xpn (x)pn−1 (x)w(x) dx + λn−1 . γn where q(x) is some polynomial of degree less than n. Therefore. a So. and that on the right-hand side is γn λn+1 . and λn+1 in (3). γn λn+1 = note that the denominator on the right-hand side is not zero.

b). b). let w(x) be positive on (a. 178 In order to verify the following formula. note that ak > 0 since γi > 0. e. f (x) = n=0 This property of orthogonal polynomials is called completeness. the polynomial k Y (x − λj ) pn (x) j=1 176 In this case. and let {pn }∞ n=0 be a system of orthonormal polynomials on (a. the integral on the left and the third integral on the right is zero. Multiplying the above equation by pk (x)w(x) and integration on the interval (a. b xp2n (x)w(x) dx. Orthogonal polynomials 229 Again. Therefore. b). Then for each n. for this. λk be the zeros of pn of odd multiplicity that are located in the interval (a. This is the usual form of the three-term recurrence formula. Therefore. b). λ2 . Let λ1 . .” The theory of Riemann integration is not adequate to discuss this topic in an elegant framework. b) be an interval. Hence. one can define a0 arbitrarily.g. one needs the theory of Lebesgue integration. we obtain γn+1 λn = − γn (6) Write ak = Z b a γk−1 γk xp2n (x)w(x) dx. real. while the second integral on the right-hand side is 1. The theory of the Riemann integral is much less suitable for this purpose. 0 = λn+1 b Z a xp2n (x)w(x) dx + λn .44. that is. and they are located in the interval (a. one needs to be able to justify the legitimacy of the termwise integration of the infinite sum on on the right-hand side. Since λi for i with 1 ≤ i ≤ k are the exactly places where the polynomial pn (x) changes its sign. Further. since it is multiplied by zero.. Proof. provided that we stipulate that p−1 (x) ≡ 0 (since this choice made (3) valid for n = 0). b) with weight w. developed by Henry Lebesgue at the beginning of the twentieth century. put bn = Z a Multiplying (3) by (7) γn γn+1 . .176 Given a system of orthonormal polynomials pn . which makes an = γn−1 /γn and an+1 = γn /γn+1 . ck = a An important theorem about the zeros of orthogonal polynomials is the following Theorem. one can take a0 = 1. 177 We . it can be shown that any “nice”177 function f can be written as an infinite series ∞ X cn pn (x). Let (a. . a system of orthogonal polynomials is complete. . This can be done fairly easily under very general circumstances using the theory of the Lebesgue integral. for k ≥ 1. do not define here what is meant by “nice. we obtain178 Z b f (x)pk (x)w(x) dx. (4)–(6) imply xpn (x) = an+1 pn+1 (x) + bn pn (x) + an pn−1 (x). again using the above expression for λn+1 . the zeros of the polynomial pn are single. This formula is valid for any n ≥ 0.

. b)). the second equality here was obtained by using the recurrence equation (7) to express ak+1 pk+1 (x) and ak+1 pk+1 (y). Assume pn+1 (λ) = pn (λ) = 0 for some λ. = p0 (λ) = 0.e. However. we obtain pk (x)pk (y) = ak+1 pk+1 (x)pk (y) − pk (x)pk+1 (y) pk (x)pk−1 (y) − pk−1 (x)pk (y) − ak . the numbers λ1 . Making y → x. we obtain n X p2k (x) = an+1 k=0 = an+1  pn+1 (y) − pn+1 (x) pn (y) − pn (x) + pn (x) lim −pn+1 (x) lim y→x y→x y−x y−x  pn (x)p′n+1 (x) − pn+1 (x)p′n (x)  . Proof. We have    ak+1 pk+1 (x)pk (y) − pk (x)pk+1 (y) = ak+1 (pk+1 (x) pk (y) − pk (x) ak+1 pk+1 (y)   = (x − bk )pk (x) − ak pk−1 (x) pk (y) − pk (x) (y − bk )pk (y) − ak pk−1 (y)  = (x − y)pk (x)pk (y) + ak pk (x)pk−1 (y) − pk−1 (x)pk (y) . . 1.. Repeating this argument. b) and {pn }∞ n=0 be as in the Theorem above. The Christoffel-Darboux formula. . so it cannot have more zeros). a j=1 thus. Taking into account that p−1 ≡ 0. . . . Hence we have Z b k Y  pn (x) (x − λj ) w(x) dx 6= 0. namely. . . n. λn constitute all the zeros of the polynomial pn (since pn has degree n. and exactly one zero of pn is located between two adjacent zeros of pn+1 . the sum on the right-hand side telescopes. We omit the details. This can be established by induction on n with the aid of (7) by analyzing how the sign changes of pn (x) and pn+1 (x) are related. b). we can see that the second fraction on the right-hand side for k = 0 is zero.230 Introduction to Numerical Analysis has constant sign (i. Hence we get n X (8) pk (x)pk (y) = an+1 k=0 pn+1 (x)pn (y) − pn (x)pn+1 (y) . Much more than this is true. note that the above calculation is true even for k = 0 if we take p−1 ≡ 0 (this choice was needed to make (7) valid for n = 0). the zeros of pn+1 and pn are interlaced in the sense that every zero of pn is located between adjacent zeros of pn+1 . Let (a. x−y x−y Summing this for k = 0. Another important observation about the zeros is the following Lemma. Dividing this equation by x − y and rearranging it. Let k ≥ 0. This contradiction shows that the assumption pn+1 (λ) = pn (λ) = 0 cannot hold. with the exception of finitely many places (the λi ’s and the zeros of even multiplicity of pn (x) in the interval (a. we must have k = n in view of (2). . Then pn−1 (λ) = 0 according the the recurrence formula (7). x−y This is called the Christoffel-Darboux formula. we can conclude that pn−2 (λ) = . Thus. . it is always positive. Then pn and pn+1 have no common zeros. p0 must be a constant different from zero. or always negative) on the interval (a.

GAUSS’S QUADRATURE FORMULA Given an interval (a. and let {pk }∞ k=0 be a system of orthonormal polynomials on (a. ak+1 ak+2 k=0 k=0 Given m with 2 ≤ m ≤ n − 1. let w(x) be positive on (a. p′ (xi )(x − xi ) Q where p(x) = n j=1 (x − xj ). The term is somewhat old fashioned. am am According to (7). x2 . Fix n. but it survives in certain contexts. . the quantity ym occurs on the right-hand side in the terms corresponding to k = m.179 For i with 1 ≤ i ≤ n. 179 As 180 In we proved above. all terms involving yk for k ≥ 2 add up to zero in the sum expressing fn (x). 181 This formula is also called the Gauss-Jacobi Quadrature Formula. Clenshaw’s summation method. . The second equality was proved in the form li (x) = p(x) . b). k = m − 1. real. . and k = m − 2. b) with respect to the weight function w(x). let li be the Lagrange fundamental polynomials (1) li (x) = n Y pn (x) x − xj = ′ x − xj pn (xi )(x − xi ) j=1 i j6=i considered on account of the Lagrange interpolation. the xi are all distinct. b). they must agree up to a constant factor). we can see that the sum of the terms involving ym on the right-hand side is   x − bm−1 am−1 ym pm (x) − pm−1 (x) + pm−2 (x) . Solving this for ck and substituting this into the above sum. Using the recurrence formula (7). . Therefore. a1 45. Collecting these terms. and let x1 . . The term quadrature in a synonym for integration. the quantity in the parentheses is zero. Gauss’s quadrature formula 231 C.180 The numbers Z b li (x)w(x) dx Λi = a are called Christoffel numbers. if one wants to indicate this dependence. It is easy to see that we have pn (x) = γn p(x) for this p(x) (as pn (x) and p(x) have the same zeros. in that in quadrature one seeks a square having the area equal to a certain area (in the case of integration. Hence   x − b0 fn (x) = y0 + y1 p0 (x) + y1 p1 (x). . we defined li by the first equality.45. and they lie in the interval (a. W. . xn be the zeros of pn . we obtain  n−1 n−1 X X x − bk ak+1 yk − ck pk (x) = fn (x) = yk+1 + yk+2 pk (x). 0. Write yn+1 = yn = 0. b). . and put yk = ak+1 x − bk yk+1 − yk+2 + ck ak+1 ak+2 for k = n − 1. the area under a curve). n − 2. Section 5 on Lagrange interpolation. one may write Λin instead of Λi . the sum fn (x) = n−1 X ck pk (x) k=0 can be efficiently evaluated as follows. they play an important role in the Gauss Quadrature Formula:181 Λi of course also depends on n. the second equation above for li follows. .

pn (xi ) = 0). as shown above. . i=1 The first equation here holds in view of the integral of q(x)pn (x)w(x) being zero.. The Christoffel-Darboux formula can be used to evaluate Λi . cf. the second equation because xi is a zero of pn (i. since this polynomial has degree 2n − 2. and the third equation holds in view of the definition of the Christoffel numbers.232 Introduction to Numerical Analysis Theorem (Gauss). This is because the Gauss Quadrature Formula is applicable with the polynomial lk2 (x) (1 ≤ k ≤ n). and so r(xi ) = p(xi ) according to the division equation above. The left-hand side here is of course positive. For any polynomial p(x) of degree less than 2n we have Z b p(x)w(x) dx = a n X p(xi )Λi . b). b). =− (x − xi ) k=0 the first equation here holds in view of (1). According to the polynomial division algorithm. we obtain Z b p(x)w(x) dx = b Z r(x)w(x) dx = n bX p(xi )li (x)w(x) dx = a i=1 a a Z n X p(xi )Λi . where q(x) and r(x) are a polynomials of degree at most n − 1. Furthermore. we have an+1 p′n (xi )pn+1 (xi )li (x) = an+1 = −an+1 pn (x)pn+1 (xi ) (x − xi ) n X pn+1 (x)pn (xi ) − pn (x)pn+1 (xi ) pk (x)pk (xi ). we have p(x) = q(x)pn (x) + r(x). showing that Λi > 0 as claimed. i=1 The second equation holds because lk (xi ) = δki . It is worth pointing out that Λi > 0. the second equation hold because pn (xi ) = 0. Namely. i=1 Proof. formula (2) in Section 44 on Orthogonal Polynomials. Multiplying both sides by w(x) and integrating on the interval (a. We have Z b a lk2 (x)w(x) dx = n X lk2 (xi )Λi = Λk .e. and the third one because of the Christoffel-Darboux formula. we obtain Z b ′ ′ li (x)w(x) dx an+1 pn (xi )pn+1 (xi )Λi = an+1 pn (xi )pn+1 (xi ) a =− n X k=0 pk (xi ) Z b a pk (x)w(x) dx = −1. n n X X p(xi )li (x) r(xi )li (x) = r(x) = i=1 i=1 according to the Lagrange interpolation formula. We have Z b q(x)pn (x)w(x) dx = 0 a in view of the orthogonality relation. one can call this equation a division equation. Multiplying this equation by w(x) and then integrating on (a.

i. the first equation here holds according to de Moivre’s formula. Given arbitrary nonnegative integers m and n. hence we have p0 (x) = p0 (xi ) 6= 0. To see why the last one holds. and the second one holds in view of the Binomial Theorem. THE CHEBYSHEV POLYNOMIALS The Chebyshev182 polynomials are defined as Tn (cos θ) = cos nθ (n = 0. and T2 (x) = 2x2 − 1. even powers of sin θ. b Z a 1 pk (x)w(x) dx = p0 (xi ) Z b pk (x)p0 (x)w(x) dx = a 1 δk0 . 0 if m 6= n. Taking real parts on both sides. (2) Λi = − 1 an+1 p′ (xi )pn+1 (xi ) . Chebyshev. a Russian mathematician of the 19th century. we have an equation for cos nθ. 1. This shows that cos nθ is indeed a polynomial of cos θ. . we have (1) T0 (x) = 1. p0 (xi ) Thus. So. note that p0 (x) is a nonzero constant. For example. √ where i = −1. say. the last equation corresponds to the identity cos 2θ = 2 cos2 θ − 1. 183 The word integral here is the adjectival form of the word integer. His name is often transliterated into French as Tchebichef.. . π if m = n = 0. ). in which case the integral of the corresponding term is π. this is why these polynomials are usually denoted as Tn . 182 P. 46. T1 (x) = x. and has nothing to do with the work integration (except in an etymological sense). The Chebyshev polynomials 233 The first equality here holds in view of the definition of Λi . The real part on the right-hand side will only involve even values of k.46. integral183 powers of sin2 θ = 1 − cos2 θ. since we have cos nθ + i sin nθ = (cos θ + i sin θ)n = n   X n k=0 k ik cosn−k θ sink θ. 2 . L. Thus we obtain that Z 0 π cos mθ cos nθ dθ =      π 2 if m = n > 0.e. that is. Z π cos mθ cos nθ dθ = 0 1 2 Z 0 π  cos(m + n)θ + cos(m − n)θ dθ. . The integral of both cosine terms is zero unless m + n or m − n is zero. The above formula indeed defines Tn as a polynomial.

Expressing ck from this equation.1) with respect 1 to the weight function √1−x . the polynomials Tn are orthogonal (but not orthonormal) on the interval (-1. 2 It is simpler to work out the Clenshaw’s recurrence equations for calculating the sum184 n−1 fn (x) = c0 X ck Tk (x) + 2 k=1 directly than to use worked out in the section on Orthogonal Polynomials: Put yn+1 = yn = 0 and write yk = 2xyk+1 − yk+2 + ck for k = n − 1. Z 0 π cos mθ cos nθ dθ = − Z −1 1 Tm (x)Tn (x) √ Therefore. . (2). dθ = − √ 1 − x2 i. . + 2 k=1 For m with 3 ≤ m ≤ n − 1 the sum of the terms involving ym on the right-hand side is ym (Tm (x) − 2xTm−1 + Tm−2 (x)) = 0. 2 The recurrence equations for the polynomials Tn can be written as xT0 (x) = T1 (x) and (2) xTn (x) =  1 Tn+1 (x) + Tn−1 (x) 2 for n ≥ 1. we have n−1 fn (x) = c0 X (yk − 2xyk+1 + yk+2 )Tk (x). if m 6= n. 1. 184 It is customary to take c0 . Thus. (2) dx = 1 − x2    π 2 dx = π Tm (x)Tn (x) √  1 − x2 −1  0 Z 1 Z 1 −1 Tm (x)Tn (x) √ dx . . if m = n = 0. . These equations are just translations of the equations cos θ cos 0θ = cos 1θ. . and is usually denoted as Dn (θ).234 Introduction to Numerical Analysis Noting that the substitution θ = arccos x gives dx . 1 − x2 if m = n > 0. 0 k 2 This sum is called the Dirichlet kernel. n − 2. The reason is that the the integral of T 2 is twice the integral of T 2 for k > 0.e. and cos θ cos nθ =  1 cos(n + 1)θ + cos(n − 1)θ .. cf.

186 Writing x = cos θ and xi = cos θi .46. we have a1 = 1/ 2. we have 1 1 2 1 1 = − t′n (xi )tn+1 (xi ) = − · Tn′ (xi )Tn+1 (xi ) = − Tn′ (xi )Tn+1 (xi ) Λi 2 2 π π (n ≥ 1). The Chebyshev polynomials 235 where the second equality holds in view of the recurrent equation (2). = 2 fn (x) = for the second equation. Then we have n−1 fn (x) = c0 X ck Tk (x) + 2 k=1 with (3) ck = 2 π Z 1 −1 f (x)Tk (x) √ dx 1 − x2 (0 ≤ k < n). 2 1 1 xt1 (x) = t2 (x) + √ t0 (x). t0 (x) = √ T0 (x) and tn (x) = π π The recurrence formula for the orthonormal Chebyshev polynomials can be written as follows: (4) 1 xt0 (x) = √ t1 (x). this is why we need to require n ≥ 1 in the last displayed formula. Hence. cf. Assume f (x) is a polynomial of degree at most n. we have d cos nθ n sin nθ dθ Tn′ (x) = d cos = . For the first equation we have an+1 = 1/2 for n ≥ 1 according to (4) for the coefficient in the recurrence equation for the Chebyshev polynomial. we have Tn+1 (xi ) = cos(n + 1)θi = cos(nθi + θi ) = cos nθi cos θi − sin nθi θi = − sin nθi sin θi . noting that 0 = Tn (xi ) = cos nθi . For this. since we took c0 /2 in the expansion of fn (x). . shown by (4). 2 2 1 1 xtn (x) = tn+1 (x) + tn−1 (x) 2 2 (n ≥ 2). formula (2) in Section 45 on the Gauss Quadrature Formula. 185 In 186 As case k = 0 this integral is π. where xi is a zero of Tn (x). the factor 2/π comes about as the reciprocal of the integral of Tk2 (with respect to the weight function). Thus. We are going to use formula (2) of Section 45 on the Gauss Quadrature formula to evaluate Λi . (1).185 Since the degree of f (x)Tk (x) is at most 2n − 1 for k < n. c0 c0 + y1 T1 (x) − 2xy2 T1 (x) + y2 T2 (x) = + y1 x − 2xy2 x + y2 (2x2 − 1) 2 2 c0 + xy1 − y2 . we first need the orthonormal Chebyshev polynomials tn : r 1 2 Tn (n > 0). We√still have 2/π in (3) for k = 0. θ sin θ dθ Further. only the terms involving y1 and y2 make a contribution to the sum describing fn (x). Thus. we can use the Gauss Quadrature Formula with the zeros of Tn to evaluate this integral.

where θi are the zeros of cos nθ with −1 < θ < 1.236 Introduction to Numerical Analysis Hence Tn′ (xi )Tn+1 (xi ) = −n sin2 nθi = −n(1 − cos2 nθi ) = −n.h> #include <math.0 ? (x) : (-(x))) #define square(x) ((x)*(x)) double *allocvector(int n). int n). . n i=1 Noting that xi = cos θi . (3) becomes n ck = 2X f (xi )Tk (xi ). double chebyshev(double x. double funct(double x).h" double *allocvector(int n) { double *b. the zeros of T (x) all lie in the interval (−1.h> #include <stdlib.h> #define absval(x) ((x) >= 0. n 2X ck = f n i=1 (5)  i − 1/2 cos π n  cos kπ i − 1/2 . n This formula can be used for numerical calculations. The main calculations take place in the file chebyshev. . this n interval corresponds to the interval (0. 1] in a one-to-one way.h" 187 As we saw in the discussion of orthogonal polynomials. b=(double *) malloc((size_t)((n+1)*sizeof(double))). Next we discuss a program incorporating formula (5). } return b. we could chose another interval for θ that the mapping θ 7→ cos θ maps to the interval [−1. The file alloc. double *c. 1). double *c. . Of course. n. matrices are not needed: 1 2 3 4 5 6 7 8 9 10 11 12 #include "chebyshev. if (!b) { printf("Allocation failure 1 in vector\n"). 2. Hence. void chebcoeffs(double (*funct)(double). . π) for θ. int n). } The definition of the function allocvector is the same as before. . Thus we have Λi = π .187 we have θi = π i−1/2 n for i = 1. The header file chebyshev. n Hence.h contains the functions to be described below: 1 2 3 4 5 6 7 8 9 10 11 #include <stdio.c: 1 #include "chebyshev. exit(1).c now only needs to define vectors.

oldfx. } The function chebcoeff has a pointer to the function *fnct. the integer n.i--) { newfx=2. double *c.j<=n. corresponds to n in formula (5). int i. Once the coefficients ck have been calculated. for (j=1. the value of fn (x) can be calculated with the aid of the Clenshaw recurrence equations. int n) { int i.j++) sum += fvector[j]*cos(pi*i*(j-0. double fx.. oldfx=fx. double *c. the #define on line 2).i++) { sum=0. and the last parameter. The Chebyshev polynomials 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 237 #define F(x) ((*fnct)(x)) void chebcoeffs(double (*fnct)(double). The second parameter.j<=n. for (j=1. *fvector.5*c[0]. long double sum.5)/n).i>=1. c[i]=two_over_n*sum.j++) { fvector[j]=F(cos(pi*(j-0. int n) { const double pi=3.h" double funct(double x) { double value. } return x*fx-oldfx+0. the vector *c will contain the coefficients ck . double two_over_n. } . This is accomplished by the function chebyshev in lines 24–35.i<n.5)/n)). this function in the program is referred to as F (cf. for (i=n-1.j.c defines the function that is evaluated with the aid of Chebyshev polynomials: 1 2 3 4 5 6 7 8 #include "chebyshev.0+x*x). The file funct. fx=0. two_over_n=2.0/n. newfx. } free(fvector). One point worth mentioning is that the inner product calculated in line 18 is accumulated in the variable sum of type long double.1415926535897931. fx=newfx. } for (i=0. The calculation in lines 5–22 is a straightforward implementation of formula (5).0/(1.46. return(value).. } double chebyshev(double x.*x*fx-oldfx+c[i]. value = 1. oldfx=0. fvector=allocvector(n).

c. c=allocvector(n-1). the program chebyshev is called to calculate forty Chebyshev coefficients. 1 + x2 This example is somewhat pointless.8f %20. These programs were used with the calling program in the file main. 47. double *c. The Lagrangian of a mechanical system is given as L = T − V .h" main() { int n=40. called the degree of freedom of the system. Then the function is evaluated at a single location on line 9.16f\n". free(c). . . y=chebyshev(x. x=. as opposed to Taylor polynomials. and the result is compared to the actual value of the function.n). The printout of this program is as follows: 1 x 2 3 0. the real role of Chebyshev polynomials is to evaluate functions repeatedly when these functions are expensive to evaluate. . . one needs to calculate too many terms near the endpoints of an interval. called generalized velocities. Assume that the system start at the point A = (a0 .8922380723133848 This shows good agreement between the actual function value and the one obtained by Chebyshev expansion. Explicit dependence on time t is also allowed. In . printf("%10. bf ) at time t = t1 . . . it will be helpful to reformulate the problem in terms of the calculus of variations. where T is the total kinetic energy. . funct(x)). q˙1 . by calculating the coefficients in the Chebyshev expansion of f (x) we can replace further evaluations with evaluations of the Chebyshev expansion. . . q˙2 . These quantities are described in terms of the position variables q1 . q˙f . while this background can be skipped and the discussion of the differential equation below can be followed. which approximate a function very well near the base point. Here f is a positive integer. one can reduce the amount of calculation by using Chebyshev polynomials. some of the motivation would be lost. y. and V is the potential energy.34753. . chebcoeffs(&funct. . y. q2 . } On line 8. af ) at time t = t0 and ends up at the point A = (b0 . .34753000 Chebyshev approx function value 0. Background from Lagrangian mechanics. As a background. x. For example. BOUNDARY VALUE PROBLEMS AND THE FINITE ELEMENT METHOD In the discussion of a boundary value problem. . we included a problem from classical mechanics. and in terms of their time derivatives. Chebyshev polynomials can be used to approximate a function with uniform accuracy on an interval. less well away from this point. printf(" x Chebyshev approx" " function value\n\n").n). . qf .c. That is. in evaluating a power series.c: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 #include "chebyshev.238 Introduction to Numerical Analysis This file defines the function funct as f (x) = 1 .16f %20. . if one finds that. .8922380723133866 0.

In order to find the minimum of the above integral. and this is described by saying that the system evolves along a stationary path. . . The above integral is called the action of the mechanical system. evolves in a way that if this path is . where δi (t) is assumed to be small. i. but also δ˙i (t) is assumed to be small). The planet is assumed to move in the xy plane.e. . Boundary value problems and the finite element method 239 order to describe the evolution of the system. one wants to find the the position coordinates as a function of time. t) ≈ L(qi (t). According to Hamilton’s principle. ∂ q˙k In order for the integral I to be minimal. t) ˙ ∂L(qi (t). q˙i (t).. but instead of the usual Cartesian coordinates. if one considers the motion of n point particles is three-dimensional space. . So the change in the above integral can be described approximately as δI = δ Z t1 t0 L(qi (t). of course. Hamilton’s principle states that δI is zero. we will have n = 3f with three Cartesian coordinates for each particle. q˙i (t). but. t) k=1 ∂qk δi (t) +  ∂L(qi (t). the same way as the derivative of a function at a place of minimum must be zero.. 2 m(ρ˙ 2 + ρ2 θ˙2 ) Mm +γ . q˙i (t). . one may find it convenient to use some coordinates other than the Cartesian coordinates to describe the system. or variation. with q1 (t0 ) = a1 . . not that I is minimum. q2 (t0 ) = a2 . qf (t1 ) = bf . q2 (t). q˙i (t). in describing the motion of a planet of mass m around the sun of mass M . qf (t0 ) = af . In general. and δi (t0 ) = δi (t1 ) = 0 (i. . In fact. t) dt = Z t1 t0 f  X ∂L(qi (t). q˙i (t). qf (t). this first order approximation. and q1 (t1 ) = b1 . . + ∂qk ∂ q˙k k=1 where the change in L is approximated by the total differential (this approximation implies that not only δi (t). δI to this integral must be zero. . . here q˙i (t) = dqi (t)/dt. and one can take q1 = ρ and q2 = θ. one replaces the paths qi (t) by nearby paths qi (t) + δi (t). one may assume that the sun is always at the center of the coordinate system.e. . but since the sun’s mass is so much larger than that of the planet. q2 (t1 ) = b2 . depending on the nature of the system. the evolution of of the system takes place in a way for which the integral I= t1 Z L(qi (t). Then the potential energy of the planet is V = −γ and the kinetic energy is T = so the Lagrangian is L= Mm . . the two endpoints of the paths do not change. its position will be described with polar coordinates ρ (the polar radius) and θ (the polar angle. ρ m(ρ˙ 2 + ρ2 θ˙2 ) . t) ˙ δi (t) dt. not really true. t) δi (t) + δi (t) . 2 ρ This system has a degree of freedom f = 2. the sun’s movement is neglected. This is. t)  f  X ∂L(qi (t). t) dt t0 is minimal (sort of – more about this later).47.. q˙i (t) + δ˙i (t). one wants to find functions q1 (t).e. q˙i (t). Then L(qi (t) + δi (t). q˙i (t). As an example. i.

Fermat used this to explain the law of refraction. Using integration by parts. So Fermat must have relied on some assumptions unverifiable at the time he stated his principle. In the formulation given by Hamilton’s principle. Ren´ e Descartes (1596–1650) thought that the speed of light was infinite. and a shorter path in water. its change is much smaller than the change in the path). and they choose a path which minimizes the action integral. the particles have a goal of getting from point A to point B. For example. it is faster to travel a longer path in air.240 Introduction to Numerical Analysis slightly changed then the above integral does not change in the first order (i.. The first measurement of the speed of light on earth was carried out by the French physicist Hippolite Fizeau in 1849. Pierre de Fermat (1601–1665) stated his principle in 1662. such as Fermat’t principle that light between two points travels in a way that takes the least amount of time. the particles move along paths determined by forces acting upon them. light travels more slowly in water than in air. Hamilton’s principle goes earlier teleological principles. then going along a straight line is not the fastest way for the light to get there. In Newtonian mechanics. we have Z t1 t0 .e. The speed of light was first measured by the Danish astronomer Ole Rømer in 1676 by studying the orbits of the satellite Io of Jupiter. So if light starts out at a point in air to go to another point under water. Hamilton’s principle gives a kind of teleological188 formulation of Newtonian mechanics.

t=t1 Z t1   .

d ∂L(qi (t). t) ∂L(qi (t). q˙i (t). t) ˙ . q˙i (t). t) ∂L(qi (t). q˙i (t).

− δi (t) dt δi (t) dt = δi (t).

. ∂qk dt ∂ q˙k here. . a small positive ǫ ensures that δk (t) is small. t) − dt ∂ q˙k  δi (t) dt = 0. t) d ∂L(qi (t). one can choose δk (t) = ǫ(t − t0 )(t1 − t) · ∂L(qi (t). in that it led to a reformulation of quantum mechanics by Richard Feynman. the mechanical system. 2. 188 Teleology is the doctrine that final causes (i. Taking this into account. choose δk (t) to be the of the same sign as the left-hand side of this equation. knowing in advance where it wants to go. q˙i (t). and the factor (t−t0 )(t1 −t) (where t0 < t1 ) ensures that δk (t0 ) = δk (t1 ) = 0. t) =− δi (t) dt. q˙i (t). This integral must be zero for all choices of small δk (t) for which δk (t0 ) = δk (t1 ) = 0. q˙i (t). In other words. t) − =0 ∂qk dt ∂ q˙k for k = 1. dt ∂ q˙k t0 the second equality holds since we have δi (t0 ) = δi (t1 ) = 0. when he was only 19 years old. For example. q˙i (t). . finds a way to get there in such a manner that involves the least action. t) ∂L(qi (t). and so the integrated-out part is zero. goals or purposes) exist. . and this is possible only if (1) d ∂L(qi (t). The Lagrangian formulation of mechanics plays a very important part in modern physics. To see this. t) − . They originate in a letter that Lagrange sent to Euler in 1755. q˙i (t). Equations (1) are the Euler-Lagrange differential equations describing the mechanical system. q˙i (t). The area of mathematics that seeks to minimize integrals of the above type is called the calculus of variations. q˙i (t). the equation δI = 0 can be written as Z t1 t0 f  X ∂L(qi (t). ∂ q˙k ∂ q˙k dt ∂ q˙k t0 t=t0  Z t1  d ∂L(qi (t).e. . t) k=1 ∂qk d ∂L(qi (t). f.

ρ2 d d ∂L ¨ = mρ2 θ˙ = 2mρρ˙ θ˙ + mρ2 θ. 1]. 0 Using integration by parts. since we want to discuss a numerical method applicable to handling a group of problems.PROBLEM 241 Problem 1. It is easy to solve this problem explicitly. In order to approach this problem. Consider the differential equation y ′′ + f (x) = 0 (2) on the interval [0. 1] such that v(0) = 0. we can see that Z 0 1 . this is. Then the above equation implies that Z 1 (y ′′ (x) + f (x))v(x) dx = 0. however. Solution. dt ∂ θ˙ dt and ∂L = 0. etc. some of which are not explicitly solvable. First. that is. Write the Euler-Lagrange equations for the planetary motion example above.) function v(x) on [0. a side issue here. we will restate it with the aid of the calculus of variations. one wants to find a function y = y(x) satisfying the above equation under the conditions that y(0) = a and y ′ (1) = b. since constraints on y are at the two end points (the boundary) of the domain on which we want to find y. ∂θ and the corresponding Euler-Lagrange equation is ∂L d ∂L − =0: ∂θ dt ∂ θ˙ −2mρρ˙ θ˙ − mρ2 θ¨ = 0. dt ∂ ρ˙ dt and the corresponding Euler-Lagrange equation is ∂L d ∂L − =0: ∂ρ dt ∂ ρ˙ For the variable θ we have mρθ˙2 − γ Mm − m¨ ρ = 0. This type of problem is called a boundary-value problem. ∂ρ ρ and d ∂L d = mρ˙ = m¨ ρ. The position variables are ρ and θ. For the variable ρ we have ∂L Mm = mρθ˙2 − γ 2 . differentiable. A boundary value problem. consider an arbitrary nice (continuous.

x=1 Z .

y ′′ (x)v(x) dx = y ′ (x)v(x).

− x=0 0 1 y ′ (x)v ′ (x) dx = bv(1) − Z 0 1 y ′ (x)v ′ (x) dx. .

y ′ (x)v ′ (x) dx = (3) 0 0 This equation needs to be satisfied for every nice function v(x). cj Gj (x) = y(x) = aG0 (x) + j=1 x=0 189 There are some subtle points in reversing the argument that will not be discussed here. but one important point is that v ′ (x) does not need to exist at every point. – we will not define here what meant by that. (2) is called the strong statement of the problem. and G(x) is linear on each of the intervals [xi−1 . The argument leading from (2) to (3) will work for functions v(x) that are continuous on [0. since the argument used to obtain (3) from (2) can be reversed. In a sense. xi ] and [xi . that is. Namely. although GN (x) outside the interval [0. xi+1 ). although G0 x) outside the interval [0. and consider the functions Gi (x) for i with 1 ≤ i ≤ N −1 such that G(x) = 0 for x outside the interval (xi−1 . An obvious difference is that equation (3) does not seem to require the existence of the second derivative of y. if xi ≤ x. do the integration by parts argument on each subinterval. we define G0 (x) to be 1 at 0. if x < 0.190 In order to obtain a numerical approximation of the solution of equation (3). 190 Using modern integration theory. < xN = 1 of the interval [0. Finally. GN (x) = 1−xN −1   1 if 1 ≤ x. but if one relies on modern integration theory. linear on   1 x−x1 G0 (x) = −x1   0 the interval [0. this argument can be carried out more elegantly. and (3) is the weak statement (or variational statement) of the same problem. 1] is of no interest.  if x < xN −1 . either.189 We will not discuss here what kind of functions is v(x) can be in equation (3). these will all cancel except those corresponding to the endpoints 0 and 1 of the interval [0. but these need not be the same. . 1]. that is. this assumption allows for y ′′ not to exist at many points. we defined G0 (x) for all x. We will seek y(x) in the form N N X X cj Gj (x).  xi −xi+1    0 if xi+1 < x. Hence the above equation becomes Z 1 Z 1 f (x)v(x) dx + bv(1). x1 ]. and 0 for x < xN −1 . This is true. 1]. 1] and differentiable everywhere with the exception of finitely many points. (2) and (3) are equivalent. When adding up the integrated-out terms. . In addition. That is.  0 x−xN −1 if xN −1 ≤ x < 1. 1] into finitely many parts. we defined GN (x) for all x. 1] is of no interest. xi+1 ]. then the existence of the second derivative is not required for (3).242 Introduction to Numerical Analysis the second equation here holds because v(0) = 0 and y ′ (1) = b. . in any case. this is because one can break up the interval [0. and 0 for x ≥ x1 . Gi (xi ) = 1. we will restrict the functions y ′ and v to certain subspaces of functions. if 0 ≤ x < x1 . 1]. in Galerkin’s approximation method. we define GN (x) to be 1 at 1.  0 if x < xi−1 . and at these points v(x) is assumed only to have left and right derivatives. linear on the interval [xN −1 . consider a partition P : 0 = x0 < x1 < x2 < . What is needed for (2) is that y ′ be absolutely continuous.      x−xi−1 if xi−1 ≤ x < xi . xi −xi−1 Gi (x) = x−xi+1  if xi ≤ x < xi+1 .

2. equation (3) becomes N X λk k=1 N X j=0 cj Z 1 0 G′k (x)G′j (x) dx = N X k=1 λk Z 1 f (x)Gk (x) dx + bλN . . A matrix (aij )N i. for z = (z1 . N . . as required. that aij = aji . . . for each i = 1.j=1 does indeed have this property. i. zn )T . . aij = 0 unless |i − j| ≤ 1 in view of the definitions of the functions Gi . Further. 0 we have the equations N X (4) j=1 aij cj = bj − aai0 (1 ≤ i ≤ N ). one might be guided by the following considerations.j=1 is symmetric. the term −aai0 on the right-hand side comes from the term for j = 0 in the sum on the left-hand side of the above equation. the matrix (aij )N i. To see that the matrix A = (aij )N i. a matrix A is called positive definite if for any nonzero column vector z the number z T Az is positive. First. . cN . To say this another way..PROBLEM 243 where c0 = a and the cj for j ≥ 1 are constants to be determined in such a way that equation (3) be satisfied for all v(x) such that N X λk Gk (x). . most coefficients on the left-hand side of (4) are zero. writing aij = Z 1 0 and bj = Z G′i (x)G′j (x) dx 1 f (x)Gj (x) dx + bδjN . .e. It is also immediately seen that the matrix (aij )N i. strict inequality holds on the right-hand side. When considering methods to solve this equation. That is.j=1 is positive definite. we have T x Ax = N N X X i=1 j=1 zi aij zj = Z 0 N N X 1X i=1 j=1 G′i (x)zi G′j (x)zj dx = Z 1 X N 0 i=1 G′i (x)zi 2 dx > 0. hence. observe that. v(x) = k=1 These equations ensure that y(0) = a and v(0) = 0. Here. . since we know that c0 = a. . If we substitute these choices of y(x) and v(x). and by solving them we will obtain an approximate solution y(x) = N X cj Gj (x) j=0 of equation (3). since not all zi are zero (because z is not the zero vector). . 0 This equation must be satisfied for any choice of the λk . the coefficient of λk on each side of the equation must agree. Indeed. .j=1 with this property is called a tridiagonal matrix (since all elements outside the main diagonal and the diagonals adjacent to it are zero). . These equations represent N linear equations for the unknowns c1 . pick λk = δik to see that the coefficient of λk for k = i on each side of the equation agrees.

Taking imaginary parts of both sides. It is worth recalling that these are the same recurrence equations that define the Chebyshev polynomials (sometimes also called Chebyshev polynomials of the first kind). The finite element method can be used to handle many boundary values problems involving partial differential equations. a such system of equations is also solvable by Gauss-Seidel iteration. In two dimensions. we can see that sin(n+1)θ involves only even powers of sin θ.244 Introduction to Numerical Analysis The fact that this matrix is positive definite and symmetric is important. 48. and the function element considered is usually linear on these triangles. THE HEAT EQUATION Chebyshev polynomials of the second kind. we obtain an equation for sin(n + 1)θ. The method outlined above is an example of the finite element method. 191 See 192 See the Corollary in Section 35.) . k where i = −1. 224. i. this equation gives U−1 = sin sin θ = 0. 0θ for n = −1. only odd powers of sin θ. that is only sin θ 2 2 is indeed a polynomial of cos θ. p. After dividing by sin θ. These are the recurrence equations for the Chebyshev polynomials of the second kind. since pivoting would destroy the tridiagonal nature of the coefficient matrix. we have U−1 (x) = 0 and U0 (x) = 1.π] with respect to the weight function 1 − x2 . 161.191 Further.. This shows that sin(n+1)θ sin θ Using the identity 1 sin α cos β = (sin(α + β) + sin(α − β)) 2 With α = (n + 1)θ and β = θ. With x = cos θ. the domain over which the equation is considered is usually partitioned into triangles. The imaginary part of the right-hand side will only involve odd values of k. p. the fact that this matrix is tridiagonal allows one to store the matrix efficiently without storing the zero elements.) the Theorem in Section 43. it is important to avoid pivoting. with the exception that the latter satisfied a different equation for n = 1. integral powers of sin θ = 1 − cos θ. This can also be written as xUn (x) = 1 (Un+1 (x) + Un−1 (x)) 2 for n ≥ 0. since it is known that a system of linear equations with a positive definite symmetric coefficient matrix can be solved by Gaussian elimination without pivoting. It is not hard to show √ that these polynomials are orthogonal on the interval [0. and for n = 0 it gives U0 = Using the de Moivre formula combined with the Binomial Theorem.e. cos(n + 1)θ + i sin(n + 1)θ = (cos θ + i sin θ)n+1 = n+1 X k=0 √ sin 1θ sin θ = 1.  n+1 k i cosn+1−k θ sink θ. we can conclude that cos θ sin(n + 1)θ = 1 (sin(n + 2)θ + sin nθ) 2 for n ≥ 0.192 If one chooses to use Gaussian elimination. in addition. the Chebyshev polynomials of the second kind are defined as Un (x) = Un (cos θ) = sin(n + 1)θ sin θ (n ≥ −1).

and writing v−1 = vN −1 = 0. (n + 1)k) − u(mh. In fact. t) = . t with 0 ≤ x ≤ 1 and t ≥ 0 describes the temperature of a rod represented by the interval [0. the above equation can be written as (5) vn+1 = Avn .48. these equations uniquely determine vk for k = 1. Taking a h = 1/N . respectively. The function u(x. N − 1. with appropriate choice of the physical units. This equation can be solved explicitly for u(m + 1)h. For consistency. t) ∂u(x. u(0. can be described by the equation (1) ∂ 2 u(x. one may write (2) u(x. and the endpoints of the rod are kept at temperature 0 at all times. Heat conduction in a rod. nk) = 2 h k with the boundary conditions u(mh. and taking a grid xm = mh and tn = nk. writing vn = (u(h. and so we must have   λ − 1 + 2s for k with 0 ≤ k ≤ N − 1. . nk) + u((m − 1)h. In order to solve this equation numerically. y) for x. h h Using this equation. nk):   2k k u(mh. . 1] at point x and time t. t) ∂u(x. With v = (v0 . . t) − 2u(x. . . writing s = k/h2 . we must of course have f (0) = f (1) = 0. vN −2 )T . nk) − 2u(mh. . where. nk) + u((m − 1)h. nk). Unfortunately. nk) u((m + 1)h. nk)). . nk) = u(N h. vk = Uk 2s . n = 1. . one can approximate the differential equation above with the difference equation (4) u(mh. ∂x2 ∂t subject boundary conditions u(x. . nk) + 2 (u((m + 1)h. The heat equation 245 The heat equation. (n + 1)k) = 1 − 2 u(mh. t) = 0. . u(mh. 2. A = (aij ) is the (N − 1) × (N − 1) matrix for which aii = 1 − 2s where 1 ≤ i ≤ N − 1. . t) + u(x − h. The boundary value conditions indicate that at time zero the temperature of the rod at point x is f (x). t) ≈ ∂t k and (3) ∂ 2 u(x. . n = 2. . the equation Av = λv can be written as 1 λ − 1 + 2s vk = (vk−1 + vn+1 ) 2s 2 (0 ≤ k ≤ N − 2) If we add the equation v0 = 1 and temporary drop the equation vN −1 = 0. nk). t) = u(1. t + k) − u(x. (n + 1)k) can be calculated for n = 0. nk))T . 0) = f (x). u(2h. t) u(x + h. these equations are the same as those defining the Chebyshev polynomial of the second kind at (λ − 1 + 2s)/(2s). ai i−1 = ai−1 i = s where 2 ≤ i ≤ N − 1. . u((N − 1)h. To see this. the partial derivatives may be approximated by finite differences: choosing step sizes h and k for x and t. 0) = f (mh) and u(0. nk) = 0 for m and n with 1 ≤ m ≤ N − 1 and n ≥ 0. this calculation is numerically unstable unless k ≤ h2 /2. and all m step by step. . t) = . ∂x2 h2 the latter formula is related to the formula f ′′ (x) ≈ ∆2 f (x)/(2h2 ) with one-dimensional finite differences.

(n − 1)k) u((m + 1)h. y) is unstable unless s < 1/2. for m and n with 1 ≤ m ≤ N − 1 and n ≥ 0. . (n − 1)k). associated with eigenvector v. i. nk) − u(mh. . we must have λ − 1 + 2s kπ = cos . (n − 1)k) + u((m − 1)h. N →∞ N we need 1 − 4s > −1. this means that UN −1 ((λ−1+2s)/(2s)) = 0. N − 1 (these are the values of k that place θ in the interval [0. Indeed. π]. . As (N − 1)π lim cos = −1. nk) + u((m − 1)h. One can combine the forward difference method and the backward difference method by taking the average of equation (4) with n replaced by n − 1 and equation (6): u((m + 1)h. a small component cv will appear as a part of the error in calculating vk . nk) = . for m = 1. h2 k or else −u((m + 1)h. . nk) = 0). nk) = 0. (n−1)k are already known. equation (5) can be written as vn = An v0 . i. nk). With this approximation.. As λn−k → ±∞ in case of |λ| > 1. 2. the range of the function arccos). One again obtains a method that is stable for all values of h and k. t − k) ≈ ∂t k to approximate the right-hand side of (1). but then k must be too small for practical calculations. (n − 1)k) is already known (recall that u(0. nk) − 2u(mh. this method of calculating u(x. 2h2 k Similarly to the backward difference method. nk) = u(N h. at the kth step. since h must be chosen small in order that the differences should closely approximate the derivatives. . nk) u(mh. assuming that u(mh. 2.e. . . (n − 1)k) − 2u(mh. this can be rearranged as a system of equations for the quantities u(·. the error in the calculation will tend to infinity. nk) − 2u(mh. By doing this. assume λ is the largest eigenvalue of A. t) − u(x. and then uses the above equation to determine u(mh. N − 1 this represent a system of equations for the quantities u(m. . nk) + u((m − 1)h. nk) = u(N h. nk) +   h2 h2 + 2 u(mh. equation (1) becomes λ = 1 − 2s + 2s cos (6) u(mh. 2. nk) = u(mh. N The calculation using equation (5) will accumulate errors if any of these eigenvalues has absolute value greater than one. k k With a fixed n > 0. At the nth step. This method is stable for all values of h and k. one obtains a method of solving equation (1) that is useful in practice.e. N − 1). i. nk) − u((m − 1)h. s < 1/2 to guarantee that |λ| < 1. . 2s N that is. one can use the backward difference ∂u(x. so TN (x) = 0 if θ = kπ/N for k = 1. we will omit the proof of this. 2.. nk) for n = 1. . That is.. . . kπ (k = 1. this method was invented by John Crank and Phyllis Nicolson. .e. (n − 1)k) + = . 0) = f (mh) and u(0. nk) when quantities u(·. nk) − u(mh. TN −1 (x) = sin N θ/ sin θ. y) u(x. Instead of the forward difference in (2). . this error will be increased to λn−k cv. . (n − 1)k) 2h2 u((m + 1)h. One starts out with the boundary conditions u(mh. because of roundoff errors. 3. . The requirement k < h2 /2 severely limits the usefulness of the above method. k < h2 /2. With θ = arccos x.246 Introduction to Numerical Analysis Adding back the equation vN −1 now. The Crank-Nicolson Method.

Brian W. 64–67. England–New York. Englewood Cliffs. Monthly 73 (1966). Taylor’s formula with derivative remainder. Hughes. Numerical recipes in C. USA–Oakleigh. S. Dover Puublications. The finite element method. Dover Publications. Teukolsky–W. Kernighan–Dennis M.org/wiki/Taylor%27s_theorem as given on July 10. 1992. 2000. Inc. Conte–Carl de Boor. 1983. P. Maxwell Rosenlicht. Addison-Wesley. Anthony Ralston–Philip Rabinowitz. 2001. third edition. On differentiating error terms. Beesack. Math. Alfred J. Cambridge University Press. jour Amer. Vetterling–B. Elementary numerical analysis. 187–189. Amer. New York. D. Cambridge. 1980. Flannery.. J. 1968. Australia. T. Dover Publications. 67–68. Introduction to Analysis. Wikipedia entry http://en. Massachusetts. A first course in numerical analysis. Linear static and dynamic finite element analysis. W. H. Math. Math. Prentice Hall. NY. New Jersey. 2007.. An Introduction to numerical analysis with Pascal. New York– London-Paris. The C programming language. Monthly 73 (1966). R. Press–S. Harley. A general form of the remainder in Taylor’s theorem. Victoria. Inc. New York. McGraw-Hill. second edition. Reading. Ritchie. second edition. Amer. Atkinson–P. Anthony Ralston. A. second edition.. Monthly 70 (1963). New York.Bibliography 247 BIBLIOGRAPHY [AH] [B] [CB] [H] [KR] [M] [PTVF] [Ral] [RR] [Ros] [Wiki] L. Mineola. V. Mineola.wikipedia. . 1988. Maria. P. R. Thomas J.

248 Bibliography .

bn . Dirichlet Kernel. x2 . the identity operator. error term of Taylor’s formula. the transpose of the vector b. 63 Rn (x. big O notation. the L∞ -norm of v. 27 O(f (x)). 157. 29 Tn . divided difference. 33.j in Romberg integration. 134 ∆. 226 I. 27 δij . the forward shift operator. In certain cases. 19 f [x. for real t. 174 bT . 228 AT . the coefficients in the three term recurrence equation for orthogonal polynomials. nfold. the leading coefficients of orthogonal polynomials. 231 ∇.List of symbols 249 LIST OF SYMBOLS Page numbers usually refer to the place where the symbol in question was introduced. Chebyshev polynomials. 10 R. 10 Sk. Lagrange’s fundamental polynomials. a). the forward difference operator. divided difference. 27 li (x). 183 Dn (θ). 142 kvk. · · · . 23 γn . xi ]. where important discussion concerning the symbol occurs at several places. the Christoffel numbers. 233 kvk∞ . x. the backward difference operator. the set of real numbers. 211 det(A). an . 234 E. binomial coefficient. 27 f [x1 . the L∞ -norm of v. the transpose of the matrix A. the determinant of the matrix A. 13 Λi = Λin . 93  t i . 142 kvk2 . x]. 174 . the L2 -norm of v. the number of the page containing the main source of information is printed in italics. Kronecker’s delta. · · · .

250 List of symbols .

37 calculus of variations. Chebysev Chebyshev. 79 C program for. companion condition number. see corrector. see matrix. see Romberg integration. while the other page numbers listed refer to various applications of the same. see matrix. see Gaussian elimination. characteristic equation of characteristic polynomial of a recurrence equation. see set. 103 characteristic equation. 133 backward difference operator. 56 of fixed-point iteration C program for. compared with adaptive integration with Simpson’s rule. characteristic polynomial of Chebyshev polynomials. see corrector. 116 . connected corrector Adams-Moulton. absolute Adams-Bashforth predictor formula. Adams-Moulton adaptive integration compared with Romberg integration. P. see predictor. see determinant. 37 boundary value problem.. see polynomial. 84 Bernoulli polynomials. 84 Binomial Theorem for shift operators. 234 —’s summation method. Often. cofactor of an entry in a companion matrix. they refer to the statement of a theorem. 233 Cholesky factorization. see recurrence equation. 28 bisection C program for. 231 cofactor. see recurrence equation. 183 polynomial. AdamsBashforth-Moulton Adams-Moulton corrector formula. Adams-Bashforth Adams-Bashforth-Moulton corrector formula. 238 chain rule multivariate. absolute error. conjugate transpose of connected set. 75 C program for. 76 Aitken’s acceleration. 80 with trapezoidal rule. Cholesky factorization Christoffel numbers. L. 231 Christoffel-Darboux formula. boundary value problem C program compiling. see condition of a function condition of a function. 8 conjugate transpose. see error. see operator.Subject index 251 SUBJECT INDEX Page numbers in italics refer to the main source of information about whatever is being indexed. see differential equation. backward difference Bernoulli numbers. 57 back substitution. 183 characteristic equation of a recurrence equation. 230 Clenshaw —’s recurrence equation for Chebyshev polynomials.

empty error absolute. 10 relative. 183 eigenvector. cyclic cyclic permutation. see spline. Leonhard. 240 finite element method. 4. even expansion of determinants. see operator. 18 n-fold. see also determinant. see significant digit Dirichlet kernel. 154 simple properties of. 102 predictor-corrector method for. 47 delta Kronecker’s. 8 mistake or blunder. 158 of an unknown. 220 multiplicity of. 240 Pierre de. 83. 242 weak statement. cubic cycle. 151–158 definition of. differential digit significant. 158 Crank. see differential equation. 153 expansion of. even even permutation. full function even. 240 Feynman. 10 Euler summation formula. 221 Doolittle algorith. 52 C program for. double eigenvalue. 156 Leibniz formula for. see Euler summation formula even function. John. cyclic de Moivre’s formula. 183. 84 odd. Crank-Nicolson method cubic spline. 244 . forward difference forward shift operator. see pivoting. Re´e. see matrix. 242 initial value problem. see also determinants cofactor of an entry in a –. boundary value problem. see precision. 244 Galerkin’s approximation. see operator. see permutation. 240 floating point number. 102 differential operator. 242 strong statement. see permutation. see recurrence equation differential equation boundary value problem. 242 variational statement. 8 roundoff. Hippolite. 153 divided difference. see Gaussian elimination. 155 difference equation. see Kronecker’s delta Descartes. 93 Euler. 124 Taylor method for. see determinants. 183. see Gaussian elimination. expansion of Fermat –’s principle. 157 of a system of linear equations. 240 determinant. 158 determinants. 1 forward difference operator. Doolittle algorithm. 84 function element. 153 multiplication of. 246 Crank-Nicolson method. 241 finite element method for. 53 Fizeau. 233. 234 distributive rule generalized.252 Subject index Cramer’s rule. see function. 244 deflation of a polynomial. forward shift forward substitution. Crout algorithm double precision. 10 truncation. 240 Euler-Maclaurin summation formula. 23 domain. Richard. see heat equation. see operator. see permutation. finite element method for fixed-point iteration. 220 empty matrix. 134 full pivoting.

174 iterative improvement. 160 Crout algorithm. long double loss of precision. 38 long double precision. 68. 46 Householder transformation. numerical. 25 Hermite. main solution makefile. 221 Horner’s rule. 238 Landau. 157. 31 Newton form. 38 Hamilton’s principle. see recurrence equation. 132 Gauss-Jacobi Quadrature Formula. 163 convegence for a positive definite matrix. 150 advantage over Doolittle algorithm. see interpolation. 45 to evaluate the derivative of a polynomial. of a vector Lagrange fundamental polynomials. Edmund. lower. 195 C program for. Charles. see matrix. identity ill-conditioned. boundary value problem. 134 main solution of a recurrence equation. 137 Cholesky factorization. Leibniz formula for Linux operating system. 15 Newton. l2 . 231 Gauss’s backward formula. Hermitian Hessenberg matrix upper. 229 Leibniz formula for determinants. Pierre-Simon. 245 Crank-Nicolson method. 229 Riemann. 13 error of. Carl Friedrich. 229 intermediate-value property of derivatives.Subject index Galerkin’s approximation. 31 Gauss. see determinants. Galerkin’s approximation Gauss Quadrature Formula. 175 Lebesgue. see precision. 18 error of. Illinois method implicit differentiation. 13 Lagrange interpolation. Lagrange Lagrangian mechanics. see numerical instability integral –s with singularities. 224 Gaussian elimination. with singularities 253 adjective for integer. see secant method. 174 Illinois method. see norm. 163 Kronecker’s delta. triangular unit. 175 identity operator. 1 lower triangular unit matrix. see operator. Hermite Hermite. 150 generalized distributive rule. 149 implicit equilibration. 211 C program for. 23 Lagrange. 151 Doolittle algorithm. upper LU-factorization. see distributive rule. see numerical integration. 223 Hermitian. 102 instability. 29 inverse power method. 16 error term differentiating. 156 least square solution. 196 isometry. Hessenberg. 239 heat equation. 211 l2 norm. see matrix. 91 interpolation base points of. 136 Jacobi iteration. 6. 246 Hermite interpolation. 24 with equidistant points. generalized GNU C compiler. 63 Laplace. see matrix. see differential equation. 33. 22 Lagrange form. see Gauss Quadrature Formula Gauss-Seidel iteration. see interpolation. 37 . 132. 233 Lebesgue. Henry. 37 loader. upper holomorphic.

35 norm of a matrix compatible. 186 transpose of. 223 row-diagonally dominant. 62 numerical instability. 185 conjugate transpose of. 130. see permutation. 211 implementing in C. 134 tridiagonal. see spectral radius spectrum of. 27 differential. see pivoting. 100 numerically unstable. see eigenvector empty. 134 unit. 246 nonlinear equations solution of. multiplication of Newton interpolation. 60 Newton-Hermite interpolation. 218 matrix resolvent of. 223 numerical differentiation of functions. 158 minor of. 158. odd operator backward difference. parasitic solution partial pivoting. 99 subtraction of singularity. 27 forward shift. 63 higher derivatives. orthonormal overdetermined system of linear equations. 138 minor principal. 65 of tables. Hermite. 27 oracle. see matrix. 195 Hermitian. 149 upper. Newton Newton’s forward formula. 130 minor. see polynomial. 222 eigenvalue of. 158 orthogonal.254 Subject index Mathematica. see resolvent companion. 134. orthogonal orthonormal. see interpolation. 169 Mean-Value Theorem for integrals. 135 inverse of. partial . 29 Newton’s method. 101 simple formulas. 68 Milne’s method. 41 for polynomial equations. 220 consistent. 61 error of. of a vector. 52 C program for. odd odd permutation. 46 speed of convergence of. 174 permutation. 136 positive definite. monic multiplication of determinants. see determinants. 61. 45 C program for. 243 Hermitian. lower. see function. 245 numerical integration composite formulas. orthogonal orthogonal polynomials. Newton form Nicolson. 27 identity. see system of linear equations. see recurrence equation. 67 with singularities. see spectrum symmetric. overdetermined parasitic solution of a recurrence equation. see numerical instability odd function. see polynomial. 163 spectral radius of. 161. upper. see interpolation. 103 forward difference. minor of monic polynomial. 135 orthogonal matrix. 220 l2 . 35 as fixed-point iteration. 70 on infinite intervals. Phyllis. see matrix. see eigenvalue eigenvector of. see polynomial. 223 Hessenberg upper. 155 triangular lower.

minor. 187 precision double. 210 C program for. 117 principal minor of a matrix. 152. Runge-KuttaFehlberg107 secant method. see matrix. 213 quadrature formula Gauss. see matrix. principal QR algorithm. 126 homogeneous. 151 even. see loss of precision single. 71 C program for. see error. see precision. 185 C program for. positive definite power method. 240 recurrence equation. Ole. relative resolvent. see error. 108 Runge-Kutta-Fehlberg method. 151 cyclic. see matrix. 135 partial. see differential equation. 221 significant digits. 1 similarity transformation. 210 simply connected. 168 free boundary conditions. see Gauss’s quadrature formula Rømer. 153 odd. 15 Romberg integration. 105 Runge-Kutta-Fehlberg. see Rnge-Kutta methods. 226 orthonormal. 135 full. row-diagonally dominant Runge-Kutta methods. 43 Illinois method.Subject index Perl programming language. roundoff row-diagonally dominant matrix. single spectral radius. 117 step size control with RungeKutta-Fehlberg. 135 plain rotation. 98 roundoff error. 114. 235 second kind. 93 C program for. 1 predictor Adams-Bashforth. 64 in Romberg integration. 71 simple. 211 polynomial Chebyshev. 94 compared with adaptive integration. 226 positive definite matrix. 221 simply connected. see matrix. 125 inhomogeneous. 220 spectrum. 227 relative error. C program for. 240 spline cubic. 36 set connected. 107 C program for. 226 orthogonal. permutation pivoting. 115 predictor-corrector method. 233 C program for expansion with —s. 244 monic. 1 long double. 93 Rolle’s theorem repeated application of. 236 orthonormal. 69 single precision. see set. 167 C program for. 152. 37 permutation. 153 permutation matrix. simply connected Simpson’s rule composite. 117 step size control with RungeKutta-Fehlberg. 35 C program for. 166. 131 recurrence formula three term. predictor corrector method for Adams-Bashforth-Moulton. 220 Richardson extrapolation. 1 loss of. 220 speed of light. 125 255 characteristic equation of. 125 parasitic solution. 126 characteristic polynomial of. 167 .

7 . upper vector implementing in C. see matrix. three term transpose. 132 Taylor series Cauchy’s remainder term. 70 for periodic functions. 177 triangular. 85 simple. see matrix.256 Subject index natural boundary conditions. 12 Roche–Schl¨ omilch remainder term. see matrix. see numerical instability upper Hessenberg matrix. 238 Wielandt’s deflation. 240 three term recurrence formula. tridiagonal truncation error. see matrix. 12 Taylor’s method for differential equations. transpose of transposition. 132 overdetermined. 204 wild card. see recurrence formula. symmetric system of linear equations. 138 velocity generalized. 239 symmetric matrix. triangular tridiagonal matrix. 173 C program for. Hessenberg. truncation Unix operating system. 13 integral remainder term. see error. see matrix. 151 trapezoidal rule composite. Taylor method for teleology. 37 unstable. 167 stationary path. 68 triangular matrix. 202 C program for. 88 Lagrange’s remainder term. see differential equation.