You are on page 1of 884

TEXTS AND READINGS IN PHYSICAL SCIENCES - 2

Numerical Methods
for Scientists and Engineers
Third Edition
Texts and Readings in Physical Sciences

Managing Editors
H. S. Mani, Chennai Mathematical Institute, Chennai.
hsmani@gmail.com

Ram Ramaswamy, Vice Chancellor, University of Hyderabad, Hyderabad.


r.ramaswamy@gmail.com

Editors

Kedar Damle (fIFR, Mumbai) kedar@tifr.res.in


Debashis Ghoshal GNU, New Delhi) dghoshal@mail.jnu.ac.in
Rajaram Nityananda (NCRA, Pune) rajaram@ncra.tifr.res.in
Gautam Menon (IMSc, Chennai) menon@imsc.res.in
Tarun Souradeep (IUCAA, Pune) tarun@iucaa.ernet.in

Volumes published so far

1. Field Theories in Condensed Matter Physics, Sumathi Roo (Ed.)


2. Numerical Methods for Scientists and Engineers (3/E), H. M. Antia
3. Lectures on Quantum Mechanics (2/E), Ashok Oas
4. Lectures on Electromagnetism, Ashok Oas
5. Current Perspectives in High Energy Physics, Oebashis Ghoshal (Ed.)
6. Linear Algebra and Group Theory for Physicists (2/E), K. N. Srinivasa Roo
7. Nonlinear Dynamics: Near and Far from Equilibrium,] K. Bhattacharjee
and S. Bhattacharyya
8. Spacetime, Geometry and Gravitation, Pankcg Sharan
9. Lectures on Advanced Mathematical Methods for Physicists, Sunil Mukhi
and N. Mukunda
10. Computational Statistical Physics, Sitangshu Bikas Santra and
Purusattam Roy (Eds.)
11. The Physics of Disordered Systems, Gautam 1. Menon and
Purusattam Roy (Eds.)
Numerical Methods
for Scientists and Engineers
Third Edition

H. M. Antia

~HINDUSTAN
U LQJ UBOOK AGENCY
Published by

Hindustan Book Agency (India)


P 19 Green Park Extension
New Delhi 110016
India

email: info@hindbook.com
www.hindbook.com

Copyright © 1991, First Edition, Tata Mcgraw-Hill Publishing Company


Limited.
Copyright © 2002, Second Edition, Hindustan Book Agency (India)
Copyright © 2012, Hindustan Book Agency (India)

No part of the material protected by this copyright notice may be


reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording or by any information storage and retrieval
system, without written permission from the copyright owner, who has also
the sole right to grant licences for translation into other languages and
publication thereof.

All export rights for this edition vest exclusively with Hindustan Book Agency
(India). Unauthorized export is a violation of Copyright Law and is subject
to legal action.

ISBN 978-93-80250-40-3 ISBN 978-93-86279-52-1 (eBook)


DOI 10.1007/978-93-86279-52-1
Texts and Readings in the Physical Sciences
The Texts and Readings in the Physical Sciences (TRiPS) series of books
aims to provide a forum for physical scientists to describe their fields of
research interest. Each book is intended to cover a subject in a detailed
manner and from a personal viewpoint, so to give students in the field a
unique exposure to the subject that is at once accessible, pedagogic, and
contemporary. The monographs and texts that have appeared so far, as
well as the volumes that we plan to bring out in the coming years, cover a
range of areas at the core as well as at the frontiers of physics.

In addition to texts on a specific topic, the TRiPS series includes lecture


notes from a thematic School or Workshop and topical volumes of
contributed articles focussing on a currently active area of research.
Through these various forms of exposition, we hope that the series will
be valuable both for the beginning graduate student as well as the
experienced researcher.

H.S. Mani R. Ramaswamy


Chennai Hyderabad
Contents

Preface xiii

Preface to the second edition xv

Preface to the First Edition xix

Notation xxiii

List of Computer Programs xxv

1 Introduction 1
1.1 Errors in Numerical Computation 2
1.2 Truncation Error 4
1.3 Programming 8
Bibliography 11
Exercises 11

2 Roundoff Error 13
2.1 Number Representation 13
2.2 Roundoff Error . . . . . 22
2.3 Error Analysis . . . . . 32
2.4 Condition and Stability 42
Bibliography 50
Exercises . . . . . . 51

3 Linear Algebraic Equations 59


3.1 Introduction . . . . . 60
3.2 Gaussian Elimination . . . . . . . 63
3.3 Direct Triangular Decomposition 71
3.4 Error Analysis . . . . . . . . . 77
3.5 Matrix Inversion . . . . . . . 87
3.6 Singular Value Decomposition 88
3.7 Iterative Methods 98
Bibliography 102
Exercises . . . . 103
viii Contents

4 Interpolation 111
4.1 Polynomial Interpolation . . . . . . . . . . 112
4.2 Divided Difference Interpolation Formula. 117
4.3 Hermite Interpolation. . . 128
4.4 Cubic Spline Interpolation . . . 130
4.5 B-splines . .. . . . . . . . . . . 135
4.6 Rational Function Interpolation 141
4.7 Interpolation in Two or More Dimensions 147
4.8 Spline Interpolation in Two or More Dimensions 150
Bibliography 152
Exercises 153

5 Differentiation 159
5.1 Differentiation of Interpolating Polynomials 159
5.2 Method of Undetermined Coefficients 164
5.3 Extrapolation Method 168
Bibliography 172
Exercises 172

6 Integration 175
6.1 Newton-Cotes Quadrature Formulae 177
6.2 Extrapolation Methods . 187
6.3 Gaussian Quadrature 194
6.4 Roundoff Error . . 198
6.5 Weight Function 201
6.6 Improper Integrals 209
6.7 Automatic Integration 220
6.8 Summation. . . . . . 229
6.9 Multiple Integrals . .. 236
6.10 Rules Exact for Monomials. 242
6.11 Monte Carlo Method .. . 247
6.12 Equidistributed Sequences 254
Bibliography 259
Exercises .. . . . .. . 260

7 Nonlinear Algebraic Equations 271


7.1 Real Roots ... . . . . 273
7.2 Fixed-Point Iteration .. 276
7.3 Method of False Position 278
7.4 Secant Method ... . . 281
7.5 Newton-Raphson Method 285
7.6 Brent's Method 289
7.7 Complex Roots . ... . . 294
7.8 Muller 's Method ... . . 297
7.9 Quadrature Based Method 301
Contents ix

7.10 Real Roots of Polynomials 304


7.11 Laguerre's Method . . . . 308
7.12 Roundoff Error . . . . . . 313
7.13 Criterion for Acceptance of a Root 315
7.14 Ill-conditioning . . . . . . . . . 319
7.15 System of Nonlinear Equations 323
7.16 Newton's Method . 328
7.17 Broyden's Method 333
Bibliography 335
Exercises 336

8 Optimisation 345
8.1 Golden Section Search 347
8.2 Brent's Method . . . . 353
8.3 Methods Using Derivative 356
8.4 Minimisation in Several Dimensions. 359
8.5 Quasi-Newton Methods. 363
8.6 Direction Set Methods 371
8.7 Linear Programming 379
8.8 Simulated Annealing 390
Bibliography 395
Exercises .. 396

9 Statistical Inferences 401


9.1 Elementary statistics . . . . . . . . . . . . . 401
9.2 Monte Carlo Methods . . . . . . . . . . . . 416
9.3 Experimental Errors and their Propagation 418
Bibliography 422
Exercises . . . . . . 423

10 Functional Approximations 425


10.1 Choice of Norm and Model. 426
10.2 Linear Least Squares . . . . 432
10.3 Nonlinear Least Squares .. 451
10.4 Least Squares Approximation in Two Dimensions 457
10.5 Discrete Fourier Transform. . . . 459
10.6 Fast Fourier Transform . . . . . . 464
10.7 FFT in Two or More Dimensions 473
10.8 Inversion of Laplace Transform 475
10.9 Pade Approximations . . . 479
10.10 Chebyshev Expansions . . . . . 485
10.11 Minimax Approximations . . . 494
10.12 Discrete Minimax Approximations 503
10.13 L1-approximations 508
Bibliography . . . . . . . . . . . . 510
x Contents

Exercises . . . . . . . . 511

11 Algebraic Eigenvalue Problem 523


11.1 Introduction .. . 524
11.2 Power Method . . . . . . 529
11.3 Inverse Iteration. . . . . 534
11.4 Eigenvalues of a Real Symmetric Matrix 541
11.5 The QL Algorithm . . . . . . . . . . . . 549
11.6 Reduction of a Matri~ to Hessenberg Form. 553
11.7 Lanczos Method. . . . . . . . . . . . . . . . 557
11.8 QR Algorithm for a Real Hessenberg Matrix . 558
11.9 Roundoff Errors. 564
Bibliography 566
Exercises . . . . 567

12 Ordinary Differential Equations 573


12.1 Initial Value Problem . . . 574
12.2 Stability of Numerical Integration Methods 578
12.3 Predictor-Corrector Methods 587
12.4 Runge-Kutta Methods .. . 600
12.5 Extrapolation Methods .. . 607
12.6 Stiff Differential Equations . 610
12.7 Boundary Value Problem. 618
12.8 Finite Difference Methods 626
12.9 Eigenvalue Problem .. . 633
12.10 Expansion Methods .. . 640
12.11 Some Special Techniques 642
Bibliography 647
Exercises 648

13 Integral Equations 657


13.1 Introduction . . . . . . . . . . . . . . . . 658
13.2 Fredholm Equations of the Second Kind 661
13.3 Expansion Methods . . . . . . . . . . . 669
13.4 Eigenvalue Problem . . . . . . . . . . . 672
13.5 Fredholm Equations of the First Kind 676
13.6 Inverse Problems . . . . . . . . . . . . 680
13.7 Volterra Equations of the Second Kind 685
13.8 Volterra Equations of the First Kind 692
Bibliography 695
Exercises . . . . . . . . . . . . . . . 696
Contents Xl

14 Partial Differential Equations 701


14.1 Introduction . . . . . . 702
14.2 Diffusion Equation in Two Dimensions . . . . . 707
14.3 General Parabolic Equation in Two Dimensions 717
14.4 Parabolic Equations in Several Space Variables 724
14.5 Wave Equation in Two Dimensions 728
14.6 General Hyperbolic Equations . . . . 734
14.7 Elliptic Equations . . . . . . . . . . . 741
14.8 Successive Over-Relaxation Method. 748
14.9 Alternating Direction Method 753
14.10 Fourier Transform Method 756
14.11 Finite Element Methods 757
Bibliography 763
Exercises 765

Appendix A
Answers and Hints 773
A.l Introduction . . . . . . . . . 773
A.2 Roundoff Error . . . . . . . 774
A.3 Linear Algebraic Equations 779
A.4 Interpolation 783
A.5 Differentiation . . . . . . . . 787
A.6 Integration....... .. . 789
A.7 Nonlinear Algebraic Equations. 795
A.8 Optimisation . . . . . . . . 801
A.9 Statistical Inferences . . . . . 803
A.lO Functional Approximations 804
A.11 Algebraic Eigenvalue Problem 811
A.12 Ordinary Differential Equations 814
A.13 Integral Equations . . . . . . 818
A.14 Partial Differential Equations 819
Bibliography . . . . . . . . . 822

Index 823
xii Contents

Appendices Band C are not included in the printed form but can be found
only in the online material at
http://www.tifr.res.in/ antia/nmse3.html
along with the machine readable versions of the Fortran and C programs. For
convenience the pages are numbered continuously following the Index. The
material in the Appendices is also included in the printed Index with respective
page numbers.

Appendix B Fortran Programs 861


B.1 Introduction . . . . . . . 861
B.2 Roundoff Error . . . .. 866
B.3 Linear Algebraic Equations 867
B.4 Interpolation 874
B.5 Differentiation . . . . . . . . 884
B.6 Integration . . . . . . . . . . 885
B.7 Nonlinear Algebraic Equations. 909
B.8 Optimisation . . . . . . . . 923
B.9 Statistical Inferences . . . . . 929
B.10 Functional Approximations 933
B.11 Algebraic Eigenvalue Problem 966
B.12 Ordinary Differential Equations 973
B.13 Integral Equations . . . . . . 988
B .14 Partial Differential Equations 996
Bibliography . . . . . . . . . 1004

Appendix C C Programs 1007


C.1 Introduction . . . . . . . . . 1007
C.2 Roundoff Error .. . .. . 1011
C.3 Linear Algebraic Equations 1012
C.4 Interpolation 1019
C.5 Differentiation . . . . . . . . 1029
C.6 Integration.......... 1030
C.7 Nonlinear Algebraic Equations. 1053
C.8 Optimisation . . . . . . . . 1064
C.9 Statistical Inferences . . . . . 1071
C.10 Functional Approximations 1074
C.11 Algebraic Eigenvalue Problem 1107
C.12 Ordinary Differential Equations 1113
C.13 Integral Equations . . . . . . 1128
C.14 Partial Differential Equations 1136
Bibliography . . . . . . . . . 1144
Preface

During the last twenty years since the publication of the first edition of the book,
the speed as well as memory of computers have increased by a few orders of
magnitude. However, the precision with which the floating point operations are
being handled on these computers has not improved. As a result, the relevance
of roundoff errors in numerical computations has increased substantially. In
this book, an attempt is made to demonstrate that with proper care the errors
do not grow very fast with the size of computational problem. Thus the book
is probably more relevant today than it was twenty years back. The main aim
of this book has been to not only provide suitable algorithms for numerical
computations, but also to explain their limitations.
Based on experience in teaching, some new topics have been added to
the second edition. A notable omission in the second edition of the book was
statistical analysis of data. This has been addressed by the addition of a new
Chapter (Chapter 9) on Statistical Inference. This Chapter introduces basic
statistical distributions, which are useful in the next Chapter on approxima-
tions. In view of this addition, the initial part of Chapter 10 on approximation
has also been edited and two new subsections on least squares fit when there
are errors in both x and y, and on Maximum Likelihood methods have been
added. In keeping with these additions some computer programs have also been
added and a few have been modified.
Further, keeping in view convenience of readers the computer programs
and related material in Appendices Band C are now made available online at
http://www.tifr.res.in/ antia/nmse.html
Although, the Appendices do not appear in the printed version, the corre-
sponding page numbers are still included in the printed index and the table
of contents. These page numbers refer to description of programs which is in-
cluded in the Appendices. The programs themselves are obviously in separate
files and there is no page number associated with them. The online material
also includes some examples of usage and instructions for using these routines.
Although, these computer programs have been tested to some extent, it is
not possible for an individual to test all these thoroughly and there are likely
to be some bugs in these routines. A few bugs in the earlier versions have been
corrected. Hence, before using these subprograms, the reader is urged to verify
their correctness. Even if there are no "bugs" in the program, it may not neces-
XIV Preface

sarily give the correct result in all cases, since the algorithm used in writing the
program itself may not be capable of solving all possible problems. It should be
noted that these routines are not a substitute for high quality numerical soft-
ware that is available. Apart from the mathematical subroutine libraries, there
are also a number of software packages which allow the mathematical problems
to be specified in a convenient form for numerical solution. Nevertheless, it is
not possible for any program or software to give correct results for all possi-
ble problems. Thus it is advisable to use only those software where the user
is aware of which technique is actually implemented, so that their limitations
may be known. The routines in the online material are also provided in the
same spirit. They may not necessarily give the correct result in all cases, but
since the algorithm used is known to the readers, they can modify these to suit
their requirements.
Since the publication of the second edition, a few of the readers have
pointed out misprints in the book and bugs in the computer programs. Besides
a few misprints were detected during revision. These have been corrected, hope-
fully, without addition of new bugs. I am thankful to Nirvikar Prasad, Lavish
Pabbi and Nilesh Sutar for pointing out misprints. It is possible that I have
forgotten some others who too pointed out the errors and I hope that they will
accept my apology. Careful readers may notice that in some of the examples
the numbers have changed as compared to previous edition. In almost all cases,
these are not due to misprints or bugs, but because of using a different compiler
and computer and the accompanying differences in roundoff errors. In some of
the cases, the old tables have been retained even though the results using cur-
rent machines were somewhat different. Apart from these a major change in
the manuscript was from plain TEX to g\TEX format, which involved extensive
editing of equations and tables. Hopefully, no errors have been introduced dur-
ing this transformation! I am thankful to Ram Ramaswamy for encouragement
and help in bringing out this third edition of the book.

August 2012 H. M. Antia


Preface to the Second Edition

During the last ten years since the publication of the first edition of the book,
the speed as well as memory of computers have increased by almost two orders
of magnitude. However, the precision with which the floating point operations
are being handled on these computers has not improved. In fact, the 32/64-bit
IEEE representation is now more or less standard on all computers, while some
of the older machines used longer bit sequences to represent the same numbers.
Although, many current compilers support quadruple precision arithmetic with
128 bit representation for floating point numbers, these are not implemented
in hardware and consequently the execution is very slow. This imbalance is
mainly due to the fact that nowadays the computers are being largely used
for non-numerical work. As a result, the computer manufacturers do not have
sufficient incentive to provide more accurate arithmetic operations. On the
other hand, because of availability of faster computers and larger memory,
the scientists and engineers are now handling larger computational problems,
using arithmetic with comparatively low precision. Clearly, more effort is now
required to control the errors in computation. In this book, an attempt is made
to demonstrate that with proper care the errors do not grow very fast with the
size of computational problem. Thus the book is probably more relevant today
than it was ten years back. The main aim of this book has been to not only
provide suitable algorithms for numerical computations, but also to explain
their limitations.
In keeping with the increase in computing power and memory some topics
requiring these increased power have been added in the second edition. This
includes interpolation and approximations in multiple dimensions, simulated
annealing and the inverse problems. A notable omission in the first edition was
the treatment of B-splines, which provide a convenient set of basis flUlctions for
approximations. This has been added in this edition. Similarly, with increas-
ing popularity of C language for computer programming, this edition includes
computer programs in both Fortran and C. During the last ten years a new ver-
sion of Fortran, namely, Fortran 90 has emerged, but the subroutines described
in Appendix B, still use Fortran 77 as the corresponding compilers are more
readily available. A number of new programs have been added bringing the
total number to over two hundred. Most of the new additions involve B-splines
or approximations in multiple dimensions. Besides a number of functions to
xvi Preface to the Second Edition

calculate some of the common mathematical functions have been added. It is


obviously more useful to have programs in a machine readable form, rather
than the printed form as was the case in the first edition. Thus, in this edition
the computer programs in Fortran and C which are described in Appendices B
and C, respectively, have been shifted to a CD. To reduce the size of the book,
these Appendices themselves are also available only on the CD. Although, the
Appendices do not appear in the printed version, the corresponding page num-
bers are still included in the printed Index and the table of contents. These page
numbers refer to description of programs which is included in the Appendices.
The programs themselves are obviously in separate files and there is no page
number associated with them. The CD also includes some examples of usage
and instructions for using these routines.
Although, these computer programs have been tested to some extent, it is
not possible for an individual to test all these thoroughly and there are likely
to be some bugs in these routines. A few bugs in the earlier versions have been
corrected. Hence, before using these subprograms, the reader is urged to verify
their correctness. Even if there are no "bugs" in the program, it may not neces-
sarily give the correct result in all cases, since the algorithm used in writing the
program itself may not be capable of solving all possible problems. It should
be noted that these routines are not a substitute for high quality numerical
software that is available. Apart from the mathematical subroutine libraries,
there are also a number of software packages which allow the mathematical
problems to be specified in a convenient form for numerical solution. My expe-
rience with these softwares has not been particularly good as quite often they
produce incorrect results without any warning. Thus it is advisable to use only
those softwares where the user is aware of which technique is actually imple-
mented, so that their limitations may be known. The routines in the attached
CD are also provided in the same spirit. They may not necessarily give the
correct result in all cases, but since the algorithm used is known to the readers,
they can modify these to suit their requirements.
During the last ten years, significant advances have taken place in tech-
niques for parallel processing and for solution of partial differential equations,
but these are not reflected in this edition as they were thought to be beyond
the scope of a general book. The algorithms presented in this book are gener-
ally meant for use on computers with sequential processing, though in many
cases it may be possible to implement an equivalent algorithm on machines
with parallel processing.
Since the publication of the first edition, a few of the readers have pointed
out misprints in the book and bugs in the computer programs. Besides a number
of misprints were detected during revision. Most of the misprints were discov-
ered in answers to exercises. These have been corrected, hopefully, without
addition of new bugs. I am thankful to S. Basu, R. S. Bhalerao, A. K. Mohap-
atra and B. Pal for pointing out misprints and bugs. It is possible that I have
forgotten some others who too pointed out the errors and I hope that they will
accept my apology. Careful readers may notice that in some of the examples
xvii

the numbers have changed as compared to previous edition. In almost all cases,
these are not due to misprints or bugs, but because of using a different compiler
and computer and the accompanying differences in roundoff errors. In some of
the approximation problems the difference arises merely because a different set
of random numbers was used. In some of the cases, the old tables have been
retained even though the results using current machines were somewhat differ-
ent. I am thankful to S. M. Chitre, M. S. Raghunathan and R. Ramaswamy
for encouragement and help in bringing out this second edition of the book.

November 2001 H. M. Antia


Preface to the First Edition

This book has evolved out of lecture courses delivered to Research Scholars at
the Tata Institute of Fundamental Research during the period 1978- 88. The
book attempts to provide a practical guide for numerical solution of mathe-
matical problems on digital computers. It is intended for both students and
research workers in science and engineering. Readers are expected to have a
reasonable background in mathematics as well as computer programming.
The enormous variety of problems that can be encountered in practice
makes numerical computation an art to some extent and as such the best way of
learning it is to practice it. Trying to learn numerical methods by just attending
lectures or by just reading a book is like trying to learn how to play football
by the same method. All students are therefore urged to tryout the methods
themselves on any problem that they may fancy, not necessarily the ones given
in this book. In fact, the reader will probably not appreciate the contents,
unless a serious attempt is made to use numerical methods to solve problems
on computers.
There is a wide difference between numerical methods for computers and
those for desk calculators. The enormous increase in speed, enables us to solve
much larger problems on computers and frequently such problems are more
sensitive to various errors during calculations. Apart from the tremendous in-
crease in speed, the main qualitative difference is that in computation with
desk calculators the intermediate results are always available whether we want
them or not. Hence, in most cases, it is rather straightforward to detect the
numerical errors creeping in the computation and the corrective action can be
taken easily. Further, any unexpected result can also be tackled appropriately.
But in computation with modern computers, the intermediate results are usu-
ally not available and consequently all possibilities have to be considered while
writing the programs. Error propagation in the computation should also be es-
timated beforehand. Otherwise, the error accumulation could render a plausible
appearing set of results completely meaningless and no indication of this diffi-
culty may be present in the final results. This makes the task of computation
conceptually more difficult, even though the process itself is far more efficient.
In particular, the error estimation is far more important with digital computers
than it was with desk calculators. As a result, a detailed discussion of errors and
pitfalls in numerical calculations has been included in this book, even though
xx Preface to the First Edition

rigorous proofs may not always be given. An attempt is also made to illustrate
the effect of errors in numerical computations by giving appropriate examples,
rather than giving detailed theory of error propagation.
Keeping in view the intended readership, detailed derivations and proofs
of numerical methods and their properties have been avoided. As far as possible,
a sufficient discussion of basic assumptions behind each numerical method and
its limitations are provided. In most cases, the numerical methods are presented
in a nonalgorithmic form. It was felt that the algorithmic form may give a very
rigid picture of these methods, while in actual practice, considerable flexibility
is required to tailor these methods for effective practical implementation. Con-
sidering the wide variety of possible problems, no algorithm can successfully
solve the entire class of problems that may arise in practice. For example, no
single algorithm can integrate all possible functions that we can think of, even
if it is restricted to the class of continuous and bounded functions over finite
intervals. In fact, given a fixed algorithm, we can construct examples where the
algorithm will fail miserably. On the other hand, given a Riemann integrable
function, we can evaluate the integral using some algorithm based on the trape-
zoidal rule, provided "sufficient" computer time is available. Nevertheless, over
one hundred Fortran subprograms implementing some of the numerical meth-
ods are provided in the Appendix B. These subprograms may serve to fill the
gaps left in description of the algorithm and can also be useful for solving prob-
lems. While an attempt has been made to test these subprograms, it is not
possible to guarantee their correctness. Hence, before using these subprograms,
the reader is urged to verify their correctness. Even if there are no "bugs" in
the program, it may not necessarily give the correct result in all cases, since
the algorithm used in writing the program itself may not be capable of solving
all possible problems.
In many cases, more than one method for solving the same class of prob-
lems is described. Although, there is some discussion about the merits of each
of these methods, it is not generally possible to give algorithmic description
for the choice of the best method for a given problem. Numerous examples are
included in the text to illustrate the working of numerical methods, as well as
to bring out the pitfalls in these methods. In addition, a number of exercises
are given at the end of each chapter to help the students in improving their
understanding of the material covered in the chapter. Short answers or hints to
most of the exercises are given in Appendix A. In case of numerical problems,
the answer should generally be correct to all figures specified there. Once again,
it is almost impossible to guarantee the correctness of all answers, and read-
ers should be prepared for some misprints and blunders. The exercises are of
varying difficulty and type: Some of the exercises are simple proofs or deriva-
tions of material covered in the text, while others provide results which are
extension of subject matter considered in the text. Apart from these, there are
computational problems, some of which are expected to expose the limitations
of numerical methods.
XXI

This book is intended to cover all elementary topics in numerical compu-


tations. However, it is difficult to fulfill this intention in actual practice and
readers may find some of their favourite topics missing. In fact, the topics of
each of the chapters (and some of the sections) have themselves been the sub-
jects of a full-fledged book. Consequently, in each chapter, the emphasis is on
simple problems where the intricacies of numerical methods can be easily un-
derstood and demonstrated. For more advanced topics, it may be necessary to
consult specialised books, which have been listed in the Bibliography at the
end of each chapter. This Bibliography is by no means exhaustive, but is only
intended to guide the readers in extending their knowledge. The choice of top-
ics included in this book is obviously biased by my experience. A glance at the
table of contents will reveal the subject matter covered in the text.
The first two chapters give an introduction to errors in numerical com-
putations. Different types of errors, as well as techniques for estimating them
are described in these chapters. Subsequent chapters discuss numerical meth-
ods for various mathematical problems. Attempt has been made to make each
of these chapters self-contained as far as possible, so that readers with some
background in numerical methods can directly go through the concerned chap-
ter, preferably after reading the first two chapters. However, certain amount of
back referencing is unavoidable. In particular, Chapters 9-13, which deal with
more advanced topics require some material from the previous chapters.
It may not be possible to cover the entire material included in this book
in any course, but by leaving out appropriate portions it should be possible to
use the book effectively. The choice of topics which can be excluded or included
will naturally depend on the experience and taste of the teacher. The book is
intended to cover most of the important topics in numerical methods, so that
it may be useful also as a reference book for research workers and engineers.
The material in this book has been drawn from a variety of sources,
for which I am indebted to many numerical analysts. In particular, the in-
fluence of the books by A. Ralston and P. Rabinowitz, and J. H. Wilkinson,
should be clear to anyone who is familiar with them. It was after some per-
suasion from B. M. Udgaonkar that I undertook the task of writing this book.
I am thankful to him for constant encouragement during the preparation of
this book, and seeing the book through to its final stage. I am thankful to
B. Banerjee, S. M. Chitre, M. H. Gokhale, P. S. Joag, D. Narasimha, T.
Padmanabhan and R. K. Singh for going through and commenting on parts
of the manuscript. The preparation of a book of this size has undoubtedly
placed a strain on the forbearance of both my colleagues in the Astrophysics
Group and my mother, and I am thankful to them for their patient coop-
eration. I am also thankful to the Tata Institute of Fundamental Research
for allowing me a generous use of the facilities for preparing the book. Fi-
nally, I wish to thank D. E. Knuth for his excellent computer typesetting lan-
guage TEX., without which I would not have thought of writing this book.
H. M. Antia
Notation

The table below is a short list of symbols and notation used in this book. More
detailed explanation, if necessary, may be found at the first use of the symbol.

Page First
Meaning Example Used
Reference to Exercises at the end of Chapter {9} 19
{1O.2} 46
Reference to Bibliography at the end of
Chapter Hamming (1973) 1
Ceiling operation Ixl 36
Closed interval [a, b] 112
Complex conjugate z* 62
Derivatives f'(x), f"(x) 57
fCkl(x) 114
Factorial n! 11
Floor operation lxJ 39
Identity matrix I 45
Kronecker's delta Dij 81
Open interval (a , b) 31
Order of magnitude O(h2) 5
Vectors - boldface letters x 44
List of Computer Programs

Chapter 2 Roundoff Error


CASSUM Cascade sum of a finite series
ROUND Round a floating-point number to a specified number of digits

Chapter 3 Linear Algebraic Equations


GAUELM Solve a system of linear equations using Gaussian elimination
MATINV Calculate inverse of a square matrix using Gaussian elimination
CROUT Solve a system of linear equations using Crout's algorithm
CROUTH Iterative refinement of solution of a system of linear equations
CHOLSK Solve a system of linear equations with symmetric positive defi-
nite matrix using Cholesky's decomposition
GAUBND Solve a system of linear equations with a band matrix using
Gaussian elimination with partial pivoting
SVD Singular value decomposition of a matrix
SVDEVL Solve a system of linear equations using singular value decompo-
sition

Chapter 4 Interpolation
DIVDIF Calculate interpolation and its derivatives using divided differ-
ence formula
DIVDIFO Divided difference interpolation formula (no derivatives version)
NEARST Find nearest point in an ordered table using bisection
SPLINE Calculate coefficients of interpolating cubic spline
SPLEVL Evaluate the cubic spline and its derivatives at a specified point
SMOOTH Draw a smooth curve through a set of points using cubic spline
BSPLIN Calculate B-spline basis functions on a set of knots
BSPINT Calculate coefficients of B-spline interpolation
BSPEVL Evaluate function value and its derivatives using B-spline expan-
sion
RATNAL Calculate rational function interpolation
POLY2 Calculate polynomial interpolation in two dimensions
LINRN Calculate linear interpolation in n dimensions
LOCATE Find the bracketing subinterval in an ordered table
BSPINT2 Calculate coefficients of B-spline interpolation in two dimensions
xxvi List of Computer Programs

BSPEV2 Evaluate function value and derivatives using B-spline expansion


in two dimensions
BSPINTN Calculate coefficients of B-spline interpolation in n dimensions
BSPEVN Evaluate function value using B-spline expansion in n dimensions
BSPEVNl Evaluate function value and first derivative using B-spline ex-
pansion in n dimensions
BSPEVN2 Evaluate function value and first and second derivatives using
B-spline expansion in n dimensions

Chapter 5 Differentiation
DRVT Differentiation using h ---+ °
extrapolation

Chapter 6 Integration
SIMSON Integration using Simpson's 1/3 rule
SPLINT Integrate a tabulated function using cubic spline
BSPQD Integrate a B-spline expansion
ROMBRG Romberg integration
EPSILN Integration using E-algorithm
GAUSS Integration using Gauss-Legendre formula
GAUCBY Integration using Gauss-Chebyshev formula with weight func-
tion,
w(x) = 1/ J(x - A)(B - x)
GAUCBl Integration using Gauss-Chebyshev formula with weight func-
tion,
w(x) = J(x - A)/(B - x)
GAUCB2 Integration using Gauss-Chebyshev formula with weight func-
tion,
w(x) = J(x - A)(B - x)
GAUSQ2 Integration over (0, A] with square root singularity using a com-
bination of Gaussian formulae
GAUSQ Integration over (0, A] using Gaussian formula with weight func-
tion,
w(x) = I/VX
GAULAG Integration over [A, 00) using a combination of Gaussian formu-
lae
LAGURE Integration over [A, 00) using Gauss-Laguerre formula
HERMIT Integration over (-00,00) using Gauss-Hermite formula
GAULG2 Integration over (0, A] with logarithmic singularity using a com-
bination of Gaussian formulae
GAULOG Integration over (0, A] using Gaussian formula with weight func-
tion,
w(x) = In(A/x)
XXVll

GAUSRC Calculate weights and abscissas of Gaussian formula using recur-


rence relation of orthogonal polynomials
GAULEG Calculate weights and abscissas of Gauss-Legendre quadrature
formulae
GAUJAC Calculate weights and abscissas of Gauss-Jacobi quadrature for-
mulae
LAGURW Calculate weights and abscissas of Gauss-Laguerre quadrature
formulae
GAUHER Calculate weights and abscissas of Gauss-Hermite quadrature
formulae
GAUSWT Calculate weights and abscissas of Gaussian formula using mo-
ments of weight function
FILON Integration of an oscillatory function using Filon's formula
ADPINT Adaptive integration over a finite interval
KRONRD Integration using Gauss-Kronrod formula for use with AD PINT
GAUS16 Integration using 16 point Gauss-Legendre formula for use with
ADPINT
CAUCHY Calculate Cauchy principal value of an integral
EULER Summation of alternating series using Euler transformation
BSPQD2 Integrate a B-spline expansion in two dimensions
BSPQDN Integrate a B-spline expansion in n dimensions
MULINT Multiple integration using product Gauss rule with varying num-
ber of points
NGAUSS Multiple integration using a specified product Gauss rule
SPHND To convert from hyper-spherical coordinates to Cartesian coor-
dinates
STRINT Multiple integration using monomial rules with varying number
of points
STROUD Multiple integration using a specified monomial rule
MCARLO Multiple integration using Monte Carlo method
RANI Generate a sequence of random numbers with uniform distribu-
tion
RANF Generate a sequence of random numbers with uniform distribu-
tion
EQUIDS Multiple integration using equidistributed sequences

Chapter 7 Nonlinear Algebraic Equations


BISECT Solve a nonlinear equation using bisection
SECANT Solve a nonlinear equation using secant iteration
SECANC Complex roots of a nonlinear equation using secant iteration
SECANI Solve a nonlinear equation using secant iteration (with reverse
communication)
NEWRAP Solve a nonlinear equation using Newton-Raphson method
BRENT Solve a nonlinear equation using Brent's method
SEARCH Locate complex zeros by looking for sign changes
XXVlll List of Computer Programs

ZROOT Complex roots of a nonlinear equation with deflation


ZROOT2 Complex roots of a nonlinear equation with deflation, function
value in scaled form, f(x) x 2i (x)
MULLER Complex root using Muller's method
MULER2 Complex root using Muller's method with function value in a
scaled form, f(x) x 2i (x)
DELVES Complex zeros of an analytic function using quadrature based
method
CONTUR Contour integration over a circular contour for DELVES
NEWRAC Complex root of a nonlinear equation using Newton-Raphson
method
POLYR All roots of a polynomial with real coefficients
LAGITR One root of a polynomial with real coefficients using Laguerre's
method
POLYC All roots of a polynomial with complex coefficients
LAGITC One root of a polynomial with complex coefficients using La-
guerre's method
DAVIDN Solve a system of nonlinear equations using Davidenko's method
NEWTON Solve a system of nonlinear equations using Newton's method
BROYDN Solve a system of nonlinear equations using Broyden's method

Chapter 8 Optimisation
BRACKM Bracketing a minimum in one dimension
GOLDEN Minimisation in one dimension using golden section search
BRENTM Minimisation in one dimension using Brent's method
DAVIDM Minimisation in one dimension using cubic Hermite interpolation
BFGS Minimisation in n dimensions using quasi-Newton method with
BFGS formula
LINMIN Line search for quasi-Newton method
FLNM Calculate the function value for line search for quasi-Newton
method
NMINF Minimisation in n dimensions using direction set method
LINMNF Line search for direction set method
FLN Calculate the function value for line search for direction set
method
SIMP LX Solving a linear programming problem using simplex method
SIMPX Simplex method for a linear programming problem in the stan-
dard form

Chapter 9 Statistical Inferences


SHSORT Sorting an array in ascending order using shell sort algorithm
GAMMAP Calculate incomplete Gamma function
BETAP Calculate incomplete Beta function
BETSER Calculate incomplete Beta function using a power series approx-
imation
XXIX

BETCONI Calculate incomplete Beta function using a continued fraction


approximation
BETCON Calculate incomplete Beta function using an alternative contin-
ued fraction approximation
BETAI Calculate incomplete Beta function by directly evaluating the
integral
FBETA Calculate the integrand for BETAI
RANGAU Generate a sequence of random numbers with Gaussian distri-
bution
IRANBIN Generate a sequence of random numbers with binomial distribu-
tion
IRANPOI Generate a sequence of random numbers with Poisson distribu-
tion
PCOR Calculate the probability that two uncorrelated sequences will
give a correlation coefficient exceeding a given value

Chapter 10 Functional Approximations


POLFIT Least squares polynomial fit using orthogonal polynomials
POLEVL Evaluate the fitted polynomial and its derivatives at a specified
point
POLFITI Least squares polynomial fit using orthogonal polynomials, sim-
plified version for multiple data sets
POLORT Evaluate the orthogonal polynomial basis functions at a given
point
POLFIT2 Least squares polynomial fit using orthogonal polynomials in two
dimensions
POLEV2 Evaluate the fitted polynomial and its derivatives at a specified
point in two dimensions
POLFITN Least squares polynomial fit using orthogonal polynomials in n
dimensions
POLEVN Evaluate the fitted polynomial at a specified point in n dimen-
sions
POLEVNl Evaluate the fitted polynomial and first derivative at a specified
point in n dimensions
POLEVN2 Evaluate the fitted polynomial and first and second derivatives
at a specified point in n dimensions
LLSQ Linear least squares fit in n dimensions to a user defined set of
basis functions
BSPFIT Least squares fit to B-spline basis functions in one dimension
BSPFIT2 Least squares fit to B-spline basis in two dimensions with equal
weights
BSPFITW2 Least squares fit to B-spline basis in two dimensions with arbi-
trary weights
BSPFITN Least squares fit to B-spline basis in n dimensions with equal
weights
xxx List of Computer Programs

BSPFITWN Least squares fit to B-spline basis in n dimensions with arbitrary


weights
LINFITXY Least squares straight line fit when there are errors in both x
and y values
NLLSQ Calculate the X 2 function for a nonlinear least squares fit
DFT Discrete Fourier transform of complex data with arbitrary num-
ber of points
FFT Fast Fourier transform of complex data
FFTR Fast Fourier transform of real data
FFTN Fast Fourier transform of complex data in n dimensions
LAPINV Inverse Laplace transform
POLD Evaluate a polynomial and its derivative at any point
RMK Evaluate a rational function at any point
RMKI Evaluate a rational function at any point (constant term in de-
nominator 1)
RMKD Evaluate a rational function and its derivative at any point
RMKDl Evaluate a rational function and its derivative at any point (con-
stant term in denominator 1)
PADE Calculate coefficients of Pade approximations
CREBCF Convert from power series to Chebyshev expansion and vice versa
CREBEX Calculate the coefficients of Chebyshev expansion
CREBAP Rational function approximation using Chebyshev polynomials
REMES Minimax approximation to mathematical functions using Remes
algorithm
FM Calculate error in rational function approximation for use with
REMES
GAMMA Calculate Gamma function at real x, r(x)
GAMMAL Calculate natural logarithm of Gamma function at real x,
In lr(x)1
ERF Calculate Error function at real x
ERFC Calculate complementary Error function at real x
BJO Calculate Bessel function of first kind of order zero, Jo(x)
BJI Calculate Bessel function of first kind of order one, Jdx)
BJN Calculate Bessel function of first kind of integral order, In(x)
BYO Calculate Bessel function of second kind of order zero, Yo(x)
BJYO Calculate Bessel function of first and second kind of order zero
BYI Calculate Bessel function of second kind of order one, Y1 (x)
BJYl Calculate Bessel function of first and second kind of order one
BYN Calculate Bessel function of second kind of integral order, Yn(x)
SPRBJN Calculate spherical Bessel function of integral order, jn (x)
BIO Calculate modified Bessel function of first kind of order zero,
10 (x)
BIl Calculate modified Bessel function of first kind of order one, II (x)
BIN Calculate modified Bessel function of first kind of integral order,
1n(x)
XXXI

BKO Calculate modified Bessel function of second kind of order zero,


Ko(x)
BKI Calculate modified Bessel function of second kind of order one,
KI(X)
BKN Calculate modified Bessel function of second kind of integral or-
der, Kn(x)
DAWSON Calculate the value of Dawson's integral
FERMM05 Calculate the Fermi integrals for k = -1/2
FERM05 Calculate the Fermi integrals for k = 1/2
FERM15 Calculate the Fermi integrals for k = 3/2
FERM25 Calculate the Fermi integrals for k = 5/2
PLEG Calculate the Legendre polynomial, Pc (x)
PLM Calculate the associated Legendre function, pr(x)
YLM Calculate the spherical harmonic, yr ((),
¢)
MINMAX Rational function minimax approximation to discrete data
POLYLI Polynomial L1-approximation to discrete data
LINLI Linear L1-approximation to discrete data for arbitrary basis
functions
SIMPL1 Modified simplex method for LP problems in Ll-approximation

Chapter 11 Algebraic Eigenvalue Problem


INVIT Eigenvalue and eigenvector using inverse iteration
TRED2 Reduction of a real symmetric matrix to symmetric tridiagonal
form using Householder transformations
TRBAK Back-transform eigenvectors of tridiagonal matrix to original ma-
trix
TQL2 Eigenvalue problem for a symmetric tridiagonal matrix using QL-
algorithm
TRIDIA Specified eigenvalues and eigenvectors of a symmetric tridiagonal
matrix using Sturm sequence and inverse iteration
STURM Locate eigenvalues of a symmetric tridiagonal matrix using
Sturm sequence
TINVIT Eigenvalue and eigenvector of a symmetric tridiagonal matrix
using inverse iteration
HEREVP Eigenvalue problem for a complex Hermitian matrix
BALANC Balancing a general real matrix
BALBAK Back-transform eigenvectors of balanced matrix to original ma-
trix
EL!vlHES Reduce a real matrix to Hessenberg form using Gaussian elimi-
nation
HQR Eigenvalues of a Hessenberg matrix using QR-algorithm

Chapter 12 Ordinary Differential Equations


RKM Initial value problem using fourth-order Runge-Kutta method
with adaptive step size
XXXll List of Computer Programs

RK4 One step of integration using fourth-order Runge-Kutta method


RK2 One step of integration using second-order Runge-Kutta method
MSTEP Initial value problem using predictor-corrector method with
adaptive step size
ADAMS One step of integration using fourth-order Adams method
STRT4 Starting values for multistep method using Runge-Kutta method
GEAR One step of integration using fourth-order stiffly stable method
EXTP Initial value problem using extrapolation method
FDM Two-point boundary value problems using Finite Difference
Method
GEVP Eigenvalue problem in differential equations using finite differ-
ences
GAUBLK Solve a system of linear equations involving finite difference ma-
trix
SETMAT Generate finite difference matrix for a system of differential equa-
tions
BSPODE Two-point boundary value problem using expansion method with
B-spline basis functions

Chapter 13 Integral Equations


FRED Solve a Fredholm equation using quadrature method
FREDCO Solve a Fredholm equation using collocation method
FUNK Integrand = K(x, t)¢j(t), for evaluating integrals in collocation
method
RLS Solve a linear inverse problem using regularised least squares
technique
FORW Solve the forward problem
VOLT Solve a linear Volterra equation using trapezoidal rule
VOLT2 Solve a nonlinear Volterra equation of the second kind using
Simpson's rule

Chapter 14 Partial Differential Equations


CRANK Linear second-order parabolic equation using Crank-Nicolson
method
LINES Nonlinear parabolic equations using the method of lines
ADM Parabolic equation in two space variables using alternating di-
rection method
LAX Nonlinear hyperbolic equations using the Lax-Wendroff method
SOR Linear second-order elliptic equations using the successive over-
relaxation (SOR) method
AD! Linear second-order elliptic equations using the alternating di-
rection implicit iterative (AD!) method
Chapter 1

Introduction

The process of solving physical problems can be roughly divided into three
phases. The first consists of constructing a mathematical model for the cor-
responding physical problem. This model could be in the form of differential
equations or algebraic equations. In most cases, this mathematical model can-
not be solved analytically, and hence a numerical solution is required. In which
case, the second phase in the solution process usually consists of constructing an
appropriate numerical model or approximation to the mathematical model. For
example, an integral or a differential equation in the mathematical formulation
will have to be approximated appropriately for numerical solution. A numeri-
cal model is one where everything in principle can be calculated using a finite
number of basic arithmetic operations. The third phase of the solution process
is the actual implementation and solution of the numerical model. Although, in
this book we will be concerned with numerical solution to mathematical prob-
lems, it should be noted that numerical methods cannot substitute analytic
techniques. Instead, they should complement analysis. As Hamming (1987) has
rightly pointed out, the purpose of computing is insight, not numbers. Quite of-
ten, experimenting with numerical solution, provides a better understanding of
the physical problem. Sometimes, it even helps in finding an analytic approach
to the problem.
This book is concerned with the last two phases of the solution process
mentioned above. The first phase is beyond the scope of this book, but it will
be briefly touched upon in Section 1.1. Each of these phases involves some
approximation. In the first phase, the real world problem is approximated by
a mathematical model, while in the second phase the mathematical model is
approximated by a numerical model and finally in the third phase the numbers
are approximated. The reliability of the final result will depend on these ap-
proximations. Hence, estimating the error in each of these phases is an integral
part of the solution process. Without an error estimate or bound, the solution
is of little use. In this chapter, we give a brief introduction to various sources of
errors in numerical computations. In the next chapter, we consider the round-
2 Chapter 1. Introduction

off error which arises in the final phase of the solution process. In subsequent
chapters, where we will discuss numerical methods for different mathematical
problems, more details of error estimation will be given.

1.1 Errors in Numerical Computation


Each of these three phases of solution process introduces errors in the final
result. In the first phase, the errors could be either due to our inadequate
understanding of the physical problem, or else the system is so complicated
that we have to introduce some approximations in order to make the problem
tractable. For example, if we are considering the motion of a projectile near the
surface of the earth, the effect of air friction has to be included, but this effect
is not fully understood and as a result, some error will be introduced in the
mathematical model. On the other hand, the effect due to gravitational attrac-
tion of the moon or the relativistic effect could in principle be incorporated in
the mathematical model, but we may decide to neglect these in order to keep
the model tractable. This error is sometimes referred to as the modelling error.
Apart from this, there could be measurement error in the first phase of solu-
tion process. This error arises because most models of physical systems require
experimentally determined data as input. These data will have their associated
errors, which introduce uncertainty in the final solution. In the second phase,
errors could be introduced when an essentially infinite process like summing an
infinite series or evaluating an integral is approximated by a finite numerical
process. This error is referred to as truncation error, since it is usually due
to truncating an infinite process. This error will be considered in Section 1.2.
Finally, in the third phase involving the actual numerical computation, there
are errors due to finite precision with which the calculations can be carried
out. This is called the roundoff error, since the numbers are usually rounded
off to a finite accuracy during the calculations. This error will be discussed in
Chapter 2.
In general, we would like to balance the errors occurring in different phases
of the solution process for a given problem. Thus, for example, if the mathemat-
ical model introduces an error of one percent in the solution, there is no point
in obtaining the numerical solution with six-digit accuracy. This balance may
not be easy in more complicated problems, since each of the phases mentioned
above may involve several different steps, each with its own error and further
it may not be easy to estimate the errors involved. Sometimes, a considerable
amount of experimentation with methods and parameters may be needed to
achieve the right balance of errors. In order to estimate the accuracy of the final
result, it is essential to have a reasonable estimate of the error in each step of
the solution. It should be stressed that estimating the accuracy of numerical
computation is an essential part of the calculation process, without which the
calculation will have little meaning. Throughout this book, we have attempted
to discuss this aspect in sufficient detail.
1.1. Errors in Numerical Computation 3

The formulation of mathematical models is obviously beyond the scope


of this book, but this phase of problem solving is so important that a few
comments are in place here. It is not practical for a mathematical model to
include every detail of a real physical problem. The best models are those that
include just those features of the real problem, which are needed to reduce
the error in this phase to an acceptable level. If there is any problem with
the solution, then it is natural to improve or enlarge the mathematical model.
This could be tricky, since adding more details to the mathematical model may
make the problem almost impossible to analyse. A common fallacy is the belief
that enhancing a mathematical model will automatically lead to better results.
In actual practice, this enhancement often makes the solution of a problem
so difficult that much larger errors may be introduced at the later phases of
the solution process. Consequently, only those mathematical models which can
be solved with a reasonable effort are useful. In fact, limitations imposed by
the last two phases, often determine the mathematical model for a physical
problem.
Apart from equations, a mathematical model also involves data that must
come from the real world and hence will have some uncertainty associated with
it. Thus, for example, in the projectile problem the uncertainty in speed and
the angle of projection is more important than the relativistic correction to the
equation of motion. Further, the error in data is of random nature, which is more
difficult to analyse. These errors can be analysed using statistical techniques,
which are discussed in Chapter 9.
Apart from these errors which can be analysed in one way or the other,
there is also the possibility of what can be called as a blunder. This can occur at
any of the three phases of solution process. In the first phase, we may overlook
some basic assumption required to obtain the mathematical model, or there
could be an error in deriving the mathematical equations, or the input data
could be completely wrong. Similar errors could occur in the second phase of
the solution process also. Sometimes the error can also be caused by a misprint
in one of the sources for the data or the equations. The probability of misprint
is not all that small. The author has encountered them on several occasions
and will not advise anyone to disregard this possibility (even in this book).
In the final solution phase, which these days would be almost invariably done
using digital computers, the blunders mainly occur at the programming stage.
The programming errors are usually referred to as bugs. Sometimes, undetected
bugs in the compilers or other system programs can also lead to errors; in which
case, it may be very difficult to pinpoint the source of error. Because of their
very nature, a systematic analysis of this class of errors is not possible and
we can only hope that the readers will take care to reduce the possibility of
blunders in their calculations.
A rather common blunder in the first phase is to use a mathematical model
which is not applicable to the physical system under consideration. We must
keep in mind the basic assumptions used to obtain the mathematical model for
the real problem, since it often happens that one of the basic assumptions is not
4 Chapter 1. Introduction

satisfied leading to a completely useless solution. We will only cite an example


to illustrate this point. There is an immense set of computer programs used to
validate and analyse the design of nuclear power plants. One of the possibilities
that is analysed in great detail is the meltdown of the core. One safety device
to take care of this possibility is a huge tank of water, which can be pumped
into the core if it gets too hot. The model of this process includes quantities
like temperature of the core, amount of water, capacities of the pumps and
pipes, and so forth. There was no experimental verification of this model, since
meltdown did not actually occur (until 1986), and so it was decided to test this
program by building a test apparatus. Gas was used to heat a dummy metal
core, and when the core was hot enough, the pumps were started. To everyone's
great surprise, there was practically no cooling and had it been a real reactor,
it could have exploded. On analysing the results, it turned out that as soon
as the water entered into the core, it turned to steam, which built up enough
pressure to prevent any further water from entering. Even though the program
was being used for several years, no one had thought about water turning into
steam and so this phenomenon was not included in the simulation program.

1.2 Truncation Error


In this section, we discuss the error incurred in the second phase of problem
solving, which involves the translation of mathematical model to a numerical
model. If the mathematical model can be solved in a finite number of arithmetic
operations (e.g., a system of n linear equations in n unknowns), then there is
no truncation error, since the numerical model is identical to the mathematical
model. However, in most cases the mathematical model cannot be solved in a
finite number of steps. Hence, the numerical model can only be some approx-
imation to it, giving rise to a truncation error. This error of course, depends
on the mathematical model and will be considered in each of the subsequent
chapters, where different mathematical models are discussed. Here we discuss
some general characteristics of truncation error.
It is necessary to distinguish between the truncation error and the round-
off error, which will be discussed in the next chapter. The roundoff error is
due to our inability to do the arithmetic operations exactly, while the trun-
cation error is due to the fact that the numerical model itself is not an exact
representation of the mathematical model. If we can carry out all arithmetic
operations exactly, then there will be no roundoff error, but truncation error
will still remain. The truncation error can be reduced by constructing a better
numerical model, which usually involves more calculations. If the mathemat-
ical model is basically infinite like an integral or a differential equation, then
the truncation error cannot be reduced to zero in any finite numerical model,
though it can be reduced to any specified level by increasing the amount of
calculations appropriately. For example, in numerical integration the trunca-
tion error can be reduced by appropriately increasing the number of points at
which the function is evaluated, but it cannot be reduced to zero, unless the
1.2. Truncation Error 5

number of points is infinite, or the function to be integrated has some special


form (e.g., a polynomial).
For most numerical models the truncation error or at least a bound on
truncation error can be given, but usually it involves quantities which are not
directly known and may be difficult to estimate. For example, while evaluating
an integral by using the trapezoidal rule involving n points, the truncation er-
ror can be expressed in terms of the second derivative of the function at some
unknown point in the interval of integration. However, it may not be possible
to estimate the second derivative easily in many cases. Hence, an experimental
approach is often used to estimate the error. In most numerical models, there
is some parameter that governs the truncation error: for example, the number,
n of points used in evaluating an integral, or the step size, h used in solving a
differential equation. A practical technique for estimating the truncation error
is to vary this parameter and observe the computed results. If the computed
results settle down or converge sufficiently, then we decide to stop the compu-
tation and accept the result. This process of testing for convergence is used in
many programs for numerical computations. Unfortunately, testing for conver-
gence using only a finite amount of information is a mathematically insolvable
problem, since convergence is essentially a continuous concept. Thus, given any
(finite) test for convergence, we can always construct examples where the test
fails, in the sense that it flags convergence even though the computed result is
far from the true value.
The rate at which the truncation error tends to zero as the parameters of
the method vary, is usually referred to as the order of convergence. This is used
for comparing methods and for obtaining an intuitive feel for the efficiency of
a method. There are specific definitions for the order of convergence, which are
applicable for certain methods like iterative methods for solution of nonlinear
equations, or for difference methods for solution of differential equations. In
general, this is done by comparing the behaviour of truncation error with stan-
dard functions like n- k , e- n or log nino More precisely, the truncation error
e(x) is said to be O(J(x)) as x tends to L, if

. I e(x) I
2~ f(x) < 00,
(1.1 )

where < 00 means that the limit is finite. Here x is the parameter of the
method, which is usually selected to be the number of points or steps n, in
which case the limit is taken as n - t 00. Sometimes, as in the solution of a
differential equation, the parameter is the step size, h of the grid of points used
in solution, in which case the limit is taken as h - t O. Thus, 3/n 2, 4/n 2 + 5/n 2.5
or 5/n 2 + log n/n 3 are all O(1/n 2) as n - t 00. Many times when the truncation
error is O(1/n 2 ) or O(h2), we say that the method has an order of convergence
two, or that it is a second-order method.
Continuing our example of evaluating an integral using the trapezoidal
rule, it is known that the truncation error in this case is inversely proportional
to the square of the number of points, provided this number is sufficiently
6 Chapter 1. Introduction

o
Figure 1.1: The sawtooth function s(x) for n = 9

large. Thus, if we evaluate the integral using successively larger values of n,


we can in principle bound the error. However, there is some uncertainty, since
we do not know what value of n is "sufficiently large". Another loophole in
this procedure is that, the truncation error is proportional to n- 2 , only if the
function has bounded derivatives (up to second-order), which may not be true
in all cases. One of the strategies that is usually adopted in such situations is
to evaluate the integral using n, 2n, 4n, ... , 2kn points, until two successive
values differ by less than the desired error bound. However, we can construct
any number of examples where two successive values will differ by an amount
which is arbitrarily smaller than the truncation error even when the iterative
process is far from convergence (see Section 6.7). Assuming that the test is
satisfied using n points aI, ... ,an and that the points are ordered such that
ai < aHl for i = 1, ... , n-1, we can construct a sawtooth function s(x) defined
as follows (Figure 1.1)

2(x-a,)
for a· <x< a,+ai+l.
2 '
s(x)= { (a,.+l-a,) 2 - -
(1.2)
2(a,+1-x)
(a,+l-ai)
for

If we add any multiple of s(x) to our original function f(x), the integral as
estimated by the trapezoidal rule will be the same, while the true value of the
integral can be increased arbitrarily by increasing the constant multiplying s(x).
Instead of the saw-tooth function we can add some multiple of the polynomial

n
p(x) = II(x - ai)2. (1.3)
i=l
1.2. Truncation Error 7

Once again the computed value of the integral will be the same, even though
the true value could be completely different. Hence, it is obvious that although
the result has apparently converged, it could be completely wrong.
This kind of examples can be constructed for demonstrating the failure of
any convergence test. Hence, numerical methods can never be infallible. There
are two alternatives here, first is to assume that difficulties are rare and are not
likely to occur in practice, while the second is to specify some restriction on the
class of problems that can be solved by the algorithm under consideration. In
the first alternative, we can improve the reliability by considering more than
one convergence test, so that the possibility of a spurious convergence may
be reduced considerably. Of course, it can never be foolproof. In the second
alternative, sometimes the restrictions may be very severe, in the sense that
most problems do not satisfy them, or else it may not be easy to verify if they
are actually satisfied or not. Again, considering our example of integration using
the trapezoidal rule, we can specify one of the following restrictions
1. The function is monotonic in the interval of integration.
2. The function oscillates at most once in any interval of length 0.0001.
3. The second derivative of the function is less than 10000 throughout the
interval of integration.
Each of these assumptions rule out the pathological sawtooth or polynomial
functions shown above and allows us to devise a convergence test which "al-
ways" works. It is important to examine each of these restrictions, to see if
the practical problems do actually satisfy them and how easy it is to devise a
convergence test using them. It can be seen that
1. This assumption is easy to test, but is too restrictive, since most functions
that we come across in practice will fail to satisfy this requirement.
2. This assumption is easy to test and most practical problems will satisfy
it, but its use requires the function to be evaluated at least once in an
interval of 0.0001, and hence it is very expensive.
3. This assumption is neither very restrictive nor is its use very expensive, but
unfortunately it is not easy to test. The calculation of a second derivative
can itself be an insolvable problem for many functions. The sawtooth
function of Figure 1.1 satisfies this condition almost everywhere, except
for a finite number of points. Hence it may not be easy to detect such
examples.
Thus, the statement of assumptions as a protective step is helpful, but it does
not guarantee that a method will always work in practice. Hence, we have to
live with this uncertainty and it is only with experience that we can identify
the problems for which the specified method will fail. In this book an attempt
will be made to give some idea about the pitfalls of various numerical methods
for practical problems.
8 Chapter 1. Introduction

1.3 Programming
The readers are expected to be familiar with computer programming, since
most of the methods described in this book are expected to be used on digital
computers. In this section, a brief discussion of some of the common pitfalls
of computer programming are described. This section is not expected to be
an exhaustive summary of computer programming, but just a warning signal
for those who are not very familiar with the art of programming. It is a com-
mon misconception to think that once one has got a few programs working or
has attended some lectures on computer programming, one has learnt all that
is required about programming. Learning the art of programming is just like
learning to play football. One cannot become a football player by kicking the
ball a few times. To become a good football player one has to put in a system-
atic and dedicated effort. Similarly, it is not possible to become a programmer
by just running small or large programs, but for that it is necessary to put
in a conscious effort. Unfortunately, most of the computer users including the
scientists and engineers to whom this book is addressed tend to neglect pro-
gramming, even though they may spend a sizable fraction of their time sitting
in front of computer terminals. I do not wish to discuss the pros and cons of
this attitude, but I will certainly be happy if after reading this book readers get
down to trying some of the algorithms themselves on the computer. A number
of computer programs are given in Appendices Band C, but the readers are
not expected to use them as black boxes. Instead it is expected that readers
will tailor these programs to suit their needs.
It is a universal experience that programmers are often so excited when a
program runs fully without complaining, that they immediately proclaim the
program to be debugged, without even bothering to check whether the output
is correct or not. Even if the computed result is correct in one or two cases,
that does not guarantee that the program has no bugs. In fact, it is usually
impossible to guarantee the correctness of any reasonably general program. We
can only reduce the possibility of error by carefully checking the program with
various different input data.
As far as possible, we have tried to avoid reference to any particular pro-
gramming language in this book, since the readers are free to use their favourite
programming language to communicate the algorithms to the computer. Nev-
ertheless, all programs in Appendices are written in Fortran or C, and the
primary reason for this choice is that, these are the languages with which the
author is familiar. These languages are by no means standardised and there are
many standards e.g., Fortran 77 and Fortran 90. Further, most compilers may
not implement the full standard, while a few nonstandard facilities are added
to "enhance" the language. As a result, there are various dialects of these lan-
guages available to the users and any substantial program is unlikely to work
correctly when moved to a new system, unless adequate care has been taken to
avoid usage of nonstandard features. The programs in this book are all written
in Fortran 77 or ANSI C, but may require some changes before running them
1.3. Programming 9

on a new system. We leave it to the readers to figure out the changes needed
to run the programs on their machines.
Apart from idiosyncrasies of computer languages, there is also the confu-
sion caused by roundoff errors (see Chapter 2). Because of roundoff errors, two
numbers which are expected to be equal may not come out to be so, and this
has to be accounted for while writing the program. Fortran statements like

IF(A.EQ.B) GO TO 100 or IF(A-B) 50,100,200

could be dangerous, since if the numbers A and B are computed independently,


they may not come out to be exactly equal. Arithmetic IF with an expression
of type real should be avoided, while if the equality of two numbers is to be
checked, then the statement should preferably read

IF(ABS(A-B) .LT.AEPS) GO TO 100

where AEPS is a suitably chosen small number. The choice of AEPS depends
on the expected roundoff errors in evaluating A and B, which is always difficult
to estimate. Hence, as far as possible the algorithms should avoid checking
for equality of two floating-point numbers. The most common requirement for
such a check arises while testing for convergence of some iterative process. For
example, if a recurrence relation

(1.4)

is being used, where the theory in some textbook has a theorem claiming that
Xn approaches a limiting value as n ~ 00, it is usually disastrous to wait until
Xn+l = Xn for some n, since because of roundoff errors, the sequence Xn might
be periodic with a very long period, or it may even show a chaotic behaviour.
The proper procedure is to continue the iteration untillxn+l -xnl < 6, for some
suitably chosen number 6. In general, we may not know the order of magnitude
of the limiting value. Hence, it may be better to use a relative criterion of the
form
(1.5)
where E is a suitably chosen small number which is usually easier to select. A
detailed discussion of convergence tests is given in Sections 6.7 and 7.13.
Apart from this, there are problems when a number which is expected to
be positive turns out to be negative due to roundoff error. This could result
in the program giving undefined numbers (Not a Number or NaN), because
an attempt was made to find square root of a negative number or may be the
logarithm. On most current computers this can significantly slow down the
execution of program.
The quality of a program depends on many characteristics e.g., clarity,
efficiency, reliability and robustness. The clarity is important only for a human
reader to understand the program, the machine does not worry about clarity,
since it does not bother to really understand what we are trying to do. However,
10 Chapter 1. Introduction

any useful program is likely to be read by humans other than the author. Hence,
it is important to ensure that it can be understood easily. In fact, even the
author of the program may not be able to understand the program after a
while, unless it is expressed clearly and a proper documentation is provided.
Clarity is also important for maintaining the program, since that will help
in making the required changes. The changes may be required to enhance the
capabilities of the program, or simply to remove bugs which have been detected.
The efficiency of a program refers to the use of computer resources. The
principal resources are time, memory, and input-output (I/O), each of which
contribute to the cost of running a program on time-sharing systems. The exe-
cution time of a program not only depends on the computer hardware, but also
on the compiler used to translate the program to machine language. Many of
the compilers have some switches for optimisation, which can also change the
execution time substantially. Apart from the user program, memory is also used
by the library routines for I/O and other functions invoked by the program.
Programs using more memory generally get lower priority on a time-sharing
system, thus delaying the execution of program. Apart from this, some com-
puters have very limited memory. It should be noted that computer time is
also spent in converting the numbers between computer's binary representa-
tion and decimal representation required for the human beings. A considerable
time could be saved if huge amount of data which are meant to be read by
other computer programs are written out using unformatted write statement,
in which case no conversion is done. Apart from saving time it also improves the
accuracy, since any conversion of floating-point numbers can only give an ap-
proximate value. In general, improving the efficiency of program in one aspect
results in deterioration in other aspects. For example, very often changes made
to decrease the memory requirement result in increased time or increased I/O.
Hence, depending on the circumstances we have to make appropriate compro-
mises. It may be noted that efficiency is important even if a generous grant for
computer time is available, since an efficient program will allow us to perform
more experiments with different parameters of the physical problem, providing
a better understanding of the problem. Reliability of the results can also be
checked more thoroughly if the program is efficient.
The reliability of a program can be defined as the probability of successful
operation for the given class of problems. In practice, it is difficult to measure
reliability of a program, since in most cases the set of possible problems is infi-
nite for all practical considerations. The choice of the sample problems used for
this purpose will affect the estimate of reliability. The requirement of reliabil-
ity generally conflicts with efficiency and a compromise has to be made. This
depends on the usage of program. Programs meant to control critical processes
like nuclear reactors, missiles, airplanes, etc., cannot compromise on reliability.
There could be several reasons for low reliability of a program, like the algo-
rithm used, or bugs in the program. For example, the algorithm adopted for the
program may fail under certain circumstances. Similarly, if the bugs are such
that the program fails most of the times, then they would usually be detected
Bibliography 11

at the initial stage of program development itself. But in large programs many
parts of the program may not usually be accessed while execution, unless the
input data meet some special conditions. It is difficult to detect bugs in these
parts of the program. Robustness of the program measures its ability to grace-
fully exit when the input data are outside its designed capability or is highly
exceptional. This is even more difficult to quantify or estimate than reliability.
While writing a program we have to keep all these aspects in mind. The ef-
ficiency of a program usually conflicts with almost all other properties of the
program.

Bibliography
Acton, F. S. (1995): Real Computing Made Real, Princeton University Press, Princeton, New
Jersey.
Cowell, W. R. (ed.) (1984): Sources and Development of Mathematical Software, Prentice-
Hall, Englewood Cliffs, New Jersey.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Ueberhuber, C. W. (1997): Numerical Computation 2: Methods, Software, and Analysis,
Springer Verlag, Berlin.

Exercises
1. Experimentally determine the order of convergence of the following infinite series
00 1.5 00 1
(i)
L
n=l
3+n2 (v) , ; n 8 '
(s = 3,2,1.5,1.1,1.01)

(ii) f 3.5cos(1/n)
4.5 + 2n 2 + n 3
(vi) f (-IT, (8 = 2, 1,0.5,0.1,0.01)
n=l n=1 n

L -_.- +10-1°)
(_1)nn2+n3
L _1)n
00
(iii) 5
..
(vn)
00 ((
--
n=l 1.5+n 2n + 1
n=l n

(iv) L'15+ n
00 4
e
-n/3 ...
(vm) L 00 ((_l)n
---+~
10-1°)
n=l 1+ n2 n=l 2n + 1 n·

How many terms are required to get an accuracy of 10-


4 in the sum? (Warning: In (v)
and (vi) try only the first three values of 8 experimentally, and extrapolate the result to
get the estimate for other values of 8. Do not attempt to verify your result experimentally
for smaller values of 8.)
2. The sine integral Si(x) can be calculated by the following infinite series:

t;
. {Xsint 00 (_1)ix2i+l
SI(X) = Jo -t- dt = (2i + 1)(2i + I)! .

How many terms are required to get a relative accuracy of 10- 6 at x = 0.1, 1.0, 10.0,
30.0, 100.0? Is it possible to achieve this accuracy in actual practice? Evaluate the sum
and verify the results experimentally.
12 Chapter 1. Introduction

3. Consider the following infinite product representation for the Gamma function
1
- - = xe'Yx
n°O (1 + --:x )e- /. X "

r(x) i=l l

where I = 0.5772156649 .. , is the Euler's constant. How many terms are required to get
a relative accuracy of 10- 6 for x = 0.1, 1.0, 10.0, 30.0? Is it possible to achieve this
accuracy in actual practice? Compare the efficiency of this product representation with
that of the asymptotic series:
r(x) "" e- x x x - l/2 v'2rr (1 + _1_ + _1_ _ ~ _ 571 + ... ) .
12x 288x 2 51840x 3 2488320x 4
For which values of x is the asymptotic series more efficient? How many terms of the
asymptotic series should be used? Also try to shift the range using r(x + 1) = xr(x)
before using the asymptotic series. How does this compare with product representation.
4. Consider the following representations of the error function
2 [00 2 2 00 (_I)ix2i+1
erfc(x) = Vii Jx
e- t dt l--L-'----c-'---:-
Vii i=O i! (2i + 1)

1/2
x + -----1----
x+ - - - - - - -
3/2
x+-----
2
x+---
x+···
e- X2 ( ,,00 (_ )i 1.3.5 ... (2i - 1))
r;; 1+L.., 1 '2 .
v lr X ,=1 2'x'
Compare the efficiencies of these formulae for x = 0.1, 0.5,1,2,4, and 6, when the function
is to be evaluated to a relative accuracy of 10- 6 . The first two formulae converge for any
value of x, while the last is an asymptotic series which diverges for any fixed value of x.
Hence, the last series has to be used carefully. It can be shown that error is bounded by
the first term which is neglected. Hence, the summation should not be continued beyond
the smallest term in the series.
5. Compare the following algorithms for computing the value of 11'.
(i) Using Maclaurir, series for tan- 1 x

11' = 4tan- 1 (1) = 4(1- ~ + ~ - ~ + ... ).


3 5 7
(ii) Using Maclnurin series for sin- 1 x
11' = 6 sin -1 = 6 (~ +
(~) 1 + 1x 3 + 1x 3 x 5 + ... ) .
\ 2 2 2 x 3 X 23 2 X 4 X 5 X 25 2 X 4 x 6 x 7 X 27
(iii) Archimedes' method: Approximate the area of a circle of unit radius by that of an
inscribed regular polygon of n sides, where n = 2i, (i = 2, 3, 4, ... ). The area of this
polygon is
n . ( -211' )
-sm
2 n'
and . ()
sm = J 1 - cos 2(}
2 '
cos () = vlt - sin 2 ().

The calculation is initialised with sin(1r/4) = cos(1r/4) = 1/V2.


(iv) Similar to (iii), but use the circumference of a circle, which is approximated by

2nsin (~) .
(v) Approximate the area of the quarter circle by trapezoidal rule, where the number of
trapezoids n = 2i, (i = 1, 2, 3, '" ).
6. Find all blunders in this book.
Chapter 2

Roundoff Error

In this chapter we discuss the roundoff error, which arises in the final phase
of the solution process discussed in the previous chapter. This error occurs
because of the finite precision with which the arithmetic operations are carried
out. Before going to the roundoff error, in Section 2.1 we discuss how numbers
are represented in the computer. In subsequent sections, the effect of roundoff
error on numerical computations is discussed, using appropriate examples.

2.1 Number Representation


The decimal number system with which all of us are familiar is an example
of a positional number system, which uses the number 10 as the base. This
system was first introduced by Indian or Arabic mathematicians and has been
universally adopted for the last 8-10 centuries, at least as far as numbers are
concerned. Unfortunately, for other measures like length, mass and most impor-
tant of all time, this system is not yet universally accepted. In decimal system,
the numbers are represented by powers of ten, for example,

643.576 = 6 x 10 2 + 4 X 10 1 + 3 x 100 + 5 X 10- 1 + 7 X 10- 2 + 6 X 10- 3 , (2.1)

and in general, a number

dn dn - 1 ... d2dldo.d_ld_2.·· = L d i lO i . (2.2)


i~n

The summation extends over all integers less than a finite number n, which
depends on the number being represented, d i are the decimal digits between 0
and 9. We can generalise the notation to a base other than 10 by writing

dn dn - 1 ... d2dldo.d_ld_2 ... = Ldibi . (2.3)


i~n
14 Chapter 2. Roundoff Error

Here, we should call the point as radix point, rather than the decimal point.
The base of the number system b is usually an integer greater than one, and
digits d; are between 0 and b - 1. This restriction is essential to ensure that the
representation is unique and complete, i.e., it should be possible to represent all
real numbers. If b > 10, we have to invent new symbols to represent the extra
digits. A digit d; for large i is said to be more significant than the digits d;
for small i; accordingly, the leftmost or leading digit is referred to as the most
significant digit, while the rightmost digit is referred to as the least significant
digit.
The most commonly used value of b apart from 10 are 2, 3, 8 and 16. In
the last case, the letters A, B, C, D, E and F are used to denote the six extra
digits. As the reader must be aware, the binary system with b = 2 is the choice
almost universally used in computers. Leibnitz is believed to be the first person
to illustrate arithmetic operations in binary system and is credited with the
invention of binary system in 1703. We will not go into the history of number
systems, but instead will consider the process of conversion from one number
system to another.
Let us try to convert from a system with base b to one with base B. If d;
and Di denote the digits in base band B respectively, then

(2.4)

If we are prepared to use the arithmetic operations with base B, then the
conversion can be trivially accomplished by writing the digits d; and the base
b in base B notation and explicitly evaluating the left-hand side. For example,
if we want to convert from binary to decimal system, b = 2 and B = 10 and

(10101.11h = 1x24+0x23+1x22+0x21+1x2o+1x2-1+1xT2 = (21.75)10.


(2.5)
We will denote the base by a subscript, only when there is some chance of
confusion. Now if we want to convert from decimal to binary, we can write
(lOho = (lOlOh and the digits

0, 1, 2, 3, 4, 5, 6, 7, 8, 9 = 0, 1, 10, 11, 100, 101, 110, 111, 1000, 1001.


(2.6)
Now we can use binary arithmetic giving the number in binary form. For ex-
ample,

but there is a problem, because

(1010)-1 = 0.0001100110011 ... , (2.8)

has a nonterminating representation and further, being not familiar with bi-
nary arithmetic it is tedious to do the calculations. It should be noted that all
fractions which have a terminating expansion in binary system will terminate
2.1. Number Representation 15

in decimal system also, but the converse is not true. It is well-known that all
rational numbers with denominator which can be factorised in terms of 2 and 5
have a terminating expansion in decimal system, because 10 = 2 x 5. Similarly,
in binary system only those rational numbers, whose denominator is a power
of 2 will terminate while others will not. Thus, for example,

1 1 3 11
= 10 x 1111 = .0001100110011 ... , (2.9)
10 2x5 2 x 15

where the last two expressions are in binary system.


There is an alternative method to convert numbers from one number sys-
tem with base b to another with base B, using the arithmetic operations in
base b instead of B. For this purpose, it is convenient to separate the integer
and fraction part. The integer part is then converted by successive division by
B, the remainder gives the successive digits of the number (Le., do, d 1 , . . . , dn ).
The fraction part can be converted by successive multiplication by B, where
the integer part at each stage gives the successive digits (Le., d_ 1 , d_ 2 , .. . ) and
should be removed before the next multiplication. For example, converting from
decimal to binary

21.75 = 21 + 0.75,
21 1 10 5 1 2
2=10+ 2 , 2=5+0, 2=2+ 2 , - = 1 +0
2 '
0.75 x 2 = 1.5, 0.5 x 2 = 1.0,
(2.10)
thus (21.75)10 = (lOlO1.11k
In principle, it is much simpler to do arithmetic in the bin<:Lry system,
the only problem being that the number of digits are large even for ordinary
numbers. Thus, in writing it is more convenient to use the octal system with
b = 8, or the hexadecimal system with b = 16. It can be easily verified that ,
it is trivial to convert between binary and octal systems, since we can group
together three digits and write them in octal form or conversely, an octal digit
can be replaced by its three-digit binary representation. Thus

(10 lO1.11h = (25.6)8 = (010 101.110)2. (2.11)

Similarly, for hexadecimal system we can combine four binary digits. Thus,
even though the computer uses binary system, the contents of registers and
memory locations inside the computer are conventionally printed in the octal
or hexadecimal form.
Another important number system as far as computers are concerned is
obtained if a rather large number like 105 or 1010 or may be 232 is used as
the base. In this case, we can store one digit in one computer word and use
several words to obtain a high precision value. Such a system is used in exact
mathematical calculations, for example, in number theory while handling large
numbers. The largest known prime number now exceeds 10,000,000 decimal
16 Chapter 2. Roundoff Error

digits, while the value of the constant 7r has been calculated to more than
trillion (10 12 ) decimal places.
It is not essential to have the base b to be a positive integer. For example,
we can have b = 1/10 with the usual ten decimal digits to get (21.75)10 =
(571.2h/1O. Here the digits get reversed in order. Even more interesting is
b = -10 or -2. Thus, for example,

(lOlO1.11h = (1101010.11)_2
(2.12)
= (1000000.01h - (101010.1)2.
Using this number system we can represent all real numbers without the use
of sign. Hence, this system could be useful for representing numbers in the
computers. However, in this system, it is nontrivial to find out the sign of a
given number, since that will depend on whether the leading digit is even or
odd. Similarly, it is nontrivial to find the negative of a given number. Arithmetic
in this system is similar to that in the usual number systems with positive base,
except for the fact that the carry has to be subtracted rather than added.
There is no reason to restrict ourselves to real values of b. For example,
we can take b = 2i (where i = A) with digits 0,1,2,3 to get the quater-
imaginary number system, which has the advantage that every complex number
can be represented without the use of sign or i. For example,

(312.13hi = 3 X (2i)2 +1X (2i)1 +2 x (2i)O +1x (2i)-1 +3 X (2i)-2


= -10.75 + 1.5i.
(2.13)
In general, a number can be separated into its real and imaginary parts by
separating the even and odd digits, thus

(d 2n ... d 1 d o.d_ 1 d_ 2 ... d- 2 khi = (d 2n ... d2 do.d_ 2 ... d- 2 k)-4


+ 2i(d2n - 1 ... d3 d 1 ·d_ 1 ... d- 2 k+d-4.
(2.14)
An interesting property of this system is that, it allows multiplication and
division of complex numbers to be done in a fairly unified manner, without
treating the real and imaginary parts separately. The difference being that in
this case, the carry (if any) is to be subtracted from two columns to the left.
A binary complex number system can be based on J2 i, but it requires an
infinite nonrepeating expansion for simple numbers like i. Penney (1965) found
that a binary complex number system can be obtained by using b = i-I as
the base. For example,

(110.11)i-1 = (i - 1)2 + (i _1)1 + 0 + (i _1)-1 + (i _1)-2 = -1.5 - i. (2.15)

It can be shown that, this number system can represent all complex numbers
using only the digits 0 and 1. Interestingly, a number system with a base of
1 + i or 1 - i fails to represent all complex numbers.
2.1. Number Representation 17

Another interesting number system is the so-called balanced ternary sys-


tem with the base b = 3, but which uses the digits -1,0, + 1 instead of 0, 1,2,
where -1 can be denoted by 1. Thus

(165.6875ho = (li00l1.I01IIOlII ... h. (2.16)

This system can represent all real numbers without explicit use of sign and has
many pleasant properties. The negative of a number can be obtained by replac-
ing 1 with I and vice versa. The sign of the number is determined by the most
significant digit and more generally we can compare two numbers by reading
them from left to right and using the lexicographic order as in the decimal
system. Further, the operation of rounding to the nearest integer is identical
to truncating. This system was seriously considered for implementation on the
first electronic computer in 1945.
After this digression into various positional number systems, let us now
consider the representation of numbers in a computer. The binary system is
almost universally used for scientific computations on computers. In most of the
modern computers memory is usually organised into units called a byte, which
consists of 8 bits. Each of the bits can be in either of the two possible states,
which can be represented by 0 and 1. Since in general, one byte is not sufficient
to represent numbers, these computers typically use 4 or 8 bytes to represent a
number and this unit is sometimes referred to as a word. The number of bits in
a word varies from computer to computer and is usually in the range of 32 to
64. A word consisting of t bits can represent at most 2t different numbers. Thus,
if the contents of the word are directly interpreted in binary system, then all
numbers from 0 to 2t - 1 can be represented. This is obviously unsatisfactory,
since no negative numbers are included in this representation. In fact, it was
the question of representing negative numbers in the computer, which led some
of the computer scientists to consider the number systems which we discussed
above. The simplest representation of both positive and negative numbers is to
use the leftmost bit as the sign bit and the remaining t - 1 bits can be used
to store the magnitude of the number. We adopt the convention of numbering
the bits from 0 to t - 1 from right to left. Thus, if (t - 1)th bit is 0, the sign is
assumed to be positive and if it is 1, the number is negative. This representation
is called the signed-magnitude representation, and corresponds to the normal
practice in writing numbers. One disadvantage of this representation is that,
minus zero and plus zero have different representations (e.g., 0000 and 1000 in
a 4-bit word). Hence, special care is needed in implementing the arithmetic.
Further, the arithmetic operations are slightly more difficult in this notation.
Almost all computers use the so-called two's complement to represent
numbers. In this notation no sign is attached to the numbers, but the cal-
culations are done modulo 2t. For example, if we have a 8-bit word, then the
number -(107ho = -(1101011h is represented as 10010101, which is obtained
by adding 28 to the given negative number, i.e., 256-107 = 149 = (10010101h
Further, any number with leftmost bit 1 is regarded as negative. For example,
01111111 represents 127 while 10000000 represents -128. There is no confusion
18 Chapter 2. Roundoff Error

about ±O, since both are represented by all zero bits. The leftmost bit which
determines the sign of the number is referred to as the sign bit. Using this
representation all integers between _2t-1 and 2 t - 1 - 1 can be represented in a
t-bit word. The asymmetry in the range comes in because of the fact that zero
is regarded as positive. It may be noted that some computers actually store
the bits in reverse order, (i.e., the rightmost bit determines the sign), but we
will stick to the more natural convention. The two's complement representa-
tion is used on most computers to represent integers. Most computers allow
integer arithmetic with 1, 2 or 4 bytes (or 8, 16 or 32 bits) with 4 bytes being
the default length. Programming languages allow the length of integers to be
declared, thus in Fortran integer*1 implies integer variable using 1 byte. Some
compilers may allow integer arithmetic with 8 bytes also.
Another possible representation that is used by some computers is the
one's complement, which can be obtained by adding 2t - 1 to the negative
number. Once again numbers with leftmost bit one are assumed to be negative.
In this representation, negative of a number can be obtained by simply taking
the logical complement of the number, i.e., replacing 1 by 0 and vice versa. Thus,
two's complement can be obtained by first taking the complement and then
adding one to the resulting number. In one's complement, +0 is represented by
all bits zero, while -0 is represented by all bits being one, and this should be
accounted for in the arithmetic operations. In this representation, all numbers
between ±(2t - 1 - 1) can be represented using t bits.
The range of numbers that can be represented using a finite number of
bits is of course limited and if during the calculation some number goes outside
this range, it is referred to as an overflow. In two's complement, as long as
there is no overflow, the addition and subtraction can be carried out as if all
numbers are positive, provided the carry out of the leftmost position is ignored.
For example,

1100 1010 = -54 11001010 = -54 11001010 = -54 (2.17)


0110 1011 = 107 00100110 = 38 1100 1010 = -54
0011 0101 = 53 11110000 = -16 1001 0100 = -108

Thus, the carry out of the last place can be ignored, provided there was a carry
into that place, otherwise it will represent an overflow condition. Similarly,
if there is a carry into the last place, but not out of it, that also represents
overflow. For example,

0110 1011 = 107 10010100 = -108 (2.18)


01101011 = 107 10010100 = -108
11010110 = -42 = 214 - 256 0010 1000 = 40 = -216 + 256

In one's complement representation, the arithmetic is only slightly more


complicated, since the carry out of the sign bit has to be added to the numbers.
2.1. Number Representation 19

For example,

1100 1001 = -54 1100 1001 = -54 11001001 = -54 (2.19)


0110 1011 = 107 0010 0110 = 38 1100 1001 = -54
0011 0100 11101111 = -16 10010010
1 1
00110101 = 53 10010011 = -108

The overflow condition can be detected as in two's complement by the fact that
there will be a carry into the sign bit, but nothing out of it {9}. Unfortunately,
most machines do not issue an error message on integer overflow, and erroneous
results may be generated. Consequently, integer arithmetic must be used with
utmost care. Even if integer arithmetic is not required sometimes by mistake
some constants may be in integer form resulting in the compiler using integer
arithmetic to calculate the result. For example, Fortran expressions like 15**20
will result in an overflow on almost every computer, but no error message may
be issued. In such cases, the results will be completely erroneous.
This representation can only represent integers and is called fixed-point
representation, since the radix point is fixed. Customarily, the radix point is
assumed to be after the rightmost digit, but of course, we can assume it to
be at any other fixed position. The process of addition and subtraction is not
affected by the position of radix point, but multiplication or division will require
appropriate scaling. Keeping track of this scaling factor is quite tedious. Hence,
the computers use the so-called floating-point representation to represent real
numbers. In this representation, each number is denoted by a pair of numbers
as follows:
1
x = (e,f) = f x be - q , b:::; If I < 1, (2.20)

where b is the base of number system and q is a fixed integer called the exponent
offset. In this representation, e is referred to as exponent and f as fraction or
mantissa. For example, if b = 10, q = 50, then

3 X 10 10 = 0.3 X lOll = (61,0.3) and 9.108 x 10- 28 = (23,0.9108). (2.21)


The exponent e is an integer and the inequality (2.20) ensures the uniqueness
of representation, since otherwise, f x be - q = bf x be - q - 1 and so on. This
inequality provides a normalisation which defines the fraction uniquely. If this
restriction is not satisfied, then the number is said to be unnormalised. The
value of q is adequately selected, such that numbers with both positive and
negative exponents within a reasonable range can be represented with positive
value of e. The nonzero choice for q does not make any difference for addition,
subtraction or division, while for multiplication it can be trivially accounted
for. The number zero cannot be represented in the normalised floating-point
representation, and a special exception has to be made for it.
To represent floating-point numbers in a computer, the word is divided
into two parts, one for exponent and the other for fraction. Negative numbers
20 Chapter 2. Roundoff Error

can be represented by two's complement or one's complement or simply using


the signed- magnitude form. Hence, the leftmost bit will be the sign bit. In
computers, since binary arithmetic is used, b = 2 and because of the inequality
in (2.20) the first bit of the fraction is always one. Thus, in most computers
this bit is not actually stored. This bit is sometimes referred to as the hidden
bit. Since the exponent e must be positive and within some fixed range, the
range of numbers that can be successfully represented within the computer is
limited. For example, let us consider a machine which uses 32 bits to represent
floating-point numbers, with 24 bits for fraction and 8 bits for exponent. Since
the first bit of fraction is not stored, one bit is actually available for the sign.
In case the computer explicitly stores the first bit of fraction also, then the
number of bits for the fraction will have to be reduced by one to accommodate
the sign bit. Thus bits 0--22 contain the fraction part, bits 23-30 the exponent
part and the 31st bit is the sign bit. With 8 bits, the range of exponent will
be between 0 and 28 - 1 = 255 and if the exponent offset q is taken as 126,
then the actual exponent range is from -126 to +129. For normalised floating-
point numbers the fraction part is between ~ and 1 - 2- 24 . Hence, numbers
between 2- 127 and 2 129 - 2 105 (approximately 5.9 x 10- 39 to 6.8 X 1038 ) in
magnitude can be represented. A number larger than the maximum limit is said
to overflow, while a nonzero number with magnitude less than the minimum
limit is said to underflow. Most computers replace an underflow by zero (often
without any error message), which may not matter in most cases, but if that
number is later multiplied by a large number, the result could be quite different.
In some computers, the range of numbers that can be represented is slightly
extended by allowing for unnormalised fraction part, when the number is too
small to be represented in the normalised form. Thus, in the 32-bit machine
considered above, if unnormalised [-action part is allowed, then the smallest
nonzero positive number that can be represented is 2-(126+23) ~ 1.4 x 10- 45 .
This facility is usually referred to as graceful underflow.
In actual machines the range of numbers could be slightly restricted, since
SOllIe exponent values may be reserved for representing indefinite values, while
those machines which do not r..ormally store the first bit of fraction part ex-
plicitly, also store this bit when the exponent is zero, since otherwise it will
be impossible to distinguish between zero and 0.100 ... x 2- Q • Most current
computers use the IEEE-754 standard for representing floating point numbers.
In this format a 32-bit word is divided into three parts as mentioned earlier
and signed-magnitude notation is used to represent negative numbers. Thus
the 31st bit represents sign, and it is 0 for positive numbers and 1 for negative
numbers. The remaining 31 bits represent the magnitude of the number in both
cases. The bits 0-22 contain the fraction part and bits 23-30 the exponent part.
If E is the content of exponent field, F that of the fraction field and S is the
value of the sign bit, then the numbers are interpreted as follows:
1. If E = 255 and F is nonzero, then the number is NaN ("Not a Number"),
which essentially results from operations which cannot be evaluated, e.g.,
square root of a negative number or 0/0.
2.1. Number Representation 21

2. If E = 255 and F is zero, then the number is (-1)5 00 , i.e., ±oo depending
on the sign. This may represent all numbers which are larger in magnitude
than the largest number that can be represented.
3. If 0 < E < 255 then the number is (_1)52 E- 127 (1.F) where 1.F is in-
terpreted in binary form. Here 1 before t:le radix point is the hidden bit,
which is not stored. This is exactly same as the example considered in
earlier paragraph. For example,

o 10000000 01000000000000000000000 = 2.5,


(2.22)
1 10000000 01000000000000000000000 = -2.5.

4. If E = 0 and F is nonzero, then the number is (_1)52 - 126 (0.F) as there


is no hidden bit in this case, since the number may be unnormalised.
5. If E = 0 and F = 0, then the number is (-1)50, i.e. , ±O. Further , any
number which is smaller in magnitude than 2- 1,,9 is replaced by zero with
the right sign.
Further, arithmetic operations are also defined with infinite operands, thus
(+00) + (+00) = +00, x + 00 = 00 (where x is any finite number that can be
represented) 1/ 00 = 0, (+00) - (+00) = NaN , etc. Unfortunately, the IEEE
standard doesn 't specify the order in which bytes are stored and as a result
there are two possibilities. Some computers store the bytes in the order in
which it is normally written, i.e., the most significant digit is the first, which
is known as the big-end ian representat.ion. The other possibility is the little-
endian where the least significant digit is stored first . Because of this ambiguity
it may be necessary to swap the bytes if binary output from one computer is
fed to another with opposite convention. The byte swap may also be needed to
store data in predefined format, e.g., FITS (Flexible Image liansport System)
format (Wells et at. 1981).
The precision with which the floating-point calculations can be performed
is determined by the length of the fraction part. Hence, we will refer to a
floating-point arithmetic using t-bit fraction as t-bit arithmetic. Thus, the
above example will be referred to as 24-bit arithmetic, even though actually
32 bits are required to represent a number. Most computers allow floating
point representation using either 4 or 8 bytes and the length of real num-
bers can be specified in programming languages, thus in Fortran real*8 im-
plies a real variable using 8 bytes of computer memory. Some compilers also
allow arithmetic with 16 byte long floating point variables. Most current com-
puters use the IEEE standard for 8 byte floating point numbers, which is
similar to the format described earlier for 32 bit word. In this case, 52 bits
are used for fraction part, 11 bits for exponent and 1 bit for sign and the
interpretation of the numbers is similar to what is described above. In this
case E = 2047 is reserved for indefinite numbers and for 0 < E < 2047 the
number is (_1) 52E-1023(1.F) . With graceful underflow, numbers in the ra nge
(2-1074 , 21024) or (4.9 x 10- 324 , 1.8 X 10308) can be represented. In 16 bytes
22 Chapter 2. Roundoff Error

floating point representation generally 15 bits are reserved for exponent and
112 bits for fraction part. In this case E = 32767 is reserved for indefinite num-
bers and for 0 < E < 32767 the number is (_1)S2E-16383(1.F). With graceful
underflow, numbers in the range (2-16494 , 216384) or (6.5 x 10- 4966 ,1.2 X 104932 )
can be represented.

2.2 Roundoff Error


We have noted earlier that at most 2t different real numbers can be represented
using a t-bit word in computer. However, we know that between any two dis-
tinct real numbers, there are infinite number of real numbers. Obviously, all
these numbers cannot be represented in the computer. Hence, the computer's
number system is necessarily "quantised" in some sense. In this section, we will
consider some of these "quantum effects". The problem that a numerical ana-
lyst faces is in some sense opposite to that faced by quantum physicist, since
the aim of the physicist is to find out what happens if a classical system is
quantised, while the numerical analyst gets only the quantised results and has
to find the "classical limit" of the results. The quantisation of number system
necessarily introduces some uncertainty in numerical calculations, which is re-
ferred to as roundoff error. Apart from the fact that only a finite set of real
numbers can be represented in computers, it turns out that this set is not closed
under any of the basic arithmetic operations. Consequently, roundoff errors are
introduced at almost every step of numerical calculations. On present day com-
puters which are capable of performing 109 or more floating-point operations
per second, each involving a roundoff error, the error in the final result could be
significant, unless appropriate care is taken. Our discussion of roundoff error
will be restricted to floating-point arithmetic, since that is what is normally
used in scientific computations. For a detailed discussion of roundoff error in
fixed-point arithmetic, readers may refer to Wilkinson (1994) .
To illustrate the effect of roundoff error, we consider a hypothetical com-
puter which uses decimal system and expresses floating-point numbers by one-
digit exponent and three-digit fraction. For clarity, we will specify the signs
of exponent and fraction explicitly. Most examples in this section are based
on arithmetic with such a hypothetical computer. Thus, the smallest nonzero
positive number that can be represented is 0.100 x 10- 9 , while the next higher
number that can be represented is 0.101 x 10- 9 . Any number between these
cannot be represented in this computer and will have to be approximated by
one of these two numbers, thus giving a roundoff error. Hence, if we roundoff a
number to the nearest one which can be represented in the machine, there will
be a maximum roundoff error of 0.5 x 10- 12 for numbers in this range. There
is some ambiguity in this rounding operation, when the number is exactly mid-
way between two numbers that can be represented. For example, 0.1005 x 10- 9
can be rounded to either 0.100 x 10- 9 or 0.101 x 10- 9 . On most computers in
the ambiguous case the number is rounded up when the last significant digit is
odd.
2.2. Roundoff Error 23

The largest number that can be represented in this machine is 0.999 x 10 9 ,


while the next lower number is 0.998 x 109 , giving a maximum roundoff error
of 0.5 x 106 for numbers in this range. Thus, the roundoff error depends on the
number and the bound on the magnitude of this error depends on the exponent
part of the number being represented. Therefore, it is more meaningful to talk
of relative error in representing a number x. It can be easily shown that

(2.23)

where Jx is the roundoff error in representing x. This inequality can be easily


generalised to a floating-point representation, using a base b with t-digit fraction
part to give

115:1_< ~ x b- t = n. (2.24)

We will refer to this quantity on the right-hand side by n. For a machine using
binary arithmetic n = 2- t . Thus, in quantum arithmetic n is not a universal
constant, but varies from machine to machine. For a computer with 32-bit word,
which uses 24 bits for fraction n = 2- 24 ::: 6 x 10- 8 • As we will see later on,
n is either the smallest positive number which if added to one gives a result
greater than one, or it is the largest positive number which if added to one
gives a result equal to one. The choice between the two options depends on
the rounding algorithm used for ambiguous cases. The first choice is valid on
machines which round up the numbers in ambiguous cases, when the number
is exactly between two numbers that can be represented. On most computers
the latter possibility will be realised.
In general, the error in numerical computation can be expressed in two
different ways, one is the so-called absolute error which is just the difference
between the true and calculated values, while the second is the relative error
which is the absolute error divided by the true value of the quantity. When
all numbers involved are of the same order of magnitude, it is convenient to
talk of absolute error, while if the numbers are varying by several orders of
magnitude, then it is more meaningful to talk of relative error. For example,
an (absolute) error of 1 in a number which is of the order of 108 is in general,
much more tolerable than an error of 0.001 in a number which is of the order
of 0.01. Hence, it is more meaningful to express these as relative errors of 10- 8
and 0.1, respectively. However, if the true value is zero, then the relative error
is not defined and in such cases, we have to resort to either a different measure
for normalisation or to the absolute error.
Another measure of error in numerical computation is provided by
the number of significant figures. Let us consider two numbers 12345.6 and
0.000234, both of which are accurate to the last digit, i.e., the errors are less
than 0.05 and 0.0000005, respectively. Thus, it appears that the second number
is more accurate, but this depends on what operations are to be done with the
number. For example, if we add the two numbers, then the error is determined
24 Chapter 2. Roundoff Error

by the first number which has a larger error. On the other hand, if we multi-
ply these two numbers, then the error is essentially determined by the second
number. If we take the reciprocal of these numbers, we get 0.000081000518 ...
and 4273.504 ... , respectively. The relative error in the reciprocals is of the
same order as the relative error in the original numbers. Thus, if instead of
the original numbers we use 12345.65 and 0.0002345, then the reciprocals will
become 0.00008100019 ... and 4264.39 ... , respectively. Hence, it is clear that
in the first case the error is ;::::: 3 X 10- 1°, while in the second case the error
is of the order of 10. Here it is more meaningful to talk of relative error; we
can say that the two numbers have a relative error of less than 4.1 x 10- 6 and
2.2 x 10- 3 , respectively. Sometimes it is convenient to express the accuracy in
terms of significant figures, which are essentially the number of correct digits in
the decimal representation. In counting the digits the leading zeros are ignored.
Hence, in our example we will say that the two numbers are accurate to six
and three significant figures, respectively. It should be noted that if there are
trailing zeros in the number which are known accurately, then that should be
counted. For example, 0.00023400 has an accuracy of five significant figures.
The detailed implementation of floating- point arithmetic operations varies
a little from computer to computer. In most computers, the result of arithmetic
operations is formed in a double length accumulator inside the arithmetic unit
and the results are then correctly rounded off to yield the single length result.
For instance, if we want to multiply two numbers of t bits, a 2t-bit product is
formed, which is then rounded to yield the t-bit result. For addition or subtrac-
tion, the number with larger magnitude is first converted to 2t digits by adding
zeros to the right of the fraction part, and then fraction part of the second
number is shifted to the right, until the exponent agrees with the first number
(if this is possible within 2t digits). After the addition or subtraction, the result-
ing number is normalised and then rounded. For example, in our hypothetical
decimal computer 0.317 x 10° + 0.386 X 10- 2 will result in

0.317000
0.003860 (2.25)
0.320860 = 0.321 x 10°

By considering all cases, it can be shown that it actually requires an accumula-


tor with 2t + 1 bits if the process is implemented in a straightforward manner.
To find the product of two floating-point numbers Xl = frb e1 and X2 =
hbe2 , the exponents are first added together to give e3 = e1 + e2, and the exact
2t-digit product frh is computed. If the original numbers are normalised, then
this product satisfies the condition
1
b2 :::; Ifrhl < 1, (2.26)

and is therefore normalised, if necessary, by a shift to the left, the exponent


being adjusted accordingly. The resulting 2t-digit product is then rounded off
to give the t-digit fraction of the computed product. In some cases rounding
2.2. Roundoff Error 25

the 2t-digit fraction may require exponent to be adjusted, e.g., rounding 0.9999
to 3 digits gives 1.00 = 0.100 X 10 1 . Similarly, to divide Xl by X2, first the
exponent e2 is subtracted from e1 to give e3 = e1 - e2, then h is placed in
the t most significant digits of the double length accumulator and the other
digits are set to zero. If Ihi 2': 112 I, then the number in the accumulator is
shifted one place to the right and the exponent e3 is increased by one. The
number in the accumulator is then divided by 12 to give the correctly rounded
t-digit quotient. It can be verified that, this process will automatically give the
correctly normalised fraction part for the quotient.
Thus, clearly if Xl EB X2 denotes the floating-point addition of Xl and X2,
and similar notation is used for other operations, then

Xl 8 X2 = (Xl - x2)(1 + E), Xl (>9 X2 = Xl x x2(1 + E),


(2.27)
Xl EB X2 = (Xl + x2)(1 + E), Xl 0 X2 = (xdx2)(1 --'- E),

where in all case" lEI < n. This result follows from the fact that, assuming
Xl and X2 are correctly represented in the machine, only error occurs while
rounding off the result at the end. Of course, the value of E will be different for
different arithmetic operations, but the upper bound on its magnitude happens
to be the same.
It can be shown that for a t-bit fraction part, if the accumulator has
t + 2 bits, then it is possible to get the correctly rounded sum of two num-
bers, provided appropriate care is taken while rounding the smaller number to
t + 2 bits after the right shift {25}. All computers allow arithmetic with at
least two different floating-point representations. The higher precision arith-
metic generally uses 8 bytes to store the numbers, and is referred to as double
precision. Apart from the length of fraction part, the range of exponent also
may be different in double precision representation. The length of fraction part
is usually more than twice that in single precision. For example, if 8 bytes are
used to represent a double precision number, then 11 bits can be used for ex-
ponent part and 53 bits (including one hidden bit) for the fraction part. In
that case, for double precision arithmetic n = 2- 53 ~ 1.1 X 10- 16 . However, the
same accumulator is usually used for all arithmetic operations. Some machines
use an accumulator which is longer than the fraction part of a double preci-
sion number. In general, the execution time for arithmetic operations in double
precision is larger than that in single precision. However, the relative speed
varies from machine to machine. On many machines, the difference between
execution times for single and double precision arithmetic is small. However,
on all machines, the use of double precision will result in increased memory
requirement, particularly when large arrays are used. Some compilers also al-
low 16 byte floating point numbers, but as of now hardly any computer has
hardware implementation for floating point operations with such numbers and
these operations are implemented through software. As a result, the execution
of programs using such numbers will be significantly slower. For 16 byte floating
point arithmetic Ii = 2- 113 ~ 9.6 X 10- 35 .
26 Chapter 2. Roundoff Error

In the above discussion, Xl and X2 are assumed to be correctly represented


in the machine. If these numbers are themselves results of some previous calcu-
lations, then there would be errors in their representation, which in turn would
affect the error in arithmetic operations. For example, 0.317 x 10° -0.316 x 10° =
0.100 X 10- 2 is exact, since no roundoff is needed. However, if either of the num-
bers itself is not exact, then the situation is different. For example, if the actual
values are Xl = 0.316501 x 10° and X2 = 0.316499, then the exact difference is
0.2 x 10- 5 . Hence, in such cases, the apparently exact operation can actually
be totally meaningless. If Xl and X2 are not correctly represented, but have a
relative error of 1'1 and 1'2 respectively, then we can write

(2.28)

Hence, there is an additional error of X1E1 + X2E2, and if hi :::; ti, 11'21 :::; ti,
Itl :::; ti, the total error is bounded by 1.06ti(lx11 + IX21 + IX1 +x21), provided ti is
sufficiently small, which is usually true on any reasonable computer. The factor
of 1.06 is arbitrary and could be replaced by any suitable number greater than
one. This factor will keep occurring in all our error bounds in the next section.
The error could be quite large as compared to the sum Xl + X2, if the two
numbers have opposite sign, but are nearly equal in magnitude; even though in
this case, there is no roundoff error in the subtraction process itself. This is the
most common cause for significant roundoff error in numerical computations.
Hence, subtraction of two nearly equal numbers should be avoided as far as
possible.
As a result of roundoff error, even the basic associative and distributive
laws of algebra may not hold in numerical calculations. For example, let us
consider the associative law of addition

a + (b + c) = (a + b) + c. (2.29)

If a = 0.456 x 10- 2 , b = 0.123 x 10°, c = -0.128 x 10°, then


+ b) + c = 0.128 x
(a 10° - 0.128 x 10° = 0,
(2.30)
a + (b + c) = 0.456 X 10- 2 - 0.500 X 10- 2 = -0.440 X 10- 3 .

The readers are urged to look at this as well as the other examples carefully,
in order to understand exactly how the roundoff error creeps in. Similarly, we
can see the failure of the distributive law

a x (b + c) = a x b + a x c,
a = 0.200 x 10 1 , b = -0.600 x 10°, c = 0.602 x 10°,
(2.31 )
a x (b + c) = (0.200 X 10 1 ) x (0.200 x 10- 2 ) = 0.400 x 10- 2 ,
a x b+ a x c = -0.120 X 10 1 + 0.120 X 10 1 =0.
We can also find examples violating the associative law of multiplication,
but in that case, the difference is always small {17}. However, apart from the
2.2. Roundoff Error 27

finite fraction part which gives rise to the roundoff error, the exponent range
is also limited, which can give rise to severe violation of the associative law for
multiplication
a x (b x c) = (a x b) xc. (2.32)
For example, if a = 10-6, b = 10-6, C = 108 , then the left-hand side will give
the correct result of 10- 4 , but the right-hand side results in an underflow on our
hypothetical machine, since (a x b) is 10- 12 . This underflow may be replaced
by zero, giving zero as the result.
The underflow and overflow can also cause difficulties while implementing
complex arithmetic, which is done through software on most machines. For
example, to evaluate the reciprocal of a complex number a straightforward
algorithm is
1 x - iy
(2.33)
x + iy X2 + y2

If the number is such that evaluating the denominator will give underflow or
overflow, then the result cannot be calculated even though the number itself is
within the range of corllputer. For example, on our hypothetical computer, if
we use a number like 105 + 2i or 10-6 + 10- 7 i, it is not possible to evaluate
the reciprocal even though both the numbers as well as their reciprocals can be
easily represented in the computer. This problem can be avoided by modifying
the algorithm. If w = max(ixi, iyi), then we can write the reciprocal as

1 1 2:. _ iJL
w w
(2.34)
x + iy
This formula will not give overflow, unless the reciprocal cannot be represented
in the computer or is close to such a limit. Apart from this, it should be noted
that the multiplication and division of complex numbers actually require more
than one arithmetic operations on its real components. Hence, the error bounds
given in (2.27) will not be applicable for complex numbers. For example, con-
sider the multiplication, (a + ib) (x + iy), e.g.,

(0.326 + 0.338i) x (0.342 + 0.327i) = (0.000966 + 0.222198i). (2.35)

If we are using a three decimal digit arithmetic, then it can be verified that
the result will come out to be (0 + 0.223i), which is not the correctly rounded
result. The significance of this error depends on what we are going to do with the
number. For example, if we are interested in only the real part of the number,
then the result is completely wrong. In any case, ideally we expect that both
real and imaginary parts of the result should be correctly rounded versions of
the exact result. This is actually possible if the double length products like ax
and by are not immediately rounded, but are rounded only after the addition or
subtraction is complete. Some computers allow this facility and if that is used,
the result of complex multiplication and division will always be a correctly
rounded approximation to the exact result.
28 Chapter 2. Roundoff Error

The commutative laws always hold good even for floating-point arithmetic.
Thus
a + b = b+ a, a x b = b x a. (2.36)
Apart from these we also have

a +0 = a, a x 1 = a, a x 0 = 0, a/I = a,
(2.37)
a - b= a + (-b), a +b= 0 if and only if a = -b,
in the floating-point arithmetic. Here, in (2.37) it is assumed that there is no
underflow, otherwise for example,

0.345 X 10- 9 80.323 X 10- 9 = 0.220 X 10- 10 = O. (2.38)

This problem will not arise if graceful underflow is used as it allows this number
to be represented as an unnormalised number. Knuth (1997) has given many
more identities which are valid in floating-point arithmetic. However, the con-
verse of some of these identities may not be true. For example, in classical
arithmetic a + x = a, if and only if, x is zero, but in computer arithmetic we
can have a + x = a, even when x is nonzero. For example,

0.123 X 106 EB 0.456 X 103 = 0.123 X 106 . (2.39)

It can be easily shown that

a EB x = a, if I~ I < ~. (2.40)

If a = 1, then it can be seen that h is the largest positive value of x, such that
1EBx=1.
Apart from these fundamental laws of algebra, even the inequalities can
also be violated in computer arithmetic, for example, let us consider the well-
known inequality
(2.41 )
In our hypothetical computer, if a = 0.100 x 10 1 and b = 0.896 x 10°, then
b2 = 0.803 x 10°, a +b= 0.190 X 10 1 ,
(2.42)
(a + b)2 = 0.361 X 10 1 , 2(a 2 + b2 ) = 0.360 X 10 1 ,
and the inequality is violated. To estimate the influence of roundoff error on
this inequality, we can transfer all terms to left-hand side and apply the result
of (2.27), to get an upper bound on error:

(2.43)

Here the second term is multiplied by three, since the error in evaluating (a + b)
is h( a + b) and it will be amplified by a factor of two, when the square is taken.
2.2. Roundoff Error 29

Apart from this, there will also be an error in evaluating the square. It may
be noted that we have neglected the possible roundoff error while multiplying
(a 2 + b2 ) by 2. On a computer with binary arithmetic, there would be no
roundoff error on multiplication by 2 as it only increases the exponent by 1.
But in this case where decimal arithmetic is used there could be roundoff error
in multiplication by 2, which is neglected. Thus, the maximum possible error
is ::::: 0.09 while the value of the expression is (a - b)2 ::::: 0.01. Hence, the
inequality could be violated. In general, this inequality could be violated for
(a - b)2 :::; h( a + b)2. On machines with large word length, this condition could
be quite stringent, but it is trivial to find other inequalities where violation is
easily possible {29}.
EXAMPLE 2.1: Solve the quadratic equation ax2 + bx + c = 0, with
(i) a = 0.123 x 10°, b = 0.451 10 1 , c = 0.856 X
X 10- 1 ,
( ii) a = 0.123 x 10°, b = 0.451 X 10 1 , c = 0.103 x 10°,
(iii) a = 0.400 x 10°, b = 0.116 X 10 1 , C = 0.841 x 10°,
(2.44)
(iv) a = 0.160 X 10 1 , b = 0.112 X 10 1 , c = 0.196 x 10°,
(v) a = 0.400 x 10°, b = 0.116 X 10 1 , c = 0.846 x 10°,
(vi) a = 0.160 x 10 1 , b = 0.112 X 10 1 , c = 0.195 x 10°.
We can try to use the well-known formula for the roots of a quadratic equation
-b ± .jb2 - 4ac
x= (2.45)
2a
For the first equation using our hypothetical computer
ac = 0.105 X 10- 1 , 4ac = 0.420 X 10- 1 . (2.46)
Hence, b2 - 4ac = 0.203 X 10 2 = b2 , which gives the roots as 0.0 and -0.367 x 102 . The
first root is obviously incorrect. The correct values of these roots to five decimal digits are
-0.18990 X 10- 1 and -0.36648 x 10 2 . The problem here is because of the fact that evaluation
of the first root requires subtraction of two nearly equal numbers i.e., band 2 - 4ac. The vb
error in the final result due to this subtraction process alone is of the order of nlb/al "'" 0.18,
which is much larger than the smaller root. Hence, it is not possible to estimate that using
this formula. To estimate this root correctly to three decimal digits will require a calculation
using six decimal digit accuracy. It may be noted that the second root is essentially correct,
and we can use it to evaluate the first root by using the well-known result that the product
of two roots must equal c/ a. Hence

Xl = ~ = -0.190 X 10- 1 , (2.47)


aX2
which is correct to three decimal digits. Thus, to solve a quadratic equation which has two
real roots, we can first find the root with larger magnitude using the standard formula, while
the second root should preferably be found using the product relation as stated above. This
avoids the subtraction of two nearly equal numbers in the crucial step. There is another
possibility for subtraction of two nearly equal numbers and that is while evaluating the
expression b2 - 4ac, which happens when the two roots are nearly equal. We will consider
this possibility a little later.
Let us consider the second equation, where 4ac = 0.508 x 10- 1 and b2 - 4ac = 0.202 x
102 . The roots in this case will come out to be -0.813 x 10- 1 and -0.366 x 102 as compared
to the correct values of -0.22852 x 10- 1 and -0.36644 x 102 Once again, the first root is
not correct while the second one is correct to three decimal digits. In this case, even though
-b and .jb2 - 4ac do not cancel each other giving zero, the roundoff error is of the same
order and we cannot expect a correct value for the first root. Using the product relation,
30 Chapter 2. Roundoff Error

we can calculate the smaller root still using only three-digit arithmetic to get a value uf
-0.229 x 10- 1 , which is correct to three decimal figures.
Now coming to the third equation, we get

4ac = 0.134 x 101, b2 - 4ac = 0.100 x 10- 1 . (2.48)

It can be easily verified that if arithmetic is done exactly, then b2 = 4ac and the equation
has two equal roots x = -1.45. Our calculation using three-digit arithmetic yields the roots
-0.132 x 101 and -0.157 x 101, while a 24-bit arithmetic gives a pair of complex conjugate
roots -1.45 ± 0.451 x 1O- 3 i. Hence, we can see that the error is once again magnified. In this
case, the error in evaluating (b 2 - 4ac) can be estimated to be of the order of li(b 2 + 14acl) ~
10- 2 . Thus, a result of this order can be expected instead of the exact value of zero. However,
the square root of this quantity is added to -b. Consequently, the error gets magnified while
evaluating the square root, which will come out to be of the order of 10- 1. Thus, in this case,
the error in the final result will be of the order of ,jhfi (b/a). The presence of square root
magnifies the error considerably. In Chapter 7 we will see that, this is a general problem with
all nonlinear equations which have multiple roots, and there is very little that can be done
about it. Of course, we can set b2 -4ac = 0, if the value is less than the expected roundoff error,
but there will always be those marginal cases, where the value comes out to be small because
it is actually small rather than zero. This problem is because of inherent ill-conditioning in
the equation, in the sense that a small relative change, of the order of Ii, in the coefficients
will cause an error of this magnitude. In this case, it can be seen that the computed roots
are the exact roots of 0.403x 2 + 1.16467x + 0.8351772 = 0, which differs slightly from the
original equation. The somewhat large difference between this and the original equation is
because of the fact that even though all the coefficients are correct to three significant figures,
the relative error in b is much larger than that in other coefficients. It may be noted that
the relative change in the coefficients is of the order of Ii. We will consider more about such
problems in Section 2.4.
Equation (iv) is similar to (iii), and in this case b2 - 4ac = -0.100 x 10- 1 , even though
once again we can verify that it should be exactly zero. This gives a pair of complex conjugate
roots -0.350 ± 0.312 x 1O- 1 i, while the use of 24-bit arithmetic gives a pair of real roots
-0.34995 and -0.35005, with an error of the order of v'f,. In this case, the relative error in
all the coefficients is comparable and it can be seen that if the value of c is perturbed to
0.197557504, the exact solution of the equation will be -0.35 ± 0.0312i. Equations (v) and
(vi) illustrate the opposite situation, where the calculated value of b2 - 4ac comes out to be
zero, even though the actual value should be negative for (v) and positive for (vi). Thus, for
(v) three-digit arithmetic gives a pair of equal roots -1.45, while correct value of roots to
three significant digits are -1.45 ± 0.112 i. For the last equation, three-digit arithmetic gives
a pair of equal roots -0.35, while the correct value of the roots are -0.325 and -0.375.

This example shows that even solving a quadratic equation is not as sim-
ple as we may imagine at the first sight. Apart from the problems illustrated
here an ideal algorithm to solve a quadratic equation should be able to handle
many other complications arising from overflow and underflow in floating-point
numbers. For example, if all coefficients in any of the equations above are mul-
tiplied by 107 (or 10- 7 ), then the true result is not affected, but calculating
b2 - 4ac will give an overflow (or underflow) on our hypothetical machine with
one-digit exponent. Further, there is a possibility of equation with a = 10- 8 ,
b = 10 8 , C = 108 , which has one of the roots out of the range, but an ideal
program should be able to find at least the other root correctly.
A consequence of violation of fundamental laws of algebra is that, two ex-
pressions which are algebraically identical may not give the same result when
evaluated numerically. We can find any number of examples where the com-
puted results are completely different from the actual results. Hence, in any
2.2. Roundoff Error 31

numerical computation we have to worry about the roundoff error. Estimating


the roundoff error in numerical computations is one of the most important and
possibly also one of the most difficult problems in numerical analysis.
Since the computer arithmetic does not obey the basic laws of algebra, we
can try to develop a new algebra which is satisfied by the computer arithmetic.
This means, we should use only those identities which are valid in computer
arithmetic, while deriving the formulae from basic physical laws. But without
the associative and distributive laws, this algebra will not be of much use. We
are therefore forced to estimate the roundoff error in numerical computations.
One interesting technique for bounding this error is the use of so-called interval
arithmetic. Here the basic idea is that, instead of a number use the interval
which bounds the number. For example, 1/3 could be represented by an in-
terval (0.333,0.334), since the true value is somewhere in this interval. Instead
of computing one result at every stage, the interval containing the true value
is computed. This also requires additional set of arithmetic operations on the
computer, since to compute the lower bound of the interval, we need a set of
arithmetic operations which round the results to next lower number that can be
represented in the machine. Similarly, for computing the upper bound of the in-
terval, we need another set of arithmetic operations, which round up the result
to next higher number that can be represented in the machine. These options
are actually available on most current processors. The normal arithmetic op-
erations which round the number to the nearest value that can be represented
in the machine cannot be used to perform interval arithmetic. For example,
6(0.333,0.334) = (1.998,2.004) and if both limits are rounded to nearest num-
ber that can be represented, both limits will be the same. Thus this should be
rounded to (1.99,2.01).
This procedure obviously requires more computations than the straight-
forward calculation using only one value, but it has the advantage that a bound
on roundoff error is automatically computed. The main disadvantage of this ap-
proach is not the excess effort, but the fact that in any reasonable computation,
the intervals all tend to become unrealistically large. The same difficulty arises
if we try to bound roundoff error using the elementary ·results stated in (2.27).
Another important problem with interval arithmetic is that, the interval may
become so large that it includes some singularity of a function and it may not
be possible to even evaluate the function at both the limits. For example, if
the interval (Xl, X2) is such that Xl < 0 < X2 and we are trying to find its
reciprocal or the square root. A considerable amount of work has been done to
use interval arithmetic to give rigorous bounds on roundoff error in computa-
ti()n (Alefeld and Herzberger, 1983). A straightforward use of this method gives
unrealistically large intervals which are of no use, but if properly applied it is
capable of giving reasonable bounds on roundoff error for some problems.
Kulisch and Miranker (1986) have proposed a new approach for computer
arithmetic, where they have introduced advanced computer arithmetic with an
augmented set of arithmetic operations. In addition to the four basic opera-
tions, they propose to include the calculation of scalar product of two vectors
32 Chapter 2. Roundoff Error

as a basic operation in this advanced computer arithmetic. Further, this oper-


atiun should be implemented in such a way that the result is correct to the last
bit in the computer representation. This ensures a bound similar to those given
by (2.27) for scalar product calculations also, which will be a significant im-
provement, since with the standard floating-point arithmetic it is not possible
to give any bound on the relative roundoff error in such calculations {15, 30}. If
the interval arithmetic is combined with this advanced computer arithmetic, it
may be possible to obtain reasonable bounds on roundoff error in a large class
of numerical computations.
In general, the estimation of a realistic bound on roundoff error in nu-
merical calculations is a difficult task and some approaches will be discussed in
the next section. It may be noted that an iterative method for numerical com-
putation does not usually converge to an accuracy better than that permitted
by roundoff error. Hence, such methods give a ready estimate of roundoff er-
ror. Apart from this, we can also estimate the roundoff error by repeating
the calculation on "" different computer, which has a different word length for
representing numbers, or by repeating the calculation on the same computer
using double precision. Probably the best way of estimating error in numeri-
cal calculations is to repeat the computations using a different algorithm. This
technique could give an estimate of both roundoff as well as truncation errors.
Apart from this, in some cases, e.g. solution of linear or nonlinear equations, it
is in principle possible to check the computed result by calculating the residual,
by inserting the computed answer into the equation and hoping that a small
residual implies a small error. However, as we will see in Section 2.4, this test
is usually unreliable and any number of examples can be given where this test
could be completely misleading. In fact, most algorithms for solving such prob-
lems tend to give results which yield small residue, although the result may
not be anywhere near the true answer. This happens for the ill-conditioned
problems which will be discussed in Section 2.4.

2.3 Error Analysis


Roundoff error being essentially random in character, there are various means
of analysing the propagation of roundoff error in numerical computations. The
randomness comes in because the roundoff error due to a finite size of computer
word could be anything between two limiting values. For example, the number
0.123 x 10 1 in our hypothetical computer could represent any number between
1.225 and 1.235, thus giving a roundoff error in the interval (-0.005, +0.005).
Depending on the rounding algorithm used, the end points of this interval may
or may not be allowed. For example, if 1.225 is rounded to 1.22 and 1.235 to
1.24, then both limits are not allowed. Further, we can expect this error to be
uniformly distributed in this interval. This enables us to do a statistical analysis
of the roundoff error. A basic assumption in the statistical analysis is that, the
roundoff error at any stage is not correlated to those at any previous steps. This
assumption is often not valid in numerical computations. For example, if we
2.3. Error Analysis 33

try to estimate n x 0.333 in our hypothetical decimal computer by performing


repeated addition, the roundoff error at every step will be in the same direction
and result is always either Jess than or equal to the actual value, independent
of n. Further, if the number is assumed to be the rounded approximation to
1/3, then the result is always less than the actual value {21}. Including the
effect of correlations in errors will make the statistical analysis much more
complicated. Apart from the statistical approach, we can give an upper bound
on the roundoff error in any computation by applying the results of (2.27) at
each step of computation. This is the so-called forward error analysis, which
usually leads to a very pessimistic estimate of roundoff error. To circumvent
this problem, Wilkinson (1994) introduced the backward error analysis, where
instead of estimating the error in the final result, we estimate the "error" in
the initial data for which the computed result is the exact solution. We will
illustrate both these techniques in this section.
Let us consider the propagation of roundoff error while calculating the
sum of n + 1 numbers:
n

Sn = LXi. . (2.49)
i=O
Since the associative law does not apply to computer arithmetic, the result
will of course, depend on the order in which summation is carried out. Let us
assume that the summation is carried out in the natural order, in which case,
the summation process can be defined by the recurrence relation

So = Xo, (i=1,2, ... ,n). (2.50)


We further assume that the numbers are correctly represented in the computer.
Let Si denote the computed sum while Si represents the actual value, which
would be obtained if there is no roundoff error. Using (2.27), we get
Sl = Xl EB So = (Xl + 80)(1 + Ed = (Xl + xo)(1 + Ed,
S2 = X2 EB 81 = (X2 + sd(1 + (2) = (xo + xd(1 + Ed(1 + (2) + x2(1 + ('2),

n n n

i=l i=2 j=i


n

= (xo + xd(1 + 1]1) + L xi(1 + 1]i),


i=2
(2.51)
where
n

and IEil S; Ii, (i = 1,2, ... , n). (2.52)


i=r

Hence, we get

l1]rl < 1.061i(n - r + 1). (2.53)


34 Chapter 2. Roundoff Error

Here the second inequality holds if h is sufficiently small, which is the case for
any reasonable computer representation. The factor of 1.06 is arbitrary and
could be replaced by any suitable number greater than one.
We can use (2.51) for the forward error analysis by writing
n

8n = Sn + 171 Xo + L'TJiXi = Sn + En, (2.54)


i=l

where En is the roundoff error in summing n terms. The upper bound on the
roundoff error is given by

IEnl < 1.06h ( nlxol +~


£:t(n - i + l)l x il
)
:::; 1.06h
n(n+ 3)
2 lxi, (2.55)

where IXi I :::; Ixi for i = 0, 1,2, ... , n. If all terms are equal, then the bound on
relative error is given by

(2.56)

Hence, the relative error increases linearly with n. This upper bound actually
assumes the worst case scenario, where the roundoff errors at various steps add
together, while in actual practice usually there will be significant cancellation
between roundoff errors at different stages, and from st6tistical consideration
we would expect the roundoff error to increase as yin. When the terms are
not of the same sign, we cannot give any bound on the relative error, since
the actual value of the sum could be zero or arbitrarily small. Further, it can
be seen from (2.53) that the factor 'TJi multiplying the terms decreases as i
increases and the first few terms could contribute a large fraction of the error.
Hence, in order to reduce the error it would be advisable to sum the series with
terms arranged in increasing order. This device reduces the upper bound on the
roundoff error and will usually result in lower error. However, counter-examples
can be given where the error actually increases, if the series is summed in the
order of increasing terms. For example, if Xo = 0.105 X 10 4 , Xl = -0.943 X 103 ,
X2 = -0.963 X 10 2 , X3 = -0.915 X 101, then

82 = 0.107 X 102 , (2.57)

which is the exact result. On the other hand, if the summation is carried out
in increasing order of magnitude, we get

83 = O. (2.58)

Here we have taken advantage of the fact that there is no roundoff error in
evaluating the difference of two nearly equal numbers. This is a rather artificial
example, but in general, we can expect the error to be reduced if the terms are
added in increasing order of magnitude. If we have an alternating series, it may
2.3. Error Analysis 35

~S11

~s~~
4 S8 S 11

~S1~ /5~
/s~\
X 1
/s~\ /s~\
x6
/s~ \ /(\ x 11

Figure 2.1: Cascade sum

be better to first sum the pairs of number which are nearly equal in magnitude,
but have opposite signs, since that will reduce the number of terms by a factor
of two without introducing any roundoff error.
In the above discussion, we have assumed that the numbers are correctly
represented in the computer, but if the numbers are actually rounded off ver-
sions of the exact numbers, then there will be an additional error which is
bounded by
n
n'L!Xi!. (2.59)
i=O

For the above example this bound is ;:::; 10, which is much larger than the correct
value of the sum. Hence, it is probably not very meaningful to compute the sum
using three-digit arithmetic.
In general, the error will be reduced if the summation is carried out in
pairs, essentially forming a binary tree with the numbers at the leaves. Sum-
mation process can go along the branches to the root. For example, if we have
11 numbers, then we can form the tree as shown in Figure 2.1, where
j

si = I>k. (2.60)
k=i

A straightforward algorithm to calculate the "binary" sum will first sum the
terms in pairs to get s~+l (i = 1,3,5, ... ). If the number of terms is odd, then
the last number can be left as such. The same process is repeated on these
partial sums to get S~+3 (i = 1,5,9, ... ). The process being continued until we
have one number which is the sum of all the terms. This algorithm is referred
to as cascade sum, and is very effective on a parallel computer'. However, this
algorithm requires a large amount of memory, since all partial sums need to be
stored. A careful look at the process shows that, this is not really required and
it is enough if we can store one partial sum at each horizontal level in the binary
tree (Figure 2.1). Initially all the partial sums should be set to zero. When two
numbers are added at any level, the sum should be carried to the next higher
level and if that completes the pair, then the result should be carried on to next
36 Chapter 2. Roundoff Error

higher level and so on ..For our case of 11 numbers, the process will require the
following steps:

(1) 81 = Xl +X2, (6) 81 = X5 + X6, (10) 8~ = 83 + 8~, (14) 81 = Xll,


(2) 82 = 81, (7) 82 = 81, (11) 84 = 8~, (15) 8~ = 82 + 81,

(3) 81 = x3 + X4, (8) 81 = X7 + Xg, (12) 81 = Xg + XlO, (16) 83 = 8~,


(4)8~=82+81' (9)8~=82+81' (13) 82 = 81, (1 7) 8~ = 84 + 83.
(5) 83 = 8~,
(2.61 )
Here the last value of 8~ is the final sum. It should be noted that in a computer
implementation the value 8j will be overwritten on 8j itself and no extra storage
is required. This algorithm of course, requires some bookkeeping. Hence, the
algorithm is more complicated than the straightforward summation process.
The function CASSUM in Appendix B implements this algorithm.
Following an analysis similar to that for the previous case, we can write
n
8n = L xi(l + "7i), (2.62)
i=l

r
where X l represents the ceiling operation which gives the smallest integer equal
to or greater than x. Hence, the error in this case is bounded by
n

IEnl < l.06npog2 n l L IXil :s; l.06npog2 n lnlxl· (2.63)


i=l

If all terms are equal, then the relative error is bounded by

I~: I < l.06npog2 n l (2.64)

Hence, in this case the relative error increases as log2 n, which is much better
than the previous case for large values of n. Further, in this case, all terms have
the same factor in the roundoff error expression. Hence, the order of terms will
not affect the error appreciably. Thus, if we want to sum a large number of
terms, it would be desirable to use the cascade sum. The only problem with
this algorithm is that, it involves some bookkeeping, particularly if partial sums
8i after adding each term are required. The partial sums may be required if we
want to sum an infinite series, but do not know a priori how many terms
are needed. In fact, it can be shown that if partial sums are needed, then
this algorithm requires (n/2) log2 n additions. Hence, it is not attractive for
computing partial sums on a sequential processor.
As we have remarked the computer actually finds a double length sum
and then rounds it off to the word length. If instead of rounding it off this
double length sum is retained in the accumulator and another number added
to it, the process being continued until the entire sum is evaluated; then we
2.3. ETlor Analysis 37

can expect the error to be reduced considerably at essentially no extra cost.


This technique is referred to as double length accumulation of sum. To estimate
the roundoff error in this case, we can replace h, by h,2, since the length of
fraction part is now doubled. On some computers the length of fraction part for
accumulator may be different from twice that of the single precision numbers.
In that case, the actual value of h, for the accumulator should be used. In
addition to this error, there will be a roundoff error bounded by h,lsnl, when
the final sum is rounded off to a single length number. This gives the total error

IEnl < ~
l.06h,2 ( nlxol + f;:{(n - i + l)l xil
)
+lilsnl < l.06li 2
n(n+3)
2 Ixl+lilsnl·

(2.65)
If n 2 h,lxl < ISnl which is most likely to be the case, unless the sum is much
smaller than the largest term in magnitude, the second term dominates and
the roundoff error in evaluating the sum will be of the same order as the error
in representing the number in the computer. On the other hand, if the first
term dominates, then the error bound is Ii times what it would be without us-
ing double length accumulation. Hence, in either case, the error will be reduced
considerably without increasing the cost of computation, if the double length
accumulation of sum is used. Unfortunately, many computers (or the compilers
of higher level languages ) do not allow double length sum to be accumulated in
this manner and hence this procedure cannot be used on all machines. However,
on machines which allow such accumulation this technique has been successfully
used to improve the accuracy of computation, mainly for evaluating the scalar
product of two vectors, which arises very often in linear algebra problems. It
should be noted that, this technique can be used only if the sum can be accumu-
lated in one register. For example, it cannot be used with the cascade sum algo-
rIthm considered above, since in that case, pairs of numbers are summed inde-
pendently. Hence, no accumulation of double length numbers is possible, unless
the computer has sufficient number (at least ilog2 n 1) of double length registers.
So far we have considered the usual forward error analysis, now we shall
consider the backward error analysis for the same problem. We can interpret
(2.51) to claim that the computed sum Sn is the exact result of summation
of n numbers, if instead of Xi we use Xi (1 + 'fJi) (with 'fJo = 'fJd. In this case,
it is trivial to translate the error in the original numbers to the error in the
resulting sum, but in general, it may not be the case. Our analysis shows that
the computed result is the exact result which would have been obtained, if the
input numbers are perturbed by a relative error of less than l.06nli. Similarly,
if we consider the cascade sum algorithm, then the backward error analysis tells
us that the computed result is the exact result obtained if the input numbers
are perturbed by a relative error of less than l.06lillog 2 n 1.
This bound on perturbation of input numbers is very useful, since usually
these numbers are obtained by some experimental observations, or by some
previous calculations and will have some errors associated with them. We can
compare this error with the error bound given by the backward error analysis.
38 Chapter 2. Roundoff Error

If the inherent error is larger than the error bound given by backward error
analysis, then we can say that the computation process has not introduced
any additional error, and the error in the final result is essentially due to the
inherent error in input data. Hence, in this case, nothing will be gained by
doing the computation more accurately. On the other hand , if we find that the
inherent errors are smaller than the bound given by our error analysis, then
we may have to improve on our calculation process if we want higher accuracy.
For example, if the relative error in input data is > nn, then there is no point
in calculating the sum using higher precision arithmetic. On the other hand, if
the relative error in input data is < nn, then we can get a more accurate result
by using higher precision arithmetic. In the summation problem, the error in
the final result depends on the value of sum, if the numbers are not of the same
sign and the sum is of the order of or less than nlxl, then it is not possible
to evaluate the sum to any accuracy. In such cases, we may have to use more
accurate arithmetic, i.e. , smaller n. However, that is meaningful only if the
input numbers are known with sufficient accuracy, otherwise the problem itself
is ill-conditioned and it is not possible to estimate an accurate value for the
sum.
Now let us consider the propagation of roundoff error in finding the prod-
uct of n numbers
n
Pn = II Xi· (2.66 )
i=O

Once again, we can obtain this product by using the recurrence

Po = xO , (i=1,2, ... , n). (2.67)

If Pi denotes the computed result and Pi denotes the exact result, then using
(2.27) we can write

Pi = Xi Q9 Pi - 1 = XiPi-1 (1 + fi) = Pi II (1 + fj), (2.68)


j=1

where kjl < n. Hence, we get


n
Pn = Pn IT (1 + fi) = Pn(l + Tin), (2.69)
i=1

where
(1 - n) n < 1 + Tin < (1 + Ii,) n . (2 .70)
If nil, is sufficiently small, then ITin I < 1.06nn and the relative error increases
linearly with n. In this case, the error bound is independent of the order in
which the various terms are multiplied. However, the actual value of product
will of course, depend on the order, since the associative law for multiplication is
violated in computer arithmetic. For backward error analysis, we can interpret
2.3. Error Analysis 39

(2.69) to imply that the computed product is the true product of numbers
xi(l + Ei). This corresponds to a relative error in the input numbers of the
order of n, which is CDmparable to the roundoff error, while representing the
number. Hence, unlike addition the process of multiplication does not unduly
amplify roundoff error.
EXAMPLE 2.2: Estimate the roundoff error in evaluating the sum

(2.71)

where (i) rq = 1, n2 = n; (ii) nl = n + 1, n2 = 2n. Estimate the bound on the roundoff


error and compare this estimate with the actual error in evaluating the sum for n = 10 3 , 104 ,
10 5 and 10 6 .
In this case, there are two sources of roundoff error, first the numbers themselves cannot
be represented correctly and second, the roundoff error in the summation process. The first
part of the error is independent of the order of summation, while the second part of course.
depends on the order. Since the error in representing a number x is bounded by nlxl. the
error due to this (Ed in summation is bounded by

for (i);
(2.72)
for (ii).
This is the minimum error bound that we can expect in summation. irrespective of the order
in which the terms are summed. In addition to this, there will be error in the summation
process, which is given by (2.55). Considering the case (i), if the summatIOn is carried out in
the natural order, where the terms are in a decreasing sequence, the error bound is

IE21 < l.06n ( (n - 1) +L


n (
n - l
. + 1») = l.06n(n + 1) L -1:- "'" l.06nn(ln n),
n
(2.73)
i=2 1, 2=2 'l

where the approximation is expected to hold for large values of n. This is much larger than
the error due to representation (Ed. If the summation is carried in the reverse order. then
this error is bounded by

IE21 < l.06n (~1+ n-l.)


L ~
n i=l'l
= l.06n(n - 1)
( + -1) .
1
n
(2.74)

As expected from our analysis this bound works out to be less than that obtained for summing
in the opposite order. If we use double length accumulation, then the error bound is given by
1
IE21 < l.06h2 (n - 1)(1 + -) + l.06nln2, (2.75)
n
where in the last term the value of the sum is approximated by In 2. This bound will of course,
be much smaller than the previous ones. If we perform the cascade sum by summing in pairs,
then the error bound is given by

(2.76)

where lxJ is the floor operation, which gives the maximum integer less than or equal to x.
This expression is slightly different from (2.63), because in this case the first step in cascade
sum which involves sum of successive pairs of numbers, will not have any roundoff error
associated with it. It is clear that, this bound is significantly less than any of the previous
bounds, without accumulation of double length results. Hence, this is the recommended
algorithm for summation, when the facility of double length accumulation of intermediate
results is not available.
40 Chapter 2. Roundoff Error

Table 2.1: Roundoff error in summing a finite series

nl n2 Forward sum Backward sum Backward sum with Cascade sum


accumulation
Actual Error Actual Error Actual Error Actual Error
error estimate error estimate error estimate error estimate

1 10 3 2 X 10- 6 4 X 10 - ' 3 X 10- 8 6 X 10- 5 3 X 10- 8 5 X 10- 7 9 X 10 - 8 4 X 10 - 6


1+10 3 2 x 10 3 1 X 10- 9 2 X 10- 5 8 X 10 - 10 2 X 10- 5 8 X 10- 10 4 X 10- 8 8 X 10- 10 4 X 10 - 7
1 10' 6 X 10- 6 6 X 10 - 3 9 X 10- 9 6 X 10 - ' 9 X 10- 9 6 X 10 - 7 5 X 10- 8 8 X 10- 6
1+10' 2 x 10' 3xlO- 10 2 X 10- 4 1 X 10- 10 2 X 10- 4 1 X 10- 10 4 X 10- 8 1 X 10- 10 6 X 10- 7
1 10 5 8 X 10- 6 7 X 10- 2 5 X 10- 9 6 X 10- 3 5 X 10- 9 7 X 10- 7 5 X 10- 8 1 X 10- 5
1+10 5 2 X 10 5 8xlO- 11 2 X 10 - 3 6xlO-- 12 2 X 10 - 3 6 X 10- 12 4 X 10 - 8 6 X 10 - 12 7 X 10- 7
1 10 6 9 X 10 - 6 9 X 10- 1 3 X 10 - 8 6 X 10- 2 3 X 10- 8 9 X 10 - 7 8 x 10- 8
2 X 10- 5
1+10 6 2 x 10 6 8 X 10 - 10 2 X 10- 2 3 X 10 - 11 2 X 10- 2 3 X 10- 11 4 X 10- 8 3xlO- 11 9 X 10- 7

The main difference between case(i) and (ii) is that, in the latter all terms in summation
are of the same order of magnitude. For case(ii), it is more meaningful to consider relative
error, since the value of the sum decreases with n. In this case, the bounds on roundoff
error for the four different algorithms, i.e., forward, backward, backward with double length
accumulation and the cascade sum are given by

(Forward) l.06h (
n-1
--+~.
~n-i+1) ~l.06hn(2In2-1),
n +1 i=2 l +n

+L
In-I . )
(Backward) l.06h ( ~ _l_. ~ l.06hn(1 -ln2) ,
2n i=1 n +l (2.77)
1
(Double length accumulation) l.06h 2 n(1 -ln2) + 1.06h-,
4n

(Cascade sum)

It should be noted that these bounds refer to absolute error. Since the value of the sum is
of the order of 4~' the relative error will be much larger. It should be noted that in this
case, the order of summation does not make much difference, since all terms are of nearly
same magnitude. Table 2.1 summarises the results of calculations using a 24-bit floating-point
arithmetic. The column of estimated error in this table actually refers to the estimated error
bounds given by (2.72- 2.77) .
Since the computer used for these calculations does not allow double length accumu-
lation of floating-point numbers, the corresponding results were actually obtained by using
a double precision variable for accumulating the sum, while the individual terms were cal-
culated using single precision arithmetic. It may be noted that the result will not change
if instead of backward the forward sum is obtained using the double length accumulation,
since in this case the error is mainly contributed by the fact that the individual terms them-
selves are rounded versions of the correct values. It can be seen that in the first case, where
the terms are decreasing significantly, the backward sum gives substantially better results,
while in the second case, when the terms are all of roughly the same magnitude, the order
of summation is not very important. It may be noted that the actual error obtained by the
program may also depend on the compiler and how exactly the arithmetic is implemented
in the computer. Further, in all cases. it can be seen that the estimated bound on roundoff
2.3. Error Analysis 41

error is a few orders of magnitude larger than the actual error. This is the main problem with
all such bounds, since they tend to be unduly conservative in actual practice. In principle,
for any given precision it may be possible to find a series, where such a bound is actually
approached {31}. but in practice, such examples are most unlikely.

EXAMPLE 2.3: Estimate the roundoff error in evaluating the value of sin(x) (x = 1, 10,
20, 30) by directly using the Maclaurin series.
Using the Maclaurin series we can write

L( _1) i - l _x_._ _,
00 2t~ 1
sin(x) = (2.78)
i=1 (2l - 1).

This series is absolutely convergent for all values of x. However, for large values of x the terms
at first increase in magnitude and will start decreasing only for (2i - 1) > x. The number of
terms in this series to be included in evaluating the sum depends on the accuracy required.
Since it is an alternating series, the last term gives a reasonable estimate of truncation error,
a convenient algorithm for evaluating the sum is to keep adding terms starting from the first
one, until we find that the last term is smaller than the desired accuracy.
As in the previous example, the roundoff error can be divided into two parts, one
due to the fact that evaluating the individual terms itself will not yield exact values, and
the second due to the summation process. Each of these two contributions will depend on
the precise algorithm used in calculations. In general, the evaluation of ith term requires 2i
multiplications. Hence, we can expect the relative roundoff error to be less than 2in. On the
other hand, the roundoff error T/i due to summation decreases with i and if n is the total
number of terms used, then a bound on total roundoff error can be written as

!ET! < nt; (~-1)!


n 2i-l
(n+i). (2.79)

This of course, gives a very conservative error bound, while the actual error is much smaller.
It may be more realistic to replace the last factor inside the summation sign by vn, which
gives an error estimate of the order of
!E !;::; T nvnsinh(x). (2.80)

If trnax is the term with maximum magnitude in the series, then a rather optimistic estimate
for the roundoff error could be given by lit rnax , which can be approximated as
/l,xx hex
!E !;::;-;::;
T
x!
=.
v27rx
(2.81 )

The result of actual computation with 24-bit arithmetic (n ;::; 6 x 10- 8 ) for various values
of x is given in Table 2.2. It can be easily seen that (2.81) gives a realistic estimate for the

Table 2.2: Evaluating sin(x) using Maclaurin series

x n Calculated Exact Actual Error estimates


value value error (2.79) (2.80) (2.81)

1 6 0.8414710 0.8414710 -2.80 x 10- 8 1.54 X 10- 7 1.72 X 10- 7 6.46 X 10- 8
10 20 -0.5441203 -0.5440211 -9.91 x 10- 5 1.74 X 10- 2 2.94 X 10- 3 1.65 X 10- 4
20 34 2.013896 0.9129453 1.10 x 10° 6.58 X 10 2 8.43 X 10 1 2.58 x 10°
30 44 -4887.133 -0.9880316 -4.89 x 10 3 1.93 X 10 7 2.11 X 10 6 4.64 X 10 4
42 Chapter 2. Roundoff Error

roundoff error in this calculation, while the other bounds prove to be unduly conservative.
Further, the roundoff error increases very rapidly with x and it is not possible to get any
meaningful estimate for large x, even though the series is absolutely convergent. Of course,
in actual practice we don't have to evaluate this series for Ixl > 7f/4 as larger angles can be
transformed to this range by use of appropriate trigonometric identities.

It is quite clear from these examples that in general, the roundoff error
estimates tend to be very conservative. The actual error is a few orders of
magnitude smaller than the estimated error bound even for simple calculations.
In general, it is very difficult to obtain a realistic estimate of roundoff error in
any reasonable computation. Considerable experience and experimentation will
be required to estimate the error in any realistic computation. On the top of all
this, we also have to cope with idiosyncrasies of the compilers, and other system
dependent features. For example, some of the compilers do not round off the
double length intermediate results, until the accumulator is required for other
computations. On such systems, if the terms of the series are already calculated
before the summation is carried out, and further, if no other calculation is being
done in the summation loop, the result is most likely to be the same as what
would be obtained with double length accumulation. However, if the terms
are calculated in the same loop, the intermediate sums may be rounded off to
single length, but the individual terms may not be rounded before adding and
the result will usually have error of the same order as expected from single
precision arithmetic, but the value will not be exactly same.

2.4 Condition and Stability


To illustrate the difference between the forward and the backward error anal-
ysis more clearly, let us consider the computation process for evaluating some
function of n variables
(2.82)

Let us assume that for some given input values of ai the computed value is x,
while the true value is x, then in the forward error analysis, we are interested
in estimating a bound on the difference Ix - xl. While in the backward error
analysis we are not concerned with the true value, but instead we ask the
question, for what value of ai the actual result (i.e., the one computed without
any roundoff error) will come out to be x, i.e.,

and (2.83)

where it is assumed that the right-hand side is the result of exact computation
without any roundoff error. Obviously, Ei may not be unique and in fact ev-
ery computed value may not even be representable in this form. We are only
interested in estimating bounds on Ei. It turns out that in many cases, it is
much simpler to estimate this bound rather than to estimate the bound on the
actual error. In fact, the concept of backward error analysis has enabled us to
2.4. Condition and Stability 43

estimate error in those computations, where it was not possible to use forward
error analysis in any meaningful manner.
To understand the significance of backward error analysis, we should re-
member that in most computations the input numbers ai would themselves
have some errors associated with them, since they would be the result of some
experiment or probably some previous computation. Hence, if we find that the
bounds on fi are much smaller than the expected error in the input ai, we can
conclude that the computation has satisfactory accuracy. This does not neces-
sarily imply that the computed results are correct. On the other hand, if we
find that the bounds on Ci are larger than the errors in input values of ai, then
the accuracy of computation may need to be improved. In either ease, we will
ultimately he interested in estimating the error in the computed result.
If the equivalent perturbations Ci in the input values are small in some
sense, then we can write the error

81
E
_
= x - x :::: L
n

;=1
Ci 8a ..
1
(2.84)

Since it may he more meaningful to consider the relative error, which can be
bounded by

(2.85 )

Thus, even if )c;/aii are all very small, the error need not be small if some of
the partial derivatives turn out to be large. If for some value of i

81 I
Ifai 8a; (2.86)
»1,

the computed result could be completely wrong even if bounds on Ci are all
very small, and such problems are referred to as ill-conditioned. In many cases,
it might not even be meaningful to attempt a solution to such problems, since
the error in the input values of aj itself may be such that the computed value
may be completely useless. Only if the input values are known to an accuracy
which is sufficient to ensure a reasonable accuracy for the final result, the
process of computation will he meaningful. In such cases, it may be necessary
to improve the accuracy ;)f computation to ensure that the bounds on Ci are
smaller than the error estimate in the input values. The quantity on the left-
hand side of (2.86) is usually referred to as condition number for the given
problem. If all condition numbers are small, the problem is said to be well-
conditioned, while if one or more of these numbers are large, then the problem
becomes ill-conditioned. Hence, for a well-conditioned problem a small change
in the input data results in a small change in the final result, while for an ill-
conditioned problem, a small change in input data may cause a large change in
the result.
44 Chapter 2. Roundoff Error

Table 2.3: Residuals for the system of two linear equations

n x y Residuals (RHS - LHS)

0.9999 -1.0001 1.8475 x 10- 4 1.9494 X 10- 4


2 1.9029 -1.9446 0 -1 x 10- 8
3 0.0473 -0.0033 -1 x 10- 8 0
4 0.0971 -0.0554 0 1 x 10- 8
5 -0.9054 0.9934 -2 x 10- 8 0
6 -8.5270 8.9670 -1 x 10- 7 0
7 -94.2700 98.6700 -1 x 10- 6 0
8 95.3.7000 -997.7000 1 x 10- 5 0
9 -9028.0000 9445.0000 0 1 x 10- 4

To illustrate the application of backward error analysis and the ill-


conditioning, let us consider the solution of the following linear equations
0.9446x + 0.9029y = 0.0417,
(2.87)
0.9967x + 0.9527y = 0.0440.
The correct solution is of course x = 1, Y = -1, but a computer using 24-bit
arithmetic gives x = -0.2306702, Y = 0.2875082. In principle, we can verify
the correctness of this solution by substituting it back in the equations and
finding the residual, i.e., the difference between right-hand side and left-hand
side. If we do the calculations exactly, then it can be verified that the residuals
in this case are -8.286 x 10- 8 and -7.38 x 10- 8 , which gives the impression
that the solution is close to the correct value. Table 2.3 gives the residuals for
several pairs of x, y values, calculated using exact arithmetic. It can be seen
that the residual is the largest for the first solution, which is actually very close
to the correct solution. For almost all entries, one of the two equations is exactly
satisfied, while the residual for the other equation is less than 10- 4 . Hence, we
intuitively expect any of these values to be close to the correct solution. This
obviously cannot be true, since the solution is unique. The second and third
entries give very small residuals and on many computers the roundoff error may
force the residuals to be exactly zero, thus giving an impression that both the
equations are exactly satisfied. In fact, using a 24-bit arithmetic gives a value
of zero for both the residuals in the second case. Thus, it is very clear that the
residual is not a reliable indicator of the correctness of the solution.
If we write the equations in a matrix form as Az = b and the calculated
solution as zo, then for the backward error analysis we can substitute this
computed solution in the equation to get the residual Azo - b = Jb. This
equation can be interpreted to imply that Zo is the exact solution to the equation
Az = b+Jb. In our example, Jb may turn out to be zero within the accuracy of
computation. Hence, we can conclude that the solution is among the best that
can be achieved with the given accuracy. This does not imply that the solution is
2.4. Condition and Stability 45

correct. In this case, we can see that if the right-hand side of the second equation
is changed to 0.04400001, the exact solution will be x = 1.9029, Y = -1.9446.
Consequently, only a small change in the coefficients of the equation can make a
substantial change in the solution. This is a typical example of an ill-conditioned
problem. If we interpret this problem as finding the point of intersection of two
straight lines, then it is clear that the ill-conditioning is because of the fact that
these two lines are nearly parallel.
We can formally write the solution to this problem as z = A-I b, where
in this case the inverse matrix is given by

A-I = (-95270000 90290000 )


(2.88)
99670000 -94460000 .
It is clear that all elements of the inverse matrix are fairly large and a small
change in the right-hand side makes a large change in the solution. In fact,
the elements of the inverse matrix can be considered as the condition numbers
for this problem. Here the backward error analysis tells us that the computed
solution is correct within the limits of computation, but in order to estimate
the error in final result, we need to know the condition numbers of the problem
which turns out to be of the order of 108 in this case. If we want to obtain a
meaningful solution to the given problem, then we must know the coefficients
to a better accuracy and use a more accurate arithmetic for the solution. Just
improving the accuracy of computation will not help, since if the coefficients
are not known accurately a small error in the coefficients causes a large error
in the solution. In this case, since all condition numbers are of the order of 108 ,
for any meaningful solution we need to know all coefficients in the equations
to a precision of better than 1 part in 108 and the computer arithmetic being
used should have the precision n < 10- 8 .
Estimation of the condition numbers for a problem is in general, a diffi-
cult proposition and recourse may have to be made to experimentation. We can
perturb the input data and find its effect on the solution, but if the number
of input parameters is large, then such a procedure may be very expensive.
Further, it is not guaranteed to give the correct results. since by some coin-
cidence we may find the result to change only slightly, with data even for an
ill-conditioned problem.
The same system of equation can be used to illustrate another interesting
feature of computer arithmetic. A well-known theorem of algebra states that:
For square matrices of order n if AX = I, where I is the identity matrix of
order n, then X A = I.
U sing the same matrix A if we take

X = (-95260473 90299029 )
(2.89)
99660033 -94469446 .

Then a computation without roundoff shows that

AX = (0.90999 0) (2.90)
1.0001 .
46 ChapteT 2. R01tndoff Errol'

If we are using 24-bit arithmetic, then it will yield the identity matrix within
the limits imposed by roundoff error. Hence, we expect that X is close to the
unique inverse A -1. However, another calculation without roundoff yields

XA = ( 17999.4085 17203.8566 )
(2.91 )
-18829.6564 -17997.4085 '

which is nowhere close to the unit matrix. The true inverse matrix given in
(2.88) is not very different from the matrix X used here.
EXAMPLE 2.4: Given the following data perform a least squares fit to a straight line of
the form y = mx + c
x = 0.990, 0.991, 0.992, 0.993, 0.994, 0.995, 0.996, 0.997, 0.998, 0.999
y = .0990, .0991, .0992, .0993, .0994, .0995, .0996, .0997 .. 0998, .0999
Repeat the calculations using arithmetic with an accuracy of three, four, five and six decimal
digits.
It is possible to simulate the calculation with d digits using a base b arithmetic on a
computer which has a higher accuracy i.e., lower n. This can be achieved by rounding off
the numbers to d significant digits after every arithmetic operation. The function ROUND in
Appendix B can be used for this purpose. The results for this example have been calculated
by using such a simulation.
Using well-known formula for the linear least squares fit to a straight line (see {1O.2})
we get the slope and intercept as
l:~=1 XiYi - nxfi
m = ~n ( )2 ' c = fi - mx, (2.92)
L..i=l Xi - X

where
1 n 1 n
X=-L3:i and Y=-LYi, (2.93)
n i=l n i=l
are the average values of x and y. Using three decimal digit accuracy, we obtain
n
x= 0.995, fi = 0.0995, L XiYi = 0.989, L(Xi - x)2 = 0.000085,
(2.94)
i=I i=I
m = -11.8, c = 11.8,
which is obviously wrong. The correct result is of course, m = 0.1 and c = O. The large
error in evaluating m comes in, because the two terms in the numerator are nearly equal,
which introduces a large error when subtraction is carried out. Using four decimal digits
accuracy we obtain the result m = 0 and c = 0.09945, which is also completely wrong. Using
five decimal digits accuracy, we get a reasonable result of m = 0.12121 and c = -0.02109.
While six decimal digit arithmetic gives m = 0.109090 and c = -0.00904000. The reason
for ill-conditioning in this case will be clear, if we actually try to plot the points on a curve.
It turns out that all the points cluster in a small region and hence we are trying to find
out the global equation of the straight line using information in a very small region. This is
possible only if the values are known very accurately. It can be seen that the correct value of
the numerator in (2.92) is 0.00000825, while nxfi ~ 0.989.' Hence, at least six decimal digits
accuracy is required to get any reasonable result.
The same data can be used to illustrate another interesting phenomenon. It is well-
known that the square of standard deviation can be written as
1 n
a2 = - 2:.>r
ni=l
_x 2 = -
1 n
L(Xi -
n i =l
x)2. (2.95)

Using three decimal digits accuracy, the first expression yields -0.001 while the second ex-
pression gives 0.85 x 10- 5 , which is close to the correct value of 0.825 x 10- 5 . Thus, we
Condition and Stability 47

can see that two expressions which are algebraically equivalent, do not give the same result
when evaluated numerically. It should be noted that the first expression gives an imaginary
value for the standard deviation, and the program will return the value as NaN. Using four-
digit accuracy, the first expression gives a value of zero for a 2 , while the second expression
gives the exact value. Using five decimal digits accuracy the first expression gives a value of
1.0 x 10- 5 , which is closer to the true value. With six-digit accuracy the first expression yields
0.90 x 10- 5 . The reason for large error in this expression is the same as the one noted above
for calculating the slope. This example clearly illustrates that by rewriting the expression
appropriately, it may be possible to reduce the roundoff error significantly. Important point
to note from these examples is that one should avoid subtraction of two nearly equal numbers
as far as possible.
EXAMPLE 2.5: Evaluate the following function

Io(x) = eX - 1, x> 0, (2.96)

at x = 1, by using the recurrence relation


(n=0,1,2, ... ). (2.97)
This recurrence relation can be easily derived using integration by parts. If we take
x = 1, then it is clear from the integral that it is a monotonically decreasing function of n
(2.98)

Hence, using 10(1) = e - 1 it is quite straightforward to calculate In(1) for n = 1,2,3, ... ,
using the recurrence relation. The results using arithmetic with 24, 48 and 96 bits are shown
in Table 2.4, which also displays the exact values. It is quite clear that error increases with n

Table 2.4: Evaluating In (1) using the recurrence relation in forward direction

n 24-bit 48-bit 96-bit Exact value

1 7.1828190 X 10- 1 7.1828183 X 10- 1 7.1828183 X 10- 1 7.1828183 X 10- 1


2 4.3656370 X 10- 1 4.3656366 X 10- 1 4.3656366 X 10- 1 4.3656366 X 10- 1
3 3.0969120 X 10- 1 3.0969097 X 10- 1 3.0969097 X 10- 1 3.0969097 X 10- 1
4 2.3876480 X 10- 1 2.3876388 X 10- 1 2.3876388 X 10- 1 2.3876388 X 10- 1
5 1.9382380 X 10- 1 1.9381942 X 10- 1 1.9381942 X 10- 1 1.9381942 X 10- 1
6 1.6294290 X 10- 1 1.6291649 X 10- 1 1.6291649 X 10- 1 1.6291649 X 10- 1
7 1.4060020 X 10- 1 1.4041543 X 10- 1 1.4041543 X 10- 1 1.4041543 X 10- 1
8 1.2480160 X 10- 1 1.2332347 X 10- 1 1.2332347 X 10- 1 1.2332347 X 10- 1
9 1.2321470 X 10- 1 1.0991122 X 10- 1 1.0991122 X 10- 1 1.0991122 X 10- 1
10 2.3214720 X 10- 1 9.9112168 X 10- 2 9.9112183 X 10- 2 9.9112183 X 10- 2
12 1. 7643430 x 1001 8.2806218 X 10- 2 8.2808202 X 10- 2 8.2808202 X 10- 2
14 3.1961050 x 1003 7.0731765 X 10- 2 7.1092802 X 10- 2 7.1092802 X 10- 2
15 4.7940570 x 1004 6.0976473 X 10- 2 6.6392033 X 10- 2 6.6392033 X 10- 2
16 7.6704810 x 10°5 -2.4376433 X 10- 2 6.2272531 X 10- 2 6.2272531 X 10- 2
17 1.3039820 x 10°7 -1.4143994 x 1000 5.8633024 X 10- 2 5.8633024 X 10- 2
20 8.9192340 X lO lD -1.0075492 X 1004 4.9881743 X 10- 2 4.9881743 X 10- 2
25 5.6865470 x 10 17 -6.4237622 X 10 10 3.9818695 X 10- 2 3.9938730 X 10- 2
26 1.4785020 x 10 19 -1.6701782 X 10 12 3.5286072 X 10- 2 3.8406972 X 10- 2
27 3.9919560 x 1020 -4.5094811 X 10 13 -4.7276047 X 10- 2 3.6988231 X 10- 2
28 1.1177480 x 1022 -1.2626547 X 10 15 -2.3237293 x 1000 3.5670457 X 10- 2
30 9.7244050 X 10 24 -1.0985096 X 10 18 -2.0526445 X 1003 3.3297601 X 10- 2
48 Chapter 2. Roundoff Error

Table 2.5: Evaluating In (1) using the recurrence relation in backward direction

n h(l) = 0 111(1) = 0 hl(l)=O hl(l) = 0 Exact value

0 1. 718055 x 10°0 1. 718282 x 10°0 1.718282 x 10°° 1. 718282 x 10°° 1. 718282 x 10°°
7.180555 X 10- 1 7.182818 X 10- 1 7.182819 X 10- 1 7.182819 X 10- 1 7.182818 X 10- 1
2 4.361111 X 10- 1 4.365636 X 10- 1 4.365637 X 10- 1 4.365637 X 10- 1 4.365637 X 10- 1
3 3.083333 X 10- 1 3.096910 X 10- 1 3.096910 X 10- 1 3.096910 X 10- 1 3.096910 X 10- 1
4 2.333333 X 10- 1 2.387638 X 10- 1 2.387639 X 10- 1 2.387639 X 10- 1 2.387639 X 10- 1
5 1.666667 X 10- 1 1.938191 X 10- 1 1.938194 X 10- 1 1.938194 x 10- 1 1.938194 X 10- 1
6 0.000000 x 10°° 1.629149 X 10- 1 1.629165 X 10- 1 1.629165 X 10- 1 1.629165 X 10- 1
7 1.404040 X 10- 1 1.404154 X 10- 1 1.404154 X 10- 1 1.404154 X 10- 1
8 1.232323 X 10- 1 1.233235 X 10- 1 1.233235 X 10- 1 1.233235 X 10- 1
9 1.090909 X 10- 1 1.099112 X 10- 1 1.099112 X 10- 1 1.099112 X 10- 1
10 9.090909 X 10- 2 9.911218 X 10- 2 9.911218 X 10- 2 9.911218 X 10- 2
15 6.639203 x 10- 2 6.639203 X 10- 2 6.639203 X 10- 2
16 6.227251 x 10- 2 6.227253 X 10- 2 6.227253 X 10- 2
17 5.863269 x 10- 2 5.863302 X 10- 2 5.863302 X 10- 2
18 5.538847 x 10- 2 5.539443 X 10- 2 5.539443 X 10- 2
19 5.238095 x 10- 2 5.249409 X 10- 2 5.249409 X 10- 2
20 4.761905 X 10- 2 4.988174 X 10- 2 4.988174 X 10- 2
26 3.840697 x 10- 2 3.840697 X 10- 2
27 3.698819 x 10- 2 3.698823 X 10- 2
28 3.566926 x 10- 2 3.567046 X 10- 2
29 3.440860 x 10- 2 3.444325 X 10- 2
30 3.225806 X 10- 2 3.329760 X 10- 2

very rapidly and for 24-bit arithmetic 19(1) < 110(1), after which the values increase almost
exponentially, which is obviously incorrect. The use of 48 or 96-bit arithmetic only delays
this trouble.
To understand the cause of this trouble, let us try to analyse the recurrence relation.
If €n denotes the error in I n (1), then assuming that, there is no roundoff error in evaluating
the recurrence relation, the error €n+l is given by

€n+l ~ (n + 1)€n ~ (n + 1)!€o ~ (n + 1)!/01i. (2.99)

Here we have assumed that the error is entirely due to the rounding of initial value 1o in
calculations. Obviously, there are other sources of roundoff error, but our analysis shows (a
posteriori), that they are not very important. For the 24-bit arithmetic, the error estimate
for n = 10 turns out to be 0.36 while Ito(l) ~ 0.1. Hence, we cannot expect to get any
reasonable value for ho(l). Similarly, with 48-bit arithmetic for n = 16, the error estimate
yields 0.13 while It6 ~ 0.06, and it can be seen from Table 2.4 that In(l) is negative for
n ~ 16. Even with 96-bit arithmetic the computed values turn out to be negative for n ~
27, which again agrees with our analysis, since the expected error in that case is 0.24 as
compared to the exact value of 0.03 ... , for 127(1). Since the factorial function increases
rather rapidly, it is difficult to use the recurrence relation in this form to calculate In(l)
for large values of n. It can be easily estimated that to evaluate 1100(1) by this process
will require floating-point representation with more than 500 bits in the fraction part. Such
unbounded growth of error in numerical calculation is referred to as numerical instability
and the corresponding algorithm is said to be numerically unstable. This instability is quite
similar to hydrodynamic instabilities. where again a small perturbation grows exponentially
with time. Many algorithms for numerical computations are known to be unstable and we will
Condition and Stability 49

consider several such examples in subsequent chapters. Here we only want to warn against the
use of such algorithms by the brute force method of increasing the accuracy of computation.
In most cases, better alternatives are known to exist and should be used.
In this example it is quite obvious that if we reverse the recurrence to write

I (1) = I n +l(l) +1 (2.100)


n n+ 1 '
then the error at each step will be divided by n + 1, and we can have a stable algorithm. The
problem now is to estimate the starting value of In(l) t.o initiate the solution. We can use
a starting value of zero for sufficiently large value of n. However, since the value of integral
tends to zero as lin, at first sight it appears that a very large value of n will be required
to start this process. Since the error at each stage is divided by a factor of n + 1, we can
expect to get a reasonable accuracy for low values of n. Table 2.5 displays the results of such
a calculation using 24-bit arithmetic. It can be seen that, the error decreases very rapidly
and we get a very accurate value of the integral. In fact the "exact value" in these tables
have been calculated by using the recurrence relation in backward direction, starting with a
much larger value of n. Even if we start with h(l) = 0 the value of II (1) turns out to be
accurate to four significant figures, while starting with III (1) = 0 yields a value which agrees
with exact value to seven significant figures.

This example clearly illustrates two facts: first some very simple looking
calculation could produce disastrous results; and second that use of higher pre-
cision is not the answer to such problems. It is quite clear from this example
that roundoff errors are not necessarily restricted to very lengthy calculations
involving millions of operations, but they could be crucial even in short calcula-
tions. In many cases, the roundoff error can be controlled by selecting the right
algorithm for the solution. It should be noted that the numerical instability is
not related to ill-conditioning, for example, the above problem is well posed
and the integral can be easily determined to arbitrary accuracy. The instability
here occurs because of the choice of improper algorithm. This should be distin-
guished from the case (e.g., Example 2.4), where the problem itself is unstable
or ill-conditioned. For an ill-conditioned problem all algorithms will display
some instability, unless adequate precision is used in the calculations. Several
examples of both these situations will be considered in subsequent chapters.
Recurrence relations of the form (2.97) are quite common in various nu-
merical calculations and we consider a more detailed analysis of their stability.
For stability it is essential that the error introduced at any stage should grow
at a rate which is slower than (or at most equal to) the rate at which the solu-
tion is growing. For example, if the error grows as n!, but if the solution also
increases as n!, then the relative error will still remain reasonably bounded. Let
us consider the following recurrence relation for the Laguerre polynomials

(2.101)

It is quite clear that the error increases roughly as n!, but that is not very dam-
aging, since the value of Ln(x) also increases at that rate. In fact this recurrence
relation is numerically stable. A more detailed analysis of the stability of such
recurrence relations will be considered in Section 12.2. In general, analysing
the stability of a recurrence relation is somewhat tricky. A rough idea can be
obtained if we consider it as a difference equation with constant coefficients
50 Chapter 2. Roundoff Error

which of course, is not the case. If we ignore the variation of coefficients with
n, we can write the general solution of this recurrence relation as

(2.102)

where al and a2 are some arbitrary constants and al and a2 are the solutions
of the quadratic
a 2 - (2n + 1 - x)a + n 2 = O. (2.103)
The value of al and a2 depends on the starting values. Now if both roots of
(2.103) have the same absolute magnitude, then the recurrence relation will be
stable, since all solutions grow at the same rate. On the other hand, if the two
roots have different absolute magnitude, then the recurrence relation will be
stable only if the actual solution has a nonzero contribution from the dominant
solution. For example, if lall > la21, but the true solution corresponds to
al = 0, then the recurrence relation is unstable, since the true solution will be
of the order of a 2, while the error grows as a l .
If we consider the recurrence relation (2.97), which is a first-order differ-
ence equation, the general solution can be written as a linear combination of a
particular solution and an arbitrary multiple of the solution of the homogeneous
part. In this case, it turns out that the desired solution is the particular solu-
tion, while the solution to the homogeneous part is O(n!), and the recurrence
relation turns out to be unstable. Most of the standard references on math-
ematical functions (e.g., Abramowitz and Stegun, 1974; Oldham et al. 2009)
also mention about the stability of recurrence relations which are given there.

Bibliography
Abramowitz, M. and Stegun, I. A. (1974): Handbook of Mathematical FUnctions, With For-
mulas, Graphs, and Mathematical Tables, Dover New York.
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Acton, F. S. (1995): Real Computing Made Real, Princeton University Press, Princeton, New
Jersey.
Alefeld, G. and Herzberger, J. (1983): Introduction to Interval Analysis, Academic Press,
New York.
Conte, S. D. and De Boor, C. (1972): Elementary Numerical Analysis, an Algorithmic Ap-
proach, McGraw-Hill, New York.
Dahlquist, G. and Bjiirck, A. (2003): Numepical Methods, Dover, New York.
Forsythe, G. E. (1970): Pitfalls in Computation, or Why a Math Book isn't Enough, Amer.
Math. Monthly 77, 931.
Goldberg, D. (1991): What Every Computer Scientist Should Know About Floating-Point
Arithmetic, Computing Surveys, 23, 5.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Knuth, D. E. (1997): The Art of Computer Programming, Vol. 2, Seminumerical Algorithms,
(3rd ed.), Addison-Wesley, Reading, Massachusetts.
Kulisch, U. W. and Miranker, W. 1. (1986): The Arithmetic of the Digital Computer: a New
Approach, SIAM Rev., 28, 1.
Miranker, W. L. and Toupin, R. A. (eds.) (1987): Accurate Scientific Computations, Lecture
Notes in Computer Science, 235, Springer-Verlag, Berlin.
Exercises 51

Oldham, K. B., Myland, J. C. and Spanier, J. (2009): An Atlas of Functions: with Equator,
the Atlas Function Calculator, (2nd ed.) Spriflger, Berlin.
Penney, W. (1965): A "Binary" System for Complex Numbers, JACM, 12, 247.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
StOff, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Vedag., New York.
Ueberhuber, C. W. (1997): Numerical Computation 2: Methods, Software, and Analysis,
Springer Verlag, Berlin.
WeUs, D. C., Greisen, E. W. and Harten, R. H. (1981): FITS: A Flexible Image Transport
System, Astronomy & Astrophysics Supplement Series, 44, 363
Wilkinson, J. H. (1994): Rounding Errors in Algebraic Processes, Dover, New York.

Exercises
1. Multiply the following numbers using the arithmetic in the number system in which they
are represented and verify the results by converting the numbers to decimal system:
(i) 110100.1012 x 1011.0112 (v) IlOI1.1h x I01.oh (balanced ternary)
(ii) 267.658 x 34.238 (vi) 10100.101-2 x 1011.011_2
(iii) BAD16 x DEED 16 (vii) 13201.232i x 231.122i
(iv) 165.38_10 x 13.65-10 (viii) 1100i_l x 1101i-l

2. (a) Express 57005 in the Hexadecimal number system.


(b) Express the following numbers in (i) binary (ii) negative binary (b = -2) (iii) negative
decimal (b = -10) and (iv) balanced ternary, number systems: 27.35, -27.35 and 13 ~.
(c) Express -29.625 + 23.375i in quater-imaginary number system (b = 2i).
(d) Express (6 + i) in the "binary" complex number system (b = i-I).
3. In the decimal number system there are some numbers with two infinite decimal expan-
sions (e.g., 3.56999 ... = 3.57000 ... ). Does the negative decimal (b = -10) system have
unique expansion for every number?
4. Write subroutines to add and subtract two fixed-point numbers of n digits using an
integral base b > 1. Also write subroutines to divide and multiply such numbers by a
single-digit number in the same base. Using appropriate value of b, calculate the value of
7r correct to 1000 decimal places by employing the series expansion for

7r = 16tan- 1 (1/5) - 4tan- 1 (1/239),

where tan -1 x can be calculated by the Maclaurin series

L( -1)' -x .-+
oc . 2i 1
tan- 1 x = .
i=O 2t +1
5. Give an algorithm and write the corresponding computer program to perform addition
and subtraction in a number system with base -10 or 2i or 3 (balanced ternary) or (i - 1).
6. Give an algorithm and write the corresponding computer program to convert numbers
from the decimal representation to a number system with base -10 or 2i or 3 (balanced
ternary) or (i - 1).
7. Can every real number (positive, negative, or zero) be expressed in a balanced-decimal
system, i.e., in the form (2.2), for some integer n and some sequence of digits di, where
each di is one of the ten numbers {-4!, -3!, -2!, -I!, -!,!, I!, 2!, 3!, 4!}? Note
52 Chapter 2. Roundoff Error

that zero is not one of the allowed digits, but we implicitly assume that d n +l, d n +2, ...
are zero. Find representations of zero and one in this system.
S. A spaceship sent to search for extraterrestrial intelligence has sent following three pictures
from three different planets, which are believed to represent addition or subtraction.
Which number system is being used on each of these three planets?

nu
(ii) (iii) uO
nuu

9. Supposing we have a 8-bit machine which represents integers using two's complement.
Try the following operations manually in such a representation:
(i) 21 + 37 = 58 (iv) 21 + (-37) = -16 (vii) - 63 + (-83) = -146
(ii) 69 + 79 = 148 (v)37+(-21)=16 (viii) 47 + 81 = 128
(iii) - 21 + (-37) = -58 (vi) - 58 + (-70) = -128 (ix) 47 + (-47) = 0
Repeat the above exercise using one's complement for representing the numbers.
10. Express the following numbers in the IEEE-754 floating-point representation using 32
bits: 3.14159265, 6.02486 x 10 23 , 1.05443 X 10- 27 •
11. Consider a machine with a word length of 60 bits, which represents floating-point numbers
using 48 bits for fraction part., 11 bits for exponent and one bit for the sign. The exponent
is represented in excess (1717)8 form. What is the largest and smallest (nonzero) positive
number that can be represented in this computer, if the numbers are all assumed to be
normalised? If unnormalised numbers are permitted, what will be the range of numbers
that can be represented? What is the smallest number x such that 0.314 + x > 0.314, in
this machine?
12. (a) Try to find out the representation used in your computer by giving it a few sample
numbers and printing out the result in octal or hexadecimal format, which can be done by
using the format specification "On" or "Zn" in Fortran. Find out the number of bits used
to store the fraction and the exponent parts. Also find the exponent offset and whether
the first bit of fraction is explicitly stored? Estimate the value of fi for the computer. Find
a few sets of four positive numbers a, b, c and d, such that lab - cdl < nab. Try to compute
this expression on the computer. What can you say about the computer arithmetic from
the results?
(b) Compute 3 n for n = 2,3, ... , 100, using integer arithmetic on your computer and
explain the results.
13. Consider a floating-point representation with 24-bit fraction and 8-bit excess 126 expo-
nent. Which of the one million numbers 1100000.0, 1100000.1, ... , 1199999.9 have the
same representation in this format? What if the fraction has 23 or 25 bits? (Hint: Do not
worry about the exponent, since that will not affect the answer.)
14. Let all the numbers in the following calculations be correctly rounded to the number of
digits shown
(i) 1.1034 + 0.937 (iv) 9.537/0.039 (vii) 3.48°·532
(ii) 2.367 - 1. 75 (v) 1/0.9675 (viii) tan(1.57)
(iii) 2.343 x 7.37 (vi) 1/0.0023 (ix) e 3 . 14
For each of these calculations, determine the smallest interval, rounded to 5 decimal
digits, in which the result using true instead of rounded values of the quantities must lie.
In (v) and (vi) assume that 1 is the exact value.
15. (a) Consider the following expression
10 48 + 314 - 1048 - 1032 + 567 + 1032 .
The correct value of this expression is 881, but most computers will not calculate it
correctly. Assuming that there is no overflow, what is the value of this expression as
Exercises 53

calculated by computers using 48, 96, 100, 104 and 108-bit fraction part, respectively?
What is the minimum size of fraction part needed to give the correct result?
(b) Evaluate the polynomial
p(x) = 20000x 4 - 74641x 3 + 9282x 2 + 223923x - 207846,
at x = 1.732051 and compare the result with the exact value (-3.57230142698 x 10- 9 ).
Estimate the minimum number of bits required in the fraction part of a computer using
binary floating-point arithmetic to give a reasonable value for the result.
(c) Evaluate the scalar product of following vectors
a = (3.141593, 2.718282, 0.5772157, 1.414214 x 10 - 6 ),

b = (5.194941, -1.139702 x 10- 6 , -28.27433, -4.016619 x 10- 7 ).


Compare the result with the exact value of the scalar product
4
2:aibi = 1.177534 x 10- 19 .
i=l

What is the minimum size of the fraction part required to give a reasonable estimate of
the correct value?
16. (a) Mathematics text books claim that a x x = a and a/x = a, if and only if x = 1, or
a = O. Is this statement true with floating-point arithmetic? (Assume that there is no
exponent overflow or underflow.)
(b) Are the following identities valid in floating-point arithmetic? (Assume that there is
no exponent overflow or underflow and x l' 0.)
1
(i) 0 - (0 - x) = x, (ii) -/- = x,
1 x

17. Using three decimal digit floating-point arithmetic, find examples where the associative
law for multiplication (2.32) is violated. Using (2.27) find a bound on 8 such that

a 1)9 (b 1)9 c) = 1 + 8,
(al)9b)l)9c
for all nonzero floating-point numbers which do not cause any exponent overflow or
underflow.
18. (a) What is the exact upper bound for l.6.x/xl in (2.24)?
(b) What is the exact condition for a EEl x = a in a floating-point arithmetic?
19. To illustrate that replacing an underflow by zero is not always right, consider the identity
axy+b a + b/y
cxy+d c + d/y
Using the three decimal digit arithmetic with an exponent range of -9 to +9, find
examples where this identity is violated, because of underflow being set to zero. Use
positive values for all numbers to reduce the effects of roundoff error, and compare the
computed values of both sides with the exact value.
20. Estimate the error in evaluating the following expressions, where the input value of x
itself has a relative error of €:
(i) x(l - x) x = 0.01,0.1,0.51,0.6,0.9,0.99
(ii) VI + x2 x = 1,100,10000
(iii) ~ + 100 - x x = 1, 100,10000

21. (a) Compute the sum (x + x + ... + x), where x = 1/3 using three decimal digit floating-
point arithmetic. What is the calculated value of the sum if the number of terms are 4,
30, 50, 300, 400, and 1000.
54 Chapter 2. Roundoff Error

(b) Consider the following statements of a Fortran program


S=127.96875
DO 100 I=l,N
100 S=S+1.E-5
PRINT *,S
Assuming that the computer uses a 24-bit arithmetic what value will be printed out if N
= 10,100, 1000, 10000, and 100000, respectively.
22. Using the well-known formula from integral caiculus
r b dx = b1 - p - a 1- p ,

Ja xP 1- P
calculate the value of the integral for
(i) a = 1, b = 2, p = 1 - 10- n , (n = 1,2,3, ... ,20),
(ii) a = 1, p = 0.5, b = 1 + lO- n , (n = 1,2,3, ... ,20).
Also evaluate the integral using the Simpson's rule

Jra b f(x) dx = b-a ( f(a)


-6- + 4f (a+b)
-2- + feb) ) .

Compare the two computed values with exact values and explain the result.
23. Obtain the following limiting values by using computer arithmetic:

(i) lim - -
sinx
x .......... o x
. sinx - sinh x
(v ) hm
x---+O
3
x
(ix) lim
m---:,.O()
(~~
W -lnm)
1,
i=l

(ii) lim xlnx


(vi) lim sinx + sinhx - 2x (x) lim xo.o1lnx
x~o x-a x5 X~O

lnx lnx
(iii) lim 001 (vii) xlim --
__ l 1 - x
(xi) lim (x ~ - x 2 )
x-co x . x~oo

eX -1 cosh x
(iv) lim -- (viii) lim (1 + ~)'n - eX) (xii) lim - . - -
x-o x m_CXl m X __ OO slnhx

In each case, try to approach as close as possible to the limit.


24. (a) Evaluate the following expressions for x = 2- n , (n = 2,3,4, ... ,40), using single
precision arithmetic on any computer and estimate the accuracy:

(i) 1 -:- cos x (iii) 2 _ sinx _ cos x _ e- x


sm 2 x
(ii) sin(1001r + x) - sinx (iv) 2x - sinx - sinh x
Try to rewrite these expressions to improve the accuracy.
(b) Repeat the above exercise for x = 2n, (n = 2,3,4 ... ,40), using the following expres-
sions
(ii) e-1/x _ _ x_ 1001rX 5 )
(i)x-Vl+x 2 (iii) sin ( - --5
l+x l+x
25. Considering suitable examples for floating-point addition using four decimal digits, show
the following:
(a) at least (2t + I)-digit accumulator is required to get correctly rounded result for
addition of t-digit numbers, if the arithmetic is implemented in a straightforward manner
described in the text.
(b) A (t+2)-digit accumulator is enough to get the correctly rounded result for addition,
provided the fraction part h of the number which is smaller in magnitude is truncated
to t + 2 digits after appropriate right shift using the following relation
if the two numbers have same sign;
if the two numbers have opposite sign.
Exercises 55

Write a program to perform floating point addition or subtraction of two numbers using
t digits with integer base b. Represent the numbers using an integer array of length t + 2
with one element to store the sign and another to store the exponent. The program must
give the correctly rounded result in all cases. (c) Even with the transformation of part(b)
it is not possible to get a correctly rounded result using only t + 1 digits.
26. Consider floating-point arithmetic of t-digit numbers using a t-digit accumulator. Assume
that when the sum of two fraction part overflows, the overflow digit is retained and can
be shifted right.
(a) Show that if no overflow occurs when the two fraction parts (one of which could have
been shifted) are added, then (if IYI ::; Ixl)
x E:9 y=x(I+€ x)+ Y,
where I€xl < n.
(b) Show that if overflow occurs, the addition may involve two roundoff error bounded by,
respectively. ~be-l-t and ~be - t. where e is the exponent of the computed sum. Hence,
show that in this case
x (jJ Y = (x + y)(1 + €xy),
where I€xyl < (1 + l / b)n.
(c) Combining the results of previous two parts, show that for such an arithmetic
x EEl y = x(1 + €x) + y(1 + €y).
where I€;rl < (1 + l/b)n and hi < (1 + l/b)n. Also show that in general, it is not possible
to find any bound of the type obtained in part(b). on relative error in the computed sum.
27. An alternative representation for real numbers is the so-called level-index (ii) representa-
tion. Given any positive number x the ii-image X is obtained by taking natural logarithm
as many times as necessary (say i) to bring the result (say f) into the interval [0,1). Thus
the level i represents how many times the logarithm has been taken, while the index f
serves the purpose of fraction part in floating-point representation. If the number Ix l < 1,
e
then the same procedure can be applied to 1/lxl and the level can be considered to be
negative. Consider a decimal level-index system, with one digit (plus sign) for level and
four digits for index and represent the followin g numbers: (i) 3 .14159, (ii) 6.025 x 1023 ,
(iii) 10 78 , (iv) lO IDOO , (v) lOIDOOOOO, (vi) 10(10 10 ), (vii) 1.6747 x 10- 24 What is the largest
number that can be represented in this system? Try the following arithmetic operations
in this number system

(i) 101000 + 10 1000 = 2 X 10 1000 ,

28. Consider the level-index representation defined in the previous exercise for a 32-bit com-
puter word. Assume that four bits are reserved for level (including one for its sign) and
the remaining 28 bits are for the index which also includes one bit for the sign of the
number. What is the relative roundoff error in representing the numbers in the previous
exercise on such machine? Compare this error with that in the standard floating-point
representation using the same number of bits. What is the largest number that can be
represented in this form?
29. Find at least 10 examples in which the inequality
(a + b)ID < 2 (aID + 45a 8b2 + 21Oa6 b4 + 21Oa 4 b6 + 45a 2 bB + bID),
is violated for a and b differing by at least 1%, using single precision arithmetic on any
computer. Also try to estimate the maximum relative difference between a and b. for
which the inequality is violated.
30. Consider the problem of finding the scalar product of two vectors
n
Sn = LXiYi.
1=1

Find bounds on the roundoff error in such calculations similar to those in Section 2.3.
56 Chapter 2. Roundoff Error

31. In Examples 2.2 and 2.3, it was found that the actual error is much less than the computed
bounds. This is probably true in most cases, but there are exceptions, where the error is
close to the bounds given by (2.55). Find at least one such example.
32. Assuming that n is even, rewrite the sum in Example 2.2, as
n/2 1
s n -- '"' -----:--------:-
~ 2i(2i - 1)

Estimate the bounds on roundoff error similar to those in Example 2.2, for this series
and compare them with actual error.
33. (a) Estimate the value of eX using the Maclaurin series, for x = ±0.1, ±1.0, ±5.00, ±20
and ±50 to as much accuracy as possible, using single precision arithmetic and estimate
the error. Compare this value with the actual result obtained by using the system routine.
How can you improve the results?
(b) Estimate the value of sin x using the Maclaurin series, for x = 3.14, 3.1416, 3.141593.
Estimate the error and compare it with the actual value.
34. Consider the sequence of numbers Xi = ao + O.lTi, (i = 1, ... , N), where Ti is a ran-
dom number in the interval (-0.5,0.5). Find the average of these numbers using the
expressions
N (x. _ )
( l.) al
" N X.
L...,=l'
= --N--'
( .. )
U a2 = ao
+ "L...i-l N
, ao
.
Estimate the roundoff error in each case. Use ao = 1.345 and N = 106 and compare the
results with accurate value, computed using a double precision variable to accumulate
the sum.
35. Consider the following algorithm for summation of n terms ai, (i = 1, ... ,n).
tl=al, tk=tk-l+ak, 17k=tk-tk-l, q=17k-ak, (k=2,3, ... ,n),
Tl = 0, S = tn + Tn.
The basic idea behind this algorithm is as follows. When ak is added to tk-l, only its
high order accuracy part (17k) affects the sum. The last line accumulates the low accuracy
part (-~k) in Tn which is then added to tn to get the final sum. Repeat Example 2.2
using this algorithm.
36. (a) Compute the sum (x + x + ... + x). Use n terms with n = 100, 500, 1000, lOOOO,
50000 and 200000, for x = 0.1, 0.125 and 1/3. Estimate the roundoff error in the sum and
compare it with actual error. How will the result change, if the cascade sum algorithm is
used?
(b) The following recurrence relations give a series of rotations in the xy-plane, through
11"/4:
xn - Yn Xn + Yn
Xn+l = vI2 Yn+l =~
Starting with Xo = 0.333 and Yo = 0.111, evaluate Xn and Yn for n = 800, 8000, 80000,
and compare it with the exact value.
37. Compute the Bessel function In(x), (n = 2,3, ... ,30) at x = 1 using the recurrence
relation
2n
In-l(X) + In+l(X) = -
x
In(x),
with Jo(1) = 0.7651976865 and h(l) = 0.4400505857. Compare the calculated values
with available tables. Try to compute these functions using the recurrence relation in
backward direction starting with J m (1) = 1 and J m + 1(1) = 0, for m = 10, 20 and 30.
Normalise the values by using the relation
Jo(x) + 2(h(x) + J4(X) + ... ) = 1.
38. Using a simulation of three decimal digit arithmetic on a computer, find the sum of the
"infinite" series:
00 1
L~'
i=l t
Exercises 57

(a) Starting from 1 keep adding terms until you encounter a term lin, such that it does
not affect the sum and print out the value of n and the sum.
(b) Starting from 1/(n - 1), sum the series in backward direction.
(c) Starting from 1/(lOn), sum the series in backward direction.
(d) Repeat the steps (a), (b) and (c) for five decimal digit arithmetic.
(e) Try to estimate the value of the sum if step (a) is carried out on a computer using
48-bit or 50-bit arithmetic. (Warning: do not try to verify this result experimentally.)
39. Write a program to solve quadratic equation ax 2 + bx + c = 0. Explicitly distinguish
between the three cases of complex roots, repeated real roots and real but distinct roots.
Do not use complex arithmetic. Test the program for the following sets of values:
( i) a = 1.3 X 10- 3 b= 12345.68 c = 3.82 X 10- 4
( ii) a = 3.1 X 10- 3 b= -23536.9 c = 1.23 X 10- 4
(iii) a = 1.21 b= 5.082 c = 5.3361
(iv) a = 4.36 b= 1.87 c = 3.43
(v) a = 1.23 b= 2.62 c = 1.395203
(vi) a = 4.36 x lO n b= 1.87 x lOn c = 3.43 x lOn
( vii) a = 4.23 x lO- n b= 1.66 x lOn c = 3.21 x lOn

where in (vi) and (vii) use a value of n such that 101. 5n gives an overflow or underflow
on the computer. (For the 32-bit machine discussed in the text, use n = ±32.)
40. Find at least 100 "zeros" of the polynomial
p(x) = x lO - lOx 9 + 45x 8 - 120x 7 + 21Ox 6 - 252x 5 + 21Ox 4 - 120x3 + 45x 2 - lOx + 1.
Here "zeros" are those values of x for which the computed value of p(x) vanishes. (There
can be millions or even billions of "zeros" of this polynomial around x = 1.) Try to
estimate the density of zeros in a suitable region. If you cannot find sufficient "zeros"
then explain why? Print the value of the function at 50 "successive" values of x around
one of these zeros. Also print the value of x in octal or hexadecimal format, say 020 or
Z15 (in Fortran), to make sure that you are using "successive" values of x.
41. Consider the 2 x 2 matrix A defined in (2.87), try to find a 2 x 2 matrix X such that
all elements of X A - I are less than or equal to 10- 4 in magnitude, but the elements of
matrix AX - I are all larger than 104 in magnitude, where I is a unit matrix.
42. Consider the polynomial
f(x) = (x - l)(x - 2)(x - 3) ... (x - 30) = x 30 - 465x 29 + ... + 30!
Find the coefficients of this polynomial and using these coefficients try to locate the
zeros by evaluating f(x) at x = -0.05 + O.li, (i = 0, ... ,320). Which of these intervals
contain zeros and how many zeros can be located by looking at the sign changes? Try
this problem using both single and double precision arithmetic. If it gives overflow on
your machine, replace 30 by 20 appropriately.
43. Derivative of a function is defined by

J'(x) = lim f(x + h) - f(x)


h~O h
Try to find this limit for the following functions, by taking h = 2- n , (n = 1,2, ... ,t + 2),
where t is the number of bits in the fraction part of floating-point numbers
(i)f(x)=e x , x=-l,O,l, (iii) f(x) = x 2 + 3x + 2, x = 0, 1,

(ii ) f( x ) =sinx, ""


x=0'4'2' (ov)
' f(x) -_ x 2 +x +3X5+ 2, x -- 0, 1.

In each case, estimate the truncation and roundoff error in the calculation and explain
the results. (If you do not know the value of t for your computer, go on increasing n until
you keep getting the derivative to be zero (for x # 0) and try to estimate the value of t
using this information.)
58 Chapter 2. Roundoff Error

44. Consider the following system of linear equations

x + ay = 1, ax +y = 2.
Assuming that a is the only input value, find the condition number of this problem for
evaluating x, y, x + Y, x - y, x2 - y2 and x2 + y2.
45. Analyse the stability of the following recurrence relations
(i) (n + l)Pn+l (x) - (2n + l)xPn(x) + nPn-l(x) = 0 (Legendre polynomials)
(ii) nEn+l(x) + xEn(x) = e- x (Exponential integral)
(iii) Tn+l - 2xTn + Tn-l = 0 (Chebyshev polynomials)
(iv) Hn+l - 2xHn + 2nHn _1 = 0 (Hermite polynomials)

46. Prove that the following recurrence relation which is satisfied by fn(x) = In(x)/n! is
unstable in the forward direction even though the (absolute) roundoff error may not
increase steeply with n
1 2n
fn+I(X) + n(n + l /n - dx ) = (n + l)x fn (x).
Compare the results obtained using this recurrence relation with those in {37}.
47. Consider the differential equation
d2 y
- -25y=0
dt 2 '

with initial conditions


yeO) = a, dy ! _ b
dt t=o - .
Find the exact solution of this equation and analyse its stability for t > O. For what
values of a and b is the solution unstable? What happens if we consider the solution for
t < O?
48. Consider the difference equation
9
Yn+l - 4Yn + Yn-l = 0,

which is obtained by replacing the second derivative in the previous exercise with a differ-
ence approximation (see Chapter 12). Use Yo = 1 and YI = e- Ij2 to find Y2, Y3,···, YlOO
and compare it with the exact solution of the differential equation, Yk = y(k/lO) = e- kj2 .
Also find the exact solution of the difference equation and show that the difference equa-
tion is stable with these initial conditions. But the truncation error introduced by dis-
cretising the equation causes the solution to deviate away from the true solution of the
differential equation, which is unstable. For what initial conditions (values of Yo and yt)
is the solution of difference equation unstable?
49. Consider the recurrence relation

i = 1,2, ...

with Xo = 16, Xl = 12. Try to find Xi, i = 2,3, ... ,50 and compare them with the exact
values, Xi = 16(3/4)i. Explain the results. Also try the same recurrence with Xo = 1,
Xl = 4 and compare the results with the exact value, Xi = 4i. Try to solve the first
part using recurrence relation in backward direction. Using Xn+l = 0 and Xn = 1 find
Xn-l, ... , Xl, Xo and then normalise all values to get Xo = 16. How will you choose the
value of n to start the recurrence?
Chapter 3

Linear Algebraic Equations

The solution of linear algebraic equations is probably the most important topic
in numerical methods. Since the simplest models for the physical world are
linear, linear equations arise frequently in physical problems. Even the most
complicated situations are frequently approximated by a linear model as a first
step. Further, as will be seen in Chapter 7, the solution of a system of non-
linear equations is achieved by an iterative procedure involving the solution
of a series of linear systems, each of them approximating the nonlinear equa-
tions. Similarly, the solution of differential and integral equations using finite
difference method leads to a system of linear or nonlinear equations. Linear
equations also arise frequently in numerical analysis. For example, the method
of undetermined coefficients which is useful for deriving formulae for numerical
differentiation, integration or solution of differential equations, generally leads
to a system of linear equations.
Because of the great importance of this topic, a large amount of literature
as well as software is available for the solution of a system of linear equations.
Several good subroutine packages are available and can be used effectively for
the solution 'of linear equations. For some of the published algorithms, readers
can refer to Dongarra et ai. (1979), Wilkinson and Reinsch (1971) or Anderson
et al. (2000). While for details of algorithms and corresponding error analysis
Wilkinson (1988) or Forsythe and Moler (1967) can be consulted.
In this chapter, we consider the simple methods of Gaussian elimination
and the direct triangular decomposition for solving a system of linear equations.
There are other methods for triangularisation of a matrix, e.g., using House-
holder's reflection {40} or Givens' rotation {41}, but those are not considered
here. These methods are numerically more stable than the Gaussian elimina-
tion, but not as efficient and hence are rarely used in practice. These methods
are classified as direct methods, since they yield the exact solution in a finite
amount of computation, provided there is no roundoff error. In Section 3.4 we
consider roundoff error in solving linear equations, where the technique of itera-
tive refinement of solution is also given. This technique can be used to estimate
60 Chapter 3. Linear Algebraic Equations

the roundoff error and to improve on the computed results. In Section 3.6, we
describe the singular value decomposition, which is a powerful technique for
detecting ill-conditioning and for solving a system of equations with singular
matrix, or least squares solution of a overdetermined system of linear equations.
In Section 3.7, we describe some iterative methods, which may be useful for cer-
tain sparse matrices. In Chapter 14 we shall consider iterative methods for a
system of linear equations arising in solution of partial differential equations.
More detailed treatment of sparse matrices is beyond the scope of this book
and readers can refer to Tewarson (1973), Pissanetzky (1984) or Althaus and
Spedicato (1998) for that.

3.1 Introduction
In this chapter, we consider the numerical solution of a system of linear algebraic
equations
n

L aijXj = bi, (i=1,2, ... ,m). (3.1)


j=l

If the number of variables n is equal to the number of equations m, then we


can expect to find a unique solution. For m > n the equations are said to be
overdetermined and in general, there is no solution. For m < n in general, there
may be infinite number of solutions. Equation (3.1) can be conveniently written
in the matrix form
Ax=b, (3.2)
where A = [aij] is the m x n matrix of coefficients, x T = (Xl, ... ,xn ) and
b T = (b l , ... , bm ), with superscript T denoting the transpose. We denote by
Ab the m x (n + 1) matrix, which has the column vector b appended as the
(n + l)th column to A. If r(A) denotes the rank of any matrix A (which is
essentially the size of the largest nonsingular sub matrix in A), then the basic
existence theorem can be stated as:
Theorem: The system of equations Ax = b has a solution if and only if
r(A) = r(Ab). Further, if r(A) = r(Ab) = n the solution is unique. While for
r(A) = r(Ab) = k < n, there is a (n - k) parameter family of solutions.
From this theorem it follows that, in the homogeneous case (b = 0),
there is a nontrivial solution if and only if r(A) = k < n. The solutions of
homogeneous problem are said to form the null-space of the matrix A. Any
vector in the null-space can be expressed as a linear combination of n - k
linearly independent vectors. For k < n, the general solution of the problem
can be constructed by adding any linear combination of n - k vectors, spanning
the null-space to a particular solution xp. For r(A) =I- r(A b), there is no solution,
but frequently we are interested in the best solution which satisfies all equations
as closely as possible. The closeness is usually defined in the least squares sense,
i.e., the sum of the squares of the differences between the left and right-hand
sides of the equations is minimised.
3.1. Introduction 61

Most of this chapter is devoted to the case, where the number of equations
is equal to the number of variables (m = n) and the matrix A is nonsingular, i.e.,
r(A) = n, in which case a unique solution exists. In Section 3.6 we will consider
the other cases. In contrast to other problems in numerical methods, here the
solution can be easily expressed in an analytic form. The Cramer's rule gives us
such a solution. The problem here is in computing the solution. The Cramer's
rule can require enormous amount of computation, even for comparatively low
order matrices. Therefore, our main aim in this chapter is to develop more
efficient algorithms to compute the solution of (3.2).
Unlike other areas of numerical methods, in this case, there is no trunca-
tion error and if the calculations are done exactly, the solution will be exact.
However, in practice, the inevitable roundoff error may completely ruin the re-
sults. Hence, apart from efficiency, we also have to worry about the accuracy of
the algorithms. If the matrix A is singular, the solution may not exist and the
numerical algorithm may experience difficulty in such circumstances. Further,
sharp distinction between singular and nonsingular matrices exists only in the
idealised situation, where the arithmetic operations are done exactly. In actual
practice, it is difficult to distinguish between a singular and a "nearly" singular
matrix. Hence, we can expect problems when the matrix is nearly singular,
even though in principle a unique solution exists in such cases. Numerical cal-
culation may turn a singular matrix into nonsingular and vice versa. In fact,
linear systems with nearly singular matrices are almost always ill-conditioned.
The matrices of coefficients that we generally come across can be classified
in several ways. We can classify them as filled or sparse matrices, depending on
the number of nonzero elements in the matrix. If most elements of the matrix
are nonzero, then the matrix is said to be filled, while if most of the elements are
zero, then the matrix is said to be sparse. It is difficult to give a precise value
of the fraction of nonzero elements, below which the matrix can be considered
to be sparse. The concept is heuristic in nature -- a matrix is said to be sparse,
when it is worthwhile to take explicit advantage of the existence of many zeros.
This will depend on the matrix, the algorithm and the computer being used.
In principle, a technique for solving filled matrices can be applied to sparse
matrices and many times the converse may also be true, but in that case the
efficiency may suffer. Here the efficiency could be with respect to the computer
time required, or with respect to the amount of memory used. There are no
firm guidelines about when it is more efficient to treat the matrix as sparse. For
example, a triangular matrix has more than 50% of its elements nonzero, but
it can be handled efficiently using special techniques, while a matrix with same
number of nonzero elements distributed randomly cannot be effectively treated
as sparse. The basic sources of sparsity are models with local connections. For
example, in an electrical network problem only those elements which form a
loop may occur in one equation. Similarly, in finite difference methods, only the
neighbouring points will occur in the same equation. Consequently, each row
of the resulting matrix of equations, has only a few nonzero elements. Further,
in most practical problems larger matrices have higher sparsity.
62 Chapter 3. Linear Algebraic Equations

The sparse matrices can also be of various types, depending on the pat-
tern of nonzero elements. If the nonzero elements are located in a narrow band
around the diagonal, then the methods for filled matrices can be modified to
take advantage of the sparsity,· while on the other extreme if th~ nonzero el-
ements are randomly distributed in the matrix, then special methods will be
required. The former type of sparse matrix is referred to as a band matrix,
where aij = 0 if Ii - jl > k. Here the constant k is called the bandwidth. There
are 2k + 1 nonzero elements in each row. If the bandwidth k = 1, then the
matrix is called tridiagonal. If the bandwidth is zero, then the matrix is diag-
onal. Equations involving a diagonal matrix can be trivially solved and hence
some algorithms for solution of linear equations actually try to transform the
original matrix to an equivalent diagonal form.
In practice we come across filled matrices of comparatively smaller size,
say, less than about 100 x 100. On the other hand, the sparse matrices are
usually encountered in solution of differential equations and their order could
be very large, limited only by the capability of the computer available. It is not
uncommon to have sparse matrices of order 100000 or more. Direct methods
are almost invariably used for solving equations involving filled matrices, while
iterative methods are sometimes preferred for sparse matrices. If the matrix
is in a band form with small bandwidth, then the direct methods are gener-
ally preferred. With significant advances in techniques for treatment of sparse
matrices, the use of direct methods, even in more general cases is increas-
ing. However, with machines capable of parallel processing, the situation may
change in favour of iterative methods. In this chapter, we mostly consider direct
methods for filled matrices, while sparse matrices will be briefly considered in
Section 3.7.
Alternately, matrices can be classified according to their type, e.g., Her-
mitian, real symmetric, positive definite, and so on. Most of the subroutine
packages available for solution of linear systems have special routines to deal
with some of these types. Hermitian matrix is one for which aij = aji for
all i and j, where asterisk denotes the complex conjugate. For such matrices
At = A, where At denotes the conjugate transpose of A. Similarly, a real ma-
trix is symmetric if aij = aji for all i and j, or equivalently if A = AT. Many
physical problems lead to real symmetric matrices. Advantage of working with
symmetric matrices is that, if the algorithm preserves symmetry, then only half
of the matrix needs to be stored and the amount of calculation required is also
halved.
A matrix is said to be positive definite if x T Ax > 0 for all nonzero vectors
x. This property is generally associated with real symmetric matrices. It can
be shown that a real symmetric matrix is positive definite if all eigenvalues are
positive. Similarly, a matrix is said to be positive semidefinite if x T Ax ::::: 0 for
all vectors x. All standard subroutine packages for linear equations have a sub-
routine for positive definite real symmetric matrices. For other real symmetric
matrices, it may be difficult to take advantage of symmetry and at the same
time ensure stability of algorithm against roundoff errors.
3.2. Gaussian Elimination 63

A matrix is said to be orthogonal if AT A = I or AT = A -1. It is trivial to


solve a system of linear equations involving an orthogonal matrix. Orthogonal
matrices are also useful in transforming matrix problems to a simpler form.
Orthogonal matrices do not change the length (x T x) of any vector. For example,
if y = Ax, then
(3.3)

This is an important property as far as numerical computations are concerned,


since we do not want the transformations to magnify any error which may
be present, either in the original problem or introduced by the roundoff error.
Similarly, a complex matrix is said to be unitary if At A = I or At = A -1.
Apart from these special matrices, there are the so-called triangular ma-
trices. A matrix is said to be upper triangular if aij = 0 when i > j for all
i and j, i.e., all elements below the diagonal are zero. Similarly, a matrix is
said to be lower triangular if aij = 0 for i < j. A system of linear equations
involving a lower triangular matrix can be easily solved using forward substitu-
tion, while those involving upper triangular matrices can be easily solved using
back-substitution. In particular, if all diagonal elements of a triangular matrix
are unity, then the matrix is referred to as a unit lower triangular or unit upper
triangular.
Similarly, we can define the Hessenberg matrices. A matrix is said to be
upper Hessenberg if aij = 0 when i > j + 1 for all i and j. In this case, apart
from the upper triangle, one of the lower subdiagonal is also nonzero. Similarly,
a matrix is said to be lower H essen berg if aij = 0 for i < j - 1.

3.2 Gaussian Elimination


Gaussian elimination is the classical and possibly the simplest method for solv-
ing a system of linear equations, which all of us have used in some form or the
other. We can write out the system of equations in the form

all x 1 + a12 x 2 + ... + a1nXn = b1 ,


a21 x 1 + a22 X 2 + ... + a2nxn = b2 , (3.4)

If all =1= 0, then we can eliminate Xl from the last (n - 1) equations by


subtracting the multiple ail/au of the first equation from the ith equation
(i = 2,3, ... , n) to get the first derived system

(3.5)
(1)
a n2 X 2 + ... + annxn
(1)
= bn(1) .
64 Chapter 3. Linear Algebraic Equations

The new coefficients ag) and b~l) are given by


i = 2,3, ... ,n (3.6)
j = 2,3, ... , n

Now, if a~~) =/:. 0 we can eliminate X2 from the last (n - 2) equations, to get the
second derived system and continuing this process through (n - 1) steps, we
arrive at the final system

allXl + a12x2 + a13X3 + ... + alnXn = bl ,


a 22 X2 + a 23 X3 + ... + a 2n Xn = 2 ,
(1) (1) (1) bel)

(2) (2) (2)


a 33 X3 + ... + a 3n Xn = b 3 ' (3.7)

(n-l) b(n-I)
ann Xn = n

Here we have assumed that all diagonal elements are nonzero and

(k) (k-l) (k-I) } k = 1,2, ... , n - 1


aij = aij - nikakj
j, = k + 1, ... , n (3.8)
b (k) _ b(k-l) _ . b(k-l)
i - i n,k k i = k + 1, ... , n

where
(k-I)
_ a ik
nik - (k-l) , (3.9)
a kk
This process is referred to as Gaussian elimination, while the diagonal elements
ak~-l) at each of the major step are called the pivots. Once the elimination is
carried out, the solution can be obtained by back-substitution. We can solve
the last equation for Xn and then solve the last but one for Xn-I and so on.
This gives

(b (i-l) _
,
~
~
(i-I)
a'J
.)
XJ ' (i = n, n - 1, ... ,1). (3.10)
j=i+l

It may be noted that the manipulations used for elimination will preserve the
value of the determinant of the matrix A. Hence, after the elimination is com-
plete, the determinant can be easily computed by taking the product of all
diagonal elements in (3.7).
To estimate the efficiency of this process, we can count the number of arith-
metic operations required {5}. It turns out that, this process requires approxi-
mately ~n2 divisions, ~n3 multiplications and additions for the elimination. In
addition the back-substitution requires n divisions and n( n - 1) /2 multiplica-
tions and additions. Hence, for large n, it requires approximately ~n3 floating
3.2. Gaussian Elimination 65

point operations to solve a system of linear equations. Such operation counts


for a numerical algorithm is usually referred to as its complexity. This complex-
ity may be compared with (n -+- 1)! multiplications required by the Cramer's
rule. Thus, using Gaussian elimination, the solution of a system of 100 linear
equations requires less than one million floating-point operations, which can
be performed within one second on existing computers. For a band matrix of
bandwidth k the number of arithmetic operations required will be O(nk2) {7}.
Thus, a system of 100000 equations with a bandwidth of two requires of the
order of a million floating-point operations.
Obviously, this procedure of Gaussian elimination will break down if at
any stage, the pivot ak~-l) turns out to be zero. In such cases, if the original
matrix is nonsingular, then at least one of the elements, a;~-l), (j = k, ... , n)
will be nonzero. Hence, we can interchange the corresponding equations to get
a nonzero element in the (k, k) position. Sharp distinction between zero and
nonzero numbers exists only in the mathematician's ideal world ofreal numbers.
As soon as we use arithmetic operations with finite accuracy, the distinction
becomes fuzzy. Thus, sometimes a nonzero number may become zero because
of roundoff errors, while more often a zero element will turn out to be nonzero.
Hence, we can expect that, even if one of the pivot is not exactly zero but very
small, the process may be numerically unstable.
EXAMPLE 3.1: Consider the following system of equations
(X + y - (Z = 1,
2x +y +Z = 4, (3.11)
x - EY + EZ = 1.
Elimination of x yields
EX +Y - EZ = 1,
2 2
(1--;-)y+3z=4--;-, (3.12)
1 1
-(E + -)y + (E + l)z = 1 - -.
E E
If E < ~Ii, then the computed value of 1 + E will be 1 and so on. In that case, the computed
system of equations becomes
EX +Y- EZ = 1,
2 2
--;-y+3z=--;-, (3.13)
1 1
--y+z=--.
E E
Eliminating y from the last equation will yield the result Z = 0, y = 1, X = 0, in contrast to
the exact solution x = y = Z = 1. Here it should be noted that det( A) = 1 - E + 4E2 and hence
the system of equations is well-conditioned. It is only because of the small pivot E at the first
stage, which gives rise to large multipliers 2/E and l/E, that large roundoff errors have been
introduced. If we interchange the first two equations before carrying out the elimination, then
we can get the correct result x = y = z = 1, using the arithmetic operations with the same
precision.

This example suggests that, at every stage before eliminating the variable
Xk we should interchange the kth and k'th equations, where

Ia (k-l)
k'k
I= ._max Ia
'l,-k, ... ,n
(k-l)
ik
I . (3.14)
66 Chapter 3. Linear Algebraic Equations

If k' so defined is not unique, we can take the smallest of its possible values.
This choice ensures that all multipliers nik are less than or equal to one in mag-
nitude. The stabilised elimination process is usually called Gaussian elimination
with interchanges or with partial pivoting. Alternately, we can interchange both
the rows as well as columns, to bring the largest element in the relevant part of
the matrix into the pivotal position. Such a procedure is called complete pivot-
ing. Interchanging of columns is equivalent to changing the order of variables.
Hence, if such a procedure is used the variables will have to be restored to their
natural order after the solution has been found. Apart from this, finding the
maximum element in the submatrix of order n - k + 1 at kth stage will also
require some effort. Hence, complete pivoting is obviously a more complicated
procedure, which is rarely recommended in practice. \Ve will consider it again
in Section 3.4. It is possible to construct examples {Il}, {12}, where pivoting
may lead to disaster, while some other ordering of equations may give accurate
results. However, such counter-examples invariably involve matrices which are
ill-conditioned. Thus, selecting largest element as pivot may not be the best
strategy in all cases. However, no alternative practical procedure has yet been
proposed.
Pivoting is not a foolproof method. For example, if the first equation
in Example 3.1 is multiplied by 2/f, then according to the above criterion,
there will be no need for interchanging the rows and we get the same result
as that obtained in the example. In fact, by multiplying the equations with
suitable constants, we can make any of the required (nonzero) element to be
the largest, without changing the actual solution of the equations. Hence, the
purpose of pivoting can be easily defeated by indiscriminate scaling of the
equations. It will be best if all coefficients of the matrix are of the same order,
but it may not be possible to ensure that in all cases. There are two types of
scaling which can be easily applied to any system of linear equations. First is
multiplying any row of the matrix by a nonzero constant, which is same as
multiplying the corresponding equation by a constant. This scaling does not
change the solution of the system. Alternately, we can multiply any column of
the matrix by a nonzero constant, which is equivalent to changing the unit for
corresponding variable (e.g., from em to km). After the solution is obtained, the
corresponding variable can be multiplied by the same constant to get the value
in original unit. These simple scaling techniques are not in principle sufficient
to ensure that all elements of the matrix will be of the same order. There is
no satisfactory solution to the problem of scaling an arbitrary matrix. Two
examples of troublesome matrices due to Rice (1992) are given below. Neither
of these simple techniques or their combination properly scales the following
matrices:

(I~W
lO lD 10'0)
1030 1050 (I~"
1020
1020
lO lD
1
I~'O) (3.15)
lO lD 1 10 40 1050
1020 1040 1080
1 10 40 10 50 1
3.2. Gaussian Elimination 67

Fortunately, most problems that are encountered in actual practice can


be scaled without much difficulty. The crucial point is to choose units which
are natural to the problem so that the relationship between various variables is
not distorted. The most common technique for row scaling is to multiply each
row (or equation), such that the largest number in that row is 1. A matrix A
is said to be row equilibrated if for each row index i
1
-(3 S max
l<OJ<on
laijl S 1, (3.16)

where (3 is the base of floating-point system used in calculations. Similarly, a


matrix A is said to be column equilibrated if for each column index j
1
-(3 < max Ia I < 1. (3.17)
- l<Oi<on 'J -

Finally a matrix is said to be equilibrated if it is both, row and column equi-


librated. The process of scaling can also introduce roundoff errors during mul-
tiplication. To avoid this it is preferable to scale the elements by a power of
the base, (3. Such multiplication or division does not introduce any roundoff
error as the process is equivalent to changing the exponent of the floating point
numbers.
Assuming that the matrix is equilibrated, we can consider the criterion to
decide if the matrix is singular or not. If the pivot i.e., the maximum element
in the column at any given stage comes out to be less than eli in magnitude
(where e is a constant of order unity), then the matrix can be considered to be
computationally singular. Since, even if the matrix is not actually singular, it
is unlikely that a reliable estimate for the corresponding pivot is available in
such circumstances. The constant e should increase with the order of matrix.
There are at most 2n arithmetic operations performed on anyone aij. Hence,
e = 2n is the largest value that can be considered. However, we can expect
considerable cancellation in the roundoff errors and as such a more reasonable
value may be e ~ .;n. As will be seen in the next section, it is convenient to
use accumulation of double length numbers when computing the coefficients.
In such cases, the roundoff error could be much lower and a value of e = 2 or 3
may be more realistic, if we are dealing with almost singular matrices. If we are
not particular about detecting a singularity, then this test could be avoided,
while the reliability of the results can be tested by the technique of iterative
refinement explained in Section 3.4.
We now describe a method of organising Gaussian elimination with partial
pivoting on a computer. Following the convention of programming languages,
aij refers to the current element in the (i, j) position, after all operations up
to the current stage have been performed. Thus, we will omit the upper suffix
on aij. This is possible, since only the latest version of aij are required in the
calculations at any stage. Further, all information regarding interchanges can
be stored in an array i, while nik can be overwritten on elements aik which are
reduced to zero. The algorithm consisting of (n - 1) major steps is as follows:
68 Chapter 3. Linear Algebraic Equations

F0r each value of k from 1 to (n - 1) in succession perform the following steps:


1. Find the largest of the quantities lajkl, (j = k, ... ,n). If lak'kl is the
maximum, then store the value of k' in i k . If two or more of these numbers
have maximum modulus, then choose k' to be the smallest of its possible
values.
2. If k' -=I=- k then, interchange aki, (i = k, ... , n) and bk with ak'i and bk,.
3. For each value of j from k + 1 to n perform (3a), (3b) and (3c).
3a. Compute njk = ajk/akk and overwrite on ajk.
3b. For each value of 'i from k +1 to n, compute (aji - njkaki) and
overwrite on aji.
3c. Compute bj - njkbk and overwrite on bj .

For completeness after all the above steps are over, we may set in = n. Further,
in the step (2) we can check for zero pivot before performing the division in step
(3a). It may be noted that, this procedure destroys the contents of the matrix
A as well as the right-hand side vector b. The determinant of the matrix A
can be computed by taking the product of pivots. It should be noted that the
determinant will change sign when two rows are interchanged, which should be
accounted for while computing the determinant. Computing the determinant
may result in overflow or underflow, if the order of matrix is large. For ex-
ample, if we multiply each equation by 10, the determinant will be multiplied
by IOn, even though the solution is not affected. To avoid this problem, there
are two alternatives, one is to take logarithms and compute logarithm of the
determinant by adding the logarithms of all the diagonal elements. The sign of
the determinant can be passed separately in this case. Second alternative is to
keep scaling the determinant periodically and keep a count of the number of
times scaling is required. This is similar to floating-point format, except that
now the range of exponent is very large. :For example, subroutine CROUT in
Appendix B returns the value of determinant in the form DET x 2IDET , where
1/32 ::; IDETI ::; 32.
The above algorithm is implemented in the subroutine GAUELM. This
subroutine assumes that the matrix is equilibrated. We can divide all rows by
the largest element in each row, before starting the procedure. However, this
technique has the disadvantage of introducing extra roundoff error during the
process of division. The roundoff can be avoided by using the nearest power of
two for division rather than the maximum element itself. Alternately, we can use
the so-called implicit equilibration as follows. First find the maximum element
in each row. If Si is the maximum element in the ith row, then for pivoting
find the maximum of lajk/Sjl in step (1) above and use the corresponding
value of k' for interchanging the rows. It is obvious that, this process will lead
to the same sequence of interchanges as what would have been obtained, if
all rows were explicitly divided by the maximum element. However, in this
case, the elements nij of the lower triangular factor may not necessarily be
3.2. Gaussian Elimination 69

less than unity. Similarly, to check for a zero pivot element, the ratio akk/ Sk
should be considered. Implicit pivoting is implemented in subroutine CROUT
in Appendix B.
Often we may come across situations in which we have to solve several
sets of equations, all involving the same matrix A, but with different right-
hand sides. Clearly, if all the right-hand sides are known ab initio, then we
can process all of them simultaneously, as the elimination proceeds. In many
cases, we may be interested in iterative procedures, where each right-hand side
is obtained from the solution of the previous set of equations. In such cases,
elimination needs to be performed only once and the right-hand sides can be
processed separately using the forward and back-substitution. Each right-hand
side can be processed by the following algorithm:
1. For each value of k from 1 to n - 1 in succession, perform the steps (la)
and (lb).
1a. Interchange bk and bk" where k' is obtained from ik.
1b. For each value of j from k + 1 to n compute bj - njkbk and overwrite
on bj . It may be noted that the elimination algorithm overwrites njk
on ajk.
2. For each value of k from n to 1 in succession, compute

(3.18)

and overwrite on bk. It may be noted that for k = n the summation will
not contribute any term.
The basic idea behind Gaussian elimination is to transform the given
matrix to an upper triangular matrix, so that the resulting equations can be
easily solved by back-substitution. Alternately, we can transform the matrix to
a diagonal form, so that the solution is trivially obtained. Such a procedure is
referred to as Gauss-Jordan elimination {6}. In this algorithm at every stage,
the corresponding variable is eliminated from all the remaining equations both
above and below the pivot element. This algorithm requires more work than
Gaussian elimination and it is not the recommended method for solution of a
system of linear equations. However, this algorithm has better parallelism and
hence may be useful on parallel machines.
EXAMPLE 3.2: Consider the following system of linear equations
0.7321xl + 0.4135x2 + 0.3126x3 + 0.5163x4 = 0.8132,
0.2317xl + 0.6123x2 + 0.4137x3 + 0.6696x4 = 0.4753, (3.19)
0.4283xl + 0.8176x2 + 0.4257x3 + 0.8312x4 = 0.2167,
0.8653xJ + 0.2165x2 + 0.8265x3 + 0.7123x4 = 0.5165.
We use the Gaussian elimination algorithm. For illustration, we give here the matrix
as well as the right-hand side after every major step of the elimination process. The results
are shown in Table 3.1, where the last column gives the corresponding component of the
right-hand side. Here we have interchanged the entire row at every stage to get the right
70 Chapter 3. Linear Algebraic Equations

Table 3.1: Solution of linear equations by Gaussian elimination

First major step (rows 1 and 4 interchanged)


0.8653000 0 .2165000 0.8265000 0.7123000 0.5165000
0.2677684 0.5543281 0.1923894 0.4788686 0 .3369976
0.4949728 0.7104384 .01660497 0.4786309 -.03895346
0.8460650 0.2303269 -0.3866727 -.08635207 0.3762074
Second major step (rows 2 and 3 interchanged)
0.8653000 0.2165000 0.8265000 0.7123000 0.5165000
0.4949728 0.7104384 .01660497 0.4786309 -.03895346
0.2677684 0 .7802621 0.1794332 0.1054110 0.3673915
0.8460650 0.3242040 -0.3920561 -0.2415261 0.3888363
Third m ajor step (rows 3 and 4 interchanged)
0.8653000 0.2165000 0.8265000 0.7123000 0.5165000
0.4949728 0.7104384 .01660497 0.4786309 -.03895346
0.8460650 0.3242040 -0.3920561 -0.2415261 0.3888363
0.2677684 0.7802621 -0.4576722 -.00512874 0.5453511

triangular decomposition. In practice , the first k - 1 elements should not be interchanged, if


we are interested in processing right-hand sides separately. The calculations were done using
a 24-bit arithmetic. Although the lower triangular elements of the matrix are reduced to zero,
we give here the corresponding entry at that place. As explained in the next section this will
give the triangular decomposition of the matrix. To indicate this fact the elements of lower
triangular matrix a re shown in bold face.
Using the same 24-bit arithmet ic, it can be verrfied that
0.86530000 0.21650000 0.82650000 0 .71230000)
LV = ( 0.42830000 0.81760000 0.42570000 0.83120000
(3.20)
0 .73210000 0.41350000 0.31259990 0.51630000
0.23170000 0.61230000 0.41370000 0.66960000
where

(,000-
0.4949728
0.0000000
1.0000000
0.0000000
0.0000000
0.000(000)
0.0000000
L (3.21)
0.8460650 0.3242040 1.0000000 0.0000000
0.2677684 0 .7802621 -0.4576722 1.0000000
0.2165000 0.8265000 O. 712300CJ)

C
0.0000000 0.7104384 .01660497 0.4786309
V 865300CJ (3.22)
0.0000000 0.0000000 -0.3920561 -0.2415261
0.0000000 0.0000000 0.0000000 -.00512874
Thus, the product of the lower and the upper triangular matrices essentially reproduces the
original matrix, apart from interchanges in the rows . This verifies the accuracy of triangular
decomposition.
Once the elimination is performed, the solution can be found by back-substitution. The
computed solution is
Xl = 8.973596, X2 = 70.07462 , X3 = 64.51418, X4 = -106.3323. (3.23)
This result may be compared with the correctly rounded values to seven decimal digits
Xl = 8.973596, X2 = 70.07471, X3 = 64.51428, X4 = - 106.3324. (3.24)
Thus, it can be seen that last two digits of the computed solution are not correct. The
comparatively large error here is because of the fact that the system of equations is somewhat
ill-conditioned. The determinant of the matrix is -1.236097 x 10- 3 , even though all the
elements are of the order of unity.
3.3. Direct Triangular Decomposition 71

3.3 Direct Triangular Decomposition


It can be easily seen that in Gaussian elimination algorithm, each element is
modified several times. It may be better if all these modifications can be affected
simultaneously. That will not change the number of arithmetic operations re-
quired, but the advantage is that we can use a double precision variable to
accumulate the intermediate sums and thereby reduce the roundoff error. Of
course, we can use double precision arithmetic to do all calculations, but that
requires twice as much storage. If the computer has the facility of accumulating
double length products, then the computer time required will be approximately
the same as that required for single precision calculation. Hence, we will get the
accuracy of double precision calculations, without actually using double pre-
cision arithmetic. On a desk calculator also, this form has the advantage that
intermediate results do not have to be noted down, since each matrix element
is modified only once.
To understand how this comes about, let us consider (3.10) for back-
substitution. The sum can be easily accumulated in a double precision variable
and after division by the pivot element, the result can be rounded to single
precision and stored as Xi. This technique requires only one double precision
variable and further, if the computer has facility for double length accumulation
of products, the process will be as fast as the single precision calculation. As
we have seen in Section 2.3, the process of summation can magnify the relative
error significantly. In fact, this is the main source of roundoff error in matrix
computations. Hence, it is better to reduce this error as far as possible.
Before considering the methods for reorganising Gaussian elimination, we
will demonstrate that Gaussian elimination is equivalent to decomposing the
original matrix into a product of a lower triangular and an upper triangular
matrix. Once the decomposition is obtained, the solution can be easily found
by a combination of forward and back-substitution.
For simplicity, let us first ignore pivoting and consider the matrix formu-
lation of the process of Gaussian elimination. If x is a solution of

Ax=b, (3.25)

then it is also a solution of


PAx=Pb, (3.26)
where P is any m x n matrix (assuming that A is a n x n matrix). However, if
P is square and nonsingular, then it is also true that any solution of (3.26) is
also a solution of (3.25). The basic idea in Gaussian elimination is to determine
a nonsingular lower triangular matrix P, such that P A = U is upper triangular
and hence (3.26) can be readily solved by back-substitution. In that case, A =
p- 1 U gives the triangular decomposition of the original matrix.
The algorithm for reducing A to upper triangular form consists of n - 1
major steps. We denote the original set of equations (3.25) by Aox = boo
Each step leads to a new set of equations, which is equivalent to the original
72 Chapter 3. Linear Algebraic Equations

set. If we denote the kth equivalent set by Akx = b k . Then Ak is already


upper triangular, as far as its first k columns are concerned. Thus, the kth
major step consists of the determination of an elementary matrix Lk, such that
Ak = LkAk-1 is upper triangular in its first k columns and is identical with
A k - 1 in its first k - 1 rows and columns. It can be verified that, this objective
is achieved by the elementary matrix Lk, which is equal to the identity matrix
except for the kth column, which is

(k-l)O'S
~
(0, ... ,0,1, -nk+1,k, ... , -nnk). (3.27)

For a typical case of n = 5 and k = 3, the equation Ak = LkAk-1 will have the
form

x x x x x x

~)(~ ~)
0 0 0

(~ ~) (~
x x x 1 0 0 x x x
0 x x 0 1 0 0 x x (3.28)
0 0 x 0 -n4,3 1 0 x x
0 0 x 0 -n5,3 0 0 x x

where x denotes elements which are in general nonzero. Premultiplication by


L k , results in subtraction of a multiple nik of the kth row from the ith row,
for each value of i from k + 1 to n. The first k rows remain unchanged and the
zeros in the first k - 1 columns are preserved. This is exactly identical to the
kth major step in the Gaussian elimination process.
The kth major step can be expressed as

and (3.29)

Using this equation for k = 1,2, ... ,n - 1, we get


and (3.30)

where L = L n - 1 ... L2L1 is a lower triangular matrix. From this equation it


follows that A = L -lU. Here L -1 will also be a unit lower triangular matrix
and in fact it can be shown {22} that it is given by

L- 1 = (~;~ ~ ~ ~ ~)
.... ..... .... :'.: ..... ..... (3.31 )
nnl nn2 nn3 ... nn,n-1 1

Hence, after Gaussian elimination is performed according to the algorithm given


in the previous section (except for row interchanges) the lower triangular part
will contain the corresponding elements of the lower triangular matrix L -1,
while the diagonal elements of this matrix are all unity. The upper triangular
part including the diagonal elements will contain the upper triangular factor U
3.3. Direct Triangular Decomposition 73

of the matrix. Thus, Gaussian elimination decomposes the given matrix into a
product of a unit lower triangular and an upper triangular matrices. This is also
referred to as LU decomposition. It can be shown that {24} for a nonsingular
matrix if such a decomposition exists, then it is unique.
In the above analysis we have neglected pivoting. It can be shown that if a
matrix is premultiplied by the permutation matrix Pij , then the rows i and j are
interchanged. The permutation matrix can be obtained by interchanging the
corresponding rows of the unit matrix. Hence, the kth major step in Gaussian
elimination with partial pivoting can be written in the matrix form as

(3.32)
It is clear that if we knew the interchanges in advance, we can perform all the
interchanges before beginning the process of elimination, to get a matrix

(3.33)

Now if Gaussian elimination is performed on A, then no interchanges will be


required and the triangular decomposition can be obtained as explained earlier.
Hence, if A is nonsingular, then there exists a permutation matrix P, such that
A = P A can be decomposed into the product of a unit lower triangular matrix
L and an upper triangular matrix U. If it is required to obtain the triangular
decomposition using Gaussian elimination, then the algorithm of the previous
section can be modified to interchange the entire row in step (2).
It may be pointed out that, it is not necessary to obtain a decomposition
in terms of a unit lower triangular and an upper triangular matrix. Instead we
can demand that the diagonal elements of U are unity, but those of L need
not be so. Alternately, we can demand that the diagonal elements of both L
and U should be equal. The latter is useful for symmetric matrices, since that
may preserve the symmetry. For a real symmetric matrix, we can look for a
decomposition of the form A = LL T , where L is a lower triangular matrix
{25}. Such a decomposition is referred to as Cholesky decomposition. However,
it can be shown that only if the matrix is positive definite, such a decomposition
will exist. For a general symmetric matrix, LU decomposition may not exist
without interchanging the rows, in which case the symmetry will be lost.
Now we consider the reorganisation of Gaussian elimination to obtain
the triangular decomposition of a given matrix. Initially, we will assume that
no interchanges are necessary to obtain such a decomposition. The equation
A = LU can be written explicitly as
n min(i,j)

aij = L likUkj = L likUkj, (i,j=1,2, ... ,n). (3.34)


k=l k=l

If we assume L to be a unit triangular matrix and U as an upper triangular


matrix, then (3.34) provides a system of n 2 equations in n 2 unknowns, the
n(n + 1)/2 elements of U and n(n - 1)/2 elements of L. These equations can
be solved recursively in several ways.
74 Chapter 3. Linear Algebraic Equations

We shall first consider Doolittle's algorithm, which is very similar to Gaus-


sian elimination. In this algorithm, we compute in succession one row of U
followed by the corresponding column of L. Setting i = 1 in (3.34), we have

(j = 1,2, ... ,n), (3.35)

since III = 1. Again setting j = 1 and i > j in (3.34), we get

l i l_- ail
- , (i=2, ... , n). (3.36)
Ull

Similarly, once k - 1 rows of U and k - 1 columns of L are computed, we can


compute the kth row of U by
k-1
Ukj = akj - L lkiUij, (j=k, ... ,n). (3.37)
i=l

While the kth column of L can be computed using

(i=k+1, ... ,n). (3.38)

Of course, if at any stage Ukk = 0, then the algorithm fails. This failure can be
handled by pivoting, which will be considered a little later.
Crout's algorithm which is more popular, differs from Doolittle's algo-
rithm, in that it generates an unit upper triangular matrix U and a general
lower triangular matrix L satisfying (3.34). In this case, we first generate a
column of L followed by the corresponding row of U. At the kth major step,
we first compute the kth column of L using
k-1
lik = aik - L lijUjk, (i=k, ... ,n), (3.39)
j=l

followed by kth row of U


",k-1 l kiUij
Ukj =
akj - L.i=l
l ' (j = k + 1, ... , n). (3.40)
kk

Crout's algorithm has a slight advantage over Doolittle's algorithm, when we


wish to include partial pivoting. Both these algorithm's are closely related {23}
and in fact , the diagonal elements of U in Doolittle's algorithm are identical to
those of L in Crout's algorithm. In both cases, the elements of matrices Land
U can be overwritten on the corresponding elements of the original matrix as
soon as they are found.
Both Doolittle's and Crout's algorithm are very similar to Gaussian elim-
ination and hence it is obvious that they will run into difficulty when a pivot
is small. As we have seen, this can happen even for well-conditioned matrices.
3.3. Direct Triangular Decomposition 75

Consequently, it is essential to use pivoting. Pivoting can be easily applied in


Crout's algorithm, since after calculating the elements in the kth column of
L, we can find the maximum and interchange the corresponding rows of the
matrix including the first k - 1 columns containing the elements of L. This is
justified, since if the same interchange was performed on the original matrix
before applying the process of triangular decomposition, then the element in
(k, k) position would have been the largest in the corresponding column.
For Doolittle's algorithm, the situation is slightly complicated, since the
kth row of U is calculated before the corresponding column of L, and hence we
would not know which row is going to be the kth. This problem can be trivially
overcome by calculating the numerators of lik by
k-1
Si = aik - L lijujk, (i=k, ... ,n). (3.41 )
j=1

These quantities may be stored separately, or they can be overwritten on aik.


We can now find the maximum of Si and interchange the corresponding rows to
bring the maximum element in kth row. It is clear that if this interchange was
performed beforehand, then we would automatically get the maximum element
in the right position. After interchanging we get Ukk = Sk, while other elements
of the kth row of U can be calculated as before and the corresponding elements
of L are obtained by lik = S;/Ukk.
Crout's algorithm with partial pivoting and implicit equilibration can be
organised as follows. For implicit equilibration, we need to store the maximum
of each row in a separate array s. Further, in all cases, the sums can be accu-
mulated in a double precision variable for higher accuracy.
1. For all values of i from 1 to n find max lai·1 and store it in Si.
l:':fc: n J

2. For all values of k from 1 to n in succession perform steps (2a) to (2d)


2a. For each value of i from k to n compute
k-1
lik = aik - L lijujk, (3.42)
j=1

and overwrite lik on aik. Note that no calculations are required for
k = 1.
2b. Find the largest of the quantities Ilikl sil, (i = k, ... ,n). If Ilk' kl Sk' I
is the maximum, then store the value of k' in ik. If two or more of
these numbers have maximum magnitude, then choose k' to be the
smallest of its possible values.
2c. Interchange aki with ak'i for i = k + 1, ... ,n, and lki with lk'i for
i = 1, ... , k. Also interchange Sk with Sk" Note that if k' = k, no
interchange is required.
76 Chapter 3. Linear Algebraic Equations

2d. For all values of j from k + 1 to n, compute

(3.43)

and overwrite on akj.

At the end of the algorithm, the lower and upper triangular factors will be
overwritten on the original matrix A. The diagonal elements of upper triangular
factor U are unity and hence need not be stored explicitly.
Having obtained the LU decomposition, the system of linear equations
can be solved using the forward and back-substitution

Ax = LUx = Ly = b, (3.44)

where Ux = Y and it is assumed that the elements of b have been interchanged


appropriately. Thus, first we can solve (3.44) involving a lower triangular matrix
for y and then solve the other equation involving upper triangular matrix for
x. For example, Ilsing Crout's algorithm the process of solution for a given
right-hand side can be accomplished by the following algorithm:
1. For each value of k from 1 to n in succession, perform steps (la) and (1 b).
lao Interchange bk and bk" where k' is obtained from ik.

bk - Lk~i lkjYj .
lb. Compute Yk = J , and overWrIte on bk.
lkk
2. For each value of k from n to 1 in succession, compute
n

Xk = Yk - L UkjXj, (3.45)
j=k+l

and overwrite on bk.

For Doolittle's algorithm it may be noted that lkk = 1 and hence in step (lb)
division is not required, while since in general, Ukk i- 1 in step (2) the right-hand
side of expression should be divided by Ukk.
EXAMPLE 3.3: Solve the system of equations in Example 3.2 using Crout's algorithm.
The result obtained using 24-bit arithmetic are summarised in Table 3.2. For illus-
tration we have given the matrix after every major step in triangular decomposition using
Crout's algorithm. The results can be compared with those in Table 3.1, obtained using
Gaussian elimination. Once again the elements of lower triangular factor are in bold face. It
may be noted that the subroutine CROUT will give different results, since it uses implicit
equilibration, while for these results it is assumed that the matrix is equilibrated. It can be
seen that kth major step modifies only the elements in the kth row and kth column, while
other elements are unchanged apart from the interchanges in rows.
It can be seen that the diagonal elements in the two cases are the same, but the off-
diagonal elements in the corresponding row of U and the column of L differ by a factor given
by the diagonal element. If we use Doolittle's algorithm, then the result will be identical to
that obtained using Gaussian elimination. The difference here is because of the fact that,
3.4. Error Analysis 77

Table 3.2: Triangular decomposition by Crout's algorithm

First major step (rows 1 and 4 interchanged)


0.8653000 0.2502022 0.9551601 0.8231827 0.5969028
0.2317000 0.6123000 0.4137000 0.6696000 0.4753000
0.4283000 0.8176000 0.4257000 0.8312000 0.2167000
0.7321000 0.4135000 0.3126000 0.5163000 0.8132000
Second major step (rows 2 and 3 interchanged)
0.8653000 0.2502022 0.9551601 0.8231827 0.5969028
0.4283000 0.7104384 .02337283 0.6737120 -.05483017
0.2317000 0.5543281 0.4137000 0.6696000 0.4753000
0.7321000 0.2303270 0.3126000 0.5163000 0.8132000
Third major step (rows 3 and 4 interchanged)
0.8653000 0.2502022 0.9551601 0.8231827 0.5969028
0.4283000 0.7104384 .02337283 0.6737120 -.05483017
0.7321000 0.2303270 -0.3920561 0.6160498 -0.9917874
0.2317000 0.5543281 0.1794332 0.6696000 0.4753000
Fourth major step
0.8653000 0.2502022 0.9551601 0.8231827 0.5969028
0.4283000 0.7104384 .02337283 0.6737120 -.05483017
0.7321000 0.2303270 -0.3920561 0.6160498 -0.9917874
0.2317000 0.5543281 0.1794332 -.00512874 -106.33230

Gaussian elimination yields a unit lower triangular matrix, while Crout's algorithm produces
a unit upper triangular factor. Using the LU decomposition, the following solution can be
obtained

Xl = 8.973590, X2 = 70.07465, X3 = 64.51421, X4 = -106.3323. (3.46)

This can again be compared with the exact solution given in Example 3.2. It may be noted
that, this solution is only marginally different from that obtained using Gaussian elimination.

3.4 Error Analysis


It is clear that if all computations are carried out exactly, then the result ob-
tained using Gaussian elimination would be exact. Thus, there is no truncation
error in Gaussian elimination. However, because of finite accuracy of arithmetic
operations, there will be some roundoff error. Let Xc be the computed solution
of Ax = b, then an obvious measure of the error in the computed solution is
the magnitude of the residual vector

r = AXe - b. (3.47)

If r is small we may think that we have obtained an accurate solution. However,


as can be seen from the example in Section 2.4, the residual is not a reliable
measure of accuracy. For ill-conditioned problems the residual may come out
to be small, even when the computed solution is completely different from the
exact solution.
78 Chapter 3. Linear Algebraic Equations

To understand this problem, let us denote the exact solution of Ax = b


by Xt,then we get
A(xc - Xt) = r. (3.48)
If A is nonsingular, then we get
Xc - Xt = A- 1 r. (3.49)
Clearly, if some elements of A-I are large, then Xc - Xt may come out to be
large, even when all elements of r are small. Also in this case, a small change
in the coefficients of A or the right-hand vector b can produce a large change
in the solution. Such a system of equations is termed ill-conditioned and it is
obvious that a small error in determining the elements of A or b will produce
a large error in the result. For example, if the elements of A or b are obtained
experimentally or through some calculations, then the inherent errors in their
values will give rise to substantial uncertainty in the solution. Unless, these
coefficients are known to sufficient accuracy, there may be no point in trying to
find an accurate solution of such a system of equations. Thus, for our example in
Section 2.4, unless the coefficients are known to an accuracy of better than 10- 8 ,
the solution is bound to be completely unreliable. Solution of ill-conditioned
problems to as much accuracy as the data warrants, is one of the most difficult
problems in linear algebra.
Let us try to get some measure of ill-conditioning. An obvious measure is
provided by the magnitude of elements of A-I, but that may be misleading,
unless the original matrix A is normalised, such that all the elements are of
the order of unity or less. Thus, if both equations in (2.91) are multiplied by
108 , then all elements of A-I will be of the order of unity, but still the solution
will be no better. To allow for such difficulties, the condition number K(A) of
a matrix A is defined by
(3.50)
where equality is realised if A is a multiple of a unitary matrix. Here IIAI12 is the
spectral norm of A, which can be defined as the square root of the maximum
eigenvalue of At A. For a real symmetric matrix
(3.51 )
where Al and An are respectively the eigenvalues with the largest and the
smallest magnitude.
To understand the significance of this definition of condition number, con-
sider the system of equations Ax = b. If b is perturbed to b + b"b, and x + b"x
is the corresponding solution, then

A(x + b"x) = b + b"b, (3.52)

and it can be proved that {26}

IIb"x112 < K(A) IIb"b11 2 (3.53)


IIxl12 - IIbl1 2 '
3.4. Error Analysis 79

where for a vector b with n components, the norm is defined by

(3.54)

Similarly, we can consider a perturbation in the matrix A, to get

(A + oA)(x + Ox) = b. (3.55)

In this case, it can be proved that {26}

K(A) IIMII2
IIAI12
(3.56)
1 - K(A) IIMI12 '
IIAII2

where it is assumed that perturbation oA are small enough to ensure

(3.57)

This condition ensures that the matrix (A + oA) will be nonsingular. Thus,
the significance of condition number is obvious from these results. A system
of linear equation involving a matrix with large condition number would be
ill-conditioned.
It can be easily seen that multiplying all equations by the same constant
will not change the condition number of the system, but multiplying one equa-
tion of the system by a constant can change the condition number. Similarly,
multiplying one column by a constant can also change the condition number of
the system. The process of equilibration mentioned in Section 3.2 can reduce
the condition number of a system under appropriate circumstances. However,
equilibration may not necessarily transform an ill-conditioned problem into
well-conditioned. That will depend on how the matrix elements were obtained.
To illustrate this, let us consider a simple example, in which the coeffi-
cients of a 2 x 2 matrix are determined in the following manner. Each of the
four elements of the matrix A is determined by measuring three independent
quantities, each of which is of the order of unity, and then adding these quan-
tities together. Suppose the error in the individual measurements are of the
order of 10- 4 , and that the resulting matrix is

A = (1.0000 0.0001) (3.58)


1.0000 -0.0001

Obviously, cancellation has occurred while computing the elements of the sec-
ond column. If the second column of this matrix is multiplied by 104 , it will
become J2 times an orthogonal matrix, which is well-conditioned. However,
the set of equations defined by the matrix is clearly ill-conditioned, since there
could be large uncertainties in the elements of second column. Hence, the com-
puted solution has little significance.
80 Chapter 3. Linear Algebraic Equations

Now consider a second situation, where the elements of the same matrix
are obtained from separate measurements, each of which has an accuracy of
1 part in 104. In that case, equilibrating the matrix will clearly remove the
ill-conditioning introduced, probably because of the use of unnatural units for
measuring the quantities in the second column. Similar considerations apply to
matrices whose coefficients have been computed numerically, using expressions
which involve roundoff errors. There is a considerable difference between situa-
tions, where the matrix in (3.58) had small elements in second column, because
of cancellation of nearly equal and opposite numbers, and the one where these
elements had a high relative accuracy.
With this background on ill-conditioning, we turn to our problem of esti-
mating a bound on roundoff error in Gaussian elimination. One approach would
be to consider the worst possible case of roundoff at each stage and to derive a
bound based on accumulation of these errors. Such bounds are very difficult to
calculate, since the roundoff at each stage is a complicated function of roundoff
at all the previous stages. Further, such bounds are rather pessimistic. Instead,
we can use the concept of backward error analysis due to Wilkinson, which has
been very successful in matrix computations. Here we consider the computed
solution Xc to be the exact solution of some perturbed system

(A + 8A)x c = b, (3.59)

and then find a bound on the perturbation 8A. One advantage of this approach
is that it gives us an estimate of the system that was actually solved, which
enables us to judge the relative importance of roundoff error in the calcula-
tions and the inherent errors in the coefficients. However, the most important
advantage of this approach is that, the analysis is substantially simpler and in
general, the bound so obtained is much less conservative.
Let us consider the LU decomposition of A. If the computed values of L
and U give a product A + F, then it can be proved that if la~~l I ::; a for all
relevant i, j, k, then (Wilkinson, 1988)

for j 2: i;
If .. 1< {2.06na(i - 1), (3.60)
'J - 2.06naj, for i > j.

It is clear from these bounds that it is important to prevent elements a~Jl from
increasing rapidly with k. In fact this is the principle reason for introducing
pivoting. However, even if la~Jll ::; 1, for all i,j, with partial pivoting, we can
only guarantee that la;;l I ::; 2k. This follows from the inequalities

Iaij(kll_1 (k-ll _n,kakj


- a ij . (k-1ll _< Ia ij(k-1ll + Ia kj(k-1JI · (3.61 )

Unfortunately, it is possible to construct matrices for which this bound is ac-


tually achieved {19}, and so it would appear that for large n, the bounds on
Ifij I can be rather large.
3.4. Error Analysis 81

To take care of such situations, we can use complete pivoting, where at the
kth major step, the pivot is selected to be the element of maximum magnitude
in the whole of square array of order n- k + 1. If a = max la;J) I, then Wilkinson
has shown that with complete pivoting

la~~-l) I ::; k 1/ 2 (2131/241/3 ... k 1/(k-1)) 1/2 a = f(k)a, (3.62)

the function f (k) defined above is much smaller than 2k for large k (e.g.,
f(100) ;::::; 3570) and further this bound is very conservative. In fact, no ma-
trix has yet been found for which f(k) > k.
From these theoretical considerations, it might be tempting to conclude
that it is advisable to use complete pivoting in all cases. However, partial piv-
oting is preferable in most cases, because of the following re&'lons:
1. It is easier to organise than complete pivoting.
2. For sparse matrices with nonzero elements arranged in a special pattern,
this pattern may be preserved by partial pivoting, but destroyed by com-
plete pivoting.
3. Experience suggests that any substantial increase in size of elements of
successive Ak is extremely rare, even with partial pivoting. In fact, if the
matrix Ao is at all ill-conditioned, then usually the elements of successive
Ak show a steady downward trend.
For Doolittle's or Crout's method, the error bounds will be similar if all opera-
tions are carried out in single precision. However, if double length accumulation
of sums is used, then the bounds reduce significantly. Here neglecting the small
roundoff error of the order of li 2 in calculating the sums, the only error occurs
when the sum is finally rounded to a single precision value, before storing in
the required array. In this case, it can be shown (see Ralston and Rabinowitz,
2001) that the elements of perturbation matrix are bounded by
Ifij I < 1.06lia, (3.63)
where a is the bound defined earlier.
The bounds above are for the elimination or the LU decomposition stage
only. To complete the process, we must also include the forward and back-
substitution. Let us first consider the solution of Ly = b by forward substitu-
tion. Once again, it can be shown (Wilkinson, 1994) that the computed solution
is the exact solution of the system (L + 8L)y = b, where the elements of the
perturbation matrix 8L are bounded by

181;j I < 1.061i( i +1- j)c, c = max 11ijl. (3.64)


tJ

If double length accumulation of intermediate results is used, then the corre-


sponding bound is l.06Mij 11ii I. Here 8ij is the Kronecker's delta, defined by

_ {1.
Oij = 0,
if i = j;
if i =1= j.
(3.65)
82 Chapter 3. Linear Algebraic Equations

Similar results can be obtained for back-substitution. If partial pivoting is used,


then the maximum element c for the Lower triangular matrix L is 1, while that
for the upper triangular matrix U is the quantity a defined earlier.
Combining these results we can find the equivalent perturbation in the
original matrix A for the entire process of solution, consisting of the triangular
decomposition, forward and back-substitution. Thus, the computed solution is
the exact solution of the system

(L + rSL)(U + rSU)x = (A + F + (rSL)U + L(rSU) + (rSL)(rSU))x = b, (3.66)


where we have used the rCi:>ult LU = A + F. If the sums are accumulated in
double precision, then it can be shown that the total perturbation matrix E
for Crout's algorithm has elements bounded by

leij I < (2.06 + rSij - rS lj )lia. (3.67)


If the entire computation is carried out using single precision arithmetic, then
the bound is markedly weaker (Wilkinson, 1994) and the maximum perturba-
tion in some of the elements could be of the order of n 2 lia. Hence, the advantage
of double length accumulation is clear.
This analysis gives a bound on equivalent perturbation of the matrix el-
ements, while in practice we will be interested in estimating the error in the
computed solution. This backward error estimate can be transformed to the
forward error estimate using (3.56). Unfortunately, it requires the knowledge
of the condition number of the matrix which may not be known. As we have
already seen that for ill-conditioned systems, even a small perturbation may
produce results which are far from the true results. We shall now consider a
procedure to estimate and possibly improve the accuracy of the solution of an
ill-conditioned system, assuming that A and b as they are represented in the
computer are exact.
Let us consider the residual vector
r(l) =b- AX(l), (3.68)

where x{l) is the calculated value of x. It can be proved that in all cases, the
norm of this residual vector is of the order of equivalent perturbations in the
matrix A. The exact correction (x - x{l)) satisfies the equation

A (x - x(l)) = r(l). (3.69)

We can solve this equation to get the correction, which in turn gives improved
value of the solution x. This procedure can be repeated till a satisfactory ap-
proximation has been achieved. This iteration can be defined as follows:

A=LU, x(O) = 0, (3.70)


r(j) =b- Ax(j) , LU d(j) = r(j), x(j+l) = x(j) + d(j), (j = 0, 1,2, ... ).
(3.71)
3.4. Error Analysis 83

In each iteration, the residual r(j) corresponding to the current x(j) is computed
and the correction d(j) to be added to x(j) is determined by solving the system
of linear equations using the LU decomposition obtained in the beginning. This
process is referred to as iterative refinement.
The only reason why the first solution x(1) is not exact, is the presence of
roundoff errors. Since roundoff errors are inevitable in the iterative refinement
also, it is by no means obvious that the successive x(j) will be improved so-
lutions. Wilkinson has shown that if A is not too ill-conditioned, the x(j) will
converge to the correct solution to working accuracy, provided at each stage the
residual is computed using double length accumulation or a double precision
variable, to accumulate the sum for each element. The terms used here require
some explanation. We shall say that A is too ill-conditioned for the precision
of computation that is being used, if IIA~111211F1b > 1, where F is the pertur-
bation matrix defined earlier. In this case, A + F could be singular and x(1)
may not agree with the true solution in any of its significant figures. By correct
solution here, we mean the correct solution of the matrix equation using the
matrix and the right-hand side as represented in the computer. It should be
noted that, the rounding of these coefficients to computer word length itself
will introduce errors, which cannot be eliminated by any means other than us-
ing a more accurate representation. We say that x(j) is the correct solution to
working accuracy, if

(3.72)

Apart from improving the accuracy, the procedure of iterative refinement


also gives an estimate of roundoff error expected in the calculations. In partic-
ular, if the matrix is computationally singular, the iteration will not converge
to any reasonable accuracy. In general, the correction in the very first step can
give an estimate of error to be expected due to rounding of the coefficients of
the matrix.
For practical application of this technique, we must first remember to save
a copy of the matrix A and the right-hand side vector b, which are required for
calculating the residuals. This is necessary, since the algorithms that we have
described for Gaussian elimination or the triangular decomposition overwrite
the original matrix and vector. It should be noted that once the triangular
decomposition is obtained, solution of an additional set of equations requires
only of the order of n 2 floating point operations. Hence, a few iterations will
not add significantly to the cost of computations. However, doing too many
iterations also does not help, since if the matrix is not too ill-conditioned, then
a few iterations will be enough; while if the matrix is extremely ill-conditioned
the process will not converge in any case. In the intermediate case, when the
condition number of the matrix is such that the iteration does converge slowly,
it may not be worthwhile to carry out all iterations required to achieve the
specified accuracy, since for such ill-conditioned matrices the error introduced
by rounding of the coefficients itself will be much more than the error in solution
84 Chapter 3. Linear Algebraic Equations

after the first few iterations. If the matrix and the right-hand vector are such
that their coefficients can be represented exactly in the computer word, then it
may be better to allow the iteration to converge.
In any case, in a practical procedure we have to allow for the possibility
that the iteration will not converge. Hence, only a reasonable finite number
of iterations should be allowed. Further, there should be some check for di-
vergences, so that iteration may be terminated before the maximum count is
reached. There is no foolproof method to check for divergences, but in practice
if we find that at any stage the magnitude of the correction vector is increasing
with number of iteration, then the process will most probably not converge
and we may terminate the iterative refinement at that step itself. This tech-
nique is implemented in the subroutine CROUTH in Appendix B. It should be
noted that the iteration will not converge, unless the residue is calculated us-
ing a higher precision arithmetic as compared to what is used for the solution.
Thus this process can not be applied if all calculations are done using highest
precision arithmetic that is available on the computer.
EXAMPLE 3.4: Apply the method of iterative refinement to the system in Example 3.2.
Using Crout's algorithm the triangular decomposition was obtained in Example 3.3.
Using this factorisation, we can apply the iterative refinement. The successive solutions and
the corresponding residuals are given in Table 3.3. It can be seen that although the order of
magnitude of the residuals remain the same, the result converges to the correctly rounded
solution of the system as represented in the computer. It may be noted that, this is not the
same as the correct solution of the original problem (see Example 3.2). In fact the error in the
final result is comparable (actually somewhat larger) to that in the first approximation. This
example illustrates that the error due to rounding of coefficients is comparable to the error
in the solution process. If the coefficients are supplied in single precision, but the calculations
are performed using double precision arithmetic; then we once again get the same result
for the final solution. Thus the error in computed solution is due to the roundoff error in
representing the coefficients of the matrix and the right-hand side.

Table 3.3: Iterative refinement of solution of linear equations

8.973590 70.07465 64.51421 -106.3323 -1.2 x 10- 7 1.2 X 10- 7 -4.9 X 10- 6 2.5 X 10- 6
8.973650 70.07516 64.51470 -106.3331 -5.2 x 10- 8 -2.1 X 10- 7 -1.6 X 10- 8 -1.0 X 10- 6
8.973650 70.07516 64.51470 -106.3331 -5.2 x 10- 8 -2.1 X 10- 7 -1.6 X 10- 8 -1.0 X 10- 6

EXAMPLE 3.5: Consider the solution of equation Ax = b, where A is the n x n principle


minor of the Hilbert matrix defined by (consider n = 4,6,8, 10)
n
aij=
1
i+j-1
(i,j=l, ... ,n) and bi = L aij (i = 1, ... , n). (3.73)
j=l

This is a well-known example of ill-conditioned system, even though the matrix involved
is real symmetric and positive definite. The successive iterates using 24-bit arithmetic and
3.4. Error Analysis 85

Table 3.4: Iterative refinement of solution of linear equations involving n x n Hilbert matrices

n=4
.9999998 1. 00000 1 .9999977 1.000001 3.7 x 10- 8
.9999989 1.000013 .9999687 1.000021 3.6 x 10- 13
n=6
1. 000049 .9984876 1.010757 .9710892 1.032644 .9869233 3.3 x 10- 8
1.000086 .9974023 1.018315 .9509940 1.055208 .9779097 2.1 x 10- 8
1.000086 .9973931 1.018379 .9508237 1.055400 .9778328 4.3 x 10- 8
1. 000086 .9973931 1.018379 .9508222 1.055401 .9778321 7.2 x 10- 8
n=8
.9992604 1.040400 .47278443.842493 -6.627931 11.77079 -6.66 3.16 4.2 X 10- 8
.99891881.045198 .53716552.967138 -3.140440 5.525196 -1.43 1.50 9.8 X 10- 8
n = 10
.99715361.123224 -.22320895.244606 -2.543540 -8.586297 21.53 -7.24 -6.095.794.0 x 10-- 8
.99717651.122655 -.21943875.233353 -2.522662 -8.615644 21.51, -7.23 -6.12 5.81 1.3 x 10- 7
.9971780 1.122596 -.21892165.232026 -2.523790 -8.605571 21.54 -7.22 -6.12 5.80 3.2 x 10- 7
.99717751.122614 -.21907685.232379 -2 ..523170 -8.609286 21.54 -7.23 -0.125.811.7 x 10- 7

employing Crout's algorithm for LU decomposition are given in Table 3.4. The last column
in each row is the magnitude of the residual vector at each stage. The correct solution of
this system of equations is Xi = 1. It can be seen that the residue is always very small, even
though the solution is far from correct. The matrix becomes highly ill-conditioned for large n
and it is impossible to have any estimate for the solution at all. The iterative refinement does
not converge for n = 8, 10. Even for smaller values of n, although the final solution is not close
to the true value. the iteration has converged to some solution. which should represent the
solution of the system of equations as represented in the computer. Further, in most ca,,;es the
correction at the first step of iterative refinement gives an estimate of the error in solution as
compared to the final value. Unfortunately, this is not the same as the true solution of the real
system of equations. It may be noted that for this calculations the matrix elements and right
hand side were computed using double precision and the values were then converted to single
precision. Thus the matrix and right hand side are correctly rounded versions of the exact
values. The results will be different. if the summation for calculating bi is done using single
precision arithmetic. The condition numbers for these matrices are approximately 1.5 x 10 4 ,
1.5 X 10 7 , 1.5 X 1010 and 1.6 X 10 13 for n = 4,6,8 and 10, respectively.
The failure to get accurate values in this case is because of the fact that the coeffi-
cients of the matrix and the right-hand sides cannot be represented exactly in the computer
word. The roundoff error due to initial rounding itself is sufficient to give a sizable error. To
demonstrate this, we can multiply all equations by 360360 which will make all coefficients
for n :S 8 as integers, which can be exactly represented in the computer. Table 3.5 gives the
result obtained using this scaled matrix. The last column in this table gives the magnitude
of the residual vector. In this case, since the coefficients are of the order of 10 5 , the residue is
correspondingly larger. It can be seen that for n = 4 and 6, even though the error in first step
is comparable to that in Table 3.4, the successive iterations converge to the exact solution.
For n = 8, even though all coefficients are exact the iteration does not converge as before.
This is clearly because of the fact that the matrix is too ill-conditioned for the precision of the
calculations and we cannot expect the iterative refinement to work. Hence, if the coefficients
of the matrix and the right-hand side can be represented exactly in the computer and if the
matrix is not too ill-conditioned for the precision of the arithmetic being used, it is possible
to get correctly rounded solution of the system of equations using the method of iterative
refinement.
86 Chapter 3. Linear Algebraic Equations

Table 3.5: Iterative refinement of solution of linear equations involving n x n Hilbert matrices
(scaled by 360360)

n=4
0.9999979 1.0000187 0.9999588 1.0000256 0.039
1.0000000 1.0000000 1.0000000 1.0000000 0.000
n=6
0.9999700 1.0009345 0.9933591 1.0178473 0.9798034 1.0081085 0.011
1.0000002 0.9999942 1.0000410 0.9998898 1.0001245 0.9999501 0.012
1.0000000 1.0000000 0.9999998 1.0000007 0.9999992 1.0000004 0.0044
1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.0000
n=8
0.9986261 1.0748495 0.0068527 6.4514337 -13.858338 22.249147 -14.26 5.341 0.014
1.0041575 0.7758108 3.9557774 -15.154346 44.892052 -61.621269 45.89 -11. 75 0.137

From these examples, it is clear that the iterative refinement may not
necessarily give more accurate results than what is obtained in the first step,
if the system of equations is ill-conditioned. Nevertheless, it gives an estimate
of the error. This is because of the fact that for such problems, the error in-
troduced by the fact that the coefficients of matrix in the computer are only
rounded versions of the exact values itself is very large. This error is com-
pletely unavoidable, unless higher precision is used to represent the matrix and
the right-hand side vector. The above process only gives correct solution to
the problem as represented in the computer. This technique is very effective
for those matrices for which LU decomposition with partial pivoting gives very
large pivots {19}, {20}. As explained earlier for such matrices, there could be
considerable roundoff error in LU decomposition, even though the matrix is
well-conditioned. The use of complete pivoting will avoid this problem, but
that is very expensive and cumbersome. On the other hand, if the process of
iterative refinement is used, it is possible to improve the results significantly
without much extra effort. In such cases, the residues will actually decrease as
the iteration proceeds.
Estimating condition number of a matrix is a rather difficult problem,
since it requires the knowledge of the inverse matrix which is not known until
the solution is complete. However, there are a few symptoms which may be
exhibited during solution by Gaussian elimination with partial pivoting. These
symptoms may indicate their ill-conditioning. They are the emergence of (1) a
small pivotal element, (2) a large computed solution and (3) a large residual
vector. In order to detect the small or large numbers, it is essential to have
the equations correctly normalised or equilibrated. Unfortunately, it is possible
to invent pathologically ill-conditioned sets of equations which do not display
any of these symptoms. Hence, there is no reliable technique to estimate the
condition number of the matrix from the computed solution. In Section 3.6, we
describe the singular value decomposition, which gives a very reliable estimate
of the condition number of a given matrix. Unfortunately, this technique may
3.5. Matrix Inversion 87

require an order of magnitude more effort as compared to that using Gaussian


elimination. However, in problems where ill-conditioning is suspected it will
be necessary to use such techniques to estimate the reliability of computed
solution. Alternately, the condition number can be estimated by perturbing the
right-hand side and solving another system of equations with the same matrix.
Here again it is possible that all perturbations may not lead to significantly
different solution, even if the matrix is ill-conditioned.

3.5 Matrix Inversion


If A-I is the inverse of A, then we have AA- 1 = I. Thus, if Xj is the jth
column of A-I and i j is the jth column of the identity matrix of order n, then
the above equation can be expressed as

(j = 1,2, ... ,n). (3.74)

If we solve these n systems of equations, we can find the inverse of A. It can be


easily seen that the whole process requires approximately ~n3 multiplications
and about the same number of additions. It may be noted that, if we take care
of the fact that the first j - 1 elements on the right-hand side are zero, then
the number of operations can be reduced to approximately n 3 . Gauss-Jordan
elimination applied to all the right-hand sides simultaneously, can also give the
inverse in slightly larger number of operations. The extra effort goes in eliminat-
ing the elements above the diagonal. Alternately, once the LU decomposition is
obtained the inverse can be calculated using A-I = U- 1 L -1. It can be shown
that {1} the inverse of a lower (upper) triangular matrix is also lower (upper)
triangular. The inverse matrices L -1 and U- 1 can be easily calculated using
the usual forward or back-substitution requiring approximately ~n3 multipli-
cations {31}. Thus, once again the number of arithmetic operations required
are of the same order.
Having given a method for matrix inversion, we would like to point out
that, there is usually no good reason for calculating the inverse. It may appear
that, if we have to solve a large number of systems of linear equations involving
the same matrix A, but different right-hand sides, then it may be better to
compute A-I first and then compute A-I b for each vector b. But it should be
noted that computation of A-I b requires n 2 multiplications, which is exactly
same as the number of multiplications and divisions required for each b after the
triangular decomposition of A is obtained. Of course, division usually requires
more time than multiplication, but only n divisions are required during back-
substitution. On the other hand, calculation of A-I itself requires of the order
of ~n3 additional multiplications and divisions. In fact, we will solve n extra
systems of equations while computing the inverse. Further, since the process
of matrix inversion involves additional computations, the roundoff error may
be larger than that in triangular decomposition. Hence, once the triangular
decomposition is known, no advantage for calculating A-I b can be gained
88 Chapter 3. Linear Algebraic Equations

by computing A-I explicitly. Similar remarks apply to products of the form


A-I B, where B is a n x m matrix, since in this case, each column of Bean
be treated separately. However, in case the matrix elements of A -1 have some
physical significance, then it would be necessary to compute the inverse matrix
A -1 explicitly. This situation arises frequently in least squares calculations.
Similarly, the matrix updating techniques for solution of a system of nonlinear
equations considered in Section 7.17 also require explicit calculation of the
inverse matrix.

3.6 Singular Value Decomposition


In the previous sections, we have seen that if the matrix is singular or nearly
singular, then numerical solution becomes rather unreliable. In this section,
we consider a powerful technique to detect such problems and obtain the best
possible solution. Apart from this, the singular value decomposition (SVD) can
also be used to obtain the least squares solution of overdetermined systems of
linear equations.
Let A be a real m x n matrix with m ~ n, then a theorem in linear
algebra (Forsythe and Moler, 1967) assures us of the so-called singular value
decomposition, given by
A=U~VT, (3.75)
where

and (3.76)

Here U is a m x n matrix, ~ is a n x n diagonal matrix while V is a n x n


orthogonal matrix. The matrix U consists of n orthonormalised eigenvectors
associated with the n largest eigenvalues of AAT , and the matrix V consists
of the orthonormalised eigenvectors of AT A. The diagonal elements of ~ are
the nonnegative square roots of the eigenvalues of AT A; they are referred to as
singular values. We shall assume that the singular values are sorted such that

(3.77)

Thus, if rank(A) = r < n, then a r+l = a r+2 = ... = an = o.


The decomposition (3.75) is unique if (3.77) is satisfied and the singular
values are all distinct. If two or more singular values are equal, then we can
interchange the corresponding columns of U, elements of~, and columns of V
or form any linear combination of columns of U and V corresponding to equal
ai's.
If the matrix A is square, then the matrices U, V and ~ will all be of
the same size. Their inverses can be trivially computed, since U and V are
orthogonal while ~ is diagonal. Hence

~-1 = diag(l/al, ... , l/an ). (3.78)


3.6. Singular Value Decomposition 89

Of course, if one or more of the singular values are zero, then the inverse does
not exist. Thus, this technique gives a clear indication of singularity or near
singularity, since in that case (In will be zero or very small. The condition
number of a matrix as defined in Section 3.4 is the ratio (Jd (J n and can be
estimated using SVD. If nK(A) ?:: 1, then the matrix is computationally singular
and the estimate of condition number may not be reliable.
If the matrix A is singular, then the solutions of the homogeneous problem
Ax = 0 are said to form the null-space of the matrix. The dimension of the
null-space, which is just the number of linearly independent vectors which span
the null-space is called the nullity of A. Similarly, we can define the range of the
matrix as the subspace of vectors b, for which a solution to the linear problem
Ax = b exists. Dimension of range is the rank of matrix A. In fact, the rank
plus nullity equals the order of matrix n. For a nonsingular matrix, the rank is
n and nullity is zero.
SVD can easily be used to find the null-space of a singular matrix. Let
lli and Vi be the ith column of U and V, respectively. Then from (3.75) it is
clear that AVi = (Jilli, (i = 1,2, ... , n). If the matrix has a rank r « n), then
(Ji = 0, (i = r + 1, ... , n). Hence, AVi = 0 for i = r + 1, ... , n and the last
n - r columns of V form an orthonormal basis for the null-space, while the first
r columns of U wili form an orthonormal basis for the range. In general, if the
singular values are not arranged in descending order, then the columns of V
corresponding to the zero (Ji will form an orthonormal basis for the null-space,
while the columns of U corresponding to the nonzero (Ji will form a basis for
the range.
If we want to solve a system of linear equations with a singular matrix,
then the first step will be to find out if the right-hand vector b lies in the range
of A or not. If the vector b is orthogonal to all columns of U corresponding
to zero (Ji, then it is in the range of A. If b is in the range, then the system
has solution and the general solution will be the sum of a particular solution
plus any vector in the null-space of the matrix. In particular, if we set 1/ (Ji = 0
for i = r + 1, ... ,n in ~-l, i.e., all infinities are replaced by zeros, then the
solution given using (3.78) (i.e., A-1b = V~-lUTb) is the one with minimum
norm. This is obvious, since the other solutions can be obtained by adding some
vector in the null-space, which is orthogonal to this solution.
If the vector b is not in the range of matrix A, then there is no solution to
the problem. In such cases sometimes, we are interested in the solution which
minimises the residual r = Ax - b. This is referred to as the least squares
solution. Of course, if we add any vector in the null-space to the solution, then
the residual will not change. Hence, the solution is not unique, unless nullity is
zero. It can be shown that the solution obtained using (3.78) gives minimum
residue IIAx - bll. Thus we want to find the solution x to minimise

f(x) ~ ~ (t, a"x, - b') , (3.79)


90 Chapter 3. Linear Algebraic Equations

To find the minimum w€ can find the zero of gradient

of
ax = 2L
m (
L aijXj -
n
bi
)
aik =0 (3.80)
k i=l j=l

This equation can be written as AT Ax = ATb. Using SVD we can show that
the solution of this equation is given by vr;- l U T b.
If there is no roundoff error, then there is no difficulty in determining
which of the singular values are zero. But in practical computation, the zero
elements may not come out to be exactly zero. Hence, we need some criterion to
decide which elements can be treated as zero. It is natural to set all ai < lia1 to
be zero, since we do not expect such values to be determined with any accuracy.
In practice, there will always be some marginal cases, where the magnitude is
slightly greater than the above limit and obviously, there will be a significant
error in determining such values also. If we wish to solve a system of equations
with ill-conditioned, but nons in gular matrix, then some of the ai may turn out
to be small. If none of the ai's are zero, then we may be able to solve the
equation using (3.78), but the solution will be as unreliable as that obtained
using LU decomposition. On the other hand, if we set the small ai's to zero, and
zero the corresponding elements in r;-l, then the computed solution does not
have any component corresponding to the doubtful singular values. This will not
be the true solution of the problem, since in fact, the small singular values may
contribute a large component to the required solution vector. Unfortunately,
since this component cannot be determined reliably, it may be best to separate
it. Even if such a solution is not acceptable, the SVD will give a clear indication
of the trouble and we can determine the solution by adding to this an arbitrary
vector from the null-space, which is also determined by the SVD. In fact, this
is precisely the component that we suppressed by setting small singular values
to zero.
If the number of equations m is less than the number of unknowns n,
then a unique solution does not exist, but SVD can be used to find the entire
solution space. Before performing SVD, augment the matrix A with rows of
zeros, until it is filled up to be a square matrix of order n. Similarly, augment
the right-hand side vector b by adding zeros. This will give a singular set of n
equations in n variables, which can be treated by SVD as described above. In
this case, at least n - m singular values should be zero.
If the system of equations is overdetermined (m > n), then in general, we
do not expect any solution. In such situations, we are usually interested in the
least squares solution, which minimises the norm of residual vector. In this case,
it can be shown that, the solution obtained using (3.78) satisfies the required
condition. In general, none of the singular values may be zero, but occasionally
we may encounter a situation, where one or more of the singular values are zero
or very small. If ad a1 < Ii, then 1/ai must be set to zero in r;-1. If one of the
ai's is small, but larger than this limiting value, then computationally it may
not be essential to set the corresponding inverse to zero, but the corresponding
3.6. Singular Value Decomposition 91

column of V gives the linear combinations of Xi'S, which are insensitive to the
given data. It may be better to zero such singular values and reduce the number
of free parameters in the fit. A more detailed treatment of this problem will be
considered in Section 1O.2.l.
The algorithm for obtaining the SVD consists of two phases, in the first
phase the matrix is reduced to a bidiagonal form using Householder transfor-
mations. The second phase uses the Q R algorithm to find the singular values.
We will only outline the algorithm here and for more details the readers can
consult Wilkinson and Reinsch (1971). The methods employed here are very
similar to those used for solution of matrix eigenvalue problems, which will be
considered in Chapter 11. Readers who are only interested in practical use of
SVD may skip the following description until Example 3.6. Subroutine SVD in
Appendix B may be used to obtain SVD of a matrix.
In the first phase, we apply a finite sequence of Householder transforma-
tions to reduce the matrix to a bidiagonal form i.e., the only nonzero elements
are along the main diagonal and one subdiagonal just above the diagonal. The
algorithm is divided into n steps in the ith step, the required elements in the
ith row and column are reduced to zero, without affecting the elements in
the previous rows and columns. The Householder transformation is essentially
a reflection operation {39}, which is achieved by using symmetric orthogonal
matrices of the form P = 1- 2xxT , where x is a unit vector (x T x = 1). Using
two sequences of Householder transformation with matrices

(k=I,2, ... ,n), (3.81 )


and
(k= 1,2, ... ,n-2), (3.82)
where X(k)T x(k) = y(k)T y(k) 1, it is possible to reduce the matrix A to an
upper bidiagonal form
ql e2 0 0 o
0 q2 e3 0 o
0 0 q3 e4
p(n) ... p(1) AQ(l) ... Q(n-2) =
0 0 o
0 ........ 0 qn-l
0 ............. 0

o (3.83)
Let A(l) = A and define
A(k+~) = p(k) A(k), (k = 1,2, ... , n),
(3.84)
A(k+l) = A(k+~)Q(k), (k = 1,2, ... , n - 2),

then p(k) is determined, such that


(k+~) - 0
a ik -, (i=k+l, ... ,m), (3.85)
92 Chapter 3. Linear Algebraic Equations

and Q(k), such that

(j=k+2, ... ,n). (3.86)

The singular values of J(O) are the same as those of A. Thus, if the singular
value decomposition of
(3.87)
then
A = PC'L,HTQT, (3.88)
so that U = PC and V = QH with P = p(I) p(2) ... p(n), Q =
Q(I)Q(2) ... Q(n-2).
For the first step in this process, we can solve (3.85) for the m components
of x(I), to get

(1)
=
±vs t+ a11
'
(1)

ail
=- (i> I), (3.89)
Xl t t '

where
m

t 2 =2(s±VS a11), S = La;I. (3.90)


i=I

Here the sign should be selected to match that of a11 and it may be noted that,
if all the required elements are already zero, the transformation does not reduce
to an identity. If s = 0, the expressions are indeterminate and hence this case
will have to be detected explicitly. Similarly, we can solve (3.86) to get

Y1(I) _ 0 (1) ±Vs' +aI2 (i > 2), (3.91 )


- ,
Y2 = t' '

where
n

S, = ""' 2
~ali. (3.92)
i=2

Similar expressions can be obtained for other steps. In order to ensure that the
rows and columns which are already reduced to bidiagonal form in the previous
step are not disturbed, the first i - 1 elements of vectors X(i) and the first i
elements of y(i) should be zero.
The second phase of the singular value decomposition of the bidiagonal
matrix is achieved by a variant of the QR algorithm. The matrix J(O) is itera-
tively diagonalised, so that

(3.93)

where
(3.94)
3.6. Singular Value Decomposition 93

o o o * x

Figure 3.1: Reduction of bidiagonal form by chasing. The x represents ele-


ments which are nonzero, while asterisk represents those elements which be-
come nonzero at intermediate steps. Each arrow represents one step, where the
element at first end becomes zero, generating entry at the position shown by
the arrowhead.

and S(i), T(i) are orthogonal matrices. The matrices T(i) are chosen so that the
sequence M(i) = J(i)T J(i) converges to a diagonal matrix, while the matrices
Sri) are chosen so that all J(i) are of bidiagonal form.
For convenience, we drop the superscript and use the notation
J:=: J(i), J:=: J(i+l), S:=: Sri),
- -T- (3.95 )
T:=: T(i), M:=: JTJ, M:=:J J.

The transformation J -+ J is achieved by application of Givens' rotations to


J, alternately from the right and the left. Thus

(3.96)
where S(};;) is the Givens' rotation matrix (in (k - 1, k) plane), which is an
identity matrix except for the following elements
(k) (k)
sk-1.k-l = sk,k = cos(h, (3.97)

and T(k)is defined analogously to S(k) with (h replaced by ¢k.


For the time being let us assume that the first angle ¢2 is arbitrary, while
all other angles are chosen so that J has the same form as J. This is achieved
by requiring that
T(2) annihilates nothing, generates an entry hI
S(2)T annihilates hI, generates an entry J 13
T(3) annihilates J 13 , generates an entry h2

an d fi na IIy 1 ates J n,n-l,


S (n)T anm'h'l generates not h'mg.
This process is frequently referred to as chasing (see Figure 3.1).
The matrices M and M are tridiagonal and further

M = JTJ = TTMT. (3.98)


94 Chapter 3. Linear Algebraic Equations

The first angle (P2 is chosen, so that the transformation M --t M is a QR


transformation (see Section 11.8) with a given shift s. The shift parameter s is
determined by an eigenvalue of the lower 2 x 2 minor of M. It can be shown
that if T(2) is chosen such that its first column is proportional to that of M - sf,
then the resulting transformation is equivalent to a QR transformation, with
shift s for the matrix M, provided none of the subdiagonal elements of Mare
zero. If some sub diagonal element of M is zero, then the matrix can be parti-
tioned at these points and each part treated separately. If these conditions are
satisfied, then the QR algorithm converges globally and almost always cubically
(Wilkinson, 1968).
As the successive iterates are computed, the off-diagonal elements will
reduce in magnitude and ultimately we end up with a diagonal matrix contain-
ing the singular values in diagonal positions. To fix a convergence criterion, we
select the tolerance Ii = nil J(D) 1100, where the matrix norm is defined by
n
(3.99)

If at any stage the off-diagonal element in (3.83) lenl ::; Ii, then Iqnl is accepted
as a singular value, and the order of the matrix is reduced by one. If however,
lekl ::; Ii for k < n, the matrix breaks into two, and the singular values of each
block may be computed independently.
If at any stage qk = 0, then at least one singular value must be equal
to zero. If all calculations are done exactly, then the matrix will break into
two parts, if the transformation is performed with a shift of zero. However, in
presence of roundoff errors the situation is more complicated. If at any stage
Iqkl ::; Ii, then we can perform an extra sequence of Givens' rotations from the
left to J involving rows (k, k+1), (k, k+2), ... , (k, n). In this case, the sequence
of transformations is selected, such that
ek+l == Jk,k+l is annihilated, but Jk,k+2, Jk+l,k are generated
Jk,k+2 is annihilated, but Jk,k+3, Jk+2,k are generated

and finally Jk,n is annihilated, but In,k is generated.


The matrix thus obtained has nonzero elements in the kth column below the
diagonal, in addition to the usual bidiagonal form, while the element Jk,k+l ==
ek+l is zero. From the orthogonality of the transformation matrices, it follows
that
n
",-2 2 2
~ J k,i = qk ::; Ii , (3.100)
i=k
where] k,i is the corresponding element of the transformed matrix, and qk is
the value before applying transformations. This inequality ensures that all the
sub diagonal elements are negligible. Hence, ] breaks up into two parts, which
may be treated independently. An implementation of this algorithm is provided
3.6. Singular Value Decomposition 95

by subroutine SVD in Appendix B, which is based on the procedure svd in


Wilkinson and Reinsch (1971). It may be noted that, this subroutine does not
make any special effort to sort the singular values in descending order. This
fact should be taken care of while using the results.
The error analysis of the SVD algorithm is beyond the scope of this book.
Nevertheless, it may be noted that only orthogonal transformations are used to
reduce the matrix to its diagonal form. Hence, instability due to unnecessary
growth of elements will not be present in this algorithm. Hence, for matrices
which lead to growth of matrix elements with partial pivoting, this algorithm
may actually give a better result than that obtained using Gaussian elimination.
Of course, the use of iterative refinement with Gaussian elimination or the use
of complete pivoting will usually give good results with much less computation.
EXAMPLE 3.6: Solve the system of linear equations involving the Hilbert matrix considered
in the Example 3.5, using the singular value decomposition. Consider two different vectors
on the right-hand side, given by
n n

b;l) = Laij, b;2) = L(-I)j-l aij . (3.101)


j=1 j=1

Using a 24-bit arithmetic, we obtain the singular value decomposition using the subrou-
tine SVD. The solution is then calculated using subroutine SVDEVL (with REPS = 10- 7 ),
which eliminates small singular values before calculating the solution. The results are dis-
played in Table 3.6, which gives the solution obtained as well as the singular values for the
matrices. For each value of n the first row gives the singular values, while the second and
third rows give the solution of the linear systems. The exact solutions for the two systems
are Xi = 1 and Xi = (_1)i-l, respectively.
It can be seen that the solution to the first system is remarkably accurate, while that
for second one is rather poor. From the singular values it is quite clear that the matrices are
ill-conditioned, with the condition numbers ranging from 10 4 for n = 4 to about 1010 for

Table 3.6: Singular value decomposition of Hilbert matrices

n=4
1.50022 .169141 .006738 .000097
.999996 1. 000042 .999895 1. 000071
.999989 -.999898 .999790 -.999878
n=6
1.61890 .242361 .016322 .0006158 .000013 1.1 x 10- 7
.999961 1. 000602 .998136 1. 000905 1. 002156 .998210
.997672 -.933605 .551266 .166116 -.286113 -.493585
n=8
1.69594 .298125 .026213 .0014677 .000054 .0000013 1.9 x 10- 8 2.1 X 10- 9
1.00023 .994880 1. 024 77 .966273 .992469 1.023879 1.020691 .976641
.999713 -.988546 .882258 -.483455 - . 148290 .357648 .186654 - .806032
n = 10
1. 75192 .342929 .035742 .0025309 .0001288 .0000047 1.2 x 10- 7 2.4 X 10- 9 2 X 10- 9 2 X 10- 9
.999960 1.000988 .995727 1.003997 1.002842 .999119 .997043 .997486 .99981 1.00308
.999236 -.9728:36 .764124 -.181250 -.300740 .002254 .238783 .209079 -.10513 -.65372
96 Chapter S. Linear Algebraic Equations

n = 10. Using 24-bit arithmetic, it is not really possible to determine the smaller singular
values accurately. Hence, the actual values will be quite different for larger n. In fact, for
n = 8 and 10 the smallest singular value is ~ 1.1 )( 10- 10 and 1.1 x 10- 13 , respectively
giving a much larger condition number. Despite ill-conditioning, it is quite surprising that
the solution to the first system comes out to be quite accurate. In Section 3.4 we have argued
that the roundoff error in solving a system of linear equations is comparable to that due to
the rounding of coefficients to machine accuracy. In that case, we cannot expect to get much
better results using any technique. In fact, the accuracy here is just because of coincidence,
as the first solution vector does not have any significant component along the columns of
V, corresponding to the small singular values. To illustrate this effect the second system is
included. It can be easily seen that the solution to this second system is no more accurate
than the solution which could be obtained using Crout's method. The only difference here is
that, even for n = 10 all components of the solution are of order of unity, which is not the
case with other methods which tend to amplify the errors arbitrarily.

From this example it is clear that, in general, the SVD cannot give a
solution which is more accurate than that obtained by other methods, but
the advantage is that it gives a clear indication of difficulties, If we use the
technique of iterative refinement, then there is fair chance of estimating the
accuracy, but as we have seen in Example 3.5, for n = 10 which is an extremely
ill-conditioned system, the method of iterative refinement may underestimate
the roundoff error, As compared to that, SVD gives a more reliable estimate of
the condition number of the matrix. It can be easily seen from the subroutine
SVD, that approximately ~n3 multiplications and equal number of additions
are required for the first stage of reduction to bidiagonal form. In the second
stage, the number of operations will depend on the matrix, If the matrix does
not split into smaller parts at any stage, it may require of the order of 5n 3 or
more operations to achieve this transformation. Hence, clearly the amount of
computations required by SVD is about an order of magnitude more than what
is required by LU decomposition, and in most cases it will not be worthwhile
to use this algorithm for solution of linear equations. However, if there is some
doubt about a system of equations being ill-conditioned one will have to verify
the condition number using this algorithm. With ever increasing speed of mod-
ern computers it may be worthwhile to spend some additional effort to test the
reliability of computed solution using SVD, Further, this algorithm can be used
to determine the rank of a matrix, or for finding general solution to a system
of equations with singular matrix, or for solving an overdetermined system of
equations.
EXAMPLE 3.7: Consider the following systems of linear equations
36x1 - 630X2 + 3360X3 - 7560X4 + 7560X5 463 ( -4157)
-630X1 + 14700X2 - 88200X3 + 211680X4 - 220500X5 = -13860 ( -17820)
3360X1 - 88200X2 + 564480X3 - 1411200X4 + 1512000X5 = 97020 ( 93555)
-7560X1 + 211680X2 - 1411200X3 + 3628800X4 - 3969000X5 = -258720 ( -261800)
7560X1 - 220500X2 + 1512000X3 - 3969000X4 + 4410000X5 = 2910QO ( 288288)
-2772x1 + 83160x2 - 582120x3 + 1552320x4 - 1746360X5 = -116424 ( -118944)
where the numbers in the parenthesis are the right-hand sides for the second system. The 6 x 5
matrix in this case consists of the first five columns of the inverse of the 6 x 6 Hilbert matrix.
The first right-hand side is selected such that, Ax - b = 0 for x T = (1,1/2,1/3,1/4,1/5).
The second right-hand side is obtained by adding a vector orthogonal to the columns of A
to the first right-hand side. Hence, the solution should be the same in both cases.
3.6. Singular Value Decomposition 97

The results obtained using 24-bit arithmetic are as follows


al = 8888159. a2 = 69916.21 a3 = 1249.146 a4 = 39.01334 a5 = 1.849190
Xl = 1.056285 X2 = .5175471 X3 = .3406049 X4 = .2531186 X5 = .2010946 (3.102)
Xl = 62.45992 X2 = 20.96306 X3 = 9.090468 X4 = 4.077362 X5 = 1.559966
Here the singular value decomposition was obtained using the subroutine SVD, while the
solution was later computed using the subroutine SVDEVL. It can be seen that, there is a
significant roundoff error and the second solution is far from the actual value. The problem
here is because of the fact that the component corresponding to the smallest singular value
is dominant. Here a5/al ~ Ii and hence a5 cannot be determined accurately. If the same
computations are repeated using a 53-bit arithmetic, then the following results are obtained
aT = (8888158.395302, 69916.14797765, 1249.255765223, 38.96968805299, 1.892391797669)
x T = (.9999999998885, .4999999999645, .3333333333185, .2499999999936, .1999999999978)
x T = (.9999999865866, .4999999954474, .3333333313649, .2499999991340, .1999999996909)
(3.103)
EXAMPLE 3.8: Consider the system of equations
Xl - X2 + X3 + X4 = 2,
Xl - X2 + X3 + X4 = 2,
(3.104)
Xl + X3 = 2,
Xl + X4 = 2.

The matrix of equations is obviously singular, but since the first two equations are
identical a solution can be found. Using subroutine SVD we find that one of the singular
value is very small as expected. The SVD for this matrix gives
-.6184657 -.3427832 .7071067 1.23 x 10- 8 )
( -.6184657 -.3427830 -.7071068 -5.21 x 10- 8
U = -.3427831 .6184656 3.41 x 10- 8 .7071068
-.3427831 .6184656 5.88 x 10- 8 -.7071068

I: = (3.19006403
0
.8848785
0
~ ~o) (3.105)
1.3 X 10- 7
o 0 0 1.000000
-.6014567 .6230972 .5000001 .0000000 )
V = ( .3869762 .7747580 -.4999999 -5.08 X 10- 8
-.4942163 -.07583018 -.5000001 .7071068
-.4942163 -.07583030 -.5000000 -.7071068
The third singular value can be replaced by zero and the solution with minimum norm can
be evaluated using subroutine SVDEVL to get
Xl = 1.500000, X2 = 0.4999996, X3 = 0.5000004, X4 = 0.5000002. (3.106)
The general solution can be obtained by adding an arbitrary multiple of V3, the third column
of V to this solution. In particular, if the multiple is selected to be -0.5, then we get the
obvious solution Xl = X2 = X3 = X4 = 1. Of course, this is not the unique solution. It can
be seen that the right-hand vector in this case is orthogonal to the third column of U, which
assures us that it is in the range of matrix A and hence the solution exists. The range of the
matrix is defined by any linear combination of columns 1, 2, 4 of U, while the null space is
defined by the third column of V.
If the right-hand side of the first equation is changed to 1, then the equations will not
be consistent, as can be verified by the fact that the new right-hand side is not orthogonal
to the third column of U. However. we can use SVD to obtain the least squares solution.
Using the same decomposition, we can use subroutine SVDEVL to evaluate the solution with
minimum norm which turns out to be
Xl = 1.625000, X2 = 0.8749999, X3 = 0.3750004, X4 = 0.3750002. (3.107)
Any multiple of V3 can be added to this solution, since that does not change the residuals.
98 Chapter 3. Linear Algebraic Equations

3.7 Iterative Methods


So far we have considered direct methods for solution of a system of linear equa-
tions. For sparse matrices, it may not be possible to take advantage of sparsity
while using direct methods, since the process of elimination can make the zero
elements nonzero, unless the zero elements are in a certain well defined pattern.
Hence, the number of arithmetic operations as well as the storage requirement
may be the same for sparse and filled matrices. This requirement may be pro-
hibitive for large matrices and in those cases, it may be worthwhile to consider
the iterative methods. Sparse matrices often arise in solution of ordinary or
partial differential equations. The matrices resulting from finite difference ap-
proximation to ordinary differential equations are usually band matrices with
small bandwidth. Hence, the algorithm for Gaussian elimination can be easily
modified to deal with such matrices. Solution of partial differential equations
often leads to matrices which may have large bandwidth and a large fraction of
elements within the band are also zero. For such matrices, the iterative methods
may be considered. In some cases, it may be possible to rearrange the equa-
tions and apply some variation of Gaussian elimination quite effectively. In this
section, we consider some of the simple iterative methods. More sophisticated
iterative methods will be considered in Chapter 14.
We can write the matrix A in the form A = D + L + U, where D is
a diagonal matrix, and Land U are respectively, lower and upper triangular
matrices with zeros on the diagonal. Then the system of equations can be
written as
Dx= -(L+U)x+h, (3.108)
or
x = _D-l(L + U)x + D-1h. (3.109)
Here we have assumed that all diagonal elements of A are nonzero. If some
diagonal element is zero, but A is nonsingular, then by permuting rows and
columns it is possible to get a nonsingular matrix D. In fact, it is desirable to
have the diagonal elements as large as possible, in relation to the off-diagonal
elements. Equation (3.109) can be used to define an iterative process, which
generates the next approximation x(j) using the previous one on the right-hand
side
(3.110)
This iterative process is known as the Jacobi iteration or the method of
simultaneous displacements. The latter name follows from the fact that every
element of the solution vector is changed before any of the new elements are
used in the iteration. Hence, both x(j) and x(j-l) need to be stored separately.
The iterative procedure can be easily expressed in the component form as

k=!
x Cj ) = ___k_"',____
, (3.111)
3.7. Iterative Methods 99

This is a special case of the general linear iteration, defined by

x(j) = Bx(j-1) + b. (3.112)

For Jacobi iteration B = D- 1 (L + U). It can be shown that, this process will
converge if and only if all eigenvalues of B lie within the unit circle. The proof
follows from expansion of the vector x in terms of the eigenvectors of B. A
sufficient condition for convergence is that

(3.113)

where IIBIIE is the Euclidean norm of the matrix B.


Jacobi method is seldom used in practice, since Gauss-Seidel method
which is a slight modification of Jacobi method, usually converges faster than
the Jacobi method and is easier to implement on a se,-!uential computer. The
difference between the Jacobi and Gauss-Seidel methods is that in the latter,
as each component of x(j) is computed, we use it immediately in the iteration.
Consequently, Gauss-Seidel method is sometimes called the method of succes-
sive displacements. This method is more convenient for programming, since
there is no need for two separate arrays to store the successive approximations,
and the new values can be immediately overwritten on the old values. The
components of the new vector are calculated using the relation

(i = 1,2, ... , n). (3.114)

This equation can be written in the matrix form as

(3.115)

which is also a special case of (3.112).


It can be proved that if the matrix A is positive definite, the Gauss-Seidel
iteration converges independently of the initial vector. Further, in this case, the
convergence is twice as fast as that for Jacobi iteration. In general, it converges
if and only if all eigenvalues of the matrix B = (D + L) -1 U lie within the unit
circle. A sufficient condition for convergence is again given by (3.113) with B
defined as above.
The main advantage of iterative methods for sparse matrices is that, the
matrix is not manipulated and hence sparsity is preserved. Further, only the
nonzero elements in the matrix need to be used. Consequently, the number of
arithmetic operations are reduced considerably. For a filled matrix Gauss-Seidel
method will require n(n - 1) multiplications and n divisions for each iteration.
If the method requires k iterations for convergence, then the total number of
arithmetic operations required is approximately kn 2 . As compared to this, the
100 Chapter 3. Linear Algebraic Equations

direct methods require kn3 operations. Hence, Gauss-Seidel method will be


more efficient if the number of iterations k < n/3. This is only a hypothetical
situation, since for a general filled matrix the iteration will probably not con-
verge. Further, in general, it is found that the rate of convergence decreases
with n (see Section 14.8 and {43}). On the other hand, if a fraction p of the
elements of A are nonzero, then the number of operations required by Gauss-
Seidel method is only pkn 2 . If the pattern of nonzero elements is such that direct
methods cannot take advantage of sparsity, then the iterative methods will be
more efficient if k < n/3p. For a band matrix with bandwidth d, Gauss-Seidel
method requires of the order of 2kdn operations, while Gaussian elimination
will require of the order of nd 2 operations. Hence, iterative methods cannot
compete with direct methods for small bandwidth. As noted earlier, partial
differential equations often lead to matrices with large bandwidth containing
large number of zero elements within the band. For such matrices, the iterative
methods may prove to be more efficient. Further, even if the iterative method
is not as efficient as the direct method, as far as the amount of computation
is concerned, the prohibitive amount of memory required by the direct method
for large matrices may preclude their use. If the coefficients of the matrix are
simple numbers, as is often the case with matrices resulting from solution of
partial differential equations, then it may not be necessary to store any of
the matrix elements at all, since they can be easily generated when required.
Another advantage of iterative methods over the direct methods is that, the
algorithm can be easily adapted for parallel processing, since the equations are
solved independently.
The direct methods can also be modified to take care of sparse matrices. A
straightforward application of Gaussian elimination introduces a large number
of nonzero elements during the calculations. This is referred to as fill-in. One
of the simplest technique is to use complete pivoting and to choose the pivot
such that the fill-in is minimised. This technique may reduce the numerical
stability of algorithm, but as long as a very small element is not used as pivot,
the process will be reasonably stahle. Alternately, if we are interested in a large
number of matrices with the same pattern of sparse elements, then we can do
some analysis to find a permutation of rows and columns, such that the fill-in
is minimised. Methods based on graph theory have proved to be quite useful in
such applications.
It may appear that roundoff error will be much less in iterative methods
as compared to that in direct method, since we always work with the original
matrix and further each step is essentially independent. Even if the sums are
accumulated in double precision, the corresponding perturbation matrix F for
Jacobi iteration has elements bounded by

(3.116)

where 6ij is the Kronecker's delta. Thus, only the diagonal elements are per-
turbed by an amount which is comparable to rounding the matrix elements. For
Gauss-Seidel method, since the new values of Xi are used in the same iteration,
3.7. Iterative Methods 101

the bound will be


for i ~ j;
(3.117)
for i < j.
These bounds are comparable to those for direct methods, neglecting the growth
of pivotal elements in those methods. For sparse matrices such growth will be
much smaller, since number of steps required for elimination will be smaller.
Further, as noted earlier for ill-conditioned matrices, the elements actually show
a decreasing trend in direct methods. Hence, for such matrices the bounds are
essentially same as those due to rounding of coefficients in either case and
no great difference can be expected. Apart from roundoff error, the iterative
methods will also involve truncation error, since the process is necessarily ter-
minated after a finite number of iterations. In most cases, since the convergence
of iteration is rather slow, the truncation error will be much larger in iterative
methods. In general, iterative methods are as prone to roundoff error as direct
methods. In fact, there is also the problem of defining the convergence crite-
rion, when the iteration is converging slowly. There is also the possibility of
spurious convergence. A simple convergence criterion of terminating the itera-
tion when the change in solution is less than some prescribed tolerance will not
be adequate when convergence is slow as the actual truncation error with this
criterion will be much larger than the tolerance.
EXAMPLE 3.9: Apply Gauss-Seidel method to the system of equations (Wilkinson, 1961)

0.96326xI + 0.81321x2 = 0.88824, 0.81321xI + 0.68654x2 = 0.74988. (3.118)

This is a symmetric positive definite system and hence Gauss-Seidel method should
converge from arbitrary starting vector. Assuming that we are using five decimal digit arith-
metic, if we start with the initial vector Xl = 0.33116, X2 = 0.70000, then it can be verified
that next approximation
0.88824 - 0.81321 x 0.70000
Xl = "" 0.33116,
0.96326
(3.119)
0.74988 - 0.81321 x 0.33116
X2 = "" 0.70000.
0.68654
Hence, the iteration will converge to starting value itself, while the correctly rounded solution
is Xl = 0.39473, X2 = 0.62470. This example clearly shows that iterative methods are no
better for ill-conditioned problems.

It is trivial to write a subroutine implementing Gauss-Seidel method for


a general n x n matrix. However, such a subroutine is not of much use, since
Gauss-Seidel method is not very effective for general dense matrices. We will
like to have a routine which takes into account the sparsity of the matrix. For
sparse matrices of special form, it is most effective to write a special routine
which takes into account not only the pattern of sparsity, but the coefficients
of the matrix itself. As mentioned earlier in many cases that arise in practice,
it may not even be necessary to store the coefficients.
Another iterative method which can be applied to sparse matrices is the
conjugate gradient method, which essentially tries to minimise the norm lAx -
hi. This method will be considered in Section 8.6. The advantage of this method
102 Chapter 3. Linear Algebraic Equations

is that, for each iteration we need to evaluate only products of the form Au
involving the matrix, which can be easily accomplished for sparse matrices,
using only the nonzero elements. This method also does not require any extra
storage. However, this algorithm is very susceptible to roundoff error and hence
will work for well-conditioned matrices only.

Bibliography
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Althaus, G. W. and Spedicato, E. (eds.) (1998): Algorithms for Large Scale Linear Algebraic
Systems: - Applications in Science and Engineering, Kluwer Academic.
Anderson, E., Bai, Z., Bischof, C., Blackford, L. S. and Demmel, J. (2000): Lapack Users'
Guide (Software, Environments and Tools, 9), SIAM, Philadelphia.
Dahlquist, G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Dongarra, J. J., Bunch, J. R, Moler, C. B. and Stewart, G. W. (1979): LINPACK User's
Guide, SIAM, Philadelphia.
Dongarra, J. J., Gustavson, F. G. and Karp, A. (1984): Implementing Linear Algebra Algo-
rithms for Dense Matrices on a Vector Pipeline Machine, SIAM Rev., 26, 91.
Faddeev, D. K. and Faddeeva, V. N. (1963): Computational Methods of Linear Algebra,
(trans. R C. Williams), W. H. Freeman, San Francisco.
Forsythe, G. E. and Moler, C. B. (1967): Computer Solution of Linear Algebraic Systems,
Prentice-Hall, Englewood Cliffs, New Jersey.
Gregory, R. T. and Karney, D. L. (1969): A Collection of Matrices for Testing Computational
Algorithms, Wiley-Interscience, New York.
Hamming, R W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Heller, D. (1978): A Survey of Parallel Algorithms in Numerical Linear Algebra, SIAM Rev.,
20, 740.
Hildebrand, F. B. (1987): Introduction to Numerical Analysis, (2nd ed.), Dover, New York.
Pissanetzky, S. (1984): Sparse Matrix Technology, Academic Press, London.
Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Quarteroni, A., Sacco, R. and Saleri, F. (2010) Numerical Mathematics, (2nd ed.) Texts in
Applied Mathematics, Springer, Berlin.
Ralston, A. and Rabinowitz, P. (200l): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
Stoer, J. and Bulirsch, R (20l0): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.
Tewarson, R P. (1973): Sparse Matrices, Academic Press, New York.
Varga, R. S. (2009): Matrix Iterative Analysis, Springer Series in Computational Mathemat-
ics, 27 (2nd Ed.) Springer Verlag, New York.
Wilkinson, J. H. (1961): Error Analysis of Direct Methods of Matrix Inversion, JACM, 8,
281.
Wilkinson, J. H. (1968): Global Convergence of Tridiagonal QR Algorithm With Origin Shifts,
Lin. Alg. and Appl., 1, 409.
Wilkinson, J. H. (1988): The Algebraic Eigenvalue Problem, Clarendon Press, Oxford.
Wilkinson, J. H. (1994): Rounding Errors in Algebraic Processes, Dover, New York.
Wilkinson, J. H. and Reinsch. C. (1971): Linear Algebra: Handbook for Automatic Compu-
tation , Vol. 2, Springer-Verlag, Berlin.
Exercises 103

Zlatev, Z. (2010): Computational Methods for General Sparse Matrices (Mathematics and
its Applications, Vol 65, Kluwer Academic.

Exercises
1. Prove that the product of two (unit) triangular matrices is a (unit) triangular matrix of
the same form. Also prove that the inverse of such a matrix is also of the same form.
2. Let A be a n X n band matrix with bandwidth k < n/2. Show that A2 has bandwidth
2k.
3. Show that the product of two symmetric matrices is symmetric if and only if they com-
mute, that is, AB = BA.
4. Assuming that matrix A is lower triangular as well as orthogonal, show that A is diagonal.
What are the diagonal elements of A?
5. Show that the Gaussian elimination algorithm for a filled matrix requires n(n - 1)/2
divisions, n(n - 1)(2n - 1)/6 multiplications and the same number of additions (or sub-
tractions). Processing each right-hand side requires n divisions, n(n - 1) multiplications
and n(n - 1) additions.
6. Instead of Gaussian elimination, we can use Gauss-Jordan elimination, which reduces
the matrix to a diagonal form. This algorithm is very similar to Gaussian elimination,
except for the fact that in step (3) of the algorithm given in Section 3.2, the steps (3a)
to (3c) have to be carried out for all values of j from 1 to n except for j = k. Show
that Gauss-Jordan elimination for a filled matrix requires n(n -1) divisions, n(n -1)2/2
multiplications and an equal number of additions. Further, the solution for each right
side requires the same number of arithmetic operations as with Gaussian elimination.
7. For a band matrix of bandwidth k, if Gaussian elimination can be performed without
pivoting, then show that the total number of arithmetic operations required is of the
order of nk divisions, nk 2 multiplications and nk 2 additions. What happens if partial
pivoting is included?
8. Give an efficient algorithm for reducing a tridiagonal matrix to upper triangular form.
Assume no pivoting is needed. Write a computer program to implement this algorithm,
which is also efficient in terms of memory space used. Test the program on the following
system with matrix A of order 21, given by
aij = 0 (Ii -jl > 1),
and the right-hand side vector given by
bi=O (i=2,3, ... ,20).
If partial pivoting is to be included what changes will be necessary? What happens if
complete pivoting is implemented?
9. Show that for an upper Hessenberg matrix Gaussian elimination requires (n-1) divisions,
n(n - 1)/2 multiplications and n(n - 1)/2 additions. What happens if partial pivoting is
incorporated?
10. The Fibonacci numbers are defined by the difference equation
j> 1, fa = 0, h=l.
Show that /;;+1 - fnfn+2 = (_l)n, n ~ O. Try to compute the unique solution of

for n = 5, 10,20,30,50 and explain the results.


11. Solve the following system of linear equations using only five correctly rounded significant
digits, at each step:
0.002321xl + 0.090244x2 = 0.047443, 0.3043xl + 11.556x2 = 6.0823.
104 Chapter 3. Linear Algebraic Equations

Check the answer by finding the residues and estimate its accuracy. Compare the results
with the exact value of (1.000, 0.5000)T. Do the calculations three times, first without
pivoting, second with partial pivoting and third with complete pivoting
12. Consider the following systems of linear equations
~Xl + X2 + X3 = 3~ Yl + Y2 + Y3 = 3
Xl + X2 - X3 = 1 Yl + ~Y2 - ~Y3 = 1 (~ « 1)
Xl - X2 + X3 = 1 Yl - ~Y2 + ~Y3 = 1

Show that the second system can be obtained from the first one by a transformation
Yl = Xl, Y2 = X2/~, Y3 = X3/~, followed by multiplying the first equation by 1/E. This
transformation is achieved by scaling the rows and columns of the matrix appropriately.
Estimate the condition numbers for both these systems. Note that both the matrices
are equilibrated in the sense that maximum element in each row and column is unity.
The correct solution is Xl = 1, X2 = X3 = ~ or Yl = Y2 = Y3 = 1. If ~ < n/4 for the
arithmetic operation bdng used, then attempt to solve each of these systems with and
without pivoting. For the second system, interchange the first two equations and solve
without pivoting.
13. Solve the following systems of linear equations using any method and estimate the accu-
racy of solution

oj,J @
COOOO07l43
0.6262
0
0.000007355
0
0 (~OO~45)
0.3763
0.4923 0.2123 0.000002534 0.6087
0.8017 0.6123 0.7165 0.4306

3296160)
@
0.6592320 0.4944240 0.3955392
C~84~
0.6592320 0.4944240 0.3955392 0.3296160
0 0

0.2825280
r~'45)
0.65434
0.4944240 0.3955392 0.3296160 0.2825280 0.2472120 0.78645
0.3955392 0.3296160 0.2825280 0.2472120 0.2197440 0.95467
0.3296160 0.2825280 0.2472120 0.2197440 0.1977587 0.34527
(0.9729627 0.6852485
0.6548700) (Xl) C6952996)
0.6796685 0.8707517 0.9187936 X2 = 0.4724475
0.3522263 0.4959247 0.5287152 X3 0.2433280
Note that for the first matrix the LU decomposition need not be performed. For the
first matrix try to solve the equations with another right-hand side vector given by
b; = 2::.1=1 aij and estimate the accuracy.
14. Using the method of undetermined coefficients (see Section 5.2), find a formula for nu-
merical differentiation of the form
I' (nh) = alI(2h) + a2I(h) + a3I(0) + a4I( -h) + asI( -2h) ,
h
where - ~ ::; n ::; ~. The formula should be exact for all polynomials of degree less than
or equal to 4. Solve the relevant system of linear equations to obtain the formulae for
n = ±0.2.
15. Show that if we apply Gaussian elimination without pivoting to a real symmetric matrix
A, then the matrix A(l) after the first major step is also real symmetric in the last n - 1
rows and columns. Hence, deduce that all intermediate matrices A(k) are symmetric in
their last n - k rows and columns. Using this fact, the amount of calculations required
in Gaussian elimination can be reduced. Estimate the number of operations required in
this case. Note that only about half of the matrix needs to be stored. Write a program to
implement this algorithm and verify it on the system with 6 x 6 Hilbert matrix considered
in Example 3.5. Can this algorithm fail for nonsingular matrices?
16. In Gaussian elimination (without pivoting) show that

ark' -1)
tJ
-
-
a
t) k' = min(i,j).
Exercises 105

By identifying ak~-l) with Ukj, show that this is exactly identical to the Doolittle's
algorithm for LU decomposition. Hence, if all calculations are done in single precision,
even the roundoff errors will be identical in Gaussian elimination and the Doolittle's
algorithm.
17. Consider the permutation matrix P;j which is obtained by interchanging the ith and jth
row of a unit matrix. Show that if any matrix is premultiplied by P;j, then the result is
the matrix with corresponding rows interchanged. Similarly, show that postmultiplication
by Pij results in interchanging of corresponding columns.
18. Show that the operation of scaling the rows of a matrix is equivalent to premultiplication
by a diagonal matrix, the diagonal elements of which give the factors by which the corre-
sponding rows are multiplied. Similarly, show that the operation of scaling the columns
of a matrix is equivalent to postmultiplication by a diagonal matrix. Thus, a matrix A
can be equilibrated by a transformation of the form A' = DIAD2, where Dl and D2 are
diagonal matrices.
19. Let A be a matrix of order n with elements

a, if i = j;
aij = { a,
if j = n;
-a, if i > j;
0, otherwise.

Show that for a = 1 with partial pivoting, the matrix elements at the kth step increases
as 2k. Choose the right hand side of the equations to be bi = (3 - i)a for i < nand
bn = (2 - n)a. With this choice the exact solution will be Xi = 1. Compute the solution
using Gaussian elimination or any other method for LU decomposition for a = 1,0.999
and n = 6, t/2, t, 3t/2, 2t, 3t, where t is the number of bits in the fraction part for the
arithmetic being used. Compare the result with the exact value and try to improve on
this result by the technique of iterative refinement. Do you expect complete pivoting to
improve the results?
20. (i) Let An be the matrix of order n with

for i > j or j = n;
aii == 1, i f= n;
for i < j < n.
Show that the elements of A;:;: 1 are given by
a;:} = (_I)n+ j -12- J , j f= n;
i > j, i f= n; a;n1 = (_1)n+i- 1 2- n + i , i f= n;
i < j, j f= n; a;:;:~ = _2-n+l.

Hence, deduce that An is well-conditioned.


(ii) Let Bn be the matrix formed by replacing an,n by -!. Show that

Hence, at least some elements of A;:;: 1 differ from those of B;:;: 1 in the first decimal.
(iii) Show that if partial pivoting is used, then Gaussian elimination applied to An and
Bn is identical until the final step, i.e., the last row never has the element with greatest
magnitude. Further, the final element Unn in the triangular decomposition will be _2 n - 1
!
for An and _2 n - 1 + for Bn. Hence, if arithmetic with less than (n - 1) bits is being
used, the LU decomposition in the two cases will be identical.
(iv) Try to evaluate A;:;: 1 and B;:;: 1 for n = 6, t/2, t, t + 2, 3t/2, 2t (where t is the number
of bits in fraction part) and compare it with exact values. Use Gaussian elimination with
partial pivoting, as well as with complete pivoting to calculate the inverses. Explain the
results.
106 Chapter 3. Linear Algebraic Equations

21. Matrix A is said to be diagonally dominant, if

n
L
J=l
jaijj < jaiij, (i=1,2, ... ,n).
j¥-t

Show that if we apply Gaussian elimination without pivoting to such a matrix, then the
matrix A(1) after the first major step is also diagonally dominant. Hence, deduce that
the elements ai~-l) oF 0 and in principle, pivoting is never necessary for such matrices.

22. For the lower triangular matrices Lk considered in Section 3.3, show that the inverse can
be obtained by changing the sign of the off-diagonal elements. Using this property prove
(3.31).

23. If in Doolittle's algorithm we define I;k = Ukklik and U~j = Ukj/Ukk, then show that I;k
and U~j satisfy the equations used in Crout's algorithm. Hence, the elements of triangular
matrices in the two algorithm are related by the above relationship. In particular, show
that Ukk = I~k' i.e., the diagonal elements in the two cases are equal.

24. Show that if a nonsingular matrix A can be factored into a product of unit lower triangular
matrix L and a upper triangular matrix U, then the factorisation is unique. Find one
example of a nonsingular matrix for which such a factorisation does not exist. Also find
an example of a singular matrix for which the factorisation is not unique.

25. Show that for a symmetric positive definite matrix A , it is possible to find a triangular
decomposition of the form A = LLT. This decomposition is referred to as Cholesky
decomposition and the algorithm for obtaining this decomposition can be specified as
For k= 1 to n in succession perform the following steps
(i) compute
",i-1 I I
aki - L-j=l ij kj
I ki = -----=-'-"-=-''---''- (i = 1, .. . ,k -1),
Iii

and overwrite on aki.


(ii) Compute
k-1
t= akk - L I~j.
j=l

If t > 0, then lkk = Vi else lkk = 0, and overwrite Ikk on akk. If t :s:
0 the calculation
has to be aborted.
Under what circumstances can the quantity t in the last step be negative? Write a com-
puter program implementing this algorithm and test it on linear systems involving Hilbert
matrices considered in Example 3.5. Find an example of nonsingular real symmetric ma-
trix for which decomposition of the form considered here does not exist.

26. Prove (3.53) and (3.56).

27. Repeat Example 3.4 using single precision arithmetic to calculate the residue and check
if the process converges. Also try to do the entire calculation in double precision and
study the behaviour of iterations.

28. Use a computer search (or analysis if possible) to find that vector x out of the class of
floating-point vectors with four significant decimal digit.s (base b = 10, t = 4), which
minimises the residue IIAx - bjj2 for the following system, and compare it with the
computed solution rounded to four decimal digits:

0.9983x1 + 0.8873x2 = 0.8764,


0.6655x1 + 0.6015x2 = 0.5829.
Exercises 107

29. Find the invprse of the following n x n matrices


7 6
16
72)
(1 ,:J
10 8 ( -24
33
Al= A2 = -10 -57
8 10
-8 -4 -17
7 9
1 1 1 1

~ (1
'7 '7 8 8

(I 1)
2 2

l)
3 3
'7 '7 8 8
A, 3 6 3 6
'7 '7 AF 8 8
4 10 20 4 10 20
'7 '7 '7 8 8 8
and test the accuracy of computation by solving a linear system Ax = b, where
n
bi = Laij (i=l, ... ,n).
j=1

The exact solution of these equations (Xi = 1) can be compared with the computed
solution using A-I b, as well as the one using Gaussian elimination or LV decomposition.
Also, compute the determinant of the matrices. Test the accuracy of the inverse by
explicitly computing AA - I - I. Which of the two tests is more reliable? Compute the
inverse of the computed inverse matrix and compare it with the original matrix. Note
that the last two matrices are essentially same, but compare the results in two cases and
explain the differences.
30. Repeat the previous exercise on the following n x n matrices

{~+;-I
for i = 1;
(i) aij
= for i > 1;

(ii) aij - - r;:f; . C7r)


n+l SIn
j
n+l ' (i,j = 1,2, ... ,n);

(iii) aij =n +1- max(i,j), (i,j = 1,2, ... ,n).


Try n = 5, 10 and 15. For the first matrix try to compute the exact inverse using the fact
that elements of inverse are all integers.
31. If L is a lower triangular matrix, then its inverse can be calculated by directly solving
the system of equations
i

L lik e;/ = <Sij'


k=j

Noting that L -1 should also be lower triangular {I}, show that the diagonal elements of
the inverser;/ = 11Ij]' The off-diagonal elements can be calculated column-wise, using

1
L liklk\
i-I
1~1 = - - (i=j+1,2, ... ,n).
'J Iii k=j J

Show that this process requires n(n + 1)/2 divisions, n(n 2 - 1)/6 multiplications and
n(n -l)(n - 2)/6 additions. Give an equivalent algorithm for finding inverse of an upper
triangular matrix. Note that if we have a unit triangular matrix, then the divisions will
not be required. Hence, show that matrix inversion using LV decomposition will require
n 2 divisions, n 3 multiplications and n 3 - 2n 2 + n additions.
32. Gaussian elimination algorithm can be used for complex matrices by simply declaring
the relevant variables to be complex in the computer program. Find the inverse of the
following matrix
1 + 2i
3i
+
-5 + 14i
2 lOi)
5i -8 + 20i
108 Chapter 3. Linear Algebraic Equations

33. Let U and v be column vectors and A be a nonsingular matrix. Verify that

This formula which is referred to as Sherman-Morrison formula can be used to find inverse
of a matrix, which differs from a matrix with known inverse in a few elements only {20}.
In particular, it can be used for sparse matrices which have a few nonzero elements outside
the main band. Let
m
B = A+ LUjVjT,
i=l

and define

Ck = (A+~UjVjT)-l
Using the above formula, B-1 = em can be calculated recursively using

(k = 0, 1, ... , m - 1).

Here it is assumed that the denominator is nonzero at all stages.

34. In the previous exercise, if Uj = ej the ith column of the unit matrix I, then the new
elements are added in the ith row only. Starting with a nonsingular diagonal matrix we
can go on adding one row at a time by appropriately choosing the vector Vi at each
stage, to calculate the inverse of the required matrix in n steps. Estimate the number
of arithmetic operations required to calculate the inverse using this technique. Find the
inverses of matrices in {29} using this technique.
35. Gaussian elimination algorithm can in principle be modified to find solution to linear sys-
tems with singular matrices. In the elimination stage, if all elements in the corresponding
column are zero, then that step can be ignored and we can go to the next step. However,
in such cases one of the diagonal elements in the upper triangular factor U will be zero.
During back-substitution if it so happens that the corresponding numerator is also zero,
then the equations are consistent and the general solution can be obtained by giving
some arbitrary value to the corresponding variable. In actual practice, it will be difficult
to distinguish between zero and nonzero, but small elements. Try this algorithm on the
following system and obtain the general solution:

3Xl + 4X2 - 5X3 - X4 = 12,


2Xl + 3X2 - X3 - 3X4 = 1,
4Xl + 5X2 - 9X3 + fiX4 = 28,
Xl + X2 - 4X3 - 2X4 = 7.

Try to find the null-space and range of the matrix using Gaussian elimination.

36. Apply SVD to find the general solution to the system in previous problem. Also find
the null-space and range of this matrix and compare the results with those obtained
using Gaussian elimination. What happens if the right hand side of the first Equation is
changed to 2?

37. Solve exercises {19} and {20} using SVD and compare the results with those obtained
using LU decomposition. Why is SVD more accurate in this case?

3S. For the electrical circuit shown in the illustration, the various resistances Rl, R2,
R3, R4, R5 are given, and the current 10 which flows in at the left and out at the right is
given. Solve this network to find the currents in all the branches by solving the following
Exercises 109

system of linear equations:

h + 14 = 10,
h + 13 = 10,
lJ - 14 - h = 0,
h - h - Is = 0,
RIll - R414 + Rsh = 0,
R212 - R3lJ - Rsh = 0,
RIll + R212 - R3lJ - R414 = 0,
Show that only five of these equations are independent and solve them for the five un-
knowns h, ... , Is, assuming Ri = i and 10 = 1. Apply SVD to the entire set and find
the solution of the overdetermined system.

39. Householder transformations are produced by the elementary reflector P defined by

P =I - 2vvT where

Show that P is symmetric and orthogonal. Hence, p2 = I. Premultiplication by these


matrices reflects the space in hyperplane through the origin perpendicular to v. For any
two given vectors x and y of equal length, we can find an elementary reflector P such
that Px = y. Show that this transformation is achieved by using

x-y
v=
Ilx -Y112
Use this result to verify (3.89) and (3.91).

40. Instead of Gaussian elimination, we can use a sequence of Householder transformation to


transform the matrix to an upper triangular form. The advantage of Householder trans-
formation is that, it does not amplify the magnitude of any matrix elements and hence is
numerically more stable. Using the results in the previous exercise give an algorithm for
reducing a matrix to upper triangular form using Householder transformations (Wilkin-
son, 1988). Show that this algorithm will require of the order of ~n3 multiplications.
Hence, it requires about twice as much effort as that required by Gaussian elimination.
Consequently, it is rarely used in practice despite better numerical stability.

41. In the previous exercise, instead of Householder transformations we can use a sequence
of Givens' rotation to transform a matrix to an upper triangular form (Wilkinson, 1988).
Show that this algorithm will require about ~n3 multiplications, which is about four
times that required by Gaussian elimination. This transformation is also orthogonal and
hence is more stable than Gaussian elimination.

42. Solve the following system of equations using (a) Gaussian elimination without pivoting,
(b) Gaussian elimination with partial pivoting and (c) Gauss-Seidel iteration.

(i) 0.876543xI + 0.617341x2 + 0.589973x3 = 0.863257,


0.612314xI + 0.784461x2 + 0.827742x3 = 0.820647,
0.317321xI + 0.446779x2 + 0.476349x3 = 0.450098,
(ii) 0.99999xI - X2 + X3 - X4 = -3,
XI -X2 = -1,
XI - X2 + X3 - X4 = -2,
XI - X4 = -3,
110 Chapter 3. Linear Algebraic Equations

(iii) -1.99xl +X2 = -1,


Xs - 1.99xg = -2,
Xi - l - 1.99xi + Xi+l = 0, (i=2,3, ... ,8),

(iv) 4Xk -
kh 3
-3-(4xl + 4X2 + 12x3 + 8X4 + 20X5 + 12x6 + 28x7 + 8xs)
= 4 sin(kh) - kh , k = 1,2, . .. , 8 (h = -fu) ,

(v) 4Xl -X2 -X4 = ~,


-X2 +4X3 -X6 = ~,
-X2 - X4 + 4X5 - X6 - xs = 0,
-X4 + 4X7 - Xs = 0,
-Xl + 4X2 - X3 - X5 = 1,
-Xl + 4X4 - X5 - X7 = 0,
-X3 - X5 + 4X6 - xg = 0,
-X5 - X7 + 4xs - xg = 0,
-X6 - xs + 4xg = O.

Note: Equations (iii), (iv), (v) corresponds to the solution of a differential equation,
integral equation and Laplace's equation, respectively.
43. Consider the N 2 x N 2 matrix A with elements all zero, except

aii == 4,
and the corresponding right-hand side vector b with elements all zero except

b = sin (N i :
i J' (i = 1 , 2, ... , N) .

This is a generalisation of the last system in previous problem , which corresponds to


N = 3. As will be seen in Section 14.7, this system corresponds to solution of the
Laplace's equation using simple finite difference method. Consider N = 10,20,50, 100
and estimate the solution to an accuracy of 0.1% using Gauss-Seidel method. How many
iterations are required in each case. Compare its efficiency with that of direct methods.
44. Generate a number of 10 x 10 matrices using random numbers (see Section 6.11) in the
interval (0,1) as matrix elements. Generate the right hand side vector such that exact
solution is Xi = 1 and solve the system using Gaussian elimination with and without
pivoting using single precision arithmetic, and compare the results with exact solution.
What is the probability to find a matrix where pivoting is essential in the sense that
solution without pivoting has an error of larger than 10- 4 in at least one component?
Estimate the condition nllmber of these matrices and find the probability for the condition
number to be larger than 100.
Chapter 4

Interpolation

Interpolation, which is essentially the art of reading between the lines of a table,
was the most important topic in numerical analysis in the pre-computer days.
This is because of the fact that in those days it was much faster to look up a
value from a table of mathematical function, rather than to calculate it ab initio.
The situation can be characterised by the low speed of computation and almost
unlimited amount of memory, which could be accessed in a time comparable to
that required for arithmetic operations. Once the computer came into picture,
the speed increased by several orders of magnitude, and at the same time the
fast access memory became very limited. Hence, in most cases, it is not economic
to store tables in the computer memory. As a result, the use of interpolation has
declined considerably. Present day computers can calculate most of the usual
mathematical functions directly in a reasonable time. Despite this advance,
interpolation has not become obsolete. There are many complicated functions
like opacities of stellar matter, or the position of astronomical objects at a given
time , which cannot be calculated in a reasonable time or effort, and for which
interpolation is still used.
Apart from this, there are many functions which are known experimen-
tally only at a limited number of points, and interpolation offers a convenient
method for approximating such functions at nontabulated points. In fact, even
in the pre-computer days one of the important uses of interpolation was in
approximating a function at nontabular points. This approximation can then
be used for other manipulations, like integration and differentiation. Hence,
interpolation formulae form the basis for numerical integration, differentiation,
solution of nonlinear equations, solution of differential equations and so on.
For example, the basic strategy for numerical differentiation or integration is
to approximate the given function by an interpolating function , and then to
differentiate or integrate this interpolating function. Similarly, for solution of
a nonlinear equation, the basic strategy is to approximate the given equation
by a linear or a quadratic equation using interpolation, and then to solve this
approximate equation. Hence, interpolation forms the basis for most of the nu-
112 Chapter 4. Interpolation

merical methods, even though it may rarely be used directly. Interpolation is


not the only type of approximation used in numerical work. Other types of
approximations will be considered in Chapter 10.
Since this book is mainly directed towards digital computers, our discus-
sion of interpolation will be brief. We will not consider most of the classical
formulae for interpolation. For that the readers can consult any of the clas-
sical texts on numerical analysis. In this chapter, we will consider polynomial
interpolation, piecewise polynomial interpolation (spline) and rational function
interpolation. Trigonometric interpolation will be considered in Chapter 10.
Most of the discussion will be restricted to interpolation in one variable, but
in the last two sections we will consider the problem of interpolation in more
than one variable.

4.1 Polynomial Interpolation


Most of the standard mathematical functions, like the sine or the exponential
cannot be calculated using a finite number of arithmetic operations. Hence, it
is necessary to approximate these functions by other functions, which can be
evaluated using a finite number of arithmetic operations. A piecewise rational
function is probably the most general kind of function that can be directly
evaluated using only the basic arithmetic operations. Of course, most of the
higher level languages allow us to use the trigonometric and exponential func-
tions, but in practice these functions are themselves evaluated using a rational
function approximation.
Polynomials are the simplest and most widely used class of functions for
approximations. This is because of the ease with which they can be evaluated
and manipulated in general and differentiated and integrated in particular. It is
well-known from mathematical analysis, that the class of functions {xn} form a
complete set over any finite interval [a, bJ. The Weierstrass theorem states that:
If f(x) is continuous on a finite interval [a, b], then, given E > 0, there exists an
n (= n(E)) and a polynomial Pn(x) of degree n, such that If(x) - Pn(x)1 < E
for all x in [a, bj.
There is a constructive proof of this theorem due to Bernstein, but we will
not consider that here, since it is not useful for practical computations. How-
ever, this theorem assures us that any continuous function can be approximated
by a polynomial of sufficiently high degree on a finite interval. In practice, even
this assurance is not enough, since if it turns out that the degree of polynomial
required is 10000, then the approximation will be essentially useless. Indeed,
there are several functions for which polynomials provide very poor approxi-
mation, but at the same time for a large variety of functions, polynomials have
been used successfully. Hence, in this section we consider polynomial interpola-
tion. It may be noted that the Weierstrass theorem is actually concerned with
uniform approximations of the type considered in Chapter 10.
4.1. Polynomial Interpolation 113

The basic problem of interpolation consists of approximating the given


function f(x) by another function g(x)

f(x) = g(x) + E(x), (4.1 )

where E(x) is the error of approximation or the truncation error. Further, the
interpolating function should agree with the given function at a specified set of
points aj, (j = 0,1, ... , n), i.e.,

(j = 0, 1, ... , n). (4.2)

This set of points will be referred to as the tabular points, 'or the abscissas.
In general, we can also demand equality of some of the derivatives, but in this
section, we consider only the ~;mplest case, where the interpolating function is
a polynomial and equality of derivative is not required.
The Weierstrass theorem does not guarantee the existence of an interpo-
lating polynomial, and in fact, it is not obvious that such a polynomial will exist
for any arbitrary table of values. Since a polynomial of degree n has n + 1 coef-
ficients, we can hope to fix these coefficients in such a way that the polynomial
interpolates the given function at n + 1 distinct points. In fact, the following
theorem assures us that this is indeed the case.
Theorem 4.1: Given a real valued function f(x) and n + 1 distinct points
ao, ... ,an, there exists one and only one polynomial of degree less than or
equal to n, which interpolates f(x) at ao, .. ·, an.
Proof: In this book we normally refrain from proving theorems, but in this case,
the proof is very simple and it is tempting to include it. We will first prove
the existence of an interpolating polynomial and then prove its uniqueness.
Consider the functions

(k=O, ... ,n). (4.3)

Each of these lk(x) is a polynomial of degree n, and hence


n
gn(x) = L f(ak)lk(x), (4.4)
k=O

is a polynomial of degree less than or equal to n. It is clear that

if j = k;
(4.5)
if j i- k;
where Ojk is the Kronecker's delta. Hence, gn(aj) = f(aj), for j = O, ... ,n,
which proves that gn(x) is an interpolating polynomial of degree less than or
equal to n. To establish its uniqueness, let us assume that, there is another
polynomial 9n(X) of degree n or less, which also interpolates the given function
114 Chapter 4. Interpolation

at the same set of points. Then consider h(x) = 9n(X) - gn(x), which is also a
polynomial of degree at most n. It can be seen that h( aj) = 0 for j = 0, ... , n.
Hence, this polynomial vanishes at n + 1 distinct points, which is possible only
if the polynomial is identically zero. Hence, 9n(X) == gn(x), which completes
the proof of the theorem.
The polynomial in (4.4) is referred to as Lagrange's interpolation polyno-
mial, and the process is referred to as Lagrangian interpolation. This theorem
not only assures us of the existence of an interpolating polynomial, but also
tells us how to construct such a polynomial, which in turn is shown to be the
unique polynomial of the specified degree. Hence, we may conclude that the
problem of polynomial interpolation is now completely solved. Actually, it is
just the beginning of the subject, since we still have to estimate the trunca-
tion error, and further we have to find an efficient algorithm for evaluating the
interpolated value at any required point. It is the latter requirement, which
has given rise to a maze of interpolation formulae which adorn the classical
books on numerical analysis. In fact, there are interpolation formulae named
after most of the well-known mathematicians like, Newton, Bessel, Lagrange,
Stirling, Everett, and so on. Most of these formulae are algebraically equivalent
to the Lagrange's polynomial (4.4).
Let us consider the case n = 1, when there are only two points. The
Lagrangian interpolation polynomial in that case is

(4.6)

which is the familiar formula for linear interpolation. Similarly, we can get
higher order interpolation formula involving more points.
Let us try to estimate the truncation error in interpolation polynomial

(4.7)

Obviously, En(aj) = 0 for j = 0, 1, ... , n. Thus, let us assume that x is not one
of the tabular points aj, and consider the function

Pn(Z)
F(z) = J( z ) - gn(z) - (f(x) - gn(x)) Pn(X) , (4.8)

where
n
Pn(X) = II (x - ai). (4.9)
i=O

The function F (z) as a function of z has n + 2 distinct zeros at z = aD, ... , an


and x. Therefore by applying Rolle's theorem n + 1 times

(4.10)
4.1. Polynomial Interpolation 115

has at least one zero in the interval limited by the largest and the smallest of the
numbers ao, ... , an and x. Here J(n+1)(z) denotes the (n+1)th derivative of J(z)
with respect to z. Calling this zero z =~, and noting that g~n+1)(z) == 0, since
gn(z) is a polynomial of degree less than or equal to n, and p~n+1) (z) = (n+ I)!,
we get
(4.11 )

Using (4.7) and (4.11), we get

En(x) = Pn(x) J(n+1) (0. (4.12)


(n + I)!

If x is one of the tabular points, then Pn (x) = 0 and also En (x) = O. Thus, this
formula holds for all values of x, although it was derived for nontabular points.
If J(x) itself is a polynomial of degree less than or equal to n, then
J(n+1)(O = 0 and En(x) == O. Thus, the interpolation would be exact for
polynomials of degree n or less, which is consistent with our assumptions. In
fact in this case, the polynomials J(x) and gn(x) will be algebraically identical
by Theorem 4.1. In general, an interpolation formula which is exact for poly-
nomials of degree less than or equal to r is said to have an order of accuracy r,
or to be of order r. If J(n+1)(x) = constant -=1= 0 (i.e., J(x) is a polynomial of
degree n + 1), then the truncation error can be exactly estimated, and if En (x)
is added to gn(x) we get a polynomial of degree n + 1, which is algebraically
identical to J(x). In general, it is not possible to find exact value of E(x), which
is of course sensible, since if we were able to estimate it exactly, then there is
no need to use an approximation.
EXAMPLE 4.1: Calculate sin( 44.4444°), by using the values of sin x at x = 40°, 45° and
50°.
Using (4.4) gives 0.70022540 as compared to the exact value of 0.70021679. Now for
j(x) = sinx with x expressed in degrees

(4.13)

Using this estimate we get an upper bound on the truncation error of


4.4444 x 0.5556 x 5.5556 ( 7r
3! 180
)3 ~ 1.22 x 10 -5 , (4.14)

which is consistent with the actual error of -8.61 x 10- 6 . To get a better estimate of the
truncation error we can use (4.12) to write
E2 = -1.22 X 10- 5 cos(€), (4.15)
which gives 7.81 x 10- 6 < - E2 < 9.31 X 10- 6 . Hence, iT) this case. \\ E are able to estimate
the truncation error rather well. In general, it is not pos~ih!;J to estimate the derivatives so
well and it may be difficult to give any estimate of the tr.;·,' "tion error.

It may appear that any desired accuracy can be achieved by including


sufficient number of points in the interpolation formula. However, in actual
practice it is found that the interpolation series, obtained by allowing the num-
ber of tabular points tend to infinity, are usually only asymptotically convergent,
116 Chapter 4. Interpolation

that is as we add more points the error at first reduces and then at some point
it starts increasing and grows without bound. One reason for this eventual di-
vergence is connected with the fact, that nth derivative of all, but a few entire
functions eventually grow without bound as n increases. Further, even if the
derivatives are bounded, as we include more points they are necessarily farther
and farther apart and so Ix - Xi I becomes large. Thus, we should never use too
many points in any interpolation formula. Further, the roundoff error will also
increase with the number of points. It is probably not advisable to use more
than six points for interpolation. If the interpolation series is converging fast,
then it will be good enough, while if the series is only slowly converging, then in
any case, it is not possible to estimate the truncation error by just looking at the
new terms which are added. Hence, unless there is some reason to believe that
use of a larger number of points is going to give a better accuracy, we should
not try more than five or six points. It should be noted that in actual practice
interpolation is used only for functions which are very difficult to calculate or
are not known in a closed form, and it is virtually impossible to estimate their
derivatives. Thus, it is not possible to use (4.12) for estimating the truncation
error. Usually the only way to estimate the error is to go on adding points one
by one and observing the difference in interpolated values.
It should be noted that, derivatives of f(x) occur in the estimate oftrunca-
tion error and it should be ensured that the corresponding derivative is bounded
in the required interval. Thus for example, it is difficult to use polynomial in-
terpolation to approximate Vx near x = O. Another way to see this difficulty
is to notice that the curve of Vx against x has a vertical tangent at the ori-
gin, and since no polynomial can have a vertical t.angent, it is not possible to
approximate such curves using polynomials. In such cases, rational functions
which are just the ratio of two polynomials, are more effective. Thus before
trying interpolation one should plot out the tabular points to check whether
the tabulated function can be approximated by polynomials.
It can be easily seen that the Lagrange's interpolation polynomial is not
very efficient for computational purposes. It is customary to measure the ef-
ficiency of any algorithm by counting the number of floating-point operations
involved. This number is usually referred to as the complexity of the algorithm.
To evaluate the Lagrangian interpolation formula at any point x, we can eval-
uate Pn(X) requiring n + 1 additions and n multiplications. Then each of the
lj(x) can be evaluated using n additions, n multiplications and one division.
Hence, it requires a total of n(n + 3) + 1 additions, n(n + 3) + 1 multiplications
and n + 1 divisions to calculate gn(x). If we want to calculate gn(x) for several
different values of x, by using the same set of tabular points, we can preserve
the values of
(4.16)
111=0 (aj - ai) ,
if.j
for each j. Then it requires only 2n + 1 additions, 2n + 1 multiplications and
n + 1 divisions to evaluate gn(x). This should be compared with only n addi-
tions and n multiplications required to evaluate a polynomial by using nested
4.2. Divided Difference Interpolation Formula 117

multiplication. For example, a fourth degree polynomial can be written as

(4.17)

In general, for a polynomial of degree n we can use the recurrence

(i=I, ... ,n). (4.18)

Here the final number Pn will be the value of polynomial at x. In most cases, the
intermediate Pi are not required. Hence, in any programming language, we can
use the same scalar variable to denote all values of Pi for i = 0, ... , n. Hence,
the evaluation of Lagrangian form of interpolation polynomial is roughly 2.5
times less efficient as compared to the evaluation of a polynomial of the same
degree, whose coefficients are known.
Apart from this, there is another problem with Lagrangian interpolation,
which is as follows. If we have evaluated the polynomial for n tabular points
and decide to add one more point, then even if the values of all the required
quantities is preserved, a considerable amount of effort will be required. This is
very important, since as already mentioned it is usually impossible to estimate
the truncation error using (4.12). Hence in most cases, the truncation error can
only be estimated by performing a series of interpolations, with one point being
added at each step. That will require considerable effort using the Lagrangian
form of interpolation polynomial. It can be seen that, even if all the denomi-
naton; are preserved, it requires 2n + 1 additions, 2n + 2 multiplications and
n + 1 divisions, when going from n to (n + I)-point interpolation formula. As
we will see in the next section, other forms of interpolation formula are more
efficient in this respect.

4.2 Divided Difference Interpolation Formula


Most of the classical interpolation formulae are meant for the special case,
when the tabular points are uniformly spaced. In pre-computer days this was
the most important condition, since tables for most mathematical functions
were available with uniform spacing in abscissas. However, now that most of
these functions are directly calculated by the computer, we need interpolation
only for those situations, where an experimentally determined function, or a
very complicated function has to be approximated at nontabular points. Quite
often, the values are not available at uniform intervals. Hence, we first consider
the Newton's divided difference interpolation formula, which is applicable for
arbitrary spacing of the abscissas. Of course, this formula is also applicable
to situations where the tabular points are uniformly spaced, although it will
require more effort than using a formula specially designed for points with
uniform spacing.
118 Chapter 4. Interpolation

Divided differences of order 0, 1, ... , k are defined recursively by the


relations
J[ao] = f(ao), f[ ao, al ] -- f[ad - J[ao] ,
al - ao
(4.19)
-
f[ ao,···, ak ] - J[al,"" ak]- J[ao, ... , ak-d
ak - ao
Here it should be noted that the first k - 1 arguments in the first term of the
numerator are the same as the last k - 1 arguments in the second term and
that the denominator is the difference between those arguments which are not
common to the two terms. It is clear from the definitions that f[ao, ... ,ak] is a
linear combination of the k + 1 ordinates f(ao), ... , f(ak), with the coefficients
depending upon the corresponding k+1 abscissas. Further, f[al, ao] = f[ao, all,
and it can be proved that divided differences of any order are symmetric func-
tions of their arguments and the order of arguments is immaterial. The result
follows from the relation

(4.20)

which can be proved by induction {6}.


If f(x) is a polynomial of degree m, then the first-order divided difference
J[ao, x] is a polynomial of degree m - 1 in x. This result follows from
X 1H am
+ aox m- 2 + a6 xm - 3 + ... + a~-2x + a~-l.
_
_ _----'0'- = x m- l (4.21)
X - ao
By continuing this argument it can be shown that, if f(x) is a polynomial of
degree m, then f[ao, ... ,ak-l,x] is a polynomial of degree m - k, if k ~ m,
while for k > m the divided difference will be identically zero.
To derive the interpolation formula, we can note from the definition of
divided differences
f(x) = f[ao] + (x - ao)J[ao, x]
J[ao, x] = f[ao, all + (x - adJ[ao, aI, x] (4.22)

J[ao, ... , an-I, x] = f[ao, ... , an] + (x - an)f[ao, ... , an, x].
By successive substitution in the first of these equations, we get

f(x) = f[ao] + (x - ao)f[ao, all + (x - ao)(x - adf[ao, aI, a2] + ...


+ Pn- l (x)J[ao, ... ,an] + Pn(x)J[ao, ... ,an, x] (4.23)
= gn(x) + En(x),
where Pn(x) is the product as defined in (4.9), and gn(x) is a polynomial of
degree less than or equal to n consisting of all terms, except the last one. It can
4.2. Divided Difference Interpolation Formula 119

be easily seen that


for x = aO, ... ,an. (4.24)
Thus, f (aj) = gn (aj ), for j = 0, ... ,n. Hence, gn (x) is the interpolating poly-
nomial, and by Theorem 4.1 it must be algebraically identical to the Lagrange's
interpolation polynomial (4.4). This is referred to as the Newton's divided dif-
ference interpolation formula. It can be seen that the first term in (4.23) involves
only the point ao, the second involves ao and ai, and each successive term in-
volves one more abscissa. Hence, in this formula, we can conveniently go on
adding terms as we include additional points for interpolation. Each term of
this formula can be considered as the correction to the previous value, when the
corresponding point is added to the list of abscissas. Ideally the terms should
get smaller as we keep adding points. If the terms are decreasing rapidly, then
the last term gives a reasonable estimate for the truncation error, but if the
terms are not decreasing very rapidly, then it may not be possible to estimate
the truncation error in the interpolated value.
Since the interpolation polynomials in (4.23) and (4.4) are algebraically
identicaL the truncation errors in the two cases can be equated to obtain

(4.25 )

where ~ is some point in the range spanned by the arguments of the divided
difference. Hence, the divided difference formula is similar to the Taylor series,
and in fact, it can be shown that if we take the limit aj ---> a, (j = 0, ... , n),
then the Newton's formula will reduce to the Taylor series, with the truncation
error giving the remainder term.
It may be noted that in deriving (4.23), we made no assumptions about the
order in which successive points are added, and in fact, the formula is applicable
for points in any arbitrary order. The only requirement being that the points
should be distinct. In fact, by choosing the proper order and assuming that the
spacing between points is uniform, most of the classical interpolation formulae
can be derived from (4.23). In practice, it is best to add the point which is
closest to the given value, since that reduces the factor Pn(x) in the truncation
error expression. In general, we will have little control over the other factor in
En(x), which is just a divided difference.
To calculate the divided differences, it is convenient to build a divided
difference table as follows:
ai flail = f(ai) f[ai-l, ail f[ai-2, ai-l, ail f[ai-3, ai-2, ai-l, ail
ao f[ao]
al f[al] f[ao, all
a2 f[a2] f[al, a2] f[ao, ai, a2]
a3 f[a3] f[a2' a3] f[al,a2,a3] f[ao, ai, a2, a3]
It should be noted that calculation of each of these divided differences requires
entries to the immediate left and the one just above it. The denominator in
120 Chapter 4. Interpolation

the recurrence relation for the divided difference, is the difference between the
abscissa in the same line and the one which is obtained by drawing a diagonal
in the table extending up to the second column, e.g.,

(4.26)

If di,j denote the divided difference in the ith row and jth column (excluding
the first one) of divided difference table, then

di,j-l - di-l,j-l
di,j = f [ai-j+l,···, ai ] = , i,j > 1. (4.27)
ai - ai-j+l
As we go on adding points, we can calculate the additional lines in the
table. Further, to calculate the nth row only the (n - l)th row needs to be
preserved. If the calculation is arranged properly only one array is required for
calculating the differences, since the new entries can be overwritten on the old
ones, which are not required. If at any stage we have calculated

di = f[ai,"" an-d, (i=0, ... ,n-1). (4.28)

Then we can start with dn = fran], and for k = n - 1, ... ,0, calculate

(4.29)

and overwrite on dk. We can proceed in this manner, if the interpolation is


desired at a few isolated points. Each time do gives the diagonal entry which is
required in the Newton's formula.
It can be seen that it requires 2n + 2 additions, n divisions and two mul-
tiplications to evaluate the interpolation formula for n + 1 points from that for
n points, as compared to 2n + 1 additions, n + 1 divisions and 2n + 2 multi-
plications, required by the Lagrange's formula. It requires n(n + 3) additions,
n(n+ 1)/2 divisions and n multiplications to evaluate gn(x) ab initio. However,
if the interpolation formula is to be evaluated at a large number of points, then
the entire divided difference table can be stored. Once the divided differences
are available, it requires only n multiplications and 2n additions to evaluate
gn (x), using nested multiplication as for standard polynomial, e.g.

g3(X) ((f[ao, al, a2, a3](x- a2)+ J[ao, al, a2])(x-ad+ f[ao, al])(x-aO)+ f[ao].
=
(4.30)
For n + 1 points we can start with Pn = J[ao, ... , an] and compute

j =n- 1, ... , 1,0, (4.31)

and Po will give gn(x).


If the entire difference table is stored, then we need not add differences
along the diagonal only, but can proceed in any order. Starting from the abscissa
4.2. Divided Difference Interpolation Formula 121

which is nearest to the given point, we can add the next nearest point and
so on. In this case, the next difference to be added to the formula will be
either the one in the same row or in the row just below the current one. If
the new point is below the previous points, then we have to move one row
down, while if the new point is just above the previous points we should pick
the corresponding difference from the same row. For example, using the above
table we can construct the interpolation polynomials as

g2(X) = f[a2l + (x - a2)f[al,a2l + (x - a2)(X - adf[ao,al,a2l


(4.32)
= f[all + (x - adj[ao, all + (x - ad(x - ao)f[ao, al, a2l·
Both these polynomials use the same set of points i.e., {ao, al, ad, but the
points are added in different order. Of course, the two polynomials are alge-
braically identical. But if the quadratic term is ignored, then the linear part
will not be identical, since they use different points.
If the table of function values is quite long, then it requires some effort
to find the nearest point. We can expect the table to be arranged in increasing
or decreasing order of abscissas. If a simple search technique is used, where we
start at one end and keep comparing the given value with successive abscissas
in the table until the nearest point is found, then on an average we require
about n/2 comparisons. A faster algorithm is the binary search or bisection,
where we start with the entire range of the table and at each stage take the
midpoint, and compare with the required value to find out which half contains
the required point. This process requires approximately log2 n comparisons to
locate the given point. An implementation of this scheme is provided by function
NEARST in Appendix B. This is the best algorithm for locating the nearest
point, if interpolation is desired at a set of randomly selected points. However,
frequently it happens that we need the interpolated value at a series of points
in some definite order. In that case, the required nearest point may be very
close to, or the same as the previous choice. In such cases, it may be better to
remember the previous choice and start the search from there. To take care of
the possibility that the new point may not be close to the previous point, we
should keep increasing the search step until an interval containing the required
point is found, or until the end of table is encountered. After that the bisection
process described above can be used to find the nearest point. If the new point
is located in the same subinterval as the previous point, no bisection will be
required, and the position can be located with just two comparisons. In the
worst situation, when the search step ultimately covers the entire table, this
process may be a factor of two slower as compared to the bisection algorithm
described earlier. This process is implemented in the function routine SPLEVL
in Appendix B. In either case, once the nearest point is found, the next nearest
point will be one of the two points on either side of it. If the table of abscissas
is uniformly spaced with spacing h, there is no need to search for the nearest
point as l(x - ao)/hJ gives the point just below the required value.
Newton's formula gives a convenient and efficient procedure for calcu-
lating interpolating polynomial. The first term in Newton's formula gives the
122 Chapter 4. Interpolation

interpolating polynomial using just one point, which is of course a constant,


since that is the best we can do. If second term is included, then we recover
the familiar formula for linear interpolation (4.6). If one more term is included
we get the formula for quadratic interpolation (4.32), which is the equation of
a parabola with its axis parallel to y-axis, and is also referred to as parabolic
interpolation.
Calculating the divided differences would involve subtraction of two nearly
equal numbers, and it may appear that the roundoff error would be large.
Although the roundoff error in evaluating the divided differences increases with
its order, in the interpolation formula it is multiplied by the product Pi(x),
which should progressively become smaller. Thus, the new term which is added
at any stage is itself small as compared to the previous term, and roundoff error
in that is not serious. If the successive terms are not reducing in magnitude
fast enough, then in any case, the truncation error will be much larger. As
already mentioned, nowadays interpolation is generally used to approximate
an experimentally determined function, or a function which is too complicated
to be calculated directly, in which case, usually the truncation error dominates
over the roundoff error. Roundoff error is rarely important in interpolation.
It may be noted that roundoff error in evaluating interpolation polynomial is
larger if Eq. (4.31) is used as compared to that when the terms are added in
natural order starting from f(ao). This is because although the first term in
(4.3-1) is small it has large roundoff error. This difference will be seen only in
those cases where truncation error comparable to nf(x) can be achieved by
interpolation.
The truncation error (4.12) for polynomial interpolation splits into two
factors. One of this i.e., f(n+l)(~) depends on the function being interpolated,
but is essentially independent of the choice of interpolation points. The other
factor i.e., Pn(x) depends only on the choice of interpolation points, but is
independent of the interpolated function. In general, we have no control over
the first part, since the function is given to us. In most cases, even the second
part is also beyond our control, since the function may be known only at a
finite set of points. However, in case we can control the choice of points; for
example, if we can measure the quantity at the points of our choice, then it will
be relevant to know which is the best choice. For this purpose, we will like to
minimise the product Pn(x) over the required range of x values say a :::; x :::; b.
The process of minimisation is not uniquely defined, since it is not specified
what we wish to minimise. We could minimise the maximum value of Pn(x),
or we could minimise some kind of average value of its magnitude. It should be
noted that Pn(x) is a polynomial of degree n + 1, whose zeros give the n + 1
abscissas used for interpolation.
Interestingly, Chebyshev has proved the following theorem: Let 'Pn des-
ignate the class of all polynomials of degree n with leading coefficient unity.
Then, for any p E 'Pn

max itn(x)i:::; max Ip(x)l. (4.33)


-l:Sx::;I -l:Sx:Sl
4.2. Divided Difference Interpolation Formula 123

Here Tn(x) = T n (x)/2 n - 1 , and Tn(x) is the Chebyshev polynomial of degree n,


which is defined by

(4.34)

We will not go into the proof of this theorem, which can be found in
Davis (1975). It can be proved that, Tn(x) has simple zeros at

2k - 1 )
Xk = cos ( ~1T , (k=1,2, ... ,n). (4.35)

Further, on the closed interval -1 :S x :S 1, Tn(x) has extreme values at the


n + 1 points
x~ = cos ( ~ 1T ) , (k=O,l, .... n), (4.36)

where it assumes the alternating values (_l)k. Hence, the polynomial Tn(x)
which has the leading coefficient unity, has a maximum value of 1/2n - 1 in
magnitude over the interval -1 :S x :S 1. This result can be transformed to any
finite interval by a suitable linear transformation.
If we consider the problem of interpolation in a given interval using a fixed
number of points, we can expect the lowest maximum error, if the abscissas are
chosen to be the zeros of the corresponding Chebyshev polynomial. For the
interval a :S x :S b, these points are given by

Xk = a+b + b- a cos (2k + 1 1T) (k = 0,1, ... , n). (4.37)


2 2 2n + 2 '

Further, the maximum value of the product Pn (x) is (b - a)n+l /22n+l. Hence,
if we have a choice for the interpolation points, we can choose the Chebyshev
points (4.37). Interpolation using these points can be referred to as Chebyshev
interpolation. Of course, it is not really advisable to use high order interpolation
formula. Hence, this prescription may not be very useful in actual practice.
If the tabular points are uniformly spaced, then the denominator in the
definition of divided differences can be ignored, since it will be the same for all
divided differences of the same order. In such cases, it is convenient to define
the differences by

~o f(x) = f(x),
(4.38)
~k f(x) = ~k-l f(x + h) - ~k-l f(x), (k = 1,2, ... ),

where h is the uniform spacing between the tabular point. These differences are
also referred to as forward difference, as opposed to the backward differences
"\7 k f(x) which are defined by

"\70 f(x) = f(x),


(4.39)
"\7 k f(x) = "\7k-l f(x) - "\7k-lf(x - h), (k = 1,2, ... ).
124 Chapter 4. Interpolation

It can be easily shown that the forward differences are related to the divided
differences by the formula

1 k
f[ao, ... , ak] = -k'
.h
k ~ f(ao). (4.40)

Using this relation and the Newton's divided difference formula, we can derive
the Newton's forward interpolation formula:

(4.41 )

where x = ao + hm, fi = f(ai) and


_ (m) _ m(m - l)(m - 2)··· (m - k + 1)
(mh - k - k! (4.42)

Here it shoulci be noted that m need not be an integer. In this interpolation


formula the new points are added in increasing order (assuming h > 0). Hence,
it is useful for interpolation near the beginning of the table. For interpolation
in the middle of the table it is better to add new points alternately on the two
sides.
If we are interested in interpolation near the end of the table, then the
new points should be added in the reverse order. In that case, we can obtain
the Newton's backward interpolation formula {13}

gn(m) = fo + (mh~f-l + (m + 1)2~2 f-2 + ... + (m + n -l)n~n f-n. (4.43)

This formula can be put in a more convenient form using backward differences.
This formula is also convenient for extrapolation beyond the end of the table.
EXAMPLE 4.2: Consider interpolation of the following functions over the interval [-1, 1]
1
(i) sinx, (ii) 1 + 25x2 ' (iii) Vi+X, (iv) vr;:I. (4.44)

We use the Newton's formula with n = 2,4,6 ... ,20 points in the specified interval.
The truncation error depends on the point at which interpolating polynomial is evaluated.
To find the maximum error in the given interval, we evaluate the interpolating polynomial
at a set of 2001 points in the interval. The error is calculated by comparing the interpolated
value with the exact value, which can be easily calculated in this case. To study the effect of
distribution of abscissas, we consider two cases, first where the spacing between abscissas is
uniform, and second where the abscissas are the Chebyshev points. The results are presented
in Table 4.1, which also gives the decay exponent ct defined as

ct=
In(11 en III - In(11 em II)
(4.45 )
In(n/m)
Here II em II is some measure of the truncation error using m points for interpolation. For our
purpose, we have defined this norm to be the maximum error over the given interval. The
decay exponent essentially gives the rate at which the error decreases with n. Negative value
of decay exponent corresponds to the case, where the truncation error is decreasing with n.
The calculations were carried out using a 24-bit arithmetic, and it can be seen that
error never decreases below 10- 7 , which is the limit imposed by roundoff error. It can be
seen that only for the first function, the error decreases very rapidly with n and very soon
4·2. Divided Difference Interpolation Formula 125

Table 4.1: Truncation error in polynomial interpolation

sinx 1
1+25x 2 vT+X v'IxT
n II en II Q
II en II Q II en II Q II en II Q

With abscissas uniformly spaced over [-1, 1]


2 6.0 X 10- 2 9.1 X 10- 1 3.5 X 10- 1 1.0 x 10°
4 1.2 X 10- 3 -5.6 7.0 x 10- 1 -0.37 1.5 x 10- 1 -1.260 5.2 x 10- 1 -0.93
6 1.2 x 10- 5 -11.5 4.3 x 10- 1 -1.20 1.0 x 10- 1 -.929 4.0 x 10- 1 -0.69
8 1.2 x 10- 7 -15.9 2.5 x 10- 1 -1.94 8.1 x 10- 2 -.795 3.3 x 10- 1 -0.62
10 1.2 x 10- 7 0.0 3.0 X 10- 1 0.88 6.8 x 10- 2 -.762 2.9 x 10- 1 -0.60
12 2.4 x 10- 7 3.8 5.6 x 10- 1 3.37 5.9 x 10- 2 -.742 2.6 x 10- 1 -0.57
14 4.2 x 10- 7 3.6 1.1 x 10° 4.25 5.3 x 10- 2 -.728 4.4 x 10- 1 3.43
16 1.6 x 10- 6 10.1 2.1 x 10° 5.07 4.8 x 10- 2 -.719 1.1 x 10° 6.70
18 1.8 x 10- 6 0.9 4.2 x 10° 5.89 4.4 x 10- 2 -.709 2.8 x 10° 8.13
20 2.8 x 10- 5 26.2 8.5 x 10° 6.71 4.1 x 10- 2 -.706 7.8 x 10° 9.57

With abscissas as Chebyshev points


2 7.7 x 10- 2 9.3 X 10- 1 3.8 X 10- 1 8.4 X 10- 1
4 9.9 X 10- 4 -6.28 7.5 x 10- 1 -0.30 1.8 x 10- 1 -1.09 5.5 x 10- 1 -.619
6 6.0 x 10- 6 -12.59 5.6 x 10- 1 -0.74 1.2 x 10- 1 -1.03 4.4 x 10- 1 -.538
8 1.8 x 10- 7 -12.22 3.9 x 10- 1 -1.22 8.9 x 10- 2 -1.01 3.8 x 10- 1 -.519
10 1.2 x 10- 7 -1.82 2.7xlO- 1 -1.68 7.1 x 10- 2 -1.01 3.4 x 10- 1 -.511
12 1.2 x 10- 7 0.00 1.8 X 10- 1 -2.12 5.9 x 10- 2 -1.01 3.1 x 10- 1 -.507
14 1.2 x 10- 7 0.00 1.2 X 10- 1 -2.55 5.1 x 10- 2 -1.00 2.9 x 10- 1 -.505
16 1.2 x 10- 7 0.00 8.3 X 10- 2 -2.96 4.4 x 10- 2 -1.00 2.7 x 10- 1 -.504
18 1.2 x 10- 7 0.00 5.6 X 10- 2 -3.37 3.9 x 10- 2 -1.00 2.5 x 10- 1 -.503
20 1.2 x 10- 7 0.00 3.7 X 10- 2 -3.77 3.5 x 10- 2 -1.00 2.4 x 10- 1 -.502

reaches the limiting value. The decay exponent decreases at first and then keeps fluctuating
as the error approaches the limiting value. Theoretically, if the derivatives are all of the same
order, the truncation error is expected to decrease as linn. In this case, there is no essential
difference between the results for the two different distribution of points. It may be noted
that the error increases slightly with n for large values of n when interpolation points are
uniformly distributed, which is clearly due to the roundoff error.
For the second function, which is the famous example due to Runge, the error decreases
a little and then starts increasing when uniformly spaced points are used. On the other hand,
for Chebyshev points the error keeps decreasing slowly but steadily. This divergence is related
to the fact, that the derivatives of this function increase unboundedly as n increases. Even
though this function is smooth on the real axis, there are two poles quite close to the real
axis at x = ±0.2i. The presence of singularity quite close to the region of interpolation causes
divergence of interpolation series. Hence, for rapid convergence of interpolation series, it is
essential that the function should be regular in the entire neighbourhood of the interpolation
region. If we consider a circle in the complex plane with the diameter covering all the inter-
polation points, then as a rule of thumb we can expect the interpolation to converge rapidly,
if there is no singularity in this circle. Hence, in this case, if the interval for interpolation is
less than about 0.2, the interpolation formula will converge reasonably. For this function, the
advantage of using Chebyshev points is quite clear, since the error does decrease slowly with
n.
126 Chapter 4. Interpolation

2 I I
"Uniformly Spaced Points j' Chebyshev Points

1.5 n=8
n=12
n=16

0.5

-0.5
-1 -0.5 o 0,5 -1 -0.5 o 0.5 1
x x

Figure 4.1: The function f(x) = 1/(1 + 25x 2 ) and interpolation polynomials using 4,8,12
and 16 points are displayed as a function of x, for both the distributions of tabular points
considered.

To identify the problem clearly, the actual plot of polynomials using 4, 8, 12 and 16
points are displayed in Figure 4.1, for both the distributions of tabular points. It can be easily
seen that for uniformly spaced points, the error in the central region decreases as the number
of points is increased, but the error in both the end portions increases without bound. On
the other hand, with Chebyshev points, the error decreases over the entire interval, even
though in the central part the error is larger than that for uniformly spaced points. The wild
oscillations displayed by high degree polynomials is a typical problem with interpolation.
A polynomial of degree n can have up to n - 1 extrema, and as n increases some of these
extrema, may show up at unexpected places within the range of interpolation. Hence, it is
advisable to avoid high degree polynomials in interpolation.
For the third function which has unbounded derivatives at one end point, we do not
expect the interpolation formula to perform well. Even then the error is decreasing slowly with
n in both cases. The fourth function has unbounded derivatives in the middle of the interval,
where the first derivative is discontinuous. Hence, we expect the results to be worse, and
in fact, the interpolation does not converge when uniformly spaced tabular points are used.
However, with Chebyshev points it converges slowly, with the decay exponent approaching
-0.5. Hence, the error decreases roughly as ,,(ii.

In this example, we have considered only one of the possible asymptotic


limits, where the number of points are increasing, but the interval spanned by
the points remains constant. Another possibility is to use points with uniform
spacing, and as we add points the interval increases {17}. If this exercise is
carried out for the functions in Example 4.2, the result does not turn out to be
very different. Another possibility is to keep the number of interpolation points
fixed, but to decrease the spacing between them. In this case, interpolation will
converge for all continuous functions. The rate of convergence depends on the
number of points and on the existence and bounded ness of higher derivatives.
Usually a distinction is made between interpolation, where the value is
required at a point inside the interval spanned by the tabular points, and ex-
4.2. Divided Difference Interpolation Formula 127

trapolation, where the value of the function is required at a point outside the
interval spanned by the tabular points. In deriving the interpolation formula,
we have not made any assumptions about the position of the required point x.
Hence, the same formula is applicable for extrapolation also. However, in gen-
eral, extrapolation is a hazardous process, since firstly, there is no guarantee
that the function even exists in that region, and even if it exists the errors will
be larger, since the product Pn(x) in (4.12) is larger. Most common examples
of extrapolation in everyday life are probably the meteorological or economic
forecasts. In any case, extrapolation at a point which is separated from the
table by more than the typical spacing within the table, will be generally unre-
liable. If the table of values is available at very fine spacing and extrapolation
is required then in some cases, it may be helpful to drop some points from the
table so that the table spacing becomes comparable to the distance of required
point from the end point in the table. The most common application of extrap-
olation in numerical methods, is in estimating the limit of numerical results,
as the finite numerical model, tends to the infinite mathematical model. For
example, in numerical integration we will like to estimate the limit as the spac-
ing h between points tends to zero. Estimating such limits necessarily involves
extrapolation. A similar situation arises in some experiments also, for example,
if we want to measure some physical quantity at 0 K, we can measure it at a
few temperatures (of course greater than 0 K) and extrapolate the results.
Another topic which is usually considered in classical numerical analysis
is the inverse interpolation, where the value of independent variable x is re-
quired for which the dependent variable y takes on a prescribed value. This
problem arises mainly in solution of nonlinear equations. Classical numerical
analysis texts usually describe a few special methods for this case, since most
of the standard formulae are based on uniformly spaced points, which is not
applicable in this case, where the function values are usually not uniformly
spaced. However, the Newton's divided difference formula can be easily applied
to this problem by interchanging the first two columns of the difference table
before computing the differences. Of course, it is essential to ensure that the
inverse function exists and is single valued, otherwise, interpolation may pro-
duce completely useless results. Inverse function will exist if the function values
are monotonic, as otherwise the inverse function will become multiple valued,
with two different values of x corresponding to the same f(x).
Finally, we will like to warn once again that if the table of function values
is sparse, it may not be possible to obtain the desired accuracy by interpolation,
no matter how many terms are included. The usual indication in such cases is
that the terms in interpolation formula do not decrease very fast as we keep
adding more terms. Another indication of difficulty may be provided by a strong
variation in the higher differences. However, as mentioned in Section 1.2, there
is no conclusive test for convergence, and we can construct examples, where
these indications are absent, even though the results are totally unreliable. A
trivial example is a table of sin7rx for x = a + 2i, (i = 0,1, ... ). In this case
all differences are zero and interpolation will always yield the value sin 7ra. A
128 Chapter 4. Interpolation

somewhat nontrivial example is the same function, but with abscissas Xi =


a + 2.01i, (i = 0,1, ... ) {12}. We can start with any smooth function and
obtain a table of values at suitable points, and then change the function, such
that it has the same value at all the tabular points, but in between some of
these points, it could have sharp peaks. Such functions will obviously make a
mockery of all interpolation formulae. As yet another example, consider the
following functions
20 10
f(x) = Lan sin(21rnx), g(x) = L(an +alO+ n )sin(21rnX). (4.46)
n=1 n=1

If we choose the tabular points Xi = i /10, (i = 0, 1, ... ), then the tables for these
functions will be identical. Hence, we get the same results using interpolation in
the two tables, in spite of the fact, that the two functions could be completely
different.
These examples illustrate that, unless we have an idea of the typical scales
at which the function changes significantly, interpolation could be a hazardous
process. The given table of values may appear to be smooth, even though the
function itself may not be. In most cases, when the function is known only
through a few experimental measurements, it is nearly impossible to give any
reliable estimate of truncation error. In such cases, it may be best to use a
few sets of tabular values for interpolation, and compare the results, or to
use functions other than polynomials for interpolation. If the data themselves
are subject to significant error, then it is not justifiable to use interpolation,
since the true function need not pass exactly through the tabular points. This
situation may also arise if the tabulated values of the function are rounded to
very small number of digits. In such cases it is obviously impossible to achieve
accuracy better than that in the tabulated points. In these cases, we have to
take resort to more general class of approximations, which will be covered in
Chapter 10.

4.3 Hermite Interpolation


So far we have considered the simple polynomial interpolation, which is also
referred to as collocation, where the interpolating function is exactly equal to
the given function at a finite set of tabular points. In addition, if we also de-
mand that the first derivative of the two functions are equal at all the tabular
points, then the process is referred to as osculatory interpolation. If we are
once again restricted to polynomials, then the resulting formula is referred to
as Hermite interpolation formula. In this case, it can be shown that one and
only one polynomial of degree less than or equal to 2n - 1 can interpolate the
given function as well as the first derivative at a set of n distinct points. An
interpolation formula analogous to the Lagrange's formula can be easily ob-
tained, but we will not discuss that here, since just like the Lagrange's formula,
it is not very useful for practical computations. Instead, we will generalise the
4.3. Hermite Interpolation 129

divided differences to obtain a formula which is algebraically identical to the


Hermite interpolation polynomial.
In (4.25), if we take the limit as all the tabular points tend to the same
value, then we can generalise the definition of divided difference to the case,
when the points are not distinct

k a's
f[ ,--"---..]- 1
a, ... ,a - (k-1)!
f(k-l)()
a. (4.47)

The basic idea here is to use the usual recursive definition (4.19), when the
denominator is nonzero, but to take the limiting value when the denominator
is zero. In the simplest case

f[a,a] = lim f(x) - f(a) = f'(a). (4.48)


X - a
x~a

Thus, if we want the interpolated function to agree with the first k - 1 deriva-
tives, in addition to the function value itself, then the point should occur k
times in the list of tabular points. For the case of Hermite interpolation using
n + 1 points, at which the function and the first derivative agree with that of
interpolating polynomial, (4.12) can be generalised to give the truncation error

(4.49)

EXAMPLE 4.3: Estimate In(1.5) by using the values of lnx and its derivative at x = 1 and
2.
We can construct the divided difference table

ai flail f[ai-l,ai] ![ai-2,ai-l,ai] f[ai-3, ai-2, ai-I, ad


0
1 0 1.
2 0.69315 0.69315 -0.30685
2 0.69315 0.5 -0.19315 0.11370

Here in evaluating the third column, if the concerned points are identical, we replace the
divided difference by the first derivative. Now the interpolating polynomial is given by

93(X) = 0 + l(x - 1) - 0.30685(x - 1)2 + 0.11370(x - 1)2(x - 2), (4.50)

and In(l.5) ~ 0 + 0.5 - 0.07671 - 0.01421 = 0.40908, which can be compared with the exact
value of 0.40547. The truncation error of -0.00351 is consistent with estimate (4.49), which
yields
E x = (x - 1)2(x - 2)2 (_
1( ) 4! (4
3!)
= _ (x _1)2(x - 2)2
4(4'
(4.51)

with 1 <« 2, giving 0.00097::; -El(1.5) ::; 0.0156.


130 Chapter 4. Interpolation

Following this example, we can write the cubic which interpolates a func-
tiop and its first derivative at two distinct points in the form
2 l[ a1, a2]- f'(at}
h3(X) = f(a1) + (x - I
adf (ad + (x - a1)
a2 - a1
2( ) f'(at} + f'(a2)- 2f[a1, a2]
+ (x - a1 ) x - a2 (
a2 - a1
)2 .

-_ f( a1 ) + (x - a1 )f'()
a1 + (x - a1 )2 3f[a1, a2] - 2f'(ad - f'(a2)
a2 - a1

(4.52)
This is the Hermite cubic interpolation formula, which is probably the most use-
ful of Hermite interpolation formulae. This formula forms the basis for deriva-
tion of cubic spline in next section. Apart from this, it is also useful as a basis
function in finite element methods (see Section 14.11 and {21}).
It may be noted that it is not really essential to know the derivative, to
use this formula, since (4.52) can be used with any value for f'(a1) and f'(a2).
Important point to note here is that, whatever value we supply for f'(ad and
f' (a2), will determine the slope of the interpolating polynomial at those points.
As will be seen in the next section, this freedom can be effectively utilised to
obtain smooth approximations with continuous derivatives. Of course, if the
value of f' supplied does not agree with the actual value, then some additional
truncation error will be introduced apart from what is given by (4.49). Her-
mite interpolation is rarely useful in direct calculation, since it is very unlikely
that a table of both the function as well as its derivative is available. For the
standard mathematical functions, it may be possible to have such tables, but
these functions are usually calculated much more effectively by other methods.
Hermite interpolation is actually used in numerical solution of differential equa-
tions, where the solution is computed at a set of x values and to calculate the
solution at other intermediate values we can use Hermite interpolation as the
first derivative is also available.

4.4 Cubic Spline Interpolation


Let us consider the problem of interpolating a given function over a wide range
of values extending over several tabular points, which is equivalent to the prob-
lem of drawing a smooth curve through a set of n + 1 points. We can use a
polynomial of degree n passing through all these points, but as we have al-
ready seen, that is not very desirable if n is fairly large. Such polynomials are
likely to have wild oscillations in between the tabular points. Another alterna-
tive is to use interpolation polynomial based on a few nearby points. In this
case, the polynomial used is different in different regions of the table, and the
resulting interpolating function is piecewise polynomial. If the change in poly-
nomial occurs at only the tabular points, this approximation is bound to be
4.4. Cubic Spline Interpolation 131

continuous, but in general, the derivatives will not be continuous. To ensure


that some derivatives of the approximating function are also continuous, we
require additional conditions, and the resulting function with maximum degree
of smoothness is referred to as a spline.
Let the function values f(ai) for i = 0,1, ... , n be known and a = ao <
al < ... < an = b. Our aim is to approximate f(x) by an interpolating
function g(x) on the interval [a, b]. We divide the interval into n subintervals
Ri = [ai-l, ai], (i = 1, ... , n). The simplest approximation can be obtained by
using a linear interpolation in each of the subintervals R i . In this case, f (x) is
approximated by a broken straight line, with breakpoints ai, ... ,an -l, at which
there is in general, a discontinuity in the slope of approximating function. This
approximation is referred to as linear spline. This approximation will obviously
not be smooth, unless the number of tabular points is exceptionally large. To
make the approximation smoother, we can use polynomials of higher degree
over each of the subintervals. The piecewise cubic interpolation with continu-
ous second derivative, or the cubic spline, has become the most popular, and
in this section we restrict ourselves to cubic spline. For more general spline
interpolation, readers can refer to Ahlberg et at. (1967) or De Boor (2001).
In the next section we will consider another alternative formulation of spline
interpolation which can be easily generalised to higher order of smoothness.
In Cubic spline interpolation we approximate the given function by a cubic
polynomial of the form

(4.53)

over each of the subintervals R i . We have to determine the 4n coefficients Cj,i


for j = 0,1,2,3 and i = 1, ... , n. For this purpose, we demand that Pi(X)
interpolate f(x) at ai-l and ai, and that the resulting approximating function
g(x) is continuously differentiable on [a, b], to get

Pi(ai) = f(ai), (i = 1, ... ,n),


(4.54)
(i = 1, .... n-1),

and let p~(ao) = So and p~(an) = Sn. Here it should be noted that the slopes
Si need not be equal to f'(ai). If the slopes and the function values are known
at each tabular point, we can perform the required interpolation by using the
cubic Hermite interpolation formula in each of the subintervals. Using (4.52),
we can write

CO,i = f(ai-d,
3f[ai-l, ai]- 2Si- 1 - Si Si + Si-l - 2f[ai-l, ail (4.55)
C2.i = h
i-l
' C3,i = h2 '
i-l

where h i - 1 = ai - ai-i. It may be noted that irrespective of the choice of Si the


resulting approximating function gtx) and its first derivative will be continuous.
We can choose Si to improve the smoothness by demanding that g(x) is twice
132 Chapter 4. Interpolation

continuously differentiable. The resulting piecewise cubic polynomial is known


as the cubic spline, and the process is referred to as cubic spline interpolation.
The name spline has been derived from the fact that in this case, the resulting
function g(x) approximates the curve traced by a draftsman's spline (essentially
a thin flexible rod), which is constrained to pass through the points (ai, f(ai)),
(i = 0, ... , n). The interior tabular points ai, (i = 1, ... , n-1), where continuity
conditions are imposed are usually referred to as knots.
The requirement of continuous second derivative gives the following n - 1
conditions
p~'(ai) =p~~I(ai)' (i = 1, ... ,n-1). (4.56)
Using (4.55) we get the n - 1 equations

hisi- I + 2(hi + hi-dsi + hi-ISHI = 3J[ai-l, ai]hi + 3J[ai' aHI]hi- l , (4.57)

which is a system of n - 1 linear equations in the n + 1 unknowns so, ... ,Sn'


To complete the specification we require two more equations. Thus, if we can
somehow estimate the slope at both ends, i.e., So and Sn, then the system will
be complete. Alternately, we can use the free-end condition, corresponding to
the case when the spline is left free beyond the end points, in which case, it
will be a straight line. This condition requires that the second derivative should
vanish at the end points. Using (4.55), this condition can be written as

(4.58)

Thus, (4.57) and (4.58) give a system of n+ 1 linear equations in n+ 1 unknowns,


which can be solved for Si'S, which in turn determine the coefficients Ci,j of the
cubic spline. It should be noted that with this end conditions, the cubic spline
interpolation will not be exact, even when f(x) is a second or third degree
polynomial.
A better boundary condition is the so-called not-a-knot condition. This
condition essentially demands that the first and last knots are ineffective, or
the polynomial on two sides of these knots is algebraically identical

and Pn-I(X) == Pn(x), (4.59)

which implies that at these two tabular points, even the third derivative is
continuous. After some manipulation {24}, this condition can be written as

Sn-I (h n - I + hn- 2) + Sn hn-2 =


hn- 2f[a n-l, an](2hn- 2 + 3hn-d + f[a n-2, an-I]h;'-I
hn - I + hn - 2
(4.60)
In all these cases, the resulting equations for Si reduces to a system of lin-
ear equations with a tridiagonal matrix. It can be shown that these system
4.4. Cubic Spline Interpolation 133

of equations is nonsingular and hence a unique solution exists. This system


of equations can be solved by Gaussian elimination and back-substitution (see
Section 3.2). Further, this system of equations is diagonally dominant and there
is no need of pivoting. Using the fact, that the matrix of coefficients is tridi-
agonal, the amount of calculation required to solve this system can be reduced
considerably. Subroutine SPLINE in Appendix B calculates the coefficients of
cubic spline. Once the coefficients are calculated, the interpolated value of the
function at any required point can be calculated using function SPLEVL.
It can be seen from the subroutine SPLINE that it requires about 34n
floating-point operations to calculate the coefficients of the cubic spline. This
can be compared with about 9n floating-point operations required to calculate
the divided differences up to third order, as required for a simple cubic polyno-
mial interpolation using the nearest four points. In either case, it will require
only three multiplications to calculate the polynomial at any point. The trun-
cation error in both cases is of the same order. Hence, clearly it is much more
efficient to use the divided difference formula.
Another disadvantage with splines is that, there is no simple way of esti-
mating the truncation error. We can try to drop a few points from the table
and compare the interpolated value with tabular values at these points. This
technique can only give some rough estimate of the truncation error. As a
result, spline interpolation is not useful if we have sufficient data points to
ensure adequate convergence using polynomial interpolation. However, if the
data points are few and well separated, then in any case, it is impossible to en-
sure any significant accuracy, and we may be forced to accept the interpolation
as giving the best possible approximation. In such cases, spline interpolation
could be useful, since it will ensure higher degree of smoothness than the sim-
ple polynomial approximation. In most physical situations, smoothness is the
only criterion that can be used to determine the goodness or badness of the
approximation.
Spline interpolation is almost indispensable if we want to draw a smooth
curve through a few data points. In that case, if we use a high degree polyno-
mial interpolating to all the data points, it is virtually impossible to avoid wild
oscillations. While, if we use polynomial interpolation based on say nearest four
points, then as the points change the polynomial will also change and it is diffi-
cult to ensure continuity of even the first derivative. Hence, the resulting curve
may not be smooth. In most cases, spline interpolation yields smooth curves,
but it is by no means guaranteed. For functions which have steep gradient in
one region, and are essentially flat in neighbouring region, it is usually very
difficult to ensure a good approximation, unless sufficient number of points are
available. Sometimes, when the resulting approximation is to be used in an
iterative method, the iteration may not converge because of discontinuity in
the first derivative when the simple polynomial interpolation is used. In such
cases, splines are more effective, since what is important is not the truncation
error, but the smoothness. In fact, it is probably safer to use cubic spline for
interpolation in a sparse table.
134 Chapter 4. Interpolation

Cubic spline also has the following interesting property: Given the data
points (ai, f (ai)) of all functions with continuous second derivatives which in-
terpolate to these data, the cubic spline g(x) with free boundary condition
(4.58), uniquely minimises the integral

(4.61)

EXAMPLE 4.4: Repeat Example 4.2 using cubic spline interpolation.


The coefficients of cubic spline with not-a-knot boundary condition can be calculated
using the subroutine SPLINE. The maximum error II en II, for interpolation with n points
is calculated by evaluating the cubic spline at a set of 20 points in each subinterval ni, and
comparing the value with the known exact value. The results are displayed in Table 4.2. It
may be noted that, at least, four tabular points are required for cubic spline with not-a-knot
boundary conditions, hence the first row in the table corresponds to n = 4. Only the case,
where the tabular points are uniformly spaced is considered here.

Table 4.2: Cubic spline interpolation

sinx 1
1+25x 2 vT+x vIxI
n II en II Q
II en II Q II en II Q II en II Q

4 1.2 X 10- 3 7.0 X 10- 1 1.5 X 10- 1 4.0 X 10- 1


6 3.9 X 10- 4 -2.81 4.3 X 10- 1 -1.21 1.1 X 10- 1 -.685 3.0 X 10- 1 -.718
8 1.2 X 10- 4 -4.04 2.5 X 10- 1 -1.93 9.4 X 10- 2 -.585 2.5 X 10- 1 -.629
10 4.8 X 10- 5 -4.17 1.4 X 10- 1 -2.45 8.3 x 10- 2 -.563 2.2 X 10- 1 -.568
12 2.2 X 10- 5 -4.17 8.4 X 10- 2 -2.91 7.5 X 10- 2 -.550 2.0 X 10- 1 -.551
14 1.2 X 10- 5 -4.15 5.0 X 10- 2 -3.32 6.9 X 10- 2 -.542 1.8 X 10- 1 -.542
16 6.8 X 10- 6 -4.13 3.1 X 10- 2 -3.68 6.4 X 10- 2 -.536 1.7 X 10- 1 -.536
18 4.2 X 10- 6 -4.21 1.9 X 10- 2 -3.99 6.0 X 10- 2 -.531 1.6 X 10- 1 -.531
20 2.7 X 10- 6 -4.06 1.2 X 10- 2 -4.25 5.7 X 10- 2 -.528 1.5 X 10- 1 -.528

Comparing this results with Table 4.1 it is clear that for the first function which is
regular and has bounded derivatives, the higher order polynomial interpolation gives much
more accurate results. For cubic spline the decay exponent is ~ -4, which is the expected
value for cubic interpolation. For the second function, where the polynomial interpolation
diverges. the cubic spline gives much better result and decay exponent is again ~ -4, i.e.,
the error decreases as the fourth power of n. For the other two cases the decay exponent is
~ -0.5 and error decreases rather slowly. The result for f(x) = vT+x are comparable to
those obtained using polynomial interpolation. But in the last case, where the polynomial
interpolation is diverging, the cubic spline gives acceptable results. It may be noted that
polynomial interpolation using Chebyshev points gives better results in almost all cases.
Further, we would like to point out that these results cannot really be compared with those
in Example 4.2 for polynomial interpolation, as in that case we considered interpolation with
increasing degree of polynomial. If instead we had considered cubic polynomial interpolation
using nearest 4 points, the results would have been similar to those using cubic spline.

EXAMPLE 4.5: Consider the cubic spline for the following function in the interval [-1,1]

f(x) = - - ; - - ( 4.62)
1+ 1+!ax
4.5. B-splines 135

1.2

\
0.8 \

- - Exact
- - - Spline (n=7)
0.6
Spline (n= 13)
--( I
- - Polynomial (n=7)
0.4 \ /
.~
Polynomial (n= 13)

-1 -0.5 o 0.5 1
x
Figure 4.2: The function (4.62) and the cubic spline interpolant using 7 and 13 uniformly
spaced tabular points. Also shown is the curve for the polynomial interpolating at the same
tabular points.

For large values of a this function resembles a step function. For large positive values
of x, I(x) ;:::; 1, while for negative values of x, I(x) ;:::; 0.5. In the region Ixl < a/a, where
a is a constant of order unity, the function changes steeply from 0.5 to 1. In most of the
region the function is essentially flat and we may expect that a few points will be sufficient
for interpolation. The result of cubic spline interpolation using 7 and 13 tabular points are
displayed in Figure 4.2. The value of a = 30 is used in these calculations.
It can be seen that, even though the function is flat at both ends, the cubic spline using
seven points shows some oscillations there. These oscillations are damped out when additional
points are added. This behaviour is sometimes irritating, since even with seven points the
human eye can see that the function is flat at both ends, but unfortunately the spline cannot.
Use of simple polynomial interpolation in this case is completely disastrous as is illustrated
by the curve for polynomial using the same points. The polynomial interpolation with 13
points shows strong oscillations towards both ends. Interestingly, in the region around the
step both polynomial and spline approximations agree with each other, but spline gives much
better approximation in the flat regions of the curve.

4.5 B-splines
Although cubic spline interpolation can be calculated by using the procedure
described in the previous section, in many applications it is necessary to express
the approximating function as a linear combination of appropriate basis func-
tions. Such requirements arise in functional approximations (cf., Chapter 10)
or in expansion methods to solve differential or integral equations. In principle,
the representation defined by Eq. (4.55) defines a linear expansion in terms of
piecewise polynomials. But the number of coefficients involved are much larger
136 Chapter 4. Interpolation

than the number of independent functions defining the spline and hence this
representation is not particularly useful.
A simple representation of splines in terms of independent basis functions
can be obtained in terms of the so called, truncated power basis, using the
functions
_ {(X - a)", if x 2 a;
( x _ a )i+- (4.63)
0, if x < a.
These functions have continuous derivatives up to order i - 1. The cubic spline
using n + 1 tabular points can be represented as
n-l
g(x) = bo + b1(x - ao) + b2(x - ao)2 + co(x - ao)3 +L Ci(X - ai)~' (4.64)
i=l

This representation requires n + 3 constants, which can be determined from


the requirement of interpolation at n + 1 points plus the two boundary condi-
tions. If we are using the not-a-knot boundary condition, then Cl = Cn-l =
and the remaining n + 1 constants can be determined by interpolation re-
°
quirement. The coefficients can be determined by the method of undetermined
coefficients (cf., Section 5.2). This representation is very convenient for ana-
lytical and theoretical studies, but is quite often numerically unstable. As will
be seen in Chapter 10, instability arises because these basis functions are far
from orthogonal and away from the corresponding knots, a basis function may
be approximated well by a linear combination of other basis functions, making
it difficult to determine the coefficients of expansion. The corresponding equa-
tions to determine the coefficients bi , Ci may become ill-conditioned. It can be
easily seen from Eq. (4.64) that at large x, towards the end of table, evaluation
of g(x) will involve some cancellation between different terms, unless the func-
tion itself increases rapidly with x. This is the basic cause for ill-conditioning,
because of which this representation is not useful for numerical computations.
For similar reasons the simple polynomial basis (xi, i = 0, ... , n) is also not
useful for numerical computations.
For numerical computations it is preferable to have basis functions which
are localised, that is they are nonzero in a finite interval only. It is clear that if
we are to construct approximations with continuous derivatives up to certain
order using such functions, the corresponding derivatives of these functions at
the end points must vanish along with the function value. It turns out that
such basis functions can be constructed using:
(4.65)

Here it is understood that the k-th divided difference of (t - X)~-l is taken


using variable t, while x is kept fixed. The resulting function Bi,k(X) is called
the B-splines of order k. These will of course, depend on the set of knots ai
used. For simplicity, in the following discussion we assume that the knots are
distinct and arranged in ascending order. We can immediately verify that
(4.66)
4.5. B-splines 137

For, in this region g(t) = (t - x)~-l is a polynomial of degree < k, for t E


[ai, ai+kJ and hence the k-th divided difference will vanish. For x E [ai, ai+k],
g( t) has a discontinuity at t = x and the resulting divided difference will in
general be nonzero. Thus, for aj :::; x :::; aj+l, of all the B-splines of order k only
the k B-splines Bj-k+l,k(X), B j -k+2,k(X), ... , BJ,k(X) might be nonzero. All
other B-splines will be zero in this interval. Further, by expanding the divided
difference in Eq. (4.65) in terms of (k -l)-th order differences, it can be shown
that 2:i Bi,k(X) = 1. We will give a brief description of properties of B-splines
in this section, while for more details readers can consult De Boor (2001).
From Eq. (4.65) it can be easily seen that for k = 1,

B t. 1 ( X ) _
.
-
{I,
0,
if ai < x < ai+ 1 ;
otherwise.
(4.67)

These are the so called, top-hat functions, which are piecewise polynomials of
order O. Similarly, for k = 2,

for ai <x :::; ai+l;

for ai+l < x:::; ai+2; (4.68)


otherwise.

is a piecewise polynomial of order 1, which is continuous everywhere, but the


derivative has discontinuities at x = ai, ai+l, ai+2. It can be shown that in
general, when the knots are distinct Bi.k(X) are piecewise polynomials of order
k - 1, with first k - 2 order derivatives continuous, while the (k - 1 )th derivative
can have discontinuities at the knots ai, ... ,ai+k. If some knot is repeated then
at that knot, some of the lower order derivatives may also be discontinuous.
Evaluating Bi,dx) using Eq. (4.65) may not be advisable as it can lead to
loss of significant figures while evaluating the divided differences. Instead, we
can obtain a recurrence relation to construct these polynomials from lower order
B-splines. To get these recurrence, we note that (t - X)~-l = (t - x)(t - x)~-2
and obtain

(t - x)~-l [ai,"" ai+kJ = (ai - x)(t - x)~-2[ai'"'' ai+kJ


(4.69)
+ (t - x)~-2[ai+l"'" ai+kJ.

Using this equation and the definition ofB-splines we get the recurrence relation

(4.70)

Starting with Bi,l(x) from Eq. (4.67), this recurrence relation can be used to
generate higher order B-splines. Eqs. (4.67) and (4.70) can also be considered
as an alternative definition of B-splines. It may be noted that one of the two
terms on the right hand side may be zero. For ai < x < ai+ 1, only Bi,1 (x) will
138 Chapter 4. Interpolation

be nonzero when k = 1, while for k = 2, both B i ,2(X) and B i - I ,2(X) will be


nonzero. Thus as k increases by one we will have one more nonzero Bj,k(X). This
recurrence relation is numerically stable as it can be easily shown that Bi,k(X) >
o for ai < x < aHk. This follows from the fact that Bi,dx) 2: 0 and hence from
the recurrence relation it is clear that both terms will be positive. Further,
because both terms are positive the evaluation of Bi,k(X) using the recurrence
relation is numerically stable as it doesn't involve subtraction of two nearly
equal numbers. All k nonzero B-splines for a given value of x can be calculated
together, using this recurrence relation to avoid duplication of calculations.
This algorithm is implemented in the subroutine BSPLIN in Appendix B. For
programming convenience in the Fortran program the knots are indexed from
-k+2 to n+k, while the index i of Bi,k is assumed to range from 1 to n+k-l.
This transformation should be accounted for in transforming the formulae given
in text to computer program.
The cubic B-splines corresponding to k = 4 are probably the most popular.
For the interval covered by n + 1 knots ao, ... , an there would be n + k - 1 inde-
pendent B-splines, B-k+l,k(X), B-k+2,k(X), ... , Bn-I,k(X), which are nonzero
in this interval. For k > 1 in order to define the B-splines near the end points
we would need k - 1 extra knots on either side. It may be convenient to de-
fine the end points as multiple knots to define the B-splines in interval lao, an].
Instead, we can add points on either side of the table with same spacing as
that between the previous pair. The definition of B-splines will depend on this
choice, but that should not matter as in all cases they will define a set of basis
functions to expand any required function in the specified interval. If the end
knot at x = ao is repeated k times then BI-k,dao) = 1, while all other B-
splines will vanish at x = ao. Further, B~_k k(ao) -10 and B~_k k(ao) -10, etc.
as the successive B-splines will have one higher derivative vanishing at x = ao.
On each subinterval [ai, aHI] only k B-splines Bi-k+l,k(X), ... , Bi,k(X) will be
nonzero. Fig. 4.3 shows the set of B-splines for 6 knots. It is clear that these
are localised functions which are nonzero only in finite interval.
To express the interpolating spline in terms of B-splines we can write
n-I
g(x) = I: CiBi,k(X), (4.71 )
i=l-k

where Ci are the coefficients of expansion, which can be determined by the


interpolation condition at the n + 1 knots, !(o,i) = g(ai) along with the appro-
priate boundary conditions. For example, for cubic spline interpolation with
not-a-knot boundary condition, we remove the knots al and an-I, but retain
the interpolation condition on these knots to obtain n + 1 linear equations in
n + 1 unknowns. Because of the local nature of B-splines, the resulting equa-
tion matrix will be a band matrix, with only k nonzero elements in each row.
This is the advantage of using localised basis function, as apart from improv-
ing the efficiency it also improves the numerical stability since it ensures that
the basis functions are independent and hence the matrix in general, will be
4.5. B-splines 139

o
Parabolic
1
t')
,..i
iIl //.
"f
,/"'" -
'<
-........

/'
,,--
)/
/"'-
--....... /"-"-',
,"
"./
'./ "/'
;-.< "-
'-

o
Cubic
1

..... /-", ..... /


'-/'
/ ' '- /'<..
.... /
................
'- ---- "- /

o ' ..
o 0.2 0.4 0.6 0.8 1
x
Figure 4.3: The B-spline basis functions over the interval [0, 1] using 6 equally
spaced knots. It is assumed that the end knots are repeated k times to calculate
the values near end points.

well-conditioned. These equations can be solved for the coefficients Ci to obtain


the required interpolating spline. This procedure is implemented in subroutine
BSPINT in Appendix B. The resulting interpolation should be identical to
cubic spline interpolation discussed in the previous section. The advantage of
B-splines is that the formulation can be easily extended to higher order spline
functions. Further, as explained earlier unlike the cubic spline representation
considered in the previous sections, B-splines provide a convenient set of basis
functions, which can be used to expand any required function. In most prac-
tical problems there is rarely any need to consider higher order splines as the
cubic splines are sufficiently smooth for most applications. Only if higher or-
der derivatives are required one will have to consider higher order splines. For
higher order splines we can drop some more knots from either side to ensure
that the number of basis functions are equal to the number of points at which
interpolation is sought. On the other hand, for linear B-splines (k = 2) the
number of basis functions is equal to the number of knots and we do not need
to drop any points.
140 Chapter 4. Interpolation

Using the fact that


d
dx (t - X)~-l = -(k - l)(t - X)~-2 , (4.72)

we can calculate the derivative of B-spline, using Eq. (4.65)

(4.73)
Thus the derivative of B-spline can be expressed in terms of a lower order B-
spline. To calculate higher order derivative we can repeat this process. Since
the recurrence relation to calculate B-spline uses the lower order B-splines, the
derivative can also be calculated during the same calculation, without much
extra effort.
If we consider the expansion given by (4.71) for k = k + 1, then the
derivative can be written as

(4.74)

This implies that L:i biBi,k is the first derivative of the spline expansion
L: CiBi,k+l, provided
k Ci-Ci-l b
= ;, (4.75)
ai+k - ai

for every i, or
ai+k - ai
Ci=Ci-l+bi k ' (4.76)

for all i. This allows us to calculate the integral of B-spline expansion, provided
one of the coefficients c, is chosen arbitrarily. The integral is of course, not
unique but if we are interested in evaluating the definite integral over a finite
limit then the value would come out to be unique. For example, if we choose
Ci = 0 for i ::::: -k, then we can write,

.( 4.77)

This equation can be used to calculate the definite integral of B-spline expan-
sion over a finite interval. The interval should normally be contained within
the range spanned by the knots. This procedure is implemented in subroutine
BSPQD in Appendix B.
Because of the ease with which B-splines and their derivatives can be
calculated and the fact that low order derivatives of these functions are contin-
uous, they provide a very useful set of basis functions. Further, the evaluation
4.6. Rational Function Interpolation 141

of B-splines as well as the expansion in terms of these functions is numerically


stable. As a result, in numerical work it is preferable to use these basis functions
rather than the normal monomials xi, which are widely used in analytic work.
The monomials are known to cause ill-conditioning in numerical calculations.
As a result, B-splines have been widely used in approximations, as well as in so-
lution of differential and integral equations. We will consider these applications
in the corresponding chapters.

4.6 Rational Function Interpolation


After considering polynomial interpolation, we will briefly consider the general
problem of interpolation using other basis functions. It is of course, not essential
to use polynomials for interpolation, we can use any set of independent functions
say ¢i(X), (i = 1, ... , n), to write the interpolating function in the form
n
gn(x) = L Ci¢i(X). (4.78)
i=l

Here are constants, which can be determined by requiring that gn(ai) =


Ci'S
f (ai) for i = 1, ... , n, i.e., gn (x) interpolates the given function f (x) at a set of
n distinct points. The interpolation condition gives a system of n linear equa-
tions in n unknowns, which can be solved in principle to get the coefficients
Ci. This procedure can be adopted for polynomial interpolation also, by tak-
ing ¢i(X) = Xi-I, in which case, we get a system of linear equations involving
a Vandermonde matrix. It is well-known that these system of equations are
generally ill-conditioned. Hence, this approach is not recommended for polyno-
mial interpolation. Apart from ill-conditioning, this approach requires O(n 3 )
floating-point operations, which is much larger than what is required to con-
struct the divided difference table, or to calculate the cubic spline coefficients.
This method is also referred to as the method of undetermined coefficients,
where we determine the coefficients required for the purpose by directly im-
posing the conditions which are expected to be satisfied. This method is very
useful in deriving formulae for numerical methods in general, but is not rec-
ommended for polynomial interpolation. As mentioned in the previous section,
this method can be easily applied to B-spline basis functions as these lead to
numerically stable set of equations with band matrix.
Even for non polynomial interpolation, this method very often leads to a
system of linear equations which is highly ill-conditioned. As will be seen in Sec-
tion 10.2, this problem is caused by the fact that, very often the basis functions
¢i (x) may not really be very different from some linear combination of other ba-
sis functions. If the basis functions are properly chosen, then this problem does
not arise. For example, if we select a set of orthogonal functions as the basis,
then the matrix is generally well-conditioned. Apart from ill-conditioning, in
general, it is not even guaranteed that the matrix will be nonsingular. Hence,
the solution may not even exist. We will not go into the general problem of
142 Chapter 4. Interpolation

existence or uniqueness of solutions, but just mention that this approach could
be useful if we know that the given function can be approximated well by the
basis functions selected.
In (4.78) we have used a linear combination of basis functions to represent
the approximating function, in general, we can use a nonlinear function of coef-
ficients Ci and basis functions cPi(X). In that case, the method of undetermined
coefficients will lead to a system of nonlinear equations, which may be quite
difficult to solve. In this section, we only consider the special case of interpo-
lation using rational functions, which is the ratio of two polynomials. To start
with let us consider the simplest case, where both numerator and denominator
are linear functions. If we consider an approximating function of the form

R(x)=c+dx. (4.79)
a+bx
This function has four constants and we may expect to determine these coeffi-
cients by demanding it to interpolate the given function at four points. However,
it is obvious that the coefficients are not independent, since the denominator
and the numerator can be divided by one of these coefficients, which is nonzero.
Hence, there are only three independent coefficients, and we can require it to
interpolate at three points ao, al and a2. It is better to rewrite the function in
the form
R(x) = bo + bl(x - ao) . (4.80)
1 + b2 (x - ao)
It is obvious that bo = f(ao), while bl and b2 can be determined by solving the
equations
bo + bl(al - ao) = f(al)(l + b2(al - ao)),
(4.81)
bo + bl (a2 - ao) = f(a2)(1 + b2(a2 - ao)).
It can be seen that if f(ad = f(a2), then these equations do not have any solu-
tion, which clearly illustrates that unlike polynomials, an interpolating rational
function may not always exist.
EXAMPLE 4.6: Using the values of tan x at x = 1.56, 1.57 and 1.58, find an interpolating
function of the form (4.80), and compare the results with polynomial interpolation.
Using the given values of tan x we can proceed as outlined above to get the coefficients
bo = 92.6205, bl = 0.07375924 and b2 = -92.62432. Using these coefficients, we can calculate
the value of R(x) at any point. Using the same three points we can obtain the interpolating
quadratic

92(X) = 92.6205 + 116314.5(x - 1.56) - 1.263780 x 107 (x - 1.56)(x - 1.57). (4.82)

The results at a few typical points in the given interval are summarised in Table 4.3. It is
clear from the table that rational function interpolation using the same number of points
gives results which are significantly better than that using polynomial interpolation. This is
not surprising, since tan x has a singularity within the given interval and hence polynomials
can hardly be expected to approximate the function reasonably. If the points had been chosen
to be away from the singularity, then the results will not be significantly different in the two
cases.
4.6. Rational Function Interpolation 143

Table 4.3: Interpolating tanx

x tanx R(x) 92(X)

1.5610 102.0758 102.0752 323


1.5650 172.5211 172.5174 990
1.5750 -237.8858 -237.8787 890
1.5708 -272242 -270292 1240

As this example illustrates, interpolation using rational functions can be


very powerful for functions which cannot be represented satisfactorily by poly-
nomials. The main advantage of rational functions over polynomials is their
ability to model functions with poles. As we have seen earlier, polynomial in-
terpolation fails if the function has poles in neighbourhood of the region of
interpolation. Rational function will be able to take care of such poles, pro-
vided the degree of polynomial in the denominator is large enough to account
for all poles in the neighbourhood. In fact, rational functions are very widely
used for approximating standard mathematical functions. Rational function ex-
trapolation has also proved to be effective in solution of differential equations
(see Section 12.5).
The method of undetermined coefficients, described above can be used to
obtain rational function interpolation, but once again, there may be difficulties
because of ill-conditioning. Apart from this, there is no simple way of adding
points one by one, since entire calculation will have to be repeated from scratch
when a new point is added. It is more convenient to use a recurrence relation,
which provides a simple algorithm to add one point at a time, similar to the
Newton's formula for polynomial interpolation. Such an algorithm has been
given by Stoer and Bulirsch (2010).
Let us consider a rational function R~ (x) which interpolates the given
function j(x) at the n + 1 points, Xi, ... , Xi+n. We can write

Ri (x) = P~(X) = Po + PiX + ... + PJ-lXJ-l (4.83)


n Qh(x) qo + qiX + ... + qvX V

Since one of the coefficients, say qo can be arbitrary, we should have n = J-l + v.
For the same value of n, we can have several rational function approximations
with different combination of J-l and v values. It is found that usually the rational
functions which have similar degrees in numerator and denominator perform
better in approximation. Hence, we use the so-called diagonal rational junction,
where J-l = v if n is even, and J-l = v-1 if n is odd. Hence, we can write J-l = ln/2 J
and v = l(n + 1)/2J. We define

T~(x,y) = P~(x) - yQ~(x), (4.84)


144 Chapter 4. Interpolation

which is a polynomial of degree l (n + 1) /2 J in x, and a linear function in y.


Further, because of the requirement of interpolation

(j = i, ... , i + n), (4.85)

where Yj = f (x j ). We can try to obtain a recurrence relation of the form


i
Tn+l(x,y) i
= a(x,y)Tn(x,y) + (3(x,y)Tni+l
_ 1(x,y), (4.86)

where a and (3 are to be determined such that T~+1 (Xj, yj) = 0 for j = i, ... ,i+
n + 1, and T~+l is a polynomial of degree l (n + 2) /2 J in x and a linear function
in y. Since T~ and T~-:::~ are polynomials of degree l(n + 1)/2J and In/2J in x,
we should have (3 to be a linear function in x and a should be independent of x.
Similarly, it can be seen that both a and (3 should be independent of y. Hence,
a is a constant, while (3 is a linear function of x. Since we assume that T~ (x, y)
and T~-:::~(x,y) satisfy the interpolation condition (4.85), T~+l(xj,Yj) = 0 for
j = i + 1, ... ,i + n. To complete the recurrence relation a and (3 should be
selected such that, this condition is satisfied for j = i and i + n + 1 also. It is
clear that, if we choose (3 = x - Xi, then T~+ 1 (Xi, Yi) = 0, and if in addition

(4.87)

then T~+l as given by (4.86) satisfies the interpolation condition (4.85). If


T~(Xn+Hl' Yn+Hd = 0 this procedure will obviously break down. In that case,
it can be seen that T~+I(X,y) = T~(x,y). Neglecting this possibility, the fol-
lowing recurrence relation can be obtained:

T~+l(x,y) = P~+I(X) - yQ~+I(X)


HI
. ) T n _. (1(Xn+Hl' Yn+Hd
-_ (x,. _ Xn+,+1 ) (pin (X) _ YQin ())
X (4.88)
T;;' Xn+i+ 1, Yn+H 1
+(x - Xi)(P~-:::~(x) - yQ~~\(x)).

Equating the terms which are independent of Y and terms linear in y, we can get
the recurrence relations for the numerator and the denominator of the rational
function (4.83)

P~+1 (x) = (Xi - Xn+HlhP~(x) + (x - Xi)P~-:::~ (x),


(4.89)

where
,= TT;;'(Xn+Hl'
. Yn+Hl)
HI
n - 1(Xn+i+l, Yn+Hl)
(4.90)

The recurrence relation (4.89) can be used to evaluate the rational function
interpolation at a given set of points, but it will be more convenient if we can
4.6. Rational Function Interpolation 145

have a recurrence relation directly in terms of the rational function R~. Using
(4.89) we can get the recurrence relation
Hi
i ()
R n+i X = Rin ()
X + (X - xi )(RHi ()
n-i X -
Rin (X))Qn-i(X)
Qi () (4.91)
n+i X

This recurrence relation relates R~+i to R~ and R~~\, which corresponds to


starting with R~~\ and first adding the point Xi and then xn+Hi to get the
final interpolating function. We can interchange the order in which the points
are added to get another recurrence relation, connecting R~+l with R~+i and
R Hi.
n-i·

Hi
i ()
R n+i X = RHi()
n X + ( X - Xn+i+i )(RHi ()
n-i X -
RHi( ))Qn_i(X)
n X Qi () (4.92)
n+i X

Eliminating the last factor between (4.91) and (4.92), we obtain the required
recurrence relation

This recurrence relation is valid for n 2: O. To start the recurrence, we can use
Rb = Yi and R~ 1 = O. For reasons of convenience and to reduce the roundoff
error, it is better to write this relation such that at every stage the new term
can be added to the previous value. For this purpose we define the differences

D~(x) = R~(x) - R~"=-\(:r). (4.94)

Here C~ (x) is the correction to be added when the point Xn+i is added to the set
{Xi, . .. , Xn+i-i }. Similarly, D~ (x) is the correction when the point Xi is added
to the set {XHi, ... ,xn +;}. Using (4.93) we can show that, these differences
satisfy the following recurrence relations:

.
( x-x,
X-X n +'l+l
) Din (x) (CHi (x)
n
- Din (x))
C~+l (x) = ---'----(~----''-------,-)--------
x-x, Di (x) - C~+l(X)
X-X n+ 1 +l n
(4.95)

These recurrence relations offer a convenient way to calculate the rational func-
tion interpolation at any required point. Unlike the divided differences the quan-
tities C~ and D~ depend explicitly on x, the point at which the interpolant is
to be calculated. Hence, it is not possible to store a table of these quantities in
computer memory, and at each point we will have to calculate the values from
146 Chapter 4. Interpolation

the beginning. It is not necessary to store the entire arrays C~ and D~. Starting
with Xl, as we go on adding points we can calculate the new entries and over-
write on the previous ones. When Xn is added to the table we need to calculate
C~_i and D~_i for i = n, ... , 1. If the calculations are done in this order, the
results can be overwritten on the previous values in a one-dimensional array.
This algorithm is implemented in the subroutine RATNAL in Appendix B.
EXAMPLE 4.7: Repeat Example 4.2 using rational function interpolation
We can use the subroutine RATNAL for calculating the rational function interpolation.
The rational function may have a pole in the given interval, even when the original function is
completely regular. In fact, for f(x) = sinx, using the two tabular points x = ±1 the rational
function interpolation gives R?(x) = si~ 1, which has a pole at x = 0. In such cases, of course,
the rational function approximation is much worse than the polynomial interpolation using
the same number of points. As is clear from the results displayed in Table 4.4, in almost all
cases, polynomial interpolation is better than the rational function interpolation, if the entire
range of [-1,1] is considered. It may be noted that except for (iii) all other functions that
we have considered are either symmetric or antisymmetric about the origin, this appears to
cause difficulty for rational function interpolation. Interestingly, even for (ii) which is actually
a rational function and can be expected to be correctly reproduced with four or more points,
the error is not small. In fact, in all cases at some points the rational function interpolation
fails, because the denominator in recurrence relation vanishes. Of course, in those cases where

near x = °
it becomes small but nonzero the error becomes large. For f(x) = sinx, there is a large error
in all cases, when the entire range [-1,1] is used. The result depends on how
the program is written and which points are used to find the maximum error. Hence, the
results for this case are not shown in the Table. If the range is reduced to [0,1]' then the
results improve significantly. Even in this case since f(O) = 0, in many cases the interpolation
routine fails depending on the order in which points are added. It appears that if x =
chosen as the first point the recurrence doesn't fail. Similar problem arises for Vf+X at
is °
x = -1 and for Vx at x = 0. Again if this point is chosen as the first point the failure can
be avoided. The decay exponent is not shown, since in most cases, the error fluctuates wildly
and a meaningful estimate of decay exponent is not easily possible. For Vf+X the results
for interval [0,1] are not shown, since the function is regular in that interval and the error

Table 4.4: Rational function interpolation

sinx 1
1+25x 2 Vf+X v'IxI
[0,1] [-1,1] [0,1] [-1,1] [-1,1] [0,1]
n II en II II en II II en II II en II II en II II en II
2 7.8 X 10- 1 9.1 X 10- 1 5.1 X 10- 1 1.3 X 10° 1.0 X 10° 9.5 X 10- 1
4 7.2 X 10- 4 2.9 X 10- 2 8.6 X 10- 8 1.1 X 10- 1 5.5 X 10- 1 7.7 X 10- 2
6 8.8 X 10- 6 1.9 X 10- 1 3.3 X 10- 7 4.7 X 10- 2 3.8 X 10- 1 3.4 X 10- 2
8 7.3 X 10- 6 5.4 X 10- 2 1.9 X 10- 7 2.5 X 10- 2 3.1 X 10- 1 1.7 X 10- 2
10 5.2 X 10- 6 2.2 X 10- 1 9.5 X 10- 6 1.7 X 10- 2 2.5 X 10- 1 1.1 X 10- 2
12 2.6 X 10- 6 2.6 X 10- 1 2.8 X 10- 5 1.3 X 10- 2 2.5 X 10- 1 7.2 X 10- 3
14 3.3 X 10- 6 2.1 X 10- 1 1.1 X 10- 6 1.3 X 10- 2 2.1 X 10- 1 1.2 X 10- 2
16 6.8 X 10- 6 3.5 X 10- 2 8.8 X 10- 6 8.3 X 10- 3 1.9 X 10- 1 8.9 X 10- 3
18 3.3 X 10- 5 1.4 X 10- 4 1.1 X 10- 6 1.3 X 10- 2 1.9 X 10- 1 9.5 X 10- 3
20 1.4 X 10- 5 2.9 X 10- 5 3.2 X 10- 6 1.7 X 10- 2 1.5 X 10- 1 6.1 X 10- 3
4.7. Interpolation in Two or More Dimensions 147

is obviously very small. In any case, this function is neither symmetric nor antisymmetric
about the origin, so there is no reason to divide the range.
It can be seen from the table that the roundoff error is much larger for rational function
interpolation, as compared to the polynomial interpolation. The presence of poles in the
interpolating function can cause a large roundoff error, since it is difficult to detect such poles
in numerical calculations, The denominator may not turn out to be zero because of roundoff
error, and essentially arbitrary value may be accepted. Apart from poles, we frequently end
up with an indeterminate expression of the form % in rational function interpolation, which
again can cause problems because of roundoff errors.
If the range for interpolation is restricted to [0, 1], then it can be seen that for the second
function, which is actually a rational function, the interpolation is exact (apart from roundoff
error) for n :2: 4 as expected. Even when the range is increased to [-1, 1], the interpolation
is actually exact with 4 points, but because of roundoff errors the error appears to be large.
If the calculations are done using double precision arithmetic, then with 4 points the error
goes down to 5 X 10- 11 . Similarly, for the last function involving square root, the results are
significantly better than those with polynomial interpolation.

The evaluation of rational function interpolation requires much more effort


than that for polynomials interpolation. Further, for each value of x the process
has to be started from scratch. It can be seen that each interpolation based on
n points requires about 3n 2 operations of multiplication and division, which is
about six times the effort needed to evaluate polynomial interpolation. Once the
difference table is constructed, the polynomial interpolation will require only
n multiplication, while rational function interpolation cannot make use of such
facilities. Thus, rational function interpolation is not as efficient as polynomial
interpolation. Hence, it is rarely recommended in practice, except when the
function is known to have poles in the neighbourhood of the given interval.
Another problem with rational function interpolation is that, the existence of
such an interpolating function is not guaranteed. As we have already seen,
even in some very simple cases, the interpolating function does not exist, and
even if it exists, it may have poles in the interval, where the given function
is continuous. It may even lead to an indeterminate expression of the form
0/0. From the above example it is clear that rational function interpolation
may either fail or give unreliable results when at some point the denominator
vanishes or becomes small. Thus in general, evaluation of rational function
interpolation is numerically unstable and must be avoided unless the function
is known to have poles in the neighbourhood of tabular range. Even in this case
considerable care would be required to get meaningful results.

4.7 Interpolation in Two or More Dimensions


If we have a function of several variables, which is tabulated at a finite number
of points, we will need to interpolate for all the variables. We will only consider
the case, where the tabular values are available on a Cartesian mesh, i.e., at the
vertices of a rectangular grid. For clarity we will mostly discuss only the two-
dimensional case. In most cases, it is straightforward to extend these methods
to higher dimensions. If we are unlucky to encounter a table with entries at
essentially random points in a multidimensional space, then probably the best
148 Chapter 4. Interpolation

Figure 4.4: Interpolation in two dimensions, • denotes the tabular points, 0


the point at which interpolation is required, * the intermediate points at which
interpolation is calculated.

we can do is to find a few nearest points and use the method of undetermined
coefficients described in the previous section, to find the coefficients of the
interpolating polynomial of appropriate degree.
Two-dimensional interpolation is the problem of interpolating in a two-
dimensional array of tabular values. The function values f(al,i, a2,j) are known
for i = 1, ... ,nl and j = 1, ... ,n2, and we have to find the value of f(Xl, X2) at
some arbitrary point (Xl, X2) in the region covered by the table. If the tabular
points are plotted on a plane, they will form the vertices of a rectangular grid
(Figure 4.4). The simplest approach is to use linear interpolation in each of the
variables. For this purpose, first we have to locate the rectangle containing the
given point, which can be done by a simple extension of one-dimensional search
techniques described earlier. Once we have found the values of (i,j), such that
al,i :::; Xl :::; al,i+l and a2,j :::; X2 :::; a2,j+l, the bilinear interpolation can be
easily carried out. It is more convenient to define dimensionless variables

(4.96)

These variables will of course, change with values of i and j, but in each rect-
angle they are uniquely defined and assume all values between 0 and 1. With
this definition, it can be easily verified that the interpolating linear function is
given by

gl (hI, h2) = f(al,i, a2,j )(1 - hd(l - h2) + f(al,i+l, a2,j )hl (1 - h2)
(4.97)
+ f(al,i+l, a2,j+1)h l h2 + f(al,i, a2,j+1)(1 - hdh2 .
This is the simplest and probably the most commonly used formula for inter-
polation in two dimensions.
This process can be easily generalised to n dimensions, by noting that
each function value is multiplied by n factors of the form hi or 1 - hi' Further,
4.7. Interpolation in Two or More Dimensions 149

if the corresponding abscissa is at the upper end of the range, the first factor
is used, otherwise the second factor is used. Hence in n dimensions, if we have
a table of values !(al,il"'" an,i n ) for i j = 1, ... , mj, (j = 1, ... , n) and if the
point (Xl, ... , Xn) at which the interpolant has to be evaluated is located in the
hypercube aj,i] :::; Xj :::; aj,ij+l for j = 1, ... , n. Then we can once again define
the dimensionless variables

(4.98)

Now the linear interpolating function in n variables can be written as


i1+l in+l n
gl(hl, ... ,h n ) = L'" L !(ak1, ... ,akn)IIH(j,kj-ij ), (4.99)
j=l

where
H(j, k) = {~- h j ,
if k = 0;
if k = 1.
(4.100)
J'

This may be the only practical interpolation procedure when n is large. We


can formally write down the truncation error in such an approximation, but
these formal expressions are of little use. In any case, we cannot expect much
accuracy in interpolation for functions of several variables, since the amount
of computation required to achieve a given accuracy increases exponentially
with the number of variables. For example, the above expression for linear
interpolation in n variables has 2n terms, each requiring n multiplications.
Hence, the total number of multiplications required are of the order of n2n.
This number can of course, be reduced by arranging the terms properly, but the
factor of 2n will remain, Further, such rearrangements may be rather difficult,
particularly for the higher order polynomials. This algorithm is implemented
in subroutine LlNRN in Appendix B.
To get higher order accuracy, there are again two possibilities, one is to
use higher degree polynomial in two variables passing through all tabular points
in some neighbourhood of the required point, while the other possibility is to
use piecewise polynomial interpolation passing through all points in the table
and satisfying some smoothness criterion. We consider the first approach in this
section, while the second approach will be discussed in the next section.
To find the interpolated value of the function at (Xl, X2) we can use a series
of one-dimensional interpolation. We can consider the one-dimensional array of
tabular values !(al,i,a2,j), (i = 1, ... ,nd for each value of j, and perform a
polynomial interpolation with respect to Xl in this table to get the value of
!(Xl, a2,j) for j = 1, ... , n2 (see Figure 4.4). Now we can use one-dimensional
interpolation with respect to X2 on ! (Xl, a2,j), (j = 1, ... , n2), to get the de-
sired value of !(Xl,X2). This procedure requires n2 + lone-dimensional in-
terpolations. If we assume nl = n2 = n, then it will require of the order of
n 3 floating-point operations to perform one interpolation in two dimensions. If
150 Chapter 4. Interpolation

this procedure is generalised to m variables, it will require of the order of n m + l


floating-point operations.
The numbers nl and n2 for the points used on each axes should be carefully
chosen. As remarked earlier if a high degree polynomial is used, there will be
wild oscillations. Hence, the values of nl and n2 should preferably be less than
five, unless there are special reasons to believe that higher values will give
better results. Subroutine POLY2 in Appendix B implements this algorithm
for polynomial interpolation in two dimensions.

4.8 Spline Interpolation in Two or More Dimensions


The approach described in the previous section can also be used with cubic
spline. We can perform spline interpolation with respect to Xl for each of the
n2 values of X2 in the table, to get f(Xl. a2.j) for j = 1, ... , n2. The value of
spline coefficients for each row of points can be stored for future use, if the
interpolant is to be evaluated at several points. Using the values of f (Xl, a2,j ),
we can perform one more spline interpolation with respect to X2, to get the
interpolated value of f(Xl, X2)' The coefficients of this spline need not be stored,
unless the function is to be evaluated for the same value of Xl again. This
process requires n2 + 1 spline calculations for the first value, and one spline
for each of the subsequent points at which the function is to be interpolated.
Apart from this, in all cases, it will be required to evaluate the spline value at
n2 + 1 points.
An alternative procedure is to actually construct a piecewise continuous
cubic in two variables, with continuous first and second derivatives throughout
the region. This gives the bicubic spline interpolation in two dimensions. How-
ever, it turns out that it is much simpler to generalise the B-splines to higher
dimensions and in this section we will only describe that option. We use the
B-spline basis functions to express the interpolating polynomial. For simplicity,
we assume that we are using cubic B-splines with not-a-knot boundary condi-
tiOllS, since in this case the number of basis functions is equal to the number of
points in interpolation table. Thus, if ePi(X) (i = 1, ... , nd are the nl B-splines
defining the basis on X and 'lj}j(Y) (j = 1, ... , n2) are those on y. Then the two
dimensional interpolation can be expressed as
nl n2

g(x, y) = L I>ijePi(X)'l/ij(Y), (4.101)


i=l j=l

where Cij are the nl n2 unknown coefficients to be determined by the inter-


polation condition g(ali.a2j) = f(ali,a2j)' These conditions will define nln2
equations in as many variables and in principle. these can be solved to calcu-
late the coefficients. However, this solution will require O(nrn~) floating point
operations. The number of operations can be reduced significantly if these are
arranged in proper order. For example, let us consider the nl equations at
4.8. Spline Interpolation in Two or More Dimensions 151

y = a2k, we can write these as

l=l, ... ,nl. (4.102)

It can be recognised that the part inside the parenthesis is just the equation
matrix that will be solved for one-dimensional interpolation using B-splines.
Thus we define the nl x nl matrix A, with ali = cPi(all) and let A-I be its
inverse, which can be calculated using O(n1) floating point operations. Actually,
the number of operations can be reduced significantly if we recognise the fact
that only 4 elements in each row are nonzero. Now if we premultiply the system
of equations in Eq. (4.102) by A-I, then we get

L 1/!j(a2k)clj =
n2

I{k' (4.103)
j=1

where I{k is the result of premultiplying A-Ion the right hand side of
Eq. (4.102). We can repeat this process for k = 1, ... , n2. noting that inverse
of A is already available, this process will require only O( n2ni) floating point
operations. In actual practice, there is no need to compute A -1, as we can
simply solve the required systems of linear equations. Once all these solutions
are obtained we can collect the equations involving Clj for fixed l, noting that
equations for different l's are decoupled. It can be easily seen that these system
of equations define the interpolation problem in second variable y. These can
be solved to obtain the solution for the required coefficients Cij. Thus with this
rearrangement, we need to solve a sequence of one dimensional interpolation
problems in x and y. These calculations will require O((nl + n2)(ni + n~))
floating point operations, which is much less than nrn~ required for direct solu-
tion of two-dimensional problem. This algorithm is implemented in subroutine
BSPINT2, in Appendix B. This method can be easily generalised to higher di-
mensions and an implementation of this algorithm for n dimensions is provided
by subroutine BSPINTN, in Appendix B. This subroutine allows B-splines of
higher order also. For three dimensions this technique will require O( n 4 ) float-
ing point operations, where n is the number of points in each dimension. For
m dimensions it will require O(nm+l) floating point operations. This technique
can also be generalised to higher order B-splines as explained in Section 4.5,
though in practical problems cubic B-splines are generally sufficient.
It may be noted that the representation of interpolating function as a
sum of product of basis functions in one dimension has its own limitations.
For functions which are smooth in two dimensions such representation should
be adequate. Similarly, if the function has one peak which is localised in two
dimensions then we can choose more knots in region around the peak to ap-
proximate the function adequately. However, if the function has a sharp ridge
which stretches along the diagonal in 2-dimensions, then we will need small
spacing throughout the region for adequate approximation using product form
152 Chapter 4. Interpolation

of basis function. In such cases product form of approximating function may


not be adequate and we may have to look at other alternatives. The finite ele-
ments (cf., Section 14.11) provide such an alternative representation which may
be useful in more general situation. But this is beyond the scope of this book.
All techniques described in the last two sections can be generalised to
higher dimensions in a straightforward manner, although the amount of cal-
culation as well as the storage required will increase exponentially with the
number of variables. In general, the problem of interpolation for a function of
several variables is rather difficult. There are several reasons for this difficulty,
some of them are as follows.
1. It requires more effort to interpolate a function of several variables.
2. The theory of function of several variables is not so well developed. For
example, almost all methods that we have described are extensions of one-
dimensional methods. It will probably be more efficient if special methods
are used to directly interpolate in two or more variables.
3. We do not have much intuitive feeling for functions of several variables.
The behaviour of a function of several variables is also quite complicated.
For example, at a given point in one dimension, a continuous function can
only, be increasing, decreasing or be stationary. While in higher dimensions
at any point the function could be increasing in some direction, decreasing
in other direction and be stationary in some other direction.
4. In higher dimensions, the distribution of tabular points could be arbitrary.
In one dimension the most general distribution of tabular points is a table
with nonuniform spacing. In two dimensions it will be difficult to use any
such characterisation, since the region between neighbouring points could
have arbitrary shapes.
Some of these remarks will not apply to the problem of interpolating an analytic
function of complex variable, even though in principle that can be considered
as a function of two real variables. This is because, the theory of analytic func-
tions is well developed and further the requirement of analyticity puts severe
constraints on the behaviour of the function. In fact, the Newton's divided dif-
ference formula will be valid, even if the points are distributed randomly in the
complex plane. Of course, it will be necessary to use complex arithmetic in a
computer program, which may reduce the efficiency to some extent.

Bibliography
Abramowitz, M. and Stegun, 1. A. (1974): Handbook of Mathematical Functions, With For-
mulas, Graphs, and Mathematical Tables, Dover New York.
Ahlberg, J. R., Nilson, E. N. and Walsh, J. 1. (1967): The Theory of Splines and Their
Applications, Academic Press, New York.
Burden, R. L. and Faires, D. (2010): Numerical Analysis, (9th ed.), Brooks/Cole Publishing
Company.
Exercises 153

Conte, S. D. and De Boor, C.(1972): Elementary Numerical Analysis, an Algorithmic Ap-


proach, McGraw-Hill, New York.
Dahlquist, G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Davis, P. J. (1975): Interpolation and Approximation, Dover, New York.
De Boor, C. (2001): A Practical Guide to Splines, Springer-Verlag, New York.
Gear, C. W. (1971): Numerical Initial Value Problems in Ordinary Differential Equations,
Prentice-Hall, Englewood Cliffs, New Jersey.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Knuth, D. E. (1998): The Art of Computer Programming, Vol. 3, Sorting and Searching,
(2nd ed.), Addison-Wesley, Reading, Massachusetts.
Press, W. H .. Teukolsky, S. A .. Vetterling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Quarteroni, A., Sacco, R. and Saleri, F. (2010) Numerical Mathematics, (2nd ed.) Texts in
Applied Mathematics, Springer, Berlin.
Ralston. A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser. Boston.
Stoer, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.

Exercises
1. For Lagrange's interpolation formula based on n points, show that

L a71i(x) = xk, (k=0, ... ,n-1),


1=1

where a;, i = 1, ... , n are the tabular points. Also prove that
n n
Lai'li(x)=x n - IT(x-a;).
i=l i=l

2. What can be the maximum spacing in a uniformly spaced table of sin x for the first
quadrant, if it is desired to have truncation error of less than 1O~7. Assume that we want
to use only (i) linear. (ii) quadratic, (iii) cubic interpolation.
3. If we are given a table of f(x) = 10glQ(sinx), for x = 0.01,0.02, ... ,1.00. Over what
range of x values will linear interpolation be sufficient for an accuracy of 10~6 or 10~5.
4. The polynomial that interpolates a polynomial f(x) of the same degree should reproduce
f(x). To check the accuracy of the interpolation process, compute the interpolants g(x)
of the following polynomials at the specified points and then evaluate the difference
f(x) - g(x) at some suitably selected check points both inside as well as outside the
interval spanned by the interpolating points.

Polynomial Interpolation points


1+ x + x2 + x3 1,2,3,4
x x2 x3 x"
1 + "2 + "6 + + 24 120 -1,1.2,3.5,4.5,5.8
(x - l)(x - 2)··· (x - 6) {O, 1, ... , 6} and {0.5, 1.5, ... , 6.5}
(x - l)(x - 2)··· (x - 20) {O, 1, ... ,20} and {0.5, 1.5, ... ,20.5}
0.1,0.2, ... ,2.1
154 Chapter 4. Interpolation

In the first case, explicitly show that the interpolating polynomial is identical to the
original polynomial.
5. Consider the interpolation of I(x) = e ax , over a table with uniform spacing h (> 0) using
tabular points aj = jh, (j = 0, 1, ... ), and show that
(e ah _ l)n
J[ao, ... ,an] = .
n!hn
Using this result study the convergence of the series obtained by letting the number of
tabular points tend to infinity. Show that the ratio of the (n + l)th and nth terms of the
series is given by
e ah - 1
h(n+l) (x-nh).

Hence, show that the series will converge if e ah < 2 and diverge if c ah > 2, unless x is
a positive integral multiple of h. For e ah = 2 the series converges if and only if x > -h.
Verify these convergence conditions by actual computations, using a table of e ax for
various values of ah. What is the maximum accuracy that can be achieved for e ah > 2?
6. Prove (4.20) using induction.
7. Evaluate the following quantities by interpolating between tables with specified uniform
spacing
(i) sin(44.4444°) with h = 50°,10°,5°,1 °
(ii) exp(O.I11) with h = 2.0,1.0,0.5,0.1
(iii) JO.0121 with h = 0.1,0.01,0.001
Use Newton's formula based on 2,4,6, ... , 20 points and compare the results with actual
values. To study the effect of errors in the tabulated values on interpolation, multiply
each value of the tabulated function by (1 + ~R), where R is a random number in interval
(-0.5,0.5), and try ~ = 10- 5 and 10- 3 . Repeat the above calculation using these values.
8. Plot the function Pn(x) defined in (4.9), using 10 tabular points in the interval [-1,1].
Consider both the cases, first where the points are uniformly spaced and second using
Chebyshev points.
9. If Y = I(x) and I'(x) i 0 for xo < x < Xl, then show that the truncation error of linear
inverse interpolation based on corresponding values (XO,yo) and (Xl,yI), is given by
I" (.;)
-(y - yo)(y - Yl) 2[/'(';)]3 ,

where xo < .; < Xl· Show also that the magnitude of this error is limited by each of the
following bounds

where

I- - - I <K
I"(x) ml :::; 1/,(x)1 :::; Ml and 1/"(x)l:::; M2 for xo:::; x:::; Xl·
[I'(x)]3 - ,

10. Using values of sinx for X = 1.560 + O.OO1j, (j = 1, ... ,10) try to estimate sinx.
X = 1.5715, 1.5734, 1.5635, 1.5605, 1.5585 using Newton's formula. Perform inverse inte'r-
polation in the same table to estimate the value of X for which sin X = 0.999995, 0.9999'-,.
0.9999, 0.9995. Also try the inverse interpolation using a table of values at
X = 1.564 + O.OO1j for j = 1, ... , 10.

11. The following table gives the world record for running events (as in 2012)
100m 9.58 s 1500 m 3:26.00 s
200 m 19.19 s 5000 m 12:37.35 s
400 m 43.18 s 10000 m 26:17.53 s
800 m 1:41.01 s 30000 m 1:26:47.4 s
Exercises 155

Using Newton's formula or cubic spline, try to estimate the world record for 1000 m,
2000 m, 25000 m and marathon (42195 m). Compare your results with actual values:
2:11.96 s, 4:44.79 s, 1:12:25.4 sand 2:03:38 s, respectively. Also try to estimate the world
record for distance covered in one hour (actual record = 21285 m). Repeat the process
by using log(dist) versus log(time).
12. Consider the interpolation for the following functions, at the specified points
(i) h(x) = sin(7rx) aj = 0.5 + 2.01j j = 0, ... ,10
(ii) h(x) = 2
x2 + e- T In[(7r - x)2] aj = 2.6 + O.lj j = 0, ... ,10
Estimate the value of h (1.5) and h(3.1416) and compare them with exact values.
13. Using the definition of backward differences, derive the Newton's backward interpolation
formula
gn = fo + (mhVfo + (m+ 1J2V2fo + ... + (m+n -l)nV n fo·
14. Using the techniques explained in Chapter 2, estimate a bound on the roundoff error in
evaluating the forward differences, and using this estimate, find a bound on the roundoff
error in interpolation with the Newton's forward difference formula.
15. Instead of forming a difference table, we can construct a table of interpolated value at a
given point similar to that used in rational function interpolation. Thus, if P;:'(x) is the
interpolating polynomial of degree n based on abscissas ai, ai+ 1, ... , ai+n, we can define
the differences
c~ = P~(x) - P~_l(X), D~ = P~(x) - P~~~(x),
where x is the point at which interpolation is required. Prove that the differences satisfy
the following recurrence relations

Ci _ (at - x)(C;:.+l - D;')


n+l -
ai - ai+n+l
These recurrence relations can be used to construct a table of interpolated value at a
given point and the resulting algorithm is known as Neville's algorithm.
16. Using values of tanx for x = 1.560 + O.OO1j, (j = 1, ... ,10) try to estimate tanx at
x = 1.5695, 1.5634, 1.5615, 1.5605 with Newton's formula based on 2,3, ... ,10 points.
Repeat the same process to interpolate sin x and cos x, and from each of these set of
values compute tan x. Compare these results with the exact values. Repeat this problem
using the following three different orders, in which the points are added; (i) at each step
choose the nearest point, (ii) points are added in the natural order, (iii) points are added
in reverse order, i.e., j = 10, ... , 1. Which of this ordering gives the best results? Repeat
the exercise using cubic spline or the rational function interpolation.
17. For all four functions in Example 4.2, find the maximum error in interpolation
over the interval [-1, 1], using n = 2,4, ... ,20 points, with a uniform spacing of
1,1/2,1/2 2 , ... , 1/2 10 . In all cases, use the nearest n points for interpolation, some of
which may be outside the required range. From these results study the asymptotic be-
haviour of interpolation series, as more points are added, at a fixed spacing. Also estimate
the decay exponent when the number of points is fixed, but the spacing decreases. Does
this agree with the value expected from theory?
18. Using induction prove the following relation for divided difference of product of two
functions. If f(x) = g(x)h(x), then
i+k
f[ai, ... , ai+k] =L g[ai, ... , aj]h[aj, ... , ai+k]
j=i

and using this prove Eq. (4.69).


19. Repeat Example 4.2 using Hermite cubic interpolation, with the actual value for the
derivatives. Compare the results with those obtained using cubic spline.
156 Chapter 4. Interpolation

20. If we are given the values of f(x) = In x at x = 1,2,3 and in addition, the first derivative
at x = 1,3, and the second derivative at x = 1 is also known. Find an interpolating
polynomial of degree five, which agrees with the function and the known derivatives
at the specified points. Estimate the truncation error in this interpolating polynomial.
Evaluate the interpolant at x = 1.5,2.5 and compare the actual error with the error
estimate.
21. Obtain the Hermite cubic interpolation using the function value and the derivative at
the two end points of the interval [a, b]. Try the following four cases
(i) f(a) = 1, f(b) = 0, f'(a) =0, f'(b) =0;
(ii) f(a) = 0, f(b) = 1, f'(a) = 0, f'(b) =0;
(iii) f(a) = 0, f(b) = 0, f'(a) = 1, f'(b) = 0;
(iv) f(a) = 0, f(b) = 0, f'(a) = 0, f'(b) = 1.
These four cubics, form the so-called Hermite cubic basis, and interpolation for any given
function can be easily expressed as a linear combination of these basis functions. These
basis function are very useful in finite element methods. Consider the special cases, when
the interval is [0,1]' [-1,1] and [-1,0].
22. The following table gives the vapour pressure (in MM of Hg) of water, as a function of
tern perat ure:
Temperature (C): o 5 10 15 20 25 30 35 40
Vapour press.: 4.579 6.543 9.209 12.788 17.535 23.756 31.824 41.175 55.324
Using any interpolation method estimate the value of vapour pressure at a temperature
of -5, -1, 1, 13, 22, 37, and 45 C. Try to estimate the accuracy of computations and
verify the results by comparing them with the known values, 3.163, 4.258, 4.926, 11.231,
19.827, 47.067 and 71.88, respectively.
23. The following table gives the specific heat at constant pressure (Cp ) of Platinum as a
function of temperature:
Temp. (K): 1 3 6 10 20 30 50 80 100
C p (Jg- 1 K-l) : .000035 .000122 .00037 .00112 .0074 .0212 .055 .088 .100
Using any interpolation formula, estimate the value of C p at a temperature of 2, 4, 8, 25,
40 and 70 K. Try to estimate the accuracy of computations and compare it with actual
values 0.000074, 0.000186, 0.00067, 0.0137, 0.038, 0.079, respectively.
24. Show that the boundary condition (4.59) can be written as
p~'(al) = p~'(all and p~'-l(an-1l = p~'(an-d·
Express these conditions in terms of the slopes Si and use (4.57) to eliminate some of the
Si and derive (4.60) for not-a-knot boundary conditions.

25. If s(x) is a cubic spline interpolation for a twice differentiable function f(x), and g(x) is
any other function interpolating f(x) at the same set of points, then

lb[g"(X)]2dX -lb[s"(X)]2dX = lb[g"(X) - s"(x)]2dx + 2lb s"(x) [g"(x) - s"(x)] dx.

Integrating by parts, show that

lb s" (x)[g" (x) - s"(x)]dx = s"(b)[g'(b) - s'(b)]- s"(a)[g'(a) - s'(a)].

Hence, prove that the natural spline (i.e., spline with the free boundary conditions)
uniquely minimises the integral in (4.61), among all functions with continuous second
derivatives which interpolate the data at the tabular points. Similarly, prove that the
spline which satisfies the boundary condition s'(a) = f'(a) and s'(b) = f'(b) uniquely
minimises (4.61) among all such functions, which in addition also satisfy the boundary
conditions.
Exercises 157

26 • Approximate vfx, (0 < x < 1) by a cubic spline using the function values at x = 0, .001,
.01, .1, .5, 1.0. Estimate the maximum error in each of the subintervals (a,_l,a;) and
try to adjust the four intermediate knots to reduce the maximum error over the entire
interval.
27. Consider the following data which gives some property of Titanium as a function of
temperature:
Ti J(Ti) Ti J(Ti) Ti J(T;) Ti J(Ti ) Ti J(T;) Ti J(T;) Ti J(T;)
585 0.644 655 0.657 725 0.676 795 0.699 865 1.336 935 0.746 1005 0.603
595 0.622 665 0.652 735 0.676 805 0.710 875 1.881 945 0.672 1015 0.601
605 0.638 675 0.655 745 0.686 815 0.730 885 2.169 955 0.627 1025 0.603
615 0.649 685 0.644 755 0.679 825 0.763 895 2.075 965 0.615 1035 0.601
625 0.652 695 0.663 765 0.678 835 0.812 905 1.598 975 0.607 1045 0.611
635 0.639 705 0.663 775 0.683 845 0.907 915 1.211 985 0.606 1055 0.601
645 0.646 715 0.668 785 0.694 855 1.044 925 0.916 995 0.609 1065 0.608
Try to pick 12 knots out of this set and approximate the function using cubic spline
interpolation on these knots. Try to pick the knots to get good approximation over the
entire range.
28. The coefficients of the interpolating polynomial of degree n can be determined directly
by solving the set of linear equations (method of undetermined coefficients)
n
L cia) = J(aj), (j=O, ... ,n),
i=O
where Ci are the coefficients of the interpolating polynomial. Use this technique to find
interpolating polynomial for J(x) = sin x over the interval [0, 1] using 4,6, ... ,20 uni-
formly spaced tabular points. Find the maximum error in interpolation over the interval
and compare the results with those obtained using Newton's formula. Also repeat the
exercise for cubic spline interpolation using the truncated power basis or B-spline basis.
29. Suppose we have a histogram defined by a sequence of n + 1 abscissas, aD < al < ... < an
and a sequence of n nonnegative numbers hI, h2, . .. , hn, with hi giving the height over
the region ai-l < x < ai· The usual interpretation is that hi(ai - ai-I) is the integral
of the underlying distribution J(x) over the corresponding subinterval

fa,
at_l
J(x) dx = hi(al - ai-d.

Approximate the distribution function J(x) using a parabolic B-spline basis with knots
at aD, ... , an. Use the additional boundary conditions J(ao) = J(an) = 0, to determine
coefficients of B-spline basis. Try this procedure on the following data for the 21 knots
ai = -2 + 0.2i, i = 0, ... ,20, with hi = 0.2, 0.4, 0.8, 1.3, 2.0, 2.8, 3.3, 3.5, 3.7, 10.8, 24.5,
20.4, 14.3, 9.9, 6.6, 4.2, 2.5, 1.4, 0.7, 0.3.
30. Suppose we have a sample of material composed of five radioactive substances. The jth
substance is known to decay (emit radioactivity) at the rate e-dJt. We can, in principle,
determine the proportions of the substances in the sample by measuring the radioactivity
levels ri at five different times, and solving the interpolation problem
5
Ti == Laje- djit , (i = 1, ... ,5).
j=l
The half-lives of the substances are 12.5 hours, 33 hours, 2.5 days, 4.3 days and 7.3 days
respectively. (The decay rate dj = In 2/half-life.) Radioactivity measurements are taken
with the values normalised so that the first reading is 1. The readings are
Monday 10 : 00 hr 1.0000000
Tuesday 11 : 00 hr 0.7493199
Wednesday 14 : 00 hr 0.5910777
Thursday 16 : 00 hr 0.4867655
Friday 10 : 00 hr 0.4304283
158 Chapter 4. Interpolation

Estimate the proportion of each substance in the sample at the beginning of the experi-
ment. These measurements are subject to some errors. Study the effects of uncertainties
in, time measurement, the half-life and radioactivity levels, by perturbing the input val-
ues randomly within the specified error limits, and recomputing the solution. Consider
the effect of following uncertainties:
(i) ±2 min error in time measurements
(ii) ±1O sec error in time measurements
(iii) ±1O min uncertainties in known half-life values
(iv) ±1O sec uncertainties in known half-life values
(v) ±1 % error in measurement of radioactivity levels
(vi) ±0.01 % error in measurement of radioactivity levels
From these results estimate the condition numbers for this problem.
31. Compare the effectiveness of various interpolation methods for the following functions
(i) e- lOx on [0,2J
(ii) min(lxl, Ix - 21) on [-1,3J
_l
( iii) e x on [O,lJ
(iv) cos(1+x 3 ) on [0,2J

(v) x~(x) = {o,


x,
if x::; 0;
if x> 0;
on [-l,lJ

(vi) x~(x) = {o,1,


if x < 0;
if x> 0;
on [-l,lJ

Compute the polynomial interpolation using both uniformly spaced and Chebyshev
points. Estimate the order of error as the number of points increase.
32. Given the table of values of the elliptic integral of the first kind f(x, y)
y\x 50 60 70 80 90
50 0.9401 0.9647 0.9876 1.0044 1.0107
52 0.9835 1.0118 1.0387 1.0587 1.0662
54 1.0277 1.0602 1.0915 1.1152 1.1242
56 1.0725 1.1097 1.1462 1.1743 1.1851
58 1.1180 1.1605 1.2030 1.2362 1.2492
find an approximation to f(65, 53) by the following: (i) interpolating horizontally to find
f(65, y) for y = 50, 52, 54, 56, 58 and then interpolating vertically, (ii) interpolating
vertically and then horizontally, (iii) interpolating diagonally along the two diagonals.
Compare the results with the correct value, which is 1.0509.
33. The following table gives the thermal conductivity of compressed oxygen (in mW /m-K)
as a function of temperature (T) and pressure (P):
T(K)\P(atm) 1 10 20 30 40
100 9.5 138.8 139.6 140.4 141.1
120 11.4 13.5 111.5 112.8 113.3
140 13.2 15.0 17.5 81.6 84.1
160 15.8 16.5 18.4 20.7 24.2
180 16.8 18.1 19.6 21.3 23.3
200 18.5 19.6 20.9 22.3 23.8
Using any interpolation method estimate the conductivity at (T, P) = (170,5), (110,35),
(210,5), (110,5) and (130,25). Try to estimate the error in computation, and compare it
with actual values of 16.5, 127.3, 19.8, 11.4 and 97.1, respectively.
34. Using a table of values for the function f(w,x,y,z) = sin(w)sin(x)sin(y)sin(z) with
uniform spacing of 0.1 in w,x,y,z over the interval [0,1]' find an approximation to
f(.23, .34, .54, .65). Use interpolation with product B-splines and compare the result with
the exact value of the function.
Chapter 5

Differentiation

In this chapter we consider numerical differentiation of a function which is


specified by either a table of values or by an expression which can somehow
be evaluated at any required point. We shall demonstrate that numerical dif-
ferentiation is basically an unstable process, which should be avoided as far as
possible. If the function is simple enough to be differentiated analytically with
a reasonable effort, the analytic expression should be used. In many cases, it
may be possible to obtain an analytic expression for the derivative by using
some algebraic manipulation routine. If it is too difficult to obtain an ana-
lytic expression, or the analytic expression for the derivative is too difficult to
evaluate compared to the function value itself, then recourse may be made to
numerical differentiation. On the other hand, if the function is known only at a
finite set of points, then there is no alternative but to calculate the derivative
numerically.
Simplest algorithms for numerical differentiation can be obtained by dif-
ferentiating the interpola.ting polynomials. These methods will be discussed in
Section 5.1. Many of the simple formulae can be derived independently by the
method of undetermined coefficients, which will be discussed in Section 5.2.
Finally in Section 5.3, we consider a method based on the h ---7 0 extrapolation,
for differentiating a function which can be evaluated at any required point.
Of course, in principle, we can use interpolation between the tabular values to
calculate the function value at any specified point, but that will not add any
new information to the known tabular values. Such tricks should, therefore, not
be resorted to. If the function is known only through a table of values, then
formulae described in Section 5.1 or 5.2 should be used.

5.1 Differentiation of Interpolating Polynomials


For numerical differentiation of a function which is available only in the form of
a table of values, we can adopt the simple procedure of first approximating the
160 Chapter 5. Differentiation

given function by an interpolating function and then differentiating that func-


tion. In such a situation, it is most convenient to use polynomial or piecewise
polynomial interpolation, since it is very easy to differentiate a polynomial.
We can use either the Newton's divided difference formula or the cubic spline
interpolation discussed in the previous chapter for this purpose.
The Newton's divided difference interpolation formula can be written as

f(x) = t
i=O
Pi- 1 (x)f[ao, ... , ail + (Pn(x))!
n +1 .
f(n+ll(O , (5.1)

where Pn (x) = TI~o (x - ai). Since the only dependence on x in this formula
is through the products Pn(x), it can be easily differentiated any number of
times to get derivatives of the required order. If this formula is differentiated
once, we get

f '( x ) -- ~ P' ()f[


~ i-I X ao,···, at
·l + (nP~(x)
+ 1)! f(n+l)(c)
<,,0 + (nPn(x)
+ 2)! f(n+2)(c)
<,,1 ,
t=1
(5.2)
where ~o and 6 are some points in the interval spanned by the tabular points
ao, ... ,an and the required point x. It is not obvious that the error term can be
differentiated in the manner given above, but it can be shown that it is indeed
possible (Kopal, 1961). The derivatives of Pn(x) can be easily calculated using
the following recurrence relations

Pi(x) = (x - ai)Pi-dx), (i :::: 0), P- 1 (x) = 1;


P:(x) = (x - ai)P:_ 1 (x) + Pi - 1 (x), (i:::: 1), P6(x) = 1; (5.3)
Pi(k)(x) = (x - ai)piC:::i (x) + kPiC:::~I)(X), (i:::: k), P~~)1 (x) = k! .

Here pyl (x) = 0 for j = 0, ... , k - 2. In principle, derivatives of any order


can be calculated using these relations. In practice, of course, the errors will
keep increasing with order of the derivative and it may not be meaningful
to proceed beyond a certain point. The derivative calculation can be easily
combined with the interpolation procedure with little extra effort. In fact, the
subroutine DIVDIF in Appendix B includes the calculation of first and second
derivatives. Higher derivatives can also be calculated, but in most cases, it may
be difficult to ensure any reasonable accuracy.
Let us consider the truncation error in approximating the first derivative
using Newton's formula. Clearly, if x is one of the tabular points, then Pn(x) = 0
and the last term in (5.2) vanishes. If the number of points is even (i.e., n is
odd) and the points are symmetrically distributed about x, then

(j n
= 0, 1, ... '-2- -1)
, (5.4)
5.1. Differentiation of Interpolating Polynomials 161

and

(5.5)

Now
n-l
-2-

Pn(Z) = II [(z - X)2 - (x - aj)2] , (5.6)


j=O
and hence
(5.7)
Thus, the first term in truncation error vanishes. It should be noted that the
second term involves higher order derivative than the first one and the order
of accuracy in this case is higher. Such formulae are sometimes referred to as
super-accurate, since the order of accuracy is higher than what was expected.
Let us derive some specific formulae for numerical differentiation. For
n = 0, that is using only one point, 90(X) = f(ao) and so gb(x) == 0, which is
probably a safe, but usually unacceptable approximatior to the derivative. For
n = 1, that is linear interpolation, we get

f'(x) = j[ao, ad + P{2(t) f"(~o) + P13\X) flll(6)· (5.8)

Now if we choose al = ao + h, x = ao, then

f'(x) = f(x + h~ - f(x) - ~ f"(~o). (5.9)

while if x = ao + h / 2, then

f'(x) = f(x +~) - f(x - ~) _ h 2 flll(6). (5.10)


h 24
Thus, we see that j[ao, ad is a better approximation to f'(x) at the midpoint
than at either of the end points.
Similarly, we can obtain higher order formulae for numerical differentia-
tion. In the next section, we will consider another approach for deriving all these
formulae without using the interpolation polynomials. Instead of the Newton's
formula, we can differentiate the Lagrangian interpolation formula to get equiv-
alent formulae for numerical differentiation. We can also use cubic spline for
calculating the derivatives. Once the coefficients of cubic spline are calculated,
the required point should be located in the appropriate subinterval. Using the
cubic in this subinterval, it is trivial to calculate the derivatives. The function
SPLEVL in Appendix B includes the calculation of first and second derivative.
The calculation of the third and higher derivatives using cubic spline is not
162 Chapter 5. Differentiation

recommended, since we cannot expect to obtain a reasonable approximation to


the third derivative, with a cubic polynomial. Similarly, we can differentiate an
expansion in terms of B-spline basis function using Eq. (4.72).
It may appear that we can get arbitrary degree of accuracy by choosing
the spacing h to be sufficiently small. But in practice because of roundoff error,
very high accuracy is difficult to achieve. It can be seen that h appears in the
denominator of these formulae, and therefore it requires division by a small
quantity. This division by itself will not cause any problem, but we know that
the final result is not large and moreover should be independent of h. Hence,
as h decreases, the numerator should also decrease, which is possible only if
there is a substantial cancellation due to subtraction of nearly equal numbers,
which is bound to cause a large roundoff error. Thus, numerical differentiation
is basically an unstable process, and in general, we cannot expect to achieve
good accuracy, unless the original data are known to be very accurate.
When h is very small, the roundoff error would be rather large, while for
large h, the truncation error would be large. Clearly, there exists an optimum
value of h, for which the error in numerical differentiation is minimum. Let us
try to estimate the optimum value of h for the formula (5.10). If 1f"'(~)1 ~ M3
:r
for x - h/2 < ~ < + h/2, then the truncation error Tl is bounded by

(5.11)

and clearly it decreases with h. To estimate the roundoff error in evaluating


(5.10), it may be noted that the process of subtraction by itself does not intro-
duce any roundoff error. The roundoff error is due to the fact, that the function
values as represented in the computer are not exact. If E is the magnitude of
maximum error in each value of !(aj), then the roundoff error Rl in (5.10) is
bounded by
(5.12)

Clearly, the roundoff error would increase as h decreases. It may be noted that
the error in function value may be due to other reasons, like experimental error
or just because the table of values is available to a limited accuracy. But for
our analysis we refer to it as roundoff error.
The optimum value of h can be obtained by minimising Tl + R 1 , but
since we can at best evaluate some bounds on Tl and R 1 , it is not possible to
find the minimum exactly. A reasonable estimate for the optimum value of h is
obtained by requiring the bounds on the magnitude of Tl and Rl to be equal.
This prescription will lead to an accurate optimum value of h, only if the two
bounds are equally good. Nevertheless, this approach is justifiable, since we are
looking only for a reasonable value of h. Thus, by equating the two bounds, we
get

hopt = (5.13)
5.1. Differentiation of Interpolating Polynomials 163

The bound on the total error for this value of h is

(5.14)

If the function values are correctly rounded versions of the exact values, then
to = hMo, where If(~)1 < Mo for x - h/2 s:; ~ s:; x + h/2. In this case, we get

1/3
h - ( 48Mo ) h 1/ 3 (5 .15)
opt - M3 '

Hence, if the derivatives are of the same order as the function values, then
h opt rv h 1 / 3 and the corresponding error bound is of the order of h 2 / 3 . For 24-
bit arithmetic h opt :::::: 0.004, and the corresponding bound on the relative error
will be of the order of 10- 5 . Thus, it is clear that in this case, even with the
optimum value of h, two to three significant figures may be lost while evaluating
the derivative.
Similar analysis can be carried out for higher order formulae for numerical
differentiation {2}, and it turns out that the optimum value of h increases with
the order of the formula, while the error bound decreases. Consequently, by
using higher order formulae , it may be possible to get higher accuracy. But,
the use of very high order formula is again not advisable, since as mentioned
in Section 4.1 high order polynomials may have wild oscillations. Further, if
the function values are not known very accurately, then the relative roundoff
error will be much larger than h and the total error in the computed value of
the derivative will be much larger. If the function values are available only in
a tabular form, then we may not have any control on the value of h, unless the
tabular points are very closely spaced. However, it may be possible to choose
appropriate formula for numerical differentiation, which gives a reasonable value
for the given spacing. If the spacing is comparatively large, then we should use a
higher order formula, while for closely spaced tables, a low order formula can be
used. Alternately, for a closely spaced table we can apply higher order formula
after ignoring a few intermediate points. If the table of values is available at
very small spacing, it may be best to ignore some entries to increase effective
spacing and get reasonable value for the derivative, as otherwise the roundoff
error will dominate. This is particularly true when the function values are
known to a low accuracy, either because of experimental error or because the
values are available to only a few decimal digits. Thus in such cases we can
skip some entries in the table so that the resulting spacing is close to the
estimated optimal value (Eq. 5.13) computed using the error in tabular values.
Alternately, in such cases where function value is known to low accuracy, it may
be better to use approximations of the form discussed in Chapter 10, instead
of interpolation. Such approximations can give smoother function, which can
be easily differentiated.
Similar analysis can be carried out for higher derivatives also {2}. If the
table has a uniform spacing of h, then it can be shown that the roundoff error
164 Chapter 5. Differentiation

in evaluating the kth derivative is of the order of E/hk, which will give rise to
a much larger error, even for the optimum value of h.

5.2 Method of Undetermined Coefficients


In the previous section we derived formulae for numerical differentiation by
differentiating the interpolation polynomials. In many cases, it is simpler to
derive these formulae directly by using the method of undetermined coefficients.
In this method, we assume a formula of some form and proceed to determine
the coefficients in this formula by demanding that the result should be exact for
polynomials up to a certain degree. This method is very useful when the table
is available with uniform spacing and derivatives are required at the tabular
points or midway between two tabular points. Such formulae are also useful in
solving differential equations. The method of undermined coefficients can also
be used to obtain formulae for interpolation or integration.
To illustrate the use of this method, let us consider a five-point formula
of the following form for calculating the first derivative:

f'(x) = aIJ(x - 2h) + a2f(x - h) + a3f(x) + a4f(x + h) + a5f(x + 2h)


h
(5.16)
Here the five coefficients aI, ... ,a5 can be determined by demanding that this
formula should give exact results when f(x) is a polynomial of degree less than
or equal to four. Since the formula is linear in the function values, it is enough
to ensure that it is exact when f (x) = 1, x, x 2, x 3 , and x4. Further, since

that x = °
the formula should be independent of any shift in the origin, we can assume
in (5.16). Again it can be shown that the coefficients should be
independent of the value of h and for simplicity we can assume h = 1. Using
these simplifications the required conditions can be written as follows:

f(x) = 1: al + a2 + a3 + a4 + a5 = 0,
f(x) =x: 2(a5 - al) + (a4 - a2) = 1,
f(x) = x 2 : 4(a5 + ad + (a4 + a2) = 0, (5.17)
f(x) = x 3 : 8(a5 - ad + (a4 - a2) = 0,
f(x) = x4 : 16(a5 + ad + (a4 + a2) = 0.

These equations can be easily solved for the coefficients to get al = -a5 = 1/12,
a4 = -a2 = 2/3 and a3 = 0, which yields the formula

f'(x) = f(x - 2h) - 8f(x - h)1;h8f(x + h) - f(x + 2h) + ~~f(5)(O, (5.18)

where x - 2h < ~ < x + 2h. Here the error term can be estimated by assuming
it to be of the form aef(5)(~). The unknown coefficient a e can be determined
by substituting f(x) = x 5 in the given formula. This may not be a rigorous
5.2. Method of Undetermined Coefficients 165

derivation for the error term and the truncation error may not even be express-
ible in this form. However, we can expect it to give a reasonable estimate. It
may be noted that if we take ~ = x, then the error term here is just the first
term in the asymptotic series (see Section 5.3).
Similarly, we can derive the following formulae for calculating the deriva-
tives at a point, which is at the centre of the region covered by the points used
in the formula. Such formulae are also referred to as central difference formulae:

/I f(x - h) - 2f(x) + f(x + h) _ h 2 f(4)(C)


(5.19)
f (x) = h2 12 <,

- f(x - 2h) + 16f(x - h) - 30f(x) + 16f(x + h) - f(x + 2h)


f"(x) 12h2

(5.20)

fill (x)
- f(x - 2h) + 2f(x - h) - 2f(x + h) + f(x + 2h)
2h 3
_ h2 f(5)(~) (5.21)
4

f(x - 2h) - 4f(x - h) + 6f(x) - 4f(x + h) + f(x + 2h)


h4

(5.22)

It may be noted that the formulae (5.20) and (5.22) are super-accurate, in
the sense that the error term involves the sixth derivative instead of the fifth,
which may be expected from the fact that five points are used in these formulae.
Similarly, (5.19) is also super-accurate. We can get super-accurate formulae for
the first and third derivatives using four points {4} when the derivative is
calculated at point midway between the central pair of points. These formulae
are useful when the derivatives are required at the tabular points in the interior
of a uniformly spaced table. On the other hand, sometimes we are interested
in finding the derivative at one of the end point, in which case, the following
formulae which are referred to as forward difference formulae can be used:

f'(x) = -3f(x) + 4f(x 2: h) - f(x + 2h) + ~2 flll(O (5.23)

-llf(x) + 18f(x + h) - 9f(x + 2h) + 2f(x + 3h)


f'(x)
6h

_ h3 f(4l(~) (5.24)
4
166 Chapter 5. Differentiation

f'(x) = -25f(x) + 48f(x + h) - 36f(x + 2h) + 16f(x + 3h) - 3f(x + 4h)


12h
+ h4 f(5)(0 (5.25)
5
J"(x) = f(x) - 2f(x +h~) + f(x + 2h) - hJ"'(O (5.26)

J"(x) = 35f(x) - 104f(x + h) + 114f(x + 2h) - 56f(x + 3h) + llf(x + 4h)


12h2
_ 5h 3 f(5)(~) (5.27)
6
f ill ( ) = -5f(x) + 18f(x + h) - 24f(x + 2h) + 14f(x + 3h) - 3f(x + 4h)
X 2h 3
+ 7h2 f(5)(0 (5.28)

-
4
f(4)(X) = f(x) - 4f(x + h) + 6f(x + !~) 4f(x + 3h) + f(x + 4h)
- 2h f(5)(~) (5.29)

The corresponding backward formulae can be obtained by replacing h by -h


in these formulae. Many formula for numerical differentiation are listed in
Abramowitz and Stegun (1974).
EXAMPLE 5.1: Evaluate the first and second derivatives of the following functions using
the three and five-point central difference formulae, and compare the result with the exact
values:
(i) eX at x = 0, (ii) tan x at x = 1.56. (5.30)

The results for eX using a range of values for the spacing h are displayed in Table 5.1.
These calculations were performed using a 24-bit arithmetic. The exact value of these deriva-
tives is unity. It can be seen that in all cases as h decreases, the error at first decreases
and then starts increasing. The three-point formula for the first derivative yields a mini-
mum error of 5 x 10 - 6 when h = 0.005. It may be noted that the three-point formula is
the same as (5.10), but with h/2 replaced by h. Hence, from (5.15) the optimum value of
his (6fi)1/3 ~ 0.007 and the corresponding error bound is (4fi2 /3)1/3 ~ 2 X 10- 5 , which
is reasonably close to the actual values. For the five-point formula (5.18) a similar analysis
{2} gives the optimum value of has (45fi)1/5 ~ 0.08 and the corresponding error bound will
be (27fi4 /5)1/5 ~ 2 X 10- 6 . Once again, these numbers are in reasonable agreement with
the computed results. For the second derivative the optimum value of h for the three-point
formula (5.19) is (48fi)1/4 ~ 0.04 and the corresponding error is /4fi/3 ~ 3 x 10- 4 . The
five-point formula (5.20) should have an optimum value of has (480fi)1/6 ~ 0.2 and the cor-
responding error is (1024fi2 /405)1/3 ~ 2 X 10- 5 . All these values are in reasonable agreement
with the actual results.
The corresponding results for tan x are displayed in Table 5.2. This function has a
singularity close to 1.56 and the derivative estimate will be totally unreliable, if one of the
points is on the other side of the singularity. Hence, in this case, the values are reasonable only
for h < 0.005. Correct values of the two derivatives are 8579.556 and 1589286, respectively.
Once again, the error at first decreases with h, but after some stage it starts increasing. In
this case the optimum values of h are different, because the relevant derivatives are not of the
same order of magnitude in this region. To first order, we can approximate the derivatives
by If(k)(x)1 = k!/( 1-
x)k+ 1 . Using 1-
x ~ 0.01 we get the following values for hopt and
5.2. Method of Undetermined Coefficients 167

Table 5.1: Calculating the first and second derivatives of eX at x =0

First derivative Second derivative


h 3-point 5-point 3-point 5-point

1.000000 1.1752010 .9624581 1.0861610 .9878490


.500000 1.0421910 .9978537 1.0210080 .9992896
.100000 1.0016680 .9999969 1.0008390 1.0000060
.050000 1.0004160 .9999991 1.0002140 1.0000050
.010000 1.0000170 1.0000010 1.0001660 1.0002160
.005000 1.0000050 1.0000010 .9989738 .9985765
.001000 1.0000170 1.0000220 1.0132790 1.0182460
.000500 .9999871 .9999772 .7152557 .6159146
.000100 1.0001660 1.0002160 .0000000 -.4967054
.000050 .9995699 .9993712 -23.8418600 -31.7891500
.000010 1.0013580 1.0013580 .0000000 .0000000
.000005 1.0013580 1.0013580 .0000000 .0000000

the corresponding error:


f'(x) using 3-point formula: h opt ~ 0.00004 relative error ~ 4 X 10- 5
f'(x) using 5-point formula: h opt ~ 0.0003 relative error ~ 7 X 10- 6
fl/(x) using 3-point formula: h opt ~ 0.0002 relative error ~ 9 x 10- 4
fl/(x) using 5-point formula: h opt ~ 0.0006 relative error ~ 1 x 10- 4
These values for h opt are once again consistent with the results displayed in Table 5.2. How-
ever, the estimated error turns out to be significantly smaller than the actual error in Ta-
ble 5.2. This is due to the fact that actual relative error in evaluating tan x near x = 1.56
is much more than n because of singularity close to this point. In fact, the error in function
value due to rounding of x itself will be about 100 times larger. If this error is used in com-
puting the bounds, then the estimated error will be consistent with actual error. It may be
noted that bound on the relative error is similar in the two cases, even though the optimum

Table 5.2: Calculating the first and second derivatives of tan x at x = 1.56

First derivative Second derivative


h 3-point 5-point 3-point 5-point

0.010000 60379.098 81681.391 11184987.0 15131174.0


0.005000 10921.891 -5563.845 2023186.5 -1030746.7
0.001000 8654.109 8577.409 1603172.1 1588986.6
0.000500 8597.274 8578.330 1592468.1 1588895.1
0.000100 8581.619 8580.889 1589966.0 1589648.0
0.000050 8570.862 8567.276 1586914.3 1585388.3
0.000010 8591.461 8591.525 1678466.9 1684824.6
0.000005 8591.461 8591.461 1525879.0 1831054.8
0.000001 8182.526 8012.136 7629394.5 17801920.0
168 Chapter 5. Differentiation

value of h is two orders of magnitude lower for the second case. If the derivative is required
at 1.5708, then the optimum value of h would be less than Ii and it may not be possible to
use that spacing, since the computed value of x + h will be equal to x.

The method of undetermined coefficients proves to be very useful in de-


riving formulae for derivatives, as required by the finite difference methods for
solution of differential equations. This method can also be applied to a function
of several variables {5}. A number of formulae for numerical differentiation of a
function of one or more variables are listed in Abramowitz and Stegun (1974).

5.3 Extrapolation Method


In Section 5.1 we have seen that every formula for numerical differentiation
has an associated optimum value of spacing which gives the minimum error.
We can use this spacing only if the function can be evaluated at any required
point, which may not be possible if the function is available only in the form
of a table of values. Of course, we can use interpolation to get the value at any
required point, but that is not going to add any new information to the table,
and the result obtained using such points cannot be more accurate than the
best estimate obtained using only the tabular points. On the other hand, if the
function is given in a closed form, or in some other form which can be directly
evaluated at any required point, then we can easily use the optimum value of
spacing in the formulae. In such a situation, we can do even better, by using
several different values of the spacing. If there were no roundoff error, we know
that the error tends to zero as the spacing h tends to zero. Hence, using the
approximate value of derivative for a few values of h, we can try to extrapolate
the result to h = O. We can use polynomial extrapolation considered in the
previous chapter for this purpose, but we can do much better if we have more
information on the behaviour of truncation error with h. The basic idea here
is that, if we are able to get a satisfactory approximation using comparatively
larger values of h, then the roundoff error will not be significant and we should
be able to get a much better accuracy.
Consider Taylor series for the function

f(x + h) = f(x) + hf'(x) + ~2 f"(x) + ~~ f'''(x) + .. .


(5.31)
f(x - h) = f(x) - hf'(x) + h22 f"(x) - ~: fll/(x) + .. .
Using these series expansions, we can easily obtain the following results

f'(x) = f(x + h) - f(x - h) _ ~ f(2i+1 l (x) h2i


2h L (2i+1)! '
,=1
(5.32)
f"(x) = f(x + h) - 2f(x) + f(x - h) _ ~ 2f(2i+ 2l(x) h2i.
h2 L
,=1
(2i + 2)!

The first terms on right-hand sides give formulae for calculating the first and
second derivatives, while the summation gives the error term for these formulae.
5.3. Extrapolation Method 169

It can be seen that in both cases, the truncation error can be expressed as a
power series in h 2 . In either case, as h ----+ 0, the first term gives the required
derivative. However, as we have seen in actual practice this limit does not work,
because the roundoff error keeps increasing as h decreases.
To avoid this difficulty due to roundoff error we can compute an approxi-
mation to the derivative using several different values of h and then extrapolate
the result to h = o. One such technique is the Richardson's h ----+ 0 extrapola-
tion, which is applicable whenever the truncation error can be expressed as an
asymptotic series in some parameter h.
Since the Richardson's extrapolation is widely applicable in numerical
methods, we consider the general case, where the truncation error has an asymp-
totic expansion of the form

L ajh'J,
00

E(h) = Ii < 12 < ... ( 5.33)


]=1

Here h could be any suitable parameter which characterises the formula used
for computing the approximation to a functional ¢ = ¢U) of a given function
j, the only requirement being that as h ----+ 0, E(h) ----+ o. Further, we assume
that the powers Ij in (5.33) are known, but the coefficients aj are not known.
If we compute a series of approximations ¢i for i = 1,2, ... , n, where ¢i
is the calculated value using h = hi, then the true value ¢ is given by
=
¢ = ¢i +L ajh;J, (i=I, ... ,n). (5.34)
j=l

U sing this set of n equations it is possible to eliminate aj, (j = I, ... , n - 1),


to get a better approximation for ¢. For example, using the first two equations
we can write

(5.35)

This process can be continued by eliminating more terms. It is simpler to con-


sider the special case where hi = pi-l hi and Ij = j,. Here I and p are con-
stants. More general cases are discussed in {13} and {I5}. The choice of h is in
our hand, while the error expansion will depend on the formula used. In most
situations, it is possible to use this special case. In particular, for numerical
differentiation, using (5.32), we get Ij = 2j. With these simplifications, (5.35)
can be written as

(5.36)

Here ¢12 is the new approximation to ¢, and the truncation error is once again of
the form (5.33), but with the first term missing. Thus, this new approximation
170 Chapter 5. Differentiation

will be more accurate and the sequence ¢i,i+l converges faster to the correct
value. The process can be repeated with ¢i,Hl to remove one more term from
the error expansion.
To formalise this procedure we denote ¢i by Td, and generate a triangular
array of approximations T:r" by the formula

i = 1,2, .. .
m= 1,2, ... ,i-1
(5.37)
Here the second form is preferred because of better roundoff properties, which
arises because the second term involving roundoff error is treated as a small
correction to the first term. It can be shown that the leading term in the error
expansion for T;" will be of the order of h~m+lh. It may be noted that our
definition of T:r,
is slightly different from that used in some of the books. Here
T;" is defined such that all the elements TJ , (j = 1, . .. , i-I) in the last row, can
be calculated when the approximation To
is added to the table. The resulting
table of approximations is sometimes referred to as the T-table.
At any stage, we expect the diagonal element TLI to give the best ap-
proximation. If there is no roundoff error, then each of the columns in the
T-table should converge to the correct value, with the higher order columns
converging faster. However, in practice because of roundoff error, there may be
no improvement in the result after a certain stage, and the values may start
oscillating about the true value. In fact, the error may even start increasing,
and the values in higher columns may not show any improvement because of
roundoff error. Hence, in practice it is probably safer to consider only the first
few columns of the T-table.
EXAMPLE 5.2: Calculate the first derivatives of the following functions using the Richard-
son's h -+ 0 extrapolation, and compare the results with actual values:

(i) eX at x = 0, (ii) tan x at x = 1.56. (5.38)

Starting with a spacing h = 1, we calculate a series of approximations to the first


derivative, with the spacing being reduced by a factor of two at every step. From these
results, we can construct the table of values giving T/,. using (5.37). The results are displayed
in Tables 5.3 and 5.4. For the very well behaved function eX, it can be seen that each column in
the table converges to the actual value of unity. The first column which is the value obtained
using the three-point formula, converges very close to the true value for h ~ 0.00195, and if h
is reduced further , the error will increase (see Example 5.1). As expected the second column
converges much faster and converges to the essentially correct value for h ~ 0.031. Similarly,
the subsequent columns converge even faster. It may be noted that the second column in the
T-table has the same order of accuracy as the five-point formula (5.18), and we may expect
the optimum value of h for this column to be ~ 0.08. However, from Table 5.3 it is clear that
the error does not increase significantly, even at much smaller spacing. This is probably due
to the fact that the second form in (5.37) is used to construct the T-table, which restricts the
roundoff error to roughly the value expected for the first column. As a result, even in higher
columns the roundoff error is fairly small.
For the function tan x, since the derivative is required close to a singularity, the results
are not so good. As explained in Example 5.1, the initial values are completely off, but even
then the approximation ultimately converges towards the correct value. Because of these
reasons, the initial spacing was chosen to be 0.25. In this case, it can be seen that the
5.3. Extrapolation Method 171

Table 5.3: Calculating the first derivative of eX at x = 0 using h ---+ 0 extrapolation

h T.'0 Ti1 T.2i Ti3 Ti4 Ti5

1.00000000 1.1752010
.50000000 1.0421910 .9978537
.25000000 1.0104490 .9998689 1.0000030
.12500000 1.0026060 .9999918 .9999999 .9999999
.06250000 1.0006510 .9999995 1.0000000 1.0000000 1.0000000
.03125000 1.0001630 .9999999 1.0000000 1.0000000 1.0000000 1.0000000
.01562500 1.0000410 .9999999 .9999999 .9999999 .9999999 .9999999
.00781250 1.0000100 .9999999 .9999999 .9999999 .9999999 .9999999
.00390625 1.0000030 .9999999 .9999999 .9999999 .9999999 .9999999
.00195313 1.0000010 .9999999 .9999999 .9999999 .9999999 .9999999

first column has not yet converged close to the correct value of 8579.556, but the second
and subsequent columns have indeed converged close to the correct value. It may be noted
that these columns appear to converge to a value which is slightly different from the true
value. Interestingly, if one considers the computer representation of 1.56 and calculates the
derivative at this rounded value, it turns out to be 8579.465, which is very close to the value
obtained in Table 5.4. Thus most of the error is due to rounding of the required value of x.
This arises because to evaluate the trigonometric function at x = 1.56, computer will first
transform the argument to 7r /2 - x, which will result in loss of two significant figures. Thus
there is a significant roundoff error in evaluating the function and its derivative at this point.
Comparing these results with those in Example 5.1, it can be seen that the use of Richardson's
extrapolation enables us to achieve a higher accuracy as compared to what was possible using
simple difference formulae. It may be noted that the columns beyond T3 show no improvement
over the previous column, which is clearly due to roundoff errors. The second derivative can
also be computed to better accuracy using extrapolation technique as compared to results
obtained in Example 5.1. For tan (x) the second derivative tends to a value of 1589300, which
is closer to the correct value of 1589286, than the values in Example 5.1. Once again columns
beyond T3 show no improvement. For tan x, the use of rational function extrapolation gives
much better results {12}.

Table 5.4: Calculating the first derivative of tan x at x = 1.56 using h ---+ 0 extrapolation

h T.0i Ti1 T.2i Ti3 Ti4 Ti.5

.25000000 -15.695
.12500000 -64.147 -80.298
.06250000 -263.541 -330.005 -346.652
.03125000 -1162.457 -1462.095 -1537.568 -1556.471
.01562500 -7837.965 -10063.130 -10636.540 -10780.960 -10817.140
.00781250 18009.670 26625.550 29071.460 29701.750 29860.500 29900.270
.00390625 9871.697 7159.040 5861.272 5492.857 5397.920 5374.007
.00195313 8869.732 8535.744 8627.524 8671.434 8683.898 8687.110
.00097656 8650.236 8577.071 8579.826 8579.069 8578.707 8578.604
.00048828 8597.050 8579.321 8579.472 8579.466 8579.468 8579.469
172 Chapter 5. Differentiation

It can be seen from this example that with the extrapolation method, it
is possible to get high accuracy using comparatively large values of h, which in
turn reduces the roundoff error. Here we have reduced h by a factor of two at
every step, which is probably a large change. As a result, h values become very
low by the time higher columns are filled. It may be better to use a smaller
factor, or alternately, to use some other sequence of values for hi, which is
not in geometric progression. An implementation of this process is provided
by function DRVT in Appendix B. In this routine, h is reduced by a factor of
HFAC = 1.5 at every stage, and only NCMAX = 6 columns are constructed.
To check for convergence, we use a simple test, where the last element in the
T-table i.e., Tj, j = min(i - 1, NCMAX) is compared with the value at the
previous step as well as with the value just to the left of it, i.e., T}_l' If either
of these differences is less than the specified error parameter, then the process is
presumed to have converged. If very high accuracy is demanded, then this test
may never be satisfied, because of roundoff errors. Hence, an attempt is also
made to detect such situations and stop the computation before the roundoff
error starts dominating. For this purpose, the difference used in the convergence
test at any step is compared with its value at the previous step. If this difference
has increased by a factor of two or more, we assume that the roundoff errors
have become significant and no further improvement may be possible. In such
cases, the computation is stopped and the previous value is returned as the
best approximation to the required derivative. This may not be the best test
for roundoff errors, but it has been found to work in some examples which have
been tried.

Bibliography
Abramowitz, M. and Stegun, I. A. (1974): Handbook of Mathematical Functions, With For-
mulas, Graphs, and Mathematical Tables, Dover New York.
Dahlquist, G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Joyce, D. C. (1971): Survey of Extrapolation Processes in Numerical Analysis, SIAM Rev.,
13, 435.
Kopal, Z. (1961): Numerical Analysis, (2nd ed.), John Wiley, New York.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.

Exercises
1. Derive the difference formulae (5.20) and (5.25) using, (i) the method of undetermined
coefficients, (ii) by differentiating the Newton's divided difference formula, and (iii) by
differentiating the Lagrangian interpolation formula. Which of these methods is the most
convenient?
2. Derive the difference formulae (5.18-5.29), and estimate the bound on roundoff error
in evaluating the derivatives using these formulae. From the error estimates, obtain the
optimum value of h and the corresponding error bound for each of these formulae.
Exercises 173

3. Given a table of In x with a uniform spacing of 0.01 for 1 :s; x :s; 2, it is required to
calculate the first and second derivatives at x = 1.1,1.2, ... , 1.9. Assuming that the table
is accurate to seven figures, which is the best formula that can be employed and what is
the spacing that should be used?
4. Using the method of undetermined coefficients, derive the following formulae for numer-
ical differentiation:

f'(x) =
f(x - lh) - 27f(x - lh)
2 2
+ 27f(x + lh)
2
- f(x + lh)
2 + 3h
_
4
f(5)(0
24h 640 '

f"(x) =
f(x - ~h) - f(x - lh) - f(x
2 2
+ lh)
2
+ f(x + ~h)
2
5h 2
_ _ f(4)(~)
2h2 24'

f"'(x) =
- f(x - ~h)
2
+ 3f(x - lh) - 3f(x
2
+ lh)
2
+ f(x + ~h)
2 _
h2
_ f(5)(~).
h3 8
Find the optimum value of h and the corresponding error bound for each of these formulae.
5. Gsing the method of undetermined coefficients, derive the following formulae for numer-
ical differentiation of a function of two variables:
iJ2 f(x,y) f(x + h, y + k) + f(x - h, y - k) - f(x + h, y - k) - f(x - h, y + k)
oxoy 4hk
h 2 EJ4f k 2 04f
6 03 xoy - (3 ox03y ,

82 - 281 + 4f(x, y) h2 06 f k2 06 f
h 2k 2 12 04 x0 2y - 12 02x04 y ,
where
81 = f(x + h, y) + f(x - h, y) + f(x, y + k) + f(x, y - k),
82 = f(x + h, y + k) + f(x - h , y - k) + f(x + h, y - k) + f(x - h, y + k).
6. Evaluate the first and second derivatives of the following functions, by differentiating the
interpolating polynomials using tables with specified uniform spacing:
(i) sinx at x = 44.4444° using h = 50°,10°,5°,1°
(ii) eX at x = 0.111 using h = 2.0,1.0,0.5,0.1
(iii) Vx at x = 0.0121 using h = 0.1,0.01,0.001
Use Newton's formula based on 2,4,6, ... , 20 points and compare the results with actual
values. To study the effect of errors in the tabulated values on differentiation, multiply
each value of the tabulated function by (1 + ER), where R is a random number in interval
(-0.5,0.5) and try E = 10- 5 and 10- 3 . Repeat the above calculation using these values.
7. Estimate the first and second derivatives of sinx at x = 0,11"/2 and of l/x at x =
1,0.001, using the three-point and five-point formulae with uniform spacing h = 2- n ,
(n = 0, 1,2, ... ,30). To study the effect of roundoff errors on these calculations, multiply
each of the function values by (1 + 10- 5 R), (where R is a random number in the interval
(-0.5,0.5)), before calculating the derivatives.
8. Investigate the sensitivity of the optimal h formula (5.15). Select n for the computer used
and apply formula (5.10) for h = fhopt, where f = 1.25,0.8,2,0.5,5,0.2,10,0.1,100,0.01.
As test cases, use the functions eX at x = 1.11 and (sinx-x)/x 3 at x = 0.01. How sensitive
is the actual error to the value of h?
9. Using the table of values in {4.22} obtain the first derivative of vapour pressure with
respect to temperature at T = 5,20,35 C. How will you estimate the accuracy of com-
putations. Try the problem using cubic spline and compare the results.
10. Given a table of Vx with a uniform spacing of 0.01 for 0 :s; x :s; 1, it is required to calculate
the second derivatives at x = 0.1,0.2, ... ,0.9. Assuming that the table is accurate to six
174 Chapter 5. Differentiation

decimal figures, what is the best spacing to be used for each of these points using the
5-point formula (5.20). Estimate the resulting total error at each of these 9 points.
11. Instead of the Richardson's extrapolation we can treat the initial approximations ¢(h{)
as a function of hOY (assuming the special case ,j = ),), for the error expansion. Using
the table of values for this function, we can perform a polynomial extrapolation, using
the Newton's divided difference formula. Show that if the points are added in decreasing
order of h"Y, then the result is exactly identical to what we have derived in the text.
12. In the previous problem, instead of polynomial extrapolation, we can use rational function
extrapolation. In this case, the result will be different from the Richardson's extrapola-
tion. Repeat Example 5.2, using rational function extrapolation and compare the results
with Tables 5.3 and 5.4.
13. The polynomial extrapolation can also be used in the case, when the successive values of h
are not in geometric progression, but are in any decreasing sequence of values. Consider
the special case, when the powers lh the asymptotic error expansion ,j = ),. Using
polynomial extrapolation, show that in this case, the T-table can be constructed using
the following relation

14. If the required function does not have a Taylor series expansion on one side of the given
point, then it is not possible to use (5.32) to evaluate the derivatives. For example, if we
have to evaluate the first derivative of cos(sin xl,;x) at x = 0, then since the function
exists on one side of the given point only, it is not possible to use a central difference
formula. In this case, the Richardson's extrapolation can be used with a forward formula,

f
for example
f'(x) = f(x + h) - f(x) _ f(i+1)(x) hi.
h i=1 (i + I)!
Use such a procedure to compute the first derivative and compare the results with the
exact value.
15. In Richardson's h -+ a extrapolation consider the special case, when hi+1 = phi, where
p is a constant. Then show that the T-table can be constructed using the recurrence
relation
Ti _ "YrnT i - 1 _ T i-
+ Tim-1
1
Ti = m-1 p m-1 _ Ti m-1
m 1 _ p"Yen - m-l p-"Ym - 1

16. Using a table of value for the function f(x, y) = eX sin y with uniform spacing in x and y of
0.05, calculate the derivatives 8f 18x, 8f/8y, 8 2 f 18x 2 , 8 2 f 18y2, 8 2 f 18x8y at (0.5, 0.5)
and compare them with the exact values.
Chapter 6

Integration

Numerical integration is one of the oldest and most fascinating topics in numer-
ical analysis. Methods of numerical integration date back to Archimedes who
tried to calculate the area of a circle, which was long before the integral calcu-
lus itself was formulated. Numerical integration is, paradoxically, both simple
and difficult. It is simple in that it can often be successfully accomplished by
the simplest of methods. It is difficult in that for some problems it may require
almost infinite amount of computer time. Unlike differentiation, integration is
numerically stable process and in general there is no difficulty in computing
an integral to any required accuracy permitted by the computer arithmetic or
available data. In contrast to the previous Chapter, most of this Chapter is
devoted to the situation where the integrand can be evaluated at any required
point. Numerical integration of functions provided in form of table of values is
dealt with only briefly in Section 6.1. Numerical integration in one dimension is
usually referred to as quadrature. Because of the simplicity of the problem and
its practical value, innumerable formulae have been developed for quadrature.
Since the computers came into existence, the emphasis in numerical integration
has shifted from quadrature rules, to evaluation of multiple integrals, and to
development of algorithms for automatic integration.
Numerical integration is usually utilised when analytic techniques fail.
Even if the indefinite integral of the function is available in a closed form, it may
involve some special functions, which cannot be computed easily. In such cases
also we can think of using numerical integration. However, it should be noted
that numerical integration can only give the numerical value of the integral with
given limits and other parameters. It is rather difficult to extract the functional
dependence of the integral on various parameters using numerical values. On the
other hand, analytic techniques usually give a good handle on the functional
dependence of integral on various parameters. Hence, numerical integration
cannot be a substitute for analytic techniques, it can only complement analysis.
Even if an integral cannot be evaluated analytically, some analysis will enable
us to choose the right method for numerical integration. Numerical integration
176 Chapter 6. Integration

can provide satisfactory results, when it is used sensibly with proper controls.
Blind use of sophisticated computer programs for numerical integration can
lead to serious error, and the programmer may not even notice it. Whenever
possible, a problem should be analysed and put into a proper form before trying
it on a computer.
In this chapter, we start with the classical Newton-Cotes quadrature for-
mulae, and then consider the extrapolation and acceleration methods. These
methods are fairly effective for simple integrals which are usually encountered.
In Section 6.3, we consider the Gaussian quadrature formulae which are based
on the theory of orthogonal polynomials. They have proved to be quite effec-
tive when high accuracy is required. Most of these methods require inordinate
amount of computation, when the integrand or its derivative is singular or
nearly singular in the region of integration. Methods to deal with singulari-
ties are discussed in Section 6.6. In Section 6.7, we consider the problem of
automatic integration, which essentially means developing an algorithm which
should ideally be able to integrate any given function efficiently. Of course, in
practice, there are limitations. Nevertheless, adaptive integration routines have
proved to be very successful in evaluating a large fraction of integrals that are
encountered in scientific computations. The related problem of summation of
series is discussed in Section 6.8. Finally, the last few sections are devoted to the
problem of numerical integration in two or more dimensions, where we discuss
the so-called product rules, which are simple extensions of quadrature formu-
lae, as well as formulae specifically meant for multiple integration. When the
number of dimensions is large, the Monte Carlo method based on a sequence
of random numbers, or the method based on equidistributed sequences prove
to be quite effective.
In a finite amount of space that is available, it is not possible to dis-
cuss all techniques for numerical integration. Hence, we consider only some of
the techniques, while for more details the reader can refer to Davis and Rabi-
nowitz (2007), which also gives a number of Fortran programs for numerical
integration, as well as references to a large number of published algorithms.
Further, a complete collection of Fortran programs for numerical quadrature is
published in Piessens et al. (1983).
With a large number of algorithms available for numerical integration,
it may be somewhat difficult to choose the best technique for solving a given
problem. This choice is particularly difficult, since a simple transformation of
the integral or a simple modification of one of the techniques can improve its
performance significantly. An attempt will be made to give some guidelines
about the choice of quadrature formulae in Section 6.7. Apart from this, a
large number of examples are given to illustrate the effectiveness as well as
the limitations of various techniques. However, it should be noted that it is
impossible to give an infallible guideline. Hence, if confronted with a problem
involving a large number of integrations of similar type, it will be best to do
some experimentation with a few promising techniques, before deciding on the
best algorithm.
6.1. N ewton- Cotes Quadrature Formulae 177

6.1 Newton-Cotes Quadrature Formulae


To approximate the integral of a given function, we can approximate the in-
tegrand by an interpolating polynomial and integrate this polynomial. Let us
take the Lagrangian interpolation polynomial
n

f(x) = LJ(aj)lj(x) + En(x). (6.1)


j=O

Integrating this function over a finite interval [a, b], we get

n
(6.2)
= LHjf(aj) + En.
j=O

Here H j are called the weights for the corresponding abscissas aj. It is clear
that these weights are independent of the function f(x), but depend only on
the abscissas ai and the limits of integration. If the abscissas are uniformly
spaced i.e., aj = ao + jh, where h is a constant, then the quadrature formulae
are known as the Newton-Cotes quadrature formulae, and the weights H j are
called Cotes numbers.
Several types of Newton-Cotes formulae are possible, depending on the
location of the limits a and b with respect to ai's. Most of the important
formulae can be classified as the Newton-Cotes formulae of closed type, where
a = ao and b = an, and both the end points are abscissas of interpolating
polynomials. In thi~ ca~e, the truncation error can be written in the form

f(n+2) (1)) fa ( )
(n+2)! aon XPn X dx for n even;
{
En = (6.3)
f(n+I)(1)) an
(n+l)! Jao Pn(x) dx for n odd;

where ao < T] < an and Pn(x) = rr=o(x - ai). Thus for n odd, the Newton-
Cotes closed formula is exact for polynomials of degree less than or equal to n,
as expected. But for even values of n (i.e., odd number of points), we get an
extra bonus, since the formula is also exact for polynomials of degree (n + 1).
Such formulae are sometimes referred to as super-accurate.
Table of values of weights H j for various values of n are given in most
books on numerical analysis (e.g., Krylov, 2005). It can be proved that all Hj's
are positive for n S 7 and n = 9, while for other values of n some of the weights
turn out to be negative. Later on we will see that the positivity of all weights
is very important property of any quadrature formula.
Similarly, we can also get Newton-Cotes formulae of open type, where the
end points are not abscissas of interpolating polynomial. The only advantage
178 Chapter 6. Integration

of these formulae over those of the closed type is when the function is difficult
to evaluate at the end point, or has an indeterminate form there. Various types
of open formulae are possible, depending on the distribution of abscissas with
respect to the limits. We can also have half-open, half-closed formulae, where
one of the end points is an abscissa of the interpolating polynomial. Obviously,
there is no limit to the number of quadrature formulae that can be obtained.
Let us consider some specific examples of quadrature formulae. We start
with the simplest case of n = 0, where the function is approximated by a
constant value
f(x) = f(ao) + (x - ao)j'(~), (6.4)

lb lb
and
f(x) dx = (b - a)f(ao) + (x - ao)j'(~) dx. (6.5)

Now if we choose the point ao to be one of the end points (say a), then x - ao
is of constant sign throughout the interval and the truncation error can be
simplified by using the mean value theorem, to get

la
b
f(x) dx = (b - a)f(a) + 1'(TJ) (b -
2
a)2, (6.6)

where a < TJ < b. This formula is referred to as the rectangle rule (Figure 6.1),
and is obviously a half-open half-closed type formula. If we choose ao = b, at
which is the midpoint of the interval, then we get the midpoint rule

(6.7)

where a < TJ < b. It should be noted that this is a super-accurate formula,


since it is exact for all linear functions. It is obvious from Figure 6.1 that in
general, the midpoint rule is more accurate than the rectangle rule, since in
this case, the extra area included in the rectangle (top left corner in Figure 6.1)
compensates for the area not included to some extent.
Now let us consider n = 1, in which case, the function is approximated by
a straight line passing through the two given points, to get

f(x) = f(ao) + f[ao, al](x - ao) + (x - ao)(x - ad -2- .


1"(0 (6.8)

Here if we take ao = a and al = b, then (x - ao)(x - ad is of constant sign in


(a, b) and once again the truncation error can be simplified by using the mean
value theorem to get

lb f(x) dx = b; a (f(a) + f(b)) - 1'~~TJ\b - a)3, (6.9)

where a < TJ < b. This is the simplest Newton-Cotes formula of closed type
and is referred to as the trapezoidal rule, since in this case, the integral is
6.1. N ewton- Cotes Quadrature Formulae 179

Rectangle Rule Midpoint Rule Trapezoidal Rule

Composite Rectangle Rule Composite Midpoint Rule Composite Trapezoidal Rule

Figure 6.1: Graphic illustration of the rectangle rule, midpoint rule and the
trapezoidal rule . In each case, the area of shaded portion is the estimated value
of the integral.

approximated by the area of a trapezium (see Figure 6.1). It may be noted


that the truncation error in the trapezoidal rule is roughly two times that in
the midpoint rule. Further, if the J"(x) does not change sign over the interval,
then the sign of the error will be opposite in the two cases. If J"(x) > 0 on
[a , b], then

a+b
(b - a)J(-2-):S
lba
b- a
J(x) dx :S -2-(.f(a) + J(b)), (6.10)

which is known as the bracketing property. Similar pairs of rules can be found
with higher order accuracy also.
An open formula for n = 1 can be obtained by choosing ao = a + ~ (b - a),
a 1 = a + ~ (b - a), to get

la
b
f(x) dx
b-a
= -2-(f(ao) + f(al)) +
(b-a)3
36 J"(TJ)· (6.11)

This formula is more accurate than the trapezoidal rule, but we can improve
on it by choosing ao = a + :l-(b - a) and al = a + ~(b - a) , to obtain

la
b b-a
f(x) dx = -2-(f(ao) + f(ad) +
(b-a) 3
96 J"(TJ)· (6.12)
180 Chapter 6. Integration

As will be seen later on, this formula is just the composite midpoint rule.
Now let us consider the case n = 2, where the function is approximated
by a parabola passing through the three given points. If the points are chosen
to be the two end points and the midpoint, then we get {I}

(6.13)

where h = b;a
is the constant spacing between successive points. This is the
well-known Simpson's rule, sometimes referred to as the Simpson's one-third
rule. It is a Newton-Cotes formula of closed type with even value of n. Hence, as
mentioned earlier it is super-accurate, being exact for all polynomials of degree
less than or equal to three.
Similarly, we can derive higher order Newton-Cotes formulae. Some of
these formulae are listed below. The notation used is as follows:
b-a
ai=a+ih, (i=O,I, ... ,n), h=--, fi = f(a;).
n
(6.14)
Simpson's ~ rule (n = 3):

I a
b
f(x) dx
3h 3
= -8 (fo + 3h + 312 + h) - _h 5 f(4)(TJ).
80
(6.15)

Boole's rule (n = 4):

(6.16)

Newton-Cotes 7-point rule (n = 6):

l
a
b
f(x) dx
h
= -(41fo + 21611 + 2712 + 272/3 + 27f4 + 216f5 + 41f6)
140
__9_ h9 f(8)(TJ).
1400
(6.17)
It may be noted that the error in Simpson's ~ rule is of the same order as that
in the Simpson's ~ rule. As noted earlier, the error in (2n + I)-point Newton-
Cotes closed formula is of the same order as that in the (2n + 2)-point formula.
Hence, there is seldom any use of the formulae with even number of points. But
if the function is available in form of table with even number of points, then it
may be necessary to use Simpson's 3/8 rule over some part. It may be noted
that all these formulae can also be derived using the method of undetermined
coefficients described in Section 5.2.
6.1. N ewton- Cotes Quadrature Formulae 181

It can be seen that truncation error in these quadrature formulae can be


written in the form K(b-a)m f(m-l)(T/), where K is some constant independent
of the function f(x). Further, the value of K decreases as the number of points
increases. But if Ib - al > 1, or f(m)(x) increases with m, the error may not
decrease by increasing the number of points. It can be proved that the n-
point Newton-Cotes formula converges to the value of the integral as n --> 00,
provided the integrand is an analytic function that is regular in a sufficiently
large region of the complex plane containing the interval of integration. Let us
consider an ellipse whose centre is at (a + b)/2, whose major axis lies on the
x-axis, and whose semi-major and semi-minor axes have lengths i(b - a) and
~ (b - a), respectively. Davis has shown that if f (z) is regular inside this ellipse,
then the Newton-Cotes formulae converge to the integral as n --> 00. It may be
noted that this is a sufficient condition for convergence, and the process may
converge, even if the function has some singularity inside the ellipse, but in most
such cases, convergence (if any) would be quite slow. Hence, the accuracy of a
n-point Newton-Cotes formula does not necessarily increase as n increases, even
though f(x) possesses continuous derivatives of all orders for all real values of x,
and even when there is no roundoff error involved in calculation. As mentioned
in Section 4.1, high degree interpolating polynomials can have wild oscillations,
and it is not advisable to use Newton-Cotes formula of very high order.
So what should we do to achieve better accuracy? Obviously, we can break
up the integral into several parts by writing

l b
f(x) dx = L la.
m
f(x) dx. (6.18)
a i=l at-l

For convenience, we can assume ai to be equally spaced i.e., ai = ao + ih,


h = (b - a)/m, ao = a and am = b. Now, if we apply the quadrature formula
on each of these subintervals, then the value of (b - a) for each of them would
be reduced and we can hope to get better results. The resulting formulae are
known as composite formulae (sometimes referred to as compound rules in the
literature). By applying the trapezoidal rule on each of the m subinterval, we
get the following composite formula (see Figure 6.1)

(6.19)

Thus, although there are m subintervals, each one reqmrmg two points we
require only m + 1 points for evaluating the integral. Similar formula can be
obtained for the Simpson's rule. In any Newton-Cotes closed formula based on
m subintervals, there will be a saving of m - 1 points, because the m - 1 interior
points are common to two subintervals.
By looking at the truncation error in (6.19) it is obvious that if the cor-
responding derivative is bounded, we can get arbitrary accuracy by using suf-
ficiently large value of m. In practice, we can start with m = 1 and go on
182 Chapter 6. Integration

subdividing each subinterval into two at every step. This subdivision is very
convenient, since in that case, we can make use of the function values which are
already calculated. The process of subdivision can continue till a satisfactory
accuracy is achieved, or when the results do not improve any more.
For Simpson's rule, the composite formula can be written as

l' f(x) dx ~ ~ (f(a) + f(b) + 4 tf(a,,') + 2·~ f(a,;))


_ (b-a)h 4 j (4)( ) (6.20)
180 fJ
_ h (b - a)h4 (4)
- 3 (Send + 4Sodd + 2Seven ) - 180 1 (fJ)·

Here it is assumed that the number of subintervals on which the Simpson's rule
is applied is m/2. Since each subinterval should have three points including
the end points, the total number of points used is m + 1 with m even, and
h = (b - a)/m is the uniform spacing between two successive points. If we now
subdivide each subinterval into two, obviously all the old points become the
even points. Hence, S~ven = Seven + Sodd, while S~dd will be the sum of function
values at all the new points. Here, S~ven and S~dd are the new values of Seven
and Sodd, respectively. Hence, it is not necessary to store all the function values
separately. Subroutine SIMSON in Appendix B provides an implementation of
this algorithm.
Similarly, midpoint rule gives the following composite formula

(6.21)

This formula results from approximating 1(1') by a piecewise constant step


function, with a jump at each point ak = a + kh (see Figure 6.1). For a given
value of m, the error in this formula tends to be about half of that for the
trapezoidal rule. However, in this case, if the number of points is doubled, then
the new points will not include the previous ones and the entire calculation has
to be repeated. If each subinterval is subdivided into three subintervals, then it
will be possible to use all the old points, but in that case, the number of points
increases very rapidly.
The Riemann integral of a function is defined to be the limiting value of
the Riemann sum, defined by
n
Sn = L I(h)(rk+l - rk), (6.22)
k=Q

where
(6.23)
6.1. N ewton- Cotes Quadrature Formulae 183

Any sequence of sum, in which the subintervals are refined so that (rk+l -ric) --->
0, tends to the Riemann integral of the function, if that exists. It can be easily
proved that the composite rectangle, trapezoidal, midpoint and Simpson's rule
approximate the integral by a Riemann sum. In fact, it can be proved that com-
posite rules formed by using any quadrature formula with positive weights form
a Riemann sum. Hence, each of these approximations assuredly converges to
the integral as the spacing h tends to zero if only the integral exists in the usual
Riemann sense, irrespective of whether or not f(x) is sufficiently differentiable
to ensure the respective rates of convergence. Actually, the convergence of Rie-
mann sum requires some mild restriction on continuity of the integrand (Davis
and Rabinowitz, 2007), which is violated in exceptional cases (see Example 6.6).
Thus, even if some or all of the derivatives are singular, these approximations
may converge to the actual value as h ---> O. although in such cases, the con-
vergence could be very slow. In fact, even if the function itself is singular (and
the infinite value is ignored), these approximations may still converge to the
correct value, provided the limit can be taken correctly. This is probably the
most dangerous property of quadrature formulae, since it is tempting to misuse
these formulae, even for those functions, where the convergence is very slow. If
only one integral is required, the computer time spent may not be too large, but
usually the same strategy is applied for evaluating similar integrals a number
of times, thus resulting in enormous wastage of computer time. Singularities
in the function or its derivatives should be dealt with by methods described in
Section 6.6.
EXAMPLE 6.1: Evaluate the following integrals using the Newton-Cotes quadrature for-
mulae

(i) t~
Jo 1 + x 2
(ii) f lO

-10 1+x2
dx
-- (iii) 10 1
Vx dx (iv) (1
Jo
dx
100xo. gg
(6.24)

These integrals can be easily computed analytically, but they cover a wide range of
difficulties that can arise in numerical integration. The first two integrands as well as all
their derivatives are continuous over the entire interval of integration. The third function is
continuous, but has unbounded derivatives, while the last has an integrable singularity at x =
O. To compare the efficiencies of different quadrature formulae, we have chosen eight different
Newton-Cotes formulae i.e., trapezoidal rule, midpoint rule. Simpson's ~ rule, Simpson's ~
rule . 5-point rule. 7-point rule, 9-point rule and 17-point rule (see Krylov. 2005 for the weights
in these formulae). The results obtained using 24-bit arithmetic are displayed in Table G.!.
All these formulae are also used in the composite form and the total number of points lls(·d
in all cases is given in the first column. These results can be compared with the cxact va.lnes
of the integrals.
For the first integral, the results converge very fast to the correct value and only ahout
10 points are required to achieve an accuracy of seven significant figures. Only the first two
columns corresponding to the trapezoidal and the midpoint rule show a somewhat slower
convergence. This is a typical result , when the integrand is analytic in a sufficiently large
neighbourhood of the interval of integration. In this case, there is no difficulty in obtain-
ing very accurate results, provided of course that arithmetic operations are also done using
comparable precision.
It can be seen that, even for the second integrand which is regular over the entire inter-
val, the n-point Newton-Cotes formula does not converge to the correct value as n increases.
This is because of the singularities in the complex plane at z = ±i, which are well within the
ellipse mentioned earlier. If the range of the integral is reduced to [0 , 1]' then the Newton-
Cotes formulae do converge as n -+ 00. In composite rules, once the subintervals become
184 Chapter 6. Integration

Table 6.1: Numerical integration using Newton-Cotes formulae

N Trap. Mid pt Sim.(!) Sim·(i) 5-point 7-point 9-point 17-point

J/o 4 dx
1+x2 ~ 3 . 141593
~

2 3.000000 3.162353
3 3.100000 3.150849 3.133333
4 3.123077 3.146801 3.138462
5 3.131176 3.144926 3.141569 3.142118
7 3.136963 3.143294 3.141592 3.141583 3.141571
9 3.138988 3.142621 3.141593 3.141594 3.141593
17 3.140942 3.141881 3.141593 3.141593 3.141593 3.141589
25 3.141304 3.141726 3.141593 3.141593 3.141593 3.141593 3.141593

J~~o 1~:2 2.942255 "'"


2 .198020 .769231
3 10.099010 6.960064 13.399340
4 1.166924 1.553983 1.288037
5 5.434121 4.593666 3.879157 3.244478
7 4.063494 3.727284 5.029018 3.309055 6.828017
9 3.494052 3.338168 2.847362 2.778575 -1.051610
17 2.983245 2.972738 2.812976 2.810684 2.830657 -89.146650
49 2.942200 2.942283 2.941144 2.937070 2.944067 2.944876 2.944657 1.940812
97 2.942241 2.942262 2.942254 2.942222 2.942329 2.942241 2.942212 2.942271
193 2.942252 2.942258 2.942255 2.942255 2.942255 2.942255 2.942255 2.942252
385 2.942255 2.942256 2.942255 2.942255 2.942255 2.942255 2.942256 2.942252

Jd Vx dx "'" 0.6666667
2 .5000000 .6830127
3 .6035534 .6760754 .6380712
4 .6312823 .6729774 .6476926
5 .6432831 .6712801 .6565263 .6577566
7 .6536788 .6695294 .6611443 .6599445 .6622961
9 .6581302 .6686647 .6630793 .6635162 .6640488
17 .6635812 .6674632 .6653982 .6655528 .6657411 .6659116
49 .6660598 .6668355 .6664225 .6663694 .6664523 .6664736 .6664886 .6665208
1537 .6666635 .6666681 .6666655 .6666647 .6666656 .6666655 .6666656 .6666652
6145 .6666667 .6666670 .6666667 .6666666 .6666662 .6666667 .6666668 .6666659

J/
o~ -
dx - 1
2 .0050000 .0263722
3 .0124309 .0302582 .0149079
4 .0165372 .0330246 .0179793
5 .0194015 .0351708 .0217251 .0221796
7 .0233977 .0384038 .0256845 .0247685 .0265882
9 .0262131 .0408141 .0284836 .0289341 .0297576
17 .0329491 .0468915 .0351944 .0356418 .0364596 .0374610
49 .0435179 .0569267 .0457359 .0448390 .0461784 .0466207 .0469872 .0479777
1537 .0760996 .0888693 .0782416 .0773753 .0786690 .0790963 .0794503 .0804071
6145 .0888192 .1014085 .0909318 .0900773 .0913531 .0917747 .0921239 .0930674
6.1. N ewton- Cotes Quadrature Formulae 185

small enough to exclude the singular points from the associated ellipses, the results start
converging. In this case, the trapezoidal rule and the midpoint rule bracket the true value,
even though the second derivative changes sign inside the interval. Further, asymptotically
the error in the midpoint rule is about half that using the trapezoidal rule. Both these rules
converge rather slowly and about 200 function evaluations are required to get an accuracy of
seven decimal figures, after which the results do not improve because of roundoff errors. The
Simpson's rules converge somewhat faster with the ! rule being slightly more accurate than
the ~ rule. The higher order Newton-Cotes formulae will be more efficient if higher accuracy
is required. For example, the 5-point formula can give an accuracy of 14 decimal digits using
385 abscissas. The 17-point formula has a somewhat higher roundoff error and the results
do not converge to the seventh significant figure. If higher order Newton-Cotes formulae are
used , the roundoff errors will be larger. The problem of roundoff error in numerical integra-
tion will be discussed in Section 6.4. This example shows that for moderate accuracy, the use
of a very high order Newton-Cotes formula does not have any advantage over the Simpson's
! rule. It may be noted that the integrand in the first two cases is essentially same, and hence
extending the range has had a significant effect on the behaviour of quadrature formulae.
For the third function which has unbounded derivatives at x = O. the n-point Newton-
Cotes formulae appear to be slowly converging to the correct value as n --> 00. Further, in this
case, the results using all the rules considered have errors of the same order with all columns
converging at the same rate of roughly 11 n 1. 5 , where n is the total number of points used
in the composite formulae. Hence, in this case, it is difficult to achieve very high accuracy
in a reasonable effort using the Newton-Cotes formulae. It is possible to get an accuracy of
seven significant figures using about 6000 points in any of the formulae, except the 17-point
formula for which the roundoff error does not permit such accuracy.
The last example is even worse and none of the values are anywhere near the actual
result. In this case, the function is singular and the function value at x = 0 is arbitrarily set
to zero in the quadrature formulae. It can be easily seen that if instead we had shifted the
°,
lower limit to 10- 10 which will cause a difference of 0.1 in the exact value, the numerical
values would have been completely different being of the order of 10 97 In, where n is the
total number of points used. This illustrates that the limit should not be shifted to avoid
singularity. It can be seen that the results are very slowly tending to the correct value, the rate
of convergence being roughly 1Ino.o l . Hence, it will require of the order of 10100 points to get
any reasonable accuracy. This is of course, an extreme example, for which most quadrature
formulae will fail miserably. It has been included here to illustrate the limitations of numerical
integration. It may be noted that if we consider the results using the trapezoidal rule, then the
difference between the values using 1537 and 6145 points is approximately 0.013 and hence
most routines for numerical integration may presume that an accuracy of that order has
been achieved. However, the actual error is roughly 80 times larger, which clearly illustrates
the limitation of the convergence tests, when the values are converging slowly. In fact, the
difference between values calculated using different quadrature formulae is also of the same
order. Hence, using several independent methods we get a result around 0.09, which is off
by an order of magnitude. This example clearly illustrates that even an agreement between
results obtained using apparently independent methods does not guarantee the correctness
of results. Of course, if we use an exponent of 0.9999 instead of 0.99 in the integrand, the
results will be much worse and the quadrature rules may indicate convergence to 10 - 4 when
the actual results are totally wrong. The problem here is because, the integrand is very close
to one for which the integral diverges.

In general, it is preferable to use Newton-Cotes formulae of closed type


rather than the open type formulae. Apart from the midpoint rule, other for-
mulae of open type have larger error as compared to the closed formula using
the same number of points. Further, in a composite rule the formula of closed
type require one less function evaluation per subinterval, since the end points
are common to the adjoining subintervals. Open type formulae are useful for
obtaining predictor formulae for numerical solution of differential or integral
186 Chapter 6. Integration

equations (see Sections 12.3 and 13.7). As noted earlier the order of accuracy
of a closed type formula with 2n abscissas, is the same as that with 2n - 1
abscissas and hence it is always preferable to use a closed formula with odd
number of points. For example, Simpson's 1/3 rule based on three points has
a truncation error of - (~~;t f(4)(7]), while the Simpson's 3/8 rule based on
four points has a truncation error - (~~;65 f(4)(7]). Of course, the value of 7]
will be different in the two cases, but in general, the 3/8 rule appears to be
slightly better. However, it should be noted that 3/8 rule requires one more
abscissa as compared to 1/3 rule. Thus, in a composite rule requiring a total of
6rn + 1 points, the truncation error would be - (~~~) h4 f(4) (7]) for 1/3 rule and
- (b;oa) h4 f(4) (7]) for the 3/8 rule, where h = (b - a)/6rn in both cases. Thus,
clearly Simpson's 1/3 rule is superior to 3/8 rule and in general, there will be
no advantage in using Newton-Cotes formulae based on even number of points.
Nevertheless, the formulae with even number of points prove to be useful for
solution of integral equations (see Section 13.7) or for integrating tabulated
functions.
If the function is known only in the form of a table of values, it is conve-
nient to use the Newton-Cotes formulae. The composite trapezoidal rule is the
most convenient in such cases and is applicable, even when the table is not at
uniformly spaced points. If the table is available at uniform spacing, then it is
possible to use higher order formulae. Thus, if the number of points is odd, it is
convenient to use Simpson's 1/3 rule. If the number of points is even, then it is
not possible to use this rule directly, but in that case, we can apply Simpson's
3/8 rule to the first four points and then the Simpson's 1/3 rule can be applied
to the remaining set of points. It may be noted that both the rules have the
same order of accuracy. Similar devices can also be used for other higher order
Newton-Cotes formulae. However, if the function is known only in the form of
a table of values, it is not possible to ensure very high accuracy and no purpose
may be served by using higher order Newton-Cotes formulae. The truncation
error can be estimated by computing the integral using different quadrature
formulae.
Alternately, we can integrate the cubic spline approximation to the func-
tion which also gives an accuracy of the order of h 4 , where h is a typical spacing
in the table. This method can also be applied to tables with nonuniform spac-
ing. Using the results in Section 4.4, it can be shown that the cubic spline
approximation can be integrated to yield {3}
rb h h2
L + f(ai)) + L
n n
Ja f(x) dx = i;l (J(ai-d i;l (8i-1 - 8i). (6.25)
a i=l i=l
Here we have assumed that a = ao and b = an. The first term in this formula is
the trapezoidal rule approximation to the integral, while the second term can
be treated as a correction to the trapezoidal rule value. This result can also be
obtained using the Euler-Maclaurin sum formula discussed in the next section.
If the spacing is uniform, then the second sum will "collapse" into just two
6.2. Extrapolation Methods 187

terms coming from the end points, similar to that in the Gregory formula {8}.
The B-spline expansion can also be integrated easily as explained in Section 4.5.
Interpolating B-splines with k = 4 will be identical to cubic spline. It is difficult
to get a formal expression for truncation error in integration using spline (or B-
spline), but we may expect an error of O(hk) for B-splines of order k. B-spline
expansion can also be obtained by approximating a table of values using least
squares fit as explained in Section 10.2. Once the coefficients of expansion are
known, the integral over the required interval can be easily calculated.

6.2 Extrapolation Methods


If the truncation error in a quadrature formula can be expressed as an asymp-
totic power series in the interval h between two successive points of a uniformly
spaced grid, we can apply the Richardson's h -+ 0 extrapolation to get higher
order accuracy (see Section 5.3). It turns out that the error in composite trape-
zoidal rule can be expressed as an asymptotic series in h. This result follows
from the Euler-Maclaurin 8um formula (Ralston and Rabinowitz, 2001)

Ln f (ao + j h) = h
11ao+nh
f (x) dx
1
+ "2 (J (ao + nh) + j (ao) )
)=0 ao

+ f
k=1
(~~~! h 2k - 1 (t(2k-l)(ao + nh) - j(2k-l)(a o )) (6.26)

+ nh 2m +2 B 2m+2 j(2m+2) ( )
(2m + 2)! 7] •

where ao < 7] < no + nh and Bk are the Bernoulli numbers, which are the
coefficients of t k in the expansion
t oc tk
et _ 1 = L Bk k! .
k=O
(6.27)

The first few nonzero Bernoulli numbers are Bo = 1, Bl = -1/2, B2 = 1/6,


B~ = -1/30, B6 = 1/42, B8 = -1/30, B10 = 5/66, B12 = -691/2730, B14 =
7/6. B 16 = -3617/510, B 18 = 43867/798, B 20 = -174611/330. In general

B 2n = (_1)n+l ~;!~l ((2n) ~ (_1)n+14vr.n (:ef n (6.28)

where ((n) is the Riemann zeta function. Here the last expression is the asymp-
totic value of B2n for large values of n. Hence, for large n, B2n increases ex-
ponentially with n. For example, B50 ~ 7.500866746 x 10 24 . Hence, (6.26) is
only an asymptotic series in h. After some rearrangements, this formula can be
written down as

(6.29)
188 Chapter 6. Integration

where

1 n-l 1 )
IT(h) = h ( 2f (ao) + ~ f(ao + jh) + 2f (ao + nh) , (6.30)

is the approximation to the integral using composite trapezoidal rule with n+ 1


points and the constants Ci depend only on ao, ao + nh and the function f (x).
Thus, the error in trapezoidal rule can be expressed as an asymptotic series in
h 2 and we can apply the Richardson's h -+ 0 extrapolation to accelerate the
convergence. The resulting algorithm is also referred to as Romberg integration.
From Eq. (6.30) it can be seen that if the integrand is periodic with function
value and derivatives being equal at the two end points, the error terms will
vanish and trapezoidal rule can give very high accuracy.
It is most convenient to subdivide each interval into two every time, so
that all the old points can be used. If [a, b] is the interval over which the integral
is to be evaluated, we can take hk = (b - a)/2k, and define Tt = Ir(h k ) which
constitutes the first column of the T-table. The second column of this table can
be obtained using (5.37)

(6.31 )

The leading term in the error expansion of Tlk is of the order of h~, which is the
same as that for the Simpson's rule. In fact, it can be easily shown that Tlk is
exactly identical to the approximation obtained using the composite Simpson's
rule based on the same set of points {5} . Similarly, it can be proved that the
third column of the T-table, i.e., T~ is identical to the composite rule formed
by using the five-point Newton-Cotes formula. For m > 2, there is no direct
relation between T! and the Newton-Cotes composite rules. The (m + 1)th
column can be calculated using the relation
Tk _ T k- 1
Tk = Tk + m-l m-l (k = m, m + 1, ... ; m = 1,2, ... ). (6.32)
m m-l 4m _ 1 '

It can be shown that T! has a leading term of the order of h~m+2 in its error
expansion.
It can be shown that all columns of the resulting T-table form a Riemann
sum and hence they converge to the value of the integral as k -+ 00. Further,
the diagonal values also converge to the same limit. The main advantage of
Romberg integration over the Newton-Cotes formula of high order is that, in
this case all weights are positive and so the roundoff error can be controlled.
However, it should be noted that the extrapolation procedure will not always
lead to better results {9}. Thus, if successive values of T! are oscillating about
the correct value of the integral, then T!+l may be worse approximation than
T! itself. These oscillations can be due to the truncation error for large spac-
ings, or due to roundoff error when the number of points is large. Roundoff
6.2. Extrapolation Methods 189

error will usually increase as we go to the higher columns, and it may not be
advisable to proceed beyond the first few columns. For functions with bounded
derivatives, the convergence can be very fast, while for functions which do not
posses bounded derivatives of higher order, the convergence will not improve
beyond a few columns. If all derivatives up to f(2n+2) are bounded, then the
results will keep improving for m :::; n, but beyond that there may be no im-
provement.
EXAMPLE 6.2: Evaluate the integral (ii) in Example 6.1 using Romberg technique

Table 6.2: Romberg integration for J~~o 1~~2 ~ 2.942255

N T.0k Tk
1 T.2k T.3k Tk
4 T.5k Tk
k

2 .198020
3 10.099010 13.399340
5 5.434121 3.879158 3.244479
9 3.494052 2.847362 2.778575 2.771180
17 2.983245 2.812977 2.810684 2.811194 2.811351
33 2.942398 2.928782 2.936502 2.938500 2.938999 2.939124
65 2.942223 2.942165 2.943057 2.943161 2.943180 2.943184 2.943185
129 2.942247 2.942255 2.942261 2.942248 2.942245 2.942244 2.942244
257 2.942254 2.942256 2.942256 2.942256 2.942256 2.942256 2.942256

The results are displayed in Table 6.2, where starting with one, the number of subin-
tervals for the composite trapezoidal rule are increased by a factor of two at every step. The
first three columns are identical to the results obtained using the trapezoidal rule, Simpson's
! rule and the 5-point rule, respectively. All the columns as well as the diagonal ultimately
converge to the correct value. The first column converges rather slowly, while the higher
columns converge much faster. Once again, because of the moderate accuracy that is possible
using a 24-bit arithmetic, the columns beyond the second do not show any improvement
over the second column. But if more accurate arithmetic is used, then ultimately the higher
columns converge faster. It should be noted that the results in first two rows are far from the
actual value which affects the subsequent rows in higher columns. If the first two rows are
eliminated and the process is started with a larger number of points, then the convergence
will be much faster.
This exercise can be repeated for other integrals in Example 6.1. For the first integral
the convergence will be very fast and even before the higher columns are built up, the result
will converge to seven significant figures. For the last two integrals, there will be no improve-
ment over the trapezoidal rule, since the truncation error in these cases cannot be expressed
as a power series in h 2

For functions which do not posses continuous derivatives up to sufficient


order, the truncation error in the trapezoidal rule cannot be expressed in the
form assumed in (6.29). For certain class of singularities in the integrand or its
derivatives, the form of error expansion is known. If this known form is used in
the extrapolation procedure, then it is possible to get much faster convergence.
For example, consider the integral

l a
xC< f(x) dx, 0> ex > -1, (6.33)
190 Chapter 6. Integration

where f(x) is a regular function. For this type of integrals, it can be shown that
(Davis and Rabinowitz, 2007) the error in the trapezoidal rule using a uniform
spacing of h can be expressed in the form
00 00

Eh = LCihi + Lc~ha+i. (6.34)


i=2 i=l

It is interesting to note that the second summation can be obtained by consid-


ering the truncation error in integration over the interval [0, h]. If the higher
derivatives of f(x) are singular, then appropriate number of terms of the series
could be considered. Hence, we can conclude that the error in the total integral
can be expressed as the sum of two asymptotic power series. Thus, in this case,
we can use the exponents 0: + 1, 0: + 2, 2, 0: + 3, 3, 0: + 4,4, ... in the Richardson's
h ----> 0 extrapolation (see {5 .15}) to get
k- 1
_ rm-1
rkm =rkm-1 + rkm-1
2,m _ 1 ' (k = m, m + 1, ... ; m = 1,2, ... ). (6.35)

where 1m are the exponents in error expansion given above.


EXAMPLE 6.3: Evaluate the following integral using Richardson's h 0 extrapolation

11
-+

eXdx
o 99 ~ 1.01306543. (6.36)
o lOOx'

Table 6.3: Richardson's h ----> 0 extrapolation

N T.k0 Tk1 T.k2 T.k3 Tk4 T.5k Tkk

4 .028033
7 .035672 1.133864
13 .042830 1.071968 1.010922
25 .049732 1.042039 1.012521 1.013049
49 .056485 1.027385 1.012932 1.013067 1.013070
97 .063142 1.020162 1.013038 1.013072 1.013073 1.013073
193 .069728 1.016585 1.013058 1.013065 1.013064 1.013063 1.013063
385 .076256 1.014822 1.013082 1.013090 1.013093 1.013095 1.013097
769 .082733 1.013933 1.013057 1.013049 1.013043 1.013040 1.013036

In this case, the error expansion can be expected to be of the form


00

E = Lbih')', {Td = {0.01, 1.01,2, 2.01,3,3.01,4,4.01, ... }. (6.37)


i=1

However, since computationally it may be difficult to distinguish between numbers like 2


and 2.01, we have used Ii = i - I + 0.01. Using these exponents we can perform the h -+ 0
extrapolation using (6.35) and the results are displayed in Table 6.3. This integral is essentially
similar to the last integral in Example 6.1. The factor of eX has been added only to make the
error expansion more typical. If it was not present, then only one term of the form bhO,+1
may be added to the usual error expansion. It can be seen that the first column is similar
6.2. Extrapolation Methods 191

to that in Table 6.1. but there is a dramatic improvement in the second column, where the
result is converging roughly as lin. In higher columns the convergence is even better.

There are other methods to speed-up the convergence of a slowly converg-


ing sequence, which do not require the knowledge of exponents in the asymp-
totic expansion of the truncation error. Suppose it is known that a sequence sn
converges linearly to the limit s

(6.38)

where C i= 0, En ----+ 0 as n ----+ (Xl and iqi < 1. We further assume that, the
constants c, En and q are not known. It can be seen that

(6.39)

Hence,
(6.40)
s- Sn S - Sn+1

Solving for S we get

(6.41)

Starting from a sequence S1, S2, ... , we obtain a second sequence s~, s~, ... , using
the transformation

(6.42)

It can be shown that under the above hypothesis, this sequence converges faster
than the original sequence. This technique for accelerating the convergence is
known as the Aitken's 52 method. It may be noted that the denominator in
(6.42) is the second-order central difference 5 2 s n +l' If the leading term in the
truncation error of a quadrature formula is of the form bh I, then it can be put
in the required form, if we use a sequence of values h n = hopn, where 0 <
p < 1. This technique can be used to accelerate the convergence of any slowly
converging sequence and is also useful in other areas of numerical methods.
The same transformation can be applied again to the sequence s~ to ac-
celerate the convergence, provided the truncation error in s~ is of the same
form, which will be the case, if the true value can be expressed in the form
p

S = Sn + LCiqr, (6.43)
i=1

In such cases, we can apply the Aitken's process several times, each applica-
tion takes care of one term of the error expansion. This process is somewhat
similar to the Richardson's extrapolation, except for the fact that here we need
three values of Si to remove one term, since both Ci and qi are unknown. In
192 Chapter 6. Integration

Richardson's extrapolation the exponent is assumed to be known and hence the


convergence will be faster. On the other hand, the Aitken's procedure is more
widely applicable, since we need not know the exponents in error expansion.
The acceleration method is normally implemented via the so-called E-
algorithm due to Wynn. In this method we construct the E-table
(0)
E_1
(0)
EO
(1) (0)
E_1 E1
(1) (0)
EO E2
(2) (1)
E_1 E1
(2)
EO
(3)
E_1

using the relation

E(j)
m+1
= E(j+1)
m-1
+ (E(j+1)
m
_ E(j))-l
m ,
E(j) =
-1
o.' (6.44)

It can be shown that if s satisfies an equation of the form (6.43), then E~~~) = s.
Further, if Iqil « 1, then

(j) ~ + Ck+d(qk+1 - qd··· (qk+1 - qk)F j


(6.45 )
E2k - s [(1 _ qd ... (1 - qk)J2 qk+l .

Hence, even numbered columns of E-table converge to the true value. For E6k )
we can use the trapezoidal rule approximation to the integral. It can be shown
that E~k) is identical to the sequence s~ obtained using the Aitken's method.
The convergence of E-algorithm is much slower than that of the Richardson's ex-
trapolation, and it should be used only for those integrals, where the exponents
of h in the error expansion are not known.
EXAMPLE 6.4: Evaluate the integrals in Examples 6.2 and 6.3, using E-algorithm to accel-
erate the convergence of the trapezoidal rule approximation.
The E-algorithm can be applied to the trapezoidal rule approximation for the integral,
even if the exact form of truncation error is not known. The results are displayed in Tables
6.4 and 6.5, where n is the number of points used in the trapezoidal rule approximation listed
in the first column. For higher columns, the number of points required can be found out by
drawing a diagonal line across the table. For the first integral, the result converges rather
fast to the correct value. It requires 257 points to get a result accurate to seven figures. The
higher columns have a larger roundoff error. The convergence here is only marginally slower
as compared to that for the Romberg integration. However, if higher accuracy is required,
the Romberg process will converge much faster. It may be noted that the odd columns do
not converge to any definite value.
6.2. Extrapolation Methods 193

Table 6.4: E-table for J~~o 1!~2 ~ 2.942255

(k)
n EO

2 .198020
.101000
3 10.099010 6.928104
-.214367 -.422035
5 5.434120 2.112725 2.847934
-.515446 .938124 6.40094
9 3.494052 2.800687 3.030989 2.950566
-1.95769 5.28025 -6.03330 -125.703
17 2.983245 2.938848 2.942600 2.942210 2.942259
-24.4817 271.831 -2571.76 20347.9
33 2.942398 2.942223 2.942248 2.942253
-5722.11 40369.1 180179.
65 2.942224 2.942245 2.942255
41943.0 135150.
129 2.942247 2.942255
167772.
257 2.942253

t.hen the results improve markedly (see Example 6.3). The results using E-algoritnm are dis-
played in Table 6.5, which shows only the even columns of the E-table. From the table it is
clear that, even though the results are not as good as those in Table 6.3, the higher columns
do tend towards the correct value. There is a significant improvement in the convergence for
the E~k) , which is obtained after one step of E-algorithm. Beyond E1 k ) the values are oscillating
and we cannot expect any further improvement in higher columns.

Table 6.5: E-table for Jo1 1~;xix99 ~ 1.01306543

( k) (k) (k) (k) (k) (k)


n EO E2 E4 E6 Es EIO

2 .01359
3 .02317 .07632
5 .03128 .11524 .77396
9 .03868 .18060 .86148 3.11753
17 .04572 .28345 .95763 1.02108 1.01662
33 .05255 .42631 .99816 1.01692 1.02232 1.00871
65 .05926 .59097 1.01128 1.00769 1.01031
129 .06588 .74314 1.00640 1.00950
257 .07244 .85558 1.01521
513 .07895 .92762
1025 .08541
194 Chapter 6. Integration

6.3 Gaussian Quadrature


Let us consider the quadrature formula, based on n abscissas aI, ... , an

1a
b
f(x) dx =
n
Lf(aj)Hj
j=1
+ En. (6.46)

It may be noted that throughout this section, the points are labelled from al
to an rather than from ao to an. So far we have assumed that the points are
uniformly spaced, but obviously that is not necessary. In fact, if this condition
is relaxed, then it can be seen that there are 2n independent constants aj and
H j on the right-hand side, and in principle it should be possible to adjust them
in such a way that the quadrature formula is accurate for polynomials of degree
less than or equal to 2n - I, which also have 2n parameters.
For n = 1 we know that midpoint rule realises the desired accuracy. For
n = 2, we can write

(6.47)

where for convenience we have chosen the interval [-1, 1J. This formula is re-
quired to be exact for polynomials of degree three or less. We can use the
method of undetermined coefficients to determine the constants aI, a2, HI,
and H2 which yields the following set of nonlinear equations

f(x)=I: HI + H2 = 2,
f(x) = x: alHI + a2H2 = 0,
(6.48)
f(x) = x 2 : al2H 1 + a22H2 = 3'
2

f(x) = x 3 : arHI + a~H2 = O.

This system of nonlinear equations can be solved to get al = -a2 = 1/ J3 and


HI = H2 = 1. Similarly, we can derive quadrature formulae involving more
number of points, but solving the system of nonlinear equations may not be an
easy task. Further, it is not obvious that the system of nonlinear equations will
have a real solution. There is a better way of deriving these formulae using the
properties of orthogonal polynomials, which also assures us about the existence
of such formulae.
Assuming that the limits of integration are from -1 to 1, the truncation
error in a quadrature formula can be written as:

(6.49)

where Pn(x) = I1~1 (X-ai). This form can be obtained by integrating the error
term in the Newton's divided difference interpolation formula. Now. if f(x) is
a polynomial of degree m, then the divided difference would be a polynomial
6.3. Gaussian Quadrature 195

of degree m - n for m ~ n, while for m < n it would vanish identically. Hence,


if f(x) is a polynomial of degree less than n, then the truncation error will
vanish. If f(x) is polynomial of degree n, then the divided difference would be
a constant. In that case, if J~l Pn(x) dx = 0, then En = 0 and our quadrature
formula would be exact for polynomials of degree n also. In addition, if we also
have t1 Pn(x)x dx = 0, then the formula would be exact for polynomials of
degree n + 1 and so on. It should be noted that Pn(x) contains n constants ai
and by adjusting ai's suitably it should be possible to make it orthogonal to
1, x, ... , x n - 1 , in which case, the resulting quadrature formula would be exact
for polynomials of degree less than or equal to 2n - 1.
It is well-known that if Pn(x) is some constant multiple of Pn(x), the
Legendre polynomial of degree n, then it is orthogonal to all polynomials of
degree less than or equal to n - lover the interval [- L 1]. Thus, we require

(6.50)

Hence, the required abscissas are just the zeros of the Legendre polynomial
Pn(x). Thus, if we had chosen our Lagrangian formula based on the abscissas
ai which are zeros of the Legendre Polynomial of degree n, then the resulting
quadrature formula would be exact for all polynomials of degree 2n - 1 or less.
It is well-known that all zeros of Legendre polynomials are real and distinct and
lie in the interval (-1, 1). Hence, the existence of such high order quadrature
formula is assured for all values of n. Further, it can be easily proved that this
formula is not exact for all polynomials of degree 2n. This result follows from
the fact that the quadrature formula is not exact for
n

P2n = II (x - a;) 2 • (6.51 )


i=l

where ai are the abscissas of the Gaussian formula.


The weights are given by

(6.52)

where using (4.3}

(6.53)

After some manipulations, we get

(6.54)
196 Chapter 6. Integration

while the truncation error in this case can be obtained by considering the
integration of P;
(x) to get

-1<7]<1. (6.55)

Any quadrature formula whose abscissas and weights are subject to no


constraints and which are determined so as to achieve a maximum order of
accuracy is called a Gaussian quadrature formula. In particular, with the ab-
scissas given as the zeros of the Legendre polynomial of degree n and weights
H j given above, the formula is called Gauss-Legendre quadrature formula. It
should be noted that the zeros of Legendre polynomial are distributed symmet-
rically about x = 0 and if the abscissas are arranged in a monotonic sequence,
then ai = -an+l~i, and Hi = Hn+l~i' for i = 1, ... , n. For even values of n
we have n/2 pairs of abscissas with opposite signs, while for odd values of n,
there are (n - 1) /2 pairs and the remaining abscissa is at x = o. Hence, only
half the values need to be tabulated in tables of weights and abscissas of Gauss-
Legendre formulae. Such tables are available in Abramowitz and Stegun (1974),
Kopal (1961), Krylov (2005) and Stroud and Secrest (1966).
This quadrature formula in the form presented above is applicable only
when the interval of integration is [-1,1]. If the integral is required over any
finite interval [a, b], then the following linear transformation will transform the
interval to [-1,1]:
b-a b+a
x= - - t + - - . (6.56)
2 2
Using this transformation we get

la
b
f(x) dx b-ajl
= -
2
-
~1
f (b-a
2
b+a)
--t + -
2
- dt

= b- a ~ Hf (b - a a. + b + a) (6.57)
2 ~, 2' 2
i=l

(b - a )2n+l 22n+l( n.,)4 (2n)


+ -2- (2n + 1)[(2n)!j3 f (7])

where a < 7] < b. Again as in the case of Newton-Cotes formulae, we can also
consider composite Gauss-Legendre formulae, by subdividing the interval in
several parts and applying the formula over each subinterval separately. In this
case, it is obvious that if an interval is divided into two or more subintervals,
the new abscissas are entirely different from the old ones. In contrast, for the
Newton-Cotes formulae all the old points can be used to save the computation.
Another disadvantage of Gaussian formulae is that the weights and abscissas
are all irrational numbers and so specially in the days of hand computation, it
was extremely inconvenient to use them. But on digital computers it is immate-
rial whether the argument is rational or irrational. However, all these irrational
6.3. Gaussian Quadrature 197

numbers have to be supplied to the computer in the form of data, which needs
to be entered and checked thoroughly before the program can be used for nu-
merical integration. This check can be easily carried out by trying the program
on f(x) = xk for k = 0,1, ... , 2n - 1, where n is the number of points in
the Gaussian formula. If the weights and abscissas are typed correctly, then
the result should be exact (apart from some small roundoff error) for all these
functions. Of course, the weights and abscissas for Gaussian quadrature can
be computed using appropriate formulae, but this computation requires signif-
icant effort and hence it should be done only once and results can be saved for
further use.
It is possible to prove that all weights of Gauss-Legendre quadrature for-
mulae of any order are positive and so Gaussian formulae have very good round-
off properties. Even for very high order Gaussian formulae, the roundoff error
may be fairly small. It can be shown that if f(x) is continuous, then any de-
sired degree of accuracy is attainable by using a sufficiently high order Gaussian
formula.
For any given E > 0, let p(x) be a polynomial of degree N, such that
If(x) - p(x)1 < E on [a, b]. The existence of such a polynomial is guaranteed by
the Weierstrass theorem. The truncation error E is given by

(6.58)

+ LHj[p(aj) - f(aj)]
j=1

If 2n - 1 > N, then the second term vanishes, since the quadrature formula
would be exact for the polynomial p( x). Further, since the weights are all pos-
itive and ~7=1 H j = b - a, we get

lEI < 2E(b - a). (6.59)

Thus, the truncation error in Gauss-Legendre formula can be made arbitrarily


small by choosing n large enough. It should be noted that the above proof
is valid for any sequence of quadrature formulae of increasing order, provided
only that weights are all positive, which is necessary to bound the last term
in (6.58). This condition is not satisfied by the Newton-Cotes formulae, but is
satisfied by the Romberg integration. Hence, the above argument provides a
proof of convergence of the Romberg method.
The main advantage of Gaussian formulae comes from the fact that, even
for very high order formula the roundoff error is strictly controlled and further
the convergence is also assured. Thus, we can easily use Gauss-Legendre formula
198 Chapter 6. Integration

based on 32 or 64 points per subinterval, while the use of 64-point Newton-


Cotes formula would be almost certainly disastrous. The use of very high order
Gaussian formula may not always be advisable, since in most cases, sufficient
accuracy can be achieved using a smaller number of points.
As noted earlier one of the main objections against the use of Gaussian
formulae is that, the weights and abscissas of rules of any order are distinct from
those of any other order (except for zero which appears as an abscissa in all rules
of odd order). Further, each of the composite rules will have abscissas which
are completely different from other composite rules. Hence, in proceeding from
one Gaussian formula to a more accurate one, almost all information obtained
in computing the first approximation has to be discarded. This objection has
been partially answered by a device due to Kronrod and developed by Patterson
(Davis and Rabinowitz, 2007, Piessens et al., 1983), which enables us to add
new abscissas to a given set producing a new rule of higher - but not optimally
higher accuracy. Kronrod starts with an n-point Gauss rule, adds ri+ 1 abscissas
and arrives at a rule which is exact for all polynomials of degree less than or
equal to (3n + 1) when n is even, while for odd values of n the resulting rule is
also exact for all polynomials of degree (3n+2). However, the weights associated
with the original Gauss abscissas are not conserved in the process of extension.
Such formulae are referred to as Gauss-Kronrod quadrature formulae. Patterson
has extended it further by adding another 2n + 2 abscissas to the Kronrod rule
and arriving at a (4n + 3)-point rule, which is exact for all polynomials of
degree less than or equal to (6n + 5). It is not obvious that such extensions will
exist, but some results have been proved, while a lot of rules have been found.
For example, starting with 3-point Gauss-Legendre rule, a 7-point Kronrod
extension and Patterson rules with 15,31, ... ,255 points have been constructed.
Each of these Patterson rules with n points is exact for polynomials of degree
less than or equal to (3n + 1)/2. Extensive tables of these rules are given in
Piessens et al. (1983).
Apart from these extensions, it is also possible to obtain Gauss type
quadrature rules, where some of the abscissas are preassigned. In this case,
the order of accuracy will be correspondingly reduced. The most important
of such rules are the Radau formula, where one of the end points is an ab-
scissa, and the Lobatto formula, where both the end points are abscissas. These
formulae are useful in some special circumstances (see Section 6.6.9).

6.4 Roundoff Error


Apart from the truncation error that we have considered so far, there will also
be roundoff error in evaluating the quadrature formulae. There are two sources
of roundoff error, first due to error in evaluating the function value and second
while evaluating the quadrature sum. The error in the function values could be
due to roundoff error in calculating the value, or due to inherent error in its
definition (e.g., when the function is defined by a table of values). If E is the
6.4. Roundoff Error 199

magnitude of the maximum error in the function values, then the total error
due to it is bounded by
n

IRI < EL IHjl· (6.60)


j=O

If H j are all positive, then we get


n
IRI < EL H j = E(b - a). (6.61)
j=O

Here the last equality can be obtained from (6.2) with f(x) =: 1. Hence, it can
be seen that the roundoff error due to the uncertainties in the function values
are independent of h and is not a serious problem in numerical integration. The
relative roundoff error in evaluating integral is of the same order as the relative
error in the function value itself, unless there is a substantial cancellation in
evaluating the quadrature sum. Of course, if the function evaluation itself has
a large roundoff error, then it may be difficult to estimate the integral very
accurately. For example, consider the integral

1 1 dx
, x - sin x
(0<1'«1). (6.62)

In this case, there could be large roundoff error in evaluating the integrand near
x = 1', unless proper care is taken to rewrite the denominator for small values
of x. Further, in this case, the major contribution to the integral comes from
the region close to x = I' and hence the roundoff error could be significant.
For higher order Newton-Cotes formulae, since H j are not all positive,
there may be magnification of roundoff error by a factor of

2:7=0IHj l (6.63)
2:7=0 H j
For 9-point formula this factor is 1.5, for ll-point formula it is 3.1, fnr 21-point
formula it is 544, and it increases very steeply with the number of points. Hence,
use of these formulae can give rise to a significantly higher roundoff error in
estimating integrals.
So far we have not considered the roundoff error in evaluating the quadra-
ture sum. If If(x)1 < M o, then using (2.55) we get the following bound

IRI < hMo (nlHol + ~(n - i + l)IHil) . (6.64)

This bound depends on the order in which the summation is carried out. If the
weights are all of the same order, then this error bound will be of the order of
hMon(b - a)/2. The actual error will be much smaller than this bound, and
for small values of n it may not cause any significant error, unless the value of
200 Chapter 6. Integration

integral is significantly less than Mo(b - a), which can happen if the function
changes sign in the interval. This error bound can in general be reduced if we use
the cascade sum algorithm described in Section 2.3. However, it is not essential
to use the cascade sum algorithm, since that will require significant amount of
bookkeeping. We can divide the n abscissas into m subgroups, each consisting of
k points (n = m x k) and sum each of the subgroups independently. Now if the
partial sums are added to give the final result, the error bound is reduced by a
factor of (m + k) / n, which is sufficient in most cases. For numerical integration
in one dimension, the number of points is seldom large, and roundoff error is
not very important. But for evaluating multiple integrals, the number of points
could be very large, limited only by the speed of the computer. Hence, proper
care is required to control the roundoff error, particularly on computers with
small word length.
The roundoff error in evaluating the sum may reduce if the terms are
added in increasing order. Hence, in general, for the Gaussian formulae , if the
summation is carried out in increasing order for the weights, the roundoff error
may be reduced. Further, for Gauss-Legendre formula if the pairs of values
with the same weights are combined before the summation, then apart from
saving one multiplication the roundoff error may also be reduced. To control
roundoff error in composite rule, one simple device may be to sum over each
subinterval separately and then add the result to a running sum. This technique
is implemented in Subroutine GAUSS in Appendix B. Another technique for
reducing the roundoff error in summation is described in {2.35}.
Here we have considered a bound on roundoff error, but as we have seen
in Chapter 2, these bounds are usually very pessimistic and the actual error
could be a few orders of magnitude lower than these bounds. Hence, it may be
more meaningful to consider a statistical approach for estimating the roundoff
error. If we assume that the function values are all of the same order, then the
root mean square estimate for the roundoff error (i.e., standard deviation) can
be written as {I2}
(6.65)

Further, if the weights are all positive, then we have the constraint l: H j = b-a.
Hence, the roundoff error can be minimised if all weights are equal ie., H j =
(b - a) / n, where n is the total number of points used in the quadrature formula.
Once again, this condition is grossly violated by the high order Newton-Cotes
formulae, where even for the II-point formula the weights diff€r by more than
a factor of 25, while for the 2I-point formula, the weights differ by a factor of
10 4 and for large n this ratio is approximately

n!4 2n +2 (2
(6.66)
[(n/2)!j2nlnn ~ nlnn V;;;;
On the other hand, in the Gauss-Legendre formulae the weights are nearly
equal, the maximum ratios being respectively 3.6, 7 and 14 for 8, 16 and 32-
6.5. Weight Function 201

point formulae. Thus, once again the Gaussian formulae have a better roundoff
property.
It is actually possible to construct a class of quadrature formulae for which
the weights are all exactly equal. Such formulae can be written in the form

I a
b
f(x) dx =
b n
: a Lf(ak) + En.
i=l
(6.67)

The abscissas ak can be chosen such that, this formula is exact for all polyno-
mials of degree n - 1 or less. The existence of such a formula is not guaranteed,
since the solution of the required nonlinear equations may turn out to be com-
plex. It is found that such formulae exist only for n = 1,2,3,4,5,6,7,9 (Krylov,
2005). These formulae are sometimes referred to as Chebyshev quadrature for-
mulae {13}.
Apart from these, there will be roundoff error due to the fact that the
abscissas may not be accurate. As a result, the function may be evaluated at
points which are slightly different from the actual abscissas in the quadrature
formula. This effect is particularly important near a singularity, where a small
change in abscissa will cause a large change in the function value. It may be
noted that when the successive abscissas are computed by successive addition
e.g., aj = aj-l + h the roundoff error could build up considerably. This error
could be reduced if each abscissa is computed directly, e.g., aj = a + jh, for
uniformly spaced points.
In general, roundoff error is not a very serious problem in numerical inte-
gration, unless the integral itself vanishes. If f (x) changes sign in the interval,
then even if all weights are positive, the terms in quadrature sum may be of
different signs and there will be some cancellation and consequent loss of signifi-
cant figures. In general, if an iterative procedure is used to estimate the integral,
the convergence will be limited by the roundoff error, thus giving an estimate
of roundoff error. For example, if we subdivide each subinterval into two and
recompute the integral using the smaller subintervals, the process being contin-
ued until a satisfactory accuracy is achieved. In this case, if the roundoff error
is dominating, then usually the result will not converge. Of course, the roundoff
errors being essentially random, sometimes we can get accidental convergence.
The problem of checking for convergence will be considered in Section 6.7.

6.5 Weight Function


Let us .consider the interpolation formula

n
f(x) = L f(ai)li(x) + Pn(X)f[al,"" an, xl· (6.68)
i=l
202 Chapter 6. Integration

Instead of integrating this function directly we can multiply it by a suitable


function w(x) and then integrate to get the quadrature formula

1 b
w(x)f(x)dx =
n
~f(ai) 1 b
W(X)li(X)dx+ 1 b
W(X)Pn(X)j[a 1, ... ,an,xldx.

(6.69)
The function w(x) is known as the weight function and to be of any use it must

1
be such that
b
W(X)li(X) dx = Hi , (6.70)

can be easily evaluated, so that the quadrature formula can be written in the

lb
usual form

a
w(x)f(x) dx = t;=1
f(ai)Hi + En. (6.71 )

The simplest case, when w(x) == 1 corresponds to the Gauss-Legendre


quadrature considered in Section 6.3. If w(x) oj. 1, then the advantages of this
formulation are twofold. First, computationally it may be easier to evaluate
f(ai) rather than w(ai)f(ai). Second and the most important advantage comes
from the fact that in this case, the truncation error involves derivatives of f(x)
only. Hence, even if derivatives of w(x) or even w(x) itself is singular in the in-
terval, the error can be bounded if the derivatives of f(x) are bounded. Hence,
the concept of weight function is very effective for dealing with singularities in
the integrand. The weight function is generally assumed to be positive through-
out the interval of integration, so that the properties of orthogonal polynomials
can be used to derive the quadrature formulae.
Again if we follow our argument for Gaussian formulae, then it can be
shown that if the product Pn(x) is a constant multiple of the orthogonal poly-
nomial <Pn(x) corresponding to a weight function w(x) over the given interval
(a, b); the corresponding quadrature formula will be exact when f(x) is a poly-
nomial of degree less than or equal to 2n - 1. In that case, we can write
n

<Pn(x) = AnPn(x) = An II (x - ai), (6.72)


i=l

where An is the coefficient of xn in the definition of <Pn(x). Obviously, in this


case, the abscissas aj would be zeros of <Pn(x), while the weights Hj are given
by
(j = 1, ... , n), (6.73)

lb
where
"In = w(x)<p;(x) dx. (6.74)

The truncation error En can be expressed in the form


E - "In f(2n)() (6.75)
n - A~(2n)! f/ .
6.5. Weight Function 203

As in the case of Gauss-Legendre formula it can be shown that, if f(x)


is continuous and w(x) is any integrable, nonnegative weight function, then all
weights H j are positive and any desired degree of accuracy can be achieved by
using a sufficiently high order Gaussian quadrature formula.
Let us consider a specific example by taking w( x) = 1/ Jf=X2 on the
interval [-1,1]. In this case, it is well-known that the corresponding orthogonal
polynomials are the Chebyshev polynomials of the first kind, generally denoted
by Tn(x). For these polynomials. we know that An = 2n - 1 (n > 1) and 'Yn =
7r/2, while the zeros aj of T,,(x) are given by (4.35). Thus, the quadrature
formula can be written as

1 vI -
1
-1
f(x)
x2
dx = ~ tf
n j=l
(cos (2] -1)7r)
2n
+ 27r
22n(2n)!
f(2n) (T/). (6.76)

This is known as the Gauss-Chebyshev quadrature formula. It has two remark-


able properties as compared to the. other Gaussian formulae. First is that, in
this case, all the weights are equal, which reduces the roundoff error in cal-
culation. Further, all the abscissas and weights can be expressed in a closed
form. and there is no need of tables of weights and abscissas which are required
for most Gaussian formulae. However, in this case, it is not possible to use
composite formulae, since if the range is broken anywhere, it is not possible to
transform the integral back to the original form without introducing singular-
ities in derivatives of f (x). It should be noted that in this case, the integrand
is singular at both the end points, but the truncation error can be bounded
if the derivatives of f (:r) are bounded on [-1, 1]. If f (:r) is analytic in some
neighbourhood of the interval of integration, then there will be no difficulty
in evaluating the integral accurately using this formula. But if there are sin-
gularities in f(x) or its derivatives, we may have to use one of the techniques
described in the next section.
EXAMPLE 6.5: Evaluate the following integral

j l
~
cos X
- I vI - x 2
dx = 71"Jo(l) ~ 2.4039394306344. (6.77)

Using 4-point Gauss-Chebyshev quadrature formula we get a value of 2.4039388, which


is accurate to seven decimal figures. The truncation error in this case. is bounded by

271" 7
IE 41 < -8-- ~ 6.1 x 10- . (6.78)
2 x 8!

Thus, we can see that by using just four points we can get very good accuracy. Similarly,
the 6-point formula gives 2.40393943063125, which is correct to 12 significant figures. In this
case, the Newton-Cotes formula or the Gauss-Legendre formulae converge very slowly with
the truncation error decreasing as 1/.;n, where n is the total number of points used in the
composite formulae.

Of course, we can use different weight functions corresponding to various


orthogonal polynomials, some of which are listed below:
204 Chapter 6. Integration

6.5.1 w(x) = (1- x)Q(1 + x).B, o.,~ > -Ion [-1,1]


Here, depending on the values of a and (3 the integrand or its derivatives could
be singular at both the end points. The orthogonal polynomials in this case are
the Jacobi polynomials In(x; a, (3) and the abscissas will be the zeros of these
polynomials. For these polynomials
A - 1 r(2n+a+i3+1)
n - 2nn! r(n+a+i3+1) ,
(6.79)
2o +il+ 1 r(n+a+l)r(n+i3+1)
In = n!(2n+a+i3+1) r(n+a+i3+1)
Using (6.73) and (6.79), we can express the weights in the form

H __ 2n + a + (3 + 2 f(n + a + l)f(n + (3 + 1)
) - n+a+(3+1 f(n+a+(3+1)(n+1)!

Similarly, using (6.75) the truncation error can be expressed as

E = f(n + a + l)r(n + (3 + l)r(n + a + (3 + 1) n!22n+a+i3+1 j(2nl( )


n (2n + a + (3 + 1)[f(2n + a + (3 + 1)]2 (2n)! 7) .
(6.81 )
The weights and abscissas for a Gauss-Jacobi quadrature formula can be
calculated using subroutine GAUJAC in Appendix B. Depending on the value
of a and (3, we can recover some of the quadrature formulae as special cases of
the Gauss-Jacobi formula. For example, a = (3 = 0 corresponds to the Gauss-
Legendre formula, while a = (3 = - ~ corresponds to the Gauss-Chebyshev
formula.

6.5.2 w(x) = VI - x 2 on [-1,1]


In this case, the integrand is continuous, but the derivatives are singular at both
the end points. For this weight function, the orthogonal polynomials are called
the Chebyshev polynomials of the second kind, denoted by Sn(x) (sometimes
denoted in the literature by Un (x)). These polynomials can be expressed in the
form
Sn(X) = sin[(n + 1) cos- 1 xl (6.82)
sin(cos- 1 x) .
The abscissas which are the zeros of this polynomial are
)'rr
a· =cos-- (j = 1, ... ,n), (6.83)
] n + l'
and An = 2n, In = 'IT/2. Using (6.73) and (6.75), we get
'IT . 2 J'IT E - 'IT j(2n)( )
H · = - - sm - - , (j=l, ... ,n); n - 22n+l(2n)!
) n+1 n+1 7) .
(6.84)
This quadrature formula has the advantage that the abscissas and weights can
be expressed in a closed form and no special tables are required.
6.S. Weight Function 205

3.5.3 w(x) =./z on [0,1]


Here, the integrand is singular at one of the end points. In this case, the or-
thogonal polynomial of degree n is P2n( y'X), where P2n (x) is the Legendre
polynomial of degree 2n. Corresponding to each positive zero O:j of P2n (x),
there is an abscissa given by aj = 0:;,
and the corresponding weight H j = 2h j ,
where h j is the weight corresponding to O:j in the 2n-point Gauss-Legendre
formula. The truncation error in this case can be expressed as

24n + 1 [(2n),]3
E . j(2nl() (6.85)
n = (4n + 1)[(4n)!J2 7] .

6.5.4 w(x) = Vi on [0,1]


In this case, the derivatives are singular at one of the end points. The cor-
responding orthogonal polynomials of degree n are P2n +1 ( y'X) / y'X and the
abscissas and weights are given by

(j = 1, ... ,n), (6.86)

where h j is the weight corresponding to O:j in (2n+1)-point Gauss-Legendre


formula. The truncation error is given by

24n+3[(2n + 1)!]4 j(2nl()


(6.87)
En = (4n + 3)[(4n + 2)!J2(2n)! 7] .

6.5.5 w(x) = (l~Z) 2 on [0,1]


In this case, the weight function has a singularity at one end point, while at
the other end point the derivatives are singular. Here the orthogonal polyno-
mial of degree n is given by T 2n +1 (y'X)/ y'X, where T 2n +1 (x) is the Chebyshev
polynomial of degree 2n + 1. In this case, the abscissas and weights are given
by

2 (2j - 1)11"
aj = cos , (j = 1, ... ,n), (6.88)
4n+2
and truncation error can be expressed in the form

En = 11" j(2n) (7]) (6.89)


24n+l(2n)! .

6.5.6 w(x) = e-z on [0,00)


The corresponding orthogonal polynomials are the Laguerre polynomials Ln (x),
and the abscissas are zeros of Ln (x). For these polynomials An = (-1) nand
206 Chapter 6. Integration

"in = (n!)2, and the weights are given by

H _ (n!)2aj
(j=l, ... ,n). (6.90)
J - [Ln+l(aj)J2'
It should be noted that two different normalisations for these polynomials are
used in the literature. Some authors prefer to define the Laguerre polynomials,
such that An = (-l)njn! and "in = 1, in which case, the expression for the
weights will also change. In either case, the truncation error is given by

E = (n!)2 j(2n)( ). (6.91)


n (2n)! TJ .

This quadrature formula is referred to as Gauss-Laguerre formula.

6.5.1 w(z) = zOe-a: Q > -Ion [0,00)


The corresponding orthogonal polynomials are the associated Laguerre polyno-
mials L~(x), and once again there is the ambiguity in its definition. If we use
the normalisation An = (-1) n, then"'(n = n!f (n + a + 1). The abscissas will be
the zeros of L~ (x) and the corresponding weights and the truncation error are
given by

H _ n!f(n + a + l)aj (j = 1, ... ,n;


) E = n!f(n+a+1) j(2n)( )
J - [L~+l(aj)J2 , n (2n)! TJ·
(6.92)
The weights and abscissas for a Gauss-Laguerre quadrature formula can
be calculated using subroutine LAGURW in Appendix B.

6.5.8 w(z) = e-r on (-00,00)


In this case, the corresponding orthogonal polynomials are the Hermite poly-
nomials H n (x) and the abscissas are zeros of H n (x). For these polynomials
An = 2n and "in = y'7f 2nn!. The weights and truncation error are given by

(j=l, ... ,n); E = n!y'7f j(2n)() (6.93)


n 2n(2n)! TJ .

This quadrature formula is referred to as Gauss-Hermite formula. The weights


and abscissas for a Gauss-Hermite quadrature formula can be calculated using
subroutine GAUHER in Appendix B.
The weights and abscissas of some of the common Gaussian formulae can
be found in the computer programs in Appendix B and in the exercises. More
extensive tables are available in Abramowitz and Stegun (1974), Kopal (1961),
Krylov (2005) and Stroud and Secrest (1966). For some of formulae the weights
and abscissas can be calculated using subroutines GAULEG (Gauss-Legendre),
GAUJAC (Gauss-Jacobi), LAGURW (Gauss-Laguerre) and GAUHER (Gauss-
Hermite) in Appendix B. For these Gaussian formulae, it can be shown that all
6.5. Weight Function 207

weights are positive and hence the computed value of the integral tends to the
exact value as the number of points n -> 00. However, in most cases, it may not
be practical to use formulae with very large number of points. In such cases,
we may be tempted to use composite rules. For example, consider the integral

1o
1 f(x)
r;:;; dx -
vx
_11/2 f(x)
0
r;:;; dx
vX
+
Jl f(x)
1/2
r;:;; dx.
VX
(6.94)

Here the first integral on the right-hand side can be easily transformed to
the standard form by substituting x = t/2, but the second one cannot be
transformed into that form. However, it should be noted that the second integral
does not have any singularity, and it can be easily evaluated using Gauss-
Legendre formula. Thus, if a satisfactory convergence is not achieved in the
formula, then we can subdivide the interval and apply the same formula to the
singular part, while the remaining part can be evaluated by Gauss-Legendre
formula. This process can be repeated till a satisfactory approximation has been
found. To check for satisfactory convergence the same integral can be evaluated
by using two different quadrature rules (say 8 and 16-point rules) and if the
two values differ by less than the required tolerance, we can assume that a
satisfactory approximation has been achieved. This procedure is implemented
in subroutine GAUSQ2 in Appendix B. The Kronrod extension to some of these
Gaussian formulae is known and can be used for checking the convergence of the
Gaussian rule. It should be noted that for weight functions which are singular
at both the end points, the above procedure of subdividing the range cannot
be used.
For weight functions with known orthogonal polynomials it is possible to
calculate the weights and abscissas using a method due to Golub and Welsch
(1969) which requires the coefficients in the recurrence relation for the polyno-
mial as well as the integral of the weight function over the associated interval.
The recurrence relation for orthogonal polynomial can be written in the form,

Pj(x) = (ajx + bj)pj-dx) - Cjpj_2(:r), (j=1,2, ... ,N),


(6.95)
with P-l(X) == 0, Po(x) == 1,
where Pj is the orthogonal polynomial of degree j. This equation can be written
in the matrix form
_h 1
0 0 0
~(x)
al at
( ) ~ .l. 0 0 ( PO(X))
Pi (X) ( 0 )

2,~~) ~
-~
a2 a2
0 9.
a3
_ak
a3
.l.
a3
0 ... + 0
X
0 Sl.
an
-~
an
PN~'l'(X) p~1X)
(6.96)
or equivalently in matrix notation as
1
xp(x) = Tp + -PN(x)eN, (6.97)
aN
208 Chapter 6. Integration

where T is a tridiagonal matrix and eN = (0,0, ... ,0, l)T. Thus PN(tj) = if
and only if Tp = tjP, or when tj is an eigenvalue of T. Thus eigenvalues of T
°
are the abscissas of the Gaussian quadrature formula with N abscissas. Hence,
the abscissas can be calculated by finding the eigenvalues of T. To improve the
efficiency of calculations we can first apply a diagonal similarity transformation
to obtain the matrix J = DT D- 1 , which is symmetric tridiagonal matrix. It
can be shown that this is achieved when matrix J has elements given by

. bi
]i,i = Qi = --, ji,H1 = jH1,i = (3i = (6.98)
ai

Eigenvalues of J are same as those of T and thus these are the abscissas
of the required quadrature formula. These eigenvalues and the correspond-
ing eigenvectors can be easily calculated using the QL algorithm described
in Section 11.5. Further, it can be shown that (Golub and Welsch 1969) if
qJ = (q1j, q2j, ... ,qN j) is the normalised eigenvector of J with qJ qj = 1, then
the weight corresponding to the abscissa tj is given by

Wj = qij X lb w(x) dx = qijmo, (6.99)

where mo is the value of the integral. Thus we can calculate the abscissas and
weights for quadrature formula of any order using this technique, provided the
moment mo and the coefficients of recurrence relations are known. This method
is implemented in subroutine GAUSRC in Appendix B.
An alternate procedure for calculating the weights and abscissas of a Gaus-
sian formula is based on direct solution of the relevant equations. Since the
n-point quadrature rule should be exact for f(x) = 1, X, X2, ... , X 2n - 1 , we get
the following system of equations for the weights and abscissas

1 a
b

w(x)x j dx
n

= mj = L. Hiai,
i=l
(j=0,1, ... ,2n-1). (6.100)

Here it is assumed that the integrals involved in evaluating the moments mj


can be expressed in a dosed form or can be evaluated easily. The abscissas will
be the zeros of the polynomial
n n

(en = 1). (6.101)


i=l k=O

Hence, to find the abscissas we have to determine the coefficients Ci of this


polynomial. Taking appropriate linear combinations of the first n + 1 equations
in (6.100), we get

L miCi = L
n n

Hi (Co + c1 ai + ... + cnaf)· (6.102)


i=O i=l
6.6. Improper Integrals 209

Since the term in the parenthesis is L~=o cka~ = p(ai) = 0, we have

(6.103)

Using linear combinations of other equations in (6.100), we can get similar equa-
tions and noting that Cn = 1, we get the following system of linear equations
in quantities co, ... , Cn-l:
n-l

L mk+ici = -mk+n, (k = 0, 1, ... , n - 1). (6.104)


i=O

If w(x) is nonnegative throughout the interval, then it can be proved that this
system of equations has a unique solution. Once the coefficients Ci are known the
polynomial can be solved to obtain the abscissas, using the methods described
in Section 7.11. After calculating the abscissas, the weights can be determined
by solving the first n equations in (6.100) for Hi (i = 1, ... , n).
It should be noted that the system of equations involved in finding the
coefficients and the weights are usually ill-conditioned when n is large. Hence,
it is difficult to solve these equations accurately and it may be advisable to use
double precision arithmetic for evaluating the abscissas and weights using the
algorithm described above. An implementation of this algorithm is provided
by subroutine GAUSWT in Appendix B. Other methods for evaluating the
weights and abscissas are described in Davis and Rabinowitz (2007).
If the weight function changes sign in the interval of integration, then it
may not be possible to obtain Gaussian formulae. However, it is possible to ob-
tain formulae of the Newton-Cotes type with fixed abscissas. Using n abscissas
it is possible to find a quadrature formula which is exact for all polynomials
of degree less than or equal to (n - 1). Such formulae can be found by inte-
grating the interpolating polynomial or by using the method of undetermined
coefficients, which amounts to solving the first n equations in (6.100).

6.6 Improper Integrals


Integrals whose range or integrand is unbounded are known as improper inte-
grals. Such integrals are defined as the limits of corresponding (proper) inte-
grals. For example

rOO f(x)
Jo
dx = lim
r~oo Jo
r
f(x) dx,

l
(6.105)
bf(x) dx = lim jb f(x) dx ,
a r---+-a+ r

where in the second integral f(x) could be unbounded in the neighbourhood


of x = a. In either case, if the limit does not exist, then the integral does not
exist in the Riemann sense.
210 Chapter 6. Integration

For integrals over (-00,00), there are two possibilities. The first and the
usual definition is

1: f(x) dx = 1°= f(x) dx + 100


f(x) dx , (6.106)

where both integrals on the right-hand side can be independently evaluated by


taking the appropriate limits. This integral exists in the Riemann sense only
if both limits exist independently, but sometimes it is convenient to define the
so-called Cauchy principal value when both limits do not exist independently.
This value is defined by

PjX f(x) dx
-00
= lim
r--+oo
jr f(x) dx,
-r
(6.107)

where the P in front of the integral sign designate that it is the Cauchy principal
value. If the integral exist in the usual Riemann sense, then both these values
will be identical.
Similarly, we can define the Cauchy Principal value for the case, when
the integrand has a singularity at an internal point. Suppose that a < c < b
and f(x) is unbounded in the neighbourhood of x = c, then we can define the
Cauchy principal value by

pi a
b
f(x)dx= lim+
T-O
(i a
c r
- f(X)dX+jb f(X)dX)
c+r
(6.108)

Here again if both limits exist independently, then the integral exists in the
Riemann sense and its value will be identical to the above limit. The principal
value often arises in integral transforms.
In numerical integration, even when the function is not actually singular,
but has a singularity close to the interval of integration, special effort may be
needed to get accurate results. As seen in Example 6.1, even when the deriva-
tives of the function are unbounded in the interval of integration, many of the
usual quadrature formulae converge very slowly. Most of the techniques con-
sidered in this section are applicable to such pseudo-singularities also. Further,
there are the apparent singularities, where the function is indeterminate, e.g.,
sin x/x or cot x -1/ x at x = O. In both these cases, a straightforward use of the
expression will lead to an indeterminate form, but we know that the limit does
exist. Such apparent singularities do not cause any problem, provided the lim-
iting values are appropriately incorporated in the definition of the function. For
the second example, there could be large roundoff error near x = 0, but it can
be taken care of by using appropriate expansion near the apparent singularity.
Some of the techniques that can be used to tackle improper integrals are
as follows:
6.6. Improper Integrals 211

6.6.1 Ignoring the Singularity


This is the simplest technique (or rather the lack of it), where the quadrature
formula is used as if there is no singularity, and if the function value is un-
bounded at one of the abscissas it can be ignored. As seen in Example 6.1, this
brute force technique does lead to correct results in some cases, but the con-
vergence is usually very slow. This is certainly not a recommended technique,
but unfortunately, it is used quite often. This technique may not work if the
function oscillates rapidly near the singularity.
EXAMPLE 6.6: Consider the integral

10o 1 1 1
- sin - dx =
JCXl -
sin x
- dx = 0.624713. (6.109)
x x 1 x
Using composite Simpson's rule we get the following result

n 16 32 64 128 256 512 1024 2048 4096 8192


I .335 .916 2.312 1.695 -.608 1.218 .721 .318 .943 -1.060
It is quite clear that the values are oscillating wildly and it is impossible to guess t.he correct
value from these results. It should be noted that, this integral does not exist if the oscillatory
factor is removed.

In some cases, this technique proves to be very effective. For integration


over infinite interval, if the function falls off very rapidly the limit can be
effectively truncated. Using the trapezoidal rule with uniform spacing. This
technique leads to an infinite series, which can be summed by using appropriate
number of terms. Thus

JOC
-00
f(x) dx = h f
i=-oo
f(ih). (6.110)

For example, consider


(6.111)

Using trapezoidal rule with h = 1 and 0.5 yields the values 1.772637 and
1.772453850905516 respectively. Further, because the function falls off very
rapidly, it requires only 9 and 23 terms, respectively. It should be noted that the
second value is accurate to 16 figures. The trapezoidal rule gives very accurate
results in cases, where the odd order derivatives are equal at the two end points
{9}.

6.6.2 Proceeding to the Limit


We can use the basic definition of the improper integral. Let b > rl > T2 > ...
be a sequence of points that converge to a, for example, Tn = a + (b - a)2-n.
Then

Ib
a
f(x) dx = Ib f(x) dx + l
Tl T2
r1
f(x) dx + lT3
r2
f(x) dx + ... (6.112)
212 ChapteT' 6. Integration

Each of the integrals on the right-hand side is proper and can be evaluated
by any of the usual quadrature formulae. The series can be truncated when
the last integral is less than the required tolerance. This convergence criterion
gives reliable results only when the series is converging rapidly. If the series
is converging slowly, some acceleration technique like the Aitken's method or
the f-algorithm can be employed to speed-up the convergence. The sequence
T'n must be chosen carefully, so that the series converges fast enough and also
it is easy to evaluate each of the integrals using a few points. This technique
can be considered as an extrapolation method, where x - 4 a, and the limiting
value is obtained by extrapolation on values at one side of the limit.
Similarly, for integration over infinite range we can write

1 00
f(x) dx = 1~ f(x) dx + jn f(x) dx + j~ f(x) dx + ... (6.113)
a a ro rl

and we can use T'n = a + 2n. It may be noted that if T'i'S are chosen very close
to each other, then a large number of integrals may be required and further it
is possible that one of the integrals may come out to be very small, even when
the sum of the remaining terms is much larger than the required accuracy, thus
leading to a spurious convergence. For example, if T'n = n, then the divergent
integral It) dx/x yields a finite value. This may not happen when T'n = 2n.

6.6.3 Change of Variable


A change of variable can often remove a singularity in simple cases. For example,
a change of variable x = t n transforms the integral

t f(x) dx = n t f(tn)t n - 1 - m dt. (6.114)


Jo xm / n Jo
If m and n are integers, then the transformed integral will be regular. In general,
some caution is required, since it may happen that the new integral has a
singularity at some other point or the range may become very large or even
infinite.
A Change of variable can also transform the infinite interval into a finite
one. If the new integrand is bounded, then the integral will reduce to a proper
integral which can be easily evaluated. However, quite often it turns out that
the new integrand is singular, in which case, we have merely exchanged one
type of improper integral for another. Some of the useful transformations are
as follows

roo f(x)dx= r 1
f(-logy) dy,
Jo Jo y

x=a+--:
1-y
l+y
1 a
00
f(x) dx =
J1 f (a + l+ Yy )
-1 1_
2
(1 _ y)2 dy, (6.115)

x = tany: J oo
f(x)dx=
J1r/2 f(tany)
2 dy.
-ClO -IT /2 cos Y
6.6. Improper Integrals 213

Squire has suggested a one parameter family of transformation by splitting the


range [0, CXJ) into two intervals [0, a] and [a, CXJ) and applying the transforma-
tions y x/a and y = a/x in the first and second interval respectively, to
get

1 00
f(x) dx = a 11 [J(ay) + y-2 f(a/y)] dy . (6.116)

In each of these cases, the transformed integral will be either unbounded or has
an indeterminate form at one of the end points. Hence, it may be better to use
an open type quadrature formula, which does not require function evaluation
at the end points. The Gauss-Legendre formulae could be quite useful in such
cases. Sometimes, it may be better to transform an integral over infinite range
into another with infinite range, but which converges much faster. For example,

x = eY :
J f(~)
1
oo
X
dx = roo f(e Y) exp( -(0 - l)y) dy .
10
(6.117)

Here, the second integral converges faster and since the integrand falls of ex-
ponentially, it may also be possible to use the Gauss-Laguerre formula.

6.6.4 Truncation of the Interval


If the integral is not required to be evaluated to very high accuracy, it may
be possible to truncate the interval and omit the part near the singularity. For
example, if the singularity is at the lower limit x = a, and if we can estimate
Ir = f:+r f(x) dx, (r > 0), easily. Then if IIrl < to, we can neglect this part
and evaluate the integral over the interval [a + r, b], which has no singularity. It
will be much better, if we can estimate the value of Ir to the required accuracy,
which is often possible when the singularity is of some simple known form. For
example, consider the following integral

Ir = l
o
r
x'" f(x) dx = f(O)--
0+1
r",+1 r",+2
+ 1'(0)-- + ...
0+2
(6.118)

In this case, if a few terms of the Taylor series of f(x) are known, it may be
possible to estimate Ir accurately for a reasonable value of r. In some cases,
it may be possible to estimate the entire integral efficiently using the Taylor
series.
Similarly an infinite interval can be reduced to a finite interval by trun-
cating the range and ignoring the "tail". Of course, it will be better if the "tail"
can be estimated by some simple analytic means (see Example 6.9). In most
cases the range will be fairly large and the integrand will vary significantly over
the range. Hence, it may be necessary to use an adaptive technique to deal with
the resulting integral over finite range so that the step size can be adjusted as
per the function values in different regions.
214 Chapter 6. Integration

6.6.5 Weakening the Singularity


In some cases, it may be possible to reduce the strength of the singularity by
subtracting out a function with similar singularity, but whose integral is known.
For example

(-1 < ex < 0).


(6.119)
Here, the first integral which is improper can be evaluated easily in a closed
form, while in the second integral, since xa(f(x) - 1(0)) ~ x a +1 , the strength
of singularity is reduced. If -1 < a < 0, then the new integrand is bounded,
but its derivatives are still unbounded. If a few more terms of the Taylor series
can be removed explicitly, the strength of singularity will reduce further.
Similar technique is possible for integration over infinite range. For exam-
pie
J OO
(
lnx
) dx =
Joo lnx
-2 dx -
Joo 2(
lnx
) dx. (6.120)
1 xx+1 1 x 1 x x+1
Here the first integral can be evaluated analytically, while the second one will
converge faster than the original integral.

6.6.6 Using a Quadrature Formula with Weight Function


If w(x) is a weight function with a singularity at x = a, such that J: w(x)x k dx
exists and can be easily evaluated for k = 0, 1, ... ; then it is possible to find
a Gaussian or Newton Cotes quadrature formula. For simple algebraic or log-
arithmic singularities, the Gaussian formulae are well-known and the weights
and abscissas are either available in tabulated form, or can be easily calcu-
lated as described in the previous section. This method can be very effective
for simple singularities.
It is possible to use Gauss-Laguerre or Gauss-Hermite quadrature formulae
for integration over infinite intervals. These formulae converge for a wide variety
of functions, but the convergence could be quite slow, unless the integrand has
the right asymptotic behaviour. Thus, Gauss-Laguerre formula converges very
fast if for large x the integrand can be approximated by the product of e- X
and a polynomial in x. While if the integrand is 0 (1/ x a ), (ex > 1) for large x,
then the convergence will be quite slow, and in this case, it may be better to
apply the transformation x = e Y to transform the integral into the right form.
If the approximations to the integral do not converge, then we can split the
range into two [0, a] and [a. (0), the first part can be evaluated by the usual
formulae, while for the second part we can apply the Gauss-Laguerre formula
after a linear transformation x = a + t. The value of a can be increased until
a satisfactory approximation using Gauss-Laguerre formula is achieved. This
scheme is implemented in the Subroutine GAULAG in Appendix B. For the
Gauss-Hermite quadrature formula, it is not possible to break the range when
a satisfactory convergence is not achieved.
6.6. Improper Integrals 215

6.6.7 h ~ 0 Extrapolation
If the singularity is of algebraic form, then it is possible to find the form of
truncation error in the trapezoidal rule, and using it we can perform the h -> 0
extrapolation as explained in Section 6.2. This method is very effective for
algebraic singularities. For logarithmic singularities also the form of truncation
error is known, but it may be somewhat difficult to use it in an extrapolation
scheme, since terms proportional to In h will appear in the error expansion.

6.6.8 Singularity off but near the Interval of Integration


As we have seen, the presence of singularity off but near the interval of inte-
gration also affects the accuracy of quadrature formulae. In such cases also,
any of the methods described above can be applied. In particular, it is also
possible to precompute special quadrature rules that are applicable in such cir-
cumstances. For example, if the integrand has a pole of order r at x = a* close
to the interval of integration, then we can consider quadrature formulae with
a weight function w(x) = l/(x - a*t. For singularities close to the interval of
integration adaptive integration routines described in Section 6.7 are also fairly
effective. In such cases some care will be required to reduce roundoff error while
evaluating the integrand near the singularity.

6.6.9 Rapidly Oscillatory Integrands


If the integrand has numerous maxima and minima over the range of integra-
tion, then even if the integrand is bounded its derivatives will be quite large
in magnitude. In such cases, it may be difficult to approximate the integral
accurately, using usual quadrature rules, unless very large number of points
are used. Further, if the sign of the integrand is also oscillating, then there
may be substantial cancellation between the positive and negative terms in the
quadrature sum, leading to a large roundoff error. The principal examples of
rapidly oscillatory integrands can be found in the various integral transforms,
e.g., the Fourier transform:

1 b
f(x) cosnx dx, lb f(x) sin nx dx . (6.121)

In such cases, we are usually not interested in an isolated integral, but the entire
family of integrals with different values of the parameter n. For large values of
n it is possible to get an asymptotic expansion using repeated integration by
parts, provided the derivatives of f(x) are bounded. This technique will be
more effective than numerical integration for very large values of n.
An alternative is to integrate between the zeros of the integrand or the
rapidly oscillatory part of the integrand. Let a ::; Xl < X2 < ... < xp ::; b, where
Xi are the zeros of the integrand, then we can compute the integrals over the
range [Xi, xi+ll separately. Here it is advantageous to use a rule which uses the
216 Chapter 6. Integration

value of the integrand at both the end points, since in that case, two terms can
be dropped from the quadrature sum without affecting the accuracy. A Lobatto
rule is quite convenient in such cases. For example, using a 5-point Lobatto
rule (two endpoints and three interior points) requires only three points per
subinterval, while the resulting quadrature formula is exact for all polynomials
of degree less than or equal to seven.
It will of course be better if the oscillatory part can be treated as a weight
function, but since this function changes sign in the interval, it may not be
possible to find Gaussian formulae. However, it is possible to find interpolatory
formulae of Newton-Cotes type with oscillatory weight function, which leads
to the Filon's method, where the interval [a, bJ is divided into 2N subintervals
of equal length h = (b - a)/2N. Over each double subinterval, f(x) is approx-
imated by a parabolic interpolation at the three points. For a quadratic f(x)
the integrals can be computed explicitly using integration by parts. After some
manipulations, we get the following rules for the Fourier integrals (Davis and
Rabinowitz, 2007)

lb f(x) cos kx dx ~ h[o:(J(b) sin kb - f(a) sin ka) + (3C2n + /,C2n -d,

lb f(x) sin kx dx ~ h[ -o:(J(b) cos kb - f( a) cos ka) + (3S2n + /,S2n-l],


(6.122)
where
1
2[J(a) coska + f(b) coskbJ + L
N-1
C 2n = f(a + 2ih) cosk(a + 2ih),
,=1
N
C 2n - 1 = L f(a + (2i - l)h) cos k(a + (2i - l)h),
;=1
1
S2n = 2[f(a) sin ka + f(b) sin kbJ + t; f(a + 2ih)
N-1
sin k(a + 2ih),
N
S2n-1 = ~ f(a + (2i - l)h) sink(a + (2i - l)h),
;=1

= (e) = e 2 + esinecose - 2sin 2 e ~ ~e3 _ ~e5 _2_e7


0: 0: e3 45 315 + 4725 '

(3 = (3(e) = 2[e(1 + cos 2 e) - 2 sine cos eJ ~ ~ ~e2 _ ~e4 ~e6


e3 3 + 15 105 + 567 '
= (e) =
4(sine - ecose) ~ ~ _ ~e2 _1_e4 _ _ 1_e6
/' /' 3 3 15 e + 210 11340'
b-a
e=kh=k--.
2N
(6.123)
6.6. Improper Integrals 217

Here the approximate expressions for il, f3 and, are useful for small values of
e, where the evaluation of exact expression may incur a heavy roundoff error.
It can be seen that if k = 0, the approximation to the cosine integral reduces
to the Simpson's rule.
For periodic functions when the interval of integration extends over an
integral multiple of period, the trapezoidal rule is very efficient {9}. For such
integrals the trapezoidal rule converges almost exponentially.
For integration over infinite interval also the range can be broken at the
zeros of the integrand. The integral can then be expressed as an alternating
infinite series. The convergence is usually slow, and it may be necessary to use
some acceleration techniques to find the sum efficiently (see Section 6.8). In
some cases, the convergence can be accelerated by using a technique similar to
weakening the singularity.

6.6.10 Cauchy Principal Value


Let a < c < b and that the integrand f(x) is unbounded in the neighbourhood
of x = c, such that the Ca:lchy principal value of the integral exists. There is
no loss of generality in assuming c - a > b - c = r. We can write the integral as

lb f(x) dx = lr c
- f(x) dx + l~r f(x) dx + lr
c
+ f(x) dx . (6.124)

Here the first integral is regular, while we can combine the last two by making
transformations t = c - x and t = x - c, respectively to get

lb f(x) dx = l c
-
r
f(x) dx + l r
[J(c + t) + f(c - t)]dt. (6.125)

It is possible that the new function h( t) = f (c + t) + f (c - t) has no singularity


at t = 0. However, in most cases, this function has an integrable singularity at
t = 0, which can be treated by the methods described earlier.
Similarly, for integration over infinite range the integral can be split into

1:
two parts and the two parts can be combined after appropriate transformation
to yield
f(x) dx = lOG (f(x) + f( -x)) dx . (6.126)

The new integral should exist in the Riemann sense and can be evaluated by
any technique discussed earlier.
If the integrand has several singularities, then it may be better to break the
range into several parts, so as to isolate the singularities. Each of the singularity
can then be treated separately. In general, it is convenient to have singularities
at one or both the end points of the interval. The above methods are applicable
when the type and location of singularity is known. If the function is very
complicated or is only defined implicitly, then it may not be possible to use any
of these methods. In such cases, we can only hope that some kind of adaptive
quadrature routine (see Section 6.7) will work.
218 Chapter 6. Integration

Table 6.6: Evaluating fa1


J(
In(1/x) cos x
x6/7 +x8/9
dx = 29.65340

Method N I

1. Ignoring the singularity 16385 11.84010


2. Proceeding to the limit 340 29.61279
3. Change of variable 1025 29.65340
4. Truncating the range (a=O.OO1) 16385 29.65340
(a=O.01) 2049 29.65340
5. Using weight function 2 29.65318
4 29.65340
6. <-algorithm 2049 29.69121
7. Adaptive integration (AD PINT) 3015 29.64354

EXAMPLE 6.7: Consider the following integral

1 1

a x
In(1/x) cos x
6/7
+x
8/9 dx = 29.65340, (6.127)

We attempt to use each of the methods 1~7, described above to evaluate the integral
to a relative accuracy of 1O~6 and the results are summarised in Table 6.6. In this case,
it is difficult to characterise the singularity and hence some of the methods may not work
effectively. If the singularity is ignored, then the results converge as 1/n 1 / 7 . Taking the limit
as r --+ 0 and using <-algorithm to accelerate the convergence proves to be fairly effective.
The roundoff error is fairly large in this method. A change of variable to x = t 63 removes
the algebraic singularity, but the logarithmic singularity remains. Ignoring the logarithmic
singularity and using the Simpson's rule as before yields fairly good result, which shows that
a logarithmic singularity does not affect the convergence of the quadrature formulae very
badly. Truncating the range of the integral is not very easy in this case, since the series
expansion does not converge so fast and a large number of terms are needed, even when a is
fairly small. The denominator can be written as x 6 / 7 (1 + x 2 / 63 ) and a part of the integrand
can be expressed as a series in x 2 / 63 . For a = 0.01 about 60 ~ 70 terms are required for an
accuracy of 1O~6. The artifact of weakening the singularity also proves to be ineffective, since
a large number of terms will be required to reduce the strength of singularity appreciably.
The singular part can be treated as a weight function, since its sign does not change
over the interval. Thus, using a weight function w(x) = Inx/(x 6 / 7 + x 8 / 9 ) proves to be
very effective and it is possible to get an accuracy of seven decimal digits using only 4-point
formula. Calculation of abscissas and weights of the Gaussian formulae requires the knowledge
of integrals of the form fa1 w(x)x n dx, which can be expressed in a closed form, after a change
of variable x = t 63 . For this integral the form of error expansion is probably not known and it
will not be easy to use the h --+ 0 extrapolation. It may be expected that the first term in the
error expansion has the form h 1/7, but that improves the results only slightly. Of course, the
<-algorithm can be used to accelerate the convergence of trapezoidal rule, even when the exact
form of error expansion is not known. This technique may be useful for moderate accuracy,
but fails to converge if accuracy requirement is high. Subroutine ADPINT also fails to give
satisfactory accuracy for this integral, but the value returned is not too bad, considering the
fact that the program was not supplied with any information about the location or the nature
of singularity.
EXAMPLE 6.8: Consider the integral
(1.57
ia tan x dx = -In(cos 1.57) = 7.1355010 ... (6.128)
6.6. Improper Integrals 219

The integrand is continuous over the interval of integration, but there is a singularity
at x = 7r /2, which is close to the upper limit. In this case, ignoring the singularity results
in very slow convergence. Even with 16385 points, the composite Simpson's rule fails to give
an accuracy of six significant digits. The use of composite rule with 32-point Gauss-Legendre
formula gives somewhat better results, requiring a total of 2016 function evaluations for a
similar result. After this stage, the roundoff error starts dominating and there is no further
improvement. Evaluating tan x close to the singularity will involve a significant roundoff error.
Even a slight error in calculating the abscissas causes a significant error in the function values.
It is possible to use the technique of truncation of interval by using the Laurent series
about x = 7r /2

tan x =
_ ___ ~-x_ (~-x)3 2(~_X)5
(6.129)
~ - x 3 45 945
These four terms are enough to give an accuracy of 10- 6 , even when the integral is truncated
at x = 1. With this truncated range , the Simpson's rule converges rather fast. An alternative
procedure is to weaken the singularity by subtracting out the first term of the Laurent
series expansion from the function. In this case, the Simpson's rule still converges rather
slowly and the roundoff error is quite significant, since the evaluation of function near the
singularity requires subtraction of two nearly equal numbers. To improve the convergence, we
can combine the two techniques by weakening the singularity and at the same time, truncating
the range of integral using the Laurent series to estimate the portion near x = 1.57. Once
again truncating the range at x = 1 requires only 33 points to evaluate the integral using the
Simpson's rule, while using 4-point Gauss-Legendre formula, the same integral requires only
12 function evaluations. It may be noted that in this case, the method of approaching the
limit or the h --> 0 extrapolation are not very effective.
EXAMPLE 6.9: Consider the following integral

1 -+- = -
1 eX
= x dx
o 12
7r 2
= 0.82246703342411 ... (6.130)

This integral is in a form suitable for the Gauss-Laguerre formula and the result of
N-point formulae obtained using 53-bit arithmetic are as follows
N 2 4 8 16 32
I 0.80527 0.82370 0.82245036 0.82246696824592 0.82246703343093
I' 0.90916 0.82205 0.82246708 0.82246703342411 0.82246703342411
Thus, the use of 32-point Gauss-Laguerre formula gives an accuracy of 10 significant figures.
An alternative strategy is to split the integral at x = 5 and evaluate the part over [0,5] using
Gauss-Legendre formula and estimate the remaining part using the Gauss-Laguerre formula
after a simple shift of origin. The results using 53-bit arithmetic are shown in the last line (I')
of the above table, where for simplicity we have used the same value of N in both formulae.
Instead of the Gaussian formula we can use composite Simpson's rule to estimate
the first part, which requires about 2000 points to achieve an accuracy of 14 significant
figures and further the roundoff error in 53-bit arithmetic does not permit such an accuracy
to be achieved. On the other hand, we have seen that 16-point Gauss-Legendre formula
using the same 53-bit arithmetic gives a result which is accurate to 14 figures, which clearly
demonstrates the power of Gaussian formulae when a very accurate result is required. The
second part over [a, =] can also be estimated as follows

Ia = l
a
OCI xe~X dx
----x
1+e
= ICXl xe ~X (1 -
a
e
-x
+ e -2x - ... ) dx
(6.131)
-20 -3a
= (1 + a)e- a - (1 + 2a) _e_ + (1 + 3a) _e_ _ ...
22 32
For a = 5, 8 terms are enough to give an accuracy of 10- 15 while for a = 1 we may need
about 35 terms to get the same accuracy. In fact, the series is convergent, even for a = 0 and
the entire integral can be evaluated by summing the series
1 1 1
1= 1 - - + - - - + ... (6.132)
22 32 42
220 Chapter 6. Integration

However, this series converges rather slowly and it will require 10 7 terms to get an accuracy
of 14 significant figures. Hence, it is much more efficient to evaluate the integral directly. We
can reverse the problem and claim that numerical integration offers an efficient way to sum
this series. In fact, as will be seen in Section 6.8 integration can be used quite effectively to
sum certain series.

6.7 A utomatic Integration


In the preceding sections, we have considered several methods for numerical
integration of a function that can be evaluated at any given point inside the
range of integration. It will be ideal if we can combine these techniques to write
a computer program, which can efficiently integrate any given function to the
required precision. Of course, it is impossible to write such a program, since
all techniques of numerical integration depend on the value of the integrand at
a finite number of points. On the other hand, changing the function values at
a finite number of points does not change the integral of the function. Hence,
it is theoretically impossible to ensure accuracy of any algorithm based on a
finite number of function values. As shown in Section 1.2, it is always possible
to change the function, in such a way that the value at these points and conse-
quently the estimated value of integral will not change, but the integral of the
new function could be completely different.
Because of these limitations, in practice, the goal of automatic integration
is to write a computer program which can integrate a large fraction of functions
that are encountered in practical problems. Further, if it fails to find the result
to the specified accuracy, then it should exit in a reasonable time with an error
message. Of course, since the type of integrals that one user comes across are
usually different from what another user will come across, the reliability of such
programs varies from user to user. Hence, the programmer should try to ensure
that the program is applicable to as wide a range of integrals as possible. A
large number of automatic integration routines are available, and in fact, almost
all subroutines on numerical integration in Appendix B can be considered to
achieve this specification to some extent. Further, because of our inability to
estimate the actual error, most routines for automatic integration tend to be
conservative in their error estimate. As a result, the estimated error is often a
few orders of magnitude larger than the true error. Consequently, such routines
may indicate failure, even when the actual error is well within the requested
tolerance.
Several classifications of such routines are possible. For example, these
routines could be either iterative or noniterative. In iterative algorithms, a
sequence of approximation to the integral are obtained, and using this sequence
an error estimate is obtained. The iteration is continued until the error estimate
is within the specified bounds. Further, if the iteration fails to converge in some
reasonable effort, then the program should exit with an error message. Most of
the automatic integration programs are iterative in nature. A simple iterative
algorithm is to use a composite rule with the number of points being doubled
6.7. Automatic Integration 221

at each step, and the process is continued until two successive values agree up
to the specified accuracy. On the other hand, in a noniterative algorithm, using
a fixed number of steps the error is estimated and based on this estimate, the
number of points for the final step could be selected. A simple noniterative
algorithm is obtained by considering two different estimates of the integral, the
first estimate Rl is obtained using say the midpoint rule with n points, while
the second estimate R2 is obtained using 2n points in the midpoint rule. Here
n could be some fixed number in the program. Now using the fact that the
truncation error in midpoint rule for a regular function should be O(1/n2), we
can estimate the number of points required to achieve the desired accuracy.
Using this number the final value can be calculated and no attempt is made
to verify whether the result is actually correct to the specified accuracy. The
disadvantage of this algorithm is that, if the function does not have bounded
derivatives up to the second order on or near the interval of integration, or if
the initial estimates Rl and R2 are very much off the mark, then the estimated
value of n for the final result will not be correct and an erroneous result may
be accepted.
It is also possible to classify the automatic integration routines as adaptive
or nonadaptive. In a nonadaptive algorithm the integrand is evaluated at a
fixed sequence of points, while in an adaptive algorithm the points at which the
integrand is evaluated depends on the nature of the integrand. For example,
the simple algorithm of doubling the number of points at each step using a
composite rule is an example of nonadaptive algorithm, since the sequence of
abscissas used will be independent of the integrand, although the point at which
the algorithm is terminated depends on the function being integrated. A simple
adaptive algorithm will be obtained if the program can choose different spacing
in different regions, depending on the behaviour of the integrand. For example,
in region where the integrand has some singularity the number of points could
be larger, while in region, where the behaviour is smooth, only a few points may
be enough. Such algorithm can be expected to be more effective for functions
which have varying behaviour over the interval of integration.
Simplest subroutines for automatic integration are obtained by using com-
posite rules of a fixed type over a predetermined sequence of subintervals. Sub-
routines SIMSON and GAUSS in Appendix B are examples of such iterative
nonadaptive routines for automatic integration. The strategy used in these sub-
routines is to start with one subinterval covering the entire interval of integra-
tion and at every step divide each subinterval into two and apply the composite
formula using the same basic quadrature rule. The process is continued till two
successive values converge to the required accuracy. It is not necessary to use
a sequence of composite rules based on the same basic rule, but any sequence
of quadrature formulae with increasing precision can be used. For example, we
can use a sequence of 2k-point Gauss-Legendre formula (k = 1,2,3, ... ) or a
sequence of embedded Gauss-Kronrod-Patterson rule. The problem with such
sequences is that, the weights and abscissas associated with very high order
rules may not be known or easy to calculate. This process does not guaran-
222 Chapter 6. Integration

tee that the error is actually less than the difference between two successive
values. In practice, if the convergence is rapid, then the difference usually over-
estimates the actual error, while for slowly converging situation the difference
usually underestimates the actual error.
For example, if the integrand is regular and the asymptotic convergence
rate is achieved, then the actual error will be about 1/15 of the difference
between two successive values in subroutine SIMSON. The main reason for
this overestimate is the fact that the last value in the sequence is usually the
most accurate and the difference between two successive values, say Rn and
Rn+1 gives a good estimate for the error in Rn rather than the error in Rn+l,
which is the current best estimate for the integral. Some algorithms try to
correct for this overestimate and the calculation may be terminated when the
difference between two successive values is less than some predetermined factor
of the required accuracy. For example, in subroutine SIMSON we can use the
stopping criterion as IRn+l - Rnl < 15E, where E is the required accuracy.
This criterion of course increases the efficiency of the algorithm significantly,
but is more likely to lead to spurious convergences i.e., a situation where the
algorithm indicates convergence to specified accuracy, while the actual error is
larger than the tolerance. The problem of spurious convergence is more serious
if the specified accuracy is rather low. Numerous examples of such spurious
convergence have been given by Clenshaw and Curtis (1960).
EXAMPLE 6.10: Consider the integrals

II = /1
-1
(23 cosh x _ cos
25
X) dx = 0.4794282, (6.133)

For the first integral using Simpson's rule with three points gives the value SI =
0.4795546, while a composite rule using five points gives the value S2 = 0.4795551. The
difference S2 - SI = 5 X 10- 7 , while the actual error in S2 is 0.000l269, which is more
than 250 times the difference. In fact, by changing the factor ~ the difference could be made
arbitrarily small. For example, if this factor is replaced by 0.920044, then the two values agree
to eight significant figures as SI ~ S2 ~ 0.4796585, while the correct value of the integral is
II = 0.4795316. The actual error in this case is about 10 5 times the difference.
A more striking example is provided by the second integral. If Q = 2.5, then the
Simpson's rule SI and the composite Simpson's rule S2 using five points agree exactly, SI =
S2 ~ 0.681481481481481. The correct value in this case is 0.679234683 ... , giving an error
of 2.25 x 10- 3 . The same integral with Q = 2.25, but using 2-point Gauss-Legendre formula
gives similar results. The value G1 obtained using this formula and G2 using a composite
rule based on two subintervals agree exactly, G1 = G2 ~ 0.7422680412371134. The correct
value of integral in this case is h = 0.7441825583390339, with an error of 2 x 10- 3 . Once
again the same integral using Q = 2.562 gives G2 = 0.6652376 and G4 = 0.6652371, where
G2 and G4 are respectively the values obtained using 2 and 4-point Gauss-Legendre formula.
The difference G2 - G4 = 5 X 10- 7 , but the correct value of the integral is h = 0.6648797.
This gives an error of 3.574 x 10- 4 in G4, which is about three orders of magnitude larger
than the difference.

Any number of such examples can be easily constructed (see integral h in


{24} for another interesting coincidence). These examples suffice to illustrate
that the difference between two successive approximations does not necessar-
ily give a reasonable estimate of the actual error, even for functions which are
6.7. Automatic Integration 223

perfectly regular in a reasonable neighbourhood of the interval. Hence, if relia-


bility of the program is very crucial, it may be better to use a criterion which
requires three successive values to be within the required tolerance. This also
cannot be a foolproof method, but it is more unlikely to find examples violat-
ing this criterion. Such a criterion is used in subroutine EQUIDS for multiple
integration.
If the specified accuracy is quite close to the accuracy of the arithmetic
used, or if there is a substantial roundoff error in evaluating the function, it
may not be possible to achieve the required accuracy. However, because the
roundoff errors are random in nature, there is a finite chance that the roundoff
errors may conspire to give a spurious convergence. This chance can also be
reduced if a more stringent criterion using three successive values is adopted.
If the successive approximations to the integral are printed out, it may be
easier to detect such spurious convergence, but then the algorithm may not
be considered as automatic. Alternately, the program can actually check if the
expected asymptotic convergence rate is achieved or not, by comparing values
obtained at a few of the most recent steps. Any departure from the expected rate
could be either due to roundoff error or due to a singularity in the integrand.
If the convergence pattern is regular, then it is most likely to be due to a
singularity; while if the pattern is irregular, then it is most likely to be due
to the roundoff error or some rapid oscillations in the function values, which
are not resolved by the number of points used. This technique is sometimes
referred to as "cautious" iteration, and is used in some routines for automatic
integration (De Boor, 1971a).
A simpler technique to detect the growth of roundoff error in iterative
methods is to compare the difference between successive values at various
stages. If after reducing to a reasonable value, the difference tends to increase
or fluctuate around the same value, then the roundoff error is most probably
dominating and no purpose will be served by continuing further. If roundoff
error is a serious problem, then an additional stopping condition can be intro-
duced into the integration routine. Once the difference between two successive
values is below a reasonable value, a flag can be set. Afterwards, if at any step
the difference tends to increase, then it may be presumed that roundoff error
is dominating and an appropriate error exit can be arranged. This technique
is used in Subroutine DRVT for numerical differentiation. Of course, there are
pitfalls in this criterion, since it may happen that even though the roundoff
error is dominating, the difference may not actually increase, thus allowing the
iteration to continue well past the roundoff stage. More serious problem with
this criterion is that, spurious convergence with relatively modest criterion may
set the flag and in the next step, when the difference turns out to be larger,
the process will be terminated with an error message, while if the iteration was
allowed to continue it might have converged to an acceptable accuracy.
There is also the problem of choosing the tolerance, the simplest criterion
test for either the relative or absolute difference between two or more approxi-
mations to the integral. Each of these criterion have their merits and demerits.
224 Chapter 6. Integration

The absolute error criterion is quite useful, if the order of magnitude of the
integral is known in advance. However, frequently it happens that the mag-
nitude depends on some parameter and may vary over a wide range. In such
cases, it may be difficult to specify the required tolerance in absolute terms
and a relative error criterion may be preferred. The relative error criterion en-
counters difficulties if the integral vanishes, or there is a significant cancellation
in evaluating the quadrature sum. In such cases, it will be difficult to satisfy
any reasonable requirement on relative accuracy of the result. This problem is
also encountered, if for reasons of efficiency or otherwise the original integral is
split into several parts. For example, when the technique of approaching a limit
is used in evaluating an improper integral, or when the integrand has several
singularities, it is required to break up the original integral into several parts.
In such cases, if the magnitude of the integrals differ widely it will be meaning-
less to use a relative criterion. For example, if the integral is broken into two
parts, one of which is ~ 100 while the other part is ~ 0.01, it is meaningless
to require a relative accuracy of 10- 6 for each of them, since when the sum is
formed, the error in the sum cannot be expected to be much less than 10- 4 in
any case. Thus, the second part needs to be evaluated to a relative accuracy of
10- 2 only. Alternately, we could evaluate both parts to an absolute accuracy of
10- 4 . Because of this difficulty, most of the programs for automatic integration
require two tolerances to be specified. The computed value of the integral I
should satisfy the following error criterion

(6.134)

where Eabs and Erel are absolute and relative tolerances, respectively. If Eabs = 0,
this criterion will reduce to a simple relative criterion, while if Erel = a it will
reduce to a simple absolute criterion. It should be noted that, since the exact
value of the integral is not known in practical problems, the program only tries
to estimate this error. Hence, the theoretical error (e.g., the variable DIF in
most of the subroutines in Appendix B), is only a figure of merit which gives
an indication of the accuracy achieved. Actual error could be much smaller or
larger than the estimated error.
Automatic integration routines based on Newton-Cotes formulae en-
counter difficulties when the function is periodic, and the interval of integration
is an integral multiple of the period. For example, consider the integral

(6.135)

If we apply subroutine SIMSON to this integral with n = 64, it will exit with a
value of 1/2 for the integral, and the error estimate will be zero. This happens
because the function is equal to 1/2 at each of the first 65 points tried by the
subroutine and it naturally presumes that the function is constant. Some of the
automatic integrators are smarter, if they encounter such a situation, a random
6.7. A utomatic Integration 225

value of x is used to evaluate the function and if the value of the function comes
out to be different, then the routine may break the integral into two part at
this point, or use some other technique to subdivide the range in order to bring
out the nonconstancy of the integrand. However, such smartness is not of much
use, since if we give the same integral with n = 65 or n = 63, the routine may
not detect any problem and approximate the integrand with a slowly varying
function.
An alternative program for automatic integration could be based on some
acceleration techniques, like the Richardson's h ----> 0 extrapolation, or the E-
algorithm. Subroutine ROMBRG and EPSILN in Appendix B are based on
this approach. Here the trapezoidal rule or some other simple rule is used to
generate the first approximation to the integral, which forms the first column
of the corresponding T-table or the E-table. The convergence of this sequence
is usually quite slow, and extrapolation techniques are used to improve the
convergence. Here a two-dimensional array of approximations are generated and
it is not so easy to check for convergence. Ideally we expect the diagonal values
to converge fastest, and the higher columns to converge faster than the first
few columns. But in actual practice, because of roundoff error or application
of incorrect form of error expansion, the higher columns do not necessarily
converge faster. We can use a simple criterion, where the difference between
successive values in each column is found at every step. The minimum of this
difference is used as an error estimate for the value in higher row, which is taken
as the best estimate for the integral. A more stringent criterion will require
the value in more than one column or the diagonal to converge together. The
roundoff error could be detected by techniques mentioned earlier.
The routines described so far can be classified as iterative, nonadaptive
integration routines. If the nature of integrand changes within the interval of
integration, it may be better to use an adaptive technique, where the number of
points to be used in a given subinterval depends on the behaviour of the func-
tion. Thus, if the function has a sharp variation in some subinterval, a larger
number of points is used, while if the function is smooth in some other subin-
terval, only a few abscissas are used for integration. The subroutine ADPINT
in Appendix B is a simple adaptive integration routine, which is based on the
Gauss-Kronrod quadrature formulae. The basic idea here is to use two different
integration rules (say Rl and R2 ) to estimate the integral over any subinterval.
If the difference IRl - R21 is less than the acceptable error, then the result
is accepted. If the difference is larger than the acceptable level, the interval is
subdivided into a fixed number of subintervals, and each of these subintervals is
treated similarly. If all the subintervals are successfully treated, then the value
of total integral is obtained, while if the specified maximum number of function
evaluations are reached before all the subintervals are exhausted, an error exit
will take place. There is another error exit when one of the subinterval becomes
too small, that is only a finite number of subdivisions of the original interval
are allowed. If the subinterval needs to be divided further, then an error exit
takes place. A discussion of various possible strategies for adaptive integration
226 Chapter 6. Integration

is given by Rice (1975), who concludes that it is possible to write more than a
million different adaptive integration programs. The difference could be in the
basic rules used for integration, or in the error estimates and various stopping
criteria used in the program.
Two main types of adaptive strategies have been used: locally adaptive
and globally adaptive. Locally adaptive algorithms are characterised by deci-
sions to subdivide particular subregions which do not use results from other
subregions. A simple locally adaptive strategy is to start at one end and choose
the first subinterval which has error larger than the required tolerance for sub-
division. The subroutine ADPINT described below uses this strategy. Globally
adaptive algorithms choose regions for subdivision using information about all
of the current subregions. Simplest globally adaptive strategy is to subdivide
the subregion with largest estimated error, until the total error is within ac-
ceptable limits or the upper limit on the number of function evaluations has
been reached. This technique requires more memory, since the results over all
the subintervals need to be stored. On the other hand, in subroutine AD PINT
intermediate results are not stored, and as soon as an acceptable value over a
subinterval is obtained, the result is added to a running total. This technique
may reduce the efficiency or reliability of this subroutine as compared to more
sophisticated routines. One advantage of globally adaptive strategy is that in
the event of failure, when the limit on total number of function evaluations
is reached, the estimated integral using a globally adaptive strategy is usually
more accurate than what is obtained using a locally adaptive strategy. This
is because locally adaptive routine may spend most of its time trying to over-
come some singularity in a small region. Further, the chances of failure is also
higher with locally adaptive routines. For a more detailed discussion of strate-
gies for adaptive integration, readers can refer to Davis and Rabinowitz (2007),
Rice (1975), De Boor (1971a, 1971b).
Automatic integration routines are very useful for evaluating integrals,
when a few integrals of the same type need to be evaluated. If a large number
of similar integrals are to be evaluated, it may be more efficient to write a
special routine to take care of such integrals. The strategy for such a routine
may be selected after some experimentation. Another disadvantage of auto-
matic integration and in particular of adaptive routines has been pointed out
by Lyness (1983). If the integral has a parameter and its value is to be used in
some iterative method (e.g., for solution of some nonlinear equation involving
the integral, or in recursive evaluation of multiple integrals), then the use of
automatic integration could sometimes lead to non-convergence of the iterative
process. This happens because depending on the value of the parameter, the
routine may use different sets of points for evaluating the integral, thus giving
rise to a discontinuity in the value of integral as a function of the parameter.
This discontinuity will interfere in the iterative methods which naturally pre-
sumes that the integral is a smooth function of its arguments. If the integration
routine is successful, the magnitude of the discontinuities should be less than
the requested tolerance, and it can be reduced by adjusting the tolerance. How-
6.7. A utomatic Integration 227

ever, it reduces the efficiency of the program. In such cases, it is better to use
a quadrature rule based on a fixed set of abscissas, since that will ensure that
the integral is a smooth function of its arguments. The rule could be selected
after some experimentation with various values of the parameters to ensure
that sufficient accuracy is achieved.
Having discussed various techniques for automatic integration, we will at-
tempt to address the difficult problem of the choice of the right technique for
a given problem. Before attempting numerical evaluation of any integral, it is
necessary to analyse the integrand to check for any singularity or near singu-
larity, discontinuity, as well as possible sources of roundoff error in evaluating
the integrand. It should be noted that in the absence of theoretical knowledge
about the integral, it may be difficult to ensure the reliability of the results from
even the most sophisticated automatic integration routines. In such cases we
can plot out the integrand over the range of integration to check for any nons-
mooth behaviour. If the evaluation of integrand is likely to have large roundoff
error over some part of the interval, then attempt should be made to minimise
this error. This problem can be usually tackled by rewriting the function, or
by using appropriate expansion or approximation in the critical regions. If the
integrand has discontinuities in the function value or its derivatives, then it will
be best to split the integral into several parts at the discontinuities. Discon-
tinuities in the middle of the interval almost always lead to slow convergence
for the quadrature formulae. An adaptive integration routine will be able to
isolate the discontinuities, but it may be more efficient if the range is broken
right in the beginning. If there are several singularities in the integrand, then
the range should be broken into several parts to isolate the singularities. If an
integral over infinite region has a singular integrand, then the singular part
should be isolated. Once the singularities are isolated, each part can be dealt
independently. Before proceeding further, it is necessary to ensure that each of
these integrals exist. This existence can be ensured, once the nature and type of
singularities are known. Hence, in subsequent discussion we presume that the
integrand has only one singularity, and if the range is infinite, then there is no
singularity in the integrand. In a few exceptional cases, it may be possible to
take care of two singularities simultaneously. If it is difficult to find the exact
location of the singularity or the discontinuity, then we can try an adaptive
integration routine, provided the singularity is not very strong.
If the range of integration is finite and the integrand has no singularity or
discontinuity in the neighbourhood of the interval of integration, and further
it is not rapidly oscillating function, then most automatic integration routines
should work reasonably well. If the function value is of the same order of magni-
tude throughout the region of integration, then even the nonadaptive routines
will work efficiently. If only a moderate accuracy is required, then simple sub-
routines based on the Simpson's rule, or preferably Romberg integration could
be fairly efficient. For high accuracy a routine based on high order Gaussian
formula is preferable. This routine could be used, even for moderate accuracy,
if a low order Gaussian formula can be used. A subroutine based on composite
228 Chapter 6. Integration

rule using 32-point Gaussian formula will require a minimum of 96 function


evaluations to obtain the result, since the result has to be verified by compar-
ison with value obtained after subdivision. Most integrals can be evaluated to
moderate accuracy using much less number of function evaluations. Hence, use
of such high order formula in this case may be wasteful. If a large number of
similar integrals are required, it may be best to do some experimentation with
different routines to decide which one is most efficient and reliable. If the func-
tion is regular but varies by orders of magnitude over the interval, then it may
be best to either breakup the integral into several parts or to use an adaptive
integration routine. Adaptive integration routines can be used for all regular
integrands and may prove to be fairly efficient for moderate accuracy require-
ments. If very high accuracy is required, then the efficiency of adaptive routine
will depend on the strategy employed. For integrating periodic functions over
a complete period it may be more efficient to use the trapezoidal rule.
If the integrand is regular, but highly oscillatory, it will require a large
number of function evaluations using any of the straightforward methods. In
such cases, it will be best to use interpolatory formulae based on oscillatory
weight functions, e.g., Filon's formula for trigonometric functions. If the oscil-
latory part cannot be treated as a weight function, then we can try to integrate
between the zeros of the oscillating part. It is possible that the integrand may
not actually change sign, but the value may be oscillating because of some
oscillatory component {28}.
If the integrand is singular, but the singularity in not very strong, e.g., In x
or 1/.;x, then the adaptive integration routines may be fairly effective for low
or moderate accuracy. Some of the more sophisticated adaptive routines take
care of such simple singularities. The subroutine ROMBRG in Appendix B has
a provision to supply the exponents of error expansion for the trapezoidal rule.
For simple algebraic singularities, where the error expansion is known, this rou-
tine could be fairly effective. Various technique to deal with singularities have
been discussed and compared in Section 6.6. It will be most efficient to use a
Gaussian formula with the singular part of the integrand as a weight function,

J:
provided such formulae exist. If the weights and abscissas of such a formula are
not available, they can be calculated, provided the integrals w(x)x n dx can
be easily evaluated for the required values of n. If these integrals themselves
have to be evaluated numerically, then the effort may be worthwhile, only if
a large number of integrals with the same weight function are required. Sub-
routine GAUSWT in Appendix B can be used for calculating the weights and
abscissas for any admissible weight function. The user has to supply the func-
tion routine to calculate the required integrals. However, the algorithm used in
this subroutine is ill-conditioned and for large values of n, the number of points
in the quadrature formula, the results may not be reliable. It is advisable to
use this subroutine in double precision arithmetic only. Even with 53-bit arith-
metic it may not be possible to obtain Gaussian formulae for n > 12. For most
problems such low order formula may be enough, since if the integral does not
converge, then we can always break it up at some suitable point, such that the
6.S. Summation 229

singular part can be evaluated using the weight function, while the remaining
part which is hopefully regular can be evaluated using any other technique. If
the recurrence relation for corresponding orthogonal polynomials is known it
will be more efficient to use subroutine GAUSRC. This algorithm is also more
stable and it would be possible to calculate moderately high order Gaussian
formulae. If the singular part of the integrand is not an admissible weight func-
tion for Gaussian formulae (e.g., it may change sign in the interval), then we
can consider interpolatory quadrature formulae of Newton-Cotes type. These
formulae can be easily derived using the method of undetermined coefficients.
Of course, it is not advisable to use a high order formula of this type. Subroutine
GAUSWT can be used to calculate weights for such formulae also.
For integration over infinite interval, techniques described in Section 6.6
can be used. In general, if the integrand is falling off exponentially, it may be
quite efficient to use the Gauss-Laguerre quadrature formula to estimate the
entire integral. If it is not possible, then the interval can be broken into two parts
and the finite part can be evaluated by usual methods while the infinite "tail"
can be estimated using the Gauss-Laguerre formula. However, if the integrand
does not fall off exponentially, then the Gauss-Laguerre formula will be totally
ineffective, and in that case, the method of approaching the limit or weakening
the singularity can be used. If it is possible to estimate the "tail" of the integral
by some means, then the technique of truncating the range can be employed.
Alternately, a change of variable can transform the integral to one over a finite
interval, but the resulting integrand may be singular. A change of variable may
also be used to transform the integral into a more rapidly converging integral
over infinite interval.
If the function is known only in the form of a table of values, it is most
convenient to use the Newton-Cotes formulae, or to integrate the cubic spline
approximation as discussed in Section 6.1.

6.8 Summation
In this section, we consider the problem of summation of series both finite 8...')
well as infinite. If the series has only a few terms or if the sum is converging
very rapidly, then it is straightforward to find the sum. However, if the series
is slowly converging, then a large number of terms may be required to find an
accurate value of the sum. Apart from requiring large amount of computer time
the roundoff error will also increase. In such cases, it is desirable to use some
acceleration techniques similar to those used for calculating the integrals. The
E-algorithm can be easily applied in such cases. Thus, using the partial sums
for 2kn terms, (k = 0,1,2, ... , and n » 1) to get the first column of the E-
table, we can construct the higher columns which will hopefully converge faster.
Alternately, we can accelerate the convergence of the series by subtracting out
a series with known sum, similar to the method of weakening the singularity
for integrals. For a slowly converging alternating series, we can use the Euler's
transformation to accelerate the convergence. For certain series, we can use the
230 Chapter 6. Integration

Euler-Maclaurin sum formula to approximate the series by an integral which


can be evaluated easily. We will now discuss each of these methods in some
detail.

6.8.1 Euler-Maclaurin Sum Formula


The Euler-Maclaurin sum formula (6.26) can be used to approximate the sum
by an integral. Here summation on the right-hand side involves the Bernoulli
numbers which ultimately increase exponentially with k and the series is di-
vergent, unless the higher derivatives are all zero or equal at both the ends. In
most cases, this formula leads to an asymptotic series for the sum and if used
properly gives an effective method to sum quite a few slowly converging series.
If ao is selected properly, then the terms of the series on the right-hand side will
at first decrease, but will ultimately start increasing without bound. In most
cases, we get an alternating series and it can be shown that the truncation error
is bounded by the first term that is neglected. Hence, if the sum is truncated
just before the smallest term, maximum accuracy will be obtained. Thus, ao
should be chosen such that the smallest term is less than the required accu-
racy. For an arbitrary value of ao the formula may not give sufficient accuracy.
Hence, it may be necessary to sum a few terms of the original series separately
and apply this formula to the remaining terms, so as to obtain the right value
of ao. This formula can be used to sum infinite series, provided the required
derivatives of f vanish at infinity or approach a finite limiting value.
EXAMPLE 6.11: Estimate the following sum using the Euler-Maclaurin formula
1
L
ex)

((1.5) = (1 + ")1.5 . (6.136)


)=0 J

This is the Riemann zeta function which yields a very slowly converging series and
about 10 12 terms will be required to find the sum to an accuracy of 10- 6 , even though beyond
j = 104 all terms are smaller than 10- 6 . Now using the Euler-Maclaurin sum formula with
ao = 1, h = 1 and f(x) = 1/x1.5, we get

((1.5) = l
J
ex)

x- 15 dx +- +L
1
2 k=1
m B
~(- f(2k-I)(1))
(2k).
+ Em
(6.137)
_ -.2-
B2k 3.5.7 ... (4k - 1)
-2+0.5+~(2k)! 22k-l +Em.
k=1
Using the values of Bernoulli numbers we get the following results for the partial sums Sm
using m terms of the series on the right-hand side
m: 0 1 2 3 4 5 6 7 8 9 10
Sm: 2.5 2.625 2.6068 2.6175 2.6044 2.6311 2.549 2.898 0.913 15.4 -117.2
The values of Sm are oscillating about the true value. After k = 3, the terms in the series
start increasing and the best result is obtained when m = 2 after which the error increases.
Thus, it is not possible to get an accurate value using this asymptotic series. However, the
convergence can be improved if we increase the value of ao, since that will decrease the values
of the derivatives. Thus, if the first eight terms are summed explicitly, we get
8 1 oc 1
((1.5) = ' " -
~ j1.5
+ '"
~ (9 + j)L5 = 1.926676679707 + S'. (6.138)
)=1 )=0
6.8. Summation 231

The second sum can now be determined using the Euler-Maclaurin formula (with aD = 9),
giving very fast convergence, because of the additional factor of 1/9 k +1.5 in the kth derivative
of the function. We get the following results for the partial sums
m: 0 1 2 3 4
Sm: 2.61186186 2.61237627 2.612375342047 2.612375348784 2.612375348683

Here the error in the last value is less than 1O- 11 . Thus, using a total of 13 terms we can get an
accuracy of 1O- 11 , which will require 10 22 terms of the original series. This is a phenomenal
improvement over the brute force method. The impr<;>vement will be even more marked for
«(0) with Q closer to one. Thus, for 0 = 1.01 it will require 101000 terms of the usual series
to get an accuracy of 10- 10 , but using the Euler-Maclaurin formula appropriately, the same
accuracy can be achieved using about 12 terms.

6.8.2 Summation of Rational Functions


We shall now consider the method of subtracting out a known series to acceler-
ate the convergence of a slowly converging series. This method is very similar
to the method of weakening the singularity for improper integrals, and is most
useful in summing a series, where the nth term is a rational function of n. For
this method we need a class of standard series, which can be easily summed.
This is provided by the so-called factorial function.
For positive integral values of n the factorial function can be defined by

x(n) = x(x - l)(x - 2) .. · (x - n + 1), x(O) = 1. (6.139)

This gives n(n) = n!, which justifies the nomenclature. The forward difference
of this function is given by

D.x(n) = (x + l)(n) - x(n)


(6.140)
= x(x - 1) .. · (x - n + 2)[(x + 1) - (x - n + 1)] = nx(n-l).

This is a very interesting property of factorial function, because of which it


plays the same role with finite differences as that played by the power function
for ordinary differentiation. Using this property, we can easily sum any series
involving the factorial function by writing
N N
" " x(n) = _1_ "" D.x(n+l)
L..- n+1L..- '
x=A! x=Af

L (x + l)(n+l) -
N
= n ~1 x(n+l), (6.141)
x=Af

(N + l)(n+l) _ M(n+l)

n+1
Since any polynomial in n can be expressed as a linear combination of the
factorial function, this result can be used to find the sum of any series, where
the nth term is a polynomial in n.
232 Chapter 6. Integration

EXAMPLE 6.12: Sum the series


n
""'
L J·4· (6.142)
j=l

We can write
j4 = j(4) + 6j(3) + 7j(2) + j(l). (6.143)
Using (6.141) the sum can be written as

~ .4 = (n + 1)(5) + 6(n + 1)(4) + 7(n + 1)(3) (n + 1)(2)


LJ
)=1
5 4 3 + 2
(6 .144)
n(n + 1)(2n + 1)(3n 2 + 3n - 1)
30
This sum can also be evaluated using the Euler-Maclaurin sum formula. In this case, since
the higher derivatives vanish, the resulting series is finite and can be easily summed.

The definition of factorial function can be generalised to negative integral


values of n also by

x(n) = 1 (n < 0). (6.145)


(x - n)(-n) ,

In particular, note that o(n) -I 0 when n is a negative integer. It can be shown


that (6.140) holds for negative values of n also and hence for n -I -1, (6.141)
also holds. The factorial function can be used to accelerate the convergence of a
slowly converging series by subtracting it from the given series. This technique
will be illustrated in Example 6.13. Instead of factorial functions we can use
the Riemann zeta function as a known series:
1 2 1
L
7T6
L
00 00

((2) = n2 = ~ , ((6) = n6 = 945'


n=l n=l
(6.146)
1 7T4 1 7T8
L L
00 00

((4) = n4 = 90' ((8) = n8 = 9450'


n=l n=l

If the sum is required over only odd terms, in the above series, then we can use

oc 1 1
~ (2n - 1)0 = ((a)(l - 20)' (6.147)

Many more series can be found in Abramowitz and Stegun (1974).


EXAMPLE 6.13: Sum the following infinite series
1
S = LJ
00

j=l
-'2--
+1
~ 1.0766740474686. (6.148)

This series can be summed directly, but the convergence is slow, requiring about 10 6
terms to get an accuracy of 10- 6 . To accelerate the convergence, we can subtract an appro-
priate series of factorial function from it. Since the terms in this series are asymptotically
6.S. Summation 233

Table 6.7: Summation of infinite series

Series E = 10- 6 E = 10- 12


N 8 N 8

8 1000 1.075675 1000000 1.076673047470


1 + 81 100 1.076625 10000 1.076674042470
((2) - 82 32 1.076684 1000 1.076674047801
((2) - ((4) + 83 10 1.076673 100 1.076674047449

0(1/ j2), we should choose a factorial function of the form (x - m)< -2), where m could be
any integer. For example, using

8' = L(j - 1)(-2) = L -.-.1-


00 00

= (0)<-1) = 1, (6.149)
j=1 j=1 J(J + 1)
we get

8 = 8' +L 00 (1 1)
-+- - - -
+- = 1+
L
00

+
j-l
= 1 + 8 1. (6.150)
J=1 J'2 1 J.( J. 1) J=1
'C J + 1)
J J 1)("2

The new series 81 converges faster requiring about 103 terms for the same accuracy. The
convergence can be improved still further, by subtracting the series with terms 1)< -3) (j -
from the new series. This process can be continued any number of times to improve the
convergence, but the algebra will soon become messy.
In this case, it turns out that instead of factorial functions, we can use the Riemann
zeta function to accelerate the convergence. Thus, using the series ((2), we get
1 71"2
LJ
00

8 = ((2) - '2("2 + 1) = (3 - 82. (6.151)


j=1 J

The new series 82 converges even faster than 81 requiring only about 100 terms for an
accuracy of 10- 6 . This is a considerable improvement over the original series. We can improve
the convergence still further by subtracting out another series of the same form from 82. Thus,
using ((4) we get

(6.152)

The last series 83 converges even more rapidly, requiring only about 15 terms to get the same
accuracy.
Apart from improving the convergence, this technique enables a more reliable test of
convergence. The simple convergence criterion for terminating the summation when the new
term is less than the specified accuracy, is not strictly valid for any of the series considered.
The infinite number of terms that are neglected will add up to much more than the last
term added. As a rough estimate if E is the last term added to the series, the truncation
error will be of the order of E1/2, E2/3, E 3/ 4 and E 5/ 6 for the series 8, 81, 82 and 83,
respectively. Hence, unless a more sophisticated convergence criterion is used, the error will
be more than expected, further the slower the convergence of series the more inaccurate the
simple convergence criterion will be. In general, it is better to check the difference 8 n - 82n
for convergence, where 8 n is the sum using n terms. Even if a proper convergence criterion
is used, the new term to be added may become so small that the floating-point addition may
234 Chapter 6. Integration

not change the value of sum. Depending on the word length of the computer being used,
this problem may arise long before the required accuracy is achieved. For example, using a
24-bit arithmetic for the series S , the new terms will become too small to change the sum
for j greater than approximately 4000, even though 10 6 terms are required for an accuracy
of 10- 6 . Hence, using this series it is not possible to achieve an accuracy of better than fjl/2.
On the other hand, using the series S3, it may be possible to achieve an accuracy of the order
of 1i5 / 6 .
The results of actual computation using the simple convergence criterion with an error
requirement of 10- 6 and 10- 12 are shown in Table 6.7. The calculations were performed
using a 24-bit and 53-bit arithmetic, respectively. It can be seen that in all cases, the number
of points used (N) is less than what is actually required for the specified accuracy. It can be
seen that none of the values are accurate to the specified accuracy, but the estimated error
is quite close to the actual value for S3, which is the fastest converging series. Hence, clearly
a better convergence criterion is required.

6.8.3 Euler Transformation


The Euler's transformation is very useful for accelerating the convergence of
alternating series. We shall obtain the generalisation of Euler's transformation,
which is useful for summing finite series with oscillating terms and which re-
duces to the Euler's transformation for the infinite case. Consider the series

(6.153)

where Vi are generally, but not necessarily positive. We can define the function

(6.154)

Then

(1 + x)Sn(x) = Vo - (VI - VO)X + (V2 - vdx 2 - ...


+ (-l)n(vn - vn_dxn + (_l)nvnxn+l
(6.155)
= Vo - (~Vo)x + (~vdx2 - ...
+ (_l)n(~vn_dxn + (-l)nVnxn+1.
This yields

Sn(x) =
Vo + (_1)nvnxn+l -y [~vo-(~vdx+···+ ( -1) n- 1( ~Vn-l )Xn- 1] ,
l+x
(6.156)
where y = l~x' This transforms the original series to another series involving
the forward differences of the terms. Applying the same transformation once
again to the bracketed series, we get

S (x) = Vo + (_1)nvnxn+l _ ~vo + (_l)n-l(~vn_dxn


n l+x l+x Y
+y2 [~2VO _ (~2vdx + (~2V2)x2 - ... + (_1)n-2(~2Vn_2)Xn-2] .
(6.157)
6.8. Summation 235

Here 6 2 is the second-order forward difference (see Section 4.2). Applying this
transformation repeatedly to the series in the bracket will transform the original
series to a series of forward differences. After m (m :::; n) transformations, we
get

Vo - y6vo + y262vo - ... + (_1)m~lym~16m~lvo


Sn(x) = - - - - - - - - - - - - - - - - -
l+x
(-1)n [Vnxn+1 + (6vn~dxny + ... + (6m~lVn~m+dxn~m+2ym~ll
+~~~------~----~--~--------------~------~
l+x
+(_1)mym [6mvo - (6 m vdx + ... + (-1)n~m(6mvn~m)xn~ml.
(6.158)
Set x = 1, y = 1/2 to obtain the required sum

Sn = ~vo - ~6vo + ~62vO - ... + (-1)m~12~m6m~lvo


+( _l)n [~Vn + ~6Vn~1 + ~62Vn~2 + ... + 2~m6m~lVn~m+d
+2~m( _l)m [6mvo - 6mV1 + 6mV2 - ... + (-1)n~m6mvn~ml.
(6.159)
This is the generalisation of the Euler's transformation for a finite series.
If the series is infinite and convergent, then we can take the limit as n -+ 00
and m -+ 00. Noting Vn -+ 0 and 6kVn~k -+ 0 for all finite k, we obtain the
Euler's transformation

(6.160)

This transformation can also be derived directly by appropriately splitting the


terms {36}. If the original series is slowly converging, then the differences will
usually be quite small and the transformed series should converge much faster.
However, the first few terms of the series may not show this behaviour and once
again it may be necessary to sum the first few terms explicitly and apply the
Euler's transformation to the remaining part to get

S = 2) -l)1Vj = p~l.
ex:;
2) -l)1Vj + (-l)P (12Vp -
. 1
46vp + 86
1 2
Vp - ...
)
.
j=O j=O
(6.161)
Here p should be chosen such that the differences start decreasing rapidly.
Euler transform converts the original series to series involving differences
and will be effective only if the differences tend to zero faster than the original
series. This will usually be the case if Vn is a smooth function of n. For example,
with the series in {39} the even and odd terms have different dependence on n
and the differences will not decrease with order. For such series Euler transform
will not be effective. In such cases, it will be better to combine the positive and
negative terms to get a series with terms which are of the same sign. The
236 Chapter 6. Integration

convergence of this series can be accelerated by subtracting an appropriate


series of factorial function as discussed in the previous subsection.
Euler's transformation gives a transformation from one infinite series to
another. It can be shown that if the original series is convergent, then the trans-
formed series is also convergent. However, the converse is not true and quite
a few divergent series can be transformed into convergent series by applying
this transformation {37}. Hence, before applying this transformation, it must
be ensured that the series is convergent.
EXAMPLE 6.14: Evaluate the following integral by integrating between the zeros and sum-
ming up the resulting series

1= foo
1
sinx dx = 0.624713256428.
x
(6.162)

The integrand has zeros at x = nn for n = 1,2,3, ... and the integral could be broken
at these points. We can define

Vo= f1
7r sinx d
- - x,
x
vj=(-l)J
. j(j+l)7r

j7r
sinx
--dx
x
(j :::: 1). (6.163)

Hence, the integral can be written as


00 00

1= L(-l)jvj = Vo - VI + L(-l)ivj. (6.164)


j=O j=2

This is a slowly converging alternating series. Using 50 terms of this series gives a result of
0.618347, which has an error of ~ 0.006. Further, the error goes down roughly as lin, where
n is the number of terms summed. Hence, to get an accuracy of 10- 6 will require about 10 6
terms, which is a prohibitive amount of calculation, since each term involves one integral.
The convergence of this series can be accelerated by applying the Euler's transforma-
tion. We can apply this transformation to the entire series, but it will be better if the first
two terms are summed separately. In fact, using Euler's transformation it is possible to get
an accuracy of 10- 6 by involving only 14 terms of the series. Using 14 terms and applying
the Euler's transformation after the first two terms, gives the value 0.6247131, while if the
Euler's transformation is applied to the entire series, then the same accuracy can be achieved
using 19 terms. Thus, just like the Euler-Maclaurin sum formula, in this case, also a judicious
choice of the term from which the transformation is applied can improve the efficiency. If
higher accuracy is required, it will be more efficient to sum a few more terms separately.
Thus. by summing the first eight terms separately, it is possible to get an accuracy of 10- 12
using 25 terms, while if only the first two terms are summed separately, then 31 terms are
required to achieve the same accuracy.

If the Euler's transformation is to be applied to a finite series, then it may


be more efficient to sum the terms at one or both the ends separately depending
on the relative variation of the terms. If the terms are slowly varying throughout
the series, then the Euler's transformation can be applied to the entire series.
But if the last few terms show significant variation, then it may be better to
sum these terms separately {40}.

6.9 Multiple Integrals


So far we have discussed numerical integration in one dimension, in this and the
subsequent sections we will consider the problem of numerical integration in
6.9. Multiple Integrals 237

two or more variables. For reasons similar to those mentioned at the end of Sec-
tion 4.8, this problem is much more difficult than integration in one dimension.
Ideally we will like to discuss this topic in some detail including the treatment
of difficult integrals involving singularities, oscillatory integrands, infinite range
and so on. But the algorithms for treating these difficulties are not so well devel-
oped for multiple integrals. Hence, our discussion will be essentially restricted
to the regular case, where the integrand is well behaved throughout the range
of integration. Further, most of our discussion is restricted to integration over
rectangular regions.
The difficulty arises because of several reasons. Firstly, in one dimension,
there are only three classes of intervals, i.e., the finite interval [-1, 1J, singly
infinite interval [0,(0) and the doubly infinite interval (-00,00). While in two
or more dimensions, there are infinite number of different regions, which cannot
be mapped into each other by simple linear transformations. Hence, for each
shape of the region of integration, we will require one rule for numerical integra-
tion. Apart from this, the behaviour of a function of several variables is more
complicated and theory as well as our intuitive feeling for such functions is very
limited. Of course, the evaluation of multiple integral is much more expensive,
because more effort is required to evaluate a function of several variables, and
the number of function evaluations required to achieve the specified accuracy
also increases with the number of dimensions.
The quadrature rules for one dimension can be used for multiple integrals
by defining the integral recursively or by using the so-called product rules. On
the other hand, we can develop integration rules which are exact for polynomials
up to a certain degree in several variables. These methods can prove to be
rather effective for integration in two or three dimensions. But for integration
over several dimensions or for integration over complicated regions, it is not
possible to ensure any reasonable accuracy using such rules and in those cases,
methods based on sampling may prove to be useful. In this section, we consider
the product rules, while the other methods will be considered in the subsequent
sections.
If the region of integration is sufficiently simple, then multiple integrals can
be evaluated recursively using the techniques for one dimension. For example,
a double integral can be written as

I = Ib
a
dx
jh(X)
fa(x)
f (x, y) dy = Ib
a
F (x) dx , (6.165)

where

(6.166)

In general, the quadrature formula used to evaluate the one-dimensional integral


could also be a function of x, which is the case if an adaptive integration
routine is used to evaluate the integral. The major advantage of using recursive
238 Chapter 6. Integration

evaluation is that adaptive routine can be used to handle resulting integration


in one dimension which can effectively deal with moderately singular integrals
also. If we assume that the same rule is used for all values of x, which is possible
only if the region of integration is rectangular. In this case, the integral can be
approximated as
m m n m

1= L H;F(aj)+E:" = LL H;Hkf(aj, ak)+ L H;En(aj)+E:". (6.167)


j=l j = l k=l j=l

Neglecting the error term we get a product rule for numerical integration in two
dimensions. This is just the Cartesian product of two rules in one dimension,
where the weights are the product of corresponding weights in one-dimensional
formula. This procedure can be extended to higher dimensions in a straight-
forward manner. If we construct a product rule for d dimensions based on an
n-point rule in one dimension, then it requires n d abscissas to achieve the same
degree of accuracy. Hence, the number of points required to achieve a specified
accuracy increases exponentially with d. This is the so-called dimensional effect
which precludes the possibility of estimating an integral over a large number of
variables accurately. Even if integral in one dimension can be estimated using
only 8-point formula , it will require 810 ~ 109 points to evaluate a similar inte-
gral in 10 dimensions. This is just within the scope of present day computers.
As will be seen in subsequent sections the Monte Carlo method and the method
based on equidistributed sequence do not show such a pronounced dimensional
effect. Hence, for integration over large number of dimensions such methods
may prove to be more efficient.
If the function is known only in the form of a table of values, then we can
approximate the function by interpolation or some other form of approximation
and integrate the approximating function. For example, we can interpolate the
function using B-spline basis functions. The expansion in terms of B-splines can
then be recursively integrated provided the region of integration is rectangular
in all dimensions. Subroutines BSPQD2 and BSPQDN in Appendix B provide
an implementation of this procedure for integration in 2 and n dimensions,
respectively.
If the region of integration is not rectangular, then it may not be possible
to define a product rule in a straightforward manner. But it may still be possible
to use the recursive definition of the integral. In programming languages which
allow recursion, the same subroutine for integration in one dimension can be
used recursively to calculate a multiple integral. But in Fortran which does
not allow recursion, we will have to use different copies of the same subroutine
(with different names). The number of copies required is equal to the number of
dimensions. Of course, we can use different subroutines for integration in each
dimension. For example, if 11 and 12 are two subroutines for integration in one
dimension, then 11 can be invoked with the function F, while the function F
itself can call 12 with function f, where the first argument of f could be passed
via a common block. However, the number of function evaluations required will
6.9. Multiple Integrals 239

be of the order of n 2 , where n is the typical number required for integration over
one dimension. Apart from this, there could be some difficulty in controlling
the truncation error. If an absolute criterion is required, then there may be no
difficulty. However, for relative error criterion, there could be some difficulty,
since initially we do not have any estimate for the integral. The problem here
is similar to that faced in an adaptive integration routine in one dimension and
we will not consider it further.
Let us consider a Cartesian product of two integration rules R1 and R 2 ,
which is denoted by R1 x R 2. If the rules R1 and R2 are exact for polynomials of
degree less than or equal to n1 and n2 respectively, then R1 x R2 will be exact
for linear combination of all functions of the form Pm, (x )Pm 2 (y), with ml ::::: n1
and m2 ::::: n2, where Pm(x) is a polynomial of degree m in x. For example, the
product of the Gauss-Legendre 2-point rule (G 2 ) i.e., G 2 x G2 will be exact for
all linear combinations of the following monomials

(6.168)

Thus, we can see that some monomials of degree four in the two variables
e.g., x4 cannot be integrated exactly, while other monomial of degree six e.g.,
x 3 y 3 can be integrated correctly. Hence, although the degree of precision of
this formula is only three, it is in general, more accurate than other formulae
of the same degree, since some of the higher degree terms are also integrated
correctly. This is true for all product rules.
If the region of integration is not rectangular, then we can embed it in
a rectangular region and define the function to be zero outside the required
region. This definition will not change the value of the integral, but it may slow
down the convergence of integration rules, because such an artifact usually
introduces discontinuities in the function value or its derivatives. Further, in
higher dimensions the volume occupied by the original region could be much
smaller than that of the enclosing rectangular region, even for relatively simple
shapes. For example, if a unit hypersphere in d dimensions is enclosed in a
hypercube of side 2, the ratio of the volumes will be

(6.169)

Here r(x) is the Gamma function. For three dimensions this ratio is 7r /6 ::::::
0.5236, but for d = 10 it works out to be 7r 5 /2 10 5! :::::: 0.0025 and for d = 20 it
is 7rIO /220(10)! :::::: 2.5 x 10- 8 • Hence, for 20 dimensions if we try to enclose a
hypersphere in a hypercube, then out of about 108 points only a few will fall
inside the sphere. It is obvious that such a procedure is not going to give any
meaningful results.
If the region is sufficiently regular, special rules can be constructed. It is
possible to construct product rules over a circular disk in two dimensions or
a sphere or a cone in three dimensions and so on. For example, consider the
integral I = Ie f(x, y) dx dy , where C is the unit disk x 2 + y2 ::::: 1, changing
240 Chapter 6. Integration

the variables to polar coordinates we get

I = 10
r 27r
r1
dB 10 fr dr . (6.170)

In the r variable we can use a n-point Gaussian rule with a weight function
w(r) = r, while for the B variable it may be better to use the trapezoidal
rule {9}, since the integrand should be periodic in B. If h1' h 2 , ... , hn are the
weights of the Gaussian rule corresponding to the abscissas r1, r2, ... , r n, and
Bk = 2k7r 1m, (k = 1,2, ... , m) are the abscissas used for the trapezoidal rule,
then we get the following product rule

1
I ';:::, -m L: L: hr) )f )r· cos -27rk
m n
( . 27rk)
m- ,r·) Sln -m- (6.171)
k=l j=l

Here in the trapezoidal rule we have made use of the periodicity of the integrand
to combine the two end terms, which ensures that all the weights are 11m. Simi-
larly, we can obtain product formulae for integration over spheres and spherical
shells in three dimensions. In some cases, a change of variable may map the
region of integration onto a rectangular region. A simple linear transformation
can transform parallelepiped into a rectangle and hence the product formulae
can be used for integration over parallelepiped. Similarly, an elliptic region can
be mapped onto a circular region by a linear transformation. For integration
over a hypersphere or a hyperspherical shell in n-dimensions we can generalise
this procedure by using hyper-spherical coordinates. In these coordinates the
region of integration would be rectangular and the product rules can be easily
applied. Function SPHND in Appendix B gives the transformation from hyper-
spherical to Cartesian coordinates and it can be used to define the function
in terms of Cartesian coordinates while using integration over hyper-spherical
coordinates.
EXAMPLE 6.15: Consider the following integrals in d dimensions

II = (e _1)-d fal ... fal exp(xl + X2 + ... + Xd) dX1 dX2 ... dXd = 1,

12 -
-
11 11 Jd - (xi +
0
...
0
dXl dX2 ... dXd
X~ + ... + X~) ,

13 = _1_ /1 ... /1 + (Xl X2 + ... + Xd)6 dXl dX2 ... dXd = 2-2 (~ + (d - 1)(5d - 1))
2d d 3 -1 -1 d 7 9 '

14 = 1(d/2 + 2) h}xi + x~ + ... + x~) dV = (V Gt/2,


(6.172)
where Sd is the portion of hypersphere xi + x~ + ... + x~ ::; 1, with Xi ?: 0, (i = 1, ... , d).
We use the automatic integration routine MULINT, which implements the product
Gauss formulae and the results are summarised in Table 6.8. This subroutine was invoked with
an accuracy requirement of 10- 6 and maximum limit on the number of function evaluations as
30000. The second line in the table for each d shows the results obtained using the composite
6.9. Multiple Integrals 241

Table 6.8: Multiple integration using product Gauss-Legendre formulae

d II 12 h 14
N Error N Error N Error N Error

2 92 1.1 x 10- 9 24572 2.1 x 10- 6 92 5.3 x 10- 15 32764 8.2 x 10- 4
100 7.4 X 10- 7 30976 1.8 x 10- 5 2304 6.7 x 10- 7 30976 1.6 x 10- 4
3 504 1.6 x 10- 9 32760 1.3 x 10- 7 504 5.3 x 10- 15 32760 3.9 x 10- 3
1728 5.4 x 10- 7 21952 4.5 x 10- 6 27000 2.9 x 10- 6 21952 4.5 x 10- 3
4 2544 2.2 x 10- 9 32752 7.3 x 10- 8 2544 1.9 x 10- 14 32752 2.1 x 10- 2
20736 7.1 x 10- 7 20736 5.4 x 10- 6 20736 8.5 x 10- 5 20736 8.5 x 10- 5
6 32704 3.3 x 10- 9 32704 5.8 x 10- 7 32704 4.8 x 10- 14 32704 2.1 x 10- 2
4096 8.6 x 10- 5 4096 6.9 x 10- 5 4096 4.5 x 10- 3 4096 1.7 x 10- 1
10 31744 1.3 x 10- 3 31744 1.7 x 10- 4 31744 2.5 x 10- 2 31744 5.8 x 10- 1
1024 2.2 x 10- 3 1024 2.9 x 10- 4 1024 4.1 x 10- 2 1024 1.2 x 10°
15 32768 3.4 x 10- 3 32768 1.3 x 10- 4 32768 2.8 x 10- 2 32768 1.2 x 10°

rule obtained by using the product 2-point Gaussian rule. The table shows the number of
points used by the subroutine and the actual (absolute) error in the result. It should be noted
that, since the automatic integration routine actually tries to verify the results, it may require
more points to achieve the same accuracy as that achieved by the simple product rule. For
the last integral it is possible to use a special product formula for sphere, but for the sake
of illustration we have extended the region to the hypercube 0 ::; Xi ::; 1 and the function is
assumed to be zero outside the hypersphere.
It can be seen from the table that for the first integral which is regular, there is no
difficulty in obtaining a reasonably accurate value for d ::; 4. For higher dimensions, although
the result may be reasonably accurate, it is not easily verified by the program, unless larger
number of function evaluations are allowed. It may be seen that the composite rule appears to
give a similar result using much fewer points. In fact, the subroutine MULINT has obtained
the approximation using a similar number of function evaluations, but has used most of the
remaining abscissas to verify and possibly improve the results. For d = 15 within the limits
that we had imposed, it is not possible to proceed beyond a simple application of product
2-point formula and the results quoted are obtained using such a rule. For d > 15 even this
is not possible and hence we do not consider such values.
The second integrand has a singularity at Xi = 1, (i = 1, ... , d), and the results are
not very accurate. Nevertheless, some reasonable accuracy is possible and further the results
show a distinct improvement as we move from d = 2 to d = 3,4. This is a typical phenomenon
in multiple integrals, since the strength of singularity in some sense decreases with d. Thus,
in a composite rule with typical spacing of h in each dimension, the error in this case will
be of the order of h d - 1 / 2 ~ N-1+ 1 / 2 d, provided the formula is of sufficiently high order.
At very high d, dimensional effect finally takes over and it is not possible to achieve any
better accuracy. The correct value of the integral is 11"(1 - V172), 0.730758728, 0.6264221,
0.5070860, 0.3904194, 0.317884 and 0.274925 for d = 2,3,4,6,10,15 and 20, respectively.
The third integral actually appears to be very simple and smooth, but causes problems
at higher values of d. Since in this case, the integrand is a polynomial of degree six, the com-
posite rule formed using 4-point Gaussian formula integrates this function exactly, resulting
in very low error determined by the roundoff error, for d ::; 6. However, at higher dimensions
the number of points which were allowed (i.e., 30000) did not permit the use of composite
4-point rule and as a result the accuracy goes down. Finally, the last integrand has a discon-
tinuity at the surface of the hypersphere, and we do not expect these rules to perform well in
this case. At higher dimensions the problem is more serious, since most of the abscissas fall
242 Chapter 6. Integration

outside the sphere, because the ratio of volume of hypersphere to that of the hypercube be-
comes very small as mentioned earlier. If we use integration over hyper-spherical coordinates
using function SPHND, then it is possible to evaluate this integral without any difficulty up
to d = 6, beyond which the number of points are not sufficient to achieve required accuracy.
This arises because in hyperspherical coordinates the Integrand has a factor r d + 1 , where r
is the distance from centre. Hence, once again a Gaussian formula of sufficient high order is
required.

6.10 Rules Exact for Monomials


In one dimension we could obtain quadrature formulae by integrating the in-
terpolating polynomials. This approach cannot be generalised to higher dimen-
sions because of the fact that polynomial interpolation is not always possible
in two or more dimensions. For example, given n distinct points PI, P2 , ... , Pn
in the plane, and given n monomials in two variables 1, x, y, x2, xy, y2, ... it
is not always possible to find a linear combination of these monomials, which
takes on the prescribed values at each of the n points. Similarly, the theory
of orthogonal polynomials which leads to Gaussian rules in one dimension is
not very well developed for two or more dimensions. However, the method of
undetermined coefficients can be used to obtain the weights and abscissas for
integration in several dimensions. Of course, this method will lead to a system
of nonlinear equations, which may not have real roots. Even if the roots are
real, some of the abscissas may fall outside the region of integration. This is
not a desirable feature of any integration formula, since the function may not
exist outside the region of integration and further practical experience suggests
that even if the function exists at all the required points, such formulae tend
to have larger errors. Even if all abscissas are within the region of integration,
the weights may not all come out to be positive, in which case, the roundoff
error may be amplified and further the convergence of the formulae may also
not be assured. Even though the success of such an approach is not assured,
a number of formulae have been obtained for numerical integration in several
dimensions. Probably the first such formula was given by Maxwell in 1877, but
after that not much work was done until 1940. Since then a number of special
formulae have been constructed.
If we consider a n-point rule of the form

1= J f(XI, X2,···, Xd) dV = t


i=1
hd(ail' ai2,···, aid) + En, (6.173)

for integration over a region in d dimensions. There are (d + l)n constants on


the right-hand side which are to be determined and we can expect such a rule
to be exact for (d + l)n distinct monomials. As noted earlier, the existence of
such a rule is not guaranteed. If the rule is required to have a degree of precision
p, i.e., it should be exact for all monomials of degree less than or equal to p.
Here the degree of monomial is defined to be the sum of powers of Xi. Now it
6.10. Rules Exact for Monomials 243

can be shown that there are


(d + p)!
(6.174)
d!p!

distinct monomials of degree less than or equal to p in d variables. Thus, we


can expect a rule with

(6.175)

abscissas. If such a rule exists, it is termed as an efficient rule. If a rule of


precision p involves more than N points, it is said to be subefficient, while if it
involves less than N points it is said to be hyperefficient.
Given a required precision p, it is possible to set the following upper and
lower bound for n, the number of points required (Davis and Rabinowitz, 2007)

NL = lP /2 J) <n< (d+ P) =N
(d+lP/2J (6.176)
- - - p - u·

To have a feeling for the number of points required for a given precision and for
a specified dimension, we give in Table 6;9 the values of N L, Nand Nu for some
values of p and d. Also given in the table is the number Cd of points required
by the product Gauss rule to achieve the same degree of precision. As noted
earlier, for a given degree of precision the product Gauss rule are more accurate
than the polynomial rules considered in this section. It can be seen that for two
dimensions the product rules compare quite favourably with the polynomial
rules, but at higher dimensions, the product rules require far too many points
for the same precision. This comparison may not be very meaningful, since
usually these rules have to be used in the compound form, where the number
of subdivisions will increase exponentially with the dimensions. Hence, it may
be impossible to use any of these formulae for very large number of dimensions.

Table 6.9: Number of points required for a precision p in d dimensions


p d=2 d=4 d = 10
NL N Nu Cd NL N Nu Cd NL N Nu Cd
3 3 4 10 4 5 7 35 16 11 26 286 1024
5 6 7 21 9 15 26 126 81 66 273 3003 59049
9 15 19 55 25 70 143 715 625 1001 8398 92378 9765625
15 36 46 136 64 330 776 3876 4096 19448 297160 3268760 1073741824

For any given region, there always exists a one-point formula of degree one
with the weight equal to the volume of the region and abscissa at the centroid
of the region, given by
JXi dV
(6.177)
JdV
244 Chapter 6. Integration

The centroid may not necessarily be inside the region, for example, the centroid
of a spherical shell is outside its volume. Hence, even for the simplest formula
it is not guaranteed that the abscissas will be inside the region of integration.
Most of the important formulae in this class have been obtained for the
so-called fully symmetric regions. A region R is said to be fully symmetric if

(6.178)

where (iI, i 2, ... , id) is any permutation of (1, 2, ... , d) and all permutations of
the signs are to be considered. Examples of fully symmetric regions in d di-
mensions are the hypercube [-1, 1] d, the hypersphere xi + x~ + ... + x~ ::; 1
and the entire space (-oo,oo)d. Similarly, an integration formula is said to be
fully symmetric if the abscissas satisfy a similar relation. A weight function
is said to be fully symmetric ifw(Cl:I,Cl:2, ... ,Cl:d) = w(±Cl: i l'±Cl: i2, . . . ,±Cl:iJ.
For example, over the entire space we can use a weight function e- r or e- r2 ,
where r2 = xi + x~ + ... + x~. For such fully symmetric regions the number of
equations to be solved for deriving a numerical integration formula reduce dras-
tically, though we may not necessarily get the formula with minimum number
of abscissas. In particular, such formulae are satisfied for all odd degree mono-
mials.
Consider the formulae for integration over a hypercube Hd = [-I, l]d. We
know that

if all i j are even;


(6.179)
otherwise.

If we are interested in a rule of degree three, then we only need to consider


the monomials 1 and xi, since all other conditions will be satisfied because of
symmetry. We can look for a formula of the form

r f dV;:::; wI!(O, 0, ... ,0) + L f(±a, 0, ... ,0).


2d
W2 (6.180)
lHd
Here the summation is over all the 2d points forming the fully symmetric group.
This formula has three constants, while we need to satisfy only two conditions
i.e.

(6 .181)

Hence, we can drop the first term (WI = 0) , to obtain a 2d-point rule of precision
three, with W2 = 2d - l /d and a 2 = d/3:

r
lHdfdV;:::;TLf
2d - 1 2d (fd
±Y3'O, ... ,O
) (6.182)
6.10. Rules Exact for Monomials 245

It is obvious that for d > 3 all these points will be outside the unit hypercube,
and hence this formula is not very useful in such cases.
Stroud has given an equivalent formula with all points inside the unit
hypercube for arbitrary values of d, which is as follows
r 2d - 1 2d
if f dV = -d-l:= f(alk, a2k,···, adk), (6.183)
Hd k=1

where

a2r-l.k -
_!2
V:3 cos
((2r-l)k7r)
d '
_ V:3
a2r,k -
!2 sm
. ((2r-I)k1r)
d
(r = 1,2, ... , ld/2J); (6.184)
adk = (_I)k if dodd;
v'3'
(i,k= 1,2, ... ,d).
This formula has been obtained by rotating the points in the previous formula
appropriately.
If we want a formula with degree of precision five, then we need to con-
sider two additional monomials, i.e., xi and xix~. Hence, we need two more
constants, but the point at the origin can provide only one more. Further, a
formula of type (6.180) can never integrate the monomial of the form xix~
2xactly. Hence, we can consider a formula of the form
2d

iHd
r
f dV = wd(O, 0, ... ,0) + W2l:= f(±a, 0, ... ,0)
(6.185)
2d(d-l)
+ W3 l:= f(±a, ±a, 0, ... ,0).
-{ere the second summation is over 2d(d-1) points, forming the fully symmetric
?;roup of the form indicated by the typical term. This formula requires a total
)f 2d2 + 1 points. The constants WI, W2, W3 and a can be determined by solving
',he equations

iHd
r dV = 2d = WI + 2dw2 + 2d(d - 1)w3,

r xi dV = 23d = 2W2a2 + 4(d - 1)w3a2,


iHd
(6.186)
r xi dV = 2d5 = 2W2a4 + 4(d - 1)w3 a4 ,
1
iHd
2d 4 2 2
dV = - = 4W3a .
X IX 2
Hd 9
These equations can be solved to get
2 3 2d 2d 25 d
1 = 5' WI = 162 (25d 2-115d+162), W2 = 162 (70-25d), W3 = - 2 .
324
(6.187)
246 Chapter 6. Integration

In this case, although all the points are inside the hypercube, the weights are
not all positive for d ::=: 3. Hence, this formula is likely to amplify the roundoff
error, particularly for large d. This may not be a very serious problem, since in
any case, the truncation error will dominate for large d. The ratio I:: IWil/ I:: Wi
gives the factor by which the roundoff error could be amplified. This ratio turns
out to be 1.37,12.85,45.44 and 213.34 for d = 3,6,10 and 20, respectively.
It is possible to get higher order formulae of this type, but the weights
are once again not positive. It is also possible to obtain such formulae for
unit hypersphere {42} or the entire space with appropriate weight function
{43}. For hypersphere or the entire space it is not possible to use composite
formulae, as the region cannot be subdivided into subregions with similar shape.
Thus the accuracy achieved by these formulae will be limited. If hyper-spherical
coordinates are used then the hypersphere can be transformed to a hyper-
rectangle and the formulae over hypercube obtained above can be applied.
In practice, these formulae will be used in composite form after subdivid-
ing the original region into several subregions. From that point of view, it may
be more useful to have formulae with abscissas at the boundary of the subre-
gions, so that there could be some saving in the number of function evaluations
in forming a composite rule. Some such formulae have been developed. There
are also formulae on hypercube with points at the midpoint of the faces in such
a way that the weights at the opposite faces are negative of each other. Hence,
in a composite rule such points in the interior simply drop out.
If the interval along each coordinate is divided into m parts, then the entire
region will be divided into m d subregions. on each of which the integration
rule is applied. If we use an integration rule of degree of precision p, then in
composite rule the error will be of the order of h P , where h is the size of each
subregion, assuming it to be uniform along all axes. Since h::::; l/m ::::; N- 1 / d ,
where N is the total number of points used. The truncation error is of the
order of l/NP/d. Hence, once again the error shows the dimensional effect and
for large values of d the error will decrease rather slowly with N. If the function
is discontinuous or has some singularity, then the degree of precision will be
very low and the convergence is extremely slow. Of course, it is possible to
obtain special rules with weight functions to take care of singularities, in which
case, the convergence will be better. In the subsequent sections, we will consider
methods which do not show explicit dimensional effect and hence are expected
to be better for integration over several dimensions.
EXAMPLE 6.16: Evaluate the integrals in Example 6.Hi. using the 2d-point formula of
degree three, and the (2d 2 + I)-point formula of degree five.
The results are shown in Table 6.10. which gives the number of points used and the
actual (absolute) error in the results obtained using these formulae. The second column gives
the degree of precision p of the formula used. The results were obtained using the subroutine
STRINT with the MAXPT=30000 and accuracy requirement of 10- 6 . These results can be
compared with Table 6.8 to compare the effectiveness of these rules with product rules.
The general trend of the results is similar to that for the product rule. For the last
function with d :2: 10 the subroutine STRINT converges with a zero value, which happens
because all the points that it tried fall outside the unit hypersphere. If AEPS is set to zero
then for d = 10. 15 it does find nonzero value as the convergence criterion is not satisfied
6.11. Monte Carlo Method 247

Table 6.10: Multiple integration using rules exact for monomials

d p II 12 h 14
N Error N Error N Error N Error

2 3 1532 4.0 x 1O~7 32764 4.8 x 10- 5 32764 1.7 x 10- 7 32764 4.8 x 10- 4
2 5 45 9.6 x 10- 7 36855 2.1 x 10- 5 3447 2.2 x 10- 8 36855 3.8 x 10-.5
3 3 24570 7.1 x 10- 7 49146 6.9 x 10- 6 49146 9.5 x 10- 6 49146 5.4 x 10- 4
3 5 1197 1.1 x 10- 6 38893 6.6 x 10- 6 38893 4.8 x 10- 7 38893 4.2 x 10- 3
4 3 32760 4.7 x 10- 6 32760 1.4 x 10- 5 32760 5.6 x 10- 4 32760 2.4 x 10- 3
4 5 29535 7.1 x 10- 7 33759 5.5 x 10- 6 33759 2.3 x 10- 5 33759 1.7 x 10- 2
6 3 49140 1.1 x 10- 4 50676 5.3 x 10- 6 49140 5.7 x 10- 3 49140 3.0 x 10- 3
6 5 37303 8.0 x 10- 6 37303 1.1 x 10- 5 37303 1.8 x 10- 3 37303 4.7 x 10- 1
10 3 40940 8.6 x 10- 4 40940 1.8 x 10- 5 40940 1.4 x 10- 2 420 1.5 x 10°
10 5 51255 6.7 x 10- 4 51255 4.0 x 10- 5 51255 3.6 x 10- 2 4221 1.5 x 10°
15 3 30690 3.4 x 10- 2 30690 4.7 x 10- 5 30690 2.5 x 10- 1 930 1.2 x 10°
15 5 57277 7.7 x 10- 3 57277 7.6 x 10- 5 57277 1.4 x 10- 1 13981 1.2 x 10°
20 3 40920 1.5 x 10- 1 40920 3.5 x 10- 5 40920 6.7xlO- 1 1640 8.9 x 10- 1
20 5 50463 2.6 x 10- 2 50463 9.2 x 10- 5 50463 2.5 x 10- 1 31239 8.9 x 10- 1

when integral is zero. In all cases, the formula of degree five gives better results than that
of degree three, using the same number of function evaluations. If a higher order formula of
this class is used, the results may improve still further. In almost all cases, the product Gauss
formulae prove to be more efficient, but at higher dimensions (d :::: 10), there may be some
advantage in using the monomial rules, since the product rules require enormous number
of function evaluations. For such values of d, it may not be possible to get. any meaningful
accuracy, unless the function is reasonably smooth. It is clear that in most cases much larger
number of function evaluations will be required to achieve the specified accuracy.

6.11 Monte Carlo Method


We shall now consider the Monte Carlo method for numerical integration, which
is based on statistical considerations. In this method, instead of choosing ab~
scissas in a well defined fashion, we select abscissas at random and hope that
the function is sampled appropriately. To understand the basic ideas of this
method, let us consider the problem of integration in one dimension

1= lb f(x) dx. (6.188)

The mean value of f(x) over the interval [a, bj is I j(b - a). If al, a2,'" ,an are
n points in the interval [a, bj, then the average value of f over this sample is
given by
~ 1 n
fn = - Lf(Xi). (6.189)
n i=l
248 Chapter 6. Integration

If the points are uniformly distributed over the interval, we can expect that
In ~ Ij(b - a) and

(6.190)

This is quite similar to the quadrature formulae that we have considered ear-
lier, except for the fact that the abscissas are selected somewhat arbitrarily. If
random values are used for ai, then the resulting method is called the Monte
Carlo method.
This method can be easily generalised to the case, with a weight function,
but in that case, the random numbers will have to be selected according to the
corresponding probability density function. For example, the integral
lb
I = in
a
1
w(x)f(x) dx ~ ;:;: L f(ai) ,
n

i=l
(6.191)

where ai are random numbers selected according to a probability density func-


tion w(x), such that
lb w(x) dx = 1 . (6.192)

This method can be easily extended to multiple integrals, provided the points
are uniformly distributed in the required region.
Unlike the methods considered earlier, here it is not possible to give any
theoretical bound on truncation error. Since the abscissas are selected randomly,
there is always a finite chance that all these points will fall in a small subregion
inside the specified region of integration. Hence, all we can say is that the
calculated average value of f(x) is between the smallest and the largest value
of the function in the given region. However, if a large number of points are
used, then the probability of always hitting the extreme values will be quite
small and we have to appeal to statistics and obtain a probable error rather
than the error bound. Such probable error estimates can be obtained using the
Central Limit Theorem of statistics. For the integral considered above, we can
define the variance by (j
(6.193)

Here it is assumed that the function is square integrable over the required
region. In this case, the Central Limit Theorem tells us that

prob (
1
1
;:;: 8
n
f(ai) - I I :::; Vn 1 fA
A(j) = J27T _A e- x
2
/2dx +0 (
Vn
1 ) . (6.194)

A table of the probability integral yields the following typical values

PI 0.50 0.90 0.95 0.99 0.999 0.9999


A 0.674 1.645 1.960 2.576 3.291 3.891
6.11. Monte Carlo Method 249

For example, we can say that the error at 95% confidence level is less than
1.960a/yn. For a fixed level of confidence (i.e., ). = constant), the error esti-
mate ).a / yn varies directly as a and inversely as yn. This is the typical rate
of convergence of Monte Carlo method. This rate appears to be slow, but it
should be noted that it is independent of the dimension or the smoothness
of the integrand. Further, this method can be easily applied when the inte-
gration is required over irregular regions, since we can embed the region in a
hypercube and define the integrand to vanish outside the region. The disconti-
nuity introduced in this process will not affect the convergence rate of Monte
Carlo method, although if the required region occupies only a small fraction of
the hypercube, then the efficiency is certainly affected. Thus, the Monte Carlo
method proves to be quite effective for integration over irregular regions, or
when the number of dimensions is quite large.
For practical application of Monte Carlo method, the error can be esti-
mated by estimating the variance numerically, using the same set of function
values which are used for .:omputing the integral. We can write

(6.195)

Apart from this error estimate, we also need some algorithm to select
the abscissas randomly. It is possible to use sequences of numbers prepared
in advance, but that requires too much of memory. Hence, normally the so-
called pseudo-random sequence is used for Monte Carlo methods. According to
Lehmer the pseudo-random sequence is "a vague notion embodying the idea of
a sequence, in which each item is unpredictable to the uninitiated, and whose
digits pass a certain number of tests traditional with statisticians and depending
somewhat on the uses to which the sequence is to be put". The main advantage
of pseudo-random sequence is that they are completely deterministic and hence
a program using such a sequence will always give the same results, which is
very important for verification of programs. Since in practice we always use
pseudo-random sequences, we will refer to such sequences as simply random
sequences or random numbers. Most higher level languages do provide a random
number generator, but these may not necessarily pass the required criteria for
a random sequence. Hence, such sequences may not necessarily be ideal for use
in numerical integration. For numerical integration in one dimension, all that
is required is that the numbers should be uniformly distributed (corresponding
to w(x) = 1). If other properties of random sequence are not satisfied, then
it only implies that we cannot apply the statistical error estimates. In fact, as
will be seen in the next section, certain nonrandom sequences give much faster
convergence for numerical integration.
For integration in d dimensions, the d coordinates of a point could be taken
as the d consecutive members of the random sequence, but it is not obvious that
such a sequence of points will be randomly distributed in d-dimensional space.
There is some evidence to suggest that some of the random sequences may give
250 Chapter 6. Integration

poor results in such situations. Apart from this, many of the pseudo-random
sequences have only a finite length, in the sense that after certain number of
terms the sequence will start repeating, because the number of distinct num-
bers that can be represented in a computer are finite. Thus, all random number
generators of the form Xn+l = f(xn) have a period < 2t , where t is the word
length. Since most applications of Monte Carlo method for numerical integra-
tion require a large number of points, it is essential to ensure that the period
of the random sequence is sufficiently large.
The random sequences are most commonly generated using the power-
residue or linear congruential method. The integers Xl, X2, ... are defined re-
cursively by the relation

Xn+l = aXn + c (mod m), (n = 0,1, ... ). (6.196)

Here a, c, m and Xo are certain integers and this notation means that Xn+l is
the remainder when aXn + c is divided by m. Since division by m can produce
at most m distinct remainders, it is obvious that such a sequence cannot have
a period greater than m. In fact, arbitrary choices for the basic integers a, C
and m will not produce a sequence of any reasonable length. Even if the period
of such a sequence is rather large, it may not necessarily satisfy other tests for
random numbers. For a detailed discussion of the theory of such sequences, the
readers can refer to Knuth (1997). Here we merely outline a procedure suitable
for a computer using binary arithmetic.
If the computer uses t-bit arithmetic, it is most convenient to choose
m = 2t. This choice can be expected to give the largest possible length for
the sequence. Further, with this choice, the process of division can be trivially
accomplished. For a, an integer of the form 8n + 3 or 8n + 5 and close to 2t/2 is
recommended, while c could be any arbitrary odd integer close to m(~ - iJ3).
For Xo we can choose any integer. It has been shown that such a sequence
will have a period of 2t. Since most applications require random numbers in
interval (0,1) the binary point can be considered at the extreme left. This
choice is ideal if the program is written in assembly language. The process of
multiplication will invariably lead to an overflow, which may not be handled in
a higher level language. Further, most higher level languages may not be able
to make use of the fact that m is a special number. Hence, we are forced to
choose smaller values of m, which do not result in overflow. Press et al. (2007)
have given a table of "good" values m, a and c depending on the word length of
the computer. The value of m here is rather low being less than 106 and hence
the period of such sequences will be rather small. This problem can be avoided
by using double precision variables for calculations instead of integers.
It has been shown by Marsaglia (1968) that if such a sequence is used to
generate points in a d-dimensional space, then the points tend to fall on hy-
perplanes. There will be at most about m l / d such hyperplanes. Further, if the
constants m, a and c are not properly chosen, then the number of hyperplanes
may be much less. It is not obvious whether, this departure from randomness
will affect the computed value of the integral, but it is better to use a more
6.11. Monte Carlo Method 251

reliable random number generator which hopefully does not suffer from such
correlations. It should be noted that there cannot be any perfect random num-
ber generator. Since it is quite likely that some new type of correlation may
be found in random sequences, being generated by the best currently available
routines. There is no limit to the number of tests that can be designed to test
the randomness of a sequence, and someone may come up with some test which
the currently accepted sequences will fail. Hence, it may be better to test the
sensitivity of our results by using two different random number generators. It
is generally safer to use simple random number generator of the form described
above which is based on some theory.
To improve on the random numbers generated by linear congruential
methods, we can try to combine two or three such sequences. One of the pro-
cedures as described by Press et al. (2007) is to use three independent linear
congruential generators. One of these generates the most significant part of the
output number, while the second one generates the least significant part. The
third sequence is used to shuffle the numbers. The combination of two numbers
is better because these routines use a rather low value of m. Hence, even if the
period is increased by shuffling, the number of distinct numbers that such a
sequence can produce will not be more than m. By combining two numbers as
mentioned above, it is possible to exploit a wider range of numbers that can
be represented in the computer. Even with this combination all numbers in the
interval (0,1) which can be represented in the computer will not occur in the
random sequence generated by such a program. The third generator is used
to gem~rate a number between 1 and n s , where ns is the length of array used
for shuffling. This array is filled at the first call to the function routine. The
period of such a sequence is practically infinite, until someone finds a value of
the "seed" number for which the sequence starts repeating rather early. Func-
tion RANF in Appendix B, which is based on the function RANI of Press
et al. (2007) , implements this algorithm. Generation of random numbers with
other probability distributions is discussed in Section 9.2.
As we have seen, the error in l\Ionte-Carlo method varies directly as the
variance (J and inversely as yTi. To reduce the error we can either increase n or
decrease (J. Increasing n by a factor of four will reduce the error only by a factor
of two and a very large number of points will be required to decrease the error
substantially. Alternately, we can reduce the variance by using some variance
reduction techniques. This appears to be a more effective way of reducing the
error, but unfortunately most of the variance reduction technique are applica-
ble, only if the function can be approximated well by a simple function. Hence,
the effectiveness of these techniques is not obvious. It is clear that if the same
approximations are applied to other methods of numerical integration, their
performance will also improve.
To illustrate some of these technique consider the integral

1= 11 f(x) dx . (6.197)
252 Chapter 6. Integration

If we can find a function g(x) such that If(x) - g(x)1 :::; f over the entire region
and the integral of g(x) is already known, then we can consider the integral

h = 11 (f(x) - g(x)) dx. (6.198)

For this integral the variance is given by

(6.199)

It can be easily shown that (]"1 :::; Eo Hence, if f is sufficiently small, the integral
can be estimated very efficiently. This method is very similar to the method of
reducing the strength of singularity for improper integrals. The main difficulty
with this method is to find a suitable function g(x), so that f can be sufficiently
small. This difficulty is particularly serious for multiple integration over large
number of variables, or for integration over irregular regions, since in such
cases, integrals of very few functions are really known. Further, these are the
only cases, where Monte Carlo method is likely to be used, since for a few
dimensions or for regular regions, other methods may be far more efficient and
accurate.
An alternate method for variance reduction is the so-called importance
sampling. Here we write the integral as

1= 1 0
1 f(x)
p(x) p(x) dx, (6.200)

where p(x) > 0 and


11 p(x) dx = 1. (6.201)

Now we can treat p(x) as the weight function and use random numbers with
a probability density distribution p(x) on 0 :::; x :::; 1. In this case, the integral
can be approximated by

(6.202)

and the relevant variance is

(]"2 = (1 P(x) p(x) dx _ ( (1 f(x) p(x) dX)2 (6.203)


io p2(x) io p(x)
Now assuming that f(x) > 0 (if it is not, we can add a constant), we select

f(x)
(6.204)
p(x) = I01 f(x) dx .

Then it can be seen that (]"2 = 0, but it requires a knowledge of the integral
which we wish to evaluate. Instead, if we select p(x) to be some approximation
6.11. Monte Carlo Method 253

to the above function, then the resulting variance will be small. The main
problem with this method is the difficulty of generating random numbers with
a probability density function p( x), particularly in several dimensions. If we use
a p(x) whose integral is known and whose behaviour approximates that of f(x),
we can expect to reduce the variance. However, this gain must be compared with
the loss of time in generating random numbers with a nonconstant probability
density.
Sophisticated computer programs have been written using this technique,
which uses a piecewise constant function p(x) whose value is adjusted by an
iterative procedure. There are various other techniques possible. However, such
techniques will again show dimensional effects, since the number of subdivisions
required to approximate the function by a piecewise constant function will
increase exponentially with the number of dimensions.

EXAMPLE 6.17: Evaluate the integrals in Example 6.15 using Monte Carlo method.

The results are displayed in Table 6.11, which shows the actual error as well as the
variance 2.576a / fo for 99% confidence level. In all cases, 32000 points were used. Two
different random number generators were tried, the first one using a simple linear congruential
method (using function RANI with m = 714025, a = 1366, c = 150889), while the second
using a more sophisticated routine (function RANF in Appendix B). Here, erq and err2 are
the actual errors using RANI and RANF, respectively, while err = 2.576a/fo is the error
as estimated by the program using the computed variance, a 2 . For both random number
generators, the variance in the result was almost same and only one value is given in the
table. For h with d = 3, erq ;:::; 7 X 10- 6 , which is probably a coincidence. It can be seen
that the results are not essentially different for the two cases. Function RANI has m = 714025
and for higher dimensions it will be expected to have all the points in a few hyperplanes only,
even then the results are not affected. In particular, the variance is almost same. It should
be noted that except for h the variance increases slowly with d, thus showing a milder
dimensional effect. Except at very large values of d, this method is significantly less efficient
as compared to the product rules or the monomial rules. In almost all cases, the variance
gives a reliable estimate of the error, which is an advantage of Monte Carlo method over
other methods for high dimensions. For 14 with large d the variance comes out to be zero as
none of the points that were tried falls inside the hypersphere.

Table 6.11: Multiple integration using Monte Carlo method

d h 12 h 14
err erq err2 err erq err2 err erq err2 err erq err2

2 .0059 .0006 .0006 .00381 .000909 .000176 .011 .0021 .0018 .0094 .0054 .0016
3 .0075 .0000 .0021 .00184 .000144 .000344 .018 .0056 .0001 .0169 .0015 .0010
4 .0088 .0003 .0009 .00125 .000068 .000119 .025 .0019 .0122 .0288 .0028 .0068
6 .0112 .0009 .0014 .00076 .000026 .000091 .033 .0064 .0003 .0725 .0092 .0165
10 .0160 .0013 .0082 .00043 .000045 .000097 .040 .0021 .0483 .4475 .0787 .3735
15 .0216 .0046 .0053 .00028 .000016 .000014 .046 .0128 .0195 .0000 1.225 1.205
20 .0297 .0089 .0018 .00021 .000015 .000010 .051 .0022 .0218 .0000 .8931 .8931
254 Chapter 6. Integration

6.12 Equidistributed Sequences


In the previous section, we considered Monte Carlo method for numerical inte-
gration, where the value of integrand at random points is used to approximate
the integral. It is not really essential to choose the points randomly, all that is
required is that the points be uniformly distributed, so that all subregions are
sampled appropriately. An equidistributed (or uniformly distributed) sequence
in [a, bj can be defined as a deterministic sequence of points a1. a2- ... in [a, b],
such that

n----+oc
b
lim ~" f(ai) = .
Tl

n ~
Ib
f(x) dx, (6.205)
;=1 a

for all bounded, Riemann integrable functions f(x).


The term equidistributed comes from the fact that the fraction of points
of an equidistributed sequence that lie in any interval is asymptotically propor-
tional to the length of the interval. For practical application of such sequences,
we need a simple method for generating it. We will not go into the theory
of such sequences, but state the following result which gives us a very simple
method for generating such sequences.
If B is an irrational number, then the sequence

xn = (nB) == nB - lnB J, (6.206)

is equidistributed in [0, 1j.


The most convenient irrational numbers which can be used for this process
are obtained by taking square roots. The sequence of random numbers consid-
ered in the previous section is also equidistributed, but the sequence defined
above is not random. A sequence of random numbers is expected to pass vari-
ous statistical tests, apart from the equidistribution. For example, if we denote
by prob(x n > xn+d the limit (if it exists)

prob(xn > xn+d = .


hm N1
N----jo(X)
L l. (6.207)

For a random sequence we would require prob(xn > Xn+l) = ~, while there is
no such requirement for the equidistributed sequence and in fact, the sequence
defined by (6.206) does not satisfy this property. For the sequence (6.206) it
can be seen that Xn > Xn+1 if and only if 1 - (B) :::; Xn < 1. Hence, using the
property of equidistribution, prob(xn > xn+d = (B) of- ~.
The definition of equidistributed sequences can be extended to several di-
mensions. To generate an equidistributed sequence of points in d dimensions
we can use d independent sequences of the form (6.206) to generate the d co-
ordinates of the successive points. If the irrational numbers B; used for these
sequences are such that 1,B 1 .B2 , •.• ,Bd are linearly independent over the ratio-
nal numbers, that is

(6.208)
6.12. Equidistributed Sequences 255

for rational O:i not all vanishing. Then it can be shown that the sequence of
points
(n = 1,2, ... ), (6.209)
is equidistributed over the hypercube [0, 1jd. This means that

(6.2lO)

for any bounded Riemann integrable function f. The most convenient choice
for 8i is ,;pi, where Pi are distinct primes.
A considerable effort is being spent on finding sequences of points in mul-
tidimensional space that are "good" for numerical integration. The reader can
refer to Davis and Rabinowitz (2007) and references therein for more details.
For these sequences to be useful, the convergence rate of the computed inte-
grals should be better than O(n-1/2) for the Monte Carlo method. It has been
shown that for certain class of functions, these sequences give a convergence
rate of O(1/n), which is much faster than that for Monte Carlo method. In
order to improve the convergence still further, Haselgrove (1961) has suggested
the method of averaging, which we will outline here.
Let us assume that f(x1, X2, ... , Xd) is a function of period 2 in each
variable and 8 1 ,82 , ... ,8d are the d irrational numbers, such that L 81 , ... ,8d
are linearly independent over the rationals. Let
N
Sl(N) = L f(2n8 1 , 2n8 2 , •.• , 2n8 d ),
n=-N n=O (6.211)
(N) = 51 (N)
Sl 2N + l'

Here Sl (N) and s2(N) are the average value of the function and can be expected
to approximate the integral

(6.212)

Haselgrove has shown that if the function satisfies certain smoothness condi-
tions, then there exists irrational numbers 8 1 , ... ,8d, such that
const
and II - s2(N)1 ::; 1'.'2-<' (I' > 0). (6.213)

Haselgrove has tabulated "good" values of 8i for d ::; 8.


If the function is not periodic, then it can be extended as a periodic
function. However, this extension will introduce discontinuities in the function,
which in turn will slow down the convergence of the method. Haselgrove has sug-
gested the following procedure, which results in a continuous periodic function,
provided the original function was continuous over the region of integration.
256 Chapter 6. Integration

Consider the integral

(6.214)

This can be written as

(6.215)

The function f(Xl, X2, ... , Xd) = F(lxll, IX21, ... , IXdl) may be extended as a
periodic function with period 2 in each of the variables. It can be easily seen
that f is continuous, provided F is continuous. Using this function the sum
Sl (N) can be expressed in the form

Sl (N) = F(O, 0, ... ,0) + 2 L F(21{ 1


N 1
2nOdl , ... , 21{ 2nOd}l) , (6.216)
n=l

where {x} denotes the fractional part, lying in the range (-~, ~).
Haselgrove has also defined higher order averages s3(N) and s4(N), which
under appropriate conditions are expected to converge even faster. These higher
rates can be realised only if the function is sufficiently smooth in the periodic
form. This is a rather fast convergence rate, considering the fact that it is inde-
pendent of d. However, in practice, such rapid convergence is rarely achieved.
This method usually gives more accurate results than the Monte Carlo method,
with the same number of function evaluations. However, the improvement may
not be as much as may be expected from the theoretical convergence rates.
There is no straightforward method for estimating the truncation error
in this method. The difference sI(N) - s2(N) could give some estimate of the
error. Alternately, using the partial sums at some intermediate stages, we can
evaluate different approximations to the integral and using these it may be
possible to get some estimate of error.
If the summation for S2(N) is implemented in a straightforward manner,
there will be a substantial roundoff error. The sums Sl (N) can be expected to
have a relative error of the order of VNIi. Hence, the roundoff error in S2(N)
can be expected to be of the order of liN. This estimate is for probable error
assuming some cancellation. If the roundoff error at various steps are correlated,
then the error could be much larger, with the upper bound being of the order of
liN 2 for S2(N). To reduce the roundoff error Haselgrove (1961) has suggested
the following procedure. At regular intervals in the computation, calculate the
average value of the function J = S2 (N) / (N + 1)2, and subtract this value from
each subsequent calculations of the integrand. Thus, the accumulated sums are

N
S~(N) = L (j((nOd,···, (nOd)) - 1) , (6.217)
n=-N
6.12. Equidistributed Sequences 257

and
N

S~(N) = L S;(n) . (6.218)


n=O

We can start withJ = 0 and at each successive reestimate, J is replaced by


J + S~(N)/(N + 1)2, while S2(N) is replaced by zero and S;(N) by S;(N) -
(2N + l)6.J, where 6.J is the change in J at this stage. This technique is
implemented in subroutine EQUIDS in Appendix B.
EXAMPLE 6.18: Evaluate the integrals in Example 6.15 using the equidistributed se-
quences.

Table 6.12: Multiple integration using equidistributed sequences

d II h h 14
erq err2 erq err2 errl err2 erq err2

2 2.1 x 10- 5 5.0 X 10- 9 1.7 X 10- 4 1.2 X 10- 4 1.7 X 10- 5 2.3 X 10- 6 .001 .001
3 7.7 x 10- 5 3.0 X 10- 6 7.4 X 10- 5 6.8 X 10- 5 4.5 X 10- 4 2.9 X 10- 4 .007 .007
4 1.3 x 10- 5 4.0 X 10- 6 2.8 X 10- 5 2.1 X 10- 5 4.1 X 10- 5 3.8 X 10- 4 .014 .016
6 6.1 x 10- 5 4.4 X 10- 6 4.5 X 10- 6 3.0 X 10- 6 1.5 X 10- 3 2.5 X 10- 3 .004 .004
10 5.3 x 10- 4 1.1 X 10- 3 5.9 X 10- 7 6.1 X 10- 6 1.4 X 10- 2 4.2 X 10- 2 .294 .144
15 2.0 x 10- 3 1.4 X 10- 3 1.8 X 10- 6 9.8 X 10- 7 6.9 X 10- 2 1.3 X 10- 1 1.22 1.22
20 7.0 x 10- 3 6.0 X 10- 3 9.7 X 10- 7 1.5 X 10- 6 1.3 X 10- 1 2.6 X 10- 1 .893 .893

The results obtained using the subroutine EQUIDS are displayed in Table 6.12. Here
we have used a maximum of 32000 points and accuracy requirement of 10- 6 . In most cases,
the subroutine failed to converge to the required accuracy and hence the number of points
actually used are not shown in the table. For 14, there was spurious convergence to zero value
for d = 15,20 requiring only 400 points. For 11 with d = 2,3 the number of points used is
12800. For all other integrals the number of points used is 32000. In all cases, the actual error
in both the estimates 81 and 82 are shown in the table as erq and err2, respectively. The error
estimate returned by the subroutine which is based on difference between successive estimates
was generally found to be reliable and only for about 10% of the cases, it underestimated
the error. Even in these cases, the difference was less than an order of magnitude. Hence,
if some allowance is made, this error estimate could be quite reliable, even though there is
no theoretical basis for it. If the successive estimates obtained by the subroutine are also
printed out, the results could be even more reliable. Further, it is found that roundoff error
is significant when large number of points are used.
It can be seen that, this method is more efficient than the Monte Carlo method in almost
all the cases. The accuracy is remarkably good for h. For II the accuracy deteriorates at
higher values of d. The second estimate 82 is not always superior to the first estimate 81.

Having discussed various techniques for evaluating multiple integrals, we


now address the question of which technique is to be used in practice. This
choice will of course depend on the problem, particularly on the number of
dimensions and the nature of function and the region over which integration
is required. If the region is highly irregular, then we can embed it into a regu-
lar region with the function being defined to be zero outside the given region.
258 Chapter 6. Integration

This embedding may slow down the rate of convergence of all the methods
considerably. For integration in two or three dimensions, it is almost always
more efficient to use product rules. Alternately, we can use recursive definition
in terms of one-dimensional integrals, which can be efficiently evaluated us-
ing adaptive techniques. If the function is sufficiently smooth and an accurate
value is required, it may be more efficient to use product rules with high or-
der Gaussian formulae. In such cases, even for higher dimensions the product
rules could be very efficient. For moderate number of dimensions say between
four and six, the higher order product rules may require very large number of
points, which may be beyond the reach of computer. In such cases, a product
rule with low order formula or the compound rules for monomials could give
moderate accuracy, if the function is well behaved. A high accuracy for such di-
mensions will almost invariably require a considerable amount of computation.
For even higher dimensions, it is nearly impossible to ensure a high accuracy
and the number of points required in product rules will increase exponentially,
while for the monomial rules the increase is somewhat slower. Consequently,
it may still be possible to obtain some estimate using a reasonable number of
points. In general, for dimension larger than about six, the methods based on
equidistributed sequence will be more efficient.
If the function has singularities, then some of the techniques used in
one-dimensional case can be extended to higher dimensions. For example, the
method of weakening the singularity can be employed, provided the form of sin-
gularity is known and can be integrated easily. In general, this is rather difficult
in higher dimensions. In some cases, a change of variables may be able to make
the singularity more tractable. Apart from these, it may even be possible to use
appropriate weight function in the product rules or the rules for monomials.
Algebraic singularities tend to become somewhat milder, as we go to higher
dimensions. For example, integrand of the form l/r, (r2 = xi + x~ + ... + x~)
around r = 0 is not integrable in one dimension, but is integrable in higher di-
mensions. Using sufficiently high order formulae for regular integrals the results
will converge as n- H1 / d . Hence, for such integrals, the results may even im-
prove to some extent for higher dimensions. If none of these techniques work,
then it may be best to break up the region into several parts to isolate the
singular region, before applying the numerical methods. If the number of di-
mensions is not large, it may be best to use recursive definition in terms of
one-dimensional integrals, which can be evaluated using an adaptive routine.
The adaptive routine would be able to overcome moderate singularities effi-
ciently.
It is fairly straightforward to implement numerical integration algorithms
on parallel processors. Parallelisation is possible at several levels. At the lowest
level, we can calculate the function values at different abscissas in parallel.
Thus, if n processors are available it is possible to evaluate the function value
at all the n points of a n-point rule simultaneously, giving a speed-up by a factor
of almost n. If sufficient processors are available, then for adaptive integration
routine it may be preferable to use a basic rule with points which is equal
Bibliography 259

to the number of processors available. This choice will reduce the number of
subdivisions required without increasing the execution time. At a higher level
we can parallelise the numerical integration programs by subdividing the range
of integration into n parts, where n is tne number of processors available. The
integral over each of the subregions can be evaluated in parallel. However, it
may be difficult to choose the divisions properly so that each of the subregions
require comparable time. At even higher level, if a number of similar integrals
are required, then we can calculate each of them in parallel.

Bibliography
Abramowitz, M. and Stegun, I. A. (1974): Handbook of Mathematical Functions, With For-
mulas, Graphs, and Mathematical Tables, Dover New York.
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Acton, F. S. (1995): Real Computing Made Real, Princeton University Press, Princeton, New
Jersey.
Burden, R. L. and Faires, D. (2010): Numerical Analysis, (9th ed.), Brooks/Cole Publishing
Company.
Carnahan, B., Luther, H. A. and Wilkes, J. O. (1969): Applied Numerical Methods, John
Wiley, New York.
Clenshaw, C. W. and Curtis, A. R. (1960): A Method for Numerical Integration on an Au-
tomatic Computer, Numer. Math., 2, 197.
Dahlquist. G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Davis, P. J. and Rabinowitz, P. (2007): Methods of Numerical Integration, (2nd ed.), Dover,
New York.
De Boor, C. (1971a): CADRE: An Algorithm for Numerical Quadrature, in J. R. Rice (ed.)
Mathematical Software, Academic Press, New York (p 417).
De Boor, C. (1971b): On Writing an Automatic Integration Algorithm, in J. R. Rice (ed.)
Mathematical Software, Academic Press, New York (p 201).
Golub, G. H. and Welsch, J. H. (1969): Calculation of Gauss Quadrature Rules, Math. Comp.,
23, 221.
Haber, S. (1970): Numerical Evaluation of Multiple Integrals, SIAM Rev., 12, 481.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Haselgrove, C. B. (1961): A Method for Numerical Integration, Math. Comp., 15,323.
Hildebrand, F. B. (1987): Introduction to Numerical Analysis, (2nd ed.), Dover, New York.
Joyce, D. C. (1971): Survey of Extrapolation Processes in Numerical Analysis, SIAM Rev.,
13,435.
Knuth, D. E. (1997): The Art of Computer Programming, Vol. 2, Seminumerical Algorithms,
(3rd ed.), Addison-Wesley, Reading, Massachusetts.
Kopal, Z. (1961): Numerical Analysis, (2nd ed.), John Wiley, New York.
Krylov, V. I. (2005): Approximate Calculation of Integrals, (trans. A. H. Stroud), Dover,
New York.
Lyness, J. N. (1983): When Not to Use an Automatic Quadrature Routine, SIAM Rev., 25,
63.
Marsaglia, G. (1968): Random Numbers Fall Mainly in the Planes, Proc. Nat. Acad. Sci.
U.S.A., 61, 25.
Oldham, K. B., Myland, J. C. and Spanier, J. (2009): An Atlas of Functions: with Equator,
the Atlas Function Calculator, (2nd ed.) Springer, Berlin.
Piessens, R., Doncker-Kapenga, E. de, Uberhuber, C. W. and Kahaner, D. K. (1983): QUAD-
PACK, A Subroutine Package for Numerical Integration, Series in Computational Math.
1, Springer-Verlag, Berlin.
260 Chapter 6. Integration

Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007): Numerical


Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1975): A Metalgorithm for Adaptive Quadrature, JACM, 22, 61.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
Stoer, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.
Stroud, A. H. (1971): Approximate Calculation of Multiple Integrals, Prentice-Hall, Engle-
wood Cliffs, New Jersey.
Stroud, A. H. and Secrest, D. (1966): Gaussian Quadrature Formulas, Prentice-Hall, Engle-
wood Cliffs, New Jersey.

Exercises
1. Derive the Simpson's 1/3 rule and 3/8 rule, using the method of undetermined coefficients
or by integrating the appropriate interpolating polynomial. Also estimate the truncation
error in these formulae.
2. Use any quadrature formula (for regular integrand) to approximate the following inte-
grals:

h = 10 1
XX dx "'" 0.783430510712,

12 = faIT /2 e sin X dx "'" 3.10437901786,

h = 10 1
e-(x+l/x) dx "'" 0.0721982401982,

14 = 1o
1 e-(x+l/x)

X
20 dx "'" 6.05693330804 X 10 15 ,

h = 10 1
In(l/x) sin X dx "'" 0.239811742001,

h = fo1In(1/X) sin 2 x dx "'" 0.0986467557993,

h = 1o
1
e- 12x sin 2 x dx =
e-12
--(6cos2 - sin2) - -
148 74
3
+ ----
1- e- 12

24'

I
8 =
ex e
10
5 _2x 3 d
x =
1 - 3e- 2
12 '

1 1 1 1 1
dx=---+---+-.
2 4! 6! 8! 1O!
What is the rate of convergence for each of these integrals? Compare the results with
actual values.
3. Derive (6.25) by explicitly integrating the cubic spline. Also obtain the same expression
using the Euler-Maclaurin sum formula over each subinterval.
4. Using the following table of values for I(x), obtain the integral J:
I(x )dx for [a, b] = [0, 1],
[0,0.2]' [0.5,1.0]' [0.33,0.92]' [0.5,1.5]. Try to estimate the error in evaluating the integral
Exercises 261

and compare it with the true value (noting that f(x) = Vx)·
x: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f(x): 0 0.3162 0.4472 0.5477 0.6325 0.7071 0.7746 0.8367 0.8944 0.9487 1.0000
5. Show that T1,k in Romberg integration is identical to the composite Simpson's 1/3 rule
using 2k + 1 points. Similarly, show that T 2 ,k is identical to composite 5-point Newton-
Cotes closed formula. Verify these results experimentally by considering some of the
integrals in {2}. Using the leading term in the error expansion for T m,b show that it
does not correspond to any Newton-Cotes composite rule for m 2: 3.
6. Let ¢( x) be a function of x and ¢( 1/4 m) = TO'. Show that the Romberg integration is
equivalent to using Neville's method (see {4.15}) of interpolation for approximating ¢(O).
1. Show that ~-algorithm is equivalent to Aitken's 52 process, by writing ~i2) explicitly in
terms of ~JO).
8. Using the Euler-Maclaurin sum formula obtain the integration formula

l a
a + nh
f(x) dx;:,;; h L
n~l

i=1
f(a
h
+ ih) + -(f(a) + f(a + nh)) -
2
h2
-(f'(a + nh) - j'(a)).
12
Here the last term can be treated as a correction to the trapezoidal rule. This term
requires the knowledge of derivatives at the end points, but has the advantage that by
just adding the correction term involving only the end points, the formula is accurate for
all cubic polynomials. Try this formula on the integrals in {2} and compare the results
with those obtained using the trapezoidal rule and the Simpson's rule. If the derivatives
are not known, they can be replaced by finite differences to get the Gregory's formula

la
a + nh
f(x) dx;:,;; h L
n~l

.=1
f(a
h
+ ih) + -(f(a) + f(a + nh))
2
h
- -('\7 fn - b.fo)
12
h 2
- 24 ('\7 fn + b. 2 fo),
where b.fo = f(a+h) - f(a) is the forward difference and '\7 fn = f(a+nh) - f(a+nh-h)
is the backward difference. Use this formula also on the integrals in {2}. Try to improve
on these formulae by induding more correction terms involving higher derivatives.
9. Considering the formula of the previous problem, show that if f'(a) = f'(a + nh), then
the correction term vanishes and the trapezoidal rule itself will have a higher order
accuracy. Similarly. if f(2k~1)(a) = f(2k~1)(a + nh) for k = 1,2, .. . ,m, then show that
the trapezoidal rule will have truncation error of the order of h 2m + 2 If all the odd
derivatives of the function are equal at the two end points, then the convergence of
trapezoidal rule will be almost exponential. In particular, if the integrand is periodic and
the interval of integration extends over an integral multiple of the period, the trapezoidal
rule converges extremely fast. To demonstrate this effect evaluate the integrals given
below using Romberg integration as well as the 2 and 4-point Gaussian formulae with
2n, (n = 1,2, ... , 12), subintervals. Compare the results with exact value of the integral.
In Romberg integration print all columns of the T-table.

II = faTr cos(tsinx - x) dx = 7rh(t), t = 8,

12 = loo= e~x
2
dx =
1
~
2
,;:ii,

h = fa2Tr sin 2 x dx = 71',

14 = (1 dx dx = 1
Jo 2 + sin(207rx) J3 '
(2Tr
Is = Jo sinx dx = O.
262 Chapter 6. Integration

For h the range can be truncated appropriately. Which formula is most efficient for these
integrals?
10. Evaluate the following integral defining the Error function

erf(x) = .Jrr fax e- x2 dx,


for x = 0.5,1,2,3,4,5,6,8,10, using Romberg technique and print all columns. How
many function evaluations are required to achieve a predetermined accuracy in each
case? This integral can also be evaluated by expanding the integrand in a Taylor series
and integrating the series term by term. Compare the efficiency of quadrature formulae
with the Taylor series for different values of x.
11. Evaluate the following integrals using composite Simpson's rule:

h = fa1 XX dx ~ 0.783430510712,
rO. 983
I2 = io XX dx ~ 0.766573383391,

keep adding abscissas until roundoff error is dominating. To see the effect of inaccuracies
in the abscissas, try both expressions i.e., aj = ao + jh and aj = aj-2 + 2h for evaluating
the required abscissas and compare the results with exact value. Also try to evaluate the
sum using the cascade sum algorithm and compare the results. Repeat this exercise for
some of the integrals in {2}. Also evaluate the integrals using a composite rule based
on 32-point Gauss-Legendre formula. Try the following techniques for accumulating the
sum. (i) Use one variable to accumulate the sum in a straightforward manner. (ii) Find
the sum over 32 points separately and then add it to the running variable. Compare the
roundoff errors in the two cases.
12. Assuming the roundoff error to be uniformly distributed in the interval [-E,E]; show
that the variance a = E/V3. Show that if E! and E2 are the error bounds on Xl and X2
respectively, then the variance of error in Xl + X2 is given byaiz = ai + a~ = (EI + E~)/3
(Section 9.3). Here it is assumed that the process of addition does not introduce any
further error, which is true for the fixed-point arithmetic, or when the exponents are the
same for floating-point numbers. Generalise this result to sum of n numbers and justify
(6.65).
13. Consider the Chebyshev quadrature formulae of the form
1 n
/ -1 f(x) dx = Hn ~f(a;),

which should be exact for all polynomials of degree less than or equal to n. From the
requirement that this formula is exact for f(x) == 1, show that Hn = 2/n. Obtain the
following conditions on the abscissas
n '+1
5 J' = '~" a j = ~ 1 - (-l)J (j=1,2, ... ,n).
,=1' 2 j +1
Assume that the abscissas are the zeros of the polynomial xn +Cjx n - 1 +C2X n - 2 + ... +c n .
Find the coefficients Ci u5ing the Newton's equations
51 + Cj = 0,
52 +C151 +2C2 = 0,

5 n + Cj5 n -1 + c25n-2 + .. , + nCn = O.


Find the zeros of this polynomial to obtain the abscissas. Try the formulae for n = 3,4, 8
and obtain the following formulae
1 2
/ f(x) dx = -[J(-l/h) + f(O) + f(l/h)],
-1 3
1 1
/ f(x) dx = -[J(-aIJ + f(-a2) + f(a2) + f(aIJ],
-1 2
Exercises 263

where al = V 5+125Y5 and a2 = V


5-125Y5· Show that it is not possible to obtain such a
formula in the last case, since the resulting polynomial has a pair of complex roots.
14. Derive quadrature formulae of the form
(I
in Vx f(x) dx ~ L
n
wi/(i/n),
o i=O

which are exact for all polynomials of degree less than or equal to n (use n = 2,4,8). Use
these formulae to evaluate the following integrals

h = fol Vxsinx dx ~ 0.364221932032 ,

h = fol Vx cos x dx ~ 0.531202683085.


Compare the accuracy of these formulae with those of the Gauss-Legendre formula as
well as the Gaussian formulae with weight function Vx discussed in Section 6.5.4.
15. Derive a quadrature formula of the form

fol f( Vx) dx ~ wI/CO) + w2f(1/2) + w3f(1),


which is exact for all quadratic polynomials. Apply the formula to estimate

t
io
Vl+Vxdx= 8(V2+1).
15
Can you transform this formula to apply over the interval [0, a]? Try to obtain the formula
over [0, a] and use composite Simpson's rule over [a, 1], and adjust a to get satisfactory
accuracy. Compare the efficiency of this formula with that of the Simpson's rule over the
entire interval.
16. Find the total arc length as well as the area of the ellipse x 2 + y2/9 = 1.
17. Derive Gaussian formulae of the form

1 1

o
f(x) In(l/x) dx ~
n
L wi/(ai),
i=1

for n=2 ,4,8. Use these formulae to approximate the integral hand h in {2}.
18. Evaluate the following integral using appropriate Gaussian formula and compare its effi-
ciency with that of Gauss-Legendre formula

iot J x
1- x eX dx ~ 3.422110929991.
19. Consider a quadrature formula of the form (k is an integer)

l" cos(kx)f(x) dx ~ wl(k)f(O) + w2(k)f(7l'/2) + w3(k)f(7l'),


which is exact for all quadratic polynomials. Here the weights depend on k. Obtain the
set of equations to determine the weights and solve them for k = 50. Use the formula to

1"
estimate the integral
e"-1
cos(50x)e X dx = - - - 2 .
o 1 + 50
20. Derive 5-point Lobatto quadrature formula of the form
264 Chapter 6. Integration

Find the weights and abscissas such that, this formula is exact for all polynomials of
degree less than or equal to seven.
21. Derive Filon's formula for integrating rapidly oscillating functions and use it to approxi-
mate the following integrals

-rr /2 for n = 1;
II = ior 2rr xcosxsin nx dx = {
2nrr
- n 2 -1 for n" 1;

h = 1o
2rr xsinnx
V1-x2/4rr2
dx = 2rr 3 h(2rrn);

h = 1o
2rr
Inx sin nx dx =
n
1
--h + In 2nrr - Ci(2nrr)];

where -y = 0.5772156649015 ... is the Euler's constant and Ci is the cosine integral

Ci(x) = rX1-costdt.
io t
In each of these integrals, try n = 10,20,30,50. Also evaluate these integrals by breaking
up the range at the zeros of the integrand and integrating each part using a five-point
Lobatto rule. Also try the integrals using Simpson's rule. Compare the relative efficiency
of these formulae. For the last two integrals try to improve the efficiency of Filon's formula
by appropriately breaking up the range and evaluating the singular part using appropriate
Gaussian formulae.
22. Evaluate the following improper integrals using the method of approaching the limit:

II = r 1 x" dx = _1_ (0 = -0.9, -0.5, 0.5),


io 0 +1
roo x dx rr2
h = io 1+ eX = 12 .

For the first integral use the intervals [2- n - 1 , 2- n ], (n = 0, 1,2, ... ) while in the second
case, use the intervals [0, 1], [2n, 2n + 1], (n = 0, 1, 2, ... ). In all cases, estimate the value of
n at which the computations should be stopped, and also keep a count of total number of
function evalu'ttions used. Use E-algorithm to accelerate the convergence of this technique.
23. Evaluate the following integrals accurately, using appropriate quadrature formulae:

ior e- 5x cos lOx


1
h = Vx dx ~ 0.450874741208;

12 = 1 o
1
x m (in(l/x))n dx = . +1'
(m+1)n
n' m = 0, ±0.5, 1.1.20; n = 0, ±0.5, 4, 4.5, 9;

=1
1 dx 1
13 b = 2, 1.1, 1.01, 1.001, 1.0001;
o h + sin(20rrx) y'b 2 - 1'

14 -
-
!,
1
b dx
V(x 2 - 1)(b 2 - x 2 ) ,
b = 2,1.1,1.01,1.001,1.0001;

Is = [11 (1 - x 2 )3/2 cos x dx = 3rrh(1) ~ 1.08293983240;

h = J-1
1

~(l+x+Q)
1
dx, 0= 10- 1 ,10- 3 .10- 6

If possible evaluate the integrals using more than one technique and compare their effi-
ciency. Also print the number of function evaluations required.
Exercises 265

24. Evaluate the following integrals over infinite region as accurately as possible, using any
suitable technique:

II = (= e- X2 cosx dx = ~v'1r e- 1 / 4 ;
io 2

12= 1 o
= 2
exp(-x - - ) d x = - ;
x2 2e 2
1 v'1r

h = 1 (1 o
= - - - _e-
eX - 1 x
-X) dx = , ~ 0.577215664902;

14 - -1= o
e _x2 In x dx -- - v'1rh +
4
2In2).,

h =1 + o
00

eX - 1
7r 2
- - -1 ) dx = -
In (eX
4
;

(= e-x/ 4 cosx '- (tan- 14)


y7rCOS --2-

h = io Vx dx = (1 + 16)1 / 4 ;

h = 1 o
00

(l+x)x p
dx
= 7r cosec p7r, p = 0.5,0.9;

1 (= (0: - l)x dx
0: = 2,1.5,1.1,1.01,1.001.
8 = i2 (1 +x 2 )(lnx)O
If possible evaluate the integrals using more than one technique and compare their effi-
ciency. Also print the number of function evaluations required.
25. Evaluate the following integrals, where the integrand has a singularity close to the interval
of integration:

II = 1
a
1
- - = In ( 1 - e -1) - In ( 1 - e - a)
- dx
eX - 1

12 = 1
a
1

X -
dx
sinx
(a = 10- 1 ,10 - 3 , 10- 6 ),

(a = 0.2,0.1,0.01,0.002).

26. Evaluate the Cauchy principal value of the following integrals using appropriate quadra-
ture formulae:
1 eX
II = P / - dx ~ 2.114501750751,
-1 x

1 2=PjOC 1 X(X -
~
t)
dx (t = 1, 1.01 , 1.1,1.5),

h = P ("'" tan x dx = !:. .


io x 2

27. Evaluate the following integrals

II = (00 sin x dx = !:. ,


io x 2
h - (= -=-_d_x_ _ (b = 0.1, 1.0).
- io x 2 +bcosx'
In each case, break the integral at the zeros of the integrand and evaluate each part
separately to get an alternating infinite series. If necessary, speed-up the convergence
266 Chapter 6. Integration

by using appropriate technique. In the last case, subtract the average value x- 2 before
evaluating the oscillatory part.
28. Evaluate the following integrals

/t= 1 1

o b+sin(mrx)
x2 dx
(n = 10,20,50; b = 2, 1.1, 1.001);

12 = (0 1 dx (n = 10,20,50);
in 1 + xsin(mrx)

h =
1° 1 x2 dx
(1 + sin(mrx))1/3'
n = 10, 20, 50;

29. Write a program to evaluate the following integral, which arise in statistical mechanics:

V(o,p) = 1°= xP dx
eO +X +b
over a wide range of values for the parameter o. For Fermi-Dirac statistics b = 1 and use
o = 0, ± 1, ±2, ±5, ± 10. For Bose-Einstein statistics b = -1 and try 0 = 0.001,0.1, 1, 2, 5.
30. Evaluate the following integral which arises in the problem of scattering by central force
fields in classical mechanics:

1-
- °
I VI -
T! S dx
Vi;) - s2x 2 '

where Xl is the smallest positive zero of the denominator. Consider the potential
V(x)/ E = X4, and try s = IOn for n = -4, -2,0,2,4.
31. Write a subroutine for automatic integration and test it on the integrals in {23}-{28}. For
infinite range, truncate the interval appropriately. Print the value of the integral along
with the estimated error and the number of function evaluations used. The last number
can be obtained by keeping a counter in the function routine, which can be used via a
common block, if the facility is not provided in the subroutine itself. Try various values of
absolute and relative error tolerances and compare it with the actual error. Also test any
routine for automatic integration available in your computer using the same integrals.
Also try the following integrals:

I
1(
0 -
) -
t ~ - ~5 In ( 1 + 5e-<>
io 5 + eOX
6 )
-

h = 1° 1 x3 dx
x - sinx
= 6.1015908643463;

I3(n) = 11 dx = {1.15470053837925 for n even;


° 1 + "21 sin mrx 1.14859101172116 for n = 63;
n = 63,64,90;

14 = fa1 eX Inx dx = -1.31790215145440;


Is = fa1 sin(l/x) dx = 0.504067061907;
h(n) = (lsin(~+mrx) dx=cos(fg)(l-(-l)n) n=63,64,90;
io 2.5 nrr

h = fa1 sin(2krrx) dx = 0 k = 61,62,63,64,65,66,67,68;

Is = fa1 cos(2krrx) dx = 0 k = 61,62,63,64,65,66,67,68;

19 = 10.1
0.2
W(x) dx = -0.0131960130162;
Exercises 267

where W(x) is the Weierstrass continuous nondifferentiable function


1
W(x) = - LTJ
00 . .
COS(7J1TX).
1T j=1

32. Attempt to evaluate the following integrals using any automatic integration routine:

Ir = 1
1
o
a,,+1 + (1 - a)o+1
Ix - alO dx = ----'------'--
a +1
( a-
1
- 2'6'
1T
. a = 3,1,0.5, -0.5, -0.9) ;

h = folln(lx - al) dx = a Ina - a + (1 - a) In(1 - a) - (1 - a)

Break up the range of integration at x = a and transform the variable to get the discon-
tinuity at x = o. Evaluate both parts separately and compare the results.
33. Attempt to evaluate the following divergent integrals using various techniques and test
which ones give an indication of possible divergence:

h = fol
in
dx
xlnx'
13 = 1 o
00 x
-2--
x
+1
+2
dx,
14 = foc (x
io (x 2
+ 2) dx
+ 5) Inx
34. Find the sum of the following series by using the Euler-Maclaurin formula
n 1
5n = L~' ( n = 200, 12495,2 44 , 10 100 , 1O( 10 10 )) .
i= 1 1,

Also evaluate 5 n -In n for each of these n, and estimate the value of the Euler's constant
from
lim (5 n - In n) = , ::::: 0.57721 5664901532860606512090082.
n~oo

35. Consider the series for the Riemann zeta function


1
L-::J
00

«3) =
i=l 1,

Accelerate the convergence of this series by subtracting appropriate factorial function


and evaluate the sum. Evaluate the same sum using the Euler-Maclaurin sum formula.
Define
n 1
Sn=L-::J '
i=l1,
and apply the E-algorithm to Sn, S2n, S4n, ... to accelerate the convergence and evaluate
the limit «3). Compare the efficiencies of these methods.
36. Derive the Euler's transformation for an infinite series by splitting and combining the
terms as follows
1 1 1 1
5 = -Vo - -(VI - vol + -(V2 - VI) - -(V3 - V2) + ...
2 2 2 2
Using the forward differences, this equation gives the first step of transformation. Repeat
the same process with the new series to obtain the Euler's transformation.
37. Apply the Euler's transformation to evaluate the following series
1 1 1
+ - - - + ... = -
1T
51 =1- -
357 4
1 1 1
52 = 1 - - + - - - + ... ::::: 0.604898643422
v'2 v'3 v'4 '
111 1
53 = - - - +- - - + ... ::::: 0.924299897223,
In 2 In 3 In 4 In 5
N 2
54 = L(-1)n 1 +an ,
n=1 n2 +3n+5
268 Chapter 6. Integration

The last series is actually divergent, but Euler's transformation will give the "average"
value.
38. Evaluate the following series
1
'L
00
Sl = -.3-- ~ 0.507160669571,
i=l t +2
1
'L i 2 + 3i + 4 ~ 0.436827337720,
00

S2 =

t
i=l

S3 = i5 = n 2 (n + 1)2(2n2 + 2n - 1)
,=1 12

39. Evaluate the following two series by any method

Sl = 1 + -1-1
- +1
- +1
- -1
- +1
- +1
- 1
- - + ...
3 2 5 7 4 9 11 6
111111111
S2 = 1- - - - +- - - - - +- - - - - +- - ...
2 4 3 6 8 5 10 12 7
40. Evaluate the following integral by breaking up the range at the zeros of integrand and
summing the resulting alternating series:
{2007r
Jo V200271"2 - x 2 sin x dx ~ 606.092442727.

Use Euler's transformation to speed-up the summation.


41. Derive a product rule for integration over a unit sphere in three dimension S3 and use it
to approximate the following integral

I -
- !S3 x2 + y2 + (z
dV
- k)2 -
- 7r (1
2 - ) 11 + k I)
+ (- k k In - -
1-
k (k = 2,0.5).

42. Derive fully symmetric formulae of degree three and five for integration over a hypersphere
in d dimensions. Use these formulae to approximate the integrals

h = ( exp(x1 + X2 + ... + Xd) dV ,


JS d
( dV d 71"d/2
h= JSd VxI+x~+ ... +x~ = d-Ir(d/2+1)'

where Sd is the hypersphere xi + x~ + ... + x~ ::; 1. Try d = 2,3,4,6, 10,20.


43. Derive fully symmetric formulae of degree three and five for integration over the entire
space in d dimensions, using a weight function of exp( -xi - x~ - ... - x~). Use these
formulae to estimate the following integrals for d = 2,3,4,6, 10,20:

h = [ : ... [ : (Xl + X2 + ... + Xd)6 exp( -xi - x~ - ... - x~) dX1 ... dXd

15d3 d/2
= -8-71" ,

h = / 00 . . . /00 -i==========-
exp( -xi - x~ - ... - x~)
/x2+x2+ ... +x2
dX1 dX2 ... dXd = 71"
d/2 r( d 21)
--d-
r(2)
-00 -00
V 1 2 d

44. Evaluate the following integral which arises in potential theory

<I> ( )_! ! !
a,b,c -
V(X - a)2
dx dy dz
+ (y - b)2 + (z - c)2
(a, b, c) = (2,0,0) and (0.5,0,0).
Exercises 269

Evaluate the integral over (1) the ellipsoid x 2 + y2/2 + z2/3 ::; 1 and (2) 0 < x < 1,
o< y < 2, 0 < z < 3. For the first region try the following techniques: (i) embed the
region in a rectangle and apply product rules; (ii) transform the region to a sphere and
apply the formulae obtained in {41}, {42}; (iii) write it in a recursive form and use
appropriate quadrature formulae for integration in one dimension.
45. Evaluate the following integrals using any suitable method

h = t r
Jo Jo
1 dx dy =
1 - xy
7r
6
2 ;

12 = fal fa1 fal fa1 (k cos w - tkw sin w - 6kw 2 cos w + kw 3 sin W)dXl dX2 dX3 dX4

= sin k, k = 0.1,0.5,1,2;

1111
w = kX1X2X3X4,

x2 + y + 1
h = 2 2 dx dy ~ 0.493839087493;
o 0 Y +x +3

14 = fa1 fal e XY sin(x + y) dx dy ~ 1.06900706067.


46. Transform the integrals in {42} to hyper-spherical coordinates and evaluate them using
product Gauss formulae. Also try Monte-Carlo method or the method using equidis-
tributed sequence and compare their efficiencies with product Gauss formula.
47. Truncate the range of integration suitably for integrals in {43} and evaluate them using
product rules.
48. Evaluate the following integrals in d dimensions (d = 2,3,4,6,10,20) using any suitable
method

49. Consider the integral J~ eX dx, using the Monte Carlo method. What is the expected
error? Try to reduce the variance by subtracting out a function g(x) = 1 + x from this
function. What is the gain in efficiency? Try this technique on a computer and compare
the results with exact value.
50. Apply the method of importance sampling to the integral in previous problem. Use p(x) =
~(1 +x) and generate random numbers with this probability distribution in the interval
[0,1] (Section 9.2) Using these random numbers, evaluate the integral and estimate the
gain in efficiency.
51. Try to evaluate the integrals

h = Jo
t Vx dx = :32 ' 12
r1 dx
= Jo Vx = 2, h
t
= Jo
dx
x 3/ 4 = 4,

using Monte Carlo method and explain the results.


270 Chapter 6. Integration

52. Evaluate the integrals in Example 6.15, using an equidistributed sequence. Use the simple
average as given by (6.210). Also try the following sequence of points in d dimensions:

Pn
. ((n(n+1)
. 2
1/2) ' (n(n+1)
PI 2
1/2) , ... , (n(n+1)
P2 2
1/2)) '
Pd

where PI , P2, . .. is the sequence of prime numbers 2,3,5,7,11,13, ... , and compare the
results with those in Table 6.12.
53. Evaluate the integrals in Example 6.15 for n = 4,6 using Monte Carlo method and also
using equidistributed sequences. For each integral try N = 104 ,2 X 104 ,5 X 104 , 105 ,2 X
10 5 ,5 X 10 5 , 106 ,2 X 106 ,5 X 106 , 10 7 points. In each case estimate the actual error using
the exact value of the integral and plot log (error) vs log N to check if the theoretically
expected rate of convergence is achieved.
54. Estimate the integral

fal fal fal sin (x) sin(y) sin(z) dx dy dz = (1 - cos 1)3,

using a table of values for the integrand with uniform spacing of 0.1 in x, y, z over the in-
terval [0,1]. Generate an approximation in terms of product B-splines using interpolation
at the tabulated points and evaluate the integral using this approximation and compare
with the exact value. Also evaluate the integral using product Simpson's 1/3 rule using
the same tabular points and compare the results with earlier estimate.
Chapter 7

Nonlinear Algebraic Equations

The solution of a system of nonlinear algebraic or transcendental equations is


probably one of the most difficult problems in numerical analysis. In general,
it is almost impossible to find the solution, unless an approximate value is al-
ready known. Linear equations occur more frequently in numerical calculations,
simply because it is difficult and sometimes impossible to solve the correspond-
ing nonlinear problem. Most of the equations governing the physical world are
nonlinear, but very often for simplicity and tractability linear models are con-
sidered. Solution of boundary value problems for differential equations also lead
to a nonlinear equation or a system of nonlinear equations. Solution of a system
of nonlinear equations also arises in numerical analysis, while deriving certain
formulae. As we have seen in Chapter 6, the Gaussian quadrature formulae can
be obtained by solving a system of nonlinear equations.
It should be noted that it is not necessary to write down the nonlinear
equation in a closed form. A nonlinear equation of the form f(x) = 0 can
involve a function which is defined only implicitly. All that is necessary for the
numerical solution of such equations is that, given a value of x, it should be
possible to calculate the function f (x). Calculation of this function itself may
involve evaluation of integrals, or solution of differential equations, or even a
solution of nonlinear equation. Because of this flexibility, a large number of
problems in numerical computations can be reduced to solution of nonlinear
equations. The enormous variety of problems that are possible adds to the
difficulty in calculating the solution. If it were easy to solve systems of nonlinear
equations, then a large fraction of problems arising in numerical computation
can be trivially solved.
Most of the methods for solution of nonlinear equations are iterative in
nature. That is, we guess an approximate value of the root and substitute it into
some formula involving the equation to obtain the next approximation which
is hopefully closer to the root. This approximation in turn is used to obtain a
third one and so on. The iteration is stopped when we think it has converged
to the root with sufficient accuracy, or it has indicated that it is not going to
272 Chapter 7. Nonlinear Algebraic Equations

converge, or when we get tired and decide to give up. A large number of iterative
methods have been developed for solution of a single nonlinear equation as
well as for solution of a system of nonlinear equations. One important reason
for such a large number of methods is that, none of them will work on all
problems. The amount of software available for solution of nonlinear equations
is much less than that for linear equations and whatever is available may not
work for a given problem, unless some analysis is done beforehand to locate
the approximate position of the roots. Unlike linear systems it is difficult to
give conditions for existence or uniqueness of solution of a system of nonlinear
equations. Even if the number of equations and the variables are equal, there
may be no real solution or the solution may not be unique. Even for a single
nonlinear equation in one unknown, the solution may not exist. On the other
hand, an infinite number of distinct roots are also possible.
In this chapter, we start with the simplest problem of finding real roots of
a single nonlinear equation. The methods for finding such roots can be broadly
classified under two categories, first the iterative methods and second the so-
called always convergent methods, which are also iterative but under appro-
priate circumstances they are guaranteed to converge to the required root.
Fixed-point iteration (Section 7.2), secant iteration (Section 7.4) and Newton-
Raphson method (Section 7.5) are examples of iterative methods, while bisec-
tion (Section 7.1), regula falsi method (Section 7.3) and Brent's method (Sec-
tion 7.6) are examples of always convergent methods. The problem of finding
complex roots of a single equation will be considered in Sections 7.7-7.9, while
special case of finding roots of polynomials will be considered in Sections 7.10-
7.11. In numerical work, because of roundoff error it is usually impossible to
find a solution which satisfies the equation exactly and in fact such a solution
may not exist in the finite set of numbers, that can be represented in a machine.
Further, roundoff error is significant when a function is evaluated near its zero,
since it necessarily involves almost exact cancellation between the positive and
negative terms, that occur in the function evaluation process. The influence of
roundoff error on iterative methods will be considered in Section 7.12 and some
examples of ill-conditioned problems will be considered in Section 7.14. Since
almost all methods for solution of nonlinear equations are iterative in nature,
the algorithms can be broadly divided into two parts, i.e., the iteration and
the convergence test. In Section 7.13 we will consider the important issue of
criterion for convergence of iterative methods.
Finally, in the last few sections we will consider the difficult problem of
solution of a system of nonlinear equations. For such problems, there is practi-
cally no always convergent method and we have to depend on iterative methods,
which are unlikely to converge, unless a reasonable approximation for the root
is already known. This forces us to come up with good approximations. Even
for a single equation it will save considerable computer time, if a reasonable
approximation to the root can be supplied. In fact, for solution of nonlinear
equations, it is most important to analyse the problem and locate the approx-
imate positions of roots before starting the computations.
7.1. Real Roots 273

7.1 Real Roots


We first consider the problem of finding real roots of a single equation f(x) = 0,
which is equivalent to finding zeros of a real valued function. Consequently, we
will use the terms zero and root interchangeably. Conceptually, the simplest way
of finding a zero would be to plot the function f(x) against x and the point at
which it crosses the x-axis is the zero of f(x). To get a better accuracy, we can
plot on a larger scale. This is of course, a crude but reliable method for solving
nonlinear equations.
A more convenient procedure for computers is based on the fact that if
f(x) is continuous on an interval (a, b) and further f(a)f(b) < 0, then there
is at least one zero of f (x) in the interval (a, b). Thus, if we want to find all
real zeros of f(x) in some interval (a, b) on which f(x) is continuous, we can
choose a suitable search step h = (b - a)/n (where n is a positive integer), and
compute f(a + jh) for j = 0,1, ... , n. Between any two successive values of the
abscissas for which the function has opposite sign, there would be at least one
zero of f (x). By this process the roots can be located in an interval of length
h.
The drawbacks of this simple search method are obvious - there may
be two (or any positive even number of) zeros in a single subinterval of length
h, which will be missed in this algorithm. While, if the number of zeros in a
single subinterval is odd (but> 1), then only one of them may be actually
detected and determined while the others may be ignored. Further, a zero of
even multiplicity can never be located by this process. Most of these problems
can be taken care of, if the derivatives of the function are also examined along
with the function value itself. In this approach, the proper choice of the search
step h is very important. If h is too small, a very large number of function
evaluations would be required, while if h is too large, some of the zeros are
likely to be missed. Obviously, the optimum value of h would depend on the
distribution of zeros of f(x), which in general is not known before the zeros are
computed, but some estimate may be possible if proper analysis is done.
Once the root has been located in a given interval, it can be conveniently
determined by the method of bisection. Let us assume that, we have found an
interval (Xl, X2) such that f(xd and f(X2) are of opposite sign, if we evaluate
the function at the midpoint of the interval, then the following three possibilities
arise:
1. f( Xl !X2) = 0, which is very unlikely, but if it happens, then of course,
XI!X2 is a zero of f(x).
2. f( Xl !X2 ) has the same sign as f (xd, then the subinterval (X I !X2 , X2) must
contain at least one zero of f(x).
3. f( Xl !X2) has the same sign as f(X2), then the subinterval (Xl, Xl !X2) must
contain at least one zero of f (x).
Thus, by this process we can reduce the length of the original interval by a factor
of two. This iteration can be repeated till the desired accuracy is achieved. In
274 Chapter 7. Nonlinear Algebraic Equations

ten repetitions, the original interval reduces by a factor of 210 = 1024. Clearly,
if the interval contains more than one zero of f(x), then only one of it would
be located while the rest would be ignored. It may be noted that instead of the
midpoint we can decide to evaluate the function at any point in the interval
(Xl, X2). But in that case, the length of the interval containing the root may
not decrease by a factor of two, since the larger subinterval may be found to
contain the root. Thus, in the absence of any other information the technique
of selecting the midpoint at every step ensures that the interval is reduced by
a factor of two at every step. This is the optimum choice, unless additional
information is available.
Even such a simple method of bisection, is not foolproof when applied on
a computer. For example, tan(1I"/4) > 0 and tan(27r/3) < 0, but if we apply the
method of bisection, we will come up with a value of 7r /2 which is well-known
to be a pole of tan x, rather than a zero. Hence, before applying bisection it
must be ensured that the function is continuous over the given interval and in
case, there is any doubt about that, the value of the function at each of the
midpoints should be examined. If the successive values start increasing, then
obviously we are converging to a singularity.
Further, it may appear that any arbitrary degree of accuracy can be
achieved by repeating the process of bisection sufficient number of times. This
cannot be true, since as we have seen in Chapter 2, only a finite sequence of real
numbers can be represented in a computer. Hence, after a sufficient number of
subdivisions, we come to a situation in which the two end points Xl and X2
are the two consecutive numbers that can be represented in the machine. Then
the computed value of Xl !X2 can only be either Xl or X2, since no number
in between these two can be represented in the machine, and the process of
bisection will get into a nonterminating loop. However, even before this limit
is reached the roundoff error will come into effect and the sign of the function
as evaluated by the computer may not be correct, in which case, the wrong
subinterval may be selected.
Roundoff error is significant when the function is evaluated near its zero,
since it will necessarily involve almost exact cancellation between the positive
and negative terms, that occur in the function evaluation process. To describe
such a situation, Wilkinson (1988) defines the domain of indeterminacy associ-
ated with every zero of the given function, within which the computed value of
the function is totally governed by the roundoff error. This concept is very help-
ful in understanding the effect of roundoff error on various methods for solving
nonlinear equations. The size of this domain would be determined by the value
of n for the arithmetic used, as well as the algorithm used for calculating the
function. The size of this domain also puts a fundamental limit on the accuracy
with which the zero of a given function can be determined. The influence of
roundoff error on various methods for solution of nonlinear equations will be
considered in Section 7.12.
Considering the method of bisection, at some stage it may happen that
the midpoint falls inside the domain of indeterminacy associated with the zero.
7.1. Real Roots 275

In that case, the value of the function at that point and hence its sign is totally
governed by roundoff error and the process of bisection may locate the zero
in the wrong subinterval. Obviously, bisection cannot yield an accuracy better
than that permitted by the domain of indeterminacy. It can be shown that
either all the bisection intervals do contain the exact zero, or from a certain
stage onwards at least one of the end points lies in the domain of indeterminacy.
In practice, it is safer to repeat the process of bisection a fixed number of times,
rather than testing for any other criterion which may lead to a nonterminating
loop. Subroutine BISECT in Appendix B provides an implementation of this
method.
EXAMPLE 7.1: Find the zero of the following function using the method of bisection over
the interval (0.6,2.1)
f(x) = x9 - 9x 8 + 36x 7 - 84x6 + 126x 5 - 126x 4 + 84x3 - 36x2 + 9x - 1. (7.1)

The exact value of the zero is of course x = 1, but the results obtained using the m ethod
of bisection are given in Table 7.1, which gives the interval after every step as well as the value
of the function at each of the end points. These results were obtained using 24-bit arithmetic
and the last two columns show the contents of Xl and X2 in hexadecimal format. It can be
seen that after the fifth step, when the function is evaluated at the midpoint Xb = 0.9984374
the value comes out to be positive and the root is located in the wrong interval. After that
the subdivision continues on one side of the zero and the value "converges" to 0.9896252,
which is quite far from the actual root. Further, the function value is of the order of 10- 18
at all these points. At 23rd step the two end points in hexadecimal form differ by three in
the last place. The midpoint is rounded up to a value just lower than X2, and at the next
step Xl and X2 are the two consecutive numbers that can be represented in the computer. At
this stage, the midpoint is rounded to the same value as Xl and the numbers do not change
beyond this point. Hence, if a convergence criterion of type IXI - x21 < 10- 8 is used . then
the iteration will never converge.
To estimate the size of the domain of indeterminacy, we can estimate the roundoff error
in evaluating the function around the zero at X = 1. A very optimistic est.imate is 126h. while

Table 7.1: Finding zeros by the method of bisection

Xl X2 f(xll f(x2) Xl X2

1 .6000000 1.3500000 -2.62 x 10- 4 7.88 X 10- 5 3F19999A 3FACCCCC


2 .9750000 1.3500000 -3.81 x 10- 15 7.88 X 10- 5 3F799999 3FACCCCC
3 .9750000 1.1625000 -3.81 x 10- 15 7.90 X 10- 8 3F799999 3F94CCCC
4 .9750000 1.0687500 -3.81 x 10- 15 3.43 X 10- 11 3F799999 3F88CCCC
5 .9750000 1.0218750 -3.81 x 10- 15 1.15 X 10- 15 3F799999 3F82CCCC
6 .9750000 .9984374 -3.81 x 10- 15 4.23 X 10- 18 3F799999 3F7F9998
7 .9867187 .9984374 -6.51 x 10- 18 4.23 X 10- 18 3F7C9998 3F7F9998
8 .9867187 .9925780 -6.51 x 10- 18 1.19 X 10- 18 3F7C9998 3F7E1998
21 .9896247 .9896255 - 4.50 x 10- 18 5.42 X 10- 19 3F7D580C 3F7D5818
22 .9896251 .9896255 -5.26 x 10- 18 5.42 X 10- 19 3F7D5812 3F7D5818
23 .9896251 .9896253 -5.26 x 10- 18 2.71 X 10- 18 3F7D5812 3F7D5815
24 .9896252 .9896253 -9.22 x 10- 19 2.71 X 10- 18 3F7D5814 3F7D5815
25 .9896252 .9896253 -9.22 x 10- 19 2.71 X 10- 18 3F7D5814 3F7D5815
276 Chapter 7. Nonlinear Algebraic Equations

the value of the function is (x - 1)9. Equating the two values will give us an estimate of the
size of domain of indeterminacy d. It can be seen that
d ~ (1261i)1/9 ~ 0.27, (7.2)
which is much larger than the error in the calculated value. This result can be traced to
the fact that the computer does not roundoff intermediate results. Hence, the function is
effectively evaluated using a much higher precision of 64 bits (Ii ~ 5.4 x 10- 2 °), because
when the polynomial is evaluated using the nested multiplication as follows
f(x) = ((((((((x - 9)x + 36)x - 84)x + 126)x - 126)x + 84)x - 36)x + 9)x - 1, (7.3)
then only one accumulator is enough. The nested multiplication is also the most efficient
algorithm for evaluating the polynomial. It can be seen that the function value is of the order
of 10- 18 , even though the last term is 1 which implies that Ii < 10- 18 as far as function
evaluation is concerned. This phenomenon may not be observed on all computers.

7.2 Fixed-Point Iteration


In this section we consider a simple iterative method, which converges to the
root only under special circumstances. If the equation is written properly, then
it could be very effective, but in general, it may not even converge. If the
equation can be written in the form

x = F(x), (7.4)

and if an approximation Xk to the root is known, then we can hope to improve


on it by using a simple recurrence relation

(7.5)

This iterative process is referred to as the fixed-point iteration. Thus starting


from some initial guess Xo, we can use the above equation to calculate the suc-
cessive approximations Xl, X2, . .. , until the sequence converges to a satisfactory
accuracy.
Let us examine the condition under which this iteration can converge to
the correct value of the root, which we assume to be a. Then, a = F(a) and

(7.6)

where ~k lies between a and Xk. If Ei = Xi - a denotes the error in ith approx-
imation and if we assume that the iteration converges to a as k --> 00, then
Xk --> a and F'(~k) --> F'(a). Hence, we get

lim Ek+l = F'(a), (7.7)


k-->oo Ek
and
(k --> 00), (7.8)
where A is some constant. If 1F'(a)1 > 1, then Ek would grow unboundedly in
magnitude. Hence, 1F'(a)1 < 1 is a necessary condition for convergence. If we
7.2. Fixed-Point Iteration 277

(a) Monotone Convergence (b) Spiral Convergence

(c) Monotone Divergence (d) Spiral Divergence

Figure 7.1: Fixed-point Iteration

define a convergence factor Pk = Ek+1/Ek, then Pk ;:::; F'(a) for large k. Thus,
F'(a) may be considered as the asymptotic convergence factor. If 1F'(a)1 > 1,
then error at each step will ultimately increase in magnitude and the itera-
tion would be asymptotically divergent. On the other hand, if IF'(a)1 < 1,
the iteration would be asymptotically convergent, and if the initial approxima-
tion is sufficiently close to a, the sequence of iterates will converge to a. For
IF' (a) I = 1, the asymptotic behaviour of the iterative process is unpredictable.
In that case, it may not converge and even if it converges the rate of conver-
gence would be very slow. In general, for the iteration to be effective, 1F'(a)1
must be significantly less than unity. Figure 7.1 illustrates the four different
types of behaviour that this iteration may exhibit. For 0 < F'(a) < 1, we get
monotone convergence, i.e., the successive iterates approach the root from one
side. For 0 > F' (a) > -1, we get spiral or oscillatory convergence and the
successive iterates are on different sides of the root. Similarly, for F'(a) > 1,
we get monotone divergence, while for F'(a) < -1, we get spiral or oscillatory
divergence.
278 Chapter 7. Nonlinear Algebraic Equations

It may be noted that, this result applies only when the successive iterates
are sufficiently close to the root. The behaviour could be different when the
starting value is far from the root. The iteration may even converge towards
the root initially, even if IF' (a) I > 1, but once it is close to the root it will start
diverging. Similarly, even if IF' (a) I < 1 the iteration may not converge from
arbitrary starting values.
If the iteration is converging slowly, then we can apply the Aitken's 62
method (see Section 6.2) to accelerate the convergence. In practice, the method
of fixed-point iteration is used only under special circumstances (e.g., in solution
of differential equations, see Section 12.3), when a good approximation to the
root is already available and the asymptotic convergence factor is quite small.
As will be seen in subsequent sections, other iterative methods converge much
faster. Hence, this method is not generally recommended in practice.

7.3 Method of False Position


As in the method of bisection, if we can find two points Xl and X2 such that
f(xdf(X2) < 0, then if f(x) is continuous over [Xl,X2], there exists at least
one zero of f(x) in (Xl,X2). By using the two points Xl and X2 we can apply
linear inverse interpolation to get an estimate X3 for the zero of f (x)

(7.9)

where Yi = f(:1:i)' It can be easily shown that Xl < X3 < X2. Now if we
calculate Y:3, then by looking at the sign of Y3 we can find the subinterval
(Xl, X3) or (X3, X2), which contains the zero. This procedure can be repeated
till satisfactory approximation is achieved. The resulting algorithm is known
as the method of false position or the regula falsi method, and will always
converge to the actual value of the zero. Only difference between bisection and
the method of false position is that, in the latter the new approximation is
obtained by linear interpolation while in the former, it is simply the midpoint.
In this method the new approximation to X depends on two of the previous
approximations. However, it can be shown that after a sufficient number of
iterations, this method will be a stationary method of functional iteration (just
like fixed-point iteration), since the successive iterates depend only on the last
point, while the other point remains fixed. This behaviour is illustrated in
Figure 7.2, which gives a graphical construction for this process. Let us take
the fixed point to be Xo itself, then the successive iterates are given by
Yi
Xi+l = -- - :1:0 + - Yo
- - Xi = F(Xi)' (7.10)
Yi - Yo Yo - Yi
In this case, since we use linear interpolation we can estimate the truncation
error (:rHl - a) by
g"(~) f"(~)
Ei+l = Xi+l - a = --2-YiYO = 2[f'(~)P ViVO , (7.11)
7.3. Method of False Position 279

(a) Regula-Falsi (b) Modified Regula-Falsi

Figure 7.2: Graphic illustration of regula falsi and modified regula falsi methods.

where g(y) is the function inverse to f (x). It may be noted that sometimes the
truncation error is defined with opposite sign, which should be taken care of
while comparing the equations. Further, since f(a) = 0, using the mean value
theorem, we get
(7.12)
Here the points ~i' ~ and rJ lie in appropriate intervals. Using (7.11) and (7.12),
we get
1" (0 l' (~o) l' (~i) (7.13)
EHI = 2[J'(~)]3 EiEO ,

and therefore
.
11m IEHll -_I 1"(~)1'(~o)1'(a) II I. (7.14)
IEil
~
i--->oo 2[1'(0]3
Here we are assuming that the iteration converges to the actual root a and
hence ~i ~ a and ~ ~ ~ as i ~ 00. If 1'(x) is bounded away from zero in a
neighbourhood of a, this ratio tends to a finite limit. Hence, asymptotic rate of
convergence is linear for this method also. An important difference between this
method and the fixed-point iteration considered in Section 7.2 is the occurrence
of leal in the above expression on the right-hand side. Thus, if the stationary
point is sufficiently close to a (i.e., lEal is sufficiently small), the right-hand side
of (7.14) will be less than unity and the process will converge. Actually, by
construction this iteration will always converge, because one of the end point
will become fixed only after lei I has been reduced sufficiently to ensure that the
above convergence criterion is satisfied. In contrast the fixed-point iteration
does not converge if IF' (a) I > 1, even if the starting value is arbitrarily close
to the actual root.
The regula falsi method is an excellent method, when the available in-
formation on the location of roots is poor. However, it converges very slowly
and in fact, in many cases, the convergence may be slower than that of the
method of bisection. Consequently, it is not very useful. This method can be
improved to give what is usually referred to as modified regula falsi method.
280 Chapter 7. Nonlinear Algebraic Equations

31T
21T
1T
o
-1T
-21T
-31T
1T 21T 31T
Figure 7.3: Solving x = tan x.

In this method, at each step the function value that is retained is divided by
two. The effect of this modification can be best seen in Figure 7.2. Thus, in
this method , if at any stage the root is bracketed between X i -1 and Xi+ 1 , then
we compute X i+ 2 using

Xi+2 = nYi-1
Xi+1 + --'---'---
Yi+1
Xi-1 , (7.15)
nYi-1 - Yi+1 Yi+1 - nYi-1

where 0 < n < 1. The most common choice of n is ~ . At this step if Yi+2Yi+1 <
0, then we continue the regula falsi method as usual, using the points Xi+l and
Xi+2. Otherwise, we replace Yi - 1 by nYi-1 and perform another modified step,
using the points Xi-1 and Xi+2. This process is continued until the old point Yi-1
is eliminated from the bracketing points. Hence, at each modified step the value
of Yi-1 is artificially multiplied by n , which ensures that the bounding interval
also tends to zero as the iteration converges. On the other hand, for the simple
regula falsi method, the bounding interval may remain finite as the iteration
converges to the root, because one of the end points remains fixed. This method
is sometimes referred to as Illinois method. For n = ~, it can be shown that
asymptotically a pattern of iteration is set up, which consists of cycles of two
normal secant iterations (see Section 7.4) , followed by one modified iteration.
Further, asymptotically Ei+3 ~ K Et and the order of this iterative method (see
Section 7.4 for definition) is 3 1/ 3 ~ 1.44. Another modification of regula falsi
method is considered in {6}.
EXAMPLE 7.2: Find the roots of x = tan x .
The roots of this equation are given by the points of intersection of the curves y = x
and y = tan x. Since the second function passes through all values from -00 to +00 in each
of the intervals ((n - ! !
)71", (n + )71") (where n is any integer), the two curves are bound to
intersect in each of these intervals as is clear from Figure 7.3. Hence, there are infinite number
of roots , but here we consider only the first two, the root Xl in the figure and the root at
x = O.
7.4. Secant Method 281

The root Xl can be easily located in the interval (7r, ~7r) and starting from Xl = 3.15
and X2 = 4.71, we can apply the method of bisection or regula falsi method to find the exact
value of the root using a 24-bit arithmetic. The bisection method converges to the true value
of the root with the expected rate, and after 20 bisections it reaches the limit where the two
end points are two consecutive numbers represented in the machine and no further division
is possible. However, the regula falsi method is very slowly converging to the correct root
and even after 100 iterations it gives the value 3.9648, which is far from the correct root at
4.493410. In fact, looking at the regula falsi results, after about 20 iterations it appears as if
the root is close to 3.4 and the difference between successive iterates is approximately 0.01,
even though the actual error is of the order of unity. On the other hand, using the modified
regula falsi method the convergence is much faster and in about 16 iterations the result has
converged to a value which is essentially correct to seven significant figures. After this stage,
the value fluctuates a little because of roundoff error.
To improve the convergence we can start with a smaller interval containing the root.
For example, if we start with Xl = 4 and X2 = 4.7, the convergence is somewhat faster, but
even then the regula falsi method converges very slowly and after 100 iterations the calculated
value is 4.49153, which is not very close to the correct root. In this case, the modified regula
falsi method requires only 12 iterations for convergence. If the starting interval is reduced
still further to (4.4,4.6), the regula falsi method also converges to seven significant figures in
about 10 iterations. Thus, in this case, it is possible to achieve a relative accuracy of almost
Ii using any of these methods, if we start sufficiently close to the actual root.
For locating the root at X = 0, we start with an initial interval (-1,1.3). Of course, this
root is obvious from the equation, but nevertheless to study the behaviour of these methods for
multiple roots, we have included this case. It may be noted that, this is a root of multiplicity
three. The method of bisection is unaffected by the multiplicity and converges at the expected
rate to the root. The regula falsi method converges very slowly and after 100 iterations the
computed value is -0.1431. Once again, the modified regula falsi method converges faster,
but the convergence is not very fast, requiring about 30 iterations to reach the limiting value
of 0.000064 (where the function becomes zero) which has a significant error. This behaviour
is typical for multiple zeros, which have larger domain of indeterminacy associated with them
and the convergence of most iterative methods is also slow.
The rate of convergence is quite sensitive to the form in which the equation is written.
To demonstrate this, we try the calculations after rewriting the same equation as cot X = l/x.
In this case the convergence is faster in all cases, because the singularity in the function is far
from the root. In particular, the regula falsi method converges to seven significant digits in
about eight iterations when starting with the interval (4,4.7). For the new form of equation
the point X = 0 is a simple zero and the convergence is much faster, but the accuracy does not
improve. because in this case, evaluating the function in the neighbourhood of X = 0 requires
subtraction of two very large quantities. In fact, the relative roundoff error in evaluating the
function is of the same order in both cases. Hence, by rewriting the equation we have reduced
the multiplicity of the zero, but the roundoff error does not reduce. It may be noted that in
this case, the simple regula falsi method converges at the same rate as the modified regula
falsi method. Alternately, we can rewrite the equation as sin X = X cos x which eliminates the
singularities. In general, while rewriting the equation care should be taken to ensure that,
no new root is introduced, or an already existing root is not eliminated. For example, if the
equation tan x = x 2 is rewritten as cot x = 1/ x 2 then the root at x = 0 is eliminated.

7.4 Secant Method


The secant method is similar to the method of false position, except for the
fact that we drop the condition that f(x) should have opposite signs at the
two points used to generate the next approximation. Instead, we always retain
the last two points Xi and Xi-l to generate Xi+! by linear inverse interpolation
282 Chapter 7. Nonlinear Algebraic Equations

Figure 7.4: Secant Iteration

using
Yi Yi-1
Xi+ 1 = ---"-'-- Xi -1 + -"--'---=--- Xi
Yi - Yi-1 Yi-1 - Yi
(7.16)
Xi - Xi-1
= Xi- Yi .
Yi - Yi-1
The second form is more convenient and has better roundoff property. The first
form may have large roundoff error if Yi and Yi-l have the same sign, which is
possible in secant method, but not in the regula falsi method. As mentioned in
Section 4.2, the second form will also have large roundoff error in the second
term, but since that term is only a small correction to Xi, in general the errors
may not be very serious.
In contrast to the regula falsi method, the secant iteration does not bracket
the root and it is not even necessary to bracket the root to start the iteration.
Hence, it is obvious that the iteration may not always converge. On the other
hand, if it converges, the convergence is faster. Thus, by dropping the necessity
of bracketing the root, we improve the rate of convergence at the risk that in
some cases, the iteration may not converge at all. The name secant iteration
comes from the fact that at any stage the next approximation can be obtained
by drawing the secant line through the two most recent points (see Figure 7.4).
Instead, if we draw a tangent at the most recent point, we get the Newton-
Raphson method considered in the next section.
To estimate the convergence rate for the secant iteration, we get using
(7.13)
. _ [f"(~)f'(~i)f'(~i-d] .. (7.17)
Et +1 - 2[f'(~)J3 Et Et -1·

If the first derivative is bounded away from zero, the term in the square bracket
is bounded in some neighbourhood of a. It immediately follows that if the initial
7.4. Secant Method 283

approximations Xl and X2 are sufficiently close to a, then the iteration will


converge. In the limit i --+ 00, we can write

(7.18)

where K is the limiting magnitude of the term inside square bracket of (7.17).
If we define di = In(K[Eil), then

(7.19)

which is a linear homogeneous difference equation. In fact, it is the well-known


Fibonacci's equation. The general solution of this equation is given by

(7.20)

where C1 , C2 are arbitrary constants and Q1, Q2 are the roots of the correspond-
ing characteristic equation
(7.21 )
Hence
1+ J5 1- J5
Q1 = --2-' Q2 = --2-' (7.22)

and
. di+1 1 + J5
hm - d = Q1 = - - - ;:::: 1.618 . (7.23)
i~= i 2
From the definition of d i and K , we get

lim f" (a_) I


· I = I__
IE,+l <Xl -1
(7.24)
i~oo IEil<X1 2f'(a)
To compare the rate of convergence of various iterative methods, it is
convenient to define a quantity which is known as the order of an iterative
method. If the iteration converges to the root a and there exists a real number
p 2: 1, such that
(7.25)

then p is said to be the order of iterative method for the root at X = a.


The constant c is the asymptotic error constant and depends on the function
f(x) as well as the root a. The requirement c =/= 0 means that c of. 0 for a
general function f(x). In special cases, c may be zero and the iteration will
converge faster. If p = 1 the method is said to converge linearly, as is the
case with both the methods of bisection and regula falsi . However, for secant
iteration p ;:::: 1.6. As we have seen earlier, for the case of p = 1, a necessary
condition for convergence would be c < 1. This condition is no longer necessary
if p > 1, in which case the iteration will converge, provided the starting values
are sufficiently close to the actual value of the root. Clearly, for p > 1, the
rate of convergence would be faster, since the number of significant digits of
284 Chapter 7. Nonlinear Algebraic Equations

accuracy would increase exponentially. For example, if p = 2, then at every


step number of significant digits will be doubled.
Thus, the advantage of secant iteration over the method of false position is
obvious. Further, in this case, it is not necessary to locate the root between two
values. Of course, all these advantages cannot be achieved without any sacrifice.
The price we pay in this case is that, the iteration may not converge in all
cases. For isolated roots if the initial approximation is sufficiently close to the
actual value of the root, then the iteration will converge. However, for multiple
roots the iteration may not converge, even when the initial approximations are
arbitrarily close to the actual roots. This can be easily seen for a root with
even multiplicity, if we consider a case in which Xl and X2 are on opposite sides
of the root and are such that f(xd = f(X2). For multiple roots, even if the
iteration converges, the convergence is linear.
For isolated roots if the secant iteration converges, then the convergence
is usually quite fast and any desired accuracy permissible by the domain of
indeterminacy can be achieved. One advantage of the iterative methods over
the method of bisection is that, once the iterates are inside the domain of
indeterminacy, the difference between two successive values start fluctuating
randomly, giving an indication of the size of the domain and hence the error
estimate (see Section 7.12).
Now let us consider the behaviour of secant iteration for the case, when
X = a is a double root of f(x) = 0. In this case, we have f(Xi) ~ ~f"(a)E; and
the successive values of Ei can be shown to satisfy the equation

(7.26)

Writing Ui = 1/Ei' we get


(7.27)

which is the same difference equation that we encountered earlier and the gen-
eral solution is given by (7.20). For large values of i

1
and Ei ~ . (7.28)
CI (1.618)'

Hence, the error is divided by approximately a factor of 1.618 at every step.


For roots of higher multiplicity, the asymptotic convergence rate is even slower
{15}. For a root of multiplicity m > 1, the convergence is linear with the error
being multiplied by a factor lX, which is given by the real root in the interval
(0,1) of
(7.29)

Unlike Newton-Raphson method (see Section 7.5), it is not easy to modify the
secant iteration to produce higher order convergence, even if the multiplicity of
the root is known {16}.
7.5. Newton-Raphson Method 285

EXAMPLE 7.3: Solve the equation in Example 7.2 using secant iteration.
Using a 24-bit arithmetic and the same pairs of starting values as in Example 7.2, we
repeat the calculation using secant iteration. Using the equation x = tan x, the iteration does
not converge to the root Xl, unless the initial interval is rather small (4.4,4.6). For other
values, if the iteration is continued further, it ultimately converges to the root at x = 0. The
convergence to this root is in any case rather slow, because of its multiplicity. In all cases,
ultimately we encounter a situation, where f(Xi) = f(Xi-1l and it is not possible to continue
further. In general, such a situation can happen either because of roundoff errors within the
domain of indeterminacy, or because Xi = Xi-l as far as machine representation is concerned,
or just by coincidence, even when the iteration is far from the zero. This condition should
be detected explicitly in any subroutine implementing secant iteration, since it will occur in
some case or the other, even if the iteration is terminated as soon as a reasonable convergence
criterion is satisfied.
This example clearly illustrates the problem with secant iteration, i.e., that the iteration
may not converge to the required root. However, if we consider the second form ofthe equation
(cot X = l/x), then the iteration converges very rapidly in all cases considered. For the root
Xl five iterations are sufficient to get an accuracy of seven significant figures. For the root at
X = 0, after three iterations the values keep oscillating (around x ~ 10- 7 ), thus giving some
indication of the degree of uncertainty in estimating this root. Hence, if used properly the
secant method can provide an efficient algorithm for solving nonlinear equations. Further, it
can also provide an estimate of the roundoff error.

7.5 Newton-Raphson Method


In the secant method, if we take the limit that the two points coincide, then
the function is approximated by a tangent (Figure 7.5) and we get the Newton-
Raphson method, where the successive iterates are calculated using the formula

f(Xi)
Xi+l = Xi - f'(Xi) (7.30)

This formula can also be derived by approximating the function by the first two
terms of the Taylor series about Xi and then finding zero of the approximating
function.
In this case, the error ti+l can be written in the form

f(Xi) 11"(1]) 2
ti+l = Xi+l - a = Xi - a - f'(Xi) = 2 f'(a) ti' (7.31 )

where 1] is some point between a and Xi. Here the second form can be ob-
tained by expanding f(Xi) in Taylor series about X = a. It is assumed that
f'(a) i- 0 i.e., a is an isolated root of f(x) = O. Thus, we see that for isolated
roots, Newton-Raphson method is of order two, or the convergence is quadratic.
Hence, at each step, the number of significant figures of accuracy will be dou-
bled. The Newton-Raphson method converges faster than the secant iteration,
but it should be noted that the Newton-Raphson method involves the calcula-
tion of the first derivative also.
For comparing the efficiency of the two methods, we can use the efficiency
index {7} defined by
EI = plio, (7.32)
286 Chapter 7. Nonlinear Algebraic Equations

Figure 7.5: Newton-Raphson Method

e
where p is the order of the method and is the cost per iteration, which is
usually measured in units of cost required for the function evaluation. For ex-
ample, if some method requires two function evaluations in each step, then
e = 2 and the effective order of the method per function evaluation will be
p1/2. Hence, the efficiency index of the method gives the effective order per
function evaluation (or equivalent amount of calculations). Of course, this defi-
nition of efficiency index would be meaningful only when the asymptotic rate of
convergence is realised. In practice, usually much more effort is spent in locat-
ing the approximate position of the root, before either of the iterative methods
start converging rapidly.
For secant iteration, the cost per iteration would be one, since only the
value of the function is required, while for Newton-Raphson method e = 1 + 0:,
where 0: is the relative cost of evaluating f'(x). Thus, for secant iteration,
EIs ~ 1.618, while for the Newton-Raphson method EIN ~ 2 1 /(1+ a ). Hence,
if 0: < 0.44, the Newton-Raphson method is more efficient, while for 0: > 0.44,
the secant iteration is more efficient. If f(x) is a polynomial, then evaluation of
f'(x) is almost as costly as evaluation of f(x) and 0: ~ 1. Hence, for polynomials
secant iteration would be more efficient. On the other hand, if f(x) involves
elementary functions like exponential or trigonometric function, then often the
evaluation of f'(x) may not require much extra effort, provided both f(x) and
f'(x) are evaluated in the same subprogram unit, or the relevant information
is preserved. For more complicated functions that are usually encountered, it
may be extremely difficult to calculate the derivative, in which case, the use
of Newton-Raphson method would be out of question. Calculating derivatives
numerically for use in Newton-Raphson method should not be considered, since
apart from being less efficient as compared to the secant iteration, there will
be difficulties due to roundoff errors.
7.5. Newton-Raphson Method 287

Now let us examine the behaviour of Newton-Raphson method when the


zero has a multiplicity r > 1. By Taylor series expansion, in the neighbourhood
of the zero at x = a, we get

f(x) ;::::: (x - ay f(rl(a) and f'(x) ;::::: (x - ay-l f(r)(a). (7.33)


r! (r - 1)!

Using (7.31), we get


1
Ei+l ;::::: Ei(1 - -). (7.34)
r
Hence, this method converges linearly when a is a multiple zero. At every step
the error reduces by a factor of (1 - ~), which is ~ for a double root and ~
for a triple root. Thus the convergence is rather slow and further, higher the
multiplicity, the slower is the convergence.
This method can be easily modified to give quadratic convergence for a
zero of multiplicity r, if we write

(7.35)

However, it requires the knowledge of r which is generally not available. In most


cases, it may be possible to estimate the multiplicity by considering the rate
of convergence of iteration. For example, if we find that the difference between
two successive iterates is decreasing by a factor of two at every step, then it
should be a double root. An alternative method may be to use the function
u(x) = f(x)/ f'(x) for finding zeros. It may be noted that u(x) has a simple
zero at x = a, irrespective of the multiplicity of the zero of f(x). If we find
zeros of u(x) by any of the iterative methods, the rate of convergence will be
independent of the multiplicity of the zeros. However, this technique is not
very efficient, since evaluating u(x) and possibly its derivatives require more
computation than that required for evaluating f(x). Further, this function has
poles at those zeros of f'(x), which are not zeros of f(x). Similarly, the function
f(x)l/r also has a simple zero at x = a, but evaluating this function requires
the knowledge of multiplicity r.
The subroutine NEWRAP in Appendix B attempts to implement the
Newton-Raphson method for multiple zeros with unknown multiplicity. In this
subroutine, at each step the ratio ti = (XHl - Xi)/(Xi - xi-d is computed
to estimate the multiplicity of zero. Assuming that the current estimate ro
of multiplicity is being used to calculate the successive iterates using (7.35),
it can be shown that asymptotically the ratio ti ;::::: 1 - ro/r, where r is the
actual multiplicity of the root. Hence, from this ratio, we can estimate the
multiplicity r = ro / (1 - ti). However, for this formula to be applicable, it is
essential that the last two iterations should have been carried out with the same
value of multiplicity. Hence, the multiplicity is not changed at every step, but
once a value is accepted, at least three iterations are performed with the same
multiplicity. After that if it is found that two successive ratios, e.g., ti and ti-l
288 Chapter 7. Nonlinear Algebraic Equations

yield the same estimate for multiplicity (which is assumed to be an integer),


then the new estimate is used. Once the correct estimate of multiplicity is
obtained, the iteration converges very fast.
An alternative strategy is based on the fact that

lim h(x) = lim In If(x)1 = r, (7.36)


x-o: x-o: In lu(x)1

where r is the multiplicity of the root and u(x) = f(x)/ f'(x). which suggests
an iteration formula of the form

(7.37)

It can be proved (Traub, 1997) that for this iteration function IEi+l/Eil -- 0
asymptotically. However, this iteration function is of incommensurate order as
it can be proved that

(7.38)

Hence, the convergence is significantly faster than linear, but is not superlin-
ear. As a result, this method is not as efficient as the method implemented
in NEWRAP, when the multiplicity is an integer. For nonintegral multiplicity,
this method will converge faster. Of course, the subroutine NEWRAP can be
easily modified to take care of nonintegral multiplicity, but in that case the es-
timate for multiplicity cannot be accurate and the iteration may not converge
quadratically.

EXAMPLE 7.4: Solve the equation in Example 7.2 using the Newton-Raphson method.
Using 24-bit arithmetic and starting value of 3.15 or 4.0, the Newton-Raphson iteration
diverges, while starting with 4.4, it converges in 4 iterations to seven significant figures. From
a starting value of 4.71, which is close to the singularity, it takes 11 iterations to reach
the same value. For the root at x = 0 the convergence is slow and after 12 iterations it
reaches a value of -0.0105. For this root the error is divided by a factor of approximately
1.5 at each iteration. If the iteration is continued further, then after 22 iterations it stops at
x = -3.0 X 10- 4 , when the computed value of J'(x) turns out to be zero. As noted earlier,
the Newton-Raphson method can be easily modified to handle multiple zeros, provided the
multiplicity is known. With these modifications the convergence is quadratic even for multiple
zeros. Even if the multiplicity is not known ab initio, it can be estimated by looking at the
rate of convergence. The subroutine NEWRAP which estimates the multiplicity of roots,
converges in eight iterations to a value -8.45 x 10- 5 , when the iteration is started with
the same starting value of -1.0. This clearly demonstrates the power of modified Newton's
method for multiple zeros. For Newton-Raphson method also the convergence is much faster
if the equation is rewritten in the form cot x = l/x. For this equation the iteration converges
rapidly, even with the starting value of 3.15. Thus depending on the starting value and
the problem, the Newton-Raphson method may converge rapidly, or the iteration may not
converge at all.
7.6. Brent's Method 289

7.6 Brent's Method


So far we have used only linear interpolation to approximate the function.
Instead, we can use higher order interpolation for this purpose. There are two
alternatives for higher order interpolation, since we can use direct or inverse
interpolation. We can interpolate the function using a few of the most recent
estimates for the zero and find the zero of the interpolating polynomial to get
the next approximation. This technique involves solving a polynomial which
may not be trivial if the degree of polynomial is higher than two. The Muller's
method described in Section 7.8 uses quadratic interpolation to approximate
the function. To avoid the problem of finding roots of interpolating polynomials,
we can use inverse interpolation; i.e., we can interpolate x as a function of y =
f(x) (Le., x = g(y» and then interpolate to find the value of x corresponding
to y = 0, to get the next approximation to the zero of f(x). This approach
can be used to get higher order methods. However, use of very high order
method is not expected to be very effective, since a large number of starting
values are required to start the iteration. Moreover, it can be shown that such
methods cannot have an order of convergence greater than two. Hence, the
convergence is not much faster than the secant iteration or the method based
on quadratic interpolation. In this section, we consider the method based on
quadratic (inverse) interpolation.
If we have three approximations Xi, xi-I and Xi-2 to a zero of the function,
we can perform an inverse quadratic interpolation to get

x ~ Xi + y - f(Xi)
+ (y - f(Xi»(Y - f(Xi-d) (1 - . 1)
f[Xi, Xi-I] f(Xi) - f(Xi-2) J[Xi' Xi-I]
f[Xi-I, Xi-2]
(7.39)
To get a better approximation to the zero, set y = 0 in this expression to obtain

Xi+I = Xi -
f(Xi)
f[Xi, xi-d
+ f(Xi)f(Xi-d
f(Xi) - f(Xi-2)
(1f[Xi, Xi-I]
- . 1)
f[Xi-I, Xi-2]
(7.40)
The truncation error in this approximation is given by (see Section 4.1)
g"'(~)
Xi+I - a = Ei+I = f(Xi)f(Xi-df(Xi-2)-6- , (7.41 )

where g(x) is the inverse function. If the zero is simple, then after some ma-
nipulations it can be shown that asymptotically the truncation error is given
by
(7.42)
where
g'" (0)
K~--- (7.43)
6g'(0)3 .
It can be shown {17} that the order of convergence of this method is 1.839,
which is slightly faster than that for secant method based on linear interpola-
tion. Further, it can be shown {18} that if higher order interpolation is used,
290 Chapter 7. Nonlinear Algebraic Equations

then the order improves only marginally and in fact, the order of all methods
based on simple inverse interpolation is less than two. Hence, no purpose will
be served by using higher order interpolation.
If quadratic interpolation is used instead of linear interpolation, then the
convergence will be slightly faster, but apart from that there is no significant
advantage over the secant method. However, Brent (2002) has combined the
inverse quadratic interpolation with bisection, in such a way that the root is
always bracketed between two of the approximations and further if the conver-
gence of interpolation is slow, then the algorithm performs bisection to achieve
a faster rate of convergence. Thus, this algorithm combines the advantages
of faster convergence of iterative methods with that of an always convergent
method, in such a way that the iteration is guaranteed to converge to the root
with required accuracy in a finite number of steps. However, before applying
this algorithm, it fs necessary that the root is bracketed between two values as
shown by a sign change.
At a typical step in this algorithm, we have three approximations to the
zero, a, band c such that f(b)f(c) :::; 0, and If(b)1 :::; If(c)l. In some cases, a
may coincide with c. The points a, band c change during the computations,
but for simplicity we will omit subscripts to denote the successive values of
these numbers. At any stage, b is the best estimate to the root D, while a is
the previous value of b. Of course, D must lie between band c. Further, initially
a = c. At each step this algorithm chooses the next iterate from two candidates
- one obtained by bisection and the other obtained using an interpolation.
Inverse quadratic interpolation is used whenever the three points are distinct,
otherwise linear interpolation is used. If the point obtained using interpola-
tion is "reasonable" it is chosen, otherwise bisection is used. Here "reasonable"
means that the point is inside the current interval and not too close to the end
points.
If f(b) = 0, then we have found the zero. If f(b) -=I- 0 and m = ~(c­
b), then the iteration is terminated when Iml < J, where J is the required
tolerance. The last value of b is returned as the best approximation to the root.
If super linear convergence has set in, then this value will be much more accurate
than the required tolerance. If the iteration has not converged, then the next
approximation is obtained as follows.
If the three current points a, band c are distinct, then we can find the new
approximation d by inverse quadratic interpolation, while if a = c, then we use
linear interpolation using a and b. In either case, the new point is expressed as
d = b + p / q. If linear interpolation is used, then we can define s = f (b) / f (a)
and p = ± (a - b) s, and q = =t= (1 - s). If the three points are distinct, then since
a is the previous value of b, we perform a bisection if If(b)1 :c If(a)l. Otherwise,
we have If(b)1 < If(a)1 < If(c)1 and after some manipulations it can be shown
that

q = =t=(rl - 1)(r2 -1)(r3 - 1),


(7.44)
7.6. Brent's Method 291

where rl = f(a)/ f(c), r2 = f(b)/ f(c) and r3 = f(b)/ f(a). If either of the
interpolations is used, the new point is accepted only if it lies between band
c and up to three-quarters of the way from b to c. Hence, the new point is
rejected if 21pl 2': 31mql and in that case, we perform bisection. To ensure
certain minimum rate of convergence, one more test is done before accepting
the interpolated value at any step. If the previous step was also interpolation
and if e is the value of p/q at that step, then we perform bisection if either
Ie I < 6 or if Ip / q I 2': ~ Ie I which ensures that Im I decreases by at least a factor
of two on every second step. Further, even if the iteration gives a value in
acceptable limit, but Ip/ql < 6, then a shift of at least 6 is used. This situation
could arise when the inverse quadratic iteration has actually converged to a
root, but the bounding interval is still too large (see Example 7.5).
Once the new point d has been obtained by bisection or interpolation,
the function value at this point is computed. If f(d)f(c) < 0, then we can get
three distinct values for a, band c for the next step. Otherwise a is set equal
to c. Further, if If(d)1 > If(c)l, then b is set to c, while a and c are set to the
new approximation d. For a more complete specification of Brent's algorithm,
readers can consult Brent (2002) or the subroutine BRENT in Appendix B.
Unless a maximum limit on the number of iterations is kept, the iteration
may get into a nonterminating loop if unrealistically low value of tolerance
is specified. For example, if the relative tolerance is much less than h, then
the convergence criterion may never be satisfied and iteration can get into
a nonterminating cycle. It may be noted that many implementations of this
method do not have any upper limit on the number of iterations and if the
convergence criterion is too stringent, such subroutines may get into an infinite
loop. On the other hand, by putting an upper limit on the number of iterations,
there is a chance that the iteration may not converge to a satisfactory accuracy
in some cases. If the upper limit is reasonably large, then this failure can occur
only if either the root is zero (or very close to zero) or if the specified limits are
several orders of magnitude larger than the root.
Like all other always convergent methods, this algorithm will also con-
verge to a singularity if the sign change across the original interval is due to a
singularity rather than a zero. For a multiple zero, the convergence of inverse
quadratic interpolation may be quite slow, but in the Brent's method, if the
convergence is slow, then a bisection will be performed at every alternate step.
Hence, in such cases, this algorithm will require about twice as many steps as
that required by bisection, for the same accuracy. Thus, for multiple roots the
convergence could be slower than that of bisection, but that is not a serious
disadvantage, since most other iterative methods may converge even slower. Of
course, if the root is of even multiplicity, then it cannot be bracketed between
two sign changes and this algorithm is not applicable for such roots.
EXAMPLE 7.5: Solve the equation in Example 7.2 using the Brent's algorithm.
Using a 24-bit arithmetic and the same starting values as in Example 7.2 we use Brent's
method to solve tan x = x. The convergence is quite fast for the root at Xl. But after the
spanning interval (b, c) has been reduced to the minimum possible, with the precision of
arithmetic used , the iteration just keeps oscillating between the two values. If a reasonable
292 Chapter 7. Nonlinear Algebraic Equations

tolerance was supplied to the subroutine, then the iteration would have converged before this
stage. For the root x = 0, the convergence is quite slow and it takes about 22 iterations
to reach a value of 2 x 10- 4 . If the iteration is started with values 1.0,3.0 which span a
singularity, the iteration slowly converges to the singularity in about 50 iterations. For this
case, similar results will be obtained if bisection or the regula falsi method is used. The value
of the function at the calculated "zero" in this case is of the order of 10 7 instead of 10- 7
at the normal zeros. Hence, in this case, the function value can be easily used to distinguish
between a zero and a singularity, but this is not always possible {14}.
For the alternative form of equation i.e., l/x = cot x, the results are quite different.
For the first two starting values, the inverse quadratic interpolation converges very fast to
the accurate value in the first eight iterations, but after that the difference becomes too small
and the algorithm performs bisection at every alternate step to reduce the initial interval.
Hence, it ultimately requires about 40 iterations to get the accurate value, which is much
slower than other iterative methods as well as the simple method of bisection. The Brent's
algorithm runs into this problem, whenever the inverse quadratic interpolation converges to
the actual root from one side, since in that case, the other end of the initial bounding interval
remains fixed. While the iteration will not converge, until the bounding interval is reduced
to the specified tolerance, which requires unnecessary bisections at every alternate step. If by
any coincidence, because of roundoff error one of the computed approximations come out to
be on the other side of the root, then the iteration may converge quite fast. If a reaSonable
tolerance is specified, then even if the change in iterate is too small a shift by the amount
equal to the required tolerance is affected. This shift can force the next approximation to the
other side of the zero and the bracketing interval will reduce immediately. For example, in
this case, if the relative tolerance is supplied to be 10- 7 , then at the seventh iteration the
value is forced to be 4.493409 and the function value changes sign, which immediately reduces
the bounding interval and the iteration is terminated. For the root at x = 0 the convergence
is very fast as it is a simple zero for this form of equation.

It is quite clear from the above example that in order to use Brent's
method effectively, it should be ensured that the initial interval actually con-
tains a zero rather than a singularity. Further, it is also necessary to give a
reasonable value of the required tolerance. If the tolerance is too small, then
the iteration may never converge or it may take unnecessarily large number
of iterations to converge to the root. Even if the iteration has apparently con-
verged to the root within specified tolerance, there is no guarantee that the
value is actually accurate to the required level. This problem arises because of
roundoff error. Hence, all we can say is that the iteration will converge to some
point within the domain of indeterminacy, provided it has not converged to a
singularity or some other discontinuity in the function. This method does not
give any idea about the size of this domain.
We have considered broadly two classes of methods for finding real roots
of nonlinear equations. The first class, which includes bisection, regula falsi
and Brent's methods, are guaranteed to converge to the actual value of the
root. However, the convergence is rather slow. The Brent's method which ac-
tually combines bisection with an iterative method may converge quite fast in
most cases. On the other hand, the secant iteration or the Newton-Raphson
method converge very fast, but the convergence is not guaranteed. Therefore,
Brent's method appears to be the best choice for finding real roots of nonlinear
equations.
However, the problem with Brent's method or any other always convergent
method is that, they always converge. Thus, even if there is no root, but the
7.6. Brent's Method 293

function changes sign because of some singularity or discontinuity, then the


iteration converges to the singularity or discontinuity. Of course, we can argue
that before applying this method, it should be ensured that the function is
continuous. But in practice, it is quite possible that a singularity in some odd
term of a complicated equation may be overlooked, or it may not be obvious
that a denominator could be zero, or it may happen because of some error in
defining the function. This problem can be easily taken care of, if the function
value at the successive iterates are also printed. In that case, if the iteration
is converging to a singularity, then the function values will be increasing in
magnitude, while if the iteration is converging to a zero, then the function
values will be decreasing in magnitude. It may be noted that in many cases,
it will not be sufficient to print the value of the function at the final "zero".
Since it may not be possible to have any idea about the typical value of the
function in the neighbourhood of the zero. For example, if we are finding zero
of the determinant of a large matrix, then the function value can easily come
out to be of the order of 10100 at the "zero" (or 10- 100 at a "singularity") and
it will be impossible to distinguish a zero from a singularity by looking at the
value of the function at this point.
Apart from this, the main problem with the always convergent methods is
that, there is no indication of the roundoff error and the iteration can converge
to any required accuracy. Thus, even if the domain of indeterminacy associated
with a root is very large, these methods will apparently converge to an arbitrary
accuracy giving no indication of the errors. On the other hand, the iterative
methods will usually not converge to an accuracy much better than what is per-
mitted by the domain of indeterminacy (see Section 7.12), and ultimately the
iteration will oscillate within this domain. This oscillations gives a very good
indication of the limitations imposed by the roundoff error, which is otherwise
difficult to estimate for a rather complicated function. The always convergent
methods can take care of this problem, provided the function values at each
iteration are also printed. Once the iteration has entered the domain of indeter-
minacy, the function value will not decrease, but keep oscillating. This test may
be difficult to implement in an automatic program, since it may be tricky to
distinguish between this situation and situations where the convergence is very
slow, or when the iteration is far from the root, but the function value is of the
same order in a wide region. One advantage of the always convergent methods
is that, if the effect of roundoff error is neglected, then the results are always
reliable, in the sense that the root is guaranteed to be accurate to the speci-
fied tolerance. On the other hand, with iterative methods, there is always the
chance of spurious convergence, since the difference between successive iterates
may come out to be small, even when the iteration has not really converged
(see Section 6.7).
Apart from all these problems, the always convergent methods are not
applicable for roots of even multiplicity. If the function is simple enough to
ensure that, there is no singularity in the interval and the roundoff error is
reasonably small, then Brent's method can be safely and efficiently used to
294 Chapter 7. Nonlinear Algebraic Equations

calculate the roots. But if we have enough knowledge to ensure this condition,
then we may also be able to supply good starting values, in which case, there
is no need for an always convergent method.
On the other hand, the main problem with iterative methods is that these
methods may not converge or may converge to another root of the equation.
However, if a good starting value is available, then this problem will not arise.
Hence, we are forced to come up with good starting values, which may be pro-
vided by proper analysis of the problem, or by the method of bisection coupled
with some routine to scan for the sign changes. If a few bisections are performed
before using the approximation for the iterative methods, then convergence may
be ensured. In practice, we often come across nonlinear equations with some
parameter and the root needs to be evaluated for a number of values of this
parameter. Further, often the successive values of this parameter are close to
each other. In such cases, the previous value of the root or some extrapolated
version of it can be effectively used as the starting value. In such cases, the
iterative methods are likely to be more efficient. Since to apply the always con-
vergent methods, we need to bracket the root, which may not be easy, unless a
rather large range is given. In that case, the convergence may be slow. Another
important advantage of iterative methods is that, they can be easily adopted
for calculating complex zeros also.

7.7 Complex Roots


Having considered methods for real roots of nonlinear equations, now we con-
sider methods for determining complex roots. Even if all coefficients of a non-
linear equation are real, the equation can have complex roots. The iterative
methods like the secant iteration or the Newton-Raphson method, which do
not use the sign of function explicitly are applicable to complex roots also, pro-
vided complex arithmetic is used. For a real valued function of real variable,
the iteration cannot generate complex values, unless the starting value itself
is complex. If the iteration converges to a complex root, then the asymptotic
convergence rate is the same as that for a real root. However, as we have seen
earlier, such methods are not guaranteed to converge and in fact, the possibil-
ity of non-convergence is higher in the complex plane than along the real line,
since in the complex plane the iteration can proceed along any of the infinite
directions. Hence, we will like to have an always convergent method for com-
plex roots of a nonlinear equation. If the equation can be written as J(z) = 0,
where J(z) is an analytic function in some region of complex plane containing
the zero, then it is possible to use the method described in Section 7.9. In this
section, we describe a simpler method.
The problem of finding complex zeros of a given function is equivalent to
finding real values x and y, such that

J(z) = J(x + iy) = u(x, y) + iv(x, y) = O. (7.45)


7.7. Complex Roots 295

where u and v are real functions. This problem is equivalent to solving a system
of two nonlinear equations in two real unknowns x and y. If the function J(z )
is analytic, then the two real functions u and v are not independent. This puts
a severe constraint on these functions. For locating real zeros, we can evaluate
the function at a number of points and consider the sign changes. Similarly,
to locate the simultaneous zeros of two real functions, we can evaluate the
functions at a number of points in two dimensions and consider the sign of
both the functions. The complex zero is defined by the two equations

u(x,y) = 0 and v(x, y) = O. (7.46)

Each of these equations define a set of curves in the complex z-plane and the
intersection of these two sets would give the zeros of J(z). These system of
curves divide the complex plane in four types of regions, depending on the
quadrant containing the function value J(z), which in turn depends on the
signs of u and v. Thus, if u > 0 and v > 0, J(z) would be in the first quadrant,
if u < 0 and v > 0, J(z) would be in the second quadrant, if u < 0 and v < 0,
J(z) would be in the third quadrant, and if u > 0 and v < 0, J(z) would be in
the fourth quadrant.
To locate the complex zeros, we can evaluate the functions u(x, y) and
v(x, y) at a set of points in the complex plane. It would be most convenient
if the set of points form a Cartesian grid. We can obtain a figure similar to
Figure 7.6, where at each of these points the quadrant number of J(z) is marked.
It is obvious that the curve u = 0 separates the region of second and third
quadrants from first and fourth, while v = 0 separates the first and second
quadrants from third and fourth. Evidently, where all four quadrants meet, we
have a zero of J(z). In this case, multiple zeros will not cause a serious problem,
since that would be given by an intersection of a set of u = 0 and v = 0 curves,
which can be easily traced. This method can be understood by the help of the
following example.
EXAMPLE 7.6: Locate the approximate position of complex roots of the following equa-
tions:
(i) z + sinz = 0, (ii) z - tanz = O. (7.47)
For the first equation, we try to look for the root in the region 3 < x < 5 and 1 < y < 3
(where z = x + iy). The results are displayed in Figure 7.6, which also shows the curves
corresponding to u = 0 and v = 0, where u and v are defined by (7.45). The intersection
of these curves will give the root of the given equation. The numbers in the figure are the
quadrant number of the function J(z) = z + sin z at the corresponding point in the complex
plane. The curves u = 0 and v = 0 divide the complex plane into four regions and the point
where all these four regions meet is the zero of J(z). It is quite clear from the figure that, this
point can be approximately located with the help of such figures, even if the curves u = 0
and v = 0 are not drawn. The actual value of the root is z ~ 4.21239 + 2.25073i. which is a
simple zero of J(z).
To illustrate the behaviour of this method for a multiple root, we consider the root at
z = 0 of the second equation. The results are also displayed in Figure 7.6. It can be seen
that in this case, the zero is given by the common point of intersection of six curves, three
corresponding to u = 0 and thrf'e corresponding to v = O. Further, as we go around the
zero in a closed contour, the quadrant values change in cyclic order 1,2,3,4, etc., which is a
characteristic signature of multiple roots of an analytic function. It is obvious that to locate
296 Chapter 7. Nonlinear Algebraic Equations

3 2 2 2 2
2 2 2 2
2 2 2
2 2 2
2 2 2
4 4 4 4 4
4 4 4 4 4 4 4 4 u=o
4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4
4 4 4 4 4
4 4 4 4
4 4 4 4 4
1 4 4 4 4 4
3.0 3.5 4.0 4.5 5.0
Simple zero of z + sin z

v=o u=o v=o


3 3 3 3 2 2 2 2 2 1
3 3 3 3 3 2 2 2 2 2 1
0.2 3 2 1
3 1
1 u=o
4 4 3
4 4 3 3
4 4 3 3
4 4 3 3 v=o
0
1 1 2 2
2
2
2

u=o
2
-0.2 2
2 3 4
2 3 4
-0.2 -0.1 0.1 0.2
Triple zero of z - tan z

Figure 7.6: Searching complex zeros

multiple roots, we need higher resolution i.e., larger number of points, in order to separate
out the regions of different quadrants. Apart from this, there is no difficulty in locating
multiple roots. In particular, it is possible to locate roots of even multiplicity also. Further,
it is also possible to determine the multiplicity of the root by counting the number of times
each quadrant occurs as we go around the zero in a closed contour.

The method described above is a crude and inefficient method. However,


in many cases, there is no alternative and we have to resort to such methods.
This method can be used to locate the approximate position of the root in the
7.B. Muller's Method 297

complex plane. Once an approximate value is known, an iterative method can


be used to find the accurate value of the root. If the function is real, then the
complex roots will occur in pairs of complex conjugate zeros. Hence, if z = x+iy
is a zero of J(z), then z* = x - iy is also a zero and we need to search only the
complex half plane above the real axis. If the function J(z) is not analytic in
the neighbourhood of the zero, or if we are considering the solution of a system
of two independent nonlinear equations, then this method can in principle be
used, but the intersection of the two sets of curves could have more complicated
geometry. For example, in this case, the two sets of curves can touch each other
without intersecting, and it will be difficult to distinguish between cases, where
the curves actually touch and the ones, where the curves come close without
touching. Further, in this case, only three quadrants may meet at the point
of contact. The requirement of analyticity of J(z) eliminates such pathol0gical
cases.

7.8 Muller's Method


As mentioned earlier the Muller's method uses parabolic interpolation to ap-
proximate the given function and then the zero is approximated by the zero of
the interpolating quadratic. The quadratic equation has two roots and at any
given step, we can choose the root which is nearer to the current best estimate.
Further, a quadratic equation with real coefficients can have complex roots.
Hence, this method can give complex roots, even when the starting values are
all real. At the same time, we may be forced to use complex arithmetic, even
when we are interested in real roots only. Because of these reasons, Muller's
method is not advisable for finding real roots.
In Muller's method, we start with three points Xi, Xi-l and Xi-2 and
approximate the function by a parabola using quadratic interpolation (see Sec-
tion 4.2)

92(X) = J(Xi) + (x - Xi)J[Xi,Xi-l] + (x - Xi)(X - xi-dJ[Xi,Xi-l,Xi-2]


= a2(x - Xi)2 + al(x - Xi) + ao ,
(7.48)
where
ao = J(Xi), al = J[Xi,Xi-d + (Xi - Xi-l)f[Xi,Xi-l,Xi-2],
(7.49)
a2 = J[Xi, Xi-l, Xi-2].
The zeros of 92 (x) are given by

2ao
Xi+l = Xi + 2 ' (7.50)
-al ± Val - 4aOa2
where we must choose one of the two roots. This choice can be made by requiring
that if Xi = a the zero of J(x), then Xi+l = a, that is the iteration must converge
if the initial value is exact. If Xi = a, then ao = J(Xi) = 0, and we must choose
298 Chapter 7. Nonlinear Algebraic Equations

that sign before the radical which agrees with -aI, so that the denominator
does not vanish. It may be noted that the coefficients ao, al and a2 are in
general complex and we must always choose that sign before the radical which
gives the larger magnitude for the denominator, i.e., we select that root of the
quadratic which is nearer to Xi.
The form of iteration function given above is more suitable than the form
originally given by :Muller (1956). At each step the function value f(Xi), the
divided difference d i = j[Xi-l, Xi] and corrections hi = xi - Xi-l and hi =
Xi+l - Xi need to be preserved. An implementation of this method is provided
by subroutine MULLER in Appendix B.
As in the case of secant method, we can get the asymptotic form for the
error EHI using the expression for truncation error in quadratic interpolation

(7.51 )

where
K ~ 1_ fill (a) I. (7.52)
6f'(a)
Defining d i = In( K 1 /2 IE; I), we can proceed in a manner similar to that for secant
method to find {17} that the Muller's method is of order 1.839. Hence, Muller's
method converges faster than the secant method, but slower than the Newton-
Raphson method. Muller's method also requires calculation of one function
value per iteration and the cost per iteration will be the same as that for the
secant method. Hence, if we calculate the computational efficiency, it turns
out that Muller's method would be more efficient than the Newton-Raphson
method, if a > 0.137, where a is the relative cost of evaluating f'(x). Thus,
usually Muller's method is preferable to the methods described earlier. It can be
seen that the l\Iuller's method involves finding a square root of an expression,
which can become negative, even if all the initial points are real. Therefore,
Muller's method can generate complex roots, even if all the starting values are
real.
Muller's method can be used for calculating real roots as well. However,
it may not be desirable to do that, since as we have seen, even when the
initial values are all real, this method can generate complex values for Xi and
it would be necessary to use complex arithmetic, which itself may slow down
the process by a factor of approximately four. The complex arithmetic can be
avoided if at every step we retain only the real part of Xi. Alternately, we can
set the expression under the radical sign to zero, if it turns out to be negative.
This artifact may affect the convergence rate of the iteration, but if the root
is real the iteration should normally converge. However, with such artifacts
the iteration may apparently converge to a point, which is not a root. Such
spurious convergence can happen if the coefficient al comes out to be zero or
small, which is possible near the zeros of J'(z).
Now we will examine the rate of convergence of Muller's method for mul-
tiple roots. Let us first consider the case, when the root at X = a is a double
7.8. Muller's Method 299

zero of fez) = O. In this case, it can be shown {20} that for large values of k
the truncation error Ek satisfies the equation

(7.53)

which can be written in the form

(7.54)

defining Uk = In(KIEkl) , we get

(7.55 )

Using this difference equation it can be shown {20} that the Muller's method is
of order 1.23 for a double zero. Hence, of all the methods (except the modified
Newton-Raphson method which requires the knowledge of multiplicity) that
we have considered, this is the only one which has an order p > 1 for a double
zero.
It can be shown that {21} for a zero of multiplicity m 2': 3, the Muller's
method converges linearly and ultimately the error is multiplied by a constant
o in each step. This constant is given by the root with smallest magnitude of
the polynomial

(7.56)

The required root turns out to be a pair of complex conjugate zeros within the
unit circle. The actual value of 0 will be one of these two roots depending on
the value of f(m) (0). For a triple root, 0 = 0.713 ± 0.201i, which is comparable
in magnitude to the corresponding factor for secant iteration (0.755) and some-
what larger than the factor of 2/3 for Newton-Raphson method. Thus, for triple
zeros, the rate of convergence of :M uller's method will be somewhat slower than
that for the Newton's method, and comparable to that for the secant iteration.
It can be shown that a method based on cubic interpolation using the
last four points will have super linear convergence for roots of multiplicity m S
3. The order of convergence being 1.928, 1.349, and 1.126 for single, double
and triple roots, respectively. However, this method requires the solution of a
cubic equation at each step and is not very practical. As compared to that,
all higher order methods based on inverse interpolation converge linearly to
roots of multiplicity m -I- 1, which is probably due to the fact that the inverse
function does not exist near multiple roots.
EXAMPLE 7.7: Solve the equation in Example 7.2 using Muller 's method.
Muller's method requires three starting values and to compare the results with other
methods we take the third value to be the midpoint of the first two. Even though the roots
are real, in some cases, the iteration picks up an imaginary part, which tends to zero as
the iteration converges. In the first case, with starting values 3.15,4.71,3.93, the iteration
wanders off from the required interval and ultimately converges to the root at x = O. When
the starting values are closer to the root the iteration converges very fast to the root. The
300 Chapter 7. Nonlinear Algebraic Equations

convergence is slow for the root at x = 0, requiring 24 iterations to reach a value -2.8 x 10- 4 .
For this root the rate of convergence is comparable to that for secant iteration or the Newton-
Raphson method. If the equation is rewritten as cot x = 1/x, then the convergence is much
faster in all cases.

The iterative methods are very efficient in determining complex zeros of


arbitrary functions, provided the iteration converges. However, the iteration
may not converge, unless the initial approximation is very good, particularly
if the zeros are close to each other, or if the function is singular in the neigh-
bourhood of the zero. In such cases, the convergence of iterative process would
depend on how well an initial guess is made, so that we are forced to come
up with a good initial guess. For this purpose, it may be necessary to use the
method described in the previous section or the one described in the next sec-
tion. Further, if the number of zeros is not known in advance, the result may
be unreliable, in the sense that some of the zeros may not be discovered at all.
On the other hand, if there is no zero in the region, we may make a long and
futile search for something which does not exist without ever being sure of its
nonexistence.
Another problem with the iterative methods is that, if we are interested
in more than one root, then it may happen that on subsequent attempts the
iteration may keep converging to the same root again and again. This problem
can be avoided by using the technique of deflation or removal of roots. If the
function can be explicitly divided out by the factor x - a (where a is the
calculated zero) to get a simpler form, then the process of deflation will improve
the efficiency of the calculations. If the division cannot be carried out explicitly,
then we can just use the function f(x)/(x - a) for the iteration process. If
more roots are known, then we can add more factors in the denominator. This
strategy is used in the subroutine MULLER. Deflation is needed if we want to
check if a root is multiple, as in that case iteration will still converge to the same
root. The process of deflation introduces some roundoff error, particularly if the
division is carried out explicitly, since the computed value of the root will not
be exact. Hence, the computed root of the deflated function should be refined
by iterating on the original function.
One technique which can be tried to handle difficult situations is based on
the continuity principle and can be used if the equation has a parameter such
that for some special value of the parameter the solution is known. In this case,
the parameter can be changed slowly in several steps from the value with known
solution to the required value. At each step, the previous root which should be
close to the new root can be used as the starting value. Even if the function has
no parameter, we can introduce one artificially. This technique is sometimes
referred to as Davidenko's method and will be discussed in Section 7.16 in
connection with the solution of a system of nonlinear equations. In general,
there may be no guarantee that the root is a single valued function of the
parameter and in fact, the function may not exist at some intermediate point.
Hence, this technique is not as simple or effective as we would imagine at first
sight.
'l.9. Quadrature Based Method 301

7.9 Quadrature Based Method


The basic idea of this method for finding complex zeros of an analytic function
due to Delves and Lyness (1967) is to construct a polynomial which has the
same zeros as the required function J(z) within a closed contour. Thus, instead
of finding an approximation which interpolates the given function, we find an
approximation which shares the zeros within a closed contour with J(z). This
approximation will of course, interpolate the function at its zeros. Once such an
approximation is available, the zeros of J(z) can be easily calculated by finding
zeros of the polynomial. The problem of finding all zeros of a polynomial is
considered in Section 7.11. To construct the required polynomial the following
result based on the Cauchy's theorem is used.
If C is a closed contour in the complex plane which does not pass through
a zero of J(z) and R is the interior of C, then

1 J n f'(z) d ~ n
Sn = 21Ti Ie z J (z ) z = ~ Zi , (7.57)

where Zi, (i = 1,2, ... , N) are all zeros of J(z) which lie in R. A multiple zero
is counted according to its multiplicity in this formula. Hence, if we consider
a region R, and carry out the contour integrals numerically, we can determine
the approximations to so, Sl, ... , SN. The true value of So will be an integer N,
the number of zeros of J(z) in R. Using these approximations, we can construct
a polynomial PN(z) of degree N, whose zeros coincide with the zeros of J(z)
in R.
For this purpose, we can use the Newton's formula which gives the sums
of similar powers of roots of a polynomial in terms of its coefficients. Thus, if
the required polynomial is

PN ()
Z = Z
N
+ a1Z N-1 + ... + aN . (7.58)

Then the Newton's formula gives

(j=1,2, ... ,N), (7.59)

where ao = 1. Hence, all coefficients of the polynomial can be calculated once


s1's have been determined. If J(z) has several zeros in the given region, then
PN(z) will be a high degree polynomial, which may be very ill-conditioned. In
that case, we would have to determine s j to very high accuracy, for the roots
of the resulting polynomial to adequately approximate those of J(z), and then
to find zeros of PN (z) using the same precision. This difficulty can be avoided
by subdividing the region R, so that each subregion contains only a few zeros.
Delves and Lyness (1967) enumerate the following steps for determining the
zeros:
302 ChapteT 7. NonlineaT Algebmic Equations

1. Evaluate the number of roots So = N in the region. If N = 0, then stop,


while if N is small enough to be handled conveniently, then evaluate also
Sl, ... , SN and proceed to step (3) and (4). If N is too large, then do (2).

2. Subdivide the region into smaller subregions and do step (1) for each in
turn.
3. Given a suitable region and the evaluated values of Sl, ... , SN, construct
and solve the equivalent polynomial PN(Z).
4. Take the zeros of PN(Z) as approximations to the zeros of J(z) and refine
these by using an iterative method on the original function J(z).
This procedure involves the choice of a number lvI, which is the maximum
number of zeros that can be handled at a time. This choice obviously involves
a compromise, since if Jovl is increased, then fewer regions need to be scanned
and efficiency will increase. However, if M is too large, then the resulting poly-
nomial may be ill-conditioned. A good value of M may be five. The resulting
polynomial would be of degree less than or equal to M, and for M = 5 it is
sufficiently low to ensure that calculation of zeros of PN (z) can be assumed
to be a standard procedure. Of course, with this restriction, it is impossible
to estimate a zero of multiplicity greater than M, but that is not a serious
limitation, since zeros with such high multiplicity are rare in practice. In any
case, if after several subdivisions the value of So does not decrease, then it will
indicate the presence of a multiple zero.
Efficiency of this method depends crucially on the efficiency with which
the contour integrals are evaluated. These integrals can be evaluated using the
methods for numerical integration considered in Chapter 6, although in this
case, it should be noted that the integrand would be complex. For convenience
the region R should be chosen to be either a circle or a rectangle. For circular
contour it would be most efficient to use the trapezoidal rule, since the integrand
is periodic. For this purpose we can use the following change of variable to 0

Z = Zo + Re 27rilJ , (0:::;0:::;1), (7.60)

where Zo is the centre and R the radius of the circle. It is more efficient to
evaluate contour integrals over circles, but if a subdivision is required, then it
is not very convenient to subdivide a circle into smaller circles. On the other
hand, square or rectangular regions can be easily subdivided, but evaluation of
contour integrals may not be very efficient. Further, in a practical implementa-
tion, we also have to worry about the possibility of some zero lying on or close
to the contour, in which case, the evaluation of the integrals may be difficult
due to singularity. For more details regarding the actual implementation of this
method, readers can refer to Lyness and Delves (1967). Obviously, as in other
global methods for locating zeros, it would be uneconomical to find the zeros
to high accuracy by using this method. Hence, this procedure should be used
to locate the zeros to a relatively low accuracy and an iterative method can
be used to improve on this approximation as mentioned in step (4) above. Of
7.9. Quadrature Based Method 303

course, it is possible that in some cases, the iteration may fail to converge or
converge to a different root.
I[ the derivative of the function cannot be calculated easily, then numeri-
cal differentiation can be used. The derivative can be calculated using the same
set of function values, which are used for integration. But in this case, addi-
tional error will be introduced in the estimate of contour integrals, reducing
the efficiency of the process still further. I[ three-point formula is used for esti-
mating the derivatives, we can hope to use the Romberg integration to improve
on the trapezoidal rule value. The Romberg technique may take care of error
in integration as well as in computing the derivatives. But this process will
not work too well for rectangular contours, since the derivatives at the corner
points cannot be estimated using the same three-point formula, and the special
formula for use at corner points have a different error expansion. Lyness and
Delves (1967) have given other methods for estimating the derivatives using
the function values on the contour only.
A simple implementation of this algorithm is provided by subroutine
DELVES in Appendix B. This subroutine assumes that the derivative of the
function can be calculated explicitly. It uses circular contours and has no facil-
ity to subdivide the region if the number of roots is larger than five. There is
no check to find if the contour is too close to one of the zeros and there may
be difficulty with convergence if the contour is close to one of the zeros either
inside or outside the contour. After locating the approximate position of ze-
ros, it uses a complex version of Newton-Raphson method to find the accurate
value. The Newton-Raphson method is used here simply because the function
routine calculates the derivative also which can be effectively utilised by the
Newton-Raphson method.
EXAMPLE 7.8: Find the zeros of z + sin z in the region Izl < 8.
Using the subroutine DELVES, five zeros inside this region are located. The first con-
tour integration 80 gives a value of approximately five, indicating the number of zeros inside
the circular contour. The subroutine uses 32 function evaluations to estimate this number,
but much smaller number of points may be enough. Subsequently, the subroutine evaluates
81, 82, 83, 84 and 85 to estimate the sums of powers of roots, which require 128 function
evaluations to achieve the specified accuracy, and the results are:
81 = 2.63 X 1O~7 + 2.23 X 1O~7 i, 82 = 50.71387 - 6.045 x 1O~6i,
83 = -8.22 X 1O~6+ 7.84 x 1O~5i, 84 = -795.2405 + 5.25 x 1O~4i, (7.61 )
85 = 0.0029 + O.OOlOi.
Using these values, the coefficients of the equivalent polynomial are calculated and then the
polynomial is solved using the subroutine POLYC which uses the Laguerre's method (see
Section 7.11) to get the approximate position of the roots as
Xl = 1.51 X 1O~6 - 6.57 X 1O~7 i, X2 = 4.212392 + 2.250728i,
X3 = -4.212392 + 2.250729i, X4 = 4.212391 - 2.250728i, (7.62)
X5 = -4.212392 - 2.250727i.

Using these as the starting values. the N{wtoll-Raphson mdhod gives the accurate values of
the roots as 0 and ±4.212392 ± 2.250729i.
If the radius of circular contour is reduced to five, then the contour passes quite close to
the zeros and it requires 1024 function pvalll<.ttiow, to get comparable accuracy. If the raclius
304 Chapter 7. Nonlinear Algebraic Equations

is reduced still further to 4.8, then it requires 4096 function evaluations. Hence, the efficiency
of this process is rpduced significantly if one of the zeros is close to the contour. To avoid this
problem Lyness and Delves (1967) have suggested that, this condition should be detected by
checking the value of f'(z)/ fez). If this value comes out to be larger than a predetermined
constant, then it may be assumed that the contour passes close to a zero and the search
should be abandoned. Second attempt could then be made using a different contour.

The main advantage of this method is that, it gives definite information


about existence or nonexistence of roots within a closed contour which is im-
possible using iterative methods. Since non-convergence of iteration does not
establish the nonexistence of roots, while even if the iteration converges to a
particular root, it is difficult to ascertain if there is no other root in the neigh-
bourhood. The main problem with the quadrature based method is that, it
requires enormous number of function evaluations to compute the integrals to
sufficient accuracy. If the integrals are not evaluated to sufficient accuracy, then
the estimate of roots may be completely unreliable. It is quite possible that one
of the roots of the resulting polynomial turns out to be outside the required
region. Because of these problems, the practical utility of such always conver-
gent methods is somewhat questionable. But if all other efforts have failed, or
if we want to make sure that, there are no undetected zeros in the given re-
gion, then it is worthwhile to try this method. There are some instances when
such methods have proved to be useful in locating zeros of some complicated
function, when either the analytic knowledge is lacking, or when very good
approximation is required for the iteration to converge.
However, before applying this method it must be ensured that the function
is analytic in the entire region within the contour. For example, this method
cannot be applied to the problem in Example 7.2, which has a pole close to the
zeros. An extension of this method to determine zeros and poles of a meromor-
phic function is given by Delves (Rabinowitz, 1970). Further, this method may
give completely misleading results, when applied to ill-conditioned problem. In
many case:>, the method described in Section 7.7 may turn out to be simpler
and more efficient, because the quadrature based method can require enor-
mous number of function evaluations to evaluate the integrals with sufficient
accuracy.

7.10 Real Roots of Polynomials


So far we have considered the problem of finding zeros of arbitrary functions,
in this and the next section we consider the special case, when the function is
a polynomial of degree n in x:

(7.63)

Here the coefficients ai are generally real, although some of the methods are
applicable, even when the coefficients are complex. Of course, the methods
considered in the previous sections can be applied to polynomials also. However,
the problem of finding the roots of polynomials occurs very often in physical
7.10. Real Roots of Polynomials 305

problems and it is justified to look for methods which are specially suitable for
polynomials.
As illustrated in Section 7.14, unnecessary reduction of a physical problem
to that of finding roots of polynomials should be avoided, since such polynomi-
als often turn out to be ill-conditioned. Nevertheless, the analytic simplicity of a
polynomial often tempts us to reduce the problem to finding roots of a polyno-
mial. In some cases such a step may be helpful, since we can apply all analytic
techniques for locating the roots or simply for ascertaining the nature of roots.
While such an analysis will certainly lend us some insight into the problem,
very often the analysis part is ignored and the polynomial is straightway fed
to a solving routine. This may be hazardous, since no polynomial solving rou-
tine can guarantee required accuracy for all polynomials that are possible. The
subroutines that are available behave in a different manner, when it comes to
accuracy. Some of the subroutines may flag failure too soon, while others may
supply numbers which have little significance as the roots. As a rule of thumb,
we can assume that a good routine will be able to give roots with a relative
accuracy of at least lil/n, where n is the degree of polynomial. Thus, with 24-bit
arithmetic, we can only assume an accuracy of 2- 4 ~ 0.06 or about 6% for a
sixth degree polynomial! In most cases, the accuracy may be better, but that
cannot be assumed, unless some analysis or experimentation is carried out. We
do not want to discourage readers from solving polynomials, but only wish to
point out that the roots calculated through standard subroutines cannot be
taken for granted, particularly if high accuracy is required.
One advantage with polynomials is that, a fundamental theorem in algebra
assures us that a polynomial of degree n has exactly n zeros. Hence, we can
expect to calculate all zeros of a given polynomial. Even if we use an iterative
method and find n zeros, then we can be sure that, there are no more zeros
to be found. Similarly, if the iteration fails to converge before all the zeros are
found, then we can say that the iterative method has failed. Even though we can
easily construct examples for which iterative methods starting from any fixed
starting value will fail. It is generally found that iterative methods converge
from essentially arbitrary starting values to some root of the polynomial. There
is an extensive literature on the location of roots of a polynomial as a function
of its coefficients (Householder, 1970). In this section, we consider only one
technique for locating real roots, which is based on Sturm sequences. This
method may not be very useful for finding real roots of a polynomial of low
degree that we generally come across. However, it proves to be very useful in
finding eigenvalues of real symmetric matrices, as well as that of the Sturm-
Liouville problems in ordinary differential equations.
A sequence of real functions {Jo (x), h (x), ... , f m (x)} is said to form a
Sturm sequence in the weak sense on the interval [a, b], if the following condi-
tions are satisfied:
1. No two consecutive functions in the sequence vanish simultaneously on
the interval.
2. If fJ(r) = 0 for 0 < j < m, then fJ-l(r)fJ+dr) < O.
306 Chapter 7. Nonlinear Algebraic Equations

3. Throughout the interval [a, b], fm(x) -=I- O.


4. In addition, if it is also true that fo(r) = 0, implies fMr)fdr) > 0, then
it will be called a Sturm sequence in the strict sense, or just a Sturm
sequence for f(x) = fo(x) .
Let {fi (x)} , i = 0,1, ... , m be a Sturm sequence on [a, bj and if Xo is
any point in (a, b) at which fo(xo) -=I- O. We define V(xo) to be the number
of sign changes (i.e., the number of times the sign changes from positive to
negative and vice versa) in the sequence {fi(XO)} , zero values being ignored.
If a is finite , then V(a) is defined as V(a + f) , where f is such that no fi(X)
vanishes in (a , a + f). Similarly, we can define V(b) when b is finite. If a = -00,
then V(a) is defined to be the number of sign changes of {limx~_oc fi(X)} and
similarly for V(b) when b = +00.
If fo(x) is any polynomial, then we can construct a Sturm sequence as
follows:
h(x) = fMx);
fj-dx) = qj-l (x)fJ(x) - fJ+dx), (j = 1,2, ... ,m-1); (7.64)
fm-l(X) = qm-l(x)fm(x).
Here qj-l(X) is the quotient and fj+l(x) is the negative of the remainder,
when fj-l(X) is divided by fj(x). It can be seen that {fi(X)} is a sequence
of polynomials of decreasing degree, which must eventually terminate in a
polynomial fm(x) for m ::; n, where fm(x) is the greatest common divisor
of fo (x) and II (T) and hence of every other fi (T). If we assume that all zeros of
fo(x) are simple, then obviously fm(x) would be a constant and conditions (1)
and (3) for the Sturm sequence would be satisfied. Further, if fi(r) = 0, then
fi-l(r) = -fi+l(r) for 0 < i < m and condition (2) is also satisfied, while the
condition (4) is satisfied by virtue of the choice of II (x). Thus, the sequence of
polynomials defined above form a Sturm sequence.
If some of the zeros of fo(x) are multiple, then since fm(x) is the greatest
common divisor of fo(x) and fMx) = h(x), fo(x)/ fm(x) will have only simple
zeros. Further, since fm(x) divides all fi(X), it follows that fi(X)/ fm(x) would
also be a polynomial and the sequence of polynomials {fi(X)/ fm(x)}, (i =
0,1, ... , m) form a Sturm sequence. It can be seen that, this division by fm(x)
does not affect V(x) as long as x is not one of the zeros of fm(x), i.e., one of
the multiple zeros of fo(x). Thus in practice we need not perform the division
and use the sequence {fi(X)} itself. Using the properties of Sturm sequence we
can prove the following theorem due to Sturm.
Theorem (Sturm): If h(x) == fMx) and the sequence of polynomials
{J;(x)} , (i = 0,1, ... , m) is formed as above, then the number of times fo(x)
vanishes on the interval (a, b) is exactly equal to V(a) - V(b) (that is a zero of
whatever multiplicity is counted only once).
This result is extremely useful in locating real roots of any polynomial,
since it enables us to find out the exact number of real roots in any given
interval. Unlike the simple method of locating the roots by looking for sign
1.10. Real Roots of Polynomials 307

changes, there is no question of any root being missed and hence the search
interval need not be small. To find all real zeros of a polynomial, we can start
with an interval large enough to contain all real roots and go on bisecting the
interval until all the roots are isolated, i.e., each subinterval contains at most
1 root. Once the roots are isolated, we can switch to an iterative method to
get the accurate values efficiently. Further, by making use of fm(x) we can also
find out the multiplicity of the roots.

EXAMPLE 7.9: Find the real roots of the following polynomial

10(x) = x 6 - 4x 5 + 4x4 - x 2 + 4x - 4 = (x 2 + l)(x + l)(x - l)(x _ 2)2. (7.65)

Using (7.64) we can construct the Strum sequence to get the result shown in Table 7.2,
where the coefficients have been made integers by multiplying by suitable positive constants.
Since 15(x) divides 14(x) exactly, the process terminates there. Hence, (x - 2) is a com-
mon factor of all polynomials in the sequence and can be identified as a double root of the
polynomiallo(x).

Table 7.2: Locating real zeros of a polynomial using Sturm sequence

24
x 4 -4 0 1.9 17

10(x) = x 6 - 4x 5 + 4x4 - x 2 + 4x - 4 + + + + 0
h(x) = 16(x) = 6x 5 - 20x 4 + 16x 3 - 2x + 4 + + + +
12(x) = 4X4 - 8x 3 + 3x 2 - 14x + 16 + + + +
13(x) = x 3 - 6x 2 + 12x - 8 +
14(x) = -17x 2 + 58x - 48 = (-x + 2)(17x - 24) + 0
15(x) = -x + 2 + + + + +
No. of sign changes V(x) 4 3 2 2 2

It is obvious from the coefficients of the polynomial that all real roots would be in the
interval (-4,4). Actually we can get stronger bounds on the roots, but that is not required.
Starting with this interval we find that, there are three distinct real roots of the polynomial.
Further, from 15(x) it is clear that the polynomial has a double root at x = 2. Hence,
counting the multiplicity of roots, the polynomial has four real roots and a pair of complex
conjugate roots. To isolate the roots we can refine the interval. Considering x = 0 we find
that, there is one root in the interval (-4,0) and two roots in the interval (0,4). To isolate
these two roots we can evaluate the sequence at x = 2, but since all the polynomials vanish
at that point which is a double root of the polynomial, we cannot estimate the number of
sign changes at this point. Hence, we can consider x = 1.9, which isolates the two roots in
intervals (0,1.9) and (1.9,4). Thus, with just four evaluations of Sturm sequence, we have
isolated all real roots. Just to illustrate the situation at one of the zeros of 1;(x), (i > 0), we
consider x = 24/17. It can be seen that in this case, if the zero value is ignored, then we can
get the correct result. On the other hand, if we consider x = 1 at which 10(x) = 0, there may
be some confusion. To resolve this confusion we can consider x = 1 + E, where E > 0 is small
enough to ensure that signs of 1;(x), (i > 0) will not change. In this case, since h(l) = 16(1)
is positive, 10(1 + E) is positive and the number of sign changes will be the same as at x = 1.
Hence, we can say that , there is no root in the interval (1 + E, 1.9), but one root in the interval
(0,1 + E) and so on.
308 Chapter 7. Nonlinear Algebraic Equations

7.11 Laguerre's Method


Having considered the problem of finding real roots of polynomials, we shall
now consider the complex roots of a polynomial. There are several theorems
for locating complex roots of polynomials also. One of these theorem due to
Schur is also based on constructing a sequence of polynomials, such that by
looking at the sign changes, we can find whether there is any zero in a given
circle in the complex plane. This criterion can be used to locate the zeros of
any polynomial as discussed by Lehmer (Householder, 1970). This method is
rather inefficient, but is guaranteed to work. There is another always convergent
method, namely Graeffe's root squaring method, where the basic idea is to
replace the given polynomial by another polynomial, still of degree n whose
roots are the squares of the roots of the original polynomial. By repeating this
process sufficient number of times, the roots which are unequal in magnitude
will become more widely separated and we can easily calculate the roots directly
from the coefficients. This process will have problems when two or more roots
are equal, or nearly equal in magnitude. These problems can be tackled, but
that makes the algorithm too cumbersome.
All these always convergent methods work very nicely for polynomials of
low degree, but for high degree polynomials they may produce the roots only in
principle. In practice, these methods get into problems, because of overflow and
roundoff error. Further, for most of the polynomials that we come across, the
iterative methods starting from almost arbitrary starting values are enough. It
is found that in most cases, the lack of convergence of iterative methods is due
to roundoff error and if higher precision arithmetic is used the iteration usually
converges.
We can use any of the iterative methods considered earlier for finding
zeros of a polynomial. Once one of the roots is determined, then we can re-
move the root by performing deflation to get a polynomial of lower degree.
This process of deflation introduces some error in the resulting polynomial.
Hence, it would be better if the root determined as above is "refined" by using
the original polynomial for iteration. This iteration should converge very fast,
since the initial approximation would be very good in general. If before each
deflation, iteration is carried out till the limiting accuracy for the precision of
computations is attained, and if the zeros are determined roughly in the or-
der of increasing absolute magnitude, the deterioration caused by deflation is
very small. However, if one of the larger zeros is accepted and deflation carried
out before determining the smaller zeros, then a catastrophic loss of accuracy
may take place, even if all zeros of original polynomial were well-conditioned.
Of course, it may not always be possible to find zeros in increasing order of
magnitude.
For polynomials it is possible to use Laguerre's method, which can be
proved to be always convergent for polynomials with real zeros. Even when
some of the zeros are complex, practical experience suggests that convergence
failure is extremely rare. Suppose all zeros aI, ... , an of a polynomial f (y) are
7.11. Laguerre's Method 309

real, distinct and ordered, such that CY1 < a2 < ... < a n -1 < an. Let x be any
point in some interval Ii = (ai, ai+d. When x lies outside the interval (aI, an),
we may regard the real axis as joined from +00 to -00 and say that it lies in
the interval (O:n, (1). The basis of Laguerre's method is to construct a parabola
with two real zeros in h one of which will be closer to a zero of f(y) than x.
We assume that n > 2, since for n :::; 2 the results follow trivially.
Consider the quadratic g(y) in y, defined for any real u by

g(y) = (x _ y)2 [~ (u - a i )2]_ (u _ y)2. (7.66)


~
i=l
(x - 0:.)2
t

Clearly, if u -=I- x we have

g(O:i) > 0, (i=1,2, ... ,n) and g(x) < O. (7.67)

Hence, for any real u -=I- x, g(y) has two zeros y' and y", such that

O:i < y' < x < y" < ai+ 1 . (7.68)

Thus, y' and y" are better approximations to the two zeros neighbouring x,
than x itself. The function g(y) can be expressed in terms of the coefficients of
polynomial f(x), by noting that

1'(x) _ ~_1_
(7.69)
f(x) - ~ x - O:i .

Differentiating this equation with respect to x, we get

[1' (x)f - f (x ) f" (x) n 1


[J(x)J2 = ~ (x - O:i)2
(7.70)

Using
u - a i ]2 (u _x)2 2(u - x)
[
x - O:i (x - ai )
2 + x - O:i
+ 1. (7.71 )

It can be easily seen that g(y) is a quadratic in (u - x) and we can write

g(y) = A(u - x)2 + B(u - x) + e, (7.72)

where, A, Band e are functions of y. To get the best results, we can choose u
such that the zeros y' and y" are as near as possible to the corresponding zeros
of f(x). In view of condition (7.68), this aim is achieved when the value of y is
extremum, that is ~ = 0 and noting

aA = aB = ae = 0 (7.73)
au au au '
we get
2A(u-x)+B=0. (7.74)
310 Chapter 7. Nonlinear Algebraic Equations

Eliminating u between (7.74) and (7.72) we get B2 = 4AC, which after a


considerable algebraic manipulation, yields

nf(x)
(7.75)
y = x - f'(x) ± JH(x) ,

where
H(x) = (n - 1) [(n - 1)[1' (xW - nf(x)f"(x)] . (7.76)
As in Muller's method, we choose the sign so that the denominator has the
larger magnitude. It should be noted that for n = 2, (7.75) reduces to the usual
formula for the roots of a quadratic, which gives the exact result. Similarly, for
n = 1, (7.75) reduces to the Newton-Raphson iteration, which also gives exact
result in that case. Hence, the successive iterates can be generated using

(7.77)
Xk+1 = Xk - f'(Xk) ± JH(xd '

with H(x) defined by (7.76).


It can be proved that for a simple zero, this method gives cubic conver-
gence (i.e., p=3), which is much faster than any other method we have so far
considered. However, it requires the calculation of first as well as the second
derivatives and so the computational efficiency is

Eh = 31 / 6 = 3 1 / 3 ~ 1.44, (7.78)

where we have assumed e = 3, which is reasonable, since for a polynomial, com-


putation of derivatives require nearly the same effort as calculation of function
values. It should be noted that, this efficiency index is only slightly greater than
that for Newton-Raphson method (21/2 ~ 1.41) and is considerably less than
that for secant iteration or Muller's method. However, Laguerre's method is
very useful for finding zeros of a polynomial, mainly because of the empirical
evidence that the convergence properties of this method are much better when
the approximation is far from the actual root. Moreover, by its construction it
will always converge to the nearest zero, provided all zeros are real and distinct.
It can be proved that it will converge, even if some of the real zeros are multi-
ple. For a polynomial, some of whose zeros are complex nothing can be proved
about the global convergence, but experience suggests that convergence failure
is very rare. Hence, Laguerre's method gives a very good technique for finding
all zeros of any polynomial. It can be seen that a real initial approximation can
converge to a complex zero, since H(x) can come out to be negative at some
stage.
It can be easily seen from (7.77) that if at any point f(Xk) -1= 0, but
f'(Xk) = f"(Xk) = 0, then the denominator vanishes and it is impossible to
proceed further. Hence, in such cases, Laguerre's method fails. For example,
if Xk = 0 and the coefficients of x and x 2 in the polynomial are zero, then
the iteration will fail. This failure can cause some problem, since zero is a
7.11. Laguerre '8 Method 311

convenient starting value for iteration, because it usually results in roots being
found in increasing order of magnitude. To counter this problem in subroutine
POLYR, we make a second attempt with some arbitrary starting value if the
iteration fails to converge with the first starting value. This device may avoid
the convergence failure, but in that case a large root might be located first,
causing problems after deflation. Hence, if the polynomial is expected to have
small roots, this constant (i.e., 1.123456 in subroutine POLYR) may be changed
in the subroutine. This failure can also occur near a root of multiplicity three or
higher, when because of roundoff errors it may turn out that the first and second
derivatives vanish. This problem can be avoided if we estimate the roundoff
error in evaluating the polynomial and iteration is terminated once the function
value is within the roundoff limit. The roundoff error will be generally of the
order of n times the sum of absolute magnitude of all terms in the polynomial.
Once a root of the polynomial is determined, we can deflate the polynomial
by dividing out the corresponding factor to obtain a polynomial of lower degree.
If the coefficients of the polynomial are real and a complex root has been found,
then the second root will be the complex conjugate of it and both these roots
can be removed by deflating the polynomial by dividing out the corresponding
quadratic factor. If a is a real root of a polynomial, then the deflation process
can be defined by

f(x) = ao +alx+a2x2 + .. ·+anx n = (x - O')(bo +hx+b 2x 2 + ... +bn_1x n - 1).


(7.79)
Equating the coefficients of x j for j = 0, 1. 2, ... ,n, we can obtain the fol-
lowing recurrence relation to determine the new coefficients bi of the deflated
polynomial in terms of the coefficients of the original polynomial:

(k = n - 2, n - 3, ... ,0). (7.80)

If a complex root a + if3 has been found, then we can divide by the quadratic
factor x 2 + px + q, where p = -20' and q = 0'2 + {32. Following the above
procedure, it can be shown that this deflation is achieved by the following
recurrence relations

bn - 3 = an-l - pb n - 2 ,
(7.81)
(k = n - 4, n - 5, .... 0).

It should be noted that in either case, if the root is not determined ac-
curately, then some error is introduced in the new coefficients. Hence, if the
polynomial is ill-conditioned, such a procedure may lead to disaster, since the
roots of deflated polynomial may not be close to the roots of the original poly-
nomial. It is possible to refine the roots obtained using the deflated polynomial
by using them as the starting values in iterative methods with the original
polynomial. If the deflation process is stable, then the iteration should con-
verge very fast and we can get an accurate value of the roots. Since the refined
roots are obtained by iterating on the original polynomial, any effect that the
312 Chapter 7. Nonlinear Algebraic Equations

deflation process might have had will be removed. However, this process of re-
finement has its own drawbacks. It may happen that the iteration simply does
not converge with the original polynomial. A more likely possibility is that, in
two or more cases, the iteration with original polynomial may converge to the
same root; which can happen, since unlike the deflated polynomial, the original
polynomial has all its roots intact. This possibility is likely to occur when two
or more roots of the polynomial are nearly equal.
Wilkinson (1994) has done extensive error analysis of the process of de-
flation. It is found that if the roots are found in increasing order of magnitude,
then the process of deflation is generally stable. On the other hand, if some
of the larger root is accepted beforehand and deflation is carried out, then the
smaller roots may be perturbed by a significant amount, rendering the solution
useless. Further, before deflation it is necessary to find the root as accurately
as possible, so that the error in the deflation process is minimised. This may
be achieved by continuing the iteration until the difference between succes-
sive iterates starts increasing after a relatively moderate convergence criterion
is satisfied. To find the roots in increasing order of magnitude, the iteration
should be started from zero or a value close to zero.
Because of these reasons, it is necessary to choose the proper conver-
gence criterion. This problem will be considered in Section 7.13. For the case of
polynomials it is difficult to give a general criterion. Since, if the convergence
criterion is too stringent, it may not be possible to satisfy that for multiple
roots. Hence, to allow for the possibility of multiple roots, it is essential to
choose a more flexible criterion, which should allow maximum accuracy to he
achieved for well-conditioned simple roots. At the same time, it should ensure
a reasonable accuracy for multiple roots. Since the accepted values of the roots
will be used in deflation procedure, it is essential to ensure maximum possible
accuracy to minimise the errors introduced in the process of deflation. Because
of these considerations, polynomial solving routines usually do not require any
external convergence parameter to determine the accuracy of roots. As a re-
sult, it may be difficult to ascertain the accuracy of roots calculated using any
of these subroutines. In Section 7.13, we will consider some practical criterion
to achieve maximum possible accuracy. However, none of these approaches are
foolproof and we can always find exceptions, where these methods will give
misleading results.
To check the reliability of computations, a simple method is to repeat the
calculations after multiplying all the coefficients by a number like 1.1 or 1.37,
which cannot be represented exactly in the computer. If calculations are done
exactly, then the result should not change because of this multiplication. By
comparing the results in the two cases, we can get some idea about the accuracy
of computations. If the polynomial is ill-conditioned, then the two sets of results
are unlikely to agree. Alternately, we can perturb some of the coefficients by a
small amount and check if the changes in the roots are reasonable in magnitude.
This may not always give reliable indication as it is possible that the roots are
not very sensitive to the coefficients perturbed.
7.12. Roundoff Error 313

7.12 Roundoff Error


In our discussion of the iterative methods, we have not considered the influence
of roundoff error. Our results about the asymptotic convergence rates would be
correct only if the computations are carried out exactly, which is never the case.
In practice, the attainable accuracy is limited by our inability to evaluate f(x)
accurately in the neighbourhood of a zero. It can be seen that in all iterative
methods, the correction Xk+l - Xk is directly proportional to the computed
value of f(Xk). The computed correction therefore has no relation with the
actual correction, unless f(Xk) has some correct significant figures.
We have already introduced the concept of the domain of indeterminacy
in Section 7.1. Here we will attempt to estimate the size of this domain which
depends on the function and on how it is evaluated. Nevertheless, some general
principles can be stated independently of the function. Let us assume that the
(absolute) roundoff error in evaluating the function in the neighbourhood of a
zero is hK, where K is some constant which should generally be of the order of
the function value, somewhat away from the root. The value of the function at a
point x close to the zero at x = a can be approximated by the first nonvanishing
term in the Taylor series, to get
f(m)( )
f(x) ~ ,a (x _ a)m, (7.82)
m.
where m is the multiplicity of the zero. Since this domain can be defined as
the region, where the roundoff error is comparable to the function value itself,
equating the function value to the roundoff error estimate will give a reasonable
estimate of the size of the domain of indeterminacy. Hence, the "radius" of
domain of indeterminacy can be written as

(7.83)

Hence, we see that the size of this domain increases with the multiplicity. For
isolated roots (m = 1) the size is of the order of h, and in general, it will
not pose any serious limitation. However, even for double root, the size of this
domain is of the order of Jli and there will be a significant loss of accuracy.
Thus, if we are using an arithmetic with an accuracy of 10 decimal digits, we
can only achieve an accuracy of approximately five digits for a double root. For
roots with higher multiplicity the situation will be even worse.
Let us first consider the effect of roundoff error on secant iteration. When
Xk is sufficiently far from the zero at a, the computed value of the function will
have a low relative error, and the computed value of the correction is nearly
as good as the exact value. As we approach the domain of indeterminacy, the
correction would still be in the right direction. However, if Xk happens to be
inside the domain while Xk-l is outside, the computed value of f(Xk) depends
on the vagaries of roundoff errors and computed value of correction will have no
relation to the exact value. But in this case, since Xk-l is outside the domain
314 Chapter 7. Nonlinear Algebraic Equations

of indeterminacy, If(Xk)1 < If(Xk~l)1 and in general, IXk+l-Xkl < IXk -Xk~ll·
Thus, most probably Xk+l is also inside the domain of indeterminacy. It can
be easily seen that if instead Xk~l is inside the domain and Xk is outside it, the
same result will hold good. Now let us consider a situation in which both Xk
and Xk~l lie inside the domain of indeterminacy, then the correction is given
by
f(Xk) ]
Xk+l - Xk = (Xk - xk~d [ f(Xk~d - f(Xk) . (7.84)

The quantity inside the square brackets is entirely governed by the roundoff
error. For all values of x inside the domain of indeterminacy, the function value
is generally of the same order of magnitude. Hence, the factor inside the square
bracket would be of the order of unity and the next approximation would be
inside or near the domain of indeterminacy. Thus, once we reach a point inside
the domain of indeterminacy, the successive approximations tend to remain
inside, or near the domain of indeterminacy. Of course, an unfortunate com-
bination of roundoff errors may produce the values f(Xk) and f(Xk~l), which
are very nearly or even exactly equal. In this case, the next approximation may
be far away from the domain of indeterminacy and it may happen that if the
iteration is continued still further, it may approach another zero of f(x). More
likely possibility is that the iteration will once again tend towards the same
zero, in which case, we can get some sort of oscillatory behaviour, where the
iteration keeps coming towards the zero, but bounces back without converging.
If the two values of f(x) turn out to be exactly equal, then the iteration has to
be terminated.
For Muller's method, the situation is not essentially different. However,
for Newton-Raphson method the situation is somewhat different. In this case,
also as long as we are well outside the domain of indeterminacy, the computed
correction would have a low relative error and we find that the computed ap-
proximations improve right up to the stage when the domain of indeterminacy
is reached. If Xk is inside the domain of indeterminacy and the zero under
consideration is a simple zero, the computed value of f'(Xk) may still have a
comparatively low relative error. Since, If(Xk)1 :::; dlf'(a)l, (where d is some
measure of the radius of the domain of indeterminacy) the computed value
of Xk+l is likely to be inside or near the domain of indeterminacy. Thus, the
iterates will move about more or less randomly inside the domain of indetermi-
nacy. However, for a multiple zero the computed value of f'(xd will also have
a large error, when Xk is inside the domain of indeterminacy. In particular,
the computed value of f'(Xk) may come out to be very small or even zero and
f(Xk) may be nonzero, in which case, Xk+l would be far from a or Xk. If f'(Xk)
happens to be zero, then the iteration has to be terminated.
The always convergent methods like the Brent's method may apparently
converge to a relative accuracy of the order of n, even when the domain of
indeterminacy is very large. However, in all cases, the bounding interval may
not reduce to this value, since one of the function value may come out to be
exactly zero and the iteration may be terminated. In any case, such methods
7.13. Criterion for Acceptance of a Root 315

should converge to some value inside the domain of indeterminacy and the
likely error in the computed solution is once again determined by the size of
this domain. However, these methods will not give any estimate for the size of
this domain.
In practical applications to estimate the effect of roundoff error, we can
repeat the calculations after multiplying each term in the equation by a num-
ber like 1.1 or 1.37, which cannot be exactly represented in computer. For
example, all coefficients of a polynomial can be multiplied by 1.1 and roots
can be found once again. Comparing the results in the two cases may give an
estimate of roundoff error in computations. This device will not work if the
function is defined by only one term, for example, if we wish to find zero of
say the Bessel function, Jo(x). Alternately, we can use different starting values
to check whether the iteration converges to the same root. Printing out the
intermediate values after every iteration will also give an idea of size of domain
of indeterminacy as the iterates may keep oscillating within this domain.

7.13 Criterion for Acceptance of a Root


The practical problem of determining when the iteration should be stopped is
not free from difficulties. Let us first consider the idealised problem in which
the computations are performed exactly, so that any desired accuracy can be
achieved by continuing the iteration for a sufficient number of steps. Of course,
we assume that initial value is such that the iteration does converge to the
required root. In all iterative methods we have ultimately

(7.85)

where p > 1 for simple roots, but may be equal to one for multiple roots or
for the regula falsi method or the fixed-point iteration. A simple criterion for
terminating the iteration is l.1'i+l - Xi < E, where f is some preassigned small
1

number. It should be noted that if we want to cover the possibility of multiple


roots, E should be appreciably smaller than the maximum error that is regarded
as acceptable in a root. For example, if a is a root of multiplicity r, and if we
are using the Newton-Raphson method, then ultimately we have

a:::::: Xi+l + (r - I)(Xi+l - Xi). (7.86)

Hence, E must be at least a factor of (r - 1) smaller than our required tolerance.


It should be noted that in any case, E must be small enough to ensure that
asymptotic convergence rate is achieved.
If the magnitude of roots are widely varying, it may be better to use a
criterion based on relative error

I Xi-II <E-
Xi -
Xi
(7.87)
316 Chapter 7. Nonlinear Algebraic Equations

This criterion is satisfactory as long as a # 0, but for a simple root at a = 0,


we would have
· [Xi-Xi-1[ = 00.
11m (7.88)
i--+oc, Xi

Thus, this relative criterion should be used only when we are sure that, there
is no root at or near x = O. Alternately, we can use a criterion of the form

(7.89)

This criterion reduces to an absolute criterion if IXi IEre! < Eabs, while for larger
values of IXi I it will be a relative criterion. If Ere! = 0, th~n it reduces to a
simple absolute criterion, while for Eabs = 0 it reduces to a relative criterion.
To reduce the chances of spurious convergence we can apply a test on the last
three iterates, for example, we can use

IXi+! - xii + IXi - xi-11 < E, (7.90)

where E is the tolerance, which can be defined in a relative or absolute or a


combination of the two forms as before.
Alternately, we can use a criterion based on the function value. For ex-
ample, the iteration may be stopped if 1!(xk)1 < ff' where Ef is some prede-
termined tolerance. This test can be used only if we have a good idea of the
magnitude of the function in the neighbourhood of the zero. We can also use a
test of the form 1!(xk)1 < Efll!ll, where II!II is the norm of the function, which
is essentially the magnitude of the function at points away from the zero.
As explained in Section 1.2, it is impossible to check for convergence of
an infinite sequence using a finite amount of information. Hence, in any prac-
tical procedure, we should be prepared for spurious convergence in some cases,
where the convergence criterion is satisfied, even though the iteration has not
converged. The probability of spurious convergence will be higher when the con-
vergence criterion is modest. For example, if are interested in only a relative
accuracy of 0.01, and if Ere! = 0.01, then the chances of spurious convergence
are quite high, since it is possible that the first two digits agree by accident.
Hence, even if we are interested in only a crude result it is preferable to specify
a small value for E so as to avoid spurious convergence.
In practice, the attainable accuracy in any zero is limited by the domain
of indeterminacy associated with it. Further, in general, there is no a priori
information on the size of this domain. If we choose E to be too large, then we
may accept a much less accurate value than what may ultimately be possible.
This is particularly important if the root is to be used in further computations.
On the other hand, if we choose E to be very small, then the criterion may
never be satisfied. In such cases, it may happen that after a sufficient number
of iterations, a peculiar combination of roundoff error suddenly produces IXi+1 -
xii, which is very small and satisfies the given criterion. However, the result
so obtained is no more accurate than most of the preceding approximations.
Thus, indiscriminate use of iteration may produce a result which is apparently
7.13. Criterion for Acceptance of a Root 317

more accurate than what can really be achieved. To guard against such spurious
convergence, we can use a convergence criterion which checks three successive
iterates for convergence rather than two. However, if one of the function values
has come out to be exactly zero, then the iteration will certainly converge to
that value, no matter what convergence criterion is utilised. Similarly, if Xi+l is
equal to Xi to machine precision, then again the iteration has to be terminated.
Hence, spurious convergence in some cases is unavoidable. To estimate the size
of the domain of indeterminacy, we can consider the successive values of the
differences IXH 1 - Xi I. If the root is simple, then after a sharp decrease in the
differences it may either start oscillating or decreasing very slowly. The average
differences at this stage will give a rough estimate for the size of the domain of
indeterminacy.
As we will see later on, in some ill-conditioned cases, the domain of inde-
terminacy may be very large and it may not be possible to meet even the most
modest demands, unless very high precision arithmetic is used. An ideal pro-
gram should be able to detect this trouble. A simple technique which reduces
the risk of iterating indefinitely for ill-conditioned roots, without sacrificing the
accuracy attainable in well-conditioned roots is as follows. It is noticed that
after a certain stage the differences IXHl - Xi I decrease monotonically, until we
reach the domain of indeterminacy. Thereafter the differences will in general
behave erratically. Hence, we can use a relatively modest criterion of the form
IXHl -xii < E, and after it has been satisfied, the iteration should be continued
until the successive values of the differences is decreasing. The iteration can be
terminated whenever
(7.91 )
and the value XHl can be accepted as the zero. This technique is not foolproof,
since in some cases, it may happen that the difference may keep decreasing
very slowly for several iterations. There is also the problem of choosing the
right value of E for the first criterion. If E is chosen to be too small, then the
criterion may never be satisfied, while if it too large a value is used for E, then
it is likely to be satisfied, even when the iteration is far from the root and the
asymptotic convergence rate has not been reached. In the latter case, if after
this stage the difference increases, then the corresponding value will be accepted
as the zero. Hence, such devices should be used with due care. Such flexible
convergence criterion are normally employed for finding roots of a polynomial,
since in that case, the same criterion is used for all roots irrespective of their
multiplicity.
If we wish to find more than one zero of a given function, then it may
happen that on our second attempt the iteration will once again converge to
the same zero. To avoid this problem, we can divide the function f(x) by
(x - a), where a is the computed value of the previous zero. If this division can
be explicitly carried out, then it may result in a simpler function, which would
be the case for polynomials. Similarly, if a complex zero of a real function f(x)
is found, then we can divide f(x) by the quadratic factor x2 - px + q, which
has the pair of complex conjugate zeros a and a*. Each time a zero is found,
318 Chapter 7. Nonlinear Algebraic Equations

we can divide the function by the corresponding factor before continuing the
search for the next zero. If the division cannot be carried out explicitly, then
the form of function gradually becomes more complicated. On the other hand,
if the division can be carried out explicitly, then a significant roundoff error
may be introduced, because the computed value of the zero will not be exact.
In the above discussion, we have assumed that the iteration is converging
to the required root. If the iteration is not converging to the required root, then
we will like to detect this possibility as soon as possible, so that unnecessary
computation can be avoided. This is a very difficult problem and in general,
there is no way we can tell that the iteration is not going to converge. If the
approximate position of the root is known and some bounds can be put on
its value, then if the iteration goes outside these bounds we may assume that
it is not going to converge to this root. Of course, this assumption may not
be correct, since after wandering around, the iteration may come back to the
neighbourhood of the root and converge. If the bounds on a real root are pro-
vided by sign changes, then it may happen that one of the bounds is very close
to the actual root. In such cases, it may be difficult for the iteration to converge
to the root without going outside the bounding interval. If by coincidence one
of the bounding points happens to be inside the domain of indeterminacy for
the root, then it may be virtually impossible for the iteration to converge to
the root without violating the bounds. On the other hand, if the bounds are
known very crudely, then a large number of iterations may be required before
the iteration actually goes outside the bounds. In many cases, the iteration may
drift very slowly in some direction and it is impossible to predict if it is going
to converge or not.
Because of these uncertainties, it is always better to keep some upper
limit on the number of iterations, that are allowed in any iterative method. No
matter how carefully we choose this number, there will always be cases, where
the iteration was about to converge when the limit on the number of iterations
was reached. Some of these cases can be detected by checking the difference in
the last iteration. If the change is small, but not small enough to accept the
root, then it is very likely that the iteration may converge in a few more steps.
If the limiting number of iterations is small, then the iteration may fail very
frequently. On the other hand, if this number is large, then the probability of
spurious convergence also increases, particularly if roundoff error is significant
at the level of accuracy required.
If the iteration fails to converge, then either the starting value is not
sufficiently close to the root, or the roundoff error is significant. If the failure is
due to roundoff error, then the successive iterates should be wandering more or
less randomly within the domain of indeterminacy. Occasionally, it may happen
that some peculiar combination of roundoff errors will send the iteration far
from the root, after which it once again approaches the same or a different
root. This behaviour can be easily detected if the successive iterates are printed
out. Experience suggests that for a single nonlinear equation, the convergence
failure is usually due to roundoff error, even when the iteration does not exhibit
7.14. Ill-conditioning 319

the above mentioned behaviour. Increasing the precision of arithmetic often


improves the convergence dramatically.
For some highly ill-conditioned problems, the iteration may apparently
converge to almost arbitrary values, usually very close to the starting value.
Such spurious convergences can be easily detected, if the function values are
also printed out, since in that case, the function value does not decrease by
any appreciable amount. Further, a second attempt from a slightly different
starting value will give a different result in such cases.

7.14 Ill-conditioning
Having considered the problem of finding real and complex zeros of arbitrary
functions, now we consider some examples of ill-conditioned problems. The
main purpose of this section is to give some idea about what types of problems
could be ill-conditioned. In fact, a significant number of high degree polyno-
mials are ill-conditioned and should be avoided as far as possible. Most of
the polynomials that we come across in practice, do not arise naturally in the
problem, but rather because of our training in mathematics we tend to reduce
the problem to finding roots of a polynomial. While polynomial may provide
a neat way of expressing the results analytically, this reduction may lead to
disaster if numerical values are required. For example, the matrix eigenvalue
problem can be reduced to solution of a polynomial, but even if the original
matrix is well-conditioned with respect to the eigenvalues, the corresponding
polynomial may (and by Murphy's law it usually does) turn out to be ill-
conditioned. On the other hand, an ill-conditioned matrix eigenvalue problem
will never become well-conditioned when expressed in polynomial form. Let us
consider an ill-conditioned matrix eigenvalue problem, where the eigenvalues
are very sensitive to perturbations in matrix elements. Now if this problem is
reduced to a polynomial which is well-conditioned in the sense that the roots
are not sensitive to perturbations in coefficients. Then it only implies that the
coefficients of the polynomial themselves are very sensitive to perturbations in
matrix elements. Hence, unless the matrix elements are known exactly and the
corresponding coefficients of the polynomial can also be calculated exactly, no
purpose will be served by using the polynomial. Another important reason for
reducing a problem to that of finding roots of polynomials is the feeling that,
it is straightforward to find zeros of polynomials, while finding zeros of an ar-
bitrary function, is a rather difficult task. However, in practical problems the
additional roundoff errors involved in calculating the coefficients of polynomial
and the resulting ill-conditioning of polynomials, may make the problem more
difficult. It is a general principle in numerical computation that unnecessary
analytic simplification should be avoided, unless we are sure that it will help in
numerical computations or when it provides some insight into the solution.
To define the condition number of a function, let us assume that a function
J(z) has a zero at z = a. If this function is perturbed to J(z) + Eg(Z), where it
is assumed that for general values of z, both J(z) and g(z) have the same order
320 Chapter 7. Nonlinear Algebraic Equations

of magnitude and If I « 1. In actual practice, the perturbation may be due to


uncertainties in knowledge of some coefficients or parameters occurring in the
definition of the function, or due to roundoff error in evaluating the function.
If a + tSa is the zero of the perturbed function and if f is small, then neglecting
terms of the order of f2 and higher, we get

. g(a)
c'Ja ~ -f !'(a) . (7.92)

The quantity K = Ig(a)/ f'(a)1 can be considered to be the condition number of


the problem. If K is large, then the problem is very sensitive to perturbations
in function definition and such problems are said to be ill-conditioned. While if
K is small, then the roots are not very sensitive to errors in function evaluation
and the problem is said to be stable or well-conditioned. Here we have assumed
a to be a simple zero. If it is a zero of multiplicity r, then in general, it will be
perturbed to r distinct zeros and the perturbation would be given by

s: . ~ l/r (_ r!g(a) ) l/r


(7.93)
uX, f JCr)(a)

By taking r different interpretations of f l/r , we can get the perturbations in


the r zeros. In general, the perturbation in multiple zeros would be larger as
compared to that in simple zeros. In practice, even if the zeros are distinct, but
close in the sense that perturbation given by (7.92) is larger than the spacing
between the zeros, then they should probably be treated as multiple.
EXAMPLE 7.10: Let us consider a polynomial of degree 20 with zeros at 1,2, ... ,20, which
is the classic example considered by Wilkinson (1994).
In this case, the perturbation /ixr in the rth zero due to a perturbation £ in the
coefficient ak of xk (i.e., g(x) = xk) is given by (7.92)

x~ 20-r rk
/ixr ;:,; -£-- = -(-1) £ = ±K£, (7.94)
f'(xr) (20 - r)!(r - I)!
where we have assumed £ to be small enough for the linear perturbation theory to be appli·
cable. The number K is the condition number of rth zero with respect to perturbation in ak.
The maximum value of K(;:,; 2.4 x 10 9 ) is obtained for r = 16, k = 19, while the minimum
value (;:,; 8.2 X 10- 18 ) is realised for r = 1. Most of the larger zeros are very sensitive to
perturbations in the coefficients of higher powers of x, while the smaller zeros are rather
insensitive.
The explicit form of the polynomial is
x 20 _ 21Ox 19 + 20615x 18 - + 533 27946x 16 - 16722 80820x 15
12 56850x 17
+ 4 01717 7561111 84500x 13 + 1131 02769 95381x 12 - 1355851828 99530xll
71630x 14 -
+ 1 30753 50105 40395x 10 - 10 1422998655 11450x 9 + 630308120992 94896x 8
- 311 33364 31613 90640x 7 + 12066478037803 73360x 6 - 3599 97951 7947607200x 5
+ 8037 81182 26450 51776x 4 - 128709312451509 88800x 3 + 13803 75975 36407 04000x 2
- 8752 94803 67616 OOOOOx + 2432 90200 81766 40000.
(7.95)
It can be seen that the coefficients of this polynomial vary over several orders of magnitude
and it is more meaningful to consider a relative perturbation in the coefficients while defining
7.14. Ill-conditioning 321

the condition numbers. Hence, in (7.94) we can use E = Erak and the condition number will
have an extra factor of ak, which could be quite large. In this case. the maximum condition
number is attained for r = 14, k = 13, for which .5xr >::; 1.4 X 10 14 Er. Assuming that the
perturbation is due to rounding of the coefficients, we get f r = n. It can be seen that, unless
n < 10- 14 the perturbations will be too large for the linear theory to be applicable. It should
be noted that here we have considered the effect of error in single coefficient. while in actual
practice all coefficients are rounded and the error may I:>e much larger. Further. apart from
the initial rounding of coefficients. there will be roundoff error in evaluating the polynomial
which will also be of the same order.
To illustrate the effect, we have computed the zeros of this polynomial when the co-
efficient of x14 is perturbed by 2- 10 • which corresponds to a relative change of the order of
10- 13 in the coefficient. The zeros rounded to fifth decimal place are
1.00000, 2.00000, 3.00000, 4.00000, 5.00000, 5.99999. 7.00015,
7.99823, 9.01426, 9.93252. 11.37045 ± 0.27920i. 13.37933 ± 0.82557i,
15.56225 ± 0.92533i, 17.67071 ± 0.49798i, 19.10306. 19.98632.
It can be seen that eight zeros turn out to be complex with very substantial imaginary parts,
while the smaller zeros are fairly insensitive.
If all coefficients are rounded to 24 bits. then the roots correct to fifth decimal place
are
0.99999, 7.33652, 3.83179 ± 0.62889i, 8.72470 ± 6.01625i, 23.56912 ± 6.21891i,
2.00031, 15.20071, 4.93175 ± 2.03176i, 12.33241 ± 8.13074i,
2.96648, 26.29165, 6.48936 ± 3.86115i, 17.72305 ± 8.85964i.
It should be noted that these roots are determined using higher precision.

Table 7.3: Muller iteration for an ill-conditioned polynomial

Zi Zi - Zi-l

13.95404 + i6.495721 x 10- 7 -4.596 X 10- 2 + i6.496 X 10- 7 -2.943 X 10 9 - i1.100 X 108
2 14.04481 + i1.658152 x 10- 6 9.077 X 10- 2 + i1.009 X 10- 6 -5.276 X 10 9 - i1.979 X 108
3 14.01986 + i1.597903 x 10- 6 -2.495 X 10- 2 - i6.025 X 10- 8 5.353 x 10 9 + i1.994 X 108
4 13.97466 + i2.522342 x 10- 6 -4.520 X 10- 2 + i9.244 X 10- 7 5.681 X 10 9 + i2.120 X 108
5 13.98185 + i2.878036 x 10- 6 7.192 X 10- 3 + i3.557 X 10- 7 -1.915 X 109 - i7.168 X 10 7
6 13.99916 + i2.733912 x 10- 6 1. 731 X 10- 2 - i1.441 X 10- 7 -1.893 X 109 - i7.083 X 10 7
7 14.00064 + i2.695378 x 10- 6 1.482 X 10- 3 - i3.853 X 10- 8 -2.459 X 108 - i9.189 X 10 6
8 13.99841 + i2.746675 x 10- 6 -2.230 X 10- 3 + i5.130 X 10- 8 -8.178 X 10 8 - i3.056 X 10 7

If we try to solve this polynomial using subroutine POLYR or ZROOT with 24-bit
arithmetic. then the routine fails after finding first few zeros and returns more or less arbi-
trary values for other zeros. The error parameter IER is nonzero and it is difficult to have
any idea about the accuracy of the results, even if all the intermediate iterates are printed
out. Because of the extreme ill-conditioning, the domain of indeterminacy corresponding to
larger zeros is extremely large and any reasonable starting value will be inside one of the
domains. Consequently. the iteration keeps wandering aimlessly and the function value does
not show any decrease. For the purpose of illustration, one such sequence of iterates is shown
in Table 7.3, which was obtained using Muller's method with starting values of 14.1, 13.9 and
14.0. It can be seen that, even though the function value has hardly decreased, the differ-
ence between successive iterates shows a slow decrease, which may give the impression that
a moderate accuracy has been achieved. However, in such cases, iteration generally tends to
converge close to the starting value itself.
322 Chapter 7. Nonlinear Algebraic Equations

EXAMPLE 7.11: Consider the polynomial of degree 20 with roots in geometric progression
x r = 2 - r, (r = 1, 2, ... , 20).
For this polynomial, the coefficients vary enormously in magnitude from 1 for the
coefficient of x 20 to 2- 209 for the constant term , and it is more reasonable to consider
relative perturbations. It is found that for all values of rand k

(7.96)

Thus in this case, not more than six significant binary digits can be affected in any zero and
hence the polynomial is well-conditioned. Further, it is found that the smaller zeros are more
stable, which appears surprising, since the absolute difference between the zeros is very small
and we might think that the zeros are very close. However, the ratio of each of these roots
to the next one is two and it is this ratio which is important. In the previous example, the
roots were apparently well separated, but actually the ratio of successive roots is close to one
for the higher zeros leading to ill-conditioning.
EXAMPLE 7.12: Consider the polynomial in Example 7.1.
This polynomial has a zero of multiplicity nine, and is expected to be ill-conditioned.
However, since the coefficients are simple numbers which can be represented exactly in any
computer, the perturbations in the zeros may turn out to be much smaller than expected, if
the intermediate numbers are not rounded while evaluating the polynomial (see Example 7.1).
If all coefficients are multiplied by 1.1 and the coefficients are rounded to 24 bits, then it turns
out that the roots become all distinct and all iterative methods converge rapidly to the roots.
The roots of this polynomial correctly rounded to seventh decimal place are
0.8154658, 1.0000000, 1.2262931, 0.8574369 ± 0.1245913i,
(7.97)
0.9795323 ± 0.2012872i, 1.1421512 ± 0.1659622i.
It can be seen that just the error in representation is enough to perturb the roots significantly
in this case.

These examples show that, even polynomials having zeros which at first
sight appear to be well separated, may be extremely ill-conditioned. From the
second example it is clear that it is the ratio of neighbouring zeros which is
the decisive factor and not their absolute distances. The possession of several
zeros having ratios close to unity always leads to ill-conditioning of these zeros.
Further, it is found that polynomials having complex zeros are in general better
conditioned. Thus, if we take polynomials of fixed degree and compare random
distribution of zeros inside the unit circle with a random distribution in the
interval (-1, 1), then the former will in general be much better conditioned.
Before we close this section we consider one more example due to Olver (1952)
EXAMPLE 7.13: Find the zeros of the following function given by Olver (1952)
fez) = 2z(dz 2 + l)n - l sinhn-y cosech')' + 2(dz 2 + l)n coshn,)" (7.98)
where
(d+~)z2+1
cosh')' = --,--';;---- (7.99)
dz 2 +1
with parameters d = 10 and n = 8 .
By eliminating cosh')' between the two equations, this function reduces to a polynomial
of degree 16 given by
12501 62561x l6 + 3854 55882x l5 + 8459 47696x l4 + 2407 75148x l3 + 2479 26664x l2
+642 49356x ll + 410 18752x lO + 94 90840x 9 + 41 78260x 8 + 837860x 7
+2 67232x 6 + 44184x 5 + 10416x 4 + 1288x 3 + 224x2 + 16x + 2.
(7.100)
7.15. System of Nonlinear Equations 323

However, it turns out that the polynomial is ill-conditioned with condition numbers of about
lO7 and using 24-bit arithmetic, the subroutine POLYR fails to find any root at all. If the
coefficients are rounded to 24-bit, but the calculations are done in double precision, then the
following results are obtained.
-.0033179 ± .3255979i, .0127553 ± .3187566i, - .0167539 ± .3128566i,
-.0124053 ± .2933054i, .0022977 ± .2881406i, -.0186876 ± .2530314i, (7.lO1)
.0143967 ± .3012906i, - .1324472 ± .1360055i.
On the other hand, if the original function defined by (7.98) is supplied to subroutine
ZROOT, then the following results are obtained using 24-bit arithmetic
-.0000031 ± .3121970i, -.0000148 ± .3116963i, -.0000471 ± .3lO6619i,
-.0004915 ± .3041824i, -.0023209 ± .2925837i, - .0186950 ± .2530457i, (7.lO2)
-.0001426 ± .3086121i, -.1324472 ± .1360054i.
All roots are complex and further most of the complex conjugate pairs are close and as a
result the iteration converges slowly. It can be verified that all these roots are correct to seven
significant figures, which is about the accuracy of th,e arithmetic employed . Hence, it is clear
that the original function is well-conditioned for all zeros and ill-conditioning was introduced
when the function was transformed to polynomial form. Thus, although the polynomial form
assures us that, there are only 16 zeros, it does not help us in finding them accurately. On
the other hand, from the original function it is not obvious that the number of zeros is only
16 and we may not be sure if we have found all roots .

7.15 System of Nonlinear Equations


So far we have considered methods for finding real or complex roots of a single
nonlinear equation, in this and the subsequent sections we briefly consider the
problem of finding real roots of a system of nonlinear equations. We will not
consider the case of complex roots of a system of nonlinear equations, since that
can always be considered as a system of real equations in twice the number of
variables. Once again when we try to extend the methods to higher dimensions,
there are difficulties as mentioned towards the end of Section 4.8. Apart from
the fact that the amount of computation increases steeply with the number
of equations, there is also the problem of lack of intuitive feeling and analysis
to obtain a reasonable approximation. Even if the iterative methods are gen-
eralised to a system of nonlinear equations, the convergence of iteration is a
serious problem. Unless a good guess is known, it is extremely unlikely that the
iteration will converge.
For solution of a single equation we can have a good idea of the root,
if the corresponding function is plotted as a function of x. The solution of
two nonlinear equations in two variables can be considered as the problem
of finding the points of intersection of two series of curves in two dimensions
defined by each of the two equations. Similarly, in three dimensions the solution
can be considered as the point of intersection of the three surfaces defined by
each of the equations. While this concept can be easily generalised to higher
dimensions, it is of little help for having a feeling of where the roots are located.
We can write the system of n nonlinear equations in n unknowns in the
form
u
= 1.2, ... ,n), (7.103)
324 Chapter 7. Nonlinear Algebraic Equations

which can be expressed in vector notation by defining a vector function f of a


vector x as
f(x) = o. (7.104)
Apart from boldfacing, this equation looks similar to the corresponding equa-
tion for one variable, but the analogy does not go very far. A simple extension
of one-dimensional method is to solve the equations recursively. For example,
consider a system of two equations

(7.105)
If the second equation can be solved for X2 in terms of Xl, then this solution
can be substituted in the first equation to yield an equation in only one variable
Xl, which can be solved by methods described earlier. This technique is similar
to the elimination procedure for a system of linear equations. If the elimination
can be carried out analytically, then there is no difficulty. However, usually the
elimination cannot be carried out in a closed form. In such cases, we can use
the subroutine for solving one equation recursively. For this purpose, we can
consider the first equation as a function of Xl only, with the value of X2 being
determined by solving the second equation for a given value of Xl. This requires
the subroutine for solving one nonlinear equation to be called recursively and
if the programming language does not support recursion, multiple copies of the
routine will be needed.
This procedure can be easily generalised to higher dimensions. This proce-
dure has the advantage that at every step we require a solution of one equation
only. Hence, we can scan the required region for root by looking for sign changes
(and extrema to allow for multiple roots). Consequently, we can be reasonably
certain of the fact that all roots in a given range have been located. This as-
surance is almost impossible with other methods. Hence, this strategy may be
considered for solving a system of two or three equations. Beyond that the work
involved may be too much, since this method is fairly inefficient in the sense
that a large number of function evaluations may be required. Further, there is
also the problem that the solution of the second equation for X2 may not exist
for arbitrary values of Xl. Hence, unless a good approximation to the solution
is known, it may not be possible to use this method at all. Another problem
with this method is that, for any value of Xl, there may be more than one
root of h(XI, X2) = 0 and the value of X2 may not be determined uniquely.
In such cases, we have to try all branches of the solution X2 as a function of
Xl separately, which could be a tedious exercise if there are several solutions.
In higher dimensions, if each of the equations has multiple solutions, then the
number of branches will multiply rapidly making it impossible to consider all
possibilities in a reasonable time. However, if other methods have failed in solv-
ing the problem, then this approach can be tried. The order in which variables
are eliminated will be crucial in determining the effectiveness of this method.
Hence, various different orders may be tried.
In one dimension, it is straightforward to locate the approximate position
of the root by scanning the expected region and looking for sign changes in
7.15. System of Nonlinear Equations 325

the function. For two-dimensional problems, we have to scan in two dimensions


looking for sign changes in both the functions. It is possible to use the method
described in Section 7.7 for this purpose, but as mentioned earlier, there could
be problems because the two sets of curves could be coming close without
intersecting. An extension of regula falsi method to two dimensions is described
in Acton (1990), where three points are used to "bracket" the root . Each of
the functions is approximated by a plane, and the point of intersection of these
planes with the (Xl , X2) plane gives the new approximation to the root. This
procedure is similar to the generalised secant method mentioned in Section 7.17.
However, in this method following the basic philosophy of regula falsi method
one of the old point is discarded, such that the remaining three points still
"bracket" the root. However, finding the bracketing points itself will require
some effort. Further, depending on the geometry of the problem the root may
not be inside the triangle formed by the "bracketing" points and it may be
difficult to estimate the error at any stage.
The problem becomes even more complicated in higher dimensions. Be-
cause of these reasons, scanning for the roots of nonlinear equations is rather
difficult and we have to rely more or less entirely on iterative methods. The
success of iterative methods depends crucially on the initial approximation
supplied, which necessitates proper analysis of the problem before attempting
numerical solution. In many cases, the solution may be known in some limit or
after making some simplifying assumptions. Such solutions can be effectively
used as initial approximations to the roots. By progressively relaxing the sim-
plifying assumptions, or by changing the parameter slowly from its limiting
value to the required value, we can solve a series of intermediate problems to
finally arrive at the required solution.
The fixed-point iteration method can be easily generalised to higher di-
mensions by writing the system of equations in the form x = F(x). It can be
shown that the iteration converges if all eigenvalues of the Jacobian of the vec-
tor function F(x) at the root are within the unit circle {51}. Since, in general,
the roots are not known before the problem is solved, it may be difficult to en-
sure this condition. Using arbitrary form of equations, the iteration is unlikely
to converge and this method need not be tried, unless proper analysis is carried
out. If the convergence is slow, then acceleration techniques can be applied.
Once the approximate position of a root is found, it may be better to shift to
Newton's method to obtain faster convergence.
Another problem with the solution of a system of nonlinear equations is
scaling. We have considered this problem for the case of a system of linear
equations and the same principles apply here except for the fact that it is very
difficult to scale the equations using any automatic program and this has to be
ensured by the user. The scaling can usually be achieved by choosing proper
units, such that all components fJ(x) of the vector function are roughly of
the same order of magnitude. Similarly, all components Xj should also be of
the same order of magnitude. For example, if some of the variables are varying
between 106 and 108 while others are between 10- 6 and 10- 8 , then the iterative
326 Chapter 7. Nonlinear Algebraic Equations

methods for solving nonlinear equations are unlikely to work. In this case, if
the iteration chooses a step size of 0.5, then for some of the variables it will
jump across the entire range, while for others it will be an insignificant change.
Hence, it is worthwhile to spend some effort in scaling a difficult problem, so
that the variables and the function values vary over approximately the same
range of values.
Because of these reasons, the solution of a system of nonlinear equations
is a difficult and time consuming problem. It is unlikely that an automatic pro-
gram will successfully solve such a system, unless some analysis is done before
invoking such a routine. However, with some patience, experimentation and
proper analysis such problems can be solved. In practice, we may come across
systems of nonlinear equations with hundreds or even thousands of equations.
Such problems normally arise out of finite difference approximation to differ-
ential equations. In such problems, we normally do not worry about the order
of convergence of iterative methods, since convergence itself is rather unlikely.
Very little effort is spent in improving the accuracy of the root, since most
of the work concerns the initial stages, when iteration is wandering around in
search of the elusive roots. Hence, any method which converges to the required
solution is normally acceptable. Our present understanding of this problem is
not good enough to suggest which methods should be preferred. In subsequent
sections, we will outline a few of the methods which will hopefully enable the
reader to solve systems of nonlinear equations. In the next section, we consider
the generalisation of the Newton-Raphson method to a system of nonlinear
equations, while in the last section we will consider the Broyden's method,
which can be used when the derivatives of the functions are not available.
In one dimension, the process of deflation or removal of known roots is
very straightforward and it ensures that the iteration does not converge to the
same root again and again. However, it is difficult to generalise this procedure
to higher dimensions. Hence, if we are interested in more than one root of the
equation, it may be difficult to locate them, since the iteration may preferen-
tially converge to one of the roots only. We can try to extend the concept of
deflation to higher dimensions. For example, in two dimensions if (xo, Yo) is a
root of f(x, y) = 0 and g(x, y) = 0, then we can perform deflation by defining
the new function

f( )-
1 x,y -
f(x,y)
{j+ {j+ (7.106)
(x - xo)~ + (y - Yo)~

If the root is simple, then h (xo, Yo) = 1 and the root is certainly removed.
Unfortunately, this technique introduces a line in the (x, y) plane, along which
h (x, y) is singular, and if there are any other roots on or close to this line,
they may also be suppressed. Apart from such coincidences, this singularity
may interfere with the root finding algorithms. It may be noted that it is not
necessary to divide both functions by such factors, since removing the root
from one equation will remove the simultaneous solution of both the equations.
Hence, we can choose g(x, y) instead of f(x, y) to perform deflation, or may
'l.IS. System of Nonlinear Equations 327

be we can perform it on both. The basic problem here is that, while we want
to remove one point, the process outlined here removes a line from the (x, y)
plane. This procedure can be easily generalised to higher dimensions, but in
that case, singularity will be introduced along a hyperplane, and the problem
may become more complicated. Further, if several roots are removed by this
procedure, then singularities will be introduced along several such hyperplanes,
making it very difficult for the iterative procedure to converge to roots. It is
tempting to use the simpler option of using the function
f(x,y)
h(x,y) = J( X - Xo )2 + (y - Yo )2 '
(7.107)

to perform deflation. This technique does not really remove the zero, since the
function f(x, y) vanishes on a line while the denominator vanishes only at one
point. Hence, the value of the function does not exist at (xo, Yo), since the limit
depends on the direction along which we approach the point. However, the
denominator may be able to distort the geometry of the problem in such a way
as to drive the iteration away from this point.
An alternate strategy is to simply remove a small region around the point
(xo, Yo). For example, we can take a small disk centred at (xo, Yo), whose radius
ro is at our disposal , and define our new function to have its old value outside
the disk, while inside it takes on the interpolated value
r r
lI(x,y) = f(x,y)- + A(l- -),
ro ro
Here A is some arbitrary constant. The value of II (x, y) now passes smoothly
from A at the root (xo, Yo) to its correct value on the circumference of the
disk. This device will of course, wipe out all roots lying within the disk, but
it may still be considered, since it does not introduce singularities all over the
space. If the radius of the disk is too small, then the probability of wiping out
other roots may be minimised, but in that case, this device fails to prevent the
iteration from tending towards the old root. This is because, far from the root
the geometry is still the same and the iteration tends towards the root until
it lands inside the disk, where it will be thoroughly confused by the sudden
change in the geometry of the function. Hence, the radius ro has to be carefully
chosen.
Even in one dimension the iterative methods face difficulty when the root
is multiple. The problem is likely to be more severe in higher dimensions. It
can be shown that the determinant of the Jacobian of the function vanishes
at a multiple root, which causes difficulties with iterative methods. Similar
problems are encountered if the roots are not multiple, but closely clustered.
One strategy to deal with multiple roots is to replace one of the equations by
the determinant of the Jacobian (Acton, 1990) and solve the resulting system of
equations. The solution of this system gives the points, close to which multiple
or closely clustered zeros can be expected. Further, at this point only one of
the equations which has been replaced by the determinant of the Jacobian may
328 Chapter 7. Nonlinear Algebraic Equations

not be satisfied. Theating the root as a small perturbation about this point, the
root or an approximation to the root may be found.

7.16 Newton's Method


The Newton's method can be generalised to a system of nonlinear equations
f(x) = 0 by considering a Taylor series expansion about the most recent ap-
proximation to the root, and retaining only the terms involving the first deriva-
tives. This process is basically equivalent to linearising the nonlinear equations
and leads to a system of linear equations at e3.ch iteration. Assuming that
x(i) = (xii), X~i), ... , x~)) is the best approximation to the root, we expand
each of the n functions in Taylor series about this point, to get

~ f j ((i)
f j (XI,X2"",Xn ) ~ (i) (i))
Xl 'X 2 "",Xn + ~(
L....t Xk -Xk
(i))
aofj I (7.109)
k=l Xk x=x(i)

for each value of j. To find the root of this system of equations, we set the
right-hand sides to zero and get the next approximation by solving the resulting
system of linear equations for the differences (X k - xk
i )). The first derivatives

(7.110)

constitute the Jacobian matrix J i . Using the Jacobian we can write the iteration
formally in the form
(7.111)
which is similar in appearance to the Newton-Raphson iteration for a single
nonlinear equation. In actual practice, we should avoid calculating the inverse
of Jacobian matrix explicitly. Instead the system of linear equations (7.109) can
be solved directly to yield the new approximation.
Hence, in this case, apart from calculating all the n functions and its n 2
first derivatives, we also need to solve a system of n linear equations requiring
approximately ~n3 floating-point operations at each step of the iterative pro-
cess. Further, the number of iterations required for convergence also generally
increase with n. As a result, the amount of computation required to solve a
system of nonlinear equations increases rapidly with n. It can be shown that
this method converges quadratically to the root, even for a system of nonlinear
equations. However, the asymptotic convergence rate is realised only when the
approximation is close to the root. In actual practice, most of the effort is spent
in approaching the root and the asymptotic convergence rate is not of much
significance.
Most of the methods for solution of a system of nonlinear equations are
modifications or approximations of the Newton's method. The main problem
with the Newton's method is that, the n 2 derivatives are required at each step,
7.16. Newton's Method 329

which makes each step of the iterative process very costly in terms of computer
time. Further, calculation of the derivatives itself may be quite difficult for most
of the functions that are encountered in practice. Hence, it is necessary to look
for generalisation of the secant iteration, which can be obtained by approxi-
mating the derivatives in the Newton's method with appropriate differences.
Such methods will be discussed in the next section. Most of the modifications
of the Newton's method are based on the iteration formula

(7.112)

where ti is a scaling factor and Hi is an n x n matrix, which is supposed


to approximate the inverse of the Jacobian matrix. The choice ti = 1 and
Hi = Ji- 1 gives the Newton's method.
A simple modification of the Newton's method is obtained if the Jacobian
is not computed at every step, but the same value is used for a fixed number
m of iterations. If the convergence is not achieved within m iterations, then
we compute a new value for the Jacobian. This device certainly reduces the
computation required during each iteration, but with this modification the
convergence is not quadratic.
Another modification which is referred to as the damped Newton's
method, is obtained if we assume Hi = J i- 1 , as in Newton's method, but
select ti such that
(7.113)

for some convenient norm. In this case, usually ti is determined by a one-


dimensional minimisation algorithm, which minimises Ilf(x(i) - tHif(i))11 as a
function of t, for t > O. In this method, at any stage the new approximation
to the root is obtained by moving in the direction given by the inverse of
Jacobian as in the Newton's method, but the amount of shift is determined by
minimising the norm of the function along that direction. This device ensures
that the norm of the function decreases at every step. Since the search for
minimum may require considerable effort, it may be sufficient to ensure that
the condition (7.113) is satisfied. After some steps it may happen that the norm
cannot decrease any more, because we have hit a local minimum of the function.
If the norm at this stage is small enough to be accepted as zero, then we have
found the root, otherwise the search has to be abandoned and a fresh attempt
made with a new starting value. A complicated function in several variables is
likely to have several minima and the iteration is very likely to hit one of these
minima rather than the actual root, unless approximate position of the root is
known beforehand.
Another alternative is to use a method for finding minim·.lm of a function
of several variables (see Chapter 8). It can be seen that the scalar function

F(x) = fT (x)f(x) = z=
n

i=l
!i(X1, X2 . ... ,xn)2, (7.114)
330 Chapter 7. Nonlinear Algebraic Equations

takes on its absolute minimum value of zero at the roots of the given system of
equations. Hence, if we can find a global minimum of the scalar function F, we
can solve the equations. This looks like an attractive possibility, but there are
several pitfalls which need to be overcome to make it a useful method. Firstly,
it is not easy to find a global minimum of such a function and most algorithms
only find a local minimum at which the function value may be nonzero. Hence,
we have to keep looking for minima, until we find one where the function
value "vanishes". However, it should be noted that even other methods like the
Newton's method also suffer from a similar problem, since at a local minimum
the Jacobian matrix is singular and the Newton's method will also fail or give
arbitrarily large shift. The main problem with minimisation techniques is that,
a minimum at which the function value is small but nonzero, may be mistaken
to be a root in the presence of roundoff errors. The advantage of minimisation
methods is that , it is much easier to find a minimum of one function than to
find simultaneous zeros of all the n distinct functions. Consequently, if all other
methods have failed, then it may be worthwhile to try such methods.
We can use the method of steepest descent (see Section 8.4) to find the
required minimum. This method searches along the direction - 'V F(x(i)) to find
the next approximation X(i+l), which minimises F(x) similar to the damped
Newton 's method. If finding the minimum requires too much effort, we may
just take some value such that the norm is decreasing. It is always possible to
find such points as long as the gradient is nonzero, that is, the point is not an
extremum of the function. Noting that 'V F(x) = 2J T (x)f(x)we can write this
iteration also in the form (7.112), by defining Hi = f[ and once again ti is
determined as in the damped Newton's method. Thus, in this case, the matrix
Hi is the Jacobian itself rather than its inverse.
Whereas Newton's method converges quadratically in a neighbourhood
of the root, the method of steepest descent converges only linearly. However,
the Newton's method requires a good initial approximation, while the method
of steepest descent converges to one of the local minima starting from almost
arbitrary starting values. This is because of the fact that for finding a mini-
mum it is only necessary to find a direction along which the scalar function is
decreasing, which is always possible. On the other hand, for finding zeros of a
vector function it is necessary to find directions, where all the n functions are
decreasing in magnitude, which in general is not possible. To make use of the
faster asymptotic convergence of the Newton 's method close to the root, we can
switch to the Newton's method when the iteration appears to be converging to
a root.
The Levenberg-Marquardt algorithm combines the Newton's and the
steepest descent method. In this algorithm we set

ti = 1, Ai 2:: o. (7.115)

For Ai = 0, it yields the Newton's method, whereas as Ai increases the direction


specified by Hi tends to that of the steepest descent method. Thus, in this
algorithm we start with a large value of Ai and go on reducing it as the solution
7.16. Newton's Method 331

is approached, so as to switch from the method of steepest descent to the


Newton's method.
An alternate strategy to get a good initial approximation is provided by
the continuation method, which is also referred to as Davidenko's method. The
basic idea here is to construct a sequence of problems, such that the solution to
the ith problem in this sequence provides a good approximation to the solution
of the (i + 1)th problem. Further, if the first problem in the sequence has a
known solution while the last one is the problem whose solution is required, then
starting with the first problem we can solve a sequence of problems to get the
required solution. Let g(x, B) be a function of x and B such that g(x, 0) == f(x)
and g( x, 1) = 0 has a known solution. Then starting with Bo = 1, we can choose
a sequence Bj such that

(7.116)

and solve in succession the equations

(i=1,2, ... ,m). (7.117)

At each step the previous root or an extrapolated value based on the last few
roots can be used as the starting value. The solution of the last system gives
the required result. By choosing Bi+l sufficiently close to Bi at every step, the
problem of non-convergence of iteration may be avoided. If there is a natural
parameter in the problem, then the corresponding values can be used instead of
o and 1, otherwise it is straightforward to introduce a parameter. For example,
if x(O) is some guess for the zero, then we can define

g(x, B) = f(x) - Bf(x(O)), (7.118)

or
g(x, B) = (x - x(O))B + (1 - B)f(x). (7.119)
It can be easily verified that for both these definitions g(x(O), 1) = 0 and for
B = 0, it reduces to the original function.
At the first sight it appears that Davidenko's method can always be made
to converge to some root of the required system. However, the root may not be
a continuous function of B, in which case, this technique will fail no matter how
small steps are used in changing B. From the last two definitions of g(x, B), if
we treat the root x as a function of B, then we obtain

(7.120)

and
~; = [(1 - B)J(x) + Blr 1 [f(x) - (x - x(O))] ,
(7.121)
= B- 1 [(1 - B)J(x) + B1r1 f(x),
332 Chapter 7. Nonlinear Algebraic Equations

respectively. Thus, the direction of the Davidenko path in the first case is
the direction of the corresponding Newton step, whereas in the second it is
like that of Levenberg-Marquardt algorithm, provided f is a gradient of some
scalar function. In either case, the Davidenko's method will succeed only if the
deri vative exist for 0 ::; e ::;
1. Thus, in the first case, the method will work,
provided the Jacobian is nonsingular throughout the range.
EXAMPLE 7.14: Solve the following system of nonlinear equations to obtain a 4-point
Gaussian quadrature formula, to evaluate integrals of the form f~ f(x) In x dx:

t;
4 1
w;x; + (j + 1)2' (j = 0,1, ... ,7), (7.122)

where Xi are the abscissas and Wi are the corresponding weigi.ts.


If Newton's method is applied with a reasonable, but otherwise arbitrary starting
values given in the first row of Table 7.4, then the iteration fails to converge. However, if
Davidenko's method is applied, then the iteration converges from the same starting values
with one intermediate value for the parameter 0, which clearly demonstrates the power of
Davidenko's method. Here we have used (7.118) to parametrise the function. These results
have been obtained with the subroutine NEWTON in Appendix B using a 24-bit arithmetic. It
can be seen that the iteration does converge rather fast in this case and only one intermediate
value of the parameter 0 is required. But this situation is not typical and in most cases, many
more intermediate values may be necessary. Readers may try to solve this problem after
changing the initial values of abscissas to Xi = 0.2 x i. The correctly rounded solution of this
system of equations is
WI = -0.3834641, W2 = -0.3868753, W3 = -0.1904351, W4 = -0.03922549,
Xl = 0.04144848, X2 = 0.2452749, X3 = 0.5561654, X4 = 0.8489824.
(7.123)

Table 7.4: Newton/Davidenko method for a system of nonlinear equation

WI Xl W2 x2 W3 x3 W4 x4

0=0.8
-.2500000 .10000000 -.2500000 .3500000 -.2500000 .6000000 -.25000000 .8500000
1 -.2380672 .06295187 -.2342870 .2510820 -.3054465 .5478025 -.22219930 .8463859
2 -.2633193 .07491492 -.2566524 .3005650 - .2651121 .5710580 -.21491610 .8477684
3 -.2553394 .07355019 -.2686689 .2935534 -.2630826 .5763674 -.21290900 .8484067
4 -.2552063 .07349666 -.2688932 .2938928 -.2630646 .5764400 -.21283580 .8484309
0=0
1 -.3268597 -.00057323 -.3902644 .1682333 -.2349166 .5324223 -.04795930 .8458008
2 -.3573105 .03000237 -.3822079 .2291974 -.2169211 .5323158 -.04356055 .8417670
3 -.3794663 .04064046 -.3907252 .2442779 - .1898410 .5533004 -.03996745 .8474201
4 -.3834502 .04144989 -.3869194 .2452714 -.1904019 .5562028 - .039228.56 .8489638
5 -.3834670 .04144914 -.3868729 .2452761 -.1904346 .5561656 -.03922553 .8489823
6 -.3834918 .04145333 -.3868664 .2452932 -.1904201 .5561830 -.03922177 .8489894
7 -.3834890 .04145268 -.3868693 .2452924 -.1904201 .5561837 -.03922152 .8489900
8 -.3834782 .04145096 -.3868709 .2452842 -.1904275 .5561746 -.m922349 .8489862
9 -.3834585 .04144744 -.3868771 .2452713 -.1904382 .5561621 -.03922615 .8489811
10 -.3834738 .04145022 -.3868721 .2452813 -.1904299 .5561718 -.03922409 .8489851
11 -.3834891 .04145283 -.3868681 .2452919 -.1904209 .5561826 -.03922176 .8489895
7.17. Broyden's Method 333

Thus the computed solution has significant roundoff error. As mentioned in Section 6.5 this
system of equations is somewhat ill-conditioned. It may be noted that the required Gaus-
sian formula can be easily obtained using the method described in Section 6.5, which is
implemented in the subroutine GAUSWT.

7.17 Broyden's Method


The Newton's method and its modifications described in the previous section
require the computation of the Jacobian. However, even for moderate values of
n the differentiation and the programming effort involved to evaluate the n 2
derivatives can become tedious and time consuming. Further, in many cases, it
may be rather difficult to find analytic expressions for the derivatives. Hence,
we will like to have methods which do not require calculation of derivatives.
This objective can be achieved if we use some finite difference approximation
to the Jacobian. The simplest approximation is provided by

8fJ I ~ fj (x(i) + ekhi) - fj(x(i)) ,


(7.124)
8Xk X=X(ij hi
where ek is the kth column of In, the identity matrix of order n. The choice
of the increment hi is crucial in this approximation, since if hi is too large,
there will be large truncation error, while if hi is too small, there may be large
roundoff error. It is found that if we wish to retain the quadratic nature of
the Newton's method, then hi :::; Gllfill for some positive constant G and an
appropriate norm for the function.
While this discretisation of the Jacobian may avoid the need to calculate
the derivatives, it still requires calculation of n 2 functions at each step which
may be quite expensive. By analogy with the secant iteration in one dimension,
it will be better if the Jacobian is calculated only at the beginning of the
iteration and after that it is updated without any extra function evaluations.
There are various techniques for such updating, but in this section we describe
the one due to Broyden, which is probably the most convenient.
If x(i) are the successive iterates and f(i) are the function values at these
points, then we can define

and (7.125)

Further, if Bi is some approximation to the Jacobian at the corresponding


point, then we will like to have

(7.126)

However, this equation by itself is not sufficient to determine the Jacobian at


the new point. One simple technique for estimating the Jacobian is the so-called
generalised secant procedure, in which we demand

(j = i, i-I, ... , i - n + 1). (7.127)


334 Chapter 7. Nonlinear Algebraic Equations

If these system of linear equations are linearly independent, then it is possible


to solve them for B i +1. To start the iteration it will be necessary to calculate
an approximation to the Jacobian using (7.124).
In Broyden's method once again, at the first step we calculate the approx-
imation to the Jacobian using (7.124). At the subsequent steps we can update
the matrix Bi using the relation

(.) (.) q(i)T


B+l
,
= B· - (Bq' - Y , )
" q(i)T q(i)
(7.128)

It can be verified that the resulting matrix satisfies (7.126). One interesting
feature about this formula is that, if Hi = B i- 1 , then
C) C) q(i)T H
H i+1 = Hi - (Hiy' - q ') (.)T 'C) (7.129)
q\' Hiy'
The iteration is defined by
(7.130)
Hence, we need not use Bi at all, but always work with the inverse matrix Hi, in
which case, we do not need to solve a system of linear equations at every step.
Consequently, the efficiency may improve for large systems, since solution of a
system of linear equations requires O(n 3 ) floating-point operations, while this
procedure requires only O(n 2 ) operations to calculate the next approximation.
On the other hand, the disadvantage of working with the inverse matrix is
that, for large sparse system~ the inverse may not be sparse. Hence, a larger
storage and computational effort may be required in handling the inverse. For
example, in nonlinear systems arising out of the finite difference approximation
to ordinary differential equations, the Jacobian is a band matrix with rather
small bandwidth, but the inverse of this matrix may be filled. In such cases, it
may be better to work with the matrix Bi and solve a system of linear equations
for the sparse matrix at each stage. Even then there is the problem that the
matrix Bi which is only an approximation to the actual Jacobian may not turn
out to be sparse. We may ignore the elements which are always expected to be
zero and hope that the iteration will converge.
EXAMPLE 1.15: Solve the system of equations in Example 7.14 using Broyden's method.
Using Broyden's method, once again, the iteration does not converge directly from the
starting values in the first row of Table 7.4. However, if Davidenko's technique is applied, then
the iteration converges in two steps with () = 0.9 as an intermediate value. The convergence is
e
much slower with Broyden's method, requiring 29 iterations for = o. However, it should be
noted that each iteration of Newton's method requires the evaluation of n 2 + n = 72 different
scalar functions, while apart from the initial iteration which requires the same number of
function evaluations, the subsequent iterations in Broyden's method all require only eight
function evaluations. If this factor is taken into account, then Broyden's method may turn
out to be more efficient in terms of computational effort involved. The computed solution is
WI = -0.3834746 W2 = -0.3868727 W3 = -.1904289 W4 = -.03922379 (7.131)
Xl = 0.04145026 X2 = 0.2452822 X3 = 0.5561731 X4 = 0.8489857
The accuracy of the computed solution is comparable to that obtained using the Newton's
method.
Bibliography 335

Bibliography
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Acton, F. S. (1995): Real Computing Made Real, Princeton University Press, Princeton, New
Jersey.
Brent, R. P. (2002): Algorithms for Minimization Without Derivatives, Dover, New York
Broyden, C. G. (1965): A Class of Methods for Solving Nonlinear Simultaneous Equations,
Math. Comp., 19, 577.
Broyden, C. G. (1969): A New Method of Solving Nonlinear Simultaneous Equations, Compo
J., 12, 94.
Burden, R. 1. and Faires, D. (2010): Numerical Analysis, (9th ed.), Brooks/Cole Publishing
Company.
Carnahan, B., Luther, H. A. and Wilkes, J. O. (1969): Applied Numerical Methods, John
Wiley, New York.
Dahlquist, G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Delves, L. M. and Lyness, J. N. (1967): A Numerical Method for Locating the Zeros of an
Analytic Function, Math. Comp., 21, 543. .
Forsythe, G. E., Malcolm, M. A. and Moler, C. B. (1977): Computer Methods for Mathe-
matical Computations, Prentice-Hall, Englewood cliffs, New Jersey.
Gerald, C. F. and Wheatley, P. O. (2003): Applied Numerical Analysis, (7th ed.) Addison-
Wesley.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Hildebrand, F. B. (1987): Introduction to Numerical Analysis, (2nd ed.), Dover, New York.
Householder, A. S. (1970): The Numerical Treatment of a Single Nonlinear Equation,
McGraw-Hill, New York.
Lyness, J. N. and Delves, L. M. (1967): On Numerical Contour Integration Round a Closed
Contour, Math. Comp., 21, 56l.
Muller, D. E. (1956): A Method for Solving Algebraic Equations Using an Automatic Com-
puter, MTAC, 10, 208.
Olver, F. W. J. (1952): The Evaluation of Zeros of High-Degree polynomials, Phil. Trans.
Roy. Soc., A244, 385.
Press, W. H., Teukolsky, S. A., Vetteriing, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Quarteroni, A., Sacco, R. and Saleri, F. (2010) Numerical Mathematics, (2nd ed.) Texts in
Applied Mathematics, Springer, Berlin.
Rabinowitz, P. (ed.) (1970): Numerical Methods for Nonlinear Algebraic Equations, Gordon
and Breach, London.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
Traub, J. F. (1997): Iterative Methods for the Solution of Equations, Chelsea Publishing
Company.
Wasserstrom, E. (1973): Numerical Solution by the Continuation Method, SIAM Rev., 15,
89.
Wilkinson, J. H. (1959): The Evaluation of the Zeros of Ill-conditioned Polynomials, Numer.
Math., 1, 150.
Wilkinson, J. H. (1994): Rounding Errors in Algebraic Processes, Dover, New York.
Wilkinson, J. H. (1988): The Algebraic Eigenvalue Problem, Clarendon Press, Oxford.
336 Chapter 7. Nonlinear Algebraic Equations

Exercises
1. The problem "solve x 2 - x - 2 = 0" can be reformulated in several ways for fixed-point
iteration, like

2 x2 - X - 2
(i)x=x 2 -2 , (ii) x = ±Jx + 2, (iii)x=l+-, (iv) x = x - ,
x a
where a is any constant. Which of these iterations will converge to the two zeros and if it
converges, then how fast would the convergence be in each of these cases? Estimate the
number of iterations required to reduce the error by a factor of 100. In (iv) estimate the
range of a for which the iteration converges and also estimate the value of a for which
the convergence is the fastest.
2. Reformulate the following equations, so that the fixed-point iteration converges to the
real zeros.

(i) x 3 - x +1= 0, (ii) eX = sin x, (iii) In(l + x) = x3 ,

3. Apply the Aitken's 82 method to accelerate the convergence of fixed-point iteration for
the previous problem. If the acceleration technique is applied systematically, then we
obtain what is usually referred to as the Steffensen's method. In this method, if xo, Xl,
X2 are the three latest approximations generated by the fixed-point iteration. The next
approximation X3 is generated by applying the Aitken's technique to these three approx-
imations. This cycle of three iterations can be repeated until convergence or divergence
is detected. Thus the approximations X3n are generated by applying Aitken 's method to
X3n-1, X3n-2 , X3n-3, while other approximations are generated using the usual fixed-
point iteration. Apply this method to equations in the previous problem and compare
the efficiency of the two methods.
4. Find the smaller root of the quadratic x 2 - 100x + 1 = 0 by (i) using the quadratic
formula, (ii) fixed-point iteration. Compare the efficiency of the two processes.
5. For the regula falsi method , show that if (i) f"(x) to in [X1 , X2] ; and (ii) f(xllf"(xll >
0, then Xl always remains one of the points in the regula falsi method. These conditions
are called Fourier's conditions.
6. The regula falsi method can be modified to improve the convergence rate if in (7.15),
we take a = Y;/(Yi + Yi+ll. This procedure is also referred to as the Pegasus method.
Implement this method in a computer program and test it on the equation in Example 7.2,
and compare the results with those obtained using the regula falsi method as well as the
modified regula falsi method considered in Section 7.3.
7. Consider two different iterative methods which converge to the same root from the same
starting value. Asymptotically the truncation error is given by the relation

(j = 1, 2) ,

where Cj are the asymptotic error constants and Pj the order of convergence of the two
methods. Take the logarithm of both sides and obtain a linear difference equation for
In IE;j)I. Show that the solution of these difference equations are

(j = 1, 2).

Note that , since the starting values are identical for both methods , EO is the same. Now
if the two iterations converge to the same accuracy in II and 12 steps, then equating
E~~) and E~~) obtain a relation between II and h. Neglecting the logarithmic term in
asymptotic error constants show that II Inp1 ~ h Inp2. If 01 and 02 are the cost per
iteration, then total cost of computation are 01II and 0212. Using these relations justify
the definition of computational efficiency given in (7.32) .
Exercises 337

8. Find the roots of the following equation using the Newton-Raphson method

(x _ a)2 (1 + xa-b
- b) + o(x _ b)2 (1 + x - a) = O.
b-a
Try starting values x = a or x = b. Use 0 = ±1, a = 1 and b = 2. Prove that, if there is
no roundoff error, then the iteration will keep oscillating indefinitely between x = a and
x = b.
9. Assuming that the operation of division is not available on your computer, write a pro-
gram to calculate the reciprocal of any number a using the Newton-Raphson method.
Test the program by computing the reciprocal of 10, 0.01, 104 Try different starting
values in all cases. Try secant iteration instead of Newton-Raphson method.
10. Compute v'2 and 2 1/ 5 using the Newton-Raphson method and the secant iteration.
Compare their efficiencies.
11. What is the expected accuracy with which the zero at x = 1 of the following function
can be determined by using a 24-bit arithmetic:
x7 - 7x 6 + 21x 5 - 35x 4 + 35x 3 - 21x2 + 7x - 1.
Assuming that we are using the Newton-Raphson method, estimate the number of itera-
tions required in attaining this limiting accuracy when the starting value is x = 2. What
happens if the iteration is continued further? Modify the Newton-Raphson method to
get faster convergence. How will you rewrite the expression to get higher accuracy?
12. What is the order of convergence of the Newton-Raphson method for the following equa-
tion? For what range of 0 will it converge at all?
Ix - 0.11° = 0, (0) 0).

13. Find the specified roots of the following equations using any suitable method.
(i) All real roots of x = asinx, (a = 2,10,20,20.3959).
(ii) First 100 posi ti ve roots of x = tan x.
(iii) First 10 positive roots of sin(2x) + [1 - cos(2x)][cot(x) - 2J = O.
(iv) All real roots of
x6 - 9.2x 5 + 34.45x 4 - 66.914x 3 + 70.684x 2 - 38.168x + 8.112 + 0.012341nx = O.
(v) First five positive roots of sin(2x) + [1 + cos(2x)][cot(x) - 2J = O.
(vi) First five positive zeros of sin(x + 7rcosx).
(vii) First five positive maxima and minima of sin(x + 7rcosx).
(viii) Real roots of x 20 = a, (a = 0.2,2,20,1000).
(ix) First three positive zeros of Bessel function Jo(x).
(x) Evaluate In(a + In(a + In (a + ... ))), (a = 1,2,3.14).
(xi) Zero near x = 0 of

sin x + sinh x a (a---2


- - In
x
+ X2)
a-x
a = 1, 27r,.J46.

14. Find all real roots of


(i) e-x2 (X2 - ax + 131) a = 21.34,22.8911,25.3454;
131 - x 2

.. ( 1 ) (x 2 -ax+131) a = 21.34,22.8911,25.3454;
(n) exp - 0.005 + 1131 _ x21 131 _ x2

(iii) x100lx - 0.123I o.44 (x 2 - 2.21x + 0.42).


15. If x = a is a zero of multiplicity m, then f(xi) ~ ~
rn.
fern) (a). Using this expression show
that the error €i in secant iteration is asymptotically given by
m-l m-l
f.i - f. i - 1
f.i+l:;::::: f . i f . i - l - - , - : - - - - -
tf -f.r.:..l
338 Chapter 7. Nonlinear Algebraic Equations

Assuming that the convergence is linear and that Ei ;;::; QiEO, prove that Q is the real
root in the interval (0,1) of the polynomial QTn + QTn-l - 1 = O. Find the value of Q for
m = 2,3,4, .5, 10.

16. Try to modify the secant iteration to improve its convergence for multiple zeros with
known multiplicity m. Try to multiply the increment at each step by m, as in the mod-
ified Newton-Raphson method. Apply this method to the equation in Example 7.2 and
estimate the order of convergence for the triple root.
17. Derive (7.51) and show that the order of convergence of Muller's method is given by the
largest root of Q3 - Q2 - Q - 1 = O. Similarly, show that the method based on inverse
quadratic interpolation also has the same order of convergence.
18. For a method based on the inverse interpolation using a polynomial of degree n, show
that the asymptotic form of truncation error is given by

where x = g(y) is the inverse function. Using this form show that the order of convergence
for this method is given by the positive root, with magnitude greater than unity of the
following polynomial
n
t + L:t
n 1 - j = O.
j=O
Hence, show that this order of convergence is always less than 2. Find the order for
n = 3,4 and 5.
19. Modify the Muller's method to generate only real values, (i) by retaining only the real
part of the new approximation Xi+l at every step and (ii) by ignoring the square root if
the expression inside turns out to be negative. Test this modified method on the equation
in Example 7.2 and compare the rate of convergence with (unmodified) Muller's method.
20. For a double root, prove (7.53) and show that the Muller's method has an order of
convergence 1.23. Also show that the method based on inverse quadratic interpolation
converges linearly to a double root.
21. Show that for a multiple root with multiplicity m 2': 3, the Muller's method converges
linearly and at every step the error is asymptotically multiplied by the root, with smallest
absolute magnitude of the polynomial

22. Find all complex zeros with Ixl < 10 and Iyl < 10 (where z = x + iy) of e Z = z ; by
(i) Treating it as one equation in complex variable.
(ii) Separating the real and imaginary parts and solving it as a system of two equations.
(iii) Eliminating x between the two equations and solving for y.
(iv) Eliminating y and solving for x.
23. Find all zeros real as well as complex of

(i) (x - 1)10 + 2n, n = -24, -32. -48, -64, -96;


(ii) (x -1)(x - 2)··· (x - 20) + 2nx14, n = 11,3, -13, -29.

In both cases also try the equivalent polynomial form to calculate the roots.
24. Find the approximate value of the roots near x = 13 of the following equations, by
considering it as a small perturbation of equations with known roots:

(i) (x - 1)(x - 2)··· (x - 20) + 1O- 6 x 12 = 0,


(ii) «x - l)(x - 3) ... (x - 19))2 + 1O- 6 x 12 = O.
Exercises 339

25. Solve the following nonlinear equations:

(i) J x (t 3 - x)2ln
-x
(1 ++ t
--2 )
1 x
2
dt = 12,
. !,""!,S 1 + (t - +
(n)
1 1
VI
8)2
(t - 8)2/(t 2 + 8 2 )
+ (t - x)2 + (8 - x)2
dtd8=1.

26. Given the number density n of a Fermion of mass m, calculate the chemical potential 'IjJ
using
[00 t 2 dt
m=O.I,I,lO; n=lOk, (k=-3,-I,O,I,3).
n= oJ exp(~-'ljJ)+I'

27. Consider the iteration

f(x;)
xi+~ = Xi - f'(Xi)'

which is the Newton-Raphson iteration, with the derivative computed at alternate steps.
(a) Show that if the iteration converges to the root at x = a, then

. Xi+l-a 1 (f"(a))2
i~~ (Xi - a)3 = 2 fl(a) ,

and that the convergence is cubic.


(b) Estimate the computational efficiency of this method. If the cost of computing f(x)
is 1 and that for f'ex) is e, for what range of e will this method be more efficient than
(i) the Newton-Raphson method and (ii) the secant methou?
(c) Test this method on the equation in Example 7.2.
28. Consider an iteration function of the form
Ui f(Xi)
Xi+l = Xi - Q(uJ' f'(Xi) ,

where Q( u;) is a polynomial in Ui. If Q is linear, then derive a method of order three.
This is referred to as the Halley'S method. Solve the equation in Example 7.2 using this
method.
29. Derive a new iterative method from the Newton-Raphson method by replacing f'ex;) by
its approximation found by differentiating a three-point Lagrangian interpolation formula
based on Xi, X,-l and X,-2. Show that this method has an order of convergence p "'"
1.839. Test this method on the equation in Example 7.2. What are the advantages and
disadvantages of this method over the Muller's method?
30. Derive an iterative method by considering the Taylor series approximation to the function,
retaining terms up to the second derivative. The resulting quadratic equation can be
solved to give the iteration formula
2f(Xi)
Xi+l = Xi -
Vfl(X;)2
,
f'(Xi) ± - 2f(x,)f"(Xi)
where the sign of square root is chosen such that the denominator has larger magnitude.
Show that this iteration converges cubically to the root. Estimate its computational
efficiency. Under what conditions will this iteration be more efficient than the (i) Newton-
Raphson method or (ii) !l.lulier's method.
31. Derive the following iterative method by considering the Taylor series approximation to
the inverse function X = g(y), retaining terms up to the second derivative:

f(x;) f(x;)2 f" (x;)


Xi+l = x, - f'ex,) - 2fl(Xi)3
340 Chapter 7. Nonlinear Algebraic Equations

Show that this iteration also converges cubically to the root . What are the advantages
and disadvantages of this iteration over that considered in the previous problem?
32. What is the order of convergence of the methods considered in the previous two problems
for multiple roots? Can these methods be modified to give higher order convergence for
multiple roots?
33. Consider the following iteration function:

Xi+1 = Xi _ 4 + _...:/--,[
Ii I; -
__ [-; +
1;-1 Ii
_,_1_ _
li-1
3
I[Xi,Xi-1]
] ,

where I; = I(Xi) and Ii = 1'(Xi). Show that this iteration has an order of convergence
p:::;2.73.
34. Consider the following iteration function, which is the finite difference analogue of Halley 's
method:
Xi+1 = Xi - [ ] ,
I[ ] I I Xi,Xi-1,Xi-2
Xi,X i - l - i-I ![Xi,Xi-l]

where I; = I(Xi). Show that this iteration function has an order of convergence:::; 1.839.
35. Consider the following iteration function which is an example of multipoint iteration
function, in the sense that at each step it requires evaluation of function or the derivative
at more than one new point:
I(Xi)
Xi+1 = Xi - , ( ~)
I Xi - 2!'(xi)

Show that this method converges cubically.


36. If the sequence of functions {/o(x), h (x), ... , 1m (x)} form a Sturm sequence, then show
that the number of roots of the function lo(x) in the interval (a, b) is the difference
between the number of changes of sign in the sequence {fo(a), h (a), .. . , 1m (a)} and
{fo(b), h(b), ... , 1m (b)}.
37. Locate all real roots of the following polynomials using Sturm sequences and then deter-
mine them accurately using any iterative method:
(i) x5 - 5x 4 + 5x 3 + 7x2 - 14x + 6,
(ii) 36x 6 + 36x 5 + 23x 4 - 13x 3 - 12x2 + X + 1,
(iii) 288x 5 720x 4 + 694x 3 - 321x 2 + 71x - 6,
-

(iv) x6 - 36x 5 + 450x 4 - 2400x 3 + 5400x2 - 4320x + 720, (Laguerre polynomial) ,


(v) 231x 6 - 315x 4 + 105x2 - 5, (Legendre polynomial).

38. Find all roots real as well as complex of the following polynomials
(i) x7 - 5.52x 6 + 20.856x 5 - 63 .9788x 4 + 1l1.00494x 3 - 88.004172x 2
+ 24.4494x - 2.16172
(ii) x 10 - 2x 9 + x8 - 4.1 x 6 + 8.1x 5 - 4.1x4 + 4.2x 2 - 8.3x + 4.2
(iii) 676039x 12 - 1939938x lO + 2078505x 8 - 1021020x 6 + 225225x 4
- 18018x 2 + 231 (P12(X))
(iv) x 12 - 144xll + 8712x lO - 290400x 9 + 5880600x 8 - 75271680x 7
+ 614718720x 6 - 3161410560x 5 + 9879408000x 4 - 17563392000x 3
+ 15807052800x 2 - 5748019200x + 479001600 (Laguerre polynomial)
(v) 6x - 45x + 105x - 273x + 715x - 1287x + 1365x 4 - 691x 2 + 105
15 14 12 lO 8 6
(vi) x 20 - 1
(vii) 524288x 20 - 2621440x 18 + 5570560x 16 - 6553600x 14 + 4659200x 12 - 2050048x lO
+549120x 8 - 84480x 6 + 6600x 4 - 200x 2 + 1 (Chebyshev polynomial)
(viii) x 6 - 6.6x 5 + 18.15x 4 - 26.62x 3 + 21.9615x 2 - 9.66306x + 1.771561
Exercises 341

Reconstruct the polynomial using the computed roots and compare it with original poly-
nomial. Multiply all coefficients of these polynomials by 1.1 and find the roots.
39. The polynomial
P3(X) = X3 + 9813.18x 2 + 8571.08x + 0.781736,
has zeros -9812.306, -0.8734119 and -0.00009121577. Assume that -9812.306 has been
accepted as a zero of P3(X), deflate P3(X) using this value and find the roots of the result-
ing quadratic. Repeat the process by deflating the polynomial for roots at -0.8734119
and -0.00009121577.
40. Show that the polynomial deflation of the form
ao + a1X + a2x2 + ... + anx n = (x - a)(bo + b1X + b2X2 + ... + bn_1Xn-1),
can also be achieved by the following recurrence:
bo=_a o ; bk=bk-1-a k , (k=1,2, ... ,n-1).
a a
This process is referred to as backward deflation. How will you modify this process if
a = O? Apply backward deflation to the polynomial of the previous problem using each
of the three accepted zeros given there, and determine the remaining zeros of the deflated
polynomial.
41. Show that if the zeros of f(x) = ao + alx + ... + anxn with ao =I 0 are a1, a2, ... , an,
then the zeros of the polynomial g(x) = aoxn + a1Xn-1 + ... + a n -1X + an are
all, a;-l, ... , a~l. If the polynomial resulting from deflation of g(x) by a zero a- 1
be dox n - 1 + d1X n - 2 + ... + dn -2X + dn -1, then the coefficient d i can be calculated using
the usual recurrence
do = ao, d k = ak + a- 1d k _ 1 , (k = 1,2, ... , n - 1).
Show that these are the same coefficients that would result, if we performed backward
deflation on f(x) by dividing through by 1 - x/a instead of by x - a. Hence, conclude
that the reason backward deflation is stable if we deflate with the zero of the highest
magnitude is that, we are essentially performing deflation of g(x) with the zero of smallest
magnitude.
42. Find all real zeros of the following function
20 36 40 475 1.12(y40 - 1) 6 3
f(x) = 15 + 25 + 33 + 40 - 40 - 4" - 8" - 4.5,
y y y y xy y y
where y = 1 + x. This function can be transformed into a polynomial equation, if it is
multiplied by xy40. Discuss the advantages and disadvantages of this approach. Solve
this polynomial and compare the results with those obtained directly.
43. Test the library routine for solving nonlinear equations available at your computer centre
(or any subroutine that is available or the one you have written) on the real roots of the
following functions:
(i) f(x) = (y + 12) In(l + y2)viY=6T, y = x - 1234.56,

.. f(x) __ {1.2
("") X 109 - Ixl if Ixl >
108 , try starting values close to x = ±10 8 ,
1 + x4 otherwise,
(iii) f(x) = e- Y (l + 0.5siny), y = min(x, 88),
1O iflyl<1O-5,
(iv) f(x) ={y otherwise,
y = (x + l)(x + 5)(x - 33)(x - 137),

1
(v) f(x) = 1 + x 21 '
. 1
(Vt) f(x) = (X-IT )3'
(vii) f(x) = exp(-1/x 2 ),
11-e x l
(viii) f(x) =
Ilnlx+511
342 Chapter 7. Nonlinear Algebraic Equations

44. Find the eigenvalues of n x n Hilbert matrix A defined byaij = 1/(i + j - 1), by finding
the zeros of the determinant A - AI. Try n = 4, 6 and 10.
45. To simulate the effects of roundoff errors add a term of the form E(R-0.5) to the function,
where R is a random number between 0 and 1. Use E = 10- 4 and try secant iteration
and Brent's method on the equations in {13}.
46. Consider the equation in Example 7.1 and apply Brent's method, secant iteration and
Newton-Raphson method to determine the zero. Compare the behaviours of these meth-
ods. Which of these is more reliable?
47. For functions with heavy roundoff error we can perform the following test to ascertain
and possibly improve the reliability of the computed zero. Let Xk+l and Xk be the two
final iterates from any method, then define z = xk+l and dz = 3lxk+l - xkl. Check f(x)
at x = z ± dz; if either of these values is smaller than f(z), then change z to that value
and repeat the process. If neither of the values are smaller, then dz = dz/2 and repeat
the process. Terminate the process when dz has been divided 10 times or when dz < 3nlzl
or when the value of z has been changed too many times. In the last case, the roundoff
error is probably dominating. Apply this test to functions in the last two problems.
48. Consider the following convergence tests for any iterative method:

(v) f(Xk) = 0, (vi) If(Xk)1 < ff'

Discuss the strength and weakness of these convergence tests and give examples, where
each of these tests is unreliable or ineffective. Give an example which passes all the above
tests, even when the iteration is far from the actual zero.
49. Find all real roots of the following systems of equations:

(i) x = tany, sin (2 \ )


x +y +a
= 0, a = 1,0.1;

(iii) 42.25x 2+ 27.885x - 0.749y 2 - 2.54y - 2.466 = 0,


-0.052x - 0.0192 + 0.00359y2 + 0.00356y = O.

50. Try to use various techniques for deflation mentioned in Section 7.15 to determine the
real roots of the equations in the previous problem. and compare their effectiveness.
51. Prove that the method of fixed-point iteration which can be written as xi+l = F(xil,
converges to the required roots, only if all eigenvalues of the Jacobian of F(x) are within
the unit circle at the root. Try to reformulate the equations in {49} (ii), such that the
method of fixed-point iteration converges to the required roots.
52. Find real roots of the following systems of equations

(i) 3x + 4y + e(z+w) = 1.007,


6x - 4y + e(3z+w) = 11,
x4 - 4y2 + 6z - 8w = 20,
x2 + 2y 3 + z - w = 4;
(ii) _2x2 - 3xy + 4siny + 6 = 0,
3x2 - 2xy2 + 3cosx = 8;

( iii)
sin(xy + 7r /6) + J x 2 y2 + 1
~-'--'-----'--:,---'---'---- = - 2. 8,
cos(x - y)
xe xy + rr / 6 - sin (x - y)
----r~~=----=c..:.. = 1.66.
Jx
2 y2 1 +
Exercises 343

53. Consider the system of nonlinear equations f(x) =0 and an initial approximation x =
x(O). Define
o ~ t < 00,
so that g(x(O) , 0) = 0 and g(x, t) -+ f(x) as t -+ 00.
(a) Show that if we define x(t) to be the solution of g(x(t) , t) = 0, then x(t) satisfies the
differential equation
dx
J(x)- = -f(x),
dt
x(O) = x(O), o ~ t < 00,
where J(x) is the Jacobian of the vector function f(x).
(b) Show that the application of Newton 's method to the system of nonlinear equations
f(x) = 0 is equivalent to the solution of the above system of ordinary differential equations
by Euler's m ethod (see Section 12.1) with step size h = 1.
54. Find the ionisation equilibrium of a mixture of hydrogen and helium using Saha's equa-
tions:

n2 n e = aT1.5 exp( -hiT), h = 157828, a = 2.413 x 10 15 ,


nl

n4 n e = 4aT1.5 exp (-/2IT), 12 = 285482,


n3
n5 n e = aT1.5 exp( -/3IT), 13 = 631310,
n4
ne = n2 + n4 + 2n5 , nH = 7q + n2, nH e = n3 + n4 + n5,
Determine nl , n2, n3, n4 and n5, when nH e = O.lnH and (T,nH) = (10 3 ,10 10 ),
(10 3 ,10 25 ), (8 X 10 3 ,10 10 ), (2 X 10 4 , lOIS), (10 5 ,10 10 ), (5 X 105 , 10 10 ) , (5 X 10 5 ,10 2 5 ). Here
nl, n2, n3, n4, n5 and ne are respectively, the abundances per cm 3 of HO, H+, Heo, He+,
He++ and the electrons, and T is the temperature in K. The equations may be solved
simultaneously, or by eliminating some of the variables before solving the equations. It
is possible to solve for all other variables in terms of ne. which can be determined by
iteration. Try both these approaches.
55. The nonlinear system given below, arises from a model of the combustion of propane
in air. Here S = Li~l Xi, and R = 4.056734 is a physical parameter. It is a difficult
problem, sensitive to the choice of initial guesses, because we may get negative values for
the argument of square roots.

Xl + X4 - 3 = 0,
2Xl + X2 + X4 + X7 + Xs + X9 + 2X1O - 10 - R = 0,
2X2 + 2X5 + X6 + X7 - 8 = 0,

2X3 + X5 - 4R = 0,

XlX5 - 0 . 193x2X4 = 0,
X6y'X2 - 0.002597v1X2x4S = 0,
X7yX4 - 0.003448v'XlX4S = 0,
XSX4 - 1. 799 X 1O- 5 x2S = 0,
X9X4 - 2.155 X 1O- 4 Xl JX3S = 0,
XlOX~ - 3.846 X 1O-5x~S = O.

(i) Solve the system as it is written here.


(ii) Replace Xl through X4 by the squares, (that is, xi,new = Xl,old) to force these vari-
ables to be positive and try the solution. Discuss the change in performance.
(iii) Put absolute values inside the squa.re roots and try the solution. Compare the per-
formance with earlier approaches.
(iv) Rewrite the equations to eliminate the square roots and compare the performance
with earlier approaches.
344 Chapter 7. Nonlinear Algebraic Equations

56. Consider the system


_2x2 - 3xy + 4siny = -6, 3x 2 - 2xy2 + 3cosx = 8.

Assume that a tentative solution x = 1.690 and y = 0.253 has been obtained. Apply
backward error analysis to estimate how satisfactory this solution is. This analysis can
be achieved by substituting the solution into the equation and perturbing the various
coefficients to satisfy the equations exactly. From this result try to estimate the actual
error in the result.
57. Consider the following system of equations, which has two close roots

x 2 _ xy +y2 =~
4
Try to find the roots using any of the standard iterative methods, using deflation if
necessary. Try the strategy mentioned in Section 7.15 to determine the multiple roots
and compare the effectiveness of both the approaches. In the second approach, replace one
of the equations by the Jacobian of the system and determine the roots. The required
roots will be close to this value and can be found by considering small perturbations
about it.
58. Try to locate the roots of system of nonlinear equations in {49} using the method similar
to one described in Section 7.7. Also try this method on the system of equations in the
previous exercise.
59. Estimate the condition number for the system of equations in Example 7.14, by lin-
ear ising the equations about the computed solution. Find the condition number for the
resulting system of linear equations. Also substitute the calculated solution in the system
of equations and calculate the residuals. From these residuals and condition numbers
estimate the roundoff error in the computed solution. Use the computed solution which
gives the weights and abscissas of Gaussian formula to evaluate integrals hand h in
{6.2} and compare the results with the exact values. Try to improve the accuracy by
splitting the range into two parts and evaluating the second part using Gauss-Legendre
formula to estimate the maximum accuracy that can be achieved using the computed
weights and abscissas. Also try to find the weights and Abscissas for 8 point Gaussian
formula and compare the results with exact values calculated in {6.17}. How does the
accuracy deteriorate with the number of abscissas?
60. Solve the system of equations in Example 7.14 using the damped Newton's method or the
Levenberg-Marquardt algorithm and compare its performance with Newton's method.
Chapter 8

Optimisation

In this chapter, we consider methods for minimising or maximising a function


of several variables, that is, finding those values of the coordinates for which the
function takes on the minimum or the maximum value. Throughout the chapter
we only consider the problem of finding the minimum, since the maximum can
be easily found by noting that

maxf(x) = - min( - f(x)), (8.1)

that is, the maximum of f(x) is the negative of the minimum of - f(x). The
point at which the function takes its minimum value is called the minimiser.
Minimisation methods are often misused by computer users, because a variety
of problems can be formulated in terms of optimisation, even though that may
not be the best method of solving the problems. For example, we can take a
set of nonlinear equations and produce a positive function by taking the sum
of squares of left-hand sides (the right-hand sides being zero). The solution of
this set of equations will coincide with the minimiser of this function. However,
as mentioned in the previous chapter, this technique should be attempted only
as a last resort, when all other methods for solving the system of nonlinear
equations have failed. Unfortunately, it is very often the first thing that users
try. The problem here is that, the amount of computations required using this
approach could be an order of magnitude larger than what would be required by
a method for solution of a system of nonlinear equations. Further, as we will see
in the next section, the accuracy of the computed minimiser is in general, much
lower than those of the calculated roots. Further, there is really no practical
method for finding the global minimiser of a function of several variables. Here
by global minimiser we mean the point, where the function assumes its lowest
value out of the entire range of coordinate values that is allowed.
Optimisation problems arise in almost every field, where numerical in-
formation is processed (Science, Engineering, Mathematics, Economics, Com-
merce, etc.). In Science, optimisation problems arise in data fitting, in varia-
tional principles, and in the solution of differential and integral equations by
346 Chapter 8. Optimisation

expansion methods. Engineering applications are in design problems, which usu-


ally have constraints in the sense that variables cannot take arbitrary values.
For example, while designing a bridge an engineer will be interested in min-
imising the cost, while maintaining certain minimum strength for the structure.
Even the strength of materials used will have a finite range depending on what
is available in the market. Such problems with constraints are more difficult
to handle than the simple unconstrained optimisation problems, which very
often arise in scientific work. In most problems, we assume the variables to be
continuously varying, but some problems require the variables to take discrete
values. For example, combinato'rial optimisation problems where the object is
to find a permutation which minimises certain function.
While most formulation of optimisation problems require the global min-
imum to be found, most of the methods that we are going to describe here
will only find a local minimum. The function has a local minimum at a point
where it assumes the lowest value in a small neighbourhood of the point, which
is not at the boundary of that neighbourhood. To find a global minimum we
normally try a heuristic approach, where several local minima are found by
repeated trials with different starting values or using different techniques. The
smallest of all known local minima is then assumed to be the global minimum.
This procedure is obviously unreliable, since it is impossible to ensure that all
local minima have been found. There is always the possibility that at some un-
known local minimum, the function assumes even smaller value. The technique
of simulated annealing is more likely to give global minima and may be used
if the function has large number of local minima. Further, there is no way of
verifying that the point so obtained is indeed a global minimum, unless the
value of the function at the global minimum is known independently. On the
other hand, if a point is claimed to be the solution of a system of nonlinear
equations, then it can in principle be verified by substituting in the equations to
check whether all the equations are satisfied or not. Of course, in practice, the
roundoff error introduces some uncertainty, but that can be overcome. Because
of these reasons, minimisation techniques are inherently unreliable and should
be avoided if the problem can be reformulated to avoid optimisation. However,
there are problems for which no alternative solution method is known and we
have to use these techniques.
Not much can be said about the existence and uniqueness of either the
global or local minimum of a function of several variables. It is possible that no
minimum of either type exists, when the function is not bounded from below
(e.g., f(x) = x). Even if the function is bounded from below, the minimum
may not exist (e.g., f(x) = e- X ). Even if a minimum exists it may not be
unique, for example, f (x) = sin x has an infinite number of both local and
global minima. Further, infinite number of local minima may exist, even when
there is no global minimum (e.g., f(x) = x + 2sin(x)). If the function or its
derivative is not continuous, then the situation could be even more complicated.
For example, f(x) = yX has a global minimum at x = 0, which is not a local
minimum (i.e., f'(x) #- O)!
8.1. Golden Section Search 347

Because of the enormous variety of geometries that are possible, it is rather


difficult to find efficient methods for obtaining local minima. Given sufficient
time many simple methods can produce a local minimum, but such methods are
rarely recommended because of their inefficiency. It is possible to use the stan-
dard technique of approximating the given function by a simple function e.g., a
polynomial and then finding the minimum of this approximating function. Of
course, it is not possible to use a linear model for approximating the function,
since that does not have any minimum. Hence, simplest models to be used
for approximation are quadratic models. As usual we first consider the simpler
problem of finding minimum of a function of one variable. In Section 8.1, we con-
sider the golden section search method, which is an always convergent method
similar to the method of bisection for finding roots of one nonlinear equation. In
Section 8.2, we consider the Brent's method which combines the golden section
search with an iterative method based on parabolic interpolation, similar in
spirit to Brent's method for finding zeros. Both these methods do not require
the calculation of derivatives. If the first derivative of the function can be easily
calculated, then it is possible to use this additional information to improve the
efficiency using the method described in Section 8.3. The general principles and
some simple methods for finding a minimum of a function of several variables
are outlined in Section 8.4, while in Section 8.5 we consider the quasi-Newton
methods, which require the calculation of derivatives. If the derivatives are too
difficult to calculate, then the direction set methods described in Section 8.6
could be considered.
All these methods deal with the problem of unconstrained optimisation,
where no restriction is imposed on the values of the parameters to be found by
minimisation. In many engineering and economic applications, there are con-
straints which have to be satisfied as arbitrary values of the parameters are not
allowed. In such cases, the global minimum often occurs at the boundary of the
allowed region and is not a local minimum. In Section 8.7, we describe the sim-
plex method for solving linear optimisation problems with linear constraints.
The more general case of nonlinear problems with constraints is beyond the
scope of this book. The last section describes the method of simulated anneal-
ing, which if used properly has high probability of finding the global minimum
of the function.

8.1 Golden Section Search


In the previous chapter, we have seen that a zero of a nonlinear function can be
bracketed between two values, where the function value has opposite signs. To
bracket a minimum we need more information. For example, if the first deriva-
tive of the function has opposite signs at the two end points, then the function
will have a stationary point inside the interval, provided it is continuous over
the interval. This point could be either a minimiser or a maximiser. (It may be
noted that the first derivative does not change sign at a point of inflection.) If
this point is known to be a minimiser, then we can apply the simple method of
348 Chapter 8. Optimisation

a
c

b
a
d

Figure 8.1: Bracketing a minimum

bisection to determine the minimiser more accurately. Of course, here we have


assumed that the first derivative of the function can be calculated at every
point.
If it is not possible to calculate the derivative easily, then we need a third
point inside the interval to detect if the function has a minimum or not. For
example, if we have three points a < b < c, such that f(b) < min(f(a), f(c))
and if the function is continuous in the interval (a, c), then it can be easily
seen that, there should be at least one minimum inside the interval (a, c). In
contrast to the location of roots by looking for sign changes, in this case, there
could be two minima in the interval also (Figure 8.1). Once a minimum has
been bracketed, we can locate it more accurately by subdividing the interval as
in the method of bisection. If d is any point in the interval (b, c), then we can
evaluate f (d) and the following three possibilities arise:

1. f(d) < f(b), in which case, the interval (b, c) will bracket a minimum.
2. f(d) > f(b), in which case, the interval (a, d) will bracket a minimum.
3. f(d) = f(b), which is unlikely to happen in practice, but if it occurs, then
as can be seen from Figure 8.1, both the intervals (a, d) and (b, c) will
contain a minimum. If there is only one minimum in the original interval,
then the minimum can be pinned down in the interval (b, d). But if the
original interval had more than one minimum, then the two intervals may
bracket different minima.

Hence, in all cases, we get a smaller interval containing a minimum. This process
can be continued until the interval has been reduced to the required level.
For finding zeros of a function, it was found that the optimum strategy
is to choose the new point to be the midpoint of the original interval. This
choice will not be optimum for a minimiser, since either of the intervals (a, d)
or (b, c) could be selected. Hence, the point d should be selected such that the
larger of the two intervals should be as small as possible. This objective can
be realised if the two intervals are equal, that is when the points band dare
symmetrically distributed with respect to the the interval (a, c). Obviously, the
8.1. Golden Section Search 349

new point must be selected in the larger of the two segments b - a and c - b.
Hence, we have
b-a c- d c-b
--=--=7, - - = 1-7. (8.2)
c-a c-a c-a
If we choose the points such that the two segments e.g., b - a and c - b have
constant ratio, then this ratio can be estimated by
d- b 1 - 27
7= - - = - - - (8.3)
c-b 1-7'

which gives the quadratic 72 - 37 +1= 0, yielding

3- V5
7 = -2- = 0.381966. (8.4)

H1mce, the ratio of the two segments

c- b = 1- 7 = 1+ V5 >:::: 1.618034. (8.5)


b- a 7 2
This is the so-called golden ratio, which also occurs in the Fibonacci numbers.
The golden ratio gives the optimal method for subdividing the interval and the
corresponding algorithm is called the golden section search.
Thus, in this algorithm, at each stage the new point is selected to be a
fraction 7 as measured from the central point and into the larger of the two
subintervals. Even if we start out with a triplet whose segments are not in the
golden ratio, this procedure will ultimately lead to intervals with the golden
ratio. Once the bracketing triplet has achieved the golden ratio, at each step
the bracketing interval reduces by a factor of 1 - 7 >:::: 0.618034. Hence, this
method converges linearly to the minimiser. Further, this ratio is somewhat
larger than the factor of ~ in the method of bisection. Hence, the convergence
of golden section search will be slower than that of bisection. Once we are
close to the minimum, we can switch to some iterative method which converges
superlinearly.
We can continue dividing the interval as many times as we like, but as
in the method of bisection, beyond a certain stage it will not improve the
accuracy in any way. In fact, it turns out that in general, the roundoff error
poses a more serious problem for finding a minimum than for finding a zero,
because near a minimum the function is rather flat and the function value may
not change significantly over a rather long interval. Let us assume that x = 0
is the minimiser, then in the neighbourhood of the minimiser, the function can
be approximated by

(8.6)

Assuming that 1(0) f= 0, the roundoff error in evaluating the function in the
neighbourhood of the minimum could be written as hKI/(o)l, where K is some
350 Chapter 8. Optimisation

constant. If the function can be evaluated without excessive roundoff error, then
K is of the order of unity. Hence, for

Ix-al <
2KIf(a)ln =J (8.7)
f"(a) ,

the function value will be essentially equal to the minimum because of roundoff
error. Thus, any point in this range could be taken as the minimiser. This region
could be considered as the domain of indeterminacy for the minimiser. It may
be noted that the size of this domain is of the order of Vhlal and the accuracy
will be severely limited. Thus, using 8-digit arithmetic it may be possible to
determine the minimiser to only four significant figures, which is similar to the
situation for double zeros. Hence, it is much more difficult to find an accurate
value of a minimiser, than to find a zero of the same function. This is another
reason why the problem of finding roots of a system of nonlinear equations
should not be reduced to an optimisation problem.
The flatness of the function near the minimum can also be used to check
for roundoff error in the calculation. Thus, if at any stage we find that the
computed value of the function comes out to be the same at all the three points
of the bracketing triplet, then we may assume that the iteration has reached
the roundoff limit. Of course, the function values may come out to be equal by
chance, even when the bracket is very large. In that case, the bracketing interval
must enclose at least one minimum and one maximum. The probability of such
coincidences is rather small and further in most cases, this situation can be
usually detected by looking at the size of the bounding interval. Subroutine
GOLDEN in Appendix B employs this technique to detect the roundoff errors
and the iteration is terminated with an error flag, if all the three function values
are equal at any stage. This is not a foolproof method for checking roundoff
errors, in the sense that, even if the roundoff errors are dominating, the function
values may not come out to be equal. This is very likely to happen when the
function value is zero or close to zero near the minimum. Adding an arbitrary
constant to the function may avoid such situations for simple functions, where
the roundoff error is not too large. Ideally we should check if the difference
in the function values is smaller than the expected roundoff error. But that
requires an estimate of the roundoff error, which may not be easily available.
In the above discussion we have assumed that f (a) =I- 0, if the function vanishes
at the minimum, then f(a) in (8.7) can be replaced by a typical term in the
function definition. If all terms in the function definition also tend to zero at
the minimiser, then very high accuracy can be achieved.
Further, because of roundoff error the function value may show some fluc-
tuations which can give rise to spurious minima, because the calculated value of
the function may not be monotonic on either side of the minimum. To overcome
this difficulty, we should avoid evaluating the function at two points within a
distance of J (as given by 8.7) from each other. Since the value of the second
derivative is not in general known, it may not be possible to estimate J and for
8.1. Golden Section Search 351

practical purposes we can take 6 ~ IQIv'hK. Hence, the golden section search
should be terminated when the interval is of the order of 36, since continuing
further will involve calculating function values at two close points.
If the first derivative of the function can be calculated using an analytic
expression, then the roundoff error in evaluating the derivative may be com-
paratively low. At a simple minimum the derivative will have a simple zero and
it may be possible to calculate this zero to a much higher accuracy, using any
of the methods considered in the previous chapter. However, extra information
will be required to distinguish between a minimum and a maximum. Thus, if a
root of f'(x) is located between x = a and x = b as shown by the sign change
and if f'(a)(b - a) < 0, then the root corresponds to a minimum, provided
the interval contains only one root. Since in general, it may not be possible to
ensure that the given interval contains only one root, this criterion may be mis-
leading. Hence, this condition should be checked only after the root has been
isolated or determined to the required accuracy. This approach could be rather
accurate and efficient, if the derivative can be calculated directly. However, if
the derivative is calculated using differences, then the roundoff error will be
much higher and no purpose will be served by using the derivatives to locate
the minimum.
Before applying the method of golden section search, it is essential to
bracket a minimum. A crude but reliable method to locate minimum is to
plot the function. In general, bracketing a minimum is slightly more involved
than bracketing a zero by looking for sign changes. As in the case of locating
zeros, there is no foolproof technique for bracketing a minimum. To be able to
bracket a minimum, it is essential to have some idea about the typical scale
of the function. The basic idea here is to start with two points a and b, such
that f(b) < f(a) and with separation of the order of the typical length scale of
the function. Then we have to find a third point c to complcte the bracketing.
A trial value x could be selected and if that does not meet the requirement
f(x) > f(b), then we can keep continuing further in the same direction, until
the function value starts increasing. At every stage, we can increase the search
step by some appropriate ratio. Further, it should be noted that if f(x) < f(b),
then we can replace our old values of a and b by b and x, respectively to get a
smaller bracketing interval. We can improve upon this simple method by using
parabolic interpolation, to predict the new trial value x. However, far from the
minimum interpolation may give arbitrary results. Hence, the predicted value
should be tried only if it is in some reasonable range.
It may happen that the function has a maximum immediately following
the minimum and thereafter it keeps decreasing indefinitely. In such cases, if
this bracketing process jumps over the maximum at some stage, then there is
no hope of bracketing the minimum. For example, the subroutine BRACKM
which implements this algorithm fails to locate any of the minima of function
f(x) = x + 2sin(x), when started with a = 0 and b = 1. However, if the
function is unimodal, that is it has only one minimum and no maximum, then
the function value will ultimately start increasing and the minimum will be
352 Chapter 8. Optimisation

bracketed. Even if the function has a global minimum, it is possible that the
initial slope as determined by the points a and b is pointing in the wrong
direction and no minimum may be located by the above technique. For example,
f(x) = (x 2 - l)e- X has a global minimum at x = 1 - y'2, but if we start the
search with a = 4 and b = 5, our algorithm will fail to bracket the minimum.
Similarly, for f(x) = (x 2 - 0.01)e- 10x if the search is started with a = -2,
b = -1 the subroutine just jumps over the global minimum at x ;::0 -0.0414.
Hence, before applying this algorithm we should have some idea about the
approximate position of the minimum, as well as the typical scale over which
the function changes significantly. For a unimodal function which has exactly
one minimum and no maximum, this algorithm is guaranteed to work, but in
practice, we rarely come across unimodal functions. It will be best to plot out
the function to check the approximate position of the minimum before trying
other techniques to locate it more accurately. An alternative strategy is to scan
the function at a suitable spacing and look for three successive points which
satisfy the hracketing requirement.
EXAMPLE 8.1: Find minimum of the following functions

h(x) = (x 2 - O.Ol)e-lO x ,
h(x) = x lO - lOx 9 + 45x 8 - 120x 7 + 21Ox 6 - 252x 5 + 21Ox 4 - 120x 3 + 45x 2 - lOx + 1.
(8.8)
These functions have a unique minimum at x = 0.1(1 - V2) and x = 1, respectively.
The function h(x) is unimodal and there is no difficulty in locating the minimum using any
of the methods, while h (x) has a maximum at x = 0.1 (1 + V2) beyond which the function is
monotonically decreasing. This maximum interferes with the minimum and unless the initial
values and the step size are carefully chosen, the subruutine BRACKM fails to detect the
minimum. For example, starting with a = -2 and b = -1, the subroutine jumps over both the
minimum and the maximum into the monotonically decreasing region x> 0.1(1+ V2). In this
region, the value of x keeps increasing in the hope of finding a minimum, but the function
value just keeps decreasing slowly without reaching any minimum. However, if a = 0 and
b = 1 are used to start BRACKM, then it detects the minimum.
Once the minimum is bracketed, we can use the golden section search to find accurate
value of the minimum. With initial bracketing triplet (1,0, -1.618) for h (x) and using 24-bit
arithmetic after 26 steps the function value is equal at all the three points and the subroutine
GOLDEN terminates with an error flag, which gives an estimated error of ;::j 10- 5 from the
difference a-c at the last step. The computed value of the minimiser is -0.0414211. For h(x)
which has a minimum value of zero at x = 1, the roundoff criterion is never satisfied and the
subroutine goes on doing unnecessary divisions until the specified criterion is satisfied. The
minimiser is estimated to be around 1.00545. It may be noted that unlike subroutine BISECT,
this subroutine will not go into a nonterminating loop, even if there is no limit on the number
of iterations and an arbitrarily low tolerance is specified. In this case, after sufficient number
of subdivisions the points a, b, c are the three consecutive numbers that can be represented
in the machine. After this stage, the computed value of new point x is equal to b, and one
of the end points will be replaced by it. Hence, after this step two of the three bracketing
points are identical. Further, at the next step all the thre~ points will become identical and
the bracketing interval will be reduced to zero. It can be seen from the subroutine GOLDEN
that if either (or both) of the IF statements are changed from FP.LT.FX to FP.LE.FX, then
the iteration will get into a nonterminating loop as the bounding interval will not reduce to
zero.
In this case, there is a substantial error in the minimiser of the order of 0.005, which
is much larger than the bracketing interval. This function is similar to the one considered in
Example 7.1 and we expect heavy roundoff error. If the function is modified by dropping the
8.2. Brent's Method 353

last constant term, then the modified function will have a minimum value of -1 at x = 1.
In this case, after seven iterations, the function values come out to be equal at all three
points and the iteration is terminated giving a good estimate of the roundoff error. The final
bracket in this case is (0.853,0.926). Thus, in this case, although we get a reliable estimate
of roundoff error, but the error itself is substantially larger than that in the previous case.
A more reliable estimate will be provided if we add a small constant of the order of fin
to h(x), where f is the estimated roundoff error in evaluating the function. As argued in
Example 7.1, the roundoff error in evaluating this function is ::::: 252 x 2- 64 ::::: 1O- l7 . If the
calculations are repeated with h(x) + 10- 9 , the iteration terminates with the final bracket of
(0.982,1.010), which gives a reasonable estimate of the error. The unexpectedly high accuracy
in this case is because the intermediate quantities are not rounded while calculating f(x) and
the calculations are effectively done using a higher precision.

8.2 Brent '8 Method


Following the principle explained in Section 7.6, Brent (2002) has also given
an algorithm for finding minimum of a function of one variable. This method
combines the golden section search with parabolic interpolation in such a way,
that the iteration is guaranteed to converge to the specified accuracy in a finite
number of iterations, provided of course, that the roundoff error is neglected. In
the presence of roundoff error, it may be meaningless to continue the iteration
beyond a certain stage, which unfortunately cannot be detected by this method.
At any stage, if we have three distinct approximations to the minimiser
Xi, Xi-l, Xi-2, then we can perform parabolic interpolation to approximate the
function by

f(x) ~ f(Xi) + (x - xi)f[Xi,Xi-l] + (x - Xi)(X - Xi-l)f[Xi,Xi-l,Xi-2]. (8.9)

To find the minimum of this parabola, we can find the point where the first
derivative vanishes, which gives the new approximation Xi+l as

Xi + Xi-l J[Xi' xi-d


2 2f[Xi, Xi-l, Xi-2]
(x, - Xi-2)2(f(Xi) - f(Xi-d) - (Xi - xi-d 2(f(Xi) - f(Xi-2))
= Xi - 2[(Xi - Xi-2)(f(Xi) - f(Xi-l)) - (Xi - Xi-d(f(Xi) - !(Xi-2))]
(8.10)
It should be noted that the new approximation could correspond to either a
maximum or a minimum of the interpolating parabola. Hence, if the iteration
is carried out without any additional checks, then it may converge either to
a minimum or a maximum. In order to ensure that the iteration converges
to a minimum, we can check the sign of J[Xi' Xi-l, Xi-2] which approximates
the second derivative of !. If this divided difference is positive, then the point
should be a minimiser, while if it is negative, then the point is expected to be
a maximiser. Finally, if the divided difference vanishes, then the three points
are collinear. Hence, if we are interested in finding a minimum and at any
stage if the divided difference comes out to be negative or zero, then we will
have to adopt some alternative strategy for selecting the new point. Of course,
354 Chapter 8. Optimisation

once we are sufficiently close to a minimum, then this divided difference will
automatically come out to be positive, but far from the minimum, we cannot
be sure of the sign. In the presence of roundoff errors, this divided difference
may come out to be zero or negative when we are very close to the minimum. In
fact, in most cases, the computed value of the divided difference vanishes, when
the iteration is too close to a minimum or a maximum. If the minimum is not
simple, in the sense that the second derivative also vanishes at the minimiser,
then the sign of the divided difference will have little relevance to the nature
of the stationary point. Consequently, it is difficult to use a straightforward
iterative method for finding a minimum. However, the iterative methods have
the advantage, that the iteration will normally not converge to an accuracy
much better than what is permitted by the roundoff error, thus giving a good
estimate of roundoff error. On the other hand, the golden section search or the
Brent's method described below give little indication of roundoff error.
Following an analysis similar to that in the previous chapter, it can be
shown {I} that, this method has an order of convergence of 1.325. Hence, if ex
is the true minimiser, then asymptotically

(8.11)

where c is the asymptotic error constant. This convergence is much slower than
that of most iterative methods for finding zeros. But it should be noted that
if, the derivative is estimated using differences and secant method is used to
find a zero of the derivative, then since two function evaluations are required at
pvery step, the computational efficiency of the iteration will be only V1.618 ~
1.27. Hence, the parabolic interpolation method is more efficient. Further, the
procedure based on estimating derivatives using differences has large roundoff
error.
To take care of the problems associated with the iterative procedure and to
ensure the convergence of iteration, Brent (2002) has proposed an algorithm,
which is similar to the one described in Section 7.6 for finding zeros. Before
applying this method it is essential that the minimum is bracketed in some
interval. Further, at every stage we have at least three distinct points which
bracket a minimum. New approximation can be obtained using parabolic in-
terpolation. However, this value is accepted only if it is within the bracketing
interval and is not too close to the previous approximations. Otherwise the
interval is subdivided using the golden section search method to yield a smaller
interval. In this method, there is no need to check if the parabolic interpolation
is tending to a minimum or a maximum, since the result will be accepted only
if it lies within the interval known to contain a minimum.
At a typical step in this algorithm, there are six points a, b, U, v, wand
x, not all distinct. The position of these points change after every step of the
algorithm, but for simplicity we will omit the subscripts. Initially (a, b) is the
interval containing the minimiser and v = w = x is a point inside the interval,
such that f(x) < min(f(a), f(b)), which ensures that a minimum is bracketed
B.2. Brent's Method 355

within the interval. At every step (a, b) brackets the minimum, while x is the
point in (a, b), at which the function has the minimum value among all the
points, where the function has been evaluated so far. If there are more than
one point with the same minimum value, then x is the most recent of these,
while w is the point with next lowest value of the function and v is the previous
value of w; u is the last point at which the function is evaluated (undefined at
the first step).
When the iteration converges, x is returned as the final approximation
to the minimiser and the truncation error should be less than three times the
specified tolerance, provided the associated domain of indeterminacy is smaller
than the specified tolerance. If m = ~ (a + b) is the midpoint of the interval
known to contain the minimum and Ix - ml :S 2E - ~(b - a), i.e., max(x -
a, b - x) :S 2E, then the iteration is terminated. Here the tolerance E could be
specified in terms of a relative or an absolute criterion, or a combination of two
as described in Section 7.13. It may be noted that the subroutine BRENTM in
Appendix B actually sets E to be half of the specified criterion. Hence, the final
result should generally be accurate to the specified tolerance.
If the iteration has not converged, then the new point may be obtained by
performing parabolic interpolation based on the three points x, wand v. The
result can be expressed in the form x + p / q, where

p = ±[(x - v)2(f(x) - f(w)) - (x - w)2(f(x) - f(v))],


(8.12)
q = =t=2[(x - v)(f(x) - f(w)) - (x - w)(f(x) - f(v))].

Let e be the value of p/q at the second-last iteration, then as in the algorithm
for finding zero, this new point is accepted only if q i= 0, x+p/q E (a, b), lei> E
and Ip/ql < ~leI- The first two conditions ensure that the new point is within
the bracketing interval, while the third condition ensures that the bracketing
interval is reduced, even if the iteration has apparently converged. The last
condition eliminates the case of slow convergence. If either of these conditions
are violated, then instead of parabolic interpolation we use the simple golden
section search to get the new point

{ (~-l) X + (3-2.r5) a, if x::::: m;


(8.13)
u = ( ~-l ) X + (3-2.r5) b, if x < m.
Further, to avoid evaluating the function at two points closer together than E, it
is ensured that lu - xl, b - u and u - a are all greater than E. If these conditions
are not satisfied, then the point u may be shifted slightly. While these conditions
ensures that u and x are not too close, it will not necessarily ensure that u is not
too close to w, which may be used in further interpolations. Further, it should
be noted that the golden section search may not really reduce the interval by
the usual ratio, since the triplet a, x, b may not have subintervals in the golden
ratio. But if two consecutive iterations perform golden section search, then the
bounding interval must reduce by a factor of at least 1.618.
356 Chapter 8. Optimisation

EXAMPLE 8.2: Minimise the functions in E~:ample 8.1 using the Brent's method.
Using 24-bit arithmetic with the same starting triplet as in Example 8.1 it can be seen
that convergence is somewhat faster as compared to that using the golden section search. If
a relative tolerance of 10- 6 is requested, the iteration keeps continuing aimlessly after some
stage. Instead if the tolerance is raised to 10- 3 , then the iteration converges in 16 and 10
iterations respectively, for the two functions. Thus, for using the Brent's method effectively,
it is necessary to give a reasonable convergence criterion. It is possible to include some simple
tests for detecting the roundoff error as discussed in the previous section. However, such tests
may not work for all problems.

Even if there is no true minimum, but some discontinuity in the function


value this algorithm will converge to the discontinuity. This may be justified,
since the function will have a minimum value at that point, even if it is not
continuous. Thus, this point can be regarded as a minimiser, even though the
first derivative may not vanish there. Of course, such minima cannot be located
by iterative methods. However, Brent's method will never give any idea about
the roundoff error. For simple functions the technique of comparing the function
values at the three bracketing points mentioned in the previous section may give
a reasonable estimate of the roundoff error. On the other hand, the problem
with straightforward iterative methods is that, without any extra checks they
may converge to a maximum or a point of inflection rather than a minimum.
Further, since the function is rather flat near all stationary points, it may not
be very easy to determine the nature of this stationary point, by looking at the
function values in the immediate neighbourhood.

8.3 Methods Using Derivative


In the preceding sections, we considered optimisation techniques which require
only the function value. If the first derivative can also be computed easily, then
we can use the extra information to get faster convergence. If the function is
continuous in the neighbourhood of the minimum, then the first derivative van-
ishes at the minimiser. Hence, we can use methods described in the previous
chapter to find zeros of the derivative. For example, we can use the Brent's
method or the secant iteration to find zeros of f'(x). However, as mentioned
earlier we will be faced with the problem of distinguishing between a minimum,
a maximum and a point of inflection. Assuming that this identification is pos-
sible, such methods will be more efficient and accurate than those considered
earlier.
But in this approach the information present in the function values is
essentially discarded. Hence, it may be more efficient to develop a method
which incorporates both, the function value and its derivative. In this section,
we describe one such method due to Davidon, which uses the Hermite cubic
interpolation (Section 4.3) to approximate the function. If at any stage we have
two approximations Xi and Xi-l to the minimiser, then using the function value
and its derivative at these two points, we can approximate the function by a
cubic given by (4.52). Equating the first derivative of this cubic to zero will give
8.3. Methods Using Derivative 357

us the next approximation to the minimiser. After some manipulations, we get

f:(Xi - Xi-1)
Xi+l = Xi - ------------'---'--'-r================
2f: + f:- 1 -
3f[Xi,Xi-d ± V(3f[Xi,Xi-d - fi - fi-1F - f:l:- 1
(8.14)
where fk = f'(Xk). As usual we select the sign which gives larger value for the
denominator. Hence, we select the root of the quadratic which is nearest to Xi.
It may happen that the interpolating cubic has no minimum or maximum, in
which case, the expression under the square root sign comes out to be negative.
In that case, we can either try some other strategy to proceed with the iteration
or abandon the iteration and start it again at some other more promising point.
This problem is likely to arise when we are far from the minimum. Even if the
iteration converges, we still have to find out if it is converging to a minimum
or a maximum. This can be done by estimating the second derivative of the
function, which can be approximated by that of the cubic:

f" : : : f'(Xi) - f'(xi-d . (8.15)


Xi - Xi - 1

Hence, by looking at the sign of this quantity, we can distinguish between a


minimum and a maximum. Thus, if fl/(Xi) > 0 we are converging towards a
minimum. If the function has some point of inflection, it may be difficult to
detect that {6}, unless this estimate is printed out at every iteration.
If the cubic does not have any extrema, then we can approximate the
given function by a quadratic based on the function value at the two points
and the derivative at X = Xi. This approximating quadratic can be obtained by
dropping the last term in (4.52). The quadratic always has an extrema, ulliess
the quadratic term vanishes. Hence, we may be able to proceed further with
the iteration, but this estimate is not likely to be reliable. It can be easily seen
that in this case, the next approximation is given by

( ' (Xi) - f [Xi,Xi-1 l)


Xi+1 = Xi - 2f , (8.16)

and the iteration will fail if the denominator in this equation vanishes.
This iterative method can also be combined with bisection or the golden
section search, to get an always convergent method. In this case, since the
derivatives are also available, the minimum can be bracketed by two points,
where the derivative has opposite sign. For functions which are not continuous
near the minimum , the iterative method is of no use and it will be faster to use
bisection if there is a sign change in the derivative across the minimum, or to
use the golden section search otherwise.
It can be shown {2} that, this iterative method converges quadratically to
a minimiser, which is much faster than the parabolic interpolation considered
in the previous section. However, this method requires the calculation of the
first derivatives also and the computational efficiency is EI = 2 1 /( l+a), where
358 Chapter 8. Optimisation

ex is the relative cost of computing the derivative. It can be seen that, this
method is more efficient than the parabolic interpolation, provided ex < 1.46.
Hence, unless derivative calculation is much more expensive as compared to
the function value itself, this method will be more efficient than the parabolic
interpolation. It may be noted that for computing the zeros, the situation is
quite different. In that case, the methods requiring first derivative are more
efficient than methods based on function values alone, only if the derivatives
can be computed very easily. However, for computing a minimum the deriva-
tives should be ignored only if it is really difficult to evaluate them. Many
functions that we come across in practice are rather difficult to differentiate
and it may not be possible to use the methods based on derivatives. Further, if
this procedure is required for minimising a function of several variables, then
all components of derivatives need to be computed, which could require much
more effort than calculating the function value alone. Further, as in the previ-
ous chapter, this rate of convergence is realised only if the minimum is simple,
that is the second derivative is nonzero at the minimiser. If the minimum cor-
responds to a multiple root of f'(x) = 0, then the convergence is linear and the
golden section search method will be more efficient. In practice, one irritation
with methods using derivatives is that, often users make errors in coding the
derivative and consequently the program gets confused.
Since in this method we are essentially finding zeros of f'(x), the accuracy
of the computed minimiser depends on the domain of indeterminacy associated
with the corresponding zero of f'(x). If this zero is simple, then a relative accu-
racy of Ii is possible, provided the roundoff error in computing the derivative is
not exceptionally large. Hence, in general this method will permit much more
accurate results than the methods based on function values alone, which cannot
give a relative accuracy of better than Vii.
Subroutine DAVIDM in Appendix B provides an implementation of this
method. This subroutine requires two starting values, which need not bracket
the minimum and no attempt is made to keep a bracket on the minimum.
Hence, the process may not converge and even if it converges, it may not be
to the nearest minimum. Further, no distinction is made between a minimum
and a maximum, but if the iteration converges, then depending on the sign
of the computed second derivative an error flag may be set if the point is a
maximiser rather than a minimiser. Hence, to use this subroutine effectively,
the minimum or maximum must be approximately located beforehand and
appropriate starting value should be supplied. The minimum can be located
either by looking for sign changes in the first derivative, or by the bracketing
algorithm considered in Section 8.1.
EXAMPLE 8.3: Minimise the functions in Example 8.1 using the method based on cubic
Hermite interpolation .
The calculations are done using the subroutine DAVIDM with 24-bit arithmetic. For
h(x) the convergence is much faster when the starting values are close to the required
minimiser. However, in some cases, the iteration fails to converge, while in some others it
converges to a maximum. But in all cases, the subroutine was able to distinguish between a
minimum and a maximum by looking at the sign of the estimated second derivative. Using the
8.4. Minimisation in Several Dimensions 359

starting values 1, -1.618 it converges to the minimum in about 20 iterations, while starting
with 1,0 is converges to the maximum in 6 iterations. On the other hand starting with 0, 1 it
starts diverging. Hence, before using this subroutine it should be ensured that we are close to
a minimum. In all cases when the iteration converges the result is more accurate as compared
to methods based on function value alone. In this case, both the minimum and maximum
are correct to seven significant figures, which is the best that can be achieved using 24-bit
arithmetic.
For h(x) which has a zero of multiplicity 10 at the minimiser, the convergence of
iteration is very slow. For this function it will be more efficient to use the golden section
search or the Brent's algorithm, which combines the golden section search with parabolic
interpolation. In this case, the second derivative also vanishes at the minimiser and it may
not be possible for the subroutine to distinguish between a maximum and a minimum. But
by some coincidence in both cases, the subroutine correctly finds the point to be a minimum.

8.4 Minimisation in Several Dimensions


In the last few sections, we considered methods for minimising a function of
one variable. However, in practice most optimisation problems are multivariate.
A function of n variables will have a local minimum at a point where all the
n components of the gradient are zero. Hence, if derivatives can be explicitly
evaluated, this problem can be reduced to that of solving the corresponding
system of nonlinear equations leading to the quasi-Newton methods. However,
the gradient vector also vanishes at a maximum or a saddle point. To distinguish
between these three types of stationary points will require some additional
efforts.
For a function of several variables the Taylor series about a point
(a 1, ... , an) can be expressed in the form

(8.17)

where the derivatives are assumed to be evaluated at the point (a1,"" an).
This series can be written in the vector form
1
f(x) = f(o:) + (x - o:)TV f(o:) + 2(x - afV 2 f(a)(x - a) + ... (8.18)

Here V f is the gradient vector and V2 f = G(x) is the matrix of second deriva-
tives. This matrix which is known as the Hessian matrix is symmetric, because
of the symmetry of the partial derivatives. If we consider points along a line
passing through the point 0: along a given direction s, then these points can be
written as x = 0: + ts. Along this line the function can be considered to depend
on only one variable i.e., t and the Taylor series can be written as

(8.19)
360 Chapter 8. Optimisation

Hence, along this line we can identify the derivatives

and I" (0) = ST GS. (8.20)

For the function to have a local minimum at a point a, the derivative l'
must vanish along all possible directions s, which implies that V j(a) = O. Any
point where this condition is satisfied is referred to as a stationary point. In
addition, at a local minimum we also require that the function value must be
minimum at this point along all possible directions. This condition is satisfied
if 1"(0) > 0, for all possible vectors s, which gives

't:/s =1= 0, (8.21 )

Thus, if at any point V j = 0 and the Hessian matrix is positive definite, then
the point is a minimum. This is a sufficient condition for the existence of a
minimum, and is almost but not quite necessary. It can be easily shown that
the following conditions are necessary for a local minimum:

Vj=O and sTGs 2 0, 't:/s. (8.22)

On the other hand, if the Hessian matrix is negative definite, then the corre-
sponding point is a local maximum. Finally, if the Hessian matrix is indefinite,
then there are some directions along which j"(O) is positive and the function
has a minimum along these directions. While, there are other directions along
which 1"(0) is negative and the function has a maximum. Such a stationary
point, where the function could he a minimum or a maximum depending on the
direction of the path is called a saddle point. If the Hessian matrix is semidefi-
nite, then the nature of the corresponding stationary point can be decided by
considering the higher derivatives. If we consider the hypersurfaces with con-
stant j, then these will be ellipsoids about a maximum or a minimum, while
about a saddle point these surfaces will be hyperboloids.
In practice, the positive definiteness of a matrix can be checked by any of
the following properties:
1. All eigenvalues of G are positive.
2. The Cholesky decomposition of the form G = LLT (Section 3.3, {3.25})
exists with iii> 0 for i = 1. ... ,n.
3. Triangular decomposition of the form G = LDLT exists with iii = 1 and
dii > 0 for i = 1, ... ,n.
4. All pivots in Gaussian elimination without pivoting are positive.
Here, the matrices Land D are lower triangular and diagonal, respectively. In
practice, it is simplest to use the last three conditions and in general, the con-
ditions (2) and (3) are the most efficient and they also enable linear equations
with coefficient matrix G to be solved subsequently. In practice, even if the
Hessian matrix is available, it may be difficult to check the positive definiteness
8.4. Minimisation in Several Dimensions 361

because of roundoff error. In most cases, it is too expensive to compute the


n( n + 1) /2 independent components of this matrix and we will like to approxi-
mate G numerically. This approximation may introduce further errors, making
it more difficult to ascertain the nature of this matrix at a stationary point.
Many minimisation methods are based on only the first condition, i.e.,
V f = O. The nature of the stationary point is assured to be a minimum, because
of some additional features of these methods. For example, if the function value
is decreasing at every step, we cannot end up in a maximum. Eliminating the
possibility of convergence to a saddle point may be more difficult.
Many early methods for minimisation were based on essentially ad hoc
ideas without much theoretical basis. These methods may work when the num-
ber of variables is not too large. For most of these methods, the amount of
effort required to obtain a minimum goes up rapidly (typically as 2n ), with
the number of variables. On the other hand, the better methods require O(n 2 )
or O(n 3 ) arithmetic operations, thus making it feasible to solve problems with
large number of variables. However, before attempting to solve such problems,
it is worthwhile to examine whether we really need so many variables. In most
cases, minimisation problems arise out of an attempt to fit some theoretical
model to experimentally obtained data. In such cases, many times it happens
that some of the variables are superfluous, in the sense that either they are
not really independent of the other variables or they do not affect the fit to
any significant level. Such variables should be identified and eliminated before
attempting the fit.
One of the most successful of these ad hoc methods, which merely com-
pares function values at different points is known as the simplex method. (Neider
and Mead, 1965). This method should not be confused with the more well-
known simplex method of linear programming (cf., Section 8.7), although its
name is derived from the same geometrical concept. Simplex is a geometrical
figure obtained from a set of n + 1 distinct points in n-dimensional space, by
interconnecting all line segments, polygonal faces, and so on. In two dimensions
simplex is a triangle, while in three dimensions it is a tetrahedron. In general,
we are only interested in simplexes that are nondegenerate, that is those which
enclose a finite volume in n-dimensional space. In this method at any stage, we
have a simplex with n + 1 vertices, where the function value is evaluated. The
basic iteration consists of changing this simplex to reduce the function values.
Several transformations are possible, the simplest is the one where the vertex
at which the function value is the largest is reflected in the hyperplane formed
by the other n vertices. The function value at the new vertex is then evaluated
and the process can be repeated. However, it may turn out that the function
value is maximum at the new vertex, in which case, if it is reflected we will
get back to the old simplex and no progress can be made. In such cases, either
another point is selected for reflection or the simplex is extended or contracted
in some direction. The iteration proceeds in such a manner that the volume
of the simplex keeps decreasing once appropriate conditions are satisfied and
ultimately the volume may shrink to the required accuracy and the process
362 Chapter 8. Optimisation

may be terminated. At this stage, the vertex with lowest value of the function
will hopefully give a good approximation to the minimiser.
It is quite clear from the description that the simplex method does not
make full use of the information available in the function values, since the
values at different vertices are only compared with each other. As a result, we
cannot expect it to be as efficient as the methods which make better use of the
available information. This method is not as efficient as the method described in
Section 8.6 which also uses only function values. However, the simplex method
is very robust and may be useful for minimising functions with large errors or
functions with discontinuities in the function value or derivatives. Further, it
is simpler to implement and does not require any additional procedure for the
minimisation of the function along a given line, which is required by most other
methods.
Most methods for minimisation are based on the concept of line search.
Here the user is required to supply an initial estimate x(l) and the basic struc-
ture of the kth iteration is
1. Determine a direction of search s(k)

2. Find oJk) to minimise f(x(k) + as(k)) with respect to a


3. Set x(k+1) = x(k) + o(k)s(k)
Different methods correspond to different ways of choosing s(k) in step (1).
Step (2) is the line search subproblem and can be carried out by using the
methods described in Sections 8.1-3. In practice, of course, this step cannot be
accomplished exactly in the sense that the computed minimum can only be an
approximation to the actual minimum along that direction. For most methods,
it is not crucial to find the exact minimum in step (2). This freedom can be
used to improve the efficiency of these methods, since most of the time is spent
in line searches.
Simplest method of this type is the method of steepest descent, where we
choose s(k) = -'\1 f(x(k)), that is the line search is carried out in the direction
opposite to that of the gradient vector. It can be seen that, this is the direction
along which the function decreases most rapidly, which justifies the name of this
method. This method looks attractive, but in practice it exhibits an oscillatory
behaviour. Although, there is a theoretical proof of convergence, the method
usually terminates far from the minimiser, due to roundoff errors. Hence, in
practice this method is both unreliable and inefficient. The cause of the difficulty
can be understood if we consider a two-dimensional problem, where the contours
of constant f are long and narrow (Figure 8.2). In this case, at most points the
gradient which is perpendicular to the contours points the quickest way to the
valley floor, but not to the centre. Indeed, the centre is poorly defined along the
valley, since the function declines very slowly along the valley, as we approach
the centre. The zigzag line in the figure shows a possible path for the method of
steepest descent. It can be seen that the first step takes it close to the valley floor
and thereafter the iteration keeps oscillating about the valley floor, proceeding
8.5. Quasi- Newton Methods 363

Figure 8.2: Method of steepest descent

very slowly towards the centre. Ideally we will like to advance directly towards
the centre rather than into the valley.
In order to be able to locate the centre, we need more information and
most of the successful methods for minimisation are based on a quadratic model,
where the function is approximated by a quadratic. The reason for choosing a
quadratic function are obvious, since that is the simplest smooth function with
a well determined minimum. Further, close to a minimum most functions can be
approximated well by a quadratic. Hence, methods based on quadratic models
should ultimately converge very fast. Finally, even away from the minimum a
quadratic can be expected to give a better approximation than a linear function.
Another simple method requiring line searches is the alternating variables
method, in which during kth iteration (k = 1,2, ... , n) the variable Xk alone is
changed in an attempt to reduce the functions. After n iterations, the whole
cycle is repeated again. Unfortunately, in practice this method is very inefficient
and unreliable. Powell (1973) has constructed a function for which the method
fails to converge to a stationary point. However, if the successive directions are
properly selected, the method can prove to be effective and in fact, that is the
basic idea behind the direction set methods discussed in Section 8.6.

8.5 Quasi-Newton Methods


The simplest method based on a quadratic model is the Newton's method (Sec-
tion 7.16) for determining the zeros of the gradient vector. If we can calculate
the second derivatives of the function, then we can approximate the function
using the first three terms of the Taylor series (8.18). If the Hessian matrix is
positive definite, then there is a unique minimum of the quadratic which can
be found by equating the gradient vector to zero, giving

x -(t = -C-1Vf. (8.23)

In practice, of course, we do not calculate the inverse of the Hessian matrix, but
rather solve the resulting system of linear equations. Thus, at the kth step if
364 Chapter 8. Optimisation

we have an approximation x(k) to a minimiser, the next approximation x(k+l)


is obtained by solving the following system of linear equations

(8.24)

where C(k) is the Hessian matrix evaluated at the point x = x(k). It can be
shown that the iteration converges quadratically to the minimum, provided the
starting value is sufficiently close.
The main drawback of the Newton's method is that, far from a minimum
the matrix C(k) need not be positive definite and the approximating quadratic
may not have a minimum. Furthermore, even if C(k) is positive definite, the
iteration may not converge. In fact, f(x(k)) may not even decrease. The latter
possibility is eliminated in the Newton's method with line search, where the
Newton correction is only used to generate a direction of line search

(8.25 )

which is then used in a line search algorithm to get the next approximation.
However, if C(k) is not positive definite, then the direction sO'\ as given by (8.25)
may not be a descent direction, i.e., the function value may not be decreasing
in that direction. In most such cases, we can perform line search along -s(k),
where the function will be decreasing. But if the point x(k) happens to be a
saddle point of f(x), then s(k) = 0 and it is not possible to continue further.
Even if the point is not a saddle point, it is possible that the function may not
have a minimum along the direction specified by (8.25). For example, consider
the following problem due to Powell:

x(1) = (0, of. (8.26)

It can be easily verified that

C(I) = (0 1)
1 2 '
(1) =
sO·
(-2) (8.27)

It can be seen that a search along ±s(1) gives the minimum at Xl = 0 and the
algorithm fails to make any progress. It may be noted that, this function has a
well determined minimum and there is no difficulty in reducing the function, for
instance by a search along the steepest descent direction. The difficulty arises
because s (1) T yo f (x( 1)) = 0 and the directions ±s (1) are not downhill. It can
be easily seen that, this problem will not arise if the matrix C(1) is positive
definite.
This example clearly illustrates that some modification is required to make
the Newton's method applicable to general problems. One possibility is to use
the method of steepest descent, whenever C(k) is not positive definite. However,
if this modification operates for a number of iterations, then the convergence is
likely to be very slow. An alternative is to modify the matrix C(k) to make it
B.S. Quasi-Newton Methods 365

positive definite, which can be most conveniently achieved by adding a multiple


of unit matrix I to C(k). In this case, the search direction can be obtained by
solving the system of equations

(8.28)

Here l/ should be larger than the negative of the lowest eigenvalue of C(k).
There are other modifications to the basic Newton's method to overcome this
problem and readers can refer to Fletcher (2000) for more details.
As in the case of solution of a system of nonlinear equations, the main
disadvantage of Newton's method, even when modified to ensure global conver-
gence, is the need to calculate all the n( n + 1) /2 second derivatives. Hence, as in
Section 7.17, it may be better to approximate the Hessian matrix using the value
of the function and the gradient vector. The simplest technique is to use a finite
difference approximation, which may not yield a symmetric matrix. Hence, the
approximate matrix G can be symmetrised by taking ~(G + GT ) to replace
C(k) in the Newton's method. Once again this matrix may not be positive def-
inite, thus requiring modifications as discussed earlier. Further, evaluating the
finite difference approximation requires n + 1 evaluations of the gradient vector,
which could be very expensive.
The above disadvantages can be avoided if some updating procedure simi-
lar to that in the Broyden's method (Section 7.17) can be given for the Hessian
matrix or its inverse, such that the matrix is assured to be positive definite.
The initial matrix H(l) can be any positive definite matrix, which is generally
taken to be the unit matrix. As in Broyden's method for solution of nonlin-
ear equations, it will be advantageous if instead of the Hessian matrix we can
directly update the inverse H(k) = C(k)-r. Such methods are referred to as
the quasi-Newton methods and they enjoy the followir .> ;:.,dvantages over the
Newton's method:
1. Only first derivatives are required.
2. The matrix H(k) is assured to be positive definite, which ensures that the
search direction given by (8.25) is a descent direction.
3. Each iteration requires only O(n 2 ) floating-point operations, as opposed
to O(n 3 ) for the Newton's method.
To obtain the updating formulae we define the differences

If higher order terms in the Taylor series are neglected, then

(8.30)

Since -y(k) and d(k) can only be calculated after the line search is completed,
the matrix H(k) (which is supposed to approximate the inverse of C(k)) does
366 Chapter 8. Optimisation

not usually satisfy the above requirement. Thus, we can choose the updating
procedure such that the new matrix H(k+1) satisfies

(8.31 )

which is referred to as the quasi-Newton condition. This condition does not


determine H(k+1) uniquely and there are various possibilities. In addition, we
will also like to ensure that the matrix H(k+1) is positive definite.
One such updating formula due to Davidon, Fletcher and Powell known
as DFP formula can be written as

(8.32)

where the superscripts (k) are dropped from the right-hand side. It can be
shown that for quadratic functions this method terminates in at most n itera-
tions with H(n+1) = G- 1 . While for a general function, it preserves the positive
definiteness of H(k). This method has been found to work well in practice and
is much more efficient than the method of steepest descent and somewhat more
efficient than the conjugate gradient methods. However, this method is found
to be rather sensitive to the accuracy of line search used at every step. With
crude line searches, the performance of this method degrades significantly.
Another updating formula was suggested by Broyden, Fletcher, Goldfarb
and Shanno and is known as the BFGS formula

(8.33)

It can be shown that if B(k) = H(k)-' approximates the Hessian matrix G,


then
(k+1) 'Y'YT BOOT B
B SFGS = B + 'YT 0 - or Bo ' (8.34)

which resembles the DFP formula. Similar formula can also be given for B DFP .
Hence, we can work with the matrix B instead of H, but in that case, we have
to solve a system of linear equations at every iteration, which may be desirable
if the matrix B is expected to be sparse. Since in that case, the inverse will
generally not be sparse and a large amount of memory may be required to store
H. Even the solution of equations may be more efficient, because of the sparsity
of B. The BFGS formula has been found to work well in practice and in fact
is usually better than the DFP formula. Fletcher (2000) has described many
more updating formulae, but these formulae do not appear to have any clear
advantage over the BFGS formula.
It can be shown {I5} that if instead of starting with a unit matrix for
H(O) some of the diagonal elements are set to zero, the corresponding variables
remains fixed at their initial values. This technique can be used to vary the
parameters selectively in quasi-Newton methods.
8.5. Quasi-Newton Methods 367

This method requires a line search to be carried out at every step. For
this purpose, in principle, we can use the methods described in Sections 8.1-3.
In practice, the specification of the function could be somewhat tricky, since
the function has n independent variables, which can be expressed in terms of
one variable a
(8.35)
If the derivative is also required, then it can be evaluated using

(8.36)

Further, since most of the effort in this method is spent in line searches, it is
important to perform them as efficiently as possible. It may not be essential
to find an accurate minimum for every line search and special techniques for
finding crude minimum will be more effective. Various criterion have been given
to determine what is an acceptable minimum, obtained out of the line search
methods. It can be shown (Fletcher, 2000) that any point along the line, which
satisfies the following conditions is acceptable as a minimiser for the line search

F(a) :::; F(O) + apF'(O), [F'(a)[ :::; -crF'(O). (8.37)

Here p and cr are predetermined constants and 0 < cr < 1,0 < P < min(1/2, cr).
It may be noted that F'(O) < 0, since the selected direction should be a descent
direction. This criterion for acceptance of minimiser is sufficient to prove the
convergence of some of the methods.
Fletcher has also described a simple algorithm for performing the line
search. This method consists of two stages, in the first stage we attempt to
bracket the minimum while the second stage consists of sectioning the bracket.
For bracketing the minimum, we start with an initial guess al and generate
successive iterates ai, (i = 2,3, ... ), until the minimum is bracketed. We may
take ao = o. At the ith step the function and its derivative are evaluated at
ai and the conditions (8.37) are checked. If these conditions are satisfied, then
we straightway accept the point without going into the second phase. While if
F(ai) > F(O)+aipF'(O) or F'(ai) > 0 or F(ai) > F(ai-l), then the minimum
is bracketed in the interval (ai-I, ai). Otherwise, we generate the next value
ai+! E (2ai - ai-l,ai + Tl(ai - ai-d) and start the next iteration. Here Tl
is a preset factor by which the size of jumps is increased, typically Tl = 9.
For the initial approximation al we can use min(l, -2JI/ 1'(0)), where JI is
the decrease in function value during the last iteration. The value of ai+! in
the last step can be obtained by some interpolation. For example, we can use
the Hermite cubic interpolation or the parabolic interpolation as discussed in
Section 8.3.
Once the bracketing is complete, the interval is subdivided until an ac-
ceptable point is found. At each iteration of the searching phase we have two
points a and b (b > a) bracketing the minimum at which the function values
and derivatives are known from the previous calculations. We choose a trial
value a E [a + T2(b - a), b - T3(b - a)] and evaluate F(a) and F'(a). Here, the
368 Chapter 8. Optimisation

constants T2 and T3 are chosen to ensure that the bracketing interval reduces
significantly at each iteration. Now, if the conditions (8.37) are satisfied, then
the process is terminated. Otherwise, depending on the value of the function
and the derivative at the three points, one of the end point of bracketing in-
terval is replaced by 0: and next iteration is started. It can be easily seen that
if F(o:) > F(O) + o:pF'(O) or F(o:) > F(a) or F'(o:) > 0, then 0: replaces b
while otherwise 0: replaces a. Once again, the trial value of 0: could be obtained
by cubic or parabolic interpolation. Because of roundoff error or some error in
calculating the derivatives, it may happen that the line search will try to look
for the minimum in the wrong direction. In that case, it is not possible to find
the minimum and the initial interval will be subdivided forever, unless some
alternative exit is provided.
Subroutine BFGS in Appendix B is a simple implementation of quasi-
Newton method using BFGS formula. This subroutine uses a simple criterion
for checking the convergence. The iteration is terminated when the shift in the
point during the last iteration is smaller than the required tolerance. To avoid
the possibility of spurious convergence, we may use a more stringent criterion,
where the shift in last two or three iterations is checked. In some applications,
it may be better to use a convergence criterion based on the function values,
that is, the iteration can be terminated when the decrease in the function
value during the last iteration or may be the last k iterations is less than the
required tolerance. This criterion is simpler to apply, but if we are interested
in the minimiser, then it may not be very satisfactory, since the minimum may
be poorly defined along some directions, in the sense that the function value
may change very slowly along these directions. In such cases, a criterion based
on function values may cause the iteration to terminate far from the actual
minimiser.
It should be noted that the quasi-Newton methods only look for a zero of
the gradient vector, which can arise at a maximum or a saddle point also. Hence,
we should be able to distinguish between different types of stationary points.
Since during each iteration the function value decreases, there is no danger of
the iteration converging to a maximum, unless the starting value itself happens
to coincide with the maximiser. However, the situation at a saddle point is not
that simple, and in principle, if the iteration is started at a point close to the
saddle point, where the function value is larger than the function value at the
saddle point, it is possible for the iteration to converge to the saddle point.
Nevertheless, the gradient at a point close to the saddle point may not really
lead towards the saddle point, and in most cases, the iteration may be able to
avoid the saddle point. But if the Hessian matrix is singular at the saddle point
and the nature of the stationary point is actually determined by the higher
order derivatives {14}, the iteration may actually converge to the saddle point.
This situation can be detected by the fact that the matrix H in the quasi-
Newton method, which approximates the inverse of the Hessian matrix, will
tend to blow up at such points. Hence, by considering the norm of this matrix
it may be possible to detect such situations. However, the Hessian matrix can
8.5. Quasi-Newton Methods 369

also be singular at a minimum and the singularity of the Hessian matrix does
not necessarily imply that the corresponding point is a saddle point.
Subroutine BFGS attempts to detect this situation by considering a norm
(i.e., L:ij Ih ij I) of the matrix H. If the ratio of the norms at the last two itera-
tions is larger than 1.3, then it is assumed that the Hessian matrix is singular.
This is not a foolproof technique for detecting singularity of the Hessian ma-
trix, particularly if the convergence requirement is modest. In that case, even
if the Hessian matrix is nonsingular, it is possible that the matrix H has not
converged to its asymptotic value and the ratio of norms may come out to be
larger than 1.3. On the other hand, it is also possible that the Hessian matrix
is singular, but the ratio is smaller than 1.3, because we are not sufficiently
close to the stationary point. At higher accuracy it may be easier to distinguish
between the two types of points, but that is not guaranteed. However, if very
high accuracy is required, then because of the roundoff error, it may be difficult
to decide if the matrix is singular or not. Alternately, we can look at the norm
of the matrix H at the end of iteration, but it may be difficult to detect singu-
larity, unless we have some idea about the expected norm. If a second attempt
is made to locate the minimum with a starting value close to the final point,
then the iteration may tend towards lower function values if it is a saddle point,
otherwise it should come back to the same point. The direction set methods
or the conjugate gradient methods described in the next section do not have
this drawback, since there at each stage search is made along n independent
directions.
The quasi-Newton methods require calculation of derivatives. It often hap-
pens that the users make blunders in computing the derivatives of the function,
and as a result, the program gets completely confused. In fact, some of the bet-
ter library routines to find a minimum include internal check on the calculated
derivatives using finite differences, to make sure that, there are no errors in
computing the derivatives.
Since the quasi-Newton methods essentially look for a zero of the gradient
vector, it is normally possible to achieve a relative accuracy of the order of n.
However, each iteration involves a line search and it may not be possible to
decrease the function, once we are close to the minimiser. This problem can
be overcome if the line search uses derivatives and if the point is accepted
to be a minimum, even if the function value has not actually reduced. Once
the iteration is close to a minimiser, if at each step we choose the value of
the parameter 0: along the line to be one, then the iteration will converge
quadratically to the minimiser. This result follows from the fact that 0: = 1
yields the Newton's method which converges quadratically, even though the
function value may not reduce within the machine accuracy. If the function
value itself is zero at the minimum, then there may not be any difficulty, since
the function value will keep reducing as we approach the minimum.
EXAMPLE 8.4: Minimise the following functions using the BFGS method:

(8.38)
370 Chapter 8. Optimisation

The first function is a simple quadratic function which has been included to demonstrate
the quadratic termination of the BFGS method. The minimum is clearly at x = 0, but simple
methods like the method of steepest descent will converge very slowly, because the contour
diagram for this function is similar to that shown in Figure 8.2, where the contours are
ellipses, with semi-major axis ten times the semi-minor axis. There is effectively a valley
along the X2 axis. If the iteration is started with Xl = 0.01 and X2 = 1, it converges in three
iterations to essentially the exact value of the minimiser. Further, in the third iteration, apart
from a small roundoff error, the matrix H is the exact inverse of the Hessian matrix.
The second function is a well-known example due to Rosenbrock, and is somewhat
difficult to minimise, because of a parabolic valley along X2 = xi.
Using the method of
steepest descent. the iteration tends to fall into the valley and then follow it around to the
minimum at (1,I)T. Table 8.1 gives the results obtained using the subroutine BFGS in
Appendix B, with a starting value of Xl = X2 = O. It requires about 15 iterations to converge
to the accuracy permitted by the 24-bit arithmetic. The convergence is nearly quadratic for
the last few iterations, before the roundoff error starts dominating. It can be seen that in
the last three iterations the function value has not reduced significantly. It may be noted
that this method required 45 function evaluations, giving an average of three per line search,
which includes both the bracketing and the sectioning phases. The line searches were carried
out with parameters (j = 0.1, P = 0.01, T1 = 9, T2 = 0.1, T3 = 0.5, as recommended by
Fletcher (2000), which corresponds to a fairly accurate line search. If the accuracy is reduced
by taking (j = 0.9, the program requires 21 iterations to converge to the same results, but
it needs only 26 function evaluations. Hence, it may be more efficient to use crude line
searches. In the latter case, it requires only about one function evaluation per line search,
because almost any reasonable value is acceptable according to (8.37). This table also gives

Table 8.1: Minimisation of the Rosenbrock's function using the BFGS method

x( i) f(x(i)) V f(x(i)) a(i) s(i)


HhGS

0.1634684 7.712 x 10- 1 7.421 X 10- 2 0.0817342 2.000 x 10° 1.000000 0.000000
0.0000000 -5.344 x 10° 0.000 x 10° 0.000000 1.000000
2 0.2944206 6.263 x 10- 1 2.809 x 10° 0.0098670 1.327 x 10 1 6.717596 2.576584
0.0508464 -7.167 x 10° 5.153 x 10° 2.576584 1.000000
3 0.3436205 4.370 x 10- 1 -2.389 x 10° 0.6592494 7.463 x 10- 2 0.074204 0.039497
0.1259072 1.566 x 10° 1.139 X 10 - 1 0.039497 0.031367
4 0.4604641 3.186 x 10- 1 1.974 x 10° 0.8308288 1.406 x 10- 1 0.102587 0.066695
0.1954527 -3.315 x 10° 8.371 X 10- 2 0.066695 0.048293
5 0.5395574 2.743 x 10- 1 4.465 x 10° 1.3615100 5.809 x 10- 2 0.138944 0.100252
0.2661691 -4.991 x 10° 5.194 X 10 - 2 0.100252 0.075359
8 0.9474563 4.770 x 10- 3 1.594 x 10° 4.2812870 1.466 x 10- 2 0.066600 0.114178
0.8931908 -8.965 x 10- 1 2.604 X 10- 2 0.114178 0.200475
10 0.9946687 4.415 x 10- 5 1.471 X 10- 1 0.4790581 3.323 x 10- 2 0.568068 1.081823
0.9889693 -7.931 x 10- 2 6.161 X 10 - 2 1.081823 2.065783
12 1.0001400 1.037 x 10- 7 1.188 X 10- 2 0.5000000 -5.901 x 10- 4 0.506532 1.004832
1.0002510 -.5.798 x 10- 3 -1.067 X 10- 3 1.004832 1.999388
13 1.0000020 4.619 x 10- 12 .5.126 X 10- 5 1.0000000 -1.384 X 10 - 4 0.467117 0.933078
1.0000030 -2.384 x 10- 5 -2.480 X 10- 4 0.933078 1.868769
14 0.9999986 1.435 x 10- 12 -2.670 X 10-- 5 1.0000000 -1.709 X 10- 6 0.468873 0.936380
0.9999972 1.192 x 10- 5 -3.296 X 10- 6 0.936380 1.874965
15 1.0000000 7.851 x 10- 13 4.792 X 10- 5 1.0000000 7.864 X 10- 7 0.302770 0.612240
1.0000000 -2.384 x 10- 5 1.537 X 10- 6 0.612240 1.242514
8.6. Direction Set Methods 371

the updated matrix H at each stage. Since this function is not quadratic, we do not expect
the matrix to converge to the correct inverse of the Hessian matrix. For this problem it may
be noted that at the minimiser

c= (802 -400) (8.39)


-400 200 '

which can be compared with the matrix H at the end of the calculation. It can be seen
from the table that at the end of 12th iteration, the matrix H is very close to C - I , but in
subsequent iterations, because of roundoff errors the approximation becomes worse. From the
table, it can be seen that the gradient vector tends to zero at the minimiser as expected. In
this case, since the minimum value of the function is zero, there is no difficulty in obtaining
a fairly accurate position of the mini miser. If we add a constant to the function, then the
function will no longer vanish at the minimiser, but nevertheless the subroutine BFGS still
yields essentially the same results.

8.6 Direction Set Methods


In the previous section, we considered quasi-Newton methods for finding a min-
imum of a function of several variables. These methods require the calculation
of derivatives , which may not be possible if the function is rather complicated.
In this section, we describe methods which do not require the cakulation of
derivatives. These methods are not as efficient as the methods described in the
previous section, in terms of the number of function evaluations required, and
unless the derivative calculation is extremely difficult, the method described
earlier will be more efficient. If the derivative can be evaluated, then it is also
possible to use the conjugate gradient methods described at the end of this sec-
tion. These methods are not as efficient as the quasi-Newton methods described
in Section 8.5, but they require much less memory and may be considered if
the number of variables is very large.
In the absence of even the first derivatives, it is difficult to approximate
the quadratic terms, unless a search is made along n independent directions.
The simplest method of this type is the alternating direction method briefly
described in Section 8.4. But as mentioned earlier it is not very satisfactory.
We need better set of directions than the set of arbitrary orthogonal vectors.
The basic idea of all direction set methods is to start with an arbitrary set of
orthogonal vectors and update this set using the information about the function
value and attempt to come up with a set, which has non-interfering directions
such that a minimisation along one direction is not spoilt by subsequent min-
imisation along another. If such a set can be found, then after one cycle the
process will converge to the minimiser. Such a set of direction may not even
exist for arbitrary functions. However, it can be shown that for a quadratic
function such a set of directions exists. Hence, such methods should terminate
after a predetermined finite number of iterations for a quadratic function. This
is a useful property of any optimisation method and is referred to as quadratic
termination. Since a general function can be approximated reasonably well by
a quadratic in the neighbourhood of a minimum, we can expect such methods
to converge very fast, once we are close to a minimum.
372 Chapter 8. Optimisation

The set of non-interfering directions is called conjugate directions and this


concept is strictly valid only for quadratic functions. Hence, for the time being
let us assume that our function is quadratic and is given by the first three terms
of the Taylor series (8.18). Suppose we have moved along some direction u to
a minimum, then the gradient vector at this point is perpendicular to u. Now
if we want to minimise the function along another direction v, then at the end
of the minimisation the gradient is perpendicular to v, but not necessarily to
u, which spoils the minimisation along u. Hence, what we want is that, during
the second minimisation the gradient should always remain perpendicular to
u. Since our function is quadratic, the change in the gradient can be written as

6('\1 f) = G(6x). (8.40)

During the minimisation along v, 6x = av and if

(8.41)

then the gradient will remain perpendicular to u and after this minimisation,
there will be no need to minimise again along u. Any two vectors which satisfy
this condition are said to be conjugate. A set of vectors such that any two of
them are conjugate, is said to constitute a set of conjugate directions.
Thus, the basic aim of the direction set methods is to come up with a
set of conjugate directions, using only the information obtained from function
values at intermediate points. It may be noted that the Hessian matrix G is not
known and hence it is not possible to use (8.41) to find conjugate directions.
The difference between various direction set methods being in the ingenuity,
by which conjugate directions are generated without explicit knowledge of the
Hessian, and in the techniques needed to make the methods efficient for general
non-quadratic functions. One such method was given by Powell which starts
with an initial approximation to the minimum x(O) and a set of orthogonal
vectors s(1), ... , s(n). This set of vectors could be simply the columns of a unit
matrix. One iteration of the basic procedure consists of the following steps:
1. For i = 1, ... , n, compute ai to minimise f(x(i-l) + ais(i)), and define
x(i) = x(i-l) + ais(i).
2. For i = 1, ... ,n -1, replace sri) by s(i+l).

3. Replace s(n) by x(n) - x(O).

4. Compute a to minimise f(x(O) + as(n)), and replace x(O) by x(O) + as(n).


It can be proved that if f is quadratic, then after kth iteration ( 1 :::; k :::; n) the
last k directions s(n-k+1), ... , s(n) are conjugate. After n iterations the process
will terminate at the minimum. However, for a general non-quadratic function
the entire cycle has to be repeated until some suitable convergence criterion is
satisfied.
The problem with Powell's method is that, for large n the set of directions
s(i) tend to become linearly dependent and no longer span the entire space in
8.6. Direction Set Methods 373

n dimensions. Hence, the method effectively searches for the minimiser over a
proper subspace of the entire space Rn. Consequently, it is not likely to give the
correct result, since the true minimiser may not be in this subspace. To over-
come this problem, various suggestions have been put forward. Simplest option
is to restart the cycle after n iterations, which ensures that all directions are
searched. However, this approach wastes a lot of information, which has been
gathered during each cycle of n iterations. In particular, at the end of this cycle
the set of directions S(i) is close to conjugate. Hence, it will be better if this in-
formation can be used in subsequent iterations. Brent (2002) has proposed that
at the end of one cycle we can reinitialise the directions to a set of orthogonal
vectors obtained from the set s(1), ... ,s(n). This technique retains the previous
information and at the same time avoids the problem of linear dependence.
An alternative suggestion, which has been made by Powell is not to discard
S(l) at the step (2) of the algorithm, but to discard that direction which will
maintain linear independence as far as possible. With this modification, it turns
out that the property of quadratic termination is lost, since a complete set of
conjugate directions may never be obtained. However, practical experience sug-
gests that, this method works fairly well on most problems. Of course, we can
combine the two suggestions, that of selecting which direction to discard and
to reinitialise the direction sets after each cycle by orthogonalisation. In that
case, at the kth iteration (2 ::; k ::; n), if we discard anyone of the directions
s(1), ... ,s(n-k+1), the conjugacy and hence the quadratic termination property
will be preserved, since only the last k directions are conjugate at any stage.
This device is possible, if we reinitialise the set after every n iterations, since
in that case k ::; n.
We need some criterion to decide which direction to discard at any stage.
For this purpose, we normalise the directions such that s(i)Gs(i) = 1 and con-
struct the matrix S = (s(1), ... , s(n)), where each column is the corresponding
direction vector. Then the determinant of this matrix can give a good mea-
sure of linear independence. Hence, we should discard that direction, which
maximises the determinant. This direction can be easily identified as follows.
Suppose that the new direction x(n) - x(O) = s(n+1), satisfies
s(n+l) n s(i)

(s(n+l V Gs(n+l))l/2 = ~ (3i (s(i)T GS(i))1/2 .


(8.42)

Here the denominator is added to ensure that the corresponding vectors are
normalised. It is clear that the effect of discarding S(i) and replacing it by
s(n+1), and then renumbering the directions, is to multiply the determinant
lSI by l(3i I· Thus, we should choose i such that l(3i I, (1 ::; i ::; n - k + 1) is
maximised. In order to use this result we need to estimate the denominators,
which can be done along with the basic line search at step (1) of the Powell's
algorithm. At the ith step if the minimisation with shift DiS(i) decreases f(x)
by an amount 8i and if the function is quadratic, then it can be shown that

(8.43)
374 Chapter 8. Optimisation

If at any step Cl:i = 0, then the result of previous iteration may be used. Further,
from the iterative procedure it is clear that s(n+1) = L: Cl:iS(i). Hence

(3. _ . ( (i)TC (i)) 1/2 _ Cl:iJ2(5;"


( (n+1)Tc (n+1)) 1/2 (8.44)
S S t - Cl: t S S - ICl:i I '

which essentially implies that the direction along which the largest change in
the function value has taken place should be discarded. This result may appear
paradoxical at first sight, but it should be noted that, this is usually a major
component of the new direction we are adding.
Finally we discuss the suggestion of Brent to reinitialise the directions
after every n iterations. If S is the matrix of conjugate direction vectors, then
from the property of conjugacy

(8.45)

where D is a diagonal matrix with positive diagonal elements di , which can be


estimated from (8.43). Now if we define a matrix

(8.46)

Since S is nonsingular, we get

(8.47)

At the end of every cycle of n iterations, we wish to replace the direction vectors
by a set of orthonormal vectors to ensure linear independence. It will be best if
this orthonormal set is selected, such that the corresponding matrix Q satisfies

(8.48)

where A = diag( Ai) is diagonal. Thus, the columns of Q are just the eigenvectors
of the real symmetric matrix C, with corresponding eigenvalues A1, ... ,An. This
device ensures that the information from previous cycle is not lost, since the
new set will be conjugate. To determine these directions, Brent recommends
singular value decomposition (Section 3.6) of the matrix V to get QTV R = ~,
where R is also an orthogonal matrix, and ~ = diag(o-i) is a diagonal matrix.
It can be easily seen that

(8.49)

and Ai = 0-7. Thus, we can compute the required orthogonal matrix Q by


singular value decomposition of V. The columns of this matrix will give the
new set of direction vectors.
The process of orthogonalisation introduces some extra computations, but
Brent has shown that it is much smaller than the effort spent in evaluating the
function a large number of times. Further, by improving the efficiency and thus
8.6. Direction Set Methods 375

reducing the number of function evaluations, this device should ultimately re-
duce the total amount of computations required. Apart from these suggestions,
Brent also has some suggestions regarding the problem of narrow valleys men-
tioned in Section 8.4.
lt can be seen that each cycle of n basic iterations of Powell's method re-
quires n( n + 1) line searches. Hence, most of the effort is spent in line searches
and it is essential to perform them as efficiently as possible. Since these line
searches only give some intermediate quantities, it will be wasteful to spend
considerable effort to find the minimum accurately. However, for the no deriva-
tive methods, there is no theory to guide us about what is an acceptable level
of accuracy in line searches. There is no general agreement about how these
line searches should be carried out. Brent (2002) suggests that it is only nec-
essary to ensure that the function value decreases after each line search. Using
the function value at three distinct points along the line, or any other infor-
mation which is available (e.g., the estimate of the second derivative J"), we
can approximate the function J(a) by a parabola. If a* is the minimiser of
this parabola, then we can accept that as the required minimiser, provided
J(a*) < J(O). If this condition is not satisfied, then a* is replaced by a* /2
and J(a*) is evaluated and the test repeated. This process which is similar to
bisection can be repeated and if after a number of attempts the test is still
not satisfied, then we accept a = 0 as the minimiser. According to Brent this
procedure requires on an average two to three function evaluations for each line
search.
If the derivatives are not available, it is not possible to guess which side
the minimum will be found and it may not be correct to subdivide the interval
on one side of the starting point, since the first guess could be on the wrong
side of the initial point. To avoid this problem the subroutine LINMNF which
implements the line search for no derivative case adopts the following procedure.
When the subroutine is called, the function value at the starting value is already
available. The function is evaluated at the first guess aI, which is provided on
the basis of earlier results. If the function is less than F(O), then the third
point a2 is chosen further down on the same side. Otherwise, the third point is
chosen on the opposite side. Using these three points, parabolic interpolation is
performed to obtain the approximation to the minimiser. The interpolated value
a p is accepted only if it is in some reasonable range. If this value is acceptable
and F(a p ) < F(O), the point is accepted and the process is terminated. In other
cases, if min(F(ad, F(a2)) < F(O), then once again the corresponding point
is accepted. Hence, after the first iteration either we have found an acceptable
point or the function value at all the new points is larger than F(O), in which
case, the minimiser is bracketed between the points a1 and a2. If a p is within
this interval, then the bracketing interval can be reduced still further. If the
reduction in the interval size is significant, then a p replaces the corresponding
bound, otherwise a new point is selected as in the golden section search. The
process is repeated using the new triplet of bracketing points and iteration is
continued until we find one acceptable point, or the change in function value
376 Chapter 8. Optimisation

is within acceptable limits. It may be noted that we have made no attempt to


use the estimate of the second derivative at a = 0, which is also available.
The line search routine described here as well as the one described in
the previous section may fail to detect the true minimum, if the function is
not unimodal or if the roundoff error is significant. For functions like h (x) in
Example 8.1, the routine may jump over the maximum into the asymptotically
decreasing region and one of the points in this region may pass the required
test. Consequently, the iteration may wander into a region, where the function
is monotonically decreasing and the successive values tend to diverge away from
any of the local minimum.
Since this method does not involve any derivatives, it may not be possible
to determine the minimiser to a relative accuracy of better than v1i, because
of the roundoff error. As argued in Section 8.1, the function value is essen-
tially constant at all points within this distance from the minimiser and it is
not possible to make any progress during line minimisation. This problem is
not present in quasi-Newton methods which essentially find the zeros of gra-
dient vector, since in that case, the accuracy is determined by the domain of
indeterminacy associated with the zero.
Subroutine NMINF in Appendix B implements the Powell's method, in-
corporating some of the modifications suggested by Brent as described in this
section. Brent (2002) has given a more sophisticated implementation of this
algorithm and the corresponding Algol program. Subroutine NMINF incorpo-
rates the modified criterion for discarding old directions and the singular value
decomposition to reinitialise the basic search directions.
EXAMPLE 8.5: Minimise the following functions using a direction set method:

Here the first function is slightly different from that in the previous example, since
h (Xl, X2) in Example 8.4 has the axis directions as conjugate directions and the iteration
will converge trivially to the exact value in the first iteration of Powell's method. The results
obtained using a 24-bit arithmetic are displayed in Table 8.2. It may be noted that for the
first function, the numbers in the first column give the number of basic iterations rather than
the the number of cycles of n iterations. It can be seen that after one cycle of two iterations,
the procedure has converged to the correct value. If the iteration is continued further, the
results improve because in this case, the function value as well as the variables vanish at the
minimum. Hence, the relative roundoff error is still quite low. This essentially means that,
there is no inherent scale in the function and Xl = X2 = 1 is as good as Xl = X2 = 10- 10 ,
as long as the numbers do not give rise to underflow or overflow during the calculation. The
improvement will continue until xT results in an underflow. This function was included only
to demonstrate the property of quadratic termination for this method.
The second function is the Rosenbrock's function considered in Example 8.4 and using
the same starting values, the subroutine NMINF converges to the minimum in 21 cycles. The
resu lts are also shown in Table 8.2, where the numbers in the first column give the number of
cycles rather than the number of basic iterations. Each cycle requires six line searches to be
carried out, thus giving a total of 126 line searches, which required 355 function evaluations
giving approximately three function evaluations per line search. Using a more sophisticated
implementation of this method, Brent (2002) requires somewhat smaller number of function
evaluations per line search. By comparing these results with Table 8.1, it is clear that this
method is not as efficient as the quasi-Newton methods. Even if calculation of each derivative
requires as much effort as evaluating the function , the quasi-Newton method works out to
8.6. Direction Set Methods 377

Table 8.2: Minimisation using a direction set method

f f

3.6615 x 10° -9.4321 X 10- 1 9.6152 X 10- 1 5.3838 X 10- 1 .2867207 .0650012
2 3.3904 x 10° -9.0762 X 10- 1 9.2524 X 10- 1 5.1571 X 10- 1 .2849864 .0745347
3 2.8681 x 10° -8.3268 X 10- 1 7.8141 X 10- 1 4.5035 X 10- 1 .3315537 .1158690
4 2.3162 x 10- 11 -7.5234 X 10- 7 1.1926 X 10- 6 2.3382 X 10- 1 .5169949 .2649833
5 6.1991 x 10- 15 -3.7796 x 10- 8 3.4733 X 10- 8 2.2805 X 10- 1 .5228686 .2714006
6 6.1555 x 10- 25 1.1591 X 10- 13 -1.8823 X 10- 13 2.1523 X 10- 1 .5363553 .2860430
7 9.0823 x 10- 31 -4.5755 X 10- 16 4.2050 X 10- 16 1.8778 X 10- 1 .5672530 .3195236
10 1.3812 x 10- 1 .6570248 .4173683
15 3.2666 x 10- 2 .8461918 .7065487
18 3.4433 x 10- 4 .9814639 .9633574
19 5.0228 x 10- 8 .9997774 .9995576
20 3.8725 x 10- 13 .9999998 .9999997
21 3.5882 x 10- 13 .9999999 .9999999

be more efficient by a factor of three, as compared to the direction set methcds. In this case,
since the function value is zero at the minimiser, it is possible to get an accurate value of the
minimiser. If we consider the function h(X1, X2) + 1, then the function will not vanish at the
minimiser and it is not possible to find the minimiser to an accuracy of better than 10- 3 ,
using the same 24-bit arithmetic. Thus in general , the direction set methods are neither as
efficient nor as accurate as the quasi-Newton methods.

If the first derivatives can also be calculated, then it is possible to use


the so-called conjugate gradient methods, which are similar to the direction
set methods considered here. The simplest method of this type is the Fletcher-
Reeves method, in which the line searches are carried out along conjugate di-
rections, which are selected as follows. The first search direction is the direction
of steepest descent (S(l) = -V' f). For k ~ 1
s(k+1) = component of - V' f(x(k+1)) conjugate to s(1), s(2), ... , s(k). (8.51)

For a quadratic function the conjugacy condition can be written as

j -I i, (8.52)
where "{(i) = V' f(x(i+1)) - V' f(X(i)). Hence, using the Gram-Schmidt process,
we can write
k
s(k+1) = --V'f(x(k+1)) + l..::,8js(j), (8.53)
j=l

For quadratic functions, it can be shown that ,8j = 0 for j < k and only one
term in the summation survives. Further
378 Chapter 8. Optimisation

It can be shown that for quadratic functions this method terminates in at most
n iterations, each involving one line search.
When this method is applied to a general non-quadratic function, it may
not terminate in n iterations. Hence, we can either continue to use (8.53) (with
only one term in the summation) for all k, or we can periodically reset s(k)
to the steepest descent direction. It turns out that in practice, continuing to
use (8.53) turns out to be a better alternative. Another possibility is to replace
(8.54) by another formula, which is equivalent for a quadratic function, but
gives better results for a general non-quadratic function. It turns out that the
following formula:

(V' f(x(k+l») - V' f(x(k»)) TV' f(x(k+l))


13k = V' f(x(k»)TV' f(x(k») (8.55)

due to Polak and Ribiere is in general, better than (8.54).


The advantage of these methods over the direction set methods is that,
only n line searches are required to scan the set of n conjugate directions, as
opposed to n(n + 1) line searches for direction set methods. This is to be ex-
pected, since the direction set methods rely only on the function value, while
here we have function value plus n components of the gradient vector. In gen-
eral, the conjugate gradient methods are not as efficient as the quasi-Newton
methods, but the advantage here is that much less storage space is required,
since no matrix needs to be stored. Hence, for problems involving hundreds or
thousalJds of variables, this method may be the only one which can be used
because of the constraints imposed by computer memory. Further, unlike the
quasi-Newton methods the conjugate gradient methods will not converge to a
saddle point.
The cor"jugate gradient method is quite popular for solving a system of
linear equations Ax = b, where the matrix A is sparse. In this case, we use

V'f(x) = AT(Ax - b). (8.56)

The minimum along the direction s(k) is given by

S(k)V' f
0' = -IAs(k) 12
(8.57)

Since the function is quadratic the conjugate gradient method should converge
in n iterations. However, in practice because of roundoff error more iterations
are required for convergence. The advantage of this method for sparse matrices
is that, only operation that is performed with the matrix is multiplication of the
matrix by some vector, which can be performed efficiently for a sparse matrix.
For the same reasons this algorithm can also be easily implemented on parallel
processors.
8.7. Linear Programming 379

8.7 Linear Programming


So far we have considered the problem of unconstrained optimisation, where the
independent variables can take any arbitrary values. In many practical optimi-
sation oroblems, there are constraints on the independent variables. However,
finding a minimum of nonlinear function of several variables in the presence
of constraints is a difficult problem and we restrict ourselves to only linear
functions with linear constraints. Such problems are usually referred to as lin-
ear programming (LP) problems. Even if our problem is nonlinear, it may be
approximated by a linear model to make it more tractable. The principal ap-
plications of LP have been in industry, where it is used to determine the op-
timum allocation of resources subject to various constraints imposed by the
prevailing situation. Because of the economic importance of LP, a considerable
effort has been put in to develop sophisticated algorithms for solving large LP
problems on computers. Problems involving tens of thousands of variables and
thousands of constraints have been successfully solved. There is a vast amount
of literature available on this problem. Because of its importance an elaborate
quasi-economic terminology has been developed, which to some extent obscures
the basic numerical process from the non experts. In numerical methods, the
principal application of LP is in functional approximation. As we will see in
the Chapter 10, many problems of functional approximation can be formulated
as LP problems. Rabinowitz (1968) has discussed various applications of LP to
numerical analysis.
A general LP problem could be of the following form, where we have to
minimise the linear function

(8.58)

subject to primary constraints

i=1,2, ... ,nl, (8.59)

and simultaneously subject to AI = ml + m2 + m3 additional constraints of the


form

ailXl+ ai2x2 + ... + ainxn :S bi , (i = 1,2, ... , md;


ailXl + ai2X2 + ... + ainXn 2: bi , (i = + 1, ... , ml + m2);
ml

ailXl + ai2X2 + ... + ainXn = bi , (i = ml + m2 + 1, ... , ml + m2 + m3)'


(8.60)
Here some of the variables are restricted to be positive, while others can range
over all real values consistent with other constraints. There is no restriction on
the total number of constraints Ai, but the number of equality constraints m3
should generally be less than the number of variables n. Otherwise, it may not
be possible to find a solution to the problem.
The function f(x) is referred to as the objective function or sometimes
also as the cost function, while the coefficients Ci are referred to as costs. A set
380 Chapter 8. Optimisation

of values Xl,X2, ... ,X n that satisfies the constraints (8.59) and (8.60) is called
a feasible vector. The feasible vector which minimises the objective function
is called the optimal feasible vector. Each equality constraint in (8.60) defines
a hyperplane in the space and the intersection of all such hyperplanes give
a region with effective dimension n - m3 containing the feasible points. On
the other hand, the inequality constraints divide the space into two parts, one
of which is admissible. When all constraints are imposed, either we are left
with some feasible region, or else there are no feasible vectors. In the former
case, since the feasible region is bounded by hyperplanes, it is a kind of convex
polyhedron. Optimal feasible vector need not exist for all problems of the form
(8.58-8.60). The failure can occur either because there are no feasible vectors
(i.e., the constraints are incompatible), or because the objective function is not
bounded from below in the region covered by the feasible vectors. The latter
possibility can arise if the set of feasible vectors allows unbounded value for
some component, and the objective function tends to -00.
It is obvious that a linear function of the form (8.58) cannot have a local
minimum, since the gradient vector is a constant. Thus, starting from any
feasible vector, we can go down the gradient (i.e., along - \7 J) and the function
will keep reducing, until we hit one of the bounding planes when we cannot
proceed further. On the other hand, if no such boundary is encountered, then
the function can be reduced indefinitely in the feasible region and optimal
feasible vector does not exist. Hence, the solution cannot be in the interior
of the feasible region, it can only be at the boundary. At any point in one
of the bounding hyperplanes two possibilities exist. If the gradient vector is
perpendicular to the hyperplane, then the function value is constant along the
entire hyperplane. If the gradient vector is not perpendicular to the hyperplane,
then we can reduce the function by moving along the projection of (negative of)
gradient vector in the hyperplane. In the second case, once again we can either
reduce the function indefinitely, or we hit a boundary i.e., intersection of the
bounding hyperplane with another hyperplane. The process can be continued
until one of the following three possibilities are realised:

1. We end up at a vertex, which is the common point of intersection of


n bounding hyperplanes and it is not possible to continue further. This
vertex gives the optimal feasible vector which is unique in this case.
2. At some stage we find that the gradient vector is perpeudicular to the
boundary, which could be defined by the intersection of several hyper-
planes. In this case, it is not possible to decrease the function along this
boundary. Such problems will have nonunique optimal feasible vector.
3. At some stage no boundary is encountered and the function can be reduced
to -00. In this case, optimal feasible vector does not exist.

Hence, we can see that if the optimal feasible vector exists, it can only be
one of the vertices of the bounding region, unless the solution is nonunique.
In that case, the entire bounding plane including the vertices will yield the
8.7. Linear Programming 381

same minimum value. Thus, to find the minimum, we only have to consider
the vertices of the feasible region. However, the number of vertices will increase
almost exponentially with nand A[, and it is not possible to apply the brute
force method of examining the function value at each of the vertices.
It may be noted that each vertex is a point of intersection of n hyperplanes
defined by the constraints. Hence, at each vertex, n of the constraints (including
Xi 2 0) are reduced to equality. Therefore, the problem of finding the minimum
reduces to a combinatorial problem of finding the combination of n constraints,
which are exactly satisfied by the optimal feasible vector. For small number of
variables, there may be no difficulty in obtaining the solution just by inspection.
The difficulty here is that, in practical problems the number of variables and
constraints could be very large. Several algorithms have been developed for
solving this problem. Probably, the most popular amongst these is the simplex
method developed by Dantzig and his coworkers in the late forties. While,
there are some deficiencies in this algorithm, such as its possible numerical
instability. Nevertheless, for almost all practical problems it has proved to be
fairly effective. As a result, the simplex method, in its original form or with
some modifications is probably the most widely used algorithm for solving LP
problems. In this section, we only describe the basic form of this algorithm and
refer the reader to extensive literature for further details. Computer programs
implementing simplex method are given by Wilkinson and Reinsch (1971) and
Press et al. (2007).
The simplex method for solution of LP problems require the problem to
be put in the so-called standard form
mmlmlse f(x)= cTx,
(8.61 )
subject to Ax = b 2 0, x 2 0,
where A is an m x n matrix and m :s; n. Thus, here all variables are constrained
to be positive and further only equality type of constraints are allowed. The
requirement of bi 2 0 is not a limitation, since if it is not satisfied, the cor-
responding equation can be multiplied by -1. A general linear constrained
optimisation problem can be easily transformed to this form, albeit with some
possible loss of efficiency. Most of these transformations involve inclusion of
additional variables. For example, all variables that are unbounded can be re-
placed by a difference of two new variables e.g., Xi = Yl - Y2, each of which
is constrained to be positive. Similarly, an inequality type constraint can be
converted to equality form by introducing a new variable referred to as a slack
variable. For example, a constraint of the form aT x :s; b can be rewritten as
aT x + Y = b, with Y 2 0, while a constraint of the form aT x 2 b can be
transformed to aT x - Y = b. More general bounds of the form Xi 2 Ii can be
incorporated by a shift of origin. Thus, the problem defined by (8.58-8.60) can
be reformulated as follows: Minimise
f(xl,'" ,Xn1 , Yl,···, Yml+m2' Zl,"" Z2n-2nl) = CIXI + ... + cn1xn1
+Cn1+J(Zl - zn-nl+d + ... + cn(Zn-nl - Z2n-2nJ,
(8.62)
382 Chapter 8. Optimisation

subject to constraints

Xi 2: 0, (i=l, ... ,nI);


Yi 2: 0, (i = 1, ... , m1 +m2) ; (8.63)
Zi 2: 0, (i = 1, ... , 2(n - nI));
n-nl
2: 2: ai ,n,+j(Zj-Zn-n,+j)+Yi=bi ,
nl

aijXj+ (i=1,2, ... ,mI);


j=l j=l
nl n-nl
2: aijXj+ 2: ai,n,+j(Zj- Zn-n,+j)-Yi=bi , (i=m1+ 1, ... ,m1+ m 2);
j=l j=l
nl n-nl
2: aijXj + 2: ai ,nl+j(Zj - Zn-nl+j) = bi ,
j=l j=l

(8.64)
After the solution is obtained the values of the slack variables Yi can be ignored,
while Zi can be used to obtain the values of the corresponding unbounded
variables in the original problem. This problem has 2n-n1 +m1 +m2 variables
subject to m1 + m2 + m3 equality constraints, apart from the requirement of
nonnegativity of all the variables. Hence, we see that the standard form is not
really a limitation.
For the LP problem in the standard form, the only inequality constraints
are those requiring the variables to be nonnegative. If all other constraints are
linearly independent, that is the matrix A has a rank m, then at each vertex at
least n - m components of x must vanish. It may be noted that the constraints
in (8.61) yield an underdetermined system of linear equations. If the rank of A
is m, then this system of linear equations has an infinite number of solutions, in
which n - m of the variables can be freely chosen. In particular, we can assume
this n - m variables to be zero, in which case, the corresponding solution yields
a vertex of the feasible region, provided the nonzero Xi'S are all positive. Such
feasible vectors are called basic feasible vector and correspond to vertices of
the feasible region. Thus, choosing an arbitrary set of n - m components to be
zero may not give a basic feasible vector, since some of the nonzero components
may be negative. In fact, for problems with large number of variables, finding
a feasible vector itself may be a nontrivial problem. As we will see later on,
the simplex method itself can be used to find a basic feasible vector. Further, if
more than n - m components of a basic feasible vector are zero, then it is said to
be degenerate. This essentially means that the vertex is defined by intersection
of more that n hyperplanes.
Using the concept of basic feasible vectors, the Fundamental theorem of
Linear optimisation can be stated as: If an optimal feasible vector exists, then
there is a basic feasible vector which is optimal.
The basic idea of the simplex method is to start with some basic feasible
vector and every step of the algorithm consists of exchanging one zero variable
8.7. Linear Programming 383

with a nonzero one, to get another basic feasible vector with a reduced value of
the objective function. The algorithm terminates when either we have reached
a stage, where it is not possible to reduce the objective function, or when it
turns out that the objective function can be reduced to -00. In the former
case, we have found the optimal feasible vector, while in the latter case, such a
vector does not exist. The other cause of failure i.e., nonexistence of any feasible
vector can be detected at the first stage, where we have to find a basic feasible
vector to start the iteration.
For simplicity, if we assume that the first m columns of A are linearly
independent, then we can solve (8.61) for the first m variables in terms of the
remaining components, to get

(i=l, ... ,m). (8.65)

Thus, if bi > 0, then we can get a basic feasible vector as (b 1, ... , bm , 0, ... , 0).
Similarly, substituting (8.65) in the cost function, we can eliminate the first m
variables to get

f(x) = Zo + em+1Xm+1 + ... + enxn ,


(8.66)
Zo = C1b1 + C2b2 + ... + cmbm ,

Here ej, (j = m + 1, ... , n) are called the reduced cost coefficients. For the
basic feasible vector x* = (b 1 , ... ,bm , 0, ... ,0), the value of the objective func-
tion f(x*) = Zoo In this procedure, the set of variables is partitioned in two
parts, the first set {Xl, X2, . .. ,xm } consisting of the dependent or left-hand
side variables which are associated with rows of the matrix .4, while the second
set {xm+l' Xm +2, ... , xn} consisting of the independent or the right-hand side
variables associated with the columns of A.
Now, if all the reduced cost coefficients, Ci are nonnegative, then any
allowed change in the independent variables {Xm+l' x m +2, ... , xn} cannot de-
crease the objective function. Hence, f(x) has a "local" minimum at x*, which
can be shown to be a global minimum over all the feasible vectors. Conversely,
if some ei < 0, then we can increase the corresponding Xi to get a feasible
vector with reduced cost function. Thus, a necessary and sufficient condition
for optimality of a basic feasible vector is that, all the reduced cost coefficients
are nonnegative.
Now let us assume that some ek < 0, then we can increase Xk to decrease
f(x); and the more we increase Xk the greater the decrease in f(x). However,
Xk cannot be increased arbitrarily, since as we increase Xk the components
Xl, ... ,X m change and it may happen that one of these components becomes
negative and the vector will no longer be feasible. For a particular variable Xj,
1 ::; j ::; m, if (ijk ::; 0, then any increase in Xk will not decrease Xj, and Xj

°
remains positive. Thus, such variables do not put any constraint on the value of
Xk. Hence, if (ijk ::; for all j (1 ::; j ::; m), then there will be no constraint on
Xk and the objective function can be reduced indefinitely. In such a situation,
no optimal feasible vector exists, which usually indicates that the LP problem
384 Chapter 8. Optimisation

has not been formulated properly. For example, some constraint may have been
omitted. On the other hand, if for some j, ajk > 0, then Xj will decrease as
Xk increases and it remains nonnegative only if Xk :::; bj/ajk. Hence, all Xi,
(1 :::; i :::; m), will remain nonnegative as long as

(8.67)

Now if we set Xk = bl/alk, then Xl = 0 and we have a new basic feasible


vector. This process essentially exchanges one variable from the set of left-hand
variables, with another one in the set of right-hand variables.
If the minimum in (8.67) is not unique, then several components will be
reduced to zero during the transformation leading to degeneracy. By considering
a small perturbation, such ties can be resolved uniquely (Fletcher, 2000). It may
be noted that the requirement of nondegeneracy is essential here, since if one
of the first m components of the basic feasible vector is zero, it may not be
possible to increase Xk at all, if the corresponding ajk > O. In this case, we
can still exchange the two components, even though it does not result in any
change in the function value. However, in some cases, this exchange may lead
to cycling, i.e., the iteration will come back to the previous vertex and the
sequence keeps repeating {30}. Alternately, we can find some other component
Xk', such that Xk' can be increased to obtain a smaller value of f(x). In that
case, the degeneracy may be avoided. However, this may not be possible in all
cases. Another problem is that, in practical computations, because of roundoff
error it may be difficult to decide when one of the components is actually zero.
Opinion of the experts appears to be divided on the issue of cycling, while some
feel that in actual practice such cycling is rarely observed, others think it is
not that rare. Most good implementation of simplex method take into account
degeneracy.
We shall now show how to write the new basic feasible solution in the
form (8.65), which will enable us to start a new iteration. With l defined by
(8.67), we can rewrite the lth equation in (8.65) as

1 -
Xk = ~(bl - al,m+lXm+l - . .. - al,k-lxk-l - Xl - al,k+lxk+l - ... - alnXn ),
alk
(8.68)
and substitute this value of Xk in other equations in (8.65) and into (8.66).
After some manipulations, we get the new system

(8.69)

for i = 1, ... , l - 1, k, l + 1, ... , m and the new objective function

(8.70)
8.7. Linear Programming 385

where

i = 1, ... , m, i # I
j = m + 1, ... , n, j # k
_, _ Ckalj
Cj = Cj - alk '
_, Ck
cl=--
alk'
(8.71)
°
Since Ck < 0, hl > and alk > 0, we see that zb < Zo, so that each iteration
reduces the value of f(x).
In practice, the calculations are usually arranged in the form of a tablea'u,
where one row and one column are added to the matrix A of (8.65). We can
define hi = aiO, Cj = aOj and Zo = -aDO. Further, the new column a~l can be
overwritten on aik itself and a record of permutation giving which column cor-
responds to which variables is kept. Similarly, the row ajk can be overwritten on
ajl itself. Thus, at any iteration the set N = {I, 2, ... ,n} is partitioned into two
sets, N = D U V, where D = {PI, ... , Pm} is the set of indices of the dependent
variables occurring on the left-hand side in (8.65), and V = {Pm + l , . . . , Pn }
is the set of indices of the independent or the right-hand side variables. Here
{PI, ... , Pn } gives appropriate permutation of N. In the following description,
any index i actually refers to Pi (i = 1, ... , n), for example, Xl refers to Xp j •

At the beginning of any iteration, we have the sets D and V in addition to the
matrix A = [aij], (i = 0, 1, ... , m; j = m + 1, . .. , n). The basic iteration of the
simplex method can be accomplished in the following steps:

1. Compute minm+l:'::j:'::n aOj = aOk to find the most negative Ck. In case of
ties, choose the first index k. If aOk 2: 0, terminate the iteration in which
case, the optimal feasible vector is given by XPi = aiO, (i = 1, ... , m), and
Xp, = 0, (i = m + 1, ... , n). The optimal value of the objective function
is -0,00.

2. If aOk < 0, compute minii,k>o(aio/aik) = aLO/alk' If all aik ::; 0, then


terminate the iteration with an indication that the objective function is
unbounded from below.

3. Introduce Pk into D in place of Pl to yield D' and replace Pk by Pl in F


to yield V'. Transform the matrix elements as follows:

i = 0, 1, ... , m, i # l
_, alj j = 0, m + 1, .. ., n, j # k
al'=-
J alk'
(8.72)
and return to step (1) with updated matrix A and permutation vectors
D' and V'.
386 Chapter 8. Optimisation

This algorithm can be understood more clearly from Example 8.6


To start this iteration, we need one basic feasible vector and as mentioned
earlier it may not be trivial to find one. In fact, such a vector may not even
exist. It turns out that the simplex algorithm itself can be used to find such a
vector. For this purpose, we introduce some more variables Yi, (i = 1, ... , m)
referred to as artificial variables, and consider the following LP problem:
m

Minimise "L-- Yi, (8.73)


i=l

subject to the constraints


Ax+ ImY = b, x;:::: 0, y;:::: 0, (8.74)
where 1m is the m x m identity matrix. This problem has a basic feasible
solution y = b, x = 0, so that we can apply the simplex algorithm. Since, the
objective function is bounded from below by 0, this LP problem must have an
optimal feasible vector. Hence, the algorithm should terminate at the optimal
solution. If this optimal feasible vector has any component of y nonzero, then

°
it implies that, there is no feasible vector for our original problem. For if a
feasible vector existed, then (x, y) with y = would be a feasible solution of
the new problem, with the value of objective function equal to zero; while the

other hand, if the optimal feasible vector for this problem has y = °
algorithm terminated with a positive value of the objective function. On the
and if it
is nondegenerate, then exactly m components of x will be nonzero. Hence, this
solution will give the required basic feasible vector for the original problem.
Thus, a given LP problem can be solved in two phases; in the first phase an
auxiliary LP problem is solved to obtain a basic feasible vector for the actual
problem, and in the second phase the actual problem is solved. Both these
phases can be accomplished using the same simplex algorithm.
The calculations can be organised to include the auxiliary objective func-
tion for determining the initial basic feasible vector as the (m + l)th row of
matrix A. In the first phase, the last row is used to get the reduced cost co-
efficients, while the transformation is applied to the entire matrix. At the end
of the first stage, if a basic feasible vector has been found, then the columns
corresponding to the artificial variables can be discarded to obtain the required
tableau for the given problem. At this stage, the last row containing information
about the auxiliary objective function can also be discarded. This procedure
is implemented in subroutine SIMP LX in Appendix B. This subroutine is a
crude implementation of simplex algorithm, which does not take into account
degeneracy or possible roundoff error. It has a parameter AEPS, which is used
to determine the roundoff limit, i.e., any variable less than AEPS in magnitude
can be assumed to be zero. If different components of the vectors are varying
over several orders of magnitude, it may be difficult to give any value for AEPS.
For some problems with degeneracy this subroutine may get into a nontermi-
nating loop, but because of the limit on the maximum number of iterations it
will exit with an error message.
8.1. Linear Programming 387

Figure 8.3: Optimisation with constraints

EXAMPLE 8.6: Minimise !(x, y) = x + y, subject to the following constraints

y 2: 0, 2y - x::::: 1, 6y - x::::: 3, y +x::::: 4. (8.75)

Since there are only two variables, it is convenient to give a geometric interpretation
of the problem. The four inequality constraints force the feasible points to be in the shaded
region in Figure 8.3. Minimising !(x,y) subject to these constraints is equivalent to finding
that line of slope -1 having smallest y-intercept, and still intersecting the feasible region. It
is clear from Figure 8.3 that the required line is y+x = -1, intersecting the feasible region at
(-1,0). Alternately, from the fundamental theorem of linear programming, we know that the
minimum can only be at one of the four vertices of the feasible region. Computing the value
of the objective function at each of the vertices, we can easily see that the minimum occurs
at (-1,0). If instead of minimising we wish to maximise the function, then following similar
arguments it can be easily seen that the maximum value of four is attained at any point on
the boundary defined by the last constraint. The function attains its maximum value at the
vertices (4,0) and (3,1) , and hence at all points on the straight line connecting these two
points. Therefore, the point at which the function assumes its maximum value is not unique.
If the second constraint is changed to 6y - x ::::: -5, then it can be seen that , there will be
no feasible vectors. While if the last constraint is omitted, then the function is bounded from

°
below, but not from above and hence the maximum does not exist. On the other hand, if the
first constraint i.e. , y 2: is omitted, then the function is unbounded on both sides. These
examples illustrate various reasons for nonexistence or nonuniqueness of the solution to LP
problems.
Now we will solve the problem using the simplex method . For this purpose, we first
transform the problem to the standard form by adding extra variables. Defining x = XI - X2
and using X3, X4, X5 as slack variables to take care of the three inequality constraints, we get
the following LP problem in the standard form:

Minimise !(y,XI,X2,X3,X4 ,X5 ) = y+xI - X2,


subject to y2:0, Xi 2:0, (i=1, ... ,5),
2y - XI + X2 + X3 = 1, 6y - XI + X2 + X4
= 4. = 3, Y + XI - X2 + X5
(8.76)
This problem has six variables and three equality constraints. The equations for constraints
can be easily solved for X3, X4, X5 in terms of y, Xl, X2, to give

X3 = 1 - 2y + XI - X2, X4 = 3- 6y + XI - :1-2, X5 = 4 - y - XI + X2. (8.77)


388 Chapter 8. Optimisation

Hence, we can start the simplex iteration with basic feasible vector (0,0,0,1,3,4). It may
be noted that, this vector does not correspond to any of the vertices in Figure 8.3, that is
because we have introduced extra variables. We can obtain the following tableau from these
equations:
y Xl X2
f o 1 1 -1
2 -1 1
X4 3 6 -1 1
X5 4 -1
It can be seen that only one reduced cost coefficient corresponding to X2 is negative and the
coefficients corresponding to X3 and X4 in that column are positive. Thus, these variables will
be reduced when X2 is increased. It can be easily seen that for X2 = 1, X3 will be reduced to
zero, thus fixing the extent to which we can increase X2. This information can be obtained
from the tableau by finding the minimum of aiO/ai3 over positive values of ai3. Thus, the
variables X2 and X3 should be exchanged to get the new tableau using (8.72):

Y Xl X3

f 1 3 0 1
X2 1 2 -1 1
X4 2 4 0 -1
X5 5 3 0
Here it can be seen that all the reduced cost coefficients are nonnegative and the function
cannot be reduced further, which gives the optimal feasible vector (0,0,1,0,2,5). Discarding
the last three components corresponding to the slack variables, we get y = 0, X = Xl -
X2 = -1 as the optimal point at which the function value is -1. However, it can be seen
that the reduced cost coefficients corresponding to Xl is zero and further all the entries
in that column are nonpositive. Hence, this variable can be increased arbitrarily without
affecting the objective function. Thus, the general solution for optimal feasible vector is
(0, Xl, Xl + 1,0,2,5), but it can be seen that it gives a unique value for the required variables
X and y.
If we want to maximise the function, then we can consider the objective function
f = -y - Xl + X2 for minimising. Once again, we can start with the same starting vector,
and the corresponding tableau can be obtained by changing the signs of entries in the first
row. Now, there are two negative entries in the first row, either of these could be used for
exchanging variables. It can be seen that if we choose to exchange Xl, then we arrive at the
solution. But to illustrate the problems caused by degeneracy we choose to exchange y. It
can be seen that min(iiio/iiiI) is not unique and both X3 and X4 will be reduced to zero. We
decide to exchange y with X3 to get the following sequence of tableau:
X3 X4 X2
X3 Xl X2 X5 X4 X2
I I I 7 3
f 2 2 -2
3
2
3 f 2 -4" 4" 0 f 4 1 0 0
I 1 I I I I I 0 Y 1 7 7 0
Y -2 Y 2 -4" 4"
2 2 2 3 6 I -1
0 -3 2 -2 3 I -1
Xl 7 -7
X4 Xl 0 -2 2
I 4 3
7 3 3 7 7 X3 2 7 -7 0
X5
2 -2 2 -2 X5 -4"
3
0
2 4"
It can be seen that after the first exchange the only negative coefficient in the first row is
corresponding to Xl. However, Xl cannot be increased, since increasing Xl will cause X4 to
become negative. Hence, we are forced to exchange Xl with X4 without any change in the
function value to get the next tableau. Now it can be seen that the only negative value in
first row corresponds to X3 and further, there is only one positive entry in that column.
Hence, we exchange X3 with X5 to get the final tableau. Now all the reduced cost coefficients
are nonnegative and the function cannot be reduced further, which gives the optimal vector
(1,3,0,2,0,0) corresponding to X = 3 and y = 1. The corresponding function value is -4,
and hence the maximum value of our original function is 4. It can be seen that two of the
reduced cost coefficients are zero and the corresponding variables can be increased without
8.7. Linear Programming 389

affecting the function value. Increasing X2 does not really lead to any new solution, since
x = Xl - X2 remains fixed. Increasing X4 gives a different optimal vector and it can be seen
from the tableau that X4 can be increased to 7 giving y = 0 and Xl = 4, which corresponds
to the other vertex in Figure 8.3.

For some problems considerable effort can be saved by using the so-called
duality theorem, which relates two LP problems. Consider the LP problems

A: mmlmlse cT x subject to x;::: 0, Ax;::: b;


(8.78)
B: maximise yTb subject to y;::: 0, yT A:; c.

They are called dual problems because of the many interesting relationships
between them. such as the following:

1. If either problem has a solution, then the other does and further the
minimum of cT x equals the maximum of yTb.

2. If the solution of either problem is found using the simplex algorithm,


then the solution for the other problem can be obtained by taking the
slack variables in order and assigning those in the final basis (left-hand
variables) the value zero, and giving each of the others the corresponding
value in the first row of the tableau.

Thus, the actual variables of one problem are the slack variables of the dual. To
solve a LP problem we can apply the simplex algorithm to either the problem
or its dual depending on which is easier to solve. In many cases, the difference
could be considerable.
The tableau form of simplex algorithm is to a large extent superseded by
the more efficient revised simplex method, which uses matrix factorisation with
the inverse being updated at every step. This technique suffers from roundoff
error and a more stable form has been suggested. The simplex algorithm has
proved to be very successful in practice, although in principle, we can find
examples, where the time required increases exponentially with n, the number of
variables {31}. In practice, the time required increases almost linearly with n. It
has been shown that LP problems can be solved in polynomial-time, that is the
time required for solution should increase as some power of n. Karmarkar (1984)
has given a polynomial-time algorithm for LP problem, which is claimed to be
several time faster than the simplex method.
If some of the expressions for the constraint or the objective function is
nonlinear in the variables, then we get nonlinear programming problem. Several
methods have been developed for treating such problems, but that is beyond our
scope. In the simplest cases, we can try to eliminate some variables to remove
the constraints. Alternately, we can use the method of Lagrange's multipliers.
Another class of optimisation problems are combinatorial problems discussed
in the next section.
390 Chapter 8. Optimisation

8.8 Simulated Annealing


The method of simulated annealing has been fairly successful in solving com-
binatorial minimisation problem where the aim is to find a permutation, that
minimises the given function. Thus instead of continuous variables the function
is defined only over a discrete but very large set of permutations. Since the
number of permutations increase factorially, it is not possible to use the brute
force method of comparing all possibilities to determine the minimum. This
method has also been applied to continuous problems and has been reasonably
effective in finding global minimum. Although, it can be shown that asymptot-
ically this method will yield the global minimum, but that will require infinite
computational effort. \Vith any finite computational effort the convergence to
global minimum is not guaranteed, but experience suggests that it has been
fairly successful. This technique is also known as Metropolis algorithm.
This technique derives its name from the analogy with the process of
annealing in solids. In condensed matter physics annealing is the process in
which a solid is heated to certain temperature, so that it melts and the molecules
are randomly distributed. After that the material is cooled very slowly to the
required temperature. It is found that if the cooling is sufficiently slow the
atoms arrange themselves in ordered form of a crystal which is the minimum
energy state. On the other hand, if the material is cooled rapidly, then we
get an amorphous or polycrystalline state with higher energy. This process is
referred to as quenching as opposed to annealing. This is similar to the problem
of minimisation, where we generally end up with a local minimum, rather than
the global minimum (i.e., the crystal). The basic idea in annealing is that at
each temperature the system should be in thermal equilibrium, with energy E
given by the Boltzmann distribution

(8.79)

where P(E) dE is the probability that energy is between E and E + dE,


kB is the Boltzmann constant and Z(T) is the partition function. The factor
exp( -EI(kBT)) is known as the Boltzmann factor. As the temperature de-
creases, the energy distribution concentrates in the states with lowest energy
and as the temperature approaches zero only the minimum states have nonzero
probability of occurrence.
The basic idea of simulated annealing is to start with some configuration
and go on finding a 'neighbouring' state which is chosen randomly. Further, this
new state is accepted if the required function value is lower than the previous
value like any other minimisation algorithm. The crucial difference is that even
if the new function value is higher, it is not always rejected, but instead it is
accepted with a probability of P(E) as defined by (8.79). In order to use this
we need to define a temperature of the system. In general, the temperature
is defined such that kB = 1. In practical applications we generate a random
number and check if its value is less than exp( -8j IT), where 8j is the change
B.B. Simulated Annealing 391

in function value. If the random number is less we accept the new configuration
even though the function value has increased, otherwise the new configuration
is rejected. The advantage of this approach is that when temperature is suffi-
ciently high the process can get out of a local minimum. The temperature is
slowly decreased and the process is continued, until the temperature becomes
so small that no changes are allowed. At this stage we have found the minimum
of the energy function. If the temperature is sufficiently high to start with and
is reduced sufficiently slowly then the algorithm will converge to a global mini-
mum, but in other cases it will still converge to a local minimum. Nevertheless,
with some care it is possible to get global minimum with some reliability. We
will give only a brief description of this technique, for more details readers
can consult Laarhoven & Aarts (2010). The working of this technique can be
understood from Example 8.7
In practir:al implementation of simulated annealing, we have to find suit-
able definition of temperature. In the early steps this temperature should be
large enough to allow almost all transition, so that the system has opportunity
to span the parameter space. This temperature should be slowly reduced as
the calculations proceed. The initial value of temperature To should be such
that for virtually all transitions exp( -JJ ITo) ~ 1. A simple way to get this
value is to start with some value of To and tryout a number of transitions. If
the acceptance ratio X, defined as the number of accepted transitions divided
by the number of trials, is less than a given value Xo, then double the value
of To, otherwise accept the trial value of To. If the value of To is inc;-eased try
this procedure again, until the value is acceptable. A value of XO = 0.8 may be
reasonable for this purpose.
Apart from initial temperature, we will also need to choose the final value,
when we decide that the algorithm has converged to the required minimum. In
general, it is difficult to choose the convergence criterion, but if we find that
there is no improvement in function value, that is, the function value does not
decrease in the last few sets with different temperatures, then one can decide
to stop. Alternately, if we find that none of the transitions have been accepted
for a few successive values of temperature, then we can stop the process. Of
course, none of these criterion can ensure that the function would not decrease
in one of the subsequent attempts. Besides we also need some prescription to
decrease the temperature. There are two alternatives, either one can decrease
the temperature slowly before each attempted transition, or we can keep it
fixed for certain number of attempts and then decrease it by a reasonable
amount. The second approach is simpler. Thus after certain number of trials
we can decrease the temperature by a constant factor a < 1. The value of a
can range from 0.5 to 0.99. At each temperature we can either make a fixed
number of attempts or we can keep trying until a fixed number of transitions
have been accepted. The problem with latter approach is that as we decrease
the temperature, it may be difficult to find any transition that is acceptable.
Thus, it may be better to put some maximum limit on number of trials at each
temperature. At higher temperature we can decrease the temperature after a
392 Chapter 8. Optimisation

fixed number of transitions have been accepted. The actual numbers will of
course, depend on the problem and should be of order of some polynomial in
number of parameters.
The choice of transition at every step has to be made randomly, but it
may be better if it can be controlled such that at higher temperature we tend to
generate transitions that are expected to make large difference to the function
value. While as temperature decreases the transition should be chosen so as to
make smaller differences, since larger difference would be most likely rejected.
In general, we can choose transitions such that expected 6f rv T.
Simulated annealing has been applied successfully to a number of prob-
lems, the classic example is the travelling salesman problem (Bonomi & Lutton
1984; Press et al. 2007). This problem is known to belong to the class of prob-
lems known as NP-complete problems. Computational effort for exact solution
of these problems increases exponentially with the size N. Simulated annealing
is able to find a good approximation to the solution of the travelling salesman
problem in reasonable time. It has also been applied in computer circuit design
problems as well as other design problems where the basic problem is to choose
a configuration which gives optimal performance. Here we will consider another
application which is relevant for this book.
EXAMPLE 8.7: For the Titanium data in {4.27} find the combination of 12 knots such that
the maximum error in spline approximation over the entire range is minimised.
The data consists of 49 points (Xi, Yi) and we want to select 12 of these knots as abscis-
sas for cubic spline approximations such that the maximum error in this approximation over
the entire interval (585, 1065) is minimised. Since we wish to approximate the function over en-
tire interval, we would expect the end points to be the abscissas, thus we only wish to find the
remaining 10 knots in the interior. Thus we define an array Ji, i = 1,10 such that 2 :S Ji :S 48
and Ji < Ji+l. The knots for cubic spline approximation are Xl, XJl' XJ2" .. , XJlO' X49. We
calculate the spline coefficients using subroutine SPLINE in Appendix B and then the inter-
polated value is calculated over a set of 501 points covering the entire interval using function
SPLEVL. These values are compared with 'exact' value evaluated using spline with all 49
knots. The maximum difference between these set of values is the function I(Ji) that we wish
to minimise.
We use simulated annealing to minimise this function, starting with a temperature of 1,
we reduce the temperature by a factor of 0.95 at each step. At each step we try a maximum of
10000 transitions. If 500 transitions are accepted at any step we go to the next temperature.
The number of trials at each step are much smaller in the beginning but as temperature
reduces the number of successful transition reduces and we have to make 10000 attempts at
each temperature. The final value is accepted when there is no reduction in function value
and the temperature T < 0.01/0, where 10 is the function value at minimum.
To implement simulated annealing we also need an algorithm to select transition from
one state to next. At high temperatures we want to tryout maximum changes and in this
case we change all 10 indices Ji simultaneously. The perturbation is selected randomly and
we select change in each index by -1, 0 or 1. Of course, we also need to ensure that the
indices are in ascending order. The indices may have to be shifted to maintain the ascending
order. When the temperature has reduced below 0.05, we do not wish to make large changes
in the function and alternate strategy is used to select the transition. In this case only one
of the indices is perturbed in manner outlined above. The choice of which index to perturb
is made randomly.
With this choice the program converges to the following set of knots 585, 685, 805,
855, 865, 875, 895, 905, 925, 975. 1005, 1065 and the maximum error in approximation is
0.024. It is not clear if this is indeed the optimum choice, but it is certainly among the better
S.S. Simulated Annealing 393

2.5

.....---. 1.5
E--<
......
'-"'

0.5

a
600 700 800 900 1000
T
Figure 8.4: Optimum choice of knots for the Titanium problem. The large circles denote
the 12 knots chosen to get minimum errors, while small circles represent the other points in
data set. The dotted curve centred at 0.4 is the error in approximation multiplied by a factor
of 10.

choices. The resulting function is shown in Fig. 8.4. It can be seen that in the region covered
by the peak most of the points in data set have been chosen as knots, while in the almost
flat region on either side only one intermediate knot is required. Attempts with different seed
for random numbers give the same value, which strengthens the case for it being a global
minimum. Even if we relax the condition that the end points are knots, and try to select
all 12 knots using simulated annealing, we arrive at the same results, that is, both the end
points have to included as knots to get optimal results. Simulated annealing may not be the
most efficient method of solving this problem, but we have applied it for illustration. Fortran
program used to solve this problem is included in Appendix B.

So far we have considered application of simulated annealing to combina-


torial minimisation problems where the variables are discrete. But this tech-
nique can also be applied to optimisation problems over continuous variables,
considered in this chapter (Vanderbilt & Louie 1984). In principle, there is no
difficulty in applying the technique of simulated annealing described above to
continuous variables. We can choose the transition randomly in the space of
variables, with the step being accepted or rejected as per the criterion described
earlier. The main problem is that the choice of optimal magnitude and direction
in the phase space. If we select very small step the process will be inefficient,
while if the step is too large then most of the attempts will be rejected. Sim-
ilarly, if we are in a highly anisotropic valley, most of the steps which explore
in direction perpendicular to the axis of valley will be rejected. In such cases
we will need to make many more trials to find a direction where the function
value reduces. Further, the step length should shrink as temperature decreases.
394 Chapter 8. Optimisation

It turns out that the function values which are available can be used to guide
us in making these choices as described by Vanderbilt and Louie (1984).
A natural choice of random steps in n variables Xi is to call a random
number generator n times to generate the numbers (Ul, U2, ... ,un), where each
Ui is chosen independently and is distributed with zero mean and unit variance.
We then choose the step .6.x according to

.6.x = Qu, (8.80)

where the matrix Q controls the step distribution. The simplest choice Q =
aI, where I is a unit matrix generates an isotropic distribution with average
step length afo. The covariance matrix of these steps is given by S = QQT.
Thus for a given covariance matrix we can obtain the required matrix Q by
Cholesky decomposition and then use (8.80) to generate the .6.xi. The choice
of temperature for simulated annealing is made as explained earlier. We keep
constant temperature for a fixed number of attempted transitions. Using the
function values obtained in this process during the previous step we can obtain
the covariance matrix. Considering the AI steps at previous temperature that
were found to be acceptable, we define the first and second moments of random
walk segment,
1 ~ (m)
Ai = !vI ~ Xi ,
m=l

;1 L (x~m)
AI
(8.81 )
Sij = - Ai)(xjm) - Aj).
m=l

We can also multiply this matrix by a constant factor to allow an increase in the
size of region searched during subsequent step. With this choice of covariance
matrix, the shape of the allowed region is approximately mapped and search
is more efficient in locating promising directions where the function is likely to
decrease.
The method of simulated annealing is highly inefficient as compared to
the methods described in Sections 8.5 and 8.6. However, it is more likely to find
a global minimum, while the regular minimisation techniques tend to get stuck
in a local minimum. We can try several different starting values with these
minimisation techniques and choose the solution that gives minimum value of
the function, but even then it is not likely to find the global minimum. Another
advantage of simulated annealing is that it is straightforward to restrict the
solution in a given region. While finding new configuration we can easily choose
only those which are within the required region. This is difficult to ensure in
the iterative techniques as the iteration may go outside the region. As a result,
simulated annealing may be useful when there are large number of local minima
in the function or when there are constraints on the parameter space. This
situation arises in nonlinear fitting problems, when the number of data points
and parameters is large. Because of errors in data to be fitted a large number of
local minima may be introduced in the function and it may be difficult to find
Bibliography 395

the global minimum which we may be interested in. In such cases simulated
annealing is more likely to succeed in finding the global minimum. However,
the annealing schedule needs to be chosen carefully. We have to allow sufficient
number of trials at each temperature and the temperature should be decreased
so that decrease in function value keeps pace with decreasing temperature.
All this will require considerable experimentation. Simulated annealing is not
useful in those cases where a reasonable guess for minimum is known as even if
one starts with a good guess, this technique will wander around the parameter
space before finding the minimum. Further, if the function has very narrow dip
around the minimum, it will be difficult to detect it using random values and in
such cases simulated annealing is not likely to succeed {36}. For such problems
quasi-Newton methods are more likely to succeed as with the knowledge of
derivatives it will be able to locate the dip more effectively.
Another technique which has been fairly successful in finding global mini-
mum is Genetic algorithm, which are again heuristic search techniques inspired
by biological process of evolution by natural selection. The basic strategy in this
technique is similar to simulated annealing, except that the choice of transition
is made by 'breeding'. We will not describe this algorithm further and readers
can refer to Goldberg (1989) for more details. Like simulated annealing this
technique is also fairly inefficient and should be used only when good initial
guess for minimum is not available and a large number of local minima are
expected.

Bibliography
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Brent, R. P. (2002): Algorithms for Minimization Without Derivatives, Dover, New York
Bonomi, E. and Lutton, J.-L. (1984) The N-city Travelling Salesman Problem: Statistical
Mechanics and the Metropolis Algorithm, Siam Rev., 26, 551.
Chong, E. K. P. and Zak, S. H. (2008) An Introduction to Optimization, (3rd ed.) Wiley-
Interscience.
Dahlquist, G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Dantzig, G. B. (1998): Linear Programming and Extensions, Princeton 'University Press,
Princeton, New Jersey.
Fletcher, R. (2000): Practical Methods of Optimization, (2nd ed.), John Wiley, Chichester.
Goldberg, D. E. (1989): Genetic Algorithms in Search, Optimization and Machine Learning,
Addison-Wesley. Reading.
Karmarkar, N. (1984): A New Polynomial-time Algorithm for Linear Programming, Combi-
natorics, 4, 373.
Laarhoven, P. J. M. van, and Aarts, E. H. L. (2010): Simulated Annealing: Theory and
Applications, D. Reidel, Dordrecht.
Neider. J. A. and I\Iead R. (1965): A Simplex Method for Function Minimization, Computer
J .. 7, 308.
Powell. M. J. D. (1973): On Search Directions for Minimization Algorithms, Math. Prog., 4.
193.
Press, W. H., Teukolsky, S. A .. Vetteriing, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Rabinowitz. P. (1968): Applications of Linear Programming to Numerical Analysis, SIAM
Rev .. 10, 121.
396 Chapter 8. Optimisation

Rabinowitz, P. (ed.) (1970): Numerical Methods for Nonlinear Algebraic Equations, Gordon
and Breach, London.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Stoer, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.
Vanderbilt, D. and Louie, S. G. (1984): A Monte Carlo Simulated Annealing Approach to
Optimization over Continuous Variables, J. Compo Phys., 56, 259.
Wilkinson, J. H. and Reinsch, C. (1971): Linear Algebra: Handbook for Automatic Compu-
tation, Vol. 2, Springer-Verlag, Berlin.

Exercises
1. For a simple mmlmum (Le., J" of. 0) of a function of single variable, show that the
truncation error in parabolic interpolation using three distinct points is given by
J'"
€i+l ,:::; 6J" €i-1 Ei-2 .

Hence, show that the order of convergence is given by the largest root of a 3 - a-I = O.
2. For a simple minimum, show that the truncation error in cubic Hermite interpolation
based on two distinct points is given by
J(4) 2
€i+l ,:::; -12J" Ei Ei_l'

Hence, show that the order of convergence of this iteration is given by the larger root of
a2 - a - 2 = O.

3. Find all local minima of the following polynomials and deduce the global minimum:
(i) x 6 - 36x 5 + 450x 4 - 2400x 3 + 5400x 2 - 4320x + 720 (Laguerre polynomial)
(ii) 676039x 12 - 1939938x 10 + 2078505x 8 - 1021020x6 + 225225x 4 - 18018x 2 + 231
(Legendre Polynomial))
(iii) 524288x 20 - 2621440x 18 + 5570560x 16 - 6553600x 14 + 4659200x 12 - 2050048x 10
+549120x 8 - 84480x 6 + 6500x 4 - 200x 2 + 1 (Chebyshev polynomial)

4. Find all local minima of the following functions:


(i) ax + bsin(ax), a = 1,100, b = 2, 1.1, 1.001;
(ii) x sin x find first five positive minimisers;
(iii) vlx - 1.11·
5. Try to bracket the minimum of functions in {3} and {4} using a procedure similar to
that used in the subroutine BRACKM, but without invoking any interpolation. Thus, the
minimum is bracketed by just comparing the function values at different points. Compare
the efficiency and reliability of this method with the one including interpolation.
6. Try to find local minima of J(x) = x-sinhx, using the iterative method based on Hermite
cubic interpolation. Use positive starting values close to zero. Show that the function has
no local minimum, but a point of inflection at x = O. How will you detect this point to
be a point of inflection, if the iteration does converge there?
7. Find the global minima of
J(x) = (x - asinx)2, a = 10,20.3959;
and compare the results with those of {7.13}. Compare the efficiency of this process with
that of direct solution of x = a sin x. Are there any other local minima of this function?
Exercises 397

Also try to minimise the function Ix - asin xl and compare the efficiency with earlier
methods.
8. Find the global minima of
f(x) = (X 6 - 9.2x 5 + 34.45x 4 - 66.914x 3 + 70.684x 2 - 38.168x + 8.112 + 0.0125 In x)2,
and compare the results with those obtained by solving the equivalent nonlinear equation
directly. Are there any other local minima of this function?
9. For the following function with specified starting points, find the interval which will
contain an acceptable point for line search with (CT, p) = (0.1,0.01), (0.1,0.05), (0.9,0.01).
(0.9,0.5) (see 8.37):
f(x) = (x 2 _O.01)e- lOx , xo = -0.1,-1.

10. For the Rosenbrock's function

find the region where the Hessian matrix is positive definite.


11. For the function
f(Xl,X2) = xi + 2x§ + 4Xl + 4X2,
prove by induction that the method of steepest descent (with exact line search) applied
with the starting value x(l) = 0 generates the sequence {x(k)}, where

x(
k+l
)
1 k - 1) T
= ( 32k - 2, (-:3)

Hence, prove that the method converges linearly. Find a point from which this method
will converge in one iteration.
12. Find all stationary points of the following functions. Which of these points are local min-
imisers, which are local maximisers, and which are neither? Is there any global minimum
or maximum of this function?

13. Verify that H~kF61h = 0, B~~6lio = 'Y and B~~6liH~kF6li = T.


14. Try to minimise the following functions by quasi-Newton method:

Use x = 0 or x = (0.5, 0.5)T as the starting values. Show that the functions do not have
any local minimum, but a saddle point at (1,1). Also try a few different starting values
close to this point. How will you detect this point to be a saddle point, if the iteration
does converge there? Try to apply a direction set method to the same problem.
15. In the BFGS method if the ith row and column of H(l) is zeroed, then show that this
property is preserved for all H(k) and hence x~k) = x;l) for all k > 1. It then follows
that the objective function is minimised subject to the constraint Xi = xP).
16. Using DFP formula, show that the matrix B = H- l can be updated using

B(k+l) _ B ( oT BO) 'Y'YT _ ('YoT B + BO'YT)


DFP - + 1+ To
'Y.'Y
To To·
'Y

17. Minimise the following quadratic function generated by n X n Hilbert matrix H:


n n
f(x) = x T Hx = LL . XiXj .
i=l j=l l +J - 1

Try n = 4,6, 10 and 15.


398 Chapter 8. Optimisation

18. Minimise the following functions:


(i) 100(X2 - xi)2 + (1 - x1l2 + 90(X4 - x~)2 + (1 - X3)2 + 1O.1[(x2 - 1)2 + (X4 - 1)2J
+19.8(X2 - l)(x4 - 1),
(ii) 100(x2 - xi)2 - (1 - x1l2 + 90(X4 - x~)2 + (1 - X3)2 + 1O.1[(x2 - 1)2 + (X4 - 1)2J
+19.8(x2 - 1)(x4 -1),
(iii) (Xl + lOx2)2 + 5(X3 - X4)2 + (X2 - 2X3)4 + 1O(X1 -- X4)4_

19. To test the efficacy of the orthogonalisation in the modification of the Powell's method,
try to solve the previous problem using the direction set method with and without the sin-
gular value decomposition at the end of each cycle, and compare the number of iterations
required in the two cases.
20. Implement a conjugate gradient method and test it on the functions in {I8}- Compare the
efficiency of this method with (i) quasi-Newton methods and (ii) direction set methods.
21. Find all local minima of the following functions:
(i) (x4 - 16x 3 + 72x 2 - 96x + 24)(y4 - lOy3 + 35y2 - 50y + 24),
(ii) (x2 - 4x + 5)(y2 - 3y + 3)(z2 - 6z + 10),
which of these (if any) is the global minimum:
22. Find the global minima of the following function:
(_2x2 - 3xy +4siny + 6)2 + (3x2 - 2xy2 -+- 3cosx - 8)2,
and compare the results with the direct solution of the corresponding system of nonlinear
equations {7.52}. Compare the efficiency of the two methods. Are there any local minima
which are not global?
23. Minimise the following function due to Kowalik and Osborne, which arises when a poly-
nomial of degree n is fitted by least squares (see Chapter 10) to approximate a solution
of the differential equation f' = 1 + f2, frO) = 0, (try n = 4,6 and 9):

xi + (X2 - xi - 1)2 + L [nL(j - I)Xj (_~I)J-2 - (nL Xj (_~l)j_1)2 -


30
1
]2
i=2 j=2 29 j=l 29

24. An industry uses resources of material m and labour I to make up to four possible
items (a, b, c, d). The resources required for these items are 5m + I, m + 41, 2m + 21,
2m+31, respectively. While the profit per item is 100, 200, 125 and 150 units, respectively.
Assuming that the industry can manage a supply of up to 50 units of material and 40
units of labour per day and that there is no constraint on the integrality of this numbers,
what is the manufacturing schedule which will give maximum profit? Formulate this as
an LP problem and find the optimal solution. Compare the profit in the optimal solution
with that obtained, if (1) each item is manufactured in equal amount, and (2) only the
two items yielding maximum profit are manufactured to fully utilise the resources of
material and labour.
25. Consider the LP problem, minimise x + y, subject to

2x + y::::; 9, Iy - 21::::; 1, x ~ 0, y ~ O.
Solve the problem graphically. Formulate this problem as a standard LP problem and
solve it using the simplex method.
26. Consider modifying the simplex method to solve a more general LP problem of the form

minimise cTx,
subject to Ax=b, Ii ::::; Xi ::::; Ui, (i=I,2, ... ,n).
Show that the following changes are required in the basic simplex method. Right-hand
side variables can take values Ii or Ui instead of zero and the first column of tableau b;
Exercises 399

will have to be calculated accordingly. The optimality test is changed to require that the
set
{i,i E N,Ci < 0 if Xi = [i,C, > 0 if Xi = ud,
is empty, if not then Ck is chosen as max( ICi I) for i in this set. Similarly, the choice of
the basic variable to be exchanged is determined, by which of the variables reaches the
upper or lower bound for the smallest change in the value of Xk.
27. Implement the algorithm described in the previous problem and test it on the following
problem: minimise Xl + 3X2 - 2X4 subject to

Xl + X2 + X3 + X4 :s: 10, Xl + 2X2 - 2X3 = 6,


o :s: Xl :s: 4, 0 :s: x2 :s: 10, 2 :s: X3 :s: 8, 3 :s: X4 :s: 10.
Also formulate this as a standard LP problem and solve it using the usual simplex method.
Compare the efficiency of the two methods.
28. Consider the LP problem, minimise Xl + X2, subject to Xl ~ 0, X2 ~ 0 and Xl - X2 = X,
and show that for X> 0 the optimal vector is (x, 0), while for X < 0 it is (0, -x). Thus, we
can replace Ixi = Xl + X2 and X = Xl - X2 to minimise the absolute values, and one of the
two artificial variables will turn out to be zero. Using this technique solve the following
problem, minimise Ixi + Iy - 11 + Iz - xl, subject to X - z 1 and X - 2y + z = 3. :s:
29. Convert the LP problem, minimise Xl + X3, subject to

Xl + X2 + X3 = 2, x ~ 0,

to the standard form and find an initial basic feasible vector by the method of introducing
artificial variables. Show that the optimal solution to the auxiliary problem is degenerate.
Obtain the required basic feasible vector by exchanging this variable with some other
variable on the right-hand side. Using this initial vector solve the original problem.
30. Consider the following LP problem due to Beale

minimise - ~Xl + 20X2 - !X3 + 6X4,


subject to ~Xl - 8X2 - X3 + 9X4 :s: 0,
X3:S: 1, x ~ O.

Add slack variables X5, X6, X7 and show that the obvious choice of Xi = 0 for i = 1,2,3
and 4, gives a basic feasible vector which is degenerate, since X5 = X6 = O. Solve the
problem by the simplex algorithm, with the ties being resolved in favour of the first
occurrence as given in Section 8.7. Show that after six iterations the original tableau is
restored so that cycling is established, and the algorithm goes into a nonterminating loop.
Show that instead of exchanging Xl with X5 at the first step, if we ignore the smallest
value of Ci and choose to exchange X3 with X7, the degeneracy will be resolved and we
can get the required solution in one more iteration.
31. Consider the following LP problem due to Klee, Minty and Chvatal, involving n variables:

maximise ""'
~ lO
j - l x'J'
j=l

subject to Xi + 2L lO j - i xj :s: 10 2n - 2i , (i=I,2, ... ,n).


j>t

Show that the feasible region has 2 n vertices. Starting with the basic feasible vector
x = 0, that is all the slack variables on the left-hand side; show that the simplex method
visits all the 2 n vertices before converging to the optimal vector. Hence, the time required
to solve this problem using the simplex method increases exponentially with n. Consider
the case n = 6 and explicitly verify that the iteration traverses all the 64 vertices.
32. Solve the problem, minimise 2x + y, subject to x2 + 4y2 1 by graphical means. Also :s:
try the following alternative method. First prove that the minimum occurs on the curve
400 Chapter 8. Optimisation

x 2 + 4y2 = 1 and then eliminate y using this equation and find the minimum. Compare
the results obtained using these two methods.
33. Solve the problem, minimise x 2 + y2 subject to (x - 1)3 = y2 both graphically and
also by eliminating y. In the latter case, show that the resulting function of x has no
minima, and explain this apparent contradiction. What happens if the problem is solved
by eliminating x?
34. Solve the travelling salesman problem for N = 100 cities using the technique of simulated
annealing. Generate N pairs of random numbers to define the coordinates of the N cities
and then find the path with shortest length, which passes through each of the N cities
once and returns to the starting point. If (Xi, Yi) are the coordinates of cities taken in
the order of path then the object is to minimise
N·-1
L = V(x1 - XN)2 + (m - YN)2 + L V(Xi - Xi+d 2 + (Yi - Yi+d 2 .
i=l

Starting with an arbitrary permutation of cities apply simulated annealing to reduce the
length L. Try random variations in permutation of the following types: (i) A section of
path is removed and then replaced with the same cities in opposite order; or (ii) a section
of path between two randomly selected cities is removed and grafted between another
pair of randomly selected cities.
35. Solve the optimisation problems in {18} using simulated annealing and compare its effi-
ciency with that of quasi-Newton methods.
36. Solve the system of equations in Example 7.14, using minimisation technique, by min-
imising

t (t
J=O ,=1
wixi + (. +1 1)2) 2 ,
J
and compare the reliability of minimisation technique with those of directly solving the
system of nonlinear equations. Try the quasi-Newton method, direction set method and
simulated annealing technique for minimisation and compare their effectiveness in finding
the global minimum, starting with different initial guesses.
Chapter 9

Statistical Inferences

Probability and statistical inferences has a wide variety of applications, from


Scientific research to financial markets, insurance, government policies, epi-
demiology, reliability of consumer products as well as gambling. Our discussion
will be restricted to scientific applications mainly in data analysis. The topic of
data analysis is discussed in the next Chapter, while in this Chapter we discuss
the relevant statistics.
In Section 9.1, we discuss elementary statistics and properties of some use-
ful probability distributions. In Section 9.2 we discuss the basic ideas in Monte
Carlo methods, which are extensively used in simulating random events on a
computer. In Section 9.3 we consider experimental errors and their propagation.
Similar ideas can be used to study propagation of roundoff errors in numerical
computations. The application of statistics to fitting data will be discussed in
the next Chapter along with the general problem of functional approximations.

9.1 Elementary statistics


If we repeat the same measurement, e.g., height of a person, the result may not
be the same every-time. If we make n measurements of some quantity and get
the values, Xi, i = 1, ... , n . We may want to get the best estimate of the 'correct'
value X, using these measurements. In general, there are two types of errors that
one may encounter, random or statistical errors and systematic errors. Random
errors are due to uncertainties in measurements, which can be of either sign and
are randomly distributed about the 'correct' value. While, systematic errors are
by definition not random, e.g., in measuring height , if the scale is used at higher
temperature than what it is calibrated for, then due to thermal expansion, all
reading will be on lower side. Such errors cannot be detected or eliminated by
making several measurements using the same equipment. Such errors are very
important in actual measurements, but there is no general theory for these
systematic errors, as these will depend on the technique being used. Another
source of non-random errors are so-called blunders, where a wrong reading
402 Chapter 9. Statistical Inferences

may be recorded or there may be some other error in calculation or recording.


Sometimes, such errors can be spotted by careful examination of recorded data,
as these may be far from other values. In general, it is not easy to locate such
errors, but they can be controlled by more careful measurement. Hence in our
discussion we would assume that there are no blunders and the systematic
errors are much smaller than the random errors and hence can be neglected.
Henceforth, we will assume that only random errors are present.
Coming back to our measurement, a simple way of obtaining the best
value is to take the mean
x = L:~=l Xi . (9.1)
n
We generally denote the mean of a set of quantities by a bar or angular brackets.
In the next Chapter we will show that if the measurements Xi follow the so-
called normal distribution, then the mean gives the best estimate in certain
sense. In addition to the best value we would also like to have some measure of
the error in individual values, as well as the mean. For this purpose a convenient
measure is the standard deviation, a defined by

a 2 - L:~=l (Xi - x)2 - =='------''------'-----'-


L:~=l X; - n(x)2
(9.2)
- n-1 - n-1 .

The quantity 0'2 is also referred to as variance. It may be noted that we have
used n - 1 instead of n in the denominator. This can be roughly understood
by considering n = 1, i.e., only one measurement, Xl. In that case x = Xl and
Eq. (9.2), would be indeterminate. This is to be expected as with only one
measurement we cannot hope to estimate the error. It is clear that the mean
and the standard deviation have the same units as the quantities Xi and the
standard deviation gives an estimate of random errors in individual measure-
ments, which can be formally written as Xi ± o'. Here we have assumed that all
measurements have comparable errors as each is performed in the same man-
ner. The error estimate can also be used to measure consistency of two different
measurements, e.g., if IXi - xjl »0', then the two measurements Xi and Xj are
unlikely to be reasonable measure of the same quantity. In Section 9.3 we will
give a more quantitative measure of the acceptable difference. Apart from errors
in individual measurements, we would be interested in error in the mean. In
Section 9.3 we will show that this is given by O'/,;n. This may give an impres-
sion that we can get arbitrarily high accuracy by repeating the measurement
sufficiently large number of times. In practice, there will be limitations due to
systematic errors which have been neglected.
It is also possible to define higher order moments, these are called the
skewness and kurtosis, which are defined by
L:n_ (x i - X)3
3
L:n (Xi- x )4
Skewness = 2-1 (J , Kurtosis = i=l (J4 - 3. (9.3)
n-1 n-l
It may be noted that in this case we have divided the difference by a to get
a dimensionless number. The skewness will give a measure of asymmetry in
9.1. Elementary statistics 403

distribution. For symmetric distribution skewness is zero. If skewness is positive


the distribution will be more pronounced on positive side and vice versa. In the
definition of kurtosis we have subtracted 3, which is the value for the normal
distribution. This is sometimes referred to as excess kurtosis. If kurtosis is
positive, the distribution is peaked as compared to the normal distribution
and if it is negative, then the distribution is flat as compared to the normal
distribution. For any finite distribution of numbers these quantities will almost
always come out to be non-zero and to estimate if the magnitude is significant
we need some measure of the expected variations. This will of course, depend on
the actual distribution, but a convenient measure is provided by the standard
deviation of these quantities for a normal distribution , which is give by j6/n
and j24/n respectively for skewness and kurtosis. Thus these values can be
considered to be significant only when they are significantly larger than the
standard deviation for the normal distribution.
For a continuous probability distribution, P(x) implies a probability of
P(x)ox of finding a value in the interval (x, x + ox). Since the total probability

I:
has to be unity the distribution should satisfy

P(x) dx = 1. (9.4)

I:
In this case the mean and standard deviation are defined by

(x) = xP(x) dx ,

where the angular brackets are used to denote the mean value.
Apart from mean there are other estimates for 'best' value of a distribu-
tion, the more common ones are the median and the mode. The median, 111/2
is defined as the value such that probability of finding values less (or greater)
than 111/2 is 1/2, i.e.,

1
P(x < 111/2) = P(x ~ 111/2) = "2 ' (9 .6)

where we have resolved the ambiguity of middle value by putting x = 111 /2


in the upper half. For continuous distribution this doesn't matter, but for a
finite set, the definition of median may have a small ambiguity depending on
which half the median value is included. To find the median value for a finite
set of observations, a straightforward technique is to sort the measurement in
ascending order and then take the middle value in the ordered table. If the
number of observations, n is odd then 111/2 = X(n+1)/2 as there would be (n -
1) /2 values on either side of it in the ordered table, though the definition given
by Eq. (9.6), is not strictly valid, as with odd n we cannot get a probability of
exactly half. For even value of n there is an ambiguity and we can use either X n /2
or Xn /2+l' With the definition given by Eq. (9.6), we will get 111 /2 = Xn /2+l' If
the probability distribution is symmetric about a central value, then both mean
404 Chapter 9. Statistical Inferences

and median will be equal to this central value. For asymmetric distribution the
mean and median will not be equal. In general, the median is a more robust
estimator, though it takes much more effort to calculate. In particular, median
is remarkably insensitive to the presence of a few measurements with large
errors. This issue will be discussed in the next Chapter in connection with
larger problem of approximating a function. Apart from the median we can
also have a measure of width of distribution in terms of the quartile deviation
instead of the standard deviation. The first or lower Quartile, QI is defined as
the value such that P(x < Qd = 1/4. Similarly the third of upper Quartile, Q3
is the value such that P( x > Q3) = 1/4. Thus 1/4 of the values in a distribution
should be lower than Q I and another 1/4 above Q3' Quartile deviation is defined
by
Q uartI'1e D eVlatlOn
.. (QD) = Q3 -2 QI (9.7)

This is also referred to as the semi-inter-quartile range, while the difference


Q3 - QI is the inter quartile range. Thus half the values will be between QI
and Q3'
Another measure for the best value is given by the mode, whL::h is the
most probable value of x. This can be defined for a continuous probability
distribution, P(x), as the value of x at which P(x) is maximum. This value may
not be uniquely defined as in some cases, there may be more than (Ine value
of x at which P(x) has the same maximum value. For a symmetric probability
distribution with only one peak at the central value, the mean, me iian and
mode will all be equal. but for asymmetric distributions they will all be different.
While for a symmetric distribution which has maximum value away fro n centre,
the mode will not be unique as there will be at least 2 (or even number) of values
at which P(x) has a maximum value. The mean of these values will be the same
as the mean of the distribution. For distributions with one dominan1 peak, it
is also possible to get a measure of width as the range where the prot ability is
larger than half of its maximum value at the mode. This is referred 10 as Full
Width at Half Maximum (FWHM). Half of this range is referred to af the Half
Width at Half Maximum (HWHM). Thus if Xm is the mode and Xl a:ld X2 are
two values on either side such that P(xd = P(X2) = ~P(Xm)' then I'WHM is
X2 - Xl. These quantities would be meaningful only if there are no secondary
peaks in the distribution that are higher than half the maximum value.
In the following subsections we consider properties of some of the impor-
tant probability distributions, that arise in data analysis, including so ne which
are useful for statistical tests.

9.1.1 The Normal Distribution


If we repeat the same measurement many times, we would not get the same
result, as every-time a different error may creep in. If the source of errors is
random and various sources are independent, then in most cases thl~ limiting
9.1. Elementary statistics 405

distribution tends to the normal or Gaussian distribution:


G (x) = _1_ e -(x-/l)2/(2a 2 ) •
(9.8)
/l,a aJ2;i
The factor in front of the exponential ensures that the probability distribution
is normalised as

00 G/l,a(X) dx = _1_ /00 e-(x-/l)2/(2a2) dx = 1. (9.9)


/
-00 aJ2;i -00

This distribution has a maximum value of l/(aJ2;i) at x = JL and is symmetric


about the maximum. In Eq. (9.8), the parameters JL and a can be identified
with the mean and standard deviation of the distribution. This can be easily
verified by calculating the mean, which for continuous distribution is given by

/ 00 xG/l,a(x) dx
-00
= -1-
aJ2;i
/00 xe-(x-/l)2/(2a 2) dx
-00
= JL. (9.10)

Since the distribution is symmetric with peak at x = JL, the median and mode
of the distribution is also JL. Similarly we can get the standard deviation

/
00

-00
(x _ JL)2G/l,a(X) dx = _1_
aJ2;i
/00 (x _ JL)2 e -(x-/l)2/(2a 2) dx = a 2.
-00
(9.11)

In this distribution a can be considered as a measure of the width of the


distribution. It can be seen that G/l,a(JL ± a) = e- 1 / 2G/l,a(JL). It turns out
that the point JL ± a are the points of inflection where the second derivative
of the distribution function vanishes. Further, because of the exponential, the
values fall off rapidly beyond this. Another common measure of the width of a
distribution is the Full Width at Half Maximum (FWHM), r, which is defined
as the width of region where the probability is greater than half of that at the
maximum, Thus,

G/l,a(JL ± ~) = ~G/l,a(JL)' or e- r2 /(8a 2) = 1/2, or r ~ 2.3548a. (9.12)

Thus the FWHM is close to 2a, as may be expected. As will be seen later the
Quartile deviation for the normal distribution is 0.6745a.
The normal distribution is the most often used distribution, because it is
probably the simplest to treat and in practice a large number of measurement
problems do lead to a distribution that is close to normal. Some other common
distributions can also be approximated by the normal distribution in appro-
priate limits. In fact, most of the statistical tests in curve fitting problems are
valid only when errors follow the normal distribution.
To estimate the probability of finding measurement in a given interval
we need to evaluate the integral of the probability distribution. The most in-
teresting results are to find probability of finding values in interval JL ± za.
Thus

P(lx - JLI ::::: za) = l /l+ ZU


/l-za
G/l,a dx =
V
1
tn=
/Z
27l" -z
2
e- Y /2 dy = erf(z/v2), (9.13)
406 Chapter 9. Statistical Inferences

where we have used the substitution y (x - fL)/U and erf(x) is che error
function, which is defined by

erf(x) =
2 r e-
J1F 10 t
2
dt. (9.14)

This is also known as probability integral. In particular, for z = 1 the prob-


ability comes out to be 0.6827. Thus there is about 68% chance that the
measured value is within 1u of the mean value. Obviously, there is about
32% chance that it is outside this limit. Because of symmetry of curve,
P(x > fL + u) = P(x < fL - u) = 0.1587. Thus the probability of finding a
value larger than fL + zu is given by

P(X > JL + zu) = P(x < fL - zu) = ~(1 - erf(z/V2)) = ~erfc(z/V2), (9.15)

where erfc(x) = 1 - erf(x) is the complementary Error function. Since, the


probability distribution falls off exponentially the probability in Eq. (9.13),
tends rapidly to 1 with increasing z. For z = 2,3,4, the probability turns out
to be 0.9545, 0.9973, 0.99994, respectively. Thus with the normal distribution
the probability of finding a measurement with deviation of greater than 3u
from the mean value is extremely small. A table of these probabilities can be
found in books on error analysis. The values can be easily compmed using
the Error function. Conversely, we can find out the range of values for a given
probability. For example, probability of 1/2 is achieved for z = 0.6745. Thus
there is 50% chance that the measured value is within 0.67u of the mean.
Similarly, a probability of 90%, 95%, 99%, 99.9%, 99.99% is achieved for z =
1.6449, 1.9600,2.5758, 3.2905 , 3.8906 respectively. These values can also be read
from the tables, or they can be found by solving the nonlinear equation p =
erf(z/V2) using say secant iteration. The probability integral also gives the
error estimate in Monte Carlo methods for numerical integration (Section 6.11).

9.1.2 The Binomial Distribution


The binomial distribution pertains to the discrete case and arises naturally in
many situations where same process is repeated many times. Let us consider
the problem of rolling a dice multiple times. If we roll the dice 5 times, there
are 65 possibilities and the probability of getting all sixes is clearly l/6 5 . But
if we want the probability of getting exactly 2 sixes (and getting some other
score on remaining 3 occasions), a simple calculation will give the probability as
(1/6)2(5/6)3 = 53/6 5 . But that is not correct as we have not specified which of
the two attempts have given a six each. There are 5!/(2!3!) = 10 combinations
which give 2 sixes out of 5 attempts. Thus the actual probability of getting 2
sixes is 10 x 53 /6 5 . In general if the probability of an event is p and the process
is repeated n times, the probability that the event will occur exactl~r m times
is

B n,p (m ) n! (n)
= m.'( n _ m.),pm( 1 - p )n-m = m p m( 1 - p )n-m . (9.16)
9.1. Elementary statistics 407

0.5

0.4

/""'0.
0.3
8
----
p..,
0.2

0.1

0
0 2 3 4 5 0 1 2 3 4 5
m m

Figure 9.1: The binomial distribution for n = 5, p = 1/2,1/6 is shown by the


bars. The solid curve shows the normal distribution with the same mean and
standard deviation.

Here the argument m can only take integral values from 0 to n. The two pa-
rameters, n, p define the number of trials and the probability of occurrence in
each trial. While n can only take positive integral values, p is a real number
with 0 < p < 1. The name Binomial comes from the fact that the probability
distribution are the terms in Binomial expansion for (p + q)n, with q = 1 - p.
We know
(p + q)n = ; (:)pmqn-m = 1. (9.17)

Here the last equality follows from the fact that q = 1 - p. This equation
shows that the total probability is 1 as expected. For any admissible values
of n,p,m the probability can be calculated using Eq. (9.16), though for large
values of n the calculation may result in an overflow. For small values of n
the factorials can be easily calculated, while for large values we can either use
the Stirling's formula or calculate them using approximations to the Gamma
function, n! = r(n+ 1). To avoid overflow, generally, the routines calculate the
natural logarithm of r(x). The Stirling's formula which gives an asymptotic
approximation to factorial

(9.18)

is valid for large values of n, with a relative error of O(1/(12n)).


The Binomial distribution is symmetric only for p = 1/2. For other values
of p it is not symmetric. We can calculate the mean value
n n ,
J-l = L mBn,p(m) = L m
m. n
'( n~ m.),pm(1- pt- m = np. (9.19)
m=l m=l '
408 Chapter 9. Statistical Inferences

This may be expected as the probability of occurrence in one trial is p and


hence on an average we expect np occurrences in n trials. The median of this
distribution is lnpJ or Inpl, while the mode is l(n + l)pJ or l(n + l)pJ - 1.
The standard deviation can also be calculated with little more manipulations
to get

n n I
(J2 = '" (m -l1f Bn. (m) = '" (m - np)2 n. pm(1- pt- m
~ p ~ m!(n-m)!
m=O m=O
= np(l - p).
(9.20)
The skewness and kurtosis of the Binomial distribution are given by (1- 2p)/(J
and (6p2 - 6p + 1) / (J2 respectively. The cumulative probability of obtaining
more than k successes is given by

P(> k) = :t
m=k+1
(~)pk(1- p)n-k = Ip(k + 1, n - k) (9.21 )

where Ix(a, b) is the incomplete beta function

) Bx(a, b) 1
Ix(a, b = B(a, b) = B(a, b)
r
io t
a-I
(1 - t)
b-I
dt. (9.22)

Having calculated the moments, we can compare the Binomial distribu-


tion with the normal distribution with same mean and (J. It turns out that
for 11 = np » 1 the Binomial distribution tends to the normal distribution
GIL.a(x). This can be proved using Stirling's formula to approximate the fac-
torials in Bn.p(m). Fig. 9.1, compares the binomial distribution with normal
distribution for n = 5. Thus for large values of np we can approximate the
Binomial distribution by the normal distribution, which is more convenient to
use.

9.1.3 The Poisson distribution


The Poisson distribution can be considered as a limiting case of Binomial dis-
tribution, when n ---> 00, but p « 1 and 11 = np remains finite. This is useful
in counting experiments, e.g., radioactive decay, where the number of nuclei is
very large, but the probability of decay is very small, giving only a small num-
ber of decaying nuclei in a given time interval. In this case number of counts
m is relatively small and hence taking the limit we get

nn+I/2 e -m
Bn.p(m) ~ m!(n _ m)n_ m +l/2 pm (1 - p)n-m
(9.23)
I1 m e- m (1-I1/n)n(l + pm) I1me-IL
~ m!(l - m/n)n(l + m(m - 1/2)/n) ~~ ,
9.1. Elementary statistics 409

0.5

0.4

0.3
-----
E
'-""
0..
0.2

0.1

0
0 2 4 6 0 2 4 6
m m

Figure 9.2: The Poisson distribution for f.1 1,2 is shown by the bars. The
solid curve shows the normal distribution with the same mean and standard
deviation.

which defines the Poisson distribution. In this case we require only one param-
eter, f.1 to define the distribution and one would expect (J2 = IL( 1 - p) ---> f.1.
Thus the Poisson distribution is defined by

(9.24)

In this case the count m can be any integer from 0 to 00. Thus taking the sum
over all possibilities we get

(9.25)

Thus the total probability adds up to 1 as expected. Further, we can verify the
expression for mean by noting
00 00 m 00 m-l
f.1 = ""
~
mP (m)
I"
= ""
~
me-I"~
m! = f.1e-I" " " f.1
~ (m - I)!
= f.1. (9.26)
m=l m=l m=l

Similarly, we can get the standard deviation using

Thus in this case the standard deviation is just the square root of the mean.
Since this distribution has been obtained from the Binomial distribution, we
can expect that for large f.1 it will approach the normal distribution CI",fo(x).
410 Chapter 9. Statistical Inferences

This can also be easily proved directly. Fig. 9.2 shows the Poisson distribution
for fJ = 1,2. The higher order moments can also be calculated, the skewness is
1/ Vii and kurtosis is 1/ fJ·

9.1.4 The Lorentzian distribution


The Lorentzian distribution is generally seen in spectral lines obtained by tak-
ing Fourier transform of a time series. The Fourier transform of a damped
harmonic oscillator gives a Lorentzian peak. This distribution is not similar to
the normal distribution, although it is also defined for a continuous variable.
The Lorentzian distribution is defined by

1 r/2
LIl.dx) = ;;: (x - fJ)2 + r 2 /4 ' (9.28)

where the parameter fJ is the mean value and r is the FWHM. It can be easily
verified that the total probability

OG 12 fOG dx
f -(X) LIl.dx) dx =;;:r -OG 2 = l. (9.29)
(~/i) + 1
Thus this distribution is correctly normalised. This distribution is symmetric
about x = fJ and the main difference with respect to the normal distribution is
that the tail falls off rather slowly, roughly as 1/(x - fJ)2 for large differences,
instead of the exponential fall for the normal distribution.
We can verify the mean value using

fJ = f-OGOG xLIl,dx) dx
12f(X) (x-fJ+fJ)dx
= --
1T r -OG (~) +1
2 = fJ· (9.30)
r/2

This is actually the Cauchy principal value (Section 6.6) of the integral. The
integral diverges in the Riemann sense. Since the distribution is symmetric
about the mean, the median and mode are also equal to 11" If we try to find the
standard deviation

a
2
= fOG 2
(x - fJ) LIl,dx ) dx
1 2 fOG
= -- (x - fJ)2 dx
2 = 00. (9.31 )
-(X) 1T r -:)0 (~)
r/2
+1

It is clear that the integral diverges and hence in this case the standard de-
viation is not meaningful. This can be attributed to the slowly falling tail of
the distribution, because of which the integrand doesn't tend to zero at in-
finity. With a finite number of measurements the standard deviation will, of
course, be finite, but it will not have any significance as it will not tend to a
limiting value as the number of measurements increase. Thus in this case the
relevant measure of width is the Full Width at Half Maximum, r. It is clear
9.1. Elementary statistics 411

that L Il ,r(I1±r /2) = (1/2)L Il ,r(I1). This distribution as well as its integral over
any interval can be easily calculated. Thus the probability of finding a value
within Jt ± r /2 is

ll+r/2 1 2jll+r/2 dx 1
P(lx -111::; r/2) = j LIl,r(x) dx = -- 2 = -.
ll-r/2 7r r ll- r / 2 (X-i!)
r/2
+1 2
(9.32)
Thus the probability of finding a value within the FWHM is exactly half. Fur-
ther the probability of finding a value within 11 ± zr /2 is (2/1f) tan- 1 z and can
be easily calculated. Even for z = 5 the probability is 0.8743, and hence there
is significant probability of finding a value beyond these limits. For large z the
probability is approximately, 1 - 2/ (1fz) and approaches 1 asymptotically as
l/z.

9.1.5 Student's t Distribution


The t distributions were discovered by Gosset in 1908. Since his employer had
stipulated that he cannot publish it in his name, he published it under the
pen name of 'Student'. This distribution is useful in checking if a given set
of measurements are consistent with specified mean and standard deviation.
This is useful in the industry to check quality of a product by comparing the
distribution of its properties with specifications. Suppose we have a set of n
numbers which are drawn from a normal distribution with mean 11 and standard
deviation (J. If x is the mean of this sample and s is its standard deviation, then
the quantity
(9.33)

has a t distribution with n - 1 degrees of freedom. The probability distribution


with v degrees of freedom is given by

A () 1 1 (9.34)
v t = JVBa, ~) (1 + ~ /V+l)/2
where, B(x, y) = r(x)r(y)/r(x + y) is the beta function. It can be shown that
the integral of this distribution is 1. The mean value of this distribution is
clearly zero, while the standard deviation is Jv/(v - 2) for v > 2. For v = 1
the Student's distribution is the same as Lorentzian distribution with zero mean
and FWHM of 2. For v = 2 also the standard deviation diverges. The skewness
for this distribution is 0, while kurtosis is 6/(v - 4) for v > 4. As v --+ 00, the
Student's t distribution tends to the normal distribution with 11 = band (J = 1.
It is more meaningful to calculate the probability that a value is in the
range [-t, t], which is given by

A(tlv) = JVB(l
1
.!::)
jt dx
(v+1)/2 = 1- Iv/(v+t2) 2' 2 '
(v 1) (9.35)
(1 +
2
V 2' 2/ -t Xv )
412 Chapter 9. Statistical Inferences

where Ix(a, b) is the incomplete beta function. Thus depending on the proba-
bility we can decide if the two mean values are consistent. Actually, we would
be more interested in the probability of finding a value as high as t or higher in
magnitude, which is given by 1 - A(tlv). If this probability is low then the two
values are unlikely to be compatible. The critical value of probability for the
test is of course, arbitrary. A reasonable value may be 0.05, which corresponds
to 95% confidence level.
If the mean J.l is also calculated from another data set, then there are two
possibilities depending on whether we expect the two data sets to have the
same or different variances. If the two data sets are assumed to have the same
variance and have length nl, n2, then the t value is given by

ar(nl - 1) + a~(n2- 1)
a= (9.36)
nl + n2 - 2

where Xl, X2 are the means of two data sets and al, a2 are the respective stan-
dard deviations. Here a 2 is the variance of the combined data set and in this
case the number of degrees of freedom v for the t-test is nl + n2 - 2. In case
the variances in the two sets are different then the we use

t= (9.37)

In this case the effective number of degrees of freedom, v for the t distribution
is given by

(9.38)

Here it is assumed that the two data sets are uncorrelated. If there is some
correlation between them, then the covariance between the two sets also needs
to be accounted for. In that case the denominator in Eq. (9.37), should be
replaced by J(ar + a~ - 2a12)ln, where a12 is the covariance between the two
data sets, which should now have the same length and the number of degrees
of freedom is n - 1.

9.1.6 The X 2 distribution


The X2 function which arises in least squares approximations (see Section 10.1
for definition) depends on the degree offreedom in the approximation. To check
the quality of fit we need to consider the value of X2 obtained by the fit. In order
to quantify the quality of fit, we need to have some measure of the expected
value of X2. From the definition of X2 we may expect it to be of the order of v,
the degree of freedom. We can define X~ = X2 Iv referred to as the normalised
or reduced X2 or X2 per degree of freedom. This is expected to be of the order
of 1 for good fit. Any value substantially larger than 1, would point to some
9.1. Elementary statistics 413

0.5 0.1
v=l
v=2
0.4 0.08
v=3
v=5
..--. 0.3 0.06
N

~
0..
0.2 0.04
,\
\ ....
\".
0.1 0.02
\
'.'
1

0 0
0 5 10 15 0 10 20 30 40 50
X2 X2

Figure 9.3: The X2 distribution for v = 1,2,3,5,10,20 is shown by the curves


as marked in each panel. The dotted lines show the normal distribution with
the same mean and standard deviation, for v = 5,10,20.

discrepancy, but that also depends on the value of v. Further, if for some reasons
the errors ai has been underestimated, the X2 would turn out to be large. On
the other hand, very small values of X~ are also not expected. This can happen
if the errors are overestimated.
In order to get more quantitative measure of probability of finding a given
value of X2 we need its probability distribution, which depends on the degrees
of freedom, v and is given by

2 e _X2/2( X2)"-1
2
Px(X ,v) = 2v/2f(v/2) (9.39)

For low values of v this is given by

(9.40)

The distribution for X2 with 2 degrees of freedom, which is essentially an expo-


nential distribution also arises in some power spectra. For 1 degree of freedom
the distribution has a singularity at X2 = 0, but it is an integrable singularity
and hence the probability remains finite. The mode of this distribution is v - 2
for v 2:: 2. The mean value of X2 distribution is v, the standard deviation is
ffv, the skewness is ~ while the kurtosis is 12/v. Further, it can be shown
that for large v, this distribution tends to the normal distribution with same
mean and standard distribution (Fig. 9.3).
In practice, it is not meaningful to find the probability of finding a partic-
ular value of X2 as that would be infinitesimal. It is more meaningful to consider
the possibility of finding a value as large or larger than the one found. Hence
414 Chapter 9. Statistical Inferences

we are more interested in the cumulative probability for finding a value less
than some value x. This is given by

F()= '(~'~)=P(~::) (9.41 )


X
v r(~) 2'2 '
where P( a, x) is the incomplete Gamma function defined by

P(a, x) = '~~~~) = r/a) fox e- t t a- 1 dt . (9.42)

There is some ambiguity in the definition of incomplete Gamma function. Some-


times, ,(a, x) is also referred to as incomplete Gamma function. Thus given a
X2 value we can ask what is the probability of finding a value larger than that,
which is given by 1- Fv(X2). If this probability is reasonable, then the X2 value
is acceptable. On the other hand, if this probability is very small then it is
unlikely that the fit is good. By giving the probability of finding such value we
can quantify the quality of fit by the confidence level.
The X2 test can also be applied to test for distribution. If we have a set
of n values Xi, then we can divide the range in appropriate bins and get the
number of values nj in jth bin, j = 1, ... , m. If we wish to know whether
this distribution is compatible with say, the normal distribution, then we can
calculate the probability Pj of finding the value in jth bin with the normal
distribution. Then we can compare the numbers nj with npj using a X2 function

(9.43)

where we have assumed that the standard deviation in npi is given by vnPi,
which is what we expect from Poisson distribution, which may be appropriate
in this case as counting is involved. Since the parameter n appears in this
definition, the number of degrees of freedom is m - 1 and using this we can
test for the compatibility of the two distributions. It is clear that bins will have
to be chosen carefully, so that there are reasonable number of entries in each
bin. Typically we should expect at least, 5 entries in each bin. To account for
tail in the distribution at the extremes, we can choose a large bin to ensure a
minimum count.
If we wish to compare distribution in two data sets, then we can again
bin them in appropriate bins and count the number of points in each bin, say
ni and k i . Then we can use the definition

2
X =
I:
m
(ni - k i )2
, (9.44)
i=l
n ' +k,
here the denominator gives the variance in the difference ni - k i . If the total
number of points in the two sets has been adjusted to be the same, i.e., L ni =
L ki' then the number of degrees of freedom in X2 is m - 1. On the other hand,
if this is not the case then the number of degrees of freedom is m. However, in
that case the numbers ni and k i in (9.44) will need to be normalised properly.
9.1. Elementary statistics 415

EXAMPLE 9.1: In a test the scores of 21 students are:


49,73,46,45,67,47,28,49,61, 71,49,57,52, 79,40,41,55,69,41, 71,40.
Test if this distribution is consistent with the normal distribution.

We can calculate the mean and standard deviation from the data to get J.L = 53.8 and
a = 13.6. We further bin the data in 5 bins with range [0,40]' (40,50]' (50,60], (60,70]'
(70,100] to find 3, 8, 3, 3, 4 students respectively. Using the J.L and a values we can calculate
how many students are expected in each bin assuming a normal distribution. Using the
function ERFC we get the values 3.25, 4.93, 6.01, 4.36, 2.45 in the 5 bins. Using Eq. (9.43)
we get X 2 = 4.85 with 4 degrees of freedom. The probability of finding a value as large or
larger is given by 1 - P(2, X2 /2) = 0.31 which is reasonable. Thus the given distribution is
consistent with a normal distribution.

9.1. 7 The F distribution


The so-called F-test can be applied to test if the variance of two distributions
is the same. It can also be applied to ratio of two X2 values. If we have two
X2 values, xi and X~ with VI, v2 degrees of freedom respectively, then we can
define the ratio of X2 values per degree of freedom

(9.45)

It can be shown that these f values follow the distribution given by

(9.46)

The mean value for this distribution is V2/(V2 - 2) (V2 > 2), while for V2 :::; 2
the mean will diverge. The variance is given by

(V2 > 4). (9.4 7)

Similarly, the skewness is defined only for V2 > 6 and is given by

(V2 > 6). (9.48)

The kurtosis is given by

(V2 > 8).


(9.49)
The integrated probability of finding a value less than f is given by

(9.50)
416 Chapter 9. Statistical Inferences

where Ix(a, b) is the incomplete Beta function defined by Eq. (9.22).


While applying the F-test we have to consider the ratio in both order
to make sure that the ratio is neither too small nor too large. Another use of
F-test is to determine the significance of additional term in least-squares fit
problems as discussed in Section 10.2.

9.2 Monte Carlo Methods


In simple situations it is possible to theoretically estimate the probability of
random events, but in more complicated situations that may not be possible.
For example, the study of very high energy cosmic rays by ground based detec-
tors, depends on secondaries generated when these cosmic rays interact with the
Earth's atmosphere. Because of large number of particles and numerous possi-
bilities at every stage, it is not possible to estimate the probability distributions
of various possibilities theoretically. In such cases Monte Carlo Methods have
been very successful. In this method we simulate the actual process by using
appropriate random numbers to decide on various possibilities. We can repeat
the whole process any number of times with different sets of random numbers
to get the final distribution. Even for simple problems we can use this tech-
nique to test the methods that we are learning. For example, we can generate
the Binomial distribution by simulating the coin toss or dice throw. For this
purpose we need some way of generating random numbers with appropriate
distributions. The standard random number generator (Section 6.11) generates
a sequence of uniformly distributed random number in the interval [0, 1]. Thus
the probability of finding a number in the interval [x, x + dx] is dx. With this
we can simulate tossing a coin by deciding that if the random number is in
range [0,1/2) it should be counted as head. Similarly, to test a curve fitting
technique, we can generate artificial data from a known function and add ran-
dom errors with say a normal distribution to it, before using them in the fitting
routine. The result can then be compared with the actual function to test the
technique and to determine the limits on errors for which it may be possible to
recover the real function. In fact, before applying any formal statistical test to
real data it is a good idea to try it out on artificial data of similar kind to test
the procedure.
In Section 6.11 we described a method to generate random numbers with
uniform distribution. To generate random numbers with a different probability
distribution, P(x) we can generate a sequence of random number with uniform
distribution, ri and then transform it to the required distribution by using

ri = j
-00
x,
P(x) dx . (9.51 )

By solving this equation for Xi we can get a sequence of random numbers


with the required distribution. The main problem with this approach is that
it may not be possible to solve this equation analytically. Thus at each stage
9.2. Monte Carlo Methods 417

numerical solution may be required. Even the integral may have to be evaluated
numerically. For Lorentzian distribution it is straightforward to evaluate this
integral and hence this procedure can be easily adopted in that case.
For the normal distribution the integral in Eq. (9.51), cannot be evaluated
in a closed form, though it can be expressed as Error function, which can be
calculated. It may be noted that for the normal distribution it is only needed
to generate a sequence Zi for J-l = 0 and a = 1. This can be transformed to
required values by using Xi = J-l + aZi. In this case Box and Muller (1958)
have suggested an alternative technique which is more efficient. For this we
can combine two random sequences with the normal distribution. The two
dimensional distribution is given by
1 '( 2 2) 1 2/
G(Zl,Z2) = 27re-2 z,+z2 = 27r e- r 2 = G(rcose,rsine), (9.52)

where we have transformed to polar coordinates. Now we can integrate the


distribution in r, (), where () integral gives 27r,

(9.53)

We can generate two random sequences with uniform distribution, rl and r2,
and calculate the two sequences Zl, Z2 with the normal distribution using

Zl = J -2ln rl cOS( 27rr2), Z2 = J -2ln rl sin(27rr2). (9.54)

In practice we can discard one of these sequences.


To generate random numbers with Poisson distribution we can use
Eq. (9.51)
x x m
r= L Pp.(m) = L e-P.~ .
m.
(9.55 )
m=O m=O

We can store the partial sums and get the value of X (l x J) for a given uniform
random deviate, r. This would be reasonably efficient for low J-l, while for high
J-l the distribution can be approximated by a normal distribution. In this case,
there is no simple way to scale the random number sequence for different values
of J-l and hence the sequence has to be generated separately for each J-l, unlike
the case for the normal distribution. Similar procedure can be adopted for
generating random numbers with Binomial distribution.
In this section we have only discussed methods for generating random
numbers with different probability distributions which forms the basis for
Monte Carlo methods. These methods have a wide range of application. We
have already considered the application to numerical int'3gration in Section 6.11
and in optimisation problems using the simulated annealing (Section 8.8). Other
applications in data analysis are discussed in the next Chapter. Apart from data
analysis this technique can also be used to simulate the effect of roundoff errors
in numerical calculations. Monte Carlo methods have been widely used in sim-
ulating physical processes which are not fully deterministic, like interaction of
418 Chapter 9. Statistical Inferences

particles in accelerators, or where the initial conditions are not known, like the
large scale structure formation in the universe. In these cases random numbers
are used to decide on the branch of reaction activated or to generate the ini-
tial conditions with appropriate distribution. Monte Carlo techniques are also
applied to simulate experiments, to decide the optimal strategy to be used to
get the best results.

9.3 Experimental Errors and their Propagation


As we have pointed out earlier all experimentally measured quantities would
have some associated errors. Neglecting systematic errors, the remaining errors
would be randomly distributed about the true value. In most cases the distri-
bution is expected to be close to the normal distribution. In general we specify
the measured quantity x by x ± CT, where CT is the estimated error. We use the
standard deviation as a measure of errors. If errors are normally distributed,
then the probability that the true value is within the limits specified is about
68%. This can be used to compare two measurements of the same quantity. If
the difference between the two is comparable to estimated errors, then the two
measurements are consistent with each other. Apart from the explicit error-
bars there is also a concept of significant figures. Thus if we write a value, say,
1.234, then it is presumed that the error in less than 0.0005 and we can claim
it has 4 significant figures of accuracy. For counting significant figures, leading
zeros should be ignored, while following zeros should be counted. For example,
1.23400 has 6 significant figures and error of less than 0.000005, while 0.00012
has only 2 significant figures with the same error. For numbers that are very
different from unity, scientific notation may be used, e.g., 4.5678 x 109 has 5
significant figures. While writing numbers with error bars one should not show
too many digits as they may not have any meaning. In general it is a good
idea to show 1 more digit than what may be significant, e.g., 1.2344 ± 0.0013.
While doing calculations with these numbers, intermediate quantities should
not be rounded. The rounding should be done only for final results. Otherwise
additional errors will be introduced in the calculations. It should be noted that
the error bar is only an estimate of error, the actual error wuuld be of this order
but is, of course, unknown.
Ideally, we should be able to estimate the errors by analysing the process
of measurement, but if that is not possible, then we can measure the same
quantity many times and the standard deviation in this distribution will give
the estimate of error in individual measurement. It may be noted that the
error in the mean value would be smaller by a factor of y'ri, where n is the
number of independent measurements. If the measured value is used in further
calculations, the error will propagate and we need to estimate the error in
the numbers so obtained. For the simple case of addition of two numbers,
(x ± CT x ) + (y ± CT y) a straightforward calculation would suggest that the best
value should be x + y and the error would be CT x + CT y. However, one can
easily argnc that the error is over-estimated. Since the errors are random, the
9.3. Experimental Errors and their Propagation 419

probability that both numbers will have error close to the upper (or lower) limit
is relatively small and thus the actual error would be smaller.
To analyse this more carefully, let us assume that we make a series of
measurements, Xi, Yi, i = 1, ... , n for the quantities x. y, each with estimated
error of (Jx, (Jy. The mean values X, y and the standard deviation (Jx, (Jy can be
calculated from these measurements. In this case we get n values of the sum
Si = Xi + Yi. It can be easily seen that the mean value l; = x + y. Now we can
calculate the standard deviation in these values to get

(J
2
= 2:~1 (Si - 8)2
==~~--~-
2:7=1 (Xi + Yi - X - y)2
s n-1 n-1
= 2 + 2 + 22:7=1(Xi - X)(Yi - y)
(Jx (Jy (
n- 1 9.56 )
= 2 + 2 + 22:~1 XiYi - nxy
(Jx (Jy n-1
= 22
(J x
(J y+2 (J xy + ,

where, (Jxy is known as the covariance between x. Y and is given by


2:7-1 (Xi - X)(Yi - y) 2:7-1 XiYi - nxy (9.57)
(J xy = n _ 1 = n - 1

The covariance can be negative and hence we have omitted a power of 2 in


this definition. If the errors in X and yare independent, the covariance would
vanish and the square of errors would be added to give the squared error in the
sum. Since the errors are added in squares, small errors will not make much
contribution to the final result and can be neglected in comparison to larger
errors. It can be verified that this would be less than ((Jx + (Jy)2.
If we consider the difference of two numbers Xi - Yi, then the error will
be essentially the same as that in Eq. (9.56), except that the covariance term
will have opposite sign.
Instead of sum if we take product of two numbers then
(9.58)
If the errors are small we can neglect the last term and further after dividing by
the product xy, it can be seen that in this case the relative error will be added.
Following the analysis for the sum, these errors will be added in quadrature to
get
2
(Jp
2
(Jx
2
(Jy
2(Jxy
p2 = X2 + y2 + ----;;y;,
(9.59)
In general if we are doing some calculations using two measured quantities
x. y, this can be considered to be a function J(x. y). Again we can consider n
measurements, Xi. Yi to get Ji = J(Xi, Yi). There may be some error in comput-
ing this function, but we assume that it is much smaller than the errors caused
by uncertainties in x, y. If the errors are small we can write
- oj _ oj _
Ji - J = -;:}(x; - .T) + -;:}(Yi - y). (9.60)
uJ' UY
420 Chapter 9. Statistical Inferences

where f = f(x, y) is the mean value. Using this we can calculate the variance
in f

= r r
(~~ ~; + (~~ ~~ + 2 (~~) (~~) ~
For a function of one variable ~ f = (df / dx)~x'
xy .
(9.61 )

This result can be generalised to a function of m variables f(Xl, ... ,xm )


by using the gradient vector

(9.62)

In this case the variance in f is given by

(9.63)

where C is the covariance matrix whose elements are defined by

if i = j,
(9.64)
if i i= j,

where (YI is the variance in Xi and (Yij is the covariance between Xi,Xj. The
covariance matrix is symmetric, with the diagonal elements given by the vari-
ance of Xi and the off-diagonal elements given by the covariance between the
respective variables. If all variables are independent of each other the covariance
matrix will be diagonal and the variance is given by

(9.65)

Using these relations we can now estimate the error in the mean value
given by Eq. (9.1). Since each measurement is independent the errors between
any two of them are uncorrelated and hence the error in the mean is given by

(9.66)

This proves the result that we had assumed earlier.


The definition of covariance is not normalised and it is difficult to estimate
if two sets of measurements indeed have nonzero covariance. For this purpose
there is another measure called the correlation defined by

(9.67)
9.3. Experimental Errors and their Propagation 421

which is known as the correlation coefficient. It is clear that this is always


between -1 and 1. If the two quantities satisfies a linear relation R = ± 1, with
the sign depending on the slope of line. To get a quantitative measure of finding
this value of R we need to consider its probability distribution, which depends
on the actual value of the correlation coefficient, which in general, is not known.
Thus we normally consider the probability distribution of R when the two sets
of variables are not correlated and check if the actual value is consistent with
that. This probability distribution is given by

1 r(v+l)
P, (r) = - -2- (1 _ r 2 )(v-2)/2 (9.68)
v ftr(~) ,
where v = n - 2 is the degrees of freedom, which comes because a linear relation
would require 2 parameters. The mean value of this distribution is 0 and its
standard deviation is 1/ Jf+lJ. In the limit that v is large we can show that this
distribution reduces to a normal distribution with zero mean and a standard
deviation of 1/ vIv. We can use this approximation for large v, but for smaller
values we may have to use the distribution given by Eq. (9.68).
As usual we need the integrated probability in the region URI, 1] to check
if the actual result is consistent with no correlation. If this probability is small
then there is a good chance that the two variables are correlated. This proba-
bility is given by

2 r[(v+1)/2] " I ( )k l! Ix1 2k + 1


1 -..fo r(v/2) uk=O -1 (I-k)!k! 2k+l
for v = 2l + 2 even,
1 -1.11" ( sm "l
. -1 Ix I + Ix Iuk=O ~(1
(2k+l)!! _ x 2)k+l/2)
for v = 2l + 3 odd,
(9.69)
where n!! denotes a variant of factorial where only odd or even factors are used.
We can also address the problem of checking if two independent measure-
ments are consistent with each other. If there are two measurements Xl ± al
and X2 ± a2 of the same quantity, then we can take the difference to get
Xl - X2 ± J a? + a~. If the two error estimates are equal the error will be mul-
tiplipd by v'2 when the difference is taken. If we assume that the errors have a
normal distribution we can compare the magnitude of difference IXI - x21 with
the error estimate to check what is the probability of finding such a value. Of
course, the probability of getting any particular value will always be small and
it is not meaningful to estimate that probability. Instead we can estimate the
probability of finding the difference larger than the measured difference. This
J
will depend on the ratio IXI - x21/ a? + a~. If this ratio is small, then there
will be a reasonable probability of getting such difference and the two mea-
surements are consistent with each other. For example, if the ratio is 1, then
the probability of finding a difference larger than this magnitude is about 32%,
which is reasonable. On the other hand, if this ratio is 4, then the probability
422 Chapter 9. Statistical Inferences

of finding a difference larger than this is less than 0.01%. Thus there is little
chance that the two values are consistent with each other.
Similarly, we can consider the possibility of rejecting some measurement.
If we measure a quantity n times and calculate the mean value, then we can
calculate the deviations Xi - X and compare them with the estimated standard
deviation, a. If this ratio is large, say 2: 4, then it is unlikely that this mea-
surement is reasonable. We can think of rejecting this value, though rejection
of data is a controversial topic and sometimes it is claimed that one should
not reject any data, unless there is an independent reason to believe that the
measurement is not reliable. The significance of the difference also depends
on how many measurements are made. If we make very large number of mea-
surements, one of them may have large difference. Thus we should calculate
nP(x> IXi - .fl) and if this number is much smaller than one, we can reject the
corresponding value. Such values are referred to as outliers. The exact threshold
below which we can reject a value is arbitrary and a reasonable value is 1/2,
which is referred to as Chauvenet's criterion. After eliminating the data value
we need to recalculate the mean and a again. Both these values will change
and in particular, a will reduce significantly. Thus after recalculation, we may
find more data points which can be eliminated. On the other hand, it is also
possible that some value which was rejected earlier becomes acceptable. Thus
this is an iterative process. This process may be continued, until we do not find
any offending data points. If a large number of points get eliminated then it is
very likely that the errors do not have a normal distribution and we may have
to worry about alternate strategies. There are fitting techniques which are not
very sensitive to presence of a few outliers and one can consider using these.
For example, in the present case if we calculate the median value, the result
will not be particularly sensitive to the presence of a few outliers.

Bibliography
Abramowitz, M. and Stegun, I. A. (1974): Handbook of Mathematical Functions, With For-
mulas, Graphs, and Mathematical Tables, Dover New York.
Bevington, P. R. and Robinson, D. K. (2002) : Data reduction and error analysis for the
physical sciences, (3rd ed.) McGraw Hill
Box, G. E. P. and Miiller, M. E. (1958) : A note on the generation of random normal deviates,
Ann. Math. Statist., 29, 610.
Brownlee, K. A. (1984): Statistical Theory and Methodology in Science and Engineering,
(2nd ed.), Krieger Publishing Company.
Chambers. J. M. (1977): Computational Methods for Data Analysis, John Wiley, New York.
Oldham, K. B., Myland, J. C. and Spanier, J. (2009): An Atlas of Functions: with Equator,
the Atlas Function Calculator, (2nd ed.) Springer, Berlin.
Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Robert, C. P. and Casella, G. (2010): Monte Carlo Statistical Methods, Springer Texts in
Statistics.
Taylor, J. R. (1997) : An introduction to error analysis, Gniversity Science books
Exercises 423

Exercises
1. Use a random number generator ~hich generates numbers with the normal distribution,
with /1- = 0 and 0' = 1 and generate a sequence of n random numbers. Draw a histogram
showing the distribution of these numbers and compare it with the normal distribution,
for n = 10,20,50, 100, 1000, 10000, 100000. From these distributions estimate the value of
Ixl such that the probability of finding a value which has deviation larger than this is 20%,
2%, 0.5%, 0.05%, 0.005%. Compare these with theoretically estimated values. Calculate
the mean, standard deviation, skewness and kurtosis from these numbers. Repeat the
exercise with different set of random number and find the distribution of these moments.
2. Simulate the process of rolling dice on the computer by using random numbers with
uniform probability distribution. For example, if the random number is > 5/6, we can
assume that the result is a six. Using this simulate the event n times and find out the
number of times a six has been found. Repeat the entire exercise 10000 times and find
out the probability of getting m sixes in n trials, m = 0, ... , n and draw the histogram
of probability as a function of m. Compare this with the Binomial distribution and
appropriate normal distribution, for n = 10,20,50, 100, 1000.
3. Using Stirling formula for n!, show that for /1- = np» 1 the binomial distribution tends
to the normal distribution with the same mean and standard deviation.
4. Show that the mean and standard deviation for the binomial distribution are given by
Eq. (9.19, 9.20) respectively. Also estimate the skewness and kurtosis for binomial distri-
bution.
5. In a university course, on an average 5% students have failed over the last several years.
What is the probability that 4 or more students will fail in a class of 40 students this
year?
6. If three dices are thrown together, find the probability that the sum of the points on the
three dices is x (3 ::; x ::; 18). Find the mean and standard deviation of this distribution.
Simulate the process when 100 dices are thrown together. Repeat the simulation 10000
times and find the probability distribution for the sum of points on all dices and calculate
the mean and standard deviation of the distribution.
7. Calculate the Poisson distribution for /1 = 0.5, 1, 2,5, 10,20 and draw the histogram and
compare with (1) Binomial distribution with p = /1-/n and n = 30,100,500,1000 and
(2) Normal distribution with same /1 and 0' = fo.
8. For the Poisson distribution, if the mean /1 is an integer, then show that
P,,(/1-) = P,,(tt - 1).
9. Calculate the skewness and kurtosis for the Poisson distribution.
10. A large neutrino detector detected 8 neutrinos coinciding with observations of Supernova
SN 1987A. If the average number of (background) events detected by the instrument is
2 per day, what is the probability of detecting 8 or more neutrinos in a day? Noting
that all these 8 neutrinos were detected within a span of 1 minute, calculate what is the
probability of it being due to background .
11. In an experiment to measure radioactivity of a sample, a count 400 particles in 1 minute
is recorded when the source is kept near the counter. To measure the background the
source is removed and 400 counts are recorded in 5 minutes. Find the contribution from
the source with the error estimate assuming Poisson distribution.
12. Electronic counters are limited by dead-time. After each count the instrument is effec-
tively dead for this interval as it is completing the recording process a nd recovering from
the event. We are using a counter that has a dead-time of 100 ns (10- 7 s) ami a beam
of particles with a rate of 106 particles per second is hitting the counter. If two or more
particles hit the detector within the dead-time only one will be recorded, thus leading
to a loss in counting. Find the efficiency of this counter which is the ratio of particles
detected and total number that is hitting the counter. What happens if the beam rate is
5 x 106 or 10 7 particles per second?
424 Chapter 9. Statistical Inferences

13. Using the definition of beta function and the asymptotic expansion for rex) ({1.3}) show
that as 1/ -> CXl, the Student's t distribution tends to a normal distribution with J.L = 0
and a = 1
14. Generate three sequence of 100 random numbers with normal distribution with (J.L, a) =
(0,1), (0, 1.5) and (0.1,1.5) respectively. Calculate the mean and standard deviation for
each set using these numbers and then apply the t-test to check if they are consistent
with each other.
15. Show that for X 2 distribution the mean is 1/ and the standard deviation is ffv.
16. Using the asymptotic expansion for rex) ({1.3}) show that as 1/ -> CXl, the X2 distribution
tends to a normal distribution with J.L = 1/ and a = ffv
17. We have three identical dices and wish to test if they are loaded. We throw them 100
times and count the number of 6s each time to find that we get 0, 1, 2, 3 sixes 45, 35,
15, 5 times respectively. What is the chance that the dices are loaded?
18. If x,y have uncertainties ax,ay respectively, and a covariance ofaxy, what is the uncer-
tainty in (i) x/y, (ii) Jx 2 + y2, (iii) Jx 2 + 4y2 - x.
19. Generate four sequences each of 100 random numbers using

(3) 0.01 10- 6


r(2)=(u'-05)
i t · , T· = , r(4) = -:-----::-c::-:-;c
, (Ui- 0 .5 ) , (Ui - 0.5)3'

where gi are random numbers with a normal distribution and J.L = 0, a = 1, and Ui are
random numbers with uniform distribution over (0,1). Find the mean, median, standard
deviation, skewness and kurtosis for each of these. Plot the histogram of distributions
for each of these. Compare each of these with the normal distribution with the same
mean and the standard deviation, using the X2 test. Also find the correlation coefficients
between all 6 pairs of sequences and interpret the value.
20. Generate two sequences of n numbers using

Xi = (i/n) + O.01ri Yi = a(i/n) + 0.02Si , i = 1, ... ,n,


where ri and Si are two sequences of random numbers with a normal distribution and J.L =
0, a = 1. Calculate the correlation coefficient between Xi, Yi and estimate its significance.
Use n = 10,20,50,100 and a = 0, ±0.01, ±0.05, ±0.1.
Chapter 10

Functional Approximations

In numerical computation we often have to approximate functions. The need


for approximation arises, because the function may be known only in the form
of a table of values, and we need some closed form representation so that the
required manipulations, e.g., differentiation or integration, can be performed.
Another reason for using approximations could be that the function is defined
implicitly, or the analytic expression is too complicated to be evaluated effi-
ciently. In fact, any part of a computer program which has only one input
variable and one output variable can be treated as a function of one variable,
and a suitable approximation can save considerable amount of computation,
if that part of the program is executed several times. Even if the function is
known in analytic form, it may not be possible to evaluate the functi,:m using a
finite sequence of basic arithmetic operations, e.g., sin x or eX. Such functions
have to be approximated by simpler functions that can be evaluated efficiently
on a computer. A rational function is the most general function that can be
evaluated using the basic arithmetic operations. Approximation of functions is
also useful in numerical analysis, for example, the solution of a differential or
integral equation can be approximated by a combination of simple functions.
In Chapter 4 we considered one class of approximation, i.e., interpola-
tion, which is widely used in numerical computations, because of the ease with
which it can be implemented in practice. As we have seen in previous chapters,
interpolation forms the basis of many numerical methods for differentiation, in-
tegration, solution of nonlinear equations and optimisation. Interpolation may
not be the best type of approximation for use on digital computers, since it
requires some memory to store the table, as well as time to scan the table for
the nearest value. Further, if the function is defined by data which are obtained
experimentally, there will be errors in the function value itself and the function
need not agree with given values at the specified points. In this chapter, we
shall consider other types of approximations, which are obtained by minimis-
ing some reasonable norm of truncation error in approximation. Hence, these
approximations are the best in some sense.
426 Chapter 10. Functional Approximations

Section 10.1 describes various norms which can be used for approxima-
tions. In Sections 10.2--4, we consider the least squares approximations, which
are the simplest (apart from interpolation) to implement. These approximations
are very widely used in fitting experimentally obtained data. The related topic
of Fourier approximation is considered in Sections 10.5-7, while numerical in-
version of Laplace transforms is considered in Section 10.S. In Sections 10.9-11,
we consider the problem of approximating mathematical functions. The mini-
max approximations are discussed in Sections 10.11 and 10.12. In Section 10.13,
we consider approximations based on L1 norm, which are more robust than the
least squares approximation and are useful when the expected errors in the
data are large.

10.1 Choice of Norm and Model


There are three rather distinct types of approximation problems which arise in
practice:
1. Approximation of mathematical functions: We need to approximate an
exact mathematical function, which is defined implicitly, or cannot be
evaluated using a finite number of arithmetic operations. For example, the
functions like sin x or eX are evaluated using appropriate approximations
in the library routines. Such approximation problems have the following
characteristics.
(a) There is no uncertainty or error in the function being approximated,
in the sense that at any given point the value of the function can be
calculated to any required accuracy.
(b) Very high accuracy of the order of Ii is required, that is the computed
value of the function is expected to be the correctly rounded version
of the actual result, for every value of the argument x in the required
range. Thus, the approximation should be uniformly valid.
(c) There is a lot of specialised information available about the function,
which can be effectively utilised in constructing the approximation.
For example, all library routines to calculate sin x use the follow-
ing identities to reduce the interval over which the approximation is
required:
. 1f
sin(x + kJr) = (_l)k sinx, sin(1f - x) = sin x, sm("2 - x) = cosx.
(10.1)
(d) The approximation is expected to provide an efficient algorithm for
evaluating the function. The efficiency is important, since the func-
tion may need to be evaluated a large number of times.
2. Representation of data: A large set of fairly accurate data needs to be rep-
resented by a compact formula. For example, if we have measured value
10.1. Choice of Norm and Model 427

of viscosity of water at several temperatures, it may be more convenient


to replace it by a simple analytic expression, which is more efficient for
computation and will also save the amount of memory required. Compact
representation of complicated functions also fall in this case. In that case,
the errors could be either the roundoff error or the truncation error, in cal-
culating the function values. The approximation problems in this category
have the following characteristics:
(a) There is little uncertainty or error in the function values.
(b) Moderate accuracy is required , and the error in the input data are
less than the required accuracy.
(c) There may be some specialised information available about the func-
tion. For example, there may be some theory which claims that the
function should be of the form a + be ex , in which case, the data can
be used to determine the best values of parameters a, band c.
(d) The approximation is expected to give an efficient algorithm for eval-
uating the function.
3. Smoothing of data: We need to approximate a set of data with substan-
tial error in some or all of the entries. There could be some outliers which
have substantially larger error than the average. These could be due to
some unusual circumstances, for example, the value might have been mis-
copied or someone might have kicked the instrument. In such problems,
the approximation is mainly required to smooth out the errors and pro-
duce a more acceptable function values. These class of problems have the
following characteristics:
(a) There is a substantial error in the data.
(b) Low accuracy is required.
(c) There is probably no specialised information available about the
function.
(d) The efficiency is usually not very crucial, but the approximation is
expected to smooth the function by removing errors from the data.
Depending on the class of problems, we use different techniques for approx-
imation. Interpolation can in principle be used for the second class of problems,
but it does not yield a compact representation, since the entire table has to be
stored. For the first class of problems, interpolation cannot yield sufficient accu-
racy with reasonable storage requirements; while for the third class of problems,
it is meaningless to interpolate, since the actual function need not be passing
through the given set of points.
Approximations are defined with respect to some measure of goodness or
closeness. Most of the approximations minimise some norm of the error. The
norm could be based on a continuum of points or on a set of discrete points.
428 Chapter 10. Functional Approximations

For problems in the first class, it may be better to use a continuum norm, while
for other classes of problems, the norm may be defined over a set of discrete
values only, since the function value is not known at an arbitrary point. A
function f(x) can be approximated by a computable function F(a,x), where
a = (al' a2, ... ,am)T is a set of parameters which characterise the function F
and are determined by minimising the norm of error. The most convenient class
of norm is defined as follows:

Ilf - Flip = ( 1 b ) lip


w(x)lf(x) - F(a,x)IP dx ,
(10.2)
n ) lip
or Ilf - Flip = ( ~ wilf(Xi) - F(a, xi)IP ,

where w(x) (or Wi) is some convenient weight function, which is assumed to
be positive over the required range. Here, in the continuum case, it is assumed
that the approximation is required over the interval (a, b), while in the discrete
case, the value of the function is assumed to be known at a set of n points.
This norm is referred to as the Lp norm.
The most common values of pare 1, 2 and 00. The Loo norm is generally
used for the first class of problems, and the corresponding approximation is
referred to as the Chebyshev approximation, or the minimax approximation.
The latter name arises because

Ilf - Flloo = a$x$b


max If(x) - F(a, x)1 or IIf - Flloo = max If(xi) - F(a, xi)l·
l$l$n
(10.3)
Hence, Loo norm just gives the maximum error over the required interval or the
required set of points, and in Loo-approximation, the object is to minimise this
maximum error. The advantage of this norm is that, in this case, the maximum
error in approximation is minimised. Thus an upper bound on the error is
known and this bound is minimised. However, this approximation is not very
widely used, primarily because the norm is not a differentiable function of the
parameters. Hence, it is not easy to minimise this norm. Further, Loo norm is
unsuitable when data have substantial errors, since it often magnifies the effect
of a single error. Because of these reasons, the minimax approximations are
primarily used to approximate mathematical functions.
The most widely used norm is the L2 norm, which leads to the least
squares approximations, obtained by minimising the sum (or the integral) of
the square of the error. This is the simplest norm which yields a differentiable
function of the parameters and is easy to minimise. Because of this simplicity
it is widely used, even in those problems, where Ll or Loo norms are more
natural. In many cases, this use may be justified, because the benefits obtained
by using the other norms do not offset the extra effort required. In particular, if
the errors are normally distributed, then this is the correct norm to be used. In
practice, the errors may not be normally distributed and it may not be easy to
10.1. Choice of Norm and Model 429

test for the type of distribution, particularly, when the amount of data is very
little. The L2 norm is also fairly sensitive to large errors in some data points,
and is not really suitable for approximating data with outliers. Most of the
statistical tests about the goodness of fit apply only if the errors are normally
distributed.
To show that this is the correct norm to be used when data errors follow
the normal distribution, let us assume that we make n independent measure-
ments to get values Yi = f(xi) where the errors in Xi are negligible. Now the
probability of finding the set of values Yi is given by

P( Y1, ... ,Yn ) -- II -1- e


n
~
~(Yi~f(x,))2
(2~;)
i=l v 27rO"i
(10.4)
1 ( ~ (Yi - f(Xi))2)
= (27r)n/2 TIn 0". exp - L 20"2
t=l t i=l t

Here the terms outside the exponential function are constants and hence this
probability will be maximum if the summation inside the exponential is mini-
mum. This is the principle of maximum likelihood which shows that for max-
imum likelihood, the sum of squares or the L2 norm of the difference, with
weights given by 1/0"; should be minimised. This function is usually referred
to as X2.
In L 1-approximation the object is to minimise the sum of absolute values
of the errors. The L1 norm is even more difficult to minimise than the Loo
norm, but it is very robust as it is remarkably insensitive to the presence of a
few outliers. Hence, this norm is most suitable for the third class of problems,
but unfortunately because of difficulty in implementation, it is rarely used in
practice. With the development of new effective techniques and software for
this approximation, it should be possible to use it in data smoothing problems.
From the above discussion it is clear that the choice of the norm is dictated
by two conflicting requirements of the ease in implementation and the suitability
of the approximation. The least squares approximation may not be the best one
to use in most problems, but it is the easiest to obtain and analyse. There is
immense literature on various tests to determine the quality of least squares
approximations. In the presence of outliers which are quite common in real data
that needs smoothing, the least squares approximation is not really satisfactory
and it may be better to use the L1 norm. For approximating mathematical
functions, it is best to use the minimax (Loo) approximations.
Having defined the norm, let us consider a simple problem of estimating
the best value of a quantity, say X, from n independent measurements, Xi,
i = 1, ... ,n. If we minimise the L2 norm, then the fUllction to be minimised is

2( ) _ ~ (Xi - x)2
X X - L 2 (10.5)
i=l O"i
430 Chapter 10. Functional Approximations

To minimise, X2 (x) we can equate the derivative to zero, which gives


,\,n x,
L....i=1 a;
x= ,\,n l' (10.6)
L....i=1 a;
If the errors O"i are all equal then we can use O"i = 1 and this is just the mean
value, that we defined in Eq. (9.1). Thus if the errors have a normal distribution,
the mean gives the best value. The error in estimated mean can be calculated
using results in Section 9.3 to get

2 1
a mean (10.7)
,\,n
L-i=l ur1
It can be seen that if all errors are equal, i.e., O"i = 0", then the error in the
mean is 0"/ Vii as found in Section 9.3.
Similarly, it can be shown that even for Poisson distribution the mean is
the best estimate. In this case the probability is given by

P( Y1, Y2, ... ,Yn ) -- e -nfJ. IT -fL ,' .


n y
(10.8)
i=1 y,.

To maximise this probability, take the logarithm and find its derivative with
respect to fL, which gives
n 1 n
-n+ '"'
~- Yi =0, or fL = -n LYi. (10.9)
i=1 fL i=1

For the L1 norm let us assume Wi = 1, in that case we need to minimise


n
h(x) = Llxi -xl· (10.10)
i=1

This function doesn't have continuous derivatives, but it can be seen that it
will be minimum when x is the median of distribution Xi. Thus for L1 norm
median will be the best value to use. It can be easily seen that the median
is rather insensitive to a presence of a few outliers and hence gives a robust
estimate of the true value.
On the other hand, if we use the Lac-norm, then we have to minimise the
maximum of IXi - xl, which can be achieved if x is the midrange, i.e., the mean
of the minimum and maximum value of {xd. This is obviously much more
sensitive to presence of a few outliers as these outliers will define the minimum
and maximum values.
EXAMPLE 10.1: An experiment to measure the acceleration due to gravity 9 at some place
produces five different results g1, ... , g5 = 980.5, 981, 978.5, 979, and 982 cm s-2. Determine
the best approximation to 9 using different norms.
10.1. Choice of Norm and Model 431

Using the minimax approximation we get the best value as the midrange i.e., (max g, +
mingil/2 = 980.25 cm s-2. The least squares approximation produces the average, which is
i L gi = 980.2 cm s-2 The L1-approximation produces the median which is 980.5 cm s-2.
It can be seen that all these approximations give very similar results in this case.
Now assume that we have a sixth value which is 1980 cm s-2, because of some blunder
in carrying out the experiment. In that case, the Lee and L2-approximations give a value
of 1479.25 and 1046.83, respectively. While the L1 norm does not give a unique result in
this case, since any value between 980.5 and 981 gives the same norm. This nonuniqueness
is because of the even number of data points and has nothing to do with the actual values.
It is clear that Loc norm is most sensitive to the presence of outliers, while the least squares
approximation is slightly less sensitive. Nevertheless, the shift in the mean value due to a
single blunder is considerable and the result is totally unacceptable. On the other hand,
the L1-approximation is remarkably insensitive to the presence of a few outliers. Of course,
we can argue that the new measurement is obviously wrong and must be rejected. But in
practice, there will always be cases, where it is not that obvious to reject some measurement,
but it will cause a substantial error in the result.

Apart from the norm we also have to select the right model for the func-
tion F(a, x) to obtain a good approximation. In fact, the choice of the model
is more important than the choice of the norm, but it depends entirely on the
function being approximated. If the function is smooth and does not have many
extrema, then it may be reasonable to use a low degree polynomial or a piece-
wise polynomial model. High degree polynomials are normally not very useful,
unless there are good reasons to believe that the function is actually a polyno-
mial of high degree. If the function is oscillatory, then it may be better to use
trigonometric functions. If the function has steep variation or singularity, then
it can never be approximated well by a polynomial and it may be better to use
a model based on rational functions. Most approximations are performed using
functions which are linear in the parameters ai, since it is easiest to handle such
functions. It should be noted that the linearity is only with respect to ai, the
function could be highly nonlinear with respect to x. In this case, the function
is represented in the form
m

F(a, x) = 2..: aiq\(x), (10.11)


i=l

where 1>i(X), (i = 1,2, ... , m) is a set of suitable basis functions. Such approx-
imations are usually referred to as linear approximations. The functions 1>i(X)
could be nonlinear. For example, if we take ¢i(X) = Xi-I, then the function
F(a, x) will be a polynomial of degree less than or equal to (m - 1) in x. Sim-
ilarly, if we take 1>i(X) = sin(ix), then we get the Fourier approximation. We
can also use piecewise continuous functions e.g., B-splines for approximation.
The advantage of using linear approximations is that, we end up with linear
equations to determine the parameters ai, which can be solved conveniently.
In some cases, a change of variable may improve the approximation signif-
icantly. For example, if the function is varying over several orders of magnitude,
it may be better to take logarithm before approximating by a polynomial. For
some problems, it may be better or even necessary to use nonlinear functions
of ai. In that case, determining ai will involve nonlinear equations, which may
432 Chapter 10. Functional Approximations

be much more difficult to solve. Consequently, nonlinear models are usually


avoided, but in some cases, these models yield much better approximations
and it may be worthwhile to try them. The form of nonlinear function could
be guessed only if there is some specialised knowledge about the function. For
example, we can use functions of the form

F(a, x) = ea,x (a2 + a3x + a4x2 + ... + a m Xm - 2) ,


F(a, x) = ale a2x + a3ea4x + ... + am_lea",x,
(10.12)
al+ a2x + ... + akxk-1
F(a,x) = k' (k < m).
1 + ak+lx + ... + amx m -

Rational functions and continued fractions have proved to be very successful


in approximating mathematical functions. For functions given in the form of a
table of values, it is an art to guess the type of function required for approxi-
mation and we cannot give any guidelines for such cases.
It is difficult to give any error analysis for approximation. For general
F(a, x) we can say little beyond the fact that the error as measured by the norm
of the residuals decreases as more basis functions are added. However, we cannot
say anything about the error with respect to the true function that we are trying
to approximate, since we do not know the correct value of the true function
at any point. Only in the case of approximating mathematical functions, we
can actually estimate the error, since the true fmlction is known in that case.
For approximation to a set of data it IS impossible to give any rigorous error
estimate. Some idea about the uncertainties can be obtained by perturbing the
inp·UL data by an amount comparable to the expected error and recalculating
the coefficients. If the magnitude and distribution of the errors is known, it
may be possible to give some statistical estimate of error in approximation.
However, this estimate could be highly unreliable, unless the magnitude and
distribution of error in the data is actually close to what is assumed in the
statistical analysis. It is most convenient to perform a Monte Carlo analysis to
estimate the error in approximation. In this technique we add random error
to data with distribution that is expected in the data and then repeat the
exercise to calculate the coefficients for approximation. Using several different
realisations of errors we produce sets of artificial data with error added and
calculate the parameters for each of these data sets. The distribution of fitted
parameters in these sets will give the required error. For example, if the errors
are normally distributed, we can calculate the standard deviation of the fitted
parameters from these sets.

10.2 Linear Least Squares


In this section, we consider approximations based on the L2 norm, which are
referred to as the least squares approximations. Since in practice, we normally
come across problems over a discrete set of points, only discrete case is consid-
ered in this section. The results can be easily extended to least squares approx-
10.2. Linear Least Squares 433

imation of continuous functions. Let f(x) be a function and {Xl. X2, •.. , xn} be
a set of points at which the function values are known. These values will have
some error associated with them. We denote by fi the known value at :ri with
an estimated error of ai. The true value f(Xi), is in general. not known and
it may be difficult to give any estimate of the error with respect to the actual
function. It is generally assumed that the error fi - f(X i ) at different points
are uncorrelated. Let {(Pl (x), ... , cPm (x)} be a set of basis functions, in terms
of which we are seeking a linear approximation of the form
m
f(x) ~ F(a, x) = I>yn)cPj(x). (10.13)
j=l

Here the superscript (m) denotes that the coefficients in general, depend on the
number of basis functions included in the approximating function. In the linear
least squares approximation , the coefficients a;m) are determined by minimising
the L2 norm, which yields

(10.14)

where we have omitted the square root, since minimising the square root of a
nonnegative function is the same as minimising the function itself. Here ai arc
the standard deviation in fi and ri the residuals at Xi. This is the so-called
chi-square function. It may be noted that the statistical tests for fits will be
valid only if X2 function is defined by (10.14) with weights given by inverse of
variance a 2 . It is convenient to define the so-called design matrix G and the
vector b by
cPj (Xi) b,. -- fi .
gij = ---, (10.15)
ai ai
Here G is an n x m matrix. With these definitions, the function X2 can be
written as
(10.16)
The minimisation can be carried out by any method described in Chap-
ter 8, but in this case, since the function is quadratic in a;m), we can directly
find the minimum by finding the zero of the gradient vector. Thus, the minimum
can be obtained by solving the system of equations

~
aX2 = -2 nL: a1 (m
2 fi - L: a j
(m)
cPj(.Ti) ) (h(x;) = 0, (k = 1, . . . , m) ,
aa k i=l' j=l
(10.17)
or in the matrix form G T Ga = GTb. This is a system of m linear equations in
m unknowns and is referred to as the normal equations. If the determinant of
the coefficient matrix is nonzero, then these equations yield a unique solution
434 Chapter 10. Functional Approximations

for ajml. By considering the function at (a~m) + bal,"" a~m) + bam), it can
be shown that this solution is indeed a minimum. The coefficient matrix will
be nonsingular, provided the basis functions are linearly independent and the
points Xi are distinct. If m ::: n, then in general, it is possible to find solution
which reduces the error norm X2 to zero and the resulting approximating func-
tion will interpolate the function J(x) at the given points. This approximation
does not serve any useful purpose, since including large number of basis func-
tions will not improve the smoothness of the function. Further, the inherent
errors in the function value are not eliminated or even reduced. Hence, in prac-
tice we always consider the case, where m < n and in general, it is not possible
to reduce X2 to zero. Noting that the X2 function is quadratic in parameters ai
and that its gradient vanishes at the minimum, we can write

(10.18)

where arnin is the minimiser of X2 and X~in is its value at the minimum. Thus
the matrix C T C is related to the second derivatives of the X2 and the Hessian
matrix. This expression can be used to obtain confidence limits on the estimated
values, by using appropriate values for X2.
The solution of the normal equations can be formally written as a =
CGTb, where the matrix C = (CTC)-l. It turns out that the elements of the
inverse matrix C have some statistical significance and it may be necessary to
compute t he inverse explicitly. The sensitivity of the parameters to the input
values Ji is given by
aa(m) 1
L
m
_J_ = -Cjkgik. (10.19)
aJi k=l (Ii

If the errors are normally distributed, then the variance associated with ajm)
is given by

(10.20)

The term in the parentheses are the elements of C T C which is the inverse of
C. Hence, the above equation reduces to

(10.21)

Thus, the diagonal elements of the matrix C give the variance associated with
the fitted parameters. It can be shown that the off-diagonal elements ejk give
th!' covariance between ajm) and a~m) and the matrix C is called the covariance
matrix.
In principle. the least squares approximation can be obtained by solving
the normal equations for the coefficients ajm) , which is possible for small values
of m. However, for large number of basis functions the normal equations are
10.2. Linear Least Squares 435

usually ill-conditioned and is unlikely to yield meaningful results. For example,


if we consider polynomial approximation using the basis functions ¢i (x) = xi-1 ,
then the normal equations can be written as

6'\.""""' ~2 x,J+k-2 = '"'


6 ~f.
'm ( n ) n
6"'
aJ(m) k-1
2 ,x, , (k=I, ... ,rn). ( 10.22)
U u·
j=1 i=1' i=1 '

For convenience, if we assume that the points Xi are uniformly distributed in


the interval (0,1) and Ui = u, (i = 1, ... , n), then for large n we can use the
approximation

n x Hk - 2 ~ n
'"'
6.=1 '
11 0
x Hk - 2 dx = __n.,----_
j+k-l'
(j,k=I, ... ,rn). (10.23)

Thus, it can be seen that the coefficient matrix in the normal equations is ap-
proximately equal to a multiple of the rn x rn principle minor of the Hilbert
matrix (see Example 3.5), which is well-known to be highly ill-conditioned.
Hence, we cannot expect to solve the normal equations, even for moderate
values of rn (e.g., rn = 10), unless very high precision arithmetic is used for
calculating the coefficients as well as for solving the system of equations. Even
then a slight change in the input data Ji or Xi may change the solution sig-
nificantly. This ill-conditioning can be traced to the fact that for large values
of rn, the basis functions x j are not really independent, in the sense that, J. j
can be approximated by a lower degree polynomial to a good approximation.
Hence, the coefficient matrix becomes almost singular. Similar problems may
also be encountered with other basis functions when the number is large. It
essentially means that the data do not clearly distinguish between two or more
combinations of basis fUllctions. If two different combinations of functions fit
the data equally well, then the matrix of normal equations becomes singular.
This problem can also arise if we do not realise that the parameters are not
really independent. For example, if we attempt to fit a function of the form
a1 exp(a2 + a3x), then clearly the parameters a1 and a2 play the same role
'1nd the linearised version of normal matrix will be singular in this case. This
problem can also arise out of the not uncommon, but potentially disastrous ap-
proach exemplified by the attitude: I really do not know which variables affect
the Junction, so let me throw in everything I can imagine and let the com-
puter sort them out. Unfortunately, the computer is not intelligent enough to
sort out the problems for us, and in most such cases, we get garbage, where
small random variations in the data, in fact determine which variables are se-
lected as important. If instead of polynomial we use the B-spline basis functions
to approximate the functions, the resulting normal matrix is likely to be well-
conditioned. Thus B-spline basis functions are preferred over simple polynomial
expansion.
This problem can be overcome by using the orthogonal matrix factorisa-
tion to solve the system of overdetermined equations Ga = b. In this method.
436 Chapter 10. Functional Approximations

we seek an orthogonal matrix Q so that

(10.24)

where il is an m x m upper triangular matrix. This decomposition can be


achieved using a sequence of Householder transformations (see Section 3.6 and
Wilkinson and Reinsch, 1971). Using this decomposition the system of equa-
tions is transformed to Ra = Qb and the required solution with minimum
norm is given by a = il- 1 Qb. This solution can be easily computed using
back-substitution. The advantage of this technique is twofold, first, the round-
off errors are not magnified during orthogonal transformation. Of course, the
solution of upper triangular system using back-substitution can magnify the
roundoff error. However, since we are directly working with the design matrix
G rather than G T G, the condition number is smaller. In the process of ob-
taining the normal equations, the condition number is squared. Hence, using
the orthogonal factorisation it is possible to improve the computed solution.
Although, for large values of m the system of equations can be ill-conditioned,
for most practical problems this technique can give reliable results.

10.2.1 Least Squares Approximation Using SVD


Another alternative is to use the singular value decomposition (Section 3.6),
which requires much more effort. The SVD algorithm will also detect linear
dependence or near linear dependence of the basis functions, since one or more
singular values may come out to be very small. If the components corresponding
to small singular values are suppressed, then it may be possible to obtain a
meaningful solution. Another advantage of SVD is that it also isolates the linear
combination of basis functions, which gives rise to linear dependence. SVD can
also be applied directly to obtain the least squares solution of the system of
overdetermined equations Ga = b as explained in Section 3.6. If G = UI:V T is
the SVD of the matrix G, then the solution can be written as

a= L: (Ui.b)
m
--

Vi, (10.25)
i=l '

where Ui and Vi are the ith columns of U and V, respectively and Si are the
singular values. Of course, if some of the singular values are very small, then
the corresponding terms should be dropped. Once again, since we are directly
working with the design matrix G the condition number is reduced.
Using the singular value decomposition the matrix of normal equations
GTG = VI:2VT, and the covariance matrix C = VI:- 2V T . Thus the matrix
V diagonalises the covariance matrix and columns of V give the principal di-
rections which are not correlated. Thus, the solution can be written in the
10.2. Linear Least Squares 437

form

(10.26)

Thus, the columns of V give the principal axes of the error ellipsoid of the fitted
parameters. The error ellipsoid essentially defines a region in the parameter
space, where the parameter values are likely to fall. The variance in the estimate
of the parameter aj can be written as

(10.27)

while the covariance between parameters aj and ak is given by

(10.28)

It can be easily seen that the error will be rather large for small Si and dropping
such terms will reduce the errors, at the cost of increasing the mean square
deviation slightly (see Eq. 10.18). The columns of V corresponding to small
Si identify the linear combination of variables, which contribute little towards
reducing the X2 , but make large contribution in the standard deviation. Thus,
even if some of the singular values are not small enough to cause roundoff
problems, it may be better to zero them while computing the solution. It may
be noted that the required solution can be computed using the subroutines
SVD and SVDEVL in Appendix B. The parameter REPS in SVDEVL controls
which singular values are reduced to zero. If Si < REPS x Smax, then it is set
to zero.
The only disadvantage of using SVD is the extra effort and memory re-
quired. It can be seen that SVD will require an extra array G of size n x m,
which could be quite large when the number of data points n is large. Further,
the SVD algorithm could require approximately cnm 2 floating-point operations,
where c ~ 6. As compared to that the solution of normal equations requires
only m 3 /3 floating-point operations (provided we make use of the symmetry of
the matrix). In addition, setting up of normal equations itself may require nm 2
operations. This number does not include the effort required in evaluating the
basis functions at different points. Thus, the SVD solution is about five times
more expensive in terms of amount of effort required. Further, if we take into
account the fact that in practice, we usually repeat the calculations for a large
number of values for m, the situation would be even worse, since the coefficients
of the matrix in the normal equations can be preserved. Hence, SVD calculation
could easily require an order of magnitude more effort. Nevertheless, the solu-
tion of normal equations is fairly unreliable and in general, it is rather difficult
to detect the trouble. Hence, it is usually worthwhile to spend the extra effort
required for SVD solution, since that is not only more reliable, but also gives
438 Chapter 10. Functional Approximations

some insight into the problem by way of near linear dependence of some of the
basis functions. However, for polynomial least squares approximation, there is
an alternative method based on orthogonal polynomials, which is much more
efficient than SVD and is also reliable.
Another problem with approximations is to decide the number of basis
functions (or the degree of polynomial) to be used for approximation. If the
normal equations can be solved accurately, then X2 will keep reducing as m
increases, and it can be reduced to zero for m = n. However, if the actual
data have some errors, we cannot expect X2 to be reduced to zero for the true
function. Hence, we should stop adding more terms in approximation, once the
error is reduced to some limiting value. There are elaborate statistical tests to
decide optimal value of m to be used. Adding more terms does not improve
the approximation, but only increases the amount of computation required to
evaluate the approximating function. In practice, if we plot X2 as a function of
m, then the point at which the errors become dominant may be detected, since
beyond that the function becomes rather flat with X2 decreasing very slowly
with m, while for small values of m, X2 decreases rapidly. Crude statistical
analysis shows that if we compute the quantity
2( (m) (m))
X2 = X ai ,am
, ... (10.29)
v n-m
then the process should be terminated at a value of m beyond which X~ is
essentially constant. Here, v = n - m is the number of degrees of freedom.
Alternately, we can look at the distribution of the residuals rio If the errors are
uncorrelated, then the residuals should be randomly distributed and if we count
the number of sign changes in the sequence {ri,' .. , rn}, then we should find a
value close to n/2. Thus, after adding every term, we can check the number of
sign changes and the process can be terminated at a stage, where the number
of sign changes increases sharply to a value close to n/2 (i.e., within about
In/2). This criterion is not very robust and the presence of one outlier may
modify the approximation in such a way that the number of sign changes in
the residuals is much smaller than the expected value.
Another alternative is to use the F-test to determine the significance of
additional term in least-squares fit problems. As we increase the number of pa-
rameters the X2 value would decrease, but the question is to determine whether
the decrease is significant. If x2(m) is the X2 value with m parameters, then
we can consider the ratio
f = x2(m) - x 2(m + 1) (10.30)
x2 (m + l)/(n - m - 1)
It is clear that if this ratio is small then adding the additional parameter is
not effective in improving the fit. For the purpose of F-test, the numerator can
be considered as X2 with one degree of freedom. This ratio can be tested as
in F-test with degrees of freedom as 1, n - m - 1 to get the probability. The
probability of obtaining a value f or larger should be high (e.g., > 0.95) for the
new term to be acceptable.
10.2. Linear Least Squares 439

Table 10.1: Polynomial least squares approximation by solving the normal equations

m nc X2 al a2 a3 a4 a5 aB a7 a8 ag alQ

2 3 l.20 X 108 -4.004 19.71 -17.01


3 4 6.83 X 10 7 -10.927 107.23 -238.04 147.4
4 5 4.98 X lOB -2.977 -67.77 568.33 -1115.5 63l.5
5 6 3.67 X 10 4 -5.148 8.45 1l.55 393.76 -1076.8 684.0
6 6 2.56 X 10 3 -4.967 -2.04 128.47 -10l.11 -116.8 -179.4 29l.9
7 6 5.69 X 10 3 -4.952 -l.64 112.75 23.14 -526.2 476.9 -21.5.7 152
8 6 l.86 X 10 3 -5.027 0.78 102.55 -1l.64 -262.4 -2l.4 93.5 199 -80
9 7 2.02 X 10 3 -4.960 -2.48 134.85 -13l.36 -8l.4 -89.9 5l.7 122 87 -70

EXAMPLE 10.2: Generate synthetic data using the function


!(x) = 231x 6 - 315x 4 + 105x 2 - 5 + (R - 0.5)10- 2 , (10.31)
where R is a uniformly distributed random number in the interval (0,1). Using n equally
spaced points in the interval [0, 1], try to fit a polynomial of degree m using the basis functions
dJ,(x) = xi-I. Use n = 50 and m = 2,3, ... ,9.
In this casE' the errors are not normally distributed, but we use the variance (J2 =
8.333 x 10- 6 to define X 2 . The normal equations are set up using (lO.22). These equations
can be solved using Gaussian elimination with partial pivoting to yield the results shown in
Table 10.1, which gives the coefficients of the polynomial starting with the constant term
as well as the corresponding value of X2. This table also gives the number of sign changes
nc observed in the sequence of residuals. It can be seen that the residuals quickly reduce
as m is increased. but after m = 6, there are some fluctuations. The function should be
correctly represented by a polynomial of degree six, which should approximately reproduce the
coefficients of the original polynomial. But it can be seen that apart from the first coefficient,
all other coefficients are far from the actual value, which is clearly because of ill-conditioning
of the normal equations. The matrix G T G defining the normal equations for Tn = 6 is
1.000000 .500000 .336735 .255102 .206136 .173503
.150204
.500000 .336735 .255102 .206136 .173503 .150204
.132738
2 .336735 .255102 .206136 .173503 .150204 .132738
.119161
~GTG= .255102 .206136 .173503 .150204 .132738 .119161
.108306
n .206136 .173503 .150204 .132738 .119161 .108306
.099431
.173503 .150204 .132738 .119161 .108306 .099431
.092041
.150204 .132738 .119161 .108306 .099431.085792
.092041
(10.32)
It can be seen that, this matrix is quite close to the 7 x 7 Hilbert matrix. The number of
sign changes do not show any sharp increase for m ::; 9 and from these results it is not very
clear, where the process should be terminated, though we may be tempted to use a value of
m = 6. The X2 value is also too large to be acceptable.
If the same least squares approximation is obtained using SVD, then we get the results
displayed in Table 10.2. It should be noted that both these calculations have been performed
using the same 24-bit arithmetic. It can be seen that the results are distinctly better and for
m = 6 all coefficients are close to the actual values. For higher degree the coefficients are not
correct, because some combination of higher degree terms is also included. It can be seen that
the number of sign changes increase sharply as m is increased to six, thus indicating that
the residuals are essentially random. The residuals are not decreasing substantially beyond
Tn = 6, which indicates Tn = 6 as the optimum value for the approximation. Alternately, we
can use the F-test to check for significance of additional term. We can calculate the ratio using
440 Chapter 10. Functional Approximations

Table 10.2: Polynomial least squares approximation using SVD

m no x2 al a2 a3 a4 ao a6 a7 as ag alQ

2 3 l.20 X 10 8 -4.004 19.71 -l7.01


3 4 6.83 X 10 7 -10.928 107.25 -238.08 147.4
4 5 4.97 X 10" -2.952 -68.26 570.42 -1118.6 633.0
5 6 3.59 X 10 4 --5.180 9.52 3.91 414.2 -1099.7 693.1
6 38 4.76 x 10 1 -4.997 -0.07 105.55 -2.1 -31l.1 -3.5 232.2
7 38 4.68 x 10 1 -4.998 -0.01 104.63 3.2 -326.0 18.2 216.5 4
8 37 4.51 x 10 1 -4.999 0.11 102.40 20.3 -39l.4 155.8 55.0 104 -25
9 37 4.48 x 10 1 -4.999 .06 103.85 5.9 -319.4 -48.8 398.7 -235 156 -40

Eq. (10.30) to get values 34.8, 573.4, 6047.4, 32387.7, 0.72. 1.55,0.27 for m = 2,3,4,5,6,7,8
respectively. Using the appropriate degrees of freedom the probability of finding a value as
large as these are close to one till m = 5, while for m = 6 it falls to 0.60, thus giving an
indication that polynomial of degree 6 is the best in this case. The estimated errors in the
coefficients can be computed using Eq. (10.27). For m = 6 the estimated errors are 0.0023.
0.067, 0.60. 2.31, 4.24, 3.69. 1.23, which is consistent with the difference between the exact
and fitted coefficients.
In SVD calculation, the parameter REPS which is used to suppress small singular value
was set to 10- 7 . For m = 9 the singular values are 9.475661,4.277862, 1.394861,0.3753915,
0.0857653.0.0166677,0.0027254,3.653 x 10- 4 ,3.798 X 10- 5 and 2.639 x 10- 6 . Thus, none of
the singular values are reduced to zero during calculations, but still we find that the results
are much better. This behaviour can be explained by the fact that in this approach we use
the design matrix C directly, while the normal equations involve CTC. The condition number
of CTC will be the square of that of C. It can be seen that for m = 9 the condition number
of C is "'" 3.6 X 106 and that of C T C will be of the order of 10 13 . Hence, it is not possible
to obtain any meaningful solution of the normal equations using 24-bit arithmetic. In fact,
if the normal equations are set up and solved using double precision arithmetic, the results
are essentially same as those in Table 10.2. For m = 9 the column of V corresponding to the
smallest singular value turns out to be close to a multiple of coefficients of the Chebyshev
polynomial over the interval [0, 11. which defines the linear combination that gives rise to the
ill-conditioning in normal matrix.
Figure 10.1 shows some of the approximations obtained using the SVD. The actual
function (i.e., without the error term) is also shown in the figure. It can be seen that the
function has three extrema in the interval, including the one at x = O. Hence, a polynomial
of degree at least four is required for any reasonable approximation. For comparison the
figure also shows the approximation obtained using the direct solution of normal equations
for m = 6. This curve generally coincides with the actual curve of the function in the figure
on the left-hand side. But from the right-hand side figure, which shows a part of the curves
magnified five times, the difFerence is quite clear. The m = 6 curve using the SVD coincides
with the actual curve, even on this scale and is not shown.
To illustrate the smoothing property of the approximation, we compute the square
deviation using the actual function J*(x), rather than the given data. In a realistic problem
the function J*(x) will not be known, but in this case, since the data have been generated
artificially. we know the actual function which can be obtained by removing the last error
term from (10.31). Thus, we can compute

2
.5 = L 2(1
n

a
i=l
1 *
(Xi) -
2
F(a,xi)) , (10.33)
10.2. Linear Least Squares 441

m=6"
f{x) \
5
Y
-4 \
\
o \
\
-5
/
-5

o 0.2 0.4 0.6 0.8 1 o 0.1 0.2


x x
Figure 10.1: Polynomial least squares fit: The figure on right-hand side shows a part of the
curves magnified five times. The curve Tn = 6' is obtained using direct solution of the normal
equations, while one labelled m = 6" is for the second data set with outliers.

to determine the effectiveness of the approximation. The results are shown in Table 10.3. It
can be seen that for Tn < 6. ,,2
is essentially same as the X 2 shown in Tables 10.1 and 10.2.
With SVD it can be seen that, this difference is substantially less tha:! what is obtained using
the actual data for Tn = 6. Thus. the approximating function is closer to the actual function
than to the data. Further, for higher values of Tn this difference increases and smoothing
property of the approximation is lost. This results show t hat the approximation does not
improve by including more functions in the basis. even though the X2 may decrease to zero
when m = n.
An important application of smoothing is in calculating derivatives of functions. To
illustrate this application. we compute the first and second derivative at x = 0.4897959,
which is the 25th point in the data set. Since each of the data points have some error.
computing derivatives using formulae in Chapter 5 give substantial errors. Thus. using the
3-point formula we get J' = -6.054663 and f" = -:H3.077. while the 5-point formula
gives J' = -5.987589 and f" = -317.817. The exact values are J' = -6.125849 and f" =
-297.987. Thus. it can be seen that the errors are magnified. while computing the derivatives.
Using the approximating polynomial. we can easily compute the derivative at any point and
the results are shown in Table 10.3. It can be seen that for m < 5. the approximation to
the derivatives is very poor. Further. the SVD solution gives distinctly better results and the
derivatives are substantially more accurate than what is obtained by finite differences on the
raw data.
From these tables it is clear that SVD gives an effective method to deal with ill-
conditioning of the normal equations. To illustrate the limitation of least squares approxima-
tion, we repeat the calculations with a different synthetic data set using the function

10- 5
fix) = 231x 6 - 315x 4 + 105x 2 - 5 +
(R - 0.5)
3 (10.34)

where R is a random number in the interval (0,1). In this case. the errors are not uniformly
distributed and there will be a few outliers corresponding to those cases. where the random
number R is very close to 0.5. The calculations using SVD yield completely useless results
442 Chapter 10. Functional Approximations

Table 10.3: Smoothing of data by polynomial least squares approximation

Using normal equations Using SVD


m 82 f' f" 82 f' f"

2 1.20 X 10 8 3.044130 -34.025 1.20 x 108 3.044413 -34.025


2 6.83 x 10 7 -19.902650 -43.046 6.83 x 10 7 -19.906320 -43.048
4 4.98 x 106 -17.057330 -323.593 4.97 x 106 -17.031700 -324.235
5 3.62 x 104 -6.124409 -312.220 3.54 x 104 -5.980693 -312.010
6 2.60 x 10 3 -6.115625 -294.076 4.07 x 10° -6.124817 -297.935
7 5.82 x 103 -6.436885 -293.327 4.76 x 10° -6.131210 -297.948
8 1.73 X 103 -5.868022 -299.058 6.44 x 10° -6.132968 -297.783
9 1.96 x 103 -5.888071 -294.806 6.47 x 10° -6.136873 -297.799

and for m = 6 the fitted polynomial is


F(x) = -5.240 - 5.674x + 275.074x 2 - 1098.642x 3 + 2441.109x 4 - 2925.531x 5 + 1335.850x 6 .
(10.35)
It is clear that these coefficients have no relation to the coefficients of the actual polynomial
used to generate the synthet!c data set. This approximating polynomial is also plotted in
Figure 10.1. For this error distribution, the variance will diverge, but for the sake of compar-
ison we have used the same value as before. The X 2 does not decrease significantly as m is
increased. For example, with m = 2, X2 = 1.68 X 10 8 while with m = 6, X2 = 6.07 X 10 7 .
This example clearly demonstrates that the least squares approximation is not suitable for
data sets with outliers or where the distribution of errors is far from normal. The results also
depend on the choice of random numbers. These results should be compared with those in
Example 10.13 obtained using the L1 norm.

10.2.2 Least Squares Approximation Using Orthogonal Polyno-


mials
Although SVD overcomes the problems caused by near linear dependence of
the basis function, it will be better if this problem can be avoided altogether.
This objective can be realised if the basis functions are mutually orthogonal.
Here the orthogonality is defined with respect to the discrete sum

if j =1= k;
(10.36)
if j = k.
It can be easily seen that if this property is satisfied, then the matrix of normal
equations is diagonal and the equations can be trivially solved to give

(ml _ 2:7=1 ~ fi¢j (Xi)


a· - . (10.37)
J Ij

It can be seen that in this case, a;ml is independent of m and depends only on j.
This is another advantage of orthogonality, since the old coefficients do not have
10.2. Linear Least Squares 443

to be recalculated when one more basis function is added. If the basis functions
are not orthogonal, then almost the entire calculation has to be repeated when
one more basis function is added, since a new row and column has to be added
to the coefficient matrix and the solution for all components have to be obtained
afresh. Using the orthogonality condition we can show that

(10.38)

In practice, using this approximation may involve significant roundoff error if


X2 is of the order of It times the first sum. Nevertheless, it gives a convenient
estimate of the amount by which the last term will reduce the X2, and may be
useful in the stopping criterion for terminating the approximation.
A straightforward method for generating a set of orthogonal functions is
the Gram-Schmidt orthogonalisation process. In this process, we begin with
a set of linearly independent functions qj (x) and go on removing the parallel
components from each of them to obtain a set of orthogonal functions cPj (x),
using

cPj(x) = qj(x) - 8 (n~ (J~qj(XddJi(xd


j- 1 1 ) cPi (X )
----;:;:-' (10.39)

where Ii are given by (10.36). However, if the original basis functions are close
to being linearly dependent, then this process will run into the same difficulty
mentioned earlier. In this case, roundoff error will occur while constructing
such a set rather than while solving the normal equations. This process is
similar to the method using orthogonal factorisation mentioned earlier. From
the orthogonality condition (10.36), it is quite clear that the set of orthogonal
polynomials depends on the weights as well as the sequence of data points
{x;}. Thus, for each sequence of data points we require a set of orthogonal
polynomials.
A more convenient and efficient method for generating orthogonal poly-
nomials is provided by recurrence relations. It can be proved {18} that the set
of orthogonal polynomials {pj(xH which satisfy (10.36), also obey a two-term
recurrence relation of the form

(j = 0, 1, ... , n - 2), (10.40)

with po(x) = 1 and P-l(X) = O. Here aj and (3j are constants which can be
determined using

(10.41)

Thus, we can go on generating the orthogonal polynomials, until the denomina-


tor in the definition of aj+l vanishes. The denominator will vanish if Pj(Xi) = 0
for i = 1, ... ,n, in which case, the polynomial has at least n distinct zeros and
444 Chapter 10. Functional Approximations

hence j :::: n. In fact, it can be shown {19} that Pn(Xi) = 0 for i = 1, ... , n.
Thus, we cannot generate more than n independent polynomials using n data
points. Using the coefficients OJ and (3j, the orthogonal polynomials can be
evaluated at any value of x. The coefficients aj of these polynomials in the ap-
proximating function can be easily obtained using (10.37). It can be shown that
the approximating function can be conveniently calculated using the following
recurrence due to Clenshaw

qdx) = ak + (x - ok+dqk+l - (3k+lqk+2(X), (k = m, m - 1, ... ,0);


qm+l (x) = qm+2(X) = 0;
(10.42)
with qo(x) = F(a, x). Similar recurrences can be obtained for evaluating the
derivatives of the approximating function {20}.
Given the input values of Xi, (Ji, Ii and m, the following algorithm can be
used to generate the orthogonal polynomials and the corresponding coefficients
in the least squares approximation:
1. Set PO(Xi) = 1, P-l (Xi) = 0 and Y(Xi) = 0 for i = 1,2, ... , n
2. Set {30= 0 and 10 = 2:7=1 (Ji- 2
3. For j = 0, 1. .... m in succession, do steps (3a) to (3g)
2::~1 (Ji- 2 j;Pj(Xi)
3a. compu t e aj =
Ij
3b. For i = 1,2, ... , n compute Y(Xi) + ajpj(xi) and overwrite on Y(Xi)
3c. If j = m. then stop
3d. Compute O:j+l using (10.41)
3e. For i = 1,2, .... n compute Pj+l (Xi) = (X - O:j+dpj (Xi) - {3jPj-l (Xi)
n

3f. Compute Ij+l = L (J;2 [pj+l (Xi)]2


;=1
3g. Compute (3j+l using (10.41)
Here Y(Xi) gives the calculated approximation at X = Xi. An implementation
of this algorithm is provided by subroutine POLFIT in Appendix B. Using the
coefficients aj, OJ and {3j the approximating polynomial can be evaluated at
any required point using Clenshaw's recurrence, which can be accomplished by
the subroutine POLEVL. In principle, we can also find the coefficients of xk in
the approximating polynomial, but there is really no need to obtain these coef-
ficients. In fact, calculating the coefficients may involve a considerable roundoff
error. It can be seen from this algorithm that calculating the approximation
requires cnm floating-point operations, where constant c could be fairly large,
approximately 10--20. Even for moderate values of m, this is much less than
the number of operations required for the SVD solution or even for the direct
solution of normal equations. Further, in this case, the covariance matrix is also
10.2. Linear Least Squares 445

diagonal and in fact, the square of variance associated with the coefficients aj
is given by l/ rj , which can also be computed without any extra effort.
If the data extends over a large interval, it may not be possible to ap-
proximate the function over the entire range using a low degree polynomial.
In such cases, we can use local approximations to obtain smoothed function.
For example, we can use the nearest five points to fit a least squares quadratic
to evaluate the function or its derivative {3}. Such local approximations will
not yield a fit that is smooth on global scale as the approximating parabola
will change as we move across the table. To get a smooth fit we can use spline
approximations. It is most convenient to use B-spline basis functions for this
purpose. In general, if a polynomial of degree five or six does not give suffi-
ciently good fit, then it may be better to try B-splines, unless there are good
reasons to believe that a higher degree polynomial will give good fit. Subroutine
BSPFIT in Appendix B provides least squares fit to B-spline basis functions.
As the number of B-spline basis functions is increased the fitted function tends
to become less smooth. Thus smoothing can be controlled by choosing the right
number of basis functions. In practice, this may be difficult to decide. Alternate

rt r
strategy is to use regularisation, where instead of minimising the X2 we add a
term representing smoothing. For example, we can minimise

x~ ~ t :; (f' -~ alm) ¢,(x;l + A' h(x,)' (~alm)1j (x,)

(10.43)
where A is a constant which determines the extent of regularisation, h(Xi) is a
suitable weight function and rP'j is the second derivative of rPj (x). Here we have
chosen the same set of points in both the terms, but that is not necessary. We
can apply smoothing constraint on a different set of points also. The parameter
A is known as the regularisation parameter. For A = 0 this definition will
reduce to the usual X2 , while for A » 1 the second term will dominate and the
approximating function will tend to a straight line. The extent of smoothing
can be controlled by the choice of A. We can choose A to give the expected X2 or
we can make the choice by visual inspection of fits obtained using different A.
There are more sophisticated techniques of choosing optimal regularisation, but
it is difficult to apply them in practical problems. In general, the X2 (without
the smoothing term) increases as A increases and best choice will probably
be around the value where it starts increasing rapidly. If the two terms in
(10.43) are plotted against each other then the curve has a shape of letter L
and the corner of the this curve probably gives an optimal value of A. The
weight function h(x) can be chosen to give appropriate weight age to different
regions. In absence of any particular information we can use h(x) = 1. Instead
of second derivative we can use any other measure of smoothness to define the
smoothing. For example, if we use the first derivative, then as A increases the
approximating function will tend to a constant. In practical applications either
first or second derivative smoothing proves to be useful.
446 Chapter 10. Functional Approximations

In order to account for the additional terms arising from smoothing, we


can modify the design matrix G by adding additional equations given by
m

)"h(xi) L alm)</>'j(xi) = 0 i = 1, ... ,no (10.44)


j=l

These set of n equations can be added to n equations Ga = b to obtain the


system of 2n equations to be solved when smoothing is included. We can solve
this system of equations using SVD or alternately we can obtain the normal
equations by differentiating (10.43) with respect to ai and solve the normal
equations. It is clear that this process will require more effort than the usual
least squares solution, but the ability to control smoothing may be useful and
in some cases compensates for the additional effort.
EXAMPLE 10.3: Using the data in Example 10.2 obtain the least squares polynomial ap-
proximation using the orthogonal polynomials and using B-spline basis functions.
Using the subroutine POLFIT we can obtain the coefficients aj in the least squares
approximation as well as the coefficients aj and f3j in the recurrence relation for the orthog-
onal polynomials. The results are very similar to those found using SVD in Example 10.2.
Table 10.4 shows the results, which includes the mefficients aj of orthogonal polynomials
and the coefficients aj and f3j occurring in the recurrence relations. This table also displays
1/"ffi, which gives the estimate of errors in the coefficients aj. It can be seen that the error
increases with j and for j > 6 the error is comparable or larger than aj itself, which also
gives an indication of where to terminate the approximation process. It may be noted that
in this example the errors are not normally distributed. This table also gives the quantity 6 2
defined in Example 10.2 and the computed value of the first and second derivatives using the
approximating function. Once again these quantities are essentially same as that obtained
using the SVD.
If instead of orthogonal polynomials we use the cubic B-spline basis function, which
also form a set of reasonably independent basis functions, we can again avoid ill-conditioning
seen in Example 10.2 for normal polynomial representation. However, in this case since the
given data is obtained from a polynomial, using an alternate representation will not give
optimal results. But to illustrate the effectiveness of piecewise polynomial approximation

Table 10.4: Smoothing of data by polynomial least squares approximation

11: am 1/ "ffi am 13m X2 62 I' I"

0 .121 .0004 1.35 X 108


2.697 .0014 .50000 .08673 1.31 X 108
2 -17.012 .0053 .50000 .06930 1.21 X 108 1.21 X 108 3.044120 -34.025
3 147.376 .0204 .50000 .06670 6.83 X 10 7 6.83 X 10 7 -19.906303 -43.048
4 632.989 .0796 .50000 .06569 4.98 X 106 4.97 X 106 -17.031675 -324.234
5 693.105 .3118 .50000 .06508 3.58 X 104 3.54 X 104 -5.980465 -312.010
6 232.189 1.2271 .50000 .06459 4.76 X 10 1 3.97 X 10° -6.124782 -297.935
7 4.473 4.8455 .50000 .06412 4.68 X 10 1 4.84 X 10° -6.131027 -297.948
8 -24.888 19.2048 .50000 .06366 4.51 X 10 1 6.51 X 10° -6.132732 -297.782
9 -41.991 76.4146 .50000 .06316 4.48 X 10 1 6.83 X 10° -6.137438 -297.800
10.2. Linear Least Squares 447

Table 10.5: Smoothing of data by B-splines least squares approximation

m nc Aa x2 82 J' I"

Set I. using error (R - 0.5)10- 2


5 5 0.0000 6.31 X 106 6.31 X 106 -15.050076 -503.214
6 7 0.0000 2.72 X 10 5 2.71 X 105 -4.975206 -236.709
7 9 0.0000 4.09 X 104 4.12 X 104 -7.175162 -359.605
8 11 0.0000 7.81 X 103 7.66 X 103 -5.663962 -276.692
9 15 0.0000 2.02 x 103 1.98 X 103 -6.099776 -317.494
10 21 0.0000 6.43 x 102 6.16 X 102 -6.169413 -286.726
11 25 0.0000 2.47 x 10 2 2.26 X 10 2 -6.092216 -308.530
12 29 0.0000 1.42 x 10 2 9.49 X 10 1 -6.159128 -291.399
13 25 0.0000 9.46 x 10 1 4.61 X 10 1 -6.093847 -303.166
14 33 0.0000 6.71 x 10 1 2.54 X 10 1 -6.170155 -293.605
15 35 0.0000 5.33 x 10 1 1.66 X 10 1 -6.103477 -301.363
Set II. using error 1O - 5 /( R - 0.5)3
15 20 0.0000 5.31 X 107 1.41 X 10 7 -13.334391 -1706.265
15 17 0.0005 5.35 x 10 7 1.24 X 10 7 -14.635931 -1283.273
15 15 0.0010 5.46 x 10 7 1.03 X 10 7 -13.534383 -753.033
15 9 0.0020 5.82 x 10 7 8.37 X 106 -7.921742 -337.410
15 8 0.0050 7.47 x 107 1.73 X 10 7 -3.632702 - 239.965
15 6 0.0100 1.01 x 10 8 4.35 X 10 7 -5.486947 -195.368
15 7 0.0200 1.31 x 10 8 7.85 X 10 7 -2.677186 -110.525
15 5 0.0500 1.61 x 108 1.11 X 10 8 2.243095 -40.846
15 5 0.1000 1.75 x 10 8 1.24 X 108 3.420311 -14.283

we fit this data using B-spline basis functions. We use subroutine BSPFIT to obtain the
coefficients of approximation and calculate the approximating function using BSPEVL. The
results obtained using a uniform spacing for knots are shown in Table 10.5, which gives the X 2
as well as 82 as defined earlier for different values of m, the number of basis functions . It can
be seen that for m = 7. the results are not as good as that obtained using polynomials. This
is to be expected as a polynomial of degree 6 was used to generate this data set. But as m
increases we get better approximation and for m = 15 the X2 and 8 2 are comparable to that
obtained using polynomials. Advantage of using B-splines is that we can increase the number
of basis functions without causing any ill-conditioning. For example, even for m = 15 the
condition number of design matrix G is less than 10. This is a significant improvement over the
polynomials. As a result , B-splines prove to be very useful in fitting data. The first and second
derivative at the required point are also given in the table. The value of second derivative
is not as good as that obtained using polynomial approximation. This is probably because
we have used cubic B-splines, if we use higher order B-splines then this approximation also
improves. Further, in that case fewer basis function are required to achieve same accuracy.
For example, using quartic B-splines, we need only about 11 basis functions to reduce X2 to
48.
All these calculations have been done without any regularisation. In order to study the
influence of regular is at ion we try least squares approximation with regularisation. However,
the approximation in this case is already quite smooth and not much can be achieved by
using regularisation. To illustrate the process we choose the second data set obtained using
(10.34). In this set there are some outliers and the resulting approximation tends to become
nonsmooth. We use 15 cubic B-spline basis functions based on 13 uniformly spaced knots
448 Chapter 10. Functional Approximations

o f*(x)
f(x)
A=O
A=0.002
-5 A=0.005

o 0.2 0.4 0.6 0.8 1


x
Figure 10.2: Least squares approximation with regularisation obtained using cubic B-splines
basis functions. The exact function f*(x) as well as the given data f(x) and the approxima-
tions using different .>. are shown.

and calculate the approximation with different values of the smoothing parameter ..\ using
the subroutine BSPFIT in Appendix B. The resulting approximations are shown in Fig. 10.2,
while the X 2 , 6 2 and the two derivatives at x = 0.4897959, are listed in Table 10.5. From the
figure it can be seen that there is one point around x = 0.7 where the data value is 21.9,
which is far from the actual function. Because of this point the approximations are distorted
in its neighbourhood and it is not possible to get any meaningful approximation. Of course,
by looking at the residuals in the fit we can identify this point and remove it from the set.
After that the fits will be reasonable. But there may be situations where no individual point
is totally off, but yet the combined effect may be to distort the approximation. Thus to
illustrate the working of smoothing we retain this point in our fits. Because of this distortion,
the derivatives at x = 0.4897959 cannot be calculated reliably. From the results in Table 10.5,
it is clear that as smoothing is increased the derivatives approach the true value. As ..\ is
increased still further the derivatives again tend to be unreliable. The X 2 in this table is the
value excluding the smoothing term and it can be seen that this keeps increasing with .>... From
this variation we may expect a value of ..\ = 0.005 to be optimal. From the table it can be seen
that the minimum value of 52 is achieved at ..\ = 0.002. Thus at this point the approximation
is closest to the polynomial used to generate the data set. But in practical applications this
information will not be available and this criterion cannot be used to decide the optimal
value of .>... Nevertheless, it can be seen that this value is close to that obtained by looking at
variation of X2. Similarly, the number of sign changes in residuals also decrease markedly as ..\
is increased to 0.002. which also gives an indication of optimal value for smoothing. The best
approximation for the derivatives is also obtained around the same value. From Fig. 10.2, it
can be seen that for..\ = 0 the approximation has significant deviation around x = 0.7, and as
smoothing is increased this deviation reduces. But as ..\ is increased beyond 0.002, the actual
dip in the function around x = 0.85 also tends to be smoothed. Thus the optimal value of ..\
has to be chosen such that real nonsmooth variations in the function are not smoothed out.
In general, we will not know what the real function looks like and it will be difficult to make
the optimal choice, but by looking at various solutions and the computed X2 it is generally
possible to make reasonable guess for ..\.
10.2. Linear Least Squares 449

10.2.3 Least Squares Approximation With Correlations


In the foregoing discussion we have assumed that the errors in each data point
are uncorrelated with those in other points. In practice, we often come across
situations where these errors are correlated. In such cases we need to define the
X2 using covariance matrix. For example, let us assume that the data values Ii
(i = 1, ... , n) are obtained from measuring a set of independent quantities gj
(j = 1, ... , N) with an estimated variance of aj, which are uncorrelated. The
error in li can be expressed as

(10.45)

where J ij are the elements of the Jacobian matrix J. If we define the matrix
JIj = Jijaj then the covariance matrix can be obtained as C = J' J,T, which
is a n x n matrix. Now the X2 can be defined as

which replaces (10.14). Thus, if the errors in various data points are correlated
we should obtain least squares approximation by minimising this function. It
can be easily seen that if the Covariance matrix is diagonal then it reduces
to the usual definition (10.14). We can again obtain the normal equations as
earlier to find the minimum. It would be advisable to use SVD to obtain the
solution of the normal equations.

10.2.4 Least Squares Approximation when both x and y have


errors
So far we have discussed fitting data sets (Xi, Yi) where the error in the in-
dependent variable X is small and has been neglected. If both variables have
significant errors then we have to consider them. The procedure in this case
becomes much more complicated and hence to illustrate the procedure we only
consider the simple case of fitting a straight line. Since x, y both have errors,
let us assume that their true values are ~,'rJ. Further, we also assume that the
variances are independent of i. In that case the probability distribution for one
pair of (x, y) value is given by

1
G(x,y) = ----r==
27faxayV1 - P~y
exp (_ 1 2 [(_X
2(1 - PXy)
- ~)2 _2pxy (_X
ax ax
- 'rJ) + (_Y
-~) (_y
ay
- 'rJ)2]) ,
ay
(10.47)
450 Chapter 10. Functional Approximations

where Pxy is the correlation between x and y. If there is no correlation, the


expression is straightforward to obtain. In presence of correlation we need to
take the inverse of the covariance matrix

(10.48)

Thus in this case, neglecting the constant fi1ctor involving (1 - P~y) the X2
function is given by

X2 = t
i=l
[(Xi - ~i)2
ax
_ 2p. ( __
\
'--~_)
a.r
(Yi -Iii) + (Yi - T/i)2].
ay ay
(10.49)

It may be noted that if P.l!! iepends on i, then the factor involving it will
need to be included in the definition of X2. If we wish to fit a function of form
T/ = F(a, 0, then we can use l]i = F(a, ~i) to eliminate T/i and minimise X2 to
get the parameters a as well as ~i' Thus in this case apart from the parameters
a defining the fit, we nave n more parameters ~i to be determined. For the case
of straight line fit of form T/ = a + b~, also the X2 is not a quadratic function
of these parameters and hence we would need to solve nonlinear equations to
~etermine these parameters. Differentiating X2 with respect to a, b and ~i we
get the set of equations
n
(10.50)
i=l
n
(10.51)
i=l
(Yi - a - b~;)b - R((Xi - ~i)b + Yi - a - b~i) + S(Xi - ~i) = 0,
i = 1, ... ,n; (10.52)
where R = PXyay/ax and S = (ay/u x )2. The last equation can be written as
bR-S
b _ R (Xi - ~i) = {}(Xi - ~i)' (10.53)

{}Xi - Yi + a
or ~i (10.54)
{}-b
It may be noted that although {} is a constant, its value is not known as it
involves the unknown parameter b. If Pxy = 1, then S = R2 and {} = R,
independent of b. In this case it can be seen that Eqs. (10.50) and (10.51) will
also be satisfied (if 10.52 holds) and in fact, X2 = 0 independent of values of
a, b. Thus there is strictly no solution in this case. This case can handled by
perturbing Pxy slightly. Substituting Eq. (10.53), in Eq. (10.50), we get
n n
(10.55)
i=l i=l
10.3. Nonlinear Least Squares 451

which gives the relationship between mean values, x= [ and y = fJ. Further,
using the equation for straight line we get

fJ = a + b[, or y = a + bx. (10.56)

This gives one equation connecting the parameters a, b. The second equation
can be obtained by eliminating ~i from Eq. (10.51), using Eq. (10.54)
n
2:)Yi - a - bXi)(a - Yi + eXi) = O. (10.57)
i=1

Substituting the value of a from Eq. (10.56), we get a nonlinear equation in b

(y)2 _ y2 _ bxy + bxy + e(xy - xy + b(x)2 - bx 2) = O. (10.58)

Since e is also a function of b, this equation gives a quadratic in b

(10.59)
This equation can be solved to get two values for the slope b, one of which will
give minimum X2 . We can then use Eq. (10.56), to calculate the intercept a and
Eq. (10.54), to calculate ~i' It is clear that generalising this to more general
least squares fit will be complicated. This should be done only if errors in x are
significant. To compare the significance we can compare a y with bax, where b
is the estimated slope of a straight line fit.

10.3 Nonlinear Least Squares


In this section we consider approximations, where the approximating fUllction
depends nonlinearly on the parameters aj. Before we consider the methods
for solving such problems, it may be noted that many nonlinear problems can
be transformed into linear problems. For example, y(x) = al exp(a2x) can be
rewritten as lny(x) = a~ + a2x. Such problems should better be tackled as
linear problems. Another problem with nonlinear functions is that, it is easy to
overlook the fact that some of the parameters are not really independent. For
example
(10.60)

Here in F 1 , the parameters a1 and a2 are essentially same, while in F2 any


one of the parameters (which is nonzero) can be eliminated by dividing the
numerator and denominator by the same parameter.
We can follow the same approach as for the linear problem, but in this
case, the normal equations turn out to be nonlinear. Thus, it may be better to
directly minimise the error norm using the techniques described in Chapter 8.
452 Chapter 10. Functional Approximations

Of course, if the parameters are not independent, then there could be problems
in minimisation, because of narrow valleys in the contour diagram for the func-
tion. The statistical analysis of the results is also more difficult in the nonlinear
case.
The function to be minimised can be written as

2
X (a) =
~ 1 2
~ 2 (fi - F(a,xi)) . (10.61)
(J.
i=l t

Because of the special form of this function i.e., sum of squares of real func-
tions, it is possible to find special methods to minimise the function. Here
we will only outline some methods, while for more details readers can consult
Fletcher (2000). The gradient vector and the Hessian matrix for the required
function is given by

(10.62)

It can be seen that the second term in the definition of the Hessian matrix
involves the residuals ri = (F(a, Xi) - Ii)/ (Ji. Near the minimum these residuals
may be expected to be small and uncorrelated. Thus, the summation may give a
rather small result. Hence, it is customary to neglect this term. The advantages
of neglecting this term are: (1) The Hessian matrix can be calculated without
requiring second derivatives of the approximating function. (2) The resulting
Hessian matrix is positive semidefinite. The second characteristic is important
to ensure descent property in the modified Newton methods. Ideally we will like
the matrix to be positive definite, but semi-definiteness may be good enough
in most cases. Further, modifying the Hessian matrix will only change the rate
of convergence of iterations and not the final result. In fact, it turns out that
with this modification, the convergence of Newton's method is linear rather
than quadratic, unless the second term in the definition of the Hessian matrix
actually vanishes at the mini miser. If J is the m x n Jacobian matrix of the
residuals
J ij = ~,
ar (10.63)
Uai
then we can write the gradient vector g and the Hessian matrix G in the form

g(a) = 2Jr, (10.64)

Thus, the Hessian matrix can be calculated using the Jacobian of the residuals.
The difference in this case as compared to the general problems considered
in Chapter 8, is that, an approximation to the Hessian matrix is also available
10.3. Nonlinear Least Squares 453

and we do not need quasi-Newton method, which generates the approximation


in m steps. We can directly use the Newton's method which gives the iteration
solve J(k) J(k)T /Sa = _J(k)r(k) for /Sa and set a(k+l) = ark) + /Sa. (10.65)

Here J(k) is the Jacobian matrix evaluated at a = a(k). It can be seen that
each iteration essentially consists of linearising the normal equations and solv-
ing them to get the next approximation. As mentioned in Section 8.5, this
method can fail to reduce the function when the iteration is far from a min-
imum and it may be necessary to use the Newton's method with line search.
Even this technique may fail if the Hessian matrix is not positive definite. In
this case, since the Hessian matrix is positive semidefinite, usually, there will
be no difficulty. Of course, we can construct examples for which this method
converges to a nonstationary point, where the Jacobian is rank deficient. Fur-
ther, if the residuals are large at the minimum, then the approximation to the
Hessian is not very good and the convergence could be very slow. However,
practical experience with this method appears to suggest that the failures are
rare and in general, it is fairly efficient in finding the minimum.
Another method which is widely used for nonlinear least squares is the
Levenberg-Marquardt method briefly described in Section 7.16. In this method,
the Hessian matrix is modified by adding some multiple of the unit matrix, to
get the equation
(10.66)
The value of parameter v is adjusted during the iteration. If v = 0 we get
the Newton's method while for very large v this iteration yields corrections
in the direction of steepest descent. If at any step the iteration causes the
function value to increase, then the correction is ignored and the iteration is
repeated with a larger value of v, which will take the iteration closer to the
steepest descent. After successful iteration the value of v could be reduced.
If v is changed appropriately, then the iteration normally converges to the
minimum, though the convergence could be very slow in some cases.
If the function is too complicated and it is difficult to compute even the
first derivatives, then we will have to consider some method which does not re-
quire derivatives. Fletcher (2000) has described some methods which are based
on updating the Jacobian matrix, similar to the Broyden's method described
in Section 7.17. In fact, the Broyden's method itself can be used to solve the
resulting system of nonlinear equations. But in that case, the objective function
may not decrease after every iteration and the iteration may not converge to a
minimum. Of course, we can always use the direction set methods described in
Section 8.6, but that may not be very efficient for the special problem consid-
ered here.
It may be difficult to give any statistical analysis of nonlinear least squares
problems. However, close to a minimum the residuals may be approximated by
a linear function and the statistical techniques for linear least squares may be
applied to nonlinear problems also. If a quasi-Newton method for minimisa-
tion is used, then it may give an estimate to the inverse Hessian matrix at the
454 Chapter 10. Functional Approximations

minimum. This should be the covariance matrix which can be used to estimate
errors in fitted parameters. If the distribution of errors in the input data is
known, then we can perform Monte Carlo simulation to estimate the uncer-
tainties in the estimated parameters. For this purpose, we can obtain the value
of parameters (which we call ajOl) as determined by the least squares technique.
Then using these parameters, we can generate a set of synthetic data at the
same points and following the expected error distribution. In fact, we can gen-
erate several such sets. Using each of these synthetic data set, we can apply
the same method to obtain the least squares approximation and get a new set
of parameters aj. From the sets of parameters aj, we can study the distribu-
tion of aj with respect to the "true parameters" ajOl used in the simulation.
It is reasonable to assume that, this distribution gives a good approximation
to the actual distribution of aj with respect to the true parameter, which is
not known. Consequently, this technique can give a reliable estimate of the
expected errors in estimating the actual parameters. However, it requires con-
siderable computer time to simulate the data sets and perform least squares
fit on each one of them. The advantage of this technique is that, it can be ap-
plied even when the errors are not normally distributed. Of course, the Monte
Carlo simulation requires the knowledge of distribution of errors in the input
data. But that is not a serious limitation, since if the distribution of errors in
the input data is not known, it is not possible to give any estimate of errors
on the output coefficients. Monte Carlo simulation can be used to give error
estimate for linear least squares problem, or even on approximation problems
using different norms. In fact, this technique can be used to do experiments
on computer using different data reduction techniques, to find out which is the
best one to use under the given circumstances. It can also be used to plan an
experiment by identifying the points at which more data should be obtained,
in order to improve the reliability of parameter estimation.
EXAMPLE 10.4: Generate synthetic data using the function
f(x) = e- x + 2e- 2x + 3e- 3x + (R - 0.5)10- 5 , ( 10.67)

where R is a random number with uniform distribution in the interval (0,1). Using n equally
spaced points in the interval [0. 1]. try to fit a function of the form
m/2
F(a,x) = L a2,-1 exp(-a2'x). (10.68)
i=l

Use n = 50 and m = 2,4,6 and 8.


We can perform the nonlinear least squares fit by minimising the X 2 , where we use
0';2 = 12 X 1010. For m 2 6 it turns out to be rather difficult to find an accurate minimum
of the function, because of narrow valleys in the contour diagram for the function. There
are some directions along which the function decreases very slowly and as a result, the
minimisation routine fails to find the minimiser accurately and terminates far away from
the actual minimum. There are significant roundoff errors in evaluating the function and
its gradient. Consequently, 24-bit arithmetic is probably not sufficient to carry out these
calculations. The Hessian matrix is nearly singular at the minimiser and the convergence is
extremely slow with both the BFGS method as well as the Levenberg-Marquardt method.
The results obtained using the BFGS method are shown in the Table 10.6. The last 10 rows
10.3. Nonlinear Least Squares 455

Table 10.6: Nonlinear least squares approximation

m nc x2 al a2 a3 a, a5 a6 "'I "'2 IOal

2 2 1.4 X 10 10 5.R873 2.1228


4 4 6.0 X 10 4 1.8302 1.2227 4.1695 2.8189
8 35 48.78 0.7560 0.9555 1.9164 2.0228 2.9920 3.0002 a7 = 0.3356 as = 1.2654
6 3.5 48.81 0.9731 0.9920 1.9859 1.9775 3.0411 2.9949 -0.0563 0.0065 0.000082
6 30 38.24 0.9965 0.9992 1.9936 1.9958 3.0099 2.9986 -0.0128 -0.0029 0.000083
6 26 54.27 0.9843 0.9958 1.9793 1.9834 3.0363 2.9950 - 0.0476 -0.0064 0.000140
6 31 39.81 1.0163 1.0051 2.0021 2.0119 2.9816 3.0019 0.0264 -0.0093 0.000031
6 26 30.84 1.0176 1.0051 2.0126 2.0155 2.9699 3.0038 0.0407 -0.0015 0.000052
6 26 50.40 1.0282 1.0078 2.0334 2.0283 2.9385 3.0083 0.0809 0.0084 0.000250
6 24 49.35 1.0895 1.0250 2.1058 2.0884 2.8047 3.0255 0.2564 0.0269 0.002205
6 27 38.19 0.9475 0.9838 1.9834 1.9591 3.0692 2.9917 -0.0967 0.0218 0.000296
6 26 45.96 1.0241 1.0073 2.0084 2.0190 2.9675 3.0037 0.0452 -0.0094 0.000091
6 29 39.42 1.0714 1.0200 2.0778 2.0693 2.8508 3.0194 0.1968 0.0160 0.001258

of the table give the results obtained for m = 6 using different simulations i.e., using different
sets of simulated data all using (10.67). It can be seen that only a small error term added to
the function has introduced significant errors in the coefficients. In fact , if the coefficient in
the error term is increased to 10- 4 , the results are completely different and it is difficult to
identify the coefficients. From different sets of data, it is possible to have some idea about the
expected errors in the coefficients. As expected, the X 2 decreases as m is increased, and by
looking at the number of sign changes in the residuals, it is clear that m = 6 is the optimum
value.
From the distribution of different sets of results for m = 6, it can be seen that , there is
a strong correlation between the errors in different coefficients. In fact, if the set of coefficients
is plotted in the six-dimensional space, all the points will lie close to a straight line passing
through the true value of the parameters. To study the problem we linearise the function
about the true values of the parameters and perform it linear least squares fit using the
function
F(a, x) = e- x + 2e- 2x +3e- 3x + al e- x - a2xe-x +a3e-2x - a42xe-2x + a5e-3x - a63xe - 3x .
(10.69)
It essentially means that we are fitting the random error term to a six-parameter linear
function. Hence, if the process is carried out correctly, then we should get a = O. In order
to a nalyse the problem we perform the fit using SVD algorithm. The singular values are
7.077082, 2.277433 , 0.5033542 , 0.04148534, 5.047 x 10 - 4 and 2.103 x 10- 5 . Thus, it can be
seen that the last two singular values are rather small and the condition number of the normal
equations is of the order of lOll. If none of the singular values are suppressed, then we get
the result

aT = (-0.026509, -0.007844, -0.015673, -0.022566,0.042184, -0.005234). (10.70)

It can be seen that the errors in this approximation are comparable to those seen in the
coefficients shown in Table 10.6. Further, it can be seen that the error in most sets of values
in Table 10.6 is proportional to this solution. In fact , the dominant contribution to the error
is from the sixth column of the matrix V of SVD corresponding to the smallest singular value:

V6 = (-0.4095254, -0.1177549, -0.3361527, -0.3743083, 0.7456710, -0.0963958)T (10.71)

Along this direction the X2 decreases very slowly, resulting in very slow convergence of the
minimisation algorithms for the nonlinear problem. If this column is suppressed by increasing
456 Chapter 10. Functional Approximations

REPS to 10- 5 in subroutine SVDEVL, then we get the following results:

aT = (-0.002981, -0.000658,0.000305, -0.001060, -0.000548,0.000305). (10.72)

Hence, the accuracy improves significantly. This vector is approximately proportional to the
fifth column of V

V5 = (0.5964462,0.2139682, -0.7289404,0.2141384,0.1323258, -0.0612379) T. (10.73)

In fact, if this column is also suppressed, then the results improve dramatically to

aT = (-0.00000086, -0.00000939, -0.00000007,0.00000961,0.00000272, -0.00000095).


(10.74)
To show that the errors in Table 10.6 are mainly in these two directions, we estimate the
component along these directions. For each data set the result a can he expressed in the form

(10.75)

where ao = (1,1,2,2,3, 3)T is the true solution, Cl<j = vir (a - ao), 0<2 = v[ (a - ao) and oa
is the residual error vector after removing the errors along the two principal directions. The
last three columns in Table 10.6 give 0<1, 0<2 and loal. It can be seen that in all cases, these
two directions account for maximum component of error.
To illustrate the smoothing property of this approximation, following Example 10.2,
we calculate the quantity 8 2 and find it to be 2.79, which is much smaller than the X 2 . Hence,
the approximation is closer to the actual function. Similarly, we can compute the derivatives
at the 25th point (x = 0.4897959). Using the 3-point formulae we find f' = -4.18689 and
f" = 9.82169, while the 5-point formulae yield f' = -4.18509 and f" = 9.81638, which
can be compared with the exact values of f' = -4.185205 and f" = 9.828256. Once again,
the errors in finite difference approximation to the derivatives are much larger than the
errors in the function values. Using the approximation with m = 6, we find the derivatives
to be f' = -4.185206 and f" = 9.828230, which gives a much better approximation to
the derivatives of the true function. This result once again establishes the effectiveness of
smoothing for calculating the derivatives.

10.3.1 Maximum Likelihood Method


As explained earlier if the errors have a normal distribution, the L2 norm is the
correct norm to be used for fitting data. If that is not the case we can directly
invoke the probability function to maximise the likelihood. To illustrate this
method let us assume that the measured quantities, Yi = f(Xi) have Poisson
Distribution. In that case the probability of getting the values Yi is given by

(10.76)

In this expression, the measured quantities, Yi can be treated as constants


as they do not depend on the parameters which define the function f(Xi). To
maximise the probability we can take the logarithm and maximise the logarithm
or minimise the negative of the logarithm. Further, we assume that the function
f(x) = a + bx is linear. This gives,
n
InP = L (Yi In(a + bXi) - (a + bXi) -In(Yi!)) (10.77)
i=l
10.4. Least Squares Approximation in Two Dimensions 457

To maximise this, we can differentiate with respect to a, b to get


n

n = '~
" Yi
a+bx.' (10.78)
i=1 !

These are a set of two nonlinear equations in the parameters a, b which can be
solved to get the values. Alternately, we can get the parameters by maximising
In P directly. It is clear that this requires much more effort than the simple
least squares fit. Further, estimating errors in parameters a, b is also difficult.
Near the minimum the function In P can be approximated by a quadratic using
the second derivatives of the function. From this we can obtain the equivalent
of the normal equations for the problem and the corresponding error estimates.
Alternately, we can do a Monte Carlo simulations by generating a set of artificial
data using the parameters a, b and select the Yi using appropriate distribution.
Then the fitting process is repeated for the artificial data to get new values of
a, b. By repeating this with many realisations of artificial data, we can find the
distribution of a, b values and get the standard deviation to estimate the errors.

10.4 Least Squares Approximation in Two


Dimensions
So far we have considered approximations in one dimension only, but in prin-
ciple the same methods could be applied to approximate a function of several
variables. In that case, for linear least squares approximations, the basis func-
tions will be functions of several variables and the entire fitting procedure goes
through. The only difficulty being that to approximate a function of several vari-
ables, we require a large number of basis functions, which may cause problems
with ill-conditioning. The theory of orthogonal polynomials in several variables
is not very well developed and it may be difficult to use orthogonal polynomials.
But we can easily use the product B-spline basis functions to approximate a
function of several variables. For example
nx ny

f(x, y) ;:::; L L aij¢i(x)Wj(Y), (10.79)


i=1 j=1

where nx and ny are the number of basis functions along x and y, while ¢i(X)
and Wj (y) are the basis functions in the two variables. The coefficients aij can
be determined by minimising the X2 . Similarly, for nonlinear least squares the
approximating function could be nonlinear and there will be no difference in
the basic method, although it may be difficult to find a good form for the
approximating function.
The technique of regularisation can also be generalised to higher dimen-
sions. Thus in two dimensions we can incorporate second derivative regularisa-
458 Chapter 10. Functional Approximations

tion by including additional equations of type


n,r ny

LLOijeP~'(X)'I/!j(Y) =0,
i=l j=l
nx ny
(10.80)
L L OijePi(X)'I/!j'(y) = 0,
i=l j=l

at a suitable set of points (x. y). These equations have to be added to the set of
equations obtained from (10.79). These equations can be multiplied by suitably
chosen weights. If we apply the regularisation condition at the same set of points
where the table of values is available, the number of equations will increase by
a factor of 3. This process can be easily generalised to n dimensions, in that
case the number of equations will increase by a factor of n + 1 as compared
to the case where no regularisation is applied. Thus it is clear that applying
regularisation increases the amount of calculation significantly. In most cases,
it may be better to adjust the number of basis functions to achieve smoothing.
For calculating the linear least squares approximation in multiple dimen-
sions, if the tabular points are given over a rectangular mesh we can use the
technique described in Section 4.8 to improve the efficiency of calculations, by
solving the approximation problem in one dimension recursively. However, un-
like interpolation where the solution of equation is unique and hence is not
affected if some equation is multiplied by a constant, in the case of approxima-
tion if an equation is multiplied by a constant it is equivalent to using a different
weight and the least squares solution will be different. Thus if the weights are
nonuniform in Eq. (4.102) we will need to solve the one-dimensional problem
for every k. Even if that is done the resulting equations may effectively get
some weight and the solution will not be identical to that obtained by solving
the entire set of equations in two dimensions. Further, if smoothing has to be
included through additional equations it will be difficult to generalise this pro-
cedure to higher dimensions when weights are nonuniform. When weights are
equal it is possible to generalise this procedure to solve approximation problem
efficiently in multiple dimensions. Subroutines BSPFIT2 and BSPFITN in Ap-
pendix B provide an implementation of this algorithm in two and n dimensions
respectively. These routines are very efficient in calculating least squares ap-
proximation over large number of tabular points in multiple dimensions when
the weights are uniform and tabular points are over hyper-rectangular mesh in
n dimensions.
The orthogonal polynomials can also be generalised to higher dimensions
by using the product form if the points are on a hyper-rectangular mesh. If the
weights are equal for all points then we can use expansion of the form (10.79)
with the orthogonal polynomials as the basis functions in tvm dimensions. If the
weights are not equal then the orthogonal polynomials ePi(X) will also depend on
y and the expansion will be more complicated. Thus we only consider the case
where the weights are equal, since in that case the same orthogonal polynomials
10.5. Discrete Fourier Transform 459

1>i (X) are applicable for all y. In this case the equations to be solved are

(10.81)

where mx and my are the required degree of polynomials in x and y, respec-


tively, x j, j = 1, ... , nx are the points along x axis and Yb k = 1, ... , ny are the
points along y axis where the function values are available. It can be recognised
that the part inside the parenthesis is just the expansion in one dimension. If
these equations are multiplied by 1>i (x j) and summed over all points x j, then
using orthogonality of cPi (x j ), we get

i = 1, ... , mx + 1,
(10.82)
where b;k) are the coefficients of expansion in 1>i(X) for y = Yk. Cancelling Ii
from both sides these equations can then be solved for the required coefficients
aij' Thus in this case also the approximation problem in two dimensions can
be solved by separately solving the one dimensional problem along each of
the axis. This process is implemented in subroutine POLFIT2 in Appendix B.
This procedure can be easily generalised to higher dimensions by solving for
coefficients separately in each dimension. This procedure is implemented in
subroutine POLFITN in Appendix B.

10.5 Discrete Fourier Transform


If the function is expected to be periodic, then it may be better to use trigono-
metric functions instead of polynomials for approximation. The methods dis-
cussed in the preceding sections can be applied to trigonometric functions also,
but the Fourier series and transform have a wide range of application and we
will consider them in this and the next few sections. Even if the function is a
superposition of several periodic functions, the Fourier transform may be able
to separate each part and identify the different frequencies. In mathematical
analysis, we are familiar with the Fourier transform which operates on contin-
uous functions. In numerical computations, the data are always available at a
set of discrete points only, which imposes some fundamental limitations on the
amount of information that can be extracted from such data. In this section, we
briefly describe some of the properties and limitation of the so-called discrete
Fourier transform (DFT). In the next section, we will give an algorithm for
calculating the DFT.

I: I:
The continuous Fourier transform and its inverse are defined by

G(J) = g(t)e27riJt dt, g(t) = G(J)e-27riJt df· (10.83)


460 Chapter 10. Functional Approximations

The function can be represented in the time domain by g(t) or in the frequency
domain by G(f), and (10.83) provide a transformation between the two rep-
resentations. Here, if time t is measured in seconds, then the frequency f is
measured in cycles per second or Hertz. Of course, the Fourier transform can
also be applied to a function of spatial coordinates x or any other variable, in
which case, the interpretation of f will be different.
In most practical situations, the function g(t) is sampled i.e., its value
is recorded at a set of evenly spaced points in time. If 6. denotes the time
interval between consecutive samples, then the sequence of sampled values is
gj = g(j6.). The reciprocal of 6. is called the sampling rate. Further, only
a finite sequence of function values are available instead of the infinite range
required by the Fourier transform. If the function g(t) is nonzero only in a finite
range of time, then this range could be included in the data points. Alternately,
if the function g(t) goes on for ever, then the data can only represent some
typical range of time. Hence, the information content in a finite sequence of
sampled points is expected to be much less than what is present in the actual
function, defined over a continuous and infinite interval. The discrete sampling
at an interval 6. itself introduces some fundamental limitation on the amount
of information that is contained in the discrete data, even if we have an infinite
number of values.
The frequency fe = 1/(26.) is called the Nyquist critical frequency. It is
quite clear that if we have two waves exp(27fiht) and exp(27fiht), such that
h - h = 2fe, then the sampled values will be identical and from a discrete
sequence of points, we cannot distinguish between such waves. In fact, it turns
out that from the sampled data, we can obtain the transform only over a finite
range of f i.e., (-fe, fe) and the function G(f) outside this range gets folded
over into this range. Thus, a frequency f in this range will include contributions
from f + 2nfe, (n = ... , -2, -1,0,1,2, ... ). This phenomenon is called aliasing.
Thus, any frequency outside this range is aliased into the range because of
discrete sampling. On the other hand, the sampling theorem states that: If a
continuous function g(t), sampled at an interval 6., happens to be bandwidth
limited to frequencies smaller in magnitude than fe, (i.e., G(f) = 0 for If I > fe),
then the function g(t) is completely determined by its samples gn. In fact, g(t)
is explicitly given by the formula

g(t) = 6. ~ sin[27f fe(t - n6.)] (10.84)


~ gn 7f(t - n6.)
n=-(XJ

Thus, we should choose the sampling rate, such that most of the contribution
in G(f) is within the range (- fe, Ie). In general, we do not know the bandwidth
of the signal. Hence, after finding the Fourier transform, we can check if the
function G(f) is tending to zero as f tends to + fe from below and as f tends
to - fe from above. If the transform is not tending to zero in these limits, then
the function is unlikely to be bandwidth limited in the required range.
Thus, because of sampling in the time domain, the Fourier transform is ef-
fectively periodic in the frequency domain with a period of 2fe. Similarly, if the
10.5. Discrete Fourier Transform 461

Fourier transform is sampled in the frequency domain, then the corresponding


function in the time domain will be effectively periodic. Sampling in the fre-
quency domain arises because in practice we only have a finite sample of values
for gk. Suppose we have N sampled values gk = g(k6.) for k = 0,1,2 .... , N-l.
For simplicity we also assume that N is even. Now with only N independent
input we cannot expect more than N independent output. Hence, we can obtain
the Fourier transform G(f) at only a set of N discrete values fj = 2fcj IN for
j = -NI2, ... , NI2 - l. Since G(f) is periodic G- N / 2 = G N / 2 and in fact, we
can even take the frequency interval to be (0,2fc) by letting j = 0 .... ,N - l.
This discretisation introduces sampling in the frequency domain at a rate of
1/(N6.). Hence, the function in time domain is essentially periodic with period
T = N 6., which essentially implies that, even though we know the function
value only in the interval T, the function is extended to other values of t by
assuming that it is periodic with period T. Thus, if the function is known to be
periodic, we should sample it over an interval which is an integral multiple of
the period. Otherwise, some discontinuities may be introduced in the effective
function definition, when it is extended periodically outside the interval.
The aliasing introduces distortions in the calculated transform for a gen-
eral function, which is not bandwidth limited. To minimise the effect of alias-
ing, we should sample the function at a very fast rate. However, if the Fourier
transform is falling off very slowly with f, then very high sampling rate will be
required. In such ca.'les, it may be better to accelerate the convergence of the
transform by modifying the function. For this purpose, we have to look at the
convergence properties of the Fourier transform. Unlike polynomial approxima-
tions, the Fourier approximations have the interesting property, that the rate
at which the Fourier coefficients G(f) fall off as f increases is determined by
the discontinuities in the function value or its derivatives. If the function and
all its derivatives are continuous, then the coefficients will fall off very rapidly
and aliasing should not pose any serious problem. It should be noted that for
nonperiodic functions, discontinuities are usually introduced by the act of sam-
pling over a finite interval, since the function is implicitly extended outside the
interval, assuming that it is periodic. For example, if g(O) =f. g(N 6.), then a fi-
nite length of samples will introduce discontinuities at t = 0, ±N6., ±2N6., ...
in the function values. Similar discontinuities may also be introduced into the
derivatives. By performing integration by parts over the definition of Fourier
transforms, it can be shown that if the function has a discontinuity inside the
range, or if the values at the two end points are not equal, then G(f) '" II f
for large f. Similarly, if the function is continuous, but the first derivative has
a discontinuity, then G f '" liP. If the first derivative is also continuous, but
second derivative has a discontinuity, then G f '" 1I f3. If the Fourier transform
G(f) is decreasing slowly with f, then there will be a significant aliasing, un-
less sampling rate is unrealistically high. To circumvent this problem, we can
apply the usual procedure of subtracting out the discontinuity, where another
function with discontinuity is subtracted from the given function, such that the
difference has no discontinuity. This technique is illustrated in Example 10.6.
462 Chapter 10. Functional Approximations

Approximating the integral in Fourier transform using the trapezoidal


rule, which is very accurate for periodic functions, we get the discrete Fourier
transform
N-l N- l
Gj = L gke27rijk / N = L gk Wjk , (j = 0,1,2, ... ,N -1), (10.85)
k=O k=O

where w = e 27ri / N and i = A. To avoid any confusion between i used as


an index for summation, or as an array index and its definition as A, we
will not use i as an index in this and the next few sections. Hence, in these
sections i always represents A. Here we have assumed that the index j
is in the range 0, ... , N - 1, and it should be noted that indices 1 :S j :S
N /2 -1 correspond to positive frequencies in the range (0, fe), while the indices
N/2 + 1 :S j :S N - 1 correspond to negative frequencies in the range (- fe, 0).
The value n = N / 2 corresponds to both f = fe and f = - fe. It should be
noted that while defining the DFT we have neglected the factor of ~, which
should arise when the integral is approximated by the trapezoidal rule. Hence,
the continuous Fourier transform can be approximated by

(10.86)

The corresponding inverse transformation is given by

(k=0 , ... , N-1). (10.87)

Using the orthogonality relations

tl
k=O
wjkw- rk = {N, if j = r (mod N);
0, otherwise;
(10.88)

it can be shown that, if G j given by (10.85) are substituted in (10.87), then the
right-hand side gives gk.
With this definition, the DFT can be used to interpolate a function using
trigonometric functions as the basis. For this purpose, the interpolating function
can be written in the form

L
N-l
gN(t) =~ Gje - 27rijt /( ClN) , (10.89)
j=O

where it is assumed that the function was sampled at tk = k~ for k =


0,1, ... , N -1. From the definition of inverse DFT, it is clear that this function
reproduces the sampled values at the points tk and hence it is an interpolating
function. Its behaviour at intermediate points could be quite different from the
10.5. Discrete Fourier Transform 463

actual function, if the original function is not bandwidth limited. In particu-


lar, at the sampled points the function value does not depend on whether G j
is interpreted as rUll1ponent of frequency j/(6.N) or 2kfc + j/(6.N), but at
intermediate values of t, hetween the sampled points, this interpretation will
affect the value of interpolating functions. In general, it will be best to split the
sum over positive and negative frequencies as explained earlier. At points where
the function has some discontinuity (probably because of periodic extension),
gN (t) generally tends to overshoot the discontinuity and then oscillates about
the true function, with the amplitude of oscillation decreasing as we move away
from the discontinuity. This is known as the Gibbs' phenomenon (Hamming,
1987). For a square wave function it can be shown that in the limit N ----+ 00,
the function G N (t) will overshoot the discontinuity by approximately 9%. This
overshooting and oscillations can be reduced by introducing the Lanczos sigma
factors. In this approach, each of the G j is multiplied by a factor

sin(7rj/N)
aj = 7rj/N ' (10.90)

before using it to construct the interpolating function using (10.89). This factor
damps the oscillations and reduces the extent of overshooting across the dis-
continuity. However, with this factor added, the function gN(t) will no longer
interpolate the function g(t) at the given set of points. The sigma factor has
been obtained by averaging the rapidly oscillating function YN(t) over appro-
priate subintervals. The use of sigma factor is illustrated in Example 10.6. The
sigma factor is also useful in calculating derivatives. Thus, we can write (Ham-
ming, 1987)
, () _ ~ ~l_ a J. (27rij)
gN t - N L...... 6.N
G.
Je
-27rijt/(tc.N)
, (10.91)
j=O

to approximate the derivative. As usual the oscillations about the true function
will introduce higher error in derivatives than in the function value and it is
more important to apply smoothing process for calculating the derivatives.
Least squares approximation using trigonometric functions can be ob-
tained by restricting the summation in (10.89) to rn < N - 1. Tp;c result
follows directly from the orthogonality of functions ¢j (t) = e-27rijt/(tc.A). Trun-
cating the summation essentially implies that the high frequency components
are dropped. While truncating the sum, it should be recognised that high fre-
quency terms are actually near j = N /2 and these terms should be dropped
to get a smooth approximation. If high frequency components are dropped we
get the low-pass filter, while if only high frequency terms are included we get a
high-pass filter. It is also possible to get a band-pass filter which retains terms
in specified frequency range. In all cases we have to include contributions from
both positive and negative frequencies in appropriate range. Further, disconti-
nuities in frequency domain of the function can introduce some artifact in the
filtered function and hence it may be better to multiply the frequency spectrum
by a smooth function which tends to zero in appropriate range.
464 Chapter 10. Functional Approximations

We shall denote the function in the time domain by small case letters
and the corresponding function in the frequency domain by upper case letters
and use the ¢=} symbol to indicate the transform pairs. If gk ¢=} Gj and
hk ¢=} H j , then some of the properties of the discrete Fourier transform are
as follows:

agk + bhk ¢=} aGj + bHj , (linearity);


gk-r ¢=} wjrGj , (time shifting);
W
-ks
9k ¢=}
G j-s, (frequency shifting);
N-l

L grhk-r ¢=} GjHj , (convolution);


(10.92)
r=O
N-l

L grhr+k ¢=} GjHj, (correlation);


r=O

The discrete form of the Parseval's theorem is


N-l N-l

L Ih k l2 = ~ L IHjI2. (10.93)
k=O j=O

We can easily find the corresponding properties of the continuous transform.


Apart from these, there are the following symmetry properties:
If hk is real then HN-j = Hi
If hk is imaginary then H N - j = -H;
If hk = hN-k (i.e., h is even) then H j = H N - j (i.e., H is even)
If hk = -hN-k (Le., h is odd) then H j = -HN - j (i.e., H is odd)
If h is real and even then H is real and even
If h is real and odd then H is imaginary and odd
If h is imaginary and even then H is imaginary and even
If h is imaginary and odd then H is real and odd
We conclude this section with a cautionary note, that although many
properties of DFT are analogous to those of the Fourier transform, the two are
not always the same. Significant effects could be introduced because of sampling
and finite amount of data. The most important concept to keep in mind is that,
DFT implies periodicity in both time and frequency domain. It should be noted
that the N sampled values of the time domain function represent one sample
of a periodic function.

10.6 Fast Fourier Transform


A straightforward use of (10.85) to calculate the DFT requires of the order
of N 2 complex multiplications and equal number of additions. However, all
computations involved in this calculation are not independent and in 1960s
10.6. Fast Fourier Transform 465

Cooley and Thkey discovered an algorithm which requires only of the order
of N In N operations. Similar algorithms were being used by a few individuals
since 1940s. This algorithm and its variants are known by the name of Fast
Fourier Transform (FFT). The discovery of this algorithm has revolutionised
the numerical computation of Fourier transforms. This algorithm can be applied
to any data, where the length N is a composite number, though the efficiency
depends on the factorisation of N. If all prime factors of N are small, then the
algorithm will be very efficient, while if N is prime, then no saving may be
accomplished and it can require O(N2) operations. The algorithm is simplest
when N = 2t and we will describe only that case here. For more advanced FFT
algorithms which are also applicable to other values of N, readers can refer to
Elliott and Rao (1982).
If N = 2t, then we can express the indices j and k in the form

j=j1+2j2+22h+···+2t-1jll jr=O,l (r=l, . .. ,t);


(10.94)
k = k t + 2kt - 1 + 22k t _ 2 + ... + 2t - 1k 1, kr = 0, 1 (r = 1, ... , t).

It can be seen that here jr are the digits in binary representation of j, while
kr are the digits in binary representation of k, but in the reverse order. For
clarity of presentation, we give the algorithm for N = 8 or t = 3, which can
be easily generalised to any value of t. Using the above expansion for k we can
write (10.85) as
1 1
Gj = 2: 2: 2: 9kW4kdw2k2jwk3j. (10.95)
k3=0 k2=0 k, =0
Now using the expansion for j, we get

(10.96)

where we have made use of the fact that w 8 = wN = 1. Substituting (10.96) in


(10.95), we get

. . =
G)1,)2,)3 ~ [~ (~9k
~ ~ ~ I,
k2. k3 w4jlkl) W(j, +212l 2k 2 ] W(jl +212+ 4hlk3 .
k3=0 k2=0 k, =0
(10.97)
It should be noted that the terms in the parentheses depend only on j1 and
hence can be considered as transforming the index k1 to j1, yielding Fj,.k2,k3'
Similarly, the second summation transforms the index k2 to j2, giving Fj, ,12,k3
and the third summation gives the required transform. Each of these summa-
tion involves only two terms and has to be performed for all relevant values of
the other indices. Thus for each value of j, this process essentially breaks the
summation over 23 terms into three summations over two terms, which require
only three additions instead of seven required in the direct evaluation of the
sum. Hence, 8 x 3 additions and an equal number of multiplications are required
to evaluate the complete DFT. This technique can be easily extended to any
466 Chapter 10. Functional Approximations

value of t, in which case, the summation will be broken up into t = log2 N parts,
each having two terms. This algorithm requires of the order of N ln2 N arith-
metic operations. Thus, the FFT algorithm gains in efficiency by rearranging
the summation over N terms into log2 N summations involving two terms.
For N = 2t the algorithm can be described as follows. To start
with we have the quantities fo(kl' k 2, ... , k t ) = gk and the algorithm con-
sists of t steps Itt the 7th step, the index kr is exchanged with jr to get
fr(j[, ... ,jr, k r + l , ... , k t ). At the end of the last step, we have the DFT as
G j = h(jl, j2, ... ,jt). The rth step consists of summation over kr' given by

(10.98)

This expression is to be evaluated for all relevant values of jl,"" jr and


kr +l, ... ,kt . This formula can be simplified by defining

J l = 0, (r = 2, ... ,t), (10.99)

and

(r=0, ... ,t-1), K t = 0. (10.100)

Then we can write the index j on left-hand side of (10.98), as

(r = 1, ... ,t). (10.101)

It can be seen that J r and Kr as defined above have ranges J r = 0, 1, ... , 2r - l _


1 and Kr = 0,1, ... ,2 t - r .- 1, respectively. With these definitions (10.98) can
be written as
1
f r (Jr + 2r - lJr' + 2r K r ) = '~
" fr-l(Jr + 2r - l k r + 2rKr)w(Jr+2r-ljr)2,-rkr .

°
(10.102)
Now we split (10.102) into two equations, one for jr and the other for
jr = 1. Noting that w(2 r - ' )(2'-r) = W 2'-' = -1, we get

+ 2r Kr) =
fr(Jr fr-l(Jr + 2r Kr) + fr-l(Jr + 2r Kr + 2r - 1 )w Jr2 '-r,
fr(Jr + 2r Kr + 2r - l ) = fr-l (Jr + 2r Kr) - fr-l (Jr + 2r Kr + 2 r - 1 )w Jr2 '-r.
(10.103)
In a computer implementation the array fr can be overwritten on fr-l itself
and as a result at the end of the calculations gk will be replaced by G j. However,
from the definition of the indices, it is clear that while the index j is defined
in the natural order in the array ft (j) = G j, the initial array fo (k) = gk
should contain the elements in the bit reversed order. Thus, before applying
the algorithm, the array gk should be sorted in the bit reversed order.
10.6. Fast Fourier Transform 467

000 000
001 100
010 010
011 110
100 001
101 101
110 011
111 111

Figure 10.3: Bit reversal for the FFT algorithm with N = 8

The bit reversal can be done without any extra storage requirement, since
it involves exchanging pairs of elements. If k' is the bit reverse of k, then k is
the bit reverse of k'. Only care is required to ensure that the same pair is not
exchanged twice. This objective can be achieved by proceeding down the array
gk, and if k' is the bit reverse of k, then gk and gk ' are exchanged only if k' > k.
The operation of bit reversal for the case N = 8 is illustrated in Figure 10.3. In
a computer implementation, it is convenient to update k' as k increases, which
can be achieved by noting that if the first s digits of k in binary representation
are one, then they will become zero when k is incremented by one, while the
next digit is raised from zero to one. Other digits are not affected. Thus, in k'
the most significant s digits will be reduced to zero and one is added in the
next place. This updating can save some calculations, since the entire binary
representation for k need not be evaluated for each value of k . On most com-
puters straightforward implementation of bit reversal algorithm as explained
above is rather slow, because it requires memory access in almost random order.
In particular, if the array is too large to be stored in the computer memory,
then the execution will be too slow because of swapping and it will be almost
impossible to calculate the Fourier transform. It is possible to reorganise this
process to improve memory access, but we will not describe such algorithms.
Many variants of the basic FFT algorithm are possible. The FFT algo-
rithm in the form described above is also referred to as Cooley-Tukey FFT
algorithm. Here the bit reversal is performed before applying the transforma-
tion. Alternately, we can perform the transformation to get G j in a bit reversed
order, and then perform the bit reversal to get G j in natural order. This form
is called the Sande-Tukey FFT algorithm. Further, there are FFT algorithms,
which are based on powers of 4 or 8. In this case, each summation involves 4
or 8 terms, respectively. But it can be performed effectively, by noting that the
different powers of w that occur in the summation are simply related to each
468 Chapter 10. Functional Approximations

other. For example, with N = 4 all the powers of w which occur in a summation
differ by a factor of ±1 or ±i.
In the above algorithm, the input numbers gk as well as the output num-
bers G j are assumed to be complex. Of course, it can be used to obtain the
transform of real data by treating the numbers as real part of complex numbers,
with imaginary part being zero. However, there will be some loss in efficiency.
If we have to find Fourier transform of two real functions with the same number
of data points, then we can treat them as the real and the imaginary parts of
the input function. The FFT of each of these functions can be separated from
the output G j , using the symmetry properties of the DFT {29}.
If FFT of one real function is required, then the input data can be broken
into two parts, one obtained from the points with even k and other with odd
k and once again these can be treated as the real and imaginary parts of the
input array. Consequently, a real array of length N is essentially treated as a
complex array of length N /2 (assuming that N is even) and the DFT of this
complex array is decoded to yield the DFT of the original function as explained
below. The new complex function can be written as
N
(k = 0, 1, ... , 2" - 1), (10.104)

and the DFT of this complex function can be expressed as


N/2-1
Hj = Gj + iGj = L (g2k + ig2k+l)e27rikj/(N/2), (10.105)
k=O
where Gj and Gj are the DFT of even and odd parts, respectively. While the
DFT of the original function is given by
Gj=Gj+e2TCij/NGj, (j=0,1, ... ,N-1). (10.106)
Using the symmetry properties described in the previous section, it can be seen
that

(10.107)

Substituting these values in (10.106) we can get the required DFT. However,
since G N_j = G j, there is no point in computing the entire transform, which
will occupy twice the space as compared to that required by the input array gk.
Hence, only G j for j = 0, ... , N /2 need to be calculated. Even this truncated
form requires H N / 2 , which is not directly obtained from the DFT, but by peri-
odicity property of DFT we know that H N /'2 = Ho. From the above equations
it can be seen that the values Go and G N /2 are real and independent. In order
to get the entire transform G j in the original array space, it is convenient to
return G N/2 as the imaginary part of Go. This transform can be easily inverted
by noting that

Ge
j ="2l(G j+ G*N/2-j') GO=~e-27rij/N(G._G*
J 2 J
.)
N/2-J' (j=0, ... ,N/2-1).
(10.108)
10.6. Fast Fourier Transform 469

Using (10.105) we get H j , which can be inverted using the normal inverse DFT
algorithm. It should be noted that, since DFT is performed over N /2 data
points, the summation should be divided by N/2 rather than N in (10.87).
This algorithm is implemented in subroutine FFTR in Appendix B.
If the length of data set N is not a power of 2, then alternate strategy has
to be used. If N = pt for some relatively small prime p, then the FFT algorithm
can be trivially modified by expressing the indices in base p notation with k
expressed using reverse digits. In that case, each summation involves p terms.
If N is a composite number with many small prime factors, then we can use
a mixed radix representation {32} to express the indices j and k, with k once
again expressed in the reverse order. In this case, bookkeeping will be more
complex, since the summations will be over varying lengths. However, if N is a
prime, or has only a few large prime factors, then this algorithm may not make
much difference as far as efficiency is concerned. For prime N it will require
O(N2) floating-point operations. Winograd has given a set of highly optimised
algorithms for obtaining FFT for a variety of values of N (Elliott and Rao,
1982). However, the FFT algorithm is simplest when N is a power of 2. Thus,
in planning an experiment, we can take account of this fact and obtain data
sets with right length.
Alternately, we can pad the data set with zeros up to the next power of 2.
This padding will introduce some distortions in the Fourier transform, but in
most cases, these distortions can be figured out. The technique of zero padding
is also used to fill gaps in data sets. If for some reasons the value of function
is not recorded at intermediate points, then such gaps can be filled with zeros.
If the amount of filling required is a small fraction of total number of data
points, the Fourier transform may not be significantly distorted. To find out
the distortion introduced by zero padding and gap filling we can note that the
effective function of time can be considered as a product of the actual function
with the window function, Wi which is defined to be one when data is available
and zero at points where data is missing. Thus the Fourier transform of the
modified function will be the convolution of Fourier transform of actual function
with that of the window function. Hence to estimate the distortion we should
find the Fourier transform of the window function. If the power spectrum for
the window function has a peak at frequency VI =1= 0, then the gap filling will
introduce spurious peaks in power spectrum for the actual function which are
separated from the real peaks by VI. If the number of peaks in spectra of actual
and window function are small it may be possible to deconvolve the calculated
spectrum to obtain the spectrum of original function without data gaps. It
is not straightforward to perform this deconvolution unless the spectrum of
window function is simple enough.
An FFT algorithm for arbitrary value of N can be obtained by the fol-
lowing procedure due to Bluestein. Write the equation for DFT in the form

N-I
Gj =W j2 / 2 L (W k2 / 2gk)W-(j-k)2/ 2. (10.109)
k=O
470 Chapter 10. Functional Approximations

(10.110)

where the summation can be considered as a convolution of two functions.


This convolution can be evaluated by augmenting the arrays Yk and hk with
sufficient zeros, so that they are of length N'. The number N' could be chosen
to be such, that there is sufficient zero padding to avoid the end effects and N'
is a composite number or a number of the form 2t , so that an efficient FFT
algorithm is available. To evaluate the convolution, we can perform FFT on
both Yk and hk and then calculate the convolution in the frequency domain,
which can be achieved by simply taking the product of corresponding terms.
To get the convolution in time domain, perform an inverse FFT. Thus, this
algorithm will require three FFT evaluations each one requiring O(N'ln N')
arithmetic operations.
EXAMPLE 10.5: Obtain the DFT of get) = sin(27rft) using function values at tk = kiN
for k = 0, 1, ... , N - 1. Use N = 128 and f = 30,30.5.
The DFT can be easily calculated using the FFT algorithm described above. For f = 30
the sampled interval includes an integral multiple of the period, and the DFT essentially
coincides with the continuous transform, which is a delta function. Apart from some roundoff
errors, we find that C301 N = -Cg81 N = ~i, while all other components of Cj are zero.
Noting that C98 = C-30, it can be easily seen that the DFT reproduces the original function.
In fact, these results are fairly insensitive to random errors in the function value. Even if
random errors of the order of 0.1 are introduced into all function values, the DFT is not
significantly affected and only a few of the compommts of C j (other than G30 and C98)
have a magnitude of the order of 10- 3 . Neglecting these small random components in C j ,
we can recover the actual function without any difficulty. Hence, for such cases, where the
signal is known to be a linear superposition of sinusoidal functions, there is no difficulty in
determining the true form of the signal from the observed values, at a set of discrete points
spanning an integral multiple of the periods.
For f = 30.5 the interval does not span an integral multiple of the period, while
sampling essentially extends the function as a periodic function beyond the sampled inter-
val. This extension introduces some discontinuity at the end points. Hence, in this case,
we do not expect the DFT to resemble the continuous transform. In fact, the DFT does
not have any point at the frequency 30.5 and we may expect the signal to be split be-
tween the frequencies of 30 and 31. Even though the original function is a pure sine
wave, all coefficients of the DFT are real indicating cosines, which is to be expected, since
cos(30t) - cos(31t) = 2sin(30.5t) sin(0.5t). Because of the factor of sin(0.5t), the other coef-
ficients of the DFT are also nonzero. Figure 10.4 shows the DFT power spectrum ICj II N as
a function of the frequency j. It can be seen that the power has leaked into the neighbouring
frequencies, which is a typical result for periodic signals, where the sampling interval is not
an integral multiple of the period. The smearing can be understood by the fact that the finite
sample of function values essentially implies that we are seeing the product of the function
with a square wave window function

wet) = {I,
0,
for 0 S t S 1;
otherwise.
(10.111)

Hence, the Fourier transform of the product function is the convolution of the Fourier trans-
forms C(J) and W(J). The Fourier transform

IW(J)I = sin;;fl , (10.112)


10.6. Fast Fourier Transform 471

0.3

0.2

0.1

o
-50 o 50
f

Figure 10.4: OFT of a sine wave: The continuous curve shows the OFT obtained using
the function values directly, while the dashed line shows the OFT using Hanning's window
function.

while G(f) is an impulse function with nonzero value at f = 30.5 only. Hence, the convolution
will have one pronounced peak at 30.5 and a series of smaller peaks on either sides. These
peaks give the nonzero contributions at different frequencies. When the frequency of the input
signal is an integral multiple of the sampling frequency, it so happens that all the points at
which the frequency spectrum is sampled coincide with the zeros of the convolved function
and we get zero contribution. Brigham (1997) has given a detailed analysis of this situation.
The smearing in the frequency spectrum may be attributed to sharp edges in the
window function w(t). This smearing can indeed be reduced by using a smooth window
function. One of the simplest window function which is widely used is given by Hanning and
is defined by

wit) = { 2
! - ! cos(27rt)
2 '
°: ; t ::; 1;
(10.113)
0, otherwise.
°
This function as well as its first derivative vanishes at both the end points t = and t = 1.
Hence, the truncation of the interval does not introduce any discontinuity in the function value
or the first derivative. If we use this window function instead of the square wave, then the
leakage is considerably reduced as can be seen from Figure lOA, which also shows the OFT
obtained using this window function. This transform is obtained by multiplying the given
function with w(t), before taking the OFT. This window function has frequency components
at f = 0, ±1 and it introduces two side lobes on either side of the actual frequency. Apart
from these main side lobes, the truncation of function to the interval [0,1] will cause small
contributions at other frequencies also. But this contribution is much less than that for the
square window. Thus, if this window function is applied to the first case with f = 30, then
we get G30/N = -G98/N = 0.25i and G29/N = G3I/N = -G97/N = -G99/N = -0.125i,
while other components of Gj are zero. Introduction of the window function will change the
normalisation for the power spectrum and the OFT may be divided by appropriate factor to
get the right normalisation.

EXAMPLE 10.6: Obtain the OFT of g(t) = exp( -2/t/), using function values at tk =
--a + 2ak/N, (k = 0, 1, ... , N - 1). Use N = 64 and a = 16.
This function is not bandwidth limited and we expect some aliasing in the OFT. The
continuous Fourier transform is
(10.114)
472 Chapter 10. Functional Approximations

0.8
0.8
0.6
~
0.6
...,
Q. OIl
0.4
0.4
0.2
0.2
,.... ~ . ..,. ....
0
0
-1 -0.5
°f
0.5 0 2 3

Figure 10.5: DFT of a non periodic function: The figure on the left shows the Fourier
transform (continuous curve), the DFT (dashed curve). The dotted curve shows the DFT
obtained after removing the discontinuities. The figure on the right shows the function in
time domain, continuous curve shows the actual function, while the dashed curve shows
the function obtained using the DFT. The dotted curve is obtained using the sigma factor,
while the long dashed curve which essentially coincides with the continuous curve shows the
function obtained using DFT after removing the discontinuities.

while the Nyquist frequency Ie for the sampled input is 1. Hence, the contribution for Ifl > 1
is folded back into the interval (-1,1) and the DFT is more than twice the actual transform at
the boundaries. Further, since C(j) drops rather slowly with I, there is a significant aliasing
for all values of j as the DFT has contributions from frequencies j +2njc, (n = 0 , ±l , ±2 , . . . ).
Figure 10.5 displays the Fourier transform as well as the DFT. This figure displays only the
magnitude of DFT, since because of time shift a phase of w jN / 2 = (-I)j is introduced in
Cj. The problem here is because of the discontinuities in the first derivative. There are two
discontinuities, the first at t = 0, and the second at t = ±a, by virtue of periodic extension. If
we subtract the parabola (a -I tl)2 / a from the function, then the slope at the origin is reduced
to zero and the discontinuity is removed. At the same time, this function has a slope of zero
at t = ±a and the second discontinuity is unaffected. To remove the second discontinuity, we
can add the function e- 2a t 2 /a which has a vanishing slope at t = O. Thus we get the function
1 1
h(t) = e- 2ltl - -(a -ltl)2 + _e- 2a t 2 , (10.115)
a a
which is continuous and has continuous first derivative, when extended periodically beyond
the interval [-a, al . The DFT of this function is also shown in Figure 10.5. It can be easily
seen that in this case, the DFT drops to very low value at the end points and there is no
significant aliasing due to frequencies outside the Nyquist range. For this function , the second
derivative is also continuous and the discontinuity is only in the third derivative. Thus, the
Fourier transform falls off as 1/14. Near I = 0 the curve for C(j) goes outside the scale as
C(O) = 169.67.
To demonstrate the effect of discontinuity on Fourier approximation, we also show
the function 9N(t) defined by (10.89). It can be seen that , this function interpola tes the
original function at the sampled points, but oscillates about it at intermediate points. The
corresponding curve obtained using the modified function defined by (10.115) is also shown
in the figure. It can be seen that removing the discontinuities in the derivative improves
the approximation significantly and in fact the curve using this approximation essentially
coincides with the exact function. Also shown in the figure is the curve obtained using the
Lanczos sigma factor. The sigma factor improves the approximation, but the result is not as
good as that obtained after removing the discontinuity.
10.7. FFT in Two or More Dimensions 473

The FFT algorithm can also be used to obtain the sine and cosine trans-
forms defined by
N-l 1 . N-l
Gj = 2::.:: gk sin (7rjkjN) or Gj = 2(go+ (-l)J gN) + 2::.:: gk cos( 7rj kj N).
k=l k=l
(10.116)
The sine transform should normally be applied to odd functions, while the
cosine transform to even functions. The simplest technique of using the FFT
algorithm to obtain these transforms is to extend the data set to 2N points
using the symmetry of transforms. Thus, for the sine transform we can use
g2N-k = -gk, (k = 0, ... , N - 1) along with gN = 0, while for the cosine
transform we use g2N -k = gk. It can be seen that with this extension the FFT
will give the required sine or cosine transform. However, in this technique the
data length is doubled and both the storage and the number of arithmetic
operations required increase by a factor of approximately two. It is possible to
perform this operation using an array of length N, provided we define some
auxiliary function {30}. Both these transforms are their own inverses. Hence,
if these transformations are applied twice, we get back the original array apart
from a normalising constant.
While computing discrete convolution of two functions, it should be noted
that discretisation implies periodicity in both time and frequency domain. If the
function is not actually periodic, then this extension may introduce spurious
effects near the end points. In most applications involving convolution, one of
the functions is a signal which goes on indefinitely in time. While the other
function is usually a response function, that is nonzero only in a small finite
interval around t = 0, which specifies how the signal will be smeared by the
measuring instrument. In such cases, we may pad the input data with sufficient
zeros, so that periodicity will not cause the data at one end to be "polluted"
by that at the other end. The number of zeros added should be larger than the
number of nonzero entries in the response function on either side of t = 0. This
number can be chosen so that the number of points in the resulting data set is
such that an efficient FFT algorithm is available. For computing convolution
we can improve the efficiency by calculating the FFT of both functions and
then computing the product in frequency domain. The inverse FFT of this
will give the convolution. This requires O(3N In N) floating point operations as
compared to O(N2) for the straightforward computation of convolution.

10.7 FFT in Two or More Dimensions


The definition of Fourier transform can be easily generalised to two or more
dimensions. In n dimensions we can define the DFT by

G )I"",)n
. = NI-l
~ ... Nn-l
~ gk k exp
([·k
27ri ~ + ... + In'k]) (10.117)
~ ~ I,···, n • Nl N n
n
'
kl=O kn=O
474 Chapter 10. Functional Approximations

Here N j are the number of data points along jth dimension. As usual the
inverse transform can be obtained by changing the sign of i in the exponential
and dividing the sum by N 1 N 2 .. . N n . The properties of DFT in n dimensions
are analogous to those for one dimension. In particular, the input data are
effectively treated as periodic, with period equal to the length of data string
along each dimension. The DFT is also effectively periodic in the corresponding
frequencies. If the signal is not bandwidth limited with respect to one or more
dimensions, then the corresponding frequencies are aliased into the limiting
region.
Each of the summation in (lO.117) can be carried out using the FFT al-
gorithm described in the previous section. For example, we can perform FFT
with respect to k1 for each of the N 2 N 3 ··· N n values of k 2 , .. . , k n to get
G;~~k2.k3 ..... k,,' Then we can perform FFT with respect to k2 for each value
of j1 and k 3 , ... , k n . This process can be continued until all variables are ex-
hausted. Hence, we can use the subroutine FFT to calculate the DFT in any
number of dimensions. It may be noted that unlike the case of solution of a
system of nonlinear equations, here the routine is not really required in a re-
cursive manner and one copy of FFT routine in one dimension will be sufficient
for any number of dimensions. However, this process requires a large number
of operations just for copying arrays from the full data set, to a linear array for
the FFT routine. The copying operations can be avoided if proper bookkeeping
is done. l<1.uther, some efficiency can also be gained by noting that for each FFT
with respect to one variable, the number of data points is the same, and the
sets are completely independent. Hence, on a parallel processor different sets
can be processed in parallel. For example, the operation for bit reversal can be
performed on each of the sets in one loop. That is if two elements have to be
exchanged, then the corresponding elements for all the sets can be exchanged.
In this section, we consider the generalisation of the algorithm described in
the previous section to higher dimensions. Other FFT algorithms can also be
generalised in a similar manner.
Such an algorithm is implemented in subroutine FFTN in Appendix B.
Since the number of dimensions is variable, the input array is treated as a one-
dimensional array with elements stored in the usual Fortran order. Thus, the
Fortran array CG contains the elements of 9 as follows

for k j = 0, ... , N j - 1. As FFT operation proceeds, these elements will be


overwritten by the corresponding elements of the DFT. The algorithm consists
of n steps. At the rth step, FFT is performed with respect to kr for each value
of j1, ... , jr-1 and k r + 1 , ... , k n , which gives

N r -1
Gt~ .... jr,kr+l ,... k" = L Gt~.~:jr_l.kr, ... ,kn e2trijrkr/ Nr. (10.119)
kr=O
10.S. Inversion of Laplace Transform 475

For convenience we can introduce


Jr = j1 + hNl + j3 N 1N 2 + ... + jr-l N l N 2 ... N r- 2 ,
(10.120)
Kr = Nrkr+l + kr+2NrNr+l + ... + knNrNr+1 .. · N n- 1 .
Using these definitions, (10.119) can be written in terms of the Fortran array

n
CG as

+ [jr + Krl + 1)

n
CG (Jr Ns
(10.121)
~: CG(Jr + [kr + Krl Ns + 1) e27rijrkr/Nr,

where J r = 0, 1, ... , N 1N 2 ... N r - 1 -1 and Kr = 0,1, ... , N r+1 N r+2 ... Nn-l.
The summation can be obtained by using the usual FFT algorithm involving
the bit reversal and transformation. Inside the loops for bit reversal and trans-
formation, we can introduce two additional loops over Jr and K r .
It can be easily seen that, this algorithm requires of the order of N In N
arithmetic operations, where N = N 1 N 2 ... N n . Although the number of oper-
ations as well as the amount of memory required increases almost linearly with
N, the number N itself will increase very rapidly with the number of dimensions
n. Hence, it may not be possible to use this algorithm for n larger than three or
four. Further, the subroutine FFTN in Appendix B is restricted to cases, where
each of the N r is a power of 2. We can use zero padding to increase N r to the
next power of 2, but if that is to be done with each dimension, then there may
be a significant loss in efficiency.

10.8 Inversion of Laplace Transform


Laplace transform is superficially similar to the Fourier transform. The funda-
mental importance of Laplace transform lies in its ability to lower the tran-
scendence level of an equation. Ordinary differential equations can be reduced
to algebraic equations, while partial differential equations are reduced to or-
dinary differential equations. Similarly, some classes of integral equations can
also be reduced to algebraic equations. However, the Laplace transform is not
very often used in practice, because it is usually very difficult to find the in-
verse, which is required to obtain the final solution. Unlike Fourier transform
the Laplace transform cannot be inverted easily and in many cases, the prob-
lem of inversion may even be ill-conditioned. Thus, small perturbations in the
transform can make large changes in the inverse function. In this section, we
describe one method for inversion of Laplace transform, which may be useful
for some problems.
If f(t) is a function defined for t ~ 0, then the Laplace transform is defined
by
(10.122)
476 Chapter 10. Functional A pproxirnations

We shall assume that f(t) is piecewise continuous and of exponential order


ct (i.e., If(t)1 ~ Meed for some constant M), in which case, the transform
function F(s) is defined for Re(s) > ct, where Re(s) denotes the real part of s.
We denote the Laplace transform by L. Thus, we can write the above equation
as L(f(t)) = F(s). Some important properties of the Laplace transform are as
follows:

L[af(t) + bg(t)] = aL[j(t)] + bL[g(t)] (linearity);


L[e-atf(t)] = F(s + a) (translation);

L (!) = sL(f) - f(O);


L (~:!) = snL(f) - sn-l f(O) - sn-21'(0) - ... - f(n-l)(o);

L (1 t f(x)g(t - x) dX) = L(f)L(g) (convolution).


(10.123)
Starting with F(s), the inverse transform f(t) is given by the inversion
formula

f(t) = -.
1 l
a ix
+ estF(s) ds

l
21ft a-ioo
(10.124)
= -eat
DO
(Re(F(a + iw)) coswt - Im(F(a + iw)) sinwt) dw ,
1f 0

where a is a real number greater than ct. Evaluating the integral using the
trapezoidal rule, we get the inversion formula for j(t),

eat
T (1"2F(a)
~ k1fi[ k1ft
+ ~ Re(F(a + T)) cos( T) - Im(F(a
k1fi k1ft ])
+ T)) sin( T) ,
(10.125)
where f(t) = j(t) - E(t) and the truncation error E(t) can be expressed as
(Crump, 1976)

L exp( -a(2kT + t))f(2kT + t) = L e- 2kaT f(2kT + t).


00 00

E(t) = eat (10.126)


k=l k=l

This expression for the error can be derived by considering a Fourier series
representations for f(t)e- at in intervals [2nT, 2(n+1)TJ, (n = 0, 1,2, ... ), which
span the required region. Here the parameter a can be chosen such that the
error is within acceptable limits for 0 < t < 2T. It should be noted that,
there are two sources of truncation error in this approximation. First is the
error E(t) due to the approximation to the integral given by (10.126) and the
second source is due to the fact that in practice the summation in (10.125)
is necessarily truncated at some finite value of k. Using our assumption that
10.B. Inversion of Laplace Transform 477

If(t)1 ::; Me-at, we can sum (10.126) to get

0< t < 2T. (10.127)

The case of t = 0 needs to be examined separately, and it can be shown that

/(0) - ~f(O) = E(O) + ~f(2T)e-2aT ::; ~Me-2T(a-a). (10.128)

Thus, the error bound at t = 0 is one and half times the error bound at other
points and importantly the approximation f(t) converges to f(0)/2, rather than
f(O) at t = O.
The error bound (10.127) provides a simple criterion to choose the pa-
rameters a and T required to compute f(t). Suppose, we want to compute f(t)
over a range of t values, such that the largest is t max , and the required relative
accuracy is E. Then we can choose T such that 2T > t max and using (10.127),
a is given by
a = a -In(E)/2T. (10.129)
The series in (10.125) is then summed until it has converged to the required
accuracy. The summation usually converges slowly, but the convergence can be
accelerated by using a suitable acceleration technique. We can use the Euler's
transformation or the E-algorithm described in Chapter 6 for this purpose.
Crump (1976) finds the E-algorithm to be more efficient in all cases that he
considered. The E-algorithm can be applied to the successive partial sums of the
series to improve the convergence. It turns out that in most cases, for continuous
functions, a reasonable accuracy can be achieved using less than 100 terms of the
series. For functions with discontinuity, the convergence is slow and the series
will converge to the mean value of the solution at that point. Convergence
can be accelerated by removing the discontinuity {38}. Similarly, as pointed
out earlier, the series converges to f (0) /2 at t = 0, so that a discontinuity is
introduced artificially at t = 0, when f(O) -=I- o. If f(O) is known beforehand,
this discontinuity can be removed by considering the function F* (s) = F( s) -
f(O)/s, which gives f(t) = f*(t) + f(O). This transformation improves the
convergence for small values of t to some extent.
This algorithm is implemented in the subroutine LAPINV. This subrou-
tine uses T = t max and calculates a using (10.129), which requires the order a
to be specified. The order a can be estimated as the maximum of the real part
of the poles of the function F(s), which may be known if analytic expression
for F(s) is known. In general, the order may not be known and different values
may be tried to see how the results are converging. In that case, it may be best
to start with a = O. The E-algorithm is used to accelerate the convergence of
the series, and the result is accepted if the difference between two successive
approximations is less than 0.01 times the required convergence criterion. If
the series is converging rapidly, then the result will be correct to the required
accuracy, but if the convergence is slow the result may not be accurate to the
478 Chapter 10. Functional Approximations

specified accuracy, even if the subroutine does not give any error indication.
For small values of t, the convergence may be improved by removing the dis-
continuity or by making a second attempt with a smaller value of t max .
EXAMPLE 10.7: Find the inverse Laplace transform of F(s) = 2/s - 1/(s + 1).
Using the subroutine LAPINV we can evaluate the required function at the specified
set of points. In this case, the inverse can be calculated analytically and we can estimate the
exponential order a required by the subroutine. Alternately, we can see that F(s) has poles
at s = 0 and s = -1. Hence, the maximum real part of the poles is zero, which indicates
that a = O. This value is also confirmed by the exact inverse f(t) = 2 - e- t . The results
obtained using a = 0 and E = 10- 4 and 10- 6 are shown in Table 10.7. This table also gives
N, the number of terms used in the summation for each value of t, as well as the value of
the sum. It can be seen that at t = 0 the algorithm yields an approximation to f(0)/2 and
the convergence is rather slow. As a result, the subroutine gives spurious convergence to a
value which is far from the correct value. When the accuracy requirement is raised to 10- 6 ,
the convergence criterion is not satisfied as can be seen by the fact that N = 51, which was
the maximum number of function evaluations permitted by the subroutine. The convergence
appears to be even slower when the accuracy requirement is raised. The convergence is also
slow for small values of t, but at higher values of t the convergence appears to be fast. It
can be seen that for E = 10- 6 , f(5) is not correct to the specified accuracy, even though
the series has apparently converged to the required accuracy. This spurious convergence is
because of roundoff error in evaluating the sum. It can be seen that the first term in the
sum F(a)/2 is of the order of unity, but the sum in this case is ;:,; 0.007. Hence, at least two
decimal digits have been lost during the summation. Since the calculations were done using
a 24-bit arithmetic, we cannot expect the sum to have an absulute accuracy of better than
IiF(a)/2 ;:,; 10- 7 giving a relative error of ;:,; 10- 5 , which is consistent with the actual error
of the order of 10- 4 , considering the fact that more roundoff errors will be introduced during
summation and application of the E-algorithm. Since the constant a in the inversion formula
increases with accuracy, the factor eat increa~es and the value of the sum reduces (to keep
f (t) constant), giving higher roundoff error.
As mentioned earlier, convergence of the sum near t = 0 can be improved by removing
the jump. Using f(O) = 1, we consider the function F*(s) = F(s)-l/s, which gives f*(0) = O.
Using this function the required inverse f(t) = f* (t) + 1 can be easily calculated and the
results using E = 10- 6 are also shown in Table 10.7. It can be seen that results for smaller
values of t are much better, though the convergence is still quite slow.

Table 10.7: Numerical inversion of Laplace transform

E = 10- 4 E = 10- 6 Removing jump at t = 0


N Sum f(t) N Sum f(t) N Sum f(t) Exact f(t)

0 18 2.5009 0.5002 51 2.490883 0.498177 51 0.014317 1.002863 1.000000


.2 51 4.6024 1.1221 51 4.448482 1.189221 51 0.678807 1.181467 1.181269
.4 51 4.3494 1.2927 51 3.723129 1.330391 51 0.922288 1.329562 1.329680
.6 49 4.0055 1.4513 51 3.038325 1.451196 51 0.944651 1.451194 1.451188
.8 34 3.5095 1.5501 50 2.428922 1.550694 51 0.862543 1.550672 1.550671
1 41 2.9702 1.5993 44 1.912576 1.632118 51 0.740742 1.632121 1.632121
2 27 1.2866 1.8651 34 0.512112 1.864666 29 0.237471 1.864664 1.864665
3 35 0.4998 1.9503 23 0.125529 1.950216 26 0.061162 1.950213 1.950213
4 16 0.1886 1.9818 25 0.029895 1.981683 23 0.014809 1.981686 1.981684
5 13 0.0705 1.9934 28 0.007047 1.993170 21 0.003512 1.993252 1.993262
10.9. Pade Approximations 479

10.9 Pade Approximations


In this section, we consider the problem of approximating mathematical func-
tions, by a rational function. Ideally, we will like to use minimax approximations
for this purpose, but constructing such approximations is somewhat difficult.
Hence, we first consider a simpler approximation which can be easily gener-
ated and is useful in many cases. HerE we seek approximation of the form
Rmk = Pm(X)/Qk(X), where Pm(x) and Qdx) are polynomials of degree m
and k, respectively. Empirically it is found that for a given value of n = m + k,
such approximations are most efficient when m = k or m = k + 1. Hence, this
is the case most often considered. In Pade approximations, the basic idea is to
choose the approximating function, such that the function value and as many
derivatives as possible agree with the given function at x = o. Hence, this ap-
proximation is an extreme case of interpolation, where only one point is used,
but higher order accuracy is achieved by matching higher order derivatives.
Here it is implicitly assumed that the Maclaurin series for the function f(x)
exists in some neighbourhood of x = O. The choice of the point x = 0 is arbi-
trary and is made to simplify the algebraic manipulations. By a simple change
of variable, we can construct such approximations about any other point.
We assume that Pm(x) and Qdx) have no common factors. Now let

m k
Pm(x) = I: ajx j , Qdx) =I:bjx j , bo = 1. (10.130)
j=O j=O

If bo i- 0, then it is possible to set bo = 1, because the numerator and denomi-

°
nator can be divided by a constant without changing the value of the rational
function. If bo = 0, the rational function does not exist at x = 0, unless ao =
in which case, there is a common factor between Pm(x) and Qk(X). Hence, in
subsequent discussion we assume bo = 1. Now let f(x) have a Maclaurin series

I: ejx j .
DC

f(x) = (10.131)
j=O

Then consider the difference

(10.132)

Since we have n + 1 constants at our disposal, we can hope to cancel the first
n + 1 terms (i.e., the coefficients of Xi for i = 0,1, ... , n) in the numerator,
which is equivalent to ensuring that the first n derivatives of the two functions
are equal at x = O. Equating the first n + 1 coefficients in the numerator of
480 Chapter 10. Functional Approximations

(10.132) to zero, we get

~ bjCi-j = ai, i = 0, 1, ... , m, bj = 0 if j > k,


j=O
k
(10.133)
~ bjCi_j = 0, i = m + 1, ... ,n, Cr =0 if r < O.
j=o

This is a system of n + 1 linear equations in n + 1 unknowns, which can be


solved to get the required coefficients, provided the matrix is nonsingular. First
the second set of equations can be solved to bj , j = 1, ... , k and then the first
set of equations will give ai. The approximation need not exist for all functions
{42}. It is possible to obtain recurrence relations giving the polynomials Pm (x)
and Qk(X) in terms of those for lower values of n.
The above derivation of the Pade approximation does not provide us with
an error term. However, in this section we are mainly interested in minimax
approximation of mathematical function and the maximum error can be esti-
mated by actually comparing the approximated value with the exact value of
the function. If necessary, a rough estimate of the truncation error can be ob-
tained by assuming that the error will be dominated by the first nonvanishing
term in the numerator of (10.132), which gives an error estimate of the form
dn+lx n + 1 (the denominator tends to 1 as x --+ 0), where the coefficient dn+l is
given by
k

dn+l = ~ bjCn+l-j . (10.134)


j=O

Hence, the error is small for small x and generally increases with x, which is
to be expected, since the approximation has been obtained by matching the
value of the function and its derivatives at x = O. Hence, the accuracy of the
approximation deteriorates as Ixl increases. Thus, in practice, if x = 0 is not
the centre of the interval over which the approximation is required, then we
should perform a change of variable to get the centre of the required interval
at x = O.
Since one of the most important requirements of approximations to math-
ematical functions is efficiency, we shall consider the evaluation of rational
functions. The simplest procedure is to evaluate the polynomials Pm(x) and
Qk(X) using the nested multiplication or the Horner's method (Section 4.1)
and then evaluate the rational function approximation R mk . This technique
requires n = m + k multiplications, n additions and 1 division to evaluate the
rational function. If some preprocessing is done, then it is possible to reduce
the number of multiplications required to evaluate the polynomials (Ralston
and Rabinowitz, 2001). Alternately, we can convert the rational function to a
continued fraction, which can be achieved by successive divisions and recipro-
10.9. Pade Approximations 481

cation. For example

2x3 + x 2 + 2x + 3 = 2X
------;;:------ - 1+
1
--n---
x2 +x +1 x2 + x + 1
x+4 (10.135)
1
= 2x -1 + -------0-
1"'""3-
x-3+--
x+4
For the case of m = k or k + 1, the continued fraction can be written in the
form
C1
Rmk(X) = Cox+ Do + ---------''--C"::=:2,.-------- (10.136)
x + D1 + -----------",,------
C3
X + D2 + - - - - - - - - - -
x+D 3 +

where Co = 0 when m = k. To evaluate the continued fraction we can use the


recurrence

d._ Cj (j = k, k - 1, ... ,1), (10.137)


J -
x + D j + dj+l '
to evaluate Rmk (x) = Cox + Do + d1 . This process requires k divisions, 1 multi-
plication (if m = k + 1) and 2k + 1 additions. Noting that k = n/2 or (n - 1) /2
the number of additions required is the same as before, but instead of n mul-
tiplications we now require approximately n/2 divisions. If multiplication and
division require the same amount of time on a computer, then the evaluation
of continued fraction will be more efficient than the direct evaluation of the ra-
tional function. However, if division requires more than twice the time required
for multiplication, then the direct evaluation of rational function will be more
efficient.
Just like power series, the continued fraction can also have infinite terms
and the result may converge for some range of x. However, for continued fraction
it is not straightforward to keep adding one term at a time and check for
convergence, since the continued fraction is normally evaluated from bottom
upwards. If we are not sure about how many terms are required for the specified
accuracy, then we can convert the continued fraction to a rational function for
evaluation. Thus, we can write Rmk = A k / B k , where Ak and Bk are computed
using the following recurrence

Aj = (x + Dj )Aj - 1 + Cj A j - 2 , Bj = (x + D j )Bj- 1 + Cj B j - 2,

(j = 1,2, ... , k); (10.138)


AD = Cox+Do, Bo = 1, B-1 = O.
482 Chapter 10. Functional Approximations

This recurrence can be easily proved by induction {44}. In practice, this re-
currence frequently leads to overflow or underflow, since the numbers Aj and
B j tend to be rather large or small. Since the recurrence is linear in Aj's and
Bj's, we can rescale A j , B j , A j - 1 and B j - 1 by dividing all of them by B j . This
scaling will not change the value of the rational function.
It can be seen that, this process requires approximately 2n multiplications,
2n additions and one division to compute Rmk, which is much larger than those
required by other methods. Hence, this technique should be used only if we do
not know the number of terms beforehand. If the value of k is known, it will
be more efficient to use t.he recurrence (10.137), or the direct evaluation of the
rational function approximation.
It turns out that the Pade approximations with m = k or m = k+ 1 can be
obtained by truncating the known continued fraction expansion of the function
at appropriate stage. The coefficients of continued fraction expansion can be
calculated from the known power series coefficients, by using the quotient-
difference (QD) algorithm due to Rutishauser. The QD algorithm is based on
the following table
(0)
ql
(1) (0)
eo e1
(1) (0)
ql q2
(2) (1) (0)
eo e1 e2
(2) (1) (0)
ql q2 q3
(3) (2) (1) (0)
eo e1 e2 1:3
(3) (2) (1) (0)
ql q2 q3 q4

where the entries are related by the following relations


(n) _
- qk
(n+l) + e (n+l) (n) rn) - 0
eo - ,
ek k 1 - qk '
_

(n) qk
(n+l) (n+l)
ek (10.139)
qk+l = (n)
ek

The equations in this form can be used to fill the table from the second column
(in the first column all elements are zero). This process may be numerically
unstable in some cases. Alternately, we can rewrite the equations to fill the
table starting from the top diagonal.
If we have the power series

LCi
00

f(x) = X\ (10.140)
i=O

and compute the second column of our QD table from the relations

(10.141)
10.9. Pade Approximations 483

then the complete QD table can be generated using (10.139). The elements
along the main diagonal give the coefficients of the corresponding continued
fraction, which is given by

f(x) = _ _ _ _c_o-=:-_ __ (lO.142)


-qiO)x
1 + ----=---;-::-c---
-e(O)x
1+ __~l,---;-:,.,.-_
-q~O)x
1+------'-''----,-,-,-----
-e(O)x
1+_2_
1 + ...

This continued fraction can be truncated at appropriate stage and converted


to equivalent rational function or the continued fraction in the form (10.136).
This procedure can be reversed to obtain the coefficients of the power series
from those of the continued fraction.
EXAMPLE 10.8: Find Pade approximations to tan- 1 x with n = m +k = 9.
Since tan -1 x is an odd function, we can look for an approximation to tan -1 x/x, which
involves only even powers of x. Using the well-known Maclaurin series

tan -1 1 2 1 4 1 6 1 8
X
- - - = 1 - -x + -x - -x +-x (10.143)
x 3 5 7 9
we can calculate the Pade approximations Rg,o(x), R7,2(X), R 5,4(X), R3,6(X) and R 1,8(X):

12 1 4 1 1
Rg o(x)
,
= 1- -x
3
+ -x
5
- -x 6 +-x 8
7 9'
1 + ~x2 - 1~5x4 + 3i5x6
R7,2 (x) = ---"---------'''''.;---=---''''-''---
1 + ~x2

1 + ~x2 + 964~x4
R5,4 (x) = ----+~~~- (10.144)
1 + ~x2 + ~x4 '

Of course, all these approximations should be multiplied by x to get the corresponding ap-
proximation to tan- 1 x. Here we have stuck to our convention of setting bo = 1, but in
practice, it is better to normalise the coefficients in the numerator and the denominator,
such that the coefficient of the highest power of x at either place is one, since that saves one
multiplication while evaluating the function. The rational function can also be expressed as
a continued fraction to improve the efficiency, while evaluating the approximation. It may
be noted that Rg,o(x) is just the truncated Maclaurin series, while R5,4(X) is algebraically
equivalent to the well-known continued fraction expansion truncated after four stages

(10.145)
484 Chapter 10. Functional Approximations

I
I
I / .' d.-
o I-
I /
/ .'
/ ,/ e::::o
~
/ '
8... -2 f-
- - tan-Ix ...
Q)

- - - R.,o(x)
0.5 R7,2(X) Q() -4 I-
o
...l
Ro,.(x)
R3,6(X)
-6 I-
----R1.6(X)

0.5 1.5 2
x

Figure 10.6: Pade approximations to tan- 1 x

It is clear that as x -+ 00, Rg,o(x) and R7,2(X) tend to infinity, R3,6(X) and Rl,S(X) tend
to zero, R5,4(X) tends to 64/225, while the actual function tends to zero. Each of these
approximations for tan- 1 x and the truncation error in them are shown in Figure 10.6, It
can be seen that error in Rg,o(x) diverges very fast , which can be expected from the fact
that it is just the truncated Maclaurin series that converges only in the interval [-1,1]. On
the other hand, Rl,S has a pole around x = 1.7, beyond which it is negative. This gives rise
to a cusp like feature in the error. While the other approximations are fairly close even for
1 < x < 2, even though all these approximations were derived from the Maclaurin series
which diverges for Ixl > 1. Thus, the region of convergence of Pade approximations does not
necessarily coincide with that of the Maclaurin series, from which these approximations are
obtained. In the range (0 , 2) shown in the figure , the approximation R5,4(X) turns out to be
the best.

It can be seen that for all approximations in Example 10.8, the error
grows rapidly as we approach Ixl = 1. To improve the approximation, there
are three alternatives, (1) add more terms in the approximation, (2) break the
required interval into appropriate number of subintervals, (3) use some analytic
property of the function. Increasing the number of terms will require more effort
to calculate the approximation, while increasing the number of subintervals will
require more memory to store the coefficients and in addition more time will be
spent in finding the proper subinterval to use, unless the subintervals are chosen
very carefully. In practice, it is difficult to give any general rule for choosing
the optimum value of the number of terms n and the number of subintervals.
The optimum choice can be obtained only after some experimentation. The
third alternative must be used whenever possible, for example, we can use
tan- 1 x = 7r/2 - tan- 1 (1/x) to extend the range beyond [-1,1]. With Pade
approximations, the error increases rapidly as we move away from the centre of
the interval, which is not the case with Chebyshev or minimax approximations.
Hence, the latter are more useful for approximating mathematical functions
on digital computers. Nevertheless, Pade approximations may serve as starting
point from which to develop better approximations.
10.10. Chebyshev Expansions 485

10.10 Chebyshev Expansions


We do not expect Pade approximations to be the best in the minimax sense,
since these approximations are derived by matching the function value and
its derivatives at one point. Hence, as we move away from this point the er-
ror increases rapidly. For a global approximation over the given interval, we
will like to have error which is not monotonic, but oscillatory with nearly con-
stant amplitude. This may be the case, if the function is expanded in terms of
Chebyshev polynomials or trigonometric functions rather than powers of x. As
we have seen in Section 4.2, Chebyshev polynomials have the minimax property
in the sense that of all polynomials of degree n with leading coefficient unity,
the Chebyshev polynomial has the least maximum value on the interval [-1,1].
Thus, Chebyshev polynomials are very useful in minimax approximations and
in fact, such approximations are often referred to as Chebyshev approxima-
tions. But we will avoid that nomenclature, since it can be confused with an
approximation based on Chebyshev expansions.
Before considering Chebyshev expansions, we shall list some important
properties of the Chebyshev polynomials Tn(x). These polynomials satisfy the
recurrence relation

To(x) = I, (10.146)

This recurrence relation can be used t.o generate the coefficients of the high order
polynomials, as well as to express the powers of x in terms of the Chebyshev
polynomials. Some of the Chebyshev polynomials and the expansion for powers
of x in terms of Tn(x) are as follows:

T 2(x) = 2X2 - I,
T3(X) = 4x 3 - 3x, x 2 = ~(To(x) + T2(x)),
T4(X) = 8x 4 - 8x2 + I, x 3 = i-(3TI(x) + T 3(x)),
T5(X) = 16x 5 - 20x 3 + 5x, X4 = ~(3To(x) + 4T2(x) + T4(x)),
T6(X) = 32x 6 - 48x 4 + 18x2 - I, /6
x 5 = (lOTI (x) + 5T3(X) + T5(X)).
T7(X) = 64x 7 - 112x 5 + 56x 3 - 7x,
(10.147)
More extensive tables of Chebyshev polynomials are available in Abramowitz
and Stegun (1974), or can be generated using subroutine CHEBCF in Ap-
pendix B.
To convert a polynomial expressed in terms of powers of x to that in terms
of the Chebyshev polynomials, we can use the recurrence relation. Consider the
polynomial
n n
Pn(x) == L ai xi == L biTi(x), (10.148)
i=O i=O

for which the coefficients ai are known and we wish to find the coefficients bi
for the Chebyshev expansion. This polynomial can be evaluated using the usual
486 Chapter 10. Functional Approximations

nested multiplication algorithm

(j=n,n--1, ... ,0), Pn+l = 0; (10.149)

with Po = Pn (x). In this algorithm, Pj is a polynomial of degree n - j, which


can be expressed in terms of the Chebyshev polynomials

(10.150)

Substituting this expression in (10.149) and using the recurrence relation for
Tn(x), we get
n-j n-j-l
2: b~j)Ti(X) = x 2: b~j+1)Ti(X) + aj
i=O i=O
n-j-l
= aj + b0(j+l)T1 ()
X + '"'
L b(j+1)
i 2"1 [Ti+1l')
x + T i-I ()]
x .
i=1
(10.151)
Equating the coefficients of Ti(X) on both sides, we get

b(j)
,
= 12 (b(j+l)
,-1
+ b(j+1))
,+ 1 , (i = 2,3, ... , n - j);
(10.152)

Starting with b6n ) = an, b~n) = 0, (i = 1,2, ... , n), we can calculate the required
coefficients bi = b~O), using the above relations. This process is implemented in
the subroutine CHEBCF.
If the Chebyshev expansion is known and we want to find the coefficients of
Xi in the equivalent polynomial, then this procedure can be reversed to calculate
aj. Alternately, we can use the Clenshaw's recurrence relations (Section 10.2)

qk = bk + 2xqk+l - qk+2, (k = n, n - 1, ... ,1), qn+l = qn+2 = 0;


qo = bo + Xql - q2 ;
(10.153)
with qo giving the required value. This recurrence can be used to evaluate the
value of corresponding polynomial at a given value of x, or in the algebraic form
to find the coefficients of the required polynomial. Since it is a two-term recur-
rence relation, additional memory space will be required to store intermediate
steps.
As we have seen in Section 4.2, Tn (x) has n real zeros in the interval ( -1, 1)
and n + 1 extrema (including x = ±1), at which it has a value of ±l. This is
probably the most important property of these polynomials. Because of these
equal minima and maxima, these polynomials are most suitable for minimax
approximations. These polynomials are similar to trigonometric functions inside
10.10. Chebyshev Expansions 487

Tt(x)
T2 (x)
0.5
- - - - T3 (x)

o T4 (x)
- - T5 (x)
-0.5
. ;.:
/'
- 1 ""/.....L---.J.....:..,...i."-L-L-JL-'>....J."'-1....:::oL--L...-±:::.....L--L---'-"'--.l....."-L.:.o..J~L-L----"

-1 -0.5 o 0.5 1
x

Figure 10.7: Chebyshev polynomials Tj(x), j = 1,2,3,4,5

the interval [-1,1]. To illustrate this property, Figure 10.7 shows the first few
Chebyshev polynomials.
These polynomials satisfy the orthogonality relation

if m i= n;
if m = n = 0; (10.154)
if m = n i= O.
Chebyshev polynomials are also orthogonal over the following discrete set of
N + 1 points, which are equally spaced in () = cos- 1 X

Xj = cos(j7f/N), (j = 0, 1, ... , N), (10.155)

which are the extrema of TN(X). The orthogonality relation is

if j i= k;
if j = k i= 0;
if j = k = O.
(10.156)
If the function is periodic, the first and the last term can be combined to get
the orthogonality condition in a more compact form over the first N points.
Over this set of N + 1 points, we can have at most N + 1 linearly indepen-
dent polynomials. In fact Tj (x), (j > N) can be expressed in terms of lower
degree polynomials over these points, which is similar to the phenomenon of
aliasing observed in discrete Fourier transform. Chebyshev polynomials are also
orthogonal over the set of N points

xJ
._ cos ((2j 2N-l)7f) '
- (j=1,2, ... ,N), (10.157)
488 Chapter 10. Functional Approximations

which are the zeros of TN(X). The orthogonality relation is

(10.158)

It may be noted that the unequal spacing of points in Xj, compensates for the
weight factor of 1/ vr=xz in the continuous case.
The Chebyshev polynomials can be easily differentiated or integrated us-
ing the relations

T~+l (x) _ T~_l (x) = 2Tn(x), (n ;:: 2);


n+1 n-1 (10.159)
T~(x) = 4T1 , T{(x) = To, T6(x) = O.

Instead of the usual Maclaurin series in xj, we can expand the function
f(x) in a series of Chebyshev polynomials, given by

1
+L
00

f(x) = "2co cjTj(x). (10.160)


j=l

1
The factor of in front of the first term is introduced so as to have a uni-
form formula for the coefficients Cj. Using the orthogonality of the Chebyshev
polynomials, the coefficients can be evaluated using

(j=O,l, ... ). (10.161)

It should be noted that the truncated series obtained by retaining the first n
terms is not algebraically equivalent to the truncated Maclaurin series. The
truncated Maclaurin series can also be expressed as a sum of Chebyshev poly-
nomials, but the coefficients will be different from those in (10.160). Similarly,
the truncated Chebyshev expansion can be expressed as a polynomial in x,
but the coefficients of x j will be different from those in the Maclaurin series.
The advantage of Chebyshev expansion is that the error incurred by dropping
the rth term is bounded by Cr over the interval [-1, 1] and further is more
evenly distributed over the interval, while the power series has large errors
near the end of the interval. Further, the coefficients Cj in Chebyshev expan-
sion fall off more rapidly with j as compared to the corresponding coefficients
in the Maclaurin series. Beyond the interval [-1, 1], the Chebyshev polyno-
mials increase rapidly and the accuracy of approximation deteriorates. If the
approximation is required over a different interval, then the corresponding re-
gion should be mapped on to [-1, 1] by a suitable change of variable, before
obtaining the Chebyshev expansion.
10.10. Ch ebyshev Expansions 489

We can also obtain approximations similar to the Pade approximations


using the Chebyshev polynomials instead of powers of x. Thus, we can write

L~:o ajTj(x)
Tmk(X) = --'k';--.---- (10.162)
Lj=o bjTj(x)
To determine the coefficients aJ and bj , we can use the Chebyshev expansion
(10.160), to write

(~CO + L~l cjTj(x)) (L~=O bjTj(x)) - L';=o ajTj(x)


f(x) - Tmk(X) =
L~=o bjTj(x)
( 10.163)
The m + k + 1 = n + 1 independent coefficients aj and bj can be determined by
equating the coefficients of Tj(x) for j = 0, L ... , n in the numerator to zero.
In order to obtain the required equations, we can use the identity

(10.164)

which can be easily derived using the relation Tj (cos 0) = cos(jO). We get the
following expansion for the numerator of (10.163):

1 1 x
L bjTj(x) + "2 L L biCj [Ti+j (x) + T\i-J\(X)]- L ajTj(x).
k k m
"2 co (10.165)
j=O j=l ;=0 j=O
Equating the coefficients of T j (x) for j = 0, L ... ,n to zero, we get the following
equations:

1 k
-2~
'"' be
" = ao ,
i=O

1~
"2 ~ bi(c\j-i\ + Cj+;) =
{a j , ifj=1,2, ... ,m; ( 10.166)

i=O 0, if j = m + 1, ... , n.
These constitute a system of n + 1 linear equations in n + 2 unknowns aj
and bj . However, the coefficients aj and bj are not all independent and any
one of the nonzero coefficients can be given arbitrary value by multiplying the
denominator and the numerator by a suitable factor. If bo -j. 0, then we can set
bo = 1 to solve the system of equations for the remaining n + 1 coefficients.
The truncation error in this approximation may be estimated using the
coefficient of the first nonvanishing term in the numerator of (10.163), which is

1 k
d n +1 = "2 L bi(cn+1-i + Cn+l+i). (10.167)
i=O

This term gives a truncation error of d n+lTn+1(x)/Qk(X), where Qk(X) is the


polynomial of degree k in the denominator of the rational function.
490 Chapter 10. Functional Approximations

While this procedure can give very good approximation, it requires the
knowledge of Chebyshev expansion, which may be difficult to obtain. A simpler
procedure is to start with the Maclaurin series or any other polynomial approx-
imation and apply what is known as Chebyshev economisation. The basic idea
behind economisation is that for large n , xn can be approximated by a com-
bination of lower degree terms, and in fact, as we have already seen, it causes
problems when we try to obtain least squares approximations using powers of
x as the basis functions. If this approximation to xn is substituted in the poly-
nomial, then we can drop one term from our polynomial approximation to get
a lower degree polynomial. This is the basic philosophy of economisation proce-
dure. This process can be continued until the approximation of leading power
of x by lower degree terms introduces significant errors in approximation.
In practice, economisation is achieved using the Chebyshev polynomials,
which basically give the linear combination of powers of x, which can be elimi-
nated from the known polynomial approximation. Thus, we can consider a few
extra terms in the I\Iaclaurin series and then remove the extra terms by sub-
tracting appropriate multiples of Chebyshev polynomial to get a polynomial of
lower degree, which is much more accurate than the truncated Maclaurin series
of the same degree. In the simplest case, when only one extra term is added,
this procedure can be applied as follows. Consider the truncated Maclaurin
expansion
n+l
R..+1.0 = l: dixi. (10.168)
i=O

Then
(10.169)

is a polynomial of degree n, since the leading term in R n+ 1 ,0(x) is cancelled


by that in T n+1 (x). Now since ITn+1(x) I :::; 1, the truncation error in Cn,o(x) is
greater than that in R n+1 •0(x) by no more than dn+t/2 n , which is quite small
in most practical situations. On the other hand, the maximum error caused
by neglecting the last term in Rn+l.0(X) is d n+1 . Hence, the maximum error
in Cn,o(x) is expected to be approximately a factor of 2n less than that in
Rn .o (x), provided higher order terms in the Maclaurin series are much smaller.
In general, the improvement will be somewhat less. It should be noted that both
Cn,o(x) and Rn.o(x) are polynomials of degree n, which can be evaluated with
the same amount of effort. Hence, we start with a polynomial approximation
of degree n + 1 and from that obtain another polynomial approximation of
degree n, whose maximum error is only slightly greater than that of the original
polynomial of degree n + 1. Thus, we have "economised" the power series in the
sense of using fewer terms to achieve almost the same accuracy. This process
is referred to as Chebyshev economisation.
If the terms in the Maclaurin series are falling off rapidly, then economi-
sation may give an accuracy comparable to that achieved by Chebyshev expan-
sion, which may require much more effort to obtain the coefficients. Of course,
10.10. Chebyshev Expansions 491

the economisation can be applied to Cn.O(x) to obtain a polynomial of degree


n - 1, or alternately we can start with Rn+2.0(X) and apply the economisation
twice to obtain a polynomial of degree n. This process can be applied any num-
ber of times. However, the effort required for economisation increases rapidly,
since at each stage the coefficient of the leading power of x needs to be calcu-
lated. After a few steps, this coefficient may come out to be fairly large and the
error introduced may not be very small. For functions whose Maclaurin series
converges very slowly, it is possible to drop several terms during economisation
without increasing the error significantly, but the error itself may be fairly large.
Similar economisation can also be applied to rational function approximation,
(Ralston and Rabinowitz, 2001), but the algebra is obviously more involved.
EXAMPLE 10.9: Obtain the Chebyshev expansions equivalent to Rg.o(x) and R5,4(X) of
Example 10.8 for f(x) = tan- 1 x and compare their accuracy with corresponding Pade
approximations.
To obtain Chebyshev expansion we need to evaluate the integrals (10.161). However,
in this case, we can avoid the integrals by using the following technique. We can write

tan- 1 x = fox ~ (10.170)


in 1+t2 '

and first find the Chebyshev expansion for the integrand


1 (Xl

--2 = 2:>2j T2j(X). (10.171)


1+x )=0

To find the coefficients b2k , multiply this equation by 1 + x 2 and using the recurrence relation
write x 2T2k(X) in terms of T2j(X), to get

~ bo ~ b2j
~ b2)T2j(X) + -[To(x) + T2(X)] + ~ -[T2j - 2(X) + 2T2) (x) + T2j+2(X)] = 1. (10.172)
j=O 2 j=l 4

Now equating the coefficients of T2j(X) on both sides gives the difference equation
3 1 131
-bo + -b2 = 1 -bo + -b2 + -b4 = O·
2 4 ' 2 2 4 '
(10.173)
1 3 1
-b2j - 2 + -b2) + -b2j+2 = 0, (j = 2,3, ... ).
4 2 4
The solution of this difference equation can be written as b2j = q a{ + c2a~, (j 2: 1), where q
and C2 are constants and a1 and a2 are the roots of the characteristic equation a 2 +6a+ 1 = 0,
which gives a1 = -3 + 2V2 and a2 = -3 - 2V2. Since la21 > 1, C2 must be zero, otherwise
the series will diverge. The constant q and the coefficient bo can be found from the first two
equations, which give
6bo +qcq = 4, 2bo + q(6a1 + ai) = O. (10.174)
Hence, q = V2 and bo = 1/V2 and we get the expansion

_1-2 = V2 (~To(x) + a1T2(x) + aIT4(x) + ... ) . (10.175)


1+x 2
Integrating this series term by term gives the required Chebyshev expansion

!L=V2-1. (10.176)

As expected only odd terms are present in the expansion, and in fact, the expansion is very
similar to the Maclaurin series, except for a factor of 2!L j which improves the convergence
492 Chapter 10. Functional Approximations

- - tan-Ix
0.8 R.,o(x) -2
T•. o(x)
C•. o(x) ;/
0.6 /
C'(x) ....
---. R5 ,.(x)
//
/
/
e.... -4 /

~ I
T5 ,.(x)
0.4 / o
/ otJ
/ o
..:l
/
-6
0.2
/
/ /
;/
Y
{Y
o ~-L~-L~-L~LL~~-L~-L~

o 0.2 0.4 0.6 0,8 0,2 0.4 0.6 0.8


x x

Figure 10.8: Chebyshev approximations to tan- 1 x

of the series significantly. As a result, we expect the truncated Chebyshev series to give
significantly better approximation than the truncated Maclaurin series.
It is possible to get an expansion in terms of Chebyshev polynomial of even degree
for tan- 1 x/x, but that obviously gives larger error in approximating tan- 1 x, since the
multiplication by x distorts the error curve. Hence, we stick to our expansion in terms of odd
degree polynomials. The Chebyshev expansion T9,0(X) is obtained by retaining the first five
terms of the above series, Because of symmetry, it has only five nonzero coefficients. The error
in this approximation is shown in Figure 10.8, which also shows the corresponding curve for
the truncated Maclaurin expansion Rg,o(x). It is clear that the error in Tg,o(x) is much less,
and in fact, this approximation is better than even R5,4(X). Using the known coefficients of
Tj(x), we can express Tg,o(x) as a polynomial in x to get
Tg,o(x) = 0.9998513x - 0,3301050x 3 + 0.1794405x 5 - 0.0841985x 7 + 0.0204196x g , (10.177)
where the coefficients have been rounded to seven figures. It can be seen that the coefficients
of lower degree terms are close to those of the Maclaurin series. While the last few coefficients
are significantly different.
This approximation can be compared with the economised power series, which can be
written as
131517 19 111 1
Cg o(x) =x - -x + -x - -x + -x - - x + --T11 (x)
, 3' 5 7 9 11 2 10 11 '
= 0.9990234x - 0.3138021x 3 + 0.0906250x 5 + 0.1071429x 7 0.1388889x g . -
(10.178)
The error in Cg,o(x) is also shown in Figure 10.8. It can be seen that, this approximation
has an accuracy comparable to that of Tg,o(x) for smaller values of x, but for x close to 1
the error is significantly larger and is in fact, close to the error in Rg,o(x). This behaviour is
because of the fact that the Maclaurin series converges very slowly for x ~ 1.' The truncated
Maclaurin series can be written in terms of the Chebyshev polynomials
415 31 11 23 5 1
Rll o(x) =
-T1(X) - -T3(X) - --T5(X) - --T7(X) - --Tg - --Tll(X).
, 512 512 5120 7168 9216 11264
(10.179)
The economised approximation Cg,o(x) is obtained by dropping the last term, which gives
a maximum error of 1/11264. Even if all terms other than the first one are dropped, the
maximum error introduced is less than 0.07, which is comparable to the maximum error of
1/13 introduced in R11,O(X) by ignoring the higher terms of the Maclaurin series. Figure 10.8
also shows the error in this approximation C*(x) = (415/512)TI(x). It can be seen that the
10.10. Chebyshev Expansions 493

maximum error in C*(x) is only slightly larger than that in Cg,o(x) OI' Rg,o(x). In fact, if we
take the approximation Tl,O(X) obtained by retaining only the first term of the Che byshev
expansion (10.176), then the maximum error is even less. It should be noted that C*(x) and
Tl,O(X) have only one coefficient, while Rg,o(x) and Cg ,o(x) have five coefficients, which clearly
illustrates the power of Chebyshev polynomials for minimax approximations. Alternately, we
can write the truncated Maclaurin series in terms of the form xTj (x 2 ), but it is clear that in
that case, economisation will introduce larger errors. The larger error is essentially b ecause
of the fact that after a change of variable y = x 2 our range is restricted to [0,1)' which
is not proper for using Chebyshev polynomials in the standard form. VYe should apply a
change of variable to transform this range to [-1, 1], in which case, we will recover the same
approximation as obtained earlier.
To obtain the rational function approximation using Chebyshev expansion, we can
follow the procedure outlined above . In view of the symmetry of the function we look for a
rational function of the form

T5 4(X) = aoTl(x) + alT3(x) + a2 T 5(x) . (10.180)


, 1 + blT2(X) + b2T4(X)

Comparing T5 ,4(X) with the Chebyshev expansion (10.176) , we get the following equa tions
for the coefficients

(10.181)

The last two of these equations can be solved for b1 and b2 , and these values can be substituted
into the first three equations to give ao, al and a2. The solution gives the rational function
approximation

T5 4(X) = 0.9725548TI(x) + O. 11 22664T3 (x) + 0.00l5806T5 (x)


, 1 + 0.3697941T2(X) + 0.0l34554T4(x) ,
(10.182)
0 .9999958x + O.6485605x 3 + 0.0392906x 5
1 + 0 .9817970x2 + 0.1672363x4
The error in this approximation is also shown in Figure 10.8. It can be seen that while for
Rg,o(x) and R5,4(X), the error is monotonic and of the same sign in the interval (0,1), the
error in Tg,o(x) and T5,4(X) oscillates in sign. At points where the error changes sign, the
absolute magnitude drops to zero giving rise to sharp dips in the curves shown in Figure 10.8.
It may be noticed that these dips occur at nearly the same location in all the curves. That is
because in all these approximations the error is dominated by a term proportional to Tll (x)
and the dips correspond to the zeros of Tll(X).

This example clearly illustrates that Chebyshev expansions give much


better approximations to functions as compared to those obtained using power
series. The main problem with Chebyshev expansion is the necessity of eval-
uating the integrals (10.161) to find the coefficients of expansion. In general,
these integrals cannot be evaluated analytically, in which case, numerical eval-
uation of integral involves considerable effort and will also add to the errors
in the coefficients. If the integrals are to be evaluated numerically, we need
to worry about the singularity at both the end points, which can be tackled
494 Chapter 10. Functional Approximations

by using Gauss-Chebyshev quadrature formulae. One way of avoiding the in-


tegrals is to use trigonometric interpolation to approximate the coefficients Cj.
For this purpose, we perform a change of variable x = cos 0 and remembering
Tj(cosO) = cosjO, (10.160) becomes
1 00
g(O) = f(cosO) = 2co + LCjcosjO, (10.183)
)=1

and the coefficients Cj are given by

Cj = -2111" g(O) cosjO dO, (j=O,l, ... ), (10.184)


7r 0

which is exactly the Fourier cosine series for an even function. Hence, Fourier
and Chebyshev expansion are related to each other by a change of variable
x = cos O. We can use the FFT algorithm to obtain the coefficients Cj. If the
coefficients are obtained by interpolating at the zeros of Tn (x), then the result is
equivalent to evaluating the integrals using the corresponding Gauss-Chebyshev
quadrature formula.
Another approach which can be used if the Maclaurin series is known, is
to replace the function by its Maclaurin series inside the integral and integrat-
ing it term by term. The resulting infinite series usually converges sufficiently
rapidly to enable the calculation of the coefficients. This procedure can be eas-
ily mechanised on a digital computer, but it still involves considerable amount
of calculation to obtain the coefficients.

10.11 Minimax Approximations


For approximating mathematical functions, we will like to minimise the maxi-
mum error over the required interval. However, none of the approximations that
we have considered so far are minimax approximations, that is the maximum
error in the approximation is not the smallest that is possible with a polynomial
of the same degree, or a rational function of the same form. If the coefficients in
the Chebyshev series are decreasing very rapidly, then truncation of the series
or the corresponding rational function approximation yields approximations
which are close to minimax and in practice the effort to improve on them may
not be worthwhile. For function whose Chebyshev coefficients do not decrease
rapidly, the minimax approximation of the same degree may have significantly
less maximum error than Tmk(x) considered in the previous section, and it
may be worthwhile to spend the extra effort to improve on this approximation.
Before we consider techniques for generating minimax approximations, we will
state a few useful results.
Let f(x) be a continuous function which we wish to approximate over a
finite interval [a, b], by a rational function of the form

R mk () _ Pm(x) _
X - -
2:;:0 ajx j
---'';---- (10.185)
Qk(X) 2:;=0 bjx j
10.11. Minimax Approximations 495

Two rational functions are considered to be identical, if they are equal when
reduced to their lowest terms. Let

rmk = max If(x) - Rmk(X)1 , (10.186)


a::;x::;b

be the maximum error in the approximation. Then we can prove the following
theorem due to Chebyshev.
Theorem: There exists a unique rational function R;"k(x) which minimises rmk.
Moreover, if we write this unique rational function as

P* () ",m-v * j
R* (x) = ~ = ~j=o ajx (10.187)
mk Q'k(x) 2:~::t bjxj ,

where 0 ::::: fJ ::::: k, 0 ::::: 1/ ::::: m, a;"_v -=J 0 and b'k-Jl -=J 0, and P;"(x)/Q'k(x) is
irreducible, then if r;"k -=J 0, the number of consecutive points of [a, b] at which
f(x) - R;"k(x) takes on its maximum value of magnitude r;"k with alternate
change of sign is not less than L = m + k + 2 - d, where d = min(fJ, 1/).
Thus, the important property of minimax approximations is that, the error
curve has a series of equal minima and maxima. Such error curves are usually
referred to as equal ripple, while the points at which the error has maximum
magnitude are referred to as the critical points. In particular, for polynomial
approximations (k = 0), the theorem says that the number of points at which
the error attains its maximum magnitude is at least m + 2. We shall not go
into the proof of this theorem, since that does not tell us anything about how
to construct such approximations. This theorem also does not say anything
about how close a given approximation is to the minimax approximation. Intu-
itively, we can see from Figure 10.8 that the plot of error in T 5,4(x) has nearly
equal minima and maxima and we expect T 5 ,4 to be close to a minimax ap-
proximation. On the other hand, the plot of error in R5"'!(X) shows that, this
approximation deviates considerably from the minimax approximation. This is
confirmed by the following theorem.
Theorem: Let Rmk as defined by (10.187) be irreducible and let the difference
f(x) - Rmk(X) remain finite in [a, b]. Further, let Xl < X2 < ... < XL be the
extrema of the difference in [a, b] and

(i=1,2, ... ,L), (10.188)

where Ai > 0 for all i and L = m + k + 2 - d with d = min(fJ, 1/). Let Smk(X)
be any rational function approximation to f(x) with degrees of numerator and
denominator less than or equal to, respectively, m and k and let

(10.189)

Then Smk :::: mini Ai·


Thus, this theorem assures us that the maximum error in any approxi-
mation is greater than the smallest extrema of error in a given approximation,
496 Chapter 10. Functional Approximations

provided it has the right number of extrema. Further, by definition the mini-
max approximation has maximum error smaller than the maximum error in the
given approximation. Thus, this theorem gives us the lower and upper bound
on the maximum error in the minimax approximation. If r;"k is the maximum
error in the minimax approximation, then mini Ai :::; r;"k :::; maxi Ai. Thus, if
we find an approximation with the correct number of extrema for which mini Ai
differs very little from maXi Ai, then the maximum error in this approximation
is not much greater than that in the true minimax approximation.
If the Chebyshev expansion of the function is known, then there is an
interesting result which gives the upper and lower bound for the minimax
polynomial approximation. Suppose that f(x) has the Chebyshev expansion
(10.160), and let P~(x) be the polynomial of degree less than or equal to n,
which approximates f(x) with minimax absolute error in [-1,1]. Then it can
be shown (Cheney, 1999) that

~
~ C(2k+l)(n+l) I:::;
I k=O max
-l<x<l
- -
IP~(x) - f(x)l:::; ~
~
k=n+l
ICkl· (10.190)

The right-hand inequality can be easily proved by using the fact that ITdx)1 :::;
l.
The minimax approximation can be constructed by an iterative process
referred to as the second algorithm of Remes. For this iteration to converge we
need a reasonably good initial approximation, or at least a good approxima-
tion to Xi the extrema of the minimax approximation. Using the equal ripple
property of the error in minimax approximation, we can construct the next
approximation. The initial approximation can be obtained using the Cheby-
shev expansion or by economised polynomial or rational functions. The Pade
approximations or truncated Maclaurin series is in general, not suitable for
starting the iteration, since the error curve does not have the required num-
ber of extrema. For approximation by a polynomial of degree m, we can start
with an interpolating polynomial which interpolates the function at the zeros of
Tm+l(X). As explained in Section 4.1, this approximation should have trunca-
tion error of the form Tm+l(X)f(m+l)(~)/(2m(m + I)!), which can be expected
to have the required number of extrema. Alternately, we can fit at the m + 2
extrema of T m +1 (x) by requiring the error to be of equal magnitude with al-
ternating sign at these points. For a rational function approximation, we can
obtain the coefficients by solving the system of equations

(i = 0,1, ... , m + k + 1), bo = 1,

(10.191)
where Xi are the m + k + 2 extrema of T m+k+ 1 (x). This system of m + k + 2
equations can be solved for the m + k + 1 coefficients and the error E. It may
be noted that in some cases, bo may turn out to be zero, in which case, a
different normalisation may be required. For polynomial approximation (k =
10.11. Minimax Approximations 497

0), these equations are linear and can be solved easily. For rational function
approximation, there is some nonlinearity because there will be product of E
with bj in the equations. Hence, these equations need to be solved iteratively.
The iteration formula can be defined by linearising the equations to obtain

m k
"L,ajx~ - [J(Xi) - (-I)iErl "L,bjx{ = f(Xi) - (-I)iE r+1 . (10.192)
j=O j=1

These equations are obtained by neglecting nonlinear terms of the form


(Er - E r +1)b j xi in the original equations. Starting from an assumed ini-
tial value of Eo, this set of linear equations can be solved for the unknowns
aD, ... , am, b1, ... , bk and Er+1 for r = 0,1, ... , until two successive values of

°
Er are in satisfactory agreement. In the absence of any other information, a
starting value of Eo = could be used.
In Remes algorithm, we start with an initial approximation and find the
extrema of the error curve f(x) - Rmk(X), or alternately we start directly with
the approximate positions of these extrema, and solve (10.191) to obtain the
next approximation to the coefficients. Having found the coefficients, the ex-
trema of error curve can be found and the process is repeated until the extrema
are of equal magnitude to the required accuracy. Here it is assumed that at ev-
ery step the error curve has correct number of extrema. It can be shown that
for k = 0, this iteration converges for an arbitrary choice of the m + 2 extrema
°
in the first step. However, for k -I- the iteration converges only if the initial
choice of extrema is sufficiently close to the actual extrema of minimax approx-
imation. Hence, a good initial approximation is usually required. In general,
at some stage, we may encounter an approximation, which does not have the
required number of extrema and it may not be possible to proceed further.
It should be noted that, this algorithm assumes our ability to evaluate the
function at any value of x in the required interval, to an accuracy substantially
better than the maximum error in the approximation. However, this evaluation
need not be very efficient, since this algorithm will be used only once to gen-
erate the approximation. For example, we can use an infinite series or product
representation of the function, even if the convergence is not very fast. Never-
theless, it should be fast enough to ensure convergence in a reasonable time.
Of course, suitable analysis is required to ensure that the truncation error is
indeed less than the required value.
The Remes algorithm is also applicable for finding approximations of the
form f(x) ~ w(x)Rmdx), where w(x) is a given continuous function which
does not vanish in the required interval. In particular, if the function f(x) does
not vanish in the interval, we can choose w(x) = 1/ f(x) and attempt to find
an approximation to g(x) == 1. This technique yields the best relative minimax
approximation to f(x) over the interval, since

(10.193)
498 Chapter 10. Functional Approximations

If f(x) vanishes at some point in the interval, then it is meaningless to consider


relative error. In that case, we can modify the function to take care of this
situation. Thus, for example, if f(O') = 0, then we can consider the function
F(x) = f(x)/(x - a), provided F(O') -I- 0, i.e., the limit as x ----+ a exists and is
nonzero.
If the error curve has the standard form, that is it has exactly m + k +
2 extrema which includes both the end points, then this algorithm can be
easily applied to find the corresponding approximations. However, if the error
curve is nonstandard in the sense that it has some extra extrema or one of
the end points is not all pxtrema, then the situation could be difficult. In such
cases, the approximation Rm+1,k+l(X) will be degenerate. Fortunately, the only
cases of true dcgC'lH'racy encountered in practice are those due to even or odd
symmetry and these can be easily handled as illustrated in Example lD.lD. In
true degeneracy of degree one, there is one less extremum of the error curve
than in the normal case, and essentially the numerator and denominator of the
rational function have a common factor, which must be divided out to reduce
it to the lowest terms. While true degeneracy almost never occurs in practice,
it is quite common to encounter situations, where one zero of numerator differs
very little from that of the denominator. Such situations can be handled by
the method of artificial pole (Cody, 1970). In this technique, we essentially use
w(x) = l/(x - p), where p is some suitably chosen point outside the interval
[a, bJ. The point p is adjusted until the error curve is of the standard form,
leading to an approximation of the form Rmk/(X - p), where the value of k
should be chosen to be one less than the required value.
If the degree of polynomial or the rational function is large, then the sys-
tem of equations could be ill-conditioned. This ill-conditioning can be avoided if
we expand the denominator and numerator in terms of Chebyshev polynomials.
From (10.191) it can be seen that while the coefficients on the left-hand side are
generally of order of unity, the term on the right-hand side, being the maximum
error is quite small. Thus, there has to be a considerable cancellation. If the
error is required to be close to the single precision limit, then the equations
will have to be solved using double precision arithmetic, otherwise we will be
fitting rounding errors instead of information. Higher precision arithmetic may
also be required for calculating the extrema. As we have seen in Chapter 8,
because of the flatness of the curve near an extrema, it is usually impossible
to achieve an accuracy of better than v'ii in computing the extrema. In this
case, the situation is even worse, because we are trying to find extrema of the
error, which itself is obtained by subtraction of two nearly equal functions. For
locating the extrema we can use the Brent's method.
The requirement of higher precision arithmetic may be avoided in some
cases, by approximating the discrepancy, or the difference between the required
function and some convenient approximation. For this procedure to be useful,
it is essential that the discrepancy should be evaluated directly, rather than
by taking the difference of two functions. For example, if we wish to approxi-
mate the cosine function by a polynomial of degree 2n, then we can define the
10.11. Minimax Approximations 499

discrepancy by
x2 X4 x2n )
cos x - ( 1- 2f + 4T - ... (_l)n (2n)!
(lO.194)
x2n+2 X 2n + 4 )
= (_l)n+l (
(2n+2)! - (2n+4)! + ....
Thus, the discrepancy can be evaluated without evaluating the cosine. If we
now find a minimax approximation to this discrepancy, we can add it to the
truncated Maclaurin series to obtain the required approximation. Since the
discrepancy is small, the approximation to it will also be small. Hence, the
relative precision required in approximating the discrepancy will be quite low
and it may be possible to avoid the use of higher precision arithmetic. For
functions which have infinite continued fraction expansion, it is possible to
estimate the discrepancy between the function value and the approximation
obtained by truncated continued fraction (Acton, 1990). This estimate could
be useful for obtaining rational function approximations.
Subroutine REMES in Appendix B provides a simple implementation of
Remes algorithm. It can use a known approximation with required number of
extrema to start the iteration. Alternately, it is possible to start the iteration
directly, in which case, it starts with extrema of Tm+k+1 (x) as the critical
points for the first approximation. It is also possible to specify the position of
extrema for the first approximation. The iteration is continued until the error at
extrema of the error curve are nearly equal. If Emax and E min are respectively,
the maximum and the minimum magnitude of error at the extrema, then the
iteration is terminated when Emax - Emin < O.OlEmax . To find the extrema of
error function, this subroutine uses a simple approach, where the extrema are
bracketed by computing the value of the function at a fixed grid of points in the
given interval. This technique saves some computation, since the function value
is calculated only once at these points. Using the bracketing triplet, Brent's
method is employed to find the extrema. Here it is preferable to use an always
convergent method for finding the extrema, since otherwise it may be difficult
to ensure that iteration has converged to the correct extrema. The coefficients
of the final approximation are not very sensitive to the accuracy with which
the extrema are located. Hence, a separate convergence criterion is required
to specify the accuracy with which the extrema are to be located. Because
of roundoff error, it may not be possible to locate extrema very accurately
and if high accuracy is specified, the Brent's method will perform unnecessary
iterations to subdivide the range. Hence, this convergence parameter should
be set to a moderate value (e.g., lO-3 to lO-6) and in any case, it should be
significantly larger than v1i (Ial + Ibl), where [a, b] is the range over which the
approximation is required.
Evaluating the rational function approximation using computer arithmetic
involves some roundoff error. The roundoff error will be particularly important
when the maximum error is comparable to the machine accuracy. In general,
it is found that when the rational function is evaluated directly by taking ratio
500 Chapter 10. Functional Approximations

of two polynomials, the roundoff error is usually small. There is more likely to
be serious propagation of roundoff error, if the equivalent continued fraction
is used. Thus, we have the usual conflict between the efficiency and accuracy.
While the continued fraction may be more efficient to evaluate, it may suffer
from excessive roundoff error. This trouble can be usually detected by the fact
that some of the coefficients in the continued fraction expansion may come out
to be large and possibly of opposite signs {43}. Since the polynomial basis is not
orthogonal, the computed coefficients of rational function approximations may
also come out to be slightly different from tabulated values, but the resulting
approximation may be equally good if sufficient significant figures are retained
in these coefficients.
EXAMPLE 10.10: Find minimax approximation of the form R5,4(X) for tan- 1 x over the
interval [-1,1].
To construct this minimax approximation, we start with an initial approximation
T5,4(X) obtained in Example 10.9. It can be seen from Figure 10.8 that the error curve of this
function is fairly close to what we expect for the minimax approximation and it may serve
as a good starting approximation for applying the Remes algorithm. This rational function
approximation has polynomials of degree five and four respectively, in the numerator and the
denominator, and we expect it to have 11 extrema inside the interval. However, as seen from
Figure 10.8, there are 12 extrema, which can be attributed to the symmetry of the function.
Because of the symmetry, R;;,4(x) == R 5,5(x) == R 6,4(x) == R 6,5(x), and we can consider this
approximation as R 6,4(x), in which case, the number of extrema will be normal and error
curve can also be expected to be of the standard form. From the error curve for T5,4(X),
it follows that the magnitude of maximum error E in the minimax approximation satisfies
8.60 x 10- 8 :<; E :<; 3.48 X 10- 7 . Using subroutine REMES the required approximation is
obtained in two iterations. The final approximation
_ () 0.9999974987x + 0.6557398216x 3 + O.0405226924x 5
R6,4 x -- --------~~~~~~~~~~~~~~----
1 + 0.9890036549x2 + 0.1707411613x 4 '
(10.195)

has all extrema between 1.86x 10- 7 and 1.91 x 10- 7 . Hence, this approximation should be very
close to the true minimax approximation. It should be noted that if the program recognised
the symmetry of the function, then the amount of effort required could be reduced by a
factor of two. The calculations were performed using a 53-bit arithmetic. The error curves
for the initial approximation T5,4(X) as well as the final approximation'R6,4(x) is shown
in Figure 10.9. The initial approximation has larger error near the centre of the interval,
while the final approximation has the equal ripple error curve characteristic of all minimax
approximations.
It should be noted that, this approximation is with respect to the absolute error. We
can also find minimax approximation involving relative error. Since the function vanishes at
x = 0 we can use the function f-(x) = tan- 1 x/x, which is symmetric about x = 0 and to
improve the efficiency we can seek approximation of the form R2,2(X 2 ) over the range [0,1].
Using subroutine REMES with function FUN = 1, and FUND = vx/tan-1(vx) and once
again starting with the approximation T5,4(X), we can obtain the required approximation.
After four iterations, we get
_ (2) 0.9999995340 + 0.6667190811x 2 + 0.0424327889x 4
R x - ----------~------~----~--~~~----- (10.196)
2,2 - 1 + 1.0000l97854x 2 + 0.1761383590x4 '
which has a maximum relative error (in f* (x)) of 4.66 x 10- 7 . It can be seen that the coeffi-
cients in the two approximations are somewhat different. For comparison the absolute error
in this approximation is also shown in Figure 10.9. It can be seen that, this approximation
has a somewhat higher maximum error near the ends of the interval, while near x == 0 the
error is very low; which is to be expected, since the error curve for relative error will have
the characteristic equal ripple property for this approximation.
10.11. Minimax Approximations 501

4x10 - 7

2x 10- 7

-2x10- 7

- 4x 10- 7
-1 -0 .5 o 0.5 1
X

Figure 10.9: Error curves for approximations to tan- 1 x

EXAMPLE 10.11: Find minimax approximation of the form Rl,2( X) for the Gamma func-
tion over the interval [2, 3J.
In order to calculate the approximation, we first need some method for calculating the
Gamma function to sufficient accuracy. For this purpose, we can use the infinite product, or
the asymptotic series (see {1.3}). However, the infinite product converges very slowly and is
not very useful , while the asymptotic series cannot give sufficient accuracy for the relevant
values of x. To improve the convergence, we use the well-known relation r(x + 1) = xr(x) to
translate the range to sufficiently large value of x, so that the asymptotic series converges to
sufficient accuracy. This may not be the best technique for evaluating rex), but is sufficient
for our purpose. We apply this relation 100 times to translate the range to [102 , 103J. In
this process we loose two significant digits in x, but even then the truncation error will be
larger than the roundoff error, since 53-bit arithmetic is used for all computations. Having
an algorithm to calculate rex), we can apply the Remes algorithm to find the required
approximation. However, it turns out that, even if a reasonably good starting approximation
is supplied, the iteration does not converge. The problem can be traced to near degeneracy
of the approximation. From the approximation given later on, it can be seen that while the
denominator has a zero at x = 3.00605, the numerator vanishes at x = 3.00599. Consequently,
it is very difficult to obtain this approximation directly.
If the nearly common factor between the numerator and the denominator is eliminated,
then the approximation reduces to the form Ro ,l(X). Hence, to start with, we obtain this
approximation using the subroutine REMES. If the iteration is started with critical points
at the extrema of T2(X) , the iteration converges to

R* (x) _ 0.496693593
(10.197)
0 ,1 - 1 - 0.250859058x '
which has a maximum error of 0.00747. It turns out that for this approximation the error
curVe shown in Figure 10.10 has a nonstandard form, since there is an extra extremum at
x = 2 which is smaller than the other extrema. If the curve is extended to values of x < 2,
then the error rises to the magnitude of the three other extrema, when x ::::; 1.9507. Thus,
the minimax approximation to rex) on [1.9507, 3J has four alternating error extrema of equal
magnitude, which is one more than normal. Hence, we expect the approximation Rl,2(X) to
be degenerate over this interval. On the slightly smaller interval [2,3]' this approximation can
be expected to be nearly degenerate. If we try to use the Remes algorithm with approximate
position of extrema to start the iteration, then in most cases, it turns out that the resulting
approximation has a pole inside the interval and the iteration fails to converge.
502 Chapter 10. Functional Approximations

0.005

-0.005

2 2.2 2.4 2.6 2.8 3


x

Figure 10.10: Error curves for minimax approximations to r(x)

To find the minimax approximation RI,2(X), we use the method of artificial pole men-
tioned above. For this purpose, we seek an approximation of the form RI,I (x)/(x - p), where
p is outside the required interval. For a given value of p, we can find the minimax approxi-
mation RI,I (x) by using the weight function w(x) = I/(x - p), which can be achieved using
the subroutine REMES with function FUND = I/(x - pl. The value of p is then adjusted to
yield the minimum error. We can use Brent's method for minimising the maximum error in
this approximation with respect to p. The minimisation gives p = 3.00605 and
* 0.4940536316 - 0.1643563367x
RI ,2 (x) = -1---0-.5-S-4-2-3-57-2-7-6-x-+-O-.O-S-3-6-S-S-9-S-S-1x--"-2 ' (1O.19S)

which has a maximum error of 0.00567. It may be noted that p is between the end point of the
interval and the pole of RO,I (x) and this fact could be used to bracket the minimum for finding
p. It should be noted that the coefficients of the approximation may not be correct to all
digits shown above. However, rounding the coefficients to less than eight significant figures
will result in a considerably different error curve, because of roundoff error in evaluating
the approximation close to x = 3. The error curve for this approximation is also shown in
Figure lD.lD. This curve is of the standard form, but because of a pole just outside the
inter'fal the last two extrema are very close to each other. In fact, if the region is extended
to [1.9507,3]' then the zeros of the numerator and denominator coincide, giving exactly
degenerate approximation. It can be seen that the error in Rj,2(x) is not much less than that
in RO,I (x). This behaviour is typical with degenerate or nearly degenerate approximations.
This difficulty can also be avoided by looking for approximation with different values
of m and k. For example, there is no difficulty in directly obtaining approximation of the
form R2,2(X) or R4,2(X). In fact, for the latter case, starting with the initial guess for critical
points as the extrema of T7(X), the Remes algorithm converges in the very first iteration
itself, giving R4,2(X)
2.362083654 - 2.167034140x + 1.327927514x 2 - 0.3570635379x 3 + 0.0523888798x4
1 + 0.3437883594x - 0.0915343207x 2
(10.199)
which has extrema ranging between 1.7617 x 10- 7 and 1.7722 X 10- 7 and the error curve
has a standard form. Better approximations to rex) can be found in function GAMMA in
Appendix B.

An extensive tabulation of minimax approximation to common mathe-


matical functions is given by Hart et al. (1968), while references to several
10.12. Discrete Minimax Approximations 503

published approximations for special functions are given in Cody (1970). Ap-
pendix B contains routines to calculate approximations to a few well known
mathematical functions, like, the Gamma function, Error function and Bessel
functions of various types. These functions are approximated by rational func-
tions of appropriate form, whose coefficients have been calculated using Remes
algorithm.

10.12 Discrete Minimax Approximations


In the previous section, we considered an algorithm for obtaining minimax ap-
proximation, when the function could be evaluated at any point inside the in-
terval. If the function is known only at a set of discrete points {Xl, X2, ... , X N },
then the minimax approximation R':nk(x) is defined by the requirement

(10.200)

for all Rmk(X). This problem is not essentially different from the continuous
case, since if the set of points is rather dense in the interval [a, b], then the
resulting approximation may be close to that for the continuous case. In this
case also, the error curve should have equal ripple character, the only difference
being that the extrema should be attained at one of the mesh points. Thus, the
problem basically boils down to finding a set of m + k + 2 points, at which the
error has equal magnitude and is alternating in sign. In this case, we can use
the so-called exchange algorithm, where the basic idea is to start with a choice
of m + k + 2 mesh points and construct an approximation which has these points
as the critical points. Using this approximation we can calculate the error at
each mesh point and if this error is less than or equal to that at the critical
points, then we have found the approximation. If that is not the case, then find
the point at which the error is maximum and include that in the set of critical
points, and remove one point from the set such that the error is alternating in
sign. Using the new set of points a new approximation is constructed and the
process is repeated until the minimax approximation is found. This algorithm
is also referred to as the first algorithm of Remes. This exchange algorithm also
makes use of the equal ripple property of the error curve. In some applications,
e.g., the solution of a system of overdetermined linear equations, it is not pos-
sible to define any order and the minimax approximation will only satisfy the
requirement that the error has maximum magnitude at m+k+2 points. In such
cases, it is not possible to use the exchange algorithm, since even if we select
a set of m + k + 2 equations which has the residual of maximum magnitude,
it is not possible to fix the sign of the residue. In such situations, it is possible
to use the differential correction algorithm described below, which reduces the
approximation problem to a linear programming (LP) problem (Section 8.7).
For simplicity, we first consider the case k = 0, i.e., approximation by a
polynomial of degree m. In fact, we can consider the general linear approxima-
504 Chapter 10. Functional Approximations

tion of the form


m

F(a, x) = L aicPi(x). (10.201)


i=l

Since a LP problem in the standard form requires the variables to be non-


negative, we define £1m +1 = max(O, -a1,"" -am) and £1j = aj + £1 m +1 for
j = 1,2, ... , m. Then for 1 ::; i ::; N, we have
m m

ei = e(xi) = F(a, Xi) - f(Xi) = L £1jcPj(Xi) - £1m+l L cPj(Xi) - f(Xi)


j=l j=l

= £11cP1,i + £12cP2,i + ... + £1mcPm.i + £1m +1cPm+l,i - fi ,


(10.202)

cPm+l ,i = cPm+l(Xi) = - LcPj(Xi). (10.203)


j=l

Now if we define w = max1::;i::;N leil, then we obtain the 2N constraints


£11 cP1,i + £12cP2,i + ... + £1m+l cPm+1 ,i + w 2:: 1;,
1 ::; i ::; N, (10 .204)
- £1 1cP1,i - £12cP2,i - ... - £1m +lcPm+1,i + w 2:: - fi,

which define a LP problem of minimising w subject to (10.204). However, in


practice it is more convenient to solve the dual problem. The dual problem in-
volves finding nonnegative quantities Si and ti, 1 ::; i ::; N which are essentially
the slack variables in the above constraints. This LP problem can be stated as,
maximising
N

L fi(Si - ti), (10.205)


i=l
subject to the m + 2 constraints
2:~1 cPj,i(Si - ti) ::; 0, 1 ::; j ::; m + 1,
(10.206)
2:~ 1 (Si + ti) ::; 1.
These constraints can be expressed as equality constraints using £1j and w , the
original variables as slack variables, to get

2:~1 cPj,i(Si - ti) + £1j = 0, 1::;j::;m+1,


(10.207)
2:~1 (Si + ti) + w = 1.
These constraints define a LP problem in the standard form, which can be
solved by the simplex algorithm. In the tableau for the simplex algorithm, the
columns corresponding to variables Si and ti are not really independent, but
can be obtained from one another. Hence, we can modify the simplex algorithm
10.12. Discrete Minimax Approximations 505

to use the condensed tableau containing only s;, O:j and w, which reduces the
memory requirement by a factor of approximately two and will also improve
the efficiency by roughly the same factor, since only half the elements need to
be calculated at each iteration. Such an algorithm is given by Barrodale and
Young (1966). This algorithm can be easily extended to functions of two or
more variables, using appropriate basis functions.
For rational function approximation, the corresponding equations will
be nonlinear and we need to linearise the equations. These linearised equa-
tions can then be solved iteratively to obtain a sequence of approximations
R~L (x) = p,;;) (x) / Qks ) (x), which will hopefully converge to the required min-
imax approximation R;"k(X). We define

(S)
r rnk = max t f( Xi- p,;;) (Xi) t
l<i<N ) Q(S)'.)'
- - k lx, ( 10.208)
Pm(X)
ernk(X) = f(x) - Qk(X) ,

and at each step consider the quantity

(!emdXi)1 - T~~)Qdxi)
w = max ,
l<O;i<O;N Qks)(Xi)
(10.209)
If(Xi)Qk(Xi) - Pm(xi)1 - r~~Qk(xi)
= max
l<O;i<O;N Qks)(Xi)

The next approximation R~tl)(x) is obtained by minimising w, subject to the


additional constraint that Qk(Xi) > 0 for all i. It is clear that w can be negative,
because it is possible to have !emk(Xi)1 < T~:~ for all i, unless R~~(x) is the
minimax approximation.
To cast this problem in the LP form, we can write the above equation as

1 :::; i :::; N. (10.210)

This form can be directly obtained by linearising the term rmkQk(xi) on the
left-hand side of (10.208). This inequality is equivalent to

wQkS)(Xi) + r~LQk(x;) - [J(Xi)Qk(Xi) - Pm (x;)] :::: 0,


(10.211 )
wQks)(Xi) + r~~Qk(Xi) + [J(Xi)Qk(Xi) - Pm(Xi)]:::: o.
Thus, the problem is to minimise w subject to constraints (10.211) and
Qk(Xi) > O. In addition, some normalising condition is required to fix the
coefficients of the rational function. For example, we may take bo = 1. This is a
LP problem though not in the standard form, since w < 0 and the coefficients
ajand bj could have either sign. Thus, to make w positive, we can consider
506 Chapter 10. Functional Approximations

w' = r~~ +w which will be positive. While the coefficients can be made positive
by defining O'm+k+2 = max(O, -aD, ... , -am, -b 1, ... , -bk) as for the polyno-
mial case. With these definitions, we get a LP problem of minimising w' subject
to constraints
k m
W'Q~S)(Xi) + (1'~~ - fi) L.O'jx1 + L.O'k+j+1Xi
j=l j=O

-Om+k+' t,
((r;';l - J;) xi + ~,,; ) >
-
f. - 1'(s) + Q(8)(x')r(8)
t mk k t mk'

k m
w' Q~8) (Xi) + (1'~~ + fi) L. O'jXi - L. O'k+j+1xi
j=l j=O

-Om+,+, ((r;';l + j;) t,,,; -~xi ) ~ - fi - r mk


(8)
+ Q(8)( ) (S)
k Xi 1'mk ,

k k

L. O'jx1- O'm+k+2 L. xi > -1,


j=l j=l
(lO.212)
for 1 :s: i :s: N. Here O'j = bj +O'm+k+2 for 1 :s: j :s: k, and O'k+j+1 = aj+O'm+k+2
for 0 :s: j :s: m. It may be noted that the last inequality in the constraints does
not really confirm to LP problem, but for all practical purposes, we can replace
> by ~. Once again, it is more convenient to solve the dual problem. It may
be noted that the initial feasible vector for this problem is highly degenerate
with almost all variables vanishing.
The major advantage of the differential correction algorithm over the
Remes algorithm is that, it can easily take care of additional constraints on
the approximation of the form Rmk(Xi) = f(Xi) at some selected points Xi. For
example, if we are approximating f(x) = sinx, we will like to have Rmk(O) = a
and R m k(7r/2) = 1. In addition, we may also like to have IRmdx)1 :s: 1 for all x.
Such constraints can also be imposed in the differential correction algorithm. It
should be noted that the usual minimax approximation may not satisfy these
constraints. For example, near X = 7r /2 a small error may result in Rmk (x) > 1.
Subroutine MIN MAX in Appendix B provides an implementation of this
algorithm. This subroutine uses the subroutine SIMP X to solve the LP prob-
lem. Since this subroutine does not explicitly take care of degeneracy, for some
problems it may fail to converge, in which case, a second attempt may be made
with a different starting approximation. For k = a the starting approximation
is superficial, since the algorithm converges in the first iteration.
EXAMPLE 10.12: Find the minimax polynomial approximation to the data in Exam-
ple 10.2.
Minimax approximation may not be suitable for data with errors, but just for illustra-
tion we obtain a minimax approximation to this data. Using the subroutine MINMAX , we
obtain the coefficients of the required minimax polynomials for m =
2,3, ... ,9. The results
10.12. Discrete Minimax Approximations 507

Table 10.8: Minimax approximations using polynomials

m nc Emax x2 a, a2 a3 a. a5 a6 a7 aB ag alO

2 3 8.7205 2.2 x 109 3.722 -31.783 35.3


3 4 5.3332 8.6 x lOB -10.332 125.616 -311.0 206.4
4 5 1.3600 5.6 x 10 7 -3.639 -63.206 569.4 -1139.6 651.7
5 4 0.1122 6.4 x 10 5 -5.100 8.164 9.3 406.6 -1097.2 693.9
6 29 0.0046 5.3 x 10 2 -4.999 0.008 104.9 0.7 -316.3 1.0 230.7
7 o 0.0046 7.4 x 10 3 -4.996 0.030 104.8 0.3 -314.0 -3.5 234.8 -1
8 27 0.0045 9.4 x 10 2 -4.994 -0.163 106.6 -7.0 -300.1 -13.9 231.6 8 -4
9 o 0.0045 1.7 x 10 4 -5.005 -0.197 106.5 -2.1 -334.2 94.3 43.6 192 -101 21

obtained using 24-bit arithmetic are displayed in Table 10.8, where Emax is maximum value,
i.e., the L= norm of the residual. These results can be compared with those in Table 10.2 ob-
tained using a least squares fit. It can be seen that the coefficients are quite similar and even
the value of X 2 , which is just the sum of squared differences is only marginally higher for the
minimax approximation. Since the least squares approximations are obtained by minimising
the X2, it is natural to expect that X2 is smaller for the least squares approximation. Never-
theless, comparing the coefficients with those of the polynomial used to generate the data, it
is clear that these results are somewhat better than those obtained using the least squares
approximation. Since we have used x j as the basis, we can expect some ill-conditioning for
higher values of m. This ill-conditioning may be avoided by using polynomials, which are
orthogonal over the interval or over the discrete set of points. The algorithm appears to be
having some problem when fitting polynomial of odd degree, possibly because of symmetry
in the original polynomial. The approximations for m = 7,9 are distinctly worse than those
of neighbouring degree.
We can also consider the smoothing property of the approximation by computing the
derivatives at x = 0.489796. For m = 6 the derivatives are f' = -6.1243 and i" = -298.04,
which are again comparable in accuracy to that obtained using the least squares approxima-
tion. To consider the effect of some outliers in the data, we again consider the second data
set in Example 10.2. For this data set the coefficients are substantially different. For m = 6
the coefficients are -17.5, 524.4, -3374.3,7103.4, -3018.1, -5117.8 and 3929.1, which are
distinctly worse than the corresponding results obtained using least squares approximation.
Hence, the minimax approximation is even more sensitive to the presence of outliers and
should not be used for data with significant errors.

This example illustrates that the minimax approximation for discrete data
can be considered if the error in individual values is very small, or if all the
errors are of comparable magnitude. Minimax approximation is very sensitive
to the presence of some outliers, where the error is much larger than aver-
age. Hence, in general, this approximation is not suitable for approximating
experimentally obtained data points. In any case, obtaining minimax approxi-
mation requires much more effort, as compared to that in finding least squares
approximations. Consequently, minimax approximations are rarely used to ap-
proximate discrete data. However, the algorithm described in this section can
be used to approximate a mathematical function which requires enormous ef-
fort to compute. In that case, this algorithm may be more efficient as compared
to the Remes algorithm, which may require many more function evaluations.
508 Chapter 10. Functional Approximations

10.13 Ll-approximations
In Section 10.1, we found that approximations using the Ll norm are most
suitable for approximating data with substantial errors. Now we shall describe
an algorithm for obtaining such approximations. We restrict ourselves to the
discrete case with linear approximations. In principle, this algorithm can be ex-
tended to the nonlinear approximations by linearising the equations and solving
them iteratively. Convergence of iteration could be a problem in such cases.
The algorithm is very similar to that for the discrete Leo-approximation
considered in the previous section. Once again, the problem is reduced to a
LP problem. In order to ensure non negativity of the coefficients, we define the

°
new set of coefficients O;j introduced in Section 10.12. The error ei given by
(10.202) can be written as ei = Ui - Vi, where Ui 2': and Vi 2': 0, which gives
N constraints in nonnegative variables

1 ::; i ::; N. (10.213)

The L 1-approximation problem is to find coefficients O;j such that


L~1 leil is minimised, which leads to the LP problem of minimising L~1 (Ui +
Vi) subject to (10.213). We can solve this problem using the simplex method.
The initial basis can be obtained from the constraints by noting that if Ii 2': 0,
then Vi will be in the basis, otherwise Ui will be in the basis. This problem
yields a tableau of size (N + 1) x (N + m + 2), which could be quite large, even
for moderate values of N. Further, at each iteration all elements need to be
recalculated, which involves considerable calculations. The effort as well as the
memory requirement can be reduced by noting that the columns corresponding
to Vi and Ui are not completely independent (Barrodale and Young, 1966). At
most one of the variables Ui and Vi can be in the basis at any stage. Further,
if Ui is in the basis, then the simplex algorithm cannot exchange it with Vi.
Hence, the column corresponding to Vi need not be considered. If both Ui and
Vi are among the right-hand side variables, then it can be seen that the cost
coefficients of Ui and Vi add up to 2. While the other rows in the tableau cor-
responding to Ui can be obtained from that of Vi by just changing the sign.
Hence, there is no need to store both these columns. Thus, we need to keep
only one of these variables at any stage, which reduces the size of tableau to
(N + 1) x (m + 2). This is a considerable saving, since in all practical cases
m « N, particularly when N is large.
An Algol implementation of this algorithm is provided by Barrodale and
Young (1966), which also gives an example that can be followed to understand
the algorithm. Subroutine POLYL1 in Appendix B provides an implementa-
tion of this algorithm for obtaining L 1-approximation with polynomials as ba-
sis functions. This subroutine uses subroutine SIMPL1, which implements the
modified simplex method as explained above.
EXAMPLE 10.13: Find the polynomial Ll-approximation to the data in Example 10.2.
Using the subroutine POLYLl, we obtain the coefficients of the required polynomial
approximation for m = 2,3, ... ,9. The results are displayed in Table 10.9, where Esum is the
10.13. Ll -approximations 509

Table 10.9: Ll polynomial approximations

m nc Esum x2 al a2 a3 a4 a5 a6 a7 aB ag

2 4 141.87 1.6 x lOB -7.068 42.91 -44.0


3 4 131.84 8.9 x 10 7 -10.236 85.15 -161.0 85
4 5 37.00 6.6 x 10 6 -1.848 -76.68 582.1 - lll0 617.7
5 5 2.89 7.2 x 10 4 -5.344 12.58 -10.7 445 -1l29.5 704
6 30 0.12 5.6 x 10 1 -4.999 0.03 104.7 -316.7 230.7
7 35 O.ll 5.3 x 10 1 -4.999 0.09 103.4 10 -343.5 43 199.1 9
8 32 O.ll 7.0 x 10 1 -4.998 0.24 99.8 39 -458.4 286 -86 .8 185 -44
Data set with outliers
2 6 160.48 2.0 x 108 -6.544 40.49 -42.0
3 6 154.19 1.5 x 108 -9.984 80.81 -145.7 73
4 7 62.43 7.9 x 10 7 -1.127 -83.28 601.7 - 1l:~6 630.8
5 13 29.63 6.7x10 7 -5.730 17.36 -33.6 492 -1l71.4 717
6 29 28.08 6.7 x 10 7 -5.002 0.09 104.2 3 -320.2 4 229.6
7 14 28.07 6.7 x 10 7 -5.012 0.77 97.3 34 -391.1 93 172.6 15
8 23 28.08 6.7x10 7 -5.028 1.22 89.9 86 -581.6 477 -262.8 274 -63

LI norm of the residual. These results can be compared with those in Table 10.2 obtained
using a least squares fit. It can be seen that the coefficients are similar, though the value of
X 2 which is just the sum of squared differences is somewhat higher for the LI-approximation.
Since the subroutine POLYLI uses x J as the basis functions, the problem is ill-conditioned
for higher values of m. This ill-conditioning may be avoided by using polynomials orthogonal
over the interval [O, lJ or over the discrete set of points.
We can also consider the smoothing property of the approximation by computing the
derivatives at x = 0.489796. For m = 6 the derivatives are f' = -6.12066 and f" = -297.985,
which are again comparable in accuracy to that obtained using the least squares approxi-
mation. To consider the effect of some outliers in the data, we again c onsider the second
data set considered in Example 10.2. The results are also shown in Table 10.9 and it can
be seen that the coefficients are essentially same as that obtained using the first data set.
Using m = 6 the derivatives at x = 0.489796 are!, = -6.12468 and f" = -298.057 which
are again comparable to those obtained from the first set, even though this point is fairly
close to the outlier in data set as shown in Fig. 10.2. These results can be compared with
those in Examples 10.2, 10.3 and 10.12, obtained using the least squares and the minimax
approximations, respectively. It is clear that the Ll-approximation is remarkably insensitive
to the presence of a few outliers in the data.

The above example clearly demonstrates that the L 1-approximations are


robust and insensitive to the presence of some outliers in data. The improve-
ment in the results clearly justify the extra effort required in computing the
L 1-approximations, as compared to the least squares approximations. Since
outliers are fairly common in experimentally obtained data, it may be better
to use L 1-approximations for approximating such data. The statistical analysis
of L 1 -approximation is much less developed than the corresponding analysis
for least squares approximations. The Monte Carlo technique discussed in Sec-
tion 10.3 can be applied to statistical analysis of L 1-approximations also. In
this technique, using the calculated coefficients, synthetic data sets are gen-
510 Chapter 10. Functional Approximations

erated, where the error distribution is the same as that in the original data
set. Using these synthetic data sets, more L 1 -approximations are obtained and
the coefficients in these approximations can be compared with those used in
generating the synthetic data sets, thus giving some idea about the uncertainty
in coefficients due to random errors. However, calculating coefficients for L 1-
approximations requires significant effort and it will require considerable effort
to repeat this exercise using a large number of synthetic data sets to obtain a
good distribution of errors in the coefficients. But there may be no other alter-
native to estimate the errors in fitted coefficients. This technique can be easily
extended to nonlinear problem by linearising the approximating function about
an initial guess for the parameters and solving the resulting problem iteratively.

Bibliography
Abramowitz, M. and Stegun, I. A. (1974): Handbook of Mathematical Functions, With For-
mulas, Graphs, and Mathematical Tables, Dover New York.
Acton, F. S. (1966): Analysis of Straight-Line Data, Dover, New York.
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Antia, H. M. (1993): Rational Function Approximations for Fermi-Dirac Integrals, Astrophys.
J. Suppl., 84, 1OI.
Barrodale, I. and Young, A. (1966): Algorithms for Best Ll and Loo Linear Approximations
on a Discrete Set, Numer. Math., 8, 295.
Bellman, R, Kalaba, R E. and Lockett, J. A. (1966): Numerical Inversion of the Laplace
Transform, Elsevier, New York.
Brigham, E. 0. (1997): The Fast Fourier Transform and its Applications, Prentice-Hall,
Englewood Cliffs, New Jersey.
Brownlee, K. A. (1984): Statistical Theory and Methodology in Science and Engineering,
(2nd ed.), Krieger Publishing Company.
Burden, R L. and Faires, D. (2010): Numerical Analysis, (9th ed.), Brooks/Cole Publishing
Company.
Chambers, J. M. (1977): Computational Methods for Data Analysis, John Wiley, New York.
Cheney, E. W. (1999): Introduction to Approximation Theory, (2nd ed.), American Mathe-
matical Society.
Clenshaw, C. W. (1962): Chebyshev Series for Mathematical Functions, National Physical
Laboratory Tables, 5, HMSO, London.
Cody, W. J. (1970): A Survey of Practical Rational and Polynomial Approximation of Func-
tions, SIAM Rev., 12, 400.
Cody, W. J., Fraser, W. and Hart, J. F. (1968): Rational Chebyshev Approximation Using
Linear Equations, Numer. Math., 12, 242.
Cody, W. J., Paciorek, K. A. and Thacher, H. C. (1970): Chebyshev Approximations for
Dawson's Integral, Math. Comput., 24, 17I.
Cody, W. J. and Waite, W. (1981): Software Manual for Elementary Functions, Prentice-Hall,
Englewood Cliffs, New Jersey.
Crump, K. S. (1976): Numerical Inversion of Laplace Transforms Using a Fourier Series
Approximation, JACM, 23, 89.
Davis, P. J. (1975): Interpolation and Approximation, Dover, New York.
Elliott, D. F. and Rao, K. R (1982): Fast Transforms: Algorithms, Analyses, Applications,
Academic Press, New York.
Farebrother, R. W. (1988): Linear Least Squares Computations, Marcel Dekker, New York.
Fike, C. T. (1968): Computer Evaluation of Mathematical Functions, Prentice-Hall, Engle-
wood Cliffs, New Jersey.
Fletcher, R. (2000): Practical Methods of Optimization, (2nd ed.), John Wiley, Chichester.
Exercises 511

Fox, L. and Parker, I. B. (1968): Chebyshev Polynomials in Numerical Analysis, Oxford


University Press, London.
Gil, A., Segura, J. and Temme, N. M. (2007): Numerical Methods for Special Functions,
SIAM, Philadelphia.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York
Handscomb, D. C. (ed.) (1966): Methods of Numerical Approximation, Pergamon Press,
Oxford.
Hart, J. F., Cheney, E. W., Lawson, C. L., Maehly, H. J., Mesztenyi, C. K., Rice, J. R.,
Thacher, H. G. Jr. and Witzgall, C. (1968): Computer Approximations, SIAM Series in
Applied Mathematics, John Wiley, New York.
Jerri, A. J. (1992): Integral and Discrete Transforms With Applications and Error Analysis,
Marcel Dekker, New York.
Lanczos, C. (2010): Applied Analysis, Dover, New York.
Lawson, C. L. and Hanson, R. J. (1995): Solving Least Squares Problems, Society for Indus-
trial & Applied mathematics.
Luke, Y. L. (1969): The Special Functions and Their Approximations, Vols. 1 and 2, Academic
Press, New York.
Luke, Y. L. (1975): Mathematical Functions and Their Approximations, Academic Press,
New York.
Lyusternik, L. A., Chervonenkis, O. A. and Yanpolskii, A. R. (1965): Handbook for Com-
puting Elementary Functions, Pergamon Press, Oxford.
Oldham, K. B., Myland, J. C. and Spanier, J. (2009): An Atlas of Functions: with Equator,
the Atlas Function Calculator, (2nd ed.) Springer, Berlin.
Press, W. H., Teukolsky, S. A., Vetter ling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Rabinowitz, P. (1968): Applications of Linear Programming to Numerical Analysis, SIAM
Rev., 10, 121.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Ralston, A. and Wilf, H. S. (eds.) (1960): Mathematical Methods for Digital Computers,
Vols. 1 and 2, John Wiley, New York.
Rice, J. R. (1964): The Approximation of Functions, Vol. 1, Linear Theory, Addison-Wesley,
Reading, Massachusetts.
Rice, J. R. (1969): The Approximation of Functions, Vol. 2, Nonlinear and Multivariate
Theory, Addison-Wesley, Reading, Massachusetts.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
Wall, H. S. (1948): Analytic Theory of Continued Fractions, American Mathematical Society.
Wilkinson, J. H. and Reinsch, C. (1971): Linear Algebra: Handbook for Automatic Compu-
tation, Vol. 2, Springer-Verlag, Berlin.

Exercises
1. Consider the three functions

fI(x)=l, h(x) = 1 + -1 sm(81rx),


.
(a = 100),
2
on the interval [0,1]. Find the three norms L1, L2 and Lee for each of these functions.
2. Consider the data set (Xi, /;) with errors Gi for i = 1,2, ... , n and obtain the normal
equations for the least squares straight line fit of the form y = mx + c. Solve these
equations to get
80t1 - 81 to 82tO - 81 t1
m= c=
8082 - 8I ' 80 8 2 - 8I '
512 Chapter 10. Functional Approximations

where
n f. k
tk = ~ ,xi
~ 2'
i=l O"i

Estimate the errors in c, m . Also consider the form y = m(x - x) + c, where x = s1/ So
is the mean value of x.
Apply this formula to find the least squares straight line fit to the following data with
error 0.1
Xi 234 5 6 7891011
f; 3.9 3.8 4.1 4.0 4.4 4.3 4.3 4.5 4.6 4.6
3. In a large data set we can perform smoothing using a local approximation, which involves
only a few neighbouring data points. Assuming that the abscissas are uniformly spaced,
obtain the least squares parabola using five points (Xi , f;), (i = k - 2, k - 1, ... , k + 2).
From this parabola obtain the smoothing formula
3 3 4
F(Xk) = fk - -(fk-2 - 4fk-l + 6fk - 4fk+l + fk+2) = fk - -8 fk'
~ M
where 8j fk denote the central differences defined by

(m> 1).
This formula can be used for smoothing data in the central part of the table. Derive the
corresponding formula for the end points using the nearest five points to fit the parabola.
4. The five-point parabola of the previous problem can also be used to calculate the first
derivative of the function at any point. Derive the following formulae for the first deriva-
tive:

FI(Xk) = l~h (-2fk-2 - fk-l + fk+l + 2fk+2), (k = 2,3, ... , N - 2),

FI(XO) = _1_ (-54fo + 13h + 4012 + 2713 - 26f4) ,


70h
FI(Xl) = 7~h (-34fo + 3h + 2012 + 1713 - 6f4) ,

FI(XN) = _1_ (26fN-4 - 27fN-3 - 40fN-2 - 13fN-l + 54fN) ,


70h
FI(XN-l) = _1_ (6fN-4 - 17fN-3 - 20fN - 2 - 3fN-l + 34fN)'
70h

5. Using the five-point parabola of {3} obtain the numerical integration formula

l xk + 2

Xk_2
F(x) dx =-
h
105
(44fk-2 + 104fk-l + 124fk + 104fk+l + 44fk+2) '

which is exact when f(x) is a polynomial of degree three or less.


6. Using the following data which gives the value of VI rounded to two decimal places:

Xi 2 3 4 5 6 789 10
fi 1.00 1.41 1.73 2.00 2.24 2.45 2.65 2.83 3.00 3.16

obtain quadratic approximations of the form F(x) = ao + alx + a2x2. Consider approxi-
mations using the L1, L2 and Loo norms. In all cases, calculate all the three error norms,
i.e.

= L If; -
n ) 1/2
El F(xi)l, E2 = ( ~(fi - F(Xi»2 Eoo = l::;i:Sn
max If; - F(xi)l·
i=l

Compare these error norms for different approximations. Replace 13 by 1.83 and f8 by
2.73 and repeat the exercise.
Exercises 513

7. Obtain synthetic data sets using the functions

/lex) = sin(rrx) + T, hex) = sin(5rrx) + T, hex) = eX + T,


where T is a random number with normal distribution with zero mean and a = c. Use
c = 10- 3 ,10- 5 and the points Xi = (i -1)/50 for i = 1,2, ... ,51. In each case, obtain the
least squares polynomial fit using polynomials of degree m = 2,3, ... and determine the
optimal degree. Calculate the function as well as the first and second derivatives at Yi =
i/lO, (i = 0, 1, ... ,10) using these approximations and compare the results with the exact
value for the unperturbed function. Repeat this exercise using 100 different synthetic
data sets and find the distribution of calculated coefficients. Use only the polynomial
with optimal degree for obtaining this distribution. From this distribution estimate the
error in each coefficient. Compare this with that calculated from the covariance matrix.
8. Repeat the above exercise with T = clu 3 , where u is a uniformly distributed random
number in the interval (-0.5,0.5) (hopefully u -I' 0). Use c = 10- 5 , 10- 7 and 10- 10 .
Compare the results with those in the previous exercise.
9. Solve {4.ll}, {4.22} and {4.23} using least squares polynomials instead of interpolation
and compare the results with those obtained using interpolation.
10. Show that the residual of a least squares solution of a system of overdetermined equations
Ax = h, is orthogonal to the columns of A.
11. Consider
1
A= ( 1 1.01)
1 , hI = (2.01)
2 , h2 = (1.01)
4.01 , h3 = ( -2
1) , h4 = (1.01)
-2 .
1.01 1 2.01 J.Ol 1 1
Find the least squares solution to AXj = h j for i = 1,2,3,4. Compare the relative
difference between hI and h 2 , with that between Xl and X2. Similarly, compare the
relative difference between h3 and h4 with that between X3 and X4. Explain the results
using the SVD of the matrix A. To consider the effect of perturbation in the matrix
coefficients, solve the above problems using the matrices

1
1.011)
A2 = ( 1 1 ,
1.011 1
and explain the results.
12. Consider least squares approximation to f(x) over the interval [0, rr], by
1
+L
Tn
F(a, x) = 2aa aj cos(jx).
J=l

Derive the normal equations for this problem for the continuous case and show that the
coefficients aj are given by

aj = - 2111: f(x) cos(jx) dx .


rr a
If instead the data are specified over a set of uniformly spaced points in the' interval [0, rr],
then show that the coefficients are given by

ak =-
2(1 -fa + --fn + L
(_I)k n-1
fj cos(jkrrln)
)
,
n 2 2 j=l

where fj = f(jrr In) for j = 0, 1, ... , nand n > m. What happens if m = n?


13. Generate synthetic data set using
i-I

, = -5
-rr
0' (i = 1,2, ... ,51),
514 Chapter 10. Functional Approximations

where r is a random number with normal distribution and (IL = 0, a = 0.1). Using the
formula of the previous problem, obtain least squares approximations by a series of cosine
functions for m = 1,2, ... and find the optimum number of terms to be used in the series.
14. Obtain the least squares polynomial approximation to tan -1 x over the interval [-1, 1].
For a polynomial of degree nine, compare the accuracy with other approximations of
similar form obtained in Examples 10.8 and 10.9. Compare the error curve of this ap-
proximation with that of the minimax approximation.
15. Repeat {7} and {8} using cubic B-spline basis functions. Compare the results with those
obtained using polynomials with same number of coefficients. Also try the same problem
using the truncated power basis to represent cubic spline (see section 4.5).
16. Use cubic B-splines expansion to obtain least squares approximations, to the following
data on points Xi = (i - 1)/50 for i = 1,2, ... ,51, using the knots tj:

(i) f; = sinCrrxi)' tj = j/lO, (j = 0, 1, ... ,10);


(ii) fi = 1 + Xi + x~ + xr, tj = j/lO, (j = 0 , 1, ... ,10);
(iii) f; = ylxivl - Xi, tj = = 0, 1, ... ,10);
j/lO, (j
(iv) f; = ylxivl - Xi, tj = 0.5(1- cosUrr/lO», (j = 0, 1, ... ,10).
17. Repeat the previous exercise using linear or parabolic B-splines and compare the results.
18. Prove that orthogonal polynomials satisfy a recurrence relation of the form (10.40) and
determine the coefficients, Qj, {3j.
19. Show that if P; n) (x), (j = 0, 1, ... , n) are a set of orthogonal polynomials of degree j
satisfying (10.36), then p~n) (Xi) = 0 for i = 1, 2, ... , n.
20. Substituting ak from (10.42) in F(a,x) = L;ajpj(x) and using the recurrence relation
for orthogonal polynomials show that F(a,x) = qo(x). Differentiating (10.42), show that
the derivative of the approximating function can be computed using the recurrence

q~(x) = qk+1(X) + (x - Qk+1)q~+1(x) - {3k+lq~+2(X), (k =m - 1, m - 2, ... ,0),

with q:n(x) = q:n+l (x) = 0 and qb(x) = F'(a, x).


21. Consider least squares polynomial approximation for f(x) using a weight function w(x) =
1/~ over the interval [-1,1]. Show that the approximating polynomial is given by

Pn(x) = -co
2
1

i=1
n
+ LCiTi(X), Ci =~
'Jr
/1 f(x)Ti(X) dx.
-1 VI - x 2

22. Using least squares approximation with appropriate weight functions, obtain the following
expansions over the interval [-1, 1]

X =~ ~ -1 i+1 (2i - 2)! 4i + 1 P . X


II 2 + ~( ) (i _ 1)!(i + I)! 22i 2,(),

2 4 ~ HI 1
= -; + -; L..) -1) 4i 2 -" 1 T 2i (X),
i=l
1
= - - -
2
4
L
00

'Jr2 i=1 (2i - 1)2


1
cos((2i - 1)'Jrx).

Show that the ratio of the coefficients of T2i (x) and P2i (x) tends to VI / 'Jri as i ...... 00.
Using the Chebyshev expansion and the relevant Parseval formula
00

~Chi = [1
1
w(x)(f(x»2 dx, 'Yi = /1
-1
W(X)(qi(X»2 dx ,
Exercises 515

where Ci are the coefficients of expansion in orthogonal polynomials qi(X), prove that
=
t;
rr2 1
8 =1+ 2 (4i2 _ 1)2

23. Generate synthetic data using the function

( )_ 0.01 sin(20x + 0.5)


I x - 0.5 + x + 0.5 + x 2 + er,

where r is a random number with normal distribution and J.L = 0, CT = 1. Use 51 equally
spaced points over the interval [0, 1] to generate the values J;, i = 0, ... ,50. Now calculate
the second differences
i = 1, ... ,49.

Because of these relations the errors in Yi are correlated. Calculate the covariance matrix
for these data points, assuming that errors in Ii are uncorrelated. Use this covariance
matrix to perform a nonlinear least squares fit to a function of form
asin(wx + cjJ)
Y = --I--'-+-b'--x-O:2---'-

to calculate the parameters a, b, w, cjJ. Compare the results with those obtained without us-
ing correlation between different data points. Use e = 0.05,0.02,0.01,0.005,0.002,0.001.
Estimate the errors in fitted parameters directly or by repeating the fit several times.
Thy to fit J; directly, to a function including the linear trend. Note that the linear trend
is wiped off when second differences are taken and hence is not used to fit Yi,
24. Approximation by rational functions leads to nonlinear least squares approximations.
Obtain a least squares rational function approximation of the form R5,4 (x) for tan - I x
over a set of points Xi = -I+2i/n, (i = 0, 1, ... ,n). Use n = 100 and compare the error in
this approximation with rational function approximations obtained in Examples 10.8-10.
25. Consider the following nonlinear least squares fit problem due to Walsh. Obtain an ap-
proximation of the form
alx) (1/(alc))-1
F(a,x)= ( 1 - - (c = 96.05),
a2

to fit the following data by minimising the L2 norm of the difference:


2000 5000 10000 20000 30000 50000
0.9427 0.8616 0.7384 0.5362 0.3739 0.3096
Repeat the problem using x; = x;/100000 and compare the results.
26. In {I6} if the intermediate knots ti, (i = 1, ... , k-I) are also variables, then it is possible
to obtain a nonlinear approximation with the coefficients ai and knots ti as parameters.
Obtain the cubic B-spline least squares approximation to data in the third set of {I6}
using 11 knots and compare the results with those in {I6}.
27. Generate synthetic spectrum using the function
x2 400 800
I(x) = 1 + x + 20 + (x _ 5)2 + 1 + (x _ 10)2 + 0.5

Use n = 400 points Xi = i/20, i = 1, ... ,400. At each point the value Ii = I(Xi) is chosen
to be random with probability distribution given by PI'(x) = le-x/I' with x > 0 and
I'
J.L = I(x;). The spectrum consists of two Lorentzian peaks and a quadratic background.
Use maximum likelihood technique to estimate the parameters. Show that this requires
minimising the function
n J;
S(al, ... , a9) = 2:)n(F(xi)) + -- ,
;=1 F(x;)
516 Chapter 10. Functional Approximations

where
F(x)=al+a2 x + a3 x2 + a4 + a7
(X-a5)2 +a6 (x -as)2 +ag

28. Using monomials in two variables x{ x~ as the basis functions, obtain least squares ap-
proximation to the data in {4.33} and calculate the function value at the specified points.
Compare the results with those obtained using interpolation. Also try product B-spline
basis functions.
29. The subroutine for calculating FFT of one complex data set can be used to find FFT
of two real data sets of equal length N by considering the two sets as the real and the
imaginary part of the input data. If h j = Ij + igj is the complex input data, then show
that the resulting FFT can be separated using

30. The sine transform of real data Ij, (j = 1, ... , N - 1) can be obtained by considering the
FFT of auxiliary array

go = 0, (j = 1, ... , N - 1).

Since the elements are real they can be considered as complex array of length N/2 and
its Fourier transform G k can be calculated. Noting that the first term is even while the
second one is odd, show that the Fourier transform Gk = (F2k + 1 - F2k-Il + iF2k> where
Fk, (k = 0, ... , N - 1) is the required sine transform. Thus, the sine transform can be
calculated using
(k = 0, 1, ... , N/2 - 1).
The element Fl can be calculated independently before using the above relations recur-
sively. Similarly, show that the cosine transform of Ij can be obtained using the auxiliary
array
1
gj = 75Yj + IN-j) - sin(jrr/N)(fj - IN - j), (j = 0, ... , N - 1),

which gives the FFT G k = F2k + i(F2k+l - F 2k - 1).


31. Consider the FFT algorithm for the case, when the number of points N = Tt, where
T is an integer. Show that the number of operations required is N(T - 1) logr N. Show
that the coefficient of N In N in this number is minimum when T = 2. For N = 4t, some
simplification is possible by noting that the powers of w which are required are of special
form. Compare the efficiency of this algorithm with that of power of 2 FFT algorithm
for the same value of N.
32. If the number of data points N = qT2 ... Tt, where Ti, (i = 1, ... , t) are prime factors of
N, then develop an FFT algorithm by expanding the indices j and k in the mixed radix
form
j =jl +Tli2 +qT233 + ... +TIT2· .. Tt-lit, j. = 0, 1, ... ,T. -1; S = 1, ... ,t;
k = kt +Ttkt-l + Tt Tt-l k t-2 + ... +TtTt-l ... T2kl; k. = 0, 1, . .. ,T. -1; S = 1, .. . ,t.
33. In Example 10.5, Hanning window was used to reduce the leakage. Repeat the exercise
using the following window functions and compare the results

(triangular window) ,

j_l'L)2
(T (parabolic window).
wj=l-

34. To consider the effect of gaps in data, consider a signal which is sampled at an interval of
1/ N over [0, 1J. However, because of some reasons the observations could not be recorded
Exercises 517

for some subinterval [tl, t2J within this interval. We can fill these gaps by putting a value
of zero to get an effective function

f(t) = {o,
sin(2rrat), otherwise.

Find the OFT of the resulting data for N = 256, a = 30,30.5 and [tl, t2J = [1/4,1/2]'
[3/4,1]' [3/8,1/2]' [7/16, 1/2J. Also find the Fourier transform of the window function,
which can be defined as
o, if tl ~ t ~ t2;
wet) = {
1, otherwise;
and explain the effect of data gaps.
35. Consider the discrete convolution
N/2-1

hk = L fjgk-j, fj =
sine rrj /4)
rrj/4 '
{ o, if 111 2: 8;
g] = (8 - 111)/8, otherwise.
j=-N/2

Evaluate hk for k = -N/2, ... , N/2 - 1 by direct summation for N = 128. Also obtain
the FFT of fj and gj and eva.luate the convolution using convolution theorem in the
frequency domain and then calculate the inverse FFT to get the required convolution
in the time domain. Compare the efficiency of the two processes. Do you get the same
results in the two cases? What changes are required to get the correct results?
36. With Xj = 2rrj/(2L + 1), show that
cos(kXj) = cos((2L + 1 - k)xj) and sin(kxj) = - sin((2L +1- k)xj).
Thus, deduce that over this set of 2L + 1 points the function cos kXj and sin kXj, for
k > L can be expressed in terms of the cos nXj and sin nXj for n ~ L.
37. Apply the sigma factors to the square wave function

get) = {I,
0,
for °< t <
for rr
rr;
< t < 2rr;
get + 2nrr) = get), (n = ±1, ±2, ... ).

The Fourier series for this function is

get) = -1 + -2(sint + -1 sin3t + -1 sin5t +... ) .


2 rr 3 5
Truncate the series after 32 terms and apply the sigma factor to smooth the function.
Plot both the raw and the smoothed function to see the effect of smoothing. Repeat the
exercise using 64 terms. The Fourier series for the derivative g' (t) diverges everywhere
except at t = (2k + l)rr /2. Show that applying sigma factor to this series results in a
convergent series and plot the function obtained using 32 terms.
38. Find the inverse Laplace transform of the function

Fs=() (2 +1) ----


s s 1
e -58 ,

and compare the result with the exact value

t < 5;
t> 5;
at t = 0,0.1,0.2, ... ,10.0. Because of the discontinuity at t = 5 the convergence is
quite slow around t = 5 and there are oscillations as in the Gibbs' phenomenon for
Fourier series. Remove the discontinuity by considering the modified function F*(s) =
F(s) - e- 58 /s and compare the results with those obtained earlier.
518 Chapter 10. Functional Approximations

39. Show that an integral equation of the form

can be easily solved for u(t) by taking Laplace transform, which gives

L(u) = L(f)
1 - L(g)

Solve the equation for f(t) = 1 and g(t) = e- 2t and compare the results with the exact
solution u(t) = 2 - e- t .
40. Solve the differential equation with time delay

u'(t) = u(t - 1), (t 2': 0); u(t) = 1, (-l:St:SO),


using Laplace transform. Obtain u(t) at t = 0.1,0.2, ... ,5. Estimate the exponential
order a for this function.
41. Consider the linear differential equation

d4 u d3 u d 2u du
- - 22 - + 39 - + 22 - - 40u = 0
dt 4 dt 3 dt 2 dt '
subject to the initial conditions

u(O) = 1, u'(O) = -1, ult(O) = 1, u'" (0) = -1.


Here the initial conditions have been chosen such that the exact solution u(t) = e- t
is ill-conditioned and cannot be obtained using standard numerical integration methods.
Solve the equation using Laplace transform and show that ill-conditioning can be avoided.
Explain the results.
42. For m = k = 1, show that the Pade approximation to cos x does not exist.
43. Find the Pade approximation to cos x for m = k = 6. Show that it contains only even
powers of x and write it as R3,3(Z), where z = x 2 . Express this approximation as a
continued fraction and estimate the roundoff error in evaluating this approximation.
44. Prove the recurrence (10.138) by noting that Ak+l/Bk+l can be obtained from Ak/Bk
by replacing (x + Dk) by
x + Dk + C k+1
x + D k +1
45. Show that a change of argument x = 2t -1, which converts the interval [-1, 1] to [0, 1] also
converts the Chebyshev polynomials into the so-called shifted Chebyshev polynomials,
which may be used if the interval is [0,1]. Show that this shifted polynomials Tj*(x)
satisfy the recurrence relation

(n 2': 1),
and show that the first four polynomials are

TO'(x)=l, Tt(x)=2x-1, T2'(x)=8x 2 -8x+1, T;(x)=32x 3 -48x 2 +18x-1.

46. Prove the following relations for integrating Chebyshev polynomials

j To(x) dx = Tl(X) + c, j Tl(X) dx = lT2(X) + c,


j Tn(X) dx = ~2 (Tn+l(X)
n+1
_ Tn-l(X)) +c,
n-1
(n> 1).

Applying this result to expansion in {22} show that

~= 1- 2 C/32- 3/52 + 52\2 - ... ) .


Exercises 519

47. Find the coefficients of expansions


10 10
(1 - X 2 )1O = La2iX2i = ~cO + LC2iT2i(X),
i=O 2 i=1

and compare the maximum magnitude of the coefficients in the two expansions. Which
of these expansion has better roundoff property on the interval [-1, 1J?
48. Consider the following truncated Maclaurin series
x2 x4 x 20
cos x ~ 1- - + - _ ... + -
2! 4! 20! '
x2 x3 12
eX ~ 1 + x + - + - + ... + -x
2! 3! 12! '
(ax )2 (ax )3 (ax )20
In(l + ax) ~ ax - - - + - - - ". - - - (a = 0.9).
2 3 20
Estimate the truncation error in these approximations over the interval [-1, 1] and
economise the power series. How many terms can be dropped when the error due to
economisation is less than half of the truncation error?
49. Find the Fourier cosine series for I cos 01 in 0 ~ 0 ~ 71', and compare it with the series in
{22}, to verify that the Chebyshev series for f(x) is exactly the Fourier series for f(cos 0).
50. For f(x) = eX use trigonometric interpolation to estimate the coefficients of Cheby-
shev expansion. Find the coefficients co, Cj, . . . , C6 using formula based on 8,16, 32-point
trigonometric interpolation. Also try to evaluate these coefficients using FFT algorithm
for 32, 64, 128 points. Estimate the error in these calculated coefficients. Using this expan-
sion obtain rational function approximations of the form T2,2(X) and T4,0(X). With these
approximations as starting values, apply Remes algorithm to obtain the corresponding
minimax approximations over the interval [-1,1]. Compare the maximum errors in these
approximations.
51. Let f(x) = L~oaixi be the Maclaurin series for f(x) and let Tj(x) = L~=ot;j)xi be
the expansion for Chebyshev polynomia.ls. Substituting these expressions in the integrals
for coefficients of Chebyshev expansion, show that

where
R 2(2n) =
J
(2j)!
22j-1(j!)2
(pn)
0 2 2j + 2
1
+ t(2n) 2j + + ... + t(2n) (2j + 1)(2j + 3)··· (2j + 2n -
2n (2j + 2)(2j + 4)··· (2j + 2n)
1))
R(2n+1)= (2j)! (t(2n+1) 2j + 1 + t(2n+1) (2j + 1)(2j + 3) + ...
2J+1 22j-1(j!)2 1 2j+2 3 (2j + 2)(2j +4)

(2n+1) (2j + 1)(2j + 3) ... (2j + 2n + 1))


+t 2n + 1 (2j + 2)(2j + 4) ... (2j + 2n + 2) .
Approximate the coefficients for f(x) = eX using these results and compare the results
with that in the previous problem,
52. Show that the minimax polynomial approximation of degree less than n to f(x) = xn on
the interval [-1,1] is a polynomial of degree n - 2. Find this polynomial. What happens
if the interval is [0, 1J?
53. Find the polynomial approximation which interpolates f(x) = 1/(1 + x 2 ) at the zeros
of T2(X). Using the form of truncation error in interpolation, show that the maximum
error is 1/2. Show that actually the maximum error over the required interval is 1/3.
Find the minimax polynomial of degree less than two, over the interval [-1, 1] and find
the maximum error in this approximation.
520 Chapter 10. Functional Approximations

54. Using the orthogonality of the Chebyshev polynomials, show that interpolation at the
zeros of Tn (x) can be achieved by the polynomial
1 n-1
Pn-1(X) = 2cO + L CiTi(X),
,=1
where Xj are the zeros of Tn(x). If f(x) is a polynomial of degree n, show that Pn-1(X)
is the minimax polynomial of degree n - 1. Verify this result explicitly for f(x) = 1 +
x + x 2 + x 3.
55. Finite Chebyshev expansions can be obtained by using the Lanczos 7-method. Consider
the expansion
1 n
--2 ~ LaiTi(X).
1+x i=O

Multiplying the equation by (1 + x2) and writing the resulting polynomial in terms of
Ti(X) we can hope to find aj by equating the coefficients of Ti(X) on both sides. Show
that it gives rise to a system of n +3 equations in n + 1 unknowns, which has no solution.
If the symmetry of the function is used, the number of equations and that of unknowns
can be reduced, but that does not solve the problem. This difficulty can be overcome by
introducing perturbations, to get

Solve the resulting system of n + 3 equations for n + 3 unknowns 71, 72 and aj, (j =
0, ... ,n). Use n = 9 and compare the coefficients with those in Example 10.9. Using the
value of 7i estimate the maximum error in this approximation.
56. Show that the Chebyshev series for sin -1 x is

Using this series show that the maximum error in minimax approximation by a polyno-
mial of degree less than or equal to n (where n is an even integer) over [-1,1] is greater
than !11"/(n+1)2. Show that if a polynomial approximation with a maximum absolute er-
ror of less than 2- 25 is required, the degree of polynomial must be at least 7259. Obtain
rational function approximations of the form Tg,o(x) and T5,4(X) from the Chebyshev
expansion. Using them as the initial approximation, obtain the corresponding minimax
approximations. Compare the error in these approximations. Obtain an approximation
of the form R3,3(X) over the interval [0,0.5] and use that to obtain the approximation
over the entire range using

. -1
sm x = "211" - • -1
2 sm (
Vf1-::x)
~-2- .

57. Asymptotic expansion in terms of powers of l/x can also be transformed to a series in
Ti(l/x) and in many cases, the transformed series may give much better convergence
over a wider range. Consider the asymptotic expansion for e x2 erfc(x) given in {1.4}
and estimate the maximum accuracy that can be achieved at x = 1 and x = 2. The
corresponding Chebyshev expansion is (see Clenshaw, 1962)

,fir xe x2 erfc x ~ .8598866396To(z) - .116997801OT2(Z) + .0180164275T4(Z)


-.0037711095T6(Z) + .00094120l4T8(Z) - .0002644669TlO(Z) + .0000810840T12(Z)
-.0000266056T14(Z) + .0000092225T16(Z) - .0000033463T18(Z) + .0000012623T20(Z)
-.0000004924T22(Z) + .0000001979T24(Z) - .0000000816T26(Z) + .0000000345T28(Z)
-.0000000149T30(Z) + .0000000065T32(Z) - .0000000029T34(Z) + .00000000l3T36(Z)
-.0000000006T38 (z) + ...
Exercises 521

where z = x-I. Using this expansion obtain an approximation which has a maximum
error of less than 2- 25 for Ixl > 1. Write this approximation in the usual polynomial
form and compare the coefficients of l/x J with those in the asymptotic form. Using this
Chebyshev expansion obtain a rational function approximation of the form T6,6. What
is the maximum error in this approximation?
5S. Obtain minimax polynomial approximation to tan- l x over [-1,1] using the exchange
algorithm applied to a finite set of points. Using a set of lOl uniformly spaced points find
the minimax polynomial of degree nine. Also find a rational function approximation of
the form R5,4(X) and compare it with similar approximations in Example lO.lO.
59. The function value f(x) corresponding to the argument x is said to be unstable if
xf'(x)/f(x) is very large and amplifies small errors in x. Show that cosx is unstable
near x = 1T/2. Similarly, show that cos- l x is unstable near Ixl = 1. Using single preci-
sion library routines, compute Xl = cos- l (cos xo), for Xo = 1.57079, 0.7854. lO-3 and
lO-6 and explain the results.
60. A common iterative method for computing Va for a in (0,00) is the Newton-Raphson
iteration
Xn+l = ~2 (xn + ~)
Xn
.
To get a good starting value for this iteration, it is possible to use a rational function
approximation. Writing a = 22m b, where 1/4 < b :s; 1 and m is an integer, obtain a
minimax approximation of the form Rl,1(x) and R2,2(X) for Vx over [1/4,1]. What is
the maximum error in this approximation? How many Newton-Raphson iterations will
be required to obtain a relative accuracy of 2- 64 with these starting values.
61. Obtain approximation to the error function erf(x) and the complementary error function
erfc(x) (see {1.4})
2 rx 2 roo e- 2
erf(x) = ,;n io e- x
2
dx, erfc(x) = l-erf(x) = ,;nix X dx,

using symmetry erf(x) = -erf(-x), the range can be restricted to nonnegative values.
This range can be further split into two parts [0, a] and [a, 00] to find rational function
minimax approximations (with respect to relative error) of the form
(O:S;x:S;a);
2
e- x , 1 1 e- X2 ,
erfe( x) "'" --;; Rrn' k' ( x 2 ), erf(x)","I---R 'k'(~2)' (a :$ x < =).
X rn X
Use a = 2, m = k = 4, m' = k' = 3. For the higher range you may use the approximation
obtained in {57} as the starting approximation.
62. Obtain minimax approximation to the Bessel functions Jo(x) and h(x) over the interval
[0,8]. Use the form R5,5(X 2 ) for Jo(x) and XR5,5(X2) for h(x). Use these approximations
to find the zeros of these functions in the interval [0,8] and compare them with the exact
zeros of the Bessel functions.
63. Obtain approximations to the Fermi integrals (see {6.29} or Antia 1993)

Fn(x) = 1 00

o e
t
tn dt
-x +1 '

for n = ±1/2, 3/2. Divide the range into two parts and seek minimax approximations of
the form
-00 < x < 2;
2:S; x < 00.
Choose the order of approximation such that the maximum relative error is less than
lO-7.
64. Dawson's integral (see Cody et al., 1970)

D(x) = e- x2 1 x
e t2 dt,
522 Chapter 10. Functional Approximations

appears in a variety of applications including spectroscopy, heat conduction, and electrical


oscillations in certain vacuum tubes. Obtain a recurrence relation between the derivatives
of this function at x = 0 and using that get the Maclaurin series

D(x)=x'£= (--:I)jj!, (2x)2 j , Ix! < 00.


j=O (2) + 1).

This series is not very useful at large values of x because of heavy roundoff error. Using
the QD algorithm obtain the continued fraction expansion
x
D(x) = ----------------~2-x2n/~3---------------
1 + ------------------'n-,------------------
4x 2 /15
1 - ----------------'--n--,---------------
6x 2 /35
1+ --------------'-----------------
1-

+ (_1)<+1 (2kx 2 )/(2k-1)(2k+1)


1+··
This continued fraction can be used to evaluate the function at any required value of x,
though for large x it requires large number of terms. Obtain a minimax approximation
of the form
XR mk (X 2 ),
{
Ixl ~ 3;
D(x) ~
~R~, k' (l/x 2 ), Ixl ~ 3;
with a maximum error of less than 10- 7 .
65. Consider the normal probability integral

P(x) = __
1_ (X e-t2/2 dt.
v'27r io
Obtain a mmlmax approximation of the form R5,5(X) over the interval [0,00], which
satisfies the additional constraints R5,5(0) = P(O) = 0 and R5,5(X) = P(x) = 1/2
as x -> 00. Use Remes algorithm with the coefficients ao = 0 , b,'j = 1 and a5 = 1/2
to satisfy the constraints. Determine the remaining coefficients to obtain the minimax
approximation with 10 extrema.
66. Repeat exercises {7} and {8} using minimax polynomial approximations.
67. The algorithm for obtaining linear minimax approximation for discrete data can be
easily extended to functions of several variables. Using the basis functions xiyj for
i, j = 0 , 1,2 ... , m obtain minimax approximation to !(x , y) = exp(x 2 + y2) over the
region 0 ~ x ~ 1, 0 ~ Y ~ 1. Use a set of discrete points (Xi,Yj) = (i/n,j/n) for
i, j = 0, 1, ... ,n. Try n = 10,20 and m = 2,3,4. For each of these approximations obtain
the extrema of error. What can you infer about the number of extrema in the error curve
for minimax approximations in more than one variables?
68. Repeat exercises {7} , {8} and {28} using polynomial approximations based on the L1
norm.
69. Obtain the L1-approximation by a polynomial of degree nine to tan -1 x over the interval
[-1,1] using the function values at a set of 101 uniformly spaced points. Compare the
error in this approximation with other approximations calculated in Example 10.9.
70. Calculate a synthetic data set using the function
!(x , y, z) = sin(x) sin(y) sin(z) + 1O- 5 r,
where r is a random number having normal distribution with unit variance and zero mean.
Use a set of 20 points over the interval [0,1] with uniform spacing in x, y, z to calculate
the table of values !(Xi, yj, Zk). Use this table of values to calculate least squares approx-
imations with B-spline basis functions. Also obtain the minimax and Ll-approximations
with the same set of basis functions and compare the results with exact function obtained
by removing the random error.
Chapter 11

Algebraic Eigenvalue Problem

One of the important problems in linear aJgebra is the determination of eigen-


values and eigenvectors of a matrix. Eigenvalue problems in differential and
integral equations can also be approximated by algebraic eigenvalue problems.
In spite of the simplicity of its formulation, many algorithms are required to deal
efficiently with a wide spectrum of problems, which are encountered in prac-
tice. The variety in problems arises because of the variety in type of matrix
and the varying requirements from the solution. The classification of matrices
into different types has been described in Section 3.1 and will not be repeated
here. The varying requirements from the solution arise because we may need
all eigenvalues and eigenvectors or only some of them. Alternately, we may be
interested in only eigenvalues. In some problems, we may be interested in only
the eigenvalue with the largest magnitude, or the one with the largest real part,
while in other problems we may require all eigenvalues in a given region of the
complex plane. In general, there is some correlation between the type of matrix
and the amount of information required from the solution. For example, with
large matrices which usually arise from discretisation of differential or integral
equations, we are normally interested in only a few eigenvalues and possibly
the eigenvectors. On the other hand, for small matrices we normally want all
eigenvalues and some or all eigenvectors.
In this chapter, we shall consider only a few methods for calculating eigen-
values and eigenvectors. Efficient and reliable algorithms for solving eigenvalue
problems tend to be fairly complex and as a result, we do not give full descrip-
tion about some of the algorithms and the corresponding error analysis. For
more details readers can refer to the excellent treatise by Wilkinson (1988).
Reliable software for solution of eigenvalue problems is readily available. Al-
gol programs for eigenvalue problems can be found in Wilkinson and Rein-
sch (1971), which will be referred to as the Handbook in this chapter. Fortran
programs can be found in Smith et al. (1976) and Garbow et al. (1977). Other
freely available software libraries include LINPACK (Dongarra et al. 1987) and
LAPACK (Anderson et al. 1987).
524 Chapter 11. Algebraic Eigenvalue Problem

In Section 11.1, we describe some of the basic properties of the algebraic


eigenvalue problem. A simple method for finding the dominant eigenvalue and
the corresponding eigenvector will be considered in Section 11.2. In Section 11.3,
we describe the inverse iteration method. These methods are useful for find-
ing a few eigenvalues and eigenvectors of a large matrix. In Sections 11.4 and
11.5, we consider the eigenvalue problem for a real symmetric matrix, while
Sections 11.6-8 deal with general unsymmetric matrices. Finally, Section 11.9
gives a brief discussion of errors in computed eigenvalues and eigenvectors.

11.1 Introduction
The fundamental algebraic eigenvalue problem is the determination of those
values of A for which the set of n homogeneous linear equations in n unknowns

Ax = AX, (11.1)

has a nontrivial solution. It can be easily seen that X = 0 trivially satisfies all
the equations, but almost nobody else. Further, for an arbitrary value of A
this is the unique solution. Hence, the aim is to find those special values of A
for which a nontrivial solution exists. It can be easily shown that a nontrivial
solution to this system of equations exists if, and only if, the corresponding
matrix A - AI is singular, that is

det(A - AI) = O. (11.2)

Here I is the unit matrix of order n. This is referred to as the characteristic


equation. By expanding the determinant, this characteristic equation can be
written as a polynomial of degree n in A. Thus, nontrivial solution exists only
when A is one of the roots of the characteristic polynomial. These roots are
referred to as eigenvalues of the matrix A. Corresponding to any eigenvalue
A, the system of equations (11.1) has at least one nontrivial solution x, which
is referred to as the eigenvector corresponding to that eigenvalue. If the ma-
trix (A - AI) has a rank less than (n - 1), then there will be more than one
independent eigenvectors corresponding to the eigenvalue. Since the system of
equations (11.1) is homogeneous, the solution is arbitrary to the extent of a
constant multiplier. Thus, if X is an eigenvector corresponding to A, then ax
(where a is a nonzero constant) is also an eigenvector corresponding to the same
eigenvalue. It is convenient to choose the multiplier such that the eigenvector
is normalised in some sense. Most convenient normalisations are

or (ii)

(11.3)
The first choice has the disadvantage of being arbitrary to the extent of a
complex multiplier of modulus unity. The second choice gives an unique vector,
11.1. Introduction 525

provided the maximum component is unique and is assumed to have a value


+ 1. If the maximum component is not unique, then the first such component
can be scaled to + 1.
The eigenvector as defined above is also referred to as the right eigenvector,
as opposed to the left eigenvector which is the solution of
(11.4)
Throughout the book the term eigenvector without any qualification will refer
to the right eigenvector. It can be seen that left eigenvector of a matrix A is the
right eigenvector of the matrix AT corresponding to the same eigenvalue. Since
the determinant of a matrix is equal to that of its transpose, the eigenvalues of
AT are the same as those of A. It can be shown that the two sets of eigenvectors
satisfy the following biorthogonality relation
if (11.5 )
where Xi and Yi are respectively the right and left eigenvectors corresponding
to Ai. Note that, since the eigenvectors could be complex, the product above
is not the true inner product as usually understood (e.g., if x = (1, i)T, then
XTx = 0).
In principle, the eigenvalues can be computed by solving the characteristic
equation using any of the methods described in Chapter 7. Once the eigenvalue
is known, the eigenvector can be computed by solving any n - 1 equations in
the system (11.1) for any n - 1 components, in terms of the remaining compo-
nent (provided this component is nonzero). In fact, this technique is normally
taught in the traditional mathematics courses. However, both these procedures
often lead to trouble and should be avoided. Obtaining the coefficients of the
characteristic polynomial usually requires more effort than solving the entire
eigenvalue problem using other techniques. Apart from the loss in efficiency,
the characteristic equation often turns out to be ill-conditioned, even when the
eigenvalue problem itself is well-conditioned. Similarly, direct solution of linear
equations to find eigenvector often leads to numerical instability, as illustrated
in Example 11.4.
The general theory of the eigensystem is simplest when the n eigenvalues
are distinct, and we first consider that situation. In that case, it can be shown
that the eigenvector corresponding to any eigenvalue is unique apart from a
constant multiplier. Further, the n eigenvectors are linearly independent and
span the entire space. Thus, any arbitrary vector v can be expressed as a linear
combination of the eigenvectors Xl, ... , Xn
n
V = LO:iXi, (11.6)
i=l

where the coefficients O:i can be obtained using the biorthogonality (11.5)
yTv
0: j - J -·
- -T (11. 7)
Yj Xj
526 Chapter 11. Algebraic Eigenvalue Problem

Here the denominator is nonzero, since otherwise Xj will be orthogonal to the


complete set Ylo ... ,Yn. If the matrix A is real and symmetric, then the two
sets of eigenvectors are identical and we need not distinguish between the left
and right eigenvectors. In that case, the eigenvectors form a complete set of
orthogonal vectors. Further, it can be shown that for a real symmetric (or a
complex Hermitian) matrix all eigenvalues are real. As a result, the eigenvalue
problem is simplest for this class of matrices.
For a general matrix with distinct eigenvalues, we can choose the arbitrary
multipliers in the eigenvectors Xi and Yj, such that

(i,j = 1,2, ... ,n). (11.8 )

If X is the matrix with Xi as the ith column, and Y is the matrix with Yj as
the jth column, then we can show that

X-lAX = yT AX = diag(Ai), (11.9)

where diag(Ai) is a diagonal matrix with elements Ai along the diagonal. In


general, a transform of the form H- l AH, of the matrix A, where H is a non-
singular matrix, is referred to as a similarity transform, while the matrices
A and H- l AH are said to be similar. It can be shown that eigenvalues of a
matrix are invariant under a similarity transform, while the eigenvectors are
multiplied by H- l . Thus, if x is an eigenvector of A, then H-lx is an eigenvec-
tor of H- l AH. Most of the methods for numerical solution of the eigenvalue
problem use similarity transform to reduce the matrix to a simpler form which
can be handled easily. In particular, if we can find a similarity transform which
reduces the matrix to a diagonal form, then we have solved the eigenvalue prob-
lem, since the diagonal elements will give the eigenvalues while the columns of
H will give the eigenvectors. Further, since similarity is a transitive property,
we can use a sequence of similarity transformations to reduce the matrix to a
simpler form.
In the above discussion, we have assumed that the eigenvalues of the
matrix are distinct. If the matrix has multiple eigenvalues, then in general, it
may not be possible to reduce the matrix to a diagonal form using similarity
transform. For example, consider the r x r matrix

a 1
a 1
a 1
Cr(a) = (11.10)

a 1
a

where all elements outside the two diagonals are zero. This matrix has all
eigenvalues equal to a, and only one eigenvector x = el, the first column of the
11.1. Introduction 527

identity matrix. This result follows from the fact that the matrix Cr(a) - aI
has a rank r - 1. This matrix also has only one left eigenvector i.e., y = e r and

(11.11)

which is in contrast with the result for matrices with distinct eigenvalues, where
this product cannot vanish (at least for real y). It can be shown that the matrix
Cr(a) cannot be reduced to a diagonal form by a similarity transformation.
This matrix is called a simple Jordan matrix of order r. It can be shown that
a general matrix can be reduced to the so-called Jordan canonical form using
similarity transformations (Wilkinson, 1988). This canonical form consists of
simple Jordan submatrices isolated along the diagonal with all other elements
equal to zero.
If a matrix with multiple eigenvalues can be reduced to a diagonal form
by similarity transformations, then corresponding to every eigenvalue Ai with
multiplicity mi, there exist mi independent eigenvectors. In this case, the ma-
trix A - A;I has a rank n - mi. Further, any linear combination of these mi
eigenvectors is also an eigenvector corresponding to the same eigenvalue. Thus,
in this case, the eigenvectors corresponding to multiple eigenvalues are not
unique. However, it is possible to find a set of n eigenvectors which are lin-
early independent and span the entire space. For such matrices, the situation is
not essentially different from those with distinct eigenvalues. It can be shown
that all real symmetric matrices fall in this class. Hence, such matrices can be
reduced to a diagonal form using similarity transformations and the Jordan
canonical form has all elementary matrices of order one. Consequently, they
are also referred to as matrices with linear elementary divisors.
If the Jordan canonical form for a matrix includes some elementary ma-
trices of order larger than one, then the corresponding matrix is said to have
nonlinear elementary divisors. For such matrices, the set of eigenvectors do
not form a complete set and they are also referred to as defective. A matrix for
which the Jordan canonical form has only one elementary divisor corresponding
to every distinct eigenvalue, is called non derogatory. For nonderogatory matri-
ces, the eigenvector corresponding to every eigenvalue is unique. If eigenvector
corresponding to some eigenvalue is not unique, then the matrix is said to be
derogatory.
For a general matrix with real coefficients, eigenvalues could be complex.
Some bound on the location of eigenvalues is provided by the Gerschgorin
theorem. Let A = [aij] be a n x n matrix and Ci for i = 1,2, ... , n, be the
circular disks with centres aii and radii
n
ri = L
k~l
laikl· (11.12)
k#i

Further, let
(11.13)
528 Chapter 11. Algebraic Eigenvalue Problem

Then the Gerschgorin theorem can be stated as follows. All the eigenvalues of
A lie within the domain D. Thus every eigenvalue of the matrix A lies in at
least one of the circular disks Ci . This theorem is valid, even when the elements
of matrix are complex.
This theorem can be easily proved by considering the eigenvalue equation:
n
L aijXj = )..Xi (11.14)
j=l

where, the index i is chosen such that IXj I ::; IXi I for j i- i. Dividing the equation
by Xi, we get
n
or I)..-aiil::; Llaijl =ri· (11.15)
J=1
j#-2

Thus the eigenvalue).. is inside the circle C i .


The eigenvalue problem defined by (11.1) can be generalised in various
ways to cover a wider spectrum of problems. The simplest of these generalisa-
tions is a problem of the form Ax = )..Bx, where A and B are square matrices
of order n. This problem is usually referred to as the generalised eigenvalue
problem. If the matrix B is nonsingular, then multiplication from left by B- 1
converts this problem to the standard form. If the matrix B is positive definite,
then this problem can be reduced to the standard form by using Cholesky de-
composition of B (see Section 3.3). If B = LLT is the Cholesky decomposition,
then the generalised eigenvalue problem can be transformed to
(11.16)
A more generalised form is obtained if the elements of the matrix are
polynomials of degree less than or equal to m in )... This problem can be stated
as
(11.17)
where B i , (i = 0, ... , m) are n x n matrices. Such problems often anse m
hydrodynamic stability analysis. This problem can be reduced to the earlier
form:

( ·~· · ~· · ~· · · · · · · ~· l ( ;~ 1 (~· ~· · · · · ·~· · ~·l ( ;~ 1


o
Bo
0
B1
0
B2
... I
B m- 1
X m -2

Xm-l
=).. 0
0
0
0
... I
0
0
Bm
(11.18)
X m -2

Xm-l

Here I is a n x n unit matrix and xi, (i = 0, 1, ... , m - 1) are vectors of length


n. It can be easily seen that, this equation leads to
(i=I,2, ... ,m-1);
(11.19)
11 .2. Power Method 529

which gives (11.17) with Xo as the required eigenvector. Thus, this problem
can be transformed to the previous form, but with a matrix of size nm and the
number of eigenvalues will be nm. This is expected, since the determinant of
(11.17) can be expanded to yield a polynomial of degree nm in A. Reduction of
generalised eigenvalue problem to a standard form may be useful, since good
software for solution of standard problem is often available.
However, if the coefficients of the matrix are transcendental functions of
A, then the problem cannot be reduced to the standard form. In this case, the
problem can be written in the form A(A)x = 0, where each coefficient of the
matrix A is some function of A. For such problems, the number of eigenvalues
may even be infinite and there may not be any theory to help us in reducing the
problem to a simpler form. In this case, the eigenvalues can be calculated by
finding zeros of the determinant using any method described in Chapter 7. For a
given value of A, the determinant can be calculated using Gaussian elimination.
Explicit expansion of the determinant should be avoided as far as possible, since
that is likely to be ill-conditioned. Once the eigenvalue is found, the eigenvector
can be found using the inverse iteration method described in Section 11.3. The
same technique may also be applied to other generalised eigenvalue problems,
if only a few eigenvalues and eigenvectors are required.

11.2 Power Method


In this section, we describe a simple iterative method to find the dominant
eigenvalue and the corresponding eigenvector of a matrix. Let us assume that
all the elementary divisors of the matrix A are linear. Hence, the eigenvectors
of A form a complete set and any arbitrary vector Uo can be written as
n
Uo = LQiXi, (11.20)
i=l

where Qi are some constants and Xi is the eigenvector of A corresponding to


the eigenvalue Ai. Let us consider a sequence of vectors Vs and Us defined by

= (11.21)
U s +l
max (V s +l ) ,
where we use the notation max(x) to denote the element of vector x which
has the maximum magnitude. The second operation is a simple normalisation,
which enables us to check convergence of the sequence of vectors Us. It can be
easily seen that
ASuo
Us = - - - - - (11.22)
max(Asuo)
Thus, apart from a normalising factor, the sequence Us is obtained by repeated
premultiplication with matrix A. Using the expansion (11.20), Us is given by

(11.23)
530 Chapter 11. Algebraic Eigenvalue Problem

apart from some normalising factor. Now if we assume that the eigenvalues are
sorted in descending order of magnitude, that is IAll > IA21 ~ IA31 ~ ... ~ IAn I,
then provided 01 -I- 0, we get
. Xl
hm Us = -----'=--..,.. and lim max(v s ) = Al . (11.24)
8->00 ma:x(xd 8->OC

Thus we get both the dominant eigenvalue Al and the corresponding eigenvector
Xl. This method is known as the power method.
If IA2/ All « 1, then other terms in the summation will drop out very fast
and the iteration converg('~ rapidly to the eigenvector. However, if this ratio
is close to unity, the cOllvergence could be very slow. In the above discussion,
we have tacitly assumed that, there is a unique eigenvalue which is dominant.
In general, we can have a situation, where IA11 = IA~I = ... = IArl. If the
dominant eigenvalue is multiple, then the convergence to eigenvector is not
affected and the iteration converges to some eigenvector corresponding to the
dominant eigenvalue. Thus, if we have

and

then
(11.26)

and the iteration will clearly converge to L:;=1 0iXj, which is some vector in the
subspace spanned by the eigenvectors corresponding to the multiple eigenvalue
AI. The limit will however depend on the initial vector uo. Thus, it is possible to
find all the linearly independent eigenvectors by starting with different initial
vectors. However, if IA11 = IA21, but Al -I- A2, then this iteration will not
converge to any eigenvector. This situation can arise if Al = -A2 or when the
eigenvalues are complex. The nature of dominant eigenvalues can be detected
by considering the behaviour of th(' iteration and it is possible to modify the
iteration to produce both the (~ig(,llvallles and the eigcn\"(~ctor:-. simultaneously
(Wilkinson, 1988). However, it is C]nite cumbersome to detect all possible cases
in a computer program.
To start with, if 01 = 0, that is Uo does not have any component along
XI, then this iteration should converge to the next dominant eigenvalue for

which the coefficient 0i -I- O. However, in practical computation roundoff error


will invariably introduce a small component of Xl in the vector. This com-
ponent gets magnified &<; the iteration proceeds and ultimately the iteration
converges to Xl. Nevertheless, a good approximation to A2 and X2 may be
achieved before the component of Xl starts dominating. As we will see later,
once an approximate value of A2 is obtained we can use inverse iteration to get
the accurate eigenvalue and eigenvector. Hence, the power method can also be
used for finding the subdominant eigenvalues and eigenvectors. This idea can
be continued further by starting with an initial vector which is orthogonal to
both Xl and X2 , but in that case, the convergence to the next eigenvector may
11.2. Power Method 531

be quite poor. If the matrix has integer elements and the eigenvalues as well as
the components of eigenvectors are also integers, then the iteration may con-
verge to arbitrary eigenvalue if the initial vector happens to coincide with the
corresponding eigenvector. In this case, there may be no roundoff error during
calculations.
We can improve the convergence by shifting the eigenvalues, which can be
achieved by using the matrix (A - pI) instead of A. It can be shown that the
eigenvectors of the two matrices are the same, while if .Ai is an eigenvalue of A,
then .Ai - p is an eigenvalue of (A - pI). The convergence of power method can be
improved by choosing the constant p suitably. If all eigenvalues of A are real and
p is also real, then irrespective of the value of p either .AI - p or .An - p will be the
dominant eigenvalue. Here it is assumed that .AI > .A2 2 ... 2 .An-I> .An. For
convergence to Xl the optimum value of pis (.A2 + .A n )/2, and the convergence
is determined by the rate at which [(.A2 - .A n )/(2.A1 - .A2 - .AnW tends to zero.
This shift can make a considerable difference in some cases. But unfortunately
the choice of the optimum value of p requires the knowledge of distribution of
eigenvalues. In many cases, it is possible to estimate the value of .A2 by looking
at the rate at which the iteration is converging. In fact, this is a good example
of a method which in the hands of experts can prove to be very effective, while
if used haphazardly it may not lead anywhere. An interesting discussion of
various shifting strategies is given by Wilkinson (1988). Unfortunately, it is
difficult to write a computer program which chooses the optimum value of p
automatically.
Another use of shifting technique is for annihilating the component cor-
responding to some eigenvector. For example, if we have found the dominant
eigenvalue .AI and wish to iterate for the next dominant eigenvalue, then as
mentioned earlier, because of roundoff error a small component of the domi-
nant vector is generated, which grows rather fast. To suppress this component,
we can perform one iteration with p = .AI, which will clearly reduce this compo-
nent drastically. Even in this technique, it is difficult to decide in an automatic
program when to use this value of p, in order to allow the iteration to converge
to the subdominant eigenvalue.
An alternate strategy for calculating the subdominant eigenvalue is de-
flation {8}, where we suppress the known eigenvalues similar to the deflation
technique for finding roots of polynomials. Roundoff error may pose a serious
problem for deflation techniques and it may not be possible to proceed beyond
eliminating the first few eigenvalues, unless the eigenvalues are found in the
right order. Wilkinson (1988) has given several deflation techniques as well as
a detailed error analysis for each of them.
To accelerate the convergence of the power method, we can also try the
Aitken's 15 2 method (Section 6.2). This method is applicable, since it can be
shown that power method converges linearly. Once again, using this technique in
an automatic program is not very straightforward, since applying this technique
before the results have started converging may not yield any useful results.
532 Chapter 11. Algebraic Eigenvalue Problem

For a real symmetric matrix the Rayleigh quotient can yield much better
results. If the matrix A is real and symmetric, then the eigenvectors are or-
thogonal and we can consider them to be orthonormal by selecting appropriate
multipliers. In that case

(11.27)

and
n
T
Us Us = """"' 2 \ 28
~ Q i Ai , (11.28)
i=l

apart from some normalising factor, which gives

(11.29)

This quantity is known as the Rayleigh quotient. It can be seen that in this
quotient the extraneous terms tend to zero as (A2/Ad 28 instead of (A2/Al)8
in the normal power method. Hence, the convergence to the dominant eigen-
value is twice as fast. In fact, for symmetric matrices this method is preferable
to the Aitken's method. The definition of Rayleigh quotient can in principle
be extended to unsymmetric matrices by considering both the left and right
eigenvectors {II}, but in that case, the amount of calculation will be doubled.
If the matrix has some nonlinear divisors, the eigenvectors do not form
a complete set and the power method may not work. It can be shown that if
the dominant eigenvalue corresponds to a linear divisor, then power method
will still converge to the dominant eigenvalue. But if the dominant eigenvalue
corresponds to a nonlinear divisor, then the convergence of iteration will be
very slow ({3}) and it is not possible to use this method for finding eigenvalues
corresponding to nonlinear divisors.
EXAMPLE 11.1: Find the eigenvalues and eigenvectors of the following matrix using the
power method

A = ( ~
0.5 0.25
(11.30)

Applying the power method without any shift we get the results shown in Table 11.1.
For the first set, an arbitrary starting vector is used and the iteration converges to the
dominant eigenvalue 2.536526 in 28 iterations. The calculations were performed using 24-bit
arithmetic and the iterations were continued until the eigenvalue as given by max(us) has
converged to all 24 bits. The final result is correct to almost seven significant figures. The
last column in the table gives the Rayleigh quotient Re. It can be seen that Re converges
much faster to the eigenvalue and in general, for a given accuracy it requires about half
as many iterations as that required by the plain power method. For the second set, the
iteration was started with a vector which is orthogonal to the eigenvector. The iteration
initially converges to the next eigenvalue 1.480122. This convergence is very rapid, since the
third eigenvalue is ~ 0.016. However, the eigenvalue does not converge to all the 24 bits and
after the fourth iteration, the vector starts drifting away from this eigenvector, towards the
11.2. Power Method 533

Table 11.1: Eigenvalue and eigenvector using power method

uo = (1,1, l)T uo = (-1, 0, 0.7482212)T


s max(v s ) Eigenvector Rc s max(v s ) Eigenvector Rc

1 2.750000 .9090909 .8181818 2.500000 1 0.996442 -.6281240 -.8158472 0.879227


2 2.659091 .8376068 .7435897 2.524834 2 1.481976 -.6369678 -.8056614 1.479994
3 2.604701 .7990156 .7030353 2.532516 3 1.480101 -.6368683 -.8057757 1.480122
4 2.575267 .7774150 .6803377 2.535157 4 1.480122 -.6368691 -.8057742 1.480121
5 2.558792 .7651082 .6674058 2.536059 5 1.480122 -.6368687 -.8057737 1.480121
6 2.549406 .7580253 .6599633 2.536367 6 1.480122 -.6368680 -.8057730 1.480121
7 2.544003 .7539253 .6556550 2.536472 7 1.480123 -.6368667 -.8057716 1.480121
8 2.540876 .7515440 .6531527 2.536507 8 1.480124 -.6368646 -.8057694 1.480121
9 2.539060 .7501582 .6516966 2.536520 9 1.480125 -.6368609 -.8057656 1.480121
10 2.538003 .7493508 .6508482 2.536524 10 1.480128 -.6368546 -.8057589 1.480121
11 2.537387 .7488801 .6503536 2.536525 15 1.480221 -.6366464 -.8055402 l.480121
12 2.537028 .7486056 .6500651 2.536525 20 l.481589 -.6335765 -.8023143 l.480124
13 2.536819 .7484455 .6498969 2.536526 25 l.501400 -.5897353 -.7562465 l.480552
16 2.536584 .7482657 .6497080 2.536526 30 l.726309 -.1625901 -.3074077 l.566500
20 2.536533 .7482263 .6496666 2.536526 35 2.344149 .5889586 .4823100 2.484882
22 2.536528 .7482229 .6496630 2.536526 40 2.520849 .7361527 .6369798 2.536278
24 2.536527 .7482218 .6496618 2.536526 4,5 2.535450 .7473980 .6487962 2.536525
25 2.536526 .7482215 .6496615 2.536526 50 2.536453 .7481654 .6496026 2.536526
26 2.536526 .7482213 .6496614 2.536526 55 2.536521 .7482174 .6496572 2.,536526
27 2.536526 .7482213 .6496613 2.536526 60 2.536525 .7482209 .6496609 2.536526
28 2.536526 .7482212 .6496612 2.536526 62 2.536526 .7482210 .6496611 2 ..536526

dominant eigenvector found in the first set. Ultimately after 62 iterations it converges to the
dominant eigenvector. The correct value of the second eigenvalue and eigenvector to eight
significant figures is >'2 = 1.4801214 and X2 = (-0.63686975, -0.80577481, I)T. In this case,
since the magnitudes of second and third eigenvalues differ by a factor of almost 10, it is
possible to find a very good approximation to the second eigenvalue and eigenvector before
the component corresponding to the dominant eigenvalue becomes significant.
To find the third eigenvalue, we can iterate with a shift of p = 2.54, which ensures
that the third eigenvalue dominates. In this case, the iteration converges in 22 iterations
to the eigenvalue -0.01664734 and the eigenvector (1, -0.9516674, -0.1299598)T. Here the
eigenvector is accurate to seven significant digits, but the correct eigenvalue is -0.0166472836.
This is not the optimum shift for this eigenvalue and it can be seen that the optimum value
of p is close to 2. Using p = 2 the power method requires 18 iterations to get the same result.
Thus, there is a marginal improvement. Similarly, for the first eigenvalue the optimum value
of p should be around 0.7, which requires 19 iterations to converge to the dominant eigenvalue
as compared to 28 iterations required without shift.
It may be noted that in this case, once two eigenvalues are known, the third can be
easily estimated using the well-known result that the trace of the matrix equals the sum of
all eigenvalues (i.e., au + a22 + ... + ann = >'1 + >'2 + ... + >'n).

The main advantage of the power method is its simplicity. It can be eas-
ily applied to sparse matrices, since the only operation that is required is the
matrix multiplication. It may appear that roundoff error will be rather low
in this method, since the matrix is handled directly. However, Wilkinson has
shown that the roundoff error due to just one matrix multiplication is in gen-
eral, comparable to that in the entire solution of the eigenvalue problem using
534 Chapter 11. Algebraic Eigenvalue Problem

the better methods. If the eigenvectors are ill-conditioned, then it is possible


that U s +l = Us to working precision, even when it is still far from the exact
eigenvector. This spurious convergence is illustrated by the following example.
EXAMPLE 11.2: Consider the eigenvalue problem for the matrix

A = (0.9900001 2 X 10- 7 )
-1 x 10- 7 0 .9900004 . (11.31)

The correct eigenvalues and eigenvectors for this matrix are:


Al = 0.9900003, Xl = (1, I)T; A2 = 0.9900002, X2 = (1,0.5)T. (11.32)
If we take Us = (1,0.9)T, exact computations give

Vs+l = (0.99000028, 0.89100026)T,


Us+! = (1,0.90000000808 .. . )T, (11.33)
max( v s+ l) = 0.99000028.

The difference between the components of Us+l and Us is less than 2- 24 and if we are using
24-bit arithmetic, the computed Us+l may be exactly equal to Us and the iteration may
terminate. It can be verified that many different start.ing vectors Us show this behaviour.
In fact, for starting vectors of the form Us = (1, a)T, the iteration essentially converges to
an eigenvalue which can be obtained by linear interpolation between the two eigenvectors.
In this case, the result also depends on the convergence criterion used to terminate the
iteration, because the eigenvalue and eigenvector converge at widely different rates. If a
simple convergence criterion, which checks for the change in eigenvalue alone is considered,
then the iteration may converge to a completely arbitrary eigenvector. For example, starting
with Uo = (I,O)T the iteration converges to A = 0.9900001 and Us = (1, -2.02 x 1O- 7 )T,
even though the second component of eigenvector has changed in the last iteration. On the
other hand, if the convergence criterion is applied to the eigenvector, then the iteration fails
to converge in any reasonable number of iterations as the eigenvector is c.hanging very slowly.
It may be noted that it is possible to obtain accurate eigenvectors, provided we perform
a shift by considering A - 0.991. However, this is a rather special case. since the matrix is
2 x 2. In general, when IAI - A21 « IAI - Ani. the difficulty involved in computing Xl and
X2 is quite fundamental and cannot be overcome by such a simple device.

11.3 Inverse Iteration


The main drawback of the power method is the slow rate of convergence. We
have already considered the technique of shift of origin to accelerate the con-
vergence. This technique is the simplest transformation in the eigenvalue that
can be performed. A more interesting transformation is obtained if we use the
matrix (A - pI) -1 instead of (A - pI). This process is known as inverse iteration
or the inverse power method. In practice, the inverse need not be calculated
explicitly, since the iteration can be defined by

(11.34)

The first step is achieved by solving a system of linear equations to get the
vector V s +l. For this purpose, we can use Gaussian elimination or the triangular
decomposition, and note that the elimination has to be performed only for the
first iteration. In subsequent iterations, we only have to solve the system of
11.3. Inverse Iteration 535

equations with a different right-hand side, which requires essentially the same
amount of computation as that required in matrix multiplication. Thus, apart
from the first iteration, other iterations in this method require the same effort
as that in the power method, provided the value of p is not changed.
The convergence properties of this method follows from the fact that the
eigenvalues of (A - pI) -1 are (,\ - p) -1, while the eigenvectors are the same
as that of A. Thus, this method will converge to the most dominant IAi -
pi-I. Now if IAk - pi « IAi - pi, (i -=I- k), the component of Xi (i -=I- k) in
Us decreases rapidly to zero and the iteration converges to the eigenvector
Xk. The merit of this method lies in the fact that if p is very close to the
eigenvalue Ak, then the ratio I(Ak - p) / (Ai - p) I, (i -=I- k) would be very small
and the convergence would be extremely fast. In fact, the closer p is to Ak,
the smaller the ratio will be, and the convergence will also be faster. Even if
two eigenvalues are very close, there will be no problem, provided the required
eigenvalue is known to an accuracy much better than the separation between
the neighbouring eigenvalues. In fact, this method provides a powerful technique
for calculating an eigenvector of a matrix, provided the eigenvalue is known.
If the eigenvalue is known very accurately, then the process usually converges
after the first iteration itself. Of course, this method can be used to find any
eigenvector, including those corresponding to complex Ak, provided p is chosen
to be complex. If the eigenvalue is known accurately, then the choice of the
initial vector is not very crucial and any essentially arbitrary choice e.g., Uo =
(1, 1, ... , 1) T, or the one obtained using random numbers, is suitable. If by
coincidence the initial eigenvector is orthogonal to the required eigenvector (this
coincidence is not as rare as we might expect), then a second iteration may be
required to get the eigenvector correctly. This situation is unlikely to arise if the
initial vector is chosen using random numbers. Nevertheless, a second iteration
may be desirable to check for convergence. Wilkinson recommends that the
first iteration may be only a half iteration, where the vector VI is obtained by
performing only back-substitution. Thus, in this case, the vector Uo is implicitly
defined to be such that Gaussian elimination yields some convenient vector,
e.g., (1,1, ... , 1)T. It is claimed that, this choice is less likely to be deficient in
a given eigenvector.
If Ak is a multiple eigenvalue, then the iteration will converge to some
vector in the subspace spanned by the eigenvectors corresponding to Ak. To
get another eigenvector, we can start with a different initial choice. Unlike
the power method, even if we start with a vector orthogonal to the known
eigenvector, there is no guarantee that the new eigenvector will be significantly
different from the first one. However, in most cases, different eigenvectors can
be computed by just perturbing the initial vector to some extent. Alternately,
perturbing the eigenvalue by a small amount also generally leads to a different
eigenvector. If the eigenvalues are not multiple, but very close with separation
comparable to lilAI, then in general, it is difficult to avoid some contamination
from other eigenvectors. Wilkinson (1988) has discussed this situation in con-
siderable detail and finds that in many cases, an accurate eigenvector may be
536 Chapter 11. Algebraic Eigenvalue Problem

obtained, if in the final iteration a high accuracy solution is computed using the
technique of iterative refinement (Section 3.4). For the iterative refinement to
converge, it is essential that the matrix A - pI is not too ill-conditioned. Hence,
p should be slightly different from the eigenvalue. In fact, in such situations the
eigenvector itself may be ill-conditioned (see Section 11.9), and it may not be
possible to determine it accurately using any method.
It may appear that if p is very close to an eigenvalue Ak, then the matrix
(A - pI) is almost singular and the corresponding system of linear equations is
ill-conditioned. However, error analysis by Wilkinson (1988) shows that inverse
iteration gives good results, even when p is equal to Ak to machine precision,
unless the eigenvector Xk itself is ill-conditioned. This result follows from the
fact that, the only requirement for convergence to Xk is that at each step the
component of Xk in the vector gets multiplied by some large quantity. It is
immaterial whether the large factor is the one that will be obtained by exact
computation or some completely different number of comparable magnitude. If
the required eigenvector corresponds to a multiple eigenvalue, then all we can
say is that after a few iterations the resulting vector will be in the subspace
spanned by the eigenvectors corresponding to the eigenvalue Ak. Even if the
iteration fails to converge, the resulting vector will be in the subspace spanned
by the eigenvectors (provided the eigenvalue does not correspond to a nonlinear
divisor) and is an acceptable eigenvector. If the eigenvalue is known accurately,
then it may not be necessary to check for convergence of eigenvector and the
iteration can be terminated when max(v s +1) is larger than some number of the
order of l/E, where E is the required accuracy.
The inverse iteration method can also be used to calculate eigenvalues of
a matrix, provided a reasonable approximation to the eigenvalue is known. In
that case, it may be better to change the value of p after every iteration, or
after every few iterations, to the latest estimate of the eigenvalue. That will
accelerate the convergence. For symmetric matrices, we can use the Rayleigh
quotient to estimate the eigenvalue, and this estimate can be used as the value
of p for the next iteration. In this case, at any step we compute the new value
of p using the Rayleigh quotient

vr+1 (A - PsI)vs +1 vr+1 Us


Ps+l = Ps + T = Ps + T ' (11.35)
v s + 1Vs +1 v s + 1Vs +1
after the normal inverse iteration defined by (11.34). If this shift is used in
the next iteration, then it can be proved that the iteration usually converges
cubically, with Ps converging to Ak and Us to Xk. As usual, the cubic con-
vergence will be established only when Ps is sufficiently close to Ak. Far from
the eigenvalue the convergence could be slow. With this modification, Gaus-
sian elimination or the triangular decomposition will have to be performed at
every step, since the matrix is changing. Hence, this iteration requires much
more effort as compared to the simple inverse iteration with constant shift p.
For complex Hermitian matrices, the definition of Rayleigh quotient should be
modified by replacing Vr+1 by its complex conjugate v!+1.
11.3. Inverse Iteration 537

For a general unsymmetric matrix, the Rayleigh quotient as defined above


still converges to the eigenvalue, provided v~+1 is replaced by its complex con-
jugate V!+l' However, it can be shown that in this case, the contribution due
to other vectors will fall off as (38 rather than (328, where (3 = (Ak - p) / (Ai - p).
Thus, the rate of convergence is the same as that achieved in the simple inverse
iteration. Hence, in this case, we may use the new shift as
1
P8+1 = Ps + max (Vs +l ) , (11.36)

and the convergence is quadratic rather than cubic. In principle, the rate of
convergence will be the same for either choice of P8+1, but in practice if the
eigenvectors of the unsymmetric matrix are orthogonal to some extent, the
Rayleigh quotient may give faster convergence. It is possible to modify the
definition of Rayleigh quotient {11} by using two sequences of vectors corre-
sponding to the left and the right eigenvectors. In this case, cubic convergence
is achieved, but each iteration requires more effort. However, since the matrices
involved in the two equations are essentially the same, the same LU decom-
position can be used to solve both these equations. Hence, for a dense matrix
the effort required per iteration may not increase significantly. However, as will
be seen in Section 11.6, it is more economical to reduce a general matrix to
Hessenberg form before finding the eigenvalues and eigenvectors. Now if this
process is applied to Hessenberg matrix, then both the LU decomposition and
the solution of equation for a given right-hand side requires O(n 2 ) arithmetic
operations. As a result, the effort per iteration increases significantly if two
sequences of vectors have to be generated.
This algorithm is implemented in subroutine INVIT for a general ma-
trix and in subroutine TINVIT for a symmetric tridiagonal matrix. Subroutine
TINVIT uses variable shift as given by the Rayleigh quotient, while subroutine
INVIT has an option of using fixed or variable shift. If the eigenvalue is already
known accurately, it is better to use a fixed shift. If the eigenvalue is known
only approximately, it may be better to use a variable shift. However, with
variable shift the inverse iteration may not necessarily converge to the nearest
eigenvalue. It may happen that the initial vector has larger component of some
other eigenvector corresponding to one of the nearby eigenvalues, in which case,
the Rayleigh quotient will be closer to this eigenvalue. Hence, the iteration with
variable shift may converge to an eigenvalue which is not nearest to the initial
choice of the shift p. This behaviour could cause some problems in subroutine
TINVIT, since it may fail to find some eigenvalues correctly, even though they
have been bracketed in a reasonably small interval. This problem can only be
avoided by locating the eigenvalues more accurately, before calling TINVIT
or by using the simple inverse iteration with constant p. More sophisticated
implementations of inverse iteration algorithm are given in the Handbook.
EXAMPLE 11.3: Solve the eigenvalue problem in Example 11.1 using inverse iteration.
For simplicity we consider the method of inverse iteration with a fixed value of p. Using
subroutine INVIT with a variety of values for p and the initial vector, we can obtain the
538 Chapter 11. Algebraic Eigenvalue Problem

required eigenvalues. Using p = 2.5 and an arbitrary starting vector, the iteration converges
in 5 steps to the same results as that obtained in Example 11.1. The convergence can be
improved by supplying a more accurate value of the eigenvalue. In fact, if p = 2.536526,
which is the eigenvalue correct to seven significant figures, then the iteration converges to
the same eigenvector in the very first iteration itself. For this value of p even if the initial
vector is orthogonal to the required eigenvector, the iteration almost converges to the required
eigenvector in the second iteration. Similar results could be obtained for other eigenvalues
also. With p = 0, iteration converges to the lowest eigenvalue in six iterations, while with
p = 1.5 it converges to the middle eigenvalue in eight iterations. All these results are accurate
to almost seven significant digits.
EXAMPLE 11.4: Find the eigenvectors corresponding to A"" ±1O.75, 1, 0 of the following
tridiagonal matrix of order 21 and compare the results with those obtained by direct solution
of the linear equations:
10 1
9 1
8
(11.37)

-9
-10
The eigenvectors can be easily determined using the method of inverse iteration, which
also gives an accurate estimate of the eigenvalues. However, for illustration we calculate the
eigenvector by direct solution of the system of linear equation using the calculated eigenvalue.
We can use the equations
(Wl1 - + W12X2
,\)X1 = 0,
Wi,i-1Xi-1 + (Wii - A)Xi + Wi,i+1Xi+1 = 0, (i=2,3, ... ,n-1), (11.38)
Wn,n-1Xn-1 + (wnn - A)Xn = 0,
where n = 21. Clearly Xl cannot be zero. Hence, we can assume Xl = 1 and solve the
first 20 equations for X2, X3, ... , X21, respectively. The resulting eigenvector can then be
normalised by dividing out by the largest component to confirm to the convention used in
the inverse iteration method. Alternately, we can solve the last 20 equations by taking X21 = 1
and solving the equations for X20, x 19, ... , X2, X 1. The eigenvector obtained using both these
options is also given in Table 11.2. For A = 10.74619 the first component Xl is the largest
and if we solve the first 20 equations, then there is a significant cancellation in determining
the higher elements and the results are totally useless. On the other hand, using the last
20 equations gives a result which is essentially identical to that obtained using the inverse
iteration method, where all components are correct to essentially seven significant figures.
This is a remarkable achievement, considering the fact that the components of eigenvector
vary widely in magnitude from 1 to 10- 2 0.
The problem here arises because the eigenvalue cannot be exact and a small depar-
ture from the correct eigenvalue produces a numerical instability which ruins the results. To
analyse the situation, let us assume that .). is close, but not exactly equal to the eigenvalue
Ak. In that case, if we solve the first 20 equations exactly to obtain the eigenvector, the last
equation in (11.38) is not satisfied and the calculated vector y satisfies exactly the equation
( 11.39)
where fJ is the residual in the last equation and en is the nth column of the unit matrix. We
can expand the vector en in terms of the normalised eigenvectors of the matrix Xj as
n

en = LCtiXi. (11.40)
i=l

Using this expansion, we obtain from (11.39)

<~ aiXj fJakxk <'" a, iXj


y=o~--=--+o~-- (11.41)
i=l Ai - A Ak - A i#k A, - A
11.3. Inverse Iteration 539

Table 11.2: Calculating eigenvectors of W 21 for)" = 10.74619

Inverse iter. First 20 eq. Last 20 eq.

1.00(')000 x 10° 1.000000 x 10° 1.000000 x 10°


7.461942 X 10- 1 7.461942 X 10- 1 7.461942 X ]0-1
3.029999 X 10- 1 3.029999 X 10- 1 3.030000 X 10- 1
8.590249 X 10- 2 8.590240 X 10- 2 8.590250 X 10- 2
1.880748 X 10- 2 1.880715 X 10- 2 1.880748 X 10- 2
3.361465 X 10- 3 3.360009 X 10- 3 3.361465 X 10- 3
5.081471 X 10- 4 5.001082 X 10- 4 5.081472 X 10- 4
6.659431 X 10- 5 1.381875 X 10- 5 6.659432 X 10- 5
7.705364 X 10- 6 -3.930655 X 10- 4 7.705364 X 10- 6
7.982855 X 10- 7 -3.451646 X 10- 3 7.982856 X 10- 7
7.488266 X 10- 8 -3.324735 X 10- 2 7.488268 X 10- 8
6.418173 X 10- 9 -3.538308 X 10- 1 6.418174 X 10- 9
5.064413 X 10- 10 -4.122918 x 10° 5.064414 X 10- 10
3.702574 X 10- 11 -5.219768 X 10 1 3.702575 X 10- 11
2.521770 X 10- 12 -7.133965 X 102 2.521770 X 10- 12
1.607628 X 10- 13 -1.046769 X 104 1.607628 X 10- 13
9.632471 X 10- 15 -1.641128 X 10 5 9.632472 X 10- 15
5.444318 X 10- 16 -2.737798 X 106 5.444319 X 10- 16
2.912112 X 10- 17 -4.842138 X 10 7 2.912112 X 10- 17
1.478380 X 10- 18 -9.049788 X 10 8 1.478380 X 10- 18
7.126030 X 10- 20 -1.782147 X 1010 7.126030 X 10- 20

If y is to be a good approximation to Xk, then it is important that o.k/P'k - ).) should be


very large compared with o.;!().i - ).), (i # k). Now we know that ).k -). is very small, but if
o.k is also very small, then the calculated vector will not be close to the expected eigenvector.
As can be seen from the correct eigenvector, o.k is indeed very small. Consequently, we do
not expect the calculated eigenvector to resemble the true eigenvector for this eigenvalue.
It may be argued that it is just a coincidence and in general, the probability that en is
highly deficient in the required eigenvector is very small. Unfortunately, such situations are
not as rare as we might imagine at first sight. On the other hand, when the first equation is
omit.ted, the relevant vector in (11.39) will be el, which has a significant component of the
eigenvector, and we get a reasonable result. Thus, this technique is likely to run into trouble,
since we will not know beforehand which is the dominant component in the eigenvector. We
may think that dropping any of the equation, we can calculate the eigenvector and verify a
posteriori whether the omitted component in the eigenvector is substantial. This technique
clearly fails in this case, since from Table 11.2 it is clear that the calculated vector actually
has the largest component in the last position, which will tend to justify the omission of the
last equation.
For the second eigenvalue). = -10.74619 the situation is reverse, since the first com-
ponent is the smallest and in this case, it turns out that using the first 20 equations gives
the correct solution. On the other hand, solution using the last 20 equations leads to numer-
ical instability. In general, if the eigenvector is not known, it is impossible to decide which
equation should be omitted while solving for the eigenvector. For the third eigenvalue the
situation is completely different, as the maximum component of the eigenvector is in the
middle, and as a result, starting from either end leads to some errors, though these are not
as serious as in the first two cases. Nevertheless. it is quite clear that the strategy will ul-
timately fail, if we consider the matrix W 2--"+1 for larger values of n. It may be noted that
the correct value of this eigenvalue is 1.00000000000062, but the 24-bit arithmetic used in
540 Chapter 11. Algebraic Eigenvalue Problem

these calculations is not able to dBtect the departure from 1. For the last eigenvalue close to
zero, the eigenvector is somewhat similar in behaviour, but in that case, all the calculated
solutions essentially agree with each other. The exact value of this eigenvalue is zero, and the
direct solution gives good result in this case, since the computed eigenvalue is so small that
up to machine accuracy Wii - A = Wii for all diagonal elements, except for i = 11. Further,
since all diagonal elements are integers the calculations are essentially performed with exact
eigenvalue and we get the correct eigenvector. This is just a coincidence, since actually the
first and the last components of the eigenvector are quite small and we do not expect the
direct solution to give any reasonable accuracy. For example, if all elements of the matrix are
multiplied by 1.1, the results are similar to that obtained for A = 1.
Even if the calculations are repeated using 53-bit arithmetic and the eigenvalue is also
determined to corresponding accuracy before using it in the system of equations, it turns out
that the results obtained using direct solution of the linear equations are not as good as that
obtained using inverse iteration with 24-bit arithmetic.

From the above examples, it is quite clear that inverse iteration provides
a very reliable method for determining the eigenvector of a matrix, once the
eigenvalue is known. Direct solution of the system of linear equations is unreli-
able, as in general it is impossible to decide which equation should be omitted.
Even though in principle, any of the equations can be omitted, in practice, be-
cause of roundoff error, the results may depend on which equation is omitted.
Because of these reasons, such methods should be avoided.
The method of inverse iteration can also be used to find eigenvectors in a
generalised eigenvalue problem of the form A("\)x = 0, provided the eigenvalue
is known accurately. This follows from the fact that the required solution of
the system of homogeneous equations gets multiplied by a large factor and
the inverse iteration should converge to this solution. As compared to that,
other components in the initial vector are multiplied by much smaller factors.
In this case, the number of eigenvectors may be much larger than n and they
need not be orthogonal or even linearly independent. Since the theory of such
problems is not developed, it is difficult to say what will happen if there is a
multiple eigenvalue. But in simple cases, where eigenvalues are distinct and the
eigenvector is unique, the method of inverse iteration will yield the eigenvector.
Even for multiple eigenvalues, we can expect the behaviour to be similar to
that for the standard eigenvalue problem, since in the neighbourhood of the
eigenvalue, we can expand the elements of the matrix in a Taylor series in
(..\ - ..\k) and retain the first nonvanishing term. This approximation yields a
generalised eigenvalue problem of a polynomial form, which can be transformed
to the standard form under appropriate conditions.
It can be shown that the inverse iteration method can also be used to find
eigenvectors of matrices with nonlinear divisors. For those eigenvalues which
correspond to linear divisors, there is no difficulty and the method will work
just as for normal matrices with linear divisors. For eigenvalues with nonlin-
ear divisors, it can be shown that the iteration converges very slowly if the
eigenvalue is not known accurately. It can be shown {9} that the extraneous
terms fall off as oj s, where 0 is some constant and s is the number of itera-
tions performed. Here the constant 0 depends on the eigenvalue and the initial
vector. If the initial approximation is not very good, then no purpose may be
served by using the inverse iteration method, as no significant improvement
11.4. Eigenvalues of a Real Symmetric Matrix 541

over the known approximation can be obtained. Using a variable shift may
help in this case. If the eigenvalue is known accurately, then it can be shown
that the inverse iteration method will immediately converge to the eigenvec-
tor. This result follows from the fact that in that case, the constant 0: defined
above is very small and the extraneous terms fall off to a very small value in
the first iteration, even though the subsequent iterations will not improve the
eigenvector significantly. In this case, it is essential to ensure that the initial
vector has a significant component along the required eigenvector. Hence, with
some caution this method can also be used to find eigenvectors corresponding
to nonlinear divisors.

11.4 Eigenvalues of a Real Symmetric Matrix


If A is real and symmetric, then it is well-known that all eigenvalues are real
and eigenvectors corresponding to distinct eigenvalues are orthogonal. Further,
in this case, instead of a general nonsingular matrix H, we can use orthogonal
matrices for similarity transform. In fact, it can be proved that there exists an
orthogonal matrix Q, such that

(11.42)

Here the columns of Q are the eigenvectors. The advantage of using orthogonal
transformation as opposed to a general similarity transformation is that the
roundoff error can be controlled, because orthogonal transformations preserve
the norm of a matrix. Hence, the matrix elements will not increase appreciably
in size.
Finding eigenvalues of a matrix is equivalent to finding roots of the corre-
sponding characteristic polynomial. As we have seen in Chapter 7, this polyno-
mial can only be solved iteratively. Hence, we can expect that all methods for
finding eigenvalues are also iterative in nature. The methods discussed so far
obviously fall in this category. Now any serious manipulation, such as multi-
plication with another matrix or the solution of corresponding linear equations
usually requires O(n 3 ) arithmetic operations for a general dense matrix. Hence,
each iteration with a general matrix will require O(n 3 ) operations. In practice,
several iterations will be required to find each eigenvalue. Consequently, finding
all eigenvalues will require O( n 4 ) operations. On the other hand, if the matrix
is reduced to a more condensed form, then each iteration may require much
less effort. For example, if we deal with Hessenberg matrix, each iteration may
require only O(n 2 ) operations, while for a tridiagonal matrix it may require
only O(n) operations. Of course, for a diagonal matrix the eigenvalue problem
can be solved trivially. Hence, if we can reduce the general matrix to any of
these condensed forms using similarity transformations, then the iterative part
of the solution can be accomplished efficiently.
Thus, the basic strategy for solving an eigenvalue problem is to first re-
duce the general matrix to a more condensed form using a similarity transform
542 Chapter 11. Algebraic Eigenvalue Problem

and then apply an iterative method to solve the eigenvalue problem for the
condensed matrix. Of course, it will be best if the matrix can be reduced to
a diagonal form using similarity transformations, since in that case, the eigen-
value problem is solved. We can expect from the very nature of this problem,
that the reduction can only be achieved iteratively, since otherwise the eigen-
value prohlem can be solved in a finite number of arithmetic operations. In
fact, the Jacobi method gives a sequence of orthogonal transformations, which
diagonalises a given real symmetric matrix. The basic idea in this method is
to use Givens' rotation to annihilate one of the nonzero off-diagonal elements.
Since this transformation does not preserve the zeros already present in the
matrix, the process is iterative in nature, in the sense that the norm of off-
diagonal elements keeps reducing with each iteration and at some stage, when
all the off-diagonal elements have been reduced to some predetermined level,
the process can be terminated. This method is rather inefficient and we do not
consider it further.
In this section, we deal with only real symmetric matrices. For such ma-
trices, it is natural to apply transformations which preserve symmetry. Since it
is not possible to reduce such a matrix to a diagonal form using a finite number
of arithmetic operations, the next best that we can achieve is to reduce it to
a symmetric tridiagonal form. In this case, it turns out that it is possible to
arrange the sequence of transformations such that zero elements introduced in
the previous steps are preserved. Hence, this reduction can be achieved using
a finite number of transformations. The reduction to a symmetric tridiagonal
form can be achieved hy a sequence of Givens' rotations or Householder's reflec-
tions (Section 3.6). It turns out that the Householder's method is more efficient
by a factor of two as compared to the Givens' method. As a result, we describe
only the Householder's method here.
The Householder's method consists of (n - 2) major steps, where in the
rth step zeros are introduced in the rth row and rth column, without affecting
the zeros introduced in the previous steps. The typical configuration of the
matrix before the rth step for the case n = 7 and r = 4 is as follows:

x x 0 0 0 0 0
x x x 0 0 0 0
0 x x x 0 0 0
0 0 x x x x x (11.43)
0 0 0 x x x x
0 0 0 x x x x
0 0 0 x x x x

where x denotes elements which are in general nonzero. The matrix Pr of the
rth transformation should be such that

(11.44)
11.4. Eigenvalues of a Real Symmet'ric Mat'rix 543

is tridiagonal in the first 'r + 1 rows and columns. It should be noted that, since
p r- 1 = Pr , the equation above defines a similarity transform. \.\Ie can choose

(11.45)

where Wr is a unit vector having n components, of which the first 'r are zeros.
If Uir denotes the ith component of U r and aij denotes the current element in
(i,j) position of A r - 1 , then

Q' if i = 1, 2, ... .r;


Uir = { a,·.r+l =f Sr, if i = 'r + 1; (11.46)
ari, if i = 'r + 2..... n;
where

(11.47)

It can be proved that (see {3.39}), with this choice of U r the matrix Pr produces
the desired transform. Here the sign of Sr is chosen to be that of ar,r+l, so that
cancellation cannot take place during the addition. With this prescription, if
at any stage all the elements which are to be eliminated are already zero, the
transformation matrix is not the identity matrix as might be expected. Instead
it differs from the identity matrix in that the ('r + 1)th diagonal element is -1
instead of + 1. Hence, this case should be explicitly detected and corresponding
transformation should be skipped. It may be noted that if we had chosen the
opposite sign for Sr. the transformation matrix would have come out to be the
identity matrix in this case. However, this prescription is not desirable because
of roundoff error.
To calculate the elements after each stage, we can define

(11.48)

The vectors Pr and qr have the first (r - 1) components zero. With these
definitions

(11.49)

It should be noted that, this transformation leaves the first r rows and columns
unaffected. Further, since the transformation maintains symmetry, we need to
calculate only the upper triangle of A r . This process requires approximately
~n3 additions, ~n3 multiplications ~n2 divisions, and (n - 2) square root eval-
uations, to reduce a symmetric matrix of order n to a symmetric tridiagonal
matrix.
544 Chapter 11. Algebraic Eigenvalue Problem

An implementation of this algorithm is provided by the subroutine TRED2


in Appendix B, which is based on the procedure tred2 in the Handbook. Follow-
ing the Handbook this subroutine performs the transformation in reverse order,
starting from the nth row and column instead of the first. The parameter REPS
is used to decide when to skip the transformation. If the sum Sr ~ REPS, then
the relevant elements are already very small and the corresponding step in the
transformation can be skipped. The value of REPS should be 1]/ Ii, where 1] is the
smallest nonzero positive number that can be represented in the machine. With
this choice, underflow can be avoided during the calculations. As explained in
the Handbook, under certain circumstances underflows can cause substantial
departure from orthogonality in the transformation matrix. After completing
the Householder reduction, this subroutine accumulates the transformation ma-
trix Q = P 1 P2 ... Pn - 2 , which is returned in the array A. This transformation
matrix will be required to transform the eigenvectors of the tridiagonal matrix
to that of the original matrix. This matrix is computed using

(i=n-3, ... ,2,1), (11.50)

with Q = Ql' Computing the transformation requires additional ~n3 additions


and multiplications. Because of the nature of Pi, the matrix Qi is equal to the
identity matrix in the last i rows and columns. Hence, it is possible to overwrite
the relevant parts of the successive Qi on the same array, which stores the details
of the transformation Pi. The elements of Ui are stored in the row l = n - i + 1,
while it proves convenient to store u;j Hi in the column l of the original matrix
A. The nonzero elements of Pi are stored in the unused location of the array
e used to store the off-diagonal elements of the transformed matrix. At each
stage, the number of storage locations not yet used by elements of e is adequate
to store the nonzero elements of Pi. The vector Pi is subsequently overwritten
with qi.
When dealing with matrices having elements of widely varying orders
of magnitude, it is better to permute the rows and columns such that the
smaller elements are in the top left-hand corner {17}. This is because, the
Householder reduction is performed from the last row and roundoff error will
be reduced in this case. If the original matrix is in a band form, then the
corresponding Householder matrices will have several zero elements and if these
zero elements are explicitly taken into account, then it is possible to reduce the
matrix efficiently.
To complete the solution, we now consider the eigenvalue problem for a
symmetric tridiagonal matrix A. For convenience, we define

and (11.51)

and we shall assume that ei -I- O. If some of the ei = 0, then it can be seen
that the matrix can be split into smaller matrices with ei -I- 0, which can be
considered independently. The eigenvalues can be determined by finding zeros of
det('\I - A), which can be evaluated in 2n multiplications and an equal number
11.4. Eigenvalues of a Real Symmetric Matrix 545

of additions. For this purpose, we can use any method discussed in Chapter 7,
but it turns out that in this case, the sequence of principal minors form a
Sturm sequence (Section 7.10), which can be effectively used to determine any
eigenvalue.
If we denote the leading principal minor of order r in (AI - A) by Pr(A),
then defining Po (A) = 1, we get

PI (A) = A - d l , Pi(A) = (A - di)Pi-dA) - e~Pi-2(A), (i = 2,3, ... , n).


(11.52)
It can be proved that the sequence {Pn(A),Pn-dA), ... ,PI(A),PO(A)} form a
Sturm sequence in the strict sense (provided ei =f. 0), and we can use the Sturm's
theorem to find the exact number of zeros of Pn (A) in any given interval. For this
purpose, we need to calculate the number of sign changes V(x) in the Sturm
sequence at A = x. Further, it is obvious that as A ~ 00 the signs of Pi(A)
are all positive and there would be no sign change, giving V (00) = O. Thus, at
any point x if we evaluate the number of sign changes V (x), then V (x) is the
number of eigenvalues which are strictly greater than x. While computing V(x)
it should be ensured that none of the polynomials pdA) gives an underflow,
since in that case, the value may be replaced by zero and the sign may come
out to be incorrect. Such underflows are quite likely when the order of matrix
is large. The underflows and overflows may be avoided, if we consider the ratios
qk(A) = Pk(A) /Pk-I (A), which satisfy the recurrence relation

(i = 2,3, ... , n). (11.53)

In that case, the number of sign changes is given by the number of negative
qi()..). At first sight, these relations look dangerous, since it is possible that one
of the qi(A) turns out to be zero. In such cases, it is merely necessary to replace
it by a suitable small positive quantity and the analysis may be continued. From
the above recurrence relation it is clear that if qi (A) is small, then the exact value
is immaterial, since either qi (A) or qi+1 (A) will be negative in any case. Hence,
the number of negative qj(A) and hence V(A) are not affected. By comparing
the recurrence relations for Pi (A) and qi (A), it can be seen that instead of two
multiplications we now require one division at each step. If one of the ei = 0,
then the sequence Pi(A) does not strictly form a Sturm sequence, since Pn(A)
can now have multiple zeros. In practice, for simplicity, the condition ei =f. 0
may be dropped, since we can always perturb the matrix a little by introducing
a small ei without changing the eigenvalues appreciably. Splitting the matrix
into smaller parts may increase efficiency, since we will have to deal with smaller
matrices at a time. However, this splitting introduces additional bookkeeping,
since each submatrix needs to be considered while counting the number of
eigenvalues.
The condition ei =f. 0 ensures that, there are no multiple eigenvalues. If
a symmetric tridiagonal matrix has an eigenvalue of multiplicity k, then there
must be at least k -1 zero elements in the superdiagonal. However, the presence
546 Chapter 11. Algebraic Eigenvalue Problem

of some zero off-diagonal elements does not imply that the matrix has multiple
eigenvalues. For example, a diagonal matrix which has all off-diagonal elements
zero need not have any multiple eigenvalues.
This Sturm sequence property may be used to locate any individual eigen-
value, say kth in the order of decreasing value, without reference to any of the
other eigenvalues. Suppose, we have two values a and b such that b > a and
V(a) ;:::: k and V(b) < k. Then we know that Ak lies in the interval (a, b). An
initial choice for a and b can be obtained using the Gerschgorin theorem. From
Gerschgorin theorem the eigenvalues are all contained in the union of the in-
tervals d i ± (Ieil + lei+ll) with 1'1 = (;1/+1 = O. Hence, the upper bound Amax
and the lower bound Amin are given by

Once this interval is known, the method of bisection can be used to determine
the eigenvalue accurately. For more rapid convergence, we can use the method of
inverse iteration, once the eigenvalue is isolated, i.e., V(a) = k and V(b) = k-l.
The bisection method. has very great flexibility. We can use it to find
specifically selected eigenvalues Ak" Ak2 , ... , Akr , the eigenvalues in a given in-
terval (A/, Au) or a prescribed number of eigenvalnes to the left or right of a
given value. Since each evaluation of the Sturm sequence tells us exactly how
many of the eigenvalues are to the right of the point of evaluation, we can
search for all of the required eigenvalues simultaneously. An implementation of
this algorithm is provided by subroutine STURM in Appendix B.
It can be seen that calculating the determinant for a given value of A re-
quires O(n) floating-point operations. Hence, for large n, calculating all n eigen-
values and eigenvectors of the reduced tridiagonal matrix will require O(n 2 ) op-
erations, which is much less than that required for reducing the original matrix
to a tridiagonal form. To complete the Householder's method, we must obtain
the eigenvectors of the original matrix from those of the tridiagonal matrix.
Thus, if x is an eigenvector of the tridiagonal matrix, then the corresponding
eigenvector z of the original matrix is given by Qx, where Q is the matrix
defined in (11.50). If the calculations are arranged properly, then it requires
approximately n 2 multiplications to recover each eigenvector. Hence, n 3 mul-
tiplications are required to recover all the eigenvectors. This is comparable to
the number of arithmetic operations required to reduce the original matrix to
tridiagonal form.
EXAMPLE 11.5: Solve the eigenvalue problem for the symmetric tridiagonal matrix W 2 "l
of order 21 with elements
Wii = III - ii, Wi,1+1 = Wi+l,i = 1, (i=1,2, ... ,21). (11.55)

This matrix is very similar to the matrix W 21 considered in Example 11.4, except for
the fact that here all elements are positive. Since the matrix is already in a tridiagonal form,
we can use the Sturm sequence property of the principal minors to isolate the eigenvalues
using the method of bisection. From Gerschgorin theorem, we known that all the eigenvalues
are located in the interval (-2,11). Using the subroutine TRIDIA, we can find any of the
required eigenvalues and the corresponding eigenvectors. It turns out that apart from the
11.4. Eigenvalues of a Real Symmetric Matrix 547

Table 11.3: Eigenvectors of w2i using inverse iteration

A = -1.12544152211998 9.21067864736133 10.74619 ... 332 10.74619 ... 339

2.97872416730933 X 10- 8 -8.57416934 X 10- 1 -9.927284 X 10- 1 9.983061 X 10- 1


-3.31396215339255 X 10- 7 6.76777494 X 10- 1 -7.407682 X 10- 1 7.449302 x 10 - 1
3.32574575739642 X 10- 6 9.99999501 X 10 - 1 -3.007966 X 10- 1 3.024867 X 10 - 1
-3.00175022112204 X 10- 5 5.33900549 X 10- 1 -8.527785 X 10- 2 8.575698 X 10- 2
2.40579713099982 X 10- 4 1.80283043 X 10 - 1 -1.867072 X 10- 2 1.877562 X 10 - 2
-1.68421917489111 X 10- 3 4.49303668 X 10 - 2 -3.337021 X 10- 3 3.355771 X 10 - 3
1.00760063531287 X 10- 2 8.90429329 X 10- 3 -5 .044520 X 10- 4 5.072864 X 10 - 4
-4.99597621645793 X 10- 2 1.46704415 X 10- 3 -6.610994 X 10- 5 6.648162 X 10- 5
1. 96030070915866 x 10 - 1 2.07046515 X 10- 4 -7.648429 X 10- 6 7.693216 X 10- 6
-5.62720761059992 X 10- 1 2.59017322 X 10- 5 -7.847047 X 10- 7 8.047358 X 10 - 7
1.00000000000000 x 100 5.62428449 X 10- 6 5.454528 X 10- 10 1.498959 X 10- 7
-5.62720761059992 X 10- 1 2.59017448 X 10- 5 7.905662 X 10- 7 8.060749 X 10 - 7
1.96030070915866 X 10- 1 2.07046619 X 10- 4 7.704466 X 10- 6 7.706266 X 10- 6
-4.99597621645793 X 10- 2 1.46704489 X 10- 3 6.659419 X 10- 5 6.659443 X 10 - 5
1.00760063531287 X 10- 2 8.90429774 X 10 - 3 5.081471 X 10- 4 5.081471 X 10 - 4
-1.68421917489111 X 10- 3 4.49303892 X 10 - 2 3.361465 X 10- 3 3.361465 X 10- 3
2.40579713099982 X 10- 4 1.80283133 X 10 - 1 1.880748 X 10- 2 1.880748 X 10- 2
-3.00175022112204 X 10- 5 5.33900816 X 10 - 1 8.590249 X 10- 2 8.590249 X 10 - 2
3.32574575739642 X 10- 6 1.00000000 x 10 0 3.029999 X 10- 1 3.029999 X 10- 1
-3.31396215339255 X 10- 7 6.76777832 X 10 - 1 7.461942 X 10- 1 7.461942 X 10 -- 1
2.97872416730933 X 10- 8 -8.57417362 X 10- 1 1.000000 x 10 0 1.000000 x 10°

lowest eigenvalue, the other eigenvalues are located in pairs with tlIP separation between the
members of the pair decreasing as A increases. Further, for each pair one of the eigenvect.or is
symmetric while the other is antisymmetric about the central component, that is Xi = ±X22- I
for i = 1,2, ... , 11. Using 24-bit arithmetic it is not possible to separate the last four pairs.
Consequently, it is not possible to find the eigenvectors correctly. The accuracy with which
eigenvectors are computed can be easily ascertained using the symmetry property mentioned
above. Any depa rture from the symmetry can be attributed to the error in computation.
In order to separate the close pairs of eigenvalues we use 53-bit arithmetic to perform the
calculations. With this accuracy it is barely possible to isolate the last pair of eigenvalues.
The eigenvalues of the matrix are
-1.12544152211998, 0.25380581709668, 0.94753436752929, 1.78932135269508,
2.13020921936251, 2.96105888418573, 3.04309929257882, 3.99604820138363,
4.00435402344086, 4.99978247774290, 5.00024442500191, 6.00021752225710,
6.00023403158417, 7.00395179861638, 7.00395220952867, 8.03894111581427,
8.03894112282902, 9.21067864730492, 9.21067864736133, 10.74619418290332 ,
10.74619418290339,
(11.56)
The subroutine STURM required 386 evaluations of the determinant to isolate all eigenvalues,
when the parameter REPS was set to 10- 3 . After the eigenvalues are isolated, inverse iteration
is used to obtain the accurate eigenvalues and eigenvectors, which requires three to four
iterations for each eigenvalue. Because of the close pairs of eigenvalues, it is difficult to
determine the corresponding eigenvectors accurately. Some of the eigenvectors determined
using inverse iteration are shown in Table 11.3. From the symmetry properties of eigenvectors,
it is clear that the first two eigenvectors are essentially accurate to 15 significant figures.
However, the last two eigenvectors corresponding to very close pair have substantial errors.
548 Chapter 11. Algebraic Eigenvalue Problem

The error arises because some component of the other eigenvector is also mixed . If the linear
equations arising in the inverse iteration method are solved using subroutine CROUTH which
iterates on the residuals, then it is possible to get much more accurate eigenvectors for this
pair of eigenvalues. In that case, the components are correct to approximately six significant
digits. The second column gives an eigenvector corresponding to the next pair, which has a
separation of the order of 6 x 10- 11 This eigenvector is more accurate with an accuracy of
approximately six significant digits.
We have noted that a symmetric tridiagonal matrix can have multiple eigenvalue only
if at least one of the off-diagonal elements vanishes. From this result, it may be tempting to
conclude that if the matrix has close eigenvalues, then some off-diagonal elements should be
small. However, the matrix in this example clearly demonstrates that, this statement is not
true in the sense that , even if all the off-diagonal elements are of substantial magnitude, the
matrix may have very close eigenvalues. In fact, it can be shown that for the matrix W 2";.+I'
the separation between the largest pair of eigenvalues is of the order of (n!)-2.

If a large fraction say 0.25 or more of the eigenvalues is required, it may be


more efficient to use the QL algorithm described in the next section, to find all
the eigenvalues. In general, an individual eigenvector calculated using inverse
iteration may be more accurate , as compared to the same calculated using the
QL algorithm. However, the set of eigenvectors determined using the QL al-
gorithm will be orthogonal to machine accuracy, while those determined using
the inverse iteration method may not satisfy orthogonality to that accuracy.
The method described in this section can be easily generalised to complex
Hermitian matrices. Thus, a Hermitian matrix can be diagonalised using sim-
ilarity transformations involving unitary matrices. Once again, all eigenvalues
of a Hermitian matrix are real. Householder transformation can be generalised
to Hermitian matrices. This transformation can be applied to reduce the ma-
trix to a tridiagonal Hermitian matrix. The principal minors of this tridiagonal
matrix also form a Sturm sequence and the same procedure can be used to find
the eigenvalues and eigenvectors.
Alternately, we can convert this eigenvalue problem to that for a real
symmetric matrix of order 2n. Thus, if C = A + iB is a n x n Hermitian matrix
with A and B giving the real and imaginary parts, then the eigenvalue problem

(A + iB)(x + iy) = A(X + iy), (11.57)

is equivalent to the (2n) x (2n) real problem

(~ -f) (~) = A (~) . (11.58)

Note that the matrix in (11.58) is symmetric and further, for a given eigen-
value A, if (x T , yT)T is an eigenvector, then (_yT, xTf is also an eigenvector.
Hence, each eigenvalue of the Hermitian matrix is a double eigenvalue of the
corresponding real matrix. The eigenvectors are pairs of the form (x + iy) and
i(x + iy), which are identical except for a factor of i. Thus, we can find the
eigenvalues and eigenvectors of the original matrix from those of the augmented
real matrix. The matrix in (11.58) requires twice as much storage as that re-
quired by the corresponding complex matrix. Further, if the complex routine is
11.5. The QL Algorithm 549

coded properly, it will require half as much time as compared to that required
to solve the eigenvalue problem for the real matrix. Nevertheless, if efficient
complex routine is not available, conversion to real matrix may be faster.

11.5 The QL Algorithm


The QL algorithm is similar to the more well-known QR algorithm described
in Section 11.8. The basic idea in this algorithm is to factorise the given matrix
as a product of an orthogonal matrix and a lower triangular matrix and then
to construct the product in the opposite order

(11.59)

Here Qs is an orthogonal matrix and Ls is lower triangular. It can be seen that,


this transformation defines a similarity transform. Starting with the original
matrix Ao = A, we can perform this transformation for 8 = 0, 1, ... , and
generate a sequence of similar matrices As. The only difference between the
QR and QL algorithms is that, in the QR algorithm the second factor is upper
triangular rather than lower triangular. This nomenclature comes from the fact
that the triangular matrices have nonzero elements on the left or the right
portion. Here QL algorithm is preferred over QR algorithm, because of a slight
advantage with regard to the roundoff error. As mentioned in the previous
section, if the elements of the matrix are varying widely in magnitude the matrix
should be preferably presented such that the smaller elements are in the top
left-hand corner. With this arrangement it turns out that QL algorithm has
better roundoff property than the QR algorithm. The QL algorithm is based
on the following result.
It can be shown that if Ao has eigenvalues of distinct modulus, the ma-
trix As tends to a lower triangular form, with the limiting matrix Aoo having
the diagonal elements equal to the eigenvalues of Ao arranged in the order of
increasing absolute magnitude. If Ao has a number of eigenvalues with equal
modulus, the limiting matrix is not triangular but block triangular. In gen-
eral, corresponding to a IAi I of multiplicity m, As ultimately has an associated
diagonal block matrix of order m with m eigenvalues which tend to IAil. In
such cases, the block diagonal matrix itself may not tend to a limit, but the
eigenvalues of diagonal blocks will tend to a limit. It can be shown that the
superdiagonal element (As)ij behaves asymptotically as kij(A;fAj)S, where kij
is a constant. Thus, if some of the eigenvalues are close in magnitude, the con-
vergence could be very slow. The rate of convergence can be improved by using
the technique of shifting the eigenvalues by working with the matrix As - ksI,
for suitably chosen k s . The iteration is then defined by

(11.60)

Thus, the sequence of matrices As obtained in this manner are still similar
to A o, but the contribution of the 8th step towards convergence of the (i,j)
550 Chapter 11. Algebraic Eigenvalue Problem

element is determined by the ratio (Ai - k s ) / (Aj - k s ) rather than Ad Aj. If


ks is chosen to be close to AI, the eigenvalue with the smallest modulus, then
the off-diagonal elements in the first row would decrease rapidly. When they
are negligible to working accuracy, ai;)
can be accepted as an eigenvalue and
the other eigenvalues are those of the remaining principal matrix of order n -
1. Hence, in subsequent iterations we can work with this submatrix. In this
manner, we can go on finding eigenvalues one by one, until all eigenvalues are
determined. Further, if the algorithm is carried out without any shift on the
original matrix, then a~;) ~ Al and the algorithm itself gives a suitable value
of shift ks to be used.
The algorithm in this form can be used for a dense matrix, but each
iteration requires O(n 3 ) operations. It can be seen that the transformation
preserves the form of matrix. Thus, if we start with a real symmetric matrix A o,
then all A" (s > 0) are also real and symmetric. Similarly, if Ao is tridiagonal,
then As, (s > 0) are also tridiagonal. Further, each iteration for a tridiagonal
matrix requires only O(n) operations. Hence, it is advisable to reduce a real
symmetric matrix to tridiagonal form, before applying the Q L algorithm. If we
start with a symmetric tridiagonal matrix, then all matrices As are symmetric
and tridiagonal and As tends to a diagonal matrix as s ~ 00.
For a real symmetric tridiagonal matrix, it 1S convenient to use the itera-
tion
(11.61)
Note that we do not add k.J. Hence, As+1 is similar to Ao - I: kJ. The
orthogonal matrix Qs is obtained by 'l, sequence of Givens' rotations
(11.62)

Here Pi(s) is a rotation in the (i, i + 1) plane designed to annihilate the (i, i + 1)
element. This rotation introduces a nonzero element in the (i + 2, i) position.
Thus, the lower triangular matrix Ls has first three diagonals nonzero. The
matrix p}s) is an identity matrix except for the following elements
d H1
Pii = PH1,i+1 = cos ()i = ,
JdT+1 + eT (11.63)

= = sin ()i = ' ,
Jd7+l+ e;
PHl.i -Pi,i+1

where di is the diagonal element of Pi~~ Pi~~ ... P~~l (As - kJ), and ei is the
element in (i, i + 1) position.
The reduction to Ls and determination of As+1 can proceed side by
side. Let d~s), (i = 1, ... ,n) be the diagonal elements and e;s) = ai,i+l,
(i = 1, ... , n - 1) be the off-diagonal elements. It may be noted that here we
number the off-diagonal clements from I, ... ,n - 1. For simplicity, we omit the
upper suffix s in everything except d and e. After some algebraic manipulation
it can be shown that one QL iteration is defined by the following algorithm:
11.5. The QL Algorithm 551

1. Set en = 1, Sn = O.
2. For i = n - 1, ... ,1 in succession calculate

(11.64)

3. Calculate
Since the eigenvalues of a real symmetric matrix are real, those of equal
modulus are either multiple or of opposite sign. Further, for a tridiagonal matrix
if the eigenvalue has a multiplicity r, then at least r - 1 off-diagonal elements
must be zero. In that case, the matrix can be split into r submatrices, each of
which has only simple eigenvalues. The eigenvalues with opposite signs will get
separated out due to the variable shift used in the algorithm. Hence, in this
case, we need not worry about the diagonal blocks corresponding to multiple
1Ai I· In practice. if the tridiagonal matrix has been obtained by reduction of a
full matrix, then because of roundoff errors the required off-diagonal elements
may not be exactly zero. Hence, at every stage, we can check if any of the ei
are negligible. For this purpose, the following simple criterion can be used

(11.65)

If any of the ei are found to be negligible, then the matrix is split and each
part is considered separately. From Example 11.5 it is clear that the matrix can
have very close eigenvalues, which cannot be separated within the precision of
the arithmetic being used, even though all off-diagonal elements are sizable.
Thus, in actual practice despite all the tests for splitting, we may end up in
a submatrix with multiple eigenvalues. In such cases, the matrix will usually
split after the first few iterations of the Q L algorithm.
At each stage, the shift can be determined by considering the eigenvalue of
the 2 x 2 submatrix at the top left-hand corner of the matrix. The shift ks can be
taken as the eigenvalue nearer to all. Actually, if r - 1 eigenvalues have already
been isolated, then all should be replaced by arT) since only the submatrix in
the lower right-hand corner will be considered for further reduction. This choice
of ks has been fairly successful in practice, though it is by no means clear if
that is the optimum choice. It can be shown that with this choice the iteration
always converges and in most cases, the convergence is cubic. In practice, it
is found that the average number of iterations required per eigenvalue is often
less than two or three. Because of shifts, the order in which eigenvalues are
found is no longer predictable, as the eigenvalues may not be found in the
increasing order of magnitude. As a result, this algorithm cannot be used to
find say the r smallest eigenvalues. In principle, it is possible to modify the
552 Chapter 11. Algebraic Eigenvalue Problem

QL algorithm to give the principal minors of the determinant, which can give
the information about the location of roots via the Sturm sequence property.
However, for such applications, where specific roots are required, the method of
bisection described in the previous section is preferable. If all eigenvalues and
possibly the eigenvectors are required, then it is much more efficient to use the
QL algorithm.
An implementation of this algorithm is provided by subroutine TQL2 in
Appendix B, which is again based on the procedure tql2 in the Handbook. If
the tridiagonal matrix is obtained by reduction of a full real symmetric ma-
trix using subroutine TRED2, then subroutine TQL2 also performs the back-
transformation required to give the eigenvectors of the original matrix. This
back-transformation is achieved by accumulating the transformations by mul-
tiplying the matrix Qs to the matrix Q obtained by TRED2 at every step.
With this accumulation, the matrix Q will be the orthogonal matrix which
diagonalises the original matrix. Hence, the columns of this matrix are the re-
quired eigenvectors. If the eigenvalues are widely varying in magnitude and the
matrix is presented in such a way that the larger eigenvalues are determined
first, then the shift may introduce roundoff error which will affect the smaller
eigenvalues significantly. This problem may be avoided by using the so-called
Q L algorithm with implicit shift (see the Handbook).
EXAMPLE 11.6: Solve the eigenvalue problem for the following matrix

o
A -- (~
0 -1
-1
(11.66)
1 0

It may be noted that, this eigenvalue problem corresponds to that for the Hermitian
matrix
(11.67)

The matrix A is tridiagonal except for two elements al,4 and a4,1. This matrix can be reduced
to a tridiagonal form using the subroutine TRED2, which uses Householder transformations.
This subroutine gives an orthogonal matrix Q such that QT AQ = T, where

( 0.7071068 0.0000000 -0.7071068


0.0000000)
Q= 0.0000000 1.0000000 0.0000000 0.0000000
-0.7071068 0.0000000 -0.7071068 0.0000000
0.0000000 0.0000000 0.0000000 1.0000000
(11.68)

-LLl,)
1.414214 0
T _ c~ooo
1.414214 1.0000000 0.0000000
- 0 0.0000000 1.000000
0 0 -1.414214 1.000000

It can be seen that the element t2,3 is zero and the matrix splits into two parts, which can
be treated independently. The splitting is inevitable, since the eigenvalue problem has been
obtained from that of a complex Hermitian matrix and all eigenvalues are expected to be
multiple. The eigenvalues and eigenvectors of the tridiagonal matrix T can be found using the
subroutine TQL2, which also performs the back-transformation to produce the eigenvectors of
the original matrix A. This subroutine requires only two QL iterations to find all eigenvalues.
11.6. Reduction of a Matrix to Hessenberg Form 553

The eigenvalues and eigenvectors are


.\ = -0.4142134 x = (0.5000000, -0.7071068, -0.5000000, o.ooOOOOO)T;
.\ = -0.4142134 x = (-0.5000000,0.0000000, -0.5000000, 0.7071068)T;
(11.69)
.\ = 2.414213 x = (0.5000000, 0.7071068, -0.5000000,0.0000000) T ;
.\ = 2.414214 x = (0.5000000,0.0000000,0.5000000, 0.7071068)T;

and the eigenvectors of the Hermitian matrix are

Xl = (0.5000000 - 0.5000000i, -0.7071068)T, X2= (0.5000000 - 0.5000000i, 0.7071068)T.


(11.70)
These results can be compared with the exact eigenvalues 1 ± v'2 and eigenvectors
(!(1-i),±!v'2)T.

11.6 Reduction of a Matrix to Hessenberg Form


The basic idea of preceding sections can also be applied to a general unsymmet-
ric matrix. Thus, it is possible to reduce an unsymmetric matrix to a tridiagonal
form. However, in general, such reduction may be numerically unstable. Hence,
it is preferable to reduce such matrices to upper or lower Hessenberg form. In
fact, if the Householder's method described in Section 11.4 is applied to an
unsymmetric matrix, the matrix will reduce to a Hessenberg form. However,
Gaussian elimination turns out to be more efficient for reducing a general ma-
trix to such form and we consider that in this section. The eigenvalues and
eigenvectors of a Hessenberg matrix can be calculated using the Q R algorithm
described in Section 11.8.
It is impossible to design algorithms for solving the general eigenvalue
problem, which are numerically as satisfactory as those for the real symmetric
(or Hermitian) matrices, since the eigenvalues themselves may be very sensitive
to small changes in the matrix elements. Further, the matrix may be defective,
in which case, there is no complete set of eigenvectors. In practice, it is virtually
impossible to show that a given matrix is defective, unless the elements as well
as the eigenvalues and components of eigenvectors are all rational numbers.
In presence of roundoff error, it is impossible to prove that a given matrix is
defective. Hence, all algorithms always attempt to find the same number of
eigenvectors as eigenvalues. From the computed eigenvectors, it may often be
evident that the matrix is defective or nearly so. In that case, the corresponding
eigenvectors may be nearly parallel. In many cases, the computed eigenvalues
corresponding to a nonlinear divisor may not be very close.
Accurate solution of an eigenvalue problem may present practical diffi-
culties if the matrix is badly balanced, that is, if the corresponding rows and
columns have very different norms. Balancing essentially reduces the norm of
the matrix and since roundoff error is usually proportional to some norm of the
matrix, the error may be reduced by balancing a matrix. Before attempting
numerical calculation of eigenvalues or eigenvectors, it is advisable to balance
the matrix. Balancing usually improves the accuracy substantially, when the
original matrix is badly balanced. If the original matrix is fairly well balanced,
then balancing will have little effect, but since the time required for balancing
554 Chapter 11. Algebraic Eigenvalue Problem

is small as compared to that required in the solution of eigenvalue problem, it


is a good strategy to balance the matrix before solving the eigenvalue problem.
Balancing is achieved by using a similarity transformation with a diagonal
matrix D, to get
(11.71)
In order to avoid roundoff error during balancing, it is desirable to have elements
of D as powers of the base of the floating-point arithmetic used. On most
computers, it means that elements of D are powers of 2, so that multiplications
and divisions can be accomplished by a simple shift in the exponent, which
does not involve any roundoff error. In practice, the matrix D is obtained by a
sequence of such transformations, each of which affects only one row and the
corresponding column. Let i be the index of the row and column modified in
the kth step, then i-I = k - 1 (mod n). Thus, the rows are modified cyclically
in their natural order. The process is terminated when for a complete cycle
no modification is required. At the kth step, we find the norm of ith row and
column
Ri = L laijl, Ci = lajil· L (11.72)
#i #i
Now if RiCi =1= 0 and b is the base of the floating-point arithmetic used, then
there exists a unique integer a, such that

b2er - 1 < ~: S; b2er + 1 • (11. 73)

If f = ber , then a modification is made, if

Cd + Rd f < I'(Ci + Ri), (11.74)

where l' S; 1 is some predetermined constant. If a modification is required,


then the corresponding diagonal matrix Dk has the element d ii = f, with other
diagonal elements being unity. If l' = 1, then in every step, the factor f is that
integer power of the base b, which gives maximum reduction in the contribution
of the ith row and column to the norm of A. If l' is slightly smaller than unity,
a step is skipped if it would produce an insignificant reduction in the norm of
A.
An algorithm to balance a general matrix is provided by subroutine BAL-
ANC in Appendix B which is based on the procedure balance in the Handbook.
Subroutine BALANC also recognises "isolated" eigenvalues, that is eigenval-
ues which are available by inspection without any computation and its use
will ensure that such eigenvalues are determined accurately, even if they are ill-
conditioned. If the subroutine BALANC is used to balance the matrix, then the
eigenvectors of the original matrix can be recovered from those of the balanced
matrix by using the subroutine BALBAK.
If the ith column has all subdiagonal elements zero for i = 1,2, ... , ko,
then the first ko diagonal elements are eigenvalues of the matrix. Such eigen-
values can be isolated by applying appropriate permutations of the columns. It
11.6. Reduction of a Matrix to Hessenberg Form 555

is better to detect such eigenvalues, so that they can be computed accurately


and moreover, we have to worry about only the remaining submatrix of order
n - ko, which improves the efficiency. Similarly, we can also isolate eigenvalues
along the rows. Subroutine BALANC applies column permutations to bring the
required columns to the left corner, while the rows are transferred to the bot-
tom of the matrix. Thus, after permutation, the matrix is typically partitioned
as
x x x
0 x x X Y
0 0 x

pTAP= 0 Z TV (11.75)

x x x
0 0 0 x x
0 0 x
where P is the corresponding permutation matrix. The subsequent calculations
may be carried out using the reduced matrix Z. This permutation should be
done before balancing.
For reduction to the Hessenberg form it is more efficient to use the Gaus-
sian elimination with partial pivoting. This transformation can be achieved in
a finite number of steps, with relevant elements of one column being zeroed at
each step. As shown in Section 3.3, each major step of Gaussian elimination
algorithm is equivalent to premultiplying the matrix by an elementary matrix.
In order to obtain a similarity transformation, the matrix should also be post-
multiplied by the corresponding inverse. It turns out that with this constraint,
it is not possible to reduce the matrix to a triangular form, since the elements
which are reduced to zero during elimination become nonzero when the corre-
sponding inverse is postmultiplied. However, it is possible to reduce a general
matrix to Hessenberg form using Gaussian elimination. Further, this process
requires only half as many arithmetic operations as that required by the House-
holder's method. The algorithm consists of n - 2 major steps, where in the kth
step we perform the following operations:
1. Determine the maximum of the quantities lajkl, (j = k + 1, ... , n). If
this maximum is not unique, take the first of these maximum values. If
the maximum is zero, then kth step is complete, since all the required
elements are already zero. Otherwise, denote the maximum element by
ark and proceed as follows.
2. If r ::J. k + 1, interchange the rows k + 1 and r. To make it a similarity
transformation, also interchange the columns k + 1 and r.
3. For each value of j from k + 2 to n, calculate

(11.76)
556 Chapter 11. Algebraic Eigenvalue Problem

where because of previous steps, imj,k+li :S 1. If mj,k+l =1= 0, then subtract


mj,k+l times row k + 1 from row j. Once again, to make the elimination
a similarity transformation, also add mj,k+l times column j to column
k + 1.
As with the solution of a system of linear equations, the partial pivoting which
constitute the first two steps listed above, ensures numerical stability of the
elimination process. This algorithm is also referred to as reduction to Hessen-
berg form using stabilised elementary transformations. It can be shown that the
postmultiplication by elementary matrix does not disturb the zero elements in-
troduced during the elimination phase. This algorithm requires approximately
in3 multiplications and additions. The multipliers mj,k+l can be overwritten
on the location ajko which is reduced to zero. This algorithm is implemented in
subroutine ELMHES in Appendix B, which is based on the procedure elmhes
in the Handbook.
If the elements of a matrix change rapidly in magnitude as we move in
a direction parallel to the principal diagonal, the matrix should be presented
such that the largest element is in the top left-hand corner. This arrangement
may require reflecting the matrix in the secondary diagonal. It may be noted
that, this arrangement is opposite of that required by the subroutine TRED2,
since there the reduction starts with the last row and the last column.
Just like the LU decomposition, this algorithm can also be rearranged
so that each element is modified only once. In that case, the inner products
occurring in the transformation can be accumulated using higher precision
arithmetic, giving better accuracy. The direct reduction to Hessenberg form
is similar to the Crout's algorithm for LU decomposition. The number of arith-
metic operations required for direct reduction is the same as that required for
the step by step reduction. It may be noted that, this method can also be used
to reduce a general complex matrix to a Hessenberg form, provided complex
arithmetic is used in the calculations.
It may be noted that the Householder's method, which uses orthogonal
transformations to reduce the matrix to Hessenberg form, is numerically more
stable than the method based on Gaussian elimination described here. As seen
in Section 3.4, there are matrices for which the matrix elements can grow rapidly
when partial pivoting as described above is used. For such matrices, the round-
off error could be rather large. Nevertheless, such matrices are rare in actual
practice, and in most cases, there is no growth of elements during Gaussian
elimination with partial pivoting. Thus, even though the theoretical bound on
the roundoff error is smaller for Householder's method, in most practical cases,
the roundoff errors in the two methods is of the same order and there is little
justification for spending the extra effort in using orthogonal transformations.
However, if we do encounter matrices for which elements grow rapidly, then
we can consider using Householder's method. Subroutine implementing this
method is given in the Handbook.
To complete the eigenvalue problem, we need to find the eigenvalues and
eigenvectors of the reduced Hessenberg matrix, which can be achieved using
11.7. Lanczos Method 557

the QR algorithm described in Section 11.8. If eigenvectors are also required,


then eigenvectors of the Hessenberg matrix need to be transformed back to the
eigenvectors of the original matrix by using the matrix of transformation.

11. 7 Lanczos Method


In the preceding section, we considered an algorithm for reducing a general
real matrix to Hessenberg form. However, if the matrix is sparse, the sparsity
is generally destroyed as the transformation proceeds. If the matrix is in a
band form, it is possible to modify the algorithm to preserve the zero elements
to a large extent. However, if the nonzero elements are not restricted to a
narrow band around the diagonal, it is difficult to avoid filling in of the matrix.
In this section, we describe the Lanczos method for reducing a general real
matrix to tridiagonal form, which uses only matrix multiplication with the
original matrix. Hence, this method can be easily applied to sparse matrices.
Of course, this method can also be applied to filled matrices, but it turns out
that numerical property of this method are not as good as that of Gaussian
elimination considered in the previous section. Roundoff error could pose a
serious problem with this method. However, for large sparse matrices, we may
not have any alternative and this method may be useful.
In the Lanczos method, the aim is to find a similarity transformation which
reduces a general matrix to tridiagonal form. This transformation is achieved by
constructing two sequences of vectors Xl, X2, ... ,X n and YI, Y2, ... , Yn, which
satisfy the biorthogonality condition

if i =I j. (11.77)

These sequences of vectors are generated by the following recurrence

Xk+l = AXk - bkXk - Ck-1Xk-1 ,


(k=I,2, ... ,n-l), (11.78)
Yk+l = AT Yk - bkYk - Ck-1Yk-1 ,

where

Co = 0, (11.79)

The recurrence is started with Xo = Yo = 0 and Xl and YI as some arbitrary


vectors. The coefficients bk and Ck-l are chosen so as to make the new vector
Xk+1 orthogonal to Yk and Yk-l. Clearly, this recurrence requires that yJ
Xj =I 0
for j = 1,2, ... , n - 1. If we assume that this condition is satisfied, then it can
be shown that the sequence of vectors given by the above recurrence satisfies
the biorthogonality relation (11.77). This essentially means that ifAxk is made
orthogonal to Yk and Yk-l, it automatically becomes orthogonal to the rest of
Yj, (j = k - 2, ... ,1). Further, it can be proved that under these assumptions,
558 Chapter 11. Algebraic Eigenvalue Problem

the vectors Xl, ... , Xn are linearly independent and if the same relations are
used to generate Xn+l, then Xn+l = O.
Using the first equation in (11.78), we get

AXI = b1XI+ X2,


AXk = Ck-1Xk-1 + bkXk + Xk+l, (k = 2,3, .. . ,n -1) , (11.80)
AXn = Cn-1Xn-1 + bnxn.
Now if S is a matrix with columns Xl, ... , X n , then the above equations can be
written as

AS=S =ST, (11.81)


1 bn - 1
1
where T is a tridiagonal matrix. Hence, we get the required similarity transfor-
mation
S-lAS = T. (11.82)
Thus, a general matrix A can be reduced to a tridiagonal matrix, provided
yJ Xj i- 0 for j = 1, ... , n. In practice, it is found that if we start with arbitrary
vectors Xl and Yl, then at some stage this condition is violated, in the sense
that the scalar product comes out to be very small leading to excessive roundoff
error. There is no simple remedy to this problem and we may have to restart the
calculation with a different starting vector. Various strategies for constructing
full sequence of acceptable vectors are discussed in Wilkinson (1988) .
As can be seen from (11.78), the only operation that is required to be
performed on the matrix is a matrix multiplication. This operation can be
easily performed for a sparse matrix taking full advantage of the zero elements.
Hence, this method may prove to be useful for large sparse matrices. In view of
the possible numerical instability, there is no reason to use this method for dense
matrices. It may be noted that the matrix T is not in general symmetric, even
if the original matrix is symmetric. The principal minors of this matrix may
not form a Sturm sequence and as such the method described in Section 11.4
cannot be used to find eigenvalues of this matrix. The QL algorithm described
in Section 11.5 can be used to find eigenvalues of this matrix, but we will have
to worry about complex eigenvalues. The QR algorithm described in the next
section can be used to find all eigenvalues and possibly eigenvectors of this
matrix.

11.8 Q R Algorithm for a Real Hessenberg Matrix


The QR algorithm is very similar to the QL algorithm described in Section 11.5,
the only difference being that in this case, the factorisation is obtained in the
11.S. QR Algorithm for a Real Hessenberg Matrix 559

form A = QR, where Q is once again orthogonal while R is upper triangular


instead of lower triangular. The properties of this transformation are almost
identical to those of the Q L transformation and will not be repeated here.
In this section, we describe the difference that arises, because the matrices
to which the algorithms are applied are different. Here we are interested in
applying QR algorithm to upper Hessenberg matrices obtained by reducing a
general unsymmetric matrix. The major difference arises, because in this case,
the eigenvalues need not be real and since we will prefer to restrict ourselves to
real arithmetic, it may not be possible to isolate complex eigenvalues by using
real shifts. For example, if ).. = a + ib and)" + E (lEI « Ibl) are two complex
eigenvalues, then using any real shift it is impossible to separate them, in the
sense I().. - p) / ().. + E - p) I ;:::0 1 for all real p. Hence, the convergence will be
slow.
In principle, complex eigenvalues can be isolated by performing two suc-
cessive shifts by ks and ks+l = k;. After these two shifts the matrix should
again be real. However, in practice it turns out that because of roundoff error,
the resulting matrix elements may have significant imaginary parts and neglect
of these can produce serious error. Further, complex arithmetic will be required
for carrying out the individual transformations. This problem can be avoided
by performing the double Q R transformation with implicit shift.
The double QR transformation can be used with either two real shifts
or a pair of complex conjugate shifts. This transformation is based on the
following result. If B is nonsingular and BQ = Q H with orthogonal Q and
upper Hessenberg H, then the matrices Q and H are determined by the first
column of Q. Further, this determination is unique if H has positive subdiagonal
elements. Now if two steps of the QR algorithm with shifts ks and ks+l are
performed, we get
(11.83)
and
(11.84)
Defining

R = Rs+IRs ,

we have
R=QTM. (11.86)
Here the matrices Qs, Qs+l, Rs and Rs+I may be complex, but Q and R will be
real. In the present application As and As+2 are both upper Hessenberg. Hence,
given the initial matrix As the matrices Q and As+2 are determined, provided
the first column of Q is known. Further, the same matrix Q should reduce the
matrix AI to an upper triangular matrix R. Now the matrix .AI can be reduced
to an upper triangular form using a sequence of Householder transformations
PI, P2 , ... , Pn - 1 , where Pr reduces the relevant elements in the rth column to
zero. From the form of Pr it can be shown that the first row of the matrix
560 Chapter 11. Algebraic Eigenvalue Problem

Pn - l ... P2P I is the first row of PI itself. Hence, to determine Q, we only need
to determine the first row of PI, which in turn is determined by the first column
of M. Now the first column of M can be written as (PI, ql, rl, 0, ... , O)T, where

PI = ail - all(ks + ks+d + ksks+l + al2 a21 ,


(11.87)

Using (11.87) we can determine PI = 1- 2WIWf, where the unit vector WI


has only the first three components nonzero. With this prescription, the matrix
PIAsPI is almost upper Hessenberg, in the sense that apart from a31, a41, a42
all other elements aij (i > j + 1) are zero. This matrix can be reduced to an
upper Hessenberg form by using a sequence of orthogonal similarity transform
as in Section 11.4, to get

(11.88)

It can be shown that at each stage of this transformation the matrix is upper
Hessenberg, except for three extra nonzero elements and the Householder ma-
trix Pr = 1- 2wr w'{ is determined by three nonzero elements in W r . Once
again, the first column of Pn - l ... P 2P I is the same as the first column of Pl.
Hence, this transformation should be the same as the one in (11.86). Thus,
we get the required transformation matrix Q = P n - l .,. P2P I for the double
QR transformation. It is quite clear from the above discussion that no com-
plex arithmetic is involved in this process, even if ks and kS+1 form a complex
conjugate pair.
To sum up, a complete step of the double QR algorithm is as follows:
1. Compute shifts ks and ks+l as explained below.
2. Using these shifts compute the quantities PI, ql and rl defined by (11.87).
Find the Householder matrix PI to reduce the first column of M to Hes-
senberg form.
3. Reduce the matrix PI AsPI to Hessenberg form using a sequence of House-
holder similarity transformations, yielding A s + 2 .
Each step of the double QR algorithm requires about 5n 2 multiplications. If
shifts are chosen properly, then convergence is very fast and usually an average
of only two or three iterations per eigenvalue are required. Further, as explained
below the matrix is deflated after each eigenvalue is isolated and the effective
value of n reduces. Because of these reasons, the double QR algorithm has
become the accepted method for finding eigenvalues of Hessenberg matrices.
The double QR transformation can be used to isolate both real and com-
plex eigenvalues using only real arithmetic. Once again, the shifts ks and ks+l
in the double QR transformation can be taken as the eigenvalues of 2 x 2 matrix
in the lower right-hand corner of the matrix, which gives
11.S. QR Algorithm for a Real Hessenberg Matrix 561

With these shifts, the convergence is usually very fast and the iteration is
continued until either an,n-l or a n -l. n -2 is negligible. In the first case, ann
can be regarded as an eigenvalue and the calculations are continued with a
deflated matrix obtained by dropping the last row and column. If a n -l,n-2 is
negligible, then the two eigenvalues of the 2 x 2 matrix at the bottom right-hand
corner can be regarded as the eigenvalues of the matrix A(s) and calculations
are continued with a deflated matrix obtained by dropping the last two rows
and columns. With this choice of shift, the convergence failure is very rare, but
there are matrices for which it occurs {27}. The failure can occur for matrices
with more than two eigenvalues of equal magnitude. In that case, QR algorithm
may converge to a block triangular form with diagonal blocks larger than 2 x 2.
Such blocks will not be detected by the above criterion. It is difficult to take
care of such blocks, since their size is not known in advance. To take care of
such situations, we can adopt a different strategy for calculating shifts, if the
iteration fails to converge in, say 10 iterations. The alternative shift may be
given by

ks + ks+l = 1.5(!an,n-l! + !an-l,n-2J), ksks+l = (!an,n-l! + !an_l,n_2!)2.


(11.90)
There is no particular significance of these expressions and any random shift of
the same order of magnitude may work.
After each iteration of double QR transformation, all subdiagonal ele-
ments are examined to see if any of them is negligible, that is if

(11.91)

If any of these elements is negligible, then the matrix splits into two or more
submatrices, which can be considered separately. The search for negligible ele-
ment is made starting from last row and if it is found that ar+l,r, (r < n - 2) is
negligible, then we can continue with the matrix of order n - r in the bottom
right-hand corner. Further, the matrix is also searched to see if two consecutive
subdiagonal elements are small enough to split the matrix. For this purpose,
we can use the following simple criterion (see the Handbook)

where Pm, qm and rm are given by equations analogous to (11.87) (with all
replaced by a mm and so on). If this criterion is satisfied, then we can consider
the submatrix of order n - m + 1 in the lower right-hand corner. The double
QR algorithm is implemented in subroutine HQR in Appendix B which is again
based on the procedure hqr in the Handbook.
To calculate the eigenvectors of this matrix, we can consider the triangular
or block triangular matrix obtained by the QR transformation. The eigenvectors
of this matrix can be easily obtained by direct solution of the corresponding
system of linear equations. Since the matrix is already in an upper triangular
form, we only need to do back-substitution to obtain the eigenvector. Unlike
562 Chapter 11. Algebraic Eigenvalue Problem

the case of tridiagonal matrix, the direct solution of eigenvectors in this case is
stable, since we know exactly which pivot is zero. If T is the upper triangular
matrix obtained by QR transformation, then the eigenvector corresponding to
ith eigenvalue Ai = tii is given by

L~=j+l t)kXki . .
Xji=O, j=i+1, ... ,n; Xii = 1, Xji =- , J = z-l, ... , l.
tjj - tii
(11.93)
Similarly, we can find expressions for eigenvectors for the case when the matrix
T has 2 x 2 blocks along the diagonal. Once the eigenvector of the triangular
matrix is determined, that of the original matrix can be obtained by multiply-
ing it by the matrices corresponding to: (1) the QR transformation, (2) the
reduction to upper Hessenberg form and (3) the balancing performed initially.
Subroutines for this purpose can be found in Smith et al. (1976) or in the Hand-
book. It may be noted that it requires considerable amount of bookkeeping to
keep track of the transformation matrix for the QR algorithm. Consequently,
subroutine HQR in Appendix B does not keep track of the transformations and
hence it cannot be used for calculating the eigenvectors.
If only a few selected eigenvectors are required, then we can use the inverse
iteration method. Once the eigenvalues are known, then this method converges
very rapidly. The inverse iteration can be applied to either the original matrix
or preferably to the one obtained after balancing, or to the Hessenberg matrix
obtained by similarity reduction. In the last two cases, the eigenvectors have to
be back-transformed to that of the original matrix using the matrix of transfor-
mation. This process will be most efficient with Hessenberg matrix, provided
the transformation matrix is available and the inverse iteration routine takes
into account the Hessenberg form of the matrix.
EXAMPLE 11.7: Solve the eigenvalue problem for the following matrix
6 0 0 0 0 1 0
0 4 0 3 X 10- 8 0.0001 0.0002 0.001
1 10000 7 0 0 -2 20
A= 0 2 X 10 8 0 -40000 30000 -4 x 10 5 (11.94)
-2 -30000 0 0.0001 2 2 40
0 0 0 0 0 0 0
0 1000 0 4 X 10- 5 0.1 -0.2 3
This matrix has elements with widely varying magnitude and it is necessary to balance
the matrix before attempting the solution of eigenvalue problem. Using subroutine BALANC
with b = 10, we get the following balanced matrix
7 0.01 0.02 0 0 1 -2
0 4 1 3 0 200
0 1 3 4 0 -200
Ab = 0 2 -4 -4 0 300 (11.95)
0 -3 4 1 2 -200 200
0 0 0 0 0 6 1
0 0 0 0 0 0 0

The corresponding diagonal matrix for balancing A is diag(l, 10- 6 ,0.001,100,0.01,1,1). It


may be noted that here we have used b = 10 rather than 2, to demonstrate the effect in the
II.S. QR Algorithm for a Real Hessenberg Matrix 563

familiar decimal system. Apart from the reduction in norm, the subroutine BALANC also
detects the three eigenvalues 7, 6 and 0 in rows 1, 6 and 7, respectively. These eigenvalues
can be determined exactly and now we need to work with only the 4 x 4 submatrix in rows
LOW = 2 to IGH = 5 and corresponding columns. It can be seen that while the original
matrix had elements ranging from 3 x 10- 8 to 2 X 108 , the balanced submatrix in the relevant
rows and columns of Ab has all elements between 1 and 4. Thus, the variation is reduced
significantly. To find eigenvalues of this 4 x 4 matrix, we first reduce it to an upper Hessenberg
form using the subroutine ELMHES, which yields the Hessenberg matrix H

7 0.01 -0.006666667 0.016 0.02 1 -2


0 4.000000 -1.333334 3.800001 1.000000 0 200.0000
0 -3.000000 -1.1921 x 10- 7 4.200001 4.000000 -200.0000 200.0000
0 0 -3.333333 0.600000 -1.333333 -133.3333 433.3334
0 0 0 7.320001 5.400000 40.00003 -480.0001
0 0 0 0 0 6 1
0 0 0 0 0 0 0
(11.96)
The eigenvalues of this matrix can he found using the subroutine HQR, which invokes the
QR algorithm. It turns out that only five double QR iterations are required to compute
all four eigenvalues of the balanced matrix. Table 11.4 gives the eigenvalues of this matrix.
To demonstrate the effect of balancing, we also compute the eigenvalues without balancing.
For this purpose, we use the same subroutines ELMHES and HQR directly on the original
matrix A and the results are also shown in Table 11.4. All calculations were done using a
24-bit arithmetic. It can be easily seen that balancing has a significant effect on accuracy of
the computed eigenvalues.

Table 11.4: Eigenvalues of a real matrix

Exact After balancing Without balancing

7 7 6.999119
-0.077155796 .. . -0.07715616 -0.07766255
2.743808423 ... ± i3.881158983 ... 2.743811 ± 3.881158i 2.744898 ± 3.882214i
4.589538948 .. . 4.589540 4.589580
6 6 5.999151
o o o

To find the eigenvectors, we can use inverse iteration on the balanced matrix A b . Using
the known eigenvalues as shifts, the inverse iteration converges in one or two iterations.
However, for the three eigenvalues which are determined exactly, there could be some problem,
since the matrix is singular and during Gaussian elimination one of the pivots may turn out
to be exactly zero. This problem can be avoided if the calculated eigenvalues are slightly
perturbed, before using them as a shift in the inverse iteration routine.

For a general complex matrix C = A + iB it is possible to use the aug-


mented real matrix as defined by (11.58). In this case corresponding to each
complex eigenvalue Ak and eigenvector Xk of C, the augmented matrix has
eigenvalues Ak and Ak with eigenvectors of the form

and (11.97)
564 Chapter 11. Algebraic Eigenvalue Problem

Corresponding to a real eigenvalue Ak of C with eigenvector Xk = Uk + iVk, the


augmented matrix has a double eigenvalue Ak and independent eigenvectors
given by

and (11.98)

If only eigenvalues of the augmented matrix are calculated, there is no simple


way of telling which of a complex conjugate pair of eigenvalues, is the eigenvalue
of the original matrix. If eigenvectors are also computed, then it is possible to
resolve this difficulty. If the eigenvector of augmented matrix is partitioned
into u and v, then for the complex conjugate Xi. eigenvalue the vector u + iv
is null within roundoff errors. Such eigenvalues should be discarded and the
vector u + iv derived from the other will be the required eigenvector. For a real
eigenvalue of original matrix, the eigenvector u + iv derived from either of the
double eigenvalue of the augmented matrix gives the eigenvector of the original
matrix. Alternately, if only a few eigenvalues and eigenvectors are required it
is straightforward to use the method of inverse iteration for complex matrices
to determine the required eigenvalues and corresponding eigenvectors. In this
case the shift p will also need to be complex.

11.9 Roundoff Errors


A detailed error analysis of various techniques for solving eigenvalue problem
is beyond the scope of this book, but in this section we give a general outline of
error estimates. In most methods of solving the eigenvalue problem, the given
matrix A is reduced to another matrix B, using similarity transformations.
Because of roundoff error, the matrix B is not strictly similar to A. However,
using a backward error analysis, it is possible to show that B is similar to some
matrix A + bA, where elements of bA are small. In particular, for reduction of
A to a Hessenberg or tridiagonal form using the Householder transformation,
it can be shown that
(11.99)

where r is a constant of order unity and IIAIIE is the Euclidean norm of A.


Even if the bound on IlbA11 is small, it does not imply that the error in
computed eigenvalues or eigenvectors is small. The error depends on whether
the eigenvalue problem is well-conditioned or not. To estimate the condition
numbers for the problem, let us consider an eigenvalue Ai and the corresponding
right and left eigenvectors Xi and Yi. Without any loss of generality, we can
assume that the eigenvectors are normalised such that Ilxill = IIYi II = 1. Now
if Ai + bAi is the corresponding eigenvalue of A + bA with eigenvectors Xi + bXi
and Yi + bYi, we have

(11.100)
11.9. Roundoff Errors 565

Assuming that the perturbations are small, we can linearise this equation and
premultiply it by Yi T, to get

(11.101)

where we have used Yf 8Xi = O. This result follows from the fact that, since the
eigenvectors are arbitrary to the extent of a constant multiple, the perturbation
in Xi should not have any component along Xi, when expanded in terms of the
eigenvectors. Further, since Yf Xj = 0 if j =f. i, the scalar product Yf 8Xi = O. For
a real symmetric matrix Xi = Yi, Yf Xi = 1, and if elements of 8A are small, then
8Ai will also be small. If all elements of 8A are less than f in magnitude, then
it can be shown that 18Ail :s; nf, which essentially implies that the eigenvalue
problem for a real symmetric matrix is always well-conditioned. This result only
implies that the error in the computed eigenvalue is of the same order as the
error in the matrix elements. If the matrix has eigenvalues of widely varying
magnitudes, it may be difficult to determine the small eigenvalues accurately
{17}, since the roundoff errors may be comparable to the smallest eigenvalue.
For a general unsymmetric matrix, the numerator in (11.101) may still
be small, but in general, there is no lower bound on the denominator and in
fact, for matrices with nonlinear divisors, the denominator vanishes. Thus, the
perturbation in the eigenvalues can be arbitrarily large {32}. This is similar
to the situation for multiple roots of a nonlinear equation. On the other hand,
for multiple eigenvalues with linear divisors there is no difficulty and they are
just like distinct eigenvalues. Thus for eigenvalue problem, multiple eigenvalues
pose a problem only if they have nonlinear divisor. If the matrix has a non-
linear divisor of order m, then the eigenvalue will be perturbed to m distinct
eigenvalues with spacing comparable to (YiT(8A)Xi)1/m. If the left and right
eigenvector corresponding to the same eigenvalue are nearly orthogonal, then
the eigenvalue is ill-conditioned and a small perturbations in A can affect the
eigenvalue significantly. Hence, for a general matrix A, there could be some
eigenvalues which are very sensitive to perturbations, while others which are
well-conditioned.
To find the effect of this perturbation on the computed eigenvectors, we
can again premultiply the linearised version of (11.100) by Yk T , (k =f. i), to get
(11.102)
To find an expression for 8Xi, we can expand it in terms of the eigenvectors
(assuming that matrix is not defective), to get

8Xi = L Cl:jXj . (11.103)


j#i

Substituting (11.103) in (11.102), we can get the coefficients Cl:i, which yields

~ = ""'
UXi ~
yf 8AxiT Xk· (11.104)
k#-i (Ai - Ak)(Yk xk )
566 Chapter 11. Algebraic Eigenvalue Problem

Thus, once again the perturbation could be large if the quantity in the denom-
inator is small, which can happen if either (Ai - Ad, or y[ Xk is small for some
value of k. For a real symmetric matrix, the scalar product is unity and the
perturbation can be large only if Ai - Ak is small for some k. Hence, for eigen-
values which are close, there could be significant contamination by components
of eigenvectors corresponding to other eigenvalues in the neighbourhood. For
truly multiple eigenvalue, the situation is different, since in that case, any com-
bination of the corresponding eigenvectors is also an exact eigenvector. Thus,
for real symmetric matrices, the eigenvectors corresponding to isolated eigen-
values are well-conditioned, but those corresponding to close eigenvalues are in
general ill-conditioned.
For a general unsymmetric matrix, the ill-conditioning in eigenvectors
could arise from two causes, firstly from close eigenvalues and secondly from
near orthogonality of left and right eigenvectors, corresponding to some eigen-
value. The closeness of eigenvalues only affects the eigenvectors corresponding
to eigenvalues which are close. On the other hand, if the matrix has any eigen-
value for which y[ Xk is small, all eigenvectors will be affected. Thus, for such
matrices it may be difficult to find any eigenvector accurately.
It may be noted that the bound given in (11.99) is valid for the reduction
of matrix using Householder transformation, which are orthogonal and hence
numerically stable. Reduction of a general matrix to Hessenberg form using
Gaussian elimination described in Section 11.6 does not yield an orthogonal
transformation. For this transformation, the strict error bound could be con-
siderably larger, even with partial pivoting. In fact, there is an additional factor
of 2 n in the error bound, which gives unrealistically large bound for large n.
However, as mentioned earlier, in practice, it is rare to come across matrices
for which the error is significantly larger than that in Householder's method.
Here we have only considered errors introduced during similarity reduction
of the matrix to a condensed form. Additional errors will be introduced in the
process of solving the eigenvalue problem for the condensed form. Once again,
using backward error analysis, it is possible to show that the error is equivalent
to some perturbation in the matrix A and we can give limits on the norm of
these perturbations. The basic effect of these perturbations will be the same as
discussed above.

Bibliography
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Anderson, E., Bai, Z., Bischof, C. and Blackford, S. (1987): LAPACK Users' Guide (Software,
Environments and Tools), (3rd ed.) SIAM, Philadelphia.
Dongarra, J. J., Bunch, J. R., Moler, C. B. and Stewart, G. W. (1987): LINPACK User's
Guide, SIAM, Philadelphia.
Faddeev, D. K. and Faddeeva, V. N. (1963): Computational Methods of Linear Algebra,
(trans. R. C. Williams), \V. H. Freeman, San Francisco.
Garbow, B. S. et ai. (1977): Matrix Eigensystem Routines- EISPACK Guide Extension,
Lecture Notes in Computer Science 51, Springer-Verlag, New York.
Exercises 567

Golub, G. H. and Van Loan, C. F. (1996): Matrix Computations, (3rd edition), John Hopkins
University Press.
Gregory, R. T. and Karney, D. L. (1969): A Collection of Matrices for Testing Computational
Algorithms, Wiley-Interscience, New York.
Pissanetzky, S. (1984): Sparse Matrix Technology, Academic Press, London.
Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Quarteroni, A., Sacco, R. and Saleri, F. (2010) Numerical Mathematics, (2nd ed.) Texts in
Applied Mathematics, Springer, Berlin.
Ralston, A. and Rabinowitz, P. (200l): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
Smith, B. T., Boyle, J. M., Garbow, B. S., Ikebe, Y., Klema, V. C. and Moler, C. B. (1976):
Matrix Eigensystem Routines - EISPACK Guide, (2nd ed.), Lecture Notes in Computer
Science 6, Springer-Verlag, New York.
Stoer, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.
Tewarson, R. P. (1973): Sparse Matrices, Academic Press, New York.
Wilkinson, J. H. (1988): The Algebraic Eigenvalue Problem, Clarendon Press, Oxford.
Wilkinson, J. H. (1968): Global Convergence of Tridiagonal QR Algorithm With Origin Shifts,
Lin. Alg. and Appl., 1, 409.
Wilkinson, J. H. and Reinsch, C. (1971): Linear Algebra: Handbook for Automatic Compu-
tation, Vol. 2, Springer-Verlag, Berlin.

Exercises
1. How many distinct (i.e., not mutually similar) 7 x 7 matrices are there, whose character-
istic polynomial is

2. If A has an elementary divisor of order r corresponding to Ai, then show that, there exists
a solution Xr of the system

(A - AiI(x = 0 such that

If we define
(j = 1, ... , r - 1),
then show that
(j=1, ... ,r-1).
Show that the vectors Xj, (j = 1, ... , r) are linearly independent. The vector Xj is called
generalised eigenvector of rank j corresponding to Ai, while Xl is the eigenvector.
3. What happens if the Power method is applied to a matrix with nonlinear divisors? Expa:ld
the arbitrary vector in terms of the generalised eigenvectors of the matrix and show that
if the dominant eigenvalue corresponds to a linear divisor, then the iteration converges
as usual. If the dominant eigenvalue corresponds to a nonlinear divisor, show that the
iteration converges as a/ s, where a is a constant and s is the order of iteration. Verify
these results by applying power method to the following matrices:

-3 4

t)
2 4
-2 3
2 3
568 Chapter 11. Algebraic Eigenvalue Problem

4. Find all eigenvectors corresponding to the dominant eigenvalue of the matrix in Exam-
ple 11.6, using the power method. How will you know that all eigenvectors have been
found?
5. Analyse the behaviour of the power method, when the dominant eigenvalues are of the
form ±>\l. How will you modify the iteration to yield both the eigenvectors simultane-
ollsly? Try this technique on the following matrix:
-1 4 4
-1
A= ( 4
4 1 -1
1 4 4

Apply the power method with initial vectors Uo = (1,0,0, O)T, (0,1,0, I)T, (1, -1,0, O)T
to this matrix and explain the results.
6. Analyse the behaviour of the power method, when the dominant eigenvalue is a complex
conjugate pair. How will you modify the iteration to yield both the eigenvalues and
eigenvectors simultaneously? Try this technique on the following matrix:
o
(~
5

T)
3 3
A = -5 3 3
-3 o -5

7. Apply the Aitken's ,,2


process to improve the convergence of the power method. Consider
the matrix in Example 11.1 and compare the results with those in Table ILL
S. If the dominant eigenvalue and eigenvector (both left and right) have been found using
the power method, then we can perform deflation by defining

where Xl and Yl are the right and left eigenvectors corresponding to >'1. Show that Al
has the same eigenvectors as A and the eigenvalues are also the same, except for A1 which
is zero. Hence, this process due to Hotelling essentially removes the first eigenvalue. Use
this process to find the subdominant eigenvalues of the matrix in Example 11.1.
9. Analyse the behaviour of inverse iteration for a defective matrix. In particular, consider
the case, when the shift p is very close to the eigenvalue with nonlinear divisor. Verify the
results by considering the matrices in {3}. The generalised eigenvectors Xj (j > 1) (see
{2}), can be found by finding eigenvectors of (A - >.1)j and removing the components
from lower order. Try to use this technique to find the generalised eigenvectors of these
matrices.
10. Consider the inverse iteration method with variable shift for a real symmetric matrix.
Assume that the shift is close to the eigenvalue Aj and expand the vector Us in terms of
the normalised eigenvectors Xi in the form

Us = Xj +L EkXk .
k#j

If the next iteration is performed using the shift ps given by Rayleigh quotient, then show
that the next iterate is given by

Hence, show that the iteration converges cubically.


11. For a general unsymmetric matrix, modify the inverse iteration method by considering
two sequences of vectors Us and Ws, by
w;Au s
ps+1 = --T--'
Ws Us
Exercises 569

where ks+1 and k~+l are normalising factors chosen such that the max Us+l =
max W s +l = 1. By considering expansions similar to that in the previous problem, show
that this iteration also converges cubically.
12. Find all eigenvalues and eigenvectors of the following matrices using a combination of
power method and inverse iteration (with acceleration/annihilation techniques if neces-
sary):

Al =6 x 6 Hilbert matrix with


0.81321 -0.00013 0.00014 0.00011
0.00021) 4 4
-0.00013 0.93125 0.23567 0.41235 0.41632
( 6 1
0.00014 0.23567 0.18765 0.50632 0.30697
6
0.00011 0.41235 0.50632 0.27605 0.46322
4 4
0.00021 0.41632 0.30697 0.46322 0.41931
4 3 2
o 4
1 3 -5 o
(1
6
-3 1 4 -3
(1 o
4
3
7
6
5
6
8
7
AF 1
5
6
-2
-3
o
4
5

-2 2 2
( -3
-2
3
o 4
2
A8 = A9 = (2+~ 3
i
2i
2-i
4
5 + 3i
3 - 2i)
5 - 3i
6
-1 o o
Also find the left eigenvectors for A6, A7 and A8.
13. Show that the transformation given by (11.46) actually reduces all required elements
in the rth row and rth column of AT to zero. Estimate the number of floating-point
operations required in this step. Show that the entire process of reduction of a symmetric
matrix to a tridiagonal form requires approximately ~n3 multiplications and a similar
number of additions.
14. Try to modify the Householder's transformation for reducing a symmetric matrix to
tridiagonal form, such that at the rth ~tep all off-diagonal elements in the rth row and
rth column are reduced to zero. Show that such a transformation is not possible for a
general symmetric matrix.
15. Find the Givens' matrices PI, P2, ... , P n - 1 used to affect one step of the QL algorithm
for a symmetric tridiagonal matrix and verify that (11.64) gives the required QL trans-
formation.
16. Find all eigenvalues of the n x n Hilbert matrix for n = 6, 10,20.
17. Find all eigenvalues of the following 7 x 7 symmetric tridiagonal matrices using the
QL algorithm

a11, ... ,a77 = 1,102,104,106,108,101°,1012; a12, ... , a67 = 10,103,105,107,109,1011;


bl l , ... ,b77 = 1012,1010,108,106,104,102,1; b12, ... ,b6 7 = 1011,109,107,105,103,10.

Compare the accuracy in the computed results for the two matrices. Also find the eigen-
vector corresponding to the eigenvalue with the smallest magnitude and compare the
results for the two matrices. (Note that the eigenvalues of A and B are identical, but the
corresponding eigenvectors are reflected.)
18. Repeat the above exercise using the QR algorithm.
19. Find all eigenvalues of the 21 x 21 symmetric tridiagonal matrix with elements

aii = 1110 - lOil, (i = 1, ... ,21), ai,i+1 = ai+1" = 1, (i = 1, ... ,20).

Try to estimate the separation between the close pairs of eigenvalues.


570 Chapter 11. Algebraic Eigenvalue Problem

20. The following generalised eigenvalue problem of the form Vx = >..Tx arises in the study
of free vibration of a linear triatomic molecule:
-k
2k
-k
T= (I I)
M
o
o
Convert this problem to the standard form and find all eigenvalues and eigenvectors. Use
k = 1, m = 1 and M = 2.
21. Find the three eigenvalues with lowest magnitude and the corresponding eigenvectors of
the n x n symmetric tridiagonal matrix, with aii = 2 and ai,i+1 = ai+l,i = -1. This
matrix arises in the solution of an eigenvalue problem for a differential equation. Use
n = 10,20 and 50 and study the change in the eigenvalue and the eigenvector with n.
22. Use Gaussian elimination method to reduce the following matrix to upper Hessenberg
form:
-2
1
3
Write down the corresponding transformation matrix and verify the results by explicit
multiplication. Also reduce this matrix to a tridiagonal form using the Lanczos method
with Xl = Yl = (0,0, 1)T.
23. Suppese an element hi+l,i of an upper Hessenberg matrix is zero. Show that the matrix
can be split into the form

H = (~I ~~)
where HI and H2 are upper Hessenberg matrices of order i and n - i, respectively and Al
is an i x (n - i) matrix. Show that the eigenvalues of H can be calculated by considering
the matrices HI and H2, independently. How will you find the eigenvectors of H from
those of HI and H2?
24. The eigenvalues of an upper Hessenberg matrix H can be found by finding the zeros of
its determinant. To calculate the determinant, write the equations in the form
(h ll - >")XI + h I2 X2 + ... + hInXn = 0,
h2IXI + (h22 - >")X2 + h23X3 + ... + h2nXn = 0,

hn-l,n-2Xn-2 + (hn-I,n-I
+ hn-l,nXn = 0,
- >")Xn-I
+ (hnn - >,,)xn = O.
hn,n-IXn-1

If none of the subdiagonal elements are zero, then show that Xn i= O. Using Xn = 1, solve
these equations (in reverse order) for Xn-I,Xn-2, ... ,XI. Now the first equation will be
satisfied only if>.. is the eigenvalue. Denote the left-hand side of this equation by F(>").
Show that F(>..) is some multiple of the determinant.
25. Use the technique in the previous exercise to find all eigenvalues of the following 12 x 12
matrix:
aij = 13 - j, (if j 2: i); ai+l,i = 12 - i; aij = 0, (if j <i - 1);
and compare the efficiency with that of the QR algorithm.
26. Show that the Hessenberg or tridiagonal form is invariant under the QR transformation,
i.e., if As = QR and A S + I = RQ with orthogonal matrix Q and an upper triangular
matrix R, then show that if As is Hessenberg/tridiagona!, then so is A,+I.
27. Show that the double Q R algorithm with shifts determined by the 2 x 2 matrix in lower
left corner fails to converge for the following matrix
o o o
o o o
1 o o
o 1 o
o o
Exercises 571

Apply essentially arbitrary shifts of order unity at the first iteration and then try the
same algorithm.
28. Find explicit expressions (in terms of the elements of the matrix As) for the Householder
matrices PI, . .. , P n - l for performing one step of the double QR transform on an upper
Hessenberg matrix.
29. The characteristic polynomial associated with a given matrix A, can be obtained by the
following method due to Krylov. If
n-l
Pn(A) = An +L b,Ai,
i=O
is the characteristic polynomial, then the Cayley-Hamilton theorem states that Pn(A) =
O. Thus, for any arbitrary vector y
n-l
Any +L biAiy = 0,
i=O
which gives a system of n equations in the n unknown coefficients bi . Consider the 20 x 20
matrix with elements
aii = i, (i = 1, ... ,20); ai,i+l = 1, (i = 1, ... ,19); a20,1 = 19, a20,2 = -1,
the remaining elements being zero. Obtain the characteristic polynomial for this matrix
and find the eigenvalues by calculating the zeros of this polynomial. Compare the accuracy
with eigenvalues calculated using the QR algorithm.
30. Use Gerschgorin theorem to find approximate location of the eigenvalues of the following
matrix and verify the result by actually calculating the eigenvalues:
0.1 0.2
A = (0~3
0.1
10
1 100
0.3
1 )
10
0.1 2 10 1000

31. Find all eigenvalues and corresponding eigenvectors of the following complex matrices:

A= ( 73
3
7
1 + 2i
1 - 2i -1+21)
-1 - 2i
B=
( 1 + 2i
43 + 44i
3 + 4i
13 + 14i
21
15
+ 22i)
+ 16i
1 - 2i 1 + 2i 7 -3
5 + 6i 7 + 8i 25 + 26i
-1- 2i -1 + 2i -3 7

32. Find all eigenvalues and eigenvectors of the following matrices:

-")
11 6 -9
0
3 9 -3 -8

1) C
( 10 10
A-
- 00 B= ; 6 6 -3 -11
0 10
10- 16 7 5 -3 -11
0 0
17 12 5 -10 -16
Here the matrix A is obtained by perturbing a matrix with nonlinear divisor, while the
matrix B has a pair of quadratic divisors. If a library routine for calculating eigenvalues
and eigenvectors of a real matrix is available, try it on these matrices. Looking at the
eigenvalues and eigenvectors, can you claim that the matrix has nonlinear divisors?
33. Find all eigenvalues and corresponding eigenvectors ofthe generalised eigenvalue problem,
Ax = ABx with
2 3 1 1 -1 2
12 2 14 -1
~1 ~1
1 ) (12
1 11 1 B = 16 -1
2 1 9 -1 -1 12
1 -1 15 1 1 1 -1
572 Chapter 11. Algebraic Eigenvalue Problem

Try to solve this problem by (i) using Cholesky decomposition of B to reduce this problem
to the standard form, (ii) premultiplying with B- 1 to reduce it to the standard form, (iii)
finding the eigenvalues by finding zeros of det(A - )"B) and then using inverse iteration
to calculate the eigenvectors. Compare the efficiency of these techniques.
34. Consider the following generalised eigenvalue problem, which arises in hydrodynamic
stability analysis:

-! - ik<
k; +w 2 12
_w
(1+{3)w+q
) (Xl) (0)
X2 0
-(3w X3 0

where w is the eigenvalue to be determined, while kx, kz, {3 and q are the parameters
in the problem. Use kx = kz = 1, {3 = ±0.1, q = 0.01 and find all the eigenvalues and
eigenvectors. Convert this problem to a more standard form by the technique outlined in
Section 11.1 and then solve it using the standard techniques. Also find the eigenvalues
by directly finding zeros of the corresponding determinant and use inverse iteration to
find the eigenvectors. Compare the efficiency of the two processes.
35. Find all real eigenvalues and the corresponding eigenvectors for the following generalised
eigenvalue problem, which arises in the problem of square well potential in quantum
mechanics:
- sink2 - cos k2
k2 cos k2 -k2 sink2
- sink2 COSk2
k2 cos k2 k2 sin k2

where kl = .;=F" k2 = ylO + E and E is the required eigenvalue.


36. Find all eigenvalues and corresponding eigenvectors of the following matrices:
-1 0 0 0 0 0
-1 0 1 0 0 0 0
-1 0 0 1 0 0 0
A= -1 0 0 0 1 0 0
-1 0 0 0 0 1 0
-1 0 0 0 0 0 1
-1 0 0 0 0 0 0
13 12 11 10 9 8 7 6 5 4 3 2
12 12 11 10 9 8 7 6 5 4 3 2
0 11 11 10 9 8 7 6 5 4 3 2 1
0 0 10 10 9 8 7 6 5 4 3 2 1
0 0 0 9 9 8 7 6 5 4 3 2
0 0 0 0 8 8 7 6 5 4 3 2
B= 0 0 0 0 0 7 7 6 5 4 3 2
0 0 0 0 0 0 6 6 5 4 3 2
0 0 0 0 0 0 0 5 5 4 3 2
0 0 0 0 0 0 0 0 4 4 3 2
0 0 0 0 0 0 0 0 0 3 3 2
0 0 0 0 0 0 0 0 0 0 2 2
0 0 0 0 0 0 0 0 0 0 0
Chapter 12

Ordinary Differential
Equations

Differential equations are the most important mathematical models for describ-
ing various physical phenomena. Motion of solid objects or fluid, deformation
of elastic objects, heat flow, chemical or nuclear reactions are all modelled by
differential equations. If a differential equation has only one independent vari-
able, then it is referred to as an ordinary differential equation. If there are more
than one independent variables, then it is called a partial differential equation.
Most of the differential equations governing physical phenomena are partial
differential equations. However, because of the difficulty in solving them, we
often simplify the problem and reduce it to ordinary differential equations. For
example, assuming spherical symmetry can reduce a problem in three space
variables to that in just one variable. As a result of such simplifications, a large
fraction of the differential equations that we come across in practice are ordi-
nary differential equations. However, because of a significant improvement in
algorithms for solving partial differential equations and in computing power,
the situation is changing.
In this chapter, we shall consider numerical solution of ordinary differential
equations, while partial differential equations will be considered in Chapter 14.
In contrast to most of the problems that we have considered so far in this book,
solution of differential equations cannot be represented by a finite sequence of
numbers, since the solution is a function of a continuous variable. In numeri-
cal computation, we are forced to approximate this continuous function by its
value at a finite sequence of points, or by an approximating function of the
type considered in Chapter 10. This finite representation of solution naturally
introduces truncation error in the solution. Estimating this truncation error is
rather difficult and we only consider some aspects of it in this chapter. Further,
numerical solution can only provide the solution for a given set of boundary
conditions and may not be useful in obtaining a general solution of the differ-
ential equations.
574 Chapter 12. Ordinary Differential Equations

The simplest method for solving differential equations is to replace the


derivatives by the corresponding finite difference approximations considered
in Chapter 5. In order to achieve higher accuracy, we can use a higher order
difference formula. A higher order formula leads to a higher order difference
equation. If the order of the difference equation is higher than that of the
differential equation, then it will have some extraneous solutions which do not
satisfy the differential equation. These unwanted solutions may cause numerical
instabilities under certain circumstances. The stability problem is considered
in Section 12.2.
In this chapter, we first consider the initial value problems. Numerical in-
tegration methods which are the simplest to derive are discussed in Section 12.l.
Predictor-corrector methods which are based on the numerical integration for-
mulae are described in Section 12.3. These methods are generally more efficient
than the popular Runge-Kutta methods, but are not used very often because of
the difficulty in efficient implementation. Runge-Kutta methods which are dif-
ficult to derive, but easy to use are described in Section 12.4. In Section 12.5,
we describe the extrapolation methods which have proved to be fairly effec-
tive in automatic solution of nonstiff differential equations. The choice between
these methods is rather difficult, since their efficiency depends crucially on the
level of sophistication in the computer programs. In Section 12.6, we discuss
the so-called stiff differential equations, for which most methods considered
earlier fail miserably. The methods described in this section are important,
since stiffness is rather common in practical problems involving large number
of differential equations. In Section 12.7- 8, we consider boundary value prob-
lems, while eigenvalue problems are considered in Section 12.9. In Section 12.10
we consider expansion methods for boundary value problems. Finally in Sec-
tion 12.11, we outline some techniques for dealing with singularities and other
complications which can arise in some problems.

12.1 Initial Value Problem


It is well-known from the theory of differential equations that the solution of a
differential equation can be defined uniquely only if additional conditions are
specified on the solution. These conditions are usually referred to as the bound-
ary conditions, because in most practical problems they are specified at one or
both the end points of the interval over which the solution is required. These
boundary conditions are algebraic conditions on the value of the solution or its
derivatives. For the solution to be defined uniquely, the number of independent
boundary conditions should equal the order of the differential equation. The
order of a differential equation is the order of the highest derivative occurring
in it. We will not consider the problem of existence and uniqueness of the so-
lution of differential equations. In most practical problems, if the equation has
been derived consistently, then a solution exists and usually this solution is also
unique. Problems involving solution of differential equations can be classified
according to the type of the boundary conditions. The simplest problems are
12.1 . Initial Value Problem 575

those, where the required number of boundary conditions are specified at one
point. Such problems are referred to as initial value problems. On the other
hand, problems where boundary conditions are specified at both end points are
referred to as two-point boundary value problems, or simply as boundary value
problems. For differential equations with singularity, special conditions may be
required at singular points, in order to ensure continuity of the solution across
the singularity. If the differential equation as well as the boundary conditions
are homogeneous in the unknown functions, then the corresponding problem
can be considered as an eigenvalue problem. In this case, a nontrivial solution
exists only when some parameter in the equation assumes special values, which
can be considered as the eigenvalue.
Problems arising in ordinary differential equations can be usually reduced
to that for a system of first-order differential equations. For example, a differ-
ential equation of order m

(12.1)

can be reduced to a system of m first-order equations by defining new variables


Yl = Y , and Yj+l = 'fi'f for j = 1, ... , m - 1, to get

dYj _ dYm = f (t, Yl , Y2, ... , Y ) .


dt - Yj+l' (j = 1,2, .. . ,m -1), dt m (12.2)

In some problems, it may be useful to include some additional factors in the


definition of new variables, for the purpose of tackling some singular behaviour
or simply to avoid underflow and overflow. This will of course, depend on the
equation and it is not possible to give any general guideline for such choices.
However, if it is found that the variable Yl is of a reasonable size while some of
the extra variables Yj' (j > 1) are much larger or smaller in magnitude, then it
may be better to include some additional factors in the definition to make all
variables of roughly the same order of magnitude. In this chapter, we denote the
independent variable by t, since in many cases, initial value problems involve
time as the independent variable. This is of course, not essential and we can use
any other variable in its place, but for convenience we refer to the independent
variable as time.
Since a general system of differential equations can be reduced to a system
of first-order differential equations, we only consider solution of a system of
first-order differential equations, which can be written as

(j=1,2 , ... ,n). (12.3)

To complete the specification of an initial value problem, we need n boundary


(or in this case initial) conditions of the form

(j = 1,2 , .. . , n) , (12.4)
576 Chapter 12. Ordinary Differential Equations

where to is the point at which initial conditions are defined. The system of
differential equations defined above is considered to be linear, if the functions
fJ(t, Yk) are linear functions of Yk . Many methods for solving an initial value
problem are applicable to nonlinear equations, without any special modifica-
tions.
Some library routines for solution of ordinary differential equations assume
that the functions on the right-hand side do not explicitly depend on t. Such a
system is said to be in the autonomous form. A non-autonomous system can be
easily transformed to the autonomous form, by introducing an extra variable
Yn+l = t, satisfying the equation

dYn+l =1 (12.5)
dt '
A large fraction of equations occurring in physical problems are time indepen-
dent and do not need this transformation.
Most methods for solving initial value problems for a system of differ-
ential equations are simple extension of methods for a single equation in one
variable. Hence, in this and the next few sections, we consider solution of a
single first-order differential equation. The theory of numerical methods is of
course simpler for one differential equation. Methods for solution of initial value
problem generally proceed step by step, in the sense that using the value of the
solution at t ~ tj, we calculate the value at some later time tj+l = tj + hj.
This process is repeated for j = 0,1,2, ... , until the required range of time is
covered. It may be noted that at the first step (j = 0) the solution is not known
at t < to, and we can only use the known initial values at t = to to compute the
solution at h. Thus, the methods which require more than one previous values
to compute the solution, need some other technique to generate the required
number of starting values. Such methods are referred to as multistep methods.
In the simplest case, we use information only at the previous step and the com-
puted value of y(tj+d depends only on y(tj). Such methods are referred to as
single-step method. However, higher order formulae of this type are not easy
to derive.
A straightforward method for numerical solution of differential equations
is the so-called numerical integration method. In this method, the spacing h j
at different steps is assumed to be constant and hence tj = to + jh. Further,
the computed value of solution at the new point (Yj+d depends only on a fixed
number of the previous values. Thus, we can write
m m
Yj+l = 2: akYj_k + h 2: bkyj_k' (12.6)
k=O k=-l

where Yi is the computed solution and y~ = f(ti, Yi) is the computed value of
the derivative at t = k Here a factor of h has been included in the second term
to make the coefficients bk independent of h . This formula uses information at
m + 1 previous points. Here some of the coefficients ak or bk could be zero, but
12.1. Initial Value Problem 577

it is assumed that either am or bm is nonzero. Further, at least one of the bk'S


must be nonzero, since otherwise the differential equation will not be used at
all.
If Ll = 0, then (12.6) expresses Yj+l as a linear combination of the
known past values of Yj-k and their derivatives, which can be easily evaluated.
Such formulae are called explicit or forward integration formulae. An example
of explicit formula is the well-known Euler's method, Yj+1 = Yj + hyj. Explicit
formulae essentially extrapolate the value of the solution at the next point
using the previous points, and are likely to have larger errors. On the other
hand, if Ll -I 0, then Yj+1 also occurs on the right-hand side of (12.6), since
yj+l = f(tj+l, Yj+l)· Hence, this equation can only be solved iteratively, unless
the differential equation is linear. Such formulae are called implicit formulae. In
this case, at each step we have to solve a system of nonlinear equations, which
could require considerable effort. However, this effort is compensated by higher
accuracy or better stability of such formulae. An example of implicit formula
is the trapezoidal rule:

Yj+l = Yj + "2h (Yj+l


, + Yj') . (12.7)

Numerical integration formulae can be obtained using quadrature for-


mulae or numerical differentiation formulae, or by integrating interpolating
polynomials. It is also possible to obtain these formulae using the method of
undetermined coefficients (Section 5.2). In this method, the coefficients are de-
termined by demanding that the formula is exact, when the actual solution is
a polynomial of some specified degree. For example, if we require that (12.6) is
exact for polynomials of degree less than or equal to r, then we can obtain r + 1
equations by letting Yk = t~ for i = 0,1, ... , r in (12.6). We can simplify the
algebra if we assume that tj = 0, since a shift in the origin should not affect
the coefficients. Further, it can be seen that factors of h cancel and hence we
can assume h = 1. With these simplifications, we get the following conditions
m
(i = 0);
k=O
m m
- L kak + L bk = 1, (i = 1); (12.8)
k=O k=-l
m m
(i=2, ... ,r).
k=O k=-l

These are r+ 1 equations for 2m+3 coefficients. However, some of the coefficients
(e.g., L d may be postulated to be zero. If the number of coefficients is r + 1,
then generally (12.8) can be solved for ak 's and bk's. If the number of coefficients
is less than r + 1, then in general, there is no solution. While if this number is
larger than r + 1, then we shall have some free parameters which can be used
to achieve other objectives. For example, we can use these parameters to make
578 Chapter 12. Ordinary Differential Equations

the coefficient of error term small, or to improve stability of the formula. Some
specific examples of numerical integration methods will be considered in the
next two sections.
Since a numerical integration formula approximates a continuous opera-
tion by a discrete sum, there is bound to be some truncation error. If the value
of Yk at the previous points are known exactly, then truncation error is simply
the error in integration over the last step which can be easily estimated. For
example, using Taylor series we know that truncation error in Euler's method
is ~h2y"(~). This truncation error is referred to as the local truncation error. In
general, local truncation error can be expected to be of the form ChT+Iy(T+l)(~),
where r is the order of the formula. It may be possible to estimate the coefficient
c in truncation error using the method of undetermined coefficients. For this
purpose, we can consider the function Y = t T+! and find the difference between
the exact and computed values.
If the first two equations in (12.8) are satisfied, then the corresponding
numerical integration method is said to be consistent. Consistency is equivalent
to ensuring that the formula is exact for linear functions. All useful numerical
integration methods are consistent. It can be shown that if the formula is not
consistent, then truncation error over a given finite interval may not tend to
zero, even in the limit h ----> O. For example, the numerical integration method
Yj+! = Yj + 2hyj is not consistent, even though the local truncation error tends
to zero as h ----> O. But the error over a finite interval does not tend to zero {5}
for any reasonable differential equation, even when h ----> O. Consistency is an
essential property of all useful numerical integration methods.

12.2 Stability of Numerical Integration Methods


As explained in the previous section most numerical methods for solution of
initial value problems proceed step by step, with the value of solution at the
next point depending on that at previous points. Thus, there are two sources
of truncation error in Yj+!. First, there is some truncation error introduced
in the jth step and second, the computed value of Yj itself cannot be exact,
thus introducing additional error while computing Yj+l. The first contribution is
referred to as the local truncation error and is defined as the error introduced in
the jth step, assuming that the previous values of Yj-k are exact. In most cases,
it is comparatively easy to estimate this local truncation error. The second
source of truncation error is essentially propagation of the local truncation error,
since it depends on the truncation error in the previous steps. This contribution
is rather difficult to estimate and leads to the concept of stability of numerical
integration methods. If a small error introduced at some stage increases rapidly
as the solution proceeds, then the corresponding method is said to be unstable,
while if the error remains bounded in some sense, then the method is said
to be stable. In this section, we attempt to study stability of some numerical
integration methods for simple problems.
12.2. Stability of Numerical Integration Methods 579

Truncation error, of course, depends on the step size. Hence, stability also
depends on the step size. At sufficiently large step size, most methods do not
yield any meaningful solution of any nontrivial differential equation. Thus, in
most cases, we may be interested in stability for a reasonably small value of h
only. An exception to this situation occurs for stiff differential equations consid-
ered in Section 12.6. The concept of stability of a numerical integration method
can be defined as follows: If there exists an ho > 0 for each differential equation,
such that a change in starting values by a fixed amount produces a bounded
change in the numerical solution for all h in [0, hoL then the method is said
to be stable. The stability of a numerical integration method only assures us
that the solution is not sensitive to perturbations, it does not imply that the
computed solution is actually close to the true solution of the required differen-
tial equation. Accuracy is related to the concept of convergence. Convergence
implies that any desired degree of accuracy can be achieved for any well posed
differential equation, by choosing a sufficiently small step size h. While defining
convergence, it is assumed that, there is no roundoff error. In practice, round-
off error will increase as h decreases and an arbitrary accuracy is not possible,
unless the precision of arithmetic operations is also increased.
Since it is difficult to analyse stability of numerical integration methods
for an arbitrary nonlinear equation, we consider a simple linear differential
equation of the form
dy
dt = )...y, y(O) = Yo, (12.9)

for which the exact solution is Y = yoe>.t. For this differential equation, the
numerical integration method (12.6) becomes
m

(1 - h)"'LdYj+1 = 2)ak + h)...bk)Yj_k . (12.10)


k=O

This is a linear difference equation with constant coefficients, which has a gen-
eral solution of the form
m

Yj = L Ck0:1, (12.11)
k=O
where Ck'S are constants, which can be determined from the initial conditions,
and O:k are the roots of the characteristic equation
m

(1 - h)...b_1)cr+ 1 = L(ak + h)...bk)o:m-k. (12.12)


k=O

Here it is assumed that the roots of (12.12) are all distinct. If the roots are not
distinct, then there will be terms of the form ckF 0:1 in the solution (where r
is an integer less than the multiplicity of the root). These terms do not affect
our conclusions in any significant respect.
If this solution has to approximate the solution of the differential equation,
then one of the roots say 0:0 ;:::: e Ah . To show this we note that, if h = 0, then
580 Chapter 12. Ordinary Differential Equations

from the first equation in (12.8), Qo = 1 is one of the roots of (12.12). For h « 1
we can expand this root in a Taylor series
00

Qo = 1 + L{3ihi. (12.13)
i=l

Substituting this expansion in (12.12) and using the second equation in (12.8),
it can be shown that (31 A. Thus, if the numerical integration method is
consistent, then
(12.14)
In fact, the O(h2) term is actually O(h r + 1 ), where r is the order of the numerical
integration method. Using this value of Qo, we get

(12.15)

Hence, the solution of difference equation will approximate that of the differen-
tial equation, provided the constant Co :::::: Yo and other terms in the solution are
small. These terms are sometimes referred to as parasitic solutions. They arise
because while the differential equation has only one independent solution, the
difference equation has m + 1 independent solutions. The constants Ci depend
on the initial values of Yo, Yl' ... ' Ym . It may be noted that, since the difference
equation is of higher order than the differential equation, it requires more initial
conditions. These extra initial conditions have to be obtained using some other
technique. In general, there is some error in the starting values Yl' ... , Ym , but
it is reasonable to assume that as h --> 0, the error in these values also tends to
zero. With this assumption, it can be shown {1O} that Co --> Yo as h --> 0, while
other coefficients Cl, ... ,Cm tend to zero. If m = 0, then there is no parasitic
solution and the solution of the difference equation approximates that of the
differential equation.
In general, the above conditions on coefficients Ci are not sufficient to en-
sure that the solution of the difference equation approximates the true solution
of the differential equation. For that we have to ensure that the parasitic so-
lution Ci Q{, (i > 0) always remain small as compared to cOQ6, which is true
if
(i=I, ... ,m), (12.16)
that is, Qo is the root with the largest magnitude. As h --> 0, Qo --> 1. Hence, if
(12.16) is to be satisfied, it is necessary that all other roots of (12.12) lie on or
within the unit circle when h = o. Further, the roots on the unit circle must be
simple, since for multiple roots the additional factor of j in the solution may
result in instability. Only if this condition is satisfied, we can expect (12.16)
to hold for a finite value of h. If some of the roots are on the unit circle when
h = 0, then further analysis may be required to ensure that (12.16) is satisfied
for a finite value of h. If for h = 0 all roots Qj, (j > 0) are inside the unit circle,
then since the roots are continuous function of h, we can expect that (12.16) is
satisfied for some finite range of h including zero.
12.2. Stability of Numerical Integration Methods 581

It can be seen that, it is the combination K = h>", which occurs in the


equations governing stability. Further, the condition (12.16) is a relative con-
dition in the sense that it ensures that the parasitic solutions are always small
as compared to the true solution. This condition does not imply that the par-
asitic solutions decrease in magnitude with time. On the other hand, absolute
stability requires that all solutions of the difference equation must decrease in
magnitude, as integration proceeds. This concept may be meaningless in sit-
uations, where the true solution is increasing with time. However, it is useful
for stiff differential equations discussed in Section 12.6. Thus, a consistent nu-
merical integration method is said to be relatively stable on an interval [0,,8]'
which must include zero if, for all h>" in this interval the condition (12.16) is
satisfied and if, when 1011 = 1001, 01 is a simple root. It is said to be absolutely
stable on an interval b, 6], if for all hA in this interval

(i=O,l , ... ,m). (12.17)

Adams-Moulton methods which are of the form


m

Yj+l = Yj + L bkyj_k , (12.18)


k=-l

have very good stability characteristics, since for h = 0, 0i = 0 (i > 0).


Hence, in this case, we can expect (12.16) to be satisfied for a reasonably large
range of hA. These methods can be derived by integrating Newton's backward
interpolation formula over the last step. These methods will be considered in
the next section.
The above stability analysis is restricted to a simple linear equation. A
general nonlinear equation can be locally approximated by a linear equation
and we can hope that the same condition will apply locally. For the differential
equation y' = f(y , t), we can define

A= of (12.19)
oy'
which is no longer a constant. But we can expect that if (12.16) is satisfied for all
values of A in the required range, then the corresponding numerical integration
method is stable. Alternately, we can vary the step size h to keep hA in the
stability region.
The definition of stability and convergence does not depend on the number
of equations in the system being considered. However, the stability analysis
becomes complicated when a system of differential equations is considered {9}.
For a system of linear first-order equations y' = Ay, a numerical integration
method is stable if the stability condition (12.16) is satisfied for all eigenvalues
Ai of the matrix A. It should be noted that the eigenvalues can be complex,
even if all coefficients are real. Hence, we need to extend the above stability
analysis to complex values of A. For a system of nonlinear first-order equations
the matrix A can be replaced by the Jacobian matrix.
582 Chapter 12. Ordinary Differential Equations

EXAMPLE 12.1: Analyse stability of the following numerical integration formula

(12.20)

which is essentially the Simpson's rule.


Considering the linear equation y' = )..y, we get the difference equation
1 1 4
(1 - -h)..)Yj+l = (1 + -h)..)Yj_l + -h)..Yj . (12.21)
3 3 3

General solution of this difference equation can be written as coab + qa{, where ak are the
roots of the characteristic equation

1 2 4 1
(1 - Sh)..)a - Sh)..a - (1 + Sh)") = 0, (12.22)

which gives the roots

ao = 1- !h)" , ( 12.23)
3

For h = 0 the roots are ±1 and both roots are of the same magnitude. Therefore, stability
depends on how the roots change as h becomes positive. Obviously, the root ao corresponds
to the true solution while al corresponds to a parasitic solution, which arises because the
difference equation is of second order, while the differential equation is first order. For finite
h the ratio
1 + Zh2)..2 - '!h)" /1 + !h 2 )..2
al 9 3 V 3
(12.24)
ao 1 - ~h2)..2
It can be shown that laJ/aol < 1, when h)" > 0; and laJ/aol > 1 when h)" < O. Hence, if
h)" < 0, then this method is unstable, since ultimately the parasitic solution qa{ dominates.
However, for h)" > 0 the method is stable as the parasitic solution always remains smaller than
the required solution. If h > 0, it follows that this method is stable for)" > 0, that is when
the true solution is an increasing function of t. On the other hand, when the true solution is a
decreasing function of t the method is unstable. Considering the absolute stability criterion,
it can be seen that ao > 1 for h)" > 0, while al < -1 for h)" < O. Hence, the method is
absolutely stable only for h)" = O.
It can be seen that for h)" « 1, the root

(12.25)

The first five terms of this expansion agree with thos~ of e h >'. Hence, the local truncation error
is O(h5), which is consistent with the known truncation error for the Simpson's rule given
by - -doh5y(5) (~). The coefficient of error term can be obtained by considering the difference
l~O - 12 = - l~O' which is half of that in the Simpson's rule. This is because Simpson's rule
actually integrates the function over two steps.

EXAMPLE 12.2: Analyse stability of a general three-step, fourth-order numerical integra-


tion formula of the form

(12.26)

Using the method of undetermined coefficients, we can attempt to determine the con-
stants ai and bi . Since the method is required to be fourth order, the formula should be exact
when yet) = 1, t, t 2 , t 3 and t 4 , which provide five equations in the seven unknown coefficients.
Thus, we can find all coefficients in terms of b-l and b2. The value of b-l and b2 can then
12.2. Stability of Numerical Integration Methods 583

c.=o a=-1 0=1


3

0.5 " ,,'=0 2 -

,DN 0 Cl

-0.5 0

-1 -1
0 0.2 0.4 0.6 0.8 -2 -1 0
b_ 1 h>-'

Figure 12.1: The figure on the left shows the stability region of a general three-step fourth-
order numerical integration method in the (b-l , b2) plane. The figure on the right shows the
three roots of the cubic for Hamming's method. The continuous curve shows the root ao.
The dot-dashed portion of the curve shows the magnitude of complex roots in the region,
where the roots are a complex conjugate pair.

be decided by requiring the method to be stable. In principle, we can find a sixth-order for-
mula of this form, but as will be seen later on, this formula is not stable. The method of
undetermined coefficients leads to the following equations
ao + al + a2 = 1,
-al - 2a2 + Ll + bo + b1 + b2 = 1,
al + 4a2 + 2b_l - 2bl - 4b2 = 1, (12.27)
-al - 8a2 + 3Ll + 3bl + 12b2 = 1,
al + 16a2 + 4Ll - 4bl - 32b2 = 1,
which give the coefficients
ao = -9 + 27Ll - 3b2 , al=9-24b-l,
(12.28)
bo = 6 - 14Ll + b2, b1 =6-17L 1 + 4b2.
Stability of this method can be analysed by considering the characteristic equation
(12.29)
It is rather difficult to analyse the conditions under which the stability condition (12.16) will
be satisfied. Since for a useful numerical integration method, we must have stability in the
limit h -+ 0, we can set h = 0 in the above equation, to get the cubic
(12.30)
We can expect the method to be stable if both roots of the quadratic are within the unit
circle. It is convenient to study this problem in the (b-l, b2) plane, where the region in which
the methods are stable is bounded by lines on which la, I = 1. This condition can be satisfied
when ai = ±1 or when the roots are complex conjugate pair with magnitude unity, which
gives
1
ai = 1 => b2 = 5b_ 1 - 2, ai =-1 Ll =:3' (12.31)
If the roots are complex, then the condition can be written as
lail 2 = ala2 = 3b2 - 3b- 1 +1= 1, (12.32)
584 Chapter 12. Ordinary Differential Equations

which gives b2 = b-l. In addition, we must also have a condition to ensure that the roots are
complex. From the quadratic, it can be seen that the roots are complex when the discriminant
is negative. Using the condition b2 = b_ 1 for unit magnitude, the roots are complex when
110 - 24Lli < 2, which gives 1/3 < Ll < 1/2. It can be seen that the end points are
exactly the points of intersection of lines (12.31) with (12.32). These three lines are shown
in Figure 12.1, where the region inside the triangle bounded by continuous lines gives stable
numerical integration methods. For points on the boundary of this region, we get methods
which are weakly stable in the sense that they are stable only for h = 0, while for nonzero h
such methods are generally not stable when hA is negative.
It can be seen that the line b_ 1 = 0 is outside the stability region. Hence, all explicit
fourth-order three-step methods are unstable. If we wish to choose the coefficients b_ 1 and
b2 to get higher order accuracy, then the method of undetermined coefficif'nts requires an
additional condition
( 12.33)
This line is also shown in Figure 12.1 and it can be seen that it just touches the stability
region at b- 1 = 1/3 and b2 = -1/3. Hence, there appears to be only one fifth-order weakly
stable three-step method, given by

1 I 1
Yj+l = Yj + Yj-l - Yj-2 +h ( 3Yj+l + Yj I I
- Yj-l - 3Yj-2
I )
. (12.34)

Following the analysis of last example, it can be shown that this method is relatively stable
only for hA > O. Even for hA = 0 it is not stable, since a = 1 is a double root of the
cubic. This method does not satisfy the requirement of absolute stability for any value of
hA (including zero). Thus, there are no stable fifth-order three-step methods. Hence, using
a three-step method, we can at best achieve fourth order accuracy for a stable numerical
integration method. It can be expected that distance from the line (12.33) should give a good
measure of the error coefficient in the numerical integration methods. In fact, it can be shown
that the coefficient of h 5 y(5) (.;) in the error term is given by
1
C5 = - - ( -12 + 40b- 1 + 4b2). (12.35)
120
Now we shall consider a few of these fourth-order methods. It can be seen that for
b-l = 1/3 and b2 = 0, we get the Simpson's rule considered in the previous example. This
point is at the boundary of the stable region and this method is only weakly stable. For
this method a2 = 0 and it is actually a two-step method. In fact, it is the only two-step
method with fourth order accuracy. For b2 = 0 it can be shown {16} that the best stability
characteristics are achieved when the coefficient al ~ 0 or b-l "'" 3/8, which gives the
Hamming's method

(
Yj+l = 81 ( 9Yj - Yj-2
)
+ 83h I
Yj+l + 2Yj I I
- Yj-l
)
. (12.36)

For this method the characteristic polynomial is

3 ) a 3 - (9
( 1 - -hA - + -hA
3 ) a2 + -hAO'
3 + -1 = O. (12.37)
8 8 4 8 8
The roots of this polynomial are shown in Figure 12.1. It can be shown that, this method is
relatively stable for hA > -0.6946, beyond which the cubic has a pair of complex conjugate
roots (which includes 0'0). In this case, 10'0/(1<11 = 1, even beyond the stability region. Since
the solution should be real, the coefficients co = ct and the two terms in the solution of
difference equation are equal in magnitude. Since the value of 0'0 is complex, the solution of
difference equation is oscillatory in character and cannot provide a reasonable approximation
to the solution of the differential equation. On real axis, this method is absolutely stable for
o ~ hA ~ -8/3.
To get a better stability characteristic, we can consider the case, where both extraneous
roots of the cubic vanish at h = O. It can be seen that if 3b2 = 3b-l - 1, then one of these roots
vanish. This line is also shown in Figure 12.1. Further, if 3b2 = 27b_l - 10, then the other
12.2. Stability of Numerical Integration Methods 585

root also vanishes. It can be seen that for b-l = 3/8 and b2 = 1/24, both these conditions
are satisfied. This choice gives the fourth-order Adams-Moulton method
3 I 19 I 5 I 1 I )
YJ+l = Yj +h ( SYj+l + 24 Yj - 24 Yj-l + 24 Yj-2 ' (12.38)

which is relatively stable for h)" > -0.923 (where 0'0 = -ail and absolutely stable for
o2 h)" 2 -3. Thus. this method has better stability characteristics than other methods
considered above.

Stability of numerical integration method ensures that the parasitic so-


lutions are small as compared to the required solution. It also ensures that
a small perturbation in the initial values does not cause numerical instability
in the computed solution, provided the problem itself is well posed. However,
stability by itself does not ensure that the computed solution converges to the
true solution in the limit h -+ O. It can be easily shown that stability and
consistency are necessary conditions for convergence. In fact, in our discussion
on stability, we had used consistency to show that one of the roots of the char-
acteristic equation corresponds to the actual solution. In general, consistency
and stability together are sufficient to ensure convergence, but the proof of this
theorem is beyond our scope.
Convergence of a numerical integration method only ensures that the com-
puted solution tends to the true solution in the limit h -+ O. In practice, we
are interested in estimating the error for a finite value of h. Since errors are
introduced at each stage, it is essential to consider the propagation of errors.
Let Y j = y( tj) be the true solution of the differential equation; then we can
define the accumulated error Ej = Y j - Yj' where Yj is the computed solution.
In practical computations, there are roundoff errors. Hence, Yj is not exactly
same as that given by the numerical integration formula. To correct for this
error, we modify the numerical integration formula (12.6) to write
m m

Yj+l = L akYj_k + h L bkyj_k + Rj , (12.39)


k=O k=-1

where R j is the roundoff error in this step. Here yj = f( tj, Yj) is the derivative
at the computed point. Similarly, we can define Yj = f(tj, 0) as the derivative
for the true solution. The true solution satisfies the equation
m m

0+1 = Lak0-k +h L bkYj_k + Tj , (12.40)


k=O k=-l

where Tj is the truncation error at the jth step. This is the local truncation
error which can be easily estimated for most formulae. For the Simpson's rule
considered in Example 12.1 it is of the form -glOh5 y (5)(O. Subtracting (12.39)
from (12.40) we get
m m

Ej+l =L akEj-k +h L bkEj_k + Ej , (12.41 )


k=O k=-l
586 Chapter 12. Ordinary Differential Equations

where E j = Tj - Rj is the error introduced at the jth step. This error should
be distinguished from f j, which is the accumulated error. Using the mean value
theorem we get

(12.42)

where Jy denotes the partial derivative of J, and Tlj lies between Yj and lj. If
we are considering the simple linear equation y' = >..y, then Jy(t, y) = >.. and
the difference equation reduces to
m
(1 - h>..b_dfj+l = 2)ak + h>..bk)fj_k + E j . (12.43)
k=O

Apart from addition of the inhomogeneous term, this equation is identical to


that studied earlier. If we make a further simplifying assumption that the er-
ror E j introduced in every step is a constant E, the general solution of this
difference equation can be written as

(12.44)

where we have used first equation in (12.8) to simplify the denominator in the
second term. The coefficients d k can be obtained from the initial conditions,
which is essentially the error in the starting values. We assume that the initial
conditions are such that Idol « leol, since otherwise the computed solution will
be of no use. If the numerical integration method is stable, then the parasitic
solutions should be smaller than the true solution and the term doa~ dominates
over other terms in the summation. Further, this term always remains smaller
than the true solution. If ao > 1, that is the solution is increasing with time,
then the error fj also increases with time. However, at all times the error doa~ is
small as compared to the true solution coab and the relative error is bounded.
Thus, stability of numerical integration method is necessary to ensure that
the parasitic solutions do not overwhelm the required solution and the error
remains bounded. If the true solution is decreasing with time, then the solution
will be meaningful as long as the true solution is larger than the second term
in Eq. (12.44) . The contribution of truncation error to E reduces with h, but
the roundoff error may be independent of h and its contribution to the error fj
increases as h reduces. This will put some limit on how far the solution can be
continued when it is decreasing with time.
If the term E j in (12.43) is not constant, then the inhomogeneous term in
the solution may be complicated. A bound on this term can be obtained if E j
is replaced by its upper bound E over the entire range. In general, estimating
the dccumulated error in a numerical integration method is a difficult problem
and we do not consider it further . In the next few sections, we will consider
some practical methods for controlling the accumulated error.
12.3. Predictor-Corrector Methods 587

12.3 Predictor-Corrector Methods


In this section, we consider the Predictor-Corrector methods, which are based
on numerical integration formulae considered in the preceding sections. As seen
in Section 12.1, numerical integration methods can be explicit or implicit. For
explicit methods the formula directly yields the value of Yj+l' once all the
previous values are known. Such formulae are easy to use, but in general, the
stability characteristics are not very good and as a result implicit formulae may
be preferred. However, for implicit formulae, the value of Yj+l can be calculated
only iteratively. The iteration will converge rapidly, if a good initial guess can
be provided. The basic idea behind the predictor-corrector methods is to use an
explicit formula to predict a good initial guess, which can be used as the starting
value for iteration on an implicit formula. The explicit formula is referred to
as the predictor, while the implicit formula is referred to as the corrector. For
effective use, it turns out that both the predictor and the corrector should have
the same order of accuracy.
For example, consider the following pair of second-order numerical inte-
gration formulae

3
predictor Yj+l
3h ('
= Yj-2 + 2 ' ) 3h '" ( )
Yj + Yj-l + 4 Y 6,
(12.45)
corrector Yj+l =
h ('
Yj + "2 Yj
' ) - h123 Y III ( 6·
+ Yj+l )

The first 'Jf these is an explicit formula obtained from Newton-Cotes open
formula, while the second is an implicit formula based on the trapezoidal rule.
The truncation error in the implicit formula is significantly less than that in the
explicit formula. Further, it can be shown that the trapezoidal rule is stable
over the entire range of h>" values, while the explicit formula is unstable for
h>" < O. Hence, the explicit formula cannot be used by itself to integrate a
differential equation with decreasing solution, but it serves a useful purpose
by predicting a reasonable starting value for the corrector formula, so that the
iteration converges rapidly.
To solve the corrector formula for Yj+l' it is necessary to use an iterative
method. Any method considered in Chapter 7 can be used for this purpose. In
this case, since very good guess is generally available, it is convenient to use
the simple fixed-point iteration defined by

(12.46)

where Y3~1 is the kth approximation to Yj+l. The starting value YJ~l is provided
by the predictor formula and the above formula can be used to obtain successive
approximations with k = 0,1, ... , until a satisfactory accuracy is achieved. As
shown in Section 7.2, this iteration converges only if

(12.47)
588 Chapter 12. Ordinary Differential Equations

where fy A is the partial derivative of f(t, y). Thus, for sufficiently small
h, the iteration will converge. Further, for rapid convergence IhAb_ 1 1 « 1. In
most cases, if (12.47) is violated, then the step size h is too large to ensure any
reasonable accuracy. Hence, accuracy requirement usually forces the step size
to be small enough to ensure convergence of fixed-point iteration. An exception
to this rule occurs for stiff differential equations considered in Section 12.6.
For such equations, it is not possible to use the simple fixed-point iteration on
corrector and a more sophisticated iterative method is required.
It should be noted that while iterating on the corrector, only the last term
in the iteration formula will change with k. In most practical problems, evalu-
ation of the derivative using the given differential equation requires more effort
than the rest of the calculations. Hence, efficiency of the method is essentially
determined by the number of evaluations of the derivative yj+1 = f(tj+1' Yj+l)·
This number can be reduced by providing a good guess for the starting value
and by making hALl small. Thus, reducing the step size may reduce the num-
ber of iterations required. Hence, the total effort involved in integrating an
equation over a given region may not increase as h is decreased, even though
larger number of steps are required. Further, reducing h improves accuracy of
the computed solution. However, reducing h below some optimum value will
result in a significant extra effort. Thus, h should be chosen such that the
corrector formula converges to the required accuracy in one or two iterations.
Apart from the choice of h, efficiency can be improved significantly if a good
starting value can be provided for the iteration. The predicted value of Yj+1
can be improved by estimating the local truncation error. If the predictor and
the corrector are of the same order, then truncation error in both the formulae
involve the same derivative. If we assume that this derivative is slowly varying
with time, then it can be estimated using the difference between the predicted
and the corrected value.
The local truncation error at each step can also be estimated using the
estimated derivative. For example, using the predictor-corrector method defined
by (12.45), we get
5
Y . _ y P ::::: _h 3 y'" (12.48)
J J 6 '
where yf and Yj are the computed values of Yj using the predictor and corrector,
respectively. From this estimate, the local truncation error in the predictor is
::::: 190 (Yj - yf), while that in the corrector is ::::: - 110 (Yj - yf). If we assume that
the derivative is not changing appreciably in one step, then the predicted value
can be improved by adding this truncation error, to get

P 9 ( P)
(0) _
Yj+1 - Yj+l + 10 Yj - Yj . (12.49)

This improved value can be used as a starting value for iteration on the cor-
rector. This formula is sometimes referred to as modifier.
In principle, it is possible to improve accuracy of the corrector by adding
the estimated truncation error, which is equivalent to using a higher order
12.3. Predictor-Corrector Methods 589

method. However, modifying the corrector is likely to affect the stability prop-
erty {19} and hence should be avoided. For Adams method it can be shown
{19} that adding the estimated truncation error does not cause any serious
deterioration in the relative stability of the method. However, the region of ab-
solute stability shrinks significantly. Hence, this technique could be useful for
nonstiff differential equations. If the step size is adjusted to achieve the required
accuracy, then the step size is not affected by this modification, since we can
only estimate the truncation error in the predictor or the corrector, but not in
the modified formula. Hence, we can only hope that the estimated truncation
error in the corrector provides an upper bound on the error in the higher order
formula, obtained by adding the estimated truncation error. This assumption
need not be true, since even if the corrector is stable, the addition of estimated
truncation error could make the modified formula unstable. In such cases, the
error in higher order formula will be larger than that in the corrector. In gen-
eral, it will be safer to use a higher order formula, rather than increasing the
order by adding the estimated error.
If the relevant derivative is varying rapidly or is changing sign in the
required interval, then it is possible that the error estimate is completely off
the mark. However, such problems are unavoidable in any technique for error
estimation based on finite amount of information. It should be noted that if
the predictor has lower order of accuracy as compared to the corrector, then
in most cases, the difference between the predicted and the corrected value
gives an approximation to truncation error in the predictor, rather than the
corrector. In such cases, it is difficult to obtain a reliable estimate of truncation
error in the corrector. Even if we assume that , the difference in predicted and
corrected values provides an upper bound on the error in the corrector, we are
forced to keep the step size small enough to ensure that the estimated error
is within the specified limits. This condition will force a smaller step size than
what is really necessary, thus reducing the efficiency. Hence, for effective error
estimation, it is essential that both predictor and corrector are of the same
order.
If the solution is smooth, then higher order methods are more efficient for
a given accuracy, since a larger step size can be used. On digital computers,
it is common to use fourth-order methods, since methods of this order provide
sufficient accuracy for most problems and are reasonably straightforward to
derive. However, if the solution is smooth and very high accuracy is required,
then even higher order method may be used. Conversely, if the solution shows
some singular behaviour, then it may be more efficient to use a lower order
method.
Adams-Moulton correctors are more popular, because of better stability
properties. As mentioned in the previous section, for these correctors all para-
sitic solutions vanish when h = O. These corrector formulae can be used with
Adams-Bashforth predictors, which can also be derived by integrating the New-
ton's backward interpolation formula (not using the last point) over one step.
The first five Adams-Bashforth formulae with the corresponding error terms
590 Chapter 12. Ordinary Differential Equations

are

Yj+l = Yj + hyj + ~h2ylI(~) (Euler's method),

h (I I) 5 3 III( )
Yj+l = Yj +2 3Yj - Yj-l + 12h Y ~,
_ h (I 1 I) 3 4 (4)( )
Yj+l - Yj + 12 23Yj - 16Yj_l + 5Yj_2 + Sh Y ~,

_
Yj+l - Yj + .!!.- ( 1
24 55Yj
_ 1
59Yj_l + 37Yj_2
1 _
9Yj_3
I) 251 h 5 (5) ( )
+ 720 Y ~,

Yj+l = Yj + 7~0 (1901yj - 2774yj_l + 2616yj_2 - 1274yj_3 + 251yj_4)


+ ~h6y(6)(0.
288
(12.50)
While the corresponding Adams-Moulton corrector formulae are

Yj+l = Yj + hyj+l - ~h2ylI(0 (Backward Euler's method),

Yj+l = Yj + 2 1 + YjI)
h (Yj+l 1 3 Y11/ (0
- 12h (
Trapezoidal rule,
)

_ h (I 1 I) 1 4 (4)( )
Yj+l - Yj + 12 5Yj+l + 8Yj - Yj-l - 24 h Y ~,
(12.51 )
- h (9 1 9 1 5 1 19 h5 (5)(C)
Yj+l - Yj + 24 Yj+l + 1 Yj - Yj-l + Yj-2
I)
- 720 Y <",

Yj+l = Yj + 7~0 (251yj+l + 646yj - 264yj_l + 106yj_2 - 19yj_3)


_ ~h6y(6)(~).
160
These formulae can be used with appropriate modifier in the predictor-corrector
methods. The resulting predictor-corrector method is referred to as Adams-
Bashforth-Moulton method or simply as Adams method. The fourth-order
Adams-Bashforth-Moulton predictor-corrector formula can be used with the
modifier
(0) P 251( P) (12.52)
Yj+l = Yj+l + 270 Yj - Yj
As noted earlier, predictor-corrector methods of order greater than one
require additional initial conditions to start the solution. For example, to use the
fourth-order Adams-Bashforth-Moulton method, we need four previous values.
To start with we know only one value Yo, while the next three values Yl' Y2 and
Y3 need to be generated using some other method. One way of generating these
starting values is to use Taylor series expansion about t = to. If a few derivatives
can be calculated by repeated differentiation of the differential equation, then
we can obtain the starting values with sufficient accuracy. However, in most
practical problems, ;>articularly for a system of differential equations, it is too
cumbersome to calculate the required derivatives. Hence, it is usually difficult to
use this technique for generating starting values. Another alternative is to use a
12.3. Predi ct or- Corrector Methods 591

single-step method with sufficient accuracy such as the Runge-Kutta methods


(Section 12.4) for generating the starting value. In fact, this turns out to be
the most convenient means of generating the required starting values.
It is possible to use an iterative method to generate the starting values.
For example, by integrating the interpolating polynomial using the four points
to, h, t2, t3, we can obtain the following formulae for calculating three values
Yl' Y2 and Y3' as

h (' , , ') 19 5 5( )
Yl = Yo + 24 9yo + 19Y1 - 5Y2 + Y3 - 720 h Y 6 ,

Y2 = Yo + "3h( Yo, + 4Yl, + Y2') ,- 90


1 5 5( )
h Y 6, (12.53)

3h ( , , , ,) 3 5 5( )
Y3 = Yo + 8" Yo + 3Yl + 3Y2 + Y3 - 80 h Y 6·

Here the last two formulae are just the Simpson's 1/3 and 3/8 rule (Section 6.1),
while the first formula is essentially Adams-Moulton formula used in the reverse
direction. These formulae can be used to generate starting values for any four-
step fourth-order method. It can be seen that these equations involve Yl' Y2
and Y3 on the right-hand sides. Hence, they can only be solved iteratively. Once
again, if the step size is sufficiently small, then it can be shown that the simple
fixed-point iteration will converge. Thus, using some initial estimates y;O), we
can use these formulae to compute the next approximation y;l) and the process
is continued until the results have converged to a satisfactory accuracy. Once
the results have converged, we can estimate the truncation error by estimating
the fifth derivative, provided this estimate can be obtained independently. It
can be shown {23} that if the fifth derivative is estimated using the computed
values of Yi and y~, then the estimated value is always close to the roundoff limit.
This is basically because of the fact that the starting values have been generated
using a fourth degree polynomial. Hence the fifth derivative estimated using the
computed values vanishes. Alternately, we can use the estimate of truncation
error in the first step of the predictor-corrector method, as an estimate for error
in the starting values. If the estimated truncation error is within acceptable
bound, then we can accept the results and continue the integration. Otherwise,
the step size should be reduced and second attempt is made to generate the
starting values. Using the fact that, the truncation error is expected to scale as
h 5 , we can choose the new step size to be tried.
One advantage of the predictor-corrector method is that, it gives an es-
timate of the local truncation error without any extra computations. This es-
timate can be used to choose the proper step size. If the estimated error is
much smaller than the required accuracy, then the step size can be increased
to improve efficiency, while if the estimated error is too large, then the step
size can be reduced to achieve the required accuracy. However, the numerical
integration formulae assume a constant step size and as a result it is not easy
to change the step size. It is convenient to increase or decrease the step size by
592 Chapter 12. Ordinary Differential Equations

a factor of two when required. If the step size is doubled, then no extra starting
values may be required, provided sufficient information at the required number
of previous steps is retained. This procedure requires that the step size can be
doubled only after sufficient number of steps have been performed. Hence, if
the initial step size is too small, it will require a large number of steps before an
optimum value of step size is reached. When the step size is to be halved, the
required values can be calculated using Hermite interpolation formula, since
the first derivatives are also known. For example using the solution at t, t - h
and t - 2h, we can use

Y (t - ~) = 1~8 (45Y(t) + 72y(t - h) + lly(t - 2h))

+ 1~8 ( - 9y'(t) + 36y'(t - h) + 3y'(t - 2h)) + O(h 6 ),

y (t - ~h) = 1~8 (lly(t) + 72y(t - h) + 45y(t - 2h))

+ 1~8 ( - 3y'(t) - 36y'(t - h) + 9y'(t - 2h)) + O(h 6 ).


(12.54)
Once these values are calculated, we can use the points t, t - h/2, t - hand
t - 3h/2 for continuing the integration using a four-step method.
Instead of interpolation, we can use special formulae with unequal step
sizes for halving the step size. For example, for a three-step third-order method,
we can use the formulae

y(t +~) = y(t) + 2~ (17y'(t) -7y'(t - h) + 2y'(t - 2h)) + O(h4),

y(t +"2h) = y(t) + 72


h ( 64y'(t) - 33y'(t - "2)
h + 5y'(t - 2)
3h ) + O(h4),
(12.55)
to calculate Yj+l = y(tj + h/2) and Yj+2 = y(tj + h). Once these values are
calculated, integration can be continued with a step size of h/2 using yj' Yj+l
and Yj+2' In practice, we will like to estimate the truncation error in these
values also, which can be achieved either by estimating the required derivative,
or by using a corrector formula of similar type.
Any implementation of numerical integration method which automatically,
adjusts the step size depending on the accuracy requirements is similar in prin-
ciple to the adaptive integration routines discussed in Section 6.7. In principle,
it is possible to use such algorithms to evaluate a definite integral. However, an
adaptive integration routine is in general much more effective for that purpose.
Further, it is rather difficult to handle singularities in a differential equation
and most implementations of numerical integration methods will fail to evaluate
singular integrals with even moderate singularities.
Apart from adjusting the step size, we can also change the order of the
predictor-corrector method to improve efficiency. Such implementations are re-
12.3. Predictor- Corrector Methods 593

ferred to as variable order variable step methods. In these methods, both the
order and step size are adjusted to complete the integration with minimum
effort. Such methods usually employ Adams predictor-corrector formulae, since
they have the best stability characteristics. Not much is known about stability
of such a variable order variable step methods, but empirical evidence suggests
that changing both the order and step size too frequently introduces insta-
bilities. Further, if a method of order r is being used, then the order should
preferably be changed to r ± 1 only. In these methods, if at any step the es-
timated truncation error is larger than the required accuracy, then either the
step size is decreased or the order is changed to achieve the required accuracy.
Which of the two parameters is changed depends on the estimated efficiencies.
Apart from that, in view of the possible instability, the order is changed only
if some minimum number of steps have been performed with the same order.
Even if the estimated error is within the required limits, the order or the step
size can be changed if it is estimated to improve efficiency. The optimum strat-
egy for changing the step size and order is not known and certain heuristics
are used to decide the change. Such variable order methods are self starting,
since we can start with a first-order formula which does not require any ex-
tra starting values. A Fortran implementation of such an algorithm is given
in Gear (1971), while subroutine 11STEP in Appendix B provides a simple
implementation of fourth-order Adams-Bashforth-Moulton predictor-corrector
method with variable step size.
It should be noted that the predictor-corrector method only gives an es-
timate of local truncation error, while in actual practice we are interested in
the accumulated error. It is difficult to give any estimate for the accumulated
error and once again heuristics are used. For example, if we require an accu-
racy of E over the interval [a, b], then we can demand that at each step the local
truncation error is less than fhj(b - a). Here it is effectively assumed that the
accumulated error is the sum of local truncation errors, which is obviously not
the case. Alternately, we can simply ensure that the difference between the pre-
dicted and the corrected values at every step is less than the required accuracy
and hope that it will also control the accumulated error. This technique may
work, if the local truncation error is much less than the required accuracy and
we can expect that the extra factor will take care of accumulation. The only
reasonably reliable method of estimating the accumulated error is to repeat the
integration using two different methods and compare the results. It is possible
to use the same basic method with different step sizes to get two independent
estimate for the solution at the required points. However, if a routine with
adaptive step size is used, then it is difficult to achieve this objective, since
changing the accuracy requirement may not necessarily change the step size
significantly. Hence, with such routines two completely independent methods
will be needed to get independent results.
When a predictor-corrector method with variable step size is used, it is
usually impossible to ensure that the final point of the required interval coin-
cides with one of the points used in the integration formula. Hence, the integra-
594 Chapter 12. Ordinary Differential Equations

Table 12.1: Solving Y' = 3y, y(O) = 1 using predictor-corrector methods

Simpson's rule Adams method


Exact sol. Corrected ReI. error Corrected ReI. error

0.4 3.32012 X 10° 3.32042 x 10° -9.0 X 10- 5 3.32062 x 10° -1.5 X 10- 4
0.5 4.48169 x 10° 4.48222 x 10° -1.2 X 10- 4 4.48259 x 10° -2.0 X 10- 4
1.0 2.00855 X 101 2.00891 X 101 -1.8 X 10- 4 2.00947 X 101 -4.6 X 10- 4
2.0 4.03429 X 10 2 4.03555 X 102 -3.1 X 10- 4 4.03818 X 102 -9.7 X 10- 4
3.0 8.10307 X 103 8.10670 X 103 -4.5 X 10- 4 8.11504 X 103 -1.5 X 10 - 3
4.0 1.62754 X 105 1.62849 X 10 5 -5.9 X 10- 4 1.63078 X 10 5 -2.0 X 10- 3
5.0 3.26899 X 106 3.27135 X 106 -7.2 X 10 - 4 3.27717 X 106 -2.5 X 10- 3
6.0 6.56593 X 10 7 6.57156 X 10 7 -8.6 X 10- 4 6.58573 X 10 7 -3.0 X 10- 3
7.0 1.31880 X 10 9 1.32011 X 109 -9.9 X 10 - 4 1.32345 X 10 9 -3.5 X 10- 3
8.0 2.64887 X lO lD 2.65186 X lO lD -1.1 X 10- 3 2.65958 X lO lD -4.0 X 10- 3
9.0 5.32045 X 1011 5.32712 X lOll -1.3 X 10- 3 5.34463 X lOll -4.5 X 10- 3
10.0 1.06865 X 10 13 1.07012 X 10 13 -1.4 X 10- 3 1.07404 X 10 13 -5.0 X 10- 3

tion can be continucd till the last point is just beyond the required end point.
After that the solution at the required point can be calculated using interpola-
tion. Since the step size is constant, and first derivatives are also available, it
is most convenient to use Hermite interpolation.
EXAMPLE 12.3: Solve the following initial value problem using the Simpson's rule (also
referred to as Milne'» method) and the fourth order Adams predictor corrector formula
dy _ oX
dt - y, y(O) = 1, (oX = ±3). (12.56)

For both these methods, we need four starting values which can be generated using
(12.53). The same starting values can be used with both predictor-corrector methods . Milne's
method uses the predictor

/. _ '
p _.
Yj+l - YJ-3 + 4h
3 (21!J Yj-l + 2YJ'- ) 14 5 (5) )
2 + 45 h Y (~, (12.57)

The exact solution eAt, can be compared with the computed solution. Similarly, the actual
error can be compared with the difference between the predicted and the corrected value. For
simplicity, all results presented here have been obtained with constant step size. Iteration on
the corrector is continued until the relative difference between two successive approximations
is less than 10 - 7 . It can be seen that in most cases, the truncation error is much larger
than 10- 7 and a more modest criterion may be sufficient. However, to ensure that , there
is no effect due to lack of convergence, we have chosen this criterion. All calculations were
performed using a 24-bit arithmetic.
The results obtained using oX = 3 and h = 0.1 are shown in Table 12.1. Since the
solution is increasing rapidly with time, it is only meaningful to consider relative error in the
computed solution, which is also shown in the table. It can be seen that the Simpson's rule
gives more accurate results as can be expected from the fact that the local truncation error
in this method is less than half that in Adams method. The local (relative) truncation error
in Simpson's rule is expected to be ~ 2.7 x 10- 5 . Howfwer, the starting values themselves are
not accurate to that extent and as a result, even at the first step the error is somewhat larger.
Further, the relative error increases almost linearly with t . At each step, it requires four to
five iterations on the corrector to achieve the specified accuracy. Even if the convergence
12.3. Predictor- Corrector Methods 595

Table 12.2: Solving y' = -3y, y(O) = 1 using predictor-corrector methods

Simpson's rule Adams method


Exact sol. Corrected ReI. error Corrected ReI. err.

.4 3.01194 X 10- 1 3.01194 X 10- 1 -1.2 X 10- 6 3.01141 X 10- 1 .0002


.5 2.23130 x 10- 1 2.23090 X 10- 1 1.8 X 10- 4 2.23072 X 10- 1 .0003
.6 1.65299 x 10- 1 1.65308 X 10- 1 -5.7 X 10- 5 1.65242 X 10- 1 .0003
.7 1.22456 x 10- 1 1.22416 X 10- 1 3.3 X 10- 4 1.22405 X 10- 1 .0004
1.0 4.97871 x 10- 2 4.98170 X 10- 2 -6.0 X 10- 4 4.97538 X 10- 2 .0007
1.5 1.11090 x 10- 2 1.10454 X 10- 2 5.7 X 10- 3 1.10970 X 10- 2 .0011
2.0 2.47875 x 10- 3 2.57863 X 10- 3 -4.0 X 10- 2 2.47507 X 10- 3 .0015
2.5 5.53085 x 10- 4 3.87614 X 10- 4 3.0 X 10- 1 5.52037 X 10- 4 .0019
3.0 1.23410 x 10- 4 3.94961 X 10- 4 -2.2 x 100 1.23126 X 10- 4 .0023
4.0 6.14424 x 10- 6 7.39597 X 10- 4 -1.2 X 10 2 6.12505 X 10- 6 .0031
5.0 3.05905 x 10- 7 1.98098 X 10- 3 -6.5 X 10 3 3.04699 X 10- 7 .0039
6.0 1.52301 x 10- 8 5.34878 X 10- 3 -3.5 X 10.5 1.51576 X 10- 8 .0047
7.0 7.58266 x 10- 10 1.44442 X 10- 2 -1.9 X 10 7 7.54039 X 10- 10 .0056
8.0 3.77519 x 10- 11 3.90061 X 10- 2 -1.0 X 109 3.75107 X 10- 11 .0064
9.0 1.87954 x 10- 12 1.05335 X 10- 1 -5.6 X 1010 1.86602 X 10- 12 .0072
10.0 9.35757 x 10- 14 2.84454 X 10- 1 -3.0 X 10 12 9.28277 X 10- 14 .0080

criterion is reduced to 10- 5 , it still requires two to three iterations. Hence obviously, the step
size is a bit too large for this equation. The relative difference between the predicted and the
corrected value is ~ 10- 3 throughout the range. This difference is consistent with the local
truncation error expected in the methods and gives a good estimate of the actual error.
Now we consider the case>. = -3 using h = 0.1. The results are shown in Table 12.2. It
can be seen that the relative error using Adams method is about twice that for >. = 3. On the
other hand, Simpson's rule initially gives more accurate results, but the error is increasing
rapidly and within a few steps it exceeds that in Adams method. As the integration is
continued further, there is a distinct instability and the error keeps increasing exponentially.
In fact, even in the first few steps it can be seen that the error is oscillating in sign, which is
clearly due to the fact that Simpson's rule is unstable for decreasing solutions (Example 12.1).
From the quadratic equation in Example 12.1, it can be seen that for h>' = -0.3 the second
root a1 ~ -1.1. Thus, the parasitic solution grows as a{, while the true solution is decreasing
exponentially. Hence, at each step the relative error will increase by a factor of ia1ie-Ah ~ 1.5.
If the error in the starting values is of the order of 10- 5 , which is a reasonable figure, then
after about 30 steps the relative error should be comparable to one. After 100 steps the
parasitic solution is expected to be £oai oo ~ 104 £0, where £0 is the coefficient of the parasitic
solution as determined from the initial conditions. Since the value at the end of 100 steps is
~ 0.3, it follows that £0 ~ 3 X 10- 5 , which is consistent with errors in Adams method, as
well as the errors for >. = 3. Since the value of a1 is negative, the error is alternating in sign.
Even if the starting values are exact, the error will be of the same order. If the cal-
culations are repeated for h = 0.01, then only one iteration on the corrector is required to
achieve the specified accuracy. In fact, the local truncation error is smaller than the roundoff
error. However, towards the end of integration, Simpson's rule requires four to five iterations,
as the instability develops. Adams method gives much more accurate results and in fact, the
error reduces by a factor of ~ 104 , which is expected from the fact that it is a fourth-order
method. Note that although the local truncation error is O(hS), accumulation over l/h steps
makes the accumulated error O(h4). For Simpson's rule, the instability still persists, but the
relative error is reduced by a factor of 104 . As a result, the relative error exceeds unity only
596 Chapter 12. Ordinary Differential Equations

around t = 6. This result can be understood from the fact that for h>' = -0.03, the root of
the quadratic (1<1 ~ -1.01, and QtO is approximately equal to the value of Q1 for the previous
case. Hence, even though this factor is small, the growth of error takes place at the same rate,
since the number of steps required to reach a given value of t increases. However, because of
smaller step size the initial values are more accurate by a factor of 104 and as a result the
instability is somewhat delayed.

From this example it is clear that for A < 0, Adams method which is stable
for this value of hA, gives much better results as compared to Simpson's rule.
In general, since we cannot know in advance whether the solution is decreasing
or increasing with time, it is risky to use Simpson's rule. Further, for a system
of differential equations, there are several independent solutions and some of
them may be decreasing, leading to numerical instability in Simpson's rule. It is
interesting to note that for Simpson's rule the difference between the predicted
and the corrected value gives a clear indication of the problem. Hence, if a
routine with adaptive step size control is used, then it will keep reducing the
step size and if proper checks are not provided, it may get into a nonterminating
loop. The reduction of step size may not make any difference as far as the
solution is concerned, but the computer time required will increase drastically.
EXAMPLE 12.4: Solve the following initial value problem using Simpson's rule and Adams
predictor-corrector methods

y(O) = 0, (>. = ±3). (12.58)

The exact solution of this differential equation is y = t 3 for all values of >.. Since both
methods under consideration have fourth order accuracy, the computed solution should be
exact except for roundoff error. Apart from the inhomogeneous term, the equation is same as
that considered in the previous example. Following the previous example, we first consider
the case>. = 3, for which both methods are expected to be stable. The results are shown in
Table 12.3. It is clear that both methods develop some instability and the results are totally
useless, except at small t. Initially the solution is almost exact, but as integration proceeds
the error builds up exponentially.

Table 12.3: Solving y' = 3(y - t 3 ) + 3t 2 , y(O) = 0 using predictor-corrector methods

Simpson's rule Adams method


Exact sol. Predicted Corrected ReI. error Predicted Corrected ReI. error

1.0 1.0000 1.00000 1.00000 0.00 1.00000 1.00000 0.00


2.0 8.0000 8.00000 8.00000 0.00 8.00000 8.00000 0.00
3.0 27.0000 26.99998 26.99998 6.4 x 10- 7 26.99998 26.99998 7.0 x 10- 7
4.0 64.0000 63.99966 63.99966 5.3 x 10- 6 63.99960 63.99960 6.2 x 10- 6
5.0 125.0000 124.99316 124.99316 5.5 x 10- 5 124.99187 124.99187 6.5 x 10- 5
6.0 216.0000 215.86237 215.86230 6.4 x 10- 4 215.83678 215.83670 7.5 x 10- 4
7.0 343.0000 340.23511 340.23386 8.1 x 10- 3 339.72000 339.71835 9.6 x 10- 3
8.0 512.0000 456.45786 456.43277 1.1 x 10- 1 446.08636 446.05286 1.3 x 10- 1
9.0 729.0000 -386.74384 -387.24747 1.5 x 10° -595.58606 -596.25916 1.8 x 10°
12.3. Predictor-Corrector Methods 597

1000

800

600

400

200

o
o 2 4 6 8 10
t

Figure 12.2: Exact solution of y' = 3(y - t 3 ) + 3t 2 , y(O) = E for E = 0, ±1O- 4 , ±10- 6 ,
±1O- 8 , ±1O- 10 . The curves for negative E are below that for E = 0 while those for positive E
are above. The curves diverging first are for E = ±10- 4 .

The problem here is not due to the numerical method and similar results will be
obtained using any numerical technique. The general solution of the differential equation can
be written as Ae At + t 3 , where A is an arbitrary constant, which is determined by the initial
condition. The initial condition of the problem gives A = 0 and the exponential term is
suppressed. For positive >., the exponential term ultimately dominates the solution, even if A
is small but nonzero. To illustrate the instability, Figure 12.2 shows the exact solution of the
problem with slightly perturbed initial conditions. Hence, if due to some error in specifying
the boundary conditions, the coefficient A is not exactly zero, this solution will ultimately
dominate, giving rise to an instability. It should be recognised that the trouble here is due
to the fact that the problem itself is ill-conditioned, in the sense that a small perturbation in
initial conditions can give a solution which is completely different. This problem can only be
overcome by increasing the accuracy of numerical computations. In this case, because of the
special nature of solution, there is no truncation error and the only error is due to roundoff.
Thus, we can expect the coefficient A to be of the order of n. Hence, the exponential term
is expected to be comparable to the true solution when t ~ 8, which is consistent with the
results in Table 12.3. It should be noted that in this case, the predicted and the corrected
values agree reasonably well for all values of t and it is impossible to detect the problem,
unless solutions using two independent methods are compared or a solution using higher
accuracy (i.e., smaller n) is computed. Even that may not always give reliable results as the
instability may be similar in both cases.
Even if the calculations are repeated using smaller step size, essentially similar results
will be obtained, since there is no truncation error in this problem. In fact, reducing the step
size increases the error, since roundoff error increases as h decreases. By increasing the step
size it may be possible to improve the results to some extent. But it is purely because of
the fact that, there is no truncation error in this problem. In most practical problems the
truncation error is dominating. Hence, reducing the step size will reduce the errors and it may
be possible to improve the results using a smaller step size {25}. In this case, to get better
results we need to reduce the coefficient A, which can be achieved by reducing n. Thus, using
a 53-bit arithmetic, it is possible to get a relative accuracy of 10- 7 at t = 10. However, the
errors are increasing exponentially and if integration is continued further, the results will be
no different from those in Table 12.3.
598 Chapter 12. Ordinary Differential Equations

For)" = -3 the exponential term in the general solution is rapidly decreasing and as a
result the problem is well-conditioned. In this case the Adams method gives essentially exact
results apart from roundoff errors. However, Simpson's rule shows some instability, provided
the integration is continued far enough, which is clearly due to the known instability of
Simpson's rule for negative values of )... Even though the true solution of the differential
equation is increasing with time, the parasitic solution introduced due to h)" = -0.3 is
growing exponentially. As shown in the previous example, this solution is of the form EO<{,
where 0<1 ~ 1.1. Since in this case, there is no truncation error, E ~ fi and after 100 steps
at t = 10 this solution is ~ 10- 3 , and we expect a relative error of ~ 10- 6 in the true
solution. The parasitic solution becomes comparable to the true solution at a point, where
fia{ ~ (jh)3, which gives jh ~ 25.
If we consider a large negative value of ).. (e.g., -10), then with this step size the
iteration on corrector may not converge and it is not possible to use the predictor-corrector
method in the normal form. This case will be considered in Example 12.8.

From these examples it can be seen that, even solving one differential
equation can be tricky. In practice, we rarely come across a single differential
equation, as most problems involve solution of a system of first-order differential
equations. For such systems, there may be several independent solutions of
widely differing behaviour and it may be difficult to approximate all of these
to sufficient accuracy using the methods described here. We will return to this
problem in Section 12.6.
In most cases, if the step size is reasonably small, then the iteration on cor-
rector converges in one iteration, which requires two evaluations of the function
f(t, y) at every step. On the other hand, a fourth-order Runge-Kutta method
requires four function evaluations per step. Hence, the predictor-corrector meth-
ods are more efficient. There is another way of using the predictor-corrector
method in which instead of checking for convergence only a fixed number (usu-
ally one) of iterations are performed on the corrector. Properties of this method
are different from that method , in which the corrector is iterated to convergence.
In the latter case, the choice of predictor does not affect the final result, while
in the former the choice of predictor will have significant influence on the prop-
erties of the method. Stability properties of the two methods are also different
{13}.
In practical implementation uf predictor-corrector methods, it is rather
difficult to devise a general convergence test to check for convergence of the
corrector or to check if the truncation error is within acceptable limits. Simple
checks based on absolute or relative criterion or some simple combination of
the type considered in Section 6.7 are not sufficient for differential equations.
The solution of differential equations can exhibit a wide range of behaviour
from oscillatory to steeply increasing or decreasing functions. Further, for a
system of differential equations, the situation is even more complicated, since
each component may behave differently. Apart from this, the behaviour could
be different in different intervals of time. A simple relative criterion is impos-
sible to be satisfied near a zero of oscillatory function. If the amplitude of
oscillatory solution is constant, and all components are of the same order of
magnitude, then it may be more meaningful to use an absolute criterion. How-
ever, behaviour of the solution is generally not known in advance. One artifact
which is used to overcome this problem is to keep an array yf, which contains
12.3. Predictor-Corrector Methods 599

the scale of each component. For example, we can take Yi to be the maximum
magnitude of the ith component so far encountered. The convergence criterion
is then assumed to be satisfied, if the change is less than Eyi in magnitude,
where E is the requested accuracy. This criterion can take care of the situation,
where different components of solution vary over several orders of magnitudes.
This criterion can also take care of oscillatory solution with increasing ampli-
tude. But if the solution is decreasing in magnitude, it will not be effective at
large times, since any reasonable (or even an unreasonably large) change in the
solution can pass the convergence test. For example, with Y = e- 5t cos t over
[0,5] the maximum value is 1, but y(4) ~ 10- 9 and if E = 10- 7 , a change which
is 100 times the actual value will pass the convergence test. In fact, given any
convergence test, we can find a function or a set of functions for which the test
is either too stringent or ineffective.
Hence, designing a general purpose convergence test is rather difficult and
all routines for automatic integration use some assumptions about the solution.
In all subroutines given in Appendix B, we have used a simple relative criterion
Ibyl < E(lYj I + IYj+l - Yj I + TJ)· Here by is the change in Y during the last
iteration, Yj and Yj+l are the computed values of Y at the two most recent
points and TJ (> 0) is some suitably chosen small number, which is introduced
to take care of a situation where Yj = Yj+l = O. This criterion is somewhat
stringent near the zeros of oscillatory function. If the solution has a multiple
zero, the criterion is even more stringent, since the change Yj+1 - Yj is also
very small. Further, for a system of equations if some component is very small
(and its exact value is not important for other components in the solution) in
certain interval of t, then it may be impossible to satisfy such a criterion.
If the required accuracy is very high, it may not be possible to achieve
that because of roundoff error. In that case, the step size will keep reducing
until either the lower limit (if any) on step size is violated, or until it becomes
so small that computationally Yj+1 = Yj. The latter can happen if the term
to be added to Yj at the required step is smaller than nlYj I. In this case,
the accuracy requirement is apparently satisfied and the integration proceeds
further with a very small step size and the value of Y will remain constant. For
a multistep method (not of Adams type), exact equality of Yj and Yj+l may
not be achieved, even for arbitrarily small values of h. However, for Adams
type methods which are commonly used, if the term coming from derivatives
is very small, it will ultimately lead to Yj+l = Yj' as the step length h becomes
sufficiently small. Similarly, if the step size h is allowed to become smaller than
nltj I, then the computed value of t remains constant and the process may get
into a nonterminating loop, since the end point of integration will never be
reached. A nonterminating loop can also arise if the sign of h is not correct to
start with, since in that case, the integration proceeds in the wrong direction.
Near a singularity in the differential equation it may not be possible to achieve
any reasonable accuracy and solution may get aborted. It is best to start the
solution a little away from the singularity and determine the solution near the
singularity using some other technique, like Taylor series expansion.
600 Chapter 12. Ordinary Differential Eq1wtions

12.4 Runge-Kutta Methods


Runge-Kutta methods can be used to generate not only the starting values, but
the entire solution. These are single-step methods in the sense that informa-
tion at only one previous step is required. In this case, higher order accuracy
is achieved by considering intermediate points, which may not necessarily be
on the path defining the solution of the differential equation, i.e., the function
f(t, y) is evaluated at points such that y(t) is not the computed solution of
the differential equation. In this case, since the information at previous steps
is not used, we cannot expect these methods to be as efficient as the predictor-
corrector methods considered in the previous section. However, if the deriva-
tives f(t, y) are simple functions, that can be evaluated without much effort,
then these methods may be comparable in efficiency to the predictor-corrector
methods, which require considerable amount of bookkeeping to achieve opti-
mum efficiency. The most important reason for their popularity is the ease with
which they can be implemented. These methods are explicit and do not require
any iteration and further they do not require any special starting values. Hence,
the step size can be easily adjusted at every step. However, the local truncation
error can only be estimated by performing some extra computation.
To illustrate the basic principles of Runge-Kutta methods, we consider
the following second-order method:

where h n = tn+l - tn is the step size for nth step. This formula is similar to
the trapezoidal rule, but it may be noted that the second point at which the
function f is evaluated is an approximation to y(tn+d using Euler's method.
This intermediate point has been selected only to achieve higher order accuracy
and may have no other relation with the calculated solution. In particular,
Yn + hnfn is not the computed value of Yn+l. Consequently, this formula is not
identical to the numerical integration formula based on trapezoidal rule. Unlike
the numerical integration formula, here no iteration is required to determine
Yn+l" The stability properties of the two methods are also different {38}. In
fact, this formula can be regarded as a predictor-corrector formula using Euler's
method as the predictor and the trapezoidal rule as the corrector, where only
one iteration is performed on the corrector.
In Runge-Kutta methods, the value of Y at the next step is computed
using a formula of the form
m

Yn +1 = Yn + 2..= Wiki , (12.60)


i=1

where Wi'S are constants and


i-1

k i = hnf(tn + (J;ihn, Yn + 2..= f3ij k j ). (12.61)


j=1
12.4. Runge-Kutta Methods 601

Here OCi and (3ij are constants with OCI = O. These constants are selected to
achieve the required degree of accuracy. For this purpose, we expand both sides
using Taylor series about (tn, Yn) and match terms up to the required order.
Derivation of these formulae is fairly involved and here we only illustrate the
procedure by deriving second-order formulae.
We can write

(12.62)

where f = f(tn, Yn) and It and fy are the partial derivatives evaluated at the
same point. For a second-order formula, we need to match only these three
terms and hence m = 2 in (12.60). Thus, this formula can be written as

Yn +l - Yn = h n ( wd(tn, Yn) + w2f(tn + oc2hn, Yn + (321 h nf))


(12.63)
= hn(wd + w2!) + h;'W2( OC 2ft + (32d fy) + ...
Second-order accuracy can be achieved if these two expansions agree up to
terms involving h;. The required equations to determine the coefficients WI,
W2, a2 and (321 can be obtained by comparing the coefficients of f, ft and f fy
in (12.62) and (12.63), which yields

1
WI + W2 = 1, W2 OC 2 = 2' (12.64)

These equations provide three nonlinear equations in four unknowns, which can
be solved for WI, W2 and (321 in terms of OC2·
Three second-order methods of interest correspond to OC2 = ~,~, 1, which
give the formulae

Yn+1 = Yn + hnf (tn + ~hn' Yn + ~hnfn) ,

Yn+l = Yn + ~hn (f(tn, Yn) + 3f(tn + ~hn' Yn + ~hnfn) ) , (12.65)

Yn+l = Yn + ~hn (f(t n , Yn) + f(tn + hn' Yn + hnfn)).


The first of these formulae is similar to the midpoint rule, while the last one
is similar to the trapezoidal rule. The second formula is obtained by a choice
of OC2, for which the truncation error is minimised in some sense as defined in
Ralston and Rabinowitz (2001).
The local truncation error can be estimated by considering the terms of
O(h3) in (12.62) and (12.63). However, it is obvious that this involves several
terms which cannot be expressed in a simple form (e.g., ch r +1 y (r+l) (0). As
a result, we will not try to find an expression for the truncation error. The
602 Chapter 12. Ordinary Differential Equations

interested readers can refer to Ralston and Rabinowitz (2001). Nevertheless, it


is clear that for a Runge-Kutta formula of order r, the leading term in the error
expansion will be O(hr+1).
By including higher order terms in the expansion, it is possible to obtain
higher order formulae. However, it. is quite obvious that the algebra gets messy.
One of the third-order formula is
1
Yn+l - Yn = 6(k 1 + 4k2 + k3),
kl = hnf(tn , yn ),
(12.66)
k2 = hnf(tn + ~hn'Yn + ~kl)'
k3 = hnf(tn + hn, Yn - kl + 2k2),
which is similar to the Simpson's rule. However, this method has third order
accuracy and is (relatively) stable. In fact, all single-step methods are relatively
stable, since there are no parasitic solutions as the corresponding difference
equation is also of first order. Similarly, a fourth-order formula is given by
1
Yn+1 - Yn = 6(k 1 + 2k2 + 2k3 + k4),
kl = hnf(tn , Yn) '
k2 = hnf(tn + ~hn'Yn + ~kl)' (12.67)

k3 = hnf (tn + ~ hn, Yn + ~ k2 ) ,


k4 = hnf(tn + hn' Yn + k3).
This is the classical Runge-Kutta formula, which is very widely used. This
method can also be considered as an extension of Simpson's rule to the case,
where the derivative is a function of both t and y. For most practical problems a
fourth-order method is fairly effective. Apart from this, there are several other
fourth-order formulae like

(12.68)

which is known as the Runge-Kutta-Gill formula.


12.4. Runge-Kutta Methods 603

It can be seen that for r ::; 4, a Runge-Kutta method of order r, requires


r function evaluations per step. It turns out that for r > 4 the number of
function evaluations required is larger than r. This number should be com-
pared with (generally) two function evaluations for predictor-corrector method
of any order. Hence, predictor-corrector methods are generally more efficient
than Runge-Kutta methods. However, Runge-Kutta methods are easier to use,
since they use the value of the solution at only one previous point. Conse-
quently, no special technique is required to start the integration and the same
formula is applicable at all points. In fact, Runge-Kutta methods are often used
to start the integration for a predictor-corrector method. Further, depending
on the accuracy requirements, the step size can also be freely changed without
any special effort. However, unfortunately, there is no simple technique to esti-
mate the local truncation error and unless extra computations are performed,
it is not possible to decide what step size is optimal for the required accuracy.
Thus, in order to use the facility of changing the step size freely, we have to
put in extra effort in computing the solution. On the other hand, in predictor-
corrector methods a reliable estimate of truncation error is available without
any extra effort, but it is cumbersome to change the step size freely.
In practical implementations, simplest technique for estimating the local
truncation error is to repeat the calculations using a different step size. This
technique can be applied at every step, if the calculations are repeated with
two steps of half size. Let Y~~l and Y~~l be the calculated values of Yn+1 using
a step size of h n and h n /2, respectively. Then the local truncation errors can be
expected to be Kh~+l and Kh~+l /2 T , respectively. Note that, since two steps
are performed using h n /2, the per step error should be multiplied by two. Here
the constant K can be expected to be roughly independent of h, at least for
small enough values of h. Using this assumption we see that

(2) (1) ~
Yn+1 - Yn+1 ~ K n
hT+l ( 1)
1 - 2T ' (12.69)

which gives an estimate for the constant K. Using this value of K, the estimated
truncation error in y~211 is expected to be En ~ (Y~~l -y;;11)/(2 T -1). Further,
we expect the error to scale as h T +1 . Hence, using this estimate and the required
accuracy, it is possible to estimate the optimal step size. For example, if the
requested accuracy is E, then the optimal step size is

(12.70)

Usually in order to control the accumulated truncation error, the required ac-
curacy at each step is assumed to be proportional to h n , in which case, the
exponent in the above expression should be 1/r instead of l/(r + 1). In prac-
tice, to allow for some uncertainties in the estimate of errors and some variation
from step to step, h should be chosen to be slightly smaller than the optimal
value given by (12.70). Hence, it is preferable to multiply the above value by a
safety factor 'Y < 1.
604 Chapter 12. Ordinary Differential Equations

This technique is implemented in subroutine RKM in Appendix B, which


can be used with a second or fourth-order Runge-Kutta method. Since the
point (tn, Yn) is common to both the step sizes h n and hn/2, we can save one
function evaluation, if this value is preserved. Thus, for a fourth-order method,
this technique requires 11 function evaluations at every step. However, since
it achieves the accuracy corresponding to a step size hn/2, only three extra
function evaluations are required to estimate the accuracy, which gives a relative
overhead of 3/8 for estimating the truncation error.
It may be noted that it is possible to add the estimated truncation error
to the calculated value to get a fifth order accuracy. However, in that case, we
have no means of estimating the error in the corrected value. In most problems,
we can expect that the error in the corrected value is smaller than that in
the uncorrected value. However, it is not necessarily true in all cases, since
higher order method does not always give lower truncation error at finite step
size. In most practical situations, since the step size is chosen to be small
enough to achieve some reasonable accuracy, the higher order method should
give higher accuracy. Adding the correction also changes the stability properties
of the method. However, in this case, it turns out that the stability is improved
slightly {29} and in fact, the modified fourth-order method is absolutely stable
for -3.22 < h)'" ::; 0, where h is the step size of the smaller step.
Another technique for estimating the truncation error in Runge-Kutta
methods is to choose the constants Wi, (li and (Jij in such a manner that two
different estimates are available. This technique usually involves one extra func-
tion evaluation. For example, consider the formula

7)0 = Yn , ko = hnf(tn' y n ),
1 1
7)1 = 7)0 + 3ko , k1 = hnf(tn + 3hn' 1]1),
1 1
7)2 = 7)0 + "6(ko + kd, k2 = hnf(tn + 3hn' 1]2),
1 1 (12.71)
1]3 = 1]0 + 8(ko + 3k2), k3 = hnf(tn + "2hn' 1]3),
1
1]4 = 1]0 + "2 (ko - 3k2 + 4k3 ), k4 = hnf(tn + hn' 1]4),
1
Yn+1 = 1]5 = 1]0 + "6(ko + 4k3 + k4).

This formula due to Merson is referred to as Runge-Kutta-Merson method.


It can be verified that both 1]5 and 1]4 give an estimate for y(tn+d and the
truncation errors are O(h5) and O(h4), respectively. Further, it can be shown
that if the function f(y, t) is linear in both Y and t, then the truncation error
in 1]4 is also O(h 5 ). In this case, Merson has shown that the truncation error
in 1]4 and 1]5 are respectively, 1~oh5y(5)(~) and 7~oh5y(5)(~). Hence, the local
truncation error at any step can be estimated by considering the difference
1]4 - 1]5· However, in general, the coefficients of h 5 or h4 in the error term are
uncertain and cannot be estimated by just considering the difference 1]4 - 1]5.
12.4. Runge-Kutta Methods 605

Thus, it is not possible to get an accurate estimate of the truncation error


in 7]5. It can be seen that, this process requires five function evaluations per
step. Thus, two steps will require 10 function evaluations against 11 for the
step doubling process considered earlier. For a general differential equation,
the step size is determined by the error in the third-order method (7]4). As a
result, a subroutine with adaptive control of step size will choose a smaller step
size as compared to what is chosen by the simple fourth-order Runge-Kutta
method discussed earlier. Thus, Runge-Kutta method with step doubling is
more efficient. It is possible to obtain a Runge-Kutta-Fehlberg formula using
six function evaluations, which provides estimates with fourth and fifth order
accuracy (Rice, 1992). For this formula the step size can be adjusted according
to the truncation error in a fourth-order formula, but it requires 12 function
evaluations for a double-step. Hence, it appears that these formulae do not have
any obvious advantage over the simple technique for error estimation outlined
earlier.
EXAMPLE 12.5: Analyse stability of the fourth-order Runge-Kutta method.
Since Runge-Kutta methods are single-step methods, they are always stable in the
relative sense, as the corresponding difference equation is also first order and has only one
independent solution. Hence, we should only worry about the absolute stability, which requires
that the computed solution of the differential equation tends to zero as t ---> 00. Once again
considering the linear equation Y' = )..y the fourth-order Runge-Kutta method gives

kl = h)..Yn, k2 = h)..Y n (1 + ~h)") , k3 = h)..Y n (1 + ~h)" + ~h2)..2) ,

k4 = h)..Y n (1 + h)"+ ~h2)..2 + ~h3)..3),


Y n +l = (1 + h)" + ~h2)..2 + ~h3)..3 + ~h4 )..4) Yn
2 6 24
= O:Yn .

(12.72)
In fact, it can be shown that all fourth-order Runge-Kutta methods using four function
evaluations give the same difference equation. This result follows from the fact that the terms
up to h4 have to be the same to en&ure fourth order accuracy, while higher terms will vanish
if the method uses only four function evaluations per step. From the difference equation, it is
clear that the local truncation error is O(h 5 ). For h)" > 0 the true solution of the differential
equation is increasing with time and we cannot expect the method to be absolutely stable.
Thus, we need to consider only negative values of h)". The method is absolutely stable when
10:1 :s: 1, which gives the stability region (-2.78,0). Similarly, it can be shown that third-order
methods using three function evaluations per step have a stability region (-2.51,0)' while
for the second-order methods it is (-2,0).

EXAMPLE 12.6: Solve the initial value problem in Example 12.4 using the fourth-order
Runge-Kutta method.
We can use subroutine RKM in Appendix B to perform the integration. This subroutine
adjusts the step size to achieve the required accuracy, which was specified to be 10- 6 . It is
found that even though the fourth-order method should be exact (irrespective of the step
size), for this problem the step size does not increase arbitrarily, as the integration proceeds
and in fact, it is more or less constant (except close to t = 0). It is found that initially the step
size is very small and it increases gradually as the integration proceeds. This is due to the fact
that the starting point is a triple zero of the exact solution and as a result the convergence
criterion is very stringent close to t = O. Consequently, the subroutine indicates failure when
1)..1 > 20 as the required step size goes beyond the permissible limit. The difficulty will be
even more serious if the solution has a zero of higher multiplicity. This problem can of course
606 Chapter 12. Ordinary Differential Equations

Table 12.4: Solving y' = A(y - t 3 ) + 3t2 , y(O) = 0 using Runge-Kutta method

A = -25 A = -27 A = -28 A = -29


y ReI. error y ReI. error y ReI. error y ReI. error

1 1.021 2.1 x 10- 2 1.04 4.3 x 10- 2 1.07 0.07 1.1 1.3 x 10- 1
2 8.048 6.1 x 10- 3 8.13 1.6 x 10- 2 8.31 0.04 9.2 1.5 x 10- 1
4 64.104 1.6 x 10- 3 64.33 5.1 x 10- 3 65.47 0.02 107.4 6.8 x 10- 1
6 216.159 7.4 x 10- 4 216.53 2.4 x 10- 3 219.97 0.02 1569.7 6.3 x 100
8 512.214 4.2 x 10- 4 512.73 1.4 x 10- 3 520.54 0.02 42387.1 8.2 x 10 1
10 1000.271 2.7 x 10- 4 1000.93 9.3 x 10- 4 1016.34 0.02 1.3 x 103
20 8000.594 6.8 x 10- 5 8001.99 2.4 x 10- 4 8216.10 0.03 4.6 x 109
30 27001.030 3.0 x 10- 5 27003.15 1.1 x 10- 4 29095.86 0.08 3.8 x 10 16
40 64000.920 1.7 x 10- 5 64003.80 6.2 x 10- 5 83370.52 0.30 4.6 x 10 23
50 125000.000 1.1 x 10- 5 125003.60 4.0 x 10- 5 302728.80 1.40 6.6 x 10 30

be avoided by modifying the convergence criterion suitably. But that may lead to difficulty
in some other problems. The results for A = 3 are similar to those in Example 12.4 and the
cause of instability is the same. For A = -3 and -10, there is no difficulty and it is possible to
obtain an accurate solution over a large range of t. For>. = 3, -3, -10 it requires 210, 195, 437
steps to integrate the equation up to t = 10. Here each step requires 11 function evaluations.
To compare the efficiency of Runge-Kutta and predictor-corrector methods, we have tried the
same equation using subroutine MSTEP, which implements the fourth-order Adams method,
with adaptive step size control. This subroutine requires 928 and 266 function evaluations
for A = 3 and -3, respectively, which is much less than what is required by Runge-Kutta
method for the same accuracy. This comparison may not give typical values, since in this
case, the true solution is such that, there is no truncation error.
To illustrate the effect of stability, we perform calculations for A = -25, -27, -28,
-29 with a fixed step size of h = 0.1. As mentioned earlier, the routine with adaptive step
size control fails for these cases, as the step size becomes too small. For these values of
A, the exponential term in the general solution decreases very fast and we do not expect
it to interfere with the true solution. The results are shown in Table 12.4. It can be seen
that for A = -25 the computed values are fairly accurate and in fact, the relative accuracy
actually improves with t. For A = -27, there is a considerable error in the initial phase,
but the relative error decreases significantly as the integration proceeds. This behaviour can
be understood by the fact that for this case hA = -2.7, which is very close to the stability
limit of Runge-Kutta method. For this choice a ~ 0.88 and the solution corresponding to the
exponential term will get slowly damped out. To start with, since the true solution is zero,
any small error introduced in the first step can have significant component of this solution.
This is particularly true, because for such values of A, the method cannot be expected to be
accurate. Hence, initially the relative error is fairly large. However, since the true solution
is increasing with time, ultimately the extraneous term falls off. Thus, after 100 steps the
extraneous solution should decrease by a factor of a 100 ~ 3 X 10- 6 , which does not agree
with the accuracy observed in Table 12.6, probably due to the fact that a significant error
may be introduced at every step.
For larger values of -A, the method is unstable, but for A = -28 the error initially
decreases. For this value of A, a = 1.022 and the extraneous solution increases very slowly.
Initially growth rate of the true solution is larger than a and the relative error decreases.
When t > 3h/ (a-I) ~ 14, the growth rate of true solution is lower than that of the extraneous
solution. As a result, for larger t we can expect the relative error to increase which is consistent
with the results in Table 12.4. For A = -29, a = 1.187 and the extraneous solution increases
12.5. Extrapolation Methods 607

very rapidly. In 10 steps this solution will increase by a factor of "'" 6, but since the true
solution is also increasing, the increase in relative error is somewhat slower. For larger value
of -.>.., the situation is much worse and the error will become very large in a few steps.
Thus, even though the exponential term in the general solution of the differential equation
becomes negligible (even if there is some error in initial values), the numerical solution is
exponentially increasing. This is clearly due to the fact that for such values of.>.. the step size
is too large and we cannot hope to approximate the solution to any accuracy. On the other
hand, if the numerical integration method is absolutely stable, then even though the solution
is not represented accurately, the extraneous solution also decreases with time and the error
may not be significant. Thus, absolute stability is important in such problems. In fact, this
problem can be classified as stiff and we will return to the discussion of such problems in
Section 12.6.

We have described Runge-Kutta methods for a single differential equation.


This method can be easily generalised to a system of first-order differential
equations. Thus, if we have a system of differential equations

dyi
dt =
f i (t,1 m)
Y , ... , y , (i=l, ... ,m), (12.73)

where the superscripts identify different variables and should not be confused
with exponents. Then we can use the fourth-order Runge-Kutta method as
follows:
(i=l, ... ,m),

·=h
k2 n fi (tn h n Yn1 +"21k1"'"
+ 2' 1
Ynm +"21km)
1 , (i=l, ... ,m),

k3 = hnfi (tn + ~n , y~ + ~ k~, ... , yr;: + ~ k';' ) , (i = 1, ... , m),


ki = hnfi(tn + hn , y;, + kj, ... , y'; + k!3), (i = 1, ... , m),

Y~+l = Y~ + ~(k~ + 2k~ + 2k1 + k~), (i=l, ... ,rn).


(12.74)
In this section, we have only considered the explicit Runge-Kutta methods
for which the solution at the next step can be calculated directly in terms of the
known value at the previous step. It is possible to derive implicit Runge-Kutta
formulae {30} similar to the corrector formulae, which can only be solved it-
eratively. These methods are not used very often, because of the difficulty in
implementation and because the iterative process may require several function
evaluations. In general, it is difficult to justify the additional work for an im-
plicit Runge-Kutta methods by the increased accuracy attainable, and their use
is restricted to some special problems for which they have desirable stability
characteristics.

12.5 Extrapolation Methods


Following Sections 5.3 and 6.2, it is possible to use Richardson's h --> 0 extrap-
olation to achieve higher order accuracy in solution of differential equations.
608 Chapter 12. Ordinary Differential Equations

It can be shown that truncation error in Euler's method can be expanded as


a power series in h. Hence, extrapolation process can be used to eliminate the
terms one by one. However, since the error expansion contains all powers of h
higher than one, it is not very efficient to eliminate the terms. For example,
if we use the sequence of step sizes h, h/2, h/4 and h/8, then Euler's method
requires 12 function evaluations to achieve fourth order accuracy (using h --> 0
extrapolation). It will be better if we can find a method for which the trunca-
tion error can be expanded in even powers of h, so that we can rapidly achieve
higher order accuracy. It turns out that the modified midpoint method due
to Gragg has this property. Consequently, it is widely used in extrapolation
methods.
Starting from t = to, we can compute the midpoint rule approximations
7]n using 7]0 = Yo and

7]1 = Yo + hf(to, Yo); 7]n+1 = 7]n-1 + 2hf(t n , 7]n),


(n = 1,2, ... , N - 1).
(12.75)
The computed value of y at the final point tN = to + Nh is then given by

(12.76)

Gragg has shown that the truncation error in y N can be expanded in even
powers of h, provided that the number of steps N is always even or always
odd and y(t) is sufficiently differentiable. Bulirsch and Stoer recommend using
an all even sequence starting with ho = H /2, where H is the basic step size
to be used. A new initial value problem is solved over the interval of size H
starting from t = to, using different step sizes hi. The extrapolation procedure
is applied to the results of the first interval to compute y( to + H) to the desired
accuracy. Starting with this initial value, extrapolation process is applied over
the interval [to + H, to + 2H] to compute y( to + 2H) and the process is repeated
until the required range is covered. This is essentially a single-step method, in
the sense that to compute y(t o + jH) we only need the value at the previous
point y(to + (j - l)H). Hence, if necessary the basic step size H can be varied
at every step.
We can use the usual polynomial extrapolation to achieve higher order
accuracy as explained in Section 5.3. However, Bulirsch and Stoer recommend
the use of rational function extrapolation (Section 4.6) to achieve better re-
sults. Gear (1971) has compared the efficiencies of the polynomial and rational
function extrapolation to find that the rational function extrapolation provides
slightly better results. The advantage of extrapolation methods is that, they
automatically provide a variable order variable step size method. Here the order
depends on the number of terms eliminated in the extrapolation process. In this
method, an initial guess for step size is used to perform the first step. If very
high order is required to achieve the specified accuracy, then the step size is
reduced appropriately. On the other hand, if the specified accuracy is achieved
within a few extrapolations, then the step size can be increased. Further, to
12.5. Extrapolation Methods 609

improve the efficiency of the extrapolation process, it is better to decrease the


step size h slowly and the following sequence is found to be fairly effective

][/2,][/4,][/6,][/8,][/12,][/16,][/24,][/32,][/48,][/64,][/96,][/128, ...
(12.77)
It is not clear what is the optimum step size ][ for this method. Even
though the step size can be easily changed from step to step, it is rather dif-
ficult to decide how to change it. An estimate of the local truncation error is
easily obtained by considering the difference between the latest extrapolated
values. Thus, the extrapolation procedure can be continued until the required
accuracy is achieved, or until the prescribed maximum number of trials is made.
For example, if extrapolation fails to give a satisfactory result when the step
size is reduced to h = ][/128, then we can abandon the step and make a second
attempt with a much smaller step size ][. Unlike Runge-Kutta methods, here
it is difficult to estimate the optimum step size. Hence, the step size can be
reduced by some ad hoc factor in such circumstances. However, such failures
are quite rare. Because of the unsatisfactory manner in which the step size is
adjusted, an automatic routine incorporating this method may fail miserably,
if the differential equation is stiff or has some singularity. This failure arises
because at some stage the step size becomes too large and the process leads to
an overflow, particularly on machines with small exponent range. On the other
hand, a simple Runge-Kutta method with adaptive step size control may be
able to tackle such equations, although the efficiency may be low. A sophisti-
cated implementation of extrapolation method may be able to overcome such
difficulties.
If rational function extrapolation is used, then the denominator may van-
ish in some cases and the procedure may have to be abandoned. In that case, we
can use polynomial extrapolation, which does not suffer from this drawback.
A simple implementation of extrapolation method is provided by subroutine
EXTP in Appendix B, which can be used with either polynomial or rational
function extrapolation.
Deuflhard (1985) describes a technique for controlling the step size and
order for extrapolation method. Using this technique, it is possible to adjust
the step size effectively. With these modifications, extrapolation routines can
compete favourably with those based on Adams method. In terms of number of
function evaluations required to achieve a given accuracy, predictor-corrector
methods usually fare better, but these methods have heavy overheads because
of difficulty in varying the step size. As a result, on simple differential equations,
where the required functions can be evaluated easily, extrapolation method may
be more efficient. On the other hand, for problems where the function evalu-
ation requires a considerable effort, the multistep methods based on Adams
formula usually fare better. Main problem with extrapolation methods is its
sensitivity to output requirements. If we require the solution at a large number
of intermediate points, then the performance of extrapolation methods degrade
significantly, because the basic step size used in this method is rather large.
Hence, if output is required at smaller intervals, then we are forced to use
610 Chapter 12. Ordinary Differential Equations

smaller step size. Unlike predictor-corrector methods, here it is difficult to use


interpolation to. generate intermediate results. This difficulty arises because us-
ing values at widely separated points used in the method, we are unlikely to get
any meaningful results with interpolation. It is basically due to the fact that
the method internally uses smaller step size to achieve the required accuracy,
but the intermediate values are not available to sufficient accuracy. In general,
the choice between various methods for nonstiff initial value problems is rather
difficult, since the efficiency and reliability of methods depend very crucially on
the implementation.
The modified midpoint method described in this section is not the only
technique which gives error expansion in terms of even powers of h. Some
other methods are described in Deufihard (1985), which include some that are
applicable to stiff differential equations also. Any of these methods can be used
with extrapolation technique to yield higher order accuracy.

12.6 Stiff Differential Equations


When the digital computer came into existence, it was generally believed that
simple methods like Runge-Kutta methods or predictor-corrector methods with
appropriate control of step size should be able to solve all initial value problems.
However, discovery of the so-called stiff differential equations soon after the
computers were introduced belied this expectation. Stiff equations have proved
to be too important to be ignored and too expensive to overpower. They are
too important, because they occur rather frequently in physical problems. They
are too expensive to overpower, because of their size and the difficulty they
present to classical methods, no matter how great an improvement in computing
capability becomes available.
Stiff differential equations arise in a variety of applications, such as net-
work analysis, chemical or nuclear kinetics. The basic cause of stiffness is the
existence of vastly differing time scales in the problem. Time scale or time con-
stant is a term used by physicists and engineers to refer to the rate of change.
For example, if the solution is ce At , then the time scale is II/AI. Thus, time
scale is the time over which the solution increases or decreases by a factor of
e. For a general function, the time scale is defined to be If(t)/ f'(t)l. Now in
a system of differential equations, if different components have widely varying
time scales, then for a meaningful solution using any of the methods described
so far, we require a time step which is smaller than the smallest of these time
scales. This condition is generally very restrictive and it may not be possible
to integrate the equation over any reasonable time.
If some of the components are increasing with time, then ultimately the
one with the shortest time scale will dominate, while other components will
become irrelevant. If the time scales of decreasing components is also larger
than that of the dominant component, then the equation is not stiff, since
the solution is essentially determined by the shortest time scale and we have to
continue the integration with time steps of that order, no matter which method
12.6. Stiff Differential Equations 611

is used. On the other hand, if there are some components which are decreasing
over a time scale much shorter than the dominant solution (irrespective of
whether the dominant solution itself is increasing or decreasing with time),
then the actual solution changes over a time scale which is much larger than
the shortest scale in the problem. Components with short time scales will soon
become insignificant, as compared to the dominant solution. Such equations
are termed stiff.
As seen in Examples 12.4 and 12.6, in order to ensure stability of numer-
ical integration method, we have to choose the step size such that IhAI ~ 1,
which implies that integration has to continue with time steps comparable to
the shortest time scale in the problem, even though the corresponding solution
is insignificant. For example, with the equation in Example 12.4, if A = -10000,
then the general solution is t 3 + Ae-lOOOOt and irrespective of the initial condi-
tions at any reasonable time, the second term is negligible. Even then the step
size will need to be such that 10000h ~ 1 or h ~ 0.0001 and approximately
10000 steps are required to integrate the equation, say between t = 2 and t = 3,
even though the solution itself changes by only a factor of three. This is the
generic problem with stiff differential equations. Even if we can afford to spend
the computer time to integrate the equation with extremely small time step,
roundoff errors accumulated during these steps may invalidate the solution.
It is of course true that the solution using most methods discussed so far
does converge to the true solution in the limit h ---+ 0, but the problem here is
that for any reasonable result the step size has to be intolerably small. Hence, it
is desirable to have methods which do not restrict step size under such circum-
stances. The crucial point here is that, when the component ceAt is insignificant
and continues to decay, it is not necessary to approximate eAt accurately, it is
only necessary to ensure that the approximation remains bounded so that the
corresponding term in the computed solution is not amplified. This require-
ment leads to the concept of A-stability. A method is said to be A-stable if all
numerical approximations tend to zero as j ---+ 00, when it is applied to the
differential equation Y' = Ay, with a fixed positive h and a (complex) constant
A with negative real part. It should be noted that the general case of complex
A has to be considered, since for a system of differential equations it is possible
to have a pair of complex conjugate exponents, even when all coefficients of
differential equations are real. Our stability analysis has so far been restricted
to real values of A, but readers can refer to Gear (1971), where the region of
absolute stability for most of the useful numerical integration methods is given.
Some of these results are obtained in Example 12.7.
For example, consider the backward Euler's method Yj+1 = Yj + hyj+l'
which is an implicit method. For this method we can see that the relevant
difference equation is Yj+l = Yj /(1 - hAj. Hence, for Re(A) < 0 the solution
is decreasing in magnitude and the method is absolutely stable for all negative
values of A. For such methods the step size need not be restricted, because
of stability requirements. Hence, for stiff equations this method may be more
efficient than most of the methods considered so far. Similarly, it can be shown
612 Chapter 12. Ordinary Differential Equations

Figure 12.3: Stiff stability in h>' plane.

that the implicit method based on the trapezoidal rule is also absolutely stable
for all negative Re(>.).
Dahlquist has shown that a multistep method of order greater than two
cannot be A-stable. Hence, trapezoidal rule mentioned above is the highest
order multistep A-stable method. This result is rather disappointing, since in
order to ensure the required accuracy we would like to use a higher order
method. However, the definition of A-stability given above is very restrictive,
since in practice we do not need stability over the entire negative half of complex
plane. This condition is weakened by introducing the concept of stiff stability
due to Gear. A method is said to be stiffly stable, if in the region Rl (Re(h>.) ::;
D) it is absolutely stable, and in R 2 (D < Re(h>.) < (3,IIm(h>')1 < ()) it is
accurate.
The rationale for this definition is as follows, e h >' is the change in a com-
ponent in one step due to an eigenvalue >.. If h>' = u + iv, then the change in
magnitude of the solution is eU. If u < D < 0, then we are satisfied if the com-
ponent is reduced by a factor of at least e D in one step. In this case, we are not
interested in accuracy of the components that are very small. Hence, for some
suitable choice of D, we are prepared to ignore all components in the region
Rl (Figure 12.3). Around the origin we are interested in accuracy, for which
relative or absolute stability is necessary. If u > (3 > 0, then the component is
increasing by a factor of at least e!3 in one step. This factor must be limited
to some reasonable value, in order to ensure that the step size is small enough
to follow the change accurately. Hence, h should be adjusted such that u < (3.
Similarly, if Ivl > (), there are at least (}j21f complete cycles of oscillation in
one step. Except in R 1 , where we are not interested in decaying components
and for u > (3 which is not used, we must follow the oscillating components
accurately. As we know from Section 10.5, for a bandwidth limited signal at
least two samples must be taken at the highest frequency present, in order to
12.6. Stiff Differential Equations 613

represent the signal faithfully. In practice, about five times that number is re-
quired for numerical accuracy, and 8 is generally less than 1f /5. Thus, in order
to ensure accuracy, h must be such that u < {3 and Ivl < 8 (if D < u < {3) for
all components of the solution.
If D = 0 in the above definition, then it is same as A-stability and we know
that no multistep method of order greater than two exists. However, as argued
above for a reasonable method, we should have D < O. In that case, Gear has
shown that, there are stiffiy stable methods of order :S 6 for some D, {3 and 8. In
fact, it turns out that the methods based on backward differentiation formulae
lead to stiffiy stable methods. Thus, if the derivative at tj+l is estimated using
the function values at tj-k, (k = -1,0, ... , m), then we get a method of order
m, and for m :S 6 it can be shown to be stiffiy stable. For example, the fourth-
order method follows from

yj+1 = l~h (25Yj+1 - 48Yj + 36Yj_l - 16Yj_2 + 3Yj_3) , (12.78)

which gives the numerical integration formula


1 12h I
Yj+l = 25 (48Yj - 36Yj_l + 16Yj_2 - 3Yj_3) + 25 Yj+l . (12.79)

This method is also referred to as Gear's method. The local truncation error in
this formula is - ti5 h 5y(5) (0, which is much larger than that in Adams method.
Nevertheless, on stiff differential equations, this method performs much better
than Adams method. This is an implicit formula and the value of Yj+l can only
be calculated iteratively. Further, in this case, IhAI may be very large if some
sharply decaying components in the region Rl are present. Hence, the simple
fixed-point iteration described in Section 12.3 will not usually converge for stiff
differential equations. For such equations it may be necessary to Ilse a more so-
phisticated iterative method, e.g., Newton's method or Broyden's method. The
Jacobian matrix may be slowly varying with time and it may not be necessary
to get a new approximation at every step. The initial approximation to Yj+l can
be computed using a predictor formula similar to that in the predictor-corrector
methods. Some of the stiffiy stable methods are as follows

Yj+l ="31 (4Yj - Yj-l


) 2h
+ 3Yj+l
I 2 3 1I1( )
- gh Y ~,
_ ~ (_ ~ 4 (4)
Yj+l - 11 18Yj 9Yj_l + 2Yj_2 ) + 6h
11 Yj+l
I _
22h Y (0,
1 60h I
Yj+l = 137 (300Yj - 300Yj_l + 200Yj_2 - 75Yj_3 + 12Yj_4) + 137Yj+1
- 11307h6y(6l(~),
1
Yj+1 = 147 (360Yj - 450Yj_l + 400Yj_2 - 225Yj_3 + 72Yj_4 - lOYj_5)

60h I _ 20h 7 (7)((:).


+ 147 Yj+l 343 Y <,
(12.80)
614 Chapter 12. Ordinary Differential Equations

These methods are of order 2, 3, 5 and 6, respectively.


For all methods described above, the roots of characteristic equation tend
to zero as IhAI -----> 00 and as a result these methods are absolutely stable for
large IhAI. In fact, the boundary of stability region can be easily obtained by
solving the characteristic equation for hA with the root a = eiiJ (Example 12.7).
Apart from absolute stability we also require the methods to be relatively stable
in the region R2 (Figure 12.3).
Subroutine GEAR in Appendix B provides a simple implementation of
fourth-order stiffly stable method. It uses Broyden's method to solve the sys-
tem of nonlinear equations. This subroutine performs integration over one time
step and it can be used with the driver routine MSTEP, which implements the
adaptive step size control. For stiff differential equations, if the routine requires
halving the step size, then the formulae used for this purpose may magnify the
errors significantly and as a result the routine may fail. To avoid this prob-
lem, the subroutine uses Runge-Kutta method to restart the integration from
this point. This restarting is used when two successive halving of step size are
required, or if the estimated error at the step is too large to be corrected by
halving the step size. However, since Runge-Kutta method is not stiffly stable,
it will lead to a very small step size and the integration starts with small step
size. This step size keeps increasing as the integration proceeds until once again
the accuracy test fails at some point. If this failure occurs, then once again the
solution will be restarted with a small step size and the cycle may keep repeat-
ing. This situation is clearly unsatisfactory and can be remedied if a variable
order stiffly stable method is used. Since in that case, integration can be started
using a first-order method. This problem is basically due to the fact that for
stiff differential equations, the truncation error in halving formula could be
very large for components with small time scale. In principle, such components
should be negligible and we do not expect any problem. However, the magni-
tude of the extraneous components is determined by the convergence criterion
used in solving the implicit equation for Yj+l' If the solution is obtained with
sufficient accuracy, then this problem may not occur. On the other hand, if this
criterion is only slightly better than the required truncation error, then it is
possible that due to some magnification of errors in the step halving formula,
the routine may fail to achieve the specified accuracy. In subroutine GEAR,
the accuracy to which the corrector equations are solved, can be controlled by
the parameter CFAC in the PARAMETER statement. This subroutine accepts
the value of Yj+l' when the change is less than CFAC x REPS. Thus, reducing
CFAC will cause it to obtain more accurate solution. It should be noted that,
this parameter does not affect the truncation error due to the difference ap-
proximation, but it controls the error due to accepting an inaccurate value for
the solution of the difference equation. Decreasing the value of CFAC increases
the number of iterations required by Broyden's method for solving the corrector
equations. However, since convergence of Broyden's method is superlinear, the
increase in number of iteration may not be substantial. If the value of CFAC
is reduced to such an extent that it is not possible to achieve the specified
12.6. Stiff Differential Equations 615

4 5 Gear

,,
I
3
I
0 0

-5

o 5 10

Figure 12.4: Stability of fourth-order Gear's method: The figure on the left shows the roots
of the characteristic equation as a function of h)". Where the roots are complex the dot-dashed
lines give the magnitude of roots. The continuous line shows e h >'. Figure on the right shows
the region of absolute stability in the complex h)" plane for the fourth-order Gear's method
and Adams-Moulton method. Gear's method is absolutely stable everywhere outside the big
curve, while the Adams method is absolutely stable inside the smaller curve.

accuracy due to roundoff error, then Broyden's method may fail to converge.
This failure is likely to occur when the required accuracy REPS is close to the
machine accuracy n.
EXAMPLE 12.7: Find stability region of the fourth-order Gear's method (12.79).
Considering the linear equation y' = )..y, we get the corresponding characteristic equa-
tion
12 ) 4 48 3 36 2 16 3
( 1 - -h)" a - - a + - a - - a -+- - = 0. (12.81)
25 25 25 25· 25
Roots of this equation for real values of h)" are shown in Figure 12.4. It can be seen that for
D < h)" < (3, where D "" -0.5 and (3 "" 1.0, the dominant root of this equation approximates
e h >' fairly accurately. Hence, depending on the accuracy requirements, it can be considered to
be the region in which the method is accurate. At h)" = 25/12, the dominant root is infinite
and for higher values of h)" it is negative. For h)" < -0.5, the roots are complex and we cannot
hope to approximate the true solution of the differential equation accurately. However, the
magnitude of all roots is less than one and absolute stability is assured.
To find absolute stability region in the complex h)" plane, we can solve the characteristic
equation (12.81) for h)", to get
1
h)" = --4 (25a 4 - 48a 3 + 36a 2 - 16a + 3) . (12.82)
12a
The boundary of absolute stability region can be obtained by substituting a = e iO in the above
equation. Different values of () in the range (0,271') will trace the required curve. Since we know
that as Ih)..1 -. 00 all roots vanish, the method is absolutely stable everywhere outside the
curve traced by the above equation. This curve is also shown in Figure 12.4. Similar procedure
can be repeated for the fourth-order Adams-Moulton method and the corresponding curve
is also shown in Figure 12.4. However, in this case, the method is absolutely stable for
small negative values of h)" and the stability region is inside the curve rather than outside.
Thus, it can be seen that the absolute stability region of the fourth-order Adams method is
essentially included in that of Gear's method. Further, Gear's method is stable over a much
more extensive region including almost the entire negative half of the complex plane. The
616 Chapter 12. Ordinary Differential Equations

stability region of Hamming's method can also be obtained in a similar manner. This region
essentially coincides with that for Adams method and is not shown in the figure.

The stability of Gear's method for large values of I'\hl in the positive half
of the complex plane can cause problems in some cases. If the equations are
stiff for some range of t, then the step size can become rather large. After
that if there is some transition in the characteristics of the equation and some
component of the solution starts increasing rapidly, then '\h will be large and
positive. Absolute stability of Gear's method will ensure that the numerical
solution decreases rapidly, introducing serious errors in the computed solution
{42}. Thus, in practical computations involving stiff equations, we have to
watch out for the possibility of the solution suddenly changing its nature. It
is possible to put some reasonable upper limit on the step size to avoid this
situation. However, this limit can be put only if we have some idea about the
time scales at which the solution is expected to evolve.
It may be noted that implicit method based on the trapezoidal rule {38}
is absolutely stable in the entire negative half of the complex plane. But as
h'\ ----+ -00, one root of the characteristic equation tends to -1. Hence, for large
negative values of h,\, the extraneous solution decreases very slowly and the
error will keep oscillating. This method does not have the problem mentioned
above, since it is not absolutely stable in the positive half of the complex plane.
EXAMPLE 12.8: Solve the equation in Example 12.4 with)" = -30, -32, -100 using the
fourth-order Gear's and Adams method.
Since this equation is linear in Y, (12.79) can be explicitly solved for Yj+l' provided the
solution is known at four previous points. For simplicity we use the known exact solution to
obtain the first four starting values. Using these values the solution can be easily continued
to any value of t. Similarly, Adams corrector can also be solved explicitly for Yj+l and we do
not need the predictor in this case. We use a fixed value of h = 0.1. For)" = ±3 the results
are similar to those in Example 12.4. For)" = -30 both Adams and Gear's method gives
essentially exact results. For Adams method with h)" = -3 the root with largest magnitude
is -1 and the solution doesn't increase with time. Since the true solution is increasing slowly
with time there is no difficulty.
For h)" < -3 Adams method is expected to be unstable and this instability is clearly
reflected in the results for)" = -32, -100. The root with the largest magnitude is -1.738 and
-1.043 for h)" = -10 and -3.2, respectively. For ).. = -32 the correct solution is increasing
faster than the other solution for t ::; 7 and hence initially the solution is almost exact, but
ultimately the error builds up and the relative error in the solution is about 8 X 10- 7 ,2 X
10- 5 ,5 X 10- 4 ,2 X 10- 2 respectively for t = 20,30,40,50. For ).. = -100 the situation is
much worse and error starts increasing rapidly for t > 1. At t = 5 the relative error already
exceeds 12. On the other hand, Gears method is absolutely stable for all negative values of
h)" and truncation error is well contained. In fact, since the true solution is a polynomial, the
step size can be increased arbitrarily without increasing the error appreciably. If subroutine
GEAR with adaptive step size control is used to integrate this equation, then the step size
keeps increasing steadily. If integration is started with h = 0.1, then at t = 100 the step size
increases to 12.8, which clearly demonstrates the importance of absolute stability for stiff
problems.
EXAMPLE 12.9: Solve the following system of differential equations due to Krogh (1973)
(zi)' = -f3iz i + (z')2, z'(O) = -1, i = 1,2,3,4; (12.83)
where f3i are nonzero constants. Here the superscript should not be confused with exponent.
Since this system of equations is decoupled, we obtain a coupled equation by defining a new
12.6. Stiff Differential Equations 617

Table 12.5: Solving a stiff differential equation

Adams Runge-Kutta Extrapolation Gear (0.1) Gear (0.0001)


n ReI. error n ReI. error n ReI. error n ReI. error n ReI. error

10- 2 356 1.0 x 10- 8 781 6.0 x 10- 9 248 9.9 x 10- 8 929 2.2 x 10- 8 1077 2.2 x 10- 8
10- 1 747 5.9 x 10- 9 1166 1.1xlO- 9 1402 4.7 x 10- 8 1216 1.9 x 10- 7 1469 1.9 x 10- 8
10° 4112 1.9 x 1O- 1D 3564 1.3 x 10- 9 11614 3.4 x 10- 8 1772 2.6 x 10- 8 2178 3.4 x 10- 8
10 1 37943 1.0 x 10- 11 27797 5.5 x 10- 10 108374 1.9 x 10- 9 2327 4.9 x 10- 8 2679 7.8 x 10- 8
10 2 267905 1.0 x 10- 11 7527 2.4 x 10- 9 2981 8.0 x 10- 8
10 3 3215 6.8 x 10- 8
10 4 3372 9.9 x 10- 9

set of variables y = Uz, where U is the unitary matrix


-1 1
1 ( 1 -1 1
U=-2 1 1 -1
(12.84)
1

and solve the differential equation in the new variables y

yi (0) = -1, (12.85)

where z = Uy. Krogh has suggested (31 = 1000, (32 = 800, (33 = -10 and (34 = O.OOL
The exact solution of this differential equations is

zi _ Bi ( 12.86)
- 1 - (1 + (3i)e{1i t '
Thus, for positive r3 i the solution tends to zero as t --> 00, while for negative (3i it tends to
(3i. The Jacobian matrix at lay for this system of equations is
J = Udiag( -(3, + 2zi)U, ( 12.87)
which has eigenvalues -(3i +2zi. Thus, initially the eigenvalues are -2-(3; while as t --> 00 the
eigenvalues become -1(3;1. Thus, initially the solution corresponding to (3; < -2 is increasing
while ultimately all solutions are decreasing in magnitude. Because of a wide variation in
values of (3i, the equation is stiff. The component corresponding to B4 = 0.001 requires a
time of the order of t = 1000 to reach the asymptotic limit and integration will need to be
carried out over a time interval of this order. While other components reach the asymptotic
limit in a much shorter interval. The shortest time scale in the problem is 0.001, and unless
the method is stiffly stable, a step size of the order of 0.001 will be needed. Hence, it requires
~ 106 steps for integration over the entirE' interval, when all solutions settle down to their
asymptotic values. In this case, since all eigenvalues are real, absolute stability along the
negative real axis is sufficient.
To demonstrate the effect of stiffness, we attempt to solve this problem using the sub-
routines RKM, EXTP and MSTEP (with both ADAMS and GEAR). Subroutine EXTP was
used with polynomial extrapolation, since with rational function extrapolation the subroutine
failed at t ~ 0.48, because the step size became too large and as a result the function value
resulted in an overflow. This failure arises because the midpoint method used to integrate the
equations over a basic step H is not stiffly stable and as a result the solution can blow up in
128 steps that are performed with this method. The basic problem here is with the procedure
for adjusting the step size. Since accuracy is not a problem, the extrapolation process may
618 Chapter 12. Ordinary Differential Equations

converge in two to three steps, after which the subroutine tries to increase the step size, lead-
ing to disaster when the step size goes outside the stability region for the midpoint method.
Interestingly, this problem was not encountered with polynomial extrapolation. Even in the
first two steps, where rational function extrapolation was successful, polynomial extrapola-
tion was found to be more efficient. Rational function extrapolation required 1842 function
evaluations up to t = O.l.
The solution is requested at times t = 0.01, 0.1,1,10,100, 1000, 10000 and the results
are shown in Table 12.5. The number in parenthesis following GEAR is the value of the
parameter CFAC. All subroutines were requested to achieve a relative accuracy of 1O~6 and
were started with an initial step size of 1O~3. It is not possible to achieve this accuracy
using a 24-bit arithmetic. Hence, all calculations are performed using a 53-bit arithmetic.
Consequently, roundoff error should be negligible. The table shows the number of function
evaluations used as well as the maximum (relative) error E, in the final results. All subroutines
were terminated when the number of function evaluations became too large, or when the step
size became too small. As mentioned earlier, subroutine MSTEP with GEAR has difficulty
when the step size needs to be halved and as a result the step size keeps oscillating between a
value of 0.001 and 0.1 during the calculations. For t > 100, it results in failure as the step size
required to restart the integration using Runge-Kutta method is smaller than the permissible
limit. This problem can be tackled by reducing the parameter CFAC in subroutine GEAR
to 0.0001 and results obtained using that value are also shown in the table.
It can be seen from the table that in the first interval when all components of the
solution are varying, the extrapolation and Adams methods are most efficient. By t = 0.01
the first two components have essentially reached the asymptotic limit and the step size is
determined by stability rather than accuracy. As a result, Gear's method performs better
in this interval. Beyond t = 1 Gear's method with CFAC = 0.1 faces the problem due
to accuracy mentioned above and the performance deteriorates. By reducing CFAC it is
possible to overcome this problem. Even with the smaller value of CFAC this problem may
be encountered at t > 1000, where the step size becomes as large as 100. The results shown
in Table 12.5 were obtained by using two intermediate points 2000 and 4000. Cutting down
the interval for integration enables the integration to proceed, as the specified tolerance
1O~6H/TSTEP at each step increases. Among all methods tried, it is the only one which
completes the integration up to t = 10000 in a reasonable time.
By comparing the accuracy of the computed results, it can be seen that Gear's method
has much larger errors, which is clearly due to the fact that the step size is fairly large for this
method. Other methods a.re forced to use very small step size, because of stability constraint.
Consequently, These methods achieve much higher accuracy than what was requested. How-
ever, the actual (relative) error never exceeds the requested accuracy of 10~6.

12.7 Boundary Value Problem


So far we have considered numerical solution of initial value problems, where
all boundary conditions are specified at one point. Now we shall consider more
general problems, where boundary conditions involve the function values at
more than one point. Thus, for a system of n first-order differential equations,
the boundary conditions may be written in the form

(j=l, ... ,n). (12.88)

Here it is assumed that the boundary conditions involve function values at


two different points. This definition can be easily generalised to more than
two points. In practice, the boundary value problems usually involve boundary
conditions at two different points and are also referred to as two-point boundary
12.7. Boundary Falue Problem 619

value problem. Further, in most cases the boundary conditions are separable in
the sense that each individual condition involves the function value at one point
only. In that case, the boundary conditions can be written as

gj(Yl(a), ... ,Yn(a)) = 0, (j = 1, ... ,nd;


(12.89)
gj(Yl(b), ... ,Yn(b)) =0, (j = nl + 1, ... ,n).
In this chapter, we assume that boundary conditions are in this form. The
numerical methods can be easily generalised to more general forms.
Many problems can be reduced to this form. For example, the eigenvalue
problem, where the differential equation involves an additional parameter A
dy
dt = f(t, y, A). (12.90)

In this case, if we have to satisfy n+ 1 boundary conditions of the form (12.89),


then the problem is overdetermined. Therefore, there is no solution for arbitrary
values of A. But for special values of A a solution may exist. This leads to an
eigenvalue problem with the difference that the system of equations need not
be homogeneous. For a homogeneous system of equations, we do not need the
additional boundary condition, since even with n boundary conditions a non-
trivial solution exists only for special values of A. This homogeneous problem
can also be transformed to an inhomogeneous form, by imposing an additional
boundary condition to normalise the eigenfunctions. Some special methods for
this kind of eigenvalue problem will be considered in Section 12.9. The gen-
eral inhomogeneous eigenvalue problem can be reduced to a simple boundary
value problem, by defining an additional variable Yn+l = A, which satisfies the
additional equation Y~+l = O. With this addition the number of boundary con-
ditions equals the number of unknowns and the problem reduces to a standard
boundary value problem.
Similarly, a problem with free boundary can be reduced to a boundary
value problem in the standard form by a change of variable. In these problems,
only one boundary point a is prescribed, while the second boundary point b
is to be determined so that the solution satisfies certain boundary conditions
at that point. For example, while integrating stellar structure equations, the
outer point may be fixed by the requirement that pressure vanishes at that
point. Since an additional parameter is to be determined, in general there will
be an extra boundary condition. This problem can be reduced to the standard
form by defining a new independent variable x = (t - a)j(b - a), in terms of
which the boundary conditions are specified at x = 0 and 1. The unknown
parameter b - a can be determined as in the previous case by introducing an
additional variable Yn+l = b - a, satisfying the equation Y~+l = O. Any other
change of variable which transforms the points a and b to a set of fixed values
may be used instead of the simple linear transformation given above.
Numerical solution of boundary value problems is usually more difficult
than that of initial value problems. For initial value problems, sufficient bound-
ary conditions are known at one point to determine the values of all dependent
620 Chapter 12. Ordinary Differential Equations

variables and the solution is uniquely defined under fairly general conditions.
For boundary value problem the solution may not exist for arbitrary boundary
conditions, while even if the solution exists it may not be unique. For example,
consider the equation y" + y = 0, which has a general solution

y = Cl sin t + C2 cos t. (12.91)

If the boundary conditions are y(o) = 1 and y(7f /2) = 0, the unique solution is
y = cost. But if the boundary conditions are y(o) = y(7f) = 0, then the solution
is not unique, since all solutions of the form y = C sin t satisfy the boundary
conditions. On the other hand, if the boundary conditions are y(o) = y(7f) = 1,
then the solution does not exist.
Numerical methods for the solution of boundary value problems can be
broadly divided into three classes: the shooting methods which use a technique
for initial value problem, the finite difference methods which attempt to find
the solution by solving a system of simultaneous algebraic equations, and the
expansion methods where the solution is expanded in terms of suitable basis
function. Since the number of boundary conditions at one point is not sufficient
to define the solution uniquely, in shooting method, we choose some of the
components arbitrarily to start the integration using some method for initial
value problem. The solution is continued up to the second point, where in
general the boundary conditions are not satisfied. Now the extra conditions at
the first point are adjusted, such that the boundary conditions at the second
point are satisfied. This method derives its name from the fact that here we
essentially aim at the target and shoot the solution from one end. If the target
is not hit within the required precision, then the initial aim is adjusted to hit
the target. This method is also referred to as initial value methods for boundary
value problems. The shooting method is quite similar to the analytic technique
for solving a boundary value problem. In the analytic technique, we first write
the general solution of the differential equation introducing the required number
of arbitrary constants. The required solution is then determined by calculating
the constants such that the specified boundary conditions are satisfied. On
the other hand, in finite difference methods the basic idea is to replace the
differential equations by a set of difference equations and then to solve the
system of difference equations. In this section, we consider the shooting methods
while the finite difference methods will be discussed in the next section. The
expansion methods will be discussed in Section 12.10.
If we assume the boundary conditions in the form (12.89), then at the first
boundary we can obtain nl of the n components in terms of the remaining com-
ponents, which can be treated as unknown parameters. For any value of these
n - nl parameters, it is possible to find a solution of the initial value problem.
This solution can be written as y(t,(Xl, ... ,(Xn-n,), where (Xi are the unknown
parameters introduced at the first point to define the solution uniquely. Substi-
tuting this solution in the second equation in (12.89), we get a system of n - nl
nonlinear equations in these parameters. These equations can be solved using
any method described in Chapter 7.
12.7. Boundary Value Problem 621

To make the idea more clear, consider the boundary value problem
y" = f(t,y,y'), y(a) = 0, y(b) = (3. (12.92)
Now the initial value problem
y" = f(t, y, y'), y(a) = (X, y'(a) = x, ( 12.93)
has a solution y(t) = y(t,x) for any given value of x. To solve the boundary
value problem, we can substitute this solution in the second boundary condi-
tion to get F(x) = y(b, x) - (3 = O. Thus, the problem essentially boils down
to finding a zero of F(x). Even though the function F(x) is not defined explic-
itly, it can be computed for any given value of x by solving the initial value
problem (12.93), using any method described in the preceding sections. For
finding a zero of F(x), we can use any method described in Chapter 7, for
example, secant iteration. If the method requires the first derivative of F(x)
(e.g., Newton-Raphson method), then the first derivative of this function needs
to be computed, which requires solution of another initial value problem {46}.
However, it requires a substantial effort to compute this derivative and it is
more efficient to avoid methods requiring derivatives. Thus in this case, it is
preferable to use secant iteration or Brent's method for finding zeros of F(x).
It should be noted that, since the function F(x) is in general nonlinear, it can
have more than one real zeros, or it may not have any real zero at all. For a
higher order differential equation, we will end up with a system of nonlinear
equations, which can be solved using Broyden's method.
If the differential equations as well as the boundary conditions are linear
in components of y, then it can be shown that the resulting system of equations
are linear and hence can be solvej directly. For example, consider the boundary
value problem
y' = F(t)y + g(t), Ay(a) + By(b) = c, (12.94)
where F, A and Bare n x n matrices and y, g and c are vectors of length n.
The elements of F and g can be arbitrary continuous functions of t. General
solution of this differential equation is a linear combination of n independent
solutions of the homogeneous problem (i.e., g == 0), plus a particular solution of
the full equation. The particular solution yp(t) can be obtained by considering
the initial value problem
y' = F(t)y + g(t), y(a) = O. (12.95)
The n independent solution of the homogeneous equations can be de~ned by
the initial value problems
y' = F(t)y, (j = 1, 2, ... , n), (12.96)
where ej is the jth column of the unit matrix. If we construct a matrix Y(t)
whose columns are the respective solutions of the above initial value problems,
then the general solution of the original differential equations is given by
y(t, s) = Y(t)s + yp(t). ( 12.97)
622 Chapter 12. Ordinary Differential Equations

It can be easily verified that for any vector s, this function satisfies the dif-
ferential equation. Substituting this solution in the boundary conditions, we
get
(A + BY(b))s + Byp(b) - c = o. (12.98)

This is a system of n linear equations in the n components of s, which can be


solved for s, provided the matrix A + BY(b) is nonsingular. Once the vector
s is known, the required solution can be computed from (12.97). Hence, if
the equation is linear, then we do not need to use any iterative method to
compute the required solution. It can also be seen that if an iterative method
is attempted, then the iteration will converge after the first step itself.
The simple shooting method as described here may run into difficulty,
when the eigenvalues of the corresponding Jacobian matrix BidBYj are widely
separated. In that case, irrespective of the initial values, the numerical solution
always tends to the dominant component. As a result, solutions with different
initial vectors are not independent and it may be difficult to determine the cor-
rect starting values from the results. For example, if a second-order equation
has a general solution C1e-t + c2e100t and the two boundary conditions speci-
fied at t = 0 and 1 are such that the required solution is e- t ; then no matter
what initial condition is chosen at t = 0, the solution is dominated by the other
component at t = 1, and it may be impossible to find the correct solution. On
the other hand, if integration is started from t = 1, it may be possible to ob-
tain the correct solution, since in that case the unwanted solution is decreasing
rapidly. However, for a higher order equation, it is quite possible that the dom-
inant solutions from both end points are unwanted and irrespective of where
we start the integration, it is not possible to get the correct solution. This is
the main drawback of shooting methods. Further, it is possible that when the
initial values are not close to the correct values, the solution may not exist
because of some singularity or some other pathology in the general solution
of the equation. For example, at some stage some quantity inside the square
root may become negative, because of the incorrect initial conditions and the
solution will break down.
To overcome this problem, we can try to reduce the interval over which
the integration is carried out. The simplest option is to start the integration
from both end points and match the solution at some point in between. Thus,
we can introduce unknown parameters at both ends to determine the initial
values. This technique clearly requires n parameters, which can be determined
by requiring that the n components of the solution match at some intermedi-
ate point. Thus, let Yb( t, aI, ... , an!) be the solution of initial value problem
starting from t = 0 and satisfying the boundary conditions at t = b. Similarly,
Ya(t, an! +1, ... , an) is the solution starting from t = a. Then the unknown pa-
rameters ai can be determined by matching the solutions at some intermediate
point c giving the n equations

(12.99)
12.7. Boundary Value Problem 623

This device can only improve the solution to some extent, but cannot really
eliminate the problem mentioned above, since integration over even half the
range may be sufficient for the true solution to be swamped by the extraneous
ones. This problem can be overcome by introducing several intermediate points
and solving the initial value problems over sufficiently small intervals. Thus,
if m - 1 intermediate points are introduced, then we can solve m initial value
problems in each of the m subintervals. The initial conditions at all points can
then be determined by requiring that the resulting solution is continuous and
satisfies the required boundary conditions. These conditions lead to a system
of nm equations in equal number of unknowns. This technique is usually re-
ferred to as multiple shooting. The solution of resulting system of nonlinear
equations could be rather difficult and we may require good initial guess for
the solution, in order to converge to the required result. Readers can refer to
Stoer and Bulirsch (2010) or Keller (1992) for more details. In the limit of very
large number of intermediate points, this method will be similar to the finite
difference methods considered in the next section.
EXAMPLE 12.10: Solve the following boundary value problem using shooting method

Y" - (A - l)y' - AY = -At - A + 1, yeO) = 1, y(l) = 1 + e- 1 , (A = 10,50). (12.100)

The general solution of this differential equation is y = qeAt + C2e-t + t and it can be
easily verified that the exact solution satisfying the boundary conditions is y = t + e- t . Using
the fourth-order Runge-Kutta method we can find the solution to the initial value problems

(i)Y~-(A-1)Y~-AYl=-At-A+1, Yl(O)=l, y~(O)=l,


(12.101)
(ii) y~ - (A -1M - AY2 = 0, Y2(0) = 0, y~(O) = -1,
where we denote the two solutions by Yl (t) and Y2(t), respectively. Any linear combination of
the form Yl (t) + O:Y2 (t) satisfies the differential equation and the boundary condition at t = O.
Hence, to solve the boundary value problem, we have to find the value of 0: such that the
boundary condition at t = 1 is satisfied, which gives 0: = (1 + e- 1 - Yl (1))/Y2 (1). Using this
value of 0:, we can compute the solution at any required point, provided the values of Yl (t)
and Y2 (t) are preserved. If a method with adaptive control of step size is used, then the step
size will be different for the two integrations and combining the results to get the required
solution may be difficult. Further, controlling the error in both integrations separately does
not guarantee that the error in the linear combination is also within the required limits.
Hence, for simplicity we use integration with constant step size. The results are shown in
Table 12.6. Since the results are obtained with constant step size, it is more efficient to use
a predictor-corrector method. But in this case, since the number of steps is only five, most
of the effort will be spent in starting the solution. As a result, we have used the fourth-order
Runge-Kutta method.
The term C2eAt is expected to dominate in both solutions Y1 (t) and Y2 (t), even though
it does not appear in the solution of the boundary value problem. It can be seen that for
A = 10 both solutions are of the order of 10 3 at t = 1, and the calculated value of 0: is close
to 1. Although, the value of 0: can be calculated accurately, three significant figures will be
lost while constructing the required solution yet) = Yl (t) + O:Y2 (t), when t ~ 1. As a result,
the error increases slightly towards the end. Using different starting values for y'(O) in the
initial value problems does not help, since the dominant term will always be of the same order
while the exact solution is of the order of unity, some cancellation is inevitable. For A = 50
the situation is much worse, since Y1 (1) and Y2(1) are of the order of 10 12 and 12 significant
figures will be lost while finding y(l). Since all results are obtained using a 24-bit arithmetic,
we cannot hope to have any significant digit in the computed value of y(l). In fact the error
increases sharply with t.
624 Chapter 12. Ordinary Differential Equations

Table 12.6: Solving a two-point boundary value problem using shooting method

Y1 (t) Y2(t) yet) Yexact ReI. error

>. = 10, h = 0.2


0.2 1.580667 x 10° -5.619333 X 10- 1 1.018733 1.018731 -2.5 x 10- 6
0.4 5.463932 x 10° -4.393607 x 10° 1.070324 1.070320 -3.6 x 10- 6
0.6 3.228075 x 10 1 -3.113193 X 10 1 1.148815 1.148812 -3.2 x 10- 6
0.8 2.194812 x 102 -2.182319 X 102 1.249318 1.249329 8.9 x 10- 6
1.0 1.529244 X 10 3 -1.527876 X 103 1.367860 1.367879 1.4 x 10- 5

>. = 50, h = 0.2


0.2 1.363667 x 10 1 -1.261794 X 10 1 1.018731 1.018731 -3.5 x 10- 6
0.4 8.141556 x 103 -8.140488 X 103 1.068359 1.070320 1.8 x 10- 3
0.6 5.245197 x 10 6 -5.245196 X 106 1.000000 1.148812 1.3 x 10- 1
0.8 3.379655 x 109 -3.379655 X 10 9 0.000000 1.249329 1.0 x 10°
1.0 2.177625 X 10 12 -2.177625 X 10 12 0.000000 1.367879 1.0 x 10°

>. = 50, h = -0.2


0.8 3.339594 x 10° -5.681934 x 10° 1.249329 1.249329 9.5 x 10- 8
0.6 6.119692 x 10 2 -1.660382 X 10 3 1.149109 1.148812 -2.6 x 10- 4
0.4 1. 777529 x 10 5 -4.831799 X 10 5 1.125000 1.070320 -5.1 x 10- 2
0.2 5. 172580 x 10 7 _1 .406054 X 108 16.000000 1.018731 -1.5 x 10 1
0.0 1.505220 X 1010 -4.091616 X 1010 0.000000 1.000000 1.0 x 10°

>. = 50, h = -0.05


0.8 1.241794 x 10° 2.048249 X 10- 2 1.249329 1.249329 9.5 x 10- 8
0.6 1.138276 x 10° 2.863859 X 10-- 2 1.148811 1.148812 1.6 x 10- 7
0.4 1.057217 x 10° 3.561946 X 10- 2 1.070320 1.070320 2.6 x 10- 8
0.2 1.002684 x 10° 4.361890 X 10- 2 1.018731 1.018731 -8.5 x 10- 9
0.0 9.803937 X 10- 1 5.329626 X 10- 2 1.000000 1.000000 0.00

For this problem, since the extraneous solution is increasing with t, we can hope to get
good results if the integration is started from t = 1 instead of t = O. For this purpose, we can
use the following initial conditions for the two solutions
(ii) Y2(1) = 0, y~(1) = -1. (12.102)
The last two sets in Table 12.6 show the result obtained using h = -0.2 and h = -0.05,
respectively. It can be seen that for h = -0.2 the computed solutions Y1 (t) and Y2(t) are
increasing steeply (much faster than e- t ) and once again we face the same problem. This
problem arises because for this value of h>' the Runge-Kutta method is not absolutely stable
and as a result the extraneous solution increases. Reducing h to -0.05 makes the method
absolutely stable and we get much better result. If the general solution of the differential
equation had another component of the form e-.\t, then starting from either end point will
lead to trouble.
EXAMPLE 12.11: Solve the following boundary value problem using shooting method
y" = >.sinh(>.y), yeO) = 0, y(l) = 1, >. = 5,10. (12.103)

To solve the problem using shooting method, we can solve the initial value problem
with initial conditions yeO) = 0 and y'(O) = x, where x is the unknown parameter which
12.7. Boundary Value Problem 625

should be adjusted to satisfy the second boundary condition. The solution of this initial
value problem gives y(l, x) and finding zeros of f( x) = y(l, x) - 1, using secant iteration can
give us the required value of x. Once the value of x is known, the required solution can be
computed at any specified set of points using any initia l value method. However, it turns out
that the solution has a singularity little beyond t = 1 and as we approach this point, the
step size has to be reduced. It turns out that using a 24-bit arithmetic it is not possible to
achieve an accuracy of better than 10 - 6 for A = 5 and 10- 5 for A = 10. Further, in both
cases if the value of x is slightly larger than the correct value, then the solution of initial value
problem has a singularity inside the interval and subroutine RKM fails. It can be shown that
the solution has a singularity at
t~ ~ln _8_ (12.104)
A ly'(O)I'
and for ly'(O)1 > 8e- A the singularity is inside the required interval. Thus for A = 5, we must
have y'(O) < 0.05 while for A = 10, y'(O) < 0.00036. Unless the starting values are specified
rather accurately, the secant iteration fails to converge , since it tends to give an estimate for
x outside the limits mentioned above. Hence, the solution is singular and subroutine RKM
fails. The situation is particularly serious for A = 10, where the initial value problem can
be solved only for a very small range of x values. Consequently, a large number of trials are
required to find the correct value of x. The value of x is 0.04575042 for A = 5 and 0.0003586
for A = 10.
For A = 5 there is no difficulty and the results are accurate to the specified accuracy.
For A = 10, there is a significant error of order of 10- 3 , since the boundary condition at
t= 1 is not satisfied to high accuracy. The exact solution to this initial value problem can
be expressed in terms of elliptic functions. It can be shown that the exact solution has a
logarithmic singularity at t ~ 1 + 1/(Acosh(A/2)). This singularity occurs at t ~ 1.0326 for
A = 5, and at t ~ 1.0013 for A = 10, which is very dose to the upper limit. The exact solution
listed in the table has actually been calculated using the same program in double precision
with specified accuracy of 10- 10 . It may be noted that the Runge-Kutta method required
about 1500 function evaluations to achieve the specified accuracy in each integration. A large
number of integrations were required before the solution was found, and the total number of
function evaluations required is of the order of 10 4 . This example shows that it requires a
considerable effort to solve a two-point boundary value problem.
To avoid the problem posed by the singularity falling inside the range of integration it
is possible to use the continuation (or Davidenko's) method (Section 7.16). For this purpose,
we can solve a series of problems, where the second boundary condition is applied at t = !3
and the value of !3 is varied gradually from a suitably chosen value (say !3 = 0.5) to B = 1.
A simpler possibility is to use Brent's method for solving the nonlinear equation and in the
routine to calculate the function we can set an arbitrary value with correct sign when the
integration has to be aborted before the end point is reached. This will ensure that the correct
value of x is found.

In general, using a simple shooting method for solution of a boundary


value problem could yield misleading results, where roundoff errors rather than
the boundary conditions decide the solution. For linear problems, it is much
more straightforward to use finite difference methods described in the next
section or the expansion methods described in Section 12.10. For nonlinear
problems the choice is not very clear, since finite difference methods will lead
to a system of nonlinear equations, which may be very difficult to solve. The
expansion method also leads to a system of nonlinear equations but the number
of equations is much less than that for finite difference methods and hence these
may be preferred. In such cases, shooting method may be preferable, since the
number of algebraic equations to be solved is much less with this method.
However, if the eigenvalues of the Jacobian matrix are widely separated, then
this advantage may be lost, since multiple shooting will have to be used.
626 Chapter 12. Ordinary Differential Equations

12.8 Finite Difference Methods


The basic idea underlying all finite difference methods is to replace the deriva-
tives in the differential equations by a finite difference approximation and to
solve the resulting system of algebraic equations. If the differential equation
is nonlinear, then the resulting algebraic equations are also nonlinear and we
will be faced with the problem of solving a large system of nonlinear equations.
On the other hand, if the differential equation as well as the boundary condi-
tions are linear in y, then the resulting system of equations is also linear and
it can be easily solved. The finite difference methods are similar to the numeri-
cal integration methods considered in Section 12.1. Once again, if higher order
approximations are used, then extra boundary conditions are required. In this
case, generating extra boundary conditions is rather difficult, but it is possible
to improve on the first-order approximation using the technique of deferred cor-
rection. Alternately, the necessity of extra boundary conditions can be avoided
by using special off-centred formulae near the end points. Stability of higher
order difference schemes may also pose problems.
For a system of n first-order differential equations, we can write the finite
difference approximation as

(j = 1, ... , N - 1),
(12.105)
where tj, (j = 1, ... ,N) with tl = a and tN = b, is a sequence of mesh points
covering the required interval and tj+~ = (tj + tj+l)/2. At each of these mesh
points, we need to calculate the n components of y. Thus, the total number of
unknowns is nN, while the above system of equations provide only (N - l)n
equations. The remaining n equations are supplied by the boundary conditions.
It can be easily verified that the difference approximation (12.105) has second
order accuracy and is referred to as central difference scheme. If on the right-
hand side the function is evaluated at tj or tj +1, then we get the forward
or the backward differences. Both these approximations have only first order
accuracy, but may be useful in some cases. Since the difference equations involve
only one subinterval, it is possible to choose a variable mesh spacing. However,
in general, we have no idea of the solution beforehand and it may be difficult
to choose the optimal mesh spacing. Depending on the expected behaviour of
the solution, the mesh spacing is usually decided beforehand. If a large number
of problems with similar equations are to be solved, then it is possible to do
some experimentation to find better mesh points. It is also possible to use an
adaptive technique for step size control as discussed in Section 12.11.3.
If we have a system of second-order differential equations, it may be bet-
ter to obtain the finite difference approximation directly. If the equations are
reduced to a system of first-order equations, then the number of equations in
the corresponding difference approximation will be doubled. For second deriva-
tive, we need a three-point approximation and it is convenient to have a mesh
with uniform spacing. Thus, if tj = a + (j - l)h for j = 1, ... , N, where
12.8. Finite Difference Methods 627

h = (b - a)/(N - 1), then the difference equations can be obtained by using


the approximations

'(t) ~ y(tj+l) - y(tj-d


Y ) ~ 2h .
(12.106)
Substituting (12.106) in the differential equation at t = t2, ... , tN-l, we obtain
n(N - 2) equations in nN unknowns. The remaining 2n equations are supplied
by the boundary conditions for the system of second-order differential equa-
tions. Similarly, if we have a differential equation of order m, we can use a
finite difference approximation based on m + 1 points. It may be noted that in
this case, although the lower derivatives may be approximated by higher order
formulae, the highest derivative can have an accuracy of at most second order.
Thus, the truncation error in these formulae is O(h2). In some cases, it may be
convenient to introduce some fictitious points beyond the physical boundaries,
in order to take care of boundary conditions involving derivatives. For example,
if the boundary condition is y' (a) = 0, we can write it in finite difference form
as y(a + h) - y(a - h) = 0, thus introducing a point a - h outside the region,
where the solution is required. This technique increases the number of algebraic
equations to be solved, unless the extra values of y can be easily eliminated.
To achieve higher order accuracy, we can use Richardson's h ~ 0 extrap-
olation. This extrapolation has to be applied at each value of t, for which the
solution is required. Since this technique has been discussed earlier (Sections 5.3
and 6.2), we do not consider it further. An alternative approach is the method
of deferred correction. In this method, we estimate the local truncation error
in each step, using an estimate for higher order derivatives.
Let us consider a boundary value problem for a system of linear equations

Y' = F(t)y + g(t), (12.107)

Using a uniform mesh of points tj, (j = 1, ... , N) with tj = a + (j - l)h,


h = (b - a)/(N - 1), we can write the finite difference approximation as

Yj+l-Yj-F(t. l)(Yj+l+Yj)_ (t. 1)=(h 2 lII(t 1)


h )+2 2 g )+2 24 Y 1+2

+ ~y(5)(t .+1) + ... ) - F(t+1) (h2 y"(t ·+1) + ~y(4)(t .+1) + ... )
1920 ) 2 ) 2 8 ) 2 384 ) 2 '

(12.108)
where Yj is an approximation to y(tj) and tj+! = (tj + tj+d/2. The two
terms on the right-hand side arise from the truncation error in approximating
y' and y at tj+!. If the right-hand side is neglected, then we get the usual finite
difference approximation (12.105), which is a system of linear equations in Yj.
These equations supplemented by the boundary conditions can be solved to
obtain the approximate solution Yj. It may be noted that the equation matrix
has only 2n nonzero elements in each row. Using the computed solution, we can
try to estimate the derivatives on the right-hand side, which gives an estimate
628 Chapter 12. Ordinary Differential Equations

of the local truncation error at each step. For the boundary value problem,
this local estimate can be converted to global estimate, if the corresponding
equations are solved once again. In most cases, we do not proceed beyond the
first term. Hence, in the subsequent discussion, we neglect all higher order
terms. To estimate the second and third derivatives, we can use the formulae

"'(t )~ Yj+2 - 3Yj+l + 3Yj - Yj-l


Y j+! ~ h3 '
(12.109)
"(t. 1) ~ Yj+2 - Yj+l - Yj + Yj-l
Y 1+2" 2h2 '

which have truncation errors of O(h2). These formulae can be used for j
2, ... ,N - 2, but at the two end points it requires values outside the interval
and we need to use an off-centred formula. Off-centred formula for ylll based
on four points has an accuracy of O(h) only, and it is preferable to use the
five-point formulae

(12.110)

Using this estimate of the local truncation error, we can calculate the correction
bYj to the computed solution by solving the system of equations

_ F(t
j+!
) (b Y j +!
2
+ bY j ) _
- Tj+!
(12.111)
= ~: (Y"'(t +!) - j 3F(tj+!)Y"(tj+!)) ,

where Tj+! is the estimated truncation error. These equations coupled with
boundary conditions (in which there is no truncation error) can be solved for
the corrections bYj. It may be noted that the matrix of equations in (12.108)
and (12.111) is the same and the elimination or the LU decomposition need not
be repeated. Thus, the correction can be obtained without too much of extra
effort. In most cases, calculating the coefficients on the right-hand side requires
more effort than solving the resulting system of linear equations.
It should be noted that, this technique of deferred correction is not equiv-
alent to using a higher order difference approximation, even though the order of
accuracy achieved may be the same. For example, if a four-point formula (with
corresponding off-centred formulae at the end points) is used to approximate
the derivative, then we can achieve an accuracy of O(h4). An equivalent for-
mula could be obtained if we use the approximation to y" and ylll given above
in (12.108) and transfer the terms onto the left-hand side before attempting
the solution of difference equations. The difference between the two approach
can be understood as follows. Let us write the system of equations (12.108)
in matrix form as Ay = b, where A is a nN x nN matrix while Y and bare
12.8. Finite Difference Methods 629

vectors with nN components. Now if A + 8A is the matrix which is obtained if


the second and third derivatives are expressed in the difference form and trans-
ferred to the left-hand side. Then the fourth-order method gives the solution
y + 8y, which satisfies (A + 8A)(y + 8y) = b. Hence, we get

8Ay + A8y + 8A8y = 0, (12.112)

where we have used the fact that Ay = b. On the other hand, it can be easily
seen that the method of deferred correction gives

(12.113)

which is obtained by neglecting the higher order term 8A8y in (12.112).


Thus, there is some difference between the direct solution of higher order
difference equation and the method of deferred correction. The direct solution
of fourth-order difference equation leads to a matrix with 4n nonzero elements
in each row. Solution of this system requires more effort and it is not possi-
ble to estimate truncation error in the computed solution. Further, stability of
higher order difference approximation also needs to be considered. As seen in
Section 12.2, for the initial value problem, many of the higher order difference
schemes are unstable for certain differential equations. Stability may also af-
fect the solution of boundary value problem, although the effect is somewhat
difficult to analyse. The method of deferred correction has the advantage that
it gives an estimate of truncation error in the calculated solution. The use of
off-centred differences near the end points could lead to some additional errors
in the computed solutions. For a system of second-order differential equations,
it is possible to avoid the use of off-centred formulae {4 7} at the end point.
For nonlinear differential equations, the resulting difference equations are
nonlinear and we have to use an iterative method to solve this system. We can
use either Broyden's or Newton's method for this purpose. For these equations,
the Jacobian matrix is in the same block form as the matrix for linear equations.
In Newton's method, we start with an initial guess for the solution (at all mesh
points) and linearise the difference equations (12.108) to get

(12.114)

where F is the Jacobian matrix for the function f with elements Fij = 8fd8Yj.
The matrix F here plays the same role as the corresponding coefficient matrix
in the linear problem. These equations can be solved to calculate the correction
to be added to the previous iterate. This process can be continued until the
iteration has converged.
The nonlinear systems resulting from the finite difference approximations
involve a large number of variables and the solution is unlikely to converge,
630 Chapter 12. Ordinary Differential Equations

unless a reasonably good guess is available for the solution. In many physical
problems, the initial guess is provided by solving a sequence of problems, just
as in the Davidenko's method (Section 7.16). Most problems that we come
across have natural parameters, such that for some limiting value the solution
is known. In such cases, we can solve a sequence of problems with the parameter
changing gradually from the value for the known solution to the required value.
The idea of deferred correction can be generalised to nonlinear equations
by linearising the perturbations to the original problem, to get an equation
similar to the one obtained for the linear case. The deferred correction can be
calculated by solving (12.114) with the right-hand side replaced by the local
truncation error.
An implementation of the finite difference method with deferred correction
is provided by subroutine FDM in Appendix B, which can be used to solve a
two-point boundary value problem for a system of differential equation with
separable boundary conditions. This subroutine accepts differential equations
written in slightly more general form

D(t)y' = f(t,y(t)), (12.115)

where D is a n x n matrix. It must be ensured that the matrix D is nons in-


gular, otherwise the results will be determined by roundoff errors rather than
the differential equations. The corresponding difference equations can be easily
obtained in analogy with (12.114). While the finite difference solution can be
obtained on a nonuniform mesh, the algorithm for obtaining deferred correction
assumes a constant spacing. Hence, this option can be used only if the mesh
spacing is uniform . It is of course possible to estimate the third derivative using
nonuniform spacing, but it requires a five-point formula to achieve an accuracy
of O(h2). The formulae could be quite cumbersome and further, the effect of
nonuniform spacing on the computed correction is also not clear. Because of
these reasons, the formulae with nonuniform spacing are not implemented in
this subroutine. This subroutine can be used for both linear and nonlinear prob-
lems. For linear problems only one iteration is performed and there is no need
to supply initial guess for the solution.
This subroutine uses Gaussian elimination with partial pivoting to solve
the system of linear equations. In principle, if the equations are written in
proper order, then in many cases it may be possible to ensure that the resulting
matrix is diagonally dominant and no pivoting is necessary. This ordering could
save some memory space as well as time during execution, at the cost of some
work in preparing the equations properly. However, in some problems it may
be difficult to ensure that the resulting matrix is diagonally dominant. Another
reason for implementing pivoting is that the same subroutine can then be used
for the eigenvalue problem also , where the matrix could be nearly singular and
it is essential to use partial pivoting. A typical matrix for the case N = 4,
n = 5 and nl = 2 is shown in Figure 12.5, where x denotes elements which
can in general be nonzero. The zero elements are not shown explicitly. To
perform elimination we first consider the elements in the topmost box of size
12.8. Finite Difference Methods 631

x x x x x x x
x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x x x x x x
x x x x x x x
x x x x x x x
x x x x xi/ x x

Figure 12.5: Typical matrix for boundary value problem

(n + nI) x 2n. It is clear that the first five (= n) steps of Gaussian elimination
do not involve any element outside this box, even if partial pivoting is used. In
the next stage, when the subdiagonal elements in the sixth column are being
reduced to zero, it may he necessary to exchange this row with the one outside
the box and we can switch to the next box. This process can be continued until
we reach the last n x n box, where the elimination can be completed. It may
be noted that, there is some overlap between different boxes which results in
some wastage of memory space and some extra copying of a small part of the
block matrix in the simple implementation, provided in subroutine GAUBLK.
EXAMPLE 12.12: Solve the boundary value problem in Example 12.10 using finite difference
method.
Using subroutine FDM we can obtain the solution on a uniform mesh of points. To use
this subroutine the equation is written as a system of two first-order differential equations
dyl _ 2 dy2 2 1
dt -y, -=(A-1)y +AY -At-A+l.
dt
(12.116)

This subroutine also applies the method of deferred correction to improve the accuracy of the
computed results. The results obtained using a 24-bit arithmetic show that for A = 10 and
spacing h = 0.2 the computed solution has relative errors of about 10- 3 , while after applying
the deferred correction it reduces by an order of magnitude, which is slightly worse than the
results in Example 12.10 using shooting method. For A = 50 and h = 0.2 the error in solution
is comparable to earlier case, but there is little improvement when deferred correction is
applied. This behaviour can again be attributed to the fact that for this value of A a higher
order method is not likely to yield significantly smaller error, since Ah = 10. Using a smaller
632 Chapter 12. Ordinary Differential Equations

value of h = 0.05 the results improve significantly. The error in computed results decreases by
an order of magnitude, while that in the corrected results is down by almost three orders of
magnitude. These errors can be compared with the expected truncation error in these results
which is O(h2) and O(h4), respectively.
Alternately, we can use the finite difference approximations (12.106) for the first and
second derivatives, to obtain a system of linear equations in y(tj). The matrix of this system
is tridiagonal and for the same mesh the number of unknowns is half that used in subroutine
FDM. Hence, this process is more efficient. This tridiagonal system of equations can be
easily solved using Gaussian elimination. No pivoting is required, since the equations are
diagonally dominant. The results obtained using this technique are very similar and the error
is generally larger by approximately a factor of two. Once again the truncation error in this
difference approximation is also O(h2). The first-order correction can also be computed using
the method of deferred correction.

EXAMPLE 12.13: Solve the boundary value problem in Example 12.11 using finite difference
method.
As noted in Example 12.11, the solution has a singularity close to the interval and
it may be preferable to use a nonuniform mesh with close spacing around t = 1, but it is
difficult to guess the optimum distribution of the mesh points. Hence, for simplicity we use
a uniform spacing and try to estimate the error using the method of deferred correction. We
have used subroutine FDM with REPS = 10- 6 , which ensures that the results converge to
this accuracy. It should be noted that, REPS does not put any limit on the truncation error,
which could be considerably larger. The relative error in the solution using N = 51, 101,201
mesh points is respectively, ~ 0.008,0.002 and 0.0005 for >. = 5; and 0.25,0.075 and 0.02 for
>. = 10. After applying the deferred correction, the relative error in the three cases comes
down to approximately 1 x 10- 4 ,3 X 10- 5 and 7 x 10- 6 for>. = 5, and 0.12,0.026 and 0.0043
for >. = 10. Thus, it can be seen that the error is decreasing as the number of points is
increased, even though it does not fall off as h 2 and h4 as expected. The departure from the
expected behaviour is more marked for >. = 10, which is clearly due to the fact that higher
order terms in the error expansion are not negligible.
In all cases, the iteration was started with essentially arbitrary values and it required
respectively six and nine iterations for>. = 5 and 10, to achieve the specified accuracy. This
is much less than the number of trials required with the shooting method in Example 12.1l.
Further, it can be seen that here the number of mesh points used is much less, but the effort
involved in one iteration of finite difference method using 201 mesh points is comparable to
that in one integration of initial value problem using 1500 function evaluations. If a finite
difference approximation to the second-order differential equation is used, the effort involved
in finite difference method could be much less for comparable accuracy. The accuracy achieved
using 401 mesh points would be better than that achieved in the shooting method. Further, it
is much simpler to use finite difference method, since the iteration converges from essentially
arbitrary values. Of course, it depends on the problem and there may be instances, where
finite difference methods may have more difficulty in convergence.

The choice between the shooting and finite difference method could de-
pend on several factors. Firstly, if the Jacobian matrix has widely separated
eigenvalues in the interval, then the initial value methods are likely to run into
difficulties, unless multiple shooting technique is used. In such cases, the use of
finite difference methods may be simpler. If the solution has wide variation in
its behaviour over the region, then it may be difficult to choose the mesh size
appropriately in the finite difference methods, while the initial value method
used in shooting technique can easily use adaptive control of step size. Hence,
for such problems, shooting method may appear to be advantageous. However,
&<; seen in Example 12.11, the solution in such cases may be very sensitive to the
choice of arbitrary initial conditions used in shooting techniques. Consequently,
12.9. Eigenvalue Problem 633

it may be more difficult to use shooting technique, unless a good approximation


to starting values is known. For simple problems it may be easier to implement
shooting method, because the number of unknowns in that case is much less.
The shooting method requires very little memory, since the complete solution
need not be stored; while finite difference method requires the entire matrix
to be stored. If the size of the matrix is too large to fit in the primary mem-
ory of the computer, then it is possible to store it in the secondary memory
and keep only the part immediately required in the main memory. However,
this technique will necessarily slow down the execution. Singularities in the
equation may result in incorrect results being accepted using the finite differ-
ence method, since because of finite spacing between mesh points the difference
scheme may completely fail to detect singular behaviour in a very narrow re-
gion {49}. As with initial value problems, it is difficult to choose between the
two class of methods, since their reliability and efficiency depends crucially on
the implementation. Another technique is the expansion method discussed in
Section 12.10, where the solution is expanded in terms of some basis functions.

12.9 Eigenvalue Problem


The eigenvalue problem can be considered as a special case of the boundary
value problem, when the differential equations as well as the boundary con-
ditions are linear and homogeneous. As with algebraic eigenvalue problem, a
nontrivial solution exists only for special values of the parameter A, which are
referred to as the eigenvalues, while the corresponding nontrivial solution is
referred to as the eigenfunction. The theory of eigenvalue problems in ordinary
differential equations is similar to that for algebraic eigenvalue problem. How-
ever, for differential eigenvalue problem, the number of eigenvalues could be
infinite, and in some cases it is even possible to have a continuum of eigenval-
ues, in the sense that any number in some finite region of complex plane could
be an eigenvalue. The eigenfunctions are arbitrary to an extent of a constant
multiplier.
For solving eigenvalue problems, once again, we can use either shooting
or finite difference methods. In either case, we will end up with a system of
homogeneous linear equations containing a parameter, which can be considered
as a generalised eigenvalue problem discussed in Chapter 11. In some cases,
the solution of differential eigenvalue problem leads to an algebraic eigenvalue
problem in the standard form. Since we are considering only linear equations,
it may be more convenient to use finite difference methods.
The simplest eigenvalue problems arising in physical systems are the
Sturm-Liouville problems defined by a second-order differential equation of the
form

d ( p(t) ddtY ) - q(t)y + Ar(t)y = 0,


dt (12.117)
634 Chapter 12. Ordinary Differential Equations

with boundary conditions

(12.118)

where p, q and r are functions of t such that p and r are positive in the required
interval a :::; t :::; b, and (Xl, (X2, (31, (32 are constants. This problem has an infinite
number of eigenvalues and the corresponding eigenfunctions form a complete
set of orthogonal functions satisfying

lb Yi (t)Yj (t)r(t) dt = 0, i =1= j, (12.119)

where Yi (t) and Yj (t) are eigenfunctions corresponding to eigenvalues Ai and


Aj, respectively. Sturm-Liouville problems can be reduced to an algebraic eigen-
value problem by using a finite difference method on a uniform mesh with
spacing h, to get

~ (P(tj+~)Yj+lh-Yj -p(tj_~)Yj -hYj - l ) -q(tj)Yj = -Ar(tj)Yj' (12.120)

In addition, if r(tj) = 1 and the boundary conditions are y(a) = y(b) = 0,


then this problem can be reduced to a standard eigenvalue problem with a
symmetric tridiagonal matrix. Any required eigenvalue can be conveniently
calculated using the Sturm sequence property (Section 11.4).
For a general system of differential equations it is convenient to reduce it
to a first-order system, before applying the finite difference approximation. If
the differential equation is

Cy == D(t, A)y'(t) - F(t, A)y(t) = 0, (12.121)

where D and Fare n x n matrices. Here the matrix D must be nonsingular.


Now using a uniform mesh, we can write the finite difference equations

(12.122)

This equation coupled with the boundary conditions

(12.123)

give the required system of nN homogeneous equations in equal number of


unknowns, which can be written in the matrix form A(A)y = O. Thus, for a
nontrivial solution to exist, we must have det(A(A)) = O. The roots of this
equation will give the eigenvalues. These roots may be found using any method
described in Chapter 7. If the eigenvalue problem can be reduced to the stan-
dard form Ay = AY, then we can apply the standard technique of reduction
to a more condensed form, before calculating the eigenvalues. However, if the
problem is not in the standard form, then it may not be possible to reduce
12.9. Eigenvalue Problem 635

it to any condensed form. This is not a serious problem, since the matrix A
is already in a band form as shown in Figure 12.5 and further we are always
interested in only a few eigenvalues of this matrix.
It may be noted that, we have approximated the differential eigenvalue
problem by an algebraic eigenvalue problem, which introduces some truncation
error. We can expect the eigenvalues of this finite difference matrix A(A) to
approximate those of the differential equation. Now the number of eigenvalues
of A may be nN or more, depending on the matrix elements. For example, if
all elements of A are polynomials in A of degree less than or equal to m, then
the number of eigenvalues may be mnN. On the other hand, the differential
equation may have infinite number of eigenvalues. Such a situation appears to
be rather uncomfortable, but it can be shown that the lower eigenvalues are
approximated quite accurately, while the higher eigenvalues are poorly repre-
sented. In physical problems, the higher eigenfunctions have a larger number
of nodes (i.e., zeros) which imply larger derivatives. It can be shown that the
local truncation error in the finite difference approximation considered above
is (h 2 /24)(D y lll - 3Fy"). Now, if the eigenfunctions are of the form sin(mt),
(m = 1,2, ... ), then the truncation error is O(h 2 m 3 ), which increases rapidly
with m. For values of m comparable to the number of mesh points N, the
error will be substantial. This is to be expected, since in general the mth eigen-
function has m nodes, which can never be approximated by any numerical
representation involving less than m points. Hence, it is only meaningful to
calculate the few lower eigenvalues of the finite difference matrix A. If a large
number of eigenvalues are required, then we have to use correspondingly larger
number of mesh points in the difference approximation.
For determining the eigenvector we can use the inverse iteration method.
For the generalised eigenvalue problem, it is necessary to find the eigenvalue
accurately before using inverse iteration. It should be noted that reduction to
first-order system of differential equations generates a larger matrix as com-
pared to that for the equivalent system of higher order differential equations.
Further, if the difference scheme (12.122) is applied to Sturm-Liouville prob-
lems, then the resulting matrix is not symmetric tridiagonal and even the alge-
braic eigenvalue problem is not in the standard form. Thus, for some problems
it may be better to use special methods to obtain the matrix in a convenient
form.
To obtain higher order accuracy, we can use h ~ 0 extrapolation. We can
compute the eigenvalue for different mesh spacing h and apply extrapolation
technique to find more accurate value corresponding to h = O. Alternately, we
can also use a higher order difference scheme, but as noted in the previous
section, that increases the bandwidth of the matrix and may introduce some
instabilities in the solution. It is possible to apply the method of deferred correc-
tion to the eigenvalue problem also. As in the previous section, we can estimate
the local truncation error in each interval using (12.108). Thus, if the trunca-
tion error terms are transferred to the left-hand side, we get the corresponding
matrix A + JA. If y and u are respectively the right and left eigenvectors of
636 Chapter 12. Ordinary Differential Equations

matrix A corresponding to eigenvalue A; and A + t5A and y + t5y respectively, be


the corresponding eigenvalue and right eigenvector of A + t5A. Then we have

A(A)Y = 0, u T A(A) = OT, (A(A + t5A) + t5A(A + t5A)) (y + t5y) = O.


(12.124)
Writing A(A+t5A) = A(A)+t5AA'(A), where A' = 8Aj8A, and neglecting higher
order terms in perturbations, we get

A(A)t5y + t5AA' (A)y + t5A(A)Y = o. (12.125)

Multiplying this equation from the left by u T, we get

t5A = _ u T (t5A(A))Y (12.126)


u T A'(A)y ,
which gives the first-order correction to the eigenvalue. Calculating first-order
correction to the eigenvector requires considerable effort and it may not be
necessary, since in most physical problems we are interested in accurate eigen-
values rather than eigenfunctions. It may be noted that the vector t5Ay in the
numerator is just the estimated local truncation error in each finite difference
equation, which is given by

(12.127)

The derivatives y" and ylll can be calculated using a four point central dif-
ference approximation. As noted in the previous section, at the end points we
have to use the corresponding off-centred formulae. If the eigenvalues of A have
been calculated by finding zeros of the determinant, then the triangular de-
composition is already available and the eigenvectors can be easily found using
the inverse iteration method. The same triangular decomposition can be used
for both the right and left eigenvectors. Further, if the eigenvalue has been
calculated to a good accuracy, then inverse iteration will converge in the first
step itself. Hence, calculating the left and right eigenvectors requires very little
extra computation. Usually, calculating the coefficients of the right-hand side
requires more effort than solving the equations.
If Gaussian elimination with partial pivoting is used to solve the finite
difference equation, then we obtain a triangular decomposition of the form
LU = N A, where N is the permutation matrix (Section 3.3). Noting that
N 2 = I, we get AT = U T LT N. Hence, the matrix equation involving AT can
be solved by solving three systems involving the matrices U T , LT and N in that
order. The matrix U T is a lower triangular matrix, whose elements are stored
in the upper triangle of the original matrix A. This equation can be easily
solved using forward substitution. The matrix LT is an unit upper triangular
matrix, whose superdiagonal elements are stored in the lower triangle of the
original matrix. This equation can be solved using back-substitution. However,
the elements of this matrix are not in the right place, since during Gaussian
12.9. Eigenvalue Problem 637

elimination with partial pivoting, we do not exchange the entire row, but only
the elements to the right of the pivot column. This problem can be rectified if
the entire row is exchanged whenever pivoting is required. However, in that case
nonzero elements may be introduced outside the blocks considered in subroutine
GAUBLK. In fact, any element in the lower triangle of the matrix can become
nonzero, if this procedure is adopted. As a result, in subroutine GEVP which
implements this technique, we find the actual position of the elements by using
the information on exchanges. If we are considering the subdiagonal elements
in jth column, then the actual position of the element in the matrix with full
exchange can be found by considering all exchanges that have taken place after
this column was processed. Let us assume that we wish to find the actual
position of the kth element (k > j). If any row k' between j + 1 to k - 1 is
exchanged with kth, then this element should have been in k' j position. While
if none of these rows have been exchanged with kth and there is no exchange at
kth step, then the element retains its position. On the other hand, if there is an
exchange in kth step, then we have to follow the sequence until we reach a step,
where no interchange has taken place, or the element has been exchanged with
some previous element. For example, if row k is exchanged with kl (> k), then
once again three possibility arises, (i) row kl has been exchanged with some
row between k + 1 and kl - 1, (ii) the element (kl,j) retains its position, (iii)
it has been exchanged with an element k2 in forward direction (i.e., k2 > kd.
In the first two cases the search is complete, while in the third case we have to
consider the row k2 in a similar manner. Thus, in all cases the actual position
of any subdiagonal element can be found. Nevertheless, this procedure could
require significant amount of computer time. The final step in solving the matrix
equation for AT involves equation with matrix N of row interchanges. Since in
actual algorithm the matrix is obtained by a sequence of exchanges, the effect
can be undone by exchanging the rows in the reverse order.
If the eigenvalue problem for the differential equation (12.121) is Hermi-
tian, then the process of deferred correction can be simplified. For a Hermitian

lb lb
operator,
x(t)* Cy(t) dt = y(t)(Cx(t))* dt, (12.128)

for all x( t) and y( t), which satisfy the required boundary conditions. In this
case, just like that for Hermitian matrices the eigenvalues are real and the left
and right eigenfunctions are identical. Thus, for such problems we do not need
to calculate the left eigenfunction separately and some part of the calculation
can be avoided. Even if the eigenvalue problem is Hermitian, the matrix A
obtained using a finite difference approximation to it may not be Hermitian,
unless proper difference approximation is used to approximate the differential
operator.
For large matrices, calculating the determinant usually leads to overflow or
underflow. Hence, the determinant should be preferably calculated in a scaled
form det(A) = d2i, with the exponent i given separately. If the routine for
finding zeros also handles this form, then we can avoid overflow and underflow.
638 Chapter 12. Ordinary Differential Equations

Scaling the equations appropriately may help, but for really large matrices it is
difficult to scale the equations to avoid overflow and underflow. For example, if
the total number of variables is of the order of 1000, which is not unreasonable
in realistic problems, multiplying every equation by a factor of 2 causes the
determinant to be multiplied by a factor of 2 1000 :::::; 10301 , which may cause
overflow on most machines. If coefficients of the equation matrix are varying in
magnitude, it may be impossible to scale them appropriately to avoid overflow
and underflow, unless the determinant is expressed in a scaled form.
As noted in Section 12.7, an eigenvalue problem can be reduced to the
standard boundary value problem by introducing an additional equation. This
reduction transforms a linear equation to nonlinear equation, since terms of the
form AY will become nonlinear in variables, once A is also a component of y.
Hence, finite difference equations have to be solved iteratively. This iteration
requires an initial guess for not only the eigenvalue, but also the eigenvector.
Further, introducing extra equation increases the size of the matrix and each
iteration with finite difference matrix requires more effort. On the other hand,
following the method described in this section, the eigenvalues are found by
calculating the zeros of determinant, which is again an iterative process. But
in this case, there is only one unknown to be found and the process is much
simpler.
EXAMPLE 12.14: Solve the following eigenvalue problem which defines Spheroidal Har-
monies:
d((1 - t )dS)
-
dt
- + (..\ - c 2
2
dt
t 2
- m
-2-) S = 0
1- t 2 '
(12.129)

subject to the boundary conditions that the solution is regular at the singular points t = ±l.
Before attempting the numerical solution, we should formulate the boundary conditions
in appropriate form. In the neighbourhood of the singularities, we can expand the solution
in a power series of the form

L
00

Sit) = (1 ± t)" aj (1 ± tF· (12.130)


j=o

Substituting this expansion in the differential equation, we can obtain the characteristic
equation to determine a, which gives a = ±m/2. Now assuming that m > 0, the regular
solution corresponds to a = m/2. It is convenient to factor out this behaviour from the
solution of the differential equation by defining a new variable y by

(12.131)

Substituting this expression in the differential equation, we get the equation ill terms of the
new variable
2 d2y dy 2 2
(l-t )--2(m+l)t-+(J.L-C t )y=O (12.132)
dt 2 dt '
where J.L = ..\ - m( m + 1). Since the equation is invariant under the transformation t -> -t, the
eigenfunctions will be either symmetric or antisymmetric about t = 0, that is y( -t) = ±y(t).
Using this symmetry, the range of problem can be reduced to [0,1]. The boundary condition
at t = 0 is determined by the symmetry requirements. Thus, for symmetric solutions, we get
y'(O) = 0 while for antisymmetric solutions we have y(O) = O. The boundary condition at
t = 1 is determined by the requirement of regularity. Now the solution can be expanded in a
simple power series
(12.133)
12.9. Eigenvalue Problem 639

0.8 f.L=0.1409
.... ... . .... . f.L= 14.402
0.6
- - - - - - - f.L=36.455
>. 0.4

0.2 . .. ..

o ---------~~~~ . ~-------

-0.2
o 0.2 0.4 0.6 0.8 1
t

Figure 12.6: Spheroidal harmonics for m = 2, c2 = 1.

Substituting this expansion in the differential equation, we can calculate the higher coefficients
in terms of ao. For our purpose, it is sufficient to calculate ai , which is given by
J-L - c2
al = - ao· (12.134)
2(m+ 1)
This yields the boundary condition

" - c2
'(I) - ,.., (1) (12.135)
Y - 2(m + l)Y .

In order to solve this problem numerically, it may be more convenient to write it as


a system of two coupled first-order differential equations, by introducing the new variables
Yl = Y and Y2 = y'. Using this form we can use subroutine GEVP in Appendix B to
find any required eigenvalue and the corresponding eigenvector. The results for the first two
eigenvalues for even functions with m = 2 and c 2 = 1 are shown in Table 12.7. Here J-Lfdm is
the eigenvalue as calculated using finite difference approximation, while J-Leo r is the estimated
eigenvalue after applying the deferred correction. It can be seen that the error in calculated
eigenvalue decreases as h 2 . Using the method of deferred correction it is possible to achieve

Table 12.7: Solving the eigenvalue problem for the spheroidal harmonics

J-L = 0.1409490 J-L = 14.40235


N J-Lfdrn ReI. err. M ear ReI. err. J-Lfdm ReI. err . J..Lcor ReI. err.

6 .1410852 9.7 x 10- 4 .1409485 3.7 x 10- 6 14.41293 7.3 x 10- 4 14.42849 1.8 x 10- 3
11 .1409824 2.4 x 10- 4 .1409489 4.2 x 10- 7 14.40907 4.7 x 10- 4 14.40340 7.3 x 10- 5
21 .1409573 5.9 x 10- 5 .1409490 0.0 14.40417 1.3 x 10- 4 14.40237 1.5 x 10- 6
41 .1,,09511 1.5 x 10- 5 .1409490 0.0 14.40281 3.2 x 10- 5 14.40235 6.6 x 10- 8
640 Chapter 12. Ordinary Differential Equations

much higher accuracy, as the error falls off as h4. The correct eigenvalues are 0.14094899 ...
and 14.402353 ... , respectively. It can also be seen that the errors are significantly larger for
the second eigenvalue, which is to be expected from the eigenfunctions shown in Figure 12.6.
The second eigenfunction has larger derivatives and the error is also expected to be larger.
Alternately, we can formulate this problem as a standard boundary value problem, by
introducing an additional variable Y3 = Il and an additional boundary condition to normalise
the eigenfunctions. This gives the equations

dY2 2( m + 1 )tY2 - (Y3 - c 2t 2 )Yl dY3


-=0 (12.136)
dt 1 - t2 dt '
subject to the boundary conditions

Y2(0) = 0, Yl (1) = 1. (12.137)

Here except for the last boundary condition, all other equations are homogeneous. Hence,
this boundary condition ensures the normalisation of the eigenfunctions. Using subroutine
FDM it is possible to solve this problem. The result is essentially the same as that shown in
Table 12.7. However, the computer time required using this technique is larger by a factor of
1.5 to 2, as compared to that using subroutine GEVP.
It will be more efficient if the eigenvalue problem is discretised by using a finite dif-
ference approximation for the second derivative. This approximation reduces the size of the
finite difference matrix by a factor of two. It is also possible to use shooting method to solve
the eigenvalue problem. Since the differential equation has a singularity at t = 1, we can start
the integration from t = 1 - E (E > 0) using the known expansion for the function close to
t = 1. Thus, using the first two terms we get the initial values
Il - c 2
Yd1 - E) = 1 - E, (12.138)
2(m + 1)
For any given value of /-L, this initial value problem can be solved to find the solution at
t = O. The eigenvalue Il can be determined by requiring that Y2 (0) = O. This equation can be
solved using secant iteration, or any other convenient method for finding zero of a nonlinear
function. Starting the integration from 0.99, it is possible to obtain both the eigenvalues
correct to seven significant figures. The time required is comparable to that required by the
finite difference method.

12.10 Expansion Methods


Instead of approximating the solution by its value at selected grid points, we
can approximate the solution by an expansion in terms of suitably selected
basis functions. For example, we can use the B-spline basis functions to expand
the solution. This expansion is then substituted in the differential equation
and boundary conditions to obtain the required equations to determine the
coefficients of expansion. Let us consider the expansion
m

Yj(t) = I>ijePi(t), (12.139)


i=1

where ePi (t) are the m independent basis functions. Here for simplicity we have
assumed that each component of y is expanded separately in terms of the same
basis functions. If there are n components of y then we need to determine the
12.10. Expansion Methods 641

mn coefficients of expansion aij' Substituting this expansion in the differential


equation we get
m

Laij¢~(t) = !j(t , y), (12.140)


i=l

where ¢>;(t) = d¢ddt. If the right hand side is linear in y we can obtain a linear
equation connecting coefficients aij' This equation can be written at any point
t. Thus if we choose a set of points tj, j = 1, ... , N, then we get nN equations.
In addition the boundary conditions would also give n equations, giving a total
of n(N + 1) equations. If we choose N + 1 = m, then we can solve these system
of equations to determine the coefficients aij' If we choose N + 1 > m, then
we can obtain the least squares solution to determine the coefficients. If the
differential equations are nonlinear then the resulting system of equations in
aij will also be nonlinear and solution will need to be calculated iteratively.
For nonlinear differential equations we can linearise the equations (12.140)
to obtain the required set of equations which need to be solved iteratively. For
this purpose we can express the coefficients as aij + baij, where the first part is
current approximation to the coefficients and baij is the correction which needs
to be determined. We can linearise the equations in baij to obtain.

(12 .141)

Similar equations can be written for the boundary conditions and the resulting
system of (N + l)n equations can be solved for baij to calculate the correction to
be added to the previous iterate. This process can be continued till the solution
converges to some predetermined accuracy. In general, the number of unknown
coefficient can be quite large and convergence of iteration is not guaranteed.
Comparing this method with the finite difference methods considered in
Section 12.8, we find that in this case the number of coefficients are determined
by the number of basis functions, instead of the number of mesh points. In most
cases, the number of basis functions required to achieve a specified accuracy
is much less than the number of mesh points to achieve the same accuracy.
Thus the system of equations to be solved in expansion methods is smaller
than that in finite difference methods. If the basis functions are chosen care-
fully, this number could be significantly smaller. However, the finite difference
methods give rise to a banded system of equations while the system of equa-
tions arising from expansion methods may be filled. It may also be difficult to
choose the points tj such that their number is just sufficient to determine the
coefficients. In most case this number will need to be larger and the solution
will need to be calculated in the least squares sense. This may require larger
effort as compared to solving a finite difference equation with the same number
of unknowns. On the other hand, expansion methods have the advantage that
once the coefficients are determined the solution as well as its derivatives can
be easily computed at any required point.
642 Chapter 12. Ordinary Differential Equations

Subroutine BSPODE in Appendix B provides an implementation of ex-


pansion method using B-spline basis functions based on a specified set of knots.
For B-spline of order k the number of basis functions on n knots is n + k - 2.
Hence, for k -I- 3 it is not possible to use the same set of knots for determin-
ing the coefficients. Thus this routine requires another set of points to define
the system of equations to be solved to determine the coefficients and a least
squares solution is computed using SVD. This may require more effort.
EXAMPLE 12.15: Solve the boundary value problem in Example 12.11 using expansion
method with B-spline basis functions.
As noted in Example 12.11, the solution has a singularity close to the interval and it
may be preferable to use a nonuniformly distributed knots with close spacing around t = 1,
but it is difficult to guess the optimum distribution of the mesh points. Hence, for simplicity
we use a uniform spacing and try to estimate the error by solving the equations with different
number of knots or with different order of B-splines. We have used subroutine BSPODE with
REPS = 10- 7 , which ensures that the results converge to this accuracy. Since the routine did
not converge to an accuracy of better than 10- 5 with 24-bit arithmetic, all calculations are
done using 53-bit arithmetic. It should be noted that, REPS does not put any limit on the
truncation error, which could be considerably larger. We use 20, 50 and 100 knots with the
order of B-splines k = 8,16. For lower order splines the results did not converge satisfactorily.
The use of high order spline is probably dictated by almost singular behaviour of the solution
near t = 1. The error decreases very rapidly as the number of knots is increased and also when
the order of B-splines, k is increased. For N = 20 and A = 10 the iteration did not converge to
specified accuracy, while for all other cases the iteration converged from essentially arbitrary
initial choice of all coefficients equal to unity. For A = 5 and k = 8 the relative error in the
computed solution is about 0.01,0.0008,0.000009 for N = 20,50,100 respectively. For k = 16
the error drops to 6 x 10- 6 for N = 20, while for N = 50, 100 it is close to the roundoff limit.
For A = 10 the errors are much larger and even for k = 8 and N = 100 it is about 0.06. For
k = 16 the error drops to 0.007.
In most cases the number of points to obtain the equations was chosen to be three
times the number of knots, but for N = 20 and A = 10 it was necessary to choose larger
number of points to get stable results.

The expansion method can also be applied to the eigenvalue problem


provided the number of points is chosen such that N + 1 = m. In that case we
will get a linear homogeneous system of equations and zeros of the determinant
will give the eigenvalues, while the eigenvectors can be found using inverse
iteration. However, we will have to choose the points tj carefully to cover the
region with minimum number of points. Instead of B-splines we can use any
other set of basis functions but the resulting system of equations may be ill-
conditioned unless the basis functions are chosen carefully.

12.11 Some Special Techniques


12.11.1 Infinite Range of Integration
If one or both end points are at infinity, then it is not possible to apply the
boundary condition in a straightforward manner. As in the case of integration
over infinite range, several alternatives are possible. In the simplest case, we
can choose a sufficiently large value of t as the boundary and apply the required
condition there, which introduces additional truncation error in the solution.
12.11. Some Special Techniques 643

This error can be estimated by repeating the calculation with boundary condi-
tion being imposed at a different point. It is important to ensure that the new
point is substantially different (e.g., by a factor of two). This technique usu-
ally requires the boundary point to be chosen very far, in order to reduce the
truncation error due to restricting the interval. Consequently, a large number
of points will be required to approximate the solution. In order to improve the
efficiency, we can use an extrapolation technique similar to h --> 0 extrapolation
or the E-algorithm, for obtaining the limiting value as Ibl --> 00.
If it is possible to obtain the asymptotic solution of the differential equa-
tion in the limit t --> 00, then we can use this solution to apply a better condition
at a finite t. For example, if we know that the solution is of the form y ~ ce Qt
as t --> 00, where a is a known constant, then it is possible to use the boundary
condition y' = ay at a suitably chosen boundary t = b. In this case also, it is
necessary to estimate the truncation error by repeating the calculation with a
different value of b. If a series of similar problems are being solved, it is better
to do some experimentation to find an optimum value of b. If the value of b
is too small, then truncation error is large while if it is too large, then the
number of steps required to integrate the equation will increase leading to a
loss in efficiency. In some cases, it may be possible to transform the indepen-
dent variable, such that the range becomes finite. However, in most cases, such
transformation leads to a singularity which may be equally difficult to treat.
. Equations with singularity at one or both the end points can also be
treated in a similar manner. In that case, we can use analytic solution in the
neighbourhood of the singularity, to apply the boundary conditions at a point
slightly away from the singularity, thus avoiding it in numerical solution (see
Example 12.14).

12.11.2 Highly Oscillatory Solution


If the solution of a differential equation is highly oscillatory, then we will be
forced to use a step size which is significantly smaller than the period of oscil-
lation. If the period itself is not changing rapidly, then it is possible to apply
some transformation to yield a more tractable form of equations. For example,
consider the equation

y" + A(t)y = 0, A(t) » 1, (12.142)

which needs to be integrated over several periods. In many practical cases, the
function A(t) itself is a slowly varying function, where by slowly varying we
mean that variation in A( t) over one period 27r / VA is small. In this case, we
can try to get a smoother function by separating the amplitude and phase of
oscillation using the Madelung transformation y(t) = x(t) cos(z(t)). Substitut-
ing this expression in (12.142) and separating into two differential equations by
equating the coefficients of the sine and cosine terms, yields

x" - (Z')2 X + Ax = 0, XZ" + 2x' z' = O. ( 12.143)


644 Chapter 12. Ordinary Differential Equations

The second equation is separable and can be integrated once to give z' = c/ x 2 ,
where c is an arbitrary constant, which can be determined by the boundary
conditions. Using this solution, the first equation becomes

2
x " +Ax -"3
c = 0, (12.144)
x
which can be integrated independently to give the amplitude x(t). Once x is
known, we can obtain the phase z by a simple quadrature using z' = c/x2.
Thus, in this approach, the equation with highly oscillatory solution has been
transformed to a system of two nonlinear equations with slowly varying com-
ponents. The increase in work due to increasing the number of equations may
be more than compensated by using a much larger step size.

12.11.3 Automated Allocation of Mesh Points


In finite difference methods, it is difficult to use a variable step size effectively,
since the mesh is chosen before the solution is known. If we know the rough
form of the solution, then it is possible to choose a smaller step size in regions
where the solution is changing rapidly and a coarser mesh elsewhere. If the
solution is not known initially, we can compute an approximate solution using
uniform spacing and then try to improve on it by allocating more points, where
the function is varying rapidly. However, such crude techniques are useful only
if a large number of similar problems are being solved.
It is possible to automate the allocation of mesh points, similar to the
adaptive choice of step size in initial value methods. To apply this technique,
let us transform the independent variable from t to x, such that a uniform
spacing of (say) 1 in x, gives the required spacing in t. In the new variable, the
equations can be written as
dy _ fdt
(12.145)
dx- dx'
and we can easily write the corresponding finite difference approximation. In
the new variable x, the boundary conditions are applied at x = 0 and x = N.
Here dt/dx gives the mesh spacing in terms of the original variable, while dx/dt
gives the density of mesh points. Now if we can give an expression for the density
of mesh points in terms of the solution y, then we can add one more equation
for dt / dx to the system. Solving all these equations together will automatically
allocate the mesh according to our requirement.
We can choose the density of mesh points to be larger, where the solution
is varying steeply. The problem here is that, it is difficult to give a formula
for density of mesh points which adds up to the required number of points. To
overcome this problem, we introduce another independent variable X which
differs from x only in the unknown proportionality constant. This variable
satisfies the equation
(12.146)
12.11. Some Special Techniques 645

which can be written as a system of two coupled first-order equations. These


equations can be added to our system of differential equations. To complete the
prescription, we add a third equation specifying the required density of mesh
points
dX dX dx
ill = dx dt = ¢(t, y(t)), (12.147)

where ¢( t) is chosen by us. Hence, we get the third equation

dt 1 dX
(12.148)
dx ¢(t,y(t)) dx .

The corresponding additional boundary conditions can be

X(O) = 0, t(O) = a, t(N) = b. (12.149)

To use automated allocation of mesh points, we add these three equations


to the original system. By changing the chosen function ¢(x, y), we can exper-
iment with different strategies of choosing the mesh points. For example, we
can choose
¢( t) = ~ + Id~/t I ' (12.150)

where D. and 8 are constants. Here if y has more than one component, then we
can consider the maximum value of (dYj / dt) / Yj' The first term by itself gives
a uniform spacing in t, while the second term forces more grid points in region
where Y is changing rapidly. By choosing the constants 8 and D., we can give
appropriate weight age to these two strategies.

12.11.4 Internal Singularities


As mentioned earlier, if a singularity occurs at the end points, then we can use
an analytic approximation to the solution in the neighbourhood of the singular-
ity to impose appropriate boundary condition on the solution. In most physical
problems, singularities define the boundary conditions required to specify the
solution uniquely. Even if the singularity occurs at a fixed point inside the
interval, a similar technique can be applied by introducing extra boundary
conditions at that point from both sides. However, if the location of the singu-
larity itself depends on the solution, then it is difficult to use this technique.
For example, if the differential equation is of the form

dy N(t, y)
(12.151)
dt D(t, y)'

where the denominator D( t, y) vanishes at some intermediate point. In a phys-


ical problem with a smooth finite solution, the numerator must also vanish at
the same point in such a manner that the ratio has a finite limit. This con-
straint on the solution is sometimes referred to as regularity condition, which is
646 Chapter 12. Ordinary Differential Equations

essentially an extra boundary condition. This extra boundary condition may be


required to define the solution uniquely, or to fix the value of some parameter
in the differential equation.
Such problems can be treated by a technique similar to that for the free
boundaries discussed in Section 12.7. Let us assume that the singularity occurs
at an unknown point t s , then we can add an equation

dz
z = ts - a, dt = o. (12.152)

Now we can perform a change of variable to x, by setting

t - a = xz, O:::;x:::;1. (12.153)

In terms of the new variable, the singularity occurs at x = 1. where we can


apply the boundary conditions

N(t , y) = 0, D(t, y) = o. (12.154)

In principle, this boundary conditions may not be sufficient to ensure regularity


of the solution, since the denominator may have a zero of higher order than
that of the numerator. But in most practical problems, this situation may not
arise.
This problem can also be tackled by the technique of automated allocation
of mesh points discussed in Section 12.11.3. In that case, we can choose the new
variable such that the singularity occurs at x = n the number of mesh points,
while the variable X is essentially the original variable t (apart from some
shift), which gives the function ¢(t) = 1. The extra boundary conditions for
these problems in terms of the new variable x can be given by

t(O) = a, X(O) = 0, N(t(n), y(n)) = 0, D(t(n), y(n)) = 0. (12.155)

The solution to this problem gives t(n), which is the position of the singularity.
If the solution needs to be continued on the other side of the singular point
also, then this treatment can be modified slightly. In that case these boundary
conditions can be applied at some internal point. It is convenient to define
two mesh points on either side of the singularity. Thus, if the singularity is at
t = ts, then the two mesh points could be taken at t = ts - f and t = ts + f,
(f > 0). Extra boundary conditions connecting the solution at these two points
can be imposed. These connection conditions do not change the structure of
finite difference matrix as shown in Figure 12.5.
Similar technique can be applied to situations where the solution is not
continuous at some point, such as presence of shocks in fluid flows or presence
of internal boundaries where the physical conditions and hence the governing
equations or the coefficients in equations change discontinuously. In this case
also we can define two mesh points on either side of the discontinuity and apply
appropriate connection conditions.
Bibliography 647

Bibliography
Acton, F. S. (1990): Numerical Methods That Work, Mathematical Association of America.
Antia, H. M. (1979): Finite-Difference Method for Generalized Eigenvalue Problem in Ordi-
nary Differential Equations, J. Computational Phys., 30, 283.
Burden, R. 1. and Faires, D. (2010): Numerical Analysis, (9th ed.), Brooks/Cole Publishing
Company.
Dahlquist, G. and Bjorck, A. (2003): Numerical Methods, Dover, New York.
Deuflhard, P. (1985): Recent Progress in Extrapolation Methods for Ordinary Differential
Equations, SIAM Rev., 27, 505.
Fox, 1. (1991): The Numerical Solution of Two-Point Boundary Problems in Ordinary Dif-
ferential Equations, Dover, New York.
Fox, L. (1987): Numerical Solution of Ordinary Differential Equations, Chapman and Hall
London.
Fox, L. (ed.) (1962): Numerical Solution of Ordinary and Partial Differential Equations,
Pergamon Press, Oxford.
Gear. C. W. (1971): Numerical Initial Value Problems in Ordinary Differential Equations,
Prentice-Hall, Englewood Cliffs, New Jersey.
Gear, C. W. (1981): Numerical Solution of Ordinary Differential Equations: Is There Anything
Left to Do? SIAM Rev., 23, 10.
Gerald, C. F. and Wheatley, P. O. (2003): Applied Numerical Analysis, (7th ed.) Addison-
Wesley.
Hall, G. and Watt, J. M. (eds.) (1976): Modern Numerical Methods for Ordinary Differential
Equations, Clarendon Press, Oxford.
Hamming, R. W. (1987): Numerical Methods for Scientists and Engineers, (2nd ed.), Dover,
New York.
Hoffman, J. D. (2001): Numerical Methods for Engineers and Scientists, (2nd ed.) CRC Press.
Iserles. A. (2008): A First Course in the Numerical Analysis of Differential Equations, (2nd
ed.) Cambridge University Press.
Jain, M. K. (1979): Numerical Solution of Differential Equations, Wiley Eastern, New Delhi.
Keller, H. B. (1992): Numerical Methods for Two Point Boundary Value Problems, Dover,
New York.
Krogh, F. T. (1973): On Testing a Subroutine for the Numerical Integration of Ordinary
Differential Equations, JACM, 29, 545.
Lapidus, L. and Schiesser, W. E. (1976): Numerical Methods for Differential Systems, Aca-
demic Press. New York.
Lindberg, B. (1974): On a Dangerous Property of Methods for Stiff Differential Equation,
BIT, 14, 430.
Quarteroni, A., Sacco, R. and Saleri, F. (2010) Numerical Mathematics, (2nd ed.) Texts in
Applied Mathematics, Springer, Berlin.
Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Rahman, M. (2004): Applied Numerical Analysis, WIT Press.
Ralston, A. and Rabinowitz, P. (2001): A First Course in Numerical Analysis, (2nd Ed.)
Dover.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhiiuser, Boston.
Shampine, L. F. and Gear, C. W. (1979): A User's View of Solving Stiff Ordinary Differential
Equations, SIAM Rev., 21, 1.
Shampine, L. F., Watts, H. A. and Davenport, S. M. (1976): Solving Nonstiff Ordinary
Differential Equations - The State of the Art, SIAM Rev., 18, 376.
Stoer, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.
648 Chapter 12. Ordinary Differential Equations

Exercises
1. Prove that a system of ordinary differential equations can be written as a system of
first-order equations, if and only if the system can be rewritten with the highest order
derivative in each variable appearing as the left-hand side of one equation and nowhere
else. Write the following system as a system of first-order equations:

d3 y 2 2 3 dy
-3 + -d 2y -dz +cos(t)y z - = t 3 sinycosz,
dt dt dt dt
d~2 , d2y (dz)3/2
_ + y2 z3 _ + z2 y 3 _dy = eYz sin(t).
dt 2 dt 2 dt dt

2. Consider the system of linear equations y' = Ay, where A is a n X n matrix. Show that if
all elementary divisors of the matrix A are linear, then the general solution of this system
of equations is
n
Y = LCixie)..tt,
i=l
where Ci are arbitrary constants, and Xi is the eigenvector corresponding to the eigenvalue
Ai . If the matrix has an elementary divisor of order m corresponding to an eigenvalue A,
then the relevant terms in the solution can be written as

j - i ti - j )
y -- L c'e'
Tn

i=1
At (
L
Tn

j=i
x · _-
J
t -
')1
J ( . _
t.
)
-- LTn

i=1
x·e
1
At (
Li

j=1
C'---
J (. _ ')1
t J.
'

where Xi is the generalised eigenvectors of rank i corresponding to the eigenvalue A.


3. Show that the differential equation y' = y/t, has a general solution y = ct. Hence, show
that the initial value problem with y(O) = 1 has no solution.
4. Consider the initial value problem

y' = y +t - t3 , y(O) = 5.

Suppose we use Euler's method with step size h to compute the approximate solution
y. = y(tj); tj = hj. Find an explicit formula for Yj and obtain the truncation error.
Show that for a fixed value of t, the error goes to zero as h --.. O. Repeat this exercise for
the initial condition y(O) = O.
5. Repeat the above exercise for the numerical integration method Yj+1 = Yj + 2hyj and
show that the truncation error does not tend to zero as h --.. O. Why?
6. The initial value problem
y' =,;y, y(O) = 0,
has a nontrivial solution y(t) = t 2 /4. Howeve r , application of single-step methods like
Euler's method or Runge-Kutta method yield y(t) = 0 for all t and h . Explain the
paradox.
1. Analyse stability of the midpoint method

Yj+1 = Yj-1 + 2hyj ,

and show that it is relatively unstable for decreasing solutions. Find the region of absolute
stability of this method.
8. Consider the initial value problems

(i) y' = _y2, y(O) = 1; (ii) y' = -y - ty2, y(O) = 1;


(iii) y' = 2ytan t + 2tan t, y(O) = 1; (iv) y' = 2t 3 - 2ty, y(O) = 1;
(v) y2(y - (1 + t)y') + (1 + t)3 = 0, y(O) = 1; (vi) ty' = t + y, y(l/e) = O.
Exercises 649

Differentiating these equations find expressions for higher derivatives of y. Compute y(l)
using the Taylor series method

1 h2" 1 h3 '" 1 h r (r)


YJ+l = Yj + h'
Yj + 2 Yj + 3! Yj + ... +:;:! Yj .

Use r = 2,3,4, h = 0.1,0.05 and compare the efficiency and accuracy of this method
with Runge-Kutta methods and predictor-corrector methods.
9. Consider stability of numerical integration methods for a system of two coupled linear
equations
y~ = allYl + al2Y2, Y; = a2lYl + a22Y2 .
Show that a straightforward analysis similar to that for one equation leads to generally
intractable algebraic equation. Assuming that the matrix A has linear divisors, apply
a linear transformation to obtain the equation in a diagonal form. Now apply the same
analysis and show that the result is similar to that for one equation, provided)" is replaced
by the eigenvalues of the coefficient matrix and the stability condition is satisfied for each
value of )...
10. For stability analysis of Section 12.2, use Cramer's rule to express the coefficients Ci in
terms of the initial values Y, (O:S i :S m) and show that Co --+ Yo and c, --+ 0, (i > 0) as
h --+ O.
11. Show that the transformation w = (z - 1)/(z + 1) maps the unit circle in the complex
z-plane onto the negative half of the w-plane. Apply this transformation to the charac-
teristic polynomial in Example 12.2, and apply the Routh-Hurwitz criterion to find the
condition for stability. If d i , (i = 0, ... , p) is the coefficient of w p - i in the polynomial
(with do = 1), then define a matrix C with elements

i,j=I, ... ,p; d k = 0 if k <0 or k > p.


Routh-Hurwitz criterion states that the polynomial with coefficients d i has all its roots
in the negative half of the complex plane or on the imaginary axis, if and only if all
principle minors of the matrix C are of the same sign, zero values being ignored.
12. Derive Adams-Bashforth predictors of order 1, 2, 3, 4, 5 by integrating the Newton's
backward formula between appropriate limits. Also derive Adams-Moulton correctors of
order 1, 2, 3, 4, 5 using a similar technique.
13. Write down the numerical integration formula resulting from a predictor-corrector
method, if only one iteration on corrector is performed. Analyse stability of this formula
for the fourth-order Adams-Bashforth-Moulton predictor-corrector method. Repeat this
exercise for Hamming's method with predictor given by Eq. (12.57).
14. Apply the numerical integration method in Example 12.2 to the equation Y' = 0 with
initial conditions Y-2 = 0, Y-l = 0, Yo E, (a small roundoff error). Show that the
solution is
Co = --------------
12 - 30b-l + 6b2
Hence, show that the roundoff error will be amplified if 112 -30b- 1 +6b21 < 1. Find a stable
fourth-order three-step formula, which does not amplify the roundoff error. Compare the
roundoff properties of Hamming's and Adams methods. Apply these methods to the
following initial value problem

Y' = ------0;- y(O) = O.


1 +tan 2 y'
Use h = 0.1, 0.01, 0.001 and continue the integration up to t = 10. Compare the results
with the exact values. Repeat the exercise when integration is continued up to t = 1000
and explain the results. What happens if the step size is adjusted to achieve an accuracy
of 1O- 6 ?
15. Consider numerical integration formula in Example 12.2 and find the values of b2 and
b-l which maximises the range of )"h (on negative side) for stability.
650 Chapter 12. Ordinary Differential Equations

16. Repeat the above exercise for the case b2 = 0 and show that the resulting formula is close
to Hamming's method.
17. Consider numerical integration formula in Example 12.2 and find a fourth-order explicit
formula. This formula can be used as a predictor with fourth-order Adams-Moulton cor-
rector. What are the advantages and disadvantages of this formula over Adams-Bashforth
predictor? Implement this formula in a predictor-corrector method and try it on initial
value problems in {8}.
18. Use the errors in Adams-Bashforth predictors and Adams-Moulton correctors to derive
the corresponding modifier equations. What is the estimated truncation error in the
corrector?
19. Show that if the estimate of the truncation error in a corrector formula is used to modify
the corrector formula, then the resulting formula would be of higher order. Analyse
stability of such a modified formula obtained from the fourth-order Adams-Bashforth-
Moulton predictor-corrector method. Repeat this exercise for Hamming's method.
20. Consider a general two-step numerical integration method of the form

YJ+l = aOYj + alYj_l + h(L1yj+l + boyj + blyj_l)'


What is the maximum order of this method? Find the formula which achieves the max-
imum order. Is this formula stable? Consider third-order methods of this form and find
all coefficients in terms of b-l. Find the range of b_ 1 values for which the method is
stable. Find a third-order method with best stability characteristics. What is the region
of absolute stability of this method in the complex plane? Plot the two roots of the
characteristic equation as a function of h)" for b-l = 0.4,0.45, 0.48 and 0.49.
21. Show that the initial value problem

Y" = Ly, y(O) = 1, y'(O) = -Vi, (L» I),


is unstable. Apply Riccati transformation defined by Y' = 1)Y and obtain the differential
equation in terms of 1), 1)' + 1)2 = L. Solve this differential equation and obtain the
required solution Y by integrating the transformation equation. Compare the results with
those obtained by direct solution of the equation. Show that this transformation does not
avoid instability, unless the equation in 1) is solved analytically. Try to solve the equation
numerically for L = 100,123.21,1.96 and 1.69 with 1)(0) = -10, -11.1, -1.4 and -1.3,
respectively. Perturb the initial values slightly and see the result.
22. Show that starting values for a numerical integration method can be generated by letting

L
00

y(x) = Ak(x - xo)k.


k=O

Substituting this expansion in the differential equation, we can obtain a recurrence re-
lation for the coefficients A k , while Ao is determined from the initial condition. For
which type of differential equations will this method be useful? Apply this method to the
equations in {8} and calculate y(O.l), y(0.2) and y(0.3).
23. Derive the formulae (12.53) for generating the starting values. Show that truncation error
in all these formulae is O(h5) and can be estimated if the fifth derivative of the function
can be approximated. In principle, the fifth derivative can be estimated using the value
of the function and the first derivative at these four points. Show that all such estimates
will vanish. Hence, this technique cannot be used to estimate the truncation error in
these formulae. How will you estimate the truncation error in these formulae? Use this
technique to obtain the starting values for the equations in {8} and compare the results
with those in the previous exercise.
24. Evaluate the integral

rt dx
y(t) = io (1 - x)"
l-(l_t)l-n
1- a
Exercises 651

by solving the initial value problem Y' = 1/(1 - t)'>' y(O) = O. Try to estimate y(l) to
an accuracy of 10- 6 , using a method with adaptive step size control and compare its
efficiency with that of an adaptive quadrature routine.
25. Solve the initial value problem

Y' = 3 (Y - 1 ~ J -(1: t)2 y(O) = 1,

using any fourth-order numerical method with constant step size h = 0.1 and continue
the integration up to t = 100. Compare the results with the exact solution y = 1/(1 +
t) . Repeat the exercise using h = 0.05 and 0.01. Compare the results with those in
Example 12.4 and explain the difference.
26. Solve the following initial value problem
,
Y = t2 -
1 y 2
y , t _.
y(l) = -1, 1 :::; t :::; 10.

Use h = 0.1,0.01 and 0.001 and compare the computed results with the exact solution
y(t) = -lit .
27. Derive the formula for the fourth-order Runge-Kutta method by expanding the formula
as a Taylor series in t and y. Also obtain the following formula due to Ralston, which
minimises the truncat.ion error in certain sense
Yn + l = Yn + 0.17476028h - 0.55148066k2 + 1.20553560k3 + O.17118478k 4 ,
kl = hnf(t n , Yn),
k2 = hnf(t n + O.4hn , Yn + O.4kJ) ,
k3 = hnf(tn +0.45573725hn 'Y n +0.29697761kl +0.15875964k2),
k4 = hnf(tn + hn, Yn + 0.21810040k l - 3.05096516k2 + 3.83286476k3 ).
Try both these fourth-order methods on initial value problems in {8} and compare the
relative errors.
28. Try to verify the order of Runge-Kutta formulae given in Section 12.4 experimentally.
Consider the solution of initial value problems in {8} using different value& of h and verify
that the error actually falls off as the expected power of h.
29. In the fourth-order Runge-Kutta method, if the truncation error estimated using the
technique described in Section 12.4 is added to the computed solution, then show that
the resulting method is of fifth order. Find the region of absolute stability in the complex
plane for this method and compare it with that for unmodified method.
30. Consider a two-stage implicit Runge-Kutta method of the form (Gear, 1971)
Y n +l = Yn + "'Ilkl + "I2k2 ,
kl = hnf(tn + cqhn , Yn + Bllkl + (312 k 2),
k2 = hnf(tn + 0i2hn, Yn + (321kl + 622k2)'
Determine the constants Oii. 6ij and "Ii to achieve a fo urth order accuracy. Show that the
formula with
1 1 V3 _1 V3
"'11 = "'12 = '2' Oil = '2 - 6 ' 0i2 - '2 +6 '
{3 11 -6
-
_1
22 - 4 '
f3 12 _1
-
V3'
4 - 6 (3
21
_1 V3
- 4 + 6 '
due to Butcher achieves fourth order accuracy. Analyse the stability of this method. Use
this method on initial value problems in {8} and compare the efficiency with standard
fourth-order Runge-Kutta method.
31. For a second-order differential equation Y" = f(t, y, y'), derive a Runge-Kutta formula
of the form
Yn+l = Yn + hn(Y~ + "'11 kl + "'I2 k 2), Y~+1 = y~ + <>lkl + <>2k2,
hn ,
kl = 2: f (t n ,Yn,Yn) ,
k2 = 2:
h n f(t n + Oilhn, Yn + Oil hnYn" + 0i2hnkl' Yn + 0i3 k J).
652 Chapter 12. Ordinary Differential Equations

Show that the method with


1 2 4
1'1 = 1'2 = 2' 0'1=0'2=3' 0'3 = 3'
has a third order accuracy. Compare this method with the usual third-order Runge-Kutta
method, applied after reducing the equation to a system of first-order equations. Try this
method on the following initial value problem

Y(O) = 0, Y' (0) = 1.

32. For initial value problems in {8} continue the integration up to t = 1 with h = 0.1
and h = 0.05 using (i) Euler's method, (ii) trapezoidal rule, (iii) Milne's method, (iv)
second and fourth-order Adams methods, and (v) second and fourth-order Runge-Kutta
methods. Compare the relative accuracy and efficiency of these methods. In (iii) and (iv),
also estimate the truncation error using the computed values and compare it with actual
error.
33. If a first-order differential equation can be easily differentiated to obtain higher deriva-
tives, then it may be more efficient to use these derivatives also in numerical integration
formulae. Derive the following pairs of formulae which can be used as a predictor-corrector
pair:
2
Yj+1 = Yj + 2h (
-Yj ' ) + 12
' + 3Yj_l h ( 17Yj" + 7Yj_l
") ,

2
Yj+l = Yj + 2h (Yj+l
, + Yj') + 12
h ( "
-Yj+l + Yj") .
Show that truncation error in these formulae is (31/720)h5 y (5) (E) and (1/720)h 5 y(5) (E),
respectively. Analyse stability of the corrector formula. Use these formulae to solve the
initial value problems in {8}.
34. Solve the following differential equations:

(i) Bessel's equation: Y" + 1{ +


x
(1 - n2)
x2
Y = 0, 0:::; x: :; 10;
with y(O) = 1, y'(O) =0 for n = 0; and y(O) = 0, y'(O) = 1/2 for n = 1;
(ii) Legendre's equation: (1 - x 2 )y" - 2xy' + ere + l)y = 0, -1:::; x:::; 1;

with y(l) = 1, y'(l) = e(e; 1), e = 3,6,7,10;


... "
(m) Y - ( 1 - t13 12)
+?i Y= 0, solution regular at t = 0, 0:::; t:::; 10;

(iv) Y' - lOy = 15e- 5t , y(O) = ±1, 0:::; t:::; 10;

(v) Y' - lOy = 15e- 5t , y(lO) = 0, 10 2': t 2': 0;

(vi) Y' = t(l - y) + (1 - t)e- t , y(O) = 1, 0:::; t :::; 50;

( .. ) " "
vn Yl = - ( 2
Yl
+YlY22)3/2' Y2 = - ( 2
Yl
+Y2Y22)3/2;
Yl (0) = 1, y~ (0) = 0, Y2(0) = 0, y~(O) = 1, 0:::; t :::; 50.

35. Solve the following initial value problem, which is the equation of motion for a simple
pendulum:

e" + asine = 0, e(o) = 7r, e'(O) = 0, 0< t < 1000,

for a = ±1. Compare the results with exact solution. Also try the initial condition
e(O) = 7r /3, e' (0) = O. Estimate the period of oscillation in each case. (Note: a = 1
e
corresponds to the case when the angle is measured from the lowest point, while if the
angle is measured from the highest point, then a = -1.)
Exercises 653

36. Using the solution obtained in {34} compute the first zero of Bessel functions Jo(x) and
h(x). Also treat this as a free boundary problem and try to find the zero by using the
technique outlined in Section 12.7.
37. Solve the following initial value problem defined by the Lane-Emden equation

0(0) = 1, 0'(0) = o.

Use n = 0, 1, 1.5,2,3,4,5 and continue the integration up to the first zero of O.


38. Analyse stability of the numerical integration method based on the trapezoidal rule
h, ,
Yj+l = Yj + 2 (YJ+l + Yj)'
and show that this method is absolutely stable in the entire negative half of the complex
plane. Compare stability of this method with the second-order Runge-Kutta method
defined by the last formula in (12.65).
39. If there is a sample of pure Radium 226 at time t = 0, what will be the relative abundance
of Radium 226, Radon 222 and Polonium 218 after a time t (0 < t < 10 5 years), given
their half-lives of 1620 years, 3.825 days, 3.05 min, respectively? The abundances nj, n2
and n3 can be obtained by solving the initial value problem

with nj (0) = 1, n2(0) = 0, n3(0) = O. Here a; = In 2/T;, where T; is the half-life.


Compare the results with the exact solution. Try various routines for stiff as well as
nonstiff equations and compare their performances.
40. Solve the following initial value problem using any suitable technique

dy
dx = Y + z,
dz
dx = (1 + x)azyly - z - x, y(l) = -z(l) = (2~)2/3
~

Use a = 1,10 2 ,10 5 ,10 10 and continue the solution in negative x direction, until either
y(x) = 0 or y(x) has a minimum. How will you estimate the accuracy of the solution?
41. Solve the initial value problem in Example 12.4, for oX =
10, 50, 100, 200 using fourth-
order (1) Runge-Kutta method, (2) Adams predictor-corrector method, (3) Gear's stiffly
stable method. Use h = 0.1 and 0.01 and explain the results. Repeat the exercise with
initial condition y(O) = 1.
42. Solve the following initial value problem due to Lindberg using any suitable technique

y~ = 104 Yl Y3 + 104 Y2 Y4' Yj (0) = 1


y~ = -10 4 YjY4 + 104 Y2Y3' Y2(0) = 1,
y~ = 1 - Y3' Y3(0) =-1,
y~ = -Y4 - 0.5Y3 + 0.5, Y4(0) = o.
43. Consider the equations for two body central force problem:

dO
mr2- = I
dt '
with m = 1, I = 1 and f(r) = -1/r2. Try the solution using random initial conditions
for r(O) in interval (0.5,1.5) and for dr/dt(O) in interval (-0.1,0.1) with 0(0) = 0 and
calculate the solution for 0 < t ::; 1000. Trace the orbits and try to estimate the orbital
period. Try several combinations of random numbers and demonstrate that the orbits
are either ellipse or hyperbola as expected for an inverse square law. Repeat the same
exercise when f(r) = -l/r or f(r) = -1/r 2 - 0.01/r 4 and demonstrate that in this case
the orbits are not closed. What happens if f(r) = -2/r 3 ?
654 Chapter 12. Ordinary Differential Equations

44. Consider the equation of motion for three point particles under their gravitational field:
d2xi kmj (Xj - Xi)
dt 2 = ~ ((Xj - Xi)2 + (Yj - Yi)2)3/2 '
i = 1,2,3; j=I,2,3.

Use k = 1, ml = 100, m2 = m3 = 1 with the initial values:


dXI dYI = 0
Xl = 0, YI = 0, -=0
dt ' dt '
dX2 T3 dY2 T4
X2 = Tl, Y2 = T2, -
dt
= 15Y2 +-
10 ' dt = -15 x 2 + 10 '

dX3 T7 dY3 TS
X3 = T5, Y3 = T6, -
dt
= 15Y3+-
10 '
-
dt
= -15x3 +-
10 '
at t = 0, where Ti, i = 1, ... ,8 are some random numbers in interval (-1/2,1/2). Find
the solution for 0 < t < 1000 and trace the path of each of these three particles. Try
many different sets of random initial condition.
45. Solve the following boundary value problems:

(i) y" +y = 0, y(O) = 1, y(l) = 2;


(ii) Legendre's equation: (1 - x2)y" - 2xy' + £(£ + l)y = 0,
y(-I) = (_I)l, y(l) = 1.
In (ii) try £ = 3,6,7,10 and compare the results with those in {34}.
46. Consider shooting method for the boundary value problem defined by (12.92). Show that
the first derivative of the function F(x) can be calculated by solving the initial value
problem

v" = af(t,y,y')v+ af(t,y,y')v',


v(a) = 0, v'(a) = 1,
ay ay'
which gives F'(x) = v(b). Using this derivative, we can use Newton-Raphson method to
find the zeros of F(x). Compare the efficiency of this technique with that using secant
iteration on F(x).
47. For boundary value problems involving a second-order differential equation y" =
f(t, y, y'), it is possible to use the finite difference approximations (12.106) to obtain
the difference equations. Find the truncation error in these formulae. The method of de-
ferred correction can be applied to this formula also. Show that the local truncation error
involves the derivatives y'" (tj) and y( 4) (tj), which can be estimated using the same finite
difference formulae on y" (tj) = fj. Show that in this case, we do not need to use any
off-centred formula at the end point. Apply this method to the boundary value problems
in {45}.
48. The displacement u of a loaded beam of length 2L under certain approximations satisfies
the following differential equation:

2
d2
-d2 ( El(x)-2
dx dx
u) + Ku = q(x),

with boundary conditions

u"(-L) = 0, u"'(-L) = 0, u"(L) = 0, u"'(L) = O.


Here K and E are constants and

l(x) = 10 ( 2 - Gf) , K =
40E10
---r;4 .
Exercises 655

Use L = 1 and c = qo/ K = 0.0025 and solve the boundary value problem to obtain the
displacement at x = 0.
49. Consider the boundary value problem
2y' y
y" = - - + , y'(O) = 0, y(l) = 1,
x a(y+k)
where a and k are constants, k = 0.1, a = 0.1,0.01,0.001. To overcome the singularity
at x = 0, expand the solution in a Taylor series about x = up to and including the term °
x 6 , expressing all coefficients in terms of (3 = yeO). Use shooting technique to determine
the value of (3 after shifting the boundary to x = 0.01, in order to avoid the singularity.
Try this problem using a finite difference or expansion method and compare the results.
50. Find the lowest eigenvalue and the corresponding eigenfunction of

(i) y" + >..y = 0, yeO) = 0, y(l) = 0;


(ii) y" + (>.. - x 2 )y = 0, y(±oo) = ° (Harmonic oscillator);
... d ( -1- -
(m)- dY ) +>..y=o yeO) = y(l) = 0.
dt 1 + t dt '

51. Consider the eigenvalue problem for a stretched membrane

d ( rd-y ) +>"y=o,
1 -
-
r dr dr

with boundary conditions y(l) = °


and regularity condition at r = 0. Using the general
solution of the differential equation near the singularity at r = 0, find the boundary
condition at r = 0. Write the finite difference equations on a uniform mesh with points
rj = (2j+1)/(2N +1), (j = -1,0,1, .. . ,N). Find the lowest eigenvalue for this problem.

52. Solve the following eigenvalue problem for the linear adiabatic oscillation of poly tropes

dv = _ (~+ _1_(2: -1») v+ (0'2 _ 1'(1'+ 1») PI _ 1'(1'+1) P01/JI,


dr r "YHp r c2 r2 r2

-dPI = ( -9- (- "Y - 1) - 1) v - -9 Pl - PO -d1/J1 ,


dr 0'2"YHp r c2 dr

d 21/J1 = _ (n + 1) (2'. _ l)v + n + 1 PI + 1'(1' + 1) 1/J1 _ ~ d1/J1 ,


dr 2 0'2 "Y Hp r c2 r2 r dr

with boundary conditions


d1/J1 I'
rv + ePI + epo1/J1 = 0, -
dr
- -1/J1 =0,
r
r ---> 0;

n +1 + 1 + --
--v
0'2
+ I'--1/J1
r
d1/J1
dr
=0, r = R;

where "Y = 5/3, r = (n + l)/n = 4/3, c2 = "Ye, 9 = -en + 1)()',


Hp = e/g, PO = en. Here
e is the solution of the Lane-Emden equation {37} for polytropic index n and the outer
boundary R is the first zero of e. For a given value of I' (¥o 0), there are two series of
eigenvalues, one in which the number of nodes is increasing with 0'2 and another in which
it is decreasing with 0'2. Use n = 3 and I' = 2 and find all eigenvalues 0'2 corresponding
to eigenfunctions which have less than 10 nodes in v.
53. Apply the technique of adaptive mesh point allocation to the boundary value problem
in Example 12.11. Experiment with various strategies for allocation of mesh points and
find out the improvement resulting from this approach.
54. Consider the Mathieu equation

y" + (>.. - 2qcos2x)y = 0, yeO) = yen) = 0.


656 Chapter 12. Ordinary Differential Equations

Find the lowest eigenvalue>. (for q = 1) with symmetric eigenfunctions using the expan-
sion method. Try the expansion

I: A
00

y = 2i + 1 sin(2i + l)x.
i=O

Substituting this expansion in the differential equation, obtain the following linear equa-
tions connecting the unknown coefficients A2i+ I:
(i=O,l, ... ),
where for i = 0, we can use A-I = -AI. Truncate the expansion after n = 3,5,8 terms
and find the lowest eigenvalue. Estimate the accuracy. Also try to estimate the next lowest
eigenvalue. How many terms are required to achieve an accuracy of 10- 6 ? Compare the
results with those obtained using B-spline basis functions.
55. Solve the following initial value problem using Lanczos T-method

2(1 + t)y' + y = 0, y(O) = 1, O<t<l.


Expand the solution as a polynomial in t
y(t) = ao + alt + a2t2 + ... + antn,
substitute in the differential equation and obtain equation connecting the coefficients.
Show that these equations have only the trivial solution ai = O. To determine the coeffi-
cients use Lanczos T-method by assuming that the given expansion is an exact solution
of the slightly perturbed problem

2(1 + t)y' + y = TT,';(t),


where T;' (t) is the Chebyshev polynomial over the interval [0, 1J. Use n = 5,8, 10 and
compare the solution with the exact value.
56. Solve the following eigenvalue problem using the technique described in the previous
problem:
y(-l) = y(l) = O.
Consider the expansion in terms of even powers of x and show that in this case two
perturbing terms are required to determine the coefficients. Assume a perturbation of
the form TIT2n(X) + T2T2n+2(X) and find the lowest eigenvalue for n = 3,5 and 10.
Repeat the exercise using expansion in terms of odd powers.
Chapter 13

Integral Equations

We shall now consider numerical solution of integral equations, where the un-
known function occurs under the integral sign. Integral equations do not occur
as frequently as differential equations in physical problems. However, a differ-
ential equation can be converted to an equivalent integral equation. Apart from
this, integral equations also appear in certain diffusion and transport problems.
The main difference between a differential and an integral equation is that, the
former only describes the local behaviour of the solution, in the sense that f (x)
is related to f (x + ox). On the other hand, an integral equation specifies global
behaviour of the solution, since its value at any point is related to that at every
other point within the range of the integral.
Methods for numerical solution of integral equations can be broadly clas-
sified into two classes. First is the quadrature methods, where the integral is
approximated by a quadrature formula. These methods are similar to finite
difference methods for solving differential equations. Second is the expansion
methods, where the solution is approximated by an expansion in terms of some
convenient basis functions. The coefficients of expansion are found by minimis-
ing the error.
In Section 13.1, we consider some properties of integral equations as well
as the classification of integral equations into different types. In Sections 13.2
and 13.3 we describe respectively, the quadrature and expansion methods for
Fredholm equations of the second kind. These methods can be applied to Fred-
holm equations of other kind, after some minor modifications. In Section 13.5
we discuss the solution of Fredholm equations of the first kind, while Section
13.4 considers the Fredholm equations of the third kind, which is the eigenvalue
problem for integral equations. Section 13.6 deals with inverse problems, which
is essentially discrete version of Fredholm equations of first kind. The equations
of the first kind are generally ill-conditioned and considerable care is required
to obtain meaningful solution. Section 13.7 deals with the basic techniques for
solving Volterra equations of the second kind, while Section 13.8 describes the
application of these methods to Volterra equations of the first kind. In this
658 Chapter 13. Integral Equations

chapter, we only consider some simple methods for numerical solution of in-
tegral equations; many techniques are only outlined briefly. For more details
readers can refer to Baker (1977) or Delves and Walsh (1974). Baker (1977)
also gives a number of illustrative examples.

13.1 Introduction
In an integral equation the unknown function appears under the integral sign.
The limits of the integral may be constants, in which case we have a Fredholm
type of equation, whereas if one of the limits is the independent variable, then
we have an equation of Volterra type. An integral equation is said to be lin-
ear if all terms occurring in the equation are linear in the unknown function.
Otherwise, the equation is said to be nonlinear. Solution of nonlinear equations
is obviously more difficult and in most of this chapter, we consider only linear
integral equations. Linear integral equations can be classified into Fredholm
and Volterra type, as explained above. Each of these can be further classified
into different kinds. For example

lb K(x, t)f(t) dt = g(x),

lb K(x, t)f(t) dt = g(x) + f(x), (13.1)

1 b
K(x, t)f(t) dt = )..j(x),

are Fredholm equations of the first, second and third kind, respectively. These
equations are supposed to hold for all values of x in [a, b]. Here f(x) is the
unknown function which is to be determined, while g(x) and K(x, t) are known
functions. The function K(x, t) is referred to as the kernel. The equation of
the third kind is just the homogeneous version of an equation of the second
kind and defines an eigenvalue problem. This equation has a nontrivial solution
only for some special values of ,\ which are referred to as the eigenvalues.
The corresponding solution is referred to as the eigenfunction. Sometimes, this
problem is referred to as a homogeneous problem of the second kind. Similarly,
some books define the equations of second kind with g(x) on the left-hand
side. This convention should be taken care of while comparing the formulae for
numerical solution of integral equations.
In some problems the variable x may be discrete, in which case typically
gi represent some measured data and the object is to determine f(t) using
the corresponding Kernels Ki(t) and gi. This situation arises in the so-called
inverse problems. The corresponding forward problem, that is, determining
gi for a known f(t) is straightforward to solve. The forward problem being
simple quadrature is well-conditioned, while the corresponding inverse problem
is generally ill-conditioned.
13.1. Introduction 659

If the upper limit b in (13.1) is replaced by x, then we obtain the Volterra


equations of the first, second and third kind. Equations of Volterra type can be
considered as a special case of Fredholm type, with kernel K 1 (x, t) defined by

K 1 (x, t ) -_ {K(x , t) , for x 2': t: (13.2)


0, for x < t.

However, this kernel is normally discontinuous on the line x = t. Further, solv-


ing a Volterra equation is usually simpler than solving a Fredholm equation.
Consequently, it is desirable to consider Volterra equations as a separate prob-
lem.
It can be shown that Volterra equation of the third kind can only have
the trivial solution f(x) == 0. Hence, we need not consider them. Further,
an equation of the first kind can be converted to that of the second kind by
differentiation. For example, consider Volterra equation of the first kind

l X
K(x, t)f(t) dt = g(x). (13.3)

Differentiating it with respect to x gives

rf)
K(x, x)f(x) + Jaf)x K(x, t)f(t) dt = g'(x). (13.4)

Now if K(x, x) =1= 0, we can divide by this factor to obtain an equation of the
second kind. If K(x, x) = 0, we can differentiate this equation once more to
obtain the second derivative of the kernel. This process can be repeated until
the relevant derivative of the kernel at t = .r is nonzero, to get the equation
f)k
f)xkK(x,x)
r f)k+l
+ Ja f)Xk+l K(x,t)f(t) dt = g{k+l)(x). (13.5)

Thus, we need to consider only Volterra equations of the second kind. However,
in some cases, it may be difficult to find the required derivatives. It is also
possible that for all finite values of k, the derivatives vanish at t = x. In such
circumstances, it may be necessary to use special methods for solving Volterra
equations of the first kind.
A Fredholm equation is nonsingular if the limits a and b are finite, the
kernel K(x, t) is well behaved in the range [a, b] of x and t, and g(x) is also
smooth in [a, b]. A singular equation will have infinite limit, or some singularity
in K(x, t) or g(x), or some combination of these qualities. In most of this
chapter, we consider only nonsingular problems. Some techniques for treatment
of singular equations are outlined briefly.
A boundary value problem for a differential equation can be transformed
to a Fredholm equation of the second kind. For example, the boundary value
problem
y" + p(t)y = q(t) , y(O) = 0, y(l} = 0, (13.6)
660 Chapter 13. Integral Equations

can be transformed into an integral equation

y(x) + PI (x) = 11 K(x, t)p(t)y(t) dt, (13.7)

where

K(x, t) = {
t(1 - x), for 0 ~ t ~ x ~ 1;
and PI (x) =
11
0 K(x, t)q(t) dt.
x(l-t), for 0 ~ x <t ~ 1;
(13.8)
It should be noted that the integral equation is equivalent to the differential
equation, together with the boundary conditions.
Similarly, initial value problems in differential equations can be trans-
formed to Volterra equations of the second kind. For example, the initial value
problem
y" + p(t)y = q(t), y(O) = Co, (13.9)
can be transformed into an integral equation

(13.10)

where

and

Once again the integral equation is equivalent to the differential equation to-
gether with the initial conditions.
Properties of an integral equation are determined by those of the kernel.
The kernel is said to be symmetric if K(x, t) = K(t, x). Properties of a sym-
metric kernel are similar to those of a symmetric matrix. Similarly, the kernel
is said to be positive definite, if

lb lb K(x, t)f(x)f*(t) dt dx > 0, (13.12)

for every nonzero function f(x). If sign of the inequality is reversed, then it is
negative definite. If the kernel K(x, t) is a function of x - t only, then it is said
to be of convolution type.
If the kernel is a simple product of a function of x and a function of t, then
the Fredholm equation can be solved easily. For example, if K(x, t) = X(x)T(t),
then the integral can be written as

lb X(x)T(t)f(t) dt = AX(x), (13.13)

lb
where the constant
A= T(t)f(t) dt . (13.14)
13.2. Fredholm Equations of the Second Kind 661

Thus, the Fredholm equation of the second kind has the solution

f(x) = AX(x) - g(x). (13.15)

The unknown constant A can be determined by substituting this solution in


(13.14), to obtain

A (1 -l b
X(t)T(t) dt) = -l b
g(t)T(t) dt. (13.16)

Fredholm equations of other kinds can also be solved in a similar manner,


provided the kernel is of the product form.
This kernel is a simple example of a degenerate kernel
n
K(x, t) = L Xi(x)Ti(t). (13.17)
i=l

In this case, if we define

(i = 1, ... ,n), (13.18)

then the solution of Fredholm equation of the second kind can be written as
n
f(x) = L AiXi(X) - g(x). (13.19)
i=l

To determine the unknown constants Ai, substitute this solution in (13.18), to


obtain
( 13.20)

which gives a system of n linear equations in the n unknowns Ai. These equa-
tions can be solved to obtain the required solution.
Thus, if a kernel is degenerate, or can be approximated by a degener-
ate kernel, then the solution of the corresponding Fredholm equation can be
computed easily using quadrature formulae. In principle, many kernels may be
expanded in a power series in t and x, which may converge over the entire re-
gion. However, if the number of terms required to achieve a reasonable accuracy
is very large, then we will end up with a large system of linear equations. In
such cases, it may be more efficient to use other techniques for a direct solution
of the integral equation.

13.2 Fredholm Equations of the Second Kind


Fredholm equation of the second kind can be solved iteratively by a technique
similar to the fixed-point iteration considered in Section 7.2. Thus, we can use
662 Chapter 13. Integral Equations

lb
the iteration
fn+l (x) = -g(x) + K(x, t)fn(t) dt , (13.21)

where fn(x) are the successive iterates. It can be shown that, this process
converges if all eigenvalues of the kernel are inside the unit circle. This is the
classical technique which has been used to establish some of the fundamental
properties of Fredholm equations of the second kind. This technique can be
used in practical problems only if the integrals can be evaluated analytically. If
the integrals are to be evaluated numerically, then this method is too laborious
to be useful.
The simplest technique for numerical solution of integral equations is the
quadrature method, where we replace the integral by a quadrature formula and
solve the resulting system of algebraic equations. This technique is similar to
finite difference methods for differential equations. Thus, a Fredholm equation
of the second kind can be approximated by
N
f(x) + g(x) = L WiK(X, ai)f(ai), ( 13.22)
i=l

where ai are the abscissas and Wi are the corresponding weights in the quadra-
ture formula. Now if we assume that this equation is satisfied for x = aj,
(j = 1,2, ... , N), then we get a system of N linear equations in N unknowns
f(aj). These equations can be written in the matrix form

(KD - I)f = g, (13.23)

where f and g are vectors with elements f(ai) and g(ai) respectively, K is a
matrix with elements k ij = K (ai, aj), and D is a diagonal matrix with elements
d ii = Wi. This system of linear equations can be solved for f, provided the
matrix K D - I is nonsingular. If the matrix K D - I is singular, then there
may be no solution, or the solution will not be unique. If the kernel K(x, t)
has an eigenvalue equal to one, then the solution cannot be unique, since any
multiple of the corresponding eigenfunction can be added to the solution. In
such cases, the matrix K D - I may be singular or nearly singular and the
problem is likely to be ill-conditioned.
There are two important differences between an integral equation and a
boundary value problem in ordinary differential equations. Although both these
problems can be reduced to the solution of a system of algebraic equations, for
differential equations the matrix is in a convenient band form, which can be
solved efficiently. On the other hand, it is quite clear from (13.22) that solution
of an integral equation leads to a dense matrix which requires O(N3) floating-
point operations to solve. This is due to the fact that a differential equation
represents local behaviour of the solution, where f(ai) is only related to a few
neighbouring values, while an integral equation represents global property of
the solution with f (ai) related to its value at all points. Thus, the solution of
1.'3.2. Fredholm Equations of the Second Kind 663

an integral equation requires much more effort than that for a boundary value
problem considered in Section 12.8. In both cases, the solution of a system of
linear equations gives a set of discrete values, approximating the solution at
certain points. For differential equations the solution at any intermediate point
can be computed by interpolation. On the other hand, for Fredholm equation
of the second kind, we can use (13.22) itself to compute the solution at any
required point x, once the function values f(ai) are known.
The computational procedure for solving (13.22) is straightforward. Hav-
ing decided which quadrature formula to use, we can easily calculate the ele-
ments of the relevant matrix. Since this matrix is dense, we have to use the
usual method of Gaussian elimination with partial pivoting or Crout's algo-
rithm for solving the required system of linear equations. As the effort involved
in solving the system of equations is proportional to N 3 , it is clear that the
value of N should be as small as possible. Hence, it may be advisable to use
a Gaussian formula, which gives higher order accuracy using the same number
of points. To estimate the truncation error, we can use a formula with larger
number of points, or alternately we can use the h ----> 0 extrapolation, or the
method of deferred correction.
In order to apply the method of deferred correction, we need to estimate
truncation error in the quadrature formula. As we have seen in Chapter 6,
truncation error in most quadrature formulae can be expressed in terms of a
high order derivative of the integrand at some unknown point in the interval.
This form is not very useful here, since the interval could be rather large and it
is difficult to estimate a typical value of the derivative in the interval. However,
as noted in Section 6.2, truncation error in the trapezoidal rule can be expanded
as a power series in h 2 and further Gregory's formula expresses this error in
a convenient difference form. This error expansion can be easily used to apply
the technique of deferred correction to estimate the truncation error in the
computed results.
The Gregory's quadrature formula {6.8} can be written as

lb U (t) dt = h (~U 1 + U2 + ... + UN _1 + ~ UN)

-:2 (VUN - ~Ul) - 2~ (V2UN + ~2Ul) - ...


(13.24)

where h = (b - a)j(N - 1) and Uj = u(a + (j - l)h). For solution of integral


equations u(t) = K(x, t)f(t). Once we have decided how many terms in the
series are to be used, this formula leads to a system of N linear equations,
which can be solved to give the solution. Since the error expansion involves
differences of the unknown function f(t), it is almost impossible to predict in
advance how many terms should be included to achieve the required accuracy.
In the method of deferred correction, we first compute the solution using some
fixed number of terms in the error expansion to obtain a first approximation to
the solution. Using the computed solution, we can estimate the local truncation
error using the first few terms that are neglected in the error expansion. Then
664 Chapter 13. Integral Equations

as in Section 12.8, the correction <5f(aj) is computed by solving the system of


equations
(KD - I)<5f = e, (13.25)

where e is the vector of estimated local truncation error. This estimate is ob-
tained by computing the required differences using the approximate solution
(e.g., ej = (hj12)('9uN - ~Ul))' Calculation of deferred correction requires so-
lution of another system of linear equations involving the same matrix. Hence, if
the information from elimination is preserved, it requires only O(N2) floating-
point operations to compute the solution. Apart from improving the accuracy
of the computed solution, the method of deferred correction gives an estimate
of the truncation error.
This process is implemented in subroutine FRED in Appendix B. This
subroutine can use trapezoidal rule, Simpson's rule or composite Gauss-
Legendre formulae based on 4, 8, 16 or 32 points. For trapezoidal rule, it also
estimates the truncation error using the method of deferred correction. The de-
ferred correction is calculated using the first two terms in the error expansion
which should give an accuracy of O(h4), provided the integrand is sufficiently
smooth.
Using the technique of deferred correction, it is possible to iterate on the
corrected solution to get the true solution of the equation, using a higher order
formula including the error term. In this method, using the corrected solution
we can again calculate the relevant differences to estimate the truncation error.
With this estimate, we can once again calculate the deferred correction. If the
iteration converges, then the resulting solution is the same as that obtained if
the weights were modified to include the relevant differences before solving the
system of linear equations. In fact, we can go on adding higher order differences,
until satisfactory accuracy is achieved. The convergence of this process is not
guaranteed, since it depends on the integrand and the value of h. If h is too
large, then higher order differences may not be small and the process may not
converge.
Instead of deferred correction, we can apply the h ---> 0 extrapolation to
obtain higher accuracy. Thus, using the same quadrature formula based on
Nand 2N points, we can solve the system of linear equations to obtain two
approximations to the required solution. Applying the h ---> 0 extrapolation, we
can improve on these results. For example, the computed values obtained using
the trapezoidal rule can be improved by noting that the truncation error is a
power series in h 2 , provided relevant derivatives exist. This technique has been
described in Sections 5.3 and 6.2, and will not be considered further. However,
for integral equations the system of linear equations corresponding to different
values of N are essentially independent and will have to be solved separately.
Unfortunately, these methods for achieving higher order accuracy may not
be of much use in actual practice, since most problems that we come across
have some singularity or discontinuity. In many cases, the kernel or some of
its derivatives are singular or discontinuous at t = x. This singularity can be
13.2. Fredholm Equations of the Second Kind 665

weakened by using

lb K(x, t)f(t) dt = lb K(x, t) (f(t) - f(x)) dt + lb K(x, t)f(x) dt

= lb K(x, t) (i(t) - f(x)) dt + f(x)H(x),


(13.26)
where H(x) is assumed to be a known function. Since f(x) - f(t) vanishes at
x = t, the singularity is weakened. With this modification, the corresponding
algebraic equations can be written as
n
f(ai)+g(ai) = L WjK(ai' aj) (i(aj)- f(ai)) + f(ai)H(ai), (i=l, . .. ,N).
j=l
(13.27)
In this case, the matrix elements are somewhat more complicated. But if the
function H(ad can be easily computed, then there is not much additional effort
required to compute the coefficients. In any case, since the matrix is dense the
solution of either system requires the same effort. It should be noted that this
device only weakens the singularity, it does not eliminate it. Thus, if the original
integrand is singular at t = x, the modified integrand will have a singularity in
its first derivative. Nevertheless, the improvement could be significant and this
technique may be useful in some cases.
If the kernel is not singular, but discontinuous at t = x, then the range
of integration can be broken into two parts and Gregory's formula can be ap-
plied to each of them. This division introduces extra terms involving differences
around t = x. However, at points close to either boundary, we may not have suf-
ficient abscissas to calculate the required differences on the smaller subinterval.
Consequently, additional errors may be introduced.
Another alternative to deal with singularity is to use a quadrature for-
mula with singular weight function. For example, we can consider a quadrature
formula of the form (Section 6.5)

1 a
b
K(x, t)f(t) dt:::::
N
L wi(x)f(ai).
i=l
(13.28)

As shown in Section 6.5, under appropriate conditions, the error term in such
formulae depends only on the derivatives of f(t) and the kernel does not enter
the error term. This technique removes the singularity entirely, but the effort
involved could be considerably larger. It can be seen that in this case the
weights Wi are not constants , but are some function of the variable x. Further,
these weights may not be known for each value of x , in which case. we need
to calculate the weights using the method of undetermined coefficients. This
method yields

(k = 0,1, ... , N - 1) . (13.29)


666 Chapter 13. Integral Equations

This system of N linear equations in the weights Wi can be solved, provided the
integrals on the left-hand sides are known. This exercise needs to be repeated for
x = aj, (j = 1, ... , N). Thus, this process requires solution of N + 1 systems
of equations, each with N variables. Hence, the effort required to solve the
integral equation increases by a factor of N. This procedure requires O(N4)
floating-point operations.
It is not essential to consider the entire kernel as the weight function for
the quadrature formulae. We can write K(x, t) = M(x, t)L(x, t), where M(x, t)
is smooth, but L(x, t) could be singular. In this case, we can consider quadrature
formulae of the form

1 b
L(x, t)¢(t) dt =
N
~ Wi(x)t/J(ai). (13.30)

If the singular part L (x, t) is independent of x, then the weights are independent
of x and we do not need to solve a system of linear equations for each value of
aj to find the weights. In such cases, the singularity can be eliminated easily.
However, in most practical problems, it may not be possible to isolate the
singular part as a function of t alone.
There is an enormous increase in the effort required to use a special
quadrature formula with singular weight function, and hence in practical prob-
lems it is better to use a composite formula. Thus, we can use the trapezoidal
rule for regular integrands over the intervals [a, x - h] and [x + h. b]; while over
the interval [x - h, x + h] which contains the singularity, we can use a special
formula with singular weight fUIlction involving only three points. Both these
formulae are accurate to second order. In this case, most of the weights entering
in the quadrature formulae can be determined without any effort, while for each
value of x = aj, the weights of the three-point formula need to be calculated.
This technique requires much less additional effort.
Another alternative is to break up the range of integral into several parts,
and to use different quadrature formulae over each one. This is essentially what
is done when a composite formula using say the trapezoidal rule or the Simp-
son's rule is used. The final system of linear equations is of the same size, but if
the weights are to be determined using the method of undetermined coefficients,
then the size of the system of linear equations can be reduced. For example,
if we break up the original interval into p equal parts each of q points, then
for determining the weights of quadrature formulae we need to solve equations
in q unknowns. There are p such systems for each of the N = pq abscissas.
Thus, the total effort in computing the weights is of the order of N pq3 = N 2q2,
instead of N 4 , when the range is not divided. We therefore gain a factor of p2 in
efficiency. If p is large, then the gain in efficiency will be large, but the resulting
formula is of low order and accuracy may be low. Consequently, p should be
chosen such that the resulting formulae are of reasonably high order to achieve
the required accuracy.
If the solution f(x) itself is singular, then the problem is more difficult,
since in general, the form of the singularity is not known in advance. In principle,
13.2. Fredholm Equations oj the Second Kind 667

we can use special quadrature formulae with singular weight functions to deal
with this problem also. But the resulting quadrature formula could be quite
complica.ted, requiring many weights to be determined by solving systems of
linear equations. It should be noted that the integrand is actually a product
of the kernel and the solution. Hence, in some cases, the singularity in the two
functions may cancel giving accurate results, even though the kernel is singular
{8}.
EXAMPLE 13.1: Solve the integral equation

f(x) + fa1 K(x,t)f(t) dt = x 3, K(x , t) = {


X(1 - t), for 0 .,; x .,; t .,; 1;
(13.31)
t(1 - x), for 0 .,; t <x .,; 1.

Using subroutine FRED, we can solve this equation using the trapezoidal rule with 11,
21 and 41 points. This subroutine also applies the technique of deferred correction to achieve
higher order accuracy. The results are shown in Table 13.1. The computed results can be
compared with the exact solution

( 13.32)

It can be seen that error in the computed values decreases as h 2 and the deferred correction
does not lead to any significant improvement because of a discontinuity in the first derivative
of the kernel at t = x. Thus, a large number of points are required to achieve high accuracy.
The trapezoidal rule is not affected by the discontinuity. since the discontinuity occurs at
one of the grid points. Thus, the integral can be broken up into two parts at this point.
The combined result will be the same as that obtained without dividing the range. Hence,
trapezoidal rule does not feel the discontinuity, but the higher order formulae (including
Gregory's formula) which are based on more than two points are affected by the discontinuity.

Table 13.1: Solving a Fredholm equation of the second kind using trapezoidal rule

Computed solution Corrected solution


x /exact N=l1 N = 21 N = 41 IErrorl N=l1 N = 21 N = 41 IError l

.2 -.0007568 -.00061 -.000719 -.000747 .000009 -.0007372 -.0007543 -.0007563 .0000005


.4 .0466162 .04688 .046682 .046633 .000017 .0466111 .0466110 .0466147 .0000016
.6 .1921805 .19249 .192258 .192200 .000019 .1920685 .1921470 .1921718 .0000087
.8 .4899384 .49018 .489998 .489953 .000015 .4895878 .4898433 .4899142 .0000240
1.0 1.0000000 1.00000 1.000000 1.000000 .000000 1.0000000 1.0000000 1.0000000 .0000000

If Simpson's rule is used, the results are not much better, because of the discontinuity
at t = x. Similar results are obtained using composite Gauss-Legendre formulae with a total
of 32 points. The errors in results using 4, 8, 16 and 32-point formulae are comparable. Thus,
the higher order formulae do not give higher order accuracy.

The quadrature method can be easily extended to nonlinear problems.


In that situation, we can use some approximation j(i)(X) to the solution and
linearise the equation to calculate the correction 6j(i) (x) = j(i+1)(X) - j(i) (.1:).
This process can be repeated with the new approximation j(i+1) (x), until a
668 Chapter 13. Integral Equations

satisfactory accuracy is achieved. For example, the nonlinear equation

lb K(x, t, f(t)) dt = f(x) + g(x), (13.33)

can be linearised to give

Assuming that f( i) (x) is a known function, the first integral can be evaluated
and the equation can be treated as a linear Fredholm equation of the second
kind in 15f(i) (x), with kernel oK/of. This equation can be solved using the
techniques outlined in this section to give the correction 6 f(i) (x). The itera-
tion can be repeated until the solution converges satisfactorily. This technique
is equivalent to using Newton's method for solving the system of nonlinear
equations obtained by using the quadrature formula.
Similar technique can also be applied to integro-differential equations,
where the unknown function occurs inside the integral and its derivatives also
appear in the equation. In this case, the integral can be approximated using
the quadrature formula as explained above, while the derivatives can be ap-
proximated using a finite difference approximation of the form discussed in
Section 12.8. The resulting system of algebraic equations will be linear if the
original equation is linear, in which case, we can solve it using Gaussian elim-
ination. If the equation is nonlinear, then it can be solved iteratively after
linearising at every step as explained in the preceding paragraph.
The quadrature method can be easily extended to a system of integral
equations. For example, consider the system of linear integral equations

fi(X) + gi(X) = ~
M
1 b
Kij(x, t)fJ(t) dt, (i=l, ... ,M). (13.35)

Each of the integral can be approximated using the same quadrature formula
based on N points to get the system of equations

M N
fi(a s ) + gi(a s ) = L L WkKij(a s, ak)fj(ak), i = 1, ... , M, s = 1, ... , Nj
j=lk=l
(13.36)
which defines N AI linear equations in equal number of unknowns.
Similarly, this method can also be extended to equations involving multi-
ple integrals, in which case, we can use a formula for multiple integration. We
can use either the product rules or special rules for multiple integrals. How-
ever, the number of abscissas required for even a modest approximation could
be very large, resulting in a large system of equations.
13.3. Expansion Methods 669

13.3 Expansion Methods


In this section, we describe methods for approximating the solution of a Fred-
holm equation of the second kind by expansions of the form
N
f(x) = L ai<j>;(x) , (13.37)
i=l

where cPi(X) are suitably chosen basis functions. The basis functions could be
polynomials, trigonometric functions or B-splines. To determine the coefficients
of expansion, we can substitute this solution into the integral equation, to get
the residual

L ai 1K(x, t)cPi(t) dt.


N N b
1](x) = g(x) +L aicPi(x) - (13.38)
i=l i=l a

Unless the true solution is a linear combination of the chosen basis functions,
it is not possible to choose the coefficients ai such that the residual vanishes
identically. The coefficients ai should therefore be chosen such that the residual
is as small as possible.
The simplest option here is to use interpolation, that is to require that
1]( x) vanishes at a set of suitably chosen points. Of course, it does not imply that
1]( x) vanishes at other points too, but in general, if the functions are sufficiently
smooth and the set of points is chosen carefully, then we can expect the residual
to be small throughout the region. Thus, if Zi, (i = 1, ... , N) are the points
at which 1](x) vanishes, then substituting x = Zi in (13.38) and equating it to
zero, we get a system of N linear equations in N unknown coefficients ai. These
equations can be written in the matrix form as (B - C)a = g, where gj = g(Zj),
while Band Care N x N matrices with elements

(13.39)

This system of equations can be solved for the coefficients, provided the matrix
B - C is nonsingular. The solution as well as the condition number of the matrix
will crucially depend on the choice of the basis functions and the interpolating
points Zi. This technique is usually referred to as collocation.
An alternative is to minimise some norm of the residual Y}(x). For this
purpose, we can use the L 1, L2 or Loo norms defined in Section 10.1. It is more
convenient to define these norms over a set of discrete points Zi, (i = 1, ... , M).
In this case, M > N and it is not possible to choose the coefficients, such that
the residuals 1](Zi) = 0 for every value of i. Consequently, (13.38) provides an
overdetermined system of linear equations. In particular, the L2 norm leads to
the linear least squares approximation, which can be obtained as explained in
Section 10.2. The problem with L1 or Loo norms can be solved using linear
programming techniques as explained in Sections 10.13 and 10.12, respectively.
670 Chapter 13. Integral Equations

These techniques will almost certainly require more effort than the collocation
method described earlier. However, it may give better results and further the
residuals in the equations can give some estimate of the truncation error. We
can consider the computed solution as the true solution of the perturbed inte-
gral equation, with the residual 7](x) subtracted from the function g(x). This
backward error estimate can be converted to the actual error, if the condition
number for the kernel is known or can be estimated.
Third alternative is the so-called Galerkin method, where the coefficients
are determined by requiring that the residual 7](x) is orthogonal to the basis

lb
functions
(7], ¢;) = 7](x )¢; (x) = 0, (i=l, ... ,N), (13.40)

where ¢;(x) denotes the complex conjugate of ¢i(X). These equations can be
written in the matrix form as (B - C)a = v, where Band Care N x N matrices
and v is a vector with n components defined by

bij = (,lPj, ¢i) = lb lb K(x, t)¢j(t)¢;(x) dx dt,

Cij = (¢j,¢i) = lb ¢j(x)¢;(x) dx, (13.41)

Vi = (g, ¢i) = 1 b
g(x)¢;(x) dx.

It is also possible to add a suitable weight function in the integrals, by re-


defining the scalar product using the weight function. If the functions ¢i(X) are
orthonormal, then the matrix C is the identity matrix. If the kernel is symmet-
ric, then the matrix B is also symmetric. On the other hand, in the collocation
method the corresponding matrices are in general not symmetric, even if the
original kernel is symmetric. Using symmetry of the matrices, we can improve
the efficiency of the numerical methods. This feature is particularly useful for
eigenvalue problems. If the matrix B - C is nonsingular, then the system of
linear equations can be solved to obtain a unique solution for the coefficients
ai. Implementation of the Galerkin method is complicated because of the need
to calculate the elements bij , Cij and Vj, which involve evaluation of single or
double integrals. If these integrals are to be evaluated numerically, which is
often the case, then the amount of effort required is substantial.
In practical computations, it is most convenient to use the collocation
technique to find the coefficients of expansion, as this is good enough in most
cases. It is more important to choose the basis functions properly. As in the
least squares approximation, the basis functions should be not only linearly in-
dependent, but also nearly orthogonal. Thus, for a polynomial approximation,
although it is more convenient to use the basis functions ¢i(X) = Xi-I, this
choice may lead to ill-conditioned equations when N is moderately large. It is
therefore recommended to use a set of orthogonal polynomials as the basis func-
tions. For example, we can use the Legendre polynomials or the Chebyshev poly-
nomials after transforming the interval appropriately. If ¢i (x) = Pi - 1 (x), the
13.3. Expansion Methods 671

Legendre polynomial of degree i-I, then we can apply the collocation method
using the points Zi, which are zeros of PN(X). With Chebyshev polynomials we
can use the Galerkin method with a weight function w(x) = IjVf=X2, or the
collocation method with Zi as the zeros of TN(X), or the extrema of TN-l(X).
If approximation by a single polynomial over the entire range is not good
enough, we can use the B-spline basis functions (Section 4.5). As mentioned in
Section 4.5, B-spline basis functions are linearly independent to a good extent
and hence the resulting system of equations is generally well-conditioned. Fur-
ther, these basis functions can be easily evaluated at any point and hence are
convenient to use.
For smooth kernels a high order quadrature formula can give sufficient
accuracy using only a few points, and quadrature methods considered in the
previous section are generally efficient. However, in practice most kernels have
discontinuity in function value or its derivatives at t = x, and high order quadra-
ture formulae are not very useful. In such cases, expansion methods may be
better, particularly if the integrals can be evaluated analytically. Even if the
integrals cannot be evaluated analytically, it may be much simpler to evaluate
the integrals to calculate the coefficients of matrices Band C, since in this case,
it is not necessary to use the same abscissas for each integral. In fact, we can
use an adaptive quadrature routine to tackle the singularity or discontinuity.
Alternately, if the kernel is discontinuous at a known point, then we can break
up the range into two parts and treat each part separately to achieve higher
order accuracy. This division may not be possible in quadrature methods, since
the abscissas have to be the same for all values of x.
EXAMPLE 13.2: Solve the integral equation in Example 13.1 using expansion methods.
We can try to approximate the solution by a polynomial of appropriate degree. It can
be shown that only odd degree terms in the polynomial will survive. Hence, we can use the
basis functions <Pi(X) = x 2i - 1 , (i = 1, ... , N). This choice is not the best, since for larger
values of N the resulting system of linear equations can be ill-conditioned. However, it is the
simplest choice. Coefficients of the expansion can be determined using collocation method
based on the N equidistant points Zi = i/N, (i = 1, ... , N). For this choice of basis functions

>Pi (X) = l a
b
K(x, t)<Pi(t) dt =
x(l - x 2i )
..
21(21+1)
, (13.42)

and we get the system of linear equations

L +L
N zj(l-z;i) N 2i-1
a 2i-1 2·(2· 1) a 2i-1 Z j =g(Zj), (j = 1, ... , N). (13.43)
i=1 1 1 + i=1

This system of equations can be solved to obtain the coefficients. To test the accuracy we
compute the maximum error in the calculated value of the function. These computations were
performed using a 24-bit arithmetic. For N = 2 the error is 0.006 and reduces to 5 x 10- 5
for N = 3. For higher values of N it has essentially reached the limit due to roundoff error.
For N = 4 the solution is given by
!(X) = -0.04357326x + 0.992741x 3 + 0.049619x 5 + 0.001213x 7 . (13.44)
Beyond N = 8, the error increases, most probably due to the fact that the resulting system
of linear equations is ill-conditioned. Actually, it turns out that even for larger values of N
the error is significant only for x > 0.8 and is likely to be due to errors in higher coefficients.
These results may be compared with those in Example 10.2, where ill-conditioning is more
672 Chapter 13. Integral Equations

severe. Comparing these results with those in Example 13.1, it can be seen that in this case
expansion methods give much better accuracy as compared to quadrature methods for the
same size of linear equations.
Instead of collocation method we can use the Galerkin method to determine the coef-
ficients. Using (13.42) and
(1 (1 1
Jo J o K(x, t)cPj(X)cPk(t) dx dt = (2k + 1)(2j + 1)(2k + 2j + 1) , (13.45)

we can set up the required matrix. For N :s:


4, the results are similar to that obtained using
collocation. But for larger values of N the coefficients are significantly different and the error
is much larger. This error is clearly due to ill-conditioning of the resulting equations. In
contrast to the collocation method, here the error is of the same order almost throughout the
interval. Thus, for a given value of N, the system of equations using Galerkin method are
more ill-conditioned as compared to that using collocation. Further, when both the systems
are well-conditioned, there is no significant difference in the results. Hence, in this case, it
is preferable to use the collocation method. Instead of polynomial we can also use B-spline
basis functions, which, in general, lead to system of equations which are better conditioned.
However, B-splines cannot account for the fact that solution has only odd degree terms and
hence in this case the B-splines will not be as efficient as the basis functions considered above.

13.4 Eigenvalue Problem


In this section, we consider numerical solution of the eigenvalue problem defined
by a Fredholm equation of the third kind. The properties of this eigenvalue
problem are similar to those of the algebraic eigenvalue problem. For example,
if the kernel is symmetric (i.e., K(x, t) = K(t, x)), then the eigenvalues are real
and eigenfunctions corresponding to distinct eigenvalues are orthogonal. Thus

(13.46)

where fi (x) and fJ (x) are eigenfunctions corresponding to the eigenvalues Ai


and Aj. If the kernel is nondegenerate, then it has infinite number of eigenvalues.
Unlike the differential eigenvalue problem, however, the spectrum of eigenvalues
has the origin, rather than infinity as the limit point. It can be seen that if a
differential eigenvalue problem is transformed to the integral form, then the
eigenvalue A will multiply the integral rather than f(x). In fact, sometimes
the eigenvalue problem for integral equations is defined with the factor of A
multiplying the integral. However, we follow the more natural definition given in
Section 13.1, even though it leads to an inverted spectrum. With this definition,
we are generally interested in the eigenvalue with the largest magnitude,
If the number of nonzero eigenvalues is finite (say r), then the kernel is
degenerate and may be expressed in the form (provided it is symmetric)

L Ai!i(X)!i(t),
T

K(x, t) = (13.47)
i=l

where fi(X) are the normalised eigenfunctions. This expression can be gener-
alised to unsymmetric kernels using the left eigenfunctions, which are equiva-
lent of the left eigenvectors for the matrix eigenvalue problem. For a symmetric
13.4. Eigenvalue Problem 673

kernel, we can also define the Rayleigh quotient by

AR = J: J: K(x, t)f(x)f(t) dx dt ,
(13.48)
J:(f(x))2 dx

which for any well behaved function f(x) is between the smallest and the largest
eigenvalues. Further, we also have the relation

I: Ib
i=l
Ai =
a
K(x , x) dx , (13.49)

which corresponds to the trace of a matrix.


Now returning to the computational problem, we can use the quadrature
method described in Section 13.2 to obtain the equivalent algebraic eigenvalue
problem K Df = ,\f, where K and D are the matrices defined in Section 13.2.
Unlike the differential eigenvalue problem, in this case the matrix is dense, and
it may be better to reduce the matrix to a more condensed form before finding
the eigenvalues. We can use direct or inverse iteration to determine the few
eigenvalues with largest magnitude. The eigenvalues with smaller magnitude

°
may not be represented accLlrately in the equivalent algebraic eigenvalue prob-
lem. The method of deferred correction or the h ----> extrapolation can also be
applied to the eigenvalue problem, to achieve higher order accuracy.
It should be noted that if the matrix is obtained as in Section 13.2, then
even if the kernel is real and symmetric, the matrix K D may not be symmetric.
In order to obtain an eigenvalue problem with real symmetric matrix, we can
reformulate the problem as

(13.50)

which defines a similarity transform on K D. Thus, eigenvalues are the same


while eigenvectors are Dl /2f. This symmetric matrix can be reduced to a sym-
metric tridiagonal form using Householder's method and the required eigen-
values of the symmetric tridiagonal matrix can be determined using the Sturm
sequence property as explained in Section 11.4. It may be noted that, this refor-
mulation is possible only if all weights in the quadrature formula are positive.
It is also possible to use the expansion methods to solve the eigen-
value problem. The corresponding equations can be obtained by substituting
g(x) = 0, and adding a factor of A in the terms arising from f(x), in the rele-
vant equations in Section 13.3. It can be easily seen that the resulting eigenvalue
problem may not be in the standard form, but of the form Bx = ACX. This
eigenvalue problem can be solved by techniques outlined in Section 11.1. With
collocation method, the matrices Band C may not be symmetric, even if the
original kernel is symmetric. Thus, an eigenvalue problem for a symmetric op-
erator is reduced to one for an unsymmetric matrix, which may cause problems
in numerical solution. On the other hand, using Galerkin method the matrices
674 Chapter 13. Integral Equations

are symmetric if the kernel is symmetric. Further, if the basis functions are or-
thonormal, then C is an identity matrix and the problem reduces to a standard
eigenvalue problem. Thus, Galerkin method has a significant advantage as far
as solving the matrix eigenvalue problem is concerned, even though calculating
coefficients of the matrix may require more effort as compared to that in the
collocation method.
In all these methods, the eigenvalue problem for integral equation is ap-
proximated by an algebraic eigenvalue problem. In the limit of N ----> 00, the
eigenvalues of this matrix should tend to those of the integral equation. In
practical computations with finite matrix, we can only hope to approximate
a few of these eigenvalues. The higher order eigenvalues (i.e., the ones with
smaller magnitude) have eigenfunctions with large number of nodes and we
cannot hope to approximate them to any reasonable accuracy, using a matrix
of comparable size. On the other hand, the lower eigenvalues can be approx-
imated reasonably well by those of the matrix. Another problem in practical
computation is identification of the required eigenvalues. If all eigenvalues of
the algebraic problem are computed, then we can arrange the eigenvalues in
the order of decreasing magnitude. In this case, any required eigenvalue can
be easily identified. However, if only a few eigenvalues are calculated, it may
be difficult to identify the order of these eigenvalues. In most cases, a count of
the number of nodes in the associated eigenfunction should give the order of
the eigenvalue. For example, if we are interested in finding the second largest
eigenvalue, then we can look for eigenfunction with one node in the range of
integration. For symmetric matrices, it is possible to use the Sturm sequence
property to determine any required eigenvalue (Section 11.4).
If the kernel is degenerate, then the number of nonzero eigenvalues is
finite and if a quadrature formula with larger number of points is used, then
some eigenvalues of the resulting matrix will be zero. The number of nonzero
eigenvalues of the matrix should be equal to that of the kernel. Further, each
of these eigenvalues approximate some eigenvalue of the kernel. As the number
of points N ----> 00, the nonzero eigenvalues of the matrix tend to those of the
kernel.
EXAMPLE 13.3: Solve the eigenvalue problem

10 1
ext J(t) = >.J(t). (13.51)

Since the kernel in this equation is regular, the eigenvalue spectrum has a limit point at
A = O. Hence, we can find a few of the largest eigenvalues. We can use a high order quadra-
ture formula to obtain the corresponding algebraic equations. Using subroutine FRED in
Appendix B, it is possible to find some of the eigenvalues. Results for the first four eigen-
values using the trapezoidal rule (N = 5,9,17), the Simpson's rule (N = 5,9,17) and the
4-point and 8-point Gauss-Legendre formulae are shown in Table 13.2. The computed val-
ues can be compared with the exact eigenvalues 1.3530302, 0.10598322, 3.5607491 x 10- 3 ,
7.6379714 X 10- 5 . All computations were performed using a 24-bit arithmetic and a con-
vergence parameter REPS = 2 x 10- 7 was specified for the inverse iteration method to
determine the eigenvalues and eigenvectors. In some cases, with large number of points the
inverse iteration did not converge to the specified accuracy, but nevertheless the computed
eigenvalues as well as the eigenvectors were accurate to the extent that is possible using a
13.4. Eigenvalue Problem 675

0.5

,--,
x
......... 0
••.:t

-0.5

-1
0 0.2 0.4 0.6 0.8 1
x

Figure 13.1: The first four eigenfunctions of the kernel K(x, t) = ext on [0,1].

24-bit arithmetic. There is some difficulty in ensuring that the inverse iteration converges to
the correct eigenvalue, unless an approximate value is known. The order of the eigenvalue
can be ascertained by counting the number of nodes in the eigenfunctions. Alternately, if the
number of points in the quadrature formula is not very high, then we can find all eigenvalues
of the matrix using the techniques described in Chapter 11 and arrange the eigenvalues in a
decreasing order.
For a given formula, the accuracy worsens as we go to higher order eigenvalues. Using
trapezoidal rule the error is O(I/N2), while for Simpson's rule it is O(I/N 4 ). Gauss-Legendre
formulae achieve much higher accuracy. The 4-point formula which uses only four abscissas,
gives the first two eigenvalues essentially correct to seven significant figures. Even the fourth
eigenvalue is accurate to a few percent, which is quite remarkable, since this formula leads to
a 4 x 4 matrix with only four eigenvalues. It can be seen that the magnitude of eigenvalues is
decreasing rapidly as they tend to a limit point at >. = o. Consequently, the resulting matrix
could be quite ill-conditioned with respect to solution of linear equations. This ill-conditioning
is evident in Example 13.4.
It can be seen that in contrast to Sturm-Liouville problems, where the eigenvalue
spectrum has a limit point at >. = 00, here the limit point occurs at >. = o. Thus, in this case,
the eigenfunction corresponding to the largest eigenvalue has no nodes, while the successive

Table 13.2: Solving an eigenvalue problem using quadrature methods

Thapezoidal rule Simpson's rule Gauss-Legendre formula


N=5 N=9 N = 17 N=5 N=9 N= 17 4-point 8-point

1.364445 1.355832 1.353727 1.353127 1.353036 1.353031 1.353030 1.353030


.1204690 .1096512 .1069029 .1064218 .1060110 .1059850 .1059831 .1059832
.0056101 .0041211 .0037040 .0040109 .0035910 .00356268 .00355975 .003560744
.0001531 .0001060 .0000846 .0001498 .0000833 .00007685 .00007384 .0000763776
676 Chapter 13. Integral Equations

eigenfunctions have one additional node. For illustration, the eigenfunctions corresponding
to the first four eigenvalues are shown in Figure 13.1. Here the eigenfunction which has no
node corresponds to >. ;:::; 1.353, while the one with three nodes corresponds to >. ;:::; 0.000076.
These eigenfunctions are also calculated using the same subroutine FRED.

13.5 Fredholm Equations of the First Kind


The quadrature and expansion methods for Fredholm equation of the second
kind can in principle be used for numerical solution of the equation of the first
kind. However, in practice, there are difficulties because the resulting matrix
may be ill-conditioned. Even the existence of a solution is not clear for equations
of the first kind. In many cases, it can be easily seen that for a given kernel
K(x, t) the solution does not exist for arbitrary g(x). For example, it can be
easily shown by successive differentiation, that if K(x, t) satisfies a differential
equation of the form

(13.52)

then g(x) must also satisfy this equation. Otherwise, the integral equation can-
not have any solution whose behaviour will permit differentiation under the
integral sign.
Similarly, if the kernel is completely degenerate, that is K(x, t) =
X(x)T(t), then the integral equation

lb X(x)T(t)f(t) dt = g(x), (13.53)

has no solution, unless g(x) is a constant multiple c of X(x). Even then the
solution is not unique, since the only condition on f(x) is

lb T(t)f(t) dt = c. (13.54)

Similar conditions can be obtained for a general degenerate kernel {18}.


A formal solution of the integral equation can be obtained in terms of
the eigenfunctions of the kernel. Let Ai be the eigenvalues and fi(X) the corre-
sponding eigenfunctions of the kernel. Now if we can expand the solution and
the right-hand side in terms of the eigenfunctions

(13.55)

it follows that ai = bd Ai and we have found the solution. In order to ensure


the existence of such expansion for a general g(x), it is necessary that the
eigenfunctions form a complete set. Hence, a necessary condition for existence of
such an expansion is that the number of eigenvalues is infinite. Further, Ar = 0
should not be an eigenvalue, since in that case, the corresponding coefficient
13.5. Fredholm Equations of the First Kind 677

a r = br I Ar does not exist, unless br = 0, which destroys the generality of


g(x). Even if br = 0, the solution is not unique, since any multiple of fr(x)
can be added to the solution. This expansion is not very useful in practical
computations, since the eigenvalue problem usually has a limit point at A = 0
and there are a large number of small eigenvalues, which can make substantial
contribution to the solution, since a r = briAr can be large. Further, it is difficult
to estimate these high order eigenvalues or the corresponding eigenfunctions
accurately. It is clear that the problem is ill-conditioned, as a small change in
g(x), i.e., small br can give a large change in solution, if Ar is small.
For numerical solution of Fredholm equations of the first kind, we can
again use the quadrature or the expansion methods. Using quadrature meth-
ods we get the matrix equation K Df = g, where K and D are the matrices
defined in Section 13.2. The eigenvalues of this matrix approximate that of the
kernel. Thus, if the kernel is degenerate and the order of the matrix is larger
than the number n in the expansion for the kernel (13.17), then the matrix will
be singular. Thus, in this case, it is not possible to obtain a unique solution
to the system of linear equations. In fact, Gaussian elimination algorithm fails.
However, it is possible to find the solution using the singular value decomposi-
tion (Section 3.6). As explained above the solution is nonunique for degenerate
kernel. Hence, singular value decomposition gives an approximation to some
admissible solution. In fact, the general solution of the system of linear equa-
tion can also be calculated using SVD. However, it is necessary to check if the
right hand side is in the range of the matrix.
If the kernel is nondegenerate, then generally A = 0 is a limit point for
the eigenvalues. Consequently, there are several eigenvalues with very small
magnitudes. Thus, the matrix equation is likely to be highly ill-conditioned.
Further, in some cases, it can be shown {20} that, even if the resulting linear
equations are solved exactly, the computed solution does not converge to the
true solution as N ~ 00. Thus, the use of quadrature methods could be risky,
since if the computed values at only a few selected points are considered, then
the computed solution may apparently converge to some limiting value, which
is far from the true solution.
Another problem with quadrature methods for equations of the first kind
is that, the solution is obtained only at a set of discrete points. Unlike the
equations of the second kind, here it is not possible to use the quadrature
formula itself for calculating the solution at intermediate points, and we need
some interpolation formula. However, the choice of interpolation formula is not
clear. For example, if N-point Gauss-Legendre formula is used to approximate
the integrals, then we can obtain the function values at the N abscissas. This
quadrature formula is exact for all polynomials of degree less than or equal to
2N - 1, but no interpolation formula using N points can achieve this order of
accuracy. Hence, the process of interpolation will introduce additional errors.
If the trapezoidal or the Simpson's rule is used to approximate the integral,
then it may be possible to use an interpolation formula with the same order
of accuracy. However, as mentioned earlier because of severe ill-conditioning of
678 Chapter 13. Integral Equations

the resulting system of equations, it is preferable to use a high order Gaussian


formula, in which case, interpolation is not possible with the required accuracy.
From the preceding discussion it appears that quadrature methods are
not very effective for solving a Fredholm equation of the first kind. It may
be better to use expansion methods. These methods are also likely to lead to
ill-conditioning, since similar matrix is involved. In fact, the problem itself is
ill-conditioned, in the sense that a small perturbation to the right-hand side
can cause a significant change in the solution. One advantage of the expansion
method over the quadrature method is that, the solution can be easily computed
at any required point. Instead of using the simple collocation method, it may be
better to use a minimisation technique for any convenient norm of the residual.
For example, we can use the singular value decomposition to obtain a least
squares solution of a system of overdetermined equations using the residual at
several points.
If the kernel has some singularity, then the eigenvalue spectrum may not
have a limit point at ). = O. Even if the eigenvalue spectrum has a limit point
at ). = 0, the successive eigenvalues may not decrease rapidly in magnitude. In
such cases, the problem may be well posed. However, in this case, the numerical
solution using quadrature methods may have some difficulty due to singularity.
These difficulties can be tackled by techniques outlined in Section 13.2. Alter-
nately, we can use the expansion methods which may not be affected by this
singularity, provided the solution itself is regular.
In general, Fredholm equations of the first kind are ill-conditioned and
accurate solution of the problem may be difficult. It is possible to improve the
solution using regularisation. Here, the problem is transformed to minimising
the quadratic functional
(13.56)
where TJ( x) is the residual, given by

TJ(x) = lb K(x, t)f(t) dt - g(x), (13.57)

and L is some linear operator. For example, we can take Lf = f, l' or f". If kth
derivative is selected, then the process is referred to as kth order regularisation.
1Iore generally, the term IILfl12 can be replaced by a linear combination of the
squared norms of several derivatives. Here a > 0 is the regularisation parameter.
Motivation for this method is as follows. It is impractical to try to solve the
integral equation exactly, since the input functions are only approximate. Thus,
we should only require IITJ(x)11 :::; E, where E is the expected error in g(x). Hence,
we should minimise IILfl1 subject to this restriction. The latter requirement
ensures some degree of smoothness in the solution. The parameter a plays a
crucial role in this method. When a = 0, this method reduces to the expansion
method considered earlier. As a increases the IITJ(x)11 increases, while IILfl1
decreases. Thus, a should be chosen such that the magnitude of residuals is
balanced against some measure of smoothness. Optimal choice of a is not clear,
13.5. Fredholm Equations of the First Kind 679

but a value of the order of IIbgl1 2 may be reasonable, where bg is the expected
uncertainty in the right-hand side function g. This technique is discussed in the
next section.
EXAMPLE 13.4: Solve the following integral equation

1o
1 ex+l - 1
extf(t) dt = - - -
x+1
(13.58)

The kernel in this equation is regular and we can expect this problem to be ill-
conditioned. Results obtained using quadrature methods are shown in Table 13.3. It is found
that the resulting system of equations is highly ill-conditioned and no meaningful solution is
possible except for very small values of N. Thus, we use N = 5 for the trapezoidal and the
Simpson's rule, and N = 4 and 8 for the Gauss-Legendre formulae. Using the trapezoidal
rule, we also calculate the corrected values using the deferred correction (Gregory's formula).
The computed values can be compared with the exact solution f(x) = eX. All calculations
were performed using a 24-bit arithmetic. Using Gauss-Legendre formulae, the function val-
ues can be calculated at the corresponding abscissas. To calculate the function value at any
other point, we need to use interpolation. As explained in the text the process of interpo-
lation will introduce additional errors. Thus, we compare the error in the computed values
at the abscissas, which are also listed in the table. The best results are obtained using the
four-point Gauss-Legendre formula. Condition number of the matrix using the trapezoidal
and Simpson's rule is 9 X 105 and 1.2 x 106 respectively, while for Gauss-Legendre formulae
with 4 and 8 points, it is 2 X 104 and 9 x 10 11 respectively. Since the integrand is regular,
the Gauss-Legendre formulae are much more accurate, and we get better results, provided
the matrix is not too ill-conditioned. Using the 8-point Gauss-Legendre formula, the system
of equations is too ill-conditioned to obtain any meaningful solution.
We can try the expansion methods on this problem by expanding the solution in terms
of the functions ¢i (x) = xi-I. It is simplest to use the collocation method to calculate the
coefficients. For this purpose, we can choose the points Zi = i/ N. This technique requires the
integrals
(13.59)

which can be evaluated using a recurrence relation of the type considered in Example 2.5.
This recurrence is highly unstable in the forward direction and no meaningful results may
be obtained even for moderate values of i. It is possible to use the recurrence relation in the
backward direction. Alternately, we can evaluate the integrals numerically. The results shown
in Table 13.4 are obtained using subroutine ADPINT to evaluate the integrals numerically.
This table shows the coefficients ai of xi in the expansion as well as the calculated value of
the solution at a few sample points. It can be seen that the error decreases until N = 4, after
that it starts increasing. This is clearly due to ill-conditioning of the resulting system of linear

Table 13.3: Solving a Fredholm equation of the first kind using quadrature methods

x fexact Trap. Greg. Simp. x fexact 4-pt x fexact 8-pt

.00 1.0000 0.558 0.698 0.866 .06943 1.07190 1.07208 .01986 1.0201 1.175
.25 1.2840 1.953 1.627 1.439 .33001 1.39098 1.39075 .23723 1.2677 0.270
.50 1.6487 0.689 0.748 1.103 .66999 1.95422 1.95441 .59172 1.8071 -0.603
.75 2.1170 3.138 2.614 2.333 .93057 2.53595 2.53583 .76277 2.1442 4.508
1.00 2.7183 1.627 2.034 2.459 .98014 2.6648 3.843
680 Chapter 13. Integral Equations

Table 13.4: Solving a Fredholm equation of the first kind using the collocation method

N frO) f(0.25) f(0.5) f(0.75) f(l) ao aj a2 a3 a4

2 .85081 1.28396 1.71711 2.15026 2.58341 0.8508 1.7326


3 1.01523 1.27803 1.64731 2.12305 2.70527 1.0152 0.8383 0.8518
4 0.99818 1.28468 1.64838 2.11693 2.71798 0.9982 1.0284 0.3965 0.295
5 0.87811 1.32243 1.61432 2.13406 2.64555 0.8781 3.2055 -8.7814 13.918 -6.57

4 1.00252 1.28296 1.64814 2.11853 2.71463 using cubic B-splines


10 0.64011 1.41945 1.50227 2.18328 2.02245
10 0.98355 1.29055 1.64155 2.12154 2.69752 cubic B-splines with regularisation
exact 1.00000 1.28402 1.64872 2.11700 2.71828

equations. Accuracy achieved using N = 4 is comparable to that using the 4-point Gauss-
Legendre formula. To obtain more accurate results it is necessary to use higher precision
arithmetic.
Instead of simple polynomial basis we can try cubic B-spline basis functions. In this
case also best accuracy is achieved using 4 basis functions (based on 2 knots) and this result
is also shown in Table 13.4. In this case instead of collocation we have obtained least squares
solution, by considering residuals at 2N points. As the number of basis functions is increased
the approximation deteriorates. To demonstrate the effect of regularisation we apply second
derivative regularisation (Eq. 13.56) with a = 5 x 10- 8 and these results are also shown
in the table. It can be seen that regularisation improves the results, though they are not
as good as those obtained using N = 4. Using fewer basis functions automatically imposes
smoothness, as non smooth functions cannot be expanded in terms of basis functions unless a
large number of functions are used. Thus we can effectively apply regularisation by choosing
the right number of basis functions.

13.6 Inverse Problems


An important application of Fredholm equations of first kind arises in the so-
called inverse problems. In many of these applications the variable x is discrete.
Thus a simple linear inverse problem in one dimension can be written as (Gough
and Thompson 1991)

i=1,2, ... ,n; (13.60)

where di are generally some experimentally determined quantities which may


have some error Ei associated with them. The Kernels Ki(t) are known func-
tions and f(t) is the unknown function which needs to be determined. The
corresponding forward problem, where f(t) is known can be easily solved to
calculate the quantities di and is generally well conditioned. But the inverse
problem of determining f(t) from di is generally ill-conditioned. It is clear that
using only a finite number of measured quantities di , it is not possible to deter-
mine the continuous function f(t) uniquely. However, this is essentially what
13.6. Inverse Problems 681

we do all the time. Thus, using only a finite number of measurements of any
quantity at different times we try to estimate the values at some other time
using interpolation or approximation. The underlying assumption in all this is
that the required function is sufficiently smooth to be approximated. Similarly,
if we assume that the function f(t) in the inverse problem is smooth in some
sense, it may be possible to calculate f(t) from the known d i . Nevertheless, be-
cause of smoothing due to integration the problem is ill-conditioned and very
high accuracy in di will be required to get any reasonable solution.
Although, many specialised techniques have been developed for solving
these problems, the simplest technique is the regularised least squares method,
where the unknown function is expanded in terms of a set of basis functions
and the least squares solution of the resulting system of equations is obtained.
Thus, we can expand the required solution in terms of a suitable set of basis
functions using (13.37) to obtain

L aj 1Ki(t)</Jj(t) dt = L bijaj ,
N b N
di = (13.61 )
j=l a j=l

where we have ignored the errors fi. Here, bij are the integrals and if n > N,
these equations provide a system of overdetermined equations in N unknown
coefficients. A least squares solution of this system of equations should yield
the required coefficients of expansion. However, a straightforward solution of
this equations will not give any meaningful result, as the resulting f(t) will
show oscillations on small scales. It is necessary to apply regular is at ion as ex-
plained in the previous section. We can apply either first or second derivative
regularisation and the resulting procedure is referred to as Regularised Least
Squares (RLS) technique. For example, using second derivative regularisation
we obtain the solution by minimising the function

(13.62)

where 0 is the regularisation parameter, w(t) is a weight function and ti, i =


1, ... ,M is a suitable mesh covering the required interval. If 0 = 0 the solution
will reduce to straightforward least squares solution, while when 0 is large the
solution will tend to a straight line. The value of 0 has to be chosen carefully
to ensure some degree of smoothness in solution. As mentioned in Section 10.2,
we can try different values of 0 and plot the two terms in (13.62) against each
other. The resulting curve has a shape resembling the letter "L" and the value
of 0 corresponding to the corner would be close to the optimal value. This plot
is also referred to as L-curve. By enforcing smoothness, regularisation helps
us in obtaining a meaningful solution to the problem. The value of 0 can be
increased until the small scale oscillations in the solution are smoothed out.
Of course, if the real solution has such oscillations or steep variation, it will
also be smoothed out. However, with sufficient experimentation with different
682 Chapter 13. Integral Equations

basis functions and regular is at ion parameter, it is possible to identify any real
nonsmooth variation in f(t) which is well determined by the data. In practice,
the minimisation of (13.62) is obtained using SVD. The system of equation
defined by each of the terms in (13.62) are solved using SVD to find the least
squares solution.
An advantage of regularisation is that in regions where the data given
by right hand side, d i are not sufficient to determine the required solution, the
resulting function tends to a straight line (with second derivative regularisation)
or a constant (with first derivative regularisation). The regularisation parameter
0: is generally chosen to ensure that the solution is smooth in some sense.
Considerable experimentation with different values of 0: and different forms of
regularisation will be needed before the optimal solution can be found. In most
of the real problems, the kernels Ki(t) are highly oscillatory and this actually
improves the condition of the problem. If the kernels are smooth the problem
will be extremely ill-conditioned. Test results with artificial data generated
using known functions can also be used to validate the solution. For this purpose
we choose a form for f(t) which is similar to the expected or inferred behaviour
and solve the forward problem to calculate d i . In order to simulate real data it in
necessary to add random errors in artificial data, di with the same distribution
as that expected in real data before trying the solution. The computed solution
for the artificial data can be compared with the actual f(t) used to construct
the data. A simple implementation of RLS inversion technique is provided by
subroutine RLS in Appendix B.
To estimate the errors in inverted f(t) arising from possible errors in input
data, we can perform a Monte Carlo simulation. For this purpose we choose
an artificial function f(x) close to the inverted profile and solve the forward
problem to calculate the data di . Then we add random errors to d; with same
distribution as expected in real data and perform the inversion using same
value of 0: as that used in actual calculation. This exercise is repeated for
several different realisations of random errors, e.g., obtained using different
seed values in random number generator. Using the set of inverted profiles, we
then calculate the standard deviation at each value ofti to get the error. As the
regularisation parameter 0: is increased the estimated error will reduce, because
the computed solution is determined by regularisation and is not very sensitive
to errors in data. For very small values of 0: the errors are large as the solution
is very sensitive to small errors in data values.
EXAMPLE 13.5: Consider the problem of inversion for rotation rate in solar interior using
observed splittings of solar oscillations frequencies.
This inversion problem can be written in the form (13.60), with kernels as defined by
Ritzwoller and Lavely (1991). For simplicity, we consider only the latitudinally independent
component of rotation rate and choose an artificial rotation profile shown in Fig. 13.2 to
calculate the splitting coefficients. These coefficients are then perturbed by adding a random
error with standard deviation of 0.2 nHz to each of the values. It is beyond our scope to
describe how the kernels are actually calculated and we will assume that these are available for
a set of about 1300 modes that are used for inversion. These kernels were actually calculated
using a realistic solar model. This artificial data is then inverted using the subroutine RLS,
with different values of regularisation parameter Q. We use cubic B-spline basis functions with
13.6. Inverse Problems 683

.YO
450 - - Exact
- - - 1st derivative ,,=0001 (I")
- - - - - - - 2nd derivative ---- ,,=0.02 (I")
:1"I
3x10'
445 'I
- _ . ,,=0.1 (1")
- - -. ,,=0.2 (I')
N
=a 440
c
/,
/'
10' 435
/:
IIJ:
/ '
430

o 10' 2x10' 3x10' 4x10' 0 0.2 0.4 0.6 0.8


X2

Figure 13.2: The left panel shows the two terms in function to be minimised, X 2 and IIL/1I2
(the smoothing part) plotted against each other for both first and second derivative regular-
isation. The right panel shows the inverted solution obtained using few different values of Ct
for both first derivative smoothing (labelled as f') or second derivative smoothing (labelled
as I"). The exact rotation profile which was used to construct the data is also shown.

50 uniformly spaced knots to represent the unknown function O(r), where r is the distance
measured in units of solar radius . We try both first and second derivative regularisation. In
order to identify an optimum value of a, we try a large range of values and calculate the
two terms X2 and IIL/I12 in (13.62) and plot them against each other. The result is shown in
Fig. 13.2. The points form a curve resembling the letter "L" and we expect the value of a
corresponding to the corner of L to be the optimal value. This value turns out to be 0.2 and
0.02, respectively for first and second derivative regularisation. The right panel in Fig. 13.2
shows the results obtained using a few different values of a.
The exact rotation rate that was used to construct the artificial data is almost like a
step function. It can be seen that when a is much smaller than the optimal value, the solution
shows many oscillations in the interior, but near the step it matches the exact value quite well.
On the other hand, with the optimal choice of a, although the oscillations are suppressed,
there is some departure in the steep region. The first derivative smoothing appears to do
better, but even using second derivative smoothing we can get similar results if a is reduced
slightly below the value inferred from the L-curve. When a is much larger the solution is
smooth and shows two humps on either side of the step. This is similar to Gibbs phenomenon
in Fourier transform. In this case there is considerable departure from exact value in the region
around the step. Instead of using the L-curve we can simply increase a until the oscillations in
inferred O(r) are suppressed. This strategy also gives similar value of a for optimal solution.
In this case since we know the exact solution we can try to measure the difference between
the inferred and exact solution and it turns out that integrated value of squared difference
is minimum when a is close to the value indicated by the L-curve. Very close to the centre
(r = 0) the data is not adequate to determine the rotation rate and the inverted profile tends
to be straight line when smoothing is used. For first derivative smoothing the slope tends to
be zero which matches the exact profile, while second derivative smoothing can give nonzero
slope in this region, giving large error in some cases. The subroutine RLS can also be used to
estimate the errors in inverted profile. As a is reduced the estimated errors increase and for
small a these estimates are comparable with the magnitude of oscillations seen in inverted
profiles. Thus for small a the error estimates are more realistic, while at large values of a the
estimated errors are very small even though the inverted profile has significant error near the
step. This is to be expected since the errors only measure the sensitivity of inversion to data
errors and does not reflect the systematic error introduced through smoothing. Nevertheless,
it should be clear from this example that in-spite of smoothing there is no difficulty in
identifying the step like variation in rotation rate from inversion results.
684 Chapter 13. Integral Equations

An alternative technique for inversion is the Optimally Localised Averages


(OLA) due to Backus and Gilbert (1968) where the aim is to construct a linear
combination of kernels, such that the result is localised in a small region, like the
delta function, peaked around the required point to. The resulting combination
is called the averaging kernel. Taking linear combination of (13.60), we get

(13.63)

where K(t; to) is the averaging kernel corresponding to t = to. The aim in OLA
technique is to choose the coefficients Ci(tO) such that the resulting averaging
kernel K(t; to) is localised around to and at the same time the error term is as
small as possible. As expected, the two requirements are generally conflicting
and the error will increase if we try to choose the combination to make the
averaging kernel localised in very small region. Thus once again a suitable
compromise has to be made.
There are many techniques to choose the appropriate coefficients Ci(tO)'
The simplest may be to choose them by minimising

12 cosO lb (t - to)2(K(t; to))2 dt + J.L sin 0 L


2
C;(tO)f;, (13.64)

subject to the normalisation condition

lb K(t; to) dt = 1. (13.65)

Here, J.L can be chosen as the inverse of mean square error in data points and
the parameter 0 serves as the trade-off between the two terms, similar to the
regularisation parameter in RLS. More details of the technique can be found
in Backus and Gilbert (1968) or Gough and Thompson (1991). The choice of
parameter 0 will again require some experimentation and compromise, but the
advantage of OLA technique over RLS is that in this case the relationship
between the errors and resolution (as measured by the width of resulting av-
eraging kernels) comes out more clearly. Thus if we try to reduce the width of
averaging kernel the errors will increase and the parameter 0 can be chosen to
get appropriate balance betwe8n the two.
Disadvantage of OLA technique is the excessive amount of computation
required. While RLS requires least squares solution of n x N matrix equation,
the OLA technique requires n x n matrix to be solved for each value of to
at which the solution is required. It is possible to reduce the amount of com-
putation by using suitable strategy, but still the effort is significantly larger
than what is required in RLS solution. RLS also has the advantage that the
13.7. Volterra Equations of the Second Kind 685

residuals can be easily calculated and hence one has an idea of whether the
solution indeed fits the data properly or not. The residuals also help in iden-
tifying outliers which may distort the solution unreasonably. OLA technique
does not allow residuals to be computed and hence the presence of outliers can
not be detected. The main disadvantage of RLS technique is the somewhat
arbitrary manner in which the regularisation parameter is fixed. Consequently,
it is difficult to get realistic estimate of errors in inverted solution. While OLA
technique gives a convenient relation between resolution and errors which is
easier to interpret. Since the inverse problems are generally ill-conditioned, it
is better to try both techniques and compare the results. A detailed discussion
of various techniques for solving inverse problems is beyond the scope of this
book and readers should consult books on the subject (Craig and Brown 1986;
Tarantola 2004; Menke 2012).

13.7 Volterra Equations of the Second Kind


There is a strong similarity between Volterra equations of the second kind and
initial value problems for ordinary differential equations. If quadrature method
is applied to a Volterra equation, the resulting matrix is triangular and the
equations can be easily solved using forward substitution. This technique is
similar to the step by step solution of initial value problems. The main difficulty
with Volterra equations is that the range of integration is not constant. It is
therefore not possible to use a Gaussian formula, since the abscissas used will
be different for each value of x. Thus, for Volterra equations it is convenient
to use Newton-Cotes quadrature formulae. Further, for Volterra equations, it
is not difficult to treat nonlinear equations. Hence, we consider a nonlinear
equation of the form

l X
K(x, t, f(t)) dt = f(x) + g(x), (13.66)

where K(x, t, 1) can be a nonlinear function of its arguments. If this function is


linear in f, then we get the form considered in Section 13.1. This is not the most
general form of nonlinear Volterra equations, but we are using it simply because
nonlinearities in this form can be tackled easily. Setting x = a in (13.66), it is
clear that f(a) = -g(a), which provides the initial value for the solution.
Using the function value Ii = f(aj) at aj = a + (j - l)h, we can approx-
imate the Volterra equation by
r

fr+g(a r) = LWriK(anai,fi), (r = 2,3, ... ), (13.67)


i=l

where the weights Wri in general depend on r also. If the quadrature formula
is closed (Le., Wrr =10 0), then assuming fo, JI, ... , fr-l are known, we have a
nonlinear equation for fro This equation can be solved using simple fixed-point
686 Chapter 13. Integral Equations

iteration, provided IwrrKf(aT) aT) fr)1 < 1, where Kf = 8K/8f. Since, the
weight Wrr is proportional to spacing h, for sufficiently small h the iteration will
converge. If the equation is linear, then iteration is not required, since (13.67)
can be solved explicitly for fro This formula is similar to a corrector formula
for solving initial value problems in differential equations, the difference being
that in this case, the function values at all the previous points are required. The
starting approximation f~O) for the iteration can be provided by extrapolation
from the previous points, or by using a predictor formula. The simplest choice
f~O) = fr-l arises from extrapolation using only one previous point. Since
Lhe number of iterations required by the corrector formula depends on the
accuracy of the initial approximation, it is preferable to provide a more accurate
approximation. A better approximation is obtained if we assume the solution
to be linear, which gives f~O) = 2fr-l - fr-2 for r 2 3. Using three previous
points we can obtain an even better approximation, provided the solution is
sufficiently smooth. Quadratic extrapolation, using the last three points gives

r 2 4. (13.68)

Truncation error in these formulae is O(h2) and O(h 3 ), respectively. If the pre-
dicted values are to be used for estimating the truncation error in the computed
result, then it is better to have a predictor formula with truncation error of the
same order as the corrector. This can be achieved by including larger number of
previous points. However, unlike initial value problems, here it may be difficult
to estimate the coefficient of hP in the error term, since the integration is over
a large interval.
Alternately, we can obtain a predictor formula using a quadrature rule of
the open type to approximate the integral. This open formula needs to be used
only on the last few subintervals. For example, we can use the formulae

l x
X+3h
f(t) dt
3h
~ - (f(x) + 3f(x + 2h)),
4
Jr+
x
2h
f(t) dt ~ 2hf(x + h),
(13.69)
instead of the Simpson's 3/8 or 1/3 rule over the last interval. These formulae
have a truncation error of O(h4) and O(h 3 ), respectively. It is difficult to obtain
a reliable estimate of the local truncation error for integral equations, since at
each step the integral has to be evaluated over the entire range. Consequently,
unless the entire integral over [a, xl is evaluated using two different quadrature
formulae (involving the same abscissas), it may be difficult to give a reliable
estimate of the truncation error. A change in the quadrature formula in the last
few subintervals only gives an estimate of truncation error over those subinter-
vals. This estimate is useful for predicting an accurate starting value for the
corrector formula, but the difference between the predicted and the corrected
values may not give a reasonable estimate of the truncation error.
In the simplest case, when trapezoidal rule is used to approximate the
integral, then Wrl = Wrr = h/2, while Wri = h for i = 2, ... ,r - 1. In this case,
13.7. Volterra Equations of the Second Kind 687

there is no difficulty in calculating 12,13, ... using (13.67) because the trape-
zoidal rule is essentially a single-step formula. But if a higher order quadrature
formula is used, there may be some difficulty. For example, if we wish to ap-
proximate the integral using the Simpson's 1/3 rule, then r must be odd. To
overcome this problem, we can use the Simpson's 3/8 rule at one of the ends.
It may appear that it is immaterial at which end the 3/8 rule is used. However,
it turns out that using it at the lower end leads to instability, and the 3/8 rule
must be used at the upper end for even r. Thus, for odd value of r the weights
are

~ (3 + (-1 )i )
h
Wrl = Wrr = 3' Wri = , (l<i<r), (13.70)

while for even values of r (> 4) the weights are

h 3h 9h 17h
Wrl = 3' Wrr = 8' Wr,r-l = Wr,r-2 = 8' Wr,r-3 = 24'

Wri=~(3+(-I)i), (2:Si:Sr-4).
(13.71)
For r = 4 we have only one application of the Simpson's 3/8 rule, giving
W41 = W44 = 3h/8 and W42 = W43 = 9h/8. It can be easily seen that this process
cannot be used for r = 2, since it is not possible to approximate the integral to
sufficient accuracy using only two points. Thus, we need a special formula for
generating this starting value. This situation is similar to the requirement of
extra starting values for initial value problems in differential equations, when
a higher order multistep method is used.
Instead of the 3/8 rule, we can use the trapezoidal rule over one subinterval
when r is even. This process does not require any extra starting value, but in
that case, truncation error is O(h3) rather than O(h4). Note that error of O(h3)
arises because the trapezoidal rule is used over only one subinterval of length
h. If the trapezoidal rule is used over the entire range, then truncation error
is O(h2). In this case also it is found that using trapezoidal rule over the last
subinterval enhances the stability. On the other hand, if the trapezoidal rule is
used in the first subinterval, then the resulting method is unstable for decreasing
solution, just like Milne's corrector for initial value problems.
To generate additional starting values, we can again use techniques similar
to that for differential equations. For example, we can use the Taylor series
expansion, provided the required number of derivatives can be computed easily.
It is also possible to use Runge-Kutta type of methods for solving Volterra
equations (Baker, 1977). Accuracy of starting values should be of the same
order as the global truncation error expected in the quadrature formula. For
example, the Simpson's rule has a local truncation error of O(h 5 ), but for a
fixed value of x the error is O(h4). Hence, we need starting values with an
accuracy of O(h4) only. We can improve the accuracy of the starting values by
including intermediate points, similar to that in Runge-Kutta methods.
688 Chapter 13. Integral Equations

It is possible to use the following method due to Day for calculating 12,
which gives an accuracy of O(h4). In this method we use the intermediate values
hi = -g(a + h) + hK(a + h, a, h) + O(h2) (rectangle rule)

122 = -g(a + h) + ~ (K(a + h, a, h) + K(a + h, a + h, hi)) + O(h3)


(trapezoidal rule)
1
123 = 2(h + 122) + O(h2) (linear interpolation)

h)
h4=-g ( a+ 2 +4"h ( h ,a,ld+ K (a+ h ,a+ h ,h3) )
K(a+ 2 2 2 3
+O(h)

(trapezoidal rule)
(13.72)
Then the solution at the next step can be written using the Simpson's rule
12 = -g(a+h)
+ ~ ( K(a + h, a, It) + 4K(a + h, a + ~, 124) + K(a + h, a + h,J22)) ,
(13.73)
which gives an accuracy of O(h4) rather than O(h5), since 122 and 124 are
known to an accuracy of O(h3) only. This procedure is similar to a Runge-
Kutta method, except for the fact that the coefficients have not been chosen
to achieve the optimal accuracy, using a fixed number of function evaluations.
This starting procedure along with the composite Simpson's rules can be used
to obtain the solution over a uniform mesh.
The main difference between multistep methods for differential equations
and those for integral equations is that, for differential equations we need func-
tion values at only a fixed number of previous points, while for integral equa-
tions all the previous values are required. Clearly, the solution of integral equa-
tions requires much more effort. Thus, if N steps of integration are performed,
then for differential equations the effort required is O(N), while for integral
equations it is O(N2); this puts a severe constraint on the number of steps that
can be performed. Further, it also requires a fair amount of computer memory,
since all the previous values need to be stored. Adjusting the step size is also
more difficult for integral equations, since all the previous values need to be
used.
The technique of deferred correction using Gregory's formula can be ap-
plied to Volterra equations. But, once again a special technique is required to
estimate the correction in the first few steps, where sufficient number of differ-
ences cannot be calculated using the computed function values. It is actually
possible to use a variable order method using Gregory's formula, where the
number of differences used can be adjusted to achieve the required accuracy.
This technique can be applied at each step, although, it is not advisable to
change the order too frequently.
Unlike Fredholm equation of the second kind, here the quadrature formula
cannot be used to calculate the solution at points other than the abscissas.
13.7. 'Volterra Equations of the Second Kind 689

For this purpose, we can use a suitable interpolation formula. However, un-
like Fredholm equations of the first kind, there is no difficulty in choosing the
interpolation formula, since we can choose the one which has the same order
of accuracy as that of the quadrature formula. This choice is always possible,
since we are using only Newton-Cotes quadrature formulae.
As noted earlier, Volterra equations are similar to initial value problems in
differential equations, and the concepts of consistency, stability and convergence
can be applied to these problems also. Thus, a quadrature formula of the form
used in (13.67) is said to be consistent, if for a fixed value of x-a, the truncation
error in evaluating the integral (assuming the function values to be exact) tends
to zero as h ---+ O. Similarly, a method is said to be convergent if the computed
solution converges to the true solution in the limit h ---+ 0, while a method is
said to be stable if the errors introduced at some stage are not amplified at
subsequent steps.
For initial value problems in ordinary differential equations, instability
of numerical integration methods may be attributed to the presence of extra
complementary solutions. These extra solutions arise because of the fact that
the resulting difference methods are of higher order than the corresponding
differential equations. There does not seem to be any simple analogue of this
argument in the treatment of Volterra equations. For Volterra equations of the
second kind, the homogeneous part has only trivial solution and there cannot
be any extra complementary solutions. This is essentially because of the fact
that integral equations incorporate the boundary conditions also. Nevertheless,
higher order quadrature formulae do need extra starting values, thus admitting
extra independent solutions. The instability in numerical methods for Volterra
equations can be ascribed to the sensitivity of numerical methods to small
perturbations in g(x) or the kernel. In practice, it is found that numerical
methods for Volterra equations display instabilities, which are very similar to
those in corresponding methods for the initial value problems.
The stability analysis for Volterra equation is rather complicated, since the
difference equation involves all the previous points. As in the case of differential
equations, we can analyse stability only for a simple form of linear equations.
For a differential equation, we can claim that any equation can be linearised
locally, and if the numerical integration method is stable at all points, then the
computed solution is likely to be stable. But for integral equations, since the
solution depends on all the previous points, the results cannot be generalised
to equations of other form, even if the kernel can be represented locally in the
assumed form. Thus, the usefulness of stability analysis of integral equations
is rather limited and we consider only some simple cases here. Most stability
analysis are performed for the case K(x, t, f(t)) = )..f(t) , (where).. is some
constant) and g(x) is also a constant. It can be easily seen that, this integral
equation is equivalent to the differential equation f' = )..f. For a general differ-
ential equation, we can consider ).. = 1'/ f locally and hope that the stability
analysis is still valid locally. Such an assumption is not possible for integral
equations, since we need to approximate the kernel over the entire range. This
690 Chapter 13. Integral Equations

difference is basically because of the fact that a differential equation represents


local behaviour of the solution, while an integral equation represents the global
behaviour.
Consider the integral equation

foX Af(t) dt = f(x) - 1. (13.74)

Applying the trapezoidal rule to approximate the integral for x = nh and x =


(n+ l)h, we can obtain the linear equations connecting iI, ... ,fn+l' Subtracting
the two equations gives

(13.75)

This is the same difference equation that is obtained when the trapezoidal rule
is used for the solution of differential equation f' = Af. We know (see {11.38})
that, this method is absolutely stable for all values of A in the negative half
of the complex plane and is relatively stable for all positive value of A. We
therefore expect this method to be stable for integral equations also.
Similarly, if we apply the Simpson's 1/3 rule, or the Simpson's 3/8 rule
followed by the 1/3 rule, then it can be shown that the computed solution
satisfies the difference equation

(1- ~Ah) fn+2 - ~Ahfn+l- (1 + ~Ah) fn O. = (13.76)

Once again this is the same as the difference equation that is obtained when
Simpson's rule is used for initial value problems. We know (Example 12.1) that
this method is unstable for Ah < 0, and hence we expect this method to be
unstable for integral equations when the solution is decreasing. On the other
hand, if we consider a variation of this method, where the 3/8 rule is applied
at the other end, then the resulting difference equation is rather complicated.
It turns out that, this method is stable over a much wider range of Ah values.
Thus, the application of 3/8 rule on the last subinterval stabilises the method
based on the Simpson's 1/3 rule. Similarly, it can be shown that if trapezoidal
rule is used on the first subinterval (instead of the 3/8 rule), then the resulting
method is unstable for Ah < 0, while if the trapezoidal rule is applied at the
last subinterval, then the resulting method is stable.
Expansion methods of the type described in Section 13.3 may be used for
Volterra equations, but the resulting matrix will not be triangular, and more
effort is required to solve these equations.
EXAMPLE 13.6: Solve the following integral equation

A foX f(t) dt = f(x) + A- (A + 1)e- X, A = 12. (i3.77)

We select a mesh with uniform spacing of h and use trapezoidal rule to approximate the
integral. Substituting x = 0 in the equation gives the starting value f(O) = 1. Substituting
13.7. Volterra Equations of the Second Kind 691

Table 13.5: Solving a Volterra equation of the second kind

Using trapezoidal rule Using Simpson's rule


x fexact h = 0.05 h = 0.025 h = 0.01 h = 0.05 h = 0.025 h = 0.01 h = 0.005

.10 .9048374 .905327 .904955 .904856 .905453 .9048646 .9048377 .9048374


.20 .8187308 .820860 .819230 .818809 .820727 .8188212 .8187318 .8187307
.30 .7408182 .748563 .742589 .741095 .747458 .7411180 .7408214 .7408177
.40 .6703200 .697395 .676339 .671256 .692392 .6713156 .6703314 .6703180
.60 .5488116 .872306 .616724 .559211 .792751 .5597870 .5489339 .5487846
.80 .4493290 4.298617 1.211995 .564329 3.145281 .5703328 .4506739 .4490354
1.00 .3678795 46.157890 8.929915 1.639254 30.162810 1.7019300 .3827050 .3646526

x = h, 2h, ... , we get successive equations to calculate approximations to f(h), f(2h), ... ,
thus for x = rh we can use

( 1 - >.h) fr =
2
(>. + l)e- rh - >. + >'h fo + >'h
2
I:
i=2
J;, (r = 1,2, ... ), (13.78)

to calculate the solution fr ~ f(rh). The results using h = 0.05,0.025 and 0.01 are shown
in Table 13.5. The computed values can be compared with the exact solution f(x) = e- x . It
can be seen that initially the solution is fairly accurate, but the error increases rapidly. The
problem here can be traced to the extraneous solution e AX , which arises because of the kernel.
For a general function g(x) on the right-hand side, the solution will have some component
of this solution. However, the right-hand side in this problem has been specially chosen to
suppress this component. This phenomenon is similar to instability in initial value problems
in differential equations (Example 12.4). In fact, this integral equation is equivalent to the
initial value problem
y' = >.y - (>. + l)e- X , y(O) = 1, (13.79)
which is known to be ill-conditioned. Thus, for ).. = 12 using trapezoidal rule the error is
multiplied by fk+d fk = (1 + 6h)/(1 - 6h) at every step. For h = 0.05, this factor is ~ 1.86
and in 20 steps the error can increase by a factor of ~ 2 x 10 5 .
From Table 13.5 it can be seen that error decreases with h and it is possible to get more
accurate results using a smaller step size. This is clearly because of the f~t that decreasing h
increases the accuracy, thus reducing the coefficient of the extraneous solution. It can be seen
that the error is O(h2). To achieve higher order accuracy, we can use the Simpson's 1/3 rule
or this rule followed by a single application of Simpson's 3/8 rule. The results obtained using
this method are also shown in Table 13.5. It can be seen that for h = 0.05, the error using the
Simpson's rule appears to be larger than that using the trapezoidal rule. In particular, the
error in calculating the value h at the first step is somewhat larger. This error is probably
due to the fact that the step size is a bit too large for the extraneous solution, since >'h = 0.6.
Consequently, a higher order method does not necessarily give better accuracy. For smaller
values of h the results are more accurate when Simpson's rule is used. With Simpson's rule,
the error decreases as h4. All computations were performed using a 24-bit arithmetic. It is
clear that with h = 0.01 truncation error in the first step is comparable to roundoff error.
Hence, decreasing the step size further will not improve the results significantly. This is borne
out by the results in the last column using h = 0.005, where it can be seen that error decreases
by roughly a factor of 3 rather than 16, as expected for the Simpson's rule. Further, using
a smaller step size requires much larger effort, since for Volterra equations the amount of
computation increases as N 2 , where N is the number of steps used. For the same kernel if
a different right-hand side is used, the problem may be stable and it will be possible to get
accurate solution without much effort.
692 Chapter 13. Integral Equations

In the above example, the problem is due to ill-conditioning of the integral


equation itself. Apart from this, there could be problems caused by instability
of the numerical method. For example, if h>" < 0 and we use the Simpson's
1/3 rule or a single application of the Simpson's 3/8 rule followed by 1/3 rule,
then errors start increasing exponentially. The behaviour of this method is very
similar to that of Milne's method for solving initial value problems. Because of
this instability, it is recommended to apply the 3/8 rule at the end. Further,
there could be difficulties for stiff problems, when the numerical method is not
absolutely stable. For example, if h>" « -1, then the method based on Simp-
son's rule may not be stable, even though the extraneous solution is decreasing
rapidly. The trapezoidal rule is absolutely stable for Re(h>..) < 0, and it may
be used for solving stiff problems. Of course, it should be noted that all theo-
retical results on stability are for extremely simple form of integral equations.
The behaviour of these methods for more realistic equations could be quite
different.

13.8 Volterra Equations of the First Kind


Integral equations of the first kind are generally suspected of being ill-
conditioned, in the sense that a small change in g(x) or K(x, t) can have a
large effect on the solution. If the original problem had a solution, the per-
turbed equation may not have any solution, and vice versa. We have already
seen in Section 13.5 that Fredholm equations of the first kind are generally
ill-conditioned. Now, Volterra equations can be written in the Fredholm form
by defining the kernel to be zero for t > x. This transformation introduces a
discontinuity in the kernel if K(x,x) -I- o. As noted in Section 13.5, Fredholm
equations of the first kind with discontinuous or singular kernel are likely to be
better conditioned. We therefore expect Volterra equations of first kind to be
less difficult to deal with. Nevertheless, Volterra equation of the first kind can
be ill-conditioned. For example, consider the problem

laX f(t) dt = g(x). (13.80)

It can be easily seen that, there is no solution, unless g(O) = 0 and g'(x) exists.
The solution is then f(x) = g'(x). Now, if g(x) is perturbed to g(X)+E(X), then
the solution exists only if E(O) = 0 and E'(X) exists. Thus, for arbitrary infinites-
imal perturbation the solution may not exist. It may be noted that numerical
solution of this equation essentially involves computation of the derivative of
g(x), which is known to be sensitive to errors in the function values.
As noted in Section 13.1, Volterra equations of the first kind can be trans-
formed to those of the second kind by differentiation. If the required derivatives
can be easily computed, then this technique can be used for numerical solution.
Alternately, it is also possible to transform the equation of the first kind to that
13.8. Volterra Equations of the First Kind 693

of the second kind using integration by parts. Thus

g(x) = I a
x
K(x, t)f(t) dt = K(x, x)F(x) -
IX ata K(x, t)F(t) dt,
a (13.81)

l
where
x
F(x) = f(t) dt. (13.82)

If K(x, x) -=I- 0, then (13.81) gives an equation of the second kind in F(x). The
required solution f(x) can then be computed by differentiating F(x). It should
be noted that numerical differentiation may introduce additional errors in the
solution.
If conversion to equation of the second kind is not straightforward, then it
may be better to treat the equation directly. The methods for Volterra equations
of the second kind can be applied to those of the first kind after some obvious
modifications. For example, quadrature methods can be applied to equations of
the first kind. However, in this case, since f(x) does not occur on the right-hand
side, it is not possible to use simple fixed-point iteration to solve the nonlin-
ear equations. In this case, we can use secant iteration to solve the nonlinear
equation at each step. But, if the integral equation is linear, then there is no
difficulty in solving the resulting system of algebraic equations. Hence, in this
section we consider only linear equations of the form

l x
K(x, t)f(t) dt = g(x). (13.83)

If the quadrature formula requires f(a), then this value needs to be determined
separately. Differentiating this equation with respect to x, we get the initial
value f(a) = g'(a)/ K(a, a). The derivative g'(a) may be computed numerically
using g'(a) = (g(a + h) - g(a - h))/2h. The solution can be continued further
using any suitable quadrature formula, e.g., the trapezoidal rule which gives
the equations
h h
L
r-l
2. K (a r , ar )Jr = 9 (ar ) - 2. K (ar , a) h - h K (a r , ai) fi , (13.84)
i=2

where ai = a + (i - l)h. Thus, if the solution is known at the previous points,


the value at the next point can be computed using this formula, provided
K(a r , ar) -=I- O. In order to compute the solution, we should have K(x, x) -=I- 0
for a ::; x ::; b, where [a, b] is the range in which the solution is required. It
may be noted that if this condition is satisfied, then it is possible to convert
the problem to that of the second kind using differentiation or integration.
It is found that the results obtained using trapezoidal rule show some
oscillations in the computed function values. It has been suggested that these
oscillations can be smoothed out using

(13.85)
694 Chapter 13. Integral Equations

This averaging cannot be performed at the first and the last point. Since the
number of steps to be performed is in our hand, the problem at the last point can
be avoided by computing the solution at one more point. It can be shown that
under suitable differentiability conditions, the truncation error ¢( aT) - f (a r ) in
the smoothed values is O(h2).
Instead of the trapezoidal rule, it is possible to use the midpoint rule which
requires function values f(a + (2i - l)h), (i = 1,2, ... , N) only. Hence, in this
case, we do not need to calculate the starting value f(a). Using this quadrature
rule the successive values can be calculated using
i-1

2hK(a2i' a2i-dhi-1 = g(a2d - 2h L K(a2i, a2j-dhj-1, (i=1,2, ... ),


j=l
(13.86)
where aj = a + jh and hi-1 gives an approximation to f(a2i-d. Here the
summation may be dropped for i = 1. Truncation error in the computed values
can be shown to be O(h2). It should be noted that this method can be applied,
even when K(x, x) = O.
EXAMPLE 13.7: Solve the integral equation

fox ex-tf(t) dt = sinh(x). (13.87)

The numerical solution can be obtained using the trapezoidal rule as explained above.
The results obtained using h = 0.1 and 0.05 are shown in Table 13.6. These results can be
compared with the exact solution f(x) = e- x . This table also shows the relative error in the
computed solution. It can be seen that the error keeps increasing steadily with x. The results
show some oscillation and the error is alternating in sign. As explained above, the oscillations
can be removed by averaging the function value using neighbouring values. This averaging
cannot be performed at the first point. The averaged or smoothed results are also shown
in the table. It can be seen that for large values of x the smoothed values are significantly

Table 13.6: Solving a Volterra equation of the first kind

Using trapezoidal rule After smoothing


x fexact h = 0.1 Error h = 0.05 Error h = 0.1 Error h = 0.05 Error

0.0 1.00000 1.00167 1.7 x 10- 3 1.00042 4.2 x 10-'


0.5 .60653 .59629 1.7 x 10- 2 .60808 2.6 x 10- 3 .60605 7.9 x 10-' .60640 2.1 x 10- 4
1.0 .36788 .38022 3.4 X 10- 2 .37095 8.3 X 10- 3 .36754 9.2 X 10- 4 .36780 2.2 x 10-'
1.5 .22313 .20004 1.0 x 10- 1 .22856 2.4 x 10- 2 .22300 5.7 x 10- 4 .22308 2.1 x 10- 4
2.0 .13534 .17178 2.7 x 10- 1 .14459 6.8 x 10- 2 .13514 1.5 x 10- 3 .13529 3.1 x 10- 4
2.5 .08208 .021Q8 7.4 x 10- 1 .09763 1.9 x 10- 1 .08217 1.1 x 10- 3 .08206 2.6 x 10-'
3.0 .04979 .14982 2.0 x 10° .07540 5.1 x 10- 1 .04948 6.1 x 10- 3 .04975 6.8 x 10- 4
3.5 .03020 -.13540 5.5 x 10° .071ll 1.4 x 10° .03059 1.3 x 10- 2 .03019 1.3 x 10-'
4.0 .01832 .29ll3 1.5 x 10 1 .08458 3.6 x 10° .01760 3.9 x 10- 2 .01814 9.7 x 10- 3
4.5 .Ollll -.43854 4.0 x 10 1 .12293 1.0 x 10 1 .01222 1.0 x 10- 1 .01106 4.2 x 10- 3
5.0 .00674 .74690 1.1 x 10 2 .18837 2.7 x 10' .00490 2.7xlO-' .00656 2.6 x 10- 2
Bibliography 695

better than the raw values. In some cases, the raw values are off by an order of magnitude,
while the smoothed result has an accuracy of at least one significant figure. In both sets of
results, the error is roughly proportional to h 2 . It may be noticed that, this problem is not
as ill-conditioned as those in Examples 13.4 and 13.5. However, in general, the equations of
the first kind are more ill-conditioned than those of the second kind.

It is possible to use higher order quadrature formulae to achieve higher


accuracy. But in many cases, it has been shown that the results obtained using
such formulae do not converge to the true solution in the limit h --+ O. The use
of Gregory's quadrature formula may also lead to results which do not converge
to the true solution in the limit h --+ O. Hence, this formula should not be used
to apply the method of deferred correction. We can use h --+ 0 extrapolation
to achieve higher order accuracy. However, because of oscillations in the com-
puted values, this technique cannot be applied to the results obtained using
the trapezoidal rule. But we can use the h --+ 0 extrapolation on the smoothed
values ¢( a r ), to achieve higher order accuracy. The h --+ 0 extrapolation can
also be used to improve on the computed results obtained using the midpoint
rule. However, in that case, we need to decrease h by a factor of three, so that
the old points are again the abscissas in the quadrature formula. Some other
higher order methods for solution of Volterra equations of the first kind are
described in Delves and Walsh (1974) and Baker (1977).

Bibliography
Backus, G. and Gilbert, F. (1968): The Resolving Power of Gross Earth Data, Geophys. J.
Roy. Astron. Soc., 16, 169.
Baker, C. T. H. (1977): The Numerical Treatment of Integral Equations, Clarendon Press,
Oxford.
Bellman, R., Kalaba, R. E. and Lockett, J. A. (1966): Numerical Inversion of the Laplace
Transform, Elsevier, New York.
Brunner, H. and Van der Houwen, P. J. (1986): The Numerical Solution of Volterra Equations,
North-Holland, Amsterdam.
Craig, I. J. D. and Brown, J. C. (1986): Inverse Problems in Astronomy: A Guide to Inversion
Strategies for Remotely Sensed Data, A. Hilger, Bristol.
Delves, L. M. and Walsh, J. (ed.) (1974): Numerical Solution ofIntegral Equations, Clarendon
Press, Oxford.
Fox, L. (ed.) (1962): Numerical Solution of Ordinary and Partial Differential Equations,
Pergamon Press, Oxford.
Fox, L. and Parker, I. B. (1968): Chebyshev Polynomials in Numerical Analysis, Oxford
University Press, London.
Gough, D. O. and Thompson, M. J. (1991): in Solar Interior and Atmosphere, (eds.) A. N.
Cox, W. C. Livingston and M. S. Matthews, University of Arizona press, Tucson.
Ikebe, Y. (1972): The Galerkin Method for the Numerical Solution of Fredholm Integral
Equations of the Second Kind, SIAM Rev., 14,465.
Jerri, A. J. (1999): Introduction to Integral Equations with Applications, (2nd ed.), John
Wiley, New York.
Kirsch, A. (1996): An Introduction to the Mathematical Theory of Inverse Problems, (Applied
Mathematical Sciences, Vol 120), Springer Verlag.
Kopal, Z. (1961): Numerical Analysis, (2nd ed.), John Wiley, New York.
Menke, W. (2012): Geophysical Data Analysis: Discrete Inverse Theory, (3rd ed.) Academic
press, Orlando.
696 Chapter 13. Integral Equations

Morse, P. M. and Feshbach, H. (1953): Methods of Theoretical Physics, Part I, McGraw-Hill,


New York.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Ritzwoller , M. H. and Lavely, E. M. (1991): A Unified Approach to the Helioseismic Forward
and Inverse Problems of Differential Rotation, Astrophys. J. , 369, 557.
Tarantola, A. (2004): Inverse Problem Theory and Methods for Model Parameter Estimations,
SIAM, Philadelphia.

Exercises
1. Show that the initial value problem
dny dn-Iy dy
-dt n + PI ( t )dt-n -- I + ... + Pn-I(t)-
dt
+ Pn(t)y = get),

with initial conditions y(k)(O) = ak for k = 0,1, ... , n - 1 is equivalent to a Volterra


equation of the second kind. This equation can be obtained by defining y(n)(x) = J(x)
and integrating it n times to get y(n-I)(x), . .. , y'(x), y(x) , using the initial conditions

1x 1x 1x
at each stage. Use the identity
(x t)n-I
... J(t) dt n = ( ) J(t) dt,
o 0 0 n-1!
to transform the multiple integrals into a single integral and obtain a Volterra equation
of the second kind .
2. Using the technique of the previous problem, transform the following initial value prob-
lems into integral equations
(i) y' - AY = 0, yeO) = 1;
(ii) y" + V; + y = 0, yeO) = 1, y'(O) = O.

3. For a Volterra equation of the third kind show that J(k) (a) = O. Hence, show that if the
function has a Taylor series expansion about x = a, then it must b e identically zero.
4. Solve the following integral equations

(i) J(x) = x(l - 2x2 + x 3 ) + fol K(x, t)J(t) dt, K(x, t) = {t(l - x),
if 0 :s: t < x :s: 1;
io :s: x :s: t :s: 1;
11
x(l - t), if 0

+ e + - 1 - eX =
x 1
(ii) J(x) extJ(t) dt

(iii)
1
x +1

4
1 1
2
0

2
1
J(x) = - + -x - -x 2 lnx + _(x 2 - 1) In(l + x) +
2
11
0
In Ix + tIJ(t) dt

5. Solve Love's equation which arises in electrostatics

J(x) +~
7r
/1
-I
aJ(y) dy
a 2 + (x - y)2
= g(x) .

Use a = -1, g(x) = 1 and find the solution J(x), -1 x :s: :s:
1. Apply the technique of
deferred correction using Gregory's formula with varying number of differences included
and estimate the error in the computed values.
6. Apply the technique of weakening a singularity described in Section 13.2 to the following

11
integral equations

1 2x - -xe
(i) J(x) = _e 1 2 - -1 + eX + min(x, t)e t J(t) dt
4 2 4 0

(ii) J(x) = x - ~ (x2lnx + (1- x 2 ) In(l - x) - (x + ~)) + lllnlx - tIJ(t) dt


Exercises 697

Compare the results with those obtained using direct method.


7. A n x n matrix A is said to be centrosymmetric if aij = an-i.n-j' Similarly, a kernel
K(x, t) is said to be centrosyrnmetric on [a, b] if K(a + s , a + u) = K(b - s, b - u). For
example, a kernel of convolution type K(x , t) = K(lx - tl) is centrosymmetric. In addition
if g( a + s) = g( b - s). then show that the Fredholm equation of the second kind can be
simplified to

foB(K(a: + s, a: + u) + K(a: - s, a: + u))f(a: + u) du = f(a: + s) + g(a: + s),


where a: = (a + b) /2 and f3 = (b - a)/2. This formulation has the advantage that only
half the range needs to be considered. Apply this technique to solve Love's equation {5}.
8. Solve the following integral equation using quadrature method

f(x) + lVti f(t) dt = yX.

The derivatives of the kernel are singular at t = 0, but show that the trapezoidal rule
gives the correct answer, because this singularity is "cancelled" by that in the solution
f(x) = 2VX/3.
9. Solve the integral equation

f(x) = (1- x 2 )3/4 - -11'V2


4
(2 - x 2) 2/1 Ix - tl-
+-
3 - 1
1/ 2 f(t) dt,

using a quadrature method with singular weight function. Obtain a quadrature formula
of the form
1 ¢ (t) dt N
/ I . - t1 1/ 2 ~ L Wi(aj)¢(a;),
- 1 aJ i=O

where ai = -1 + ih and h = 2/ N. The weights are chosen so that this formula is exact
for continuous functions, which are linear in intervals [ai, ai+1].
10. Solve the following Fredholm equation for which the solution is not unique (Why?)

11'2
f(x)+-(x-x
6
3 )-x = 11 0
K(x,t)f(t) dt, K(x, t) = { 11'
11' 2 t(1
2 x{1 - t) ,
- x),
for 0 :::; x :::; t :::; 1;
for 0 :::; t < x :::; 1.

11. Consider the coupled integral equations

h (x) + g1 (x) = lb L(x , t)h(t) dt, hex) + g2(X) = lb M(x, t)h (t) dt .

Eliminate 12 to obtain an integral equation in hex) with kernel

K(x, t) = 1
a
b
L(x , y)M{y , t) dy ~
N
LWiL(x ,a,)M(ai,t).
i=l

This kernel can be approximated by a degenerate kernel using a quadrature formula. The
equation with degenerate kernel can then be solved. This t echnique can be applied to a
Fredholm equation of the second kind, by applying the operator I + K to the equation
giving

f(x) -lb L(x, t)f{t) dt = -g(x) - lb K(x , t)g{t) dt, L(x , t) = lb K(x, y)K(y, t) dy.

Apply this technique to the equation in Example 13.1.


12. Apply expansion method to the Fredholm equation in Example 13.1. Find the coefficients
of expansion by minimising the L1 , L2 or L oo norm of the residual. Compare the results
with those in Examples 13.1 and 13.2. Also compare the actual error with the magnitude
of t,he residuals.
698 Chapter 13. Integral Equations

13. Find the first three eigenvalues and eigenfunctions of the following kernels

(i) K(x, t) = {X(1 - t)/t, for 0 Sx <tS 1;


(1 - x), for 0 StSx S 1;

K(x,t) = {-lnt, for 0 S x < t S 1;


(ii)
-lnx, for 0 S t S x S 1.

14. Solve the eigenvalue problems in the previous exercise using h -> 0 extrapolation. Obtain
the eigenvalues using trapezoidal rule based on N = 2k + 1 points (k = 1,2, ... ) and
apply the h -> 0 extrapolation to improve on these values. Estimate the exponent of h
in the leading term of truncation error.
15. Solve the eigenvalue problem in Example 13.3 using expansion methods. Use <Pi (x) = x i - I
and apply the collocation or the Galerkin method to reduce the problem to an algebraic
form. Compare the efficiency of this technique with quadrature methods.
16. Apply quadrature method with trapezoidal rule to find the nonzero eigenvalue of

fal (xt)I/2 f(t) dt = vex).

17. Find three eigenvalues with largest magnitude for the following problem which arises in
the study of water waves in a dock

[II In Ix - tlf(t) dt = vex).


18. For a Fredholm equation of the first kind with degenerate kernel
p
K(x, t) = L Xi (X)Ti (t),
z=1

prove that a solution exists only if the right-hand side function g(x) is of the form
p
g(x) = L aiXi(x),
i=l

where ai are some arbitrary constants. If g(x) is of this form, then show that any function
f(x) which satisfies the constraints

lb Ti(t)f(t) dt = ai,

is a solution of the integral equation.


19. Solve the following integral equations

(i) Laa In Ix - ylf(y) dy = Jr, a = 1,1.5,1.9,2;

20. Consider a Fredholm equation of the first kind with the kernel in Example 13.1. Show
that the solution exists only if g(O) = g(l) = O. Using trapezoidal rule based on N + 1
points, set up the corresponding matrix equations and show that after discarding the
first and the last equations, the matrix can be inverted explicitly to give the solution

. ) _ g(ih - h) - 2g(ih) + g(ih + h)


f (~h - - h2 ' (i=I, ... ,N-l),
Exercises 699

which approximates the exact solution f(x) = _gil (x). Assuming N is even, apply the
Simpson's rule to the same problem and show that the solution is

-0.75(9(ih - h) - 2g(ih) + g(ih + h))/h 2, for i odd;


f(ih) = {
-1.5(g(ih - h) - 2g(ih) + g(ih + h))/h2, for i even.

Hence, show that even if the equations are solved exactly, the computed solution does
not tend to the exact solution as h ---> O. Use g(x) = x(1 - x) and solve these equations
numerically using both trapezoidal and Simpson's rule with N = 11,21,41. Find the
solution at x = 0.2,0.4,0.6,0.8 and study the convergence of the results as N ---> 00.
21. Find the inverse Laplace transform defined by the following integral equation

1= o
2 1
e- st f(t) dt = g(8) = - - - - .
8 8 +1
Transform the range of integral to a finite value using a transformation y = e- t and
use Gauss-Legendre formula to solve the problem. Compare the results with those in
Example 10.7.
22. Solve the following Fredholm equation of the first kind
00 2
/ -oc f(t)e-(x-t) dt = j7re- 1 / 4 cosx.

23. Solve the inverse problem defined in Example 13.5 using artificial data generated with
fer) = 430 + 20sin(30r) nHz. The kernels for the problem can be found in the online
material. Use these kernels to solve the forward problem and calculate the splitting co-
efficients, d; and then add random errors to these values. Use different regularisation
techniques to invert this artificial data and compare the results with exact profile. Try
the exercise using different magnitudes of errors, namely, 0.2, 1, 5 nHz. Try to estimate
the optimal value of regularisation parameter. Does the optimum value of regularisation
parameter depend on the magnitude of errors?
24. A variant of the inverse problem is the so called moment problem, which occurs in
transport phenomenon. In this case the moments

are known for some weight function wet) while f(t) is the unknown function to be deter-
mined. Use wet) == 1. a = 0, b = 1 and Mn = 1/(n + 1)2, n = 0,1, ... ,15 to determine
f(t). Try quadrature method using 16 abscissas to determine f(t). Also try expansion
methods using polynomial or B-spline basis functions.
25. Solve the following integral equations (0 ::; x ::; 10)

(i) f(x) = sinx + fox (x - t)f(t) dt,

(ii) f(x) + r
Jo
cosh(x - t)f(t) dt = sinhx.

26. Convert the Volterra equations in the previous problem to Fredholm type by extending
the range of integration to t = 10 and assuming the kernel to vanish for t > x. Apply
quadrature methods based on the trapezoidal and the Simpson's 1/3 rule to this Fredholm
equations and compare the results with those obtained in the previous problem.
27. Consider the following integral equations

(i) hex) =x - 1 2
-AX
2
+A 1 0
x
h(t) dt,

(ii) hex) = 1 + A fox h(t) dt ,


700 Chapter 13. Integral Equations

with exact solutions hex) = x and hex) = e AX , respectively. Use h = 0.1,0.5, -50 ~

°
A ~ 50 and apply the following quadrature methods to find the solution in the range
~ x ~ 10: (i) Trapezoidal rule (ii) Simpson's 1/3 rule or one application of trapezoidal
rule followed by Simpson's 1/3 rule, (iii) Simpson's 1/3 rule or Simpson's 1/3 rule followed
by one application of trapezoidal rule, (iv) Simpson's 1/3 rule or one application of
Simpson's 3/ 8 rule followed by 1/ 3 rule, (v) Simpson's 1/ 3 rule or Simpson's 1/ 3 rule
followed by one application of 3/8 rule. Study the stability of each of these methods for
different values of hA.
28. Solve the following integral equation and compare the results with those in Example 13.5:

foX 12f(t) dt = f(x) + 13 - 13e- X.

29. Consider the following integral equations:

(i) foX cos(x - t)h(t) dt = sinx,

(ii) foX h(t) dt = x 2 ,

with exact solutions hex) = 1 and hex) = 2x, respectively. Use h = 0.2,0.1,0.05,0.0l
and apply the following quadrature methods to find the solution in the range ~ x ~
10: (i) Trapezoidal rule (ii) midpoint rule (iii) Simpson's 1/3 rule or one application of
°
Simpson's 3/8 rule followed by 1/3 rule, (iv) Simpson's 1/3 rule or Simpson's 1/3 rule

°
followed by one application of 3/ 8 rule. Study the stability of each of these methods for
different values of h. In the first two cases also try the h --> extrapolation.
30. Solve the following integral equations:

(i) foX e 2(x - t) f(t) dt = sinx,

(ii) foX e x - t f(t) dt = x 2 .

Convert these equations to Volterra equations of the second kind and find the solution.
Compare the efficiency and accuracy of the two approaches.
31. Solve the following integral equation where K( x ,x) = °
foX sin (x - t)f(t) dt = 1 - cosx.

32. Solve the following nonlinear integral equations:

(i) f(x) + r (x + f(t»2 dt = x


Jo
2 + 2x + ~,
3
[X
Jo
2 1 2
(ii) f(x) = eX - x + xe "i X - xtJl(t) dt.

33. Solve the following nonlinear integral equation which arises in the theory of radiative
transfer:
f(x) = 1 + ~ f(x) [1 _X_fey) dy, A = 0.5,0.95,1.
2 Jo x+y
34. Solve the following system of two coupled integral equations

hex) = 1 + ax [1 h(x)h(t) - h(x)h(t) dt ,


Jo x +t
hex) = e- b/ x + ax [1 h(x)h(t) - h(x)h(t) dt,
Jo x- t
where a and b are constants. Use a = 0.25, b = 1.
Chapter 14

Partial Differential Equations

Partial differential equations are basic tools for modelling physical phenomenon.
Numerical solution of partial differential equations requires a large amount of
computer time and memory. Consequently, the amount of computer time spent
in solving them is probably larger than that for any other class of problems.
Theory and analysis of numerical methods for partial differential equations is
rather difficult and actually beyond the scope of this book. As Rice (1992) has
correctly pointed out, partial differential equations are too important to ignore
and too difficult to cover adequately in a general text like this one.
The aim of this chapter is to give an introduction to the basic numerical
methods for solving partial differential equations. These methods are explained
by considering some simple equations. However, unlike ordinary differential
equations, it is not always straightforward to generalise these techniques to
partial differential equations of other form. Every partial differential equation
that arises often in practical problems has its associated numerical techniques,
which have evolved after years of experimentation and analysis. As a result,
readers may find that the methods which are suitable or effective for a par-
ticular practical problem are not even mentioned in this chapter. The basic
idea in this chapter is to introduce some of the simpler methods and point out
the difficulties that arise in numerical solution of partial differential equations.
We mainly consider finite difference methods for solution of partial differential
equations. This choice has been made because of the simplicity of these meth-
ods rather than any other property. Finite element methods are outlined only
very briefly.
Apart from the difficulty in computing the solution, it is equally difficult
to interpret the numerical solution. This difficulty arises because numerical
solution of partial differential equations generates enormous amount of data. To
extract the required information out of these numbers is rather difficult. This
difficulty is enhanced by the fact that, because of limitations on computing
resources we often end up with results which are barely sufficient to supply the
relevant information.
702 Chapter 14. Partial Differential Equations

In this chapter, we start with a brief introduction to the theory and classi-
fication of partial differential equations. Finite difference methods for parabolic
equations are considered in Sections 14.2-4, while those for hyperbolic equa-
tions are considered in Sections 14.5-6. Boundary value problems defined by
elliptic equations are considered in Section 14.7. Solution of these problems
lead to large systems of linear equations and some special methods to solve
these systems of linear equations are considered in Sections 14.8-10. Finally,
Section 14.11 gives a brief introduction to the finite element methods.

14.1 Introduction
Analogous to ordinary differential equations, the solution of partial differential
equations can be specified uniquely, only if additional initial or boundary con-
ditions are supplied. For partial differential equations, since derivatives with
respect to several independent variables are involved, we require boundary or
initial conditions over the enclosing hypersurface. For example, if the equation
has only two independent variables x and t, then the boundary conditions may
be specified over a line in the (x, t) plane. If this line forms a closed curve,
then we have a boundary value problem similar to the two-point boundary
value problem in ordinary differential equations. On the other hand, if this
line is open, then we have an initial value problem. It should be noted that,
partial differential equations require infinite number of boundary conditions,
one at each point along the line. In practical methods, we can only specify
the boundary condition over a finite number of points, or by using some finite
approximation to the function defining the boundary condition. This approxi-
mation plays a crucial role in the stability of numerical methods for solution of
partial differential equations. Thus, a finite difference method which is unstable
to perturbations on scales comparable to the mesh spacing, is not likely to be
of any use in computing the solution, since the boundary conditions cannot be
represented correctly on that scale.
A partial differential equation can be classified according to various prop-
erties. Some of the characteristic properties are as follows: The order is the
order of the highest derivative that appears in the equation, the dimension is
the number of independent variables in the equation. Sometimes for initial value
problems, dimension refers to the number of "space" variables, while "time"
is not counted. An equation is said to be linear if it is linear in the unknown
variable and its derivatives. A nonlinear differential equation which is linear
in the derivatives of the unknown function (see {10}) is sometimes referred
to as quasilinear. In this chapter, we will mostly consider second-order linear
equations in two dimensions.
The boundary conditions can also be of different types. Dirichlet condition
consists in prescribing only the values of the function on a hypersurface. For
Neumann condition, only the values of the derivative of the function along
the normal to a hyper surface are specified. Cauchy conditions define both the
function value as well as the normal derivative along a hypersurface. These
14.1. Introduction 703

boundary conditions apply to second-order differential equations. For equations


of higher order, boundary conditions may involve derivatives of higher order.
The majority of problems in science and engineering fall naturally into
one of the three categories: equilibrium problems, eigenvalue problems and
evolution problems. Equilibrium problems describe a steady state, in which
the equilibrium configuration of a system is to be determined by solving the
differential equation subject to some boundary conditions that are usually spec-
ified on a closed curve enclosing the region over which the solution is required.
Typical examples include steady viscous flow, equilibrium temperature distri-
bution, equilibrium stresses in elastic structures. Despite the apparent diversity
of the physics, it turns out that in most cases, governing equations for equi-
librium problems can be classified as elliptic. For partial differential equations,
the type of boundary conditions are closely related to the type of differential
equations. Thus, in most cases, boundary value problems are associated with
elliptic differential equations.
Eigenvalue problem can be thought of as a homogeneous version of the
boundary value problem. Typical examples include, stability of fluids or elastic
objects, resonance in electric circuits or acoustics. Eigenvalue problems are also
usually associated with elliptic differential equations. These problems actually
arise from linearised equations involving time derivatives, where the coefficients
are time independent. Hence, a solution of the form f(x, y , z , t) = eiwt F( x , y, z)
is sought, where the frequency w can be treated as the eigenvalue. Once the
time derivatives are eliminated by substituting this form of the solution, the
resulting differential equation in spatial coordinates is elliptic.
In contrast to the equilibrium problems, the evolution (or the initial value)
problems involve calculation of an unsteady state, where the initial configura-
tion is given and we need to find how the system evolves in time. Typical
examples of such problems include motion of fluids, displacement in elastic ob-
jects, propagation of heat. In this case, the governing equations are usually
parabolic or hyperbolic.
Partial differential equations are usually classified according to the so-
called characteristic curves, which are essentially the curves along which infor-
mation is propagated. The equations for which all characteristic curves are real
are referred to as hyperbolic, while those for which the curves do not exist, are
termed elliptic. Those equations for which some of the chamcteristic curves are
coinciding are termed parabolic. Hyperbolic and parabolic equations are usu-
ally associated with initial value problems, while elliptic equations are usually
associated with boundary value problems.
In order to make these ideas clear, let us consider a second-order partial
differential equation in two dimensions

(14.1)

where a, band c are constants. The basic properties of this equation depend
on the sign of b2 - ac. If this quantity is positive, then we have a hyperbolic
704 Chapter 14. Partial Differential Equations

equation, if it is zero, then the equation is parabolic, while if it is negative,


then the equation is elliptic. It should be noted that the presence of first-order
derivatives in the equation does not affect the classification. If the coefficients
are not constant, but functions of x and y, then the local behaviour of the solu-
tion is determined by the sign of b2 - ac and if this sign is the same throughout
the region, then the equation can still be classified into the three categories.
Although, in this case the characteristics may not be straight lines. If b2 - ac
changes sign inside the region of interest, then the equation is of mixed type,
and the problem may not be well posed.
A typical example of hyperbolic equations is the wave equation

{)2u 2 {)2u
{)t 2 =C {)x 2 '
(14.2)

where for simplicity we may assume c to be a constant. The characteristics of


this equation are x ± ct = const. The general solution of this equation can be
written as
u(x, t) = f(x + ct) + g(x - ct), (14.3)
where f and 9 can be arbitrary differentiable functions. If c > 0, then
u(x, t) = f(x + ct) represents a wave travelling to the left with a velocity
c. This wave retains its original shape. Similarly, u(x, t) = g(x - ct) represents
a wave travelling to the right. However, since the general solution is a super-
position of two waves travelling in opposite directions, it may not retain its
shape. To specify the solution uniquely, we need some boundary conditions.
Let us consider the Cauchy conditions at t = 0

u(x,O) = a(x), ~u I = b(x), (14.4)


ut t=O

Substituting the general solution (14.3) in these initial conditions, we get

f(x) + g(x) = a(x), c df(x) _ c dg(x) = b(x), (14.5)


dx dx
which gives

f(~)
1
= ;-a(~) + -1 1~ b(y) dy + const,
2 2c Xl

g(TJ) = 1
-a(TJ)
2
- -1
2c
1'7 b(y) dy -
Xl
const.
(14.6)

Therefore, the solution is

u(x, t) = -1 ( a(x + ct) + a(x - ct) +- 11x+ct)


b(y) dy . (14.7)
2 c x-ct

Since the initial conditions specify a(x) and b(x) only in the interval [XI,X2],
the above expression is meaningful when Xl ::; X ± ct ::; X2, which determines a
14.1. Introduction 705

Figure 14.1: Solution of the wave equation: The oblique lines are the charac-
teristics. The Cauchy conditions on [Xl, X2] give a unique solution within the
shaded rectangle delimited by the characteristics that pass through Xl and x2.

rectangle in the (x, t) plane (Figure 14.1). This rectangle includes a triangle in
the future (t > 0) and another in the past (t < 0). Hence, the solution at any
given point is determined by the values at some other epoch in the triangular
region, defined by the two characteristics which intersect at that point. Thus,
the solution at any given point (xo, to) is affected only by the values in the
region Ix - Xo I < cit - to I, which leads to the principle of causality. If Xl -+ -00
and X2 -+ +00, i.e., when the initial conditions are specified over an infinite
line, then the rectangle covers the entire (x, t)-plane and the solution can be
determined everywhere for any time.
If the initial conditions are specified over a finite segment of the x-axis,
we need additional information to determine the solution everywhere. This
information may come from additional boundary conditions of the form

(14.8)

Instead of this, we can have Neumann boundary condition at x = Xl and/or


x = X2. Thus, in addition to the initial conditions, we also need boundary
conditions at both boundaries. This requirement can be understood from the
fact that the differential equation is second order in both spatial and temporal
derivatives. We therefore expect two boundary or initial conditions in space as
well as time.
A typical example of parabolic equations is the diffusion equation

au a2 u (14.9)
at = (J ax2 '
where (J may be a constant. The characteristics of this equation are the lines
t = const. It is easy to see that if we specify Cauchy conditions along a char-
acteristic, then the problem is overdetermined. Suppose we specify the initial
706 Chapter 14. Partial Differential Equations

condition u(x, 0) = a(x), then using the differential equation, we get

au I = ad2a (x) (14.10)


at t=O dx 2 ·

Thus, clearly we do not have the freedom of specifying the derivative at t = O.


This restriction can be easily understood from the fact that the equation is only
first order in the time derivative and hence we need only one initial condition
of Dirichlet type.
Let us consider the initial condition

U(x,O) = {I,0,
for x > 0;
for x < O.
(14.11)

It can be shown that the solution of the diffusion equation with this initial
condition is
u(x, t)
1
= --
JX/,fi e- 2
Y /4(7 dy, t > O. (14.12)
2V1W -00

It may be noted that there does not exist any solution of the diffusion equation
for t < 0, which would satisfy the boundary conditions (14.11). This behaviour
can be physically interpreted by noting that the diffusion equation describes the
"disorganisation" of physical systems. It is clear that while we can describe the
evolution of a given system in time, we cannot have a situation, where a system
that was disorganised in the past becomes organised at t = O. Effectively, it
means that we can solve a parabolic equation only for t > 0 or future time
but not in the reverse direction. This happens because the solution for t < 0 is
always unstable. As opposed to this hyperbolic equations can be solved for both
t > 0 and t < o. Once again, if the initial conditions are specified over a finite
interval of x, then additional boundary conditions are required to determine
the solution in the required region of (x, t)-plane.
The prototype of elliptic equations is the Laplace's equation

(14.13)

This equation is also second order in derivatives with respect to both x and y.
However, in this case, the characteristics are not real, and consequently, initial
conditions of Cauchy type do not lead to a well posed problem. For elliptic
equations the solution can be uniquely defined by specifying a single bound-
ary condition on a closed curve. The boundary condition can be of Dirichlet
or Neumann type. This equation normally arises when equilibrium situation is
considered. For example, if we consider a parabolic equation in two space vari-
ables and set the time derivatives to zero, we end up with an elliptic equation.
We have given a few simple examples to demonstrate how different types
of partial differential equations are associated with different types of boundary
conditions. These results can be summarised as follows:
14.2. Diffusion Equation in Two Dimensions 707

1. Hyperbolic equations are associated with Cauchy conditions on an open


hypersurface.
2. Parabolic equations are associated with Dirichlet or Neumann conditions
on an open hypersurface. In this case, a stable solution exists on one side
of the hypersurface only.
3. Elliptic equations are associated with Dirichlet or Neumann conditions on
a closed hypersurface.
Another major difference in the nature of these equations is that elliptic and
parabolic equations have a very strong smoothing effect on the solution, while
the hyperbolic equations do not. If the boundary conditions or the forcing
term in the differential equation have a sharp corner or a singularity or a
discontinuity, the solution of an elliptic or parabolic equation is smooth, even
at a small distance from the singularity. On the other hand, in the solution of
a hyperbolic equation the singularity propagates along the characteristics.

14.2 Diffusion Equation in Two Dimensions


In this section, we consider finite difference methods for numerical solution of
the diffusion equation in two dimensions. For example, if x denotes a coordinate
along the length of a thin insulated rod in which heat can flow, and if u = u(x, t)
denotes the temperature at the position x and time t, then the temperature
satisfies the diffusion equation

aau = ~Kau (14.14)


at ax ax'
where a = a(x) or a(x, t) is the heat capacity of the material and K = K(x) or
K(x, t) is the thermal conductivity. It is assumed that K/a > O. This equation is
linear, although it can be nonlinear if the material properties denoted by a and
K also depend on the temperature u, which is quite often the case in physical
systems. For simplicity, we assume K and a are constants and (J" = K / a, in
which case, (14.14) reduces to the diffusion equation (i4.9). It is possible to
scale the variables t and x such that (J" = 1. However, we retain the factor of (J"
in our equation, since in general, if (J" is not constant, it cannot be eliminated
by simple scaling.
This differential equation is to be supplemented by the initial conditions
specifying the initial temperature distribution over the rod, and boundary con-
ditions at say x = 0 and x = L, which denote the ends of the rod. The boundary
conditions depend on the situation at the end points. For example, if the end
of the rod is in contact with a large heat reservoir, then we may assume the
temperature to be constant, i.e., u(O, t) = Uo. On the other hand, if the end is
insulated, then we can use au/ax = O.
To solve the equation numerically we define a grid of mesh points, denoted
by Xj = jh and tn = nk , where j = 0,1,2, . . . , M and n = 0,1 , 2, ... with
708 Chapter 14. Partial Differential Equations

h = ~x = L / M and k = ~t as the step sizes in x and t directions. Here it is


assumed that the solution is required in the interval 0 :::; x :::; Land t > O. The
initial line is represented by n = 0, while the boundaries are specified by j = 0
and j = M. Since we are considering an initial value problem, it is possible
to calculate the solution step by step. Thus, starting with the initial values at
n = 0, we can compute the solution for n = 1. Once the solution is known for
n = 1, the same procedure can be followed to obtain the solution at n = 2 and
the process can be repeated until the required range in t is covered.
Let us denote the computed solution at (Xj, tn) by uj, then it is straight-
forward to obtain the finite difference approximation to the diffusion equation:

(j = 1, ... , M - 1). (14.15)

This difference equation can be easily solved for uj+l, giving

uj+l = ruj_l + (1 - 2r)uj + ruj+l' (j = 1, ... ,M -1), (14.16)

where
uk u~t
r = h2 = (~X)2 . (14.17)

Hence, this difference scheme can be classified as explicit. In general, any dif-
ference equation involving only one grid point at the advanced time level tn+l
is termed explicit. On the other hand, if the difference equation involves more
than one grid point at the advanced time level, then it is called implicit. It may
be noted that the solution at the boundary points j = 0 and j = M cannot
be obtained from (14.16) and we need to use the boundary conditions. If the
boundary conditions are of Dirichlet type, then Uo and u'M are explicitly given
and there is no difficulty in calculating these values. If the boundary conditions
are of Neumann type, then it may be convenient to introduce fictitious mesh
points beyond the required interval say j = -1 and j = M + 1. With this addi-
tion (14.16) can be used to calculate u~+l and u'l.t 1 , but additional equations
are required to calculate u~tl and u~t~l. These equations are provided by the
boundary conditions. For example, if the boundary condition is au/ox = f(t) at
x = 0, then we get the finite difference approximation u~+l = u~tl +2hf(tn+l).
Thus, in all cases, we can find sufficient equations to continue the solution.
If u( x, t) is the exact solution of the diffusion equation, then the truncation
error in (14.15) is defined by

(14.18)
Thus, the explicit difference scheme is first order in time, but second order in
space coordinate. The actual truncation error incurred in advancing the solu-
tion by one time step (assuming that the values at the previous time step are
14.2. Diffusion Equation in Two Dimensions 709

exact) is obtained by multiplying this expression by k. Sometimes, this result is


referred to as the local truncation error. However, we prefer to treat (14.18) as
the definition of truncation error. As in the case of ordinary differential equa-
tions, it is difficult to estimate the actual truncation error, which is accumulated
over all the previous time steps. But if the explicit difference scheme is stable,
then the accumulated truncation error may be expected to be O(k + h 2 ).
The explicit method outlined in the previous paragraph can be easily used
to compute the approximate solution over any required interval. However, as
with ordinary differential equations, we need to analyse the stability of such
a difference method. There are several techniques for analysing the stability
of difference methods, none of which may be applicable to realistic nonlinear
equations that are encountered in practice. One of the simplest technique is
based on the concept of a maximum principle. It can be easily seen that the
local truncation error in (14.16) is O(k2+kh2). Hence, we can choose a constant
A, such that the local truncation error is bounded by A(k 2 + kh 2 ). Now if
zj = uj - u(Xj, t n ) is the error in the computed solution, then it can be shown
that

zj+l = rzj_l + (1 - 2r)zj + rzj+l + O(k2 + kh 2 ), (j = 1,2, ... , M - 1).


(14.19)
If the boundary conditions are of Dirichlet type, then the error at the boundaries
vanishes and we can ignore the boundary values. If 0 < r ~ 1/2, then all
coefficients in (14.19) are positive and add up to one. In that case

Izj+ll ~ rlzj_ll + (1 - 2r)lzjl + rlzj+ll + A(k 2 + kh 2 ),


(14.20)
::::; Ilznll + A(k 2 + kh 2 ),
where
Ilznll = . max
)=O, ... ,M
Izjl. (14.21)

Consequently

(n=O,l, . .. ). (14.22)

Since Ilzoll = 0 we can easily show that


(14.23)

Thus, for a given time t, the error tends to zero as k and h tend to zero, provided
r ~ 1/2. It should be noted that for r > 1/2 the second coefficient in (14.19)
is negative and the sum of absolute values of the coefficients is larger than
one. If this analysis is applied to that case, then the error bound will increase
exponentially with n and does not tend to zero as h, k -> O. In fact, the explicit
difference method is unstable in that case.
From the above analysis it appears that stability of the difference method
depends on how hand k tend to zero. In particular, if r ~ 1/2 as h, k -> 0,
then the difference method is stable, otherwise it may be unstable. The cause
710 Chapter 14. Partial Differential Equations

of instability here is different from that for ordinary differential equations. For
an ordinary differential equation we know that a single-step method of the type
considered above is always relatively stable, while higher order methods may be
unstable because of the presence of extraneous solutions. For partial differen-
tial equations, the instability can be attributed to our inability to represent the
initial conditions. Since the number of initial conditions is actually infinite, we
need to approximate them by initial values at a set of discrete points. For exam-
ple, consider the initial conditions u(x, 0) = c. For finite difference solution, we
specify the initial conditions only at x = Xj for j = 0, 1, ... , M. This represen-
tation cannot distinguish between u(x, 0) = c and u(x, 0) = c + bsin(Jrx/h) for
any value of the constant b. Now if the difference equation actually amplifies a
rapidly oscillating term of this type, then the true solution can be swamped and
we end up with a computed solution, that is dominated by a spurious solution
due to such oscillating initial conditions. This will be clear from the Fourier
technique of analysing the stability of difference scheme.
In the Fourier method, we study the stability of difference methods to
oscillatory perturbations in the initial conditions. Thus, we can consider an
initial perturbation of the form e imx , where i = A (it is assumed that we
consider only the real part). Neglecting boundary conditions it can be shown
that the solution of the difference equations for the perturbations can be written
in the form uj = ~neimhj, where ~ is referred to as the amplification factor.
Substituting this solution in the difference equation we get

~ = 2rcos(mh) + (1 - 2r) =1- 4rsin2(~mh). (14.24)

This solution will be bounded, provided I~I :S 1, otherwise it will grow expo-
nentially. Thus, for stability we must have I~I :S 1 for all values of m. It can
be easily seen that ~ :S 1, since r > 0 and we only need to ensure that ~ :::: -1
for all values of m. This condition can be clearly ensured if 0 :S r :S 1/2. Thus,
once again we end up with the same stability condition. Further, it can be seen
that for r > 1/2 the dominating perturbations are those for which mh = Jr,
i.e., a perturbation with spatial oscillations on scale of twice the mesh spacing.
Hence, the resulting solution will display oscillations with successive values of
opposite sign.
It may be noted that the solution of the diffusion equation with the same
initial condition is e-m2at+imx. The amplification factor per time step is thus

e- m2ak = 1 - m 2ak + ~m4a2e - ... (14.25)


2
This factor may be compared with the amplification factor in numerical com-
putation, which is given by

1 4
~(m) = 2
1 - m ak + -m akh 2 - ... (14.26)
12
Thus, it can be seen that the first two terms in the expansion agrees with that
for the actual amplification factor, and for sufficiently small values of m 2 ak, the
14.2. Diffusion Equation in Two Dimensions 711

computed solution will be close to the exact solution. Further, if k = h 2 /(6a),


then it can be seen that the third term in the two expansions also agrees. Nev-
ertheless, for sufficiently large m the two factors could be completely different.
However, for stability we do not need to approximate the exact solution, but
only need to ensure that the perturbation remains bounded. Thus, for small
values of m which are relevant for the physical problems, we should ensure
accuracy, while for large m it is sufficient if the numerical method does not am-
plify the perturbations. The situation here is similar to that for stiff differential
equations, except for the fact that in this case, there is no bound on m.
Yet another technique for analysing the stability is the matrix method. In
this method we write the finite difference equation in the matrix form un+! =
Au n , where un is a vector of M + 1 elements u'J, (j = O,l, ... ,M). Now
if all eigenvalues of A are within the unit circle, then the solution remains
bounded, and the difference method is stable. Thus, for stability we require all
eigenvalues of A to be less than unity in magnitude. It can be shown that, for
o
the explicit method with simple Dirichlet boundary conditions U = uM = 0,
the eigenvalues of the matrix A are

(k = 1, ... , M - 1). (14.27)

Once again, we get the same stability criterion for the explicit method. In this
way, all these techniques lead to the same stability condition.
The stability condition can be interpreted physically to mean that the
maximum allowed time step is, up to a numerical factor, the diffusion time
scale T = (~x)2 /a across a cell of size ~x. Since in practical problems we are
generally interested in diffusion over spatial scales ,\ » ~x, the time scale is
also correspondingly larger. In general, we need of the order of (,\/ ~x)2 time
steps before significant changes in the system take place. This number could
be quite large if a very fine mesh is used.
The explicit method outlined above is convenient to use, but the time
step is restricted by the stability requirement. Further, the permissible time
step is proportional to (~x)2. Thus, if we reduce the mesh spacing in spatial
coordinates by a factor of two, then the time step has to be decreased by a
factor of four. As a result, the effort required to integrate the equation over a
given length of time is roughly proportional to M 3 , where M is the number of
mesh points in the spatial coordinate.
In order to improve stability, we need to ensure that the difference method
damps out perturbations on small spatial scales. As with ordinary differential
equations, we can expect the implicit methods to posses better stability proper-
ties. An implicit method is one in which each difference equation involves more
than one unknowns at time step n + I, so that it cannot be directly solved for
anyone of the unknowns. Since our equation is linear, we get a system of linear
equations which have to be solved simultaneously.
712 Chapter 14. Partial Differential Equations

The simplest implicit method is obtained if the space derivative is evalu-


ated at time tn+l instead of tn, which gives

(j = 1, ... , M -1). (14.28)

More general difference equation can be obtained by taking a weighted average


of (14.15) and (14.28), to get

k1 (n+l
uj j -_ h2
- u n) (j (()( n+l - 2u n+l+
uj+l j u n+l)+(l
j_1 -
n - 2u n+
())( uj+l j n )) ,
Uj_l
(14.29)
where 0 :::; () :::; 1. For () = 0, this formula reduces to the explicit formula (14.15)
while for () = 1 it yields the simple implicit formula (14.28). For () = 1/2, we
get the Crank-Nicolson method, where both the space and time derivatives are
centred at the point (Xj, tn + k/2) and we can expect a higher order accuracy
with truncation error of O(k2 + h2). For () = 0 the space derivative is centred
at (Xj, tn) while the time derivative is centred at (Xj, tn + k/2). Consequently,
the truncation error is expected to be O(k + h 2 ).
It can be seen that each equation in (14.29) involves three unknown quan-
tities and we get a system of linear equations with tridiagonal matrix

-r()uj~ll +(1 +2r())u'r 1 -r()uj:N = r(l-())uj_l +(1-2r(1-()) )uj +r(l-())uj+l'


(14.30)
This tridiagonal system can be easily solved using Gaussian elimination. The
matrix is diagonally dominant and pivoting may not be required unless r() » 1.
To study the stability of this implicit method, we can use the Fourier
technique. Substituting A.;ne imjh for uj in the difference equations, we find
that the amplification factor is given by

.; = ';(m) = 1 - 2r(1 - ())(1 - cos(mh)) .


(14.31)
1 + 2r()(1 - cos(mh))

Once again, it can be shown that ';(m) < 1. Hence, for stability we require
.; (m) > -1 for all values of m. It can be shown that .; is a monotonically
decreasing function of (1 - cosmh), with a limiting value of 1 - 4rJ(1 + 4r()) ,
when 1 - cos mh = 2. Thus for stability, we require

4r
1- > -1. (14.32)
1 + 4r()

For () :::: 1/2 this condition is satisfied for all positive values of r and the
difference method is always stable. For other values of (), the value of r has to
be restricted in order to achieve stability. We can easily obtain the stability
criterion
2r = (2X~t :::; 1~2/}' if 0:::; () < 1/2;
{ (14.33)
no restriction, if 1/2 :::; () :::; 1.
14.2. Diffusion Equation in Two Dimensions 713

e
It may be noted that for = 1/2 the amplification factor ~(m) -1 when
cosmh = -1 or mh = (2n + I)-iT. Thus, perturbations over a scale of twice the
mesh spacing are not damped effectively and may be maintained at the initial
level, while for e > 1/2 perturbations at all length scales are damped. For
e = 1 the damping factor in the extreme case tends to zero as 6.t -> 00. Thus,
this difference method damps out all small scale perturbations very rapidly. In
fact, for r -> 00 this difference method essentially yields the finite difference
equations for the equilibrium equation 8 2u/8x2 = O.
e
The most common choices for are 0, 1/2 and 1. The first choice has the
advantage of simplicity, since the equation can be solved explicitly for uj+l,
while the other two have the advantage of stability for arbitrary time steps. In
general, e = 1/2 gives better accuracy as compared to e = 1. But e = 1 has
the advantage of damping out small scale perturbations more effectively.
We can define the concepts of consistency, stability and convergence for
partial differential equations also. Thus, a difference method is said to be con-
sistent if the truncation error tends to zero, as the step sizes h, k -> O. A method
is said to be stable if the perturbations due to errors introduced at any stage
remain bounded at all subsequent steps. Similarly, a method is said to be con-
vergent if for a given time t the computed solution tends to the true solution
as the step sizes h, k -> O. In general, consistency and stability together ensure
convergence.
Now let us consider the truncation error in the difference approximation
(14.29):
u(Xj, tn+l) - u(Xj, tn) (J ( {
e(u) = k -h 2 eu(xj+l,tn+d- 2u(Xj,tn+d

+ U(Xj_l, tn+d} + (1 - e){ u(Xj+l' tn) - 2u(Xj, tn) + U(Xj-l, tn)} ),


(14.34)
where U = u(x, t) is the exact solution of the equation. As noted earlier, some-
times this expression is multiplied by k and the result is referred to as the local
truncation error. In that case, the definition of consistency given above may
need to be modified. To estimate the truncation error, we can use Taylor series
expansion about the point (Xj, tn), to get

e(u) = (J ~:~ ((Jk(~ - e) - 112h2) + O(k2) + O(h 4 ), (14.35)

where we have used the fact that 8 2u/8t 2 = (J28 4 u/8x 4 . Thus, in general, the
truncation error is O(k + h 2). Since this error tends to zero as h, k -> 0, the
resulting difference method is consistent. It may be noted that if e = 1/2, the
first term drops out and the truncation error is O(h2 + k 2). On the other hand,
if
1 h2 1 1
e=----=--- (14.36)
2 12(Jk 2 12r '
then the entire term in the parenthesis vanishes and the error is O(k2 + h 4 ),
giving a higher order accuracy. Further, it can be verified that for this value
714 Chapter 14. Partial Differential Equations

of () the difference method is stable. In fact, for 0 ~ () < 1/2 the time step in
this case is one-third of the critical value required for stability. It should be
noted that if the time step is very small, then the value of () given by the above
expression may come out to be negative. Such small values of time step should
in general be avoided, since that will lead to a loss of efficiency. In general,
if 0' = O'(x, t) is variable, then the corresponding value of () to achieve higher
order accuracy will also be variable. It can be shown {4} that in addition if
r = 1/J26, then there is a further reduction of the truncation error to O(h 6 ).
It should be noted that truncation error tends to zero as h, k ---> 0, even
if the stability condition is violated because, the truncation error is only a
measure of how closely the true solution of the differential equation satisfies
the difference equation. It does not say anything about how close the computed
solution is to the exact solution. Thus in an unstable situation, although the
true solution of the differential equation may satisfy the difference equation
very accurately, the actual solution of the difference equation may not be close
to the true solution of the differential equation. For the solution of difference
equation to converge to that of the differential equation, we need stability in
addition to consistency.
EXAMPLE 14.1: Solve the diffusion equation
au a2u
at ax 2 '
o :S x :S 1, t> 0, (14.37)

subject to the initial and boundary conditions


u(x,O) = sin'Trx, u(O, t) = 0, u(l, t) = O. (14.38)

We use a uniform mesh Xj = jh and tn = nk to discretise the equation. Since the


solution vanishes at the boundaries, these points can be ignored. It may be noted that the
initial condition is symmetric about the midpoint and this symmetry will be maintained at all
times. We can take advantage of this symmetry and compute the solution over only half the
range i.e., 0 :S x :S 1/2. In that case, the boundary condition at x = 0 remains the same, but
at x = 1/2 we can invoke symmetry and demand au/ax = o. This condition can be written in
the difference form as u M+ 1 = u M_1 ' which enables us to write (14.30) for j = 1,2, ... ,M,
giving M equations in the M unknowns uj+l, (j = 1, ... , M). This system of equations with
a tridiagonal matrix can be easily solved for the solution at the next time step. In fact, we
can notice that the coefficient matrix in the difference equations is independent of nand
elimination needs to be performed only once. The computed solution can be compared with
the exact solution of the differential equation

u(x, t) = e-7l" 2 t sin'Trx. (14.39)


To begin with, we use the explicit method (/I = 0) with h = 0.05, which requires only 10
points at Xj = jh, (j = 1, ... ,10). To take care of the boundary condition at x = 0.5,
we use u(0.55, t) = u(0.45, t), to eliminate u(0.55, t) from the last difference equation. We
perform two sets of calculations using r = 0.45 and r = 0.55, which gives k = t:.t = 0.001125
and 0.001375, respectively. The results after a few time steps are shown in Figure 14.2,
where continuous lines show the exact solution, while the circles represent the computed
values. Although, the computations were performed over the interval [0,0.5] the results are
extended to the entire range using symmetry. It can be easily seen that in the first set with
r = 0.45, there is a reasonable agreement between the computed and the exact solution.
For the second set with r = 0.55 the agreement is initially good, but after some time the
computed result shows some oscillations. For the second curve at t = 0.111375, (Le. after 99
time steps) the results already show mild oscillations about the true value. At t = 0.136125,
14.2. Diffusion Equation in Two Dimensions 715

0.8 0.8

0.6
0.6
9
/ i
" "

0.4
;;l 0.4 ;;l

0.2

0.2
0

-0.2
0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
x x

Figure 14.2: Solution of the diffusion equation using explicit difference method. The contin-
uous curves show the exact solution while the circles denote the computed values at different
times. The figure on the left shows the result obtained using r = 0.45 at t = 0.037125,
0.07425,0.111375, 0.1485 and 0.185625. The highest curve corresponds to the smallest value
of t. The figure on the right shows the result obtained using r = 0.55 at t = 0.07425,0.111375
and 0.136125.

the oscillations have grown substantially in amplitude and it may be difficult to identify
the resulting computed values with the true solution. This example clearly demonstrates the
instability of explicit difference method for r > 1/2. It is clear that the perturbations with a
wavelength of twice the mesh spacing swamp the solution completely. On the other hand, for
r = 0.45, the solution is stable at all times. If the calculations are performed using e = 1/2
or 1, then the computed results are stable for r = 0.55 also.
In order to compare the accuracy of computation using different values of rand we e,
perform the calculations using h = 0.1 involving only five points. The results at some fixed
e
times with different values of and r are shown in Table 14.1, which also gives the exact
solution. It may be noted that, since h = 0.1 the time step k = O.Olr. With the coarse mesh
spacing used, it is not possible to achieve very high accuracy. However, it can be seen that for
the same value of r, the accuracy is significantly better for e = 1/2. Further, it can be seen
that increasing the time step to k = 0.01 does not affect the accuracy significantly, provided
the difference equations are stable. Hence, it is possible to use much larger time step for the
same accuracy using the implicit method. In fact, for e = 1/2 the accuracy improves when
the time step is increased to k = 0.02.
As explained earlier, it is possible to improve the accuracy of the difference method
by choosing e = 0.5 - 1/(12r). For r = 1 and 2 it gives e = 0.41666 ... and 0.458333 ... ,
respectively. The results obtained using this value of e are also shown in the table. It can
e
be seen that the results improve significantly over that for = 1/2. As time step increases
e
the optimal value of approaches 1/2, which probably explains the fact that for = 1/2 e
the accuracy improves as time step is increased. It is possible to improve the accuracy still
further if we use r = l/VW ~ 0.2236, and e = 0.5 - 1/(12r) ~ 0.1273. In this case, the
time step is significantly smaller, but as can be seen from Table 14.1 the results are accurate
to about five significant figures, which is much better than the results obtained with other
values of e,
using the same time step.

So far we have considered difference methods for which the order of dif-
ference approximations is the same as that of the differential equation. Thus,
the time derivative is approximated by a two-point formula, while the space
derivative by a three-point formula. This approximation ensures that no spu-
rious solutions are introduced during the process of discretisation. In order to
716 Chapter 14. Partial Differential Equations

Table 14.1: Solving diffusion equation using finite difference methods

(J r u(O.l, t) u(0.2 , t) u(0.3, t) u(O.4 , t) u(0 .5, t)

t = 1.98
0.0000 0.4500 7.61437 x 10- 10 1.44834 x 10- 9 1.99347 x 10- 9 2.34346 x 10- 9 2.46406 x 10- 9
1. 0000 0.4500 1. 78964 x 10 - 9 3.40410 x 10- 9 4.68534 x 10- 9 5.50795 x 10- 9 5.79140 x 10- 9
1.0000 0.5500 1.95633 x 10- 9 3.72116 x 10- 9 5.12173 x 10- 9 6.02096 x 10- 9 6.33081 x 10- 9
0.5000 0.4500 1.17836 x 10 - 9 2.24137 x 10- 9 3.08498 x 10- 9 3.62662 x 10- 9 3.81325 x 10- 9
0.5000 0 .5500 1.17661 x 10- 9 2.23804 x 10- 9 3.08039 x 10- 9 3.62122 x 10- 9 3.80758 x 10- 9
Exact solution 1.00711 x 10- 9 1.91564 x 10- 9 2.63665 x 10- 9 3.09957 X 10- 9 3.25908 x 10- 9

t = 2.0
1.0000 1.0000 2.39032 x 10- 9 4.54665 x 10- 9 6.25793 x 10- 9 7.35664 x 10- 9 7.73523 x 10- 9
1.0000 2.0000 5.30980 x 10- 9 1.00998 X 10- 8 1.39012 X 10- 8 1.63419 X 10- 8 1.71829 X 10- 8
0.5000 1.0000 9.56815 x 10- 10 1.81997 x 10- 9 2.50497 x 10- 9 2.94477 x 10- 9 3.09632 x 10- 9
0.5000 2.0000 9 . 12672 x 10- 10 1.73600 x 10- 9 2.38941 x 10- 9 2.80891 x 10- 9 2.95347 x 10- 9
0.4167 1.0000 8.14208 x 10- 10 1. 54872 x 10- 9 2.13162 x 10- 9 2.50587 x 10- 9 2.63483 x 10- 9
0.4583 2.0000 7.75727 x 10- 10 1.47552 x 10- 9 2.03088 x 10- 9 2.38744 x 10- 9 2.51031 x 10- 9
Exact solution 8.26713 x 10- 10 1.57250 x 10- 9 2.1643 7 x 10- 9 2.54436 x 10- 9 2.67530 x 10- 9

t = 2.0013
0.0000 0.2236 7.72133 x 10- 10 1.46868 x 10- 9 2.02147 x 10- 9 2.37638 x 10- 9 2.49867 x 10- 9
1.0000 0.2236 1.18557 x 10- 9 2.25510 x 10- 9 3.10387 x 10- 9 3.64882 x 10- 9 3.83660 X 10- 9
0.5000 0.2236 9.59013 x 10- 10 1.82415 X 10- 9 2.51073 X 10- 9 2.95154 X 10- 9 3.10343 X 10- 9
0.1273 0.2236 8.16374 x 10- 10 1.55283 X 10- 9 2.13729 X 10- 9 2.51254 X 10- 9 2.64184 X 10- 9
Exact solution 8.16344 x 10- 10 1.55278 x 10- 9 2.13721 x 10- 9 2.51245 X 10- 9 2.64174 X 10- 9

achieve higher order accuracy, we can introduce higher order approximations


to the derivative, involving more points. For example, we can use a three-level
formula for representing the time derivative. This formula gives rise to a multi-
step method and as with ordinary differential equations, we require additional
starting values. Similarly, if we use five-point formula to approximate the space
derivative, then additional boundary conditions are required. Apart from re-
quiring additional boundary conditions, these higher order formulae will have
some extra solutions which do not approximate those of the differential equa-
tion. If these solutions are not damped out sufficiently fast , then they may
swamp the true solution giving rise to instabilities of the kind considered in
Section 12.2. Consequently, it is dangerous to use higher order methods for so-
lution of partial differential equations, unless the method is known to be stable
for the equation being solved. Further, the higher order methods are effective
only if the solution is sufficiently smooth and a high accuracy is required.
Many difference methods for solving the diffusion equation along with
their stability characteristics are listed by Richtmyer and Morton (1995). One
of these methods {3} is unconditionally unstable and hence cannot be used
in practical computation. However, it serves as an excellent example of what
can happen if no attention is paid to the mathematical properties of finite dif-
ference methods. Yet another method due to Du Fort and Frankel {3}, which
14.3. General Parabolic Equation in Two Dimensions 717

incidentally is a slight modification of the earlier method is always stable, but


the truncation error has a term O((kjh)2). As a result, if k, h -+ 0 such that
kjh = (3 a constant, then the truncation error does not approach zero and the
computed solution in fact tends to the actual solution of a hyperbolic equation.
While, if k tends to zero faster than h, then the result converges to the actual
solution of the diffusion equation. On the other hand, if h tends to zero faster
than k, then the truncation error diverges. In practical computations with fi-
nite values of k and h the truncation error depends on the ratio kjh, which
essentially forces us to use a small time step for controlling the truncation error.
Thus, the stability of this method for arbitrary time steps cannot be exploited
effectively in practical computations.

14.3 General Parabolic Equation in Two Dimensions


In this section, we outline the extension of the methods of the previous section
to more general parabolic equations. Let us first consider the linear parabolic
equation
ou 02u OU
ot = a(x, t) OX2 + b(x, t) ox + c(x, t)u + d(x, t). (14.40)

Difference methods considered in the previous section can be easily extended


to this equation, by adding a difference approximation for the first derivative.
Thus, we can use the difference method

(14.41 )

where
n Uj+l - 2uj + Uj_l uj+l - Uj_l
Fj (u) = a(xj, tn) h2 + b(Xj, tn) 2h
(14.42)
+ C(Xj, tn)uj + d(Xj, tn).
Once again if 0 = 0, then it is an explicit method, while for other values of 0
the resulting system of linear equations needs to be solved at every time step.
In this case also the matrix of equations is tridiagonal.
It is somewhat difficult to analyse the stability of this difference equation,
but it can be seen that terms arising from lower order derivative on the right-
hand side will have an extra factor of h or h2. Hence, if h is sufficiently small,
we do not expect these terms to affect the stability of difference equations
significantly. In particular, it can be shown that the explicit formula (0 = 0) is
stable, if
(6x)2
a(x, t) > 0 and 6t < min ( ). (14.43)
x 2a x, t
This stability condition is essentially same as that obtained in the previous
section for the simple diffusion equation. It should be noted that if a(x, t) < 0,
718 Chapter 14. Partial Differential Equations

then the true solution of the differential equation is increasing exponentially


with time and perturbations on small spatial scales grow much faster than
those at larger scales. Hence, the differential equation itself is unstable (Le., ill-
conditioned) and no meaningful solution can be obtained using any numerical
method. Thus, in this section we assume that a(x, t) > o.
If the coefficients a, b, c, d are independent of time, it is possible to use the
matrix method to analyse the stability of the difference method. Eigenvalues of
the relevant matrix may have to be computed numerically. If the coefficients are
constant, then the Fourier method can also be applied to analyse the stability
of difference methods. For () > 1/2 we can expect the difference method to be
stable for all time steps, while for 0 ~ () < 1/2 we expect the method to be
stable if
(.6.X)2
.6.t < min --:--'--,-,...:--~ (14.44)
x 2a(x, t)(l - 2(})
In practice, to allow for some uncertainties in the stability analysis, we can
restrict the time step to a value somewhat smaller than the critical value given
above. Of course, it is necessary to ensure that the stability condition is satisfied
for every value of x in the required range, at the given time step.
For the differential equation (14.14), we can approximate the derivative
on the right-hand side by

~ (K{Ju) I -
K(xj+~' tn)(uj+l - uj) - K(xj_~' tn)(uj - uj_l)
{Jx {Jx (x.t)-
J, n
h2 '
(14.45)
where xj+~ = Xj + h/2. Alternately, if K(x, t) = K(x) is only a function of x,
then we can perform a change of variable

y = J dx
K(x) , (14.46)

to obtain the equation


{Ju 1 {J2u
(14.47)
{Jt = K(y) {Jy2 .
A general boundary condition of the form
{Ju
a(t)u + (3(t) {Jx = f(t), (14.48)

can also be treated easily. If (3( t) =I- 0, then it is convenient to introduce an


extra mesh point XM+l outside the region of interest. In this case, we can write
the difference equation at the point x M with the usual formula. This difference
equation involves u M+1 ' which requires an additional equation. This equation
can be obtained by discretising the boundary condition, to get

(14.49)
14.3. General Parabolic Equation in Two Dimensions 719

and a similar equation at t = tn+l. It should be noted that the difference


approximation used for approximating the boundary condition should be of the
same order as that used to approximate the differential equation. Otherwise,
the computed solution will have lower order of accuracy determined by the
boundary conditions.
If the differential equation is nonlinear, then the resulting system of alge-
braic equations is also nonlinear and will need to be solved iteratively. If the
differential equation is of the form

au (14.50)
at = f(x, t, u, u x , u xx ),

where
au (14.51 )
Ux = ax'
then we can once again adopt difference approximations of the form considered
above for the linear equation. If the explicit method is used, then the resulting
equation can be explicitly solved for uj+l, provided all uj at the previous time
step are known. This method yields the formula

(14.52)

where

t5uj = uj+l - Uj_l' (14.53)

No stability analysis of a general nonlinear equation has been performed, but


we can expect of /au xx to play the role of the coefficient a(x, t) in the linear
equation. We would then expect the explicit method to be stable, provided

. (Llx)2
Llt < mm 8! . (14.54)
x 2 8u xx

The explicit method is very convenient for nonlinear equations, but in that
case, the time step is limited by the stability requirement.
It is possible to generalise the implicit method to nonlinear equations, to
get

(14.55)

For () ~ 1/2, the time step may not be limited by stability. However, use of an
implicit method results in a system of nonlinear equations, which needs to be
solved at each time step. Solving these equations requires a considerable effort
720 Chapter 14. Partial Differential Equations

and this method may be worthwhile only if the time step can be increased
significantly. This may be true in most practical situations, since when the co-
efficients are variable or if the equation is nonlinear, the stability requirement
needs to be satisfied throughout the range of x values. Thus, a large value of
of / au xx in a small range of x may force very small time step over the entire re-
gion. This problem can be avoided by using the implicit method at the expense
of solving a system of nonlinear equations at every time step. The system of
equations can be solved using Newton's method, which is equivalent to linearis-
ing the equation at every step using the current approximation to the solution
u(x, t) and solving the resulting equations to obtain a new approximation to
the solution. This iteration can be continued until a satisfactory convergence is
achieved. Convergence may not be guaranteed in all cases, but if the time step
is sufficiently small and the solution is smooth in the relevant range, then there
should not be any difficulty in ensuring convergence. The starting approxima-
tion can be obtained using an explicit method. If ujm) is the mth approximation
to uj+l, then the iteration can be defined by

~~ (m) _ () (Of ~ (m) +~ of o~ (m) + ~.!!Lo2~ (m))


k UJ au UJ 2h au x UJ h 2 au xx UJ
(m) n
Uj - Uj ( (m) 1 (m) 1 2 (m)) (14.56)
k +8f Xj,tn+l'U j '2h OUj 'h20Uj

+ (1 - 8)f( Xj, tn, u'l, 2~ ou'l, ~2 02uj) ,


where ~ujm) = ujm+l) - ujm) is the correction in the mth iteration. This is
a system of equations with tridiagonal matrix in ~ujm), which can be easily
solved to compute the required corrections.
For some nonlinear equations, it may be preferable to average the terms
in a different way. For example, consider the following system of equations
considered by Crank and Nicolson

au a 2u ow
at = ax2 - qat' (14.57)

Using 8 = 1/2 we can write the corresponding difference equations in the form

u n+1 _ un 1 wn+l - w n
J J = _(02un+l + 02un) _ q J J
k 2h2 J J k
(14.58)
wn +1 - wn b
J k J = -2(wr 1 + w'l) exp (-2A/(uj+l + u'l)).
This form is different from what will be obtained if the formula (14.55) (with
8 = 1/2) is applied to this equation. But both these formulae are centred at
(Xj, tn+d.
2
14.3. General Parabolic Equation in Two Dimensions 721

Alternately, it is possible to use a predictor-corrector type of method with


an explicit formula serving as the predictor and an implicit formula serving as
the corrector. A simple fixed-point iteration may be used to solve the implicit
equations. The fixed-point iteration will converge, provided the time step is
sufficiently small. However, it may turn out that in order to ensure the con-
vergence of the fixed-point iteration, the time step may have to be restricted
to a very small value. In that case, it may be more efficient to use Newton's
method to solve the system of nonlinear equations. The convergence of New-
ton's method is also not guaranteed for arbitrary time step, since the starting
values may not be sufficiently close to the actual solution. However, in order
to ensure a reasonable accuracy, the time step will have to be restricted to a
value, such that change in the solution at each time step is not unduly large.
This in turn usually ensures that the starting approximation obtained using an
explicit method is good enough for the Newton's iteration to converge.
These methods can be easily generalised to a system of parabolic equations
of the form

au(i) _ (i)
at - f (x, t, u, U x , u xx ), (i=l, ... ,N). (14.59)

The amount of computation required to solve the system will of course increase
with the size N. Further, if the explicit method is used, then the time step is
restricted to the lowest of the critical time steps for different equations.
It is rather difficult to write an algorithm with automatic control of step
sizes for partial differential equations. Firstly, it is difficult to estimate even the
local truncation error, since higher order derivatives with respect to both t and x
are required. Further, both the time step and the space step need to be adjusted
to achieve the required accuracy, since adjusting time step alone may not be
sufficient. Adjusting the step size in spatial direction may require interpolation
to compute solution at the new set of points. Further, the optimum strategy for
adjusting both step sizes may not be known. This problem will become more
severe for equations in more than one space variables.
For nonlinear equations of the form (14.59), it may be more convenient
to use the method of lines. In this method, we discretise the equation in space
coordinates to obtain a system of ordinary differential equations. The resulting
system of ordinary differential equations can then be solved by using techniques
described in Chapter 12. This approach has proved to be very effective in many
cases, especially in reducing the amount of programming effort and analysis
required to solve the partial differential equation. The execution time using
this method may not be less than that for methods specifically developed for
the given problem. One advantage of the method of lines is that the time-
step can be adjusted while solving the resulting system of ordinary differential
equations using the techniques described in Chapter 12.
To implement this method, we can introduce a mesh in x coordinate as
before and define Uj(t) = u(Xj, t). We can approximate the differential equation
722 Chapter 14. Partial Differential Equations

(14.50) by

(j = 1,2, ... ,M -1). (14.60)

This approximation reduces the original partial differential equation to a sys-


tem of ordinary differential equation. It can be seen that using the trapezoidal
rule to solve this system of ordinary differential equations is equivalent to ap-
plying the Crank-Nicolson method to the original partial differential equation.
In many cases, the resulting system of ordinary differential equations may be
stiff with widely varying eigenvalues of the equation matrix. Thus, if the usual
techniques for nonstiff equations are used, the time step will be restricted to
a very small value. In fact, with the usual Runge-Kutta method, the limiting
value of the time step may be comparable to that using the explicit method
for solving the partial differential equation. Hence, it may be desirable to use a
stiffly stable method to solve this system of differential equations. The time step
in this method may be controlled to achieve the required accuracy. However,
that will only control the error due to time discretisation. Additional error is in-
troduced because of difference approximation in the spatial direction. Thus, for
optimum results, it is desirable to balance the two errors. Hence, the requested
accuracy for the solution of the system of ordinary differential equations should
be comparable to the estimated error due to space discretisation.
This technique is particularly attractive for problems which are nonlinear
in space derivatives. Since the solution of initial value problem in ordinary dif-
ferential equation poses no serious difficulty for nonlinear equations. It should
be noted that the resulting equation matrix is banded, because dUj / dt de-
pends only on Uj-l, Uj, and Uj+l. Hence, the corresponding equation matrix
of the linearised equation (which is required to solve the implicit equations) is
tridiagonal. If the routine adopted to solve the system of ordinary differential
equations uses the tridiagonal nature of the matrix, then we can achieve much
higher efficiency. Some routines for solving stiff differential equations are spe-
cially tailored for banded matrices of the type encountered in the method of
lines. The method of lines can also be generalised to equations in more than
one space dimensions. But, in that case, the number of variables will increase
significantly and we may be forced to use a coarse mesh with only a few points
in each direction. Further, as would be seen in the next section, the bandwidth
of the resulting equation matrix is also much larger. The method of lines can
also be used with finite element discretisation (Section 14.11).
EXAMPLE 14.2: Solve the following nonlinear parabolic equation

8u 82 u5
t > 0, o :S x :S 1, (14.61)
8t 8x 2 '

which has an exact solution that is given implicitly by the equation


5 20
- (u - uO)4 + -uo(u - uO)3 + 15u6(U - UO)2 + 20u~(u - uo) + 5u6ln(u - uo) = v(vt - x + xo),
4 3
(14.62)
14.3. General Parabolic Equation in Two Dimensions 723

1
o 0.2 0.4 0.6 0.8 1
x
Figure 14.3: Solution of a nonlinear parabolic equation using the method of lines. The
continuous curves shows the exact solution while the circles denote the computed values at
different times. The curves from left to right show the solution at time intervals of 10- 4 . The
dotted line shows the computed solution using Ax = 0.01 for t = 6 X 10- 4 .

where uo, Xo and v are constants. This equation can be used to specify the initial condition
u(x,O) and the boundary conditions u(O, t) and u(l, t). Use Uo = 1, Xo = 0 and v = 1500.
The required solution is a nonlinear wave running to the right if v > 0; its shape is
shown in Figure 14.3. At the right u approaches uo more or less exponentially, because for
u - uo « uo the logarithmic term on the left-hand side dominates. The implicit equation
can be solved using secant iteration to find u(x, t). When u - uo « uo, the iteration may not
converge, unless the starting value is sufficiently close to the actual solution. In fact, using a
finite precision arithmetic, it may not be possible to distinguish the root from u = uo. Hence,
in that case, we approximate the solution by neglecting all other terms on the left-hand side
to obtain

u(x,t) = uo +exp ( v(vt-x+xo


4
») '
v(vt-x+xo)
--'--::---'4'--""::":" « -l. (14.63)
5u o 5u o
Solution of the implicit equation (14.62) is required to generate the initial values u(x,O) and
the boundary values u(O, t) and u(l, t). Further, this equation is also solved to obtain the
exact solution at the required values of t to be compared with the computed solution.
The differential equation can be solved numerically, using a finite difference method.
If explicit method is used, then the time step is restricted to very small values. Using an
implicit difference scheme leads to a system of nonlinear equations at every time step. This
equation can be linearised if we approximate

(14.64)

Using this approximation, it is possible to obtain a set of linear equations in uj+l with
tridiagonal matrix, which can be easily solved. Richtmyer and Morton (1995) have reported
results using this technique. They find that for () < 1/2, the difference method is stable if

5u 4 At 1
max - - - < --- . (14.65)
x (Ax)2 2 - 4()
724 Chapter 14. Partial Differential Equations

It may be noted that the value of U increases with time and at some stage this criterion is
bound to be violated. Hence, if we are using a constant time step throughout the calculation,
this time step will have to be very small in the initial phases. Since the difference method
is single-step in time, there is no need to restrict ourselves to a constant time step. We
can effectively use a variable time step, with the actual value determined using the stability
criterion. The use of () :0:: 1/2 avoids this problem, since the time step is no longer constrained
by stability condition.
We solve this equation using the method of lines. Choosing a uniform mesh with spacing
6..x = 0.05, we can approximate the spatial derivatives using the usual difference approxima-
tion to obtain a system of ordinary differential equations in variables Uj (t), which approxi-
mates the solution u(Xj, t). This system of ordinary differential equations can be solved using
the subroutine MSTEP. It is found that the resulting system of equations is stiff· and it is
better to use this routine with subroutine GEAR. The results obtained using a constant time
step 6..t = 10- 5 are shown in Figure 14.3, which shows the solution at time intervals of 10- 4 .
The smallest curve represents the solution at t = 10- 4 , while the largest at t = 6 X 10- 4 .
The circles represent the computed solution at the same time. It can be seen that the com-
puted solution is close to the exact solution at the left end, but near the edge of the wave
there are significant departures. This is to be expected, since at the right edge of the wave
there is a sharp change in the gradient, resulting in significant errors using simple difference
approximation. To improve the accuracy we need to increase the number of mesh points in
spatial direction. The figure also shows the result obtained using 6..x = 0.01 at the last time
step and it is quite clear that there is significant improvement.
If the same routine is used with the fourth-order Adams method, then after some time
the iteration on the corrector fails to converge, unless the time step is very small. To continue
integration till t = 6 X 10- 4 , we need to use a time step of ~ 5 x 10- 7 , which is 20 times
smaller than what is used with stiffly stable method. Similar step size is necessary with
the fourth-order Runge-Kutta method, in order to ensure any reasonable accuracy. If the
integration is to be continued till t = 10- 3 , then an even smaller time step of ~ 2 x 10- 7 is
required. Further, the accuracy of results obtained using such small time steps is comparable
to that with 6..t = 10- 5 , using stiffly stable method because the accuracy is determined by
the spatial discretisation. Thus, in principle it is much more efficient to use a stiffly stable
method. However, the subroutine GEAR in Appendix B does not use the banded nature of
the equation matrix. Consequently, one step of integration using it requires much more effort
than what is required for Runge-Kutta method, or Adams method. Even then at later stages
when the time step is otherwise constrained to very low values, the stiffly stable method
works out to be significantly more efficient.

14.4 Parabolic Equations in Several Space Variables


In this section, we consider techniques for solving parabolic equations in more
than one space variables. As a first example we consider parabolic equation in
three dimensions
8u 8 2u 8 2u 8 2u
8t = A 8x2 + 2B 8x8y + C 8y2 ' (14.66)

where A > 0, C > 0 and AC - B2 > O. These inequalities ensure the parabolic
character of the equation; when they are satisfied, the initial value problem is
well posed.
For writing the finite difference equations, we introduce a uniform mesh
in each coordinate and define

(14.67)
14.4. Parabolic Equations in Several Space Variables 725

Using this notation we can write the finite difference approximation to (14.66)
as
Un
jl
+ 1 _ un
jl
b.t = B<f>j/l + (1 - B)<f>jl' (14.68)
where
<f>n = A uj+l,l - 2ujl + uj-l,l + B uj+l,I+1 - uj-l,l+l - uj+l,l-1 + uj-l,I-1
Jl (b.x)2 2b.xb.y
Uj,I+1 - 2ujl + uj 1-1
+C (b.y)2 ,
(14.69)
If B = 0 the equations are explicit and there is no difficulty in solving them.
If B > 0, then these equations are implicit, and for each value of n we must
solve a system of linear equations similar to those encountered in solution of
elliptic equations (Section 14.7). This solution requires a considerable effort,
since the equations are no longer tridiagonal. It can be seen that each equation
has nine nonzero coefficients distributed in groups of three. Matrices of this type
are discussed in Section 14.7. Although the matrix is sparse, the bandwidth is
rather large and further most of the elements within the band are also zero.
Iterative methods of the type described in Sections 14.8-9 can be used to solve
this system of equations. However, it may be more efficient to use the explicit
method and take b.t sufficiently small to ensure stability.
It can be seen that the method of lines also encounters similar problems.
Once again the equation matrix is not tridiagonal and if the resulting system of
ordinary differential equations is stiff, it may be difficult to solve the required
system of equations at each time step. For nonstiff systems, there may not be
any difficulty, since we can use Runge-Kutta methods or predictor-corrector
methods. But in that case, the time step may have to be restricted.
The stability of the difference method (14.68) can be analysed using the
Fourier method by looking for solution of the form
ujl = ~n exp(ikdb.x + ik2lb.y), (14.70)
which gives the amplification factor
C = c(k k) = 1 + (1 - B)'ljJ (14.71)
<" <" 1, 2 1 _ B'ljJ ,
where
Ab.t Bb.t
'ljJ = -2 ( (b.x)2 (1- cosk1b.x) + b.xb.y sin k1 b.xsin k2b.y
(14.72)
Cb.t )
+ (b.y)2 (1 - cos k2b.y) .

For stability we require I~I < 1 for all possible values of kl and k2. It can be
shown that for all values of kl and k2
Ab.t Cb.t )
-4 ( (b.x)2 + (b.y)2 :::; 'ljJ :::; O. (14.73)
726 Chapter 14. Partial Differential Equations

Following the arguments in Section 14.2, we get the following stability criterion
ALl.t CLl.t
{ (Ll.X)2 +.(LI..Y)2 <
1
2-40 '
if 0 :s;e < 1/2;
(14.74)
no restnctlOn, if 1/ 2 :s; e :s; 1.

This stability criterion can be easily generalised to higher dimensions.


Another problem with equations in more than one space dimensions is the
treatment of boundary conditions. If the solution is required over a rectangular
region, then the boundary conditions can be incorporated easily. However, if
the boundary is curved, then such simple treatment will not work. In general,
it may be difficult to incorporate such boundary conditions to an accuracy
compatible with the difference approximation. Some techniques for dealing with
such boundaries are described in Section 14.7.
The problem of solving a large system of linear equations using the implicit
method can be tackled by adopting the so-called alternating direction method.
This method is based on the observation that the tridiagonal nature of the
equation matrix can be preserved, if at any step we use differences along one
space direction in the implicit method. Consider the simple diffusion equation

au _ (a 2u a 2u) (14.75)
at - a ax2 + ay2 '

on a square mesh i.e., ~x = ~y = h. The alternating direction method consists


of taking two half-steps of time step ~t/2, defined by

n+~
U jl - U'Jl =
a~t
(62u n+4 + 6yu
x jl
2 n) jl ,
2(~x)2
(14.76)
u n +1
jl -
n+4
U jl
a~t
2(~x)2
(62x Ujl
n+4
+ 62YU jl n+l) '

where

(14.77)

Assuming for simplicity, that boundary values are given on some rectangle in
the (x,y)-plane, the first equation in (14.76) leads to a tridiagonal system for
u;t 1 / 2 , which can be easily solved. This stage is referred to as the sweep. In x
fact, it may be noted that the equations for each value of l are decoupled from
the rest. Hence, we get My systems of tridiagonal equations in Mx unknowns,
where lvIx and lvIy are the internal mesh points in the x and y directions,
respectively. Similarly, second equation in (14.76) can be solved for uj/l . For
each value of j it yields a system of tridiagonal equations. This stage can be
referred to as the y sweep. Thus, at each time step we need to solve several
systems of equations with tridiagonal matrices, instead of one large system of
equations with a more complex matrix.
It should be noted that u7t
~ is just some intermediate quantity and not an
approximation to the solution at t = t n + 1 / 2 . Hence, in the first stage we cannot
14.4. Parabolic Equations in Several Space Variables 727

use the boundary condition u~tl/2 = h(Yl,t n+1/ 2), where u(t,O,y) = h(y,t)
is the required boundary condition. In fact, if this simple boundary condition
is used, the accuracy will drop to first order in 6.t. It can be shown {16} that
the correct boundary condition for this intermediate stage is

n+ ~
u Ol =
1 a 6.t (2 2
2" (h(Yl,tn ) +h(Yl,tn+l)) + 4(6.y)2 l5yh(Yl , tn)-l5yh(YI,tn+d .
)
(14.78)
If the boundary condition is independent of t, then it reduces to the simple form
U~I+1 /2 = h(Yl) . Similar conditions need to be applied at other boundaries.
The stability of this method can be analysed by the Fourier method. The
resulting amplification factor is the product of the amplification factors for the
two steps. For a Fourier component exp(ik1x + ik2Y) the amplification factor
in one complete step is

~(kl k 2) = 1 - r(l - cosk26.x) 1- r(l - coskl6.x)


(14.79)
, 1 + r(l - coskl6.x) 1 + r(l - cosk 26.x)

Since r(l- coski6.x):::: 0, (i = 1,2), I~I ::; 1, giving unconditional stability. It


should be noted that the individual amplification factors representing half-steps
could have a magnitude larger than unity, but the magnitude of their product
is always smaller than or equal to unity. To find the order of accuracy of this
method, we can eliminate u;t 1 / 2 between the two equations in (14.76) to find
{15} that the resulting formula has a truncation error of O((6.t)2 + (6.x)2).
The most obvious generalisation of this method to three spatial dimensions
is by three equations for u;i:i 1/ 3 , u;i:i 2/ 3 and uj~l. The equations are identical,
except for a cyclic shifting of the implicit term among the x, Y and z derivatives.
It turns out that the resulting method is not unconditionally stable and further,
the truncation error is O(6.t+(6.x)2). It is possible to restore the unconditional
stability as well as the second order accuracy, if this method is modified by
defining intermediate quantities U;~l+l and u;~~+ 1 to approximate uji!i 1. The
resulting method is defined by

(14.80)
728 Chapter 14. Partial Differential Equations

It can be proved that, this method is unconditionally stable. It is interesting


to note that if U;ZI+l is replaced by 1L;Z~+l in the third equation above (it
may appear reasonable to use the latest approximation to uj:tl), then the
unconditional stability is lost. It can also be proved that truncation error in
this method is O((~t)2 + (~X)2). Once again, at each step we need to solve
implicit equations for tridiagonal matrices.
This technique can be easily generalised to n space variables, in which
case, we need to solve n systems of tridiagonal equations at each time step.
This method can also be generalised to linear parabolic equations with variable
coefficients. More generally, if the equation can be written in the form

(14.81)

where Ai is some linear differential operator containing only spatial derivatives;


then we can split one time step into K substeps. In each of these substeps, the
difference equation is implicit with respect to one of the operators Ai. If the
operators Ai are chosen such that an implicit equation in anyone of them
can be solved efficiently, then the integration over the entire time step can be
performed efficiently.
Solving partial differential equations in three or more dimensions requires
a considerable amount of computing resources. Hence, it is better to start ex-
perimenting on a coarse grid, even though the resulting values may not have
the required accuracy. If a difference method is not stable on a coarse grid, it
is not likely to be stable on a finer grid. To detect instabilities in the difference
method, it is necessary to print out or plot the solution at every mesh point
at successive time steps. The instability usually shows up as oscillation (either
spatial or temporal) in some of these values. Even if the difference method
is stable on a coarse grid, it may turn out to be unstable when the grid is re-
fined. In that case, decreasing the time step sufficiently should normally restore
stability.

14.5 Wave Equation in Two Dimensions


For hyperbolic equations, information is propagated along the characteristics.
In particular, if a discontinuity exists across a certain characteristic at one point,
there will be discontinuity across that characteristic along its entire length. On
the other hand , for parabolic equations, even if the initial state is discontinu-
ous, the discontinuity diffuses out as the solution is continued. The presence of
discontinuities may pose problems for finite difference methods, since the char-
acteristic along which discontinuity exists may not pass through any of the grid
points used in the finite difference representation. Further, the characteristic
curves are the natural boundaries for determining which portions of the solu-
tion domain are influenced by which boundary conditions. A knowledge of the
14.5. Wave Equation in Two Dimensions 729

characteristics is important for solution and understanding the properties of hy-


perbolic equations. In fact, the method of integrating along the characteristics
is quite often the most accurate and convenient process for numerical solution
of hyperbolic equations. The main problem with the method of characteristics
is that, if the solution is required at a fixed time, then it requires interpolation
along different characteristics to obtain the solution at the required time. Apart
from this, many equations in practice are of mixed hyperbolic and parabolic
type. In this book we restrict ourselves to finite difference methods for step by
step solution of the initial value problem. The basic ideas here are similar to
those for parabolic equations and hence our discussion is brief.
Let us consider the wave equation
{Pu 2 fPu
8t 2 = c 8x2 ' (14.82)
with initial conditions of Cauchy type. Using the notation of Section 14.2, the
simplest difference method is obtained by approximating both derivatives by a
central difference scheme, to get

Ujn+l - 2ujn + Ujn-l = (Cf1t)2


f1x (Uj+l
n - 2ujn + Uj_
n) . (14.83)
1

If the solution is known at time steps nand n - 1, then this equation can be
solved explicitly for uj+l. Initially u~ is given by the initial condition, while
ut can be eliminated by using the initial value of the derivative. For example,
if the initial conditions are

U(X,O) = h (x), ~~ It=o = h(x), (14.84)

then these can be written in the finite difference form as

(14.85)

The second equation can be used with (14.83) (for n = 0) to eliminate ujl,
thus enabling us to calculate uJdirectly. Similarly, at the boundaries j = 0 and
j = 1'vl, the boundary conditions can be used to calculate uS and u~f'
Truncation error in this difference method is given by

( ) _ u(Xj, tn+d - 2u(xj, tn) + U(Xj, tn-d


eU - (f1t)2
(14.86)
2U(Xj+l' tn) - 2u(Xj, tn) + U(Xj-l, tn)
-c (f1x)2 ,
where U is the exact solution of the wave equation. Using the Taylor series
expansion, we find

e(u) = ~ ((f1t)2 84u _ c2(f1x)2 8 4u) = C2(f1X)2 8 4u (c 2(f1t)2 _


12 8t 4 8x4 12 8X4 (f1x)2
1) '
(14.87)
730 Chapter 14. Partial Differential Equations

where we have used, 8 4 u/8t 4 = c4 8 4 u/8x 4 to simplify the error term. Thus in
general, the truncation error is O(h2 + k 2), but if c!lt = !lx, then the leading
term in the error expansion vanishes and we get higher order accuracy. It can
be shown that this difference scheme is stable if c!lt ::; !lx. Thus, if the time
step is just at the stability limit, then the difference scheme has higher order
accuracy. In fact, it can be shown that the local truncation error vanishes in
this case and we get essentially exact (apart from roundoff) results. In practice,
it may be difficult to use this result to achieve higher order accuracy, since the
propagation speed c may be variable and we will be constrained to use a step
size determined by the maximum value of c.
Stability of this difference method can be analysed using the Fourier or
the matrix method. Using the Fourier method we can seek solutions of the form
uj = ~neimjh. Substituting this expression in the difference equation gives the
amplification factor

1 .
~ +~ = 2- 4ssm 2 (mh/2) = 2A, (14.88)

where s = c2 k 2 / h2 . This quadratic equation in ~ can be solved to get the


amplification factor
(14.89)

Now for stability we require I~I ::; 1 for both roots, which is true only if IAI < l.
This in turn requires s ::; 1, which gives the stability criterion c!lt ::; !lx. This
is the celebrated stability criterion first obtained by Courant, Friedrichs and
Lewy, and is usually referred to as the Courant condition. It should be noted
that if this condition is satisfied, then I~I = 1 and the amplitude of oscillatory
modes are neither decreasing nor increasing. This is reasonable, since the simple
wave equation without any dissipation should preserve the amplitude of wave
motion. It is interesting to note that if c!lt = !lx , then ~(m) = eimctlt, which
is the amplification factor per time step for the exact solution of the wave
equation. In this case, we do not expect any truncation error for all values of
m. Thus, when the time step is equal to the limiting value for stability, the
numerical solution of the wave equation gives exact results. It should be noted
that the Fourier analysis does not incorporate the boundary conditions. Hence,
in actual practice, even if c!lt = !lx, there could be some truncation error, if
the boundary conditions are not represented exactly. The Courant condition
can be physically understood as follows. If c!lt > !lx, then using the difference
equations on a pure initial value problem, it will be possible to find solution
outside the triangular region limited by the characteristics (Figure 14.1) , and
the solution cannot be physical.
We can expect this restriction on the time step to be overcome by using
an implicit method. The simplest implicit method is obtained by writing

(14.90)
14.5. Wave Equation in Two Dimensions 731

In this case, we need to solve a system of equations involving a tridiagonal


matrix at every step. Stability of this method can be analysed using the Fourier
method, which gives the amplification factor

(14.91)

Thus, I~I = 1 for any positive value of s, giving unrestricted stability. This result
can be physically understood by the fact that, apart from function values at
previous time steps, here we also need ujtl. This essentially implies that we
need some boundary conditions at the end points, thus defining the solution of
the physical problem at the new time step.
We can use the more general implicit formula

(14.92)

which gives unrestricted stability if e > 1/4.


It is instructive to rewrite the wave equation as a system of two coupled
first-order equations. Introducing new variables v = au/at and w = cau/ax,
we can write the wave equation as

(14.93)

It can be shown {24} that the rather obvious explicit difference scheme

(14.94)

is always unstable. The instability in this difference scheme may be attributed


to the fact that Vj with odd j are coupled to Wj with even j and vice versa.
Thus, the two sets of function values are independent. We can actually ignore
half the values and in fact, that is what is essentially done in the following
scheme, which is found to be stable. Another alternative is to introduce some
coupling between the two sets, as is done in Lax-Friedrichs scheme.
To avoid difference quotients over the double space intervals 2~x, we
can consider the values of W at the midpoints of the intervals i.e., wj+l/2'
(j = 0, 1, ... , 1\I -1). We can also treat time intervals in a similar manner, which
gives a more symmetrical and correctly centred difference scheme. However, we
avoid the time centring and write the explicit difference scheme

(14.95)

It may be noted that, even though v appears at the advanced time step n + 1 in
the right-hand side of the second equation, it is essentially an explicit difference
scheme, since vj+l can be calculated explicitly using the first equation before
732 Chapter 14. Partial Differential Equations

calculating wj~11/2. In fact, this difference scheme is equivalent to the explicit


scheme considered earlier, if we identify

n u"1-u"1_l
and wj - 1/2 = C ~x (14.96)

This equivalence implies that the same stability condition should apply to this
difference scheme also.
The stability of this difference method can be analysed using the Fourier
method. In this case, we look for solutions of the form v"1 = vo~neimjh and
w n 1 = wo~neimh(j-l/2). Substituting these expressions in the difference equa-
)-2
tion, it follows that ~ must be an eigenvalue of the matrix

G= G(~t,m) = (.1za ia
1 - a2
)
,
(14.97)

where
2c~t
a = ~x sin(m~x/2). (14.98)

This matrix is usually referred to as the amplification matrix for the difference
method. It can be shown that both eigenvalues of this matrix have an absolute
value of unity, provided a 2 ::::: 4, which is true if the Courant condition is
satisfied. Thus, we get the same stability criterion.
An implicit difference scheme is obtained if we centre these equations in
time to obtain

(14.99)

which can be shown to be equivalent to the implicit difference scheme (14.92)


with () = 1/4. The amplification matrix for this method is

ia2 /4 )
1+a
(14.100)
l-a~/4
l+a /4

Both eigenvalues of this matrix have unit absolute magnitude for all values of
a, and therefore, this difference method is unconditionally stable.
Alternately, it is possible to stabilise the difference scheme (14.94) if v"1
and w"1 are replaced by their spatial averages on the left-hand side, to obtain

(14.101)
14.5. Wave Equation in Two Dimensions 733

This explicit difference scheme due to Lax can be shown to be stable if the
Courant condition is satisfied. In this case, the amplification matrix is

G= ( cos mflx i 't: sin mflx) . (14.102)


i 't: sin mflx cos mflx
The eigenvalues of this matrix are

~ = cosmflx ± i~: sin mflx. (14.103)

It can be shown that I~I ~ 1, provided cflt ~ flx. It may be noted that if
cflt = flx, then ~(m) = e imc6.t, which is the exact amplification factor for the
wave equation, and in this case there is no truncation error. This difference
method is usually referred to as the Lax-Friedrichs method.
It is interesting to note that, in this case I~I < 1 if cflt < flx and amplitude
of the waves decreases with time, while the exact solution of the wave equation
should preserve the amplitude. For mflx « 1, we can expand

1~12 = 1 - (1 - s) sin2mflx = 1 - (1 - s) ( m 2(flx)2 - ~m4(flx)4 + ... ) ,


(14.lO4)
where s = c2 (flt)2 / (flX)2. For Imflxl « 1, this factor is close to unity and there
may not be any significant damping. However, for large m the damping could
be significant. For mflx = 7r /2 i.e., waves with wavelength four times the mesh
spacing, the damping factor drops to VS. While for mflx = 7r, i.e., waves with
wavelength twice the mesh spacing, there is no damping. Thus, the damping
factor is not monotonic, but an oscillatory function of m. For small m we can
expect the amplitude to be damped as ~ exp( -(1- s)m 2(flx)2t/2flt). Since in
general, we are interested in phenomenon over a spatial scale of ~ M flx, we can
assume mflx = 1/ JovI. In that ca.<;e, we expect a significant decrease in amplitude
of calculated solution after 11,1 2 time steps, which may lead to serious errors, if
integration is to be continued over long interval of time. However, it should be
noted that, even if such artificial damping is not present , the truncation error
may be of the same order of magnitude. On the other hand, the advantage
of such difference schemes is that the oscillations at smaller length scales are
damped rapidly. Unfortunately, it does not apply to all small scale motions.
To understand the origin of this damping, we can write the difference
scheme in the form
vj+l - vj wj+1 - Wj_l (flx)2 Vj+l - 2vj + vj_l
flt =c 2flx + 2flt (flx)2 ,
(14.lO5)

This difference scheme can be considered as a finite difference approximation


to the differential equation
avow (flX)2 a 2v
(14.lO6)
at = c ax + 2flt ax 2 '
734 Chapter 14. Partial Differential Equations

which is an equation for damped waves. The last term in the equation provides
a small damping with a diffusion coefficient a = (~X)2 /(2~t). This damp-
ing is sometimes referred to as numerical dissipation or numerical viscosity.
Such artificial dissipation is useful in many cases, e.g., the solution of nonlinear
equations leads to shocks, where the solution is discontinuous, and the simplest
technique of treating shocks is to introduce artificial viscosity which smears the
shock.
A finite difference method is called dissipative iffor all m, 1~(m)1 ~ 1 and
1~(m)1 < 1 for some m. For dissipative methods, the amplitude of some of the
Fourier components will show a spurious decrease, although, this is not the only
source of error in difference schemes. Apart from amplitude, the phase may also
have an error. Thus, if we start with a wave packet which is a superposition
of components with different wavelengths, each of these components may incur
different errors in phase, and the wave packet may show some dispersion. A
difference method is said to be dispersive, if the phase error depends on the
wavelength.
Dissipative difference schemes are useful in many circumstances, particu-
larly for nonlinear problems. For such problems, different Fourier modes may
interact with each other in such a manner, that energy is transferred from
the large scale modes to smaller scale modes. For example, in turbulent flows
the turbulent energy cascades from large eddies to small eddies. The energy
in small scale eddies is then dissipated into internal energy through friction or
viscosity. If a finite difference method is not dissipative or does not have suffi-
cient dissipation, then the energy which is transferred from long wavelength to
short wavelength modes accumulates at the wavelength of 2~x, since that is the
smallest scale that can be represented. Because of limitations on computer time,
this scale may be much larger than the scale at which the physical dissipative
processes operate. In such a situation, the energy simply keeps accumulating at
this scale and some part of it may get transmitted back to larger scale motions,
resulting in errors even at larger scales, which are of greatest interest. Such
problems can be avoided by using a difference scheme with dissipation, so that
the energy is not allowed to accumulate in the small scale eddies.

14.6 General Hyperbolic Equations


Finite difference methods considered in the previous section can be generalised
to other hyperbolic equations. For Lax-Friedrichs method, it is convenient to
write the equations in conservation law form
au a
at + axf(u) = 0, (14.107)

where u and f(u) are vectors of length N. The function f(u) may be nonlinear.
The differential equation is represented as a system of first-order differential
equations. In order to represent a given equation in this form, some transfor-
mations mr.y be necessary. For example, the equations of fluid dynamics in one
14.6. General Hyperbolic Equations 735

space variable can be cast in this form (Richtmyer and Morton, 1995). The
physical interpretation of this equation is that, the integral of U over space co-
ordinates is conserved if the flux f(u) vanishes at the boundary. More generally
the time rate of change of the integral is equal to the flux across the boundary,
i.e.
d 2
r
dt iXI u dx = -f(u)lx=X2 + f(u)lx=xI . (14.108)

Thus, f(u) can be interpreted as the flux of the quantity u in the +x direction.
The Lax-Friedrichs method for this equation can be written as

f1t (fn
n+l
uj = "21 ( u jn_ 1 + u jn+ 1) - 2f1x )+1 -
fn)
j-l . (14.109)

However, it is preferable to use the two-stage Lax-Wendroff difference method,


which avoids large numerical dissipation. This difference scheme is defined by

n+l/2 f1t (fn fn)


U)+I/2 = "21 (u jn+ 1 + u jn) - 2f1x j+l - j ,
(14.110)
n+l _ n _ f1t (fn+l/2 _ fn+l/2)
uj - uj f1x j+l/2 j-l/2'

It can be shown that, this difference scheme has second order accuracy. Further,
it can be easily seen that the difference approximation also satisfies the conser-
vation law (14.108) exactly, with the integral replaced by a summation (using
the trapezoidal rule). It is not possible to analyse the stability for the general
nonlinear equation. But if the equations are linear and f(u) = Au, where A is
a N x N matrix whose coefficients are constants, then it is possible to analyse
the stability using Fourier method. It can be shown that the difference scheme
is stable, provided 1,\If1t < f1x, where ,\ is any eigenvalue of the matrix A. If
the differential equation is hyperbolic, then all eigenvalues of A should be real.
In particular, for the wave equation (14.93) the eigenvalues ,\ = ±c and
we obtain the Courant condition for stability. For this method the amplification
factor can be shown to be

(14.111)

where s = c2 k 2 / h 2 . Thus, in this case, artificial damping of waves is much less


as compared to that in the Lax-Friedrichs method, and we can expect better
accuracy. For waves covering the region of interest, a significant damping occurs
after ~ M4 time steps.
It should be noted that this difference scheme is second order in x, while
the differential equation is first order in x. Hence, an extra boundary condition
is required to continue the integration. This extra boundary condition must be
specified carefully, otherwise it may lead to some instability {20}. If Dirichlet
boundary conditions are specified at both end points, then the solution can be
easily obtained at all interior points using the difference equations. For more
general boundary conditions, it may be necessary to introduce fictitious points
736 Chapter 14. Partial Differential Equations

outside the required region. For example, we can apply the boundary condi-
tions on the flux f instead of the solution u. Consider the boundary condition
f(O, t) = g(t) on some component of the flux. To apply this boundary condi-
tion, we introduce a fictitious point x-1/2 at the intermediate stage and write
the boundary condition as

21 (n+1/2 n+1/2)
f-1/2 + f1/2
( )
= 9 t n+1/ 2 . (14.112)

Thus, we can determine f:-:}f 2, which in turn can be used to calculate the
boundary value u~+l, using the second equation in (14.110). This process is
implemented in the subroutine LAX in Appendix B, which incorporates a more
general boundary condition of the form
[)
a(t)f(O, t) + b(t) [)xf(O, t) = c(t). (14.113)

This boundary condition can be discretised to give the difference equation

f n+l/2 + fn+1/2 f n +l/ 2 _ f n + 1/ 2


t ) -1/2
(
a n+1/2 2
1/2
+ b tn+1/2
( ) 1/2
~x
-1/2
= c ( tn+l/2 ) . (14.114)

It should be noted that if the coefficients of boundary condition on flux com-


ponents are time dependent, then the resulting difference scheme may have an
accuracy of O(6.t) only, because U n + 1 / 2 is just some intermediate quantity and
not the solution at t = t n + 1 / 2 .
EXAMPLE 14.3: Solve the wave equation
a2 u a2 u 0:::; x:::; 1, t > 0, (14.115)
at 2 ax 2 '
with the initial and boundary conditions
au
u(x,O) = sin 7fX, at (x,O) = 0, u(O, t) = 0, u(l, t) = O. (14.116)

The exact solution is a standing wave with u(x, t) = sin 7fX cos 7ft, which may repre-
sent the fundamental mode of oscillation in a string fixed at both ends. Since the problem is
symmetric about x = 1/2, we can consider only half the range in x direction. As in Exam-
ple 14.1, we can apply the boundary conditions at x = 0 and x = 1/2. At x = 1/2 because of
symmetry we can use the boundary condition au/ax = o. We use ~x = 0.1 requiring only
five points. In order to incorporate the boundary condition at x = 1/2, we define an extra
mesh point X6 = 0.6 and demand u6 = u'4. Using this relation, it is possible to write the
finite difference equation at x = X5 = 0.5, thus giving us the required number of equations
to find the solution.
First we use the explicit difference scheme (14.83) to integrate the equation. The results
at some fixed times are shown in Table 14.2, which also gives the exact solution. This difference
scheme is unstable if ~t > ~x and if such a time step is used, then the solution grows
exponentially displaying small scale spatial oscillations. This instability is evident in the first
set of results. At smaller time steps, the truncation error is controlled and interestingly when
the time step is increased from 0.05 to 0.1, the truncation error drops significantly. In fact,
for ~t = 0.1, there is practically no error in the computed values, even when the integration
is continued for 10000 time steps. This confirms our result that the explicit method has no
14·6. General Hyperbolic Equations 737

Table 14.2: Solving wave equation using finite difference methods

() t1t u(O.l, t) u(0.2, t) u(0.3, t) u(O.4, t) u(0.5, t)

0.00 0.11 1.6500 0.141690 0.268928 0.370950 0.435135 0.458520


0.00 0.11 2.2000 0.239374 0.491548 0.626689 0.795341 0.774630
0.00 0.11 2.7500 0.371383 -1.543920 0.972292 -2.498120 1.201820
0.00 0.11 3.3000 -36.919600 69.542800 -96.656700 112.523000 -119.474000

0.00 0.05 3.5000 -0.010494 -0.019960 -0.027473 -0.032296 -0.033959


0.00 0.10 3.5000 0.000000 5.96 x 10- 8 0.000000 5.96 X 10- 8 0.000000
0.25 0.05 3.5000 -0.020816 -0.039593 -0.054496 -0.064064 -0.067361
0.25 0.10 3.5000 -0.041039 -0.078060 -0.107440 -0.126304 -0.132803
Exact solution 0.000000 0.000000 0.000000 0.000000 0.000000

0.00 0.05 3.7500 0.210413 0.400230 0.550869 0.647585 0.680912


0.25 0.05 3.7500 0.202170 0.384550 0.529288 0.622215 0.654236
Exact solution 0.218506 0.415624 0.572057 0.672494 0.707102

0.00 0.05 4.0000 0.308784 0.587342 0.808407 0.950340 0.999246


0.00 0.05 20.0000 0.303215 0.576750 0.793827 0.933199 0.981223
0.00 0.05 50.0000 0.273347 0.519939 0.715635 0.841280 0.884574
0.00 0.05 100.0000 0.174575 0.332063 0.457047 0.537290 0.564942
0.00 0.05 160.0000 0.005594 0.010638 0.014642 0.Q17212 0.018097
0.00 0.05 200.0000 -0.111769 -0.212595 -0.292612 -0.343985 -0.361686
0.00 0.05 250.0000 -0.233237 -0.443641 -0.610619 -0.717826 -0.754765
0.00 0.05 324.0000 -0.309016 -0.587785 -0.809017 -0.951056 -1.000000
0.00 0.10 4.0000 0.309017 0.587785 0.809017 0.951057 1.000000
0.00 0.10 1000.0000 0.309017 0.587785 0.809017 0.951057 1.000000
0.25 0.05 4.0000 0.308100 0.586042 0.806617 0.948235 0.997034
0.25 0.05 20.0000 0.286371 0.544712 0.749730 0.881362 0.926718
0.25 0.05 50.0000 0.176459 0.335646 0.461977 0.543085 0.571036
0.25 0.05 80.0000 0.009251 0.017599 0.024222 0.028475 0.029942
0.25 0.05 100.0000 -0.107488 -0.204458 -0.281409 -0.330819 -0.347844
0.25 0.05 164.0000 -0.308973 -0.587701 -0.808899 -0.950920 -0.999854
0.25 0.10 4.0000 0.305444 0.580988 0.799661 0.940058 0.988436
0.25 0.10 20.0000 0.223747 0.425591 0.585777 0.688621 0.724060
0.25 0.10 40.0000 0.014997 0.028525 0.039261 0.046154 0.048529
0.25 0.10 60.0000 -0.202029 -0.384281 -0.528919 -0.621781 -0.653779
0.25 0.10 82.0000 -0.308946 -0.587650 -0.808832 -0.950839 -0.999771
Exact solution 0.309017 0.587785 0.809017 0.951057 1.000000

truncation error for the simple wave equation when t1t = t1x. Further, it may be noted that,
even for t1t = 0.05, there is no systematic change in the amplitude over several time steps.
Looking at the computed solution at time intervals of one period, it may appear that the
amplitude is decreasing slowly with time. However, this decrease is actually due to a small
truncation error in the phase. It is quite clear from the table that in this case, the phase of
the wave drifts by 7r in approximately 6500 time steps covering t ;::j 324.
Using an implicit method with () = 1/4, it is possible to use larger time steps. In this
case, the truncation error increases with time step and there is no reduction in error when
738 Chapter 14. Partial Differential Equations

Table 14.3: Solving wave equation using finite difference methods

t.t w(o.o, t) w(0.1,t) w(0 .2, t) 1v(0.3, t) w(O.4, t)

Using Lax-Friedrichs difference scheme


0.05 2.0000 0.223368 0.212435 0.180708 0.131292 0.069024
0.05 4.0000 0.048611 0.046232 0.039327 0.028573 0.015022
0.05 6.0000 0.010286 0.009782 0.008321 0 .006046 0.003178
0.05 8.0000 0.002107 0.002004 0.001705 0.001239 0.000651
0 .05 10.0000 0.000415 0.000395 0.000336 0.000244 0.000128
0.10 4.0000 1.000000 0.951057 0.809017 0.587785 0.309017
0.10 1000.0000 1.000000 0.951057 0.809017 0.587785 0.309017
Exact solution 1.000000 0.951057 0.809017 0.587785 0.309017

Using Lax-Wendroff difference scheme


0.05 3.5000 -0.131533 -0.125095 -0.106412 -0.077313 -0.040646
0 .10 3.5000 0.000000 0.000000 0.000000 0.000000 0.000000
Exact solution 0.000000 0.000000 0.000000 0.000000 0.000000
0.05 3.7500 0.588640 0.559833 0.476222 0.345996 0.181901
Exact solution 0.707102 0.672494 0.572057 0.415624 0 .218506
0.05 4.0000 0.970692 0 .923183 0.785307 0.570559 0.299960
0.05 20.0000 0.658875 0.626628 0.533041 0.387277 0.203604
0.05 40.0000 0.032712 0.031111 0.026465 0.019228 0.010109
0.05 60.0000 -0.507397 -0.482564 -0.410493 -0.298241 -0.156794
0.05 78.0000 -0.695961 -0.661898 -0.563044 -0.409076 -0.215064
0.05 124.0000 0.020435 0 .019435 0.016532 0.012011 0.006315
0.05 162.0000 0.481430 0.457867 0.389485 0.282977 0.148770
0.10 4.0000 1.000000 0.951056 0.809017 0.587785 0.309017
0.10 1000.0000 1.000000 0.951056 0.809017 0.587785 0.309017
Exact solution 1.000000 0.951057 0.809017 0.587785 0.309017

t.t = t.x. Even for t.t = 0.05 the error is larger than that for the explicit method and
the phase is drifting at a rate which is twice that for the explicit method. In fact , the error
will usually force us to use a step size smaller than or comparable to that required by the
Courant condition, and we do not gain much by using the implicit method. It should be noted
that solving the implicit difference equations requires much more effort at each time step as
compared to that for the explicit method. Hence, for the wave equation it may be advisable
to use an explicit method, with the step size controlled to ensure stability. It should be noted
that, since the method requires values at two previous time steps, it is not possible to change
the time step freely.
In order to allow the facility of changing the time step freely, we can rewrite the equation
as a system of two first-order equations by defining v = au/at and w = au/ax, which gives
the differential equations
av ow ow av (14 .117)
at ax ' at ax
Ignoring a factor of 7r, the initial and boundary conditions are

v(x,O) = 0, w(x,O) = cos 7rX,

v(O, t) = 0, a
-w(O,t) = 0,
a 1
-v(2,t)=0, w(~,t) =0.
(14.118)
ax ax
14.6. General Hyperbolic Equations 739

Here, it should be noted that the boundary conditions are not really independent. For exam-
ple, v(O, t) = 0 implies 8w/8x = 0, using the first differential equation. Similarly, w(0.5, t) = 0
implies 8v/8x = O. However, in order to use the Lax-Friedrichs or Lax-Wendroff methods, we
need these extra boundary conditions in order to compute the solution. The results obtained
using these methods are shown in Table 14.3.
The Lax-Friedrichs method has a significant numerical dissipation if /::;.t < /::;.X and as
a result the amplitude decreases with time. It can be seen from the table that the decrease
in amplitude is fairly rapid and it is not possible to estimate the error in phase. In just one
period requiring 40 time steps with /::;.t = 0.05, the amplitude decreases by almost a factor
of five. When /::;.t = /::;.X = 0.1, there is no dissipation and in fact the difference scheme gives
exact results.
For the Lax-Wendroff method, numerical dissipation is much less and we get better
results using /::;.t = 0.05. It is possible to estimate the error in phase by using the computed
values. From Table 14.3 it can be seen that the phase shifts by 7r in about 1560 time steps,
where the amplitude falls to ~ 0.7. Thus, in this case, the error in phase and amplitude
are roughly comparable. A comparison with Table 14.2 shows that the phase error in Lax-
Wendroff method is about four times that in the explicit difference method when /::;.t = 0.05.
The Lax-Wendroff difference scheme is also explicit, but it involves two stages at every time
step, which requires more effort as compared to that for the simple explicit difference scheme
(14.83). Once again if /::;.t = /::;.x = 0.1 we get exact results.

For a first-order hyperbolic equation, there could be other instabilities.


For example, the Eulerian equations of fluid mechanics have terms of the form
ap/at + v(ap/ax), where v is the fluid velocity and p could be any variable like
the density, internal energy or even the velocity v itself. It is not satisfactory
to use a central difference approximation for the spatial derivative. It can be
easily verified that if this term is approximated as
n+1 n n n
Pj - Pj
--=----,----=- + V n+1
.
Pj+l - Pj-1
(14.119)
tlt J 2tlx
then the situation is similar to (14.94) and the difference method is always
unstable. Lelevier has suggested the use of upwind differencing to overcome
this problem. In this technique, we use a forward or backward difference to
approximate the spatial derivative, depending on whether v < 0 or v > 0,
giving

if vJn + 1 < 0;
(14.120)
l' f vn+1
J
> 0
-'

If v is constant, then the amplification factor for this difference scheme (assum-
ing the right-hand side of the equation to be zero) is given by

vtlt I vtlt
~ = 1- I tlx (1 - cos mtlx) - i tlx sin mtlx, (14.121)

which gives

(14.122)
740 Chapter 14. Partial Differential Equations

For stability we require 1~12 < 1, which is just the Courant condition. This
result can be physically understood by noting that in the central difference
scheme the solution at Xj is affected by values at both Xj-l and Xj+l, while if
velocity v is positive we cannot expect the disturbances at Xj+l to affect Xj at
a later time. This problem is taken care of in upwind differencing, but it should
be noted that upwind differencing gives only first order accuracy.
In principle, it may be possible to use the method of lines for hyperbolic
equations. However, if the spatial derivatives are not approximated in a stable
manner, then the solution may lead to instability, irrespective of the time step
used in the solution of the resulting system of ordinary differential equations.
For example, if the spatial derivative has to be approximated by a forward or
backward difference depending on the sign of v, then the resulting system of
ordinary differential equations will need to be changed when v changes sign at
any of the mesh points. This is obviously not a satisfactory situation. Spec-
ification of boundary conditions also pose problems for hyperbolic equations,
because most convenient approximations for spatial derivatives are of higher
order than the differential equation, requiring additional boundary conditions.
The Lax-Wendroff method can be generalised to equations involving more
than one space variables. For example, consider the system of first-order equa-
tions in two space variables
ou 0 0
ot + ox f(u) + Oyg(u) = o. (14.123)

Here f(u) and g(u) can be interpreted as the x and y components of the
flux. In this case, it is convenient to change the notation slightly, to avoid
the fractional indices by thinking of the unit cell of the calculational net as one
having dimensions 2~x, 2~y and 2~t in the x, y and t coordinates, respectively.
Hence, we calculate provisional values at t = tn+1 before proceeding to t n +2'
The Lax-Wendroff difference scheme can be written as
1( n
+ u] ~t (fn·+11 - f·-11
n)
, I + u·], 1+1 + U ], 1-1 - - - ]
_
u ]n+l
l - -4u]
·+1' I
n
·-1
n n)
,
2~x ],

~t ( n n)
- 2~y gj,l+l - gj,l-1

n+2 n ~t (fn+1 fn+l) ~t (n+l n+l )


u jl = ujl - ~x j+l,1 - j-l,1 - ~y gj,l+1 - gj,l-l .
(14.124)
These difference equations give no coupling between the set of net points having
even values of n + j + l and that having odd values. Half the net points can
therefore be omitted if desired. For the simple case of wave equation in two space
dimensions, with ~x = ~y it can be shown that these difference equations are
stable, provided
(14.125)

This difference scheme can be easily generalised to more than two space vari-
ables.
14.7. Elliptic Equations 741

14.7 Elliptic Equations


In this section, we consider the boundary value problems involving elliptic dif-
ferential equations. Application of finite difference methods to these problems
leads to a system of algebraic equations, which need to be solved. These al-
gebraic equations are linear or nonlinear depending on whether the original
differential equation is linear or nonlinear. In this section, we consider only the
linear problems. The main difficulty in solving linear problems arises because
of the large size of the algebraic system. Even though the matrix is sparse,
the nonzero elements are usually not confined to positions close to the diago-
nal. Hence, it is not easy to use the sparseness of the matrix to improve the
efficiency of the direct methods for solving a system of linear equations. As a
result, iterative methods have been used to solve such systems. However, with
significant developments in direct solution of sparse matrices the scene is chang-
ing in favour of the direct methods. Nevertheless, these techniques are beyond
the scope of this book and we will consider iterative methods for solution of
such sparse matrices. In this section, we describe the basic difference equations,
while the next two sections deal with iterative methods for solving the result-
ing system of linear equations. A direct method based on the FFT algorithm
is described in Section 14.10.
Let us consider Poisson's equation in two dimensions

(14.126)

where f(x, y) is some known function. If f(x, y) == 0, then we get Laplace's


equation. The boundary conditions can be specified on some closed curve in
the (x, y)-plane. In the simplest case, the boundary conditions may be specified
over a rectangle. The simplest boundary conditions are the Dirichlet conditions

U(O, y) = h(y), u(a, y) = h(y), u(x,O) = gl(X), u(x, b) = g2(X).


(14.127)
In general, the boundary conditions may also involve the normal derivative.
The simplest difference approximation is obtained by replacing the deriva-
tives with centred difference approximations, to get

Uj+l,l - 2Uj,1 + Uj-l,l + Uj,l+l - 2Ujl + Uj,l-l (14.128)


(~x)2 (~y)2 = f(xj, Yl)·
Here j = 1, ... , Mx - 1 and l = 1, ... , My - 1, where Mx + 1 and My + 1
are the number of grid points in the x and y directions. For j = 0, Mx or
l = 0, My the relevant equations can be obtained using the boundary conditions.
If Dirichlet boundary conditions are specified, then these equations provide
(Mx - 1)(My - 1) equations in equal number of unknowns. This system of
linear equations can be solved to find the approximate solution. It may be
noted that each equation involves the solution at five mesh points. But it is
not possible to order the mesh points in such a way that these are the five
742 Chapter 14. Partial Differential Equations

consecutive elements. If D.x D.y = h, then the difference equation can be


simplified to

Uj+1,1 + Uj-I,l + Uj,l+1 + Uj,l-I - 4Ujl = h2 f(xj, Yl)' (14.129)

For Laplace's equation the right-hand side vanishes and the finite difference
equation essentially implies that the solution at any point is the average of
those at the four nearest points. This equation is essentially a finite difference
representation of the well-known averaging property of the Laplace's equation.
It can be easily seen that the truncation error in this finite difference approxi-
mation is O((D.X)2 + (D.y)2).
If a more general boundary condition of the form

(14.130)

is specified at x = 0, then once again, we can define a fictitious point X-I


outside the given region and approximate the boundary condition by

(14.131)

This equation coupled with the finite difference equation (14.128) for j = 0
provide the required number of equations. Boundary conditions of this form
at other boundaries can be dealt with similarly. It may be noted that solution
of Poisson's equation with Neumann boundary conditions over a closed curve
is not well posed. In that case, the solution is not unique, since an arbitrary
constant can be added to the solution. In such situations, the finite difference
matrix can be singular, leading to difficulties in numerical solution. In principle,
we can use the singular value decomposition to find an admissible solution in
such cases, but the computation involved may be prohibitive.
For a general second-order elliptic equation, we can approximate the other
derivatives by

Uj+1,l+1 - Uj-I,I+1 - Uj+I,I-1 + Uj-I,l-I


4D.xD.y
(14.132)
Uj+I,1 - Uj-I,l au Uj,I+1 - Uj,l-I
2D.x ay 2D.y
All these approximations are accurate to second order in D.x and D.y. It should
be noted that the presence of the cross derivative yields a finite difference
equation involving nine points.
To write the finite difference equations in a matrix form, we need to order
the mesh points to obtain a vector out of Ujl. Most natural choice is to use the
usual order in which for example, Fortran will store a two-dimensional array.
Thus, if ith element of the array represents Ujl, then i = j + 1 + l(Mx + 1),
where j = 0, 1, ... , Mx and l = 0, 1, ... , My, while i = 1, ... , (Mx + 1)(My + 1).
Depending on the boundary conditions some of the equations can be dropped.
14.7. Elliptic Equations 743

For example, if boundary conditions are of Dirichlet type, then we can drop
j = 0, Mx and l = 0, My, which will only change the effective value of Mx and
My. This may not be the optimum choice for ordering the equations, in the
sense that, the bandwidth of the resulting matrix may be reduced by choosing
a different order. Nevertheless, we shall not consider reordering these equations.
Using a simple five-point or nine-point difference approximation, this
choice for ordering generally leads to a block tridiagonal matrix with each
block being a (Mx + 1) x (Mx + 1) tridiagonal matrix. For Poisson's equation
using the five-point difference approximation with ~x = ~y, a typical matrix
with Mx = My = 6 with Dirichlet boundary condition is

-4 1 1
1 -4 1 1
1 -4 1 1
1 -4 1 1
1 -4 1
1 f-4 1 1
1 1 -4 1 1
1 1 -4 1 1
1 1 -4 1 1
1 1 -4 1
1 4 1 1
1 1 -4 1 1
1 1 -4 1 1
1 1 -4 1 1
1 1 -4 1
1 4 1 1
1 1 -4 1 1
1 1 -4 1 1
1 1 -4 1 1
1 1 -4 1
1 4 1
1 1 -4 1
1 1 -4 1
1 1 -4 1
1 1 -4

where only nonzero elements are shown explicitly. For a general second-order
linear elliptic equation

02U 02u 02u OU OU


a(x, y) ox 2 + b(x, y) oxoy + c(x, y) oy2 + d(x, y) ox + e(x, y) oy (14.133)
+ f(x, y)u = g(x, y),
we need to use the nine-point approximation and the resulting matrix is similar
to that shown above, except for the fact that the coefficients are not constants
and the subdiagonal block matrices are also tridiagonal rather than diagonal.
In general, this matrix can be written as a block tridiagonal matrix

(14.134)
744 Chapter 14. Partial Differential Equations

where for Poisson's equation with 6.x = 6.y, C i = Bi = I, the identity matrix
of order Mx + 1 and Ai is the tridiagonal matrix
-4 1
-4 1
Ai = ( 1 1 -4 1 (14.135)

1
Depending on the boundary conditions the matrices in the first and last block
may be slightly different but the pattern of nonzero elements will be the same. If
we attempt to solve the finite difference equations using Gaussian elimination,
the zero elements outside the block triangular part can be easily accounted for,
if the matrix is treated as a band matrix with a bandwidth of Mx + 1. However, a
large fraction of elements within the band are also zero initially. These elements
will in general become nonzero during elimination. Thus, even if we neglect
pivoting, the elimination process requires O(M~My) floating-point operations.
If Mx = My, then the total number of unknowns N ~ M;. Thus, we need
O(N2) arithmetic operations to complete elimination. Further, all 2M;My ~
2N 3 / 2 elements within the band need to be stored. Since in practical problems
the number of mesh points could be very large (say N ~ 104 or more), it requires
substantial computing resources. With more sophisticated techniques for direct
solution, it may be possible to reduce the number of arithmetic operations, as
well as the memory utilisation to some extent.
As we will see in the next section, simple iterative methods require compa-
rable or even larger number of arithmetic operations for a reasonable accuracy.
However, these methods require very little memory, apart from storage for N
components of the solution. In fact, for simple equations, where the coefficients
of the equation matrix can be easily generated, the coefficients need not be
stored. Hence, if memory of the computer is limited, it may be better to use
iterative methods. For ill-conditioned equations the number of iterations re-
quired can be so large, that it may be more efficient to use a direct method like
Gaussian elimination with partial pivoting. Similarly, when solution is required
for the same equation, but with different right-hand sides, it is more efficient to
use direct methods, since the elimination needs to be performed only once. This
situation arises in the solution of eigenvalue problems. Again if the solution is
required to high accuracy, it may be more efficient to use direct methods, since
the iterative methods require a large number of iterations. For matrices of more
complicated form arising from say irregular boundaries or higher order differ-
ence approximations, there may be no theory to guide the choice of optimum
iterative methods and it may be more efficient to use a direct method.
To obtain higher order accuracy, we can apply the method of deferred cor-
rection or the h -+ 0 extrapolation in a manner similar to that for the bound-
ary value problems in ordinary differential equations. For h -+ 0 extrapolation,
it is convenient to refine the mesh such that, there is only one independent
parameter. For example, we can keep 6.x/6.y constant and expand the trun-
cation error in powers of 6.x. Alternately, we can use higher order difference
14.7. Elliptic Equations 745

approximations. These higher order methods are useful only if the solution is
sufficiently smooth. Thus, for Poisson's equation assuming D-x = D-y = h, we
can use the nine-point difference approximation

4(Uj+1,l + Uj-l,l + Uj,l+1 + uj,l-d - 20ujl + Uj+1,l+l + Uj-l,l+l + Uj+1,l-l


+Uj-l,l-l = 6h 2 f(xj, Yl)'
(14.136)
It can be shown that the truncation error in this difference approximation is

(14.137)

where

(14.138)

For Laplace's equation \7 2 u = 0, and the first term in (14.137) vanishes. In


fact, for Laplace's equation it can be shown that the O(h4) term also vanishes,
giving an error of O(h 6 ). For Poisson's equation we can use \7 4 u = \7 2 f and
approximate \7 2 f using the five-point approximation to achieve an accuracy of
O(h4) {40}.
For a general elliptic equation, it may not be possible to achieve higher or-
der accuracy using only nine-point difference approximation. To achieve higher
order accuracy, it may be necessary to use more points in the difference equa-
tion. But in that case, extra boundary conditions will be required to obtain
sufficient equations for solving the problem. It may be noted that a nine-point
formula does not require any extra boundary conditions. Further, if a higher
order formula is to be used, then the boundary conditions also need to be ap-
proximated to the same order of accuracy. For Dirichlet boundary conditions,
there is no difficulty as long as the boundaries are rectangular, but for Neu-
mann boundary condition it will be necessary to approximate the derivative to
the same accuracy as that of the difference scheme, which may pose some diffi-
culty. For example, the nine-point formula can give an accuracy of O(h4), but
representing the Neumann boundary condition to that accuracy may require
five points in one direction.
If the boundary values are specified over a curved boundary, then the
situation is more difficult. If the boundary conditions are of Dirichlet type, then
at the end point we can use divided differences to approximate the derivatives.
For example, consider a typical point B shown in Figure 14.4. Let the length
Be = o:D-x, where 0 < 0: < 1. Using Taylor series about the point B, we get

(D-X)2 3
UA = UB - D-xu x + -2-Uxx + O((D-x) ),
(14.139)
(0:D-X)2 3
Uc = UB + o:D-xu x + 2 U xx + O((D-x) ),
746 Chapter 14. Partial Differential Equations

c
F

A B c A

E B

Dirichlet boundary condition Neumann boundary condition

Figure 14.4: Applying boundary conditions at irregular boundaries.

where U x and U xx are the derivatives evaluated at the point B. From these
expansions we get the derivatives

Ux = -1 (1 Uc - 1-0:
- -UB - - -0
U:A) + O((~X) 2 ),
~x 0:(0:+1) 0: 1+0:
(14.140)
U xx = 2
(A)2 (1 ( ) Uc -
1
-UB + -1 -+1
UA
) + O(~X).
uX 0: 1 + 0: 0: 0:
If the boundary condition is of Dirichlet form, then the value Uc is known
and these expressions for derivatives can be used to write the finite difference
equation at the point B. It should be noted that the approximation to the
second derivative is accurate to only first order in ~x and there will be some
loss in accuracy. The error in boundary condition will affect the solution over
the entire region and in this case, we can hope to achieve an accuracy of O(~x)
only. Similar expressions can be obtained for derivatives with respect to y .
Another alternative is to introduce a fictitious point D outside the bound-
ary, such that BD is the normal spacing ~x. The function value UD at this
point can be found using linear interpolation

(1 -
UD = Uc -
0:
O:)UB
+ O( uX,
A )
(14.141)

or quadratic interpolation

(14.142)

Similar expressions can be written down for the point C. Using the known
boundary values Uc and UF, we can write the usual finite difference equation
at the point B with uniform spacing. In this technique, we essentially embed
14.7. Elliptic Equations 747

the required area in a region covered by rectangles by defining fictitious grid


points wherever necessary.
We now consider the case where the boundary condition is of the Neumann
type, with the normal derivative of u, onu specified on an irregular boundary.
Here the problem is more complicated, since the direction of the normal may
not coincide with either of the axes, and approximating the normal derivative in
terms of the function values at mesh points is not straightforward. In this case,
it is convenient to introduce fictitious points outside the boundary. Consider the
situation shown on the right-hand side of Figure 14.4. We first approximate the
boundary by a piecewise linear curve E F. From the point D draw the normal
to EF and let us assume that the angle ADH = (J. Now the normal derivative
at the point G can be approximated by the divided difference
UD -UH
onuG = ~
xsec
(J + O(~x). (14.143)

The value UH at point H can be obtained using the linear interpolation


(~y - ~xtan(J)uA + (~xtan(J)uB
UH = (14.144)
~y

From (14.143) and (14.144), we get

UD = (~y - ~xtan(J)~~ + (~xtan(J)uB +(~xsec(J)onuG+O(~x). (14.145)

Similar exercise can be repeated for point C. If the normal derivative onuG is
specified, then using this expression we can write the finite difference equation
at point A. For other methods of treating irregular boundaries, see Forsythe
and Was ow (2004), Greenspan (1965).
As usual the finite difference methods can in principle be applied to non-
linear equations also. In that case, we need to solve a system of nonlinear
equations, which can be achieved using the Newton's method. The Newton's
method is essentially equivalent to linearising the equation about the current
approximation and solving the system of linear equations to obtain the next
approximation. This iteration may be repeated until the solution converges. In
general, it may be impossible to ensure convergence, unless some approximate
solution is known.
In analogy with ordinary differential equations, we can define an eigen-
value problem for elliptic partial differential equations. In this case, the equa-
tion and the boundary conditions are homogeneous and a nontrivial solution
exists only for some special values of a parameter in the equation. The simplest
eigenvalue problem is defined by
02u 02u
ox2 + oy2 = AU, U = 0 on boundary. (14.146)

There is no new principle involved in setting up finite difference equations, but


the resulting difference equations are also homogeneous and yield an algebraic
748 Chapter 14. Partial Differential Equations

eigenvalue problem of the form (A - >.B)u = 0, where u is the vector of function


values at different grid points. This eigenvalue problem can be solved using
techniques similar to that for ordinary differential equations, except for the
fact that the matrix is more complicated and larger.

14.8 Successive Over-Relaxation Method


In this section, we consider some iterative methods for solving the system of al-
gebraic equations obtained by using finite difference approximations to elliptic
equations. For simplicity, we only discuss boundary value problems over rect-
angular regions. Some of the simple iterative methods for solution of a system
of linear algebraic equations are considered in Section 3.7. Here we apply these
methods to the elliptic boundary value problems and consider some modifica-
tions to improve their performance.
For simplicity, we consider the solution of Poisson's equation with 6x =
6y = h. The simplest iterative method due to Jacobi (Section 3.7) is defined
by the iteration

h f
2
n+l
u jl = 4"1 (uj+l
n
,l + uj-1 ,1 + uj,l+l + Uj,l-l
n n n)
- 4"" jl, (14.147)

where the superscript denotes the successive approximations generated by the


Jacobi's method. It may be noted that the same equation is obtained if the
diffusion equation
(14.148)

is solved using the explicit method considered in Section 14.4, provided the time
step 6t = h 2 /4, the maximum value permissible by the stability condition. The
solution of this diffusion equation does tend to that of the Poisson's equation
as t - 7 00 and the system reaches an equilibrium state. However, this method
is not practical, since the iteration converges very slowly.
The Gauss-Seidel method which generally converges twice as fast as the
Jacobi's method, is defined by the iteration

n+l _
u jl - 4"1 (n
uj+l,l
+ uj_1
n+l + n
,1 uj,l+l
+ uj,l-l
n+l)
-
h f jl
2
4"" . (14.149)

This method is simpler to implement on a computer with sequential processing,


since the new values can be directly overwritten on the old ones. However, this
method is also slowly converging and is not useful in practical computations.
Thus, in order to provide a practical iterative method, we need to accelerate
the convergence.
Before considering techniques for accelerating the convergence, let us es-
timate the rate of convergence of Jacobi and Gauss-Seidel iterations. For this
purpose, we write the finite difference equations in the matrix form as Au = b.
As in Section 3.7, we split the matrix as A = L+D+U, where D is the diagonal
14.8. Successive Over-Relaxation Method 749

part of A, Land U are the lower and upper triangle of A with zeros on the
diagonal. Now the Jacobi method is defined by the iteration

(14.150)

The convergence properties of this method depend on the matrix B = D- 1 (L +


U). It can be easily shown that, this method converges if and only if all eigen-
values of B are inside the unit circle. The rate of convergence depends on the
eigenvalue with the largest magnitude, which is also referred to as the spectral
radius of the matrix, and is denoted by p(B). Thus, for convergence we must
have p(B) < 1. In n iterations, the error will reduce by a factor of pn(B).
Hence, the rate of convergence R can be defined by R = -lnp(B). Thus II R
is essentially the number of iterations required to reduce the error by a factor
of e.
For elliptic boundary value problems, the spectral radius p(B) ----+ 1 as
the number of grid points N ----+ 00. Thus, the number of iterations required
to achieve the same accuracy increases with the number of points used in the
grid. For Poisson's equation with Dirichlet boundary conditions on the region
o ::; x ::; a and 0 ::; y ::; b with unequal mesh spacings 6.x = hand 6.y = k, it
can be shown that

(14.151)

where the subscript J refers to the Jacobi method. In a square region with
M = Mx - 1 = My - 1 points along each axis, we have

Hh H2 H2
p(BJ) = cos - :::::: 1 - - - = 1 - - (14.152)
a 2M2 2N'
where N is the size of the finite difference matrix. Thus, the rate of convergence
RJ :::::: H2/(2N) and the number of iterations required to achieve a given accu-
racy will increase linearly with the size N of the matrix. Since the amount of
computation needed in each iteration also increases linearly with N, the total
effort required to solve the equation increases as N 2 , which is same as that
using the direct method. Actually, in order to achieve an accuracy of O(M- 2 ),
we require O(N In N) iterations. Thus, Jacobi's method requires more effort as
compared to the direct method. However, the direct method will need to store
a significant fraction of the matrix elements in the computer memory, while the
iterative method requires very little memory. Thus, if the problem is too large
to fit into the memory, we may be forced to use the iterative methods.
The Gauss-Seidel method is defined by the iteration

(14.153)
750 Chapter 14. Partial Differential Equations

Convergence property of this method is determined by the matrix Be = (L +


D)-lU. It can be shown that the spectral radius of this matrix p(Be) = p(BJ )2.
Thus, the rate of convergence of Gauss-Seidel method is twice that of the Ja-
cobi's method. Hence, the number of iterations required is half that for Jacobi's
method. However, the improvement by a factor of two is not sufficient to make
the method practical.
Before the advent of electronic computers, relaxation methods were pop-
ular for solution of such systems of linear equations. In these methods, we start
with an initial guess and compute the residuals r = Au - h. Changes are made
in the solution to reduce some component of the residual to zero. The process is
continued until all residuals have been reduced to ;1ll acceptable level. The main
input in these methods is the choice of the component of the residual, which
is to be reduced to zero. With some experience the relaxation experts were
able to achieve fairly efficient solution, using such apparently crude technique.
Unfortunately, it is rather difficult for an automatic program to make proper
choices. Hence, for digital computers, we need some systematic procedure to
choose the component at each stage. In fact, the Gauss-Seidel method can be
considered as a relaxation method, where the successive components of the
residual are reduced to zero at each step. One iteration of Gauss-Seidel method
consists of a complete cycle of relaxation method, where each component of the
residual is successively reduced to zero. The relaxation experts had also noticed
that it is often better to overrelax - that is, to make a larger change in the
unknown solution than what was required to reduce the corresponding residual
to zero. Attempts to apply this idea in a systematic manner have resulted in
the successive over-relaxation (SOR) method. In this method at each step we
apply correction to the previous solution, which is w times that given by the
Gauss-Seidel iteration. Here w is some predetermined constant, which is also
referred to as the relaxation parameter.
Thus, the iteration in the SOR method is defined by

(14.154)

where uj/l is the value obtained using the Gauss-Seidel method defined by
(14.149). For the simple case considered in (14.149) the SOR iteration is defined
by

uj/l = U']I+W (~(U']+l'l + uj~{,l + U'],l+l + Uj,t!l) - U']l - ~ hi). (14.155)

It may be noted that the expression in the parenthesis is just the residual. If
W = 1, then this iteration reduces to the Gauss-Seidel method. In the SOR
method we expect w > 1. If w < 1, then the iteration can be considered as
underrelaxation. It can be shown that for matrices satisfying certain property,
which is usually referred to as Property (A), the eigenvalues Ai of the matrix
Bw which determines the convergence of the SOR iteration are given by

(14.156)
14.8. Successive Over-Relaxation Method 751

where J-Li are the corresponding eigenvalues of the matrix BJ for the Jacobi
iteration. Using this relation it can be shown {32} that the optimum choice of
W is given by
2 2
Wb = = ------;- (14.157)
1 + y'1- p2(BJ) 1 + sin 7rah '

where the last equality holds for the simple case of Poisson's equation on a
square region with h = k. Further, for this value of W

1- y'1- p2(BJ) 1- sin 7rh


p( B ) -
w -
W
b -
- 1- -
1 + y'1 _ p2 (B J) - 1 + sin :h '
a (14.158)

where again the last equality is valid only in the special case. For this case,
the rate of convergence Rw ~ 27rh/a ~ 27r/M, which is much better than that
for the Gauss-Seidel method. Thus, SOR method requires only of the order of
M = VN iterations to achieve a reasonable accuracy. Further, the number of
arithmetic operations in completing one iteration of SOR method is comparable
to that for the Gauss-Seidel method. Consequently, the SOR method should be
preferred.
In practice, the truncation error is expected to be O(M- 2 ) and we can
perform the iteration until the result converges to this accuracy, which requires
O(M In M) iterations. Each SOR iteration requires O(M2) floating-point op-
erations. Thus, the total effort required to solve an elliptic boundary value
problem is O(M 3 1n M), which can be compared with about 3M 4 arithmetic
operations using Gaussian elimination for the band matrix. Hence, the SOR
method works out to be more efficient than the direct methods for solution of
finite difference equations, if the optimum value of the relaxation parameter is
used.
The main difficulty with the SOR method is that, for more general prob-
lems it may not be possible to estimate the optimum value of w. In fact, the
corresponding matrix may not even satisfy the property (A). If we assume that
(14.156) is valid, then the value of p2(BJ) = p(Bc) may be estimated experi-
mentally by observing the convergence rate for the Gauss-Seidel method, which
in turn determines Wb using (14.157). Alternately, if a large number of similar
problems (e.g., on the same region, but with different right-hand sides f(x,y))
are to be solved, then we can do some experimentation with different values of
W (1 < W < 2) to determine the optimum value. Several methods for approxi-
mating Wb for irregular regions have been tried. For example, we can choose Wb
to be that value obtained for a circumscribing rectangle. Alternately, we can
use Wb for a square region with the same area as the actual region. Both these
choices have been found to perform fairly well on a number of problems.
For W in the range Wb ::; W < 2, it can be shown that p(Bw) = W - 1. In
general, it is better to overestimate Wb than to underestimate it by the same
amount. Overestimation of the optimum Wb has a smaller adverse effect on the
rate of convergence than an underestimation (by the same amount), since the
curve of p(Bw) versus W has a slope of unity for W > Wb, but an infinite slope as
752 Chapter 14. Partial Differential Equations

W ---; Wi:. Nevertheless, we can estimate Wb only by approaching it from below.


If W > Wb the dominant eigenvalues of Bw are complex. Hence, it is difficult
to estimate the spectral radius by observing the convergence rate, because of
oscillations in the residuals.

EXAMPLE 14.4: Solve Laplace's equation in two dimensions , subject to the boundary con-
ditions

u(O, Y) = 0, u(l, Y) = 0, u(X,O) = 0, u(X, 1) = sin 7rX . (14.159)

The exact solution of this problem is

( ) sine 7rx) sinh( 7rY)


u X, Y = sinh 7r (14.160)

Using a uniform grid with Xj = Yj = jh for j = 0,1, . .. , M and h = 11M, it is possible


to write the finite difference equations using the usual five-point formula. This system of
equations can be solved using the SOR method. Using different values of M and w, we obtain
the results shown in Table 14.4. In each case, the iteration was continued until the maximum
difference at any step is less than 10- 5 . This table shows the number of iterations (n) required
and the maximum truncation error in the computed values. The truncation error is estimated
by comparing with the exact solution. The truncation error depends only on the number of
points M, while the number of iterations required for convergence also depends on the value
of w . The case w = 1 yields the Gauss-Seidel method. For this problem the optimum value of w
is given by (14.157). We have used M = 6, 11,21,51 corresponding to h = 0.2,0.1,0.05, 0.02,
respectively. For each value of M four values of w are considered, i.e., w = 1, Wb, Wb ± 0.1,
where Wb is the optimum value of w.
It can be seen from the table that if the optimum value of w is used, then the number
of iterations increases almost linearly with M. This is expected from the spectral radius of
the iteration matrix for the SOR method. For the Gauss-Seidel method (w = 1), the number
of iterations increases roughly as M2 . It can be seen that in most cases, it is better to
overestimate the optimum value of w rather than underestimate it. However, for M = 51,
it appears that underestimation is better, probably because the attempted value of w is too
close to the limiting value of 2. The truncation error is roughly proportional to h 2 as expected.
For m = 51 the truncation error is much higher for w = 1, probably because of the fact that
the convergence is very slow and as a result the convergence criterion is satisfied before the
error is actually smaller than the requested accuracy.

Table 14.4: Solving Laplace's equation using the SOR method

M w n Error M w n Error

6 1.000000 23 0.00994 21 1.000000 266 0.00041


6 1.160000 16 0.00995 21 1.630000 61 0.00061
6 1.259616 11 0.00995 21 1.729454 39 0.00070
6 1.360000 12 0.00996 21 1.830000 53 0.00070
11 1.000000 80 0.00275 51 1.000000 1213 0.00240
11 1.430000 30 0.00280 51 1.780000 181 0.00040
11 1.527864 20 0.00282 51 1.881838 89 0.00009
11 1.630000 23 0.00283 51 1.980000 503 0.00011
14.9. Alternating Direction Method 753

14.9 Alternating Direction Method


We have seen in the previous section that Jacobi's method applied to the fi-
nite difference approximation of the Poisson's equation, is equivalent to solving
the diffusion equation (14.148), with the explicit method. This restricts the
time step. It is possible to take larger time steps using an implicit method,
but in that case, we have to solve a system of equations at every time step.
This system of equations has a matrix similar to that resulting from the orig-
inal elliptic equation, and this method is obviously impractical. As noted in
Section 14.4, it is possible to simplify the calculations using the alternating
direction method, which involves solving several systems of equations involving
tridiagonal matrices. This method is stable for arbitrary time steps, and we can
use such techniques for solving the elliptic equations governing the equilibrium
situation. Thus, we can effectively use equations (14.76) to define the iteration
with tlt being considered as some arbitrary parameter. It can be seen that
if the iteration does converge, then the resulting solution satisfies the elliptic
equation obtained by setting the time derivative to zero. This method is gen-
erally referred to as the alternating direction implicit iterative method or ADI
method.
The key question in this method is how to choose the optimum time step.
Although the method is stable for arbitrarily large time step, we cannot expect
it to be accurate. This method can also be considered as a matrix iterative
method of the type described in the previous section. The optimum time step
is then chosen to minimise the spectral radius of the iteration matrix. If that is
done, then it turns out that the ADI method has the same rate of convergence
as the SOR method. Since one iteration of ADI method requires much more
work than that of SOR method, there appears to be no advantage in using the
ADI method. This is indeed true if the time step is kept constant. However, it
turns out that if the time step is varied suitably, then it is possible to get faster
convergence.
An alternative formulation of this method can be obtained by rewriting
the finite difference equations Au = b, as

(X + rI)u = b - (Y - rI)u or (Y + rI)u = b - (X - rI)u, (14.161)

where A = X + Y. If the partitions X and Yare chosen such that the system
of equations with matrices X + r I and Y + r I can be solved easily, then one
step of AD! iter:ation is defined by

(X + rI)u n+1 / 2 = b - (Y - rI)u n , (Y + rI)un+l = b - (X - rI)u n+ 1 / 2 .


(14.162)
For Poisson's equation if X is chosen as the part arising from 6;''Ujl, and Y
as that arising from 8~'Ujl' then this iteration can be identified with (14.76),
provided r = 2h2 / tlt. In this case, the matrix X is just the tridiagonal part of
A with diagonal elements as -2. The matrix Y is not tridiagonal, but it can
be easily seen that if the order of elements in the vector u is changed suitably,
754 Chapter 14. Partial Differential Equations

then it can also become tridiagonal. However, it is not possible to choose an


order of elements such that both X and Yare tridiagonal.
With this definition, the convergence property of the ADI iteration is
determined by the matrix

(14.163)

The parameter r can be chosen to minimise the spectral radius of this matrix.
As mentioned earlier, this choice leads to the same rate of convergence as that
of the SOR method and it is not very useful. We can choose a sequence of
values ri, such that the product of spectral radii for a fixed number of steps is
minimised.
As with SOR method, the optimum sequence of time steps (i.e., ri) is not
known for arbitrary problems. The -result is known when the two matrices X
and Yare positive definite and commute (Le., XY = YX), which is certainly
true for Poisson's equation. For other problems we can only hope that the same
prescription will give a reasonable result. Practical experience suggests that this
method does give good results, even in some cases, where the two matrices do
not commute. Since the number of steps are not known in advance, we can
choose a sequence of m time steps and repeat the cycle after m iterations. The
formula is simplest when m is a power of two and we give the result only for
that case.
Let the eigenvalues of X and Y be bounded by the interval [a,,8], where
o < a < ,8. Then the prescription for choosing the ADI parameter r~m), (i =
1,2, ... , m = 2k) for a complete cycle of m steps is as follows.
1. Define ao =a and ,80 = ,8 and recursively compute

a}+l = Vaj,8j, (j = 0,1, ... , k - 1). (14.164)

2. Set siO) = Vak,8k .

3. For j = 0,1, ... , k - 1, determine S~}+l), (i = 1,2, . . . , 2j+1), as the 2j+1


roots of the 2j quadratic equations in x

(j) 1 ( a k- 1- j ,8k-1- j ) = 1,2, ... ,2j).


Si = 2 x + x ' (i (14.165)

4 . Set r i(m) -
- Si
(k) clor ~. -- 1, 2 , ... , m -- 2k .

For the solution of Poisson's equation on a square grid, the bounds on the
eigenvalues turn out to be

a = 2 ( 1 - cos 7rah) , (14.166)


14.9. Alternating Direction Method 755

For this case, it can be shown that the spectral radius of the iteration matrix
for the entire cycle of m = 2k iterations is

(14.167)

The presence of mth root in this expression leads to a dramatic increase in the
rate of convergence
8 (7fh)l/m
RADI = - - (14.168)
m 4a
To minimise the total number of iterations, we should choose m ~ In(4a/7fh) ~
In(4M/7f), where M is the number of mesh points in each direction. Thus,
it will require of the order of In M / RADI or (In M)2 iterations to achieve the
required accuracy of O(h2), which may be compared with Min M iterations
for the SOR method. Each iteration of the ADI method requires about four
times the effort required for the SOR method. However, for large values of M,
the ADI method will be more efficient.
EXAMPLE 14.5: Solve the equation in Example 14.4 using the ADI method.
For Laplace's equation on rectangles, the bounds on the eigenvalues of X and Yare
known and there is no difficulty in finding the optimum sequence Ti for the ADI iteration.
Using the subroutine ADI in Appendix B, we can find the solution for a given value of
m = 2k and the number of mesh points Mx and My. For simplicity, we use M = Mx = My
in all computations. The iteration was continued until the successive iterates differ by less
that 10- 5 in all components. For M = 6, 11,21 and 51 it requires 5, 7, 9 and 12 iterations,
respectively. This numbers may be compared with the results in Table 14.4 for the SOR
iteration. It is clear that the number of iterations required for ADI method is much less than
that required for the SOR method. The accuracy of results is of course, the same for both
these methods, since the same set of finite difference equations are being solved. However,
one ADI iteration requires about four times as much computer time as an SOR iteration.
It is found that for M = 51, the ADI method is about twice as fast as the SOR method.
The efficiency of these methods also depends on the implementation. If the special nature of
coefficients for this equation is used, then the time required could easily reduce by a factor
of five or more. For example, the subroutine ADI treats the coefficients as functions of x
and y, while if the coefficients are constant, then each tridiagonal matrix is identical and
the elimination part needs to be performed only once. Even if the coefficients are variable,
the elimination needs to be done only for the first iteration, provided relevant information is
retained for all submatrices. However, in order to reduce memory requirements, subroutine
ADI performs the elimination in every iteration.

The iterative methods like SOR and ADI can be generalised to elliptic
problems in three or more dimensions. In such problems the number of un-
knowns in the finite difference equation could be very large and much larger
effort is required. For these problems, the improvement in efficiency could play
a significant role in deciding whether a given problem is solvable or not, with
the available computing resources. In fact, it has been demonstrated that for
solution of elliptic problems in three dimensions, the increase in efficiency due
to improvement in the algorithms during the recent times is larger than the
increase in computing power during the same period.
756 Chapter 14. Partial Differential Equations

14.10 Fourier Transform Method


In the last two sections, we considered iterative methods for solution of finite dif-
ference equations resulting from elliptic boundary value problems. These meth-
ods are applicable for fairly general elliptic equations. More efficient methods
have been developed in recent years for some special forms of elliptic equations.
These are direct methods, which yield the solution in a fixed number of arith-
metic operations. The method of cyclic reduction due to Buneman (Stoer and
Bulirsch, 2010) is applicable to elliptic equations which are separable (in the
sense of separation of variable). For Poisson's equation, this method requires
approximately 3M 2 log2 M arithmetic operations, which is much better than
the ADI method. For equations with constant coefficients it is possible to use
the Fourier transform method, which requires approximately 2M 2 log 2 M arith-
metic operations. It is also possible to combine the two methods to improve
the efficiency. These methods are applicable only over rectangular regions.
The basic idea in the Fourier transform method is to take Fourier trans-
form of the difference equation and solve the problem in the frequency domain.
The required solution is then obtained by taking the inverse 'Fourier transform.
The efficiency can be improved by using the FFT algorithm for calculating the
Fourier transform. The inverse discrete Fourier transforms in two variables can
be defined by

(14.169)

where Ujl is the solution at the required points and U mn is the corresponding
solution in the frequency domain. Similarly, we can define inverse DFT for the
right-hand side hi. Substituting this equation in the finite difference equation
(14.128), we can obtain the solution

(14.170)
Umn = ( ) .
2 (t.~)2 (cos 2f:I": - 1) + (t.~)2 (cos ~~~ - 1)

Thus, we can first calculate Fmn by using FFT on the right-hand side and
then calculate Umn using the above equation. Inverse FFT of Umn will give the
required solution. This technique can be used for periodic boundary conditions

UOl = UMxl, (l = 0, ... , My); UjO = UjM y ' (j = 0, ... , Mx);


(14.171)
since the DFT essentially extends the function periodically, beyond the specified
region.
For homogeneous Dirichlet conditions U = 0 on the boundaries, we can
use the sine transform

(14.172)
14.11. Finite Element Methods 757

It can be easily verified that this solution satisfies the homogeneous Dirichlet
boundary condition. Once again, substituting it in the finite difference equation
(14.128), we can obtain the solution

u _ Fmn
(14.173)
mn - 2 Ct:.~)2(cos ~2 -1) + (t:.~)2(COS ~;;, -1))
Thus, we can first calculate Fmn by using sine transform on the right-hand
side and then calculate U mn using the above equation. The sine transform of
Umn will give the required solution. An algorithm for obtaining the fast sine
transform is described in Section 10.6.
For nonhomogeneous Dirichlet conditions, we can transfer the known
boundary values in the relevant finite difference equation to the right-hand
side before taking the sine transform of the right-hand side. Another alterna-
tive is to add a particular solution which satisfies the boundary conditions. For
Neumann boundary conditions, we can use the cosine transform instead of sine
transform. The Fourier transform method can be easily extended to Poisson's
equation in three or more dimensions.

14.11 Finite Element Methods


The finite difference methods discussed so far in this chapter yield an approxi-
mate solution of the differential equations by discretising the domain. Solution
is found only at a finite set of points. If the solution is required at other points,
then interpolation may be required. On the other hand, finite element meth-
ods result in a finite system of equations by discretising the solution space.
These methods are similar to expansion methods considered in Section 13.3 for
integral equations. Thus, we can seek a solution of the form
n
u(x, y) = 2: aicPi(x, y) = u(x, y, a), (14.174)
i=l

where ¢i(X, y) are the basis functions. The coefficients ai of expansion can be
found by substituting this solution in the differential equation and minimising
the residue in some sense as explained in Section 13.3.
Finite element methods are special forms of expansion methods, where
the basis functions are finite elements, that is, functions which are zero every-
where, except on a small part of the domain under consideration. Thus, the
finite element method involves the following four steps for the solution of any
differential equation:
1. Partition the region where solution is required into convenient pieces. For
example, in two dimensions the pieces could be rectangles, triangles or
curved sided triangles or quadrilaterals. The discretisation is very flexible
and is adjusted to suit the region of the problem. It should be noted that
758 Chapter 14. Partial Differential Equations

adjacent small angles or highly obtuse angles should be avoided, as they


lead to numerical difficulties.

2. Over each piece define basis functions which are nonzero only in that piece
and zero everywhere else. The resulting piecewise approximation to the
solution should satisfy some continuity requirements. In most cases, the
solution must be continuous across the boundary between two elements. In
elastomechanics problems, e.g., bending of beams or plates, the continu-
ity conditions are more stringent, because physics requires the first-order
partial derivatives to be continuous.

3. Substitute the solution in the differential equation and obtain algebraic


equations for the coefficients of expansion. In most cases, it is possible to
use a variational formulation of the problem to obtain the required equa-
tions. This technique may be equivalent to the Galerkin method applied
to the partial differential equation.

4. Solve the system of algebraic equations to obtain the required coefficients


of expansion.

Finite element methods can also be applied to ordinary differential equations in


which case, pieces will be just line segments in one dimension. For example, the
B-spline basis functions can be considered as finite elements in one dimension,
where in each subinterval only a few of these components are nonzero. The
coefficients of B-splines can be calculated by substituting the solution into the
differential equation (Section 12.10).
The basic idea in finite element methods is to choose localised functions
for expansion. This choice helps in two ways: first, since each function is ex-
pected to approximate the solution in only a small region, the accuracy can be
improved; second, the resulting system of equations for determining the coeffi-
cients of expansion come out to be sparse, since each element is coupled only
to a few neighbouring elements. Consequently, the fourth step can be com-
pleted efficiently. Of course, for improving the accuracy of approximation, we
can choose more elements.
In order to ensure continuity of the piecewise function across the bound-
aries of individual elements, it is convenient to define the basis functions in
terms of function values at suitably chosen nodal points or nodes. We can also
use the values of some derivatives at these nodes. The approximating function
can be expressed as a linear combination of basis functions with the nodal
variables as coefficients. If only the function values Ui at the nodes (Xi, Yi) are
taken as nodal variables, then the approximating function for a two-dimensional
element with n nodes can be expressed as

n
u(x, y) = L UiPi(X, y). (14.175)
i=l
14.11. Finite Element Methods 759

Here the basis functions Pi(x, y) must possess the interpolation property

for j = i;
(14.176)
for j =I- i.

Since a nodal point (Xi, Yi) could be at the common boundary of several ele-
ments, the basis function Pi(x, y) is the sum of the basis functions with the same
nodal point defined on all such elements. This basis function is nonzero only in
those elements which contain the node. For example, if a two-dimensional re-
gion is partitioned into rectangular elements, then an internal node defined by
the intersection of bounding lines is a vertex of four different elements. Then,
the corresponding basis function is nonzero only in these four elements, and is
defined as the piecewise function obtained by the sum of the basis functions
defined in each of these four regions. In principle, we can have four different
basis functions with independent coefficients, but the requirement of continuity
essentially forces these coefficients to be the same.
The choice of the basis functions over each element depends on the shape
of the element and the order of accuracy required. It is most convenient to
use polynomials as basis functions, although rational functions can be used
if the solution is expected to be singular in the concerned region. In order to
standardise these basis functions, it is convenient to use local coordinates within
each element to define the basis functions. For example, consider a triangular
element defined by the vertices (Xl, yd, (X2' Y2) and (X3, Y3) in counterclockwise
order. For this element, it is convenient to define a linear transformation, such
that the point (Xl, yd becomes the origin and the triangle is transformed to a
unit right-angled triangle with vertices (0,0), (1,0) and (0,1). It can be easily
seen that, this is achieved by the transformation

Thus, for triangular elements we can consider basis functions in terms of the
standard coordinates (~, 17). The transformation equation (14.177) can be used
to write these basis functions in terms of the required variables X, y. The same
transformation can be used to transform a parallelogram into a unit rectangle.
Simplest basis function in a triangle is obtained by using a linear func-
tion of the form cjJ(~, 17) = (xo + (Xl~ + (X217. Since there are three independent
constants, we can use the three vertices as the nodes. The values at the three
nodes (in counterclockwise order) can be written as

(14.178)

The corresponding basis functions can be easily obtained by inverting the above
matrix to write (X = Au, where

A= (~1 ~ ~)
-1 0 1
(14.179)
760 Chapter 14. Partial Differential Equations

c c
A ~----------~ A


D B D B
Linear approximation Cubic approximation

Figure 14.5: Nodes for triangular elements.

Columns of this matrix provide the coefficients for the corresponding basis
function. It can be easily verified that the basis functions are
(14.180)
These functions satisfy the interpolation condition (14.176). Thus, the approx-
imating function over the triangle can be expressed as
(14.181)
Suppose the given region is partitioned into triangles with common ver-
tices (Figure 14.5) and that on two adjoining triangles ABC and ADB the
function values U a and Ub are equal. Then the function values at all intermedi-
ate points on the boundary AB between the two triangles are also equal on the
two elements, since the function is linear. This ensures continuity of the func-
tion across the boundaries of elements. If instead of linear approximation we
use cubic approximation on each element, then in order to ensure continuity we
need to have equal values at four points on the boundary. Thus, we can use four
nodes (including the two end points) on each side giving a total of nine nodes.
However, a general cubic in two variables has 10 independent coefficients, and
we need one extra node which can be taken as the centroid of the triangle. It
is clear that ensuring continuity of derivatives across such boundaries is rather
difficult.
It is somewhat difficult to define basis functions on parallelograms using
such nodes. For example, if we consider the general linear function, then there
are only three constants, while in order to ensure continuity of the function
across the element boundaries we need four nodes. To avoid this problem it is
convenient to use a bilinear approximation of the form
(14.182)
It should be noted that, this form is not preserved under the general linear
transformation (14.177), in the sense that other quadratic terms of the form
14.11. Finite Element Methods 761

x 2 and y2 will arise when the function is transformed to x, y coordinates. How-


ever, in the (~, 1]) coordinates it is clear that for fixed ~ or 1], the approximation
becomes a linear function of the other non-fixed variable. In particular, on the
sides of the standard rectangle the approximation u(~, 1]) is a linear function of
arc length. Hence, continuity across the element boundaries can be ensured if
the function value at the two end points of the side are equal. Now if this func-
tion is transformed to the original coordinates (x, y), since the transformation is
linear, the behaviour of the transformed function on the sides of parallelogram
is also linear. Thus, the continuity of the function from one element to the next
is assured, if the function values at the corners are used as nodal variables. In
this case, the basis functions are

Pl(~,1]) = 1-~-1]+~1], P2(~,1]) = ~-~1], P3(~,1]) = ~1], P4(~,1]) = 1]-~1].


(14.183)
Similar difficulties are encountered for higher order approximations on paral-
lelograms.
Having decided on the basis functions, the next step is to obtain the al-
gebraic equations to determine the coefficients or the nodal values. For this
purpose, we can use collocation, least squares or Galerkin method described in
Section 13.3. It is simplest to use the collocation method, where the colloca-
tion points must be in the interior of the elements, since at the boundaries the
required derivatives may not be uniquely defined. The number of collocation
points must equal the number of nodal quantities in the system. Least squares
and Galerkin methods require some integrals involving the basis functions to
be evaluated. In principle, it may be possible to evaluate these integrals analyt-
ically, but it may be too tedious to derive the corresponding formulae. There
are two alternatives: the first is to use an algebraic manipulation program to
evaluate the expressions for integrals; the second is to use numerical integration
to approximate the integrals. Over rectangular elements, we can easily use the
product Gauss rules of appropriate degree. For triangular elements we can use
special formulae for triangular region of the kind discussed in Section 6.10. It
should be noted that the resulting solution of the differential equation will have
significant truncation error, even if the integrals are evaluated exactly. Hence,
it may not be meaningful to waste a lot of effort in estimating the integrals
accurately, and 2 or 3-point Gauss-Legendre formula is usually sufficient to ob-
tain the required accuracy. The use of integral in least squares method can be
avoided if the integral is replaced by sum of squares over a set of points. The
number of points chosen must be larger than the number of basis functions
and these points should be approximately uniformly distributed over the entire
region.
The boundary conditions can be treated in two different manners. The first
is to choose the basis functions, which satisfy the boundary conditions exactly.
This is possible when the boundary conditions are simple and the boundaries
are regular. On irregular boundaries, it may be difficult to choose basis func-
tions which satisfy the boundary conditions exactly. In such cases, the second
approach may be used, where some of the algebraic equations used to determine
762 Chapter 14. Partial Differential Equations

the coefficients of expansion are derived using the boundary conditions. Thus,
in collocation method, we can choose some collocation point!" at the boundary.
It may be difficult to decide the weight age to be given to the boundary condi-
tions, that is, the ratio of the number of collocation points on the boundary to
that in the interior. Similarly, in the least squares method, we can choose the
function to be minimised as

F(a) = j j(Lu(x,y,a)-f(x,y))2dXdY+W j(u(x,y,a)-g(x,y))2dS,


(14.184)
where Lu = f is the differential equation with Dirichlet boundary conditions
u = g(x, y) on some boundary. Here the second integral is over the curve defin-
ing the boundary. Once again, the weight factor w for the boundary conditions
is arbitrary. Good value of w can only be obtained by trial and error. For
Galerkin method it is difficult to account for boundary condition in this man-
ner. Usually the basis functions are chosen to satisfy the boundary conditions
exactly, and if that is not possible the Galerkin method may be avoided.
Instead of using these methods which are based on minimising the resid-
ual, in certain sense it is possible to reformulate the problem as a variational
problem, which can be directly solved using the approximating function. For
example, solution of Laplace's equation with Dirichlet boundary conditions is
equivalent to finding stationary points of the integral

(14.185)

where the trial functions u(x, y) satisfy the required boundary conditions. Here
the integral is evaluated over the region, where the solution is required. Thus, we
can use an expansion in terms of the basis functions, which satisfy the required
boundary conditions. This expansion can be substituted into the integral and
the required coefficients are determined by finding the extremum. Thus, the
coefficients can be determined by solving the system of equations

81 =
8a
jj (8cPi
8x
~ .8cPj 8cPi ~ .8cP j )
~ a 8x + 8 ~ a 8
J J
dx dy = 0. (14.186)
t j=l Y j=l Y

These system of linear equations can be solved to obtain the coefficients of


expansion. The advantage of variational approach over other methods is that,
in this case lower order derivatives are required. The variational method re-
quires only first-order derivatives, when the differential equation is of second
order, which enables us to use a lower order approximating function to simplify
the algebra. For example, we can use a linear approximation to evaluate the
first derivative, but not the second derivative. The integrals required in this
approach can be evaluated either by using algebraic manipulation program or
by numerical integration.
Bibliography 763

In all cases, the problem is reduced to solving a large system of linear


equations to find the coefficients of expansion. Because of a finite extent of the
basis functions the equation matrix is sparse. The bandwidth of this matrix
depends on the ordering of the basis functions. Thus, it is better to choose
an ordering of basis functions , which minimises the bandwidth of the resulting
matrix. Some algorithms for this purpose are described in Schwarz (1988). It
is possible to use iterative methods like SOR to solve the system of linear
equations. However, in most cases, there is no theory to estimate the optimum
value of the relaxation parameter w, which needs to be obtained experimentally.
For initial value problems, it is convenient to use an expansion of the form
n
u(x, y, t) = L ai(t)(Pi(x, y) = u(x, y, a(t)), (14.187)
i=l

where the coefficients of expansion are functions of time, while <Pi (x, y) are some
suitable basis functions in the spatial coordinates. Substituting this expansion
into the differential equation, we get a system of ordinary differential equations
in the coefficients ai, which can be solved by techniques described in Chapter 12.
For example, consider the differential equation
au
at = L(u), (14.188)

where L is a differential operator involving spatial coordinates. If <Pi (x, y) are or-
thonormal basis functions over the required region, then using Galerkin method,
we can obtain the system of ordinary differential equations

dt = ~
dai Laj
)=1
II <Pi(x,y)L<pj(x,y) dx dy, (i = 1, ... . n). (14.189)

This process is similar to the method of lines described in Section 14.3. Here we
use finite element method to discretise the spatial region, instead of the finite
difference method. In this case also the resulting equation matrix is sparse. since
only a few of the integrals on the right-hand side of (14.189) are nonzero. Instead
of Galerkin method, we can use collocation method to obtain the required
system of ordinary differential equations. This system of ordinary differential
equations is likely to be stiff and a routine for dealing with stiff equations will
be required to solve these equations.

Bibliography
Ames. W. F . (1992) : Numerical Methods for Partial Differential Equations, (3rd ed.), Aca-
demic Press, New York.
Birkhoff, G. and Schoenstadt, A. (eds .) (1984): Elliptic Problem Solvers, Academic Press,
New York.
Carnahan, B. , Luther, H. A. and Wilkes, J. O . (1969) : Applied Numerical Methods, John
Wiley, New York.
764 Chapter 14. Partial Differential Equations

Dennery, P. and Krzywicki, A. (1996): Mathematics for Physicists, Dover, New York.
Evans, G. A., Yardley, P. D. and Blackledge, J. M. (1999): Numerical Methods for Partial
Differential Equations, (Springer Undergraduate Mathematics Series), Springer Verlag.
Farlow, S. J. (1993): Partial Differential Equations for Scientists and Engineers, Dover, New
York.
Forsythe, G. E. and Wasow, W. R. (2004): Finite-Difference Methods for Partial Differential
Equations, Dover, New York.
Fox, L. (ed.) (1962): Numerical Solution of Ordinary and Partial Differential Equations,
Pergamon Press, Oxford.
Gerald, C. F. and Wheatley, P. O. (2003): Applied Numerical Analysis, (7th ed.) Addison-
Wesley.
Greenspan, D. (1965): Introductory Numerical Analysis of Elliptic Boundary Value Problems,
Harper and Row, New York.
Heller, D. (1978): A Survey of Parallel Algorithms in Numerical Linear Algebra, SIAM Rev.,
20, 740.
Hockney, R. W. and Jesshope, C. R. (1988): Parallel Computers 2, Architecture, Program-
ming and Algorithms, Adam Hilger, Bristol.
Hoffman, J. D. (2001): Numerical Methods for Engineers and Scientists, (2nd ed.) CRC Press.
Iserles, A. (2008): A First Course in the Numerical Analysis of Differential Equations, (2nd
ed.) Cambridge University Press.
Jain, M. K. (1979): Numerical Solution of Differential Equations, Wiley Eastern, New Delhi.
Lapidus, L. and Pinder, G. F. (1999): Numerical Solution of Partial Differential Equations
in Science and Engineering, Wiley-Interscience.
LeVeque, R. J. (2007): Finite Difference Methods for Ordinary and Partial Differential Equa-
tions: Steady-State and Time-Dependent Problems, SIAM, Philadelphia.
Morton, K. W. and Mayers, D. F. (2005): Numerical Solution of Partial Differential Equations:
an Introduction, (2nd ed.) Cambridge University Press, Cambridge.
Ortega, J. M. and Voigt, R. G. (1985): Solution of Partial Differential Equations on Vector
and Parallel Computers, SIAM Rev., 27, 149.
Quarteroni, A., Sacco, R. and Saleri, F. (2010) Numerical Mathematics, (2nd ed.) Texts in
Applied Mathematics, Springer, Berlin.
Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007): Numerical
Recipes: The Art of Scientific Computing, (3rd ed.) Cambridge University Press, New
York.
Rahman, M. (2004): Applied Numerical Analysis, WIT Press.
Ram-Mohan, L. R. (2002): Finite Element and Boundary Element Applications in Quantum
Mechanics, Oxford University Press.
Rice, J. R. (1992): Numerical Methods, Software, and Analysis, (2nd Ed.), Academic Press,
New York.
Richtmyer, R. D. and Morton, K. W. (1995): Difference Methods for Initial Value Problems,
(2nd ed.), Krieger Publishing Company.
Rutishauser, H. (1990): Lectures on Numerical Mathematics, Birkhauser, Boston.
Schwarz, H. R. (1988): Finite Element Methods, Academic Press, London.
Sod, G. A. (2009): Numerical Methods in Fluid Dynamics: Initial and Initial Boundary- Value
Problems, Cambridge University Press, Cambridge.
Stoer, J. and Bulirsch, R. (2010): Introduction to Numerical Analysis, (3rd Ed.) Springer-
Verlag, New York.
Swarztrauber, P. N. (1977): The Methods of Cyclic Reduction, Fourier Analysis and the
FACR Algorithm for the Discrete Solution of Poisson's Equation on a Rectangle, SIAM
Rev., 19, 490.
Varga, R. S. (2009): Matrix Iterative Analysis, Springer Series in Computational Mathemat-
ics, 27 (2nd Ed.) Springer Verlag, New York.
Exercises 765

Exercises
1. Solve the diffusion equation in Example 14.1 subject to initial and boundary conditions

X, if 0 :s; x :s; rr /2;


u(x,O) = f(x) ={ u(O, t) = u(rr, t) = O.
7r - X, ifrr/ 2 < x:S; rr;

2. A thin rod of length L, and thermal diffusivity a is initially at a uniform temperature


(Jo.Its two ends are subsequently maintained at a constant temperature (JI. Find how
the temperature (J inside the rod varies with time and position. Define variables
(J - (Jo at x
u = (JI - (Jo' T = L2' x= L '
to formulate the problem as
au
u(X, O) = 0, u(O, T) = u(1, T) = 1, O:S; X:S; 1, > O.
aT T

Note the discontinuity at the end points for T = O.


3. Consider the following difference scheme for the diffusion equation (14.9) and show that
it is always unstable:

u jn+1 - u jn - I

2~t

This difference scheme can be stabilised by replacing u'J on the right-hand side by average
over the two neighbouring time steps to obtain

uj+1 - uj - I = a u'J+I - urI - uj-I + uj_ 1


2~t (~x)2

Show that this difference scheme due to Du Fort and Frankel is always stable. Show that
the truncation error in this difference approximation is

Hence, show that if ~x, ~t --+ 0 such that ~t/ ~x = (3 = const ., then the numerical
solution of the diffusion equation tends to the exact solution of the hyperbolic equation
au _ a 2u 2 a 2u
at - a ax2 - a(3 at2 .

4. Using the Taylor series expansion, estimate the truncation error in the difference scheme
(14.29) for the diffusion equation. Find the higher order terms in this error expansion
and show that , if r = 1/ V20 and (J = 1/2 - 1/12r, then the first two terms vanish and
the truncation error is O(h 6 ).
5. Show that the following difference scheme for the diffusion equation (14.9) is always
stable:

~ uj:N - uj+1 + ~ urI - uj 1 uj~; - uj_1 = a 82 u Jn + 1 + 82 u Jn


12 ~t 6 ~t + 12 ~t 2(~x)2

Show that this difference method is identical to the implicit method (14.29) with (J
1/2 - 1/12r. Estimate the truncation error in this difference scheme.
6. Consider the parabolic equation

au
at
766 Chapter 14. Partial Differential Equations

Analyse the stability of the explicit difference scheme


U~+l - U~
J J uj+l - 2uj + uj_l uj+l - uj_l
l!..t (l!..X)2 + Zl!..x
and show that the addition of first derivative term does not. affect the stability. Use the
matrix method to analyse stability of this difference scheme, with boundary conditions
u(O, t) = f1(t) and u(l, t) = h(t).
7. Consider the diffusion equation in cylindrical polar coordinates with cylindrical symmetry
au a 2u lou
at = ar2 + ;: or '
subject to boundary conditions
au au
- = 0 at r = 0, - = -uf(t) at r = 1.
or or
Apparent singularity at r = 0 can be avoided by noting that (l/r)(au/ar) = (a 2u/ar 2 ).
Use Crank-Nicolson difference scheme to approximate this differential equation and anal-
yse its stability using the matrix method.
S. Show that if r = 0 is not in the range of integration, the equation in the previous problem
can be transformed to the usual form by a change of variable x = In r. This transformation
is applicable to problems over hollow cylinders.
9. Solve the diffusion equation with variable coefficients

au = ~ (x au ) _ 4u, 1 ::; x ::; 2, t > 0,


at ax ax
subject to initial and boundary conditions
Inx
u(x,O) = -, 1::; x::; 2, u(l, t) = 0, u(2, t) = e- 4t , t > o.
In 2
10. Solve the quasilinear diffusion equation

au a 2u 2 7r
at = ax2 + (1 + u )(1 - 2u), o ::; x ::; 1, 0::; t < 2 - 1,

subject to the initial and boundary conditions


7r
u(x,O) = tan x, 0::; x::; 1, u(O, t) = tan t, u(l, t) = tan(1 + t), 0::;t<2-1.

11. Solve the parabolic equation

~ ~ 2~U
at = (sin t cos t) ax - sin(2t)u ax 2 ' 0::; x ::; 1, t 2': 0,

subject to the initial and boundary conditions

u(x,O) = .,(X, u(O, t) = sin t, u(l, t) = VI + sin 2 t.


Also try to solve the same equation for t > 7r /2 with initial condition u(x, 7r /2) = VXTI.
Explain the results.
12. Solve the Burgers' equation

au a 2u au
at == 8x 2 - u ax ' o ::; x ::; 1, o ::; t ::; 2,
subject to initial and boundary conditions
2 2
u(x,O) = - 2- , u(O, t) = - - t '
1 + e-
u(l, t) = -----:----:-
1 + e 1-
1 + eX t
Exercises 767

Discretise the space variable and convert the equation to a system of ordinary differential
equations (method of lines). Compare the efficiency of this method with that of the
explicit difference scheme.
13. Consider the following equation which arises in elastic vibrations:
ov 02w
at = -a ox2'
Show that the explicit difference scheme
v n+1 - vn 1
J J = _a ___ (j2wn
b.t (b.x)2 J'

is stable if ab.t/(b.x)2 ~ 1/2. Study the stability of the implicit difference scheme of
Crank-Nicolson type.
14. Solve the diffusion equation in two space dimensions
au 02u 02u
- - - +oy2'
at - ox2
- o ~ x,y ~ 1, t ~ 0,

with initial conditions


u(x,y,O) = sinCrrx)sin(7rY), 0~x,y~1,

and boundary conditions


u(O, y, t) = u(1, y, t) = u(x, 0, t) = u(x, 1, t) = 0, 0~x,y~1, t ~ O.
Use the explicit difference scheme as well as the method of alternating directions and
compare their efficiencies.
15. Consider the alternating direction method for the diffusion equation in two space vari-
ables. Eliminate the intermediate values u7+ 1 / 2 to obtain uj+l in terms of uk. Estimate
the truncation error in this method by using the Taylor series expansion. Also analyse the
stability of this method using the Fourier technique and show that it is unconditionally
stable.
16. In the alternating direction method for two space variables, the boundary condition for
the intermediate step should be selected to achieve a second order accuracy. Using the
result of the previous exercise, show that in order to ensure second order accuracy with
Dirichlet boundary conditions, the boundary value should be approximated by (14.78).
17. Solve the diffusion equation in three space dimensions
au 02u 02u 02u
- - - + oy2
at - ox 2
- +oz2
-' o~ x, y, z ~ 1, t ~ 0,

with initial conditions


u(x , y, z, 0) = sin(7rx) sin(7rY) sin(7rz), o~ x, y, z ~ 1,
and boundary conditions
u(O, y, z, t) = u(1 , y, z, t) = u(x , 0, z, t) = u(x, 1, z, t) = u(x , y, 0 , t) = u(x, y, 1, t) = 0,
where 0 ~ x, y, z ~ 1 and t ~ O. Use the explicit difference scheme as well as the method
of alternating directions and compare their efficiencies.
18. Consider the explicit difference scheme for the wave equation. Plot the curve traced by
the amplification factor ~(m) in the complex plane as m varies. Show that if the Courant
condition is satisfied, then this curve is the unit circle and hence the difference scheme
is stable. Plot the curve for c(b.t)/ D.x = 0.5,1,1.5. Repeat this exercise for the implicit
method with () = 0.1,0.25 and 0.5.
19. Consider the first-order hyperbolic equation
au au
at ax
768 Chapter 14. Partial Differential Equations

Plot the amplification factor as in the previous exercise for the difference approximations

( ~.) Ujn+ 1 = Ujn + 6.x


6.t ( n
Uj+l
n)
- Uj ,
( .. )
n
n+ 1
Uj = Ujn + 6.x
6.t (n
Uj
n)
- Uj _ l ,

Which of these difference schemes are stable? Repeat this exercise for the Lax-Friedrichs
and the Lax-Wendroff difference schemes.
20. Consider the first-order hyperbolic equation of the previous problem subject to initial
condition u(x,O) = 1 and boundary condition u(l, t) = 1. Analyse the stability of the
leapfrog method
n+1 n-l 6.t ( n n
uj = uj + 6.x Uj+1 - Uj-I)·

This difference scheme requires additional boundary condition to solve the resulting dif-
ference equations. Impose an additional boundary condition u(O, t) = and solve the
difference equation. This difference scheme also requires an additional initial condition,
°
but in order to study the effect of boundary condition assume the exact value u(x, At) = 1
to start the integration.
21. Consider the first-order hyperbolic equation in {19} subject to the initial condition

U(x,O) = f(x) = {I, 0,


°
if < x < 0.2;
otherwise;

over -00 < x < +00. Study the change in the shape of wave packet with time using (i)
the explicit method, (ii) the implicit method and (iii) the Lax-Friedrichs method.
22. Solve the hyperbolic equation
au 1 au
at = -4"X ax' o:s: x:S: 4, O:S: t :s: 2,
subject to initial and boundary conditions

U(x,O) = f(x) = {I,


0,
if 1 < x < 1.2;
otherwise;
U(O, t) = 0.

23. Solve the nonlinear hyperbolic equation


au au
at + ax U = 0,
with the initial condition
I, if x:S: 0;
u(x,O) = { 1 - x, if °:s:
x :s: 1;
0, if x> 1.
The solution of this equation develops into a shock.
24. Show that the explicit difference scheme (14.94) for the wave equation (14.93) is uncon-
ditionally unstable.
25. Analyse the stability of the implicit difference scheme (14.92) for the wave equation.
26. Consider the Lax-Wendroff difference scheme applied to the wave equation (14.93) . Elim-
inate u7+
1 !2 from the two equations and find the amplification matrix for the difference

scheme and verify (14.111).


27. Solve the wave equation in Example 14.3 subject to initial and boundary conditions

U(x,O) = sin(7rx),
a
-u(x,O) = -7r cos(7rx) ,
at
U(O, t) = - sin(7rt),
a
-u( l, t) = -7rcoS(7rt) .
ax
Exercises 769

Use the explicit difference scheme with various values of time steps to demonstrate the
stability and instability of the difference method. Repeat the exercise with the following
initial and boundary conditions

U(x,O) = eX sin 2x,


o
-u(x,O) = -eX(2cos2x+sm2x),
. 0::; x ::; 1;
ot
U(O, t) = _e- t sin 2t, u(l, t) = e l - t sin(2 - 2t), t > O.

28. Study the effect of roundoff error on solution of initial value problems by carrying out a
time step and then perturbing the results by a factor of (1 + ER), where R is a random
number in the interval (-0.5,0.5). Choose E = 10- 2 and try to solve the equations in
Examples 14.1 and 14.3 using this technique. Study the accumulation of Roundoff error
as the computation progresses when a perturbation is applied after every time step.
29. Solve the following equation
202u 202u OU 3
x ox 2 - t ot 2 = xu ox + xtu 0::; x::; 1, t 2: 0,

subject to initial and boundary conditions

= 1, o U(O, t) = 1,
1
U(x,O) o t ' 0) =-x ,
-u(x u(l,t) = -
l+t
30. Consider the following equations for a compressible fluid with heat conduction:
OU
-
ow
= c-
oe
- c(-y - 1) - ,
ow
-=c-
ou oe
-
o2e
=0'-- -C-,
OU
ot ox ox ot ox' ot ox 2 ox
where c and 'Yare constants. Here the hyperbolic equations of sound waves are coupled
to parabolic equations governing heat flow. These equations can be approximated by the
following explicit difference scheme

u n + l _ un wj+l/2 - wj_1/2 en - en
J J = c _ c(-y _ 1) i+1/2 j-1/2
~t ~x ~x

w Jntl1/2 - wj+l/2 u nt1l - u n + 1


----"'--'--'''---''---:-_=-:..-'-''- = C J J
~t ~x

e;t:/2 - ej+l/2 ej+3/2 - 2ej+l/2 + ej_l/2 _ c ujt; - uj+l


~t = (j (~x)2 ~x
Experimentally show that this difference scheme is stable if
1
v0 c~t < ~x, (j~t < _(~x)2
2
The second condition can be avoided if the third equation is approximated by a Crank-
Nicolson type of difference scheme.
31. Solve the eigenvalue problem which gives the eigenfrequencies of homogeneous square
membrane fixed at outer edge
o2u o2u
ox 2 + 8y2 = AU, u(O, y) = u(l, y) = u(x, 0) = u(x, 1) = O.
Find the three lowest eigenvalues using the usual five-point formula. Prove the following
nine-point formula on a square grid with spacing h

4(Uj+l,l + Uj-1,1 + 'Uj,l+l + Uj,l-l) + Uj+!,l+l + Uj-l,l+l + Uj+1,1-1 + Uj-l,l-l


1
- 20Ujl = (6h A + 2h A )Ujl'
2 4 2

Use this formula to calculate the eigenvalues and compare the two results.
770 Chapter 14. Partial Differential Equations

32. Using (14.156) show that p(Bw) is minimum when w is given by (14.157). Plot the function
p( Bw) as a function of wand show that it has a slope of unity for w > Wb, but an infinite
slope as w - wi; .
33. Use the nine-point formula to solve the problem in Example 14.4 and find the optimum
value of the relaxation parameter w.
34. Solve the Poisson's equation '\7 2 u = 2e x + y subject to boundary conditions
(i) u(O,y) = eY , u(l, y) = ey + 1, u(x,O) = eX,
. ) au
( zi - == u, x = 0,1; u(x,O)=e X,
ax
au
(iii) u(O, y) = e Y , u(l, y) = ey +1 ; -=U
ay , y = 0, 1;

au au
(iv) - = u,
ax
x = 0, 1; ay = u, y = 0,1;
and compare the results.
35. Solve the Laplace's equation in two dimensions inside a triangular region with boundary
conditions
u(x,O) = _x 3 , 0:S x:S 1; u(O,y) = y3, :S y :S 1; °
u(x, 1 - x) = 4x 3 - 6x 2 + 1, 0 :S x :S 1.
Try to estimate the optimum value of the relaxation parameter w experimentally and
compare it with that for square region with the same mesh spacing. What changes are
required if the first boundary condition is replaced by the Neumann condition
a
ay u(x, 0) = -3x .
2

36. Solve the Laplace's equation in two dimensions inside the unit circle x 2 + y2 :s 1, subject
to the boundary c0ndition u(x, y) = x 2 - y2 at the boundary x 2 + y2 = 1.
31. Solve the Poisson's equation '\7 2 u = -2 inside the ellipse
x2 y2
a2 + b2 = 1,
°
subject to boundary condition u = at the boundary. Use a = 1, b = 0.5.
38. Solve the potential problem in the region between two squares r7""'7-r-r-r-r'T"71
(hatched area in the adjoining figure). Here the inner square

°
has sides of length 1/2, while the outer one has sides of length
2. The boundary conditions are u = on the outer boundary
and u = 1 on the inner boundary. Because of the symmetry we
need to consider only 1/8 of the total region. Find the equiva-
lent boundary conditions on this region and solve the Laplace's
equation to find the potential. Find the optimum value of the
relaxation parameter w for this problem.
39. Write down the finite difference equations for Poisson's equation with periodic boundary
conditions
u(O,y) = u(l,y),
a a
axu(O,y) = ax u (l,y),
a a
u(x,O) = u(x, 1), ay u(x, 0) = ay u(x, 1).
Compare the pattern of nonzero elements in the resulting finite difference matrix with
that shown in Section 14.7.
40. Estimate the truncation error in the nine-point formula (14.136) for Poisson's equation.
Show that for Laplace's equation the first two terms vanish giving an accuracy of O(h 6 ).
How will you modify the scheme for Poisson's equation to get O(h4) accuracy? Try both
versions on the equation in {34} and compare the results.
Exercises 771

41. Show that the nine-point formula of the previous problem can be modified to give O(h6)
accuracy for Poisson equation if V'2 f is approximated by the 9-point formula and V'4 f
is approximated using

where
SI = fHl/2 1+ fj-l/2,1 + fj,I+I/2 + fj,I-I/2,
S2 = fHl/2,1+1/2 + fj-l/2,1+1/2 + fj+l/2,1-1/2 + fj-l/2,1-1/2,
S3 = fHl,1 + fj-l,l + h/+l + fj,I-I,
which is based on half grid . Verify experimentally the accuracy of this method by solving
the equation in the previous problem.
42. Use the Fourier transform method to solve the Poisson's equation V' 2u = 2e x + y , 0 ::;
x, y ::; 1 with boundary conditions
(i)u(O,y)=e Y , u(l,y)=e y + l , u(x,O)=e x , u(x,l)=e x + l ,
.. 0 0 0 0
(n) ox u(O, y) = e Y , ox u(l, y) = e y+ l , oy u(x, 0) = eX, oy u(x, 0) = e X + l .

43. Solve the following elliptic equation


o2u o2u 02u
4 ox 2 + x 2 oxoy + 4 oy2 = 0, 0::; x,y::; 1,

subject to the boundary conditions


u(x,O) = x 2 , u(x, 1) = x 2 - 1, 0::; x::; 1;
u(O,y) = _y2, u(l, y) = 1 _ y2, 0::; y ::; 1.

44. Solve the elliptic equation


202u 02u 202u OU OU 3 3
X - + xy-- + y - - 2x - - 2y - + 3u = 3x - 3y ,
ox 2 oxoy oy2 ox oy
inside the unit square 0 ::; x, y ::; 1 subject to the boundary conditions
u(O,y) = _y3, u(l,y) = 1 +y - y3, u(x,O) = x 3, u(x, 1) = x 3 +x-1.

45. Consider the biharmonic equation


04u 04u 04u
V' 4 u = -
ox 4
+ 2ox-2- - +-
oy2 oy4
= O.

Show that the following difference approximation has a truncation error of O(h2) on a
square grid with ~x = ~y = h
20Ujl - 8S1 + 2S2 + S3 = 0,
where
SI = Uj+l ,l + Uj-l,l + Uj ,I+1 + Uj ,l-lo
S2 = Uj+l ,I+1 + Uj-l,I+1 + Uj+l,I-1 + Uj-l,I-I,
S3 = Uj+2,1 + Uj-2,1 + Uj,I+2 + Uj,I-2'
Solve the equation subject to the boundary conditions

U(O,y)=yS,
o
-u(O,y) = 0,
ox
o
-u(l, y) =5- 15y2 - lOy3,
ox
U(x,O) = xS, o
-u(x,O) = 0,
oy
o
U(x, 1) =1- 5x 2 - 5x 3 + xS, -u(x, 1) = 5- 15x 2 - lOx 3 .
oy
772 Chapter 14. Partial Differential Equations

46. Solve the Laplace's equation in three dimensions inside a unit cube subject to the bound-
ary conditions
u(x,y,O) = x 2 + 2y2, u(x,y, 1) = x 2 + 2y2 - 3, u(O,y,z) = 2y2 - 3z 2,
u(l,y,z) = 1 + 2y2 - 3z 2 , u(x,O,z) = x 2 - 3z 2 , u(x, 1,z) = 2 + x 2 - 3z 2 ,
where 0::; x,y,z::; l.
47. Solve the following equation which arises in subsonic flow problems
a 2u a 2u a 2u 2
&x2 + ay2 + az2 = u ,

subject to the boundary conditions


126 126 126
u(x y 0) - -:------::-----:-;-;;- = + 2y + 5)2 u(O y z) - - - - - _ c o _
" -(x+2y+1)2' u(x, y, 1) (x ' " - (2y + 4z + 1)2 '
126 126 126
u(l y z) - - - - - _ _ _ _ = _
" - (2y + 4z + 2)2 ' u(x,O,z) = (x+4z+1)2' u(x,l,z) = (x+4z+3)2 '

where 0 ::; x, y, z ::; l.


48. On the standard unit right-angled triangle, obtain the cubic basis functions of the form

u(~, 1) = 0<0 + O<l~ + 0<21) + 0<3e + 0<4~1) + 0<51)2 + 0<6e + 0<7e1) + O<S~1)2 + 0<91)3,
using the nodes shown in Figure 14.5.
49. Obtain the basis function for biquadratic approximation of the form

u(~, 1) = 0<0 + O<l~ + 0<21) + 0<3e + 0<4~1) + 0<51)2 + 0<6e1) + 0<7~1)2 + o<se1)2,
on a unit square with nodes (i/2,j/2), i,j = 0, 1,2.
50. Solve the boundary value problem in {36} using finite element method with triangular
elements.
Appendix A
Answers and Hints

A.I Introduction
1.
(i) O(~),
n
(iii)O(~), (v) 0 (1/~:~!1)) , (vii) 0 (10- 10 Inn),

(ii) O( 1.:\
n
(iv) 0 C· 5 ,
-;;- ) (vi) 0 (~s ) , (viii) 0 CO-
n O.O!
8
) .

The series (vii) is divergent, but the divergence is not likely to be detected experimen-
tally, while series (viii) may apparently converge much faster as O(I/n), if the accuracy
requirement is not very high.
2. n = 3, 5, 18,45, 138. For higher values of x roundoff error will dominate (see Example 2.3).
Roundoff error is of the order of hex / J 27rx 3 .
3. To find the order of convergence, we can take logarithm

The terms in the summation will be O(x 2 /i 2 ). Hence, number of terms required for a
relative accuracy of 10- 6 is of the order of 106 x 2 . The roundoff error may be of the order
of IH03 x . The asymptotic formula is very efficient, but does not give sufficient accuracy
for x = 0.1, and 1, unless the range is shifted by using rex
+ 1) = xr(x).
4. The first series has a large roundoff error for x > 2, the number of terms required for
various values of x are approximately 3, 6, 10, 22, 65 and 137, respectively. The continued
fraction converges very fast for higher values of x, it requires about 3000, 120, 40, 12,
6 and 4 terms, respectively. Hence, it is more efficient and accurate as compared to the
series for x > 1.
The asymptotic series does not give sufficient accuracy for x ::; 2. For
x = 4 and 6 it requires only 10 and 5 terms, respectively for the specified accuracy.
5. The rates of convergence of these methods are:

(i) 0 (~), (ii) 0 (22 nn'~ 5.fii)' (iii) 0 (27r:) , (iv) 0 (7r
32), (v) 0 (n-1.5).
n 7r 3n 6n

Here it should be noted that in (iii) and (iv), n = 2i and so effective rate of convergence of
(ii), (iii) and (iv) are comparable. However, methods (iii) and (iv) require two square roots
at each step, and will be somewhat less efficient as compared to (ii). Further, (iii) and
(iv) are numerically unstable, since it requires subtraction of two nearly equal numbers
when evaluating sinO. Methods (i) and (v) have slow rate of convergence and roundoff
774 Appendix A. Answers and Hints

error will not allow full accuracy to be achieved. Hence, only method (ii) will achieve full
accuracy in a reasonable effort.
6. Contact antia@tifr.res.in with the answer.

A.2 Roundoff Error


1.
(i) 52.625 x 11.375 = 598.609375 = (1001010110.100111h ,
(ii) 183.828125 x 28.296875 = 5201.761474609375 = (12121.6057)8 ,
(iii) 2989 x 57069 = 170579241 = (A2AD529h6,
(iv) 44.78 x (-7.55) = -338.089 = (1742.129)_10 ,
(v) (-55.777 ... ) x (-8.111 ... ) = 452.419753086 ... = (II01I11.11Ilh ,
(vi) 19.375 x (-8.875) = -171.953125 = (1101010100.000111)_2 ,
(vii) (8.25 - 25i) x (-7.5 + 5.5i) = 75.625 + 232.875i = (1123013110.3232hi ,
(viii) 2 x 3 = 6 = (111011100)i_1 .

2. (a) DEAD,

(b) 27.35 = (11011.010110h = (1101100.lOi01i)_2 = (188.75)-10 = (looo.iooih ,


-27.35 = (100101.11i110)_2 = (33.45)-10 = (Iooo.1001h ,
13~ = (1101.00i)2 = (11101.01100i)-2 = (194.958)_10 = (l11.0110Ilh ,
(c) 1121003.2332, (d) 111011111 .
3. (1.909090 ... )-10 = (0.09090909 ... )-10 = rt- .
4. 7f = 3.1415926535897932384626433832795028841971 ... 2164201989.
5. For addition in system with b = i - I use the fact that 1 + 1 = 1100 in this system.
In sOllie cases, this technique will lead to a Ilonterminating process e.g., 11 + 111 = 0
(i + (-i) = 0). Hence, this condition has to be checked explicitly.
6. To convert numbers to balanced ternary system, first represent the number in the usual
ternary system (b = 3); for example, 2120.12. Now add the infinite number ... 111.111 ...
in ternary notation to get ... 11121002.00111 ... , and finally subtract ... 111.111 ... by
decreasing each digit by one to get lOIl1.II.
To convert number to "binary" complex system, first express real and imaginary parts
in hexadecimal system and then express each digit in "binary" complex and shift it
appropriately by noting that (i - 1)8 = 16 and then add the numbers using algorithm of
the previous problem.
7.
1 ~ 1
1
0= - +~ 1
L...,(-4-) x 10
-I.
and
-k
-=L...,(4-)xlO .
2 1.21 2 2 1.21 2
The various representation of zero can be obtained by multiplying the above represen-
tation by any power of ten or by -1. The representation of unity are ~ + ~, 1 ~ - ~,
5 - 4~ + ~, 50 - 45 - 4~ + ~, etc., where ~ can be replaced as in above equation. To
convert numbers from decimal to "balanced" decimal representation, add a suitable rep-
resentation of zero to the number by decreasing each digit independently by the required
amount. This representation will not be unique, but any number can be represented in
this notation.
8. The possible number systems are (i) b = 5, (ii) balanced ternary, (iii) b = -3.

23 33 101 III 12 12
(i) ~ ~ (ii) 101 111 (iii) -20 -21
111 121 II 11 122 121
A.2. Roundoff Error 775

9. The cases (ii), (vii) and (viii) will give overflow. For ones complement case (vi) will also
give overflow.
10. The result in hexadecimal form are (i) 4049 OFDB, (ii) 66FF 29AA, (iii) 12A 7 14C3.
11. Range (2- 976 , 2 1072 ) ::::0 (1.57 X 10- 294 , 5.06 X 10 322 ). If unnormalised numbers are
allowed, the lower limit will be 2- 1023 ::::0 1.11 X 10- 308 . X = 2- 50 ::::0 8.88 X 10- 16 .
12. (a) If the result is always zero or larger than nab in magnitude, then it implies that the
computer rounds off each intermediate product to single length number before subtrac-
tion. If the result usually comes out to be smaller than nab in magnitude, but is not
otherwise accurate, then it implies that one of the product is not rounded off to single
length value before subtraction. On the other hand, if it actually gives a correct result,
then both the double length products are retained and used in subtraction process.
(b) The result for 3 n may not be correct because of integer overflow.
13. It may be noted that all numbers are between 2 20 and 221. Hence the numbers are
represented in the form f x 221 which requires 21 bits for integer part. Consequently,
for integers in this range the last three bits will be zero. The last three bits for numbers
of form x.l, x.2, x.3 are 001, 010, and 010, respectively. Hence, all numbers of form
x.2 and x.3 will be identical in such a representation. Similarly, x.7 and x.8 are also
identical. If the number of bits is increased to 25, all these numbers will have different
representations. If the number of bits is 23, then following groups will have the same
representation: ((x - 1).9, x.O, x.l), (x.2, x.3), (x.4, x.5, x.6), (x.7, x.8).
14. (i) (2.03985, 2.04095), (ii) (0.6115, 0.6225), (iii) (17.252, 17.284),
(iv) (241.43, 247.73), (v) (1.0335, 1.0337),
(vi) (425.53, 444.45), (vii) (1.9387, 1.9442),
(viii) (172.52, 00) and (-00, -237.88), (ix) (22.988, 23.220).
15. (a) Note that 1048 = 2 160 /l, 1032 = 2 107 /2,314 = 0.100111010 x 29 ,
567 = 0.1000110111 x 2 10 . If we assume that at each step the sum is rounded off to t
bits, the value of the expression using t-bit arithmetic is as follows:
t :s: 96 97 98-100 101-2 103 104-5 106 107-150 151 152-3 154-6 157 158 :::: 159
v 0 1024 512 576 560 568 566 567 1079 823 887 879 883 881
For t = 106, 158, the result will depend on how rounding is done in the case when the
number is exactly midway between two neighbouring numbers.
(b) Since the largest term is approximately 3.8 x 10 5 , roundoff error will be ::::0 3.8 x 105 1i.
Equating this to the actual value we get the minimum number of bits required to be 48.
(c) As in (b) 16.31i::::o 1.2 x 10- 19 giving 68 bits.
16. (a) Using 3-digit decimal arithmetic, 0.1111)9 0.996 = 0.111 and 0.111 00.996 = 0.111.
However, for binary arithmetic this theorem cannot be violated, provided the process of
multiplication and division produces a correctly rounded result. Using a t-bit arithmetic
the best possibility for violation occurs when x = 0.111 ... = 1- 2- t . Then ax = a- 2- t a,
and if a is normalised the first bit will be one. Thus, a 1)9 x can be obtained by subtracting
one from the last bit in a. Similarly, we can prove the result for division.
(b) The first identity is always valid, but the others are not. For example,
1 1 1
-1/-0.-99-6 = 1, -1/-0.-33-5 = -2.9-9 = 0.334, "-1/-:-:-(0-.1-1-:1)-2 = -1.0-1 = 0.110,
9.96 5.77
0.996 -:f. 9.96 x 1, 0.335 = 17.2 -:f. 5.77 x 2.99 = 17.3,
(0.110h
- - - = 0.111 i- 0.110 x 1.01 = 1.00.
(0.111h

17. Consider a = 0.325, b = 0.322, c = 0.315, Then a 1)9 (bl)9c) = 0.0328 and (al)9b)l)9c = 0.0331.
a 1)9 (bl)9c) (1+E1)(I+f2) -1+8
(a 1)9 b) 1)9 c (1 + f3)(1 + f4) - ,

where Ifil < Ii and 181 < 4.241i.


776 Appendix A. Answers and Hints

18. (a) The maximum error will occur for numbers of the form 0.10049999 ... , which gives
an upper bound of 1i/(1 + Ii).
(b) Let a = bea la and x = bex lx, and a t-digit fraction part is being used, then assuming
that the numbers are normalised, the following cases arise:
(i) If ea - ex 2: t + 2, then a Efl x = a.
(ii) If ea - ex = t + 1, then a Efl x -I- a if and only if ax < 0 and Iia I = l/b and I/x I > 1/2.
For example, 0.100 - 0.501 x 10- 4 = 0.999 X 10- 1 .
(iii) If ea - ex = t, then a Efl x = a if and only if I/xl < 1/2 and (ilal > l/b or ax > 0).
(iv) If ea - ex < t, then a Efl x -I- a.
It may be noted that for binary arithmetic I/x I 2: 1/2 and cases (iii) and (iv) can be
combined.
19. (i) a = 0.900 x 10- 1 , b = d = 0.100 X 10- 9 , c = 0.100, Y = 0.100 X 10- 8 , gives 0.5 and
0.95 for the two sides.
(ii) a = c = 0.100 X 10- 9 , b = 0.100 X 10- 4 , d = 0.900 X 10- 5 , Y = 0.100 X 106 , gives
1.05 and 2 for the two sides.
20. (i) The roundoff error in evaluating the expression is less than 2.12lilx(1 - x)l. The error
due to initial uncertainty in x is
d
-(x(l - x») EX = (1 - 2X)EX.
dx
(ii) Assuming that the square root function gives the correctly rounded value, the roundoff
error is bounded by
~ (3+4x 2 )
liv 1 + x· 2(1 + x 2 ) .

The error due to initial uncertainty in x is Ex2/V1 + x 2.


(iii) Roundoff error is bounded by

5 + 6x ) 2
Ii ( 21100 - xl + vII + x 2 2(1 + x2)
'

For positive values of x it will be better to rewrite it as


1
100+ .
x + vI + x 2
The error due to initial uncertainty in x is

For positive values of x it will be much less than the corresponding error in (ii), because
of cancellation.
21. (a)
1 {1.33 + (n-4)0.33, for 4 ::; n ::; 30;
L. -3 =
n
10.2 + (n - 31)0.3, for 31 ::; n ::; 330;
,=1 100, for n > 330;
(b) The result depends on the compiler, if the compiler rounds off the sum to single
length after each addition, the result will be 127.96875 + n2- 17 for n ::; 4096, while for
4096::; n < 4096 + 2 23 the result is 128 + (n - 4096)2- 16 . For still higher values of n the
sum will be constant at 256. If the computer does not round off the result after every
step, then the result will be essentially correct, with the error depending on the length
of the accumulator.
22. The result depends on the compiler, the precision of arithmetic used and how the state-
ments are actually coded in the program. Roundoff error could be important when 1 - p
is computed, or when the numerator of the analytic result is computed. Computing the
approximation by Simpson's rule can have large roundoff in case (ii) while evaluating
A.2. Roundoff Error 777

b - a. In (i) if 1 - P is replaced by lO- n directly, or if the compiler uses double length ac-
cumulator to calculate the difference, the result could be more accurate. Similar remarks
apply to evaluating b - a for case (ii) while using Simpson's rule.
23. (i) 1, (ii) 0, (iii) 0, (iv) 1, (v) -!,(vi) la,
(vii) -1, (viii) 0, (ix) , = 0.5772 ... , (x) 0,
(xi) ~, (xii) 1.
24. (a) To improve the accuracy use the first few terms of Taylor series, (ii) is of course
identically zero.
(b) Use Taylor series expansion in terms of ~.
25. (a) 0.100080.5001 x 10- 5 = 0.9999 X 10- 1 ,
(b) 0.10068 0.7349 X 10- 3 = 0.9987 X 10- 1 , 0.100680.7351 x 10- 3 = 0.9986 X 10- 1,
0.8736 ED 0.6499 x 10- 3 = 0.8742, 0.9996 ED 0.8999 x 10- 3 = 0.1000 X 10 1 .
(c) Try the above examples using (t + 1)-bit accumulator.
26. Let us assume that x :::: y, if this condition is not satisfied, then the numbers may be
interchanged. In this case, before addition y may be shifted right if necessary. Since
the accumulator is of the same size as the numbers, this shift will be accompanied by
rounding. After addition, if there is no overflow (i.e., the fraction part is less than one),
there will be no further rounding. If the result overflows, there will be additional rounding
when the result is shifted right by one place.
(a) In this case, the only rounding (if any) occurs when y is shifted right before addition.
It can be shown that, this error will be bounded by nlxl.
(b) In this case, the sum has t + 1 digits, the first rounding which occurred when y was
rounded off to t digits will be less than half in the last digit, while the second rounding
will be b times the first, where b is the base of arithmetic being used. This gives the
required result. (e.g., 0.9856 ED 0.03484 = 1.020.)
(c) Consider 1.00080.9996 = 0 with 4-digit accumulator. In fact, using such an arithmetic
a 0 b = 0 does not imply a = b.
27. (i) (2,0.1352), (ii) (4,0.3272), (iii) (4,0.4989), (iv) (4,0.7162), (v) (4,0.9875),
(vi) (5,0.1436), (vii) (-4,0.3271). The largest number that can be represented is larger
than
101000000
1010
101010
In level-index representation the results are indistinguishable from
(i) lO lDOO , (ii) lOlDlooo.

28. (i) 2.1 x 10- 9 , (ii) 5.7 x 10- 7 , (iii) 2.9 x 10- 6 , (iv) 6.8 x 10- 5 , (v) 1.18, (vi) 10 1627 , (vii)
5.7 x 10- 7 For floating-point arithmetic the relative error is n = 2- 24 ~ 6 X 10- 8 as
long as there is no overflow. The largest number that can be represented is larger than
lOlDOOOOO
101010

29. Try random values of a and b such that (a - b)10 < n(a + b)lD. The maximum relative
difference for violation of inequality would be ~ 2nO. 1 .
30. The result is essentially same as that in Section 2.3, except that an additional relative
n
error bounded by will be present in all terms to be summed. Summing in the natural
order, error will be
n
En = L XiYiEi ,
i=l
where hi < 1.06nn and lEil < 1.06(n - i + 2)n, (i = 2,3, ... ,n). For cascade sum the
same result applies with lEi I < 1.06(1 + liog2 n 1)n.
31. If we are using a t-bit binary arithmetic, choose Xl = 1, X2 = 1-2-t, X3 to X4 = 1_2 1 - t ,
X5 to Xs = 1 - 22-t, ... , X 2"-1 +1 to X2" = 1 - 2"-1-t. It can be easily verified that the
computed sum will be 2", which gives an error of
778 Appendix A. Answers and Hints

where n = 2r is the total number of terms.


32. The upper bounds on the total error for large values of n are

1.06h(% In2), forward sum;

{ 1.06h(0.25 Inn) + hln2, backward sum;


IEI<
1.06hllog2 n lin 2, cascade sum;
1.06h2 (% In2) + 2hln2, forward with double length accumulation.

33. (a) For positive values of x the accuracy will be reasonable, but for negative values the
roundoff error ~ ha a la!, where a = Ixl. To improve the results for negative values of x,
a simple algorithm is to write eX = 1/e- x and use the Maclaurin series to calculate e- X.
A better algorithm is to write eX = 2n 2Y , where n is the nearest integer to xl In 2, and
Y = xl In 2 - n. 2 Y can be calculated by using a series expansion.
(b) The roundoff error is of the order of h, while the function value decreases as x -+ 7r.
Hence, relative error is ~ h/lx - nl.
34. The second expression gives more accurate result, since the absolute error in summation
is less.
36. (a) There should be no roundoff error when x = 0.125, since that can be exactly repre-
sented in binary system.
(b) The exact value is XSn = 0.333 and YSn = 0.111.
37. This recurrence is unstable in the forward direction.
38. (a) 6.16 (n=201), (b) 5.90, (c) 7.15, (d) 10 (n=12492), 10.003, 11.751.
(e) For 48-bit arithmetic the sum will be between 16 and 32. Hence, the process will
stop when n = 244 ~ 1.76 X 10 13 , and Sn ~ In n + "f ~ 31.1. For 50-bit arithmetic
the summation will stop when the sum becomes 32, which happens for n ~ e 32 -1' ~
4.43 x 10 13 .
39. The roots correct to ten decimal digits are as follows:
(i) - 9496676.923, -3.094199752 x 10- 8 (iv), (vi) - 0.2144495413 ± 0.8606443179i
(ii) 7592548.387, 5.225836877 x 10- 9 (v) - 1.064587987, -1.065493314
(iii) - 2.1, -2.1 (vii) - 1.933734940, -3.924349882 X 102n - 1

In (vii) the second root gives overflow and cannot be determined. Here it is assumed that
n > O. If n < 0 the roots are -1.962174941 X 1O- 2n - 1 ± 8.711287208 x 1O- n - 1i, and
the real part will give underflow.
41.
x= (-95260473 90280971 )
99679967 -94469446 .

42. This is a well-known example of ill-conditioned polynomials (see Example 7.10).


43. Truncation error ~ hJ"(x)/2; Roundoff error ~ h(2If(x)llh)
44. The solution of equations is
1- 2a 2-a
x---- Y = 1 - a2 .
- 1- a 2 '

The condition numbers are

Cx = Ix'(a)lx(a) I = 1 2a (a - a 2 - 1) I Cy = Ia(4a - 1 - ( 2) I,
1/a (1 - ( 2 )(1 - 2a) , (1 - ( 2 )(2 - a)

C x + y = 1-1: a I, Cx - y = 11: a I,
I1 2a2 I _12a(5a 3 - 12a 2 + 15a - 4) I
C x 2_y2 = _ a2 ' C x 2+y2 - (1 _ ( 2)(5 - 8a + 5(2) .
AS Linear Algebraic Equations 779

Evaluation of x is ill-conditioned near a 2 = 1 and a = 0.5, the first is due to the fact
that the system of equations is singular for a = 1, while the ill-conditioning near a = 0.5
is due to the fact that x = 0 at a = 0.5. The latter is usually not a serious problem and
is inevitable in all function evaluations near the zero of the function. Similar remarks
will apply to the evaluation of y. For x + y , there is no ill-conditioning near a = 1, but
the problem is ill-conditioned near a = -1. This result applies only if x + y is computed
directly, and not by using the calculated values of x and y. Of course, if the calculations
are done accurately, then the errors in x and y will cancel, but in practical calculations
the roundoff errors may not cancel. Similarly, evaluating x - y is ill-conditioned near
a = 1, but not near a = -1.
45. The recurrence relations (i), (iii) and (iv) are always stable, since even if the error actually
increases, the value of the function increases at least as fast. The recurrence relation (ii)
for exponential integral is stable in forward direction when x < n, but is unstable for
x> n.
47. The exact solution of differential equation is

y=21 ( a+ 5b) e 5t +21 ( a- 5b) e _5t .


For t» 1, in general, the first term should be larger. For a+b/5 = 0 the first term should
be zero and the solution is unstable. For t < 0 the solution is unstable when a = b/5.
48. The difference equation is unstable when Yl = (9 - Vl7)/8.

A.3 Linear Algebraic Equations


1. If I;j denote the elements of the inverse of a lower triangular matrix L, then it must
satisfy the n 2 equations
i

L likl~j = Oij , (i,j = 1,2, ... ,n),


k=l

where the summation is restricted to i terms, because L is lower triangular. Considering


the equation for i = 1, we get 11l1~J = 0 for j > 1. If the matrix L is nonsingular , then
111 oF 0, which proves I~j = 0 for j > 1. Similarly, by considering i = 2,3, .. . ,n - 1, we
can show that I;j = 0 for j > i . Hence , the inverse is also lower triangular.
4. The diagonal elements can be +1 or -1.
7. If partial pivoting is included, then the number of multiplications and additions required
for elimination would be approximately doubled .
8. For a tridiagonal matrix only the diagonal and the two subdiagonals need to be stored.
If d k = akk, Ik = ak,k-l and Uk = ak,k+l, then the following algorithm can be used for
elimination.
For each value of k from 1 to n - 1 in succession, perform the following steps.
(i) Compute nk+l = Ik+lIdk and overwrite on Ik+l'
(ii) Compute (d k + 1 - nk+luk) and overwrite on d k + 1.
If partial pivoting is required, then before step (i) the rows k and k + 1 may be inter-
changed if Ilk+1 1 > Idkl. The interchange can generate entries in another subdiagonal
corresponding to ak,k+2, which will require additional storage. Further, in step (ii) apart
from d k + 1 the element Uk+l will also be modified. Complete pivoting can introduce
nonzero elements at a number of locations and is almost impossible to organise, unless
the tridiagonal nature of the matrix is ignored. Correct solution of the given system is
Xi = 1 for i = 1, ... ,n.

9. In this case, partial pivoting will neither increase the number of arithmetic operations
nor the storage required.
780 Appendix A. Answers and Hints

10. The correct solution is Xl = X2 = 1. The equations become more and more ill-conditioned
as n increases. The condition number is of the order of f~. Hence, if nf~ > 1 the solution
may be completely unreliable.
11. The solution and residue in each case are
no pivoting: Xl = 1.0000, X2 = 0.50000; T1 = 0, T2 = 0;
partial pivoting: Xl = 1.0092, X2 = 0.49976; T1 = 10- 6 , T2 = 0;
complete pivoting: Xl = 1.0109, X2 = 0.49971; T1 = 10- 6 , T2 = 10- 4 .

The condition number of this matrix is ~ 2 X 10 5 , even then we get exact result without
pivoting. The results with partial and complete pivoting are much worse.
12. The condition number of the two systems are y'2 and 1/f, respectively. Both systems be-
come singular if no pivoting is used. Interchanging the first two equations give the correct
result in both cases. It may be noted that for the first system which is well-conditioned,
partial or complete pivoting requires the first two equations to be interchanged in first
step, which gives the correct result. On the other hand, for the second system, even if any
reasonable form of pivoting is used, no interchange is required and it will not be possible
to solve the equations.
13. The solution correctly rounded to seven decimal digits are
Xl = 1.294274, X2 = -59031.20, X3 = 4.945657 X 109 , X4 = -8.573741 X 10 14 ;
Xl = -19234.05, X2 = 230665.8, X3 = -806864.5, X4 = 1075266., X5 = -483650.5;
Xl = 0.8295765, X2 = -0.7228296, X3 = 0.5855675;
It may be noted that all three systems are ill-conditioned, in the first two cases, the
ill-conditioning can be easily detected because of the magnitude of the solution vector;
while in the third case, the solution is of the order of unity, even though the condition
number is of the order of 2.3 x 106 . The condition numbers for the first two matrices are
~ 2 X 1020 and 10 7 , respectively. Even though the first matrix is highly ill-conditioned
the required solution is reasonably reliable. If the right-hand side is chosen to be such
that the exact solution is Xl = X2 = X3 = x4 = 1, then the solution will have significant
error.
14.
- 0.266 f(2h) + 0.908f(h) - 0.492f(0) - 1.156 f( -h) + 0.058f( -2h)
r(Q2h) = 3 3 ,
h
-0.058f(2h) + 1.156 f(h) + 0.492f(0) - 0.908f( -h) + 0.266 f( -2h)
r(-02h)-
. - 3 h 3

15. The number of additions and multiplications required is approximately ~ n 3 . The algo-
rithm will obviously fail for A = (~ ~).
19. The matrix is well-conditioned, but partial pivoting leads to magnification of the matrix
elements, which gives rise to a large roundoff error. Technique of iterative refinement will
improve the accuracy significantly. For a = 1 the solution may come out to be correct up
to certain value of n, since the numbers can all be represented correctly. Hence, there may
not be any roundoff error, until numbers become too large to be represented exactly. For
a = 0.999 the growth of matrix elements will be almost the same and in addition roundoff
error is introduced at every step. Complete pivoting will also give accurate solution, since
the growth of matrix elements will be checked.
20. The correctness of inverse can be verified by explicit multiplication. Since the maximum
element in each row and column of the inverse is of order unity, while all nonzero elements
of original matrix are also of order unity, the matrix is well-conditioned. The inverse of B
can be obtained using the Sherman-Morrison formula described in {3.33}. This matrix is
similar to that considered in the previous problem, in the sense that with partial pivoting
some of the pivots tend to be very large causing very high roundoff error. Complete
A.3. Linear Algebraic Equations 781

pivoting or iterative refinement of solution using partial pivoting will lead to the correct
inverse.
24. Let A = LU = Ll Ul be two distinct factorisation of a nonsingular matrix A with Land
Ll being unit lower triangular. Then the triangular matrices also must be nonsingular
and L11L = U1U- 1 From {3.1} it follows that L11L is lower triangular while U1U- l
is upper triangular. Hence, both must be diagonal matrices. Further, since Land Ll are
unit lower triangular matrices, L1l L = U1U- l = I and hence the two decomposition
must be identical. However, this does not establish that triangular decomposition exists
for all nonsingular matrices. For example, the triangular decomposition does not exist
for the following matrix (Wilkinson, 1988)

G: n
For singular matrices the decomposition may not be unique, for example, (Wilkinson,
1988)

o~ D 0 ~ DG~ ~1)
G : DG ~ ~1)
25. If the matrix is not positive definite, then t may not be positive. In practice, even for
positive definite matrices t may come out to be zero or negative because of roundoff error.
This is likely to happen for ill-conditioned positive definite matrices. The algorithm will
obviously fail for the matrix (~ ~).
26. Noting that A8x = 8b, or 8x = A- 18b we get, 118xll ::; IIA- 1 11·118bll. Similarly, using
Ax = b we get Ilbll ::; IIAII . Ilxll. Combining the two we get the required inequality. The
second inequality follows from

which implies
118x ll < IICII
Ilxll -
and further
or
which gives

or

which gives the required result.


21. The iteration may not converge, unless the residuals are calculated using higher precision.
28. The solution correctly rounded to four decimal digits is Xl = 0.9968 and X2 = -0.1338,
which give residues of -1.53 x 10- 5 and -1.03 x 10- 5 for the two equations, using
exact calculations. But the lowest residue is given by the solution Xl = 0.9969 and
X2 = -0.1339, which gives -4.2 x 10- 6 and -3.9 x 10- 6 . If arithmetic with 4 decimal
digits is used then the computed solution is Xl = 0.9934, X2 = -0.1300 and the residual
vanishes if computed using the same arithmetic.
29. Since the residues calculated using the computed solution are always small, AA -1 - I
will always be quite small. Hence, that is not a very reliable indicator of accuracy of com-
putations. The matrices in this problem are reasonably well-conditioned and the errors
will not be too large. The approximate condition numbers for these matrices are 3000,
782 Appendix A. Answers and Hints

4500, 700 and 700, respectively. The last two matrices are the so-called Pascal matrices
of order four. When the denominator is seven, the numbers cannot be represented ex-
actly in a binary computer and the results may not be very accurate. While for the last
matrix, where the numbers can be represented exactly in the computer, the results are
more accurate and iterative refinement will lead to exact results. The iterative refinement
will also lead to the correctly rounded results for the first two matrices, since the matrix
elements can be exactly represented in the computer memory. The determinants of these

n
hlatrices are 1,6,1/74 and 1/8 4 , respectively. The correct inverses are

-41 -17 8

A-
I
I -- ( -41
68 25 10 10)
-6 A-
2
I -
-
-3
5
2"
-32)
51
2"
-17 10 5 -3
2
10 -6 -3 2 3 9

A-
3
I -- ( -42
28
28
-42
98
-77
28
-77
70
-7)
21
-2~
A-I
4 -
- (-483232
-48
112
-88
32
-88
80
-8)
24
-2~
-7 21 -21 -8 24 -24

30. The first matrix is extremely ill-conditioned and it is almost impossible to compute the
correct inverse for n = 15, even using double precision on most computers. The condition
number of this matrix is approximately 6 x 10 5 , 3 X 10 13 and 10 21 for n = 5, 10, 15,
respectively. However, since the elements of inverse are all integers, we can multiply all
equations suitably to get all coefficients as integers. If the word length of arithmetic
being used is large enough to represent these integers exactly and IiK(A) < 1/2, then
iterative refinement will yield the exact result, provided residues are calculated using
extended precision. The determinant for this matrix correctly rounded to seven decimal
digits is 1.874648 x 10- 11 , -2.164179 X 10- 52 and 1.587814 x 10- 123 for n = 5,10 and
15, respectively. The correct inverse [bij] for the first matrix can be expressed in a closed
form by

( )n-i (n +z. -i-I)


bil= -1
1
(n).
z i = 1, ... ,n
b.. = (_I)i- (i + j) (i + j - 1) (n + i-I) (n + j)i
j j = 1, ... ,n-l
',)+1 j j-l i+j i+j

while the determinant is given by det (An) = (-1) n - 18;;:-1, where

8n + 1 = (n 2~ 1) c:) (2n + 1)8n ,


For n = 5 the inverse is
5 300 -2100 4200
-60 -2400 18900 -40320 -2520)
25200
( 210 6300 -52920 117600 -75600
-280 -6720 58800 -134400 88200
126 2520 -22680 52920 -35280

The second matrix is a symmetric orthogonal matrix and A-I = A while det(A) = 1 for
n = 5 and det(A) = -1 for n = 10,15. For the third matrix, the inverse is a tridiagonal
matrix B with elements given by

b11 = 1, bii =2 (i> 1); bi,i+1 = bi ,i-l = -1 (i = 1, ... ,n); bij =0 (Ii- jl > 1).
Thus, inverse of a tridiagonal matrix is not in general tridiagonal. For this matrix
det(A) = 1.
32.

A-I =
10 + i
( 9 - 3i
-2+ 6i
8i
-3 -
-3 - 2i
2i)
-2 + 2i -1- 2i 1
A.4. Interpolation 783

34. This process will require approximately n 3 divisions, 2n 3 multiplications and 2n 3 addi-
tions.
35. Xl = 23 + llX3, X2 = -14 - 7X3, X4 = 1, where X3 can have any arbitrary value.
36. The solution with minimum norm is
72 63 351
X2 = 171' X3 = -171' X4 = 1.
xI=171'
The null space is covered by the vector (11, -7, 1,0). The range consists of all vectors
orthogonal to (-13,9,4,5).
37. SVD will lead to more accurate results in these cases, since the SVD algorithm uses only
orthogonal matrices for transformation. Hence, the magnitude of matrix elements should
not increase.
38. h = 11/15, h = 2/3, h = 1/3, 14 = 4/15, h = 1/15.
42. The solutions correctly rounded to seven decimal digits are
(i) (0.6363290, -0.02950666,0.5486742),
(ii) (100000,100001,100002,100003),
(iii) (1.168248,1.324814,1.468131,1.596767,1.709436, 1.805010, 1.882534, 1.941232,
1.980519),
(iv) (0.1950885,0.3826798,0.5555648,0.7070995, 0.8314606, 0.9238687, 0.9807726,
0.9999855),
(v) (0.3318121,0.4692531,0.3318121,0.1508883, 0.2133883, 0.1508883, 0.05835298,
0.08252358,0.05835298),
The condition numbers of the first two systems are approximately 10 5 and 106 , respec-
tively and Gauss-Seidel method does not converge. For the first system, it may converge
rather fast requiring only a few iterations for a tolerance of 10- 4 , but the solution may
be nowhere near the correct value. For the second system, unless the starting values are
such that the residuals are small enough to satisfy the convergence criterion, the process
does not converge. For the third system, the convergence of Gauss-Seidel iteration is very
slow, requiring about 77 iterations for an accuracy of 10- 4 and of course, the use of
simple convergence criterion underestimates the error. Hence, the actual error even after
77 iterations is more than 10- 4 . For the fourth system, the convergence is extremely
fast requiring only nine iterations for an accuracy of 10- 7 . Nevertheless, this matrix is
filled and the iterative method cannot compete with direct method in efficiency. The last
system requires 24 iterations for an accuracy of 10- 7 .
43. For this system of equations the Gauss-Seidel method requires only about 3N 2 additions
and N 2 divisions (or multiplications) for each iteration. Hence, it may appear that for
large N it will be more efficient than the direct methods, which require approximately
~N6 floating-point operations (addition, multiplication, division). However, the matrix
can be considered as a band matrix with bandwidth N which reduces the number of oper-
ations to O(N4) for direct methods. In this case, the convergence of the ;terative methods
becomes very slow for large N and number of iterations required for a fixed accuracy is
O(N2). Hence, it may turn out that the direct methods are more efficient in terms of exe-
cution speed. Direct methods will require O(N3) memory locations, as compared to only
N 2 required for the Gauss-Seidel method. Further, for such slow convergence, it is difficult
to apply a reliable convergence criterion. For methods of accelerating the convergence of
iterative methods see Chapter 14.

A.4 Interpolation
1. Consider the Lagrangian interpolation with error term for f(x) = xk.
2. If the uniform spacing is h and the point is located between X = a and a + h, then the
error in linear interpolation is given by
EI(X) = (x - alex - a - h) fl/(f,).
2
784 Appendix A. Answers and Hints

The product in the numerator has a maximum value of h 2 /4 when x = a + h/2, and the
derivatives are all bounded by unity. Hence, lEI (x)1 < h 2 /8, and for a maximum error of
10- 7 we get h < 8.9 X 10- 4 . This gives the required spacing of about three arc minutes,
which requires 1800 entries in the table for linear interpolation in the first quadrant.
For quadratic interpolation we can proceed as above, the maximum value of the product
in this case is 2h 3 /3V3 and the error bound is IE2(X)1 < h 3 j9V3. This gives the required
spacing of 0.0116 or 40 arc minutes, which will require only 135 entries.
For cubic interpolation, the maximum value of the product comes out to be 9h 4 /16 and
the error is bounded by IE3(X)1 < 3h 4 /128. This gives the required spacing of 0.0454 or
about 2.5 0 , and only 36 entries will be required in the table.
3. As in the last problem, the maximum error in linear interpolation is bounded by
h2 h2 1
IE 1(X)I:S 8" /1/((,) = 8" (loge 10) sin2.;
For a spacing of 0.01, this expression will never be less than 10- 6 and such an accuracy
can only be achieved for points which are close to the tabular points (within about
4.6 X 10- 4 sin 2 .;), since in that case, the factor multiplying the derivative will be small.
An accuracy of 10- 5 can be achieved, provided sin2.; > 0.54 or .; > 0.83. At smaller
values of x such an accuracy can only be achieved, if the required point is close to a
tabular point (within about 0.0046 sin 2 .;)
4. For the first three polynomials which are of low degree, there should be no difficulty,
and the interpolated value should agree with the exact value everywhere. For the fourth
polynomial the interpolated value is reliable, as long as the value is required at points
outside the interval [0,21]' where the polynomial has zeros. However, within this interval
the roundoff error is large. If the first choice of points is used and the points are added
in decreasing order, then the interpolation polynomial will be identical to the given
function and there may be no significant roundoff error. For the last polynomial. there is
no difficulty for x in the tabular range, but outside the range the error increases rapidly,
and the results are totally unreliable.
5. It can be seen that the interpolation series will be alternating for sufficiently large value
of n. Hence, it converges if the terms are decreasing in magnitude, and tend to zero as
n ---. 00. From the ratio of two successive terms, we can easily prove that for e uh > 2 the
terms are increasing in magnitude and the series diverges. For e uh - 1 < 1 the terms are
decreasing in magnitude and further as n ---. 00, the nth term will tend to zero because
of the factor (e uh _1)n. For e uh = 2, if we define m = x/h, then the nth term is given
by
_ ( )n nn j - 1 - m
tn - -1 .,
j=1 J
taking the logarithms, we get

n
In{ltnl) = Lin (
1- +
~ 1) .
j=1 J

As n ---. 00, the jth term of the series tends to (m + l)jj. Hence, the series diverges for
m + 1 f. O. Further, for m + 1 > 0 the series will diverge to -00 giving

lim tn = O.
n~oo

Thus, the interpolation series will converge for x > -h. Similarly, it can be easily shown
that the series will diverge for x < -h. For x = -h it can be easily verified that all terms
of the interpolation series will be +1 or -1, and the series is obviously divergent. For
e uh > 2 the best approximation is achieved when the series is truncated at the smallest
term.
6. This result can be proved in a straightforward manner by induction, but an elegant
proof is obtained by considering interpolation using points ao, ... , ak and equating the
A.4. Interpolation 785

powers of xk in the Newton's divided difference formula and the Lagrange's interpolation
formula.
7. For sin x the interpolation converges for all required spacings. For exp x the interpolation
does not converge for h = 2, while for h = 1 it converges only if the new points are added
in alternating direction. If the points are added in increasing or decreasing order, the
interpolation does not converge {4.5}.
8. The maximum value of the polynomial is 0.0126 for uniform spacing and 0.00195 for
Chebyshev points.
9. If x = g(y) is the inverse function, then the truncation error in linear interpolation is
given by

E ( ) = (y - yo)(y - yll "() = _ (y - yo)(y - Yl) f"(~)


1 y 2 9 TJ 2 [f'(~)J3

where ~ = g(TJ). The first factor has a maximum value of (Yl-YO)2 /8 when Y = (YO+Yl )/2,
which gives the first bound. The second bound can be obtained from the first by using
the mean value theorem to get

XO<6<Xl.
The third bound can be obtained from the second by noting that K ::::: M2/mr.
10. Using two to four points for interpolation, we can get the value of sin x accurate to nearly
10- 7 in all cases. For inverse interpolation the error is somewhat larger for the first three
values, while for the last value which is outside the range of table, the result is completely
useless. It may be noted that, there is no difficulty in extrapolation for sin x on either side
of the table. For the second table the inverse function will not be single valued and there
could be two values of x for a given value of sin x. If the points on one side of x = rr /2
are used, it may be possible to get some reasonable value for the first two cases, but as
soon as a point on the other side is added, the approximation will deteriorate.
11. This problem is more suitable for approximation, rather than interpolation, since there
are inherent fluctuations in the function value itself. Nevertheless, a reasonable estimate
can be obtained by interpolation. It is much better to use logarithms, since both the
distance as well as time vary by a few orders of magnitude over the range covered by
the table. This is a good example of real life situation with inherent errors and relatively
sparse table of values.
12. For h(x) the interpolation will actually approximate the function sin(rr(0.5 + (x -
0.5)/201)), which has the same tabular values as the given function. There is no way
the interpolation routine can know, that between every two points the function has one
complete oscillation. Similarly, for h(x) the interpolation will miss the mild singularity
at x = rr, which gives rise to a very narrow spike in the plot for this function.
14. The evaluation of the forward differences usually involves subtraction of two nearly equal
numbers. Hence, there will be no roundoff error in evaluating the difference. Only error
is due to the error in function values, which are bounded by hMo, where If(ai)1 < Mo
for i = 0, ... , n. Hence, the error in evaluating the first difference will be bounded by
2hMo, and the roundoff error in evaluating the kth difference is bounded by 2k hMo. If
the given point is close to the first point, then the factor multiplying the differences are
close to unity, and ignoring the comparatively small roundoff error in evaluating these
factors, the error in evaluating the interpolated value is bounded by
hMo(l + 2 + 22 + ... + 2n) < 2n + 1hMo.
Of course, here we have neglected some sources of roundoff error in interpolation, but
since such bounds are usually very conservative, the bound obtained above is unlikely to
be violated in most practical problems.
15. The result follows by noting that with this definition the resulting interpolating polyno-
mial is of the correct degree and satisfies the interpolation condition at the required set
of points.
786 Appendix A. Answers and Hints

16. Direct interpolation for tan x gives reasonable value only in the middle portion of the
table, in the end portion the result is not very good, because of the singularity at x = 7r /2.
Using sin x and cos x gives good result over the entire range, as these functions have no
singularity. It is best to use nearest point at every step. Using cubic spline wi!! not change
the situation significalJtly, but using rational function interpolation will of course improve
the approximation, and in fact, it is possible to get an accuracy of approximately seven
significant digits using just two or three points.
17. The results are not essentially different when points are added at uniform spacing. When
the spacing is decreased, keeping the number of points in interpolation formula fixed, the
error decreases in all cases and the decay exponents are approximately, -n, -n, -0.5, -0.5
respectively, for the four functions using n-point interpolation formula.
18. The results are generally worse than those obtained using uniform spacing, and the
approximation appears to be diverging for vr+x. Chebyshev points were chosen to
minimise error in polynomial interpolation, not for splines.
19. For vr+x it is not possible to use this interpolation, as the derivative does not exist at
one of the points. For other functions the results are comparable to that for cubic spline.
It may be noted that Hermite cubic interpolation uses more pieces of information about
the functions as compared to cubic spline, since apart from the function values the first
derivative is also required at each of the tabular points.
20. Using the divided difference table, we can get the interpolating polynomial gs(x) as

(x -1) - 0.5(x _1)2 + .19315(x _1)3 - .05582(x -1)3(x - 2) + .0165(x -1)3(x - 2)(x - 3)

with a truncation error of


E (x) = (x - 1)3(x - 2)(x - 3)2 f(6)( ) = _ (x - 1)3(x - 2)(x - 3)2
5 6! ~ 6~6 '

g5(1.5) = 0.40418 and g5(2.5) = 0.91876.


21.
(i) (b-x)2(2x+b-.3a), c.) (x-a)(b-x)2
(b - a)3 m (b _ a)2 '
.. (x-a)2(3b-2x-a) (iv) _ (x-a)2(b-x) .
(Il) (b-a)3 ' (b - a)2

22. For the first five points it is possible to get a reasonable approximation using cubic spline
or four to six-point polynomial interpolation. At 45 C the polynomial interpolation is
diverging. Rational function interpolation gives good results in all cases.
25. After integration by parts the second term yields the following integral:

inr I>:O iar '


b n a
S"'(X)[g'(x) - s'(x)] dx = [g'(x) - s'(x)] dx = O.
a i=l ai-l

Since Sill (x) is a piecewise constant function, the integration is broken over the n sub-
regions [ai-l, ail between the n + 1 tabular points. This integral vanishes because of
the interpolation requirement g(ai) = s(ai) = f(a;). Further, using the free boundary
condition, the integrated part also vanishes. Hence, it can be proved that the right-hand
side of the first equation is positive, which proves the required result.
26. Adjust the intermediate knots such that the error in each subinterval is nearly equal.
Using the knots at 0.02, 0.09, 0.25 and 0.65 gives a maximum error of around 0.03.
27. See Example 8.7.
29. The coefficients of the parabolic B-spline basis are 0, 0.291263, 0.326211, 0.803894,
1.258215, 1.963248, 2.888795, 3.281573, 3.784913, 2.578776, 8.099982, 29.821302,
19.614816, 14.119430, 9.707468, 6.450697, 4.089743, 2.390330, 1.348936, 0.613927,
0.395358, O.
A.5. Differentiation 787

30. The composition is {0.1O, 0.15, 0.21, 0.18, 0.36}. Perturbing the time measurements by 2
min., which is a change of the order of 0.1 %, changes the composition drastically and
some of the values become negative, thus giving a perturbation of the order of 100%.
This gives the condition numbers of the order of 1000 with respect to uncertainties in
time. It may be noted that the composition of the first substance is not sensitive to the
uncertainties in time. The uncertainties in t.he known half-life is not very serious and the
condition numbers are of the order of 10. An error of 0 .01 % in measured radioactivity
level causes the results to change by almost 100% in some variables, thus giving condition
numbers of the order of 10000.
31. For (i) it is more meaningful to consider the relative error, and it turns out that low order
polynomials, splines or rational function interpolation perform well. Function (ii) has a
discontinuity in slope at x = 0,1 , 2 and it is better to use splines. In fact , a piecewise
linear interpolation with knots at x = 0,1,2 will give exact results. Rational function
interpolation and high order polynomial interpolation perform rather poorly in this case.
The third function is the reciprocal of a function with essential singularity at x = 0 and
it is virtually impossible to approximate it by interpolation near x = O. It may give an
underflow in evaluating the function at x = 0.01 using single precision arithmetic. Hence,
the range may have to be suitably truncated. The fourth function is well behaved, but
because of the large range of the argument of cosine, it may require a reasonable number
of points before the interpolation starts converging. Polynomials will be the best in this
case. The fifth function has a discontinuity in slope and once again it is best to use
spline. The last function is discontinuous at x = 0 which give rise to problems with
all interpolation methods. Only by breaking the range suitably will it be possible to
interpolate this function near x = O.
33. Using polynomial interpolation with 2 x 2 or 3 x 3 points, give reasonable values at the
first three points. Higher order polynomials or product B-splines fail to give reasonable
values at any of the required points. A close look at the table reveals that there are
actually two regions, one in the lower left corner, where the function values are small and
the second region in upper right corner, where the function values are nearly an order of
magnitude larger. For interpolation in the first region, if we select a point in the second
region, then the approximation fails, which is also the cause of failure of B-splines, since
that is a global approximation. At the last two points which are at the boundary of the
two regions, it is impossible to get any reasonable approximation using polynomials.

A.5 Differentiation
2. The bound on the roundoff error R, hopt and the corresponding error Eopt are as follows:
formula R hopt E opt

(5.18) 3<Mo
2h
( 45~~o ) 1/5 (27<4 ,,;t Ms ) 1/5

(5.19) 4~Afo (48~~o) 1/4 J4<MgM4

16<1>/0 (480<M o ) 1/ 6 ( 1024<2 M5 M6 ) 1/3


(5.20) 3h M6 405

(5.21) 3Wo (12;:;/0) 1/5 (9<2 ~5 M?) 1/5

(5.22) 16~rQ (96~~0) 1/6 (32<I>~OMg) 1/ 3

4<Mo (P~~o )1/3 ( 128<2 ~6 M3 ) 1/3


(5.23) h

(5.24) 20<Mo (80<Mo ) 1/4 (32DOO<;7M8 M4 ) 1/4


3h 3M4
788 Appendix A. Answers and Hints

(5.25) 32<Mo
3h
(160<Mo ) 1/5
3Ms
32 (4< MoMs
4 r(5
405

(5.26) 4~!;Jo (4~0) 1(3 ( 32EMOMj) 1(3

80<!;Jo (32f.t~Q ) 1(5 ( 1024<3 M3 M2) 1(5


(5.27) 3h 10 243 0 S

(5.28) 32~ro (128<Mo ) 1(5 (175616E 2 M5 Ml) 1(5


7Ms

(5.29) 16~~0 (8~0) 1(5 (8192EMoMt) 1(5


Here E is the bound on the relative error in the function values, and Mk is a bound on
f(k)(x).
3. Since the derivatives are required at interior points, it is better to use central difference
formulae. The three-point formula will have a truncation error of the order of 10- 4 and
will not give very good accuracy. The five-point formula can give an accuracy of 10- 8 ,
if there is no roundoff error. The spacing of 0.01 is too small for five-point formula, and
it may be better to skip a few points. The optimum spacing in this case is (45 x 5 x
1O- 8 x 5 /24)1(5 ~ 0.04x for the first derivative and (480 x 5 x 1O- 8 x 6 /120)1/6 ~ 0.08x
for the second derivative. At x = 1.1,1.9 it is not possible to use a spacing of larger than
0.05. It will be better to use extrapolation method using a few values of h, which are
permitted by the table e.g., 0.08, 0.04, 0.02, 0.01, or some other sequence which is not
in geometric progression. In fact, using the former set of h values it is possible to get an
accuracy of almost seven decimal digits in all cases.
5. To derive these formulae we can use the nine points, (x, y), (x±h, y), (x, y±k), (x±h, y±k),
and the nine coefficients can be determined by requiring the formula to be exact when
f(x, y) = xiyJ. The error term in these formulae is the lowest (in i + j) nonvanishing
contribution.
6. As the points are added the approximation improves at first, but after some stage the
roundoff error dominates, particularly for the second derivative. For eX with h = 2 and
yX with h = 0.1,0.01, there is no convergence as explained in {4.5}. The effect of random
perturbations is also highly magnified for the second derivative, and it is difficult to get
any reasonable approximation when E = 10- 3 .
8. In practice, the optimum value of h is probably somewhat smaller than what is given by
(5.15). This arises because the truncation error estimate is quite close to actual error, but
roundoff error is generally overestimated, as in actual practice there is some cancellation
of roundoff errors between different floating point operations.
9. The error can only be estimated by comparing results obtained using different sets of
points. It is difficult to estimate the derivative at T = 35 C. The correct value of derivative
at the three points is probably 0.467, 1.08 and 2.1 respectively.
10. The roundoff error is bounded by (16/3)E/h 2 , where E = 5 X 10- 7 is the error in each entry.
For this function f(6)(x) = (945/64)x- 11 (2 and the maximum value of the derivative will
be at x - 2h. Thus the truncation error is given by (21/128)h 4 (x - 2h)-11(2. Since h also
appears in the second factor it is not easy to estimate the optimum value by equating the
two errors. In this case h = O.OIn, where n is integer. The optimum spacing and total
error are given by
x 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
hopt om 0.02 0.03 0.05 0.06 0.07 0.08 0.09 0.05
error 0.028 0.0073 0.0033 0.0018 0.0012 0.00083 0.00062 0.00048 0.0011

11. If Neville's algorithm is used for interpolation, then the result follows trivially. The New-
ton's formula is based on the divided difference table, while the T-table is essentially a
table of extrapolated values using different sets of points hi, and to see the correspon-
dence some manipulations will be required. If we represent the interpolating polynomial
based on points h;_m, ... ,h; by T;n(h"l), then the T:" occurring in the Richardson's
A.6. Integration 789

extrapolation should be just the value of the corresponding polynomial at h'Y = O. It


is trivial to see that for m = 1 the Newton's divided difference formula gives the same
result as the extrapolation method. For m > 1, we can write

= Ti (h"Y) + (h'Y _ h'Y)


( T' (h'Y)_Ti - l (h"Y»_(T'-l (h"Y)_T' - l (h"Y»
hi - hi_Tn '
m - l m - 2 m-l m-2
Tn-I t

which gives (5.37), when we set h"Y = 0 and hi = pi-1h1.


12. There is not much difference for eX, but for tan x the results are much better and it is
possible to get an accuracy of seven figures using h = 0.03 and larger.
13. This result follows from {11}.
14. In this case, since the error expansion is in powers of h rather than h 2 , the convergence
of the approximation will be slower. The correct value of the derivative is -0.5.

A.6 Integration
2. For integral hand h the rules for regular integrands will have a convergence rate of 1/n2,
for h it will be 1/n3 , provided the quadrature formula has sufficient degree of precision.
For other integrals the convergence rate will be according to the degree of precision of
the quadrature formula being used.
9. The trapezoidal rule forming the first column of T-table should be most efficient for these
integrals. The correct value of h ~ 0.737131823541.
10. For x 2: 6 the function is equal to unity within the accuracy used here; for other value of
x the correct values are
x I x I x I
0.5 0.52049987781305 1.0 0.84270079294971 2.0 0.99532226501895
3.0 0.99997790950300 4.0 0.99999998458274 5.0 0.99999999999846

For lower values of x the Taylor series converges rapidly, while the quadrature formulae
converge somewhat slowly. 32-point Gauss-Legendre formula gives the result to an accu-
racy of better than 10- 14 in all cases considered here. The trapezoidal or Simpson's rule
require a large number of function evaluations for lower values of x, but the efficiency
increases as x is increased beyond a certain value, and for large values of x the trape-
zoidal rule is the most efficient. This is rather unexpected, since the range of integration
increases with x. This result can be explained by the fact that as x increases the deriva-
tives at the end point vanish to give higher accuracy for the trapezoidal rule, as explained
in the previous exercise. Increasing the range beyond the point, where the integral has
approached its limiting value of unity within the required accuracy, will reduce the effi-
ciency of quadrature formulae, since most of the function evaluations will be wasted in
region, where the integrand is essentially zero. Thus, at low x the higher columns in the
T-table converge rapidly, while at large x the first column converges faster.
11. When the upper limit is unity, there may be no difference in the results in the two cases,
since the step size h can be represented exactly in the computer. With the upper limit
changed to 0.983 which cannot be represented exactly in the binary form, the computation
using aj = ao + jh should be more accurate. Summation technique will affect the result
only when the accuracy is comparable to the roundoff limit.
14. The weights are
n = 2: 0.0380952380952,0.457142857143,0.171428571429;
n = 4: .0134680134680, .166233766234, .106204906205, .301683501684, .0790764790765;
n= 8: 0.00444191319025,0.0649952169693,0.00453825576633,0.193331044600,
-0.0780232342668, ,0.268141346862, -0.0174182336411, .191452115191, .0352082419958.
790 Appendix A. Answers and Hints

The Gaussian formulae requires about half as many points for the same accuracy. The
Gauss-Legendre formulae will converge as roughly 1/n 2 . 5 and 1/n1. 5 for II and 12, re-
spectively.

15. lo
a 8
f( y'X) dx ~ (a + 1 - 2Va)f(0) + (- Va
3
-
a 2
2)f( -) + (1 - -Va)f(a).
2 3
16.
Area = / ydx = 4 10 3~ dx = 3'1r,
1

arc length = / 1 + (d Y )
dx
2 dx = 2 /
1
-1
/1 + 8x22 dx
I-x
~ 13.36489322056.
The integrals can be easily evaluated using the Gauss-Chebyshev formulae.
17. The weights and abscissas are

n ai Wi ai Wi

2 0.602276908119 0.281460680970 0.112008806167 0.718539319030


4 0.848982394533 0.0392254871300 0.556165453560 0.190435126950
0.245274914321 0.386875317775 0.0414484801994 0.383464068145
8 0.953326450056 0.00368640710403 0.849379320441 0.0209790737421
0.701814529939 0.0578722107178 0.52945857523 0.112924030247
0.354153994352 0.175754079006 0.197871029326 0.226841984432
0.0797504290139 0.237525610023 0.0133202441609 0.164416604728

19.
Jro cos(kx)f(x) dx ~ --;. ((3+ (-ll)f(O) - 4(1 +
'Irk
(-l)k)f(~) +
2
(3(-1)k + l)f('Ir)).

20. /1 f(x)dx~~(J(l)+f(-l))+
-1 10
49 (f( iI)+f(-
90 V;; ;; + 32
ViI)) 45
f (0).

21.

n = 10 n = 20 n = 30 n = 50
-4.38761101729 -3.11175259538 -2.54325961889 -1.97156703598
- .4 71793074422 -.270544415329 -.193877275100 -.126542517365

23. 14 can be easily evaluated using the Gauss-Chebyshev formula, if the range is transformed
to [-1,1] by a linear transformation. The results are as follows:

b=2 b = 1.1 b = 1.01 b = 1.001 b = 1.0001


1.07825782375 1. 496845655 72 1.56299109151 1. 57001141916 1.57071779189

h can be evaluated using the Gauss-Chebyshev formula, The results are:


0=0.1 0=10- 3 0=10- 6
6.85551720847 70.2305918566 2221.44091369

24. II can be easily evaluated using Gauss-Hermite formula. For 12 the two-point Gauss-
Hermite formula gives the exact result by coincidence, but the higher order formulae
do not give very accurate results. This integral can be conveniently evaluated using the
trapezoidal rule, because the derivatives of the function vanish at both the end points.
For 18 a transformation x = e t followed by weakening the singularity improves the
convergence. The correct values are
0=2 0=1.5 0=1.1 0=1.01 0=1.001
1.33647751530 1.14914223106 1.02696523041 1.00263390826 1.00026276774
A.6. Integration 791

25. The correct values of hand 13 are

h(.I) "'" 297.6946987824, h(10-3) "'" 2999999.07628912, 12(10- 6 ) "'" 3 x 10 12 + 1.148616;


13(0.2) "'" 11.32580754739, 13(0.1) "'" 290.405560773, 13(0.01) "'" 2.74356030377 x 10 39 ;
13 (0.002) "'" 5.63696218062 x 10 211 .

26.
1 2sinhx d
II = 1 --- X
o x

h= 1o
t - 11
- (Jln(t+X) - Jln(t-X)) dx+
x t +x t- X
1"" -~--dx
2t-l x(x - t)

~ 1"/2 2xcot x dx
13= L...
n=O 0 (n+!)27r 2 _X 2

"/2 ) 8 "" 1 "" 1"/2 2x 3 cot t dt


= (1 x cot x dx - ~ + ~
o 7r 2 ,S(2n+l)2 ,S 0 (n+~)27r2((n+~)27r2_x2)

,,/2 11"/2 3 21"/2 5


= 1 x cot x dx + - x cot x dx + - x cot x dx
o 3 0 15 0
"" {" /2 2x 7 cot x dx
+~io (n+ !)67r 6 ((n+ !)27r 2 _x 2 )

Here for 13 the series is obtained by breaking the range over [k7r, (k + 1)7r] and breaking
each of these intervals into two at the singularity, which is then transformed to x = O.
The convergence of the resulting series can be accelerated by subtracting series of the
form I' j(7r(n + 0.5))2k, where I' is the appropriate integral. For II and 13 the resulting
integrals have no singularity and can be evaluated easily. For h the first integral has
an algebraic singularity at the upper end because of Jln(t - y) and may be treated by
adaptive technique, or by weakening the singularity. If t = 1 the first integral vanishes,
but the second integral will have an algebraic singularity at x = 1, which can be treated
by a Gaussian formula. The second integral in all cases can be evaluated by a change of
variable x = e Y and using the Gauss-Laguerre formula. For t = 1 the integral exists in
the Riemann sense. The correct values for 12 are

t = 1.5 t = 1.1 t = 1.01 t= 1


1.183343919108 1.991787625060 2.279475719276 2.315157373394

27. h(b = 0.1) "'" 5.07143047634, h(b= 1) "'" 1.838017695010.


28. For even values of n the range of 13 can be suitably broken to obtain

..L 2d
+ L' 2
n/2
{1
(4i_l)2 +
2n n d
1 5-
{
1 (2n-l + £)2
dx
1 2n X X 2n n
o (1 + sin(n7rx))1/3 i=l;;: io (1 - cos(7rx))1/3 x - ;;: il/2 (1 - cos(7rx))1/3

where the integrals in the sum correspond to the range (4~~3, 4~~1) of the original
integral. The summation can be easily evaluated by considering the two integrals

Cl= 1 1 dx
o (1 - cOS(7rx))1/3'
C2 -
-
1 0
1 x2 dx
(1 - cOS(7rx))1/3 .

To reduce the roundoff error near x = 0, the denominator could be either expanded in a
Taylor series or rewritten in terms of sin(7rxj2). The integrand is singular and the integral
can be evaluated using the Gaussian formula with appropriate weight function. Integral
II can also be treated in a similar manner, for b close to 1 the integrand will be nearly
792 Appendix A. Answers and Hints

singular. The correct values of the integrals are

n = 10 n = 20 n = 50
h(b = 2) 0.202031325408 0.197251024771 0.194372939875
h(b=l.l) 0.805823763228 0.766816070023 0.743212057605
h (b = 1.001) 8.520372316139 7.990352204714 7.668195405159
12 1.619567548422 1.610286600537 1. 598827048964
h 0.662176859271 0.638038023124 0.623401951146

29. For large values of a, we can use the following asymptotic forms

V(a p) = r(p+ 1)
,
(e-o< _ b e2P+l
- 2 + e- 3 o< _ b e - 4 + ... )
o<
3P+1 4P+l
o< a» 1

(~alP;l [1+2CP+~;PC2 + (P+l)P(P:})(P-2)C4 + ... )] -a»l(b=l)

where
Cn = 1 - ~n + ~n - ~ + ... = «(n) (1 _ _1_)
2 3 4n 2n - 1 '

and «(n) is the Riemann zeta function. Further, C2 = rr2/12, C4 = 7rr 4 /720, Cs =
31rr6 /30240, Cs = 127rr s /1209600. For intermediate values of a the range can be broken
suitably and the finite part can be evaluated using Gaussian formula with weight function
w(x) = yX, while the infinite tail can be estimated using the Gauss-Laguerre formula. If
sufficiently high order formula is not available, then the range can be broken into three
parts and the intermediate part evaluated using Gauss-Legendre formula. The correct
values are

Fermi-Dirac (b = +1) Bose-Einstein (b = -1)


a p=~ p=~ a p=~ p=~

10 .0000402339943669 .0000603514758981
5 .00595717690518 .00894638226041 5 .00598562754875 .00896772007184
2 .114587823925 .175800988854 2 .126140522431 .184437265706
1 .290500896170 .460848806290 1 .379695714963 .526057226932
0 .678093895153 1.15280383709 0.1 1.45020171900 1. 52570484329
-1 1.39637528067 2.66168262473 10- 3 2.21710560411 1. 77991883051
-2 2.50245782601 5.53725367501 10- 6 2.31201707495 1.78328972170
-5 7.83797605729 27.8024462157
-10 21.3444714924 134.270159963

30.
8 = 10- 4 8 = 10- 2 8=1 8 = 10 2 8 = 10 4
.000131102877565 .0131101379998 1.13708259952 1.57079631501 1.57079632679

34. Evaluating'Y directly involves subtraction of two nearly equal quantities, which may cause
significant roundoff error for large n. To get accurate value of 'Y remove the In n term,
which occurs in the Euler-Maclaurin formula before evaluating the sum.
n = 200 n = 12495 n = 244 n = 10100 n = 10(1010)
Sn 5.87803094812 10.0103395236 31.07569160954 230.835724964 23025850930.5
Sn - In n .579713581573 .577255680374 .577215664902 .577215664902 .577215664902

35. The Euler-Maclaurin formula is most efficient. «(3) ~ 1.20205690316.


A.6. Integration 793

37. The correct values of S4 are


0=1 0=10- 2 0=10- 6
0.414837082935 -0.0654540168145 -0.0703049569220
0.416333833948 -0.0654395423286 -0.0703054534290

39. SI = ~ In 2 and S2 = ~ In 2. It may be noted that both the series are obtained by rear-
ranging terms in the series for In 2. This example demonstrates that it is not permissible
to rearrange terms, unless the series is absolutely convergent.
40. In this case, the last few terms should be summed separately, since there the differences
will not be small. Thus, only the first few and the last few terms need to be evaluated.
41. Using the usual transformation to spherical coordinates
x = l' cos 0, y = rsinOcoscp, z = 1'sinOsincp,
the integral can be transformed to

[= r
iS3
f(x, y, z) dV =
in
r dr ior dO inr2rr d¢F(r, 0, cp)1'2 sin 0,
l

where F(1',O,cp) = f(x,y,z). Since the integrand should be periodic in cp, it is preferable
to use the trapezoidal rule. For 0 we can use the Gauss-Legendre formula for x' = cos 0,
while for l' we can use the Gauss-Legendre formula or a special formula with a weight
function of 1'2. The product rule can be written as

["" -271' LLLA,B)F


n, n2 n3 ( 271'k)
ri,O),- ,
n3 ,=1 )=1 k=1 n3

where cosO) are the zeros of Legendre polynomial of degree n2 and B) are the corre-
sponding weights of Gauss-Legendre formula. Similarly, 1', and Ai are the abscissas and
weights of Gaussian formula with weight function 1'2 over [0,1].
This technique can be extended to hypersphere in n-dimensions by using the transfor-
mation
j-I n-I
Xj = l' cos 0) IT sinO" j=1, ... ,n-1; Xn = l' IT sin 0, ,
,=1 i=l

where 1',01, ... ,On-I are the n coordinates in the hyper-spherical coordinate system and

[= r
iSn
f(xI, ... ,xn)dV

42. The following elementary integrals can be used to derive the required formulae

In particular, the volume of the hypersphere is given by


71'd/2
V=----
r(d/2+ 1)
The 2d-point formula of degree three and (2d 2 + I)-point formula of degree five for
integration over the hypersphere are

i rSd f dV "" V
2d
f (± J +
f
d
d
2
, 0, 0, ... ,0) ,

"" wJ/(O, 0, ... ,0) + W2 I: 2d f(±a, 0, 0, ... ,0) + W3 I: 2d (d-l) f(±a, ±a, 0, ... ,0),
794 Appendix A. Answers and Hints

where

J3 Wl= V (d3-3d2-1Od+36) W = V(16-d 2 ) V(d+4)


a = V4+d' 18(d + 2) , 2 18(d + 2) , w3 = 36(d + 2)

The approximate value of the first integral for various values of dare
d=2 d=3 d=4 d=6 d = 10 d = 20
3.9952370677 5.587807917 6.79964863 7.42186 3.82369 0.0403243

43. The following elementary integrals can be used for deriving these formulae

/ 00 ... /00
-00 -00
exp( -xI _ x~ _ ... _ X~)X~' X~2 ... X~d dXl dX2 ... dXd

= r (n 1: 1) r (n2 : 1 ) ... r ( nd 2+ 1 ) .

The 2d-point formula of degree three and the (2d 2 + I)-point formula of degree five are

l:···l: exp(-xI-x~- ... -x~)f(xl ,x2, ... ,xd)dxldx2 ... dXd

Jrd/2 2d
~ 2d Lf(±y'd72,O,O, ... ,O)
2d 2d(d-l)
~ wl/(O,O, ... ,0) + w2 Lf(±a, O,O, ... ,0) + w3 L f(±a, ±a,O, ... ,0) ,

where
Jrd/2
a = y'3j2, WI = _ _ (d 2 - 7d + 18),
18
44. For a = 2 the integrand is regular and there will be no difficulty in evaluating the integral.
But for a = 0.5 the integrand is singular and it will be difficult to get an accurate value
for the integral. The approximate values of the integral are
ellipsoid: a = 2 4.7988404904, a = 0.5 10.4710433534,
rectangle: a = 2 : 2.5487135317, a = 0.5 : 3.89714993772.

49. The expected error is of the order of 17/ yn, where variance 17 2 = ~ (e 2 - 1) - (e - 1)2 ~
0.242. The variance of eX - 1 - x is 3e - ~e2 - M
~ 0.044. Hence, the number of function
evaluations required for the same accuracy will reduce by a factor of 0.044/0.242 ~ 0.18.
Since each function evaluation requires a little more time with modified functions, the
actual gain in efficiency will be a little less.
50. To generate a sequence of random numbers with probability distribution p(x) over [0,1]'
first generate the random sequence r, with uniform distribution. The corresponding mem-
ber of the required sequence can then be obtained by solving the equation

1o
x 2
p(y) dy = -x
3
1
+ _x2
3
= ri .

The relevant variance is

17 2= 31
-
2 0
1
-
P.
2x dx
1+x
- - (1 1 )2
0 eX dx ~ 0.0269.

Hence, the number of function evaluations required will be reduced by a factor of


0.0269/0.242 ~ 0.11. However, the generation of random numbers as well as the func-
tion evaluation itself now requires more time, which will partially compensate for the
reduction in the number of function evaluations.
51. The last two integrands are not square integrable and hence the variance could be large.
A.7. Nonlinear Algebraic Equations 795

A.7 Nonlinear Algebraic Equations


1. The roots of the equation are x = -1 and 2. The number of iterations n, required to
reduce the error by a factor of 100 is given by If'(a)ln = 0.01 , provided If'(a)1 < 1.
The first form does not converge to either of the roots. The second form requires about
six to seven iterations for the root at x = -1, and three to four iterations for the other
root. The third form does not converge to the root at x = -1 and requires six to seven
iterations for the other root. The last form converges to x = -1 if a < -1.5 and to x = 2
if a> 1.5. The convergence is quadratic for a = -3 (to x = -1) and a = 3 (to x = 2).
2. The possible forms are as follows:

... {(In(I+X))1/3, for x = 0.850 ... ;


(i) x = -(1 - x)1/3, (m) = 3
eX - 1, for x = 0;
In( 4x 3 ), for larger root;
(iv) x = { ex / 3
for smaller root.
4 1/ 3 '

Here, in (ii) it is assumed that sin-Ix gives results in the range [-n/2,n/2] and n =
1,2, ... , identify the different roots. The actual values of the roots to seven significant
figures are (i) -1.324718, (ii) -3.183063, -6.281314, -9.424859, -12.56637, ... , (iii) 0,
0.8506512,
(iv) 0.8310315, 7.384406. For the second equation the higher roots will coincide with-nn
to seven significant figures.
4. Using the form x = 0.01 + 0.01x2 and starting value Xo = 0.01, the convergence will be
very fast. Root x ~ 0.0100010002.
5. If the root is at x = a, then from the Taylor series expansion we get
, (Xl - a)2
+
II
f(xd = (Xl - a)f (a) f «(d,
2
for 6 and 6 in appropriate range. From the hypothesis of the problem, it follows that
f(xd, f"(6), f"«(2) have the same sign, while f(X2) has the opposite sign. This implies
that the two terms in f(xd are of the same sign, while in f(X2) the two terms are of
opposite sign and further I(X2 - a)f'(a)1 > 1~(X2 - a)2fll(6)1. The next iterate X3 is
given by

Using these equations, we get

= I I>
I Xl - XII
X3 -
a f'(a) _
f'(a)
(xl-a)2
+ ~f"(6)
fll(C l ) + (x2-a)2 fll(C 2 )
1
.
2(X2 · xt> <" 2(X2 xt> <"

The inequality follows from the fact that the first two terms in the numerator and de-
nominator are of the same sign while the last term in the denominator is of opposite sign,
but smaller in magnitude than the first term. This inequality ensures that the point Xl
remains fixed.
9. Choose the function as f(x) = l/x - a, then the Newton-Raphson iteration is given by
Xi+l = 2Xi - ax;' The secant iteration gives Xi+l = Xi + Xi-l - aXiXi-l. The iteration
converges when the starting values are less than 2/a.
10. Use the function f(x) = xn - a, where n = 2 or 5. v'2 ~ 1.414213, 2 1 / 5 ~ 1.148698.
11. The roundoff error in evaluating the function near x = 1 is of the order of 35fi. Equating
this to the function value, we get the size of the domain of indeterminacy as (3511)1/7 ~
0.16. Here it is assumed that all intermediate results are rounded off to single precision
before further calculation. If the polynomial is evaluated using nested multiplication, then
the roundoff error may be significantly less, if the machine does not roundoff intermediate
796 Appendix A. Answers and Hints

results to single precision. If the Newton-Raphson method is used, then at each step the
error is multiplied by a factor of 6/7. Hence, the number n of iterations required is given
by (6/7)n ~ 0.16 or n = In(0.16)/ In(6/7) ~ 12. If the iteration is continued further, it
may keep oscillating within the domain of indeterminacy, or if at some stage the computed
value of the derivative fl(x) turns out to be very small it may jump to a value far away.
In the latter case it will again converge to the root slowly. If the iteration is modified
for a root of multiplicity seven, then iteration converges quadratically. If the function is
rewritten as (x - 1)7, then the roundoff error is reduced.
12. For 0< # 1 the order of convergence is 1. Since the error at each step is multiplied by a
factor of 1 - 1/0<, the iteration converges for 0< > 0.5.
13. The correct value of the roots to seven significant figures are:
(i)(a = 2): 0, ±1.895494; (a = 10): 0, ±2.852342, ±7.068174, ±8.423204;
(a = 20) 0, ±2.991456, ±6.620579, ±8.960238, ±13.29342, ±14.86970;
(a = 20.3959) 0, ±2.994254, ±6.613406, ±8.969439, ±13.2751O, ±14.88960, ±20.36873,
±20.37388.
(ii) 4.493409, 7.725252, 10.90412, 14.06619, 17.22076, 20.37130, 23.51945, 26.66605,
29.81160, 32.95639, 36.10062, 39.24443, 42.38791, 45.53113, 48.67414, 51.81698,
54.95968, 58.10225, 61.24473, 64.38712, 67.52943, 70.67169, 73.81388, 76.95603,
80.09813, 83.24019, 86.38222, 89.52422. 92.66619, 95.80814, 98.95006, 102.0920,
105.2339, 108.3757, 111.5176, 114.6594, 117.8012, 120.9430, 124.0849, 127.2266,
130.3684, 133.5102, 136.6520, 139.7937, 142.9355, 146.0772, 149.2189, 152.3607,
155.5024, 158.6441, 161.7858, 164.9276. 168.0693, 171.2110, 174.3527, 177.4944,
180.6360, 183.7777, 186.9194, 190.0611, 193.2028, 196.3444, 199.4861, 202.6278,
205.7695, 208.9111, 212.0528, 215.1944, 218.3361, 221.4778, 224.6194, 227.7611,
230.9027, 234.0444, 237.1860, 240.3277, 243.4693, 246.6110, 249.7526, 252.8943,
256.0359, 259.1775, 262.3192, 265.4608, 268.6024, 271.7441, 274.8857, 278.0274,
281.1690, 284.3106, 287.4522, 290.5939, 293.7355, 296.8771, 300.0188, 303.1604,
306.3020, 309.4436, 312.5853, 315.7269.
(iii) mr, mr + ~ for n = 0, 1,2, ...
(iv) 3.205745 x 10- 286 , 0.5953495, 1.211214, 1.549504, 1.593314.
(v) mr + ~ and mr + ~ (double roots), for n = 0, 1,2, ...
(vi) 0.6602603, 2.481332, 3.141593, 4.325869, 5.098909.
(vii) Maxima: 1.570796, 3.884530, 5.540248, 7.853982, 10.16772.
Minima: 0.3239461, 2.817647, 4.712389, 6.607131, 9.100832.
(viii) ±0.9226808 (a = 0.2), ±1.035265 (a = 2), ±1.161586 (a = 20), ±1.412538 (a =
1000).
(ix) 2.404826, 5.520078, 8.653728. (x) 0 (a = 1), 1.146193 (a = 2), 1.544194 (a =
3.14). The value is given by the positive root of In (a + x) = x.
(xi) x = 0 is a root of multiplicity five when a # V40. For a = V40 it has multiplicity
nine.
14. (i) and (ii): none for a = 21.34; 11.42075, 11.47035 (a = 22.8911); 7.232358, 18.113042
(a = 25.3454). Note that it is impossible to distinguish between the singularities and roots
by looking at the function value alone. For the last equation, the roots are 0,0.123,0.21
and 2. In all cases, the iteration may not converge easily, because of the dominant expo-
nential or power factor. In the first case, the iteration will almost always tend towards
x = 00, while in the last case, it will tend towards x = O. Hence. in the first case, if
the starting value is larger than the root, the iteration is unlikely to converge. Similarly,
in the last case, the iteration is unlikely to converge, if starting value is smaller than
the root. This situation often arises in practice, while finding zeros of large determinants
resulting from eigenvalue problems in differential equations. If each of the equation is
multiplied by x, the determinant will be multiplied by xn and we may not recognise the
difficulty.
15. 0.618, 0.755, 0.819, 0.857, 0.930 for m = 2,3,4,5, 10, respectively.
16. Multiplying the increment by m does not improve the order of conver~nce, but the
convergence may be somewhat faster and erratic.
A.7. Nonlinear Algebraic Equations 797

17. To estimate the error in Muller's method, we note that the new approximation Xi+1 is
given by the zero of interpolating polynomial. From the expression for truncation error
at this point (see Section 4.1) it follows that
f'" (7])
f(xi+1l = --6-(Xi+1 - Xi)(Xi+1 - xi-Il(Xi+l - Xi-2),

f:)
where 7] is a suitable intermediate point. To get an expression in terms of the errors Ei we
use the fact that if x = a is a zero of multiplicity m ::; 3, then f(Xi) ~ Er'. For an
isolated zero (m = 1), substituting this value in the above equation leads to the required
equation, provided higher order terms are neglected. For example, Ei+1 - Ei ~ -Ei and
so on.
18. The form of truncation error follows directly from the expression of truncation error in
polynomial interpolation, as in the previous problem. To find the order, define Ui =
In(IK 1 /(n-1)Eil), where K is the magnitude of the factor in front of the product sign.
This yields a linear difference equation in terms of Ui. The dominant solution of this
difference equation will give the required order. It can be easily shown that the order
is the root with the largest magnitude of the characteristic equation. Further, it can be
shown that, this root will be in the interval (1 , 2) for n ~ 1 and that all other roots of
the characteristic equations are within the unit circle in the complex plane. The order is
1.928, 1.966 and 1.984 for n = 3,4,5 respectively.
20. The equation can be derived as in {17}. To find the order of convergence define
Ui = In( IK Ei I) and obtain a linear difference equation. The corresponding character-
istic equation is 2t 3 - t 2 - t - 1 = O. To find the order of the method based on the inverse
quadratic interpolation for a double root, substitute the approximation f(Xi) ~ fff~",) E;
in the definition of the iteration function. Retaining only the dominant terms gives an
equation which is homogeneous in terms of different Ei and hence the convergence will
be linear.
21. For a triple zero it may be noted that the equation defining the truncation error is
homogeneous in terms of Ei, and as a result the convergence will be linear. Hence, all Ej
will be of the same order and none of the terms can be neglected. To find the ratio of
successive errors, assume that Ei = Eoa i , and substitute in the resulting equation to get
the required polynomial. For higher order zeros the expression for truncation error used
to derive these formulae is not valid. In this case, we can write down the interpolating
quadratic

One of the zero of this quadratic gives the new point Xi+1. If Xi is sufficiently close to
the zero, then f(Xi) ~ f~~) Er'. Substituting this approximation in the above equation
f(=)
and neglecting the common factor of ~, we get

Substituting Ei = Eoa i gives the required polynomial after some simplification. The ratio
a turns out to be 0.713±0.201i, 0.814±0.146i, 0.862±0.1l5i, 0.939±0.055i for m = 3,4,5
and 10, respectively.
22. The roots are 0.3181315 ± 1.337236i and 2.062278 ± 7.588631i. Eliminating x gives

x = In (-!!-) ,
Sill y
In (-!!-) -
Sill Y
ycoty = O.

Eliminating y gives
y = cos- 1 (xe- x ) + 2m!",
where n = 0,1,2, ... and it is assumed that cos- 1 x gives a value in the range [0, 11"J. For
each value of n, there is one root.
798 Appendix A. Answers and Hints

23. (i) (n = -24): 0.8198085 ± 0.05854777i, 0.8886355 ± 0.1532801i,


1.000000 ± 0.1894646i, 1.111364 ± 0.1532801i, 1.180192 ± 0.05854777i;
(n = -32): 0.8965072 ± 0.03362686i, 0.9360379 ± 0.08803628i,
1.000000 ± 0.1088188i, 1.063962 ± 0.08803628i, 1.103493 ± 0.03362686i;
(n = -48): 0.9658601 ± 0.01109273i, 0.9789004 ± 0.02904114i,
1.000000 ± 0.03589682i, 1.021100 ± 0.02904114i, 1.034140 ± 0.01109273i;
(n = -64): 0.9887380 ± 0.003659236i, 0.9930397 ± 0.009580004i,
1.000000 ± 0.01184154i, 1.006960 ± 0.009580004i, 1.011262 ± 0.003659236i;
(n = -96): 0.9987745 ± 0.0003981937i, 0.9992426 ± 0.001042485i,
1.000000 ± 0.001288582i, 1.000757 ± 0.001042485i, 1.001226 ± 0.0003981937i;
(ii) (n = 11): 1.000000, 2.000000, 3.000014, 3.995714, 5.022002 ± 0.3707534i,
5.857901 ± 1.549740i, 6.924237 ± 3.086166i, 8.536600 ± 5.036545i, 11.26297 ± 7.222429i,
15.69946 ± 8.688029i, 21.27809 ± 7.616640i, 25.42087 ± 3.090892i;
(n = 3): 1.000000, 2.000000, 3.000000, 3.999983, 5.001566, 5.949160,
6.944681 ± 0.6362860i, 8.259316 ± 1.840439i, 10.04692 ± 3. 17591Oi, 12.54622 ± 4.380456i,
15.80874 ± 4.876681i, 19.27186 ± 3.963980i, 21.64690 ± 1.531178i;
(n = -13): 1.000000, 2.000000, 3.000000, 4.000000, 5.000000, 5.999999,
7.000018,7.999778,9.001741,9.990855,11.03683, 11.90887, 13.25371, 13.67273,
15.52041 ± 0.04311312i, 17.15171, 17.92973, 19.01485, 19.99834;
(n = -29): To seven significant figures the roots are 1,2,3, ... ,20.
24. (i) 13.0000096506, (ii) 13 ± 0.00218226i.
25. (i) -1.768570, (ii) 4.797964, -1.974795.
26.

m n = 0.001 n = 0.1 n = 1 n = 10 n = 1000

0.1 -7.576831 -2.965756 -0.6117821 2.082781 14.19787


1.0 -7.393116 -2.783201 -0.4385950 2.210731 14.22903
10.0 -6.209451 -1.601932 0.7218636 3.223867 14.53686

27. This method will be more efficient than the Newton-Raphson method if (J > 0.71, while
for (J < 0.283 it is more efficient than the secant iteration.
28. Assuming Q(Ui) = a + (3ui expand the error {i+l in terms of <i using the Taylor series
expansion and then estimate a and (3 such that terms of the order of {i and (~ cancel
giving a cubic convergence. This leads to the following iteration function
Ui
Xi+ 1 = Xi - ----cf;;-;";-;-(-x,~)-
1 - 2f'(x,) Ui

29. The resulting iteration function is

To find the order of convergence of this method, expand the expression in terms of Taylor
series retaining only the dominant terms. This may be slightly tricky, since which terms
dominate will depend on the order of iteration. Hence, the neglect of terms can only be
justified a posteriori. For isolated zeros this method has the same order of convergence
as Muller's method, but for double zero this method is linear, while Muller's method is
of order 1.23.
30. The order of convergence can be estimated by a procedure similar to that for the Muller's
method. If we make the simplifying assumption that both the first and second derivatives
can be evaluated at a relative cost of a, then this method is more efficient than the
l';ewton-Raphson method for a < 1.41, while it is more efficient than the Muller's method
for a < 0.401.
A.7. Nonlinear Algebraic Equations 799

31. The cubic convergence follows directly from the truncation error in the Taylor series
expansion.
32. For double zeros following a procedure similar to that for Muller's method, it can be
shown that while the method based on direct Taylor series expansion has an order of
convergence 1.5, the method based on Taylor series for the inverse function converges
linearly. For roots with higher multiplicity, both these methods converge linearly. The
first method can be modified to give cubic convergence for roots with known multiplicity,
provided we consider the function fIlm, which has only simple zeros. The resulting
iteration can be written as
2mf(Xi)
Xi+ 1 = Xi - -::f::-:'(;---X-:i):-±-:-yll70(2""m=-=1"")"'f:;C'(;=OX"'i)""2=-~2m===:f;'7(x=,=;')""f7C1/T(X==c'i)

The second iteration function can be easily modified to get a cubic convergence by mul-
tiplying the last two terms by suitable factors, such that the first two terms in the Taylor
series expansion are cancelled, which yields the iteration function
m(3 - m) f(Xi) m 2 f(x;)2 !"(Xi)
Xi+l=Xi- ----
2 f'(Xi) 2 f'(Xi)3

33. In this case, the truncation error can be written in the form Ei+l = KErE7-1' which gives
the characteristic equation 0 2 - 20 - 2 = O. It may be noted that this iteration function
is obtained from inverse Hermite interpolation if the term in is neglected. fl
36. As x increases from a to b, the number of sign changes in the Sturm sequence can only
be affected by one or more of the functions changing sign across the corresponding zeros.
From property (2) of the Sturm sequence, it follows that the number of sign changes will
not be affected as we pass over a zero of fi(X), (i > 0). Hence, the number of sign changes
is affected by zeros of fo (x) only. From the last property of the Sturm sequence, it follows
that we lose one sign change, between fo(x) and h (x) as we move across a root of fo(x),
which proves the required result.
37. The roots are
.. 1 1 -1 ± V3 i .. 1 1 1 2 3
(i) 1,1,±V2,3; (n) ±3'±2' 2 ; (m) 4'3'2'3'4;
(iv) 0.2228466,1.188932,2.992736,5.775144,9.837467,15.98287;
(v) ± 0.2386192, ±0.6612094, ±0.9324695.

38. The roots are


(i) 0.2,0.22,1.7,1.7,1.7, ±V'1O i;
(ii) 1.000000,1.380772, -0.03145856 ± 1.191511i, 0.03311558 ± 1.197438i,
1.003364 ± 0.1924398i, -1.195407 ± 0.02239336i;
(iii) ± 0.1252334, ±0.3678315, ±0.5873180, ±0.7699027, ±0.9041173, ±0.9815606;
(iv) 0.1157221,0.6117575,1.512610,2.833751,4.599228, 6.844525, 9.621317,
13.00605,17.11686,22.15109,28.48797,37.09912;
(v) - 0.5000105,0.5000122, 1.000000, 1.071554,7.176522, -1.015845 ± 0.5108793i,
-0.9829008 ± 0.09101494i, -0.8373775 ± 1.212646i, 0.8952141 ± 1.266988i,
1.066871 ± 0.4941294i;
(vi) ± 1, ±i, ±0.3090170 ± 0.9510565i, ±0.5877853 ± 0.8090170i,
±0.8090170 ± 0.5877853i, ±0.9510565 ± 0.3090170i;
(vii) ± 0.07845910, ±0.2334454, ±0.3826834, ±0.5224986, ±0.6494480, ±0.7604060,
±0.8526402, ±0.9238795, ±0.9723699, ±0.9969173;
(viii) 1.1,1.1,1.1,1.1,1.1,1.1.

42. 0.09864724 and -2.074703.


43. (i) 1234.56, 1222.56, 1240.56; (ii) ±1.2 x 109 ; (iii) none; (iv) none on the continuous part,
but the function changes sign at discontinuities (dose to x = -5, -1, 33,137) and may
be assumed to have roots there. (v) none; (vi) none; (vii) 0; (viii) 0, -5.
800 Appendix A. Answers and Hints

44. (n = 4) : 1.500214,0.1691412,6.738274 x 10- 3 , 9.670230 x 10- 5 ;


(n = 6) : 1.618900,0.2423609,0.01632152,6.157484 x 10- 4 , 1.257076 X 10- 5 ,
1.082799 X 10- 7
(n = 10) : 1.751920,0.3429295,0.03574182,2.530891 x 10- 3 , 1.287496 X 10- 4 ,
4.729689 X 10- 6 , 1.228968 X 10- 7 , 2.147439 X 10- 9 , 2.266747 X 10- 11 , 1.093154 X 10- 13 .
47. For functions with heavy roundoff error this process tends to converge to a value for
which the computed value of the function turns out to be exactly zero. However, it may
not be any better approximation as compared to any other value within the domain of
indeterminacy.
48. exp(-1/x2) = O. If an underflow is replaced by zero, then on most machines all the
criteria will be satisfied even when x is not too small. Even if underflow is prevented, all
criterion other than (v) can be satisfied for reasonably large values of x. For example,
if Newton-Raphson method is used, the successive differences will be O(x 3 ) and the
first four criterion can be satisfied at rather large values of x. While in secant or Muller's
method requiring function evaluation at more than one point, if two points like Xl = 0.05
and X = 0.04 are selected as starting values, then because of enormous difference in
the function values at the two points, the iteration will converge to the smaller value
immediately and within machine accuracy Xk+1 = Xk. The first four criteria can be
satisfied in an always convergent method like bisection, regula falsi or Brent's method for
functions like tan x or l/(x - 0) across the singularity. Condition (vi) could be satisfied
for functions like e- x2 , 1/(1 + x 8 ) using any of the iterative methods, since the iteration
will tend to large values of x. The criteria (i) and (v) are ineffective, in the sense that in
most cases, it may be impossible to satisfy them.
49. (i) If (x, y) is a solution, then (-x, -y) is also a solution of the system. Hence, only
positive solutions are listed here. For a = 1 : x = 0.3778177, y = 0.3612387. For
a = 0.1 the solutions (x, y) are (0.05526980,0.05521363), (0.08885481,0.08862207),
(0.1171499,0.1166184), (0.1444941,0.1435009), (0.1728242,0.1711337),
(0.2037208,0.2009707), (0.2390529,0.2346493), (0.2816034,0.2744949),
(0.3362722,0.3243932), (0.4132972,0.3919166), (0.5399257,0.4950757),
(0.8317777,0.6938195);
(ii) (0.1269280, -1.99999999999979), (1.929939,1.773101);
(iii) (0.05397181, -3.020855), (0.3572531,2.785752).
Ei = Xi -0<, where 0< is the required root, then it can be shown
51. If we define the error vector
that Ei+1 "" JEi. Here, J is the Jacobian matrix. From this it follows that Ek "" Jk EO .
Using the expansion of EO in terms of the eigenvectors of J, it can be easily shown that
the error tends to zero only if all the eigenvalues of the matrix J are within the unit
circle. For the given problem, the iteration will converge to root near x = 0.1, Y = -2, if
it is expressed as
x = In(l + e Y ).
While for the other root the iteration converges, if the equations are written in the form

y = In(e X - 1).
Alternately, we can try to write equations in the form

where the constants aI, a2, a3, a4 are selected such that the eigenvalues of the Jacobian
vanish at a point thought to be close to the root. This will ensure convergence, provided
our guess for the root is reasonable. It requires some effort to find these values and the
choice will be useful only if the root is actually close to the anticipated value.
52. (i) (x, y, z, w) = (-0.7672973,0.5906711,1.471900, -1.527193);
(ii) (x, y) = (-1.658599, -0.05550650), (-0.4113288, -2.401050),
(1.690779,0.2533899), (2.016678, -0.8489945);
(iii) (x, y) = (1.937807, -0.1208453), (4.041585, -0.3278100), (8.033715, -0.1625996),
(10.3859916, -0.1821919), (14.40846, -0.1209285), (16.61774, -0.1318683), ...
A.B. Optimisation 801

54. This problem is very difficult to scale properly, since the solution ranges over several
orders of magnitude for the range of T and nH values considered. Consequently, it is
difficult to solve the system of nonlinear equations, unless a high precision calculation
is done or only absolute accuracy to some reasonable level is required in the solution.
In the latter case, the abundances which are very low will not be determined to any
accuracy and in particular at low temperatures it may not be possible to estimate ne to
any reasonable accuracy. If all variables are expressed in terms of ne, then it is possible
to get accurate value of all abundances, even if some of them are very small. The results
are shown in the following table

1 X 10 3 10 10 1.000000 X 1010 4.670506 X 10- 20 1.000000 X 109 6.791975 X 10- 76 7.425747 X 10- 311
1 X 10 3 10 25 1.000000 x 10 25 1.476944 x 10- 12 1.000000 x 10 24 2.147811 X 10- 68 7.425747 x 10- 311
8 X 103 10 10 2.132758x 10 7 9.978672 x 10 9 9.997801 x 108 2.198707x 105 2.034563 x 10- 18
2 X 10 4 10 18 2.318538 X 10 17 7.681462 X 10 17 9.780873 X 10 16 2.191270 X 10 15 3.796599 X 10 5
1 X 105 10 10 7.621882 X 10- 3 1.000000 X 1010 5.926045 X 10- 14 8.676977 X 10- 2 1.000000 X 109
5 X 105 10 10 1.928664 X 10- 4 1.000000 X 10 10 3.094520 X 10- 19 4.971848 X 10- 5 1.000000 X 109
5 X 105 10 25 8.004067 X 10 24 1.995933 X 10 24 5.412848 X 10 23 4.182561 X 10 23 4.045912 X 10 22

55. The iteration usually stops because one of the expressions inside the square root be-
comes negative. However, if subroutine DAVIDN is used and the parameter () is changed
very gradually, then the iteration does converge from almost arbitrary starting values to
the following solution: 2.997669, 3.986057, 8.113169, 2.331476 x 10- 3 ,5.983403 X 10- 4 ,
5.490709 X 10- 4 , 0.02613967, 0.5896927, 3.455701, 7.373799 x 10- 4 .
56. Substituting the tentative solution into the equations gives the residual, which can be
absorbed in the terms. Simplest alternative is to perturb the constant at the right-hand
side to absorb the effect. The perturbation turns out to be -6.33 x 10- 3 and 4.82 x 10- 3 ,
respectively. To find the error in tentative solution, linearise the system by assuming that
the required perturbation are small which gives equations of the form
(-4x - 3y)8x + (-3x + 4cosy)8y = 8b!,
(6x - 2y2 - 3sinx)8x - 4xyoy = Ob2,
where 8b! and 8b2 are the calculated perturbations in the right-hand side. Solving this
system of linear equations gives the expected errors 8x and 8y as 6.67 x 10- 4 and 1.09 x
10- 3 . As compared to that, the actual error in this solution is 7.79 x 10- 4 and 3.9 x 10- 4 ,
respectively. It can be easily seen that, this process is identical to one step in Newton's
method. If the perturbations are absorbed in different terms, then the results will be
somewhat different.
57. The roots are: (x, y) = (0.3652685,0.5698476) and (0.3700294,0.5688228). Further, if
(x,y) is a root, then (-x, -y) is also a root.

A.8 Optimisation
3. (i) 1(0.61703085) = -391.3917, 1(4.610833) = -1681.383, 1(14.26010) = -133652.7, the
last one is the global minimum.
(ii) 1(±0.9533098) = -415.6791, 1(±0.6861885) = -270.5583, 1(±0.2492869) =
-234.7165, the first pair gives the global minima.
(iii) ±0.9876883, ±0.8910065, ±0.7071068, ±0.4539905, ±0.1564345, are all global min-
ima with 1 = -1.
4. (i) b = 2, the minimisers are ax = 4.188790 + 2mr, where n is any integer,
for b = 1.1, ax = 3.571292 + 2mr and for b = 1.001, ax = 3.186295 + 2mr.
802 Appendix A. Answers and Hints

(ii) f(4.913180) = -4.814470, f(l1.085538) = -11.04071, f(17.33638) = -17.30761,


f(:::!3.60428) = -23.58313, f(29.87859) = -29.86187.
(iii) f(1.1) = 0 is not a local minimum, since f'(1.1) i= O.
7. Apart from the global minima at the roots of x = a sin x, the function has infinite number
oflocal minima at ±(14.03700+2n7r) for a = 10, or at ±(26.65449+2n7r) (n = 0, 1,2, ... )
for a = 20.3959.
8. Global minima at 1.446282 x 10- 282 , 0.5952897, 1.210769, and a local minimum at
1.571090, where f = 2.600965 X 10- 10 .
9. <7 = 0.1, p = 0.01: = -0.1, (-0.05238067, -0.02567633);
Xo
Xo = -1, (-0.8087300,7.319328)
<7 = 0.1, p = 0.05: Xo = -0.1, (-0.05238067, -0.02567633);
Xo = -1, (-0.8087300, .6638655)
<7 = 0.9, p = 0.01: Xo = -0.1, (-0.09654989,0.08702072);
xo = -1, (-0.9911137,7.319328)
<7 = 0.9, p = 0.5: Xo = -0.1, (-0.09654989, -0.05571456);
Xo = -1, (-.9911137, -.8662966)
It may be noted that the last choice of <7 and p excludes the true minimiser at x
-0.0414 ... from the interval of acceptable points.
10. The Hessian matrix is
H = (1200xI +2- 400X2 -400X 1 )
-400Xl 200 .
This matrix will be positive definite, provided one of the diagonal elements and the
determinant are positive definite, which give X2 < xI + 0.005.
11. If the starting point is (-2, y) or (x, -1), the method of steepest descent will converge
in the first iteration.
12. Minimum: f(l, 0) = -1, maximum: f( -1, -1) = 1, saddle points: f(O,O) = f(O, -1) = O.
Since the function is unbounded, there is no global minimum or maximum.
14. The iteration may converge to saddle point, particularly for the first function for which
the Hessian matrix is singular at (1,1). This singularity can be detected by considering
the norm of matrix H in the BFGS method, as explained in the text.
17. Since the Hilbert matrix is positive definite, the minimum is x = O. However, because of
ill-conditioning it may be difficult to find an accurate minimum.
18. The minima are: (i) f(l, 1, 1, 1) = 0,
(ii) f(l, 1, 1, 1) = 0 and
f( -1.362053,1.846517,0.4795782,0.2239666) = -4.984719,
(iii) f(O, 0, 0, 0) = O.
21. (i) The local minima are f(0.9358222, 2.5) = -8.510601,
f(3.305407, 1.381966) = -34.88072, f(3.305407,3.618034) = -34.88072,
f(7.758770,2.5) = -132.6098. There is no global minimum, since the function is un-
bounded from below.
(ii) The global minimum is f(2, 1.5,3) = 0.75, and there is no other local minimum.
22. There are infinite number of local minima apart from the four global minima, which
correspond to the solution of the system of nonlinear equation (see {7.52}). Some of these
local minima are: f( -0.08771214,4.774252) = 11.53299, f( -0.01914651,11.00986) =
7.055762,
f( -0.008017066,17.28477) = 5.879316.
23. This is a highly ill-conditioned problem, the direction set method fares very badly for
high n, while the BFGS method is reasonably efficient. Higher precision arithmetic may
be required to get good results. The minima are:
(n = 4), f( -0.1830481,1.080739, -0.4526588,0.6144563) = 0.06958773,
(n = 6), f( -.01572509,1.012435, -.2329916, 1.260430, -1.513729, .9929964) = .00228767
=
(n 9), f( -1.530704 X 10- 5 ,0.9997897,0.01476396,0.1463423,1.000821, -2.617731,
4.104403, -3.143612, 1.052626) = 1.399760 x 10- 6 .
A.9. Statistical Inferences 803

24. The corresponding LP problem is to minimise -100xa - 200Xb - 125xc - 150Xd subject
to

The solution is x = (2.5,0, 18.75,0) T and the maximum profit is 2593.75. If only items
band d are manufactured, then it is not possible to find any combination which utilises
the resources fully. While if all items are made in equal quantities, then 4 units of each
can be manufactured giving a profit of 2300.
25. x = 0, y = 1, f(O, 1) = 1.
27. The minimum is f(O, 5, 2, 3) = 9.
28. The LP problem is to minimise Xl +X2+X3+X4+X5+X6, where X = xl-X2, y-l = X3-X4
and z - x = X5 - X6, which gives the constraints
-X5 + X6 ~ 1,
The solution is x = 2.5, y = 1, z = 2.5 and f = 2.5.
29. The minimum is f(O, 1.5,0.5) = 0.5.
30. The minimum is f(l, 0,1,0) = -5/4.
31. For n =6 the maximum is f(lO lo , 0, 0, 0, 0, 0) = 1010.
32. Graphically the problem boils down to finding a straight line parallel to 2x + y = c, with
smallest value of c and passing thr::mgh the feasible region. This will be clearly achieved
by a line that is tangent to the ellipse in the third quadrant, which gives x = -4/V17 and
y = -0.5/V17. The same solution will be obtained, if negative sign is assumed for the
square root while eliminating y. Positive sign yields the maximum of the same expression.
33. In this case, the minimum is at x = 1 and y = 0, where y is not a continuous function of
x.
36. Simulated annealing is not likely to succeed in this case as the minimum is located in a
very narrow dip, where the function value drops significantly.

A.9 Statistical Inferences


1. The theoretical values of x are 1.28, 2.32, 2.81, 3.48, 4.06 respectively.
5. Consider binomial distribution with p = 0.05 and n = 40. The probability is
1 - Bn,p(O) - Bn,p(l) - Bn,p(2) - Bn,p(3) = 0.14
6. For 3 dices the probability of getting a sum of 3, ... , 18 is 1/216, 3/216, 6/216, 10/216,
15/216, 21/216, 25/216, 27/216, 27/216, 25/216, 21.216, 15/216, 10/216, 6/216, 3/216,
1/216, respectively. For 100 dices the average is 350 and standard deviation is about
17.08
10. The probability of detecting;::: 8 neutrinos per day is about 0.1%. In 1 minute the average
background is 1/720 giving the probability for;::: 8 neutrinos as about 10- 27 .
11. 320 ± 20.4.
12. For 106 particles per second, the average in dead time J.L = 0.1. Irrespective of how many
particles come in this period only one will be detected giving detection of

particles, giving an efficiency of (1- e-I")/ J.L = 0.95. For a rate of 5 x 106 ,10 7 per second,
the efficiency drops to 0.79 and 0.63 respectively.
17. Using binomial distribution the probability of getting 0,1,2,3 sixes is 0.579,0.347,0.069,
0.005 respectively. Thus the X 2 value for difference with observed distribution is about
52.9 with 3 degrees of freedom. Probability of getting such values is very small and hence
the dice should be loaded.
804 Appendix A. Answers and Hints

18. The variances are


. 1 2 x2 2 x
(t) 20"X
y
+ 40"Y
y
- 23
y
0"xy,

(n.. ) 1
-2--2 x o"x2 + y 2O"y2 + 2xyO"xy ) ,
(2
x +y

(iii) ( x) - 1 2 O"x2 + 16y2 0" 2 + -----r~==;:c


8y (x )
- 1 O"xy.
Jx 2 + 4y2 x 2 + 4y2 y Jx 2 + 4y2 Jx 2 + 4y2

A.IO Functional Approximations


1. Ilhlli = IIhll2 = IIhlloo = 1, 1112111 = 1, 1112112 = V1.l25, 11121100 = 1.5, Ilfall1 =
1.01, II fa 112 = V1.52, II fa 1100 = 101.0. It can be seen that L1 norm is least sensitive to
perturbations in the function, while Loo norm is most sensitive.
2. F = (3.671 ± 0.078) + (0.089 ± O.Ol1)x.
3. It is convenient to perform the change of variable t = (x - X k) / h, where h is the uniform
spacing between the points. Let the parabola be pet) = aD + a1t + a2t2, then the normal
equations are

5aO + lOa2 = Ik-2 + Ik-1 + Ik + 1k+1 + 1k+2 ,


10a1 = -2fk-2 - fk-1 + 1k+1 + 2fk+2 ,
lOaD + 34a2 = 4fk-2 + fk-1 + Ik+l + 41k+2 ,
which give the coefficients of parabola

aD = Ik - fgUk-2 - 41k-1 + 6fk - 41k+1 + 1k+2) = Ik - fg8 41k ,


a1 = to (-2/k-2 - !k-1 + 1k+1 + 2/k+2),
a2 = -tt(2fk-2 - Ik-1 - 2/k - Ik+1 + 2/k+2).
Using these coefficients we get F(Xk) = aD, which gives the required formula. At the first
point, we get the formula
1 3 3 4
F(xo) = aD - 2a1 + 4a2= 10 + -~ 10 + -~ 10.
5 35
1 3
= 10 + -(-/0 +3/1 - 312 + fa) + -Uo -4h +612 - 4fa + 14)
5 35
Similarly, we can get the formulae for the other end points
2 3 1 4
F(X1) = aD - a1 + a2 = h - -~ 10 - -~ 10,
5 7
F(XN-1) = IN-1 + ~\13IN - ~\14IN' F(XN) = IN - ~\13IN + ~\14IN.
5 7 5 35
4. The required formulae can be easily derived using the coefficients of parabola obtained
in the last exercise. Thus, F'(Xk) = a1, F'(xo) = a1 - 4a2 and F'(X1) = a1 - 2a2.
6. L1: 0.7586 + 0.3476x - 0.01095x2, IIEll1 = 0.200, IIEI12 = 0.106, IIElioo = 0.095
L2: 0.6818 + 0.3771x - 0.0l322x 2 , IIEl11 = 0.232, IIEII2 = 0.083, IIElloo = 0.046
Loo: .6594 + 0.3881x - 0.01414x2, IIEl11 = 0.251, IIEI12 = 0.086, IIElloo = 0.033
After changing two data points
L1: 0.7586 + 0.3476x - 0.01095x2, IIEIII = 0.400, IIEI12 = 0.196, II Ell 00 = 0.127
L2: 0.7152 + 0.3711x - 0.0l322x 2 , IIEIII = 0.422, IIEI12 = 0.188, IIElioo = 0.121
Loo: 0.735 + 0.3757x - 0.0l429x 2 , IIEl11 = 0.511, IIEII2 = 0.201, IIElloo = 0.096
11. The solutions are, x = (1,I)T, (1,I)T, (0.001655,0.00l655)T, (-0.4967,0.5033)T. The
first two right-hand side vector differ by a vector which is orthogonal to the range of
A . IO. Functional Approximations 805

A. Consequently, the solution is exactly same. While b3 is almost orthogonal to the


range of A, and the solution is close to zero, b4 - b3 is small, but has a component
corresponding to the smaller singular value of A, '72 = 0.01, which magnifies the difference
in the solutions. The solutions for matrix Al are (1.0032, 0.9965)T, (-9.013, 11.01O)T,
(9.9847, -9.9784)T and (9.5045, -9.4951)T. These solutions are significantly different,
because although singular values of Al are close to those of A, the corresponding columns
of SVD matrix U are quite different, and as a result a component corresponding to the
smaller singular value IJ2 is picked up. For the matrix A2 the column of U corresponding
to IJ2 is identical to that for A and as a result the difference in solutions is not too large.
The solutions for A2 are: (0.99967, 0.99967)T, (0.99950, 0.99950)T, (0.00182,0.00182)T
and (-0.4511,0.4580)T.
12. If m = n, then we get the Fourier cosine transform and the coefficients are given by the
same formula, provided a factor of 1/2 is included in the last term to get the expansion
1 1
+L
n-I
F(a,x) = -aD aj cos(jx) + -an cos(nx).
2 j=1 2
This approximation interpolates the function at the given points. The coefficients can be
obtained by using the orthogonality of functions over the discrete points with end points
having a weight of 1/2 and middle points with weight 1.
14. 0.9998834x - 0.3306000x 3 + 0 .1814533x 5 - 0.087l766x 7 + 0.0218689x 9,
with maximum error of 3.08 x 10- 5 .
16. The least squares fit gives the coefficients
(i) a = (0.00000101, 0.1047l68, 0 .3141450, 0.5975427, 0.8224456, 0 . L~68435,
1.0165989, 0.9668435,0.8224456,0.5975427, 0.3141450, 0.1047l68, 0.00000101)T;
(ii) a = (0.0082417,0.2338425, 0.2945998,0.4144088, 0.4580286, 0.4962023,
0.5009666,0.4962023,0.4580286,0.4144088, 0.2945998, 0.2338425, 0.0082417)T;
(iii) a = (0.0082417,0.2338425,0.2945998,0.4144088, 0.4580286, 0.4962023,
0.5009666,0.4962023,0.4580286,0.4144088, 0.2945998, 0.2338425 , 0.0082417) T ;
(iv) a = (-0.00000144, 0.1028774,0.2098219, 0.3219762, 0.4205090, 0.4851111 ,
0.5078317,0.4851111,0.4205090,0.3219762, 0.2098219, 0.1028774, -0.00000144)T.
The approximation with Chebyshev points gives much better results for the last function.
17. With linear B-spline the coefficients are:
(i) a = (0.0004763,0.3113436, 0.592467l, 0.8154125, 0.9585857,
1.0079136, 0.9585857,0.8154125, 0.592467l, 0.3113436 , 0.0004763) T ;
(ii) a = (0.9987088,1.1088428,1.2454593,1.4139551,1.6204816,
1.8709980,2.1715267,2.5280136,2.9466657, 3.4326572, 3.9952912)T;
(iii) a = (0.0551934,0.3217544,0.3992901,0.4612640,0.4913419,
0 .5017125,0.4913419,0.4612640,0.3992901,0.3217544,0.0551934)T;
(iv) a = (0.0001678,0.1701409,0.2997401,0.4097278,0.4798117,
0.5039490,0.4798117, 0.4097278,0.2997401, 0.1701409, 0.0001678)T .
For parabolic B-splines the coefficients are:
(i) a = (-0.0001324,0.1584345,0.4596100, O. 7l59004, 0.9020727,
0.9999601,0.9999601,0.9020727, O. 7l59004, 0.4596100, 0.1584345, -0.0001324)T;
(ii) a = (1.0000250, 1.0497393, 1.1722539, 1.3237483, 1.5102509,
1. 7377496 , 2.0122503,2.3397492, 2.7262518, 3.1777460, 3.7002606, 3.9999749)T;
(-iii) a = (0.0211807, 0.2660660,0.3530617,0.4410953,0.4780686,
0.5004740,0.5004740,0.4780686,0.4410953, 0.3530617, 0.2660660, 0.0211807)T;
(iv) a = (-0.0000319,0.1235992,0.2491155,0.3656637,0.4536294,
0.5000357,0.5000357,0.4536294,0.3656637, 0.2491155, 0.1235992, -0.0000319) T.

19. Consider the orthogonality relation (10.26) for k = nand j = 0,1, ... , n - 1. These
equations define a system of n homogeneous linear equations for p~n) (Xi). Because of
806 Appendix A. Answers and Hints

orthogonality of basis functions the coefficient matrix is nonsingular giving p~n) (x,) = 0
as the unique solution.
24. The following least squares approximation has a maximum error of 3.25 x 10- 7 :

R (x) = 0.99999786x + 0.65879419x 3 + 0.041072lOx 5


5,4 1 + 0.99206571x2 + 0.17226764x4

25. a1 = -0.00962989, a2 = 562.944.


26. The convergence to minimum is very slow because of discontinuity in the derivatives. The
minimisation tends towards the Chebyshev points considered in {1O.16} (iv).
31. The FFT algorithm breaks the summation into logr N summations, each involving r
terms which requires r - 1 complex additions and equal number of multiplications to
evaluate the required product.s with powers of w. This gives the total number of operations
as (r-1)N logr N, where each operation consists of one complex addition and one complex
multiplication. Here we have not included simplifications that are possible by noting the
special form of powers of w that enter into the calculations. From subroutine FFT that
implements the power of 2 FFT algorithm, it can be easily seen that it requires N log2 N
additions and ~ N log2 N multiplications. Similarly, it can be shown that the power of 4
FFT algorithm will require N log2 N additions and ~ N log2 N multiplications.
35. In general, direct evaluation of convolution requires O(N2) operations. However, in
this case, if the zero elements of 9j are considered, then it requires 15N multiplica-
tions and similar number of additions, to evaluate the convolution. Using FFT requires
O(3N log2 N) arithmetic operations to evaluate the two DFT and one inverse DFT. In
addition N multiplications are required to take the product of DFT's. The result ob-
tained using FFT will not be the same as that obtained by the direct evaluation of the
sum, since DFT implicitly continues the function periodically. Consequently, near the end
points the convolution is distorted due to the function value near the other end point,
which will be effectively used in computing the convolution. To avoid this problem, we
can pad the input fJ and 9J with at least seven zeros at both ends before taking the FFT
and then discard the extra points after calculating the convolution.
37. The truncated series with sigma factor is

1
932 ()
t = - + -2 (sin( 7r /32).
sm t + -1 sin(37r /32) .
sm 3t + ... + -1 sin(3l7r /32). )
sm 31t ,
2 7r 7r /32 3 37r /32 31 3l7r /32
2 (sin( 7r /32)
932 t
I ()
= - cost + sin(37r /32) cos3t + ... + sin(317r /32) cos31t
)
.
7r 7r/32 37r/32 317r/32
It is interesting to note that 932(t) is the truncated Fourier series of slightly blunted
square wave h(t) and 9b(t) is the truncated Fourier series of a narrow square wave 8(t),
given by

~ + l!-t, O::;t::;:fu; .!2


"'
O::;t::; :fu;
1, :fu < t < 3i;; 0, :fu < t < 3i;;
h(t) = 1- l!- (t - 3i2"), 31" 33" .
32 ::;t::; 32 ' 8(t) = 16
--,r, 31" 33" .
32 ::;t::; 32'
0, 33" <t< 63" . 0, 33" <t< 63" .
32 32 ' 32 32 '
l!- (t - 6;;), 63"
32 ::; t::; 27r;
16
-;r, 63"
32 ::; t::; 27r.

2 1
39. L(u) = - - - - .
8 8 + 1
40. Taking Laplace transform of the equation yields

L(u) = 8 +1- e- S
8(8 - c s)
A.10. Functional Approximations 807

The exponential order of this function can be estimated by the pole with largest real
part. This pole is given by the zero of 8 - e- s , which is approximately 0.5671. The exact
solution of the equation for 0 ~ t ~ 5 is

1 + t, o~ t ~ 1;

~ + !t 2, 1 ~ t ~ 2;

u(t) = ~+~t+i(t-l) 3 , 2 ~ t ~ 3;

¥ + ~t + ~(t - 1)2 + -14(t - 2)4, 3 ~ t ~ 4;

- it + ¥t+ i (t - 1)2 + ~ (t - 2)3 + 1~O (t - 3)5 , 4 ~ t ~ 5.

41. Taking Laplace transform of the equation, we get

L(u) = 83 - 238 2 + 628 - 40


84 - 228 3 + 398 2 + 228 - 40

which gives the required result. The ill-conditioning has been avoided , because the com-
mon factors have been cancelled exactly from the numerator and denominator , which
may not be possible in numerical work.
43. The maximum error in the following approximation over [0,1] is 1.3 X 1O- 11 :

1 - 0.470595788392x 2 + 0.027388289676x 4 - 0.0003723422685x 6


%,6(X) = 1 + 0 .0294042116076x 2 + 0.0004237288136x 4 + 0.0000032355435x6
23535.6031992064
-115.07874015748 + 4017.037828631424
x2 + 92.70474093766 + 2 1670.694958431145
X - 3.504759928139 + x 2 +41.760648911736
The evaluation of continued fraction will involve a loss of at least two significant figures,
because the final result is of the order of unity, while the first coefficient is approximately
115.
47. The expansions are

1 - lOx 2 + 45x 4 - 120x 6 + 21Ox 8 -


+ 21Ox 12 - 120x 14 + 45x 16 - lOx 18 + x 20
252x 10
= O. 17619705To (x) + 0.24026871T4(x) - O. 14785767T6 (x)
- 0.32035828T2(X)
+0.07392883Ts(x) - 0.02957153TlO (x) + 0.00924110T12 (x) - 0.00217438T14 (x)
+0.00036240T16 (x) - 0.00003815T1S(X) + 0.00000191T2o(x).

Since the maximum coefficient in the Chebyshev expansion is almost three orders of
magnitude smaller than that in the polynomial , the roundoff error will be less.

48. For cosine the truncation error is given by the magnitude of first term that is neglected,
which is ~ 10- 21 . If the truncated power series is expressed in terms of the Chebyshev
polynomials, then the coefficient of T20(X) is 7.84 X 10- 25 and this term can be easily
dropped. The coefficient of T18(X) is -1.18 X 10- 21 , which is comparable to truncation
error in the truncated power series and this term cannot be dropped. Thus, only one term
can be dropped. For eX, since the series converges quite rapidly, a reasonable estimate
of the truncation error can be given by the first term that is neglected , which is ~
2 X 10- 10. Hence, the last two terms in the Chebyshev expansion which have coefficients
of 2.45 x 1O- 11 a nd 1.02 x 10- 12 respectively, can be dropped. The Maclaurin series for
In(1 + 0.9x) converges rather slowly and it can be shown that the truncation error is
bounded by 10 times the first term that is neglected. This gives a truncation error of
~ 0.05. It can be shown that the coefficients of Chebyshev expansion fall off rapidly and
in this case, about 14 terms can be dropped during the process of economisation. The
808 Appendix A. Answers and Hints

economised power series are

cos x ~ 1- f3 - (~ - 2OO(3) x 2 + (~ - 6600 f3 j x4 - (~ - 84480(3) x 6

+ (~
8!
- 549120(3) x 8 - (..!-.
1O!
- 2050048f3 x lO + (..!-.
12!
- 4659200(3) x 12
- (~
14.
- 6553600(3 ) x14 + (~
16.
- 5570560(3) x 16 - (~
18.
- 2621440(3 ) x 18 ,

eX ~ 0.999999999999 + 1.000000000269x + 0.500000000073x2 + 0.166666661284x 3


+0.04166666581x 4 + 0.008333363474x 5 + 0.001388892542x 6 ·+ .000198343805x 7
+0.000024794541x 8 + 0.000002824625x 9 + 0.000000281836x lO,
In(l + 0.9x) ~ 0.00343 + 0.9444x - 0.50327x 2 - 0.05829x 3 + 0.24811x4
+0.55445x 5 - 0.55501x 6 ,

where f3 = 1/(2 19 20!) ~ 7.839809 x 10- 25 .

50. The Chebyshev coefficients can be estimated by performing FFT using

9k = exp(cos(2k1r/N»,

N = 16 is sufficient to get an accuracy of 10- 9 in all the required coefficients. The


coefficients are co = 2.53213176, C) = 1.13031821, C2 = 0.27149534, C3 = 0.04433685,
C4 = 0.00547424, C5 = 0.00054293, C6 = 0.00004498, which yields the rational function
approximation

To (x) = 1.00100476 + 0.48257145TI{x) + 0.03971850T2(X) .


2,2 1 _ 0.47830627Tl (x) + 0.03873383T2 (x)

The maximum error in T4,O (x) is 5.9 X 10- 4 , while that in T2,2 (x) is 1.89 X 10- 4 . The
corresponding minimax approximations are

R4,O(X) = 1.00009000 + 0.99730925x + 0.49883512x2 + 0.17734527x 3 + 0.04415552x4,


R (x) = 1.00007255 + 0.50863618x + 0.08582937x 2 .
2,2 1 _ 0.49109193x + 0.07770847x 2

The maximum error in R4,O(X) is 5.47 x 10- 4 , while that in R2 ,2(X) is 8.69 x 10- 5 .
52. The minimax approximations are

Pn-2(X) = xn - _1-1 Tn(x),


2n -
over [-1,1];

Pn-l (x) = xn - - 12
2 n-
1 T';:(x), over [0,1].

53. The interpolating polynomial is p(x) = 2/3, while the minimax polynomial of degree less
than two is p(x) = 3/4. The maximum errors are 1/3 and 1/4, respectively.
54. The minimax polynomial is f(x) - iT3(X) = 1 + ~x + x 2 .
55. Tl = 0 and T2 = 1/3363 while the expansion is
1 2378 816 140 24 4
-- ~ -To(x) - -T2(X) + -T4(X) - -T6(X) + --T8(X)
1+x2 3363 3363 3363 3363 3363 .

56. The limits on error follow from (10.164), which gives

4 1
:; (n + 1)2 E (Xl 1
(2k - 1)2 =
1'(

"2 (n + 1)2
1
.
A.10. Functional Approximations 809

The approximations are

T9,0(X) = 1.06305x - 0.88386x 3 + 4.69522x 5 - 7.39114x 7 + 4.02407x 9 ,


1.00029x - 1.24796x 3 + 0.29458x 5
T54 ()
X - ------~------;---
, - 1 - 1.41058x 2 + 0.44130x 4 '
R9,0(X) = 1.28588x - 3.88985x 3 + 15.63387x 5 - 22.12209x 7 + 1O.63219x 9 ,
1.0232700x - 1.6931651x 3 + 0.6717794x 5
R54 ()
x - ------------,:----------;---
, . - 1 - 1.7556079x 2 + 0.7568116x4 '
1.9212 X 10- 8 + 0.999996576213x - 0.451974837608x 2 - 0.189912429240x 3
R33 ()
X - ------------------~-------~---
, - 1 - 0.452075196681x - 0.355472695998x 2 + 0.069535056619x 3
The maximum error in these approximations are 0.063, 0.044, 0.031, 0.0055 and 1.92 x
10- 8 , respectively. The last approximation is over the interval [0,0.5J. To get the correct
value at x = 0, we can look for approximations to sin - I x/x.
57. At x = 2 we can achieve a maximum relative accuracy of ~ 0.026, while at x = 1
no reasonable value can be obtained using the asymptotic series. About 16 terms of
Chebyshev expansion will be required for an accuracy of 2- 25 .

To (x) = 0.80550522 + 0.70097449T2(X) + 0.08132690T4(x) + 0.00133236T6(X)


6,6 1 + 0.95111581T2(X) + 0.14068429T4(X) + 0.00504286T6(X) ,

with maximum error 1.89 x 10- 6 .


60. The maximum error n RI,I(X) and R2,2(X) is 0.0018 and 1.31 X 10- 5 , respectively. With
these starting values the Newton-Raphson method will require 3 and 2 iterations respec-
tively, to achieve a relative accuracy of 2-64. The approximations are:

R (x) _ 0.2334689 + 1.3851241x


1,1 - 1 + 0.6214895x '
R (x) = 0.13894896 + 2.83602154x + 2.72638752x 2
2,2 1 + 3.96167477x + 0.73975823x2 '

61. The maximum relative error in the following approximations are 6.76 x 10- 10 and 2.48 x
10- 9 , respectively:

R4,4(X 2 ) =
1.128379166334+ .217020814 7306x 2 +.04975623466289x 4 + .00382418049807x 6 + .0000844977610078x 8
I +0.52566298661377x 2 +0. I 19316626255058x 4 +0.0144034527781Ix 6 +0.00083240506459358x 8

R' I _ 0.564189582152+3.7637189120372?z +5.129508094085 -!:r +0.7888279543317 ~


3,3 (~ ) - 1+7.1710165204368 ~ + 11.9274827896078 :fr +3.8523310784287 ~

J (x) ~ Po(x) where


62. 0 Qo(x)'

PO(x) = 1.00000000283 - .23211639547x 2 + .01131903349x 4 - .00019480145226x 6


+ .000001344352592x8 - 3.211918985 X 1O-9 x I0,
Qo(x) = 1 + .0178836173332x 2 + .00016492837725x 4 + .0000010296016141x 6
+ 4.652771508 X 1O- 9x 8 + 1.73705556 X 1O-IlxlO,
PI (x) = .500000000072 - .0545520781894x 2 + .00167487827953x 4
- .0000205397463597x 6 + 1.08512959204 X 10- 7 x 8 - 2.083992338 X 1O-lO x lO ,
QI (x) = 1 + .01589584426198x 2 + .0001284032936675x 4 + 6.87138742777 X 10- 7 x 6
+ 2.60487994245 X 1O- 9x 8 + 6.9092998374 X 1O- 12 x I0 .
The maximum errors in these approximations are 2.83 x 10- 9 and 7.2 x 10- 11 , respec-
tively. The exact values of zeros of Jo(x) to 12 significant figures are 2.40482555770 and
5.52007811029. Similarly, zeros of h(x) are 3.83170597021 and 7.01558666982.
810 Appendix A. Answers and Hints

63. The approximations for k = -1/2,1/2, and 3/2 are respectively:


1. 77245380305+2.0093539162e x +0.568207763005e 2x +0.034983034684e 3x +0.0000595581732765e 4x
1+ 1.84076133244ge x + 1.044865089683e 2x +0.1956912234403e 3x +0.008630506158283e 4x
1. 999999865103+ 179. 135~6511 76 + 9384. 30~581206 + 166037. ~581868 + 654697 .~812554 + 54088:tJ89726
1+ 89. 97875220251 + 4731.31881544K + 85024.54791313 + 391779. 2278651 + 2992. 738831806

0.88622689922+ 1. 0 1846830281 eX +0.296300309698e 2x +0.0196807358036e 3x + .0000813692815896e 4x


1+ 1.50276977212e x +0.6732205995827e 2x +0.095921111854 74e 3x +0.003088652398712e 4x
0.6666666839314+ ~~ + 1714.6;r 6470 + 19030. :'1;"68205 + 62200. ~~966954 + 45670Jb06686
1+ 67.882;%663845 + 2487.1!¥90582 + 25422. ;~29239~ + 55446·~i(24070 + 9U03.6;t665,J36

The maximum relative errors in these approximations are 2.76 x 10- 8 , 2.98 X 10- 8 , 6.88 X
10- 8 , 6.83 X 10- 8 , 2.60 X 10- 8 and 2.03 x 10- 8 , respectively. For more approximations
to Fermi integrals see Antia (1993) or functions FERMM05, FERM05, FERM15 and
FERM25 in Appendix B.
64. The maximum errors in the following approximations for D(x)/x and xD(x) are 3.04 X
10- 8 and 2.67 x 10- 8 , respectively.
R45 =
, 1.0000000303- .04541189370x 2 +.032851709369x' +.00098304424147x 6 +.000156319072713x 8
1+.62125626124x 2 +.180344290425x 4 +.0317692194191x 6 +.00359378367607x 8 +.000272852786029x IO
0.49999997332-7.60364446685 ~ +39.4001823571 :::'or -2.2997676475 -:\ -99.1484559775 -:\
R~,4 = 1-15.7073307220:f,: +85.~096151021 :!r -37.~237479620 -!r> -21~.0217238992 -!r> x

65. It is rather difficult to obtain this approximation, as the iteration may not converge,
unless the starting approximation is very close. The resulting approximation, R5,5 (x)
with a maximum error of 4.62 X 10- 5 is
+ O. 76777946x 2 + 1.66690211x 3 - 0.6242701Ox 4 + !x 5
9.05085828x
22.60211104 + 2.77695983x + 5. 14840348x 2 + 2.99556743x 3 - 1.23860387x 4 + x 5
67. Some of the approximations are
P2(X, y) = 1.2105 - 1.3582x + 3.0498x 2 - 1.3582y + 1.5218xy - 2.6545x 2 y + 3.0498y 2
-2.6545xy2 + 6.3723x 2 y 2,
P2(X, y) = 1.2180 - 1.1335x + 2.800lx2 - 1.1335y - .2570xy - 1.0497x 2 y + 2.800ly2
-1.0497xy2 + 4.9763x 2 y 2,
P3(X, y) = .9543 + .4318x - .7198x 2 + 2.0062x 3 + .4318y + 1.5474xy - 2.9716x 2 y
+2.1065x 3 y - .7198y 2 - 2.9716xy2 + 5.0239x 2 y2 - 3.1270x 3 y 2 + 2.0062y 3
+2.1065xy3 - 3.1270x 2 y3 + 4.3656x 3 y 3,
P4(X, y) = 1.0073 - .1591x + 1.9805x 2 - 2.0107x 3 + 1.9071x4 - .1591y + .5486xy
-3.2997x 2 y + 5.5176x 3 y - 2.9111x 4 y + 1.9805y 2 - 3.2997xy2
+20.8424x2y2 - 33.3178x 3 y 2 + 18.5420x 4 y 2 - 2.0107y 3 + 5.5176 xy3
-33.3178x 2 y 3 + 54.7181x 3 y 3 - 29.3791x 4 y 3 + 1.9071 y4
-2.9111 xy4 + 18.5420x 2 y 4 - 29.3791x 3 y4 + 16.5259x 4 y 4.

The first one is for n = 10, while the others are for n = 20. The maximum errors (with
respect to discrete data) in these approximations are 0.2105, 0.2180, 0.0457 and 0.0073,
respectively.
69. The maximum error in the following approximation, P9(X) is 2.84 x 10- 5 .
0.9998891581x - 0.3306879479x 3 + 0.1817663353x 5 - 0.087580l559x 7 + .0220392457x9
A.ii. Algebraic Eigenvalue Problem 811

A.II Algebraic Eigenvalue Problem


1. 15.
3. If the dominant eigenvalue Al has a nonlinear divisor of order m, then the components
corresponding to other eigenvalues will fall off rapidly and we only need to consider
components corresponding to the generalised eigenvectors corresponding to AI. Using
the result
A s Xj~AIXj+
's (8)'S~1
1 Al Xj~l+ (8)'S~2
2 Al Xj~2+"'+ ( j -
8
l )'S~j+l
Al Xl, (8 > j).

Clearly as 8 --> 00, Xl becomes dominant and

ASXj ~ (. 8 )A~~j+l (Xl + j ~ 1 AI X2 + ... ) .


J-l 8-)+2
Eigenvalues of A are 6, 3, 3 and corresponding eigenvectors are (3,4, 2)T and (0,1, -1).
The matrix B has a pair of quadratic divisors corresponding to A = 3 ± y'5, with eigen-
vectors
(±y'5,3 ± y'5, 2, 6)T.
4. It is really impossible to be sure, but if starting with initial vectors which are orthogonal
to the known eigenvectors corresponding to the dominant eigenvalue, the iteration tends
to converge to subdominant eigenvalue; then it is most likely that all the independent
eigenvectors have been found.
5. Use the iteration
Ws
U s +I = ----,=----,-
max(w s )
If Xl and X2 are the eigenvectors corresponding to Al and -AI, then it can be shown
that as 8 --> 00

max(ws) = Ai,
Alternately, a simple shift of origin will isolate the dominant eigenvalues. The dominant
eigenvalues of the matrix are ±8 and eigenvectors (1, ±1, ±1, 1)T.
6. Use the normal iteration for power method and define ks+l = max(vs+Il. Now if Al and
Ai are the roots of A2 - PA - q = 0, then

lim (ks+lks+2Us+2 - pks+lUs+1 - qus) = O.


S--(X;

Hence, ultimately any three consecutive iterates are linearly dependent. Using least
squares, we can determine approximations Ps and qs to P and q. The dominant eigenvalues
of the given matrix are 6 ± 5i and eigenvectors (1, ±i, ±i, _1)T.
9. If Xj is the generalised eigenvector corresponding to Al and p = Al -1), then for 8 >j - 1

(A _ pI)~sx' = ~x. +
J 1)8 J
(8)1 _1_X'~1
1)s+1J
+ (8) _1_X'~2 + ... +
2 1)s+2 J
( 8 ) __I_Xl.
j - 1 1)s+J~1

If 1) « 1 then Xl dominates over the other components. The generalised eigenvector of A is


(1, -0.5, -0.5)T. For B, A = 3+ y'5, X2 = (1, -0.095760608,0.55065526, -0.47266157)T
and A = 3 - y'5, X2 = (0.87063218,0.60764124, -1,0.58043269) T .
12. For Al the eigenvalues and eigenvectors are
1.082799 X 1O~7 (-.001248194, .03560664, -.2406791, .6254604, -.6898072, .2716054)T
:
1.257076 x 1O~5 (.01114432, -.1797328, .6042122, -.4435747, -.4415366, .4591148)T
:
6.157484 x 1O~4 (.06222659, -.4908392, .5354769, .4170377, -.04703402, -.5406816)T
:
1.632152 x 1O~2 (.2403254, -.6976514, -.2313894, .1328632, .3627149, .5027629)T
:
2.423609 x 1O~1 (-.6145448, .2110825, .3658936, .3947068, .3881904, .3706959)T
:
1.618900 x 10° : (.7487192, .4407175, .3206969, .2543114, .2115308, .1814430)T
812 Appendix A. Answers and Hints

A2 : A = -0.29908 : (-0.000029689,0.071652,0.63745, -0.74839, 0.16866)T


A = 0.01521 : (-0.00010723, -0.11924, -0.50939, -0.26257, 0.81078)T
A = 0.41985: (-0.00067247, -0.75284,0.45070,0.37828, 0.29496)T
A = 0.81321 : (0.99999976, -0.00060590,0.00021732,0.00013798,0.00022437) T
A = 1.67828 : (0.00013832,0.64334,0.36201,0.47736, 0.47665)T
Aa : A = 15 : (1, 1, 1, 1) T ; A = 5 : (-1, -1, 1, 1) T ; A = 5 : (-1, 1, -1, 1) T ;
A = -1 : (1, -1, -1, I)T
A4 : A = -1.0965952 : (0.46935807, -0.54221219, -0.54445240, .42586566, .088988503)T
A = 1.3270456 : (.34101304, -0.11643462, -0.019590672, -0.68204303, .63607121)T
A = 4.8489501 : (.54717280, -0.31256992,0.61811208, -0.11560659, -0.45549375)T
A = 7.5137242 : (-0.55096196, -0.70944034,0.34017913,0.083410953, 0.26543568)T
A = 22.406875 : (0.24587794,0.30239604,0.45321452,0.57717715, 0.55638458) T
As : A = -8.0285784 : (-0.26346240, -0.65904072,0.19963353, 0.67557335)T
A = -1.5731907: (-0.68804794,0.62412286,0.25980086, 0.26375027)T
A = 5.6688644 : (0.37870269,0.36241905, -0.53793516, 0.66019881)T
A = 7.9329047 : (0.56014451,0.21163276,0.77670826, 0.19538161)T
A6:A=12:(I,-I,I,I)T; A=I±5i:(I,'fi,'fi,-I)T; A=2:(I,I,-I,I)T
A7:A=I:(4,3,2,I)T; A=2:(3,3,2,I)T; A=3:(2,2,2,I)T; A=4:(I,I,I,I)T
A8 : A = 2: (0,1, _1)T; A = 4,4: (1,0, _1)T (defective matrix)
Ag : A = -1.6963228 : (-0.97392088 + 0.037903363i, -0.52915801 + 0.68432114i, I)T
A = 0.28498644: (1, -0.74948997 + 0.36926493i, 0.32462646 - 0.27958897)T
A = 12.411336 : (0.35427988 - 0.30579544,0.71502968 - 0.38725249, I)T
The left-eigenvectors are:

A6 A = 12 : (1, -1, 1, I)T; A = 1 ± 5i : (1, ±i, ±i, _1)T; A = 2 : (1,1, -1, I)T
A7 A=I:(I,-I,O,O)T; A=2:(-1,2,-1,0)T; A=3:(0,-1,2,-1)T;
A = 4: (0,0,-1,2)T
A = 4,4: (1,1, I)T (defective matrix)

16. For n = 6 see {12}. For n = 10 the eigenvalues are


1.0931538 x 10- 13 ,2.2667467 X 10- 11 ,2.1474388 X 10- 9 , 1.2289677 X 10- 7 ,
4.7296893 X 10- 6 ,0.00012874961, 0.0025308908, 0.035741816, 0.34292955, 1.7519197.
For n = 20: 7.7773771 X 10- 29 , 3.3763048 X 10- 26 , 7.0264420 X 10- 24 ,
9.3311941 X 10- 22 ,8.8800759 X 10- 2 °,6.4467646 X 10- 18 , 3.7109770 X 10- 16 ,
1.7379067 X 10- 14 , 6.7408082 X 10- 13 ,2.1928908 X 10- 11 ,6.0360953 X 10- 10 ,
1.4139548 X 10- 8 , 2.8276521 X 10- 7 , 4.8305100 X 10- 6 , 7.0334315 X 10- 5 ,
0.00086767111, 0.0089611286, 0.075595821, 0.48703841, 1.9071347.
17. The correct eigenvalues are: -9.4634742 x 108 , -9.4634692 X 102 ,0.99989902,
1.0463372 x 103 , 1.0098990 X 10 6 , 1.0463377 X 10 9 , 1.0100000 X 10 12 . The eigenvector of A
corresponding to the third eigenvalue is (0.99994951, -0.000010097471, -0.0099984955,
0.00099985055, 0, -9.9985055 x 10- 6 , 9.9985055 X 1O- 7 )T.

19. - .197092891034046780233653, 9.900494253375477505820646,


10.096595438597929556594178,19.999506574411644593957503,
20.000496623252656100784589, 29.999999172903972235905071,
30.000000828491903698796504, 39.999999999309251557759714,
40.000000000691211415039377, 49.999999999999654337301074,
50.000000000000346008996989, 60.000000000000345662698926,
60.000000000000345893409985, 70.000000000690748442240286,
70.000000000690748442295243, 80.0000008270960277640949286,
80.0000008270960277640949384, 90.0004934255883554060424967605625,
90.0004934255883554060424967619262, 100.09950574662452249417935375412714,
100.09950574662452249417935375412728.
The last two pairs may not be accurate to all figures.

20. A=O, x=(1,1,1)T; A=2, x=(1,-I,1)T.


A .11. Algebraic Eigenvalue Problem 813

21. The eigenvalues of this matrix are Ak = 2[1 - cos(k7r/(n + 1»] for k = 1,2, ... , n. The
kth component of the jth eigenvector is Xkj = sine ~~~). As n ---> 00, Ak(n + 1)2 tends
to a limiting value for first few eigenvalues.
22. The Gaussian elimination method leads to the following transformation

G no 1J G n G 1
0 -2 0
S-IAS = H =* 1 2
-1 :3 0 2)

Using Lanczos method, we get the following transformation

G11 G T) G T) (~1 1 -~)


-2 3 3 6
8
AS=ST =* 1 3"
1) = 19
3 0 0 3"
23. Corresponding to an eigenvalue Al of HI, if Xl is an eigenvector of HI then the eigenvector
of His (Xl, 0). While if A2 is an eigenvalue of H2 with eigenvector X2, then eigenvector
of His ((A2! - Ht}-l AIX2, X2).
25. 32.228892, 20.198989, 12.311077, 6.9615331, 3.5118559, 1.5539887, 0.64350532,
0.28474972, 0.14364652, 0.081227659, 0.049507429, 0.031028061.
29. A = 1,2,3, ... ,20 to 12 significant figures. The matrix is well-conditioned, but the char-
acteristic equation is ill-conditioned.
30. A = 0.99653334, 9.9904242, 99.899860, 1000.1132.
31. ForA, A=O:(-i,i,l,I)T; A=8:(1,1,-1,1)T; A=8: (i,-i,l,l)T;
A = 12 : (1,1,1, _l)T.
For matrix B
A = 6.7008755 - 7.8759890i : (-.24990158 - .20661306i, 1, -.21122723 + .092245190i)T
A = -7.4775303 + 6.8803213i: (-.24259230 + .19803185i, 1, -.20658148 - .11037889i)T
A = 39.776655 + 42.995668i: (.42108067 + .011575612i, 1, .61791802 + .0055785778i)T

32. For A, A = 10 + wlO- 4 and x = (l,wlO-4,w21O-8,w31O-12)T, where w = ±1, ±i


are the fourth root of unity. The Matrix B has a complex conjugate pair of eigenval-
ues with quadratic divisors. For A = 1.5 ± V12.75 i, x = (184 'f 230V12.75 i,507 'f
52V12.75 i, 295.5 'f 163V12.75 i, 411 'f 166V12.75 i, 213.5 'f 223V12.75 i) T. For A = -1
which has a linear divisor, x = (13,22,19,16, 28)T.
33. A = 0.43278721 : (-0.85236472,0.38818067,1, -0.69324896, 0.26264939)T;
A = 0.66366275 : (-0.45357708, -0.83773233,0.64877028,1, -0.0l9482857)T;
A = 0.94385900 : (1, -0.82933173,0.39037639, -0.71706696, 0.46412747)T;
A = 1.1092845 : (0.99713530,1,0.84958338,0.88141454, 0.054010743)T;
A = 1.4923532 : (-0.26391810,0.059074043, -0.23032770, 0.29729812, l)T.

34. For (3 = 0.1 the eigenvalues and eigenvectors are


-.00038428853 + 1.56009825i : (-.66549147 + .27242335i, 1, .075248555 - .038336771i)T
-.00038428853 - 1.56009825i : (.66621835 - .27237094i, 1, .074818547 - .03926141Oi)T
-.0046157115 + .20264477i : (1, .20925224 + .091280725i, .0087296851 + .45753738i)T
-.0046157115 - .20264477i : (1, -.21173434 - .081020781i, -.029462102 - .45488951i)T
For (3 = -0.1 the eigenvalues and eigenvectors are
0.21427174: (1,0.12350095 - 0.20392028i, 0.47994249 + 0.021540805i)T
-0.22510492: (1, -0.13142482 + 0.21304753i, -0.50386486 - 0.024901057i)T
.00041659046 + 1.43987585i : (-.65969097 + .40291472i, 1, -.080421994 + .050294755i)T
.00041659046 - 1.43987585i : (.65866750 - .40273989i, 1, -.079621050 + .051432794i)T

35. There are only two real eigenvalues, A = -8.5927853, x = (1,0,0.14215192, 1)T, and
A = -4.6241941, x =
(1,0.15880859,0, _1)T
814 Appendix A. Answers and Hints

36. For A the eigenvalues are>. = -1, ±i, (-1 ± i)/V2, (1 ± i)/V2 and the eigenvectors
corresponding to >. are
>.-6(1 + >. + >.2 + >.3 + >.4 + >.s + >.6)
>.-5(1 + >. + >.2 + >.3 + >.4 + >.S)
>.-4(1 + >. + >.2 + >.3 + >.4)
>. -3(1 + >. + >.2 + >.3)
>. -2(1 + >. + >.2)
>.-1(1 + >.)
1

For B the eigenvalues are 35.613861, 23.037532, 14.629782, 8.7537796, 4.7673065,


2.2989112, 1.0000000, 0.43498852, 0.20976205, 0.11423637, 0.068353718, 0.043407427,
0.028078955.

A.12 Ordinary Differential Equations


1. Using variables YI = Y, Y2 = y', Y3 = y", Y4 = z, Ys = z', the system is

y~ = Y2, y~ = Y3, y~ = t 3 sin YI cos Y4 - Y3YS - Y2YIyl cos t,


eY1 Y4 sin t _ y2 y 3 y 3 / 2 _ y3 y 2Y2
y~ = Ys,
' _
YS I 4 S I 4
-
Y3

2. The result follows from the Jordan canonical form. For a Jordan submatrix of order m
the equations are
dx;
- dt = + X'+I
>.x'" ,
(i=I, ... ,m-l), dXm = >,xm
dt '
which gives the required solution in terms of generalised eigenvectors Xi.
4. = ce t + t 3 + 3t 2 + 5t + 5, while that
The general solution of the differential equation is y
of the difference equation is

Yj = c(1 + h)j + (jh)3 + h 2 (3j2 + 3j + 1) + h(5j + 6) + 5.


5. The general solution of the difference equations is

Yj = c(1 + 2h)l, + (jh) 3+1 22 1 1


'2h (3j + 3j + 1) + '2h(j + 3) + 4: .

This numerical integration method is not consistent.


6. Solution is nonunique as the Lipschitz condition is not satisfied. The Lipschitz condition
requires that, there exist a constant L, such that for any two numbers Y and y*

If(x,y) - f(x,Y*)1 ::; Lly - Y*I·

1. This method is absolutely unstable everywhere except for Ih>'1 < 1 on the imaginary axis,
where both roots of the characteristic equation are on the unit circle.
S. The exact solutions for the differential equations are
1 1 2
(i) Y = t' (ii) Y = 2et _ t - 1 ' (iii) y = - - - 1,
cos 2 t
(iv) Y = 2e- t2 + t 2 - 1, (v) Y = (1 + t) (1 + 31n(1 + t))1/3, (vi) Y = t + t In t.
10. The result follows by noting that as h --> 0, 0<0 --> 1 and Yi --> Yoe ihA . For the coefficient Co
the first column in the numerator tends to YO times the first column of denominator. For
Ci (i > 0), the first column of numerator becomes proportional to the (i + l)th column.
A .12. Ordinary Differential Equations 815

11. For h = 0, the coefficients of polynomial are


12b-l - 12b2 24 - 60b-l + 12b2
do = 1, dl - - - - , - - - - d2 - - - - - : - - - - -
- 48b-l - 16 ' - 48b-l - 16 '
which confirms the conditions given in Example 12.2.
13. For the linear equation Y' = )..y, the fourth-order Adams-Bashforth-Moulton formula
gives
h)" h 2 )..2
Yj+l = Yj + - (28Yj - 5Yj-l + Yj-2) + - - (55Yj - 59Yj-l + 37Yj-2 - 9Yj-3).
24 64
This difference equation is relatively stable for h)" > -0.61 and absolutely stable for
-1.28 < h)" ::; O. Similarly, Hamming's formula gives the difference equation
1 3h)" h 2 )..2
Yj+l = 8 (9Yj - Yj-2) +8 (2Yj - Yj-l + Yj-3) + -2- (2Yj - Yj-l + 2Yj-2),
which is relatively stable for h)" > -0.26 and absolutely stable for -0.5 < h)" ::; O.
14. For Hamming's, Adams and Milne's (Simpson's rule) correctors CO/f = 4/3, 1 and 1/2,
respectively. For b-l = 0.35 and b2 = 1/12, it is possible to have a stable method for
which CO/f = 1/2. The exact solution of the differential equation is Y = tan- 1 t.
15. For b-l ~ 0.383, b2 ~ 0.094 the formula is relatively stable for h)" > -1.416. But much
before this h)", the formula is not accurate to any reasonable degree.
11. Such formulae can be obtained by setting b-l = 0 in the expression for the coefficients.
A reasonable choice could be b2 = O. but in any case, the resulting explicit formula is
highly unstable with one of the extraneous roots of the order of 10. By changing b2, this
root can be decreased to approximately three, but even that is too large.
19. If yj+l and yC +1 are respectively, the predicted and the corrected values, then the re-
sulting numerical integration formula is
19 ( C P) 251 C 19 P
+ 270 Yj+l
C
Yj+l = Yj+l - 270 Yj+l - Yj+l = 270 Yj+l .

For the linear equation Y' = )..y, this formula for Yj+l can be written as

251 Yj
-
+ ~~ (19Yj - 5Yj-l
3
+ Yj-2) + -19 (
Yj + -h)" (55YJ - 59YJ-l + 37Yj-2 - 9Yj-3)
)
270 1 - ah)" 270 24
This difference equation is relatively stable for h)" > -0.65 and absolutely stable for
-1.53 < h)" ::; O. Similar difference equations can be written for Hamming's method also,
which is relatively stable for h)" > -0.48 and absolutely stable for -1.29 < h)" ::; O.
20. The maximum order of four is achieved by the Simpson's rule or the Milne's corrector,
whose stability is analysed in Example 12.1. For third-order formula, the coefficients are

ao = -4 + 12b_ 1 , bo=4-8b-l , bl = 2 - 5b-l .


For h = 0 the formula is relatively stable when 1/3 ::; b-l < 1/2. The best stability
property is achieved when b-l = 5/12, which gives the Adams-Moulton formula . This
formula is relatively stable when h)" 2 -1.5 and absolutely stable when -6 ::; h)" ::; O.
It can be shown that both roots of the characteristic equation are real. Hence, it is
tempting to claim that the transition to relative instability takes place when Ql = -Qo,
which gives h)" = -(3b-l - 1)/(1 - 2b-J). However, for b- 1 close to 1/2, the roots show
avoided crossing (Figure A.l), where the roots are never equal, but as h)" decreases the
larger root is not the one which is close to e h >'. For b-l = 1/2, the roots actually cross
at h)" = O. In such cases, it is difficult to identify the region of relative stability. On real
axis the methods are absolutely stable for -(12b-l - 4)/(1 - 2b-J) < h)" ::; O.
23. There are only three independent fourth-order formulae of the form

aoyo + alYl + a2Y2 + a3Y3 + h(boyb + bIY~ + b2Y~ + b3Y~) = O.


816 Appendix A. Answers and Hints

-1
-2 -1 o
hA

Figure 1.6: Roots of characteristic equation as a function of )"h for b_ 1 = 0.45.

Equation (12.54) provide three such formulae and any other formula of this type will
be a linear combination of these. Since these formulae have been obtained by effectively
interpolating with 4th degree polynomial the fifth derivative will vanish. The truncation
error can be estimated by performing one step of predictor-corrector method, using these
starting values.
29. Considering the linear equation y' = )..y, the difference equation is

16 (. h)" h 2 )..2 h 3 )..3 h4)..4)2


(
Yn+1 = Yn 15 1 + :2 + 222 + 236 + 2424

1 ( 2
h )..2 h )..3 h4)..4))3
- - 1+ h)" +- -+- -+- -
15 2 6 24
h2)..2 h 3 )..3 h4)..4 h5)..5 h 6 )..6 h7)..7 h 8 )..8)
= Yn
(
1 + h)" + -2- + -6- + 24 + 120 + 864 + 8640 + 138240 .
This formula is absolutely stable for -6.46 < h)" s:; 0, where h is the length of the
double-step.
30. Considering the linear equation y' = )..y, the difference equation is
_ (1 + (1 - h)" h)..('Ylfh2 + )'21311 - )'11312 - ),21321)))
Yn+l - Yn 1 _ h)..(1311 + 1322) + h 2 )..2(13111322 - 1312132Il .
On real axis this method is absolutely stable for -00 < h)" s:; O.
31. y(1) = 1.306448.
33. The formulae can be easily derived using the method of undetermined coefficients. The
corrector formula is a single-step method and is always relatively stable. It is absolutely
stable for -00 < h)" s:; 0 on the real axis. It is interesting to note that for the linear
equation y' = )..y, the difference equation is identical to that for the implicit Runge-
Kutta method considered in {30}.
34. The first three equations have singularities and it will be better if analytic solution in the
neighbourhood of singular point is used to start the integration. The analytic solutions

f
are
(i) In(x) = (_1)k(x/2)n+2k ,
k=O k!(n+k)!
A .12. Ordinary Differential Equations 817

(ii) Pe(x) = 1 - W + 1) (1 _ x) + £(£ + 1) (£(£ + 1) - 2)(1 _ x)2


2 16
£(£+ 1) 3
- ~(£(£ + 1) - 2)(£(£ + 1) - 6)(1 - x) + ... ,
13 5 59 6 169 7
(m) y(t) = t - - t + - t - - t + .. .
. .. 4
8 48 288
The exact solutions are Jo(l) = 0.7651977, Jo(10) = -0.2459358,
h(l) = 0.4400506, h(lO) = 0.04347275 and
1 1
P3(X) =- (5x 3 - 3x), P6(X) =- (231x 6 - 315x 4 + 105x 2 - 5),
2 16
1
P7(X) = - (429x 7 - 693x 5 + 315x 3 - 35x),
16
PlO(X) = _1_ (46189x lO - 109395x 8 + 90090x 6 - 30030x 4 + 3465x 2 - 63) .
256
For (iii), y(l) = 0.1749676 and y(lO) = -0.2035107. For y(O) = -1, the problem in (iv)
is unstable (with exact solution of y = _e- 5t ) and should be preferably reformulated as
in (v). For y(O) = 1, the exact solution of (iv) is Y = 2e 10t - e - 5t .
For (vi), y = e- t2 / 2 - e- t+ 1, and for (vii) Yl = cos t, Y2 = sin t.
35. The exact solution O(t) = 0 is unstable for a = 1.
36. The exact zero of Jo(x) and h(x) are 2.40482555770 and 3.83170597021, respectively.
37. For n = 0, 1 and 5, the exact solutions are 0 = l-r 2 /6, 0 = sin r/r and 0 = 1/ VI + r 2 /3,
respectively. The radius rn (i.e., the first zero of 0) is
n: 0 1.5 2 3 4 5
zero(rn) : 2.4494 3.14159 3.65375 4.35287 6.89685 14.97154 00
-r;,O' : 4.8988 3.14159 2.71406 2.41105 2.01824 1. 79723 1. 73205

39. The exact solution is


- - (-a
== - aln2 e 2 t - e -alt) ,
al - a2
ala2e- a1t ala2e- a2t ala2 e - a3t
n3 = -,--------,--,----,- + + -,------...,...,-----,-
(a2 - at)(a3 - at) (al - a2)(a3 - a2) (a3 - al)(a3 - a2)

40. For a = 1, y is minimum at x;::; -0.551, where y ;::; 0.2618 and z ;::; -0.26. For a = 100,
Y = 0 and z = 0.05879574 at x = 0.3956863. For a = 105 , Y = 0 and z = 0.001756217 at
x = 0.4618568. For a = 10 10 , Y = 0 and z = 5.548 X 10- 6 at x = 0.4649234.
41. The equation is not stiff for .A > 0, but the problem is ill-conditioned. However, the
extraneous solution may fall off rapidly for some of the stiffly stable methods, when the
step size is large. For y(O) = 1 the exact solution is y = t 3 + eAt.
42. The first two components are initially decreasing steeply, but ultimately start increasing.
The increase may be suppressed by a stiffly stable method which is also absolutely stable
for large portion of positive half of complex plane. The exact solution is
Yl = exp(10 4 (2e- t +t - 2)) sin(10 4 (1 - (1 + t)e- t )), Y3 = 1- 2e- t ,

Y2 = exp(10 4 (2e- t +t - 2)) cos (10 4 (1 - (1 + t)e- t )), Y4 = te- t .


43. For f(r) = -2/r 3 the particles can fall into each other at r = 0 and numerical solution
may break down due to singularity.
45. The exact solution for (i) is y = «2 - cos 1)/ sin 1) sinx + cosX.
48. u(O) = 0.004523035.
49. The solution close to t = 0 is

( t) {3 + {3 t2 + k{3 t4 + (3k - lO{3)kf3 t6 + ...


y = 6a({3 + k) 120a 2 ({3 + k)3 15120a3 ({3 + k)5
818 Appendix A. Answers and Hints

where (3 = y(O). For ex = 0.1, {3 = 0.02279135, while for ex = 0.01, {3 = 1.126702 X 10- 11
and for ex = 0.001, {3 = 6.2615 X 10- 41 .
50. (i) A = 1r 2 , (ii) A = 1, (iii) A = 6.77387.
51. Boundary condition at r = 0 is y' = O. A = 5.783186.
52. g-modes (glO,.'" g1): 0.00794, 0.00942, 0.01138, 0.01401, 0.G1769, 0.02304, 0.03126,
0.04484, 0.06960, 0.1209, f-mode (no nodes): 0.2012, p-modes (PI, ... ,PlO): 0.3756,
0.6576, 1.020, 1.462, 1.983, 2.581, 3.257, 4.010, 4.842, 5.749.
54. A = -0.110248816992 and 9.04773925981. n = 4 is enough for 10- 6 accuracy.
55. If c;n) is the coefficient of x j in T;::(x), then

(2j -1)aj-l + 2jaj = TC;nJ1 , (j= 1,2, ... ,n); (2n + l)an = TC~n).
This equations along with the initial condition ao = 1 provide n + 2 equations in n + 2
unknowns. The exact solution is y = 1/ vr+t.
56. If c;n) is the coefficient of x j in Tn(x), then

(2j+2)(2j+l)a2j+2 -(2j(2j-2)-A)a2j -Aa2j-2 = T1C~~n) +T2C~~n+2), (j = 0, ... , n+l)


with a-2 = a2n+2 = a2n+4 = O. For n = 3 we get A = 3.5696 as compared to the exact
value of 3.5592. For antisymmetric case A = 12.156.

A.13 Integral Equations


1. Use

dn - 1 y
--1 =
dt n -
l0
X
f(t) dt+an-1,

to obtain the required equation

f(x)
r ( 2:>j(x)
+ in
n (t)j-1)
~ :--1)' f(t) dt = g(x) -
n
~Pi(X) ~an-j (~_
i i-j
')' .
o j=l J. i=l j=l t J.

2. (i) f(x) = A + A foX f(t) dt, (ii) f(x) + 1 + foX G+ t) x - f(t) dt = o.


CoS( 1(1 - 2x)) )
4. (i) f(x) = 24 ( 2 1 - 1 -12x(1 - x), (ii) f(x) = eX, (iii) f(x) = x.
cos '2
5. frO) = 1.919032, f(±0.25) = 1.899615, f(±0.5) = 1.842385,
f(±0.75) = 1.751955, f(±I) = 1.639695.
6. (i) f(x) = eX, (ii) f(x) = x.
9. The weights are given by

(i = 2, ... , N - 1).

The exact solution of the integral equation is f(x) = (1 - x 2 )3/4.


10. The kernel has an eigenvalue equal to unity, for which the eigenfunction is sin(1rx). Con-
sequently, the general solution is f(x) = x + a sin( 1rx), where a is an arbitrary constant.
13. (i) 0.2724430, 0.08127033, 0.03864755; (ii) 0.6916603, 0.1312712, 0.05341381.
A.14. Partial Differential Equations 819

16. A = 1/2.
17. It is convenient to use a special quadrature formula over the interval [x - h,x + h] to
account for the singularity. This formula is given by

j h
-h
2h h
f(t)lnltldt=--(8-6Inh)f(0)--(1-3Inh)
9 9
( f(-h)+f(h) ) ,

r
Jo
f(t) In It I dt = - ~ (3 -
4
21n h)f(O) - ~(1 -
4
2Inh)f(h).

The required eigenvalues are -1. 7642, -1.5660, -0.7883. Note that because of singularity
in the kernel the eigenvalues are not decreasing rapidly to the limit point at A = O.
19. The first problem is well-conditioned except close to a = 2, where the solution does not
exist. The quadrature formula in {17} may be used here. The exact solutions are
. 1
(z) f(x) = I (1 ) ~' (ii) f(x) = x.
n 2a ya- - x-

22. f(x) = cos x.


24. f(x) = In(l/x).
1
25. (i) f(x) = -(sinx + sinhx),
2

27. The first equation is ill-conditioned for A > 0 just like that in Example 13.6. For A < 0
the methods (ii) and (iv) are unstable, while for Ah « -1 all methods except (i) are
unstable.
28. f(x) = e- x - e 12x .

29. The Simpson's rule leads to instability in both cases.


30. (i) f(x) = cosx - 2sinx, (ii) f(x) = 2x - x 2 .
31. f(x) = 1.

32. (i) f(x) = x or f(x) = 5x - 4, (ii) f(x) = e x2 .


33. x = 0.00 x=0.25 x=0.50 x=0.75 x = 1.0
A = 0.50 1.000000 1.129653 1.187735 1.224933 1.251260
I\. = 0.95 1.000000 1.400740 1.671788 1.891785 2.077124
A = 1.00 1.000000 1.547326 2.012779 2.463460 2.907810
34. x=O x = 0.25 x = 0.50 x = 0.75 x = 1.00
h(x) 1.0000 1.1278 1.1796 1.2082 1.2262
fz(x) 0.0000 0.0580 0.2272 0.3951 0.5275

A.14 Partial Differential Equations


1. The exact solution is

u(x t) - ~ 4(-I)j sin«2J' + l)x) e-(2j+l)2 t .


, - f;:a rr(2j + 1)2

2. The exact solution is


00 4
T(X,T) = 1- { ; rr(2j + 1) sin«2j + I)X) e-(2]+1)2 T •

3. The amplification factors are

~(m)=-w±vw2+1, and
820 Appendix A. Answers and Hints

where r = 2a~t/(~x)2 and w = 2rsin 2 (mh/2).


4.
4 6 3
a2
a k( -1 - 0) - -a h 2) + -a u (a
e(u) = -a U (2 - k 2(1- 30) - -ah 4 - -Okh 2)
ax4 2 12 ax 6 6 360 12
8 a2
+ -a u8 (a-4k 3 (1 - 40) - -Okh 4 a
- -Ok 2 2
h - --a h .
3 6)
ax 24 360 24 20160

5. The amplification factor


5 + cos mh - 12a dt sin 2 mh
~(m) = ~ 2
5 + cosmh + 12a (dx)2
dt sin2 mh
2

8. The equation becomes e


2
x
au
at = ax 2 '
a2 u
e- 4t lnx
9. The exact solution is u(x , t) = .
In2
10. The exact solution is u(x, t) = tan(x + t).
11. The exact solution is u(x, t) = J
x + sin 2 t. This problem is ill-conditioned when sin 2t >
0, since the coefficient of second derivative is negative. Hence, in this range the solution
is unstable to perturbations at small length scales.
12. The exact solution is u(x, t) = 2/(1 + ex-t).
13. The amplification matrix for the explicit difference scheme is

4a~t . 2 m~x
w= - - s m - -
(~x)2 2
The Crank-Nicolson method is always stable with the amplification matrix
l_w2 / 4

G(~t , m) = ( 1+~~ /4
1+w 2 /4

14. The exact solution is u(x, y, t) = e- 2rr2t sin(-rrx) sin(1rY).


15. Eliminating u;t 1/ 2 between the two equations, we get

where r = ak/(2h 2 ). Using Taylor series and neglecting higher order terms, the truncation
error is given by

16. The result follows from


2u jl
n+l/2 _
-
(1 -
<2) n+l
roy u jl + (1 + roy
<2) ujl
n
.

2
11. The exact solution is u(x, y , z , t) = e- 3rr t sin(1rx) sin(1rY) sin(1rz) .
19. The amplification factors are
~t . ~t .
= 1 + _(e'm = 1 + - ( 1 - e-,m ),
h h
(i) ~(m) - 1), (ii) ~(m)
~x ~x
~t 1
(iii) ~(m) = 1 + i - sin mh, (iv)~(m)= 1+ dt(l_eimh)
~x
dx
A .14. Partial Differential Equations 821

The schemes (ii) and (iii) are always unstable, while (i) is stable if At :-::; A x , (iv) is
always stable. Here it is assumed that At/Ax> O.
20. The exact solution is u(x, t) = 1, while the amplification factor is

~(m) = i (~:) sinmh± ';1- (At / Ax)2 s in 2 mh.


The difference method is stable when At :-::; Ax.
21. The exact solution is u(x, t) = f(x + t).
22. The exact solution is u(x, t) = f(xe- t / 4 ).
23. The exact solution is
I, ifx<t;
u(x, t) = { ~=~, if t :-::; x :-::; 1; for t < 1;
0, if x> 1;

1, if x> ttl ;
u(x, t) = { for t 2 1.
0, if x < ttl;
24. The amplification matrix is

G(At,m) = (1 ia)
ta 1 '
cAt
a= -sinmh.
Ax

25. The amplification factor is

(m) = 1 +2wlJ - w ± Jw 2 - 2w(1 +2wlJ)


~ 1+2wlJ '
It can be shown that Real(~(m)) < 1 and the stability criterion is

for 0:-::; IJ < t;


for 1/4 :-::; IJ :-::; 1.

26. The amplification matrix is

G(At m) =
,
(1 - 2r2 ~in2(mh/ 2)
trsm(mh)
irsin(mh) )
1 - 2r2 sin2(mh / 2) ,
r=- .
cAt
Ax

27. The exact solution is u(x, t) = sin(rr(x - t)) and u(x, t) = e x - t sin(2(x - t)).
29. The exact solution is u(x, t) = 1/(1 + xt).
31. The three lowest eigenvalues are -2rr2, -5rr2 and -5rr2 with eigenfunctions
sin(rrx) sin(rry) , sin(2rrx) sin(rry) and sin(rrx) sin(2rry) , respectively.
32. Noting that 0 :-::; I/-Lil :-::; p(BJ), maximum Ai = p(Bw) occurs when either /-Li = 0 or
/-Li = p(BJ) :

for Wb :-::; w:-::; 2;


for 0:-::; W < wb.
The optimum value Wb can be obtained by equating the two estimates of p(Bw).
33. The optimum value of w is the same as that in Example 14.4.
34. The exact solution in all cases is u(x, y) = e+ y .
35. The exact solution is u(x, y) = _x 3 - 3x 2y + 3xy2 + y3.
36. The exact solution is u( x, y) = x 2 _ y2.
822 Appendix A. Answers and Hints

x2 y2) a2b2
37. The exact solution is u(x, y) = ( 1 - a 2 - b2 a 2 + b2 .

38. u(O, -0.5) :::; 0.575, u(O, -0.75) :::; 0.260, u(0.25, -0.5) :::; 0.511, u(0.25, -0.75) :::; 0.233,
u(0.5, -0.5) :::; 0.344, u(0.5, -0.75) :::; 0.166, u(0.75, -0.75) :::; 0.OS4.
40. The truncation error is

e(u)=- h 2 (8 4 4
- u+ 28-u- +8 -
4u)
+h4
- (8
-+
6u 8 6+
-
6u
u 5 -8- 6
+ 1 )8-u- ) .
12 8 x 4 8x 28y2 8y4 360 8 x 6 8 y6 8x 4 8y2 8x 28y4
To achieve higher order accuracy for Poisson's equation V' 2 u = !, we can add the follow-
ing term on the right-hand side of the nine-point formula
h2
2(f(Xj+1,YI) + !(Xj-I,YI) + !(Xj,YI+I) + !(Xj,YI-I) - 4!(Xj,YI))'

42. The exact solution is u(x, y) = e X +Y •


43. The exact solution is u(x, y) = x2 _ y2.
44. The exact solution is u(x, y) = x3 - y3 + xy.
45. The exact solution is u(x, y) = x S + y 5 - 5(x2 y 3 + x3 y 2).
46. The exact solution is u(x, Y, z) = x2 + 2y2 - 3z 2 .
47. The exact solution is u(x, y, z) = 126/(x + 2y + 4z + 1)2.
48. Taking nodes in the order (0,0), (1,0), (0,1), (1/3,0), (2/3,0), (2/3,1/3), (1/3,2/3),
(0,2/3), (0,1/3), (1/3,1/3), the basis functions are
PI ((, 11) = 1 - 5.5( - 5.511 + ge + IS(11 + 911 2 - 4.5e - 13.5e11 - 13.5(112 - 4.511 3 ,
P2(~, 11) = ( - 4.5e + 4.5(3,
P3((,11) = 11- 4.511 2 + 4.511 3 ,
P4((,11) = 9( - 22.5e - 22.5(11 + 13.5e + 27e11 + 13.5(112 ,
PS ((,11) = -4.5( + ISe + 4.5(1] - 13.5e - 13.5e11,
P6(( , 11) = -4.5(11 + 13.5e11,
P7(( , 1]) = -4.5(11 + 13.5(112 ,
P8((,11) = -4.511 + 4.5(11 + IS112 - 13.5(112 - 13.511 3 ,
P9((,11) = 911 - 22.5(11 - 22.511 2 + 13.5e11 + 27~112 + 13.5113 ,
PlO((,11) = 27(11 - 27e11 - 27(11 2 .
49. Taking nodes in the order (0, 0), (1 , 0), (1,1) , (0, 1), (1/2, 0) , (1,1/2), (1/2,1), (0, 1/2) ,
(1/2,1 / 2) , the basis functions are
PI ((, 11) = 1 - 3( - 311 + 2e + 9(1] + 2112 - 6e11 - 6(11 2 + 4e112 ,
P2((,11) = -( + 2e + 3(11- 6e11 - 2(11 2 + 4e112,
P3((,11) = (11- 2e11- 2(1]2 +4e112,
P4((,11) = -11 + 3(1] + 2112 - 2e11- 6(112 + 4e112,
P S ((,11) = 4( - 4e - 12(11 + 12e1] + S(11 2 - Se112,
P6((,11) = -4(11 + Se11 + 4(11 2 -Se1]2,
P7((,11) = -4(11 + 4e11 + S(112 - Se112,
Ps ((,1]) = 411- 12(11 - 4112 + Se11 + 12(112 - Se112,
Pg ((,1]) = 16(11 - 16e11- 16(112 + 16e112 .

Bibliography
Antia, H. M. (1993): Rational Function Approximations for Fermi-Dirac Integrals, Astrophys.
J. Sup., 84, 10l.
Wilkinson, J. H. (19SS) : The Algebraic Eigenvalue Problem, Clarendon Press, Oxford.
Index

1.06, 26, 33, 36, 81, 778 rate of convergence, 755


alternating direction method, 726-728, 753,
A-stable, 611, 612 999, 1139
Aarts, E. H. L., 391 boundary condition, 726, 767
Abramowitz, M., 50, 166, 168, 196, 206, stability, 727
232, 485 three space variables, 727
abscissa, 113, 177, 195, 201, 202, 208, 242, truncation error, 767, 820
243, 662, 988, 1128 alternating variables method, :163
nearest, 121, 875-877, 1021, 1022 Althaus, G. W., 60
absolute convergence, 793 always convergent methods, 272, 290-294,
absolute error, 23, 224, 315, 598 304, 308, 314, 347, 357
absolute stability, .581, 604, 605, 607, 615,
amplification factor, 71G-712, 725, 727, 730,
624
731, 735, 739, 767, 768, 819, 820
integral equation, 692
amplification matrix, 732, 733, 768, 820,
accumulation of double length numbers, 36,
821
39, 42, 71, 81, 83, 556
analytic function
Acton, F. S., 325, 327, 499
int'~rpolation, 152
ADAMS, 617, 974-976, 1115, 1116
zeros, 301-304, 916
Adams method, 590, 593, 594, 596, 613,
Anderson, E., 59
649, 724, 974, 976, 1115, 1116
annealing, 390
roundoff error, 649
stability, 585, 615, 815 Antia, H. M., 521, 810
Adams-Bashforth formula, 589 approximation, 111, 425-522, 804-810,
Adams-Moulton formula, 581, 590, 591 933-966, 1074-1107
stability, 585 Chebyshev, 428, 485-493, 951, 1092
adaptive integration, 217, 221, 225-228, chi-square, 433, 434
237, 592, 651, 671, 898, 1043 choice of model, 431
global, 226 degenerate, 498, 501
local, 226 differentiation, 441, 446, 456, 507,
ADI, 755, 996, 1002, 1136, 1143 509, 512, 934, 935, 937, 938,
ADM, 998, 1139 1075, 1077-1079
ADPINT, 218, 225, 226, 679, 898-900, 931, error in, 432, 454, 480, 489, 509
932,990, 1043-1045, 1072, 1073, Fourier, 431, 462
11:10 ill-conditioning, 435, 439, 498, 509
advanced computer arithmetic, 32 in n dimensions, 458, 936-938, 1077-
AEPS, 866, 1011 1079
Ahlberg, J. H., 131 in two dimensions, 457-459, 935,
Aitken's method, 191, 212 , 278, 336, 531, 1076, 1077
568 interpolation, 111, 427, 669
Alefeld, G., 31 L1, 429, 508-510, 965, 1105, 1106
algebraic manipulation program, 159, 761 least squares, 428, 432-458, 933-947,
aliasing, 460, 461, 471, 474 1074-1087
Chebyshev polynomial, 487 linear, 431, 432-449, 503, 508, 933-
alternating direction implicit iterative 947, 965, 1074-1087, 1105, 1106
(ADI) method, 753-755, 1003, minimax, 428, 494-507, 521, 522, 952-
1143 965, 1093-1105
824 Index

Monte Carlo simulation, 454, 509 in n dimensions, 151, 882-884, 943,


multivariate, 457-459, 516, 522 945, 1027-1029, 1084, 1086
nonlinear, 431, 451-457, 505, 947, in two dimensions, 150, 151, 457, 880,
1087 881, 901, 941, 942, 1025, 1026,
of data, 426, 432-457, 503-510, 933- 1082, 1083
947, 1074-1087 integral equation, 671, 680, 991, 1131
of mathematical function, 426, 479- integration, 140, 238, 886, 901, 902,
503, 952-965, 1093-1105 1031, 1046, 1047
Pade, 479-484, 950, 1091 interpolation, 138, 150, 877, 880-882,
rational function, 479-506, 950-965, 1022, 1025-1027
1091-1105 inverse problem, 683, 991, 1131
roundoff error, 437, 443, 444, 454, 499, least squares approximation, 445, 939,
502, 518, 519 941-945, 1080, 1082-1086
smoothing, 427, 440, 445, 447, 456, linear, 139
463, 507, 509 multiple integration, 238, 901, 1046
Arc sin parabolic, 157
see sin- 1 quadrature, 140, 886, 1031
Arc tan recurrence relation, 137
see tan- 1 back-substitution, 64, 69, 76, 87, 133, 535,
Archimedes, 12, 175 636
array in Fortran, 474, 863, 948, 1089 roundoff error, 81
array in C, 1009, 1010 back-transformation, 544, 552, 562, 968,
artificial damping, 733, 735 972, 1108, 1112
artificial pole, 498, 502 Backus, G., 684
artificial variable, 386, 399, 928, 929, 1070 backward deflation, 341
artificial viscosity, 734 backward difference, 123, 261, 626
associative law, 26, 27, 33, 38, 53 backward differentiation formula, 165, 613
asymptotic error constant, 283, 336, 354 backward error analysis, 33, 37-39, 42, 44,
asymptotic series, 12, 116, 187, 230, 520 80, 82, 344, 564, 670
automated allocation of mesh points, 644, backward Euler's method, 590, 611
646 backward interpolation formula, 124, 155,
automatic integration, 220-228, 885-909, 581, 589
1030-1053 Baker, C. T. H., 658, 687, 695
adaptive, 221, 225, 226, 898, 1043 BALANC, 554, 555, 562, 563, 971, 972,
iterative, 220, 885, 886, 898, 903, 909, 1112
1030, 1031, 1043, 1048, 1053 balanced-decimal, 51, 774
limitations, 220, 226 balanced-ternary, 17, 774
multiple integrals, 240, 903-909, balancing a matrix, 553, 555, 562, 971,1112
1048-1053 isolated eigenvalue, 554
autonomous form, 576 BALBAK, 554, 972, 1112
average, 430, 431 band matrix, 62, 98, 100, 103, 544, 634, 722,
averaging kernel, 684 744, 877, 1022
avoided crossing, 815 Gaussian elimination, 65, 100, 103,
872, 1017
B-spline, 136-141, 157, 431, 435, 445, band-pass filter, 463
877, 878, 880-884, 939-945, 986, bandwidth, 62, 98, 100, 742, 744, 763, 872,
1022, 1023, 1025-1029, 1080- 1017
1086, 1126 bandwidth limited, 460
approximation, 446, 457, 514, 939- Barrodale, 1., 505, 508
945, 1080-1086 base, 13, 15, 16, 19, 23, 46, 51, 554, 867,
boundary value problem, 640, 986, 971, 1011, 1112
1126 basic feasible vector, 382, 386, 388, 928,
cubic, 138, 877, 1022 929, 1069, 1070
differential equation, 640, 986, 1126 degenerate, 382, 384, 399, 506
differentiation, 140, 162, 877, 1022 finding, 386
expansion, 138, 878, 881-884, 1023, basis function, 130, 135, 139, 141, 433, 503,
1026-1029 508,669,670,757,989,990,1130
Index 825

B-spline 136-141, 445, 876, 1022 roundoff error, 274


biquadratic, 772, 822 searching an ordered table, 121, 875-
cubic, 772, 822 877, 880, 1021, 1022, 1025
finite element, 758, 759 Sturm sequence, 546, 969, 970, 1110
linear, 759 bit, 17, 20, 21
on parallelogram, 760, 772, 822 bit reversal, 467
on triangle, 759, 772, 822 BJO, 955, 957, 958, 1096, 1098
polynomial, 435, 442-445, 934, 1076 BJ1, 956, 957, 959, 1096, 1098, 1099
BCS, 979-982, 985-988, 1119-1122, 1125 ~ BJN, 956, 1097
1128 BJYO, 958, 1098
BCSD, 981, 982, 985, 1121, 1122, 1125 BJY1, 959, 1099
Beale, E. M. L., 399 BKO, 962, 963, 1102, 1103
Bernoulli numbers, 187, 230 BK1, 962, 963. 1103
Bernstein, S. N., 112 BKN, 963, 1103
Bessel function, 56, 58, 503, 521, 652, 653, block tridiagonal matrix, 743
809, 816, 817, 955-962, 1096- Bluestein's FFT algorithm, 469
1103 blunder, 3, 12, 358, 369, 401, 427, 431
asymptotic form, 955 962, 1096-~1103 Boltzmann constant, 390
first kind, 955, 956, 958, 959, 1096- Boltzmann distribution, 390
1099 Bonomi, E .. 392
modified, 960 963, 11011103 Boole's rule, 180
recurrence relation, 956, 959-961, 963, Bose-Einstein statistics, 266
1097, 1100, 1102, 1103 boundary condition, 574, 618, 626, 627,
second kind, 957-959, 1098, 1099 660, 982, 1122
spherical, 959, 1100 alternating direction method, 726,
zero of, 337, 653, 817 767
Beta function, 411 Cauchy, 702, 704, 729
incomplete, 408, 412, 416, 930--932, cubic spline, 132, 156
1072, 1073 Dirichlet, 702, 706, 708, 745, 756, 762
BETAI, 931, 932, 1072, 1073 expansion method, 762
BETAP, 930-932, 1072, 1073 finite difference method, 627, 708.
BETCON, 931, 1072, 1073 718, 726, 729, 735, 741, 742, 745,
BETCON1, 931, 1072 747, 756
BETSER, 931, 1072 finite element method, 761
BFGS, 368~~371, 454, 924 927, 947, 954, irregular boundary, 745-747
1066-1068, 1087 Lax-Wendroff method, 735, 1000,
BFGS formula, 366, 370, 397, 924, 1066 1140
BIO, 960, 962, 1101, 1102 Neumann, 702, 705, 706, 708, 742,
BI1, 961, 962, 1101, 1102 747,757
bidiagonal matrix, 91, 92 partial differential equation, 702, 705-
big-endian, 21 708, 718, 726, 729, 735, 741,
biharmonic equation, 771 742, 745~ 747, 756, 761, 996, 999,
bilinear approximation, 760 1000, 1003, 1136, 1137, 1139,
bilinear interpolation, 148 1140, 1143
BIN, 961, 1101 periodic, 756, 770
binary complex system, 16, 774 separable, 619, 630, 979,981,985, 986,
binary number system, 14, 15, 17, 20, 465 1119, 1121, 1126
binary search, 121, 875-877, 880, 1021, boundary value problem, 575, 618-640, 662,
1022, 1025 979-986, 1119-1126
binary tree, 35 B-spline, 640, 986, 1126
binomial distribution, 406, 408, 932, 1073 choice of method, 625, 632
random numbers, 417, 932, 1073 conversion to integral equation, 659
biorthogonality, 525, 557 deferred correction, 627-630, 654
BISECT, 275, 352, 866, 909, 1008, 1053 existence and uniqueness of solution,
bisection, 273-276, 281, 291, 294, 347, 909, 620
1053 expansion method, 633, 640-642, 986,
pitfalls, 274 1126
826 Index

finite difference method, 620, 626-- BSPINTN, 151, 864, 882, 883, 902, 903,
633, 979, 1119 1027, 1028, 1047
initial value method, 620-625, 654 BSPLIN, 138, 864, 877-879, 881--884, 886,
linear, 621, 623, 627, 631 902, 903, 940, 942, 943, 945, 946,
nonlinear, 620, 625, 629, 632, 642 988, 993, 994, 1022-1024, 1026-
partial differential equation, 702, 741- 1029, 1031, 1046, 1047, 1082-
757, 1001 -1004, 11411144 1084, 1086, 1087, 1128, 1134
shooting method, 620-625, 654 BSPODE, 642, 986, 1126
Box, G. E. P, 417 BSPQD, 140, 886, 902, 903, 1008, 1031,
bracketing complex root, 294- 297, 913, 1046, 1047
1056 BSPQD2, 238, 901, 1008, 1046
bracketing minimum, 348, 351, 367, 923, BSPQDN, 238, 902, 1008, 1047
1064 bug, 3, 10
bracketing real root, 273, 278, 280, 282, 290, Bulirsch, R., 143, 608, 623, 756
294, 306, 325 Buneman, 0., 756
BRACKM, 351, 352, 396, 923, 1064 Burgers' equation, 766
breeding, 395 Butcher, J. C., 651
BRENT, 291, 912, 1056 BYO, 957-959, 1098, 1100
Brent's method, 289- 294, 354-356, 621, BY1, 958, 959, 1099, 1100
912, 923, 1056, 1065 BYN, 959, 1099
convergence, 289 byte, 17
minimisation, 354-356, 498, 499, 502,
C, 8, 1007
923, 1065
CABS, 1010, 1012, 1057-1061
multiple root , 291
carry, 16, 18
pitfalls, 292 , 293
Cartesian grid, 147, 295
root, 289 - 294, 621, 912, 1056
Cartesian product, 238
roundoff error, 293 cascade sum, 35, 39, 200, 866, 1011
Brent, R. P., 290, 291, 353, 354, 373- 376, roundoff error, 36
912, 923, 1056, 1065 CASSUM, 36, 866, 867, 1011
BRENTM, 355, 923, 953, 954, 1065, 1094, CAUCHY, 900, 1045
1095 Cauchy condition, 702, 704, 706, 729
Brigham, E. 0. , 471 Cauchy principal value, 210, 217, 265, 410,
Brown, J. C., 685 900, 1045
Broyden's method, 333, 334, 453, 613, 614, Cauchy's theorem, 301
621, 629, 922, 97~ 1063, 1117 causality, 705
Broyden, C. G., 366 cautious iteration, 223
BROYDN, 920-922, 1061-1063 Cayley-Hamilton theorem, 571
BSPEV2, 881, 942, 943, 1008, 1026, 1083, CDIV, 1010, 1012, 1057 - 1061, 1089
1084 central difference, 512, 719, 741
BSPEVL, 447, 878,940,942,945,987,988, central difference formula, 165, 626, 729,
992, 994, 1008, 1023, 1082, 1083, 741
1086, 1127, 1128, 1132, 1134 central force problem, 266, 653
BSPEVN, 883, 884, 944-946, 1008, 1028, central limit theorem, 248
1029, 1085- 1087 centroid, 243
BSPEVN1, 883, 884, 1008, 1028, 1029 centrosymmetric, 697
BSPEVN2, 883, 884, 1008, 1028, 1029 change of variable, 212, 240
BSPFIT, 445, 447, 448, 876, 939, 942, 945, characteristic equation, 524, 579, 582, 583,
1022, 1080, 1083, 1086 614, 615, 638
BSPFIT2, 458, 941, 942, 1082, 1083 characteristic polynomial, 319, 524, 525,
BSPFITN , 458, 943, 945, 1084, 1086 529,649
BSPFITW2, 941 , 942, 1082, 1083 Krylov's method , 571
BSPFITWN, 943, 94.5, 1084, 1086 characteristics, 703- 707, 728-730
BSPINT, 139, 877, 878, 881, 883, 886, chasing, 93
1022-1024, 1026, 1028, 1031 Chauvenet's criterion, 422
BSPINT2, 151, 880, 881, 902, 1025- 1027, CHEBAP, 952, 1092
1046, 1047 CHEBCF, 485, 486, 951, 1092
Index 827

CHEBEX, 951, 1092 complexity, 64, 103, 116, 120, 147,437, 516,
Chebyshev approximation, 428, 485 749, 751
Chebyshev economisation, 490, 519, 807 composite formula, 181, 196, 207, 246, 666
Chebyshev expansion, 485-494, 496, 951, convergence, 183
952, 1092, 1093 computational efficiency, 285, 298, 310, 336,
calculating the coefficients, 493, 519, 339, 354, 357, 798
951, 1092 computer programming
Lanczos r-method, 520 see programming/program
rational function approximation, 489, condition number, 43, 45, 158, 564, 670,
493, 952, 1092 779,787
Chebyshev integration, 201, 262 nonlinear equation, 320
Chebyshev interpolation, 123, 125, 806 of matrix, 78, 79, 82, 86, 89, 96, 669,
Chebyshev polynomial, 58, 123, 203-205, 679, 780, 782
340,396,440,485,488,498,656, confidence level, 249, 253, 412, 908, 1052
670, 951, 1092 conjugate directions, 372 , 377
differentiation, 488 conjugate gradient method, 101, 366, 371,
extrema, 123, 486 377
integration, 518 Fletcher-Reeves method, 377
orthogonality relation, 487 sparse matrix, 378
recurrence relation, 485, 486, 518 conjugate transpose, 62
shifted, 518 connection condition, 646
zeros of, 123, 203, 486 conservation law form, 734, 1000, 1140
Chebyshev theorem, 122, 495 consistency, 578, 580, 585, 689, 713
Cheney, E. W., 496 constraints, 346, 347, 379, 504, 506, 508,
chi-square, 412, 415, 429, 433, 434, 438, 450, 928, 929, 1069, 1070
993, 1133 continuation method, 300, 331, 334, 625,
chi-square distribution, 412, 415 630, 920, 1061
Cholesky decomposition, 73, 106, 360, 394, continued fraction, 480-483
528, 572, 871, 1016 expansion 12, 483, 522, 930, 931,
CHOLSK, 871, 1016 1071-1073
Chvatal, V., 399 conversion to rational function, 481
clarity, 10 evaluation, 481
Clenshaw's recurrence, 444, 486, 514, 934, QD algorithm, 483
1075 contour integral, 301, 302, 917
Clenshaw, C. W., 222, 520 CONTUR, 917, 1010
Cody, W. J., 498, 503, 521 convergence
collocation, 128 absolute, 793
collocation method, 669-671, 678, 679, 761, acceleration, 170, 188, 191, 192, 231,
989,1130 234,278
eigenvalue problem, 673 failure, 311, 318, 561, 570, 677, 695
Fredholm equation, 669-671, 678, interpolation, 115, 116, 125, 127, 154,
679, 989, 1130 784
partial differential equation, 762 inverse iteration, 535, 536, 568, 811
combinatorial optimisation, 346, 390 iterative methods, 99, 277, 279, 282,
commutative law, 28 285,289,298,310,328,330,338-
complete pivoting, 66, 81, 100, 780 340, 342, 354, 357, 364, 587, 748,
complex arithmetic, 27, 1010 750, 751 , 755, 796, 798
complex root, 294-304, 308-312, 913-920, iterative refinement, 83
1056-1061 linear, 283
locating, 294-297, 913, 1056 numerical integration method, 579,
Muller's method, 297-300, 915, 916, 585
1058, 1059 order of, 5, 11, 115, 283, 578
of analytic functions, 301-304, 916 QL algorithm, 549, 551
of polynomials, 308-312, 918-920, quadrature formulae, 181, 183, 188,
1060 197, 203, 249, 255
quadrature based method, 301-304, solution of differential equation, 579,
916 585, 713, 714, 717
828 Index

solution of integral equation, 677, 689, boundary condition, 132, 156


695 differentiation, 160, 161, 876, 1021
spurious, 7, 212, 222-224, 257, 281, in two dimensions, 150
298, 316-318, 368, 478, 534, 677, integration, 186, 885, 1030
699, 800 truncated power basis, 136
convergence criterion, 5, 9, 94, 222, 227, Curtis, A. R., 222
233, 315-319, 355, 534, 598 cyclic reduction, 756
absolute, 224, 316, 598 cycling in simplex method, 384, 399, 966,
detecting roundoff error, 172, 223, 1107
317, 350
extrapolation methods, 172, 225 Dahlquist, G., 612
iterative methods, 312, 315-319, 342, Dantzig, G. B., 381
368 Davidenko's method, 300, 331, 334, 625,
limitation, 7, 185, 222, 224 630, 920, 1061
relative, 224, 316, 598 DAVIDM, 358, 924, 1065
convolution, 464, 470, 473, 476, 517, 806 DAVIDN, 801, 920-922, 1061, 1063
convolution type kernel, 660, 697 Davidon's method, 356-359, 924, 1065
Cooley, J. W., 465 convergence, 357
Cooley-Tukey FFT algorithm, 467 order, 396
corrector, 587, 686 roundoff error, 358
Adams-Moulton, 590
Davidon, W. C., 366
convergence of iteration, 587
Davis, P. J., 123, 176, 181, 183, 190, 198,
correlation, 450, 464, 946, 1087
209, 216, 226, 243, 255
correlation coefficient, 420, 421
DAWSON, 963, 1103
probability distribution, 421, 933,
Dawson's integral, 521 , 810, 963, 1103
1074
Day's method, 688
cosine, 807
De Boor, C., 131, 137, 223, 226
cosine integral, 264
debugging, 8
cosine transform, 473, 516, 757, 805
decay exponent, 124, 134, 146, 786
cosmic rays, 416
cost coefficient, 383, 928, 965, 966, 1069, decimal number system, 13
1106 defective matrix, 527, 553, 565, 567, 568,
cost function, 379 571, 811, 967, 1108
Cotes numbers, 177 eigenvector, 540
Courant condition, 730, 732, 733, 735, 739, generalised eigenvector, 567, 568, 811
767 inverse iteration, 540, 568, 811
Courant, R., 730 power method, 567, 811
covariance, 412, 419 deferred correction, 627, 628, 631, 979, 981,
covariance matrix, 394, 434, 436, 444, 449, 988, 1119, 1121, 1129
450, 453, 515 convergence, 664
Craig,1. J. D., 685 eigenvalue problem, 635, 981, 984,
Cramer's rule, 61, 65, 649 1121, 1124
CRANK, 996, 1136 integral equation, 663, 664, 667, 673,
Crank, J., 720 679, 688, 695, 988, 1129
Crank-Nicolson method, 712, 720, 722, 766, nonlinear equation, 630
820, 996, 1137 partial differential equation, 744
critical frequency, 460, 472 second-order equation, 629, 654
critical points, 495, 499, 501, 503 deflation, 300, 311, 318, 341, 913, 914, 919,
CROUT, 68, 69, 76, 868, 870, 1013, 1016 1057, 1061
Crout's algorithm, 74, 75, 556, 663, 868, backward, 341
869, 1013-1015 matrix, 531, 560, 568
partial pivoting, 75 polynomial, 308, 311, 312, 341, 919,
CROUTH, 84, 548, 869, 1014 1061
Crump, K. S., 476, 477 stability, 312
CSQRT, 1010, 1012, 1057-1061 system of equations, 326, 327
cubic spline, 131-135, 758, 786, 875, 876, degeneracy
1021 approximations, 498, 501
Index 829

kernel, 661 , 672, 674 , 676, 677, 697, free boundary, 619, 646
698 Gear's method , 613, 61 5, 616, 977,
linear programming, 382, 384, 399, 1117
506 , 929, 966, 1070, 1071 , 1107 Hamming's method , 583, 584, 616
degree of freedom, 411- 414, 421 , 438 ill-conditioning, 518, 597, 650, 652,
DELVES, 303, 916-918, 1010 817
Delves, L. M ., 301-304, 658, 695 infinite range of integration, 642
density of mesh points, 644 initial value problem, 575-618, 973-
derogatory matrix, 527 979, 1113-1119
design matrix, 433, 436 , 440 limitation of numerical methods , 573
determinant, 64, 524, 529, 545 , 572, 634, linear , 576, 621, 623, 627, 631 , 633,
636, 868, 870-872, 984, 985, 648
1013- 1017, 1124, 1125 method of undetermined coefficients,
Hessenberg matrix, 570 577, 578, 582, 816
overflow, 68, 637 Milne's method, 594
Deuflhard, P., 609, 610 modified midpoint method , 608
DFP formula , 366, 397 multistep method , 576, 609, 612, 974,
DFT, 947, 1088 1115
diagonal matrix, 62, 69, 88, 94, 526, 541, numerical integration method, 576,
550 599
diagonal rational function, 143 order, 574
diagonalisation, 542, 550 oscillatory, 643
diagonally dominant matrix, 106, 133, 630, predictor-corrector method , 587- 599,
712 974- 976,111 5, 1116
DiDonato, A. R., 930, 931, 1071- 1073 reduction to first- order system, 575
difference roundoff error , 585, 596, 597, 599, 614,
backward, 123, 261, 626 623 , 649, 81 5
central, 512, 719, 741 Runge-Kutta method, 600- 607, 651,
divided, 118, 119, 122, 129 652 , 973, 974, 1113, 1114
forward , 123, 234, 261 , 626 shooting method , 620-625, 640, 654
difference equation, 50, 283, 491, 579, 582, Simpson's rule, 596
586, 605, 708 single-step method, 576, 600, 610
differential correction algorithm, 504 , 506, singularity, 575, 609, 625, 638, 643,
964, 1105 645, 655, 816
differential equation, 573-656, 814- 822, stability, 578-586, 588, 593, 602, 604,
973- 986, 1113- 1126 605 , 611 , 614-6 16,649, 653, 815,
Adams method , 581, 585, 589, 590, 816
593, 594,596,615,815,976,1116 stiff, 579, 581, 588, 607, 609, 610- 618,
autonomous form , 576 722, 977, 997,1117,1138
B-spline, 640, 986, 1126 stiffly stable method , 613, 618, 722,
boundary condition, 574, 576, 618, 977, 1117
626, 627, 645, 646 trapezoidal rule, 577, 587, 590, 600,
boundary value problem, 575, 618·· 612, 616, 653
633, 979-986, 1119- 1126 truncation error, 578, 582, 585- 589,
choice of method, 610, 625, 632 591 , 594, 601, 603, 604, 627- 630,
convergence criterion, 598 635, 642, 654, 977, 979, 1117,
convergence of solution, 579, 585 1120
eigenvalue problem, 575, 619, 633- with time delay, 518
640, 642, 981 , 1121 differentiation, 57, 104, 159-174, 303, 577,
Euler 's method , 577, 578, 589, 611 , 787- 789, 884, 1029
648 B-spline, 140, 162, 877, 1022
expansion method , 640-642, 656, 986, backward difference formula, 165, 613
1126 central difference formula, 165
extrapolation methods, 607- 610, 643, cubic spline, 161, 876, 1021
978, 1118 extrapolation met hods, 168-·172, 789,
finite difference method , 626- 640, 884, 1029
644,979-985,1119-1126 Fourier series, 463, 517
830 Index

function of two variables, 173 weakening the discontinuity, 472


interpolation formula, 160, 874, 875, window function, 470, 471, 516
1019, 1020 zero padding, 469, 473, 475
method of undetermined coefficients, dispersive method, 734
164-168 dissipative method, 734
optimum spacing, 162, 166, 172, 173, distributive law, 26
787,788 DIVDIF, 160, 874, 876, 1019-1021
roundoff error, 162, 169, 172, 787,885, DIVDIFO, 875, 879, 1020, 1025
1030 divergent integral, 212, 267
truncation error, 160, 162, 165, 787 divergent series, 236, 268, 773
using approximation, 441, 446, 456, divided difference, 118, 119, 122, 129, 353,
507, 509, 512, 934, 935, 937, 938, 745,747
1075, 1077-1079 divided difference interpolation formula,
diffusion equation, 705, 707-717, 748 117-128,174,194,874,875,879,
alternating direction method, 726 1019, 1020, 1024
Crank-Nicolson method, 712, 996, differentiation, 160, 874, 875, 1019,
1136 1020
cylindrical polar coordinates, 766 divided difference table, 119, 129
Du Fort and Frankel method, 716, 765 domain of indeterminacy, 274, 284, 292,
higher order methods, 713, 715, 716 293, 313, 316, 795, 915, 918,
two space variables, 726 1059, 1060
diffusion time scale, 711 of minimum, 350, 355, 358
digit, 13-17 size, 276, 313, 317
dimensional effect, 238, 246, 253 Dongarra, J. J., 59
direct methods, 59, 62, 98, 100, 741, 744 Doolittle's algorithm, 74
direction set method, 371-377, 453, 926, partial pivoting, 75
1068 double length accumulation, 36, 39, 42, 71,
line search, 375, 927, 1068 81, 83, 556
Powell's method, 372 double precision, 25, 28, 32, 83, 84, 209, 865
roundoff error, 376 DRVT, 172, 223, 884, 1008, 1029
Dirichlet condition, 702, 706, 708, 741, 745, Du Fort and Frankel method, 716, 765
749, 756, 762, 996-1001, 1136- dual problem, 389, 504, 506
1139, 1141 duality theorem, 389
irregular boundary, 745
discontinuity €-algorithm, 192, 212, 229, 477, 643, 887,
Fourier transform, 461, 463, 472 949, 1032, 1090
integral equation, 665 €-table, 192, 225, 229
inverse Laplace transform, 477, 478 economisation of power series, 490, 519, 807
partial differential equation, 728 efficiency, 10
discrepancy, 498 efficiency index, 285, 298, 310, 336, 339,
discrete Fourier transform, 459-475, 756, 354, 357, 798
947, 948, 1088, 1089 efficient rule, 243
aliasing, 460, 461, 471, 474 eigenfunction, 633, 634, 638, 981, 1122
basic properties, 464 integral equation, 658, 662, 672, 674,
convolution, 464, 470, 473 676, 988, 1129
correlation, 464 inverse iteration, 635, 673, 675
cosine, 473, 516, 805 left, 673
differentiation, 463, 517 nodes, 635, 655, 674, 675
frequency shift, 464 eigenvalue, 62, 78, 524, 529, 554, 617, 648,
in n dimensions, 473-475, 948, 1089 966, 1107
inverse, 462, 469, 474 differential equation, 633-640, 981,
Lanczos sigma factor, 463, 472, 517, 1121
806 dominant, 529
orthogonality relation, 462 integral equation, 658, 662, 672, 676,
side lobes, 471 989, 1129
sine, 473, 516, 805 partial differential equation, 703
time shift, 464, 472 see also eigenvalue problem
Index 831

eigenvalue problem, 523-572, 633-640, sparse matrix, 557, 558


672-676, 811-813, 966-973, Sturm sequence, 545, 548, 970, 1110
1010, 1107-1113 symmetric tridiagonal matrix, 544,
balancing the matrix, 553, 555, 562, 548, 550, 553, 969, 970, 1109,
971, 1112 1110
characteristic polynomial, 524, 525, truncation error, 635, 674
529, 571 unsymmetric matrix, 553, 563, 568,
collocation method, 673 972, 973, 1112, 1113
defective matrix, 540, 553, 565, 567, eigenvector, 524, 529, 552, 565, 648, 966,
568, 571, 811, 96~ 1108 1107
differential equation, 570, 575, 619, back-transformation, 544, 552, 562,
633-640, 642, 981, 1121 968, 972, 1108, 1112
distinct eigenvalues, 525, 526 by direct solution of equations, 538,
expansion method, 642, 673 562
extrapolation methods, 635 generalised, 567, 811
finite difference method, 634-640, inverse iteration, 529, 535, 540, 562
744, 747, 769, 981, 1121 left,525, 537, 636, 968, 972, 982, 1108,
Galerkin method, 673 1112, 1122
generalised, 528, 540, 570-572, 633, normalisation, 524, 529, 967, 1107
673, 968, 981, 1108, 1121 orthogonality, 525, 548
Hermitian matrix, 526, 536, 548, 971, right, 525
1111 see also eigenvalue problem
Hermitian operator, 637 elementary matrix, 72, 555
Hessenberg matrix, 559-562, 570, 972, Elliott, D. F., 465, 469
1113 ellipse, 181, 370
Householder's transformation, 542, ellipsoid, 360, 437
553, 556, 564, 968, 1108 elliptic equation, 703, 706, 741-757, 1001-
ill-conditioning, 525, 529, 536, 538, 1004, 1141-1144
553, 565, 566 alternating direction implicit iterative
integral equation, 658, 672-676, 989, (ADI) method, 753-755, 1003,
1129 1143
inverse iteration, 534-541, 546, 635, boundary condition, 706, 741, 742,
811, 966, 970, 984, 988, 1107, 745, 747, 756
1110, 1124, 1129 cyclic reduction, 756
Jacobi method, 542 direct methods, 741, 744, 756
Lanczos method, 557, 558 eigenvalue problem, 703, 744, 747, 769
multiple eigenvalue, 526, 527, 530, FFT method, 756, 757
535, 536, 545, 548, 551 irregular boundary, 745-747
partial differential equation, 703, 744, iterative methods, 744, 748-755
747,769 nonlinear, 747
power method, 529-534, 567, 568, 811 successive over-relaxation method,
QL algorithm, 549-553, 969, 1109 750-752, 1001, 1141
QR algorithm, 558-562, 972, 1113 three or more dimensions, 755
quadrature method, 673, 989, 1129 truncation error, 742, 745, 822
Rayleigh quotient, 532, 536, 537, 568, ELMHES, 556, 563, 972, 1112
967, 970, 1107, 1111 EPSILN, 225, 887, 1032
real symmetric matrix, 526, 541-553, EQN, 979-982, 985-988, 1119-1122, 1125-
968-971, 1108-1111 1128
reduction to boundary value problem, EQND, 981, 982, 985, 1121, 1122, 1125
619, 638, 640 equidistributed sequence, 238, 254-257,
reduction to condensed form, 542, 908, 1052
544, 553, 557, 968, 972, 1108, convergence of integrals, 255
1112 generation, 254, 255
roundoff error, 530, 531, 533, 536, 543, roundoff error, 256
544, 553, 556, 558, 564-566, 569 EQUIDS, 223, 257, 905, 908, 1049, 1052
shooting method, 640 equilibrated matrix, 67, 86, 104
software, 523 equilibration, 66, 68, 75, 79, 105, 325
832 Index

equilibrium problem, 703 exponential function, 56, 519, 778, 807, 808
ERF, 933, 955, 1074, 1095, 1096 exponential integral, 58
ERFC, 955, 1096 exponential order, 476, 477, 807, 949, 1090
error EXTP, 609, 617, 978,1118
absolute, 23, 224, 315, 598 extrapolation, 11, 127, 143, 168, 331, 577,
blunder, 3, 12, 358, 369, 401, 427, 431 686, 979, 1119
bug, 3, 10 extrapolation methods, 169-174, 187-193,
measurement, 2, 401 607-610
modelling, 2 adjusting the step size, 609
probable, 248 convergence criterion, 172, 225
propagation, 418, 420 differential equation, 607-610, 627,
relative, 23, 224, 315, 598 635, 643, 978, 1118
roundoff, 2, 4, 9, 22-50 differentiation, 168-172, 789, 884,
significant figures, 23 1029
systematic, 401 €-algorithm, 192, 212, 229, 477, 643,
truncation, 2, 4, 7 887, 949, 1032, 1090
error curve, 495, 498, 501-503, 952, 1093 integral equation, 664, 673, 695
nonstandard, 498 integration, 187-193, 212, 215, 225,
error ellipsoid, 437 886, 887, 1031, 1032
error function, 12, 262, 406, 503, 520, 809, partial differential equation, 744
933, 955, 1074, 1095, 1096 roundoff error, 170
minimax approximation, 521, 809 using polynomials, 168, 174, 608, 618,
Euclidean norm, 99, 564 979, 1119
EULER, 901, 1045 using rational functions, 174, 608,
Euler's constant, 12, 264, 267, 777, 792, 617,979,1119
957, 958, 962, 1098, 1099, 1102, extrema, 431, 486, 495-499, 503, 522, 952,
1103 1093
Euler's method, 343, 577, 589, 600, 648 see also minimum
backward, 590, 611
truncation error, 578, 607 F distribution, 415, 416
Euler's transformation, 234 236, 267, 477, F test, 415, 416
900, 1045 factorial, 407
Euler-Maclaurin sum formula, 186, 187, factorial function, 231, 232
230-232, 261, 267, 792 false position
Eulerian equation, 739 see regula falsi method
evolution problem, 703 fast Fourier transform, 465-475, 494, 806,
exchange algorithm, 503 947, 948, 1088, 1089
expansion method, 640-642, 669-672, 986, bit reversal, 467
989, 1126, 1130 Bluestein's algorithm, 469
boundary condition, 762 complexity, 516, 806
collocation, 669-671, 673, 678, 679, convolution, 470, 473, 517, 806
761, 762, 989, 1130 Cooley-Tukey algorithm, 467
differential equation, 640-642, 656, for arbitrary N, 469
986, 1126 gap filling, 469
eigenvalue problem, 642, 673 in n dimensions, 474, 475, 948, 1089
Fredholm equation, 669-672, 678, parallel processing, 474
679, 989, 1130 real data, 468, 516, 948, 1089
Galerkin, 670, 672, 673, 758, 761-763 Sande-Tukey algorithm, 467
least squares, 669, 678, 761, 762 solution of elliptic equations, 756
partial differential equation, 757 using mixed radix representation, 469,
experimental error, 418 516
explicit difference method, 708, 714, 729, zero padding, 469, 470, 473, 475
731, 733, 736 FBETA, 931, 932, 1072, 1073
nonlinear equation, 719 FDM, 630-632, 640, 979, 983, 985, 986, 988,
stability, 709, 710, 717, 730 1119, 1124, 1126, 1128
explicit formula, 577, 584, 587 feasible region, 382
exponent offset, 19, 20 feasible vector, 380
Index 833

Fehlberg, E., 605 parabolic equation, 708- 728, 765,


FERM05, 810, 963, 1104 996- 1000, 1136- 1140
FERM15, 810, 963, 1104 second-order equation, 626, 629, 654
FERM25, 810, 963, 1104 stability, 709-·713, 717, 725, 727, 730-
Fermi integrals, 266, 521, 792, 810, 963, 732, 735, 739, 767, 821
1104 Sturm-Liouville problem, 634
Fermi-Dirac statistics, 266 truncation error, 627-630, 635, 642,
FERMM05, 810, 963, 1104 654, 708, 713, 727, 729, 742, 745,
FFT, 474, 947, 948, 1088, 1089 765, 820, 822
FFTN, 474, 475, 948, 1089 finite element method, 757-763
FFTR, 469, 948, 1089 boundary condition, 761
Fibonacci equation, 283 continuity of solution, 758
Fibonacci number, 103, 349 initial value problem, 763
fictitious mesh point, 708, 736, 742, 746, method of lines, 763
747 nodal points, 758
fill-in, 100, 557 parallelogram element, 760, 772, 822
partitioning, 757
FILON, 898, 1042
triangular element, 759, 760, 772, 822
Filon's formula, 216, 228, 898, 1042
variational method, 762
finite difference matrix, 630, 631, 636, 743,
finite elements, 757
985, 1126
FITS, 21
finite difference method, 626-640, 668, 708-
five-point formula, 741, 743
757, 979- 985, 996- 1004, 1119-
fixed-point arithmetic, 18, 22, 262
1126, 1137-1144
fixed-point iteration, 276-278, 587, 591,
amplification factor, 710-712, 725,
661, 685, 721, 995, 1136
727, 730, 731, 735, 739, 767, 768,
convergence, 277, 342
819, 820
system of equations, 325, 800
amplification matrix, 732, 733, 768,
fixed-point representation, 17, 19
820,821
Fletcher, R., 365-367, 370, 384, 452, 453
automated allocation of mesh points, Fletcher-Reeves method, 377
644- 646
FLN, 927, 928, 1068, 1069
boundary condition, 627, 645, 646, FLNM, 925, 926, 1066, 1067
708, 718, 726, 729, 735, 741, 742, floating-point arithmetic, 24-27, 54, 55
745, 747, 756 overflow, 27, 30
central difference formula, 626, 729, roundoff error, 25, 26, 28, 31, 53, 775,
741 777
consistency, 713 underflow, 27, 30, 53
convergence, 713 floating-point number, 19
deferred correction, 627-630, 635, base, 19, 23, 46, 867, 1011
654, 979, 981, 984, 1119, 1121, exponent offset, 19, 20
1124 fraction/mantissa, 19, 20, 23- 25, 28,
dispersive, 734 37, 55
dissipative, 734 IEEE standard, 20, 21, 52, 862
eigenvalue problem, 634-640, 744, NaN, 9, 20, 47
747, 769, 981, 1121 normalisation, 19, 20, 24
elliptic equation, 741-757, 1001-1004, overflow, 20
1141-1144 precision, 21, 23
explicit, 708- 710, 714, 717, 719, 729- underflow, 20, 27
731, 733, 736 fluid mechanics, 739, 769
extrapolation methods, 627, 635, 744 flux, 735, 736, 740, 1000, 1140
higher order schemes, 628, 713, 715, FM, 953, 954, 1094, 1095
716, 730, 745 folding of frequency spectrum, 460
hyperbolic equation, 729-740, 1000, Forsythe, C. E ., 59, 88, 747
1140 Fortran, 8, 9,18,19,21,52,54,57,176,238,
implicit, 708, 711- 713, 719, 730, 732, .523, 593, 742, 864, 913, 954, 997,
737, 821 1008
nonlinear equation, 629, 719, 735, 747 array, 474, 863, 948, 1089
834 Index

SAVE, 865 Friedrichs, K 0., 730


unformatted write, 10 full width at half maximum, 404, 410, 4ll
FORW, 991, 994, ll31, ll34 normal distribution, 405
forward difference, 123, 234, 261, 626 fully symmetric region, 244
forward error analysis, 33, 34, 42 FUNK, 990, 1130, ll31
forward integration formula, 577
forward interpolation formula, 124 Galerkin method, 670, 672, 758, 761
forward problem, 991, 994, ll31, ll34 eigenvalue problem, 673
forward substitution, 69, 76, 636, 685 Fredholm equation, 670, 672, 673
roundoff error, 81 partial differential equation, 758, 762,
Fourier approximation, 431, 462 763
Fourier method, 710, 712, 718, 725, 727, GAMMA, 502, 896, 897, 930, 954, 955,
730-732, 735 1041, 1071, 1095
Fourier transform, 215, 410, 459-475, 947, Gamma function, 12, 239, 407, 501, 503,
1088 931-933, 954, 955, 1072-1074,
convergence, 461 1095
convolution, 806 incomplete, 414, 930, 1071
deconvolution, 469 GAMMAL, 930-933, 955, 1071-1074, 1095
discontinuity, 461, 463, 472
GAMMAP, 930, 1071
Gibbs' phenomenon, 463
Garbow, B. S., 523
see also discrete Fourier transform
GAUBLK, 631, 637, 981, 984, 985, 1121,
see also fast Fourier transform
ll25
Fourier transform method, 756, 757
GAUBND, 872, 873, 878, 881, 883, 1017,
Fourier's conditions, 336
1023, 1026, 1028
fraction, 19, 20, 23-25, 28, 37, 55
GAUCB1, 889, 1033
FRED, 664, 667, 674, 675, 866, 988, 1008,
GAUCB2, 889, 1034
ll28
GAUCBY, 888, 1033
FREDCO, 989, 990, 1130, 1131
Fredholm equation, 658-679, 988-990, GAUELM, 68, 867, 868, 872, 897, 898,
921, 922, 951-953, 967, 968,
ll28-ll31
976, 978, 989, 990, 1012, 1013,
classification, 658
1017, 1042, 1063, 1064, 1092-
collocation method, 669-671, 673,
1094, ll08, ll16, ll18, ll29,
678, 679, 989, ll30
deferred correction, 663, 664, 667, 1130
673, 679 GAUHER, 206, 897, 1041
existence of solution, 676, 698 GAUJAC, 204, 206, 896, 1040
expansion method, 669-673, 678, 679, GAULAG, 214, 891, 892, 1036, 1037
989, ll30 GAULEG, 206, 895, 1040
extrapolation methods, 664, 673 GAULG2, 893, 894, 1038, 1039
first kind, 676-679, 988-990, 1128- GAULOG, 893, 894, 1038, 1039
ll31 GAUS16, 899, 900, 1044
fixed-point iteration, 661 GAUSQ, 890, 891, 1035
Galerkin method, 670, 672, 673 GAUSQ2, 207, 890, 891, 1034, 1036
nonlinear, 667 GAUSRC, 208, 229, 895-898, 1039-1042
quadrature method, 662-668, 673, GAUSS, 200, 221, 888, 890-894, 1032,
677, 678, 818, 819, 988, 1128 1035-1039
regularisation method, 678-680 Gauss-Chebyshev formula, 203, 790, 888,
second kind, 659-669, 672, 988-990, 889, 896, 1033, 1034, 1041
1128-1131 Gauss-Hermite formula, 206, 214, 790, 893,
singular, 659, 664, 666, 678 1037
spurious convergence, 677, 699 weights and abscissas, 897, 1041
system of, 668 Gauss-Jacobi formula, 204
third kind, 672-676, 989, ll29 weights and abscissas, 896, 1040
truncation error, 663, 670, 674 Gauss-Jordan elimination, 69, 87
free boundary, 619, 646 complexity, 103
free-end conditions, 132 Gauss.-Kronrod formula, 198, 207, 225, 898,
frequency domain, 460, 464, 756 899, 1043, 1044
Index 835

Gauss-Kronrod-Patterson formula, 198. weight function, 202-209, 214


221 weights and abscissas. 196, 202-209,
Gauss-Laguerre formula, 206, 214, 229, 891, 790
892, 1036, 1037 GAUSWT, 209, 228, 229, 333, 895, 897,
weights and abscissas, 896, 1041 1040. 1042
Gauss-Legendre formula, 196, 200, 221, GEAR, 614, 616--618, 724. 974, 976, 977,
761, 888, 890, 891, 893, 894, 997,1115 1117, 1138
896, 900, 1032. 1034, 1036, 1038, Gear's method, 613-618, 977. 1117
1039, 1041, 1044 pitfalls, 616
composite, 196 stability, 615
integral equation, 664, 667. 674, 677, Gear, C. W., 593, 608, 611-613. 651
989. 1129 generalised eigenvalue problem, 528. 529,
product, 239, 240, 243, 761, 903, 904, 570-572, 633, 673. 968, 981-983,
1048 1108,1121-1123
truncation error, 196 differential equation, 633, 638, 981-
weights and abscissas, 895. 1040 983. 1121-1123
Gauss-Seidel method, 99, 100, 748, 783 eigenvector, 529, 540, 635
convergence. 99. 750 inverse iteration, 529, 540
Gaussian distribution, 405 reduction from polynomial to linear
Gaussian elimination, 63 -73, 98, 133, 360, form. 528
529. 534, 536, 663. 677, 712, 744, generalised eigenvector, 567, 568, 648. 811,
867. 872, 1012, 1017 814
algorithmic form. 68 genetic algorithm. 395
complexity, 64. 103 Gerschgorin theorem, 527, 546, 571
for band matrix, 65, 100, 103. 872, GEVP, 637, 639, 640, 981, 985, 986, 1121.
1017 1126
for eigenvalue problem, 555, 556, 566, Gibbs' phenomenon, 463, 517, 683
972, 1112 Gilbert, F., 684
for finite difference matrix, 630, 636, Gill, S., 602
985, 1125 Givens' rotatioil. 93, 109, 542. 550
for singular matrix, 108 global minimum. 345, 346, 396
for sparse matrix, 100 Goldberg, D. E., 395
for symmetric matrix, 73, 104, 106, GOLDEN, 350. 352, 923. 1064
871, 1016 golden ratio, 349. 355
for tridiagonal matrix, 779 golden section search, 348 352, 355, 375,
matrix formulation, 71-73 923, 1064
partial pivoting, 66, 67, 73, 630, 636, convergence, 349
867. 872, 1012, 1017 roundoff error, 3.50-352
roundoff error, 65, 67, 80 Goldfarb, D., 366
Gaussian quadrature, 194-198, 202- 209, Golub, G. H., 207
219, 888·-894, 1032-1039 Gosset, W. S., 411
calculating weights and abscissas, Gough, D.O., 680, 684
208. 209, 895-897. 1039-1042 graceful underflow, 20, 28
checking the weights and abscissas, gradient vector, 359, 433
197 Graeffe's root squaring method, 308
composite formula, 196, 207 Gragg, W. B., 608
convergence, 197, 203 Gram-Schmidt process, 377, 443
improper integrals, 202, 203, 209, 214 Greenspan, D., 747
integral equation, 663, 664, 677, 678, Gregory's formula, 187, 261, 663, 665, 679,
685, 988, 1128 688, 695, 988, 1129
logarithmic singularity, 263, 790, 893,
894, 1038, 1039 h -+0 extrapolation
nonstandard weight functions, 209 see extrapolation methods
over infinite interval, 206, 214, 892, Ii, 23, 25, 29, 30, 33, 36-41, 45, 46, 48,
893, 1037 52. 55, 65, 67, 80, 82, 83, 89,
roundoff error, 197, 200 90, 94, 97, 100, 163, 166, 199,
truncation error, 196, 202 234, 256, 274, 291, 305, 311, 313,
836 Index

350, 353, 358, 369, 376, 426, 443, artificial damping, 733, 735
498, 535, 870, 898, 919, 968, 969, boundary condition, 705, 729, 735
971. 972, 1015, 1043, 1061, 1108, characteristics, 704, 707, 728-730
1109, 1111, 1113 cO!lservation law form, 734, 1000,
half width at half maximum, 404 1140
Halley's method, 339, 340, 798 explicit method, 729-731, 733
Hamming's method, 583, 649 implicit method, 730, 732
roundoff error, 649 Lax-Friedrichs method, 733
stability, 584, 616, 815 Lax-Wendroff method, 735, 736, 740,
Hamming, R. W., 1,463 1000, 1140
Handbook, 523, 537, 544, 552, 554, 556, nonlinear, 734, 1000, 1140
561, 562, 968, 969, 971, 972, numerical dissipation, 734
1108, 1109, 1112, 1113 several space variables, 740
Hanning window, 471 stability, 730-732, 735, 739, 767
harmonic oscillator, 410, 655 truncation error, 729, 730
Hart, J. F., 503 upwind differencing, 739
Haselgrove, C. B., 255, 256 hyperboloids, 360
HEREVP, 971, 1111 hypercube, 239, 244
HERMIT, 893, 1037 hyperefficient rule, 243
Hermite cubic basis, 156, 786 hyperplane, 250, 327, 380
Hermite interpolation, 128-130, 592, 594, hypersphere, 239, 241, 244, 793
799 hypersurface, 360, 702
cubic, 130, 131, 356, 367, 396, 786,
924, 1065 IEEE standard, 20, 21, 52
Hermite polynomial, 58, 206, 897, 1041 IER, 866, 1008
Hermitian matrix, 62, 526, 536, 552, 967, ignoring the singularity, 211
1107 ill-conditioning, 30, 32, 43, 44, 46, 49, 57,
conversion to real symmetric form, 79, 319-323, 898, 1042
548 approximation, 435, 439, 498, 509
eigenvalue problem, 548, 971, 1111 differential equation, 518, 597, 650,
Herzberger, J., 31 652, 817
Hessenberg matrix, 63, 103, 537, 541, 553, eigenvalue problem, 525, 529, 536,
813 538, 553, 565, 566
determinant, 570 integral equation, 658, 662, 670-672,
eigenvalue problem, 559--562, 570, 676-679, 691, 692, 819
972, 1113 integration, 209, 228
Hessian, 359, 363, 365, 366, 372, 452-454, interpolation, 136, 141, 787
924, 1066 inverse problem, 680
singular, 368 Laplace transform, 475
updating, 365 linear equations, 44, 61, 78, 79, 85,
heuristic, 346, 593, 866, 1011 86, 96, 780, 868-870, 873, 1013,
hidden bit, 20, 21, 25 1014, 1016, 1018
high-pass filter, 463 minimisation, 802
Hilbert matrix, 84, 95, 96, 104, 106, 342, nonlinear equation, 301, 305, 317,
397, 435, 439, 569, 802 319-323
histogram, 157 partial differential equation, 820
Horner's method, 117, 276, 480, 485 spurious convergence, 319
Hotelling, H., 568 symptoms, 86, 319
Householder's transformation, 91, 109, 436, Illinois method, 280
542, 548, 553, 556, 559, 564, 673, implicit difference method, 708, 711, 715,
968, 1108 737
roundoff error, 544 hyperbolic equation, 730, 732
Householder, A. S., 305, 308 nonlinear equation, 719
HQR, 561-563, 972, 1113 parabolic equation, 711, 712, 719
hyper-spherical coordinates, 905, 1049 stability, 712, 713, 731, 821
hyperbolic equation, 703, 704, 728-740, implicit equilibration, 68, 75, 869, 1014
1000, 1140 implicit formula, 577, 587, 611, 613, 616
Index 837

implicit Runge-Kutta method, 607, 651, integral equation, 518, 657-700, 818, 819,
816 988-995, 1128-1136
importance sampling, 252, 269, 794 B-spline, 671, 680, 991, 1131
improper integrals, 189, 202, 209-219, 227-- collocation method, 669-671 , 673,
229 678, 679, 989, 1130
Cauchy principal value, 217 convergence, 689, 695
change of variable, 212 deferred correction, 663, 664, 667,
extrapolation methods, 189, 212, 215 673, 679, 688, 695
ignoring the singularity, 211 discontinuity, 665
proceeding to the limit, 212 eigenvalue problem, 658, 672--676,
truncating the interval, 213 989, 1129
weakening the singularity, 214 existence of solution, 676, 692, 698
weight function, 202, 209, 214 expansion method, 669-673, 678, 679,
incomplete Beta function, 408, 412, 416, 690, 989, 1130
930-932, 1072, 1073 extrapolation methods, 664, 673, 695
incomplete Gamma function, 414, 930, 1071 Fredholm type, 658, 660-679, 988-
index, 55 990, 1128-1131
initial condition, 576, 704, 706, 707, 729 Galerkin method, 670, 672, 673
initial value method ill-conditioning, 658, 662, 670-672,
677-679, 691, 692, 819
see shooting method
in higher dimensions, 668
initial value problem, 575-618, 621, 685,
inverse problem, 658, 680-685, 991,
702-740, 973-979, 996-1001,
1131
1113-1119,1137-1141
least squares method, 669, 678
choice of method, 610
moment problem, 699
convergence criterion, 598
Monte Carlo simulation, 682, 991,
convergence of solution, 579, 585
1131
conversion to integral equation, 660,
nonlinear, 667, 685, 689, 700, 995,
696
1135
existence and uniqueness of solution,
quadrature method, 662-668, 673,
648 677, 678, 685, 695, 988, 995,
extrapolation methods, 607-610, 978, 1128, 1135
1118 singularity, 659, 664, 666, 678
finite element method, 763 spurious convergence, 677, 699
ill-conditioning, 597 stability, 687, 689, 690, 692
method of lines, 721, 724, 763, 997, symmetric kernel, 660, 670, 672, 673
1137 system of, 668
multistep method, 576, 609, 612, 974, truncation error, 663, 670, 674, 686,
1115 687, 694
partial differential equation, 702-740, Volterra type, 658, 659, 685-695, 995,
996-1001, 1137-1141 1135, 1136
pitfalls in numerical solution, 599 integral transforms, 215
predictor-corrector method, 587-599, integration, 54, 175-270, 789-794, 885-909,
974-976, 1115, 1116 1030-1053
roundoff error, 585, 596, 597, 599, 614 B-spline, 140, 238, 886, 901, 1031,
Runge-Kutta method, 600-607, 973, 1046
974, 1113, 1114 contour, 301, 302, 917
single-step method, 576, 600, 610 multiple (see multiple integration)
stability, 578-586, 588, 593, 602, 604, see also quadrature
605, 611, 614-616 integro-differential equation, 668
starting values, 590, 591, 593, 977, inter-quartile range, 404
1117 interpolation, l11-158, 339, 425, 427, 577,
stiff, 579, 588, 607, 609, 610-618 593,610,669,677,689,746,783-
stiffly stable method, 613, 618, 977, 787, 874-884, 1019-1029
1117 B-spline, 138, 150,877,880-883, 1022,
truncation error, 578, 585-589, 591, 1025-1028
594, 601, 977, 1117 backward formula, 124, 155, 581, 589
838 Index

Chebyshev, 123, 125 multiple eigenvalue, 535, 536


complexity, 116, 120, 147 Rayleigh quotient, 536, 537, 568
convergence, 115, 116, 125, 127, 154, roundoff error, 536
784 with variable shift, 536, 537
cubic spline, 131-136, 150, 156, 875, inverse Laplace transform, 475-478, 699,
876, 1021 949, 1090
differentiation, 160, 874, 875, 1019, differential equations, 518
1020 exponential order, 476, 477, 807
divided difference formula, 119, 874, removing a discontinuity, 477, 478
875, 879, 1019, 1020, 1024 roundoff error, 478
forward formula, 124 truncation error, 476
general basis functions, 141, 157 inverse matrix, 45, 87, 88, 107, 334, 868-
Hermite, 128-130, 131, 356, 367, 592, 871, 873, 1013, 1014, 1016-1018
594,799 inverse problem, 658, 680-685, 991, 1131
ill-conditioning, 136, 141, 787 B-spline, 683, 991, 1131
in n dimensions, 148, 152, 879, 882- Monte Carlo simulation, 994, 1134
884, 1025, 1027-1029 optimally localised averages, 684
in two dimensions, 147-152, 879, 880, regularisation method, 681, 992, 1132
1024, 1025 inverse sine transform, 473
inverse, 127, 154, 289, 290, 338 INVIT, 537, 966, 968, 989, 1107, 1108, 1129
Lagrangian, 114-117, 153 IRANBIN, 932, 1073
linear, 114, 122, 148, 154 IRANPOI, 932, lO74
method of undetermined coefficients, irrational number, 254
136, 141, 148, 157 irregular boundary, 238, 239, 249, 745-747
Neville's, 155, 261 isolated eigenvalue, 554, 972, 1112
nonlinear equation, 338 iterative methods, 98--101, 748-755
of analytic functions, 152 cautious, 223
order, 115 convergence, 99, 277, 279, 282, 285,
osculatory, 128 289, 298, 310, 328, 330, 338-340,
parabolic, 122, 289, 297, 351, 353-355, 342, 354, 357, 364, 587, 748, 750,
375, 396 751, 755, 796, 798
pitfalls, 127, 785 convergence criterion, 9, 312, 315-
polynomial, 112-135 319, 342, 368
rational function, 142-147, 879, lO24 efficiency index, 285, 298, 310
roundoff error, 122, 154, 155, 785 eigenvalue problem, 529, 534, 549,
solution of nonlinear equation, 278, 558, 811
289, 290, 297, 299 elliptic equation, 744, 748
trigonometric, 462 integration, 220
truncation error, 114, 116, 119, 122, limitations, 300
133, 783-785 linear equations, 62, 98-102
interpolation property, 759 nonlinear equation, 272, 277, 282, 285,
interpolation series, 115, 154, 784 298, 300, 310, 328, 338-340, 342,
interval arithmetic, 31, 32 796, 798, 799
inverse cosine transform, 473 of incommensurate order, 288
inverse Fourier transform, 462, 469, 474, order, 280, 283, 289
947-949, 1088-1090 roundoff error, 101, 313, 315
inverse interpolation, 127, 154, 278, 289, iterative refinement, 83-85, 105, 536, 780,
290, 338, 785 782, 869, 1014
inverse iteration, 529, 530, 534-541, 546, convergence, 83
562, 966, 969, 970, 984, 988, limitations, 86
1107, 1109, 1110, 1124, 1129
convergence, 535, 536, 568, 811 Jacobi method, 98, 542, 748, 749
defective matrix, 540, 568, 811 complexity, 749
eigenfunction, 635, 673, 675 convergence, 99, 749
eigenvector, 529, 535, 540, 562 Jacobi polynomial, 204, 896, 1040
generalised eigenvalue problem, 540, Jacobian, 325, 327, 328, 330, 333, 334, 344,
572,635 449,452,613,617,622,629,920-
Index 839

922, 978, 980, 987, 1061-1063, convergence, 310


1118, 1120, 1127 convergence failure, 311
Jordan canonical form, 527, 814 roundoff error, 311
Jordan matrix, 527, 814 LAGURE, 891, 892, 1036, 1037
LAGURW, 206, 896, 1041
Karmarkar, N" 389 Lanczos r-method, 520, 656
Keller, H. B., 623 Lanczos method, 557, 558
kernel. 658, 995, 1135 roundoff error, 558
averaging, 684 Lanczos sigma factor, 463, 472, 517, 806
centrosymmetric, 697 differentiation, 463
convolution type, 660, 697 Lane-Emden equation, 653, 655, 817
degenerate, 661, 672, 674, 676, 677, LAPINV, 477, 478, 949, 1090
697,698 Laplace transform, 475-478, 518, 699, 807
discontinuous, 665 basic properties, 476
eigenfunction, 662, 672, 674-676 convolution, 476
eigenvalue, 662, 672, 674, 676 exponential order, 476, 477, 807
positive definite, 660 ill-conditioning, 475
singular, 664, 678 see also inverse Laplace transform
symmetric, 660, 670, 672, 673 Laplace's equation, 110, 706, 742
Klee, V., 399
higher order methods, 745, 770
knot, 132, 134, 136, 138, 150, 157, 877, 878,
variational formulation, 762
880-884, 1022, 1023, 1025-1029
see also Poisson's equation
Knuth, D. E., 28, 250, 932, 1073
Laurent series, 219
Kopal, Z., 160, 196, 206
Lavely, E. M., 682
Kowalik, J., 398
LAX, 736, 1000, 1140
Krogh, F. T., 616
Lax-Friedrichs method, 733, 735, 739
Kronecker's delta, 81, 113
artificial damping, 733
KRONRD, 899. 900, 931, 932, 990, 1044,
Lax-Wendroff method, 735, 736, 739, 1000,
1045. 1072, 1073, 1130
1140
Kronrod rule, 198, 207, 221, 225, 898, 899,
amplification matrix, 768, 821
1043, 1044
artificial damping, 735
Krylov's method, 571
Krylov, V. 1., 177, 196,201,206 boundary condition, 735
Kulisch, U. W., 31 several space variables, 740
kurtosis, 402, 403, 408, 410, 411, 413, 415 stability, 735
leakage in power spectrum, 470, 471
L-curve, 681, 683, 994, 1134 leapfrog method, 768
L1-approximation, 429-431, 508-510, 965, least squares, 46, 60, 88-90, 428, 432-
1105, 1106 457, 511, 516, 811, 874, 933-945,
in multiple dimensions, 965, 1106 1019, 1074-1087
L1 norm, 429, 430 choice of degree, 438, 443
L2- approximation, complexity, 437
see least squares approximation correlation, 449
L2 norm, 428, 429, 433 design matrix, 433, 436, 440
L",,-approximation, 430, 431, 494-507 differential equations, 641
Loo norm, 428, 430 differentiation, 441, 446, 456, 512
Lp norm, 428 error ellipsoid, 437
Laarhoven, P. J. M. van, 391 error in both x and y, 449-451, 946.
LAG lTC, 917, 919, 1010 1087
LAGITR, 898, 919, 1042, 1061 error in parameter estimate, 434, 437,
Lagrange's multiplier, 389 444,454
Lagrangian interpolation, 114-117, 153, five-point parabola, 512, 804
172, 177, 339 for solving integral equation, 669, 678
Laguerre polynomial, 49, 205, 206, 340, 396, for solving partial differential equa-
896, 1040, 1041 tion, 761, 762
Laguerre's method, 303, 308-311, 918, 919, Householder's transformation, 436
1060. 1061 in n dimensions, 943-945, 1084-1086
840 Index

in two dimensions, 457-459, 516, 941, linear elementary divisor, 527, 529
942, 1082, 1083 linear equations, 59-110, 779-783, 867- 874,
inverse problem, 681-683, 991- 994, 1010, 1012-1019
1131- 1134 back-substitution, 64, 69, 76, 81, 87
integration, 512 band matrix, 65, 98, 100, 103, 872,
Levenberg-Marquardt method , 453 1017
linear, 432-449, 933-945, 1074- 1087 Cholesky's algorithm, 73, 106, 871,
Monte Carlo simulation, 454 1016
Newton's method, 452 condition number, 78, 79, 82, 86, 89,
nonlinear, 451- 457, 947, 1087 96,780,782
normal equations, 433, 451 Crout's algorithm, 74, 75, 868, 1014
outliers, 431, 438, 441 direct methods, 59, 62, 98, 100
regularisation, 445, 681, 939,941 - 943, Doolittle's algorithm, 74
945, 1080, 1082-1084, 1086 existence and uniqueness of solution,
singular value decomposition (SVD), 60
88, 89, 436-442, 455, 513, 678, forward substitution, 69, 76, 81
874, 1019 Gauss-Jordan elimination, 69, 87, 103
smoothing, 440, 456 Gauss-Seidel method , 99, 100, 748,
straight line fit, 511, 946, 1087 750,783
trigonometric approximation, 463, Gaussian elimination, 63-73, 98, 100,
513 103, 104, 108, 867, 872, 1012,
using B-spline, 445, 939-945, 1081}- 1017
1086 ill-conditioning, 44, 61, 78, 79, 85,
using orthogonal polynomials, 442- 86, 96, 780, 868-870, 873, 1013,
448, 933-936, 1074-1077 1014, 1016, 1018
left eigenfunction, 673 iterative methods, 62, 98-102, 748,
left eigenvector, 525, 537, 636, 968, 972, 752
982, 1l08, 1112, 1122 iterative refinement, 83, 85, 86, 105,
Legendre functions 780, 782, 869, 1014
associa.ted, 964, 1104 Jacobi method, 98, 748, 749
Legendre polynomial, 58, 195, 205 , 340, least squares solution, 60, 88-90
652, 654, 670, 817, 895, 963, overdetermined system, 60, 88, 90,
1040, 1104 435, 436, 669, 678
Lehmer, D. H., 249, 308 roundoff error, 65, 67, 80, 82, 87, 90,
Leibnitz, G. W., 14 101, 106, 780
Lelevier, R., 739 singular value decomposition, 88- 97,
level, 55 873, 1018
level-index, 55, 777 with singular matrix, 89
Levenberg-Marquardt method, 330, 453, linear interpolation, 114, 122, 131, 154, 278,
454 747
Lewy, H., 730 in n dimensions, 148, 880, 1025
limit point, 672, 675 linear programming, 379-389, 504, 506,
Lindberg, B., 653 508, 669, 928, 929, 965, 966,
line search, 362, 367, 375, 376 1069, 1070, 1106
criterion for acceptance of minimum, basic feasible vector, 382 , 386, 928,
367 929, 1069, 1070
with derivatives, 367, 925, 1066 degeneracy, 382, 384, 399, 506, 929,
without derivatives, 375, 927, 1068 966, 1071, 1107
linear approximation, 431-445, 503, 504, dual problem, 389, 504, 506
508, 933- 945, 965, 1074-1087, feasible vector, 380
1105, 1106 fundamental theorem, 382
Ll-approximation, 508, 965, 1l05, simplex method, 381, 385, 389, 504,
1106 928, 929, 966, 1069, 1070, 1106
linear congruential method, 250, 908, 1052 slack variable, 381, 389
linear differential equation, 576, 621, 623, standard form, 381, 929, 966, 1070,
627, 631 , 633, 648 1106
general solution, 621 tableau , 385, 389
Index 841

LINES, 997, 1137 Euclidean norm, 99, 564


LINFITXY, 946, 1087 Hermitian, 62, 526, 536, 548, 552, 967,
LINL1, 965, 966, 1106 1107
LINMIN, 925, 926, 954, 1066, 1067 Hessenberg, 63, 103, 537, 541, 553,
LINMNF, 375, 927, 928, 1068, 1069 559, 570, 972, 1113
LINRN, 149, 880, 1025 in C function, 1010
Lipschitz condition, 814 inverse, 87, 88, 107, 868- 871, 873,
little-endian, 21 1013, 1014, 1016-1018
LLSQ, 938, 1079 norm, 78, 94, 99, 369, 553, 564
Lobatto quadrature formula, 198, 216, 263, null-space, 60, 89, 90
264,790 orthogonal, 63, 88, 91, 109, 541, 549,
local coordinates, 759 558, 969, 1109
local maximum, 360 positive definite , 62, 73, 99, 106, 360,
local minimum, 346, 359, 360 528, 754, 781, 871, 1016
local truncation error, 578, 582, 585, 587, range, 89, 805
588, 601, 603, 627, 709, 730 rank, 60, 88, 89, 96, 524, 527
LOCATE, 880, 1025 similarity transform, 526, 541, 549,
logarithmic singularity, 263, 790, 893, 894, 554, 555, 557, 564
1038, 1039 singular, 61 , 67, 89, 97, 108, 524
Lorentzian distribution, 410, 411, 417 singular value decomposition, 88- 97,
Louie, S. G., 393 873, 1018
Love's equation, 696, 697 singular values, 88, 90
low-pass filter, 463 sparse, 61, 81 , 98-101, 108, 334, 366,
lower Hessenberg matrix, 63 533, 557, 558, 725, 741, 748, 763
lower triangular matrix, 63, 71, 72, 76, 549 spectral norm, 78
LP problem symmetric, 62, 104, 526, 541 , 544, 550,
see linear programming 871, 1016
LV decomposition trace, 533
see triangular decomposition triangular, 63, 69, 71. 72, 76, 87, 98,
Lutton, J.-L., 392 103, 107, 549, 558, 779
Lyness, J. N., 226, 301-304 tridiagonal, 62, 103, 132, 537, 541,
548, 550, 557, 632, 779
Maclaurin series, 12, 41, 56, 479, 483, 488, unitary, 63, 78, 548
778, 950, 1091 W$, 538, 546
Madelung transformation, 643 with nonlinear elementary divisor,
mantissa, 19, 20, 23- 25, 28, 37, 55 527, 540, 553, 565, 567, 568, 571,
Marsaglia, G., 250 811, 967, 1108
Mathieu equation, 655 matrix method for stability, 711, 718
MATINV, 868, 1013 maximisation, 345
matrix maximiser, 347, 353, 358, 924, 1065
balancing, 553, 555, 562, 971, 1112 maximum, 345
band, 62, 65, 98, 100, 103, .544, 634, local, 360
722, 744, 872, 877, 1017, 1022 maximum likelihood , 429, 456, 515
condition number , 78, 79, 82, 86, 89, maximum principle, 709
96, 669, 679, 780, 782 Maxwell, J. C., 242
defective, 527, 540, 553, 565, 567, 568, MCARLO, 905, 907, 1049, 1051
571, 811, 967, 1108 Mead, R., 361
deflation, 531, 560, 568 mean, 402, 403, 405, 411, 413,415, 418, 421,
derogatory, 527 430
determinant, 64, 68, 524, 529, 545, binomial distribution, 407
570,572,634,636,637,868,870- error in , 420
872, 985, 1013-1017, 1125 Lorentzian di5tribution, 410
diagonal, 62, 69, 88, 94, 526, 541, 550 Poisson distribution, 409
diagonalisation, 542, 550 mean value theorem, 178, 279, 586
diagonally dominant, 106 measurement error, 2, 401
equilibration, 66-·68, 75, 79, 86, 104, median, 403, 405, 408, 422, 430, 431
105 memory, 10, 633
842 Index

Menke, W., 685 Hermite interpolation, 356, 396, 924,


meromorphic function, 304 1065
Merson, R. H., 604 ill-conditioning, 802
method of artificial pole, 498, 502 least squares, 433, 452, 453
method of characteristics, 729 line search, 362, 367, 375, 376, 925,
method of lines, 721, 724, 725, 740, 763, 927, 1066, 1068
997, 1137 linear programming, 379-389, 928,
method of simultaneous displacement, 98 929, 966, 1069, 1070, 1106
method of steepest descent, 362, 397 method of steepest descent, 362, 397
method of successive displacements, 99 multivariate, 359-378, 924-927, 1066-
method of undetermined coefficients, 104, 1068
164-168, 180, 665, 666 Newton's method, 363
differential equation, 577, 578, 582, parabolic interpolation, 351, 353, 355,
816 375,396
differentiation, 164-168 Powell's method, 372, 376
integration, 194, 209, 242 quadratic termination, 366, 370-373
interpolation, 136, 141, 148, 157 quasi-Newton method, 365-371, 924,
quadrature, 180 1066
truncation error, 165, 578 roundoff error. 349-352, 358, 369, 376,
Metropolis algorithm, 390-395 923, 1064
midpoint method, 608 simplex method, 361, 381, :389, 928.
stability, 648 929, 966, 1069, 1070, 1106
midpoint rule, 178, 179, 183, 194,601,694 simulated annealing, 390-394
composite, 182 solving nonlinear equations, 329
midrange, 430, 431 using derivative, 351, 356, 359, 363,
Milne's method, 594, 596, 602 371
minimax approximation, 428, 494-507, 521, minimiser, 345, 923, 924, 926, 1064-1066,
522, 952-965, 1093-1105 1068
Chebyshev theorem, 495 minimum, 345
differential correction algorithm, 504, bracketing, 348, 351, 367, 923, 1064
506, 964, 1105 distinguishing from a maximum, 351,
discrete data. 50a-507. 964. 1105 353
error bounds. 496 distinguishing from a saddle point,
error curve, 495, 498, 501-503 368,369
exchange algorithm, 503 domain of indeterminacy, 350, 355,
method of artificial pole, 498, 502 358
multivariate, 522 existence and uniqueness, 346
polynomial, 496, 503 global, 345, 346, 396
rational function, 496, 503, 505, 506, local, 346, 359, 360
952-964. 109a-1105 spurious, 350
relat i ve error, 500 MINMAX, 506, 964, 1105
Remes algorithm, 496, 952, 1093 Minty, G. J., 399
smoothing, 507 Miranker, W. L., 31
with constraints, 506, 522 mixed radix representation. 469, 516
minimisation, 329, 345-400, 801-803, 923- mod~ 404, 405, 408, 413
929,966,1064-1071, 1107 modelling error, 2
alternating variables method, 363 modified midpoint method, 608
Brent's method, 354·-356, 923, 1065 modifier, 588
conjugate gradient method, 377 Moler, C. B., 59, 88
constrained, 347, 379, 389, 928, 1069 moment problem, 699
convergence criterion, 350, 355, 368 moments, 208, 897, 1042
Davidon's method, 356-359, 396, 924, monomial, 239, 242
1065 degree, 242
direction set method, 371--377, 926, monomial rules, 242-247, 906, 907, 1050,
1068 1051
golden section search, 348, 349, 355, degree 3,244,793,794,906,907, 1050,
375, 923, 1064 1051
Index 843

degree 5, 245, 793, 794, 906, 907, 1050, product rule, 238-242. 268, 793, 903,
1051 904, 1048
one-point rule, 244 recursive evaluation, 237, 238
over entire space, 268, 794 roundoff error, 256
over hypersphere, 268, 793 rule of degree 3, 245, 793, 794
monotone convergence, 277 rule of degree 5, 246, 793, 794
monotone divergence, 277 singularity. 241, 246, 258
Monte Carlo method, 238, 247--253, 416, Stroud's formula, 245, 906, 907, 1050,
417, 457, 907, 1051 1051
convergence, 248, 249 truncation error, 246, 248, 256
importance sampling, 252, 794 multiple root, 276, 281, 287, 288, 291, 322,
irregular region, 249 799, 912, 1055
variance reduction, 251, 794 condition number, 320
weight function, 248 in system of nonlinear equations, 327,
Monte Carlo simulation, 454, 509, 682, 991, 344
994, 1131, 1134 Muller's method, 298, 299. 338. 797
Morris, A. H., Jr., 930, 931,1071-1073 Newton-Raphson method, 287
Morton, K. W., 716, 723, 735 rate of convergence. 284, 287, 298,
MSTEP, 593, 606, 614, 617. 618, 724, 864, 338, 797. 799
974, 976-978, 997, 998, 1115- secant method, 284, 338
1118, 1138 multiple shooting method, 622
MULER2, 864, 913-916, 984. 985, 1008. multistep method. 576, 609. 612, 974, 977,
1057-1059, 1124, 1125 1115,1117
MULINT, 240, 241, 903, 905, 1048, 1049 multivariate problems, 152
MULLER, 298, 300, 913-916, 1057-1060 Murphy's law, 319
Muller's method, 297-300, 913-916, 984,
NaN, 9, 20, 47
1057-1059, 1124
NEARST, 121, 875, 879, 1020, 1021, 1024,
convergence, 298, 338. 797
1025
multiple root, 298, 299, 338, 797 negative decimal system, 16
real root, 298, 984, 1125 Neider, J. A., 361
roundoff error, 314 nested multiplication, 117, 120, 276, 480,
Muller, D. E., 298 485
Muller, M. E., 417 Neumann condition, 702, 705, 706, 708,
multiple eigenvalue, 526, 527, 530, 535. 536, 742, 745, 757
545, 548, 551 irregular boundary, 747
multiple integration, 237-259, 793, 794, neutrino, 423
901-909, 1049-1053 Neville's interpolation, 155, 261, 788
B-spline, 238, 901. 902, 1046, 1047 NEWRAC, 912, 917, 918, 1010
change of variable, 240 NEWRAP, 287, 288, 866, 912, 918, 1008,
choice of method, 257, 258 1055
equidistributed sequence, 254-257, NEWTON, 332, 920, 921. 1061-1063
908, 1052 Newton's equations, 262, 301
hyper-spherical coordinates, 905, 1049 Newton's interpolation formula, 117-128
method of undetermined coefficients, backward, 124, 155. 581, 589
242 divided difference, 119, 160, 174, 194,
monomial rules, 242-247, 268, 793, 789, 874, 875, 879, 1019, 1020,
794, 906, 907, 1050, 1051 1024
Monte Carlo method, 247 253, 907, forward, 124
1051 Newton's method, 328-333, 363, 452, 613,
of tabulated functions, 238, 901, 902, 629, 668, 720, 721, 747, 921, 981,
1046, 1047 988, 1063, 1121, 1128
over entire space, 268, 794 convergence, 328, 364
over fully symmetric region, 244 damped, 329
over hypersphere, 268, 793 modified, 329, 364
over irregular region, 238, 239, 249 with line search, 329, 364
over sphere, 268, 793 Newton-Cotes formula, 177-186
844 Index

choice, 185 iterative methods, 272, 277, 282, 285,


closed, 177, 685 298, 300, 310, 328, 338-340, 342,
convergence, 181 796, 798, 799
integral equation, 685, 687 Laguerre's method, 308-311, 918, 919,
open, 177, 179, 213, 587, 686 1060, 1061
roundoff error, 199, 200 Levenberg-Marquardt method, 330
truncation error, 177, 179, 181 locating complex roots, 294-297, 913,
weight function, 209, 216, 263, 789, 1056
897, 1041 minimisation techniques, 329
Muller's method, 297-300, 338, 797,
Newton-Raphson method, 285-288, 292,
913-916, 984, 1057-1059, 1124
298,303,337,521,621,654,912,
multiple root, 281, 284, 287, 288, 291,
918, 1055
298, 320, 322, 327, 338, 344, 797,
convergence, 285
799, 912, 1055
modified, 287, 912, 1055
multipoint iteration function, 340
multiple root, 287, 912, 1055 Newton's method, 328-333, 921, 1063
roundoff error, 314 NE-wton-Raphson method, 285-288,
NGAUSS, 903, 904, 1048 314, 912, 918, 1055
Nicolson, P., 720 polynomial, 304-312, 320, 322, 341,
nine-point formula, 743, 745 918, 920, 1060
eigenvalue problem, 769 quadrature based method, 301-304,
truncation error, 770, 822 916
NLLSQ, 947, 1087, 1088 real root, 273-294, 298, 909-913, 920-
NMINF, 376, 926, 927, 947, 1068, 1087, 922, 1053-1056, 1061-1064
1088 recursive solution, 324
nodal points, 758 refining roots, 302, 308, 311
nodes, 635, 655, 674, 675, 758 regula falsi method, 278-281, 325,
nonlinear approximation, 431, 451--457, 505 336,795
nonlinear elementary divisor, 527, 540, 553, roundoff error, 272, 274, 282, 293, 308,
565, 567, 568, 571, 648, 811, 967, 311,313-315, 316, 342, 795, 915,
1108 1059,1060
nonlinear equation, 271-344, 620, 795-801, secant method, 281-285, 313, 333,
909-922, 1053-1064 338, 909-911, 984, 1053, 1054,
always convergent method, 272, 290- 1124
294, 304, 308 Steffensen's method, 336
bisection, 273-276, 291, 909, 1053 system of, 323-334, 920-922, 1061-
Brent's method, 289-294, 912, 1056 1064
truncation error, 279, 289, 336, 338,
Broyden'smethod, 333, 334, 922, 1063
797,799
checking the roots, 315
nonlinear hyperbolic equation, 734, 1000,
choice of method, 292, 294
1140
complex root, 294-304, 308, 312, 913- Lax-Wendroff method, 735
920, 1056-1061 nonlinear integral equation, 667, 685, 689,
continuation method, 300, 331, 920, 700, 995, 1135
1061 nonlinear least squares, 451-456, 947, 1087,
convergence criterion, 312, 315-319, 1088
342, 800 Levenberg-Marquardt method, 453
deflation, 300, 308, 311, 312, 318, 326, Newton's method, 452
327, 341, 913, 914, 1057 nonlinear parabolic equation, 719, 724, 997,
fixed-point iteration, 276-278, 325, 1137
342 nonlinear programming, 389
generalised secant method, 333 nonlinear wave, 723
Halley's method, 339, 340, 798 nontrivial solution, 524, 619
ill-conditioning, 301, 305, 317, 319-- norm, 78, 79,427, 669
323 choice of, 429
interpolatory method, 289, 299, 338, continuum, 427
797 discrete, 427
Index 845

error, 427 order, 578, 580


Euclidean, 99, 564 parasitic solution, 580, 582, 585, 586
L p ,428 roundoff error, 585, 596, 597, 649
matrix, 78, 79, 94, 99, 369, 553, 564 stability, 578-585, 587, 588, 593, 611,
spectral, 78 614 616, 648, 649, 653, 814, 815
vector, 79, 89 starting values, 590, 591, 650
normal distribution, 404-406, 408-411, 413, stiffly stable, 613, 618, 977, 1117
416, 418, 421, 423, 428, 429, 434, three-step fourth-order formula, 582,
442, 454, 932, 933, 1073, 1074 585, 649, 650
full width at half maximum, 405 truncation error, 578, 582, 585, 587,
quartile deviation, 405 588
random numbers, 417, 932, 1073 two-step formula, 650, 815
normal equations, 433, 451 using higher order derivatives, 652
normal probability integral, 522, 810 weakly stable, 584
normalisation numerical model, 1
eigenvector, 524, 529, 967, 1107 numerical viscosity, 734
floating-point number, 19, 20, 24 Nyquist frequency, 460, 471
not-a-knot boundary condition, 132, 134,
136, 138, 150, 156, 875, 877, 882, objective function, 379, 928, 1069
1021, 1023, 1027 off-centred difference formula, 626, 628, 636
NP-complete problem, 392 Oldham, K. B., 50
nuclear power plant, 4 Olver, F. W. J., 322
null-space, 60, 89, 90 one's complement, 18
nullity, 89 open formula, 177, 179, 213, 587, 686
number representation, 13-21 optimal feasible vector, 380, 383, 385, 928,
fixed-point, 17, 19 929, 966, 1069, 1071, 1107
floating-point, 19, 52 optimally localised averages, 684
level-index, 55 optimisation
one's complement, 18 see minimisation
signed-magnitude, 17 order
two's complement, 17 differential equation, 574
number system exponential, 476, 477
balanced-decimal, 51, 774 interpolation formula, 115
balanced-ternary, 17, 774 iterative methods, 280, 283, 289, 336
base, 13, 15, 51 numerical integration method, 578,
binary, 14 580
binary complex, 16, 774 partial differential equation, 702
conversion, 14, 15, 774 order of convergence, 5, 11
decimal, 13 ordinary differential equation
hexadecimal, 15 see differential equation
negative decimal, 16, 51 orthogonal functions, 442, 670
octal, 15 orthogonal matrix, 63, 88, 91,109,541,549,
positional\ 13 558, 969, 1109
quantisation, 22 orthogonal polynomials, 194, 202, 203, 208,
quater-imaginary, 16 438, 446, 458, 933, 934, 1074-
numerical dissipation, 734 1076
numerical instability, 48, 49, 525, 538, 585, Clenshaw's recurrence, 444, 934, 1075
595, 650, 679 in n dimensions, 936-938, 1077-1079
numerical integration method, 576-588 in two dimensions, 458, 935, 1076,
absolute stability, 581, 615 1077
accumulated error, 585, 586 least squares, 442-449, 458
consistency, 578, 580, 585 recurrence relation, 443
convergence, 579, 585 orthogonal transformation, 91, 95, 109,436,
explicit, 577, 584, 587 542, 550, 556, 559, 566
implicit, 577, 587 orthogonalisation, 373, 443
method of undetermined coefficients, Osborne, M. R., 398
,577, 578, 582 oscillatory convergence, 277
846 Index

oscillatory divergence, 277 boundary condition, 702, 705-708,


osculatory interpolation, 128 718, 726, 729, 735, 741, 742, 745,
outliers, 427, 429, 431, 438, 441, 507, 509 747, 756, 761
overdetermined system, 60, 88, 90,435,436, boundary value problem, 702, 741-
503, 669, 678 757, 1001-1004, 1141-1144
overflow, 18-20, 27, 30, 68, 250, 308, 376, characteristics, 703-707, 728-730
482,545,575,617,637,775,868, classification, 703, 706
911, 916, 955, 1010, 1013, 1054, convergence, 713, 714, 717
1059, 1095 Crank-Nicolson method, 712, 996,
1137
7r,12, 16, 51, 773 detecting instability, 728
Pade approximation, 479-484, 950, 1091 dimension, 702
convergence, 484 eigenvalue problem, 703, 744, 747, 769
existence, 518 elliptic, 703, 706,741-757,1001-1004,
truncation error, ·180 1141-1144
equilibrium problem, 703
PADE, 950, 952, 964, 1091, 1093, 1105
extrapolation methods, 744
parabolic equation, 703, 705, 707-728, 996-
FFT method, 756
1000, 1136 1140
finite difference method, 708-757,
alternating direction method, 726-
996-1004, 1137-1144
728, 999, 1139
finite element method, 757-763
boundary condition, 707, 708, 718,
hyperbolic, 703, 704, 728-740, 1000,
726, 767
1140
convergence, 713, 714, 717
ill-conditioning, 820
Crank-Nicolson method, 712, 720,
initial value problem, 702, 703, 707-
996, 1137
740, 996-1001, 1137-1141
explicit method, 708-710, 717, 719
Lax-Wendroff method, 735, 736, 740,
finite difference method, 708, 721,
1000, 1140
724, 728, 765, 996-1000, 1136-
method of lines, 721, 724, 763, 997,
1140
1137
implicit method, 711, 712, 719 nonlinear, 719, 724, 734, 747, 997,
method of lines, 721, 724, 997, 1137 1000, 1137, 1140
nonlinear, 719, 724, 997, 1137 order, 702
several space variables, 724, 728, 999, parabolic, 703, 705, 707-728, 996-
1139 1000, 1136-1140
stability, 709-713, 717, 725, 727 quasilinear, 702, 766
system of, 721 roundoff error, 769
three space dimensions, 727 stability, 709-713, 717, 725, 727, 730-
truncation error, 708, 713, 727, 765, 732, 735, 739, 767
767, 820 successive over-relaxation method,
parabolic interpolation, 122, 289, 297, 351, 750-752, 1001, 1141
353-355, 375, 396 truncation error, 708, 713, 727, 729,
parabolic window, 516 730, 742, 745, 765, 767, 820, 822
parallel processing, 35, 378 partial pivoting, 66, 67, 73, 75, 105, 556,
fast Fourier transform, 474 630, 636, 663, 867-869, 872, 985,
integration, 258 1012-1014, 1017, 1125
speed-up, 258 limitations, 66, 780
parallelogram, 759, 760, 772, 822 roundoff error, 80
parasitic solution, 580, 582, 585, 586, 602 particular solution, 60, 621
Parseval 's theorem, 464, 514 partitioning, 757
partial differential equation, 701-772, 819- Pascal matrix, 782
822,996-1004,1136-1144 Patterson rule, 198, 221
alternating direction implicit iterative PCOR, 933, 1074
(ADI) method, 753-755, 1003, Pegasus method, 336
1143 Penney, W., 16
alternating direction method, 726-- periodic boundary condition, 756, 770
728, 999, 1139 permutation matrix, 73, 105, 636
Index 847

piecewise polynomial. 130, 150, 431 , 445 Laguerre's method, 308-311, 918, 919,
Piessens. R., 176, 198 1060, 1061
Pissanetzky, S., 60 least squares, 443, 446, 933-- 936,
pivot , 64, 360 1074 -1077
pivoting. 66, 67, 80, 81 , 133 Legendre, 58, 195, 205, 340, 652 , 654,
complete. 66, 81, 100 670,81 7, 963, 1104
limitations. 66 minimax approximation, 496, 503
partial, 66, 67, 73, 75, 80, 105. 556, multiple root, 276, 322
630, 636. 663 orthogonal, 194, 202, 203, 208, 438,
PLEG. 963, 1104 443, 444 , 446, 458
PLM, 964, 1104, 1105 real roots, 305- 307
point of inflection, 347, 357, 396 , 924, 1065 refining roots , 308, 311
Poisson distribution. 408- 410, 414. 430, roots, 57, 304, 312 . 319, 320, 322, 918,
456, 932, 1074 919, 1060, 1061
random numbers, 417, 932, 1074 Sturm sequence, 306
Poisson's equation, 741. 742, 748. 749, 753, polynomial extrapolation, 174, 608, 618,
754. 756. 1002, 1003. 1142, 1143 979. 1119
FFT method . 756, 757 polynomial interpolation, 112-135, 874--
five-point formula , 741 , 743 879, 1019- 1024
nine-point formula, 745. 770. 822 complexity. 116, 120
Polak-Ribiere formula, 378 existence and uniqueness, 113
POLD, 949. 1090 in several dimensions, 147-152, 879,
pole, 274, 287. 304, 477. 478. 498. 502, 879, 1024
1024 truncation error, 114, 116, 119, 122.
POLEV2, 935, 1077 783- 785
POLEVL, 444, 933. 934, 1075, 1076 polynomial-time algorithm, 389
POLEVN, 937, 938, 1078, 1079 POLYR, 311, 321, 323, 897, 898, 918, 919,
POLEVN1, 937, 1078, 1079 1042, 1060
POLEVN2, 937, 938. 1078, 1079 poly trope, 655
POLFIT, 444, 446 . 933, 934, 939, 1074- positional number system, 13
1076, 1080 positive definite kernel, 660
POLFIT1, 934, 935, 937. 1075- 1078 positive definite matrix, 62, 73, 99. 106,
POLFIT2, 459, 935, 1076. 1077 360, 365 . 453, 528, 754, 781
POLFITN, 459, 864 , 936 938, 1077- 1079 positive semidefinite matrix, 62
POLORT.934- 937. 1076-1078 potential problem, 770
POLY2 , 150, 879, 1024 Powell's method, 372 , 376
POLYC, 303. 917, 919,1010 Powell, M_ J. D., 363, 364, 366, 373
polyhedron, 380 power method , 529 -534, 673, 811
POLYL1, 508, 509, 965, 966, 1105, 1106 annihilation, 531
polynomial, 301, 304-313, 431. 949, 1090 complex conjugate eigenvalues, 530,
characteristic, 524. 525, 529. 571 568, 811
Chebyshev, 58, 123, 203-205, 340, convergence, 530, 811
396,485,486.488,670,951 , 1092 convergence criterion, 534
Chebyshev expansion, 485. 951, 1092 defective matrix, 567, 811
checking the roots. 312 deflation, 531, 568
complex root, 308-312 multiple eigenvalue, 530
convergence criterion, 312 Rayleigh quotient, 532
deflation, 308, 311. 312, 341, 919, 1061 roundoff error, 530, 531, 533
differentiation, 949, 1090 subdominant eigenvalues, 530
evaluation, 117, 276, 444, 480, 949, with shift, 531
1090 power series, 482, 951 , 1092
Graeffe's root squaring method. 308 power-residue method , 250
Hermite, 58, 206, 897, 1041 predictor, 587, 613, 650, 686
ill-conditioning, 301, 305, 319- 322 Adams-Bashforth, 589
Jacobi, 204, 896, 1040 predictor-corrector method. 587-599, 725,
Ll-approximation, 508-510, 965, 1105 974, 1115
Laguerre, 49, 205, 340, 396 Adams, 590, 593, 596, 615, 976, 1116
848 Index

adjusting step size, 591, 592 choice of shift, 551


choice of order, 589 convergence, 549, 551
convergence of iteration, 587, 613 multiple eigenvalue, 551
for partial differential equation, 721 with shift, 549, 552
Hamming's, 616 QR algorithm, 92, 94, 549, 553, 557-562,
improving accuracy of corrector, 650, 972, 1113
815 choice of shift, 560
interpolation for final value, 593 convergence failure, 561, 570
modifier, 588 deflation, 561
one iteration on corrector, 598, 649, double, 559
815 eigenvector, 562
stability, 588, 593, 649, 650, 815 with implicit shift, 559
starting values, 590, 591, 593, 650, quadratic equation, 29, 30
816,975,977,1115,1117 quadratic interpolation, 122, 289, 297, 351,
stiffly stable, 613, 614, 977, 1117 353-355, 375, 396
truncation error, 588, 589, 593, 594 quadratic model, 363
variable order variable step size, 592 quadratic termination, 366, 37G-373
Press, W. H., 250, 251, 381, 392, 908, 1052 quadrature, 175-229, 577, 789-792, 885-
prime number, 15, 465, 469 900, 1030-1045
probability distribution, 248, 252, 403, 794 adaptive, 217, 221-228, 898,1043
binomial, 406-408 automatic, 220, 885, 899, 900, 1030,
chi-square, 412-·415 1044
correlation coefficient, 421 B-spline, 238, 886, 1031
F distribution, 415, 416 Boole's rule, 180
Gaussian, 405 bracketing property, 179
Lorentzian, 410, 411 Cauchy principal value, 210, 217, 265,
normal, 404-406 900, 1045
Poisson, 408-410 change of variable, 212
Student's t, 411 Chebyshev, 201, 262
probability integral, 248, 406 choice of formula, 227
probable error, 248 composite formula, 181-183, 196, 207
product, 38 convergence criterion, 222, 224, 227
product Gauss rule, 240, 243, 761, 903, 904, cubic spline, 186, 885, 1030
1048 <-algorithm, 192, 212, 887, 1032
product rule, 238, 242, 243, 903, 904, 1048 extrapolation methods, 187-193, 212,
over disk, 239 215, 225, 886, 88~ 1031, 1032
over hypersphere, 793 Filon's formula, 216, 898, 1042
over sphere, 268, 793 Gauss-Chebyshev, 203, 888, 889,
program, 861-1144 1033, 1034
clarity, 9 Gauss-Hermite, 206, 893, 897, 1037,
debugging, 8 1041
efficiency, 10 Gauss-Jacobi, 204, 896, 1040
quality, 9 Gauss-Kronrod rule, 198, 207, 221,
reliability, 10 225, 898, 899, 1043, 1044
programming, 8 Gauss-Laguerre, 206, 892, 896, 1037,
bug, 3, 10 1041
pitfalls, 8 Gauss-Legendre, 196, 200, 888, 895,
reverse communication, 911,914, 916, 900, 1032, 1040, 1044
985, 1054, 1057-1059, 1125 Gaussian formula, 194-198, 202-209,
roundoff error, 9 214, 263, 887-897, 1032-1042
projectile, 2, 3 Gregory's formula, 187, 261
Property (A), 750 improper integrals, 189, 202, 209-219,
pseudo-random sequence, 249, 908, 1052 227-229
pseudo-singularity, 215 least squares formula, 512
Lobatto rule, 198, 216, 264, 790
QD table, 482 logarithmic singularity, 790, 893, 894,
QL algorithm, 548-553, 558, 969, 1109 1038, 1039
Index 849

method of undetermined coefficients quotient-difference (QD) algorithm, 482,


180, 194, 209 ' 522
midpoint rule, 178, 179, 182, 194
Newton-Cotes formula, 177, 186, 199, Rabinowitz, P., 81, 176, 183, 187, 190, 198,
200, 209, 216 209, 216, 226, 243, 255, 304, 379,
of oscillatory functions, 215-217, 228, 480, 491, 601, 602
263, 265, 898, 1042 Radau quadrature formula, 198
of periodic functions 188 224 228 radiative transfer, 700
261 "" radioactive decay, 408, 423
of tabulated functions, 186, 229, 885, radix point, 14, 19, 21
886, 1030, 1031 Ralston, A., 81, 187,480,491,601,602,651
over infinite interval 229 891-893. RANI, 908, 970, 971, 1052, 1110. 1111
1036, 1037 ' , random numbers, 154, 173, 248, 249, 416,
parallel processing, 258 439, 454, 509, 513, 514, 535
Radau rule, 198 Binomial distribution, 417, 932, 1073
rectangle rule, 178, 179 generation, 250, 251, 416, 794, 908,
Romberg, 188, 197, 227, 261, 886, 932, 1052, 1073, 1074
1031 normal distribution, 417, 932, 1073
Poisson distribution, 417, 932, 1074
roundoff error, 185, 188, 197-201,215,
uniform distribution, 251-253, 416,
219,223,227,262,791,887,1032
908, 1052
Simpson's 3/8 rule, 180, 186
RANF, 251, 253, 864, 908, 1052
Simpson's rule, 180, 182, 186, 188,
RANGAU, 932, 994, 1073, 1134
227, 885, 1030
range of matrix, 89, 805
singularity, 189, 202-219, 227, 228,
rank, 60, 88, 89, 96, 382, 524, 527
887-891, 1032-1034, 1036
Rao, K. R., 465, 469
square root singularity, 888--891,
rational function, 116, 431, 949, 950, 1090,
1033, 1034, 1036
1091
trapezoidal rule, 178, 179 181 186
conversion to continued fraction, 480
188, 190, 192,211, 228, 240, 261:
diagonal, 143
885, 886, 1030, 1031
differentiation, 950, 1091
truncation error, 177, 179, 181, 188,
evaluation, 480, 949, 950, 1090, 1091
190, 194, 196, 202, 885, 1030
rational function approximation 479 506
weight function, 202-209, 214, 216, 865,950-965,1091-1105' ,
228, 263, 888-897, 1033-1041
Bessel function, 955-962, 1096-11 03
quadrature based method, 301-304, 916 Dawson's integral, 521, 810, 963, 1103
quadrature method, 662-668 988 995 error function, 521, 809, 955, 1095
1128, 1135 " ,
Fermi integrals, 521, 810, 963, 1104
eigenvalue problem, 673, 989, 1129 Gamma function, 501, 954, 1095
Fredholm equation, 662-668, 677, minimax, 496-503, 505, 952-964,
678, 818, 819, 988, 1128 1093-1105
Volterra equation, 685-693, 695, 995, Pade, 479-484, 950, 1091
1135 roundoff error, 499, 518
quadruple precision, 25 using Chebyshev polynomial, 489,
quantisation, 22 493, 952, 1092
quartile deviation, 404 rational function extrapolation, 171, 174,
normal distribution, 405 608, 617, 979, 1119
quasi-Newton condition, 366 rational function interpolation, 142-147,
quasi-Newton method, 365-371, 376, 453, 879, 1024
924, 1066 existence, 142
BFGS formula, 366, 924, 1066 pitfalls, 147
DFP formula, 366, 397 RATNAL, 146, 879, 1024
line search, 367, 925, 1066 Rayleigh quotient, 532, 536, 537, 568, 967,
pitfalls, 368 970, 1107, 1111
roundoff error, 369 integral equation, 673
quasilinear equation, 702, 766 real roots
quater-imaginary, 16 see roots
850 Index

real symmetric matrix, 526, 550 discrete approximation, 503


eigenvalue problem, 541- 553, 565, initial approximation, 496
566,968- 971, 1108-1111 relative minimax approximation, 497
reduction to tridiagonal form , 542, REPS , 866, 1011
544 , 968, 1108 residual , 32 , 44 , 77, 82 , 84, 433, 452, 669,
triangular decomposition, 73, 106, 750, 992, 1132
871 , 1016 response function , 473
rectangle rule, 178, 179, 688 reverse communication, 911, 914 , 916, 985,
rectangular elements, 760, 772, 822 1054, 1057- 1059, 1125
continuity, 761 Riccati transformation, 650
recurrence relation , 47, 50, 679, 895- 897, Rice, J. R. , 66, 226, 605, 701
1039- 1041 Richa rdson's h ---+ 0 extrapolation
B-spline, 137 see extrapolation method
Bessel function , 56, 956, 959- 961 , 963, Richtmyer, R. D. , 716, 723, 735
1097, 1100, 1102, 1103 Riemann integral, 182, 209, 217, 254, 791
Chebyshev polynomial, 485, 518 Riemann sum, 182, 188
Clenshaw's, 444 , 486, 514, 934, 1075 Riemann zeta function, 187, 230, 232, 233,
for continued fraction, 481 267, 792
for evaluating a polynomial, 117, 444 Ritzwoller , M. H ., 682
Hermite polynomial, 897, 1041 RK2 , 974, 1114
Jacobi polynomial, 896, 1040 RK4, 974, 976, 977, 1114,1116, 1117
Lagurre polynomial, 896, 1041 RKM, 604, 605, 617, 625,973,974,997,998,
orthogonal polynomials, 443 1113, 1114, 1138
stability, 49, 50, 778, 779, 956, 960, RLS , 682, 991, 994, 1131 , 1134, 1135
961 , 1097, 1100, 1102 RMK , 949, 950, 1090, 1091
recursive method RMK1 , 950, 954, 1091, 1095
multiple integration, 238 RMKD, 950, 1091
system of nonlinear equations, 324 RMKDl, 950, 1091
refining a root, 302, 308, 311, 916, 918, 919, robust approximation, 429, 509
1061 Rolle 's theorem , 114
regula falsi method, 278-282 Romberg integration, 188, 227, 261 , 303,
convergence, 279 886, 1031
for two variables, 325 convergence, 188, 197
Fourier's condition, 336, 795 ROMBRG, 225, 228, 886, 1031
modified, 279, 281, 336 root, 273-344, 909- 922, 1053-1063
truncation error, 279 bracketing, 273, 278, 280, 282, 290,
regularisation, 445, 447, 678, 680, 681 , 939, 294, 297, 306, 325, 913, 1056
941- 943, 945, 992, 1080, 1082- complex, 294- 304, 308, 312, 913-920,
1084, 1086, 1132 1056- 1061
in two dimensions, 457 deflation, 300, 308, 311, 312, 318, 341,
L-curve, 681, 683, 994, 1134 913, 914, 919, 1057, 1061
regularisation parameter, 940, 941, 943, domain of indeterminacy, 274, 276,
944, 946, 992, 1081 , 1083- 1085, 284,292, 293,313,316, 317, 795,
1087, 1132 915 , 918, 1059, 1060
regularised least squares, 681 , 991 , 1131 multiple, 276, 281 , 284, 287, 288, 291 ,
regularity condition, 645 298, 320, 322, 327, 338, 344, 797,
Reinsch, C., 59, 91, 95, 381, 436, 523, 873, 799, 912, 1055
1018 of analytic functions, 301-304, 916
rejection of data points, 422 of polynomials, 304-312, 319, 320,
relative error, 23, 224 , 315, 598 322, 918- 920, 1060, 1061
relative stability, 581 , 584 refining, 302, 308, 311 , 916, 918, 919,
relaxation method , 750 1061
relaxation parameter, 750, 763, 1002, 1142 Rosenbrock's function, 370, 376, 397
optimum value, 751 ROUND, 46, 867, 1011
reliability, 10, 220 roundoff error, 2, 4, 9, 22-50, 1010
REMES, 499-502, 952, 954, 1093, 1095 approximation, 437, 443, 444, 454,
Remes algorithm , 496, 952, 1093 499, 502, 518, 519
Index 851

backward analysis, 33, 37-39, 42, 44, saddle point, 359, 360, 364, 368, 378, 397,
80,82, 564 802, 925, 1066
compl~x arithmetic, 27 detecting, 369
convergence criterion, 172, 223, 317 safety factor, 603
detection, 172, 223, 317, 350 Saha's equation, 343
differential equation, 585, 596, 597, sampling, 460, 461
599, 614, 623, 649, 815 sampling theorem, 460
differentiation, 162, 169, 172, 787, Sande-Tukey FFT algorithm, 467
885, 1030 SAVE, 864
eigenvalue problem, 530, 531, ,533, scalar product, 32, 55, 777
536,543,544,553,556,558,564- scaling a matrix, 66, 68, 75, 79, 104, 105
566, 569 scaling a nonlinear system, 325
estimation, 31, 32, 313, 315 Sch ur, I., 308
Schwarz, H. R., 763
floating-point arithmetic, 25, 26, 28,
SEARCH, 913, 1056
31, 53, 775, 777
searching an ordered table, 121, 875-877,
forward analysis, 33, 34, 42
880, 1021, 1022, 1025
integration, 185, 188, 197-201, 215,
SECANC, 910, 911, 1010
219, 223, 227, 256, 262, 791, 887,
SECANI, 864, 866, 911, 984, 985, 1008,
1032
1054, 1124, 1125
interpolation, 122, 154, 155, 785
SECANT, 909, 910, 1053
inverse Laplace transform, 478 secant method, 281~285, 292, 337, 356,
linear equations, 65, 67, 80--87, 90, 621,625, 654, 723, 909-911, 984,
101, 106, 780 1053, 1054, 1124
minimisation, 349-352, 358, 369, 376, convergence, 282, 338
923, 1064 generalised, 333
nonlinear equation, 272, 274, 282, 293, multiple root, 284, 338
308,311,313-316, 342, 795, 915, roundoff error, 282, 313
1059, 1060 Secrest, D., 196, 206
partial differential equation, 769 seed, 251, 908, 932, 1052, 1073, 1074
quadratic equation, 29, 30 separable boundary condition, 619, 630,
role of compilers, 42 979, 981, 985, 986, 1119, 1121,
statistical analysis, 32, 200 1126
summation, 33, 34, 36, 37, 39, 41, 199, series
234, 256, 777 asymptotic, 12, 116, 187, 230, 520
Routh-Hurwitz criterion, 649 divergent, 236, 268, 773
Runge, C., 125 series summation
Runge-Kutta method, 591, 600-607, 614, see summation
618, 687, 722, 724, 725, 973, 975, SEn,IAT, 981, 983-986, 1121, 1124-1126
977,1113,1115,1117 Shanno, D. F., 366
adjusting step size, 603, 604, 973, 1113 shell sort, 929, 1071
correcting for truncation error, 604, Sherman-Morrison formula, 108, 780
651, 816 shock, 734, 768
efficiency, 603 shooting method, 620-625, 654
eigenvalue problem, 640
for second-order equation, 652
limitation, 622
fourth-order, 602, 605, 651, 974, 1114
multiple, 622
implicit, 607, 651, 816
roundoff error, 623
second-order, 600, 974, 1114
SHSORT, 929, 1071
stability, 602, 604, 605, 651, 816 side lobes, 471
system of differential equations, 607 sign changes, 273, 290, 294, 295, 306, 438,
third-order, 602 545
truncation error, 601, 603, 604 signed-magnitude, 17
Runge-Kutta-Fehlberg method, 605 significant digit, 14
Runge-Kutta-Gill method, 602 significant figures, 23, 418
Runge-Kutta-Merson method, 604 similarity transform, 526, 541, 549, 554,
Rutishauser, H., 482 555, 557, 673
852 Index

roundoff error, 564 integration, 189, 202, 209-219, 227,


SIMPL1, 508, 965, 966, 1105, 1106 228, 241, 246, 258, 887, 1032
simplex, 361 kernel, 678
simplex method, 361, 381, 385, 389, 504, nonlinear equation, 292, 327, 909,
508, 928, 929, 965, 966, 1069, 1053
1070, 1106 pseudo, 210, 215
cycling, 384, 399, 966, 1107 removal, 212, 665
degeneracy, 382, 384, 399, 929, 966, weakening, 214, 217, 665
1070, 1071, 1107 skewness, 402, 403, 408, 410, 411, 413, 415
finding a basic feasible vector, 386 slack variable, 381, 389, 504, 928, 929, 1070
tableau, 385, 389, 508 Smith, B. T., 523, 562
SIMPLX, 386, 928, 1069 SMOOTH, 876, 1021
Simpson's 3/8 rule, 180, 186, 687, 690 smoothing of data, 427, 440, 445, 456, 463,
Simpson's rule, 54, 180, 183, 186, 188, 219, 507, 509
227, 585, 594, 885, 1030 SN 1987 A, 423
composite formula, 182 SOR, 752, 1001, 1141
differential equation, 582, 584, 591, sorting, 403, 929, 1071
602 Spanier, J., 50
integral equation, 664, 667, 674, 687, sparse matrix, 61, 81, 98, 100, 108,334,366,
688, 690, 989, 995, 1129, 1135 557, 725, 741, 763
stability, 582, 690 conjugate gradient method, 101, 378
SIMPX, 506, 929, 965, 966, 1070, 1105, eigenvalue problem, 533, 557, 558
1106 iterative methods, 99, 748
SIMSON, 182, 221, 222, 224, 885, 1030 Lanczos method, 557, 558
simulated annealing, 390-395 spectral norm, 78
continuous variables, 393 spectral radius, 749, 753, 754, 821
sin- 1 x, 12, 520, 809 Spedicato, E., 60
sine, 41, 426 speed-up, 258
sine integral, 11 SPHBJN, 959, 1100
sine transform, 473, 516, 757 spherical harmonic, 964, 1105
single-step method, 576, 600, 610 Spheroidal harmonics, 638
stability, 602, 604, 605, 651, 816 SPHNO, 240, 242, 905, 1049
singular matrix, 61, 67, 89, 97, 524 SPLEVL, 121, 133, 161, 392, 864, 876, 885,
Gaussian elimination, 108 886, 1008, 1021, 1022, 1030, 1031
linear equations, 89 SPLINE, 133, 134, 392, 875, 876, 885, 886,
singular value decomposition (SVO), 88- 1021, 1022, 1030, 1031
97, 374, 376, 446, 677, 742, spline, 131,431,876, 1021, 1022, 1030, 1031
805, 873, 938, 939, 941-943, 945, B-spline, 136-140, 162, 431, 877, 878,
1018, 1079, 1080, 1082-1084, 880-884, 1022, 1023, 1025-1029
1086 cubic, 131-136, 150, 156, 160, 161,
complexity, 96 186, 786, 875, 876, 885, 1021,
convergence criterion, 94 1030
finding condition number, 89 differentiation, 140, 161, 162, 876,
finding null-space and range, 89 877, 1021, 1022
inverse problem, 682 integration, 186, 238, 885, 886, 1030,
least squares, 88-90, 436, 439-442, 1031
455, 513, 678, 874, 1019 linear, 131
overdetermined system, 90 truncated power basis, 136, 514
roundoff error, 90 SPLINT, 885, 1030
singular values, 88, 90, 436, 873, 993, 1018, spurious convergence, 7, 212, 222-224, 257,
1019, 1133 281, 298, 316-318, 368, 478, 534,
singularity 677, 699, 800
apparent, 210 spurious minimum, 350
differential equation, 575, 609, 625, square element, 772, 822
638, 643, 645, 653, 816 square root, 337, 521, 795, 809
integral equation, 659, 664, 666, 678 Squire, W., 213
Index 853

stabilised elementary transformation, 556, Stirling's formula, 407, 423


972, 1112 Stoer, J., 143, 608, 623, 756
stability, 42 50 straight line fit, 450, 511, 946, 1087
absolute, 581, 604, 605, 607, 615, 624, STRINT, 246, 905-907, 1049-1051
692 STROUD, 906, 907, 1050, 1051
finite difference method, 709- 713, Stroud's formula, 245, 906, 907, 1050, 1051
717, 725, 727, 730-732, 735, 739 Stroud, A. H., 196, 206
Fourier method, 710, 712, 718, 725, STRT4, 975-977,1116,1117
727, 730-732, 735 Student's t distribution, 411, 412
hyperbolic equation, 730-732, 735, STURM, 546, 547, 970, 1110
739 Sturm sequence, 305-307, 545, 548, 552,
integral equation, 687, 689, 690, 692 558, 634, 673, 674, 799, 969, 970,
matrix method, 711, 718 1109, 1110
maximum principle, 709 Sturm's theorem, 306, 545
numerical integration method, 578, Sturm-Liouville problem, 633, 635, 675
580-585, 587, 588, 611, 614-616, subefficient rule, 243
649, 653, 814, 815 successive over-relaxation method, 750-
parabolic equation, 709--713, 717, 725, 753, 755, 763, 821, 1001, 1141
727 complexity, 751
predictor-corrector method, 588, 593, rate of convergence, 751
649, 650, 815 summation, 229- 236, 901, 1045
recurrence relation, 49, 50, 956, 960, accelerating the convergence, 232
961, 1097, 1100, 1102 cascade sum, 35, 36, 39, 866, 1011
relative, 584 convergence criterion, 233
single-step method, 602, 604, 605, E-algorithm, 229, 477
651, 816 Euler's transformation, 234-236, 901,
stiff, 612 1015
system of differential equations, 649 Euler-Maclaurin sum formula, 230,
standard coordinates, 759 231
standard deviation, 46, 200, 402, 403, 405, of oscillatory series, 234-236
410-414, 418, 419, 421, 433, 436 of rational function, 231-234
binomial distribution, 408 roundoff error, 33, 34, 36, 37, 39, 41,
Poisson distribution, 409 199, 234, 256, 777
standard form truncation error, 230
error curve, 498 super-accurate formula, 161, 165, 177, 178,
linear programming, 381, 504 180
starting values, 586, 590, 591, 593, 975, SVD, 95-97, 437, 440, 449, 456, 868, 869,
1115 873, 874, 927, 939, 940, 942, 943,
Day's method, 688 945, 946, 988, 992-994, 1013,
iterative method, 591, 650, 816 1014, 1018, 1019, 1068, 1080,
Runge-Kutta method, 591, 977,1117 1082-1084, 1086, 1087, 1128,
using Taylor series, 590, 650, 687 1133, 1134
stationary point, 347, 354, 356, 360, 368, SVDEVL, 95, 97, 437, 456, 873, 939,
397 940, 942, 943, 945, 946, 988,
statistics, 401-422 992,994, 1018, 1080, 1082-1084,
steepest descent, 330, 362, 377, 397, 453 1086, 108~ 1133, 1134
Steffensen's method, 336 symmetric kernel, 660, 670, 672, 673
Stegun, I. A., 50, 166, 168, 196, 206, 232, symmetric matrix, 62, 104, 526, 541, 544,
485 550,673
step function, 135 eigenvalue problem, 541-553, 565,
stiff differential equation, 579, 581, 588, 566, 968-970, 1108-1110
607, 609-618, 722, 977, 997, triangular decomposition, 73, 106,
1117,1138 871, 1016
stiffly stable method, 612, 613, 618, 722, symmetric tridiagonal matrix
724,974,976,977,1115-1117 eigenvalue problem, 544, 548-553,
pitfalls, 616 969, 970, 1109, 1110
stability, 614, 615 multiple eigenvalue, 545
854 Index

QL algorithm, 550-553, 969, 1109 Cholesky's algorithm, 73, 106, 871,


Sturm sequence, 545-548, 969, 970, 1016
1109, 1110 Crout's algorithm , 74, 75, 868-870,
symmetry, 146, 492, 493, 498, 500, 638, 656, 1013-1015
714, 736, 770 Doolittle's algorithm, 74
systematic errors, 401 existence and uniqueness, 106, 781
roundoff error, 81
t test, 412, 424 triangular element, 759, 760, 772, 822
T-table, 170, 174, 188, 225 triangular matrix, 63, 69, 71, 72, 76, 87, 98,
tableau, 385, 389, 504, 508, 928, 929, 966, 103, 107, 549, 558, 636, 685, 779
1069, 1070, 1106 triangular window, 516
tan- 1 x, 483, 484, 491, 514, 515, 521, 522, TRIDlA, 546, 968, 969, 1108, 1109
805, 806, 810 tridiagonal matrix, 62, 103, 132, 537, 541,
Chebyshev expansion, 491 548, 550, 557, 632, 634, 712, 720,
minimax approximation, 500 722, 726, 753, 779
Tarantola, A., 685 trigonometric approximation, 431, 462,
Taylor series, 119, 168, 174, 213, 285, 313, 463, 513
339,578,601, 713, 729,745, 765, trigonometric interpolation, 462
777, 791 truncated power basis, 136, 514
multivariate function, 359 truncation error, 2, 4, 7, 61
Taylor series method, 590, 649, 687 approximation, 480, 489
temperature, 390 differential equation, 578, 582, 585--
Tewarson, R. P., 60 589, 591, 594, 601, 603, 604,
627 630, 635, 642, 654, 977, 979,
Thompson, M. J., 680, 684
1117,1120
three body problem, 654
differentiation, 160, 162, 165, 787
three-step formula, 582, 585, 649, 650, 815
integral equation, 663, 670, 674, 686,
maximal order, 584
687, 694
time constant, 610
integration, 177, 179, 181, 188, 190,
time domain, 460, 464
194, 196, 202, 246, 248, 256, 885,
time scale, 610
1030
time shift, 464, 472 interpolation, 114, 116, 119, 122, 133,
TlNVlT, 537, 970, 1110 783-785
Titanium, 157, 392 inverse Laplace transformation, 476
top-hat function, 137 local,578,582,585, 587,588, 601, 603,
TQL2, 552, 895- 897, 968, 969, 971, 1040, 627, 709, 730
1041,1108,1109,1111 nonlinear equation, 279, 289, 336, 338,
trace, 533, 673 797,799
trapezoidal rule, 6, 12, 178, 179, 181, 183, partial differential equation, 708, 713,
186, 192, 211, 240, 476, 789, 790, 727, 729, 730, 742, 745, 765, 767,
885, 886, 1030, 1031 820,822
contour integral, 302, 303, 917 propagation, 578
differential equation, 577, 587, 590, series summation, 230
600, 612, 616, 653, 722 Tukey, J. W., 465
Fourier transform, 462 two's complement, 17
integral equation, 664, 667, 674, 687,
688, 693, 988, 995, 1129, 1135 underdetermined system, 90, 382
periodic integrand, 188, 228, 261 underflow, 20, 27, 30, 53, 68, 376, 482, 545,
stability, 653, 690 575, 637, 776, 787, 800, 868, 911,
truncation error, 5,179,181,188, 190 916, 1010, 1013, 1054, 1059
Traub, J. F., 288 graceful, 20
travelling salesman problem, 400 underrelaxation, 750
TRBAK, 968, 1108 unformatted write, 10
TRED2, 544, 552, 556, 968, 969, 971, 1108, unimodal function, 351
1109, 1111 unit lower triangular matrix, 63, 72
triangular decomposition, 71-76, 360, 536, unit upper triangular matrix, 63, 74
868-871, 1013-1016 unitary matrix, 63, 78, 548
Index 855

unsymmetric matrix fully symmetric, 244


balancing, 553, 555, 562, 971, 1112 multiple integrals, 246, 258, 268, 794
eigenvalue problem, 553 563, 568, weights, 177, 195, 198, 200, 202, 242, 243.
971,973, 1112,1113 662, 685, 818, 988, 1128
eigenvector, 562 positive, 177, 183, 188, 197, 673
ill-conditioning, 553, 565, 566 product rule, 238, 239, 793
reduction to Hessenberg form, 555, tables, 206, 789, 790
972, 1112 well-conditioned, 43. 320
reduction to tridiagonal form, 557 Wells, D. C., 21
upper Hessenberg matrix, 63, 103, 559 Welsch, J. H., 207
upper triangular matrix, 63, 69, 76. 558 Wilkinson, J. H., 22, 33 , 59, 80- 83, 91 , 94,
upwind differencing, 739 95, 101, 109, 274, 312, 320, 381,
436, 523, 527, 530, 531, 534 536,
Vanderbilt, D. , 393
558, 781. 873, 1018
Vandermonde matrix, 141
window function, 470
variance, 248, 249. 253, 262, 402, 412, 415, Hanning, 471
433, 434, 450, 908, 1052 parabolic, 516
variance reduction, 251, 269, 794
triangular, 516
variational method, 758, 762
Winograd, S. , 469
vector norm, 79, 89
word, 17
vertex, 380, 382
Wynn. P., 192
VOLT, 995, 1135
VOLT2, 995, 1135 x sweep, 726
Volterra equation, 658, 660, 685- 695, 990,
995, 1130, 1135, 1136 y sweep, 726
classification, 659 YLM , 964, 1105
consistency, 689 Young, A .. 505, 508
convergence, 689, 695
conversion from the first to second zero padding, 469, 470, 473, 475
kind, 659, 692 zeros
deferred correction, 688, 695 see roots
existence of solution, 692 zeta function, 187, 230, 232, 233, 792
first kind, 692- 695, 995, 1135 ZROOT, 321. 323, 913. 915, 1057, 1058
nonlinear, 685, 689, 995, 1135 ZROOT2, 914, 916, 1057, 1060
quadrature method, 685-693, 695,
995, 1135
second kind, 660, 685-692, 995, 1135
smoothing of oscillations, 694, 995,
1135
stability, 687, 689, 690, 692
starting values, 687, 688
third kind, 696
truncation error. 686, 687, 694
variable order method, 688
W~ . 538, 546
Walsh, G. R., 515
Walsh, J., 658, 695
Wasow, W. R., 747
wave equation, 704, 729, 734, 736
higher order accuracy, 730
weakening the singularity, 214, 217, 665
weakly stable method, 584
Weierstrass continuous nondifferentiable
function, 267
Weierstrass theorem, 112, 197
weight function, 202- 209, 214, 216, 228,
248, 263, 428, 514, 665, 666, 670

You might also like