Professional Documents
Culture Documents
∗
Numerical Computation and Error Analysis
September 2, 2021
1 Introduction
Scientific computing is a multidisciplinary field that uses advanced computing capabilities
to understand and solve complex problems (wikipedia). Typically it is the application of
computer simulation and other forms of computation from numerical analysis and theoretical
computer science to solve problems. Numerical analysis is an important underpinning for
techniques used in scientific computing.
Floating-point precision has to be sufficiently accurate to handle tasks in various situations.
However, there are some cases where this has not been the case. For instance, consider the
problem of computing the derivative of a function f (x) at x = x0 numerically. From the
definition, we have
df (x0 ) f (x0 + ∆x) − f (x0 )
= lim .
dx ∆x→0 ∆x
This definition cannot be implemented in a mathematically rigorous way but in an approx-
imated manner for actual computation. Namely, we have
1
However, the Patriot missile stores 24 bits only. An error of 0.000000095 is introduced in
the computation. It means that for 100 hour operation a time difference of 0.34 second was
added (0.000000095 * 100 * 60 * 60 * 10 = 0.34sec.) This is enough time to miss a Scud
since during 0.34 sec the Scud travels more than half a kilometer during that time period.
The orbiter was lost in 1999. Engineers concluded that the spacecraft entered the planet’s
atmosphere too low and probably burned up.
The primary cause of the much lower altitude of the orbiter than calculated was that one piece
of ground software supplied by Lockheed Martin produced results in a United States customary
unit, contrary to its Software Interface Specification (SIS), while a second system, supplied
by NASA, expected those results to be in SI units, in accordance with the SIS. Specifically,
software that calculated the total impulse produced by thruster firings produced results in
pound-force seconds. The trajectory calculation software then used these results – expected
to be in newton seconds (incorrect by a factor of 4.45) – to update the predicted position of
the spacecraft.
(https:/en.wikipedia.org/wiki/Mars Climate Orbiter, retrieved in September, 2020)
Subtractive Cancellation
Cancellation occurs from the subtraction of two almost equal quantities. Consider the fol-
lowing difference: 3.593 - 3.574 = 0.019. Using the four decimal digits, these are represented
by 3593e-3, 3574e-3, and 1900e-5. All three appear to have the same precision: four dec-
imal digits in the mantissa. However, because 3.593 represents any number in the range
[3.5925, 3.5935], and 3.574 represents any number in the range [3.5735, 3.5745], the actual
2
difference may be any number in the interval [0.018, 0.020], and therefore while we have an
apparent precision of four digits, we actually have zero significant digits.
This phenomena which occurs when we try to subtract similar numbers is termed subtractive
cancellation. The easiest example is with the formula for the derivative:
(f (x + h) − f (x))/h.
which is a poor approximation for the actual derivative 6.506 (https://ece.uwaterloo.ca/ dwharder/NumericalAnaly
retrieved in September, 2020). If we use a arithmetic system with more significant digits,
we get
Recursion
A common method in scientific computing is to calculate a new entity based on the previous
one, and continuing in that way, either in an iterative process or in a recursive process
calculating new values all the time. In both cases the errors can accumulate and finally
destroy the computation.
Integer overflow
Integer overflow is usually not signaled. This can cause numerical problems in the calculation
of the factorial function. Writing a code using repeated multiplication on a computer with
32 bit integer arithmetic, the factorials up to 12! are all correctly evaluated, but 13! gets
the wrong value and 17! becomes negative.
Overflow occurs when the exponent is too large to be represented with the given form.
According to IEEE 754, the appropriate representation of such a number is ±inf inity.
Underflow occurs when the exponent is too small (too negative) to be represented by the
given form and should be represented by ±0.
3
Order of Additions
In performing a sequence of additions, the numbers should be added in the order of the
smallest in magnitude to the largest in magnitude. This ensures that the cumulative sum of
many small arguments is still felt.
(https://ece.uwaterloo.ca/ dwharder/NumericalAnalysis/02Numerics/Weaknesses/, retrieved
in September, 2020)
Consider the factored polynomial (x − 1)8 . Expanded, this produces the polynomial.
(±).b1 b2 · · · bp · 2E ,
4
3.2 ANSI/IEEE 754 Standard
The standard specifies the following:
• rounding modes
• signed zero, infinity (±∞), and not-a-number (NaN)
• floating-point exceptions and their handling
• conversion between formats
Example Using four figure decimal arithmetic, suppose we wish to compute s = 1.000 +
1.000 × 104 − 1.000 × 104 . The standard way produces 0 instead of 1.000, the exact solution.
This error is caused by cancellation. In particular the addition of 1.000 + 1.000 × 104 will
become 1.000 × 104 because of the arithmetic format, losing critical information of 1 in this
computation.
A floating point number X is represented in terms of a sign bit s (0 or 1), an integer exponent
E and a p-bit significand B(b0 b1 · · · bp−1 , bi = 0, 1), where
X = (−1)s 2E B.
B is computed as
p−1
X
B = b0 .b1 b2 · · · bp−1 = b0 20 + bi 2−i .
i=1
For double precision arithmetic, the standard defines p = 53, and −1022 ≤ E ≤ 1023.
The number X is defined as a 64-bit quantity with one sign bit s, an 11-bit for exponent
e = E + 1023, and a 52-bit fractional mantissa m (b1 b2 · · · b52 ). Since the exponent can
always be selected such that b0 = 1, the value of b0 is constant.
The integer value of the 11-bit biased exponent e is given as
10
X
e = e0 e1 · · · e10 = ei 210−i .
i=0
5
• if e = 0 and m 6= 0, then X is a denormalized number.
52
!
X
s −1022 s −1022 −i
X = (−1) 2 0.m = (−1) 2 bi 2 .
i=1
4 Error
There are two methods for quantifying errors in approximation.
The relative error provides a method of characterizing the percentage error; when the relative
error is less than one, the negative of the log10 of the relative error gives the number of
significant decimal digits in the computer solution.
|p − p∗ |
≤ α × 10−t .
|p|
5 Error Analysis
The cumulative effects of errors in the computation are not clearly observed in most cases.
Such errors are caused by rounding or truncation, and propagate throughout the entire
computation. Consider a polynomial
p(x) = p0 + p1 x + p2 x2 + · · · + pn xn .
Here, we may ask under what conditions, if any, on the coefficients and α, the solution will
be reasonable. To answer this question, an error analysis needs to be performed.
Error analysis is concerned with establishing whether or not an algorithm is stable for the
problem at hand. A forward error analysis is concerned with how close the computed solution
is to the exact solution. A backward error analysis is concerned with how well the computed
solution satisfies the problem to be solved.
The distinction between backward and forward errors is explained using the following ex-
ample.
Example Let
99 98 1
A= ,b = .
100 99 1
6
Then the exact solution of the equations Ax = b is given by
x = (1, −1)T .
Also let x̂ be an approximate solution to the equations and define the residual vector r as
r = b − Ax̂.
Of course, for the exact solution r = 0 and we might hope that for a solution close to x, r
should be small.
Consider the approximate solution
2.97
x̂ =
−2.99
for which
1.97
x̂ − x = ,
−1.99
and so x̂ looks to be a rather poor solution. But for this solution we have that
−0.01
r= ,
0.01
and we have almost solved the original problem. On the other hand, the approximate
solution
1.01
x̂ =
−0.99
for which
0.01
x̂ − x =
0.01
gives
−1.97
r=
−1.97
and it does not solve a problem close to the original problem.
Once we have computed the solution to a system of linear equations Ax = b we can readily
compute the residual. If we can find a matrix E such that
Ex̂ = r,
then
(A + E) x̂ = b
and we thus have a measure of the perturbation in A required to make x̂ an exact solution.
A particular E that satisfies Ex̂ = r is given by
Ex̂x̂T = rx̂T .
From this equation we have that
||r||2 ||x̂||2 ||r||2
||E||2 ≤ 2 =
||x̂||2 ||x̂||2
and we have
||r||2 ≤ ||E||2 ||x̂||2 ,
so that we have
||r||2
||E||2 = .
||x̂2 ||
Thus, this particular E minimizes ||E||2 . This gives us an a posteriori bound on the backward
error.
7
References
[1] B. Einarsson, editor. Accuracy and Reliability in Scientific Computing. SIAM, 2005.