You are on page 1of 8

Advanced CAD/CAM


Numerical Computation and Error Analysis

September 2, 2021

1 Introduction
Scientific computing is a multidisciplinary field that uses advanced computing capabilities
to understand and solve complex problems (wikipedia). Typically it is the application of
computer simulation and other forms of computation from numerical analysis and theoretical
computer science to solve problems. Numerical analysis is an important underpinning for
techniques used in scientific computing.
Floating-point precision has to be sufficiently accurate to handle tasks in various situations.
However, there are some cases where this has not been the case. For instance, consider the
problem of computing the derivative of a function f (x) at x = x0 numerically. From the
definition, we have
df (x0 ) f (x0 + ∆x) − f (x0 )
= lim .
dx ∆x→0 ∆x
This definition cannot be implemented in a mathematically rigorous way but in an approx-
imated manner for actual computation. Namely, we have

df (x0 ) f (x0 + ∆x) − f (x0 )


≈ ,
dx ∆x
where we take a very small number for ∆x << 1, for example, ∆x = 10−8 . In this case,
the computed value obviously contains an error, which can propagate throughout the entire
computation, resulting in an unexpected accumulation of error and therefore an incorrect
value.
Here are a few examples where numerical problems have occurred, not the more common
pure programming errors (bugs).

Patriot Missile Failure

It is an example of inaccurate time computation. The time in tenths of second as measured


by the system’s internal clock was multiplied by 1/10 to produce the time in seconds. The
calculation was performed using a 24 bit fixed point register. The value 1/10 is a non-
terminating binary expansion, which is chopped at 24 bits.

1/10 = 1/24 + 1/25 + 1/28 + . . .


∗ This lecture note is prepared with excerpts from [1, 2, 3].

1
However, the Patriot missile stores 24 bits only. An error of 0.000000095 is introduced in
the computation. It means that for 100 hour operation a time difference of 0.34 second was
added (0.000000095 * 100 * 60 * 60 * 10 = 0.34sec.) This is enough time to miss a Scud
since during 0.34 sec the Scud travels more than half a kilometer during that time period.

Ariane 5 Rocket Failure

It is an accident caused by insufficient precision of floating point arithmetic. Ariane 5, which


is worth 7 billion dollars, crashed in less than a minute after lift-off. A small computer is to
blame for the crash, which tries to stuff a 64-bit number into a 16-bit space. At the bottom
of the disaster, conversion of velocity data in a 64-bit format to a 16-bit format triggered an
overflow error, making the core system in the rocket shut down.

Collapse of the Sleipner A offshore platform

It was caused by incorrect finite element analysis of the structure.

Loss of Mars Climate Orbiter

The orbiter was lost in 1999. Engineers concluded that the spacecraft entered the planet’s
atmosphere too low and probably burned up.
The primary cause of the much lower altitude of the orbiter than calculated was that one piece
of ground software supplied by Lockheed Martin produced results in a United States customary
unit, contrary to its Software Interface Specification (SIS), while a second system, supplied
by NASA, expected those results to be in SI units, in accordance with the SIS. Specifically,
software that calculated the total impulse produced by thruster firings produced results in
pound-force seconds. The trajectory calculation software then used these results – expected
to be in newton seconds (incorrect by a factor of 4.45) – to update the predicted position of
the spacecraft.
(https:/en.wikipedia.org/wiki/Mars Climate Orbiter, retrieved in September, 2020)

2 Basic Problems in Numerical Computation


Rounding
Calculations are usually performed with a certain fixed number of significant digits, so after
each operation the result usually has to be rounded introducing a rounding error whose
modulus in the optimal case is less than or equal to half a unit in the last digit. At the next
computation this rounding error has to be taken into account, as well as a new rounding error.
The propagation of the rounding error is therefore quite complex. Moreover, depending
on the precision used in the computation, you may obtain a different result although the
computation step remains the same.

Subtractive Cancellation
Cancellation occurs from the subtraction of two almost equal quantities. Consider the fol-
lowing difference: 3.593 - 3.574 = 0.019. Using the four decimal digits, these are represented
by 3593e-3, 3574e-3, and 1900e-5. All three appear to have the same precision: four dec-
imal digits in the mantissa. However, because 3.593 represents any number in the range
[3.5925, 3.5935], and 3.574 represents any number in the range [3.5735, 3.5745], the actual

2
difference may be any number in the interval [0.018, 0.020], and therefore while we have an
apparent precision of four digits, we actually have zero significant digits.
This phenomena which occurs when we try to subtract similar numbers is termed subtractive
cancellation. The easiest example is with the formula for the derivative:

(f (x + h) − f (x))/h.

If we use f (x) = x2 with x = 3.253 and h = 0.002, we get an approximation

(3.25522 − 3.25322 )/0.002 = (10.60 − 10.58)/0.002 = 10,

which is a poor approximation for the actual derivative 6.506 (https://ece.uwaterloo.ca/ dwharder/NumericalAnaly
retrieved in September, 2020). If we use a arithmetic system with more significant digits,
we get

((x + h)2 − x2 )/h = 0.013015999999998584/0.002 = 6.507999999999292,

which is quite close to the actual derivative value.


Any time you subtract two numbers which are almost equal, subtractive cancellation will
occur, and the difference will not have as much precision as either of subtrahend or the
minuend. In designing numerical algorithms, we must avoid such situations.

Recursion
A common method in scientific computing is to calculate a new entity based on the previous
one, and continuing in that way, either in an iterative process or in a recursive process
calculating new values all the time. In both cases the errors can accumulate and finally
destroy the computation.

Integer overflow
Integer overflow is usually not signaled. This can cause numerical problems in the calculation
of the factorial function. Writing a code using repeated multiplication on a computer with
32 bit integer arithmetic, the factorials up to 12! are all correctly evaluated, but 13! gets
the wrong value and 17! becomes negative.
Overflow occurs when the exponent is too large to be represented with the given form.
According to IEEE 754, the appropriate representation of such a number is ±inf inity.
Underflow occurs when the exponent is too small (too negative) to be represented by the
given form and should be represented by ±0.

Addition of Large and Small Numbers


Consider 1038. + 0.02353 = 1038.02353 ≈ 1038. Thus, even though the second number
is not zero, when it is added to 1038., there is no change to the first. This leads to
the further problem that floats are not associative, i.e., (a + b) + c != a + (b + c).
(https://ece.uwaterloo.ca/ dwharder/NumericalAnalysis/02Numerics/Weaknesses/, retrieved
in September, 2020)
If a large float is added to a small float, the small float may be too negligible to change
the value of the larger float. Algorithms which may cause such behaviour must be avoided.
(https://ece.uwaterloo.ca/ dwharder/NumericalAnalysis/02Numerics/Weaknesses/, retrieved
in September, 2020)

3
Order of Additions
In performing a sequence of additions, the numbers should be added in the order of the
smallest in magnitude to the largest in magnitude. This ensures that the cumulative sum of
many small arguments is still felt.
(https://ece.uwaterloo.ca/ dwharder/NumericalAnalysis/02Numerics/Weaknesses/, retrieved
in September, 2020)
Consider the factored polynomial (x − 1)8 . Expanded, this produces the polynomial.

x8 − 8x7 + 28x6 − 56x5 + 70x4 − 56x3 + 28x2 − 8x + 1.

First, we consider the original factored polynomial. For an interval of x = [1 − d, 1 + d],


where d = 0.000007237, we have [0.999992763000000, 1.000007237000000]. Then we directly
use x to evaluate (x − 1)8 .
Usually, polynomials are not available in factored form and consequently they must be
evaluated in expanded form. By default, this may be done by summing the terms and this
may be done either in order of decreasing or increasing degree. The result of the computation
done in order of decreasing degree is [2.66453525910038e − 015, −1.24344978758018e − 014].
For the order of increasing degree, we have [2.77555756156289e−015, −1.24344978758018e−
014].
The error using these evaluations is increased by 26 orders of magnitude.
Another means of evaluating expanded polynomials is Horner’s rule, however, says that we
rewrite the polynomial in the form and to now evaluate this from inside out:

(((((((x − 8)x + 28)x − 56)x + 70)x − 56)x + 28x) − 8)x + 1.

This may be evaluated as follows:


[−1.55431223447522e − 015, 3.44169137633799e − 015]
In this case, the error is less than that of either evaluating the polynomials in either decreas-
ing or increasing degree.

3 Floating Point Arithmetic


3.1 Basic Concept
Current computer systems in general use floating point arithmetic (FPA) for various compu-
tation. However, arithmetic operations, especially division, in FPA may lead to significant
numerical errors.
It should be noted that any computation in FPA is equivalent to that on a discrete grid of
points. For example, a floating point number in general form is given by [1]

(±).b1 b2 · · · bp · 2E ,

where b1 · · · bp is the mantissa, bi = 0 or 1, with b1 6= 0, and p is the number of significant


digits, and E is an integer exponent.

Example If p = 2 and −2 ≤ E ≤ 3, then a list of positive numbers in this arithmetic


system is 1/8, 3/16, 1/4, 3/8, 1/2, 3/4, 1, 3/2, 2, 3, 4, 6. The set defined by the given
arithmetic system is a finite subset of the rational numbers m n (here, m and n are positive
integers) in the interval [ 18 , 6]. The numbers are distributed non-uniformly in this interval.

4
3.2 ANSI/IEEE 754 Standard
The standard specifies the following:

• floating-point number formats (single, double, extended)


• results of the basic floating-point operations

• rounding modes
• signed zero, infinity (±∞), and not-a-number (NaN)
• floating-point exceptions and their handling
• conversion between formats

These days, almost all computers use the standard.

Example Using four figure decimal arithmetic, suppose we wish to compute s = 1.000 +
1.000 × 104 − 1.000 × 104 . The standard way produces 0 instead of 1.000, the exact solution.
This error is caused by cancellation. In particular the addition of 1.000 + 1.000 × 104 will
become 1.000 × 104 because of the arithmetic format, losing critical information of 1 in this
computation.

Double Precision Floating Point Arithmetic defined by ANSI/IEEE Std 754-


1985

A floating point number X is represented in terms of a sign bit s (0 or 1), an integer exponent
E and a p-bit significand B(b0 b1 · · · bp−1 , bi = 0, 1), where

X = (−1)s 2E B.

B is computed as
p−1
X
B = b0 .b1 b2 · · · bp−1 = b0 20 + bi 2−i .
i=1

For double precision arithmetic, the standard defines p = 53, and −1022 ≤ E ≤ 1023.
The number X is defined as a 64-bit quantity with one sign bit s, an 11-bit for exponent
e = E + 1023, and a 52-bit fractional mantissa m (b1 b2 · · · b52 ). Since the exponent can
always be selected such that b0 = 1, the value of b0 is constant.
The integer value of the 11-bit biased exponent e is given as
10
X
e = e0 e1 · · · e10 = ei 210−i .
i=0

• If e = 2047 and m 6= 0, then the value of X is the special flag N aN .


• If e = 2047 and m = 0, then the value of X is ±∞ depending on the sign bit.

• if 0 < e < 2047, then X is a normalized number.


52
!
X
s e−1023 s e−1023 −i
X = (−1) 2 1.m = (−1) 2 1+ bi 2 .
i=1

5
• if e = 0 and m 6= 0, then X is a denormalized number.
52
!
X
s −1022 s −1022 −i
X = (−1) 2 0.m = (−1) 2 bi 2 .
i=1

• If e = 0 and m = 0, then the value of X is ±0 depending on the sign bit value.

4 Error
There are two methods for quantifying errors in approximation.

Definition If p∗ is an approximation to p, the absolute error is |p − p∗ |, and the relative



error is |p−p |
|p| , provided that p 6= 0.

The relative error provides a method of characterizing the percentage error; when the relative
error is less than one, the negative of the log10 of the relative error gives the number of
significant decimal digits in the computer solution.

|p − p∗ |
≤ α × 10−t .
|p|

Here, t shows the number of significant digits.


It is not so useful a measure as p∗ approaches 0. One often switches to absolute error in this
case.

5 Error Analysis
The cumulative effects of errors in the computation are not clearly observed in most cases.
Such errors are caused by rounding or truncation, and propagate throughout the entire
computation. Consider a polynomial

p(x) = p0 + p1 x + p2 x2 + · · · + pn xn .

When the polynomial is evaluated at x = α, Horner’s scheme can be employed as follows.

p(α) = p0 + α(p1 + · · · + α(pn−2 + α(pn−1 + αpn )) · · ·).

Here, we may ask under what conditions, if any, on the coefficients and α, the solution will
be reasonable. To answer this question, an error analysis needs to be performed.
Error analysis is concerned with establishing whether or not an algorithm is stable for the
problem at hand. A forward error analysis is concerned with how close the computed solution
is to the exact solution. A backward error analysis is concerned with how well the computed
solution satisfies the problem to be solved.
The distinction between backward and forward errors is explained using the following ex-
ample.

Example Let    
99 98 1
A= ,b = .
100 99 1

6
Then the exact solution of the equations Ax = b is given by
x = (1, −1)T .
Also let x̂ be an approximate solution to the equations and define the residual vector r as
r = b − Ax̂.
Of course, for the exact solution r = 0 and we might hope that for a solution close to x, r
should be small.
Consider the approximate solution
 
2.97
x̂ =
−2.99
for which  
1.97
x̂ − x = ,
−1.99
and so x̂ looks to be a rather poor solution. But for this solution we have that
 
−0.01
r= ,
0.01
and we have almost solved the original problem. On the other hand, the approximate
solution
 
1.01
x̂ =
−0.99
for which  
0.01
x̂ − x =
0.01
gives  
−1.97
r=
−1.97
and it does not solve a problem close to the original problem.

Once we have computed the solution to a system of linear equations Ax = b we can readily
compute the residual. If we can find a matrix E such that
Ex̂ = r,
then
(A + E) x̂ = b
and we thus have a measure of the perturbation in A required to make x̂ an exact solution.
A particular E that satisfies Ex̂ = r is given by
Ex̂x̂T = rx̂T .
From this equation we have that
||r||2 ||x̂||2 ||r||2
||E||2 ≤ 2 =
||x̂||2 ||x̂||2
and we have
||r||2 ≤ ||E||2 ||x̂||2 ,
so that we have
||r||2
||E||2 = .
||x̂2 ||
Thus, this particular E minimizes ||E||2 . This gives us an a posteriori bound on the backward
error.

7
References
[1] B. Einarsson, editor. Accuracy and Reliability in Scientific Computing. SIAM, 2005.

[2] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes


in C: The Art of Scientific Computing. Cambridge University Press, 2nd edition, 1992.
[3] X.-S. Yang. Introduction to Computational Mathematics. World Scientific, 2015.

You might also like