You are on page 1of 34

Lecture 3

Rounding Errors

Course Website
https://sites.google.com/view/kporwal/teaching/mtl107
In this Section
▶ Understand how numbers are stored in computer.
▶ How Roundoff error can accumulates.
▶ Some recipes to avoid them.
Introduction I

One of the important tasks of numerical mathematics is the


determination of the accuracy of results of some computation.
There are three types of errors that limit accuracy:
1. Errors in the mathematical model of the problem to be solved.
Simplified models are easier to solve (shape of objects,
’unimportant’ chemical reactants, linearisation).
2. Discretization or approximation errors depend on the chosen
algorithm or the type of discretization.
▶ It May occur even when computing without rounding error is
approximated by a finite number of terms (truncation error),
▶ Function is approximated by, e.g., a piecewise linear function
▶ Derivative by finite differences.
▶ Finite number of iteration
Introduction II

3 Rounding errors occur if a real number (probably an


intermediate result of some computation) is rounded to the
next nearest machine number. The propagation of rounding
errors from one floating point operation to the next is the
most frequent source of numerical instabilities. Since
computer memory is finite practically no real number can be
represented exactly in a computer.
We discuss floating point numbers as a representation of real
numbers.
Computation of Pi
Motivating example: quadrature of a circle
Let’s try to compute π which is the area of a circle with radius r =
1. We Let’s
approximate by the area
try to πcompute ⇡, ofthe
an area
inscribed
of a regular
circle polygon:
with radius
r=
We approximate ⇡ by the area of an inscribed regular poly

↵n := 2⇡
n
Fn = cos ↵2n si

↵n ↵n
An = nFn6:=Area
Figure n cos sin
of the Circle ! ⇡ as n ! 1
2 2
[See Gander, Gander, & Kwok: Scientific Computing. Springer. Com
Computation of Pi

▶ Define αn = 2π αn αn
n , then area of the triangle is Fn = cos 2 sin 2 .
▶ Area An covered by rotating this triangle n times is
n cos α2n sin α2n
▶ An → π as n → ∞.
n
▶ An = 2 sin( 2π
n )=
n
2 sin(αn )
q q √
1−cos(αn ) 1− 1−sin2 αn
▶ sin(α2n ) = sin α2n = 2 = 2 .

▶ sin(α6 ) = 3
2
Computation of Pi

1 s=sqrt(3)/2; A=3*s; n=6;


2 z=[A-pi n A s];
3 while s>1e-10
4 % initialization
5 % store the results
6 % terminate if s=sin(alpha) small
7 s=sqrt((1-sqrt(1-s*s))/2); % new sin(alpha/2) value
8 n=2*n; A=n/2*s; % A = new polygonal area
9 z=[z; A-pi n A s];
10 end
11 for i=1:length(z)
12 fprintf('%10d %20.15f %20.15f ...
%20.15f\n',z(i,2),z(i,3),z(i,1),z(i,4)')
13 end
Integers

Integers also suffer from the finiteness of computers.


Matlab represents integers by 32-bit signed int’s (in the two’s
complement format)
(
Σ30
i=0 ai 2
i if a31 = 0,
a=
−(2 − Σi=0 ai 2i ) if a31 = 1,
32 30

Therefore the range is −2147483648 ≤ a ≤ 2147483647. These


numbers are given by intmin and intmax, respectively.
Real Numbers

A number x ∈ R (in the binary number system) has the form

x = ±(1.d1 d2 d3 · · · dt−1 dt dt+1 · · · ) × 2e

where e is an integer exponent.


The binary digits di are either 0 or 1. So,

d1 d2 d3
1.d1 d2 d3 · · · = 1 + + 2 + 3 + ···
2 2 2
In general, infinitely many digits are needed to represent a real
number.
The choice of a binary representation is just one of many
possibilities. It is, indeed, a convenient choice when it comes to
computers.
Examples

1
1. −1.1012 × 2 = −(1 + 2 + 18 ) × 2 = −3.25
2. (10011.01)2 = 19.25
1
3. (0.010101)2 = 3
1
4. (0.00110011) = 5 = 0.2
The last example is of interest insofar as it shows that to a finite
decimal number there may correspond a (nontrivial) infinite binary
representation. (This is not true the other way round. Why?) So,
one cannot assume that a finite decimal number is exactly
representable on a binary computer.
NOTE
Floating Point representation

Given any real number


 
d1 d2 dt
x =± 1+ + 2 + · · · t + · · · × 2e
2 2 2

= ±(1.d1 d2 d3 · · · dt−1 dt dt+1 · · · ) × 2e


with di = {0, 1}.
We somehow want represent this number in computer as

fl(x) = sign(x) × (1.d˜1 , d˜2 , · · · , d˜t−1 , d˜t ) × 2e .

So, we need to store, sign, t bits (t digits), and exponent e.


Floating point systems
Definition (Floating point system:)
A floating point system can be characterize by using 4-tuple
(β, t, L, U), where
▶ β is the base of number system
▶ t is the number of digits (percision)
▶ L is the lower bound on exponenet e.
▶ R is the upper bound on exponenet e.

So,  
d0 d1 dt−1
fl(x) = ± + + · · · + t−1 × βe
β0 β1 β

▶ Single Percison: β = 2, t = 23, L = −126, U = 127


▶ Double Percison: β = 2, t = 52, L = −1022, U = 1023
▶ Sign is stored sepearately
Error in Floating point representation

How big is the relative error

|fl(x) − x|
|x|

is ?
Floating point systems: Rounding Unit and Significant
digits

Relative error in a floating point representation is called Rounding


Unit or machine percision or machine epsiolon. For a general
floating point system rounding unit is
1
η = β 1−t
2
Also, in that case then t − 1 is called Number of Significant Digits.
Error in Floating point representation
▶ Chopping:

fl(x) = ±(1.d1 d2 d3 · · · dt−1 dt ) × 2e

Absolute error is bounded by 2−t · 2e


▶ Rounding:
(
±(1.d1 d2 d3 · · · dt−1 dt ) × 2e if 0.dt+1 · · · < 1/2
fl(x) =
±(1.d1 d2 d3 · · · dt−1 dt + 2−t ) × 2e if 0.dt+1 · · · > 1/2

Absolute Error is 21 2−t · 2e . Relative error bounded by


rounding unit, machine precision is,
1
η = 2−t
2
(EXERCISE)
Errors for general floating point systems

Theorem
Let x → fl(x) = g × β e , where x ̸= 0 and g is normalized, signed
mantissa. Then the absolute error committed in using floating
point representation of x is bounded by,
 1−t e
β β for chopping,
|x − fl(x)| ≤ 1 1−t e
2β β for rounding.

The relative errors are:


|x − fl(x)| β 1−t

for chopping,
≤ 1 1−t
|x| 2β for rounding.
IEEE floating point numbers
IEEE floating point numbers
ANSI/IEEE Standard 754-1985 for Binary Floating Point
ANSI/IEEE Standard 754-1985 for Binary Floating Point
Arithmetic. Acording to the IEEE standard a 32-bit float has the
Arithmetic. Acording to the IEEE standard a 32-bit float has the
following
followingstructure
structure(from
(from en.wikipedia.org)
en.wikipedia.org)

The exponent has 8 bits, the7:mantissa


Figure 23 Bit. There is a sign bit.
32 Bit Number
The value of a normalized 32-bit IEEE floating point number V is

The exponent has 8 Vbits, theS mantissa


= (-1) x 2(E-127) 23 Bit. There is a sign bit.
x (1.M)
The value of a normalized 32-bit IEEE floating point number V is
Normalized means 0 < E < 255 = 28 1. (127 is called a bias.)
NumCSE, Lecture 2, Sept 23, 2013 V = (−1)S × 2E −127 × (1.M) 14/34

Normalized means 0 < E < 255 = 28 − 1. (127 is called a bias.)


Double

double:
▶ 1 sign bit
▶ 11 bits exponent
▶ 52 bits mantissa
The value of a normalized 64-bit IEEE floating point number V is

V = (−1)S × 2E −1023 × (1.M)

Normalized means, that 0 < E < 2047 = 211 − 1.


Special Numbers

If the exponent has only zeros or ones, there is a special cases:


▶ 0 (zero): e = 0, m = 0, s arbitrary.
▶ -Infinity, +Infinity: e= all ones, m = 0.
▶ e=all ones, m ̸= 0 : NaN
There are also non-normalized numbers.
Rounding errors in IEEE
Parameters of IEEE Standard arithmetics with base E = 2.
Percision t emin emax η
Single 23 -125 128 2−24 ≈ 6 · 10−8
Double 52 -1021 1024 2−53 ≈ 6 · 10−16
Extended 63 -16381 16384 2−64 ≈ 5 · 10−20
Table 1: Errors in various percisons

Lemma
(EXERCISE):If x ̸= 0is a normalized floating point number and
fl(x) obtained by rounding with t digits, then

|fl(x) − x| 2−t
|fl(x) − x| ≤ 2e−t /2, ≤ ≡η
|x| 2
Rounding errors
We
Weassume
assume that
that all numbers
numbers are are normalized.
normalized.
Let
Lettt be
be the
the length the mantissa.
length of the mantissa.
Between
Between powers
powers of 2, the
the floating
floating point
pointnumbers
numbersare
areequidistant.
equidistant.

Here, the length of the mantissa is t = 2 and 2  e  2.


Figure 8: Distribution of Numbers
Definition
Machine precision = 2 (t+1) (half of Matlab’s eps)
Here,
This is the length
half of the mantissa
the distance 2 and −21 ≤
is t = between
of the numbers ≤ 2.
ande 2.
Definition Machine precision = 2−(t+1) (half of Matlab eps)
This is half the distance of the numbers between 1 and 2.
NumCSE, Lecture 2, Sept 23, 2013 18/34
Rounding errors

Rounding error are random


Rounding errors are random

NumCSE, Lecture 2, Sept 23, 2013 19/34

Figure 9: Sampling Errors


Floating point Arithmetic

Important to use exact rounding: if x and y are machine numbers,


then

fl(x ± y ) = (x ± y )(1 + ε1 ),
fl(x × y ) = (x × y )(1 + ε2 ),
fl(x/y ) = (x/y )(1 + ε3 ),

with |εi | ≤ η.
In other words: The result of a basic operation with two floating
point numbers yields a result that is correct up to a relative error
smaller than η. Thus, the relative errors remain small after each
such operation.
This is achieved only using guard digits (intermediate higher
precision).
Guard Digit

Floating point system with β = 10 and t = 4. So, η = 21 10−4 . Let

x = 0.1103 = 1.103 × 101 , y = 9.963 × 10−4 .

Then, x − y = 0.100337. Hence, exact rounding yields 0.1003.

|0.100337−0.1003|
Relative error: |0.100337| ≈ 0.37 × 10−4 < η

However, if we were to subtract these two numbers without guard


digits we would obtain 0.1103 − 0.0099 = 0.1004. Now the
obtained relative error is 0.63 × 10−4 > η

Thus, guard digits must be used to produce exact rounding.


Rounding error example

For t = 5 we have η = 2−6 = 0.015625.

fl(π) = fl(21 + 20 + 2−3 + 2−5 + · · · ) = 21 + 20 + 2−3 = 1.10010 · 21

|fl(π) − π|
≈ 0.0053
|π|
Similarly,

|π 2 − fl(fl(π)fl(π))|
π 2 − fl(fl(π)fl(π)) ≈ 0.12, ≈ 0.012
π2
Note on machine epsilon

▶ For any number with α ≤ η we have fl(1 + α) = 1.


▶ The Matlab command eps returns η = 2−52 , i.e., the
smallest positive number (for the data type double) for which
1 + eps greater than 1.
▶ eps can have a parameter, see help eps.
▶ In the finite difference example we had for very small h that
fl(f (x + h)) = fl(f (x)). Therefore,
fl(f (x0 + h)) − fl(f (x0 ))
− fl(f ′ (x0 )) = |fl(f ′ (x0 ))|.


h
Rounding errors summary

Lemma
1. With the machine precision η we have fl(x) = x(1 + ε) with
|ε| ≤ η.
2. If ∗ is an elementary operation then fl(x ∗ y ) = (x ∗ y )(1 + ε)
with |ε| ≤ η.

Wilkinson’s Principle
The result of a numerical computation on the computer is the
exact result with slightly perturbed initial data.
This also holds for good implementations of (library) functions!
Cancellation

Cancellation is a special kind of rounding error. Consider the


following two numbers with 5 decimal digits:

1.2345e0 − 1.2344e0 = 0.0001e0 = 1.0000e − 4

If the two numbers were exact, the result delivered by the


computer would also be exact. But if the first two numbers had
been obtained by previous calculations and were affected by
rounding errors, then the result would at best be 1.xxxxe − 4,
where the digits denoted by x are unknown.
Cancellation (cont.)

Suppose z = x − y , where x ≈ y . Then

|z − fl(z)| ≤ |x − fl(x)| + |y − fl(y )|,

from which it follows that the relative error satisfies


|z − fl(z)| |x − fl(x)| + |y − fl(y )|

|z| |x − y |

Numerator: OK.
Denominator is very close to zero if x ≈ y . So the relative error in
z could become large.
Library function for sinh

The sinus hyperbolicus is defined as


1
y = sinh(x) = (e x − e −x ).
2
Ifx ≈ 0 we have to expect cancellation in the two terms. Using the
Taylor expansion of sinh:

x3 ξ5
sinh(x) = x + + , |ξ| < x.
6 120
For small enough x we can expect very good approximations.
Quadrature of a circle revisited

This is what happened in our previous example to compute π by


the inscribed regular polynoms.
To compute sin(α/2) from sin(α), we used the recurrence:
s p
αn 1 − 1 − sin2 αn
sin =
2 2
Since sin αn → 0, the numerator on the right
p
1 − 1 − ε2

with small ε = sin(αn ). which suffers from severe cancellation.


Therefore the algorithm performed so badly, although theory and
program are both correct.
Fix

Quadrature of a circle revisited (cont.)

αn sin αn
sin =q .
2 p
2
2(1 + 1 − sin αn )
Modified Program

1 oldA=0; s=sqrt(3)/2; A=3*s; n=6;


2 z=[A-pi n A s];
3 while A>oldA
4 % initialization store the results
5 % terminate if area does not increase
6 oldA=A;
7 s=s/sqrt(2*(1+sqrt((1-s)*(1+s)))); % new sin() value
8 n=2*n; A=n/2*s; % A = new ...
polygonal area
9 z=[z; A-pi n A s];
10 end
11 for i=1:length(z)
12 fprintf('%10d %20.15f ...
%20.15f\n',z(i,2),z(i,3),z(i,1))
13 end
Paper on high precision computation

David H. Bailey, Roberto Barrio, and Jonathan M. Borwein, High


precision computation: Mathematical physics and dynamics,
Applied Mathematics and Computation, vol. 218 (2012), pp.
10106-10121.
http://dx.doi.org/10.1016/j.amc.2012.03.087
Gist of the paper:
In many very large scale problems it is difficult to achieve sufficient
accuracy: for a rapidly growing body of important scientific
computing applications, a higher level of numeric precision is
required. This is facilitated by high-precision software packages.
Software available, but awfully slow.

You might also like