Rounding Errors: Course Website

Lecture 3
Rounding Errors
Course Website
https://sites.google.com/view/kporwal/teaching/mtl107
In this Section
▶ Understand how numbers are stored in computer.
▶ How Roundoff error can accumulates.
▶ Some recipes to avoid them.
Introduction I
One of the important tasks of numerical mathematics is the

determination of the accuracy of results of some computation.
There are three types of errors that limit accuracy:
1. Errors in the mathematical model of the problem to be solved.
Simplified models are easier to solve (shape of objects,
’unimportant’ chemical reactants, linearisation).
2. Discretization or approximation errors depend on the chosen
algorithm or the type of discretization.
▶ It May occur even when computing without rounding error is
approximated by a finite number of terms (truncation error),
▶ Function is approximated by, e.g., a piecewise linear function
▶ Derivative by finite differences.
▶ Finite number of iteration
Introduction II
3 Rounding errors occur if a real number (probably an

intermediate result of some computation) is rounded to the
next nearest machine number. The propagation of rounding
errors from one floating point operation to the next is the
most frequent source of numerical instabilities. Since
computer memory is finite practically no real number can be
represented exactly in a computer.
We discuss floating point numbers as a representation of real
numbers.
Computation of Pi
Motivating example: quadrature of a circle
Let’s try to compute π which is the area of a circle with radius r =
1. We Let’s
approximate by the area
try to πcompute ⇡, ofthe
an area
inscribed
of a regular
circle polygon:
with radius
r=
We approximate ⇡ by the area of an inscribed regular poly
↵n := 2⇡
n
Fn = cos ↵2n si
↵n ↵n
An = nFn6:=Area
Figure n cos sin
of the Circle ! ⇡ as n ! 1
2 2
[See Gander, Gander, & Kwok: Scientific Computing. Springer. Com
Computation of Pi
▶ Define αn = 2π αn αn
n , then area of the triangle is Fn = cos 2 sin 2 .
▶ Area An covered by rotating this triangle n times is
n cos α2n sin α2n
▶ An → π as n → ∞.
n
▶ An = 2 sin( 2π
n )=
n
2 sin(αn )
q q √
1−cos(αn ) 1− 1−sin2 αn
▶ sin(α2n ) = sin α2n = 2 = 2 .
√
▶ sin(α6 ) = 3
2
Computation of Pi
1 s=sqrt(3)/2; A=3*s; n=6;

2 z=[A-pi n A s];
3 while s>1e-10
4 % initialization
5 % store the results
6 % terminate if s=sin(alpha) small
7 s=sqrt((1-sqrt(1-s*s))/2); % new sin(alpha/2) value
8 n=2*n; A=n/2*s; % A = new polygonal area
9 z=[z; A-pi n A s];
10 end
11 for i=1:length(z)
12 fprintf('%10d %20.15f %20.15f ...
%20.15f\n',z(i,2),z(i,3),z(i,1),z(i,4)')
13 end
Integers
Integers also suffer from the finiteness of computers.

Matlab represents integers by 32-bit signed int’s (in the two’s
complement format)
(
Σ30
i=0 ai 2
i if a31 = 0,
a=
−(2 − Σi=0 ai 2i ) if a31 = 1,
32 30
Therefore the range is −2147483648 ≤ a ≤ 2147483647. These

numbers are given by intmin and intmax, respectively.
Real Numbers
A number x ∈ R (in the binary number system) has the form
x = ±(1.d1 d2 d3 · · · dt−1 dt dt+1 · · · ) × 2e
where e is an integer exponent.

The binary digits di are either 0 or 1. So,
d1 d2 d3
1.d1 d2 d3 · · · = 1 + + 2 + 3 + ···
2 2 2
In general, infinitely many digits are needed to represent a real
number.
The choice of a binary representation is just one of many
possibilities. It is, indeed, a convenient choice when it comes to
computers.
Examples
1
1. −1.1012 × 2 = −(1 + 2 + 18 ) × 2 = −3.25
2. (10011.01)2 = 19.25
1
3. (0.010101)2 = 3
1
4. (0.00110011) = 5 = 0.2
The last example is of interest insofar as it shows that to a finite
decimal number there may correspond a (nontrivial) infinite binary
representation. (This is not true the other way round. Why?) So,
one cannot assume that a finite decimal number is exactly
representable on a binary computer.
NOTE
Floating Point representation
Given any real number

d1 d2 dt
x =± 1+ + 2 + · · · t + · · · × 2e
2 2 2
= ±(1.d1 d2 d3 · · · dt−1 dt dt+1 · · · ) × 2e

with di = {0, 1}.
We somehow want represent this number in computer as
fl(x) = sign(x) × (1.d˜1 , d˜2 , · · · , d˜t−1 , d˜t ) × 2e .
So, we need to store, sign, t bits (t digits), and exponent e.

Floating point systems
Definition (Floating point system:)
A floating point system can be characterize by using 4-tuple
(β, t, L, U), where
▶ β is the base of number system
▶ t is the number of digits (percision)
▶ L is the lower bound on exponenet e.
▶ R is the upper bound on exponenet e.
So,
d0 d1 dt−1
fl(x) = ± + + · · · + t−1 × βe
β0 β1 β
▶ Single Percison: β = 2, t = 23, L = −126, U = 127

▶ Double Percison: β = 2, t = 52, L = −1022, U = 1023
▶ Sign is stored sepearately
Error in Floating point representation
How big is the relative error
|fl(x) − x|
|x|
is ?
Floating point systems: Rounding Unit and Significant
digits
Relative error in a floating point representation is called Rounding

Unit or machine percision or machine epsiolon. For a general
floating point system rounding unit is
1
η = β 1−t
2
Also, in that case then t − 1 is called Number of Significant Digits.
Error in Floating point representation
▶ Chopping:
fl(x) = ±(1.d1 d2 d3 · · · dt−1 dt ) × 2e
Absolute error is bounded by 2−t · 2e

▶ Rounding:
(
±(1.d1 d2 d3 · · · dt−1 dt ) × 2e if 0.dt+1 · · · < 1/2
fl(x) =
±(1.d1 d2 d3 · · · dt−1 dt + 2−t ) × 2e if 0.dt+1 · · · > 1/2
Absolute Error is 21 2−t · 2e . Relative error bounded by

rounding unit, machine precision is,
1
η = 2−t
2
(EXERCISE)
Errors for general floating point systems
Theorem
Let x → fl(x) = g × β e , where x ̸= 0 and g is normalized, signed
mantissa. Then the absolute error committed in using floating
point representation of x is bounded by,
1−t e
β β for chopping,
|x − fl(x)| ≤ 1 1−t e
2β β for rounding.
The relative errors are:

|x − fl(x)| β 1−t

for chopping,
≤ 1 1−t
|x| 2β for rounding.
IEEE floating point numbers
IEEE floating point numbers
ANSI/IEEE Standard 754-1985 for Binary Floating Point
ANSI/IEEE Standard 754-1985 for Binary Floating Point
Arithmetic. Acording to the IEEE standard a 32-bit float has the
Arithmetic. Acording to the IEEE standard a 32-bit float has the
following
followingstructure
structure(from
(from en.wikipedia.org)
en.wikipedia.org)
The exponent has 8 bits, the7:mantissa

Figure 23 Bit. There is a sign bit.
32 Bit Number
The value of a normalized 32-bit IEEE floating point number V is
The exponent has 8 Vbits, theS mantissa

= (-1) x 2(E-127) 23 Bit. There is a sign bit.
x (1.M)
Normalized means 0 < E < 255 = 28 1. (127 is called a bias.)
NumCSE, Lecture 2, Sept 23, 2013 V = (−1)S × 2E −127 × (1.M) 14/34
Normalized means 0 < E < 255 = 28 − 1. (127 is called a bias.)

Double
double:
▶ 1 sign bit
▶ 11 bits exponent
▶ 52 bits mantissa
V = (−1)S × 2E −1023 × (1.M)
Normalized means, that 0 < E < 2047 = 211 − 1.

Special Numbers
If the exponent has only zeros or ones, there is a special cases:

▶ 0 (zero): e = 0, m = 0, s arbitrary.
▶ -Infinity, +Infinity: e= all ones, m = 0.
▶ e=all ones, m ̸= 0 : NaN
There are also non-normalized numbers.
Rounding errors in IEEE
Parameters of IEEE Standard arithmetics with base E = 2.
Percision t emin emax η
Single 23 -125 128 2−24 ≈ 6 · 10−8
Double 52 -1021 1024 2−53 ≈ 6 · 10−16
Extended 63 -16381 16384 2−64 ≈ 5 · 10−20
Table 1: Errors in various percisons
Lemma
(EXERCISE):If x ̸= 0is a normalized floating point number and
fl(x) obtained by rounding with t digits, then
|fl(x) − x| 2−t
|fl(x) − x| ≤ 2e−t /2, ≤ ≡η
|x| 2
Rounding errors
We
Weassume
assume that
that all numbers
numbers are are normalized.
normalized.
Let
Lettt be
be the
the length the mantissa.
length of the mantissa.
Between
Between powers
powers of 2, the
the floating
floating point
pointnumbers
numbersare
areequidistant.
equidistant.
Here, the length of the mantissa is t = 2 and 2  e  2.

Figure 8: Distribution of Numbers
Definition
Machine precision = 2 (t+1) (half of Matlab’s eps)
Here,
This is the length
half of the mantissa
the distance 2 and −21 ≤
is t = between
of the numbers ≤ 2.
ande 2.
Definition Machine precision = 2−(t+1) (half of Matlab eps)
This is half the distance of the numbers between 1 and 2.
NumCSE, Lecture 2, Sept 23, 2013 18/34
Rounding errors
Rounding error are random

Rounding errors are random
NumCSE, Lecture 2, Sept 23, 2013 19/34
Figure 9: Sampling Errors

Floating point Arithmetic
Important to use exact rounding: if x and y are machine numbers,

then
fl(x ± y ) = (x ± y )(1 + ε1 ),
fl(x × y ) = (x × y )(1 + ε2 ),
fl(x/y ) = (x/y )(1 + ε3 ),
with |εi | ≤ η.
In other words: The result of a basic operation with two floating
point numbers yields a result that is correct up to a relative error
smaller than η. Thus, the relative errors remain small after each
such operation.
This is achieved only using guard digits (intermediate higher
precision).
Guard Digit
Floating point system with β = 10 and t = 4. So, η = 21 10−4 . Let
x = 0.1103 = 1.103 × 101 , y = 9.963 × 10−4 .
Then, x − y = 0.100337. Hence, exact rounding yields 0.1003.
|0.100337−0.1003|
Relative error: |0.100337| ≈ 0.37 × 10−4 < η
However, if we were to subtract these two numbers without guard

digits we would obtain 0.1103 − 0.0099 = 0.1004. Now the
obtained relative error is 0.63 × 10−4 > η
Thus, guard digits must be used to produce exact rounding.

Rounding error example
For t = 5 we have η = 2−6 = 0.015625.
fl(π) = fl(21 + 20 + 2−3 + 2−5 + · · · ) = 21 + 20 + 2−3 = 1.10010 · 21
|fl(π) − π|
≈ 0.0053
|π|
Similarly,
|π 2 − fl(fl(π)fl(π))|
π 2 − fl(fl(π)fl(π)) ≈ 0.12, ≈ 0.012
π2
Note on machine epsilon
▶ For any number with α ≤ η we have fl(1 + α) = 1.

▶ The Matlab command eps returns η = 2−52 , i.e., the
smallest positive number (for the data type double) for which
1 + eps greater than 1.
▶ eps can have a parameter, see help eps.
▶ In the finite difference example we had for very small h that
fl(f (x + h)) = fl(f (x)). Therefore,
fl(f (x0 + h)) − fl(f (x0 ))
− fl(f ′ (x0 )) = |fl(f ′ (x0 ))|.

h
Rounding errors summary
Lemma
1. With the machine precision η we have fl(x) = x(1 + ε) with
|ε| ≤ η.
2. If ∗ is an elementary operation then fl(x ∗ y ) = (x ∗ y )(1 + ε)
with |ε| ≤ η.
Wilkinson’s Principle
The result of a numerical computation on the computer is the
exact result with slightly perturbed initial data.
This also holds for good implementations of (library) functions!
Cancellation
Cancellation is a special kind of rounding error. Consider the

following two numbers with 5 decimal digits:
1.2345e0 − 1.2344e0 = 0.0001e0 = 1.0000e − 4
If the two numbers were exact, the result delivered by the

computer would also be exact. But if the first two numbers had
been obtained by previous calculations and were affected by
rounding errors, then the result would at best be 1.xxxxe − 4,
where the digits denoted by x are unknown.
Cancellation (cont.)
Suppose z = x − y , where x ≈ y . Then
|z − fl(z)| ≤ |x − fl(x)| + |y − fl(y )|,
from which it follows that the relative error satisfies

|z − fl(z)| |x − fl(x)| + |y − fl(y )|
≤
|z| |x − y |
Numerator: OK.
Denominator is very close to zero if x ≈ y . So the relative error in
z could become large.
Library function for sinh
The sinus hyperbolicus is defined as

1
y = sinh(x) = (e x − e −x ).
2
Ifx ≈ 0 we have to expect cancellation in the two terms. Using the
Taylor expansion of sinh:
x3 ξ5
sinh(x) = x + + , |ξ| < x.
6 120
For small enough x we can expect very good approximations.
Quadrature of a circle revisited
This is what happened in our previous example to compute π by

the inscribed regular polynoms.
To compute sin(α/2) from sin(α), we used the recurrence:
s p
αn 1 − 1 − sin2 αn
sin =
2 2
Since sin αn → 0, the numerator on the right
p
1 − 1 − ε2
with small ε = sin(αn ). which suffers from severe cancellation.

Therefore the algorithm performed so badly, although theory and
program are both correct.
Fix
Quadrature of a circle revisited (cont.)
αn sin αn
sin =q .
2 p
2
2(1 + 1 − sin αn )
Modified Program
1 oldA=0; s=sqrt(3)/2; A=3*s; n=6;

2 z=[A-pi n A s];
3 while A>oldA
4 % initialization store the results
5 % terminate if area does not increase
6 oldA=A;
7 s=s/sqrt(2*(1+sqrt((1-s)*(1+s)))); % new sin() value
8 n=2*n; A=n/2*s; % A = new ...
polygonal area
9 z=[z; A-pi n A s];
10 end
11 for i=1:length(z)
12 fprintf('%10d %20.15f ...
%20.15f\n',z(i,2),z(i,3),z(i,1))
13 end
Paper on high precision computation
David H. Bailey, Roberto Barrio, and Jonathan M. Borwein, High

precision computation: Mathematical physics and dynamics,
Applied Mathematics and Computation, vol. 218 (2012), pp.
10106-10121.
http://dx.doi.org/10.1016/j.amc.2012.03.087
Gist of the paper:
In many very large scale problems it is difficult to achieve sufficient
accuracy: for a rapidly growing body of important scientific
computing applications, a higher level of numeric precision is
required. This is facilitated by high-precision software packages.
Software available, but awfully slow.

Rounding Errors: Course Website

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rounding Errors: Course Website

Uploaded by

Copyright:

Available Formats

Lecture 3

One of the important tasks of numerical mathematics is the

3 Rounding errors occur if a real number (probably an

1 s=sqrt(3)/2; A=3*s; n=6;

Integers also suffer from the finiteness of computers.

Therefore the range is −2147483648 ≤ a ≤ 2147483647. These

A number x ∈ R (in the binary number system) has the form

x = ±(1.d1 d2 d3 · · · dt−1 dt dt+1 · · · ) × 2e

where e is an integer exponent.

Given any real number

= ±(1.d1 d2 d3 · · · dt−1 dt dt+1 · · · ) × 2e

fl(x) = sign(x) × (1.d˜1 , d˜2 , · · · , d˜t−1 , d˜t ) × 2e .

So, we need to store, sign, t bits (t digits), and exponent e.

▶ Single Percison: β = 2, t = 23, L = −126, U = 127

How big is the relative error

Relative error in a floating point representation is called Rounding

fl(x) = ±(1.d1 d2 d3 · · · dt−1 dt ) × 2e

Absolute error is bounded by 2−t · 2e

Absolute Error is 21 2−t · 2e . Relative error bounded by

The relative errors are:

The exponent has 8 bits, the7:mantissa

The exponent has 8 Vbits, theS mantissa

Normalized means 0 < E < 255 = 28 − 1. (127 is called a bias.)

V = (−1)S × 2E −1023 × (1.M)

Normalized means, that 0 < E < 2047 = 211 − 1.

If the exponent has only zeros or ones, there is a special cases:

Here, the length of the mantissa is t = 2 and 2  e  2.

Rounding error are random

NumCSE, Lecture 2, Sept 23, 2013 19/34

Figure 9: Sampling Errors

Important to use exact rounding: if x and y are machine numbers,

Floating point system with β = 10 and t = 4. So, η = 21 10−4 . Let

x = 0.1103 = 1.103 × 101 , y = 9.963 × 10−4 .

Then, x − y = 0.100337. Hence, exact rounding yields 0.1003.

However, if we were to subtract these two numbers without guard

Thus, guard digits must be used to produce exact rounding.

For t = 5 we have η = 2−6 = 0.015625.

fl(π) = fl(21 + 20 + 2−3 + 2−5 + · · · ) = 21 + 20 + 2−3 = 1.10010 · 21

▶ For any number with α ≤ η we have fl(1 + α) = 1.

Cancellation is a special kind of rounding error. Consider the

1.2345e0 − 1.2344e0 = 0.0001e0 = 1.0000e − 4

If the two numbers were exact, the result delivered by the

Suppose z = x − y , where x ≈ y . Then

|z − fl(z)| ≤ |x − fl(x)| + |y − fl(y )|,

from which it follows that the relative error satisfies

The sinus hyperbolicus is defined as

This is what happened in our previous example to compute π by

with small ε = sin(αn ). which suffers from severe cancellation.

Quadrature of a circle revisited (cont.)

1 oldA=0; s=sqrt(3)/2; A=3*s; n=6;

David H. Bailey, Roberto Barrio, and Jonathan M. Borwein, High

You might also like