T01 Floating Point

Floating Point Operations
Numerical Methods
2019-2020
Discrete World
• We can not represent

all the real numbers
with a computer.
• Continuous processes
Numerical Methods
in space or in time
must be represented
in a discrete world
2019-2020
Discrete World
• The hardware of any computer only

implements the elemental algebraic operations:
+,-,*,/
• Any other mathematical operation must be
Numerical Methods
properly implemented through a specific

algorithm. This includes:
– Any transcendental function: sin(x), ln(x)
– Derivation operations
2019-2020
Floating Point Numbers
• In order to describe a real number we use a

real part, a decimal point and a fractional part:
37.21829, 0.002271828
Numerical Methods
Decimal Point Fractional Part

Real Part
2019-2020
• But we normally use the normalized scientific

notation:
37.21829  0.3721829  10 2
Numerical Methods
2
0.002271828  0.2271828  10
2019-2020
• Using normalized notation each number is

represented by certain fraction multiplied by
10n.
x  0.d1 d 2 d 3   10 n
Numerical Methods
• With the additional constraint that the first digit

in the fractional part must be not zero
d1  0
2019-2020
• Thus, in normalized scientific form, a

decimal point will be represented as:
1
x   r  10 n  r 1
10
Numerical Methods
where r will be called the normalized

mantissa and n the exponent
2019-2020
• Computers will use the normalized scientific

notation in a similar way, buy using a
different basis. The binary basis-2 system
1
x  q  2 m
 q 1
Numerical Methods
2
• q will be then a bit sequence bi (0 or 1) with
the additional condition:
b1  1
2019-2020
• The important point is that any computer has a

finite word length to complete the
representation of a real number.
• We can only represent numbers with a finite
Numerical Methods
number of digits. No way to treat irrational

numbers
• Real numbers can be too big or too small in
order to be represented within a computer.
2019-2020
• The finite set of numbers that can be
represented in a computer will be called the
set of machine numbers
x  0.1b2 b3  bn 2  2 m
Numerical Methods
• The smallest number will be of the form:
0.12  2 m
• And thus, there will be always a “hole around

zero”
2019-2020
• The detailed
Sign of q 1 bit
representation will
depend on the
machine Sign of m 1 bit
architecture.
Numerical Methods
• In order to fix the Integer |m| 14 bites

ideas, suppose that
we deal with a 64
bites and: Integer q 48 bits
2019-2020
Radix Point
16 bites 48 bites
Numerical Methods
exponent Normalized Mantissa
Mantissa sign Exponent Sign
2019-2020
• With these constraints, the maximum exponent

that can be represented is
ଵସ
Numerical Methods
• On the other hand, the smallest number that

can be represented by our machine is:
48 14
q2  0.3553  10
2019-2020
Computer Errors
• Some real numbers can be not represented with

this word structure:
x  (0.1d1d 2 K d 48d 49d 50 K ) 2  2 m

Numerical Methods
• We will be forced to discard some of the digits

in the mantissa, starting at the 48th position:
d 49d 50 K
2019-2020
Computer Errors
• We can do this in two ways. By Cutting off:
x'  0.1d 2 d 3  d 48 2  2 m
• Or rounding off. In this case we will

Numerical Methods
proceed as follows:
x ''   0.1d 2d 3 K d 48  2  2 48

  2 m
• And then we will take the nearest value

between x’ and x’’
2019-2020
Absolute Error
• The absolute error will be defined as:
 a  x  x'
• If we are using the rounding of numbers,
Numerical Methods
then:
1
 a  x  x'  x' ' x'  2  49  m
2
• The absolute error will depend on the
magnitude of the number (through m)
2019-2020
Relative Error
• On the other hand, the relative error is

defined as: x  x'
r 
x
• An in the case of rounding we have:
Numerical Methods
x  x '' 249 m 249 48

   2
x  2 3 
0.1d d K  2 m
1/ 2
• And the error does not depend anymore on

the magnitude of this particular number
2019-2020
Relative Error
• In our machine the representation error of a

real number will not be greater that this value:
 r  2 48  0.36  10 14

Numerical Methods
• Thus, all the floating point operations are

performed normally with 14 significant digits.
• This particular value is the machine rounding
error.
2019-2020
Computer Errors
• If during the computing operations we get a
number of the form:
 q2 m
where m is bigger than the greater number of

Numerical Methods
your computer you will get an overflow error

• And if you try to use number smaller that the
smallest number you get an underflow error.
(NaN: Not a Number)
2019-2020
Arithmetic Operations
• Any elemental arithmetic operation will not

improve the precision of your numbers:
x  y  x' y ' a ( x)   a ( y )
Numerical Methods
• Your errors will always pile up. Never

disappearing
 a ( x  y )   a ( x)   a ( y )
2019-2020
Matlab/Octave
• We will make an intensive use of the

MatLab/octave tools for Numerical Analysis
Numerical Methods
2019-2020
Matlab/Octave
• Octave+GNUPlot (QtOctave) is a free
distribution software, with a similar functionality
to MatLab
Numerical Methods
2019-2020
Matlab/Octave
• In this soft we will find a great number of
integrated functionality. We get a programming
and many graphics facilities.
Numerical Methods
2019-2020
Matlab/Octave
• All the useful

standard functions
are implemented
Numerical Methods
2019-2020
Matlab/Octave
• We can also get the standard logical operators

Numerical Methods
2019-2020
Matlab/Octave
• And the standard hierarchy of precedence with

the operators.
Numerical Methods
2019-2020
• The lower an
bigger number
that your
computer can
Numerical Methods
use are stored in

the constants
realmax and
realmin
2019-2020
• By Default, Matlab/Octave will work with a
precision of 16 decimal digits, but also by
default, will show you in the screen the values
in single precision (7 digits)
Numerical Methods
2019-2020
• Consider the following simple arithmetic

operation:
Numerical Methods
2019-2020
• The explanation of the former result is in the

fact that the computer works with basis-2 and
the floating point number t = 0.1 is not
represented as 1/10 but as this infinite series:
Numerical Methods
1 1 1 0 0 1 1 0
 4  5  6  7  8  9  10 L
10 2 2 2 2 2 2 2
• And the computer will never be able to store all

the mantissa digits
2019-2020
• When representing floating point numbers the
computer will make a representation error
which will have as upper limit the value:
x  fl ( x) 1
 
Numerical Methods
x 2
• Where fl(x) is the floating point representation.
The constant ε is called the machine epsilon.
2019-2020
• This constant ε is the minimum distance
between two floating point numbers. It is also
the smallest relative error. This values is
predefined in eps and takes the value ε=2^(-52)
Numerical Methods
2019-2020
• This constant value can be calculated in any

computer using the following algorithm:
Numerical Methods
• The constant b ends with the relative error

b=eps/2. We will not be able to do floating
point operations with an smaller relative error.
2019-2020
• There are many

ways to compute the
machine epsilon.
Here we have
Numerical Methods
another example:
2019-2020
Loss of Significance
• The limited precision implies that we have to
pay special attention if we want to make good
programs.
• A common error appears when subtracting very
similar numbers. The relative error can be really
Numerical Methods
huge. This particular error is called terms

cancellation
2019-2020
• In the case of subtraction the relative error can

be computed as:
 a ( x)   a ( y )
r 
x y
Numerical Methods
• And if x and y are very similar we can obtain

huge relative
2019-2020
• Consider for instance the expression

y  x  sin(x)
where:
x  0.6666666667  10 1
Numerical Methods
sin( x)  0.6661729492  10 1
x  sin( x)  0.0004937175  10 1
 0.4937175000  10  4
• We have lost three significant digits, i.e., a
relative error of 103
2019-2020
• If we want to avoid this problem we will be

forced to make some changes in mathematical
expression. For instance, consider:
f ( x)  x 2  1  1
Numerical Methods
• If x is a near zero value is much better to use

instead:
     
2
x 1 1 x2
f ( x)  x 1 1  
2 
 x2  1 1 x2 1 1
 
2019-2020
• With this program we can explore the

difference (nm122.m)
Numerical Methods
2019-2020
• This program will compute the two equivalent

functions:
f1 ( x)  x  x 1  x  i f 2 ( x) 
x
x 1  x
Numerical Methods
• For different values of x. When x=1015 we have

that:
x  1  3.162277660168381 107  31622776.60168381
x  3.162277660168379 107  31622776.60168379
2019-2020
• And we obtain the following results:

Numerical Methods
2019-2020
• Another example can be found when evaluating

polynomials like
p( x )  ( x  1)7
• Taking the x values within [0.988, 1.012] and
Numerical Methods
developing the polynomial expression to

obtain:
p ( x)  x 7  7 x 6  21x 5  35 x 4  35 x 3  21x 2  7 x  1
2019-2020
• Evaluating these two expressions alternatively
we obtain
Numerical Methods
2019-2020
• In order to evaluate the expression
y ( x)  x  sin( x)
Numerical Methods
• Is a clever solution to use the Taylor series for

sin(x). Thus
 x3 x5 x7  x3 x5 x7
f ( x)  x   x         
 3! 5! 7!  3! 5! 7!
2019-2020
Transcendental Functions
• The last example introduces a new problem. A

computer will not be able to evaluate directly
expressions like:
ln(2)
Numerical Methods
• Transcendental functions are those functions

that can not be evaluated using elementary
arithmetic operations.
2019-2020
Transcendental Functions
• We can use a polynomial to approximate this
kind of functions, thanks to:
• Weierstrass Approximation Theorem: If f is
defined and it si continuous within the interval
[a,b] and given any ε >0, There is some
Numerical Methods
polynomial P, defined within [a,b], for which
f ( x)  P( x)  
for any x   a, b
2019-2020
Taylor’s Polynomial
• Taylor’s Theorem: If f  C a, b and if f(n+1)

n
exists within the interval (a,b), then, for any

point c and x in [a,b] there is some point ξ
between c and x for which
Numerical Methods
n
1
f ( x)  
k 0 k!
f (k )
( c )( x  c ) k  E n ( x ),
where
1
E n ( x)  f ( n 1) ( )( x  c) n 1 .
(n  1)!
2019-2020
• In this way we can write
1 1
ln( x)  ( x  1)  ( x  1)  ( x  1) 3  
2
2 3
Numerical Methods
n 1 1
 (1) ( x  1) n  E n ( x)
n
• This expression can be used in any computer

as we are using only arithmetic operations.
2019-2020
• Suppose that we want to calculate ln(2) with a

precision of 8 digits. Then, it must be true that:
1 ( n 1) 1
E n ( 2)   n 1
(2  1)   10 8
n 1 n 1
Numerical Methods
• Thus n+1 ≥ 108, and we will need more than a

hundred million terms
2019-2020
• The Taylor Polynomial is just a local
approximation to a function
Numerical Methods
2019-2020
• The Taylor polynomial will not be useful

outside the interpolating range
Numerical Methods
2019-2020
Error Propagation
• We can compute recursively the value of the
Bessel functions in order to see how the
propagation of errors can be catastrophic.
• The value of the Bessel functions Jn(x) when
x=1. Can be obtained with the recursive
Numerical Methods
relation:
1
J n 1 ( x)  2nx J n ( x)  J n 1 ( x)
where
J 0 (1)  0.7651976865
J1 (1)  0.4400505857
2019-2020
Error Propagation
• The Bessel functions are defined by the relation


1
J n ( x) 
  cos( x sin   n )d
0
Numerical Methods
• And it is true that


1
J n ( x) 
  1d
0
1
2019-2020
Error Propagation
• But if we use the simple following

program:
Numerical Methods
2019-2020
Error Propagation
• But you will

get!?!
Numerical Methods
2019-2020

T01 Floating Point

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

T01 Floating Point

Uploaded by

Copyright:

Available Formats

Floating Point Operations

• We can not represent

• The hardware of any computer only

properly implemented through a specific

• In order to describe a real number we use a

Decimal Point Fractional Part

• But we normally use the normalized scientific

• Using normalized notation each number is

• With the additional constraint that the first digit

• Thus, in normalized scientific form, a

where r will be called the normalized

• Computers will use the normalized scientific

• The important point is that any computer has a

number of digits. No way to treat irrational

• The smallest number will be of the form:

• And thus, there will be always a “hole around

• In order to fix the Integer |m| 14 bites

exponent Normalized Mantissa

Mantissa sign Exponent Sign

• With these constraints, the maximum exponent

• On the other hand, the smallest number that

• Some real numbers can be not represented with

x  (0.1d1d 2 K d 48d 49d 50 K ) 2  2 m

• We will be forced to discard some of the digits

• We can do this in two ways. By Cutting off:

• Or rounding off. In this case we will

x ''   0.1d 2d 3 K d 48  2  2 48

• And then we will take the nearest value

• The absolute error will be defined as:

• On the other hand, the relative error is

x  x '' 249 m 249 48

• And the error does not depend anymore on

• In our machine the representation error of a

 r  2 48  0.36  10 14

• Thus, all the floating point operations are

where m is bigger than the greater number of

your computer you will get an overflow error

• Any elemental arithmetic operation will not

• Your errors will always pile up. Never

• We will make an intensive use of the

• All the useful

• We can also get the standard logical operators

• And the standard hierarchy of precedence with

use are stored in

• Consider the following simple arithmetic

• The explanation of the former result is in the

• And the computer will never be able to store all

• This constant value can be calculated in any

• The constant b ends with the relative error

• There are many

huge. This particular error is called terms

• In the case of subtraction the relative error can

• And if x and y are very similar we can obtain

• Consider for instance the expression

• If we want to avoid this problem we will be

• If x is a near zero value is much better to use

• With this program we can explore the

• This program will compute the two equivalent

• For different values of x. When x=1015 we have

• And we obtain the following results:

• Another example can be found when evaluating

developing the polynomial expression to

• In order to evaluate the expression