1.2 Round Off Errors and Computer Arithmetic

Sec:1.
Round-off Errors
and Computer
Arithmetic
Sec:1.2 Round-off Errors and Computer Arithmetic
The purpose of this section is to study how errors in

numbers can propagate through mathematical functions.
2 Why?
Example:
( √19 ) =¿ round-off errors are directly related to the manner
in which numbers are stored in a computer.
* Numbers such as π, e, or cannot be expressed by
a fixed number of significant figures
*numbers on the computer are represented with a
binary, or base-2, system.
*binary system 
floating point numbers
Error = 0.000000000000004
>> format short
relative error = >> pi
3.1416
>> format long
>> pi
3.141592653589793
Binary Machine Numbers A 64-bit (binary digit) representation

(floating-point number)
s c f
In a computer, only a relatively small subset of
the real number system is used for the Example:
representation of all the real numbers. 0 10000000011 1011100100010000000000000000000000000000000000000000
1st 11-bit 52-bit

Real line
− ∞ ⋯ −𝟐 −𝟏 𝟎 𝟏 𝟐 ⋯ +∞
sign characteristic mantissa
exponent fraction
In 1985, the IEEE (Institute for Electrical and Electronic
Engineers) published a report called Binary Floating
Example:
Point Arithmetic Standard 754–1985.
0 10000000011 1011100100010000000000000000000000000000000000000000
This provides standards for binary and decimal floating

𝒔 𝒄 − 𝟏𝟎𝟐𝟑
point numbers, formats for data interchange, algorithms for ¿ (−𝟏) 𝟐 (𝟏+ 𝒇 )
rounding arithmetic operations, and for the handling of
exceptions.
10000000011
single, double, precisions 10 9 1 0
𝒄 =1 ∙ 2 + 0 ∙ 2 + ⋯ +1 ∙ 2 + 1∙ 2
𝒄 =1024 +0 + ⋯ ⋯ ⋯ +2 +1
𝒄 =1027
A 64-bit (binary digit) representation

11-bit exponent
c
10000000011 s f
11-bit 52-bit
10 9 1 0 sign exponent fraction
𝒄 =1 ∙ 2 + 0 ∙ 2 + ⋯ +1 ∙ 2 + 1∙ 2
Example:
𝒄 =1024 +0 + ⋯ ⋯ ⋯ +2 +1 0 10000000011 1011100100010000000000000000000000000000000000000000
𝒄 =𝟏𝟎𝟐𝟕 𝒔 𝒄 − 𝟏𝟎𝟐𝟑
¿ (−𝟏) 𝟐 (𝟏+ 𝒇 )
52-bit fraction
1011100100010000000000000000000000000000000000000000 
1 3 4 5 8 12
1 1 1 1 1 1
𝒇 =1 ∙( ) +1 ∙( ) +1 ∙( ) +1 ∙( ) +1∙ ( ) +1 ∙( )
2 2 2 2 2 2
1 1 1 1 1 1
𝒇= + + + + +
2 8 16 32 256 4096 ¿𝟎.𝟕𝟐𝟐𝟗𝟎𝟎𝟑𝟗𝟎𝟔𝟐𝟓
0 10000000011 1011100100010000000000000000000000000000000000000000 S=0 c = 1027

𝒔 𝒄 − 𝟏𝟎𝟐𝟑 𝑓 =𝟎.𝟕𝟐𝟐𝟗𝟎𝟎𝟑𝟗𝟎𝟔𝟐𝟓
¿ (−𝟏) 𝟐 (𝟏+ 𝒇 )
𝟎 𝟏𝟎𝟐𝟕 −𝟏𝟎𝟐𝟑
¿ (−𝟏) × 𝟐 ×(𝟏 +𝟎 .𝟕𝟐𝟐𝟗𝟎𝟎𝟑𝟗𝟎𝟔𝟐𝟓 )
¿𝟐𝟕.𝟓𝟔𝟔𝟒𝟎𝟔𝟐𝟓
Example: (similar to 15a/29)
Use the 64-bit long real format to find the decimal equivalent of the following floating-
point machine number.
0 01111111110 1001001100000000000000000000000000000000000000000000
S=0
c = 2^9+2^8+2^7+2^6+2^5+2^4+2^3+2^2+2^1= 1022
𝟏 𝟏 𝟏 𝟒 𝟏 𝟕 𝟏 𝟖 𝟏 𝟏 𝟏 𝟏
𝒇 =( ) +( ) +( ) +( ) ¿ + + + = 0.57421875
𝟐 𝟐 𝟐 𝟐 𝟐 𝟏𝟔 𝟏𝟐𝟖 𝟐𝟓𝟔
0 01111111110 1001001100000000000000000000000000000000000000000000
𝟎 𝟏𝟎𝟐𝟐 −𝟏𝟎𝟐𝟑
¿ (−𝟏) × 𝟐 ×(𝟏 +𝟎 .𝟓𝟕𝟒𝟐𝟏𝟖𝟕𝟓 )¿𝟎.𝟕𝟖𝟕𝟏𝟎𝟗𝟑𝟕𝟓
Example:
1 01111111110 1001001100000000000000000000000000000000000000000000
×(𝟏+ 𝟎 .𝟓𝟕𝟒𝟐𝟏𝟖𝟕𝟓 ) ¿− 𝟎.𝟕𝟖𝟕𝟏𝟎𝟗𝟑𝟕𝟓
𝟏 𝟏𝟎𝟐𝟐 −𝟏𝟎𝟐𝟑
¿ (−𝟏) × 𝟐
Largest floating point
0 11111111111 1111111111111111111111111111111111111111111111111111
−𝟓𝟐
S=0 c = 2046 𝒇 =𝟏 − 𝟐
𝒏𝒖𝒎𝒃𝒆𝒓 =(− 𝟏) 𝟎 ×𝟐𝟐𝟎𝟒𝟔 − 𝟏𝟎𝟐𝟑 ×(𝟐 −𝟐− 𝟓𝟐 )= 0.1797693134862316e+309
>> 2^1023*(2-2^(-52))
1.797693134862316e+308
>>2^1024-2^971
Inf
Numbers greater than 𝟐𝟏𝟎𝟐𝟑 ×( 𝟐− 𝟐−𝟓𝟐 )
result in overflow (Inf)

smallest normalized positive number
S=0 c=1 𝒇 =𝟎
𝒏𝒖𝒎𝒃𝒆𝒓 =(− 𝟏) 𝟎 × 𝟐𝟏 −𝟏𝟎𝟐𝟑 × (𝟏+𝟎)= 2.225073858507201e-308
Numbers occurring in calculations

that have a magnitude less than
− 𝟏𝟎𝟐𝟐
𝟐 × (𝟏 +𝟎 )
result in underflow and are generally

set to zero
𝟐𝟕.𝟓𝟔𝟔𝟒𝟎𝟔𝟐𝟓
0
𝑥∗
0 10000000011 1011100100010000000000000000000000000000000000000000
the next largest machine number
0 10000000011 1011100100010000000000000000000000000000000000000001
any real number in the interval
Is represented by the real number 𝟐𝟕.𝟓𝟔𝟔𝟒𝟎𝟔𝟐𝟓
𝟐𝟕.𝟓𝟔𝟔𝟒𝟎𝟔𝟐𝟓 +𝒙 ∗
=¿27.5664062500000017763568394002504646778106689453125
𝟐
Largest floating point
Binary Machine Numbers 0 11111111111 1111111111111111111111111111111111111111111111111111
(floating-point number)
In a computer, only a relatively small

subset of the real number system is used
for the representation of all the real
numbers.
{ 𝐹𝑙𝑜𝑎𝑡𝑖𝑛𝑔 𝑝𝑜𝑖𝑛𝑡𝑠 } ⊂(− ∞ ,+∞)
smallest normalized positive number
A 64-bit (binary digit) representation
s c f
Example:
0 10000000011 1011100100010000000000000000000000000000000000000000
1st 11-bit 52-bit

How many real numbers between (0.0001, 0.0002)
characteristic mantissa
sign
exponent fraction How many floating points between (0.0001, 0.0002)
¿𝟎.𝟕𝟐𝟐𝟗𝟎𝟎𝟑𝟗𝟎𝟔𝟐𝟓
Decimal Machine Numbers Example: Determine the 5-digit chopping

and rounding values of
we assume that machine numbers are
represented in the normalized decimal
floating-point form
y = π = 0.3141592653589793
Example:
-digit decimal machine numbers chopping fl(y) = 0.31415 ×
rounding fl(y) = 0.31416 ×
The floating-point form of y, denoted

fl(y), is obtained by terminating the value
of y at k decimal digits.
two common ways of termination
chopping
rounding
Three methods for measuring approximation errors
Example:
Determine the actual, absolute,
and relative errors when
approximating by
Example:
Determine the actual, absolute,
and relative errors when
approximating by
E1-Term221 Exc. 3d)

E1-Term221 Exc. 3d)

significant digits (or figures)
The number p* is said to approximate p to

t significant digits (or figures) if t is the
largest nonnegative number such that
∣𝑝− 𝑝 ∗ ∣ −𝒕
≤5 × 10
∣𝑝∣
Example:
Determine the significant digits
𝑝=0.1 𝑝 ∗=0.10005
0.00005 −4
=0.0005 ≤ 5× 10
0.1

largest
∣𝑝− 𝑝 ∗ ∣ −𝒕
≤5 × 10
∣𝑝∣
Example:
𝑝=5000 𝑝 ∗=5002.4
2.4
=0.00048=4.8 ×10 −4 ≤ 5 ×10 −4
5000

largest
∣𝑝− 𝑝 ∗ ∣ −𝒕
≤5 × 10
∣𝑝∣
Example:
𝑝=25 000 𝑝 ∗=25001.3

1.3 5.2
= =0.000052
25000 10000
−5
¿ 5.2 × 10
≤ 5 × 10
−𝑡
𝑡 =?
Finite-Digit Arithmetic Source of round-off error:

number repersentations, arithmetic operations
Example: represent machine addition, subtraction,

multiplication, and division operations
x ⊕ y = fl(fl(x) + fl(y)),
Example:
Suppose that x = and y =
Use five-digit chopping for calculating x + y
x= fl(x) =
y= fl(y) =


x ⊕ y = fl(fl(x) + fl(y)),
Example:
x= fl(x) =
y= fl(y) =


x ⊕ y = fl(fl(x) + fl(y)),
Example:
x= fl(x) = 0.71428 ×
y= fl(y) = 0.33333 ×
Avoid subtracting two

small absolute error but a large relative error
nearly number:
Example:
Suppose that x= and u =0.714251
x=
Use five-digit chopping for calculating x -u
This approximation has a small absolute error but a large relative error
Nested Arithmetic round-off error can also be reduced by

rearranging calculations
Example:
Evaluate f (x) at x = 4.71 using three-digit arithmetic.
Exact: f (4.71)= −14.263899 The relative errors
Three-digit (chopping): f (4.71) = −13.5 0.05

Three-digit (rounding): f (4.71)= −13.4 0.06
𝑓 ( 𝑥 ) =¿
Three-digit (chopping): f (4.71) = −14.2 0.0045

Three-digit (rounding): f (4.71)= −14.3 0.0025
Remark The decreased error in this example is due to the reduction in computations from
four multiplications and three additions to two multiplications and three additions.
One way to reduce round-off error is to reduce the number of computations.
Polynomials should always be expressed in nested form
round-off error can also be reduced by
rationalizing the numerator
rearranging calculations
Example:
The relative errors Remark

0.24
1) the subtraction of nearly equal
numbers,
2) the division by small number
0.00062 (x2)

1.2 Round Off Errors and Computer Arithmetic

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.2 Round Off Errors and Computer Arithmetic

Uploaded by

Copyright:

Available Formats

Sec:1.

The purpose of this section is to study how errors in

Binary Machine Numbers A 64-bit (binary digit) representation

1st 11-bit 52-bit

This provides standards for binary and decimal floating

A 64-bit (binary digit) representation

0 10000000011 1011100100010000000000000000000000000000000000000000 S=0 c = 1027

Example: (similar to 15a/29)

Largest floating point

𝒏𝒖𝒎𝒃𝒆𝒓 =(− 𝟏) 𝟎 ×𝟐𝟐𝟎𝟒𝟔 − 𝟏𝟎𝟐𝟑 ×(𝟐 −𝟐− 𝟓𝟐 )= 0.1797693134862316e+309

Numbers greater than 𝟐𝟏𝟎𝟐𝟑 ×( 𝟐− 𝟐−𝟓𝟐 )

result in overflow (Inf)

smallest normalized positive number

𝒏𝒖𝒎𝒃𝒆𝒓 =(− 𝟏) 𝟎 × 𝟐𝟏 −𝟏𝟎𝟐𝟑 × (𝟏+𝟎)= 2.225073858507201e-308

Numbers occurring in calculations

result in underflow and are generally

the next largest machine number

any real number in the interval

Is represented by the real number 𝟐𝟕.𝟓𝟔𝟔𝟒𝟎𝟔𝟐𝟓

In a computer, only a relatively small

A 64-bit (binary digit) representation

1st 11-bit 52-bit

Decimal Machine Numbers Example: Determine the 5-digit chopping

rounding fl(y) = 0.31416 ×

The floating-point form of y, denoted

Three methods for measuring approximation errors

Three methods for measuring approximation errors

Three methods for measuring approximation errors

E1-Term221 Exc. 3d)

Three methods for measuring approximation errors

E1-Term221 Exc. 3d)

significant digits (or figures)

The number p* is said to approximate p to

significant digits (or figures)

The number p* is said to approximate p to

significant digits (or figures)

The number p* is said to approximate p to

𝑝=25 000 𝑝 ∗=25001.3

Finite-Digit Arithmetic Source of round-off error:

Example: represent machine addition, subtraction,

Finite-Digit Arithmetic Source of round-off error:

Example: represent machine addition, subtraction,

Finite-Digit Arithmetic Source of round-off error:

Example: represent machine addition, subtraction,

Avoid subtracting two

Nested Arithmetic round-off error can also be reduced by

Evaluate f (x) at x = 4.71 using three-digit arithmetic.

Exact: f (4.71)= −14.263899 The relative errors

Three-digit (chopping): f (4.71) = −13.5 0.05

Three-digit (chopping): f (4.71) = −14.2 0.0045

The relative errors Remark

You might also like