Numerical Methods: Representing Numbers

Numerical Methods
Representing Numbers
Accuracy and Precision
• Accuracy – how closely a
increasing accuracy measured value agrees with
the truth
• Precision – how closely
measured values agree with
increasing precision
each other
• Inaccurate measurements are

due to some bias
• Imprecise values are caused by
some uncertainty
• Inaccuracy and imprecision are
measured using an error term
Errors – Measurement
• Absolute error 𝐸 𝐴 =¿𝑡𝑟𝑢𝑒 − 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑∨¿

𝑡𝑟𝑢𝑒 −𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
• Relative error

𝐸𝑅= | 𝑡𝑟𝑢𝑒 |
• true = 1.5 cm, calculated = 1 cm
• absolute = 0.5 cm
• relative = 0.333
• true = 1,000,000.5 cm, calculated = 1,000,000 cm

• absolute = 0.5 cm
• relative = 0.5 x 10-8
Absolute vs. Relative Error
• Generally, scientific and engineering applications are less
sensitive to small errors in large values
• Relative error can be used to compare values at widely
varying sizes
• Most computing systems are designed to minimize relative
error
𝑡𝑟𝑢𝑒 −𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
Relative Error | |

𝐸𝑅=
𝑡𝑟𝑢𝑒
1) Undefined when the true value is zero

2
𝑓 (𝑥 )− 𝑓 𝑐 (𝑥)
1
0
0
𝑓 ( 2.3 )=0

1 2 3 4 5

|
𝐸𝑅=
max ⁡{∨𝑓 𝑐 ( 𝑥)∨,∨𝑓 ( 𝑥 )∨}|
-1
Celsius
2) Relative measurements Kelvin
true C K
calculated C K
= 283 −284
𝐸 = 10− 11
𝑅 |
10 | 𝐸 𝑅
283 | |
𝐸
𝑅 =0.1=10 % 𝐸
𝑅 =0.0035=.35 %
Sources of error
• Inaccuracy often results from a bias in the algorithm
• Euler’s method for explicit integration [discussed later] returns
biased results based on the curvature of the solution to the
differential equation
• Creating more accurate algorithms is a cornerstone of numerical
methods
• Imprecision is more fundamental – depends on the number

of bits supported by the data type being used
• Imagine that we only have 3 digits to represent a value
• How do we represent , , and ?
• Our choices for improving precision are limited:
• increase the number of digits available
Number Bases
• Decimal – base 10
• measure precision in “digits”
• Binary – base 2
• measure precision in “bits”
• Hexadecimal – base 16
• useful for memory dumps
• matches neatly with data produced by 8-, 32-, 64-bit systems
• will explore later
• Octal – base 8
• look at memory dumps from machines with smaller word sizes
• similar to how hexadecimal is used now
Binary
• When
discussing numerical algorithms, it is convenient to
discuss numbers using both binary and decimal notation
• For example, discussing significant bits and significant digits can be
useful for different applications
• Understand what both of these terms mean
• Numerical basis is represented by a subscript when the
value is ambiguous:
Interpreting binary numbers
•• Double-dabble
algorithm
• initialize a register X = 0
• for each digit in the binary number
• if 0 – double X
• if 1 – dabble X (double it and add one)
• Do this with a few numbers:

0000 1101 0110 1010 1111 1111
13 106 255
• How would you handle signed integers?
• Introduce a “signed” bit as the most significant digit:
• What are the following values (assuming an preceding signed bit)?

1101 1010 1011 1001 1111 1111
-90 -57 -127
Overflow
• What
is 1111 1111 + 1?
• or 0000 0000 – 1?
• Value “wraps around” to the opposite value
• One of the most common errors in programming
• What happens when we convert an 8-bit integer into a 4-bit integer
by truncating the most significant bits?
0001 0011 -> 0011 1100 1010 -> 1010
• Keeping the least significant bits of a number performs the

operation:
• This is the defined behavior for C/C++ casting:
unsigned long m = ULONG_MAX; //

unsigned int n = m; //
Fixed Point and Arithmetic
• Decimal
places are fractional powers-of-two
• Evaluate:
• Demonstrate the operations:
+ + x x
Floating Point, base 10
23
. 602× 10
exponent
mantissa base
• spend
a few bits on the mantissa and a few on the exponent
• effectively represent very large and very small numbers
• Any base can be used

• normalization constraint:
insures that each number has a single representation
(basically, the first digit of the mantissa should be non-zero)
Floating Point, base 2
• Standard
format:
• is the base
• is a sign bit
• is the mantissa, where
• is the exponent
• What is the decimal spacing between values when and is 3-bits?

smallest value:
next value:
• How about when ?
smallest value:
next value:
Floating Point Multiplication
•
1) Add the sign bits:

2) Add the exponents:
3) Multiply the mantissa:
4) Adjust to normalize :
5) Chop excess digits:
• Find the result of the binary operation given a 3-bit mantissa:
var
Floating Point Addition
•
1) Convert to the same exponent (and chop):
2) Add:
3) Normalize:
• What is the actual solution and the corresponding relative
error?
• Note that the actual solution could be represented

Loss of Precision
• This
subtraction effectively reduced our numerical precision
by one digit:
• Assuming a 3-digit mantissa, perform the following

operations:
1)
2)
where is the correct result, the relative error is
Loss of Precision
• Each
operation maintains the appropriate precision
• However, we end up with far less precision than we expect
• The result has one digit of precision
• The correct value of requires two digits of precision
• We expect , we only need , but we get
• This is known colloquially as catastrophic cancellation
• How could we have avoided loss of precision in this case?
• We will discuss ways to avoid loss of precision in more

complex cases in the next lecture
Loss of Precision Theorem
• Let and be normalized floating-point machine numbers such that ,
and
for some positive integers and , then at most and at least significant
digits are lost in the subtraction .
• How many bits are lost when calculating for:

and ?
Lost Bits
• Reformulate the Loss of Precision Theorem:
• Lose between and bits

Lost Bits
• How many bits are lost in , where
and ?
Between and bits of precision have been lost.

Lost Bits – Example 2
• How many bits are lost in when ?
Between and bits of precision have been lost.
We will discuss what to do about this in the next lecture.

Representation Quirks
• Precision
scales between decades – spacing between values
increases with increasing exponent
• Machine epsilon (unit epsilon)
• smallest number such that
• spacing between and an adjacent number is given by:
• Functions with slope skip values

• Imagine has slope at all points. Several values of will have no
corresponding
• How is this a problem for finding roots or optimization?
Chopping, Rounding, and Bias
•
• Chopping values: 0.1557
× 101

• Rounding values: 0.1557

× 101

• Both chopping and rounding introduce a bias

• rounding – 4 digits (1-4) round down while 5 digits (5-9) round up
• eliminating the bias can be done by randomly rounding up or down
in the case of a ‘5’ – not worth it if running the same program twice
returns two different results
IEEE 32-bit floating point
• Slightly more complicated
• no wasted 1 to force normalization

• exponent bias removes the need for a sign bit in
• 32-bit: 1 sign bit, 8-bit exponent, 23-bit mantissa
• exponent bias:
• machine epsilon
• 64-bit: 1 sign bit, 11-bit exponent, 52-bit mantissa

• exponent bias:
• machine epsilon
Hexadecimal
• Useful when looking at binary data
• memory dumps, binary files, etc.
0000 1010 0001 1011 0010 1100 0011 1101

0 A 1 B 2 C 3 D
• Easier to view the data in 8-bit chunks

• 0A 1B 2C 3D
• Commonly used for
• memory dumps
• doing your own math, bitwise operations
Endianness
• Hardware-dependent byte
ordering of the system
• Conversion is called the “nUxi
problem”
• Intel (and by extension most
home computers) use little-
endian
• Important if you want to do
your own bitwise operations
• Create binary masks
• Implement your own floating
point
• Integer arithmetic
Testing Endianness
main() {
int a = 0x0A1B2C3D;
unsigned char *c = (unsigned char*)(&a);
if (*c == 0x3D)
std::cout<<“little-endian”<<std::endl;
else
std::cout<<“big-endian”<<std::endl;
}
Numerical Methods – a PSA
• In 1991, failure of a MIM-104 Patriot Missile defense system was
caused by a software error. Time stamps from multiple radar
pulses were converted to floating point differently.
• http://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran
• In 1994, Professor Thomas R. Nicely discovered that the Intel
Pentium floating‐point processor returned erroneous results for
certain division operations.
• http://engineeringfailures.org/?p=466
• In 1996, the Ariane5 rocket launched by the European Space
Agency exploded 40 seconds after lift‐off from Kourou, French
Guiana, because of an overflow error.
• http://www.around.com/ariane.html
• IBM recently announced that they’re going little-endian
• https://www.business-cloud.com/articles/news/ibm-drops-linux-bombshell
h/t Prof. Lennart Johnsson (COSC)

Numerical Methods – a PSA
• Boeing
747 operators have been ordered to periodically
reset their electrical systems to avoid an overflow error that
happens every centiseconds (around every 8 months)
• http://www.nytimes.com/2015/05/01/business/faa-orders-fix-for-p
ossible-power-loss-in-boeing-787.html?_
r=0
• Software patch is being released Fall 2015
• Donkey Kong breaks at level 22. You’re given 260 seconds to

complete the level – however they’re only using an 8-bit
unsigned integer to store the time...
Memory dump demo
• Try using https://hexed.it/ to view memory dumps and
binary files

Numerical Methods: Representing Numbers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Methods: Representing Numbers

Uploaded by

Copyright:

Available Formats

Numerical Methods

• Inaccurate measurements are

• true = 1,000,000.5 cm, calculated = 1,000,000 cm

1) Undefined when the true value is zero

• Imprecision is more fundamental – depends on the number

• Do this with a few numbers:

• What are the following values (assuming an preceding signed bit)?

• Keeping the least significant bits of a number performs the

unsigned long m = ULONG_MAX; //

• Demonstrate the operations:

• Any base can be used

• What is the decimal spacing between values when and is 3-bits?

1) Add the sign bits:

1) Convert to the same exponent (and chop):

• Note that the actual solution could be represented

• Assuming a 3-digit mantissa, perform the following

• We will discuss ways to avoid loss of precision in more

• How many bits are lost when calculating for:

• Lose between and bits

Between and bits of precision have been lost.

Between and bits of precision have been lost.

We will discuss what to do about this in the next lecture.

• Functions with slope skip values

• Rounding values: 0.1557

• Both chopping and rounding introduce a bias

• no wasted 1 to force normalization

• 64-bit: 1 sign bit, 11-bit exponent, 52-bit mantissa

0000 1010 0001 1011 0010 1100 0011 1101

• Easier to view the data in 8-bit chunks

h/t Prof. Lennart Johnsson (COSC)

• Donkey Kong breaks at level 22. You’re given 260 seconds to

You might also like