Professional Documents
Culture Documents
Normally, we count things in base 10. The base is completely arbitrary. The only
reason that people have traditionally used base 10 is that they have 10 fingers
, which have made handy counting tools.
The number 532.25 in decimal (base 10) means the following:
(5 * 10^2) + (3 * 10^1) + (2 * 10^0) + (2 * 10^-1) + (5 * 10^-2)
500
+
30
+
2
+
2/10
+
5/100
_________
= 532.25
In the binary number system (base 2), each column
d of 10. For example, the number 101.01 means the
(1 * 2^2) + (0 * 2^1) + (1 * 2^0) + (0 * 2^-1)
4
+
0
+
1
+
0
_________
= 5.25 Decimal
Note that IEEE has more bits dedicated to the exponent, which allows it to repre
sent a wider range of values. MBF has more mantissa bits, which allows it to be
more precise within its narrower range.
General Floating-Point Concepts
It is very important to realize that any binary floating-point system can repres
ent only a finite number of floating-point values in exact form. All other value
s must be approximated by the closest representable value. The IEEE standard spe
cifies the method for rounding values to the "closest" representable value. Quic
kBasic for MS-DOS supports the standard and rounds according to the IEEE rules.
Also, keep in mind that the numbers that can be represented in IEEE are spread o
ut over a very wide range. You can imagine them on a number line. There is a hig
h density of representable numbers near 1.0 and -1.0 but fewer and fewer as you
go towards 0 or infinity.
The goal of the IEEE standard, which is designed for engineering calculations, i
s to maximize accuracy (to get as close as possible to the actual number). Preci
sion refers to the number of digits that you can represent. The IEEE standard at
tempts to balance the number of bits dedicated to the exponent with the number o
f bits used for the fractional part of the number, to keep both accuracy and pre
cision within acceptable limits.
IEEE Details
Floating-point numbers are represented in the following form, where [exponent] i
s the binary exponent:
X = Fraction * 2^(exponent - bias)
[Fraction] is the normalized fractional part of the number, normalized because t
he exponent is adjusted so that the leading bit is always a 1. This way, it does
not have to be stored, and you get one more bit of precision. This is why there
is an implied bit. You can think of this like scientific notation, where you ma
nipulate the exponent to have one digit to the left of the decimal point, except
in binary, you can always manipulate the exponent so that the first bit is a 1,
since there are only 1s and 0s.
[bias] is the bias value used to avoid having to store negative exponents.
The bias for single-precision numbers is 127 and 1023 (decimal) for double-preci
sion numbers.
The values equal to all 0's and all 1's (binary) are reserved for representing s
pecial cases. There are other special cases as well, that indicate various error
conditions.
Single-Precision Examples
2 = 1 * 2^1 = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000 hex
Note the sign bit is zero, and the stored exponent is 128, or 100 0000 0 in bina
ry, which is 127 plus 1. The stored mantissa is (1.) 000 0000 ... 0000 0000, whi
ch has an implied leading 1 and binary point, so the actual mantissa is 1.
-2 = -1 * 2^1 = 1100 0000 0000 0000 ... 0000 0000 = C000 0000 hex
Same as +2 except that the sign bit is set. This is true for all IEEE format flo
ating-point numbers.
4 = 1 * 2^2 = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000 hex
Same mantissa, exponent increases by one (biased value is 129, or 100 0000 1 in
binary.
6 = 1.5 * 2^2 = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 hex
Same exponent, mantissa is larger by half -- it's (1.) 100 0000 ... 0000 0000, w
hich, since this is a binary fraction, is 1-1/2 (the values of the fractional di
gits are 1/2, 1/4, 1/8, etc.).
1 = 1 * 2^0 = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000 hex
Same exponent as other powers of 2, mantissa is one less than 2 at 127, or 011 1
111 1 in binary.
.75 = 1.5 * 2^-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000 hex
The biased exponent is 126, 011 1111 0 in binary, and the mantissa is (1.) 100 0
000 ... 0000 0000, which is 1-1/2.
2.5 = 1.25 * 2^1 = 0100 0000 0010 0000 ... 0000 0000 = 4020 0000 hex
Exactly the same as 2 except that the bit which represents 1/4 is set in the man
tissa.
0.1 = 1.6 * 2^-4 = 0011 1101 1100 1100 ... 1100 1101 = 3DCC CCCD hex
1/10 is a repeating fraction in binary. The mantissa is just shy of 1.6, and the
biased exponent says that 1.6 is to be divided by 16 (it is 011 1101 1 in binar
y, which is 123 in decimal). The true exponent is 123 - 127 = -4, which means th
at the factor by which to multiply is 2**-4 = 1/16. Note that the stored mantiss
a is rounded up in the last bit. This is an attempt to represent the unrepresent
able number as accurately as possible. (The reason that 1/10 and 1/100 are not e
xactly representable in binary is similar to the way that 1/3 is not exactly rep
resentable in decimal.)
0 = 1.0 * 2^-128 = all zeros -- a special case.
Other Common Floating-Point Errors
The following are common floating-point errors:
Round-off error
This error results when all of the bits in a binary number cannot be used in a c
alculation.
Example: Adding 0.0001 to 0.9900 (Single Precision)
Decimal 0.0001 will be represented as:
(1.)10100011011011100010111 * 2^(-14+Bias) (13 Leading 0s in Binary!)
0.9900 will be represented as:
(1.)11111010111000010100011 * 2^(-1+Bias)
Now to actually add these numbers, the decimal (binary) points must be aligned.
For this they must be Unnormalized. Here is the resulting addition:
.000000000000011010001101 * 2^0 <- Only 11 of 23 Bits retained
+.111111010111000010100011 * 2^0
________________________________
.111111010111011100110000 * 2^0
This is called a round-off error because some computers round when shifting for
addition. Others simply truncate. Round-off errors are important to consider whe
never you are adding or multiplying two very different values.
Subtracting two almost equal values
.1235
-.1234
_____
.0001
This will be normalized. Note that although the original numbers each had four s
ignificant digits, the result has only one significant digit.
Overflow and underflow
This occurs when the result is too large or too small to be represented by the d
ata type.
Quantizing error
This occurs with those numbers that cannot be represented in exact form by the f
loating-point standard.
Division by a very small number
This can trigger a "divide by zero" error or can produce bad results, as in the
following example:
A = 112000000
B = 100000
C = 0.0009
X = A - B / C
In QuickBasic for MS-DOS, X now has the value 888887, instead of the correct ans
wer, 900000.
Output error
This type of error occurs when the output functions alter the values they are wo
rking with.
Properties
Article ID: 42980 - Last Review: 08/16/2005 21:27:52 - Revision: 3.1
Applies to
Microsoft Visual Basic 2.0 Standard Edition
Microsoft Visual Basic 3.0 Professional Edition
Microsoft Visual Basic 4.0 Standard Edition
Microsoft Visual Basic 1.0 Standard Edition
Microsoft Visual Basic 2.0 Professional Edition
Microsoft Visual Basic 3.0 Professional Edition
Microsoft Visual Basic 4.0 Professional Edition
Microsoft Visual Basic for MS-DOS
Microsoft QuickBasic 4.0
Microsoft QuickBASIC 4.0b
Microsoft QuickBasic 4.5 for MS-DOS
Microsoft BASIC Compiler 6.0
Microsoft BASIC Compiler 6.0b
Microsoft BASIC Professional Development System 7.0
Microsoft Cinemania 97 Standard Edition
Keywords:
KB42980
Feedback
Was this information helpful? Yes No Somewhat
Tell us what we can do to improve the article
Submit
Support
Account support
Supported products list
Product support lifecycle
Security