Floating Point Tutorial

Decimal and Binary Number Systems
Normally, we count things in base 10. The base is completely arbitrary. The only
reason that people have traditionally used base 10 is that they have 10 fingers
, which have made handy counting tools.
The number 532.25 in decimal (base 10) means the following:
(5 * 10^2) + (3 * 10^1) + (2 * 10^0) + (2 * 10^-1) + (5 * 10^-2)
500
+
30
+
2
+
2/10
+
5/100
_________
= 532.25
In the binary number system (base 2), each column
d of 10. For example, the number 101.01 means the
(1 * 2^2) + (0 * 2^1) + (1 * 2^0) + (0 * 2^-1)
4
+
0
+
1
+
0
_________
= 5.25 Decimal
represents a power of 2 instea

following:
+ (1 * 2^-2)
+
1/4
How Integers Are Represented in PCs

Because there is no fractional part to an integer, its machine representation is
much simpler than it is for floating-point values. Normal integers on personal
computers (PCs) are 2 bytes (16 bits) long with the most significant bit indicat
ing the sign. Long integers are 4 bytes long. Positive values are straightforwar
d binary numbers. For example:
1 Decimal = 1 Binary
2 Decimal = 10 Binary
22 Decimal = 10110 Binary, etc.
However, negative integers are represented using the two's complement scheme. To
get the two's complement representation for a negative number, take the binary
representation for the number's absolute value and then flip all the bits and ad
d 1. For example:
4 Decimal = 0000 0000 0000 0100
1111 1111 1111 1011
Flip the Bits
-4
= 1111 1111 1111 1100
Add 1
Note that -1 Decimal = 1111 1111 1111 1111 in Binary, which explains why Basic t
reats -1 as logical true (All bits = 1). This is a consequence of not having sep
arate operators for bitwise and logical comparisons. Often in Basic, it is conve
nient to use the code fragment below when your program will be making many logic
al comparisons. This greatly aids readability.
CONST TRUE = -1
CONST FALSE = NOT TRUE
Note that adding any combination of two's complement numbers together using ordi
nary binary arithmetic produces the correct result.
Floating-Point Complications
Every decimal integer can be exactly represented by a binary integer; however, t
his is not true for fractional numbers. In fact, every number that is irrational
in base 10 will also be irrational in any system with a base smaller than 10.
For binary, in particular, only fractional numbers that can be represented in th
e form p/q, where q is an integer power of 2, can be expressed exactly, with a f

inite number of bits.
Even common decimal fractions, such as decimal 0.0001, cannot be represented exa
ctly in binary. (0.0001 is a repeating binary fraction with a period of 104 bits
!)
This explains why a simple example, such as the following
SUM = 0
FOR I% = 1 TO 10000
SUM = SUM + 0.0001
NEXT I%
PRINT SUM
' Theoretically = 1.0.
will PRINT 1.000054 as output. The small error in representing 0.0001 in binary
propagates to the sum.
For the same reason, you should always be very cautious when making comparisons
on real numbers. The following example illustrates a common programming error:
item1# = 69.82#
item2# = 69.20# + 0.62#
IF item1# = item2# then print "Equality!"
This will NOT PRINT "Equality!" because 69.82 cannot be represented exactly in b
inary, which causes the value that results from the assignment to be SLIGHTLY di
fferent (in binary) than the value that is generated from the expression. In pra
ctice, you should always code such comparisons in such a way as to allow for som
e tolerance. For example:
IF (item1# < 69.83#) AND (item1# > 69.81#) then print "Equal"
This will PRINT "Equal".
IEEE Format Numbers
QuickBasic for MS-DOS, version 3.0 was shipped with an MBF (Microsoft Binary Flo
ating Point) version and an IEEE (Institute of Electrical and Electronics Engine
ers) version for machines with a math coprocessor. QuickBasic for MS-DOS, versio
ns 4.0 and later only use IEEE. Microsoft chose the IEEE standard to represent f
loating-point values in current versions of Basic for the following three primar
y reasons:
To allow Basic to use the Intel math coprocessors, which use IEEE format. The In
tel 80x87 series coprocessors cannot work with Microsoft Binary Format numbers.
To make interlanguage calling between Basic, C, Pascal, FORTRAN, and MASM much e
asier. Otherwise, conversion routines would have to be used to send numeric valu
es from one language to another.
To achieve consistency. IEEE is the accepted industry standard for C and FORTRAN
compilers.
The following is a quick comparison of IEEE and MBF representations for a double
-precision number:
Sign Bits Exponent Bits Mantissa Bits
--------- ------------- ------------IEEE
1
11
52 + 1 (Implied)
MBF
1
8
56
For more information on the differences between IEEE and MBF floating-point repr
esentation, query in the Microsoft Knowledge Base on the following words:
IEEE and floating and point and appnote
Note that IEEE has more bits dedicated to the exponent, which allows it to repre
sent a wider range of values. MBF has more mantissa bits, which allows it to be
more precise within its narrower range.
General Floating-Point Concepts
It is very important to realize that any binary floating-point system can repres
ent only a finite number of floating-point values in exact form. All other value
s must be approximated by the closest representable value. The IEEE standard spe
cifies the method for rounding values to the "closest" representable value. Quic
kBasic for MS-DOS supports the standard and rounds according to the IEEE rules.
Also, keep in mind that the numbers that can be represented in IEEE are spread o
ut over a very wide range. You can imagine them on a number line. There is a hig
h density of representable numbers near 1.0 and -1.0 but fewer and fewer as you
go towards 0 or infinity.
The goal of the IEEE standard, which is designed for engineering calculations, i
s to maximize accuracy (to get as close as possible to the actual number). Preci
sion refers to the number of digits that you can represent. The IEEE standard at
tempts to balance the number of bits dedicated to the exponent with the number o
f bits used for the fractional part of the number, to keep both accuracy and pre
cision within acceptable limits.
IEEE Details
Floating-point numbers are represented in the following form, where [exponent] i
s the binary exponent:
X = Fraction * 2^(exponent - bias)
[Fraction] is the normalized fractional part of the number, normalized because t
he exponent is adjusted so that the leading bit is always a 1. This way, it does
not have to be stored, and you get one more bit of precision. This is why there
is an implied bit. You can think of this like scientific notation, where you ma
nipulate the exponent to have one digit to the left of the decimal point, except
in binary, you can always manipulate the exponent so that the first bit is a 1,
since there are only 1s and 0s.
[bias] is the bias value used to avoid having to store negative exponents.
The bias for single-precision numbers is 127 and 1023 (decimal) for double-preci
sion numbers.
The values equal to all 0's and all 1's (binary) are reserved for representing s
pecial cases. There are other special cases as well, that indicate various error
conditions.
Single-Precision Examples
2 = 1 * 2^1 = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000 hex
Note the sign bit is zero, and the stored exponent is 128, or 100 0000 0 in bina
ry, which is 127 plus 1. The stored mantissa is (1.) 000 0000 ... 0000 0000, whi
ch has an implied leading 1 and binary point, so the actual mantissa is 1.
-2 = -1 * 2^1 = 1100 0000 0000 0000 ... 0000 0000 = C000 0000 hex
Same as +2 except that the sign bit is set. This is true for all IEEE format flo
ating-point numbers.
4 = 1 * 2^2 = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000 hex
Same mantissa, exponent increases by one (biased value is 129, or 100 0000 1 in
binary.
6 = 1.5 * 2^2 = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 hex
Same exponent, mantissa is larger by half -- it's (1.) 100 0000 ... 0000 0000, w
hich, since this is a binary fraction, is 1-1/2 (the values of the fractional di
gits are 1/2, 1/4, 1/8, etc.).
1 = 1 * 2^0 = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000 hex
Same exponent as other powers of 2, mantissa is one less than 2 at 127, or 011 1
111 1 in binary.
.75 = 1.5 * 2^-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000 hex
The biased exponent is 126, 011 1111 0 in binary, and the mantissa is (1.) 100 0
000 ... 0000 0000, which is 1-1/2.
2.5 = 1.25 * 2^1 = 0100 0000 0010 0000 ... 0000 0000 = 4020 0000 hex
Exactly the same as 2 except that the bit which represents 1/4 is set in the man
tissa.
0.1 = 1.6 * 2^-4 = 0011 1101 1100 1100 ... 1100 1101 = 3DCC CCCD hex
1/10 is a repeating fraction in binary. The mantissa is just shy of 1.6, and the
biased exponent says that 1.6 is to be divided by 16 (it is 011 1101 1 in binar
y, which is 123 in decimal). The true exponent is 123 - 127 = -4, which means th
at the factor by which to multiply is 2**-4 = 1/16. Note that the stored mantiss
a is rounded up in the last bit. This is an attempt to represent the unrepresent
able number as accurately as possible. (The reason that 1/10 and 1/100 are not e
xactly representable in binary is similar to the way that 1/3 is not exactly rep
resentable in decimal.)
0 = 1.0 * 2^-128 = all zeros -- a special case.
Other Common Floating-Point Errors
The following are common floating-point errors:
Round-off error
This error results when all of the bits in a binary number cannot be used in a c
alculation.
Example: Adding 0.0001 to 0.9900 (Single Precision)
Decimal 0.0001 will be represented as:
(1.)10100011011011100010111 * 2^(-14+Bias) (13 Leading 0s in Binary!)
0.9900 will be represented as:
(1.)11111010111000010100011 * 2^(-1+Bias)
Now to actually add these numbers, the decimal (binary) points must be aligned.
For this they must be Unnormalized. Here is the resulting addition:
.000000000000011010001101 * 2^0 <- Only 11 of 23 Bits retained
+.111111010111000010100011 * 2^0
________________________________
.111111010111011100110000 * 2^0
This is called a round-off error because some computers round when shifting for
addition. Others simply truncate. Round-off errors are important to consider whe
never you are adding or multiplying two very different values.
Subtracting two almost equal values
.1235
-.1234
_____
.0001
This will be normalized. Note that although the original numbers each had four s
ignificant digits, the result has only one significant digit.
Overflow and underflow
This occurs when the result is too large or too small to be represented by the d
ata type.
Quantizing error
This occurs with those numbers that cannot be represented in exact form by the f
loating-point standard.
Division by a very small number
This can trigger a "divide by zero" error or can produce bad results, as in the
following example:
A = 112000000
B = 100000
C = 0.0009
X = A - B / C
In QuickBasic for MS-DOS, X now has the value 888887, instead of the correct ans
wer, 900000.
Output error
This type of error occurs when the output functions alter the values they are wo
rking with.
Properties
Article ID: 42980 - Last Review: 08/16/2005 21:27:52 - Revision: 3.1
Applies to
Microsoft Visual Basic 2.0 Standard Edition
Microsoft Visual Basic 3.0 Professional Edition
Microsoft Visual Basic for MS-DOS
Microsoft QuickBasic 4.0
Microsoft QuickBASIC 4.0b
Microsoft QuickBasic 4.5 for MS-DOS
Microsoft BASIC Compiler 6.0
Microsoft BASIC Compiler 6.0b
Microsoft BASIC Professional Development System 7.0
Microsoft Cinemania 97 Standard Edition
Keywords:
KB42980
Feedback
Was this information helpful? Yes No Somewhat
Tell us what we can do to improve the article
Submit
Support
Account support
Supported products list
Product support lifecycle
Security

Floating Point Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Floating Point Tutorial

Uploaded by

Copyright:

Available Formats

Decimal and Binary Number Systems

represents a power of 2 instea

How Integers Are Represented in PCs

e form p/q, where q is an integer power of 2, can be expressed exactly, with a f

You might also like