You are on page 1of 23

A method of representation of real numbers that can support a wide range of values. A typical number that can be represented exactly is of the form:

Significant digits × baseexponent
 The term floating point refers to the fact that the

radix point can "float" i.e., it can be placed anywhere relative to the significant digits of the number.

 Floating point numbers approximate real numbers  Floating numbers have large dynamic range .

This standard specifies how single precision (32 bit) and double precision (64 bit) floating point numbers are to be represented. The IEEE 754 has produced a standard for floating point arithmetic. as well as how arithmetic should be carried out on them .

The IEEE 754 standard specifies a binary32 as having: • Sign bit: 1 bit • Exponent width: 8 bits • Significand precision: 24 (23 explicitly stored)  The base is 2 .

which is the sign of the significand as well. Sign bit determines the sign of the number. Sign bit=0 if the number is positive =1 if the number is negative .

a bias of ‘127’ is added to the actual exponent in order to get the stored exponent. A stored value of 200 indicates an exponent of (200-127).  Exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.  . or 73.The exponent field needs to represent both positive and negative exponents.  Thus. an exponent of zero means that 127 is stored in the exponent field. To do this.

 Also known as ‘Mantissa’  The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction bits of the significand appear in the memory format but the total precision is 24 bits .

 The bits are laid out as follows: sign exponent significand 31 30 23 22 0 .

then v=(. (e) If e=0 and f=0. then v= (. (b) If e=255 and f=0. then v = ( . (d) If e =0 and f=0. (zero). then v= NaN.I)s (c) If 0<e<255.1)s2 -126(0.l)s 0. .1)s2e-127 (1. then v=(.The value of the number represented in single precision format is as follows: (a)If e=255 and f=0.f). f).

floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. five is represented as 5. . In order to maximize the quantity of representable numbers. In normalized form.0 × 100.

 A nice little optimization is available to us in base two. by way of 23 fraction bits. and don't need to represent it explicitly. we can just assume a leading digit of 1. Thus. . the mantissa has effectively 24 bits of resolution. since the only possible non-zero digit is 1. As a result.

 The storage format of double precision is as shown  sign bit: 1 bit Exponent width:11 bits significand precision: 52 bits(implicit)  The bias for exponent is 1023 Sign exponent significand 63 62 52 51 0 .

put the bits in three groups. 1 10000001 10110011001100110011010  First. Bit ‘31’ (the leftmost bit) show the sign of the number. Convert the following single-precision IEEE 754 number into a floating-point decimal value. Bits ‘0-22’ (on the right) give the fraction . Bits ‘23-30’ (the next 8 bits) are the exponent.

otherwise positive. Since this is a single-precision number. 10000001bin = 129ten Remember that we will have to subtract a bias from this exponent to find the power of 2. so the number is negative. If this bit is a 1. Get the exponent and the correct bias. look at the sign bit. the bias is 127. the number is negative. The exponent is simply a positive binary number. . Here this bit is 1.Now.

Binary fractions look like this: 0. Convert the fraction string into base ten.1 = (1/2) = 2-1 0. so conversion is a little different. The binary string represents a fraction.01 = (1/4) = 2-2 0.001 = (1/8) = 2-3 . This is the trickiest step.

.. Note that this number is just an approximation on some decimal number.. There will most likely be some error. for this example. So. we multiply each digit by the corresponding power of 2: 0.. In this case.10110011001100110011010bin = 1/2 + 1/8 + 1/16 + .. the fraction is about 0.10110011001100110011010bin = 1*2-1+ 0*2-2 + 1*2-3 + 1*2-4 + 0*2-5 + 0 * 2-6 + . 0.7000000476837158.

 This is all the information we need.8.7000000476837158) * 2 129-127 = -6. We can put these numbers in the expression: (-1)sign bit * (1+fraction) * 2 exponent .bias = (-1)1 * (1. .8  The answer is approximately -6.

Convert 0.101562510 = 0.  0.625 × 2 = 1.  0.  0.40625 0 Generate 0 and continue.  . Converting:  0.8125 0 Generate 0 and continue.25 1 Generate 1 and continue with the rest.625 1 Generate 1 and continue with the rest.1015625 to IEEE 32-bit floating point format.00011012.1015625 × 2 = 0.  So 0.203125 × 2 = 0.0 1 Generate 1 and nothing remains.5 × 2 = 1.5 0 Generate 0 and continue.  0.8125 × 2 = 1.  0.203125 0 Generate 0 and continue.25 × 2 = 0.  0.40625 × 2 = 0.

exponent is -4 + 127 = 123 = 011110112.   Normalize: 0.1015625 is 00111101110100000000000000000000 . sign bit is 0.1012 × 2-4.00011012 = 1. So 0. Mantissa is 10100000000000000000000.

 Binary Fractional Numbers  “Even” when least significant bit is 0  Half way when bits to right of rounding position =  Examples  Round to nearest 1/4 (2 bits right of binary point) Value Binary Rounded Action Rounded Value 2 3/32 10.000112 10.002 (1/2—up) 3 2 5/8 10.012 (>1/2—up) 2 1/4 2 7/8 10.001102 10.102 (1/2—down) 2 1/2 100…2 .002 (<1/2—down) 2 2 3/16 10.111002 11.101002 10.

 Operands (–1)s1 M1 2E1 (–1)s2 M2 2E2  Assume E1 > E2 + (–1)s1 m1 E1–E2 (–1)s2 m2 (–1)s m  Exact Result (–1)s M 2E  Sign s. shift M left k positions. increment E if M < 1. decrement E by k Overflow if E out of range Round M to fit frac precision . significand M: ▪ Result of signed align & add  Fixing      Exponent E: E1 If M ≥ 2. shift M right.

25 x 10 ** 3 + 2.000263 x 10 ** 3 -------------------= 3.25 x 10 ** 3 + 0.3.250263 x 10 ** 3 .63 x 10 ** -1 ----------------first step: align decimal points second step: add 3.