Week 5: IEEE Floating Point Revision Guide For Phase Test

Week 5
IEEE Floating Point

Revision Guide for Phase Test
1
Floating Point
15900000000000000
14
could be represented as
Mantissa 159 * 1014 Exponent

15.9 * 1015
1.59 * 1016
A calculator might display 159 E14
2
Binary
The value of real binary numbers…
Scientific 22 21 20 . 2-1 2-2 2-3
Fractions . ½ ¼ ¾
Decimal 4 2 1 . .5 .25 .125
1 0 1 . 1 0 1
101.101 = 4+1+1/2+1/8
= 4+1+.5+.125= 5.625
=5 ⅝ 3
Binary Fractions
Scientific 22 21 20 . 2-1 2-2 2-3
Fractions . ½ ¼ ⅛
Decimal 4 2 1 . .5 .25 .125
1 0 1 . 1 0 1
101.101 = 4+1+1/2+1/8
= 4+1+.5+.125= 5.625
=5 ⅝ 4
Binary Fractions
Scientific 22 21 20 . 2-1 2-2 2-3
Fractions . ½ ¼ ⅛
Decimal 4 2 1 . .5 .25 .125
1 0 1 . 1 0 1
101.101 = 4+1+1/2+1/8
= 4+1+.5+.125= 5.625
=5 ⅝ 5
IEEE Single Precision
The number will occupy 32 bits
The first bit represents the sign of the number;

1= negative 0= positive.
The next 8 bits will specify the exponent stored in
biased 127 form.
The remaining 23 bits will carry the mantissa
normalised to be between 1 and 2.
i.e. 1<= mantissa < 2
6
Basic Conversion
Converting a decimal number to a floating
point number.
1. Take the integer part of the number and generate the

binary equivalent.
2. Take the fractional part and generate a binary fraction
3. Then place the two parts together and normalise.
7
IEEE – Example 1
Convert 6.75 to 32 bit IEEE format.
1. The Mantissa. The Integer first.
6/2 =3r0
3/2 =1r1 = 1102
1/2 =0r1
2. Fraction next.
.75 * 2 = 1.5 = 0.112
.5 * 2 = 1.0
3. put the two parts together… 110.11
Now normalise 1.1011 * 22
8
IEEE – Example 1
6/2 =3r0
3/2 =1r1 = 1102
1/2 =0r1
2. Fraction next.
.75 * 2 = 1.5 = 0.112
.5 * 2 = 1.0
9
IEEE – Example 1
6/2 =3r0
3/2 =1r1 = 1102
1/2 =0r1
2. Fraction next.
.75 * 2 = 1.5 = 0.112
.5 * 2 = 1.0
10
IEEE Biased 127 Exponent
To generate a biased 127 exponent
Take the value of the signed exponent and add 127.
Example.
216 then 2127+16 = 2143 and my value for the

exponent would be 143 = 100011112
So it is simply now an unsigned value ....
11
Possible Representations of
an Exponent
Binary Sign Magnitude 2's Biased
Complement 127
Exponent.
00000000 0 0 -127
{reserved}
00000001 1 1 -126
00000010 2 2 -125
01111110 126 126 -1
01111111 127 127 0
10000000 -0 -128 1
10000001 -1 -127 2
11111110 -126 -2 127
11111111 -127 -1 128
{reserved}
12
Why Biased ?
The smallest exponent 00000000

Only one exponent zero 01111111
The highest exponent is 11111111
To increase the exponent by one
simply add 1 to the present pattern.
13
Back to the example
Our original example revisited…. 1.1011 * 22
Exponent is 2+127 =129 or 10000001 in binary.
NOTE: Mantissa always ends up with a value of ‘1’ before

the Dot. This is a waste of storage therefore it is implied
but not actually stored. 1.1000 is stored .1000
6.75 in 32 bit floating point IEEE representation:-

0 10000001 10110000000000000000000
sign(1) exponent(8) mantissa(23)
14
Special cases
0 + Infinity and - infinity.
 Zero is a pattern that only contains ‘0’s
00000000000000000000000000000000
 Positive Infinity is the pattern
011111111….
 Negative Infinity is the pattern
111111111….
15
Truncation and Rounding
Following arithmetic operations on a floating point

number we may have increased the number of
mantissa bits.
Since we will have a fixed storage (23 places) for the
mantissa we require to limit these bits.
The simplest approach is to truncate the result prior to
storage
Example 0.1101101 stored in 4 bits
stored in 4 bits => 0.1101 ( loss 0.0000101 )
16
Rounding
If lost digit is > ½ then add 1 to LSB
Example – in 4 bits
0.1101101 <- 0.1101 + 0.0001 = 0.1110 ( rounded UP)
0.1101011 <- 0.1101 ( rounded DOWN)
NOTE:
Rounding is always preferred to truncation partly because it
is intrinsically more accurate , and because we end up with a
FAIR error .
17
Other Considerations
Truncation always undervalues the result, and can

lead to a systematic error situation .
Rounding has one major disadvantage since it requires
up to two further arithmetic operations .
Note. When we use floating point care has to be taken
when comparing the size of numbers because we are
generating binary fractions of a predefined length.
There is always going to be the chance of recurring
numbers etc like 1/3 in decimal
0.333333333333333333333 etc..
18
From Floating Point Binary
to Decimal Example
1 01111011 11100000100000000000000
Sign = 1 therefore this number is a negative number.
Exponent 01111011 = 64+32+16+8+2+1

= 123
subtract the 127 =-4
Mantissa = 1.111000001
1.111000001 * 2- 4
-ve 0.0001111000001
1/16 + 1/32 +1/64+1/128+1/8192
or - 0.1173095703125 19
Floating Point Maths
Floating point addition and subtraction.
1. Make sure that the two numbers are of the same
magnitude. Their Exponents have to be equal.
2. We then add or subtract the mantissas
3. Starting with the existing exponent re-normalise if
needed.
20
Example
Example 1.1* 23 + 1.1 * 22
Select the smaller number and make the mantissa smaller by
moving the point whilst increasing the exponent until the
exponents match.
1.1 * 22 0.11 * 23
Add the mantissas

Re-normalise.
21
Example
1.1* 23 001.1 23
+1.1 * 22 000.11 23
010.01 23
Re normalise 010.01 * 23
= 1.001 * 24
22
FP math
Floating Point Multiplication

Assume two numbers a x 2m b x 2n
Result (a x 2m ) x (b x 2n) = ( a x b ) x ( 2m+n )
Floating Point Division

Assume two number a x 2m and b x 2n
Result (a x 2m ) / (b x 2n) = (a/b ) x 2m-n
23

Week 5: IEEE Floating Point Revision Guide For Phase Test

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 5: IEEE Floating Point Revision Guide For Phase Test

Uploaded by

Copyright:

Available Formats

Week 5

IEEE Floating Point

Mantissa 159 * 1014 Exponent

The first bit represents the sign of the number;

1. Take the integer part of the number and generate the

216 then 2127+16 = 2143 and my value for the

So it is simply now an unsigned value ....

The smallest exponent 00000000

NOTE: Mantissa always ends up with a value of ‘1’ before

6.75 in 32 bit floating point IEEE representation:-

Following arithmetic operations on a floating point

Truncation always undervalues the result, and can

Exponent 01111011 = 64+32+16+8+2+1

Add the mantissas

Floating Point Multiplication

Floating Point Division

You might also like