Lect 24 - IEEE Floating Point Units

IEEE Floating Point
The IEEE Floating Point Standard and

execution units for it
1/8/2007 - L24 IEEE Floating Poi Copyright 2006 - Joanne DeGroat, ECE, OSU 1
nt Basics
Lecture overview
 The standard
 Floating Point Basics
 A floating point adder design
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 2
ating Point Basics
The floating point standard
 Single Precision
s e (8-bits) f (23-bits)
 Value of bits stored in representation is:

 If e=255 and f /= 0, then v is NaN regardless of s
 If e=255 and f = 0, then v = (-1)s 
 If 0 < e < 255, then v = (-1)s 2e-127 (1.f) – normalized number
 If e = 0 and f /= 0, the v = (-1)s 2-126 (0.f)
 Denormalized numbers – allow for graceful underflow
 If e = 0 and f = 0 the v = (-1)s 0 (zero)
ating Point Basics
 Double Precision
s e (11-bits) f (52-bits)
 Value of bits in word representation is:

 If e=2047 and f /= 0, then v is NaN regardless of s
 If e=2047 and f = 0, then v = (-1)s 
 If 0 < e < 2047, then v = (-1)s 2e-1023 (1.f)
 – normalized number
 If e = 0 and f /= 0, the v = (-1)s 2-1022 (0.f)
 Denormalized numbers – allow for graceful underflow
 If e = 0 and f = 0 the v = (-1)s 0 (zero)
ating Point Basics
 Notes on single and double precision
 The leading 1 of the fractional part is not stored
for normalized numbers
 Representation allows for +0 and -0 indicating
direction of 0 (allow determination that might
matter if rounding was used)
 Denormalized numbers allow graceful underflow
towards 0
ating Point Basics
Conversion Examples
 Converting from base 10 to the representation
 Single precision example
 Covert 10010
 Step 1 – convert to binary - 0110 0100
128 64 32 16 8 4 2 1
0 1 1 0 0 1 0 0
 In a binary representation form of 1.xxx have

 0110 0100 = 1.100100 x 26
ating Point Basics
Conversion Example Continued
 1.1001 x 26 is binary for 100
 Thus the exponent is a 6
 Biased exponent will be 6+127=133 = 1000 0101
 Sign will be a 0 for positive
 Stored fractional part f will be 1001
 Thus we have
 se f
 0 100 0 010 1 1 00 1000….
 4 2 C 8 0 0 0 0 in hexadecimal
 $42C8 0000 is representation for 100
ating Point Basics
Another example
 Representation for -175
 175 = 128 + 32 + 8 + 4 + 2 +1 = 1010 1111
 Or 1.0101111 x 27
 S=1
 Exponent is 7 +127 = 134 = 1000 0110
 Fractional part f = 0101111
 Representation 1100 0011 0010 1111 0000 ….
 Or in Hex $C32F 0000
ating Point Basics
Converting back
 Convert $C32F 0000 into decimal
 Extract components from
 1100 0011 0010 1111
 S=1
 Exponent = 1000 0110 = 128+4+2 = 134
 unbias 134 – 127 =7
 f = 0101111 so mantissa is 1.0101111
 Adjust by exponent 1010 1111 (move binary pt 7 places)
 Or 128+32+15 = 175
 Sign is negative so -175
ating Point Basics
Another example
 Convert $41C8 0000 to decimal
 0100 0001 1100 1000 0000 ….
 S is 0 so positive number
 Exponent 1000 0011 = 128+3=131-127=4
 f = 1001 so mantissa is 1.1001
 With 4 binary positions have 11001 as final
number or a decimal
 25
ating Point Basics
Arithmetic with floating point numbers
 Add op1 $42C8 0000 and op2 $41C8 0000
 First divide into component parts
 Op1 $42C8 0000 =0100 0010 1100 1000 0000 ….
 S=0
 E = 1000 0101 = 133 – 127 = 6
 Mop1 = 1.10010000…
 Op2 $41C8 0000 =0100 0001 1100 1000 0000 ….
 S=0
 E = 1000 0011 = 131 – 127 = 4
 Mop2 = 1.10010000…
ating Point Basics
Now add the mantissas
 But first align the mantissas
 Op1 1.1001000….
 Op2 1.1001000…. Which is the smaller number
and needs to be aligned
 Exponent difference between op1 and op2 is 2
 So shift op2 by 2 binary places or
 Op2 becomes 0.0110010000…
ating Point Basics
Add
 Add op1 mantissa with the aligned op2
mantissa
 1.1001000000…
 0.0110010000…
 1.1111010000
 Result exponent is 6
 Value is 1111101 or 64+32+16+8+4+1=125
 Values added were 100 and 25
ating Point Basics
Constructing Result Value
 Sign 0
 Exponent 6 E = 1000 0101 = 133 – 127 = 6
 Mantissa of Result 1.1111010000
 Fractional Part 1111010000….
 Constructed Value
 0 100 0010 1 111 1010 0000 0000 0000 0000
 $4 2 F A 0 0 0 0 (125)
ating Point Basics
Floating point representation of
125
 Positive so s is 0
 Exponent is 6 + 127 = 133 = 1000 0101
 Fractional part from mantissa of
 1.111101 or 111101
 Constructed value
 0 1000 0101 111101 00000000000000000
 $42FA 0000
ating Point Basics
Multiplication example
 Multiply op1 $42C8 0000 & op2 $41C8 0000
 First divide into component parts
 Op1 $42C8 0000 =0100 0010 1100 1000 0000 ….
 S=0
 E = 1000 0101 = 133 – 127 = 6
 Mop1 = 1.10010000…
 Op2 $41C8 0000 =0100 0001 1100 1000 0000 ….
 S=0
 E = 1000 0011 = 131 – 127 = 4
 Mop2 = 1.10010000…
ating Point Basics
Multiplication basics
 Base 10 example
 3x102 * 1.1x102 = 3.3 x 104
 Have 2 numbers A x 2ea and B x 2eb
 Multiply and get
 result = A*B x 2ea+eb
ating Point Basics
So here
 Have sign of both is + so result is +
 Exponent addition
 Both exponents are biased as stored
 If you add stored binary exponents you need to
subtract the extra bias or 127
 Or using pencil and paper (or powerpoint) can
just add the unbiased exponent of one operand to
the other biased exponent
 Here have 133 + 4 = 137 = 1000 1001
ating Point Basics
The mantissas
 Do a binary multiplication
 1.1001
 1.1001
 1 1001
 1100 1
 11001 and add
 100111 0001
 Adjusting for binary point have 10.01110001
ating Point Basics
Final result
 Exponent is 137 or 10
 Mantissa is 10.01110001
 Adjusted for exponent 1001 1100 0100
 Value is 2048+256+128+64+4
 Or 2304+128+68 = 2432 + 68 = 2500
 And we were multiplying 100 * 25
ating Point Basics
Specification of a FPA
 Floating Point Add/Subtract Unit
 Specification
 Inputs in IEEE 754 Double Precision
 Must perform both addition and subtraction
 Must handle the full floating point standard
 Normalized numbers
 Not a Numbers – NaNs
 +/- Infinity
 Denormalized numbers
ating Point Basics
Specifications continued
 Result will be a IEEE 754 Double Precision
representation
 Unit will correctly handle the invalid operation of
adding +  and -  = Nan per the standard
 Unit latches it inputs into registers from parallel
64-bit data busses.
 There is a separate signal line that indicates the
operation add or subtract
ating Point Basics
Specifications continued
 Outputs
 The correctly represented result
 Flags that are output are
 Zero result
 Overflow to infinity from normalized numbers as inputs
 NaN result
 Overshift (result is the larger of the two operands)
 Denormalized result
 Inexact (result was rounded)
 Invalid operation for addition
ating Point Basics
High level block diagram
 Basic architecture interface
 Data – 64 bit A,B,& C Busses
 Control signals – Latch, Add/Sub, Asel, Drive
 Condition Flags Output – 7 Flag signals
 Clocks – Phi1 and Phi2 (a 2 phase clocked architecture
Abus Bbus
Add/Sub
Latch
Phi1 Floating Point Adder
Phi2 Unit
Asel
Drive
Cbus Flags
ating Point Basics
Start the VHDL
 The entity interface
 In the next lecture
ating Point Basics

Lect 24 - IEEE Floating Point Units

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 24 - IEEE Floating Point Units

Uploaded by

Copyright:

Available Formats

IEEE Floating Point

The IEEE Floating Point Standard and

 Value of bits stored in representation is:

 Value of bits in word representation is:

 In a binary representation form of 1.xxx have

 In the next lecture

You might also like