You are on page 1of 25

IEEE Floating Point

The IEEE Floating Point Standard and


execution units for it

1/8/2007 - L24 IEEE Floating Poi Copyright 2006 - Joanne DeGroat, ECE, OSU 1
nt Basics
Lecture overview
 The standard
 Floating Point Basics
 A floating point adder design

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 2
ating Point Basics
The floating point standard
 Single Precision
s e (8-bits) f (23-bits)

 Value of bits stored in representation is:


 If e=255 and f /= 0, then v is NaN regardless of s
 If e=255 and f = 0, then v = (-1)s 
 If 0 < e < 255, then v = (-1)s 2e-127 (1.f) – normalized number
 If e = 0 and f /= 0, the v = (-1)s 2-126 (0.f)
 Denormalized numbers – allow for graceful underflow
 If e = 0 and f = 0 the v = (-1)s 0 (zero)
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 3
ating Point Basics
The floating point standard
 Double Precision
s e (11-bits) f (52-bits)

 Value of bits in word representation is:


 If e=2047 and f /= 0, then v is NaN regardless of s
 If e=2047 and f = 0, then v = (-1)s 
 If 0 < e < 2047, then v = (-1)s 2e-1023 (1.f)
 – normalized number
 If e = 0 and f /= 0, the v = (-1)s 2-1022 (0.f)
 Denormalized numbers – allow for graceful underflow
 If e = 0 and f = 0 the v = (-1)s 0 (zero)
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 4
ating Point Basics
The floating point standard
 Notes on single and double precision
 The leading 1 of the fractional part is not stored
for normalized numbers
 Representation allows for +0 and -0 indicating
direction of 0 (allow determination that might
matter if rounding was used)
 Denormalized numbers allow graceful underflow
towards 0

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 5
ating Point Basics
Conversion Examples
 Converting from base 10 to the representation
 Single precision example
 Covert 10010
 Step 1 – convert to binary - 0110 0100
128 64 32 16 8 4 2 1
0 1 1 0 0 1 0 0

 In a binary representation form of 1.xxx have


 0110 0100 = 1.100100 x 26

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 6
ating Point Basics
Conversion Example Continued
 1.1001 x 26 is binary for 100
 Thus the exponent is a 6
 Biased exponent will be 6+127=133 = 1000 0101
 Sign will be a 0 for positive
 Stored fractional part f will be 1001
 Thus we have
 se f
 0 100 0 010 1 1 00 1000….
 4 2 C 8 0 0 0 0 in hexadecimal
 $42C8 0000 is representation for 100
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 7
ating Point Basics
Another example
 Representation for -175
 175 = 128 + 32 + 8 + 4 + 2 +1 = 1010 1111
 Or 1.0101111 x 27
 S=1
 Exponent is 7 +127 = 134 = 1000 0110
 Fractional part f = 0101111
 Representation 1100 0011 0010 1111 0000 ….
 Or in Hex $C32F 0000
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 8
ating Point Basics
Converting back
 Convert $C32F 0000 into decimal
 Extract components from
 1100 0011 0010 1111
 S=1
 Exponent = 1000 0110 = 128+4+2 = 134
 unbias 134 – 127 =7
 f = 0101111 so mantissa is 1.0101111
 Adjust by exponent 1010 1111 (move binary pt 7 places)
 Or 128+32+15 = 175
 Sign is negative so -175
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 9
ating Point Basics
Another example
 Convert $41C8 0000 to decimal
 0100 0001 1100 1000 0000 ….
 S is 0 so positive number
 Exponent 1000 0011 = 128+3=131-127=4
 f = 1001 so mantissa is 1.1001
 With 4 binary positions have 11001 as final
number or a decimal
 25
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 10
ating Point Basics
Arithmetic with floating point numbers
 Add op1 $42C8 0000 and op2 $41C8 0000
 First divide into component parts
 Op1 $42C8 0000 =0100 0010 1100 1000 0000 ….
 S=0
 E = 1000 0101 = 133 – 127 = 6
 Mop1 = 1.10010000…
 Op2 $41C8 0000 =0100 0001 1100 1000 0000 ….
 S=0
 E = 1000 0011 = 131 – 127 = 4
 Mop2 = 1.10010000…

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 11
ating Point Basics
Now add the mantissas
 But first align the mantissas
 Op1 1.1001000….
 Op2 1.1001000…. Which is the smaller number
and needs to be aligned
 Exponent difference between op1 and op2 is 2
 So shift op2 by 2 binary places or
 Op2 becomes 0.0110010000…

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 12
ating Point Basics
Add
 Add op1 mantissa with the aligned op2
mantissa
 1.1001000000…
 0.0110010000…
 1.1111010000
 Result exponent is 6
 Value is 1111101 or 64+32+16+8+4+1=125
 Values added were 100 and 25
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 13
ating Point Basics
Constructing Result Value
 Sign 0
 Exponent 6 E = 1000 0101 = 133 – 127 = 6
 Mantissa of Result 1.1111010000
 Fractional Part 1111010000….

 Constructed Value
 0 100 0010 1 111 1010 0000 0000 0000 0000
 $4 2 F A 0 0 0 0 (125)
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 14
ating Point Basics
Floating point representation of
125
 Positive so s is 0
 Exponent is 6 + 127 = 133 = 1000 0101
 Fractional part from mantissa of
 1.111101 or 111101
 Constructed value
 0 1000 0101 111101 00000000000000000
 $42FA 0000

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 15
ating Point Basics
Multiplication example
 Multiply op1 $42C8 0000 & op2 $41C8 0000
 First divide into component parts
 Op1 $42C8 0000 =0100 0010 1100 1000 0000 ….
 S=0
 E = 1000 0101 = 133 – 127 = 6
 Mop1 = 1.10010000…
 Op2 $41C8 0000 =0100 0001 1100 1000 0000 ….
 S=0
 E = 1000 0011 = 131 – 127 = 4
 Mop2 = 1.10010000…

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 16
ating Point Basics
Multiplication basics
 Base 10 example
 3x102 * 1.1x102 = 3.3 x 104
 Have 2 numbers A x 2ea and B x 2eb
 Multiply and get
 result = A*B x 2ea+eb

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 17
ating Point Basics
So here
 Have sign of both is + so result is +
 Exponent addition
 Both exponents are biased as stored
 If you add stored binary exponents you need to
subtract the extra bias or 127
 Or using pencil and paper (or powerpoint) can
just add the unbiased exponent of one operand to
the other biased exponent
 Here have 133 + 4 = 137 = 1000 1001
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 18
ating Point Basics
The mantissas
 Do a binary multiplication
 1.1001
 1.1001
 1 1001
 1100 1
 11001 and add
 100111 0001
 Adjusting for binary point have 10.01110001
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 19
ating Point Basics
Final result
 Exponent is 137 or 10
 Mantissa is 10.01110001
 Adjusted for exponent 1001 1100 0100
 Value is 2048+256+128+64+4
 Or 2304+128+68 = 2432 + 68 = 2500
 And we were multiplying 100 * 25

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 20
ating Point Basics
Specification of a FPA
 Floating Point Add/Subtract Unit
 Specification
 Inputs in IEEE 754 Double Precision
 Must perform both addition and subtraction
 Must handle the full floating point standard
 Normalized numbers
 Not a Numbers – NaNs
 +/- Infinity
 Denormalized numbers
1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 21
ating Point Basics
Specifications continued
 Result will be a IEEE 754 Double Precision
representation
 Unit will correctly handle the invalid operation of
adding +  and -  = Nan per the standard
 Unit latches it inputs into registers from parallel
64-bit data busses.
 There is a separate signal line that indicates the
operation add or subtract

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 22
ating Point Basics
Specifications continued
 Outputs
 The correctly represented result
 Flags that are output are
 Zero result
 Overflow to infinity from normalized numbers as inputs
 NaN result
 Overshift (result is the larger of the two operands)
 Denormalized result
 Inexact (result was rounded)
 Invalid operation for addition

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 23
ating Point Basics
High level block diagram
 Basic architecture interface
 Data – 64 bit A,B,& C Busses
 Control signals – Latch, Add/Sub, Asel, Drive
 Condition Flags Output – 7 Flag signals
 Clocks – Phi1 and Phi2 (a 2 phase clocked architecture
Abus Bbus

Add/Sub
Latch
Phi1 Floating Point Adder
Phi2 Unit

Asel
Drive

Cbus Flags

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 24
ating Point Basics
Start the VHDL
 The entity interface

 In the next lecture

1/8/2007 - L24 IEEE Flo Copyright 2006 - Joanne DeGroat, ECE, OSU 25
ating Point Basics

You might also like