You are on page 1of 11

HCMC UNIVERSITY OF TECHNOLOGY AND EDUCATION

FALCUTY OF HIGH QUALITY TRAINING


COMPUTER ENGINEERING TECHNOLOGY

FLOATING POINT ALU


COMPUTER ENGINEERING TECHNOLOGY

Do Minh Quan 19119043


Nguyen Tran Duy Khanh 19119063
Huynh Ngoc Khanh 19119

HO CHI MINH CITY – 10/2021

iv
v
vi
Chapter 1: Introduction
Floating Point ALU
When a CPU executes a program that is calling for a floating-point (FP) operation, there
are three ways by which it can carry out the operation. Firstly, it may call a floating-point unit
emulator, which is a floating-point library, using a series of simple fixed-point arithmetic
operations which can run on the integer ALU. These emulators can save the added hardware
cost of a FPU but are significantly slow. Secondly, it may use an add-on FPUs that are entirely
separate from the CPU, and are typically sold as an optional add-ons which are purchased only
when they are needed to speed up math-intensive operations. Else it may use integrated FPU
present in the system . The FPU designed by us is a single precision IEEE754 compliant
integrated unit. It can handle not only basic floating point operations like addition, subtraction,
multiplication and division but can also handle operations like shifting, logical operations.

Floating Point ALU model


In this report, we review an single precision format ALU base on IEEE754 standard.
Floating point ALUs are used for high precision computing. This ALU uses 32 bit numbers,
which is the common computer word length. The numbers are represented in IEEE 754
standard. This standard is widely used in floating point arithmetic. The ALU can perform the
arithmetic operation: addition, subtraction, multiplication and division. Each module is connect
to a mux to check for exception flags if it occurred and determine if there is any overflow or
underflow.

Selectors Operations
00 Addition
01 Subtraction
10 Multiplication
11 Division
Table of Operation

1
Single Precision IEEE 754 Format
All the floating point numbers are composed by three components:
 Sign: it indicates the sign of the number (0 positive and 1 negative)
 Mantissa: it sets the value of the number
 Exponent: it contains the value of the base power (biased), in single precision the exponent ranging
from 1 to 254
If a Simple Precision format is used the bits will be divided in that way:
 The first bit (31st bit) is set the sign (S) of the number (0 positive and 1 negative)

2
 Next 8 bits (from 30th to 23rd bit) represent the exponent (E)
 The rest of the string, 23 bits (from 22nd to 0) is reserved to save the mantissa.

Single precision floating point

Converting an IEEE 754 floating point binary into decimal

For single precision, the bias is 127.


We have the formular to convert IEEE 754 floating point binary to decimal:
X=(-1)S x 1.m x 2e-127
 S: Sign bit, which is the most significant bit.
 m: All the mantissa, after the exponent.
 E: exponent, which is the next 8 bits after the sign bit.
Example: Converting 11000001010101100000000000000000 in to decimal.
Step 1: Determine the sign bit
In this case, the sign bit equal to 1, which mean this is a negative number.
Step 2: Calculate the e
After the sign bit, the next 8 bits is the exponent, in this case the exponent is 10000010.
10000010 in binary convert to decimal is 130. So we have e=130.
Step 3: Calculate the m
The rest after the sign bit and the exponent is the mantissa. In this case we have
10101100000000000000000. 10101100000000000000000 in binary after the dot convert to
decimal is 2-1 + 2-3 + 2-5 + 2-6 = 0.671875.
So our m is 0.671875.
Step 4: Conver the 32 bit floating point to decimal
X=(-1)S x (1+m) x 2e-127
X=(-1)-1 x 1.671875 x 2130-127
X= -13.375
There are few special cases that we need to list which called the exceptions. When A or B is one
of these case mentioned, the result will be based on the figure below:

3
Exception of single precison floating point
The first case: The exponent is out of range when 11111111 2 = 255, but the range of
exponent in single precision is ranging from 1 to 254. The exponent is out of range the
the number is larger than 32 bit floating point can display, so the represent for this case
will be infinity.
The second case: When all the mantissa is not zero and all the exponent are 1, this case
we considered it not a number.
The third case: When all the exponent are 0 and all the matissa are also 0, this is 1×2-127 ,
we consider this 0
The fourth case: When all the exponent are 0 and all the matissa are not all 0, this is
number is still visionary , but it very small.
The maximum number single precision can display is
01111111011111111111111111111111 and the minimum is
10000000000000000000000000000001. There are some case that when we adding 2
number that is larger than 32 bit floating point can handle, this is what we call overflow
and viceversa when the number is too small and can not be display, that is underflow.

Floating point binary addition


Two 32 bits floating point numbers can be added by executing these following step:
Step 1: Make smaller exponent match the larger exponent.
Step 3: Add mantissas together.
Step 4: Normalise the result if necessary.

4
Example: We need to add two numbers 01000001111011010000000000000000
and 01000010101111010000000000000000.
For A = 01000001111011010000000000000000. Preconverting it we will have
29.625.
For B = 01000010101111010000000000000000. Preconverting it we will have 94.5.
To add A and B together, we first make their exponent equal to each other by make
smaller exponent match the larger exponent.
We have,
A = 1.11011010000000000000000 × 10000011 = 1.11011010000000000000000 ×
24

B = 1.01111010000000000000000 × 10000101 = 1.01111010000000000000000 ×


26
Shift A to the left 2 times, we have A = 0.0111011010000000000000000 × 26
Now we can add the matissa of A and B together:
0.0111011010000000000000000
+ 1.1101101000000000000000000
1.1111000010000000000000000
So we have our m = 1111000010000000000000000 2= 2-1 + 2-2 + 2-3 + 2-4 + 2-9 =
0.939453125
X = (-1)0 + 1.939453125 × 26 = 124.125.
So our answer in floating point is 0100001011111000010000000000000000

Comparing the result in the reviewed ALU, we have the correct answer.
Floating point binary addition
Floating point binary Subtraction

5
Two 32 bits floating point numbers can be subtracted by executing these following
step:
Step 1: Make smaller exponent match the larger exponent.
Step 2: In case we are subtracting, negate the number and add 1.
Step 3: Add mantissas together.
Step 4: Normalise the result if necessary.
For example: We need to sub two numbers 01000010101111010000000000000000
by 01000001111011010000000000000000 .
For A = 01000010101111010000000000000000. Preconverting it we will have 94.5.
For B = 01000001111011010000000000000000. Preconverting it we will have
29.625.
To subtract A to B, we first make their exponent equal to each other by make smaller
exponent match the larger exponent.
We have,
A = 1.01111010000000000000000 × 10000101 = 1.01111010000000000000000 ×
26

B = 1.11011010000000000000000 × 10000011 = 1.11011010000000000000000 ×


24
Shift B to the left 2 times, we have B = 0.0111011010000000000000000 × 26
Now we negate B by inverting all the bit and add 1: B = 1.10001001011111111111111
+1 = 1.10001001100000000000000
Now we can add the matissa of A and B together:
1.01111010000000000000000
+ 1.10001001100000000000000
1.00000011100000000000000
So we have our m = 000000111000000000000002= 2-7 + 2-8 + 2-9 = 0.013671875
X = (-1)0 + 1.013671875 × 26 = 64.875.
So our answer in floating point is 01000010100000011100000000000000.

6
Comparing the result in the reviewed ALU, we have the correct answer.

7
-

You might also like