You are on page 1of 6

International Journal of Intelligent Information Technology Application, 2009, 2(6):273-278

Algorithm & Design of an Efficient Floating Point ADD/SUB Unit for an Experimental CPU
A.Joshi
The University of the West Indies Department of Electrical and Computer Engg St. Augustine, Trinidad and Tobago ajoshi@eng.uwi.tt

S.L. Lam and Y.Y. Chan


Multimedia University, Faculty of Engg. Cyberjaya, Malaysia.

Abstract An 8-bit CPU is designed at gate level from scratch using custom chip approach. CPU has an 8-bit integer unit and 16-bit floating point unit. The instruction set includes shift, logic, integer and floating-point arithmetic instruction. The circuits are optimized by using more efficient algorithm. The algorithm discussed in this paper was applied for an 8-bit CPU design, however there is no reason that this couldn't be used for more powerful and serious CPU development. Currently no attempt has been made to include any special support or design for parallel MUL/ ADD / SUB operations[1][2]. An attempt has been made to improve conventional[6] algorithm. This paper discusses the design of FP ADD/SUB unit, with respect to algorithm and VHDL implementation, as all the functional units cannot be discussed in this paper. The project was implemented using VHDL and simulated using Altera MaxPlus II sim software which can map the design into Altera CPLD. Index TermsCPU, simulation, algorithm, Floating point unit, VHDL.

This unit has been designed to use five control signals, one enable signal, shift (left/right) signal and 3 signals to determine the number of shifts. This unit accepts one operand from ALU registers. FP add/subtractor: Its operands, FA and FB are sourced from floatingpoint register file and it requires one control signal to indicate the start and another control signal to decide whether addition or subtraction is to be performed. the result is 16-bit wide. FP multiplier: Similar to FP add unit, the operands are sourced from floating-point register file. But it requires only one control to indicate the start. B. Specifications of Floating-point add/sub unit: Specifications in short are 2x16 bit FP registers for operands, 1x16 FP register for final result, 1x8 bit register counter, 1xadder, 1x 4to1 selector, 1x 2to1 selector, 6x8 bi register, 6x 2to1 multiplexers, 6x 2to1 multiplexers, 3x 3to1 multiplexers, left barrel shifter, right barrel shifter, 1 zero counter, 1x7 bit register and an output signal logic. All simulations of VHDL code were done using device family MAX7000 from Altera Max-plus II. It is impossible to discuss all in this paper however, floating point add unit is discussed. Logic of algorithm is discussed in detail and implementation block diagram example of the Add unit is shown with detail design of 16-bit floating point register. Although it is worth mentioning that excellent work has been done for improving the FP arithmetic [7][8]. C. The Algorithm: - Initially, operands are loaded in 2 temporary registers - 2 biased-exponents are compared. - The difference is stored. - mantissa with smaller Biased-exponent is shifted by the difference.

I. INTRODUCTION Paper focuses on a functional unit of 16-bit FPU which is a part of a CPU with 8-bit integer unit. CPU has 4x16bit FPU registers, 16 bit data, address busses and 16-bit program counter. Data path is where most of the operations are done on by the processor's control unit. There are seven functional units, out of which 3 for FPU. This paper will discuss 1 functional unit, floating point add. Logic of algorithm is discussed in detail and implementation block diagram along with the VHDL code and simulation results. Goal is not to addrerss any issues of Clock rate or IPC [3]. Main focus is on improved algorithm and relevant design A. 3 Functional units:. Barrel shifter:

1999-2459 /09/$25.00 2009 Engineering Technology Press

273

- Then it is subtracted from the other mantissa and the result is stored. - Round up or down - Result is normalized and stored. There are some exceptions to this algorithm: - If the difference between the two biased-exponents is greater than 7, which is the length of mantissa, then the operand with higher exponent value will be stored without going through the following steps. - or if the subtraction between mantissa is zero, then zero will be stored as a result, Normalizing will be skipped. Conventional algorithm uses following steps: - Zero checking. - Significant adjustment. - Addition/Subtraction. - Normalization. - Rounding. D. Rationale : With respect to the above algorithm, we have a slightly different method of obtaining the result. The difference are as follows: - There is no zero checking on operands in this method. We think that as the operand with zero value doesn't occur very often there would be no significant degradation of performance of floating point calculations. Further, one clock cycle is saved for every FADD/SUB instruction with non-zero operand and fewer gates are used. - Our method chooses not to compare exponent and testing significant for zero every clock cycle to make exponents equal. Instead, we chose to find the difference between the two exponents and store the difference (which is positive). Larger exponent value will be stored and significant with smaller exponent will be shifted by the difference using a barrel shifter (with the exception that the difference must not be larger than 7) - Significant will not be checked for zero after adding signed significands. Since most of the results of addition does not result in zero, we feel that it is not necessary to introduce an extra cycle just to check this. - If significand overflow occurs after adding both significands, exponent overflow will not be checked immediately by this algorithm. The maximum biased exponent value that can be stored is 11111111=1111110.1111111 will indicate an overflow. In a worst case scenario, the maximum value of biased exponent fater being incremented is 1111111. However, since the result will be normalized later, which can decrement the biased exponent back into permissible range, we check this after normalization and rounding.

- Round to nearest: in this, a representable significand value nearest to the result will be stored. If the result is exactly in between the two representable values, then the current least significand bit will determine a round up or round down in order to force the result to be even. For example, if the current least significand bit is zero, then it will be rounded down and if found to be one then it will be rounded up. Algorithm also allows exceptions like exponent underflow/overflow and significand overflow. Algorithm is faster as it only takes maximum of 8 clock cycles to complete as compared to conventional algorithm which takes 13 clock cycles. There are different ways to improve performance [4][5], our approach different. There is a small drawback in this, in a sense that this algorithm requires mores components, as a result, the block diagram may look complicated and confusing. However this is far outweighed by the benefit. Overall Block diagram Figure 2. can be found on page 5 of the paper. II. DESIGN: ADD/SUB UNIT We have tried quiet a few different designs like Carry Look-ahead adder (CLA) or Ripple Carry adder (RCA). CLA provided good speed but has much larger size and power consumption was more than RCA. RCA , on the other hand, is compact but rather slower than CLA. Hence we decided to design Hybrid adders to take advantage of both. We have tried quiet a few different designs for Hybrid adders as well, and here we will discuss type 1 hybrid adder (HA-1). HA-1 is very fast, with high on power consumption and usage is FADD/FSUB. A. VHDL Code and Simulation Code for CLA (3) with normal carry input
LIBRARY ieee; USE ieee.std_loqic.1164.all ; ENTITY add_cla3._n IS PORT ( a0, a1, a2 : IN STD_LOGIC; b0, bl, b2 : IN STD_LOGIC; ci : IN STD_LOGIC; o0, o1, o2 : OUT STD_LOGIC; co : OUT STD_LOGIC); END ENTITY; ARCHITECTURE a OF add_cla3_n IS SIGNAL g0, g1, g2 : STD_LOGIC; -- imm signal for P SIGNAL p0, pl, p2 : STO_LOGIC; -- imm signal for G SIGNAL cl, c2, c3 : STD_LOGIC; -- imm signal for carry out BEGIN g0 <= a0 AND b0; g1 <= a1 AND b1; p0 <= a0 OR b0; p1 <= a1 OR b1 ; c1 <= g0 OR (p0 AND ci);

g2 <= a2 AND b2; p2 <= a2 OR b2;

-- carry generation for bit 1

274

c2 <= g1 OR (p1 AND c1); c3 <= g2 OR (p2 AND c2); o0 <= (a0 XOR b0) XOR ci; o1 <= (a1 XOR b1) XOR c1; o2 <= (a2 XOR b2) XDR c2; co <= c3; -- carry output END a;

-- carry generation for bit 2 -- carry generation for carry out -- sum output bit0 -- sum output bitl -- sum output bit2

The inverted signal from carry out C4. c4n <= NOT c4; 0(0) <= sO; -- sum bitO (from full adderl) 0(1) <= s1; -- sum bitl (From CLA3) 0(2) <= s2; -- sum bit2 (From CLA3) 0(3) <= s3; -- sum bit3 (From CLA3) 0(4) <= (s40 AND c4n) OR (s41 AND c4); -- sum bit4 (From CSA) 0(5) <= (s50 AND c4n) OR (s51 AND c4); -- sum bit5 (From CSA) 0(6) <= (s60 AND c4n) OR (s61 AND c4); -- sum bit6 (From CSA) 0(7) <= (s70 AND c4n) OR (s71 AND c4); -- sum bit7 (From CSA) co <= (c80 AND c4n) OR (c81 AND c4); -- carry out END a;

Code for Hybrid Adder 1:


LIBRARY ieee; USE ieee.std_logic.1164.all ; ENTITY add_ha1 IS PORT ( a, b : IN STD_LOGIC_VECTOR (0 to 7); ci : IN STD_LOGIC; o : OUT STD__LOGIC._VECTOR (0 to 7); co : OUT STD_LOGIC); END ENTITY; ARCHITECTURE a OF add__hal IS COMPONENT add_full1 IS -- declare full adderl PORT ( a, b, ci : IN STD_LOGIC; o, co : OUT STD_LOGIC); END COMPONENT; COMPONENT add_full2 IS -- declare full adder2 PORT ( a, b, ci : IN STD_LOGIC; o, co : OUT STD_LOGIC); END COMPONENT; COMPONENT add_cla3_0 IS -- declare cla3 with '0' carry in PORT ( a0, a1, a2 : IN STD_LOGIC; b0, b1, b2 : IN STD_LOGIC; o0, o1, o2 : OUT STD_LOGIC; co : OUT STD_LOGIC); END COMPONENT; COMPONENT add_cla3_1 IS -- declare cla3 with '1' carry in PORT ( a0, a1, a2 : IN STD_LOGIC; b0, b1, b2 : IN STD_LOGIC; o0, o1, o2 : OUT STD_LOGIC; co : OUT STD_LOGIC); END COMPONENT; COMPONENT add_cla3_n IS -- declare cla3 with normal carry in PORT ( a0, a1, a2 : IN STD_LOGIC; b0, b1, b2 : IN STD_LOGIC; ci : IN STD_LOGIC; o0, o1, o2 : OUT STD_LOGIC; co : OUT STD_LOGIC); END COMPONENT; SIGNAL x0,xl,x2,x3,x4,xS,x6,x7 STD_LOGIC;--imm signal for XOR2 SIGNAL s0,s1,s2,s3,s40,s41 STD_LOGIC; -- imm signal for sum SIGNAL s50,s51,s60,s61,s70,571 STD_LOGIC; -- imm signal for sum SIGNAL cl,c4,c4n,c70,c71,c80,c81 STD_LOGIC; imm signal for carry BEGIN XOR2 gates at the B input for ADD/SUB function X0 <= b(0) XOR ei; xl <= bel) XOR ci; x2 <= b(2) XOR ei; x3 <= b(3) XOR ci; x4 <= b(4) XOR ci; x5 <= b(5) XOR ci; x6 <= b(6) XOR ci; xl <= b(7) XOR ci; --connecting the different adders together in the way shown in --the logic circuit diagram of HA1. g0 : add_full1 PORT MAP (a(0),x0, ci,s0,c1); g1 : add_cla3_n PORT MAP (a(1),a(2),a(3)lx1,x2,x3,c1,sl,s2,s3,c4); g2 : add_cla3_O PORT MAP (a(4),a(5),a(6),x4,x5,x6,s40,sSO,s60,c70); g3 : add_full2 PORT MAP (a(7),x7, c70,s70,c80); g4 : add_cla3_1 PORT MAP (a(4),a(5),a(6),x4,x5,x6,541,551,561,c71); g5 : add_full2 PORT MAP (a(7),x7, c71,s71,\c81);

The logic diagram Figure 3. gives an idea of the the circuit of HA-1

275

seen in the diagram shows 10010110, which confirms the proper operation. Co= 0 which indicates that the result does not overflow. The results of Subtraction simulation is correct: When control =1, Sub function is selected . So, A-B = 50D = 00110010B. The output result (o0 to o7) as seen in the diagram, matches the result. Co= 1 indicates the result is not a negative number after subtraction. MAX7000 CPLD device used in this simulation has 8.1 ns of output delay. Glitches appeared during the simulation of sum output (from 200ns to 208.1 ns). This means that CPLD is not suitable for HA-1 type of implementation, thought result were good. III. CONCLUSION

The results obtained are very encouraging. We successfully optimized the design chosen from many versions of HA. HA-1 is not perfect yet as though it is very fast , it consumes lot of power and the size is yet s bit large. More streamlined design can be implemented with some effort. As mentioned earlier, in this paper we have discussed the design of just the Hybrid Adder (FA ADD/SUB unit) along with the VHDL code though we have managed to simulate almost all the functional units required by the CPU though the whole lot cannot be discussed here Also we are happy that the functional units are flexible enough to be used as a base for more powerful CPU with sufficient effort and time. REFERENCES
[1] A. Akkas, M.J. Schulte, Dual-mode floating-point multiplier architectures with parallel operations, Journal of Systems Architecture, vol. 52, pp. 549 - 562, October 2006. [2] A. Akkas, Dual-Mode Quadruple Precision FloatingPoint Adder, Proceedings of the 9th EUROMICRO Conference on Digital System Design, 2006, pp. 211 220,
ISBN:0-7695-2609-8

[3] V. Agarwal, M.S.Hrishikesh, S.W.Keckler and D.Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the
Figure. 3. HA-1 Logic diagram

27th annual international symposium on Computer architecture, vol.28,May 2000,pp. 248 - 259 , ISSN:01635964.

B. Simulation results: As seen in the Simulation diagram Figure 1. on page 4 of this paper: A=01100100B = 100D B=00110010B = 50D - The results of addition simulation are correct : When control is 0 , add function is selected. So, A+B = 150D = 10010110B. The output result as

[4] A. Beaumont-Smith, N. Burgess, S. Lefrere and C. C. Lim Reduced Latency IEEE Floating-Point Standard Adder Architectures, Proceedings of the 14th IEEE Symposium on Computer Arithmetic, pp. 35, 1999, ISBN:0-76950116-8.

[5] G. Even, S. M. Mueller and PM. Seidel A dual precision IEEE floating-point multiplier, Integration, the VLSI Journal, vol. 29 issue 2, 2000, pp. 167- 180, ISSN:01679260. [6] W. Stallings, Computer Organization and Architecture, sixth edition, Pierson &Prentice-Hall, 2003.

276

[7] PM. Seidel, and G. Even, Delay-Optimized Implementation of IEEE Floating-Point Addition, IEEE Transactions on Computers.vol.53 issue 2., February 2004 pp. 97-113, ISSN:0018-9340.

Dr. Joshi is a member of IEEE and leads Computer Systems group at IEEE Trinidad chapter.

[8] Y. Hida, X. S. Li, and D. H. Bailey, Algorithms for Quad-Double Precision Floating Point Arithmetic, Proceedings of the 15th IEEE Symposium on Computer Arithmetic,pg.155, 2001.
A. Joshi holds a Ph.D. with specialization in parallel architecture from the University of Mumbai, India. He is a member of the Faculty of Engg., the Department of Electrical and Computer Engg., The University of the West Indies. Trinidad and Tobago.

S.L.Lam graduated from Multimedia University, Cyberjaya, Malaysia. He later joined Xilinx in Malaysia as an Engineer.

Y.Y. Chan graduated from Multimedia University, Cyberjaya, Malaysia. He later joined Xilinx in Malaysia as an Engineer.

Figure 1. Simulation result for HA-1

277

Figure 2. FP ADD/SUB Unit Block diagram.

278

You might also like