Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

2013 International Conference on Advanced Electronic Systems (ICAES)
Design and Implementation of Single Precision

Pipelined Floating Point Co-Processor
Manisha Sangwan A Anita Angeline

PG Student, M.Tech VLSI Design Professor
SENSE, VIT University SENSE, VIT University
Chennai, India - 600048 Chennai, India -600048
manishasangwan47@gmail.com
Abstract—Floating point numbers are used in various Sign Bit: It defines whether the number the number is
applications such as medical imaging, radar, telecommunications
positive or negative. If it is 0 then number is positive else
Etc. This paper deals with the comparison of various arithmetic
modules and the implementation of optimized floating point
negative.
ALU. Here pipelined architecture is used in order to increase the Exponent: Both positive and negative values are
performance and the design is achieved to increase the operating represented by this field. To do this, a bias is added to the
frequency by 1.62 times. The logic is designed using Verilog actual exponent in order to get the stored exponent. [10] For
HDL. Synthesis is done on Encounter by Cadence after timing single precision this value is 127 and for double precision
and logic simulation. this value is 1023.
Mantissa: mantissa bit consists of both implicit leading bit
Keywords—CLA; clock-cycles; GDM; HDL; IEEE 754; and fractional part it is represented in the form of 1.f where 1
pipelining; verilog is implicit and f is fractional part mantissa is also known as
significant.
I. INTRODUCTION
II. IMPLEMENTATION
These days the use of computers has been incorporated in
many applications such as medical imaging, radar, audio A. Adder and Subtractor
system design, signal processors, industrial control,
telecommunications and other applications. There are Algorithm
many key factors that are considered before choosing the
number system. Those preferred factors are computational
capabilities required for the application, processor and system
costs, accuracy, complexity and performance attributes. Over
the years the designers moved from fixed point to floating
point operations due to wide range along with the ability of
floating point numbers to represent a very small number to a
very large number but at the same time accuracy of
floating point numbers is less. So trade off has to be done to
get the optimized architecture.
Almost twenty years ago IEEE 754 standard was
adopted for floating point numbers. This single precision
standard is for 32-bit number and the double precision is for
64- bit number.
The storage layout consists of three components: the sign,
the exponent and mantissa. The mantissa includes implicit
Fig. 1. Block diagram of Floating Point Adder and Subtractor
leading bit and the fractional part.
In adders the propagation of carry from one adder block
TABLE I. FLOATING POINT REPRESENTATION
to another block consumes a lot of time but Carry Look Ahead
Sign Exponent Fractional Bias (CLA) adder save this propagation time by generating and
Single Precision 1[31] 8[30-23] 23[22-00] 127 propagating this carry simultaneously in the consecutive
Double Precision 1[63] 11[62-52] 52[62-52] 1023 blocks. So, for faster operation this carry look ahead adder
is used.
,(((
Authorized licensed use limited to: College of Engineering - THIRUVANANTHAPURAM. Downloaded on December 17,2022 at 16:06:32 UTC from IEEE Xplore. Restrictions apply.
Fig. 3. Block diagram of Floating Point Division

S[i] = X[i] ْ Y[i] ْ C[i]
For division process Goldsmith (GDM) algorithm is
G[i] = X[i] * Y[i] used. For this algorithm to be applied needs both the inputs to
P[i] = X[i] +Y[i] be normalized first. In this both the multiplication of inputs is
independent of each other, so can be executed in parallel,
C[i+1] = X[i] * Y[i] + X[i] * C[i] + Y[i] * C[i] C[i+1] = G[i] + P[i] *
therefore it will increase the latency.
G[i] + P[i] * P[i-1] * C[i-1]
Algorithm for GDM for Q = A/B using k iterations:
B. Multiplication
Algorithm • B !=0, |e0| < 1
• Initialize N = A, D = B, R = (1-e0) / B
• For I = 0 to k
• N = N*R
• D = D*R
• R=2–D
• End for
• Q=N
Return Q
D. Pipelining
The speed of execution of any instruction can be varied by
a number of methods like, using a faster circuit technology to
build a processor or to arrange the hardware in such a way so
that multiple operations can be performed at the same
Fig. 2. Block diagram of Floating Point Multiplier
time. [11] By using pipelining multiple operations can be
Multiplication is an important block for ALU. With performed simultaneously without changing the execution
high speed and low power multipliers there comes the time of an instruction. As shown in example below, in
complexity so, we need to do the trade off between these sequential execution the third instruction will be executed at
to get the optimized algorithm with the regularity of sixth clock cycle but in pipelined architecture the same
layout. There are different algorithms available for instruction will be executed at fourth clock cycle and we’ll be
multiplication such as Booth, modified booth, Wallace, saving two clock cycles at the end where, F is fetching block
Bough Wooley, Braun multipliers. But issue with multipliers and E is execution block. Therefore, as the number of
is speed and regular layout so keeping both the parameters in instruction count will be increasing we’ll be saving more
mind modified booth algorithm was chosen. It is a powerful clock cycles.
algorithm for signed-number multiplication, which treats both 1 2 3 4 5 6
positive and negative numbers uniformly.
F1 E1 F2 E2 F3 E3
C. Division |______I1______| |_______I2______| |_______I3______|
Algorithm
Fig. 4. Sequential Execution
Clock Cycles: 0 1 2 3 4
F1 E1
F2 E2
F3 E3
Fig. 5. Pipelined Execution
III. FUNCTIONAL AND TIMING VERIFICATION

The functional verification is done using both Cadence and
Xilinx along with that the arithmetic results are verified
theoretically. Timing analyses, power analyses and area
analyses are done using both Cadence and Xilinx.

A. Adder and Subtractors

In addition and subtraction, all the sign, exponential and
fractional bits are operated separately and then the
exponents are shifted accordingly to equate the exponents
then this addition operation is done on fractional bits. The
final result is combined to make the output of 32 bits [Fig 6].
Fig. 6. Simulation Waveform for Adder and Subtractor
B. Multiplier
In multiplication block, the exponents are added and the
fractional bits are multiplied according to the algorithm. To Fig. 9. ALU Layout
get the sign of the result, the XOR operation is performed on
both the input sign bits. Finally all the bits are clubbed to get
the final result [Fig 7]. IV. SYNTHESIS RESULT
Synthesis results are shown in the table 2, shown below.
TABLE II. COMPARATIVE ANALYSIS OF BOTH EXISTING AND

PROPOSED DESIGN
Existing Proposed
Leakage power 2.880282 W 3.50267 W
Dynamic Power 11.377751 mW 16.14882 mW
Total power 11.380632 mW 16.15232 mW
Gate counts 2881 3712
Fig. 7. Simulation Waveform for Multiplier Frequency 225.65 MHz 367.654 MHz
Critical path 4.43164 nsec 2.70 nsec
C. Division Logic utilization 1% (466/38000) 4%(1780/46560)
The exponents are subtracted and the fractional bits are IOS 44% (130/296) 65% (157/ 240)
multiplied and subtracted according to the GDM algorithm Area 75436 97194
and continues iterations are done to get the closest result, for
sign bit XOR operation is performed on the sign bits of inputs.
This block is most time and area consuming [Fig 8]. V. CONCLUSION
In this paper various arithmetic modules are implemented
and then various comparative analyses are done. Ultimately
these individual blocks are clubbed to make Floating point
based ALU in a pipelined manner to minimize the power and
to increase the operating frequency at the same time. These
comparative analyses are done on cadence and Xilinx both.
Simulation results are verified theoretically. Verilog HDL
(Hardware Description Language) is used to design the whole
ALU block. In existing design, total power is 11.380632 mW
that is 0.70458 times less as compared to the proposed design
Fig. 8. Simulation Waveform for Division but the operating frequency is 1.67 times more than the
existing design. Along with that the gate count is increased
D. ALU Layout and area is also increased because of number of iterations used
In Fig 9, the final layout of the circuit is shown. in the algorithm.

VI. FUTURE WORK

Optimization of source code to decrease the area and gate
counts will improve the reliability. Low power techniques
could be incorporated to obtain better trade off.
REFERENCES
[1] Addanki Purna Ramesh, Ch. Pradeep, “FPGA Based Implementation of
Double Precision Floating Point Adder/Subtractor Using VERILOG”,
International Journal of Emerging Technology and Advanced
Engineering, Volume 2, Issue 7, July 2012
[2] Semih Aslan, Erdal Oruklu and Jafar Saniie, “A High Level Synthesis
and Verification Tool for Fixed to Floating Point Conversion”, 55th
IEEE Internation Midwest Symposium on Circuits and Systems
(MWSCAS 2012)
[3] Prashanth B.U.V, P. Anil Kumai, G. Sreenivasulu, “Design and
Implementation of Floating Point ALU on a FPGA Processor”,
International Conference on Computing, Electronics and Electrical
Technologies (ICCEET 2012), 2012
[4] Subhajit Banerjee Purnapatra, Siddharth Kumar, Subrata Bhattacharya,
“Implementation of Floating Point Operations on Fixed Point Processor
– An Optimization Algorithm and Comparative Analysis”, IEEE 10th
International Conference on Computer Information Technology (CIT
2010), 2010
[5] Ghassem Jaberipur, Behrooz Parhami, and Saeid Gorgin,
“RedundantDigit Floating-Point Addition Scheme Based on a Stored
Rounding Value”, IEEE transactions on computer, vol. 59, no.
[6] Alexandre F. Tenca, “Multi-operand Floating-point Addition”, 19th
IEEE International Symposium on Computer Arithmetic, 2009.
[7] Cornea, “IEEE 754-2008 Decimal Floating-Point for Intel®
Architecture Processors”, 19th IEEE International Symposium on
Computer Arithmetic, 2009.
[8] Joy Alinda P. Reyes, Louis P. Alarcon, and Luis Alarilla, “A Study of
Floating-Point Architectures for Pipelined RISC Processors”, IEEE
International Symposium on Circuits and Systems, 2006.
[9] Peter-Michael Seidel, “High-Radix Implementation of IEEE
FloatingPoint Addition”, Proceedings of the 17th IEEE Symposium on
Computer Arithmetic, 2005.
[10] Guillermo Marcus, Patricia Hinojosa, Alfonso Avila and Juan
NolazcoFlores, “A Fully Synthesizable Single-Precision, Floating-Point
Adder/Subtractor and Multiplier in VHDL for General and Educational
Use”, Proceedings of the 5thIEEE International Caracas Conference on
Devices, Circuits and Systems, Dominican Republic, Nov.3-5, 2004.
[11] Carl Hamacher, Zvonko Vranesic, Safwat Zaky “Computer
Organization” 5th Edition, Tata McGraw-Hill Education, 2011.


Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Design and Implementation of Single Precision Pipelined Floating Point Co-Processor

Uploaded by

Copyright:

Available Formats

2013 International Conference on Advanced Electronic Systems (ICAES)

Design and Implementation of Single Precision

Manisha Sangwan A Anita Angeline

Fig. 3. Block diagram of Floating Point Division

III. FUNCTIONAL AND TIMING VERIFICATION

A. Adder and Subtractors

Fig. 6. Simulation Waveform for Adder and Subtractor

TABLE II. COMPARATIVE ANALYSIS OF BOTH EXISTING AND

VI. FUTURE WORK

You might also like