You are on page 1of 73

GRIFFITH UNIVERSITY

Design of Double Precision IEEE-754
Floating-Point Units
Extending the Open Source LTH FPU for the Gaisler Leon2 SPARC8 Microprocessor

DISSERTATION

SUBMITTED BY

MICHAEL KENNEDY B.I.T.

SUPERVISOR

PROFESSOR PETER LISNER

SCHOOL OF MICROELECTRONICS

SUBMITTED IN FULFILLMENT OF THE REQUIREMENTS OF THE DEGREE OF MASTER’S OF ENGINEERING
IN VLSI

MARCH 2007
©2007 Michael Kennedy
All Rights Reserved
This work has not previously been submitted for a degree or diploma in any university. To the best of my
knowledge and belief, the thesis contains no material previously published or written by another person
except where due reference is made in the thesis itself.

Michael Kennedy
Table of Contents

LIST OF FIGURES 7

ABBREVIATIONS AND SYMBOLS 9

ACKNOWLEDGEMENTS 10

ABSTRACT 11

1 INTRODUCTION 12

2 LITERATURE REVIEW 14

2.1 INTRODUCTION 14
2.2 FLOATING POINT NUMBERS 14
2.3 THE LTH FPU 14
2.4 ARITHMETIC OPERATIONS 15
2.5 DESIGN 18
2.6 VERIFICATION 18
2.7 PERFORMANCE 19
2.8 COMPARISON 19
2.9 CONCLUSION 20

3 DESIGN 21

3.1 INTRODUCTION 21
3.2 LTH FLOATING POINT UNIT 21
3.3 MULTIPLICATION 23
3.4 SRT DIVISION 27
3.5 SRT SQUARE ROOT 32
3.6 CONCLUSION 36

4 IMPLEMENTATION 37

4.1 INTRODUCTION 37
4.2 HDL MODELING 37
4.3 SYNTHESIS 41
4.4 CONCLUSION 42

5 RESULTS 43

5.1 INTRODUCTION 43
5.2 TIMING AND AREA ANALYSIS 43
5.3 IC LAYOUT 49
5.4 PERFORMANCE 51
5.5 CONCLUSION 52

6 CONCLUSION 53

6.1 INTRODUCTION 53
6.2 ACHIEVEMENTS 53
6.3 FUTURE DIRECTIONS 54
6.4 CONCLUSION 55

BIBLIOGRAPHY 56

APPENDIX A: ARITHMETIC BY EXAMPLE 60

INTRODUCTION 60
RADIX-2 UNSIGNED MULTIPLICATION 60
RADIX-4 MULTIPLICATION USING MODIFIED BOOTH ENCODING 61
SRT RADIX-2 DIVISION 62
SRT RADIX-2 SQUARE ROOT 64
RADIX-2 ON-THE-FLY CONVERSION 66

APPENDIX B: COUNTER STUDY 67

INTRODUCTION 67
TIMING ANALYSIS 67
AREA ANALYSIS 68
CONCLUSION 68
APPENDIX C: HDL VERIFICATION 69

INTRODUCTION 69
MULTIPLICATION 69
DIVISION 69
SQUARE ROOT 70

APPENDIX D: CODE LISTING 72

INTRODUCTION 72
FILE LOCATIONS AND NAMING CONVENTIONS 72
List of Figures

Figure 3.1 LTH Overview .................................................................................................................................................................. 21
Figure 3.2 Extending the LTH ......................................................................................................................................................... 22
Figure 3.3 Floating Point Multiplication Operation Properties ....................................................................................... 23
Figure 3.4 Modified Booth-2 Encoding ....................................................................................................................................... 23
Figure 3.5 Partial Product Sign Extension................................................................................................................................. 24
Figure 3.6 Partial Product Counter Over Sizing ...................................................................................................................... 24
Figure 3.7 CSA Counter Configuration ........................................................................................................................................ 25
Figure 3.8 106-Bit CSEA using 16-bit CLA Blocks.................................................................................................................. 26
Figure 3.9 Floating Point Division Operation Properties ................................................................................................... 27
Figure 3.10 SRT Radix-2 Division Equations ........................................................................................................................... 27
Figure 3.11 SRT Radix-2 Division Overview ............................................................................................................................ 28
Figure 3.12 SRT Radix-2 Division Circuit .................................................................................................................................. 28
Figure 3.13 SRT Radix-2 Quotient Selection ............................................................................................................................ 29
Figure 3.14 Radix-2 On-The-Fly Conversion Equations ..................................................................................................... 30
Figure 3.15 Radix-2 On-The-Fly Conversion for Q ................................................................................................................ 30
Figure 3.16 Radix-2 On-The-Fly Conversion for QM ............................................................................................................ 30
Figure 3.17 Stick Bit Formula for a Redundant Residual ................................................................................................... 31
Figure 3.18 SRT Radix-4 Division using Layer Radix-2 Stages ........................................................................................ 31
Figure 3.19 Floating Point Square Root Operation Properties ........................................................................................ 32
Figure 3.20 SRT Radix-2 Square Root Equations ................................................................................................................... 33
Figure 3.21 SRT Radix-2 Square Root Overview .................................................................................................................... 33
Figure 3.22 SRT Radix-2 Square Root Circuit .......................................................................................................................... 34
Figure 3.23 SRT Radix-2 Square Root Functional Divisor Equations ........................................................................... 35
Figure 3.24 SRT Radix-2 Division and Square Root Equations........................................................................................ 35
Figure 3.25 SRT Radix-2 Combined Division and Square Root Circuit ........................................................................ 36
Figure 4.1 Multiplier Pipeline ......................................................................................................................................................... 37
Figure 4.2 SRT Radix-4 Division and Square Root Pipeline .............................................................................................. 39
Figure 4.3 Leonardo Spectrum Settings ..................................................................................................................................... 41
Figure 4.4 Preparing a Netlist for IC Layout ............................................................................................................................. 42
Figure 5.1 Original LTH Timing Analysis ................................................................................................................................... 43
Figure 5.2 Original LTH Area Analysis ........................................................................................................................................ 43
Figure 5.3 Multiplier Timing Analysis......................................................................................................................................... 44
Figure 5.4 Multiplier Area Analysis.............................................................................................................................................. 45
Figure 5.5 Combined Division and Square Root Timing Analysis .................................................................................. 46
Figure 5.6 Combined Division and Square Root Area Analysis ....................................................................................... 47
Figure 5.7 Exponent Component Timing Analysis ................................................................................................................ 48
Figure 5.8 Exponent Component Area Analysis ..................................................................................................................... 48
Figure 5.9 Multiplier IC Layout ...................................................................................................................................................... 49
Figure 5.10 Combined Division and Square Root IC Layout ............................................................................................. 50
Figure 5.11 Pre-Layout Clock Frequency .................................................................................................................................. 51
Figure 5.12 Processor Latency and Frequency Comparison ............................................................................................ 51
Figure A.1 Unsigned Radix-2 Multiplication Example ......................................................................................................... 60
Figure A.2 Booth-2 Multiplicaiton Example ............................................................................................................................. 61
Figure A.3 Booth-2 Multiplication Sign Extension Dot Diagram ..................................................................................... 61
Figure A.4 SRT Radix-2 Division Example................................................................................................................................. 63
Figure A.5 SRT Radix-2 Square Root Example ........................................................................................................................ 65
Figure A.6 On-The-Fly Conversion Example ............................................................................................................................ 66
Figure B.1 Std. Cell v. Synthesized Counter Timing Analysis ............................................................................................ 67
Figure B.2 Std. Cell v. Synthesized Counter Area Analysis ................................................................................................. 68
Figure C.1 FPU Multiplication Test Bench ................................................................................................................................. 69
Figure C.2 FPU Division Test Bench ............................................................................................................................................. 70
Figure C.3 FPU SQRT Test Bench................................................................................................................................................... 70
Figure C.4 FPU Square Root Result Verification ..................................................................................................................... 71
Figure D.1 File Location Map .......................................................................................................................................................... 72
Figure D.2 File Naming Conventions ........................................................................................................................................... 72
Figure D.3 HDL Synthesis and Simulation Highlights .......................................................................................................... 73
Abbreviations and Symbols
ADK Academic Design Kit

CLA Carry Look-Ahead Adder

CPA Carry Propagate Adder

CSA Carry Save Adder or Full Adder

CSEA Carry Select Adder

D Divisor

DIV Division (may refer to combined division and square root unit)

FA Full Adder

F(i) Square root functional divisor (F(i) and –F(i), which are for positive and negative products)

FPU Floating Point Unit

G Guard Bit

I Iteration

J Bit Position

LTH Open Source FPU distributed by Gaisler Research (www.gaisler.com)

MUX Multiplexer

OTFC On-The-Fly-Conversion, used for quotient generation

PLA Programmable Logic Array

PPi Partial Product

PR(s,c) Partial Remainder in carry save form

Q Quotient

QM Negative quotient factor (Qi – 2-(i))

qi+1 QSLC result for current iteration

QSLC Quotient Selection Logic Circuit

r Radix

R Round bit

SQRT Square Root

Syn Synthesis or synthesized

T Sticky bit

ULP Unit in Least Significant Position
Acknowledgements
I would like to thank my supervisor, Professor Peter Lisner for his patience and support

My parents, Jennifer and Michael Kennedy, for making this all possible

Professor Nicholas Bellamy, for editorial advice

David Clarke, for research assistance

Mal Gillespie, for software support
Abstract
This thesis explores the design and implementation of a synchronous pipelined Floating Point Unit FPU
for the AMI 0.5 micron process. Specifically it focuses on designing a multiplier, divider and square root
unit for the LTH. The LTH is an open source FPU provided by Gaisler Research for their Leon2 SPARC8
compliant processor. It is a partial implementation of an arithmetic unit, for working with IEEE-754
Floating-Point single and double precision numbers. The new functions are modeled using the Verilog-97
hardware design language HDL, and implemented using the Mentor Graphics ADK design flow. The goal
is to produce a design that implements the essential operations of the floating point standard, and is
capable of running at and above 100MHz.
1 Introduction
This thesis presents an overview of designing a pipelined synchronous IEEE-754 double precision
floating point unit for the AMI 0.5 micron process, using the Mentor Graphics ADK 2.5 design flow. The
first section provides the relevant background information for working with floating point numbers and
the base implementation. The second section illustrates the choice of arithmetic design to provide the
missing functionality. The final two sections provide details of the implementation and the results
achieved.

Essentially floating point numbers are used to represent very small to very large numbers with varying
levels of precision. The normal set of mathematical operations such as addition, division, multiplication
and subtraction are defined, but are slightly more complex than those for standard integer numbers.
Floating point operations performed in software, where only an integer adder is available may take
several hundred or thousands of cycles. The speedup of these operations is quite important, because
floating point numbers are used in a wide range of applications including CAD, games, graphics,
multimedia, and scientific. Hence, the main objective to designing a FPU, is to decrease the number of
cycles to perform a floating point operation from that taken in software, down to a smaller number of
cycles performed in hardware.

The first section, literature review, provides the background for this thesis. It discusses the IEEE-754
floating point number standard and the existing implementation provided with the Gaisler Leon2 SPARC8
compatible processor. This first section also provides some details of how various implementations were
verified and their performance.

The second section, design, provides an overview of the arithmetic algorithms and architecture employed
to extend the LTH FPU. It is meant to illustrate the circuit design, from a high level perspective of how the
arithmetic algorithms are to be implemented in logic blocks, that fit together to provide the required
functionality.

In the third section, implementation, the design process is discussed with reference to the Mentor
Graphics ADK 2.5 design flow. It covers the basics from modeling the FPU with the Verilog HDL using
standard cells to layout with IC station. While not all elements that were used in the FPU are fully
detailed, a general outline for their construction is provided with a selection of examples from each stage
of the development cycle. Also covered in this section are issues discovered with the tool flow discovered
during the implementation process.

The fourth section, results, is a discussion of the developed FPU implementation. It is aimed at providing
some comments in relation to testing and verifying the design, and providing some performance
measurements. It covers verification of both the HDL design and the completed IC layouts for both logic
and electrical rule verification.

12
The final section, conclusion, outlines the success of the developed implementation. It details known
limitations and future improvements with the design. Performance measurements and comparisons with
industry implementations are also discussed. This section in brief provides a summary of the design,
results and where future modifications should be aimed to provide increased performance.

13
2 Literature Review

2.1 Introduction
This section provides an overview of the background literature required for understanding the floating
point standard, operations and the design of floating point units. In particular it focuses on
multiplication, division, and square root operations as they relate to extending the LTH into a more
complete implementation for processing IEEE-754 floating point numbers for the Leon2 Microprocessor.

2.2 Floating Point Numbers
Floating point numbers [8] [11] are essentially used to describe very small to very large numbers with a
varying level of precision. They are comprised of three fields, a sign, a fraction and an exponent field.
Floating point numbers are also defined for several levels of precision, which include single, double, quad,
and several extended forms. The IEEE-754 standard defines several floating point number formats and
the size of the fields that comprise them. Also making them more complex, is that several specific bit
patterns represent special numbers, which include zero, infinity, and Not-a-Number “NaN”. Additionally
floating point numbers may be in a normalised or de-normalized state. Further discussed in the IEEE-754
standard are several rounding schemes, which include round to zero, round to infinity, round to negative
infinity, and round to nearest [8] [ 9] [11] [14].

In particular for the design of this FPU, only double precision normalized floating point numbers are of
interest. This is because de-normalized numbers are not required to be implemented in an FPU for it to
be considered compatible with the IEEE-754 standard, and the LTH converts all floating point numbers to
double precision for internal use. Also, all special bit patterns result in special conditions, such as divide
by zero, which is instantly an exception, and requires no computation. This essentially means the floating
point numbers used will have a 1-bit sign field, 53-bit fraction field, and an 11-bit exponent field. The
format of the fraction and exponent fields is also specified, with the MSB of the fraction always being one
and the exponent never being all one’s or all zero’s.

2.3 The LTH FPU

2.3.1 The Leon2
The Leon2 [1] is a pipelined SPARC8 [4] software configurable synthesizable processor designed by
Gaisler Research for the European Space Agency and distributed under the GPL open source license [5].
Internally it does not include a floating point unit, but does define an interface for using one as a co-
processor [2]. It provides two methods of interfacing with a floating point unit, which include a direct
and an indirect approach. The main difference is that when a floating point operation is issued using the
direct approach, the CPU is stalled until the FPU completes that operation. The advantage to this is that
an operations queue is not needed, reducing the complexity of any implementation. While the Leon2
does not explicitly contain any FPU, Gaisler does distribute it with several options [3] [6], which include a
partial implementation called the LTH [3] that is made available under the same GPL license.

14
2.3.2 The LTH
The LTH [3] is a partial implementation of a floating point unit. In its original form it supports addition,
subtraction, and comparison operations on both single and double precision floating point numbers.
While the IEEE-754 standard [11] does not require all floating point operations to be implemented in
hardware, it does require several rounding schemes, and most implementations normally include
multiplier, divider and square root units [7] [14]. Further, as the LTH is aimed for the SPARC8 ISA it
should be able to handle single and double precision floating point operations. However while quad
precision instructions are listed in the SPARC8 standard, they are not often implemented in many
industry designs [29] [30]. Incorporating the new instructions into the LTH also requires limited changes
in the Leon2, for identifying the available floating point operations implemented in hardware.

2.4 Arithmetic Operations

2.4.1 Floating Point Arithmetic
Floating point units usually consist of two main data paths for computing the fraction and exponent fields
of a result [8]. However, the main idea of performing a floating point operation can be defined as
decoding, normalization, exception checking, fraction and exponent operations and rounding, where
operations on the fractions of the operands take the most computation [8] [11]. It should be noted that
the floating point standard is aimed at producing rounded results as if they were produced for an infinite
precision [11]. To accomplish this all operations are generally computed to several extra bits of precision
(guard, round, and sticky bits). Some designs can produce this extra information for some operations in
parallel to their normal execution [16] [25].

2.4.2 Multiplication
Multiplication of two floating point normalized numbers is performed by multiplying the fractional
components, adding the exponents, and an exclusive or operation of the sign fields of both of the
operands [9] [11] [14]. The most complicated part is performing the integer-like multiplication on the
fraction fields (Figure A.2). This can be done by utilizing a full tree [15], Wallace [20] or Dadda tree [12],
staged [17], or serial array multiplier architecture [8] to sum the partial products.

Essentially the multiplication is done in two steps, partial product generation and partial product
addition. For double precision operands (53-bit fraction fields), a total of 53 53-bit partial products are
generated [16]. To speed up this process, the two obvious solutions are to generate fewer partial
products and to sum them faster [8].

In it simplest form, producing the radix-2 unsigned partial products is accomplished by a logical AND
operation [8] on the operands. To produce fewer partial products, the normal method used is some form
of modified Booth encoding [15] [36]. This allows for computations at a higher radix, with partial
products produced in various forms. The choice of radix [14] determines how many partial products are
to be generated, and the style of encoding determines the form of partial product. Modified Booth-2
radix-4 encoding for double precision operands produces 27, and Booth-3 radix-8 produces 18. With an
increased radix also comes increased complexity and normally increased delay [8] [9]. The form of the

15
produced partial products can be as simple as 2’complement, or a more redundant number system [18]
that allows for a faster summation of the products.

After the partial products have been produced, they must all be added together. The fastest method is to
use a carry free method of addition that sums the products in parallel using a CSA tree [8] [15]. The CSA
tree can be implemented in many forms, the main variations including wiring regularity [8] [14] [20],
order of addition [8], and counter or compressor sizing [15] [16]. In the case of Booth-2 radix-4 encoding
with a full binary CSA tree, the benefits of producing fewer partial products do not include a major speed
increase, because only a decrease in the critical path of the one full adder is realized. Alternately, this
does provide significant area saving. Further, while it may be possible to generate fewer partial products
with higher levels of Booth encoding, the time and area penalties increases with higher radix encodings
and can outweigh any improvements offered by a smaller CSA tree [8]. Also care must be taken with the
layout of the counter tree [16] [20], as the natural shape is irregular and area intensive [19]. The CSA tree
from Samsung [15] attempts to eliminate much of the wiring irregularity, by defining the counter tree in
slices ranging from 28:2 to 4:2 counters built from seven levels of full adder cells. Alternately, the Parallel
Structured Full Array Multiplier [17] built using a Wallace tree from four levels of 4:2 counters offers
increased regularity. Assuming that the 4:2 counters are internally two full adder cells, the Samsung
design consists of one less level in the CSA tree, while offering a less regular wiring structure then the
Wallace tree design.

The last stage in the multiplication is a carry propagate addition using a wide 106-bit adder for double
precision numbers. Adders, while performing a simple operation, are still an area of interest and many
implementations exist for providing a final carry propagate addition CPA [8]. Amongst these are Carry-
Look-Ahead CLA [14], Carry-Select CSEA [11], Ling [14], and hybrid adder architectures comprised of
elements from several different design approaches [17]. Additionally, there are adders that operate on
different number systems with different levels of redundancy. Further, some partial products may be
produced in a highly redundant form making a full CPA irrelevant [18] and instead utilizing a slightly
faster but less area intensive conversion circuit to convert the redundant result into a standard binary
notation. However, many of the designs specified become less attractive when they are implemented
using standard cells, such as the 4ns 32-bit CMOS Ling Adder [14]. This presents its own problems, as the
selected adder must be feasible using standard cells and be able to perform the full addition in under
10ns. This makes the XOR based design offered by Samsung [15] more attractive.

Optimizing a multiplier is usually done by constructing the CSA tree into a regular layout instead of the
irregular trapezoid that results from the shifted partial products [20]. The goal here is to decrease the
total area used by the multiplier and to make the wiring delay between the adders more uniform. Some
other work has been performed using more redundant number systems, which allow for higher levels of
modified Booth encoding with out incurring the time penalties that normally result from forming the hard
multiples (such as the 3x) [18]. Generally however, the most common implements revolve around a
radix-4 Booth-2 encoding, some form of CSA tree, and a wide hybrid adder (at least 106-bits).

16
2.4.3 Division
Division of two floating normalized floating point numbers is performed by a subtraction on the
exponents, an integer style division on the fractions (Figure A.4), and an exclusive OR operation of the
sign fields of both of the operands [11] [14]. Similar to multiplication, the operation on the fraction fields
is the most time consuming part. There are two main ways of performing the integer division of the
fraction fields; a subtractive recursive approach [11] and a multiplicative approximation [8]. Both
present their own unique benefits [21] [22] in terms of area and performance, but also complexity. The
subtractive approach provides for a simpler implementation but produces a constant number of bits per
iteration. Alternately, the multiplicative approach produces an increasing number of bits per iteration,
and as a result requires fewer iterations, with each taking several cycles that need a modified multiplier.
The subtractive approach is a more popular architecture because it can be implemented independently,
and does need complicated modifications to the multiplier [28].

The subtractive approach, which is performed by a repeated addition, subtraction or shifting operation
produces a constant numbers of result bits per iteration (1-bit per iteration for radix-2, 2-bits per
iteration for radix-4). This does imply that for a 53-bit result, at least 53 iterations are needed. Two
ways to increase the performance are to increase the number of result bits generated during an iteration,
or to perform more iteration’s per cycle [8]. This can be done either through a higher radix more complex
iteration, or by layering several simple lower radix iterations together [23] [26]. Another way to reduce
the delay is to work with the partial remainder in a redundant form, such as carry-save [25], which
removes the need for a CPA in each iteration. The critical path then becomes the quotient selection logic
(QSL), which determines if a subtraction, addition or shift operation takes places during an iteration.

The quotient selection logic can be implemented as purely combinational logic or as a look up table (LUT),
or referred to in SRT as a Quotient Selection Table QST. In the case of the second approach several tools
are available to automatically generate the QST’s [8] [12] [13]. With a higher radix comes a more complex
QSL. Layering and overlapping of lower radix iterations allows for a computation at a higher radix
without the complexity [23]. A design by SUN Microsystems [25] demonstrates the layering of low radix
iterations, and also provides a more optimized SRT table than some other designs [24]. This optimization
has the potential to make unnecessary a final restoring iteration for the remainder normally required by
many SRT implementations. When used in conjunction with high radix cycles, this single cycle saving
provides a relatively large performance increase without any extra delay to the critical path.

2.4.4 Square Root
The square root operation is very similar to the division operation, where both multiplicative
approximation and subtractive recursive approaches are available [22] (Figure A.4). The reasons for
choosing an appropriate architecture are very similar to division [21], where either the multiplier can be
modified or an extra subtractive unit can be implemented. In the case of the latter choice, the difference
between the subtractive division and square root unit’s is essentially very small, and an SRT divider can
be modified to encompass square root operations [25] with only a few changes.

17
Essentially the biggest difference between the SRT division and square root operations, is that in the SRT
square root operation the divisor is adjusted during every iteration [11] [13], where as in SRT division it
remains unchanged. Another difference between the two operations is in the handling of exponents and
base exceptions. For the most part, the SRT division algorithm can be extended to implement the square
root operation [37], allowing for the advantages of the more complex division unit designs to be extended
for the square operations in a combined division square root unit.

2.5 Design
The design of the original LTH [3] is a single pipeline that has several pipeline stages operating on double
precision floating point numbers. Data are inserted into the pipeline in several stages including decode,
operands and rounding. The missing functionality is normally implemented in parallel independent
pipelines because of the marked difference in the requirements of their respective algorithms. They still
normally share some common stages, which may include decoding, rounding, and normalization. This
implies that the new functionality can be incorporated by adding the new functional blocks in parallel to
the existing addition/subtraction unit of the LTH. The new blocks will also be required to operate on a
single pair of double precision numbers. However, other industry designs offer a wider variety of word
sizes for computation [19] [30] [31].

The original design uses a pipelined architecture, but is not truly pipelined. While it is divided into stages,
it cannot accept new operations every cycle. This is an intended feature of the LTH because of its
interface with the Leon2 [2], which means with the inclusion of the new functional units, at most only one
stage in a single pipeline will be in operation at any time. While it may be possible to extend the modified
LTH to a more modern out-of-order superscalar design [7] [32], neither the SPARC8 standard [4] nor the
co-processor interface of the Leon2 requires such complexities, and therefore the need for executing
floating point instructions in parallel and out-of-order is beyond the scope of this project.

2.6 Verification
A floating point unit is a set of complex algorithms implemented in an integrated architecture [33] [34].
Demonstrating a satisfactory level of compliance with the IEEE-754 standard and implementation
reliability can be grouped into two separate concepts, design and implementation testing [36]. The first is
a logic proof of the well defined algorithm translated into Verilog HDL. These are tested by running a
limited number of test vectors against the developed HDL modules. The second, implementation testing,
is usually accomplished by generating test vectors to cover a specific set of test cases and possibly a
percent of total input coverage against a physic model. This testing can be accomplished through a
mechanized form on a taped out implementation. The importance of testing both the algorithm and
implementation can be seen by the problems encountered with the Intel Pentium division QRT table [35],
where the algorithms were sound but the implementation contained incomplete data sets.

18
2.7 Performance
Performance of a floating point unit is commonly expressed as a matter of its latency and throughput.
However the real world performance of an FPU may be more application specific, and is normally
measured using benchmarks that comprise tests from common applications [29]. While this does allow
for a comparison of different architectures based on execution cycles alone, it does not permit a
meaningful overall comparison of designs since area, power, and technology are not considered. FUPA, a
Floating-point Unit Performance Analysis metric discussed by Flynn and Obermann [14], is a more
complete metric for comparisons of different implementations as it incorporates latency, area, scaling,
power, minimum feature size, and application profiling.

2.8 Comparison
During this literature review several different approaches and designs were researched to determine the
optimal architecture for extending the LTH. The extended overall design is well defined by the original
LTH, and the marked differences between the functional requirements of the different operational units.
The largest issues are in concern to the additional units, which can be implemented using a variety of
different methods, but are limited to those designs that use standard cells [12] [13] and can achieve a
10ns cycle time.

The design of a multiplier is generally accepted to be a three stage process, modified Booth encoding, CSA
tree, and final CPA. The approach offered by Cho et al [15] offers both a complete modular description for
Booth-2 encoders and a CSA tree constructed from full adders. The other main advantage is that this
design avoids some of the layout issues with counter trees [8] as the blocks are clearly defined. Other
implementations utilizing a more redundant number system [18], while interesting, offer perhaps a faster
solution but are vastly more complex and at times not complete. Flynn [14] also suggests for both area
and latency, a modified Booth-2 encoder with a full binary adder tree is an optimal solution. The final
stage of multiplication, the 106-bit CPA, presents the only real choice. While many designs exist, only fast
adders that can be implemented using standard cells are relevant. The hybrid CSEA using 16-bit CLAs
offered by Mori [17], provides a suitable delay, a modular and well documented design optimal for layout,
especially in comparison to a fully synthesized random logic wide adder.

There is a larger difference in the design of division and square root units. Essentially the subtractive SRT
approaches are favored because they do not require any modification to the multiplier and can be
implemented independently. The next choice then becomes that of radix, where more bits per iteration
incurs more complexity, or more iterations per cycle increases the area. The implementation offered by
SUN Microsystems and others utilizing several lower radix iterations to form a higher radix iteration, is
clearly the best approach [25]. While the SUN design is not noticeably different from other layered
overlapped radix-2 designs, it does provide significantly improved QSLC, utilizing a fifth bit.

Essentially the methods used for testing the final product are not available for the extended LTH. This is
because the design approach of testing on an FPGA is not possible given the direct instantiation of ADK
standard cells in the HDL. Secondly, since there is no planned fabrication, the process of mechanical

19
testing of a physical processor is not possible, and if it were the approximate two month testing period
would be restrictive [36]. Testing is therefore limited to HDL testing and IC layout verification.

Finally in terms of performance, the least favored methods of comparison, latency and throughput, will be
the only metrics available for making a comparison between the extended LTH and other Floating Point
Units. This is because without a fabricated design it is impossible to verify the operational frequency or
power requirements. This makes running any application specific benchmarks [29] and generating any
FUPA metric [11] impossible. The other reason that running benchmarks is not possible is that while the
extended LTH may conform to the original LTH-Leon2 interface [2], the Leon2 needs some modification
to operate correctly with the new design, which is beyond the scope of this work.

2.9 Conclusion
Extending the LTH to perform division, multiplication and square root operations for single and double
precision floating point numbers can be accomplished by adding parallel pipelines to the
addition/subtraction pipeline. The best approach for designing a multiplier relies on the Samsung design
for the modified Booth encoding and CSA tree. However there are many alternate approaches to the final
wide fast adder, which is confined to those that are implementable using standard cells.

The optimal architecture of a joint division and square root unit is a layered subtractive radix-2 SRT style
approach, benefiting from quotient selection logic outlined by Sun. This design is appropriate because of
its independence from the multiplier, its ability to be layered to perform higher radix computations and
meet timing requirements.

In terms of implementation, the design decisions are fairly well dictated. To achieve a cycle time
comparable to that of industry designs using a similar process technology, optimizing for delay at the
expense of area is necessary, as a result of being confined to a standard cell library. The concept of using
standard cells for implementation also means a host of other designs are not feasible, including custom
transistor 4:2 counters and pseudo CMOS Ling adders. Further, since the design is intended to be
converted into an IC layout, a modular design is also beneficial. This can be seen in the selection of a CPA
for the wide adder, where a 106-bit CSEA made of 16-bit CLA blocks offers a more regular layout than the
random logic offered by a completely synthesized approach.

20
3 Design

3.1 Introduction
This section covers the basic details of the computation of the fractional component of double precision
floating point numbers. It illustrates the underlying elements for the multiplication, division and square
root designs selected. It also provides some details of how this new functionality can be incorporated into
the existing LTH architecture.

3.2 LTH Floating Point Unit

3.2.1 Introduction
The LTH is an open source FPU distributed with the Leon2 [3]. It is a partial implementation of a floating
point unit that uses the double precision format for all calculations. This section presents the original
datapath organization and the required changes for integrating missing functionality.

3.2.2 Original Design
The original LTH [3] is divided into five stages (Figure 3.1), “Decode”, “Prenorm”, “Add/Sub”, “Postnorm”,
and “Roundnorm”. The Decode stage unpacks the operands and instructions and checks for special
conditions. Prenorm, the second stage, formats the operands for addition or subtraction, by placing them
in the correct order and performing any necessary shift operations. The third stage Add/Sub, performs
the addition or subtraction operation. The Postnorm stage normalizes the result for the final stage, which
rounds and packages the result ready for output.

Input

Busy Signal Decode
Stage 1

Prenorm
Stage 2

Add/Sub
Stage 3

Postnorm
Stage 4

Roundnorm
Stage 5

Output

Figure 3.1 LTH Overview

This presents several problems, mostly related to the time taken to execute different operations. In the
case when an invalid operation is sent to the LTH, an exception should be generated by the floating point
unit indicating a problem with the instruction or operand [11]. In this design, when the error is known at

21
the end of the first stage, it takes a further four stages for the exception to be written to the output. The
same is true for the comparison operation which requires less then the five cycles. While this may not
appear to be a problem, since the Leon2 is stalled while any FPU operation is underway, for an invalid
operation, the CPU may have to wait four cycles for the FPU to do nothing.

3.2.3 Integrating Additional Functions
Input

Decode

Stage 1

Prenorm Multiplier DIV/SQRT

Stage 2

Add/Sub Normalization Normalization

Stage 3

Postnorm

Stage 4

Roundnorm

Stage 5

Output

Figure 3.2 Extending the LTH

Integrating the additional features, such as division, square root, and multiplication, requires extending
the LTH’s decoder, modifying its rounding stage to accept inputs from multiple datapaths, and
redesigning its busy signal unit. Further, an extra path from the decoding stage to the last stage should be
added, which for a division operation with invalid operands would avoid over 57 stalled CPU cycles. The
key here is that essentially the decoder recognizes all valid operations and correct operands for the new
features, and can dispatch the operation down the correct datapath. At the end of each of datapath, where
a result is available, the already defined rounding module can be utilized to round and package it for
output. This leaves only the design of a multiplier, divider, square root unit, and associated normalization
units required to construct a full FPU (Figure 3.2).

22
3.3 Multiplication

3.3.1 Introduction
Multiplication Operation Properties
Property Value
Operation
Operand A Range (Dividend) 1≤A<2
Operand B Range (Divisor) 1≤B<2
Result R Range 1 ≤ R < 4 (May Require left shift and exponent
increment)
Result Sign Sign A XOR Sign B
Result Exponent Exponent A + Exponent B - Bias
Figure 3.3 Floating Point Multiplication Operation Properties

Multiplication includes four logically separate stages, partial product generation, partial product addition,
a final carry propagate addition, and result normalization. To accomplish these tasks Booth encoding, a
full CSA tree, a fast CSEA CPA and a left shifter were designed. This section outlines the fundamentals
behind the algorithms used to implement each of the four stages.

3.3.2 Partial Product Generation
The first part of designing a multiplier circuit is to produce the partial products that have to be summed.
Modified Booth-2 radix-4 encoding was selected, as it produces half the number of partial products as an
un-encoded method, and does not require the calculation of any hard multiples as would be needed by a
higher level of encoding.

Booth-2 Products
Xj+1 XJ Xj-1 Booth Multiple Dir. Shift Add
0 0 0 0Y 0 0 0
0 0 1 1Y 0 - 1
0 1 0 1Y 0 - 1
0 1 1 2Y 0 1 0
1 0 0 -2Y 1 1 0
1 0 1 -1Y 1 - 1
1 1 0 -1Y 1 - 1
1 1 1 0Y 1 0 0
Figure 3.4 Modified Booth-2 Encoding

Instead of a simple AND operation on a bit J from operand A with all bits of operand B to produce a single
partial product (Figure A.1), Booth encoding uses three bits from operand A to generate a multiple of
operand B (Figure A.2). Essentially this requires constructing an encoder for the truth table presented
above (Figure 3.4), and a set of connected MUX to select the correct multiples of operand B. The design
from Samsung [15] offers a full description of both the Booth encoders and the MUX required for

23
implementation. Their encoders produce three select signals, “add”, “dir”, and “shift” to generate the
require Booth multiple.

3.3.3 Multi-Operand Addition
The modified Booth encoding produces 27 54-bit partial products in a 1’s complement form. To convert
these to the required 2’s complement, the sign bit can also be extracted, and added to the shifted products
LSB. For the multiplication of radix-2 unsigned numbers, the shifted results can be simply added
together. However, when signed numbers are concerned, a special sign extension method (Figure 3.5) is
generally used to keep the operands short [14], which significantly reduces area. The alternate would be
to sign extend each partial product to the maximum width of the operation, which is obviously
undesirable.

S S S • • • • • • • • • •
1 S • • • • • • • • • • S
1 S • • • • • • • • • • S
1 S • S• • • • • • • • • S
• • • • • • • • • • S

Figure 3.5 Partial Product Sign Extension

Adding these products together using a carry save adder CSA tree can be done through at least seven
levels [8] of 3:2 counters (also known as full adders). The design presented by Samsung, offers an
optimized form of carry propagation (Figure 3.6), where each counter has bit zero of its partial product
input port optimized for late arrival (Figure 3.7). This allows for counters to receive an input to their
partial product input port from their neighbouring counter’s carry out with no added delay.

0 S S S • • • • • • • • • •
0 C 1 S • • • • • • • • • • 0 S
0 C 1 S • • • • • • • • • • 0 S
1 S • • • • • • • • • • 0 S
• • • • • • • • • • 0 S

Figure 3.6 Partial Product Counter Over Sizing

24
4:2 Counter 4:2 Counter 5:2 Counter
GND

FA
GND

FA FA FA

FA FA FA

CARRY CARRY CARRY

SUM SUM SUM

Late Arrival

Figure 3.7 CSA Counter Configuration

While the Samsung design is fairly complete, it does present some concerns. It proposes a maximum
counter size of 28:2, which suggest that in a column there will be 28 bits to sum. As the last product
generated is never negative, because the MSB of the other operand is always zero (operand B [52:0], MSB
for Booth encoding is bit 53). As a result, the final partial product will always be a positive multiple,
which makes its sign also zero. This limits the maximum height of a column to 27, and the required
counter size to 27:2. In terms of implementation, this requires a range of counters to be constructed,
from 27:2 to 4:2, and have port zero optimized for late arrival to handle carry propagation. The other
result of carry propagation, is that some counters need to be oversized to handle the extra carry signals
generated by previous counters (Figure 3.6).

3.3.4 Fast Carry Propagation Addition
For the final stage, the wide adder, two candidates exist for performing the CPA. The fastest adder, but
also most area intensive is a Carry Look Ahead adder CLA [14], which is only suitable for at most 16 bits.
However, a hybrid approach using a Carry Select Adder CSEA comprised of 16-bit CLA blocks provides
most of the speed advantages of the CLA, without the vastly increased area requirements that would be
expected from such a wide adder [17]. The second possibility for performing the wide CPA is a fully
synthesized adder [8].

Using a 16-bit CLA with the output signal for bit 16 as a block carry output, it is possible to construct a
larger adder. This effectively makes each 16-bit CLA with no carry-out signal into a 15-bit adder cell with
both carry-in and carry-out signals defined. A 15-bit CSEA block can be made by using two of the CLA
blocks to compute the addition of the same input bits but with different carry in signals. Then a 106-bit
CPA can be constructed by chaining several 15-bit CLA CSEA blocks together, and for the most significant
block where no carry out signal is required, the 15-bit CSEA block can be used as a full 16-bit Adder cell.

25
Thus, a 106-bit CSEA can be built from 13 16-bit CLA’s, with the delay of slightly more then one 16-bit
CLA and six 16:8 multiplexes (Figure 3.8).

16-bit GND 15-bit GND 15-bit

GND
16-Bit CLA VDD 16-Bit CLA VDD 16-Bit CLA

16-Bit CLA GND 16-Bit CLA GND

MUX MUX
Carry out Carry out

Carry out

Result[15] Result[15] Result[15]
Result[14:0] Result[14:0] Result[14:0]

1-bit 15-bit 15-bit 15-bit

Result

Figure 3.8 106-Bit CSEA using 16-bit CLA Blocks

3.3.5 Normalization and Sticky Bit Generation
The result of a double precision operand multiplication is 106-bits long in the range of 1 ≤ R < 4. This
implies that if the leading bit is a one, the result is above the required range, and to normalize it a left shift
is needed. If a left shift is required, the exponent must also be incremented. At this stage, when there is
more than the required 53 bits of precision available, before the extra 50 or 51 bits are discarded, a sticky
bit “T” must also be generated. This sticky bit indicates if the truncated bits are all zero and is required
for the next stage, rounding.

26
3.4 SRT Division

3.4.1 Introduction
SRT Division Operation Properties
Property Value
Operation

Operand X Range (Dividend) 1 ≤ X < 2 and X ≤ D (Right shift X, 0.5 ≤ X < 1)
Operand D Range (Divisor) 1≤D<2
Result Q Range 0.5 ≤ Q < 2 (With right shift of X, 0.25 ≤ Q < 1)
Result Sign Sign X XOR Sign D
Result Exponent ( Exponent X – Exponent D ) + Bias
Figure 3.9 Floating Point Division Operation Properties

SRT is a non-restoring residual recurrence approach to division that estimates a constant number of
quotient bits per iteration [8] [11]. The estimation is performed by a quotient selection function QSLC,
which defines the type of operation that is to occur during an iteration. Further, as each quotient bit is
produced, it must be incorporated into the existing quotient. To accommodate for the possible negative
values, a scheme called “on-the-fly” conversion is presented. While predominately the designs presented
are for a radix-2 SRT division unit, extending the base architecture to perform calculation at a higher
radix is also possible.

3.4.2 Radix-2

Where
I = 0, 1, 2, 3,..., m
Q0 = 0, PR0 = X/2
Figure 3.10 SRT Radix-2 Division Equations

For a radix-2 or a bit-at-a-time implementation (Figure A.4), the iteration formula can be summarized in
the above figure. In its simplest form, it involves examining the most significant bits of a partial
remainder and performing an operation based on the result. Depending on the implementation of the
QSLC, the time to determine the correct operation can be quite long. By calculating all three possible
results in parallel to determine which operation to perform (Figure 3.11), it is possible to reduce the
critical path significantly [21].

27
Registers

Critical Path

QSLC -1 Product 0 Product +1 Product

MUX Q Q-1
Logic

Figure 3.11 SRT Radix-2 Division Overview

Generating the three different products for the new partial remainder when a carry-save form residual is
used (Figure 3.13), can be accomplished by using one level of Full Adders. When “qi+1” is zero, a left shift
of the partial remainder is sufficient. When “qi+1” is negative one or positive one, the new partial
remainder is the result of a left shifted remainder and addition or subtraction of the divisor respectively.
For the subtraction of the divisor, a 2’s complement form can be used, which involves only inverting the
divisor for the positive product and adding a one to the carry-in for the LSB.

Registers

2PRsj 2PRcj Dj 2PRsj 2PRcj Dj

QSLC

CSA CSA
-PScj+1

+PScj+1
-PScj
+PScj

- Product 0 Product + Product

MUX

PRsj (i+1) PRcj (i+1)

Parallel Product Cellj

Figure 3.12 SRT Radix-2 Division Circuit

To obtain the correct 53-bit result at least 56 iterations are required. These iterations are used to
produce 53 bits for the result, two guard bits G, and a round bit R. The first guard bit is used for the case
in which 0.25 ≤ Q < 0.5 and the second guard bit used to correct for any over estimation in the last
produced bit of the quotient. The round bit R, is a required bit for all floating point operations, it

28
effectively calculates the result to an extra level of precision which can be used in subsequent rounding
stages [8].

3.4.3 Quotient Selection
Quotient Selection
CPA Result QSL Product
000.0 and 5th MSB of PRs and PRc = 0 0
0XX.X +1
111.1 0
1XX.X -1
Figure 3.13 SRT Radix-2 Quotient Selection

Quotient selection is the process of estimating the next bit in the final quotient. It does this by sampling
several of the most significant bits of the partial remainder and estimating the next “qi+1” quotient bit
(Figure 3.13). Each estimate is either correct or an over estimation of the next quotient bit, which may
require correction at a later iteration, which is why the negative quotient bit is required. Further, since
only several of the most significant bits are sampled, the partial remainder is not required to be fully
calculated, and a redundant form is suitable [11] [24], such as carry-save. However, traditional QSL does
not function as expected for an exact result, such as “25/5”. An improved version proposed by SUN
Microsystems [25] corrects this behaviour, by modifying the case where the first four most significants
bits of the partial remainder sum to zero. In terms of implementation, this requires a fast adder to sum
the four most significant bits of the residual and some form of logic to determine which case fits.
Alternatively, by examining each case, a ROM containing slightly more then 256 entries can be
constructed [35].

3.4.4 Partial Quotient Generation
Every iteration of a radix-2 operation generates a new quotient bit “qi+1”. Since this bit can be positive or
negative, it would at first appear a CPA might be required for negative results. By using the “On-The-Fly”
conversion method [11], two registers Q and QM are maintained, which remove the need for a full CPA of
the quotient (Figure A.6). Q is the current quotient at iteration I, and QM is its negative factor (Qi -2-i).
Calculating Q and QM involves selecting one of the two registers and concatenating a value to its end in
position “2-i” (Figure 3.14).

In terms of implementation, this requires less than two 2:1 MUX and the distribution of the new quotient
bit. However, since the parallel product generation also includes a 3:1 MUX for selecting the correct
product, the critical path of a single iteration is only increased by less then a 4:1 MUX.

29
Figure 3.14 Radix-2 On-The-Fly Conversion Equations

Figure 3.14 Radix-2 On-The-Fly Conversion Equations

The radix-2 form of these equations is quite simple in comparison to the generic formula. This leads to a
basic design for a single bit of Q and QM logic, which is depicted in the next two figures. The only
complication then becomes knowing what “2-(i+1)” is during an iteration. By having a single hot state
machine called “I”, where a single high signal is propagated from left to right in a register one bit per
iteration, it becomes trivial to select the bit position that will be changed.

Qj QMj Ij

Q if QSL ≥ 0 e.g. QSL = 01 or
QSLMSB
QSL MUX 00 if QSL < 0 e.g. QSL = 11
QM

QSLLSB

If Bit is in position 2-(i+1)
Bit = QSLLSB
MUX
Else Bit = Bit read from selected
register
Qj

Figure 3.15 Radix-2 On-The-Fly Conversion for Q

Qj QMj Ij

Q if QSL > 0 e.g. QSL = 01
QSL NXOR2 MUX QM if QSL <= 0 e.g. QSL = 00 or
11
QSLLSB

If Bit is in position 2-(i+1)
Bit = ~QSLLSB
MUX
Else Bit = Bit read from selected
register
QMj

Figure 3.16 Radix-2 On-The-Fly Conversion for QM

30
3.4.5 Normalization and Sticky Bit Generation
Like multiplication, the division operation may produce a result not in the required range, in this case 0.5
≤ Q < 2. By detecting if the leading quotient bit is a zero (accounting for the initial required dividend right
shift), the range of the quotient can be seen to either be in the range 0.5 ≤ Q < 1 or 1 ≤ Q < 2. In the case of
the first, the quotient must be left shifted and the exponent decremented. However, unlike multiplication,
the sticky bit T is more complicated to calculate. For division and square root, it is produced from a
residual in carry save form [25] (Figure 3.17). Since it is not based on the quotient, it is normally
performed in the same stage as the recurrence iteration.

Where M = Residual Length

Figure 3.17 Stick Bit Formula for a Redundant Residual

3.4.6 Radix-4

Registers

Critical Path QSLC +1 Product 0 Product -1 Product

QSLC QSLC QSLC
MUX

+1 Product 0 Product -1 Product
MUX

MUX

Figure 3.18 SRT Radix-4 Division using Layer Radix-2 Stages

The largest delay of an SRT division algorithm is the quotient selection logic circuit QSLC and the fan-out
of its output. While it is possible to compute at a higher radix, the QSLC becomes more complicated, and
its delay and area increase [25]. Overlapping two lower radix stages allows for the second QSLC to begin
before the first has completed, essentially leaving only one radix-2 QSLC in the critical path (Figure 3.18).
This is done by calculating the second quotient products for the next stage for each of the three possible
products from the first stage before the correct product from the first stage has been selected. Once the

31
first stage QSLC has completed, and the correct product has been selected, the next parallel product
generation can commence, and by its completion, the second stage QSLC should have also completed.

In terms of the circuits described for the parallel product generation for radix-2, at radix-4 they remain
unchanged. While no extra logic may be required, the five most significant bits of each parallel product
need to be available for output to the second level QSLC before the correct product has been selected.

3.4.7 Higher Radix
It is possible to compute at radix-8 using three overlapped radix-2 stages. Essentially this would have a
critical path of a one QSLC, three 3:1 MUX and two parallel product generation stages. While this may be
possible, as the SUN design shows [25], using standard cells and an AMI0.5 process, the increased delay
and area compared to a radix-4 design may be restrictive. It is also important to remember that the
above diagram for radix-4 division does not indicate the extra delay of the first stage new input selection,
sticky bit generation, nor the partial quotient calculation. These are important to remember, as while it
may be possible to combine three stages in under 8ns, the extra delay may be problematic. Further, any
higher then radix-8 and the additional parallel product generation delay becomes a limiting factor [23].

3.5 SRT Square Root

3.5.1 Introduction
SRT Square Root Operation Properties
Property Value
Operation
Operand X Range 1≤X<2
Operand D Range N/A
Result Q Range 1≤Q<2
Result Sign Sign X (Should be positive from decoding)
Result Exponent

Figure 3.19 Floating Point Square Root Operation Properties

The SRT approach can be adapted to handle the square root operation [11]. However, unlike division,
where the integer SRT algorithm could handle the floating point numbers, the square root iteration has
some minor differences [37]. These differences can be seen by comparing the formula presented for both
division and square root, and noting the value of PRs0 for both. For the most part, the division and square
root algorithms can share an underlying design, and most approaches to speeding up SRT division are
applicable to the square root operation. This section covers a base SRT square root implementation and
the elements that make it different to division, such as the creation of a functional divisor.

32
3.5.2 Radix-2

Where
I = 0, 1, 2, 3,..., m
Q0 = 1, QM0 = 0, PR0 = (X-1) for an even exponent
Q0 =0, QM0 = 0, PR0 = (X/2) for an odd exponent

Figure 3.20 SRT Radix-2 Square Root Equations

The formula for SRT square root (Figure 3.20), unlike division has four base factor PRc, PRs, 2Q, and 2-
(i+1), which can be reduced to three PRc, PRs, and F(I) (Figure 3.23). This makes the approaches and
concepts of reducing the delay of SRT square root architectures (Figure 3.21) similar to that for division
(Figure 3.11). The only added complexity is the computation of F(i), which must occur before an iteration
can start.

Registers

F[i] –F[i] Generation

Critical Path

QSLC -1 Product 0 Product +1 Product

MUX Q Q-1
Logic

Figure 3.21 SRT Radix-2 Square Root Overview

While for the positive product, F(i) could be produced as -(F(i)), as is shown later, it is simpler to keep it
in the form F(i) (Figure 3.24). However, unlike division, where there is a single divisor D, the square root
algorithm uses two, a separate one for both the positive and negative products.

33
2PRsj 2PRcj -F[i]j +F[i]j
QSLC

CSA CSA

-PScj+1
+PSj+1
-PScj
+PScj

- Product 0 Product + Product

MUX

PRsj (i+1) PRcj (i+1)

Parallel Product Cellj

Figure 3.22 SRT Radix-2 Square Root Circuit

For floating point numbers in the range 1 ≤ X < 2, the square root operation produces a result guaranteed
to be in the range 1 ≤ Q < 2, and the first bit of which is produced at the initial stage. This means only 54
iterations are required, 52 for the remaining 52 bits, one for the round bit R, and a guard bit G to correct
the last quotient bit produced.

3.5.3 Square Root Functional Divisor
The original square root formula (Figure 3.20) can be simplified to involve only three base factors instead
of four (Figure 3.23), with the new factor being a functional divisor F(i). Two different divisors, F(i) and –
F(i), are required for the positive and negative products. Both versions of F(i) can be generated by using
the two different versions of the current quotient, Qi and QMi. From these elements, the four base factors
of a square root operation, PRsi, PRci, 2Qi and 2-(i+1), can be treated as three factors, PRsi, PRci, and F(i) or -
F(i). The advantage of using three factors instead of four, is that the underlying division architecture can
be used and a single level of full adders is sufficient (Figure 3.12). In terms of implementation, this only
requires a left shift of Q or QM and two OR operations with the iterations marker.

34
Figure 3.23 SRT Radix-2 Square Root Functional Divisor Equations

3.5.4 Integration with Division

Figure 3.24 SRT Radix-2 Division and Square Root Equations

The division and square root operations present a great deal of similarities (Figure 3.24). Integrating the
two functions into a single unit requires the design of a mechanism for selecting the correct divisor
(Figure 3.25). In division, the divisor is a constant single factor, in a square root operation the divisor is
the result of multiple changing factors (Figure 3.23). By constructing the square root functional divisors
prior to their input in to the division iteration, the four base factors are limited to three and the circuit
only requires an extra MUX to choose the correct divisor for both positive and negative products. The
integrated square root and division architecture then has a critical path of the original division unit with
the addition of a MUX and F(I) generation (which at most is a three input OR gate).

35
2PRsj 2PRcj Dj - +F[i]j Operation
F(i)J

QSLC
MUX MUX

CSA CSA

-PScj+1
+PSj+1
-PScj
+PScj

- Product 0 Product + Product

MUX

PRsj (i+1) PRcj (i+1)

Parallel Product Cellj

Figure 3.25 SRT Radix-2 Combined Division and Square Root Circuit

3.6 Conclusion
Extending the LTH to provide most of the capabilities of a modern FPU requires the development of a
double precision divider, multiplier, and square root unit. To meet the clock frequency of 100MHz using
standard cells on a 0.5µ process, a focus needs to be placed on reducing delay. This can be done by using
a radix-4 multiplier that consists of modified Booth-2 partial product generation, a full CSA tree, and a
hybrid 106-bit CSEA. To accommodate for the division and square root operations, a combined SRT
radix-4 division and square root unit is proposed. While the design should meet the target frequency, it
may come with a large area cost.

36
4 Implementation

4.1 Introduction
This section presents an overview of the HDL modeling, testing, and synthesis stages, used to develop the
multiplier and combined division and square root unit designs into an IC layout ready form. Due to the
complexity and large volumes of code and scripts produced, only a high level overview of the important
aspects of each design is presented. For a full code listing please see “Code Listing”.

4.2 HDL Modeling

4.2.1 Introduction
This section discusses the modeling of the designs presented in the last chapter using the Verilog-97 HDL.
It elaborates on the choice of coding styles, ranging from direct ADK standard cell instantiation to
synthesized alternatives. Further, the processes of macro and micro cell optimization are explored, and
the reasons for performing these levels of modeling are investigated.

4.2.2 Multiplication
Input

Booth Partial Product Generation
Stage 1

CSA Tree
Stage 2

106-Bit CSEA
Stage 3

Normalization
Stage 4

Output

Figure 4.1 Multiplier Pipeline

4.2.2.1 Partial Product Generation
The partial product generation step for multiplication is by all accounts one of the fastest stages [8] [11]
[14]. A design from literature was directly implemented [15], which consisted of two cells. The first cell
was a Booth encoder that read three bits from one of the operands, to determine the correct Booth
multiple. The second cell, Booth MUX, read in a single bit from the second operand, and based on the
outputs from the Booth encoder performed a shift or invert operation with its neighboring cells. As the
design was already defined, the only addition to the schematics presented by Samsung was an inverter
tree. The inverter tree was used to buffer the output of a Booth encoder and provide the signals to all
associated Booth MUX for that partial product. Therefore, each partial product contained a Booth MUX,
three inverter trees (one for each bit of the Booth encoder’s output), and 54 Booth MUX (one for each

37
possible bit of the partial product). These partial products were then grouped into a single module called
“Booth_Partial_Products” that produced all the required partial products.

4.2.2.2 CSA Tree
The CSA tree was the most complex part of this thesis. It involved designing a range of counters from
27:2 to 4:2 using full adders, connecting them in such a way that there was only a critical path of seven
adders, and retrieving columns of bits from the rows of partial products produced from the last stage for
each counter’s input.

The first step was to build the required counters. The Samsung design[15] specifies carry inputs and
outputs for each counter on all seven levels. For each carry signal on a level, there must also be an
associated counter. By using this as a guide, all 23 counters could be constructed, with the only
optimization being that port zero of the initial partial product column input for a particular counter,
would be used as late as possible. It was then observed, upon synthesizing the HDL, that the produced
netlist did not contain any full adder cells (ADK “Fadd1”). Since the design implied the cells were to be
used, a style of coding was adopted that directly instantiated the ADK standard cells. As a result, the
Verilog code is no longer portable outside of an ADK environment.

Once all counters were created, the CSA tree needed to be constructed. This entailed placing 106
counters ranging from 27:2 to half adder cells, in such a way that the carry signals propagated correctly,
and the partial product rows were correctly aligned into columns for counter input. At first by hand, this
seemed an impossible task due to the overwhelmingly vast amount of ports and nets used. Opting for a
more automated approach, a PERL script was used. This resulted in an approximately 1800 lines long
computer generated module that required only minor changes.

4.2.2.3 106-bit Wide Adder
In some texts, a fully synthesized design [8] is suggested to provide sufficient results for the final
multiplication CPA. This proposal resulted in a module with a single line of HDL in its body. While initial
timing and area reports were promising, the goal of 100MHz could not be met with this design.

Using a hybrid CSEA design [17] made from 16-bit CLA blocks [12] a better result was obtained.
However, the delay achieved was very close to the acceptable cut-off limit of 8ns. Since this design was
explicitly described in a structural Verilog, a version using ADK standard cells directly instantiated was
also created. This proved to be a better candidate.

4.2.2.4 Normalization
The normalization section for the multiplier was relatively trivial. It consists of a MUX and several OR
gates. The MUX is used to left shift the required bits if needed, and the OR gates are used to detect if any
of the discarded bits are ones.

4.2.3 Division and Square Root

38
Input

Input Selection

Div4_Iteration

First Level QSLC Square Root Functional Divisor Generation

OTFC Q and QM Calculation Parallel Product Generation

Second Level QSLC Square Root Functional Divisor Generation

OTFC Q and QM Calculation Parallel Product Generation

Sticky Bit Calculation

Radix-4 Iteration

Normalization
Normalization Stage

Output

Figure 4.2 SRT Radix-4 Division and Square Root Pipeline

4.2.3.1 QSLC
The quotient selection logic is one of the most critical sections of the SRT design. SUN [25] and Parhami
[8] suggested either using a 4-bit fast adder with some combinational logic, or implementing it as a ROM.
Both avenues were explored, but there was some hesitation with the second [35].

The first approach called for a fast 4-bit adder. The fastest adder suitable for this word size was taken to
be the CLA [12], and due to it being only four bits, the increased area used by this style of adder is
assumed not to be a problem.

The next approach was to be a ROM version. When testing the original CLA in ModelSIM, a test bench was
created that contained a simple behavioral QSLC. By modifying this, to output the resulting quotient bit
for all variations of the most significant four bits, and where the result was all zero adding an extra
internal condition, a Verilog module was created. This module has only a single CASE statement with 256
entries, which Leonardo Spectrum recognized as a ROM.

4.2.3.2 Parallel Product Generation
The parallel product generation essentially performs all three variations of the formula described in the
last chapter for division and square root at the same time (Figure 3.24). Transferring the designs

39
presented in the integrated division and square root section directly into an initial design was the first
target. This was done by defining a basic cell for a single bit that produced a single bit result for all three
variations of “qi+1” and had a final MUX to select the correct one. This resulted in a cell with a large
number of ports requiring to be connected to at least 58 other cells. Moving directly to an automated
approach, a PERL script was created that would instantiate all the required bit cells, and provide two
inverter trees to buffer the input bits from the QSLC.

In the first version, ADK cells were directly used. However the parallel product cell’s internal logic
presented itself as slightly more random then that of the CSA tree. By creating a second version of the
basic fraction bit cell, which used Verilog primitives and synthesizable full adder cells, as in the last
section, an alternative was produced. Initial timing reports were promising. However, it is important to
remember that the timing report tool assumed all inputs will be available at time zero. This is not the
case as the new quotient bits will arrive late as they have to be first generated and propagated. The
solution was to synthesize the full adders and provide the produced netlist as an input to a parallel
product bit cell synthesis run with the hierarchy set to “PRESERVE”.

The final step was to create a netlist for all bits using the synthesized bit cell. This could be described as
macro optimization of micro optimized cells. By noticing that the synthesis tool did not modify the types
of gates used when ADK cells were directly called in a module, a twofold approach was developed. The
first step was to synthesize the bit cells with the behavioral full adders. The second step was run the
synthesis tool again, providing it with the full 59 bit parallel product generation module and synthesized
netlist as inputs. The result is that synthesized logic is confined to a single micro cell, and the induced
added delay due to random logic wiring irregularity is prevented from propagating to the macro cell.

4.2.3.3 On-The-Fly Conversion
On-the-fly conversion is used to update the quotient at the end of each iteration. Its complexity derives
from the fact it must handle negative quotient bits. During the last chapter, a high level design was shown
for the updating of both Q and QM (Figure 3.14). This led to a basic module for a single bit consisting of
several MUX levels, which was expanded and replaced by two levels of NAND gates. These cells were
optimized in the same way as the parallel product generator, in that they were synthesized at the cell
level, linked together using a PERL script, and synthesized again together using the macro cell and the
synthesized micro cell netlists.

4.2.3.4 Radix-2 Iteration
This was one the easiest modules to design. Its purpose was to perform a single radix-2 iteration of a
division or square root operation on the fractional component of the operands. Essentially it only used
four sub modules. These include a QSLC, parallel fraction product generator, on-the-fly conversion, and a
square root functional divisor module. The inputs and outputs were all clearly defined by the internals.
While not used in the final design, it was an invaluable tool for testing, since with a higher radix, the
outputs are the result of multiple iterations, which may make it unclear where a problem has occurred.

4.2.3.5 Radix-4 Iteration

40
This module serves the same purpose as the radix-2 version. The difference is that it implements two
layered radix-2 iterations with overlapped QSLC. It uses slightly more then double the gates, as it
requires a total of four QSLC blocks instead of one.

4.2.3.6 Normalization
The normalization stage for division is much simpler than that of multiplication. It takes in a 57-bit
quotient and returns a 54-bit result, 53 bits for the quotient and an extra round bit R. For a square root
operation, the required result is the first 54 bits, where for division it is 54 bits but from the second or
third MSB. Essentially this can be done by using three levels of MUX.

4.3 Synthesis

4.3.1 Configuration
Mentor Graphics Leonardo Spectrum Synthesis Level 3 Configuration
Property Settings
Temperature 27
Voltage 5
Max Fan-out Load 5
Register 2 Register 10ns (Data arrival time 8.0ns)
Input 2 Register 10ns
Register 2 Output 10ns
Optimization Timing with Standard Effort (maximum possible)
Figure 4.3 Leonardo Spectrum Settings

Leonardo Spectrum was used to generate netlists for behavioral HDL modules and to flatten structural
hierarchies in preparation for layout. Further, it was also utilized to provide timing and area reports, and
to verify fan-out of no more than five for all key modules. While for the most part, the HDL was designed
in a structural form, the buffering and stage connections were done using a high level form of Verilog-97.

The most important elements of the above configuration are the timing settings. While any data arrival
times produced by Leonardo will only be a sum of the delay through each component, they do not include
wiring delays. As a result, any pre-layout timing will be increased post-layout. To counter for this, for a
cycle time of 10ns, approximately ~2ns must be set aside. For any design unit, a synthesized register to
register data arrival time of 8ns is assumed to indicate that it can meet a cycle time of 10ns post layout.

4.3.2 Automation and Reporting
The Mentor Graphics synthesis tool, Leonardo Spectrum, when run in a GUI mode was found to be highly
unresponsive. In comparison, the same tool when used under a terminal environment was more reliable.
This led to a series of scripts for generating the netlists and reporting timing and area reports.

4.3.3 IC Preparation
When creating an ADK netlist by flattening or synthesizing a design, the output from Leonardo may be
incompatible with the IC layout tool. This occurs in two areas, the first being signals assignment, and the

41
second power supply component problems. It is unclear why both of these issues occur, but the
workaround is to manually alter the produced netlist (Figure 4.4).

Preparing a Netlist for IC Layout
Synthesis Result IC Layout Compatible Verilog
Fake_gnd gFgXX(.Y(wGnd)); // Replace fake ground
Supply0 wGND;
Fake_vcc gFvccXX(.Y(wVdd)); // Replace fake power
Supply1 wVcc;
Assign j = k; // Replace assign statements with fastest ADK buffer
Buf02 gJKXX( .Y(j) , .A(k) );
Figure 4.4 Preparing a Netlist for IC Layout

After synthesis and some fine tuning of the produced netlist, it was time to proceed to creating a layout.
This is done in two stages; the first is to import the netlist into the schematic editor to generate a
technology viewport. The second stage uses this viewport to generate the layout. By synthesizing each
pipeline stage, and feeding the produced netlist into the synthesis run of the macro cell, where essentially
all pipeline stages are instantiated with the addition of registers, a generic floor plan can be created. For
example, synthesizing the multiplier, the netlists for the Booth partial products, CSA tree, and CPA are
required. For this to produce the correct result, the synthesis run should have its hierarchy set to
preserve. In this way both a semi automated or fully automated IC layout is possible.

4.4 Conclusion
The implementation of division, multiplication, and square root units is a complicated task. By designing
the units in a hierarchical manner, where each micro or leaf cell is modeled using both ADK standard cells
and Verilog primitives, a variety of options becomes available for the final implementations. To handle
the large number of interconnects required, especially for the CSA tree, an automated approach using
PERL was adopted. Also, scripting was further embraced to design the QSLC, where a Verilog test bench
for the CLA QSLC was modified to generate a full ROM QSLC.

42
5 Results

5.1 Introduction
This section presents the results of the HDL design in terms of area, timing, and IC layouts, then discusses
the reasons for the achieved area and timing for both the multiplier and combined division and square
root units, and elaborates on why integrating them into the original LTH was not possible. Finally a
comparison between the project performance of the designed architecture and industry implementations
is presented.

5.2 Timing and Area Analysis

5.2.1 LTH
LTH Pre-Layout Data Arrival Times (ns)
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Decode 8.65 4.64 3.13
Pre- normalization 23.72 12.59 9.94
Add/Sub 26.66 17.68 11.66
Post- normalization 30.37 15.98 11.23
Round-normalization 23.01 10.76 9.87
LTH (all stages and buffers) 36.87 24.73 16.78
Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only.
Figure 5.1 Original LTH Timing Analysis

LTH Area by Gate Count
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Decode 631 615 582
Pre-Normalization 1947 1771 1720
Add/Sub 5407 3975 4213
Post- normalization 2582 2384 2654
Round- normalization 1147 1049 1018
LTH (all stages and buffering) 11547 10200 9859
Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only.
Figure 5.2 Original LTH Area Analysis

The above is a timing and area analysis of the Original LTH FPU HDL separated into pipeline stages. By
the authors own admission [3], it is not optimized. The HDL is coded in a very high level of behavioral
VHDL, and as a result, the timing is much higher then desired. Using the maximum possible
computational effort of the synthesis tool, the original LTH fails to meet the required cycle time of 10ns,
which becomes worse considering this is pre-layout, and does not include any allowance for wiring delay
incurred during layout. While it was possible to generate the above tables with apparently borderline
results, the time taken often exceeded a week for synthesizing a single stage for a single variation of the

43
AMI 0.5 micron process. The situation is worsened by most stages requiring additional logic, which
would add extra delay, and more than likely increased synthesis time. This then begs the question,
whether the task is one of integrating, or completely redesigning the LTH.

Some possible solutions to fixing the large delay would involve splitting each of the pipeline stages that
exceed the 8.5ns for data arrival. For example, the three internal logic operations of the “Add/Sub” stage
could be split into three stages, namely a pre-shift, an addition using the multipliers CSEA, and a result
conversion. Since several stages require a similar form of pipeline stage splitting, this method would
move the number of stages for addition or subtraction operations closer to 10. This latency does not
compare well against other FPU implementations [14], which normally perform these operations in fewer
than five cycles. The problem is magnified by the fact the Leon2 stalls until a single FPU operation has
completed, and extra stages in the pipeline have significant effects on overall system performance.

5.2.2 Multiplier
Multiplier Pipeline Pre-Layout Data Arrival Times (ns)
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Partial Product Generation
Booth-2 Product Generation 6.26 3.59 2.31
Carry Save Adder Tree
CSA Tree (ADK Std. Cells) 21.06 11.68 7.34
*Synthesized CSA (Area) 13.74 7.88 5.03
*Synthesized CSA (Delay) 15.64 7.70 5.01
106-Bit Final CPA
106-Bit CSEA (ADK Std. Cells) 18.75 10.17 8.02
*Synthesized CSEA (Cells Preserved) 28.22 12.60 10.04
*Synthesized 106-Bit Adder 48.67 20.77 11.84
Normalization (Fraction Adjustment)
Normalization 4.56 2.66 1.73
Multiplier (Includes: Buffers and Exponent)
Radix-4 Multiplier 24.04 13.52 8.91
Modules do not include buffering unless specified. *Unused Cells or listed for Information Purposes only.
“Area” and “Delay” refer to synthesis optimization targets that may deviate from normal Leonardo Spectrum settings.
Figure 5.3 Multiplier Timing Analysis

Multiplier Pipeline Area by Gate Count
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Partial Product Generation
Booth-2 Product Generation 22946 22946 22946
Carry Save Adder Tree
CSA Tree (ADK Std. Cells) 7719 6415 6415
*Synthesized CSA (Area) 8202 7998 7997

44
*Synthesized CSA (Delay) 15303 13875 14397
106-Bit Final CPA
106-Bit CSEA (ADK Std. Cells) 3128 2963 2931
*Synthesized CSEA (Cells Preserved) 3074 3260 3126
*Synthesized 106-Bit Adder 2445 1588 1930
Normalization (Fraction Adjustment)
Normalization 188 185 185
Multiplier (Includes: Buffers and Exponent)
Radix-4 Multiplier 48908 47131 47132
Modules do not include buffering unless specified. *Unused Cells or listed for Information Purposes only.
“Area” and “Delay” refer to synthesis optimization targets that may deviate from normal Leonardo Spectrum settings.
Figure 5.4 Multiplier Area Analysis

Generating the partial products is one of the fastest operations in the multiplication process as is shown
by the timing analysis for “Booth-2 Product Generation” (Figure 5.3). It is also one of the largest in terms
of gate count. This can be explained by the large number of Booth MUX required (1458), with one needed
for each bit of each partial product (27 products each 54-bits long). Further, connecting a Booth encoder
to each Booth MUX in a Booth product required buffering to solve the fan-out problem. The solution was
to use an inverter tree, with three of these required for each partial product, and a total of 81 for all the
partial products. At first it may appear that this stage could include more logic levels. However, the extra
slack is beneficial for placing the input registers for the next stage in a more optimal position for the CSA
tree.

Adding the generated partial products using a full binary tree of Full Adder cells achieved a balanced
result for both area and timing. This stage whether implemented using Full Adders or 4:2 counters,
requires a significant number of Adder cells, with long carry propagation paths. As it can be seen from
the timing analysis, the CSEA is actually slower then the CSA, meaning that a satisfactory delay was
achieved for the Adder tree using ADK Fadd1 cells. While it is clear that a 4:2 counter approach (Figure
B.1) would offer a significantly shorter delay, it would also come at a vastly increased gate count (Figure
B.2), and no overall increase in performance. Further, the synthesized options, while apparently faster,
may have a significantly worse critical path post-layout then the one using Fadd1 cells. Thus while a 4:2
counter and fully synthesized CSA trees offer increased pre-layout performance, the gain is either
irrelevant or unlikely to be realized post-layout and comes with a cost of significantly more gates.

For the final stage it was initially hoped that a synthesized 106-bit CPA may be able to provide
satisfactory performance in terms of both gate count and delay. As the above timing and area analysis
shows, the custom ADK standard cell based CSEA is around 33% faster then the fully synthesized 106-bit
Adder option on a fast process, but at a cost of 50% more gates. The synthesized version of the CSEA
provides no advantages for either delay or area. To meet the target of a 10ns or approximately 8ns pre-
layout cycle time, the only viable option is to use the ADK std. cell version of the CSEA on a fast process.

45
5.2.3 Combined Division and Square Root
Combined Divide Square Root Pre-Layout Data Arrival Times (ns)
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Quotient Generation (Q and QM)
*OTFC (ADK Std. Cells) 4.84 2.79 1.79
OTFC (Syn.) 4.15 2.49 1.61
Quotient Selection Logic QSLC
*QSLC (CLA) 8.26 4.61 2.91
QSLC (ROM) 3.26 1.83 1.15
Square Root Function Divisor (F(i) and –F(i))
Square Root Functional Divisor 0.70 0.42 0.28
Parallel Product Generation (Includes: Adders, qi+1 Inverter Trees, and Selection MUX)
*Bit Slice (ADK Std. Cells) 4.97 2.81 1.77
*Bit Slice (Syn.) 3.77 2.01 1.34
Bit Slice (Micro Syn.) 4.30 2.43 1.54
Sticky Bit Generation 4.19 2.35 1.55
Iterations (Includes: QSLC, Functional Divisor Calculation, Parallel Product Generation, and OTFC)
*Radix-2 Iteration (ADK Std. Cells) 10.68 6.06 3.92
*Radix-2 Iteration (Syn.) 9.30 6.19 4.02
*Radix-2 Iteration (Micro Syn.) 9.64 5.89 3.83
Radix-4 Input Selection (Syn.) 4.39 3.09 2.09
*Radix-4 Iteration (ADK Std. Cells) 17.51 9.95 6.57
*Radix-4 Iteration (Syn. Product) 14.01 9.90 6.41
Radix-4 Iteration (Micro Syn.) 16.03 9.39 6.06
Combined Division and Square Root Unit (Including Buffers and Exponent)
SRT Divider Radix-4 22.08 12.92 8.93
Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Radix-4 iteration also
includes sticky bit generation.
Figure 5.5 Combined Division and Square Root Timing Analysis

Combined Divide Square Root Area by Gate Count
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Quotient Generation (Q and QM)
*OTFC (ADK Std. Cells) 856 856 856
OTFC (Syn.) 779 652 652
Quotient Selection Logic QSLC
*QSLC (CLA) 46 46 46
QSLC (ROM) 33 41 42
Square Root Function Divisor (F(i) and –F(i))
Square Root Functional Divisor 251 251 250

46
Parallel Product Generation (Includes: Adders, qi+1 Inverter Trees, and Selection MUX)
*Bit Slice (ADK Std. Cells) 34 34 34
*Bit Slice (Syn.) 38 33 37
Bit Slice (Micro Syn.) 29 29 29
Sticky Bit Generation 423 420 417
Iterations (Includes: QSLC, Functional Divisor Calculation, Parallel Product Generation, and OTFC)
*Radix-2 Iteration (ADK Std. Cells) 3156 3155 3155
*Radix-2 Iteration (Syn.) 3125 3124 3124
Radix-2 Iteration (Micro Syn.) 2872 2875 2875
Radix-4 Input Selection (Syn.) 1057 955 955
*Radix-4 Iteration (ADK Std. Cells) 7046 6987 6898
*Radix-4 Iteration (Syn.) 7075 6969 6969
Radix-4 Iteration (Micro Syn.) 6714 6530 6532
Combined Division and Square Root Unit (Including Buffers and Exponent)
SRT Divider Radix-4 12418 12524 12419
Modules do not include buffering unless specified. *Unused Cells or Provided for Information Purposes only. Radix-4 iteration also
includes sticky bit generation.
Figure 5.6 Combined Division and Square Root Area Analysis

The two key areas of the combined division and square root unit are the QSLC and the layering of
iterations. The fast CLA based QSLC was larger (Figure 5.6) and slower (Figure 5.5) than the synthesized
ROM QSLC. Also from the timing analysis, it is possible to see the benefits from overlapping QSLC, where
a two iterations are performed with a data arrival time significantly less the twice what would be
expected by layering two non-overlapped radix-2 iterations. However the second iteration was helped by
micro synthesis, where effectively the parallel product used bit slices containing several smaller
synthesized cells (“micro syn.”), instead of a completely synthesized bit cell (“syn.”). Surprisingly, each of
these bit cells used two synthesized full adders (Figure B.2), which would expectantly make the cell larger
than the ADK version, but instead they were both faster and smaller.

5.2.4 Exponent
Important Exponent Component Pre-Layout Data Arrival Times (ns)
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Addition or Subtraction of Exponents
*CLA 12-Bit (ADK Std. Cells) 11.17 6.52 4.12
CLA 12-Bit (Syn.) 9.67 5.73 3.63
Decrementing Exponent
*Dec 12-Bit (ADK Std. Cells) 6.55 3.26 2.06
Dec 12-Bit (Syn.) 4.79 2.74 1.99
Incrementing Exponent
*Inc 12-Bit (ADK Std. Cells) 5.14 2.96 1.92
Inc 12-Bit (Syn.) 3.73 2.25 1.45

47
*Unused Cells or Provided for Information Purposes only.
Figure 5.7 Exponent Component Timing Analysis

Important Exponent Component Area by Gate Count
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
Addition or Subtraction of Exponents
*CLA 12-Bit (ADK Std. Cells) 157 150 50
CLA 12-Bit (Syn.) 104 96 95
Decrementing Exponent
*Dec 12-Bit (ADK Std. Cells) 46 42 42
Dec 12-Bit (Syn.) 49 49 47
Incrementing Exponent
*Inc 12-Bit (ADK Std. Cells) 47 46 45
Inc 12-Bit (Syn.) 46 45 45
*Unused Cells or Provided for Information Purposes only.
Figure 5.8 Exponent Component Area Analysis

The exponent datapaths for both the multiplier and combined division and square root units have to this
point not been discussed in any detail. While their design is trivial in comparison to the fraction for
multiplication, division, and square root operations, it is important to understand the associated timing
(Figure 5.7) and area (Figure 5.8), that they add to an FPU. Essentially, there are three different
operations performed on operands exponents, these include addition, increment, and decrement. While
the exponent field of IEEE-754 operands is only 11 bits, the extra 12th bit of precision is required for
overflow and underflow detection.

48
5.3 IC Layout

5.3.1 Multiplier

Figure 5.9 Multiplier IC Layout

When transferring the multipliers netlist into an IC layout (Figure 5.9), more standard cell limitations
were discovered. For example, a four port OR gate was missing in the Design Architect installation. It
was not possible to avoid using this cell in a similar way as the “AND04”, where the synthesized CPA was
used instead of the faster ADK CSEA. The result was to use two NOR and a NAND gate. Similar
occurrences happened for other ADK standard cells. At this stage it was realized that any IC layout would
not accurately reflect the merits of the HDL model, and with limited time, a fully automated place and
route were generated. That being said, placing the Booth MUX closer to their respective connections in
the CSA tree, may have been very difficult with a full SDL design using a manual layout process.

49
5.3.2 Combined Division and Square Root

Figure 5.10 Combined Division and Square Root IC Layout

Unlike the multiplier, the layout for the division and square root unit is more regular, which should make
a non-automated IC layout much simpler. This can be seen by the regularity of the integrated division
and square root cell (Figure 3.25) in comparison to that of the CSA tree counters [15]. While the netlist
produced from the HDL for the division unit contained no missing cells, time constraints limited any IC
layout to that of a fully automated place and route (Figure 5.10). The internal wiring and cell placement is
noticeably more uniform then the multiplier.

50
5.4 Performance

5.4.1 Pre-Layout
Pre-Layout Clock Frequency for FPU Components on an AMI 0.5µ Fast Process
Component Pre-Layout Clock Frequency (MHz)
LTH FPU(Original) 58.8
Multiplier 109.3
Multiplier (Using Synthesized CPA for IC Layout) 102.7
Combined Division and Square Root Unit 108.9
Figure 5.11 Pre-Layout Clock Frequency

The maximum possible clock frequency of a FPU containing the new components is 102.7MHz (Figure
5.11). Post-layout this figure is likely to decrease as the wiring delays are incorporated. It does however
illustrate the point, that the newly designed components have achieved a pre-layout frequency almost
double that of the original LTH. This figure is more interesting, when it is considered that the speed of the
multiplier was sacrificed for software compatibility issues, as a slower synthesized CPA was used for IC
layouts. Alternately, assuming there were no standard cell compatibility issues, the maximum clock rate
would be 108.9MHz. Allowing roughly 10% added wiring delay, results in a projected post-layout clock
frequency of 98.01MHz or 100MHz for the optimal design.

5.4.2 Industry Comparison
1994-95 0.5µ Processor Latency and Clock Frequency Comparison
Operation Latency
Processor Multiplication Division Square Root Clock Rate (MHz)
Intel P6 5 38 38 150
MIPS R10000 2 19 19 200
Sun UltraSPARC 3 22 22 167
DEC21164 4 23 23 200
AMDK5 7 39 39 75
PA8000 3 31 31 200
LTH Extended 6(4+2LTHL) 30(28+2LTHL) 31(29+2LTHL) 100
All operation latencies are assumed to include FPU overhead (e.g. decoding, rounding, and operand packaging). LTHL = LTH FPU
operation overhead latency (this includes decoding and rounding stages). Statistics provided by Flynn and Obermann [14].
Figure 5.12 Processor Latency and Frequency Comparison

In a comparison of 0.5µ microprocessors released during 1994-95 (Figure 5.12), the new components for
the LTH performed quite well. While all statistics for the new design are projected, as the LTH was not
updated, and the clock frequency is based on pre-layout data arrival times, the extended LTH is not the
slowest architecture, even if the post-layout data arrival times are decreased by 25%. However, the
mentioned clock frequencies of industry implementations may not reflect the operation clock of the
respective FPU, but rather the operation frequency of its CPU.

51
Operation latency and clock frequencies alone,, are not however adequate metrics for an architecture
design. The area, feature size, and power consumed are also important [14]. This provides a restriction
on measuring the benefits of the new modules. As the existing LTH was not updated due to its own
limitations, the FPU area could not be approximated, and a clock frequency could only be related to the
critical path of the new units (Figure 5.11).

5.5 Conclusion
For the multiplier and combined division square root units, both a structural form of Verilog with direct
ADK instantiation, and a synthesizable version were implemented. This led to a variety of options for
creating the final layouts. The ADK based models provided a satisfactory delay with a low gate count,
while the synthesized modules provided, in general, a vastly decreased pre-layout delay and increased
gate count,

When selecting the optimal solution for the multiplier, the decisions were determined by the delay of the
final CPA. While for the other stages, various solutions were available, the ones with the lowest gate
counts were selected. This is because no matter how fast the other stages may be, the cycle time is only
altered by the slowest part in the pipeline, which is the CPA, and there is no need to add extra gates for no
improved performance.

In the division and square root unit, the use of synthesized components was increased. The original CLA
based QSLC provided an expected delay. However, the synthesized ROM version was significantly faster
and with fewer gates. For the parallel product generation, which included buffering of the QSLC output, a
synthesized cell was also appropriate. By confining the results of the synthesis tool to a small area, the
random logic was not expected to produce a large increase to post-layout data arrival times.

While a variety of options were investigated for both the multiplier and division units, the LTH proved
elusive. Working with the original LTH was problematic as a result of unsatisfactory synthesis times,
where the components produced could not meet the required register to register delays. Nearly all stages
required some optimization, which would extend their critical path, providing worse data arrivals times.
This leaves two solutions, the first being to completely redesign the LTH, and the second is to use a
different open source FPU, such as that provided by SUN Microsystems [7].

For the implemented units, an IC layout was constructed using a fully automated process. The results
were for illustration only, as many sacrifices had to be made on the original design due to software
compatibility issues. Thus, any post-layout timings would not accurately reflect the true merits of the
architecture, and performance evaluation must be based on pre-layout data arrival times.

52
6 Conclusion

6.1 Introduction
This section provides an overview of what was achieved, what remains to be done, and the future
direction of extending the FPU. It discusses each of the functional units implemented in terms of design,
area and timing, and layout. The directions for which the original LTH could be extended and the new
units improved to further increase performance are also elaborated. Finally, there is some discussion of
the overall achievements of this project, and directions for future implementations.

6.2 Achievements

6.2.1 Design
A double precision floating point multiplier was designed with four pipelined stages. The first was a
Booth-2 radix-4 partial product generator, that produced 27 54-bit partial products. The second stage
was a full CSA tree, where each column of the partial products produced was an input into a counter. The
third stage was a CPA, which was constructed from 15-bit CLA blocks to form a 106-bit CSEA. The fourth
and final stage was normalization, where the result produced was prepared for rounding.

The division approach used was a radix-2 SRT subtractive algorithm using a modified QSL from SUN
Microsystems. It consisted of two pipeline stages, a recursively called division stage and a normalization
stage. The first stage used two layered radix-2 iterations with overlapped parallel quotient selection,
which permitted the calculation of a double precision result in 27 cycles. The square root operation was
also incorporated into the division unit. Like the CSA from the multiplier, the design was realized through
a bit slice implementation, However, unlike the multiplier, the delay of the SRT division and square root
unit was never a concern, as the number of SRT iterations could be scaled.

Both the multiplication and division units were verified against a set of over 1000 random test vectors.
The square root unit was also subjected to a similar number of vectors, however its results had to be
check by hand.

6.2.2 Implementation
Over a 100 modules were created to model the architecture of the multiplier and division unit for both
direct ADK and synthesized implementations. This led to a low area ADK full adder based CSA tree, a
quick ADK CSEA, and a fast synthesize QSLC ROM. Without this depth and variety of HDL modeling, it is
doubtful that a sub 10ns pre-layout cycle time would have been possible under any 0.5µ process.

Transferring the HDL to an IC layout was more complicated than expected. This resulted from software
compatibility issues, where the synthesized netlists would consist of unknown or unavailable cells. The
solution was to model the missing cells using available ones, creating a slower and larger design. This
meant no produced IC layout could come close in terms of clock frequency or area to the pre-layout
calculations. Considering this, and time limitations, a fully auto-place and route IC layout was produced, if
for no other reason than to demonstrate size and complexity.

53
6.2.3 Timing and Area
Using the Mentor Graphics ADK2.5 standard cell design flow for an aggressive AMI 0.5 micron static
CMOS process, a target pre-layout speed of 100MHz was possible. However, using a more typical AMI 0.5
micron process, a maximum data arrival time of 13.52ns was obtained for the multiplier, in contrast to a
more aggressive fast process, where the cycle time was reduced to 8.91ns. Thus for a 10ns cycle time or
an FPU clocked at 100MHz, an aggressive process is required.

This does not include the timing of the original LTH. Due to the high level of HDL, extensive synthesis was
required to generate a netlist. This provided several problems. In particular, some of the original code
was not synthesizable and some modules provided a critical path of at least 50ns for a typical AMI 0.5
micron process. This caused significant problems with integrating the data paths, as only a complete
redesign of the original LTH code would permit it to achieve a cycle time of 10ns, similar to that of the
new pipeline stages.

Regarding the new datapaths, the pipeline stage with the largest area was the modified Booth-2 radix-4
partial product generator. While using a fast AMI process, it was estimated to have an area of at least 22K
gates, but surprisingly it had one the fastest pre-layout critical paths of 3.5ns. Overall the multiplier
required 47K gates, and the combined division square root unit 12K gates, and since multiplication is a
more used operation [29], the ratio of required area and utilization is appropriate.

6.3 Future Directions

6.3.1 Design
There are several possibilities for improving this design that may lead to achieving a sub 10ns cycle time
under a typical AMI 0.5µ process. The first is modifying the multiplier to perform more computations in
the Booth encoding stage, perhaps extending the Booth encoding to radix-8. This would lead to an
increased delay during this stage, along with increased area. A higher radix multiplication would reduce
the delay and area of the CSA tree, as a result of requiring fewer counters.

The CSA tree could be modified to have its first few levels of counters placed in the Booth encoding stage.
The result would be a slightly longer Booth encoding Stage, a slightly shorter CSA stage, and a significant
saving of area as less partial products would need to be buffered. This would come at the expense of
increased design time, as many counters would need to be created that had pipeline registers directly
inserted after the first few levels of counters.

The third possibility is to perform the final CPA over several stages, where each CLA block could produce
an output for the CSEA, and the MUX selection could occur in the next stage. This reduces the critical path
of the first stage of the CPA to one 16-bit CLA and the second stage to six multiplexers. The reason this
was not done initially was that it adds an extra pipeline stage, and more gates, due to the extra buffering
required.

For the division and square root unit, the goal of implementing the Sun design [25], which used three
layered overlapped stages seemed out of reach for standard cells on the AMI 0.5µ process. Without

54
considering the input selection required for the SRT iterations, three levels of parallel product generation
and overlapped QSLC comes close to 8.5ns. This limits the number of iteration per cycle for this design to
two. Future implements may wish to investigate alternate architectures, such as a radix-512 [37] or a
multiplicative approach [19]. However, for an SRT approach, there is probably not much more
performance to be gained, without significant increases in area [23].

6.3.2 Implementation
Clearly the different versions of the ADK offered by the synthesis tool, and the missing cells of the
schematic entry and layout software, are a cause for concern. With these issues resolved, and a
significant amount of time, the HDL implementation could be realized in a hand placed IC layout.

6.3.3 Architecture Integration
The LTH FPU offered most of the required functionality for integrating the new units. It did not however,
provide a satisfactory level of performance. While the missing (multiplier, divider and square root) units
were fully realized in HDL, integrating them with the LTH would result in halving their projected clock
frequencies (Figure 5.11). For the designed units to be used as part of a larger FPU capable of a 100MHz
clock frequency, a more optimal solution than the current LTH needs to be found, which may include a
complete redesign or the use of another open source FPU [7].

6.4 Conclusion
The LTH FPU provides for Gaisler Researches Leon2, a partial implementation of a floating point unit, that
includes instruction and operand decoding, special number detection, addition and subtraction, and
rounding. However it does not have a multiplier, divider, or a square root unit. These components have
been designed using Verilog HDL with the ADK 2.5 design flow, and verified to operate at 100MHz pre-
layout, using an aggressive AMI 0.5 micron process. While the components have been built and tested
they have not been integrated into the original LTH, nor realized in an IC layout.

While the designs of the two new units are complete, the implementation could benefit from a hand
placed IC layout, specifically the CSA tree. This would drastically reduce area consumed and improve the
post-layout critical path of the multiplier compared to the auto-place and route layout produced.

Since the start of development, an open source CPU including a fully pipelined FPU with many
improvements over the Leon2 and original LTH, has become available from SUN Microsystems, with the
release of the Open SPARC T1. The FPU provided with the T1 offers a better choice for future innovations,
as it is not confined by the limitations placed on it by the Leon2 and the original LTH. Further, the
original LTH would need to be completely redesigned, to achieve a similar performance level to that of
the new units, or to be a comparable option to the T1’s FPU for future projects.

55
Bibliography

1 Gaisler Research. “Leon 2 Processor” [Online]. Available
http://www.gaisler.com/bin/leon2-1.0.32-xst.tar.gz

2 Gaisler Research. “Leon2 Processor Users Manual” [Online]. Available
http://www.gaisler.com/doc/leon2-1.0.30-xst.pdf

3 M. Kasprzyk. “Floating Point Unit Design IC Project 2001” [Online]. Available
http://www.gaisler.com/bin/leon2-1.0.32-xst.tar.gz

4 SPARC International. “The SPARC Architecture Manual V8” [Online]. Available
http://www.sparc.com/standards/V8.pdf.

5 GNU. “General Public License GPL” [Online]. Available
http://www.gnu.org/copyleft/gpl.html

6 Gaisler Research. “GRFPU High-Performance Floating-Point Unit”. Available
http://www.gaisler.com/cms4_5_3/index.php?option=com_content&task=view&id=138&
Itemid=54

7 SUN Microsystems. “OpenSPARC-T1” [Online]. Available http://opensparc-
t1.sunsource.net/index.html

8 B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University
Press, 2000.

9 I. Koren, Computer Arithmetic Algorithms 2nd Ed., Prentice-Hall, 2002.

10 I. Koren. “Computer Arithmetic Algorithms Simulator” [Online]. Available:
http://www.ecs.umass.edu/ece/koren/arith/simulator/

11 M. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, 2004.

12 J. E. Stine, Digital Computer Arithmetic Datapath Design using Verilog HDL, Kluwer
Academic Publishers, 2004.

13 J. Deschamps, G. J. Antoine Bioul, and G. D. Sutter, Synthesis of Arithmetic Circuits, John
Wiley & Sons, 2006.

56
14 M. J. Flynn and S. F. Oberman, Advanced Computer Arithmetic Design, John Wiley & Sons,
2001.

15 K. Cho, J. Hong, and G Choi, “54x54-bit Radix-4 Multiplier based on Modified Booth
Algorithm,” in Proc. 13th ACM Symp. VLSI, pp 233-236, Apr. 2003.

16 R. K. Yu, G. B. Zyner, “167MHz Radix-4 Floating-Point Multiplier,” in Proc. 12th IEEE Symp.
On Computer Arithmetic, pp 149-154, Jul. 1995.

17 J. Mori, M. Nagamatsu, M. Hirano, ”A 10-ns 54*54-bit Parallel Structured Full Array
Multiplier with 0.5 micron CMOS technology,” IEEE Journal of Solid-State Circuits, vol. 26,
no. 4, pp 600-606, Apr. 1991.

18 N. Besli, R. G. DeshMukh, “A 54*54-bit Multiplier with a new Redundant Booth’s
Encoding,” IEEE Conf. Electrical and Computer Engineering, vol. 2, pp 597-602, 12-15 May
2002.

19 E. M. Schwarz, R. M. Averill, L. J. Sigal, “A Radix-8 S/390 Multiplier,” in Proc. 13th IEEE
Symp. Computer Arithmetic, pp 2-9, Jul. 1999.

20 N. Itoh, Y. Naemura, H. Makino, Y. Nakase, “A Compact 54*54-bit With Improved Wallace-
Tree Structure,” in Dig. Technical Papers of Symp. VLSI Circuits, pp 15-16, Jun. 1999.

21 M. J. Flynn and S. F. Oberman, “Division Algorithms and Implementations,” IEEE Trans.
Computers, vol. 46, no. 8, Aug. 1997.

22 P. Soderquist and M. Leeser, “Area and Performance Tradeoffs in Floating-Point Divide
and Square-Root Implementations,” ACM Computing Surveys, vol. 28, no. 3, Sep. 1996.

23 D. L. Harris, S. F. Oberman, and M. A. Horowitz, “SRT Division Architectures and
Implementations,” in Proc 13th IEEE Symp. Computer Arithmetic, pp 18-25, Jul. 1997.

24 P. Kornerup, “Quotient Digit Selection for SRT Division and Square Root,” IEEE Trans.
Computers, vol. 54, no. 3, Mar. 2005.

25 J. A. Prabhu and G. B. Zyner, “167MHz Radix-8 Divide and Square Root Using Overlapped
Radix-2 Stages,” in Proc. 12th IEEE Symp. Computer Arithmetic, pp. 155-162, Jul. 1995.

26 J. Chiang, H. Chung, and M. Tsai, “Carry-Free Radix-2 Subtractive Division Algorithm and

57
Implementation of the Divider,” Tamkang Journal of Science and Engineering, vol. 3, no. 4,
pp 249-255, 2000.

27 M. Kuhlmann and K. K. Parhi, “Fast Low-Power Shared Division and Square Root
Architecture,” in Proc. IEEE Int. Conf. on Computer Design, (ICCD), pp. 128-135, Oct. 1998.

28 G. Gerwig, H. Wetter, E. M. Schwarz, J Haess, “Higher Performance Floating-Point Unit with
116 wide Divider,” in Proc. 12th IEEE Symp. On Computer Arithmetic, pp 87-94, June 2003.

29 SPEC. “SPEC CFP2006 Benchmark Description” [Online]. Available:
http:www.spec.org/cpu2006/CFP2006.

30 M. Darley, B. Kronlage, D. Bural, B. Churchill, D. Pulling, P. Wang, ”The TMS390602A
Floating-Point Coprocessor for SPARC Systems,” IEEE Micro, vol. 10, no. 3, pp. 36-47,
May 1990.

31 B. J. Benchschneider, W. J. Bowhill, E. M. Cooper et al, ”A Pipeline 50-MHz CMOS 64-bit
Floating-Point Arithmetic Processor,” IEEE Journal of Solid-State Circuits, vol. 24, no. 5, Oct.
1989.

32 L. Min, B. Yong-Qiang, S. Xu-Bang, G. De-Yuan, “The implementation of an out-of-order
execution floating point unit,” Proc. IEEE Int. Conf. Solid State and Integrated Circuits, pp.
1384-1387, vol. 2, Oct. 2004.

33 M. Aagaard and C. H. Seger, “The Formal Verification of a Pipelined Double Precision IEEE
Floating Point Multiplier,” Proc. IEEE/ACM Int. Conf. Computer Aided Design, pp. 7-10,
1995.

34 M. Leeser and J. O’Leary, “The Verification of a Subtractive Radix-2 Square Root
Algorithm and Implementation,” IEEE Int. Conf. on Computer Design, (ICCD), pp.
526-531, Oct. 1995.

35 S. Siewert, “Big Iron Lessons, Part 1: FPU Architecture Now and Then” [Online],
Available: http://www-128.ibm.com/developerworks/library/pa-
bigiron1/?ca=dnt-65.

36 N. Weste, D. Harris, CMOS VLSI Design, 3rd Ed. Int. Ed, Addison Wesley, 2004.

58
37 A. Nannarelli, “Low Power Division and Square Root”, PhD. thesis, Univ. of
California, Irvine, 1999.

59
Appendix A: Arithmetic by Example

Introduction
The following sections provide a demonstration of some of the arithmetic methods used to compute the
multiplication, division, and square root operations in this design. They illustrate the broad concepts of
each function, and provide an insight into the correct operation of the most important sections of each
unit.

Radix-2 Unsigned Multiplication
Unsigned multiplication is one of the simplest of operations, where a set of partial products are produced
and summed (Figure A.1). Each partial product is the result of an AND operation on a bit from operand B
and the entire operand A. The partial product is then shifted to the left a number of times, depending on
the bits position in operand B.

A×B=R
2110 × 2610 = 54610
101012 × 110102 = 10001000102

A[4:0] 1 0 1 0 1 2110
B[4:0] × 1 1 0 1 0 2610
PP1 0 0 0 0 0 A and B0
PP2 1 0 1 0 1 A and B1
PP3 0 0 0 0 0 A and B2
PP4 1 0 1 0 1 A and B3
PP5 1 0 1 0 1 A and B4
Result 1 0 0 0 1 0 0 0 1 0 PP Sum = 54610
Figure A.1 Unsigned Radix-2 Multiplication Example

60
Radix-4 Multiplication using Modified Booth Encoding
The following example is of an unsigned multiplication of two numbers that uses a Modified Boot-2
algorithm. After the partial products have been generated, they are sign extended and then summed.
This is required as the Booth encoding can generate negative numbers, and using a special form of sign
encoding, requires only flipping the bits and shifting for all negative partial products.

A×B=R
2110 × 2610 = 54610
101012 × 110102 = 10001000102

A[4:0] 1 0 1 0 1 2110
B[4:0] × 1 1 0 1 0 2610
~S1 S1 S1 PP15 PP14 PP13 PP12 PP11 PP10
1 ~S2 PP25 PP24 PP23 PP22 PP21 PP20 S1
PP15 PP14 PP13 PP12 PP11 PP10 S2
PP1 0 1 1 0 1 0 1 0 1 (B1,B0,0) = -2A
PP2 1 0 1 0 1 0 1 0 1 (B3,B2,B1)= -1A
PP3 1 0 1 0 1 0 1 (0,B4,B3) = +2A
Result 1 0 0 0 1 0 0 0 1 0 PP Sum = 54610
Figure A.2 Booth-2 Multiplicaiton Example

The sign extension can extended to a larger number of partial products by repeating the form of encoding
on the second line for the rest of the other products (Figure A.3). In the following example of a generic 8-
bit multiplication, it may appear as if an error has occurred because the last product is shorter then the
rest. This is due to the modified Booth encoding, which will only receive “00X”, and can only product a
zero or positive one product. Further, the sign extension of the forth row was not continued to include a
one, because an 8-bit by 8-bit multiplication can only produce a 16-bit number, which results in the extra
sign extension on row four and the carry out from the most significant column being ignored.

• • • • • • • •
× • • • • • • • •
~S S S • • • • • • • • •
1 ~S • • • • • • • • • S
1 ~S • • • • • • • • • S
~S • • • • • • • • • S
• • • • • • • • S
• • • • • • • • • • • • • • • •
Figure A.3 Booth-2 Multiplication Sign Extension Dot Diagram

61
SRT Radix-2 Division
X/D=Q
2510 / 510 = 510
110012 / 001012 = 001012
In Floating-Point Terms 1.10012 / 1.01002 = 1.01002
Let PRs0 = 000.110012 and Result Exponent = Result Exponent + 1
Let PRc0 = 0
All QSLC calculations are based on the SUN Design [25]

Iteration 0
PRs0 0 0 0 . 1 1 0 0 1
PRc0 0 0 0 . 0 0 0 0 0
QSLC 4-bit Sum 0 0 0 . 1 Qi = 1 , Q1 = 1.0
2PRs0 0 0 1 . 1 0 0 1 0
2PRc0 0 0 0 . 0 0 0 0 0
-D 1 1 0 . 1 1 0 0 0 PR1 = 2PR0 - D
PRs1 1 1 1 . 0 1 0 1 0
PRc1 0 0 0 1 . 0 0 0 0 0

Iteration 1
PRs1 1 1 1 . 0 1 0 1 0
PRc1 0 0 1 . 0 0 0 0 0
QSLC 4-bit Sum 0 0 0 . 0 1 Q2 = 1 as MSB5 = 1, Q2 = 1.1
2PRs1 1 1 0 . 1 0 1 0 0
2PRc1 0 1 0 . 0 0 0 0 0
-D + 1 1 0 . 1 1 0 0 0 PR2 = 2PR1 - D
PRs2 0 1 0 . 0 1 1 0 0
PRc2 1 1 0 1 . 0 0 0 0 0

Iteration 2
PRs2 0 1 0 . 0 1 1 0 0
PRc2 1 0 1 . 0 0 0 0 0
QSLC 4-bit Sum 1 1 1 . 0 Q3 = -1 , Q3 = 1.01
2PRs2 1 0 0 . 1 1 0 0 0
2PRc2 0 1 0 . 0 0 0 0 0
+D + 0 0 1 . 0 1 0 0 0 PR3 = 2PR2 + D

62
PRs3 1 1 1 . 1 0 0 0 0
PRc3 0 0 0 0 . 1 0 0 0 0

Iteration 3
PRs3 1 1 1 . 1 0 0 0 0
PRc3 0 0 0 . 1 0 0 0 0
QSLC 4-bit Sum 0 0 0 . 0 0 Q4= 0 as MSB5= 0 , Q4= 1.010
2PRs3 1 1 1 . 0 0 0 0 0
2PRc3 0 0 1 . 0 0 0 0 0
+D 0 0 0 . 0 0 0 0 0 PR4 = 2PR3
PRs4 1 1 0 . 0 0 0 0 0
PRc4 0 0 1 0 . 0 0 0 0 0

Iteration 4
PRs4 1 1 0 . 0 0 0 0 0
PRc4 0 1 0 . 0 0 0 0 0
QSLC 4-bit Sum 0 0 0 . 0 0 Q5 = 0 , Q5 = 1.0100
2PRs4 1 0 0 . 0 0 0 0 0
2PRc4 1 0 0 . 0 0 0 0 0
+D 0 0 0 . 0 0 0 0 0 PR5 = 2PR4
PRs5 0 0 0 . 0 0 0 0 0
PRc5 1 0 0 0 . 0 0 0 0 0
Figure A.4 SRT Radix-2 Division Example

No more iteration’s are required, as in the next iteration, all of the bits for the QSLC input will be zero.
Thus, further iterations will only generate extra zero quotient bits. This method has produced a result of
“1.0100…..0”, which matches what was expected. In further iterations, the partial remainder will be
completely zero, thus the sticky bit generated would also be zero.

63
SRT Radix-2 Square Root
The following is a demonstration of a radix-2 SRT square root operation.

=Q
2510 1/2 = 510
SQRT 110012 = 001012
In Floating-Point Terms SQRT 1.10012 = 1.01002
Let PRs0 = 000.100102
Let PRc0 = 0
All QSLC calculations are based on the SUN Design [25]

Iteration 0
X 0 0 1 . 1 0 0 1 0 0 0 0
PRsi 0 0 0 . 1 0 0 1 0 0 0 0
PRCi 0 0 0 . 0 0 0 0 0 0 0 0
QSLC 0 0 0 . 1 qi+1=1 PRi+1=2PRi-(2Q+2-(i+))
Qi 0 0 1 . 0 0 0 0 0 0 0 0
2-(i+1) 0 0 0 . 1 0 0 0 0 0 0 0
2PRsi 0 0 1 . 0 0 1 0 0 0 0 0
2PRci 0 0 0 . 0 0 0 0 0 0 0 0
F[i] 1 0 1 . 0 1 1 1 1 1 1 1 +
PRsi+1 1 0 0 . 0 1 0 1 1 1 1 1
PRci+1 0 1 0 . 0 1 0 0 0 0 0 1 Carry in

Iteration 1
PRsi 1 0 0 . 0 1 0 1 1 1 1 1
PRCi 0 1 0 . 0 1 0 0 0 0 0 1
QSLC 1 1 0 . 0 qi+1=-1 PRi+1=2PRi+(2Q-2-(i+))
Qi 0 0 1 . 1 0 0 0 0 0 0 0
2-(i+1) 0 0 0 . 0 1 0 0 0 0 0 0
2PRsi 0 0 0 . 1 0 1 1 1 1 1 0
2PRci 1 0 0 . 1 0 0 0 0 0 1 0
F(i) 0 1 0 . 1 1 0 0 0 0 0 0 +
PRsi+1 1 1 0 . 1 1 1 1 1 1 0 0
PRci+1 0 0 1 . 0 0 0 0 0 1 0 0

Iteration 2

64
PRsi 1 1 0 . 1 1 1 1 1 1 0 0
PRCi 0 0 1 . 0 0 0 0 0 1 0 0
QSLC 1 1 1 . 1 qi+1=0 PRi+1=2PRi
Qi 0 0 1 . 0 1 0 0 0 0 0 0
2-(i+1) 0 0 0 . 0 0 1 0 0 0 0 0
2PRsi 1 0 1 . 1 1 1 1 1 0 0 0
2PRci 0 1 0 . 0 0 0 0 1 0 0 0
0 0 0 0 . 0 0 0 0 0 0 0 0
PRsi+1 1 0 1 . 1 1 1 1 1 0 0 0
PRci+1 0 1 0 . 0 0 0 0 1 0 0 0

Iteration 3
PRsi 1 0 1 . 1 1 1 1 1 0 0 0
PRCi 0 1 0 . 0 0 0 0 1 0 0 0
QSLC 1 1 1 . 1 qi+1=0 PRi+1=2PRi
Qi 0 0 1 . 0 1 0 0 0 0 0 0
2-(i+1) 0 0 0 . 0 0 0 1 0 0 0 0
2PRsi 0 1 1 . 1 1 1 1 0 0 0 0
2PRci 1 0 0 . 0 0 0 1 0 0 0 0
0 0 0 0 . 0 0 0 0 0 0 0 0
PRsi+1 0 1 1 . 1 1 1 1 0 0 0 0
PRci+1 1 0 0 . 0 0 0 1 0 0 0 0
Figure A.5 SRT Radix-2 Square Root Example

Q3 = 1.0100, which means after several iterations the desired result has been obtained. No further
iterations were completed because all further “qi+1” will be zero. The main difference between an integer
and the demonstrated floating point version of SRT square root, is that PRsi and Qi are initialized on the
first iteration to (X-1) and 1 respectively.

65
Radix-2 On-The-Fly Conversion
The following is an example of how the quotient is formed for division and square root operations.
Specifically it uses the SRT radix-2 division example from above. The key here is to notice that only the
bit position specified in the third column is new for both Q and QM. All other bits are simply the result of
a select statement from the old values. This is illustrated in iteration three, where the QSLC output is
negative one, and Q selects the previous value from QM.

The square root operation is more complicated then division, as its divisor “F(I)” is variable. There are
two separate versions of F(i), one for both positive and negative products for each iteration. They can,
however, simply be formed by using the final three columns.

Iteration qi 2-i Qi QMi
0 N/A 1.00000 0.XXXXX 0.XXXXX
1 1 0.10000 0.1XXXX 0.0XXXX
2 1 0.01000 0.11XXX 0.10XXX
3 -1 0.00100 0.101XX 0.100XX
4 0 0.00010 0.1010X 0.1001X
5 0 0.00001 0.10100 0.10011
Figure A.6 On-The-Fly Conversion Example

66
Appendix B: Counter Study

Introduction
A counter is a functional unit that is able to produce a value equal to the sum of its input bits, where each
input bit is worth the same. For example, a standard binary counter given the input vector “10011010”
may produce an output of “0100”, which means four bits are high. While many counters may produce the
same output given identical inputs, there is a great variation of implementation. This section aims to
delve into the various styles of counters, and discover what advantages they offer in terms of both
performance and area, for the multiplication and division units.

Timing Analysis
Counter Pre-Layout Data Arrival Times (ns)
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
3 to 2 Counters (Full Adders)
ADK Fadd1 Std. Cell 2.17 1.23 0.78
Synthesized 1.25 0.71 0.45
4 to 2 Counters
Two ADK Fadd1 Std. Cells 5.26 2.95 1.86
Synthesized (Adders Preserved) 2.96 1.69 1.10
Custom ADK Design [11] 4.95 2.84 1.80
Synthesized (Adders Dissolved) 2.97 1.70 1.04
27 to 2 Counters
Samsung ADK Fadd1 Design [15] 21.06 11.68 7.34
Synthesized (Adders Preserved) 11.55 6.63 4.34
Synthesized (Adders Dissolved) 11.31 6.64 4.29
Figure B.1 Std. Cell v. Synthesized Counter Timing Analysis

Several different types of counters were studied (Figure B.1) to determine the best architecture for a
multiplier design. In particular this relates to the second stage of multiplication, fast multi-operand
addition. The two common methods for adding the partial products into a carry save form rely on either
seven levels of 3:2 counters or four levels of 4:2 counters. For both different types of counter, the
synthesized versions offer a greatly reduced delay. While the 3:2 counter may seem to have a data arrival
time of less then 50% of the larger counter, it is important to remember that these figures do not account
for carry propagation. As a result, the synthesized adders may suffer, since the synthesis tool assumes all
inputs are available at time zero, where the ADK std. cell versions may have a port optimized for late
arrival.

Also of particular interest to the multiplier, is a comparison of the delays of 4:2 and 27:2 counters. The
Samsung design [15] relies on seven levels of full adder cells to produce a 27:2 counter. An alternate
design [17], utilizes four levels of 4:2 counters, which would approximately produce a delay of at 4.16ns

67
(four levels of synthesized 4:2 counters with the cells dissolved). This second approach offers a
substantial decrease in delay in comparison to the required 7.34ns of the 27:2 ADK std. cell counter.
However, it is important to remember that the decreased delay may come at the cost of a significant area
increase. Also, providing a very fast CSA for the multiplier may not yield any performance increase, as
there may be other slower pipeline stages.

Area Analysis
Counter Area by Gate Count
Module AMI 0.5µ Slow AMI 0.5µ Typical AMI 0.5µ Fast
3 to 2 Counters (Full Adders)
ADK Fadd1 Std. Cell 5 5 5
Synthesized 7 7 8
4 to 2 Counters
Two ADK Fadd1 ADK Std. Cells 10 9 9
Synthesized (Adders Preserved) 14 14 15
Custom ADK Design [11] 16 16 16
Synthesized (Adders Dissolved) 14 14 18
27 to 2 Counters
Samsung ADK Fadd1 Design [15] 133 114 114
Synthesized (Adders Preserved) 170 170 190
Synthesized (Adders Dissolved) 199 197 246
Figure B.2 Std. Cell v. Synthesized Counter Area Analysis

In contrast to the timing analysis (Figure B.1), the gate counts for the synthesized counters (Figure B.2)
are vastly increased. The 4:2 counter with all of its internal full adder cells dissolved, uses 3.6 times the
number of gates as the ADK std. cell full adder. While an extreme example, it does illustrate the trend of
the synthesized adders requiring a larger area. It is important to remember, to achieve a satisfactory
delay for the AMI 0.5 micron process, all adders were optimized for delay. As a result of synthesizing with
optimization for delay, the required area increases.

Conclusion
The case for a full ADK standard cell instantiated HDL counter design is that of a low gate count, and
probably similar pre-layout and post-layout timing results. In comparison, the synthesized options
produce a vastly faster pre-layout result but with an increased gate count. The selection of which counter
is appropriate should then be based on how many counters are used and on the importance of the data
arrival times offered by each. For the multiplier, which uses a vast number of counters, area is the
overriding factor. In contrast, the combined division square root unit uses significantly few counters, and
delay is more important.

68
Appendix C: HDL Verification

Introduction
This section presents the outputs from the HDL simulator for the three main operations. For
multiplication and division, it was possible to model their complete functionality in a Verilog test bench.
In the case of square root, no simple arithmetic HDL constructs were available, and the produced results
were verified by hand. For all three operations, a separate test bench was constructed, and 1000 random
normalized operands were generated. While the majority of HDL written, has an accompanying test
bench, the units tested here utilize all of the other sub components, and as a result provide a means of
testing all designed modules.

Multiplication
# Griffith University Floating-Point Multiplier Test Bench.
# Simulation File: [./hdl/testbench/tb_fpu_multply_vec.v]
#
# Simulated and circuit responses being logged
# Log File: [./hdl/testbench/output/tb_fpu_multply_vec.out]
#
# Initializing test vectors...
#
# Performing 1000 verification tests using random normalized vectors.
#
# Test bench complete
#
# Details:
# Tests completed: 1000
# Calculation errors: 0
# Fraction error: 0
# Exponent error: 0
# Sign error: 0
#
# Processing Details:.
# Time required: 6004000
# Cycles required: 6004

Figure C.1 FPU Multiplication Test Bench

When testing the functionality of the designed multiplier, it was possible to model the calculations for all
three fields (exponent, fraction, and sign) in a test bench (Figure C.1). Essentially, one operation was run
at a time (similar conditions to LTH FPU), and when that operation completed, the produced results was
checked against a pre-calculated one. At this stage, the inputs were then reset to random values, and
remain constant until the next test had completed.

Division
# Griffith University Floating-Point Combined DIVSRT Test Bench.
# Performs only division tests.
# Simulation File: [./hdl/testbench/tb_fpu_divsqrt_vec.v]
#
# Simulated and circuit responses being logged
# Log File: [./hdl/testbench/output/tb_fpu_divsqrt_vec.out]
#

69
# Initializing test vectors...
#
# Performing 1000 verification tests using random normalized vectors.
#
# Test bench complete
#
# Details:
# Tests completed: 1000
# Calculation errors (after corrections): 0
# Fraction errors before correction: 119
# ULP Corrections required: 119
# Exponent error: 0
# Sign error: 0
#
# Processing Details:.
# Time required: 31004000
# Cycles required: 31004

Figure C.2 FPU Division Test Bench

Similar to multiplication, it was possible to model all aspects of a division operation in a Verilog test
bench (Figure C.2). However, in SRT calculations, the quotient can be over estimated, and corrected at a
latter stage by a negative QSLC result. However, this adjustment is not guaranteed to happen directly
after an over estimation, and as a result, the quotient can be over estimated by an ULP. Accounting for
this, the division test bench compared the original result produced by the circuit first, with a pre-
calculated one. When the original result did not match, the pre-calculated result was then compared to
the original result minus an ULP. From 1000 normalized operands, allowing for quotient correction, no
problems were observed.

Square Root
# Griffith University Floating-Point Combined DIVSRT Test Bench.
# Performs only square root tests.
# Simulation File: [./hdl/testbench/tb_fpu_sqrt_simple_vec.v]
#
# Simulated and circuit responses being logged
# Log File: [./hdl/testbench/output/tb_fpu_sqrt_simple_vec.out]
#
# Initializing test vectors...
#
# Performing 1000 verification tests using random normalized vectors.
#
# Test bench complete
#
# Processing Details:.
# Time required: 30004000
# Cycles required: 30004

Figure C.3 FPU SQRT Test Bench

Unlike the division and multiplication operations, the square root function could not be modeled in
Verilog. By modifying the division test bench, to simply generate operands and record results (Figure
C.3), it was possible to create a list of operations with results that could be verified at a later stage (Figure
C.4).

70
When verifying the recorded results, the logged quotients generated by VSIM were compared to those
produced using Microsoft Windows Power Toy Calculator. Given the square root operation also suffers
from a similar error as division, where the quotient may be over estimated by an ULP, the results were
assumed the same, if the simulated quotient or the simulated quotient minus an ULP matched the result
produced by the calculator. Further, as the division and square root functions share many common
blocks, by verifying the division operation with a large set of test vectors, the square root function is also
verified.

Square Root Test: 0 (First test with odd exponent)
Operand A 1.1001010111101000000100010010000101010011010100100100 Odd Exponent
Calculator 1.11000111111000001001100001000001001110010101111111001
Circuit 1.11000111111000001001100001000001001110010101111111010 Q = Q - ULP
Correction 1.11000111111000001001100001000001001110010101111111001
Square Root Test: 3 (First test with even exponent)
Operand A 1.0000011010011111001001000111111011001101101110001111 Even Exponent
Calculator 1.00000011010010100010100100000110011101000101100001000
Circuit 1.00000011010010100010100100000110011101000101100001000
Square Root Test: 612 (Selected at random)
Operand A 1.0101011111101110010100010111010110011011001100101110 Even Exponent
Calculator 1.00101000101110011101101100001010000100101011000100100
Circuit 1.00101000101110011101101100001010000100101011000100100
Square Root Test: 726 (Selected at random)
Operand A 1.0000010100001000010011100000010101001101000011000000 Even Exponent
Calculator 1.00000010100000010000010001110110111110001001011110010
Circuit 1.00000010100000010000010001110110111110001001011110010
Square Root Test: 996 (Last test with odd exponents)
Operand A 1.0101101110001010110000000001000011010101001100000010 Odd Exponent
Calculator 1.10100101110101001100010011001101010000110011101001011
Circuit 1.10100101110101001100010011001101010000110011101001011
Test cases are from the log file produced and referenced in the square root test bench (Figure C.3)
Figure C.4 FPU Square Root Result Verification

71
Appendix D: Code Listing

Introduction
A CD/DVD accompanies this dissertation. This section outlines all the contents of that disc. In particular
this includes HDL, area and timing reports, and produced netlists. Also included is some of the early work
performed on the LTH FPU, which includes the separation into pipeline stages and synthesis reports.

File Locations and Naming Conventions
File Locations
Location Description
./hdl Base directory for created modules
./hdl/scripts Contains all PERL and Verilog scripts used to generate other modules (e.g.
CSA tree and QSLC ROM)
./hdl/testbench Contains test benches for created modules
./hdl/testbench/output Stores the log files produces by a limited number of the main test benches.
./lth Has a similar directory structure to that of “./hdl”. It also contains a
“synth” directory, similar to that described below.
./synth Base directory for synthesis results
./synth/area Area reports (e.g. gate count)
./synth/delay Timing Reports
./synth/netlist Produced netlist, either from synthesized or flattened modules.
./synth/script Contains all synthesis settings and Linux Bash shell scripts to run them in
terminal mode.
Figure D.1 File Location Map

File Naming Conventions
File Pattern Description
*.ami05[fast|slow|typ].* Result or output of synthesis, where the name of the module that produced
this file precedes the process variation.
*.area.rpt Area report
*.delay.rpt Pre-layout delay report
*.out Log file produced from VSIM simulation
*_syn.* A module designed to be or produced from a design utilizing Verilog
primitives. There should also be a similar name file with out “-syn” that is
based on ADK cells.
Tb_*.v Verilog test bench
Figure D.2 File Naming Conventions

72
Highlights
File Description
./hdl/fpu_divsqrt.v Radix-4 combined division and square root unit
./hdl/fpu_multply.v Multiplication unit
./synth/delay/fpu_divsqrt.ami05fast.rpt Delay report for the division square root unit on an AMI
0.5 fast process.
./synth/delay/fpu_mulyply.ami05fast.rpt Delay report for the multiplication unit on an AMI 0.5 fast
process.
./hdl/scripts/csatree.pl Generates connectivity for the CSA tree, instantiates 27:2
to 4:2 counters.
./hdl/testbench/tb_fpu_multiply_vec.v FPU Multiplication test bench
./hdl/testbench/tb_fpu_divsqrt_vec.v FPU Division test bench
Figure D.3 HDL Synthesis and Simulation Highlights

73