You are on page 1of 37

CHAPTER 1

INTRODUCTION
In domain of digital signal processing the number representation is either in
fixed-point or floating-point form. The contribution deals with a binary
representation of real numbers. The advantage of floating-point representation
over fixed-point and integer representation is that it can support a much wider
range of values. To increase the speed and efficiency of the real-
number computations, computers or Floating Point Unit (FPU) typically used
binary floating-point format to represent real numbers. These representations
play major role in digital and radar imaging to reduce the complexities during
the processing and to achieve the accuracy and efficiency.

The advancement of VLSI technology has the enhancing feature in the


appearance of VLSI are circuits in handling floating point (FP) arithmetic.
Floating-point unit designed for applications like space craft, launching rockets
and big data. Since integer arithmetic lacks the range and precision for
accuracy, VLSI technology making it possible to.

There are many processors with fixed or floating-point representation and there
are also several blocks used for arithmetical operations. Floating-Point (FP) Fast
Fourier Transform (FFT) processors are widely used in high resolution radar
imaging applications for the task of pulse compression.

1
1.1 BASIC CONCEPTS OF FLOATING POINT

Floating point describes a system for representing numbers that would be


too large or too small as integers. Floating point representation is able to
retain its resolution and accuracy compared to fixed point representation.
A floating-point number is represented by the sign, mantissa and
exponent as shown in fig .1.1.1. S is the Sign bit (0 is positive and 1 is
negative). Representation is called sign and magnitude. E is the Exponent
field (signed), very large numbers have large positive exponent and Very
small close-to-zero numbers have negative exponents. More bits in
exponent field increases range of values. For M is the Fraction bit or
Mantissa (fraction after binary point). More bits in fraction field improves
the precision of FP numbers.

SIGN EXPONENT MANTISSA

Fig.1.1.1 Representation of Floating Point

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is a technical


standard for floating-point arithmetic established in 1985 by the Institute of
Electrical and Electronics Engineers (IEEE). The standard addressed many
problems found in the diverse floating-point implementations that made them
difficult to use reliably and portably. Many hardware floating-point units use the
IEEE 754 standard.

Conversion of decimal to the floating point according IEEE standard 754


consists of three steps such as converting the decimal to binary then second step
is represent the converted binary in scientific representation. Scientific notation
is nothing but shifting the point towards left in converted binary number and
2
multiplied by exponent i.e., multiplied by 2^(n) ( where n is number of shifts) .
Then the third is convert it to floating point according to standard 754.Here the
exponent can be calculated by the sum of exponent bias and n.The exponent
bias is constant which were denoted as 127 for single precision and 1023 for
double precision.

1.1 PRECISION IN FLOATING POINT


Generally the smallest change that can be represented in floating
point representation is called as precision. The meaning of precision
implies closeness or accuracy. There are two types of precision, which
were discussed below.
1.2 TYPES OF PRECISION
Two types of precision
1) Single precision
2) Double precision
1) Single precision

Single precision floating point standard representation


requires a 32 bit word, which may be represented as numbered from 0 to 31, left
to right as shown in fig 1.1.2.
Sign (1) Exponent (8) Mantissa (23)

Fig.1.1.2.Representation of Single Precision

3
2) Double Precision

Double precision floating point standard representation requires a 64


bit word, which may be represented as numbered from 0 to 63, left to right as
shown in fig 1.1.3.

Sign (1) Exponent (11) Mantissa (52)

Fig.1.1.3.Representation of Double Precision

1.4 ADDITION OF FLOATING NUMBERS:

It is desirable that floating-point numbers be normalized, i.e., that there be only


one significant digit (which in binary can only be 1) to the left of the radix point
of mantissa m. Thus, the addition of two floating-point numbers typically
involves two normalized floating-point add A and B, and yields a normalized
sum S. The mantissas of these two numbers are designated M1 and M2
respectively, and the exponents are designated E1and E2. In alternative floating-
point formats to IEEE 754, mantissas M1 and M2 can be represented in 2's
complement format, thus allowing a single simple structure to perform addition
or subtraction. The general process for Summing addends A and B consists of
three steps: 1) de-normalization of the addends, 2) mantissa addition, and 3)
normalization of the Sum. In de-normalization of the addends, if exponents E1
and E2 are not equal, then one or more of the addends is de-normalized until E1
and E2 match. A typical method for de-normalizing addends is to increase the
smallest exponent, by X to equal the largest exponent E, and shift the binary
point of mantissa M bits of the addend with the smallest exponent X places to
the left to yield de-normalized mantissa Miles. For example, to add 1.0x2'
(addend A) and 1.11111x27 (addend B), the method described above would

4
increase Elites, (in this case, E4, or 2) by 5 So that Elites, equals E, (in this case,
E, or 7). In normalization of the sum, if Ms., is not already in normalized form,
then it is normalized to yield normalized sum S. In other words, if required, the
binary point of Ms., is shifted left or right as appropriate until there is only one
significant digit to the left of the binary point to yield normalized mantissa sum
Ms. Then, E, is adjusted by y to yield the exponent Es of normalized sum S. In
the example above, Ms., 10.0 is normalized by shifting the binary point one
place to the left to yield Ms 1.0. Then, E, 7 is increased by 1 to yields Es of 8.
Exceptions: Overflow, Underflow, and Zero.
An exception occurs when a floating-point operation yields a result which
cannot be represented in the floating-point numbering system used. Three
common exceptions are overflow, underflow, and Zero. Overflow and
underflow exceptions occur when addition results in a Sum, the absolute value
of which is either too large (overflow) or too small (underflow) to be
represented in the floating-point numbering system used. For example, IEEE
754 32-bit single-precision format is not capable of representing a positive
number
greater than (2-2°)x2' (positive overflow) or less than 2' (positive underflow), or
a negative number the absolute value of which is greater than (2-2)x2'''
(negative overflow), or less than 2'' (negative underflow). Furthermore , IEEE
754, with its implied leading digit of 1, is incapable of naturally representing 0
(zero exception).

1.5 APPLICATIONS

 OFDM Systems.
 4G communication technology.

5
 Generating sine and cosine which will reduce the usage of large
Rom.
 Space craft and launching rockets.
 Deep Learning
 Media processing
 Security
 Big data
 Analytic processing.

1.6 EXISTING WORK

The design in primary power consumption, area consumption and power


delay has reduced and efficiency has increased. Existing work concentrates on
“Design and Implementation of a Floating Point ALU” which describes the
arithmetic operation such as addition of floating point numbers and describes a
set of related methods for reducing the time required for the addition of floating.
In the existing work they designed the floating point addition with basic blocks
of adder which is used to add the mantissa part, comparator for comparing the
sign bits and also used to comparing the exponents and the exception block to
rectify the problems such as overflow, underflow and zero. Carry look ahead
adders which decrease the propagation time. The output is given to encoder to
achieve the accuracy. However the existing works failed to achieve better
efficiency, accuracy, less power consumption, less area consumption and power
delay.
1.7 PROPOSED WORK

6
Fig.1.1.4.Architecture of floating point addition

The architecture of floating point addition shown in fig.1.1.4,Which consists


blocks such comparator ,sign control, bit inverter, adder, LZA logic which
replaces the carry look ahead adder and used to speed up the process ,counter
,left shifter ,right shifter, exponent incrementer, Exponent subtract, incrementer
, rounding control, exception data format and multiplexer which replaces the
encoder for better results. These all blocks help to achieve increase in efficiency
and reduce to the power consumption, area consumption and minimum power
delay.

CHAPTER 2

7
DESIGN ARCHITECTURE OF FLOATING POINT
ADDITION
2.1 STEPS TO DESIGN FLOATING POINT ADDITION
ARCHITECTURE
According to the architecture of floating point addition it having many
blocks which can be individually designed using
Model simulator and get correlated to form the overall architecture. This
architecture can be build by step by step
Pipelining process, following steps are:

 Compare the exponents of two numbers or ( or ) and calculate the


absolute value of difference between the two exponents (). Take the
larger exponent as the tentative exponent of the result.
 Shift the significant of the number with the smaller exponent, right
through a number of bit positions that is equal to the exponent difference.
Two of the shifted out bits of the aligned significant are retained as guard
(G) and Round (R) bits. So for p bit significant, the effective width of
aligned significant must be p + 2 bits. Append a third bit, namely the
sticky bit (S), at the right end of the aligned significant. The sticky bit is
the logical OR of all shifted out bits.
 Add/subtract the two signed-magnitude significant using a p + 3 bit
adder. Let the result of this is SUM.
 Check SUM for carry out (Cout) from the MSB position during addition.
Shift SUM right by one bit position if a carry out is detected and
increment the tentative exponent by 1. During subtraction, check SUM
for leading zeros. Shift SUM left until the MSB of the shifted result is a
1. Subtract the leading zero count from tentative exponent.
Evaluate exception conditions, if any.

8
 Round the result if the logical condition R”(M0 + S’’) is true, where M0
and R’’ represent the pth and (p + 1)st bits from the left end of the
normalized significant. New sticky bit (S’’) is the logical OR of all bits
towards the right of the R’’ bit. If the rounding condition is true, a 1 is
added at the pth bit (from the left side) of the normalized significant. If p
MSBs of the normalized significant are 1’s, rounding can generate a
carry-out. in that case normalization (step 4) has to be done again.
2.2 LZA

Leading-zero/one anticipator (LZA) The finite-state model of the


LZA allows us to enter a string of serial inputs which, depending on
the bit length (N), is not always as fast as a carry-look ahead adder
[5]. It is therefore necessary to process the string of P-, G- and 2-
inputs using a parallel algorithm similar to the look ahead structure.
The final construction will process the input data in discrete blocks
of length D. This approach can be interpreted as a parallel
implementation of our finite-state machine, considering its
combinatorial equivalents for different state and input combinations.
In the following, the leading-zero/one anticipation is carried out digit
wise; i.e., the block length is 4 bits. The results of this study can
easily be extended to arbitrary block lengths. We assume that the
beginning of a block is the kth bit position.

9
2.1.1Figure Le0/ading zero/one anticipator operation

Hence LZA is a vital block to increase the performance of FP processors


The LZA anticipates the leading zeros count of the result directly from the input
operands (chance for a one bit error)
and is obtained prior to the adder’s output. It works in parallel with the adder,
thus it reduces the delay introduced by the normalization process.

10
CHAPTER 3
LITERATURE REVIEW
[1] J. Sohn, J. Earl E. Swartzlander Improved architectures for a
fused floating-point add-subtract unit gives the information about the fused
floating point add and subtract unit with the employment of pipeline and dual
path algorithm. The fused point 40 percent of area and power consumption and
the dual path design reduces the latency by 30 percent compared to the discrete
floating point designs. The fused dual path floating point add and subtract unit
can be split into two pipelines and with this the throughput increased by 80
percent.

[2] IEEE Standard for Floating-point Arithmetic, 2008 this


standard specifies the interchange and arithmetic methods for binary and
decimal floating arithmetic in computer systems with functions like single,
double and extendable precision. This standard provides a method for
computation with floating point numbers that will yield same result whether the
processing implemented in hardware, software or with the combination of both.
The results will be identical, independent of implementation, given the same
input data.

[3] M. Garrido, J. Grajal, M. Sanchez, O. Gustafsson Pipelined radix-


2k feed forward FFT architectures , gives the information about pipelined
FFT with radix-2 feed forward(multi path delay commutation) In feedforward
architectures radix-2 k can be used for any number of parallel samples which is
the power of two. This designs results in very high throughput which makes
them suitable for demanding applications. As a result the feed forward
architectures provide high throughput compared to single input FFT
architectures indeed the feedforward designs require additional hardware
components must be processed with many parallel samples.

11
[4] Xing Wei , Haigang Yang , Wei Li , Zhihong Huang , Tao Yin, Le
Yu A reconfigurable 4-GS/s power-efficient floating-point FFT processor
design and implementation based on single-sided binary-tree
decomposition. The design of the floating point operations (fp) with fast
Fourier transform to attain high throughput .To improve the efficiency and
reduce the power consumption, area consumption. The above aim is achieved
by designing the proposed design using silicon based on
SMIC’s(Semiconductor Manufacturing International Corporation) 28 nm
CMOS technology with the active area 1.39mm^2.In this proposed work and
existing work the process reduced from 45nm,1.25V to 28nm,1.05V,working
frequency reduced to 1.49GHZ to 0.5 GHZ. area consumption reduced from
100% t0 80.3% and the power consumption reduced from 2.47 to 1.45.

[5] Neil Burgess , Chris Goodyer , Christopher N. H, and David R


High-Precision Anchored Accumulators for Reproducible Floating - Point
Summation introduces a new data type, the High-Precision Anchored (HPA)
number, that allows reproducible accumulation of floating-point (FP) numbers
in programmer selectable range and it has providing large significant and
smaller range existing formats and has better arithmetic and computational
properties.HPA processing is much faster compared to other arithmetic
methods. An HPA is a pair consisting of a long 2’s complement performs fast
addition and high performance. It is equally applicable to summations of
unrounded products of floating point values.

[6] Manish Kumar Jaiswal , Hayden K.-H. So Design of quadruple


precision multiplier architectures with SIMD single and double precision
support gives the information about the quadruple precision which is an
precision advanced technique with the support of single and double precision.
In this view, this manuscript is exploring possible architectures for dual-mode
QPdDP (quadruple precision with dual (two parallel) double precision) and tri-

12
mode QPdDPqSP (quadruple precision with dual double precision and quad
(four parallel) single precision) arithmetic. These have two prime targets, first to
provide quadruple precision arithmetic support and second is aimed towards the
inclusion of SIMD (single instruction multiple data) processing support for
double precision and single precision arithmetic in it.

CHAPTER 4
PROBLEM STATEMENT
Thus the floating point addition is designed to low power
consumption, and with increased efficiency and minimum power delay.
In previous work the following conditions mentioned above cannot be
achieved. To overcome this LZA logic block is included. LZA logic plays
major role and it is used to speed up the operation and reduces the delay
introduced by normalization. And using the pipelining methods it can be
done step by step sequentially and the required minimum delay, minimum

13
power and less area consumption area.

CHAPTER 5
COMPARATOR:
A magnitude digital Comparator is a combinational circuit
that compares two digital or binary numbers in order to find out whether one
binary number is equal, less than or greater than the other binary number. We
logically design a circuit for which we will have two inputs one for A and other
for B and have three output terminals, one for A > B condition, one for A = B
condition and one for A < B condition.

14
Fig:5.1.1 comparator diagram

A comparator used to compare two bits is called a single bit comparator.


It consists of two inputs each for two single bit numbers and three outputs to
generate less than, equal to and greater than between two binary numbers.

5.1CHARACTERISTICS:
 Linear characteristics of scale
 Quick in results
 High magnification
 Versatility
 Compensation from Temperature effects

5.2 4 Bit convertor:

A comparator used to compare two binary numbers each of four


bits is called a 4-bit magnitude comparator. It consists of eight inputs each for
two four bit numbers and three outputs to generate less than, equal to and
greater than between two binary numbers.

In a 4-bit comparator the condition of A>B can be possible in the


following four cases:

1. If A3 = 1 and B3 = 0
2. If A3 = B3 and A2 = 1 and B2 = 0
3. If A3 = B3, A2 = B2 and A1 = 1 and B1 = 0
4. If A3 = B3, A2 = B2, A1 = B1 and A0 = 1 and B0 = 0
Similarly the condition for A<B can be possible in the following four cases:
15
1. A3 = 0 and B3 = 1
2. If A3 = B3 and A2 = 0 and B2 = 1
3. If A3 = B3, A2 = B2 and A1 = 0 and B1 = 1
4. If A3 = B3, A2 = B2, A1 = B1 and A0 = 0 and B0 = 1

The condition of A=B is possible only when all the individual bits of one
number exactly coincide with corresponding bits of another number.

Fig 5.2.1 figure of 4-bit convertor

5.2.1 4 bit comparator simulated results:

16
A>B

A<B

17
A=B

CHAPTER 6

SHIFTER:
6.1 Operation of shifter:

Shifters are important element in microprocessor design for arithmetic


shifting, logical shifting and barrel shifting. The two basic types are
the arithmetic left shift and the arithmetic right shift. For binary numbers, it is
a bitwise operation that shifts all of the bits of its operand; every bit in the
operand is simply moved a given number of bit positions, and the vacant bit-
positions are filled in. Arithmetic shifts can be useful as efficient ways to
perform multiplication or division of signed integers by powers of two. A shift,
applied to the representation of a number in a fixed radix numeration system
and in a fixed-point representation system, and in which only the characters
representing the fixed-point part of the number are moved.
18
A shift, applied to the representation of a number in a fixed radix
numeration system and in a fixed-point representation system, and in which
only the characters representing the fixed-point part of the number are moved.

6.1.1 figure of Shifter

19
6.1.2 Left arithmetic shift

6.1.3 Right arithmetic shift

6.1.4 Simulated results for shifter:

20
Right shifter

21
Left shifter

Shifting left by n bits on a signed or unsigned binary number has the


effect of multiplying it by 2^n. Shifting right by n bits on a two's complement
signed binary number has the effect of dividing it by 2^n.

A Left Arithmetic Shift of one position moves each bit to the left by
one. The vacant least significant bit (LSB) is filled with zero and the most
significant bit (MSB) is discarded. A Right Arithmetic Shift is one position
moves each bit to the right by one. The LSB is discarded and the vacant MSB is
filled with zero.

CHAPTER 7

Multiplexer:

It is a combinational circuit which has many data inputs and single


output depending on control or selects inputs. For N input lines, log n (base2)
selection lines, or we can say that for 2n input lines, n selection lines are
required. Multiplexers are also known as “Data n selector, parallel to serial
convertor, many to one circuit, universal logic circuit”. Multiplexers are
mainly used to increase amount of the data that can be sent over the network
within certain amount of time and bandwidth.

22
7.1.1 Figure of 4:1multiplexer

7.1.2 circuit diagram of 4:1 multiplexer

7.1.2 Characteristics of multiplexer :


 Source impedance
 Common mode rejection
 Polarity
 Voltage range
7.1.3 Simulated results of 4:1 multiplexer:

23
4:1 Multiplexer

CHAPTER 8
ADDER:
8.1 Operation of adder:

An Adder is a device that can add two binary digits. It is a type of


digital circuit that performs the operation of additions of two number . It is
mainly designed for the addition of binary number, but they can be used in
various other applications like binary code decimal, address decoding etc.
Full Adder is the adder which adds three inputs and produces two
outputs. The first two inputs are A and B and the third input is an input carry as
C-IN. The output carry is designated as C-OUT and the normal output is
designated as S which is SUM.

24
8.1.1 Truth table of adder

8.1.1 Fig Full Adder

. A full adder logic is designed in such a manner that can take eight inputs
together to create a byte-wide adder and cascade the carry bit from one adder to
the another.

25
CHAPTER 9
Exponent Difference And Exponent Increment:
9.1 EXPONENT DIFFERENCE:
This is used to take difference between the two inputs given to the
perform floating point arithmetic. But it is necessary to be the two input
exponent values same. For that purpose we take difference between the given
input exponent values .if the value becomes zero ,further process to be proceed
otherwise it value should be increment or decrement to get exponent difference
zero. This is achieved by using formula ,

D = Y-Z;
where , Y= (2**a);
Z= (2**b);

When inputs are a ^ y , b ^ z and D is output.


9.1.1 Simulated results for exponent difference:

26
9.1.2 Simulated result Exponent increment:

For Exponent Increment :


Y = (2**(a+1));
When input is a ^ y , b ^ z ,where (y < z)

27
For Exponent decrement :
Y= (a-1);
When input is a ^ y , b^z ,where ( y>z)
9.2 Bit inverter:
An inverter circuit outputs a voltage representing the opposite
logic-level to its input. Its main function is to invert the input signal applied.
Digital
electronics circuits operate at fixed voltage levels corresponding to a logical 0
or 1.
.

9.2.1 circuit diagram for 8 bit inverter

28
CHAPTER 10

REFERENCES

 D. Bailey, “High-precision computation: applications and


challenges,” in Proc. 21st IEEE Symp Comput. Arithmetic, Jun.
2013,
Art. no. 3, Keynote talk.
 J. Demmel and H. D. Nguyen, “Fast reproducible summation,”
IEEE Trans. Comput., vol. 64, no. 5, pp. 2060–2070, Jul. 2015
 ]A. Kaivani, S. Ko, Floating-point butterfly architecture based on
binary signed-digit representation, IEEE Trans. VLSI Syst. 24 (3)
(2016)
 F. Qureshi, O. Gustafsson, Generation of all radix-2 fast Fourier
transform algorithms using binary trees, in: European Conference on
Circuit Theory and
Design, 2011.
 IEEE Standard for Floating-point Arithmetic, 2008.
 D. Tan, C.E. Lemonds, M.J. Schulte, Low-power multiple-precision
iterative floating-point multiplier with SIMD support, IEEETrans.
Comput. 58 (2) (2009)
 M.K. Jaiswal, H.K.H. So, Dual-mode double precision/two-parallel
single precision floating point multiplier architecture, in: 2015
IFIP/IEEE International Conference on Very Large Scale Integration
(VLSI-SoC), 2015, pp. 213–218, https://doi.org/10.1109/VLSI-
SoC.2015.7314418.
 K. Manolopoulos, D. Reisis, V. Chouliaras, An efficient multiple
precision floating-point multiplier, in: 18th IEEE International

29
Conference on Electronics, Circuits and Systems (ICECS), 2011, pp.
153–156, https://doi.org/10.1109/ICECS.2011.6122237.

30
31
32
33
34
35
36
37

You might also like