You are on page 1of 95

CHAPTER 1

INTRODUCTION

1.1 Motivation:
With the recent rapid advances in Multimedia and communication systems, Real-
time signal processing like Audio Signal Processing, Video/Image processing, or
large-capacity data processing are increasingly being demanded. The multiplier and
Multiplier-And-Accumulator (MAC) are the essential elements of the Digital Signal
Processing such as filtering, convolution, and inner products.

Most Digital Signal Processing methods use nonlinear functions such as Discrete
Cosine Transform (DCT) or Discrete Wavelet Transform (DWT). Because they are
basically accomplished by repetitive application of multiplication and addition, the
speed of the multiplication and addition arithmetic determines the execution speed
and performance of the entire calculation. Because the multiplier requires the longest
delay among the basic operational blocks in digital system, the critical path is
determined by the multiplier, in general. For high-speed multiplication, the Modified
radix-4 Booth’s Algorithm (MBA) is commonly used. However, this cannot
completely solve the problem due to the long critical path for multiplication.

In general, a multiplier uses Booth’s algorithm and array of full adders (FA’s), or
Wallace tree instead of the array of FA’s. The Wallace tree multiplier mainly consists
of three parts: Booth encoder, a tree to compress the partial products such as Wallace
tree, and final adder. Because Wallace tree is used to add the partial products from
encoder as parallel as possible, its operation time is proportional to O (log 2 N), where
N is the number of inputs. It uses the fact that counting the number of 1’s among the
inputs reduces the number of outputs into log 2 N. In real implementation, many (3:2)
or (7:3) counters are used to reduce the number of outputs in each pipeline stage. The
most effective way to increase the speed of a multiplier is to reduce the number of
partial products because multiplication precedes a series of additions for the partial
products. To reduce the number of calculation steps for the partial products, MBA has
been applied mostly, where Wallace tree has taken the role of increasing the speed to
add the partial products. To increase the speed of the MBA algorithm, many parallel
multiplication architectures have been researched. Among them, the architectures

1
based on the Baugh–Wooley Algorithm (BWA) have been developed and they have
been applied to various digital filtering calculations.

One of the most advanced types of MAC for general-purpose Digital Signal
Processing has been proposed by Elguibaly. It is an architecture in which
accumulation has been combined with the Carry Save Adder (CSA) tree that
compresses the partial products. In this architecture, the critical path is reduced by
eliminating the adder for accumulation and decreasing the number of input bits in the
final adder. While it has a better performance because of the reduced critical path
compared to the previous MAC architectures, there is a need to improve the output
rate due to the use of the final adder results for accumulation. So that a new
architecture for a high speed MAC is proposed.

1.2 Objective:
In this thesis, new high speed MAC architecture has been proposed. In this
MAC, the computations of multiplication and accumulation are combined, and a
hybrid-type CSA structure is proposed to reduce the critical path and improve the
output rate.

1.3 Thesis Organization:


The rest of the chapters in this thesis organized as follows:
Chapter 2 illustrates the different types of multiplier, adder and different types of
parallel MAC architectures.
Chapter 3 presents the FPGA and ASIC implementation of Elguibaly’s as well as
proposed parallel MAC architecture.
Chapter 4 presents the simulation results of all the parallel MAC architectures.
Chapter 5 draws conclusions on this work and also give short discussions on future
scope.

2
CHAPTER 2
LITERATURE SURVEY

2.1 Introduction:
This chapter explains the significance of MAC and gives a survey on various
types of multipliers, adders and MAC architectures.

2.2 Significance of MAC:


Digital Signal Processing (DSP) is used in a wide range of applications such
as speech and audio coding, image processing and video, pattern recognition, sonar
and so on. In real time Very Large Scale Integration (VLSI) implementation of the
DSP instruction, the system requires hardware architecture which can process input
signal samples as they received. Most of the DSP computation involves the use of
multiply and multiply accumulate operations and therefore Multiplier Accumulator
(MAC) unit is very important in DSP applications.

A Digital multiplier is the fundamental component in general-purpose


microprocessors and in digital signal processors. In addition, the multiply-accumulate
(MAC) operation is very prevalent in many scientific and engineering applications. It
is very easy to find such operation in signal processing algorithms and matrix
arithmetic. The multiplier speed is usually the bottleneck that determines the depth of

pipelining in the ALU or how fast a processor can run. The current trend in ALU
design is to implement the addition and multiplication operations using one hardware
component.
MAC = multiplication + Accumulation

2.3 Multiplier Background:


Multiplier circuits are found in virtually every computer, cellular telephone,
and digital audio/video equipment. In fact, essentially any digital device used to
handle speech, stereo, image, graphics, and multimedia content contains one or more
multiplier circuits. The multiplier circuits are usually integrated within
microprocessor, media co-processor, and digital signal processor chips. These
multipliers are used to perform a wide range of functions such as address generation,
Discrete Cosine Transformations (DCT), Fast Fourier Transforms (FFT), multiply-

3
accumulate, etc. Thus multipliers play a critical role in processing audio, graphics,
video, and multimedia data.

Multiplication operation is carried out by using different types of multipliers.


1. Serial multiplier
2. Parallel multiplier
i) Booth encoding
ii) Modified booth encoding

2.3.1 Binary serial multiplier:


A Binary serial multiplier is an electronic hardware device used in digital
electronics or a computer or other electronic device to perform rapid multiplication of
two numbers in binary representation. It is built using binary adders. The basic
hardware required for binary multiplier is shown in Fig. 2.1.

The rules for binary multiplication can be stated as follows:


i) If the multiplier digit is a 1, the multiplicand is simply copied down and represents
the product.
ii) If the multiplier digit is a 0 the product is also 0.

Fig. 2.1 Basic hardware required for multiplier

For designing a multiplier circuit, the following four points should be kept in
mind.
1. It should be capable of identifying whether a bit is 0 or 1?
2. It should be capable of shifting left partial products.
3. It should be able to add all the partial products to give the products as sum of
partial products.

4
4. It should examine the sign bits. If they are alike, the sign of the product will be a
positive, if the sign bits are opposite product will be negative. The sign bit of the
product stored with above criteria should be displayed along with the product.

From the above discussion it is clear that it is not necessary to wait until all the
partial products have been formed before summing them. In fact the addition of
partial product can be carried out as soon as the partial product is formed.

2.3.2 Parallel Multiplier:


Parallel multipliers can be implemented using two methods to increase the
speed of the multiplier. They are
1. Booth encoding Algorithm
2. Modified Booth Algorithm (MBA)

2.3.2.1 Booth encoding:


Booth multiplication is a technique that allows for smaller, faster
multiplication circuits, by recoding the numbers. It is the standard technique used in
chip design and provides significant improvements over long multiplication
technique.

Shift and Add:


A standard approach that might be taken by a novice to perform multiplication
is "shift and add", or normal "long multiplication". That is, for each column in the
multiplier, shift the multiplicand to the appropriate number of columns and multiply it
by the value of the digit in that column of the multiplier, to obtain a partial product.
The partial products are then added to obtain the final result.

Fig. 2.2 Illustration of multiplication using shift and add method


With this system, the number of partial products is exactly the number of
columns in the multiplier.

5
Booth algorithm gives a procedure for multiplying binary integers in signed –
2’s complement representation.

Example: 2ten × (- 4)ten or


0010two* 1100two

Step 1: Making the Booth table


I. From the two numbers, pick the number with the smallest difference between a
series of consecutive numbers, and make it a multiplier.
i.e., 0010 -- From 0 to 0 no change, 0 to 1 one change, 1 to 0 another change, and so
there are two changes on this one.
1100 -- From 1 to 1 no change, 1 to 0 one change, 0 to 0 no change, so there is only
one change on this one.
Therefore, multiplication of 2 x (–4), where 2ten(0010two) is the multiplicand
and (– 4)ten(1100two) is the multiplier.

II. Let X = 1100 (multiplier)


Let Y = 0010 (multiplicand)
Take the 2’s complement of Y and call it –Y
–Y = 1110
III. Load the X value in the table.
IV. Load 0 for X-1 value, it should be the previous first least significant bit of X.
V. Load 0 in U and V rows which will have the product of X and Y at the end of
operation.
VI. Make four rows for each cycle; because here the multiplication is of four bit
numbers.

Step 2: Booth Algorithm


Booth Algorithm requires examination of the multiplier bits, and shifting of
the partial product. Prior to the shifting, the multiplicand may be added to partial
product, or subtracted from the partial product, or left unchanged according to the
following rules:
Look at the first least significant bits of the multiplier “X”, and the previous
Least significant bits of the multiplier “X - 1”.
I. 0 0 Shift only
1 1 Shift only

6
0 1 Add Y to U, and shift
1 0 Subtract Y from U, and shift or add (-Y) to U and shift

II. Take U & V together and shift arithmetic right shift which preserves the sign bit of
2’s complement number. Thus a positive number remains positive, and a negative
number remains negative.
III Shift X circular right shifts because this will prevent us from using two registers
for the X value.

Illustration of the Booth multiplier is shown in Fig. 2.3.

Fig. 2.3 Illustration of Booth multiplier


After completion of the four cycles, the answer is shown in the last rows of U
and V of Fig. 2.3, which is 11111000two.

2.3.2.2 Modified Booth multiplication Algorithm for radix 4:


One of the solutions of realizing high speed multipliers is to enhance
parallelism which helps to decrease the number of subsequent calculation stages. The
original version of the Booth Algorithm (Radix-2) had two drawbacks.
They are:
(i) The number of add & subtract operations and the number of shift operations
becomes variable and becomes inconvenient in designing parallel multipliers.

7
(ii) The algorithm becomes inefficient when there are isolated 1’s.

These problems are overcome by using modified Radix-4 Booth Algorithm


which reduces the number of partial products to half. The basic idea is that, instead of
shifting and adding for every column of the multiplier term and multiplying by 1 or 0,
here only the second column is taken, and it has been multiplied by ±1, ±2, or 0, to
obtain the same result. So, to multiply by 7, we can multiply the partial product
aligned against the least significant bit by -1, and multiply the partial product aligned
with the third column by 2.

Partial product 0 = Multiplicand * -1, shifted left 0 bits (x -1)


Partial Product 1 = Multiplicand * 2, shifted left 2 bits (x 8)

This is the same result as the equivalent shift and adds method:
Partial Product 0 = Multiplicand * 1, shifted left 0 bits (x 1)
Partial Product 1 = Multiplicand * 1, shifted left 1 bits (x 2)
Partial Product 2 = Multiplicand * 1, shifted left 2 bits (x 4)
Partial Product 3 = Multiplicand * 0, shifted left 3 bits (x 0)

The advantage of this method is halving of the number of partial products.


This is important in circuit design as it relates to the propagation delay in the running
of the circuit, and reduces the complexity, power consumption of its implementation.

Consider the multiplication of two 2’s complement N –bit numbers X and Y.


 Let X as multiplicand and Y as multiplier.
 For modified Booth coding, a “0” must always be concatenated to the right of
Y.

To recode the Booth multiplier term, consider the multiplier bits in blocks of
three, such that each block overlaps the previous block by one bit. Grouping starts
from the LSB, and the first block only uses two bits of the multiplier (since there is
no previous block to overlap, third bit is assumed to be 0 ) shown in Fig. 2.4 (a).

The overlap is necessary to know what happened in the last block, as the MSB
of the block acts like a sign bit. Then consult the table given below to determine the
Booth encoded terms.

8
Table 2.1
Modified Booth Encoding Table

Fig. 2.4 Modified Booth encoder (a) Grouping of multiplier bits for N=8,
(b) Encoder circuit

Since the LSB of each block is used to know what the sign bit was in the
previous block, and there are never any negative products before the least significant
block, the LSB of the first block is always assumed to be 0.

In the case where there are not enough bits to obtain a MSB of the last block,
as below, the multiplier sign bit can be extended by one bit.

For Example: 0 0 1 1 1
Block 0 : 110 Encoding: * (-1)
Block 1 : 011 Encoding: * (2)
Block 2 : 000 Encoding: * (0)

Modified booth encoder circuit is designed using Modified Booth encoding table,
shown in Fig. 2.4 (b).

In figure 2.4 (b),


 The Xsel signal indicates that simply multiplicand is taken as a partial product;
either it may be zero or any other value.

9
 The 2Xsel signal is used as the control to a 2:1 multiplexer, to select whether or
not the partial product bits are shifted left by one position.
 Finally, the NEGsel signal indicates whether or not to invert all of the bits to
create a negative product (which must be corrected by adding "1" at some later
stage).

From the Fig. 2.4 (b), based on the values of X sel, 2Xsel and NEGsel, the
corresponding partial products (0, +X, -X, +2X, -2X) will be generated and these
partial products will be arranged in a manner shown in Fig. 2.5 which are added
together to get the required multiplication result.

Fig. 2.5 Partial products for N = 8 which are to be added.

Modified Radix-2 Booth algorithm scans string of three bits with the
algorithm given below:
1) Extend the sign bit 1 position if necessary to ensure that n is even.
2) Append a 0 to the right of the LSB of the multiplier.
3) According to the value of each vector, each Partial Product will be 0, +X, -X, +2X
or -2X.

Table 2.2
Partial Product for each encoded yi’ with N=8
( pi , j: partial product bit, ni, 0: negation bit )

10
The previous example can be rewritten as:

2.4 Sign Extension:

Once the Booth recoded partial products have been generated, they need to be
shifted and added together in the following fashion:

The problem with implementing this in hardware is that the first partial
product needs to be sign extended by 6 bits, the second by four bits, and so on. This is
easily achievable in hardware, but requires additional logic gates than if those bits
could be permanently kept constant.

Fortunately, there is a technique that achieves this:


 Invert the most significant bit (MSB) of each partial product
 Add an additional '1' to the MSB of the first partial product
 Add an additional '1' in front of each partial product

This technique allows any sign bits to be correctly propagated, without the
need to sign extends all of the bits.

11
2.4.1 Sign Extension Simplification:
Fig. 2.6 shows a 16 bit Radix-4 Booth partial product array for an unsigned
multiplier using the dot diagram notation. Extra bit ‘1’ is added to the least significant
bit in the next row to form the 2’s complement of negative multiples. Inverting the
implicit leading 0’s generates leading 1’s on negative multiples. PP8 is required in
case PP7 is negative. This partial product is always zero (for unsigned multiplier).

Fig. 2.6 Booth-encoded partial products with sign extension

Observe that all the sign extension bits are either 1’s or 0’s. If a single 1 is
added to the least significant position in a string of 1’s, the result is a string of 0’s plus
a carry-out the top bit that may be discarded. Therefore, the large number of ‘s’ bits in
each partial product can be replaced by an equal number of constant 1’s plus the
inverse of s added to the least significant position, as shown in Fig. 2.7. These
constant values mostly can be optimized out of the array by pre-computing their sum.
The simplified result is shown in Fig. 2.8.

12
Fig. 2.7 Booth-encoded partial products with simplified sign extension

Fig. 2.8 Booth-encoded partial products after optimized sign extension

2.5 Adders for Multiplication:


Different types of adder architectures are available to add the partial products and
each has its own advantage and disadvantages:
1. Ripple Carry Adder
2. Carry Look-ahead Adder
3. Carry Select Adder
4. Carry Save Adder
5. Hybrid Adder

2.5.1 Ripple Carry Adder:


The ripple carry adder is constructed by cascading full adders (FA) blocks in
series. One full adder is responsible for the addition of two binary digits at any stage
of the ripple carry. The carryout of one stage is fed directly to the carry-in of the next
stage. A number of full adders may be added to the ripple carry adder or ripple carry

13
adders of different sizes may be cascaded in order to accommodate binary vector
strings of larger sizes. For an n-bit parallel adder, it requires n computational elements
(FA). Fig. 2.9 shows an example of a parallel adder: a 4-bit ripple-carry adder. It is
composed of four full adders. The augends bits of x are added to the addend bits of y
respectfully of their binary position. Each bit addition creates a sum and a carry out.
The carry out is then transmitted to the carry in of the next higher-order bit. The final
result creates a sum of four bits plus a carry out (c4).

Fig. 2.9 4-bit Ripple-Carry Adder Block Diagram

Even though this is a simple adder and can be used to add unrestricted bit
length numbers, it is however not very efficient when large bit numbers are used.

One of the most serious drawbacks of this adder is that the delay increases
linearly with the bit length. As mentioned before, each full adder has to wait for the
carry out of the previous stage to output steady-state result. Therefore, even if the
adder has a value at its output terminal; it has to wait for the propagation of the carry
before the output reaches a correct value. In Fig. 2.9, the addition of x 4 and y4 cannot
reach steady state until c4 becomes available. In turn, c4 has to wait for c3, and so on
down to c1. If one full adder takes T×fa seconds to complete its operation, the final
result will reach its steady-state value only after 4.T×fa seconds. Its area is 4.A×fa.

2.5.2 Carry Look-ahead Adder (CLA):


As seen in the ripple-carry adder, its limiting factor is the time it takes to
propagate the carry. The carry look-ahead adder solves this problem by calculating
the carry signals in advance, based on the input signals. The result is a reduced carry

14
propagation time. To understand how the carry look-ahead adder works, we have to
manipulate the Boolean expression dealing with the full adder. The Propagate P and

Generate G in a full-adder, is given as:


Pi = Ai xor Bi Carry propagate
Gi = AiBi Carry generate

The new expressions for the output sum and the carryout are given by:
Si = Pi xor Ci-1
Ci+1= Gi + PiCi

These equations show that a carry signal will be generated in two cases:
1) If both bits Ai and Bi are 1
2) If either Ai or Bi is 1 and the carry-in Ci is 1.

Let's apply these equations for a 4-bit adder:


C1 = G0 + P0C0
C2 = G1 + P1C1 = G1 + P1 (G0 + P0C0) = G1 + P1G0 + P1P0C0
C3 = G2 + P2C2 = G2 + P2G1 + P2P1G0 + P2P1P0C0
C4 = G3 + P3C3 = G3 + P3G2 + P3P2G1 + P3P2P1G0 + P3P2P1P0C0

Fig. 2.10 4-bit carry look ahead (CLA)

These expressions show that C2, C3 and C4 do not depend on its previous
carry-in. Therefore C4 does not need to wait for C3 to propagate. As soon as C0 is
computed, C4 can reach steady state. The same is also true for C2 and C3.

15
The general expression is
Ci+1= Gi + PiGi-1 + PiPi-1Gi-2 + ……. PiPi-1….P2P1G0 + PiPi-1 ….P1P0C0

For small n (n<=4) RCA is advantageous because it requires less area and fast.
CLA consumes more area and power because of large number of logic gates.

2.5.3 Carry Select Adder:


The concept of the carry-select adder is to compute alternative results in
parallel and subsequently selecting the correct result with single or multiple stage
hierarchical techniques. In order to enhance its speed performance, the carry-select
adder increases its area requirements. In carry-select adders both the sum and carry
bits are calculated for the two alternatives: input carries “0” and “1”. Once the carry-
in is delivered, the correct computation is chosen (using a MUX) to produce the
desired output. Therefore instead of waiting for the carry-in to calculate the sum, the
sum is correctly output as soon as the carry-in gets there. The time taken to compute
the sum is then avoided which results in a good improvement in speed. This concept
is illustrated in Fig. 2.11.

Fig. 2.11 4-bit carry-select adder


Carry-select adders can be divided into equal or unequal sections. Fig. 2.10
shows the implementation of an 8 bit carry-select adder with 4-bit sections. For each
section, shown in Fig. 2.10, the calculation of two sums is accomplished using two 4-
bit ripple-carry adders. One of these adders is fed with a 0 as carry-in whereas the
other is fed a 1. Then using a multiplexer, depending on the real carryout of the
previous section, the correct sum is chosen. Similarly, the carryout of the section is
computed twice and chosen depending of the carryout of the previous section.

16
2.5.4 Carry Save Adder:
Unlike RCA, CLA & CSL adders, the Carry Save Adder realizes concurrent
addition of multiple operands, which is a basic requirement of multiplication. To
increase the speed of the addition process many times, Carry Save Adder architectures
are used.
In this technique, the carry output from the bit i during step j is applied to
carry input for bit i+1 during next step j+1. After addition of product components in
the last row, one more step is required in which the carries are allowed to ripple from
the least to most significant bit. This technique does not save any hardware but it
reduces the propagation delay substantially.

2.5.5 Hybrid Adder:


Hybrid adder is a combination of any two adders. It is used in high speed
applications. The hybrid adder generally consists of two carry look-ahead adders and
a multiplexer. Adding two n-bit numbers with a hybrid adder is done with two adders
(therefore two carry look ahead adders) in order to perform the calculation twice, one
time with the assumption of the carry being zero and the other assuming one. After
the two results are calculated, the correct sum, as well as the correct carry, is then
selected with the multiplexer once the correct carry is known. The propagation delay
is less for hybrid adder and at the same time it occupies larger area compared to the
other adders.

Fig. 2.12 Hybrid Adder

In this thesis, hybrid CSA tree structure is used to increase the performance of
the MAC architecture. In this hybrid adder architecture, combination of Carry Save

17
Adder and a mix of 2-bit Carry Look-ahead Adder and 4-bit Carry Look-ahead
Adders are used.

2.6 Overview of MAC:

In this section, basic MAC operation is introduced. A multiplier can be


divided into three operational steps. The first is radix-2 Booth encoding in which a
partial product is generated from the multiplicand X and the multiplier Y. The second
is adder array or partial product compression to add all partial products and convert
them into the form of sum and carry. The last is the final addition in which the final
multiplication result is produced by adding the sum and the carry. If the process to
accumulate the multiplied results is included, a MAC consists of four steps, as shown
in Fig.2.13, which shows the operational steps explicitly.

Fig.2.13 Basic arithmetic steps of multiplication and accumulation.

General hardware architecture of this MAC is shown in Fig. 2.14. It executes


the multiplication operation by multiplying the input multiplier X and the
multiplicand Y. This is added to the previous multiplication result Z as the
accumulation step.

P=X×Y+Z

18
The N-bit 2’s complement binary number X can be expressed as

Fig. 2.14 Hardware architecture of general MAC

If (1) is expressed in base-4 type redundant sign digit form in order to apply the
radix-2 Booth’s algorithm,

If (2) is used, multiplication can be expressed as

If these equations are used, the afore-mentioned multiplication–accumulation


results can be expressed as

Each of the two terms on the right-hand side of (5) is calculated independently
and the final result is produced by adding the two results. The MAC architecture
implemented by (5) is called the standard design.
If N-bit data are multiplied, the number of the generated partial products is
proportional to N. In order to add them serially, the execution time is also
proportional to N. The architecture of a multiplier, which is the fastest, uses radix-2

19
Booth encoding that generates partial products and a Wallace tree based on CSA as
the adder array to add the partial products. If radix-2 Booth encoding is used, the
number of partial products, i.e., the inputs to the Wallace tree, is reduced to half,
resulting in the decrease in CSA tree step. In addition, the signed multiplication based
on 2’s complement numbers is also possible. Due to these reasons, most current used
multipliers adopt the Booth encoding.

2.7 Parallel MAC structure:


Fig. 2.15 shows the general structure of a parallel MAC for multiplying two
numbers X and Y and adding the result to Z. The partial products in the figure can be
generated using any multiplication algorithm using bit-serial, serial-parallel, or full-
parallel techniques, as explained below. The required tasks, organized from fastest to
slowest, are the following.

1) Partial-Product Generation: This can be achieved using several techniques such


as the BWA, the Booth algorithm (BA), or the MBA. For an n-bit multiplier, the
number of summands is n for BWA, <= n/2 for BA, and n/2 for MBA. In addition to
the encoding step, BA and MBA algorithms also require generation of the two’s
complement of the multiplier which introduces extra delay. The delay for two’s-
complement generation is not trivial, but has been consistently neglected in most of
the proposed designs.

Fig. 2.15 Three major parts of a general parallel multiplier

20
2) Partial-Product Addition: This is done using a carry ripple adder for serial-
parallel multipliers. For parallel multipliers, the addition is accomplished using carry-
save techniques, Wallace trees, or summand skip. However, the last two techniques
require irregular wiring and extra hardware.

3) Final Adder: When the number of partial products is reduced to sum and carry
words, a final adder is required to generate the multiplication result. The number of
bits of the final adder is the sum of the number of bits of the multiplier and
multiplicand. Thus, the data path width is usually doubled and the delay of this stage
is most severe. In this thesis, we use a mix of 2-bit and 4-bit CLA’s to reduce the
delay and area requirements.

4) Accumulator: The final adder produces a double-precision result of 2n bits that


must be added to the accumulator content, which are also 2n-bits wide. Typical
microprocessors will complete the multiplication operation and follow that with a
double-precision accumulation operation. Needless to say, this is delay intensive,
since both the multiplier and accumulator will each have a delay that is almost 2n
times the delay of a one-bit full adder. Recently, it was realized that the MAC
operation can be merged to make the MAC operation take as much delay as a regular
multiply operation.

To increase the speed of the MBA algorithm, several parallel MAC


architectures have been researched.

2.8 Standard design:

21
Fig. 2.16 MAC structure of the standard design

Fig. 2.17 Hardware architecture for the standard design

From the above figure 2.16 First, n × n bit multiplication operation is carried
out using Booth multiplier, and then the multiplied 2n-bit multiplication result is
accumulated with the 2n-bits. So finally, the output of the MAC structure is 2n+1-bits
wide.

Internally, Booth multiplier includes the following three blocks, which are
shown in Fig. 2.17.

1. Booth encoder 2. CSA tree 3. Final addition

Booth Encoder: It is used to generate the partial products.

22
For n × n-bit MAC operation, let us consider both the multiplicand (X) and multiplier
(Y) are of n-bits wide. These two are given as inputs to the Booth Encoder. Based on
the Modified Booth Encoding table, from the Booth encoding set (0, X,-X, 2X,-2X)
one of the possible partial product will be generated for each combination of grouping
of 3-bits of the multiplier. Booth encoding uses 2’s complement method to generate
the numbers. Here, each partial product is n+1 bit wide.

CSA tree: The partial products which are generated in the previous step are given as
input to the CSA tree. CSA tree can be used to compress the generated partial
products and converts them into the form of sum and carry. Here, both the inputs and
outputs are of same width, i.e., n+1-bits.

Final Addition: When the number of partial products is reduced to sum and carry
words, a final adder is required to generate the multiplication result. Here, the data
path width is doubled i.e., 2n-bit. This multiplication result is added to the
accumulator content, which are also 2n-bits wide.

In order to increase the MAC speed, there are two major bottlenecks that need
to be considered. The first one is the partial production network and the second one is
the accumulator. Since both of these two stages require addition of large operands that
involve long paths for carry propagation. Using tree architecture represents an
attractive solution to speed up the partial products reduction process.

Since the accumulation has the longest delay in MAC operation, the
independent accumulation operation has been removed and it is merged into the
compression process of the partial products, so that overall MAC operation has been
improved.

One of the most advanced types of MAC for general-purpose digital signal
processing has been proposed by Elguibaly. It is an architecture in which
accumulation has been combined with the carry save adder (CSA) tree that
compresses partial products. In this architecture, the critical path was reduced by
eliminating the adder for accumulation and decreasing the number of input bits in the
final adder.

2.9 Elguibaly’s parallel MAC architecture:

23
In this architecture, a Dependence Graph (DG) of the merged MAC operation
based on MBA is developed.

2.9.1 Dependence Graph (DG):

Fig. 2.18 Bit level dependence graph


A Dependence Graph (DG) shown in Fig. 2.18 is a directed graph that shows
the dependence of the computations in an algorithm. The nodes in a DG represent
computations and the edges represent precedence constraints among nodes. DG
representation is similar to the DFG representation as it explicitly exhibits the
dependence of nodes on other nodes in the graph. The difference is that the nodes in
DFG only cover the computations in one iteration of the corresponding algorithm and
they are executed repetitively from iteration to iteration, whereas DG contains
computations for all iterations in an algorithm. Fig. 2.19 and Fig. 2.20 shows the
multiplication operation using DG based parallel carry save addition structure.

24
Fig.2.19 Parallel carry-save array multiplier

Fig. 2.20 Dependence graph for carry save addition with carry ripple vector merging
Dependence Graphs are used for systolic array design, where various
implementations can be derived from a single DG by exploiting the parallelism
presented in DG in different ways.

25
Fig. 2.21 Elguibaly’s Parallel MAC design

Fig. 2.22 Hardware architecture for Elguibaly’s design

2.9.2 DG of MAC algorithm:


The MAC operation can be written as

P=X×Y+Z

Where, the multiplier X and multiplicand Y and are assumed to have n-bits
each and the addend Z has 2n-bits. The number of partial products (summands) in the
MBA is given by

26
number of summands = n/2, for n even
= (n+1)/2, for n odd

where xi= 0 or 1, Qj = -2x2j+1 + x2j + x2j-1, x-1 = 0.

The MAC result is given by

where Y is the multiplicand and each summand is a two’s-complement


number, with n+1 bits to be able to represent Y, 2Y, -Y and -2Y. One more bit is
needed in order to be able to add 2 two’s-complement numbers without the possibility
of overflow. Thus, each partial product will consist of n+2 bits.

Fig. 2.23 shows the DG of the MAC operation for n = 8. Four rows,
representing the summands S0: S3, are generated by the Booth encoders. These
summands are added using carry-save addition as shown by the empty circles and
lines connecting bits of equal binary weight. To prevent overflow, sign extensions
have to be provided (as shown by the black circles at the right column). Traditionally,
sign extension to the full width of the data path is provided. For example, to add three
summands in the MBA using other techniques, one of the summand has to be
extended by 2 bits, the second by 4 bits and the third by 6 bits, and so on. Obviously,
this is extremely wasteful and does not lead to regular connections. In this technique,
we uniformly extend each summand by one bit only even though we are adding three
two’s-complement numbers at each step. This is justified by the fact that two of the
numbers are smaller than the third by a factor of four, since the MBA skips over two
bits of the multiplier at each step.

The dashed arrows on the right in Fig. 2.23 represent a “1” at the LSB for the
two’s- complement operation, and the dashed arrows on the left represent sign
extensions. The diamonds at the bottom row represent the final adder, while diamonds

27
on the right of the figure represent full adders that are required to produce the LSB of
the final product.

Fig. 2.23 DG of the MBA for the case n = 8. The dashed arrows on the right represent
“1” at the LSB for the two’s-complement operation, and the dashed arrows on the left
represent sign extensions. The circles represent full-adders, as well as the empty
diamonds.

The accumulator value is traditionally introduced to the multiplier output.


However, since add operation is commutative, we can introduce that value to the
input of the multiplier. This results in a design where multiply and accumulate
operations are merged. In Fig. 2.23, it can be seen that the addend is introduced to the
multiplier in two places. The least significant (n+1) bits are introduced at the top of
the figure as shown, while the most significant (n-1) bits are introduced after the
partial products are added. Notice the one-bit sign extension to prevent result
overflow for both the addend and the sum and carry words of the partial product.

2.9.3 Optimization of MAC hardware:

28
The DG for the MAC algorithm in Fig. 2.23 helps to optimize the hardware
design task. This section describes how an efficient hardware design is obtained using
several optimization decisions.

2.9.3.1. Carry-Save Two’s-Complement Generation:


As mentioned before, MBA requires the generation of X,-X, 2X and -2X
terms. This implies that the two’s complement must be generated which is an
expensive process since “1” must be added to the LSB. This operation requires an n-
bit adder in the Booth encoder section. This delay is avoided by generating the one’s
complement for –X and -2X within the encoder and adds an extra “1” to the encoder
outputs as represented by the dashed arrows on the right of the figure. The delay
associated with obtaining the one’s complement is just an inverter delay. As a result,
the delay associated with generating X, -X, 2X and -2X by the encoders is eliminated.

2.9.3.2. Final Adder Speedup:


The partial-product addition is accomplished by the empty circles in Fig. 2.23.
The carry signals traverse n/2 stages. The final adder is shown by the empty diamonds
at the right and bottom of the diagram in Fig. 2.23. Notice that the carry signal of that
final adder traverses 2n stages. Attempting to speed up the addition of bits using 4-bit
CLA’s is typically done, but it would require excessive amount of hardware. We
simplify the task by breaking the final adder into two stages: one to add the least
significant (n-1)-bit word and one to add the most significant (n+2)-bit word such that
the performance of the final adder is improved compared to the standard design.

Fig. 2.24 shows the use of 2-bit CLA’s to add the LSB of the sum and carry
words. Those adders are indicated by A 0, A1 and A2. It is not necessary to use 4-bit
CLA’s, since these add extra area and will not lead to any further speedup. In this
way, the LSB part of the resulting product is obtained after approximately n/2 full-
adder delays only.

The (n+2)-bits addition for the MSB of the final adder is done using the usual
4-bit CLA if a completely parallel implementation is contemplated. In this case, the
expected adder delay will be approximately half that of earlier implementations since
the number of bits involved is (n+2), as compared to 2n.

29
Fig. 2.24 DG for the 8-bit MBA where carry propagation between stages is optimized
2-bit CLA’s: A0, A1, and A2 are used to implement the LSB word of the final addition.
4-bit CLA’s are used to implement the MSB word of the final addition.

2.10 Proposed parallel MAC architecture:


In this section, the expression for the new arithmetic will be derived from
equations of the standard design. From this result, a new VLSI architecture for the
parallel MAC will be proposed. In addition, a hybrid-type CSA architecture that can
satisfy the operation of the MAC will be proposed.

2.10.1 Derivation of MAC arithmetic:


2.10.1.1 Basic concept: If an operation to multiply two N-bit numbers and
accumulate into a 2N-bit number is considered, the critical path is determined by the
2N-bit accumulation operation. If a pipeline scheme is applied for each step in the
standard design, the delay of the last accumulator must be reduced in order to improve
the performance of the MAC. The overall performance of the proposed MAC is
improved by eliminating the accumulator itself by combining it with the CSA
function. If the accumulator has been eliminated, the critical path is then determined
by the final adder in the multiplier. The basic method to improve the performance of
the final adder is to decrease the number of input bits. In order to reduce this number
of input bits, the multiple partial products are compressed into a sum and a carry by

30
CSA. The number of bits of sums and carries to be transferred to the final adder is
reduced by adding the lower bits of sums and carries in advance within the range in
which the overall performance will not be degraded. A 2-bit CLA is used to add the
lower bits in the CSA. In addition, to increase the output rate when pipelining is
applied, the sums and carries from the CSA are accumulated instead of the outputs
from the final adder in the manner that the sum and carry from the CSA in the
previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the
number of inputs to CSA increases, compared to the standard design and Elguibaly’s
design.

2.10.1.2 Equation Derivation: The aforementioned concept is applied to (5) to


express the proposed MAC arithmetic. Then, the multiplication would be transferred
to a hardware architecture that complies with the proposed concept, in which the
feedback value for accumulation will be modified and expanded for the new MAC.

First, if the multiplication in (4) is decomposed and rearranged, it becomes

If (6) is divided into the first partial product, sum of the middle partial
products, and the final partial product, it can be re expressed as (7). The reason for
separating the partial product addition as (7) is that three types of data are fed back for
accumulation, which are the sum, the carry, and the pre added results of the sum and
carry from lower bits.

Now, the proposed concept is applied to Z in (5). If Z is first divided into


upper and lower bits and rearranged, (8) will be derived. The first term of the right-
hand side in (8) corresponds to the upper bits. It is the value that is fed back as the
sum and the carry. The second term corresponds to the lower bits and is the value that
is fed back as the addition result for the sum and carry.

The second term can be separated further into the carry term and sum term as

31
Thus, (8) is finally separated into three terms as

If (7) and (10) are used, the MAC arithmetic in (5) can be expressed as

If each term of (11) is matched to the bit position and rearranged, it can be
expressed as (12), which is the final equation for the proposed MAC. The first
parenthesis on the right is the operation to accumulate the first partial product with the
added result of the sum and the carry. The second parenthesis is the one to accumulate
the middle partial products with the sum of the CSA that was fed back. Finally, the
third parenthesis expresses the operation to accumulate the last partial product with
the carry of the CSA.

2.10.2 Proposed MAC architecture:


If the MAC process proposed in the previous section is rearranged, it would be
as Fig. 2.25, in which the MAC is organized into three steps. When compared with
Fig. 2.13, it is easy to identify the difference that the accumulation has been merged
into the process of adding the partial products. Another big difference from Fig. 2.13
is that the final addition process in step 3 is not always run even though it does not
appear explicitly in Fig. 2.25. Since accumulation is carried out using the result from
step 2 instead of that from step 3, step 3 does not have to be run until the point at
which the result for the final accumulation is needed.

32
Fig. 2.25 Proposed arithmetic operation of multiplication and accumulation.

The hardware architecture of the MAC to satisfy the process in Fig. 2.25 is
shown in Fig. 2.26. The n-bit MAC inputs, X and Y are converted into an (n+1)-bit
partial product by passing through the Booth encoder. In the CSA and accumulator,
accumulation is carried out along with the addition of the partial products. As a result,
n-bit S, C and Z (the result from adding the lower bits of the sum and carry) are
generated. These three values are fed back and used for the next accumulation. If the
final result for the MAC is needed, P [2n-1: n] is generated by adding S and C in the
final adder and combined with P [n-1: 0] that was already generated.

Fig. 2.26 Hardware architecture of the proposed MAC


2.10.3 Proposed CSA architecture:
The architecture of the hybrid-type CSA that complies with the operation of
the proposed MAC is shown in Fig. 2.27, which performs 8 x 8-bit operation. It was
formed based on (12). In Fig. 2.27, Si is to simplify the sign expansion and Ni is to

33
compensate 1’s complement number into 2’s complement number. S[i] and C[i]
correspond to the ith bit of the feedback sum and carry. Z[i] is the ith bit of the sum of
the lower bits for each partial product that were added in advance Z’ [i]and is the
previous result. In addition, Pj[i] corresponds to the ith bit of the jth partial product.
Since the multiplier is for 8 bits, totally four partial products (P0 [7:0] ~P3 [7:0]) are
generated from the Booth encoder. In (11), d0Y and dN/2-12N-2Y correspond to P0 [7:0]
and P3 [7:0] respectively. This CSA requires at least four rows of FA’s for the four
partial products. Thus, totally five FA rows are necessary since one more level of
rows are needed for accumulation. For an n × n-bit MAC operation, the level of CSA
is (n/2+1). The white square in Fig. 2.27 represents an FA and the gray square is a
half adder (HA). The rectangular symbol with five inputs is a 2-bit CLA with a carry
input.

Fig. 2.27 Proposed hybrid CSA architecture

The critical path in this CSA is determined by the 2-bit CLA. It is also
possible to use FA’s to implement the CSA without CLA. However, if the lower bits
of the previously generated partial product are not processed in advance by the

34
CLA’s, the number of bits for the final adder will increase. When the entire multiplier
or MAC is considered, it degrades the performance
In Table 2.3, the characteristics of the proposed CSA architecture have been
summarized and briefly compared with other architectures. For the number system,
the proposed CSA uses 1’s complement modified CSA array without sign extension.
The biggest difference between proposed design and the others is the type of values
that is fed back for accumulation. Proposed design has the smallest number of inputs
to the final adder.
Table 2.3
Characteristics of CSA
Standard Design Elguibaly’s design Proposed design
Number System 2’s complement 1’s complement 1’s complement
Sign Extension Used Used Not Used
Result Data of Final Result Data of Final Sum and Carry of
Accumulation
Addition Addition CSA
CSA Tree FA,HA FA,2-bits CLA FA, HA, 2-bit CLA
Final Adder 2n bits n+2 bits n bits

Table 2.4
Calculation of hardware resources

Component Standard Design Elguibaly’s Design Proposed design

General 8-bits General 8-bits General 8-Bits


FA (n2 / 2 + n) 40 (n2/2+2n+3) 51 ( n2/2+n/2) 36
HA 0 0 0 0 3n/2 12
2 bit CLA 0 0 ( n/2 -1) 3 n/2 4
4-bit CLA 0 0 0 - n/4 2
(2n+1) bits
Accumulator 1 - - - -
CLA
Final adder 2n bits 16 ( n+ 2 ) bits 10 n-bits 8

2.11 Pipelining scheme:


2.11.1 Stage Analysis: The pipeline stages for the Elguibaly’s architecture and
proposed parallel MAC architectures are shown in Fig. 2.28 and Fig. 2.29. Step 1 and
step 2 in Fig. 2.25 that correspond to the Booth encoding and CSA operation,
35
respectively, are set to stage 1 and step 3, which correspond to the final adder and are
set to stage 2.

2.11.2 Pipeline Structure and Operation:


A hardware which incorporates a pipelining scheme to increase the operation
speed is shown in Fig. 2.28, with the one from Elguibaly’s scheme in Fig. 2.29 for the
purpose of comparison. The difference between the two is proposed architecture
carries out the accumulation by feeding back the final CSA outputs rather than the
final adder results as in Fig. 2.29.

Fig. 2.28 Pipelined hardware structure for the proposed design

Fig. 2.29 Pipelined hardware structure for the Elguibaly’s design

Fig. 2.30 Pipelined operational sequence of Elguibaly’s operation

36
Fig. 2.31 Pipelined operational sequence of proposed operation

These two schemes are also compared in the time sequence in Fig. 2.30 and
Fig. 2.31 for Fig. 2.28 and Fig. 2.29 respectively. While an accumulated result cannot
be output by the Elguibaly’s design in every clock period because of a structural
drawback for the accumulation, proposed architecture can output a result in every
clock cycle. Thus, even though the delay of proposed architecture is a little longer
than Elguibaly’s design, it gives much better overall performance or the output rate.

2.12 Conclusion
In this chapter, different types of multipliers, adders and parallel MAC
architectures were presented.

37
CHAPTER 3
DESIGN AND IMPLEMENTATION

3.1 Introduction:
Hardware Description Languages are modeling tools for creating a hardware
model. Verilog structuring Hardware Description language is used to design
Elguibaly’s architecture as well as proposed parallel MAC architecture . All the
architectures are synthesized in 0.18-μm technology.

3.2 FPGA implementation:


3.2.1 Booth Encoder:
Input : x[7:0] is the multiplier of 8 bits
Outputs : mul, shift, twocom are booth encoded bits, each is of 4 bits.

Fig. 3.1 Booth encoder port diagram

Fig. 3.2 RTL schematic of Booth encoder

38
3.2.2 Partial product generator:

Inputs : mul, shift, twocom are booth encoded bits, each is of 4 bits.
Outputs : pp0, pp1, pp2, pp3 are generated partial products, each is of
9 bits.

Fig. 3.3 Partial product generator port diagram

Fig. 3.4 RTL schematic of partial product generator

39
3.2.3 Standard design:
Inputs : x, y are the multiplier and multiplicand, each of 8 bits,
clk, reset
Outputs : p is MAC result which is of 16 bits.

Fig. 3.5 Port diagram of standard design

40
Fig. 3.6 RTL schematic of standard design
3.2.4 Elguibaly’s parallel MAC architecture:

Fig. 3.7 Port diagram of Elguibaly’s parallel MAC architecture

41
Fig. 3.8 RTL schematic of Elguibaly’s parallel MAC architecture
3.2.5 Proposed parallel MAC architecture:

Fig. 3.9 Port diagram of proposed parallel MAC architecture

42
Fig. 3.10 RTL schematic of proposed parallel MAC architecture
3.2.6 Elguibaly’s parallel MAC architecture with 2-stage pipelining:

Fig. 3.11 RTL schematic of modified Elguibaly’s parallel MAC architecture

3.2.7 Proposed parallel MAC architecture with 2-stage pipelining:

43
Fig. 3.12 RTL schematic of modified proposed parallel MAC architecture
3.3 FPGA Synthesis reports:
3.3.1 Standard design:
============================================================
* Final Report *
=============================================================
Final Results
RTL Top Level Output File Name : mac_standard.ngr
Top Level Output File Name : mac_standard
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 673
# AND2 : 190
# AND3 : 68
# AND8 :1
# INV : 183
# OR2 : 115
# XOR2 : 116
# Flip Flops/Latches : 16
# FDC : 16
# IO Buffers : 34
# IBUF : 18
# OBUF : 16

Device Utilization Summary:


Selected Device : 3s400pq208-5
Number of Slices : 121 out of 3584 3%
Number of Slice Flip Flops : 16 out of 7168 0%
Number of 4 input LUTs : 217 out of 7168 3%

44
Number of Ios : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%

3.3.2 Elguibaly’s parallel MAC architecture:


=============================================================
* Final Report *

Final Results
RTL Top Level Output File Name : mac_elguibaly.ngr
Top Level Output File Name : mac_elguibaly
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 575
# AND2 : 259
# AND3 : 12
# AND8 :1
# INV : 72
# OR2 : 39
# OR5 : 32
# XOR2 : 160
# FlipFlops/Latches : 16
# FDC : 14
# FDP :2
# IO Buffers : 34
# IBUF : 18
# OBUF : 16
=============================================================
Device utilization summary:
Selected Device : 3s400pq208-5

45
Number of Slices : 116 out of 3584 3%
Number of Slice Flip Flops : 16 out of 7168 0%
Number of 4 input LUTs : 221 out of 7168 3%
Number of Ios : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%

3.3.3 Proposed parallel MAC architecture:


=============================================================
* Final Report *
=============================================================
Final Results
RTL Top Level Output File Name : proposed.ngr
Top Level Output File Name : proposed
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 477
# AND2 : 201
# AND3 : 12
# AND8 :1
# INV : 76
# OR2 : 35
# OR5 : 32
# XOR2 : 120
# FlipFlops/Latches : 24
# FDC : 24
# IO Buffers : 34
# IBUF : 18
# OBUF : 16

46
Device utilization summary:
Selected Device : 3s400pq208-5
Number of Slices : 122 out of 3584 3%
Number of Slice Flip Flops : 24 out of 7168 0%
Number of 4 input LUTs : 228 out of 7168 3%
Number of Ios : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%

3.3.4 Elguibaly’s parallel MAC architecture with 2-stage pipelining:


=============================================================
* Final Report *
=============================================================
Final Results
RTL Top Level Output File Name : mac_elguibaly_pipe.ngr
Top Level Output File Name : mac_elguibaly_pipe
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 255
# LUT2 :9
# LUT3 : 60
# LUT3_D : 11
# LUT3_L :4
# LUT4 : 128
# LUT4_D : 18
# LUT4_L :5
# MUXF5 : 20
# Flip Flops/Latches : 40
# FDC : 38

47
# FDP :2
# Clock Buffers :1
# BUFGP :1
# IO Buffers : 33
# IBUF : 17
# OBUF : 16
=============================================================
Device utilization summary:
Selected Device : 3s400pq208-5
Number of Slices : 125 out of 3584 3%
Number of Slice Flip Flops : 40 out of 7168 0%
Number of 4 input LUTs : 235 out of 7168 3%
Number of IOs : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%
=============================================================

3.3.5 Proposed parallel MAC architecture with 2-stage pipelining:


=============================================================
* Final Report *
=============================================================
Final Results
RTL Top Level Output File Name : proposed_pipe.ngr
Top Level Output File Name : proposed_pipe
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 297
# LUT2 :3
# LUT2_L :1
# LUT3 : 55
48
# LUT3_D : 11
# LUT3_L :4
# LUT4 : 165
# LUT4_D : 14
# LUT4_L : 13
# MUXF5 : 31
# Flip-flops/Latches : 72
# FDC : 70
# FDP :2
# Clock Buffers :1
# BUFGP :1
# IO Buffers : 33
# IBUF : 17
# OBUF : 16
=============================================================

Device utilization summary:


Selected Device : 3s400pq208-5
Number of Slices : 141 out of 3584 3%
Number of Slice Flip Flops : 72 out of 7168 1%
Number of 4 input LUTs : 266 out of 7168 3%
Number of IOs : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%

3.4 Timing reports:


3.4.1 Standard design:
Timing Summary:
Speed Grade: -5
Minimum period: 5.751ns (Maximum Frequency : 173.887MHz)
Minimum input arrival time before clock : 32.263ns
Maximum output required time after clock : 11.229ns
Maximum combinational path delay : 37.742ns

49
3.4.2 Elguibaly’s parallel MAC architecture:
Timing Summary:
Speed Grade: -5
Minimum period : 10.854ns (Maximum Frequency : 92.132MHz)
Minimum input arrival time before clock : 20.506ns
Maximum output required time after clock : 16.332ns
Maximum combinational path delay : 25.984ns

3.4.3 Proposed parallel MAC architecture:


Timing Summary:
Speed Grade: -5
Minimum period: 7.203ns (Maximum Frequency : 138.825MHz)
Minimum input arrival time before clock : 16.537ns
Maximum output required time after clock : 21.509ns
Maximum combinational path delay : 30.693ns

Table 3.1
Delay comparison without pipelining

Standard Design Elguibaly’s Design Proposed Design

Delay (ns) 37.742 25.984 30.693

From the table 3.1, it is clear that the delay of proposed design is more
compared to the Elguibaly’s design. Even though proposed design has more delay, it
is preferred rather than Elguibaly’s design because overall output rate is increased in
the proposed design after applying the pipelining scheme.

Here 2-stage pipelining is applied for both the designs. So, for the Elguibaly’s
design 2 clock cycles are required to get first output. For the second output, again two
clock cycles are required. But in case of proposed design, two clock cycles are
required for the first output and one clock cycle is enough to get the second output.
The difference between the two is, in the proposed design, Booth Encoding and carry
save addition for the second input has been done in the second cycle instead of third
cycle only in parallel with the final addition of the first output.

50
The illustration of these two schemes was shown in the Fig. 3.4 and 3.5.

3.4.4 Elguibaly’s parallel MAC architecture with 2-stage pipelining:


Timing Summary:
Speed Grade: -5
Minimum period: 8.658ns (Maximum Frequency: 115.495MHz)
Minimum input arrival time before clock : 17.866ns
Maximum output required time after clock : 14.137ns
Maximum combinational path delay : 23.344ns

3.4.5 Proposed parallel MAC architecture with 2-stage pipelining:


Timing Summary:
Speed Grade: -5
Minimum period: 10.873ns (Maximum Frequency: 91.969MHz)
Minimum input arrival time before clock : 20.671ns
Maximum output required time after clock : 6.280ns
Maximum combinational path delay : No path found

Table 3.2
Pipelining analysis
Parameter Elguibaly’s design Proposed design
Output rate 2 clocks 1 clock
Pipeline delay ( n inputs) 8.658(2n) ns 10.873(n+1) ns
Pipeline delay ( 5 inputs) 86.58 ns 65.238 ns

3.5 ASIC implementation:


3.5.1Physical Design Flow:
Physical synthesis uses timing libraries and capacitance tables which give the
timing and fan-out information of the standard cells. Generic cells are mapped to
standard cells.

3.5.2. RTL Compiler:


The RTL compiler GUI is designed for synthesis. RTL compiler is used to
know the physical information, to drive synthesis but not interested in creating the

51
physical details. It serves as an analysis tool to identify design problems such as
timing and power.

Fig 3.13 Physical Design Flow

52
Generating log files:

By default, RTL compiler generates log file name rc.log. The log file contains
the entire output of the current RTL compiler session.

Generating command file

By default, RTL compiler generates a command history file named rc.cmd,


which contains a record of all commands that were issued in a particular session. This
file is created in addition to the log file.

Specifying explicit search paths

Specify the search paths for libraries, scripts and HDL (Hardware Description
Language) files. The default search path is the directory in which RTL compiler is
invoked.

To set the search paths, the following set-attributes commands

rc:/> set-attribute lib_search_path path/

rc:/> set-atr=tribute hdl_search_path path/

rc:/> set-atr=tribute script_search_path path/

Where, path is the full path of the target library, HDL and script locations.

Setting the target technology library:

After setting the libraries search paths, we need to specify the target
technology library for the synthesis using the library attribute.

To specify single library

rc:/> set-attribute library lib_name.lbr /

RTL compiler will use the library named lib_name.lbr for synthesis.

Setting the appropriate synthesis mode:

RTL compiler has two modes: Wireload and PLE. These modes are set using
the interconnect_ mode activate. The default mode is Wireload. In Wireload mode we

53
use wireless models to drive synthesis. In PLE mode we use physical layout
estimators (PLE) to drive synthesis. PLE is a process of using physical information
such as LEF (Library Exchange Format) libraries to provide better closure with
backend.

Loading the HDL files:


read_hdl command is used to read HDL files in to RTL compiler. After
executing read_hdl command RTL compiler reads the files and performs syntax
checks.
read_hdl{file1.v file2.v}
The above command reads one or more files simultaneously.

Performing elaboration:
Elaboration is required only for the top-level design. The elaborate command
automatically elaborates the top-level design and all of its references. During
elaboration RTL compiler performs:
 Builds data structures
 Infers registers ion the design
 Performs high-level HDL optimization, such as dead code removal
 Checks semantics
At the end of elaboration, RTL compiler displays any unresolved references.
After elaboration, RTL compiler has an internally created data structure for the whole
design so we can apply constraints and performs other operations.

Applying constraints:
After loading and elaborating the design, we must specify constraints. The
constraints include:
 Operating conditions
 Clock waveforms
 I/O timings
Then include constraints file read in SDC constraints.

Performance synthesis:
After constraints and optimizations are set for design, proceed the synthesis by
issuing the synthesize command.

54
rc:/> synthesize –to_mapped

Analyze the synthesis results:


After synthesizing the design, detailed timing and area reports are generated
using the area report commands. To generate detailed are report use report area. To
generate a detailed gate selection and area report, use report gates. To generate the
detailed timing reports including the worst critical path of the current design use
report timing.

Wiring out files for place and route:


The last step in the flow involves wiring out the gate_level netlist, SDC or
encounter configuration file for processing in place and route tool. By default write
command, write out to stdout. Save information to a file, use the redirection
symbol(>) and a file name.

To write the gate level net list, use write_hdl command.

rc:/> write_hdl > design.v

This command write out the gate level netlist to a file called design .v.

To write out the design constraints, use writ_scripts command.

rc:/> write_script >constraints.g

This command writes out constraints to a file named constraints.g.

To write the design constraints in HDC format use the write_hdc command

rc:/> write_hdc >constraints.hdc


This command writes the design constraints to file called constraints.hdc.

3.5.3 RTL Synthesis diagrams:


3.5.3.1 Standard design:
RTL synthesis diagram of standard design is shown in Fig. 3.14.

55
Fig. 3.14 RTL synthesis diagram of the standard design

3.5.3.2 Elguibaly’s parallel MAC architecture:

Fig. 3.15 RTL synthesis diagram for the Elguibaly’s design


56
3.5.3.3 Proposed Parallel MAC architecture:

Fig. 3.16 RTL synthesis diagram for the proposed design

3.5.3.4 Elguibaly’s parallel MAC architecture with 2-stage pipelining:

Fig. 3.17 RTL synthesis diagram for the modified Elguibaly’s design

57
3.5.3.5Proposed parallel MAC architecture with 2-stage pipelining:

Fig. 3.18 RTL synthesis diagram for the modified proposed design

3.5.4 Chip level synthesis of the modified proposed architecture:

Fig. 3.19 Chip level synthesis diagram

58
Chip level module design involves both core design and IO PADS. The pad
information is mentioned in the encounter libraries. Fig. 3.20 shows the physical
synthesis diagram. In this design total 40 pads are there and they are
Input pads---------------18
Output pads------------16
Supply pads------------2
Corner pads-------------4

The output of the RTL compiler are the netlist file, sdc file, config file,
encounter setup file and encounter mode files.

Fig. 3.20 Physical synthesis of top module

3.5.4.1 Critical path:


Fig. 3.21 and 3.22 shows the critical timing path which contains slack and
clock skew information for the Elguibaly’s and proposed architectures respectively.

59
Fig 3.21 Critical path of the modified proposed architecture

Fig 3.22 Critical path of the modified Elguibaly’s architecture

3.5.5 SOC Encounter Design:


3.5.5.1 Inputs for the SOC Encounter:
1. Netlist File…………..proposed_chip_enc.v
This net list file is placed in Encounter Lab/src/netlist folder
60
2. Sdc file…………….proposed_chip_enc.sdc
This file is placed in Encounter Lab/src/sdc folder
3. IO File…………….proposed_chip.io
This file contains number of IO pads and their placement directions.
4. Configuration File………….proposed_chip_enc.conf
This file is placed in Encounter Lab/scripts folder.

3.5.5.2 SOC encounter design steps for the modified proposed architecture:
STEP-1: IMPORT DEISIGN:
The import design form enables to set up the design import into the encounter
software. This form is used to import a full chip design: a partial design such as a
module or a partitioned design.

The inputs required to import design on to the encounter are the gate level
netlist, timing libraries and constraint file, LEF files and IO assignment file.

Fig. 3.23 has the information about the IO pads, core area and netlist file.

Fig. 3.23 Importing Design in SOC encounter

61
STEP-2: TIMING

Timing analysis condition provides the options for setting the modes for
extraction, timing analysis and delay calculation. Operating condition form is used to
select the operating temperature, process or voltage conditions for the design. The
operating conditions are contained in the timing library.

RC extraction mode: This mode is used to set RC extract mode, to specify the
threshold value (in ps), to perform RC reduction, to specify whether noise should be
considered during extraction and to specify the database output file name.

Timing analysis form is used to build a graph for the design and generate a
slack report and a detailed timing violation report.

STEP-3: FLOORPLAN

Use the specify floorplan form to view or change floorplan specifications after
importing the design. Use the form to specify the dimensions by size; or by die; or
core coordinates. Use the specify floorplan form, the floorplan is resized
automatically which is shown in Fig. 3.24. Relatively floorplan constraints are
automatically derived on the fly for blocks, fixed standard cells, fixed pre-router and
blockages. The floorplan of proposed parallel MAC architecture is shown in Fig. 3.24.

Fig. 3.24 Floorplan of the proposed design


62
STEP-4: POWER
Add ring form is used to create power rings for specified nets around the core
boundary on selected power domains, blocks and group of rows which is shown in the
Fig.3.25.

Fig. 3.25 Adding power rings to (a) individual blocks (b) group of blocks

Add stripes form is used to create power stripes within the specified range. If
block rings encountered, the stripes connect the block rings. If an attribution is
encountered the stripes connects to the last stripes on the same nets; otherwise the
stripes stops at the core row boundary.

Add X stripes form is used to create the diagonal power routes and stripes for
encounter X design, add diagonal stripes on diagonal routing layers only and only in
the preferred routing direction of layer. Use create power/ground pin (P/G pin) as per
the specified co-ordinate. The power stripes for the proposed parallel MAC
architecture are shown in Fig. 3.26.

Fig. 3.26 Adding power stripes to the proposed design

63
STEP-5: PLACE

The place specify form enable us to specify and assign spare cells, scan cells,
JTAG cells and placement blockages for power and ground stripes. We must assign
these objects before running placement.

Spare cells: Assign cell types or modules that are designed as space cells in the
design.

JTAG cell: JTAG cell form is used to specify modules that contain JTAG logic, use
this to save and load the specification data.

Placement blockage: Placement blockage stripe and routing blockage form to treat
routing blockage objects and wires with the DEF attribute.

Check placement: Used to check fixed and placed cells and blocks for violations
and violation makers to the design. Display area and violation report. The placement
of proposed parallel MAC architecture is shown in Fig. 3.27

Fig. 3.27 Placing the Design onto chip

64
STEP-6: IOP1

It performs timing optimization on placed design before the clock tree is built.
By default, repairs DRVs and setup violations for all path groups. If the worst
negative slack found during the first optimization pass does not occur on a register-to
register path, register-to-register critical path. The timing optimization before clock
tree built for the proposed parallel MAC architecture is shown in Fig. 3.28.

Fig. 3.28 Pre-timing analysis before CTS built

STEP-7: CTS (Clock Tree Synthesis)


Clock tree synthesis form enables us to run the complete clock tree synthesis
process. Use this form to built clock trees, route clock nets and resize instances based
on CTS mode settings and SDC constraints and to generate the standard clock skew
and timing reports. The timing optimization when the clock tree is built for the
proposed parallel MAC architecture is shown in Fig. 3.29.

STEP-8: IOP2
It performs timing optimization on placed design after the clock tree is built.
By default, repairs DRCs and setup violations for all path groups. If the worst
negative slack found during the first optimization pass does not occur on a register-to
register path the software performs an additional optimization for the register-to

65
register critical path In this mode, while running useful skew and have already detail
routed the clock, the EDI system (Encounter Digital Implementation) software
performs ECO (Engineering change order) routing using the nanoroute router. The
timing optimization after the clock tree is built for the proposed parallel MAC
architecture is shown in Fig. 3.30.

Fig. 3.29 Timing Analysis when CTS is built

STEP-9: Power route


Perform timing optimization on the routed design by default repairs DRVs and
setup violations for all path groups. If the worst negative slack found during the first
optimization pass does not occur on a register to register path, the software critical
paths. In this mode the software performs Eco routing using nano-routing router.

Route power:
It is used to limit connections to specify notes or routes allow jogging,
specifies that jogs are allowed during routing to avoid DRC violations. The
connectivity of the blocks in the proposed design is shown in Fig. 3.31.

66
Fig. 3.30 Timing Analysis after CTS is built

Fig. 3.31 Route power stripes

67
STEP-10: ADD FILLERS:

Add fillers form is used to insert the filler instances between the gaps of
standard cell instants. If the design is routed the software does DRC checks of the
filler cells added to the wires in the design. It does not check the adjacent cells. This
process insures that adding filler cells is fast enough to be used many times in the
design flow. Inserting of fillers in the proposed design is shown in Fig. 3.32.

Fig. 3.32 Adding Fillers

STEP-11: ROUTE:

Trail route form used to perform quick, global and detailed routing for
estimating routing related congestion and capacitance values. Special route enables to
route pins to nearby rings and stripes. Nano route specifies:

1. Data attributes
2. Most commonly used run time options
3. Routing type (global, detailed or both)

68
4. Congestion map style and options.

The connectivity of the blocks for the proposed design is shown in Fig. 3.33.

STEP-12: Verify:
Verify connectivity: verify connectivity form to detect conditions such as opens,
unconnected wires, unconnected pins, loops, partial routing and un-routed nets. When
you verify connectivity the software generates violation markers in the design
window.

Fig. 3.33 Routing the Placed Cells and blocks

Verify metal density: Used to check the metal density for each routing layer and
density against the values specified by the LEF file.
Cut density: Used to check the density of specified ct layer and area of cut layers or
the cut density of the whole chip.

69
Fig. 3.34 Verifying the Parameters of the design

3.5.6 ASIC implementation reports of proposed design:


3.5.6.1 Area report:
============================================================
Generated by: Encounter(R) RTL Compiler v08.10-p104_1
Generated on: Nov 02 2011 01:38:24 PM
Module: proposed_pipe_chip
Technology libraries: typical 1.13
tpz973gtc 230
physical_cells
Operating conditions: typical
Interconnect mode: ple
Area mode: physical library
============================================================

Instance Cells Cell Area Net Area


------------------------------------------------
proposed_pipe_chip 317 329915 4474
a1 283 10315 4474
m2 8 53 679
m1 12 230 266
a4 6 173 120
a2 6 173 107
a3 6 173 99

70
a1 4 123 82
fa28 3 86 74
fa9 3 86 70
fa27 3 86 70
fa6 3 86 66
fa5 3 86 66
fa4 3 86 66
fa3 3 86 66
fa26 3 86 66
fa25 3 86 66
fa24 3 86 66
fa23 3 86 66
fa22 3 86 66
fa29 3 86 66
fa8 3 86 58
fa7 3 86 58
fa2 3 86 58
fa34 3 86 58
fa33 3 86 58
fa32 3 86 58
fa31 3 86 58
fa30 3 86 58
fa18 3 86 54
fa13 3 86 49
fa12 3 86 49
fa1 3 86 49
fa42 3 86 41
fa41 3 86 41
fa40 3 86 41
fa39 3 86 41
fa38 3 86 41
fa37 3 86 41
fa17 3 86 41
fa14 3 86 41
fa11 3 86 41
fa21 3 86 33
fa20 3 86 33
fa19 3 86 33
fa16 3 86 33
fa15 3 86 33
fa10 3 86 33
fa0 3 57 54
fa35 1 60 29
fa43 1 60 8
ha7 1 37 16
ha6 1 37 16
ha5 1 37 16
ha4 1 37 16
ha3 1 37 16
ha2 1 37 16
ha1 1 37 16
ha0 1 37 16
fa36 1 37 16

71
3.5.6.2 Power report:
============================================================
Generated by: Encounter(R) RTL Compiler v08.10-p104_1
Generated on: Nov 02 2011 01:38:24 PM
Module: proposed_pipe_chip
Technology libraries: typical 1.13
tpz973gtc 230
physical_cells
Operating conditions: typical
Interconnect mode: ple
Area mode: physical library
============================================================

Leakage Dynamic Total


Instance Cells Power(nW) Power(nW) Power(nW)
-------------------------------------------------------------
proposed_pipe_chip 317 58.595 44571818.037 44571876.632
a1 283 58.595 2337295.399 2337353.994
m1 12 1.498 9037.846 9039.343
a2 6 1.158 22882.367 22883.525
a3 6 1.158 32722.932 32724.090
a4 6 1.158 41632.921 41634.080
a1 4 0.810 8688.880 8689.690
fa35 1 0.601 11602.935 11603.536
fa43 1 0.601 8421.867 8422.468
fa1 3 0.579 7097.660 7098.239
fa10 3 0.579 14550.501 14551.080
fa11 3 0.579 13909.788 13910.368
fa12 3 0.579 10805.782 10806.361
fa13 3 0.579 16655.001 16655.580
fa14 3 0.579 19631.528 19632.107
fa15 3 0.579 16733.996 16734.575
fa16 3 0.579 21105.301 21105.880
fa17 3 0.579 17729.353 17729.932
fa18 3 0.579 15899.437 15900.016
fa19 3 0.579 17153.782 17154.361
fa2 3 0.579 9949.254 9949.833
fa20 3 0.579 17647.922 17648.501
fa21 3 0.579 25542.580 25543.159
fa22 3 0.579 21431.205 21431.784
fa23 3 0.579 22799.572 22800.151
fa24 3 0.579 22835.886 22836.466
fa25 3 0.579 20029.060 20029.639
fa26 3 0.579 13587.642 13588.221
fa27 3 0.579 14888.480 14889.059
fa28 3 0.579 29102.849 29103.428
fa29 3 0.579 29275.144 29275.723
fa3 3 0.579 9740.407 9740.986
fa30 3 0.579 29487.690 29488.269
fa31 3 0.579 29203.066 29203.645
fa32 3 0.579 24183.074 24183.653
fa33 3 0.579 20281.956 20282.535
fa34 3 0.579 21516.049 21516.628
fa37 3 0.579 28121.925 28122.504
fa38 3 0.579 28503.772 28504.351
fa39 3 0.579 28150.204 28150.783
fa4 3 0.579 7520.634 7521.213
fa40 3 0.579 22447.125 22447.704
fa41 3 0.579 18267.013 18267.593

72
fa42 3 0.579 18653.668 18654.247
fa5 3 0.579 9049.397 9049.976
fa6 3 0.579 13095.439 13096.019
fa7 3 0.579 16158.880 16159.459
fa8 3 0.579 18031.057 18031.636
fa9 3 0.579 15577.880 15578.460
fa0 3 0.335 11529.437 11529.772
fa36 1 0.231 12212.879 12213.110
ha0 1 0.231 1799.957 1800.188
ha1 1 0.231 1942.624 1942.855
ha2 1 0.231 2546.368 2546.599
ha3 1 0.231 2539.815 2540.046
ha4 1 0.231 2308.017 2308.248
ha5 1 0.231 2560.314 2560.545
ha6 1 0.231 1594.591 1594.822
ha7 1 0.231 4833.688 4833.919
m2 8 0.092 18434.621 18434.713
m0 0 0.000 2818.800 2818.800
m1 0 0.000 2817.585 2817.585
m2 0 0.000 2818.800 2818.800
m3 0 0.000 2818.800 2818.800

3.5.6.3 Timing report:


============================================================
Generated by: Encounter(R) RTL Compiler v08.10-p104_1
Generated on: Nov 02 2011 01:38:24 PM
Module: proposed_pipe_chip
Technology libraries: typical 1.13
tpz973gtc 230
physical_cells
Operating conditions: typical
Interconnect mode: ple
Area mode: physical library
============================================================

Pin Type Fanout Load Slew Delay Arrival


(fF) (ps) (ps) (ps)
------------------------------------------------------------------
(clock pclk) launch 0R
(in_del_1) ext delay +100 100 F
py[2] in port 1 3359.5 0 +0 100 F
p14/PAD +0 100
p14/C PDIDGZ 1 8.8 57 +317 417 F
a1/y[2]
g186/A +0 417
g186/Y INVX1 2 14.1 168 +93 511 R
g182/C +0 511
g182/Y AND4X2 1 10.0 96 +138 648 R
g181/A +0 648
g181/Y NAND4X1 7 39.3 333 +173 822 F
g178/A +0 822
g178/Y AND2X2 5 30.5 104 +204 1025 F
a1/y[0]
g32/A +0 1025
g32/CO ADDHXL 1 11.1 102 +152 1177 F
g31/B +0 1177
g31/CO ADDHXL 1 8.7 90 +131 1308 F
g30/A +0 1308
g30/Y OR2X2 1 11.1 70 +139 1447 F

73
a1/cout
a2/cin
g44/B +0 1447
g44/CO ADDHXL 1 8.7 90 +123 1570 F
g43/A +0 1570
g43/Y OR2X2 1 9.9 67 +136 1707 F
g42/A +0 1707
g42/CO ADDHXL 1 8.7 89 +131 1838 F
g41/A +0 1838
g41/Y OR2X2 1 11.1 70 +138 1976 F
a2/cout
a3/cin
g44/B +0 1976
g44/CO ADDHXL 1 8.7 90 +123 2099 F
g43/A +0 2100
g43/Y OR2X2 1 9.9 67 +136 2236 F
g42/A +0 2236
g42/CO ADDHXL 1 8.7 88 +131 2367 F
g41/A +0 2367
g41/Y OR2X2 1 11.1 70 +138 2506 F
a3/cout
a4/cin
g44/B +0 2506
g44/CO ADDHXL 1 8.7 90 +123 2629 F
g43/A +0 2629
g43/Y OR2X2 1 9.9 67 +136 2765 F
g42/A +0 2765
g42/CO ADDHXL 1 8.7 88 +131 2896 F
g41/A +0 2896
g41/Y OR2X2 2 15.5 80 +146 3043 F
a4/cout
fa36/a
g15/B +0 3043
g15/CO ADDHXL 1 11.1 104 +136 3179 F
fa36/cout
fa37/cin
g52/B +0 3179
g52/CO ADDHXL 1 8.7 90 +132 3310 F
g51/A +0 3311
g51/Y OR2X2 1 9.9 67 +137 3447 F
fa37/cout
fa38/cin
g52/A +0 3447
g52/CO ADDHXL 1 8.7 89 +131 3578 F
g51/A +0 3578
g51/Y OR2X2 1 9.9 67 +136 3715 F
fa38/cout
fa39/cin
g52/A +0 3715
g52/CO ADDHXL 1 8.7 89 +131 3846 F
g51/A +0 3846
g51/Y OR2X2 1 9.9 67 +136 3982 F
fa39/cout
fa40/cin
g52/A +0 3982
g52/CO ADDHXL 1 8.7 89 +131 4113 F
g51/A +0 4113
g51/Y OR2X2 1 9.9 67 +136 4250 F
fa40/cout
fa41/cin

74
g52/A +0 4250
g52/CO ADDHXL 1 8.7 89 +131 4381 F
g51/A +0 4381
g51/Y OR2X2 1 9.9 67 +136 4517 F
fa41/cout
fa42/cin
g52/A +0 4517
g52/CO ADDHXL 1 8.7 89 +131 4648 F
g51/A +0 4648
g51/Y OR2X2 1 14.0 76 +144 4792 F
fa42/cout
fa43/cin
g21/C +0 4792
g21/Y XOR3X2 1 8.4 104 +132 4924 F
fa43/sum
preg2_reg[7]/D DFFRHQXL +0 4924
preg2_reg[7]/CK setup 0 +148 5072 R
---------------------------------
(clock pclk) capture 6000 R
------------------------------------------------------------------
Timing slack : 928ps
Start-point : py[2]
End-point : a1/preg2_reg[7]/D

Table 3.3
Comparison of ASIC reports
Elguibaly’s Design Proposed Design
Area 327567 329915
Power (nW) 47002223.290 44571876.632
Delay (ns) 5.973 5.072

From the above Table, area required for the proposed design is slightly more
compared to the Elguibaly’s design. Even though, proposed design is preferred
because of its overall performance improvement.

Table 3.4
Comparison of Delay Analysis with pipelining
Elguibaly’s Design Proposed Design
Delay for n inputs 5.973*(2n) ns 5.072*(n+1) ns
Delay for 5 inputs 59.73 ns 30.432 ns

From the above table it is clear that overall performance has been increased
twice compared to the Elguibaly’s design i.e., 49.05% overall performance is
improved.

75
3.6 CONCLUSION:
FPGA and ASIC implementations of all the architectures were presented in
this chapter. RTL physical synthesis for the top-module chip is done using RTL
compiler tool. Placement and routing is done by using SOC Encounter tool using
180nm technology.

76
CHAPTER 4
SIMULATION RESULTS

4.1 Introduction:
This chapter gives the simulation results of all parallel MAC architectures.

4.2 FPGA Simulation results:


4.2.1 Standard design:

Fig. 4.1 simulation waveform of the standard design

In the above figure, the inputs are chosen randomly. Initially the
multiplicand(x) and the multiplier(y) are given as 15, 12 respectively. Z[15:0] is the
previous MAC result, and initially it is set to 0. Current MAC result is stored in p( p =
x * y + z), so 15 * 12 + 0 = 180 is stored in the p register. For the second clock cycle,
the values of x, y chosen as -25 and 20 respectively. Now previous MAC result 180 is
stored in z register. So p = -25 * 20 + 180 = -320 is stored in the p register. This
process repeats.

4.2.2 Elguibaly’s parallel MAC architecture:

Fig. 4.2 simulation waveform for the Elguibaly’s parallel MAC architecture

77
The same procedure is applicable for Elguibaly’s and proposed parallel MAC
architectures also.

4.2.3 Proposed parallel MAC architecture:

Fig. 4.3 simulation waveform for the proposed parallel MAC architecture

4.2.4 Elguibaly’s parallel MAC architecture with 2-stage pipelining:

Fig. 4.4 simulation waveform for the modified Elguibaly’s architecture

4.2.5 Proposed parallel MAC architecture with 2-stage pipelining:

Fig. 4.5 simulation waveform for the modified proposed architecture

78
4.3 Conclusion:
In this chapter simulation results for all the parallel MAC architectures are
presented.

79
CHAPTER 5
CONCLUSION AND FUTURE SCOPE

5.1 Conclusion
The MAC unit is proposed and designed by combining a hybrid type CSA
structure and Modified Booth’s Algorithm using Xilinx ISE Design suite for FPGA
implementation and Cadence Semi-Custom Design Suite for ASIC Design for TSMC
180nm.

The MAC unit with out pipelining is designed and implemented by using
standard, Elguibaly’s algorithm, proposed methods with the delays of 37.742 ns,
25.984 ns, 30.693 ns for Xilinx Spartan-3 FPGA. To improve the performance of the
MAC units pipelining is applied. The MAC unit with pipelining using Elguibaly’s
algorithm and proposed methods are designed with the delays of 86.58 ns, 65.238 ns.
Hence the proposed MAC unit with pipelining is the faster design.

The Elguibaly’s method and proposed methods with pipeline are extended to
ASIC design and designed by using Cadence Semi-Custom design suite for TSMC
180nm Technology with the delay of 59.73 ns, 30.43 ns respectively.

The overall performance parameter of the MAC unit with pipelining is


increased by 49.05 % compared to the MAC units with out pipelining.

5.2 Future scope:


The MAC unit can be extended by replacing Booth-2 algorithm with Booth-3
algorithm. Using Booth-2 algorithm number of partial products is reduced to half.
Similarly using Booth-3 algorithm number of partial products are reduces to n/3, so
that delay will be reduced. The Booth-3 algorithm extension can be done with an
additional cost of hardware components.

The MAC unit with pipelining can be extended to n-stage pipelining i.e more
than 2-stage pipelining to improve the performance parameters.

80
CHAPTER 6
REFERENCES

[1] A. R. Omondi, Computer Arithmetic Systems. Englewood Cliffs, NJ: Prentice-


Hall, 1994.
[2] Modern VLSI Design by Wayne wolf.
[3] A. D. Booth, “A signed binary multiplication technique,” Quart. J.Math., vol. IV,
pp. 236–240, 1952.
[4] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron
Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964.
[5] A. R. Cooper, “Parallel architecture modified Booth multiplier,” Proc. Inst. Electr.
Eng. G, vol. 135, pp. 125–128, 1988.
[6] A. Tawfik, F. Elguibaly, and P. Agathoklis, “New realization and implementation
of fixed-point IIR digital filters,” J. Circuits, Syst.,Comput., vol. 7, no. 3, pp. 191–
209, 1997.
[7] F. Elguibaly, “A fast parallel multiplier–accumulator using the modified Booth
algorithm,” IEEE Trans. Circuits Syst., vol. 27, no. 9, pp. 902–908, Sep. 2000.
[8] A. Fayed and M. Bayoumi, “A merged multiplier-accumulator for high speed
signal processing applications,” Proc. ICASSP, vol. 3, pp. 3212–3215, 2002.

81
CHAPTER 7
BIBLIOGRAPHY

1. www.wikipedia.com
2. www.ece.concordia.com
3. www.xilinx.com
4. http:\portal.acm.org
5. http:\citefeerx.ist.pfu.edu
6. www.pudn.com

82
APPENDIX-A
FPGA DESIGN FLOW

A.1 Introduction:
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable.
Logic blocks are programmed to implement a desired function and the interconnects
are programmed using the switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for instance), then
the design is divided into small sub functions and each sub function is implemented
using one logic block. Now, to get our desired design (CPU), all the sub functions
implemented in logic blocks must be connected and this is done by programming the
interconnects. .
Internal structure of an FPGA is depicted in the following figure A-1.

Fig A-1: FPGA Architecture

83
FPGAs, alternative to the custom ICs, can be used to implement an entire
System On one Chip (SOC). The main advantage of FPGA is ability to reprogram.
User can reprogram an FPGA to implement a design and this is done after the FPGA
is manufactured. This brings the name “Field Programmable.” Custom ICs are
expensive and takes long time to design so they are useful when produced in bulk
amounts. But FPGAs are easy to implement within a short time with the help of
Computer Aided Designing (CAD) tools (because there is no physical layout process,
no mask making, and no IC manufacturing). Some disadvantages of FPGAs are, they
are slow compared to custom ICs as they can’t handle vary complex designs and also
they draw more power. Xilinx logic block consists of one Look Up Table (LUT) and
one Flip Flop. An LUT is used to implement number of different functionality. The
input lines to the logic block go into the LUT and enable it. The output of the LUT
gives the result of the logic function that it implements and the output of logic block is
registered or unregistered output from the LUT. SRAM is used to implement a
LUT.A k-input logic function is implemented using 2^k * 1 size SRAM. Number of
different possible functions for k input LUT is 2^2^k. Advantage of such an
architecture is that it supports implementation of so many logic functions, however
the disadvantage is unusually large number of memory cells required to implement
such a logic block in case number of inputs is large. Figure A-2 below shows a 4-
input LUT based implementation of logic block.

Fig A-2: Xilinx LUT

LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade off
84
between performance and logic density. An n-LUT can be shown as a direct
implementation of a function truth-table. Each of the latches holds the value of the
function corresponding to one input combination. For Example: 2-LUT can be used to
implement 16 types of functions like AND, OR, A+ not B etc.

A        B      AND     OR

0          0          0          0

0          1          0          1

1          0          0          1

1          1          1          1

A.2: Interconnects:
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an
FPGA can be termed as a track. Typically an FPGA has logic blocks, interconnects
and switch blocks (Input/output blocks). Switch blocks lie in the periphery of logic
blocks and interconnect. Wire segments are connected to logic blocks through switch
blocks. Depending on the required design, one logic block is connected to another and
so on. n this part of tutorial we are going to have a short intro on FPGA design flow.
A simplified version of design flow is given in the flowing figure A-3.

Fig A-3: FPGA Design Flow

85
A.3: Design Entry:
There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both etc. . Selection of a method depends
on the design and designer. If the designer wants to deal more with Hardware, then
Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density. HDLs represent a level of
abstraction that can isolate the designers from the details of the hardware
implementation.  Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method
but rarely used is state-machines. It is the better choice for the designers who think the
design as a series of states. But the tools for state machine entry are limited. In this
documentation we are going to deal with the HDL based design entry.

A.4: Synthesis:
This is the process which translates VHDL or Verilog code into a device netlist
format. i.e., a complete circuit with logical elements( gates, flip flops, etc…) for the
design.If the design contains more than one sub designs, ex. to implement  a
processor, we need a CPU as one design element and RAM as another and so on, then
the synthesis process generates netlist for each design element. Synthesis process will
check code syntax and analyze the hierarchy of the design which ensures that the
design is optimized for the design architecture, the designer has selected. The
resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx®
Synthesis Technology (XST)).

Fig A-4: FPGA Synthesis

86
A.5: Implementation:
This process consists of a sequence of three steps
1.Translate
2.Map
3. Place and Route

Translate process combines all the input netlists and constraints to a logic
design file. This information is saved as a NGD (Native Generic Database) file. This
can be done using NGD Build program. Here, defining constraints is nothing but,
assigning the ports in the design to the physical elements (ex. pins, switches, buttons
etc) of the targeted device and specifying time requirements of the design. This
information is stored in a file named UCF (User Constraints File).
Tools used to create or modify the UCF are PACE, Constraint Editor etc.

Fig A-5: FPGA Translate

Map process divides the whole circuit with logical elements into sub blocks
such that they can be fit into the FPGA logic blocks. That means map process fits the
logic defined by the NGD file into the targeted FPGA elements (Combinational Logic

Fig A-6: FPGA map

87
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.

Place and Route PAR program is used for this process. The place and route
process places the sub blocks from the map process into logic blocks according to the
constraints and connects the logic blocks. Ex. if a sub block is placed in a logic block
which is very near to IO pin, then it may save the time but it may effect some other
constraint. So trade off between all the constraints is taken account by the place and
route process. The PAR tool takes the mapped NCD file as input and produces a
completely routed NCD file as output. Output NCD file consists of the routing
information.

Fig A-7: FPGA Place and route

A.6 Device Programming:


Now the design must be loaded on the FPGA. But the design must be
converted to a format so that the FPGA can accept it. BITGEN program deals with the
conversion. The routed NCD file is then given to the BITGEN program to generate a
bit stream (a .BIT file) which can be used to configure the target FPGA device. This
can be done using a cable. Selection of cable depends on the design.
A.7: Design Verification:
Verification can be done at different stages of the process steps. 

Behavioral Simulation (RTL Simulation) this is the first of all simulation steps;


those are encountered throughout the hierarchy of the design flow. This simulation is
performed before synthesis process to verify RTL (behavioral) code and to confirm
that the design is functioning as intended. Behavioral simulation can be performed on

88
either VHDL or Verilog designs. In this process, signals and variables are observed,
procedures and functions are traced and breakpoints are set. This is a very fast
simulation and so allows the designer to change the HDL code if the required
functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.

Functional simulation (Post Translate Simulation) Functional simulation gives


information about the logic operation of the circuit. Designer can verify the
functionality of the design using this process after the Translate process. If the
functionality is not as expected, then the designer has to made changes in the code and
again follow the design flow steps.

Static Timing Analysis This can be done after MAP or PAR processes Post MAP
timing report lists signal path delays of the design derived from the design logic. Post
Place and Route timing report incorporates timing delay information to provide a
comprehensive timing.

89
APPENDIX-B
SEMICUSTOM DESIGN INPUT FILES

B.1 RTL scripts:


set _attribute information_level 9
set_attribute hdl_search_path /home/vlsi/RCLAB/rtl/
set_ attribute lib_search_path /home/vlsi/RCLAB/library/
set_attribute library {slow.lib tpz973gtc.lib}
read_hdl { proposed.v booth_encoder.v pp_17.v mux_17.v ha.v fa.v cla2bits.v } -
v2001
elaborate proposed
#set clk [define_clock -p 2000 [find /des* -port ports_in/pclk]]
#external_delay -input 200 -clock $pclk [find /des* -port ports_in/*]
#external_delay -output 200 -clock $pclk [find /des* -port ports_out/*]
#synthesize -to_generic-effort high
synthesize -to_mapped -effort high
write_hdl>netlist1.v

B.2 SOC encounter input files:


B.2.1 IO File:
Version:1
Pad: c01 NW

Pad: p01 N
Pad: p02 N
Pad: p03 N
Pad: p04 N
Pad: p05 N
Pad: p06 N
Pad: p07 N
Pad: p08 N
Pad: p09 N

Pad: c02 NE

90
Pad: p10 E
Pad: p11 E
Pad: p12 E
Pad: p13 E
Pad: p14 E
Pad: p15 E
Pad: p16 E
Pad: p17 E
Pad: p18 E

Pad: c03 SE

Pad: p19 S
Pad: p20 S
Pad: p21 S
Pad: p22 S
Pad: p23 S
Pad: p24 S
Pad: p25 S
Pad: p26 S
Pad: p27 S

Pad: c04 SW

Pad: p28 W
Pad: p29 W
Pad: p30 W
Pad: p31 W
Pad: p32 W
Pad: p33 W
Pad: p34 W
Pad: p35 W
Pad: p36 W

91
B.2.2 Configuration File
global rda_Input
set cwd lpwd
set rda_Input(ui_netlist) {./proposed_chip_enc.v}
set rda_Input(ui_netlisttype) {Verilog}
set rda_Input(ui_settop) {1}
set rda_Input(ui_topcell) {proposed_chip}
set rda_Input(ui_timelib) {/home/vlsi/RCLAB/library/typical.lib
/home/vlsi/RCLAB/library/tpz973gtc.lib}
set rda_Input(ui_timingcon_file) {./proposed_chip_enc.sdc}
set rda_Input(ui_buf_footprint) {BUFX1}
set rda_Input(ui_inv_footprint) {INVX1}
set rda_Input(ui_leffile) {/home/vlsi/RCLAB/library/all.lef
/home/vlsi/RCLAB/library/tpz973g_6lm.lef
/home/vlsi/RCLAB/library/tsmc18_6lm_tech.lef}
set rda_Input(ui_cts_cell_list) {CLKBUFX20 CLKBUFXL CLKBUFX1
CLKBUFX2 CLKBUFX3 CLKINVX1 CLKINVX2 CLKINVX12 CLKINVX3
CLKINVX4}
set rda_Input(ui_core_cntl) {aspect}
set rda_Input(ui_aspect_ratio) {1.0000}
set rda_Input(ui_captbl_file) {/home/vlsi/RCLAB/library/t018s6mlv.capTbl}
set rda_Input(ui_defcap_scale) {1.0}
set rda_Input(ui_res_scale) {1.0}
set rda_Input(ui_shr_scale) {1.0}
set rda_Input(assign_buffer) {1}
set rda_Input(ui_gen_footprint) {1}

B.2.3.SDC File
####################################################################
# Created by Encounter(R) RTL Compiler v08.10-p104_1 on Mon Jul 11 13:26:33
IST 2011
#
####################################################################
set sdc_version 1.7
92
set_units -capacitance 1000.0fF
set_units -time 1000.0ps
# Set the current design
current_design proposed_chip

create_clock -name "pclk" -add -period 6.0 -waveform {0.0 3.0} [get_ports pclk]
set_clock_gating_check -setup 0.0
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports preset]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports pclk]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[0]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[1]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[2]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[3]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[4]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[5]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[6]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[7]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[0]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[1]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[2]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[3]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[4]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[5]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[6]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[7]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[0]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[1]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[2]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[3]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[4]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[5]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[6]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[7]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[8]}]
93
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[9]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[10]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[11]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[12]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[13]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[14]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[15]}]
set_wire_load_selection_group "WireAreaCon" -library "tpz973gtc"
set_dont_use [get_lib_cells typical/RF1R1WX2]
set_dont_use [get_lib_cells typical/RF2R1WX2]
set_dont_use [get_lib_cells typical/RFRDX1]
set_dont_use [get_lib_cells typical/RFRDX2]
set_dont_use [get_lib_cells typical/RFRDX4]
set_dont_use [get_lib_cells typical/TIEHI]
set_dont_use [get_lib_cells typical/TIELO]
set_dont_use [get_lib_cells tpz973gtc/PVDD2DGZ]
set_dont_use [get_lib_cells tpz973gtc/PVSS2DGZ]

B.2.4.MODE File:
#####################################################################
# First Encounter mode file
# Created by Encounter(R) RTL Compiler on 07/11/11 13:26:34
#####################################################################
# General Mode Settings
###########################################################
if {[enc_version] >= 7.1} {
setAnalysisMode -asyncChecks noAsync
} else {
setAnalysisMode -noAsync
}
set_global timing_apply_default_primary_input_assertion false
set_global timing_clock_phase_propagation both
if {[enc_version] >= 7.1} {
setAnalysisMode -multipleClockPerRegister true
94
} else {
setAnalysisMode -multipleClockPerRegister
}
if {[enc_version] >= 7.1} {
setPlaceMode -reorderScan false
} else {
setPlaceMode -noReorderScan
}
if {[enc_version] >= 7.1} {
setExtractRCMode -engine default
} else {
setExtractRCMode -default
}

95

You might also like