Professional Documents
Culture Documents
INTRODUCTION
1.1 Motivation:
With the recent rapid advances in Multimedia and communication systems, Real-
time signal processing like Audio Signal Processing, Video/Image processing, or
large-capacity data processing are increasingly being demanded. The multiplier and
Multiplier-And-Accumulator (MAC) are the essential elements of the Digital Signal
Processing such as filtering, convolution, and inner products.
Most Digital Signal Processing methods use nonlinear functions such as Discrete
Cosine Transform (DCT) or Discrete Wavelet Transform (DWT). Because they are
basically accomplished by repetitive application of multiplication and addition, the
speed of the multiplication and addition arithmetic determines the execution speed
and performance of the entire calculation. Because the multiplier requires the longest
delay among the basic operational blocks in digital system, the critical path is
determined by the multiplier, in general. For high-speed multiplication, the Modified
radix-4 Booth’s Algorithm (MBA) is commonly used. However, this cannot
completely solve the problem due to the long critical path for multiplication.
In general, a multiplier uses Booth’s algorithm and array of full adders (FA’s), or
Wallace tree instead of the array of FA’s. The Wallace tree multiplier mainly consists
of three parts: Booth encoder, a tree to compress the partial products such as Wallace
tree, and final adder. Because Wallace tree is used to add the partial products from
encoder as parallel as possible, its operation time is proportional to O (log 2 N), where
N is the number of inputs. It uses the fact that counting the number of 1’s among the
inputs reduces the number of outputs into log 2 N. In real implementation, many (3:2)
or (7:3) counters are used to reduce the number of outputs in each pipeline stage. The
most effective way to increase the speed of a multiplier is to reduce the number of
partial products because multiplication precedes a series of additions for the partial
products. To reduce the number of calculation steps for the partial products, MBA has
been applied mostly, where Wallace tree has taken the role of increasing the speed to
add the partial products. To increase the speed of the MBA algorithm, many parallel
multiplication architectures have been researched. Among them, the architectures
1
based on the Baugh–Wooley Algorithm (BWA) have been developed and they have
been applied to various digital filtering calculations.
One of the most advanced types of MAC for general-purpose Digital Signal
Processing has been proposed by Elguibaly. It is an architecture in which
accumulation has been combined with the Carry Save Adder (CSA) tree that
compresses the partial products. In this architecture, the critical path is reduced by
eliminating the adder for accumulation and decreasing the number of input bits in the
final adder. While it has a better performance because of the reduced critical path
compared to the previous MAC architectures, there is a need to improve the output
rate due to the use of the final adder results for accumulation. So that a new
architecture for a high speed MAC is proposed.
1.2 Objective:
In this thesis, new high speed MAC architecture has been proposed. In this
MAC, the computations of multiplication and accumulation are combined, and a
hybrid-type CSA structure is proposed to reduce the critical path and improve the
output rate.
2
CHAPTER 2
LITERATURE SURVEY
2.1 Introduction:
This chapter explains the significance of MAC and gives a survey on various
types of multipliers, adders and MAC architectures.
pipelining in the ALU or how fast a processor can run. The current trend in ALU
design is to implement the addition and multiplication operations using one hardware
component.
MAC = multiplication + Accumulation
3
accumulate, etc. Thus multipliers play a critical role in processing audio, graphics,
video, and multimedia data.
For designing a multiplier circuit, the following four points should be kept in
mind.
1. It should be capable of identifying whether a bit is 0 or 1?
2. It should be capable of shifting left partial products.
3. It should be able to add all the partial products to give the products as sum of
partial products.
4
4. It should examine the sign bits. If they are alike, the sign of the product will be a
positive, if the sign bits are opposite product will be negative. The sign bit of the
product stored with above criteria should be displayed along with the product.
From the above discussion it is clear that it is not necessary to wait until all the
partial products have been formed before summing them. In fact the addition of
partial product can be carried out as soon as the partial product is formed.
5
Booth algorithm gives a procedure for multiplying binary integers in signed –
2’s complement representation.
6
0 1 Add Y to U, and shift
1 0 Subtract Y from U, and shift or add (-Y) to U and shift
II. Take U & V together and shift arithmetic right shift which preserves the sign bit of
2’s complement number. Thus a positive number remains positive, and a negative
number remains negative.
III Shift X circular right shifts because this will prevent us from using two registers
for the X value.
7
(ii) The algorithm becomes inefficient when there are isolated 1’s.
This is the same result as the equivalent shift and adds method:
Partial Product 0 = Multiplicand * 1, shifted left 0 bits (x 1)
Partial Product 1 = Multiplicand * 1, shifted left 1 bits (x 2)
Partial Product 2 = Multiplicand * 1, shifted left 2 bits (x 4)
Partial Product 3 = Multiplicand * 0, shifted left 3 bits (x 0)
To recode the Booth multiplier term, consider the multiplier bits in blocks of
three, such that each block overlaps the previous block by one bit. Grouping starts
from the LSB, and the first block only uses two bits of the multiplier (since there is
no previous block to overlap, third bit is assumed to be 0 ) shown in Fig. 2.4 (a).
The overlap is necessary to know what happened in the last block, as the MSB
of the block acts like a sign bit. Then consult the table given below to determine the
Booth encoded terms.
8
Table 2.1
Modified Booth Encoding Table
Fig. 2.4 Modified Booth encoder (a) Grouping of multiplier bits for N=8,
(b) Encoder circuit
Since the LSB of each block is used to know what the sign bit was in the
previous block, and there are never any negative products before the least significant
block, the LSB of the first block is always assumed to be 0.
In the case where there are not enough bits to obtain a MSB of the last block,
as below, the multiplier sign bit can be extended by one bit.
For Example: 0 0 1 1 1
Block 0 : 110 Encoding: * (-1)
Block 1 : 011 Encoding: * (2)
Block 2 : 000 Encoding: * (0)
Modified booth encoder circuit is designed using Modified Booth encoding table,
shown in Fig. 2.4 (b).
9
The 2Xsel signal is used as the control to a 2:1 multiplexer, to select whether or
not the partial product bits are shifted left by one position.
Finally, the NEGsel signal indicates whether or not to invert all of the bits to
create a negative product (which must be corrected by adding "1" at some later
stage).
From the Fig. 2.4 (b), based on the values of X sel, 2Xsel and NEGsel, the
corresponding partial products (0, +X, -X, +2X, -2X) will be generated and these
partial products will be arranged in a manner shown in Fig. 2.5 which are added
together to get the required multiplication result.
Modified Radix-2 Booth algorithm scans string of three bits with the
algorithm given below:
1) Extend the sign bit 1 position if necessary to ensure that n is even.
2) Append a 0 to the right of the LSB of the multiplier.
3) According to the value of each vector, each Partial Product will be 0, +X, -X, +2X
or -2X.
Table 2.2
Partial Product for each encoded yi’ with N=8
( pi , j: partial product bit, ni, 0: negation bit )
10
The previous example can be rewritten as:
Once the Booth recoded partial products have been generated, they need to be
shifted and added together in the following fashion:
The problem with implementing this in hardware is that the first partial
product needs to be sign extended by 6 bits, the second by four bits, and so on. This is
easily achievable in hardware, but requires additional logic gates than if those bits
could be permanently kept constant.
This technique allows any sign bits to be correctly propagated, without the
need to sign extends all of the bits.
11
2.4.1 Sign Extension Simplification:
Fig. 2.6 shows a 16 bit Radix-4 Booth partial product array for an unsigned
multiplier using the dot diagram notation. Extra bit ‘1’ is added to the least significant
bit in the next row to form the 2’s complement of negative multiples. Inverting the
implicit leading 0’s generates leading 1’s on negative multiples. PP8 is required in
case PP7 is negative. This partial product is always zero (for unsigned multiplier).
Observe that all the sign extension bits are either 1’s or 0’s. If a single 1 is
added to the least significant position in a string of 1’s, the result is a string of 0’s plus
a carry-out the top bit that may be discarded. Therefore, the large number of ‘s’ bits in
each partial product can be replaced by an equal number of constant 1’s plus the
inverse of s added to the least significant position, as shown in Fig. 2.7. These
constant values mostly can be optimized out of the array by pre-computing their sum.
The simplified result is shown in Fig. 2.8.
12
Fig. 2.7 Booth-encoded partial products with simplified sign extension
13
adders of different sizes may be cascaded in order to accommodate binary vector
strings of larger sizes. For an n-bit parallel adder, it requires n computational elements
(FA). Fig. 2.9 shows an example of a parallel adder: a 4-bit ripple-carry adder. It is
composed of four full adders. The augends bits of x are added to the addend bits of y
respectfully of their binary position. Each bit addition creates a sum and a carry out.
The carry out is then transmitted to the carry in of the next higher-order bit. The final
result creates a sum of four bits plus a carry out (c4).
Even though this is a simple adder and can be used to add unrestricted bit
length numbers, it is however not very efficient when large bit numbers are used.
One of the most serious drawbacks of this adder is that the delay increases
linearly with the bit length. As mentioned before, each full adder has to wait for the
carry out of the previous stage to output steady-state result. Therefore, even if the
adder has a value at its output terminal; it has to wait for the propagation of the carry
before the output reaches a correct value. In Fig. 2.9, the addition of x 4 and y4 cannot
reach steady state until c4 becomes available. In turn, c4 has to wait for c3, and so on
down to c1. If one full adder takes T×fa seconds to complete its operation, the final
result will reach its steady-state value only after 4.T×fa seconds. Its area is 4.A×fa.
14
propagation time. To understand how the carry look-ahead adder works, we have to
manipulate the Boolean expression dealing with the full adder. The Propagate P and
The new expressions for the output sum and the carryout are given by:
Si = Pi xor Ci-1
Ci+1= Gi + PiCi
These equations show that a carry signal will be generated in two cases:
1) If both bits Ai and Bi are 1
2) If either Ai or Bi is 1 and the carry-in Ci is 1.
These expressions show that C2, C3 and C4 do not depend on its previous
carry-in. Therefore C4 does not need to wait for C3 to propagate. As soon as C0 is
computed, C4 can reach steady state. The same is also true for C2 and C3.
15
The general expression is
Ci+1= Gi + PiGi-1 + PiPi-1Gi-2 + ……. PiPi-1….P2P1G0 + PiPi-1 ….P1P0C0
For small n (n<=4) RCA is advantageous because it requires less area and fast.
CLA consumes more area and power because of large number of logic gates.
16
2.5.4 Carry Save Adder:
Unlike RCA, CLA & CSL adders, the Carry Save Adder realizes concurrent
addition of multiple operands, which is a basic requirement of multiplication. To
increase the speed of the addition process many times, Carry Save Adder architectures
are used.
In this technique, the carry output from the bit i during step j is applied to
carry input for bit i+1 during next step j+1. After addition of product components in
the last row, one more step is required in which the carries are allowed to ripple from
the least to most significant bit. This technique does not save any hardware but it
reduces the propagation delay substantially.
In this thesis, hybrid CSA tree structure is used to increase the performance of
the MAC architecture. In this hybrid adder architecture, combination of Carry Save
17
Adder and a mix of 2-bit Carry Look-ahead Adder and 4-bit Carry Look-ahead
Adders are used.
P=X×Y+Z
18
The N-bit 2’s complement binary number X can be expressed as
If (1) is expressed in base-4 type redundant sign digit form in order to apply the
radix-2 Booth’s algorithm,
Each of the two terms on the right-hand side of (5) is calculated independently
and the final result is produced by adding the two results. The MAC architecture
implemented by (5) is called the standard design.
If N-bit data are multiplied, the number of the generated partial products is
proportional to N. In order to add them serially, the execution time is also
proportional to N. The architecture of a multiplier, which is the fastest, uses radix-2
19
Booth encoding that generates partial products and a Wallace tree based on CSA as
the adder array to add the partial products. If radix-2 Booth encoding is used, the
number of partial products, i.e., the inputs to the Wallace tree, is reduced to half,
resulting in the decrease in CSA tree step. In addition, the signed multiplication based
on 2’s complement numbers is also possible. Due to these reasons, most current used
multipliers adopt the Booth encoding.
20
2) Partial-Product Addition: This is done using a carry ripple adder for serial-
parallel multipliers. For parallel multipliers, the addition is accomplished using carry-
save techniques, Wallace trees, or summand skip. However, the last two techniques
require irregular wiring and extra hardware.
3) Final Adder: When the number of partial products is reduced to sum and carry
words, a final adder is required to generate the multiplication result. The number of
bits of the final adder is the sum of the number of bits of the multiplier and
multiplicand. Thus, the data path width is usually doubled and the delay of this stage
is most severe. In this thesis, we use a mix of 2-bit and 4-bit CLA’s to reduce the
delay and area requirements.
21
Fig. 2.16 MAC structure of the standard design
From the above figure 2.16 First, n × n bit multiplication operation is carried
out using Booth multiplier, and then the multiplied 2n-bit multiplication result is
accumulated with the 2n-bits. So finally, the output of the MAC structure is 2n+1-bits
wide.
Internally, Booth multiplier includes the following three blocks, which are
shown in Fig. 2.17.
22
For n × n-bit MAC operation, let us consider both the multiplicand (X) and multiplier
(Y) are of n-bits wide. These two are given as inputs to the Booth Encoder. Based on
the Modified Booth Encoding table, from the Booth encoding set (0, X,-X, 2X,-2X)
one of the possible partial product will be generated for each combination of grouping
of 3-bits of the multiplier. Booth encoding uses 2’s complement method to generate
the numbers. Here, each partial product is n+1 bit wide.
CSA tree: The partial products which are generated in the previous step are given as
input to the CSA tree. CSA tree can be used to compress the generated partial
products and converts them into the form of sum and carry. Here, both the inputs and
outputs are of same width, i.e., n+1-bits.
Final Addition: When the number of partial products is reduced to sum and carry
words, a final adder is required to generate the multiplication result. Here, the data
path width is doubled i.e., 2n-bit. This multiplication result is added to the
accumulator content, which are also 2n-bits wide.
In order to increase the MAC speed, there are two major bottlenecks that need
to be considered. The first one is the partial production network and the second one is
the accumulator. Since both of these two stages require addition of large operands that
involve long paths for carry propagation. Using tree architecture represents an
attractive solution to speed up the partial products reduction process.
Since the accumulation has the longest delay in MAC operation, the
independent accumulation operation has been removed and it is merged into the
compression process of the partial products, so that overall MAC operation has been
improved.
One of the most advanced types of MAC for general-purpose digital signal
processing has been proposed by Elguibaly. It is an architecture in which
accumulation has been combined with the carry save adder (CSA) tree that
compresses partial products. In this architecture, the critical path was reduced by
eliminating the adder for accumulation and decreasing the number of input bits in the
final adder.
23
In this architecture, a Dependence Graph (DG) of the merged MAC operation
based on MBA is developed.
24
Fig.2.19 Parallel carry-save array multiplier
Fig. 2.20 Dependence graph for carry save addition with carry ripple vector merging
Dependence Graphs are used for systolic array design, where various
implementations can be derived from a single DG by exploiting the parallelism
presented in DG in different ways.
25
Fig. 2.21 Elguibaly’s Parallel MAC design
P=X×Y+Z
Where, the multiplier X and multiplicand Y and are assumed to have n-bits
each and the addend Z has 2n-bits. The number of partial products (summands) in the
MBA is given by
26
number of summands = n/2, for n even
= (n+1)/2, for n odd
Fig. 2.23 shows the DG of the MAC operation for n = 8. Four rows,
representing the summands S0: S3, are generated by the Booth encoders. These
summands are added using carry-save addition as shown by the empty circles and
lines connecting bits of equal binary weight. To prevent overflow, sign extensions
have to be provided (as shown by the black circles at the right column). Traditionally,
sign extension to the full width of the data path is provided. For example, to add three
summands in the MBA using other techniques, one of the summand has to be
extended by 2 bits, the second by 4 bits and the third by 6 bits, and so on. Obviously,
this is extremely wasteful and does not lead to regular connections. In this technique,
we uniformly extend each summand by one bit only even though we are adding three
two’s-complement numbers at each step. This is justified by the fact that two of the
numbers are smaller than the third by a factor of four, since the MBA skips over two
bits of the multiplier at each step.
The dashed arrows on the right in Fig. 2.23 represent a “1” at the LSB for the
two’s- complement operation, and the dashed arrows on the left represent sign
extensions. The diamonds at the bottom row represent the final adder, while diamonds
27
on the right of the figure represent full adders that are required to produce the LSB of
the final product.
Fig. 2.23 DG of the MBA for the case n = 8. The dashed arrows on the right represent
“1” at the LSB for the two’s-complement operation, and the dashed arrows on the left
represent sign extensions. The circles represent full-adders, as well as the empty
diamonds.
28
The DG for the MAC algorithm in Fig. 2.23 helps to optimize the hardware
design task. This section describes how an efficient hardware design is obtained using
several optimization decisions.
Fig. 2.24 shows the use of 2-bit CLA’s to add the LSB of the sum and carry
words. Those adders are indicated by A 0, A1 and A2. It is not necessary to use 4-bit
CLA’s, since these add extra area and will not lead to any further speedup. In this
way, the LSB part of the resulting product is obtained after approximately n/2 full-
adder delays only.
The (n+2)-bits addition for the MSB of the final adder is done using the usual
4-bit CLA if a completely parallel implementation is contemplated. In this case, the
expected adder delay will be approximately half that of earlier implementations since
the number of bits involved is (n+2), as compared to 2n.
29
Fig. 2.24 DG for the 8-bit MBA where carry propagation between stages is optimized
2-bit CLA’s: A0, A1, and A2 are used to implement the LSB word of the final addition.
4-bit CLA’s are used to implement the MSB word of the final addition.
30
CSA. The number of bits of sums and carries to be transferred to the final adder is
reduced by adding the lower bits of sums and carries in advance within the range in
which the overall performance will not be degraded. A 2-bit CLA is used to add the
lower bits in the CSA. In addition, to increase the output rate when pipelining is
applied, the sums and carries from the CSA are accumulated instead of the outputs
from the final adder in the manner that the sum and carry from the CSA in the
previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the
number of inputs to CSA increases, compared to the standard design and Elguibaly’s
design.
If (6) is divided into the first partial product, sum of the middle partial
products, and the final partial product, it can be re expressed as (7). The reason for
separating the partial product addition as (7) is that three types of data are fed back for
accumulation, which are the sum, the carry, and the pre added results of the sum and
carry from lower bits.
The second term can be separated further into the carry term and sum term as
31
Thus, (8) is finally separated into three terms as
If (7) and (10) are used, the MAC arithmetic in (5) can be expressed as
If each term of (11) is matched to the bit position and rearranged, it can be
expressed as (12), which is the final equation for the proposed MAC. The first
parenthesis on the right is the operation to accumulate the first partial product with the
added result of the sum and the carry. The second parenthesis is the one to accumulate
the middle partial products with the sum of the CSA that was fed back. Finally, the
third parenthesis expresses the operation to accumulate the last partial product with
the carry of the CSA.
32
Fig. 2.25 Proposed arithmetic operation of multiplication and accumulation.
The hardware architecture of the MAC to satisfy the process in Fig. 2.25 is
shown in Fig. 2.26. The n-bit MAC inputs, X and Y are converted into an (n+1)-bit
partial product by passing through the Booth encoder. In the CSA and accumulator,
accumulation is carried out along with the addition of the partial products. As a result,
n-bit S, C and Z (the result from adding the lower bits of the sum and carry) are
generated. These three values are fed back and used for the next accumulation. If the
final result for the MAC is needed, P [2n-1: n] is generated by adding S and C in the
final adder and combined with P [n-1: 0] that was already generated.
33
compensate 1’s complement number into 2’s complement number. S[i] and C[i]
correspond to the ith bit of the feedback sum and carry. Z[i] is the ith bit of the sum of
the lower bits for each partial product that were added in advance Z’ [i]and is the
previous result. In addition, Pj[i] corresponds to the ith bit of the jth partial product.
Since the multiplier is for 8 bits, totally four partial products (P0 [7:0] ~P3 [7:0]) are
generated from the Booth encoder. In (11), d0Y and dN/2-12N-2Y correspond to P0 [7:0]
and P3 [7:0] respectively. This CSA requires at least four rows of FA’s for the four
partial products. Thus, totally five FA rows are necessary since one more level of
rows are needed for accumulation. For an n × n-bit MAC operation, the level of CSA
is (n/2+1). The white square in Fig. 2.27 represents an FA and the gray square is a
half adder (HA). The rectangular symbol with five inputs is a 2-bit CLA with a carry
input.
The critical path in this CSA is determined by the 2-bit CLA. It is also
possible to use FA’s to implement the CSA without CLA. However, if the lower bits
of the previously generated partial product are not processed in advance by the
34
CLA’s, the number of bits for the final adder will increase. When the entire multiplier
or MAC is considered, it degrades the performance
In Table 2.3, the characteristics of the proposed CSA architecture have been
summarized and briefly compared with other architectures. For the number system,
the proposed CSA uses 1’s complement modified CSA array without sign extension.
The biggest difference between proposed design and the others is the type of values
that is fed back for accumulation. Proposed design has the smallest number of inputs
to the final adder.
Table 2.3
Characteristics of CSA
Standard Design Elguibaly’s design Proposed design
Number System 2’s complement 1’s complement 1’s complement
Sign Extension Used Used Not Used
Result Data of Final Result Data of Final Sum and Carry of
Accumulation
Addition Addition CSA
CSA Tree FA,HA FA,2-bits CLA FA, HA, 2-bit CLA
Final Adder 2n bits n+2 bits n bits
Table 2.4
Calculation of hardware resources
36
Fig. 2.31 Pipelined operational sequence of proposed operation
These two schemes are also compared in the time sequence in Fig. 2.30 and
Fig. 2.31 for Fig. 2.28 and Fig. 2.29 respectively. While an accumulated result cannot
be output by the Elguibaly’s design in every clock period because of a structural
drawback for the accumulation, proposed architecture can output a result in every
clock cycle. Thus, even though the delay of proposed architecture is a little longer
than Elguibaly’s design, it gives much better overall performance or the output rate.
2.12 Conclusion
In this chapter, different types of multipliers, adders and parallel MAC
architectures were presented.
37
CHAPTER 3
DESIGN AND IMPLEMENTATION
3.1 Introduction:
Hardware Description Languages are modeling tools for creating a hardware
model. Verilog structuring Hardware Description language is used to design
Elguibaly’s architecture as well as proposed parallel MAC architecture . All the
architectures are synthesized in 0.18-μm technology.
38
3.2.2 Partial product generator:
Inputs : mul, shift, twocom are booth encoded bits, each is of 4 bits.
Outputs : pp0, pp1, pp2, pp3 are generated partial products, each is of
9 bits.
39
3.2.3 Standard design:
Inputs : x, y are the multiplier and multiplicand, each of 8 bits,
clk, reset
Outputs : p is MAC result which is of 16 bits.
40
Fig. 3.6 RTL schematic of standard design
3.2.4 Elguibaly’s parallel MAC architecture:
41
Fig. 3.8 RTL schematic of Elguibaly’s parallel MAC architecture
3.2.5 Proposed parallel MAC architecture:
42
Fig. 3.10 RTL schematic of proposed parallel MAC architecture
3.2.6 Elguibaly’s parallel MAC architecture with 2-stage pipelining:
43
Fig. 3.12 RTL schematic of modified proposed parallel MAC architecture
3.3 FPGA Synthesis reports:
3.3.1 Standard design:
============================================================
* Final Report *
=============================================================
Final Results
RTL Top Level Output File Name : mac_standard.ngr
Top Level Output File Name : mac_standard
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 673
# AND2 : 190
# AND3 : 68
# AND8 :1
# INV : 183
# OR2 : 115
# XOR2 : 116
# Flip Flops/Latches : 16
# FDC : 16
# IO Buffers : 34
# IBUF : 18
# OBUF : 16
44
Number of Ios : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%
Final Results
RTL Top Level Output File Name : mac_elguibaly.ngr
Top Level Output File Name : mac_elguibaly
Output Format : NGC
Optimization Goal : Speed
Keep Hierarchy : NO
Design Statistics
# IOs : 34
Cell Usage:
# BELS : 575
# AND2 : 259
# AND3 : 12
# AND8 :1
# INV : 72
# OR2 : 39
# OR5 : 32
# XOR2 : 160
# FlipFlops/Latches : 16
# FDC : 14
# FDP :2
# IO Buffers : 34
# IBUF : 18
# OBUF : 16
=============================================================
Device utilization summary:
Selected Device : 3s400pq208-5
45
Number of Slices : 116 out of 3584 3%
Number of Slice Flip Flops : 16 out of 7168 0%
Number of 4 input LUTs : 221 out of 7168 3%
Number of Ios : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%
46
Device utilization summary:
Selected Device : 3s400pq208-5
Number of Slices : 122 out of 3584 3%
Number of Slice Flip Flops : 24 out of 7168 0%
Number of 4 input LUTs : 228 out of 7168 3%
Number of Ios : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%
47
# FDP :2
# Clock Buffers :1
# BUFGP :1
# IO Buffers : 33
# IBUF : 17
# OBUF : 16
=============================================================
Device utilization summary:
Selected Device : 3s400pq208-5
Number of Slices : 125 out of 3584 3%
Number of Slice Flip Flops : 40 out of 7168 0%
Number of 4 input LUTs : 235 out of 7168 3%
Number of IOs : 34
Number of bonded IOBs : 34 out of 141 24%
Number of GCLKs : 1 out of 8 12%
=============================================================
49
3.4.2 Elguibaly’s parallel MAC architecture:
Timing Summary:
Speed Grade: -5
Minimum period : 10.854ns (Maximum Frequency : 92.132MHz)
Minimum input arrival time before clock : 20.506ns
Maximum output required time after clock : 16.332ns
Maximum combinational path delay : 25.984ns
Table 3.1
Delay comparison without pipelining
From the table 3.1, it is clear that the delay of proposed design is more
compared to the Elguibaly’s design. Even though proposed design has more delay, it
is preferred rather than Elguibaly’s design because overall output rate is increased in
the proposed design after applying the pipelining scheme.
Here 2-stage pipelining is applied for both the designs. So, for the Elguibaly’s
design 2 clock cycles are required to get first output. For the second output, again two
clock cycles are required. But in case of proposed design, two clock cycles are
required for the first output and one clock cycle is enough to get the second output.
The difference between the two is, in the proposed design, Booth Encoding and carry
save addition for the second input has been done in the second cycle instead of third
cycle only in parallel with the final addition of the first output.
50
The illustration of these two schemes was shown in the Fig. 3.4 and 3.5.
Table 3.2
Pipelining analysis
Parameter Elguibaly’s design Proposed design
Output rate 2 clocks 1 clock
Pipeline delay ( n inputs) 8.658(2n) ns 10.873(n+1) ns
Pipeline delay ( 5 inputs) 86.58 ns 65.238 ns
51
physical details. It serves as an analysis tool to identify design problems such as
timing and power.
52
Generating log files:
By default, RTL compiler generates log file name rc.log. The log file contains
the entire output of the current RTL compiler session.
Specify the search paths for libraries, scripts and HDL (Hardware Description
Language) files. The default search path is the directory in which RTL compiler is
invoked.
Where, path is the full path of the target library, HDL and script locations.
After setting the libraries search paths, we need to specify the target
technology library for the synthesis using the library attribute.
RTL compiler will use the library named lib_name.lbr for synthesis.
RTL compiler has two modes: Wireload and PLE. These modes are set using
the interconnect_ mode activate. The default mode is Wireload. In Wireload mode we
53
use wireless models to drive synthesis. In PLE mode we use physical layout
estimators (PLE) to drive synthesis. PLE is a process of using physical information
such as LEF (Library Exchange Format) libraries to provide better closure with
backend.
Performing elaboration:
Elaboration is required only for the top-level design. The elaborate command
automatically elaborates the top-level design and all of its references. During
elaboration RTL compiler performs:
Builds data structures
Infers registers ion the design
Performs high-level HDL optimization, such as dead code removal
Checks semantics
At the end of elaboration, RTL compiler displays any unresolved references.
After elaboration, RTL compiler has an internally created data structure for the whole
design so we can apply constraints and performs other operations.
Applying constraints:
After loading and elaborating the design, we must specify constraints. The
constraints include:
Operating conditions
Clock waveforms
I/O timings
Then include constraints file read in SDC constraints.
Performance synthesis:
After constraints and optimizations are set for design, proceed the synthesis by
issuing the synthesize command.
54
rc:/> synthesize –to_mapped
This command write out the gate level netlist to a file called design .v.
To write the design constraints in HDC format use the write_hdc command
55
Fig. 3.14 RTL synthesis diagram of the standard design
Fig. 3.17 RTL synthesis diagram for the modified Elguibaly’s design
57
3.5.3.5Proposed parallel MAC architecture with 2-stage pipelining:
Fig. 3.18 RTL synthesis diagram for the modified proposed design
58
Chip level module design involves both core design and IO PADS. The pad
information is mentioned in the encounter libraries. Fig. 3.20 shows the physical
synthesis diagram. In this design total 40 pads are there and they are
Input pads---------------18
Output pads------------16
Supply pads------------2
Corner pads-------------4
The output of the RTL compiler are the netlist file, sdc file, config file,
encounter setup file and encounter mode files.
59
Fig 3.21 Critical path of the modified proposed architecture
3.5.5.2 SOC encounter design steps for the modified proposed architecture:
STEP-1: IMPORT DEISIGN:
The import design form enables to set up the design import into the encounter
software. This form is used to import a full chip design: a partial design such as a
module or a partitioned design.
The inputs required to import design on to the encounter are the gate level
netlist, timing libraries and constraint file, LEF files and IO assignment file.
Fig. 3.23 has the information about the IO pads, core area and netlist file.
61
STEP-2: TIMING
Timing analysis condition provides the options for setting the modes for
extraction, timing analysis and delay calculation. Operating condition form is used to
select the operating temperature, process or voltage conditions for the design. The
operating conditions are contained in the timing library.
RC extraction mode: This mode is used to set RC extract mode, to specify the
threshold value (in ps), to perform RC reduction, to specify whether noise should be
considered during extraction and to specify the database output file name.
Timing analysis form is used to build a graph for the design and generate a
slack report and a detailed timing violation report.
STEP-3: FLOORPLAN
Use the specify floorplan form to view or change floorplan specifications after
importing the design. Use the form to specify the dimensions by size; or by die; or
core coordinates. Use the specify floorplan form, the floorplan is resized
automatically which is shown in Fig. 3.24. Relatively floorplan constraints are
automatically derived on the fly for blocks, fixed standard cells, fixed pre-router and
blockages. The floorplan of proposed parallel MAC architecture is shown in Fig. 3.24.
Fig. 3.25 Adding power rings to (a) individual blocks (b) group of blocks
Add stripes form is used to create power stripes within the specified range. If
block rings encountered, the stripes connect the block rings. If an attribution is
encountered the stripes connects to the last stripes on the same nets; otherwise the
stripes stops at the core row boundary.
Add X stripes form is used to create the diagonal power routes and stripes for
encounter X design, add diagonal stripes on diagonal routing layers only and only in
the preferred routing direction of layer. Use create power/ground pin (P/G pin) as per
the specified co-ordinate. The power stripes for the proposed parallel MAC
architecture are shown in Fig. 3.26.
63
STEP-5: PLACE
The place specify form enable us to specify and assign spare cells, scan cells,
JTAG cells and placement blockages for power and ground stripes. We must assign
these objects before running placement.
Spare cells: Assign cell types or modules that are designed as space cells in the
design.
JTAG cell: JTAG cell form is used to specify modules that contain JTAG logic, use
this to save and load the specification data.
Placement blockage: Placement blockage stripe and routing blockage form to treat
routing blockage objects and wires with the DEF attribute.
Check placement: Used to check fixed and placed cells and blocks for violations
and violation makers to the design. Display area and violation report. The placement
of proposed parallel MAC architecture is shown in Fig. 3.27
64
STEP-6: IOP1
It performs timing optimization on placed design before the clock tree is built.
By default, repairs DRVs and setup violations for all path groups. If the worst
negative slack found during the first optimization pass does not occur on a register-to
register path, register-to-register critical path. The timing optimization before clock
tree built for the proposed parallel MAC architecture is shown in Fig. 3.28.
STEP-8: IOP2
It performs timing optimization on placed design after the clock tree is built.
By default, repairs DRCs and setup violations for all path groups. If the worst
negative slack found during the first optimization pass does not occur on a register-to
register path the software performs an additional optimization for the register-to
65
register critical path In this mode, while running useful skew and have already detail
routed the clock, the EDI system (Encounter Digital Implementation) software
performs ECO (Engineering change order) routing using the nanoroute router. The
timing optimization after the clock tree is built for the proposed parallel MAC
architecture is shown in Fig. 3.30.
Route power:
It is used to limit connections to specify notes or routes allow jogging,
specifies that jogs are allowed during routing to avoid DRC violations. The
connectivity of the blocks in the proposed design is shown in Fig. 3.31.
66
Fig. 3.30 Timing Analysis after CTS is built
67
STEP-10: ADD FILLERS:
Add fillers form is used to insert the filler instances between the gaps of
standard cell instants. If the design is routed the software does DRC checks of the
filler cells added to the wires in the design. It does not check the adjacent cells. This
process insures that adding filler cells is fast enough to be used many times in the
design flow. Inserting of fillers in the proposed design is shown in Fig. 3.32.
STEP-11: ROUTE:
Trail route form used to perform quick, global and detailed routing for
estimating routing related congestion and capacitance values. Special route enables to
route pins to nearby rings and stripes. Nano route specifies:
1. Data attributes
2. Most commonly used run time options
3. Routing type (global, detailed or both)
68
4. Congestion map style and options.
The connectivity of the blocks for the proposed design is shown in Fig. 3.33.
STEP-12: Verify:
Verify connectivity: verify connectivity form to detect conditions such as opens,
unconnected wires, unconnected pins, loops, partial routing and un-routed nets. When
you verify connectivity the software generates violation markers in the design
window.
Verify metal density: Used to check the metal density for each routing layer and
density against the values specified by the LEF file.
Cut density: Used to check the density of specified ct layer and area of cut layers or
the cut density of the whole chip.
69
Fig. 3.34 Verifying the Parameters of the design
70
a1 4 123 82
fa28 3 86 74
fa9 3 86 70
fa27 3 86 70
fa6 3 86 66
fa5 3 86 66
fa4 3 86 66
fa3 3 86 66
fa26 3 86 66
fa25 3 86 66
fa24 3 86 66
fa23 3 86 66
fa22 3 86 66
fa29 3 86 66
fa8 3 86 58
fa7 3 86 58
fa2 3 86 58
fa34 3 86 58
fa33 3 86 58
fa32 3 86 58
fa31 3 86 58
fa30 3 86 58
fa18 3 86 54
fa13 3 86 49
fa12 3 86 49
fa1 3 86 49
fa42 3 86 41
fa41 3 86 41
fa40 3 86 41
fa39 3 86 41
fa38 3 86 41
fa37 3 86 41
fa17 3 86 41
fa14 3 86 41
fa11 3 86 41
fa21 3 86 33
fa20 3 86 33
fa19 3 86 33
fa16 3 86 33
fa15 3 86 33
fa10 3 86 33
fa0 3 57 54
fa35 1 60 29
fa43 1 60 8
ha7 1 37 16
ha6 1 37 16
ha5 1 37 16
ha4 1 37 16
ha3 1 37 16
ha2 1 37 16
ha1 1 37 16
ha0 1 37 16
fa36 1 37 16
71
3.5.6.2 Power report:
============================================================
Generated by: Encounter(R) RTL Compiler v08.10-p104_1
Generated on: Nov 02 2011 01:38:24 PM
Module: proposed_pipe_chip
Technology libraries: typical 1.13
tpz973gtc 230
physical_cells
Operating conditions: typical
Interconnect mode: ple
Area mode: physical library
============================================================
72
fa42 3 0.579 18653.668 18654.247
fa5 3 0.579 9049.397 9049.976
fa6 3 0.579 13095.439 13096.019
fa7 3 0.579 16158.880 16159.459
fa8 3 0.579 18031.057 18031.636
fa9 3 0.579 15577.880 15578.460
fa0 3 0.335 11529.437 11529.772
fa36 1 0.231 12212.879 12213.110
ha0 1 0.231 1799.957 1800.188
ha1 1 0.231 1942.624 1942.855
ha2 1 0.231 2546.368 2546.599
ha3 1 0.231 2539.815 2540.046
ha4 1 0.231 2308.017 2308.248
ha5 1 0.231 2560.314 2560.545
ha6 1 0.231 1594.591 1594.822
ha7 1 0.231 4833.688 4833.919
m2 8 0.092 18434.621 18434.713
m0 0 0.000 2818.800 2818.800
m1 0 0.000 2817.585 2817.585
m2 0 0.000 2818.800 2818.800
m3 0 0.000 2818.800 2818.800
73
a1/cout
a2/cin
g44/B +0 1447
g44/CO ADDHXL 1 8.7 90 +123 1570 F
g43/A +0 1570
g43/Y OR2X2 1 9.9 67 +136 1707 F
g42/A +0 1707
g42/CO ADDHXL 1 8.7 89 +131 1838 F
g41/A +0 1838
g41/Y OR2X2 1 11.1 70 +138 1976 F
a2/cout
a3/cin
g44/B +0 1976
g44/CO ADDHXL 1 8.7 90 +123 2099 F
g43/A +0 2100
g43/Y OR2X2 1 9.9 67 +136 2236 F
g42/A +0 2236
g42/CO ADDHXL 1 8.7 88 +131 2367 F
g41/A +0 2367
g41/Y OR2X2 1 11.1 70 +138 2506 F
a3/cout
a4/cin
g44/B +0 2506
g44/CO ADDHXL 1 8.7 90 +123 2629 F
g43/A +0 2629
g43/Y OR2X2 1 9.9 67 +136 2765 F
g42/A +0 2765
g42/CO ADDHXL 1 8.7 88 +131 2896 F
g41/A +0 2896
g41/Y OR2X2 2 15.5 80 +146 3043 F
a4/cout
fa36/a
g15/B +0 3043
g15/CO ADDHXL 1 11.1 104 +136 3179 F
fa36/cout
fa37/cin
g52/B +0 3179
g52/CO ADDHXL 1 8.7 90 +132 3310 F
g51/A +0 3311
g51/Y OR2X2 1 9.9 67 +137 3447 F
fa37/cout
fa38/cin
g52/A +0 3447
g52/CO ADDHXL 1 8.7 89 +131 3578 F
g51/A +0 3578
g51/Y OR2X2 1 9.9 67 +136 3715 F
fa38/cout
fa39/cin
g52/A +0 3715
g52/CO ADDHXL 1 8.7 89 +131 3846 F
g51/A +0 3846
g51/Y OR2X2 1 9.9 67 +136 3982 F
fa39/cout
fa40/cin
g52/A +0 3982
g52/CO ADDHXL 1 8.7 89 +131 4113 F
g51/A +0 4113
g51/Y OR2X2 1 9.9 67 +136 4250 F
fa40/cout
fa41/cin
74
g52/A +0 4250
g52/CO ADDHXL 1 8.7 89 +131 4381 F
g51/A +0 4381
g51/Y OR2X2 1 9.9 67 +136 4517 F
fa41/cout
fa42/cin
g52/A +0 4517
g52/CO ADDHXL 1 8.7 89 +131 4648 F
g51/A +0 4648
g51/Y OR2X2 1 14.0 76 +144 4792 F
fa42/cout
fa43/cin
g21/C +0 4792
g21/Y XOR3X2 1 8.4 104 +132 4924 F
fa43/sum
preg2_reg[7]/D DFFRHQXL +0 4924
preg2_reg[7]/CK setup 0 +148 5072 R
---------------------------------
(clock pclk) capture 6000 R
------------------------------------------------------------------
Timing slack : 928ps
Start-point : py[2]
End-point : a1/preg2_reg[7]/D
Table 3.3
Comparison of ASIC reports
Elguibaly’s Design Proposed Design
Area 327567 329915
Power (nW) 47002223.290 44571876.632
Delay (ns) 5.973 5.072
From the above Table, area required for the proposed design is slightly more
compared to the Elguibaly’s design. Even though, proposed design is preferred
because of its overall performance improvement.
Table 3.4
Comparison of Delay Analysis with pipelining
Elguibaly’s Design Proposed Design
Delay for n inputs 5.973*(2n) ns 5.072*(n+1) ns
Delay for 5 inputs 59.73 ns 30.432 ns
From the above table it is clear that overall performance has been increased
twice compared to the Elguibaly’s design i.e., 49.05% overall performance is
improved.
75
3.6 CONCLUSION:
FPGA and ASIC implementations of all the architectures were presented in
this chapter. RTL physical synthesis for the top-module chip is done using RTL
compiler tool. Placement and routing is done by using SOC Encounter tool using
180nm technology.
76
CHAPTER 4
SIMULATION RESULTS
4.1 Introduction:
This chapter gives the simulation results of all parallel MAC architectures.
In the above figure, the inputs are chosen randomly. Initially the
multiplicand(x) and the multiplier(y) are given as 15, 12 respectively. Z[15:0] is the
previous MAC result, and initially it is set to 0. Current MAC result is stored in p( p =
x * y + z), so 15 * 12 + 0 = 180 is stored in the p register. For the second clock cycle,
the values of x, y chosen as -25 and 20 respectively. Now previous MAC result 180 is
stored in z register. So p = -25 * 20 + 180 = -320 is stored in the p register. This
process repeats.
Fig. 4.2 simulation waveform for the Elguibaly’s parallel MAC architecture
77
The same procedure is applicable for Elguibaly’s and proposed parallel MAC
architectures also.
Fig. 4.3 simulation waveform for the proposed parallel MAC architecture
78
4.3 Conclusion:
In this chapter simulation results for all the parallel MAC architectures are
presented.
79
CHAPTER 5
CONCLUSION AND FUTURE SCOPE
5.1 Conclusion
The MAC unit is proposed and designed by combining a hybrid type CSA
structure and Modified Booth’s Algorithm using Xilinx ISE Design suite for FPGA
implementation and Cadence Semi-Custom Design Suite for ASIC Design for TSMC
180nm.
The MAC unit with out pipelining is designed and implemented by using
standard, Elguibaly’s algorithm, proposed methods with the delays of 37.742 ns,
25.984 ns, 30.693 ns for Xilinx Spartan-3 FPGA. To improve the performance of the
MAC units pipelining is applied. The MAC unit with pipelining using Elguibaly’s
algorithm and proposed methods are designed with the delays of 86.58 ns, 65.238 ns.
Hence the proposed MAC unit with pipelining is the faster design.
The Elguibaly’s method and proposed methods with pipeline are extended to
ASIC design and designed by using Cadence Semi-Custom design suite for TSMC
180nm Technology with the delay of 59.73 ns, 30.43 ns respectively.
The MAC unit with pipelining can be extended to n-stage pipelining i.e more
than 2-stage pipelining to improve the performance parameters.
80
CHAPTER 6
REFERENCES
81
CHAPTER 7
BIBLIOGRAPHY
1. www.wikipedia.com
2. www.ece.concordia.com
3. www.xilinx.com
4. http:\portal.acm.org
5. http:\citefeerx.ist.pfu.edu
6. www.pudn.com
82
APPENDIX-A
FPGA DESIGN FLOW
A.1 Introduction:
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable.
Logic blocks are programmed to implement a desired function and the interconnects
are programmed using the switch boxes to connect the logic blocks.
To be more clear, if we want to implement a complex design (CPU for instance), then
the design is divided into small sub functions and each sub function is implemented
using one logic block. Now, to get our desired design (CPU), all the sub functions
implemented in logic blocks must be connected and this is done by programming the
interconnects. .
Internal structure of an FPGA is depicted in the following figure A-1.
83
FPGAs, alternative to the custom ICs, can be used to implement an entire
System On one Chip (SOC). The main advantage of FPGA is ability to reprogram.
User can reprogram an FPGA to implement a design and this is done after the FPGA
is manufactured. This brings the name “Field Programmable.” Custom ICs are
expensive and takes long time to design so they are useful when produced in bulk
amounts. But FPGAs are easy to implement within a short time with the help of
Computer Aided Designing (CAD) tools (because there is no physical layout process,
no mask making, and no IC manufacturing). Some disadvantages of FPGAs are, they
are slow compared to custom ICs as they can’t handle vary complex designs and also
they draw more power. Xilinx logic block consists of one Look Up Table (LUT) and
one Flip Flop. An LUT is used to implement number of different functionality. The
input lines to the logic block go into the LUT and enable it. The output of the LUT
gives the result of the logic function that it implements and the output of logic block is
registered or unregistered output from the LUT. SRAM is used to implement a
LUT.A k-input logic function is implemented using 2^k * 1 size SRAM. Number of
different possible functions for k input LUT is 2^2^k. Advantage of such an
architecture is that it supports implementation of so many logic functions, however
the disadvantage is unusually large number of memory cells required to implement
such a logic block in case number of inputs is large. Figure A-2 below shows a 4-
input LUT based implementation of logic block.
LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade off
84
between performance and logic density. An n-LUT can be shown as a direct
implementation of a function truth-table. Each of the latches holds the value of the
function corresponding to one input combination. For Example: 2-LUT can be used to
implement 16 types of functions like AND, OR, A+ not B etc.
A B AND OR
A.2: Interconnects:
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an
FPGA can be termed as a track. Typically an FPGA has logic blocks, interconnects
and switch blocks (Input/output blocks). Switch blocks lie in the periphery of logic
blocks and interconnect. Wire segments are connected to logic blocks through switch
blocks. Depending on the required design, one logic block is connected to another and
so on. n this part of tutorial we are going to have a short intro on FPGA design flow.
A simplified version of design flow is given in the flowing figure A-3.
85
A.3: Design Entry:
There are different techniques for design entry. Schematic based, Hardware
Description Language and combination of both etc. . Selection of a method depends
on the design and designer. If the designer wants to deal more with Hardware, then
Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density. HDLs represent a level of
abstraction that can isolate the designers from the details of the hardware
implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method
but rarely used is state-machines. It is the better choice for the designers who think the
design as a series of states. But the tools for state machine entry are limited. In this
documentation we are going to deal with the HDL based design entry.
A.4: Synthesis:
This is the process which translates VHDL or Verilog code into a device netlist
format. i.e., a complete circuit with logical elements( gates, flip flops, etc…) for the
design.If the design contains more than one sub designs, ex. to implement a
processor, we need a CPU as one design element and RAM as another and so on, then
the synthesis process generates netlist for each design element. Synthesis process will
check code syntax and analyze the hierarchy of the design which ensures that the
design is optimized for the design architecture, the designer has selected. The
resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for Xilinx®
Synthesis Technology (XST)).
86
A.5: Implementation:
This process consists of a sequence of three steps
1.Translate
2.Map
3. Place and Route
Translate process combines all the input netlists and constraints to a logic
design file. This information is saved as a NGD (Native Generic Database) file. This
can be done using NGD Build program. Here, defining constraints is nothing but,
assigning the ports in the design to the physical elements (ex. pins, switches, buttons
etc) of the targeted device and specifying time requirements of the design. This
information is stored in a file named UCF (User Constraints File).
Tools used to create or modify the UCF are PACE, Constraint Editor etc.
Map process divides the whole circuit with logical elements into sub blocks
such that they can be fit into the FPGA logic blocks. That means map process fits the
logic defined by the NGD file into the targeted FPGA elements (Combinational Logic
87
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.
Place and Route PAR program is used for this process. The place and route
process places the sub blocks from the map process into logic blocks according to the
constraints and connects the logic blocks. Ex. if a sub block is placed in a logic block
which is very near to IO pin, then it may save the time but it may effect some other
constraint. So trade off between all the constraints is taken account by the place and
route process. The PAR tool takes the mapped NCD file as input and produces a
completely routed NCD file as output. Output NCD file consists of the routing
information.
88
either VHDL or Verilog designs. In this process, signals and variables are observed,
procedures and functions are traced and breakpoints are set. This is a very fast
simulation and so allows the designer to change the HDL code if the required
functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.
Static Timing Analysis This can be done after MAP or PAR processes Post MAP
timing report lists signal path delays of the design derived from the design logic. Post
Place and Route timing report incorporates timing delay information to provide a
comprehensive timing.
89
APPENDIX-B
SEMICUSTOM DESIGN INPUT FILES
Pad: p01 N
Pad: p02 N
Pad: p03 N
Pad: p04 N
Pad: p05 N
Pad: p06 N
Pad: p07 N
Pad: p08 N
Pad: p09 N
Pad: c02 NE
90
Pad: p10 E
Pad: p11 E
Pad: p12 E
Pad: p13 E
Pad: p14 E
Pad: p15 E
Pad: p16 E
Pad: p17 E
Pad: p18 E
Pad: c03 SE
Pad: p19 S
Pad: p20 S
Pad: p21 S
Pad: p22 S
Pad: p23 S
Pad: p24 S
Pad: p25 S
Pad: p26 S
Pad: p27 S
Pad: c04 SW
Pad: p28 W
Pad: p29 W
Pad: p30 W
Pad: p31 W
Pad: p32 W
Pad: p33 W
Pad: p34 W
Pad: p35 W
Pad: p36 W
91
B.2.2 Configuration File
global rda_Input
set cwd lpwd
set rda_Input(ui_netlist) {./proposed_chip_enc.v}
set rda_Input(ui_netlisttype) {Verilog}
set rda_Input(ui_settop) {1}
set rda_Input(ui_topcell) {proposed_chip}
set rda_Input(ui_timelib) {/home/vlsi/RCLAB/library/typical.lib
/home/vlsi/RCLAB/library/tpz973gtc.lib}
set rda_Input(ui_timingcon_file) {./proposed_chip_enc.sdc}
set rda_Input(ui_buf_footprint) {BUFX1}
set rda_Input(ui_inv_footprint) {INVX1}
set rda_Input(ui_leffile) {/home/vlsi/RCLAB/library/all.lef
/home/vlsi/RCLAB/library/tpz973g_6lm.lef
/home/vlsi/RCLAB/library/tsmc18_6lm_tech.lef}
set rda_Input(ui_cts_cell_list) {CLKBUFX20 CLKBUFXL CLKBUFX1
CLKBUFX2 CLKBUFX3 CLKINVX1 CLKINVX2 CLKINVX12 CLKINVX3
CLKINVX4}
set rda_Input(ui_core_cntl) {aspect}
set rda_Input(ui_aspect_ratio) {1.0000}
set rda_Input(ui_captbl_file) {/home/vlsi/RCLAB/library/t018s6mlv.capTbl}
set rda_Input(ui_defcap_scale) {1.0}
set rda_Input(ui_res_scale) {1.0}
set rda_Input(ui_shr_scale) {1.0}
set rda_Input(assign_buffer) {1}
set rda_Input(ui_gen_footprint) {1}
B.2.3.SDC File
####################################################################
# Created by Encounter(R) RTL Compiler v08.10-p104_1 on Mon Jul 11 13:26:33
IST 2011
#
####################################################################
set sdc_version 1.7
92
set_units -capacitance 1000.0fF
set_units -time 1000.0ps
# Set the current design
current_design proposed_chip
create_clock -name "pclk" -add -period 6.0 -waveform {0.0 3.0} [get_ports pclk]
set_clock_gating_check -setup 0.0
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports preset]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports pclk]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[0]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[1]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[2]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[3]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[4]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[5]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[6]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {py[7]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[0]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[1]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[2]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[3]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[4]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[5]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[6]}]
set_input_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {px[7]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[0]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[1]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[2]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[3]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[4]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[5]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[6]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[7]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[8]}]
93
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[9]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[10]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[11]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[12]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[13]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[14]}]
set_output_delay -clock [get_clocks pclk] -add_delay 0.1 [get_ports {pp[15]}]
set_wire_load_selection_group "WireAreaCon" -library "tpz973gtc"
set_dont_use [get_lib_cells typical/RF1R1WX2]
set_dont_use [get_lib_cells typical/RF2R1WX2]
set_dont_use [get_lib_cells typical/RFRDX1]
set_dont_use [get_lib_cells typical/RFRDX2]
set_dont_use [get_lib_cells typical/RFRDX4]
set_dont_use [get_lib_cells typical/TIEHI]
set_dont_use [get_lib_cells typical/TIELO]
set_dont_use [get_lib_cells tpz973gtc/PVDD2DGZ]
set_dont_use [get_lib_cells tpz973gtc/PVSS2DGZ]
B.2.4.MODE File:
#####################################################################
# First Encounter mode file
# Created by Encounter(R) RTL Compiler on 07/11/11 13:26:34
#####################################################################
# General Mode Settings
###########################################################
if {[enc_version] >= 7.1} {
setAnalysisMode -asyncChecks noAsync
} else {
setAnalysisMode -noAsync
}
set_global timing_apply_default_primary_input_assertion false
set_global timing_clock_phase_propagation both
if {[enc_version] >= 7.1} {
setAnalysisMode -multipleClockPerRegister true
94
} else {
setAnalysisMode -multipleClockPerRegister
}
if {[enc_version] >= 7.1} {
setPlaceMode -reorderScan false
} else {
setPlaceMode -noReorderScan
}
if {[enc_version] >= 7.1} {
setExtractRCMode -engine default
} else {
setExtractRCMode -default
}
95