You are on page 1of 5

2012 Third International Conference on Intelligent Systems Modelling and Simulation

Design and Implementation of Automated Wave-Pipelined Circuit using ASIC


Rengaprabhu Paramasivam$, Venkatasubramanian Adhinarayanan*, and Seetharaman Gopalakrishnan#
$

Research Scholar, Dept. of Info. and Comm. Engineering, Anna University of Technology, Tiruchirappalli, India
*

Research Scholar, Department of CSE, Sathayabamma University, Chennai, India


#

Principal, Oxford Engineering College, Tiruchirappalli, India for the implementation of wave-pipelined circuits are discussed and the implementation results are presented using Cadence tools. In Section VI summarizes the conclusions. II. REVIEW OF PREVIOUS WORK Wave-pipelining has been employed for implementing a number of systems on both ASICs and FPGAs [2]. The concept of wave-pipelining has been described in a number of previous works [3], [4]. The technique of Wave-pipelining is proposed to improve the logic utilization by minimizing the idle time and to allow for maximal rate of operation of the digital circuit. Fig. 1 shows a typical wave-pipelined circuit along with the input, output registers and clocking circuit which include the clock generation and clock skewing circuit.

Abstract Wave-pipelining enables a digital circuit to be operated at higher frequency. In the literature, only trial and error and manual procedures are adopted for the choice of the optimum value of clock and clock skew between the I/O registers of wave-pipelined circuits. The major contribution of this paper is the proposal for automating the above procedure for the ASIC implementation of wave-pipelined circuits using built in self test approach. This is studied by a multiplier using dedicated AND gate by adopting three different schemes: wave-pipelining, pipelining and non-pipelining. From the implementation results, it is verified that the wavepipelined multipliers are faster by a factor of 1.08 compared to the non-pipelined multipliers. The wavepipelined multiplier dissipates less power in the factor of 1.43 compared to the pipelined multiplier. Keywords: pipelining; wave-pipelining; FPGA; selftesting; ASIC. I. INTRODUCTION Hardware components in a SOC may include one or more processors, memories and dedicated components for accelerating critical tasks and interfaces to various peripherals [1]. The increased gate count in a complex SOC results in increased power dissipation, clock routing complexity and clock skews between different parts of a synchronous system. These limitations may be partially overcome by adoption of circuit design techniques such as wave-pipelining. Wavepipelining enables a combinational logic circuit to be operated at a higher frequency without the use of registers and may result in lower power dissipation and clock routing complexity compared to a pipelined circuit. However, the maximization of the operating speed of the wave-pipelined circuit requires the following three tasks: adjustment of the clock period, clock skew () and equalization of path delays. The automation of these three tasks are proposed for the first time in this paper. Effectiveness of the automation scheme is studied by a multiplier using dedicated AND gates as well as fast carry logic. The organization of the rest of the paper is as follows: In section II, the previous work related to wave-pipelining and the challenges involved in the design of wave-pipelined circuits are described. In section III, automation schemes for wave-pipelined circuits are presented. In section IV, an overview of the multiplier using dedicated AND gate as well as fast carry logic is presented. In section V, BIST approaches

Fig. 1. Wave-pipelined circuit An RTL model of a circuit consists of a combinational logic circuit separated by the input and output registers. The combinational logic circuit may be considered to be a wavepipelined circuit if a number of waves are made to simultaneously propagate through it. In other words, at any point of time, a sequence of data is processed in the combinational logic block. In the case of pipelining, only one data is processed in the combinational logic block at a time. Further, the maximum data rate in the pipelined circuit depends only on Dmax, the maximum propagation delay in the combinational logic block. If Dmin denotes the minimum propagation delay of the signal through the combinational logic block, the maximum data rate of the wave-pipelined circuit depends on (Dmax Dmin). Traditionally, in a wavepipelined circuit, higher speeds are achieved by equalizing the Dmax and Dmin. The output of the wave-pipelined circuit alternates between unstable and stable states. The stable

978-0-7695-4668-1/12 $26.00 2012 IEEE DOI 10.1109/ISMS.2012.92

351 348

period decreases with the increase in the logic depth. By adjusting the latching instant at the output register to lie in the stable period, the wave-pipelined circuit can be made to work properly. But, for large logic depths, there may not be any stable period. Hence adjusting the latching instant by itself may not be adequate for storing the correct result at the output register. For such cases, the clock period has to be increased to increase the stable period. In [5], an automation technique for tuning the clock period is proposed. This technique uses the system clock: First, it applies the system clock to the circuit; if the circuit is not working with system clock then it doubles the period and applies it to the circuit. This doubling process is continued until system works properly. However, this approach cannot ensure that the circuit works at the highest possible frequency. This may be either because the circuit operating frequency is greater than system clock frequency or because the highest operating frequency is not exactly equal to clock frequency /2n where n is an integer. Equalization of path delays, adjustment of the clock period and clock skew are the three tasks carried out for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. These tasks are carried out manually in [6], [7]. Automation of the above three tasks are considered in the next section. III. BIST APPROACH FOR WAVE-PIPELINED FOR WAVE-PIPELINED CIRCUIT The self tuned wave-pipelined circuit is proposed by including a BIST circuit to tune the clock frequencies and clocks with different skews. The block diagram of Self Tuned Wave-pipelined circuit is shown in Figure 2. It consists of different functional blocks namely PRSG block, PRBS sequence generator, signature analyzer, counter, Programmable Clock generator Circuit, Programmable skew generator circuit and FSM. A self tuned wave-pipelined has two modes of operation namely test mode and normal mode. TM signal is used to select the mode of operation. The circuit is placed in test mode by making TM signal to be 1. In test mode, FSM first gives CS= 0 and SS=0 for the clock generator and skew generator circuit respectively. The programmable clock generator circuit then generates the first clock and this clock is given to PRSG circuit, programmable skew generator circuit and input register. The PRSG block is used for exhaustive testing and it generates all 2n combinations of the inputs for an n-bit input. The programmable skew generator circuit generates skew and the skewed clock is applied to the output register and counter circuit. The counter is used to keep track of the number of test vectors fed to the combinational block and it generates the enable signal(sig_en) after all the test vectors have been applied. Instead of comparing every output with the expected output, a signature is generated from the outputs corresponding to all the applied inputs using PRBS generator and it is compared with stored value in signature analyzer circuit. The signature analyzer gives two control signals

(sig_in & chng) to the FSM block which indicates the match or not. Depending upon the control signals received from signature analyzer, CS and SS values to the Clock and Skew generator circuits are generated. If there is no match, FSM changes the SS value from 0-7 for every CS value. Even after all the skews are applied for a particular CS value, if there is no match, it changes the CS value. In this way, FSM changes CS and SS values until it finds a match. When match is found, FSM fixes CS and SS values and the circuit is placed in normal mode by changing TM=0. In normal mode, user inputs can be applied.

Fig. 2. Self tuned wave-pipelined circuit. A. Procedure for adjusting the clock period and skew The adjustment of the clock skew and clock period can be automated by adopting programmability. The programmable clock and clock skew generator may be implemented. Fig. 3 gives the circuit diagram of a clock generation scheme which consists of a delay block and an inverter. The actual clock period depends on the interconnect delay. The select input of the multiplexer is varied with either a processor or a Finite State Machine (FSM) to achieve different clock frequencies. Similarly, for the clock skew generator, the same circuit is used, but the feedback connection is removed and the select line is varied through processor or FSM to achieve different clock skew ranges. The wave-pipelined circuit using the programmable clock and skew generator can be operated at a higher frequency than that can be achieved using the commercially available synthesis tools which use Dmax for fixing the operating frequency. The automation may be carried out using either off-chip processor or on-chip processor. The off-chip processor is used when the FPGA is used as a coprocessor or hardware accelerator for a main processor or microcontroller.

352 349

embedded multipliers which can be configured as either 1818 or 99 multipliers. An Overview of the Multiplier Schemes The multiplication of two unsigned numbers A and B creates the product P = A.B (5.1) where A is called multiplicand and B the multiplier. Given that A is an m-bit positive whole number, then the numeric representation of the product P requires (m+n) bits. The value of the product P results from the product of the values of the two operands A and B, i.e. Fig. 3. Programmable clock generator. Since off-chip communication between the FPGA and a processor is bound to be slower than on-chip communication, in order to minimize the time required for adjustment of the parameters of the wave-pipelined circuit (clock frequency and skew), the built in self test approach using design for testability [8] technique, is proposed for this case [9]. IV. DESIGN OF MULTIPLIER UNIT Digital Signal processing is used for a variety of applications such as frequency selective filters (low pass, band pass, high pass, band elimination types), adaptive filters, equalizers, block matching algorithm for motion estimation, computation of transforms (discrete fourier transform (DFT), DCT, DWT etc.), vector quantization for image processing and compression, viterbi algorithm and dynamic programming, decimator and expander, wavelets and filter banks for multirate signal processing and modems. In all these applications, multipliers are being used as one of the fundamental blocks. In other words, it is almost impossible to do any DSP operation without using multipliers. Most of the DSP algorithms require multiplication and addition in real-time. The unit carrying out this function is called multiply accumulate (MAC). Programmable DSP (PDSP) chips have only limited number of MAC units and hence DSP operations such as convolution have to be done serially and hence P-DSPs are not suited for high speed applications. ASICs can have multiple dedicated MACs and can perform DSP functions in parallel. But, they have high cost for low volume production and the inability to make design modifications after production makes them less attractive [10]. In view of these, FPGAs are popular for applications requiring large no. of MAC units. A variety of choices exist for implementing the multipliers. Either the LUT or the embedded AND and XOR block can be used for implementing the multiplier. In addition to this, advanced FPGAs also contain V(P) = V(A) V(B) (5.2) Substituting the value of B,
n 1 j

V ( B ) = bj 2 (5.3)
j =0

the product can be represented as a function of the individual bits bj of the multiplier.

V ( P ) = V ( A) bj 2 (5.4)
j =0
n 1 j

n 1

V ( P ) = V ( A)bj 2 (5.5)
j =0

According to equation (5.4), partial products are formed by the product between A and the bits bj of B and shifted versions of the partial products are added to get the product P. The partial product V(A)bj is 0 for bj = 0 and V(A) otherwise. This method is used for computation by hand and can also be similarly used for hardware implementations [Peter Pirsch 1996]. For example, sequential multiplication using arithmetic logic units (ALUs) is implemented in this fashion,.

V ( P) =
i =0

m 1

( ai bj ) 2
j =0

n 1

i+ j

(5.6)

The product P can be represented as the sum of partial products at bit level by splitting the number A into its bits ai: The partial products that occur and their summation for the creation of the product bits pk are illustrated for m = n = 4 is given in fig. 4.

353 350

all the stages are directly connected without registers. The registers are used only at the inputs and outputs. V. IMPLEMENTATION OF SELF TUNED WAVEPIPELINED CIRCUITS USING BIST APPROACH The multipliers using dedicated AND gate is implemented on mentor tools using BIST approach. A. Implementation results on multiplier using mentor tools The self tuned wave-pipelined circuit in Fig. 2 is studied by using 10X10 multiplier as the combinational logic block. These circuits are implemented using 180nm technology in ASIC. Verilog HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified modelsim simulation tool. Leospectrum is used for synthesizing the circuit. Table. 1. Implementation results of multipliers using dedicated AND gate and fast carry logic. Schemes Pipelining Non-pipelining Wave-pipelining Area 1623 916 1431 Freq.(Mhz) 343.4 197.1 203.2

Fig. 4. Partial Products and product bits of a 4X4 multiplier. The effectiveness of the approach proposed is studied by implementing self tuned wave-pipelined circuits with combinational logic block: multipliers using dedicated AND gate is shown in Fig. 5.

From table1, it may be noted that the wave-pipelined multipliers are faster by a factor of 1.03 compared to the nonpipelined multipliers. The pipelined multipliers are faster by a factor of 1.6 compared to the wave-pipelined multiplier. B. Implementation Results On Multiplier Using Cadence Tools The self tuned wave-pipelined circuit in Fig. 2 is studied by using 10X10 multiplier as the combinational logic block. These circuits are implemented using slow normal library in ASIC. Verilog HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified Cadence tool. NC simulator tool is used for the simulation of the above design and it has been implemented with the RTL Encounter. The result is shown in the table 2. Table. 2. Implementation results of multipliers using dedicated AND gate and fast carry logic. Schemes Pipelining Non-pipelining Wave-pipelining Area (Cells) 280 240 364 Freq.(Mhz) 277.31 204.5 221.15 Power(nw) 75,120.345 49,102.248 52,354.821

Fig. 5. 10X10 Multiplier dedicated AND gate and fast carry logic. The implementation of multipliers is considered first. Xilinx FPGAs such as Spartan-II as well as Virtex devices and Altera FPGAs such as APEX and Cyclone II devices have fast carry logic and dedicated AND gate for each of the Look Up Tables (LUTs) in the Slices/Logic Elements (LEs). Since multiplying an N bit number by 2 requires only AND gates and adders, fast Nx2 multipliers can be implemented using this dedicated hardware [10 & 11]. To implement a Nx4 multiplier, output of two Nx2 multipliers has to be added [12]. To implement an NxM multiplier, the output of log2M, Nx2 multipliers have to be added, 2 at a time in parallel in log2M stages appropriately. The circuit diagram of the 10x10 multiplier using dedicated AND gate and fast carry logic is shown in Fig. 5. The dotted line indicates points where registers may be inserted for pipelining. For wave-pipelining

From table 2, it may be noted that the wave-pipelined multipliers are faster by a factor of 1.08 compared to the nonpipelined multipliers. The wave-pipelined multiplier dissipates less power in the factor of 1.43 compared to the pipelined

354 351

multiplier. The pipelined multipliers are faster by a factor of 1.25 compared to the wave-pipelined multiplier. The final tapout or layout diagram is shown in fig. 6.

CMOS 1.5 GHz elliptic curve public key cryptosystem chip, Proc. of Sixth Intl. Symposium on Advanced Research in Asynchronous Circuits and Systems, 2000, (ASYNC 2000), pp. 188 197, April 2000. [4] W. P. Burleson, M. Ciesielski, F. Klass, and Liu, Wavepipelining: a tutorial and research survey, [5] IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 6, no. 3, pp. 464 474, Sep.1998. [6] WooKim, YongKim, Automating Wave-pipelined Circuit
Design, IEEE Design & Test of Computers, Vol. 20, Nov. 2003.

Fig. 6. Tap out diagram of multiplier using dedicated AND gate and fast carry logic. VI. CONCLUSION The proposed automation scheme is implemented for the wave-pipelined circuit and it is tested using the multipliers with dedicated AND gates as well as fast carry logic. From the implementation results, it is verified that the wave-pipelined multipliers are faster by a factor of 1.08 compared to the nonpipelined multipliers. The wave-pipelined multiplier dissipates less power in the factor of 1.43 compared to the pipelined multiplier. The pipelined multipliers are faster by a factor of 1.25 compared to the wave-pipelined multiplier. REFERENCES [1] Flavio R. Wagner, Wander O. Cesario, Luigi Carro and Ahmed A. Jerraya, Strategies for the integration of hardware and software IP components in embedded systems-on-chip, Elsevier Integration, The VLSI Journal, pp. 1-31, Nov. 2003. [2] J. Nyathi and J. G. Delgado-Frias, A hybrid wave pipelined network router, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 49, no. 12, pp. 1764 1772, Dec. 2002. [3] O. Hauck., A. Katoch and S. A. Huss, VLSI system design using asynchronous wave pipelines: a 0.35 m

[7] C. Thomas Gray, W. Liu and R. Cavin, Wave Pipelining: Theory and Implementation, Kluwer Academic Publishers, 1993. [8] E. I. Boemo, S. Lopez-Buedo and J. M. Meneses, Wave pipelines via look-up tables, IEEE International Symposium on Circuits and Systems ISCAS '96, vol. 4, pp. 185 -188, 1996. [9] M. J. S. Smith, Application Specific Integrated Circuits, Pearson Education Asia Pvt. Ltd, Singapore, 2003. [10]G.Seetharaman,B.Venkataramani and G.Lakshminarayanan, Design and FPGA implementation of self-tuned wavepipelined filters, IETE journal of research, vol 52, no. 4, pp. 305-313, July-August 2006. [11] Steven Elzinga, Jeffrey Lin, and Vinita Singhal. 2000. Design Tips for HDL Implementation of Arithmetic Functions, Xilinx Application notes, XAPP-215 (v1.0), June. [12] Altera documentation library- 2003, Altera corporation, USA. [13] Xilinx documentation library, Xilinx Corporation, USA. [14] U. Meyer Baese, Digital Signal Processing with FPGAs, Springer Verlag, 2001.

355 352