You are on page 1of 5

International Journal of Advanced Computer Science, Vol. 2, No. 3, Pp. 120-124, Mar. 2012.

Power and Performance Optimization for De-synchronized Circuits


Wei Shi, Hongguang Ren, Zhiying Wang, Qiang Dou, & Li Luo
Manuscript
Received: 20,Jun., 2011 Revised: 21,Oct., 2011 Accepted: 1,Mar., 2012 Published: 15,Apr., 2012

Keywords
de-synchronized circuits, design flow, power reduction, performance improvement, multiplier

Abstract The de-synchronization methodology, which directly converts a synchronous circuit into an asynchronous counterpart according to the physical structure of pipelines, is very popular for its simplicity. However, the simplicity of the design methodology also introduces some power redundancy and performance reduction to de-synchronized circuits. This paper first investigates the influence of actual operations and operands to de-synchronized circuits, and then proposes an improved de-synchronization flow to resolve the power and performance problem in conventional de-synchronized circuits. At last, some specific schemes which are early completion, decoupling and delay element optimization are employed to optimize a traditional de-synchronized multiplier. Compared to a traditional de-synchronized multiplier, early completion can achieve as high as 72% power reduction and 54% performance improvement. While the power saving resulting from decoupling is about 20%-25% and performance improvement by optimizing delay elements is about 11%.

1. Introduction
As the CMOS technology enters deep submicron design era, the richness of computational resources brings a lot of problems, such as complex clock distribution, great clock skew and high power dissipation. As an alternative, asynchronous circuits become significantly attractive. In asynchronous circuits the periodically global clock is replaced by local handshake signals, naturally eliminating clock related problems. Furthermore, asynchronous circuits work on demand and stop when not needed. As such, the asynchronous circuit is an efficient solution to the power-efficient design. However, the design of asynchronous circuits is much more complicated. The de-synchronization methodology [1], which directly converts a synchronous circuit into an asynchronous
This work was supported by the by National Natural Science Foundation of China under grand 60873015 and 60903039. Wei Shi, Hongguang Ren, Zhiying Wang, Qiang Dou, & Li Luo are with National University of Defense and Technology, School of Copmuter Faculty (shi.wei@nudt.edu.cn; renn@nudt.edu.cn; zy.wang@nudt.edu.cn; q.dou@nudt.edu.cn; luo.li@nudt.edu.cn)

counterpart, is very popular for its simplicity. In the de-synchronization flow, the control path is generated directly according to the structure of data path and the data path remains almost the same as original data path. This approach reduces the complexity of asynchronous circuits design. On the other hand, asynchronous circuits will not be optimized adequately. This is because the power and performance of asynchronous circuits is also heavily influenced by actual operands and operations performed in circuits. As a result, a mass of redundant operations are introduced leading to power waste. Extra time may also be expended for redundant operations. In fact, the operands and operations performed in circuits have been already often exploited to optimize the power and performance of asynchronous circuits, and the research of synchronous multiplier is a good example. In the past decades, many asynchronous multipliers have been designed for low-power and high-performance purposes [2]-[4]. However, there are mainly two kinds of methods used. The first method is reducing redundant operations in the process of multiplication, while the second one is optimizing each stage or part of the multiplier. To eliminate the redundant operation reduction, the data-dependent and operation-dependent attributions are often exploited. After eliminating unnecessary operations, the multiplication can early terminate and the redundant power of corresponding operations is also removed. To optimize each stage, new logic and structure of circuits, such as split registers [2] and DCVSL (Differential Cascode Voltage Swing Logic) [4], may be adopted. This paper first investigates the influence of actual operations and operands to de-synchronized circuits, and then proposes an improved de-synchronization flow to resolve the power and performance problems in conventional de-synchronized circuits. At last, some specific schemes which are early completion, decoupling and delay element optimization are employed to optimize a traditional de-synchronized multiplier.

2. Background
A. De-synchronization Methodology De-synchronization methodology converts synchronous circuits into asynchronous ones using existing EDA tools and flows. The main steps of the de-synchronization methodology are as follows: (1) conversion of the flip-flop based synchronous circuit into a latch based one by decoupling master and slave latches of a register; (2) generation of matched delay for combinational logic of each

Shi et al.: Power and Performance Optimization for De-synchronized Circuits.

121

stage in the pipeline; (3) combination of local controller and delay elements to build a control path according to the structure of the corresponding data path. A synchronous circuit and its de-synchronized equivalent are shown in Fig. 1(a) and Fig. 1(b) respectively.

We use the de-synchronization methodology to implement an asynchronous multiplier based on the synchronous multiplier described above and the asynchronous multiplier is shown in Fig. 3.

Fig. 3. The structure of a de-synchronized multiplier.

Fig. 1. A synchronous circuit and its de-synchronized equivalent.

B. Multiplier and its De-synchronized Structure Multipliers can be designed using different structures according to different performance requirements, power limitations and hardware requirements. These different multipliers mainly include sequential multiplier, array multiplier and tree multiplier [5]. Sequential multipliers are quite power-efficient due to their hardware simplicity, but they have very poor performance. Array and tree multipliers are two of the most popular kinds of multiplier for their quite high performance, but they are expensive in terms of hardware overhead and power consumption. As such, an alternate multiplier structure, which makes a trade-off between performance and hardware requirements, is often chosen [6]. The structure of a multiplier with iterative structure, which is used in this paper, is shown in Fig. 2. The multiplier is implemented in a 3-stage pipeline: the first stage is an 8-2 adder tree consisting of booth encoder and 4-2 adders; the second stage is a 4-2 adder tree, results of which are feed back to inputs of the adder tree; the third stage is used to generate the sum and carry of the multiplication using partial products finally. In this multiplier both adder trees in the first and second stage are efficiently reused. In every cycle 8-bit of the multiplier operand is used to generate partial products, and a 3232-bit multiplication needs 4 iterations.

According to the de-synchronization flow, data paths remain almost the same as the former synchronous design apart from the replacement of flip-flops by latches. In the process of generating the control path, control paths for linear structures are first created and they are then combined for complex data paths with feedback. The result of stage 2 will be used as inputs of the combinational logic in stage 2, i.e., there exists a feedback signal from stage 3 to stage 2. As such, the signal req2 and req1 are collected with a C element, as illustrated in Fig. 3. In addition, there also exists a C element to collect acknowledge signals. The more detail of control path generation can be found in [7]. Handshake protocols of de-synchronized circuits are discussed in [8] and the semi-decoupled four-phase handshake protocol is used to implement local latch controllers in this paper. C. Problem Analysis Without considering the influence of operations and operands performed in actual circuits, power and performance problems are often introduced into de-synchronized circuits. In order to explain the influence of operations and operands, we analyze the behavior of the de-synchronized multiplier described in Section 2.2. When doing a multiplication, each stage of the de-synchronized multiplier has to compute four times. There are some unnecessary switching activities in the third stage. In the first 5 cycles, only the stage 1 and stage 2 needs to do useful work, while stage 3 should be idle and wait for the final partial product from stage 2. Actually, each computation of stage 2 will lead to a computation of stage 3 as stage 2 and stage 3 are tightly coupled. Before the final partial products are computed, every operation in stage 3 triggered by stage 2 is meaningless. From the above, it is clear that physical structures do not necessarily reflect the actual behaviors of computations. However, the unnecessary power and time can be eliminated by the way of considering the influence of operations. Furthermore, the computation count can be reduced when the operands are analyzed. Most of the integer operands have small absolute values and the data-dependent latency or performance can be exploited. That is to say, the multiplier need not repeat the full iterations all the time. The multiplier examines the multiplier operand and quits iterations when remaining operand bits are all ones or zeros. In this case, the rest of the partial products are all zeros and

Fig. 2. The structure of an iterative multiplier.

International Journal Publishers Group (IJPG)

122

International Journal of Advanced Computer Science, Vol. 2, No. 3, Pp. 120-124, Mar. 2012.

the corresponding iterations are meaningless. This scheme is usually called early completion [4]. By adopting early completion in the iterative multiplier, we can improve the average performance of the multiplier. On the other hand, redundant operations are also eliminated achieving significant power reduction.

3. Improved De-synchronization Flow


The power and performance of an asynchronous pipeline are often affected by the functional behaviors of the pipeline, and also they are often data-dependent. However, these two important factors have not been considered by the existing De-synchronization method. In this paper, an optimized De-synchronization method is proposed. As shown in Fig. 4, two schemes are employed to optimize the circuit: (1) Modify the control path regarding of the functional behaviors of the pipeline, and the power and performance are optimized by avoiding some redundant operations; (2) Introduce a dynamic operand detection technique into the data path, so that power can be saved by leveraging the narrow width operations in the pipeline. The optimized design flow is as follows:

Fig. 4. The improved de-synchronization flow.

Step1. Use a synchronous circuit described in Verilog/VHDL as a basis, and do the functional simulation of the whole design; Step2. Analyze the operands of applications, and optimize the synchronous circuits by using several operand detection methods to optimize the circuits; Step3. Use commercial synthesis tools to get a gate-level net list of the synchronous circuits, and also a gate-level simulation is done in this step; Step4. Divide the gate-level net list into control paths and data paths; Step5. Refine the control paths so that redundant operations can be avoided by introducing some new control components (Details of this step are shown later in this paper); Step6. Combine the refined control paths and the data

paths to form an optimized asynchronous circuit, and a functional simulation is used to verify the correctness of the design; Step7. P&R the asynchronous circuit and then do the post-layout simulation. Our design flow is different from the existing De-synchronization method mainly in Step 2 and Step 5. In Step 2, the data paths are optimized based on the analysis of the operands; while in Step 5, the control paths are refined based on the actual functional behavior the pipelines. It is not possible to use a general method to optimize the datapaths since different applications may lead to different application-specific hardware. Thus the data paths are usually optimized manually. While the control paths can be optimized semi-automatically. In the process of refining the control paths (Step 5), we replace some handshake components in the control paths with new ones to better control the data flows in the pipeline. These new handshake components include FORK, JOIN, MUX, DEMUX and MERGE. The gate-level implementation of these components can be found in [9]. Specifically, the detail steps of the control path optimization are as follows: Step5-1. Design and implement different FORK, JOIN, MUX, DEMUX and MERGE handshake components with different parameters, forming a handshake component library. Step5-2. Design and implement different delay elements with different delay lengths, forming a delay element library. Step5-3. Analyze the data flow of the synchronous pipeline to obtain the useful functional behaviors of the pipeline. Step5-4. Select handshake components in the handshake component library based on the result of Step5-3. Step5-5. Select delay elements for each pipeline stage based on the delay analysis of the data path. Step5-6. Combine the handshake component selected in Step5-4 and the delay elements in Step5-5 to form the final control path. The Step5-4 is important as we must select handshake components according to the data flows and the functional behaviors of the pipeline. For linear asynchronous pipelines, traditional four-phase latch controllers (FLCs) are used. For the asynchronous pipelines with multiple branches, the DEMUX and MERGE handshake components are usually used in the control paths. While for the cyclic asynchronous pipelines, the handshake components can only be decided by the real functional behaviors of the pipelines. For example, the iterative multiplier uses MUX and DEMUX components; while a state machine can use FORK and JOIN components to construct the control paths.

4. Asynchronous Multipliers Implementation


In this section, we use the proposed de-synchronization flow to optimize the traditional de-synchronized multiplier. Concretely, three schemes which are early completion,
International Journal Publishers Group (IJPG)

Shi et al.: Power and Performance Optimization for De-synchronized Circuits.

123 TABLE 2 POWER COMPARISON AMONG 6 MULTIPLIERS (MW) multiplier idct32 mat32 maxidx32 SYN 18.17 13.80 8.12 ADE 12.89 8.62 10-3 ADC 10.32 6.43 10-3 ADL 12.96 8.65 10-3 AEC 8.73 2.41 10-3 ATS 6.57 2.45 10-3

decoupling and delay element optimization are employed [10]. 5 different asynchronous 32-bit multipliers were implemented using UMC 0.18m technology for comparison. The first one is the de-synchronized multiplier called ADE (Asynchronous DE-synchronized multiplier). Then, 3 other asynchronous multipliers are implemented with the 3 schemes respectively and they are named with ADC (Asynchronous DeCoupled multiplier), ADL (Asynchronous multiplier with new DeLay elements), AEC (Asynchronous multiplier with Early Completion) respectively. The last asynchronous multiplier, called ATS (Asynchronous multiplier with Three Schemes), applies all the proposed three schemes to the conventional de-synchronized multiplier. When the synchronous multiplier (SYN) is synthesized, the logic delay is about 1.47ns. However, when each stage of the pipeline is synthesized independently, the latencies of stage 1, stage 2 and stage 3 are 1.25ns, 0.93ns and 1.47ns respectively. A. Performance Comparisons In this section, performance is compared among six different multipliers as shown in table 1. In the SYN, the clock cycle is 1.47ns and the time required for a multiplication is 8.82ns. While the time needed by the de-synchronized multiplier to do a multiplication is increased to 9.76ns. The extra time cost is due to the RTZ characteristic of the four phase handshake protocol. The asynchronous multiplier with decoupled control path needs almost the same time as the de-synchronized multiplier to perform a multiplication. However, the optimized latch controller is more complex and a little more time is expended. The delay of ADL is reduced to 8.65ns due to optimized logic and delay elements. In the AEC, the computation time of a multiplication is firmly depended on the operand. We assume that the valid bits of the multiplier operand are less than 16 and just 2 iterations are needed. As a result, the time is greatly reduced to 6.19ns. We also know that the more invalid bits the higher performance we can achieve. Finally, the 3 schemes mentioned previously are all applied to ATS, and the time for a multiplication with 2 iterations is just about 5.21ns.
TABLE 1 TIME REQUIRED FOR A MULTIPLICATION (NS) multiplier delay SYN 8.82 ADE 9.76 ADC 9.83 ADL 8.65 AEC 6.19 ATS 5.21

From table 2, it is obvious that the synchronous multiplier consumes more power consumption than the other asynchronous ones. It is mainly because the clock tree network consumes too much power in synchronous circuits. There is no global clock in ADE, so the power consumption is reduced. Due to the elimination of redundant operations in stage 3, ADC consumes 20%-25% less power consumption than ADE. When running an idct32 trace, AEC saves about 32.3% power consumption compared to ADE, and this saving percentage becomes 72.0% when running a mat32 trace, which shows that effect of early completion scheme, is heavily dependent on the real input operands. As input numbers are small for AEC when running a mat32 trace, it may only need one booth encoding for a multiplication instead of four. When running a maxidx32 trace, as there are no multiplication operations in it, all the asynchronous multipliers consume no dynamic power consumption but very small static leakage power consumption, while the synchronous multiplier consumes 9.88mW. C. Area Comparisons Areas of different multipliers are given in table 3. From the table, the introduction of control path leads to a 6.41% increase in area. However, multipliers with different schemes are somewhat larger than the conventional de-synchronized multiplier. Specially, the area of ATS is 0.105mm2, 34.6% larger than the synchronous version. About 19.2% of area increase is due to latency optimization in the data path, and the contribution of optimization in the control path is about 15.4%.
TABLE 3 AREA COMPARISONS AMONG 6 MULTIPLIERS (MM2) multiplier area SYN 0.078 ADE 0.083 ADC 0.085 ADL 0.097 AEC 0.089 ATS 0.105

B. Power comparisons In this section, we extracted the execution traces of multiplication from several typical multi-media benchmarks, and we run these traces on different multipliers to evaluate their power consumption. Three multi-media applications used are idct32, mat32 and maxidx32, and the simulation cycle is set to be 2ns. In the simulation of mat32, we assume the elements in input matrixes are all small numbers. Table 2 gives the power consumptions of the different multipliers performing these three programs.
International Journal Publishers Group (IJPG)

5. Conclusion
Current de-synchronization method simply implements control paths according to the topology of datapaths, ignoring the actual behaviors of circuits and the features of operands. In this paper, we propose an improved de-synchronization flow to reduce power and enhance performance of traditional de-synchronized circuits. Then a de-synchronized multiplier is optimized using the proposed flow. Experiments show that the optimized multipliers have lower power dissipation and higher performance than the traditional de-synchronized multiplier, at the cost of a little

124

International Journal of Advanced Computer Science, Vol. 2, No. 3, Pp. 120-124, Mar. 2012.

more area.

References
[1] J. Cortadella, A. Kondratyev, S. Member, L. Lavagno, & C. P. Sotriou, "Desynchronization: synthesis of asynchronous circuits from synchronous specifications," (2006) IEEE Trans on Computer-Aided Design, vol. 25, no. 10, pp. 1904-1921. [2] Y. J. Liu & S. Furber, "The design of a low power asynchronous multiplier," (2004) Proc. IEEE Symp. Low Power Electronics and Design, pp. 301-306. [3] A. Efthymiou, W. Suntiamorntut, J. Garside, & L.E.M. Brackenbury, "An asynchronous, iterative implementation of the original booth multiplication algorithm," (2004) Proc. IEEE Symp. Advanced Research in Asynchronous Circuits and Systems, pp. 207-215. [4] D.W. Kim, & D.K. Jeong, "Iterative self-timed multiplier with early completion," (2000) Proc. of European Solid-State Circuits Conference, pp. 109-112. [5] B. Parhami, Computer Arithmetic, Algorithms and Hardware Designs, New York: Oxford University Press, 2000. [6] M.R. Santoro, & M.A. Horowitz, "SPIM: a pipelined 64 64-bit iterative multiplier," (1989) IEEE Journal of Solid-State Circuits, vol. 24, no. 2, pp. 487-493. [7] N. Andrikos, L. Lavagno, D. Pandini, & C.P. Sotiriou, "A fully-automated desynchronization flow for synchronous circuits," (2007) Proc. ACM/IEEE Design Automation Conference, pp. 982-985. [8] I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, & C. Sotiriou, "Handshake protocols for de-synchronization," (2004) Proc. IEEE Symp. Advanced Research in Asynchronous Circuits and Systems, pp. 149-158. [9] J. Spars, & S. Furber, "Principles of asynchronous circuit design a systems perspective," (2001) Boston: Kluwer Academic Publishers. [10] W. Shi, W. Chen, & Y. Wang, "Power and performance optimization for a de-synchronized multiplier," (2010) Proc. of IEEE International Conference on Computer and Electrical Engineering, pp. 270- 274.

International Journal Publishers Group (IJPG)

You might also like