Commun., Vol. 50, No. 3, Pp. 406-414, Mar. 2002.: A Methodology For Transistor-Efficient Supergate Design

488
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 4, APRIL 2007
[5] LDPC Coding for OFDMA PHY, 802.16, 2004. [Online]. Available: http://www.ieee802.org/16/tge [6] J. Chen and M. Fossorier, Near optimum universal belief propagation based decoding of low-density parity check codes, IEEE Trans. Commun., vol. 50, no. 3, pp. 406414, Mar. 2002. [7] M. Cocco, J. Dielissen, M. Heijligers, A. Hekstra, and J. Huisken, A scalable architecture for LDPC decoding, in Proc. Autom. Test Eur. Conf. Exhibition, 2004, pp. 8893. [8] Z. Wang, Y. Tan, and Y. Wang, Low hardware complexity parallel turbo decoder architecture, in Proc. IEEE ISCAS, 2003, pp. II-53II-56. [9] Y. Chen and D. Hocevar, A FPGA and ASIC implementation of rate-1/2, 8088-b irregular low density parity check decoder, in Proc. IEEE GLOBECOM, 2003, pp. 113117. [10] T. Zhang and K. K. Parhi, Joint (3,k)-Regular LDPC Code and Decoder/Encoder Design, IEEE Trans. Signal Process., vol. 52, no. 6, pp. 530542, May 2003. [11] Z. Wang and Q. Jia, Low complexity, high speed decoder architecture for quasi-cyclic LDPC codes, in Proc. IEEE ISCAS, 2005, pp. 57865789. [12] M. Karkooti and J. R. Cavallaro, Semi-parallel recongurable architectures for real-time LDPC decoding, in Proc. ITCC, 2004, pp. 579585. [13] Y. Li, M. Elassal, and M. Bayoumi, Power efcient architecture for (3,6)-regular low-density parity-check code decoder, in Proc. ISCAS, 2004, pp. 8184. [14] F. Kienle, T. Brack, and N. Wehn, A synthesizable IP core for DVB-S2 LDPC code decoding, in Proc. Des., Autom. Test Eur., 2005, pp. 100105. [15] M. M. Mansour and N. R. Shanbhag, High-throughput LDPC decoders, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 12, pp. 976996, Dec. 2003.
A Methodology for Transistor-Efcient Supergate Design

Dimitri Kagaris and Themistoklis Haniotakis
be prohibitive, in which case standard cell libraries are used. In the standard-cell-library approach, the main effort is the technology mapping between alternative implementations of the design and the preexisting elements of the library. Such an approach may be benecial for reducing the design effort, but may not be able to achieve a desired level of performance. If high performance is required, the custom design approach has to be followed at least for the critical parts of the circuit (see, e.g., [11]). In such performance-critical cases, selecting the appropriate transistor-level implementation for complex gates (supergates) is not trivial and the use of suitable computer-aided design (CAD) methods is needed. This is the context of this work. In the open literature there are two main approaches for minimizing the number of transistors. The rst approach is factorization (see, e.g., [1] and [2]), but it is typically limited to series-parallel implementations. Methods based on binary decision diagrams (BDDs) have also appeared in the literature (see, as a sample, [3] [5], [8], and [10]). However, some of these approaches (such as [5] and [8]) are limited to series-parallel implementations [note also that [8] cannot generate transistor diagrams from all BDD representations (for example, the BDD c has three nodes, whereas the pull-up (or for function f = ab + a pull-down) network of the corresponding gate requires four transitors], while other approaches (such as [3] and [10]) can work with any BDD but do not fully utilize the bridging potential as they impose forbidden directions on any formed bridges. In order to be able to nd solutions that use bridging congurations and produce non-series-parallel networks, graph-oriented methods have been proposed (see, e.g., [4], , [7], [13], [14], [15]). [Approaches of cataloging non-series-parallel implementations of n-input functions also exist (see, e.g., [9] for n = 4), but these are impractical even for moderate values of n > 4.] In the proposed method (a preliminary version of which appears in [6]), the transistor network is built explicitly by computing the most economical placement for the next product term of the function in the currently constructed transistor network. The most economical placement is chosen each time among several alternatives, one of which is bridging. Experimental results indicate that the proposed approach is particularly efcient and practical. II. SYNTHESIS METHODOLOGY The proposed synthesis methodology accepts a sum-of-products expression P1 + P2 + 1 1 1 + Pm that may have been simplied by any of the traditional approaches. It obtains from that an efcient (in terms of transistor-count) network, which, in general, is not series-parallel. Some denitions needed for the description of the proposed methodology are given as follows: A serial connection of transistors will be referred to as a spine [see Fig. 1(a)]. The source/drain nodes at the two ends of a spine will be referred to as endpoints. The set of transistors on a spine will be referred to as spine set. Given a spine s, its spine set will also be denoted by s, while its endpoints will be denoted by s1 and s2 [see Fig. 1(a)]. A sequence of spines will be referred to as a path. The set of transistors on the path will also be referred to as path. A set of spines p, q , s, r , and t with s1 = p2 = q1 and s2 = r2 = t1 , will be referred to as a bridge [see Fig. 1(b)]. Given a spine s and a transistor set , splicing is the operation of replacing s with three other spines p, q , and r , such that r1 is connected directly to a terminal (indicated in the following by TRM1 or TRM2), q2 has outlet to the opposite terminal than r1 , and q = \ s, p = s 0 q , r = 0 q , p1 = s1 , p2 = q1 = r2 , q2 = s2 [see Fig. 1(c)]. The fundamental part of the methodology is, given a partial transistor diagram S (i.e., a set of spines implementing a subset of the given product terms) and a product term [represented as a set of literals (transistors)] to be placed next, to nd the most economical placement
AbstractThe number of transistors required for the implementation of a logic function is a fundamental consideration in digital VLSI design. While the determination of a series-parallel implementation can be straightforward once a simplied Boolean expression of the function is available, this may not be an optimum solution. In this paper, a methodology is developed for minimizing the number of transistors that starts from a sum-of-products expression and utilizes non-series-parallel structures. Experimental results demonstrate the efciency of the approach. Index TermsAutomatic synthesis, switching functions, transistors, VLSI.
I. INTRODUCTION Optimizing the performance of a circuit with respect to implementation cost, operational speed, and power requirements is the fundamental problem in digital electronic design. In the custom design approach, a transistor-level implementation for the required functions is selected and an appropriate physical layout is made. For most commercial applications, the required effort for transistor-level implementations may
Manuscript received March 3, 2006; revised July 9, 2006 and November 27, 2006. The authors are with the Department of Electrical and Computer Engineering, Southern Illinois University, Carbondale, IL 62901-6603 USA (kagaris@engr. siu.edu; haniotak@siu.edu). Digital Object Identier 10.1109/TVLSI.2007.895248
1063-8210/$25.00 2007 IEEE
489
Fig. 3. Parallel scenario example.
Fig. 1. Illustration of denitions: (a) spine; (b) bridge; (c) splicing of spine by transistor set .
Fig. 2. Procedures
PLACE(
) and Fix Esc(9 ).
of in S . This is given as procedure PLACE(S; ; B ) in Fig. 2(a). (The third parameter B in PLACE(S; ; B ) is a user-specied bound on the allowed number of transistor in series. This is discussed at the end of the basic description that follows.) Alternatives (a)(d) compute candidate costs for the placement of the current target product term . The cost is the number of extra transistors that have to be included in the diagram such that 1) term is accommodated in the diagram and 2) no escape paths are formed. An escape (or sneak) path is dened as a path from TRM1 to TRM2 that does not include both a literal and its complement and is not a
superset of a requested product term or a superset of a path from TRM1 to TRM2 already formed in the currently constructed diagram. These cost alternatives are explained in the following. Alternative (a): The creation of a new isolated spine for the target corresponds to the trivial cost j j. Alternative (b): The search space for the cost minimization here are all spines in S . To see if alternative (b) can be applied on a spine s 2 S , the procedure tries to nd an already placed product term 0 that makes use of spine s and such that 0 0 s is largest subset of the target . If such a product term exists, it computes = 0 ( 0 0 s). Then, for each TRM1-TRM2 path 6= 0 that passes through spine s, it checks whether ( 0 s) [ is an escape (sneak) path. If no escape path is found then alternative (b) is feasible with cost j 0 sj (minus any complementary literals that can be simplied) and spine s is replaced by spines p, q , and t, such that p = \ s, q = s 0 p = s 0 , and t = 0 p = 0 s. An illustration is given in Fig. 3. Alternative (c): The search space for the cost minimization here are all spines in S . To see if alternative (c) can be applied on a spine s 2 S , the procedure rst checks whether = \ s is empty. If this is the case, then alternative (c) cannot be used on spine s. If = \ s is not empty, the procedure proceeds to check whether there is a path from endpoint s2 to TRM2 and that path is subset of (such a path will be referred to as an outlet path). Other possibilities include outlet path through s1 to TRM2, outlet path through s2 to TRM1, and outlet path through s1 to TRM1. If no such outlet path exists then Alternative (c) cannot be used on spine s. Otherwise, let be the outlet path. Assuming (for presentations sake) that is an outlet path through endpoint s2 to TRM2, the procedure creates a temporary spine t emanating from TRM1 (i.e., t1 = TRM1) and replaces spine s by spines p, q , and t, such that: p1 = s1 , p2 = q1 = t2 , q2 = s2 , and t1 = TRM1. The spine sets of spines t, p, and q are set temporarily to: q = , p = s 0 , and t = ( 0 ( [ ). Then the procedure checks whether there is any escape path which starts from t1 , includes at least spines t and q in succession, is put into a and terminates to TRM2. Any such escape path set 9 to be handled later by procedure Fix Esc(9; ). Procedure Fix Esc(9; ) computes a small set of transistors (to be placed on the aforementioned spine t) so that all escape paths in 9 are corrected. It is based on a greedy covering scheme as shown in Fig. 2(b)). Procedure PLACE(S; ; B ) proceeds by checking whether there is any escape path 0 which starts from t1 , includes at least spines t and p in succession, and terminates to TRM2, and adds any such path 0
490
Fig. 5. Bridge scenario example.
Fig. 4. Splicing scenario example.
into set 9. Then it calls Fix Esc(9; ) to x all escape paths in 9 and uses the set returned by it to update t to t [ 0 . The same computation is repeated to nd the value of t under each of the other previous possibilities for the outlet path and the minimum value of t (minus any complementary literals that can be simplied) is taken as the cost of Alternative 3(c) with respect to the particular spine s. An illustration is given in Fig. 4. Alternative (d): The search space for the cost minimization here are all pairs of spines in S . To see if Alternative (d) can be applied on a pair of spines (s; t), s 2 S , and t 2 S , the procedure rst checks whether = \ s or 0 = \ t is empty. If this is the case for either or 0 , then Alternative (d) cannot be used on the specic pair of spines s, t. If neither = \ s nor 0 = \ t is empty, the procedure proceeds to check 1) whether there is an outlet path through endpoint s1 or s2 to TRM1 (or TRM2) and that path is subset of and 2) whether there is an outlet path 0 through endpoint t1 or t2 to TRM2 (or TRM1) and that path is subset of ; If no such outlet path exists either for 1) or 2) then Alternative (d) cannot be used on spines s, t. Otherwise, assuming (for presentations sake) that outlet path is through endpoint s1 to TRM1, and outlet path 0 is through endpoint t2 to TRM2, the procedure creates a temporary spine b and splits spines s and t into spines s0 , s00 0 0 00 00 and t0 , t00 , respectively, such that: s1 = s1 , s2 = s1 = b1 , s2 = s2 , 0 0 00 00 t1 = t1 , t2 = t1 = b2 , and t2 = t2 . The spine sets of spines b, s0 , s00 , t0 , and t00 are set temporarily to: s0 = , s00 = s 0 , t0 = t 0 0 , t00 = 0 , and b = 0 ( [ [ 0 [ 0 ). Then the procedure checks whether there is any escape path containing spine b. All such paths are included into set 9, and the set returned by Fix Esc(9; ) is used to update b to b [ . The same computation is done to nd the value of b under each of the other possibilities for the outlet path in 1) and 2) and the minimum value of b (minus any complementary literals that can be simplied) is taken as the cost of Alternative (d) with respect to the particular spine pair (s; t). An illustration is given in Fig. 5. Each call PLACE(S; ; B ) will accommodate (by updating S ) the target product term in the most economical way given S and the previous operations. S is updated with the creation of a new path between the opposite terminals housing exclusively (no product term can be a subset of another as one of the two has then to be excluded from the original sum-of-products expression). A rst observation is that the particular path that houses can have the same number of literals as or more literals (but always being equal, as a set, to ). For
example, in the splicing scenario of Fig. 4, the path that houses target term = fc; d; f; mg contains actually the literals f , m, c, d, and f , i.e., ve transistors in series. In general, a large number of transistors in series is undesirable as it strongly affects the delay. The user-specied parameter B in PLACE(S; ; B ) addresses exactly this concern by eliminating from further consideration any candidate placements that result in more than B transistors in series. A common rule of thumb for the value of B is ve (see also [12] for the lower bound on the number of transistors in series for the implementation of any function), but this can be adjusted by the designer given the technology constraints. A second observation is that in the creation of the path to accommodate , other paths may be created in S as well. These paths may either correspond to product terms not yet examined [such as path a, e, h, and f in Fig. 3(b)] paths that are supersets (in terms of literals) of other examined or not yet examined product terms [such as path m, i, e, h, k , and l in Fig. 4(b), which is a superset of the not yet examined product term fe; h; k; l; m; ig], or paths that are self-blocking, i.e., contain both a literal and its complement [such as path m, i, c, d, g , and m0 in Fig. 4(b)]. Such paths are introduced when the set returned by Fix Esc(9; ) in Alternatives (c) and (d) gives rise to a lower cost than other placements. Due to the explicit minimization of the transistor count and to the bound on the transistors in series and since ties in the cost of different alternatives are broken in favor of the alternative with the simpler conguration [i.e., in the order (a)(b)(c)(d)], the overall number of paths in S is bounded in practice by a small multiple of the given number t of product terms, i.e., it is O(t). This has also been veried experimentally. For the complexity of the PLACE subroutine, we rst notice that steps 2(a), 2(b), 2(c), and 2(d), take at most s2 checks [dominated by alternative (d)], where s is the current number of spines. Each check requires time O(l) for union/intersection operations for the target product term, where l is the total number of literals, plus time for checking the outlet and escape paths and xing the latter by Fix Esc(). The latter time is O(l 1 t 1 e), where e is the maximum number of paths from an internal point to a terminal (TRM1 or TRM2). For the reasons previously given, number e is taken in practice to be O(t). Checking for the existence of a product term already in S in step 1 of PLACE takes O(l 1 e) time. So, the complexity of PLACE is O(s2 1 (l + l 1 t 1 e) + l 1 e) = O (s2 1 l 1 t 1 e + l 1 e) = O (s2 1 l 1 t 1 e), or in practice, O (l 1 s2 1 t2 ). Procedure PLACE is called for each of the given product terms successively. The total number of the product terns will be denoted by t. The examination order of the product terms affects the nal solution (the problem is NP-hard [15]). Heuristic measures can be applied, such as examining the terms in descending (or ascending) order of their literal counts, or examining next the product term that has maximum (or minimum) intersection with the already placed product
491
TABLE I COMPARATIVE RESULTS ON FUNCTIONS WITH MORE THAN FOUR VARIABLES
Fig. 6. Case3. (a) Original conguration. (b) Conguration obtained by the proposed approach.
terms, or examining next the product term that has maximum (or minimum) intersection with some terminal-to-terminal path in the currently constructed diagram, or simply examining the product terms in some random ordering. If there are k candidate orderings, the complexity is O(k 1 t 1 C ), where C is the complexity of the PLACE subroutine 2 2 2 3 (C = O (l 1 s 1 t )), i.e., O (k 1 l 1 s 1 t ). The actual time requirement of the algorithm in practice is very small (subsecond time per given ordering). III. EXPERIMENTAL RESULTS We have implemented the proposed algorithm in C and tried it extensively on different logic functions. To provide a comparison measure of the methodologys performance, we obtained three sets of comparative experimental results on: 1) supergates that we generated; 2) all four-variable functions, for which corresponding costs are available by other researchers; and 3) some functions with more than four variables, for which corresponding costs are also available by other researchers. An example of the performance of the methodology against a supergate that we generated is shown in Fig. 6. In this supergate [see Fig. 6(a)], the nMOS and pMOS networks are dual of each other. We supplied the product terms of the nMOS and pMOS networks to the proposed procedure (in separate stages) with transistor-in-series bound B equal to the maximum literal count in the product terms. The conguration that we obtained is shown in Fig. 6(b). The original diagram [see Fig. 6(a)] needs in total 28 transistors, whereas the obtained diagram [see Fig. 6(b)] needs in total 23. The obtained pMOS and nMOS networks are not any more duals of each other. In the proposed solution, the pMOS network requires nine transistors instead of 14 for the original. For the nMOS network, the number of transistors remains the same as with the original, but an advantage in terms of delay is obtained by the reduction of transistors in series. For additional analysis, we simulated in SPICE the designs of Fig. 6 using a 0.18-m technology. All transistors were minimum size and the supply voltage used was 1.8 V. As output capacitance, the total input capacitance of the corresponding gate was used. For the rst input, the pulse width was 5 ns and the period was 10 ns. For each subsequent input, the pulse width and the period was doubled. The total charge used by the original design is 5.93 2 10012 Cb and for the proposed design 5.08 2 10012 Cb. The corresponding energy used for the original is 10.674210012 J and for the proposed 9.144210012 J (a reduction of
14.33%). Also, the percentage reduction for the maximum delay was 26.7%. For the second set of experiments (four-variable functions), we performed two evaluations. For the rst evaluation, we run through the catalog of the 402 representatives of the structurally-equivalent classes [9] of all four-variable functions as given in [9, App. IV]. In all cases (except for function N 580 where there is a reduction by one transistor as also observed in [13]), the proposed method obtained diagrams with the same number of transistors as Ninomiya (see [9, App. V]), but not necessarily with the same transistor conguration. For the second evaluation, we generated the list of the representatives of the 3984-permutation equivalent classes of four-variable functions as has been done also in, e.g., [10]. We obtained the CMOS transistor network for each of these representatives and computed the total transistor count including the transistors needed for any inversions in each function. The proposed method resulted in fewer transistors (97174 versus 98311) with an overall reduction of 1137 transistors. Since the savings over [10] are due mainly to the exploitation of the bridges, the advantage is expected to be larger for functions with more than four variables as the latter provide in general more opportunities for bridging. For the third set of experiments (functions with more than four variables), we used the logic functions F4 , F5 , F6 , and F13 listed in [2] and for which past results are available (see [2] and [15]). Table I contains the comparative transistor counts and percentage savings for [2] under the bounded case (columns 24), using bounds B = 10,5, 5, 10 for F4 , F5 , F6 , and F13 , respectively, and for [15] under the unbounded case (columns 57). As can be observed, the reduction in transistor count is substantial (reduction up to 13.8% was observed for the bounded case and 25% for the unbounded). The resulting diagrams for the unbounded case can be found in [6]. IV. CONCLUSION We have described a methodology for transistor-efcient synthesis of complex gates. The methodology produces in general non-series-parallel implementations and compares favorably with past approaches. In particular, it reduces the number of transistors for implementing a logic function. The delay or power consumption are not explicitly addressed in this work although minimizing the number of transistors tends in general to affect positively both. Future work will explicitly address these issues.
REFERENCES
[1] R. K. Brayton, R. Rudell, A. Sangiovanni-Vincentelli, and A. Wang, Multilevel logic optimization and the rectangular covering problem, in Proc. Int. Conf. Comput.-Aided Design, 1987, pp. 6669. [2] G. Caruso, Near optimal factorization of Boolean functions, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 10, no. 8, pp. 10721078, Aug. 1991. [3] L. S. Da Rosa, Jr., F. S. Marques, T. M. G. Cardoso, R. P. Ribas, S. S. Sapatnekar, and A. I. Reis, Fast disjoint transistor networks from BDDs, in Proc. 19th Annu. Symp. Integr. Circuits Syst. Design, 2006, pp. 137142. [4] D. L. Dietmeyer, Logic Design of Digital Systems. Boston, MA: Allyn & Bacon, 1971.
492
[5] S. Gavrilov, A. Glebov, S. Pullela, S. C. Moore, A. Dharchoudhury, R. Panda, G. Vijayan, and D. T. Blaauw, Library-less synthesis for static CMOS combinational logic circuits, in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, 1997, pp. 658662. [6] D. Kagaris and T. Haniotakis, Transistor-level optimization of supergates, in Proc. Int. Symp. Quality Electron. Design, 2006, pp. 682690. [7] Z. Kohavi, Switching and Finite Automata Theory. New York: McGraw-Hill, 1970. [8] C. R. Liu and J. A. Abraham, Transistor level synthesis for static CMOS combinational circuits, in Proc. Great Lakes Symp. VLSI, 1999, pp. 172175. [9] M. A. Harrison, Introduction to Switching and Automata Theory. New York: McGraw-Hill, 1965. [10] R. E. B. Poli, F. R. Schneider, R. P. Ribas, and A. I. Reis, Unied theory to build cell-level transistor networks from BDDs, in Proc. Symp. Integr. Circuits Syst. Design, 2002, pp. 199204.
[11] R. Roy, D. Bhattacharya, and V. Boppana, Transistor-level optimization of digital designs with ex cells, Computer, vol. 38, no. 2, pp. 5361, Feb. 2005. [12] F. R. Schneider, R. P. Ribas, S. S. Sapatnekar, and A. I. Reis, Exact lower bound for the number of switches in series to implement a combinational logic cell, in Proc. Int. Conf. Comput. Design, 2005, pp. 357362. [13] K. Tanaka and Y. Kambayashi, Transduction method for design of logic cell structures, in Proc. IEEE AsiaPacic Design Autom. Conf. (ASP-DAC), 2004, pp. 600603. [14] M. Wu, W. Shu, and S. Chan, A unied theory for MOS circuit design switching network logic, Int. J. Electron., vol. 58, no. 1, pp. 133, 1985. [15] J. Zhu and M. Abd-El-Barr, On the optimization of MOS circuits, IEEE Trans. Circuits Syst. I, Fundam. Theory Appl., vol. 40, no. 6, pp. 412422, Jun. 1993.

Commun., Vol. 50, No. 3, Pp. 406-414, Mar. 2002.: A Methodology For Transistor-Efficient Supergate Design

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Commun., Vol. 50, No. 3, Pp. 406-414, Mar. 2002.: A Methodology For Transistor-Efficient Supergate Design

Uploaded by

Copyright:

Available Formats

488

A Methodology for Transistor-Efcient Supergate Design

1063-8210/$25.00 2007 IEEE

Fig. 3. Parallel scenario example.

) and Fix Esc(9 ).

Fig. 5. Bridge scenario example.

Fig. 4. Splicing scenario example.

TABLE I COMPARATIVE RESULTS ON FUNCTIONS WITH MORE THAN FOUR VARIABLES

You might also like