A Comparison of CMOS Implementations of an Asynchronous Circuits Primitive: the C-Element

Department of Electrical Engineering

Maitham Shams

Department of Computer Science

Jo C. Ebergen

Department of Electrical Engineering

Mohamed I. Elmasry

University of Waterloo, Waterloo, Ontario, Canada N2L 3G1

Abstract

One of the most frequently used primitives in asynchronous control circuits is the C-element. The three most popular CMOS implementations of the C-element are compared with respect to energy-eciency, delay, and area, with an emphasis on energy. The three implementations have been introduced by Sutherland, Martin, and Van Berkel. We show that in a typical environment, Van Berkel's implementation is superior to the other implementations with respect to energy consumption and area, for the same delay.

the proper operation of the C-element, it is also assumed that once both inputs become 0 (1), they will not change until the output changes. A state diagram is given in Figure 1 along with a commonly used schematic. The
100 011

a
000

b
110

a c
111

b
001

a c b

b
010

a c

b
101

a

1 Introduction

Various applications have demonstrated that asynchronous circuits have great potential for low-power and high-performance design. One of the most frequently used primitives in asynchronous control circuits is the C-element. The purpose of the paper is to compare the three most popular CMOS implementations of the C-element. One implementation of the C-element has been introduced by I.E. Sutherland [1] and is used often in high-performance micropipelines; a second implementation has been introduced by A.J. Martin and has been used in the Caltech asynchronous microprocessor [2] and an asynchronous, low-power version of the ARM developed at Manchester University [3]; a third implementation has been introduced by K. Van Berkel and is being used in the TANGRAM silicon compiler for low-power design at Philips Research Laboratories [5]. The di erent implementations are examined in two different setups. For each setup, we measure the delay and energy dissipation through HSPICE simulations.

Figure 1: State diagram and schematic of the C-element behavior of the output, c, of a C-element can be expressed in terms of the inputs a and b and the previous state of the output, c, by the following Boolean function ^ c = c  (a + b) + a  b ^ (1)

3 Measurement Setup

Two testing environments are considered: one to evaluate the e ect of the input capacitance and driving ability of an individual C-element gate, and another one to evaluate the performance of a series of cross-coupled C-element gates, where the performance of individual elements are mutually dependent. Figure 2 illustrates
Vdd Va a c Vb b t=T Vcc

Va Vb

2 The C-element

c Tf1 Tr1 Tf2 Tr2 Tf3 Tr3

The C-element has been introduced by D. E. Muller [6] and is therefore also called the \Muller C-element." A C-element has two inputs a and b and one output c. Traditionally, its logical behaviour has been described as follows. If both inputs are 0 (1) then the output becomes 0 (1); otherwise the output remains the same. For
ISLPED 1996 Monterey CA USA 0-7803-3571-8/96/$5.00 © 1996

Figure 2: First measurement setup: testing for optimal sizing for known fan-out the

The C-element is driven by two inverters which are fed by the ideal sources Va and 1 .rst measurement setup.

respectively. The delay through the C-element depends on the order and arriving times of the input signals.Vb . At the output. Exhaustive simulations are needed to . The NMOS and PMOS transistors of the inverters are 4 and 10 microns wide. the C-element is driving a number of inverters similar to the input inverters in size.

A slightly less accurate. As shown in the .nd the worst-case delay by altering the arriving interval between the input signals. but faster technique is to be only concerned with the arriving order of the input signals.

This setup di ers form the . D = max(Tf 1 . Tr3 ) (2) The energy consumption of the C-element per output transition also depends on the arriving times and order of the inputs. six cases are considered and the respective delays produced are measured. (3) E = P(Vdd ) T 3 where P(Vdd ) is the average power extracted from Vdd . and T is the total period of the output waveform. Tf 3 . We measure the average energy consumption per output cycle E.gure. The worst-case delay D is obtained by the following. Tr1 . using this method. corresponding to the illustrated waveforms. Tf 2. Tr2. The C-elements in the chain are all of the same size. The HSPICE simulations for this setup were performed at a frequency of 40 MHz. The second measurement setup is shown in Figure 3. from the following.

form the control circuit of an nstage micropipeline. A bubble at the input of a C-element schematic means that an inverted version of that input must be used. the oscillation of the nodes continues forever.rst setup in that the performance of the C-elements are now mutually dependent. Meanwhile. of course. A much simpli. The inverters are added to make the micropipeline self-driven and oscillating. can be interpreted as \request for the next stage". in turn. If not interfered. and the only possible event is the rising of r(0). This logic \one" then propagates through all the request nodes. which a ects the overall performance of the chain. where i represents the stage number. The chain of the C-elements shown. Initially. all nodes are set to low. The signals at the nodes indicated by r(i). This. without the inverters at the two ends. can be implemented using an inverter. other transitions are produced at r(0) which propagate toward the end.

ed waveform is shown in the same .

The results obtained through the .gure. The parameters of interest in this test are the latency L. is the inverse of . is E =  P(Vdd ) (4) 2 r(0) r(2) r(4) r(n) r(1) r(0) r(1) r(3) r(2) r(n) r(n-1) r(n+1) t=0 L Φ =1/F Figure 3: Second measurement setup: testing for optimal sizing of a chain structure in presence of feed-back where P(Vdd ) is the average power extracted from Vdd . and energy per throughput. which is half its throughput. The frequency of oscillation of the micropipeline F. The energy dissipation thus. throughput.

can give suggestions for the implementation of a subsystem in which the C-elements are mutually dependent. however.rst setup are useful when we want to insert a C-element in a spot where the input drive and output capacitance are known. The technology . The second setup.

i. we conclude that N1.8 m BICMOS process has been used in our HSPICE simulations. From the operation of VDD a a b c’ b a N2 N1 N5 N6 P1 P6 P2 P5 P3 P4 b c b N3 N4 a Figure 4: Conventional pull-up pull-down CMOS implementation of the C-element [1] the circuit. In this technology. N2. This implementation is ratioless.81 V and that of PMOS devices is 0. This circuit has been presented in [1] by Sutherland. Whereas N3. it does not impose any restrictions on the sizes of the transistors. they are of size W. and N6 are the main pull-down transistors which contribute to output switching.90 V. N4. 4 C-Element Implementations A conventional pull-up pull-down realization of the function is shown in Figure 4.le of a 0. Voltage Power supply of 3 V is assumed in all simulations. and N5 .e. the threshold voltage of NMOS devices is 0.

the normal pull-up transistors. For the circuit to have the same pull-up and pull-down resistances (when switching) as the previous two implementations. have widths Wp = 2:5W. The C-element circuit illustrated in Figure 5 has been presented by Martin [7]. Similarly. The circuit su ers from a race problem at node c.only provide the necessary feed-back to hold the state of the output when values of the inputs do not match. P4. must be made half the size. This circuit utilizes an inverter latch to maintain the state of the output when the inputs do not have a similar logic level. The feed-back transistors N3 and P3 are. There is an inherent re VDD P3 it is symmetrical with respect to the inputs. hence. 5 Results of the Tests a b c’ b a P1 N3 P2 c N2 P4 N1 N4 Figure 7 shows the energy dissipation versus the propagation delay for the three C-element implementations with a fan-out of 3 inverters under the . the feed-back transistors P3. except those of the output inverter. while P1 and P2. and N6 and P6 have a normal size to achieve load driving capability of the previous circuits. as usual. of minimum size. and P5 have minimum width. the normal N-tree and P-tree transistors. they are made as small as possible to reduce their loading e ect.

Point \M" in the . Thus.rst test. one might get two di erent energy readings for the same delay and implementation. The size of the C-element gate increases for each curve from the right hand side of the graph toward the top of the graph. but they correspond to two di erent sizes.

5 6 Vdd=3v Fan-out=3 inverters Figure 5: CMOS implementation of C-element using inverter latch (a) [7] sistance to switching the state of the latch that can be reduced. Accordingly. for the C-element gates in the . certain size ratios must be imposed on the transistors.3 1. energy versus delay. VDD a P1 c P4 b P3 b c’ b p2 P5 a P6 c N2 N3 N5 a N6 a N1 c N4 b Figure 6: CMOS implementation of C-element by Van Berkel [8] The next C-element implementationby Van Berkel [8] is shown in Figure 6. An advantage of this implementation is that 3 Figure 7: HSPICE simulation results. The feed-back inverter should be a weak one to allow changes in the state of the latch. For a proper operation of the circuit.15 1.2 Technology: 0. In this circuit.4 1.5 Minimum size transistors are chosen for the feed-back inverter to reduce the race problem.5 2 1. this circuit is also ratioless.25 1. Similar to Sutherland's circuit.5 4 3.5 7 6. the output state is maintained through a feed-back conducting path of three transistors in the pull-up tree or the pull-down tree.8 um BICMOS W > 2r L W P1P2 > 2 L N1N2 r W L W  N3 L P3 (5) 1.5 5 4.5 3 2. the following must hold:  Sutherland’s Martin’s Van Berkel’s M S B Energy (pJ) 5.gure shows 8 7.35 Propagation Delay (nSec) 1. but can not be eliminated.45 1.1 1.

For the same delay. if we chose the .rst test setup the minimum delay achievable using the second (Martin's) implementation of the C-element.

Notice also that delays between 1.27 nSec can be achieved by Van Berkel's circuit but not by the other two circuits. Van Berkel's circuit achieves a signi. if we chose the third (Van Berkel's) implementation.15 nSec to 1. Furthermore. Van Berkel's circuit obtains the same delay for 45% less energy . Although the minimum delays of the circuits are all within a 10% di erence. we would be saving 58% in energy over Martin's circuit.rst (Sutherland's) implementation we would be saving 42% in energy. point \S" shows the minimum obtainable delay using Sutherland's implementation. Similarly.

The results of the simulations for the second test environment are depicted in the form of an energy-frequency (E-F) graph in Figure 8.cant energy savings of about 50% for the same delay. The solid lines crossing the curves show the total gate area of the individual C- .

Similar to the E-D graph. Assume 80 75 Sutherland’s 70 Martin’s Van Berkel’s 65 60 55 50 45 40 35 30 25 20 27 15 10 17 5 0 0.175 0.125 0.15 A=178 (micron-square) 145 128 103 86 69 52 44 35 21 0. Alternatively. energy versus frequency.25 0.2 0. dissipating 30 pJ.3 Figure 8: HSPICE simulation results.225 Frequency (GHz) 0. The graph also shows that for higher frequencies.67 compared to Martin's implementation. Van Berkel's circuit o ers further savings in energy and area compared the other two implementations. if Van Berkel's C-element with a gate area of 35 m2 is used.09 in area and a factor of 2. de. However. in an E-F graph the outermost curve (here furthest from the origin) is the best alternative.elements used to form the micropipelines. which results in an energy consumption of 40 pJ per cycle. for the chains of C-element gates in the second test setup that we are interested in designing a micropipeline control circuit to operate at a maximum frequency of 250 MHz (or a throughput of 500MHz). the same throughput would be obtained at an energy of only 15 pJ. The savings over Sutherland's implementation is a factor of 2.275 0. This choice reduces silicon gate area by a factor of 2.74 and energy by a factor of 2. 25% less than the previous choice. Sutherland's C-element with a gate area of 72 m2 could be adopted.0 in energy or power. The graph shows that this is possible by using Martin's C-element with a gate area of 95 m2 .

For example. The second implementation has only two transistors dedicated to latching. Van Berkel's implementation has a nice symmetrical topology with respect to the inputs. Van Berkel's implementation has the advantages of both: minimumoverhead for latching with no resistance against output switching.nitely the right choice for low-power and highperformance applications. The superiority of Van Berkel's implementation can be explained intuitively as follows. this comparison demonstrated that minimizing the number of transistors dedicated to latching and avoiding topologies that resist output switching may result in signi. but the topology is such that they do resist output switching. The results obtained through this study are not limited to the C-element and may be extended to similar circuits in which the state of their outputs must be latched. Sutherland's implementation has six transistors dedicated to latching. Furthermore. which do not resist the output switching of the circuit.

\VLSI programming of asynchronous circuits for low power. June 1992. \Computing without clocks: Micropipelining the ARM processor. 204{243." in Proceedings Ban VIII Workshop: Asynchronous Digital Circuit Design (G. Feb. \Micropipelines.). [2] A. Kessels. Schalij. 6 Discussion and Conclusions A comparative study of CMOS implementations of the C-element in terms of energy." in Proceedings of an International Symposium on the Theory of Switching. Davis. Seitz." in Advanced Research in VLSI: Proceedings of the Decennial Caltech Conference on VLSI (C. Springer-Verlag. 720{738. Energy [pJ] References [1] I. The energy-delay and energy-frequency graphs illustrated in this paper are a convenient way of presenting the trade-o s between energy. Workshops in Computing. vol. 1995." Communications of the ACM. D. S. van Berkel. Martin. Sutherland. pp. Birtwistle and A. Hazewindus. 1959. 351{373. UT Year of Programming Series. M. pp. [7] A. van Berkel. 103{128. eds." Integration. [3] S. June 1989. \The design of an asynchronous microprocessor. one by Martin [7]. and area has been presented. M. Birtwistle and A. E. R." in Formal Development of Programs and Proofs (E. pp. Bartky. Davis. Burns. Three implementations from the literature have been considered: one by Sutherland [1]. \Beware the isochronic fork.). Although results may vary for other test environments. The techniques we discussed can be extended to compare di erent implementations of other primitives. J. and area of a design all in one graph. 1989. delay (or frequency). Addison-Wesley. J. Roncken. eds.). Martin. \Formal program transformations for VLSI circuit synthesis. and one by Van Berkel [8]. 1989. pp. Peeters. A. ed. we can say that Van Berkel's C-element is 4 . T. Lee. J. [8] K.cant energy savings. and F. [4] K. J. [6] D. 32. ed. Harvard University Press. Workshops in Computing. van Berkel. 13." in Proceedings Ban VIII Workshop: Asynchronous Digital Circuit Design (G. [5] K. 1995. W. \A theory of asynchronous circuits. Borkovic. Furber. pp.). Springer-Verlag. E. the VLSI journal. Muller and W." in International Solid State Circuits Conference. MIT Press. Dijkstra. Burgess. 1994. vol. pp. 88{89. \A fully-asynchronous lowpower error corrector for the DCC player. K. Apr. 59{80. and P. L. delay. S.

Sign up to vote on this title
UsefulNot useful