Professional Documents
Culture Documents
Abstract—This paper presents a new efficient architecture for As always happens in the design of digital circuits, achieving
the design of fast low-cost single-clock-cycle binary comparators. high speed is not the only concern. In fact, also ensuring low-
The proposed 64-bit circuit requires only 1051 transistors and, power consumption and reduced silicon area occupancy is an
when implemented by using the ST 90-nm 1-V CMOS technology,
it exhibits a running frequency higher than 4 GHz with an average important design goal. The novel single cycle comparator pro-
power dissipation of only 4 mW. Comparison with the fastest posed in this paper was designed bearing this issue in mind.
comparator known in the literature demonstrates that, at a parity A 64-bit comparator designed as described here requires only
of technology used, the novel architecture is 12% faster and 1051 transistors, thus reducing the transistor count by about
requires 69% less transistors. 44%, 36%, 66% and 69% with respect to the circuits described
Index Terms—CMOS dynamic circuits, comparator, digital in [1]–[4], respectively. Moreover, the proposed circuit is 12%
arithmetic, VLSI circuits. faster than that in [4], which is the fastest single cycle com-
parator known in the literature.
A very important feature of the novel comparator is that its ar-
I. INTRODUCTION
chitecture can be easily implemented by using both full-custom
and standard-cell-based design approaches. In order to demon-
N THE LAST few years, the design of high-speed, low- strate the efficiency of the new circuit, several implementations
I power, and area-efficient binary comparators has received
a great deal of attention, since, as is well known, comparison is
were carried out with the STMicroelectronics (ST) 90-nm 1-V
CMOS process and the Austria Micro Systems (AMS) 0.35- m
a fundamental operation in almost all digital processors. Exam- 3.3-V CMOS technology.
ples of efficient architectures of binary comparators are demon- The paper is organized as follows. In Section II, a brief back-
strated in [1]–[4]. Among hardware implementations existing in ground is given to describe the design principles of existing
the literature, the latter are very representative solutions. In the comparators; the circuit here proposed is then introduced in
following, these circuits are named full comparators to indicate Section III; implementation results and comparisons with ex-
that, and being two -bit numbers, they are able to sepa- isting comparators are presented in Section IV; finally, conclu-
rately recognize the three possible conditions in which , sions are drawn.
, and .
In order to reach very high throughput, the approach proposed II. DESIGN PRINCIPLES OF EXISTING BINARY COMPARATORS
in [1] uses two-phase clocking dynamic logic with all-N-tran-
sistor (ANT) blocks. Such a 64-bit comparator requires 1890 Let and be
transistors and produces the correct result within 3.5 clock cy- two 64-bit numbers. As is well known, the comparison between
cles. and can be performed by preliminarily comparing
Single-cycle two-phase clocking architectures are presented their corresponding bits. In order to enhance throughput, the ap-
in [2] and [3]. To realize this kind of comparators, a priority proach proposed in [1] adopts the strategy of comparing two
encoding algorithm is used in [2], whereas a parallel MSB bits of each input simultaneously through thirty-two 2-bit com-
checking method is exploited in [3]. The latter is 22% faster parators. The results obtained in this way are then input to fur-
than [2], but it requires 88% more transistors. ther sixteen 2-bit comparators, and so on till the final result is
produced. The comparator presented in [1] is realized using six
In order to further increase achievable speed, a modifica-
tion of the MSB checking algorithm used in [3] was recently levels of 2-bit comparators and exploiting the half-cycle stealing
proposed in [4], where a MUX-based structure specifically of the ANT logic. As a consequence, the comparison between
and is computed within 3.5 clock cycles.
designed for high fan-in comparators is presented. The circuit
In order to provide the comparison result within only
demonstrated in [4] exhibits the highest computational speed
one clock cycle, the comparator proposed in [2] divides the
but it also requires 3386 transistors to be realized.
logic functions required for comparing and
into two stages. In the first one, eight 8-bit comparators
Manuscript received April 01, 2008; revised June 20, 2008. Current version
process eight groups of 8 bits coming from the inputs,
published December 12, 2008. This paper was recommended by Associate Ed- such that the th 8-bit comparator, with , re-
itor M. Anis. ceives the subwords and
The authors are with the Department of Electronics, Computer Science and as inputs and produces
Systems, University of Calabria, Arcavacata di Rende, 87036 Italy (e-mail:
p.corsonello@unical.it). three output signals , and to rec-
Digital Object Identifier 10.1109/TCSII.2008.2008063 ognize the cases in which ,
1549-7747/$25.00 © 2008 IEEE
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on February 19,2020 at 20:15:17 UTC from IEEE Xplore. Restrictions apply.
1240 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 12, DECEMBER 2008
and . that, as shown in (3), indicates if the inputs are different. The
Logic equations implemented in this approach are given in signals obtained in this way are input to the priority en-
coder that operates as described by
(4)
(5)
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on February 19,2020 at 20:15:17 UTC from IEEE Xplore. Restrictions apply.
PERRI AND CORSONELLO: FAST LOW-COST IMPLEMENTATION OF SINGLE-CLOCK-CYCLE BINARY COMPARATOR 1241
TABLE I
COMPUTATIONAL STEPS OF COMPARED 64-bit COMPARATORS
where . The signals and can be then makes the occurrence of the case in which
grouped four by four using the CLA logic again, as shown in impossible.1
The novel comparator exploits logic equations completely
different and much simpler than those used in [1]–[4]. There-
fore, it is expected that it will exhibit significantly reduced
hardware complexity. This expectation comes from two basic
considerations: 1) the 32 instantiations of the 2-bit CLA circuits
(7) required to implement (6) are certainly not more complex than
the 64 2-input EXOR (or EXNOR) gates used within conventional
where . The comparison result is finally obtained comparators to compare corresponding bits of the operands
applying and and 2) the eight 4-bit CLA circuits required
to implement (7) are certainly much less complex than the
eight 8-bit comparators used in the conventional implemen-
tations. Simplifications introduced by the novel comparator
can be better appreciated by analyzing the computational steps
required to perform a comparison. In the example reported
in Table I and . For each
(8) computational step elaborated input data and produced output
data are shown. It can be seen that, in comparison with the
approaches [1]–[4] the proposed comparator requires less
It can be seen that can recognize the cases in which computational steps and each step has less input and output
and . In fact, it is equal to signals, thus providing a reduced hardware complexity.
the carry-out signal obtained by summing and .
However, it is not enough to provide the functionality of a full B. Circuit Implementation
comparator. Therefore, the signal is also computed to es- The top-level architecture of the 64-bit comparator here pro-
tablish if . Thus, when , the posed is illustrated in Fig. 1. It uses three levels of operations.
output signals and are both equal to 1, whereas 1This condition differs from existing comparators [2]–[4], in which the case
they are both equal to 0 if . Finally, when Out = Out = 1 cannot occur. However, this does not affect the overall
, is 0 and is 1. The used logic architecture at system level.
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on February 19,2020 at 20:15:17 UTC from IEEE Xplore. Restrictions apply.
1242 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 55, NO. 12, DECEMBER 2008
The first and second levels perform their evaluation phase when
the clock signal is high, whereas they perform the precharge
phase when the clock is low. The third level of the comparator
operates in the opposite manner, thus making the circuit able to
compare two 64-bit inputs within only one clock cycle.
The first stage consists of 32 CLA blocks that generate the
signals and defined in (6). These signals are then
input to the eight 4-bit CLA modules of the second stage, in Fig. 2. Transistor-level implementations of the: (a) 4-bit CLA and (b) 8-bit
CLA.
which (7) is implemented. Finally, the 8-bit comparator of the
third level completes the comparison operation as given in (8).
Transistor-level implementations of the basic components
in this case precharging only the central node is enough to re-
used in the novel comparator are depicted in Fig. 2.
move charge sharing. Finally, for the computation of the output
It can be seen that both 4-bit and 8-bit CLA blocks exploit
signal defined in (8), a static CMOS 3-input AND gate
Manchester-Carry-Chain-based architectures. The latter were
is used. The latter was chosen to elaborate simultaneously low-
chosen to achieve sufficiently high speed with a good time-
precharged and high-precharged signals, without requiring any
area-power tradeoff [5]–[8]. It can also be noted that the output
additional circuitry.
signals produced by the generic 4-bit CLA block are latched.
Latches are used to make the comparator able to fast operate
within one clock cycle. Remaining modules implemented to IV. COMPARISON RESULTS
compute the signals , , , , and use Full-custom implementations of the new 64-bit comparator
conventional domino circuits. Therefore, their schematics are were carried out using the AMS 0.35- m 3.3-V CMOS tech-
not reported here. nology and the commercial ST 90-nm 1-V CMOS process.
In an AMS 0.35- m implementation, nMOS transistors Post-layout simulations have been performed considering for
forming the carry chains depicted in Fig. 2 were progressively both implementations typical processes and operation condi-
sized using a tapering factor equal to 1.5 and a minimum tions (i.e., 25 C and nominal supply voltage). Obtained results,
transistor width equal to 1 m. Moreover, all of the precharging summarized in Table II, show that the 0.35- m implementation
transistors are 1.6 m wide. For all other gates, the minimum of the new architecture reaches a maximum running frequency
sizing approach has been used, thus making the nMOS tran- higher than 720 MHz by dissipating 38 W/MHz. When the
sistors in a series m wide. Taking into account the effective ST 90-nm technology is used, the new comparator achieves a
loads on the signals and , the two nMOS 4.3-GHz running frequency with an average power consump-
transistors of each latch, used to correctly interface the second tion of only 1 W/MHz. It can be easily verified that the worst
and third stages, are made 6 m wide, whereas a channel case delay of the proposed comparator occurs when
width equal to 4.8 m is set for the pMOS transistor. All level and (or vice versa), which is the case previously
restorers are minimum sized. analyzed in Table I.
In order to avoid the usage of large transistors, the eight sig- The AMS 0.35- m (ST 90 nm) process has a baseline fan-
nals are ANDed through two 4-input AND gates. More- out-of-4 (FO4) minimum-sized inverter delay of 145 ps (26.8
over, to limit the effects due to the charge sharing occurring in ps), a silicon area of 36 m (3.3 m ), and a power dissi-
high fan-in dynamic gates, intermediate nodes of the 4-input pation of 0.17 W/MHz (0.0027 W/MHz). In Table II, de-
AND gates are also precharged. Simulations demonstrated that lays normalized to FO4 are also reported for the 0.35- m and
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on February 19,2020 at 20:15:17 UTC from IEEE Xplore. Restrictions apply.
PERRI AND CORSONELLO: FAST LOW-COST IMPLEMENTATION OF SINGLE-CLOCK-CYCLE BINARY COMPARATOR 1243
REFERENCES
[1] C. C. Wang, C. F. Wu, and K. C. Tsai, “1 GHz 64-bit high-speed com-
parator using ANT dynamic logic with two-phase clocking,” IEE Proc.
Comput. Digit. Tech., vol. 145, no. 6, pp. 433–436, Nov. 1998.
[2] C. H. Huang and J. S. Wang, “High-performance and power efficient
90-nm implementations. The 0.35- m technology was chosen CMOS comparators,” IEEE J. Solid-State Circuits, vol. 38, no. 2, pp.
254–262, Feb. 2003.
for purposes of comparison with the comparators described in [3] H. M. Lam and C. Y. Tsui, “High-performance single clock cycle
[1]–[4]. In Table II, the two implementations available for the CMOS comparator,” Electron. Lett., vol. 42, no. 2, pp. 75–77, Jan.
comparator presented in [2] are named “[2]_C1” and “[2]_C2.” 2006.
[4] H. M. Lam and C. Y. Tsui, “A MUX-based high-performance single-
Analogously, the implementations [3]_C1 and [3]_C2 are ref- cycle CMOS comparator,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
erenced for the comparator described in [3]. Obtained results vol. 54, no. 7, pp. 591–595, Jul. 2007.
demonstrate that the novel circuit improves speed performance [5] V. Kantabutra, “A recursive carry-lookahead/carry-select hybrid
adder,” IEEE Trans. Comput., vol. 42, no. 12, pp. 1495–1499, Dec.
with respect to the fastest comparator described in [4] by 12% 1992.
and reduces the transistors count by 69%. Even though av- [6] J. H. Lou and J. B. Kuo, “A 1.5-V bootstrapped pass-transistor-based
erage power consumption of the comparator [4] is not available, Manchester carry chain circuit suitable for implementing low-voltage
carry look-ahead adders,” IEEE Trans. Circuits Syst. I, Fundam. Theory
it can be reasonably expected that such a large reduction of tran- Appl., vol. 45, no. 11, pp. 1191–1194, Nov. 1998.
sistors count achieved with the novel circuit also leads to a sig- [7] S. Perri, P. Corsonello, F. Pezzimenti, and V. Kantabutra, “Fast and
nificant reduction of power consumption. energy-efficient Manchester carry-bypass adders,” IEE Proc. Circuits,
Devices Syst., vol. 151, no. 6, pp. 497–502, Dec. 2004.
The proposed architecture is found advantageous in terms of [8] P. Corsonello, S. Perri, and M. Margala, “Efficient addition circuits for
both computational speed and transistors count also over the modular design of processor-in-memory,” IEEE Trans. Circuits Syst. I,
comparators presented in [1]–[3]. When compared to [2], the Reg. Papers, vol. 52, no. 8, pp. 1557–1567, Aug. 2005.
[9] Cadence Documentation [Online]. Available: www.cadence.com
new circuit is 37% faster and requires 36% less transistors. [10] J. E. Stine and M. J. Schulte, “A combined two’s complement and
Furthermore, it reduces the computational delay and the tran- floating-point comparator,” in Proc. IEEE Int. Symp. Circuits Syst.
sistor count by 20% and 66% with respect to [3]. ISCAS 2005, Kobe, Japan.
Equations (6)–(8) discussed in Section III can be imple- 2For purposes of comparison, only the portion required for comparing 2’s
mented in hardware also by using static CMOS standard cells complement numbers has been synthesized and laid out.
Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on February 19,2020 at 20:15:17 UTC from IEEE Xplore. Restrictions apply.