You are on page 1of 4



Robert Rogenmoser Qiuting Huang

Swiss Federal Institute of 'Technology, Integrated Systems Laboratory

CH-8092 Zurich, Switzerland

Abstract 2 Architecture

An 8-bit adder, operating at 800MHz, and a single-stage The architecture of the 8-bit adder is based on the cany-
bit-serial adder, running at more than 1 GHz, have been im- increment adders (CIA)proposed by v a g i [2]. The scheme
plemented and successfully tested in a standard 1.O pm CMOS is similar to carry-select adders (CSA)but instead of selecting
process. The performance was achieved through the use of between two results, the carry-input increments or passes the
carry-increment full adders, fine-grain pipelining, and merging intermediate result. The area compared to a standard CSA is
the combinational logic into the pipeline registers. thereby reduced by about one third. Figure 1 shows a 3-bit
block of such a CIA.The delay from the block-carry-in C p b of

1 Introduction

Many applications, such as digital video filtering and data

compression, call for simple and cost-effective conipiitational
units to perform complex algorithms in real-time.
The purpose of this work is to show the feasibility and
the techniques needed to implement high-speed computational
units in a low-cost and proven technology. We present an 8-bit
adder, measured at 800 MHz, and a single-stage bit-serial adder, (@ EXOR
running at more than 1 GHz. Both have been implemented in a
commercially available 1.O pm CMOS process. @ AND
The 8-bit adder chip performs addition of two 8-bit data
words and has a carry-input provided for subtractions. The
data throughput of 800 x lo6 additions per second is achieved
with a dedicated architecture, fine-grain pipelining, mergng the
combinational logic into the pipeline registers, and by careful Figure 1: Logic diagram of a 3-bit slice of the Tyagi adder
transistor sizing. The cost of this approach is an increased la-
tency. As applications working on large streams of (data are
limited in their performance by the throughput rather than la- the preceding block to the block-carry-out q for any such block
tency, the latter is often of minor concern. is always composed of only one gate delay.
Furthermore, in low power applications, where lower supply The 8-bit adder presented has been constructed using two
voltages are used, less parallel processing units are required to slices of 1 bit and two slices of 3 bits (cf. fig. 2). To maximize
maintain throughput when fast building blocks are used [l]. data throughput the adder is fine-grain pipelined, only one level
This alleviates the partitioning of algorithms into many parallel of logic (EXOR, etc.) being allowed per pipeline-stage. This
tasks. results in a latency of 6 clock cycles for the whole 8-bit adder.
A 20-bit adder can be built similarly using blocks of 1- (twice),
3-, 4-,5-, and 6-bits, achieving a latency of only 8 cycles.
*This work has been supported by JESSI (EUREKA EU.127), KWF
(2302.1),and Philips (Faselec AG, ZUrich)


- !.i""

Figure 2: Structure of implemented 8-bit adder showing re-

arranged logic elements and pipeline stages (small rectangles Figure 4: Schematic of the NAND/AND flip-flop
indicate delay elements)

B after the rising clock edge to the outputs, remains about the
same as that for the D-flip-flop. The cycle time of such a NAND
3 Circuit technique of the logic flip-flops pipeline stage is therefore reduced from the delay of the NAND
gate plus the delay of the D-flip-flop down to approximately the
In addition to an efficient architecture the circuit technique
delay of the D-flip-flop itself.
is another crucial aspect of high-speed digital design. The m e
Going one step further, we also put logic functions into the
single-phase clocking (TSPC) technique [3] has proved to be
second stage of the TSPC D-flip-flop (cf. figure 5). The first
one of the fastest techniques for digital CMOS circuits 141.
stage has been duplicated, having the same setup time as the
Figure 3 shows such a dynamic TSPC D-flip-flop optimized for



Figure 3: TSPC D-flip-flop optimized for high-speed and glitch-

free outputs (the dotted circuitry is used for glitch-free and low
frequency operation)

high-speed and glitch-free operation [5]. Prescalers that we Figure 5: Schematic of the AOI-flip-flop (drawn without cir-
have implemented in 1.O pin CMOS using this flip-flop run at cuitry for glitch-free outputs and low frequency)
over 2.6 GHz.
This flip-flop is also the kernel for the new logic-flip-flops. NAND-flip-flop. The additional transistor M B in the second
Compared to a standard pipeline, where the combinational logic stage performs the NAND operation of the two first stages.
is located between the register stages, these gates have been This transistor increases the propagation delay of this AND-
merged into the registers. Thereby the propagation delay of the OR-INVERT (AOI-) flip-flop slightly. The cycle time of an
combinational gate is shared with the delays of the flip-flop, A01 pipeline stage with external gates is nearly 70% bigger
significantly reducing the cycle time of a pipeline stage. than the cycle time of this AOI-flip-flop.
Figure 4 shows such a logic-flip-flop with a NAND gate in Table I gives a summary of the performance of these logic-
its first stage. The setup-tinie, needed to charge or discharge the flip-flops. Although the clock load increases with more func-
first intermediate node B , is slightly increased. The propagation tionality, one can argue that the clock load and area consumption
delay, which is the time it takes to transfer the value of node would be even bigger, if the circuit were built using finest-grain

pipelining with only conventional NAND gates and flip-flops.
Additionally, a flip-flop for the three-input function A 4-
B C, was designed. Although only three different logk-flip-
flops (plus the D-flip-flop) are available, any logic gate can
be implemented because complementary outputs are available.
The logic-flip-flops have similar circuitry for glitch-free and
low frequency operation as the TSPC D-flip-flop.
Even more logic can be merged into the TSPC D-flip-flop.
For the bit-serial adder two logic flip-flops have been designed
- one for the sum - and one for the carry operation.
TABLE I 6. 5. 4. 3. 2. 1. sync.
Performance of logic-flip-flops’ Stage Stage Stage Stage Stage Stage Stage
Circuit 1 t, I tpd I ## I Width’ I CLKloadj
[ns] [ns] Trans. [pm] [fFl Figure 6: Clock distribution tree. Numbers below buffers
D-FF 0.26 0.GS 16(11) 29 142 indicate channel widths of PMOS and NMOS (in pm)
Nm-FF 0.38 0.70 19(14) 31 187
AOI-FF 0.36 0.78 30(21) 50 273
Simulations show that the clock skew is below 80 ps and the
rise times are around 250ps. The area between pad ring and
the circuit core is used for power supply blocking capacitors.
Die size (includingpads) is 3.3 mm’, the adder itself occupying
4 Implementation only 0.37 mm2. Figure 7 shows a micrograph of the 8-bit adder
The bit-serial and the 8-bit parallel adder have been imple-
mented in a two-metal single-poly 1.O p m self-aligned CMOS
process. The logic-flip-flops for the 8-bit adder are full-custom
cells, designed like standard cells to allow easy abutment. The
size of each transistor has been optimized using SPICE simu-
lations in an iterative process to achieve minimal delays and to
satisfy the internal loads of the adder.
The input and output signals are fed via dedicated high-
frequency pads including ESD protection up to 1.7kV. To syn-
chronize the input signals and to generate their complements a
row of D-flip-flops has been added in front of the first :stage.
In high-speed designs the clock signal must be distributed
with special care. The clock skew must be controlled by care-
fully balancing the clock tree. The TSPC technique needs no
inverted clock signal and all flip-flops in this design operate on
the rising clock edge. The architecture of the 8-bit adder itself
reduces the total clock load by about one third compared to a
Figure 7: Photomicrograph of the pipelined 8-bit adder
The clock signal is distributed through a four stage clock-
tree (cf, fig. 6). Two central buffers feed the clock signal to each For test purposes an additional 8-bit adder block was imple-
pipeline-stage, starting at the last stage, going back to the first mented with a boundary scan structure. These scan-registers
stage. The clock buffers of each stage were sized accoirding to (= AOI-flip-flops) at the inputs and outputs allow testing with
the extracted capacitance of the clock inputs of all its flip-flops. only a few fast input and output signals.
Each flip-flop is laid out with the clock line in its middle,parallel
to the power rails. Therefore not only the power supplies but
also the clock signal can be readily connected by abutment,
further simplifying the clock distribution. 5 Measurements
‘numbers stem from extracted layout and typical SPICE SimUlatiOnS
’all cells have same height of 75 /L” The 8-bit adder was bonded into a high speed ceramic pack-
lOOfF load at each output age. The bonded chip was tested on a HP83000 ASIC tester,

Figure 8: Input and output waveforms at 666 MHz on the ASIC tester.

which allows clock frequencies up to 666 MHz. A full set of

pseudo-random test vectors was generated using the algorithm
Technology 1 pm CMOS, two layer metal
of linear feedback shift-registers. At 5 V supply voltage the
Power dissipation 777 mW (@8OOMHz, 5 V Vdd)
adder is fully operational to the limit of the tester, consuming
376mW (@645MHz, 4 V Vdd)
130 mA. The boundary-scan adder version was used to deter- 144 mW (@482MHz, 3 V Vdd)
mine the maximum operating frequency. Using high-frequency Die size 3.3 mm2, #Transistors: 2881
probes, a maximum frequency of 8OOMHz was measured on Area of core (w/o buffers) 0.37mm2, #Transistors: 1832
the wafer, consuming 155mA at 5V. Even at 3 V supply volt- Latency 6 cycles (7.5ns @800MHz)
age the maximum operating frequency is 480 MHz, consuming
48 mA.
Table I1 summarizes the performance of the 8-bit adder and References
figure 8 shows the measured waveforms on the ASIC tester at
666MHz, 5 V supply, and 30°C. 111 A. P. Chandrakasan, S. Sheng, and R. W. Broderson. Low-Power
The bit-serial adder was successfully tested on the wafer at CMOS Digtal Design. IEEE J o m t o l of Solid-Store Circuits,
over 1 GHz (limited by the data generator) with a supply of 5 V 27 (4):473-484, April 1992.
and 12.1mA. The maximum operating frequency at a supply 121 A. Tyagi. A Reduced-Area Scheme for Carry-SelectAdders. IEEE
voltage of 3 V was 730 MHz, consuming 5.6 mA. Troiisactioiis on Cornpriters, 42( 10):1162-1 170, October 1993.
131 J. Yuan and C. Svensson. High Speed CMOS Circuit Tech-
nique. IEEE Joiirriol of S o k - S t a t e Circiiits, 24( !):62-70, Febru-
ary 1989.
6 Conclusions
[41 D. W. Dobberpuhl and R. T. Witek et al. A 200MHz 64-b Dual-
By combining advanced design techniques both at the circuit Issue CMOS Mcroprocessor. IEEE Joirriiol of Solid-Stare Cir-
and at the architectural level the performance required for many cuits, 27(11):1555-1565, November 1992.
high-speed computation-intensive applications can be achieved [51 R. Rogenmoser, Q.Huang, and F. Piazza. 1.57 GHz Asynchro-
using standard CMOS technologies. nous and 1.4 GHz Dual-Modulus 1.2 pm CMOS Prescalers. In
The incorporation of logic functions into storage elements Proceedirigs qf tlre IEEE 1994 Clistoin Integrated Circiiits Cori-
reduces the cycle time of pipeline stages. Such an approach f e r e w e , pages 16.3.1-16.3.4,May 1994.
improves the speed/throughput substantially over that of a stan-
dard pipeline, where logic gates are realized between pipeline