Professional Documents
Culture Documents
4, APRIL 2008
Furthermore, the clock load driven by RF1 and RF2 was lower
than either of its conventionally clocked counterparts, latch-
or flip-flop-based. Consequently, the clock power reductions
achieved by RF1 and RF2 are even greater than 76% and 84%,
respectively, in comparison with their conventionally clocked
implementations.
The remainder of this paper has six sections. In Section II,
we discuss the RCL methodology in the context of RF1 and
RF2 design. In Section III, we present the FIR filter architecture
and the test setup used in RF1 and RF2. The implementation
of RF1 is discussed in Section IV. In Section V, we present the
design of RF2. Experimental results for RF1 and RF2 are given
in Section VI. Conclusions are provided in Section VII.
Fig. 3. RCL methodology. (a) Latch timing and pipeline design. (b) Derived clock waveform for synthesis.
tools use square clock descriptions to perform timing analysis. 2) minimize clock skew; and 3) minimize voltage attenuation.
We have thus extended the original timing analysis framework For the sinusoidal shape of the resonant clock waveforms, all
for level-sensitive latches described in [11] to encompass si- three objectives were served through the judicious minimiza-
nusoidal clock waveforms [12]. In our extended framework, a tion of resistance in the clock distribution network. The choice
square clock waveform is derived from the original sinusoidal of a CMOS process with thick top-level metals was key for re-
clock waveform so that it captures the operation of the reso- ducing clock network resistance. Specifically, we used a process
nant-clocked latches. Fig. 3(b) illustrates the derivation of this with a 4- m-thick top-level metal layer, providing a low-resis-
conventional clock waveform from a sinusoidal clock waveform tance interconnect for the clock distribution network. Using a
driving a level-sensitive latch. The amplitude of the derived process with such a thick top-level metal also enabled the de-
clock is set equal to the lowest voltage of the sinusoidal clock sign of high- inductors. To reduce network resistance further,
within the transparency window of the latch. Characterized latch wide clock wires were used in the distribution network. Wire
performance can be improved by deriving a clock with a higher sizing increases clock capacitance, however, causing higher cur-
amplitude, resulting in limited time-borrowing due to the re- rent flow and, therefore, increased resistive losses in the net-
sulting narrower pulse width. Therefore, this derivation method- work. The optimal choice of wire widths in the distribution net-
ology results in a tradeoff between time-borrowing and charac- work was determined empirically by the tradeoff between the
terized latch performance. For timing verification purposes, the resistance and the capacitance of the clock wires [12].
derived clock waveform is conservative, i.e., if the circuit meets To generate the resonant clock tree in an automated fashion,
timing with the derived clock waveform, then it is guaranteed to we developed a framework which interacts with the place and
do so with the sinusoidal clock waveform, even in the presence route tool to generate programmable levels of an H-tree driving
of feedback loops. In designing RF1 and RF2, the amplitude programmable levels of a clock grid. Both RF1 and RF2 utilize
of the derived clock was chosen to be 10% lower than in the a two-level H-tree [15], driving a two-metal clock grid shielded
sinusoidal clock to allow for some time-borrowing without sub- by supply and ground rails on each side. Such a clock grid
stantially penalizing characterized latch performance. topology results in clock networks with decreased resistance
and increased capacitance. Relying on a heuristic that deter-
B. Clock Network Design mines optimal wiring capacitance for a given design, we gen-
The clock distribution networks of RCL-based designs do erated and evaluated several possible alternative networks with
not include any buffers. Consequently, the energy dissipation different wire widths.
of clock signal distribution is due to the resistive losses in the
clock wires. Wire sizing optimization therefore plays a signifi- III. FIR FILTER ARCHITECTURE
cant role in clock power reduction. This section describes the FIR filter architecture of RF1 and
Clock distribution network design in RF1 and RF2 was driven RF2. Fig. 4 gives a block diagram of the 14-tap, 8-bit trans-
by three main objectives: 1) minimize clock power dissipation; pose-type FIR filter with built-in self-test (BIST). To efficiently
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 867
Fig. 5. H-LAT and L-LAT. (a) Schematics and interleaved implementation for single-phase datapaths. (b) Simulation waveforms for critical path and race
conditions.
(1)
Fig. 7. BLAT. (a) Schematics and interleaved implementation for single-phase datapaths. (b) Timing example for critical path and race conditions.
to be equal to the energy stored in the magnetic field that re- a self-resonating “blip” generator [14]. The different clocking
sults from the current buildup in the inductor. Notice that, in schemes used in the two designs lead to differences in latch de-
this driven resonant clock oscillator, the frequency of and sign, robustness to race conditions, and energy efficiency in the
determines the oscillating frequency of the resonant clock in the clock generators.
neighborhood of its natural frequency given by (1).
Replenishing energy in the inductor every cycle itself incurs A. Latch Design
energy dissipation, since the clock driver is driven by a cascade
of conventionally switching buffers. The current buildup in the Like in RF1, latch design plays a critical role in enabling
inductor also leads to resistive losses. Consequently, the optimal RF2 to achieve robust, energy-efficient operation at GHz-class
choice for switch widths is governed by the tradeoff between re- operating frequencies. Fig. 7(a) shows circuit schematics of
sistive losses in the driver switches (reduced by wider switches) BLAT, the resonant-clocked level sensitive latch used in RF1.
and dynamic power dissipation incurred in driving the driver Like H-LAT and L-LAT, BLAT is a Svensson latch implemen-
switches (reduced by smaller switches). tation [17], modified for low . Post-layout simulations of
To explore regions of efficient clock generation, the driver B-LAT-X2 with 15 fF load capacitance result in 97 ps.
switches were implemented with programmable widths in the Similar to H-LAT and L-LAT, the absence of clock buffers or
range (0–950 m) and (0–630 m). Pulse duty precharging techniques results in zero dynamic power dissipa-
cycles were also programmable in the range (0%–50%). The tion in the latch if input data does not toggle. When the latch
pulses and were derived from a programmable ring oscil- output toggles, the energy dissipation in the latch is 23 fJ. The
lator, which enabled frequency tuning around the resonant fre- cross-coupled nMOS devices shown in the B-LAT schematics
quency. are part of the distributed clock generator and are discussed in
The use of pull-up switches in the clock generator also pro- Section V-B.
vides the flexibility of operating the rail driver without an ad- Fig. 7(b) shows waveforms obtained from simulations of an
ditional power supply. In this configuration, the switch example pipeline stage in RF2 at 1 GHz. Similar to RF1, the
delivering the power supply is opened, and the on-chip pipeline is designed so that data on the critical path arrives at
decap is used to hold the dc voltage of the oscillating clock. En- the latch when the latching clock is near its peak voltage, re-
ergy is supplied to the resonant system using only the power sulting in a low . The two-phase nonoverlapping clocks
supply. deployed in RF2 yield race-immunity. Simulations of a B-LAT
pipeline stage with zero logic delay at the process corner
V. RF2 DESIGN AND IMPLEMENTATION (fast NMOS, fast PMOS) indicate that the design is immune
This section presents the salient aspects of RF2 design, in- to race conditions with clock skews of up to 350 ps. Since the
cluding latch, clock network, and pipeline design. RF2 is func- estimated clock skew in RF2 is less than 10 ps, as obtained
tionally identical to RF1, with both designs synthesized from from post-layout simulations, hold constraints are comfortably
the same Verilog description. Key differences between RF1 and satisfied.
RF2 lie in the latching scheme employed and the implementa-
tion of the resonant clocks. Whereas RF1 is a single-phase latch B. Clock Design
based design, RF2 uses a more robust two-phase nonoverlap-
ping clocking scheme. In contrast to RF1 which uses a driven os- The clock generator used in RF2 is a self-resonating oscillator
cillator to generate the single-phase resonant clock, RF2 deploys similar to that implemented by [14]. The motivation behind the
870 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 4, APRIL 2008
An essential practical requirement of any synchronous design designed using the RCL methodology do not place any addi-
is the ability to clock the design at a predetermined frequency, tional requirements on their combinational logic over conven-
regardless of variations in the fabrication process. Since the fre- tional design. They are thus amenable to all nonclock related
quency of a fully self-resonating clock generator such as the one power reduction techniques that can be applied to conventional
used in RF2 cannot be tuned, it is important that frequency vari- designs, including gate sizing, multithreshold voltage assign-
ation arising from the fabrication process be kept to a minimum. ment, power gating, and dynamic voltage scaling. Therefore,
To determine the variation of the resonant frequency of RF2 resonant clocking can be used to achieve further power savings,
across multiple chips, the resonant frequencies of ten randomly in addition to the power reductions already achievable by con-
selected chips were measured. All ten chips were operational, ventional logic optimization approaches.
with average resonant frequency 1.012 GHz and relative
standard deviation 0.012. For such a relatively low
variation in resonant frequency, injection-locking may be an at- ACKNOWLEDGMENT
tractive option for setting the oscillation frequency [7]. The authors would like to thank C. Tokunaga for his invalu-
able assistance with design. They would also like to thank the
anonymous reviewers for their constructive comments.
VII. CONCLUSION
REFERENCES
This paper describes RF1 and RF2, two level-clocked ASIC [1] C. H. Ziesler, J. Kim, V. S. Sathe, and M. C. Papaefthymiou, “A 225
designs that deploy resonant clocking to drive the latches in their MHz resonant clocked ASIC chip,” in Proc. ISLPED, Aug. 2003, pp.
48–53.
clock distribution networks with substantially reduced power [2] V. S. Sathe, J. C. Kao, and M. C. Papaefthymiou, “RF2: A 1GHz FIR
dissipation over conventional switching. Fabricated in a com- filter with distributed resonant clock generator,” in Symp. VLSI Circuits
mercial 0.13 m bulk silicon process, the two test-chips have Dig. Tech. Papers, Jun. 2007, pp. 44–45.
[3] V. S. Sathe, J. C. Kao, and M. C. Papaefthymiou, “A 0.8–1.2 GHz
been designed using RCL, a general latch-based design method- single-phase resonant-clocked FIR filter with level-sensitive latches,”
ology for energy-efficient high-performance resonant-clocked in Proc. CICC, Sep. 2007, pp. 583–586.
chips. The RCL methodology “resonates” the entire clock net- [4] W. Athas, N. Tzartzanis, L. Svensson, and L. Peterson, “A low-power
microprocessor based on resonant energy,” IEEE J. Solid-State Cir-
work, recovering energy from all clock-related switching capac- cuits, vol. 32, no. 11, pp. 1693–1701, Nov. 1997.
itance and maximizing clock efficiency. Clock distribution is [5] S. Chan, P. Restle, and K. Shepard, “A 4.6 GHz resonant global clock
performed through a metal-only network without clock buffers, distribution network,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2004,
pp. 42–43.
resulting in a sinusoidal clock waveform that directly drives the [6] A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown,
latches at the leaves of the clock network. High performance is “Resonant clocking using distributed parasitic capacitance,” IEEE J.
achieved through the use of appropriately designed level-sensi- Solid-State Circuits, vol. 39, no. 9, pp. 1520–1528, Sep. 2004.
[7] S. Chan, P. Restle, and K. Shepard, “Distributed differential oscillators
tive latches whose timing characteristics depend on the voltage for global clock networks,” IEEE J. Solid-State Circuits, vol. 41, no. 9,
level, rather than the slew rate, of the clock waveform. pp. 2083–2094, Sep. 2006.
RF1 is a single-phase 8-bit 14-tap FIR filter with an on-chip [8] M. Hansson, B. Mesgarzadeh, and A. Alvandpour, “1.56 GHz on-chip
resonant clocking in 130 nm CMOS,” in Proc. CICC, Sep. 2006, pp.
inductor that operates at frequencies of up to 1.2 GHz. Fre- 241–244.
quency-scaled operation in RF1 is achieved by forcing the clock [9] P. Restle, “A clock distribution network for microprocessors,” IEEE J.
network to oscillate off resonance, without altering its circuit pa- Solid-State Circuits, vol. 36, no. 5, pp. 792–799, May 2001.
[10] T. G. Szymanski, “LEADOUT: A static timing analyzer for MOS cir-
rameters. At its natural frequency of 1.03 GHz, RF1 dissipates cuits,” in Proc. ICCAD, Nov. 1986, pp. 130–133.
143 nW/MHz/Tap/InBit/CoeffBit, with clock power accounting [11] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “checkT and
for only 10.8% of the 131 mW total power dissipation. In com- minT : Timing verification and optimal clocking of synchronous dig-
ital circuits,” in Proc. ICCAD, Nov. 1990, pp. 552–555.
parison with conventional switching, RF1 dissipates 76% less [12] V. S. Sathe, “Hybrid resonant-clocked digital design,” Ph.D. disserta-
power for driving its 42 pF clock load. tion, Univ. of Michigan, Ann Arbor, May 2007.
RF2 has been designed using the same Verilog descrip- [13] MOSIS. [Online]. Available: http://www.mosis.com/products/fab/ven-
dors/ibm/8rf dm/
tion as the one used for synthesizing RF1. Relying on a [14] W. C. Athas, N. Tzartzanis, and L. J. Svensson, “A resonant signal
two-phase nonoverlapping resonant-clocked latching scheme, driver for two-phase, almost-non-overlapping clocks,” in Proc. ISCAS,
RF2 achieves robust and energy-efficient operation at its reso- May 1996.
[15] P. J. Restle and A. Deutsch, “Designing the best clock distribution net-
nant frequency of 1.01 GHz. The distributed clock generator work,” in VLSI Circuits Symp. Dig., Jun. 1998, pp. 2–5.
used in RF2 simplifies design and provides improved local [16] N. Weste and D. Harris, CMOS VLSI Design, 3rd ed. Reading, MA:
clock skew control. By including the clock drivers into the Addison-Wesley.
[17] J. Yuan and C. Svensson, “High speed CMOS technique,” IEEE J.
resonant system, RF2 achieves greater clock power efficiency Solid-State Circuits, vol. 24, no. 1, pp. 62–70, Feb. 1989.
than RF1. Resonating 38 pF of clock per phase at 1.01 GHz, [18] R. B. Staszewski, K. Muhammad, and P. Balsara, “A 500-Msample/s
RF2 dissipates 84% less power than conventional switching. 8-tap FIR digital filter for magnetic recording read channels,” IEEE J.
Solid-State Circuits, vol. 35, no. 8, pp. 1205–1210, Aug. 2000.
At 133 nW/MHz/Tap/InBit/CoeffBit, RF2 features the lowest [19] S. Rylov, “A 2.3Gsample/s 10-tap digital FIR filter for magnetic
figure of merit for digital FIR filters published to date. recording read channels,” in IEEE ISSCC Dig. Tech. Papers, Feb.
Both RF1 and RF2 have the same latency, throughput, and 2001, pp. 190–191.
[20] J. Park, “Computation sharing programmable FIR filter for low-power
latch counts as their conventional counterparts obtained from and high-performance applications,” IEEE J. Solid-State Circuits, vol.
the same Verilog code. In general, resonant-clocked pipelines 39, no. 2, pp. 348–357, Feb. 2004.
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 873
Visvesh S. Sathe (M’02) received the B.Tech. de- Marios C. Papaefthymiou (M’93–SM’02) received
gree in electrical engineering from the Indian Insti- the B.S. degree in electrical engineering from the
tute of Technology, Bombay, in 2001, and the M.S. California Institute of Technology, Pasadena, in
and Ph.D. degrees in electrical engineering and com- 1988 and the S.M. and Ph.D. degrees in electrical
puter science from the University of Michigan, Ann engineering and computer science from the Massa-
Arbor, in 2004 and 2007, respectively. chusetts Institute of Technology, Cambridge, in 1990
While at the University of Michigan, his research and 1993, respectively.
focused on low-energy circuit design with a partic- After a three-year term as an Assistant Professor
ular emphasis on resonant-clocked digital design. He at Yale University, he joined the University of
has held internship positions at the IBM T. J. Watson Michigan, Ann Arbor, where he currently is a
Research Center and Cyclos Semiconductor. In 2007, Professor of Electrical Engineering and Computer
he joined the Advanced Power Technology Group, Advanced Micro Devices, Science and Director of the Advanced Computer Architecture Laboratory. He
Fort Collins, CO, as a Senior Design Engineer. His current work focuses on the is also cofounder and Chief Scientist of Cyclos Semiconductor, a start-up com-
exploration and implementation of power reduction techniques for micropro- pany commercializing low-power devices. His research interests encompass
cessors. algorithms, architectures, and circuits for energy-efficient high-performance
VLSI systems. He is also active in the field of parallel and distributed com-
puting.
Dr. Papaefthymiou was the recipient of the ARO Young Investigator Award,
Jerry C. Kao (M’04) received the B.S. degree in the National Science Foundation CAREER Award, and a number of IBM Part-
electrical engineering from Columbia University, nership Awards. Furthermore, together with his students, he has received a Best
New York, NY, in 2000, and the M.S. degree in Paper Award at the 32nd ACM/IEEE Design Automation Conference and the
electrical engineering and computer science from First Prize (Operational Category) in the VLSI Design Contest of the 38th ACM/
the University of Michigan, Ann Arbor, in 2002. He IEEE Design Automation Conference. He has served multiple terms as an As-
is currently working toward the Ph.D. degree at the sociate Editor for the IEEE TRANSACTIONS ON THE COMPUTER-AIDED DESIGN
University of Michigan. OF INTEGRATED CIRCUITS, IEEE TRANSACTIONS ON COMPUTERS, and IEEE
From 2002 to 2005, he was with IBM, Rochester, TRANSACTIONS ON VLSI SYSTEMS. He has served as the General Chair and
MN, where he was involved in the design of the as the Technical Program Chair for the ACM/IEEE International Workshop
CELL processor and the XBOX 360 processor. His on Timing Issues in the Specification and Synthesis of Digital Systems. He
research interests include high-performance and has also participated several times in the Technical Program Committee of the
low-power circuit technologies and design methodologies. IEEE/ACM International Conference on Computer-Aided Design.