You are on page 1of 10

864 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO.

4, APRIL 2008

Resonant-Clock Latch-Based Design


Visvesh S. Sathe, Member, IEEE, Jerry C. Kao, Member, IEEE, and Marios C. Papaefthymiou, Senior Member, IEEE

Abstract—This paper describes RF1 and RF2, two level-clocked


test-chips that deploy resonant clocking to reduce power consump-
tion in their clock distribution networks. It also highlights RCL,
a novel resonant-clock latch-based methodology that was used to
design the two test-chips. RF1 and RF2 are 8-bit 14-tap finite-im-
pulse response (FIR) filters with identical architectures. Designed
using a fully automated ASIC design flow, they have been fab-
ricated in a commercial 0.13 m bulk silicon process. RF1 op-
erates at clock frequencies in the 0.8–1.2 GHz range and uses a
single-phase clocking scheme with a driven clock generator. Res-
onating its 42 pF clock load at 1.03 GHz with dd =1.13 V, RF1
dissipates 132 mW, achieving a clock power reduction of 76% over Fig. 1. Resonant-clocked pipeline example.
conventional switching. RF2 achieves higher clock power efficiency
than RF1 by relying on a two-phase clocking scheme with a dis-
tributed self-resonant clock generator. Resonating 38 pF of clock
load per phase at 1.01 GHz with dd = 1.08 V, RF2 dissipates
Moreover, although the deployment of adiabatic flip-flops with
a buffer-less clock network enables resonating the capacitance
124 mW and achieves 84% reduction in clock power over con-
ventional switching. At 133 nW/MHz/Tap/InBit/CoeffBit, RF2 fea- of the entire clock distribution network [4], [1], these flip-flops
tures the lowest figure of merit for FIR filters published to date. require devices in series with the load capacitance, thus de-
Index Terms—Digital signal processing, low-power VLSI. grading overall system efficiency.
To achieve energy-efficient clocking at relatively high
frequencies, recent resonant-clock implementations avoided
I. INTRODUCTION the use of adiabatic techniques, relying on conventional
master–slave flip-flops instead. In one such system, where
LOCK POWER remains a major contributor to dynamic capacitance is provided by the distributed clock wiring capac-
C power dissipation in high-performance VLSI designs.
Relying on inductance to efficiently resonate the capacitance
itance and the input capacitance of the flip-flop clock inputs,
the increased clock slew of the sinusoidal clocks significantly
of the entire clock distribution network, resonant clocking is degrades the timing parameters of the flip-flops [6].
a promising approach to the design of clock networks with To prevent performance degradation due to poorly slewing
substantially reduced power dissipation [1]–[3]. sinusoidal clocks, buffers may be introduced to generate clock
A simple example of a resonant-clock design is given in waveforms with improved slew characteristics. Chan et al. have
Fig. 1. In this example, the logic gates are implemented using demonstrated efficient resonance on a global resonant-clock
conventionally switching CMOS. The clock signal is a sinu- network, achieving significant jitter and skew reduction [5],
soidal waveform that is generated by setting up a resonant [7]. Hansson et al. have implemented a resonant-clock network
oscillation between an inductive element and the parasitic driving flip-flops using clock buffers [8]. In this case, the poor
capacitance of the clock distribution network. A clock slew of the sinusoidal clocks causes significant short-circuit
generator (shown as a single transistor) is used to sustain the power dissipation. Furthermore, clock buffers isolate the local
oscillation in this tank by periodically replenishing resistive clock distribution from the resonant system, thus limiting the
energy losses in the clock distribution network. amount of capacitance being resonated. Since the dominant
Previous work in resonant-clock digital design [4], [1], portion of clock-related power dissipation lies in the local
clock distribution [9], such use of clock buffers limits clock
[5]–[8] has focused on the implementation of resonant clock
networks driving flip-flops, adiabatic or otherwise. While the efficiency.
use of flip-flops in sequential designs greatly simplifies design In this paper, we investigate the deployment of resonant
and verification, this practice sacrifices performance and effi- clocking in conjunction with level-sensitive latches. Unlike
flip-flops, which rely on sharp clock edges for effective op-
ciency. Specifically, the relatively slow rise times of sinusoidal
clocks degrade the clock-to-output times of the flip-flops. eration, latch performance is determined primarily by the
voltage level of the clock waveform. Moreover, latch-based
designs have the potential to achieve higher performance than
Manuscript received August 23, 2007; revised November 10, 2007. This
work was supported in part by the U.S. Army Research Office under Grant
flip-flop-based designs, because the transparent phase in the op-
DAADA19-03-1-0122. Fabrication was provided through MOSIS. eration of level-sensitive latches allows data to ripple through
V. S. Sathe is with Advanced Micro Devices, Fort Collins, CO 80528 USA. latch boundaries and enables time-borrowing across logic
J. C. Kao and M. C. Papaefthymiou are with the Department of Electrical stages [10], [11]. The proposed resonant-clocked latch-based
Engineering and Computer Science, University of Michigan, Ann Arbor, MI
48109 USA (e-mail: marios@umich.edu). (RCL) design methodology relies on metal-only networks to
Digital Object Identifier 10.1109/JSSC.2008.917501 distribute the clock signal all the way to the clock inputs of
0018-9200/$25.00 © 2008 IEEE
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 865

Furthermore, the clock load driven by RF1 and RF2 was lower
than either of its conventionally clocked counterparts, latch-
or flip-flop-based. Consequently, the clock power reductions
achieved by RF1 and RF2 are even greater than 76% and 84%,
respectively, in comparison with their conventionally clocked
implementations.
The remainder of this paper has six sections. In Section II,
we discuss the RCL methodology in the context of RF1 and
RF2 design. In Section III, we present the FIR filter architecture
and the test setup used in RF1 and RF2. The implementation
of RF1 is discussed in Section IV. In Section V, we present the
design of RF2. Experimental results for RF1 and RF2 are given
in Section VI. Conclusions are provided in Section VII.

II. DESIGN METHODOLOGY


Fig. 2. Microphotograph of RF1 and RF2. In this section, we discuss the RCL methodology in the con-
text of the fully automated design of RF1 and RF2.
Fig. 3(a) gives an example of a CMOS pipeline implemented
the level-sensitive lathes without the use of any clock buffers. using the RCL methodology. This pipeline uses conventional
Substantial clock power reductions can thus be achieved by res- CMOS combinational logic gates, level-sensitive latches, and a
onating the capacitance of the entire clock distribution network. two-phase resonant clocking scheme. Robustness to race condi-
To support automated design, RCL relies on a static timing tions is achieved by ensuring nonoverlapping transparency win-
analysis framework that has been developed for latch-based dows of consecutive latches. Due to the sinusoidal nature of the
designs with nonideal clocks such as sinusoidal resonant clocks clock waveform, which directly drives transistors in the latch,
[12]. the delay between the data arrival time at the input of a latch
To demonstrate the energy efficiencies attainable by RCL and the data departure time from the output of the latch, denoted
design, we used the RCL methodology in a fully automated by , varies with data arrival time, as shown in Fig. 3(a).
ASIC flow to design two test-chips, called RF1 and RF2. The Pipeline design ensures that critical signals arrive at the latches
two designs were fabricated in a 0.13 m CMOS process with when the clock signal is at its peak, providing clocked transis-
Cu interconnect [13]. Fig. 2 shows a microphotograph of the tors with maximum gate overdrive and yielding a region of low
die containing RF1 and RF2. Both designs are 8-bit 14-tap delay. Pipeline performance is therefore similar to that ob-
transpose finite-impulse response (FIR) filters with identical tained with conventional square clocks.
architectures. The differences in the implementation lie in latch The delays of the level-sensitive latches used in RF1
design, clocking scheme, and clock generator design. Specif- and RF2 exhibit timing monotonicity, i.e., early arriving signals
ically, RF1 is a single-phase 0.8–1.2 GHz, frequency-tunable at a latch input depart from the output of that latch earlier than
latch-based design. RF1 dissipates 10.7% of total power while later arriving signals. Timing monotonicity ensures that the latch
driving 42 pF of clock load, corresponding to a 76% reduction output departure times of noncritical signals do not exceed those
in switching power compared with an identical conventionally of critical ones, which is a key property for performing timing
switched clock load. verification in systems with level-sensitive latches [11], [12].
RF2 uses a two-phase distributed self-resonant clock gener- Another characteristic of the sinusoidally clocked latches
ator that relies on a blip generator topology to achieve efficient used in RF1 and RF2 is that their transparency windows are
resonance [14]. Together with appropriate latch design, the two wider than the region of low , as shown by the annotated
phases achieve robust operation through nonoverlapping dura- clock waveform in Fig. 3(a). Therefore, it is possible for data
tions of latch transparency. The resonant clock drivers in RF2 arriving before the region of low delay to ripple through
are part of the resonant domain, resulting in increased efficien- the latch ahead of time, possibly causing race violations in the
cies. At its resonant frequency of 1.01 GHz and while driving next pipeline stage. Such race possibilities are avoided through
38 pF of clock load per clock phase, RF2 achieves 84% clock appropriate clock and latch design, as discussed in Sections IV
power reduction over an identical, conventionally-switching and V.
clock load. RF2 achieves the lowest energy-per-computation
figure of merit (normalized for filter size) reported for FIR A. ASIC Design Flow
filters to date. Design entry for RF1 and RF2 was performed using synthe-
To compare the resonant-clocked implementations of the FIR sizable Verilog. Simulation-based verification of the architec-
filter architecture that was used to obtain RF1 and RF2 with ture was carried out using Cadence NC-Verilog. Synthesis was
their conventional-clock alternatives, we performed separate performed using Synopsys Design Compiler. Physical design of
synthesis runs to obtain a conventional latch-based and a con- the synthesized netlist was carried out using Cadence SOC-En-
ventional flip-flop-based implementation with the same target counter.
clock frequency. In comparison with their latch-based coun- The sinusoidal shape of resonant-clock waveforms poses a
terparts, RF1 and RF2 had identical latency and latch counts. challenge to the use of commercial synthesis tools, since such
866 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 4, APRIL 2008

Fig. 3. RCL methodology. (a) Latch timing and pipeline design. (b) Derived clock waveform for synthesis.

tools use square clock descriptions to perform timing analysis. 2) minimize clock skew; and 3) minimize voltage attenuation.
We have thus extended the original timing analysis framework For the sinusoidal shape of the resonant clock waveforms, all
for level-sensitive latches described in [11] to encompass si- three objectives were served through the judicious minimiza-
nusoidal clock waveforms [12]. In our extended framework, a tion of resistance in the clock distribution network. The choice
square clock waveform is derived from the original sinusoidal of a CMOS process with thick top-level metals was key for re-
clock waveform so that it captures the operation of the reso- ducing clock network resistance. Specifically, we used a process
nant-clocked latches. Fig. 3(b) illustrates the derivation of this with a 4- m-thick top-level metal layer, providing a low-resis-
conventional clock waveform from a sinusoidal clock waveform tance interconnect for the clock distribution network. Using a
driving a level-sensitive latch. The amplitude of the derived process with such a thick top-level metal also enabled the de-
clock is set equal to the lowest voltage of the sinusoidal clock sign of high- inductors. To reduce network resistance further,
within the transparency window of the latch. Characterized latch wide clock wires were used in the distribution network. Wire
performance can be improved by deriving a clock with a higher sizing increases clock capacitance, however, causing higher cur-
amplitude, resulting in limited time-borrowing due to the re- rent flow and, therefore, increased resistive losses in the net-
sulting narrower pulse width. Therefore, this derivation method- work. The optimal choice of wire widths in the distribution net-
ology results in a tradeoff between time-borrowing and charac- work was determined empirically by the tradeoff between the
terized latch performance. For timing verification purposes, the resistance and the capacitance of the clock wires [12].
derived clock waveform is conservative, i.e., if the circuit meets To generate the resonant clock tree in an automated fashion,
timing with the derived clock waveform, then it is guaranteed to we developed a framework which interacts with the place and
do so with the sinusoidal clock waveform, even in the presence route tool to generate programmable levels of an H-tree driving
of feedback loops. In designing RF1 and RF2, the amplitude programmable levels of a clock grid. Both RF1 and RF2 utilize
of the derived clock was chosen to be 10% lower than in the a two-level H-tree [15], driving a two-metal clock grid shielded
sinusoidal clock to allow for some time-borrowing without sub- by supply and ground rails on each side. Such a clock grid
stantially penalizing characterized latch performance. topology results in clock networks with decreased resistance
and increased capacitance. Relying on a heuristic that deter-
B. Clock Network Design mines optimal wiring capacitance for a given design, we gen-
The clock distribution networks of RCL-based designs do erated and evaluated several possible alternative networks with
not include any buffers. Consequently, the energy dissipation different wire widths.
of clock signal distribution is due to the resistive losses in the
clock wires. Wire sizing optimization therefore plays a signifi- III. FIR FILTER ARCHITECTURE
cant role in clock power reduction. This section describes the FIR filter architecture of RF1 and
Clock distribution network design in RF1 and RF2 was driven RF2. Fig. 4 gives a block diagram of the 14-tap, 8-bit trans-
by three main objectives: 1) minimize clock power dissipation; pose-type FIR filter with built-in self-test (BIST). To efficiently
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 867

larities of the resonant clock. Fig. 5(a) shows circuit schematics


for the level-sensitive high (H-LAT) and low (L-LAT) latches
in RF1. H-LAT and L-LAT are Svensson latch implementations
[17] that have been optimized for low by placing the data
pins closer to the output. From post-layout simulations, H-LAT
(L-LAT) achieves 99 ps(106 ps) with a total dissipation
of 24 fJ (22 fJ) per toggle while driving 15 fF of load. Since the
RCL methodology does not rely on precharging or clock buffers
within the latch, clock-related power dissipation in RF1 occurs
only in the clock generator and clock distribution network. Since
the latches do not include clock buffers or precharging nodes,
the dynamic dissipation within the latch for a clock cycle with
no input data toggle is zero. Connections in H-LAT (L-LAT) are
made so that the pull-down (pull-up) clocked transistor is in se-
ries with a complementary logic stack. Thus, data input slews
limit the otherwise significant crowbar current due to large rise
and fall times of the clock.
Fig. 4. FIR filter architecture.
B. Timing
Fig. 5(b) shows waveforms obtained from simulations of
balance logic delays between different stages in the sequential interleaved latches in possible race and critical path scenarios
circuit, a transpose filter implementation was used, where data at 1 GHz. In the race scenario, data latched by H-LAT during
is premultiplied in each of the taps with their corresponding the rising transition of the clock does not race through L-LAT
coefficients. The transmission of data from the source to the during the same transition. Instead, L-LAT latches the data
filter taps distributed across the filter core is carried out by the only in the subsequent transparency window. From 1 GHz
data-broadcast block. Long latencies involved in broadcasting post-layout simulations of RF1 at the process corner (fast
the data to all filter taps were addressed by pipelining the in- NMOS, fast PMOS), the race setup in Fig. 5(b) with zero logic
terconnect in the data-broadcast block so as to achieve high delay is immune to race for clock skews up to 125 ps. Through
throughput. In each tap, pipelined multipliers scale the input Hspice simulation of the extracted clock distribution network
data according to the programmable coefficients of the filter. with parasitics, the insertion delay of the network was
The data-coefficient product obtained from each tap is merged estimated to be 10 ps. Given that clock skew is bounded by the
with cycle-delayed products from the previous taps using 4:2 insertion delay of the clock network, RF1 comfortably satisfies
compressors [16]. The final vector merge addition is performed the clock skew requirement.
using a carry-save adder. The BIST block generates a pseudo- Despite the apparent “overlap” between the transparency
random input sequence for the multiplier. Filter output data are windows of the two latches, RF1 comfortably maintains race
also compressed in the BIST block using a signature analyzer. immunity due to the increased difference between the arrival
A state machine enables the BIST block to capture the state of time of the clock at the latch and the departure time of data
the signature analyzer at the end of a user-defined number of from the output of the latch (denoted by ). This increased
cycles. results from the low clock amplitude in the overlapping
regions of transparency. Fig. 5(b) also shows that, consistent
IV. RF1 DESIGN with the RCL methodology, pipeline signals on the critical path
Important aspects pertaining to the implementation of RF1 are designed to arrive at H-LAT (L-LAT) while the resonant
such as latch design, pipeline timing and design, and clock clock is at nearly ( ). They thus provide maximum gate
system design are discussed in this section. overdrive to clocked latch transistors, enabling low delays
and high operating frequencies.
A. Latch Design
RF1 is a single-phase level-sensitive latch design. Since the C. Clock Design
entire clock network is buffer-less, latch design plays a central Fig. 6 shows the clock generator used in RF1. This clock gen-
role in the robust and high-performance operation of RF1. Latch erator consists of a ring oscillator, a pulse generator, and a res-
requirements include: 1) the design of true-single phase latches; onant clock driver similar to that used in [1]. The clock driver
2) low despite sinusoidal clock waveforms with poor slew; periodically replenishes the energy losses in the resonant system
and 3) avoiding crowbar current in latches due to gradually tran- through current injection in the inductor at the natural frequency
sitioning sinusoidal clocks. Although the use of latches in con- of the design.
ventionally clocked designs is well known, the benefits and chal- Unlike previous implementations of the clock drivers, the
lenges from their use in resonant-clocked datapaths have yet to driver in RF1 uses on-chip decoupling capacitance (decap) to
be explored. mitigate the effects of bondwire, package, and board-trace in-
To operate with a single clock phase, RF1 pipelines consist of ductance and resistance. The clock rail driver operates by using
interleaved latches that become transparent during opposite po- a 0.6 nH integrated inductor to achieve efficient resonance
868 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 4, APRIL 2008

Fig. 5. H-LAT and L-LAT. (a) Schematics and interleaved implementation for single-phase datapaths. (b) Simulation waveforms for critical path and race
conditions.

in the inductor. At the falling edge of pulse , the system con-


tinues oscillating freely with the initial condition and
at its natural frequency defined as

(1)

where is the damping factor of the network, and is


the current flowing in the inductor at the falling edge of pulse .
As the clock reaches its peak, pulse causes the pull-up switch
to conduct, resulting in a similar current buildup in the in-
ductor. At the rising edge of , the system once again resumes a
free oscillation at its natural frequency, with the initial condition
and , where is the current flowing
in the inductor at that time. The current buildup in the inductor
at the crest and trough of enables the supply to periodically
provide energy to the system, which is stored in the magnetic
Fig. 6. RF1 clock generator schematics.
field of the inductor. The amount of current required to main-
tain stable oscillations is governed by the equation

with the distributed parasitic capacitance of the clock network. (2)


Energy dissipation in the resistance of the network is replen-
ished by the rail driver. As the clock approaches its minimum, where is the energy dissipation in the resonant network
pulse causes the pull-down switch to conduct, discharging the during the last cycle with the desired clock amplitude. This
output clock voltage to 0 V, and causing an current buildup equation is obtained by setting the per-cycle energy dissipation
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 869

Fig. 7. BLAT. (a) Schematics and interleaved implementation for single-phase datapaths. (b) Timing example for critical path and race conditions.

to be equal to the energy stored in the magnetic field that re- a self-resonating “blip” generator [14]. The different clocking
sults from the current buildup in the inductor. Notice that, in schemes used in the two designs lead to differences in latch de-
this driven resonant clock oscillator, the frequency of and sign, robustness to race conditions, and energy efficiency in the
determines the oscillating frequency of the resonant clock in the clock generators.
neighborhood of its natural frequency given by (1).
Replenishing energy in the inductor every cycle itself incurs A. Latch Design
energy dissipation, since the clock driver is driven by a cascade
of conventionally switching buffers. The current buildup in the Like in RF1, latch design plays a critical role in enabling
inductor also leads to resistive losses. Consequently, the optimal RF2 to achieve robust, energy-efficient operation at GHz-class
choice for switch widths is governed by the tradeoff between re- operating frequencies. Fig. 7(a) shows circuit schematics of
sistive losses in the driver switches (reduced by wider switches) BLAT, the resonant-clocked level sensitive latch used in RF1.
and dynamic power dissipation incurred in driving the driver Like H-LAT and L-LAT, BLAT is a Svensson latch implemen-
switches (reduced by smaller switches). tation [17], modified for low . Post-layout simulations of
To explore regions of efficient clock generation, the driver B-LAT-X2 with 15 fF load capacitance result in 97 ps.
switches were implemented with programmable widths in the Similar to H-LAT and L-LAT, the absence of clock buffers or
range (0–950 m) and (0–630 m). Pulse duty precharging techniques results in zero dynamic power dissipa-
cycles were also programmable in the range (0%–50%). The tion in the latch if input data does not toggle. When the latch
pulses and were derived from a programmable ring oscil- output toggles, the energy dissipation in the latch is 23 fJ. The
lator, which enabled frequency tuning around the resonant fre- cross-coupled nMOS devices shown in the B-LAT schematics
quency. are part of the distributed clock generator and are discussed in
The use of pull-up switches in the clock generator also pro- Section V-B.
vides the flexibility of operating the rail driver without an ad- Fig. 7(b) shows waveforms obtained from simulations of an
ditional power supply. In this configuration, the switch example pipeline stage in RF2 at 1 GHz. Similar to RF1, the
delivering the power supply is opened, and the on-chip pipeline is designed so that data on the critical path arrives at
decap is used to hold the dc voltage of the oscillating clock. En- the latch when the latching clock is near its peak voltage, re-
ergy is supplied to the resonant system using only the power sulting in a low . The two-phase nonoverlapping clocks
supply. deployed in RF2 yield race-immunity. Simulations of a B-LAT
pipeline stage with zero logic delay at the process corner
V. RF2 DESIGN AND IMPLEMENTATION (fast NMOS, fast PMOS) indicate that the design is immune
This section presents the salient aspects of RF2 design, in- to race conditions with clock skews of up to 350 ps. Since the
cluding latch, clock network, and pipeline design. RF2 is func- estimated clock skew in RF2 is less than 10 ps, as obtained
tionally identical to RF1, with both designs synthesized from from post-layout simulations, hold constraints are comfortably
the same Verilog description. Key differences between RF1 and satisfied.
RF2 lie in the latching scheme employed and the implementa-
tion of the resonant clocks. Whereas RF1 is a single-phase latch B. Clock Design
based design, RF2 uses a more robust two-phase nonoverlap-
ping clocking scheme. In contrast to RF1 which uses a driven os- The clock generator used in RF2 is a self-resonating oscillator
cillator to generate the single-phase resonant clock, RF2 deploys similar to that implemented by [14]. The motivation behind the
870 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 4, APRIL 2008

Fig. 9. Statistics and performance for RF1 and RF2.

two designs. We then discuss measurement results specific to


each of the two designs. For all data points reported, correct
functionality of the two designs has been verified through BIST.
Fig. 9 summarizes measurement results obtained from RF1
and RF2. At the natural frequency of 1.03 GHz, clock power
dissipation in RF1 is 14.2 mW, accounting for only 10.7% of
overall power. The resonant clock network in RF1 achieves a
Fig. 8. RF2 clock generator. (a) Schematic. (b) Clock grid.
76% power reduction over conventional switching of the same
capapacitance at the same rate. Driving 38 pF per clock phase,
use of self-resonating clock generators is improved energy ef- which amounts to 76 pF of total clock load, RF2 achieves 84%
ficiency, which is afforded at the expense of tunability in the clock-power reduction over conventional switching. Based on
operating frequency. their relative power efficiency, RF1 and RF2 had a system
Fig. 8(a) shows circuit schematics for the self-resonating quality factor of approximately 3.3 and 4.9, respectively
“blip” generator used to derive two-phase nonoverlapping [12]. At the minimum overall energy point, clock power in RF2
clocks in RF2. A 1.32 nH symmetric center-tapped inductor is is 19.9 mW, accounting for only 16% of the overall 124 mW
used to achieve resonance with series-connected capaci- chip power.
tance loads from each phase. Energy losses in the system are In both RF1 and RF2, the load of the entire clock network hi-
replenished by the power supply. To obtain a 1.2 V clock erarchy (all the way to the clock inputs of the latches) was lower
amplitude using the blip generator, the voltage required for than in conventional (i.e., nonresonant) synthesized implemen-
in RF2 is approximately 0.5 V. Unlike driven oscillators, such tations of the same FIR architecture that we obtained with the
as the one used in RF1, the switches in the blip generator are same target clock frequency and supply voltage. The increased
driven by a resonant clock, enabling charge recovery from the clock load in the conventional clock networks was mainly due
switch capacitance. The resulting energy efficiency in driving to the use of clock buffers. Consequently, with respect to their
the switches enables the use of wider switches with reduced nonresonant counterparts, RF1 and RF2 attained clock power
resistive losses. For a given clock load, the blip generator is reduction levels that were even greater than 76% and 84%, re-
capable of achieving better energy efficiency than the clock spectively.
generator used in RF1. Note the use of the on-chip decap in An often cited figure of merit for FIR filters is the energy dis-
Fig. 8. This decap is necessary in a fully integrated blip gener- sipation of the filter normalized for filter size and is measured in
ator to prevent package parasitics from affecting the resonant nW/MHz/Tap/InBit/CoeffBit [18]–[20]. For RF1 and RF2, this
frequency of the design. figure of merit compares favorably to previously published con-
Unlike previous work, RF2 does not contain a separate clock ventional FIR implementations. In particular, RF2 features the
generator block. Fig. 8(b) shows the distributed blip generator lowest figure of merit for FIR filters published to date, dissi-
used in RF2 along with the clock network. The cross-coupled pating 133 nW/Mhz/Tap/InBit/CoeffBit. In the case of RF1 and
devices shown in the figure provide the required negative RF2, this energy metric has been obtained for a relatively high
transconductance in the circuit and are embedded within each switching activity of 0.5.
latch. Embedding the clock generator switches into the latches
provides better local clock slew control and has the added A. RF1 Measurement Results
advantage of simplifying design. Fig. 10(a) shows measured clock and total energy dissipation
versus operating frequency in RF1. The clock energy dissipation
VI. MEASUREMENT RESULTS curve shown in the figure has been obtained for 1.2 V,
Here, we present measurement results obtained from RF1 and m , and pulse duty cycle 20%. The
RF2. We first give a summary of the results obtained for the clock energy minimum at 1.03 GHz corresponds to the resonant
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 871

switches to periodically deliver power to the resonant system.


The effectiveness of this technique was diminished, however,
since the addition of decoupling capacitance between the
and supplies was overlooked. Consequently, the
injection current through the pull-up switches flows through
the additional inductance of the package and bondwire. The
increased inductance in the current buildup path requires wider
pull-up switches for the same current injection. The resulting
increase in pull-up switch width increases conventional power
dissipation in the clock generator. Therefore, although correct
operation was verified at the resonant frequency of 1.03 GHz,
the energy dissipation per cycle in the power clock was higher
at 26.2 pJ.
To explore possible energy savings by reducing dissipation in
the clock generator, we experimented with “half pumping,” i.e.,
driving the clock generator pulse at half the natural frequency.
In half-pumping mode, the clock generator circuitry switches at
half of the frequency, resulting in decreased switching power
dissipation. The energy replenished by the clock drivers ac-
counts for energy losses over two cycles, however, resulting in
increased current injection and higher resistive losses on the
replenishing switches. For resonant systems with sufficiently
high energy efficiency, the energy savings obtained by the re-
duced switching activity in the clock drivers are expected to
dominate the increased resistive losses, leading to lower overall
power consumption. In our half-pumping experiments, how-
ever, the lowest achievable per-cycle energy dissipation in the
clock was approximately 16.8 pJ, which exceeded clock power
during normal operation.

B. RF2 Measurement Results

Fig. 10(b) shows the clock, logic, and total energy-per-cycle


Fig. 10. Energy dissipation of (a) RF1 versus operating frequency and (b) RF2 dissipation of RF2 versus the clock supply voltage . The op-
versus V . erating frequency of the design is 1.01 GHz. In contrast to con-
ventional designs, in which insertion delay depends directly on
supply voltage due to the use of clock buffers, insertion delays in
frequency of the design. The lowest achievable per-cycle energy RCL-derived designs depends primarily on interconnect delays.
dissipation of the clock for a 1.2 V amplitude (clock network + Consequently, in contrast to conventional designs with buffered
clock generator) at resonance is 14.3 pJ. This energy dissipation clock networks, clock skews in RF2 remain basically unaffected
corresponds to a 76% clock power reduction over the power dis- by supply voltage scaling, as confirmed independently by simu-
sipation incurred in conventionally clocking an identical clock lations, providing additional timing margin for reducing supply
load. voltage. Moreover, efficient clocking in RF2 allows for further
The increase in the energy dissipation of the clock at frequen- supply voltage scaling in the logic, since the increased logic
cies away from resonance does not necessarily imply increased delay can be compensated by better latch performance obtained
total energy dissipation. Frequency reduction provides ad- from higher clock amplitude. The improved energy efficiency
ditional timing slack to the datapath, allowing operation at that can be achieved by driving the clock at a higher amplitude
a reduced voltage supply level (voltage scaling) while still than the voltage-scaled logic are demonstrated in Fig. 10(b). At
meeting timing constraints, and resulting in a reduction of each value of , the supply voltage , was scaled to achieve
overall power dissipation. As can be deduced from Fig. 10(a), minimum total power dissipation. As shown in the figure, the
the reduction in the logic energy dissipation of RF1 dominates optimal energy point occurs for 0.59 V and 1.08 V.
the increase in clock dissipation due to off-resonance operation, For lower values of , a higher supply voltage is required due
resulting in an overall reduction in energy dissipation. At higher to lower clock amplitude. Increasing increases clock am-
frequencies, both clock and logic power increase, leading to plitude, allowing for lower total energy dissipation through
higher power dissipation. scaling. Beyond the optimal value of , the supply–voltage
RF1 can also be operated without the use of the additional scaling afforded by improved latch performance cannot com-
power supply. The additional supply is removed by pensate for the increasing clock power dissipation, resulting in
opening the switch shown in Fig. 6, and using pMOS increased total power dissipation.
872 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 43, NO. 4, APRIL 2008

An essential practical requirement of any synchronous design designed using the RCL methodology do not place any addi-
is the ability to clock the design at a predetermined frequency, tional requirements on their combinational logic over conven-
regardless of variations in the fabrication process. Since the fre- tional design. They are thus amenable to all nonclock related
quency of a fully self-resonating clock generator such as the one power reduction techniques that can be applied to conventional
used in RF2 cannot be tuned, it is important that frequency vari- designs, including gate sizing, multithreshold voltage assign-
ation arising from the fabrication process be kept to a minimum. ment, power gating, and dynamic voltage scaling. Therefore,
To determine the variation of the resonant frequency of RF2 resonant clocking can be used to achieve further power savings,
across multiple chips, the resonant frequencies of ten randomly in addition to the power reductions already achievable by con-
selected chips were measured. All ten chips were operational, ventional logic optimization approaches.
with average resonant frequency 1.012 GHz and relative
standard deviation 0.012. For such a relatively low
variation in resonant frequency, injection-locking may be an at- ACKNOWLEDGMENT
tractive option for setting the oscillation frequency [7]. The authors would like to thank C. Tokunaga for his invalu-
able assistance with design. They would also like to thank the
anonymous reviewers for their constructive comments.

VII. CONCLUSION
REFERENCES
This paper describes RF1 and RF2, two level-clocked ASIC [1] C. H. Ziesler, J. Kim, V. S. Sathe, and M. C. Papaefthymiou, “A 225
designs that deploy resonant clocking to drive the latches in their MHz resonant clocked ASIC chip,” in Proc. ISLPED, Aug. 2003, pp.
48–53.
clock distribution networks with substantially reduced power [2] V. S. Sathe, J. C. Kao, and M. C. Papaefthymiou, “RF2: A 1GHz FIR
dissipation over conventional switching. Fabricated in a com- filter with distributed resonant clock generator,” in Symp. VLSI Circuits
mercial 0.13 m bulk silicon process, the two test-chips have Dig. Tech. Papers, Jun. 2007, pp. 44–45.
[3] V. S. Sathe, J. C. Kao, and M. C. Papaefthymiou, “A 0.8–1.2 GHz
been designed using RCL, a general latch-based design method- single-phase resonant-clocked FIR filter with level-sensitive latches,”
ology for energy-efficient high-performance resonant-clocked in Proc. CICC, Sep. 2007, pp. 583–586.
chips. The RCL methodology “resonates” the entire clock net- [4] W. Athas, N. Tzartzanis, L. Svensson, and L. Peterson, “A low-power
microprocessor based on resonant energy,” IEEE J. Solid-State Cir-
work, recovering energy from all clock-related switching capac- cuits, vol. 32, no. 11, pp. 1693–1701, Nov. 1997.
itance and maximizing clock efficiency. Clock distribution is [5] S. Chan, P. Restle, and K. Shepard, “A 4.6 GHz resonant global clock
performed through a metal-only network without clock buffers, distribution network,” in IEEE ISSCC Dig. Tech. Papers, Feb. 2004,
pp. 42–43.
resulting in a sinusoidal clock waveform that directly drives the [6] A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown,
latches at the leaves of the clock network. High performance is “Resonant clocking using distributed parasitic capacitance,” IEEE J.
achieved through the use of appropriately designed level-sensi- Solid-State Circuits, vol. 39, no. 9, pp. 1520–1528, Sep. 2004.
[7] S. Chan, P. Restle, and K. Shepard, “Distributed differential oscillators
tive latches whose timing characteristics depend on the voltage for global clock networks,” IEEE J. Solid-State Circuits, vol. 41, no. 9,
level, rather than the slew rate, of the clock waveform. pp. 2083–2094, Sep. 2006.
RF1 is a single-phase 8-bit 14-tap FIR filter with an on-chip [8] M. Hansson, B. Mesgarzadeh, and A. Alvandpour, “1.56 GHz on-chip
resonant clocking in 130 nm CMOS,” in Proc. CICC, Sep. 2006, pp.
inductor that operates at frequencies of up to 1.2 GHz. Fre- 241–244.
quency-scaled operation in RF1 is achieved by forcing the clock [9] P. Restle, “A clock distribution network for microprocessors,” IEEE J.
network to oscillate off resonance, without altering its circuit pa- Solid-State Circuits, vol. 36, no. 5, pp. 792–799, May 2001.
[10] T. G. Szymanski, “LEADOUT: A static timing analyzer for MOS cir-
rameters. At its natural frequency of 1.03 GHz, RF1 dissipates cuits,” in Proc. ICCAD, Nov. 1986, pp. 130–133.
143 nW/MHz/Tap/InBit/CoeffBit, with clock power accounting [11] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “checkT and
for only 10.8% of the 131 mW total power dissipation. In com- minT : Timing verification and optimal clocking of synchronous dig-
ital circuits,” in Proc. ICCAD, Nov. 1990, pp. 552–555.
parison with conventional switching, RF1 dissipates 76% less [12] V. S. Sathe, “Hybrid resonant-clocked digital design,” Ph.D. disserta-
power for driving its 42 pF clock load. tion, Univ. of Michigan, Ann Arbor, May 2007.
RF2 has been designed using the same Verilog descrip- [13] MOSIS. [Online]. Available: http://www.mosis.com/products/fab/ven-
dors/ibm/8rf dm/
tion as the one used for synthesizing RF1. Relying on a [14] W. C. Athas, N. Tzartzanis, and L. J. Svensson, “A resonant signal
two-phase nonoverlapping resonant-clocked latching scheme, driver for two-phase, almost-non-overlapping clocks,” in Proc. ISCAS,
RF2 achieves robust and energy-efficient operation at its reso- May 1996.
[15] P. J. Restle and A. Deutsch, “Designing the best clock distribution net-
nant frequency of 1.01 GHz. The distributed clock generator work,” in VLSI Circuits Symp. Dig., Jun. 1998, pp. 2–5.
used in RF2 simplifies design and provides improved local [16] N. Weste and D. Harris, CMOS VLSI Design, 3rd ed. Reading, MA:
clock skew control. By including the clock drivers into the Addison-Wesley.
[17] J. Yuan and C. Svensson, “High speed CMOS technique,” IEEE J.
resonant system, RF2 achieves greater clock power efficiency Solid-State Circuits, vol. 24, no. 1, pp. 62–70, Feb. 1989.
than RF1. Resonating 38 pF of clock per phase at 1.01 GHz, [18] R. B. Staszewski, K. Muhammad, and P. Balsara, “A 500-Msample/s
RF2 dissipates 84% less power than conventional switching. 8-tap FIR digital filter for magnetic recording read channels,” IEEE J.
Solid-State Circuits, vol. 35, no. 8, pp. 1205–1210, Aug. 2000.
At 133 nW/MHz/Tap/InBit/CoeffBit, RF2 features the lowest [19] S. Rylov, “A 2.3Gsample/s 10-tap digital FIR filter for magnetic
figure of merit for digital FIR filters published to date. recording read channels,” in IEEE ISSCC Dig. Tech. Papers, Feb.
Both RF1 and RF2 have the same latency, throughput, and 2001, pp. 190–191.
[20] J. Park, “Computation sharing programmable FIR filter for low-power
latch counts as their conventional counterparts obtained from and high-performance applications,” IEEE J. Solid-State Circuits, vol.
the same Verilog code. In general, resonant-clocked pipelines 39, no. 2, pp. 348–357, Feb. 2004.
SATHE et al.: RESONANT-CLOCK LATCH-BASED DESIGN 873

Visvesh S. Sathe (M’02) received the B.Tech. de- Marios C. Papaefthymiou (M’93–SM’02) received
gree in electrical engineering from the Indian Insti- the B.S. degree in electrical engineering from the
tute of Technology, Bombay, in 2001, and the M.S. California Institute of Technology, Pasadena, in
and Ph.D. degrees in electrical engineering and com- 1988 and the S.M. and Ph.D. degrees in electrical
puter science from the University of Michigan, Ann engineering and computer science from the Massa-
Arbor, in 2004 and 2007, respectively. chusetts Institute of Technology, Cambridge, in 1990
While at the University of Michigan, his research and 1993, respectively.
focused on low-energy circuit design with a partic- After a three-year term as an Assistant Professor
ular emphasis on resonant-clocked digital design. He at Yale University, he joined the University of
has held internship positions at the IBM T. J. Watson Michigan, Ann Arbor, where he currently is a
Research Center and Cyclos Semiconductor. In 2007, Professor of Electrical Engineering and Computer
he joined the Advanced Power Technology Group, Advanced Micro Devices, Science and Director of the Advanced Computer Architecture Laboratory. He
Fort Collins, CO, as a Senior Design Engineer. His current work focuses on the is also cofounder and Chief Scientist of Cyclos Semiconductor, a start-up com-
exploration and implementation of power reduction techniques for micropro- pany commercializing low-power devices. His research interests encompass
cessors. algorithms, architectures, and circuits for energy-efficient high-performance
VLSI systems. He is also active in the field of parallel and distributed com-
puting.
Dr. Papaefthymiou was the recipient of the ARO Young Investigator Award,
Jerry C. Kao (M’04) received the B.S. degree in the National Science Foundation CAREER Award, and a number of IBM Part-
electrical engineering from Columbia University, nership Awards. Furthermore, together with his students, he has received a Best
New York, NY, in 2000, and the M.S. degree in Paper Award at the 32nd ACM/IEEE Design Automation Conference and the
electrical engineering and computer science from First Prize (Operational Category) in the VLSI Design Contest of the 38th ACM/
the University of Michigan, Ann Arbor, in 2002. He IEEE Design Automation Conference. He has served multiple terms as an As-
is currently working toward the Ph.D. degree at the sociate Editor for the IEEE TRANSACTIONS ON THE COMPUTER-AIDED DESIGN
University of Michigan. OF INTEGRATED CIRCUITS, IEEE TRANSACTIONS ON COMPUTERS, and IEEE
From 2002 to 2005, he was with IBM, Rochester, TRANSACTIONS ON VLSI SYSTEMS. He has served as the General Chair and
MN, where he was involved in the design of the as the Technical Program Chair for the ACM/IEEE International Workshop
CELL processor and the XBOX 360 processor. His on Timing Issues in the Specification and Synthesis of Digital Systems. He
research interests include high-performance and has also participated several times in the Technical Program Committee of the
low-power circuit technologies and design methodologies. IEEE/ACM International Conference on Computer-Aided Design.

You might also like