You are on page 1of 4

Automatic Register Transfer Level CAD Tool Design for

Advanced Clock Gating and Low Power Schemes


Yunlong Zhang, Qiang Tong, Li Li, Wei Wang
JongEun Jang, Hyobin Jung, and Si-Young Ahn
and Ken Choi
Electrical and Computer Engineering Epic Solution Inc.
Illinois Institute of Technology 1605 Sang-Am Dong, Ma-Po Gu
3301 S Dearborn St., Chicago, IL 60616, USA Seoul, Rep. of Korea
{yzhan167,qtong,kchoi12}@hawk.iit.edu {jhb, jjang}@epicsolution.co.kr

Abstract – Power reduction is nowadays becoming the first Register-transfer level (RTL) becomes the most proper
consideration in VLSI design. Low power is one of major stage where power optimization has significant impact and
concerns in deeply scaled CMOS technologies. There have been power estimation is accurate [1].
many methods in very wide rang to achieve this objective. And
the Register-Transfer level (RTL) has become the most effective
stage in low power VLSI design, according to the significant
power optimization impact and accurate power estimation. In this
paper, some respective low power design techniques at RTL are
re-investigated at tsmc 45 nanometer CMOS technology. Clock
gating (CG) is one of the most widely used and effective technique
in RTL low power design. Without the enable signal, bus-specific
clock gating (BSC) and threshold-based clock gating (TCG) are
considered. Also an improved active-driven optimized bus-specific Fig. 1. Non-CG circuit: (a) Without enable (b) With enable [4]
clock gating (OBSC) is proposed in our laboratory. When the This paper explains some representative low power techniques
enable signal is taken into account, this paper explains local- at RTL, other techniques can be referred to any of excellent
explicit clock gating (LECG), enhanced clock gating (ECG),
references.
waste-toggle-rate-based (WTR) clock gating and the single
comparator-based clock gating (SCCG) techniques. Operand For Register-transfer level, digital circuits always contain
isolation is another useful design technique for reducing the some computations which are redundant, in other words,
power consumption by blocking some redundant operations. power reduction can be achieved by reducing these idle circuit
Memory splitting is an effective design solution for low power operations [5]. Clock gating (CG) is the most widely used and
design as well. These techniques have been experimented by using effective technique at RTL. Two typical non-CG circuits are
tsmc 45nm technology library and the proposed low-power RTL mentioned in [4] as shown in Fig. 1. Fig. 1(a) is the circuit
techniques are evaluated at gate level with logic synthesis results. without enable signal while the other one is with enable signal.
Without the enable signal, this paper explains bus-specific
Keywords – Low power design, RTL, Clock gating, Operand clock gating (BSC) [2], threshold-based clock gating (TCG)
isolation, Memory splitting
[3] and optimized bus-specific clock gating (OBSC) [1]
I. INTRODUCTION techniques. These techniques reduce the power consumption
by taking switching activity of signals into account. With the
Considering Moore’s Law and the trend of industrial enable signal as shown in Fig. 1(b), this paper includes local-
technology, integrated circuit densities and operating speeds explicit clock gating (LECG) [4], enhanced clock gating
have continued to go up during the past decades of years, and (ECG) [4], waste-toggle-rate-based (WTR) clock gating [4]
this change will be unabated [5]. With this rapid progress of and the single comparator-based clock gating (SCCG) [6]
technology, there will be larger chips, more complex design, techniques. Not only all bits of data values at two consecutive
faster operation time and then result in tremendously increased clk periods, but also the enable signal should be considered for
power consuming. Under this circumstance, power reduction wasted clk toggles in this kind of circuit. Operand isolation
has become the first consideration in VLSI design. According technique reduces power by blocking the propagation of
to infinite electricity resource, reducing power is highly switching activity through the circuit [7]. Because of the much
appreciated for tradeoff between more complex designs and less power consumption of operation to the half-size memory
less power consumption in a proper operation time. There are than the full size memories, memory splitting technique can
many techniques, such as multiple voltages design [9], pre- save the power.
computation, clock gating, operand isolation [7] and memory The low power techniques aforementioned are explained in
splitting [8], have been come up with in a wide range from Section II. Section III shows experimental results followed by
system level to layout level to achieve power reduction. conclusion in Section IV.

978-1-4673-2990-3/12/$31.00 ⓒ2012 IEEE - 21 - ISOCC 2012


Fig. 3. Single comparator-based clock gating [6]
the low en signal in a significant percentage of the clock
periods or the low signal switching activity, this gating circuit
which is more complex than LECG incurs more power
consumption when the situations aforementioned are not
emerged.
D. Threshold-based clock gating (TCG)
Fig. 2. (a) Bus-specific clock gating (b) Local-explicit clock gating (c) Enhan- Comparing with BSC technique which gating all the FFs
ced clock gating [4] without considering signal activities, in [3], Alberrto et al. has
proposed this data-driven clock gating technique to improve
II. RTL LOW POWER TECHNIQUES
BSC. In this technique, they claim a threshold which is 5% of
output switching activity, which means any FF whose output
A. Bus-specific clock gating (BSC)
toggle rate is less than 5% need to be clustered and gated using
This CG style as shown in Fig. 2(a) is proposed in [2] for
one clock gating cell. The toggle rates of FFs of non-clock
individual flip-flop (FF). The input and output bits are
gating circuit need to be tested at first time, and then according
compared by XOR gates, when the input data are equal to the
to the list of toggle rate, those FFs are divided into two parts.
current output data, the XOR output is 0 and so does the
output of N-bit OR gate which is used for determine if any bit E. Optimized bus-specific clock gating (OBSC)
changes. So the FF is gated, the power has been saved. On the In [1], Li et al. proposes a fine-grained activity-driven CG
other hand, when the input data are different from current methodology which is to improve BSC as well. Unlike TCG,
output data, this circuit performs as normal. The AND gate this technique determines which FFs should be clustered and
and the latch safely disable the clock without allowing any gated by taking the relationship of each FF into account [1]. At
glitch to reach the register clock [16]. However, power first all FFs are clustered and gated, and then the main point in
consumption will increase if the output toggle rate of FF is this fine-grained activity-driven OBSC scheme is as follows:
high which indicates the high switching activity of signal the larger the FF output toggle is, the less the clock cycles are
because of the extra logics. wasted, so the earlier that the FF should be excluded from
gating [1]. Apparently, this algorithm results in just N+1
B. Local-explicit clock gating (LECG)
possible CG solutions comparing with the 2N power estimation
As shown in Fig. 2(b), this circuit constructed as follows: the
results of N FFs. Meanwhile, the minimum power result is the
clock of FF is gated by using an AND gate and a latch.
final OBSC circuit. However, according to the much complex
Obviously, when en is low for a significant percentage of the
and densities of the VLSI circuit, its drawback is obvious, this
circuit operation, this technique saves a substantial amount of
process takes too much time (e.g. for s15850, one sequential
power, however, if en is high for a significant percentage of
circuit of ISCAS’89 benchmark, this time will be 28 hours).
the circuit operation, the power consumption is increased
because the extra circuit power consumption is outweighs the F. Single comparator-based clock gating (SCCG)
savings [4]. In [6], Wei et al. proposes a novel single comparator-based
clock gating (SCCG) scheme to enhance the ECG for pipeline.
C. Enhanced clock gating (ECG)
In the proposed SCCG, enable signal is moved from data path
In Fig. 2(c), ECG combines the gating circuits in BSC and
to control logic, the data signal is analyzed, and only single
LECG so that the clock signal of the register is gated as long
comparator is used to implement the clock-gating for all the
as its toggle is wasted [4]. Besides the advantages incurred by
pipeline stages. For the proposed SCCG technique, there is an

978-1-4673-2990-3/12/$31.00 ⓒ2012 IEEE - 22 - ISOCC 2012


For most memory architectures, the power of a read or write
to the half-size memory is much less than the power of a read

Fig. 4. Design with operand isolation Fig. 5. A memory with a large number of words split into two smaller
important constraint. If there is a constant input from outside memories
of the combination logic circuit, the pipeline can work or write to the full size memories. Even though the same
properly in this situation; however, if there is a variable input number of reads and writes are occurring in the new
from outside during an intact data transmitting procedure, then architecture, the power is reduced because each read or write
the pipeline could not work properly, because such an input is on only one of the smaller memories. To implement this
may change the data consistency in the data path [6]. change, replace the single memory with two memories, each
having half the number of words. Select a bit of the address,
G. Waste-toggle-rate-based (WTR) clock gating usually the MSB or LSB, to operate as a .bank select.. This bit
In [4], a RT level power reduction scheme is proposed. This should be used as the chip enable to one memory, and inverted
technique can be used for any applications that have power to be used as the chip enable to the other memory. This bit
problem when designers use traditional design flow. A novel should also be used as the select input to a new multiplexor on
wasting-toggle-rate based clock power reduction technique is the memory outputs. This mux selects the appropriate memory
introduced and verified along with traditional design flow. The bank.
proposed technique can choose optimal clock-gating style
selectively to minimize the power based on proposed wasting III. EXPERIMENTAL RESULTS
toggle-rate analysis at RT level, and the optimization is based
The BSC, TCG and OBSC techniques have been
on proposed power equations without simulating the design at
experimented on all the ISCAS’89 benchcircuits by using tsmc
gate level [4].
45nm technology library and Synopsys power complier. Also
H. Operand isolation the circuits are simulated for 10000 clock cycles (clock
Designs which do not fully utilize their arithmetic datapath frequency is 250 MHz) with random inputs. Table I reports the
components typically exhibit a significant overhead in power area, delay and power of the non-CG circuit, the BSC circuit,
consumption. Whenever a module performs an operation the OBSC circuit and the TCG circuit of each ISCAS’89
whose result is not used in the downstream circuit, power is circuit. Table II gives the comparative results of the three CG
being consumed for an otherwise redundant computation. circuits versus the non-CG circuit.
Operand isolation is a technique to minimize the power As shown in Table II, the traditional BSC circuit will
overhead incurred by redundant operations by selectively increase power for many circuits, and the average power is
blocking the propagation of switching activity through the increased by 208.96%. On the other hand, compared to the
circuit [7]. non-CG circuit, the OBSC circuit reduces 26.95% power on
In [7], it discusses how redundant operations can be average for all circuits, which is 16% more than the TCG
identified concurrently to normal circuit operation, and circuit. The area and delay of the OBSC circuit are increased
presents a model to estimate the power savings that can be by 14.44% and 5.77% separately. Meanwhile, we should know
obtained by isolation of selected modules at the register the impact of improvement of synthesis process.
transfer (RT) level as shown in Fig. 4. Based on this model, an For operand isolation, a simple benchmark is shown in Fig.
algorithm is presented to iteratively isolate modules while 4, and we use tsmc 45nm technology library and Synopsys
minimizing the cost incurred by RTL operand isolation. power complier. Also the circuits are simulated for 10000
clock cycles (clock frequency is 250 MHz). As shown in Table
I. Memory splitting II, AND isolation style gives the maximum power reduction
Fig. 5 shows a memory with a large number of words split which is 17.67% with only 3.17% delay increase, and isolated
into two smaller memories, each with half the number of candidate is a1.
words. For memory splitting, a simple benchmark is shown in Fig. 5
and we use tsmc 45nm technology library and Synopsys power

978-1-4673-2990-3/12/$31.00 ⓒ2012 IEEE - 23 - ISOCC 2012


TABLE I
EXPERIMENTAL RESULTS FOR ISACS’89 CIRCUITS

Circuit non-CG BSC OBSC TCG


Area Delay Power Area Delay Power Area Delay Power Area Delay Power
[μm2] [ns] [μW] [μm2] [ns] [μW] [μm2] [ns] [μW] [μm2] [ns] [μW]
s27 60.54 0.37 29.50 87.29 0.43 33.43 73.21 0.41 28.09 60.54 0.37 29.50
s298 407.82 0.55 128.01 435.04 0.50 118.09 464.61 0.81 81.57 455.22 0.76 83.29
s344 448.65 0.68 174.49 538.76 0.78 204.30 491.83 0.78 157.45 464.61 0.65 172.62
s349 443.49 0.68 174.74 547.67 0.86 204.97 495.11 0.77 158.48 459.44 0.65 172.95
s382 518.58 0.58 163.29 581.46 0.46 154.78 637.78 0.88 85.46 618.07 0.69 86.00
s386 325.69 0.57 78.96 371.22 0.74 87.96 367.93 0.60 67.34 361.83 0.62 72.10
s400 518.58 0.57 161.94 590.85 0.46 155.49 626.05 0.75 84.20 618.54 0.81 84.74
s420 482.44 1.05 122.58 488.07 0.13 84.05 488.54 0.31 57.36 485.73 0.38 57.36
s444 515.76 0.59 163.49 600.70 0.46 156.19 630.27 0.87 87.20 635.43 0.78 89.40
s510 608.21 0.84 139.31 649.51 0.98 141.66 636.37 0.73 133.25 608.21 0.84 139.31
s526 618.07 0.67 167.10 657.49 0.58 154.64 710.05 0.87 87.43 718.97 0.79 88.41
s526n 620.41 0.67 165.83 654.67 0.58 153.72 711.93 0.87 87.69 719.44 0.82 87.76
s641 431.29 0.94 140.25 538.29 1.16 155.11 498.40 1.10 103.10 498.40 1.10 103.10
s713 431.29 0.94 140.68 535.94 1.16 156.42 498.40 1.10 103.59 498.40 1.10 103.59
s820 715.21 0.83 115.79 741.02 0.92 111.45 720.84 0.83 97.36 747.59 0.87 103.42
s832 703.95 0.83 112.63 744.31 1.01 106.35 726.48 0.90 94.56 726.48 0.90 94.56
s838 1002.42 1.97 223.21 1001.96 2.07 142.53 1005.24 0.38 67.35 1005.24 0.38 67.35
s953 1161.99 0.84 282.19 1467.97 1.10 296.56 1388.66 1.10 219.47 1256.79 1.02 238.60
s1196 1371.29 1.13 362.84 1512.08 1.14 410.50 1406.96 1.14 347.16 1387.72 1.05 352.35
s1238 1348.30 1.11 364.21 1502.23 1.20 408.73 1403.68 1.20 348.34 1386.78 1.01 358.00
s1243 2060.23 2.79 677.49 2277.51 2.87 751.12 2331.48 2.45 513.93 2321.63 2.51 514.88
s1488 1426.67 0.99 214.03 1429.49 1.04 194.72 1429.49 1.04 194.72 1445.91 1.04 206.45
s1494 1440.28 1.07 219.48 1451.54 1.08 199.61 1451.54 1.08 199.61 1462.81 1.09 212.54
s5378 4595.39 0.92 1770.87 5468.28 1.14 2080.89 4921.55 1.17 1257.17 4915.45 0.99 1260.99
s9234 3592.49 1.08 1069.87 6993.04 1.36 1916.52 6409.70 1.42 751.81 6456.63 1.48 820.70
s13207 11455.61 1.67 4468.86 11420.88 1.62 4467.97 14186.94 1.85 2641.32 14272.82 1.57 2978.21
s15850 10934.69 2.86 3482.67 10972.70 2.86 3523.70 12628.86 2.17 2244.16 13258.66 2.04 2880.70
s35932 31954.64 3.82 15429.47 45263.98 2.96 242558.76 37378.34 2.06 11823.30 34237.31 2.22 13844.43
s38417 31616.74 1.89 12763.25 44359.17 2.43 316289.90 44718.19 2.41 6680.10 41069.38 2.39 56389.14
s38584 29443.41 3.25 13201.90 46890.58 2.91 320556.78 30544.39 1.71 12540.20 31825.11 2.78 12911.38

TABLE II splitting are the applicable techniques for low power VLSI
POWER CONSUMPTION AND REDUCTION VS. AREA AND DELAY
design.
circuit Power Delay Area
REFERENCES
[μW] % [ns] % [μm2] % [1] L. Li and K. Choi, Activity-driven optimized bus-specific-clock-gating
non_iso 63.54 n/a 1.26 n/a 596.95 n/a for ultra-low-power smart space applications, IET Commun., 2011, Vol.
5, Iss. 17, pp. 2501–2508.
AND_a0 58.76 -7.52 1.51 19.84 641.53 7.47
[2] T. Lang, E. Musoll, and J. Cortadella. Individual flip-flops with gated
AND_a1 52.31 -17.67 1.30 3.17 573.02 -4.01 clocks for low power datapaths. IEEE TCAS-II: Analog and Digital
OR_a0 59.86 -5.79 1.45 15.08 640.13 7.23 Signal Processing, 44(6), 1997.
[3] A. Bonanno, A. Bocca, A. Macii, E. Macii., and M. Poncino. Datadriven
OR_a1 56.22 -11.52 1.23 -2.38 621.35 4.09 clock gating for digital filters. Integrated Circuit and System Design.
LATCH_a0 60.87 -4.20 1.27 0.79 686.59 15.02 Power and Timing Modeling, Optimization and Simualtion.
LATCH_a1 57.92 -8.84 0.68 -46.03 642.47 7.63 [4] L. Li and K. Choi, Selective clock gating by using wasting toggle rate,
Electro/Information Technology, 2009. eit '09. IEEE International
complier. Splitting into 2 half-memory will gives 12.72% Conference, page(s): 399-404.
power reduction, however, the delay increases 18.18% (the [5] M. Pedram and A. Abdollahi, Low-power RT-level synthesis techniques:
a tutorial, Computers and Digital Techniques, IEE Proceedings, Volume
results table is not shown in the paper).
152, Issue 3, Page(s):333 ̢ 343, 6 May 2005.
IV. CONCLUSION [6] W. Wang, Y. C. Tsao, K. Choi, S. Park, M. K. Chung, Pipeline power
reduction through single comparator-based clock gating, SoC Design
Because of much effectiveness of power optimization and Conference (ISOCC), 2009 International , page(s): 480-483.
the accuracy of power estimation, RTL is proposed stage in [7] Munch, M., Wurth, B., Mehra, R., Sproch, J. and Wehn, N., Automating
RT-Level Operand Isolation to Minimize Power Consumption in
low power VLSI design. This paper re-investigated some Datapaths, Design Automation and Test in Europe Conference and
representative techniques at RTL for low power VLSI design Exhibition 2000 Proceedings. Page(s):624 ̢ 631, March 2000.
including BSC, TCG, OBSC, LECG, SCCG, WTR based CG, [8] Sequence Design, Inc, ĀPowerTheater User Guideā, 2007.
operand isolation and memory splitting with tsmc 45 nm [9] S . Raje and M. Sarrafzadeh, “Variable voltage scheduling,” in Proc.
technology. We found that if the modules or data paths are not Int’l. Workshop Low Power Design, Aug. 1995, pp. 9–14.
the critical path, OBSC, operand isolation and memory

978-1-4673-2990-3/12/$31.00 ⓒ2012 IEEE - 24 - ISOCC 2012