You are on page 1of 8

Power-Aware Clock Tree Planning:

Validation in an Industrial Setting

Monica Donno Enrico Macii Roberto Zafalon


BullDAST s.r.l. Politecnico di Torino STMicroelectronics
10131 Torino, Italy 10129 Torino, Italy 20041 Agrate Brianza, Italy

Abstract
Modern SoCs require the adoption of power-oriented design styles, due to the implications in terms of
reliability, costs and manufacturability of the power consumption on circuits and systems featuring
nanometric technologies. A significant fraction of the power consumed by a synchronous circuit is due to
the clock distribution network. This is for two reasons: First, the clock nets are very heavily loaded.
Second, they are subject to a very high switching activity.

The problem of automatically synthesizing a power efficient clock tree has been addressed in a few
research works. We take, as reference, the methodology introduced in [1], in which low-power clock trees
are obtained through aggressive exploitation of the clock gating technology. Distinguishing features of that
methodology are: (i) The capability of calculating powerful clock gating conditions that go beyond the
simple topological search of the HDL code [2]. (ii) The capability of determining the clock tree logical
structure starting from an RTL description. (iii) The capability of including in the cost function that drives
clock tree structure generation both functional (i.e., clock activation conditions) and physical (i.e.,
floorplaning) information. (iv) The capability of generating a clock tree structure that can be synthesized
and routed using standard back-end tools.

The design technology introduced in [1] and [2] has been embraced and turned into industry-strength
optimization engines by BullDAST s.r.l., originating the CGCap and LPClock tools. Thanks their
architectures, CGCap and LPClock can be easily integrated to any synthesis and simulation flow.

Objective of this paper is to report on the use of the CGCap+LPClock design technology in the context of
the Synopsys-based design flow of STMicroelectronics. In particular, the paper will show, in the case of a
few real-life designs, the benefits that the superposition of the CGCap+LPClock clock tree planning tools
to Synopsys DesignCompiler/PhysicalCompiler lead to significant reductions in power consumption of the
clock distribution network with no increase in clock skew and negligible cost in insertion delay.

The paper provides a summary of the design methods that are at the basis of the CGCap and LPClock tools,
as well as detailed information on how such tools have been interfaced to the Synopsys design flow
currently in use at STMicroelectronics. The results achieved on a few design cases are illustrated in order to
show the feasibility of the experimented solution.
1 Introduction
Complex digital circuits required by modern electronics applications demand considerable power
dissipation. Plus, the portability requirements of hand-held computers and other mobile devices impose
severe restrictions on power consumption: The dissipated power is becoming a limiting factor for the
amount of functionality that can be placed in these devices. The extensive and continuous use of network
wireless communication services will only worsen this problem that impacts also the overall device
reliability. Therefore, techniques and methodologies that reduce power consumption are becoming of great
interest for the designer community due also to the lack of tools that support low-power-oriented design
decisions. The appropriate estimation and optimization of the power consumption is an enabling
competitive factor to deliver high quality products at a reasonable cost.

In this paper we focus on clock net power optimization since in most high-speed digital systems, the clock
net contributes for a significant fraction of the overall power budget. Hence, clock power reduction is a
critical and pressing issue that requires significant design effort and tool support. More in detail, we
describe how two optimization engines, CGCap and LPClock, that produce a power-gated clock tree have
been interfaced to interact with the Synopsys design flow currently in use at STMicroelctronics.

The paper is organized as follows. Section 2 reviews the methodology adopted by the CGCap and LPClock
tools. In Section 3, we discuss how we have realized the interaction between these optimization tools and
the Synopsys-based design flow of STMicroelectronics. We present experimental results in Section 4 and
Section 5 closes the paper.

2 Design Methods
In this section we illustrate the methodologies that are the basis of the CGCap and LPClock tools remarking
the distinguishing elements of these design methods when compared to existing methodologies for a low
power distribution of the clock signal.

Clock Gating Methodology


It is well known that in complex digital circuits there are some units that do not perform useful computation
for some clock cycles. Thus, the clock gating provides a way to disable the corresponding logic when not in
use in order to reduce the switching activity both in the logic and in the clock net.

Even though the clock gating insertion can have a significant impact on the overall design power
consumption, computer-aided clock gating tools assume a very conservative standpoint by providing a
good support for specifying gated clocks at the RTL and for controlling clock gating instantiation and
synthesis. On the contrary, only limited support is available for detecting idle conditions when a designer
does not specify them trhough a suitable HDL construct at the RTL.

The CGCap clock gating methodology improves the quality of results of the currently available tools
through a scalable algorithm for detecting non-trivial redundant clocking in large RTL designs using
approximate observability don’t cares (ODCs) computation. Furthermore, the idleness detection logic that
controls the gated clock can be synthesized with minimal speed and area overhead.

The key idea in the CGCap clock gating is that idleness conditions can be extracted from an RTL netlist by
focusing on control signals that drive steering modules (e.g. multiplexers, three-state-drivers, enabled
registers), while computational units are considered fully observable. Clearly the latter assumption is a
conservative approximation but it prevents computationally intensive calculations. ODCs for RTL steering
modules are shown in Fig 1.
Data 0

Mux Out Data Out Data Out


Tri Reg
Data 1
En

S0 S1 ODCData0 = S 0' En
ODCData1 = S1' ODC Data = En '

Fig. 1: Steering Modules in a RTL Description.

It is interesting to notice that the clock gating capabilities of current commercial tools can be viewed as a
particular case of ODCs. In fact, by detecting clock gating conditions only when enable signals are present,
they just find ODCs generated locally to the registers.

The ODCs computation is performed as outlined in the following. Given a steering module the ODC of its
output is the intersection of the ODCs of all its fanouts, while the ODC of its inputs is the logic sum of the
ODC of its output and the additional condition produced by the steering module itself. The presence of a
computational unit is completely transparent, hence the ODCs of all its inputs are assumed to be the same
as the ODC of its output. The ODCs of the output are assumed to be empty if no information on the
external don’t cares is available. These rules are applied for each steering level moving backward in the
network; this backward traversal can also be started from a specified level away from the flip-flops to keep
under control the ODC expression complexity. An example of the ODC computation is given in Fig 2.

Observability don’t care


condition

ODCREG1 = sum _ en '+ mux _ sel '+ ireg _ en '


Steering modules ODCREG 2 = sum _ en '+ mux _ sel + treg _ en '

Fig. 2: Example of ODC computation.

The gating logic instatiation for the CGCap clock gating contains an intrinsic difficulty that needs to be
carefully addressed: ODC conditions masking off the flip-flops in the clock cycle k should be used to gate
their clock in cycle k-1. A brute force method to deal with this problem consists of the duplication of the
entire cone of logic between the flip-flops and the control inputs of the steering modules. The logic
duplication overhead can be lowered through the application of a retiming transofrmation. In particular, if
the flip-flops at the inputs of the cones of logic are moved to the cone’s output, the circuit remains
functionally equivalent to the original one but control signals are now available at the flip-flops inputs and
outpus. This retiming policy is subject to user-specified constraints to avoid unacceptable modifications in
the design delay.
Clock Tree Planning Methodology
The clock gating techinique can significantly reduce the energy consumed by the clock net, hence it has
been viewed as one of the most effective approaches to power minimization. Unfortunately implementing a
successful clock gating strategy is not a simple task, because power and area overhead of clock gating logic
can adversely impact the total consumption. If flip-flops are spread across the chip and/or the gating logic
is shared among a small number of sequential elements, even if the gated clock switching activity is
reduced, power dissipation may increase, due to the significant extra-wiring introduced in the design.
Therefore, clock gating and clock tree construction should not be seen, as done in the past, as two
independent steps and a synergistic strategy is needed.

The LPClock methodology aims at building a power optimal gated clock tree topology and use a state-of-
the-art clock tree router to perform detailed clock routing and buffering. As a consequence the output of
LPClock is a clock netlist that, taken as input of a clock synthesis tool, leads to a low-power gated clock
tree, while still accounting for all non-power-related requirements (e.g., controlled skew, low crosstalk-
induced noise, etc.).

The methodology requires two inputs: (i) A RTL structural description of a synchronous circuit in which a
set of clock gating cells with their respective control signals have been already inserted; (ii) A placement.
The placement and the control signal function waveforms, extracted through a simulation of the RTL
design, are elaborated by the LPClock algorithms.

LPClock builds the gated tree in a bottom-up fashion, by a two step procedure. The first step builds the
clock tree structure. Given the set of clock sinks for the design, where each sink corresponds to a clock
gating cell, a clock net structure based on a fully binary tree representation is built subject to a cost
function. The cost function considers two parameters: The physical distance between each possible pair of
sinks and their logical distance. The latter is related to the behavior similarity of the considered sinks. The
logical distance depends on the control signals and on the the number of flip-flops driven by the gating cell.
Once the clock network is built we have a complete binary tree in which leaves and internal nodes are
charchterized by an activation function (e.g., the control signal that drives the sub-tree rooted at the
considered node) but no gating elements are yet inserted.

The second procedure explores, through a heuristic, the opportunities to move the gating logic from the
leaves toward the upper levels inside the clock tree; in this way, the dissipation is reduced also in the clock
network. An example of the gated tree obtained through a traditional gating scheme versus the application
of the LPClock methodology is given in Fig. 3.

clk clk

Fig. 3: Traditional Gated Clock Tree (left) and LPClock based Gated Clock Tree (right).
3 Tool Flow
In this section, we describe how CGCap and LPClock have been interfaced to the design flow of
STMicroelectronics, which is based on the Synopsys tools. We concentrate on the interaction between the
CGCap+LPClock tools and the Synopsys environment, while the application of the described flow to real
test cases will be discussed in the next section.

Since the clock tree planning requires different kind of information that belongs to different levels of
abstraction, we needed to identify various possible contact points to exchange information between the
Synopsys environment and CGCap+LPClock. When possible, the interaction has been performed through
standard formats. The tool flow is presented in Fig. 4.

VHDL/Verilog

Synopsys
Design Compiler

GTECH
design Reports Simulatable
RTL

Synopsys
dc-tcl CGCap
Reports
VCS

VCD
LEF

Synopsys
Physical Compiler DEF

Standard Updated
design flow DEF

Fig. 4: Integrated Tool Flow.

A VHDL or Verilog description of the design is first read by DesignCompiler in order to be analyzed and
elaborated to obtain an abstract representation of the design in the Synopsys GTECH format. The GTECH
description is given as input to the CGCap+LPClock tools through a filter that implements a recursive
parsing of the wide range of output reports available from the dc_shell (i.e., report_cell, report_net,
report_port, etc.).

A database containing a structural RTL GTECH description of the design is thus available for simulation.
The simulation is performed through the Synopsys VCS simulator and the resulting VCD file is parsed to
annotate the database with the proper switching activity.

The CGCap optmization engine is then run to introduce the ODC based clock gating logic. All the
modifications performed onto the database (e.g., added cells, deleted nets, etc.) are annotated in the cgcap
report whose role is to act as a bridge between the database and the Synopsys environment. In fact, the
cgcap report is parsed and translated into a dc-tcl script that modifies the Synopsys database description of
the design in order to take into account the clock gating insertion. A small picture of the cgcap report,
along with the dc-tcl script used to perform the update of the Synopsys database, is given in Fig. 5.
# CGCap - Date : Tue Jan 22 10:57:04 # Insert ODC clock gating logic on register bank num. 34
2004 disconnect_net global_enable [get_pins { a0_reg/synch_enable
# Current design : aless a1_reg/synch_enable a2_reg/synch_enable a3_reg/synch_enable
# Register banks to be clock gated : a4_reg/synch_enable a5_reg/synch_enable a6_reg/synch_enable
a7_reg/synch_enable }]
all connect_net CGCap_logic_one [get_pins { a0_reg/synch_enable
a1_reg/synch_enable a2_reg/synch_enable a3_reg/synch_enable
... a4_reg/synch_enable a5_reg/synch_enable a6_reg/synch_enable
a7_reg/synch_enable }]
... copy_design ODC_CLOCK_GATING_HIGH ODC_CLOCK_GATING_HIGH_34
create_cell a0_reg_ODC_cell ODC_CLOCK_GATING_HIGH_34
Register bank num. 34 create_net n_123
disconnect_net ext_clk [get_pins { a0_reg/clocked_on
Cells : a0_reg , a1_reg , a2_reg , a1_reg/clocked_on a2_reg/clocked_on a3_reg/clocked_on
a3_reg , a4_reg , a5_reg , a6_reg , a4_reg/clocked_on a5_reg/clocked_on a6_reg/clocked_on
a7_reg a7_reg/clocked_on }]
……
Clock net : ext_clk connect_net Capnet_58 [get_pins Capcell_34/B]
Synch Enable net : global_enable create_cell Capcell_36 GTECH_OR2
connect_net Capnet_58 [get_pins Capcell_36/Z]
create_net Capnet_59
ODC clock gating cell type : connect_net Capnet_59 [get_pins Capcell_36/A]
ODC_CLOCK_GATING_HIGH create_cell Capcell_37 GTECH_AND2
connect_net Capnet_59 [get_pins Capcell_37/Z]
create_net Capnet_60
ODC enable cells (total 5 cells) : connect_net Capnet_60 [get_pins Capcell_36/B]
Capcell_34 GTECH_AND2 create_cell Capcell_38 GTECH_NOT
Capcell_35 GTECH_NOT connect_net Capnet_60 [get_pins Capcell_38/Z]
create_cell C43@2 GTECH_OR2
Capcell_36 GTECH_OR2 create_net Capnet_61
Capcell_37 GTECH_AND2 connect_net Capnet_61 [get_pins C43@2/Z]
Capcell_38 GTECH_NOT connect_net Capnet_61 [get_pins Capcell_35/A]
create_cell C47@2 GTECH_AND2
create_net Capnet_62
connect_net Capnet_62 [get_pins C47@2/Z]
Duplicated cells for prediction connect_net Capnet_62 [get_pins Capcell_38/A]
create_cell I_0@2 GTECH_NOT
(total 4 cells) : create_net Capnet_63
C43@2 GTECH_OR2 connect_net Capnet_63 [get_pins I_0@2/Z]
C47@2 GTECH_AND2 connect_net Capnet_63 [get_pins Capcell_37/A]
create_cell s1_reg@2 GTECH_BUF
I_0@2 GTECH_NOT create_net Capnet_64
s1_reg@2 GTECH_BUF connect_net Capnet_64 [get_pins s1_reg@2/Z]
….
connect_net n_97 [get_pins C47@2/B]
Total added cells to design 'aless' connect_net n_103 [get_pins I_0@2/A]
: 9 cells + 1 ODC clock gating cell connect_net n_10 [get_pins s1_reg@2/A]

Fig. 5: Example of CGCap Report and dc-tcl Corresponding Script.

After the completion of the CGCap tool run, the GTECH RTL description of the design is again extracted
from the PowerChecker for a further simulation at the RTL. The obtained VCD file is used later to plan the
clock net through LPClock.

Once the DesignCompiler view of the design has been updated, the designer proceeds with the standard
flow to obatain a valid placement. A description of the placed design in DEF format, together with the LEF
technology information of the standard cell library on which the design is mapped, completes the set of
inputs needed by the LPClock engine to plan the clock net.

All the previous information, processed by LPClock, lead to a clock net with the gating logic positioned
into the clock tree. The LPClock tool outputs an updated placement for the design that can be introduced
into the Synopsys PhysicalCompiler database through an incremental ECO process.

In this way the Synopsys database contains a design whose clock network has been planned by the
CGCap+LPClock methodology and the design flow can proceed normally. The interaction between
LPClock and Synopsys PhysicalCompiler is further detailed in Fig. 6.
Design Compiler

Floorplan
(thirdy part)

physopt physopt CTS Router


-incremental -eco (thirdy part)
Physical
Compiler

Gate level Netlist and DEF


DEF
netlist
update

Fig. 6: Interaction of the LPClock Engine with Synopsys PhysicalCompiler.

4 Experimental Results
The tool flow described in Section 3 has been applied to two bechmarks provided by STMicroelectronics,
whose details are summarized in Tab. 1.

Number of Number of Number of


Benchmark
Modules Gates Clock Sinks
Industrial1 6 18176 2132
Industrial2 9 27776 3016

Tab 1: Benchmark Information.

Each design was first synthesized using the standard Synopsys flow (DesignCompiler+PhysicalCompiler)
with no clock-gating. The whole synthesis process was timing driven, and mapping was done onto the
0.13um HCMOS9 technology library by STMicroelectronics. Layout extraction was performed next, and
the gate-level netlists back-annotated using the extracted parameters. Finally, gate-level power estimation
was performed using PowerCompiler. The same process was repeated for the two benchmarks using
traditional clock gating and using the flow of Fig. 4.

The obtained clock tree power consumption results are reported in Tab. 2. In particular, column Traditional
Clock-Gating shows the savings in the power consumed by the clock tree w.r.t. the original circuit
implementation (i.e., no clock-gating) achieved by inserting the clock-gating logic only at the inputs of the
RTL modules. On the other hand, column CGCap+LPClock shows the clock tree power savings against
the original circuits obtained by inserting the clock-gating logic as suggested by CGCap+LPClock. A
comparison of the clock power data for the two optimized circuits shows that CGCap+LPClock offers a
significant additional savings (around 22%) over traditional clock-gating (column ).

Traditional
Benchmark CGCap+LPClock
Clock-Gating
Industrial 1 18,46% 37,05% 22,79%
Industrial 2 16,18% 34,55% 21,91%

Tab 2: Results on Benchmark Circuits.


Table 3 reports the results of the timing analysis performed on the synthesized netlists containing the
capacitance information back-annotated after extraction using Synopsys PrimeTime. The data show that the
clock tree structure generated through CGCap+LPClock has the same skew as the clock tree generated
using traditional clock-gating. On the other hand, the tree built using CGCap+LPClock has a longer delay
from the root of the tree to the clock leaves (insertion delay). This larger value of the insertion delay is due
to the more complex structure that is usually associated to a tree planned with CGCap+LPClock in order to
position the gating logic as close as possible to the clock root.

Insertion
Benchmark Optimization Skew(ns)
Delay(ns)
Traditional
0.1 2.1
Industrial 1 Clock-Gating
CGCap+LPClock 0.1 3.2
Traditional
0.2 1.8
Industrial 2 Clock-Gating
CGCap+LPClock 0.2 2.4

Tab 3: Timing Analysis on Benchmark Circuits.

5 Conclusions
Interconnection capacitance is becoming more and more dominant in very deep-submicron technologies; as
a consequence the clock distribution network currently represents the major performance and power
consumption bottleneck.

In this paper we have presented the results of the application of new methodologies that target a low power
clock tree in an industrial environment. The CGCap methodology allows to determine more powerful
conditions for clock gating since the calculation of the observability don’t care functions considers both
topological and functional information about the design. Distinguishing feature of the CGCap+LPClock
methodology is its capability of exploiting both physical and logical information of the given design to
optimize the clock structure. Running the described experiments has required a significant effort in order to
make the CGCap and LPclock engines able to interact with the Synopsys-based design flow currently in
use at STMicrolectronics.

For the considered benchmarks experimental results showed clock power saving in the order of 22% over
traditional clock-gating, thus encouraging the superposition of clock optimization techniques in standard
design flow.

References
[1] L. Benini, A. Ivaldi, M. Donno, E. Macii,
“Clock Tree Power Optimization based on RTL Clock-Gating ,”
DAC-40: ACM/IEEE Design Automation Conference, pp. 622-627, Anaheim, CA, June 2003.

[2] L. Benini, E. Macii, P. Babighian,


“A Scalable ODC-Based Algorithm for RTL Insertion of Gated Clocks,”
DATE-04: IEEE Design Automation and Test in Europe, pp. 500-505, Paris, France, February 2004.

[3] J. Oh, M. Pedram,


“Gated Clock Routing for Low-Power Microprocessor Design,”
IEEE Transaction on CAD /ISCAS, Vol. 20, No. 6, pp. 715-722, June 2001.

[4] D. Garrett, M. Stan, A. Dean,


“A Challenges in Clock Gating for a Low Power ASIC Methodology,”
ISPLED-99: ACM/IEEE Intl. Symp. on Low-Power Electronics and Design, pp. 176-181, San Diego, CA, August 1999.

[5] A. Farrahi, C. Chen, A. Srivastava, G. Tellez, M. Sarrafszadeh,


“Activity Driven Clock Design,”
IEEE Transaction on CAD/ISCAS, Vol. 20, No. 6, pp.705-714, June 2001.

You might also like