Professional Documents
Culture Documents
RUMBLE: An Incremental Timing-Driven Physical-Synthesis Optimization Algorithm
RUMBLE: An Incremental Timing-Driven Physical-Synthesis Optimization Algorithm
AbstractPhysical-synthesis tools are responsible for achieving synthesis flow. To this end, we review the major phases of such
timing closure. Starting with 130-nm designs, multiple cycles are flows as follows [1], [4].
required to cross the chip, making latch placement critical to
success. We present a new physical-synthesis optimization for latch 1) Global placement computes nonoverlapping physical
placement called Rip Up and Move Boxes with Linear Evaluation locations for gates and typically optimizes half-perimeter
(RUMBLE) that uses a linear timing model to optimize timing wirelength (HPWL) or weighted HPWL. As part of this
by simultaneously replacing multiple gates. RUMBLE runs incre-
mentally and in conjunction with static timing analysis to improve phase, usually, some amount of detail placement is done,
the timing for critical paths that have already been optimized and legalization is called to ensure a legal optimization
by placement, gate sizing, and buffering. Experimental results result.
validate the effectiveness of the approach: Our techniques improve 2) Electrical correction fixes capacitance and slew viola-
slack by 41.3% of cycle time on average for a large commercial tions with gate sizing and buffering.
ASIC design.
3) Legalization is an incremental-placement capability that
Index TermsStatic timing analysis, timing-driven placement. removes overlaps caused by optimization with minimal
disturbance to placement and timing.
I. I NTRODUCTION 4) Timing analysis assesses the speed of the design and
determines if performance targets are met. Among other
P HYSICAL synthesis is a complex multiphase process
primarily designed to achieve timing closure, although
power, area, yield, and routability also need to be optimized.
metrics, this phase determines the slack of every path in
the circuitthe difference between the clock period and
Starting with 130-nm designs, signals can no longer cross how long it takes a signal to traverse the path.
the chip in a single cycle, which means that pipeline latches 5) Detail placement moves gates to further reduce wire-
need to be introduced to create multicycle paths. This problem length and improve timing. In this phase, it is possible
becomes more pronounced for 90-, 65-, and 45-nm nodes, to do timing-driven detail placement wherein timing in-
where interconnect delay increasingly dominates gate delay formation is explicitly considered when optimizing gate
[10]. Indeed, for high-performance ASIC scaling trends, the placements.
number of pipeline latches increases by 2.9 at each technol- 6) Critical-path optimization identifies most-critical paths
ogy generation, accounting for as much as 10% of the area of and focuses on techniques to improve the slack for the
90-nm design [8] and as many as 18% of the gates in 32-nm worst-timing violations. Relevant optimizations include
design [22]. Hence, the proper placement of pipeline latches is buffering, gate sizing, and incremental synthesis [23].
a growing problem for timing closure. 7) Compression optimizes the remaining paths that violate
The choice of computational techniques for latch placement timing constraints when improvements on most-critical
depends on where this optimization is invoked in a physical- paths are no longer possible. The goal is to compress the
timing histogram and reduce number of negative-slack
Manuscript received May 15, 2008; revised July 22, 2008. Current version
paths that require designer intervention.
published November 19, 2008. This paper was recommended by Associate
Editor D. Z. Pan.
The flow can be repeated with net weighting and timing-driven
D. A. Papa is with the Electrical Engineering and Computer Science De- placement to further improve results.
partment, University of Michigan, Ann Arbor, MI 48109 USA, and also with One can think of physical synthesis as progressing with
IBM Austin Research Laboratory, Austin, TX 78758 USA (e-mail: iamyou@
umich.edu).
variable detail and variable accuracy. For example, during
T. Luo was with the Department of Electrical and Computer Engineering, global placement, large changes are made to the design using
University of Texas at Austin, Austin, TX 78712 USA. He is now with Magma a coarse objective (such as wirelength) that is oblivious to
Design Automation, Austin, TX 78759 USA (e-mail: tluo@magma-da.com).
M. D. Moffitt, C. N. Sze, Z. Li, G.-J. Nam, and C. J. Alpert are with IBM timing considerations. Later, one may perform more accurate
Austin Research Laboratory, Austin, TX 78758 USA (e-mail: mdmoffitt@ optimization using an Elmore interconnect delay model with
us.ibm.com; csze@us.ibm.com; lizhuo@us.ibm.com; gnam@us.ibm.com; Steiner-tree estimates for net capacitance. As timing begins
alpert@us.ibm.com).
I. L. Markov is with the Electrical Engineering and Computer Science to converge, one can apply more costly fine-grained buffering
Department, University of Michigan, Ann Arbor, MI 48109 USA (e-mail: along actual detailed routes using a statistical timing model.
imarkov@umich.edu). Fig. 1(a)(d) shows the complications of using existing
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. global placement techniques to solve the latch-placement prob-
Digital Object Identifier 10.1109/TCAD.2008.2006155 lem for a single two-pin net. Assume that, for all four figures,
0278-0070/$25.00 2008 IEEE
PAPA et al.: RUMBLE: TIMING-DRIVEN PHYSICAL-SYNTHESIS OPTIMIZATION ALGORITHM 2157
must move the buffer in order to improve the slack of the latch,
and other complications are shown in Fig. 3. At the same time,
the optimal new location of the latch depends on how the input
Fig. 2. Poorly placed latch with buffered interconnect. In this case, the buffer and output nets are buffered. As a result, the optimal approach
must be moved or removed in order to have the freedom to move the latch far is to simultaneously move the latch and perform buffering, but
enough to fix the path. this is computationally prohibitive because a typical multiple-
objective buffering algorithm runs in exponential time. As
combinational stage. This fact offers a particularly significant
mentioned in Section I, we propose a sequential approach in
boost to our basic approach and is enhanced even further when
which we first compute the new locations for a selected set of
surrounding gates are also free to move.
movable gates based on timing estimation considering buffers.
A solution to this problem is called a transform using the
Then, buffering is applied to the input and output nets of the
terminology of [23]. A transform is an optimization designed
selected movable gates. Being practical, effective, and efficient,
to incrementally improve timing. Other examples of transforms
this approach can be integrated into a typical very large scale
include buffering a single net, resizing a gate, cloning a cell,
integration physical-synthesis flow. The calculation of opti-
swapping pins on a gate, etc. The way transforms are invoked
mal movement depends on a simple but effective buffered-
in a physical-synthesis flow is determined by the drivers. For
interconnect delay model, which is discussed in the following
example, a driver designed for critical-path optimization may
section.
attempt a transform on the 100 most-critical cells. A driver
designed for the compression stage (see Section I) may attempt
a transform on every cell that fails to meet its timing constraints. A. Linear Buffered-Path Delay Estimation
A driver has the option of avoiding transforms that may
harm the design (e.g., generating new buffering solutions in- Buffering has become indispensable in timing closure and
ferior to the original) and can then reject this solution. This cannot be ignored during interconnect delay estimation [3],
do no harm philosophy of optimization has received significant [7], [22]. Therefore, to calculate new locations of movable
recognition in recent work [5], [20]. The RUMBLE approach gates, one must adopt a buffering-aware interconnect delay
adopts this same convention which makes it more trustworthy model that accounts for future buffers. We found that the linear
in a physical-synthesis flow. delay model described in [3] and [17] is best suited for this
While no previous work has attempted to solve this par- application. In this model, the delay along an optimally buffered
ticular problem, other solutions do exist that may be able to interconnect is
help with the placement of poorly placed latches. The authors
delay(L) = L(Rb C + RCb + 2Rb Cb RC) (1)
of [24] propose a linear-programming formulation that mini-
mizes downstream delay to choose locations for gates in field-
where L is the length of a two-pin buffered net and Rb and Cb
programmable gate arrays. The authors of [6] model static
are the intrinsic resistance and input capacitance of buffers and
timing analysis (STA) in a linear-programming formulation by
gates while R and C are unit wire resistance and capacitance,
approximating the quadratic delay of nets with a piecewise-
respectively.
linear function. Their formulations objective is to maximize
Empirical results in [3] indicate that (1) is accurate up to
the improvement in total negative slack of timing endpoints.
0.5% when at least one buffer is inserted along the net. Fur-
The authors of both approaches conclude that the addition of
thermore, our own empirical results in Section VI-B suggest a
buffering would improve their techniques [6], [24]. When these
97% correlation between this linear delay model and the output
transformations are applied at the same point in a physical-
of an industry timing-analysis tool.
synthesis flow that we propose, they will be restricted by
previous optimizations. When applied somewhat earlier (e.g.,
following global placement), they are incapable of certain im- B. Timing Graph
provements. Namely, downstream optimizations, such as buffer
In RUMBLE, a set of movable gates is selected, which must
insertion, gate sizing, and detail placement, may invalidate the
include fixed gates or input/output ports to terminate every path.
optimality of latch placement. Therefore, our technique focuses
Fixed gates and I/Os help formulate timing constraints and limit
on the bad latch placements that we observed in large com-
the locations of movables. In Fig. 4(a), we assume that new
mercial ASIC designs after state-of-the-art physical-synthesis
locations have to be computed for the latch and the two OR
optimizations. However, we believe that RUMBLE may be too
gates, while all NAND gates are kept fixed.
disruptive to use after routing.
In the timing graph, each logic gate is represented by a node,
while a latch is represented by two nodes because the inputs
III. RUMBLE T IMING M ODEL
and outputs of a latch are in different clock cycles and can
We now introduce the timing model critical to RUMBLEs have different slack values. Each edge represents a driversink
success. path along a net and is associated with a delay value which is
Fig. 2 shows an intuitive example of the problem when we try linearly proportional to the distance between the driver and the
to find new locations for movable gates. Similar to Fig. 1, the sink gate. In other words, we decompose each multipin net into
latch has to be moved to the right to improve timing. However, a set of two-pin edges that connect the driver to each sink of
since the latch drives a buffer which is placed next to it, we the net.This simplification is crucial to our linear delay model
PAPA et al.: RUMBLE: TIMING-DRIVEN PHYSICAL-SYNTHESIS OPTIMIZATION ALGORITHM 2159
Fig. 3. Layout in (a) has a poorly placed latch, and existing critical path optimizations do not solve the problem. Repowering the gates may improve the timing
some in (b), but if it cannot fix the problem, the latch must be moved. Moving the latch up to the next buffer, shown in (c), does not give optimization enough
freedom. If we move the latch but do not rebuffer in (d), timing may degrade. Fig. 9(d) shows the ideal solution to this problem.
Fig. 5. In many subcircuits, there are multiple slack-optimal placements. In RUMBLE we add a secondary objective to minimize the displacement from the
original placement. This helps to maintain the timing assumptions made initially and reduces legalization issues. (a) Initial state of and example subcircuit.
(b) Slack-optimal solution commonly returned by LP solvers, all optimal solutions lie on the dotted line. (c) Solution given by RUMBLE that maximizes worst
slack then minimizes displacement.
For every movable gate mi and gate gj that drives one of its mM : xm mM : ym
inputs via net nk mM : mx mM : xm (13)
mM : my mM : ym .
Ami Agj + Uxnk Lnx k + Uynk Lny k + Dg . (10)
For every movable gate mi , xmi and ymi denote the original x
For every timing arc in the subcircuit np,q associated with and y coordinates. The upper and lower bounds of the new and
net ni original coordinates and in each dimension are as follows:
Sni Rq Ap Uxni Lnx i + Uyni Lny i . (11) DISPLACEMENT CONSTRAINTS:
Fig. 7. Modeling feedback paths within logic requires a new type of gate. Pseudomovable gates have timing values that depend on the timing values of
neighboring gates, but they cannot be moved. (a) Ignoring the presence of feedback paths is overly pessimistic, and it appears that the timing of the latch
cannot meet its constraints. (b) Making the fixed gates along a feedback path pseudomovable allows the latch to meet its timing constraints, but doing only this
can lead to the wrong placement. (c) Including all gates connected to pseudomovables as fixed timing points properly models the problem as a convex subcircuit.
Fig. 8. Subcircuit selection transparently skips buffers when building a neighborhood of movable gates and requires detection of pseudomovables.
Fig. 9. RUMBLE algorithm proceeds by (a) selecting a subcircuit to work on. An LP is formulated and solved, with movable gates being relocated as
shown in (b). Existing repeater trees are no longer appropriate and are subsequently removed in (c). Finally, the nets are rebuffered, forming the final subcircuit
shown in (d).
TABLE I TABLE II
KEEPING BUFFERS INSTEAD OF REMOVING AND REINSERTING THEM RUMBLE MODEL ACCURATELY PREDICTS THE SOLUTION QUALITY
DEGRADES RUMBLEs PERFORMANCE IMPROVEMENTS IN THE REFERENCE TIMING MODEL
TABLE III In fact, for six out of ten one-hop subcircuits and for seven
COMPARISON OF RUMBLEs LP TO A SLACK-WEIGHTED
COG TECHNIQUE out of ten two-hop circuits, multimove actually brought
the FOM down to zero, meaning it fixed all the timing
violations. Iterative single move was able to fix two and
four, respectively.
2) On average, the worst-slack improvements were 849 and
673 ps, respectively, for one- and two-hop subcircuits.
The diminished improvement for larger subcircuits is
likely because we are including more nets, some of which
cannot be improved as much as those connected to the
imbalanced latch (Fig. 6 shows an example).
3) Solving the LP takes 53 ms for one-hop subcircuits and
325 ms for two-hop subcircuits, on average.
TABLE IV
RUMBLE SIMULTANEOUSLY MOVING A one-hop NEIGHBORHOOD COMPARED TO ITERATIVELY MOVING THE SAME GATES INDIVIDUALLY
TABLE V
RUMBLE SIMULTANEOUSLY MOVING A two-hop NEIGHBORHOOD COMPARED TO ITERATIVELY MOVING THE SAME GATES INDIVIDUALLY
[5] K.-H. Chang, I. L. Markov, and V. Bertacco, Safe delay optimization for Michael D. Moffitt received the B.S. degree in com-
physical synthesis, in Proc. ASP-DAC, 2007, pp. 628633. puter science and engineering, the dual M.S. degrees
[6] A. Chowdhary et al. How accurately can we model timing in a placement in computer science and operations research, and the
engine? in Proc. DAC, 2005, pp. 801806. Ph.D. degree from the University of Michigan, Ann
[7] J. Cong, L. He, C.-K. Koh, and P. H. Madden, Performance optimization Arbor, in 2003, 2005, 2006, and 2007, respectively.
of VLSI interconnect layout, Integr. VLSI J., vol. 21, no. 1/2, pp. 194, He is currently with the Design Productivity
Nov. 1996. Group, IBM Austin Research Laboratory, Austin,
[8] P. Cocchini, Concurrent flip-flop and repeater insertion for high perfor- TX. His principal contributions include new algorith-
mance integrated circuits, in Proc. ICCAD, 2002, pp. 268273. mic techniques for temporal reasoning, infeasibility
[9] B. Halpin, C. Y. R. Chen, and N. Sehgal, Timing driven placement using analysis, planning and scheduling, block packing,
physical net constraints, in Proc. DAC, 2001, pp. 780783. floor planning, global routing, and timing optimiza-
[10] International Technology Roadmap for Semiconductors, 2001. [Online]. tion. His primary research interests include constraint satisfaction, combinator-
Availabe: http://public.itrs.net ial optimization, and their application to large-scale problems commonly faced
[11] M. A. B. Jackson and E. S. Kuh, Performance-driven placement of cell in artificial intelligence and computer-aided design.
based ICs, in Proc. DAC, 1989, pp. 370375. Dr. Moffitt developed the winning entry in the ISPD 2007 Global Routing
[12] A. B. Kahng and I. L. Markov, Minmax placement for large-scale Contest, and was invited to speak at ISPD 2008 on modern advances in
timing optimization, in Proc. ISPD, 2002, pp. 143148. academic global routing. He has served on the Technical Program Committee
[13] T. T. Kong, A novel net weighting algorithm for timing-driven place- for Design Automation and Test on Europe 2009. He was the recipient of the
ment, in Proc. ICCAD, 2002, pp. 172176. Josef Raviv Memorial Postdoctoral Fellowship.
[14] T. Luo, D. Newmark, and D. Z. Pan, A new LP based incremental timing
driven placement for high performance designs, in Proc. DAC, 2006,
pp. 11151120.
[15] M. Marek-Sadowska and S. P. Lin, Timing driven placement, in Proc.
ICCAD, 1989, pp. 9497.
[16] R. Nair, C. Berman, P. Hauge, and E. Yoffa, Generation of perfor-
mance constraints for layout, IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 8, no. 8, pp. 860874, Aug. 1989.
[17] R. Otten, Global wires harmful? in Proc. ISPD, 1998, pp. 104109.
[18] S. L. Ou and M. Pedram, Timing-driven placement based on partitioning
C. N. Sze (S03M05) received the B.Eng. and
with dynamic cut-net control, in Proc. DAC, 2000, pp. 472476.
M.Phil. degrees from the Department of Computer
[19] D. A. Papa et al., RUMBLE: An incremental, timing-driven, physical-
Science and Engineering, The Chinese University
synthesis optimization algorithm, in Proc. ISPD, 2008, pp. 29.
of Hong Kong, Sha Tin, Hong Kong, in 1999 and
[20] H. Ren et al., Hippocrates: First-do-no-harm detailed placement, in
2001, respectively, and the Ph.D. degree in computer
Proc. ASP-DAC, 2007, pp. 141146.
engineering from the Department of Electrical En-
[21] S. Sapatnekar, Timing. New York: Springer-Verlag, 2004.
gineering, Texas A&M University, College Station,
[22] P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick, Repeater
in 2005.
scaling and its impact on CAD, IEEE Trans. Comput.-Aided Design
He is currently a Research Staff Member with
Integr. Circuits Syst., vol. 23, no. 4, pp. 451463, Apr. 2004.
IBM Austin Research Laboratory, Austin, TX, where
[23] L. Trevillyan et al., An integrated environment for technology closure
he focuses on integrated placement and timing opti-
of deep-submicron IC designs, IEEE Des. Test Comput., vol. 21, no. 1,
mization for ASIC and microprocessor designs, as well as the tools and designs
pp. 1422, Jan./Feb. 2004.
for high-performance clock-distribution networks.
[24] Q. Wang, J. Lillis, and S. Sanyal, An LP-based methodology
Dr. C. N. Sze was the recipient of a DAC Graduate Scholarship. He
for improved timing-driven placement, in Proc. ASP-DAC, 2005,
has served on the technical program committees for several international
pp. 11391143.
conferences and workshops including International Symposium on Physical
Design, International Symposium on Circuits And Systems, System-Level
Interconnected Prediction (SLIP), and System-on-Chip Conference.
David A. Papa (S05) received the B.S. and M.S.
degrees in computer science and engineering from
the University of Michigan, Ann Arbor, in 2002 and
2004, respectively. He is currently working toward
the Ph.D. degree in the Department of Electrical
Engineering and Computer Science, University of
Michigan.
He is currently also with IBM Austin Research
Laboratory, Austin, TX, where he is contributing to
their physical-synthesis tool PDS. His main area of Zhuo Li (S01M05) received the B.S. and M.S.
interest is in physical design, particularly VLSICAD degrees in electrical engineering from Xian Jiaotong
placement. University, Xian, China, in 1998 and 2001, respec-
tively, and the Ph.D. degree in computer engineer-
ing from Texas A&M University, College Station,
Tao Luo received the B.S. degree in electronic me- in 2005.
chanics from the University of Electronic Science From 2005 to 2006, he was a Senior Technical
and Technology of China, Chengdu, China, the M.S. Staff with Pextra Corporation, where he worked on
degree in management information systems from very large scale integration extraction-tool develop-
Tongji University, Shanghai, China, and the M.S. ment. He is currently with IBM Austin Research
and Ph.D. degrees in computer engineering from the Laboratory, Austin, TX. His research interests in-
University of Texas at Austin, Austin. cluded fast physical synthesis, wire synthesis, optimization for routability, fast
He was with StarCore LLC, Analog Devices Inc., routing estimation, circuit modeling and simulation, mask optimization, and
Advanced Micro Devices Inc., and IBM Austin Re- design and analysis of algorithms.
search Laboratory, Austin, for part-time and intern- Dr. Li was the recipient of the Applied Materials Fellowship in 2002, the
ship positions. He is currently with Magma Design Best Paper Award at the Asia and South Pacific Design Automation Conference
Automation, Austin, where he is working on developing placement and layout in 2007, and the IEEE Circuits and System Society Outstanding Young Author
migration tools for custom IC designs. Award in 2007. He has served as technical program committee member for the
Dr. Luo was the recipient of the Engineering Doctoral Fellowship from the International Conference on Computer Design; Design, Automation and Test
University of Texas in 20062007 and was the recipient of the best paper in in Europe Conference and Exhibition; and Design Automation Conference
session award at SRC Techon 2007. Ph.D. Forum.
2168 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 27, NO. 12, DECEMBER 2008
Gi-Joon Nam received the B.S. degree in computer Igor L. Markov (SM05) received the M.A. degree
engineering from Seoul National University, Seoul, in mathematics and the Ph.D. degree in computer sci-
Korea, and the M.S. and Ph.D. degrees in com- ence from the University of California, Los Angeles.
puter science and engineering from the University of He is currently an Associate Professor of electrical
Michigan, Ann Arbor. engineering and computer science with the Univer-
Since 2001, he has been with IBM Austin Re- sity of Michigan, Ann Arbor. He is a coauthor of
search Laboratory, Austin, TX, where he is primarily more than 150 refereed publications. Prof. Markov
working in the physical design space, particularly is a member of the Editorial Board of the Com-
on placement and routing and timing closure flow. munications of the ACM and ACM Transactions on
His general interests include computer-aided design Design Automation of Electronic Systems. He is
algorithms, combinatorial optimization, very large also a member of the Executive Board of the ACM
scale integration system design, and computer architecture. SIGDA and the Editorial Board of the IEEE TRANSACTIONS ON COMPUTERS
Dr. Nam has served on the technical program committees for various con- and the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN (TCAD).
ferences including the International Conference on Computer Aided Design, His research interests include computers that make computers (software and
the International Symposium on Physical Design (ISPD), and the Asia and hardware), secure hardware design, and combinatorial optimization with ap-
South Pacific Design Automation Conference. He will be the General Chair plications to the design, verification, and debugging of integrated circuits and
of ISPD 2009. He was the recipient of first place award for the 38th Design quantum logic circuits.
Automation Conference Student Design Contest, numerous IBM awards, and Prof. Markov is a Senior Member of the Association for Computing
Special Interest Group on Design Automation technical leadership award from Machinery (ACM). He has served on and chaired a number of conference
the Association for Computing Machinery. program committees. He mentored six Ph.D. students and is now working
with seven graduate students. He was a recipient of the DAC Fellowship, the
ACM SIGDA Outstanding New Faculty Award, the ACM SIGDA Technical
Charles (Chuck) J. Alpert (S92M96SM02 Leadership Award, the NSF CAREER Award, the IBM Partnership Award, the
F05) received the B.S. degree in math and Synplicity Inc. Faculty Award, the Microsoft A. Richard Newton Breakthrough
computational sciences and the B.A. degree in Research Award, the EECS Department Outstanding Achievement Award from
history from Stanford University in 1991. He re- the University of Michigan, the Best Paper Awards at the Design Automation
ceived the Ph.D. degree in computer science from and Test in Europe Conference and the International Symposium on Physical
the University of California at Los Angeles (UCLA), Design, and the IEEE CAS Donald O. Pederson Award for Best Paper in the
in 1996. IEEE TCAD.
Since 1996, he has been with IBM Austin Re-
search Laboratory, Austin, TX, where he currently
manages the design productivity group, whose mis-
sion is to develop design automation tools and
methodologies to improve designer productivity and reduce design cost. He
has published over 100 conference and journal publications.
Dr. Alpert has been active in the academic community, serving as Chair
for the Tau Workshop on Timing Issues and the International Symposium
on Physical Design. He also serves as an Associate Editor of the IEEE
TRANSACTIONS ON COMPUTER-AIDED DESIGN. He was the recipient of the
Mahboob Khan Mentor Award for his work in mentoring in 2001 and 2007.
He was the recipient of three Best Paper Awards from the Association for
Computing Machinery/IEEE Design Automation Conference.