RUMBLE: An Incremental Timing-Driven Physical-Synthesis Optimization Algorithm

2156 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 27, NO.
12, DECEMBER 2008
RUMBLE: An Incremental Timing-Driven

Physical-Synthesis Optimization Algorithm
David A. Papa, Student Member, IEEE, Tao Luo, Michael D. Moffitt, C. N. Sze, Member, IEEE,
Zhuo Li, Member, IEEE, Gi-Joon Nam, Charles J. Alpert, Fellow, IEEE, and Igor L. Markov, Senior Member, IEEE
AbstractPhysical-synthesis tools are responsible for achieving synthesis flow. To this end, we review the major phases of such
timing closure. Starting with 130-nm designs, multiple cycles are flows as follows [1], [4].
required to cross the chip, making latch placement critical to
success. We present a new physical-synthesis optimization for latch 1) Global placement computes nonoverlapping physical
placement called Rip Up and Move Boxes with Linear Evaluation locations for gates and typically optimizes half-perimeter
(RUMBLE) that uses a linear timing model to optimize timing wirelength (HPWL) or weighted HPWL. As part of this
by simultaneously replacing multiple gates. RUMBLE runs incre-
mentally and in conjunction with static timing analysis to improve phase, usually, some amount of detail placement is done,
the timing for critical paths that have already been optimized and legalization is called to ensure a legal optimization
by placement, gate sizing, and buffering. Experimental results result.
validate the effectiveness of the approach: Our techniques improve 2) Electrical correction fixes capacitance and slew viola-
slack by 41.3% of cycle time on average for a large commercial tions with gate sizing and buffering.
ASIC design.
3) Legalization is an incremental-placement capability that
Index TermsStatic timing analysis, timing-driven placement. removes overlaps caused by optimization with minimal
disturbance to placement and timing.
I. I NTRODUCTION 4) Timing analysis assesses the speed of the design and
determines if performance targets are met. Among other
P HYSICAL synthesis is a complex multiphase process
primarily designed to achieve timing closure, although
power, area, yield, and routability also need to be optimized.
metrics, this phase determines the slack of every path in
the circuitthe difference between the clock period and
Starting with 130-nm designs, signals can no longer cross how long it takes a signal to traverse the path.
the chip in a single cycle, which means that pipeline latches 5) Detail placement moves gates to further reduce wire-
need to be introduced to create multicycle paths. This problem length and improve timing. In this phase, it is possible
becomes more pronounced for 90-, 65-, and 45-nm nodes, to do timing-driven detail placement wherein timing in-
where interconnect delay increasingly dominates gate delay formation is explicitly considered when optimizing gate
[10]. Indeed, for high-performance ASIC scaling trends, the placements.
number of pipeline latches increases by 2.9 at each technol- 6) Critical-path optimization identifies most-critical paths
ogy generation, accounting for as much as 10% of the area of and focuses on techniques to improve the slack for the
90-nm design [8] and as many as 18% of the gates in 32-nm worst-timing violations. Relevant optimizations include
design [22]. Hence, the proper placement of pipeline latches is buffering, gate sizing, and incremental synthesis [23].
a growing problem for timing closure. 7) Compression optimizes the remaining paths that violate
The choice of computational techniques for latch placement timing constraints when improvements on most-critical
depends on where this optimization is invoked in a physical- paths are no longer possible. The goal is to compress the
timing histogram and reduce number of negative-slack
Manuscript received May 15, 2008; revised July 22, 2008. Current version
paths that require designer intervention.
published November 19, 2008. This paper was recommended by Associate
Editor D. Z. Pan.
The flow can be repeated with net weighting and timing-driven
D. A. Papa is with the Electrical Engineering and Computer Science De- placement to further improve results.
partment, University of Michigan, Ann Arbor, MI 48109 USA, and also with One can think of physical synthesis as progressing with
IBM Austin Research Laboratory, Austin, TX 78758 USA (e-mail: iamyou@
umich.edu).
variable detail and variable accuracy. For example, during
T. Luo was with the Department of Electrical and Computer Engineering, global placement, large changes are made to the design using
University of Texas at Austin, Austin, TX 78712 USA. He is now with Magma a coarse objective (such as wirelength) that is oblivious to
Design Automation, Austin, TX 78759 USA (e-mail: tluo@magma-da.com).
M. D. Moffitt, C. N. Sze, Z. Li, G.-J. Nam, and C. J. Alpert are with IBM timing considerations. Later, one may perform more accurate
Austin Research Laboratory, Austin, TX 78758 USA (e-mail: mdmoffitt@ optimization using an Elmore interconnect delay model with
us.ibm.com; csze@us.ibm.com; lizhuo@us.ibm.com; gnam@us.ibm.com; Steiner-tree estimates for net capacitance. As timing begins
alpert@us.ibm.com).
I. L. Markov is with the Electrical Engineering and Computer Science to converge, one can apply more costly fine-grained buffering
Department, University of Michigan, Ann Arbor, MI 48109 USA (e-mail: along actual detailed routes using a statistical timing model.
imarkov@umich.edu). Fig. 1(a)(d) shows the complications of using existing
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. global placement techniques to solve the latch-placement prob-
Digital Object Identifier 10.1109/TCAD.2008.2006155 lem for a single two-pin net. Assume that, for all four figures,
0278-0070/$25.00 2008 IEEE
PAPA et al.: RUMBLE: TIMING-DRIVEN PHYSICAL-SYNTHESIS OPTIMIZATION ALGORITHM 2157
The problem is more complex in practice, and some aspects are

not illustrated earlier. In particular, many latches have buffer
trees in the immediate fan-in and fan-out. Such complications
pose additional challenges that we address in this paper. We
make the following contributions.
1) We show that a linear wire-delay model is sufficient to
model the impact of buffering for the latch-placement
problem.
2) We develop RUMBLE, a linear-programming-based
timing-driven placement algorithm, which includes
buffering for slack-optimal placement of individual
latches under this model and show its effectiveness
experimentally.
3) We extend RUMBLE to improve the locations of indi-
vidual logic gates other than latches. Furthermore, we
show how to find the optimal locations of multiple gates
Fig. 1. Placement of a pipeline latch impacts the slacks of both input and
(and latches) simultaneously, with additional objectives.
output paths. A wirelength objective does not capture the timing effects of this Incremental placement of multiple cells requires addi-
situation. tional care to preserve timing assumptions, optimizing a
set of slacks instead of a single slack, while also biasing
the source A and sink B are fixed in their respective locations the solution toward placement stability. We describe how
and that global placement must find the correct location for the RUMBLE handles these situations.
latch. This example is representative of situations in which a 4) We empirically validate proposed transforms and the
fixed block in one corner of the chip must communicate with a entire RUMBLE flow. We show how these techniques can
block in the opposite corner, but signal delay inevitably exceeds be used to significantly improve initial latch placement in
a single clock period. All four placements have equal wire- a reasonably optimized ASIC design with do no harm
length; therefore, unless global placement is timing driven, the acceptance criteria that reject solutions if any quality
placement of the latch between A and B is arbitrary. Consider metrics are degraded. This facilitates the use of RUMBLE
the following scenarios. later in physical synthesis.
1) Suppose the placement tool chooses the placement shown The remainder of this paper is organized as follows.
in Fig. 1(a), which is the worst location for the latch. Section II discusses background and previous work. Section III
In this case, the latch is so far from B that the timing describes the timing model we use in this paper. Section IV
constraint at B cannot be met. This results in a slack on describes how RUMBLE performs timing-driven placement.
the input net (U ) of +5 ns and a slack on the output net Section V describes the RUMBLE algorithm. Section VI shows
(V ) of 5 ns (even after optimal buffering).1 experimental results. Conclusions are drawn in Section VII.
2) With a second iteration of physical synthesis, timing-
driven placement could try to optimize the location of this
latch by adding net weights. Any net-weighting scheme II. B ACKGROUND
will assign a higher weight to net V than U , resulting in Several approaches improve IC performance by modifying
a placement where the latch is very close to B, as shown wirelength-driven global placement through timing-based
in Fig. 1(b). While the timing is improved, there is now a net weights [9], [11][13], [15], [18]. Such algorithms are
slack violation on the other side of the latch with 3 ns generally referred to as timing-driven placement, but the
of slack on U and +3 ns on V . literature has not yet considered the impact of buffering on
3) A global or detail placer could use a quadratic wirelength latch placement during global placement. Due to the lack
objective to handle these kinds of nets, giving the location of such algorithms, it is inevitable that some latches will be
shown in Fig. 1(c), which, while better than the location suboptimally placed during global placement. Therefore, new
shown in Fig. 1(a) and (b), is suboptimal. algorithms are needed for postplacement performance-driven
4) To achieve the optimal location with no critical nets incremental latch movement.
(zero slack on U and V ), the latch must be placed as We introduce a high-level description of the incremental
shown in Fig. 1(d). In this case, there is only one location latch-placement problem as follows and elaborate on its formu-
that meets both constraints. lation in Section IV. Given an optimized design and a small set
This example suggests that wirelength optimization is not of gates M (M may consist of a single latch), find new locations
well-suited for latch placement, particularly when there is little for each gate in M and new buffering solutions for nets incident
room for error. Instead, one must be able to couple latch place- to M such that the timing characteristics of the design are im-
ment with timing analysis and model the impact of buffering. proved. While moving a cell can improve delay, particularly if it
has been poorly placed, moving a latch has special significance,
1 The nets in each scenario could include buffers without changing the trends since it facilitates time borrowing: reallocating circuit delay
discussed. from a longer (slow) combinational stage to a shorter (fast)
2158 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 27, NO. 12, DECEMBER 2008
must move the buffer in order to improve the slack of the latch,
and other complications are shown in Fig. 3. At the same time,
the optimal new location of the latch depends on how the input
Fig. 2. Poorly placed latch with buffered interconnect. In this case, the buffer and output nets are buffered. As a result, the optimal approach
must be moved or removed in order to have the freedom to move the latch far is to simultaneously move the latch and perform buffering, but
enough to fix the path. this is computationally prohibitive because a typical multiple-
objective buffering algorithm runs in exponential time. As
combinational stage. This fact offers a particularly significant
mentioned in Section I, we propose a sequential approach in
boost to our basic approach and is enhanced even further when
which we first compute the new locations for a selected set of
surrounding gates are also free to move.
movable gates based on timing estimation considering buffers.
A solution to this problem is called a transform using the
Then, buffering is applied to the input and output nets of the
terminology of [23]. A transform is an optimization designed
selected movable gates. Being practical, effective, and efficient,
to incrementally improve timing. Other examples of transforms
this approach can be integrated into a typical very large scale
include buffering a single net, resizing a gate, cloning a cell,
integration physical-synthesis flow. The calculation of opti-
swapping pins on a gate, etc. The way transforms are invoked
mal movement depends on a simple but effective buffered-
in a physical-synthesis flow is determined by the drivers. For
interconnect delay model, which is discussed in the following
example, a driver designed for critical-path optimization may
section.
attempt a transform on the 100 most-critical cells. A driver
designed for the compression stage (see Section I) may attempt
a transform on every cell that fails to meet its timing constraints. A. Linear Buffered-Path Delay Estimation
A driver has the option of avoiding transforms that may
harm the design (e.g., generating new buffering solutions in- Buffering has become indispensable in timing closure and
ferior to the original) and can then reject this solution. This cannot be ignored during interconnect delay estimation [3],
do no harm philosophy of optimization has received significant [7], [22]. Therefore, to calculate new locations of movable
recognition in recent work [5], [20]. The RUMBLE approach gates, one must adopt a buffering-aware interconnect delay
adopts this same convention which makes it more trustworthy model that accounts for future buffers. We found that the linear
in a physical-synthesis flow. delay model described in [3] and [17] is best suited for this
While no previous work has attempted to solve this par- application. In this model, the delay along an optimally buffered
ticular problem, other solutions do exist that may be able to interconnect is
help with the placement of poorly placed latches. The authors
delay(L) = L(Rb C + RCb + 2Rb Cb RC) (1)
of [24] propose a linear-programming formulation that mini-
mizes downstream delay to choose locations for gates in field-
where L is the length of a two-pin buffered net and Rb and Cb
programmable gate arrays. The authors of [6] model static
are the intrinsic resistance and input capacitance of buffers and
timing analysis (STA) in a linear-programming formulation by
gates while R and C are unit wire resistance and capacitance,
approximating the quadratic delay of nets with a piecewise-
respectively.
linear function. Their formulations objective is to maximize
Empirical results in [3] indicate that (1) is accurate up to
the improvement in total negative slack of timing endpoints.
0.5% when at least one buffer is inserted along the net. Fur-
The authors of both approaches conclude that the addition of
thermore, our own empirical results in Section VI-B suggest a
buffering would improve their techniques [6], [24]. When these
97% correlation between this linear delay model and the output
transformations are applied at the same point in a physical-
of an industry timing-analysis tool.
synthesis flow that we propose, they will be restricted by
previous optimizations. When applied somewhat earlier (e.g.,
following global placement), they are incapable of certain im- B. Timing Graph
provements. Namely, downstream optimizations, such as buffer
In RUMBLE, a set of movable gates is selected, which must
insertion, gate sizing, and detail placement, may invalidate the
include fixed gates or input/output ports to terminate every path.
optimality of latch placement. Therefore, our technique focuses
Fixed gates and I/Os help formulate timing constraints and limit
on the bad latch placements that we observed in large com-
the locations of movables. In Fig. 4(a), we assume that new
mercial ASIC designs after state-of-the-art physical-synthesis
locations have to be computed for the latch and the two OR
optimizations. However, we believe that RUMBLE may be too
gates, while all NAND gates are kept fixed.
disruptive to use after routing.
In the timing graph, each logic gate is represented by a node,
while a latch is represented by two nodes because the inputs
III. RUMBLE T IMING M ODEL
and outputs of a latch are in different clock cycles and can
We now introduce the timing model critical to RUMBLEs have different slack values. Each edge represents a driversink
success. path along a net and is associated with a delay value which is
Fig. 2 shows an intuitive example of the problem when we try linearly proportional to the distance between the driver and the
to find new locations for movable gates. Similar to Fig. 1, the sink gate. In other words, we decompose each multipin net into
latch has to be moved to the right to improve timing. However, a set of two-pin edges that connect the driver to each sink of
since the latch drives a buffer which is placed next to it, we the net.This simplification is crucial to our linear delay model
Fig. 3. Layout in (a) has a poorly placed latch, and existing critical path optimizations do not solve the problem. Repowering the gates may improve the timing
some in (b), but if it cannot fix the problem, the latch must be moved. Moving the latch up to the next buffer, shown in (c), does not give optimization enough
freedom. If we move the latch but do not rebuffer in (d), timing may degrade. Fig. 9(d) shows the ideal solution to this problem.
to other objectives used in previous work, we select this objec-

tive because we are targeting critical-path optimization. Hence,
we prefer one unit of worst-slack improvement over two units
of slack improvement on less-critical nets. We introduce the
timing-driven placement technique in RUMBLE that directly
maximizes minimum slack in the following section. In the
following placement formulation, we account for the timing
Fig. 4. (a) Example subcircuit and (b) corresponding timing graph used in impact of our changes by implicitly modeling STA in our timing
RUMBLE. The AATs or RATs of (squares) unmovable objects are considered graph. In this paper, we estimate net length by the HPWL and
known. STA is performed on (round shapes) movable objects.
then scale it to represent net delay. More accurate models are
possible but may complicate optimization.
and is valid because the linear relationship can be preserved
for the most critical sinks by decoupling less critical paths with
buffers. Therefore, the two-pin edge model in the timing graph A. Problem Formulation
can guide the computation of new locations for the movable
Consider the problem of maximizing the minimum slack of
gates.
a given subcircuit G with some movable gates and some fixed
In the timing graph, an edge which represents a timing arc
gates or ports.
is created only for the following conditions: 1) each connection
Let the set of nets in the subcircuit be N = n0 , n1 , . . . , nh .
between the movable gates and 2) each connection between a
Let the set of all gates in the subcircuit (movable and fixed)
movable gate and a fixed gate. This is because we only care
be G = g0 , g1 , . . . , gf . Let the set of movable gates in the
about the slack change due to the displacement of movable
subcircuit (a subset of G) be M = m0 , m1 , . . . , mk .
gates. For the subcircuit shown in Fig. 4(a), the resultant timing
is a technology-dependent parameter that is equal to the
graph is shown in Fig. 4(b).
ratio of the delay of an optimally buffered arbitrarily long wire
For each fixed gate, we assume that the required arrival time
segment to its length
(RAT) and the actual arrival time (AAT) are fixed. The values
of RAT and AAT are generated by a STA engine using a set of delay(wire)
timing assertions created by designers. An in-depth exposition = . (2)
length(wire)
of STA can be found in [16] and [21] along with algorithms to
generate RAT and AAT. A movable latch corresponds to two The following equations govern STA and are used in the next
nodes in the timing graph: one for the data input pin and one section. A timing arc is specified for a given net n driven by
for the output pin. For the input pin, the RAT is fixed based gate u and having sink v as nu,v . The delay of a gate g is Dg .
on the clock period. Similarly, the AAT is fixed for the latchs The RAT of a combinational gate g is
output pin. Based on all the fixed RAT and AAT at fixed gates
and latches, the AAT and RAT are propagated along the edges Rg = min Roj HPWL(ng,oj ) Dg . (3)
oj :0jm
according to the delay of the timing arcs. The values of AAT
are propagated forward to fan-out edges, adding the edge delay The AAT of a combinational gate g is
to the AAT. On the contrary, RATs are propagated backward
along the fan-in edges, subtracting the edge delay from the RAT Ag = max Aij + HPWL(nij ,g ) + Dg . (4)
ij :0jl
values. Details of edge delay, RAT and AAT calculations in our
algorithm are covered in the following section. Given a clocked latch r, we assume, for simplicity, that the
RAT (Rr ) and AAT (Ar ) are fixed and come from the timer.
IV. T IMING -D RIVEN P LACEMENT Unclocked latches are treated similarly to the combinational
gates earlier.
The goal of RUMBLE is to find new locations for movable The slack of a timing arc np,q connecting two gates (combi-
gates in a given selected subcircuit such that the overall circuit national or sequential, movable or fixed) p and q is
timing improves. Therefore, we maximize the minimum (worst)
slack of source-to-sink timing arcs in the subcircuit. In contrast Snp,q = Rq Ap HPWL(np,q ). (5)
Fig. 5. In many subcircuits, there are multiple slack-optimal placements. In RUMBLE we add a secondary objective to minimize the displacement from the
original placement. This helps to maintain the timing assumptions made initially and reduces legalization issues. (a) Initial state of and example subcircuit.
(b) Slack-optimal solution commonly returned by LP solvers, all optimal solutions lie on the dotted line. (c) Solution given by RUMBLE that maximizes worst
slack then minimizes displacement.
B. RUMBLE LP as well, such as total cell displacement, which sums Manhattan

distances between cells original and new locations. We sub-
We define a linear program (LP) to maximize the minimum
tract the minimum slack objective from a weighted total cell-
slack S of a subcircuit as follows:
displacement term to avoid unnecessary cell movement. The
VARIABLES: weight Wd for the total cell-displacement objective is set to a
small value. Therefore, the weighted total displacement compo-
S nN : Sn nent is used as a tiebreaker and has little impact on worst-slack
mM : xm mM : ym maximization. Instead, the combined objective is maximized
nN : Uxn nN : Uyn (6) by a slack-optimal solution closest to cells original locations.
nN : Lnx nN : Lny During incremental timing-driven placement, minimizing total
mM : Rm mM : Am cell displacement encourages higher placement stability and
often translates into fewer legalization difficulties.
where are independent variables for gate locations, U and L Fig. 5 shows an example of the RUMBLE formulation
variables represent upper and lower bounds of nets (highest and with and without the total displacement objectives. The only
lowest coordinates in the x- and y-directions) for computing movable object in Fig. 5(a) is the latch. An input net n1 and
HPWL, and R and A compute RATs and AATs. Each Sn an output net n2 are connected to the latch with slacks 2
computes the slack of a particular net, while S is the minimum and +2, respectively. Fig. 5(b) shows the optimal LP solution
slack of all nets. without the total displacement objective. The Manhattan net
length of n1 is reduced from 20 to 18, and the net length of
OBJECTIVE: Maximize S
n2 is increased from 20 to 22. This improves the worst slack of
CONSTRAINTS: For every gate gj on net ni the subcircuit from 2 to 0. However, the latch moves a large
distance. Fig. 5(c) shows that including the total displacement
objective may preserve optimal slack while minimizing latch
g g
Uxni xj Uyni y j (7) displacement.
g g In order to minimize displacement by adding a new objective,
Lnx i xj Lny i y j . (8) we introduce the following variables and constraints to the LP:
For every movable gate mi and sink, it drives gj via net nk
DISPLACEMENT VARIABLES:

Rmi Rgj Uxnk Lnx k + Uynk Lny k Dg . (9)
For every movable gate mi and gate gj that drives one of its mM : xm mM : ym
inputs via net nk mM : mx mM : xm (13)
mM : my mM : ym .
Ami Agj + Uxnk Lnx k + Uynk Lny k + Dg . (10)
For every movable gate mi , xmi and ymi denote the original x
For every timing arc in the subcircuit np,q associated with and y coordinates. The upper and lower bounds of the new and
net ni original coordinates and in each dimension are as follows:

Sni Rq Ap Uxni Lnx i + Uyni Lny i . (11) DISPLACEMENT CONSTRAINTS:
For each net ni

m
x
i
xmi xmi xmi
S Sni . (12) mi
y ymi ymi ymi
(14)
m
x
i
xmi xmi xmi
mi
y ymi ymi ymi .
C. Extensions to Minimize Displacement
The displacements mi for a movable gate mi are defined as
The LP of RUMBLE is defined to maximize the minimum
x x
xmi = m y y .
mi
slack of a subcircuit. Additional objectives can be considered i
ymi = m i mi
(15)
solution as much as possible, the other can improve the slack

histogram when the worst slack cannot be further improved.
However, it is possible that, in order to improve a single worst-
slack path, multiple paths may degrade to the point of being
critical. If RUMBLE is deployed late enough in a physical-
synthesis flow, the corresponding FOM degradation may be
undesirable. To address this problem, we have devised an
Fig. 6. (a) Example subcircuit with an imbalanced latch whose worst slack
cannot be improved. Nevertheless, it is possible to improve timing of the latch additional constraint which, at the cost of reduced improvement
while maintaining slack-optimality. By including a FOM component in the in worst slack, can prevent this type of FOM degradation. When
objective, the total negative slack can be reduced, as shown in (b). FOM should not be degraded, we add the following constraints
to the RUMBLE LP to preserve FOM.
D. Extensions to Improve the Slack Histogram For each net nk whose slack is greater than the slack thresh-
The minimum slack is the worst slack in a subcircuit. For old Ts , add the following constraint:
two subcircuits with identical worst slack, it is possible that
one subcircuit has few critical paths with worst slack while Snk Ts . (17)
the other one has many. A timing optimization has to improve
This addition may overconstrain the LP, in which case, it is not
both the worst slack and the overall figure-of-merit (FOM) in
possible to improve the worst slack without harming FOM.
a subcircuit. FOM is defined as the sum of all slacks below
a threshold. If the slack threshold is zero, FOM is equivalent
to the total negative slack. With the minimum slack as the V. RUMBLE A LGORITHM
only objective, a small improvement in the worst slack may
cause a large FOM degradation. Therefore, we must add a FOM In this section, we discuss the details of the RUMBLE
component to the optimization objective. The balance between algorithm, which employs the LP from the previous section to
the minimum slack and the FOM is controlled by a parameter incrementally improve the timing of poorly placed latches.
Wf , which is set to a relatively small value because the worst-
slack objective is more important. A. Subcircuit Selection
Fig. 6 shows another scenario where the FOM component
may help. During optimization, it may not be always possible RUMBLE identifies imbalanced latches, which we define as
to improve the minimum slack of the subcircuit. In that case, those that exhibit positive slack on their inputs and negative
we can still reduce the number of critical paths by improving slack on their outputs (or vice versa). As shown in Fig. 1, the
the FOM. In Fig. 6, there are three movables in the subcircuit. movement of any such imbalanced latch has the potential to
The minimum slack of the subcircuit is 20, and it is not improve timing, even if all surrounding cells are held fixed.
possible to improve the minimum slack by moving any of the More generally, however, the neighbors and extended neighbors
gates. With the additional FOM component in the objective, the of the targeted latch may also be included to form a set M of
FOM of the subcircuit is improved from 90 to 85, as shown movable cells. In our technique, shown in Fig. 8, we adopt a
in Fig. 6(b). basic N -hop neighborhood approach, where any gate within N
Let Sn denote the slack on net n, then the combined objective steps of the imbalanced latch is included in the set of movable
has the displacement and FOM components cells. This requires both a forward sweep (to collect sinks) and
a backward sweep (to collect sources), which are performed in
Maximize : tandem. Those cells that fall N + 1 steps from the latch form a
set P of fixed peripheral nodes.2
In contrast to prior work that has assumed operation within a
S Wd xm + ym + Wf Sn (16) prebuffering stage, our subcircuit selection algorithm must ad-
mM n:nN,Sn <Ts dress the presence of buffers. These buffers will be encountered
in our neighborhood selection algorithm, as they are part of the
where Ts is the small slack threshold used to compute the current logic; however, since it is presumed that they would
FOM. We have earlier assumed Wf and Wd to be small, with be ripped up when new locations are determined (a critical
Wd < Wf . In our implementation, we set Wf to 0.005 the assumption that makes our linear-delay model possible), we
absolute value of the average slack in the subcircuit, and we set must prevent their inclusion in our model of the subcircuit.
Wd to 106 . These additional terms change the optimal region, Therefore, when fetching adjacent gates, we transparently skip
but because the weights are so small, the combined optimal these buffers and omit them from the set M . The recursive
region is very near the slack-optimal region. functions TRUE-SOURCE() and TRUE-SINK() in Fig. 8 provide
this additional level of indirection, returning only those combi-
national gates that reflect the logical structure of the subcircuit.
E. Preventing Harm to FOM
The primary goal of the RUMBLE LP as presented in pre- 2 Variations on this theme, such as metrics that incorporate the degree of
vious sections is to maximize the worst slack of the subcircuit. neighbors criticality [14], [24] and the size of the subcircuit bounding box
We define two additional objectivesone preserves the initial are also possible.
Fig. 7. Modeling feedback paths within logic requires a new type of gate. Pseudomovable gates have timing values that depend on the timing values of
neighboring gates, but they cannot be moved. (a) Ignoring the presence of feedback paths is overly pessimistic, and it appears that the timing of the latch
cannot meet its constraints. (b) Making the fixed gates along a feedback path pseudomovable allows the latch to meet its timing constraints, but doing only this
can lead to the wrong placement. (c) Including all gates connected to pseudomovables as fixed timing points properly models the problem as a convex subcircuit.
Fig. 8. Subcircuit selection transparently skips buffers when building a neighborhood of movable gates and requires detection of pseudomovables.
B. Feedback Paths neighbors as pseudomovable gates. Although timing must be

propagated through them (as it is for movable gates), their
As noted in [24], the process of extracting gates to form
physical locations may be fixed.
a subcircuit may suffer from complications when subpaths of
Pseudomovables are collected by intersecting the transitive
combinatorial logic between peripheral nodes are not modeled.
cones of logic between inputs and outputs to detect feedback
These subpaths introduce additional timing constraints that, if
paths, as shown in the pseudocode of Fig. 8.3 To ensure ac-
left absent from the model, could invalidate the optimality of
curacy, the inputs and outputs of pseudomovables themselves
the solution.
must be bounded by fixed endpoints, as shown in Fig. 7(c).
To illustrate, consider the example in Fig. 7, in which a single
These fringe nodes completely isolate the timing of the result-
latch has been selected as a movable gate. After collecting
ing convex subcircuit from outer cones of logic.
its inputs and outputs, a simple subcircuit is constructed as
shown in Fig. 7(a), with the two endpoints shown selected
as fixed gates. With the timing constraints as given in the C. Do No Harm Philosophy
figure, an optimal solution to this problem will place the latch
equidistantly from both endpoints to ensure that the slacks on After gates are moved, it is likely that timing has degraded
either side are balanced. However, consider a scenario where due to, for example, a capacitance violation on a long wire.
a feedback path exists from the output to the input, as shown The subcircuit must be examined, and its interconnect improved
in Fig. 7(b); in such an event, the RAT of the output and the through physical-synthesis optimizations, which might include
AAT of the input are dependent on the location of the latch.
If this dependence is modeled, the solution may be biased 3 To improve runtime, one can limit the depth of these cones to a reasonably
toward one of the two neighbors. We loosely refer to these small constant, as opposed to the exhaustive expansion in [5].
Fig. 9. RUMBLE algorithm proceeds by (a) selecting a subcircuit to work on. An LP is formulated and solved, with movable gates being relocated as
shown in (b). Existing repeater trees are no longer appropriate and are subsequently removed in (c). Finally, the nets are rebuffered, forming the final subcircuit
shown in (d).
gate sizing and buffer insertion for delay or electrical consider-

ations on nets.
Even though the LP of Section IV-B can be solved optimally,
it does not account for all the complexities of interconnect
optimization. The LP is an abstraction of the subcircuit tim-
ing that models physical-synthesis optimizations (e.g., virtual
buffering) by prorating wire delay constants based on upcoming
physical-synthesis optimizations. Despite the high correlation
to more accurate timing models in experimental results, the
RUMBLE model could turn out to be too optimistic, and its
solution might result in a timing degradation. For example,
nets can cross blockages or congested regions with no nearby
legal locations. As a result, legalization could create a timing Fig. 10. RUMBLE algorithm for moving one latch.
degradation.
When running RUMBLE in our physical-synthesis flow, we VI. E XPERIMENTAL R ESULTS
mitigate the harmful effects of legalization by finding legal
RUMBLE is implemented in C++ (compiled with
locations for gates and buffers when moving or inserting them.
GCC 4.1.0) and integrated into an industrial physical-
Insisting on legal locations can also contribute to a degradation
synthesis flow. For our experiments, we examined an already
not anticipated by the RUMBLE model. Fortunately, RUMBLE
optimized 130-nm commercial ASIC with clock period
can examine the timing implications of its changes before
2.2 ns and 3 000 000 objects. We first examined the most
committing to them. It simply stores the initial state of the
critical latches and then filtered out the ones where the latch
subcircuit and restores it if a timing degradation occurs. In this
was already well placed. We use the algorithm from [2] to
way, RUMBLE will do no harm to the circuit by ensuring
perform buffering after the cells have been moved. In practice,
that whatever solution it keeps is no worse than what existed
the LP-solving technique from RUMBLE requires only 17 ms;
before. Such safe delay optimizations are more easily inserted
the buffering algorithm dominates the runtime (over 75%).
into physical-synthesis flows [5], [20].
Since the overall runtime is dependent on the choice of the
buffering algorithm, we omit the (trivial) runtimes from our
tables. Note that the do no harm approach of Section V-C is
D. RUMBLE Algorithm
applied to all experiments, preventing timing degradation in
Fig. 10 shows pseudocode for the RUMBLE algorithm, our tables (i.e., a value of zero appears in the imprv. column).
which assumes a set of movable gates given at input, and Fig. 9
shows the process. First, the subcircuit that is necessary for
A. Rebuffering in RUMBLE
incremental placement is extracted (for a single movable, we
extract its one-hop neighborhood of input gates). During this Previously published LP techniques for timing-driven place-
process, buffers are ignored (viewed as wires) as described in ment do not allow for rebuffering during optimization. Instead,
Section V-A. Next, RUMBLE performs timing analysis so as they are either applied before buffers have been inserted or
to measure timing improvement later. Line 3 stores the state they do not differentiate the buffers from other gates. Our first
of the circuit (gates and nets) so as to possibly undo most experiment is designed to show how important it is to rip up
recent transformations we are considering. Once the initial buffers before replacing gates and subsequently rebuffering.
state is safely stored, lines 46 use the LP of Section IV to We modified our pseudocode in Fig. 8 so that the function
compute new gate locations, followed by buffer removal. If the IS - BUFFER() always returns false. The effect of this is to stop
model shows improvement, we continue. Buffers are inserted seeing through the buffers and, instead, to consider them as
on line 8, and other physical-synthesis optimizations could fixed timing endpoints. This configuration models the work of
also be applied here (e.g., repowering, Vth assignment, etc.). [24]. We then calculate a new location for each latch with the
Lines 912 measure improvement and, in the case of timing LP in Section IV. The final change is to skip line 8 of Fig. 10,
degradation, restore the initial solution. i.e., do not rebuffer. We call this algorithm KEEP-BUFFERS.
TABLE I TABLE II
KEEPING BUFFERS INSTEAD OF REMOVING AND REINSERTING THEM RUMBLE MODEL ACCURATELY PREDICTS THE SOLUTION QUALITY
DEGRADES RUMBLEs PERFORMANCE IMPROVEMENTS IN THE REFERENCE TIMING MODEL
2) However, if one compares actual improvement to model

Table I shows the results of RUMBLE on a single latch improvement, there is a 97% correlation, suggesting
compared with KEEP-BUFFERS. Column 1 shows the name that the model is reasonable enough to justify the latch
of the benchmark, and columns 2 and 5 show worst slacks in location.
picoseconds before optimization. Columns 3 and 6 show the We now show how RUMBLE actually improves the designs
slacks after optimization of KEEP-BUFFERS and RUMBLE, timing characteristics.
respectively. Columns 4 and 7 show the improvements of each
technique. C. RUMBLE on a Single Latch
From the table, we observe the following conditions.
Given that we are solving a new physical-synthesis prob-
1) Despite not ripping up buffers, KEEP-BUFFERS is still
lem, existing solutions are scarce. Therefore, we first consider
able to improve solution quality for nine out of ten
straightforward approaches to solve this problem. One possi-
test cases, although the improvement is never more than
bility is to take the center-of-gravity (COG) of adjacent pins. A
220 ps.
timing-driven improvement of the COG technique weights each
2) When rip-up and rebuffering is allowed, RUMBLE is able
pin by its slack. A reasonable version of this heuristic works in
to significantly outperform KEEP-BUFFERS for all ten test
the following way. For a slack threshold Ts (see Section IV-D),
cases. On average, the improvement grows by 7.4.
let the weight w of a pin p with slack Sp be
3) While KEEP-BUFFERS improves slack by an average
of 123 ps, RUMBLE improves slack by 908 ps, which
1 + |Sp Ts |, Sp < 0
confirms how important it is to rip-up buffers so that they wp =
max (0.1, 1 |Sp Ts |) , Sp 0.
do not anchor the latch into an artificially small region.
Then, we compute the x coordinate of movable gate m as the
weighted average of the x coordinates of the set of neighboring
B. Accuracy of the RUMBLE Timing Model pins P .

Theoretical results published by Otten [17] and discussed in
pP wp px
Section III indicate that optimal buffer insertion on a two-pin mx =
pP wp
net results in a wire delay that is linearly proportional to its
length. The RUMBLE model heavily relies on these results. and similarly for the y coordinate.
Table II compares the model-predicted values for subcircuit We implemented the earlier COG technique within the
slack to values measured by running a commercial static-timing RUMBLE framework in place of the LP solver presented
analyzer. Measurements are taken after the RUMBLE LP is in Section IV. We still allow COG the benefits of ripping
solved, the latches are moved, and connected nets are buffered. up buffers and reinserting them after the latches are moved.
Columns 24 report the initial, final, and improvement in worst Table VI shows a comparison between RUMBLE and slack-
slack of the subcircuit measured by the timing model presented weighted COG on ten latches. Column 1 shows the same latches
in Section III. Columns 57 report the same metrics measured as reported in Table III. Columns 24 show the initial and final
by the STA engine. slacks and improvement for COG. Columns 57 show the same
We make the following observations. for RUMBLE.
1) On average, the RUMBLE model overestimates the actual We observe the following.
timing improvement by about 15%. This makes sense 1) For all ten cases, RUMBLE generates a better solution
since it assumes that an optimal ideal buffering will be than COG. For three of the cases, COG could not improve
achievable, but this is not always the case, particularly the latch placement. These new solutions are rejected by
for multisink nets. the driver so as not to make the design worse.
TABLE III In fact, for six out of ten one-hop subcircuits and for seven
COMPARISON OF RUMBLEs LP TO A SLACK-WEIGHTED
COG TECHNIQUE out of ten two-hop circuits, multimove actually brought
the FOM down to zero, meaning it fixed all the timing
violations. Iterative single move was able to fix two and
four, respectively.
2) On average, the worst-slack improvements were 849 and
673 ps, respectively, for one- and two-hop subcircuits.
The diminished improvement for larger subcircuits is
likely because we are including more nets, some of which
cannot be improved as much as those connected to the
imbalanced latch (Fig. 6 shows an example).
3) Solving the LP takes 53 ms for one-hop subcircuits and
325 ms for two-hop subcircuits, on average.
E. RUMBLE in a Physical Design Flow

2) On average, COG improves slack by 20.9% of the 2.2-ns In the experiments presented so far, we have compared the ef-
cycle time, whereas RUMBLE improves slack by 41.3%. fects of RUMBLE to those of other techniques on the most crit-
This shows that one must incorporate slack constraints on ical latches of the design. Due to the high runtime of buffering
cells incident on the latch to achieve the most balanced all of the nets in multimove subcircuits, multimove RUMBLE
solution. for every critical latch in the design is expensive. Consequently,
in this section, we demonstrate the cumulative effect of single-
D. Optimizing Multiple Gates Simultaneously move RUMBLE when deployed in our physical-synthesis flow
For this experiment, we show how an even better solution on all latches with a critical pin. Table VI shows two circuits
can be obtained when one allows cells close to the latch to that each contains a significant number of poorly placed latches.
move. We show the effectiveness of this technique on two sets For each circuit, we report five statistics. An imbalanced latch
of circuits. is defined as one that has slack on the input pins that is greater
1) One-hop subcircuits include every gate (while ignoring than the slack threshold Ts (see Section IV-D) and slack on
buffers and inverters) incident to the latch of interest that the output pins that are lower than Ts , or vice versa. The Imb.
shares an incident net with the latch. Typically, this results column reports the number of imbalanced latches found in the
in four or five gates being moved. design. Let the set of imbalanced latches be I, and for each
2) Two-hop subcircuits in addition include all nonbuffer and latch l, let ws(l) be the worst slack of any pin on l. We define
inverter cells incident to cells in the one-hop neighbor- imbalance FOM to be
hood. These subcircuits range from 11 to 18 movables
Ts ws(l). (18)
with a mean of 14.8 movables.
lI
We compare this technique to iterated single-move RUMBLE,
where we pick each cell in the neighborhood and solve the LP The Imb. FOM column reports this value. A critical latch is
for that particular cell, fix it, and then move to the next cell. defined as one that has pins on both sides that are below Ts .
The experiment is designed to show that multiple cells need to The Crit. column reports the number of critical latches found
be optimized simultaneously to obtain the best results. in the design. Similarly to imbalance FOM, if C is the set of
To measure the improvement, one must now consider the critical latches and for each latch c, let ws(c) be the worst slack
slacks of all cells that may be moved, and the objective becomes of any pin on c, then we define the critical FOM to be
the improvement of the worst slack of the entire subcircuit.
However, when one cannot improve the most critical path, the Ts ws(c). (19)
other paths may have room for improvement. We use FOM cC
to measure the total improvement of all the slacks in the
subcircuit. The Crit. FOM column reports this value.
Tables IV and V compare iterating RUMBLE over each gate Finally, the FOM column reports the FOM for the entire
one at a time versus RUMBLE moving multiple gates simulta- design. We make the following observations.
neously. Columns 24 show the original and final slack and the 1) RUMBLE reduces the number of imbalanced latches by
slack improvement for iterated single-move RUMBLE, while 8.8% and 9.2% on ckt1 and ckt2, respectively.
columns 57 show the corresponding FOM measurements for 2) RUMBLE has a harder time optimizing the critical
a zero-slack threshold. Columns 813 show the same mea- latches than the imbalanced ones.
surements for multimove RUMBLE. We make the following 3) RUMBLE reduces circuit FOM by 8.6% and 5.4% on
observations. ckt1 and ckt2, respectively.
1) Multimove RUMBLE is clearly more effective than itera- 4) RUMBLE improves the characteristics of all columns and
tive RUMBLE both for one- and two-hop neighborhoods. does no harm to the circuit metrics.
TABLE IV
RUMBLE SIMULTANEOUSLY MOVING A one-hop NEIGHBORHOOD COMPARED TO ITERATIVELY MOVING THE SAME GATES INDIVIDUALLY
TABLE V
RUMBLE SIMULTANEOUSLY MOVING A two-hop NEIGHBORHOOD COMPARED TO ITERATIVELY MOVING THE SAME GATES INDIVIDUALLY
TABLE VI physical-design flows, which is particularly problematic at sub-

RUMBLE DEPLOYED IN A PHYSICAL DESIGN FLOW ON CIRCUITS THAT
HAVE PIPELINE LATCH PLACEMENT PROBLEMS. CKT1 HAS 2.92 MILLION 130-nm technology nodes. To address this challenge, we de-
OBJECTS AND 629 THOUSAND LATCHES AND CKT2 HAS 4.74 MILLION veloped RUMBLEa linear-programming-based incremental
OBJECTS AND 247 THOUSAND LATCHES. OLD REPORTS VALUES physical-synthesis algorithm that incorporates timing-driven
BEFORE RUMBLE, NEW REPORTS RESULTS AFTER , AND DIFF
REPORTS THEIR DIFFERENCE. FOM IS REPORTED IN NANOSECONDS placement and buffering. The latter justifies RUMBLEs linear-
delay model which exhibited a 97% correlation to the reference
timing model in our experiments. Empirically, this delay model
is accurate enough to guide optimization; RUMBLE improves
slack by 41.3% of cycle time on average for a large commercial
ASIC design.
The LP used in RUMBLE is general enough to optimize
multiple gates and latches simultaneously. However, when
moving multiple gates considering only the slack objective,
In addition to these observations, we point out that the two most we encountered two challenges: placement stability and FOM
common reasons for being unable to fix a particular latch are as degradations. We present our extensions to address these prob-
follows: 1) there is a high-fan-out net in the subcircuit, which lems directly in our LP objective. With these additions, moving
would degrade the performance of buffering, and we therefore several gates simultaneously improves upon RUMBLE used
skip this case; or 2) the gates are moved to a fixed endpoint, iteratively on the same movables.
which indicates that RUMBLE does not have enough freedom
to solve the problem entirely. The addition of RUMBLE to R EFERENCES
our design flow adds about 4% to the total runtime in these [1] C. J. Alpert, C. Chu, and P. G. Villarrubia, The coming of age of physical
experiments. synthesis, in Proc. ICCAD, 2007, pp. 246249.
[2] C. J. Alpert et al., Fast and flexible buffer trees that navigate the physical
layout environment, in Proc. DAC, 2004, pp. 2429.
[3] C. J. Alpert et al., Accurate estimation of global buffer delay within
VII. C ONCLUSION a floorplan, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,
vol. 25, no. 6, pp. 11401146, Jun. 2006.
In this paper, we observe that wirelength-driven placement [4] C. J. Alpert et al., Techniques for fast physical synthesis, Proc. IEEE,
leads to particularly poor timing of pipeline latches in modern vol. 95, no. 3, pp. 573599, Mar. 2007.
[5] K.-H. Chang, I. L. Markov, and V. Bertacco, Safe delay optimization for Michael D. Moffitt received the B.S. degree in com-
physical synthesis, in Proc. ASP-DAC, 2007, pp. 628633. puter science and engineering, the dual M.S. degrees
[6] A. Chowdhary et al. How accurately can we model timing in a placement in computer science and operations research, and the
engine? in Proc. DAC, 2005, pp. 801806. Ph.D. degree from the University of Michigan, Ann
[7] J. Cong, L. He, C.-K. Koh, and P. H. Madden, Performance optimization Arbor, in 2003, 2005, 2006, and 2007, respectively.
of VLSI interconnect layout, Integr. VLSI J., vol. 21, no. 1/2, pp. 194, He is currently with the Design Productivity
Nov. 1996. Group, IBM Austin Research Laboratory, Austin,
[8] P. Cocchini, Concurrent flip-flop and repeater insertion for high perfor- TX. His principal contributions include new algorith-
mance integrated circuits, in Proc. ICCAD, 2002, pp. 268273. mic techniques for temporal reasoning, infeasibility
[9] B. Halpin, C. Y. R. Chen, and N. Sehgal, Timing driven placement using analysis, planning and scheduling, block packing,
physical net constraints, in Proc. DAC, 2001, pp. 780783. floor planning, global routing, and timing optimiza-
[10] International Technology Roadmap for Semiconductors, 2001. [Online]. tion. His primary research interests include constraint satisfaction, combinator-
Availabe: http://public.itrs.net ial optimization, and their application to large-scale problems commonly faced
[11] M. A. B. Jackson and E. S. Kuh, Performance-driven placement of cell in artificial intelligence and computer-aided design.
based ICs, in Proc. DAC, 1989, pp. 370375. Dr. Moffitt developed the winning entry in the ISPD 2007 Global Routing
[12] A. B. Kahng and I. L. Markov, Minmax placement for large-scale Contest, and was invited to speak at ISPD 2008 on modern advances in
timing optimization, in Proc. ISPD, 2002, pp. 143148. academic global routing. He has served on the Technical Program Committee
[13] T. T. Kong, A novel net weighting algorithm for timing-driven place- for Design Automation and Test on Europe 2009. He was the recipient of the
ment, in Proc. ICCAD, 2002, pp. 172176. Josef Raviv Memorial Postdoctoral Fellowship.
[14] T. Luo, D. Newmark, and D. Z. Pan, A new LP based incremental timing
driven placement for high performance designs, in Proc. DAC, 2006,
pp. 11151120.
[15] M. Marek-Sadowska and S. P. Lin, Timing driven placement, in Proc.
ICCAD, 1989, pp. 9497.
[16] R. Nair, C. Berman, P. Hauge, and E. Yoffa, Generation of perfor-
mance constraints for layout, IEEE Trans. Comput.-Aided Design Integr.
Circuits Syst., vol. 8, no. 8, pp. 860874, Aug. 1989.
[17] R. Otten, Global wires harmful? in Proc. ISPD, 1998, pp. 104109.
[18] S. L. Ou and M. Pedram, Timing-driven placement based on partitioning
C. N. Sze (S03M05) received the B.Eng. and
with dynamic cut-net control, in Proc. DAC, 2000, pp. 472476.
M.Phil. degrees from the Department of Computer
[19] D. A. Papa et al., RUMBLE: An incremental, timing-driven, physical-
Science and Engineering, The Chinese University
synthesis optimization algorithm, in Proc. ISPD, 2008, pp. 29.
of Hong Kong, Sha Tin, Hong Kong, in 1999 and
[20] H. Ren et al., Hippocrates: First-do-no-harm detailed placement, in
2001, respectively, and the Ph.D. degree in computer
Proc. ASP-DAC, 2007, pp. 141146.
engineering from the Department of Electrical En-
[21] S. Sapatnekar, Timing. New York: Springer-Verlag, 2004.
gineering, Texas A&M University, College Station,
[22] P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick, Repeater
in 2005.
scaling and its impact on CAD, IEEE Trans. Comput.-Aided Design
He is currently a Research Staff Member with
Integr. Circuits Syst., vol. 23, no. 4, pp. 451463, Apr. 2004.
IBM Austin Research Laboratory, Austin, TX, where
[23] L. Trevillyan et al., An integrated environment for technology closure
he focuses on integrated placement and timing opti-
of deep-submicron IC designs, IEEE Des. Test Comput., vol. 21, no. 1,
mization for ASIC and microprocessor designs, as well as the tools and designs
pp. 1422, Jan./Feb. 2004.
for high-performance clock-distribution networks.
[24] Q. Wang, J. Lillis, and S. Sanyal, An LP-based methodology
Dr. C. N. Sze was the recipient of a DAC Graduate Scholarship. He
for improved timing-driven placement, in Proc. ASP-DAC, 2005,
has served on the technical program committees for several international
pp. 11391143.
conferences and workshops including International Symposium on Physical
Design, International Symposium on Circuits And Systems, System-Level
Interconnected Prediction (SLIP), and System-on-Chip Conference.
David A. Papa (S05) received the B.S. and M.S.
degrees in computer science and engineering from
the University of Michigan, Ann Arbor, in 2002 and
2004, respectively. He is currently working toward
the Ph.D. degree in the Department of Electrical
Engineering and Computer Science, University of
Michigan.
He is currently also with IBM Austin Research
Laboratory, Austin, TX, where he is contributing to
their physical-synthesis tool PDS. His main area of Zhuo Li (S01M05) received the B.S. and M.S.
interest is in physical design, particularly VLSICAD degrees in electrical engineering from Xian Jiaotong
placement. University, Xian, China, in 1998 and 2001, respec-
tively, and the Ph.D. degree in computer engineer-
ing from Texas A&M University, College Station,
Tao Luo received the B.S. degree in electronic me- in 2005.
chanics from the University of Electronic Science From 2005 to 2006, he was a Senior Technical
and Technology of China, Chengdu, China, the M.S. Staff with Pextra Corporation, where he worked on
degree in management information systems from very large scale integration extraction-tool develop-
Tongji University, Shanghai, China, and the M.S. ment. He is currently with IBM Austin Research
and Ph.D. degrees in computer engineering from the Laboratory, Austin, TX. His research interests in-
University of Texas at Austin, Austin. cluded fast physical synthesis, wire synthesis, optimization for routability, fast
He was with StarCore LLC, Analog Devices Inc., routing estimation, circuit modeling and simulation, mask optimization, and
Advanced Micro Devices Inc., and IBM Austin Re- design and analysis of algorithms.
search Laboratory, Austin, for part-time and intern- Dr. Li was the recipient of the Applied Materials Fellowship in 2002, the
ship positions. He is currently with Magma Design Best Paper Award at the Asia and South Pacific Design Automation Conference
Automation, Austin, where he is working on developing placement and layout in 2007, and the IEEE Circuits and System Society Outstanding Young Author
migration tools for custom IC designs. Award in 2007. He has served as technical program committee member for the
Dr. Luo was the recipient of the Engineering Doctoral Fellowship from the International Conference on Computer Design; Design, Automation and Test
University of Texas in 20062007 and was the recipient of the best paper in in Europe Conference and Exhibition; and Design Automation Conference
session award at SRC Techon 2007. Ph.D. Forum.
Gi-Joon Nam received the B.S. degree in computer Igor L. Markov (SM05) received the M.A. degree
engineering from Seoul National University, Seoul, in mathematics and the Ph.D. degree in computer sci-
Korea, and the M.S. and Ph.D. degrees in com- ence from the University of California, Los Angeles.
puter science and engineering from the University of He is currently an Associate Professor of electrical
Michigan, Ann Arbor. engineering and computer science with the Univer-
Since 2001, he has been with IBM Austin Re- sity of Michigan, Ann Arbor. He is a coauthor of
search Laboratory, Austin, TX, where he is primarily more than 150 refereed publications. Prof. Markov
working in the physical design space, particularly is a member of the Editorial Board of the Com-
on placement and routing and timing closure flow. munications of the ACM and ACM Transactions on
His general interests include computer-aided design Design Automation of Electronic Systems. He is
algorithms, combinatorial optimization, very large also a member of the Executive Board of the ACM
scale integration system design, and computer architecture. SIGDA and the Editorial Board of the IEEE TRANSACTIONS ON COMPUTERS
Dr. Nam has served on the technical program committees for various con- and the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN (TCAD).
ferences including the International Conference on Computer Aided Design, His research interests include computers that make computers (software and
the International Symposium on Physical Design (ISPD), and the Asia and hardware), secure hardware design, and combinatorial optimization with ap-
South Pacific Design Automation Conference. He will be the General Chair plications to the design, verification, and debugging of integrated circuits and
of ISPD 2009. He was the recipient of first place award for the 38th Design quantum logic circuits.
Automation Conference Student Design Contest, numerous IBM awards, and Prof. Markov is a Senior Member of the Association for Computing
Special Interest Group on Design Automation technical leadership award from Machinery (ACM). He has served on and chaired a number of conference
the Association for Computing Machinery. program committees. He mentored six Ph.D. students and is now working
with seven graduate students. He was a recipient of the DAC Fellowship, the
ACM SIGDA Outstanding New Faculty Award, the ACM SIGDA Technical
Charles (Chuck) J. Alpert (S92M96SM02 Leadership Award, the NSF CAREER Award, the IBM Partnership Award, the
F05) received the B.S. degree in math and Synplicity Inc. Faculty Award, the Microsoft A. Richard Newton Breakthrough
computational sciences and the B.A. degree in Research Award, the EECS Department Outstanding Achievement Award from
history from Stanford University in 1991. He re- the University of Michigan, the Best Paper Awards at the Design Automation
ceived the Ph.D. degree in computer science from and Test in Europe Conference and the International Symposium on Physical
the University of California at Los Angeles (UCLA), Design, and the IEEE CAS Donald O. Pederson Award for Best Paper in the
in 1996. IEEE TCAD.
Since 1996, he has been with IBM Austin Re-
search Laboratory, Austin, TX, where he currently
manages the design productivity group, whose mis-
sion is to develop design automation tools and
methodologies to improve designer productivity and reduce design cost. He
has published over 100 conference and journal publications.
Dr. Alpert has been active in the academic community, serving as Chair
for the Tau Workshop on Timing Issues and the International Symposium
on Physical Design. He also serves as an Associate Editor of the IEEE
TRANSACTIONS ON COMPUTER-AIDED DESIGN. He was the recipient of the
Mahboob Khan Mentor Award for his work in mentoring in 2001 and 2007.
He was the recipient of three Best Paper Awards from the Association for
Computing Machinery/IEEE Design Automation Conference.

RUMBLE: An Incremental Timing-Driven Physical-Synthesis Optimization Algorithm

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RUMBLE: An Incremental Timing-Driven Physical-Synthesis Optimization Algorithm

Uploaded by

Copyright:

Available Formats

2156 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 27, NO.

12, DECEMBER 2008

RUMBLE: An Incremental Timing-Driven

The problem is more complex in practice, and some aspects are

to other objectives used in previous work, we select this objec-

B. RUMBLE LP as well, such as total cell displacement, which sums Manhattan

For each net ni

solution as much as possible, the other can improve the slack

B. Feedback Paths neighbors as pseudomovable gates. Although timing must be

gate sizing and buffer insertion for delay or electrical consider-

2) However, if one compares actual improvement to model

E. RUMBLE in a Physical Design Flow

TABLE VI physical-design flows, which is particularly problematic at sub-

You might also like