You are on page 1of 6

LATCH-BASED PERFORMANCE OPTIMIZATION FOR FPGAS

Bill Teng Jason H. Anderson

Dept. of ECE, Univ. of Toronto Dept. of ECE, Univ. of Toronto


email: bill.teng@eecg.toronto.edu email: janders@eecg.toronto.edu

ABSTRACT Min=3 Min =3


Min 3
FF1 Max=8 L2 Max=4 FF3
We explore using pulsed latches for timing optimization – a
first in the FPGA community. Pulsed latches are transpar-
ent latches driven by a clock with a non-standard (non-50%) ClockPeriod:6
duty cycle. We exploit existing functionality within com-
FF1 FF1
mercial FPGA chips to implement latch-based optimizations
that do not have the power or area drawbacks associated with L2 L2
other timing optimization approaches, such as clock skew
and retiming. We propose an algorithm that iteratively re- FF3 FF3
places certain flip-flops in a logic design with latches for an
improvement in circuit speed. Results show that the major- (a)DutyCycle:50% (b)DutyCycle:33%
ity of the performance improvement that can achieved by
using multiple skewed clocks can also be achieved using a
single clock and latches. We also consider the impact of Fig. 1. Illustration showing how a 33% duty cycle can fix
short delay paths (i.e. min delays), which can cause hold- hold-time violation.
time violations. When short path delays are set to be 70%
of long path delays, our latch-based optimization, operating
on the routed design, provides a 5% performance improve- functional simulation [2]. Retiming can change the number
ment, on average, essentially for ’free’ (i.e. without any re- of flip-flops in a design, for example, in the case of moving
routing/delay padding). We show that short paths greatly a flip-flop “upstream” from the output of a multi-input logic
hinder the ability of using latches for speed improvement, gate to its inputs. As such, retiming may increase circuit
motivating further work to reduce their effects. area, and make it difficult for a designer to verify function-
ality or to correlate the retimed design with the original RTL
specification.
1. INTRODUCTION In this work, we use use latches to perform time-borrowing,
The speed of a sequential circuit is inversely related to the thereby overcoming the power barrier associated with us-
longest path between any two pairs of flip-flops. Time-borrowing ing multiple clocks, and the netlist modifications required
or cycle stealing is a well-studied technique that allows the for retiming. Consider again a combinational path from a
longest paths to “steal” time from adjacent combinational flip-flop j to a transparent latch i. The maximum allow-
stages, in order to reduce the clock period. able delay for the path is extended beyond the clock period.
One well-known approach, clock skew scheduling, in- Specifically, a transition launched from j need not settle on
tentionally delays some clocks to certain flip-flops to steal i’s input by the next rising clock edge – it may settle af-
time from subsequent combinational stages [1]. For exam- ter the clock edge during the time window when i is trans-
ple, for a combinational logic path from a flip-flop j to flip- parent. A downside of latches, however, is that they make
flop i, clock skew scheduling may delay the arrival of the timing analysis more difficult because transparency allows
clock signal to i, thereby allowing more time for a logic critical long (max delay) and short (min delay) paths to ex-
signal to arrive at i’s input. Recent research [2, 3, 4] has tend across multiple combinational stages, unlike standard
shown that only a few skewed clocks are necessary to obtain timing analysis using flip-flops. This can lead to hold-time
appreciable improvements in circuit speed. Unfortunately, violations. Using pulsed latches driven by a clock with a
clocks comprise 20-39% of dynamic power consumption in non-standard duty cycle (i.e. not 50%), or pulse width is one
commercial FPGAs [5, 6] and FPGAs consume 7-14× more method of reducing the effects of short paths plaguing con-
dynamic power than ASICs [7]. It is clear that for FPGAs ventional latch-based circuits, while allowing time borrow-
to remain competitive with ASICs, it would desirable to im- ing for long paths. This is a viable option as commercial
prove circuit performance without using extra clocks. FPGAs (e.g. Xilinx Virtex 6) can generate clocks with dif-
Instead of intentionally skewing clocks to borrow time, ferent duty cycles, as well as allow the sequential elements
retiming physically relocates flip-flops across combinational in CLBs to be used as either flip-flops or latches [8, 9]. That
logic to balance the delays between combinational stages. is, commercial FPGAs already contain the necessary hard-
Although extra clock lines are not required to borrow time, ware functionality to support pulsed latch-based timing op-
the practical usage of retiming is limited due to its impact on timization, however, to our knowledge, no prior work has
the verification methodology, i.e., equivalence checking and explored the pulsed latch concept for FPGAs.
An illustration of pulsed latches is shown in Fig. 1. Fig. 1(a) Tcq clock-to-Q delay
shows that it is possible that two signals launched on two ai , Ai earliest and latest arrival times at latch i
different clock cycles can arrive at one flip-flop (FF3) at the
same time, which clearly is invalid. The solid and dashed P clock period
lines represent long and short combinational paths, respec- cdji , CDji short and long j → i combinational
tively, between FF1, L2, and FF3. The cause of this prob-
path delay from latch j to latch i
lem is the short path signal launched from FF1 arriving at L2
when it is still transparent – a short path violation. Fig. 1(b) Tsu setup time
shows that this violation can be fixed by reducing the pulse Th hold-time
width until the short signal arrives at L2 when it is no longer
transparent. Naturally, the use of large pulse widths is de- Wi pulse width of latch i
sireable in the sense that they permit more time-borrowing
between adjacent combinational stages; however, large pulse Table 1. Summary of latch timing parameters.
widths are more likely to create short path violations. In our
approach, we reduce the negative impact of short paths by
forcing some flip-flops to remain as flip-flops. The main proach is inherently different than previous approaches be-
contributions of our work are: cause we utilize both flip-flops and latches, and also con-
• An algorithm that optimizes circuit speed by selec- sider the impact of short delay paths (hold-times).
tively converting some flip-flops to latches in FPGAs. Pulsed latches have recently been explored in the ASIC
context by Lee et. al [16, 17]. Their approach to time-borrowing
• Results showing that when short path (min) delays are uses multiple pulse widths and skewed clocks. Using flip-
70% of long path (max) delays, our optimization, op- flop-like timing constraints, their optimization strategy re-
erating on the routed design, improves circuit speed lies on exploiting the difference between pulse widths and
by 5%, on average – a “no cost” performance gain. clock delays to steal time from neighboring combinational
• We compare the use of pulsed latches (driven by a stages. This is different than our approach which mimics
single clock) with using multiple skewed clocks, and the presence of multiple ’skews’ using one pulse width.
show that pulsed latches provide the majority of ben-
efit offered by skewed clocks. 3. LATCH TIMING CONSTRAINTS
The remainder of this paper is organized as follows: In Sec- Table 3 summarizes variables necessary to understand latch
tion 2, we summarize related work on clock skew and latches. timing constraints. The main difference between latch and
In Section 3, we review latch timing constraints. In Sec- flip-flop timing constraints is the ability of short and long
tion 4, we show how the latch timing constraints can be paths to extend through multiple latches. This manifests as
represented as a graph and formulate the timing optimiza- non-linearity (a max function) in the constraints:
tion problem in a graph-theoretic manner. We explain our
algorithm in Section 5. We present our results in Section 6. max
Finally, we provide insight into future directions of our work Ai =∀j → i [max(Tcq , Aj ) + CDji ], ∀i (1)
in Section 7.
min
ai =∀j → i [max(Tcq , aj ) + cdji ], ∀i (2)
2. RELATED WORK
Equations (1) and (2) show that the earliest and latest arrival
Although there has been no work on using latches in FP- times of latch i are functions of the earliest and latest ar-
GAs, time borrowing using clock skew has been explored rival times of latches connected to i through combinational
on FPGAs. Singh and Brown showed that 4 shifted clock paths, respectively. The latch setup constraint shown below
lines provides over a 20% improvement in circuit speed [4]. assumes that latch i is driven by a pulse of width Wi [17]:
The downside to this approach is the use of extra clock lines
to allow for time-borrowing, which increases overall power Ai ≤ P + Wi − Tsu , (3)
consumption. The work did not consider the impact of short
paths – hold-time violations. Combining (1) and (3), we obtain:
Other work [10, 11] involving FPGAs has focused on the max
use of programmable delay elements (P DEs) to purposely ∀j → i [max(Tcq , Aj ) + CDji ] ≤ P + Wi − Tsu , ∀i (4)
delay clock signals. The work in [10] used PDEs on the
clock tree, whereas the PDEs were inserted into FPGA logic We can remove the leftmost max term in (4) by generating
elements in [11]. Both methods incur a hardware penalty (4) for every j → i path:
and require additional architectural considerations.
max(Tcq , Aj ) + CDji ≤ P + Wi − Tsu , ∀i, j (5)
Timing optimization of latch-based circuits has been stud-
ied extensively for ASICs. Most prior work has represented The purpose of the remaining max term is to ensure that
the latch-based optimization problem using linear constraints the signal at latch j launches no earlier than Tcq after the
and solved it using linear programming (LP) [12, 13, 14] rising edge. We can represent (5) with two equivalent linear
or graph algorithms [15]. The prior efforts use latches ex- constraints:
clusively and neglect to consider hold-time violations (as-
suming that such violations can be easily repaired). Our ap- Aj + CDji ≤ P + Wi − Tsu , ∀i, j (6)
Aj ≥ Tcq , ∀j (7) 5
C A C
Conservatively, we can assume that the latest arrival time
always occurs at the falling edge of a pulse, that is Aj = A
Wj − Tsu . Plugging this into (6) and (7) gives: 6 3
7
Wj + CDji ≤ P + Wi , ∀i, j (8) B D
B D
Wj − Tsu ≥ Tcq , ∀j (9) 2
4
Similarly, the hold-time constraint for latches, when com- WB=+1 WD=+2
bined with (2) yields: ((a)) ((b))
min
∀j → i [max(Tcq , aj ) + cdji ] ≥ Wi + Th , ∀i (10) Fig. 2. Sample circuit fragment and its graph representation.
We can relax this non-linear constraint by first transforming
(10) to occur between every latch pair connected by a com- Fig. 2 shows how a circuit fragment, (a), is represented
binational path. We conservatively assume that every early in our graph formulation, (b), where the integers next to
signal launches at the beginning of a latch’s opening win- edges represent path delays in the circuit in (a). Fig. 2 (b)
dow (i.e. the rising edge of the clock). The implies setting also illustrates how the M CR (G) yields the optimum P .
aj = 0 and results in: For this example, assume that every edge in (b) represents
constraint (8). The cycle containing dashed edges, vA →
Tcq + cdji ≥ Wi + Th , ∀i, j (11)
vD → vC → vA , is the MCR in this example, with a
Although the assumptions made to remove the max function value of 5: the sum of edge delays along the cycle is 15,
from (8) and (11) would appear to restrict the full potential and 15/(3 edges) = 5. Thus, we can operate this circuit with
of using latches, we will show that one clock can implement a clock period of 5. Observe, however, that the edge from
most of the gains achieved by clock skew requiring multiple vA → vD has a path delay of 7. Consequently, we must
clock lines. set WD = +2, in order for signals on the vA → vD path
to arrive on time. In addition, since timing analysis requires
EVERY constraint to be satisifed, we must set WB = +1
4. GRAPH FORMULATION to satisfy vA → vB . Any other cycle in this graph, with a
We can formulate the problem of latch-based performance lower cycle ratio would not satisfy every constraint. Note
optimization using latches as a graph problem. This section that in this example, the constraints (9) and (11) have been
will discuss how constraints derived in Section 3 map into omitted for clarity, however, they can also be represented in
a graph problem. A proof of this equivalence can be found the graph formulation.
in [18]. Let G = (V, E) be a strongly-connected directed The equations and constraints defined in Section 3 can
graph. Let a vertex v ∈ V represent a flip-flop or a latch in be solved using linear programming. However, we use Howard’s
G. Every v has an associated Wv , the pulse width. Let an algorithm [20] to find the MCR. Howard’s algorithm main-
edge, e(u, v), and its delay, d (u, v) represent the delay on a tains a subgraph, Gsub ⊆ G, where each vertex has exactly
u → v combinational path. one outgoing edge. Since G is strongly-connected, this im-
A path is a traversal of vertices through connecting edges plies Gsub is also and contains at least one cycle. A cycle
with an arbitrary start and end vertex. A simple cycle is a is found in Gsub by traversing its edges and its cycle ratio,
path that does not repeat vertices, other than the starting and rsub , is used as an initial estimate for the MCR. To determine
ending vertices. Let c and C represent a simple cycle and if rsub is the maximum cycle ration (MCR), Bellman-Ford
the set of all simple cycles in G, respectively. The maximum is applied to G. If rsub is not the MCR, Gsub is “improved
cycle ratio (MCR) is defined as: upon” and the algorithm continues until the MCR is found.
  The interested reader is referred to [21] for details. Although
X d (u, v) the algorithm has no known upper bound on the worst-case
M CR (G) = maxc∈C   (12) time complexity, it runs in O (E) time in practice.
|c|
e(u,v)∈c

where |c| represents the number of edges on c. The MCR 5. ALGORITHM


is the optimum clock period, P , for the circuit represented Although we can incorporate (11) into G (i.e. short path con-
by G. In essence, the MCR is the maximum total delay of straints) and find the optimum P using the MCR approach,
any cycle in G divided by the number of sequential elements the solution to this problem would imply that every sequen-
on that cycle. It represents the best clock period that could tial element is a latch with a (possibly different) pulse width,
be achieved by latches/retiming/clock skew for the circuit, thereby requiring multiple clocks and significant power con-
given the specified delay values. A proof of this property sumption. We wish to use a single clock with a single pulse
can be found in [19]. Given P , theset of duty cycle values width, and therefore, must decide the specific pulse width to
(pulse widths) for each latch, ω = W0 , W1 , . . . , W|V |−1 , use, and also whether each sequential element should be a
can be determined (outlined below). flip-flop or a latch. Although flip-flops cannot borrow time,
Input: G(V, E), Emin current violated constraint. Subsequent edges in the sorted
Output: Pf inal , Wf inal , ωf inal edge set would not generate a more constraining pulse width
1: Pinit , ωinit ← Howard (G) because the edge set is sorted in ascending order. Therefore,
2: Sort edges in E in ascending order of their short path iteration through E should stop (line (8)). Once Wf inal has
delays in Emin been determined, lines (11) - (18) constrain sequential ele-
3: Wf inal ← max(ωinit )
ments that have incoming edges with delay less than Wf inal
to be flip-flops because if we converted them to latches,
4: for e(u, v) ∈ sorted E do
other hold-time violations would occur. For sequential el-
5: d(u, v) ← short path delay of e(u, v) from Emin ements that do not have incoming edges with delay less than
6: if Tcq + d(u, v) < ωinit (v) + Th then Wf inal , we allow them to be a latches. Finally, we compute
7: Wf inal ← Tcq + d(u, v) − Th Pf inal and ωf inal subject to the new constraints at line (19),
8: EXIT FOR i.e. the selected Wf inal and the flip-flop/latch decisions. In
9: end if the next section, we show this approach delivers most of the
10: end for
performance benefits of multiple skewed clocks using a sin-
gle clock with pulse width Wf inal .
11: for e(u, v) ∈ E do
12: d(u, v) ← short path delay of e(u, v) from Emin
13: if Tcq + d(u, v) < Wf inal + Th then 6. EXPERIMENTAL STUDY
14: f orceF lipF lop(v) We implemented our approach within the VPR 5.0 frame-
15: else work [22]. Our results use a mix of VPR 5.0 and MCNC
16: f orceLatch(v) benchmarks mapped to an architecture using a LUT size of
17: end if 6, and cluster size of 8. We used an island style FPGA archi-
tecture consisting of 50% length-4 and 50% length-2 routing
18: end for
segments. Each benchmark was given an extra 30% in chan-
19: Pf inal , ωf inal ← Howard (G)
nel capacity on top of the minimum required for routing, in
order to emulate a medium stress scenario. The results for
Fig. 3. Pulsed latch timing optimization. each circuit are averaged over 5 runs, each using a differ-
ent placement seed value. The VPR framework does not
their short path constraints are less restrictive and because model hold-times on flip-flops/latches. Consequently, we
of this, we use flip-flops to prevent very short paths from set Th = 21 Tcq , and Tcq = Tsu , based on values for the
limiting the pulse width. The flip-flop/latch choice adds a Xilinx Virtex 6 .
binary decision element to the optimization problem, and In this work, we show the impact of latches and clock
would require Mixed Integer Linear Programming (MILP) skew both with and without considering short delay path
to solve. As MILP is NP-hard, unlike LP and Howard’s al- (hold-time) violations. Naturally, smaller short path delays
gorithm [18], our algorithm is a greedy heuristic that tackles lead to more hold-time violations and degraded performance
the problem iteratively. results. However, the VPR timing model does not incor-
We first solve for the best-case clock period, Pinit , and porate short path delays for combinational logic and rout-
associated pulse widths, ωinit , without considering short path ing paths. To handle this, we present results for several
(hold-time) constraints (11), and then we use ωinit in con- short path delay scenarios, ranging from optimistic to pes-
junction with (11) to guide the process of selecting some simistic. Specifically, we set a combinational/routing path’s
sequential elements to be latches and some to be flip-flops minimum (short) delay as a fraction, f %, of its maximum
and settle on a single pulse width Wf inal . The approach we (long) delay (which is computed by VPR). We consider three
take is to start with a large value for Wf inal and then scale it settings for f : 80% (optimistic), 70% (medium), and 60%
back based on the short path delay violations it imposes. We (pessimistic).
then assign each sequential element to be either a flip-flop or Table 2 contrasts the gains possible by using pulsed latches
a latch, and then re-solve for P based on the flip-flop/latch and flip-flops (columns labeled PL), clock skew using flip-
assignments and Wf inal . The full algorithm is detailed in flops only (columns labeled CS) with no restrictions on the
Figure 3. number of clock lines available, and theoretical possible gains
The inputs to the algorithm are G(V, E), which repre- (column labeled MCR) without any short path constraints.
sents the constraint set depicted in Figure 2, and the set of The results in the table, in essence, represent different ways
minimum delays between every pair of sequential elements, of analyzing a set of fully placed and routed designs. Val-
Emin . In Line (1), we begin by calculating Pinit and ωinit ues in the table represent achievable clock periods in ns, as
subject to only the constraints described in Section 4 (no reported by VPR and our latch-based timing analysis frame-
consideration of short path delays). In line (2), we sort the work. Geometric mean results and normalized geometric
edges in E in ascending order of the values in Emin . This means appear at the bottom of each column.
step identifies the first possible hold-time violation without The column labeled “Critical Path” presents the clock
having to iterate through all of Emin everytime. Line (3) period (averaged across 5 placement seeds) of each circuit
initializes Wf inal to be the maximum pulse width in the set assuming only flip-flops are used, and the flops are driven
ωinit . In lines (4) - (10), we look for the first violation of by a single clock. We refer to this as the “traditional” flow.
(11). If we do find a violation, we know that a using a pulse The “MCR” column shows the clock period when latches
width larger than Tcq + d(u, v) − Th would not satisfy the are used (or equivalently clock skew with arbitrarily many
Minimum Delay Assumptions None 80% 70% 60%
Clock Period (ns) Critical Path MCR PL CS PL CS PL CS
bigkey 2.827 2.574 2.574 2.574 2.58 2.58 2.772 2.598
cf cordic v 18 18 18 5.139 3.242 4.739 4.727 4.77 4.76 4.977 4.792
cf cordic v 8 8 8 3.173 2.164 2.785 2.761 2.815 2.794 2.845 2.826
cf fir 24 16 16 12.43 4.717 12.018 9.543 12.051 9.56 12.083 9.576
cf fir 3 8 8 10.903 4.56 10.492 7.25 10.524 7.278 10.557 7.307
clma 7.439 7.061 7.083 7.061 7.083 7.061 7.083 7.061
des area 5.195 5.02 5.02 5.02 5.02 5.02 5.02 5.02
des perf 3.479 2.591 3.067 3.067 3.1 3.1 3.479 3.132
diffeq 4.862 3.51 4.483 4.45 4.511 4.482 4.54 4.515
diffeq paj convert 37.059 36.976 36.976 36.976 36.976 36.976 36.976 36.976
elliptic 4.415 3.532 4.025 4.004 4.055 4.036 4.349 4.069
fir scu rtl restructured for cmm exp 14.669 5.409 14.269 14.257 14.301 14.29 14.503 14.323
iir 15.805 6.195 15.393 15.094 15.426 15.164 15.459 15.234
iir1 15.456 14.314 15.085 15.044 15.113 15.077 15.19 15.11
mac1 12.168 11.526 11.797 11.797 11.825 11.825 12.168 11.852
mac2 20.333 19.727 19.963 19.921 19.99 19.954 20.155 19.987
oc54 cpu 17.819 11.616 17.428 17.407 17.458 17.44 17.661 17.473
paj boundtop hierarchy no mem 6.057 5.408 5.68 5.645 5.695 5.678 5.912 5.711
paj framebuftop hierarchy no mem 4.339 3.303 3.928 3.928 3.96 3.96 4.339 3.993
paj raygentop hierarchy no mem 11.648 5.702 11.236 7.588 11.269 7.628 11.302 7.668
rs decoder 1 14.039 13.287 13.627 13.627 13.66 13.66 14.039 13.693
rs decoder 2 17.876 17.507 17.507 17.507 17.515 17.515 17.703 17.531
s38417 4.929 4.362 4.517 4.517 4.55 4.55 4.929 4.583
s38584.1 4.94 4.682 4.682 4.682 4.682 4.682 4.838 4.684
sv chip0 hierarchy no mem 4.198 3.707 3.786 3.786 3.819 3.819 4.198 3.852
sv chip1 hierarchy no mem 12.415 3.626 12.003 8.973 12.035 8.989 11.996 8.886
tseng 5.011 3.775 4.621 4.598 4.651 4.631 4.838 4.664
Geomean 8.228 5.847 7.777 7.398 7.806 7.429 8.043 7.40
Ratio to Critical Path 1 0.711 0.945 0.899 0.949 0.903 0.978 0.892
Ratio to Maximum Cycle Ratio 1 1.33 1.27 1.34 1.27 1.356 1.255
Ratio of Clock Skew to Pulsed Latches 1 0.951 1 0.952 1 0.912

Table 2. Achievable Clock period (ns) using flip-flops without any time-borrowing(Critical Path), theoretical minimum with
time-borrowing (MCR), pulsed latches (PL) and clock skew (CS) subject to different minimum delay assumptions.

clocks). On average, clock periods are reduced by 29%, same sequential elements. Such loops do not present a prob-
which matches fairly closely with prior work [4] (see third lem for clock skew, which uses edge triggered flip-flops,
last row in the table). The MCR column does not consider however, they do present a problem for latches, and limit
short path delay (hold-time) violations and thus, it represents the performance improvement for some latch circuits. With
a floor on the achievable clock period. 60% short path (min) delays, the pulsed latch results degrade
Moving onto the results that consider hold-time viola- further – the short path delays become so small, it is diffi-
tions. The columns grouped under 80% represent results cult, using a single pulse width, to achieve a performance
for the case of minimum path delays set to be 80% of max- improvement.
imum path delays. Observe that the performance results After observing that short paths have a considerable neg-
degrade considerably when hold-times constraints must be ative impact on both pulsed latch and clock skew-based opti-
honored, and underscores the necessity of considering such mization, we investigated the extent of delay padding (i.e. hold-
constraints. Pulsed latches provide 5% performance im- time repair) needed to improve the clock period results. Specif-
provement, on average, relative to the traditional flow (flip- ically, we investigated how each circuit’s clock period would
flops with single clock); clock skew with arbitrarily many be affected if it were possible to pad x% of its paths1 . The
clocks provides 10% improvement. Similar results are ob- aggregated results, normalized to the clock period without
served when minimum delays are set to 70% of maximum any short path repairs, are given in Fig. 4. The rightmost bar
delays. Essentially, with 70% min delays, 5% clock period indicates that about 3% additional reduction in clock period
improvement can be achieved, on average, by latch conver- would be achieved if it were possible to lengthen about 10%
sion on a routed design, without requiring any delay padding of paths. Circuit-by-circuit results could not be included due
(hold-time violation repair). to page limits, however, we did see reductions of up to 8%
for some circuits with 10% of paths delay-padded.
Comparing the results for pulsed latches and clock skew The results above are based on placement/routing so-
with 70% min delays, we observe identical clock periods lutions generated in a manner that was unaware of pulsed
for many circuits. Moreover, we found that latch-based op- latch-based optimization. We wondered whether improved
timization with a single clock achieved 90% of the clock pe- results could be achieved if the placement and routing tools
riod reductions possible with clock skew (with an arbitrary were aware of latches? VPR gauges the criticality of a con-
number of clocks) for 23 of the 27 circuits. A more detailed nection using its slack ratio, which is the ratio of the worst-
analysis revealed that the circuits for which pulsed latches
did poorly compared to clock skew contained what we term 1 For this analysis, we removed the 6 circuits that already achieved the

as self loops: combinational paths that begin and end at the optimum clock period using latch-based optimization.
1.005
1
7. CONCLUSIONS AND FUTURE WORK
0.995 In this paper, we considered performance optimization us-
ing pulsed latches for FPGAs. We showed that most of the
od

0.99
ClockPerio

0.985
gains of using skewed clocks can be realized by a post place-
0.98
0.975
and-route latch-based optimization that has no area or power
0.97 drawbacks, and that exploits existing features in commercial
0.965 FPGAs. However, short paths still limit the possible gains
0.96 of pulsed latches, and integration with the place-and-route
0.955 algorithms could yield even better results.
None 3% 5% 10% Results in Fig. 5 and Fig. 4 show that padding short paths
%ofConnectionsFixed can give considerable benefits, while driving VPR with cycle-
slacks gives varied results. It is clear that optimizing the
Fig. 4. Clock period as a result of extending 3%, 5%, or MCR throughout place and route is a flawed approach if
short paths are not considered. This calls for an integrated
10% of the total number of short paths under 70% minimum approach that exploits the time-borrowing benefits of pulsed
delay assumptions. latches and mitigates the short paths that hinder the benefits
of pulsed latches.
PostP&ROptimization Integrated+PostP&ROptimization
1.03 8. REFERENCES
1.02
1.01
d
ClockPeriod

[1] J. Fishburn, “Clock skew optimization,” Computers, IEEE Transactions on,


1 vol. 39, no. 7, Jul. 1990.
0.99
0.98 [2] K. Ravindran, A. Kuehlmann, and E. Sentovich, “Multi-domain clock skew
0.97 scheduling,” in ACM/IEEE ICCAD.
0.96 [3] J. Casanova and J. Cortadella, “Multi-level clustering for clock skew optimiza-
0.95 tion,” in ACM/IEEE ICCAD, 2009.
0.94 [4] D. P. Singh and S. D. Brown, “Constrained clock shifting for field pro-
None 80% 70% 60% grammable gate arrays,” in ACM FPGA, 2002.
[5] L. Shang, A. S. Kaviani, and K. Bathala, “Dynamic power consumption in
MininumDelayAssumptions
Virtex-II FPGA family,” in ACM FPGA, 2002.
[6] V. Degalahal and T. Tuan, “Methodology for high level estimation of FPGA
power consumption,” in IEEE/ACM ASP-DAC, vol. 1, 2005.
Fig. 5. Comparing the benefits of the pulsed latch optimiza- [7] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,” in ACM
tion with and without using cycle-slacks in VPR. FPGA, 2006.
[8] Virtex-6 FPGA Configurable Logic Block, Xilinx, Inc., San Jose, CA, 2009.
[9] Virtex-6 FPGA Clocking Resources, Xilinx, Inc., San Jose, CA, 2011.
case path delay through a connection, to the current critical [10] C.-Y. Yeh and M. Marek-Sadowska, “Skew-programmable clock design for
path delay. We altered VPR’s criticality notion to drive place FPGA and skew-aware placement,” in ACM FPGA, 2005.
and route using cycle-slacks [19, 23]. The cycle-slack-ratio [11] X. Dong and G. Lemieux, “PGR: Period and glitch reduction via clock skew
of a connection is the MCR of any cycle that uses the con- scheduling, delay padding and glitchless,” in IEEE FPT, 2009.
nection, to the current MCR of the design. Its definition is [12] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “Analysis and design of
latch-controlled synchronous digital circuits,” in ACM/IEEE DAC, 1990.
thus analogous to slack ratio, making it straightforward to
[13] T. G. Szymanski, “Computing optimal clock schedules,” in ACM/IEEE DAC,
integrate into VPR. Fig. 5 shows the benefits of pulsed latch ser. DAC ’92.
optimization with place and route using cycle-slacks, and [14] B. Taskin and I. Kourtev, “Delay insertion method in clock skew scheduling,”
our optimization with regular place and route, normalized to IEEE TCAD, vol. 25, no. 4, 2006.
1. [15] N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Graph algo-
rithms for clock schedule optimization,” in ACM/IEEE ICCAD, 1992.
If we ignore short paths completely, we observe a 3% re-
[16] S. Lee, S. Paik, and Y. Shin, “Retiming and time borrowing: Optimizing high-
duction in the MCR when VPR is driven with cycle-slacks. performance pulsed-latch-based circuits,” in ACM/IEEE ICCAD, 2009.
However, with minimum delay constraints, our results in- [17] H. Lee, S. Paik, and Y. Shin, “Pulse width allocation and clock skew schedul-
dicate an increase in the MCR – i.e. slightly worse results ing: Optimizing sequential circuits based on pulsed latches,” IEEE TCAD,
vol. 29, no. 3, 2010.
than if VPR is driven by normal slacks! We believe this is
[18] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to
a consequence of cycle-slacks being unaware of short paths, Algorithms, 2nd ed. McGraw-Hill Higher Education, 2001.
which constrain the amount of time-borrowing allowed. While [19] “Maximum mean weight cycle in a digraph and minimizing cycle time of a
cycle-slacks indeed yield higher performance when short logic chip,” Discrete Applied Mathematics, vol. 123, no. 1-3.
paths are not considered, said higher performance requires [20] J. Cochet-terrasson, G. Cohen, S. Gaubert, M. M. Gettrick, and J. pierre
more time borrowing to be realized, and is therefore more Quadrat, “Numerical computation of spectral elements in max-plus algebra,”
1998.
susceptible to short path violations. We noticed that VPR
[21] A. Dasdan, S. Irani, R. Gupta, “Efficient algorithms for optimum cycle mean
had to deal with many near critical cycles throughout the and optimum cost to time ratio programs,” IEEE/ACM DAC, 1999.
flow, and since it is difficult to equally optimize all equally- [22] J. Luu, I. Kuon, P. Jamieson, T. Campbell, A. Ye, W. M. Fang, and J. Rose,
critical cycles, cycle-slacks may drive VPR to optimize the “Vpr 5.0: Fpga cad and architecture exploration tools with single-driver rout-
ing, heterogeneity and process scaling,” in ACM FPGA, 2009.
wrong cycles. Altering VPR to be capable of lengthening
[23] D. P. Singh and S. D. Brown, “Integrated retiming and placement for field
paths in certain cases is likely a fruitful direction for im- programmable gate arrays,” in ACM FPGA, 2002.
proving the results.

You might also like