You are on page 1of 4

In-Place Power Optimization for LUT-Based FPGAs

Balakrishna Kumthekar #
# University of Colorado
Boulder, CO 80309

Luca Benini 
Enrico Macii z
Fabio Somenzi #
 Universita di Bologna
z Politecnico di Torino
Bologna, ITALY 40122
Torino, ITALY 10129

Abstract

This paper presents a new technique to perform power-oriented


re-con guration of a system implemented using LUT FPGAs.
The main features of our approach are: Accurate exploitation of
degrees of freedom, concurrent optimization of multiple LUTs
based on Boolean relations, and in-place re-programming without re-routing. Our tool optimizes the combinational component
of the CLBs after layout, and does not require any re-wiring.
Hence, delay and CLB usage are left unchanged, while power
is minimized. As the algorithm operates locally on the various
LUT clusters, it best performs on large examples as demonstrated by our experimental results: An average power reduction
of 20:6% has been obtained on standard benchmarks.

1 Introduction

Synthesis and optimization of FPGA circuits is a challenging


problem. The freedom provided by the programmable architectures reduce the e ectiveness of most of the design automation
tools adopted for standard cell devices. Therefore, ad-hoc techniques need to be adopted to handle area, delay, and power
optimization of FPGA-based designs.
In this paper, we focus our attention on power minimization.
The method we propose, based on the concept of neighborhood
of a cluster of functions, targets power optimization through inplace re-programming of one or more con gurable logic blocks
(CLBs) used in the circuit implementation. This distinctive feature of our approach makes it particularly suitable to deal with
post-layout descriptions, i.e., to designs for which the internal
routing. In fact, even though for some architectures (e.g., the
Xilinx LUT-based FPGAs) the internal routing can be modi ed
by re-con guring the switch and connection blocks, the rsttime choice of the interconnections is usually made so that tight
timing and resource (i.e., number of used CLBs) constraints are
met. Successive power-oriented modi cations that a ect the
routing may thus negatively impact the timing behavior or the
feasibility of the resulting solution.
A further advantage of the post-layout re-con guration procedure we propose is that, since full information on wiring capacitance and load is available even for designs that span multiple
FPGAs, accurate power estimation is possible; thus, the iterative optimization procedure can exploit such information to
re-program the most power-critical CLBs.
Concerning previous work, several approaches to FPGA power
minimization have been proposed in the literature [1, 2, 3, 4].
Our technique is not mutually exclusive with any of these methods, but it is applied at a later stage in the design ow.

Relevant to us is also the method of [6], where a re-programming


technique for solving the engineering change problem, when the
FPGA implementation of the design is already in place, is presented. Di erently from the engineering change problem, our
purpose is to obtain a circuit whose input-output behavior is
equivalent to the original one, but whose power dissipation is
reduced. Similar to the work of [6], however, we enforce the constraint that the connectivity of the design remains unchanged,
and only the CLB functionality can be altered.
Finally, another work which is related to ours is the generalized
matching approach to gate re-mapping proposed in [7]. The
idea here is to replace groups of logic gates with functionally
equivalent (but less power consuming) groups; as in our case,
this objective is achieved by exploiting the concurrent replacement of more than one gate to achieve a more aggressive global
power optimization. However, our target technology and the
core optimization algorithm di er from those in [7].

2 Background

In this section, we review some basic concepts and terminology


used throughout the discussion. We assume that the reader is
familiar with Boolean algebras and BDD-based Boolean function manipulation. We denote vectors and matrices in bold, i.e.,
x = [x1 ;x2 ;: : :; x ] . We use the symbols:
n

8 f = fj  fj
x

x0

and 9 f = f j + f j
x

x0

to designate the consensus (or universal quanti cation) and the


smoothing (or existential quanti cation) of Boolean function f
with respect to variable x, respectively.
We consider Boolean functions that model a portion (or cluster)
of a combinational circuit, and we refer to them as cluster functions. We denote by f = [f1 ; f2 ; : :: ; f ] a generic multi-output
cluster function.
The concept of Boolean relation is at the basis of the optimization algorithms detailed in Section 3. For the sake of brevity,
we provide here only some basic de nitions.
De nition 1 A Boolean relation F is a subset of the Cartesian
product of two Boolean spaces B and B , where B = f0; 1g:
F B B .
Boolean relations can be used to represent sets of Boolean functions. In this case, the Boolean relation is said to be well de ned.
The formal de nition of this property is the following:
De nition 2 A Boolean relation F is well de ned if, 8x 2 B ,
9y 2 B such that (x;y) 2 F .
Furthermore, we de ne the property of a function contained in
a Boolean relation as follows:
De nition 3 For a given Boolean relation F , a Boolean function f : B 7! B is compatible with F if, for every minterm
x 2 B ; (x; f (x)) 2 F . Otherwise, f is incompatible with F .
n

3 Power Optimization Algorithm

The starting point for our optimization algorithm is a netlist of


K -input LUTs. The netlist is completely speci ed, and it has
been back-annotated with post-layout information concerning
loads, parasitics, and wiring capacitances.
Given a generic LUT, we denote with i the vector of its input
variables, with o the output variable, and with f (i) the function
it implements. In addition, we denote with C the output load,
whose value is known with high accuracy, since it comes from
post-layout analysis.

p(x)

The block diagram of the main loop of the optimization algorithm is shown in Figure 2. The rst step consists of simulating
the FPGA network to estimate the power dissipation. The user
can provide typical long input pattern streams, possibly coming
from behavioral/RT-level simulation. Alternatively, the input
probability distributions can be supplied. Within the loop, the
network is re-simulated every N iterations in order to update
the switching statisticsand to ascertainthat there has been a decrease in the power consumption after every N optimization
steps. In order to speed-up the procedure, only a few patterns
(m % of the total) are used for the in-loop simulations. Both
N and m are user-de nable parameters.
int

int

int

q(o,x)
x

3.1 Main Loop

int

i1

LUT

o1

i2

LUT

o2

in

LUT

on

int

Original LUTs

z
Power Estimate (SW)

Sorted LUT List

Power Estimate (SW)


All Locked

f(i)

Build Cluster Around LUT

Optimized LUTs

Build Neighborhood

Compute Boolean Relation

h(x)

Figure 1: A Multi-Output Cluster and Its Neighborhood.


The optimization algorithm iteratively selects clusters of two
or more LUTs, and nds low-power alternative personalizations
(i.e., functions that can be implementedby the LUTs) exploiting
the degrees of freedom produced by the environment around the
target cluster, as shown in Figure 1. For the sake of clarity, in
the following we consider a simple two-output cluster (that is,
two LUTs). However, the algorithms and equations we present
generalize to clusters with any number of outputs.
The degrees of freedom on the functions in the cluster are represented by a Boolean relation whose characteristic function can
be computed with the following symbolic equation [8]:

F (i; o) = 8x z [(P (x; i)  Q(o; x; z)) ) H (z; x)]


(1)
where P , Q and H are the characteristicfunctionscorresponding
;

to the Boolean functions p, q and h describing the neighborhood


of the cluster (see Figure 1).
Boolean relation F represents a set of compatible functions that
is richer than the set represented by don't cares. The LUTs are
optimized for low-power by nding the minimum-power pair of
functions f1 and f2 that are compatible with F and that
have the same support i1 , i2 as the original functions f1 (i1 ) and
f2 (i2 ), respectively. Notice that the constraint on the support
of f1 and f2 is enforced because we want to perform in-place
optimization of CLBs without changing the connectivity of the
FPGA implementation we start with.
Several points must be addressed to achieve an e ective implementation of in-place power optimization. First, we need to
specify how the candidate clusters of the LUTs and their neighborhoods are selected. Second, we need to formulate a strategy
for estimating the power improvements and to decide when to
stop the optimization loop. Third, we need to be able to nd
f1 and f2 given F and to enforce the constant support constraint. The rst two issues are discussed in Section 3.1, where
the main optimization loop is described. The third problem is
described in Section 3.2.
opt

opt

opt

opt

opt

opt

Find Min Power Replacement

Mark and Record Gain


Next LUT

Lock Node with Max Gain

Figure 2: Main Power Optimization Loop.


The nodes in the network (each LUT is a network node, and to
each node we associate the function f stored in the LUT) are
rst sorted according to the product of their switching activity
times output load (SW  C ). Let these sorted nodes constitute
the set S . The LUTs are then picked according to this order.
For each selected LUT, , the companion cluster members and
the neighborhoods are computed. Only the members of S which
have not been locked due to re-programming in earlier iterations
of the algorithm are chosen. (The procedures for cluster selection and neighborhood construction are not described here for
space reasons; however, the interested reader can nd details on
such procedures in [9].)
With the selection of the cluster and its neighborhood, functions P , Q and H (Equation 1) can be computed. The stage is
then ready for the construction of Boolean relation F and the
computation of the minimum-power compatible functions f
to be implemented by the LUTs in the cluster. These are the
key steps of the optimization algorithm, and will be described
in the following subsection.
Cluster and neighborhood selection, and minimum-power compatible function computation are repeated for each of the nodes
in the set S . After all the nodes in S have been processed,
an LUT is selected which has the highest gain. The LUT is
re-programmed with the minimum-power compatible function
and then it is locked. This means that the LUT cannot be reprogrammed in future iterations. The algorithm iterates (Figure 2) until the improvement is below a minimum user-de ned
threshold d, or all nodes have been locked. Finally, full-delay
simulation with the complete set of input patterns is run again
on the optimized network to con rm the reduction in power.
opt

3.2 Optimization Core

Once the neighborhood and the cluster members are identi ed,
the Boolean relation F is computed through Equation 1. The
key optimization performed by our procedure consists of nding minimum-power compatible functions for re-programming
the LUTs in the cluster. In the following, let f (i ), f (i ),
j = 1;    ; maxCluster represent the functions implemented by
the LUTs in the cluster before and after re-programming, respectively. Notice that the support variables i of the multioutput cluster function f(i) are the union of the i .
To enforce the constraint that the network connectivity be left
unchanged, the Boolean relation F must be restricted according
to constant support constraint to yield R  F . Relation R is a
restriction of Boolean relation F with the property that if functions f , j = 1;   ; maxCluster compatible with F and with
the same support as the original f exist, then they are compatible with R as well. The usefulness of R is that it eliminates
many compatible functions of F that do not satisfy support
constraints, without excluding any valid solution. We have followed the algorithm proposed by Kukimoto and Fujita [6] for the
computation of R. Unfortunately, although R does not contain
functions which are compatible with F and do not meet the
support constraint, it is not guaranteed to contain only valid
compatible functions either. Hence, the correctness of the solutions extracted from Boolean relation R must be checkedagainst
the support constraints.
After R has been computed, the LUTs in the cluster are ordered for decreasing SW  C . Starting from the LUT with
highest switched capacitance, the min-power compatible functions are computed. Assume that LUT j has been selected. We
determine the lower bound l (i ) and the upper bound u (i )
for it. Function l has the same support as f and it is the
function with minimum ON-set compatible with R, while u is
the function with maximum ON-set (and same support as f )
compatible with R. In symbols: l (i )  h(i )  u (i ), 8h(i )
compatible with R.
To compute l and u we rst extract from R the Boolean relation R  R which contains all and only the functions with
support i :
j

opt

opt

opt

opt

stop = iter = 0;
frequency = int ;
do f
if(iter % frequency == 0) f
if(iter == 0)
N

sortedNodes

i j

) 9o2( o

oj

) R(i; o)

(2)

where o is the output variable corresponding to LUT j .


If R is well-de ned, the two bounds are easily computed as
follows:
j

l (i ) = R j j (i ; o )  R j j (i ;o )
u (i ) = R j j (i ; o )
j

j o

j o

j o

(3)
(4)

The minimum-power function f for re-programming LUT j


is then selected to be either u or l . The rationale for this
choice is the following. Assuming no temporal correlation, the
switching activity of output j can be computed as 2p (1 p ),
where p is the probability of the output j to be one. Notice
that this is a single-maximum function. Since l is smaller than
any function with support i compatible with R, its probability
is guaranteed to be minimum. On the same lines, since u is
larger than any function with support i compatible with R, its
probability is guaranteed to be maximum. Consequently, the
minimum switching activity is attained either in u or in l or
both. This conclusion is valid in absence of temporal correlation,
and may not be accurate in some cases. However, it provides an
ecient way of getting function f , as shown by the results of
Section 4.
opt

opt

= SortUnlockedNodes(network);

foreach(node 2 sortedNodes)
if(node is not processed) f

cluster = SelectCluster(node,maxCluster);
neighbor = SelectNeighborhood(cluster,maxZNodes);
F (i o) = ComputeRelation(cluster,neighbor);
R(i o) = RestrictAccordingToSupporConstraints(F (i o));
ChooseCompatibleFunctions(Fr (i o ),potNodeTable);
;

R (i ;o ) = 8 2( i
j

();
PerformPartialSimulation();
if(Power is increased) f
Undo the last Nint changes;
Leave nodes locked;
PerformCompleteSimulation

else

Once the choice between l and u is made, node j is marked as


potential candidate for re-programming and its gain is recorded.
It is also marked as processed, so that it is not considered again
in the same iteration. The relation R is restricted by taking into
account the choice of f made. Then, another cluster output
is considered.
It should be noted that our choice of f is greedy. We try to
get the best possible re-programmingfor LUT j disregarding the
ones that come after j . As a consequence, the choice for j may
result in the Boolean relation being ill-de ned when making a
choice for, say j + 1. If this is the case, the re-programming algorithm backtracks and chooses another function, di erent from
the one previously chosen, such that the new choice also reduces
the switching activity. If this also fails, the node's function is
not changed and the algorithm continues with a choice for the
next cluster member. Node j is moved to the bottom of the list
of nodes yet to be processed. It will be processed again, but
it will not be moved again and it will not be allowed to cause
backtrack.
Having described the main optimization loop and the core reprogramming algorithm, we can now provide the pseudo-code
of the in-place optimization procedure (Figure 3).

if (size

of potNodeTable

>

ReProgramAndLockBestNode

else

stop = 1;
iter++;

g while(stop == 0);

PerformCompleteSimulation

0)
(potNodeTable);

();

Figure 3: In-Place LUT Re-Programming Algorithm.


The rst part of the loop dispatches simulations to assess the
quality of the optimization. Simulation of the complete input
stream is performed at the rst iteration, and simulations of
shorter streams are performed every N iterations.
After simulation, the nodes are sorted based on their output
power consumption. The sorted list sortedNodes is then processed, one node at a time. For each node, if the node has not
been locked in previous iterations, cluster and neighborhood
are constructed. Boolean relations F and R are then computed, and the min-power compatible functions are obtained.
The variable potNodeTable returns a set of nodes for potential
re-programming. The node in the cluster for which power savings are maximum is selected, re-programmed and locked. The
iteration stops when no more savings or unlocked nodes can be
found. The procedure is guaranteed to terminate and has a
number of iterations which is linear in the number of LUTs.
int

4 Implementation and Results

The ow we have used for benchmarking the capabilities of our


in-place LUT re-programming algorithm is illustrated in Figure
4. The original .blif description is fed to SIS for logic optimization and mapping onto 5-input LUTs. The script we used for
this task is the one suggested in the SIS manual. The mapped
circuit (.fpga) is supplied to a tool (PREX) that generates a le
(.cap) containing the post-layout capacitances for the nodes in
the circuit. The .fpga description and the corresponding capacitance le are input to VIS, the framework in which the optimization algorithm has been developed. VIS is interfaced with
a power simulator (PSIM), whose task is that of providing power
estimates at all stages of execution of the re-programming algorithm using a set of user-supplied patterns ( le .pat), or userspeci ed primary input statistics ( le .i sat). The output of
the ow is the optimized LUT netlist ( le .opt).
.blif

SIS

.pat

.fpga

PREX

.cap

VIS

.i_stat

PSIM

.opt

Figure 4: Experimental Flow.


Program PREX mimics the job of the tools for placement, routing,
and capacitance extraction usually distributed by the FPGA
vendors together with their parts; in fact, it produces a capacitance le to be used for accurate timing and power simulation.
For our experiments, PREX generates capacitances based on the
fanout of the LUTs and some additive random noise (that models the uncertainties in the routing process). This is for the purpose of running experiments which are not biased by the results
of di erent layout tools. Notice, however, that any capacitance
distribution would be acceptable, as far as it is not modi ed
before and after the optimization, because the interconnections
of the various LUTs never change.
Several parameters can be speci ed to control the optimizer:
The depth of the neighborhood, the number of nodes in the
cluster, and the number of nodes in the neighborhood. Our
goal was to be able to show a practical realization of the inplace power optimizationalgorithm. Therefore, we implemented
the algorithm targeting robustness and conservatively set the
parameters to avoid failure even in corner cases and produce
results in a reasonable amount of time. Needless to say, with
this choice we gave up the opportunity to explore the entire
space spanned by the various degrees of freedom. A case in
point being the limit set on the number of nodes in the cluster
to 2 and the depth of the neighborhood limited to 1.
We have assumed the propagationdelay through the CLBs to be
constant. The pin-to-pinblock delay of the CLB is then assumed
to be constant. The total delay of CLB j , including the delay
due to the output capacitance, can thus be approximated as:
D(j ) = BlockDelay + C  fanoutCount(j )
(5)
where C is the capacitance of node j taken from the .cap le.
Our algorithm disregards the internal power of CLBs. A recent
work [5] has shown that, for the same function, the internal
power dissipation of a CLB can be reduced by reordering the
inputs to the CLB. As we do not allow any change in the interconnect, the above procedure cannot be used, and the constant
power model seems reasonable. In addition, it should be noted
that the output switching power of the CLB usually dominates
the internal power.
j

Table 1 reports the results we have obtained on a sample of the


large Mcnc'91 combinational multi-level circuits [10]. Columns
CLB, PI and PO report the number of CLBs, primary inputs
and primary outputs of the circuit. Columns Init. and Fin.
give the power dissipation of the circuit before and after optimization. Column R gives the number of CLBs that have
been re-programmed, column Sav. the percentage of power saving, and column Time the CPU time, measured in seconds on
a 200MHz Pentium Pro Linux machine with 128MB of RAM,
required by the algorithm to complete.
Circ.
c499
c1355
alu2
c880
c1908
alu4
c2670
pair
c3540
c5315
t481
i8
c7552
i10
des

CLB
70
94
113
115
150
214
307
440
470
515
628
774
801
830
1406

PI
41
41
10
60
33
14
233
173
50
178
16
133
207
257
256

PO
32
32
6
26
25
8
140
137
22
123
1
81
108
224
245

Init.
155
199
156
167
254
327
330
572
652
907
652
1170
1393
919
1887

Fin.
118
101
117
151
200
246
313
562
482
723
627
787
1188
782
1341

R
22
77
52
26
52
91
59
58
206
207
39
102
319
163
461

Sav.
23.8
48.9
25.3
9.4
21.4
24.8
5.1
1.7
26.2
20.2
3.8
32.7
14.7
14.9
28.9

Time
343
765
221
881
1570
1696
1110
1582
17222
12443
1558
1475
24920
6948
28675

Table 1: Power Optimization Results.


On all the benchmarks, power was estimated through real-delay
simulation of user speci ed input patterns. 20; 000 to 50; 000
vectors were used for the initial and nal simulations, while only
an average of 50% of such vectors was simulated during the optimization phase. The internal simulations were performed every
5 to 10 iterations of the algorithm. The peak power reduction
we have obtained is around 48:9%, and the average power reduction is around 20:6%.

5 Conclusions

We have presented a technique to perform power-oriented recon guration of a system implemented using LUT-based FPGAs. Our approach has the distinctive property of being applicable to designs for which the layout has already been generated. Our method operates locally on the various LUT clusters
of the original network and performs best on large examples, as
demonstrated by the experimental results we have reported.

References

[1] A. Farrahi, M. Sarrafzadeh, \FPGA Technology Mapping for Power


Minimization," IWFPLA, pp. 66-77, Sep. 1994.
[2] K-R. Pan, M. Pedram, \FPGA Synthesis for Minimum Area, Delay
and Power Consumption," EDTC-96, page 603, Mar. 1996.
[3] C-C. Wang, C-P. Kwan, \Low Power Technology Mapping by Hiding High-Transition Paths in Invisible Edges for LUT-Based FPGAs," ISCAS-97, pp. 1536-1540, May 1997.
[4] C-S. Chen, T. Hwang, C. L. Liu, \Low Power FPGA Design { A
Re-Engineering Approach," DAC-34, pp. 656-661, Jun. 1997.
[5] M. Alexander, \Power Optimization for FPGA Look-Up Tables,"
ISPD-97, pp. 156-162, Apr. 1997.
[6] Y. Kukimoto, M. Fujita, \Recti cation Method for Lookup-Table
Type FPGAs," ICCAD-92, pp. 54-61, Nov. 1992.
[7] P. Vuillod, L. Benini, G. De Micheli, \Re-Mapping for Low
Power Under Tight Timing Constraints," ISLPED-97, pp. 287292, Aug. 1997.
[8] Y. Watanabe, L. Guerra, R. K. Brayton, \Permissible Functions for
Multi-Output Components in Combinational Logic Optimization,"
IEEE TCAD, Vol. 15, No. 7, pp. 734-744, Jul. 1996.
[9] B. Kumthekar, L. Benini, E. Macii, F. Somenzi, In-Place Power
Optimization for LUT-Based FPGAs Tech. Rep., Dept. of ECE,
Univ. of Colorado, Oct. 1997.
[10] S. Yang, Logic Synthesis and Optimization Benchmarks User
Guide Version 3.0, Tech. Rep., MCNC, Jan. 1991.