You are on page 1of 8

2014 20th IEEE International Symposium on Asynchronous Circuits and Systems

Integrated Fanout Optimization and Slack Matching


of Asynchronous Circuits
Mehrdad Najibi and Peter A. Beerel †

Ming Hsieh Department of Electrical Engineering


University of Southern California
Los Angeles, California 90089
najibiko@usc.edu, pabeerel@usc.edu

Abstract—The integrated fanout optimization/slack matching a pipelined asynchronous circuit, special pipeline buffers with
problem is the problem of jointly building fanout trees and slack single input/output channels can be inserted on slow channels.
matching the circuit. In particular, we insert a minimum number Slack matching is the problem generally defined as the task of
of asynchronous pipelined buffers to achieve a performance
target while simultaneously ensuring the fanout count of all the adding a minimum number of such pipeline buffers to achieve
nodes of the circuit is less than specified limits. Our result show a given performance target [5][8][9].
that by solving the problem jointly up to 45% reduction in the We postulate that fanout optimization and slack-matching
total number of buffers can achieved compared to the state-of-
the-art independent formulations.
(FOSM) are highly correlated problems and therefore should
be solved in an integrated fashion to achieve optimal results. In
I. I NTRODUCTION particular, we believe the shape of the fanout tree determines
Well-designed asynchronous circuits can often achieve the static arrival times associated with the pipeline stages
higher performance and lower latency compared to their syn- which highly impact slack matching. Therefore we expect that
chronous counterparts [1][2][3]. This advantage comes partly by properly skewing the fanout trees the total pipelined buffer
from efficient fine-grain pipelining (e.g., [4]) and pipeline count, including both slack matching and fanout buffer, will be
optimization techniques that avoid throughput bottlenecks, reduced. However, given the large set of feasible fanout trees
including pipeline stalls and starvation. These techniques often and the fact that slack matching problem is itself NP-Complete
require specialized CAD algorithms and tools that understand [10], solving the two problems together is very challenging.
the unique performance aspects of asynchronous pipelined To the best of our knowledge, the state of the art approach is
designs [5]. to solve the fanout optimization problem first with no or lim-
Pipelined asynchronous circuits can be viewed as a col- ited slack-matching considerations [7][6]. Unfortunately, this
lection of processes that communicate by sending tokens via means fanout optimization may generate fanout trees which
asynchronous channels. An asynchronous channel is a col- are not properly skewed for slack matching and therefore can
lection of wires which implement some form of handshaking result in higher number slack matching buffers and far-from-
protocol through exchanging request/acknowledge signals to optimal results.
guarantee data delivery. When a process needs to send the
This paper describes a complete mixed integer linear pro-
same data to multiple processes it often does so through mul-
gram (MILP) formulation and heuristic LP relaxation al-
tiple point-to-point channels. However, internally the acknowl-
gorithm to solve the problem. While the MILP version is
edgement signals from all these channels must be merged
intractable for even small circuits, the LP relaxation algorithm
using trees of C-elements to ensure the data was properly
is quite fast and can easily handle large circuits in a matter
delivered to all fanouts. For high-performance systems these
of a few minutes. Our performance model assumes the circuit
C-element trees may become a throughput bottleneck. To
is unconditional and as a result our approach is conservative
address this problem the acknowledgement trees themselves
for circuits with conditional communication. Extensions to
can be pipelined using specialized pipelined fanout buffers that
conditional circuits is left as future work.
receive data from one input channel and copy it to a maximum
number of output channels. To support sending data to a large Subsequently, Section V presents our MILP formulation
number of fanouts, these pipelined buffers can be organized followed by Section VI which presents our proposed relaxation
in trees and the task of forming these trees is often referred algorithm. Section VII explains how the results can be further
to as fanout optimization [6] and/or fanout fixing [7]. improved through buffer sharing. The remainder of this paper
On the other hand, to reduce or remove pipelined stalls is organized as follows. Section II precisely defines the joint
created by unbalanced fork-join pipelines or short cycles in optimization problem and is followed by an overview of
the related work in Section III. Section IV then describes
This research is partly funded by NSF award CCF-1116416 and a research the performance model adopted for this work. Section VIII
grant from Intel.
† Peter A. Beerel is also Chief Scientist of Technology Development in the explains our experimental results and is followed by some
Communication and Storage Group, Intel, Calabasas, CA 91302. conclusions in Section IX.

1522-8681/14 $31.00 © 2014 IEEE 69


DOI 10.1109/ASYNC.2014.17
II. P ROBLEM F ORMULATION    
    

The new problem of integrated slack-matching and fanout
 
optimization can be defined as follows. We are given the 
  

following:
• The netlist of an asynchronous circuit (image netlist)  
as a directed graph Gc (Vc , Ec ) where v ∈ Vc is an 

  

 
asynchronous node, and cij = vi , vj  ∈ Ec is a channel  

from node vi to vj .  
• Forward and backward latency values for each channel.
• Maximum number of acceptable fanout count (outgoing Figure 1. FBCN model for Fork Join Structure
edges from each node) for node vi as F Oi .
Our goal is to concurrently build fanout trees by inserting
in •t and adding a token to places in t•. To model circuit
fanout buffers into the graph such that the fanout count of
timing, we assume that delays are assigned to the places of
node vi is less than F Oi and insert slack-matching buffers into
the Petri net by the following timing function d : P → R+ .
the graph to achieve the performance target with a minimum
A place is called a choice if its postset includes more than
number of inserted fanout and slack-matching buffers.
one transition and it is called a merge if its preset contains
III. R ELATED W ORK more than one transition. In unconditional asynchronous cir-
Although there exists no prior work which considers solving cuits, the performance model is a marked graph which is a
the two problems jointly, in [6] Dimou discussed the rela- Petri net with no choice or merge places.
tionship between slack matching and fanout optimization for As an example, Figure 1 shows the marked graph model
buffer sharing. The method suggests the use of one slack vari- for a generic fork-join structure. This model is called the
able at the root of each fanout tree to represent the maximum Full Buffer Channel Net (FBCN) model. Each asynchronous
value of the slack variables on the fanout channels. By using pipeline stage is modeled using a transition and asynchronous
only the maximum slack matching variable in the cost function channels between stages are modeled with a pair of places; a
of the slack matching MILP, Dimuo demonstrated that the forward (circles) and a backward place (squares) which are
slack matching buffers can be shared among different branches labeled with forward and backward latencies of the corre-
of the tree and the cost of buffer sharing is accurately modeled. sponding channel [8]. Forward latency is the time needed by
Several works have addressed the slack matching problems the pipeline stage in its initially ready state to generate the
for both unconditional and conditional circuits. Beerel et al. output after it receives all of its inputs. Backward latency
[8] and Prakash et al. [9] presented MILP formulations which models the time it takes for the pipeline stage to get back
guarantee a target cycle time for unconditional asynchronous to its initially ready state after it generates the output. The
pipelines by adding the minimal number of pipeline buffers. sum of forward and backward latencies represents the local
Gill et al. [11] proposed an efficient method for estimating cycle time of the channel.
performance of hierarchical asynchronous circuits with con-
ditional behavior. The method extends the “canopy graph” V. MILP FOR I NTEGRATED FANOUT O PTIMIZATION AND
method developed in [12] for conditional behavior. In [13] S LACK M ATCHING
the method is used for bottleneck detection in asynchronous In this section we present the general form of the proposed
circuits. Although slack matching of conditional circuits is ad- MILP to solve the joint FOSM problem.
dressed, no worst/average-case performance bound is claimed First, note that an unconditional asynchronous circuit is
of the slack-matched circuit. Venkataramani et al. [14] pro- slack matched for target cycle time, τ , by solving the following
posed a heuristic, iterative algorithm for slack matching an MILP [8].
asynchronous circuit with conditional behavior. More recently, 
Najibi and Beerel [15] presented an optimal MILP and relax- Minimize costc · sc subject to
ation algorithm for conditional slack matching.
∀ c = vi , vj  ∈ Ec
IV. P ERFORMANCE M ODELING 
aj = ai + l(c) − m(c) · τ + fc + ls · sc
The performance of an unconditional asynchronous circuit
0 ≤ fc ≤ τ − τc + sc (τ − τs )
can be modeled using a subclass of Petri nets. A Petri net is
a tuple P, T, F, M0  where P is the set of places, T is the ai ≥ 0, ∀ vi ∈ Vc
set of transitions, F ⊂ P × T ∪ T × P is a flow function, and
sc ∈ Z+
M0 : P → N is the initial marking. For x ∈ P ∪ T we define
the preset of x as •x = {y|y, x ∈ F } and the postset of x Here, sc and costc are number and cost of slack-matching
as x• = {y|x, y ∈ F }. buffers added to channel c respectively. ai is the arrival time
A transition t is enabled in marking M if ∀p ∈ •t; m(p) ≥ 1 variable associated with the asynchronous node vi . m(c) is
and may eventually fire by removing a token from each place 1 if channel c has an initial data token upon reset and is 0

70
otherwise.1 l(c) and l(c) are the forward and backward latency
associated with the channel. τc = l(c) + l(c) is the local time
of the channel. ls and τs are forward latency and local cycle 
time of slack-matching buffers. Finally, fc is the free slack
[8] of the channel which intuitively represents the maximum
stall the channel can tolerate without violating the target cycle 
time.
We refer to a high-fanout node along with its buffer tree Figure 2. Implementation of fanout trees of X2 and X3
as a fanout cone and the fanout channels of the node as
branches. For the ith fanout cone, with a fanout of Fi , a
feasible fanout tree is a tree which distributes the value of the and b2 = 0. For a tree to be implementable, we must guarantee
root channel to all the sinks such that all the fanout buffers as that there are enough available buffers at each level of the tree
well as the source node meet their fanout constraints. There to support all the assigned branches as well as the available
are many feasible fanout trees with different skews and such buffers at the next level of the tree. This property can be stated
implementations can be partially specified by assigning a level as follows:

to each branch, using an Fi × Fi matrix, X i = [xijl ]. K × bil−1 ≥ xijl + bil
 j
1 branch j is assigned to level l
xijl = For the ith fanout cone, given the values of bil and xijl one
0 otherwise
can optimally build the tree because the number of buffers and
Notice that each branchhas to be assigned to one and only the levels of each branch are determined. The total number  i
one level, meaning that l xjl = 1. This property however of fanout buffers for the cone can be calculated as l bl .
does not guarantee that the tree specified by an arbitrary X is Based on the above formulation, an MILP can be used to
feasible. As an example, consider a single rooted fanout tree2 find optimal implementable trees with an appropriate cost
with Fi = 4 and fanout buffers with maximum 2 outgoing function. However solving such MILP for optimal trees can
branches and the matrices: be quite impractical because the set of such feasible trees can
⎡ ⎤ ⎡ ⎤ exponentially grow as the fanout count increases.
1 0 0 0 0 1 0 0
⎢ 1 0 0 0 ⎥ ⎢ 0 1 0 0 ⎥ The MILP presented in Figure 3 jointly solves the fanout
X1 = ⎢ ⎥ ⎢
⎣ 0 1 0 0 ⎦ X2 = ⎣ 0 1 0 0 ⎦

optimization and slack-matching. In this formulation lf is the
0 0 0 1 0 1 0 0 latency of each fanout buffer and superscript i represents the
⎡ ⎤ ith fanout cone rooted at node vi . The forward (i) and back-
1 0 0 0
⎢ 0 1 0 0 ⎥ ward (ii) constraints on the static arrival times are analogous
X3 = ⎢ ⎣ 0 0 1 0 ⎦
⎥ to the slack matching formulation presented in [8]. The only
change here is that the delay of fanout buffers
 are accounted
0 0 1 0
for in the forward paths by the term lf · l l · xijl . Also note
X1 is not implementable because branches 1 and 2 are that the backward delay constraints are kept unchanged due to
assigned to level one and since each buffer has maximum two the fact that we assume fanout buffers provide zero free slack.
branches and there is only a single root, the tree cannot grow Intuitively, the first constraint asserts that for any channel x,
to branches 3 and 4. On the other hand, X2 is implementable the arrival time of sink node, j, is larger than the arrival time
with all branches assigned to level 2. X3 is also implementable of the source node i plus the latencies of the fanout tree and
forming a skewed fanout tree. Both X2 and X3 can be slack matching buffers. The second constraint guarantees that
implemented with 3 buffers, as shown in Figure 2, but as these the channel can reset quickly so that the target cycle time can
implementations can result in different static arrival times for be achieved.
slack matching, one of the them may become superior. Note that the assumption that fanout buffers do not have any
To eliminate unimplementable matrices, we keep track of free slack is conservative. In particular, channels that connect
the used and available branches in each level of the tree as to fanout buffers can sometimes have small cycle times which
follows. We define integer bl to be the number of buffers yields some free slack that can help in slack matching the
connected to the tree at level l. b0 always equals to the circuit. But this assumption is conservative because ignoring
number of roots of the fanout tree. The number of branches the potential free slack of fanout buffers leads to a correct slack
available at level l + 1 is calculated as K × bl−1 , where K is matching solution, although possibly using a slightly higher
a constant equal to the maximum outgoing branches of each number of slack matching buffers.
fanout buffer. For instance in the X2 example, b0 = 1, b1 = 2, Modeling the true free slack of fanout buffers seems to be
difficult in the current formulation and is thus left as future
1 A module which sends a token to its output upon reset is often called a
work. However, the true free slack of fanout buffers can easily
TokBuf and channels connected to the output of TokBufs will have m(c) = 1.
2 Some fanout trees may have multiple roots as the root node may support be calculated once the shape of fanout tree is fully determined.
multiple fanouts by itself. Therefore, the slack matching problem can be reformulated to

71
 
Minimize costc · sc + costb · bil subject to
i l

∀ c = vi , vj  ∈ Ec , ∀ l ≤ Fi  


⎧  

⎪ aj = ai + lc − m(c) · τ + fc + ls · sc +

⎪ 

⎪ lf · l l · xijl (i)
⎨ Figure 4. Interdependency of branch assignment to fanout tree levels
0 ≤ fc ≤ τ − τc + sc (τ − τs ) (ii)

⎪ 

⎪ K × bil−1 ≥ j xijl + bil (iii)

⎪ 
⎩ x i
= 1 (iv) safely postponed to levels further away from the root of the
l jl
trees. In particular, we define a branch as critical if it has to
ai ≥ 0, ∀ vi ∈ Vc be assigned to the current level of the tree otherwise a timing
sc , bil ∈ Z+ xijl ∈ {0, 1} constraint of the problem will be violated.
Heuristic Rule I: Critical branches are assigned to the cur-
Figure 3. MILP to solve fanout optimization and slack matching (FOSM) rent level of the fanout tree because they cannot be postponed.
Note that the criticality of each branch depends on the as-
signment of other branches that are processed concurrently as
account for the true slack of the fanout buffers once the fanout well as the assignment of other branches that were processed
trees are implemented. in previous iterations of the algorithm. To demonstrate this
point, let us assume that two fanout trees f1 and f2 as shown
VI. R ELAXATION A LGORITHM in Figure 4 are located in a loop. Let us assume that the slack
Our experiments show that solving the proposed MILP in matching target cycle time for the circuit is 18 transitions and
Section V directly is intractable for even moderately sized cir- also that before fanout optimization there are 6 pipeline stages
cuits. Therefore to solve this problem in a reasonable amount in this loop. Therefore, the latency along the loop is iniitally
of time, we developed a heuristic algorithm to solve related 12 transitions. At iteration 1 of the algorithm neither b11 of
sub-problems iteratively to obtain high quality suboptimal b21 branches shown in the figure are critical as the assignment
solutions. To do so we use the same formulation presented of both branches can be postponed to level 2 which results in
in Figure 3 but relax the binary and integer variables to real a latency of 16 around the loop. Note that in the next iteration,
values. In particular, we let: the available slack for fanout buffers is only 2 as postponing
either of these branches leads to the latency of 18. This means
sc , bil ≥ 0 0 ≤ xijl ≤ 1
that if the assignment of b11 is postponed to the next level of
The main reason for the complexity of the presented MILP the fanout tree b21 becomes critical and has to be assigned to
is that the MILP solver has to branch and bound on a relatively level 2. Also note that b11 becomes critical in the next iteration
large search space resulting from constraints associated with and has to be assigned to level 3.
large fanout trees. The relaxed LP version of the problem can Detection of critical branches at iteration L of the relaxation
be solved relative fast, enabling us to define iterative LP sub- algorithm is done through solving the relaxed linear program
problems that decide whether to assign a specific branch to the with a simple modification of the cost function. In particular,
current level of the fanout tree or to postpone its assignment to the following term is added to the LP’s cost function.
a later iteration of the algorithm associated with later layers of 
the tree. Our iterative relaxation algorithm is presented below Zij · xijL
and is composed of three major steps. ij
In each iteration of the algorithm, starting from level L = 1
of the fanout trees, our algorithm performs the following three where, Zij is a large positive unique coefficient for each
steps and proceeds until all branches are assigned to a layer. branch of each fanout cone. Since the coefficient is large, this
• Step 1: Concurrently assign all the critical branches to
modification guarantees that xijL are pushed strongly toward
level L of the fanout trees. zero. Once the LP is solved, any branch with non-zero xijL
• Step 2: Round the fractional number of buffers (bL )
i is conservatively considered to be critical and is forced to be
connected to level L to the next largest integer and fix assigned to level L. Note that if the branch is already assigned
them. to a previous level of the tree, xijl for some l < L is 1 and
• Step 3: Greedily assign non critical branches to the
therefore xijL is forced to be zero. If the branch is not assigned
current level L of the fanout tree whlie there is room to any layer in a previous iteration of the algorithm, xijl = 0
and only if doing so reduces the overall cost of FOSM for all l < L and the fact that xijL is non-zero suggests that
problem. the branch cannot be pushed to a later stage due to a timing
constraint violation. On the other hand, once xijL equal to zero
Step 1 of the algorithm considers all the branches of all
for an unassigned branch, it is guaranteed that it can be pushed
fanout trees concurrently and decides which branches can be
back to a later layer of the tree and therefore is non-critical.

72
Heuristic Rule II: For each iteration L of the algorithm,  
once critical branches are assigned, the fractional biL values    
are rounded up to the next integer.
 
Once critical branches are assigned to level L, at Step 2, we

 
use the ceiling function to fix the value of biL to the next largest
integer. For example, if fanout cone i needs 2.3 buffers to 

ensure a feasible implementation, biL will be forced to 3. This 

 
heuristic guarantees that once the relaxation is complete all  

bil values are integral. The idea behind this heuristic is that if, 
  
due to this ceiling operation, some unassigned branches which
tend to stay at the current layer are forced to be scheduled at Figure 5. Different implementations of a fanout tree and its impact on buffer
next level of the tree, there is always available room for these sharing
branches in the next layer. For instance if each fanout buffer
has a fanout of 4 in the previous example, we may push a
branch with xijL = 0.7 from layer L to L + 1 by ceiling 2.3 to same functionality at the circuit level, result in different slack
3. But, we know that 0.7 × 4 branches are thus made available matching costs. In Figure 5(c), since the first two branches
to the next layer of the tree. Recall that the postponed branches labeled with 0 and 2 slack matching buffers share the same
at Step 2 are not critical because otherwise the branch would source fanout buffer, further sharing is impossible. Meanwhile,
have been forced to level L at Step 1. in Figure 5(d) sharing can be done more effectively to obtain
Heuristic Rule III: At Step 3, branches with a tendency even lower cost of slack matching equal to 5. Compared to
to be scheduled at level L are greedily scheduled to level L. Figure 5(a), further sharing of slack matching buffers requires
At Step 3, we solve the original relaxed LP once again after a different implementation of the fanout tree with one extra
modification of the values of biL in Step 2. We then assign fanout buffer and therefore the overall reduction on the number
branches which have a tendency to be scheduled at level L of buffers is 5 buffers instead of 6.
at this level by forcing their xijL = 1 if there is any branch The above discussion highlights the dependency between
available at level L. In particular, to determine the availability fanout optimization and slack matching. In this section we
of branches we examine the values of biL−1 , biL , xijL and check discuss ideas on shaping the fanout tree optimally to reduce
that the availability constraints of the LP are not violated after the total cost of fanout tree and slack matching buffers. Note
the assignment of the branch to this level. More specifically, that the solution of the linear program of FOSM only partially
we say a branch has a tendency to be scheduled in stage L if determines the shape of the tree by determining the number
xiL has the maximum value among all xil . of available buffers and level assignment of the branches. The
number of required slack matching buffers for each branch
VII. B UFFER S HARING AND T REE I MPLEMENTATION is also known by solving the LP. We will present a greedy
As mentioned before, in the formulation presented for algorithm to shape the fanout trees to minimize total buffer
integrated fanout and slack matching (Figure 3), a slack buffer cost based on the number of slack matching buffers and level
variable is added for each channel. These slack variables are of each branch. In the next section, we also present several
independent from each other even for channels with the same heuristic modifications to the LP to improve the possibility of
source node which compose a fanout cone. It is however sharing slack matching buffers by reshaping the fanout tree.
expected that by sharing slack matching buffers between
channels of the same fanout cone the total number of slack A. Modifications to the LP to Improve Fanout Sharing
matching buffers can be considerably reduced. For example, The FOSM linear program presented in Figure 3 can be
if a fanout cone with 3 channels turns out to need 2, 4, and slightly modified to improve fanout sharing. These heuristic
5 slack matching buffers on its channels, it would be possible modifications are designed to push the final solution of the
to share 2 slack matching buffers at the root of the fanout LP toward fanout trees shapes which enable the maximum
tree and add 0, 2, and 3 slack buffers to the channels. As a possible buffer sharing.
result, the total slack matching cost can be reduced from 11 The buffer sharing at the root of the tree can also be
to 7 buffers. Although sharing buffers at source is relatively formulated at the cost of a slightly more complicated LP. A
straight forward further extension of this idea highly depends similar technique is introduced in [6] which is extended in our
on the shape of the fanout tree. formulation as follows. For each set of channels in the same
Consider Figure 5, which shows different implementations fanout cone we add a separate slack matching variable to its
of the fanout tree of the example above with different multi- channels which models the number of slack matching buffers
level sharing of the slack matching buffers. Figure 5(a) depicts shared at the root of the tree.
the original implementation of the tree without sharing. Figure ⎧
5(b) shows single level sharing of the slack matching buffers ⎪
⎨ aj = ai + l(c) − m(c) · τ + fij + ls · (si + sij )
at the root of the tree. The two different implementations +lf · l l · xijl
of the tree shown in Figure 5(c) and (d), while having the ⎪

0 ≤ fc ≤ τ − τc + (sij + si ) · (τ − τs )

73
In the above formulation, all the channels in the ith fanout    
cone share the same slack matching variable, si , at the root
and therefore the slack matching variables associated to the 
branches sij can be traded off with si . However since the cost   
of si is the same as sij in the formulation, this can reduce the
Figure 6. Example: Implementation of a tree associated with hierarchical
cost. For example, if s12 = 3 and s13 = 1, by setting si = 1 clustering {1, 3, {{4, {5, 6, 7}, 2}, 8}
and reducing s12 and s13 by one, the cost function can be
reduced by 1 unit.
Note that the above change in the formulation improves the solution does not lead to extra slack matching buffers it is
the precision of the cost function and enables the LP to find essential to keep the arrival time of the nodes unchanged.
solutions which are in favor of sharing buffers at the root of The actual implementation of the fanout trees can be ob-
the tree. tained through hierarchical clustering of branches of each tree.
Extension of this idea to the sharing at different levels of the Our greedy algorithm forms the fanout trees simply by finding
tree is not straight forward as adding dedicated slack variables a clustering of the branches from the last level of the tree to-
at the output of the fanout buffers in the mid-levels of the tree ward the root. As an example, for a fanout tree with 8 branches
requires knowledge about the exact shape of the tree which is a hierarchical clustering such as {1, 3, {{4, {5, 6, 7}, 2}, 8} can
not available in the LP. However, a heuristic can be used to completely defines the shape of the tree as shown in Figure 6.
mimic the same concept. Therefore the goal of our greedy algorithm is to find a
Heuristic Rule IV: Slack matching buffer are placed closer clustering which maximizes buffer sharing. Note that while the
to the root of the tree because these can lead to better buffer greedy algorithm always find a feasible clustering the result is
sharing. not always optimal.
To enable the above heuristic in the LP, for each fanout cone Note that in the process of building fanout trees, new
a slack matching variable is added per level of the tree. That channels have to be added to the initial set of channels to
is for a fanout cone with l branches, the fanout tree can have connect fanout buffers to each other. We denote the set of
l levels. We add l slack variables one for each level of the channels which include both original and added channels as
tree namely, s1i , ..., sli where s1i models the number of slack Ê.
matching buffers at the root of the tree. To encourage the LP A cluster, Ck ⊂ Ê, is a set of channels which are connected
to use slack variables closer to root we set the cost of variable to the output of the k th fanout buffer. Obviously the size of
further apart from the root to increase as l goes up. Note that each cluster has to be less than the maximum number of fanout
intuitively, s1i should have 14 of the cost of s3i as each buffer channels a fanout buffer can support, |Ck | ≤ F OM AX .
at level 1 can be potentially shared among 4 branches at level Heuristic Rule V: Only branches which are at the same
3. level of the tree are clustered together to guarantee feasibility
This modification leads to a more complicated LP and to and to keep the arrival time variable associated to the sink
keep the implementation feasible this optimization is only node unchanged to avoid violation of target cycle time.
enabled for fanout cones with less than 10 channels in the Our algorithm sorts the channels in each level of the fanout
current implementation. cone in the descending order of the number of slack matching
⎧  buffers inserted on the channels and then adds the channels

⎨aj = ai + l(c) − m(c) · τ + fij + ls · l (l · sli )+ one by one to the current cluster until the value of cluster

ls · sij + lf · l l · xijl drops such that creation of a new cluster is justified.

⎩ 
0 ≤ fc ≤ τ − τc + (sij + l (sli )) · (τ − τs ) Heuristic Rule VI - Value of the cluster: The value of each
cluster is defined as the minimum value of the slack matching
During level by level relaxation of the FOSM, once a branch variable associated to the channels included in the cluster.
is assigned to a particular level of the tree, say L, sli variables
for any l > L will be dropped from the constraints related V (Ck ) = min sij
to that branch to indicate that sharing of the slack matching cij ∈Ck
buffers associated to the branch is not possible in higher levels As an example if cluster Ck includes two channels with
of the tree. the following values of slack matching buffers {4.1, 3.3} the
Our results showed that this heuristic is very effective value of the cluster is V (Ck ) = 3.3. Adding a new branch
mainly because it increases the precision of the cost function. with slack variable equal to 1.2 to this cluster reduced the
value of the cluster by (3.3 − 1.2) = 2.1. If the cost of
B. Implementing the Fanout Trees - The Greedy Algorithm
creating a new cluster is Ccost = 1, given that there are enough
As mentioned before, the solution of the LP partially available buffers in this level, it make sense to be greedy and
determines the shape of the tree by deriving the level of each terminate the current cluster with just two branches and add
branch (xijl ) and the number of available fanout buffers at each the new branch to a new cluster to minimize the reduction in
level (bli ). For a given solution of the LP however different the total value. However, if we assume that F OM ax = 4,
implementations of fanout trees are possible. To ensure that terminating this cluster with only 2 branches may lead to

74
Algorithm 1 Greedy Alg. to Implement Fanout Trees previous fanout optimization algorithm uses the minimal num-
For a fanout cone i, with maximum Li levels, set current level curLevel = ber of fanout buffers, our results indicate that ignoring slack
Li , and pick the next empty cluster, Ck . matching constraints in fanout optimization leads to more
Repeat until curLevel = 0: slack matching buffer in the later steps of the synthesis flow.
1) Set UsedBuffers = 0.
2) Sort all the branches assigned to curLevel in descending order based
In addition to the percentage of reduction in total buffer count,
on the number of slack matching buffers added to each channel, sij , Figure 7 also depict the ratio of the number of fanout buffers to
in the list of Available Branches (AB). the total buffer count. Notice that the optimally skewed fanout
• Remove the first branch cij = i, j, with maximum value of trees tend to use almost 50% of the total buffer count in the
slack matching buffer, from AB, cij = AB.pop().
• Add cij to the current cluster Ck .Add(cij ).
formation of fanout trees compared to about 10% fanout buffer
3) Repeat until AB is empty: ratio used in the previous fanout optimization algorithm and
Select the next channel from AB, cij = AB.pop(), yet, on average, achieves a 22% reduction in overall buffer
• Terminate Ck as follows if |Ck | = F OM ax or count.
(V (Ck ) − V (Ck + cij )) > Ccost and |AB| + 1 <
i Figure 7 shows the cost of this improvement is increased
F OM ax (bcurLevel −UsedBuffers) .
run time that can be as much as 10X that of the previous
– Instantiate a fanout buffer with input channel cbuf and output
channel Ck . algorithm. However, even for our largest benchmark s13207
– Ê = Ê + cbuf . that has 5346 pipeline stages and 1160 channels the proposed
– Add cbuf to the list of branches assigned to level curLevel − 1. algorithm completes in under 7 minutes (382 seconds).
– usedBuffers = usedBuffers + 1
– Pick a new empty cluster to continue, k = k + 1.
– Add cij to the new Ck . IX. S UMMARY AND C ONCLUSIONS
• Otherwise, add the cij to Ck .
4) curLevel = curLevel -1. The problem of integrated fanout and slack-matching opti-
mization of asynchronous circuits is formulated. The overall
result is a saving in the total fanout and slack matching
buffer availability problems. Therefore the algorithm has to buffers by an average of 20% which translates to about 8%
look ahead and ensures that terminating a cluster is safe. For average reduction in the total area of the circuit. Our results
example if once the cluster is terminated, 5 branches are left clearly indicates the dependency of the FO and SM problems.
unassigned at this level and the number of available buffers is The resulted area improvement is at the cost of up to 10
2, then the cluster cannot be terminated. This is because we times increase in execution time for larger benchmark circuits
have to use one buffer to implement the current cluster and the although all the benchmarks could be processed in less than
next buffer can only implement 4 branches. Therefore at this 10 minutes.
point we have to accept the cost of adding the current branch In addition, since our formulation processes all the fanout
to the current cluster despite the drop in the cluster value to trees concurrently while making decisions, we believe our ap-
1.2. proach also improves the feasibility of fanout optimization by
The complete greedy algorithm to implement the fanout finding solutions which the original fanout optimization may
trees is presented in Algorithm 1. fail to find. In comparison, the original fanout optimization
build fanout trees sequentially in a random order and because
VIII. E XPERIMENTAL R ESULTS of the order dependency in forming fanout trees it may fail if
the timing constraints are tight and circuit includes competing
To demonstrate the effectiveness of the presented method for
large fanout cones.
the integration of fanout optimization and slack matching, the
LP formulation and the relaxation and the greedy algorithms The generalization of this work for conditional circuits
presented in the previous sections have been implemented and is an interesting area of future work. Similar to what was
tested on a on a dual-processor machine with 2GHz CPUs. In done for conditional slack matching [16], for a circuit with
particular, the results are obtained for the ISCAS89 benchmark multiple modes of operation, we believe we can define a mode-
circuits. As done in [8], we consider each gate as a PCHB based approach for fanout optimization and target average-case
pipeline stage and wires between gates are implemented using delay. The idea is that since different modes of operations
asynchronous four-phase dual-rail channels. In our implemen- have different assigned target-cycle times, the same gate may
tation, sequential gates are converted to token buffers. The have higher maximum fanout constraints in slower modes of
area of the various implementations is based on an available operations. As a result fanout trees in the slower modes may
TSMC 65nm QDI library [7]. be implemented with less number of slack-matching buffers
and shallower trees, saving additional area and power.
The results obtained from the FOSM method compared to
the previous fanout optimization method [6] are presented
in Table I and Figure 7. The new integrated and fanout R EFERENCES
optimization method saves on average 22% of total needed [1] A. J. Martin, A. Lines, R. Manohar, M. Nystrom, P. Penzes, R. South-
fanout and slack matching buffers. This improvement is mainly worth, U. Cummings, and T. K. Lee, “The design of an asynchronous
due to creation of skewed fanout trees which are optimized MIPS R3000 microprocessor,” in Seventeenth Conference on Advanced
Research in VLSI, Sept. 1997, pp. 164–181.
for slack-matching and the greedy buffer sharing. While the

75
TABLE I
I NTEGRATED FANOUT & S LACK M ATCHING R ESULTS

 %, - )    %  ) 



            *+ &'(  &'(  &'(
           
      !   ! !  !
 !!     !!    !  !
 !     !"    !  
        "  !  !
        !   " " 
! !    !       !
!     !! " "      !
!!!           ! "
!     "! !      ! !
     ! ! "  " ! ""
     !  "  !" " !
  !   !   !   !  "
" ! !   !    ! " 
 !  " "   !  !" ! "
  !""  "! !    ! ! "
! ! ! ! " " "   !! " ! ! 
"  " ! " !  ! " !     !
 ! "     !! " ! !  ! !"
  !    ! ! !  "   
#$ %  

.
  
/
   



   











.


..



  



 /
 


.



/



 .




.


.     /    / .    
/   
.    .

Figure 7. Integrated Fanout & Slack Matching Results

[2] S. Bo, W. Zhiying, H. Libo, S. Wei, and W. Yourui, “Reducing power circuits: complexity analysis and an efficient optimal algorithm,” IEEE
consumption of floating-point multiplier via asynchronous technique,” Transactions on Computer-Aided Design of Integrated Circuits and
in Fourth International Conference on Computational and Information Systems, vol. 25, no. 3, pp. 389–402, Mar. 2006.
Sciences (ICCIS), Aug. 2012, pp. 1360–1363. [11] G. Gill, V. Gupta, and M. Singh, “Performance estimation and slack
[3] N. Jamadagni and J. Ebergen, “An asynchronous divider implementa- matching for pipelined asynchronous architectures with choice,” in
tion,” in 18th IEEE International Symposium on Asynchronous Circuits IEEE/ACM International Conference on Computer-Aided Design, Nov.
and Systems (ASYNC), May 2012, pp. 97–104. 2008, pp. 449–456.
[4] A. M. Lines, “Pipelined asynchronous circuits,” Master’s thesis, Cali- [12] T. Williams, M. Horowitz, L. Alverson, and T. Yang, “A selftime chip
fornia Institute of Technology, 1998. for division,” in Advanced Research in VLSI: Proceedings of the 1987
[5] M. R. Greenstreet and K. Steiglitz, “Bubbles can make self-timed Stanford Conference, Mar. 1987, pp. 75–95.
pipelines fast,” The Journal of VLSI Signal Processing, vol. 2, pp. 139– [13] G. Gill and M. Singh, “Bottleneck analysis and alleviation in pipelined
148, 1990. systems: A fast hierarchical approach,” in 15th IEEE Symposium on
[6] G. Dimou, “Clustering and fanout optimization of asynchronous cir- Asynchronous Circuits and Systems, May 2009, pp. 195–205.
cuits,” Ph.D. dissertation, University of Southern California, 2009. [14] G. Venkataramani and S. C. Goldstein, “Leveraging protocol knowl-
[7] P. A. Beerel, G. D. Dimou, and A. M. Lines, “Proteus: An ASIC flow edge in slack matching,” in IEEE/ACM International Conference on
for GHz asynchronous designs,” Design & Test of Computers, vol. 28, Computer-Aided Design. ACM Press, 2006, pp. 5–9.
no. 5, pp. 36–51, Sept. 2011. [15] M. Najibi and P. A. Beerel, “Deriving performance bounds for con-
[8] P. A. Beerel, A. M. Lines, M. Davies, and N.-H. Kim, “Slack match- ditional asynchronous circuits using linear programing,” in 19th IEEE
ing asynchronous designs,” in 12th IEEE International Symposium on International Symposium on Asynchronous Circuits and Systems, May
Asynchronous Circuits and Systems, Mar. 2006, pp. 183–194. 2013, pp. 19–22.
[9] P. Prakash and A. J. Martin, “Slack matching quasi delay-insensitive [16] ——, “Slack matching mode-based asynchronous circuits for average-
circuits,” in 12th IEEE International Symposium on Asynchronous case performance,” in IEEE/ACM International Conference on
Circuits and Systems, Mar. 2006, pp. 194–204. Computer-Aided Design (ICCAD), 2013.
[10] S. Kim and P. A. Beerel, “Pipeline optimization for asynchronous

76

You might also like