You are on page 1of 14

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO.

7, JULY 2021 1437

Reusable Delay Path Synthesis for Lightening


Asynchronous Pipeline Controller
Jeongwoo Heo and Taewhan Kim , Senior Member, IEEE

Abstract— In this work, we address the synthesis problem of a handshaking controller. One notable method to lower the
of two-phase bundled-data asynchronous pipeline controllers, barrier is employing handshaking controller templates [3]–[5],
in which the insertion of buffers is essential for guaranteeing and those adapted to pipeline structure are widely used for
the correct handshaking operation on every pipeline stage at the
expense of considerable area increase. To lighten the pipeline high-performance applications [6] in particular.
controllers, we introduce a new logic synthesis concept called For example, Sutherland and Fairbanks [7] customized their
delay path sharing and reusing, by which we can significantly dynamic logic and inserted it for fast data transmission,
reduce the amount of the costly delay buffers. Precisely, first, and Singh and Nowick [8] proposed the pipeline controller
we propose a technique of synthesizing an asynchronous pipeline template with full capacity storage. Later, Fant and Brandt [9],
controller in a way to share delay buffers among setup timing
paths on pipeline stages for minimally allocating total delay Martin and Nyström [10], and Xia et al. [11] suggested the
buffers. In addition, we devise an area-efficient delay circuit methods of employing dynamic logic with dual outputs. How-
structure called delay path unit (DPU) by extending the proposed ever, their solutions require substantial care or experience to fit
delay path sharing concept and propose an in-depth synthesis into industrial design flows. On the other hand, Sutherland [12]
flow of an asynchronous pipeline controller using DPUs. Through employed his custom latch (i.e., capture-pass) and a C-element
experiments with benchmark circuits using a 45-nm cell library,
it is shown that our techniques of synthesizing asynchronous in his pipeline controller template, and Singh and Nowick [13]
pipeline controllers are able to reduce the controller area by up devised MOUSETRAP, a high-performance pipeline controller
to 46.3%–59.4% and the leakage power by up to 33.0%–49.0% template using a normally transparent latch. Recently, on top
on average while retaining the same level of performance. of [13], Ho and Chang [14] achieved a significant reduc-
Index Terms— Asynchronous circuit, delay path, logic tion in dynamic power consumption in datapath by blocking
synthesis, pipeline controller, timing. glitches in data transmission using a normally opaque latch.
Toan et al. [15] slightly ameliorated its energy efficiency and
performance using a C-element, but the underlying operation
I. I NTRODUCTION is identical to that in [14].
Meanwhile, delay variation among manufactured chips has
A N ASYNCHRONOUS circuit is one of the attractive
alternatives to the synchronous circuit design style. Con-
trary to the synchronization mechanism used in a global clock
grown a lot over the past years due to operation at low Vdd and
enlargement of process variations. Fig. 1(a) shows the drastic
network in synchronous circuits, asynchronous circuits exploit increase of delay variation caused by Vdd scaling for 28-nm
handshaking protocol for the communication between circuit process technology, and the curves in Fig. 1(b) indicate that
components, by which they consume less dynamic power and the maximum operating frequency can be increased by more
operate at a higher frequency than their synchronous counter- than 10× when the temperature changes from 0 ◦ C to 75 ◦ C
parts in general [1], [2]. One of the barriers to the adoption for the subthreshold voltage (sub-Vth ) regime [16]. Various
of an asynchronous handshaking mechanism is a large area kinds of post-silicon tuning techniques have been proposed
and used to contract this wide performance gap through adjust-
Manuscript received November 10, 2020; revised February 18, 2021 and ing delay tunable components on individual chips based on
March 20, 2021; accepted April 8, 2021. Date of publication June 2, 2021; their physical properties. One representative method is clock
date of current version June 29, 2021. This work was supported in part by
Samsung Electronics Company, Ltd. under Projects IO201216-08205-01 and skew tuning, which resolves timing violations by inserting
FOUNDARY-202010DD003F, in part by the National Research Foundation delay circuits such as adjustable delay buffers (ADBs) to
of Korea (NRF) Grant funded by the Korea Government (MEST) under control local clock skews. One of the examples is a capacitor
Grant 2020R1A4A4079177 and Grant 2021R1A2C2008864, in part by the
Institute of Information & communications Technology Planning & Evaluation bank-based ADB implementation [17], and ARM recently
(IITP) grant funded by Korea government (MSIT) under Grant 2021-0- used a long delay chain of ring oscillators called tunable
00754, Software Systems for AI Semiconductor Design), and in part by the delay stages (TDSs) for tracking temperature variation in the
BK21 Four Program of the Education and Research Program for Future ICT
Pioneers, Seoul National University in 2021. The EDA tool was supported by system [16].
the IC Design Education. (Corresponding author: Taewhan Kim.) For a bundled-data protocol-based asynchronous circuit,
Jeongwoo Heo is with Memory Business, Samsung Electronics Company, a receiver should not accept a request signal before arriving
Ltd., Gyeonggi-do 18448, South Korea (e-mail: jw20.heo@samsung.com).
Taewhan Kim is with the Department of Electrical and Computer Engi- data signals from its sender. To ensure this, most of the
neering, Seoul National University, Seoul 08826, South Korea (e-mail: previous studies focused on the optimization and logical
tkim@snucad.snu.ac.kr). operation of pipeline controller templates with the assumption
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2021.3073383. of using these post-silicon tunable delay circuits [4], [14].
Digital Object Identifier 10.1109/TVLSI.2021.3073383 As a result, they required a large number of delay buffers
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1438 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

Fig. 3. Structure of a bundled-data asynchronous pipeline circuit. A delay


circuit, e.g., a delay buffer chain, should be inserted on each pipeline stage
(the long green and short gray bars) to build up the setup and hold timing
paths.
Fig. 1. (a) Impact of Vdd scaling on a period of an inverter-based ring
oscillator for 28-nm process technology. (b) Change of system frequency with II. BACKGROUND AND R ELATED W ORK
Vdd scaling when the temperatures are 0 ◦ C and 75 ◦ C for 65-nm process
technology [16]. A. Bundled-Data Versus Delay-Insensitive Asynchronous
Circuits
Bundled-data encoding is a coding style that transmits each
data bit using exactly one signal wire, for which a handshak-
ing mechanism through request and acknowledgment signal
wires between two components should be installed, as shown
in Fig. 3. A bundled-data asynchronous circuit consists of
a datapath, which is the same structure as a synchronous
circuit, and a controller. It is particularly attractive since it
has relatively less area overhead over its delay-insensitive
Fig. 2. Change of the area of an asynchronous pipeline controller when counterpart. However, since the delays of handshaking signals
all delay circuits are delay buffer chains (blue curve) and the proportion of (req, ack) control all the timings of datapath operation, suf-
delay buffers (red curve) on a pipeline stage. We measured the area and delay ficient timing margins should be allotted to endure variation.
according to 45-nm Silvaco Open Cell library.
Contrary to the bundled-data encoding, the delay-insensitive
encoding is a technique that entails a data value and its validity
to intentionally provide a long delay path on each pipeline simultaneously by using multiple wires for every data bit.
stage, causing a considerable area overhead. Fig. 2 shows Thus, it is robust to timing variation with no timing margins.
the change of the whole controller and delay buffer area as However, it incurs a substantial area overhead because of
the required delay for satisfying handshaking communication multiple signal wires and detection logic for checking the
increases. completion of logic operations.
Lightening a pipeline controller directly impacts two
domains: 1) mitigating the increase of controller area and 2)
B. Two-Phase Versus Four-Phase Bundled-Data Protocol
reducing leakage power consumption. In this work, we target
employing a new delay circuit structure as well as resyn- A transaction of two-phase bundled-data protocol starts with
thesizing the conventional state-of-the-art pipeline controller issuing data and making a transition on a request signal by a
in [14] to achieve the two factors in Domains 1 and 2 while sender. When the receiver accepts this signal, it starts to read
retaining all the benefits (e.g., a considerable saving in glitch the transaction data and finishes the transaction by making a
power) reaped from the controller in [14]. Our work can be transition on its acknowledgment signal. On the other hand,
summarized as follows. a four-phase bundled-data protocol operates as follows. First,
1) Delay Path Sharing: We propose a technique of synthe- a sender issues data and sets the request signal to 1. Then,
sizing an asynchronous pipeline controller in a way to the receiver detects this signal and begins to read data while
share delay buffers among setup timing paths on pipeline setting the acknowledgment signal to 1. The sender accepts
stages so that the total delay buffers should be minimally this acknowledgment signal and then initializes the request
allocated, formulating the allocation problem into linear signal to 0, and finally, the receiver follows it by setting the
programming (LP). acknowledgment signal to 0, completing the transaction.
2) Delay Path Reusing: By extending the delay path sharing
concept, we devise a new area-efficient delay circuit C. Conventional State-of-the-Art Pipeline Controller
structure called delay path unit (DPU) and propose Template
an in-depth synthesis flow of an asynchronous pipeline Fig. 4 shows a part of the state-of-the-art asynchronous
controller using DPUs. pipeline controller template in [14] to be installed on each
This work extends the short version in [18] by addition- pipeline stage. The structure consists of two XOR gates (XOR iu
ally including proposing DPU circuit structure, verifying and XORli ), one NOR gate (NOR i ), and a resettable transparent
the functionality, and generalizing the synthesis flow with latch (pink box, LATif ), all of which are connected to support
DPUs. the communication protocol on that pipeline stage. Procedures

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1439

Fig. 4. Asynchronous pipeline controller template proposed in [14].

of the transactions are as follows; let us assume that all request


and acknowledgment signals are 0 initially. Then, XOR iu and
XORli are 1 and 0, respectively, and thus, NORi is 0, which
makes LATif close. As req(i−1) becomes 1 at the latch input,
XOR iu goes to 0, causing NORi to 1, which makes LATif
transparent. After that, XORli becomes 1, causing NOR i to 0,
which makes LATif close again. This short time interval (i.e.,
the sum of NORi and XORli delays) during which LATif is
transparent, thus reducing glitches, is the most significant
advantage of this controller.1 After a relatively long time
through the delay circuit, the request signal reqi of logic value
Fig. 5. Asynchronous pipeline circuits using (a) controller template in [14]
1 goes to the controller on pipeline stage i +1 and comes back and (b) our pipeline controller template with sharable delay paths. The setup
as the acknowledgment signal ack(i+1) , which completes the timing paths are highlighted in red and blue colors, and the newly added logic
transaction on pipeline stage i . Then, it initiates the next event cells are marked with yellow color. Note that the POD on each pipeline stage
indicates the last common point between launch and capture timing paths.
on pipeline stage i with the opposite polarities, i.e., the request
and acknowledgment signals of logic value 1.
in pipeline stages i + 1 and i + 2. Thus, the two setup
timing paths partially overlap.
III. D ELAY PATH S HARING FOR L IGHTENING
2) Delay circuits, marked as yellow triangles in Fig. 5(b),
P IPELINE C ONTROLLER T EMPLATE
are inserted into a template spot between the NOR and
We introduce a new concept called “sharable delay path” in the upper XOR. We call such delay circuits sharing
Section III-A, followed by validating its logical correctness, delay circuits (SDCs) and the rest of delay circuits (i.e.,
reformulating timing constraints, and minimally allocating green triangles) non-SDCs (NSDCs).
delay buffers in the subsequent three subsections. Section III-B will show that although the new setup timing
paths are physically conflicting by Feature 1, the logical
A. Synthesizing Sharable Delay Paths behavior of all the setup timing paths will fulfill the original
mission. Feature 2 then provides an opportunity to reduce
Fig. 5(a) shows a section of an asynchronous pipeline circuit the total number of delay buffers by allocating as many of
using the conventional controller template in [14], in which them as possible in SDCs while satisfying all necessary timing
the red and blue paths indicate the setup timing paths for constraints. We will explain the timing constraints for correct
pipeline stages i and i + 1, respectively. Note that the setup operations on asynchronous pipeline circuits and the detailed
timing path on each pipeline stage should be long enough so formulation of minimizing the total number of delay buffers
that its delay should exceed the timing critical path delay of in Sections III-C and III-D, respectively.
the combinational circuit on that pipeline stage by inserting It should be noted that some delay circuits are also required
a delay circuit such as a delay buffer chain [green triangles at the location of the gray rectangles in Fig. 3 to build up
in Fig. 5(a)]. Fig. 5(a) shows that the two setup timing paths the hold timing paths for guaranteeing reliable data sampling.
in pipeline stages i and i + 1 are physically disjoint, which However, since local enable networks in an asynchronous
means that the total number of delay buffers is the sum of pipeline circuit are balanced generally, the number of delay
the numbers of delay buffers in those pipeline stages. On the buffers in the delay circuits is small in comparison with that
other hand, Fig. 5(b) shows our pipeline controller template of setup timing paths. Hence, we focus on minimizing the
with sharable delay paths, which uses one additional latch number of delay buffers on setup timing paths in this work.
for each pipeline stage than that in [14], and its setup timing
paths, exhibiting two distinct features.
B. Validating Logical Correctness for Sharable Delay Paths
1) The red setup timing path is passing through both delay
circuits in pipeline stages i and i + 1, whereas the blue Fig. 6 shows the circuit structure of our controller template,
setup timing path is passing through both delay circuits on which two setup timing paths share XOR iu and SDC.
We also allocate one latch (yellow rectangle, LATri ) right
1 Note that our proposed pipeline template never enlarges this time interval. before feeding NSDC in each pipeline stage. The role of this

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1440 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

and blue dashed boxes reveals that our setup timing path is
working correctly for sending 1-state and 0-state of request
signals, respectively.

C. Reformulating Timing Constraints of


Controller Template
Timing correctness of a bundled-data asynchronous circuit
is composed of two groups of constraints: template-level
constraints and protocol-level constraints [19]. The former
Fig. 6. Structure of our controller template on a pipeline stage. It is composed
of SDCs, a NSDC, and a few subsidiary control logic components. assures the quasi-delay insensitive assumptions of a controller
template and the latter ensures successful data sampling and
transmission. In the following, we formulate the two groups
of timing constraints.
1) Template-Level Constraints: Like the controller template
in [14], our controller template with SDCs and NSDCs should
also be hazard-free, for which it has to satisfy the following
two constraints.
Constraint 1: Once the request signal req(i−1) to the input
of the latch LATif on pipeline stage i has begun to change
its logic value, the acknowledgment signal ack (i+1) to this
pipeline stage should not change its logic value until the
template on that pipeline stage is at the suspended (steady)
state.
Note that we call the template for a pipeline stage i is at
the suspended (steady) state if: 1) both of the output signals of
XORiu and XORli are high; 2) the output signal of NOR i is low;
and 3) LATif is closed [14]. We can express this constraint as
 
min delay req(i−1) → wxorui
→ eni→(i+1) → eni

→ qnif → wxoru i
→ eni→(i+1) → eni ,

delay req(i−1) → wxorui
→ eni→(i+1) → eni
Fig. 7. Simulation waveforms of our controller template in Fig. 6. 
→ q if → wxorl
i
→ eni
 (i−1)
latch is to prevent the request signal reqi starting from the ≤ delay req → wxorui
→ eni→(i+1) → eni
ordinary controller latch (pink rectangle, LATif ) from passing → qnif → wxoru
i
→ eni→(i+1) → qri → reqi
through NSDC until the latch enable signal eni→(i+1) coming 
through XOR iu and SDC is ON. Consequently, the delay of → . . . → ack(i+1) → wxorl
i
→ eni . (1)
the setup timing path can be manipulated by controlling the The left-hand side of (1) represents the time lapse between
numbers of delay buffers in SDC and NSDC. Note that eni the arrival of request signal req(i−1) and the close of LATif . On
changes to 0 when either of the outputs of two XORs becomes the other hand, the right-hand side denotes the time interval
1. In other words, regardless of the delay of SDC, it is to the moment that the next acknowledgment signal ack(i+1)
achievable to reduce the glitch power consumption through affects the enable signal eni .
data transmission such as the controller template in [14] since Constraint 2: Once the request signal req(i−1) to the input
the propagation delay of the path coming through LATif and of the latch LATif on pipeline stage i has begun to change
XORli is short enough.
its logic value, the request signal req(i−1) should keep its
Fig. 7 shows a part of the waveforms produced by SPICE logic value until the template on that pipeline stage is at the
simulation. The state change on req(i−1) (crimson wave) suspended (steady) state.
enables signal eni to make LATif transparent (green wave), We can express this constraint as
which sets wxoru
i
back to 1 (pink wave). Then, signal eni→(i+1)  
is enabled to make LATri transparent (purple wave) after a min delay req(i−1) → wxorui
→ eni→(i+1) → eni
certain period, which is exactly the delay of SDC (yellow 
→ qnif → wxoru i
→ eni→(i+1) → eni ,
triangles). As req(i−1) passes through NSDC (green triangles), 
it becomes reqi (red wave). Fig. 8 describes all feasible delay req(i−1) → wxorui
→ eni→(i+1) → eni

scenarios of the logic behavior of our controller template. → q if → wxorl
i
→ eni
Two branches from each of logic states A and B consider the  (i−1)
≤ delay req → wxorui
→ eni→(i+1) → eni
race between the request (from left) and the acknowledgment 
(from right) signals. The behavior corresponding to the black → q if → acki → . . . → req(i−1) . (2)

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1441

Fig. 8. Logic behavior of our controller template in Fig. 6. The parts in red color indicate the change of logic values.

TABLE I
D EFINITION OF N OTATIONS U SED IN T IMING C ONSTRAINTS

Fig. 9. View of timing paths on a bundled-data asynchronous circuit. The


red and blue lines denote the launch and capture paths of the setup and hold
timing paths between pipeline stages i and i + 1, respectively.

transmission at the pipeline registers. In the following, we will


The meaning of the left-hand side of (2) is the same as that reformulate the timing constraints based on the two inequal-
in (1), while the right-hand side denotes the time lapse between ities (3a) and (3b), which should be satisfied for every asyn-
two successive transitions of the request signal req(i−1) . chronous pipeline circuit that employs our controller template.
Note that we should make sure that there is no hazard First, the point-of-divergence (POD) of setup timing path
on the nets shared by the setup timing paths, which trivially on each pipeline stage is the location [red and blue dots
holds when Constraint 2 is satisfied. Besides, the data signal in Fig. 5(b)] at which the enable signal feeds that pipeline stage
q if should arrive no earlier than the falling of the enable latch. The launch path starts from the POD and terminates
signal eni→(i+1) to the new latch LATri for preventing invalid at the data pin to the next pipeline stage registers of an
propagation of the request signal req(i−1) . Similarly, q if should asynchronous circuit, passing through the local enable buffers
arrive at LATri before the rising of eni→(i+1) to LATri , which driving the enable signals to the pipeline registers. On the
should be handled carefully when the delay of SDC is small other hand, the capture path passes through the request signal
in particular. line and the local enable buffers and finally terminates at the
2) Protocol-Level Constraints: Table I defines the list of enable pins to the pipeline registers. Thus, it is possible to
notations used in the formulation. For a bundled-data asyn- rewrite the setup timing constraint in (3a) as
chronous pipeline circuit shown in Fig. 9, we can express its 
protocol-level timing constraints as delay eni → qnif → wxoru i
→ eni→(i+1) → qri → reqi
(i+1)

→ wxoru → en(i+1)→(i+2) → en(i+1)
D iR + TR(i+1) + D (i+1) ≥ D iL + TCi + D iP + Ts(i+1) (3a)
L
+D (i+1) ≥ D iL + TCi + D iP + Ts(i+1) . (4)
D iA + T Ai + D iL + TCi + D iP ≥ D (i+1) + Th(i+1) .
L
L (3b)
Second, likewise, the POD of hold timing path on each
The two inequalities (3a) and (3b) indicate the setup pipeline stage is the location where an enable feeds the next
and hold timing constraints, respectively, for reliable data pipeline stage latch. The launch path starts from the POD,

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1442 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

passes through the acknowledgment signal line, the local pipeline stages unnecessarily longer and finally deteriorate the
enable buffers, and the datapath in the current pipeline stage, overall circuit performance. However, with the consideration
and terminates at the data pin to the pipeline registers on the of the last two inequalities, an LP optimizer is able to select
i
next pipeline stage at last. On the other hand, the capture path dNSDC for the delay buffer distribution in such a situation,
is quite short, which starts from the same POD, goes through thereby avoiding the unnecessary degradation of the circuit
the local enable buffers, and terminates at the enable pin to the performance.
pipeline registers on the next pipeline stage. Hence, we can
reformulate the hold timing constraint in (3b) as IV. I N -D EPTH P IPELINE C ONTROLLER T EMPLATE
  S YNTHESIS W ITH D ELAY PATH R EUSING
delay en(i+1) → q (i+1)
f → ack(i+1) → wxorli
→ eni A. Synthesizing Delay Path Units
+D iL + TCi + D iP ≥ D (i+1)
L + Th(i+1) . (5) In Section III, we converted the asynchronous pipeline
controller template in [14] to ours with sharable delay paths for
D. Minimally Allocating Delay Buffers minimizing the total number of delay buffers. In the same way,
for reusing the delay buffers, we can apply this conversion
Unlike the conventional asynchronous pipeline controller
technique to any delay buffer chains in SDCs and NSDCs in
template, ours flexibly distribute delay buffers to multiple
the controller with slight modification recursively as far as it
locations of every setup timing path while sharing some delay
saves the area. We call those delay circuits produced by the
buffers among the setup timing paths. Consequently, it is
applications of the technique to delay buffer chains DPUs.
possible to minimize the total number of delay buffers in an
Fig. 10 shows three DPU types called DPU-1, DPU-2, and
asynchronous pipeline controller while satisfying the template-
DPU-3, which are the delay circuit structures produced by
level and the protocol-level timing constraints reformulated
applying the conversion technique once, twice, and three times
in Section III-C. Let S={0, 1, 2, . . .} denote the set of all
recursively on delay buffer chains, respectively. For example,
pipeline stage indexes. If we implement all delay circuits in
we can obtain DPU-2 by inserting DPU-1 instead of the delay
an asynchronous pipeline controller using delay buffer chains,
buffer chain in between NORDPU 1 and XOR DPU
1-u (labeled as
the propagation delay of the delay circuit will be linearly
layer-1 delay buffers) in DPU-1 in Fig. 10. Similarly, it is
proportional to the number of delay buffers in it. As a result,
possible to make DPU-3 by replacing the delay buffer chain
we can formulate the equivalent problem of minimizing the
located in between the uppermost XOR and NOR (labeled as
total number of delay buffers into a LP as
 layer-2 delay buffers) in DPU-2 with DPU-1. In this way,

min i
dSDC + dNSDC
i we can synthesize any arbitrary DPUs. Note that we use
i∈S DPU-0 to denote a simple delay buffer chain.
s.t. Eqs. (1), (2), (4), (5), ∀i ∈ S The red path in Fig. 10 indicates the input signal flow
 in DPU-1, which passes the layer-1 delay buffers twice.
delay eni → qn if → wxorui
→ eni→(i+1)
(i+1)
Likewise, for DPU-2 and DPU-3, the timing paths of input
→ qri → reqi → wxoru signal propagation pass through the top-layer delay buffers

→ en(i+1)→(i+2) → en(i+1) up to 22 times and 23 times, respectively. By generalizing
≤ Delayi + , ∀i ∈ S this, we can state that the timing path of the input signal
 propagation goes through the delay buffers in DPU-k up to 2k
delay eni → qn if → wxoru
i
→ eni→(i+1)
times. Thus, in comparison with a simple delay buffer chain,
(i+1)
→ qri → reqi → wxoru the number of delay buffers required in the corresponding
→ en(i+1)→(i+2) → en(i+1) DPU-k is theoretically reducible up to 1/2k of that of the
→ q (i+1)
f → ack(i+1) simple delay buffer chain.

→ wxorl
i
→ eni
B. Validating Logical Correctness of Delay Path Units
≤ Cyclei + , ∀i ∈ S. (6)
Since the logical operation on each layer inside of DPU-k
Note that  is for avoiding the infeasibility of the formula- is the same as that of DPU-1 containing one layer, we will
tion caused by the discrepancy in gate delay models. It is an discuss the logical correctness by using the circuit of DPU-1
empirically defined timing margin and is usually a small value in Fig. 10. DPU-1 has two latches (LATDPU 1- f and LAT1-r ,
DPU

in comparison to Delayi and Cyclei . The first four ensure the marked as yellow rectangles): one placed next to the input
satisfaction of the template-level and the protocol-level timing signal signalin and the other next to the layer-0 delay buffers.
constraints, and the last two inequalities prevent the perfor- Like the operation of our controller template explained in
mance degradation caused by the inappropriate distribution of Section III, signalin has to go through the layer-1 delay
DPU
delay buffers. For example, when one additional delay buffer buffers first and sets enDPU
1- f to 1 to pass LAT1- f . Meanwhile,
DPU DPU
is required to meet the setup timing constraint on pipeline LAT1-r blocks the signal q1- f from reaching to the layer-
(i+1)
i
stage i , we can distribute it to either dSDC i
, dNSDC , or dSDC . 0 delay buffers until the latch enable signal enDPU 1-r coming
If all the timing constraints of pipeline stages i − 1 and i + 1 from XORDPU 1-u through the layer-1 delay buffers becomes on.
have already been satisfied, the allocation of the delay buffer Note that LATDPU1- f is closed by resetting en DPU
1- f to 0 through
i (i+1)
to dSDC or dSDC will make the cycle time or delay of those XORDPU
1-l , regardless of the number of layer-1 delay buffers. If

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1443

Fig. 11. Simulation waveforms of DPU-1 in Fig. 10.

result, the next signalin would pass through LATDPU 1- f . However,


enDPU
1- f will never be triggered more than once by the single
transition of signalin since one of the inputs to XOR DPU1-l comes
DPU
from q1-r .
Figs. 11 and 12 show the waveform from the SPICE
simulation on DPU-1 in Fig. 10 and all of the possible logic
behavior scenarios in DPU-1, respectively. First, the logic
state of signalin (crimson wave) becomes 1, which triggers
the transition of w1-xoru
DPU
(pink wave) from 1 to 0. The output
signal of XOR 1-u delayed by layer-1 delay buffers, i.e., enDPU
DPU
1-r
(purple wave), passes NOR DPU 1 and sets enDPU
1- f (green wave)
to 1, letting LATDPU 1- f transparent. After passing LATDPU 1- f ,
the inverted signalin resets w1-xoru
DPU
to 1, and this transition
goes through the layer-1 delay buffers once again and makes
LATDPU DPU
1-r transparent by setting en1-r from 0 to 1. As a result,
DPU
signalin can pass LAT1-r and the layer-0 delay buffers.

C. Updating Timing Constraints for Delay Path Units


Since the timing path of the input signal propagation will
reuse the delay buffers in DPU, these signal transitions should
be hazard-free to be logically correct operations, for which it
has to satisfy the following constraint.
Constraint 3: Once the input signal signalin on the lth layer
in DPU has begun to change its logic value, this signal should
DPU
keep its state until the transition comes back to xorl−u again
DPU DPU
through norl and LATl− f .
Let (l−1) be the minimum holding time of the input
signal to the lth layer in DPU. Then, DPU should satisfy the
Fig. 10. DPUs that are recursively to be applied to the modified circuit struc-
ture of the asynchronous pipeline controller template introduced in Section III. following inequality:
The synthesized timing path of the signal propagation is highlighted in red  DPU 
color, and the subsidiary logic cells are marked with yellow color. delay wl-xoru → enl−rDPU
→ enl− f → qnl− f → wl-xoru
DPU DPU DPU

≤ (l−1) .
w1-xorl
DPU
changed to 0 before the delayed w1-xoru
DPU
(i.e., enDPU
1-r ) Fig. 13 describes this constraint by waveforms and signal
DPU
is back to 1, en1- f would become on once again, and as a transition relation. In Fig. 13(a), the transition of signalin

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1444 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

Fig. 12. Logic behavior in DPU-1. The parts in red color indicate the change of logic values.

(i.e., reqi ) corresponds to the value of Cycle(i−1) . Recursively,


for the lth layer in DPU, the minimum holding time can be
expressed in terms of the minimum holding time of the lower
(l − 1)th layer (l−1) as
  DPU
min delay wl-xoru → enl−rDPU
→ enl−DPU
f → qnl− f
DPU
 
→ wl-xoru
DPU
, (l−2) − delay wl-xoru
DPU
→ enl−r
DPU

→ enl− f → qn l− f → wl-xoru .
DPU DPU DPU
(7)
To summarize, the DPU structure can employ additional
XOR, NOR, and latches to save delay buffers. For example,
0
for implementing a target delay dtarget smaller than that imple-
min
mented through DPU-1, say dDPU-1 , we should exploit delay
buffer chains (DPU-0) to avoid a large gap with actual delay.
1 min
For a target delay dtarget larger than dDPU-1 , we can make
it through both DPU-0 and DPU-1. Compared with DPU-0,
Fig. 13. Timing waveforms and (temporal) nonconflicting timing path
DPU-1 would be more area-efficient to employ the latter
describing Constraint 3. (a) Timing waveforms. (b) Two hazard-free timing because it can reduce the number of delay buffers by about
paths. half by inserting two XORs, one NOR, and one latch in that
circuit. In the same way, DPU-2 could be more area-efficient
passes XORl−uDPU DPU
, delay buffers, and then LATl− f , setting wl-xoru
DPU 2
than DPU-1 for a much larger target delay dtarget 2
(i.e., dtarget >
1
back to 1. In the meantime, the next signalin transition also dtarget ), and for that, we will require four XORs, two NORs,
repeats the same procedure, starting from XORl−u DPU
. Thus, and four latches in the delay circuit. However, a deeper DPU
wl-xoru must be back to 1 before it is set to 0 by that transition.
DPU structure does not always provide better area efficiency due
Fig. 13(b) shows (temporal) nonconflicting two timing paths to the timing constraint Constraint 3 we pointed out in this
in a layer of DPU. section (note that the number of delay buffers in each layer,
As explained previously, a DPU can replace either SDCs or the minimum holding time of each layer, and Constraint 3 are
NSDCs. The input signal to NSDC on pipeline stage i keeps closely intertwined).
its logic state during the cycle time of that pipeline stage so
that the minimum holding time for l = 1 will be Cyclei for D. In-Depth Synthesis Flow Utilizing Delay Path Units
DPU replacing that NSDC. On the other hand, the minimum While implementing asynchronous pipeline circuits using
holding time of the input signal to SDC on pipeline stage i is the LP formulation in (6) in Section III, it is not clear to
the smaller of the two time intervals during which the output of apply the formulation directly to synthesizing our controller
XOR iu on that stage sustains 0-state and 1-state. Thus, we can template with various types of DPUs. Hence, we propose a
calculate 0 as step-wise implementation procedure shown in Fig. 14.
  i  Step 1 (Preprocessing): For each pipeline stage, we first
min delay wxoru → eni→(i+1) → eni → qn if → wxoru i
, estimate the cycle time of the pipeline stage from its target
(i−1)
 i
Cycle − delay wxoru → en i→(i+1)
→ en i
delay computed from the timing analysis of its corresponding

→ qn f → wxoru .
i i datapath after the completion of combinational logic synthesis.
Note that for each pipeline stage in our controller template,
Two terms in the min-function indicate the duration of both of the cycle time and the delay of the pipeline stage
sustaining 0-state and 1-state on wxoru
i
in Fig. 6, which include the same sharing and non-SDCs. Thus, we can easily
corresponds to signalin of DPUs in Fig. 10, and the minimum find out the cycle time of the pipeline stage from the target
holding time of the input signal to XORiu on pipeline stage i delay before starting to allocate DPU.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1445

problem into an ILP


  i   i 
min AreaiS dSDC + AreaiN S dNSDC
i∈S
s.t. Eqs. (1), (2), (4), (5), ∀i ∈ S

delay eni → qnif → wxoru i
→ eni→(i+1)
(i+1)
→ qri → reqi → wxoru

→ en(i+1)→(i+2) → en(i+1)
≤ Delayi + , ∀i ∈ S

delay eni → qnif → wxoru
i
→ eni→(i+1)
(i+1)
→ qri → reqi → wxoru
→ en(i+1)→(i+2) → en(i+1)
→ q (i+1)
f → ack(i+1)
Fig. 14. Flow of synthesizing an asynchronous pipeline controller. 
→ wxorl
i
→ eni
Step 2 (Constructing DPU Area Prediction Models): Then, ≤ Cyclei + , ∀i ∈ S. (9)
for each target propagation delay sample, we produce all
AreaiS and AreaiN S are the mapping functions of the DPU of
feasible DPUs from which we build a piecewise linear
minimum area on pipeline stage i for SDC and NSDC, respec-
model between the target delays and the area-minimal DPU
tively. Unlike (6), the objective of (9) includes the piecewise
implementation. Since the minimum holding times for SDCs
linear model, and thus, it becomes an ILP problem due to the
and NSDCs are different, we construct the piecewise linear
inclusion of integer variables to indicate the selection among
model for SDC as well as NSDC on every pipeline stage.
the intervals in the models. However, the time for solving this
Except for implementing the simple delay buffer chain (i.e.,
ILP formulation is not that significant since there are just a
DPU-0), for synthesizing DPUs, all timing constraints in
few intervals in the piecewise linear model.
Section IV-C should be met. Hence, for synthesizing DPU-
Step 4 (Implementing DPUs): By solving (8) once again
L for a given minimum holding time 0 and target delay
with the delay values assigned to all the DPUs produced by the
dtarget , we can compute the number of delay buffers mini-
solution of (9), we can complete the synthesis of the pipeline
mally required in each layer in the DPU while satisfying
controller.
all the constraints by solving the following integer LP (ILP)
formulation:

L V. E XPERIMENTAL R ESULTS
min nl A. Environment Setup
l=0
 DPU To demonstrate the effectiveness of our asynchronous
s.t. l ≤ delay wl-xoru → enl−r
DPU
→ enl− f → qnl− f
DPU DPU
pipeline controller template and the delay circuit structure,

→ wl-xoru
DPU
, ∀l ∈ {1, . . . , L − 1} we first generated four groups (G ROUP -1∼4) of pipeline
 circuits as follows.
l ≤ (l−1) − delay wl-xoru
DPU
→ enl−r
DPU
→ enl− DPU
f 1) G ROUP -1 were made by serially linking ten copies of

→ qnl− f → wl-xoru
DPU DPU the same ISCAS’85 circuits. For example, the circuit
labeled C2670 × 10 has ten copies of combinational cir-
∀l ∈ {1, . . . , L − 1} cuit C 2670, placed each one in between two consecutive
 DPU
delay wl-xoru → enl−r → enl−
DPU
f → qn l− f
DPU DPU pipeline stages.
 2) G ROUP -2 is similar to G ROUP -1. The only difference is
→ wl-xoru
DPU
≤ (l−1) , ∀l ∈ {1, . . . , L} that we set their combinational circuits and the number
 
dtarget ≤ delay signalin → . . . → signalout of pipeline stages randomly among ISCAS’85 designs
and 4, 5, . . ., 21.
≤ dtarget + dmargin . (8) 3) G ROUP -3 is a set of simple FIFO buffers with 5, 10,
The first two constraints describe (7) using two inequalities 15, and 20 stages.
and restrict the minimum holding time of each layer in 4) G ROUP -4 includes two simple pipeline circuits shown
DPU-L. The third constraint corresponds to Constraint 3 in in Fig. 15, composed of forks and joins. All the combi-
Section IV-C, and the last one controls the difference between national circuits between two consecutive pipeline stages
the target delay and our implementation. Note that nl indicates are one of the ISCAS’85 circuits.
the number of layer-l delay buffers and dmargin represents the We implemented each combinational circuit using 45-nm
modeling margin. Silvaco Open Cell Library [20] with Synopsys Design Com-
Step 3 (Assigning Delays for DPUs): By referring to the piler. Then, we extracted the maximum datapath delays
piecewise linear models, we formulate the DPU allocation between the pipeline stages using Synopsys PrimeTime for

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1446 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

Fig. 16. Changes of the minimum implementation area for DPU allocation
for NSDC and SDC corresponding to C 6288 by varying the target delay. The
blue dots and red lines represent the modeling samples and the piecewise
linear prediction, respectively. (a) DPU area for replacing NSDC ( C 6288).
(b) DPU area for replacing SDC ( C 6288).

Fig. 15. Pipelined circuits in G ROUP -4 used in our experiments. Labels in


the figure denote combinational circuits placed in between two consecutive
pipeline stages. (a) F ORK J OIN 1. ( B ) F ORK J OIN 2.

one specific corner for simplicity.2 The delay extraction


can be easily extended to the case of multicorner multi-
mode scenarios by repeating the whole process. After that,
we used the extracted delays as the delays of pipeline Fig. 17. DPU implementations for target delays dtarget of 5.38, 5.41, and
5.44 ns in Fig. 16(a). The red values indicate the number of delay buffers.
stages and constructed asynchronous pipeline controllers
according to those values. We used the IBM CPLEX opti- TABLE II
mizer [21] through the PICOS interface [22] for solving RUNTIME OF THE D ATA S AMPLES P REPARATION U SED IN
the LP formulation (6) in Section III-D and ILP formula- THE P IECEWISE L INEAR M ODELING S TEP

tions (8) and (9) in Section IV-D with  = 10 ps and


dmargin = 30 ps. We also characterized the delays of timing
paths with zero delay buffer, observed the timing behav-
ior, and measured the amounts of the dynamic and leak-
B. Piecewise Linear Modeling of Delay Path Unit Area
age power consumption of the controllers using Synopsys
FineSim. The red dots in Fig. 16 show the minimum-area changes of
Note that it is enough to run STA on asynchronous pipeline DPU for SDC and NSDC corresponding to circuit C 6288 as
controllers for checking the timing constraints in Section III-C the target delay varies, derived by solving (8) in Section IV-D.
and IV-C. One implementation method is to prepare asyn- Fig. 16(a) and (b) shows the DPU allocation results for SDC
chronous pipeline controller circuits except for delay buffers and NSDC, for which in experiments, we varied target delay
as hard-IP blocks. With those blocks and timing constraints dtarget with a step size of 0.01 ns. Note that for DPU replacing
in Section III-C and IV-C, it could be possible to optimize NSDC, the minimum holding time of input signal is decided
the number of delay buffers using synthesis tools (e.g., Syn- by the cycle time of its pipeline stage, regardless of the
opsys Design Compiler, Fusion Compiler, or Cadence SoC delay generated by the DPU. On the other hand, for DPU
Encouter). In our experiment, we resolved all those violations replacing SDC, its holding time is limited by roughly half of
using that approach. For considering realistic delay circuit the cycle time, and additional delay by the subsidiary cells
implementation, we used the delay buffer cell having the makes this upper bound on minimum holding time tight. The
weakest driving strength. blue lines in Fig. 16 indicate the piecewise linear modeling
results obtained from the samples.
We can observe that the slope of the DPU area increase
2 In the experiments, the temperature is set to 25 ◦ C. However, as the tem-
becomes gradually gentle as target delay increases [e.g., 0ns ≤
perature goes up to 70 ◦ C–80 ◦ C while using deep submicrometer technology dtarget ≤ 5.41 ns in Fig. 16(a)]. The reason is that we replaced
beyond 14 nm, the leakage power consumption could significantly increase. the simple delay buffer chain (i.e., DPU-0) with DPUs with

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1447

TABLE III
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS P RODUCED IN [13], [14], [15], DPS YN , AND
DPS YN+ IN T ERMS OF THE N UMBER OF A DDITIONAL L OGIC G ATES AND T OTAL C ONTROLLER A REA

deep layers (e.g., DPU-1, DPU-2, and DPU-3) that occupy C. Comparison of Power, Performance, and Area
less area. Meanwhile, even though the same type of DPU was
used, the DPU area increases sharply for some target delays. We notate DPS YN to refer to our controller synthesis
Fig. 17 shows the implementations for target delays of 5.38, technique in Section III that uses DPU-0. We also notate
5.41, and 5.44 ns in Fig. 16(a), and as can be seen from the DPS YN+ to refer to our synthesis technique in Section IV
figure, we could substitute the less number of delay buffers that uses DPU-1, DPU-2, DPU-3, as well as DPU-0.
in deeper layers for many delay buffers in the rest. However, Area: The column starts with “Circuit” in Table III is the
as target delay exceeds some limit [e.g., 5.41 ns in Fig. 16(a)], name of pipeline circuits, and the number in each parenthesis
delay buffer insertion in deeper layers violates Constraint 3, represents the number of pipeline stages of the randomly
which causes not to maintain the increased rate of the DPU generated one. Note that the numbers in parentheses in
area as before. the second big column show the number of additional XORs
The discontinuity shown in Fig. 16 is mainly due to the and NORs for DPS YN+. The ratios of the total area of
two factors: one is the modeling margin dmargin in (8) for controllers produced in [13] (MOUSETRAP), [15] (PP-1 and
guaranteeing marginal performance degradation to the imple- PP-2), DPS YN, and DPS YN+ to those in [14] are shown in
mentation in [14], and the other is the gap between rising the last big column, from which DPS YN and DPS YN + use
and falling delays of one delay buffer. In our experiment, 29.7%–46.3% and 49.7%–59.4% smaller controller on average
we assumed the same criterion for both rising and falling for the pipeline circuits in G ROUP -1, G ROUP -2, and G ROUP -4
of the signal propagation and tightly set dmargin for min- over that in [14]. We observed that the area reduction effect
imal performance penalty compared to the implementation of our DPS YN and DPS YN + are almost similar for the
in [14]. In addition, we observed that the difference between comparison with the other implementations as well. On the
rising and falling delays of delay circuit was accumulated other hand, since ours include one additional latch than those
as the size of a delay circuit grows. Finally, after some of [14], the total controller area generated by DPS YN (and
target delay, the gap exceeded dmargin ; consequently, our ILP DPS YN +) is larger by 37.9% on average for simple FIFO
solver cannot find a feasible solution for (8) that satisfies designs in G ROUP -3.
our strict constraints, which makes discontinuous transitions Performance: Cycle times and delays of pipeline circuits
in Figs. 16(a) and (b). in G ROUP -1, G ROUP -2, and G ROUP -4 slightly increase over
Table II summarizes the runtime taken to prepare the data those of [14] using DPS YN and DPS YN +. The changes are
samples used in piecewise linear modeling for each combina- due to ensuring pessimism considering the gate delay models,
tional circuit in experiments. In the modeling, we used DPU-0 but we can control it through  in (6) and (9) and dmargin in (8).
(i.e., a simple buffer chain), DPU-1, DPU-2, and DPU-3 in With setting of  = 10 ps and dmargin = 30 ps, the cycle time
Fig. 10 and set the step size of a target delay to 0.01 ns. The and delay of each pipeline stage increase only by 1.0%–1.2%
preparation of all data samples was completed within two and and 1.1%–1.3% for DPS YN and by 1.1%–1.5% and
a half minutes for every test case. 1.2%–1.7% for DPS YN + on average for the pipeline circuits.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1448 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

TABLE IV
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS FOR FIFO B UFFER D ESIGNS P RODUCED IN [13]–[15],
DPS YN , AND DPS YN+ IN T ERMS OF C YCLE T IME AND D ELAY OF P IPELINE C IRCUITS

TABLE V
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS P RODUCED IN [13], [14], [15], DPS YN , AND
DPS YN+ IN T ERMS OF L EAKAGE AND DYNAMIC P OWER C ONSUMPTIONS . N OTE T HAT A LL THE D ATA A RE
M EASURED W ITH THE N OMINAL V OLTAGE (1.2 V) O PERATION

On the other hand, for the simple FIFO designs in G ROUP -3,
the cycle times and delays of DPS YN (and DPS YN+) are
longer than those of [14] by 47.0% and 91.2% on average,
respectively, as shown in Table IV, since timing paths in
DPS YN (and DPS YN+) pass one additional latch. Therefore,
from the observations, we can conclude that our proposed
technique might not be one of the best solutions for the
pipeline circuits having very short combinational delays such
as FIFO buffers. For those designs, some specialized asynchro-
nous pipeline controllers, such as MOUSETRAP [13], will be
better solutions since the glitches in their datapaths are not Fig. 18. Changes in dynamic power consumption of the controller as the
that significant. cost of extra latch per delay buffer increases. DPS YN uses a much lower cost
than DPS YN+.
Power: From Table V, we can observe that DPS YN and
DPS YN+ save the leakage power by 33.0%–45.6% and
43.3%–49.0% for the pipeline circuits in G ROUP -1, G ROUP -2, 10.4%–50.2% from the insertion of multiple new latches to
and G ROUP -4, which is mainly caused by the reduced number DPUs for the same pipeline circuits. Fig. 18 clearly shows
of delay buffers. On the other hand, since the dynamic power that the dynamic power consumption increases as the latch
depends on the number of internal and external transitions cost invested per delay buffer increases. Meanwhile, for the
regardless of the total cell count, the power consumption FIFO designs in G ROUP -3, the leakage power consumption of
remains almost the same for DPS YN. The slight increase the controllers produced by DPS YN (and DPS YN+) increases
is from one new latch inserted into each pipeline stage. by 39.4% on average due to one additional latch per pipeline
However, for DPS YN+, the dynamic power increases by stage.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1449

TABLE VI
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS P RODUCED IN [14] AND DPS YN IN T ERMS OF L EAKAGE AND
DYNAMIC P OWER C ONSUMPTION FOR THE T HREE S UPPLY V OLTAGE R EGIMES . A LL THE VALUES A RE N ORMALIZED TO THE
DYNAMIC P OWER C ONSUMPTION OF [14] IN THE S AME S UPPLY V OLTAGE R EGIME

Table VI shows the comparison between [14] and DPS YN TABLE VII
in superthreshold voltage (super-Vth ), near-threshold voltage C OMPARISON OF O UR DPS YN W ITH S YNCHRONOUS I MPLEMENTATIONS
(near-Vth ), and subthreshold voltage (sub-Vth ) regimes, in
terms of leakage and dynamic power consumption. Since
dynamic power consumption is the dominant factor of total
energy efficiency over leakage power in the super-Vth regime,
the benefit of leakage power reduction from our proposed
technique has a limited impact. On the other side, as the
supply voltage decreases, the portion of leakage power among
overall power consumption grows significantly. For example,
we can see that the amount of dynamic power consumption
in [14] is about 12.2×–14.7× larger than that of leakage power
consumption for the pipeline circuits in G ROUP -1, G ROUP -2,
and G ROUP -4. However, this ratio drops to 6.0×–9.5× and
1.9×–2.7× in near-Vth and sub-Vth regimes, respectively. As a
result, for those pipeline designs, power consumption saving
of DPS YN over [14] rises sharply from 0.5% to 5.2% in super-
Vth regime to 7.0%–14.8% and 16.8%–24.7% in near-Vth and
sub-Vth regimes, respectively.
Since the reported data of the synchronous ones in Table VII
does not include the clock generator circuits such as PLL, the
D. Comparison With Synchronous Style area saving will be more.
Table VII compares the area and power consumption of The power consumption of asynchronous designs depends
the synchronous style and our DPS YN-based implementations on the activity of external input signals. Hence, we measured
(we normalized the area and power numbers to make the the power consumption based on the following three cases
presentation in all tables consistent). We targeted the same of input signal activity: 1) the highest activity; 2) 1/3× of
timing behavior in datapaths for the same external input the highest activity; and 3) 1/5× of the highest activity.
signal transitions. It is shown that the area of the pipelined For guaranteeing the correct operations for all the possible
designs produced by our DPS YN is smaller than that of input signal behaviors, we also assumed the shortest clock
the synchronous counterparts. It means that the area of the period for the synchronous pipelined circuits. For case 1,
controller template used by DPS YN is smaller than the clock i.e., when the inputs are very active, it diminishes the benefit
network area used by the synchronous circuit implementations. of the event-driven execution on the asynchronous pipeline

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1450 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021

circuits. As a result, DPS YN-based implementations consume [11] Z. Xia, S. Ishihara, M. Hariyama, and M. Kameyama, “Dual-rail/single-
more power than the synchronous circuit implementations. For rail hybrid logic design for high-performance asynchronous circuit,” in
Proc. IEEE Int. Symp. Circuits Syst., May 2012, pp. 3017–3020.
cases 2 and 3, i.e., the low activity of input signals, it is [12] I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6,
shown that the power consumption by DPS YN significantly pp. 720–738, Jun. 1989.
decreases due to the prevention of unnecessary switching [13] M. Singh and S. M. Nowick, “MOUSETRAP: high-speed transition-
signaling asynchronous pipelines,” IEEE Trans. Very Large Scale Integr.
in asynchronous controllers. In other words, our DPS YN is (VLSI) Syst., vol. 15, no. 6, pp. 684–698, Jun. 2007.
more energy-efficient as the input signal activity is sparse and [14] K.-H. Ho and Y.-W. Chang, “A new asynchronous pipeline template
irregular. for power and performance optimization,” in Proc. 51st Annu. Design
Autom. Conf. Design Autom. Conf. (DAC), 2014, pp. 1–6.
[15] N. Van Toan, D. M. Tung, and J.-G. Lee, “Energy-efficient and high per-
VI. C ONCLUSION formance 2-phase asynchronous micropipelines,” in Proc. IEEE 60th Int.
Midwest Symp. Circuits Syst. (MWSCAS), Aug. 2017, pp. 1188–1191.
This work addressed the synthesis problem of two-phase [16] J. Myers et al., “A 12.4pJ/cycle sub-threshold, 16pJ/cycle near-threshold
bundled-data asynchronous pipeline controllers. To lighten ARM Cortex-M0+ MCU with autonomous SRPG/DVFS and tem-
perature tracking clocks,” in Proc. Symp. VLSI Circuits, Jun. 2017,
the pipeline controllers, we developed a new logic synthesis pp. C332–C333.
concept called delay path sharing and reusing, by which we [17] A. Kapoor, N. Jayakumar, and S. P. Khatri, “A novel clock distribution
could significantly reduce the amount of the costly delay and dynamic de-skewing methodology,” in Proc. IEEE/ACM Int. Conf.
Comput. Aided Design (ICCAD), Nov. 2004, pp. 626–631.
buffers. Precisely, the following conditions hold: 1) we pro- [18] J. Heo and T. Kim, “Lightening asynchronous pipeline controller through
posed a technique of synthesizing a pipeline controller in a resynthesis and optimization,” in Proc. 25th Asia South Pacific Design
way to share delay buffers among the setup timing paths for Autom. Conf. (ASP-DAC), Jan. 2020, pp. 587–592.
[19] G. Gimenez, A. Cherkaoui, G. Cogniard, and L. Fesquet, “Static timing
minimally allocating them and 2) we devised an area-efficient analysis of asynchronous bundled-data circuits,” in Proc. 24th IEEE Int.
delay circuit structure called DPU by extending the delay Symp. Asynchronous Circuits Syst. (ASYNC), May 2018, pp. 110–118.
path sharing and proposed an in-depth synthesis flow of an [20] (2011). Silvaco 45 nm Open Cell Library. [Online]. Available:
https://silvaco.co.kr/products/nangate/FreePDK45_Open_Cell_Library/
asynchronous pipeline controller using DPUs. [21] (2017). IBM Ilog Cplex Optimizer 12.8.0. [Online]. Available:
https://www.ibm.com/analytics/cplex-optimizer
R EFERENCES [22] G. Sagnol, “Picos, a python interface to conic optimization solvers,”
Zuse Inst. Berlin, Berlin, Germany, Tech. Rep. 12-48, 2012. [Online].
[1] P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer’s Guide to Available: http://picos.zib.de
Asynchronous VLSI. Cambridge, U.K.: Cambridge Univ. Press, 2010.
[2] N. C. Paver, “The design and implementation of an asynchronous micro-
processor,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Manchester, Jeongwoo Heo received the B.S. and Ph.D. degrees
Manchester, U.K., 1994. in electrical and computer engineering from Seoul
National University, Seoul, South Korea, in 2014 and
[3] M. Ferretti and P. A. Beerel, “High performance asynchronous design
using single-track full-buffer standard cells,” IEEE J. Solid-State Cir- 2020, respectively.
cuits, vol. 41, no. 6, pp. 1444–1454, Jun. 2006. He is currently with Memory Business, Samsung
Electronics Company Ltd., Gyeongi-do, Korea. His
[4] D. Hand et al., “Blade—A timing violation resilient asynchronous
template,” in Proc. 21st IEEE Int. Symp. Asynchronous Circuits Syst., current research interests include timing analysis
May 2015, pp. 21–28. of asynchronous circuits and hardware performance
[5] J. Simatic, A. Cherkaoui, F. Bertrand, R. P. Bastos, and L. Fesquet, “A monitoring methodology.
practical framework for specification, verification, and design of self-
timed pipelines,” in Proc. 23rd IEEE Int. Symp. Asynchronous Circuits
Syst. (ASYNC), May 2017, pp. 65–72. Taewhan Kim (Senior Member, IEEE) received the
[6] S. M. Nowick and M. Singh, “High-performance asynchronous B.S. degree in computer science and statistics and
pipelines: An overview,” IEEE Des. Test. Comput., vol. 28, no. 5, the M.S. degree in computer science from Seoul
pp. 8–22, Sep. 2011. National University, Seoul, South Korea, in 1985 and
[7] I. Sutherland and S. Fairbanks, “GasP: A minimal FIFO control,” 1987, respectively, and the Ph.D. degree in computer
in Proc. 7th Int. Symp. Asynchronous Circuits Syst. (ASYNC), 2001, science from the University of Illinois at Urbana–
pp. 46–53. Champaign, Urbana, IL, USA, in 1993.
[8] M. Singh and S. M. Nowick, “The design of high-performance dynamic He is currently a Professor with the School of
asynchronous pipelines: High-capacity style,” IEEE Trans. Very Large Electrical Engineering and Computer Science, Seoul
Scale Integr. (VLSI) Syst., vol. 15, no. 11, pp. 1270–1283, Nov. 2007. National University. He has authored or coauthored
[9] K. M. Fant and S. A. Brandt, “NULL convention logic: A complete and over 300 technical articles in international journals
consistent logic for asynchronous digital circuit synthesis,” in Proc. Int. and conferences. His current research interests include computer-aided design
Conf. Appl. Specific Syst., Archit. Processors, Aug. 1996, pp. 261–273. of integrated circuits ranging from architectural synthesis through physical
[10] A. J. Martin and M. Nystrom, “Asynchronous techniques for designs, specifically focusing on logic and physical synthesis and automatic
system-on-chip design,” Proc. IEEE, vol. 94, no. 6, pp. 1089–1120, cell layout generation.
Jun. 2006. Dr. Kim is serving as an Associate Editor for Integration-VLSI Journal.

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.

You might also like