Professional Documents
Culture Documents
Abstract— In this work, we address the synthesis problem of a handshaking controller. One notable method to lower the
of two-phase bundled-data asynchronous pipeline controllers, barrier is employing handshaking controller templates [3]–[5],
in which the insertion of buffers is essential for guaranteeing and those adapted to pipeline structure are widely used for
the correct handshaking operation on every pipeline stage at the
expense of considerable area increase. To lighten the pipeline high-performance applications [6] in particular.
controllers, we introduce a new logic synthesis concept called For example, Sutherland and Fairbanks [7] customized their
delay path sharing and reusing, by which we can significantly dynamic logic and inserted it for fast data transmission,
reduce the amount of the costly delay buffers. Precisely, first, and Singh and Nowick [8] proposed the pipeline controller
we propose a technique of synthesizing an asynchronous pipeline template with full capacity storage. Later, Fant and Brandt [9],
controller in a way to share delay buffers among setup timing
paths on pipeline stages for minimally allocating total delay Martin and Nyström [10], and Xia et al. [11] suggested the
buffers. In addition, we devise an area-efficient delay circuit methods of employing dynamic logic with dual outputs. How-
structure called delay path unit (DPU) by extending the proposed ever, their solutions require substantial care or experience to fit
delay path sharing concept and propose an in-depth synthesis into industrial design flows. On the other hand, Sutherland [12]
flow of an asynchronous pipeline controller using DPUs. Through employed his custom latch (i.e., capture-pass) and a C-element
experiments with benchmark circuits using a 45-nm cell library,
it is shown that our techniques of synthesizing asynchronous in his pipeline controller template, and Singh and Nowick [13]
pipeline controllers are able to reduce the controller area by up devised MOUSETRAP, a high-performance pipeline controller
to 46.3%–59.4% and the leakage power by up to 33.0%–49.0% template using a normally transparent latch. Recently, on top
on average while retaining the same level of performance. of [13], Ho and Chang [14] achieved a significant reduc-
Index Terms— Asynchronous circuit, delay path, logic tion in dynamic power consumption in datapath by blocking
synthesis, pipeline controller, timing. glitches in data transmission using a normally opaque latch.
Toan et al. [15] slightly ameliorated its energy efficiency and
performance using a C-element, but the underlying operation
I. I NTRODUCTION is identical to that in [14].
Meanwhile, delay variation among manufactured chips has
A N ASYNCHRONOUS circuit is one of the attractive
alternatives to the synchronous circuit design style. Con-
trary to the synchronization mechanism used in a global clock
grown a lot over the past years due to operation at low Vdd and
enlargement of process variations. Fig. 1(a) shows the drastic
network in synchronous circuits, asynchronous circuits exploit increase of delay variation caused by Vdd scaling for 28-nm
handshaking protocol for the communication between circuit process technology, and the curves in Fig. 1(b) indicate that
components, by which they consume less dynamic power and the maximum operating frequency can be increased by more
operate at a higher frequency than their synchronous counter- than 10× when the temperature changes from 0 ◦ C to 75 ◦ C
parts in general [1], [2]. One of the barriers to the adoption for the subthreshold voltage (sub-Vth ) regime [16]. Various
of an asynchronous handshaking mechanism is a large area kinds of post-silicon tuning techniques have been proposed
and used to contract this wide performance gap through adjust-
Manuscript received November 10, 2020; revised February 18, 2021 and ing delay tunable components on individual chips based on
March 20, 2021; accepted April 8, 2021. Date of publication June 2, 2021; their physical properties. One representative method is clock
date of current version June 29, 2021. This work was supported in part by
Samsung Electronics Company, Ltd. under Projects IO201216-08205-01 and skew tuning, which resolves timing violations by inserting
FOUNDARY-202010DD003F, in part by the National Research Foundation delay circuits such as adjustable delay buffers (ADBs) to
of Korea (NRF) Grant funded by the Korea Government (MEST) under control local clock skews. One of the examples is a capacitor
Grant 2020R1A4A4079177 and Grant 2021R1A2C2008864, in part by the
Institute of Information & communications Technology Planning & Evaluation bank-based ADB implementation [17], and ARM recently
(IITP) grant funded by Korea government (MSIT) under Grant 2021-0- used a long delay chain of ring oscillators called tunable
00754, Software Systems for AI Semiconductor Design), and in part by the delay stages (TDSs) for tracking temperature variation in the
BK21 Four Program of the Education and Research Program for Future ICT
Pioneers, Seoul National University in 2021. The EDA tool was supported by system [16].
the IC Design Education. (Corresponding author: Taewhan Kim.) For a bundled-data protocol-based asynchronous circuit,
Jeongwoo Heo is with Memory Business, Samsung Electronics Company, a receiver should not accept a request signal before arriving
Ltd., Gyeonggi-do 18448, South Korea (e-mail: jw20.heo@samsung.com).
Taewhan Kim is with the Department of Electrical and Computer Engi- data signals from its sender. To ensure this, most of the
neering, Seoul National University, Seoul 08826, South Korea (e-mail: previous studies focused on the optimization and logical
tkim@snucad.snu.ac.kr). operation of pipeline controller templates with the assumption
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TVLSI.2021.3073383. of using these post-silicon tunable delay circuits [4], [14].
Digital Object Identifier 10.1109/TVLSI.2021.3073383 As a result, they required a large number of delay buffers
1063-8210 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1438 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1439
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1440 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
and blue dashed boxes reveals that our setup timing path is
working correctly for sending 1-state and 0-state of request
signals, respectively.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1441
Fig. 8. Logic behavior of our controller template in Fig. 6. The parts in red color indicate the change of logic values.
TABLE I
D EFINITION OF N OTATIONS U SED IN T IMING C ONSTRAINTS
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1442 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
passes through the acknowledgment signal line, the local pipeline stages unnecessarily longer and finally deteriorate the
enable buffers, and the datapath in the current pipeline stage, overall circuit performance. However, with the consideration
and terminates at the data pin to the pipeline registers on the of the last two inequalities, an LP optimizer is able to select
i
next pipeline stage at last. On the other hand, the capture path dNSDC for the delay buffer distribution in such a situation,
is quite short, which starts from the same POD, goes through thereby avoiding the unnecessary degradation of the circuit
the local enable buffers, and terminates at the enable pin to the performance.
pipeline registers on the next pipeline stage. Hence, we can
reformulate the hold timing constraint in (3b) as IV. I N -D EPTH P IPELINE C ONTROLLER T EMPLATE
S YNTHESIS W ITH D ELAY PATH R EUSING
delay en(i+1) → q (i+1)
f → ack(i+1) → wxorli
→ eni A. Synthesizing Delay Path Units
+D iL + TCi + D iP ≥ D (i+1)
L + Th(i+1) . (5) In Section III, we converted the asynchronous pipeline
controller template in [14] to ours with sharable delay paths for
D. Minimally Allocating Delay Buffers minimizing the total number of delay buffers. In the same way,
for reusing the delay buffers, we can apply this conversion
Unlike the conventional asynchronous pipeline controller
technique to any delay buffer chains in SDCs and NSDCs in
template, ours flexibly distribute delay buffers to multiple
the controller with slight modification recursively as far as it
locations of every setup timing path while sharing some delay
saves the area. We call those delay circuits produced by the
buffers among the setup timing paths. Consequently, it is
applications of the technique to delay buffer chains DPUs.
possible to minimize the total number of delay buffers in an
Fig. 10 shows three DPU types called DPU-1, DPU-2, and
asynchronous pipeline controller while satisfying the template-
DPU-3, which are the delay circuit structures produced by
level and the protocol-level timing constraints reformulated
applying the conversion technique once, twice, and three times
in Section III-C. Let S={0, 1, 2, . . .} denote the set of all
recursively on delay buffer chains, respectively. For example,
pipeline stage indexes. If we implement all delay circuits in
we can obtain DPU-2 by inserting DPU-1 instead of the delay
an asynchronous pipeline controller using delay buffer chains,
buffer chain in between NORDPU 1 and XOR DPU
1-u (labeled as
the propagation delay of the delay circuit will be linearly
layer-1 delay buffers) in DPU-1 in Fig. 10. Similarly, it is
proportional to the number of delay buffers in it. As a result,
possible to make DPU-3 by replacing the delay buffer chain
we can formulate the equivalent problem of minimizing the
located in between the uppermost XOR and NOR (labeled as
total number of delay buffers into a LP as
layer-2 delay buffers) in DPU-2 with DPU-1. In this way,
min i
dSDC + dNSDC
i we can synthesize any arbitrary DPUs. Note that we use
i∈S DPU-0 to denote a simple delay buffer chain.
s.t. Eqs. (1), (2), (4), (5), ∀i ∈ S The red path in Fig. 10 indicates the input signal flow
in DPU-1, which passes the layer-1 delay buffers twice.
delay eni → qn if → wxorui
→ eni→(i+1)
(i+1)
Likewise, for DPU-2 and DPU-3, the timing paths of input
→ qri → reqi → wxoru signal propagation pass through the top-layer delay buffers
→ en(i+1)→(i+2) → en(i+1) up to 22 times and 23 times, respectively. By generalizing
≤ Delayi + , ∀i ∈ S this, we can state that the timing path of the input signal
propagation goes through the delay buffers in DPU-k up to 2k
delay eni → qn if → wxoru
i
→ eni→(i+1)
times. Thus, in comparison with a simple delay buffer chain,
(i+1)
→ qri → reqi → wxoru the number of delay buffers required in the corresponding
→ en(i+1)→(i+2) → en(i+1) DPU-k is theoretically reducible up to 1/2k of that of the
→ q (i+1)
f → ack(i+1) simple delay buffer chain.
→ wxorl
i
→ eni
B. Validating Logical Correctness of Delay Path Units
≤ Cyclei + , ∀i ∈ S. (6)
Since the logical operation on each layer inside of DPU-k
Note that is for avoiding the infeasibility of the formula- is the same as that of DPU-1 containing one layer, we will
tion caused by the discrepancy in gate delay models. It is an discuss the logical correctness by using the circuit of DPU-1
empirically defined timing margin and is usually a small value in Fig. 10. DPU-1 has two latches (LATDPU 1- f and LAT1-r ,
DPU
in comparison to Delayi and Cyclei . The first four ensure the marked as yellow rectangles): one placed next to the input
satisfaction of the template-level and the protocol-level timing signal signalin and the other next to the layer-0 delay buffers.
constraints, and the last two inequalities prevent the perfor- Like the operation of our controller template explained in
mance degradation caused by the inappropriate distribution of Section III, signalin has to go through the layer-1 delay
DPU
delay buffers. For example, when one additional delay buffer buffers first and sets enDPU
1- f to 1 to pass LAT1- f . Meanwhile,
DPU DPU
is required to meet the setup timing constraint on pipeline LAT1-r blocks the signal q1- f from reaching to the layer-
(i+1)
i
stage i , we can distribute it to either dSDC i
, dNSDC , or dSDC . 0 delay buffers until the latch enable signal enDPU 1-r coming
If all the timing constraints of pipeline stages i − 1 and i + 1 from XORDPU 1-u through the layer-1 delay buffers becomes on.
have already been satisfied, the allocation of the delay buffer Note that LATDPU1- f is closed by resetting en DPU
1- f to 0 through
i (i+1)
to dSDC or dSDC will make the cycle time or delay of those XORDPU
1-l , regardless of the number of layer-1 delay buffers. If
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1443
≤ (l−1) .
w1-xorl
DPU
changed to 0 before the delayed w1-xoru
DPU
(i.e., enDPU
1-r ) Fig. 13 describes this constraint by waveforms and signal
DPU
is back to 1, en1- f would become on once again, and as a transition relation. In Fig. 13(a), the transition of signalin
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1444 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
Fig. 12. Logic behavior in DPU-1. The parts in red color indicate the change of logic values.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1445
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1446 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
Fig. 16. Changes of the minimum implementation area for DPU allocation
for NSDC and SDC corresponding to C 6288 by varying the target delay. The
blue dots and red lines represent the modeling samples and the piecewise
linear prediction, respectively. (a) DPU area for replacing NSDC ( C 6288).
(b) DPU area for replacing SDC ( C 6288).
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1447
TABLE III
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS P RODUCED IN [13], [14], [15], DPS YN , AND
DPS YN+ IN T ERMS OF THE N UMBER OF A DDITIONAL L OGIC G ATES AND T OTAL C ONTROLLER A REA
deep layers (e.g., DPU-1, DPU-2, and DPU-3) that occupy C. Comparison of Power, Performance, and Area
less area. Meanwhile, even though the same type of DPU was
used, the DPU area increases sharply for some target delays. We notate DPS YN to refer to our controller synthesis
Fig. 17 shows the implementations for target delays of 5.38, technique in Section III that uses DPU-0. We also notate
5.41, and 5.44 ns in Fig. 16(a), and as can be seen from the DPS YN+ to refer to our synthesis technique in Section IV
figure, we could substitute the less number of delay buffers that uses DPU-1, DPU-2, DPU-3, as well as DPU-0.
in deeper layers for many delay buffers in the rest. However, Area: The column starts with “Circuit” in Table III is the
as target delay exceeds some limit [e.g., 5.41 ns in Fig. 16(a)], name of pipeline circuits, and the number in each parenthesis
delay buffer insertion in deeper layers violates Constraint 3, represents the number of pipeline stages of the randomly
which causes not to maintain the increased rate of the DPU generated one. Note that the numbers in parentheses in
area as before. the second big column show the number of additional XORs
The discontinuity shown in Fig. 16 is mainly due to the and NORs for DPS YN+. The ratios of the total area of
two factors: one is the modeling margin dmargin in (8) for controllers produced in [13] (MOUSETRAP), [15] (PP-1 and
guaranteeing marginal performance degradation to the imple- PP-2), DPS YN, and DPS YN+ to those in [14] are shown in
mentation in [14], and the other is the gap between rising the last big column, from which DPS YN and DPS YN + use
and falling delays of one delay buffer. In our experiment, 29.7%–46.3% and 49.7%–59.4% smaller controller on average
we assumed the same criterion for both rising and falling for the pipeline circuits in G ROUP -1, G ROUP -2, and G ROUP -4
of the signal propagation and tightly set dmargin for min- over that in [14]. We observed that the area reduction effect
imal performance penalty compared to the implementation of our DPS YN and DPS YN + are almost similar for the
in [14]. In addition, we observed that the difference between comparison with the other implementations as well. On the
rising and falling delays of delay circuit was accumulated other hand, since ours include one additional latch than those
as the size of a delay circuit grows. Finally, after some of [14], the total controller area generated by DPS YN (and
target delay, the gap exceeded dmargin ; consequently, our ILP DPS YN +) is larger by 37.9% on average for simple FIFO
solver cannot find a feasible solution for (8) that satisfies designs in G ROUP -3.
our strict constraints, which makes discontinuous transitions Performance: Cycle times and delays of pipeline circuits
in Figs. 16(a) and (b). in G ROUP -1, G ROUP -2, and G ROUP -4 slightly increase over
Table II summarizes the runtime taken to prepare the data those of [14] using DPS YN and DPS YN +. The changes are
samples used in piecewise linear modeling for each combina- due to ensuring pessimism considering the gate delay models,
tional circuit in experiments. In the modeling, we used DPU-0 but we can control it through in (6) and (9) and dmargin in (8).
(i.e., a simple buffer chain), DPU-1, DPU-2, and DPU-3 in With setting of = 10 ps and dmargin = 30 ps, the cycle time
Fig. 10 and set the step size of a target delay to 0.01 ns. The and delay of each pipeline stage increase only by 1.0%–1.2%
preparation of all data samples was completed within two and and 1.1%–1.3% for DPS YN and by 1.1%–1.5% and
a half minutes for every test case. 1.2%–1.7% for DPS YN + on average for the pipeline circuits.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1448 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
TABLE IV
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS FOR FIFO B UFFER D ESIGNS P RODUCED IN [13]–[15],
DPS YN , AND DPS YN+ IN T ERMS OF C YCLE T IME AND D ELAY OF P IPELINE C IRCUITS
TABLE V
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS P RODUCED IN [13], [14], [15], DPS YN , AND
DPS YN+ IN T ERMS OF L EAKAGE AND DYNAMIC P OWER C ONSUMPTIONS . N OTE T HAT A LL THE D ATA A RE
M EASURED W ITH THE N OMINAL V OLTAGE (1.2 V) O PERATION
On the other hand, for the simple FIFO designs in G ROUP -3,
the cycle times and delays of DPS YN (and DPS YN+) are
longer than those of [14] by 47.0% and 91.2% on average,
respectively, as shown in Table IV, since timing paths in
DPS YN (and DPS YN+) pass one additional latch. Therefore,
from the observations, we can conclude that our proposed
technique might not be one of the best solutions for the
pipeline circuits having very short combinational delays such
as FIFO buffers. For those designs, some specialized asynchro-
nous pipeline controllers, such as MOUSETRAP [13], will be
better solutions since the glitches in their datapaths are not Fig. 18. Changes in dynamic power consumption of the controller as the
that significant. cost of extra latch per delay buffer increases. DPS YN uses a much lower cost
than DPS YN+.
Power: From Table V, we can observe that DPS YN and
DPS YN+ save the leakage power by 33.0%–45.6% and
43.3%–49.0% for the pipeline circuits in G ROUP -1, G ROUP -2, 10.4%–50.2% from the insertion of multiple new latches to
and G ROUP -4, which is mainly caused by the reduced number DPUs for the same pipeline circuits. Fig. 18 clearly shows
of delay buffers. On the other hand, since the dynamic power that the dynamic power consumption increases as the latch
depends on the number of internal and external transitions cost invested per delay buffer increases. Meanwhile, for the
regardless of the total cell count, the power consumption FIFO designs in G ROUP -3, the leakage power consumption of
remains almost the same for DPS YN. The slight increase the controllers produced by DPS YN (and DPS YN+) increases
is from one new latch inserted into each pipeline stage. by 39.4% on average due to one additional latch per pipeline
However, for DPS YN+, the dynamic power increases by stage.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
HEO AND KIM: REUSABLE DELAY PATH SYNTHESIS 1449
TABLE VI
C OMPARISON OF A SYNCHRONOUS P IPELINE C ONTROLLER I MPLEMENTATIONS P RODUCED IN [14] AND DPS YN IN T ERMS OF L EAKAGE AND
DYNAMIC P OWER C ONSUMPTION FOR THE T HREE S UPPLY V OLTAGE R EGIMES . A LL THE VALUES A RE N ORMALIZED TO THE
DYNAMIC P OWER C ONSUMPTION OF [14] IN THE S AME S UPPLY V OLTAGE R EGIME
Table VI shows the comparison between [14] and DPS YN TABLE VII
in superthreshold voltage (super-Vth ), near-threshold voltage C OMPARISON OF O UR DPS YN W ITH S YNCHRONOUS I MPLEMENTATIONS
(near-Vth ), and subthreshold voltage (sub-Vth ) regimes, in
terms of leakage and dynamic power consumption. Since
dynamic power consumption is the dominant factor of total
energy efficiency over leakage power in the super-Vth regime,
the benefit of leakage power reduction from our proposed
technique has a limited impact. On the other side, as the
supply voltage decreases, the portion of leakage power among
overall power consumption grows significantly. For example,
we can see that the amount of dynamic power consumption
in [14] is about 12.2×–14.7× larger than that of leakage power
consumption for the pipeline circuits in G ROUP -1, G ROUP -2,
and G ROUP -4. However, this ratio drops to 6.0×–9.5× and
1.9×–2.7× in near-Vth and sub-Vth regimes, respectively. As a
result, for those pipeline designs, power consumption saving
of DPS YN over [14] rises sharply from 0.5% to 5.2% in super-
Vth regime to 7.0%–14.8% and 16.8%–24.7% in near-Vth and
sub-Vth regimes, respectively.
Since the reported data of the synchronous ones in Table VII
does not include the clock generator circuits such as PLL, the
D. Comparison With Synchronous Style area saving will be more.
Table VII compares the area and power consumption of The power consumption of asynchronous designs depends
the synchronous style and our DPS YN-based implementations on the activity of external input signals. Hence, we measured
(we normalized the area and power numbers to make the the power consumption based on the following three cases
presentation in all tables consistent). We targeted the same of input signal activity: 1) the highest activity; 2) 1/3× of
timing behavior in datapaths for the same external input the highest activity; and 3) 1/5× of the highest activity.
signal transitions. It is shown that the area of the pipelined For guaranteeing the correct operations for all the possible
designs produced by our DPS YN is smaller than that of input signal behaviors, we also assumed the shortest clock
the synchronous counterparts. It means that the area of the period for the synchronous pipelined circuits. For case 1,
controller template used by DPS YN is smaller than the clock i.e., when the inputs are very active, it diminishes the benefit
network area used by the synchronous circuit implementations. of the event-driven execution on the asynchronous pipeline
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.
1450 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 29, NO. 7, JULY 2021
circuits. As a result, DPS YN-based implementations consume [11] Z. Xia, S. Ishihara, M. Hariyama, and M. Kameyama, “Dual-rail/single-
more power than the synchronous circuit implementations. For rail hybrid logic design for high-performance asynchronous circuit,” in
Proc. IEEE Int. Symp. Circuits Syst., May 2012, pp. 3017–3020.
cases 2 and 3, i.e., the low activity of input signals, it is [12] I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6,
shown that the power consumption by DPS YN significantly pp. 720–738, Jun. 1989.
decreases due to the prevention of unnecessary switching [13] M. Singh and S. M. Nowick, “MOUSETRAP: high-speed transition-
signaling asynchronous pipelines,” IEEE Trans. Very Large Scale Integr.
in asynchronous controllers. In other words, our DPS YN is (VLSI) Syst., vol. 15, no. 6, pp. 684–698, Jun. 2007.
more energy-efficient as the input signal activity is sparse and [14] K.-H. Ho and Y.-W. Chang, “A new asynchronous pipeline template
irregular. for power and performance optimization,” in Proc. 51st Annu. Design
Autom. Conf. Design Autom. Conf. (DAC), 2014, pp. 1–6.
[15] N. Van Toan, D. M. Tung, and J.-G. Lee, “Energy-efficient and high per-
VI. C ONCLUSION formance 2-phase asynchronous micropipelines,” in Proc. IEEE 60th Int.
Midwest Symp. Circuits Syst. (MWSCAS), Aug. 2017, pp. 1188–1191.
This work addressed the synthesis problem of two-phase [16] J. Myers et al., “A 12.4pJ/cycle sub-threshold, 16pJ/cycle near-threshold
bundled-data asynchronous pipeline controllers. To lighten ARM Cortex-M0+ MCU with autonomous SRPG/DVFS and tem-
perature tracking clocks,” in Proc. Symp. VLSI Circuits, Jun. 2017,
the pipeline controllers, we developed a new logic synthesis pp. C332–C333.
concept called delay path sharing and reusing, by which we [17] A. Kapoor, N. Jayakumar, and S. P. Khatri, “A novel clock distribution
could significantly reduce the amount of the costly delay and dynamic de-skewing methodology,” in Proc. IEEE/ACM Int. Conf.
Comput. Aided Design (ICCAD), Nov. 2004, pp. 626–631.
buffers. Precisely, the following conditions hold: 1) we pro- [18] J. Heo and T. Kim, “Lightening asynchronous pipeline controller through
posed a technique of synthesizing a pipeline controller in a resynthesis and optimization,” in Proc. 25th Asia South Pacific Design
way to share delay buffers among the setup timing paths for Autom. Conf. (ASP-DAC), Jan. 2020, pp. 587–592.
[19] G. Gimenez, A. Cherkaoui, G. Cogniard, and L. Fesquet, “Static timing
minimally allocating them and 2) we devised an area-efficient analysis of asynchronous bundled-data circuits,” in Proc. 24th IEEE Int.
delay circuit structure called DPU by extending the delay Symp. Asynchronous Circuits Syst. (ASYNC), May 2018, pp. 110–118.
path sharing and proposed an in-depth synthesis flow of an [20] (2011). Silvaco 45 nm Open Cell Library. [Online]. Available:
https://silvaco.co.kr/products/nangate/FreePDK45_Open_Cell_Library/
asynchronous pipeline controller using DPUs. [21] (2017). IBM Ilog Cplex Optimizer 12.8.0. [Online]. Available:
https://www.ibm.com/analytics/cplex-optimizer
R EFERENCES [22] G. Sagnol, “Picos, a python interface to conic optimization solvers,”
Zuse Inst. Berlin, Berlin, Germany, Tech. Rep. 12-48, 2012. [Online].
[1] P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer’s Guide to Available: http://picos.zib.de
Asynchronous VLSI. Cambridge, U.K.: Cambridge Univ. Press, 2010.
[2] N. C. Paver, “The design and implementation of an asynchronous micro-
processor,” Ph.D. dissertation, Dept. Comput. Sci., Univ. Manchester, Jeongwoo Heo received the B.S. and Ph.D. degrees
Manchester, U.K., 1994. in electrical and computer engineering from Seoul
National University, Seoul, South Korea, in 2014 and
[3] M. Ferretti and P. A. Beerel, “High performance asynchronous design
using single-track full-buffer standard cells,” IEEE J. Solid-State Cir- 2020, respectively.
cuits, vol. 41, no. 6, pp. 1444–1454, Jun. 2006. He is currently with Memory Business, Samsung
Electronics Company Ltd., Gyeongi-do, Korea. His
[4] D. Hand et al., “Blade—A timing violation resilient asynchronous
template,” in Proc. 21st IEEE Int. Symp. Asynchronous Circuits Syst., current research interests include timing analysis
May 2015, pp. 21–28. of asynchronous circuits and hardware performance
[5] J. Simatic, A. Cherkaoui, F. Bertrand, R. P. Bastos, and L. Fesquet, “A monitoring methodology.
practical framework for specification, verification, and design of self-
timed pipelines,” in Proc. 23rd IEEE Int. Symp. Asynchronous Circuits
Syst. (ASYNC), May 2017, pp. 65–72. Taewhan Kim (Senior Member, IEEE) received the
[6] S. M. Nowick and M. Singh, “High-performance asynchronous B.S. degree in computer science and statistics and
pipelines: An overview,” IEEE Des. Test. Comput., vol. 28, no. 5, the M.S. degree in computer science from Seoul
pp. 8–22, Sep. 2011. National University, Seoul, South Korea, in 1985 and
[7] I. Sutherland and S. Fairbanks, “GasP: A minimal FIFO control,” 1987, respectively, and the Ph.D. degree in computer
in Proc. 7th Int. Symp. Asynchronous Circuits Syst. (ASYNC), 2001, science from the University of Illinois at Urbana–
pp. 46–53. Champaign, Urbana, IL, USA, in 1993.
[8] M. Singh and S. M. Nowick, “The design of high-performance dynamic He is currently a Professor with the School of
asynchronous pipelines: High-capacity style,” IEEE Trans. Very Large Electrical Engineering and Computer Science, Seoul
Scale Integr. (VLSI) Syst., vol. 15, no. 11, pp. 1270–1283, Nov. 2007. National University. He has authored or coauthored
[9] K. M. Fant and S. A. Brandt, “NULL convention logic: A complete and over 300 technical articles in international journals
consistent logic for asynchronous digital circuit synthesis,” in Proc. Int. and conferences. His current research interests include computer-aided design
Conf. Appl. Specific Syst., Archit. Processors, Aug. 1996, pp. 261–273. of integrated circuits ranging from architectural synthesis through physical
[10] A. J. Martin and M. Nystrom, “Asynchronous techniques for designs, specifically focusing on logic and physical synthesis and automatic
system-on-chip design,” Proc. IEEE, vol. 94, no. 6, pp. 1089–1120, cell layout generation.
Jun. 2006. Dr. Kim is serving as an Associate Editor for Integration-VLSI Journal.
Authorized licensed use limited to: University of Prince Edward Island. Downloaded on July 04,2021 at 06:50:18 UTC from IEEE Xplore. Restrictions apply.