You are on page 1of 6

IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 30, NO.

7, OCTOBER 2020 1302706

Design of an 8-bit Bit-Parallel RSFQ Microprocessor


Pei-Yao Qu , Guang-Ming Tang , Jia-Hong Yang , Xiao-Chun Ye , Dong-Rui Fan,
Zhi-Min Zhang, and Ning-Hui Sun

Abstract—With Moore’s law approaching its physical limits, So far, several RSFQ microprocessors have been developed.
low-temperature computing technology is ushering in unprece- An 8-bit bit-serial microprocessor FLUX has been designed and
dented development opportunities. Rapid single-flux-quantum fabricated [6], [7]. 8-bit bit-serial microprocessors CORE series
(RSFQ) circuit technology is currently the most mature supercon-
ducting integrated circuit technology. Based on the current fabrica- [8]–[11] have been designed, fabricated, and demonstrated. A
tion process, we propose an 8-bit bit-parallel RSFQ microprocessor. bit-serial asynchronous microprocessor (SCRAM2) [12] and a
The proposed microprocessor processes 8-bit data each clock cycle. bit-serial 8-bit microprocessor (CORE e4) [13] have been de-
Ten different instructions are executed. The microprocessor mainly signed. A prototype of a 32-bit superconducting microprocessor
consists of an on-chip instruction memory, two data registers, an based on 4-bit bit-slice architecture has been presented at ASC
instruction decoder, an 8-bit bit-parallel arithmetic logic unit, and
a program counter. The microprocessor contains 7702 JJs (based 2016 [14].
on the Open Dataset of CONNECT Cell Library for AIST ADP2) Besides, some RSFQ datapath circuits have been designed.
without considering splitters, Josephson transmission lines, and Arithmetic logic unit (ALU) is the most important component of
passive transmission lines. We perform a logic-level simulation of a microprocessor. An 8-bit parallel ALU was developed. It was
the proposed microprocessor. The simulation results show correct
demonstrated successfully at low frequency tests [15] and high
operation with a target frequency of 16.7 GHz.
frequency tests [16], respectively. An 8-bit parallel ALU [17]
Index Terms—Digital circuit, rapid single-flux-quantum with a target frequency of 30 GHz was developed. A serial ALU
(RSFQ), superconducting microprocessor. [18] with a frequency of 80 GHz was developed. A 4-bit bit-slice
ALU [19] for a 32-bit RSFQ microprocessor was demonstrated
successfully. A logic design of a 16-bit bit-slice ALU [20] for
I. INTRODUCTION
32-/64-bit RSFQ microprocessors was proposed.
NERGY-EFFICIENT computing technology is required
E by many large-scale computing systems. Superconducting
computing may be able to serve the needs of these systems
In this article, we propose an 8-bit bit-parallel RSFQ micro-
processor. The proposed microprocessor processes 8-bit data
in one clock cycle. The throughput is higher than bit-serial
significantly better than conventional technology [1]. Rapid processing and bit-slice processing in previous work. Ten in-
single-flux-quantum (RSFQ) circuit technology [2] is currently structions are executed, including addition (ADD), immediate
the most mature superconducting integrated circuit technology, addition (ADDI), input (IN), output (OUT), loading immediate
whose clock frequency can be up to tens of gigahertz and energy (LOADI), shift right logic (SRL), data movement (MOV), PC-
dissipation can be down to several milliwatt. Several RSFQ relative branch (JB), no operation (NOP), and stop operation
derivatives have been developed, such as ERSFQ [3], LR-RSFQ (HALT) instructions.
[4], and RQL [5], etc. We propose an 8-bit bit-parallel RSFQ The remainder of this article is organized as follows. Section II
microprocessor based on the Open Dataset of CONNECT Cell describes the microarchitectural and logic design of the 8-bit bit-
Library for AIST ADP2. parallel microprocessor. Section III shows the simulation results.
Section IV evaluates the results of our design. Finally, Section V
Manuscript received December 29, 2019; revised May 26, 2020 and August concludes this article.
13, 2020; accepted August 15, 2020. Date of publication August 18, 2020; date
of current version August 28, 2020. This work was supported in part by the
Innovation Program of Institute of Computing Technology, Chinese Academy II. 8-bit BIT-PARALLEL RSFQ MICROPROCESSOR
of Sciences under Grant 20186120, in part by the Strategic Priority Research
Program of Chinese Academy of Sciences under Grant XDA18000000, and The proposed microprocessor is based on Harvard architec-
in part by the National Natural Science Foundation of China under Grant
61732018. This article was recommended by Associate Editor N. Yoshikawa. ture. It includes an on-chip instruction memory (IM) and two
(Corresponding author: Guang-Ming Tang.) data registers. The IM and the data registers are addressable by
Pei-Yao Qu and Dong-Rui Fan are with the State Key Laboratory of Computer 4-bit address and 1-bit address, respectively. It can be imple-
Architecture, Institute of Computing Technology, Chinese Academy of Sci-
ences, Beijing 100190, China, and also with the University of Chinese Academy mented using present-day fabrication processes. The proposed
of Sciences, Beijing 100049, China. microprocessor can execute seven different types of instructions.
Guang-Ming Tang, Jia-Hong Yang, Xiao-Chun Ye, Zhi-Min Zhang, and Ning-
Hui Sun are with the State Key Laboratory of Computer Architecture, Institute of
Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
(e-mail: tangguangming@ict.ac.cn). A. Instruction Set
Color versions of one or more of the figures in this article are available online
at https://ieeexplore.ieee.org. Table I lists the instruction set of the proposed microprocessor.
Digital Object Identifier 10.1109/TASC.2020.3017527 “D” and “S” represent the destination register and source register

1051-8223 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Universidad de Ingeniería y Tecnología (UTEC). Downloaded on October 05,2020 at 19:40:27 UTC from IEEE Xplore. Restrictions apply.
1302706 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 30, NO. 7, OCTOBER 2020

TABLE I
INSTRUCTION SET

Fig. 1. Microarchitecture of the proposed microprocessor (the red lines show the datapath of IN and OUT instructions).

number, respectively. “I” represents the immediate data. “A” red lines in Fig. 1 show the datapath for executing IN and OUT
represents the immediate address. instructions. At the first step, PC sends a 4-bit address to IRs. At
The instruction set includes seven different types of instruc- the second step, according to the address, IN instruction is read
tions, including ALU instructions (ADD, MOV), immediate out to ID, meanwhile, PC is generating the next address to fetch
instructions (ADDI, LOADI), conditional branch (JB), shifter OUT instruction. At the third step, ID sends the control signals
instruction (SRL), data I/O instructions (IN, OUT), no operation M1 and STORE_R1 to the buffers, meanwhile, OUT instruction
instruction (NOP), and halt instruction (HALT). Among them, is read out to ID. At the fourth step, MUX1 sends the input data to
immediate instructions set four least significant bits of instruc- R1 according to M1, meanwhile, ID sends the control signals M2
tion code with sign extension as immediate data. JB instruction and OUT_OP to the buffers. At the fifth step, R1 stores the input
sets four least significant bits of instruction code with zero data according to STORE_R1 and send it to MUX2. At the sixth
extension as immediate address. In addition, NOP instruction is step, MUX2 sends the data to output register according to M2.
implemented by moving the data from R0 to R0, which is used Finally, according to OUT_OP, the data output to external. The
to generate a bubble between two dependent instructions. HALT other instructions can be executed in the same way according to
instruction is implemented by looping the operation moving the the corresponding control signals.
data from R1 to R1, which is used to halt the microprocessor
until an external signal start it again.
C. Logic Design
We have designed the logic circuits of all the components
B. Microarchitecture
in the proposed microprocessor. XOR gate, AND gate, NOT
Fig. 1 shows the microarchitecture of the proposed micropro- gate, D flip-flop (DFF), dual-port D flip-flop (D2FF), resettable
cessor. We only consider using on-chip memory and registers to D flip-flop (RDFF), confluence buffer (CB), and nondestructive
store instructions and data. The instructions and data are loaded readout (NDRO) based on the Open Dataset of CONNECT Cell
by hand. The main circuit components are six multiplexers Library for AIST ADP2 are used. The data of delays, timing
(MUXs), a 4-bit program counter (PC), an instruction decoder constraints, number of JJs, area units, and power consumption
(ID), an on-chip 8-bit IM, two 8-bit data registers (DRs), a are included in the dataset. Fully synchronous timing is adopted
control signal buffer, an 8-bit bit-parallel ALU, and I/O port in the microprocessor, which means that there is only one clock
registers. source.
To illustrate the steps how the microprocessor executes the 1) Arithmetic Logic Unit: ALU is the main component of
instructions, we take IN and OUT instructions for example. The datapath for the proposed microprocessor. It is used for the

Authorized licensed use limited to: Universidad de Ingeniería y Tecnología (UTEC). Downloaded on October 05,2020 at 19:40:27 UTC from IEEE Xplore. Restrictions apply.
QU et al.: DESIGN OF AN 8-bit BIT-PARALLEL RSFQ MICROPROCESSOR 1302706

TABLE II
DECODING TRUTH TABLE

implementation of ADD, ADDI, LOADI, and SRL instructions.


The instruction set and the logic circuit design of the ALU are
defined in [21]. It contains 1180 JJs and 6 gate-level pipeline
stages.
2) Multiplexers: We use the MUXs as arbitration circuits
in the microprocessor to release the required data. We have
designed synchronous and asynchronous MUXs. A synchronous
MUX contains 17 DFFs, a NOT gate, 16 AND gates, 8 CBs,
which consists of two gate-level pipeline stage, and 409 JJs. An
asynchronous MUX consists of a DFF, a NOT gate, 16 RDFFs,
and 8 CBs, which consists of only one gate-level pipeline stage
and 233 JJs. In order to reduce the gate-level pipeline stage and
hardware consumption, asynchronous timing method is adopted
in MUXs.
3) Instruction Memory and Instruction Decoder: Table II
lists the decoding truth table of the ID. I0–I7 represent the 8-bit
instruction code. Z0 and Z1 are used to represent weather the
data from R0 and R1 equals to 0, respectively. Mn(0 < n < 7)
represent the control signals of MUXn. STORE_R0 and
STORE_R1 control the operations of data register R0 and R1,
respectively. OUT_OP is the control signal of output port reg-
ister. FINISH signal is used to control the HALT instruction
executes or not.
Considering the fabrication process and the further physical
layout, we adopt the simplest cells to implement the ID, as shown
in Fig. 2. It contains 7 NOT gates, 22 AND gates, 2 XOR gates,
12 CBs, and 38 DFFs, which consists of 719 JJs and 3 gate-
level pipeline stages. The on-chip IM is realized by 128 NDROs
and 12 CBs, which contains 2248 JJs. The delay for reading an
instruction from the on-chip IM is the main factor to limit the
minimum clock cycle.
4) I/O Port Registers: Input and output in the proposed mi-
croprocessor are achieved by simple port registers. If an 8-bit Fig. 2. Logic design circuit of the ID.

Authorized licensed use limited to: Universidad de Ingeniería y Tecnología (UTEC). Downloaded on October 05,2020 at 19:40:27 UTC from IEEE Xplore. Restrictions apply.
1302706 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 30, NO. 7, OCTOBER 2020

TABLE III output of the microprocessor. The nine traces at the bottom are
TESTBENCH OF THE PROPOSED MICROPROCESSOR
the control signals of the microprocessor, which determine the
corresponding operation based on the instructions.
According to the test vectors in Fig. 3, at 541.9 ps, the IN
instruction is read from the IM. After 12 clock cycles (60 ps per
clock cycle), control signal M1 = 1, an 8-bit data (0011 0011)
is input into the microprocessor. At the next clock cycle, control
signal STORE_R1 = 1, 0011 0011 is stored in data register R1. At
1981.9 ps, the SRL instruction is read from the IM. The control
signals M4 and STORE_R0 = 1. After 13 clock cycles, the result
after shifting right 0001 1001 stores in R0. At 3421.9 ps, the
LOADI instruction is read from the IM. According to M3, M5
and STORE_R1 = 1, after 13 clock cycles, the immediate 0000
0110 is loaded into R1. At 4861.9 ps, the ADD instruction is
read from the IM. After 3 clock cycles, control signal M2 = 0,
M4 = 1, 0001 1001 and 0000 0110 are released by the MUX2
data are presented to the input port by an external device, the data and MUX4, respectively. After 8 clock cycles, M3 = 1, MUX3
then are moved into the microprocessor to be processed. Con- releases the result from ALU (0001 1111). After 2 clock cycles,
versely, the data from the output port register can be displayed control signal STORE_R1 = 1, 0001 1111 is stored in data
externally. The I/O port registers totally contain 208 JJs. register R1. At 6301.9 ps, the JB instruction is read from the
IM. The address 1100 is sent to PC. After PC+1, the OUT
instruction is read from the IM. After 3 clock cycles, control
III. SIMULATION signal M2 = 0, 0001 1111 from R0 is released by the MUX2. At
We have performed logic-level simulation of the designed the next clock cycle, control signal OUT_OP = 1, then the 8-bit
microprocessor with Icarus Verilog [22] and GTKWave software data (0001 1111) is output to external. At 9181.9 ps, the HALT
[23]. We used Verilog hardware description language [24] to instruction is read from the IM. Then, the microprocessor stops
do the logic-level simulation based on the Open Dataset of all operations, although the clock signal is still input.
CONNECT Cell Library for AIST ADP2. The simulation results From the waveform, it shows a 16.7-GHz clock frequency,
show that the proposed microprocessor operates correctly at the but also shows 24 cycles (24 cycles × 60 ps = 1440 ps) between
target frequency of 16.7 GHz. As mentioned earlier, the delay for executable instructions. There is a NOP instruction between
reading an instruction from the on-chip IM limits the clock cycle. every two executable instructions. Thus, the clock per instruction
From the results, the delay for on-chip IM is 46.8 ps. According (CPI) of the microprocessor is 12. The value of CPI includes
to the data from the Open Dataset of CONNECT Cell Library the NOP operation. There is room for improvement if a system
for AIST ADP2, the maximum setup time of the cells we used clock is adopted. Among the 12 cycles, the delay for PC waiting
is 7.4 ps, and the maximum hold time of the cells we used is the input address amounts to six cycles and the delay for PC
4.1 ps. Thus, we set the clock period as 60 ps. The simulation calculating the output address amounts to six cycles.
results of the datapath circuits for the proposed microprocessor From the simulation results, the delay of ALU amounts to
are shown in [21]. 385.8 ps, the delay of the ID amounts to 133.2 ps, every 720 ps,
To verify the functionality of the whole design, we have PC sends an address to the IRs. The latency for executing
defined a testbench for the proposed microprocessor, as shown the designed testbench (IN-SRL-LOADI-ADD-JB-OUT-HALT)
in Table III. The testbench includes all instruction types. In is 9181.9 ps at 16.7 GHz.
Table III, “NC” means “not necessary” and “ … …” means omit-
ting the instructions between JB and OUT instructions. In the de-
IV. EVALUATION
sign, we use software bubble to avoid data and structural hazards.
Because the latency for an instruction to execute completely is According to the RSFQ logic circuits of all the components
twice that of PC to generate an address, a NOP instruction is for the proposed microprocessor. The number of cells and JJs
inserted between two dependent instructions and before or after used in the microprocessor circuits is shown in Table IV with-
JB instruction to ensure the correctness of program execution. out considering splitters (SPLs), Josephson transmission lines
This method cuts the number of available IM, but it can reduce (JTLs), and passive transmission lines (PTLs). Table IV lists
the hardware cost to some extent. that the number of JJs for on-chip IM is much larger than other
We have successfully obtained correct results for the proposed components, whereas the number of JJs for I/O port registers is
microprocessor. Fig. 3 shows the simulation results of the mi- the least. Also, the number of DFFs is larger than other cells,
croprocessor. In the figure, the first trace is the clock signal. most of which are used as buffers to synchronize the data and
The next eight traces are the instructions read from the on-chip control signals.
IM according to corresponding address. The next eight traces We compared the previous works with the proposed micro-
are 8-bit input data from external. The next eight traces are processor in Table V. From the results, the million instructions
8-bit output data to external, which represent the corresponding per second (MIPS) of the proposed microprocessor is the most.

Authorized licensed use limited to: Universidad de Ingeniería y Tecnología (UTEC). Downloaded on October 05,2020 at 19:40:27 UTC from IEEE Xplore. Restrictions apply.
QU et al.: DESIGN OF AN 8-bit BIT-PARALLEL RSFQ MICROPROCESSOR 1302706

Fig. 3. Simulation results of the proposed microprocessor.

TABLE IV
HARDWARE CONSUMPTION OF THE PROPOSED MICROPROCESSOR

TABLE V considering JTLs, PTLs, and SPLs) among which 2248 is for
COMPARISON OF 8-bit MICROPROCESSOR
the on-chip IM, 1310 is for the MUXs, and 1180 is for the ALU.
The hardware cost of the microprocessor is relatively low, which
is possible to be fabricated with current fabrication process.
The simulation results show it operates correctly at the target
frequency of 16.7 GHz.
In this article, we only design and logically simulate the
proposed microprocessor, but not do physical realization and
We have to declare that the result is just based on simulation but
simulation. There are two challenges of wiring and precise
not the physical realization. After taking this into account, the
timing control when the physical design of the microprocessor is
useful peak throughput will be negatively impacted. As a result,
implemented. We will realize the physical design in next work.
the advantage will decline.

V. CONCLUSION ACKNOWLEDGMENT
We have designed the RSFQ logic circuits of all the com- The authors would like to thank Nagoya University, Yoko-
ponents for the proposed microprocessor and simulated the hama National University, and AIST in Japan for providing Open
microprocessor at the gate level. It contains 7702 JJs (without Dataset of CONNECT Cell Library for AIST ADP2.

Authorized licensed use limited to: Universidad de Ingeniería y Tecnología (UTEC). Downloaded on October 05,2020 at 19:40:27 UTC from IEEE Xplore. Restrictions apply.
1302706 IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY, VOL. 30, NO. 7, OCTOBER 2020

REFERENCES [12] Y. Nobumori, T. Nishigai, K. Nakamiya, and N. Yoshikawa, “Design and


implementation of a fully asynchronous SFQ microprocessor: SCRAM2,”
[1] D. Scott Holmes, A. L. Ripple, and M. A. Manheimer, “Energy-efficient IEEE Trans. Appl. Supercond., vol. 17, no. 12, pp. 478–481, Jun. 2007.
superconducting computing—Power budgets and requirements,” IEEE [13] Y. Ando, R. Sato, M. Tanaka, K. Takagi, N. Takagi, and A. Fujimaki,
Trans. Appl. Supercond., vol. 23, no. 3, Jun. 2013, Art. no. 1701610. “Design and demonstration of an 8-bit bit-serial RSFQ microprocessor:
[2] K. K. Likharev and V. K. Semenov, “RSFQ logic/memory family: A CORE e4,” IEEE Trans. Appl. Supercond., vol. 26, no. 5, Aug. 2016,
new Josephson-junction technology for subterahertz-clock-frequency dig- Art. no. 1301205.
ital systems,” IEEE Trans. Appl. Supercond., vol. 1, no. 1, pp. 3–28, [14] G. Tang, Y. Ohmomo, K. Takagi, and N. Takagi, “Conceptual design of
Mar. 1991. a 4-bit bit-slice 32-bit RSFQ microprocessor,” in Proc. Appl. Supercond.
[3] O. A. Mukhanov, “Energy-efficient single flux quantum technology,” IEEE Conf., Sep. 2016, Art. no. 4EOr2B-03.
Trans. Appl. Supercond., vol. 21, no. 3, pp. 760–769, Jun. 2011. [15] T. Filippov, M. Dorojevets, A. Sahu, and A. Kirichenko, “8-bit asyn-
[4] Y. Yamanashi, T. Nishigai, and N. Yoshikawa, “Study of LR-loading chronous wave- pipelined RSFQ arithmetic-logic unit,” IEEE Trans. Appl.
technique for low-power single flux quantum circuits,” IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 847–851, Jun. 2011.
Supercond., vol. 17, no. 2, pp. 150–153, Jun. 2007. [16] T. Filippov et al., “20 GHz operation of an asynchronous wave-pipelined
[5] Q. Herr, A. Herr, O. Oberg, and A. Ioannidis, “Ultra-low-power supercon- RSFQ arithmetic logic unit,” Phys. Procedia, vol. 36, pp. 59–65, Jun. 2012.
ductor logic,” J. Appl. Phys., vol. 109, 2011. Art. no. 103903. [17] M. Dorojevets, C. L. Ayala, N. Yoshikawa, and A. Fujimaki, “8-bit
[6] M. Dorojevets, P. Bunyk, and D. Zinoviev, “FLUX chip: Design of a asychronous sparse-tree superconductor RSFQ arithmetic-logic unit with
20-GHz 16-bit ultrapipelined RSFQ processor prototype based on 1.75- a rich set of operations,” IEEE Trans. Appl. Supercond., vol. 23, no. 3,
/spl mu/m LTS technology,” IEEE Trans. Appl. Supercond., vol. 11, no. 1, Jun. 2013, Art. no. 1700104.
pp. 326–332, Mar. 2001. [18] Y. Ando, R. Sato, M. Tanaka, and K. Takagi, “80-GHz operation of an
[7] M. Dorojevets and P. Bunyk, “Architectural and implementation chal- 8-bit RSFQ arithmetic logic unit,” in Proc. 15th Int. Supercond. Electron.
lenges in designing high-performance RSFQ processors: A FLUX-1 mi- Conf., Jul. 2015, Art. no. DS-P17.
croprocessor and beyond,” IEEE Trans. Appl. Supercond., vol. 13, no. 2, [19] G. Tang, K. Takata, M. Tanaka, A. Fujimaki, K. Takagi, and N. Takagi,
pp. 446–449, Jun. 2003. “4-bit bit-slice arithmetic logic unit for 32-bit RSFQ microprocessors,”
[8] M. Tanaka, F. Matsuzaki, T. Kondo, and N. Nakajima, “A single-flux- IEEE Trans. Appl. Supercond., vol. 26, no. 1, Jan. 2016, Art. no. 1300106.
quantum logic prototype microprocessor,” in Proc. IEEE Int. Solid-State [20] G. Tang, P. Qu, X. Ye, and D. Fan, “Logic design of a 16-bit bit-slice
Circuits Conf. Dig. Tech. Papers, Feb. 2004, pp. 298–529. arithmetic logic unit for 32-/64-bit RSFQ microprocessors,” IEEE Trans.
[9] Y. Yamanashi et al., “Design and implementation of a pipelined bit-serial Appl. Supercond., vol. 28, no. 4, Jun. 2018, Art. no. 1300305.
SFQ microprocessor, CORE1β,” IEEE Trans. Appl. Supercond., vol. 17, [21] P. Qu, G. Tang, X. Ye, D. Fan, and N. Sun, “Design of datapath circuits for
no. 2, pp. 474–477, Jun. 2007. a bit-parallel 8-bit RSFQ microprocessor,” in Proc. IEEE Int. Supercond.
[10] A. Fujimaki et al., “Bit-serial single flux quantum microprocessor CORE,” Electron. Conf., 2019, pp. 1–4.
IEICE Trans. Electron., vol. E91-C, pp. 342–349, Mar. 2008. [22] [Online]. Available: http://iverilog.icarus.com/
[11] M. Tanaka et al., “Design and implementation of a pipelined 8 bit-serial [23] [Online]. Available: http://gtkwave.sourceforge.net/
single-flux-quantum microprocessor with cache memories,” Supercond. [24] IEEE Standard Hardware Description Language Based on the Verilog
Sci. Technol., vol. 20, no. 11, pp. S305–S309, Nov. 2007. Hardware Description Language, IEEE Standard 1364-1995, Oct. 1996.

Authorized licensed use limited to: Universidad de Ingeniería y Tecnología (UTEC). Downloaded on October 05,2020 at 19:40:27 UTC from IEEE Xplore. Restrictions apply.

You might also like