Professional Documents
Culture Documents
6, JUNE 2021
Abstract—Design of high-performance processors with very ARM, which is used in portable devices like mobile and x86,
low power requirement is the primary goal of many con- which is used in personal computers and servers. These ISAs
temporary and futuristic applications. This brief presents a may not be the best developed so far, but with time, they
novel processor micro-architecture which is capable of achieving have become so popular that no other ISA could replace them.
these requirements. The micro-architecture is based on RISC-V
However, these ISAs have proprietary issues because of which
Instruction Set Architecture (ISA). The core is implemented and
verified on Xilinx Virtex-7 FPGA board with a resource require- the hardware designers cannot use them for their designs.
ment of 7617 LUTs and 2319 FFs. This core could achieve a A new ISA called RISC-V [1] has been developed by UC,
Dhrystone benchmark score of 1.71 DMIPS per MHz which is Berkeley to overcome these issues and to support computer
higher than ARM Cortex-M3 (1.50 DMIPS per MHz) and ARM architecture research, education, and industry implementation.
Cortex-M4 (1.52 DMIPS per MHz). The Coremark benchmark The main advantage of this ISA is that it is in the public
is also tested on this core and it gives 4.13 Coremark per MHz. domain and anyone can use it for their research.
The physical design result of the core using commercial tools Based on RISC-V ISA, many cores have been developed
shows that it can achieve a maximum frequency of 198.02 MHz and many are under development. Rocket [2], [3] is the first
with 0.036 mm2 area and 17.36 µW/MHz power requirement
at UMC 40 nm technology node. The core consumes a dynamic
approach towards RISC-V micro-architecture by the same
power of 19.75 µW/MHz at UMC 90nm which is 36% and 40% researchers who have developed the RISC-V instruction set. It
better than ARM Cortex-M3 and Cortex-M4 respectively and is a 6-stage single-issue in-order pipeline processor based on
also lower than many others cores. The results show that this RV64G ISA. The authors achieved a 10% higher performance
core can outperform many existing commercial and open-source in DMIPS/MHz score compared to Cortex-A5 and 49% area
cores. improvement. Pulpino [4]–[6] is an open-source microcon-
Index Terms—Micro-architecture, RV32IM, RISC-V, func- troller system, based on RV32IMC RISC-V core developed
tional unit, Baugh Wooley, Booth, veddic, Dadda, FPGA, ARM. at ETH Zurich. It is a platform organized in clusters of RISC-
V cores sharing a tightly-coupled data memory. Shakti-F [7] is
another approach where researchers at IIT Madras are trying
to develop a reliable processor for radiation prone environ-
I. I NTRODUCTION
ments like nuclear and space applications. Some variants of
N THE modern technological era, the demand of appliances
I such as portable phones, medical devices, wired and wire-
less communication devices, several Internet of Thing (IoT)
shakti like shakti-E, shakti-C, shakti-I are also under develop-
ment [8]. Few more examples of RISC-V core are ORCA [9],
mriscv [10], VexRiscv [11], LowRISC [12], RI5CY [13] etc. In
devices, etc. is increasing. All these systems require a high- this brief, a processor micro-architecture is proposed, which is
performance processor to carry out their tasks. Also, most of capable of achieving high performance at the cost of very low
these applications are portable and battery-powered. Therefore, power requirement. The salient features of the design are: 1)
these systems demand a high-performance low-power proces- the pipeline is organized in four stages and all integer instruc-
sor that can execute the instruction without draining much tions except load and store, are executed in four stages, 2)
power form the battery. In the last four decades, processor Memory instructions, load, and store are separated from the
performance has increased through CMOS scaling. But, now mainstream of pipeline but with an extra stage of memory
it is coming to its end as CMOS technology is approach- read/write, 3) most of the arithmetic operations are executed
ing physical limitations. So, designers are forced to look into in a single clock cycle. But, the multiplication and division
other aspects of processor design to get better power and operations are executed in multiple cycles to reduce the criti-
performance within the aforementioned constraints. The two cal path delay, 4) Modified Baugh Wooley multiplier so that it
most important aspects of processor design are Instruction Set can operate in dual clock cycles, 5) 32-bit radix-16 divider is
Architecture (ISA) and micro-architecture. There are mainly designed to complete in 8 clock cycles, 6) dynamic branch pre-
two ISAs that are dominating the current processor industry: dictor (DBP) with a 2-bit counter is incorporated. The results
showed that the core could achieve better performance com-
Manuscript received August 15, 2020; revised November 2, 2020; accepted pared to many existing core micro-architectures with the lesser
December 4, 2020. Date of publication December 8, 2020; date of current
version May 27, 2021. This brief was recommended by Associate Editor
cost of area and power.
A. Calimera. (Corresponding author: Satyajit Bora.)
The authors are with the Department of Electronics and Electrical
Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, II. M ICRO -A RCHITECTURE
India (e-mail: b.satyajit@iitg.ac.in; roypaily@iitg.ac.in).
Color versions of one or more figures in this article are available at
In this section, we will detail micro-architectural optimiza-
https://doi.org/10.1109/TCSII.2020.3043204. tions for increasing the efficiency of the processor core. The
Digital Object Identifier 10.1109/TCSII.2020.3043204 starting point of this brief is RISC-V ISA. The core supports
1549-7747
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
BORA AND PAILY: HIGH-PERFORMANCE CORE MICRO-ARCHITECTURE BASED ON RISC-V ISA FOR LOW POWER APPLICATIONS 2133
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
2134 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 6, JUNE 2021
TABLE I TABLE II
S YNTHESIS R ESULTS OF D IFFERENT M ULTIPLIER C IRCUITS S YNTHESIS R ESULTS OF D IFFERENT D IVIDER C IRCUITS
AT UMC 40 NM T ECHNOLOGY N ODE AT UMC 40 NM T ECHNOLOGY N ODE
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
BORA AND PAILY: HIGH-PERFORMANCE CORE MICRO-ARCHITECTURE BASED ON RISC-V ISA FOR LOW POWER APPLICATIONS 2135
TABLE III
ASIC I MPLEMENTATION R ESULTS FOR THE P ROPOSED M ICRO -A RCHITECTURE AT UMC 90 NM AND 40 NM T ECHNOLOGY N ODES
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
2136 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 6, JUNE 2021
this brief, the RVT variant of UMC 40nm technology is used. R EFERENCES
The use of ULVt variant can give better frequency result, [1] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic, “The RISC-
but it also increases the power losses. Similarly, HVT gives V instruction set manual, volume I: User-level ISA, version 2.0,”
better power results by reducing the leakage, but slows down EECS Dept., Univ. California at Berkeley, Berkeley, CA, USA, Rep.
UCB/EECS-2014-54, May 2014. [Online]. Available: http://www2.eecs.
the design. Hence, the RVT variant is used to get a balance berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html
between power and speed. From Table III, it can be seen that [2] B. Zimmer et al., “A RISC-V vector processor with simultaneous-
the energy efficiency of the proposed core is much higher switching switched-capacitor DC–DC converters in 28 nm FDSOI,”
IEEE J. Solid-State Circuits, vol. 51, no. 4, pp. 930–942, Apr. 2016.
than the other existing micro-architectures. These benchmark [Online]. Available: http://ieeexplore.ieee.org/document/7422720/
values are also compared with some commercial Intel and [3] Y. Lee et al., “A 45 nm 1.3 GHz 16.7 double-precision GFLOPS/W
ARM single core processors in Figure 5. The graph shows RISC-V processor with vector accelerators,” in Proc. 40th Eur. Solid-
State Circuits Conf. (ESSCIRC), Sep. 2014, pp. 199–202. [Online].
that the proposed core outperforms many commercial single Available: http://ieeexplore.ieee.org/document/6942056/
core processors. The comparison of the proposed design with [4] M. Gautschi, M. Schaffner, F. K. Gurkaynak, and L. Benini, “An
existing micro-architectures also shows an improvement in extended shared logarithmic unit for nonlinear function kernel accel-
eration in a 65-nm CMOS multicore cluster,” IEEE J. Solid-State
power, performance and area efficiency of the proposed design. Circuits, vol. 52, no. 1, pp. 98–112, Jan. 2017. [Online]. Available:
Therefore, this micro-architecture can be used for low power http://ieeexplore.ieee.org/document/7756672/
applications like IoTs, which demands high-performance at [5] F. Conti et al., “An IoT endpoint system-on-chip for secure and
energy-efficient near-sensor analytics,” IEEE Trans. Circuits Syst. I, Reg.
low power. Papers, vol. 64, no. 9, pp. 2481–2494, Sep. 2017. [Online]. Available:
The design is primarily implemented for ASIC, but for ini- http://ieeexplore.ieee.org/document/7927716/
tial hardware verification, the design is tested on an FPGA [6] M. Gautschi et al., “Near-threshold RISC-V core with DSP exten-
sions for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale
platform, Xilinx Virtex-7 board. For FPGA verification, the Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct. 2017. [Online].
core part remains untouched; however, the cache memory part Available: http://ieeexplore.ieee.org/document/7864441/
modified depending on the resources available in the board. [7] S. Gupta, N. Gala, G. S. Madhusudan, and V. Kamakoti, “SHAKTI-
F: A fault tolerant microprocessor architecture,” in Proc. IEEE 24th
The FPGA module has a dual-port memory of 128 KB for Asian Test Symp. (ATS), Nov. 2015, pp. 163–168. [Online]. Available:
data as well as for instructions. This memory is implemented http://ieeexplore.ieee.org/document/7422253/
in Block RAM of FPGA, and depending on the available [8] Shakti Processor Program. [Online]. Available: http://shakti.org.in/
[9] ORCA RV32IM. [Online]. Available: https://github.com/vectorblox/orca
resources, it is limited to 128 KB. The processor has instruc- [10] MRISCV. [Online]. Available: https://github.com/onchipuis/mriscv
tion start address at 0x00000220 and runs at a clock frequency [11] VEXRISCV. [Online]. Available: https://github.com/SpinalHDL/
of 100 MHz. The core functionality is first verified with some VexRiscv
[12] Lowrisc. [Online]. Available: https://github.com/lowRISC/lowrisc-chip
basic C programs like addition, multiplication, factorial, loops, [13] RI5CY a Small 4-Stage RISC-V Core. [Online]. Available: https://github.
string, etc. Once the functionality of the core is verified, two com/pulp-platform/riscv
popular benchmark programs, i.e., Dhrystone and Coremark [14] N. Petra, D. D. Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo,
“Design of fixed-width multipliers with linear compensation function,”
are run on the core to measure its performance. All of these IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 5, pp. 947–960,
programs are in C and compiled using riscv-gnu-toolchain, a May 2011.
RISC-V cross-compiler, to generate binary of the programs. [15] S. Kuang, J. Wang, and C. Guo, “Modified Booth multipliers with a
regular partial product array,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
The binary data is uploaded to the memory of the core on vol. 56, no. 5, pp. 404–408, May 2009.
FPGA to execute. The FPGA synthesis results show that the [16] E. Antelo, P. Montuschi, and A. Nannarelli, “Improved 64-bit radix-
core can be implemented with 7617 LUTs and 2319 FFs. The 16 Booth multiplier based on partial product array height reduction,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 2, pp. 409–418,
timing report shows that the core has a maximum delay path of Feb. 2017.
7.472 ns, which means it can go up to a maximum frequency [17] C. R. Baugh and B. A. Wooley, “A two’s complement parallel array
of 134 MHz on Virtex-7. multiplication algorithm,” IEEE Trans. Comput., vol. C-22, no. 12,
pp. 1045–1047, Dec. 1973.
[18] R. Muralidharan and C. Chang, “Radix-4 and radix-8 Booth encoded
multi-modulus multipliers,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 60, no. 11, pp. 2940–2952, Nov. 2013.
IV. C ONCLUSION [19] Y. Bansal, C. Madhu, and P. Kaur, “High speed vedic multiplier
designs—A review,” in Proc. Recent Adv. Eng. Comput. Sci. (RAECS),
This brief presents a 4-stage pipelined micro-architecture Mar. 2014, pp. 1–6.
for low power applications. The core supports full base inte- [20] L. Dadda, “On serial-input multipliers for two’s complement numbers,”
IEEE Trans. Comput., vol. 38, no. 9, pp. 1341–1345, Sep. 1989.
ger instruction set of RISC-V along with multiplication and [21] L. Chen, J. Han, W. Liu, P. Montuschi, and F. Lombardi, “Design, eval-
division instructions. The core micro-architecture is arranged uation and application of approximate high-radix dividers,” IEEE Trans.
in such a way that execution of instructions can be car- Multi-Scale Comput. Syst., vol. 4, no. 3, pp. 299–312, Jul. 2018.
[22] M. Gautschi et al., “Tailoring instruction-set extensions for an ultra-low
ried out with a minimum possibility of data, control and power tightly-coupled cluster of OpenRISC cores,” in Proc. IFIP/IEEE
structural hazards. Since low power application is the target Int. Conf. Very Large Scale Integr. (VLSI-SoC), Oct. 2015, pp. 25–30.
of this brief, the core is designed with minimum essential [23] The ARM Cortex-M4 Processor, ARM’s High Performance Embedded
Processor. [Online]. Available: https://developer.arm.com/products/
features. This brief has mainly focused on optimizing the processors/cortex-m/cortex-m4
area, power, delay and performance of hardware circuits for [24] The ARM Cortex-M3 Processor, the Industry-Leading 32-Bit Processor
multiplier, divider, DBP, etc. and optimizing their control for Highly Deterministic Real-Time Applications. [Online]. Available:
https://developer.arm.com/products/processors/cortex-m/cortex-m3
in the pipeline. The proposed core is suitable for applica- [25] C. Arm et al., “Low-power 32-bit dual-MAC 120 µW/MHz 1.0 V
tions where we need high-performance with a low-power ICYFLEX1 DSP/MCU core,” IEEE J. Solid-State Circuits, vol. 44, no. 7,
requirement. Among the benchmarks related to these applica- pp. 2055–2064, Jul. 2009.
[26] SiFive E24, a High-Performance Microcontroller for Motor Control,
tions such as Dhrystone, Coremark, the proposed core shows Sensor Fusion, and IoT Applications. [Online]. Available: https://www.
better performance by consuming lesser power. This micro- sifive.com/cores/e24
architecture can be used as the base for RISC-V processor [27] Dhrystone Scores. [Online]. Available: http://www.roylongbottom.org.
uk/dhrystone
design, and more extensions can be added to it depending on [28] Coremark Scores. [Online]. Available: https://www.eembc.org/coremark/
specific applications. scores.php
Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.