You are on page 1of 5

2132 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO.

6, JUNE 2021

A High-Performance Core Micro-Architecture


Based on RISC-V ISA for Low Power Applications
Satyajit Bora and Roy Paily , Member, IEEE

Abstract—Design of high-performance processors with very ARM, which is used in portable devices like mobile and x86,
low power requirement is the primary goal of many con- which is used in personal computers and servers. These ISAs
temporary and futuristic applications. This brief presents a may not be the best developed so far, but with time, they
novel processor micro-architecture which is capable of achieving have become so popular that no other ISA could replace them.
these requirements. The micro-architecture is based on RISC-V
However, these ISAs have proprietary issues because of which
Instruction Set Architecture (ISA). The core is implemented and
verified on Xilinx Virtex-7 FPGA board with a resource require- the hardware designers cannot use them for their designs.
ment of 7617 LUTs and 2319 FFs. This core could achieve a A new ISA called RISC-V [1] has been developed by UC,
Dhrystone benchmark score of 1.71 DMIPS per MHz which is Berkeley to overcome these issues and to support computer
higher than ARM Cortex-M3 (1.50 DMIPS per MHz) and ARM architecture research, education, and industry implementation.
Cortex-M4 (1.52 DMIPS per MHz). The Coremark benchmark The main advantage of this ISA is that it is in the public
is also tested on this core and it gives 4.13 Coremark per MHz. domain and anyone can use it for their research.
The physical design result of the core using commercial tools Based on RISC-V ISA, many cores have been developed
shows that it can achieve a maximum frequency of 198.02 MHz and many are under development. Rocket [2], [3] is the first
with 0.036 mm2 area and 17.36 µW/MHz power requirement
at UMC 40 nm technology node. The core consumes a dynamic
approach towards RISC-V micro-architecture by the same
power of 19.75 µW/MHz at UMC 90nm which is 36% and 40% researchers who have developed the RISC-V instruction set. It
better than ARM Cortex-M3 and Cortex-M4 respectively and is a 6-stage single-issue in-order pipeline processor based on
also lower than many others cores. The results show that this RV64G ISA. The authors achieved a 10% higher performance
core can outperform many existing commercial and open-source in DMIPS/MHz score compared to Cortex-A5 and 49% area
cores. improvement. Pulpino [4]–[6] is an open-source microcon-
Index Terms—Micro-architecture, RV32IM, RISC-V, func- troller system, based on RV32IMC RISC-V core developed
tional unit, Baugh Wooley, Booth, veddic, Dadda, FPGA, ARM. at ETH Zurich. It is a platform organized in clusters of RISC-
V cores sharing a tightly-coupled data memory. Shakti-F [7] is
another approach where researchers at IIT Madras are trying
to develop a reliable processor for radiation prone environ-
I. I NTRODUCTION
ments like nuclear and space applications. Some variants of
N THE modern technological era, the demand of appliances
I such as portable phones, medical devices, wired and wire-
less communication devices, several Internet of Thing (IoT)
shakti like shakti-E, shakti-C, shakti-I are also under develop-
ment [8]. Few more examples of RISC-V core are ORCA [9],
mriscv [10], VexRiscv [11], LowRISC [12], RI5CY [13] etc. In
devices, etc. is increasing. All these systems require a high- this brief, a processor micro-architecture is proposed, which is
performance processor to carry out their tasks. Also, most of capable of achieving high performance at the cost of very low
these applications are portable and battery-powered. Therefore, power requirement. The salient features of the design are: 1)
these systems demand a high-performance low-power proces- the pipeline is organized in four stages and all integer instruc-
sor that can execute the instruction without draining much tions except load and store, are executed in four stages, 2)
power form the battery. In the last four decades, processor Memory instructions, load, and store are separated from the
performance has increased through CMOS scaling. But, now mainstream of pipeline but with an extra stage of memory
it is coming to its end as CMOS technology is approach- read/write, 3) most of the arithmetic operations are executed
ing physical limitations. So, designers are forced to look into in a single clock cycle. But, the multiplication and division
other aspects of processor design to get better power and operations are executed in multiple cycles to reduce the criti-
performance within the aforementioned constraints. The two cal path delay, 4) Modified Baugh Wooley multiplier so that it
most important aspects of processor design are Instruction Set can operate in dual clock cycles, 5) 32-bit radix-16 divider is
Architecture (ISA) and micro-architecture. There are mainly designed to complete in 8 clock cycles, 6) dynamic branch pre-
two ISAs that are dominating the current processor industry: dictor (DBP) with a 2-bit counter is incorporated. The results
showed that the core could achieve better performance com-
Manuscript received August 15, 2020; revised November 2, 2020; accepted pared to many existing core micro-architectures with the lesser
December 4, 2020. Date of publication December 8, 2020; date of current
version May 27, 2021. This brief was recommended by Associate Editor
cost of area and power.
A. Calimera. (Corresponding author: Satyajit Bora.)
The authors are with the Department of Electronics and Electrical
Engineering, Indian Institute of Technology Guwahati, Guwahati 781039, II. M ICRO -A RCHITECTURE
India (e-mail: b.satyajit@iitg.ac.in; roypaily@iitg.ac.in).
Color versions of one or more figures in this article are available at
In this section, we will detail micro-architectural optimiza-
https://doi.org/10.1109/TCSII.2020.3043204. tions for increasing the efficiency of the processor core. The
Digital Object Identifier 10.1109/TCSII.2020.3043204 starting point of this brief is RISC-V ISA. The core supports
1549-7747 
c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
BORA AND PAILY: HIGH-PERFORMANCE CORE MICRO-ARCHITECTURE BASED ON RISC-V ISA FOR LOW POWER APPLICATIONS 2133

Fig. 1. Pipeline stages of the proposed micro-architecture.

Fig. 2. Micro-architecture of Instruction Execute.


the base integer instruction set of RISC-V ISA with integer
multiplication and division extension. This micro-architecture multiple clock cycles to reduce the critical path delay. The
can be used as the base of an advanced core for support- core has a 2-bit branch predictor with a history table. A 2-bit
ing extensions and features. The main focus of this brief is prediction counter is the right balance between hardware com-
on the optimizing the core power and performance. The effi- plexity and accuracy. More than two bits counter results in a
cient design of primary components of any system leads to an higher complexity with only a little improvement of accuracy.
overall improvement of that system. So, in the proposed micro- Whereas a 1-bit counter is lesser complex, but the prediction
architecture, basic components of the core like multiplier, accuracy is poorer.
divider, DBP, etc. are optimized to get the best possible result.
The critical path of the core is also optimized to achieve a
maximum frequency close to 200 MHz. Pipeline stages are A. Instruction Execute
organized in such a way that minimum data, control, and struc- Figure 2 shows the block level representation of the IE
tural hazards can occur. These points are discussed in more stage. In IE stage, register values are first checked. If they
detail in the following part of this section. are ready to read, execution continues; otherwise, values are
The number of pipeline stages is one of the critical aspects fetched from IE/RW stage, depending on previous instructions.
of any processor design. A higher number of pipeline stages These register values are then fed into different Functional
allows a designer to achieve higher operating frequency, Units (FU) of IE stage for execution. The IE stage has five
which leads to an increase in overall throughput. However, FUs. These FUs with their operations and clock cycles are
it also increases data and control hazards and reduces the mentioned below:
IPC (Instructions per Cycle). For achieving high-performance, 1) ALU - add, sub, and, or, xor (1 clock)
processors are also optimized using branch predictions, pre- 2) Comparator - signed/unsigned comparison (1 clock)
fetch buffers, and speculation. However, these optimizations 3) Shifter - logical/arithmetic shift (1 clock)
increase power consumption and are usually not suitable for 4) Multiplier - signed/unsigned multiplication (2 clocks)
low power applications. For example, ARM Cortex-M4 is a 5) Divider - signed/unsigned division (8 clocks).
three-stage pipelined micro-architecture with a single write- IE stage also generates the memory read/write signals and
back port in the register bank. The absence of the second branch signals. Memory read/write address is obtained from
write-back port and fourth pipeline stage brings stall while the ALU unit, and memory read/write data is obtained from
executing load operations. A three-stage pipeline is good for the second source register, i.e., rs2. Memory read/write enable
achieving high IPC, but the frequency of such a pipeline is signal comes from the instruction ID stage. These three signals
limited because of multiple operations in a single stage. The are then combined to control the memory read/write opera-
proposed micro-architecture is designed in four pipeline stages tions. Branch consists of two signals; branch enable and branch
in order to achieve a balance between IPC and frequency. The address. Branch address is assigned from ALU output, and
pipeline stages of the proposed core are Instruction Fetch (IF), comparator output triggers the branch enable signal depending
Instruction Decode (ID), Instruction Execute (IE) and Register on the type of instruction.
Write (RW), as shown in Figure 1. Most of the instructions Multiplier and Divider circuits are vital parts of any core
can be completed within these four stages. But, there are few micro-architecture and their hardware complexity is very high
instructions for which this pipeline is modified to get opti- compared to other functional units of IE stage. So, implement-
mum performance. The unconditional jump instruction JAL is ing a correct hardware circuit for multiplication and division
completed after the first stage as it does not involve any reg- has a vital role in the overall area, power, and performance of
ister read/write operation. Load instructions require another a core.
stage before RW stage because the execution of these instruc- 1) Multiplier: There is a lot of binary multiplica-
tions involve memory access delays. Load instructions require tion methods [14]–[16] available in literature namely
another stage (Memory Read) before RW stage because the Baugh and Wooley [17], Booth [18], Vedic [19], Dadda [20]
execution of these instructions involve memory access delays. etc. The hardware circuits for some popular multiplica-
The proposed core supports base integer instruction set tion methods are implemented using verilog HDL and the
RV32I with multiplier extension M. Multiplication and divi- most suitable circuit is selected based on our requirements.
sion of 32-bit numbers is an expensive process in terms These designs are then synthesized using Synopsys DC
of critical path delay. So, these operations are executed in at UMC 40nm technology node and results are shown in

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
2134 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 6, JUNE 2021

TABLE I TABLE II
S YNTHESIS R ESULTS OF D IFFERENT M ULTIPLIER C IRCUITS S YNTHESIS R ESULTS OF D IFFERENT D IVIDER C IRCUITS
AT UMC 40 NM T ECHNOLOGY N ODE AT UMC 40 NM T ECHNOLOGY N ODE

Algorithm 1 32-Bit Radix-16 Divider


Radix = 16; Rn = log2 (Radix); Quotient = 0; D = Divisor;
if (32%Rn) ! = 0 then
Zn = 32 + Rn - (32%Rn);
else
Zn = 32;
end if
N = { Zn{0}, Dividend};
Nmax = size of N; Nmin = Nmax-32-Rn;
Fig. 3. Addition of Partial Products of 33x33 signed multiplier. for i = 1 to ceil(32/Rn) do
for n = (Radix-1) downto 0 do
Table I. Among all the implementations Baugh Wooley signed if N[(Nmax-1):Nmin] >= n*D then
Quotient = {Quotient, n};
multiplier is found to be the most suitable one for our N[(Nmax-1):Nmin] = N[(Nmax-1):Nmin] - n*D;
micro-architecture. Break;
From Table I, it can be seen that the delay of Baugh Wooley end if
signed multiplier circuit is 5.97 ns. If this circuit is used end for
directly in the IE stage, the critical path of the overall system Remainder = N[(Nmax-1):Nmin];
increases. Therefore, the hardware circuit of multiplier is mod- N << Rn;
ified into two stages. In the first stage, partial products are end for
calculated using (1). Then, they are divided into six parts as
shown in Figure 3 and are added. The adder results are then
buffered to the next stage. In the second stage, the six buffered
results are added again to get the final result. The critical path improvement in latency and throughput. But, there is a sig-
of the first stage consists of 1 HA (Half Adder) and 29 FAs nificant increment in power and area. So, radix-16 is selected
(Full Adders). Whereas for the second stage, it consists of 1 as the divider unit for the proposed micro-architecture. The
HA and 47 FAs. The whole partial product segmentation is pseudo-code for a 32-bit radix-16 unsigned divider is described
done in such a way that the delay of the first stage remains in Algorithm 1.
lesser than the second stage. This is because the input data of The block diagram of the hardware architecture for the
multiplier may come from register bank or from other stages of divider is shown in Figure 4. In the hardware circuit, divi-
the pipeline. So, a multiplexer delay is expected at the inputs sor is first multiplied with constants and the results are stored
of multiplier circuit. in registers. To compute these multiplications with constants,
a lot of multipliers are required which are highly expensive in

n−2 
n−2 terms of area, delay and power. So, while implementing the
A ∗ B = an−1 bn−1 22n−2 + ai 2i bj 2j hardware circuits, these multiplications are implemented using
i=0 j=0 shifters and adders. Multiple of 2, 4 and 8 are calculated by left
shifting the divisor, and the other multiples are determined by

n−2 
n−2
ai bn−1 2i + an−1 bj 2j + 2n − 22n−1 (1) adding the results among them. The critical path of the circuit
i=0 j=0
is found to be 3.10 ns at 40 nm technology node.
In the division process, the next step is to update the quo-
Similar to the implementation of multiplication, the hard- tient and remainder. At every clock cycle, the most significant
ware circuits for available division algorithms [21] are imple- 36-bits of the dividend is compared with the pre-calculated
mented using Verilog HDL and synthesized at UMC 40 nm multiples of dividend. At first, the most significant bits of div-
technology node. Table II shows the area, power, delay, latency idend is compared with 15*divisor. If the result says most
and throughput of these circuits. The throughput is measured significant bits of dividend is greater or equal to 15*divisor,
in terms of the number of inputs the circuit can execute per then the quotient is left-shifted by 4-bits with a value “1111",
clock cycle. From Table II, it can be seen that as we move to 15*divisor is subtracted from the higher 36-bits of dividend
higher radix algorithms, latency of the circuit decreases and and remainder is updated with the subtraction result. If the
as a result, throughput increases. If the results of radix-16 and comparison result indicates that it is lesser than 15*divisor,
radix-32 are compared, it can be seen that there is not much then the comparison continues for lower multiples of divisor.

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
BORA AND PAILY: HIGH-PERFORMANCE CORE MICRO-ARCHITECTURE BASED ON RISC-V ISA FOR LOW POWER APPLICATIONS 2135

TABLE III
ASIC I MPLEMENTATION R ESULTS FOR THE P ROPOSED M ICRO -A RCHITECTURE AT UMC 90 NM AND 40 NM T ECHNOLOGY N ODES

Fig. 5. Comparison of Dhrystone and Coremark scores with other commercial


cores [27], [28].
Fig. 4. 32-bit radix-16 unsigned divider.

better than some of the existing micro-architectures, and it is


suitable for low power applications.
This process continues for eight clock cycles and at the end of At UMC 40 nm technology node, the core occupies an area
8th cycle, we get the desired quotient and remainder values. of 0.036 mm2 , which consists of 20 K cells and 2.7 K buffers,
The divider unit discussed above takes eight clock cycles which is almost half of the area occupied by [6]. The criti-
to complete one division and is an unsigned divider. But, the cal path delay of the proposed micro-architecture is 5.05 ns
ISA requires both signed and unsigned division. To enable and 5.39 ns at UMC 40 nm and UMC 90 nm, respectively.
signed operations, the magnitudes of the inputs are first deter- That means the core could achieve a maximum frequency of
mined. These 32-bit magnitudes are then fed to the divider 198.02 MHz at 40 nm and 166.94 MHz at 90 nm. While
circuit and magnitude of result is obtained. This result is running at 100 MHz clock frequency and 40 nm technology
finally converted to 2’s complement depending on the type of node, the core consumes 1.745 mW total power. 99.5%, i.e.,
inputs (signed/unsigned) and the type of instruction. Whenever 1.736 mW of this power is dynamic and the leakage is only
the result produces a signed negative number, the result of 0.5% (7.63 µW). The core consumes 2.225 mW total power
an unsigned divider circuit should be converted to a 32-bit at 90 nm technology node while operating at clock frequency
signed negative number using 2’s complement representation. of 100 MHz. The dynamic power consumption is 1.975 mW
Otherwise, it should not be changed. Because of these signed- which is 88.8% percent of the total power. To make a fair
unsigned conversions, two additional clock cycles are required, comparison among the cores, dynamic power consumption
one prior to division, to determine the magnitude of the inputs is normalized to the clock frequency. The proposed micro-
and one after division to convert back the result to signed architecture consumes 19.75 µW/MHz dynamic power, which
number. is 36% better than Cortex-M3 and 40% better than Cortex-M4.
The dynamic power consumption is also comparable to some
other cores evaluated at lower technology nodes shown in
III. R ESULTS Table III.
The proposed core supports base integer instruction set, For performance measurement, two benchmark programs,
RV32I of RISC-V ISA with multiplier and divider exten- Dhrystone and Coremark are run on the FPGA module
sion, M. The entire design is implemented using Verilog HDL. of the proposed micro-architecture. The core performs 1.71
For the estimation of hardware, power, and energy efficiency, DMIPS per MHz and 4.13 Coremark per MHz. These
the proposed micro-architecture is synthesized using Synopsys Dhrystone and Coremark scores are better than almost all
DC. The place and route of the synthesized design is done in other cores. The energy efficiency of the core is measured in
Cadence Innovus at UMC 90 nm and UMC 40 nm technol- terms of Coremark/mW. The maximum operating frequency
ogy nodes. For the post layout estimation of power and delay, of [6] and [22] are higher than the proposed design. This
Synopsys Prime Time tool is used, and Synopsys VCS is used is due to the different variants of a single technology node.
for simulation. The area, power, delay, and performance results For example, UMC 40nm technology node has variants like
are compared with some commercial as well as some RISC- ULVt(Ultra Low Vt), HVT(High Vt), RVT(Regular Vt), etc.
V cores in Table III. It can be seen that the core results are Every variant has its own advantages and disadvantages. In

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.
2136 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 68, NO. 6, JUNE 2021

this brief, the RVT variant of UMC 40nm technology is used. R EFERENCES
The use of ULVt variant can give better frequency result, [1] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic, “The RISC-
but it also increases the power losses. Similarly, HVT gives V instruction set manual, volume I: User-level ISA, version 2.0,”
better power results by reducing the leakage, but slows down EECS Dept., Univ. California at Berkeley, Berkeley, CA, USA, Rep.
UCB/EECS-2014-54, May 2014. [Online]. Available: http://www2.eecs.
the design. Hence, the RVT variant is used to get a balance berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html
between power and speed. From Table III, it can be seen that [2] B. Zimmer et al., “A RISC-V vector processor with simultaneous-
the energy efficiency of the proposed core is much higher switching switched-capacitor DC–DC converters in 28 nm FDSOI,”
IEEE J. Solid-State Circuits, vol. 51, no. 4, pp. 930–942, Apr. 2016.
than the other existing micro-architectures. These benchmark [Online]. Available: http://ieeexplore.ieee.org/document/7422720/
values are also compared with some commercial Intel and [3] Y. Lee et al., “A 45 nm 1.3 GHz 16.7 double-precision GFLOPS/W
ARM single core processors in Figure 5. The graph shows RISC-V processor with vector accelerators,” in Proc. 40th Eur. Solid-
State Circuits Conf. (ESSCIRC), Sep. 2014, pp. 199–202. [Online].
that the proposed core outperforms many commercial single Available: http://ieeexplore.ieee.org/document/6942056/
core processors. The comparison of the proposed design with [4] M. Gautschi, M. Schaffner, F. K. Gurkaynak, and L. Benini, “An
existing micro-architectures also shows an improvement in extended shared logarithmic unit for nonlinear function kernel accel-
eration in a 65-nm CMOS multicore cluster,” IEEE J. Solid-State
power, performance and area efficiency of the proposed design. Circuits, vol. 52, no. 1, pp. 98–112, Jan. 2017. [Online]. Available:
Therefore, this micro-architecture can be used for low power http://ieeexplore.ieee.org/document/7756672/
applications like IoTs, which demands high-performance at [5] F. Conti et al., “An IoT endpoint system-on-chip for secure and
energy-efficient near-sensor analytics,” IEEE Trans. Circuits Syst. I, Reg.
low power. Papers, vol. 64, no. 9, pp. 2481–2494, Sep. 2017. [Online]. Available:
The design is primarily implemented for ASIC, but for ini- http://ieeexplore.ieee.org/document/7927716/
tial hardware verification, the design is tested on an FPGA [6] M. Gautschi et al., “Near-threshold RISC-V core with DSP exten-
sions for scalable IoT endpoint devices,” IEEE Trans. Very Large Scale
platform, Xilinx Virtex-7 board. For FPGA verification, the Integr. (VLSI) Syst., vol. 25, no. 10, pp. 2700–2713, Oct. 2017. [Online].
core part remains untouched; however, the cache memory part Available: http://ieeexplore.ieee.org/document/7864441/
modified depending on the resources available in the board. [7] S. Gupta, N. Gala, G. S. Madhusudan, and V. Kamakoti, “SHAKTI-
F: A fault tolerant microprocessor architecture,” in Proc. IEEE 24th
The FPGA module has a dual-port memory of 128 KB for Asian Test Symp. (ATS), Nov. 2015, pp. 163–168. [Online]. Available:
data as well as for instructions. This memory is implemented http://ieeexplore.ieee.org/document/7422253/
in Block RAM of FPGA, and depending on the available [8] Shakti Processor Program. [Online]. Available: http://shakti.org.in/
[9] ORCA RV32IM. [Online]. Available: https://github.com/vectorblox/orca
resources, it is limited to 128 KB. The processor has instruc- [10] MRISCV. [Online]. Available: https://github.com/onchipuis/mriscv
tion start address at 0x00000220 and runs at a clock frequency [11] VEXRISCV. [Online]. Available: https://github.com/SpinalHDL/
of 100 MHz. The core functionality is first verified with some VexRiscv
[12] Lowrisc. [Online]. Available: https://github.com/lowRISC/lowrisc-chip
basic C programs like addition, multiplication, factorial, loops, [13] RI5CY a Small 4-Stage RISC-V Core. [Online]. Available: https://github.
string, etc. Once the functionality of the core is verified, two com/pulp-platform/riscv
popular benchmark programs, i.e., Dhrystone and Coremark [14] N. Petra, D. D. Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo,
“Design of fixed-width multipliers with linear compensation function,”
are run on the core to measure its performance. All of these IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 58, no. 5, pp. 947–960,
programs are in C and compiled using riscv-gnu-toolchain, a May 2011.
RISC-V cross-compiler, to generate binary of the programs. [15] S. Kuang, J. Wang, and C. Guo, “Modified Booth multipliers with a
regular partial product array,” IEEE Trans. Circuits Syst. II, Exp. Briefs,
The binary data is uploaded to the memory of the core on vol. 56, no. 5, pp. 404–408, May 2009.
FPGA to execute. The FPGA synthesis results show that the [16] E. Antelo, P. Montuschi, and A. Nannarelli, “Improved 64-bit radix-
core can be implemented with 7617 LUTs and 2319 FFs. The 16 Booth multiplier based on partial product array height reduction,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 64, no. 2, pp. 409–418,
timing report shows that the core has a maximum delay path of Feb. 2017.
7.472 ns, which means it can go up to a maximum frequency [17] C. R. Baugh and B. A. Wooley, “A two’s complement parallel array
of 134 MHz on Virtex-7. multiplication algorithm,” IEEE Trans. Comput., vol. C-22, no. 12,
pp. 1045–1047, Dec. 1973.
[18] R. Muralidharan and C. Chang, “Radix-4 and radix-8 Booth encoded
multi-modulus multipliers,” IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 60, no. 11, pp. 2940–2952, Nov. 2013.
IV. C ONCLUSION [19] Y. Bansal, C. Madhu, and P. Kaur, “High speed vedic multiplier
designs—A review,” in Proc. Recent Adv. Eng. Comput. Sci. (RAECS),
This brief presents a 4-stage pipelined micro-architecture Mar. 2014, pp. 1–6.
for low power applications. The core supports full base inte- [20] L. Dadda, “On serial-input multipliers for two’s complement numbers,”
IEEE Trans. Comput., vol. 38, no. 9, pp. 1341–1345, Sep. 1989.
ger instruction set of RISC-V along with multiplication and [21] L. Chen, J. Han, W. Liu, P. Montuschi, and F. Lombardi, “Design, eval-
division instructions. The core micro-architecture is arranged uation and application of approximate high-radix dividers,” IEEE Trans.
in such a way that execution of instructions can be car- Multi-Scale Comput. Syst., vol. 4, no. 3, pp. 299–312, Jul. 2018.
[22] M. Gautschi et al., “Tailoring instruction-set extensions for an ultra-low
ried out with a minimum possibility of data, control and power tightly-coupled cluster of OpenRISC cores,” in Proc. IFIP/IEEE
structural hazards. Since low power application is the target Int. Conf. Very Large Scale Integr. (VLSI-SoC), Oct. 2015, pp. 25–30.
of this brief, the core is designed with minimum essential [23] The ARM Cortex-M4 Processor, ARM’s High Performance Embedded
Processor. [Online]. Available: https://developer.arm.com/products/
features. This brief has mainly focused on optimizing the processors/cortex-m/cortex-m4
area, power, delay and performance of hardware circuits for [24] The ARM Cortex-M3 Processor, the Industry-Leading 32-Bit Processor
multiplier, divider, DBP, etc. and optimizing their control for Highly Deterministic Real-Time Applications. [Online]. Available:
https://developer.arm.com/products/processors/cortex-m/cortex-m3
in the pipeline. The proposed core is suitable for applica- [25] C. Arm et al., “Low-power 32-bit dual-MAC 120 µW/MHz 1.0 V
tions where we need high-performance with a low-power ICYFLEX1 DSP/MCU core,” IEEE J. Solid-State Circuits, vol. 44, no. 7,
requirement. Among the benchmarks related to these applica- pp. 2055–2064, Jul. 2009.
[26] SiFive E24, a High-Performance Microcontroller for Motor Control,
tions such as Dhrystone, Coremark, the proposed core shows Sensor Fusion, and IoT Applications. [Online]. Available: https://www.
better performance by consuming lesser power. This micro- sifive.com/cores/e24
architecture can be used as the base for RISC-V processor [27] Dhrystone Scores. [Online]. Available: http://www.roylongbottom.org.
uk/dhrystone
design, and more extensions can be added to it depending on [28] Coremark Scores. [Online]. Available: https://www.eembc.org/coremark/
specific applications. scores.php

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on June 22,2021 at 18:15:29 UTC from IEEE Xplore. Restrictions apply.

You might also like