Professional Documents
Culture Documents
xx XXXX 2020
1
PAPER Special Section on Parallel, Distributed, and Reconfigurable Computing, and Networking
SUMMARY RISC-V is a RISC based open and loyalty free instruction this paper because it is sufficient to support the operating
set architecture which has been developed since 2010, and can be used for system environment and suits for embedded systems. RV32I
cost-effective soft processors on FPGAs. The basic 32-bit integer instruc-
can emulate other extensions of M, F, and D, and can be
tion set in RISC-V is defined as RV32I, which is sufficient to support the
operating system environment and suits for embedded systems. configured with fewer hardware resources than processors
In this paper, we propose an optimized RV32I soft processor named supporting RV32G. Although several soft processors that
RVCoreP adopting five-stage pipelining. The processor applies three ef- support RV32I have been released [5], they are not highly
fective optimization methods to improve the operating frequency. These optimized for FPGAs.
methods are instruction fetch unit optimization including pipelined branch
prediction mechanism, ALU optimization, and data alignment and sign-
In this paper, we propose an optimized RV32I soft pro-
extension optimization for data memory output. We implement RVCoreP cessor named RVCoreP of five-stage pipelining which is
in Verilog HDL and verify the behavior using Verilog simulation and an highly optimized for FPGAs. The main contributions of this
actual Xilinx Atrix-7 FPGA board. We evaluate IPC (instructions per paper are as follows.
cycle), operating frequency, hardware resource utilization, and processor
performance. From the evaluation results, we show that RVCoreP achieves • We propose an optimized RV32I soft processor of five-
30.0% performance improvement compared with VexRiscv, which is a high- stage pipelining highly optimized for FPGAs. To im-
performance and open source RV32I processor selected from some related
works.
prove the operating frequency, it applies instruction
key words: soft processor, FPGA, RISC-V, RV32I, Verilog HDL, five-stage fetch unit optimization including pipelined branch pre-
pipelining diction mechanism, ALU optimization, and data align-
ment and sign-extension optimization for data memory
1. Introduction output.
• We implement the proposal in Verilog HDL and evalu-
RISC-V [1] is becoming popular as an open and loyalty free ate IPC (instructions per cycle), operating frequency,
instruction set architecture (ISA) which has been developed hardware resource utilization, and processor perfor-
at the University of California, Berkeley since 2010. It can mance. From the evaluation results, we show that the
be used for cost-effective soft processors on FPGAs like proposed processor achieves much better performance
MicroBlaze [2] and Nios II [3]. than VexRiscv, which is a high-performance and open
The RISC-V ISA is defined as a basic integer instruc- source RV32I processor.
tion set and other extended instruction sets, and we can sup-
port necessary instruction sets by the application require-
2. Related works
ments [4]. The basic 32-bit integer instruction set is defined
as RV32I. Other typical extended instruction sets are defined
Rocket Core [6] is a RISC-V in-order scalar processor de-
as M for integer multiplication and division instructions,F for
veloped by the University of California, Berkeley. It is a
single-precision floating-point ones, D for double-precision
pipelined processor supporting RV32G and RV64G. It sup-
floating-point ones, and A for atomic ones. In addition to
ports processing of privilege levels, and has an MMU (mem-
these, a 32-bit general-purpose instruction set is defined as
ory management unit) with virtual memory and data cache,
RV32G as the set of RV32I, M, A, F, and D. This is an
and a branch prediction unit. Because of this rich function-
instruction set architecture for general-purpose computing
ality and hard customization, it is not suitable for embedded
systems of a broad range. RV64G is a 64-bit version of a
systems.
general-purpose instruction set.
Rocket Core has another drawback. It is written in
Among these instruction sets, we focus on RV32I in
Chisel [7], a domain-specific language based on Scala. Be-
Manuscript received January 7, 2020. cause Chisel is a new hardware description language since
† The authors are with the School of Computing, Tokyo Institute
2012, it may be difficult for hardware developers who have
of Technology. not mastered Chisel to change the design effortlessly. Ac-
a) E-mail: miyazaki@arch.cs.titech.ac.jp
cording to the work [5], Verilog HDL and SystemVerilog are
b) E-mail: kanamori@arch.cs.titech.ac.jp
c) E-mail: ashraful@arch.cs.titech.ac.jp the dominant languages used to implement the processors,
d) E-mail: kise@c.titech.ac.jp and they may be the best choice for easy-to-use processor
DOI: 10.1587/transinf.E0.D.1 implementations. Therefore, we implement our processors
Copyright © 2020 The Institute of Electronics, Information and Communication Engineers
IEICE TRANS. INF. & SYST., VOL.Exx–??, NO.xx XXXX 2020
2
w_bmis
IdEx_rd
Load-use
r_BHR
Id_rs1
Mux
join
Id_luse IdEx_pc
Id_rs2
+
left 1
shift
decoder_imm
comb
+ m_PHT Id_imm
IdEx_imm
align/extend
w_btkn D_ADDR D_IN Ma_rslt
m_dmem
Mux
m_BTB
w_btb IfId_pc
Mux
r_pc Id_rs1
4 Ex_b_rslt
comb
Id_rrs1 IdEx_rrs1 Wb_rslt
+
r_pc
r_pc+4 Id_rs2
Mux
regfile Ex_rrs1
ALU
w_npc Ex_rslt
ExMa_rslt
Id_rrs2
Mux
I_ADDR I_IN Ex_rrs2
Mux
m_imem Id_imm
IfId_ir
Wb_rslt
ExMa_rslt
Wb_rslt
in Verilog HDL, a dominant hardware description language. instruction decode stage (Id stage), instruction execution
VexRiscv [8] is a RISC-V pipelined soft processor sup- stage (Ex stage), memory access stage (Ma stage), and write
porting RV32I. The integer multiplication and division, other back stage (Wb stage).
extensions, and the MMU with instruction cache and data The green rectangles are registers that are updated at
cache can be added as options. In addition, the branch pre- the positive clock edge. The yellow rectangles are modules
diction scheme, implementation choice of shift instruction, including the memory which is composed of block RAM
data forwarding path, and so on can be tuned for imple- on Xilinx FPGA. The gray rectangle is a register file read
mentation. VexRiscv is written in an open source and new asynchronously consisting of 32 registers. The red mod-
hardware description language called SpinalHDL [9], and ules are an ALU or adders, and the other blue modules are
the corresponding RTL description can be generated as a combinational circuits.
Verilog HDL file. Since the generated Verilog HDL code is The baseline has an instruction memory named m_imem
not hierarchical, debugging and understanding this generated shown at the bottom left of the figure, and a data memory
code is not easy. named m_dmem shown at the right of the figure.
VexRiscv has won the 1st place at the highest- The branch prediction scheme is gshare [15] which con-
performance implementation category of the RISC-V Soft- tains a branch history register (BHR) named r_BHR, a pat-
CPU Contest in 2018 hosted by the RISC-V Founda- tern history table (PHT) named m_PHT, and a branch target
tion [10]. Therefore, it is an optimized soft processor for buffer (BTB) named m_BTB. To mitigate the data hazard, it
high-performance, and the highest performance RV32I soft has two forwarding paths. The red path from Ma stage to
processor available as an open source as far as we know. We Ex stage provides the register value for the next dependent
use VexRiscv as a reference for making the comparison with instruction. Similarly, the blue path provides register value
our proposed processors. from the Wb stage to the Ex stage.
There are other RISC-V processors for education such In the If stage, the instruction is fetched from the instruc-
as riscv-mini [11] and Sodor Processor [12] both are de- tion memory using the program counter (PC) as an address.
veloped by the University of California, Berkeley, and The register for PC named r_pc is updated in every cycle
Clarvi [13] developed by the University of Cambridge. with the next PC value named w_npc.
These educational RISC-V processors are easy-to-use, but There are four candidates for w_npc in following de-
their performance is not high as VexRiscv because they are scending priority order. The highest priority one is the cor-
not highly optimized for high-performance. rect PC value named ExMa_pc_true from the Ma stage. The
second priority one is the current PC value from r_pc in
3. Design of a typical five-stage pipelined processor case of pipeline stalling. The third priority one is the branch
target address named w_btb which is output from the BTB.
We design a typical five-stage pipelined processor with The lowest priority one is r_pc+4 for the instruction of the
branch prediction referring [14], and this design is used as a next address.
baseline for the proposal. There are three control signals to select the proper one
Fig.1 shows a block diagram of the baseline consists of among four candidates. The first signal is named w_bmis
five-stage indicated by the instruction fetch stage (If stage), which indicates whether a branch misprediction has oc-
MIYAZAKI et al.: RVCOREP : AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
3
IfId IfId
sign-extension or zero extension is needed for load byte and
load halfword instructions. Finally, the unit selects a proper
r_BHR
r_BHR
Mux
Mux
join
join
value using a large multiplexer depends on the operation
code of the instruction.
r_pcx
comb
comb
+ m_PHT + m_PHT
r_btb
the similar approach to the ALU optimization which is one- m_BTB
w_btb
m_BTB
r_btb
Mux
Mux
hot encoding and using exclusive OR for value selection. 4
r_pc
4
r_pc
+
r_pc
r_pc
r_pc+4 r_pc+4
decoder_if
to improve the operating frequency of the instruction fetch
unit, which contains the critical path on the baseline proces- preIf stage If stage
sor.
(a) General branch prediction mechanism (b) Pipelined branch prediction mechanism
The related works [18], [19] have shown that the pipelin-
ing of the branch predictor can improve the operating fre- Fig. 2 A general configuration and the two-stage pipelining one for
quency of the soft processor when a complex branch predic- branch prediction mechanism including gshare, BTB and instruction mem-
tor is used. The similar approach is applied to the proposed ory.
branch prediction including gshare and BTB.
Fig.2 (a) shows a block diagram of a general branch
predictor and BTB in the baseline where the prediction is Clock cycle 1 2 3 4 5 6
made in a single cycle. In gshare branch predictor, the index Inst A (add) If
to access the PHT named m_PHT is obtained by exclusive PC=0x100 preIf
Id Ex Ma Wb
OR of PC and BHR named r_BHR. If the fetched instruction
Inst B (beq) If When previous PC + 4 PC,
is predicted as a conditional branch instruction using the PC=0x104 Id Ex Ma
the branch prediction isWb
invalid
preIf
BTB, it updates the BHR speculatively using the branch
prediction in the If stage. Inst M (bne) If
Id Ex Ma
PC=0x130
The combinational circuit named join shifts BHR left preIf
by 1-bit, and connects the branch prediction result to the least Inst N (sub) If
Id Ex
significant bit of the BHR. If the branch prediction missed, PC=0x134 preIf
the BHR is updated with the correct branch history. The
combinational circuit named comb that receives the value Fig. 3 The pipeline diagram of instruction fetching using pipelined
read from the BTB and the value read from the PHT generates branch prediction mechanism in RVCoreP.
the branch prediction named w_btkn.
The critical path of a general branch prediction mech-
anism is the red data path in Fig.2 (a) which includes the preIf stage, accessing the BTB and exclusive OR processing
access of the BTB, three LUTs, a multiplexer, wiring delays, to determine the PHT index are performed. In the second
and clock skew. In our preliminary evaluation, the access cycle in the If stage, the value of the next PC is determined
of the BTB composed of block RAM on a Xilinx Artix-7 by using the results from the preIf stage and the instruction
FPGA takes about 2.54ns on the red path. Since the access is fetched from the instruction memory.
of one LUT takes about 0.12ns, the access to the three LUTs Fig.3 shows the pipeline diagram of instruction fetch-
necessary on the red path takes about 0.36ns. Also, the ac- ing using the pipelined branch prediction in RVCoreP. The
cess to a multiplexer implemented using hard macro takes rectangles written as preIf represents the processing of the
about 0.24ns, and other wiring delays and clock skew takes preIf stage, and the rectangles written as If represents the
about 3ns. Therefore, the total delay of the path exceeds processing of the If stage. Assuming that four instructions
6.1ns (165MHz) by adding the above delays. are fetched in the order of Inst A, Inst B, Inst M, and Inst N.
To improve the operating frequency for the proposed Inst A and Inst N are add and sub instructions, and these in-
processor, we split this critical path by two registers. struction addresses are 0x100 and 0x134, respectively. Inst
Fig.2 (b) shows the block diagram of the pipelined B and Inst M are beq (branch if equal) and bne (branch if not
gshare and pipelined BTB for RVCoreP. The red critical equal) instruction, and these instruction addresses are 0x104
path in Fig.2 (a) is divided into three paths by two inserted and 0x130, respectively. The next PC of Inst B is 0x130
registers named r_btb and r_pcx. The data acquired from the when the branch is taken.
BTB is stored in the register r_btb, and the register r_pcx is In the clock cycle 1 when the value of PC is 0x100, the
inserted before exclusive OR to generate the index of PHT. If stage for Inst A and the preIf stage for the next instruction
It takes two cycles to determine the value of the next are executed. In the preIf stage, the value of BTB and PHT
PC in the instruction fetch stage. In the first cycle in the index used in the If stage for the next instruction 0x104 are
MIYAZAKI et al.: RVCOREP : AN OPTIMIZED RISC-V SOFT PROCESSOR OF FIVE-STAGE PIPELINING
5
Id_alu_ctrl ExMa_tkn_pc
decoder_id
Mux
r_BHR
ExMa_npc
Mux
Id_bru_ctrl
join
w_stall w_bmis
Id_imm
ExMa_b_rslt
r_pcx
comb
+ m_PHT
IdEx_imm
align/extend
w_btkn
D_ADDR D_IN
m_dmem
+
Ex_rrs1
Ma_rslt MaWb_rslt
r_btb
Mux
m_BTB
IfId_ir
r_btb
Mux
r_pc IfId_rs1
4
Id_rrs1 IdEx_rrs1
+
r_pc
comb
r_pc+4 IfId_rs2
Mux
regfile
ALU_opt
Ex_b_rslt ExMa_b_rslt
Ex_rrs1
w_npc
Ex_rslt ExMa_rslt
Id_rrs2
Mux
I_ADDR I_IN Ex_rrs2
Mux
m_imem
IfId_ir Id_imm
decoder_if
IdEx_alu_ctrl
MaWb_rslt IdEx_bru_ctrl
If_rs1 IfId_rd
Load-use
Table 1 The evaluation results of IPC and branch prediction accuracy obtained by Verilog simulation.
Dhrystone Coremark
Label Average IPC
IPC prediction hit prediction miss hit rate IPC prediction hit prediction miss hit rate
VR-nobp 0.661 N/A N/A N/A 0.591 N/A N/A N/A 0.626
VR-bp 0.836 146,180 29,452 0.832 0.766 348,010 109,701 0.760 0.801
RVP-simple 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optALU 0.946 205,127 12,507 0.943 0.828 366,726 91,247 0.801 0.887
RVP-optIF 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
RVP-optALL 0.935 201,153 16,481 0.924 0.823 363,439 94,534 0.794 0.879
Table 2 The evaluation results of frequency, hardware resource utilization, and performance.
Operating Slice Slice Increase rate Average Processor Normalized
Label Slice
frequency LUT register of slice IPC performance performance
VR-nobp 205 936 562 284 1.000 0.626 128.4 1.000
VR-bp 140 944 611 300 1.056 0.801 112.1 0.873
RVP-simple 160 1,020 715 349 1.229 0.887 141.9 1.105
RVP-optALU 170 1,070 730 375 1.320 0.887 150.8 1.174
RVP-optIF 180 1,044 749 390 1.373 0.879 158.2 1.232
RVP-optALL 190 1,073 764 397 1.398 0.879 167.0 1.300
1.0
Dhrystone 200
0.9
Coremark
0.8
Operating frequency
Average
0.7 150
0.6
IPC
0.5 100
0.4
0.3
50
0.2
0.1
0.0 0
VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL
Fig. 6 The IPC for each configuration obtained by Verilog simulation. Fig. 7 The maximum operating frequency for each configuration on
Artix-7 FPGA.
1.2
1.0 value in the graph is the performance improvement rate
0.8 from VR-nobp. RVP-optALL achieves 30.0% performance
0.6
improvement compared to VR-nobp, which is the highest
0.4
performance configuration of VexRiscv. The other config-
0.2
0.0
urations of RVCoreP achieve performance improvement of
VR-nobp VR-bp RVP-simple RVP-optALU RVP-optIF RVP-optALL 10% or more compared to VR-nobp.
The performance improvement from RVP-simple to
Fig. 8 The processor performance by IPC assuming that the operating
frequency is the same where VR-nobp is normalized as 1. RVP-optALL is 1.176. Therefore, we achieve 17.6% perfor-
mance improvement by using three proposed optimizations.
1.4 6. Conclusion
Processor performance rate
1.2
[10] RISC-V Foundation, “RISC-V SoftCPU Contest, October 8, 2018.” Md Ashraful Islam have graduated from
https://riscv.org/2018/10/risc-v-contest/. the University of Rajshahi, Bangladesh. I am a
[11] University of California, Berkeley, “riscv-mini: Simple RISC-V 3- 1st-year doctoral student at the Tokyo Institute
stage Pipeline in Chisel.” https://github.com/ucb-bar/riscv-mini. of Technology. I have 8-years of experience in
[12] University of California, Berkeley, “The Sodor Processor: edu- the Semiconductor Industry in ASIC, SoC de-
cational microarchitectures for risc-v isa.” https://github.com/ucb- sign and Verification. My research interest is in
bar/riscv-sodor. Computer Architecture, especially on Processor
[13] University of Cambridge, “Clarvi: simple RISC-V processor for design and memory sub-system design.
teaching.” https://github.com/ucam-comparch/clarvi.
[14] D.A. Patterson and J.L. Hennessy, Computer Organization and De-
sign The Hardware / Software Interface, RISC-V Edition, Morgan
Kaufmann, 2018.
[15] S. McFarling, “Combining branch predictors,” tech. rep., Technical Kenji Kise received the B.E. degree from
Report TN-36, Digital Western Research Laboratory, 1993. Nagoya University in 1995, the M.E. degree
[16] P. Metzgen, “A High Performance 32-bit ALU for Programmable and the Ph.D. degree in information engineer-
Logic,” Proceedings of the 2004 ACM/SIGDA 12th International ing from the University of Tokyo in 1997 and
Symposium on Field Programmable Gate Arrays, FPGA ’04, New 2000, respectively. He is currently an associate
York, NY, USA, pp.61–70, ACM, 2004. professor of the School of Computing, Tokyo
[17] Xilinx, HDL Synthesis for FPGAs Design Guide, 1995. Institute of Technology. His research interests
[18] D.A. Jimenez, “Reconsidering complex branch predictors,” The include computer architecture and parallel pro-
Ninth International Symposium on High-Performance Computer Ar- cessing. He is a member of ACM, IEEE, IEICE,
chitecture, 2003. HPCA-9 2003. Proceedings., pp.43–52, Feb 2003. and IPSJ.
[19] K. Matsui, M. Ashraful Islam, and K. Kise, “An Efficient Im-
plementation of a TAGE Branch Predictor for Soft Processors on
FPGA,” 2019 IEEE 13th International Symposium on Embedded
Multicore/Many-core Systems-on-Chip (MCSoC), pp.108–115, Oct
2019.
[20] Digilent, Inc., Nexys 4 DDR Reference Manual, rev.c ed., 2016.
[21] R.P. Weicker, “Dhrystone: A Synthetic Systems Programming
Benchmark,” Commun. ACM, vol.27, no.10, pp.1013–1030, Oct.
1984.
[22] EEMBC, “CoreMark | CPU Benchmark âĂŞ MCU Benchmark.”
https://www.eembc.org/coremark/.
[23] RISC-V Foundation, “riscv-tests.” https://github.com/riscv/riscv-
tests.
[24] UC Berkeley Architecture Research, “Setup scripts and files
needed to compile CoreMark on RISC-V.” https://github.com/riscv-
boom/riscv-coremark.