Professional Documents
Culture Documents
Transactions Briefs
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power
Management in 28-nm FD-SOI
John Charles Wright , Colin Schmidt, Ben Keller , Daniel Palmer Dabbelt, Jaehwa Kwak , Vighnesh Iyer,
Nandish Mehta , Pi-Feng Chiu , Stevo Bailey, Krste Asanović , and Borivoje Nikolić
Abstract— This work demonstrates a dual-core RISC-V system-on- per core supply voltages in a multicore microprocessor system by
chip (SoC) with integrated fine-grain power management. The 28-nm using a platform-level voltage regulator modules and integrated per
fully depleted silicon-on-insulator (FD-SOI) SoC integrates switched-
core linear dropout regulators to achieve 19% power savings [3].
capacitor voltage converters and 4-Gb/s off-chip serial links. The SoC
runs applications with operating system support on dual RISC-V Rocket Additionally, voltage scaling can be used in conjunction with circuit-
cores with vector accelerators. Runtime monitoring of microarchitectural level error detection techniques to dynamically recover supply voltage
counters allows prediction of future compute intensity, enabling the margin [4].
voltage state of the managed core to be adjusted quickly to optimize This brief describes a system with fine-grain voltage domains
energy efficiency without sacrificing overall performance.
powered by fully integrated, on-chip, high-density, switched-capacitor
Index Terms— DC–converter, multicore processing, power
dc–dc converters capable of fast mode switching and supported by
management, RISC-V processor, system-on-chip (SoC), vector
architectural counters. By switching from a nominal supply voltage of
processor.
I. I NTRODUCTION 0.9 V during active compute periods to 0.55-V during idle, memory-
Modern computing applications continue to demand increasing bound periods, this scheme can approach an ideal 62% reduction in
performance under a fixed power envelope. With the end of Den- dynamic switching power during idle phases with little penalty to
nard scaling, the power consumption from performance improve- overall compute throughput.
ments must be offset by architectural and circuit design techniques II. S YSTEM D ESCRIPTION
to increase energy efficiency. This brief demonstrates a dual-core The system, shown in Fig. 1, comprises a dual-core, 64-bit RISC-V
RISC-V processor that combines on-chip dynamic voltage scaling processor with a custom vector accelerator, manufactured in a 28-nm
(DVS) with accurate, tightly integrated monitoring of real-time appli- fully depleted silicon-on-insulator (FD-SOI) technology. The system
cation performance requirements and specialization for data-parallel is the first reported RISC-V multicore design to place each core
applications. in its own voltage domain and utilize high-speed serial links for
Since its early demonstration [1], DVS has become an indispens- memory traffic. The supply voltage for each core is provided by
able technique for power management in digital systems. The basic a bank of 24 on-chip switched-capacitor dc–dc converter cells [2]
concept tradesoff approximately linear reduction in system perfor- which can be bypassed if needed. This system includes the same
mance for quadratic savings in switching and more-than-quadratic dc–dc conversion subsystem as [5], which is capable of switching
savings in leakage energy with the scaled supply. Given a bounded modes within 2 μs. Each core can be clocked by an externally
system throughput requirement, efficient use of DVS reduces total supplied clock or an on-chip adaptive clock generator [6]. The clock
application energy accordingly [2]. Although the overall energy effi- selection is made by writing the select value to a memory-mapped
ciency of the processor system is determined in part by the conversion register using the off-chip memory interface while the cores are
efficiency of voltage regulators, it also depends on the ability to in reset. The two cores share a 256-KiB L2 cache, which is in a
predict upcoming workloads and on the speed of the voltage tran- separate, fixed voltage and clock domain [2]. The L2 cache is also
sition. These changes in voltage–frequency state must be performed responsible for routing memory traffic to memory-mapped control
quickly to adapt efficiently to the dynamic performance require- and status registers. It includes a set of counters that track how
ments of an application. Energy is wasted when compute intensity many total memory accesses have hit and missed. The L2 cache
changes and the core remains in a suboptimal voltage–frequency has two possible paths for backing memory: a custom, low-speed,
state, so reducing the latency of voltage–frequency state transitions eight-bit parallel interface, or a bank of eight high-speed serial links.
improves overall system energy efficiency. Similarly, decreasing the Consistent with our goal of innovating at the circuit level while
size of voltage–frequency domains enables finer-resolution changes maintaining a functional system, the parallel interface is included
in system performance, reducing the energy overhead due to spatial as a backup to the experimental high-speed interface. Each of these
quantization of voltage domains. Recent mainstream products employ paths forward memory transactions to a separate field-programmable
gate array (FPGA) board which contains the backing dynamic random
Manuscript received April 13, 2020; revised June 27, 2020 and
August 29, 2020; accepted September 18, 2020. Date of publication
access memory (DRAM) for the system. The system also contains
October 27, 2020; date of current version November 24, 2020. This work was multiple, distributed ring-oscillator-based temperature sensors [7] and
supported in part by Defense Advanced Research Projects Agency (DARPA) a body-bias generator [8].
Power Efficiency for Embedded Computing Technologies (PERFECT) under The rocket application processors are scalar, in-order, single-issue
Award HR0011-12-2-0016 and in part by Intel Science and Technology cores with five pipeline stages [9]. This version of rocket supports
Centers (ISTC) on Agile Design. (John Charles Wright, Colin Schmidt,
and Ben Keller contributed equally to this work.) (Corresponding author: version 2.1-draft of the RISC-V RV64G instruction set architecture
John Charles Wright.) (ISA) variant with supervisor mode. These rocket instances were
The authors are with the Department of Electrical Engineering and Com- generated from parameters chosen for a typical in-order, general-
puter Sciences, University of California Berkeley, Berkeley, CA 94720 USA purpose core. It has a 64-entry branch target buffer, a 256-entry two-
(e-mail: johnwright@berkeley.edu).
Color versions of one or more of the figures in this article are available
level branch predictor, a return address stack, a two-cycle latency
online at https://ieeexplore.ieee.org. on single-precision fused multiply-add (FMAs), a three-cycle latency
Digital Object Identifier 10.1109/TVLSI.2020.3030243 on double-precision FMAs, and separate 32-KiB instruction and data
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
2722 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020
Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020 2723
Fig. 3. DC–DC mode transitions in response to cache activity for a matrix multiplication workload.
Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
2724 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020
Fig. 5. Architectural simulation of Linux boot and energy analysis at 0.9-V/250-MHz and 0.55-V/50-MHz operating modes.
TABLE I subsequent activity is shown to benefit from DVFS. From the sim-
C OMPARISON W ITH S TATE OF THE A RT. *E STIMATED ulated cache miss counter data and measured voltage and frequency
data, the relative changes to wall time, t% , and dynamic energy,
E % , are modeled with the following formulas. The time spent in
the low voltage and frequency state, cycleslo,rel , is calculated relative
to the cycles in the high state, cycleshi , by scaling the instructions
retired per cycle when the miss rate is below the threshold, IPClo ,
by the ratio of high and low frequencies
fhi
cycleslo,rel = IPClo · cycleslo · + (1 − IPClo ) · cycleslo
flo
cycleshi + cycleslo,rel
t% = − 100%
cycleshi + cycleslo
2
cycleshi + cycleslo,eff · vdd
vdd
lo · ff lo
hi hi
E % = 100% − .
cycleshi + cycleslo
A range of thresholds and polling frequencies, shown in Fig. 5,
allows for control between a 69% relative energy savings for 113%
The effectiveness of the fully programmable PMU is demonstrated
additional wall time and a less aggressive 34% energy savings for
with a program that monitors the rate of L2 cache misses and
only 6% additional wall time. By increasing the miss rate threshold,
the average number of instructions-retired-per-cycle (IPC) of the
the voltage state is changed less frequently and only when more or
compute core, changing the voltage mode of the compute core in
larger memory accesses are performed, leading to an increase in idle
response, while keeping the L2 supply constant. In this experiment,
time while the core voltage is high. Decreasing the miss rate threshold
the frequency is held constant, although it is possible to implement
causes the voltage state to be more sensitive to memory activity,
adaptive clocking as in previous designs [5]. The PMU polls these
which may increase the nonidle time spent at the lower voltage state,
counters every 100 μs and records their values while the voltage
thus hurting performance.
of the compute core is monitored with an external oscilloscope,
as shown in Fig. 3. Each time the counter is polled, the current
and prior cumulative cache miss counts are subtracted to determine IV. C ONCLUSION
the miss rate. When the program detects fewer than one cache miss This brief presents a system which can respond to workload
per 1000 cycles, it infers that the compute core is in a compute-bound characteristics with DVS completely on-chip. An implementation
section of the application and thus increases the voltage mode to the is fabricated in 28-nm FD-SOI technology and tested to prove the
maximum, 900 mV. Otherwise, the program infers that the compute efficacy of this technique. Tight integration allows rapid and precise
core is in a memory-bound section and reduces the operating voltage. adaptation to the variations seen within and among workloads on
Fig. 3 illustrates this PMU program executing mode transitions a typical system-on-chip. Future work includes introducing het-
while the compute core performs a matrix multiplication, sized to erogeneity to reduce the cost of the PMU, adding higher fidelity
fit in the L2 cache. This compute kernel is chosen to highlight two microarchitectural counters, further reducing the size of voltage and
distinct program phases: A memory-dominated phase arising from frequency domains, and implementing a profiling scheme to build
compulsory cache misses at the start of the program and a compute- models that adaptively tune the power management parameters to
driven phase after the L2 is full. The behavior of workloads with workloads.
more complex memory access patterns can be extrapolated from the
phases in this example. ACKNOWLEDGMENT
The threshold of one miss per 1000 cycles has been set empirically
for this application but, given its programmable nature, can be The authors would like to thank STMicroelectronics for donating
tuned for different workloads. Using FireSim [21], an architectural fabrication, Cadence and Synopsys for CAD tools, and Xilinx for
simulation of a similarly configured core demonstrates this technique FPGAs. The authors would also like to thank the sponsors, students,
during Linux boot, shown in Fig. 5. While this first stage bootloader faculty, and staff of the Berkeley Wireless Research Center (BWRC),
implementation spins in a loop, causing IPC to stay near 1, the especially James Dunn, Brian Richards, and Anita Flynn.
Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020 2725
Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.