You are on page 1of 5

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO.

12, DECEMBER 2020 2721

Transactions Briefs
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power
Management in 28-nm FD-SOI
John Charles Wright , Colin Schmidt, Ben Keller , Daniel Palmer Dabbelt, Jaehwa Kwak , Vighnesh Iyer,
Nandish Mehta , Pi-Feng Chiu , Stevo Bailey, Krste Asanović , and Borivoje Nikolić
Abstract— This work demonstrates a dual-core RISC-V system-on- per core supply voltages in a multicore microprocessor system by
chip (SoC) with integrated fine-grain power management. The 28-nm using a platform-level voltage regulator modules and integrated per
fully depleted silicon-on-insulator (FD-SOI) SoC integrates switched-
core linear dropout regulators to achieve 19% power savings [3].
capacitor voltage converters and 4-Gb/s off-chip serial links. The SoC
runs applications with operating system support on dual RISC-V Rocket Additionally, voltage scaling can be used in conjunction with circuit-
cores with vector accelerators. Runtime monitoring of microarchitectural level error detection techniques to dynamically recover supply voltage
counters allows prediction of future compute intensity, enabling the margin [4].
voltage state of the managed core to be adjusted quickly to optimize This brief describes a system with fine-grain voltage domains
energy efficiency without sacrificing overall performance.
powered by fully integrated, on-chip, high-density, switched-capacitor
Index Terms— DC–converter, multicore processing, power
dc–dc converters capable of fast mode switching and supported by
management, RISC-V processor, system-on-chip (SoC), vector
architectural counters. By switching from a nominal supply voltage of
processor.
I. I NTRODUCTION 0.9 V during active compute periods to 0.55-V during idle, memory-
Modern computing applications continue to demand increasing bound periods, this scheme can approach an ideal 62% reduction in
performance under a fixed power envelope. With the end of Den- dynamic switching power during idle phases with little penalty to
nard scaling, the power consumption from performance improve- overall compute throughput.
ments must be offset by architectural and circuit design techniques II. S YSTEM D ESCRIPTION
to increase energy efficiency. This brief demonstrates a dual-core The system, shown in Fig. 1, comprises a dual-core, 64-bit RISC-V
RISC-V processor that combines on-chip dynamic voltage scaling processor with a custom vector accelerator, manufactured in a 28-nm
(DVS) with accurate, tightly integrated monitoring of real-time appli- fully depleted silicon-on-insulator (FD-SOI) technology. The system
cation performance requirements and specialization for data-parallel is the first reported RISC-V multicore design to place each core
applications. in its own voltage domain and utilize high-speed serial links for
Since its early demonstration [1], DVS has become an indispens- memory traffic. The supply voltage for each core is provided by
able technique for power management in digital systems. The basic a bank of 24 on-chip switched-capacitor dc–dc converter cells [2]
concept tradesoff approximately linear reduction in system perfor- which can be bypassed if needed. This system includes the same
mance for quadratic savings in switching and more-than-quadratic dc–dc conversion subsystem as [5], which is capable of switching
savings in leakage energy with the scaled supply. Given a bounded modes within 2 μs. Each core can be clocked by an externally
system throughput requirement, efficient use of DVS reduces total supplied clock or an on-chip adaptive clock generator [6]. The clock
application energy accordingly [2]. Although the overall energy effi- selection is made by writing the select value to a memory-mapped
ciency of the processor system is determined in part by the conversion register using the off-chip memory interface while the cores are
efficiency of voltage regulators, it also depends on the ability to in reset. The two cores share a 256-KiB L2 cache, which is in a
predict upcoming workloads and on the speed of the voltage tran- separate, fixed voltage and clock domain [2]. The L2 cache is also
sition. These changes in voltage–frequency state must be performed responsible for routing memory traffic to memory-mapped control
quickly to adapt efficiently to the dynamic performance require- and status registers. It includes a set of counters that track how
ments of an application. Energy is wasted when compute intensity many total memory accesses have hit and missed. The L2 cache
changes and the core remains in a suboptimal voltage–frequency has two possible paths for backing memory: a custom, low-speed,
state, so reducing the latency of voltage–frequency state transitions eight-bit parallel interface, or a bank of eight high-speed serial links.
improves overall system energy efficiency. Similarly, decreasing the Consistent with our goal of innovating at the circuit level while
size of voltage–frequency domains enables finer-resolution changes maintaining a functional system, the parallel interface is included
in system performance, reducing the energy overhead due to spatial as a backup to the experimental high-speed interface. Each of these
quantization of voltage domains. Recent mainstream products employ paths forward memory transactions to a separate field-programmable
gate array (FPGA) board which contains the backing dynamic random
Manuscript received April 13, 2020; revised June 27, 2020 and
August 29, 2020; accepted September 18, 2020. Date of publication
access memory (DRAM) for the system. The system also contains
October 27, 2020; date of current version November 24, 2020. This work was multiple, distributed ring-oscillator-based temperature sensors [7] and
supported in part by Defense Advanced Research Projects Agency (DARPA) a body-bias generator [8].
Power Efficiency for Embedded Computing Technologies (PERFECT) under The rocket application processors are scalar, in-order, single-issue
Award HR0011-12-2-0016 and in part by Intel Science and Technology cores with five pipeline stages [9]. This version of rocket supports
Centers (ISTC) on Agile Design. (John Charles Wright, Colin Schmidt,
and Ben Keller contributed equally to this work.) (Corresponding author: version 2.1-draft of the RISC-V RV64G instruction set architecture
John Charles Wright.) (ISA) variant with supervisor mode. These rocket instances were
The authors are with the Department of Electrical Engineering and Com- generated from parameters chosen for a typical in-order, general-
puter Sciences, University of California Berkeley, Berkeley, CA 94720 USA purpose core. It has a 64-entry branch target buffer, a 256-entry two-
(e-mail: johnwright@berkeley.edu).
Color versions of one or more of the figures in this article are available
level branch predictor, a return address stack, a two-cycle latency
online at https://ieeexplore.ieee.org. on single-precision fused multiply-add (FMAs), a three-cycle latency
Digital Object Identifier 10.1109/TVLSI.2020.3030243 on double-precision FMAs, and separate 32-KiB instruction and data
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
2722 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

Fig. 1. System block diagram.

caches. The Rocket core contains a set of performance counters that


track how many instructions have been retired and the number of
cycles executed.
The custom vector accelerator, Hwacha [10], is a configurable,
multilane decoupled vector pipeline optimized for an application
specified integrated circuit (ASIC) process that executes the Hwacha
ISA version 3.8.1 [11]. In this design, each core is configured with
a single-lane variant. Hwacha more closely resembles traditional
Cray [12] vector pipelines than the single instruction, multiple data
(SIMD) units in streaming SIMD extensions (SSE) or advanced vec-
tor extensions (AVX) [13]. Hwacha improves efficiency by offloading
vector instructions from the scalar core to the vector unit, allowing the
scalar core to continue to execute in parallel and enabling the vector
unit to prefetch its own data from memory [14] effectively. Hwacha
is connected to the scalar core via the Rocket Custom Coprocessor
(RoCC) interface [9]. This instance of Hwacha includes four banks of
256 × 128 dual-port static random access memories (SRAMs) for the
vector register file, per bank integer arithmetic logic units (ALUs),
four double-precision FMA units, eight single-precision FMA units,
16 half-precision FMA units, eight master sequencer slots, a 16-KiB
vector instruction cache, and a single 128-bit-wide port to the L2.
The cores operate independently and are cache-coherent. The SoC
can execute programs with multiple concurrently running threads with
each core capable of performing its own power management, which Fig. 2. Chip micrograph.
is done by setting its dc–dc configuration register to one of three
switching modes: 1/2 1 V, 2/3 1 V, or 1/2 1.8 V [5]. These registers RISC-V toolchain to build the application and provide autonomous
are memory mapped and accessible by either core or the external power management code separately. To demonstrate such autonomous
memory interface. When adaptive clocking [6] is enabled, the clock power management in this system, one core is used as an applications
will automatically transition along with the changing voltage profile. core and the second core is programed to act as a power management
It is possible to write customized power management code for unit (PMU) for the first. The PMU can execute independent power
a specific application, but this either requires building a specialized management code which monitors the performance and power uti-
toolchain or embedding power management code into the application lization of the system by reading a set of memory-mapped counters
code. A more productive approach is to use the standard, open-source and control registers.

Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020 2723

Fig. 3. DC–DC mode transitions in response to cache activity for a matrix multiplication workload.

Switched-capacitor dc–dc converters offer the ability to perform


direct power measurements by monitoring the toggle rate of the
switches within the converter [5]. In a contrasting approach used
by this work, architectural counters offer a prediction of near-future
workloads [17]–[20]. The set of counters implemented in this chip
enables power management by measuring the short-term off-chip
memory bandwidth used by the L2 cache, the rate of instruction
execution, and the rate of energy consumption by the dc–dc convert-
ers. Prior work has identified multiple classes of counters to observe
the state of each core, queues, and the memory system [18], [19].
This work utilizes cache miss counters to provide insight into near-
term core activity, as a cache miss will cause a core to be idle for
hundreds of cycles. Therefore, by using on-chip dc–dc converters
with small response times, the core supply can be lowered until the
data are retrieved without losing state, after which the voltage can
be quickly increased to achieve the desired throughput. Although not
demonstrated in this work, other energy-saving techniques, such as
reverse body biasing [8], can also be actuated in a similar manner.
The PMU is programmable using standard C with header files that Fig. 4. DGEMM Shmoo plot.
define the locations of the performance and system monitor counters.
voltage and frequency to determine the optimal operating points
The fully featured PMU architecture enables many experiments and
for the available power states. Fig. 3 demonstrates the voltage and
studies on effective power management strategies in this system.
microarchitectural counter states during a DGEMM execution, while
However, a more production-level system might instead include a
Fig. 4 displays the results of the sweep of operating points, outlining
tiny, feature-reduced core as a dedicated PMU or, alternatively, use
the optimal voltage-frequency curve. The nonmonotonic behavior
the OS or interrupts to run power management code on one of the
around 175 MHz is attributed to a small supply resonance observed
applications cores only a fraction of the time.
on-die.
The chip is fabricated in 28-nm FD-SOI, measuring 2.8 mm by
2.8 mm, as shown in Fig. 2. To run tests, the chip is connected to an III. R ESULTS
FPGA board for tethering, control, and backing memory. The chip is The most energy-efficient operating point at which the system
capable of running complex workloads including booting Linux and operates error-free is 525 mV at 28.3 MHz and results in a core
running applications under operating-system support through both the energy efficiency of 19.6 GFLOPS/W. When compared to our previ-
slow parallel interface and a high-speed link. This system includes ous work [5], this work is less efficient in this application because the
many features of embedded systems that enable it to be an effective cores are not clock gated and its larger die requires slightly higher
test platform for fine-grain adaptive dynamic voltage and frequency minimum operating voltage. This operating mode uses the second,
scaling (DVFS) experiments. Table I demonstrates the efficacy of this full-featured core as a PMU. This core contains inactive hardware like
system compared to prior work. vector and floating-point units and is clocked at a fixed frequency; it
The chip is tested by running a double-precision general matrix is therefore, a pessimistic efficiency model for a comparable single-
multiplication (DGEMM) and verifying correctness while sweeping core system.

Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
2724 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020

Fig. 5. Architectural simulation of Linux boot and energy analysis at 0.9-V/250-MHz and 0.55-V/50-MHz operating modes.

TABLE I subsequent activity is shown to benefit from DVFS. From the sim-
C OMPARISON W ITH S TATE OF THE A RT. *E STIMATED ulated cache miss counter data and measured voltage and frequency
data, the relative changes to wall time, t% , and dynamic energy,
E % , are modeled with the following formulas. The time spent in
the low voltage and frequency state, cycleslo,rel , is calculated relative
to the cycles in the high state, cycleshi , by scaling the instructions
retired per cycle when the miss rate is below the threshold, IPClo ,
by the ratio of high and low frequencies
fhi
cycleslo,rel = IPClo · cycleslo · + (1 − IPClo ) · cycleslo
flo
cycleshi + cycleslo,rel
t% = − 100%
cycleshi + cycleslo
 2
cycleshi + cycleslo,eff · vdd
vdd
lo · ff lo
hi hi
E % = 100% − .
cycleshi + cycleslo
A range of thresholds and polling frequencies, shown in Fig. 5,
allows for control between a 69% relative energy savings for 113%
The effectiveness of the fully programmable PMU is demonstrated
additional wall time and a less aggressive 34% energy savings for
with a program that monitors the rate of L2 cache misses and
only 6% additional wall time. By increasing the miss rate threshold,
the average number of instructions-retired-per-cycle (IPC) of the
the voltage state is changed less frequently and only when more or
compute core, changing the voltage mode of the compute core in
larger memory accesses are performed, leading to an increase in idle
response, while keeping the L2 supply constant. In this experiment,
time while the core voltage is high. Decreasing the miss rate threshold
the frequency is held constant, although it is possible to implement
causes the voltage state to be more sensitive to memory activity,
adaptive clocking as in previous designs [5]. The PMU polls these
which may increase the nonidle time spent at the lower voltage state,
counters every 100 μs and records their values while the voltage
thus hurting performance.
of the compute core is monitored with an external oscilloscope,
as shown in Fig. 3. Each time the counter is polled, the current
and prior cumulative cache miss counts are subtracted to determine IV. C ONCLUSION
the miss rate. When the program detects fewer than one cache miss This brief presents a system which can respond to workload
per 1000 cycles, it infers that the compute core is in a compute-bound characteristics with DVS completely on-chip. An implementation
section of the application and thus increases the voltage mode to the is fabricated in 28-nm FD-SOI technology and tested to prove the
maximum, 900 mV. Otherwise, the program infers that the compute efficacy of this technique. Tight integration allows rapid and precise
core is in a memory-bound section and reduces the operating voltage. adaptation to the variations seen within and among workloads on
Fig. 3 illustrates this PMU program executing mode transitions a typical system-on-chip. Future work includes introducing het-
while the compute core performs a matrix multiplication, sized to erogeneity to reduce the cost of the PMU, adding higher fidelity
fit in the L2 cache. This compute kernel is chosen to highlight two microarchitectural counters, further reducing the size of voltage and
distinct program phases: A memory-dominated phase arising from frequency domains, and implementing a profiling scheme to build
compulsory cache misses at the start of the program and a compute- models that adaptively tune the power management parameters to
driven phase after the L2 is full. The behavior of workloads with workloads.
more complex memory access patterns can be extrapolated from the
phases in this example. ACKNOWLEDGMENT
The threshold of one miss per 1000 cycles has been set empirically
for this application but, given its programmable nature, can be The authors would like to thank STMicroelectronics for donating
tuned for different workloads. Using FireSim [21], an architectural fabrication, Cadence and Synopsys for CAD tools, and Xilinx for
simulation of a similarly configured core demonstrates this technique FPGAs. The authors would also like to thank the sponsors, students,
during Linux boot, shown in Fig. 5. While this first stage bootloader faculty, and staff of the Berkeley Wireless Research Center (BWRC),
implementation spins in a loop, causing IPC to stay near 1, the especially James Dunn, Brian Richards, and Anita Flynn.

Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 28, NO. 12, DECEMBER 2020 2725

R EFERENCES [12] R. M. Russell, “The CRAY-1 computer system,” Commun. ACM,


[1] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, “A vol. 21, no. 1, pp. 63–72, Jan. 1978, doi: 10.1145/359327.
dynamic voltage scaled microprocessor system,” IEEE J. Solid-State 359336.
Circuits, vol. 35, no. 11, pp. 1571–1580, Nov. 2000. [13] C. Lomont, “Introduction to Intel advanced vector extensions,” Intel,
[2] B. Zimmer et al., “A RISC-V vector processor with tightly-integrated Santa Clara, CA, USA, White Paper 183287, 2011. [Online]. Available:
switched-capacitor DC-DC converters in 28nm FDSOI,” in Proc. Symp. https://software.intel.com/content/dam/develop/external/us/en/
VLSI Circuits (VLSI Circuits), Jun. 2015. documents/intro-to-intelavx-183287.pdf
[3] S. Naffziger, K. Lepak, M. Paraschou, and M. Subramony, “2.2 AMD [14] Y. Lee, A. Ou, C. Schmidt, S. Karandikar, H. Mao, and
chiplet architecture for high-performance server and desktop products,” K. Asanović, “The Hwacha microarchitecture manual, version
in IEEE ISSCC Dig. Tech. Papers, Feb. 2020, pp. 44–45. 3.8.1,” EECS Dept., Univ. California, Berkeley, CA, USA, Tech.
[4] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level Rep. UCB/EECS-2015-263, Dec. 2015. [Online]. Available:
timing speculation,” in Proc. 22nd Digit. Avionics Syst. Conf., Dec. 2003, http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-
pp. 7–18. 263.html
[5] B. Keller et al., “A RISC-V processor SoC with integrated power [15] T. Webel et al., “Robust power management in the IBM z13,” IBM
management at submicrosecond timescales in 28 nm FD-SOI,” IEEE J. Res. Develop., vol. 59, nos. 4–5, p. 16, Jul./Sep. 2015.
J. Solid-State Circuits, vol. 52, no. 7, pp. 1863–1875, Jul. 2017. [16] E. Rotem, A. Naveh, D. Rajwan, A. Ananthakrishnan, and
[6] J. Kwak and B. Nikolic, “A 550–2260MHz self-adjustable clock genera- E. Weissmann, “Power management architecture of the 2nd gen-
tor in 28 nm FDSOI,” in Proc. IEEE Asian Solid-State Circuits Conf. (A- eration Intel core microarchitecture, formerly codenamed Sandy
SSCC), Nov. 2015, pp. 1–4. bridge,” in Proc. IEEE Hot Chips 23 Symp. (HCS), Aug. 2011,
[7] M. Cochet et al., “A 225 μm 2 probe single-point calibration digital tem- pp. 1–33.
perature sensor using body-bias adjustment in 28 nm FD-SOI CMOS,” [17] D. C. Snowdon, S. M. Petters, and G. Heiser, “Accurate on-line pre-
IEEE Solid-State Circuits Lett., vol. 1, no. 1, pp. 14–17, Jan. 2018. diction of processor and memoryenergy usage under voltage scaling,” in
[8] M. Blagojevic, M. Cochet, B. Keller, P. Flatresse, A. Vladimirescu, and Proc. 7th ACM IEEE Int. Conf. Embedded Softw. (EMSOFT), Sep. 2007,
B. Nikolic, “A fast, flexible, positive and negative adaptive body-bias pp. 84–93.
generator in 28nm FDSOI,” in Proc. IEEE Symp. VLSI Circuits (VLSI- [18] E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and
Circuits), Jun. 2016, pp. 1–2. D. Rajwan, “Power-management architecture of the intel microarchitec-
[9] K. Asanović et al., “The rocket chip generator,” EECS Dept., Univ. ture code-named Sandy bridge,” IEEE Micro, vol. 32, no. 2, pp. 20–27,
California, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2016-17, Mar. 2012.
Apr. 2016. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/ [19] S. Eyerman and L. Eeckhout, “Fine-grained DVFS using on-chip reg-
TechRpts/2016/EECS-2016-17.html ulators,” ACM Trans. Archit. Code Optim., vol. 8, no. 1, pp. 1–24,
[10] C. Schmidt and A. Ou, “Hwacha: A data-parallel RISC-V extension and Apr. 2011.
implementation,” in Proc. Inaugural RISC-V Summit. Santa Clara, CA, [20] R. Rodrigues, A. Annamalai, I. Koren, and S. Kundu, “A study on
USA: Santa Clara Convention Center, Dec. 2018. [Online]. Available: the use of performance counters to estimate power in microprocessors,”
https://riscv.org//wp-content/uploads/2018/12/Hwacha-A-Data-Parallel- IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 60, no. 12, pp. 882–886,
RISC-V-Extension-and-Implementation-Schmidt-Ou-.pdf Dec. 2013.
[11] Y. Lee, C. Schmidt, A. Ou, A. Waterman, and K. Asanović, “The [21] S. Karandikar et al., “FireSim: FPGA-accelerated cycle-exact
Hwacha vector-fetch architecture manual, version 3.8.1,” EECS Dept., scale-out system simulation in the public cloud,” in Proc.
Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2015-262, ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA),
Dec. 2015. [Online]. Available: http://www2.eecs.berkeley.edu/Pubs/ Piscataway, NJ, USA, Jun. 2018, pp. 29–42, doi: 10.1109/ISCA.2018.
TechRpts/2015/EECS-2015-262.html 00014.

Authorized licensed use limited to: University of Gothenburg. Downloaded on December 20,2020 at 18:26:42 UTC from IEEE Xplore. Restrictions apply.

You might also like