You are on page 1of 13

.................................................................................................................................................................................................................

AN AGILE APPROACH TO BUILDING


RISC-V MICROPROCESSORS
.................................................................................................................................................................................................................
THE AUTHORS ADOPTED AN AGILE HARDWARE DEVELOPMENT METHODOLOGY FOR 11
RISC-V MICROPROCESSOR TAPE-OUTS ON 28-NM AND 45-NM CMOS PROCESSES. THIS
Yunsup Lee
ENABLED SMALL TEAMS TO QUICKLY BUILD ENERGY-EFFICIENT, COST-EFFECTIVE, AND
Andrew Waterman
COMPETITIVE HIGH-PERFORMANCE MICROPROCESSORS. THE AUTHORS PRESENT A CASE
Henry Cook
STUDY OF ONE PROTOTYPE FEATURING A RISC-V VECTOR MICROPROCESSOR INTEGRATED
Brian Zimmer
WITH SWITCHED-CAPACITOR DC–DC CONVERTERS AND AN ADAPTIVE CLOCK GENERATOR
Ben Keller
IN A 28-NM, FULLY DEPLETED SILICON-ON-INSULATOR PROCESS.
Alberto Puggelli
Jaehwa Kwak
Ruzica Jevtić
...... The era of rapid transistor per-
formance improvement is coming to an end,
in software development, demarcated by the
publication of The Agile Manifesto in 2001.1
increasing pressure on architects and circuit The philosophy of agile software development
Stevo Bailey designers to improve energy efficiency. Mod- emphasizes individuals and interactions over
ern systems on chip (SoCs) incorporate a large processes and tools, working software over
Milovan Blagojević and growing number of specialized hardware comprehensive documentation, customer col-
units to execute specific tasks efficiently, often laboration over contract negotiation, and
Pi-Feng Chiu organized into multiple dynamic voltage and responding to change over following a plan.
clock frequency domains to further save In practice, the agile approach leads to small
Rimas Avizienis energy under varying computational loads. teams iteratively refining a set of working-but-
This growing complexity has led to a design incomplete prototypes until the end result is
Brian Richards productivity crisis, stimulating development acceptable.
of new tools and methodologies to enable the Inspired by the positive disruption of The
Jonathan Bachrach completion of complex chip designs on sched- Agile Manifesto on software development, we
ule and within budget. propose a set of principles to guide a new
David Patterson We propose leveraging lessons learned agile hardware development methodology.
from the software world by applying aspects Our proposal is informed by our experiences
Elad Alon of the software agile development model to as a small group of researchers designing and
hardware. Traditionally, software was devel- fabricating 11 processor chips in five years.
Borivoje Nikolić oped via the waterfall development model, a Lacking the massive resources of industrial
sequential process that consists of distinct design teams, we were forced to abandon
Krste Asanović phases that rigidly follow one another, just as standard industry practices and explore differ-
is done for hardware design today. Over- ent approaches to hardware design. We detail
budget, late, and abandoned software projects this approach’s benefits with a case study of
were commonplace, motivating a revolution one of these chips, Raven-3, which achieved
.......................................................

8 Published by the IEEE Computer Society 


0272-1732/16/$33.00 c 2016 IEEE

Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
groundbreaking energy efficiency by integrat- transfer level (RTL) freezes, and CPU centu-
ing a novel RISC-V vector architecture with ries of simulations in an attempt to achieve a
efficient on-chip DC–DC converters. Taken single viable design point. In our agile hard-
together, our experiences developing these aca- ware methodology, we first push a trivial pro-
demic prototype chips make a powerful argu- totype with a minimal working feature set all
ment for the promise of agile hardware design the way through the toolflow to a point where
in the post-Moore’s law era. it could be taped out for fabrication. We refer
to this tape-out-ready design as a tape-in.
Then we begin adding features iteratively,
An Agile Hardware Manifesto moving from one fabricatable tape-in to the
Borrowing heavily from the agile software next as we add features. The agile hardware
development manifesto, we propose the fol- methodology will always have an available
lowing principles for hardware design: tape-out candidate with some subset of fea-
 Incomplete, fabricatable prototypes tures for any tape-out deadline, whereas the
over fully featured models. conventional waterfall approach is in danger
 Collaborative, flexible teams over rigid of grossly overshooting any tape-out deadline
silos. because of issues at any stage of the process.
 Improvement of tools and generators Although the agile methodology provides
over improvement of the instance. some benefits in terms of verification effort, it
 Response to change over following a particularly shines at validation; agile designs
plan. are more likely to meet energy and perform-
ance expectations using the iterative process.
Here, we explain the importance of these Unlike conventional approaches that use
principles for building hardware and contrast high-level architectural models of the whole
them with the original principles of agile soft- design for validation with the intent to later
ware development. refine these models into an implementation,
we emphasize validation using a full RTL
Incomplete, Fabricatable Prototypes over Fully implementation of each incomplete prototype
Featured Models with the intent to add features if there is time.
We believe the primary benefit of adopting Unlike high-level architectural models, the
the agile model for hardware design is that it actual RTL is guaranteed to be cycle accurate.
lets us drastically reduce the cost of verifica- Furthermore, the actual RTL design can be
tion and validation by iterating through pushed through the entire very large-scale
many working prototypes of our designs. In integration (VLSI) toolflow to give early feed-
this article, we use the standard definition of back on back-end physical design issues that
verification as testing that determines whether will affect the quality of results (QoR), includ-
each component and the assembly of compo- ing cycle time, area, and power. For example,
nents correctly meets its specification (“Are Figure 1b shows a feature F1 being reworked
we building the thing right?”). In hardware into feature F10 after validation showed that it
design, the term validation is usually nar- would not meet performance or QoR goals,
rowly applied to postsilicon testing to ensure whereas feature F3 was dropped after valida-
that manufactured parts operate within tion showed that it exceeded physical design
design parameters. We use the broader and, budgets or did not add sufficient value to the
to our minds, more useful definition of vali- end system. These decisions on F1 and F3 can
dation from the software world, in which val- be informed by the results of the attempted
idation ensures that the product serves its physical design before completing the final
intended purposes (“Are we building the feature F4.
right thing?”). Figure 1c illustrates our iterative approach
Figure 1 contrasts the agile hardware de- to validation of a single prototype. Each
velopment model with the waterfall model. circle’s circumference represents the relative
The waterfall model, shown in Figure 1a, time and effort it takes to begin validating a
relies on Gantt charts, high-level architecture design using a particular methodology.
models, rigid processes such as register- Although the latency and cost of these
.............................................................

MARCH/APRIL 2016 9
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
..............................................................................................................................................................................................
HOT CHIPS

F1 F1 F2 F3 F4
Specification F1 F2 F1’ F3 F4

Implementation
F2
Specification

Verification

Validation
Tape-out
F3 Design Design F1 F2 F1’ F3 F4

F4 Implementation F1 F2 F1’ F3 F4
Verification F1 F2 F1’ F3 F4

Tape-in F1/F2 F1’/F2/F3 F1’/F2/F4

Validation F1/F2 F1’/F2/F3 F1’/F2/F4

Tape-out F1’/F2/F4

(a) (b)

Big tape-out

Small tape-out
Tape-in
ASIC flow
FPGA
C++

Specification
+ design
+ implementation

(c)

Figure 1. Contrasting the agile and waterfall models of hardware design. The labels FN represent various desired features of
the design. (a) The waterfall model steps all features through each activity sequentially, producing a tape-out candidate only
when all features are complete. (b) The agile model adds features incrementally, resulting in incremental tape-out candidates
as individual features are completed, reworked, or abandoned. (c) As validation progresses, designs are subjected to lengthier
and more accurate evaluation methodologies.

prototype-modeling efforts increase toward Collaborative, Flexible Teams over Rigid Silos
the outer circles, our confidence in the valida- Traditional hardware design teams usually
tion results produced increases in tandem. are organized around the waterfall model,
Using fabricatable prototypes increases valida- with engineers assigned to particular func-
tion bandwidth, because a complete RTL tions, such as architecture specification,
design can be mapped to field-programmable microarchitecture design, RTL design, veri-
gate arrays (FPGAs) to run end-application fication, physical design, and validation.
software stacks orders of magnitude faster This specialization of skills is more extreme
than with software simulators. In agile hard- than for pure software development, as
ware development, the FPGA models of itera- reflected in the specificity of designers’ job
tive prototype RTL designs (together with titles (architect, hardware engineer, or veri-
accompanying QoR numbers from the VLSI fication engineer). Sometimes, entire func-
toolflow) fulfill the same function as working tions, such as back-end physical design, are
prototypes in agile software development, pro- even outsourced to different companies. As
viding a way for customers to give early and the design progresses through these func-
frequent feedback to validate design choices. tional stages, extensive effort is required
This enables customer collaboration as envi- to communicate design intent for each
sioned in the original agile software manifesto. block, and consequently to understand the
............................................................

10 IEEE MICRO
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
documentation received from the preceding tools hinder development of truly reusable
stage. Misunderstandings can lead to subtle hardware components and lead to a focus on
errors that require extensive verification to building a single instance instead of design-
prevent, and QoR suffers because no engi- ing components that can be easily reused in
neer views the whole flow. End-to-end future designs.
design iterations are too expensive to con- An agile hardware methodology relies on
template. Considerable effort is required better hardware description languages (HDLs)
to push high-level changes down through to enable reuse. Instead of constructing an
the implementation hierarchy, particularly optimized instance, engineers develop highly
across company boundaries. This leads to parameterized generators that can be used to
innovation-stunting practices such as the automatically explore a design space or to
RTL freeze, in which only show-stopping support future use in a different design.
bugs are allowed to unfreeze a design. In Better reuse is also the primary way in which
addition, balancing the workload across the an agile hardware methodology can reduce
different stages is difficult because engineers verification costs, because the verification
focused in one functional area often lack infrastructure of these generators is also port-
the training and experience to be effective able across designs.
in other roles. An agile hardware methodology also strives
In the agile hardware model, engineers to eliminate manual intervention in the chip’s
work in collaborative, flexible teams that can build process. Replacing or fixing the many
drive a feature through all implementation commercial CAD tools required to complete
stages, avoiding most of the documentation a chip design is not practical, but most of
overhead required to map a feature to the a chip’s flow can be effectively automated
hardware. Each team can iterate on the high- with sufficient effort. Perhaps the most
level architectural design with full visibility labor-intensive part of a modern chip flow is
into low-level physical implementation and dealing with physical design issues for a new
can validate system-level software impacts on technology node. However, most new chip
an intermediate prototype. Figure 1b shows designs are in older technologies, because
how multiple teams can work on different only a few products can justify the cost of
features in a pipelined fashion, each iterating using an advanced node. Also, as technology
through multiple levels of abstraction (as scaling continues to slow, back-end flows
shown in Figure 1c). A given tape-in will and CAD tools will have time to mature
accumulate some small number of feature and should become more automatable, even
updates and integrate these for whole-system for the most advanced nodes. The end of
validation. Engineers naturally develop a scaling will also coincide with a greater
complete vertical skill set, and the focus of need for a variety of specialized designs,
project management is on prioritizing fea- where the agile model should show greater
tures for implementation, not communica- benefit.
tion between silos. Our emphasis on tools and generators is
not at odds with the agile software develop-
Improvement of Tools and Generators over ment manifesto, which emphasizes collabora-
Improvement of the Instance tion over tools. Our proposed methodology
Modern software systems could not be built also values collaboration more than specific
economically without the extreme reuse tools, but we feel that modern hardware
enabled by extensive software libraries writ- design is far less automated and exhibits far
ten in modern programming languages, or less reuse than modern software development
without functioning compilers and an auto- (or even software development in 2001).
mated build process. In contrast, most com- Accordingly, there is a need to emphasize the
mercial hardware design flows rely on rather importance of reuse and automation enabled
primitive languages to describe hardware, by new tools in the hardware design sphere,
and frequent manual intervention on tool because few modern hardware design teams
outputs to work around tool bugs and long emphasize these universally accepted software
tool runtimes. The poor languages and buggy principles.
.............................................................

MARCH/APRIL 2016 11
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
..............................................................................................................................................................................................
HOT CHIPS

Response to Change over Following a Plan ment. RISC-V is a modular architecture, with
This principle mirrors the original manifesto. variants covering 32-, 64-, and 128-bit address
Unlike software, a chip design will ultimately spaces. The base integer instruction set is lean,
be frozen into a mask set and mass-manufac- requiring fewer than 50 user-level hardware
tured, with no scope for frequent updates. instructions to support a full modern software
However, even over the timeframe of a single stack, which enables microprocessor designers
chip’s development, requirements will change to quickly bring up fully functional prototypes
owing to market shifts, customer feature and add additional features incrementally. In
additions, or standards-body deliberations. addition to simplifying the implementation of
In agile design, change can result from valida- new microarchitectures, the RISC-V design
tion testing (as opposed to the waterfall provides an ideal base for custom accelerators.
model, in which designs will only change in Not only can accelerators reuse common con-
response to verification problems). trol-processor implementations, they can share
An agile hardware development process a single software stack, including the compiler
embraces continual change in chip specifica- toolchain and operating system binaries. This
tions as a means to enhance competitiveness. dramatically reduces the cost of designing and
Every incremental fabricatable prototype, not bringing up custom accelerators.
only the most recent, becomes a starting Further information on RISC-V is avail-
point for the changed set of design require- able from the nonprofit RISC-V Foundation,
ments, with any change treated as a reprioriti- which was incorporated in August 2015 to
zation of features to implement, drop, or help coordinate, standardize, protect, and pro-
rework. By first implementing features that mote the RISC-V ecosystem.3 A free and open
are unlikely to change, designers minimize ISA is a critical ingredient in an agile hardware
the effort lost to late changes. methodology, because having an active and
diverse community of open-source contribu-
tors amortizes the overhead of maintaining
Implementing the Agile Hardware and updating the software ecosystem, which
Methodology makes it possible for smaller teams to focus on
Here, we describe the steps we took toward developing custom hardware.4 A free ISA is
implementing an agile hardware design proc- also a prerequisite for shared open-source pro-
ess. All of our processors, both general pur- cessor implementations, which can act as a
pose and specialized, were built on the free base for further customization.5
and open RISC-V instruction architecture
(ISA). We developed our RTL source code Chisel
using Chisel (Constructing Hardware in a To facilitate agile hardware development by
Scala-Embedded Language), an open-source improving designer productivity, we devel-
hardware construction language, and ex- oped Chisel6 as a domain-specific extension to
pressed the code as libraries of highly parame- the Scala programming language. Chisel is
terized hardware generators. We developed a not a high-level synthesis tool in which hard-
highly automated back-end design flow to ware is inferred from Scala code. Rather,
reduce manual effort in pushing designs Chisel is intended to be a substrate that pro-
through to layout. Finally, we present the vides a Scala abstraction of primitive hardware
resulting chips from our agile hardware design components, such as registers, muxes, and
process. wires. Any Scala program whose execution
generates a graph of such hardware compo-
RISC-V nents provides a blueprint to fabricate hard-
RISC-V is a free, open, and extensible ISA ware designs; the Chisel compiler translates a
that forms the basis of all of our program- graph of hardware components into a fast,
mable processor designs.2 Unlike many earlier cycle-accurate, bit-accurate Cþþ software
efforts that designed open processor cores, simulator, or low-level synthesizable Verilog
RISC-V is an ISA specification, intended to that maps to standard FPGA or ASIC flows.
allow many different hardware implementa- Because Chisel is embedded in Scala, hard-
tions to leverage common software develop- ware developers can tap into Scala’s modern
............................................................

12 IEEE MICRO
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
masters(0) masters(1)
AHB-Lite crossbar
masters(0,1) master
AHB-Lite bus AHB-Lite bus AHB-Lite bus

AHB-Lite
crossbar
slaves(0,1,2)

ins(0,1)
slaves(0,1,2)
Arbiter Arbiter Arbiter
AHB-Lite
Arbiter
slave mux
slave mux slave mux slave mux

out slaves(0) slaves(1) slaves(2)


(a)

A class AHBXbar(nMasters: Int, nSlaves: Int) extends Module


B {
C val io = new Bundle {
D val masters = Vec(new AHBMasterIO, nMasters).flip
E val slaves = Vec(new AHBSlaveIO, nSlaves).flip
F }
G
H val buses = List.fill(nMasters){Module(new AHBBus(nSlaves))}
I val muxes = List.fill(nSlaves){Module(new AHBSlaveMux(nMasters))}
J
K (buses.map(b => b.io.master) zip io.masters) foreach {
case (b, m) => b <> m }
L (muxes.map(m => m.io.out) zip io.slaves) foreach {
case (x, s) => x <> s }
M for (m <- 0 until nMasters; s <- 0 until nSlaves) yield {
N buses(m).io.slaves(s) <> muxes(s).io.ins(m)
O }
P }
(b)

Figure 2. Using functional programming to write a parameterized AHB-Lite crossbar switch in Chisel. (a) Block diagrams. (b)
Chisel source code. The code is short and easy to verify by inspection, but can support arbitrary numbers of masters and slave
ports.

programming language features—such as terized hardware generators rather than dis-


object-oriented programming, functional pro- crete instances of individual blocks. This in
gramming, parameterized types, abstract data turn improves code reuse within a given design
types, operator overloading, and type infer- and across generations of design iterations.
ence—to improve designer productivity by To see how these language features inter-
raising the abstraction level and increasing act in real designs, we describe a crossbar that
code reuse. Furthermore, metaprogramming, connects two AHB-Lite master ports to three
code generation, and hardware design tasks are AHB-Lite slave ports in Figure 2a. In the fig-
all expressed in the same source language, ure, the high-level block diagram is on the
which encourages developers to write parame- left, the two building blocks (AHB-Lite buses
.............................................................

MARCH/APRIL 2016 13
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
..............................................................................................................................................................................................
HOT CHIPS

idiom result
A (1,2,3) map { n => n + 1 } (2,3,4)
B (1,2,3) zip (a,b,c) ((1,a),(2,b),(3,c))
C ((1,a),(2,b),(3,c)) map { case (left,right) => left } (1,2,3)
D (1,2,3) foreach { n => print(n) } 123
E for (x <- 1 to 3; y <- 1 to 3) yield (x,y) (1,1),(1,2),(1,3),(2,1),(2,2),(2,3),
(3,1),(3,2),(3,3)

Figure 3. Generic functional programming idioms to help understand the crossbar example.

and AHB-Lite slave muxes) are in the mid- VHDL, would be difficult to generalize into a
dle, and the final design is shown on the parameterizable crossbar generator. This is just
right. To help understand the Chisel code, a small example of Chisel’s power to bring
Figure 3 shows some generic functional pro- modern programming language constructs to
gramming idioms. Idiom A maps a list of hardware development, increasing code main-
numbers to an expression that adds 1 to tainability, facilitating reuse, and enhancing
them. Idiom B zips two lists together. Idiom designer productivity.
C iterates over a list of tuples, uses Scala’s case
statement to provide names of elements, and Rocket Chip Generator
then operates on those names. Idiom D iter- Chisel’s first-class support for object orienta-
ates over a list when the output values are not tion and metaprogramming lets hardware
collected. Idiom E is a for-comprehension with designers write generators, rather than indi-
a yield statement. vidual instances of designs, which in turn
Figure 2b shows the Chisel code that encourages the development of families of
builds the crossbar. Line A declares the customizable designs. By encoding micro-
AHBXbar module with two parameters: architectural and domain knowledge in these
number of master ports (nMasters) and generators, we can quickly create different
slave ports (nSlaves). Lines C through F chip instances customized for particular
declare the module’s I/O ports. Lines H and I design goals and constraints.7 Construction
instantiate collections of AHB-Lite buses and of hardware generators requires support for
AHB-Lite slave muxes. Lines K through N reconfiguring individual components based
connect the building blocks together. Note on the context in which they are deployed;
that the Chisel operator <> connects mod- thus, a particular focus with Chisel is to
ule I/O ports together. Line K extracts the improve on the limited module parameter-
master I/O port of each bus, zips it to the cor- ization facilities of traditional HDLs.
responding master I/O port of the crossbar The Rocket chip generator is written in
itself, and then connects them together. Line Chisel and constructs a RISC-V-based plat-
L does the same thing for each slave mux out- form. The generator consists of a collection
put and each slave I/O port of the crossbar. of parameterized chip-building libraries that
Lines M and N use for-comprehension to we can use to generate different SoC variants.
generate the cross-product of all port combi- By standardizing the interfaces used to con-
nations, and use each pair to connect a spe- nect different libraries’ generators to one
cific slave I/O port of a specific bus to the another, we have created a plug-and-play
matching input of the corresponding mux. environment in which it is trivial to swap out
This code’s brevity ensures that it is easily substantial components of the design simply
verified by inspection. Although the example by changing configuration files, leaving the
shows two buses and three slave muxes, these hardware source code untouched. We can
10 lines of code would be identical for any also test the output of individual generators
number of buses and slave muxes. In contrast, and perform integration tests on the whole
implementations of this crossbar module writ- design, where the tests are also parameterized,
ten in traditional HDLs, such as Verilog or so as to exercise the entire design under test.
............................................................

14 IEEE MICRO
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
Figure 4 presents the collection of library
generators and their interfaces within the RocketTile RocketTile
Rocket chip generator. These generators can
RoCC RoCC
be parameterized and composed into many Rocket L1I$ accel. Rocket L1I$ accel.
various SoC designs. Their current capabil-
ities include the following: FPU FPU
L1D$ L1D$
 Core: the Rocket scalar core generator
with an optional floating-point unit, A Rocket
configurable pipelining of functional L1 network
units, and customizable branch- B Cache
prediction structures.
C RoCC
 Caches: a family of cache and transla-
tion look-aside buffer generators with L2$ bank D Tile
configurable sizes, associativities, and
replacement policies. E TileLink
 Rocket Custom Coprocessor: the RoCC
L2 network F Periph.
interface, a template for application-
specific coprocessors that can expose
their own parameters. TileLink/AXI4
 Tile: a tile-generator template for bridge
cache-coherent tiles. The number
and type of cores and accelerators are AXI4 crossbar
configurable, as is the organization of
private caches. High-
DRAM AHB and APB
 TileLink: a generator for networks of controller
speed
peripherals
cache-coherent agents and the associ- I/O device
ated cache controllers. Configuration
options include the number of tiles, Figure 4. The Rocket chip generator comprises the following
coherence policy, presence of shared subcomponents: (a) a core generator, (b) cache generator, (c) Rocket
backing storage, and implementation Custom Coprocessor-compatible coprocessor generator, (d) tile generator,
of underlying physical networks. (e) TileLink generator, and (f) peripherals. These generators can be
 Peripherals: generators for AXI4-/ parameterized and composed into many various system-on-chip designs.
AHB-Lite-/APB-compatible buses and
various converters and controllers.
time. The deployment process for hardware
Physical Design Flow is significantly more time-consuming than
We frequently translate Chisel’s Verilog output for software. Using an initial hardware
into a physical design through an ASIC CAD description to generate a GDS that is electri-
toolflow. We have automated this flow and opti- cally and logically correct can require many
mized it to generate industry-standard GDS- iterations. Because each iteration can take
format mask files without designer intervention. many hours to run to completion, the process
Automatic nightly regressions provide the could require weeks or months for larger
designer with regular feedback on cycle time, designs. To avoid the productivity loss of
area, and energy consumption, and they also these long-latency iterations, we initially
detect physical design problems early in the focus on smaller designs, from which GDS
design cycle. Automatic verification and report- can be generated more rapidly. Iterating on
ing on the quality of results allow more points in these smaller-scale tape-ins has proven valua-
the design space to be evaluated, and reduce the ble, because major problems are uncovered
barriertoimplementingmoresignificantchanges much earlier in the design process, and even
atlaterstagesinthedevelopmentprocess. more so when we target a new process tech-
One major hindrance to agile methodol- nology. Additionally, the parameterizable
ogy for hardware design is CAD tool run- nature of hardware generators reduces the
.............................................................

MARCH/APRIL 2016 15
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
..............................................................................................................................................................................................
HOT CHIPS

Raven-3
Raven-1
Raven-2

Raven-4 Swerve
May Apr. Aug. Feb. July Sept. Mar. Nov. Mar. Apr.

2011 2012 2013 2014 2015


EOS22 EOS24

EOS14 EOS16 EOS18 EOS20

Figure 5. Recent UC Berkeley tape-outs using RISC-V processors for three lines of research. The Raven series investigated
low-voltage reliable operation with on-chip DC–DC converters, the EOS series developed monolithically integrated silicon
photonic links, and Swerve experimented with dynamic redundancy techniques to handle low-voltage faults.

effort to scale up to larger designs later in the bine a 64-bit RISC-V vector microprocessor
design process. Regardless of design size, we with on-chip switched-capacitor DC–DC
ensure that each tape-in passes static timing converters and adaptive clocking. The 45-nm
analysis, is functionally equivalent to the chips integrate a 64-bit dual-core RISC-V
RTL model, and has only a small number of vector processor with monolithically inte-
design-rule violations. Together, these prop- grated silicon photonic links.8,9 In total, we
erties ensure that the design is indeed tape- taped out four Raven chips on STMicroelec-
out ready. tronics’ 28-nm fully depleted silicon-on-
As a particular tape-out deadline ap- insulator (FD-SOI) process, six photonics
proaches, we evaluate whether to tape out the chips on IBM’s 45-nm SOI process, and one
most recently taped-in design, usually on the Swerve chip on TSMC’s 28-nm process. The
basis of whether the set of new features vali- Rocket chip generator forms the shared code
dated by the tape-in is sufficiently large. If we base for these 11 distinct systems; we incor-
decide to tape out, we clean up the remaining porated the best ideas from each design back
design-rule violations and run final checks on into the code base, ensuring maximal reuse
the resulting GDS. The full cycle of taping even though the three distinct families of
out the GDS and testing the resulting chip is chips were specialized differently to test out
more involved than a tape-in, but qualita- distinct research ideas.
tively it is just another, longer, iteration in the
agile methodology. Thus, we focus on pro-
ducing lineages of increasingly complicated Case Study: RISC-V Vector Microprocessor
functioning chips, incorporating feedback
from the previous chip into the next in the
with On-chip DC–DC Converters
We applied our agile design methodology to
series. We believe that this rapid iterative
Raven-3, a 28-nm microprocessor that
process is essential to improve the productiv-
integrates switched-capacitor DC–DC con-
ity of hardware designers and enable iterative
verters, adaptive clocking circuitry, and low-
design-space exploration of alternative system
voltage static RAM macros with a RISC-V
microarchitectures.
vector processor instantiated by the Rocket
chip generator.10 The successful development
Silicon Results of the Raven-3 prototype demonstrates how
Figure 5 presents a timeline of UC Berkeley architecture and circuit research was sup-
RISC-V microprocessor tape-outs that we ported by the combination of the freely
created using earlier versions of the Rocket extensible RISC-V ISA, the parameterizable
chip generator. The 28-nm Raven chips com- Chisel-based hardware implementation, and
............................................................

16 IEEE MICRO
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
To off-chip oscilloscope

RISC-V scalar core CORE (1.19 mm2)


... Scalar
1.8 V Vector accelerator
Vout RF FPU Vector issue unit
1.0 V
int ...
(16-Kbye vector RF uses eight
custom 8T SRAM macros)
24 switched-capacitor Shared ...
DC–DC unit cells functional units
int int int int int
toggle

64-bit integer multiplier


Single-precision FMA Crossbar
Double-precision FMA Vector memory unit
Vref
+
FSM –
16-Kbyte scalar 32-Kbyte shared 8-Kbyte vector
DC–DC controller I-cache D-cache I-cache
(Custom 8T (Custom 8T (Custom 8T
SRAM macros) SRAM macros) SRAM macros)
2-Ghz
ref
clk Arbiter

Adaptive clock Asynchronous FIFO


1.0 V generator between domains

Wire-bonded chip-on-board Uncore

To/from off-chip FPGA FSB and DRAM

Figure 6. Raven-3 system-level block diagram. A RISC-V processor core with a vector coprocessor runs with varying voltage
and frequency generated by on-chip switched-capacitor DC–DC converters and an adaptive clock generator.

fast feedback via iteration and automation as while requiring no off-chip passives, which
part of the agile methodology. greatly simplifies package design.10 The clock
Figure 6 shows the Raven-3 system-level generator selects clock edges from a tunable
block diagram. The Raven-3 microprocessor replica circuit powered by the rippling core
instantiates the left RocketTile in Figure 4 voltage that mimics the processor’s critical
with the Hwacha vector unit implemented as path delay, and it adapts the clock frequency
a RoCC accelerator. We did not include an to the instantaneous voltage.12
L2 cache in this design iteration, leaving this The design of Raven-3 benefitted greatly
feature for implementation in future tape-ins from the agile hardware design methodology.
and tape-outs. The chip consists of two volt- By aspiring to build an incomplete prototype
age domains: the varying core domain and (rather than, for instance, a many-core design
the fixed uncore domain. The core domain implementing these technologies), we dra-
connects to the output of fully integrated matically improved the odds of project suc-
noninterleaved switched-capacitor DC–DC cess. The unique integration of the power
converters11 and powers the processor and its delivery, clocking, and processor systems was
L1 caches. The fixed 1-V domain powers the only possible because the design team
uncore. The DC–DC converters generate broadly shared expertise about all aspects of
four voltage modes between 0.5 and 1 V the project. The processor RTL could largely
from 1 and 1.8 V supplies; they achieve bet- be reused from previous tape-outs, with only
ter than 80 percent conversion efficiency, incremental improvements and configuration
.............................................................

MARCH/APRIL 2016 17
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
..............................................................................................................................................................................................
HOT CHIPS

Vector Vector
RF VI$ RF VI$

SC DC-DC SC DC-DC
Scalar core + Scalar core +
vector accelerator vector accelerator

D$ I$ D$ I$

BIST BIST
SC-DCDC Controller SC-DCDC Controller
Unit cell Unit cell
Adaptive clock Adaptive clock

(a) (b) (c) (d)

Figure 7. Raven-3 implementation: the (a) floorplan, (b) micrograph of the resulting chip, (c) wire-bonded chip on the package
board, and (d) test setup.

changes required to validate it for the particu- than 20 ns. Raven-3 runs at a maximum
lar feature set desired. As physical design on clock frequency of 961 MHz at 1 V, consum-
partial feature sets gave incremental results, ing 173 mW, and at a minimum voltage of
we were able to adjust design features to 0.45 V at 93 MHz, consuming 8 mW.
improve feasibility and QoR. Raven-3 achieves a maximum energy effi-
The agile hardware design methodology ciency of 27 and 34 Gflops per watt while
was especially important in this design due to running a double-precision matrix-multipli-
the complexity and risk introduced by the cation kernel on the vector accelerator with
custom voltage conversion and clocking, as and without the switched-capacitor DC–DC
illustrated by the following example. Tape- converters, respectively. These results validate
ins of a simple digital system with a single the agile approach to hardware design.
DC–DC unit cell were small enough to sim-
ulate at the transistor level, and these simula-
tions found a critical bug in the DC–DC
O ur agile hardware development meth-
odology played an important role in
successfully taping out 11 RISC-V micro-
controller. These transistor-level simulations
processors in five years at UC Berkeley. With
showed that a large current event could cause
a combination of our agile methodology,
the output voltage to remain below the
Chisel, RISC-V, and the Rocket chip genera-
lower-bound reference voltage even after the
tor, a small team of graduate students at UC
converter switched. A bug in the controller
Berkeley was able to design multiple micro-
would cause the converter to stop switching
processors that are competitive with indus-
and drop the processor voltage to zero for
trial designs. Although there is much work
this case. We believe this bug, which was
left to do to show that this approach can scale
fixed early in the design process by the addi-
up to the largest SoCs, we hope these projects
tion of a simple state machine in the lower-
serve as an opportunity to shed some light on
bound controller, would have been found
inefficiencies in the current design process, to
very late (if at all) with waterfall design
rethink how modern SoCs are built, and to
approaches.
revitalize hardware design. MICRO
Figure 7 shows the floorplan and micro-
graph of the resulting chip, the wire-bonded
chip on the package board, and the test setup. Acknowledgments
The resulting 2:37-mm2 Raven chip is fully We thank Tom Burd, James Dunn, Olivier
functional and boots Linux and executes Thomas, and Andrei Vladimirescu for their
compiled scalar and vector application code contributions and STMicroelectronics for
with the adaptive clock generator at all oper- the fabrication donation. This work was
ating modes, while the DC–DC converter funded in part by BWRC, ASPIRE, DARPA
can transition between voltage modes in less PERFECT Award number HR0011-12-2-
............................................................

18 IEEE MICRO
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
0016, STARnet C-FAR, Intel ARO, AMD, Yunsup Lee is a PhD candidate in the
SRC/TxACE, Marie Curie FP7, NSF Department of Electrical Engineering and
GRFP, and an Nvidia graduate fellowship. Computer Sciences at the University of
California, Berkeley. He received an MS in
.................................................................... computer science from the University of
References California, Berkeley. Contact him at
1. K. Beck et al., The Agile Manifesto, tech. yunsup@eecs.berkeley.edu.
report, Agile Alliance, 2001.
2. A. Waterman et al., The RISC-V Instruction Andrew Waterman is a PhD candidate in
Set Manual, Volume I: User-Level ISA, Version the Department of Electrical Engineering
2.0, tech. report UCB/EECS-2014-54, EECS and Computer Sciences at the University of
Dept., Univ. California, Berkeley, May 2014. California, Berkeley. He received an MS in
computer science from the University of
3. R. Merritt, “Google, HP, Oracle Join RISC-
California, Berkeley. Contact him at
V,” EE Times, 28 Dec. 2015.
waterman@eecs.berkeley.edu.
4. V. Patil et al., “Out of Order Floating Point
Coprocessor for RISC-V ISA,” Proc. 19th Henry Cook is a PhD candidate in the
Int’l Symp. VLSI Design and Test, 2015; Department of Electrical Engineering and
doi:10.1109/ISVDAT.2015.7208116. Computer Sciences at the University of
5. K. Asanović and D. Patterson, “The Case for California, Berkeley. He received an MS in
Open Instruction Sets,” Microprocessor computer science from the University of
Report, Aug. 2014. California, Berkeley. Contact him at
hcook@eecs.berkeley.edu.
6. J. Bachrach et al., “Chisel: Constructing
Hardware in a Scala Embedded Language,”
Brian Zimmer is a research scientist in the
Proc. Design Automation Conf., 2012, pp.
Circuits Research Group at Nvidia. He
1216–1225.
received a PhD in electrical engineering and
7. O. Shacham et al., “Rethinking Digital Design: computer science from the University of
Why Design Must Change,” IEEE Micro, vol. California, Berkeley, where he performed
30, no. 6, 2010, pp. 9–24. the research for this article. Contact him at
8. Y. Lee et al., “A 45nm 1.3GHz 16.7 Double- brianzimmer@gmail.com.
Precision GFLOPS/W RISC-V Processor
with Vector Accelerators,” Proc. 40th Euro- Ben Keller is a PhD student in the
pean Solid-State Circuits Conf., 2014, pp. Department of Electrical Engineering and
199–202. Computer Sciences at the University of
9. C. Sun et al., “Single-Chip Microprocessor
California, Berkeley. He received an MS in
that Communicates Directly Using Light,”
electrical engineering from the University of
Nature, vol. 528, 2015, pp. 534–538.
California, Berkeley. Contact him at
bkeller@eecs.berkeley.edu.
10. B. Zimmer et al., “A RISC-V Vector Pro-
cessor with Tightly-Integrated Switched-
Alberto Puggelli is the director of technol-
Capacitor DC–DC Converters in 28nm
ogy at Lion Semiconductor. He received a
FDSOI,” Proc. IEEE Symp. VLSI Circuits,
PhD in electrical engineering from the Uni-
2015, pp. C316–C317.
versity of California, Berkeley, where he
11. H.-P. Le, S. Sanders, and E. Alon, “Design performed the research for this article.
Techniques for Fully Integrated Switched- Contact him at alberto.puggelli@gmail.
Capacitor DC–DC Converters,” IEEE J. com.
Solid-State Circuits, vol. 46, no. 9, 2011, pp.
2120–2131. Jaehwa Kwak is a PhD candidate in the
12. J. Kwak and B. Nikolić, “A 550-2260MHz Self- Department of Electrical Engineering and
Adjustable Clock Generator in 28nm FDSOI,” Computer Science at the University of
Proc. IEEE Asian Solid-State Circuits Conf., California, Berkeley. He received an MS in
2015; doi:10.1109/ASSCC.2015.7387471. electrical engineering and computer sciences
.............................................................

MARCH/APRIL 2016 19
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.
..............................................................................................................................................................................................
HOT CHIPS

from Seoul National University. Contact science from the University of California,
him at jhkwak@eecs.berkeley.edu. Berkeley. Contact him at richards@eecs.
berkeley.edu.
Ruzica Jevtić is an assistant professor at the
Universidad de Antonio de Nebrija in Jonathan Bachrach is an adjunct assistant
Madrid. She received a PhD in electrical professor in the Department of Electrical
engineering from Technical University of Engineering and Computer Sciences at the
Madrid and was a postdoctoral fellow at the University of California, Berkeley. He
University of California, Berkeley, where received a PhD in computer science from
she performed the research for this article. the University of Massachusetts, Amherst.
Contact her at rjevtic21@gmail.com. Contact him at jrb@pobox.com.

Stevo Bailey is a PhD student in the David Patterson is the Pardee Professor of
Department of Electrical Engineering and Computer Science at the University of Cali-
Computer Science at the University of fornia, Berkeley. He received a PhD in com-
California, Berkeley. He received an MS from puter science from the University of Califor-
the University of California, Berkeley. Con- nia, Los Angeles. Contact him at pattrsn@
tact him at stevo.bailey@eecs.berkeley.edu. eecs.berkeley.edu.

Milovan Blagojević is a PhD student in a Elad Alon is an associate professor in the


CIFRE program at the Berkeley Wireless Department of Electrical Engineering and
Research Center, STMicroelectronics, and Computer Sciences at the University of
Institut Superieur d’Electronique de Paris. California, Berkeley, and a codirector of the
He received an MSc in electrical engineering Berkeley Wireless Research Center. He
from the University of Belgrade, Serbia. received a PhD in electrical engineering
Contact him at milovan.blagojevic@st.com. from Stanford University. Contact him at
elad@eecs.berkeley.edu.
Pi-Feng Chiu is a PhD student in the
Department of Electrical Engineering and Borivoje Nikolić is the National Semicon-
Computer Sciences at the University of ductor Distinguished Professor of Engineer-
California, Berkeley. She received an MS in ing at the University of California, Berkeley.
electrical engineering from the National He received a PhD in electrical and com-
Tsing Hua University. Contact her at puter engineering from the University of
pfchiu@eecs.berkeley.edu. California, Davis. Contact him at bora@
eecs.berkeley.edu.
Rimas Avizienis is a PhD candidate in the
Department of Electrical Engineering and Krste Asanović is a professor in the Depart-
Computer Sciences at the University of ment of Electrical Engineering and Com-
California, Berkeley. He received an MS in puter Sciences at the University of California,
computer science from the University of Berkeley. He received a PhD in computer
California, Berkeley. Contact him at science from the University of California,
rimas@eecs.berkeley.edu. Berkeley. Contact him at krste@berkeley.edu.

Brian Richards is a research staff engineer


at the University of California, Berkeley,
and a founding member of the Berkeley
Wireless Research Center. He received an
MS in electrical engineering and computer

............................................................

20 IEEE MICRO
Authorized licensed use limited to: NED UNIV OF ENGINEERING AND TECHNOLOGY. Downloaded on November 15,2023 at 21:35:43 UTC from IEEE Xplore. Restrictions apply.

You might also like