You are on page 1of 31

OpenPiton+Ariane: The RISC-V

Hardware Research Platform

Princeton University and ETH Zürich

http://openpiton.org
http://pulp-platform.org
Princeton Parallel Research Group
• Computer Architecture after Moore’s Law
• Redesigning the Data Center of the Future
• Biodegradable Computing (Materials)

• 12 PhD Students
• 3 Undergraduates

Winter 2018 Ski Trip3


Support

This work was partially supported by the NSF under Grants No. CNS-1823222,CCF-1823032, CCF-1217553, CCF-1453112, and CCF-1438980, Air Force Research
Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreements No. FA8650-18-2-7846, FA8650-18-2-7852, and FA8650-18-2-7862,
AFOSR under Grant No. FA9550-14-1-0148, and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. The views and conclusions contained herein
are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force
Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA), the NSF, AFOSR, or the U.S. Government.

4
The world’s first open source, general purpose,
multithreaded manycore processor
• Open source manycore
• Written in Verilog RTL
• Scales to ½ billion cores
• Configurable core, uncore
• Includes synthesis and back-end flow chipset

• Simulate in VCS, ModelSim, NCSim, Verilator, Icarus Tile

• ASIC & FPGA verified Chip

• ASIC power and energy fully characterized


[HPCA 2018]
• Runs full stack multi-user Debian Linux
• Used for Architecture, Programming Language,
Compilers, Operating Systems, Security, EDA
research
5
OpenPiton+Ariane
• Collaboration between Princeton University and PULP team from ETH Zürich

• Goal is to develop a permissively licensed, Linux capable many-core research


platform based on RISC-V

• Ariane
– RV64GC Core
– Linux capable


– Research manycore system
– OpenSPARC T1 based
– Coherent NoC, distributed cache

6
Parallel Ultra Low Power (PULP)
• Project started in 2013 by Luca Benini
• A collaboration between University of Bologna and ETH Zürich
– Large team. In total about 60 people, not all are working on PULP

• Key goal is
How to get the most BANG
for the ENERGY consumed
in a computing system
• We were able to start with a clean slate, no need to remain
compatible to legacy systems.
How we started with open source
processors
• Our research was not developing processors…
• … but we needed good processors for systems we build for
research
• Initially (2013) our options were
– Build our own (support for SW and tools)
– Use a commercial processor (licensing, collaboration issues)
– Use what is openly available (OpenRISC,.. )
• We started with OpenRISC
– First chips until mid-2016 were all using OpenRISC cores
– We spent time improving the microarchitecture
• Moved to RISC-V later
– Larger community, more momentum
– Transition was relatively simple (new decoder)
PULP RISC-V Family Explained
32 bit 64 bit
Low Cost Core Core with DSP Floating-point Linux capable
enhancements capable Core Core

§ Zero-riscy § RI5CY § RI5CY + FPU § Ariane


§ RV32-ICM § RV32-ICMX § RV32-ICMFX § RV64-GC
§ Micro-riscy § SIMD § Full privileged
§ HW loops specification
§ RV32-CE § Bit
manipulation § “OS Core”
§ Fixed point

See also other tutorials on PULP/HERO


9
An Application class processor
• Virtual Memory – Larger address space (64-bit)
– Multi-program – Requires more hardware support
environment • MMU (TLBs, PTW)
– Efficient sharing and • Privilege Levels
protection • More Exceptions (page fault, illegal access)

• Operating System
→ Ariane an application class
– Highly sequential code
processor
– Increase frequency to gain
performance
• Large software
infrastructure
– Drivers for hardware (PCIe,
ethernet)
– Application SW (e.g.: 10

Tensorflow, …)
1
0
ARIANE: Linux capable 64-bit core
• Application class processor • 6-stage pipeline
• Linux Capable – In-order issue
– Tightly integrated D$ and I$ – Out-of-order write-back
– M, S and U privilege modes – In-order commit
– TLB, SV39 • Scoreboarding
– Hardware PTW • Designed for extendibility
• Optimized for performance • Branch-prediction
– Frequency: 1.5 GHz (22 FDX) – Return Address Stack (RAS)
– Area: ~ 175 kGE – Branch Target Buffer (BTB)
– Critical path: ~ 25 logic levels – Branch History Table (BHT)

11

1
1
ARIANE: Linux capable 64-bit core

12
RISC-V Debug
• Draft specification 0.13 • RI5CY/Ariane contain
– More or less frozen performance counters
• Defines debug registers for – SoC performance monitoring
not part of RISC-V spec
– run/halt/single-step
– reading/writing GPR, FPR • Trace task group working on
and CSRs PC tracing
– Querying hart status – UltraSoC leading efforts
• JTAG interface – PULP effectively engaging
• OpenOCD support – Working on implementation
for PULPissimo
• SiFive influenced

13

1
3
OpenPiton System Overview

Tile

14
OpenPiton System Overview

15
OpenPiton System Overview
Chip

16
OpenPiton System Overview
Chip Chipset

Chip P-Mesh Off-Chip


Bridge Routers (3)

P-Mesh Chipset
Crossbars (3)

17
OpenPiton System Overview
Chip Chipset

Chip P-Mesh Off-Chip


Bridge Routers (3)

P-Mesh Chipset
Crossbars (3)

DRAM

18
OpenPiton System Overview
Chip Chipset

Chip P-Mesh Off-Chip


Bridge Routers (3)

P-Mesh Chipset
Crossbars (3)

Wishbone
DRAM
SDHC

19
OpenPiton System Overview
Chip Chipset

Chip P-Mesh Off-Chip


Bridge Routers (3)

P-Mesh Chipset AXI


Crossbars (3) I/O

Wishbone
DRAM
SDHC

20
OpenPiton System Overview
Chip Chipset

Chip P-Mesh Off-Chip


Bridge Routers (3)

P-Mesh Chipset AXI


Crossbars (3) I/O

Wishbone
DRAM
SDHC

21
Tile Overview
To Other Tiles

L2 Cache Slice P-Mesh


+ Routers
Directory Cache (3)

MITTS
(Traffic Shaper)

Modified L1.5 Cache


OpenSPARC T1
Core
CCX Arbiter

FPU

22
Silicon Proven Designs: Ariane
Poseidon layout
• Ariane has been taped-out Ariane
QUENTIN KERBIN
Globalfoundries 22nm FDX
in 2017 and 2018
• The system features 16 kByte of
instruction and 32 kByte of data Issue
cache.
• Poseidon:
HYPERDRIVE
– Area: 0.23 mm2 – 175 kGE
– 0.2 - 1.7 GHz (0.5 V – 1.15 V) Kosmodrom layout
• Kosmodrom:
– RV64GCXsmallFloat Ariane HP Ariane LP
– Transprecision / Vector FPU NTX
– Ariane HP
• 8T library, 0.8V, 1.3 GHz
• 55 mW @ 1 GHz
– Ariane LP
• 7.5T ULP library, 0.5V, 250 MHz L2 23
• 5 mW @ 200 MHz
23
Silicon Proven Designs: Piton Chip
Network On-Chip • 25-core
Tile
Links – 2 Threads per core
– 64-bit Architecture
– Modified OpenSPARC T1 Core
• 3 NoCs (P-Mesh)
– 64-bit, 2D Mesh
– Extend off-chip enabling multichip systems
• Directory-Based Cache System
– 64KB L2 Cache per core (Shared)
– 8KB L1.5 Data Cache
– 8KB L1 Data Cache
– 16KB L1 Instruction Cache
• IBM 32nm SOI Process
– 6mm x 6mm
– 460 Million Transistors
• Target: 1GHz Clock @ 900mV
• 208 Pin CQFP Package

Off-Chip Memory and


I/O Interface
24
Piton Test Setup
Bridge FPGA Bulk
DRAM + I/O Spartan 6 Decoupling Misc. Configuration

Chipset FPGA
Kintex 7
Piton + Heat Sink Power Supply
[McKeown et al, HotChips 2016] [McKeown et al, IEEE MICRO 2017] [McKeown et al, HPCA 2018]
25
Putting it all together
To Other Tiles

§ Native L1.5 interface is the


L2 Cache Slice P-Mesh ideal point to attach a new core
+ Routers
Directory Cache (3)
§ Well defined interface similar
MITTS to CCX from OpenSPARC
(Traffic Shaper)

§ Write-through cache protocol


Modified L1.5 Cache
OpenSPARC T1
Core § Coherency mechanism: only
CCX Arbiter need to support invalidation
messages
FPU

26
Putting it all together
To Other Tiles

§ Native L1.5 interface is the


L2 Cache Slice P-Mesh ideal point to attach a new core
+ Routers
Directory Cache (3)
§ Well defined interface similar
MITTS to CCX from OpenSPARC
(Traffic Shaper)

§ Write-through cache protocol


Modified L1.5 Cache
OpenSPARC T1
Core § Coherency mechanism: only
CCX Arbiter need to support invalidation
messages
FPU

27
FPGA Prototyping Platforms
Available: In progress:
• Digilent Genesys2 • Xilinx VCU118,
– $999 ($600 academic) BittWare XUPP3R
– 1-2 cores at 66MHz – $7000-8000
• Xilinx VC707 – >100MHz
– $3500 • Amazon AWS F1
– 1-4 cores at 60MHz – Rent by the hour
• Digilent Nexys Video
– $500 ($250 academic)
– 1 core at 30MHz
OpenPiton Philosophy
• Focus/Value is in the Uncore
– Not religious about ISA
– Provide whole working system
• We are practical
– Use Verilog (Ariane is SV)
– Industry standard tools
– Use the best tool for job (including commercial CAD tools)
• Primarily for research, but welcome industry also
• Licensing
– All our code, Hypervisor, are BSD-like
– Linux, T1 core (GPL or LGPL)
– Ariane (Solderpad)
• Scalability (Million Core)

29
OpenPiton Community
• Building a community • Visit http://openpiton.org
– Welcome community • openpiton@princeton.edu
contributions
– Thousands of Downloads
• Google Group

30
Doing Research with OpenPiton + Ariane
• Software Apps
Compiler/
– Install on Debian, test scalability Runtime
• Operating System HV/OS

– Recompile kernel, rebuild SW, run ISA

• Hardware/Software Co-design HW

– Add new instructions, change compiler/HV/OS/SW


• Architecture
– Change parameters, rebuild HW, run

31

App 1

Enabled Research

App 2



App 3

• Coherence Domain Restriction










– Fu et al. MICRO 2015 Program A Instruction Program B Instruction

Fetch Stage Thread Select Decode Stage Execute Stage Memory Stage Writeback

• Execution Drafting
Stage Stage

Successfully Drafted

– McKeown et al. MICRO 2014 Instructions


Lead Instructions

• Memory Inter-arrival Time Traffic Shaper


– Zhou et al. ISCA 2016 Time Frequency

• Oblivious RAM
2t Request
– Fletcher et al. ASPLOS 2015 Uniform Traffic
Inter-arr
time

• DVFS modelling
t 3t
• Multiple outside papers
More Bursty Traffic

• Numerous class research projects 2t


A Distribution of Traffic
32

You might also like