Professional Documents
Culture Documents
Fares Elsabbagh Blaise Tine Priyadarshini Roshan Ethan Lyons Euna Kim
Georgia Tech Georgia Tech Georgia Tech Georgia Tech Georgia Tech
fsabbagh@gatech.edu btine3@gatech.edu priya77darshini@gatech.edu elyons8@gatech.edu ekim79@gatech.edu
D. Warp Barriers
0 0
2x2 2x4 2x8 2x16 2x32 4x32 8x32 16x32 32x32
Fig. 8: Synthesized results for power, area and cell counts for
different number of warps and threads
1.2
0.6
Fig. 7: GDS layouts for our Vortex with 8 warp, 4 thread 0.4
0.2
configuration (4KB register file). The design was synthesized 0
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
for 300Mhz and produced a total power output of 46.8mW.
sgemm saxpy sfilter BFS gaussian vecadd nearn
Energy (J)
4
3
0.15 complexity for hazard management.
0.1
2 Simty [6] processor implements a specialized RISC-V ar-
1 0.05
0 0
chitecture that supports SIMT execution similar to Vortex,
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
2x2
2x4
2x8
2x16
2x32
4x32
8x32
16x32
32x32
but with different control flow divergence handling. In their
sgemm saxpy sfilter BFS gaussian vecadd nearn
work, only microarchitecture was implemented as a proof of
concept and there was no software stack, and none of GPGPU
Fig. 10: Power efficiency (# of warps x # of threads (Power applications were executed with the architecture.
efficiency and Energy)
VII. C ONCLUSIONS
MLP; Thus, the benchmark that benefited the most from the In this paper we proposed Vortex that supports an extended
high warp count is BFS which is an irregular benchmark. version of RISC-V for GPGPU applications. We also modi-
As we increase the number of threads and warps, the power fied OpenCL software stack (POCL) to run various OpenCL
consumption increases but they do not necessarily produce kernels and demonstrated that. We plan to release the Vortex
more performance. Hence, the most power efficient design RTL and POCL modifications to the public.2 We believe that
points vary depending on the benchmark. Figure 10 shows an Open Source version of RISC-V GPGPU will enrich the
a power efficiency metric (similar to performance per watt) RISC-V echo systems and accelerate other researchers that
which is normalized to the 2 warps x 2 threads configuration. study GPGPUs in wider topics since the entire software stack
The results show that for many benchmarks, the most power is also based on Open Source implementations.
efficient design is the one with fewer number of warps and
R EFERENCES
32 threads except for the BFS benchmark. As we discussed
earlier since BFS benchmark gets the best performance from [1] K. Asanovic, RISC-V Vector Extension. [Online]. Available: https:
the 32 warps x 32 threads configuration, it also shows the most //github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
[2] K. Asanović et al., “Instruction sets should be free: The case for risc-
power efficient design point. v,” EECS Department, University of California, Berkeley, Tech. Rep.
UCB/EECS-2014-146, 2014.
E. Placement and layout [3] M. A. Cavalcante et al., “Ara: A 1 GHz+ scalable and energy-efficient
RISC-V vector processor with multi-precision floating point support in
We synthesized our RTL using a 15-nm educational library. 22 nm FD-SOI,” CoRR, vol. abs/1906.00478, 2019.
Using Innovus, we also performed Place and route (PnR). [4] C. Celio et al., “Boomv2: an open-source out-of-order risc-v core,”
in First Workshop on Computer Architecture Research with RISC-V
Figure 7 shows the GDS layout and the power density map (CARRV), 2017.
of our Vortex processor. From the power density map, we [5] S. Che et al., “Rodinia: A benchmark suite for heterogeneous comput-
observe that the power is well distributed among the cell area. ing,” ser. IISWC ’09. IEEE Computer Society, 2009.
In addition, we observe the memory including the GPR, data [6] S. Collange, “Simty: generalized simt execution on risc-v,” in First
Workshop on Computer Architecture Research with RISC-V (CARRV
cache, instruction icache and the shared memory have a higher 2017), 2017, p. 6.
power consumption. [7] J. J. Corinna Vinschen, “Newlib,” http://sourceware.org/newlib, 2001.
[8] W. W. L. Fung et al., “Dynamic warp formation and scheduling for
VI. R ELATED W ORK efficient gpu control flow,” ser. MICRO 40. IEEE Computer Society,
2007, pp. 407–420.
ARA [3] is a RISC-V Vector Processor that implements [9] M. Gautschi et al., “Near-threshold risc-v core with dsp extensions for
scalable iot endpoint devices,” IEEE Transactions on Very Large Scale
a variable-length single-instruction multiple-data execution Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700–2713, 2017.
model where vector instructions are streamed into vector [10] G. Gobieski et al., “Manic: A vector-dataflow architecture for ultra-low-
lanes and their execution is time-multiplexed over the shared power embedded systems,” ser. MICRO ’52. ACM, 2019, pp. 670–684.
[11] J. Gray, “Grvi phalanx: A massively parallel risc-v fpga accelerator
execution units to increase energy efficiency. Ara design is accelerator,” in 2016 IEEE 24th Annual International Symposium on
based on the open-source RISC-V Vector ISA Extension Field-Programmable Custom Computing Machines (FCCM). IEEE,
proposal [1] taking advantage of its vector-length agnostic 2016, pp. 17–20.
[12] Green500, “Green500 list - june 2019,” 2019. [Online]. Available:
ISA and its relaxed architectural vector registers. Maximizing https://www.top500.org/lists/2019/06/
the utilization of the vector processors can be challenging, [13] P. Jaaskelainen et al., “Pocl: Portable computing language,” International
specifically when dealing with data dependent control flow. Journal of Parallel Programming, pp. 752–785, 2015.
That is where SIMT architectures like Vortex present an [14] P. O. Jskelinen et al., “Opencl-based design methodology for
application-specific processors,” in 2010 International Conference on
advantage with their flexible scalar-threads that can diverge Embedded Computer Systems: Architectures, Modeling and Simulation,
independently. July 2010, pp. 223–230.
HWACHA [15] is a RISC-V scalar processor with a vector [15] Y. Lee et al., “A 45nm 1.3ghz 16.7 double-precision gflops/w risc-v
processor with vector accelerators,” in ESSCIRC 2014 - 40th European
accelerator that implements a SIMT-like architecture where Solid State Circuits Conference (ESSCIRC), Sep. 2014, pp. 199–202.
vector arithmetic instructions are expanded into micro-ops and
scheduled on separate processing threads on the accelerator. 2 currently the github is private for blind review.
[16] Y. Lee et al., “A 45nm 1.3 ghz 16.7 double-precision gflops/w risc-v
processor with vector accelerators,” in ESSCIRC 2014-40th European
Solid State Circuits Conference (ESSCIRC). IEEE, 2014, pp. 199–202.
[17] A. Munshi, “The opencl specification,” in 2009 IEEE Hot Chips 21
Symposium (HCS), Aug 2009, pp. 1–314.
[18] V. Narasiman et al., “Improving gpu performance via large warps and
two-level warp scheduling,” ser. MICRO-44. New York, NY, USA:
ACM, 2011, pp. 308–317.
[19] NVIDIA, “Cuda binary utilities,” NVIDIA Application Note, 2014.
[20] A. Waterman et al., “The risc-v instruction set manual. volume 1: User-
level isa, version 2.0,” EECS Department, UC Berkeley, Tech. Rep.,
2014.
[21] A. Waterman et al., “The risc-v instruction set manual, volume i: Base
user-level isa,” EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-
2011-62, vol. 116, 2011.
[22] B. Zimmer et al., “A risc-v vector processor with tightly-integrated
switched-capacitor dc-dc converters in 28nm fdsoi,” in 2015 Symposium
on VLSI Circuits (VLSI Circuits). IEEE, 2015, pp. C316–C317.