You are on page 1of 4

Scalable Memory Architecture for Soft-core

Processors
Tiago T. Jost, Gabriel L. Nazar, Luigi Carro
Instituto de Informática
Universidade Federal do Rio Grande do Sul
Porto Alegre, Brazil
{ttjost, glnazar, carro}@inf.ufrgs.br

Abstract—Restrictions over memory performance have constraints might reduce instruction-level parallelism (ILP) and
always had a great impact on soft-core processors. The reduced performance on soft-cores that are implemented on the top of
number of ports on FPGAs' block RAMs may limit the FPGAs [2], [3].
exploitation of parallelism on soft-core processors that are
implemented on top of these devices. Multiple memory ports on Due to area restrictions on FPGAs, soft-core processors
FPGAs are cumbersome and do not scale well, having a high cost lean towards less complexity regarding their designs when
in area and power consumption when implemented. In order to compared to hardwired processors, usually with the absence of
mitigate the impact of the memory bottleneck on such devices, we register renaming or branch prediction. Memory constraints
propose a scalable memory architecture for soft-cores. We make may also add an extra complication due to the limited number
use of software-managed memories to build a memory system of ports in BRAMs. The use of BRAMs to design multi-ported
capable of improving performance and instruction-level memories that can be specially used for soft-core processors
parallelism (ILP) on soft-core processors. Results show that our does not scale well, as area increases and frequency decreases
architecture overcomes the limited parallelism realized on a significantly as the number of ports increases [4]. The number
dual-ported processor, reducing execution time by 16.5%. These of memory ports is, therefore, a limiting factor for increasing
improvements come with no area costs, as the processor is kept performance in many applications.
with the same total memory. Automated code transformations
implemented within the LLVM compiler keep changes in In this paper, we propose a scalable memory architecture
application code to a minimum. We also show that our (SMA) for boosting soft-core processors through the use of
architecture scales better when boosting the number of multiple software-managed memories (SMMs). We have
functional units in the system. implemented an LLVM-based compiler that automates code
generation for the memory architecture. We have found no
Keywords—FPGA; BRAM; soft-core; compiler; memory work that focuses primarily on using software-managed
architecture memories to increase performance and ILP on soft-cores. Some
I. INTRODUCTION previous works [5], [6] make use of software-controlled
memories, called scratchpad memories, as a replacement for
Memory bandwidth has long been a limiting factor in caches with the purpose of reducing energy consumption;
performance for many applications. Even when exploiting others [7], [8] focus on enhancing the performance of
access locality through cache memories to minimize this embedded processors, although due to a reduction of cache
limitation, multi-ported memories are known to introduce misses and not an increase in parallelism. The major difference
significant costs [1]. When developing applications in Field between scratchpads from previous works and our memories is
Programmable Gate Arrays (FPGAs), providing an efficient how address spaces are treated. Scratchpads typically use the
on-chip memory hierarchy may become even harder, as we are same address space of the memory hierarchy, i. e., a range of
restricted to fixed resources. Modern FPGAs may include tens addresses is scratchpad-addressable while others are dealt
of megabytes of on-chip memory, distributed over numerous through the cache. On the other hand, our software-managed
independent banks typically referred to as block RAMs memories have unique address spaces, with no correlation
(BRAMs). Each BRAM has its own memory ports and address among them whatsoever.
space, giving an enormous bandwidth to access these internal
memories. High bandwidth is achieved when data is fetched in FPGA’s restrictions impose a new challenge on increasing
parallel from multiple uncorrelated BRAMs, i.e., there is no bandwidth on soft-core processors. Our approach avoids high
need for a unique address space. Nevertheless, from a software costs in area and latency while providing parallel memory ports
programmer’s point of view, memory is seen as a sizable and, in consequence, ILP improvements. We demonstrate how
storage with a unique address space. Hence, the traditional our memory architecture improves performance using a soft-
FPGA memory model is not particularly suited to meet core very long instruction word (VLIW) processor through a
software’s requirements. FPGAs’ flexibility allows the set of benchmarks, such as matrix multiplication, sum of
combination of multiple BRAMs aiming at providing a unified absolute differences (SAD) and x264. We also show that our
and larger address space, with the downside of significantly architecture scales better when boosting the number of
increasing area and compromising memory bandwidth. Such functional units in the system.
This work is sponsored by CNPq (Conselho Nacional de
Desenvolvimento Científico e Tecnológico, Brazil).

978-1-5090-5142-7/16/$31.00 2016
c IEEE 396

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:24 UTC from IEEE Xplore. Restrictions apply.
The rest of the paper is organized as follows. Section II (GPUs) and illustrates how memory bandwidth can have an
presents related work. Section III describes hardware and impact on those results. Our approach is orthogonal to
software modifications required for our technique. Section IV techniques like [10] and [11], i.e., they could even be coupled
presents the results, and section V concludes the paper. to improve performance even more.
II. RELATED WORK The work in [4] introduces a new approach for creating
multi-ported memories in FPGAs that efficiently combines
Software-controlled memories are an attractive option for BRAMs while achieving significant performance
substituting caches on embedded processors due to their improvements over other methods. It replicates memories
simpler hardware. There is no need for any kind of hardware according to the number of read/write ports. A 8-read/4-write
coherency mechanism, as software is responsible for their memory, for example, can be implemented using eight 2-
control. They can, therefore, provide considerable area and read/1-write memories. It also uses live value tables (LVT) to
latency improvements over caches. Our technique uses identify where the most recently copy of the data is located.
BRAMs as software-controlled memories to boost scalability Although this approach is better than others, it still requires at
with regard to parallel memory accesses. Hence, our work least doubling the area to build multi-ported memories with
encounters many similarities with two different topics of more than two ports.
research: (1) the use of software-controlled memories as a
replacement for high-consuming caches [5], [8], (2) efficient III. HARDWARE AND SOFTWARE REQUIREMENTS
ways of using memories on FPGAs [4], [9]–[12].
Software-managed memories are particularly beneficial
The use of software-managed memories in replacement of when reuse can be guaranteed, i. e., one needs to assure that
caches has been vastly explored in the past decade. The work temporal and spatial locality are presented in the program. Our
in [5] shows an average reduction of 20-44% in energy approach aims at using these memories in soft-cores,
consumption on instruction scratchpads, while Avisar et. al. [8] processors that struggle with limited memory bandwidth due to
uses an automatic compiler method for statically allocating how FPGAs traditionally handle large address spaces. FPGAs
program data to scratchpad memories. In contrast to previous are very fast and effective when handling multiple address
works [7], [8] that primarily focused on performance spaces. However, large memory spaces are often cumbersome
improvements due to decreasing miss penalty on the memory when considering more than two ports in FPGA’s BRAMs.
hierarchy, we implement a scalable memory architecture with
the combination of software-managed memories and memory Our main goal is to mitigate memory complexity in soft-
cache. We aim at reusing data to improve performance and ILP cores by creating a memory architecture that explores the use
in soft-core processors, showing that multiple memories can of multiple BRAMs in parallel. That way, single-ported
also help increasing performance on such processors by memories are mapped to single-ported BRAMs and no
enabling multiple accesses at once. Moreover, if one considers overhead in adding new ports is necessary. The rest of this
the use of FPGA for the implementation of processors, section details hardware changes and software implementation
additional care must be taken due to the restriction of BRAMs, required for the technique.
which are typically dual-ported. A. Hardware
Recent researches have provided various contributions on Minimum hardware changes are necessary to support
the use of BRAMs. Adler et. al. [10] propose Logic-based SMMs in our system. Figure 1 shows an example of how they
Environment for Application Programming (LEAP), an are placed within a hypothetical 4-issue VLIW processor with
automatic memory management tool for reconfigurable logic. a 5-stage pipeline. A memory architecture is created with the
The main idea is to use LEAP scratchpads to provide an easy placement of a software-managed memory within each lane.
way of managing memory on FPGAs. Thus, developers can With the exception of Lane0, which can also access the regular
primarily focus on developing their core algorithms and less on data memory hierarchy through the L1 data cache (MEM block
managing memory. Yang et. al. [9] and Winterstein et. al. [12] in Figure 1), each lane can only access its own memory’s
extend LEAP scratchpads to automate the construction content. The main difference between SMMs and caches relies
application-specific memory hierarchies. Therefore, every on who controls them. The former is controlled by software
application has an optimized memory hierarchy that best fits its and requires no extra hardware for tag storage and checking
characteristics. CoRAM++ [11] provides a convenient while the latter is hardware-controlled and the programmer has
approach for designing efficient complex data structures in little control over where data is placed in memory.
memory without penalizing access performance. It uses an Because lanes can only operate in one memory, there is no
application-level interface that model data structures, such as need to specify which memory SMM instructions should
multi-dimensional arrays, linked lists, and simple data patterns. access. Each lane only sees its own memory, and the address
All works mentioned above help to address the mismatch spaces are independent of one another. Thereby, we avoid the
between the needs of applications and memory mapping on costs of having multi-ported memories, while still providing
FPGAs. The study in [13] demonstrates that memory-aware parallelism for the processor. SMM instructions operate in the
designs show a significant energy improvement comparing to same way as loads and stores: one register as the base register,
non-optimal designs by over a factor of 2. Moreover, the work an immediate offset that will be added to the base register, and
in [14] presents a comparison of performance and energy a second register that will hold the value that must be stored to
consumption between FPGAs and graphics processing units or was loaded from memory.

2016 IEEE 34th International Conference on Computer Design (ICCD) 397

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:24 UTC from IEEE Xplore. Restrictions apply.
We have compared our proposed solution measuring
execution time between processors with different
configurations. Processors that implement the technique are
denoted as SMA processors, while those that make no use of
software-managed memories are the regular processors. We
firstly consider 8-issue processors. Six benchmarks were used
to measure gains: a discrete Fourier transform (DFT), three
matrix multiplication with different matrixes sizes, the sum of
absolute differences (SAD), and an x264 video encoder
algorithm. For a fairer comparison, we have reduced the size of
our level-one data cache on the SMA processors proportional
to the size of all SMMs together. SMMs are 2 KBytes of size
each, for a total of 16 Kbytes. These dimensions are in tune
with the BRAM blocks offered by current state-of-the-art
FPGAs [18]. Thus, SMA processors have 16 KBytes of data
Fig. 1. 4-issue VLIW processor with software-managed memories cache, while regular processors have 32 Kbytes. This is a
conservative approach as the extra 16 KBytes of cache of the
B. Software baseline still require additional tag storage, which is not
In order to minimize the programmer’s effort, the LLVM necessary for the SMAs.
compiler was extended to support SMMs. The compiler is able Figure 2 illustrates the comparison between 8-issue dual-
to guarantee the correct and efficient code generation through a ported processors, as BRAMs in FPGAs typically have this
series of data flow analysis and transformations. Users can characteristic. Results show how the SMA processor reduces
decide which variable is placed within the memories through execution time over the baseline for all benchmarks. The use
code annotation. Compiler, then, modifies the code by of software-managed memories has shown to reduce execution
replacing memory instructions with instructions that time by an average of 16.5%. As the number of memory ports
manipulate SMMs. We aim at using these memories primarily increases, for instance, with four or even eight ports, the
to store vector and matrix variables, since they offer more software-managed memories become more dispensable, since
possibility of parallelism. the memory hierarchy can already provide the system with a
A preamble code is also a requirement for most high throughput. As discussed, however, such caches are very
applications. This piece of code, generated by the compiler, is costly, especially on the FPGA substrate. Still, for the dual-
responsible for fetching values that should reside in the SMMs ported case, the SMA processor provides performance
from the regular memory hierarchy, i.e., through the L1 data improvements over the baseline, with execution time reduction
cache. The unveiling of ILP relies on using multiple software- between 10% and 25%.
managed memories in parallel. Once data is loaded into the Our last measurement consisted in verifying whether our
SMMs, it can be accessed with high bandwidth due to the memory architecture helps boosting performance when
multiple independent ports. So, the compiler’s goal is increasing the number of issues on the processor. Figure 3
maximizing the usage of these memories while still reasoning depicts the performance comparison for 16-issue dual-ported
about parallel operations. processors. The average improvement over the baseline went
IV. EXPERIMENTAL RESULTS from 16.5% to 20.3%. The improvements offered by the
proposed technique, when compared to those observed for 8-
For a performance evaluation, we have developed an
LLVM [15] backend for a VLIW processor and added the
technique as part of the framework. LLVM has proven to be 1,1
very modular, which made a perfect fit for our work. Code 1,0
optimizations are in the form of passes, that may either collect 0,9
Normalized execution time

information or transform the program. We have added a new 0,8


pass that analyzes and transforms code right before register 0,7
allocation takes place. That way, the standard LLVM heuristics 0,6
deal with register allocation after the addition of new code. 0,5
Code transformation is performed in virtual registers, which 0,4
ease the process of inserting and modifying code. 0,3

Our VLIW processor implements the VEX instruction set 0,2

architecture (ISA) [16], developed by HP. The company 0,1

provides a handful of tools [17] to help use the architecture, 0,0


DFT Matrix Matrix Matrix32x32 sad x264
including a customizable simulation system that was used for 10x10 16x16
simulation. We have modified it and added new instructions to Regular - 2 Mem SMA - 2 Mem
handle SMMs. The tool already provides a built-in level-one
cache simulator. Hence, a complete simulation platform is Fig. 2. Comparison between 8-issue dual-ported processors. Average execution
included. time reduction of 16.5%.

398 2016 IEEE 34th International Conference on Computer Design (ICCD)

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:24 UTC from IEEE Xplore. Restrictions apply.
[2] J. G. Tong, I. D. L. Anderson, and M. A. S. Khalid, “Soft-core
processors for embedded systems,” in Microelectronics, 2006. ICM’06.
1,1
International Conference on, 2006, pp. 170–173.
1,0
Normalized execution time

0,9 [3] S. Wong, T. Van As, and G. Brown, “$ρ$-VEX: A reconfigurable and
0,8 extensible softcore VLIW processor,” in ICECE Technology, 2008. FPT
2008. International Conference on, 2008, pp. 369–372.
0,7
0,6 [4] C. E. LaForest and J. G. Steffan, “Efficient multi-ported memories for
0,5 FPGAs,” in Proceedings of the 18th annual ACM/SIGDA international
0,4 symposium on Field programmable gate arrays, 2010, pp. 41–50.
0,3 [5] M. Verma, L. Wehmeyer, and P. Marwedel, “Cache-aware scratchpad
0,2 allocation algorithm,” in Proceedings of the conference on Design,
0,1 automation and test in Europe-Volume 2, 2004, p. 21264.
0,0 [6] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel, “Assigning
DFT Matrix Matrix Matrix32x32 sad x264
program and data objects to scratchpad for energy reduction,” in Design,
10x10 16x16
Automation and Test in Europe Conference and Exhibition, 2002.
Regular - 16i - 2 Mem SMA - 16i - 2 Mem Proceedings, 2002, pp. 409–415.
[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and M. Olivieri, “A
Fig. 3. Comparison between 16-issue dual-ported processors. Average Post-compiler Approach to Scratchpad Mapping of Code,” in
execution time reduction of 20.3%. Proceedings of the 2004 International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems, 2004, pp. 259–267.
issue processors, were greater in four out of six benchmarks. [8] O. Avissar, R. Barua, and D. Stewart, “An Optimal Memory Allocation
One benchmark, Matrix 10x10, showed virtually the same Scheme for Scratch-pad-based Embedded Systems,” ACM Trans.
ratio. Matrix 16x16, on the other hand, went from 19.1% Embed. Comput. Syst., vol. 1, no. 1, pp. 6–26, Nov. 2002.
reduction to 28.8%. These results indicate that, when the [9] H.-J. Yang, K. Fleming, M. Adler, F. Winterstein, and J. Emer,
application exhibits high ILP, the traditional memory “Scavenger: Automating the construction of application-optimized
hierarchy is limiting factor that can be alleviated with memory hierarchies,” in Field Programmable Logic and Applications
(FPL), 2015 25th International Conference on, 2015, pp. 1–8.
independent SMMs. Moreover, DFT has shown a performance
reduction for both 16-issue dual-ported processors when [10] M. Adler, K. E. Fleming, A. Parashar, M. Pellauer, and J. Emer, “Leap
comparing to the 8-issue counterpart. The register file cannot scratchpads: automatic memory and cache management for
reconfigurable logic,” in Proceedings of the 19th ACM/SIGDA
accommodate all live variables at the same time. Thus, some
international symposium on Field programmable gate arrays, 2011, pp.
of the variables are spilled to memory and performance is 25–28.
reduced. Because the SMA processor exploits more
parallelism, more variables are kept alive at a certain point in [11] G. Weisz and J. C. Hoe, “CoRAM++: Supporting data-structure-specific
memory interfaces for FPGA computing,” in Field Programmable Logic
the program, introducing more spilled code and reducing the and Applications (FPL), 2015 25th International Conference on, 2015,
difference between the 16-issue processors. pp. 1–8.

V. CONCLUSION [12] F. Winterstein, K. Fleming, H.-J. Yang, S. Bayliss, and G.


Constantinides, “MATCHUP: memory abstractions for heap
This work proposed a scalable memory architecture for manipulating programs,” in Proceedings of the 2015 ACM/SIGDA
soft-cores through the use of software-managed memories. We International Symposium on Field-Programmable Gate Arrays, 2015,
have developed an LLVM-based backend that analyzes and pp. 136–145.
transforms code to operate such memories in cooperation with [13] E. Kadric, D. Lakata, and A. DeHon, “Impact of Memory Architecture
the memory hierarchy. Results have shown execution time on FPGA Energy Consumption,” in Proceedings of the 2015
reduction of 16.5% for the dual-ported processor. Our ACM/SIGDA International Symposium on Field-Programmable Gate
architecture also shows promising gains when the number of Arrays, 2015, pp. 146–155.
issues is boosted to 16, scaling better than the baseline when [14] B. Betkaoui, D. B. Thomas, and W. Luk, “Comparing performance and
increasing the issue width. The same area of the baseline energy efficiency of FPGAs and GPUs for high productivity
processor was kept for all cases. computing,” in Field-Programmable Technology (FPT), 2010
International Conference on, 2010, pp. 94–101.
Future work will focus on eliminating the need for code
[15] C. Lattner and V. Adve, “LLVM: A Compilation Framework for
annotation from the user part. The SMA is not restricted to Lifelong Program Analysis & Transformation,” in Proceedings of the
VLIW processors, although special care will have to be taken International Symposium on Code Generation and Optimization:
when using this technique on other ILP-capable architectures, Feedback-directed and Runtime Optimization, 2004, p. 75–.
e.g., superscalar processors. Extending the technique for such
[16] J. A. Fisher, P. Faraboschi, and C. Young, Embedded computing: a
cases is a relevant future work. We also plan on extending the VLIW approach to architecture, compilers and tools. Elsevier, 2005.
work to reason about energy consumption.
[17] Hewlett-Packard Laboratories, “VEX Toolchain.” [Online]. Available:
REFERENCES http://www.hpl.hp.com/downloads/vex/.
[1] Y. Tatsumi and H. J. Mattausch, “Fast quadratic increase of multiport- [18] Xilinx, “UltraScale Architecture and Product Overview,” vol. 890, pp.
storage-cell area with port number,” Electron. Lett., vol. 35, no. 25, pp. 1–31, 2015.
2185–2187, 1999.

2016 IEEE 34th International Conference on Computer Design (ICCD) 399

Authorized licensed use limited to: Indraprastha Institute of Information Technology. Downloaded on January 15,2021 at 13:17:24 UTC from IEEE Xplore. Restrictions apply.

You might also like