You are on page 1of 8

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Fast HBM Access with FPGAs:


2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) | 978-1-6654-3577-2/20/$31.00 ©2021 IEEE | DOI: 10.1109/IPDPSW52791.2021.00030

Analysis, Architectures, and Applications

Philipp Holzinger∗ , Daniel Reiser∗ , Tobias Hahn∗ and Marc Reichenbach∗


∗ Department of Computer Science, Chair of Computer Architecture
Friedrich-Alexander University Erlangen-Nürnberg (FAU), Germany
{philipp.holzinger, daniel.g.reiser, tobias.hahn, marc.reichenbach}@fau.de

Abstract—Over the past few decades, the gap between rapidly cases [6]–[8]. In contrast to traditional DRAM, HBM is usually
increasing computational power and almost stagnating memory lower clocked but uses wider busses and significantly more
bandwidth has steadily worsened. Recently, 3D die-stacking in channels. However, this requires a much higher degree of
form of High Bandwidth Memory (HBM) enabled the first major memory access parallelization than before. To facilitate this
jump in external memory throughput in years. In contrast to use, crossbars for global addressing have been implemented on
traditional DRAM it compensates its lower clock frequency with
wide busses and a high number of separate channels. However,
FPGAs. They considerably simplify system design and coop-
this also requires data to be spread out over all channels to reach eration between cores with different access pattern. However,
the full throughput. Previous research relied on manual HBM depending on their usage, these bus fabrics can also introduce
data partitioning schemes and handled each channel as an entirely severe limitations in throughput and latency [8]. Therefore,
independent entity. This paper in contrast also considers scalable it cannot be assumed that designs scale linearly with the
hardware adaptions and approaches system design holistically. theoretical bandwidth when switching from traditional DRAM
In this process we first analyze the problem with real world to HBM. This makes HBM devices in general more difficult to
measurements on a Xilinx HBM FPGA. Then we derive several handle and requires architectural adjustments to accelerators.
architectural changes to improve throughput and ease accelerator
design. Finally, a Roofline based model to more accurately Previous work that analyzes system behavior either tar-
estimate the expected performance in advance is presented. With geted traditional DRAM [9], [10] or GPUs [11], [12]. These
these measures we were able to increase the throughput by up approaches did not consider the interaction challenges between
to 3.78× with random and 40.6× with certain strided access FPGA accelerators and novel HBM memory interfaces. Several
patterns compared to Xilinx’ state-of-the-art switch fabric. further papers implemented FPGA designs with HBM but only
Keywords—Field Programmable Gate Arrays (FPGA), High- focused on workload specific data partitionings [6], [8], [13].
Bandwidth Memory (HBM), Computational Modeling As a consequence, they gave up the advantages of global
addressing and did not fully explore the possibilities of this
design space. In contrast, this paper presents a holistic analysis
I. I NTRODUCTION and evaluation based on the Roofline model [14] to guide
designers through a transition to HBM. It eases accelerator de-
The ever increasing demand for computational power re-
sign by more accurately estimating the expected performance
cently reached new heights with the widespread adoption
in advance and offering design guidelines. Furthermore, we
of deep learning and big data techniques. However, these
integrated several of these optimizations into an IP core. All
requirements are getting increasingly harder to meet due to
in all, our main contributions can be summarized as:
the ongoing slowdown of Moore’s Law and Dennard Scaling.
For this reason, a shift to more application-specific circuits • provide a comprehensive analysis of efficient HBM
and heterogeneous hardware can be seen in both academia access
and industry [1]. However, their full potential is often limited
by the available memory bandwidth since data often cannot • derive architecture guidelines for HBM usage from
be fetched fast enough [2]. This bottleneck is mainly caused this analysis
by the much slower pace DRAM manufacturers are able • provide a ready-to-use Memory Access Optimizer
to develop faster device generations. In the past, this effect (MAO) IP core which eases implementing these guide-
was particularly pronounced for FPGAs since they had a far lines
slower external memory access than GPUs [3]. The resulting
processor-memory gap has been a major challenge for more • proof our approach by applying the methodology to
than two decades [4]. state-of-the-art accelerators
Recent advancements in 3D die-stacking and packaging The paper is structured as follows: Section II explains
technology made it possible to significantly increase the HBM2 memory subsystem architectures at the example of Xil-
memory throughput [5]. This resulted in the creation of the inx devices and their problems. Then, prior work is discussed
High Bandwidth Memory (HBM) standard for this type of in Section III. Section IV provides a more comprehensive
memory organization. Since then, it has been implemented analysis of the memory characteristics and derives necessary
in commercial GPUs and FPGAs that promise unprecedented architecture considerations. Based on this our methodology
performance. For this reason, it has recently been utilized is evaluated with two example designs in Section V. Finally,
by several application-specific accelerators for various use Section VI summarizes the paper.

978-1-6654-3577-2/21/$31.00 ©2021 IEEE 152


DOI 10.1109/IPDPSW52791.2021.00030

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
Programmable Logic
Accelerator(s)
connections to route AXI transactions towards the destination
BM BM BM BM BM BM BM BM BM BM BM BM BM BM BM BM
when the address does not belong to the local switch. However,
... ??? Memory Access ??? ... only two outgoing busses are available in every direction.
Hereby, the memory capacity of every PCH is contiguously
Crossbar ... Crossbar Crossbar ... Crossbar mapped into successive sections of the global address space.
Switch Switch Switch Switch

ASIC ASIC
MC 0 MC 1 ... MC 6 MC 7 MC 8 MC 9 ... MC 14 MC 15
Although this kind of segmentation enables the desired
global addressing, it has several major drawbacks. First, it
changes the latency of memory accesses since AXI transac-
PCH 12 Mem

PCH 13 Mem

PCH 14 Mem

PCH 15 Mem

PCH 28 Mem

PCH 29 Mem

PCH 30 Mem

PCH 31 Mem
PCH 16 Mem

PCH 17 Mem

PCH 18 Mem

PCH 19 Mem
PCH 0 Mem

PCH 1 Mem

PCH 2 Mem

PCH 3 Mem

... ... tions potentially need to be routed over several hops. This
non-uniformity might be an issue for accelerators which expect
Chip 0 Chip 3 Chip 4 Chip 7
data to arrive approximately at the same time. Second, the
HBM Stack 0 HBM Stack 1
AXI M AXI S DDR M DDR S fast path slow path
limited number of lateral connections effectively reduces the
bandwidth if requests need to be routed over the same bus.
Fig. 1: Structure of HBM interfaces on the example of Xilinx Third, the used assignment of PCH address spaces makes in-
Virtex UltraScale+ FPGAs [15]. teroperation with a CPU and common programming languages
much more difficult. These usually assume that data lies for
their perspective contiguously in memory. If this data is simply
copied to HBM with such an address layout, it will be placed
II. BACKGROUND AND P ROBLEM D ESCRIPTION in the same PCH until its maximum capacity is reached. This
potentially causes BMs to contend for the same PCH and
To understand general system level implications of HBM therefore severely decreases latency and throughput. Overall
usage, it is first necessary to analyze the underlying technology these issues can lead to worse results then with traditional
constraints. Fig. 1 shows the structure of such a system on DRAM due to the lower clock frequency.
the example of Xilinx Virtex UltraScale+ FPGAs [15]. To
reach the promised high throughput, HBM uses by concept a These problems need to be considered when architectures
very high number of independent memory channels instead of for HBM are drafted. Here, a model is required that factors in
increasing the frequency. These are provided by parallelizing both memory throughput and compute capabilities to quickly
access to one or more stacks of N DRAM chips (N -Hi) each. estimate the overall system performance. One of the most
Hereby, every chip has two completely independent channels illustrative ones for this purpose is the Roofline model [14].
that are each further split into two 64 bit pseudo channels It places algorithm implementations in a 2D graph that limits
(PCH) with a common command path. Nevertheless, every achievable performance by simple computational and band-
PCH provides exclusive access to only its own associated width ceilings. However, these predictions can only be as
memory subsection via through-silicon vias. These can then good as the underlying assumptions. As the HBM throughput
be directly used by an equal number of bus masters (BM) in can widely vary, this behavior must also be accounted for.
an accelerator. However, this strict separation is generally not Otherwise the model would be misleading for system designers
suitable for all applications. This is for example the case when and lead to wrong architecture decisions. Therefore, a closer
the working set of a BM is bigger than its PCH capacity, as investigation of the performance impediments is necessary and
with graph algorithms where data anywhere in the memory presented in the following.
might be accessed. Therefore, an interconnect structure is
inevitable to provide more flexibility. This is called global III. R ELATED W ORK
addressing in the following. One drawback of such designs is
the high complexity. Especially due to the high number of I/O After the launch of HBM, its first productive use has been
pins that cover a great physical distance on the chip, extensive taken place in GPUs. In the course of this, the Roofline model
bus pipelining is needed to compensate the high wire delays. has also been used to predict application performance on these
These factors considerably complicate HBM interface design. systems [11], [12]. However, due to their very homogeneous
For this reason, compromises of throughput, latency, die size, architecture and memory hierarchy, these models are not
and flexibility are made. In fact, both Xilinx and Intel currently representative for FPGAs. To accommodate to this design
segment this structure into smaller crossbars [15], [16]. Due specialization, several papers explored Roofline modelling
to this inherent challenge of HBM based systems, our results of application-specific accelerators. These approaches relate
obtained on the Xilinx devices presented in the following can computational performance, memory bandwidth, and resource
be seen as indicative for a larger group of systems. consumption for e.g. HLS generated circuits [17] or CNN
accelerators [10]. Although they consider external memory
As depicted in Fig. 1, Xilinx devices use two 4-Hi stacks bandwidth an important limiting factor for design performance,
with 4 GB capacity each. This results in a total of 32 PCH they also assume that the theoretical maximum can simply
which are presented to the programmable logic (PL) as just always be reached. However, as shown in Section II this is not
as many 256 bit AXI3 busses with half the frequency. Hereby, accurate enough with HBM. In contrast, Göbel et al. take this
every two share a common memory controller (MC) to perform factor more into consideration by analyzing memory access
the AXI to DDR protocol conversion. The global addressing traces of software that is used for HLS [9]. A similar approach
interconnect structure is implemented as a segmented switch has been taken by Siracusa et al. who use additional random
network in which every four AXI slaves and two MCs are access and gather-scatter bandwidth ceilings in the Roofline
directly connected by a local crossbar switch. In contrast to model [18]. Although an accelerator connected to HBM has
Intel, these are part of the FPGA ASIC and contain lateral also been shown, they did not further investigate it. Therefore,

153

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Used memory access patterns
Single Channel Cross Channel
Stride SCS CCS
Random SCRA CCRA Fig. 2: Achievable throughput when AXI reads and writes are
issued with different ratios at 300 MHz.
it is still unclear which consequences its usage has on different
kinds of hardware.
Since then HBM usage on FPGAs has also become the A. Performance Analysis
main focus of several publications [6]–[8], [13]. A first ex-
ploration of its characteristics on Xilinx devices has been pre- To evaluate the most efficient accelerator design, develop-
sented by Wang et al. [13]. Although they benchmarked several ers need to systematically analyze all parameters that affect
basic latencies and throughputs for different MC configu- HBM access. Therefore, we identify and present them in the
rations, they only investigated isolated transactions between following. Similar to traditional DRAM, the impact on a single
one BM and one PCH each. This neglects the interference channel is considered at first. The chips used on the selected
of concurrent bus accesses that designs can suffer from in FPGA have a theoretical maximum bandwidth T hsc max of
real systems. A more accurate representation of the attainable 14.4 GB/s per PCH which leads to a total of 460 GB/s over
throughput of HBM FPGAs has been presented by several ap- all 32. However, as with all DRAM based technologies, this
plication specific accelerators [6]–[8]. Although a speedup over limit cannot be reached due to the cell refresh cycles. Xilinx
non-HBM systems could be shown, they also experienced the states the effective HBM throughput T hsc ef f on their devices
limits presented in Section II. For this reason, they manually as 7-9% lower [15]. In practice it can be even lower depending
partitioned and duplicated data such that every BM mainly on the following accelerator design choices.
accesses one PCH. However, a comprehensive exploration of
The first factor is the design frequency facc itself. With an
the HBM subsystem characteristics that can guide developers
AXI bus width W of 256 bit, 450 MHz are needed to reach
when conceiving accelerators has not been shown.
the theoretical limit. Similar to Kara et al. [8] we consider
The goal of this paper is to provide such a general this clock challenging for many accelerators to reach timing
methodology for HBM. We approach this in the following closure with. In this case, a common alternative is to double
by first analyzing the parameters that affect the performance W to provide the same throughput at half the frequency. As a
and deriving suitable architecture guidelines. Then we present consequence, more functional cores and thus more resources to
a ready-to-use IP core which eases implementing designs be able to process more data in parallel are used. However, as
adhering to these guidelines. Lastly, we proof our methodology stated before the hardware is already always underutilized due
by applying it to own accelerators in a Roofline model. to the DRAM intrinsic throughput degradation. Furthermore,
accelerators typically read and write concurrently over the
IV. S YSTEM A NALYSIS AND D ESIGN independent AXI channels while the HBM DDR protocol
provides only a common bidirectional bus. Therefore, the
Our HBM analysis in the following is based on measure- memory is usually not able to serve all AXI transactions at full
ments conducted on a Xilinx XCVU37P-2E FPGA. Due to the clock speed anyway. For this reason, it is in many cases not
regularity of HBM stacks, we assume that our methodology efficient to parallelize processing. Instead, designs with lower
is also applicable to other stack configurations. In general, we frequencies are often sufficient if a proportional ratio RWrat
used the memory controller settings Wang et al. found to be of concurrent read and write transaction compensate the lower
the best [13]. However, in contrast to them we do not only speed in one direction. Fig. 2 shows this effect on throughput
measure point-to-point characteristics but the overall system for a more common 300 MHz clock. It can be seen that
performance under certain workloads. Since tasks are often DRAM bus turnaround delays for concurrent reads and writes
decomposable into several more basic patterns, we used the reduced the performance by only 2% compared to reported
extreme cases listed in Table I as a basis. First, accelerators can unidirectional 450 MHz references [13], [15]. This maximal
be differentiated in their access locality to memory channels. value was already reached with the commonly encountered
Single Channel (SC) restricts every BM to its directly attached 2:1 ratio. Since the losses are small and timing closure prob-
PCH in a 1:1 port mapping. This eliminates interference of lems are often occurring, we also restrict the clock to more
other BM but requires data to be manually prepartitioned and conservative 300 MHz in the following to further explore the
possibly duplicated. Cross Channel (CC) in contrast assumes effect of this trade-off. In general, it is effective to reduce the
data lies globally contiguous in memory. Therefore, every BM clock frequency of HBM accelerators if it is compensated by
works on every PCH in an N:M mapping. Second, the order of an appropriate ratio of concurrent reads and writes.
accesses needs to be separated. Stride (S) linearly increases the
addresses by a certain length after every bus transaction. With A further factor that impacts throughput is the used access
CC it is assigned in a way such that every BM requests the pattern. As width all DRAM-based technologies, data must
globally subsequent data chunk in turn. Random Access (RA) first be latched before it can be accessed. This is done in
in contrast exposes no definite pattern and scatters accesses the granularity of a page with consecutive data that needs
over a larger area of the address space. However, each bus to be opened. Since accesses to open pages are faster, long
transaction works on a small (≤ 512 B) contiguous chunk of continuous bursts of data requests perform better. Fig. 3
data. The four combinations of these characteristics listed in demonstrates this effect of the access pattern and burst length
Table I are used as basic patterns in the following. BL until its AXI3 limit of 16. It can be seen that length

154

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
(a) Total throughput
(a) SCS (b) CCS
Rotate 2 Rotate 4 Rotate 6 Rotate 8

0 1 2 3 4 5 6 7 Master

...
...
Crossbar Switch Crossbar Switch

0 1 2 3 4 5 6 7 Channel

(b) Additional bus routing contention


Fig. 4: Effect of Xilinx HBM switch fabric on throughput.
(c) SCRA (d) CCRA
Fig. 3: HBM burst length comparison.
burst length of 16 is used, it is not sufficient to send only one
AXI transaction at a time and wait for the response. Therefore,
one accesses perform significantly worse with all patterns. In accelerators must always have multiple active AXI transactions
particular, the unidirectional single channel cases show a 50% on every bus to prefetch data.
performance increase when BL is increased to two. Since the
frequency has only been reduced by 33.3%, this implies that As one of the core concepts of HBM in contrast to
the effect is not caused by an insufficient number of issued AXI traditional DRAM is the multi-channel architecture, developers
requests but a limitation of the memory subsystem. Beyond must determine in the second step how much parallelization
that, Fig. 3a shows that even a burst length of two almost can actually be achieved. In this regard, throughput is heavily
maximizes throughput for strided patterns in one direction. impacted by the ability to spread access to as many PCHs as
These generally access data in a contiguous manner. Therefore, possible. Therefore, the number of effectively used ones at a
there is only a negligible difference when the burst length time Nch ef f can be lower than the device maximum Nch max .
is doubled and number of AXI transactions halved since This is especially pronounced by the global addressing scheme
the same data is accessed. This means performance plateaus used on current FPGAs which maps contiguous data regions
early. However, it can also be seen that for mixed load/store to the same PCH. This effect can be seen in Fig. 3b where
workloads longer bursts are required. Here, more requests all BMs simultaneously access the same one that contains
arrive at the memory controller which makes it less effective all data. Here, throughput is severely limited to only 2.8%
in avoiding the bus turnarounds. The same effects are true of the total device bandwidth. If only reads or writes are
for SCRA as seen in Fig. 3c. This pattern minimizes page issued the throughput further drops to 9.6 GB/s (2.1%). In
hits between consecutive AXI transactions which reduces the case source and destination arrays are mapped to different
throughput of a channel to a minimum. However, a larger PCHs, a combined throughput of 2 · 9.6 GB/s = 19.2 GB/s
access granularity again increase hits within a transaction such (4.2%) is also conceivable. This bottleneck is also known as a
that it plateaus similar to SCS when the length is doubled. “hot-spot” traffic pattern. In contrast, a perfect SCS subdivision
Therefore, long bursts generally increase throughput but even where each BM uses exactly one PCH yielded 416.7 GB/s
shorter ones can be sufficient for both SCS and SCRA to access (90.6%) as depicted in Fig. 3a. This importance of channel
HBM memory. However, about four times longer burst are parallelization can also be seen by the CCRA pattern. Fig. 3d
required to compensate a reduced frequency. shows that even when BMs use usually worse completely
random addresses, there can be a 5.4× greater total throughput
When deciding for a burst length, the number of concur- than the maximum a single PCH can deliver. This is caused
rently outstanding AXI transactions Not has to be considered by the fact that requests are spread over all 32 HBM PCH
as a factor as well. In general, BMs must be able to issue and thus have a higher memory-level parallelism. Therefore,
enough requests such that data can be continuously delivered accelerators must access all memory channels at every point in
for the whole AXI round-trip time. Otherwise stalls in the bus time. To achieve this, system designers must keep track of the
pipeline due to missing data are inevitable. We measured a location where data resides at and how many masters generate
minimum closed page latency of 48 cycles (160 ns) for reads addresses to the same region.
and 17 cycles (57 ns) for writes when global addressing is
enabled and the local PCH is accessed. When the destination Even when BM addresses and data are equally spread out
is the farthest PCH the latency increased to up to 72 cycles there are cases where the maximum throughput still cannot be
(240 ns) for reads and 41 cycles (137 ns) for writes due to the reached. This is a result of the current hardware implementa-
gradual routing. This means that even if the highest possible tion of the bus fabric that enables global addressing. With the

155

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
segmented switch networks described in Section II, requests TABLE II: HBM latency comparison
are gradually transported over lateral connections. Therefore,
the effective number of parallel busses that route data to CCS CCRA
the destination Nlat ef f also serves as an upper bound on Traffic Setup
Read
mean σ
Write
mean σ
Read
mean σ
Write
mean σ
throughput. Fig. 4a shows this limit by assigning every BM m XLNX 71.8 19.8 46.3 24.6 66.5 17.7 29.1 7.9
through an offset i a unique PCH m + i mod Nch max . Since Single
MAO 73.7 12.5 32.0 0.1 81.9 15.7 32.0 0.3
this is a rotation and the fabric architecture is also symmetric, Burst
XLNX 3020.8 1478.8 585.4 522.9 651.8 353.5 197.3 122.2
offsets bigger than Nch2max were equal to a rotation in the other MAO 264.5 13.4 72.0 0.7 546.2 158.4 93.2 23.8

direction. In total every PCH always uniquely served a single


BM with contiguous SCS bursts. This is also evident from TABLE III: MAO implementation results
reaching full throughput of 416.7 GB/s when the closest PCHs
each were accessed. In contrast, when a rotation was applied, Intercon. fmax
Latency Resources
RD WR LUTs FFs BRAM
requests were forced to use the switch network. The contended 130 12 12 285327 (21.89%) 274879 (10.54%) 260 (12.90%)
busses of an interface section are illustrated in Fig. 4b. With Full
150 25 12 278800 (21.38%) 255122 (9.79%) 260 (12.90%)
an offset of one, the performance was still ideal since no data Partial
350 12 12 152771 (11.72%) 197831 (7.59%) 132 (6.55%)
360 25 12 147798 (11.34%) 251676 (9.65%) 260 (12.90%)
paths were shared. However, a lateral connection was needed
by every fourth BM. The first drop in throughput could already
be observed with a rotation of two. Although the used FPGA
provides Nlat max = 2 paths in every direction, the static When designing an accelerator, care is often not only taken
assignment forced two BMs to use the same lateral connection. of throughput but also of latency. Table II shows the effect of
In this case in theory two BMs get 100% throughput over bus traffic on the average latency and its standard deviation.
the local switch, but the contending ones effectively only The Single test case issues one transaction at a time with
50%. This coincides with our measurements that show 74.9% burst length one on every BM. Whereas the Burst scenario
performance. The same effect happened to the other remaining increases the traffic to 32 outstanding requests with a burst
ones when the rotation was increased to four PCHs. Here, the length of 16 per BM. It can be seen that the Xilinx fabric
overall throughput dropped to 49.8%. With every additional (XLNX) introduces considerable delays with a high variance
offset up to eight the performance further decreased. When for the CC patterns. This behavior is directly caused by the
looking at Fig. 4b, it can be seen that at this point the BMs previously explained contention effects of PCHs and switches.
of a switch not only compete among themselves but also with However, as demonstrated with the MAO IP in the next
the neighboring ones over their lateral connections. Since the Section, accelerators can reduce this influence when accesses
rotation operation also causes data movements in the other over lateral connections are minimized. Therefore, if uniform
direction, all four lateral paths over the complete length of latencies are important to an accelerator design, routing AXI
the device were now used to their full extend. This lead to transactions laterally should be avoided as much as possible.
4
a saturation at 32 = 12.5%. Although no further reduction
in total bandwidth occurred after this point, the time until all B. Architectural Adaptions
BMs received all data still increased due to the routing over
even longer distances. These results show that the accumulated Section IV-A provided measurements and guidelines de-
channel throughput also needs to be effectively provided to the rived from them that developers always need to consider
accelerator. For this purpose, developers need to additionally when dealing with HBM. These rules imply three univer-
keep track of the relative position of master and data channel sal architectural adaptions which can equally improve the
pairs to identify routing conflicts. performance of a variety of designs. First, we recommend
replacing the lateral connections between switches with a
Bus fabrics are also susceptible to further overall through- hierarchical distribution network. This is in general a trade-
put reductions Ccont . Common causes are routing delays due off between latency, throughput, and chip space. However,
to additional dead cycles for bus multiplexing. These are our analysis showed major throughput bottlenecks when data
usually inserted for timing closure reasons which is even more is moved laterally with the current structure. In these cases
complicated to achieve for the high number of I/O pins HBM even traditional DRAM can lead to better results. Therefore,
uses. Therefore, they are also present in the Xilinx interconnect we consider it worthwhile to increase the minimum latency
when switching between masters for lateral connections. Since for a higher throughput. Second, we suggest to further adjust
the switches use round-robin scheduling, a slowdown is always the address mapping scheme such that the assignment to
to be expected when BMs share a bus to their destination PCH. memory channels can be changed. With this ability data can be
This can be seen when the burst length of the measurements interleaved between the PCHs such that consecutive addresses
in Fig. 4a is reduced from 16 to two. In this case, the minimal automatically access multiple different ones. This reduces
write throughput is further reduced by 16.8% due to the contention and increases parallelization which again leads to
additional bus switching caused by smaller AXI transactions. a higher performance. Third, we recommend buffers near the
Similarly, the CCRA pattern in Fig. 3d also suffers from BM and high degree of transaction reordering. On the one
the same effect. Furthermore, the varying DRAM latencies hand the HBM memory controllers themselves are then able to
due to page latching make the extent of switching delays more efficiently coalesce accesses and increase DRAM page
often difficult to determine exactly. Therefore, the number hits. On the other hand since access times between PCHs can
of concurrent AXI transactions to different channels should be differ, further reorder buffers on the BM side can free the bus
reduced (e.g. by increasing the burst length) if contention in fabric by accepting and storing out-of-order transactions early.
the bus fabric is to be expected. This reduces contention for AXI requests of other BMs.

156

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
TABLE IV: HBM throughput comparison [GB/s]
CCS CCRA
XLNX MAO XLNX MAO
RD 9.6 (2.1%) 307 (66.7%) 32.0× 36.0 (7.8%) 134 (29.1%) 3.72×
WR 9.6 (2.1%) 307 (66.7%) 32.0× 48.0 (10.4%) 144 (32.3%) 3.00×
Both 13.0 (2.8%) 414 (90.0%) 40.6× 70.4 (15.3%) 266 (57.8%) 3.78×

Since these guidelines are universal, we have furthermore


developed our ready-to-use Memory Access Optimizer (MAO) Fig. 5: Effect of stride length on throughput with MAO.
IP core. It is simply used as an intermediate layer between ac-
celerator and HBM interface. Its resource usage and fmax can
be seen in Table III. Hereby, Partial describes a configuration
which uses the 4 × 4 crossbar switches depicted in Fig. 1 but
leaves all lateral connections unused. On the contrary, Full
completely replaces the vendor bus fabric. Furthermore, the
variants with 12 cycles latency use one hierarchical stage while
the others use two. Overall, the size is similar to the 250k LUTs
Xilinx states for their own fabric [19]. For strict CCS accesses
this can also be further optimized by reducing the number of
BM and accordingly increasing the bus width. This lowers the
complexity and resource count but the architecture is not as Fig. 6: Effect of reordering on CCRA throughput with MAO.
flexible anymore. Since an architecture like MAO could be
used as an ASIC solution in the future, we decided to also
offer 32 BMs to be more comparable.
fabric from outstanding ones such that bus contentions were
In general, MAO improves the presented issues by dis- resolved faster. The effect of our measures can also be seen
tributing AXI transactions such that Nch ef f and Nlat ef f can in the overall latency in Table II. Although in some low
often be considered equal to Nch max . To demonstrate this traffic scenarios it slightly increased due to the 25 additional
capability, we compare the CC access pattern performance of a cycles in the MAO core, it improved for the intended high
standard Xilinx design (XLNX) with our MAO core in Table IV. throughput cases. In contrast the standard deviation decreases
Hereby, the exact same pattern with burst length 16 was used overall which is an indication of less contention. Therefore, our
in both cases. The only difference was that version four in proposed MAO IP proved to be a benefit for these patterns.
Table III of MAO was inserted into the Memory Access area
of Fig. 1. As predicted, the CCS pattern can be dramatically
improved by a factor of up to 40.6×. In particular, we were V. M ETHODOLOGY E VALUATION
able to reach 99.4% of our best case SCS performance. While the previous Section analyzed access properties and
This improvement can be mainly attributed to the automatic architectural adaptions, it has to be shown how this translates to
redistribution of requests to all PCHs without using lateral actual hardware. Therefore, we evaluate the upgrade opportuni-
connections. This capability is further explored in Fig. 5 where ties of two exemplary real world FPGA accelerators that were
the stride length of addresses is varied such that data is either previously not optimized for HBM, using our guidelines in
fetched multiple times or skipped. With strides smaller than the following. We chose to implement a matrix multiplication
16 kB the same data is always accessed by several subsequent since it is a versatile operation used in many applications (e.g.
BMs. Here, DRAM related differences in latency cause a Neural Networks). Starting from the memory transfers, several
higher nondeterministic bus contention between overlapping accelerators with different access patterns can be designed.
AXI requests. This is confirmed by a comparison measurement
with BRAM instead HBM that still achieved the maximal Table V gives an overview over two accelerators and their
throughput. However, this stride length is not as important characteristics. Numbers in red denote configurations that need
since such data is usually internally forwarded within the too many resources for our FPGA while optimal ones are
accelerator for efficiency reasons. On the opposite side, strides colored green. The first one, further on called A, is built with an
longer than 256 kB cause DRAM page misses that dominate array of processing elements (PE) similar to a systolic array. It
the achievable throughput. Between these bounds the maximal initially loads data from one input matrix (Mh ×Mw ) into local
performance was reached. memory inside its PEs. Afterwards it continuously streams
data from the second input and output matrices and back to
We were also able to improve the performance of a CCRA memory. The other accelerator, from now on called B, is based
pattern by 3.78×. Again, the reduction of lateral routing on an adder tree and buffers. It keeps parts of one input matrix
conflicts greatly affects the throughput. In contrast to strided as well as partial sums in local memory. This saves memory
patterns, the ability to reorder requests has a bigger influence bandwidth as only one matrix has to be reloaded and only final
here. Fig. 6 illustrates this effect by increasing the number of results need to be written back to main memory.
consecutive AXI transactions that can be reordered (indepen-
dent AXI IDs). First, a higher number allowed the memory To increase the accelerator performance, the number of
controller to more efficiently schedule requests. Second, our calculations that can be executed in parallel must be maxi-
reorder buffers on the BM side also effectively freed the mized. In general, this can be done by adding computational

157

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
1.E+06
Memory BW XLNX
Memory BW MAO 1.E+03
1.E+05 32 Ports 32 Ports (MAO)
Performance [GOPS/s] 16 Ports

Performance [GOPS/s]
8 Ports 16 Ports (MAO)
1.E+04
4 Ports
18.4X 28.5x 8 Ports (MAO)

1.E+02
1.E+03

4 Ports (XLNX) 4 Ports (MAO) Memory BW XLNX


8 Ports (XLNX) Memory BW MAO
1.E+02 16 Ports (XLNX) 32 Ports
32 Ports (XLNX) 16 Ports
8 Ports
1.E+01 4 Ports
1.E+01
1 10 100 1000 1 10
Operaonal Intensity [OPS/B] Operaonal Intensity [OPS/B]

(a) Acclerator A (b) Accelerator B


Fig. 7: Roofline model of matrix multiplication accelerators for different P .

TABLE V: Overview of Matrix Multiplication Accelerators in memory without gaps in the address space. While this CCS
pattern simplifies design, resource contention is expected as
Parameter Accelerator A Accelerator B analyzed in Section IV as many ports access the same PCH in
parallel. This bottleneck is later mitigated without additional
Architecture design effort with our MAO IP core.

RWrat 2:1 Mh :1 (Mh  2) Fig. 7a shows the Roofline model of Accelerator A for
facc 300 MHz 300 MHz
P 4 8 16 32 4 8 16 32
four different configurations of P . For this implementation,
OpI [OPs/B] 42 84 167 328 2 2 2 2 the operational intensity not only depends on the problem
Ccomp [GOPs/s] 2458 9831 39322 157286 68 137 274 547 size, but also on the accelerator implementation. With a bigger
Core 14% 56% 223% 895% 3% 6% 12% 24%
U til PE array more values can be kept in local storage. Therefore
Core+MAO 36% 79% 245% 917% 25% 28% 34% 46%
SU
HBM — 2× 3.9× 7.7× — 1× 1× 1× data must be reloaded less often and more operations can be
HBM+MAO 4.6× 18.4× 73.8× 248.2× 3.6× 7.1× 14.3× 28.5× performed on the same number of bytes on average. To check
HBM + ◦
Benefit
HBM+MAO ++ +++ the overall achievable memory throughput of the accelerator,
we have to take a look at RWrat as well as facc (see Table V).
As derived in Section IV we estimate the maximal achievable
resources. However, these must also be consistently provided memory throughput to be about 13 GB/s for the access pattern
with the necessary data. This is only possible due to the of Accelerator A in a system without MAO. In contrast,
much higher memory bandwidth HBM offers by increasing the with MAO we expect an increase to about the maximum
number of BMs P . Therefore, P directly corresponds to the HBM throughput of 416 GB/s. Then we measured the actual
degree of compute parallelization. Accelerator A scales with throughput to see if our estimation holds up. The results for
the PE array dimensions P Aw × P Ah . Accelerator B can be P = 32 amount to 12.55 GB/s without and 403.75 GB/s with
scaled in multiples of adder trees P Ah . Additionally, Table V the MAO IP core. This is about 3% off from what we estimated
also lists the operational intensity OpI, computational ceiling for both cases. Therefore, our model is sufficiently accurate
Ccomp , speedup SU compared to the baseline P = 4, as well for this early design space exploration. Furthermore, it can be
as the FPGA resource utilization U til. Ccomp is calculated seen that without optimized memory access the accelerator is
by summing all operations executed by the accelerator and memory bound for all configurations. The measured maximal
dividing the sum by the runtime. OpI is calculated by dividing throughput of 12.55 GB/s also shows that without any opti-
all operations by all bytes read and written by the accelerator. mization HBM does not bring any improvements here, as this
It can be seen that Ccomp increases with P but at the cost could also be achieved with traditional non-stacked DRAM.
of higher U til. Therefore, this growth is ultimately limited by Therefore, an optimization of the memory access pattern is
the FPGA capacity. If this additional computing power can crucial. When using our MAO IP core we were able to
be exploited further depends on the real achievable memory improve the accelerator performance up to 32.2× compared to
throughput. These factors are analyzed in the following. unoptimized access. Here, the system became compute bound
again for P < 32 even with MAO. For P = 32 the accelerator
As shown before memory requests consisting of long bursts is still memory bound, but as it utilizes the highest amount
with many outstanding transactions are beneficial for a high of available bandwidth (403.75 GB/s) it should be chosen for
throughput. Therefore, both cores immediately request as much an implementation using HBM. However, when the resource
data as possible to process all matrix values. This behavior usage is also considered, neither P = 16 nor P = 32 fit on
fulfills both requirements as long as the matrices are big our FPGA. Therefore, the P = 8 configuration is the best
enough. In a heterogeneous system where accelerators interact achievable with a memory throughput of 116 GB/s and a SU
with other cores, data can often not be partitioned in a way that of 18.4×. At this point the accelerator is compute bound and
the memory access from all is optimal. Therefore, we assume does not use the HBM benefits completely. Nevertheless, it
a allocation strategy where every matrix is contiguously stored still achieves a high performance due to the grid structure

158

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.
that enables a high internal data reuse. Therefore, it is less R EFERENCES
susceptible to external memory bottlenecks. While this kind of [1] R. S. Williams, “What’s Next? [The End of Moore’s Law],” IEEE
accelerator seems very promising, the improvement gained by Comput. Sci. Eng. Mag., vol. 19, no. 2, pp. 7–13, Mar. 2017.
using HBM is not as high as expected, as the most beneficial [2] N. S. Kim, D. Chen, J. Xiong, and W. H. Wen-mei, “Heterogeneous
configurations are not implementable on current FPGAs. Computing Meets Near-Memory Acceleration and High-Level Synthe-
sis in the Post-Moore Era,” IEEE Micro, vol. 37, no. 4, pp. 10–18,
The Roofline model for Accelerator B is depicted in 2017.
Fig. 7b. Here, OpI only depends on the matrix size therefore [3] J. Cong, Z. Fang, M. Lo, H. Wang, J. Xu, and S. Zhang, “Understanding
does not change with P . Only Ccomp differs as it depends Performance Differences of FPGAs and GPUs,” in IEEE 26th Annu. Int.
on P and facc . According to our analysis in Section IV, the Symp. Field-Prog. Custom Comput. Mach. (FCCM), 2018, pp. 93–96.
maximum achievable bandwidth of Accelerator B is inherently [4] M. V. Wilkes, “The Memory Gap and the Future of High Performance
limited by facc to roughly 23 of the maximum throughput Memories,” SIGARCH Comput. Archit. News, vol. 29, no. 1, p. 27, Mar.
2001.
caused by the core running at 300 MHz while not having a
[5] H. Jun, J. Cho, K. Lee, H. Son, K. Kim, H. Jin, and K. Kim, “HBM
2:1 read-write ratio. In a system without MAO we estimate (High Bandwidth Memory) DRAM Technology and Architecture,” in
a maximum overall throughput of about 10 GB/s. In con- IEEE Int. Memory Workshop (IMW), May 2017.
trast, with MAO an increase to about 67% of the maximum [6] G. Singh, D. Diamantopoulos, C. Hagleitner, J. Gomez-Luna, S. Stuijk,
throughput (277 GB/s) is to be expected again. Measuring O. Mutlu, and H. Corporaal, “NERO: A Near High-Bandwidth Memory
the actual throughput of this accelerator yields 9.59 GB/s for Stencil Accelerator for Weather Prediction Modeling,” in 30th Int. Conf.
unoptimized memory access, which is only 4% off from our Field-Prog. Log. and Applicat. (FPL), 2020, pp. 9–17.
estimation. With our MAO IP core it could be greatly increased [7] R. Ben Abdelhamid and Y. Yamaguchi, “A Block-Based Systolic Array
on an HBM2 FPGA for DNA Sequence Alignment,” in Appl. Recon-
to 273 GB/s which is only about 2% off from our estimation. figurable Comput. Archit., Tools, and Applicat., F. Rincón, J. Barba,
Our model was again sufficiently accurate. It can be seen H. K. H. So, P. Diniz, and J. Caba, Eds. Cham: Springer International
that the maximal achievable performance for this accelerator Publishing, 2020, pp. 298–313.
is only about 6% compared to the P = 8 configuration of [8] K. Kara, C. Hagleitner, D. Diamantopoulos, D. Syrivelis, and G. Alonso,
Accelerator A. In general, all configurations are memory bound “High Bandwidth Memory on FPGAs: A Data Analytics Perspective,”
for an unoptimized memory access. When using our MAO IP in 2020 30th Int. Conf. Field-Prog. Log. and Applicat. (FPL), 2020.
core, we could again measure an improvement of 28.5×. In [9] M. Gbel, A. Elhossini, and B. Juurlink, “A Methodology for Predict-
ing Application-Specific Achievable Memory Bandwidth for HW/SW-
this case all configurations became compute bound. However, Codesign,” in Euromicro Conf. Digital Syst. Des. (DSD), Aug 2017, pp.
the P = 32 configuration is less than 0.1% away from the 533–537.
memory ceiling. Therefore, this design is already very close [10] C. Park, S. Park, and C. S. Park, “Roofline-Model-Based Design Space
to its overall optimum. When looking at its corresponding Exploration for Dataflow Techniques of CNN Accelerators,” IEEE
resource allocation, we see that it is less than half as big as Access, vol. 8, 2020.
the P = 8 configuration of Accelerator A. Therefore, a design [11] M. Ujaldn, “HPC Accelerators with 3D Memory,” in IEEE Int. Conf.
using Accelerator B and the MAO IP core can be implemented Comput. Sci. and Eng. (CSE) and IEEE Int. Conf. Embedded and
Ubiquitous Comput. (EUC) and 15th Int. Symp. Distrib. Comput. and
on the FPGA. This shows that this type of accelerator can Applicat. for Business Eng. (DCABES), Aug 2016, pp. 320–328.
really benefit from HBM and an optimized memory access. [12] N. Ding and S. Williams, “An Instruction Roofline Model for GPUs,” in
IEEE/ACM Perform. Model., Benchmarking and Simul. High Perform.
While Accelerator B profits the most from the transition Comput. Syst. (PMBS), 2019, pp. 7–18.
to HBM with the least cost of hardware resources, for raw
[13] Z. Wang, H. Huang, J. Zhang, and G. Alonso, “Shuhai: Benchmarking
matrix multiplication performance Accelerator A should be im- High Bandwidth Memory on FPGAs,” in IEEE 28th Annu. Int. Symp.
plemented. A potential improvement for Accelerator B could Field-Prog. Custom Comput. Mach. (FCCM), 2020, pp. 111–119.
be to rework the design to achieve a higher facc . Furthermore, [14] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful
future FPGAs with more HBM stacks and therefore a higher Visual Performance Model for Multicore Architectures,” Commun.
memory throughput would make it possible to increase Ccomp ACM, vol. 52, no. 4, p. 6576, Apr. 2009.
even further. For Accelerator A the design could be optimized [15] Xilinx, Inc., “AXI High Bandwidth Memory Controller,” Jul. 2020.
to better exploit the available throughput with a smaller design, [Online]. Available: https://www.xilinx.com/support/documentation/ip
documentation/hbm/v1 0/pg276-axi-hbm.pdf
for example by applying a local buffer structure to redistribute
[16] Intel Corporation, “High Bandwidth Memory (HBM2) Inter-
values and scale the PE array linearly. face Intel FPGA IP User Guide,” Dec. 2020. [Online].
Available: https://www.intel.com/content/dam/www/programmable/us/
en/pdfs/literature/ug/ug-20031.pdf
VI. C ONCLUSION AND F UTURE W ORK
[17] B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, “Per-
This paper analyzed the behavior of HBM and the influ- formance Modeling for FPGAs: Extending the Roofline Model with
ence of accelerator and bus fabric design decisions on it. It High-Level Synthesis Tools,” Int. J. Reconfig. Comput., vol. 2013, Jan.
2013.
further provided a simple model to perform first performance
[18] M. Siracusa, M. Rabozzi, E. Del Sozzo, L. Di Tucci, S. Williams, and
estimations. With two example cores, we could show that our M. D. Santambrogio, “A CAD-based Methodology to Optimize HLS
estimates and derived design guidelines were a benefit for code via the Roofline Model,” in IEEE/ACM Int. Conf. Computer Aided
system designers. In the future this data can be used to develop Des. (ICCAD), 2020.
additional SystemC models aiming for higher accuracy. [19] Wissolik, Mike and Zacher, Darren and Torza, Anthony and Day,
Brandon, “Virtex UltraScale+ HBM FPGA: A Revolutionary Increase
in Memory Performance,” Jul. 2019. [Online]. Available: https://www.
ACKNOWLEDGMENT xilinx.com/support/documentation/white papers/wp485-hbm.pdf
This work has been supported by the Xilinx University
Program (XUP) with the Virtex UltraScale+ HBM FPGA chip.

159

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2022 at 08:04:48 UTC from IEEE Xplore. Restrictions apply.

You might also like