Professional Documents
Culture Documents
ABSTRACT 1 INTRODUCTION
Processing-in-memory (PIM) architectures have gained significant In recent times, processing in memory (PIM) architectures tailored
importance as an alternative paradigm to the von-Neumann archi- for specific workloads have gained significant attention owing to
tectures to alleviate the memory wall and technology scaling prob- their computational efficiency. Moving compute closer to memory
lems. PIM architectures have achieved significant latency and en- bridges the ever increasing latency gap between memory systems
ergy consumption improvements for various emerging and widely and compute systems and overcomes the memory-wall in conven-
used workloads such as deep neural networks, graph analytics, tional von-Neumann architectures. PIM architectures can be catego-
databases and computational genomics. In this work, we propose rized into two variants, first variant leverages the in-situ compute
IMC-Sort, an in-memory parallel sorting architecture using the capabilities of crosspoint memory architectures augmented with
hybrid memory cube (HMC) for accelerating the sort workloads. peripheral circuitry such as digital-to-analog converters, analog-
Sort is one of the fundamental and widely used algorithm in vari- to-digital converters, and shift-and-add units [1, 2]. Second vari-
ous applications such as databases, networking, and data analytics. ant incorporates the compute elements such as arithmetic-logic
IMC-Sort architecture augments the hybrid memory cube memory units, mutiply-and-accumulate units near to the memory at various
system by incorporating custom sorting network at each of the granularity such as SRAM sub-array level [3], DRAM rank level
HMC vault’s logic layer. IMC-Sort uses optimized folded Bitonic and DRAM bank level [4, 5]. These PIM architectures have shown
sort and merge network to sort input sequences of arbitrary length significant speedups and energy savings for various application do-
at each vault and optimized address mapping mechanism to distrib- mains such as deep neural networks, computational genomics, and
ute the input data across HMC vaults. Merging of the sorted results databases. To complement the advancements in application PIM
across individual vaults is also performed using the vault’s sorting accelerator architectures, enterprise companies such as VMWare
network by communicating with other vaults through the HMC’s introduced platforms like vSPhere BitFusion to simplify the run
crossbar network. Overall, IMC-Sort achieves 16.8×, 1.1× speedup time management and scheduling of various emerging workloads
and 375.5×, 13.6× savings in energy consumption compared to the elastically across pool of hardware accelerators and conventional
widely used CPU implementation and state-of-the-art near memory CPUs.
custom sort accelerator respectively. Sort is one of the fundamental algorithms in computational sci-
ence and important building block for various data management and
KEYWORDS warehousing applications such as database management systems,
processing-in-memory, sort, merge, vault, hybrid memory cube and data analytics systems. For instance, sort is used in several func-
tions in relation data management such as merge, aggregation, and
ACM Reference Format:
Zheyu Li, Nagadastagiri Challapalle, Akshay Krishna Ramanathan, and Vi-
selection. Several big data analytics applications and frameworks
jaykrishnan Narayanan. 2020. IMC-Sort: In-Memory Parallel Sorting Archi- such as MapReduce [6], Hive [7] also uses sort in their comput-
tecture using Hybrid Memory Cube . In Proceedings of the Great Lakes Sym- ing pipeline. Several prior works have established that sort is one
posium on VLSI 2020 (GLSVLSI ’20), September 7–9, 2020, Virtual Event, China. of the primary bottleneck in these applications. In addition, this
ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3386263.3407581 bottleneck is further aggravated by the big data explosion where
there is lot of variation in the width of the data records etc. Tech-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed niques such as unrolling, assigning multiple hardware mergers, are
for profit or commercial advantage and that copies bear this notice and the full citation discussed to scale the Bonsai design for various sort workloads
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
etc. Several custom hardware and FPGA-based accelerators are pro-
to post on servers or to redistribute to lists, requires prior specific permission and/or a posed to alleviate the sort bottleneck in these applications. Amin et.
fee. Request permissions from permissions@acm.org. al. [8] proposed an ASIC-based accelerator architecture for sorting,
GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
they construct large sorting networks by grouping and reusing
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7944-1/20/09. . . $15.00
https://doi.org/10.1145/3386263.3407581
45
Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
smaller sorting units. Pugsey et. al. [9] accelerated the sort oper- they can communicate with other units using the HMC crossbar
ation in map phase of MapReduce application by incorporating network. In the subsequent sections, we elaborate on the intra-
low power cores near each of the Hybrid Memory Cube (HMC) vault sorting unit and its operation, merging of the sorted values
vaults. Bonsai [10] proposed adpative merge tree based sorting near (of different input sequences) within a vault (intra-vault merging),
the memory with custom hardware compute elements to leverage and merging of the sorted values across different vaults using the
the massive bandwidth of the emerging main memory systems HMC crossbar network (inter-vault sorting).
such as High Bandwidth Memory (HBM). Several FPGA architec-
tures [11, 12] implement the parallel merge sort, parallel bitonic DRAM bank DRAM bank DRAM bank
or hybrid merge-bitonic sort algorithms using custom dataflow DRAM bank DRAM bank DRAM bank
and pipe line mechanisms to achieve superior compute and energy ...... ...... ......
3D Stacked
3D Stacked
3D Stacked
...... ...... ......
In this work, we propose IMC-Sort architecture which augments
...... ...... ......
the logic layer in vaults of hybrid memory cube (HMC) to incor-
...... ...... ......
porate a custom parallel sorting unit (intra-vault sorting unit) to
...... ...... ......
realize a customized Bitonic sorting network. IMC-Sort proposes
two variants of intra-vault sorting unit, first variant leverages the
existing compute capabilities (using the HMC’s native instruction
set) of logic layer in each of the HMC vault for doing the sort oper- Vault Vault Vault
Controller Controller Controller
ation. We use the compare and swap (CAS) operation available in
the current generation HMC systems to build a custom sort acceler-
ator. Next variant incorporates custom compute elements near each
Crossbar Network
HMC vault to perform intra-vault sort operation and a efficient
merge hardware to aggregate and sort the results from intra-vault
sorting phase. We leverage the massive vault level bandwidth to
perform parallel sorting across all the vaults and merge the results Host
across the vaults using the HMC’s crossbar network and intra-vault
sorting units. Figure 1: Overall architecture overview of the IMC-Sort
The key contributions of the paper include:
• Two variations of intra-vault sorting unit utilizing HMC’s
native instruction set and custom compute logic to sort input
2.1 Intra-vault sorting
sequences of arbitrary length in each vault with fixed hard-
ware units. Custom optimizations to the input permutation
unit to reduce the number of compare and swap units.
• Merging of the sorted results across vaults using the intra-
vault sorting units by leveraging the HMC’s crossbar commu-
nication network to move the data elements across different
vaults. Custom address mapping strategy to distribute the
input data across HMC vaults to aid the intra-vault and inter-
vault sort and merge operations.
• Detailed performance evaluation of IMC-Sort architecture
and comparison against widely CPU implementations and
state of the art custom near memory accelerator architec-
ture [10]. Overall, IMC-Sort achieved 16.8×, 1.1× speedup
and 375.5×, 13.6× savings in energy consumption compared
to the widely used CPU implementations and state-of-the-art
near memory custom sort accelerator.
2 ARCHITECTURE
In this section, we discuss overall architecture of the proposed in- Figure 2: IMC-Sort intra-vault sorter architecture
memory sort accelerator IMC-Sort. We augment the conventional
HMC architecture with custom compute elements to perform highly A HMC vault, in the form of a single stack consists multiple
efficient parallel sort operation. Figure 1 illustrates the overall IMC- DRAM banks connected to the logic layer through TSV (Through-
Sort architecture. IMC-Sort consists of custom intra-vault sorting Silicon-vias). A significant advantage of such architecture is al-
units at logic layer of each HMC vault. The control unit of the HMC lowing high bandwidth and low energy access within local vault
vault is also augmented with the necessary logic to perform the provided by TSV. A highly parallel, flexible and low overhead intra
sort operation. These sorting units can be accessed in parallel and vault sorting unit is implemented at the logic layer to leverage
46
Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
47
Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
BH8 BH4 BH2 sequences using the HMC crossbar network. Due to the address
mapping optimization, every vault should have similar size of data,
2 2 1 0 either vault can be activated to operate the merging, the merge step
across the vaults will be repeated until a single global sorted list is
3
sorted
3 0 1 obtained.
6 1 2 2 3 EVALUATION
In this section, we discuss the implementation and evaluation
7
sorted
0 3 3 methodology used for evaluation of IMC-Sort architecture. We
also compare performance and energy efficiency of IMC-Sort archi-
5 5 5 4 tecture with GNU implementation of parallel sort on CPU and state-
of-the-art FPGA based near memory accelerator architecture [10].
4 4 4 5
sorted
48
Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
Performance/Energy Comparison
near-memory FPGA based sorting accelerator architecture Bon-
18 450
sai [10]. We have used datasets ranging from 32 MB to 4GB with
16 400
14 350
64 bit keys for the evaluation. Bonsai proposed an unrolled merge
tree sorting architecture for a similar multi channel stacked mem-
Energy Savings
12 300
Speedup
Energy Savings
2.0 [18] to analyze the performance. Resultant execution cycles are 1.104 13.6
Speedup
normalized and compared with the IMC-Sort custom intra-vault 1.102 13.4
sorting unit implementation. As shown in Figure 7, custom intra- 1.1 13.2
vault sorting unit design achieves an average speedup of 17.1× and 1.098 13
average energy savings of 14.5×. The performance and energy gains 1.096 12.8
can be attributed to sufficiently utilizing the available intra-vault 1.094 12.6
32 64 128 256 512 1024 2048 4096
bandwidth with parallel CAS units. Instead of executing sequence
Dataset Size (MB)
of CAS instructions ( 8 or 16 byte granularity) per vault, custom
sorting unit accesses data at a 256 byte granularity and performs Normalized Speedup Normalized Energy Savings
several CAS operations in parallel resulting in the performance and
energy gains. Figure 8: Speedup in execution time and energy savings
of IMC-Sort architecture normalized to Bonsai architec-
Performance/Energy Comparison
ture [10]
25 18
16
20 14
4 CONCLUSION
Energy Savings
12
Speedup
15
10 In this paper, we present a processing-in-memory architecture
8
10 called IMC-Sort which incorporates custom intra-vault sorting unit
6
5 4 at HMC vault’s logic layer to perform parallel sort operation. The
2 intra-vault sorting unit along with the vault controller is further
0 0 used to merge the sorted results from the individual sorts by moving
4 8 16 32 64 the data elements across the vaults using HMC’s crossbar network.
Number of Data Elements (x1024)
IMC-Sort also uses custom address mapping for distributing input
Nornalized Speedup Normalized Energy Savings data elements across vaults and custom Bitonic sorting network
input permutation units and folded compare and swap units for
Figure 7: Speedup in execution time and energy savings of reducing the hardware requirements. IMC-Sort achieved 16.8×,
IMC-Sort architecture with custom intra-vault logic normal- 1.1× speedup and 375.5×, 13.6× savings in energy consumption
ized to IMC-Sort architecture using native HMC instruction compared to the widely used CPU implementations and state-of-
set the-art near memory custom sort accelerator. The performance and
energy gains of IMC-Sort architecture are due to efficient utilization
Figure 8 shows the speedup in execution time and energy savings of high intra-vault bandwidth at each of the HMC vaults with
of the IMC-Sort architecture normalized to the recently proposed custom efficient parallel sorting hardware.
49
Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
50