Li 2020

Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event,
’20, September 7–9, 2020, Virtual Event, China
IMC-Sort: In-Memory Parallel Sorting Architecture using Hybrid

Memory Cube
Zheyu Li Nagadastagiri Challapalle
The Pennsylvania State University The Pennsylvania State University
State College, PA, USA State College, PA, USA
zil5126@psu.edu nrc53@psu.edu
Akshay Krishna Ramanathan Vijaykrishnan Narayanan

The Pennsylvania State University The Pennsylvania State University
State College, PA, USA State College, PA, USA
axr499@psu.edu vxn9@psu.edu
ABSTRACT 1 INTRODUCTION
Processing-in-memory (PIM) architectures have gained significant In recent times, processing in memory (PIM) architectures tailored
importance as an alternative paradigm to the von-Neumann archi- for specific workloads have gained significant attention owing to
tectures to alleviate the memory wall and technology scaling prob- their computational efficiency. Moving compute closer to memory
lems. PIM architectures have achieved significant latency and en- bridges the ever increasing latency gap between memory systems
ergy consumption improvements for various emerging and widely and compute systems and overcomes the memory-wall in conven-
used workloads such as deep neural networks, graph analytics, tional von-Neumann architectures. PIM architectures can be catego-
databases and computational genomics. In this work, we propose rized into two variants, first variant leverages the in-situ compute
IMC-Sort, an in-memory parallel sorting architecture using the capabilities of crosspoint memory architectures augmented with
hybrid memory cube (HMC) for accelerating the sort workloads. peripheral circuitry such as digital-to-analog converters, analog-
Sort is one of the fundamental and widely used algorithm in vari- to-digital converters, and shift-and-add units [1, 2]. Second vari-
ous applications such as databases, networking, and data analytics. ant incorporates the compute elements such as arithmetic-logic
IMC-Sort architecture augments the hybrid memory cube memory units, mutiply-and-accumulate units near to the memory at various
system by incorporating custom sorting network at each of the granularity such as SRAM sub-array level [3], DRAM rank level
HMC vault’s logic layer. IMC-Sort uses optimized folded Bitonic and DRAM bank level [4, 5]. These PIM architectures have shown
sort and merge network to sort input sequences of arbitrary length significant speedups and energy savings for various application do-
at each vault and optimized address mapping mechanism to distrib- mains such as deep neural networks, computational genomics, and
ute the input data across HMC vaults. Merging of the sorted results databases. To complement the advancements in application PIM
across individual vaults is also performed using the vault’s sorting accelerator architectures, enterprise companies such as VMWare
network by communicating with other vaults through the HMC’s introduced platforms like vSPhere BitFusion to simplify the run
crossbar network. Overall, IMC-Sort achieves 16.8×, 1.1× speedup time management and scheduling of various emerging workloads
and 375.5×, 13.6× savings in energy consumption compared to the elastically across pool of hardware accelerators and conventional
widely used CPU implementation and state-of-the-art near memory CPUs.
custom sort accelerator respectively. Sort is one of the fundamental algorithms in computational sci-
ence and important building block for various data management and
KEYWORDS warehousing applications such as database management systems,
processing-in-memory, sort, merge, vault, hybrid memory cube and data analytics systems. For instance, sort is used in several func-
tions in relation data management such as merge, aggregation, and
ACM Reference Format:
Zheyu Li, Nagadastagiri Challapalle, Akshay Krishna Ramanathan, and Vi-
selection. Several big data analytics applications and frameworks
jaykrishnan Narayanan. 2020. IMC-Sort: In-Memory Parallel Sorting Archi- such as MapReduce [6], Hive [7] also uses sort in their comput-
tecture using Hybrid Memory Cube . In Proceedings of the Great Lakes Sym- ing pipeline. Several prior works have established that sort is one
posium on VLSI 2020 (GLSVLSI ’20), September 7–9, 2020, Virtual Event, China. of the primary bottleneck in these applications. In addition, this
ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3386263.3407581 bottleneck is further aggravated by the big data explosion where
there is lot of variation in the width of the data records etc. Tech-
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed niques such as unrolling, assigning multiple hardware mergers, are
for profit or commercial advantage and that copies bear this notice and the full citation discussed to scale the Bonsai design for various sort workloads
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
etc. Several custom hardware and FPGA-based accelerators are pro-
to post on servers or to redistribute to lists, requires prior specific permission and/or a posed to alleviate the sort bottleneck in these applications. Amin et.
fee. Request permissions from permissions@acm.org. al. [8] proposed an ASIC-based accelerator architecture for sorting,
GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
they construct large sorting networks by grouping and reusing
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-7944-1/20/09. . . $15.00
https://doi.org/10.1145/3386263.3407581
45
Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event, China
smaller sorting units. Pugsey et. al. [9] accelerated the sort oper- they can communicate with other units using the HMC crossbar
ation in map phase of MapReduce application by incorporating network. In the subsequent sections, we elaborate on the intra-
low power cores near each of the Hybrid Memory Cube (HMC) vault sorting unit and its operation, merging of the sorted values
vaults. Bonsai [10] proposed adpative merge tree based sorting near (of different input sequences) within a vault (intra-vault merging),
the memory with custom hardware compute elements to leverage and merging of the sorted values across different vaults using the
the massive bandwidth of the emerging main memory systems HMC crossbar network (inter-vault sorting).
such as High Bandwidth Memory (HBM). Several FPGA architec-
tures [11, 12] implement the parallel merge sort, parallel bitonic DRAM bank DRAM bank DRAM bank
or hybrid merge-bitonic sort algorithms using custom dataflow DRAM bank DRAM bank DRAM bank
and pipe line mechanisms to achieve superior compute and energy ...... ...... ......
efficiency. ...... ...... ......
3D Stacked
3D Stacked
3D Stacked
...... ...... ......
In this work, we propose IMC-Sort architecture which augments
...... ...... ......
the logic layer in vaults of hybrid memory cube (HMC) to incor-
...... ...... ......
porate a custom parallel sorting unit (intra-vault sorting unit) to
...... ...... ......
realize a customized Bitonic sorting network. IMC-Sort proposes
two variants of intra-vault sorting unit, first variant leverages the
existing compute capabilities (using the HMC’s native instruction
set) of logic layer in each of the HMC vault for doing the sort oper- Vault Vault Vault
Controller Controller Controller
ation. We use the compare and swap (CAS) operation available in
the current generation HMC systems to build a custom sort acceler-
ator. Next variant incorporates custom compute elements near each
Crossbar Network
HMC vault to perform intra-vault sort operation and a efficient
merge hardware to aggregate and sort the results from intra-vault
sorting phase. We leverage the massive vault level bandwidth to
perform parallel sorting across all the vaults and merge the results Host
across the vaults using the HMC’s crossbar network and intra-vault
sorting units. Figure 1: Overall architecture overview of the IMC-Sort
The key contributions of the paper include:
• Two variations of intra-vault sorting unit utilizing HMC’s
native instruction set and custom compute logic to sort input
2.1 Intra-vault sorting
sequences of arbitrary length in each vault with fixed hard-
ware units. Custom optimizations to the input permutation
unit to reduce the number of compare and swap units.
• Merging of the sorted results across vaults using the intra-
vault sorting units by leveraging the HMC’s crossbar commu-
nication network to move the data elements across different
vaults. Custom address mapping strategy to distribute the
input data across HMC vaults to aid the intra-vault and inter-
vault sort and merge operations.
• Detailed performance evaluation of IMC-Sort architecture
and comparison against widely CPU implementations and
state of the art custom near memory accelerator architec-
ture [10]. Overall, IMC-Sort achieved 16.8×, 1.1× speedup
and 375.5×, 13.6× savings in energy consumption compared
to the widely used CPU implementations and state-of-the-art
near memory custom sort accelerator.
2 ARCHITECTURE
In this section, we discuss overall architecture of the proposed in- Figure 2: IMC-Sort intra-vault sorter architecture
memory sort accelerator IMC-Sort. We augment the conventional
HMC architecture with custom compute elements to perform highly A HMC vault, in the form of a single stack consists multiple
efficient parallel sort operation. Figure 1 illustrates the overall IMC- DRAM banks connected to the logic layer through TSV (Through-
Sort architecture. IMC-Sort consists of custom intra-vault sorting Silicon-vias). A significant advantage of such architecture is al-
units at logic layer of each HMC vault. The control unit of the HMC lowing high bandwidth and low energy access within local vault
vault is also augmented with the necessary logic to perform the provided by TSV. A highly parallel, flexible and low overhead intra
sort operation. These sorting units can be accessed in parallel and vault sorting unit is implemented at the logic layer to leverage
46
the high bandwidth at each HMC vault. Figure 2 illustrates the

architecture of the intra vault sorter. It consists of a data loading
unit, input permutation logic, compare-and-swap (CAS) units, and
control unit. The data loading unit is responsible for loading inputs
from local DRAM banks into the sorting units internal registers
and writing result in-place to the DRAM banks. The n-input per-
mutation unit grouped with n compare-and-swap (CAS) units sort
inputs in parallel. The sorting mechanism is based on Batcher’s
Bitonic network described in [13]. Figure 3 illustrates a simple 8
input Bitonic sorting network. A naive implementation of n-input
Bitonic sorting network takes significant amount of space due to
the 𝑂 ( log2 𝑛 ) space complexity. In our architecture, we fold the
network into a single stage of n parallel compare-and-swap units
along with a inter-stage input permutation unit. The permutation
unit, shuffles the intermediate results as per the current Bitonic
sorting stage and feeds them to appropriate CAS units. In addition,
to further minimize the performance and space overhead, input
permutation unit is implemented as a semi-crossbar permutation
network by exploiting the regularity of data-movement in Bitonic
Figure 4: IMC-Sort intra-vault N-input semi-crossbar permu-
network. Bitonic sorting network works by recursively applying a
tation unit
BH2 BH4 BH2 BH8 BH4 BH2 2.2 Intra-vault merging

6 6 3 2 2 1 0 In order to support various lengths of input sequences with fixed
number of CAS units, and fixed input permutation unit, we sort the
7 7 2 3 3 0 1
sequence in chunks (chunk size is determined by number of CAS
3 3 6 6 1 2 2 units etc.) and merge the sorted values into single sorted sequence.
Given two sorted list a,b with each size of 𝑛/2, a concatenation
2 2 7 7 0 3 3
of a and inversion of b is a Bitonic sequence. Thus, applying a
4 1 5 5 5 5 4 single stage of Bitonic half cleaner 𝐵𝐻𝑛 can separate the lower
half and higher half in one stage. The result is then another four
1 4 4 4 4 4 5
sorted list separated in lower and higher halves. By scheduling
5 5 1 1 6 6 6 half cleaner operation recursively (𝐵𝐻𝑛− > 𝐵𝐻 (𝑛/2)− > 𝐵𝐻 (𝑛/4)
etc.) a globally sorted list can be obtained in 𝑂 ( log 𝑛 ) stages. An
0 0 0 0 7 7 7 example of merging two list of size 4 is illustrated in Figure 5. With
the merging capability, the intra-vault sorter can easily sort data
larger than its input size. In larger context, when merging two lists
Figure 3: Over view of steps in the bitonic sorting network with arbitrary size, assuming sorter input size (number of CAS
units etc.) 𝑛, a single intra-vault sorting unit can emit 𝑛/2 numbers
every 𝑂 ( log 𝑛 ) stages which is significantly higher than ordinary
software based 1 number per cycle merging.
Bitnoic half cleaner. A Bitonic sorting network for n-input sequence
𝑙𝑜𝑔 (𝑛) (𝑙𝑜𝑔 (𝑛)+1)
consists of 2 Bitonic half cleaners. Each bitonic half 2.3 Inter-vault sorting
cleaner 𝐵𝐻𝑥 is a single compare-and-swap stage with every in-
Intra-vault sorting network performs sorting of arbitrary length
put i compared with input i+x/2. For example, an 8 input Bitnoic
sequences within a vault. Inter-vault sorting mechanism merges
sorter consists of 6 Bitonic half cleaners connected in a sequence
the sorted values/sequences from all the vaults to obtain a global
as 𝐵𝐻 2− > 𝐵𝐻 4− > 𝐵𝐻 2− > 𝐵𝐻 8− > 𝐵𝐻 4− > 𝐵𝐻 2 illustrated in
sorted sequence. In subsequent sections, we discuss the address
Figure 3. In this example, the data permutation unit needs to support
mapping strategy to distribute the input sequence across different
3 permutations which are 𝐵𝐻 2, 𝐵𝐻 4,and 𝐵𝐻 8 Thus, for a sorting
vaults and strategies for merging the results across vaults.
network with size 𝑛 , we only need 𝑂 ( log 𝑛 ) of data permuta-
tions instead of a fully connected 𝑂 (𝑛 2 ) crossbar network. Figure 4 2.3.1 Address Mapping: In the stacked HMC memory organiza-
shows the logical view of a 8 input semi-crossbar permutation net- tion, direct high bandwidth access is guaranteed only within local
work. The sequence of required input pattern is implemented as vaults. Also, the intra-vault sorter operates by accessing a block of
lookup table. During execution, control unit activates the required data within consecutive addresses, which indicates consecutive ad-
permutation accordingly as per the network stage. Similarly, the dresses of data is desired to mapped in the same vault. On the other
ascending/descending sequence of each CAS unit is predetermined hand, in order to enable inter-vault parallelism, blocks of data need
and implemented as a simple lookup table. to be distributed evenly across all vaults. The default HMC address
47
BH8 BH4 BH2 sequences using the HMC crossbar network. Due to the address
mapping optimization, every vault should have similar size of data,
2 2 1 0 either vault can be activated to operate the merging, the merge step
across the vaults will be repeated until a single global sorted list is
3
sorted
3 0 1 obtained.
6 1 2 2 3 EVALUATION
In this section, we discuss the implementation and evaluation
7
sorted
0 3 3 methodology used for evaluation of IMC-Sort architecture. We
also compare performance and energy efficiency of IMC-Sort archi-
5 5 5 4 tecture with GNU implementation of parallel sort on CPU and state-
of-the-art FPGA based near memory accelerator architecture [10].
4 4 4 5
sorted
3.1 Area and Power Model

1 6 6 6
For the performance and energy evaluation we have configured
the IMC-Sort architecture to accept 64 inputs, each with 64 bit
0 width per cycle. In this particular configuration, intra-vault sorting
7 7 7
unit can sort and output 64 inputs every 21 cycles which trans-
lates to approximately 18 GB/s throughput per vault. Considering
Figure 5: Overview of the steps in Bitonic sorting network the HMC vault’s bandwidth of 16GB/s, the chosen configuration
based merge operation can sufficiently saturate the intra-vault bandwidth. A larger design
will have minimal benefit due to bandwidth saturation and also
introduce significant routing overhead. For the area and power
estimation of custom intra-vault sorting logic, we have synthesized
mapping which interleaves addresses across all vaults is then not the design at 800 MHz using Synopsis Design Compiler at 32nm
suitable. We adopt a simple optimization to map every 4KB page in technology node. Each intra-vault sorting unit has an area foot-
the same vault to facilitate both intra-vault and inter-vault sorting print of 0.19𝑚𝑚 2 and an average power consumption of 21𝑚𝑤.
mechanisms. We leverage the user-defined address mapping mode Considering the total number of 32 vaults in HMC memory system,
of the HMC to perform the above mentioned custom mapping. the overall area and average power consumption of intra-vault
2.3.2 Sorting Strategy: As each vault can sort and merge blocks sorting units are 6.08𝑚𝑚 2 and 0.67𝑊 respectively. Each vault in
of data independently in parallel, two types of strategies can be HMC memory system has a headroom of 3.5𝑚𝑚 2 [14], IMC-Sort
adopted to distribute the global sorting task. The first strategy will architecture’s intra-vault sorting unit is well within this area head-
be sampling and bucketing the task into non-overlapping chunks room. The IMC-Sort architecture is functionally verified using a
of data, and then re-distribute across vaults. Thus, after each vault custom cycle accurate model which models the the parallel vault
finishes its sub-task (sorting) independently, a global sorted list is accesses and crossbar interconnection in HMC memory system.
obtained. Such mechanism relies on the host-CPU sampling strat- For performance and power evaluation purposes, we fed the area
egy, the performance can vary greatly depending on the data dis- and power numbers from synthesis for the intra-vault sorting unit
tribution and sampling which does not represent a general case. to the model. For the energy consumption for HMC accesses, we
However, in ideal cases where the value distribution of sort task have assumed 4pj/bit for low level DRAM accesses, 2pJ/bit for vault
is known before, sampling and bucketing can be highly efficient. access, 0.05pJ/bit for TSV, and 4pJ/bit off-chip SerDes link as re-
The second strategy will be progressively merging across vaults ported in [15]. The custom cycle accurate model is built using the
after each independent vault finishes its local task. This represents PyMTL3 [16] framework.
a more general case where data already resides across vaults, and
a global sorting is requested. In HMC memory organization, each 3.2 Performance and Energy Evaluation
vault acts as an independent memory controller which can also We have compared the performance and energy consumption of
send read request into device crossbar like a host to read the data IMC-Sort architecture with GNU parallel sort implementation [17]
from other vaults. This allows the intra-vault sorting unit to bring- executed on a 8-threaded CPU in Amazon cloud instance. Figure 6
in non-local data from other vaults and merge across two different shows the speedup in execution time and energy savings of the IMC-
vaults, the result is then sent to a pre-allocated memory space. We Sort architecture normalized to CPU implementation. We have used
adapt the second strategy in IMC-Sort architecture to have flexi- datasets ranging from 32 MB to 4GB with 64 bit keys for the evalu-
bility to support sorting in a more generalized manner. Consider a ation. As shown in the Figure 6, IMC-Sort architecture achieves an
HMC organization with 32 vaults, first each of the vaults perform average speedup of 16.8× and average energy savings of 375.5× w.r.t
sort and merge operations to sort their input sequences of arbi- to the CPU implementation. The performance and energy gains
trary length (data to be sorted within each vault) independently in over CPU implementation can be attributed to efficient utilization
parallel resulting in 32 sorted sequences. Next, in parallel, every of available high bandwidth across each HMC vault with custom
pair of 2 vaults merges data into a single list resulting in 16 sorted sorting unit in the logic layer. Figure 7 shows the comparison of
48
Performance/Energy Comparison
near-memory FPGA based sorting accelerator architecture Bon-
18 450
sai [10]. We have used datasets ranging from 32 MB to 4GB with
16 400
14 350
64 bit keys for the evaluation. Bonsai proposed an unrolled merge
tree sorting architecture for a similar multi channel stacked mem-
Energy Savings
12 300
Speedup
10 250 ory system-HBM. We apply this same mechanism for performance

8 200 comparison, where an unrolled merge tree is attached to every
6 150 vault. In order to saturate the available bandwidth of 16GB/s per
4 100
vault, a 2 way 16 throughput merge tree is used. As shown in Fig-
2 50
0 0 ure 8, IMC-Sort architecture achieves an average speedup of 1.1×
32 64 128 256 512 1024 2048 4096 and average energy savings of 13.6× w.r.t to the Bonsai architec-
Dataset Size (MB) ture. By analyzing the custom cycle accurate model traces, we infer
Normalized Speedup Normalized Energy Savings
that performance gain mainly comes from the initial sorting phase
where small sorted lists are created. In our design, a sorted list
Figure 6: Speedup in execution time and energy savings of of size 64 can be quickly obtained within one invocation of the
IMC-Sort architecture normalized to GNU implementation intra-vault sorting unit, while Bonsai merge tree requires multiple
on CPU runs to obtain a sorted list of size 16 and then progressively merge
into a sorted list of size 64. The energy gains can be attributed to
the custom optimized implementation of Bitonic sorting network-
IMC-Sort architecture with custom intra-vault sorting unit with re- based intra-vault sorting unit with lesser hardware requirements
spect to IMC-Sort architecture which uses native HMC instruction in comparison to Bonsai architecture.
set (atomic compare-and-swap (CAS) instruction) to perform intra-
vault sorting operation for various input sequence lengths. The Performance/Energy Comparison
atomic CAS instruction only supports a single 8 or 16 byte swap of 1.11 14.2
a memory operand with an immediate operand. We generated the 1.108 14
sequences of atomic CAS requests and interfaced with HMC_sim 1.106 13.8
Energy Savings
2.0 [18] to analyze the performance. Resultant execution cycles are 1.104 13.6
Speedup
normalized and compared with the IMC-Sort custom intra-vault 1.102 13.4
sorting unit implementation. As shown in Figure 7, custom intra- 1.1 13.2
vault sorting unit design achieves an average speedup of 17.1× and 1.098 13
average energy savings of 14.5×. The performance and energy gains 1.096 12.8
can be attributed to sufficiently utilizing the available intra-vault 1.094 12.6
32 64 128 256 512 1024 2048 4096
bandwidth with parallel CAS units. Instead of executing sequence
Dataset Size (MB)
of CAS instructions ( 8 or 16 byte granularity) per vault, custom
sorting unit accesses data at a 256 byte granularity and performs Normalized Speedup Normalized Energy Savings
several CAS operations in parallel resulting in the performance and
energy gains. Figure 8: Speedup in execution time and energy savings
of IMC-Sort architecture normalized to Bonsai architec-
Performance/Energy Comparison
ture [10]
25 18
16
20 14
4 CONCLUSION
Energy Savings
12
Speedup
15
10 In this paper, we present a processing-in-memory architecture
8
10 called IMC-Sort which incorporates custom intra-vault sorting unit
6
5 4 at HMC vault’s logic layer to perform parallel sort operation. The
2 intra-vault sorting unit along with the vault controller is further
0 0 used to merge the sorted results from the individual sorts by moving
4 8 16 32 64 the data elements across the vaults using HMC’s crossbar network.
Number of Data Elements (x1024)
IMC-Sort also uses custom address mapping for distributing input
Nornalized Speedup Normalized Energy Savings data elements across vaults and custom Bitonic sorting network
input permutation units and folded compare and swap units for
Figure 7: Speedup in execution time and energy savings of reducing the hardware requirements. IMC-Sort achieved 16.8×,
IMC-Sort architecture with custom intra-vault logic normal- 1.1× speedup and 375.5×, 13.6× savings in energy consumption
ized to IMC-Sort architecture using native HMC instruction compared to the widely used CPU implementations and state-of-
set the-art near memory custom sort accelerator. The performance and
energy gains of IMC-Sort architecture are due to efficient utilization
Figure 8 shows the speedup in execution time and energy savings of high intra-vault bandwidth at each of the HMC vaults with
of the IMC-Sort architecture normalized to the recently proposed custom efficient parallel sorting hardware.
49
ACKNOWLEDGMENTS [8] A. Farmahini-Farahani, H. J. Duwe III, M. J. Schulte, and K. Compton, “Modu-

lar design of high-throughput, low-latency sorting units,” IEEE Trans. Comput.,
This work was supported in part by Semiconductor Research Cor- vol. 62, p. 1389–1402, July 2013.
poration (SRC) Center for Research in Intelligent Storage and Pro- [9] S. H. Pugsley, A. Deb, R. Balasubramonian, and F. Li, “Fixed-function hardware
sorting accelerators for near data mapreduce execution,” in 2015 33rd IEEE Inter-
cessing in Memory (CRISP). national Conference on Computer Design (ICCD), pp. 439–442, 2015.
[10] N. Samardzic, W. Qiao, V. Aggarwal, M. F. Chang, and J. Cong, “Bonsai: High-
REFERENCES Performance Adaptive Merge Tree Sorting,” in 2020 ACM/IEEE 47th Annual Inter-
national Symposium on Computer Architecture (ISCA), 2020.
[1] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu,
[11] S. Zhou, C. Chelmis, and V. K. Prasanna, “High-throughput and energy-efficient
R. S. Williams, and V. Srikumar, “ISAAC: A Convolutional Neural Network
graph processing on fpga,” in 2016 IEEE 24th Annual International Symposium on
Accelerator with In-Situ Analog Arithmetic in Crossbars,” in 2016 ACM/IEEE
Field-Programmable Custom Computing Machines (FCCM), pp. 103–110, 2016.
43rd Annual International Symposium on Computer Architecture (ISCA), pp. 14–26,
[12] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A hybrid design for
2016.
high performance large-scale sorting on fpga,” in 2015 International Conference
[2] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “PRIME: A
on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6, 2015.
Novel Processing-in-Memory Architecture for Neural Network Computation
[13] K. E. Batcher, “Sorting networks and their applications,” in Proceedings of the
in ReRAM-Based Main Memory,” in 2016 ACM/IEEE 43rd Annual International
April 30–May 2, 1968, Spring Joint Computer Conference, p. 307–314, 1968.
Symposium on Computer Architecture (ISCA), pp. 27–39, 2016.
[14] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and
[3] S. Gudaparthi, S. Narayanan, R. Balasubramonian, E. Giacomin, H. Kambala-
efficient neural network acceleration with 3d memory,” in Proceedings of the
subramanyam, and P.-E. Gaillardon, “Wire-aware architecture and dataflow for
Twenty-Second International Conference on Architectural Support for Programming
cnn accelerators,” in Proceedings of the 52nd Annual IEEE/ACM International
Languages and Operating Systems, ASPLOS ’17, (New York, NY, USA), p. 751–764,
Symposium on Microarchitecture, p. 1–13, 2019.
Association for Computing Machinery, 2017.
[4] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory
[15] B. Akin, F. Franchetti, and J. C. Hoe, “Data reorganization in memory using
accelerator for parallel graph processing,” in 2015 ACM/IEEE 42nd Annual Inter-
3d-stacked dram,” in Proceedings of the 42nd Annual International Symposium on
national Symposium on Computer Architecture (ISCA), pp. 105–117, June 2015.
Computer Architecture, ISCA ’15, (New York, NY, USA), p. 131–143, Association
[5] G. Li, G. Dai, S. Li, Y. Wang, and Y. Xie, “GraphIA: An In-situ Aelerator for
for Computing Machinery, 2015.
Large-scale Graph Processing,” in Proceedings of the International Symposium on
[16] S. Jiang, P. Pan, Y. Ou, and C. Batten, “Pymtl3: A python framework for open-
Memory Systems, pp. 79–84, 2018.
source hardware modeling, generation, simulation, and verification,” IEEE Micro,
[6] H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker, “Map-reduce-merge: Simpli-
vol. 40, no. 4, pp. 58–66, 2020.
fied relational data processing on large clusters,” in Proceedings of the 2007 ACM
[17] H. Chen, S. Madaminov, M. Ferdman, and P. Milder, “Fpga-accelerated samplesort
SIGMOD International Conference on Management of Data, p. 1029–1040, 2007.
for large data sets,” in The 2020 ACM/SIGDA International Symposium on Field-
[7] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu,
Programmable Gate Arrays, p. 222–232, 2020.
and R. Murthy, “Hive - a petabyte scale data warehouse using hadoop,” in 2010
[18] J. D. Leidel and Y. Chen, “Hmc-sim-2.0: A simulation platform for exploring cus-
IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 996–1005,
tom memory cube operations,” in 2016 IEEE International Parallel and Distributed
2010.
Processing Symposium Workshops (IPDPSW), pp. 621–630, 2016.
50

Li 2020

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Li 2020

Uploaded by

Copyright:

Available Formats

Session 1B: Emerging Memory-Enabled Computing for Future Electronics GLSVLSI ’20, September 7–9, 2020, Virtual Event,

’20, September 7–9, 2020, Virtual Event, China

IMC-Sort: In-Memory Parallel Sorting Architecture using Hybrid

Akshay Krishna Ramanathan Vijaykrishnan Narayanan

efficiency. ...... ...... ......

the high bandwidth at each HMC vault. Figure 2 illustrates the

BH2 BH4 BH2 BH8 BH4 BH2 2.2 Intra-vault merging

3.1 Area and Power Model

10 250 ory system-HBM. We apply this same mechanism for performance

ACKNOWLEDGMENTS [8] A. Farmahini-Farahani, H. J. Duwe III, M. J. Schulte, and K. Compton, “Modu-

You might also like