A ReRAM-Based Processing-In-Memory Architecture For Hyperdimensional Computing
A ReRAM-Based Processing-In-Memory Architecture For Hyperdimensional Computing
2, FEBRUARY 2025
A ReRAM-Based Processing-In-Memory
Architecture for Hyperdimensional Computing
Cong Liu , Kaibo Wu, Haikun Liu , Member, IEEE, Hai Jin , Fellow, IEEE, Xiaofei Liao , Member, IEEE,
Zhuohui Duan , Jiahong Xu , Huize Li , Yu Zhang , and Jing Yang
Abstract—Hyperdimensional computing (HDC) is a human Hyperdimensional computing (HDC) has emerged as a new
brain-inspired computing paradigm that processes neural activity learning paradigm inspired by the neural activities of human
patterns with high dimensional vectors. Existing HDC acceler- brains [3]. It offers several advantages over DNNs [4] because
ators usually utilize different hardware architectures to process
encoding phases and comparison phases of HDC applications of its memory-centric and one-trial learning capability. For
separately. They are unable to adapt to dynamic workloads example, HDC delivers low latency and high energy efficiency
for various datasets, resulting in resource underutilization. In because backpropagation and sparse coding are unnecessary in
this article, we propose a resistive random access memory HDC processing. Furthermore, HDC achieves higher robustness
(ReRAM)-based HDC accelerator called ReHDC for general than traditional DNNs [5]. These advantages make HDC a
HDC. We abstract the computing paradigms in encoding and
comparison phases, and provide uniform primitive operators to promising solution for various recognition tasks, including
efficiently process these two phases with the same hardware text classification [6], behavior signals classification [7],
architecture. In the unified processing engine, ReHDC utilizes image classification [8], DNA sequencing [9], and emotion
analog crossbar arrays to accelerate accumulation operations, recognition [10].
and digital crossbar arrays to speed up high-dimensional element- The computing paradigm of HDC is mainly dominated by
wise operations (XOR). Experimental results show that ReHDC
can accelerate the HDC training by 69.4× and 1.93×, and bitwise operations. Similar to DNNs, HDC also requires a
can also improve the energy efficiency by 51.5× and 2.2×, large amount of data to train an HDC model before inference.
compared with NVIDIA Tesla P100 GPU and the ReRAM-based All input data, such as images and texts, should be converted
HDC accelerator DUAL, respectively. Moreover, the performance into binary sample hypervectors (HVs) through element-wise
speedup and energy efficiency for HDC inference are similar to binding and bundling operations [8]. This phase is also called
that of HDC training.
data encoding. Besides the encoding phase, the HDC training
Index Terms—Element-wise operations, hyperdimensional also contains an aggregation phase in which different features
computing (HDC), processing-in-memory (PIM) architecture, of all encoded sample HVs are accumulated into a single class
resistive random access memory (ReRAM).
HV for each class of input data, while the HDC inference
contains a comparison phase in which the similarity between
I. I NTRODUCTION the query HV and each class HV are measured by the distance
N RECENT years, deep neural networks (DNNs) have of two HVs. Thus, both HDC training and inference involve
I demonstrated high accuracy in complex classification
tasks [1], [2]. However, conventional machine learning mod-
a large amount of operations with HVs.
Traditional von-Neumann architectures usually suffer from
els running on traditional computing architectures usually low computing efficiency and high energy consumption for
incur significant computing and energy resource overheads. HDC applications due to massive data movement between
main memory and processing units. Existing ASIC proces-
Manuscript received 27 August 2023; revised 29 December 2023 and sors [3], [5] often cannot satisfy the large requirement of
16 February 2024; accepted 27 February 2024. Date of publication memory and computing resources for large-scale HDC appli-
19 August 2024; date of current version 22 January 2025. This work was
supported in part by the National Key Research and Development Program
cations. Resistive random access memory (ReRAM)-based
of China under Grant 2022YFB4500303; in part by the National Natural processing-in-memory (PIM) accelerators achieve in-situ com-
Science Foundation of China (NSFC) under Grant 62332011; and in part puting with high data parallelism [11], [12]. They have
by the Huawei Technologies Co., Ltd under Grant YBN2021035018A7. This
article was recommended by Associate Editor A. Gamatie. (Corresponding
demonstrated significant performance and energy efficiency
author: Haikun Liu.) in matrix-vector multiplication (MVM) operations [13] and
Cong Liu, Kaibo Wu, Haikun Liu, Hai Jin, Xiaofei Liao, Zhuohui Duan, bitwise operations of binary vectors [14].
Jiahong Xu, Huize Li, and Yu Zhang are with the National Engineering
Research Center for Big Data Technology and System, Services Computing
Generally, there are two types of ReRAM-based PIM archi-
Technology and System Lab, Cluster and Grid Computing Lab, School tectures: 1) analog PIM (APIM) and 2) Digital PIM (DPIM).
of Computer Science and Technology, Huazhong University of Science APIM utilizes ReRAM crossbar arrays to perform MVM
and Technology, Wuhan 430074, China (e-mail: congliu@[Link];
canonwu@[Link]; hkliu@[Link]; hjin@[Link]; xfliao@hust.
operations in only one step efficiently, offering a promising
[Link]; zhduan@[Link]; jhxu@[Link]; huizeli@[Link]; zhyu@ approach to DNN inference applications in which weight
[Link]). matrices of different DNN layers are fixed. For example,
Jing Yang is with the School of Computer Science and Technology, Hainan
University, Haikou 570288, China (e-mail: jingyang@[Link]).
ISAAC [13] and GraphR [15] exploit APIM to efficiently
Digital Object Identifier 10.1109/TCAD.2024.3445812 perform MVM operations in neural networks and graph
c 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see [Link]
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 513
computing applications, significantly reducing data processing in the encoding phase and the comparison phase,
latency and energy consumption. However, although APIM respectively. Moreover, to further improve the efficiency
achieves significant performance speedup in MVM operations, of ReHDC, we design a fine-grained pipeline to overlap
it is unsuitable for other mathematical operations, such as data transferring and data processing, and thus hide the
addition, multiplication, and bitwise operations. In contrast, ReRAM write latency as much as possible.
DPIM can exploit ReRAM cells to perform logical NOR 4) We evaluate the efficiency of ReHDC using four
operations [16] in the digital computing paradigm. Since all datasets. Experimental results show that ReHDC can
arithmetic operations can be converted into a series of NOR accelerate the HDC training by 69.4× and 1.93×,
operations, DPIM is able to perform a wide range of arithmetic and can also improve the energy efficiency by 51.5×
operations [11]. FloatPIM [17] leverages DPIM to enhance and 2.2×, compared with NVIDIA Tesla P100 GPU
precision and execute mathematical operations. DUAL [14] and the ReRAM-based HDC accelerator–DUAL [14],
fully exploit the high parallelism of DPIM to efficiently respectively. Furthermore, the performance speedup and
perform cluster analysis. DPIM also demonstrates high par- energy efficiency achieved in the HDC inference are
allelism for bitwise operations of HVs, delivering significant comparable to that observed in the HDC training.
performance speedup for data-intensive applications. The remaining of this article is organized as follows.
Previous ReRAM-based HDC accelerators [8], [18] usually Section II first introduces the background of HDC and two
exploit customized hardware for the encoding and comparison kinds of ReRAM-based PIM architectures, and then presents
phases separately, and often cause low utilization of limited the motivation of this article. Section III presents the details of
ReRAM resource because the customized hardware cannot the ReHDC architecture. Section IV describes the experimen-
satisfy unbalanced resource requirements of these two phases tal methodologies and results. Section V studies the related
for different datasets. Fortunately, we find that most operations work. We conclude in Section VI.
in the encoding and comparison phases of HDC can be
abstracted as element-wise XOR and accumulation operations. II. BACKGROUND AND M OTIVATION
Thus, we can design a unified ReRAM-based PIM architecture
In this section, we first present the necessary background of
to accelerate these two computation-intensive phases of HDC
HDC and ReRAM-based PIM architectures, and then describe
using the same processing engines (PEs). However, a single
the motivation of this work.
PE is comprised of multiple APIMs and DPIMs for MVM
operations and HV bitwise operations, respectively.
There are several challenges to design a unified ReRAM- A. Hyperdimensional Computing
based PIM architecture for HDC. First, most ReRAM-based HDC is a novel computing model inspired by the way
APIM accelerators [11] only support a special computing that neurons work in human brains [19]. It has been applied
paradigm, i.e., MVM operations. It is not easy to support in various areas, such as image recognition [4] and text
bitwise operations for vectors by adding some extra peripheral classification [20]. The basic data structure in HDC is the
circuits. Second, although computation-intensive operations high-dimensional vector (i.e., HV) whose dimension may
in the encoding and comparison phases can be converted exceed 10 000. HVs can represent different features of real
into element-wise XOR and accumulation operations, it is items in various real-world recognition tasks, such as language
challenging to directly process these two phases with the same characters and bio-signals. An HV is generated randomly and
crossbar arrays. For instance, the encoding phase requires holographically with independent and identically distributed
horizontal accumulation of XOR results for all rows, while (i.i.d.) elements which are binary (0/1) or bipolar (+1/−1)
the comparison phase requires vertical accumulation of XOR codes [21]. For all elements in an HV, no one stores more
results. Third, data movement of intermediate results among information than the others, thus improving the robustness
crossbar arrays causes nontrivial time overhead due to ReRAM of HDC. When the dimension of HVs is rather high (e.g.,
write operations. In this article, we design a heterogeneous D = 10 000), any two random HVs are almost orthogonal.
ReRAM-based PIM architecture to tackle these challenges. The There are several typical operations in HDC [22], such as
major contributions of this article are summarized as follows. binding operations that bind two HVs to generate a new HV,
1) We analyze the computing paradigm of the encoding and bundling operations that bundle many HVs together into
and comparison phases in HDC, and abstract differ- a new HV. The binding operation retains one half of the
ent operations with uniform primitive operators that information from input HVs, while the bundling operation
can be efficiently accelerated by ReRAM-based PIM generates new HVs that are orthogonal to the input HVs [23].
architectures. These operations sustain the dimension of HVs, and thus can
2) We design a heterogeneous ReRAM-based PIM archi- keep the computing paradigm of HDC stable. In the following,
tecture called ReHDC to process encoding and we describe the two key phases of HDC inference.
comparison phases of HDC uniformly. In the unified 1) Encoding Phase: As shown in Fig. 1, the first step of
PE of ReHDC, we employ APIM modules to process HDC is to map the real-world data into a hyperdimensional
accumulation operations with array-level parallelism, space to generate HVs. This operation is called the encoding
and exploit digital PIM (DPIM) modules to process phase. Several encoding approaches have been proposed in
element-wise XOR operations with row-level parallelism. previous studies [5], [24]. Generally, they linearly sum up all
3) We design dual access paths for APIM arrays to achieve HVs associated with each feature. A feature is composed of
both horizontal and vertical accumulation operations a position HV and a level HV. Position and level HVs are
514 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025
(a) (b)
Fig. 1. HDC process for classification tasks.
Fig. 2. Two kinds of PIM architecture. (a) ReRAM-based APIM.
(b) ReRAM-based DPIM.
between the HRS and the LRS. The initial output ReRAM cell most recent ReRAM-based HDC accelerator [4] allocates fix-
should be set to the LRS. According to Kirchhoff’s law, in a sized ReRAM resource to processing units for the encoding
parallel circuit, the overall resistance of input ReRAM cells is and comparison phases, and thus cannot fairly satisfy the
lower than that of each individual input ReRAM cell. Thus, resource requirement of these two phases for different datasets.
when an input ReRAM cell is in an LRS, the current flowing Fig. 4 shows the breakdown of operations in the encoding and
out the circuit increases, allowing the output ReRAM cell to comparison phases of HDC for four datasets. We can find
change from an LRS to an HRS. that the percentage of operations used for the encoding phase
The circuit characteristic described above can be exploited and the comparison phase varies significantly for different
to implement a logical NOR operation (i.e., A ⊕ B = (A + datasets. For instance, the comparison phase accounts for only
B) ), which can be represented as follows: 0 NOR 0 = 1, 1 1% total operations in the MNIST dataset, while there are 41%
NOR 1 = 0, and 0 NOR 1 = 0. Thus, the DPIM naturally comparison operations in the BEIJING dataset [29]. The pro-
supports NOR operations. Furthermore, the DPIM array can portion of operations in the encoding and comparison phases is
also support row or column-level parallelism, and thus has mainly determined by the number of features and classes in the
higher elasticity than the APIM. As most arithmetic operations dataset. A dedicated hardware architecture designed for these
can be converted into a series of NOR operations, the DPIM is two phases cannot adapt to different datasets, usually resulting
able to flexibly execute arithmetic and logical operations. For in low utilization of computing resource. This motivates us
example, the 1-bit XOR logic operation can be implemented to develop a unified ReRAM-based HDC accelerator that
with 6 NOR operations based on De Morgan’s law. Assume A can dynamically allocate PEs to the two phases based on
and B represent the input two operands, and A represents NOT their ratio of operations. In this way, our architecture can
operation on A. The XOR operation A ⊕ B can be represented adapt to various datasets more flexibly, and improves the
by (4). Fig. 3 illustrates the six steps of the XOR operation in utilization of ReRAM resource and the performance of HDC
a DPIM applications.
Fixed Access Paths of APIM Arrays: APIM plays a cru-
A ⊕ B = A · B + A · B cial role in ReRAM-based HDC accelerators. During the
= A + B + A + B encoding phase, APIM accumulates feature HVs to obtain
the sample HV. During the comparison phase, APIM cal-
= A + B + A + B . (4) culates the Hamming distance between two HVs. However,
conventional APIM arrays only support bitline accumulation
in one direction, and thus limits the utilization of APIM in an
C. Motivation ReRAM-based HDC accelerator. To enable these two types of
In this section, we discuss the design challenges of an computations for HDC applications, it is essential to redesign
ReRAM-based PIM architecture for HDC applications. the APIM architecture so that it can perform MVM operations
The Limitation of Computing Paradigms: Traditional APIM in two directions.
accelerators are designed for accelerating MVM operations. Write Overhead Caused by Data Dependencies: During the
However, in HDC applications, other arithmetic and logical HDC processing, data dependencies exist in both encoding and
operations like bitwise XOR cannot be efficiently executed comparison phases. In addition, while multiple ReRAM arrays
by APIM, posing a significant limitation for fully utilizing offer the potential for parallel computing, the large amount
ReRAM-based HDC accelerators. In contrast, DPIM accelera- of data involved in HDC processing results in substantial
tors can perform NOR operations in parallel. Nevertheless, the write overhead during intermediate data transfers, leading to
DPIM is much slower than APIM when it performs MVM performance degradation. Therefore, it is essential to hide the
operations in HDC. Hence, the integration of APIM and DPIM write latency of ReRAM.
is crucial for effectively accelerating HDC applications which These challenges motivate us to design an ReRAM-based
are comprised of both MVM and bitwise logical operations. accelerator for general HDC processing. Particularly, it can
Disparities of ReRAM Resource Requirements Between the be dynamically configured according to the characteristics of
Encoding and Comparison Phases for Various Datasets: The different datasets.
516 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025
A. Overview of ReHDC
Fig. 5 demonstrates the architecture of ReHDC. As shown
in Fig. 5(a), ReHDC is comprised of multiple tiles that are Fig. 6. Data mapping in the encoding phase.
interconnected through a global bus. Each tile is composed of
a controller (CTRL), several PEs, and two buffers. The input
buffer is responsible for caching data that is written to the
DPIM, while the output buffer stores the results of the PE. The and position HV of each feature. The bundling operation is
controller generates the control signals for each component. used to accumulate multiple feature HVs of a sample.
sALU is responsible for computing the sum of the bitlines in We choose DPIM arrays to perform element-wise XOR
multiple arrays and performing other simple operations. operations with high parallelism. We employ a judicious
In each PE, the DPIM crossbar arrays are utilized to HV-to-DPIM mapping scheme to leverage row parallelism and
perform element-wise XOR logic operations, while the APIM perform element-wise XOR operations. Given that an HV in
crossbar arrays are employed for array-level accumulation HDC usually contains thousands of dimensions, element-wise
operations. As shown in Fig. 5(b), each wordline in the APIM XOR using DPIM can significantly exploit data parallelism
is connected to a DAC, which converts the input data into a and enhance the efficiency of the encoding phase in HDC. To
vector of voltages and applies it to each bitline of the crossbar perform multiple XOR operations in parallel, both two of their
array. In contrast, the ADC converts the output current into operands should be placed in the same row of a crossbar, and
digital signals for further processing, and is shared by multiple the output results are stored in the same row of the crossbar
bitlines of the crossbar array. To capture the intermediate data as well.
of PEs, each PE contains both an input register (IR) and an Fig. 6 shows the data mapping of the encoding phase.
output register (OR) to latch the data. Each PE also contains We partition the DPIM array into three regions which are
multiple Comparators (CMPs). As shown in Fig. 5(c), some used for storing preprocessed positions HVs, level HVs, and
CMPs are arranged as a CMP tree to determine the minimum output HVs. The size of crossbar arrays assigned for storing
Hamming distance between the sample HV and each class HV, input position HVs and level HVs is identical. We map
and the others are employed to binarize the output results of the preprocessed input HVs in columns of DPIMs and then
the APIM in the encoding phase based on a predetermined perform element-wise XOR operations. The generated feature
threshold. To improve data parallelism for DPIMs, multiple HVs are then stored in the output region. Multiple DPIMs
cascaded DPIM arrays are exploited to support row-level can perform the XOR operation concurrently to maximize
parallel data transfer and element-wise XOR operations, as data parallelism. In this way, the computation latency can be
shown in Fig. 5(d). significantly reduced.
When the XOR operation is completed in DPIMs, we
use APIMs to perform accumulation operations since APIMs
B. Data Processing in the Encoding Phase exhibits higher performance than DPIMs for MVM operations.
The encoding phase of HDC mainly consists of two oper- At first, the feature HVs are written into the APIM array
ations: 1) binding and 2) bundling. Binding is employed to row by row using peripheral circuits (DACs). Then, we apply
handle element-wise XOR operations between the level HV an all-one vector (111· · · 111) to the APIM to perform the
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 517
class with the minimum Hamming distance is deemed as the utilization of computing resource is improved, eventually
classified result. improving the performance of HDC.
During HDC training, DPIMs and APIMs in all unified
D. Bidirectional Accumulation PEs are allocated to the encoding phase and accumulation
phase. There is not any resource contention for those DPIMs
During the encoding phase, the APIM should accumulate
and APIMs. In contrast, during HDC inference, since there
elements of all feature HVs in the vertical direction, i.e., all
are DPIM resource contention between the encoding and
elements of a column in the matrix are accumulated. During
comparison phases, we should carefully allocate these arrays
the comparison phase, the APIM should accumulate values
to these two phases so that they can be executed in a pipeline.
in the horizontal direction, i.e., all elements of a row in the
Thus, for a given dataset, the resource allocation should be
matrix are accumulated. However, traditional APIM arrays
done before the HDC inference. The term dynamic allocation
only support bitline accumulation in one direction, and thus
in our paper denotes that the arrays resource allocated to the
limit their usage in a unified ReRAM-based HDC accelerator,
encoding and comparison phases are dynamically changed for
resulting in low utilization of APIM arrays.
different datasets.
To address this problem, we reorganize ReRAM arrays
We propose a resource allocation model to determine the
to support bidirectional accumulation, as shown in Fig. 5(b).
number of arrays required by these two phases. In the encoding
In this architecture, some peripheral circuits are added to
phase, we assume all input samples are encoded into Q sample
enable bidirectional accumulation, without any modification
HVs, and each input sample has n features. As shown in Fig. 6,
to the ReRAM array [30]. Each wordline or bitline in the
n position HVs and n level HVs are bound to generate n feature
ReRAM array is connected to a sense amplifier, a write driver,
HVs, which are then accumulated to generate one sample HV.
an MUX, and an address decoder. Thus, the input vector
Thus, a single sample (or query) HV requires 3 × n columns
(voltage) can be applied to the APIM array in vertical or
in DPIM arrays, and all sample HVs require total 3 × n × Q
horizontal directions. As each APIM array only performs
columns. In the comparison phase, we assume m class HVs in
multiply-accumulate operations in a single direction at any
the dataset should be compared with each query HV. As shown
time, two neighboring APIM arrays can share sense amplifiers
in Fig. 8, about q+m+qm columns are required in each DPIM
and write drivers in both vertical and horizontal directions,
array, where the q denotes the number of samples that can be
as shown in Fig. 5(b). To this end, the bidirectional design
processed in each DPIM array. Thus, Q samples require a total
introduces acceptable area overhead.
of Q(q + m + qm)/q columns. At last, the number of PEs is
Since the bidirectional APIM array is able to perform accu-
allocated to the encoding phase (PEenc ) and the comparison
mulation operations in both vertical and horizontal directions,
phase (PEcom ) according to the following model:
it can be used by both encoding and comparison phases
of HDC. Thus, the proposed bidirectional APIM empowers PEenc 3×n×q
= . (5)
the design of unified PEs and the adaptive configuration of PEcom q+m+q×m
PEs (Section III-E). As a result, the resource utilization can
be significantly improved by dynamically allocating PEs to F. Scalability of DPIM Crossbar Arrays
encoding and comparison phases according to the requirement Due to limitations of current technology, the array size
of different datasets. of a DPIM is about 1000 × 1000. Thus, it is impossible to
process an HDC classification task with only a single DPIM
E. Adaptive Configuration array. To reduce the total write latency of DPIM rows, we
employ multiple cascaded DPIM arrays to support row-level
The HDC model consists of two phases: 1) encoding
parallel data transfer, as shown in Fig. 5(d). This significantly
and 2) comparison. The HDC training is dominated by the
enhances data parallelism of the ReHDC architecture and
encoding phase, while HDC inference involves both the
improves the performance of HDC processing. Assuming the
encoding and comparison phases. Thus, it is essential to
HV dimension is D, D/1024 crossbar arrays are required to
improve the performance of both the encoding and comparison
process all HV dimensions due to the limited size of a single
phases in HDC. However, the conventional ReRAM-based
crossbar array. However, since all DPIM arrays can process
HDC accelerator usually allocates a fixed proportion of com-
HVs in parallel, ReHDC still achieves rather high parallelism
puting resource to encoding and comparison engines. Yet,
for HDC.
the demand for computing resource between the encoding
and comparison phases in HDC may vary significantly for
IV. E VALUATION
different datasets. Thus, static computing resource allocation
for different datasets usually results in a waste of computing In this section, we first present the experimental setup,
resource. and then evaluate the performance and energy efficiency of
To address this issue, we introduce an adaptive configuration ReHDC using four datasets. Finally, we evaluate the area,
scheme in our ReHDC architecture. Since the ReRAM-based dimension, and endurance of the ReHDC architecture.
HDC accelerator performs XOR and accumulation operations
in both encoding and comparison phases, ReHDC dynamically A. Experimental Setup
allocates unified PEs to different datasets according to the We simulate the functionality of ReHDC in a simulation
data volume generated in different phases. In this way, the framework for ReRAM-based PIM architectures–MHSim [31].
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 519
TABLE I
R EHDC C ONFIGURATIONS
Fig. 11. Breakdown of different phases for energy consumption and latency.
Fig. 10. Normalized performance speedup and energy saving of ReHDC (a) Breakdown of energy. (b) Breakdown of latency.
inference. (a) Performance speedup. (b) Energy saving.
C. Inference Efficiency
ReHDC, DUAL, and the GPU platform, respectively. As
As shown in Fig. 10, ReHDC significantly speeds up the
shown in Fig. 11, all three accelerators exhibit higher energy
HDC inference by 1.77× and 63.35× on average, compared
consumption in the encoding phase than in the comparison
with DUAL and GPU platforms, respectively. The processing
phase. This is attributed to more data processed in the encoding
latency in the inference stage mainly stems from the encoding
phase. On the other hand, the proportion of latency required
of query data and the calculation of the Hamming distance
in the encoding phase of ReHDC is much less than that of
between the query HV and all class HVs. Compared with
other accelerators. This implies that ReHDC can efficiently
the GPU platform, ReHDC significantly reduces the latency
process a large volume of input data in a very short time
of query data encoding by leveraging the high parallelism
due to the high parallelism of DPIMs and APIMs, and thus
of APIM. ReHDC can efficiently perform the accumulation
significantly reduce the execution latency. In contrast, the GPU
operation using APIMs in the encoding phase, and thus
and DUAL require more time to process the input data. In
significantly reduces the processing latency. In contrast, the
summary, ReHDC demonstrates extremely high efficiency of
DUAL architecture should perform multiple addition opera-
accelerating HDC in the encoding phase, due to the high
tions in the encoding phase, resulting in higher latency. During
parallelism of its architecture.
the comparison phase, ReHDC converts the calculation of
Hamming distance into an accumulation operation, which is
also efficiently processed by APIM. APIM completes array- E. Impact of Hypervector Dimensions
level MVM operations in just one cycle, while DUAL requires To explore the impact of HV dimensions on the performance
nonlinear sampling to calculate the Hamming distance by of ReHDC, we vary the HV dimensions from 2000 to 10000,
leveraging the timing characteristic of match-lines (MLs) dis- with an increment of 2000. As shown in Figs. 12 and 13, when
charging current. Thus, our approach can significantly reduce the dimension of HVs decreases, ReHDC demonstrates more
the latency for computing the Hamming distance. performance speedup and energy savings relative to the GPU
ReHDC also reduces 2.67× and 45.65× energy consump- baseline for both HDC training and inference. This implies that
tion on average in the HDC inference, compared with DUAL GPUs have higher performance and energy efficiency for HVs
and GPU platforms, respectively. The primary advantage with larger dimensions because they can fully exploit massive
of ReHDC lies in the energy-saving of MVM operations, GPU cores for high-parallel computing with more input data.
which are more efficiently processed by APIMs than CMOS
circuits. By combining the use of APIMs and DPIMs, ReHDC F. Endurance Management
demonstrates extremely high energy efficiency in accelerating ReRAMs have been widely exploited for in-situ comput-
HDC applications. ing in PIM architectures. However, they have limited write
endurance, typically up to 1012 times. Without wear leveling,
D. Latency/Energy Breakdown an ReRAM cell may be worn out quickly. Thus, it is crucial
Here, we analyze the breakdown of latency and energy to prolong the lifetime of ReRAMs by optimizing the wear
consumption for the encoding and comparison phases in leveling in ReHDC.
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 521
H. Inference Accuracy
We evaluate the end-to-end inference accuracy of our PIM
architecture for different datasets, and find that the accuracy
is comparable to that of a full-precision GPU platform with
less than 1% deviation. The root causes are three-folds. First,
our ReRAM-based PIM accelerator uses single-level cells
(SLCs) to store only one bit per cell, and thus can provide
relatively high precision compared with multilevel ReRAM
cells (MLCs). Second, DPIM arrays perform calculations in
Fig. 13. Performance/energy efficiency of ReHDC inference on different
dimensions. the digital domain, and thus can guarantee high precision.
Although APIM arrays perform calculations in the analog
domain, their output results are compared with a given
threshold for classification, and thus can tolerant a reasonable
accuracy loss of calculations. Third, the HDC paradigm and its
applications generally show higher robustness than traditional
DNNs due to high-dimensional features. For datasets with
10K dimensions, HDC can even tolerant about 20% of data
errors while still offering similar inference accuracy compared
nonerror cases [4], [8].
V. R ELATED W ORK
HDC: HDC is a novel computing paradigm inspired by the
Fig. 14. Effect of dynamic resource allocation in ReHDC. human brain. It has been explored for a variety of classification
applications, such as text classification [6], image classifica-
tion [8], audio recognition [38], sentiment recognition [10],
and biometric recognition [9]. HDC has demonstrated the
In APIM, an MVM operation only incurs one write for potential for performance improvement of various classifica-
each ReRAM cell uniformly, and the class HVs are fixed tion applications. Rahimi et al. [39] proposed an HDC-based
in crossbar arrays. Therefore, it is unnecessary to consider algorithm for hand gesture recognition. It achieves high accu-
the wear leveling problem in APIM. However, in DPIM, six racy with a smaller scale of training datasets. Jain et al. [40]
columns are frequently used for XOR operations, and thus proposed an efficient binarized network for text classification
may quickly wear out these columns. To solve this problem, by fully exploiting the potential of binary representations. To
we periodically and dynamically change the position of these improve the performance and accuracy of HDC, some HDC
six columns among a total of 1024 columns to achieve wear algorithms are tailored specifically for large-scale data [21],
leveling, and thus increases ReRAM’s lifetime by 170.6×. [41]. However, there is still a vast room for improvement and
This simple wear leveling scheme can effectively prolong the optimization of HDC computing models and algorithms.
ReRAM’s lifetime in all kinds of DPIMs. PIM: PIM has emerged as a promising computing paradigm
to solve the memory-wall problem in traditional Von-Neumann
architectures [42]. Many studies have demonstrated that
NVM-based PIM architectures can significantly improve
G. Effectiveness of Adaptive Configuration the performance of various applications, including neural
To evaluate the effectiveness of our adaptive resource networks [43], graph computing [44], and clustering [14].
configuration scheme in ReHDC for various datasets, we These studies leverage the unique physical feature of NVMs
implement a fixed HDC architecture in which the number of to achieve high-performance in-situ computing. ISAAC [13]
PEs are fixed in the encoding phase and the comparison phase is a typical DNNs accelerator based on ReRAM. ISAAC
(10:1). Using this fixed architecture as a baseline, we can accelerates MVM operations in DNNs via ReRAM-based PIM
evaluate the efficiency and effectiveness of ReHDC when it is accelerators. GraphR [15] is a graph computing accelerator
adapted to different datasets. As shown in Fig. 14, ReHDC’s that employs ReRAM-based PIM architecture to accelerate
adaptive architecture achieves 2.34× to 7.56× energy-delay MVM operations involving sparse matrices. LerGAN [45]
product (EDP) reduction compared with the fixed architecture exploits a 3D-stacked ReRAM-based PIM architecture to
for the four datasets. The fixed architecture cannot fully utilize accelerate generative adversarial networks (GANs). It can
ReRAM arrays for different datasets, resulting in a large efficiently process complex data streams and significantly
amount of resource waste. In contrast, empowered by the improve the performance of GAN. These studies all explore a
design of bidirectional APIM arrays, our adaptive architecture traditional APIM architecture to accelerator MVM operations.
can dynamically allocate unified PE resource to the encoding Despite its high performance for MVM operations, it cannot
and comparison phases, and thus significantly improves the efficiently perform other operations like addition and XOR
resource utilization of PEs. due to its limited computing paradigm. ReHDC addresses this
522 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025
[24] A. Mitrokhin, P. Sutor, C. Fermüller, and Y. Aloimonos, “Learning sen- Cong Liu received the bachelor’s degree from
sorimotor control with neuromorphic sensors: Toward hyperdimensional Dalian Maritime University, Dalian, China, in 2018.
active perception,” Sci. Robot., vol. 4, no. 30, 2019, Art. no. eaaw6736. She is currently pursuing the Ph.D. degree with the
[25] H. Akinaga and H. Shima, “Resistive random access memory (ReRAM) Huazhong University of Science and Technology,
based on metal oxides,” Proc. IEEE, vol. 98, no. 12, pp. 2237–2251, Wuhan, China.
Dec. 2010. Her research interests include in-memory comput-
[26] H. Li, H. Jin, L. Zheng, Y. Huang, and X. Liao, “ReCSA: A dedicated ing and ReRAM-based accelerator.
sort accelerator using ReRAM-based content addressable memory,”
Front. Comput. Sci., vol. 17, no. 2, 2023, Art. no. 172103.
[27] H. Li, H. Jin, L. Zheng, and X. Liao, “ReSQM: Accelerating database
operations using ReRAM-based content addressable memory,” IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 11,
pp. 4030–4041, Nov. 2020.
[28] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim: A circuit-level macro model Kaibo Wu received the bachelor’s degree from
for benchmarking neuro-inspired architectures in online learning,” IEEE Nanchang University, Nanchang, China, in 2021. He
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 12, is currently pursuing the master’s degree with the
pp. 3067–3080, Dec. 2018. Huazhong University of Science and Technology,
[29] S. Zhang, B. Guo, A. Dong, J. He, Z. Xu, and S. X. Chen, “Cautionary Wuhan, China.
tales on air-quality improvement in Beijing,” Proc. R. Soc. A, Math., His research interests include non-volatile
Phys. Eng. Sci., vol. 473, no. 2205, 2017, Art. no. 20170457. memory and in-memory computing.
[30] S. Li et al., “RC-NVM: Dual-addressing non-volatile memory architec-
ture supporting both row and column memory accesses,” IEEE Trans.
Comput., vol. 68, no. 2, pp. 239–254, Feb. 2019.
[31] H. Liu, J. Xu, X. Liao, H. Jin, Y. Zhang, and F. Mao, “A simulation
framework for memristor-based heterogeneous computing architectures,”
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41, no. 12,
pp. 5476–5488, Dec. 2022.
[32] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, Haikun Liu (Member, IEEE) received the Ph.D.
“VTEAM: A general model for voltage-controlled memristors,” IEEE degree in computer science and technology from the
Trans. Circuits Syst. II, Exp. Briefs, vol. 62, no. 8, pp. 786–790, Huazhong University of Science and Technology,
Aug. 2015. Wuhan, China, in 2012.
[33] Y. Huang et al., “Accelerating graph convolutional networks using He is currently a Professor with the School
crossbar-based processing-in-memory architectures,” in Proc. IEEE Int. of Computer Science and Technology, Huazhong
Symp. High-Perform. Comput. Archit., 2022, pp. 1029–1042. University of Science and Technology. He has
[34] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: co-authored more than 80 papers in most presti-
A tool to model large caches,” in Proc. Int. Symp. Microarchit., 2009, gious conferences and journals. His current research
pp. 1–25. interests include in-memory computing, virtualiza-
tion technologies, cloud computing, and distributed
[35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
systems.
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
Prof. Liu is a Senior Member of CCF.
pp. 2278–2324, Nov. 1998.
[36] R. Cole and M. Fanty, ISOLET (Isolated Letter Speech
Recognition), UCI Mach. Learn. Repository, Irvine, CA, USA, 1994,
doi: 10.24432/C51G69.
[37] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, Hai Jin (Fellow, IEEE) received the Ph.D. degree in
and C. Hansch, “Structure-activity relationship of mutagenic aro- computer engineering from the Huazhong University
matic and heteroaromatic nitro compounds. Correlation with molecular of Science and Technology (HUST), Wuhan, China,
orbital energies and hydrophobicity,” J. Med. Chem., vol. 34, no. 2, in 1994.
pp. 786–797, 1991. He is a Chair Professor of Computer Science
[38] M. Hersche, S. Sangalli, L. Benini, and A. Rahimi, “Evolvable hyperdi- and Engineering with HUST. He was with The
mensional computing: Unsupervised regeneration of associative memory University of Hong Kong, Hong Kong, from 1998 to
to recover faulty components,” in Proc. 2nd IEEE Int. Conf. Artif. Intell. 2000, and as a Visiting Scholar with the University
Circuits Syst., 2020, pp. 281–285. of Southern California, Los Angeles, CA, USA,
[39] A. Rahimi, S. Benatti, P. Kanerva, L. Benini, and J. M. Rabaey, from 1999 to 2000. He has co-authored more than
“Hyperdimensional biosignal processing: A case study for EMG-based 20 books and published over 900 research papers.
hand gesture recognition,” in Proc. IEEE Int. Conf. Reboot. Comput., His research interests include computer architecture, parallel and distributed
2016, pp. 1–8. computing, big data processing, data storage, and system security.
[40] H. Jain, A. Agarwal, K. Shridhar, and D. Kleyko, “End to end binarized Dr. Jin was awarded the Excellent Youth Award from the National Science
neural networks for text classification,” 2020, arXiv:2010.05223. Foundation of China in 2001. In 1996, he was awarded the German Academic
[41] Z. Zou, H. Alimohamadi, Y. Kim, M. H. Najafi, N. Srinivasa, and Exchange Service Fellowship to visit the Technical University of Chemnitz
M. Imani, “EventHD: Robust and efficient hyperdimensional learn- in Germany. He is a Fellow of CCF and a Life Member of the ACM.
ing with neuromorphic sensor,” Front. Neurosci., vol. 16, p. 1147,
Jul. 2022.
[42] C. Liu et al., “ReGNN: A ReRAM-based heterogeneous architecture for
general graph neural networks,” in Proc. 59th ACM/IEEE Design Autom. Xiaofei Liao (Member, IEEE) received the Ph.D.
Conf., 2022, pp. 469–474. degree in computer science and engineering from
[43] P. Chi et al., “PRIME: A novel processing-in-memory architecture the Huazhong University of Science and Technology
for neural network computation in ReRAM-based main memory,” (HUST), Wuhan, China, in 2005.
ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 27–39, He is currently a Professor with the School
2016. of Computer Science and Technology, HUST. His
[44] Y. Huang, L. Zheng, X. Liao, H. Jin, P. Yao, and C. Gui, “RAGra: research interests include computer architecture,
Leveraging monolithic 3D ReRam for massively-parallel graph pro- system software, and big data processing.
cessing,” in Proc. Design, Autom. Test Europe Conf. Exhibit., 2019, Prof. Liao was the recipient of the Excellent
pp. 1273–1276. Youth Award from the National Science Foundation
[45] H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, “LerGAN: A zero-free, low of China in 2018 and the CCF-IEEE CS Young
data movement and PIM-based GAN architecture,” in Proc. 51st Annu. Computer Scientist Award in 2017. He is a member of the IEEE Computer
IEEE/ACM Int. Symp. Microarchit., 2018, pp. 669–681. Society.
524 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025
Zhuohui Duan received the Ph.D. degree in Yu Zhang received the Ph.D. degree in computer
computer science and technology from the science from the Huazhong University of Science
Huazhong University of Science and Technology and Technology (HUST), Wuhan, China, in 2016.
(HUST), Wuhan, China, in 2022. He is currently a Professor with the School
He is currently a Postdoctoral Research Fellow of Computer Science and Technology, HUST. His
with the School of Computer Science and research interests include big data processing, graph
Technology, HUST. His research interests mainly computing, and distributed systems. His current
include hybrid memory, distribute memory pool, topic mainly focuses on application-driven big data
and disaggregated memory. processing and optimizations.
Huize Li received the Ph.D. degree from the School Jing Yang received the Ph.D. degree in com-
of Computer Science and Technology, Huazhong puter science and technology from the Huazhong
University of Science and Technology, Wuhan, University of Science and Technology, Wuhan,
China, in 2022. China, in 2022.
His current research interests include computer She is currently an Associate Researcher with
architecture, emerging non-volatile memory, and the School of Computer Science and Technology,
processing in memory. Hainan University, Haikou, China. Her research
interests include computer architecture, edge intelli-
gence, and hyperdimensional computing.