0% found this document useful (0 votes)
49 views13 pages

A ReRAM-Based Processing-In-Memory Architecture For Hyperdimensional Computing

The document presents a ReRAM-based processing-in-memory architecture called ReHDC for hyperdimensional computing (HDC), which aims to improve efficiency by unifying the encoding and comparison phases of HDC applications. ReHDC utilizes both analog and digital crossbar arrays to accelerate operations, achieving significant performance enhancements, including a 69.4× speedup in training and a 51.5× increase in energy efficiency compared to existing solutions. The architecture addresses challenges related to resource utilization and data movement, making it suitable for various recognition tasks.

Uploaded by

sandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views13 pages

A ReRAM-Based Processing-In-Memory Architecture For Hyperdimensional Computing

The document presents a ReRAM-based processing-in-memory architecture called ReHDC for hyperdimensional computing (HDC), which aims to improve efficiency by unifying the encoding and comparison phases of HDC applications. ReHDC utilizes both analog and digital crossbar arrays to accelerate operations, achieving significant performance enhancements, including a 69.4× speedup in training and a 51.5× increase in energy efficiency compared to existing solutions. The architecture addresses challenges related to resource utilization and data movement, making it suitable for various recognition tasks.

Uploaded by

sandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

512 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO.

2, FEBRUARY 2025

A ReRAM-Based Processing-In-Memory
Architecture for Hyperdimensional Computing
Cong Liu , Kaibo Wu, Haikun Liu , Member, IEEE, Hai Jin , Fellow, IEEE, Xiaofei Liao , Member, IEEE,
Zhuohui Duan , Jiahong Xu , Huize Li , Yu Zhang , and Jing Yang

Abstract—Hyperdimensional computing (HDC) is a human Hyperdimensional computing (HDC) has emerged as a new
brain-inspired computing paradigm that processes neural activity learning paradigm inspired by the neural activities of human
patterns with high dimensional vectors. Existing HDC acceler- brains [3]. It offers several advantages over DNNs [4] because
ators usually utilize different hardware architectures to process
encoding phases and comparison phases of HDC applications of its memory-centric and one-trial learning capability. For
separately. They are unable to adapt to dynamic workloads example, HDC delivers low latency and high energy efficiency
for various datasets, resulting in resource underutilization. In because backpropagation and sparse coding are unnecessary in
this article, we propose a resistive random access memory HDC processing. Furthermore, HDC achieves higher robustness
(ReRAM)-based HDC accelerator called ReHDC for general than traditional DNNs [5]. These advantages make HDC a
HDC. We abstract the computing paradigms in encoding and
comparison phases, and provide uniform primitive operators to promising solution for various recognition tasks, including
efficiently process these two phases with the same hardware text classification [6], behavior signals classification [7],
architecture. In the unified processing engine, ReHDC utilizes image classification [8], DNA sequencing [9], and emotion
analog crossbar arrays to accelerate accumulation operations, recognition [10].
and digital crossbar arrays to speed up high-dimensional element- The computing paradigm of HDC is mainly dominated by
wise operations (XOR). Experimental results show that ReHDC
can accelerate the HDC training by 69.4× and 1.93×, and bitwise operations. Similar to DNNs, HDC also requires a
can also improve the energy efficiency by 51.5× and 2.2×, large amount of data to train an HDC model before inference.
compared with NVIDIA Tesla P100 GPU and the ReRAM-based All input data, such as images and texts, should be converted
HDC accelerator DUAL, respectively. Moreover, the performance into binary sample hypervectors (HVs) through element-wise
speedup and energy efficiency for HDC inference are similar to binding and bundling operations [8]. This phase is also called
that of HDC training.
data encoding. Besides the encoding phase, the HDC training
Index Terms—Element-wise operations, hyperdimensional also contains an aggregation phase in which different features
computing (HDC), processing-in-memory (PIM) architecture, of all encoded sample HVs are accumulated into a single class
resistive random access memory (ReRAM).
HV for each class of input data, while the HDC inference
contains a comparison phase in which the similarity between
I. I NTRODUCTION the query HV and each class HV are measured by the distance
N RECENT years, deep neural networks (DNNs) have of two HVs. Thus, both HDC training and inference involve
I demonstrated high accuracy in complex classification
tasks [1], [2]. However, conventional machine learning mod-
a large amount of operations with HVs.
Traditional von-Neumann architectures usually suffer from
els running on traditional computing architectures usually low computing efficiency and high energy consumption for
incur significant computing and energy resource overheads. HDC applications due to massive data movement between
main memory and processing units. Existing ASIC proces-
Manuscript received 27 August 2023; revised 29 December 2023 and sors [3], [5] often cannot satisfy the large requirement of
16 February 2024; accepted 27 February 2024. Date of publication memory and computing resources for large-scale HDC appli-
19 August 2024; date of current version 22 January 2025. This work was
supported in part by the National Key Research and Development Program
cations. Resistive random access memory (ReRAM)-based
of China under Grant 2022YFB4500303; in part by the National Natural processing-in-memory (PIM) accelerators achieve in-situ com-
Science Foundation of China (NSFC) under Grant 62332011; and in part puting with high data parallelism [11], [12]. They have
by the Huawei Technologies Co., Ltd under Grant YBN2021035018A7. This
article was recommended by Associate Editor A. Gamatie. (Corresponding
demonstrated significant performance and energy efficiency
author: Haikun Liu.) in matrix-vector multiplication (MVM) operations [13] and
Cong Liu, Kaibo Wu, Haikun Liu, Hai Jin, Xiaofei Liao, Zhuohui Duan, bitwise operations of binary vectors [14].
Jiahong Xu, Huize Li, and Yu Zhang are with the National Engineering
Research Center for Big Data Technology and System, Services Computing
Generally, there are two types of ReRAM-based PIM archi-
Technology and System Lab, Cluster and Grid Computing Lab, School tectures: 1) analog PIM (APIM) and 2) Digital PIM (DPIM).
of Computer Science and Technology, Huazhong University of Science APIM utilizes ReRAM crossbar arrays to perform MVM
and Technology, Wuhan 430074, China (e-mail: congliu@[Link];
canonwu@[Link]; hkliu@[Link]; hjin@[Link]; xfliao@hust.
operations in only one step efficiently, offering a promising
[Link]; zhduan@[Link]; jhxu@[Link]; huizeli@[Link]; zhyu@ approach to DNN inference applications in which weight
[Link]). matrices of different DNN layers are fixed. For example,
Jing Yang is with the School of Computer Science and Technology, Hainan
University, Haikou 570288, China (e-mail: jingyang@[Link]).
ISAAC [13] and GraphR [15] exploit APIM to efficiently
Digital Object Identifier 10.1109/TCAD.2024.3445812 perform MVM operations in neural networks and graph

c 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see [Link]
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 513

computing applications, significantly reducing data processing in the encoding phase and the comparison phase,
latency and energy consumption. However, although APIM respectively. Moreover, to further improve the efficiency
achieves significant performance speedup in MVM operations, of ReHDC, we design a fine-grained pipeline to overlap
it is unsuitable for other mathematical operations, such as data transferring and data processing, and thus hide the
addition, multiplication, and bitwise operations. In contrast, ReRAM write latency as much as possible.
DPIM can exploit ReRAM cells to perform logical NOR 4) We evaluate the efficiency of ReHDC using four
operations [16] in the digital computing paradigm. Since all datasets. Experimental results show that ReHDC can
arithmetic operations can be converted into a series of NOR accelerate the HDC training by 69.4× and 1.93×,
operations, DPIM is able to perform a wide range of arithmetic and can also improve the energy efficiency by 51.5×
operations [11]. FloatPIM [17] leverages DPIM to enhance and 2.2×, compared with NVIDIA Tesla P100 GPU
precision and execute mathematical operations. DUAL [14] and the ReRAM-based HDC accelerator–DUAL [14],
fully exploit the high parallelism of DPIM to efficiently respectively. Furthermore, the performance speedup and
perform cluster analysis. DPIM also demonstrates high par- energy efficiency achieved in the HDC inference are
allelism for bitwise operations of HVs, delivering significant comparable to that observed in the HDC training.
performance speedup for data-intensive applications. The remaining of this article is organized as follows.
Previous ReRAM-based HDC accelerators [8], [18] usually Section II first introduces the background of HDC and two
exploit customized hardware for the encoding and comparison kinds of ReRAM-based PIM architectures, and then presents
phases separately, and often cause low utilization of limited the motivation of this article. Section III presents the details of
ReRAM resource because the customized hardware cannot the ReHDC architecture. Section IV describes the experimen-
satisfy unbalanced resource requirements of these two phases tal methodologies and results. Section V studies the related
for different datasets. Fortunately, we find that most operations work. We conclude in Section VI.
in the encoding and comparison phases of HDC can be
abstracted as element-wise XOR and accumulation operations. II. BACKGROUND AND M OTIVATION
Thus, we can design a unified ReRAM-based PIM architecture
In this section, we first present the necessary background of
to accelerate these two computation-intensive phases of HDC
HDC and ReRAM-based PIM architectures, and then describe
using the same processing engines (PEs). However, a single
the motivation of this work.
PE is comprised of multiple APIMs and DPIMs for MVM
operations and HV bitwise operations, respectively.
There are several challenges to design a unified ReRAM- A. Hyperdimensional Computing
based PIM architecture for HDC. First, most ReRAM-based HDC is a novel computing model inspired by the way
APIM accelerators [11] only support a special computing that neurons work in human brains [19]. It has been applied
paradigm, i.e., MVM operations. It is not easy to support in various areas, such as image recognition [4] and text
bitwise operations for vectors by adding some extra peripheral classification [20]. The basic data structure in HDC is the
circuits. Second, although computation-intensive operations high-dimensional vector (i.e., HV) whose dimension may
in the encoding and comparison phases can be converted exceed 10 000. HVs can represent different features of real
into element-wise XOR and accumulation operations, it is items in various real-world recognition tasks, such as language
challenging to directly process these two phases with the same characters and bio-signals. An HV is generated randomly and
crossbar arrays. For instance, the encoding phase requires holographically with independent and identically distributed
horizontal accumulation of XOR results for all rows, while (i.i.d.) elements which are binary (0/1) or bipolar (+1/−1)
the comparison phase requires vertical accumulation of XOR codes [21]. For all elements in an HV, no one stores more
results. Third, data movement of intermediate results among information than the others, thus improving the robustness
crossbar arrays causes nontrivial time overhead due to ReRAM of HDC. When the dimension of HVs is rather high (e.g.,
write operations. In this article, we design a heterogeneous D = 10 000), any two random HVs are almost orthogonal.
ReRAM-based PIM architecture to tackle these challenges. The There are several typical operations in HDC [22], such as
major contributions of this article are summarized as follows. binding operations that bind two HVs to generate a new HV,
1) We analyze the computing paradigm of the encoding and bundling operations that bundle many HVs together into
and comparison phases in HDC, and abstract differ- a new HV. The binding operation retains one half of the
ent operations with uniform primitive operators that information from input HVs, while the bundling operation
can be efficiently accelerated by ReRAM-based PIM generates new HVs that are orthogonal to the input HVs [23].
architectures. These operations sustain the dimension of HVs, and thus can
2) We design a heterogeneous ReRAM-based PIM archi- keep the computing paradigm of HDC stable. In the following,
tecture called ReHDC to process encoding and we describe the two key phases of HDC inference.
comparison phases of HDC uniformly. In the unified 1) Encoding Phase: As shown in Fig. 1, the first step of
PE of ReHDC, we employ APIM modules to process HDC is to map the real-world data into a hyperdimensional
accumulation operations with array-level parallelism, space to generate HVs. This operation is called the encoding
and exploit digital PIM (DPIM) modules to process phase. Several encoding approaches have been proposed in
element-wise XOR operations with row-level parallelism. previous studies [5], [24]. Generally, they linearly sum up all
3) We design dual access paths for APIM arrays to achieve HVs associated with each feature. A feature is composed of
both horizontal and vertical accumulation operations a position HV and a level HV. Position and level HVs are
514 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025

(a) (b)
Fig. 1. HDC process for classification tasks.
Fig. 2. Two kinds of PIM architecture. (a) ReRAM-based APIM.
(b) ReRAM-based DPIM.

generated in the same way for both training and inference.


In the encoding phase, an n-dimensional feature vector is
converted into n D-dimensional HVs, where D represents the B. ReRAM-Based PIM Accelerators
number of dimensions and each element in the HV is a binary ReRAM is a typical nonvolatile memory (NVM) that oper-
number. The encoder looks for the minimum and maximum ates by changing the resistance state of ReRAM cells [25].
feature values and linearly quantizes the range into m levels. At The metal-insulator-metal (MIM) structure of an ReRAM cell
the same time, it generates a D-dimensional random binary HV comprises an upper electrode, a lower electrode, and a metal
for each level, i.e., {L1 , . . . , Lm }. Similarly, n feature indexes oxide layer sandwiched between them. By applying an external
are encoded into n random D-dimensional binary HVs, i.e., voltage, an ReRAM cell can achieve state transitions between
{P1 , . . . , Pn }. These position HVs are generated randomly and a high resistance state (HRS) and a low resistance state (LRS)
any two position HVs are orthogonal to each other. In contrast, which correspond to logic “0” and “1,” respectively. A set
there are correlations between level HVs. For example, if two operation changes HRS (0) to LRS (1). To set an ReRAM cell,
feature values are closer, the similarity between their level HVs a positive voltage should be applied to generate sufficient write
−→ −

is higher. Assume Pm and Lm represent the position HV and current. Since an ReRAM cell can sustain up to 1012 times [15]
the level HV of the mth feature, respectively, the mth feature writes, its endurance issue is less significant compared to other
HV can be defined as follow: NVMs.
ReRAM-based PIM accelerators enable in-situ computing
→ −
− → − →
Fm = Pm ⊕ Lm (1) which eliminates extensive data movement between processors
and main memory, and thus achieve significant reduction
where ⊕ is an bitwise binding (i.e., XOR) operation between of energy consumption and execution time [26]. Therefore,
two HVs. Assume there are total n features in a sample, ReRAM-based PIM accelerators are extensively explored
HDC combines all feature HVs to generate the sample HV, as to enhance the performance of data-centric applications,
shown in including DNNs [13], graph processing [15], and database
operations [27]. Generally, PIM architectures can be catego-
→ −
n
− → rized into APIM and DPIM based on their analog or digital
H = Fm . (2) computing paradigms, respectively.
m=1
1) ReRAM-Based APIM Architecture: Generally, most
2) Comparison Phase: During the HDC inference, the APIM accelerators are designed to accelerate the MVM
queried data are converted into a query HV through the operations [28]. As shown in Fig. 2(a), APIM performs MVM
encoding phase. Then, the similarity between the query HV operations through the ReRAM crossbar array. At first, digital
and all class HVs are calculated based on the Hamming signals are transformed into analog signals via digital-to-
distance. The queried data are classified into a class when their analog converters (DACs). After applying a voltage to the
Hamming distance is the smallest. At last, the class label with wordline of ReRAM cells, a corresponding current is gen-
the highest similarity is returned to the query request. erated on the bitline based on Ohm’s law. The currents on
The HDC training includes an encoding phase and an the same bitline are then accumulated using Kirchhoff’s law.
accumulation phase. Each training sample is encoded into a This enables APIM to perform array-level MVM operations
sample HV in the encoding phase. Then, all sample HVs in a single cycle. The analog signal is subsequently trans-
that belong to the same class are accumulated to characterize ferred for further calculations. Similarly, in the output circuit,
general attributes of the class in the training dataset. Assuming analog-to-digital converters (ADCs) are employed to convert
there are total s samples in the lth class, the lth class HV can output analog results into digital values. This fixed computing
be represented by (3). In the end, all class HVs are stored in paradigm of APIM is only applicable to a few applications
the associative memory for comparison that contain a large proportion of MVM operations.
2) ReRAM-Based DPIM Architecture: A DPIM is able to
→ −
s perform in-situ NOR operations without additional peripheral
− →
Cl = Hi . (3) circuits [17]. Fig. 2(b) demonstrates the design of the DPIM
i=1 array. For an ReRAM cell, there is a significant difference
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 515

Fig. 4. Breakdown of operations in encoding and comparison phases for


Fig. 3. Processing of a XOR operation in a DPIM. different datasets.

between the HRS and the LRS. The initial output ReRAM cell most recent ReRAM-based HDC accelerator [4] allocates fix-
should be set to the LRS. According to Kirchhoff’s law, in a sized ReRAM resource to processing units for the encoding
parallel circuit, the overall resistance of input ReRAM cells is and comparison phases, and thus cannot fairly satisfy the
lower than that of each individual input ReRAM cell. Thus, resource requirement of these two phases for different datasets.
when an input ReRAM cell is in an LRS, the current flowing Fig. 4 shows the breakdown of operations in the encoding and
out the circuit increases, allowing the output ReRAM cell to comparison phases of HDC for four datasets. We can find
change from an LRS to an HRS. that the percentage of operations used for the encoding phase
The circuit characteristic described above can be exploited and the comparison phase varies significantly for different
to implement a logical NOR operation (i.e., A ⊕ B = (A + datasets. For instance, the comparison phase accounts for only
B) ), which can be represented as follows: 0 NOR 0 = 1, 1 1% total operations in the MNIST dataset, while there are 41%
NOR 1 = 0, and 0 NOR 1 = 0. Thus, the DPIM naturally comparison operations in the BEIJING dataset [29]. The pro-
supports NOR operations. Furthermore, the DPIM array can portion of operations in the encoding and comparison phases is
also support row or column-level parallelism, and thus has mainly determined by the number of features and classes in the
higher elasticity than the APIM. As most arithmetic operations dataset. A dedicated hardware architecture designed for these
can be converted into a series of NOR operations, the DPIM is two phases cannot adapt to different datasets, usually resulting
able to flexibly execute arithmetic and logical operations. For in low utilization of computing resource. This motivates us
example, the 1-bit XOR logic operation can be implemented to develop a unified ReRAM-based HDC accelerator that
with 6 NOR operations based on De Morgan’s law. Assume A can dynamically allocate PEs to the two phases based on
and B represent the input two operands, and A represents NOT their ratio of operations. In this way, our architecture can
operation on A. The XOR operation A ⊕ B can be represented adapt to various datasets more flexibly, and improves the
by (4). Fig. 3 illustrates the six steps of the XOR operation in utilization of ReRAM resource and the performance of HDC
a DPIM applications.
Fixed Access Paths of APIM Arrays: APIM plays a cru-
A ⊕ B = A · B + A · B cial role in ReRAM-based HDC accelerators. During the
   
= A + B + A + B encoding phase, APIM accumulates feature HVs to obtain
 
       the sample HV. During the comparison phase, APIM cal-
= A + B + A + B . (4) culates the Hamming distance between two HVs. However,
conventional APIM arrays only support bitline accumulation
in one direction, and thus limits the utilization of APIM in an
C. Motivation ReRAM-based HDC accelerator. To enable these two types of
In this section, we discuss the design challenges of an computations for HDC applications, it is essential to redesign
ReRAM-based PIM architecture for HDC applications. the APIM architecture so that it can perform MVM operations
The Limitation of Computing Paradigms: Traditional APIM in two directions.
accelerators are designed for accelerating MVM operations. Write Overhead Caused by Data Dependencies: During the
However, in HDC applications, other arithmetic and logical HDC processing, data dependencies exist in both encoding and
operations like bitwise XOR cannot be efficiently executed comparison phases. In addition, while multiple ReRAM arrays
by APIM, posing a significant limitation for fully utilizing offer the potential for parallel computing, the large amount
ReRAM-based HDC accelerators. In contrast, DPIM accelera- of data involved in HDC processing results in substantial
tors can perform NOR operations in parallel. Nevertheless, the write overhead during intermediate data transfers, leading to
DPIM is much slower than APIM when it performs MVM performance degradation. Therefore, it is essential to hide the
operations in HDC. Hence, the integration of APIM and DPIM write latency of ReRAM.
is crucial for effectively accelerating HDC applications which These challenges motivate us to design an ReRAM-based
are comprised of both MVM and bitwise logical operations. accelerator for general HDC processing. Particularly, it can
Disparities of ReRAM Resource Requirements Between the be dynamically configured according to the characteristics of
Encoding and Comparison Phases for Various Datasets: The different datasets.
516 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025

Fig. 5. Overview of ReHDC architecture.

III. R E HDC A RCHITECTURE


In this section, we first provide an overview of ReHDC
architecture, and then elaborate the details of the data pro-
cessing and mapping schemes for the HDC encoding and
comparison phases.

A. Overview of ReHDC
Fig. 5 demonstrates the architecture of ReHDC. As shown
in Fig. 5(a), ReHDC is comprised of multiple tiles that are Fig. 6. Data mapping in the encoding phase.
interconnected through a global bus. Each tile is composed of
a controller (CTRL), several PEs, and two buffers. The input
buffer is responsible for caching data that is written to the
DPIM, while the output buffer stores the results of the PE. The and position HV of each feature. The bundling operation is
controller generates the control signals for each component. used to accumulate multiple feature HVs of a sample.
sALU is responsible for computing the sum of the bitlines in We choose DPIM arrays to perform element-wise XOR
multiple arrays and performing other simple operations. operations with high parallelism. We employ a judicious
In each PE, the DPIM crossbar arrays are utilized to HV-to-DPIM mapping scheme to leverage row parallelism and
perform element-wise XOR logic operations, while the APIM perform element-wise XOR operations. Given that an HV in
crossbar arrays are employed for array-level accumulation HDC usually contains thousands of dimensions, element-wise
operations. As shown in Fig. 5(b), each wordline in the APIM XOR using DPIM can significantly exploit data parallelism
is connected to a DAC, which converts the input data into a and enhance the efficiency of the encoding phase in HDC. To
vector of voltages and applies it to each bitline of the crossbar perform multiple XOR operations in parallel, both two of their
array. In contrast, the ADC converts the output current into operands should be placed in the same row of a crossbar, and
digital signals for further processing, and is shared by multiple the output results are stored in the same row of the crossbar
bitlines of the crossbar array. To capture the intermediate data as well.
of PEs, each PE contains both an input register (IR) and an Fig. 6 shows the data mapping of the encoding phase.
output register (OR) to latch the data. Each PE also contains We partition the DPIM array into three regions which are
multiple Comparators (CMPs). As shown in Fig. 5(c), some used for storing preprocessed positions HVs, level HVs, and
CMPs are arranged as a CMP tree to determine the minimum output HVs. The size of crossbar arrays assigned for storing
Hamming distance between the sample HV and each class HV, input position HVs and level HVs is identical. We map
and the others are employed to binarize the output results of the preprocessed input HVs in columns of DPIMs and then
the APIM in the encoding phase based on a predetermined perform element-wise XOR operations. The generated feature
threshold. To improve data parallelism for DPIMs, multiple HVs are then stored in the output region. Multiple DPIMs
cascaded DPIM arrays are exploited to support row-level can perform the XOR operation concurrently to maximize
parallel data transfer and element-wise XOR operations, as data parallelism. In this way, the computation latency can be
shown in Fig. 5(d). significantly reduced.
When the XOR operation is completed in DPIMs, we
use APIMs to perform accumulation operations since APIMs
B. Data Processing in the Encoding Phase exhibits higher performance than DPIMs for MVM operations.
The encoding phase of HDC mainly consists of two oper- At first, the feature HVs are written into the APIM array
ations: 1) binding and 2) bundling. Binding is employed to row by row using peripheral circuits (DACs). Then, we apply
handle element-wise XOR operations between the level HV an all-one vector (111· · · 111) to the APIM to perform the
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 517

Fig. 7. Fine-grained pipeline in HDC inference.

MVM operation, which actually accumulates elements of all


feature HVs in the vertical direction. The output result of the
MVM operation is converted into digital values (a sample HV)
through ADCs.
Although APIMs can perform MVM operations efficiently,
the data mapping (writes) to APIMs is costly and becomes a
performance bottleneck. To address this problem, we design a
fine-grained pipeline to hide the majority of writing latency, as
shown in Fig. 7. When DPIM completes a binding operation Fig. 8. Data mapping in the comparison phase.
for a feature, the generated feature HV is stored in columns of
the DPIM, as shown in Fig. 6. We write these XOR results to
the APIM in rows for the following calculations. In the same
operators the same as the encoding phase, namely element-
time, DPIM starts the binding operation for the next feature. In
wise XOR and accumulation. To this end, we design a unified
ReHDC, two cycles (i.e., Set and Reset) are required to write
architecture for both the encoding and comparison phases in
a sample HV to an APIM row, while six cycles are needed to
HDC. It can dynamically configure the number of PEs for the
execute an XOR operation in the DPIM. Since these XOR and
encoding and comparison phases based on the proportion of
writing operations can be performed concurrently on DPIMs
operations performed on different datasets.
and APIMs, respectively, the fine-grained pipeline can hide
DPIMs are used to execute the element-wise XOR operations
the majority of writing latency of APIM rows by overlapping
between the query HV and each class HV. All class HVs are
it with the XOR operation on DPIMs. In this way, after the
written to DPIMs during the HDC training, while the query
DPIM performs XOR, only one extra cycle is required to write
HV is written to DPIMs during the HDC inference. Fig. 8
one feature HV or one result HV to an APIM row. Once the
illustrates the data mapping strategy in the comparison phase.
result HVs are written to the APIM, we apply an all-one vector
The data parallelism corresponds to the dimension (D) of HVs.
to the APIM to calculate the sample HV by accumulating all
To obtain the Hamming distance, APIMs are also exploited to
feature HVs of the same sample.
accumulate all XOR results generated by DPIMs.
The value of each element in the sample HV ranges from
The DPIM array is mainly partitioned into three regions
0 to n. We exploits CMPs to convert the sample HV into a
which are used to store the HVs of each class, the query
binary sample HV efficiently. CMPs are designed to compare
HVs, and the results of the XOR operation, respectively. We
each element in the sample HV with a threshold value n/2. If
note that during the inference phase, a Class HV is obtained
the value of an element is greater than the threshold, it is set
by summing up all sample HVs belonging to the same class.
to 1. Otherwise, it is set to 0. For example, when n is 1000,
Assuming there are m classes, and class HVs are (C1 , C2 ,
the threshold is set to 500. CMPs operate on all elements of
· · · , Cm ). ReHDC performs element-wise XOR operation on n
the sample HV to convert it into a binary HV.
query HVs and these m class HVs. In this way, a total of n×m
Result HVs (R11 , R12 , · · · , Rnm ) are obtained and are written
into the APIM array. Then, ReHDC applies an all-one vector
C. Data Processing in the Comparison Phase to the APIM array through peripheral circuits to perform the
During HDC inference, the comparison phase is performed accumulation operation in the horizontal direction, i.e., count
to calculate the similarity between the testing data and each the total number of ones in each Result HV. The final results
class HV, and returns the class label with the highest similarity. are Hamming distances between each query HV and all class
In ReHDC, the Hamming distance is employed to check the HV. A smaller Hamming distance implies higher similarity.
similarity between the query HV and all class HVs. The To figure out the minimum Hamming distance efficiently, we
calculation of the Hamming distance can be decomposed into use a CMP-tree to compare these Hamming distances. The
518 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025

class with the minimum Hamming distance is deemed as the utilization of computing resource is improved, eventually
classified result. improving the performance of HDC.
During HDC training, DPIMs and APIMs in all unified
D. Bidirectional Accumulation PEs are allocated to the encoding phase and accumulation
phase. There is not any resource contention for those DPIMs
During the encoding phase, the APIM should accumulate
and APIMs. In contrast, during HDC inference, since there
elements of all feature HVs in the vertical direction, i.e., all
are DPIM resource contention between the encoding and
elements of a column in the matrix are accumulated. During
comparison phases, we should carefully allocate these arrays
the comparison phase, the APIM should accumulate values
to these two phases so that they can be executed in a pipeline.
in the horizontal direction, i.e., all elements of a row in the
Thus, for a given dataset, the resource allocation should be
matrix are accumulated. However, traditional APIM arrays
done before the HDC inference. The term dynamic allocation
only support bitline accumulation in one direction, and thus
in our paper denotes that the arrays resource allocated to the
limit their usage in a unified ReRAM-based HDC accelerator,
encoding and comparison phases are dynamically changed for
resulting in low utilization of APIM arrays.
different datasets.
To address this problem, we reorganize ReRAM arrays
We propose a resource allocation model to determine the
to support bidirectional accumulation, as shown in Fig. 5(b).
number of arrays required by these two phases. In the encoding
In this architecture, some peripheral circuits are added to
phase, we assume all input samples are encoded into Q sample
enable bidirectional accumulation, without any modification
HVs, and each input sample has n features. As shown in Fig. 6,
to the ReRAM array [30]. Each wordline or bitline in the
n position HVs and n level HVs are bound to generate n feature
ReRAM array is connected to a sense amplifier, a write driver,
HVs, which are then accumulated to generate one sample HV.
an MUX, and an address decoder. Thus, the input vector
Thus, a single sample (or query) HV requires 3 × n columns
(voltage) can be applied to the APIM array in vertical or
in DPIM arrays, and all sample HVs require total 3 × n × Q
horizontal directions. As each APIM array only performs
columns. In the comparison phase, we assume m class HVs in
multiply-accumulate operations in a single direction at any
the dataset should be compared with each query HV. As shown
time, two neighboring APIM arrays can share sense amplifiers
in Fig. 8, about q+m+qm columns are required in each DPIM
and write drivers in both vertical and horizontal directions,
array, where the q denotes the number of samples that can be
as shown in Fig. 5(b). To this end, the bidirectional design
processed in each DPIM array. Thus, Q samples require a total
introduces acceptable area overhead.
of Q(q + m + qm)/q columns. At last, the number of PEs is
Since the bidirectional APIM array is able to perform accu-
allocated to the encoding phase (PEenc ) and the comparison
mulation operations in both vertical and horizontal directions,
phase (PEcom ) according to the following model:
it can be used by both encoding and comparison phases
of HDC. Thus, the proposed bidirectional APIM empowers PEenc 3×n×q
= . (5)
the design of unified PEs and the adaptive configuration of PEcom q+m+q×m
PEs (Section III-E). As a result, the resource utilization can
be significantly improved by dynamically allocating PEs to F. Scalability of DPIM Crossbar Arrays
encoding and comparison phases according to the requirement Due to limitations of current technology, the array size
of different datasets. of a DPIM is about 1000 × 1000. Thus, it is impossible to
process an HDC classification task with only a single DPIM
E. Adaptive Configuration array. To reduce the total write latency of DPIM rows, we
employ multiple cascaded DPIM arrays to support row-level
The HDC model consists of two phases: 1) encoding
parallel data transfer, as shown in Fig. 5(d). This significantly
and 2) comparison. The HDC training is dominated by the
enhances data parallelism of the ReHDC architecture and
encoding phase, while HDC inference involves both the
improves the performance of HDC processing. Assuming the
encoding and comparison phases. Thus, it is essential to
HV dimension is D, D/1024 crossbar arrays are required to
improve the performance of both the encoding and comparison
process all HV dimensions due to the limited size of a single
phases in HDC. However, the conventional ReRAM-based
crossbar array. However, since all DPIM arrays can process
HDC accelerator usually allocates a fixed proportion of com-
HVs in parallel, ReHDC still achieves rather high parallelism
puting resource to encoding and comparison engines. Yet,
for HDC.
the demand for computing resource between the encoding
and comparison phases in HDC may vary significantly for
IV. E VALUATION
different datasets. Thus, static computing resource allocation
for different datasets usually results in a waste of computing In this section, we first present the experimental setup,
resource. and then evaluate the performance and energy efficiency of
To address this issue, we introduce an adaptive configuration ReHDC using four datasets. Finally, we evaluate the area,
scheme in our ReHDC architecture. Since the ReRAM-based dimension, and endurance of the ReHDC architecture.
HDC accelerator performs XOR and accumulation operations
in both encoding and comparison phases, ReHDC dynamically A. Experimental Setup
allocates unified PEs to different datasets according to the We simulate the functionality of ReHDC in a simulation
data volume generated in different phases. In this way, the framework for ReRAM-based PIM architectures–MHSim [31].
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 519

TABLE I
R EHDC C ONFIGURATIONS

Fig. 9. Normalized performance speedup and energy saving of ReHDC


training. (a) Performance speedup. (b) Energy saving.

testing, respectively. Similarly, the ISOLET, BEIJING, and


MUTAG datasets are used for speech recognition, weather
prediction, and chemical classification, respectively.
Comparison: We compare ReHDC with a GPU-based HDC
implementation on the NVIDIA Tesla P100 using the Pytorch
framework, and an HDC implementation using ReRAM-
based PIM accelerator– DUAL [14]. DUAL is a digital-based
unsupervised learning acceleration framework that supports a
wide range of popular clustering algorithms. Instead of using
TABLE II raw data, DUAL maps all data in a high-dimensional space
D ETAILS OF C LASSIFICATION DATASETS
and replaces complex clustering operations with memory-
friendly operations. In the encoding phase of DUAL, DUAL
calculates the sum of multiple HVs. To calculate Hamming
distance, DUAL uses the characteristic of the discharge cur-
rent in circuits. Specifically, more mismatched bits imply
higher discharge current. By detecting the magnitude of the
discharge current in the circuit, DUAL can determine the
Hamming distance between the query vector and other vectors.
ReHDC consists of two kinds of ReRAM arrays. We adopt the
Both DUAL and ReHDC contain encoding operations and
VTEAM [32] model as the configuration of ReRAM crossbars.
Hamming distance calculations. Therefore, we implement the
In this model, an ReRAM state transition only consumes 1 ns,
encoding operations and Hamming distance calculation in
and 1 and 2 V voltage pulses are used for RESET and SET
DUAL with ReRAM arrays and compare it with ReHDC.
operations, respectively. The energy consumption of ReHDC
For a fair comparison, we also optimize the pipeline design
is estimated to be 7 pJ per bit [33]. The parameters of our
of DUAL to maximize the data processing efficiency and
experimental setup are summarized in Table I. ReHDC is
throughput.
composed of 32 Tiles, and each Tile contains 8 PEs. Each PE
includes 24 1024 × 1024 crossbars configured as DPIM arrays
and 12 1024 × 1024 crossbars configured as APIM arrays. B. Training Efficiency
In each Tile, we use CACTI 6.5 [34] in 32 nm technology to As shown in Fig. 9, ReHDC significantly speeds up the
model the area and energy for all buffers and registers. Within HDC training by 1.93× and 69.4× on average compared
each Tile, the DPIM and APIM arrays dominate the area with DUAL and GPU platforms. Notably, the encoding phase
and energy consumption. In contrast, the peripheral circuits, accounts for a majority of latency in the training phase. In
such as ADCs and DACs, have a relatively small area and ReHDC, a XOR operation are converted into 6 NOR operations,
energy consumption. A single tile requires 2.368 W of power which can be efficiently executed in the DPIM by fully
and occupies an area of 1.213 mm2 . The total area and leveraging array parallelism and row parallelism. ReHDC can
average power consumption of ReHDC are 41 mm2 and 76 W, significantly reduce the execution time of HDC workloads
respectively. We use nvidia-smi to simulate the latency and compared with the GPU platform. Moreover, ReHDC also
energy consumption of the GPU platform. outperforms DUAL because ReHDC achieves array-level par-
Datasets: We use four popular datasets, including allelism while DUAL can only achieve row-level parallelism
MNIST [35], ISOLET [36], MUTAG [37], and BEIJING [29] for accumulation operations during the encoding phase. In
to evaluate ReHDC. Table II summarizes the number of ReHDC, an MVM operation on APIM can be completed
features, classes, and samples for training and testing in four in only one cycle. In contrast, DUAL can only use DPIM
datasets. The MNIST dataset is used for image classification to accumulate two HVs in one cycle. Thus, ReHDC spends
with a total of ten classes. Each sample in MNIST has 784 much less time on accumulation operations than DUAL in the
features. We use 60 000 and 10 000 samples for training and encoding phase, thereby optimizing the training stage.
520 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025

Fig. 11. Breakdown of different phases for energy consumption and latency.
Fig. 10. Normalized performance speedup and energy saving of ReHDC (a) Breakdown of energy. (b) Breakdown of latency.
inference. (a) Performance speedup. (b) Energy saving.

ReHDC also achieves 2.2× and 51.5× energy savings for


the HDC training compared with DUAL and GPU platforms,
respectively. ReHDC can significantly reduce energy con-
sumption because the in-situ computing paradigm of DPIM
avoids a majority of data movements. Furthermore, the MVM
operation performed by APIM is highly efficient for energy
consumption. In contrast, DUAL often incurs much higher
energy consumption due to data transfer and bit-level addition Fig. 12. Performance/energy efficiency of ReHDC training on different
in the encoding phase. dimensions.

C. Inference Efficiency
ReHDC, DUAL, and the GPU platform, respectively. As
As shown in Fig. 10, ReHDC significantly speeds up the
shown in Fig. 11, all three accelerators exhibit higher energy
HDC inference by 1.77× and 63.35× on average, compared
consumption in the encoding phase than in the comparison
with DUAL and GPU platforms, respectively. The processing
phase. This is attributed to more data processed in the encoding
latency in the inference stage mainly stems from the encoding
phase. On the other hand, the proportion of latency required
of query data and the calculation of the Hamming distance
in the encoding phase of ReHDC is much less than that of
between the query HV and all class HVs. Compared with
other accelerators. This implies that ReHDC can efficiently
the GPU platform, ReHDC significantly reduces the latency
process a large volume of input data in a very short time
of query data encoding by leveraging the high parallelism
due to the high parallelism of DPIMs and APIMs, and thus
of APIM. ReHDC can efficiently perform the accumulation
significantly reduce the execution latency. In contrast, the GPU
operation using APIMs in the encoding phase, and thus
and DUAL require more time to process the input data. In
significantly reduces the processing latency. In contrast, the
summary, ReHDC demonstrates extremely high efficiency of
DUAL architecture should perform multiple addition opera-
accelerating HDC in the encoding phase, due to the high
tions in the encoding phase, resulting in higher latency. During
parallelism of its architecture.
the comparison phase, ReHDC converts the calculation of
Hamming distance into an accumulation operation, which is
also efficiently processed by APIM. APIM completes array- E. Impact of Hypervector Dimensions
level MVM operations in just one cycle, while DUAL requires To explore the impact of HV dimensions on the performance
nonlinear sampling to calculate the Hamming distance by of ReHDC, we vary the HV dimensions from 2000 to 10000,
leveraging the timing characteristic of match-lines (MLs) dis- with an increment of 2000. As shown in Figs. 12 and 13, when
charging current. Thus, our approach can significantly reduce the dimension of HVs decreases, ReHDC demonstrates more
the latency for computing the Hamming distance. performance speedup and energy savings relative to the GPU
ReHDC also reduces 2.67× and 45.65× energy consump- baseline for both HDC training and inference. This implies that
tion on average in the HDC inference, compared with DUAL GPUs have higher performance and energy efficiency for HVs
and GPU platforms, respectively. The primary advantage with larger dimensions because they can fully exploit massive
of ReHDC lies in the energy-saving of MVM operations, GPU cores for high-parallel computing with more input data.
which are more efficiently processed by APIMs than CMOS
circuits. By combining the use of APIMs and DPIMs, ReHDC F. Endurance Management
demonstrates extremely high energy efficiency in accelerating ReRAMs have been widely exploited for in-situ comput-
HDC applications. ing in PIM architectures. However, they have limited write
endurance, typically up to 1012 times. Without wear leveling,
D. Latency/Energy Breakdown an ReRAM cell may be worn out quickly. Thus, it is crucial
Here, we analyze the breakdown of latency and energy to prolong the lifetime of ReRAMs by optimizing the wear
consumption for the encoding and comparison phases in leveling in ReHDC.
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 521

H. Inference Accuracy
We evaluate the end-to-end inference accuracy of our PIM
architecture for different datasets, and find that the accuracy
is comparable to that of a full-precision GPU platform with
less than 1% deviation. The root causes are three-folds. First,
our ReRAM-based PIM accelerator uses single-level cells
(SLCs) to store only one bit per cell, and thus can provide
relatively high precision compared with multilevel ReRAM
cells (MLCs). Second, DPIM arrays perform calculations in
Fig. 13. Performance/energy efficiency of ReHDC inference on different
dimensions. the digital domain, and thus can guarantee high precision.
Although APIM arrays perform calculations in the analog
domain, their output results are compared with a given
threshold for classification, and thus can tolerant a reasonable
accuracy loss of calculations. Third, the HDC paradigm and its
applications generally show higher robustness than traditional
DNNs due to high-dimensional features. For datasets with
10K dimensions, HDC can even tolerant about 20% of data
errors while still offering similar inference accuracy compared
nonerror cases [4], [8].

V. R ELATED W ORK
HDC: HDC is a novel computing paradigm inspired by the
Fig. 14. Effect of dynamic resource allocation in ReHDC. human brain. It has been explored for a variety of classification
applications, such as text classification [6], image classifica-
tion [8], audio recognition [38], sentiment recognition [10],
and biometric recognition [9]. HDC has demonstrated the
In APIM, an MVM operation only incurs one write for potential for performance improvement of various classifica-
each ReRAM cell uniformly, and the class HVs are fixed tion applications. Rahimi et al. [39] proposed an HDC-based
in crossbar arrays. Therefore, it is unnecessary to consider algorithm for hand gesture recognition. It achieves high accu-
the wear leveling problem in APIM. However, in DPIM, six racy with a smaller scale of training datasets. Jain et al. [40]
columns are frequently used for XOR operations, and thus proposed an efficient binarized network for text classification
may quickly wear out these columns. To solve this problem, by fully exploiting the potential of binary representations. To
we periodically and dynamically change the position of these improve the performance and accuracy of HDC, some HDC
six columns among a total of 1024 columns to achieve wear algorithms are tailored specifically for large-scale data [21],
leveling, and thus increases ReRAM’s lifetime by 170.6×. [41]. However, there is still a vast room for improvement and
This simple wear leveling scheme can effectively prolong the optimization of HDC computing models and algorithms.
ReRAM’s lifetime in all kinds of DPIMs. PIM: PIM has emerged as a promising computing paradigm
to solve the memory-wall problem in traditional Von-Neumann
architectures [42]. Many studies have demonstrated that
NVM-based PIM architectures can significantly improve
G. Effectiveness of Adaptive Configuration the performance of various applications, including neural
To evaluate the effectiveness of our adaptive resource networks [43], graph computing [44], and clustering [14].
configuration scheme in ReHDC for various datasets, we These studies leverage the unique physical feature of NVMs
implement a fixed HDC architecture in which the number of to achieve high-performance in-situ computing. ISAAC [13]
PEs are fixed in the encoding phase and the comparison phase is a typical DNNs accelerator based on ReRAM. ISAAC
(10:1). Using this fixed architecture as a baseline, we can accelerates MVM operations in DNNs via ReRAM-based PIM
evaluate the efficiency and effectiveness of ReHDC when it is accelerators. GraphR [15] is a graph computing accelerator
adapted to different datasets. As shown in Fig. 14, ReHDC’s that employs ReRAM-based PIM architecture to accelerate
adaptive architecture achieves 2.34× to 7.56× energy-delay MVM operations involving sparse matrices. LerGAN [45]
product (EDP) reduction compared with the fixed architecture exploits a 3D-stacked ReRAM-based PIM architecture to
for the four datasets. The fixed architecture cannot fully utilize accelerate generative adversarial networks (GANs). It can
ReRAM arrays for different datasets, resulting in a large efficiently process complex data streams and significantly
amount of resource waste. In contrast, empowered by the improve the performance of GAN. These studies all explore a
design of bidirectional APIM arrays, our adaptive architecture traditional APIM architecture to accelerator MVM operations.
can dynamically allocate unified PE resource to the encoding Despite its high performance for MVM operations, it cannot
and comparison phases, and thus significantly improves the efficiently perform other operations like addition and XOR
resource utilization of PEs. due to its limited computing paradigm. ReHDC addresses this
522 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025

limitation by combining APIM and DPIM flexibly. ReHDC R EFERENCES


leverages APIM for MVM acceleration, and exploits DPIM [1] S. K. A. Fahad and A. E. Yahya, “Inflectional review of deep learning
for arithmetic and Boolean logical operations, such as bitwise on natural language processing,” in Proc. Int. Conf. Smart Comput.
addition and XOR. Electron. Enterprise, 2018, pp. 1–4.
[2] L. Deng and D. Yu, “Deep learning: Methods and applications,” Found.
HDC Accelerators: There are only a few studies on PIM- Trends Signal Process., vol. 7, nos. 3–4, pp. 197–387, 2014.
based HDC acceleration. Although the ASIC-based HDC [3] A. Rahimi, P. Kanerva, and J. M. Rabaey, “A robust and energy-efficient
accelerator [5] can improve the performance and energy classifier using brain-inspired hyperdimensional computing,” in Proc.
Int. Symp. Low Power Electron. Design, 2016, pp. 64–69.
efficiency of HDC, it suffers heavy overhead of die area.
[4] P. Poduval, Z. Zou, H. Najafi, H. Homayoun, and M. Imani, “Stochd:
SearcHD [8] is an ReRAM-based HDC algorithm designed Stochastic hyperdimensional system for efficient and robust learning
for accelerating encoding, training, and inference entirely in from raw data,” in Proc. 58th ACM/IEEE Design Autom. Conf., 2021,
memory. However, SearcHD has to modify sense amplifiers pp. 1195–1200.
[5] M. Imani, C. Huang, D. Kong, and T. Rosing, “Hierarchical hyperdi-
to support different functions, and thus still incur relatively mensional computing for energy efficient classification,” in Proc. 55th
high overheads of die area and energy. StocHD [4] is a Annu. Design Autom. Conf., 2018, pp. 1–6.
PIM architecture for accelerating HDC. It exploits DPIMs [6] K. Shridhar, H. Jain, A. Agarwal, and D. Kleyko, “End to end binarized
neural networks for text classification,” in Proc. Workshop Simple
to perform both element-wise XOR operations and accumu- Efficient Nat. Lang. Process., pp. 29–34, 2020.
lation operations. However, since the efficiency of DPIM [7] Y. Kim, M. Imani, and T. S. Rosing, “Efficient human activity recogni-
is much lower than that of APIM, there is still vast room tion using hyperdimensional computing,” in Proc. 8th Int. Conf. Internet
Things, 2018, pp. 1–6.
for performance optimization for accumulation operations. [8] M. Imani et al., “Searchd: A memory-centric hyperdimensional comput-
Dual [14] is a DPIM architecture designed to accelerate ing with stochastic training,” IEEE Trans. Comput.-Aided Design Integr.
clustering algorithms with HVs. It uses an encoder to project Circuits Syst., vol. 39, no. 10, pp. 2422–2433, Oct. 2020.
[9] M. Eggimann, A. Rahimi, and L. Benini, “A 5 µw standard cell
raw data into a high-dimensional space and then perform clus- memory-based configurable hyperdimensional computing accelerator for
tering algorithms by computing the correlation among HVs. always-on smart sensing,” IEEE Trans. Circuits Syst. I, Reg. Papers,
Dual computes Hamming distance between the query HV vol. 68, no. 10, pp. 4116–4128, Oct. 2021.
and in-memory HVs using the discharge current property of [10] A. Menon et al., “Efficient emotion recognition using hyperdimensional
computing with combinatorial channel encoding and cellular automata,”
circuits. These previous proposals only exploit one ReRAM- Brain Informat., vol. 9, no. 1, pp. 1–13, 2022.
enabled computing paradigm (either analog computing or [11] H. Jin et al., “ReHy: A ReRAM-based digital/analog hybrid PIM archi-
digital computing) to accelerate one kind of operations (MVM tecture for accelerating CNN training,” IEEE Trans. Parallel Distribut.
Syst., vol. 33, no. 11, pp. 2872–2884, Nov. 2022.
or XOR). In contrast, ReHDC integrates both APIMs and [12] J. An et al., “Design memristor-based computing-in-memory for AI
DPIMs into the same PEs to accelerate both element-wise accelerators considering the interplay between devices, circuits, and
XOR operations and accumulation operations in the encoding system,” Sci. China Inf. Sci., vol. 66, no. 8, pp. 1–11, 2023.
[13] A. Shafiee et al., “ISAAC: A convolutional neural network accelerator
and comparison phases of HDC uniformly. This design also with in-situ analog arithmetic in crossbars,” in Proc. 43rd Annu. Int.
enables adaptive resource configuration to satisfy the ReRAM Symp. Comput. Archit., 2016, pp. 14–26.
resource requirements in the two phases for different datasets. [14] M. Imani et al., “Dual: Acceleration of clustering algorithms using
digital-based processing in-memory,” in Proc. 53rd Annu. IEEE/ACM
Moreover, ReHDC also exploits dual access paths for APIM Int. Symp. Microarchit., 2020, pp. 356–371.
arrays to perform both horizontal and vertical accumulation [15] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “GraphR: Accelerating
operations, and thus can significantly accelerate HV accumu- graph processing using ReRAM,” in Proc. IEEE Int. Symp. High
Perform. Comput. Archit., 2018, pp. 531–543.
lations and Hamming distance calculations in the encoding
[16] A. Haj-Ali, R. Ben-Hur, N. Wald, R. Ronen, and S. Kvatinsky, “Not in
phase and the comparison phase, respectively. name alone: A memristive memory processing unit for real in-memory
processing,” IEEE Micro, vol. 38, no. 5, pp. 13–21, Oct. 2018.
[17] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “FloatPIM: In-memory
VI. C ONCLUSION acceleration of deep neural network training with high precision,” in
Proc. 46th Int. Symp. Comput. Archit., 2019, pp. 802–815.
In this article, we propose ReHDC, a heterogeneous [18] J. Liu, M. Ma, Z. Zhu, Y. Wang, and H. Yang, “HDC-IM:
ReRAM-based architecture for HDC. We decompose the Hyperdimensional computing in-memory architecture based on RRAM,”
in Proc. 26th IEEE Int. Conf. Electron., Circuits Syst., 2019,
computing-intensive encoding and comparison phases into pp. 450–453.
primitive XOR and accumulation operations. ReHDC lever- [19] P. Kanerva, “Hyperdimensional computing: An introduction to comput-
ages DPIM to perform element-wise operations, and APIM ing in distributed representation with high-dimensional random vectors,”
Cogn. Comput., vol. 1, no. 2, pp. 139–159, 2009.
to accelerate accumulation operations in these two phases [20] P. Alonso, K. Shridhar, D. Kleyko, E. Osipov, and M. Liwicki,
of HDC within a unified hardware architecture. We also “HyperEmbed: Tradeoffs between resources and performance in NLP
design an adaptive resource configuration scheme to dynam- tasks with hyperdimensional computing enabled embedding of N-gram
statistics,” in Proc. Int. Joint Conf. Neural Netw., 2021, pp. 1–9.
ically allocate processing units based on the characteristics [21] A. Kazemi, M. M. Sharifi, Z. Zou, M. Niemier, X. Sharon Hu, and
of different datasets. Experimental results demonstrate that M. Imani, “Mimhd: Accurate and efficient hyperdimensional inference
ReHDC achieves 69.4× and 1.93× performance improvement using multi-bit in-memory computing,” in Proc. IEEE/ACM Int. Symp.
Low Power Electron. Design, 2021, pp. 1–6.
for HDC training, while improving energy efficiency by
[22] G. Karunaratne, M. Le Gallo, G. Cherubini, L. Benini, A. Rahimi,
51.5× and 2.2× compared with the NVIDIA Tesla P100 and A. Sebastian, “In-memory hyperdimensional computing,” Nature
GPU and the ReRAM-based HDC accelerator DUAL, respec- Electron., vol. 3, no. 6, pp. 327–337, 2020.
tively. Moreover, ReHDC achieves comparable performance [23] S. Zhang, R. Wang, J. J. Zhang, A. Rahimi, and X. Jiao, “Assessing
robustness of hyperdimensional computing against errors in associative
improvement and energy efficiency in HDC inference as memory,” in Proc. 32nd Int. Conf. Appl.-Specific Syst., Archit. Process.,
observed in HDC training. 2021, pp. 211–217.
LIU et al.: ReRAM-BASED PIM ARCHITECTURE FOR HYPERDIMENSIONAL COMPUTING 523

[24] A. Mitrokhin, P. Sutor, C. Fermüller, and Y. Aloimonos, “Learning sen- Cong Liu received the bachelor’s degree from
sorimotor control with neuromorphic sensors: Toward hyperdimensional Dalian Maritime University, Dalian, China, in 2018.
active perception,” Sci. Robot., vol. 4, no. 30, 2019, Art. no. eaaw6736. She is currently pursuing the Ph.D. degree with the
[25] H. Akinaga and H. Shima, “Resistive random access memory (ReRAM) Huazhong University of Science and Technology,
based on metal oxides,” Proc. IEEE, vol. 98, no. 12, pp. 2237–2251, Wuhan, China.
Dec. 2010. Her research interests include in-memory comput-
[26] H. Li, H. Jin, L. Zheng, Y. Huang, and X. Liao, “ReCSA: A dedicated ing and ReRAM-based accelerator.
sort accelerator using ReRAM-based content addressable memory,”
Front. Comput. Sci., vol. 17, no. 2, 2023, Art. no. 172103.
[27] H. Li, H. Jin, L. Zheng, and X. Liao, “ReSQM: Accelerating database
operations using ReRAM-based content addressable memory,” IEEE
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 11,
pp. 4030–4041, Nov. 2020.
[28] P.-Y. Chen, X. Peng, and S. Yu, “NeuroSim: A circuit-level macro model Kaibo Wu received the bachelor’s degree from
for benchmarking neuro-inspired architectures in online learning,” IEEE Nanchang University, Nanchang, China, in 2021. He
Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 12, is currently pursuing the master’s degree with the
pp. 3067–3080, Dec. 2018. Huazhong University of Science and Technology,
[29] S. Zhang, B. Guo, A. Dong, J. He, Z. Xu, and S. X. Chen, “Cautionary Wuhan, China.
tales on air-quality improvement in Beijing,” Proc. R. Soc. A, Math., His research interests include non-volatile
Phys. Eng. Sci., vol. 473, no. 2205, 2017, Art. no. 20170457. memory and in-memory computing.
[30] S. Li et al., “RC-NVM: Dual-addressing non-volatile memory architec-
ture supporting both row and column memory accesses,” IEEE Trans.
Comput., vol. 68, no. 2, pp. 239–254, Feb. 2019.
[31] H. Liu, J. Xu, X. Liao, H. Jin, Y. Zhang, and F. Mao, “A simulation
framework for memristor-based heterogeneous computing architectures,”
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 41, no. 12,
pp. 5476–5488, Dec. 2022.
[32] S. Kvatinsky, M. Ramadan, E. G. Friedman, and A. Kolodny, Haikun Liu (Member, IEEE) received the Ph.D.
“VTEAM: A general model for voltage-controlled memristors,” IEEE degree in computer science and technology from the
Trans. Circuits Syst. II, Exp. Briefs, vol. 62, no. 8, pp. 786–790, Huazhong University of Science and Technology,
Aug. 2015. Wuhan, China, in 2012.
[33] Y. Huang et al., “Accelerating graph convolutional networks using He is currently a Professor with the School
crossbar-based processing-in-memory architectures,” in Proc. IEEE Int. of Computer Science and Technology, Huazhong
Symp. High-Perform. Comput. Archit., 2022, pp. 1029–1042. University of Science and Technology. He has
[34] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: co-authored more than 80 papers in most presti-
A tool to model large caches,” in Proc. Int. Symp. Microarchit., 2009, gious conferences and journals. His current research
pp. 1–25. interests include in-memory computing, virtualiza-
tion technologies, cloud computing, and distributed
[35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
systems.
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
Prof. Liu is a Senior Member of CCF.
pp. 2278–2324, Nov. 1998.
[36] R. Cole and M. Fanty, ISOLET (Isolated Letter Speech
Recognition), UCI Mach. Learn. Repository, Irvine, CA, USA, 1994,
doi: 10.24432/C51G69.
[37] A. K. Debnath, R. L. Lopez de Compadre, G. Debnath, A. J. Shusterman, Hai Jin (Fellow, IEEE) received the Ph.D. degree in
and C. Hansch, “Structure-activity relationship of mutagenic aro- computer engineering from the Huazhong University
matic and heteroaromatic nitro compounds. Correlation with molecular of Science and Technology (HUST), Wuhan, China,
orbital energies and hydrophobicity,” J. Med. Chem., vol. 34, no. 2, in 1994.
pp. 786–797, 1991. He is a Chair Professor of Computer Science
[38] M. Hersche, S. Sangalli, L. Benini, and A. Rahimi, “Evolvable hyperdi- and Engineering with HUST. He was with The
mensional computing: Unsupervised regeneration of associative memory University of Hong Kong, Hong Kong, from 1998 to
to recover faulty components,” in Proc. 2nd IEEE Int. Conf. Artif. Intell. 2000, and as a Visiting Scholar with the University
Circuits Syst., 2020, pp. 281–285. of Southern California, Los Angeles, CA, USA,
[39] A. Rahimi, S. Benatti, P. Kanerva, L. Benini, and J. M. Rabaey, from 1999 to 2000. He has co-authored more than
“Hyperdimensional biosignal processing: A case study for EMG-based 20 books and published over 900 research papers.
hand gesture recognition,” in Proc. IEEE Int. Conf. Reboot. Comput., His research interests include computer architecture, parallel and distributed
2016, pp. 1–8. computing, big data processing, data storage, and system security.
[40] H. Jain, A. Agarwal, K. Shridhar, and D. Kleyko, “End to end binarized Dr. Jin was awarded the Excellent Youth Award from the National Science
neural networks for text classification,” 2020, arXiv:2010.05223. Foundation of China in 2001. In 1996, he was awarded the German Academic
[41] Z. Zou, H. Alimohamadi, Y. Kim, M. H. Najafi, N. Srinivasa, and Exchange Service Fellowship to visit the Technical University of Chemnitz
M. Imani, “EventHD: Robust and efficient hyperdimensional learn- in Germany. He is a Fellow of CCF and a Life Member of the ACM.
ing with neuromorphic sensor,” Front. Neurosci., vol. 16, p. 1147,
Jul. 2022.
[42] C. Liu et al., “ReGNN: A ReRAM-based heterogeneous architecture for
general graph neural networks,” in Proc. 59th ACM/IEEE Design Autom. Xiaofei Liao (Member, IEEE) received the Ph.D.
Conf., 2022, pp. 469–474. degree in computer science and engineering from
[43] P. Chi et al., “PRIME: A novel processing-in-memory architecture the Huazhong University of Science and Technology
for neural network computation in ReRAM-based main memory,” (HUST), Wuhan, China, in 2005.
ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 27–39, He is currently a Professor with the School
2016. of Computer Science and Technology, HUST. His
[44] Y. Huang, L. Zheng, X. Liao, H. Jin, P. Yao, and C. Gui, “RAGra: research interests include computer architecture,
Leveraging monolithic 3D ReRam for massively-parallel graph pro- system software, and big data processing.
cessing,” in Proc. Design, Autom. Test Europe Conf. Exhibit., 2019, Prof. Liao was the recipient of the Excellent
pp. 1273–1276. Youth Award from the National Science Foundation
[45] H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, “LerGAN: A zero-free, low of China in 2018 and the CCF-IEEE CS Young
data movement and PIM-based GAN architecture,” in Proc. 51st Annu. Computer Scientist Award in 2017. He is a member of the IEEE Computer
IEEE/ACM Int. Symp. Microarchit., 2018, pp. 669–681. Society.
524 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 44, NO. 2, FEBRUARY 2025

Zhuohui Duan received the Ph.D. degree in Yu Zhang received the Ph.D. degree in computer
computer science and technology from the science from the Huazhong University of Science
Huazhong University of Science and Technology and Technology (HUST), Wuhan, China, in 2016.
(HUST), Wuhan, China, in 2022. He is currently a Professor with the School
He is currently a Postdoctoral Research Fellow of Computer Science and Technology, HUST. His
with the School of Computer Science and research interests include big data processing, graph
Technology, HUST. His research interests mainly computing, and distributed systems. His current
include hybrid memory, distribute memory pool, topic mainly focuses on application-driven big data
and disaggregated memory. processing and optimizations.

Jiahong Xu received the B.S. degree from the


School of Control and Computer Engineering, North
China Electric Power University, Beijing, China, in
2018. He is currently pursuing the Ph.D. degree with
the School of Computer Science and Technology,
Huazhong University of Science and Technology,
Wuhan, China.
His research interests mainly include non-volatile
memory and memristor-based accelerator.

Huize Li received the Ph.D. degree from the School Jing Yang received the Ph.D. degree in com-
of Computer Science and Technology, Huazhong puter science and technology from the Huazhong
University of Science and Technology, Wuhan, University of Science and Technology, Wuhan,
China, in 2022. China, in 2022.
His current research interests include computer She is currently an Associate Researcher with
architecture, emerging non-volatile memory, and the School of Computer Science and Technology,
processing in memory. Hainan University, Haikou, China. Her research
interests include computer architecture, edge intelli-
gence, and hyperdimensional computing.

You might also like