Allspark: PIM Optimization for Visual Transformers
Allspark: PIM Optimization for Visual Transformers
8, AUGUST 2021 1
Abstract—The advent of Transformers has revolutionized com- semantic segmentation, even outperforming the go-to archi-
puter vision, offering a powerful alternative to convolutional tecture CNNs, thanks to their larger receptive fields capable
arXiv:2403.15069v1 [cs.AR] 22 Mar 2024
neural networks (CNNs), especially with the local attention of capturing long-range dependencies between patches [2].
mechanism that excels at capturing local structures within the
input and achieve state-of-the-art performance. Processing in- Recently, local-attention visual Transformers (LVTs) [5]–[9]
memory (PIM) architecture offers extensive parallelism, low data have enjoyed great popularity. By virtue of their adeptness at
movement costs, and scalable memory bandwidth, making it effectively capturing the local structure in the input, LVTs have
a promising solution to accelerate Transformer with memory- shown noteworthy improvements in performance compared
intensive operations. However, the crucial challenge lies in to original visual Transformers, and are well ahead of the
efficiently deploying the entire model onto a resource-limited
PIM system while parallelizing each transformer block with leaderboards of various CV datasets [10].
potentially many computational branches based on local attention Unfortunately, Transformer runs significantly sluggishly on
mechanisms. general-purpose platforms such as GPU and CPU. Due to
We present Allspark, which focuses on workload orchestration the high memory footprint, low data reuse, and complex data
for visual Transformers on PIM systems, aiming at minimizing movement of operations such as self-attention, Transformers
inference latency. Firstly, to fully utilize the massive parallelism
of PIM, Allspark empolys a finer-grained partitioning scheme exhibit memory-intensive characteristics [11]–[13], and the
for computational branches, and format a systematic layout and communication overhead required to transfer weights from
interleaved dataflow with maximized data locality and reduced memory to computing modules becomes a crucial bottleneck,
data movement. Secondly, Allspark formulates the scheduling resulting in under-utilization of computational units and low
of the complete model on a resource-limited distributed PIM computational intensity (FLOPs/Byte). To accelerate Trans-
system as an integer linear programming (ILP) problem. Thirdly,
as local-global data interactions exhibit complex yet regular former inference, domain-specific accelerators [12], [14]–[16]
dependencies, Allspark provides a greedy-based mapping method have been developed to offload either self-attention operation
to allocate computational branches onto the PIM system and or the entire model from general-purpose platforms. However,
minimize NoC communication costs. Extensive experiments on these accelerators necessitate loading data from off-chip mem-
3D-stacked DRAM-based PIM systems show that Allspark brings ory, and still have limited parallelism and restricted memory
1.2×∼24.0× inference speedup for various visual Transformers
over baselines, and that Allspark-enriched PIM system yields bandwidth, thus limiting acceleration performance.
average speedups of 2.3× and energy savings of 20×∼55× over Processing in-memory (PIM) or Processing near-memory
Nvidia V100 GPU. architectures have the ability to effectively alleviate the mem-
Index Terms—Processing in-memory, Scheduling, Visual ory wall problem by moving the computations closer to the
Transformer, Spatial architecture, Model parallelism. data locations in the main memory, and have been used to
accelerate memory-intensive applications [17]–[23]. For DNN
I. I NTRODUCTION acceleration, PIM architectures typically employ a spatial
structure (tiled structure) [18], [20], [21], [24], [25], which
T RANSFORMER, an attention-based neural network, has is a scalable 2D array of compute nodes connected via
attracted tremendous interests due to their effectiveness network-on-chip (NoC). Each node (PIM-node) comprises
in various domains such as language, computer vision (CV), a memory subsystem (typically DRAM [18], [20], [21] or
and reinforcement learning [1], [2]. Visual Transformers, such SRAM [24]) and a processing engine. With its extensive
as ViT [3] and PVT [4], have shown impressive performance parallelism, low cost of data movement, and scalable memory
in CV tasks such as image classification, object detection, and bandwidth, PIM architecture has reaped more attention to
This paper was produced by the IEEE Publication Technology Group. They accelerate Transformers. TransPIM [11], a PIM accelerator
are in Piscataway, NJ. Manuscript received April 19, 2021; revised August based on high bandwidth memory (HBM), delivers more than
16, 2021. (Corresponding authors: Song Chen and Yi Kang) 20× speedup on NLP Transformers compared to GPUs. MAT
Mengke Ge is with Institute of Artificial Intelligence, Hefei Comprehensive
National Science Center, Hefei, China. [26] employs a PIM architecture based on hybrid memory
Junpeng Wang, Binhan Chen, Haitao Du, Song Chen, and Yi Kang are with cube (HMC) to accelerate long-sequence attention, yielding
School of Microelectronics, University of Science and Technology of China, significant performance and energy efficiency gains.
Hefei, China.
Yingjian Zhong is with Anhui University, Hefei, China. Original Transformers empoly global attention whose com-
Email: mengke.ge@iai.ustc.edu.cn, songch@ustc.edu.cn, putational complexity is quadratic over all the input [2],
ykang@ustc.edu.cn and the attention computation acts as a bottleneck when
dealing
0000–0000/00$00.00 © 2021 with
IEEE dense visual inputs, such as pixel-level seman-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2
TABLE I
S UMMARY OF SOTA T RANSFORMER ACCELERATORS .
Patch
Embedding
P0
Merging
Merging
Merging
Transformer P1 Transformer P2 Transformer P3 Transformer P4
Patch
Patch
Patch
Linear
-1
Fully-Connected (4×)
Attention Head-1
Fully-Connected
Fully-Connected
N×C3
BLOCK ww
N×N
BLOCK 2 … BLOCK
Q Softmax
Score × V
Projection
K Q×K
T
N×d Local attention
QKV
V Complexity:
N×4C3
N×C3
Q,K,V: N×d O(Rh· Rw·HW)
(N=Rh×Rw)
…
Local Regions:
Attention Head-h Rh×Rw
Feed-forward
visual Transformers on PIM systems with distributed charac- In LP, the input matrix is projected into three spaces,
teristics. Crucially, model parallelism must contemplate effi- Query(Q), Key(K), and Value(V ), by multiplying the weights
cient allocation of on-chip limited computational and memory (W Q , W K , and W V ). Moreover, different multiple groups of
resources, which has not been covered by existing researches. QKV corresponding to multiple heads of MSA are generated.
In each head, Q is multiplied by K to obtain the attention ma-
III. BACKGROUND AND M OTIVATION trix, which represents the relevance of each two patches. Each
row of attention matrix is normalized to probabilities (attention
A. Visual Transformers scores) using the softmax function. Finally, the scores are used
Structure: Visual Transformer generally consists of many to weighted sum V to obtain the output embedding. These
successive transformer blocks with a multi-stage hierarchical GeMMs are parameter-free, unlike the parameterized matrix
representation as shown in Figure 1. Most models are divided multiplications in FC layers of MSA and FFN. Then, FC layer
into S (typically 4) stages, each stage s ∈ [1, S] contains Nsbk takes the output embedding of all attention heads as input and
sequential transformer blocks (encoders) that extract feature performs a linear projection. Thus, for each transformer block,
representations. Moreover, cutting-edge models (e.g. LVTs) in- there are (3 + 1 + 2 · 4) · Cs2 = 12 · Cs2 parameters, including
troduce local attention for reducing computational complexity W Q , W K , W V , W MSA FC , W FFN FC1 , and W FFN FC2 .
and local-global interactions for performance enhancement. Local Attention: For LVTs, to reduce the complexity
The fixed-size input RGB image IF ∈RH×W ×3 is first linear to the input size, all input patches of each transformer
H W
partitioned into non-overlapping patches of size a×a, resulting block at s-th stage are divided into a·2s−1 ·Rh × a·2s−1 ·Rw
in H W
a × a visual patches with feature set as the concate-
small-sized local regions of size Rh × Rw , each of which
nation of the original pixel RGB values. These patches are independently serves as an input to the computational branch
then projected into an arbitrary dimension C using a linear based on local attention. As in Swin [5], many and uneven
embedding layer. Patch merging is applied in stage s ∈ [2, S] computational branches appear at each stage, 1225, 324, 81,
to reduce the number of patches by 2× down-sampling of and 25, respectively, as the input image is 960×960 and
resolution, and the linear layer is applied to the concatenated Rh and Rh are 7. Note that the computational procedures
features while setting the output dimension of s-th stage to and weights are the same for all branches, and there are no
Cs = C · 2s−1 . As the network goes deeper, patch merging data dependencies between them, except for the local-global
is repeatedly employed, so the resolution of s-th stage is interactions.
Ps = a·2Hs−1 × a·2Ws−1 . By varying Cs and Nsbk , the model Local-global Interaction: Before that, different compu-
size and complexity can be scaled accordingly. Differently, tational branches extract the local information of different
there is no patch merging and downsampling for ViT model. regions. To regain the ability to understand the global context
Transformer Block: Each transformer block consists of a and long-term dependencies of the whole input, LVT models
linear projection (LP), an MSA module, and a feed-forward exert great efforts in implementing local-global interactions,
network (FFN) containing two fully-connected (FC) layers. which is to interchange feature matrices or KV matrices
Layernorm (LN) layer is applied before MSA and FFN, and between local regions by some sophisticated operations. Some
a residual connection is applied after each module. models adopt a dual-block architecture, with the first block us-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
Number of branches
Utilization of
PIM-nodes
Utilization of PIM-nodes
PIM architecture, Allspark can fully automatically generate flexible and finer-grained partitioning, a systematic layout, and
the optimal deployment scheme and corresponding hardware interleaved cyclic dataflows (see Figure 6). This enables an
instructions for evaluating the cost of memory accesses, NoC operator fusion-like effect in the case of massive parallelism,
data transfers, and intra-PIM node processing. Since the infer- maximizing data locality and reducing data movement.
ence workloads are deterministic (fixed image size and model
size), we generate the deployment solutions at compile-time. A. Key Ideas
Finer-grained partitioning and systematic layout: In each
Input ① Partitioning & Dataflow
Formation (Sec. V) branch, all GeMMs on the MSA and MLP will be consis-
PIM Arch params Finer-grained partitioning tently sliced into pieces of smaller matrix chunks based on
& systematic layout
Visual Trans. Workload fully flexible attention-head-level and patch-level partitioning.
Interleaved cyclic dataflows
These chunks are parallelized across a PIM-node subarray of
Temporal- All Partitioning Candidates
③ Local-global Interaction Layer-based ② Scheduling for End-to-end
variable size u × v (u, v ≥ 1).
Aware Mapping (Sec. VII) Scheduling Inference (Sec. VI) The critical part, MSA, consists of h(= C/d) attention
Structured layout ILP-based constrained heads with complex data exchanges while being independent
Greedy-based placement optimization of each other and generally without interdependencies. We
Best Solution and Instructions introduce an attention-head-level partitioning, where h heads
Evaluation (Sec. VIII) are assigned to v groups of nodes, each group being lined up
Memory Access Analysis NoC Data Transfer Single Engine Estimator
by u nodes and being given b (= ⌈h/v⌉, h ≥ v) heads.
For each attention head, the feature matrix and derived
Fig. 5. Allspark overview. QKV matrices are composed of embedding vectors corre-
sponding to N (= Rh × Rw ) patches in each local region. We
Partitioning. To fully utilize the massive parallelism capa- evenly distribute these embedding vectors among u nodes, and
bility of PIM, Allspark employs a finer-grained partitioning each node handles p embedding vectors, thereby achieving
scheme for computational branches, and format a systematic patch-level partitioning. Finally, feature matrix F ∈ RN ×C
layout and NoC-based interleaved dataflow, which realizes is partitioned and each node holds a partial feature matrix
operator fusion-like effects on each PIM-node. That is, the F′ ∈ Rp×q , where p = ⌈N/u⌉ and q = b · d = C/v.
output of the previous operator is stored in the local memory Interleaved cyclic dataflows: The inter-layer cyclic
submodule and can be directly fetched as input by the next op- dataflow is used for broadcasting different parts of data/weight
erator, reducing frequent data movement between PIM-nodes when single DNN layer is distributed over multiple nodes
and maximizing data locality. Proposed scheme would provide [27]. Differently, in this work, we devise interleaved cyclic
all partitioning candidates for all computational branches. dataflows for both MSA and FC layers based on consistent
Scheduling. Optimal exploitation of distributed computing partitioning and systematic layout. MSA requires each PIM-
and memory resources is the primary concern. To run the node to receipt partial K and V matrices generated from the
full model on a resource-limited distributed PIM system and remaining u − 1 nodes, and broadcast its own partial matrix to
minimize the inference latency, Allspark formulates static them. For the cyclic data transmission scheme, each node only
scheduling as a constrained optimization problem. The optimal transmits the partial K&V matrices to the nearest succeeding
solution is selected from all candidate partitions, and all com- node, except for the last node, which exclusively transmits
putational branches are temporally and spatially scheduled. it to the first node, forming a cyclic path. Additionally, for
Mapping. To cope with complex dependencies for local- FC layers, the embedding matrices generated by different
global interactions, Allspark provides a detailed mapping of attention heads are concatenated and multiplied by the weight
computational branches, after scheduling of the whole model. matrix. Since each embedding vector is scattered over v nodes
The reason is that the interaction communication cost is in different groups, we also employ a cyclic data path to
quite small relative to within-branch computation cost under propagate embedding matrices for GeMM parallelism.
ideal congestion-free NoC traffic conditions. For instance, the Subtly, after each process phase based on the cyclic
complexity of a local region is about 12nC 2 + 2n2 C, and dataflows, the output matrix still maintain the same distribution
its local feature map is of size nC. Assuming a PE array as the input matrix, and each node always holds a partial
size of 8 × 8 and a NoC width of 64 bits, a rough estimate of size p × q. Moreover, the output result of the previous
indicates that, under ideal conditions, the time required for operator can be directly used by the next operator, avoiding
data interchange in local-global interactions is within 1% of complex data dependencies. Thanks to the well-organized
all computation time. Hence, for the detailed mapping, each systematic layout, these interleaved cyclic dataflows streams
computation branch is assigned to a group of PIM-nodes for data multicasting form a pattern of alternating vertical and
while minimizing the communication cost under global-local horizontal circulation (top right of Figure 6), which avoids
interactions. contention for NoC bandwidth and relieves traffic pressure.
WV
WK
WQ V
Patch 1 K Z1
Patch 2
Head 1 Q Attention-head-level partitioning:
Patch 3
Patch 4 Self- Head 1 is assigned to node 1, 4, 7,10. 1 2 3
Input attention
WFC Head 2 is assigned to node 2, 5, 8,11.
Matrix WV Head 3 is assigned to node 3, 6, 9,12.
WK
WQ V Z2 4 5 6
K
Head 2 Q W1 W2 W3Patch-level partitioning:
Self-
attention Patch 1 is assigned to node 1, 2, 3.
Patch 2 is assigned to node 4, 5, 6. 7 8 9
WV Patch 3 is assigned to node 7, 8, 9.
WK
WQ V Z3 Fully Connected Layer Patch 4 is assigned to node 10, 11, 12.
K
Head 3 Q 10 11 12
Self-
attention Subarray of PIM-nodes
& Interleaved Cyclic Dataflows
Round 0 Round 1 Round 2 Round 3 Node 1/4/7/10 Node 2/5/8/11 Node 3/6/9/12
Node 1/2/3
(Head 1) (Head 2) (Head 3)
(Patch 1) K1 K1 K4 K1 K3 K4 K1 K2 K3 K4
Z*W1 Z*W2 Z*W3
Q1*K Q1*K1 Q1*K4 Q1*K3 Q1*K2
Z1 Z2 Z3
K1 K4 K3 Round 0
Z1*W1[0:2,:] Z2*W2[2:4,:] Z3*W3[4:6,:]
Node 4/5/6
K2 K1 K2 K1 K2 K4 K1 K2 K3 K4
(Patch 2)
Z1 Z2 Z3
Q2*K Q2*K2 Q2*K1 Q2*K4 Q2*K3 Z1 Z3 Z1 Z2 Z2 Z3
K2 K1 K3 K4 Round 1
K4 K2 Z3*W1[4:6,:] Z1*W2[0:2,:] Z2*W3[2:4,:]
Node 7/8/9
(Patch 3) K3 K2 K3 K1 K2 K3 K1 K2 K3 K4
Z3 Z1 Z2
Q3*K Q3*K3 Q3*K2 Q3*K1 Q3*K4
Round 2
Z1 Z2 Z3 Z1 Z2 Z3 Z1 Z2 Z3
Node 10/11/12
(Patch 4) K4 K3 K4 K2 K3 K4 K1 K2 K3 K4 Cyclic dataflow in FC layers and LayerNorm
under attention-head-level partitioning.
Q4*K Q4*K4 Q4*K3 Q4*K2 Q4*K1
Fp×C through v −1 rounds of horizontal cyclic dataflows. For between nodes in these phases. After computation, the full-size
each attention head, the weights W Q , W K , and W V ∈ RC×d output matrix O remains evenly distributed across the node
are pre-stored on each counterpart PIM-node. On each node, subarray of size u × v, so subsequent FC layers can still be
Fp×C is multiplied by these weight matrices to obtain three executed using the cyclic data transfer described above.
sub-matrices Q′ , K ′ , and V ′ ∈ Rp×d , respectively. Then, Q, LayerNorm: LayerNorm is to normalize the embedding
K, and V ∈ RN ×d are distributed across u PIM-nodes. vector of each patch, though each vector is sliced into v
Attention: For each attention head, sof tmax(Qp×d · PIM-nodes, and each node’s ACU calculates the mean and
T
KN ×d ) · VN ×d is calculated on each node. Although only standard-deviation of the vector segment it holds. These two
′ ′ intermediate result matrices of size p × 1 are aggregated from
Kp×d and Vp×d are stored on each node, the fully-size K
and V distributed over all u nodes are available to each node, the other nodes to each node during the computation, using
using u − 1 rounds of vertical cyclic dataflows (bottom left of v − 1 rounds of horizontal cyclic dataflows in Section V-B.
Figure 6). In each round, each node sends only a partial matrix Finally, for one FC layer of MSA, two FC layers of FFN,
of size p × d to the succeeding node. Differently, each node layernorm, and patch merging, the amount of data transmitted
sends one locally generated in the first round, and sends the per PIM-node is DMSA FC = (v − 1) × p × q, DFFN = (v −
one received from the previous round in subsequent rounds. 1) × p × q + (v − 1) × p × a1 q, DLN = 2 × (v − 1) × p × 1,
The assigned b attention heads are processed sequentially and DPM = (v − 1) × p × a2 q, where a1 = 4 and a2 = 2.
on each node, and the above steps are repeated b times to
obtain the output matrix Zp×q . Ultimately, the volume of the
partial F or K or V matrix to be transmitted at each node is C. Processing Procedures and Weight Burdens per PIM-node
DF = DK = DV = b × (u − 1) × p × d = (u − 1) × p × q. Each PIM-node of the subarray executes the same computa-
Fully-connected layer and patch merging: After the self- tional procedure consisting of nine phases, as shown in Figure
attention, each node starts with only a partial matrix Zp×q , and 7. At each phase, the PE array on every PIM-node performs a
we use v −1 rounds of horizontal cyclic dataflows so that each specified GeMM. Then, in some phases, each PIM-node needs
node acquires the matrices Zp×aC distributed across v nodes, to retrieve partial matrices scattered across other PIM-nodes,
as shown in the bottom right of Figure 6. For FC layer and utilizing the cyclic dataflows described above.
′
patch merging, each node calculates Zp×aC · WaC×q = Op×q , Briefly, all the weights used by each branch (totaling 12·Cs2 )
where a ≥ 1 expresses the scaling of the weight size. The full- are stored in the PIM sub-array of size u×v. Once a individual
size weight matrix WC×aC or WaC×C is also pre-partitioned branch is partitioned into v groups of nodes on attention-head
and pre-stored to the counterpart v node groups, respectively, level, each group of nodes will use distinct weights, so we
′
and all u nodes in the same group store the same WaC×q . Note tentatively divide each of the weighting matrices into v parts,
that no weight matrices and intermediate results are transferred with only one part pre-stored for each group of PIM-nodes.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
2Cs ×qs,j ×d
with the same size. Constraint 4 is that the number of branches 1) × ⌈ us,j ⌉, W MSA = bs,j × 3 × (us,j − 1) × ⌈ Cuss,j ⌉,
allocated to different temporal layers decreases progressively. Cs ×qs,j
W FC = (us,j − 1) × ⌈ us,j ⌉, and W FFN = (us,j − 1) ×
Constraint 5 implies that, under the current scheduling C ×aq C ×q
(⌈ s us,j s,j ⌉
+ ⌈ sus,js,j ⌉),
respectively.
solution, the sum of the stored weights and the required
Weight reuse: All branches of the same transformer block
workspace should not exceed the memory submodule’s capac-
use the same weights, but each node stores only a partial one,
ity on each PIM-node. The required workspace is used to store
the below cases allow to reuse weights across temporal layers:
the input/output feature maps and intermediate data during the
computation of each computational branch. For each PIM- 1) If, at a specific temporal layer, the branch is assigned
node, Vswt is the volume of weight parameters that need to to only one PIM-node, i.e., us,j = vs,j = 1, then this
be stored for each block in the s-th model stage, while Vs,jdyn PIM-node stores all the weight parameters;
is the volume of workspace required during the computation 2) If branches on two temporal layers of the same Trans-
of the j-th temporal layer in the s-th model stage. According former block follow the same partitioning scheme.
to Figure 7, for each temporal layer, the computation per node These cases can be expressed as follows:
consists of nine phases. Notably, phases 2, 6, and 8 exhibit a
12 · C 2 , ∃ i, j; Xs,i,j,0,0 = 1 (Term1)
wt
higher demand for workspace, surpassing the other phases. Vs = P s (11)
j vol s,j , otherwise.
ps,j = ⌈N/us,j ⌉; qs,j = ⌈Cs /vs,j ⌉; bs,j = ⌈h/vs,j ⌉ (6) 0, ∃Pj ′ and 0 < j ′ < tl
Pj < Ns ,
i Xs,i,j,α,β = i Xs,i,j ,α,β ;
X
VsF = (ps,j × qs,j ) · N bt ′
(7) vols,j = (12)
j (Term 2)
dyn
Vs,j,p2 = VsF + Cs × (ps,j + d) + (ps,j × d) · N bt (8) 12·Cs2
Us [α]×Vs [β] , otherwise.
dyn
Vs,j,p6 = VsF + N × (ps,j + qs,j ) + (ps,j × qs,j ) · N bt
(9) By introducing two auxiliary binary variables Ys,j,j ′ ,α,β and
dyn
Vs,j,p8 = VsF + (Cs + 1) × (ps,j + aqs,j ) · N bt
(10) Zs,j,α,β , the above can be transformed into a linear expression:
P
where VsF is the volume of feature matrix, and
1P− i Xs,i,j ′ ,0,0 P≥ Ys,j,j ′ ,α,β ; P
N bt is the batch size (default 1). Thus, Vs,j dyn X − i Xs,i,j ,α,β − i Xs,i,j ,0,0
=
i s,i,j,α,β
′ ′
≤ Ys,j,j ,α,β ;
dyn dyn dyn
′
max{Vs,j,p2 , Vs,j,p6 , Vs,j,p8 }∈[0, node cap], which is easily P (13)
rewritten as linear expressions with some auxiliary variables.
X
i P s,i,j,α,β ≥ Y ′
s,j,j ,α,β ;
1 − i Xs,i,j ′ ,α,β ≥ Ys,j,j ′ ,α,β ;
However, if we were to pre-store all weights associated with
0 ≤ j ′ < j < Nstl .
the computations assigned to each node across all temporal ( Pj−1
layers, it would impose a substantial storage burden. For PIM j− ′ Ys,j,j ′ ,α,β ≤ j · Zs,j,α,β ;
systems with limited memory resources, the constraint 5 is so Pj−1 j =0 (14)
j ′ =0 Ys,j,j ,α,β + Zs,j,α,β ≤ j.
′
hard that no feasible solution even exists, and even if a feasible
solution exists, it does not yield the ideal inference speedup. where Zs,j,α,β = 1 indicates that branches at j-th temporal
layer will reuse weights from other layers for the transformer
C. Memory Constraint-driven Weight Sharing and Reuse block of s-th model stage, and 0 otherwise. Ys,j,j ′ ,α,β = 1 says
that the branches at j-th temporal layer will reuse weights from
To alleviate the memory burden on each PIM-node, we j ′ -th temporal layer (meeting case 1 or 2); and 0 otherwise.
propose weight sharing and weight reuse. The former refers With reuse, weights for each transformer block stored on
to the sharing of weights across nodes on a sub-array on the each PIM-node in the s-th stage is expressed as:
same temporal layer, whereas the latter relates to the reuse of X
weights among nodes across different temporal layers. Vswt = vols,j (15)
j
Weight sharing: At each temporal layer, each branch is X 12 · Cs2
assigned to a PIM-node subarray of size us,j × vs,j , on which vols,j = (1 − Zs,j,α,β ) · (16)
α,β Us [α] × Vs [β]
the weight parameters of WCPM s ×q
, WCQs ×d , WCKs ×d , WCVs ×d ,
FC FFN
WCs ×q , and WCs ×aq+Cs ×q need to be stored. Specifically, as
D. Objectives
shown in Figures 6 and 7, in attention-head-level partitioning,
the weight parameters are equally distributed among the vs,j All transformer blocks shall be executed in temporal layers
node sets. Moreover, in patch-level partitioning, each node set order, and the inference latency includes data transfer and
consisting of us,j nodes stores identical weight parameters. weight sharing between nodes, and intra-node computation
We distribute these identical weights equally across the us,j across all temporal layers of an entire model. N bt defaults
PIM-nodes. Then, us,j − 1 rounds of cyclic dataflows are to 1, otherwise, to mitigate the latency associated with weight
executed before the phase 2∼4 and 7∼9 of each computation, sharing, the weights for each processing phase on each PIM-
allowing each node to share its own stored weights to other node are used for small-batch inference. It is expressed as:
nodes. To minimize the cost of weight sharing, the weights at
X
Minimize Ttotal = [N bt · (Ts,j
c t
+ Ts,j ws
) + Ts,j ] (17)
each phase are used for batch inputs and released promptly s,j
after each phase to minimize memory usage. Thanks to the well-organized systematic layout and cyclic
Ultimately, the volume of the weights to be transmitted at dataflows as in Section V, there is no contention for NoC
each node in PM, MSA, FC and FFN are W PM = (us,j − bandwidth resources between the packets, and it is practical
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
to conduct a runtime assessment at each phase based on the Feature map i-th block
amount of data transferred between every two PIM-nodes. 1 2 3 4
1 2 3 4 i+1-th block
Weight sharing: As stated in Section VI-C, prior to the 5 6 7 8 5 6 7 8
weight-inclusive matrix multiplication, each node must fetch
9 10 11 12 9 10 11 12
the required weights using NoC-based cyclic dataflows. The
13 14 15 16
time cost of weight sharing is as follows: 13 14 15 16
ws
X
PM MSA (a) Swin: shifted window partition on feature map and data dependencies
Ts,j = Xs,i,j,α,β · [Ws,j,α,β + (Ws,j,α,β + 14 vertical stripes of size Hx7
i,α,β (18) h/2 1 2 3 4
FC
Ws,j,α,β FFN
+ Ws,j,α,β ) · Nsbk ]/BW. 6
Split Head
5 7 8
h heads Concat
h/2
Data Transfer: Pursuant to Section V-A, the parallelization 9 10 11 Each local region has
one vertical and one
of computational branches causes feature maps to be trans- horizontal stride.
14 13 12
ferred between nodes during intervals of matrix computation 14 horizontal stripes of size 7xW
operations via vertical or horizontal cyclic dataflows. (b) CSWin: cross-shape window and data dependencies
X
t PM K V Fig. 9. Data dependencies between branches. Given an input image of size
Ts,j = Xs,i,j,α,β · [Ds,j,α,β + (Ds,j,α,β + Ds,j,α,β
2242 , the figures depict data dependencies between branches on transformer
i,α,β (19) block in the second model stage. Red edges mark the data dependency of
MSA FC FFN LN bk the 6th branch. As each PIM-node has to broadcast data to all the remaining
+Ds,j,α,β + Ds,j,α,β + 2 · Ds,j,α,β ) · Ns ]/BW
nodes (e.g., CSWin), to avoid extremely cumbersome NoC communication,
Intra-node Computation: For each temporal layer, the pro- we also employ cyclic data transfers.
cessing engine on each node performs nine phases of matrix
multiplication sequentially (as depicted in Figure 7), and we stage. Despite their complexity, these operations exhibit reg-
can derive the overall computing time for each temporal layer. ular patterns, leading to regularized data dependencies be-
c
Ts,j =
X
Xs,i,j,α,β · [tPM LT MSA tween computational branches. To represent the distinct local-
s,j,α,β + (ts,j,α,β + ts,j,α,β +
i,α,β (20) global interactions employed by LVTs, we construct a graph
FFN NC bk Gdp (Vbk , Ebk ) to represent the data dependencies as in Figure
ts,j,α,β + ts,j,α,β ) · Ns ]
9.
Given matrix multiplications and their dimensions for each
phase, we can employ the existing search-based intra-node B. Two-stage Mapping Method
mapping strategies [29], [39] for loop ordering and tiling The issue at hand involves mapping the computational
to assess the computing time on processing engine. GeMM branches of |Vbk | local regions onto ntl temporal layers, where
expressed as three nested loops has low data reuse relative to within each temporal layer, nbr uniformly-sized sub-arrays
convolutional operations, and PIM-node has a concise memory are to be arranged on a node array of size HA × WA in a
hierarchy, which allows the search to be done in a short non-overlapping manner. This issue falls under the category
time. Based on the quantization methods [40], [41], the delays of simplified 3D floor-planning problems in VLSI design. To
tNC
s,j,α,β is obtained for nonlinear computations on the ACUs. address this issue, we propose a two-stage heuristic of layout
br br
P
There are totally s |Us | · |Vs | · Ns · (2Ns + 1) binary followed by placement, as shown in Figure 10.
br
P
variables and s Ns integer variables, where all variables 1) Structured layout: for each temporal layer, we sys-
relate to the model size, and |Vs | and Nsbr rely on the input tematically arrange nbr uniformly-sized sub-arrays on the
size. We use Gurobi [42] to solve the ILP problem. To combat PIM-node array to ensure a maximum PIM-node occupancy,
possible oversized models, the complexity can be reduced by as shown in Algorithm 1. Note that, temporal layers with
coarsening the partition granularity and shrinking |Us |. the same partition scheme adopt the same su-barray layout.
The following example illustrates the structural layout when
VII. L OCAL -G LOBAL I NTERACTION AWARE M APPING HA = 3, WA = 4, u = 2, and v = 1 with up to 6 sub-arrays:
" # " # " #
Given the temporal-layer-based partitioning and scheduling ∅ ∅ ∅ ∅ 1 2 3 4 1 2 3 4
Rotate
results of entire model, which specify the number of temporal ∅ ∅ ∅ ∅ −
→ 1 2 3 4 −−−−−→ 1 2 3 4
∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ &P lace 5 5 6 6
layers ntl
s for each transformer block, the branch count ns,j
br
tl
for each temporal layer j = 1, ..., ns , and their respective where ∅ stands for an unoccupied PIM-node.
partitioning schemes [us,j , vs,j ], this section is to map all the 2) Greedy-based placement: with the regularized data de-
branches on transformer blocks to the PIM-node array for min- pendency graph and the well-structured sub-array layout for
imizing the communication cost in local-global interactions. transformer block of each model stage, we propose a greedy-
based placement method to place the branches to specific PIM-
nodes. Firstly, we select the first computational branch and
A. Data Dependency for Local-Global Interaction then place it on the top-left sub-array of the first temporal
As demonstrated in Section III-A, LVTs employ vari- layer. Next, based on Gdp , we select the branch with the
ous complex operations to achieve local-global interaction most dependencies on already mapped branches and then
within or between transformer blocks in the same model place it on the sub-array with the minimal distance among
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10
Temporal
7 8 9
Layer 2 ② Energy 0.66 pJ/bit
Partitioning and Scheduling Result:
⑧ ④
Temporal layer 1: [2, 2] & 4 branches
Temporal ① ⑦
Temporal layer 2: [2, 4] & 2 branches
Temporal layer 3: [2, 2] & 2 branches
Layer 1 ④
Temporal layer 4: [4, 4] & 1 branches 1) Structured Layout 2) Greedy-base Placement Local-global Interaction
(④ → ⑨)
Baselines: state-of-the-art accelerators use either attention-
head-level(AH) partitioning [14], [16] or patch-level(P) par-
Fig. 10. Local-global interaction aware mapping. titioning [11], [12], [15], [26] for parallelization. We conduct
comparative assessments against AH, P and branch-level(B)
partitioning on the PIM system for various models.
all temporal layers. This process is repeated iteratively until In addition, we compare the inference latency of visual
all computational branches are mapped. As in Figure 10, the Transformers on a DRAM-based PIM system with Allspark
feature map of each computational branch distributed across its against Nvidia V100 GPU. We use PyTorch to implement
sub-array nodes are first i)gathered to the top-leftmost PIM- the inference of different models running on GPU, and mea-
node, then ii)transferred to the top-leftmost PIM-node of the sure power consumption using nvidia-smi. To this end, we
other sub-array, and finally iii)scattered to the other PIM- scale the size of the node arrays and PE arrays of PIM system
nodes. Therefore, the distance is computed as the Manhattan to achieve a peak throughput (14.7 TOPS@INT8) equivalent
distance between the top-leftmost PIM-nodes of every two to that of GPU (15.7 TFLOPS). The GPU and PIM have aggre-
sub-arrays. The method has a time complexity of O(n2br ). gated memory capacities of 32GiB and 4.5GiB, respectively,
and bandwidths of 900GB/s and 3.35TB/s, respectively.
VIII. E XPERIMENTS Workloads: we evaluate mapping methods across ViT [3],
A. Experiment Setting PVT [4], and four state-of-the-art LVT models, namely Swin
Proposed Allspark is implemented using Python on a Linux [5], Focal [6], Twins [7], and CSWin [9], using ImageNet
server with an Intel Xeon Gold 6254 CPU@3.10GHz server. dataset. Differently, PVTs are deployed onto a node array
PIM architecture: we adopt emerging 3D-stacked DRAM of size 8 × 8 due to its low parallelism. Each model is
[20], [23] as the substrate of our PIM architecture, which pro- available in several sizes: tiny(T), small(S), base(B), large(L),
vides a high-density, high-energy-efficient PIM solution built and huge(H). These models typically have parameters ranging
with logic-to-DRAM hybrid bonding technology. Furthermore, from 20 to 700 million and computational loads of several
the processing engine is built with the widely used NVDLA- hundred GFLOPs.
style architecture [33], [36]. This configuration allows the
processing engines to be positioned closer to the memory, B. Effectiveness of Allspark
which significantly reduces the latency and power consump- The average searching overheads of Allspark are 0.8s(small
tion of memory accesses, and achieves an impressive energy models) and 3.4s(largest models) for input sizes of 224 × 224,
efficiency of 0.66 pJ/bit [20]. The detailed configurations of increasing to 12.5min(small models) and 2.1h(largest models)
PIM systems are shown in Table III. for input sizes of 960×960, etc. For different models, the over-
Simulation: we utilize a DNN accelerator evaluation tool, head difference is not significant. The Allspark optimization
Timeloop+Accelergy [29], to obtain the computation, evaluation is within 10% deviation from simulation results.
memory accesses, and energy consumption of NVDLA-style 1) Inference Speed-up: as shown in Figure 11, with the
processing engines in PIM system. To simulate DRAM access support of the proposed Allspark, the inference latency of
within and between PIM-nodes and energy consumption, we models on the PIM system significantly improves. For original
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
%
A H
0
0.
10
P
%
.0
A lls p a r k
50
1 6 3 2 1 6 3 2 1 6 3 2 Id e a l
40
V iT -B V iT -L V iT -H
%
0
0.
S ta g e 1 S ta g e 2 S ta g e 3 S ta g e 4 A ll-s ta g e S ta g e 1 S ta g e 2 S ta g e 3 S ta g e 4 A ll-s ta g e
T w in s - B ( 6 4 0 x 6 4 0 ) T w in s - B ( 9 6 0 x 9 6 0 )
20
%
S p e e d u p (x )
0
0.
10
0
%
.0
50
2 2 4 x 2 2 4 6 4 0 x 6 4 0 9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0 9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0 9 6 0 x 9 6 0
%
0
0.
P V T -S P V T -B P V T -L
S ta g e 1 S ta g e 2 S ta g e 3 S ta g e 4 A ll-s ta g e S ta g e 1 S ta g e 2 S ta g e 3 S ta g e 4 A ll-s ta g e
S w in - B ( 6 4 0 x 6 4 0 ) S w in - B ( 9 6 0 x 9 6 0 )
(a)
6
40 Fig. 12. Node utilization at each stage and all stages.
20
0
B
S p e e d u p (x )
A lls p a r k T r a n s fe r T im e C o m p u t in g T im e
A H
0 P
P
A lls p a r k
2 2 4 x 2 2 4 6 4 0 x 6 4 0 9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0 9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0 9 6 0 x 9 6 0
A H
Id e a l
6
40 S w in -S S w in -B S w in -L
A lls p a r k
20
P IM -n o d e A r r a y : M S A F F N O th e rs
0 1 6 x 1 6 P
S p e e d u p (x )
02 2 4 x 2 2 4 6 4 0 x 6 4 0
T w in s -S
9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0
T w in s -B
9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0
T w in s -L
9 6 0 x 9 6 0
M e m o r y S u b m o d u le
C a p a c it y : 8 M iB
A H
0 1 0 M 2 0 M 3 0 M 4 0 M 5 0 M 6 0 M 7 0 M
4
20
0
L o c a l R e g io n :
( im g _ s iz e / h e a d _ s iz e ) 2
(a) Latency breakdown (cycle)
S p e e d u p (x )
6 .0
02 2 4 x 2 2 4 6 4 0 x 6 4 0
F o c a l-T
9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0
F o c a l-S
9 6 0 x 9 6 0 2 2 4 x 2 2 4 6 4 0 x 6 4 0
F o c a l-B
9 6 0 x 9 6 0
4 .0
2 .0
0 .0
A H P A lls p a r k
M A C C o u n ts N o C W o r k lo a d D R A M A c c e ss E n e rg y
20
S p e e d u p (x )
(b)
Fig. 11. Normalized speedups (w.r.t B) achieved by Allspark. Figure 12 shows node utilization of LVTs with different
mapping methods at each stage and for all stages. Using
Allspark, PIM system has the highest node utilization in every
visual Transformers, compared to B, P, and AH partitioning stage of LVTs, and an overall utilization of over 96% (Swin-
methods, the inference latency is improved on average by B) and over 74% (Twins-B), respectively. In contrast, the node
28.8×, 6.7×, and 1.2×, respectively. For LVTs, the average utilization of PIM system under other methods is below 20%.
speedup is 40.0×, 5.1×, and 4.2×, respectively. This improve- Overall, the node utilization tends to decrease from stage to
ment is particularly pronounced for larger input image sizes stage due to the downsampling performed in each stage, which
and larger models (except for CSWin). In such cases, each causes a decreasing number of local regions (branches) in
transformer block has a greater number of attention heads transformer blocks. Furthermore, node utilization is dominated
and patches in the branch based on local regions, resulting by the third model stage, which has the most transformer
in more partitioning possibilities. Other methods use fixed blocks (about 3 to 9 times more than other stages) and accounts
and uni-patterned partitioning, while Allspark allows for fully for a large proportion of the computation.
flexible partitioning for each transformer block of models. 3) Latency Breakdown: Figure 13a reveals the inference
For larger input image and model sizes, CSWin Transformer latency breakdown, highlighting that Allspark enables LVTs
has a higher number of branches (local regions) within its to have minimal inference latency due to its high PIM-
transformer block compared to other models. Therefore, on node utilization. Furthermore, the time taken for feature map
CSWin, there is no significant speedup in Allspark compared transfer and weight sharing is only 13% to 41% of theirs
to the branch-level method. The figure also gives the ideal compared to other methods. As the normalized comparison
performance which assumes perfect hardware utilization. in Figure 13b, apart from the unchanged computation and
2) PIM-node Utilization: We provide the PIM-node uti- MAC counts, Allspark enables PIM system to have less NoC
lization for the staged and full stage of models, expressed as workload, DRAM accesses, and energy consumption, with the
P s,j
Nnode · Exec T imes,j patch-level partitioning approach showing the highest values,
j occupied
N ode U tils = P (21) which are 6.7, 3.6, and 2.8 times higher than that of Allspark,
j (HA × WA ) · Exec T imes,j
respectively. Those methods based on fixed and uni-pattern
s,j
where Exec T imes,j and Nnode occupied are the runtime and
partitioning cause a heavy transmission for key-value (KV )
the number of nodes occupied for the j-th temporal layer of matrices and massive weight sharing between PIM-nodes.
the transformer block in the s-th model stage, respectively. For different batch sizes(BS), Figure 14 shows the normal-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
B S = 1 6 W e ig h t S h a r in g B
B S = 8 A H
F e a tu re M a p T ra n s fe r
N o r m a liz e d S p e e d u p
B S = 4
B S = 1 C o m p u t in g
2 0 P
0 .7 0 .8 0 .9 1 .0 A lls p a r k (4 M iB )
A lls p a r k (8 M iB )
A lls p a r k (1 2 M iB )
1 0
Fig. 14. Normalized average latency for small-batch inference.
im g _ s iz e : 6 4 0 x 6 4 0
0 L o c a l R e g io n : 7 x 7
1 2 × 1 2 1 6 × 1 6 2 0 × 2 0 W o r k lo a d : a ll B a s e m o d e ls .
ized average latency. Overall, the computing time and feature
map transfer time exhibit slight differences, while the average Fig. 15. Effectiveness of Allspark against the PIM-node array size and the
time taken for weight sharing decreases gradually as batch size memory capacity on each PIM-node.
increases, since weights are used in all batches after weight
4 6 0
sharing at each computation phase, and when batch size is
4 0
E n e r g y E ffic ie n c y
2
S p e e d u p
very large the weight sharing time is negligible. As in Section
2 0
0
III-C, the time spent on feature map transmission between
0
transformer blocks (part of the feature map transfer in Figure 1 4 8 1 6 1 4 8 1 6 1 4 8 1 6 1 4 8 1 4 8 1 6 1 4 8 1 6 1 4 8 1 6 1 4 8
6 4 0 × 6 4 0 9 6 0 × 9 6 0 6 4 0 × 6 4 0 9 6 0 × 9 6 0 6 4 0 × 6 4 0 9 6 0 × 9 6 0 6 4 0 × 6 4 0 9 6 0 × 9 6 0
S w in -S S w in -B T w in s -S T w in s -B
14) is less than 1%, confirming the soundness of prioritizing
resource-driven partitioning and scheduling in Allspark.
4) Memory Usage: The solution provided by Allspark Fig. 16. Speed and energy efficiency improvements of DRAM-based PIM
satisfies Constraint 5, where the memory capacity per PIM- systems for small-batch inference over V100 GPU.
node meets the space requirements during inference. For Swin-
B (Rh =Rw =7) and images of size 640 × 640, with a memory
in latency as well as an improvement of 20×∼55× in energy
capacity of 8 MiB per PIM-node, the solution for Allspark is
efficiency compared to GPUs for different models, model
in Table IV. Table V provides an overview of memory usage.
sizes, and batch sizes. Especially for non-batched inference,
After weight reuse and sharing, total memory usage drops to
the PIM system delivers the most significant speed gains, up
7.96 MiB. However, without either of them, memory required
to 3.5×, and the speedups stabilize as batch size increases.
for each PIM-node would exceed its capacity (8 MiB).
For the same model, the PIM system exhibits greater speedup
when processing larger image sizes, attributed to a higher
TABLE IV
PARTITIONING AND S CHEDULING R ESULT OF A LLSPARK .
utilization of PIM-nodes, as in Figure 12. The difference
between small and base models lies in the channel size,
Stage Temporal Layers Partitioning Scheme§ Weight Reuse resulting in no significant variation in speedup. Compared with
1 3 [1,1], [1,1], [4,2] Yes other methods, Allspark brings over 2× speedup, exploiting
2 2 [2,1], [4,4] No the strengths of PIM system to achieve a high node utilization,
3 3 [8,2], [8,2], [16,4] Yes
4 1 [4,4] No due to its flexible partitioning strategy and scheduling scheme.
Continuously, we shrink the aggregate memory bandwidth
§ Blue text indicates the temporal layer at which weight reuse occurs. of PIM to be the same as that of GPUs, at which point the
PIM system has a smaller node array but a larger PE array
TABLE V
M EMORY R EQUIREMENTS (M I B) FOR A S INGLE N ODE . per node. As shown in Table VI, the Allspark-enriched PIM
system still earns 1.95× average inference speedup.
Case Weights Workspace Total
w/ Reuse+Sharing 6.94 1.02 7.96 (✓) TABLE VI
w/o Reuse 11.64 1.02 12.66 (✗) P ERFORMANCE GAINS WITH SCALED BANDWIDTH .
w/o Sharing 59.8 1.02 60.82 (✗)
Memory Bandwidth 900 GB/s 1.12 TB/s 1.49 TB/s 3.35 TB/s
Average speedup 1.95 2.15 2.30 2.30
5) Impact of PIM system configurations: The effective-
ness of Allspark varies across PIM architectures, and Figure
15 shows the average normalized speedup for all base-size
IX. C ONCLUSIONS
models. As node array size and memory submodule scaling,
Allspark has increasingly better performance over baselines Allspark endeavours to workload orchestration for visual
due to its fully flexible partitioning and scheduling. Memory Transformers on PIM systems, with the goal of minimiz-
submodule capacity per PIM-node is a hard constraint that ing inference latency. Against the distributed nature of PIM
governs the solution and affects the acceleration performance. system, Allspark endows a flexible partitioning strategy and
elegant dataflows, partitioning and scheduling for end-to-end
execution, and interaction-oriented mapping, so that the sys-
C. Performance of Allspark-enriched PIM Systems tem acquires high computational node utilization, reasonable
Normalized speedup and energy efficiency for small-batch weight arrangement, and efficacious data communication. As
inference on a 3D-stacked DRAM-based PIM system with evaluation on 3D-stacked DRAM-based PIM systems across
Allspark and Nvidia V100 GPU are shown in Figure 16. various visual Transformers show that Allspark delivers sig-
Allspark-enriched PIM systems have on average 2.3× speedup nificant inference speedups over baselines, and that Allspark-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
enriched PIM system yields notable improvements in inference [15] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention
latency and also energy-efficiency over Nvidia V100 GPUs. architecture with cascade token and head pruning,” in 2021 IEEE
International Symposium on High-Performance Computer Architecture
Visual models are evolving rapidly and may continue to spawn (HPCA), 2021, pp. 97–110.
variants of Transformer models, but Transformer serves as the [16] Z. Qu, L. Liu, F. Tu, Z. Chen, Y. Ding, and Y. Xie, “Dota: Detect
key constituent with a significant computational and memory and omit weak attentions for scalable transformer acceleration,” in
Proceedings of the 27th ACM International Conference on Architectural
footprint, and remains the bottleneck for deployment. Allspark Support for Programming Languages and Operating Systems (ASPLOS),
still works for Transformer module and offers fertile ground New York, NY, USA, 2022, pp. 14–26.
for further enhancements. [17] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie,
and H. Yang, “Graphh: A processing-in-memory architecture for large-
scale graph processing,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems (TCAD), vol. 38, no. 4, pp. 640–653,
R EFERENCES 2019.
[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [18] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings “Neurocube: A programmable digital neuromorphic architecture with
of the 31st International Conference on Neural Information Processing high-density 3d memory,” in 2016 ACM/IEEE 43rd Annual International
Systems (NeurIPS). Red Hook, NY, USA: Curran Associates Inc., 2017, Symposium on Computer Architecture (ISCA), 2016, pp. 380–392.
pp. 6000–6010. [19] M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thot-
[2] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, tethodi, and T. N. Vijaykumar, “Newton: A dram-maker’s accelerator-in-
C. Xu, Y. Xu, Z. Yang, Y. Zhang, and D. Tao, “A survey on vision memory (aim) architecture for machine learning,” in 2020 53rd Annual
transformer,” IEEE Transactions on Pattern Analysis and Machine IEEE/ACM International Symposium on Microarchitecture (MICRO),
Intelligence (TPAMI), vol. 45, no. 1, pp. 87–110, 2023. 2020, pp. 372–385.
[3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, [20] S. Wang, B. Yu, W. Xiao, F. Bai, X. Long, L. Bai, X. Jia, F. Zuo, J. Tan,
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, Y. Guo, P. Sun, J. Zhou, Q. Zhan, S. Hu, Y. Zhou, Y. Kang, Q. Ren,
J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- and X. Jiang, “A 135 gbps/gbit 0.66 pj/bit stacked embedded dram with
formers for image recognition at scale,” in International Conference on multilayer arrays by fine pitch hybrid bonding and mini-tsv,” in 2023
Learning Representations (ICLR), 2021. IEEE Symposium on VLSI Technology and Circuits (VLSI Technology
[4] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and Circuits), 2023, pp. 1–2.
and L. Shao, “Pyramid vision transformer: A versatile backbone for [21] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:
dense prediction without convolutions,” in 2021 IEEE/CVF International Scalable and efficient neural network acceleration with 3d memory,”
Conference on Computer Vision (ICCV), 2021, pp. 548–558. ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp.
[5] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and 751–764, 2017. [Online]. Available: https://doi.org/10.1145/3093337.
B. Guo, “Swin transformer: Hierarchical vision transformer using shifted 3037702
windows,” in Proceedings of the IEEE/CVF International Conference on [22] S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park,
Computer Vision (ICCV), October 2021, pp. 10 012–10 022. K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin,
[6] J. Yang, C. Li, P. Zhang, X. Dai, B. Xiao, L. Yuan, and J. Gao, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim,
“Focal attention for long-range interactions in vision transformers,” in C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho,
Proceedings of the 35th Conference on Neural Information Processing “A 1ynm 1.25v 8gb, 16gb/s/pin gddr6-based accelerator-in-memory
Systems (NeurIPS), vol. 34. Curran Associates, Inc., 2021, pp. 30 008– supporting 1tflops mac operation and various activation functions for
30 022. deep-learning applications,” in 2022 IEEE International Solid- State
[7] X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.
C. Shen, “Twins: Revisiting the design of spatial attention in vision [23] D. Niu, S. Li, Y. Wang, W. Han, Z. Zhang, Y. Guan, T. Guan, F. Sun,
transformers,” in Advances in Neural Information Processing Systems, F. Xue, L. Duan, Y. Fang, H. Zheng, X. Jiang, S. Wang, F. Zuo, Y. Wang,
vol. 34. Curran Associates, Inc., 2021, pp. 9355–9366. B. Yu, Q. Ren, and Y. Xie, “184qps/w 64mb/mm2 3d logic-to-dram
[8] Z. Huang, Y. Ben, G. Luo, P. Cheng, G. Yu, and B. Fu, “Shuffle hybrid bonding with process-near-memory engine for recommendation
transformer: Rethinking spatial shuffle for vision transformer,” arXiv system,” in 2022 IEEE International Solid-State Circuits Conference
preprint arXiv:2106.03650, 2021. (ISSCC), vol. 65, 2022, pp. 1–3.
[9] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and [24] E. Talpes, D. Williams, and D. D. Sarma, “Dojo: The microarchitecture
B. Guo, “Cswin transformer: A general vision transformer backbone of tesla’s exa-scale computer,” in 2022 IEEE Hot Chips 34 Symposium
with cross-shaped windows,” in Proceedings of the IEEE/CVF Confer- (HCS), 2022, pp. 1–28.
ence on Computer Vision and Pattern Recognition (CVPR), June 2022, [25] J. Wang, M. Ge, B. Ding, Q. Xu, S. Chen, and Y. Kang, “Nicepim:
pp. 12 124–12 134. Design space exploration for processing-in-memory dnn accelerators
[10] Papers with code: Latest papers with code. [Online]. Available: with 3d-stacked-dram,” IEEE Transactions on Computer-Aided Design
https://paperswithcode.com/sota of Integrated Circuits and Systems (TCAD), pp. 1–1, 2023.
[11] M. Zhou, W. Xu, J. Kang, and T. Rosing, “Transpim: A memory- [26] M. Zhou, Y. Guo, W. Xu, B. Li, K. W. Eliceiri, and T. Rosing, “Mat:
based acceleration via software-hardware co-design for transformer,” in Processing in-memory acceleration for long-sequence attention,” in 2021
2022 IEEE International Symposium on High-Performance Computer 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 25–
Architecture (HPCA), 2022, pp. 1071–1085. 30.
[12] Y. Qin, Y. Wang, D. Deng, Z. Zhao, X. Yang, L. Liu, S. Wei, Y. Hu, [27] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram:
and S. Yin, “Fact: Ffn-attention co-optimized transformer architecture Optimized coarse-grained dataflow for scalable nn accelerators,”
with eager correlation prediction,” in Proceedings of the 50th Annual in Proceedings of the Twenty-Fourth International Conference on
International Symposium on Computer Architecture (ISCA). New Architectural Support for Programming Languages and Operating
York, NY, USA: Association for Computing Machinery, 2023. [Online]. Systems (ASPLOS). New York, NY, USA: Association for Computing
Available: https://doi.org/10.1145/3579371.3589057 Machinery, 2019, pp. 807–820. [Online]. Available: https://doi.org/10.
[13] S.-C. Kao, S. Subramanian, G. Agrawal, A. Yazdanbakhsh, and 1145/3297858.3304014
T. Krishna, “Flat: An optimized dataflow for mitigating attention [28] J. Wang, H. Du, B. Ding, Q. Xu, S. Chen, and Y. Kang, “Ddam:
bottlenecks,” in Proceedings of the 28th ACM International Conference Data distribution-aware mapping of cnns on processing-in-memory
on Architectural Support for Programming Languages and Operating systems,” Acm Transactions on Design Automation of Electronic
Systems (ASPLOS). New York, NY, USA: Association for Computing Systems (TODAES), vol. 28, no. 3, mar 2023. [Online]. Available:
Machinery, 2023, pp. 295–310. [Online]. Available: https://doi.org/10. https://doi.org/10.1145/3576196
1145/3575693.3575747 [29] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara,
[14] H. You, Z. Sun, H. Shi, Z. Yu, Y. Zhao, Y. Zhang, C. Li, B. Li, and R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop:
Y. Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm A systematic approach to dnn accelerator evaluation,” in 2019 IEEE
and accelerator co-design,” in 2023 IEEE International Symposium on International Symposium on Performance Analysis of Systems and
High-Performance Computer Architecture (HPCA), 2023, pp. 273–286. Software (ISPASS), 2019, pp. 304–315.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
[30] S. Zheng, X. Zhang, L. Liu, S. Wei, and S. Yin, “Atomic dataflow [45] K. Chandrasekar, C. Weis, Y. Li, S. Goossens, M. Jung, O. Naji,
based graph-level workload orchestration for scalable dnn accelerators,” B. Akesson, N. Wehn, and K. Goossens, “DRAMPower: Open-source
in 2022 IEEE International Symposium on High-Performance Computer DRAM power & energy estimation tool,” 2022. [Online]. Available:
Architecture (HPCA), 2022, pp. 475–489. http://www.drampower.info
[31] Q. Huang, M. Kang, G. Dinh, T. Norell, A. Kalaiah, J. Demmel,
J. Wawrzynek, and Y. S. Shao, “Cosa: Scheduling by constrained
optimization for spatial accelerators,” in 2021 ACM/IEEE 48th Annual
International Symposium on Computer Architecture (ISCA), 2021, pp.
554–566.
[32] J. Cai, Y. Wei, Z. Wu, S. Peng, and K. Ma, “Inter-layer scheduling
space definition and exploration for tiled accelerators,” in Proceedings
of the 50th Annual International Symposium on Computer Architecture
(ISCA). New York, NY, USA: Association for Computing Machinery,
2023. [Online]. Available: https://doi.org/10.1145/3579371.3589048
[33] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik,
N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G.
Tell, Y. Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and
S. W. Keckler, “Simba: Scaling deep-learning inference with multi-
chip-module-based architecture,” in Proceedings of the 52nd Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO).
New York, NY, USA: Association for Computing Machinery, 2019, pp.
14–27. [Online]. Available: https://doi.org/10.1145/3352460.3358302
[34] S. M. Nabavinejad, M. Baharloo, K.-C. Chen, M. Palesi, T. Kogel,
and M. Ebrahimi, “An overview of efficient interconnection networks
for deep neural network accelerators,” IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, vol. 10, no. 3, pp. 268–282,
2020.
[35] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin,
C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb,
T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R.
Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,
A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar,
S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,
A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na-
garajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick,
N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani,
C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,
M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Wal-
ter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance
analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual
International Symposium on Computer Architecture (ISCA), 2017, pp.
1–12.
[36] Nvidia, “Nvdla deep learning accelerator,” 2017. [Online]. Available:
http://nvdla.org.
[37] J. Lee and J. Lee, “Specializing cgras for light-weight convolutional
neural networks,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems (TCAD), vol. 41, no. 10, pp. 3387–
3399, 2022.
[38] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang,
C. Re, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative
inference of large language models with a single GPU,” in Proceedings
of the 40th International Conference on Machine Learning (ICML), vol.
202. PMLR, 23–29 Jul 2023, pp. 31 094–31 116. [Online]. Available:
https://proceedings.mlr.press/v202/sheng23a.html
[39] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and
T. Krishna, “A systematic methodology for characterizing scalability of
dnn accelerators using scale-sim,” in 2020 IEEE International Sympo-
sium on Performance Analysis of Systems and Software (ISPASS), 2020,
pp. 58–68.
[40] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert:
Integer-only bert quantization,” arXiv preprint arXiv:2101.01321, 2021.
[41] Y. Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq-vit: Post-training
quantization for fully quantized vision transformer,” in Proceedings of
the Thirty-First International Joint Conference on Artificial Intelligence
(IJCAI), 7 2022, pp. 1173–1179.
[42] G. Optimization. (2023) Gurobi optimizer reference manual. [Online].
Available: https://www.gurobi.com/
[43] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible
dram simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1,
pp. 45–49, 2016.
[44] N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E.
Shaw, J. Kim, and W. J. Dally, “A detailed and flexible cycle-accurate
network-on-chip simulator,” in 2013 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS), 2013, pp. 86–
96.