Professional Documents
Culture Documents
on Spatial Architectures
Khushal Sethi
khushal@stanford.edu
Stanford University
resented in the form of Graph Processing such as MapReduce etc. internet usage.
Consequently, a lot of designs for Domain-Specific Accelerators Hence, a large number of systems for efficient execution of large
have been proposed for Graph Processing. Spatial Architectures graph workloads on CPU, GPUs, FPGAs, Hybrid Memory Cube
have been promising in the execution of Graph Processing, where (HMC) and also custom accelerators have been proposed in the past
the graph is partitioned on several nodes and each node works in [1].
parallel. In this paper, we analyze a Spatial architecture based domain-
We conduct experiments to analyze the on-chip movement of data specific accelerator that utilizes Content Addressable Memories to
in graph processing on a Spatial Architecture. Based on the obser- accelerate Graph Applications efficiently with high storage den-
vations, we identify a data movement bottleneck, in the execution sity and efficiency by modeling both computation and memory in
of such highly-parallel processing accelerators. To mitigate the bot- graph processing. Although, several works for graph processing
tleneck we propose a novel power-law aware Graph Partitioning in-memory have been investigated, we recognize a data movement
and Data Mapping scheme to reduce the communication latency bottleneck, in the highly-parallel execution of such accelerators.
by minimizing the hop counts on a scalable network-on-chip. The To mitigate the bottleneck, we propose a novel "power-law" aware
experimental results on popular graph algorithms show that our Graph Partitioning and Data Mapping scheme to reduce the com-
implementation makes the execution 2 − 5× faster and 2.7 − 4× munication latency. Our mapping scheme minimizes hop counts on
energy-efficient by reducing the data movement time in comparison a scalable Network on Chip. The experimental results on popular
to a baseline implementation. graph algorithms, show that our implementation achieves 2 − 5×
speedup and 2.7 − 4× energy reduction in comparison to the base-
CCS CONCEPTS lines.
The paper is organized as follows : Section 2 provides a background
• Computer systems organization → Spatial Architecture; Do-
on the Graph Frameworks (Vertex-Centric Processing) and Spatial
main Specific Accelerators; • Networks → Network on chip.
Architectures used in this paper. Section 3 describes our Experi-
mental Setup, Section 4 contains the discussion for analyzing data
KEYWORDS movement and our proposed data mapping scheme. Section 5 lists
Graph processing, Application-Specific Accelerators, On-chip Com- our key results and performance of our implementation. Section 6
munication, Spatial Architecture, Content-Addressable Memory and 7 contain previous related work and conclusions respectively.
ACM Reference Format:
Khushal Sethi. 2021. Efficient On-Chip Communication for Parallel Graph- 2 BACKGROUND
Analytics on Spatial Architectures. In Proceedings of ACM Conference (Con-
ference’17). ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/ 2.1 Graph Frameworks
nnnnnnn.nnnnnnn
The vertex-centric programming model for graph processing [3], GraphR [4], GraphSAR [5] and HyVE [6]), achieving great
consists of three phases : Process, Reduce and Apply. This func- speedup and energy efficiency improvement. The implementations
tioning takes several iterations for processing to converge, in each in GraphR, GraphSAR and HyVE utilize the matrix-vector multi-
iteration the search frontier of vertices increases. Algorithm 1 de- plication capabilitites of ReRAM in analog domain. In the design
scribes the vertex-centric programming model in graph-processing. of RPBFS, GraphR, and HyVE, edges in the graph are stored using
The execution is stopped when a converged solution is found. The the edge list format. GraphR converts stores the graph in edge list
Process-Reduce-Apply functions for processing graphs for popular format and converts it into adjacency matrix for MVM operation
algorithms in shown in Table 1. The process and apply phases do and stores back into the ReRAM crossbar. GraphSAR proposed a
not contain dependencies and can be executed in parallel where-as sparsity-aware data partitioning for improving the performance
reduce phase should be executed serially, but can be parallelized by of GraphR. HyVE proposed a design for a memory hierarchy of
addition of extra fields in the Graph [2]. hetergenous storage over various levels of caches and ReRAM to op-
timize its execution, which involves off-chip data movement. HyVE
GE GE GE GE GE GE GE GE uses conventional CMOS circuits to process the subgraphs edge
by edge. GRAM [2] uses 2-D ReCAM based architecture, which
operates search operations in the digital domain and all the blocks
GE GE GE GE GE GE GE GE are serially connected to a memory controller which dictates the
execution and data movement on a shared bus. RADAR[7] similar
to us analyzes 3D-ReCAM for genome sequencing.
GE GE GE GE GE GE
GE GE The high-level architecture of a spatial architecture is shown in Fig.
1. It typically consists of multiple Graph Engines, the fundamental
GE GE GE GE GE GE GE GE
computational blocks. Each Graph Engine can execute in parallel
which provides a high degree of parallelism for execution in differ-
(b) ent graph processing stages. The On-Chip Controller is responsible
(a) for broadcasting operations to each Graph Engine.
Each Graph Engine may consist of input and output buffers, an
Figure 1: Spatial Architecture for Graph Processing. Each on-node storage (for storing the partitioned Graph), Arithmetic
Processing Engine is connected with a NOC (a) 2D Mesh (b) Logic Units (ALUs) for post processing the data.
Flattened Butterfly. The Graph Engines are connected in a "memory-centric" NOC (such
as 2D Mesh, DragonFly, Torus) structure by which they can route
packets of data.
In-Memory
Data Structure Process Reduce Apply
Check Active
Read Vertices
2.3 Data Flow
Vertex Table Read Vertices Read Vertices
The processing flow in Fig 2. follows through an in-memory layout.
Edge Table Read Edges Read Edges The graph structure in memory is represented by a Edge Table
ET and a Vertex Table VT. The edge table and vertex table are ar-
Compute Edge
eProp Vector Property
Copy eProp
ranged column-wise in the dense memory to provide fast content
vTemp Vector Copy vTemp
addressable search during the execution [2]. For graph G, with M
edges and N vertices, properties related to Edges and Vertices have
vProp Vector Copy vProp
Compute
vTemp
Compute
vProp to be stored in the memory. The Process-Reduce-Apply execution
involves movement of data between these Graph Engines shown in
Fig. 1. A parallel search in the Edge list looks up the data and neigh-
Figure 2: In-Memory Graph Processing Flow
bours. Thus large data movement happens between Edge List and
Vertex Prop (vprop), Vertex Temp (vtemp) data structures. No move-
ment of data is between the Edge Property (eprop) and Edge list
2.2 Spatial Architectures based Domain during the execution. Similarly, data movement happens between
Specific Accelerators the Edge Property (eprop) and Vertex Prop (vprop), Vertex Temp
Many spatial graph processing architectures/accelerators based (vtemp) data structures. The data-structures will be distributed on
have been proposed in the literature (e.g., ReRAM based RPBFS the spatial architecture, on-chip content addressable memory at
Efficient On-Chip Communication for Parallel Graph-Analytics on Spatial Architectures Conference’17, July 2017, Washington, DC, USA
3 EXPERIMENTAL SETUP
We study the design in a trace-driven simulator to model the func-
tionality for the spatial architecture considered. We modified open-
source GraphMAT [8] to generate traces for our simulation. We Figure 3: Total On-Chip Data Movement (Normalized with
demonstrate on three commonly used graph-processing algorithms: the size of Graph) in Graph Processing of different algo-
Breadth First Search (BFS), Single Shortest Source Path (SSSP), and rithms for Process and Reduce phase is shown.
Page-Rank(PR). The vertex-centric model for each is shown in Table
1. The experiments are run on real world graphs obtained from the
SNAP Datasets [9]. equal (𝑓𝑖,𝑗 ∈ {0, 1}) then, latency (T) of a packet through a network
can be summarized as
4 ANALYSIS OF DATA MOVEMENT 𝑇 = 𝐻 ∗ (𝑇𝑟 + 𝑇𝑤 ) (2)
The amount of data movement (normalized w/ size of graph) is
where H is the hop count and 𝑇𝑟 is the per-hop latency through
shown in Fig. 3 for 3 different algorithms considered in our ex-
the switch and 𝑇𝑤 is the channel link latency. For minimizing the
periments. The amount of data transferred is nearly equal for pro-
average hop count an efficient data mapping is required.
cess and reduce phase, whereas data movement in apply phase is
The in-memory graph structure denoted by M (V,E) consists of Edge
negligible compared to the first two phases. PageRank requires
Table (ET) where edges are denoted by E, Vertex Temp (vtemp),
more data-movement because it takes more number of iterations
Vertex Property (vprop) and Edge Property (eprop) data structures,
to converge, hence data is updated several times. The bulk of data
with data movements between them as described in the previous
movement in Process phase occurs from reading the Edge Table
sections.
and correspondingly transferring data to Vertex Property (vprop)
The "power-law" aware graph partitioning and mapping in each of
and then to Edge Property (eprop) for updates (Figure 1). In reduce
the data structures and their spatial arrangement is described in
phase also, the data movement between arises from updating Tem-
Algorithm 2.
porarily stored vertex data (vTemp) by Edge Property (eprop) data
structure and reading Edge Table (ET) for neighbours (Figure 1).
If additional book-keeping is done for parallel-reduce operations
5.1 Graph Partitioning to Nodes :
(as suggested in [2]) the network traffic in considerably increased The graph is partitioned in a source-cut (edge) fashion. As a first
further. step, the vertices are sorted by descending order of degree, for
Figure 4 shows the skewed distribution of edges on different ver- ease. The higher degree vertices will have more number of edges
tices, where sometimes even less than 10% of vertices are connected (combined) compared to lower degree vertices.
in 90% of the edges. This skewness is even greater in larger graph- For edge-list partitioning, each edge stored in a Node will have
databases. This skewness is explained by power-law which is that one vertex from a small set of higher degree vertices, from the top
the degree distribution of vertices vector n(d) with non-zero entries of sorted vertex list. In other words, the edges from these higher
that follows the relation degree vertices are distributed on to the nodes.
Now, the edges/vertices are allocated to nodes in a load-balancing
𝑛(𝑑) ∝ 1/𝑑 𝛼 (1) aware scheme.
where 𝛼 > 0 is the slope of the power law (in log scale), d is the Load balancing is carried by using modulo scheduling, which
number of edges in the vertex of a graph, and n(d) is the number means among the 𝑣 vertices in sorted vertex list, are distributed in
vertices with a specific degree d. a cyclic fashion to the nodes.
Algorithm 2 Graph Partitioning in Nodes This optimization (in Algorithm 4) is solved computationally
1: Inputs : NoC Topology Graph P (U,F) and M (V,E) with the constraints defined in Algorithm 3 as an ILP.
2: Output : M (V,E) → P (U,F) : 𝑉 → 𝑈 , such that, 𝑣𝑖 ∈ 𝑉 , 𝑢 𝑗 ∈ 𝑈
and map (𝑣𝑖 ) = 𝑢 𝑗 . Algorithm 4 ILP Formulation of the Optimization Objective
3: V’ ← sort all v ∈ 𝑉 in order of out-degree, V ← V’
1: Inputs : NoC Topology Graph P (U,F)
4: Edge-list Partitioning (∀𝑢 ∈ 𝑈 , 𝑢.𝑖𝑛𝑑𝑒𝑥 ∈ {1, 4}):
2: Output : (𝑥𝑖 , 𝑦𝑖 ) ↔ 𝑢𝑖
5: for 𝑣 ∈ 𝑉 %𝑛𝑢𝑚_𝑜 𝑓 _𝑛𝑜𝑑𝑒𝑠 𝑎𝑛𝑑 𝑒 ∋ 𝑒.𝑠𝑟𝑐 = 𝑣 do
H = min 𝑖 𝑗 𝑓𝑖 𝑗 𝑐𝑜𝑠𝑡 ((𝑥𝑖 , 𝑦𝑖 ) ↔ (𝑥 𝑗 , 𝑦 𝑗 ))
Í Í
3:
6: while 𝑢.𝑠𝑖𝑧𝑒 < 𝑢.𝑚𝑎𝑥𝑠𝑖𝑧𝑒 do
4: Where 𝑐𝑜𝑠𝑡 ((𝑥𝑖 , 𝑦𝑖 ) ↔ (𝑥 𝑗 , 𝑦 𝑗 )) depends on the Topological
7: 𝑢 ←𝑒
Distance,
8: 𝑢.𝑟𝑎𝑛𝑘 = 𝑚𝑖𝑛(𝑣)∀𝑣 ∈ 𝑢
5: For 2-D Mesh : 𝑐𝑜𝑠𝑡 ((𝑥𝑖 , 𝑦𝑖 ) ↔ (𝑥 𝑗 , 𝑦 𝑗 )) = (| 𝑥𝑖 − 𝑥 𝑗 | + |
9: Vertex Partitioning (∀𝑢 ∈ 𝑈 , 𝑢.𝑖𝑛𝑑𝑒𝑥 ∈ 2, 3) 𝑦𝑖 − 𝑦 𝑗 |)
10: for 𝑣 ∈ 𝑉 %𝑛𝑢𝑚_𝑜 𝑓 _𝑛𝑜𝑑𝑒𝑠 do 6: For Flattened Butterfly : 𝑐𝑜𝑠𝑡 ((𝑥𝑖 , 𝑦𝑖 ) ↔ (𝑥 𝑗 , 𝑦 𝑗 )) = (| 𝑥𝑖 − 𝑥 𝑗 |
11: while 𝑢.𝑠𝑖𝑧𝑒 < 𝑢.𝑚𝑎𝑥𝑠𝑖𝑧𝑒 do + | 𝑦𝑖 − 𝑦 𝑗 |)
12: 𝑢 ←𝑣
13: 𝑢.𝑟𝑎𝑛𝑘 = 𝑚𝑖𝑛(𝑣)∀𝑣 ∈ 𝑢
Fig. 5 shows the decrease in hop-count from the proposed strat-
egy after finding the node coordinates by solving the ILP, and using
the following order, and a rank is allocated to link corresponding those coordinates in our simulation in a comparison with the ran-
vertex-cut Nodes with different index. domized mapping strategy for the nodes.
Suppose a node 𝑢𝑖 contains data for Vertex Property, then the
𝑟𝑎𝑛𝑘𝑖 indicates the minimum vertex id contained and 𝑖𝑛𝑑𝑒𝑥𝑖 = 0.
Then with the corresponding node 𝑢 𝑗 containing Edge Table (Increasing degree)
with same 𝑟𝑎𝑛𝑘, we have 𝑓𝑖,𝑗 = 1. Edge Table
in Fig 4, where the mapping starts from the center of the cluster. (Increasing degree)
We define regularity constraints on each on the data-structure, 2
E.Prop
Column
Row
Direction
3DReCAM
communication and for efficiently routing traffic. We have proposed
another considerable optimization for reducing communication
Direction
Input Buffer Content
Driver
Addressable 1 bit based on efficient mapping of data NOCs by utilizing "power-law".
Output Buffer Memories ReRAM
Plane
Electrode
Our work is hence, orthogonal to their approach and can be im-
Simple ALU Sample and Hold
SA Matcher
proved by combining their approaches.
Shift and
SA Matcher Previous works have explored mapping static and dynamic tasks
Graph Engine
efficiently to processor-centric NOCs ([14–18]) by using Integer Lin-
ADD SA Matcher
(a) (b) ear Programming, Heuristic Search and Deterministic Search. How-
ever, this has not been explored for memory-centric PIM systems
with high parallelism which have a high programmable complexity.
Figure 6: (a) Node Configuration as presented in GRAM [2] In such systems a regular structure is also required in mapping the
showing I/O Buffers, ALUs and Content Addressable Memo- in-memory data structures to the nodes which is scalable and easy
ries (b) 3DReCAM implementation to implement in the memory controller. This requires good choice
of methodologies for efficient and regular data and processing map-
6 RESULTS ping to network-on-chip for processing-in-memory technologies.
Figure 7: Speedup from our proposed optimization due to lower data movement : for a 3D-ReCAM (Resistive Content Address-
able Memory) Architecture with 2D-Mesh and F-Butterfly NoC Configuration
Figure 8: Energy Consumption Reduction : for a 3D-ReCAM (Resistive Content Addressable Memory) Architecture with 2D-
Mesh and F-Butterfly NoC Configuration
[13] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi, “Orion 2.0: A power-area simulator [16] T. Maqsood, S. Ali, S. U. Malik, and S. A. Madani, “Dynamic task mapping for
for interconnection networks,” IEEE Transactions on Very Large Scale Integration network-on-chip based systems,” Journal of Systems Architecture, vol. 61, no. 7,
(VLSI) Systems, vol. 20, no. 1, pp. 191–196, 2011. pp. 293–306, 2015.
[14] P. K. Sahu and S. Chattopadhyay, “A survey on application mapping strategies [17] P. K. Sahu, T. Shah, K. Manna, and S. Chattopadhyay, “Application mapping onto
for network-on-chip design,” Journal of Systems Architecture, vol. 59, no. 1, pp. mesh-based network-on-chip using discrete particle swarm optimization,” IEEE
60–76, 2013. Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 2, pp.
[15] L. Yang, W. Liu, W. Jiang, M. Li, J. Yi, and E. H.-M. Sha, “Application mapping 300–312, 2014.
and scheduling for network-on-chip-based multiprocessor system-on-chip with [18] C. Wu, C. Deng, L. Liu, J. Han, J. Chen, S. Yin, and S. Wei, “An efficient application
fine-grain communication optimization,” IEEE Transactions on Very Large Scale mapping approach for the co-optimization of reliability, energy, and performance
Integration (VLSI) Systems, vol. 24, no. 10, pp. 3027–3040, 2016. in reconfigurable noc architectures,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 34, no. 8, pp. 1264–1277, 2015.