You are on page 1of 6

PIMMoCC: Towards PIM Based on Two-layered

Graph Representation of Applications

AbstractWith increasing demand for big data analytics and ment and utilize internal memory bandwidth? [1] reported
data-intensive programs, processing-in-memory (PIM) architec- that performing 512-way multi-constraint graph partitioning
ture with the integration of 3D-stacked memory layers and one improves performance of the PIM accelerator due to reduced
logic layer could provide new opportunities for time and energy
reduction of off-chip data communication. However, towards that off-chip network traffic. (2) PIM systems should be scalable to
goal, the PIM architecture should carefully partition the data hundreds of vaults in the future. We try to alleviate those chal-
while confining data movement across different vaults. lenges by (1) formulating the first challenge as optimization
In this paper, we introduce PIMMOCC, a novel PIM- problem and partitioning the graph to have minimal inter-vault
based framework that considers the models of computation and communications; (2) designing a scalable PIM system with
communication (MoCC) and the interactions between them for
the purpose of energy efficiency and performance. First, from NoC to efficiently route packets to the destination vault.
input applications we construct a two-layered graph representing The goal of this paper is to find an approach to wise data
both model of computation and communication where in the partitioning across different vaults in HMC-based systems to
computation layer, nodes denote computation IR instructions exploit high intra-vault memory bandwidth while improving
and edges denote data dependencies among instructions while in performance and reducing energy consumption caused by data
the communication layer, nodes denote memory operations such
as load/store and edges denote memory dependencies detected movement. Therefore, we propose the PIMMoCC framework
by alias analysis. Second, we develop an optimization model to taking into account the interactions among computations and
partition the graph into interdependent clusters to maximize the communication. First, we represent by code modification,
workload in each cluster while balancing the load among clusters. LLVM IR conversion, dynamic trace generation, reduction,
Third, we design a scalable HMC-based system where vaults profiling, and graph generation each C/C++ application as an
are connected by network-on-chip (NoC) rather than crossbar,
which makes easy scalability to hundreds of vaults in each interdependent two-layered graph where nodes denote LLVM
cube. Then we perform community-to-vault mapping to improve IR instructions; edges denote data/control dependencies be-
performance and reduce energy. Experimental results confirm tween instructions; and edge weights denote how many cycles
that compared to conventional systems, our PIMMoCC with 64 cores need to wait from one node to another. In this graph, one
vaults in an HMC improves system improves system performance layer represents model of computation in which nodes denotes
by at most 9.8x and achieves 2.3x energy reduction.
computation operations such as add and mul while the other
layer represents model of communication in which all nodes
I. I NTRODUCTION
denote communication operations, i.e., load and store. Second,
The era of big data enables programmers to write memory we propose an optimization model to partition the two-layered
intensive applications. However, traditional systems are unable graph into several clusters of highly interconnected nodes
to handle such a big volume of data with fast response as they such that data are almost confined within each cluster. In
are designed to execute computations. Therefore, once last other words, we try to minimize energy consumption induced
level cache miss is generated, data has to be fetched from the by data access to another cluster. Third, we connect vaults
main memory via off-chip links. Memory bandwidth becomes with NoC communication substrate rather than the crossbar
a bottleneck for those applications. One technique to address used in the commercial HMC due to scalability issues, and
this issue is to bring processing units close to main memory then map each cluster consisting of interconnected compu-
[7], which has been proposed a few decades ago but due to tations/communications onto a vault, making use of high
design complexity it did not succeed. Nowadays, processing- internal memory bandwidth due to TSVs. In this design, inter-
in-memory (PIM) regains its popularity because 3D-stacking vault data movement, i.e., edges connected between different
technologies allow memory layers stacked with each other and clusters, is handled by a scalable NoC substrate to efficiently
connected via TSVs (through-silicon vias). Hybrid memory route data to the destination vault while intra-vault data are
cube (HMC) provided by Micro [5] is an example of the stored and accessed by DRAM layers connected by TSVs.
commercial PIM systems. As shown in Figure 2, according In summary, our main contributions are as follows.
to HMC 2.1 specification, inside one cube, there are eight
memory layers and one logic layer stacked on top of each We construct a two-layered graph based on dependent
other with 32 partitions, which are also called vaults. LLVM IR instructions from a C/C++ application. One
However, there are two key challenges required to be layer models memory operations, i.e., load and store
solved to exploit the benefits of PIM systems: (1) Where while the other layer models computational operations.
should data reside among different vaults to reduce data move- We propose a mathematical optimization model to par-
Figure 1: Overview of the PIMMoCC framework. The PIMMoCC framework follows three steps: In step 1, we transform
one application into a two-layered graph, one representing model of communication where nodes denote memory operations,
a.k.a, load and store, and the other representing model of computation where nodes denote non-memory operations such as
xor and zext. This transformation is performed through code modification, LLVM IR conversion, dynamic trace generation,
reduction, profiling, and graph generation. In step 2, we propose an optimization model to better partition the graph into highly
connected communities to minimize the energy consumption caused by data access to another community. In step 3, we add
a router into the logic layer to form a scalable and efficient NoC substrate and perform community-to-vault mapping.
by HMC 2.0 into HMC memory systems for graph computing.
The approach of offloading is to define a PIM memory region
as uncacheable and map host atomic instructions into this
region to bypass the cache system. Gu et al. [10] propose
a framework for NDP Biscuit by filtering out extraneous data
transferred between CPUs and storage to reduce the amount
of data over the network. It allows programmers to write
applications which could run on host processors and storage
system in a distributed fashion by providing an abstract data
Figure 2: HMC Architecture communication between the host and storage.

tition the graph into highly connected clusters to ensure III. T HE M O C2PIM F RAMEWORK
that data movement is confined within each cluster for In this section, as shown in Figure 1, we present our
reducing energy consumption. MoC2PIM Framework in the following three steps: (A) Ap-
We propose a scalable HMC-based system where vaults plication Transformation; (B) Community Detection; and (C)
are connected via underlying NoC hardware and perform Community-to-vault Mapping.
community-to-vault mapping to improve performance. A. Application Transformation
The rest of the paper provides related work on state- Figure XX shows the high-level flow chart of application
of-the-art PIM/NDP, background, details on the MoC2PIM transformation from input C/C++ applications to two-layered
framework, evaluation, and conclusions. graphs. First, each input C/C++ application is converted to
LLVM IR instructions by Clang. Then, we modify and use
II. R ELATED W ORK
Contech [20] to collect dynamic traces of the application
A significant amount of on-going research [1][8][9][10] and latency for memory operations. Next, we remove all IR
[11][12][18][22] on PIM and NDP has generated specialized instructions corresponding to control statements in C such as
systems for some applications such as graph processing or if-else, for, and while. Finally, we construct a two-layered
neural networks. Ahn et al. [1] propose a scalable PIM graph by analyzing data and control dependencies to preserve
accelerator Tesseract for parallel graph processing by utilizing strict program order.
the maximum memory bandwidth, communicating between 1) LLVM IR Conversion: We transform each C/C++ ap-
different memory partitions efficiently, and designing a new plication to its corresponding LLVM IR instructions using the
programming interface to utilize the new hardware design. Chi Clang compiler: Clang -emit-llvm -S.
et al. [4] propose a PIM architecture PRIME to speed up neu- 2) Profiling: We profile the program by instrumenting the
ral network (NN) applications using ReRAM main memory. lightweight function rdtsc() and some inline code before and
PRIME partitions a ReRAM bank into memory/full function after each memory operations to get latency and data size.
(FF)/buffer subarrays. While memory subarrays are only able Weights used in a two-layered graph is calculated by latency
to store data, FF subarrays have both storage and computation times data size. This profiling is architecture independent but
capabilities to calculate forward propagation. Nai et al. [18] results can indicate the underlying memory hierarchy and
present GraphPIM to offload 18 atomic instructions supported where data reside: The larger latency and data size, the further
Figure 3: Overview of application transformation. First, we convert a C program to LLVM IR instructions. Second, we
profile and execute the instructions in order to collect dynamic traces including computations, latency and data size for each
memory operation. Third, we remove control IR statements by identifying a series of patterns. Fourth, we analyze data and
control dependencies between instructions and construct a two-layered graph. Black dotted lines represent memory dependencies
detected by alias analysis.
data away from cores (possibly in LLC or main memory) as
data in memory have to be fetched via off-chip links, which
is time-consuming. Therefore, we encode data storage and
memory hierarchy into weights used in the graph.
3) Dynamic Trace Generation & Trace Reduction: We
utilize Contech to collect dynamic IR traces. Like full loop
unrolling, dynamic traces are aware of how many iterations
loops have, leading to fine-grained load balancing when traces
are partitioned into clusters mapping to vaults. Furthermore,
due to the nature of dynamic traces, we are aware of the Figure 4: Building on the two-layered graph we generated in
execution flow of the application and there is no need to store Section A, we partition the graph into interdependent commu-
IR instruction corresponding to control statements in C such as nities representing a set of IR instructions to be executed in
if-else, for, and while. Therefore, we perform trace reduction sequential.
to lower execution overhead by identifying some patterns
associated with control statements and removing them. For tion layer. We analyze dependencies in two phases: During the
example, if statements have the following structure: the type first phase, we analyze data and control dependencies among
of the first instruction is load and the second instruction instructions, which corresponds to non-dotted edges in Figure
dependent on the first one is icmp. As long as we find such 2. During the second phase, we perform alias analysis to
pattern in a basic block consisting only of two instructions, connect different subcomponents in the communication layer
we remove this basic block. As illustrated in Figure 2, lines into one graph. Black dotted edges in Figure 2 demonstrate
2 and 3 in the third file corresponds to the for statement. We this connection.
check whether this basic block has only two instructions in B. Community Detection
which one represents load while the other denotes icmp that Community detection is applied to partition the two-
depends on the first load instruction. If all requirements are layered graph into interconnected communities to be mapped
satisfied, we remove lines 2 and 3 as indicated in the fourth to vaults. Therefore, in this section we build a community
file without green colored texts. graph, which is similar to task graph, where nodes represent
4) Graph Generation: We encode communications, com- communities including a series of instructions executed in se-
putations, and their interconnected dependencies by construct- quential while edges and their weights represent dependencies
ing two-layered graphs where one layer represents model of and communication cost between communities. Therefore, the
communication and the other represents model of computation. goal of this section is to formulate a mathematical optimiza-
Nodes in one layer denote computations whereas nodes in the tion model and partition the graph into communities while
other layer denote communications. Edges in the communi- balancing the load among communities.
cation layer are detected by alias analysis1 whereas the rest Before formulating an optimization model, we introduce
of edges are analyzed by data and control dependencies. As two formal definitions for input and output graphs:
shown in Figure 2, except lines 8 and 11, which are computa- Definition 1: A two-layered graph T G is a weighted directed
tion nodes, the rest are nodes in the space of the communica- graph G1 = T G(ni , li , eij , wij |i, j {1, ..., N }; li {1, 2})
1 In LLVM, we perform alias analysis using -basicaa -aa-eval -print-all- where nodes ni in the layer li (1 or 2) represent memory
alias-modref-info. or non-memory instructions; edges eij represent dependencies
found by alias analysis in the memory layer (li = 1) and
data/control dependencies; Edge weights wij represent latency
times data size for memory instructions.
Definition 2: A community graph CG is a weighted directed
graph G2 = CG(Vi , di , Eij , Wij |i, j {1, ..., N }) where
nodes Vi represent a series of IR instructions to be executed
in sequential, which are called communities; edges Eij rep-
resent dependencies between communities; edge weights Wij
represent communication cost from one node i to another j. Figure 5: Community-to-vault mapping
Depth di represents the largest number of hops node i takes
to the root which is considered as a starting point2 .
memory bandwidth. However, SerDes links consumes almost
Based on those definitions, we formulate an optimization
half of HMCs power [1][18][19]. In order to save power
model as follows: Given a two-layered graph T G, find com-
wasted on SerDes links, We propose a scalable PIM system
munities which maximize the following function
with NoC in the logic layer to efficiently route packets to
max F = Q R (1) the destination vault instead of the crossbar used in HMC as
1 X si sj shown in Figure 1. Therefore, in order to have more memories,
Q= [Wij ](Ci , Cj ) (2) we simply add extra vaults with routers to this design instead
2W ij 2W
of connecting it to HMCs via SerDes links to reduce energy
X consumption.
R= |Wu Wv |(du , dv ) (3)
2W 2) Mapping: We propose Algorithm 1 to map communi-
1unc
1vnc ties detected in Section B to available vaults in the scalable
u6=v
PIM system. First, we rank the priorities of communities by
where W is the sum of the total weights in T G; Wi is the first assigning higher priorities to communities at the lower
sum of weights in the community i; Wij is the weight between depth, if communities are the same depth, assigning priorities
nodes i and j; si , the strength of a node i, is the sum of the to one with the higher communication cost. For example, in
weights of edges adjacent to the node i; Ci is the community Figure 3, the starting community in the depth 0 gets the highest
to which node i belongs; nc is the number of communities; priority. If communication costs for communities 3, 4, and 5
di is the depth of community i; (i, j) equals 1 when i = j; are 102, 12, 27 respectively, then the priority order in this
controls the importance of load balancing. depth should be 3 > 5 > 4. After priority assignment, we map
The function Q measures the difference
P between the sum communities onto NoC in a greedy way as more important
of weights within a community Win = ijP Wij (Ci , Cj ) and communities (with higher priorities) should take up the better
si sj
that adjacent to the community Wadj = ij 2W (Ci , Cj ). location, which is the center of the chip.
By maximizing F , Q should also be maximized. Therefore,
Win , which represents workloads in a community, increases
Algorithm 1 Community-to-vault Mapping Algorithm
and Wadj representing communication cost decreases, leading
to limited data movement. The function R quantifies the load Input: The community graph (CG)
balancing at any depth. As shown in Figure 3, communities Output: A set of tuples T indicating which community maps
at the same depth can be executed in parallel. Therefore, we to which vault
need to make sure the loads in those communities should be 1: /* Priority Assignment */
balanced to reduce the overhead of core idle waiting. Wu 2: PriorityQueue = ()
computes the load in the community u and (du , dv ) ensures 3: nodes = G.nodeWithoutInEdges
that communities u and v are at the same depth. Hence, in 4: Sort nodes in terms of communication costs in the de-
order to maximize F, R should be minimized, causing Wu and scending order
Wv are nearly equalized at the same depth ((du , dv ) = 1). 5: PriorityQueue.append(nodes)
6: /* Community-to-vault Mapping */
C. Community-to-vault Mapping 7: Mapping = ()
In this section, based on the community graph, we try to 8: for node PriorityQueue do
build a scalable PIM system and map communities to vaults. 9: if node is the starting point then
1) Scalable PIM System: Some memory-intensive appli- 10: place = the center of the mesh-based NoC
cations require more memories to store huge amount of data. 11: else
Therefore, in order to increase memory capacity in PIM 12: place = closest to the parent node it depends (Greedy)
systems, more HMCs are utilized and connected via high- 13: end if
speed Serializer/Deserializer (SerDes) links to provide high 14: Mapping.append((node, place))
15: end for
2 Note that the depth of node 2 should be 2 rather than 1 because the longest
path is {2, 3, 1}. The depth can be found using levelization.
Table I: Configuration parameters ulators. Both simulators are cycle-accurate and trace-based3 .
The simulation flow is shown in Figure XX. Table 2 lists the
Cores In-order, 16 MSHRs 7 benchmarks we use to validate the system.
Processor L1 private caches 64KB, 4-way associative For our energy evaluation, we model the energy consump-
32-byte blocks
tion of caches in cores using CACTI 6.0 [17] and compute
L2 shared caches 256KB, distributed
Configuration 16 GB cube, 64 vaults the energy of memory layer access, which is 3.7 pJ/bit [19]
8 DRAM layers assuming memory operations dominate. Next, following [13],
Memory Layer
Internal Bandwidth 16 GB/s per vault we derive the total energy consumption of a transaction from
DRAM Timing DDR3-1600 node i to node j described as follows:
Scheduling Policy FR-FCFS
Topology Mesh Eij = N (nhops Erouter + (nhops 1)Ef lit )
Network Routing Algorithm XY routing
Flow Control Virtual channel flit-based where N , nhops , Erouter , and Ef lit represent the number of
bits to be transferred, the number of hops, energy consump-
Table II: Benchmarks and descriptions tion of routers and flit transfer respectively. we assume that
Benchmark Description Source interconnect consumes 2 pJ/bit for flit transfer Ef lit and 1.5
BS Calculate European options PARSEC[2] pJ/bit for routers to process flits Erouter [16].
SC Solve the online clustering problem PARSEC[2]
BP Back-propagation Rodinia[3] C. Experimental Results
KM K-means Rodinia[3]
MD Simulate molecular dynamics OmpSCR[6]
FFT Compute Fast Fourier Transform OmpSCR[6]
MDB Calculate Mandelbrot Set OmpSCR[6]

IV. E VALUATION
A. System Configuration
1) DDR3: We utilize 64 in-order cores to model a DDR3-
based system. Each core has 64KB L1 private caches and
256KB distributed L2 shared caches as shown in Table 1, with Figure 7: Speedup comparison among DDR3, HMC+METIS,
a memory controller to connect to memory subsystem, i.e., and HMC+CD
DDR3. This system is the baseline of our evaluation. 1) Performance: Figure 5 compares the speedup between
2) HMC: Table 1 shows configuration parameters of our DDR3 and HMC-based systems. The embarrassingly parallel
evaluated scalable HMC-based system, which includes 64 application (i.e. MDB) and applications such as MD and
vaults with eight memory layers and one logic layer. In BS may not benefit too much from PIM due to low off-
the logic layer, one vault consists of the same cores used chip memory bandwidth usage compared to what PIM could
in the DDR3-based system and NoC to connect them. To provide. Therefore, Speedup improves only at most 4x com-
further evaluate different data partitioning schemes, we apply pared to the DDR3-based system. However, if applications
METIS [15], a multi-way graph partitioning, and our proposed such as SC require high off-chip memory bandwidth, then
community detection (CD) into clusters to be mapped onto compared to DDR3-based systems which could only provide
different vaults in our system. no more than 100 GB/s, our proposed HMC-based system
could provide 1TB/s. Distinction may be more pronounced
B. Simulation Configuration
if we increase the number of vaults per cube. Therefore, the
speedup improvement for SC is 9.8x higher than DDR3-based
systems.
In HMC, we adopt METIS and CD to partition the graph
into interconnected clusters. However, for embarrassingly par-
allel programs, our graph representation cannot guarantee that
clusters after graph partitioning are independent to each other.
Figure 6: Simulation Flow Therefore, the performance improvement of applications such
We use Contech [20] as the frontend functional simulator as MD and MDB is at most 1x compared to DDR3-based
to generate dynamic LLVM traces from C/C++ applications, systems where it is easy to parallelize using threads. Nev-
write a compiler-based parser to construct a two-layered graph ertheless, our graph partitioning scheme outperforms METIS
and perform community detection to partition the graph into because this scheme tries to minimize communications while
clusters. We model 3D-stacked memory layers that follow the balancing the load.
2.1 specification [5] using ASMSim [21] and NoC communi-
cation substrate using Booksim2 [14] as backend timing sim- 3 Booksim2 supports Netrace traces as simulation input.
in the communication layer, nodes denote load/store instruc-
tions and edges are formed by alias analysis. (2) performing
community detection to find interconnected clusters ensuring
that data movement is almost confined within each cluster
and workloads among clusters are balanced. (3) designing a
scalable PIM system where vaults are connected via NoC
rather than crossbar and mapping clusters to vaults in a
greedy fashion. Our evaluation with 64 vaults and one in-order
core per vault demonstrates that performance improvement is
Figure 8: NoC traffic with different parallelism approaches at most 9.8x compared to traditional DDR3-based systems
and 0.38x compared to PIM systems with METIS graph
2) NoC Traffic: Figure 6 illustrates the normalized NoC partitioning. Energy consumption improvement is 2.3x higher
traffic with respect to threads/OpenMP and our proposed graph than PIM system without community detection as PIMMoCC
partitioning. For applications such as MD and MDB, there tries to reduce NoC traffic between different vaults.
are few data dependencies among threads while clusters are R EFERENCES
interconnected in our graph representation. Therefore, NoC [1] Junwhan Ahn et al. A scalable processing-in-memory
traffic for those applications degrades somewhat compared accelerator for parallel graph processing. In: ISCA.
to threads running almost independent on cores. However, [2] C. Bienia et al. The PARSEC benchmark suite: Char-
acterization and architectural implications. In: PACT.
for most non-embarrassingly parallel applications, community [3] Shuai Che et al. Rodinia: A benchmark suite for
detection tries its best to confine data movement within a heterogeneous computing. In: IISWC. 2009.
cluster, leading to lower energy. [4] Ping Chi et al. Prime: A novel processing-in-memory
architecture for neural network computation in reram-
based main memory. In: ISCA. 2016.
[5] Hybrid Memory Cube Consortium. Hybrid memory
cube specification 2.1. In: 2013.
[6] A. J. Dorta et al. The OpenMP source code repository.
In: PDP. 2005.
[7] Jeff Draper et al. The architecture of the DIVA
processing-in-memory chip. In: ICS. 2002.
[8] Mingyu Gao et al. Practical near-data processing for
in-memory analytics frameworks. In: PACT. 2015.
[9] Mingyu Gao and Christos Kozyrakis. HRL: efficient
and flexible reconfigurable logic for near-data process-
ing. In: HPCA. 2016.
[10] Boncheol Gu et al. Biscuit: A framework for near-data
Figure 9: The comparison of normalized energy consumption processing of big data workloads. In: ISCA. 2016.
among DDR3, HMC, and PIMMoCC [11] Kevin Hsieh et al. Accelerating pointer chasing in 3D-
stacked memory: Challenges, mechanisms, evaluation.
3) Energy Consumption: Figure 7 shows the comparison In: ICCD. 2016.
of normalized energy consumption among DDR3, HMC, and [12] Kevin Hsieh et al. Transparent offloading and map-
ping (TOM): Enabling programmer-transparent near-
PIMMoCC systems. Computations should remain the same data processing in GPU systems. In: ISCA. 2016.
(green bar in Figure 7) for all systems, while HMC im- [13] Jingcao Hu and Radu Marculescu. Energy-and
proves energy consumption regarding off-chip links compared performance-aware mapping for regular NoC architec-
tures. In: IEEE TCAD. 2005.
to DDR3 as HMC has higher off-chip memory bandwidth [14] Nan Jiang et al. A detailed and flexible cycle-accurate
and shorter distance between cores and memory, causing network-on-chip simulator. In: ISPASS. 2013.
shorter execution time. PIMMoCC further improves energy [15] George Karypis and Vipin Kumar. A fast and high
quality multilevel scheme for partitioning irregular
consumption compared to HMC as we apply community graphs. In: SISC (1998).
detection to partition the graph into clusters to minimize [16] Gwangsun Kim et al. Memory-centric system intercon-
data communications between vaults. In other words, NoC nect design with hybrid memory cubes. In: PACT.
[17] Naveen Muralimanohar et al. CACTI 6.0: A tool to
traffic is reduced for most applications except MD and MDB, model large caches. In: HP Laboratories. 2009.
thus energy consumption for NoC (yellow bar in Figure 7) [18] Lifeng Nai et al. GraphPIM: Enabling Instruction-
improves a lot. Level PIM Offloading in Graph Computing Frame-
works. In: HPCA. 2017.
[19] Seth H Pugsley et al. NDC: Analyzing the impact
V. C ONCLUSIONS of 3D-stacked memory+ logic devices on MapReduce
workloads. In: ISPASS. 2014.
In this paper, we present an optimization framework PIM- [20] Brian P Railing et al. Contech: Efficiently generating
MoCC to find the best data partitioning scheme for PIM dynamic task graphs for arbitrary parallel programs.
In: TACO. 2015.
systems to improve the performance and energy consumption. [21] Lavanya Subramanian et al. The application slowdown
PIMMoCC exploits the high memory bandwidth ( 1TB/s) of model: Quantifying and controlling the impact of inter-
PIM systems by (1) representing each application as a two- application interference at shared caches and main
memory. In: MICRO. 2015.
layered graph where in the computation layer, nodes denote [22] Dongping Zhang et al. TOP-PIM: throughput-oriented
computation instructions and edges denote data dependencies; programmable processing in memory. In: HPDC.

You might also like