You are on page 1of 11

Microelectronics Journal 42 (2011) 1380–1390

Contents lists available at SciVerse ScienceDirect

Microelectronics Journal
journal homepage: www.elsevier.com/locate/mejo

3D NOC for many-core processors


Aamir Zia, Sachhidh Kannan n, H. Jonathan Chao, Garrett S. Rose
Polytechnic Institute of New York University, Electrical and Computer Engineering, 5 Metrotech Center, Brooklyn, NY 11201, United States

a r t i c l e i n f o a b s t r a c t

Article history: With an increasing number of processors forming many-core chip multiprocessors (CMP), there exists a
Received 25 April 2011 need for easily scalable, high-performance and low-power intra-chip communication infrastructure for
Received in revised form emerging systems. In CMPs with hundreds of processing elements, 3D integration can be utilized to
21 September 2011
shorten long wires forming communication links. In this paper, we propose a Clos network-on-chip
Accepted 23 September 2011
Available online 22 October 2011
(CNOC) in conjunction with 3D integration as a viable network topology for many core CMPs. The
primary benefit of 3D CNOC is scalability and a clear upper bound on power dissipation. We present the
Keywords: architectural and physical design of 3D CNOC and compare its performance with several other
3D integration topologies. Comparisons are made among several topologies (fat tree, flattened butterfly, mesh and
CMP
Clos) showing the power consumption of a 3D CNOC increases only minimally as the network size is
Clos network
scaled from 64 to 512 nodes relative to the other topologies. Furthermore, in a 512-node system, 3D
Through-silicon vias
Scalability CNOC consumes about 15% less average power than any other topology. We also compare 3D
Low-power partitioning strategies for these topologies and discuss their effect on wire delay and the number of
through-silicon vias.
& 2011 Elsevier Ltd. All rights reserved.

1. Introduction in the form of packets, is routed from a source node to a


destination by traveling through a series of routers and links.
CHIP multiprocessors (CMP) have become ubiquitous to the Most NOC topologies employed in CMPs consist of routers
extent that their use has extended from high performance com- with a small number of ports (low-radix). This is to avoid a
puters to multi-purpose desktop computers. As CMOS transistor proportional increase in router power and delay with an increas-
technology scaling continues, there is a push towards accommo- ing number of ports. For example, a low-radix mesh network
dating more and more processing elements (PE) on a single die. topology has been presented in [7] and a reduced unidirectional
This trend has led to the evolution of many-core CMPs in which fat-tree topology has been considered in [8]. Another attraction of
tens to several hundreds of processing elements or cores are such low-radix topologies, when employed in small networks, is
integrated onto a single chip [1,2]. Concurrently, as the number that they facilitate simple network organization with low latency.
of PEs has continued to increase, interconnect limitations are However, in networks with hundreds of PEs, there is a corre-
becoming more pronounced in modern microprocessors. Tradi- sponding increase in the number of routers. Therefore, such
tionally, dedicated buses have formed the interconnection fabric networks suffer from high power dissipation and a large network
in CMPs but with a growing number of PEs more sophisticated diameter (i.e: a large average minimum distance between nodes),
power-efficient on-chip interconnection networks are required to especially if a majority of the traffic is not limited to nearby local
meet delay and power constraints. These requirements have led to nodes. A possible solution to minimize the number of routers and
the emergence of network-on-chip (NOC) interconnection options the network diameter is to use high-radix routers. As the network
where packets of data are routed between many on-chip compo- size grows, topologies with high-radix routers become more
nents [3–6]. NOC is the on-chip communication paradigm in feasible than low-radix topologies in terms of both power and
which chip modules (called nodes in this context) exchange data latency. On the other hand, very high-radix routers suffer from
through some kind of packet switching network, rather than increased router latency due to the increasing size of the required
through directly connected wires between these modules. Data, crossbar switch and higher power dissipation. Thus, there is a
balance in the choice of optimum router radix in many-core CMPs
based on a trade-off between router radix and network diameter.
n
Several high-radix NOC topologies have been proposed in recent
Corresponding author.
E-mail addresses: zaamir@gmail.com (A. Zia),
years. These include flattened butterfly (FBfly) [9], concentrated
skanna01@students.poly.edu (S. Kannan), chao@poly.edu (H. Jonathan Chao), mesh (CMesh) [10], fat-tree [11] and SPIN [12]. However, FBFly
garrettrose@ieee.org (G.S. Rose). and CMesh do not scale very well since the router size increases

0026-2692/$ - see front matter & 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.mejo.2011.09.013
A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390 1381

linearly with any increase in PE number and dimensionality. The 3D vias can be as narrow as a few micrometers with vertical
fat-tree networks are attractive for bounding router radix as the lengths of a few tens of micrometers [23]. Furthermore, these 3D
network scales up. However, large fat-tree networks need a large vias can be placed very close together. Utilization of the third
number of routers. Moreover, they also need increasing band- dimension in integrated circuits allows additional flexibility in
width (and consequently, more wires) on the upper branches of routing and floorplanning. As a result, three-dimensional inte-
the tree as more nodes are added. grated circuits facilitate smaller chip area and shorter intercon-
As CMPs built around an NOC with high-radix routers scale to nects resulting in lower wire delays and less power dissipation as
accommodate hundreds of cores, long interconnection wires also compared to conventional planar integrated circuits. In addition
become commonplace which could lead to prohibitive increases to that, 3D integration technology also provides high package
in wire delays. Traditionally, repeaters are added along long wires density and a smaller chip footprint as compared to conventional
to improve their delays. However, an increasing number of planar integrated circuits. 3D technology can also be used to
repeaters results in increased chip area and power dissipation. integrate heterogeneous technologies on the same chip. Different
With the use of three-dimensional (3D) integration [13] to types of circuits (analog, digital, RF) can be separately built on
shorten the long interconnects, we can alleviate this problem. wafers with different process technologies before integrating into
3D integration has emerged as a promising technology in the a 3D stack [24].
development of high performance digital integrated circuits by As discussed earlier, the primary benefit of 3D integration is in
improving wire delays and consequently, power dissipation. terms of improvements in chip area and wire lengths. Improve-
There are several ways of improving CMP performance using 3D ment in power consumption and wire latency follows the reduc-
integration. These include integrating processor and cache for tion in wire length. Furthermore, the provision of short and dense
decreasing cache access time and power dissipation [14], increas- 3D vias in the form of TSVs enables the realization of wide bus
ing the bandwidth between processor and cache [15] and split- widths that would otherwise incur prohibitive costs in terms of
ting the processor core logic among stacked dies [16]. Recently, power and area if implemented in planar chips [25]. In general,
several research efforts have also reported investigations into systems with regular structures are more suitable for implemen-
using 3D integration for performance improvement in NOCs. A 3D tation in a 3D stack since they often have long heavily loaded
NOC architecture is proposed in [17] with dedicated buses wires that can be conveniently partitioned into 3D resulting in
between mesh networks on each layer (or tier) of the 3D IC. visible performance improvements. NOCs, such as many CMPs,
Several techniques are presented in [18] for organizing 3D mesh are highly regular and have symmetric structures that could
networks and their zero-load latency is compared with their 2D benefit immensely from 3D integration. In particular, long wires
counterparts. Feero and Pande have evaluated the performance of in tree based NOC topologies can be easily shortened using TSVs.
3D NOC architectures with SoC PEs [19]. In [20], authors have Mesh based topologies, if implemented in 3D, can benefit from
presented a 3D NOC design with low-radix routers but that work the reduced number of hops and a smaller network diameter.
considers a very small number of nodes and the design does not Although several advantages exist, designing 3D chips involves
scale efficiently. In [21,22], authors have proposed 3D partitioning its own peculiar challenges [26]. These include increased power
of the router architecture instead of interconnection fabric. It has density at those locations where multiple layers of dense logic are
been shown that a performance improvement of up to 40% and a stacked, and possible yield issues. In wafer level 3D integration,
decrease of up to 62% in power consumption can be achieved by yield issues can arise during wafer alignment, bonding and 3D via
3D NOC as compared to planar NOC topologies. None of the etching. Thermal issues can arise due to the probability of
current literature, however, has directly addressed design issues increasing chip temperature with an increase in current density
due to the scaling of CMPs to hundreds of nodes. In this paper, we at a location where modules with active logic are placed on top of
propose a simple and innovative design of highly-scalable, 3D each other. Consequently, 3D chip design involves careful con-
NOC for many-core chip multiprocessors. In particular, we pre- sideration of thermal and reliability issues. Another issue is
sent the design of a 512-node CMP built around CLOS NOC. A optimal placement of TSVs. Due to yield and manufacturability
CLOS network is a type of multi-stage interconnection network considerations, current 3D technologies require TSVs to be at least
explained in detail in Section 3. Since the implementation of such some minimum distance away from active transistors and planar
a huge system in a single planar chip is impractical, we perform wires. In the context of 3D NOCs, the layout includes both planar
3D stacking to reduce chip area and overcome issues resulting and 3D wires to be routed through a given area forming the
from very long wires. The performance of several NOC topologies, routing channel. A key challenge in 3D NOC floorplanning is,
like the 3D Mesh, Flattened Butterfly and the Fat-tree, is analyzed therefore, to place TSVs in such a way that would have minimal
and compared with our design. impact on wire density within the routing channel. For the rest of
The paper is organized as follows. Section 2 briefly describes this paper we assume the use of the 3D technology process from
the 3D integration technology. In Section 3, our proposed high- MIT Lincoln Labs [27].
radix 3D NOC architecture is discussed. In Section 4, models for
the NOC fabrics are presented. Section 5 includes the experi-
mental results and Section 6 concludes the paper. 3. NOC architecture for many-core CMPs

A number of topologies can be employed in NOC fabrics for


2. 3D integration many-core CMPs. That said, the choice of a particular topology in
any given case would depend on application. Mesh-based topol-
3D integration technology allows stacking and bonding of ogies are very popular in CMPs and, at least theoretically, can be
semiconductor die to form an electrically connected vertical stack scaled to as many PEs as needed. However, network diameter
of integrated circuits, effectively forming a single 3D chip. increases severely as more hops are added for the additional PEs.
Although there are several ways of implementing 3D integration, Tree based topologies can also be easily scaled to add more PEs.
the most promising technique is stacking and bonding indepen- Fat-tree and flattened butterfly topologies are also popular candi-
dent wafers on top of each other. Electrical connections between dates in many-core CMPs. However, the former requires more
stacked dies are realized using through-silicon vias (TSVs), also routers of the same radix for the same size network. Alternatively,
known as 3D vias. In current 3D integration technologies, these for flattened butterfly, the router radix must increase as the NOC
1382 A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390

scales up. As mentioned earlier, a very high router radix can make ports and eight output ports, and packets always travel from the
the NOC prohibitively slow and power hungry, primarily due to IMs to the OMs.
the need for very large crossbar structures. A compromise would
be a topology, which would provide small network diameter (with
3.1. Scaling to 512 nodes
a relatively small number of routers) while keeping the router
radix within some acceptable bound. With these considerations in
The 64-node CNOC shown in Fig. 1(a) and (b) can be easily
mind, we propose using Clos NOC (CNOC) as a plausible choice for
scaled up to 128 nodes by adding two more stages to the network
many-core CMPs. Clos, which can also be called a generalized
as shown in Fig. 1(c). Here we have two blocks where each block
Benes network, is a type of multi-stage interconnection network
consists of 64 nodes in a five stage network. The two blocks are
that can be made non-blocking using relatively small sized cross-
interconnected through the wires between stages 2, 3 and 4.
bars [28], as compared to a network where the entire system is
Continuing in the similar way, a 512-node network can be
made of a single crossbar. In many core CMPs with hundreds of
achieved by interconnecting eight blocks as shown in Fig. 2. Since
PEs, the size of the crossbar becomes a major limiting factor in
all links are unidirectional, all flits must traverse exactly five hops
obtaining a non-blocking interconnection network. For example, if
from source to destination. The routers forming the middle three
1 GHz is chosen as the target operating frequency of the CMP, a
stages are denoted as CM1, CM2 and CM3, respectively.
non-blocking Clos network with reasonable router radix can easily
The chosen CNOC router architecture is a baseline virtual
accommodate hundreds of network nodes while meeting delay
channel (VC) router [30]. Fig. 3 shows the router architecture
and throughput requirements. A reasonable compromise could be
with multiple input components (one for each input port), multi-
using 8  8 routers, which can operate at 1 GHz speed in 45 nm
ple output components (one for each output port), a crossbar
technology [29]. Fig. 1(a) illustrates a 64-node, 3-stage CNOC
switch fabric, a VC allocator and a switch allocator.
which forms the baseline structure in our design. Fig. 1(b) shows
its logically equivalent folded notation. All links in the topology
are unidirectional, rather than bidirectional. This means that all 3.2. 3D floorplanning of the CNOC
flits must travel exactly three hops to reach the destination. The
routers in the first, second and third stages are denoted as input Implementing a 512-node CMP in a planar 2D chip would be
modules (IMs), center modules (CMs) and output modules (OMs), impractical since it would result in very long global wires, in
respectively. The first stage of the switch fabric consists of eight addition to very large chip area and power dissipation. In order to
IMs, each of which is connected to all eight CMs in the second circumvent these issues, 3D integration technology is employed.
stage. The third stage consists of eight OMs, each of which are also The 512-node CNOC, shown in Fig. 4, is easily realized in a 3D chip
connected to all eight CMs. In Fig. 1, all routers have eight input by stacking eight blocks as shown in the figure on separate tiers.

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4 b5 b6 b7 b8 b1 b2 b3 b4 b5 b6 b7 b8 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 b16

a1 a2 a3 a4 a5 a6 a7 a8 a1 a2 a3 a4 a5 a6 a7 a8 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16

Fig. 1. (a) 64-node CLOS NOC; (b) its folded notation; (c) 128-node folded CNOC.

Output Module
(OM)
Center Module 3
(CM3)

Center Module 2
(CM2)

Center Module 1
(CM1)

Input Module (IM)


Nodes
Block 1 Block 2 Block 8

Fig. 2. Configuration of 512-node, 5-stage CLOS NOC.


A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390 1383

VC Allocator

Switch Allocator
Upstream Downstream
SM Input Component SM
Switch Output
Routing
Component
Logic

R O P C I
State Registers State Registers
Upstream Downstream
SM SM
Queue

Fig. 3. Baseline VC router architecture [27].

Layer 8 Minimum
160
Layer 7
140

Estimated Wirelength
120

(x1000 mm)
100
80
60
40
20
0
1 2 3 4 5 6 7 8 9 10
No. of Layers

Fig. 5. Partitioning of a 512-node, 5-stage CLOS network.


Layer 1
64 PEs and 40 routers obtained using our wire-length minimizing
place and route algorithm. The resulting network partitioning is
shown in Fig. 4. For simplicity, all PEs (or tiles) are assumed to be
1.5  1.5 mm2. Interconnect routing is done by forming dedicated
channels in between the PEs in the grid. While placing the
routers, a folded topology is used where each IM is placed
adjacent to a corresponding OM. Similarly, CM1 routers are
Fig. 4. 3D partitioning and layout of the 512-node CNOC.
placed adjacent to CM2. In the floorplan, we assume the width
of horizontal semi-global wires to be 0.4 mm. In the initial design,
This would result in a 512-node, 8-tier 3D CMP with no sig- each link is 36-bits (32 data wires and 4 control wires) wide.
nificant difference in footprint when compared to a 64-node CMP. Therefore, the maximum space required for routing horizontal
While determining the number of layers in which to partition wires on a tile edge is 374.4 mm, which is less than 25% of the tile
the design we employ a wire length estimation tool that gives us edge. In this calculation, we ignore the horizontal links from CM1
an estimate of the total interconnect wire length for a given to CM2 and CM2 to CM3 which remain in the same block, since
number of layers [31]. We then pick the layer assignment that these routers are placed adjacent to each other and hence, their
gives us the minimum wire length. Fig. 5 demonstrates this interconnections are of negligible length.
approach for a 512-node Clos network. Similar results were Physical links between CM1-CM2 and CM2-CM3 connect indi-
obtained for all topologies. It is interesting to note that the wire vidual blocks and are implemented using TSVs. As noted earlier, we
length is reduced by adding additional layers up to eight layers assume the use of the 3D technology process from MIT Lincoln Labs
after which it increases due since TSV lengths are no longer in this analysis. In this technology, the TSV openings are sized
negligible. 1.25  1.25 mm2 with via pitch of 3.9 mm as shown in Fig. 6. The
The choice of 8 tiers also suits a key design objective, i.e. vertical length of each TSV is about 8 mm. With 36 bit links between
minimizing the manufacturing costs by having identical mask for each CM1 in both directions, the number of TSVs required for
each layer instead of separate mask for each layer. This would vertical links from each CM in the radix-8 CNOC is 504. Addition-
result in a reduction in fabrication costs and a shortened design ally, there would be 864 TSVs that connect CMs on upper and lower
cycle. On each layer, the number of PEs should be in a practical tiers but pass through middle tiers. If 50% of the tile edge is
range for current manufacturing technologies and neither too available for routing interconnects [32], this leaves only 25% for
high or too low. 3D partitioning is performed at the block level placing the TSVs to act as corresponding vertical interconnects after
using our own two step algorithm as detailed in [31]. In the first 25% is occupied by planar interconnects. However, 1368 TSVs
step, the floorplan and layout of an individual layer consisting of would need about 620 mm or 41% of the tile edge. Consequently,
1384 A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390

3.9 um

um

um
9

9
3.

3.
8 um

3.9 um

1.25 um

Fig. 6. TSV size and pitch.

placing TSV clusters in the routing channels is not a feasible option.


To meet this constraint, the TSV clusters are placed in the center of
the chip. Floorplanning of the chip is carried out with primary
objective of minimizing power and wire length. The floorplan of a
single tier is effectively a ‘‘square donut.’’ The TSV clusters are
placed in the donut hole whereas PEs and routers are placed in the
donut ring. IM/OM, CM2/CM4/CM3 form two separate rings in the
donut, where the former form the outer ring and the latter are
within the inner ring.

3.3. Other 3D topologies

For the sake of comparison, we also discuss other NoC Fig. 7. 3D Organization of the mesh NOC.
topologies such as the 3D Mesh, Flattened Butterfly and the Fat-
tree networks.
A number of techniques have been proposed to realize a 3D
mesh. The most conventional way is to add additional ports to the
router for upward and downward links. This would result in
radix-7 routers. Li et al. have proposed a bus based topology in
which a 3D TDMA bus connects different tiers of the 3D chip
using radix-6 routers [17]. However, [33] has shown that such a
topology does not scale well and conventional 3D mesh with
seven port routers shows better latency and throughput in large
networks with the number of nodes greater than 200. Fig. 7
shows the 3D mesh topology employed in this analysis. Due to its
highly regular structure, horizontal or vertical 3D partitioning of a
planar mesh would yield an identical 3D structure. From each
router, in a 512 node CMP, there would be 144 TSVs for
bidirectional links to adjacent routers on upper and lower tiers.
These TSVs would occupy approximately 80 mm, which is about
5% of tile edge and would thus have minimal effect on wiring
density in the routing channels. Each tier consists of 64 PEs so
that the 512-node CMP occupies eight tiers. The minimum and
maximum number of possible hops is 2 and 22, respectively. The
primary benefit of implementing a 512-node mesh in 3D is a
considerable reduction in interconnect power and chip footprint.
Besides that, the length of interconnects connecting top-down
routers is also significantly less than the lengths of planar wires
that connect adjacent PEs on the same plane.
The 512-node 3D flattened butterfly is shown in Fig. 8. It is Fig. 8. 3D Organization of the flattened butterfly NOC.
basically a scaled-up version of the 64-bit 2D flattened butterfly
topology described in [10]. The NOC is scaled up by increasing the
number of network dimensions to 3. Each tier of the 3D chip routers. However, the maximum number of hops would be 11,
consists of a 64-node 2D flattened butterfly network. Dimensions which is too high. For that reason, in order to limit the number of
1 and 2 consist of four routers each, whereas the third dimension hops, we choose the former scheme to scale up the flattened
consists of eight routers. Each router in this topology is radix-17, butterfly. Each packet must travel, at minimum 1 hop and, at
which is much higher than the radix-8 routers in the Clos NOC. An maximum, 4 hops. The 3D floorplan can be easily obtained by
alternate scheme could utilize a hierarchical or hybrid scaling, stacking instances of the 2D flattened butterfly network. In the
where eight 64-node networks are connected using mesh type middle tiers, each router includes 504 vertical links implemented
links in the 3rd dimension. Such a topology requires radix-12 using TSVs. Besides those, there would be 576 additional TSVs
A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390 1385

connecting adjacent routers on upper and lower tiers passing and 4, and 4 and 5. The 512 node CMP will occupy eight tiers with
through the same vicinity. These would roughly take 494 mm or each 64 node block on each tier. However, the number of TSVs in
about 33% of the tile edge. this case would be much higher than the first case due to a large
Fig. 9(a) shows the topology of the 512-node fat tree network. number of wires at the first level. On the other hand, the upper
All routers are radix-8 and all links are bidirectional. There are level links would result in long wires if implemented using planar
five levels in the tree with 128 routers in each level. That would interconnects. In this design, we choose the first scheme in order
mean there are 640 radix-8 routers in total. The minimum to minimize long wires. Although the resulting 3D floorplan,
number of possible hops is one, whereas the maximum is nine. shown in Fig. 9(b), is similar to the 3D CNOC and both topologies
In order to maintain network capacity, the total link bandwidth is use radix-8 routers, the number of routers and planar and vertical
equal at each level of the tree. Since the number of upward links wires is much larger in the fat-tree as compared to the 3D CNOC.
at each level are half the number of downward links, the number
of bits in the upward links are twice the number of bits in the
downward links to keep link bandwidth constant.
The second alternative topology that is compared with the 4. Modeling the NOC fabrics
proposed 3D CNOC is the fat tree, shown in Fig. 9. There are
several ways to partition a fat-tree topology into a 3D stack. One In order to compare the power and delay performance for
way of doing so would be partitioning at the bottom level of the various topologies, a combination of analytical models and
tree. This would result in all links between level 1 and level Cadence Spectre simulation results are used to obtain power
2 routers realized using TSVs. Another way would be to partition and delay numbers for the NOC fabrics. These are then employed
at the top of the tree with TSVs forming links between levels 3 to obtain the power and delay for the full NOC. Table 1 shows

4 4

Layer 8
Layer 7

Layer 5

Layer 1

Fig. 9. (a) 512-node Fat-Tree network. (b) Its 3D floorplan.


1386 A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390

some of the technology parameters of the 45-nm process eight virtual channels per input port. The buffer size is 64 flits
assumed for this analysis. per port. All eight VCs share a buffer in each port to increase
buffer utilization. The number of input/output ports varies
4.1. Virtual channel router depending upon the topology and network size.

A baseline virtual channel router consists of three pipeline 4.2. Physical link
stages, namely, the VC allocator, switch allocator and crossbar.
The delay through these stages is described by adapting analytical Specter simulations are used to obtain power and delay for
models in [34] to our VC router architecture described in the planar links in the CMPs. We assume these links are implemented
previous section. in semi-global metal layers with 400 nm pitch and width. Repea-
T router ¼ T VC_Alloc þ T SW_Alloc þ T Xbar ters are inserted at appropriate intervals to constrain wire delay
to less than the target operating frequency (1 GHz). This buffer
T VC_Alloc ¼ ð29 log4 ðp:vÞ þ 91=6Þt
insertion technique is utilized to make sure that wire delays are
T SW_Alloc ¼ ð14 log4 ðvÞ þ6 log4 ðpÞ þ 11Þt
low enough that they do not become the critical path in the
T Xbar ¼ ð15 log4 ðpÞ þ 5 log4 ðwÞ25Þt network. Similarly, simulation results for power and delay are
where p, v and w denote the number of router ports, number of obtained for the vertical wires. The model of the vertical links
virtual channels and number of bits in the channel, respectively. t consists of horizontal wires between logic and TSVs, and the TSVs
is the minimum gate delay achievable in the process technology. are modeled as thick tungsten vertical wires [36]. Tables 2 and 3
We use Orion to estimate the power consumption of the VC show the physical and electrical parameters of the horizontal and
router [35]. The total power consumed includes dynamic and vertical wires, respectively. These include intrinsic resistivity and
leakage power dissipation in the buffer, VC allocator, switch total capacitance of the metal wires. Tables 4 and 5 illustrate the
allocator and crossbar of the router. The router consists of simulation results of power and delay for various horizontal and
vertical wires.

Table 1
Technology Parameters.
5. Analysis and results
Parameter Value
In this section, we compare the latency and power of the
Vdd 1.1 V CNOC, flattened butterfly, fat-tree and mesh topologies under
Vt 0.3 V zero-load condition, as network size scales from 64 to 512 nodes.
Cg 95.19 fF/mm
T 9.38 pS
For all topologies, the 64-node 2D CMP is the baseline design and
scaling is done by stacking subsequent tiers to the baseline. It
follows that the CMP is a planar chip for 64-node network and a
3D stack for all subsequent network sizes. We use the network
Table 2
Physical and electrical parameters of planar inter- fabric models described in the previous section to calculate
connects (Global Metal Layer). performance data. All flits are 32-bits in length. Table 6 illustrates
the architectural parameters of various NOC topologies of the
Parameter Value 512-node CMP system.
The most important factor in network latency is the amount of
Resistivity 4.08  10  8 Om
Capacitance 0.048 fF/mm traffic in the network and design of the router. In general, large
Width 0.4 mm

Table 6
Architectural parameters of various NOC topologies for 512-node CMP.
Table 3
Physical and electrical parameters of vertical NOC Router Number of Link type Number Longest
interconnects. Topology radix routers of hops wire
(Min/Max) (mm)
Parameter Value
3D FBF 17 128 Bidirectional 1/4 9
Resistivity 5.6  10–8 Om 3D CNOC 8 320 Unidirectional 5/5 7.5
Capacitance 0.82 fF/mm 3D fat 8 640 Bidirectional 1/9 6
Length 8 mm tree
Area 1  1 mm2 3D mesh 7 512 Bidirectional 2/22 1.5

Table 4
Power dissipation and delay along various planar wire lengths.

Length (mm) 10.5 9.0 7.5 6.0 4.5 3.0 1.5


Power dissipation (mW) 1.74 1.513 1.306 1.025 0.8 0.513 0.306
Delay (ps) 287 251 208 171 128 87.1 45.4

Table 5
Power dissipation and delay along vertical interconnects.

Length (lm) 8 16 24 32 40 48 56
Power dissipation (mW) 0.014 0.017 0.018 0.017 0.021 0.022 0.022
Delay (ps) 0.017 0.0212 0.0281 0.0453 0.0505 0.067 0.07
A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390 1387

routers are slow since they require bigger crossbars. Using (1), 1.4
(2) and (3), we obtain the latency for the three pipeline stages of FBfly
the canonical VC router with various radices. As can be seen in 1.2 CNOC
Fig. 10, a router radix of 10 is maximum that can be used to obtain FTree
1 Mesh
clock cycle less than 1 ns (1 GHz). From Fig. 10 and, Tables 4 and

Power (W)
5, the VC allocator stage of the router is obviously the critical path 0.8
in all topologies. Fig. 11 shows the critical path delay in the
compared topologies with network sizes varying from 64- to 512- 0.6
nodes. Since router radix stays constant for both CNOC and fat-
0.4
tree as the network size scales, the critical path delay in both
topologies also stays the same. The difference between 2D and 3D 0.2
mesh delay is due to the increased radix (number of ports).
Similarly, the result for flattened butterfly is not unexpected since 0
its radix increases linearly as the network scales.

s)

s)

s)

s)
er

er

er

er

er
Ti
Figs. 12 and 13 illustrate the minimum and maximum power

Ti

Ti

Ti

Ti
(1

(2

(4

(6

(8
dissipation for various network sizes. Here, we assume zero-load

64

2
12

25

38

51
condition in order to estimate minimum and maximum power
dissipation. In this model, a single packet is assumed to travel Number of Nodes

Fig. 12. Minimum zero-load power dissipation in various NOC topologies with
scaling network size.
2.5
Crossbar
Switch Allocator 6
0.3579
0.33068

2 VC Allocator FBfly
0.30405

5 CNOC
0.2814
0.26785

FTree
0.2337

Mesh
0.76889
Delay (nSec)

4
0.73894

1.5
Power (W)
0.70965
0.68474
0.66983
0.63227

3
1
2

1
1.089
1.0364
0.98491
0.94113
0.91493
0.8489

0.5
0
)

s)

s)

s)

s)
er

er

er

er

er
Ti

0
Ti

Ti

Ti

Ti
(1

(2

(4

(6

(8
1 2 3 4 5 6
64

2
12

25

38

51
Router Type
Number of Nodes
Fig. 10. Latency in pipeline stages of routers with various radix; 1—Radix-5 (64-
node mesh), 2—Radix-7 (128–512 node mesh), 3—Radix-8 (CNOC, FTree), Fig. 13. Maximum zero-load power dissipation in various NOC topologies with
4—Radix-10 (64-node FBfly), 5—Radix-13 (256-node FBfly), 6—Radix-17 (512- scaling network size.
node FBfly).

through the network from source to the destination. Although this


1.2
FBfly CNOC FTree Mesh model discounts the impact of packet contention for resources on
network performance, it provides a lower bound on the actual
1 network performance metrics, which are a good approximation of
the network characteristics in the absence of saturation due to
0.8 heavy traffic.
Delay (nSec)

We define the minimum power dissipation as the power


dissipated while a single packet traverses the minimum number
0.6
of hops with minimum path length from source to destination.
This includes the sum of the router power for all routers in the
0.4 path and the power dissipated in horizontal and vertical links in
that path. The router power dissipation dominates the total
0.2 network power consumption. In CNOC, since packet traversal
always takes a certain constant number of hops (3 in 64-node,
5 in the rest), its minimum energy stays constant. All other
0
topologies have bidirectional links so the number of minimum
s)

s)

s)

s)
r)

hops is always one (FBfly, FTree) or two (Mesh). The router radix in
er

er

er

er
e
Ti

Ti

Ti

Ti

Ti

a flattened butterfly network increases as network size is scaled


(1

(2

(4

(6

(8
64

up, resulting in increasing power dissipation in that topology. In


12

25

38

51

Number of Nodes the fat-tree and mesh networks, the router radix does not
progressively increase as the network scales. A combination of
Fig. 11. Critical path delay in various network topologies with scaling network sizes. comparatively low radix and low number of hops means that the
1388 A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390

minimum power dissipation is lowest in these topologies, saturation. The saturation point corresponds to the maximum
making them suitable for systems with highly localized traffic throughput achievable in the network and if the injection rate is
between nodes. increased any further, it might result in a loss of packets. As can
Maximum power dissipation is the power consumption in the be seen, throughput in a mesh network maximizes when injection
network when a single packet has to travel the maximum number of rate is about 25%. The throughput in CLOS and FBFly are almost
hops with each hop consisting of the maximum possible wire length similar with CLOS consuming about 15% less power.
for that link. Here, the maximum wire length for each link depends In order to evaluate the impact of the impact of 3D integration
on the floorplan and layout of the chip. The power consumption, as on the power consumption of CNOC for many-core CMPs, we
in the previous case, is the sum of router power and link power compare the results of cycle-accurate simulation of 512-node
(horizontal and vertical). In this case, CNOC outperforms all other CNOC organized in a single tier (planar), two tiers, four tiers and
topologies. For a 64-node system, CNOC consumes 23% less power eight tiers forming the 3D chip. Since, a planar 512-node CMP
than the second lowest topology (Flattened butterfly). In a 512-node would be impractical in the real world due to its enormous area,
system, CNOC outperforms its closest rival (Fat-tree) by consuming we consider a hypothetical baseline planar 512-node CNOC. The
almost 41% less power. From these results, it can be argued that simulation framework of our cycle-accurate simulator is kept the
power dissipation is better in CNOC when employed in systems same when simulating these CNOC structures. Since 3D partition-
where the majority of inter-PE traffic is non-local. ing of the planar CNOC changes the wire lengths between routers,
Fig. 14 illustrates the maximum zero-load power consumption we obtain experimental interconnect data for each of the eval-
in 512-node NOC topologies as flit size is increased from 32 to 256 uated 3D CNOC structures. Hence, the input to the cycle-accurate
bits. Flattened butterfly performance worsens the most as flit size simulator changes for each configuration. In a planar 512-node
is scaled up keeping the same architectural design as its power CMP, all nodes and SMs are considered to be on a single layer and
dissipation increases 22 times if flit size is increased from 32 to link routing is done accordingly. In a 2-tier system, four blocks
256 bits. This is due to the significant increase in power consump- (consisting of 256 nodes and 160 routers) are placed on each tier,
tion in the radix-17 crossbar in this topology. The power dissipa- whereas the 4-tier system consists of 2 blocks (128 nodes and 80
tion in the CLOS network is the least among the four topologies. routers) placed on each tier.
Fig. 15 shows the average power dissipation in various inter- Fig. 16 shows the power consumption of simulated configura-
connection networks obtained by cycle-accurate simulation of tions of the 512-node CNOC. Power consumption is highest in the
512-node systems with varying injection rate [29]. In the cycle- planar design and decreases as the number of tiers increase. As the
accurate simulator, each node can generate and receive packets number of tiers increases to two from one, the maximum power
with equal probability. The packet generation is a Bernoulli consumption decreases by 40%. Scaling the 3D chip up to four tiers
process with a success probability equal to the packet injection further reduces power by another 25%. However, going from
rate, and the destination addresses are uniformly distributed for four tiers to eight tiers provides only an additional 4% reduction
each generator. in power dissipation, but would still reduce chip footprint by
Each packet consists of eight flits and each flit consists of 32 almost 50%.
data bits and four control bits. As expected, power consumption An important yardstick in the choice of 3D NOC topology
in all topologies increases with an increasing injection rate until should be the number of TSVs. A higher number would obviously

90
2D CNOC
80
3D CNOC (2 Tiers)
70 3D CNOC (4 Tiers)
3D CNOC (8 Tiers)
60
Power (W)

50

40

30
Fig. 14. Maximum power dissipation for various NOC topologies with scaling
network size. 20

10
45
FBFly 0
40
1 8 15 22 29 36 43 50 57 64 71 78 85 92 99
Power Consumption (W)

CNOC
35 Mesh Injection Rate (%)
30
25 Fig. 16. Impact on power consumption in a 512-node CNOC with increasing tiers.

20
15
Table 7
10 Number Of TSVs and their area for interconnection links in the 512-node NOC.
5
TOPOLOGY Link TSVs TSV area on one tier (mm2)
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97

Flattened butterfly 21888 5.2


Injection Rate [%] CNOC 10944 1.89
Fat-tree 14976 2.96
Fig. 15. Average power consumption in the 512-node NOC systems with varying Mesh 9216 1.46
flit arrival rate.
A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390 1389

affect the yield of the 3D chip. So the designer’s goal should be [2] A. Agarwal et al., Tile processor: embedded multicore for networking and
choosing a topology with minimum TSVs while providing accep- multimedia, in: Proceedings of the IEEE Hot Chips 19, Stanford, CA, USA,
August 2007.
table performance parameters. Table 7 illustrates the maximum [3] W.J. Dally and B. Towles, Route packets, not wires: on-chip interconnection
number of TSVs needed to realize communication links for 32-bit networks,in: Proceedings of the 38th Design Automation Conf. (DAC 2001),
flits in the 512-node NOC and their corresponding area overhead. June 2001, pp. 683–689.
Here, the area consumed by the TSVs is calculated assuming that [4] L. Benini, G. De Micheli, Networks on chips: a new SoC paradigm, IEEE
Computer (2002) 70–78.
two semi-global metal layers are used to connect planar wires to [5] B. Grot, et al., Express cube topologies for on-chip interconnects, HPCA 2009
these TSVS. The number of TSVs can be expected to increase (2009) 163–174.
linearly if flit size is increased from 32 bits to 64 and beyond. It [6] R. Das, et al., Design and evaluation of a hierarchical on-chip interconnect for
next-generation CMPs, HPCA 2009 (2009) 175–186.
should be noted that the required number of TSVs would be
[7] M.B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal, Scalar operand net-
different for different tiers in the 3D chip due to the fact that works: on-chip interconnect for ilp in partitioned architectures, in: Proceed-
middle tiers also carry all TSVs connecting tiers above and below ings of the International Symposium on High-Performance Computer
them. Therefore, the central tiers carry the most TSVs. Besides link Architecture (HPCA), 2003, Anaheim, CA, USA, pp. 341–353.
[8] C. Gomez, F. Gilabert, M.E. Gomez, P. Lopez and J. Duato, RUFT: Simplifying
TSVs, there are separate TSVs for the power grid and clock the fat-tree topology, in: Proceedings of the Fourteenth IEEE International
network. Although the exact numbers are heavily dependent on Conference on Parallel and Distributed Systems, 2008.
the 3D partitioning strategy employed for a given topology, the [9] J. Kim, J. Balfour, W.J. Dally, Flattened Butterfly Topology for On-Chip
Networks, IEEE Comput. Archit. Lett. 6 (2) (2007) 37–40.
trend can be expected to be similar, independent of the partition-
[10] J. Balfour and W.J. Dally, Design tradeoffs for tiled CMP on-chip networks, in:
ing scheme. As can be seen, the 3D mesh utilizes the minimum Proceedings of the 20th Annual International Conference on Supercomputing
(9216) number of TSVs, which is a little less than the 10944 TSVs (ICS 2006), 2006, pp. 187–198.
utilized in the 3D CNOC. If 3D stacking yield is quantified in terms [11] D. Ludovici, F. Gilabert, S. Medardoni, and C. Gomez, Assessing fat-tree
topologies for regular network-on-chip design under nanoscale technology
of TSVs, 3D CNOC also provides a better balance between constraints, in: Proceedings of the Conference on Design, Automation and
performance and yield. Test in Europe, 2009.
A major concern in 3D processors is the possibility of wor- [12] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiez, and C.A. Zeferino,
sened thermal behavior as compared to their planar counterparts SPIN: a scalable, packet switched, on-chip micro-network, in: Proceedings of
Design Automation and Test in Europe Conference and Exhibition, 2003.
due to increased power density and lack of heat dissipation paths. [13] K. Banerjee, et al., 3-D ICs: a novel chip design for improving deep
However, this is partially compensated by the fact that 3D submicrometer interconnect performance and system-on-chip integration,
integration reduces overall power dissipation due to smaller Proc. IEEE 89 (5) (2001) 602–633.
[14] K. Puttaswamy and G. Loh, Implementing caches in a 3D technology for high
wires. It has been shown in [37] that temperature increase, with
performance processors, in: Proceedings of the IEEE International Conference
increasing layers of active logic circuits, is very modest. Further- on Computer Design, October 2005, pp. 525–532.
more, microarchitectural techniques can also be used to improve [15] P. Jacob, et al., Mitigating memory wall effects in high-clock-rate and
thermal management in 3D multiprocessor chips [38,39]. multicore CMOS 3D processor memory stacks, Proc. IEEE (special issue on
3D Integration Technology 97 (1) (2009) 108–122.
[16] B. Black, D. Nelson, C. Webb, and N. Samra, 3D processing technology and its
impact on IA32 microprocessors, in: Proceedings of the IEEE International
6. Conclusion Conference on Computer Design (ICCD 2004), October2004, pp. 316–318.
[17] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir,
Design and management of 3D chip multiprocessors using network-in-
Intra-chip communication fabrics in many-core CMPs offer memory, in: Proceedings of the 33rd International Symposium on Computer
significant challenges that cannot be overcome using conventional Architecture (ISCA 2006), June 2006, pp. 130–141.
NOC techniques. Chip power budgets place significant constraints [18] V. Pavlidis, E. Friedman, 3-D Topologies for Networks-on-Chip, IEEE T. VLSI
Syst. 15 (10) (October 2007) 1081–1090.
on the maximum allowable power consumption in NOC fabrics. In
[19] B. Feero, P. Pande, Networks-on-chip in three-dimensional environment: a
many core CMPs, the key requirement for the network topology is performance evaluation, IEEE Trans. Comput. 58 (1) (2009) 32–45.
scalability with low power consumption. 3D integration offers a [20] Y. Xu, Y. Du, B. Zhao, Xiuyi Zhou, Youtao Zhang, and J. Yang, A low-radix and
promising way to scale wires that can become prohibitively long in low-diameter 3D interconnection network design, in: Proceedings of the IEEE
Fifteenth International Symposium on High Performance Computer Archi-
many core CMPs. In this paper, we have presented a 3D CNOC tecture (HPCA 2009), February 2009, pp. 30–42.
topology that scales very well while providing an upper bound on [21] J. Kim, C. Nicopoulos, D. Park, R. Das, Yuan Xie, N. Vijaykrishnan, and C. Das, A
power consumption. The key feature of the 3D CNOC is an upper novel dimensionally-decomposed router for on-chip communication in 3D
bound on router radix as the network scales up. 3D integration is architectures, in: Proceedings of the 34th International Symposium on
Computer Architecture (ISCA 2007), June 2007, pp. 138–149.
employed to overcome long wires that can become burdensome in [22] D. Park, S. Eachempati, R. Das, A. Mishra, Yuan Xie, V. Narayanan, and C. Das,
CNOC when implemented in a conventional planar chip. 3D MIRA: a multi-layer on chip interconnect router architecture, in: Proceedings
partitioning and floorplanning are discussed in conjunction with of the 35th International Symposium on Computer Architecture (ISCA 2008),
June 2008, pp. 251–261.
their effect on wire density and the number of TSVs. We compare
[23] J.-Q. Lu, 3-D hyperintegration and packaging technologies for micro-nano
power dissipation in several topologies when employed in a 512- systems, Proc. IEEE 97 (1) (January 2009) 18–30.
node system and find that the maximum and average power [24] M. Wolf, P. Ramm, A. Klumpp, H. Reichl, Technologies for 3D wafer level
consumption is lowest in 3D CNOC. heterogenous integration, in: Proceedings of IEEE Symosium on Design, Test,
Integration and Packaging of MEMS/MOEMS 2008.
[25] A. Zia, P. Jacob, J.-W. Kim, M. Chu, R. Kraft, J. McDonald, A 3-D cache with
ultra-wide data bus for 3-D processor-memory integration, IEEE T. VLSI
Acknowledgments Systems 18 (6) (June 2010) 967–977.
[26] W.R. Davis, et al., Demystifying 3D ICs: the pros and cons of going vertical,
IEEE Des. Test Comput. 22 (6) (2005) 498–510.
This work was supported by a Polytechnic Institute of NYU [27] FDSOI Design Guide, MIT Lincoln Laboratory, Cambridge, MA, 2008.
Angel Fund grant and US Army CERDEC. The authors would like to [28] C. Clos, A study of non-blocking switching networks, AT&T Tech. J. (Mar.
thank Yu-Hsiang Kao and Yang Yang for their insightful discus- 1953) 406–424.
[29] Y. Kao et al., Design of high-radix CLOS Network-on-Chip, in: Proceedings of
sions and valuable help with simulations.
the Fourth ACM/IEEE International Symposium on Networks-on-Chip (NOCS
2010), May 2010.
[30] W.J. Dally, B. Towles, Principles and Practices of Interconnection Networks,
References
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
[31] S. Kannan and G.S. Rose, A Heirarchical 3-D floorplanning algorithm for
[1] S.R. Vangal, et al., An 80-tile sub-100-W teraFLOPS Processor in 65-nm CMOS, many-core cmp networks, in: Proceedings of the IEEE International Sympo-
IEEE J. Solid-State Circuits 43 (1) (2008) 29–41. sium on Circuits and Systems (ISCAS 2011), 2011.
1390 A. Zia et al. / Microelectronics Journal 42 (2011) 1380–1390

[32] D.N. Jayasimha, B. Zafar, and Y. Hoskote, On-chip interconnection networks: [36] J.A. Burns, B. Aull, C. Chen, C.-L. Chen, C. Keast, J. Knecht, V. Suntharalingam,
Why they are different and how to compare them, Technical Report, Intel K. Warner, P. Wyatt, D.-R.W. Post, A wafer-scale 3-D circuit integration
Corporation, 2006. technology, IEEE Trans. Electron Devices 53 (10) (October 2006)
[33] A. Weldezion. et al, Scalability of network-on-chip communication architecture 2507–2516.
[37] K. Puttaswamy and G.H. Loh, Thermal analysis of a 3D die-stacked high-
for 3-D meshes, in: Proceedings of the Third ACM/IEEE International Symposium
performance microprocessor, in: Proceedings of the GLSVLSI 2006.
on Networks-on-Chip (NOCS 2009), May 2009, pp. 114–123.
[38] X. Zhou, J. Yang, Y. Xu, Y. Zhang, J. Zhao, Thermal-aware task scheduling for
[34] L.-S. Peh, W.J. Dally, A delay model for router microarchitectures, IEEE Micro
3D multicore processors, IEEE T. Parallel Distr. 21 (1) (January 2010)
21 (1) (January/February 2001) 26–34. 60–71.
[35] A.B. Kahng, ORION 2.0: A fast and accurate NoC power and area model for [39] A. Coskun, J. Ayala, D. Atienza, T. Rosing, and Y. Leblebici, Dynamic thermal
early-stage design space exploration, in: Design, Automation and Test in management in 3D multicore architectures, in: Proceedings of the Confer-
Europe Conference and Exhibition (DATE 2009), April 2009, pp. 423–428. ence on Design, Automation and Test in Europe (DATE 2009), 2009.

You might also like