ICCAD05 Long Links Final v5

Application-Specific Network-on-Chip Architecture
Customization via Long-Range Link Insertion

Umit Y. Ogras Radu Marculescu
Department of Electrical and Computer Engineering Department of Electrical and Computer Engineering
Carnegie Mellon University Carnegie Mellon University
Pittsburgh, PA 15213-3890, USA Pittsburgh, PA 15213-3890, USA
e-mail: uogras@ece.cmu.edu e-mail: radum@ece.cmu.edu
Abstract Networks-on-Chip (NoCs) represent a promising

solution to complex on-chip communication problems. The
NoC communication architectures considered so far are based
on either completely regular or fully customized topologies. In
this paper, we present a methodology to automatically synthe-
size an architecture where a few application-specific long-range
links are inserted on top of a regular mesh network. This way,
we can better exploit the benefits of both complete regularity
and partial customization. Indeed, our experimental results
show that inserting application-specific long-range links signif-
icantly increases the critical traffic workload at which the net- Figure 1. Illustration of adding long-range links to a 4x4
standard mesh network.
work state transits from a free to a congested regime. This, in
turn, results in a significant reduction in the average packet Fortunately, these two extreme points in the design space
latency and a major improvement in the achievable network (i.e. designs based on purely regular or completely customized
throughput. topologies) are not the only solutions possible for NoC architec-
tures. In fact, it is interesting to note that many technological,
I. INTRODUCTION biological, and social networks are neither completely regular,
Continuous scaling of CMOS technology makes it possible nor irregular [14,15]. One can view these networks as a super-
to put many heterogeneous devices on a single chip. Large- position of clustered nodes with short links and a collection of
scale integration of these blocks onto a single chip calls for random long-range links that produce “shortcuts” among differ-
truly scalable Networks-on-Chip (NoC) communication archi- ent regions of the network. Regular lattice networks with a
tectures [1,2,3]. Regular NoC architectures based on grid-like number of additional random long-range links, similar to the
(or 2D lattice) topologies provide well-controlled electrical one shown in Figure 1, can be used to model such networks
parameters and reduced power consumption across the links. [15]. This paper explores precisely the potential of using stan-
However, due to the lack of fast paths between remotely situ- dard mesh networks in conjunction with a few additional long-
ated nodes, such architectures may suffer from long packet range links to improve the performance of regular NoCs.
latencies. Indeed, having many hops between different commu- Inserting long-range links to regular architectures clearly
nicating nodes, not only increases the message latency, but also reduces the average distance between remote nodes. However,
increases the message blocking probability thus making the inserting long-range links cannot be done randomly for NoCs
end-to-end packet latency more unpredictable. Consequently, because adding extra links has a more pronounced, yet barely
such generic platforms may become easily less attractive for studied, impact on the dynamic properties of the network,
application-specific designs that need to guarantee a given level which are characterized by traffic congestion. At low traffic
of performance. loads, the average packet latency exhibits a weak dependence
On the other hand, fully customized topologies [8-11] can on the traffic injection rate. However, when the traffic injection
improve the overall network performance, but they distort the rate exceeds a critical value, the packet delivery times rise
regularity of the grid structure. This results in links with widely abruptly and the network throughput starts collapsing
varying lengths, performance and power consumption. Conse- (Figure 2). The state of the network before congestion (i.e. the
quently, better logical connectivity comes at the expense of a region on left hand side of the critical value) is the free state,
penalty in the structured nature of the wiring which is anyway while the state beyond the critical value (right hand side) is said
one of the main advantages offered by the regular on-chip net- to be the congested state. Finally, the transition from the free
works [1]. In the extreme case, fully customized solutions may state to the congested one is known as phase transition region.
end up resembling ASIC-style designs where individual mod- As it turns out, the phase transition in regular networks can
ules communicate by packet switching. Hence, the usual prob- be significantly delayed by introducing additional long-range
lems of cross-talk, timing closure, global wires etc. may links (see Figure 2) [16]. Due to the exponential increase in the
undermine the overall gain obtainable through customization. latency beyond the critical point, even a small right hand shift
uniform traffic. The packets in the network consist of a single
atomic entity containing address information only. Also, due to
the infinite buffer assumption, the authors do not deal with
deadlock states explicitly.
In contrast to this prior work, we do consider wormhole
routing with arbitrary flit sizes and network routers with
bounded input buffers. Most importantly, instead of uniform
traffic, we assume application-specific traffic patterns and
present an algorithm which inserts the long-range links in a
smart manner rather than randomly. Due to the bounded input
buffers, additional long-range links may cause deadlock states.
For this reason, we also present a deadlock-free routing algo-
rithm that exploits the long-range links to achieve the desired
performance boost.
III. LONG-RANGE LINK INSERTION ALGORITHM

We start by formulating the long-range link inserting prob-
lem. After that, we present the details of the solution following
a top-down approach.
Figure 2. Shift in the critical traffic load after the insertion of long-
range links to a 6x6 mesh network under hotspot traffic (Section V). A. System model and basic assumptions
The system of interest consists of a set T of m × n tiles
in the critical traffic value may result in orders of magnitude interconnected by a 2D mesh network1. The tiles of the network
reduction for latency. Similarly, the achievable throughput (referred to as PEs) are populated with processing and/or stor-
grows with the right shift of the critical traffic value. This phe- age elements that communicate with each other via the net-
nomenon is at the very heart of our optimization technique. work. We do not make any assumption about the distribution of
The main objective of our optimization technique is to boost the packet injection rate into the network, but only consider the
the network performance (i.e. reduce the average packet latency frequencies at which the PEs communicate with each other.
and increase the network throughput) by maximizing the value Due to limited on-chip buffer resources and low latency
of the critical traffic load via smart insertion of application-spe- requirements, it makes sense to assume wormhole switching for
cific long-range links. To this end, our contribution is twofold: the network; the results derived here, however, are also applica-
• First, for a given application, we propose an algorithm that ble to packet- and virtual cut-through switching networks. Fur-
determines the most beneficial long-range links to be inserted in ther, we do not assume any particular routing algorithm for the
a mesh network. mesh network; the only requirement is that the underlying rout-
• Second, we present a deadlock-free decentralized routing ing algorithm has to be deadlock-free and minimal. Deadlock-
algorithm that exploits the long-range links to achieve the free property is desirable for on-chip networks for two reasons:
desired performance level. First, implementing deadlock detection and recovery mecha-
The paper is organized as follows. Section II reviews nisms is expensive in terms of silicon resources. Second, such
related work. The proposed approach for smart link insertion is mechanisms can cause unpredictable delays, which need to be
explained in Section III. Practical considerations in implement- avoided for most embedded applications.
ing long-range links are discussed in Section IV, while the After inserting the long-range links as explained in
experimental results appear in Section V. Finally, Section VI Section C, the routers without extra links simply use the default
concludes the paper by summarizing our main contribution. routing strategy. Since this default strategy cannot route the
packets across the newly added links, we define a deadlock-free
II. RELATED WORK
routing strategy which enables the use of the newly added long-
The use of NoCs as a scalable communication architecture range links (Section E).
is discussed in [1,2,3]. Design methodologies for application-
specific NoCs are discussed in [4-11]. Studies in [6,7] consider B. Problem formulation
regular network topologies and present algorithms for applica- The communication volume between the PE located at tile
tion mapping under different routing strategies. On the other i ∈ T and the PE located at tile j∈ T is denoted by Vij. We
hand, fully customized communication architecture synthesis compute the frequency of communication, fij, between PEs i
for a given application is addressed in [8,9,10]. and j by normalizing the inter-tile communication volume as
To the best of our knowledge, the idea of optimizing a follows:
generic grid-like network with application-specific long-range
links is first addressed in this paper. Previous work on a similar
idea comes from the networks theory side and uses very idealis- 1. The following discussion assumes a 2D mesh but the proposed tech-
tic assumptions. For instance, the authors of [16] investigate the nique is applicable to other topologies for which a distance definition as
effect of adding random links to mesh and torus networks under in equations 6 and 7 (in Section III) exists.
V ij
∀ i, j, p, q ∈ T f ij = -------------------------- (1)
∑ ∑ Vpq
p q≠p
The addition of long-range links introduces an overhead due
to the additional wires and repeaters1 connecting the wire seg-
ments to ensure the latency-insensitive operation. Hence, we
need to model the cost of long-range links to have a measure of
this overhead.
Without losing the generality, we measure the size of long-
range links, s ( l ) , in multiples of basic link units which are
identical to the regular links used in the mesh network. This is
reasonable, since the long-range links consist of a number of
standard links connected by repeaters, as shown in Figure 8.
The number of repeaters required by a long-range link is given
by s ( l ) – 1 . Consequently, the maximum amount of permissible Figure 3. The flow of the long-range link insertion algorithm. The
overhead can be expressed as a multiple of the standard link evaluation step is detailed in Section D.
segments that make up the long-range link. For example, a The algorithm starts with a standard mesh network and
resource constraint of S means that only long-range links con- takes the communication frequencies between the network tiles,
sisting of at most S units of standard links, total, can be added. the default routing algorithm and the amount of resources
We can now state the application-specific long-range link allowed to use as inputs. Then, the algorithm selects all possible
insertion problem as follows: pairs of tiles (i.e. C ( T , 2 ) pairs where T is the number of
Given nodes in the network) and inserts links between them. After
inserting each long-range link, the resulting network is evalu-
• fij ∀i, j ∈ T
ated to find out the gain obtained over the previous configura-
• Maximum number of links that can be added, S tion. Since our goal is to maximize λ c , we compare different
• Routing strategy for the mesh network, R configurations in terms of critical traffic load as detailed in
Determine Section D. After the most beneficial long-range link is found,
• The set of long-range links to be added on top of the mesh the information about this link is stored and the amount of uti-
network, L S lized resources updated. For example, if a long-range link con-
• A deadlock-free routing strategy that governs the use of the sisting of four equivalent segments of standard links is added,
newly added long-range links, R L the utilization is incremented by four.
The procedure described above repeats until all available
such that
resources are used up. Once this happens, an architecture con-
max ( λ c ) subject to ∑ s(l ) < S (2) figuration file is generated. Then, the routing strategy govern-
l ∈ LS
ing the use of long-range links is produced and written to a
where λ c is the critical load at which the network enters the routing configuration file, as described in Section E.
congested phase. To give some intuition, the long-range links
D. Evaluation of the critical traffic value
are added to maximize the critical traffic, λ c , subject to the
total amount of available resources; that is, the phase transition While the impact of routing strategy, switching techniques
region is delayed beyond the value a standard mesh network (of and network topology on the critical point have been studied via
exactly same size) can offer. Maximizing λ c increases the simulation [18], no work has been done to maximize the traffic
achievable throughput and reduces the latency compared to the critical value subject to resource constraints. The major obstacle
original critical load, as shown later in the experiments. in optimizing the critical load comes from the difficulty in mod-
Note that, the objective of inserting long-range links is by elling the variation of critical point as a function of the design
no means limited to maximizing λ c . Indeed, other objective decisions. Several theoreticians [16,17] estimate the criticality
functions (e.g. increased fault-tolerance, guaranteed service, point using mean field models. However, unlike our work, these
etc.), can replace or augment the objective of maximizing λc in studies assume uniform traffic, infinite buffers, and the esti-
order to take the full advantage of the inserted links. mates are valid only for regular grids without long-range con-
nections.
C. Iterative link addition algorithm The key idea of our contribution is to reduce the estimation
We propose an efficient iterative algorithm that inserts the of critical point of the network to just one parameter that can be
most beneficial long-range links to the current configuration of computed analytically, much faster than simulation. This is
the network, provided that the available resources are not used important since using very accurate estimates obtained through
up yet. The link insertion algorithm is summarized in Figure 3. simulation would be simply too costly to use within an optimi-
zation loop. The optimization goal can still be achieved using
1. In our terminology, the repeaters act primarily as storage elements the simple parameter, as long as the comparison between two
(like FIFO buffers), as explained in Section IV in more detail. network configurations matches the one with the critical load.
In the following, we relate λ c to the free packet delay (τ0) 8x8 Mesh Network
in the network, which can be efficiently computed. Let the λ
number of messages in the network (at time t) be N ( t ) and the Nave / τave
aggregated packet injection rate be λ , i.e.
packets/cycle
Nave / τ0
λ = ∑ λi , λi :the injection rate of tile i ∈ T
i∈T
In the free state (i.e. when λ < λ c ), the network is in the
steady-state, so the packet injection rate equals the packet ejec-
tion rate. As a result, we can equate the injection and ejection
rates to obtain the following approximation:
N ave (3) λ (packets/cycle)
λ ≈ ----------- Figure 4. Experimental verification of Equation 3 and Equation 4
τ ave
for a 8x8 mesh network.
where τave is the average time each packet spends in the net- line with triangular markers in Figure 4 illustrates the upper
work, and Nave = <N ( t )> is the average number of packets in bound given in Equation 4. We observe that this expression pro-
the network. The exact value of τave is a function of the traffic vides a good approximation at lower data rates and holds the
injection rate, as well as topology, routing strategy, etc. While upper bound property.
no exact analytical model is available in the literature, we Computation of τ0
observe that τave shows a weak dependence on the traffic injec- For arbitrary traffic patterns characterized by the communi-
tion rate when the network is in the free state. Hence, τ0 can be cation frequencies fij ∀i, j ∈ T , τ0 can be written as
used to approximate τave. If we denote the average number of
c L
packets in the network, at the onset of the criticality, by Nave , τ0 = ∑ ∑ fij d ( i, j ) ( t r + t s + t w ) + max ( ts + t w ) -----
W
(5)
we can write the following relation: i j≠i
c
Nave (4) where d ( i, j ) is the distance from routers i to router j, and tr, ts,
λ c ≈ ------------- tw are the is architectural parameters representing time to make
τ0
the routing decision, traverse the switch and the link, respec-
This approximation is also an upper bound for the critical tively. Finally, L is the length of the packet, while W is the width
load λ c , since τ 0 ≤ τ ave ( λ c ). We note that λ = N ave ⁄ τ ave fol- of the network channel.
lows also Little’s law. Indeed, other theoretical studies proposed For the standard mesh network, the Manhattan distance
c
to approximate λ c = Nave ⁄ τ 0 using mean field [16] and distance (dM) is used to compute d ( i, j ), i.e.
models [17] under uniform traffic.
Given that the number of messages in the network, at the d M ( i, j ) = i x – j x + i y – j y (6)
c
onset of the criticality, is bounded by its capacity, Nave , the crit-
where subscripts x and y denote the x-y coordinates, respec-
ical traffic load λ c and the average packet latency are inversely
tively. For the routers with long-range links, an extended dis-
proportional to each other. Indeed, if the average packet latency
tance definition is needed in order to take the long-range
decreases, the phase transition point is delayed, as demonstrated
connections into account. Hence, we use the following general-
in Figure 2, where the reduction in the latency is obtained due
ized definition:
to the long-range links. Our optimization technique uses the
relationship between λ c and τave to maximize the critical load. ⎧ d M ( i, j ) if no long-range link is attached to i (7)
More specifically, we minimize τ0 which can be efficiently d ( i, j )= ⎨
⎩ min ( d M ( i, j ), 1 + d M ( k, j ) ) if l ( i, k ) exists
computed in the optimization loop, as opposed to λ c for which
there is no known analytical result to date. In this equation, l(i,k) means that node i is connected to
Experimental Verification of the Equations 3 and 4 node k via a long-range link. The applicability of the distance
For completeness, we verified experimentally both definition is illustrated in Figure 5. Note that the distance com-
Equation 3 and Equation 4, as shown in Figure 4. The dotted putation does not require any global knowledge thus making the
line shows the actual packet injection rate (λ), for reference. routing decision algorithm decentralized.
The solid line with the square marker is obtained for a 8 × 8
network under the hotspot traffic, as the ratio between the aver-
age number of packets in the network and the average packet
delay at that particular injection rate. From these plots, it can be
clearly seen that there is a good agreement between the actual
value obtained through simulation and the one predicted by
Equation 3 before entering the criticality. Since the network is
not in steady-state beyond the critical traffic value, the
Equation 3 does not hold for higher injection rates. As men-
tioned before, the exact value of the average packet delay at a
given load, τ ( λ ), can only be found by simulation. The dashed Figure 5. Illustration of distance definition (see Eqn. 7).
Interestingly enough, Equation 4 also confirms that the as being illegal, as in Figure 7. Therefore, we check whether or
average number of hops between the nodes, a common perfor- not the long-range links cause any of these prohibited turns. If it
mance metric, has indeed a positive impact on the dynamic is legal to use a long-range link, then the packet is forwarded to
behavior of the network. this link. Otherwise, the default routing algorithm is employed.
E. Routing strategy for long-range links
The routers without any extra link use the default routing
strategy. Defining a strategy for the routers with long-range
links is a necessity dictated by two factors:
• Without a customized mechanism in place, the newly added
long-range links cannot be utilized at all by the default routing
strategy;
• Introducing long-range links may result in cycling depen-
Figure 7. Possible routing directions, basic and extended turn models.
dencies. Therefore, arbitrary use of these links may result in
deadlock states.
The routing strategy proposed in this section tries to pro- A long-range link may become a traffic attractor and jam
duce minimal paths towards the destination by utilizing the the network earlier than the regular short links. For this reason,
long-range links effectively, as shown in Figure 6. To this end, if the routing algorithm is not adaptive (as it is our case), one
we first check whether or not there exists a long-range connec- more check point is needed before assigning a traffic stream to
tion to the current router. If there is one, the distance to the des- a new link. For instance, by assessing the amount of traffic
tination with and without the long-range link is computed using assigned to a long-range link, further traffic can be routed over
Equation 7. It is interesting to note that we can obtain global the link only if it is not likely to become a bottleneck.
improvements in the network dynamics by using local informa- IV. PRACTICAL CONSIDERATIONS
tion only.
We analyze next the implications of customizing the regular
If a newly added long-range link produces a shorter distance
network architecture with long-range links on the actual design
to the destination, we check whether or not using this link may
implementation.
cause deadlock before accepting it as a viable route. In order to
guarantee that using the newly added link does not cause a A. Implementation of long-range links
deadlock state, some limitations on the use of long-range links In order to preserve the advantages of structured wiring, the
are introduced. long-range links are segmented into regular, fixed-length, net-
We achieve deadlock-free operation by extending the turn- work links connected by repeaters. The use of repeaters with
model [13] to long-range links. More precisely, in the original buffering capabilities guarantees latency-insensitive operation,
turn model, one out of four possible turns is prohibited to avoid as discussed in [19]. Repeaters can be thought as simplified
cyclic dependencies (Figure 7). However, unlike standard links, routers consisting of only two ports that accept an incoming flit,
the long-range links can extend in two directions, such as NE, stores it in a FIFO buffer, and finally forwards it to the output
NW, etc. For this reason, one has to consider the rotation from port, as illustrated in Figure 8. If the depth of buffers in the
middle directions, NE, NW, SE, SW to the main directions N, repeaters is at least 2 flits, then the packets can be effectively
S, E and W. In the extended model, we arbitrarily chose the S- pipelined to take the full advantage of the long-range links [20].
to-E, S-to-W, SE-to-E, SE-to-W, SW-to-E, and SW-to-W turns The final consideration in terms of implementation is the
increased size of the routers with extra links due to increased
number of ports (Figure 8). To measure the area overhead, we
implemented routers with 5 and 6 ports using Verilog and syn-
thesized them for a 1M gate Xilinx Virtex2 FPGA. The router
with 5 ports utilizes 387 slices (about 7% of total resources),
while the one with 6 ports utilizes 471 slices of the target
device. We also synthesized a pure 4×4 mesh network, and a
4×4 mesh network with 4 long-range links and observed that
the extra links induce about 10% area overhead. This overhead
has to be taken into account, while computing the maximum
number of long-range links that can be added to the regular
mesh network. While there is no theoretical limitation imposed
by our approach on the number of additional links a router can
have, a maximum of one long-range link per router is used in
our experiments. This way the regularity is minimally altered,
while still providing significant improvements over the stan-
Figure 6. The description of the routing strategy. dard mesh networks, as explained in Section V.
The worst-case complexity of the technique, that is link
α
insertion and routing table generation, is O ( SN ) where
(a) 2< α < 3 . The run-time of the algorithm for the examples ana-
lyzed ranges from 0.14sec for a 4×4 network to less than half
hour for a 8 × 8 network on a Pentium III machine with 768MB
memory under Linux OS.
A. Experiments with synthetic traffic workloads
(b)
We first demonstrate the effectiveness of adding long-range
links to standard mesh networks by using synthetic traffic
inputs. Table 1 compares the critical traffic load, average packet
latency and throughput at the edge of criticality under hotspot
traffic pattern for 4×4 and 6 × 6 networks. For the hotspot traf-
fic three nodes are selected arbitrarily to act as hotspot nodes.
Figure 8. Implementation of the repeaters. Routers 1 and 3 are both
connected by Router 2 (a), the underlying mesh network, and the Each node in the network sends packets to these hotspot nodes
inserted long-range link (b). with a higher probability compared to the remaining nodes.
Critical load Latency at the critical load
B. Energy-related considerations (packet/cycle) (cycles)
One can measure the energy consumption using the Ebit
λ Mc λ Lc L M ( λ Mc ) L L ( λ Mc )
metric [12], defined as the energy required to transmit one bit of
information from the source to the destination. Ebit is given by hotspot 4x4 0.41 0.50 196.9 34.4
E bit = E L + E B + E (8) hotspot 6x6 0.62 0.75 224.5 38.2
bit bit S bit
Table 1: Critical load (packet/cycle) and latency comparison
where E Lbit , E Bbit and E S bit represent the energy consumed by (cycles) for regular mesh (M) and mesh with long links (L).
the interconnect, buffering and switching in the router, respec- As shown in Table 1, inserting 4 long-range links (consist-
tively. Analyzing the energy consumption before and after the ing of 10 short link segments) to a 4×4 network makes the
insertion of long-range links shows that the proposed approach phase transition region shift from 0.41 packet/cycle to 0.50
does not induce a significant penalty in the total communication packet/cycle (the resulting network appears in Figure 1). Simi-
energy consumption of the network. Indeed, since the long- larly, the average packet latency at 0.41 packet/cycle injection
range links consist of several regular links with repeaters rate drops from 196.9 to 34.4 cycles. We also show the variation
between them (instead of routers), the link energy consumption of the network throughput and average packet latency as a func-
stays approximately the same whether the traffic flows over the tion of traffic injection rate using a much denser scale in
long-range link or over the original path provided by the mesh Figure 9.
network. On the other hand, the energy consumption due to the
switch and routing logic is greatly simplified in the repeater Hotspot Benchmark
design compared to the original routers. This results in a reduc-
tion in the switching energy consumption. Finally, the routers
with extra links will have slightly increased energy consump-
tion due to the larger crossbar switch.
We compared the energy consumption obtained by simula-
tion before and after the insertion of the long-range links, for
the traffic patterns reported in Section 6. We observed that the
link and buffer energy consumptions increase about 2% after
the insertion of long-range links, while the switch energy con-
sumption drops about 7%, on average. The results show that the
overall energy consumption increases by only about 1%.
V. EXPERIMENTAL RESULTS
We present next an extensive experimental study involving
a set of benchmarks with synthetic and real traffic patterns. The
NoCs under study are simulated using an in-house cycle accu-
rate C++-based NoC simulator developed specifically for this
project. The simulator models the long-range links precisely as
explained in Section IV. The configuration files describing the
additional long-range links and the routing strategy for a given
traffic pattern are generated using the proposed technique and Figure 9. Traffic injection rate vs. average packet latency and
network throughput for hotspot traffic is shown. The
supplied to the simulator as an input. improvement in terms of critical point and latency values at
criticality are indicated on the plots.
Similar results have been obtained for a 6 × 6 grid, as
shown in Table 1. In this case, the phase transition region shifts
from 0.62 packet/cycle to 0.75 packet/cycle. Likewise, with the
addition of long-range links, the average packet latency at 0.62
packet/cycle injection rate drops from 224.5 to 38.2 cycles.
Comparison with the torus network
We also compared the performance of the proposed
approach against that achievable with a torus network, which
provides wrap around links added in a systematic manner. Our
simulations show that application-specific insertion of only 4 (a) (b)
long-range links, with an overhead of 12 extra standard link Figure 10. Scalability results. Performance of the proposed technique
segments, provides 4% improvement in the critical traffic load for larger network sizes.
compared to a 4×4 torus under hotspot traffic. Furthermore, Comparison with the mesh network with extra buffers
the average packet latency at the critical point of the torus net- Implementation of long-range links requires buffers in the
work, 0.48 packet/cycle, drops from 77.0 to 34.4 cycles. This repeaters. To demonstrate that the savings are the result of using
significant gain is obtained over the standard torus network by the long-range links, we also added extra amount of buffers to
utilizing only half of the additional links, since the extra links the corresponding channels of the pure mesh network, equal to
are inserted in a smart way considering the underlying applica- the amount of buffers utilized for the long-range links.
tion rather than blindly adding wrap-around channels. Table 2 summarizes the results for standard mesh network
Scalability Analysis (M), standard mesh network with extra buffers (MB) and the
To evaluate the scalability of the proposed technique, we network with long-range links (L). We observe that insertion of
also performed experiments involving networks of sizes rang- buffers improves the critical load by 3.5% for the auto industry
ing from 4×4 to 10 ×10 . Figure 10 shows that the proposed benchmark. On the other hand, the corresponding improvement
technique results in consistent improvements when the network due to long-range links is 13.6% over initial mesh network and
size scales up. For example, by inserting only 6 long-range 10% over the mesh network with additional buffers. Likewise,
links, consisting of 32 regular links in total, the critical load of a we note that with the insertion of long-range links, the average
10 ×10 network under hotspot traffic shifts from 1.18 packet/ packet latency reduces by 69% compared to the original value
cycle to 1.40 packet/cycle giving a 18.7% improvement. This and 57.0% compared to the mesh network with extra buffers.
result is similar to the gain obtained for smaller networks. Consistent results have been obtained for the synthetic traf-
Figure 10(a) also reveals that the critical traffic load grows with fic workloads mentioned in the previous section and for the
the network size due to the increase in the total bandwidth. telecom benchmark. Due to the limited space, we report only
Likewise, we observe consistent reduction in the average packet the results for telecom benchmark (Table 2). The results show
latency across different network sizes, as shown in that, with the addition of extra buffers, the critical traffic point
Figure 10(b). shifts only from 0.44 packet/cycle to 0.46 packet/cycle. Insert-
ing long-range links, on the other hand, shifts the critical point
B. Experiments involving real applications
to 0.60 packet/cycle which is a huge improvement in the net-
In this section, we evaluate the performance of the link work capability. Similarly, the average packet latency obtained
insertion algorithm using two applications with realistic traffic: by the proposed technique is almost 1/3 of the latency provided
A 4x4 auto industry benchmark and a 5x5 telecom benchmark by standard mesh and about 1/2 of the latency provided by
retrieved from E3S benchmark suite [21]. mesh with extra buffers.
The variation of average packet latency and network
throughput as a function of traffic injection rates for auto indus- Critical load Latency at critical load
try benchmark is given in Figure 11. These plots show that the (packet/cycle) (cycles)
insertion of long-range links shifts the critical traffic load from L ( λ Mc )
λc
0.29 packet/cycle to 0.33 packet/cycle resulting in a 13.6%
improvement. Similarly, as shown in Table 2, we observe that auto-indust M 0.29 98.0
the average packet latency for the network with long-range auto-indust MB 0.30 70.5
links is consistently lower compared to that of a pure mesh net- auto-indust L 0.33 30.3
work. For instance, at 0.29 packet/cycle injection rate, the telecom M 0.44 73.1
latency drops from 98.0 cycles to 30.3 cycles giving about
telecom MB 0.46 56.0
69.0% reduction.
telecom L 0.60 28.2
Similar improvements have been observed for the telecom
benchmark as shown in Figure 12. Specifically, the critical traf- Table 2: Critical load (packet/cycle) and latency (cycles)
fic load is delayed from 0.44 packet/cycle to 0.60 packet/cycle comparison for pure mesh (M), mesh with extra buffers
showing a 36.3% improvement due to the addition of long- (MB) and mesh with long links (L).
range links. Likewise, the latency at 0.44 packet/cycle traffic
injection rate drops from 73.1 cycles to 28.2 cycles (Table 2).
Figure 11. Traffic rate vs. packet latency and network Figure 12. Traffic injection rate vs. average packet
throughput for auto industry benchmark. latency and network throughput for telecom benchmark.
VI. CONCLUSION AND FUTURE WORK [7] S. Murali, G. De Micheli, “SUNMAP: A tool for automatic topol-
ogy selection and generation for NoCs,” In Proc. DAC, June 2004.
We have presented a design methodology to insert applica-
[8] A. Pinto, L. P. Carloni, A. L. Sangiovanni-Vincentelli, “Efficient
tion-specific long-range links to standard grid-like networks. It synthesis of networks on chip,” In Proc. ICCD, Oct. 2003.
is analytically and experimentally demonstrated that insertion [9] K. Srinivasan, K. S. Chatha, G. Konjevod “Linear programming
of long-range links has an important impact on the dynamics, as based techniques for synthesis of Network-on-Chip architectures,” In
well as static properties of the network. Specifically, additional Proc. ICCD, Oct. 2004.
long-range links increase the critical traffic workload. We have [10] U. Y. Ogras, R. Marculescu, “Energy- and performance- driven
customized architecture synthesis using a decomposition approach,”
also demonstrated that this increase means significant reduction In Proc. DATE, March 2005.
in the average packet latency in the network, as well as [11] A. Jalabert, S. Murali, L. Benini, G. De Micheli, “xpipesCom-
improvement in the achievable throughput. piler: A tool for instantiating application specific networks on chip,”
Our current work employs oblivious routing to utilize the In Proc. DATE, March 2004.
long-range links. We plan to extend this work to employ adap- [12] T. T. Ye, L. Benini, G. De Micheli, “Analysis of power consump-
tive routing instead. Other possible extensions include inserting tion on switch fabrics in network routers,” In Proc. DAC, June 2003.
long-range links for different objective functions such as fault- [13] C. J. Glass, L. M. Ni, “The turn model for adaptive routing,” In
Proc. ISCA, May 1992.
tolerance and QoS operation.
[14] D. J. Watts, S. H. Strogatz, “Collective dynamics of 'small-
Acknowledgements: This research is supported by world' networks,” Nature, 393:440-442, 1998.
Marco GSRC, NSF CCR-00-93104, and SRC 2004-HJ-1189. [15] J. Kleinberg, “The Small-World phenomenon and decentralized
search,” SIAM News 37(3), April 2004.
VII. REFERENCES [16] H. Fuks, A. Lawniczak, “Performance of data networks with
random links,” Mathematics and Computers in Simulation, vol 51,
[1] W. Dally, B. Towles, “Route packets, not wires: On-chip intercon- 1999.
nection networks,” In Proc. DAC, June 2001.
[17] M. Woolf, D. Arrowsmith, R. J. Mondragon, J. M. Pitts, “Opti-
[2] L. Benini, G. De Micheli. “Networks on chips: A new SoC para- mization and phase transitions in a chaotic model of data traffic,”
digm,” IEEE Computer. 35(1), 2002. Physical Review E, vol. 66 (2002).
[3] A. Jantsch, H. Tenhunen (Eds.). Networks on Chip. Kluwer, [18] J. Duato, S. Yalamanchili, N. Lionel, Interconnection Networks:
2003. an Engineering Approach. Morgan Kaufmann, 2002.
[4] M. Millberg, E. Nilsson, R. Thid, S. Kumar, A. Jantsch, “The [19] L. P. Carloni, K. L. McMillan, A. L. Sangiovanni-Vincentelli,
Nostrum backbone - a communication protocol stack for networks on “Theory of latency-insensitive design,” IEEE Trans. on CAD of Inte-
chip,” In Proc. VLSI Design, Jan. 2004. grated Circuits and Systems. 20(9), 2001.
[5] K. Goossens, et. al. “A Design flow for application-specific net- [20] V. Chandra, A. Xu, H. Schmit, L. Pileggi, “An interconnect
works on chip with guaranteed performance to accelerate SOC channel design methodology for high performance integrated cir-
design and verification,” In Proc. DATE, March 2005. cuits,” In Proc. DATE, Feb. 2004.
[6] J. Hu, R. Marculescu, “Exploiting the routing flexibility for [21] R. Dick, “Embedded system synthesis benchmarks suites
energy/performance aware mapping of regular NoC architectures,” (E3S),” http://helsinki.ee.princeton.edu/dickrp/e3s/.
In Proc. DATE, March 2003.

ICCAD05 Long Links Final v5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ICCAD05 Long Links Final v5

Uploaded by

Copyright:

Available Formats

Application-Specific Network-on-Chip Architecture

Customization via Long-Range Link Insertion

Abstract Networks-on-Chip (NoCs) represent a promising

III. LONG-RANGE LINK INSERTION ALGORITHM

You might also like