Professional Documents
Culture Documents
packets/cycle
Nave / τ0
λ = ∑ λi , λi :the injection rate of tile i ∈ T
i∈T
In the free state (i.e. when λ < λ c ), the network is in the
steady-state, so the packet injection rate equals the packet ejec-
tion rate. As a result, we can equate the injection and ejection
rates to obtain the following approximation:
N ave (3) λ (packets/cycle)
λ ≈ ----------- Figure 4. Experimental verification of Equation 3 and Equation 4
τ ave
for a 8x8 mesh network.
where τave is the average time each packet spends in the net- line with triangular markers in Figure 4 illustrates the upper
work, and Nave = <N ( t )> is the average number of packets in bound given in Equation 4. We observe that this expression pro-
the network. The exact value of τave is a function of the traffic vides a good approximation at lower data rates and holds the
injection rate, as well as topology, routing strategy, etc. While upper bound property.
no exact analytical model is available in the literature, we Computation of τ0
observe that τave shows a weak dependence on the traffic injec- For arbitrary traffic patterns characterized by the communi-
tion rate when the network is in the free state. Hence, τ0 can be cation frequencies fij ∀i, j ∈ T , τ0 can be written as
used to approximate τave. If we denote the average number of
c L
packets in the network, at the onset of the criticality, by Nave , τ0 = ∑ ∑ fij d ( i, j ) ( t r + t s + t w ) + max ( ts + t w ) -----
W
(5)
we can write the following relation: i j≠i
c
Nave (4) where d ( i, j ) is the distance from routers i to router j, and tr, ts,
λ c ≈ ------------- tw are the is architectural parameters representing time to make
τ0
the routing decision, traverse the switch and the link, respec-
This approximation is also an upper bound for the critical tively. Finally, L is the length of the packet, while W is the width
load λ c , since τ 0 ≤ τ ave ( λ c ). We note that λ = N ave ⁄ τ ave fol- of the network channel.
lows also Little’s law. Indeed, other theoretical studies proposed For the standard mesh network, the Manhattan distance
c
to approximate λ c = Nave ⁄ τ 0 using mean field [16] and distance (dM) is used to compute d ( i, j ), i.e.
models [17] under uniform traffic.
Given that the number of messages in the network, at the d M ( i, j ) = i x – j x + i y – j y (6)
c
onset of the criticality, is bounded by its capacity, Nave , the crit-
where subscripts x and y denote the x-y coordinates, respec-
ical traffic load λ c and the average packet latency are inversely
tively. For the routers with long-range links, an extended dis-
proportional to each other. Indeed, if the average packet latency
tance definition is needed in order to take the long-range
decreases, the phase transition point is delayed, as demonstrated
connections into account. Hence, we use the following general-
in Figure 2, where the reduction in the latency is obtained due
ized definition:
to the long-range links. Our optimization technique uses the
relationship between λ c and τave to maximize the critical load. ⎧ d M ( i, j ) if no long-range link is attached to i (7)
More specifically, we minimize τ0 which can be efficiently d ( i, j )= ⎨
⎩ min ( d M ( i, j ), 1 + d M ( k, j ) ) if l ( i, k ) exists
computed in the optimization loop, as opposed to λ c for which
there is no known analytical result to date. In this equation, l(i,k) means that node i is connected to
Experimental Verification of the Equations 3 and 4 node k via a long-range link. The applicability of the distance
For completeness, we verified experimentally both definition is illustrated in Figure 5. Note that the distance com-
Equation 3 and Equation 4, as shown in Figure 4. The dotted putation does not require any global knowledge thus making the
line shows the actual packet injection rate (λ), for reference. routing decision algorithm decentralized.
The solid line with the square marker is obtained for a 8 × 8
network under the hotspot traffic, as the ratio between the aver-
age number of packets in the network and the average packet
delay at that particular injection rate. From these plots, it can be
clearly seen that there is a good agreement between the actual
value obtained through simulation and the one predicted by
Equation 3 before entering the criticality. Since the network is
not in steady-state beyond the critical traffic value, the
Equation 3 does not hold for higher injection rates. As men-
tioned before, the exact value of the average packet delay at a
given load, τ ( λ ), can only be found by simulation. The dashed Figure 5. Illustration of distance definition (see Eqn. 7).
Interestingly enough, Equation 4 also confirms that the as being illegal, as in Figure 7. Therefore, we check whether or
average number of hops between the nodes, a common perfor- not the long-range links cause any of these prohibited turns. If it
mance metric, has indeed a positive impact on the dynamic is legal to use a long-range link, then the packet is forwarded to
behavior of the network. this link. Otherwise, the default routing algorithm is employed.
E. Routing strategy for long-range links
The routers without any extra link use the default routing
strategy. Defining a strategy for the routers with long-range
links is a necessity dictated by two factors:
• Without a customized mechanism in place, the newly added
long-range links cannot be utilized at all by the default routing
strategy;
• Introducing long-range links may result in cycling depen-
Figure 7. Possible routing directions, basic and extended turn models.
dencies. Therefore, arbitrary use of these links may result in
deadlock states.
The routing strategy proposed in this section tries to pro- A long-range link may become a traffic attractor and jam
duce minimal paths towards the destination by utilizing the the network earlier than the regular short links. For this reason,
long-range links effectively, as shown in Figure 6. To this end, if the routing algorithm is not adaptive (as it is our case), one
we first check whether or not there exists a long-range connec- more check point is needed before assigning a traffic stream to
tion to the current router. If there is one, the distance to the des- a new link. For instance, by assessing the amount of traffic
tination with and without the long-range link is computed using assigned to a long-range link, further traffic can be routed over
Equation 7. It is interesting to note that we can obtain global the link only if it is not likely to become a bottleneck.
improvements in the network dynamics by using local informa- IV. PRACTICAL CONSIDERATIONS
tion only.
We analyze next the implications of customizing the regular
If a newly added long-range link produces a shorter distance
network architecture with long-range links on the actual design
to the destination, we check whether or not using this link may
implementation.
cause deadlock before accepting it as a viable route. In order to
guarantee that using the newly added link does not cause a A. Implementation of long-range links
deadlock state, some limitations on the use of long-range links In order to preserve the advantages of structured wiring, the
are introduced. long-range links are segmented into regular, fixed-length, net-
We achieve deadlock-free operation by extending the turn- work links connected by repeaters. The use of repeaters with
model [13] to long-range links. More precisely, in the original buffering capabilities guarantees latency-insensitive operation,
turn model, one out of four possible turns is prohibited to avoid as discussed in [19]. Repeaters can be thought as simplified
cyclic dependencies (Figure 7). However, unlike standard links, routers consisting of only two ports that accept an incoming flit,
the long-range links can extend in two directions, such as NE, stores it in a FIFO buffer, and finally forwards it to the output
NW, etc. For this reason, one has to consider the rotation from port, as illustrated in Figure 8. If the depth of buffers in the
middle directions, NE, NW, SE, SW to the main directions N, repeaters is at least 2 flits, then the packets can be effectively
S, E and W. In the extended model, we arbitrarily chose the S- pipelined to take the full advantage of the long-range links [20].
to-E, S-to-W, SE-to-E, SE-to-W, SW-to-E, and SW-to-W turns The final consideration in terms of implementation is the
increased size of the routers with extra links due to increased
number of ports (Figure 8). To measure the area overhead, we
implemented routers with 5 and 6 ports using Verilog and syn-
thesized them for a 1M gate Xilinx Virtex2 FPGA. The router
with 5 ports utilizes 387 slices (about 7% of total resources),
while the one with 6 ports utilizes 471 slices of the target
device. We also synthesized a pure 4×4 mesh network, and a
4×4 mesh network with 4 long-range links and observed that
the extra links induce about 10% area overhead. This overhead
has to be taken into account, while computing the maximum
number of long-range links that can be added to the regular
mesh network. While there is no theoretical limitation imposed
by our approach on the number of additional links a router can
have, a maximum of one long-range link per router is used in
our experiments. This way the regularity is minimally altered,
while still providing significant improvements over the stan-
Figure 6. The description of the routing strategy. dard mesh networks, as explained in Section V.
The worst-case complexity of the technique, that is link
α
insertion and routing table generation, is O ( SN ) where
(a) 2< α < 3 . The run-time of the algorithm for the examples ana-
lyzed ranges from 0.14sec for a 4×4 network to less than half
hour for a 8 × 8 network on a Pentium III machine with 768MB
memory under Linux OS.
A. Experiments with synthetic traffic workloads
(b)
We first demonstrate the effectiveness of adding long-range
links to standard mesh networks by using synthetic traffic
inputs. Table 1 compares the critical traffic load, average packet
latency and throughput at the edge of criticality under hotspot
traffic pattern for 4×4 and 6 × 6 networks. For the hotspot traf-
fic three nodes are selected arbitrarily to act as hotspot nodes.
Figure 8. Implementation of the repeaters. Routers 1 and 3 are both
connected by Router 2 (a), the underlying mesh network, and the Each node in the network sends packets to these hotspot nodes
inserted long-range link (b). with a higher probability compared to the remaining nodes.
Critical load Latency at the critical load
B. Energy-related considerations (packet/cycle) (cycles)
One can measure the energy consumption using the Ebit
λ Mc λ Lc L M ( λ Mc ) L L ( λ Mc )
metric [12], defined as the energy required to transmit one bit of
information from the source to the destination. Ebit is given by hotspot 4x4 0.41 0.50 196.9 34.4
E bit = E L + E B + E (8) hotspot 6x6 0.62 0.75 224.5 38.2
bit bit S bit
Table 1: Critical load (packet/cycle) and latency comparison
where E Lbit , E Bbit and E S bit represent the energy consumed by (cycles) for regular mesh (M) and mesh with long links (L).
the interconnect, buffering and switching in the router, respec- As shown in Table 1, inserting 4 long-range links (consist-
tively. Analyzing the energy consumption before and after the ing of 10 short link segments) to a 4×4 network makes the
insertion of long-range links shows that the proposed approach phase transition region shift from 0.41 packet/cycle to 0.50
does not induce a significant penalty in the total communication packet/cycle (the resulting network appears in Figure 1). Simi-
energy consumption of the network. Indeed, since the long- larly, the average packet latency at 0.41 packet/cycle injection
range links consist of several regular links with repeaters rate drops from 196.9 to 34.4 cycles. We also show the variation
between them (instead of routers), the link energy consumption of the network throughput and average packet latency as a func-
stays approximately the same whether the traffic flows over the tion of traffic injection rate using a much denser scale in
long-range link or over the original path provided by the mesh Figure 9.
network. On the other hand, the energy consumption due to the
switch and routing logic is greatly simplified in the repeater Hotspot Benchmark
design compared to the original routers. This results in a reduc-
tion in the switching energy consumption. Finally, the routers
with extra links will have slightly increased energy consump-
tion due to the larger crossbar switch.
We compared the energy consumption obtained by simula-
tion before and after the insertion of the long-range links, for
the traffic patterns reported in Section 6. We observed that the
link and buffer energy consumptions increase about 2% after
the insertion of long-range links, while the switch energy con-
sumption drops about 7%, on average. The results show that the
overall energy consumption increases by only about 1%.
V. EXPERIMENTAL RESULTS
We present next an extensive experimental study involving
a set of benchmarks with synthetic and real traffic patterns. The
NoCs under study are simulated using an in-house cycle accu-
rate C++-based NoC simulator developed specifically for this
project. The simulator models the long-range links precisely as
explained in Section IV. The configuration files describing the
additional long-range links and the routing strategy for a given
traffic pattern are generated using the proposed technique and Figure 9. Traffic injection rate vs. average packet latency and
network throughput for hotspot traffic is shown. The
supplied to the simulator as an input. improvement in terms of critical point and latency values at
criticality are indicated on the plots.
Similar results have been obtained for a 6 × 6 grid, as
shown in Table 1. In this case, the phase transition region shifts
from 0.62 packet/cycle to 0.75 packet/cycle. Likewise, with the
addition of long-range links, the average packet latency at 0.62
packet/cycle injection rate drops from 224.5 to 38.2 cycles.
Comparison with the torus network
We also compared the performance of the proposed
approach against that achievable with a torus network, which
provides wrap around links added in a systematic manner. Our
simulations show that application-specific insertion of only 4 (a) (b)
long-range links, with an overhead of 12 extra standard link Figure 10. Scalability results. Performance of the proposed technique
segments, provides 4% improvement in the critical traffic load for larger network sizes.
compared to a 4×4 torus under hotspot traffic. Furthermore, Comparison with the mesh network with extra buffers
the average packet latency at the critical point of the torus net- Implementation of long-range links requires buffers in the
work, 0.48 packet/cycle, drops from 77.0 to 34.4 cycles. This repeaters. To demonstrate that the savings are the result of using
significant gain is obtained over the standard torus network by the long-range links, we also added extra amount of buffers to
utilizing only half of the additional links, since the extra links the corresponding channels of the pure mesh network, equal to
are inserted in a smart way considering the underlying applica- the amount of buffers utilized for the long-range links.
tion rather than blindly adding wrap-around channels. Table 2 summarizes the results for standard mesh network
Scalability Analysis (M), standard mesh network with extra buffers (MB) and the
To evaluate the scalability of the proposed technique, we network with long-range links (L). We observe that insertion of
also performed experiments involving networks of sizes rang- buffers improves the critical load by 3.5% for the auto industry
ing from 4×4 to 10 ×10 . Figure 10 shows that the proposed benchmark. On the other hand, the corresponding improvement
technique results in consistent improvements when the network due to long-range links is 13.6% over initial mesh network and
size scales up. For example, by inserting only 6 long-range 10% over the mesh network with additional buffers. Likewise,
links, consisting of 32 regular links in total, the critical load of a we note that with the insertion of long-range links, the average
10 ×10 network under hotspot traffic shifts from 1.18 packet/ packet latency reduces by 69% compared to the original value
cycle to 1.40 packet/cycle giving a 18.7% improvement. This and 57.0% compared to the mesh network with extra buffers.
result is similar to the gain obtained for smaller networks. Consistent results have been obtained for the synthetic traf-
Figure 10(a) also reveals that the critical traffic load grows with fic workloads mentioned in the previous section and for the
the network size due to the increase in the total bandwidth. telecom benchmark. Due to the limited space, we report only
Likewise, we observe consistent reduction in the average packet the results for telecom benchmark (Table 2). The results show
latency across different network sizes, as shown in that, with the addition of extra buffers, the critical traffic point
Figure 10(b). shifts only from 0.44 packet/cycle to 0.46 packet/cycle. Insert-
ing long-range links, on the other hand, shifts the critical point
B. Experiments involving real applications
to 0.60 packet/cycle which is a huge improvement in the net-
In this section, we evaluate the performance of the link work capability. Similarly, the average packet latency obtained
insertion algorithm using two applications with realistic traffic: by the proposed technique is almost 1/3 of the latency provided
A 4x4 auto industry benchmark and a 5x5 telecom benchmark by standard mesh and about 1/2 of the latency provided by
retrieved from E3S benchmark suite [21]. mesh with extra buffers.
The variation of average packet latency and network
throughput as a function of traffic injection rates for auto indus- Critical load Latency at critical load
try benchmark is given in Figure 11. These plots show that the (packet/cycle) (cycles)
insertion of long-range links shifts the critical traffic load from L ( λ Mc )
λc
0.29 packet/cycle to 0.33 packet/cycle resulting in a 13.6%
improvement. Similarly, as shown in Table 2, we observe that auto-indust M 0.29 98.0
the average packet latency for the network with long-range auto-indust MB 0.30 70.5
links is consistently lower compared to that of a pure mesh net- auto-indust L 0.33 30.3
work. For instance, at 0.29 packet/cycle injection rate, the telecom M 0.44 73.1
latency drops from 98.0 cycles to 30.3 cycles giving about
telecom MB 0.46 56.0
69.0% reduction.
telecom L 0.60 28.2
Similar improvements have been observed for the telecom
benchmark as shown in Figure 12. Specifically, the critical traf- Table 2: Critical load (packet/cycle) and latency (cycles)
fic load is delayed from 0.44 packet/cycle to 0.60 packet/cycle comparison for pure mesh (M), mesh with extra buffers
showing a 36.3% improvement due to the addition of long- (MB) and mesh with long links (L).
range links. Likewise, the latency at 0.44 packet/cycle traffic
injection rate drops from 73.1 cycles to 28.2 cycles (Table 2).
Figure 11. Traffic rate vs. packet latency and network Figure 12. Traffic injection rate vs. average packet
throughput for auto industry benchmark. latency and network throughput for telecom benchmark.
VI. CONCLUSION AND FUTURE WORK [7] S. Murali, G. De Micheli, “SUNMAP: A tool for automatic topol-
ogy selection and generation for NoCs,” In Proc. DAC, June 2004.
We have presented a design methodology to insert applica-
[8] A. Pinto, L. P. Carloni, A. L. Sangiovanni-Vincentelli, “Efficient
tion-specific long-range links to standard grid-like networks. It synthesis of networks on chip,” In Proc. ICCD, Oct. 2003.
is analytically and experimentally demonstrated that insertion [9] K. Srinivasan, K. S. Chatha, G. Konjevod “Linear programming
of long-range links has an important impact on the dynamics, as based techniques for synthesis of Network-on-Chip architectures,” In
well as static properties of the network. Specifically, additional Proc. ICCD, Oct. 2004.
long-range links increase the critical traffic workload. We have [10] U. Y. Ogras, R. Marculescu, “Energy- and performance- driven
customized architecture synthesis using a decomposition approach,”
also demonstrated that this increase means significant reduction In Proc. DATE, March 2005.
in the average packet latency in the network, as well as [11] A. Jalabert, S. Murali, L. Benini, G. De Micheli, “xpipesCom-
improvement in the achievable throughput. piler: A tool for instantiating application specific networks on chip,”
Our current work employs oblivious routing to utilize the In Proc. DATE, March 2004.
long-range links. We plan to extend this work to employ adap- [12] T. T. Ye, L. Benini, G. De Micheli, “Analysis of power consump-
tive routing instead. Other possible extensions include inserting tion on switch fabrics in network routers,” In Proc. DAC, June 2003.
long-range links for different objective functions such as fault- [13] C. J. Glass, L. M. Ni, “The turn model for adaptive routing,” In
Proc. ISCA, May 1992.
tolerance and QoS operation.
[14] D. J. Watts, S. H. Strogatz, “Collective dynamics of 'small-
Acknowledgements: This research is supported by world' networks,” Nature, 393:440-442, 1998.
Marco GSRC, NSF CCR-00-93104, and SRC 2004-HJ-1189. [15] J. Kleinberg, “The Small-World phenomenon and decentralized
search,” SIAM News 37(3), April 2004.
VII. REFERENCES [16] H. Fuks, A. Lawniczak, “Performance of data networks with
random links,” Mathematics and Computers in Simulation, vol 51,
[1] W. Dally, B. Towles, “Route packets, not wires: On-chip intercon- 1999.
nection networks,” In Proc. DAC, June 2001.
[17] M. Woolf, D. Arrowsmith, R. J. Mondragon, J. M. Pitts, “Opti-
[2] L. Benini, G. De Micheli. “Networks on chips: A new SoC para- mization and phase transitions in a chaotic model of data traffic,”
digm,” IEEE Computer. 35(1), 2002. Physical Review E, vol. 66 (2002).
[3] A. Jantsch, H. Tenhunen (Eds.). Networks on Chip. Kluwer, [18] J. Duato, S. Yalamanchili, N. Lionel, Interconnection Networks:
2003. an Engineering Approach. Morgan Kaufmann, 2002.
[4] M. Millberg, E. Nilsson, R. Thid, S. Kumar, A. Jantsch, “The [19] L. P. Carloni, K. L. McMillan, A. L. Sangiovanni-Vincentelli,
Nostrum backbone - a communication protocol stack for networks on “Theory of latency-insensitive design,” IEEE Trans. on CAD of Inte-
chip,” In Proc. VLSI Design, Jan. 2004. grated Circuits and Systems. 20(9), 2001.
[5] K. Goossens, et. al. “A Design flow for application-specific net- [20] V. Chandra, A. Xu, H. Schmit, L. Pileggi, “An interconnect
works on chip with guaranteed performance to accelerate SOC channel design methodology for high performance integrated cir-
design and verification,” In Proc. DATE, March 2005. cuits,” In Proc. DATE, Feb. 2004.
[6] J. Hu, R. Marculescu, “Exploiting the routing flexibility for [21] R. Dick, “Embedded system synthesis benchmarks suites
energy/performance aware mapping of regular NoC architectures,” (E3S),” http://helsinki.ee.princeton.edu/dickrp/e3s/.
In Proc. DATE, March 2003.