You are on page 1of 30

ARTICLE IN PRESS

INTEGRATION, the VLSI journal 38 (2005) 353382


www.elsevier.com/locate/vlsi

Quality-of-service and error control techniques for mesh-based network-on-chip architectures


Praveen Vellanki, Nilanjan Banerjee, Karam S. Chatha
Department of CSE, Arizona State University, P.O. BOX 875406, Tempe, AZ 85287-5406, USA Received 16 June 2004; received in revised form 19 July 2004; accepted 21 July 2004

Abstract Network-on-a-chip (NoC) has been proposed as a solution for addressing the design challenges of future high-performance system-on-chip architectures in the nanoscale regime. Many real-time applications require input data that arrives with low delay jitter. Such communication trafc can only be supported by incorporating multiple levels of service in the interconnection network. Further, as technology scales toward deep submicron, on-chip interconnects are becoming more and more sensitive to noise sources such as power supply noise, crosstalk, and radiation induced effects, that are likely to reduce the reliability of data. Hence, effective error control schemes are required for ensuring data integrity. This paper addresses two important aspects of NoC architectures, quality of service and error control schemes and makes the following contributions: (i) it presents techniques for supporting guaranteed throughput (for low delay jitter trafc) and best-effort trafc quality levels in NoC router, (ii) it presents architectures for integrating error control schemes in the NoC router architecture, and (iii) it presents cycle accurate power and performance models of the two architecture enhancements for a mesh based NoC architecture. r 2004 Elsevier B.V. All rights reserved.
Keywords: Network-on-chip; Quality-of-service, Error-control; Power consumption; Performance

Corresponding author. Department of Computer Science and Engineering, Arizona State University, Brickyard

Suite 501, 699 South Mill Avenue, Tempe, AZ85281, USA. Tel.: +480-727-7850; fax: +480-965-2751. E-mail addresses: pvellanki@asu.edu (P. Vellanki), nbanerjee@asu.edu (N. Banerjee), kchatha@asu.edu (K.S. Chatha). 0167-9260/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.vlsi.2004.07.009

ARTICLE IN PRESS
354 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

1. Introduction The physical characteristics of nanoscale technologies will pose several challenges to the systemon-chip (SoC) designers. Global signal delays will span multiple clock cycles [1,2]. Signal integrity will also be compromised due to increased RC effects, inductance, and cross-coupling capacitances [3]. Nanoscale packet switched networks or network-on-chip (NoC) have been proposed as architectural solution for SoC design in the nanoscale regime [48]. Packet switching supports asynchronous transfer of information. It provides extremely high bandwidth by distributing the propagation delay across multiple switches, thus pipelining the signal transmission. Packet switching networks also support error detection and correction schemes that can be applied towards improving the signal integrity. Quality of service (QoS) can be ensured by distinguishing between different types of trafc. In this paper, we present techniques for supporting multiple levels of service and error control schemes for a mesh based NoC architecture. Fig. 1 plots the variation in packet latency for destinations that are uniformly 3 hops away in a 4 4 mesh based NoC architecture for a router with 4 virtual channels at an injection rate of 0.05 packets/cycle/node. The x-axis denotes the latency of various packets, and y-axis denotes the number of packets. The mean latency of the plot is 97.87 clock cycles which is close to the peak of the plot. However, there are a large number of packets (214, over 50%) that experience transmission latency that is more than double the average latency. Such a large variance in average latency is unacceptable for many NoC implementations such as trafc between a cache and lower level memory, or different processing elements of a multimedia application. We present techniques for supporting both low jitter guaranteed throughput and best-effort trafc in a NoC router. Cycle accurate power and performance models for trade-off analysis of the two techniques are also presented. In the nanoscale regime, crosstalk on long global wires will be a major source of errors. Switching activity on aggressor links can cause errors by either forcing a logic transition

40
BE

35 30

Noof Packets

25 20 15 10 5 0 0 200 400 600 800 1000

Latency (cycles)

Fig. 1. Variation in packet latency.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 355

on a stable victim link or by delaying the transition on a switching victim link. Both these instances result in capture of an incorrect logic level at the receiver. A number of error control schemes [9] have been proposed for general communication networks. In a NoC architecture, due to the stringent performance and power constraints, low complexity and low power error control schemes are desirable. Hence, we have implemented two low overhead error control: single error detection and retransmission (PAR), and single error correction (SEC). We also present power and performance trade-offs of the two schemes under variable trafc prole. The trade-off in performance versus power consumption of interconnection network is a key question. The performance of the nanoscale interconnection network can be specied by the average latency of sending a message through the network, and the bandwidth of the network. The power consumption of the network consists of the dynamic and leakage power consumption of the various components. This paper also presents results for power versus performance trade-off analysis for different service levels of trafc and error control schemes. We integrated the QoS and error control schemes into a VHDL based cycle accurate power and performance model of NoC architecture. The model is a parameterized register transfer level (RTL) design of the NoC architecture elements. The design is parameterized on (i) size of packets, (ii) length and width of physical links, (iii) number, and depth of virtual channels, and (iv) switching technique. The model is annotated with delay, dynamic, and leakage energy estimates of the various components. The model can estimate the latency, throughput, dynamic, and leakage power consumption of a NoC architecture. The RTL design for the QoS and error control circuitry was synthesized and the SPICE level netlist was extracted from the layout. The design was then characterized for delay, and dynamic and leakage power consumption at 0:18 mm: The characterized values were integrated into the VHDL based RTL design to build the cycle accurate performance model. The paper is organized as follows: Section 2 discusses the previous work, Section 3 gives a quick overview of the NoC architecture and the cycle accurate performance model, 4 discusses the QoS schemes, 5 discusses error control techniques, Section 6 discusses the packet format and protocol, Section 7 presents the experimental results, and Section 8 concludes the paper.

2. Previous work In recent years a number of researchers have proposed architectures, performance evaluation techniques and optimization approaches for NoC. This section classies and presents the existing research under four categories: seminal work, router architectures, performance models, and automated optimization approaches. Our paper discusses innovative router architectures for supporting guaranteed throughput, and error control schemes in mesh based on-chip interconnection networks, and presents power and performance evaluation models for the same. The work presented in our paper can be classied under both router architectures, and performance models. Hence, in the following section we compare and contrast our work with existing techniques in both categories.

ARTICLE IN PRESS
356 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

2.1. Seminal work Guerrier et al. [4] presented a NoC design called SPIN that was based on fat-tree topology. They also presented the router architecture and cycle accurate performance model for their NoC design. Sgroi et al. [5] discussed a platform based SoC design methodology that proposed the inclusion of NoC for supporting on-chip communication. Dally et al. [6] demonstrated the feasibility of the NoC and estimated that the NoC places an area overhead of 6.6%. Benini et al. [7] in their conceptual paper on NoC, predict that packet switched on-chip interconnection networks will be essential to address the complexity of future SoC designs. Kumar et al. [8] presented a conceptual system-level architecture that allowed a mesh-based NoC to accommodate large resources such as memory banks, FPGA areas, or high performance multi-processors. Except for Guerrier et al. [4] all the above mentioned works did not present detailed architectures or performance models. We will address [4] in more detail when we discuss NoC architectures and performance models. 2.2. NoC architectures Several researchers have proposed architectures, and related optimizations for on-chip interconnection networks. We classify the related research on NoC architectures based on the supported levels of trafc service classes, error control schemes, and power optimizations. 2.2.1. Architectures for best effort trafc In this paragraph we review the NoC architectures that support only best effort trafc class. SPIN [4,10,11] was one of the seminal works to propose a detailed NoC architecture built with fat tree topology. Proteo [12,13] is a VSIA-complaint NoC architecture that can be congured for ring, star, and bus topologies. Xpipes [14] is a parameterized router architecture that can be utilized in arbitrary NoC topologies. As shown in Fig. 1, best effort trafc class is limited by large deviation in average latency which is not desirable for many real-time applications. In this paper we present a technique for supporting low jitter guaranteed throughput trafc. 2.2.2. Architectures for guaranteed throughput trafc Nostrum [15,16] is a protocol stack for mesh based NoC architecture that supports both best effort and guaranteed throughput trafc classes. Nostrum ensures bandwidth for guaranteed throughput trafc by reserving time slots called looped containers for its transmission on interrouter links. If no guaranteed throughput trafc is injected into the network the time slots are not utilized. In contrast we support guaranteed throughput trafc by reserving a certain number of virtual channels (buffers). Hence, if no guaranteed throughput trafc is injected into the network the best effort trafc can be transported with maximum bandwidth. AEthereal [17,18] is also a mesh based NoC architecture that supports guaranteed throughput trafc by utilizing a centralized scheduler for allocation of link bandwidth. Our architecture utilizes a distributed scheme where the trafc producer sets-up a guaranteed throughput connection by reserving virtual channels, transfers the data, and then tears down the connection by giving up the virtual channels. Finally, neither of these two works presented detailed results for performance and power consumption of their respective architectures.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 357

2.2.3. Architectures with error control schemes Bertozzi et al. [19] presented power versus performance results for point-to-point error control in an on-chip bus protocol based on AMBA bus. Their work did not address NoC architectures, and did not consider the inuence of network trafc on the performance of the error control schemes. Zimmer et al. [20] presented a fault model for NoC architecture. They also proposed a QoS scheme that treated control trafc with higher reliability than data trafc. In contrast, our paper presents a QoS scheme for guaranteed throughput and best-effort trafc. The performance and power consumption of the error control schemes in the presence of variable trafc proles for a mesh-based NoC architecture have also been discussed. 2.2.4. Architectural optimizations for low power Worm et al. [21] proposed an adaptive low power transmission scheme for NoC that minimized the voltage swing and frequency subject to the workload requirement. Chen et al. [22] proposed power-aware buffer policy that minimized the leakage power consumption in virtual channels. Simunic et al. [23] proposed a system-level power reduction scheme for SoC architectures with onchip interconnection networks. Their scheme applied dynamic voltage management and dynamic voltage scaling policies based on both local and global workload information. Our work is focused on architecture extensions and performance models for supporting guaranteed throughput and error control schemes. 2.3. Performance evaluation Innovative performance evaluation models are required to address the design challenges of NoC based interconnection architectures. Although there are a number of models for network performance evaluation [2427], these models do not consider the power consumption characteristics. Current system level performance evaluation tools [2830] are targeted towards shared bus architectures and do not consider interconnection networks. Traditional solutions for on-chip global communication include models for various shared-bus [3133] and ad hoc pointto-point interconnections. Wassal et al. [34] proposed system-level performance and power models for a shared-memory internet protocol/asynchronous transfer mode switching fabric. Ye et al. [35] analyzed the power consumption in the switch fabrics of network routers and proposed system-level models for the same. Pamunuwa et al. [36] performed a system level analysis and estimated the wiring overhead and the gate count for implementing mesh-based NoC architecture. They also estimated the power consumption by assuming switching activity on 50% of the gates. Wang et al. [37] proposed a power-performance simulator for interconnection network called Orion. All these models do not incorporate the QoS and error control schemes. Bolotin et al. [38] proposed analytical models for system-level performance and cost estimation of NoC architectures. They did not address the power consumption in NoC. 2.4. Automated design techniques In the recent past researchers have begin to address the problem of synthesizing custom NoC architectures, and mapping communication trafc on them. Pinto et al. [39] presented a quadratic

ARTICLE IN PRESS
358 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

programming based approach for synthesis of custom NoC architectures. Hu et al. [40] presented an integrated task and communication scheduling approach for mapping applications on meshbased NoC architectures. Murali et al. [41] presented a technique for bandwidth constrained mapping of cores to mesh based NoC architectures. As opposed to the synthesis techniques, this paper focuses on architectural extensions and performance models.

3. NoC architecture and characterization In the following paragraphs we describe the architecture of the various NoC elements (physical links, routers), and the techniques applied for their characterization. 3.1. Physical links The physical links include the data and control wires for communication between two router elements of the interconnection network. 3.1.1. Characterization of physical links The power and performance of a physical link is determined by its width (number of bits of data and control signals), length, and capacitive load of the router. In nanoscale technologies, individual wires are modeled by distributed RLC expressions for accurate description of their physical characteristics [42]. The RLC and cross-coupling capacitances of the interconnection model were obtained from the Berkeley Predictive Technology Model website [43]. We characterized the links in sets of three, two and single wire, respectively for 0:18 mm technology. The three and two wire sets included the distributed RLC effects and cross-coupling capacitances, while the single wire model only included the distributed RLC effects. We considered three different types of links: local p1000 mm; intermediate 41000 mm and p4000 mm; and global 44000 mm [1]. We obtained energy values for 64 8 8; 16 4 4 and 4 2 2 different switching combinations for the three, two and single wire sets, respectively. The wire lengths were incremented in steps of 100 mm up to 1000 mm; steps of 500 mm up to 4000 mm and steps of 1000 mm up to 5000 mm: Table 1 summarizes the switching energy consumed in 0:18 mm technology for three wire-set switching for 100, 1000 and 5000 mm; respectively. 3.1.2. Performance evaluation of physical links We included the link characterization values as a table in our performance model. The energy consumed by a n-bit wide link can be calculated from the energy consumed by the three, two and single wire sets of similar length. For example, consider the 9-bit (odd) wide link shown in the lefthand side of Fig. 2. The total switching energy consumed by the links can be calculated by adding the switching energy consumed by the three wire sets S0, S1, S2 and S3, and subtracting the energy consumed by single wire links A, B, and C, respectively. In the case of a 8-bit (even) wide link shown in the right-hand side of Fig. 2, the energy consumed by two wire set S3 is included in the calculation. The length of the physical link which is a major factor in determining its power consumption and performance is specied by the designer.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 Table 1 3 wire-set characterization Switching Energy (in fJ) 100 mm (000-000), (100-100), (000-001), (010-110), (000-010), (000-011), (000-101), (000-111) (001-010), (001-100), (001-110), (010-101) (001-001), (101-101), (000-100), (011-111), (001-011), (000-110), (010-111) (010-010), (110-110), (001-101), (100-101), (100-110), (001-111), (011-011), (111-111) (010-011), (110-111), (101-111) (100-111) 0 0.122 0.189 0.1914 0.258 0.2075 0.4314 0.2309 0.42 0.6864 1000 mm 0 5.25 12.48 5.94 13.17 20.46 29.54 7.83 20.31 48.9 5000 mm 0 99 213 121 235 66 504 165 378 830 359

(010-100), (011-101), (101-110) (011-110) (011-100)

S0 S1 S2 S3

A B C

S0 S1 S2 S3 Even number of links

A B C

Odd number of links

Total Energy = E(s0) + E(s1) + E(s2) + E(s3) - E(a) - E(b) - E(c)

Fig. 2. Performance evaluation of links.

3.2. The NoC router A router architecture that can be utilized in a 2D mesh topology is shown in Fig. 3. The router consists of ve unit routers to communicate in X-minus, X-plus, Y-minus, and Y-plus directions, and with the processor. Unit routers inside a single router are connected through a 5 5 crossbar. Data is transferred across routers or between the processor and the corresponding router by an asynchronous handshaking protocol. A single unit router is highlighted in lower half of Fig. 3. It consists of input and output link controllers, virtual channels, a header decoder and an arbiter. Data arrives at an input virtual channel of an unit router from either the previous router or the processor connected to the same router. The header decoder decodes the header it of the packet after receiving data from the input virtual channel, decides the packets destination direction (X ; X ; Y ; Y ; processor), and sends a request to the arbiter of the unit router in

ARTICLE IN PRESS
360 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
To/From "Y+" Router Crossbar Control Lines To/From Processor

Link controller, Header decoder Arbiter & FIFO Out FIFO Request and Grant Lines Link Control Lines To/From "X_" Router Link controller, Header decoder Arbiter & FIFO Link controller, Header decoder Arbiter & FIFO To/From "X+" Router Link controller, Header decoder Arbiter & FIFO

Cross

Link Data Lines

Link controller, Header decoder Arbiter & FIFO

Data Lines To/From "Y grant clear req 4 4 4 Input Link Controller Signal from out FIFO through crossbar Signal to out FIFO through crossbar full wr_req N rd_req empty N full N wr_e N Header Decoder wr_req wr_ack wr_vcid

Error Decoder

Virtual Channel N GT rd_vcid Data to crossbar Virtual Channel .... GT Virtual Channel 1 In FIFO BE Data from neighbouring router

wr_vcid Data from crossbar

Virtual Channel N GT Virtual Channel .... GT Virtual Channel 1 full N Out FIFO empty N rd_e N rd_req BE Data to neighbouring router

Control to crossbar

Arbiter 4 4 4

Error Encoder Output LinkController

rd_ack rd_vcid

req clear grant

Fig. 3. Router architecture.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 Table 2 Unit components Unit full adder 1-bit ip at the output 2-bit ip at the output Input change but no output change Leakage 2-1 Multiplexer Output transition Input change but no output change Leakage Nand gate Output transition Input change but no output change Leakage 0.096 pJ 0.1608 pJ 0.0552 pJ 0.0438 fJ 2-bit comparator Output transition Input change but no output change Leakage 0.15 pJ 0.0708 pJ 0.077 fJ 361

0.061 pJ 1.527 fJ 0.013 fJ

D Flip-Flop Output transition Input change but no output change Leakage Xor gate Output transition Input change but no output change Leakage

0.1089 pJ 0.014 pJ 0.034 fJ

0.0312 pJ 0.117 fJ 0.0026 fJ

0.0675 pJ 0.0159 pJ 0.0126 fJ

the corresponding direction. Once the grant is received the header decoder starts sending data from the input to the output virtual channel through the crossbar. The complete architecture and the detailed implementation can be found in [44]. We designed RTL models for each of the components separately. The larger components were characterized in terms of unit components like unit full adder, 2-bit comparator, 2:1 1-bit multiplexer, D ip-op, and logic gates. SPICE net-lists for 0:18 mm technology were extracted for each component and characterized for energy and performance (shown in Table 2). Power consumption of the entire router architecture is computed by including the characterized energy values as table lookups in the RTL model.

4. Quality-of-service schemes In this section we describe the QoS schemes that are supported by our architecture, and their performance and power characterization. The NoC architecture supports two levels of service: best effort (BE) and guaranteed throughput (GT). Each packet is divided into multiple its. The it is a unit of transfer between two routers. The packets are routed by a deterministic dimension ordered source routing strategy. This deadlock free strategy rst transmits the packet in X-dimension till the x-offset is zero, and then the packet is transmitted in the Y-dimension. Both the service levels ensure guaranteed and in-order delivery of packets. In the following few paragraphs we rst describe the BE service level, and then the GT service level. 4.1. Best effort trafc service level The BE trafc service level packets are injected from the input queue into the input virtual channel of the router by the processor if the channel is not full. The processor checks the full

ARTICLE IN PRESS
362 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

signal before injecting the packet. Inside the network, the same strategy is followed to transmit the each it of the packet from the output virtual channel of one router to the input virtual channel of the neighboring router. Such a transmission strategy acts as an explicit hop-to-hop ow control mechanism, and together with the dimension ordered routing ensures guaranteed, and in-order delivery of packets. There is a round robin priority based scheduling mechanism for each of the following tasks:

 Selection of an input virtual channel by the header decoder.  Selection of an output virtual channel by the arbiter.  Grant of the crossbar to the header decoder by arbiter.  Selection of the output virtual channel by the link controller.
In all the above decision mechanisms the scheduler is invoked if (i) the packet is partially transmitted and blocked, or (ii) after complete transmission of each packet. Since all the packets are of the same size, the BE round robin priority scheme approximates the theoretically optimal, work-conserving generalized processor sharing (GPS). The GPS scheme provides fair allocation of link bandwidth to all the packets. 4.2. Guaranteed throughput trafc service level Many applications demonstrate bursty trafc behavior that must be transmitted from source to destination with a required throughput and low jitter. Examples are trafc between a cache and lower level memory, or between various processing blocks of a multimedia processing engine. As demonstrated in Fig. 1, the BE trafc service level is unable to support the desired QoS. We support guaranteed throughput trafc by dividing the virtual channels between GT and BE service levels. The number of virtual channels assigned to each service level is a design parameter that is specied by the designer. In the case of heavy network load the GT trafc can be transmitted on the BE virtual channels, but not vice versa. The round robin service mechanism is modied to give priority to the GT trafc over the BE trafc. Among each of the two service levels, every virtual channel gets equal priority. The GT trafc is always transmitted as a stream of packets with a designer specied xed size. At the processor, the GT packets are queued until the stream size is reached. Once the desired stream size is reached, the GT protocol performs the following three steps; connection set-up, transmission, and tear-down. In the connection set-up, the virtual channels are reserved for the stream all the way from the source to the destination. The connection set-up stage might take a variable amount of time based on the network load. Once the connection is set-up the stream can be transmitted with maximum throughput. After the entire stream has been transmitted the reserved virtual channels are set free by tear-down step. Since, the GT trafc is always transmitted as a stream with maximum throughput, it prevents under-utilization of resources. Further, since the GT trafc is transmitted in discrete streams of xed sizes, starvation of other GT trafc is also prevented. As the GT trafc can utilize virtual channels that are allocated for BE trafc, there is a possibility for starvation of BE trafc at high injection rates. However, as the experimental results will demonstrate the starvation can be easily avoided by limiting the ratio of GT/BE trafc

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 363

to be around 0.25 for a router with 4 virtual channels (two virtual channels allocated to GT). This is not un-realistic as only a small portion of the total network trafc is expected to be supported on the GT trafc class. 4.3. Architecture and characterization for QoS schemes The basic router [44] supporting only BE service levels has been enhanced to support multiple levels of service as shown in Fig. 3. The round robin priority based scheduling units present in header decoder, arbiter and output link controller have been modied to give priority to channels transferring GT trafc. For instance, if there are N virtual channels per node in the router, and K of these N channels have been allocated to transmit GT trafc, then the schedulers assign priority to these K channels to transfer data. If GT trafc is not present, then BE packets are allocated resources in a round robin manner. The energy model for the modied architecture is implemented utilizing unit components shown in Table 2.

5. Error control schemes In the nanoscale regime, crosstalk in long global communication wires is expected to be the major source of errors. In this paper, we focus on the crosstalk errors in the links between the routers. The error control schemes are incorporated into the output and input link controllers, respectively. The output link controller includes the encoder, and the input link controller includes the corresponding decoder. Due to the strict constraints on low latency and power consumption requirements, we have implemented low overhead error control schemes. The two schemes that we implemented include PAR, and SEC. Single error detection and retransmission (PAR): The basic single bit parity check method is used to detect the error, and re-transmission of data is requested in the presence of error. The main idea behind this scheme is to enable error recovery based on the re-transmission. The hardware overhead is negligible since it requires only one extra bit of information per it of data transfer. However, latency per packet increases in case of retransmission. Single error correction (SEC): The basic (15,11) Hamming code [9] implementation with a single error correction capability is utilized for this scheme. The decoder present in the input virtual channel controller of a router is more complex than the encoder at the output virtual channel controller, because of the correction circuitry. The hop-to-hop transmission of 11 bit data requires 4 additional check bits. 5.1. Architecture and characterization of error control schemes In our architecture, we have primarily concentrated on modication of the link controllers to incorporate the error model as shown in Fig. 3. The data is encoded at the output link controller and is subsequently decoded at the input link controller before progressing through the next router towards its destination. This hop based error detection and correction mechanism allows strong error control. The functionality and characterization of the link controllers have been described below.

ARTICLE IN PRESS
364 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

Single error detection and retransmission (PAR): This scheme is implemented as shown in Fig. 4. The input link controller has 2 states, S0 and S1. S0 represents the idle state, in which the state machine waits for a req from the output virtual channel of the neighboring router. Once it receives a req from the output virtual channel, it checks the output of the parallel error detection circuitry. In absence of an error, it goes to S1 raising the ack signal and also the write signal (to its own info) high. In state S1, it lowers the write signal and stays in this state as long as the req signal remains high. Once the req signal is lowered, it returns to S0 lowering the ack signal in the transition. In presence of error, it shifts to state S1 raising the ack signal to the previous output link controller. However, it maintains the write signal low in this case and waits for req signal to go low to shift back to S0, while raising the re-transmit signal. The output link controller has a complimentary state sequence as shown in Fig. 5. The characterized energy values for both the link controllers are also shown in Fig. 4 and 5. Single error correction (SEC): This scheme is similar to the above scheme with the controllers having 2 states each. The difference lies in the state S0 of the input link controller where an error detection leads to a subsequent correction of the error before shifting to state S1. The

error = '0', REQ = '1' / ACK = '1', or write = '1' (E = 0.275 pJ) REQ = '0' / ACK = '0', write = '0'
(E = 0.024 pJ)

error = '1', REQ = '1' / ACK = '1', write = '0'

S0 (E = 0.225 pJ) REQ = '0' / ACK = '0', write = '0' retransmit = error

S1

(E = 0.024 pJ)

REQ = '1' / ACK = '1', write = '0'

Leakage energy value for the circuit = 0.6 fJ

Fig. 4. Input link controller.

ACK = '0' and ivc != full and ovc !=empty/ REQ = '0' read = ! retransmit (E = 0.261 pJ) ACK != '0'/ REQ = '0' read = '0'

(E = 0.09 pJ)

S0

S1

(E = 0.09 pJ)

ACK = '0'/ REQ = '1' read = '0'

(E = 0.219 pJ) ACK = '1'/ REQ = '1' read = '0'

ivc = input virtual channel ovc = output virtual channel

Leakage energy value of the circuit = 0.2 fJ

Fig. 5. Output link controller.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 365

characterized values of both the link controllers are similar to those shown in Figs. 4 and 5. We characterize the PAR and the SEC circuitry in terms of unit xor gates (energy values shown in Table 2). 5.2. Error generation model Hegde et al. [45] developed a model for noise from various sources in CMOS circuitry as a Gaussian source. The model has been applied towards error estimation in SoC architectures [19,46]. In the model, it is assumed that the gate input is in error when the noise voltage V N exceeds the gate decision threshold voltage V th which is dened as V dd V th 2 The model assumes that a signaling waveform has a certain noise V N added on to it, and V N has a normal distribution with a variance of s2 and mean of 0. The probability of error  is given by N   Z 1 V dd 1 2 p ey =2 dy ; where Qx Q 2sN 2p x is the Gaussian pulse. We utilize the above model to generate errors in the individual wires of the NoC links.

6. Packet format and protocol The message is partitioned into xed length packets that are in turn broken down into its for efcient data transfer. A packet consists of three kinds of itsthe header it, the data it and the tail it, that are differentiated by two bits of control information. The header it contains information of the destination router (X,Y) for each packet. The header it contains additional information of one bit to indicate whether it is a best effort or a guaranteed throughput packet.

7. Results We performed design space exploration and performance versus power trade-off analysis for a 4 4 mesh topology of a NoC based interconnection network. Each unit router consisted of 4 virtual channels, with 2 channels each allocated to GT and BE trafc service levels. The physical channels supported unidirectional communication with both data and control bits. International Technology Roadmap for Semiconductors (ITRS) predicts that in future the die size for high end SoC architectures would be around 22 mm 22 mm: Kumar et al. [8] have also made similar predictions. Hence, we assume a chip dimension of 20 mm 20 mm and consider the inter-router links to be 4.5 mm. In our experiments, the simulator generated two varieties of trafc to random destinationsuniformly distributed trafc and Poisson distributed trafc. The trafc was injected through the 16 processors by utilizing a uniform/Poisson distribution over a designer specied

ARTICLE IN PRESS
366 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

time interval. In our architecture, due to the asynchronous communication protocol, it takes two clock cycles to transfer each it. The network was allowed to stabilize for the rst 1000 cycles, after which it was run for 10,000 clock cycles. At the end of 10,000 clock cycles the total number of packets reaching the destination, their acceptance rate, and latencies were calculated. The acceptance rate is the number of packets received at the destination per cycle per node. The average dynamic and leakage power consumption of the various components was also calculated over 10,000 clock cycles. The clock width was assumed to be 3 ns. In the following plots, we distinguish between queue and network latency. The queue latency denotes the amount of time spent by the packet at the source queue after its generation, and before its injection into the network. The network latency denotes the time required by the packet to transmit from source to destination. The total latency of the packet is summation of the queue and network latency. Additionally, for the GT trafc packets, we consider the set-up latency as the time required to reserve the virtual channels from source to destination. The BE packets were assumed to consist of 5 its. The GT packets also consisted of 5 its, and the GT stream was assumed to be 15 packets long. At a particular injection rate, the number of GT and BE packets to be generated are specied as a ratio r GT=BE: The queue latency of the GT trafc is calculated as the difference between the time when the total stream has been generated and the time when the stream is injected to the network. 7.1. Evaluation of QoS schemes Fig. 6 plots the variation in network latencies of GT and BE trafc when the destination is 3 hops away from the source at an injection rate of 0.05 packets/cycle/node. While the BE trafc experiences a wide spectrum of network latencies, the GT trafc latency spectrum has a sharp

70 60 50
GT BE

Noof Packets

40 30 20 10 0 0 200 400 600 800 1000

Latency (cycles)

Fig. 6. Spectrum for BE/GT.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 367

spike. This plot validates that our router is able to provide guaranteed low jitter latency for GT trafc transmission. Figs. 7 and 8 plot the network latency of the BE and GT trafc as the injection rate is varied from 0.025 to 0.1, and r is varied from 0.25 to 1. As can be observed from the plots, for all values of r, as the injection rate is increased the average network latency of the BE trafc increases. There is also an increase of average BE network latency with increasing r values, since more priority is given to GT trafc over BE trafc. The average network latency for the GT trafc, on the other hand, remains almost constant.

1000 900 800


Latency (cycles)

GT/BE=0.25 GT/BE=0.5 GT/BE=1

700 600 500 400 300 200 100 0 0.025 0.05 Injection Rate (p acke 0.075 ts/cycle/node) 0.1
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 7. Network latency for BE.

1000 900 800


GT/BE=0.25 GT/BE=0.5 GT/BE=1

Latency (cycles)

700 600 500 400 300 200 100 0 0.025 0.05 Injection Rate (packe 0.075 ts/cycle/node) 0.1
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 8. Network latency for GT.

ARTICLE IN PRESS
368 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

Figs. 9 and 10 plot the variation in queue latency of BE, and GT trafc, respectively. The queue latency for BE trafc increases dramatically with rise in injection rate and r. This observation is supported by the BE acceptance rate plot shown in Fig. 12. It should be noted that BE queue latency soars to around 3000 clock cycles for r 1 and injection rate 0:1: The queue latency for GT trafc for lower injection rates and low values of r remains negligible as less number of GT packets are generated by the processors and the resources can easily cater to them without any congestion. However, for higher values of r with higher injection rates we observe a considerable increase in queue latency because of high network congestion between GT trafc.

3000
GT/BE=0.25

2500

GT/BE=0.5 GT/BE=1

Latency (cycles)

2000 1500 1000 500 0 0.025 0.05 0.075 0.1 Injection Rate (pack ets/cycle/node)
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 9. Queue latency for BE.

1000 900 800


GT/BE=0.25 GT/BE=0.5 GT/BE=1

Latency (cycles)

700 600 500 400 300 200 100 0 0.025 0.05 0.075 Injection Rate (packets/cycle/node) 0.1

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 10. Queue latency for GT.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 369

Fig. 11 plots the variation in connection set-up latency for the GT trafc. The set-up latency increases with both the injection rate and ratio r. The increase has the smallest slope for r 0:25: Figs. 1214 plot the acceptance rates for BE, GT, and combined trafc, respectively. As can be seen from the plots, at a particular r value the BE acceptance rate initially increases with increase in injection rate. It peaks at around 0.05 injection rate, and then falls. However, the acceptance rate for GT trafc increases linearly with increase in injection rate and r. Priority of GT trafc over BE trafc helps explain the variation in BE acceptance rate. The combined network acceptance rate rises linearly with the injection rate before the network is congested, and is constant after congestion.

1000 900 800


Latency (cycles)

GT/BE=0.25 GT/BE=0.5 GT/BE=1

700 600 500 400 300 200 100 0 0.025 0.05 0.075 0.1 Injection Rate (packets/cycle/no de)
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 11. Setup latency for GT.

0.05 0.045

GT/BE=0.25 GT/BE=0.5 GT/BE=1

AcceptanceRate (packets/cycle/node)

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0.025

GT/BE=1 GT/BE=0.5 GT/BE=0.25

0.05 0.075 Injection Rate (packets/cycle/node) 0.1

Fig. 12. Acceptance rate BE.

ARTICLE IN PRESS
370 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
0.05 0.045
Acceptance Rate (packets/cycle/node)
GT/BE=0.25 GT/BE=0.5 GT/BE=1

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.025 0.05 Injection Rate (pa

GT/BE=1 GT/BE=0.5 GT/BE=0.25

0.075 0.1 ckets/cycle/node)

Fig. 13. Acceptance rate GT.

0.05 0.045
Acceptance Rate (packets/cycle/node)
GT/BE=0.25 GT/BE=0.5 GT/BE=1

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.05 Injection Rate (pa

0.025

GT/BE=1 GT/BE=0.5 GT/BE=0.25

0.075
ckets/cycle/node )

0.1

Fig. 14. Acceptance rate BE and GT.

Figs. 15 and 16 plot the variation in average dynamic and leakage power of the NoC for the variation in injection rates and r, respectively. The dynamic power consumption closely follows the combined BE and GT acceptance rate plot shown in Fig. 14. At higher acceptance rates, the dynamic power consumption is high, and vice versa. Also the peaks in dynamic power consumption plots are mirrored by troughs in leakage power consumption, and vice versa. The virtual channel buffers are the main contributors to both dynamic and leakage power consumption in NoC. Fig. 17 plots the power consumed by the buffers at 0.05 injection rate. There is an increase in the power consumption of the GT virtual channel buffers as the GT/BE

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
80 70 60
GT/BE=0.25 GT/BE=0.5 GT/BE=1

371

Power (mW)

50 40 30 20 10 0 0.025 0.05 0.075 Injection Rate (packets/c ycle/node) 0.1

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 15. Dynamic power BE/GT.

1.29

1.285
Power (mW)

GT/BE=0.25 GT/BE=0.5 GT/BE=1

1.28

1.275

1.27 1.265 0.025 0.05 0.075 0.1 Injection Rate (packets/c ycle/node)
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 16. Leakage power BE/GT.

ratio increases from 0.25 to 1.0 since the utilization of the GT virtual channels increases with the increasing values of r. However, for GT/BE ratio of 0.5 we see the power consumption in the BE virtual channel buffers to be more than the GT virtual channel buffers. This is observed since the BE virtual channels can be used to transfer GT trafc but not vice versa. The power consumption of the individual components of the router network for an injection rate of 0.05 for the different values of r has been shown in the Fig. 18. It can be seen from the plots that the virtual channel buffers are the dominant consumers of total power. It can also be

ARTICLE IN PRESS
372 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

20

GT/BE=0.25 GT/BE=0.5 GT/BE=1

16

Power (mW)

12

GT/BE=1 GT/BE=0.5

GT/BE=0.25 Fifo_BE Fifo_GT

Fig. 17. Fifo power BE/GT.

25
GT/BE=0.25

20
Power (mW)

GT/BE=0.5 GT/BE=1

15 10 5 0
FO FI He e ad r de co e rd te bi Ar r r rs ba lle ss ro ro C nt Co al tu Vi s nk Li
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 18. Component power.

observed that the header decoders, arbiter, and the link controllers also contribute signicantly to the total power consumption. Figs. 1929 show similar plots for the router network under Poisson trafc distribution. It should be noted that the results of the latencies, acceptance rates and power consumption for the Poisson trafc model is very similar to that of the uniform random trafc model. This proves that our router design can effectively support both kinds of trafc proles.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
1000 900 800
Latency (cycles)

373

GT/BE=0.25 GT/BE=0.5 GT/BE=1

700 600 500 400 300 200 100 0 0.025 0.05 Injection Rate (pa 0.075 ckets/cycle/node 0.1
)
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 19. Network latency for BE(Poisson).


1000 900 800
GT/BE=0.25 GT/BE=0.5 GT/BE=1

Latency (cycles)

700 600 500 400 300 200 100 0 0.025 0.05 0.075 Injection Rate (packets/c ycle/node) 0.1

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 20. Network latency for GT(Poisson).

The following conclusion can be inferred from the extensive experimentation performed with our router architecture supporting multiple levels of service: For a low value of r 0:25; the GT trafc experiences almost zero queue latency and a low setup latency. Also the acceptance of BE trafc is high for this case. Hence, a low value of r (around 0.25) should be utilized when designing a NoC with GT and BE trafc service levels. 7.2. Evaluation of error control schemes We characterized the NoC for 0:18 mm technology, and consider V dd 1:8 V: We evaluated the performance of the error control schemes by assigning the noise voltage variance, sN to 0.5 V

ARTICLE IN PRESS
374 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
3000
GT/BE=0.25

2500
Latency (cycles)

GT/BE=0.5 GT/BE=1

2000 1500 1000 500 0 0.025 0.05 0.075 Injection Rate (packets/cycle/node) 0.1
GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 21. Queue latency for BE(Poisson).

1000 900 800


GT/BE=0.25 GT/BE=0.5 GT/BE=1

Latency (cycles)

700 600 500 400 300 200 100 0 0.025 0.05 0.075 Injection Rate (packets/cycle/node) 0.1

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 22. Queue latency for GT(Poisson).

[45] and 0.36 V, respectively. The corresponding bit error rate,  is 0.035 (high, H in plots) and 0.0063 (low, L in plots), respectively. The ratio of GT/BE packets generated, r, has been taken to be 0.25. Fig. 30 plots the overall acceptance rate of the NoC under low and high error rates using both the PAR and SEC error control schemes. The acceptance for the PAR scheme is lower than the SEC scheme for higher injection rates because of the latency involved in retransmission. For lower injection rates, the difference in the acceptance rates between the two schemes diminishes due to less trafc in the network.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
1000 900 800

375

GT/BE=0.25 GT/BE=0.5 GT/BE=1

Latency (cycles)

700 600 500 400 300 200 100 0 0.025 0.05 0.075 Injection Rate (packets/cycle 0.1

GT/BE=1 GT/BE=0.5 GT/BE=0.25


/node)

Fig. 23. Setup latency for GT(Poisson).

0.05 0.045

GT/BE=0.25 GT/BE=0.5 GT/BE=1

Acceptance Rate (packets/cycle/node)

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.025 0.05 0.075 0.1 Injection Rate (packets/c ycle/node)

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 24. Acceptance rate BE(Poisson).

Fig. 31 plot the network latencies under various injection and bit error rates. The network latency is always higher for the PAR scheme due to retransmission delay. This is reinforced by the overall acceptance plot in Fig. 30. The average latency is higher at high bit error rates because more number of its are prone to error and are hence retransmitted. Fig. 32 shows the network power consumption for low and high error rates using both the PAR and SEC schemes. The SEC power consumption for high injection rates is more than PAR due to high acceptance rates for SEC. For low injection and low bit error rates, the power consumption for the SEC scheme is almost equal to PAR scheme. However, the area consumed by the PAR implementation is lower than the SEC scheme, making it an attractive technique for error control

ARTICLE IN PRESS
376 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
0.05 0.045
GT/BE=0.25 GT/BE=0.5 GT/BE=1

Acceptance Rate (packets/cycle/node)

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.025 0.05 0.075 Injection Rate (p ackets/cycle/nod

GT/BE=1 GT/BE=0.5 GT/BE=0.25


0.1

e)

Fig. 25. Acceptance rate GT(Poisson).

0.05 0.045
Acceptance Rate (packets/cycle/node)
GT/BE=0.25 GT/BE=0.5 GT/BE=1

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.025 0.05 0.075 Injection Rate 0.1 (packets/cycle /node)

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 26. Acceptance rate BE and GT(Poisson).

in this case. For all other cases ({low bit error rate, high injection rate}, {high bit error rate, low injection rate}, {high bit error rate, high injection rate}), SEC is a preferred choice due to high acceptance rates. Moreover, for low injection and high bit error case, the power consumed by the retransmission circuitry offsets the power consumed by error correction. The results for the error control schemes are summarized in Table 3. The table shows the appropriate error control schemes under different injection and bit error rates respectively. Fig. 33 shows the leakage power consumption for low and high error rates using both the PAR and SEC schemes. Leakage power consumption is more in the PAR scheme than in the SEC scheme since the dynamic power consumption is less and vice versa.

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
80 70 60
GT/BE=0.25 GT/BE=0.5 GT/BE=1

377

Power (mW)

50 40 30 20 10 0 0.025 0.05 0.075 Injection Rate (packets/c ycle/node) 0.1

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 27. Dynamic power BE and GT(Poisson).

1.29 1.285 1.28 1.275 1.27 1.265 1.26

GT/BE=0.25 GT/BE=0.5 GT/BE=1

Power (mW)

0.025 0.05 0.075 0.1 Injection Rate (packe ts/cycle/node)

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 28. Leakage power BE/GT(Poisson).

8. Conclusion In this paper, we presented a cycle accurate performance and power evaluation model for BE and GT trafc with error correction/detection on mesh-based NoC. We presented results for extensive design space exploration and performance versus power trade-off analysis of a 4 4 mesh architecture. The experimental results were presented for both uniform and Poisson trafc distributions. The results demonstrated that our architecture is able to provide excellent support for both GT and BE trafc schemes as long as the GT/BE trafc ratio is around 0.25. On

ARTICLE IN PRESS
378 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
25
GT/BE=0.25

20

GT/BE=0.5 GT/BE=1

Power (mW)

15 10 5 0
FO FI He e ad r de co e rd te bi Ar r ar rs sb le ol os r Cr nt Co al tu Vi s nk Li

GT/BE=1 GT/BE=0.5 GT/BE=0.25

Fig. 29. Component power(Poisson).

0.05 0.045 0.04


Acceptance Rate (packets/cycle/node)
SEC_H Parity_H SEC_L Parity_L

0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

SEC_H

0.025

SEC_L

Parity_H

0.075
e o cti

Parity_L

nR

ate

Inj

Fig. 30. Acceptance rate PAR/SEC.

the basis of their performance and power consumption characteristics it was also shown that PAR (single error control) scheme is better than the SEC (single error correction) at low injection and low error rates. In all other circumstances the SEC scheme gives better performance. The current version of the model is limited to mesh based topologies supporting deterministic routing schemes and synthetically generated trafc. Future work will address developing

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382
500 450 400
Latency (cycles)
SEC_H Parity_H SEC_L Parity_L

379

350 300 250 200 150 100 50 0

Parity_L

SEC_L

Parity_H

SEC_H

0.075 e 0.025 Rat ion ect Inj

Fig. 31. Network latency PAR/SEC.

80 70 60
Power (mW)
SEC_H Parity_H SEC_L Parity_L

50 40 30 20 10 0

SEC_H

0.025

SEC_L

Parity_H

0.075
tio jec

Parity_L

nR

ate

In

Fig. 32. Dynamic power PAR/SEC.

router architectures and related power and performance models for generic topologies. Adaptive routing schemes would also be explored. Finally, design space exploration would be performed with communication traces of realistic benchmark applications that are mapped to NoC architectures.

ARTICLE IN PRESS
380 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

Table 3 Summary of error control schemes Injection rate (Low) Bit error rate (low) Bit error rate (high) PAR SEC Injection rate (High) SEC SEC

1.31 1.3 1.29


SEC_H Parity_H SEC_L Parity_L

Power (mW)

1.28 1.27 1.26 1.25 1.24 1.23 1.22 1.21 1.2


SEC_H

te Ra de) ion /no ect cycle Inj ts/ e ack (p

Fig. 33. Leakage power PAR/SEC.

References
[1] D. Sylvester, K. Keutzer, A global wiring paradigm for deep submicron design, IEEE Trans. Comput. Aided Design Integrated Circuits Systems (2000) 242252. [2] R.Ho, K. Mai, M. Horowitz, The future of wires, Proc. IEEE (2001) 490504. [3] J. Davis, D. Meindl, Compact distributed RLC interconnect modelsPart II: coupled line transient expressions and peak crosstalk in multilevel networks, IEEE Trans. Electron Devices 47 (11) (2000) 20782087. [4] P. Guerrier, A. Greiner, A generic architecture for on-chip packet-switched interconnections, in: DATE, Paris, France, March 2000. [5] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabeay, A. Sangiovanni-Vincentelli, Addressing the system-on-a-chip interconnect woes through communication-based design, in: Proceedings of Design Automation Conference, June 2001, pp. 667672. [6] William J. Dally, Brian Towles, Route packet, not wires: on-chip interconnection networks, in: Proceedings of DAC, June 2002. [7] Luca Benini, Giovanni De Micheli, Networks on chips: a new SoC paradigm, IEEE Comput. (2002) 7078.

0.025

SEC_L

Parity_H

0.075

Parity_L

ARTICLE IN PRESS
P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382 381

[8] S. Kumar, A. Jantsch, M. Millberg, J. Oberg, J.P. Soininen, M. Forsell, K.T.A. Hemani, A network on chip architecture and design methodology, in: IEEE Computer Society Annual Symposium, on VLSI, Pittsburg, Pennsylvania, April 2002. [9] S. Lin, D.J. Costello, Error Control Coding: Fundamentals and Applications, Prentice-Hall, Englewood Cliffs, NJ, 1983. [10] A. Andriahantenaina, A. Greiner, Micro-network for SoC: implementation of a 32-port SPIN network, in: DATE, Munich, Germany, March 2003. [11] A. Andriahantenaina, H. Charlery, A. Greiner, L. Mortiez, C.A. Zeferino, SPIN: a scalable, packet switched, onchip micro-network, in: DATE, Munich, Germany, March 2003. [12] D. Siguenza-Tortosa, J. Nurmi, Proteo: a new approach to network-on-chip, in: Proceedings of IASTED International Conference on Communication Systems and Network, Malaga, Spain, 2002. [13] D. Siguenza-Tortosa, J. Nurmi, VHDL-based simulation environment for Proteo NoC, in: High-Level Design Validation and Test Workshop, Paris, France, October 2002. [14] M. DallOsso, G. Biccari, L. Giovanninni, D. Bertozzi, L. Benini, Xpipes: a latency insensitive prameterized network-on-chip architecture for multi-processor SoCs, in: Proceedings of ICCD, San Jose, CA, October 2003. [15] M. Millberg, E. Nilsson, R. Thid, S. Kumar, A. Jantsch, The Nostrum backbonea communication protocol stack for networks on chip, in: VLSI Design Conference, Mumbai, India, January 2004. [16] M. Millberg, E. Nilsson, R. Thid, A. Jantsch, Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip, in: DATE, February 2004, pp. 890895. [17] J. Dielissen, A. Radulescu, K. Goossens, E. Rijpkema, Concepts and implementation of the Philips network-on chip, in: IP-Based SOC Design, November 2003. [18] E. Rijpkema, K.G.W. Goossens, A. Radulescu, Trade offs in the design of a router with both guaranteed best-effort services for networks on chip, in: DATE, 2004. [19] D. Bertozzi, L. Benini, G. De Micheli, Low power error resilient encoding for on-chip data buses, in: DATE, 2003. [20] H. Zimmer, A. Jantsch, A fault model notation and error-control scheme for switch-to-switch buses in a networkon-chip, in: ISSS/CODES, 2003. [21] F. Worm, P. Ienne, P. Thiran, G. De Micheli, An adaptive low-power transmission scheme for on-chip networks, in: Proceedings of ISSS, Kyoto, Japan, 2002. [22] X. Chen, L.-S. Peh, Leakage power modeling and optimization in interconnection networks, in: Proceedings of ISLPED, Seoul, Korea, 2003. [23] T. Simunic, S. Boyd, Managing power consumption in networks on chips, in: Proceedings of DATE, Paris, France, 2002. [24] J. Duato, S. Yalamanchili, L. Ni, Interconnection networks, an engineering approach, IEEE Computer Society, 1997. [25] H.J. Seigel, A model of SIMD machines and a comparison of various interconnection networks, IEEE Trans. Comput. 28 (12) (1979) 907917. [26] W.J. Dally, Performance analysis of k-ary n-cube interconnection network, IEEE Trans. Comput. 39 (6) (1990) 775785. [27] J.F. Draper, J. Ghosh, A comprehensive analytical model for wormhole routing in multicomputer systems, J. Parallel Distributed Comput. 23 (1994) 202214. [28] D. Brooks, V. Tiwari, M. Martonosi, Wattch: a framework for architectural-level power analysis and optimizations, in: International Symposium on Computer Architecture, 2000, pp. 8394. [29] W. Ye, N. Vijaykrishna, M. Kandemir, M.J. Irwin, The design and use of simplepower: a cycle-accurate energy estimation tool, in: Proceedings of Design Automation Conference, June 2000. [30] T. Givargis, F. Vahid, J. Henkel, Instruction-based system-level power evaluation of system-on-a-chip peripheral cores, IEEE Trans. VLSI 10(6) (2002). [31] Arm Inc., AMBA specication, 1999. [32] IBM, The coreconnect bus architecture, 1999. [33] D.Wingard, MicroNetwork-based integration of SOCs, in: DAC, Las Vegas, Nevada, June 2001. [34] A.G. Wassal, M.A. Hasan, Low-power system-level design of VLSI packet switching fabrics, IEEE Trans. CAD 20 (2001) 723738.

ARTICLE IN PRESS
382 P. Vellanki et al. / INTEGRATION, the VLSI journal 38 (2005) 353382

[35] Terry T. Ye, Luca Benini, Giovanni De Micheli, Analysis of power consumption on switch fabrics in network routers, in: Proceedings of DAC, 2002. [36] D. Pamunuwa, J. Oberg, L.R. Zheng, M. Millberg, A. Jantsch, H. Tenhunen, Layout, performance and power trade-offs in mesh-based network-on-chip architectures, in: IFIP International Conference on Very Large Scale Integration (VLSI-SOC), Darmstadt, Germany, December 2003, pp. 362367. [37] H.-S. Wang, L.-S. Peh, S. Malik, Orion: a power-performance simulator for interconnection network, in: International Symposium on Microarchitecture, Istanbul, Turkey, November 2002. [38] E. Bolotin, I. Cidon, R. Ginosar, A. Kolodny, Cost considerations in network on chip, in: Integrationthe VLSI Journal, November 2003. [39] A. Pinto, L.P. Carloni, A.L. Sangiovanni-Vincentelli, Efcient synthesis of networks on chip, in: ICCD, 2003. [40] J. Hu, R. Marculescu, Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints, in: DATE, Paris, France, February 2004. [41] S. Murali, G. De Micheli, Bandwidth constrained mapping of cores onto NoC architectures, in: DATE, Paris, France, February 2004. [42] P. Sotiriadis, A. Chandrakasan, A bus energy model for deep sub-micron technology, IEEE Trans. VLSI 10(3) (2002). [43] Berkeley predictive technology modeling, http://www-device.eecs.berkeley.edu/$ptm, technical report. [44] N. Banerjee, P. Vellanki, K.S. Chatha, A power and performance model for network-on-chip architectures, in: DATE, 2004. [45] R. Hegde, N. Shanbhag, Towards achieving energy efciency in presence of deep submicron noise, IEEE Trans. VLSI 8 (4) (2000) 379391. [46] L. Li, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, Adaptive error protection for energy efciency, in: ICCAD, 2003.