Professional Documents
Culture Documents
Research Article
Usman Ali Gulzari1,2 , Sheraz Anjum3, Shahrukh Aghaa1, Sarzamin Khan4, Frank Sill Torres5
1Department of Electrical Engineering, COMSATS Institute of Information Technology, Islamabad, Pakistan
2Department of Electrical Engineering, The Univesity of Lahore, Islambad Campus, Pakistan
3Department of Computer Science, COMSATS Institute of Information Technology, WahCantt, Pakistan
4Department of Electrical Engineering, COMSATS Institute of Information Technology, WahCantt, Pakistan
5Department of Electronic Engineering, Federal University of Minas Gerais, Belo Horizonte, Brazil
E-mail: usmangulzari2000@yahoo.com
Abstract: This study presents an efficient and scalable networks-on-chip (NoC) topology termed as cross-by-pass-mesh (CBP-
Mesh). The proposed architecture is derived from the traditional mesh topology by addition of cross-by-pass links in the
network. The design and impact of adding cross-by-pass links on the topology is analysed in detail with the help of synthetic,
hotspot as well as embedded traffic traces. The advantages of proposed CBP-Mesh as compared with its competitor topologies
include reduction in the network diameter, increase in bisection bandwidth, reduction in average numbers of hops, improvement
in symmetry and regularity of the network. The synthetic traffic traces and some real embedded system workloads are applied
on the proposed CBP-Mesh and its competitor two-dimensional-based NoC topologies. The comparison of analytical results in
terms of performance and costs for different network dimensions indicate that the proposed CBP-Mesh offers short latency, high
throughput and good scalability at small increase in power and energy.
3.1 Motivation
To increase the performance of the mesh network, the worst case
scenario of hop count for traversing a packet from source to its
destination node should be addressed. The worst cases of hop count
in 3 × 3 mesh network include the opposite corner nodes of Ro0,0
↔ Ro2,2 and Ro0,2 ↔ Ro2,0 (see Fig. 1a), which is four hops. The
T-Links on Torus network cover this distance in two hops by
hoping the corner nodes. By using D-Links, XD-Mesh and D-Mesh
topologies take two hops through the central node to reach its
opposite corner node. C2-Mesh uses one extra C-Link from a 3 × 3
mesh network, that reduces the hop count between two opposite
corner nodes Ro0,0 ↔ Ro2,2 in Fig. 1e. However, the other opposite
corner nodes side of Ro0,2 ↔ Ro2,0 in Fig. 1e is ignored. The
proposed CBP-Mesh design adds the two CBP-Links to a mesh
Fig. 2 CBP-Mesh router (Ro) with router links for interlinking the
network, placed between both pairs of opposite corner nodes of
connection with neighbour routers nodes
Ro0,0 ↔ Ro2,2 and Ro0,2 ↔ Ro2,0 to reduce the hop and connect
limitation by using additional D-Links [12]. However, D-Mesh directly (see Fig. 1f for details). Introduction of the CBP-Links
topology requires higher degree routers, resulting in considerably minimise distance of two hops as in the case of Torus, XD-Mesh
increased costs in terms of power consumption [14]. In detail, D- and D-Mesh networks to single hop. Addressing the worst case
Mesh applies M- and D-Links [14]. Thereby, the number of D- scenario of opposite corners in CBP-Mesh network resulted in
Links is higher than in mesh and Torus. It can be concluded that D- higher performance of the network.
Mesh is a complex and costly network topology [11]. The proposed CBP-Mesh is a scalable topology with its basic
building block of the CBP-Mesh as shown in Fig. 1f. CBP-Mesh
XD-Mesh (Fig. 1d) and C2-Mesh (Fig. 1e) have been proposed
architecture can be extended to odd (5 × 5) number of nodes,
to reduce complexity and costs at constant high scalability [13, 16].
higher number of nodes or odd/even (3 × 4) number of nodes in the
In detail, XD-Mesh applies XD-Links between corner nodes
network as shown in Figs. 3, 4 and 6c, respectively. CBP-Links in
through the centre node, while C2-Mesh adds a single link for each addition to M-Links provide multipath that helps to accommodate
3 × 3 Mesh-network connection between two opposite nodes [13]. more adaptive and dynamic routing algorithms in the proposed
It should be noted that XD-Mesh and C2-Mesh offer simplicity and network.
low cost, but having a lower performance in comparison to D-
Mesh [16]. 3.2 Principle architecture
IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148 141
© The Institution of Engineering and Technology 2017
n 2
N links, CBP_even = 2n2 − 2n + 2 (n ≥ 6) (1b)
3
142 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148
© The Institution of Engineering and Technology 2017
Fig. 4 Different CBP-Mesh network architectures
(a) 5 × 5 CBP-Mesh, (b) 3 × 7 CBP-Mesh, (c) 3 × 9 CBP-Mesh
↔ Ro2,6 take three hops by using the CBP-Links and adjacent network diameter of an n × n mesh is 2n − 2. Reducing the network
green router nodes take one more hop to route packets to their diameter leads to the minimisation of the hop count between nodes,
destinations. and thus, to a decrease of the overall latency. It follows, that mesh
has the longest diameter, while CBP-Mesh offers the shortest one.
The bisection width is the number of links that need to be
4 Topology characteristics
removed in order to separate a network into two equal parts [13].
This section presents the characteristics NoC topology and For example, the bisection width of an n × n mesh is n [14]. The
compares the proposed CBP-Mesh with other selected topologies. bisection bandwidth, which is the bandwidth available between
both parts, results from the product of bisection width and link
4.1 Characteristics of a CBP-Mesh bandwidth. Adding links to the network increases the bisection
width due to the enhanced number of paths between two sub-
NoC topology can be characterised by its network diameter, networks. Consequently, the throughput is higher and the traffic
bisection width, path diversity, number of links, degree of routers flow in the network will be improved. The topology with smallest
and the existence of path diversity [17–20]. Table 1 compares the bisection bandwidth is mesh followed by XD-Mesh and C2-Mesh.
general characteristics of the analysed NoC topologies, whereas
In contrast, mesh has the lowest number of links followed by C2-
symmetric meshes are assumed (size n × n).
Mesh, XD-Mesh and the proposed CBP-Mesh.
The network diameter is defined as the maximum shortest path
between all terminal node pairs of the NoC [18]. For example, the
IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148 143
© The Institution of Engineering and Technology 2017
Fig. 5 Compassion of proposed and selected topologies
(a) Average network latency, (b) Average network throughput, (c) Total network power, (d) Energy per data transferred packets
A link means the connection between two routing elements, i.e. in complexity and costs in terms of power consumption of the
router, in a NoC. For example, the number Nlinks of links of an n × router. The comparison of required router types reveals that mesh
n mesh is 2n2 − 2n. The highest of amount of links are applied in and Torus topologies apply only medium degree routers, while the
D-Mesh. other topologies use very complex routers with a degree of up to 9.
The degree of the router means the number of links that can be Path diversity refers to the number of available paths between
connected to the router. Thereby, a higher degree leads to increase two nodes in a NoC. It should be noted that higher path diversity
increases the fault tolerance of the network.
144 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148
© The Institution of Engineering and Technology 2017
Table 1 General characteristics of different symmetric mesh network topologies (size n × n)
Characteristics Mesh Torus D-mesh XD-Mesh C2-Mesh CBP-Mesh
number of nodes n2 n2 n2 n2 n2 n2
diameter 2n − 2 n − 1 n − 1 n − 1 n − 1 n − 1
bisection width n 2n 3n − 2 n + 2 n + 2 2(n + 1)
number of links 2n2 − 2n 2n2 4n2 − 6n + 2 2n2 − 2n + 8 2n2 − 2n + 4 2n2 − 2n + 2 n/2 2
router degree 3 to 5 5 4 to 9 4 to 9 4 to 9 4 to 9
path diversity yes yes yes yes yes yes
Table 2 Types of links used by analysed symmetric mesh Table 4 Experiments performed in NOCTweak with
topologies following parameters
Topology M-Links T-Links D-Link CPB-Links technology 65 nm
mesh 2n2 − 2n — — — operating voltage 1.0 V
Torus 2n — — clock frequency 1 GHz
2n2 − 2n
input buffer size 16 flits
D-Mesh 2n2 − 2n — 2(n − 1)2 —
number of virtual 8
XD-Mesh 2n2 − 2n — 2(n − 1) —
packet length (flit units) 150
C2-Mesh 2n2 − 2n — — (n − 1) flit length 32 bits 32 bits
2
CBP-Mesh (odd) 2n2 − 2n — — 2 n/2 router 3-stage pipeline
2 each simulation runs 100,000 cycles
CBP-Mesh (even) 2n2 − 2n 2 n/3
warm-up cycle time 20,000 cycles
links length 1000 μm
Table 2 compares the type of links each analysed topology flit injection rate 0.02 flits/cycle/node
applies (see also Fig. 2), whereas again symmetric meshes are
assumed (size n × n). It is assumed that the C- and CBP-Links have
some class due to its shape similarly XD- and D-Links consider as 5.1 Simulation environment
some class. As all topologies are mesh-like topologies, the number
of M-Links is the same for all. A further analysis reveals that Torus The NoC topologies mesh, Torus, XD-Mesh, C2-Mesh, D-Mesh
and C2-Mesh require only low amounts of additional links, while and CBP-Mesh networks are implemented in NoCTweak simulator,
the number of D-Links for D-Mesh increases quadratically with n. which is an open source and cycle-level accurate tool written in
This tendency is the same for the number of CBP-Links in the SystemC [21]. NoCTweak is selected for implementation and
proposed CBP-Mesh, however, with a four or nine times lower simulation of all networks due to the availability of large sets of
rising factor depending on the type of mesh. workloads of different synthetic traffic and real embedded system
Table 3 details the amount of required routers and links for 5 × 5 application traces.
NoC realised in the analysed topologies. The data indicate that D- The existing source routing algorithm to compute the shortest
Mesh requires a considerable amount of 6 and/or 9 degree routers path and NMAP algorithm to map embedded application on the
and possess a very high number of links. In contrast, mesh and processing cores of network are used [21]. The hotspot, synthetic
Torus apply solely routers with up to five ports at the low amount and real embedded traffic traces are applied to the proposed CBP-
of links. Finally, XD-Mesh, C2-Mesh and CBP-Mesh have Mesh and its competitor topologies for comparison. The other
balanced requirements of routers and links. simulator configurations used in the simulations are given in
The summary of the advantages gained in the proposed CBP- Table 4.
Mesh network includes:
5.2 Scalability analysis
• Reduction in network diameter.
The synthetic traffic traces of hotspot workload are applied to the
• Reduction in number of hops between nodes. proposed CBP-Mesh and its competitor networks. The analysis
• Increase in bisection bandwidth of the network. focuses on the scalability of the network topologies in terms of
• Availability of multi-paths to network's centre node. latency, throughput, power and energy. Therefore, each network
• Additional fault tolerance of the network. topology has been implemented for different symmetric sized
• Scalability. meshes of 3 × 3 to 9 × 9 networks and the results are depicted in
Figs. 5a–d.
5 Simulation results As expected, the mesh topology scales up badly. Its latency
increases more and more for complex networks, while the
To compare the proposed CBP-Mesh to existing approaches, throughput drops significantly (see Figs. 5a and b). However, mesh
different network topologies are implemented and analysed using network took low cost in terms of total network power and energy
NoCTweak [21] simulator. The results are presented in this section. due to its simple network design as shown in Figs. 5c and d.
Table 3 Number of links and routers with different degrees for an 5 × 5 selected topologies
Topology Router types Nlinks
3-Port 4-Port 5-Port 6-Port 7-Port 9-Port
mesh 4 12 9 0 0 0 40
Torus 0 0 25 0 0 0 50
D-Mesh 0 4 0 12 0 9 62
XD-Mesh 0 16 4 0 4 1 48
C2Mesh 0 16 8 0 0 1 44
CBP-Mesh 0 12 8 4 0 1 48
IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148 145
© The Institution of Engineering and Technology 2017
Table 5 Some embedded applications with required task graph to the tiles of the NoC. Some real-world embedded
number of cores applications selected for analysis and comparisons of topologies
Embedded applications Applications required task are given in Table 5:
MPEG4 decoder Mpeg4 with 12 cores The complete task graph of one of the chosen applications, i.e.
MPEG-4 decoder having 12 cores showing the bandwidth
Wifirx baseband receiver WiFi with 25 cores
requirement and information flow among different tasks is depicted
video object plane and decoder Vopd with 16 cores in Fig. 6a. The NMAP algorithm is applied to mapping the tasks of
video conference encoder Vce with 25 cores embedded applications on the tiles of the networks. The placement
multimedia system mms with 25 cores of cores for D-Mesh and CBP-Mesh networks is shown in Figs. 6b
and c. The SDRAM (C9) has more traffic load, than other cores in
the MPEG-4 application and therefore it requires more links to
Similar observations can be done for Torus. The Torus topology connect to other cores in the network (see Fig. 6a). The C9 in
shows an average increase of latency and moderate reduction in MPEG-4 plays a vital role in the overall performance of the
throughput for rising network complexity. However, for application network. The role of C9 is therefore illustrated in two competitors,
in larger networks long interconnections are required, resulting in i.e. D-Mesh and CBP-Mesh for performance. D-Mesh provides C9
designs that are not regular. Consequently, router design and
eight direct links to connect to other PEs in the network. The C9 is
routing strategies have to be more complex resulting in lower
directly connected to C0, C1, C2, C4, C5, C8, C6 and C10 in D-Mesh
scalability. In contrast, XD-Mesh and C2-Mesh offer good
performance parameters at reasonable power and energy increase. that reduces the hops count in the network. However, the traffic
from C2 → C8 has to pass through C9 in order to take the shortest
The XD-Mesh and C2-Mesh networks possess low complexity and
low cost. Its performances are lower compared with D-Mesh path and therefore packets have to be buffered at C9. Also the
though. The results indicate good performance parameters in traffic from C3 → C8 may traverse the path through C9 and C5
latency and throughput for D-Mesh. However, the extensive use of routers. Similarly, packets from C4 → C7 may also pass through
links and high degree routers leads to high costs in terms of power C9. It means C9 due to heavy traffic load becomes a bottleneck in
and energy. In contrast, the proposed CBP-Mesh scales with very the network. D-Mesh has the highest number of links in the
good performance with considerable lower penalties in network, which increase the complexity of router and cost of the
performance against D-Mesh (see Figs. 5a–d). network to a great extent.
As can be seen, the proposed 7 × 7 CBP-Mesh has the lowest The proposed CBP-Links in CBP-Mesh provides direct
average latency and good throughput, which is only outperformed connectivity between C0 → C9 and C2 → C8 pairs of nodes. The
by the costly D-Mesh and the bad scalable Torus. Compared with C9 node directly connects with C0, C4, C6 and C7. The traffic from
the classic mesh, latency and throughput improve by 36.9 and C3 → C8 take only one hop through C2 to traverse the packets
38.9%, while power consumption and energy increase by 13.5 and
13.6%. In comparison to D-Mesh, latency and throughput of the using the M- and CBP-Link as compared with three hops in D-
CBP-Mesh are 13.0 and 33.3% lower at 23.7 and 18.2% lower Mesh network. The C5 and C10 both require only one hop to its
costs in terms of power and energy. In case of the 9 × 9 network destination node of C9. CBP-Mesh divides traffic with good
topologies, CBP-Mesh has the lowest average latency, which is balance in the network and therefore do not create the bottleneck at
21.8 and 45.7% lower in comparison to D-Mesh and the standard C9 as in the D-Mesh network. The individual latency of C9 both in
mesh, respectively. In contrast, the 11 × 11 CBP-Mesh has 34.4% D-Mesh and CBP-Mesh is compared using NoCTweak. The CBP-
lower power consumption and 35.5% less effort in energy in Mesh V9 and V10 took 37 and 75% less latencies than the D-Mesh
comparison to its D-Mesh counterpart. C9 node. The average network latency, throughput, total power and
energies of networks are analysed by applying five different real-
5.3 Performance for embedded application world embedded applications traffic traces to six different NoC
networks including the proposed network. The results for these
Besides the synthetic traffic, the NoCTweak simulator provides
four parameters are shown in Figs. 7a–d.
several real-time embedded application traces. An NMAP
Simulation results indicate that the MPEG-4 application of a
algorithm is selected to map all the tasks of embedded application's
CBP-Mesh improves latency and throughput by up to 8.9 and
146 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148
© The Institution of Engineering and Technology 2017
15.7% from its predecessor C2-Mesh, while costs in terms of work. This work was partially supported by the Brazilian agencies
power and energy increase by up to 3.6 and 8.3%. Compared with FAPEMIG and CNPq.
more complex D-Mesh NoC topology, latency stays, reduce 7.6%,
while the throughput is up to 15.7% lower than D-Mesh. However, 8 References
CBP-Mesh outperforms from D-Mesh topology in terms of power
[1] Zarandi, H.R.: ‘A fault-tolerant core mapping technique in networks-on-chip’,
and energy penalty, which are 22 and 31% lower. Similarly, the 2013, 7, (August), pp. 238–245
proposed CBP-Mesh has been implemented in NoC with five real [2] Sehgal, V.K., Chauhan, D.S.: ‘State observer controller design for packets
embedded system applications workloads traces and compared flow control in networks-on-chip’, J. Supercomput., 2010, 54, (3), pp. 298–
other selected topologies. Simulation results indicate that the 329
[3] Khawaja, S.G., Mushtaq, M.H., Khan, S.A., et al.: ‘Designing area optimized
application of a CBP-Mesh improves latency in all other application-specific network-on-chip architectures while providing hard QoS
topologies. guarantees’, PLoS One, 2015, 10, (4), pp. 1–17
Compared with all applications for NoC topologies, throughput [4] Pomante, L.: ‘HW/SW co-design of dedicated heterogeneous parallel
of CBP-Mesh is higher than other topologies expect D-Mesh systems: an extended design space exploration approach’, IET Comput. Digit.
Tech., 2013, 7, (6), pp. 246–254
because of a higher number of links than the CBP-Mesh. However, [5] Morgan, A.a., El-Kharashi, M.W., Elmiligi, H., et al.: ‘Unified multi-
CBP-Mesh outperforms D-Mesh in terms of power and energy objective mapping and architecture customisation of networks-on-chip’, IET
penalty, which are lower in all the applications (see Figs. 7a–d). Comput. Digit. Tech., 2013, 7, (6), pp. 282–293
[6] Ju, X., Yang, L.: ‘Performance analysis and comparison of 2 × 4 network on
chip topology’, Microprocess. Microsyst., 2012, 36, (6), pp. 505–509
6 Conclusion [7] Catania, V., Mineo, A., Monteleone, S., et al.: ‘Energy efficient transceiver in
wireless network on chip architectures’. Proc. DATE ‘16, 2016, pp. 1321–
NoC is a promising paradigm to enable fast and reliable on-chip 1326
communication in large-scale multiprocessor systems. The [8] Balfour, J., Dally, W.J.: ‘Design tradeoffs for tiled CMP on-chip networks’.
performance of NoC is driven by several parameters including the Proc. 20th Annual Int. Conf. Supercomputing ICS 06, 2006, vol. 28, no. 1, p.
187
network topology. The impact of topology further increases with [9] Kim, J., Balfour, J., Dally, W.J.: ‘Flattened butterfly topology for on-chip
increasing system complexity. The principal difference between networks’
NoC topologies is the type and the application of the [10] Anjum, S., Chen, J., Yue, P., et al.: ‘Delay optimized architecture for on-chip
interconnections that has been classified into four basic links in this communication’, 2009, 7, (2), pp. 104–109
[11] Ya-gang, W., Hui-min, D.U., Xu-bang, S.: ‘Topological properties and routing
study. algorithm for semi-diagonal torus networks’, 2011, 18, (October), pp. 64–70
Additionally, this work proposes a new NoC topology termed as [12] Hu, W., Lee, S.E., Bagherzadeh, N.: ‘DMesh: a diagonally-linked mesh
CBP-Mesh that improves the standard mesh by adding new CBP- network-on-chip architecture.’
Links to the network. This modification helps in the reduction of [13] Arora, L.K.: ‘C 2 Mesh’, 2012, pp. 282–286
[14] Gulzari, U.A., Anjum, S., Agha, S.: ‘Cross by pass-mesh architecture for on-
the network diameter, minimises the average number of hops and chip communication’. Proc. – IEEE 9th Int. Symp. Embedded Multicore/
increases the bisection width of the NoC. Manycore SoCs, MCSoC 2015, 2015, pp. 267–274
The proposed CBP-Mesh and five other topologies proposed in [15] Ouyang, Y., Zhu, B., Liang, H., et al.: ‘Networks on chip based on diagonal
previous works are implemented using NoCTweak. Synthetic as interlinked mesh topology structure’, Comput. Eng., 2009
[16] Swaminathan, K., Lakshminarayanan, G., Ko, S.: ‘A novel hybrid topology
well as five different real embedded system workloads are applied for network on chip’, 2014, pp. 1–6
to analyse average network latency, throughput, total power and [17] Via, O., Insertion, L.L.: ‘“It’ s a small world after all”: NoC performance’,
energies of all the networks. Simulation results indicate that CBP- 2006, 14, (7), pp. 693–706
Mesh efficiently reduces the distance amongst nodes as compared [18] Sanju, V., Chiplunkar, N., Khalid, M., et al.: ‘A performance study of 2D
mesh & torus for network on chip based system’, 2013, pp. 0–4
with other selected topologies. The proposed network also [19] Elmiligi, H., Morgan, A., El-Kharashi, M.: ‘Power optimization for
improves the average network latency and throughput at a less cost application-specific networks-on-chips: a topology-based approach’,
of power and energy than its predecessors, i.e. mesh, Torus, C2- Microprocessors, 2009
Mesh and XD-Mesh. Compared with its competitor D-Mesh [20] Grecu, C., Ivanov, A., Pande, P., et al.: ‘Towards open network-on-chip
benchmarks’. Proc. Int. Symp. Networks-on-Chips, NOCS, 2007
topology, CBP-Mesh exhibits lesser latency as well as throughput. [21] Tran, A., Baas, B.: ‘NoCTweak: a highly parameterizable simulator for early
However, CBP-Mesh outperforms D-Mesh network in terms of exploration of performance and energy of networks on-chip’, 2012
power and energy penalty. Furthermore, the analytical results also
indicate good scalability of the proposed network with increasing 9 Appendix
network complexity. In short, the proposed NoC topology can be
used for communication among cores with lesser latency and
greater throughput at a reasonable penalty in terms of power and Link assignment algorithm for a CBP-Mesh network with size m ×
energy consumption. n. The current router node is ro(x,y), and the connecting links to
neighbours routers are lN S E W NE NW SE SW
i , li , li , li and CBP-Links li , li li , li
7 Acknowledgment
(see Fig. 8).
The authors are thankful to COMSATS Institute of Information
Technology for providing the platform to carry out this research
IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148 147
© The Institution of Engineering and Technology 2017
Fig. 8 Link assignment algorithm for a CBP-Mesh network
148 IET Comput. Digit. Tech., 2017, Vol. 11 Iss. 4, pp. 140-148
© The Institution of Engineering and Technology 2017