Professional Documents
Culture Documents
ABSTRACT Conventional theory for designing strictly nonblocking networks, such as crossbars or Clos
networks, assumes that these networks have a centralized topology. Many new applications, however, require
networks to have a distributed topology, like 2-D mesh or torus. In this paper, we present a new theoretical
framework for designing such nonblocking networks. The framework is based on a linear programming
formulation originally proposed for solving the hose-model traffic routing problem. The main difference,
however, is that the link bandwidths in our problem must be discrete. This makes the problem much
more challenging. We show how to apply the developed theorems to tackle the problem of designing
bufferless NoCs (networks-on-chip). The proposed bufferless NoCs are deadlock/livelock-free and consume
significantly less power than their buffered counterparts. We also present a multi-slice technique to reduce
node capacity variations. This can make the proposed NoC architecture more cost efficient in a VLSI
(very large-scale integration) implementation. In addition, we offer a detailed delay/throughput performance
evaluation of the proposed bufferless NoC in the paper.
I. INTRODUCTION ing [8], [9], but in this approach, if the outgoing link is busy,
Conventional nonblocking networks, such as crossbars and the router simply sends it to any available output port. This
Clos networks, are designed for a centralized topology. How- will obviously degrade routing efficiency, and the throughput
ever, new applications require networks with a distributed of the network may drop even when the input load increases.
topology. One example is networks-on-chip (NoCs) [1]–[3] To prevent this from happening, sophisticated congestion
which often use a mesh or torus topology [4]–[6]. Most control and routing arbitration schemes are required. Another
conventional NoCs adopt a buffered architecture, but dead- bufferless NoC is based on a circuit/packet hybrid archi-
locks and the inability to handle adversarial traffic pat- tecture [14]–[16] where best-effort traffic as well as circuit
terns are well-known problems in buffered NoCs [7]. A setup commands are sent through the packet network and
buffered NoC has another major drawback: buffers consume quality-guaranteed traffic is transmitted through the circuit
a large amount of power and silicon real estate [8]–[11]. [8] network. One problem with this architecture is that, when
showed that removing buffers in its proposed architecture an intended circuit is not available, network reconfiguration
could lower power consumption by 40% and reduce sili- is required, and this process obviously takes time. The fre-
con area by 70%. A buffered architecture is also incom- quency of reconfiguration depends on the degree of blocking
patible with emerging optical NoC technologies, such as of the network. If reconfiguration activities are frequent, the
silicon photonic rings [12], [13], which are bufferless performance can be severely degraded.
by nature. In theory, all these problems associated with a bufferless
However, bufferless NoCs have their own problems. For architecture would be removed if the entire bufferless NoC
example, one such architecture is based on deflecting rout- could perform like a nonblocking crossbar switch. Under this
condition, no traffic patterns could degrade the performance
The associate editor coordinating the review of this manuscript and of the network and non-deadlock/livelock could happen. The
approving it for publication was Gian Domenico Licciardo . only problem is that there is no theory available for the design
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 8399
B.-C. Lin, C.-T. Lea: Designing Nonblocking Networks With General Topology
implementing link capacity ce is linearly proportional to ce . (a) This statement is trivially true because under tij ∈ [0, 1],
0
the optimal routing is xije , not xije . Thus the derived Cmin
00 is not
min Ctotal (2a) 0 00
X X optimal, which means Cmin ≤ Cmin will hold.
s.t. xije − xije = 0, ∀i, j,v ∈ V , v 6 = i, j, (2b) (b) Every valid traffic matrix in (2e) under tij ∈ {0, 1} is
e ∈ 0 + (v) e ∈ 0 − (v) also valid under tij ∈ [0, 1], which means that the constraint
X X equations under tij ∈ {0, 1} are only a subset of the constraint
xije − xije = 1, ∀i, j ∈ V , v = i, (2c) equations under tij ∈ [0, 1]. Thus the result derived from the
e ∈ 0 + (v) e ∈ 0 − (v)
X X formulation under tij ∈ [0, 1] is worse than the result from the
xije − xije = −1, ∀i, j ∈ V , v = j, (2d) former formulation under tij ∈ {0, 1}, which means Cmin ≤
0
Cmin must hold.
e ∈ 0 + (v) e ∈ 0 − (v)
X
tij · xije ≤ ce , ∀e ∈ E, ∀{tij } ∈ T ∈ D, (2e) (c) Assume the network uses routing xije derived from (i). Note
that the network is intended to be a crossbar-like nonblocking
i,j ∈ V
X X network, and is a single-path network since xije ∈ {0,1}. Let
ce + ce ≤ Mnd , (2f) <p, q> denote the path from source node p to destination
e ∈ 0 + (v) e ∈ 0 − (v) node q. Let 9 be the set of source nodes and 8 the set of
xije , tij ∈ {0, 1}, ce ≥ 0, ∀e ∈ E, ∀i, j ∈ V . (2g) destination nodes of all such paths <p, q>. Note that if p ∈ 9
and q ∈ 8, it does not imply <p, q> must pass through e.
For a given V and E, (2b–d) are for the flow conservation
For example, if <1, 3> and <2, 4> are all the paths passing
constraint, and (2e) means that the total traffic routed through
through e, then 9 = {1, 2} and 8 = {3, 4}. Obviously <2,
a link is smaller than the link capacity. In (2e), D is the set of
3> does not pass through e.
all legal traffic matrices; that is, all traffic {tij } is subject to
We now construct a bipartite graph G0 with 9 and 8. Let
the following constraint:
p ∈ 9 and q ∈ 8. We add an edge lpq between p and q if
<p, q> passes through link e. Let zpq be the weight assigned
X
tij ≤ 1, ∀i ∈ V . (3a)
j∈V to lpq . A legal traffic pattern (i.e., a traffic matrix satisfying
X (3)) routed through e can be considered as a matching in the
tij ≤ 1, ∀j ∈ V . (3b)
bipartite graph G0 , and the amount of traffic in the pattern
i∈V
equals the total weights of the matching. Assume zpq ∈ {0, 1}.
Mnd in (2f) is given. Although this constraint is not required, The total weights can be calculated as
we use it to limit the maximum capacity of each node. Equiv- X
alently, this can reduce the node capacity variance. To select max zpq (4a)
the value of Mnd , we can set Mnd to infinity in the beginning p, q ∈ V
and solve the formulation. Based on the derived maximum
X
s.t. zpq ≤ 1,
node capacity, we can reduce Mnd gradually and see its q∈8
impact on the total cost and the node capacity variance. More X
discussion will be provided later, in Sec. IV. zpq ≤ 1,
The formulation of (2), however, cannot be solved directly p∈9
because there are too many traffic matrices T satisfying (3), zpq ∈ {0, 1}.
and they all need to be included in (2). In the following,
we show how to tackle this problem. In (iii), we assume the same routing xije is used, but zpq ∈
Theorem 1: Let tij ∈ [0, 1] (i.e., tij ∈ R). This relaxation [0, 1]. Under this condition, a source node can be connected
will not change the result of (2). to multiple destination nodes simultaneously and vice versa.
Proof: The maximum amount of traffic that can be routed through
(i) Assume tij ∈ {0, 1}. Let xije denote the derived routing and e can now be calculated through the following fractional
Cmin the solution of (2). matching formulation:
0
(ii) Assume tij ∈ [0, 1]. Let xije be the derived routing and X
0 max zpq (4b)
Cmin the solution of (2).
p, q ∈ V
(iii) Assume tij ∈ [0, 1] and that the network uses routing X
xije defined above. Let Cmin 00 be the minimum total capacity s.t. zpq ≤ 1,
required to support all traffic patterns. q∈8
X
We are going to prove the following three states: zpq ≤ 1,
0
(a) Cmin 00 is true.
≤ Cmin p∈9
0
(b) Cmin ≤ Cmin must hold. zpq ∈ [0, 1].
00
(c) Cmin ≤ Cmin is true.
With (a−c) above, it is trivial to see that Cmin 0 = Cmin must Comparing (4a) and (4b), we can see that (4a) is an inte-
hold. gral matching formulation and (4b) a fractional matching
We now show that all three statements are true. formulation. From graph theory [22], the maximum integral
matching equals the maximum fractional matching. There- Although the theory is for designing a network that per-
fore, the link capacity of e derived from (i) can support any forms like a crossbar switch, the bufferless NoC application,
00
traffic pattern under the assumption of (iii). Thus, Cmin ≤ as shown below, is for a packet switch.
Cmin must hold.
A. ROUTER NODE DESIGN
C. FINAL FORMULATION A packet in NoCs is usually divided into flits (flow control
With Theorem 1, we can assume tij ∈ [0, 1] and derive the digits), which form the basic data-transmission and buffer-
same result. Once we assume tij ∈ [0, 1], we can use the ing units at the link layer. The term ‘‘phit’’ (physical unit),
technique described in [23] to remove the requirement of on the other hand, represents the number of bits that can be
exhaustively listing T in (2). transferred in parallel in a single cycle. As with many NoC
Theorem 2: Assume tij ∈ R in (2). Routing xije is feasible designs, we assume that the flit size is equal to the phit size.
(i.e., satisfying (2e)) if and only if there exist non-negative In other words, the entire flit is transmitted in one clock cycle
πe (i) and λe (j) for all e ∈ E and i, j ∈ V such that If an NoC uses a buffered approach [2], [3], different flits
X X can be held in different routers, with the head flit holding
πe (i) + λe (j) ≤ ce , ∀e ∈ E, (2e0 ) the routing information. Packets from different input ports
i∈V j∈V may be headed for the same output port, and the head flit
xije ≤ πe (i) + λe (j), ∀i, j ∈ V , ∀e ∈ E. (2e00 ) of a packet may be blocked in a router. When that happens,
flow control signals will be sent to downstream routers, and
Proof: The proof is similar to that in [23]. Thus, it is
the subsequent flits may be buffered in different routers.
omitted here.
However, the NoC proposed in this paper uses a bufferless
Note that in the above, πe (i) and λe (j) are the auxiliary
approach. Each router only holds one flit. After one clock
variables introduced by the duality transformation of the
cycle, the entire packet moves one flit toward the destination.
following linear programming (LP) formulation:
The entire NoC is nonblocking in the sense that, if a destina-
X
max tij · xije , ∀e ∈ E tion is free, a packet from a source station can definitely reach
i,j ∈ V the destination node without being blocked inside the switch
X (more details are provided below).
s.t. tij ≤ 1, ∀i ∈ V , The hardware implementation of the router node of our
j∈V
X bufferless NoC is quite simple. It only comprises a crossbar
tij ≤ 1, ∀j ∈ V . and a controller. No virtual channels, flow control, deadlock
i∈V prevention mechanisms, or adaptive routing are needed. Since
With Theorem 2 and πe (i) and λe (j), we derive the final more than 60% of NoCs adopt a mesh or torus topology,
mixed-integer LP formulation as follows: we focus on these two in our discussion below.
In a bufferless NoC, a router node of a 2-D mesh or torus
min Ctotal (5a) topology has at most four bi-directional links plus one link
X X connecting to the local processor. The link connecting to the
s.t. xije − xije = 0, ∀i, j,v ∈ V , v 6 = i, j, (5b) local processor contains only one channel, but the other links
e ∈ 0 + (v) e ∈ 0 − (v) may have multiple channels. Each channel consists of two
X X
xije − xije = 1, ∀i, j ∈ V , v = i, (5c) sub-channels, DATA and ACK:
e ∈ 0 + (v) e ∈ 0 − (v) DATA: For sending data. The width of a channel is the
X X same as the width of a flit.
xije − xije = −1, ∀i, j ∈ V , v = j, (5d) ACK: A 1-bit sub-channel for sending the ACK signal
e ∈ 0 + (v) e ∈ 0 − (v) back to the source. Its direction is the opposite of the DATA
X X
πe (i) + λe (j) ≤ ce , ∀e ∈ E, (5e) sub-channel.
i∈V j∈V The first bit of each flit is the VALID bit (Fig. 2b) — 1 for
valid and 0 for idle flits. A change from 0 to 1 in the VALID
xije ≤ πe (i) + λe (j), ∀e ∈ E, ∀i, j ∈ V , (5f)
X X bit indicates that the first flit of a packet arrives, and a change
ce + ce ≤ Mnd , (5g) from 1 to 0 means the end of the current packet transmission.
e ∈ 0 + (v) e ∈ 0 − (v) There is a unique path between any two processors. The
xije ∈ {0, 1}, πe , λe , ce ≥ 0, ∀e ∈ E, ∀i, j ∈ V . (5h) specification of the entire path is given in the head flit(s)
of a packet. This style of routing is usually called source
In (5), only xije is discrete and the other variables are real. routing. It requires 2 bits to specify the routing direction of
each node (north, south, east, west), and the ‘‘hop count’’ field
III. APPLICATION OF NONBLOCKING THEORY: indicates how many routers the packet needs to go through.
BUFFERLESS NoCs The ‘‘hops passed’’ field indicates how many hops the packet
In this section, we use bufferless NoCs as an example appli- has already passed through (this field is incremented by each
cation of the nonblocking theory presented in Section II. router). When hop count = hops passed, the arriving packet
FIGURE 2. (a) The node design of a bufferless NoC. (b) Packet format.
M
1 X n2i
. (6)
M n2max slice is only 1/K the width of the channel of a single-slice
i=1
architecture. We then combine K independent slices into the
where M is the total number of nodes and n2i represents the final solution. The advantage of a multi-slice architecture
number of crosspoints required in a particular node. We call is three-fold: (i) it reduces node capacity variance, (ii) it
the formula of (6) the node capacity variance indicator: the reduces the path set-up overhead, by a factor of K in a K -slice
larger the value, the smaller the variance. architecture, and (iii) it reduces the impact of a link failure (a
single link failure only removes 1/K of the total bandwidth in
B. MULTI-SLICE ARCHITECTURES a K -slice network).
In a multi-slice architecture, we divide the network into To minimize the node capacity variance in a multi-slice
K independent slices, and the width of each channel in a NoC, how to combine the slices becomes important. Let A,
TABLE 2. Total link cost and node capacity variance of 1-slice 4 × 4 torus
Networks.
TABLE 3. Total link cost and node capacity variance of 4-slice 4 × 4 torus
Networks.
TABLE 4. Total link cost and node capacity variance of 4-slice 6 × 6 torus
NoCs.
B, C, and D represent the four slices of the solution shown in
Fig. 5a. Each slice is a torus, although the capacities of some
links are zero. We can horizontally or vertically shift or rotate
each slice and then combine them. The result is still a torus
due to the symmetry possessed by a torus network.
Denote the shift and rotation operation of a slice by a two-
tuple (OS , RS ), S∈{A, B, C, D}, where OS represents the shift
and RS the rotation (in clockwise) operation. For an n × n replaced with (5a0 ) for a link between adjacent nodes and with
network, we use the two-tuple (c, r) for 0 ≤ c, r ≤ n− 1 to de = n− 1 for a link between the first and the last node in an
represent OS where the network is shifted c columns (or r n×n torus. We can see that the node capacity variance derived
rows) to the right (or downward). RS can assume four values: by (5) is large. The total link cost of a nonblocking bufferless
0◦ , 90◦ , 180◦ , or 270◦ . An example is given in Fig. 5, where NoC in this case is even smaller than that of a conventional
the slice of a 4×4 torus network given in Fig. 5b is obtained by buffered NoC. Note that the node complexity of the former is
changing ((2, 0), 0◦ ) of the slice given in Fig. 5a. By changing much lower and performance much better (see Sec. V).
the (OS , RS ) of each slice, we intend to find the locations For multi-slice architectures, the results of four-slice 4 ×
of the four slices such that the node capacity variance is 4 and 6 × 6 NoCs are given in Tables 3 and 4, respec-
minimized (Fig. 5c). Note that the torus network given in tively, where the total link cost is computed from (5) with
Fig. 5c combines the four slices S∈{A, B, C, D}, each of de = 1, in which (5a) is replaced with (5a0 ). A node in a
which is obtained by changing the (OS , RS ) of the torus given multi-slice NoC contains multiple nodes of different slices.
in Fig. 5a, where (OA , RA ) = ((0, 0), 0◦ ), (OB , RB ) = ((2, 0), Although they are independent, we can still combine them
0◦ ), (OC , RC ) = ((1, 2), 0◦ ), (OC , RC ) = ((3, 2), 0◦ ). into one crossbar switch (with some configuration logic
For 8×8 networks, testing all possible locations of different inside, of course). An interesting result is provided by the
slices can be too time consuming. Thus, we use a coarse- 4 × 4 torus NoC. With Mnd = 24, all nodes in the bufferless
resolution approach to tackle this problem. For example, the 4-slice torus NoC (Fig. 5c) and the conventional buffered
x or y coordinates range from 0 to 7 in an 8 × 8 network. torus NoC (Fig. 4) have the same capacity (Fig. 7). Further-
We can divide the x and y axis into four regions with boundary more, the two networks have the same total link cost.
points at 0, 2, 4, 6, and 8. As a result, we only need to consider
16 square regions on the plane, each square containing four V. DELAY/THROUGHPUT PERFORMANCES
points of the original plane (see Fig. 6). We randomly select In addition to computing, telecom applications are gaining
a point as the location of a slice in that region. more importance for multi-core systems [24]. For such appli-
cations, a trace-based simulation, that is, a system simula-
C. COMPARISON IN TERMS OF NODE CAPACITY tion performed by looking at traces of program execution or
VARIANCE system component access with the purpose of performance
We first consider a single-slice architecture. Table 2 presents prediction, is meaningless. Therefore, we use traffic-pattern-
the results of a 2-D torus network (Fig. 4), where the total based simulation to compare the performance of the proposed
link cost is computed from (5) with de = 1, in which (5a) is bufferless NoC with that of a conventional buffered NoC.
FIGURE 8. Delay performance of a 4 × 4 bufferless torus NoC (4-slice) FIGURE 9. Delay performance of 8 × 8 bufferless torus NoC (4-slice) and
and of a buffered torus NoC with XY routing: (a) uniform traffic, (b) bit of the conventional 8 × 8 buffered torus NoC with XY routing: (a) uniform
complement traffic, (c) tornado traffic, and (d) transpose traffic. The delay traffic, (b) bit complement traffic, (c) tornado traffic, and (d) transpose
unit is the average packet transmission time in a buffered NoC. traffic. The delay unit is the average packet transmission time in a
buffered NoC.
of a bufferless node, these figures clearly demonstrate the [13] P. Guo, W. Hou, L. Guo, Q. Yang, Y. Ge, and H. Liang, ‘‘Low insertion loss
advantage of the new approach. and non-blocking microring-based optical router for 3D optical network-
on-chip,’’ IEEE Photon. J., vol. 10, no. 2, Apr. 2018, Art. no. 7901010.
[14] P. T. Wolkotte, G. J. M. Smit, G. K. Rauwerda, and L. T. Smit, ‘‘An energy-
VI. CONCLUSION efficient reconfigurable circuit-switched network-on-chip,’’ in Proc. 19th
In this paper, we have presented the theory for the construc- IEEE Int. Parallel Distrib. Process. Symp., Apr. 2005, pp. 1–8.
[15] N. Enright Jerger, M. Lipasti, and L.-S. Peh, ‘‘Circuit-switched coher-
tion of nonblocking networks with a general topology. We use ence,’’ IEEE Comput. Archit. Lett., vol. 6, no. 1, pp. 5–8, Jan. 2007.
bufferless NoCs to demonstrate an application of the new the- [16] K. Goossens, J. Dielissen, and A. Radulescu, ‘‘A ethereal network on chip:
ory. The presented bufferless NoC is nonblocking, deadlock- Concepts, architectures, and implementations,’’ IEEE Des. Test. Comput.,
vol. 22, no. 5, pp. 414–421, Sep./Oct. 2005.
free, livelock-free, and congestion-free. It has a simple node [17] E. Salminen, A. Kulmala, and T. Hamalainen, ‘‘Survey of network-on-chip
design and a throughput close to 100%. Its performance proposals,’’ in Proc. OCP-IP, 2008, pp. 1–13.
is traffic pattern independent, which greatly simplifies task [18] N. G. Duffield, P. Goyal, A. Greenberg, P. Mishra, K. K. Ramakrishnan,
and J. E. V. der Merwe, ‘‘A flexible model for resource management in
mapping for a multi-core system. The bufferless nature of the virtual private networks,’’ in Proc. ACM SIGCOMM, San Diego, CA, USA,
new architecture can achieve great savings in terms of power Aug. 1999, pp. 95–108.
consumption and silicon area. [19] J. Chu and C.-T. Lea, ‘‘New architecture and algorithms for hose-model
VPN construction,’’ IEEE Trans. Netw., vol. 16, no. 3, pp. 670–679, 2008.
The developed theorems can also be applied to a wave- [20] Y. Xuan and C.-T. Lea, ‘‘Network-coding multicast networks with QoS
length selective switch (WSS)-based optical network, which guarantees,’’ IEEE/ACM Trans. Netw., vol. 19, no. 1, pp. 265–274,
also adopts a distributed topology. A workstation can use Feb. 2011.
[21] J. A. Fingerhut, S. Suri, and J. S. Turner, ‘‘Designing least-cost non-
different wavelengths for different destinations. But such a blocking broadband networks,’’ J. Algorithms, vol. 24, no. 2, pp. 287–309,
WSS network is often a blocking network, and its routing and Aug. 1997.
wavelength assignment (RWA) problem is NP-hard. Using [22] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency.
Berlin, Germany: Springer, 2003.
the theorems developed in the paper, we can derive a non- [23] D. Applegate and E. Cohen, ‘‘Making intra-domain routing robust to
blocking WSS-based network in which the RWA problem is changing and uncertain traffic demands: Understanding fundamental trade-
greatly simplified. offs,’’ in Proc. ACM SIGCOMM, Karlsruhe, Germany, 2003, pp. 313–324.
[24] Proceeding of 2013 Data Center Conference. Accessed: Feb. 5, 2003.
[Online]. Available: http://www.linleygroup.com/events/event.php?
ACKNOWLEDGMENT num=21
The author Bey-Chi Lin would like to thank Prof. Zhe [25] J. Lee, C. Nicopoulos, S. J. Park, M. Swaminathan, and J. Kim, ‘‘Do we
need wide flits in networks-on-chip?’’ in Proc. IEEE Comput. Soc. Annu.
XuanYuan of UIC and Dr. Vincent Wing-Hei Luk for giving Symp., Aug. 2013, pp. 1–7.
suggestions for the programming. [26] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, ‘‘Cost considerations in
network on chip,’’ Integration, vol. 38, no. 1, pp. 19–42, Oct. 2004.
[27] A. Singh, W. J. Dally, A. K. Gupta, and B. Towles, ‘‘GOAL: A load-
REFERENCES balanced adaptive routing algorithm for torus networks,’’ ACM SIGARCH
[1] W. J. Dally and B. Towles, ‘‘Route packets, not wires: On-chip intercon- Comput. Archit. News, vol. 32, no. 2, pp. 194–205, 2003.
nection networks,’’ in Proc. Design Autom., New York, NY, USA, 2001,
pp. 684–689.
[2] T. Bjerregaard and S. Mahadevan, ‘‘A survey of research and practices of
network-on-chip,’’ ACM Comput. Surv., vol. 38, no. 1, pp. 1–51, 2006.
[3] S. Hesham, J. Rettkowski, D. Goehringer, and M. A. A. El Ghany, ‘‘Sur- BEY-CHI LIN received the B.S. degree in mathe-
vey on real-time networks-on-chip,’’ IEEE Trans. Parallel Distrib. Syst., matics education from National Taichung Teachers
vol. 28, no. 5, pp. 1500–1517, May 2017.
College, Taiwan, in 1999, and the Ph.D. degree
[4] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg,
in applied mathematics from National Chiao Tung
K. Tiensyrja, and A. Hemani, ‘‘A network on chip architecture and design
methodology,’’ in Proc. IEEE Comput. Soc. Annu. Symp. VLSI. New
University, Taiwan, in 2005. She was a Research
Paradigms, Oct. 2002, pp. 117–124. Assistant with the Department ECE, The Hong
[5] P. Guerrier and A. Greiner, ‘‘A generic architecture for on-chip packet- Kong University of Science and Technology,
switched interconnections,’’ in Proc. Design, Autom. Test Eur. Conf. Exhib., from 2005 to 2008. She is currently a Profes-
2000, pp. 250–256. sor with the Department Applied Math, National
[6] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, ‘‘Performance Tainan University, Taiwan. Her research interests
evaluation and design trade-offs for network-on-chip interconnect archi- include switching network analysis and combinatorial mathematics.
tectures,’’ IEEE Trans. Comput., vol. 54, no. 8, pp. 1025–1040, Aug. 2005.
[7] W. Dally and B. Towles, Principles and Practices of Interconnection
Networks. Burlington, MA, USA: Morgan Kaufmann, 2004.
[8] T. Moscibroda and O. Mutlu, ‘‘A case for bufferless routing in on-chip
networks,’’ in Proc. ISCA, Austin, TX, USA, 2009, pp. 196–207. CHIN-TAU LEA received the B.S. and M.S.
[9] G. Nychis, C. Fallin, T. Moscibroda, S. Seshan, and O. Mutlu, ‘‘Congestion degrees in electrical engineering from National
control for scalability in bufferless on-chip networks,’’ SAFARI, Macworld Taiwan University, in 1976 and 1978, respec-
U.K., Tech. Rep. 003–2011, Jul. 2011.
tively, and the Ph.D. degree in electrical engineer-
[10] A. Kodi, A. Louri, and J. Wang, ‘‘Design of energy-efficient channel
ing from the University of Washington, Seattle,
buffers with router bypassing for network-on-chips (NoCs),’’ in Proc. 10th
Int. Symp. Qual. Electron. Design, 2009, pp. 826–832. in 1982. He is a Professor Emeritus with The
[11] E. Ofori-Attah and M. O. Agyeman, ‘‘A survey of recent contributions Hong Kong University of Science and Technology,
on low power NoC architectures,’’ in Proc. Comput. Conf., London, U.K., in 1996. Prior to that, he was with AT&T Bell Labs,
2017, pp. 1086–1090. from 1982 to 1985 and with the Georgia Institute
[12] C.-T. Lea and B.-C. Lin, ‘‘A new approach to the wavelength nonunifor- of Technology, from 1985 to 1995. His research
mity problem of silicon photonic microrings,’’ J. Lightw. Technol., vol. 29, interests include general area of switching and networking.
no. 17, pp. 2552–2559, Sep. 14, 2011.