You are on page 1of 4

2010 2nd International Conference on Industrial and Information Systems

The design and implementation of arbiters for Network-on-chips

Zhizhou Fu Xiang Ling


National Key Laboratory of Science and Technology on National Key Laboratory of Science and Technology on
Communications ofUESTC Communications ofUESTC
Chengdu, China Chengdu, China
Email: zhizhoufu@ 163.com Email: xiangling@uestc.edu.cn

Abstract-Round robin arbiter and matrix arbiter paper, the behaviors of two popular arbiters: the
mechanism are widely used in Network-on-chips. These two Round-robin and matrix arbiter are analyzed.
mechanisms are implemented in this paper. The A High-Speed and Decentralized Round-robin
performances in 2D-mesh topology are tested in a FPGA
Arbiter(HDRA) is presented. The author analyses the area
platform. The resource consumption and throughput
and critical path, and compare the proposal with round
between Round-robin arbiter and Matrix-arbiter are
robin arbiter [ 1]. In this paper we compare the behavior of
compared. Through the experiment result, we found that the
Matrix-arbiter has higher throughput than the Round-robin the matrix arbiter with that of the round robin arbiter.
Fig. 1 shows the block diagram of an NxN switch.
Each input port contains N virtual channels (VCs) to avoid
arbiter. However the Round-robin arbiter can save much
more resources than Matrix arbiter. Thus a tradeoff between
the two mechanisms should be considered when design head-of-line blocking. The task of the arbiter is to decide a
networks-on-chip arbiters. set of contention-free connection between input and
output ports. For high performance switching, actions
Keywords-Network-on-chips, round robin arbiter, matrix
such as packet arriving, scheduling and switching, and
arbiter

VC
departing are operated in a pipelined way [5].

INTRODUCTION
I.
[nportL[��� 1
V
As the area and speed on a single chip now faces the
II I
big challenge on a single chip, more and more processing
I � III
npo �2
I
elements now are placed on System on chip.
Network-on-chip (NoC) is a new method for on chip
vc
I
[nport1:L[, ' l.Jrtmt-�
communication to solve the problem that challenges the
system on chip. The physical interconnection on a chip
becomes a primary factor which limits the performance
J..l.
and power consumption. As the switch speed of crossbar
switch increases rapidly, on big problem we should
resolve is to implement a fast and fairness arbiter to Fig.l the block diagram of N x N switch
maximize the switch throughput and timing performance In addition to a crossbar switch fabric, the internal
for Network-on-chips. structure of an N x N network switch consists of VCs and
NoC has advantages on architecture, performance, arbiters (there may be additional hardware components
reusability and scalability than traditional bus-based such as memory at the input port in case of the occurrence
system-on-chip. In the NoC implementation, each basic of VC overflows).
module need to be designed elaborately to guaranteed low This paper is organized as follows: section II
latency and high throughput.. Among these basic modules, introduces the Round-robin arbiter and its implementation
the data flow control of virtual channel play an important on FPGA. In section III, the Matrix-arbiter and its
role to alleviate the package congestion. The architecture gate-level architecture are dedicated and implemented.
and dataflow control (such as AckinAck or credit)will Section IV analyzes the performance and cost for these
affect the design of arbiter of NOC significantly. The arbiters. Finally, conclusions are drawn in section V.

THE MECHANISM OF ROUND-ROBIN ARBITER


arbitration should guaranteed the fairness in scheduling,
avoid starvation, and provide high throughput [ 1]. We II.
could use the primary arbiter to construct virtual channel The canonical architectures of router which is used in
allocator, crossbar switch [3]. NoC is showed in figure 2, the number of input ports in
The NoC's switches should provide high speed and the router varies with the interconnect architecture.
cost-effective contention resolution scheme when multiple However the basic architecture of a router port remains
packets from different input ports compete for the same the same. The arbiter is one of the key components in
output port. A fast arbiter is one of the most dominant router of NoC. Flits may simultaneously arrive at more
factors for high performance NoC switches [4]. For the than one virtual channel. As a result, an arbitration
above reasons, the analyst and compare of the mechanism is necessary to allow only one virtual channel
performance of the arbiters are significantly to access a single physical port. When consider the
meaningfulness in the design of Network-on-chips. In this proposals of arbitration in the router, we could apply

978-1-4244-8217 -7110/$26.00 ©2010 IEEE IIS 2010

292
Round-robin, Matrix-arbiter and the other different
arbitration mechanisms. In this paper we just consider the
Round-robin and Matrix arbitration and emulated them on

Input buffer Output buffer


FPGA platform.

Router �
Arbiter �
'----c:-_-,,-,J �L[[II}--J � Fig3.the architecture of round robin
Fig2.genraJ router architecture
In the design of network on chip, when use the arbiter
Whenever a resource, such as a buffer, a channel, or a
which has the fixed priority, there is no limit for waiting
switch port is contended by many requests, the arbiter is
time of the request with lower priority. In Round-robin
required to assign access to the resource to one request at
arbiter, the request input which gets the resource has
a time. For example, for an n-input arbiter that is used to
lowest priority, so that Round-robin arbiter has strong
arbitrate the use of a resource, such as the virtual channels
fairness.
(VCs) connected to that input port. Each virtual channel
that has a flit to be send requests accesses to the input port III. THE MECHANISM OF MATRIX-ARBITER
by asserting its request. We assume there are 8 virtual
channels, marked by VCI, VC2 ... VC8 separately, if VCI, A matrix arbiter adopts N x N matrix, and implements a
VC3, VC6 send out the resource requests simultaneously, least recently served priority scheme by maintaining a
at this time, assuming VC3 has higher priority than the triangular array of state bits w"j for all i < j. The
other twos, then the resource should be allocated to it,
VC 1 and VC6 will be in waiting state and send out the element w"j in row i and column j indicates that request
requests in the next clock cycle.
i takes priority over request j. For example, if the request
Supposing in a given period of time, there was many
input ports request the same output or resource, the arbiter input i had higher priority than the request input j, we set

is in charge of processing the priorities among many the element w"j to I, else the element w"j should be
different request inputs. The arbiter will release the output
O. Only the upper triangular portion of the matrix need be
port which is connected to the crossbar once the last flit in
the package has finished transmission. So that other maintained since w"j = W j,i, i '# j. When the
waiting packages could use the output by the arbitration of
request i has higher priority than the others, the request i
arbiter. A round-robin arbiter operates on the principle
that a request which was just served should have the will obtain the resource, such as output port or a virtual
lowest priority on the next round of arbitration. This can
be accomplished by generating the next priority vector p
channel. Each time a request i is granted, it clears all bits
w".
from the current grant vector g.
in its row which means the other request has higher

priority than the request i, and sets all bits in its column,
Figure2 shows the gate architecture of Round-robin

arbiter which realizes 3bit roll arbitration mechanism. A


W',i gives itself the lowest priority since it was the most

round-robin arbiter makes the last winning request lowest recently served. There is no physical meaning to consider

priority for the next round of arbitration. When there are the priorities between the request i and itself, and we mark

no requests, the priority is unchanged. If there is resource the element W1,1 by X in matrix of figure 4.
.

allocation in current clock cycle, then one bit of vector rt;2 rt;3 rt;4

[�'
grant gi would be high level that leads to relevant bit of X Tf";,3 Wz,4
the priority vector Pi+! be' I' in the next clock period. �,1 �,2 X �,4
So the next input request will be granted the highest u;:",1 u;:",2 u;:",3 X
priority in the next clock cycle, the request which had Fig4.the priority matrix

Figure 5 depicts an example, here before the arbiter


been just granted resource in last cycle will have the
executes arbitration, Request 3 has the highest priority,
lowest priority in the next clock cycle. The priority vector the request 4 has the lowest priority. When request 3 and
in the next round arbitration is Prio in the rolling request 4 contend for the same resource, request 3 will get
the resource as it has higher priority. After the arbitration,
arbitration mechanism. It depends on the grant vector
the elements [3, *] will be 0, and [*,3] will set to I,
gnti+! which had been granted resources.
as is shown in Figure 4.

293
l1 : i lHf : ij
permutation of inputs. If the initial state is invalid, it is
easy to enter to deadlock. For example, ifW01 = �2 = 1,

x 2
W 0 = 0, when the input 0, 1,2 send out request
FigS.the state translates after arbitration. simultaneously, the requests will be disabled and no
Figure 6 shows the four-input gate architecture of a grants will be issued. In the design, we should consider
matrix arbiter. In the figure each block with dotted line
the effective initial state carefully.
describes the S-R latch, and the state is maintained in the
six S-R latches denoted by dotted line blocks in the upper IV. EMULATION RESULTS AND COMPARISONS
triangular portion of the matrix. Each of the dotted blocks
The emulations of round robin arbiter and matrix
in the lower triangular portion of the matrix represents the
arbiter are implemented on FPGA platform. We set
complementary output of the diagonally symmetric solid
different numbers of request inputs, which means the

Re�1
box.
different lengths of request vectors. We get the statistics
about the resource utilization, maximum clock frequency
and power consumption of the two different arbitration

Re�2
mechanisms. Once the packets from the virtual channel of
the input simultaneously request the crossbar switch, the
number of the request inputs of arbiter increased. For
example, if there were 5 inputs, and each input with 6
virtual channels, when the packets from the virtual
channels were transferred to the crossbar, then the overall
number of request input should be 30 to resolve the

1200
concurrency conflict.

Matrix-arbiter , , , ,
.1- " "
Round-robin _:___ � ______
o
1000 : � _,__

/'
«

800 - - -j- - - to - - -1- - - T - - -1- -- ;- ---


,
19 I I I
, , , , , / ,
I I

600
c
o
Fig6.the gate architecture of Matrix-arbiter "0 - - �- - - � - - -:- - - � ---:- /- � - --
� I I I I I I

400 - --: - - - � - --: - - - + - �4- - - � - --


We suppose the initial states in the upper triangle of :::J I I I I I,· I

.�

the matrix arbiter are set to 1, according to the principle of
I ,of- I
200 1 Jl
complementary, the elements in the lower triangle should en I I I I! I I

be o. In the original matrix of the matrix arbiter, the


1 1

", ' 1"*


!.. ...!.
I I I
-.-, .,.-. - r'- '- -r-'
______ _____ ______ ___
I .J. � ' I
I I
1 . ..... i1::- -
I I I

o�
O --·� ��10�� 15�� 20--� 25 ---3�
O --�35·
request 1 has the highest priority, then request2, request3, _ ...
_.". .-.
•• • .• •

the request inputs of arbiter


and the request 4 has lowest priority. If the request vector
is 4'bOllO, that means input 2 and 3 request the resource
simultaneously. In every clock cycle, only one request FigS.the situation of slices used by two mechanisms

acquires response according to the priority matrix. When Figure 8 shows that matrix arbiter and Round robin
making a judgment, the priority matrix should be adjusted arbiter cost similar resource when there are a few requests,
according to the last priority matrix and the current grant nearly about 100 slices are consumed. However, in 3D
vector. Figure 7 shows the analysis chart, the input request NoC, the number of VCs in a router will beyond 30
vector Req is 4'bOllO, we get the grant vector gnt from channels. Thus, when the number of input requests
the priority matrix which denotes the ports 2,3,2,3 will increases, Matrix-arbiter will employ abundant resource.
In contrast, Round-robin arbiter doesn't cost so much
Req 4'b0110
obtain the resource successively.
resource. When the request inputs approach 32, the

[! : ; H! : l1H! i ! 1H! : : 1]
=

Matrix-arbiter will utilize 1003 slices, while the


Round-robin arbiter just uses 98 slices. In the design of
NoC's arbiter, we should make a sensible decision
between two different arbitration mechanisms.
gnt= OitlO � OO�O � O]too � 0�1O Furthermore, we have done the synthesis with the tool
Design Compiler, the silicon areas of NoC chip will
follow the same trend.
Fig7.the changes of priority matrix after receiving request
Figure 9 describes how the maximum clock frequency
changes as the increase of the number of the inputs in
In order to operate the matrix arbiter effectively, the
these two mechanism arbiters. Matrix arbiter has higher
matrix arbiter must be initialized to a legal state. Not all
clock frequency than round robin arbiter though it will
states of the array are meaningful since for an arbiter with consume more resources. When the number of input ports
2
n input ports there are 2n(n�1)/ states, and only n! is 4, the two mechanisms can reach the same clock

294
frequency, approximately 547 MHz. Almostly, the process data more quickly. In respect of power
maximum clock frequency of matrix arbiter is far higher consumption, they are similar to each other.
than that of the Round robin arbiter, means the In the next research, we will analyze the queuing
matrix-arbiter has higher throughput and more fast arbiter and fixed priority. In the design of NoCs, the
computation speed. Especially when the request inputs are primary arbiter will be researched to design more complex
7, the former is 1.4 times than the latter. The maximum virtual channel allocator.
clock frequency declines with the request input increases
in both of the mechanisms. The Matrix-arbiter achieves
high-speed computation at the penalty of vast resource Reference
and silicon area. [1]. Yun-Lung Lee, Jer Min Jou and Yen-Yu Chen,A

�:��:
High-Speed and Decentralized Arbiter Design for
j 1 � ��::��r���r
I
� _ . - NoC[J],350-353.

'I� I
N
___
.
___ ___
-I [2]. Gao Xiaopeng, Zhang Zhe, Long Xiang. Round Robin

- - ...J _ -�
I
Arbiters for Virtual Channel Router, IMACS

\:1 I
�450 ; � - - - r - - -,- - - -,- - - T - - - Multiconference on "Computational Engineering in

---I� 1 --

----
I
c: Systems Applications" 1610-1614.
� 400
I
- -:..I - - - I- -1- - - +. - --
' [3]. Li-Shiuan Peh,William J. Dally.A Delay Model and

____ � - ..1 -_� L� _ ---'---l.---


0""

I 1 I
ID , I

.- . I
�350 Speculative Architecture for Pipelined Routers[J], the 7th

I
..!:I::: I

-------.-.------...;� � - ---�---
.....

U I International Symposium on High-Performance Computer


-§ 300
[4]. Eung S. Shin, Vincent 1.
Architecture,255-266.
Mooney III and George F.
Riley.Round-robin Arbiter Design and Generation. ISSS'02

Switching Technologies, John Wiley & Sons, Inc., 200l .


[5]. H. J. Chao, C. H. Lam, and E. Oki, Broadband Packet

the request inputs of arbiter


2OO0�--�--�10 ��,5���
��2�5 --��--�35
[6]. Chang Wu, Yubai Li, Song Chai, Zhongming Yang.
Lottery Router: A Customized Arbitral Priority NOC
Fig9 .the throughput changed as different request
Router[J].200S International Conference on Computer
Finally, we analyze and compare the power Science and Software Engineering,411-414.
consumption of the two mechanisms. In figure 10, we can [7]. Jongman Kim, Chrysostomos Nicopoulos. A Gracefully
see that the power consumption will increase as the Degrading and Energy-Efficient Modular Router

number of inputs increasing. The round robin arbiter Architecture for On-Chip Networks[J], Proceedings of the
33rd International Symposium on Computer Architecture
consumes lower power than matrix-arbiter, and the
(ISCA'06).
difference is nearly 1 mW, In the design of arbiter, we
[S]. Robert Mullins, Andrew West and Simon Moore.
should make a trade-off among the resource or silicon
Low-Latency Virtual-Channel Routers for On-Chip
area, maximum clock frequency and power consumption, Networks[J]. Proceedings of the 31st Annual International
and choose suitable arbitration mechanism according to Symposium on Computer Architecture. 1-10

0.31 r;=====::::,---
:;- --,-----,---,----,
that. [9]. L. Benini and G. Micheli, "Networks on Chips: A New SoC

II .- Matrix-arbiter I ' , Paradigm," in Computer Magazine, vol. 35, issue 1, pp.

0.308 H - - Round-robin H - - - � - --: - - - � - -


, ,
70-7S, Jan. 2002.
"i
e-

I I I I I I [10]. S. Q. Zhengy and Mei Yang, "Algorithm-Hardware

t- '

/1
�O� -- '
- - r - - ' - - - r -- --- � - -
- Codesign of Fast Parallel Round-Robin Arbiters", IEEE
C
o I I I
I I I I
TRANSACTIONS ON PARALLEL AND DISTRIBUTED
�o_ -- �---�--�---�--� ---�--
SYSTEMS, vol. IS, issue I, pp.S4-95, Jan., 2007.
5 , , ; .---t
, ,j
--
� 0.302
1/' y'- r I
- - -i - - - t- - -/-1 - - - t- - - -I - ... - t- - -

W - - -I - - - !-i - - /-1- - - j- - -.-1- -- +- --


o I I I I I

� /1
U I
0,3
I I I I I
o I I
C. 0,298
Y�-7'-'-' - '
-_;.--l 1- _ _ L __ ....J ___ l.... ___I ___ .L. __
I I I I I
11:'
0.296 L--_'-"---__"---__"---__"---__
- "---__
- "----"
-

the request inputs of arbiter


o 10 15 � 25 � 35

FiglO.the power analyst of two mechanisms

V. CONCLUSION

There are several arbitration mechanisms can be


choose when designing NOC, such as Round-robin arbiter
and Matrix-arbiter. However in related literatures, there is
lack of analysis, especially in resource, performance and
power consumption of them. In this paper, two
mechanisms are emulated on FPGA platform. Comparing
the difference of them is meaningful for designing arbiters.
Matrix-arbiter consumes more resource, but can reach
higher clock frequency, which means matrix-arbiter could

295

You might also like