You are on page 1of 4

A Practical NoC Design for Parallel DES Computation

R. Yuan, S.-J. Ruan and J. Gotze


National Taiwan University of Science and Technology
Low-Power Systems Lab, Taipei, Taiwan
E-Mails: {D9902102, sjruan}@mail.ntust.edu.tw
TU Dortmund
Information Processing Lab, Dortmund, Germany
E-Mail: juergen.goetze@tu-dortmund.de
AbstractThe Network-on-Chip (NoC) is considered to be a
new SoC paradigm for the next generation to support a large
number of processing cores. The idea to combine NoC with homogeneous processors constructing a Multi-Core NoC (MCNoC)
is one way to achieve high computational throughput for specic
purpose like cryptography. Many researches use cryptography
standards for performance demonstration but rarely discuss a
suitable NoC for such standard. The goal of this paper is to
present a practical methodology without complicated virtual
channel or pipeline technologies to provide high throughput
Data Encryption Standard (DES) computation on FPGA. The
results point out that a mesh-based NoC with packet and
Processing Element (PE) design according to DES specication
can achieve great performance over previous works. Moreover,
the deterministic XY routing algorithm shows its competitiveness
in high throughput NoC and the West-First routing offers the
best performance among Turn-Model routings, representatives of
adaptive routing.

I. I NTRODUCTION
Advantages of Network-on-Chip (NoC) over traditional busbased architecture have been proposed in many researches.
The NoC architecture has advantages in both scalability and
exibility thus it can be organized to run homogeneous cores
in parallel to improve performance for specic purposes [1].
Such approach on NoC is a suitable method to realize a high
throughput computational system on FPGA.
Data encryption/decryption is one computational algorithm
often implemented in researches for performance demonstration. Characteristics of one cryptography affect the selection of the it size for routing, the packet size in trafc
communication and the architecture for Processing Element
(PE). Together with popularity of data protection demands
nowadays, a high performance NoC specic to cryptography
must be analyzed.
Our work has realized a 55 2-D Mesh, VCT switching,
running 25 Data Encryption Standard (DES) computations in
parallel. The goal of this paper is to evaluate the throughput of
a high workload NoC. The main contribution is related to the
performance verication results of MCNoC architectures for
parallel DES computation. Our results indicate that proposed
work has considerable speedup than previous works.
This paper is organized as follows: Section II describes
the related work of DES on other NoC systems. Section III
introduces the proposed architecture. Section IV describes

978-1-4673-4436-4/13/$31.00 2013 IEEE

congurations of proposed MCNoC including packet format,


routing algorithm, ow control and architecture of PE. Section V describes the experimental methodology and shows the
results. In last section, brief statements conclude this paper.
II. R ELATED W ORKS
Some NoC proposals use soft-core processors, MicroBlaze
or Networked Processor Array (NePA), as processing elements
to implement DES computations [2], [3], [4]. These processors
have much more complicated functions than traditional DES
needs. Thus adding cores becomes costly because of the
dramatic increase of complexity and trafc load, resulting
limited performance improvement.
The research [5] has realized one DES encryption used
sporadically in the network for brute force testing. The performance is not unleashed due to the architecture is essentially
not designed with high throughput considerations.
This paper presents a practical MCNoC for parallel DES
processing to achieve high throughput demand. Our proposed
MCNoC has all boundary ports open to other resources for
high throughput purpose, and this sharing scheme has been
applied on state-of-the-art commercial NoC chips like Tilera
TILE64. Without complicated designs of pipeline or virtual
channel technologies, routing and ow control components
can be kept simple so the NoC is low-cost and low-power
consumption.
III. A RCHITECTURE OF MCN O C FOR DES
MCNoC is a specic NoC that owns parallel computing
power that can be shared by multiple components connected
with it. A typical architecture of a 55 NoC is shown in Fig. 1.
Tiles numbered 11, 12, 13, 14, 15, 21, 25, 31, 35, 41, 45, 51,
52, 53, 54, and 55 are boundary tiles. Each boundary tile has
either 1 side or 2 sides connecting with external resources, not
switches. These external resources can be packet generators,
receivers or both represented as tiles with dot line called
terminal tiles, numbered 01, 02, 03, 04, 05, 10, 16, 20, 26,
30, 36, 40, 50, 56, 61, 62, 63, 64, and 65. Terminal tiles are
dummy tiles therefore no PE connects with them. N, E, W and
S represent North, East, West and South respectively. The rest
of tiles are normal tiles without any specic name. Every tile
except terminal tile is composed of one router and one PE.

Fig. 2: Packet its

Fig. 1: A 55 mesh-based MCNoC

A terminal tile injects packets into only one boundary tile.


The boundary tile receives the packet if its PE is available,
otherwise it routes the packet to the neighbor tile according to
its routing algorithm. The packet will be routed till an available
PE is found. When DES operation is nished, the packet has
the done bit set to HIGH and starts to be routed towards the
terminal tile where it was from.
Different from other works use end-to-end trafc or only
few input nodes as packet injecting points, our work takes
input trafc from North, East, West and South and returns
packets to all directions. In order to keep NoC low-cost and
scalable, the NoC is constructed based on following principles:
1) No virtual channel or pipeline is used.
2) All PEs are designed for data processing only.
3) Architectures of all tiles are identical.
4) Unnished packet should not leave NoC.
The third and forth principles are the difculties. Unlike
boundary tiles, tiles at the center or some other locations do
not need criteria to block packets. Applying these controls adds
extra two clock states to routing decision on each direction of
trafc, and 9 out of 25 tiles are affected in a 55 NoC. For
pursuing the overall exibility and scalability of MCNoC we
consider this overhead is acceptable.

the beginning, intermediate cipher while DES processing, and


nal ciphertext when DES encryption is done. Those Header,
Key and Data its will be mentioned repeatedly in following
sections and their relationship will be claried.
B. Routing Algorithm
In this proposal four routing algorithms are tested, one
deterministic XY and three adaptive routings: WF, NL and
NF. All these algorithms are deadlock-free, livelock-free and
are implemented respectively to evaluate their characteristics
in a high loading DES MCNoC.
Fig. 3 shows the architecture of router. When a packet is
sent into a tile, the router rst checks its done bit. If it is LOW,
the packet is either sent to PE, or directed to another tile. If
the done bit is HIGH, the packet is routed towards its desired
boundary tile. Once it reaches the boundary tile it is bursted to
the desired terminal tile immediately. By checking destination
and terminal number and done bit in Header it, an unnished
packet is not be able to pass through the boundary tile until
it nishes all processes of DES computation.

IV. N ETWORK ON C HIP CONFIGURATIONS


A. Packet Format
Packet is transferred in consecutive 5 its and each it is 32bit long as Fig. 2. The rst it is Header it storing information
for routing decision and PE computation. The values of source
and terminal represent the coordinates of packet generator and
receiver respectively. The destination number always stores the
coordinate of one boundary tile. Two-bit done signal indicates
whether the data it is the nal data in DES operation. Then,
iteration number tells which stage of DES operation for this
packet. Finally, serial number is used as a tracking number
and bit 7 is reserved.
The Key it composed of the second and third its stores
a complete 64-bit initial key needed in DES operation, e.g.
subkey generation according to the iteration number. The
fourth and fth its are grouped as Data it stores plaintext at

Fig. 3: Router Architecture

C. Flow Control
The VCT contributes higher throughput when load increasing due to the wormholes drawback of quickly resources saturation while packets blocking occurs [6]. Banerjee [7] presents
that VCT gave lower latencies at higher acceptance rates and
provided better performance than wormhole switching.

D. Architecture of PE
According to the structure of DES, the reasonable number
of iterations are divisors of 16, i.e. 1, 2, 4, 8 and 16 (one
PE completes one DES operation). Using a small PE for only
one iteration needs another 15 computations to complete one
DES operation causing more packets routing in network. By
contrast, a large PE contains full 16 iterations makes packet
stay inside PE longer thus network trafc allows more data to
ll up other PEs. Whether the fast reaction of small PE helps
throughput improvement to overall network becomes a factor
to consider. This part of testing is discussed in Section V.

2) Simulation Results of Trafc Latency on Terminal Tiles:


Fig. 4 presents average latencies measured on all terminal tiles
for XY and Turn-Model routing algorithms. The X-axis is
terminal tile number and the Y-axis is the latency in nanosecond. Packet amount less than 4,000 has no discrimination
for all designs. When the number of packets increases, some
delays occur reecting the hard time of few boundary tiles,
especially the ones sitting at corners. Such situation is obvious
in XY routing since a packet at corner waits for the only route
being available causing others packets blocked in another tiles.
Adaptive routings on the other hand has several distributions
of hot-spot corners due to the difference in routing sequences.

V. E XPERIMENT M ETHODOLOGY AND R ESULTS


A. Experiment Setup
Full circuit design was done in Xilinx ISE 11.4 targeting on XC5VLX220-1FF1760 device. All simulations done
in ModelSim-SE 6.2g were under high consecutive packets
injection rate over 95% overlapping in between any two
inputs, which guaranteed the MCNoC was fully loaded and
the maximum throughput was measured without saturating it.

(a) XY routing

(b) WF routing

(c) NF routing

(d) NL routing

B. Simulation Results
1) Simulation Results of PE Size: Values in Table I states
the performance and evaluation of 1-, 8- and 16-iterative
PEs. The resultant slice utilization tells the 16-iterative PE
architecture ts to slices architecture better than others.
In the experiments of throughput testing, the benet of
short data processing period in low-iterative PE does not
compensate for the loss of throughput caused by congestions.
The 1-iterative PE saturates the NoC quickly due to the
routing time is much longer than the data processing time.
Consequently more packets stay on link rather than in PE
resulting congestions. When insertion rate reaches 727Mbps,
packet congestion occurs in 8-iterative design resulting only
15.73% packets returning to original terminal tiles.
By analyzing processing time of one packet in Table II,
1- and 8-iterative PEs process faster than router since they
implement only partial DES computation. A 16-iterative PE
is able to lock packet longer providing router more chance
to service another packet which further helps reduce trafc in
network.
TABLE II: Processing Time of One Packet
PE
Size
1 iteration
8 iterations
16 iterations

Processing Time
PE
Router
35ns
80ns
75ns
80ns
115ns
90ns

This section concludes that the DES MCNoC equipped with


16-iterative PEs offers the best performance and slice utilization. To achieve high throughput purpose, the PE processing
time has to be longer than the router processing time, which
helps improve utilization on both PE and router.

Fig. 4: Packet Latency on Terminal Tile in DES MCNoC


For instance the WF routes packets from PE takes West
as the rst choice then South as the second choice, combing
with the features that packets from North routes to South as
the rst choice, packets from East routes to West and South
as the rst and second choice respectively, resulting that most
packets take West or South as their rst routing path causing
terminal tiles at lower left corner hard to inject more packets,
which causes terminal tiles numbered 04, 05 and 06 become
high latency entries. This experiment illustrates the number of
high latency entries for XY, WF, NF, and NL is 9, 3, 5, and
6 respectively.
3) Simulation Results of PE Utilization: The computing
power of MCNoC derives from the sum of computing power of
every PE. Theoretically if one MCNoC can distribute workload
equally to each PE, the throughput should be the maximum.
Fig. 5 presents the utilization rate of each PE for different
routing algorithms in the 10,000-packet test set.
The XY routing utilizes PEs the most uneven which has
very high utilization rate at tiles numbered 21, 25 and 52,
but extreme low or none utilization rate at tiles around center
of NoC. The WF and NL routings have ve and the NF
routing has seven PEs at utilization lower than 1%. These
illustrate that the Turn-Model is a biased routing algorithm
having approximately a quarter of total computing power loss

TABLE I: DES MCNoC Performance Comparisons with Varies PE Architectures

PE Size in
Tested MCNoC
1 iteration
8 iterations
16 iterations

Maximum
Frequency
229.764MHz
263.832MHz
263.832MHz

Slice
Usage
Register LUT
12%
15%
13%

28%
23%
24%

100 packets @200MHz


Insertion Rate 320Mbps
Average
Packets
Latency Return Ratio
2.02us
75%
2.26us
100%
2.25us
100%

10,000 packets @200MHz


Insertion Rate 320Mbps
Average
Packets
Latency
Return Ratio
2.71us
0.89%
279.48us
100%
279.45us
100%

10,000 packets @200MHz


Insertion Rate 727Mbps
Average
Packets
Latency
Return Ratio
X
X
23.70us
15.73%
141.62us
100%

TABLE IV: Throughput Comparison with Previous Works


Work
Proposed
MCNoC

Fig. 5: PE Utilization in DES MCNoC

in MCNoC.
4) Simulation Results of Throughput: According to comparison results described in previous sections, the DES MCNoC using WF routing algorithm has the best performance
of all. It has the highest insertion rate of packet and lowest
processing latency attributing to the higher PE utilization and
lower trafc contention than other algorithms. The XY routing
has higher packet insertion rate over NF and NL routings, but
gives the lowest throughput due to its vulnerability to network
congestion. Even though, the XY shows a very competitive
performance in high throughput design. All designs have maximum frequencies over 250MHz and throughputs are calculated
in gigabits per second listed in Table III.
Comparing with previous works listed in Table IV, the
proposed work is 6.17 times faster than [3] which is composed
of soft-core processors and pipeline technology, 14.71 times
faster than [4] which is also a complicated design applied
NePA and group pipelining.
VI. C ONCLUSIONS
The results show a high throughput DES computation
design can be achieved with low-cost switching, packet format
and routing algorithms in a 55 mesh-based MCNoC. Using
large PE is area efcient to FPGA and having PE processing
time longer than routing time is a key factor for PE architecture

TABLE III: Throughput Testing Results


Routing
XY
WF
NF
NL

Max.
Freq.
265MHz
264MHz
265MHz
264MHz

10,000 packets @250MHz


Insertion Average
Thro.
Rate
Latency
of DES
784Mbps
133us
4.80Gbps
909Mbps
113us
5.65Gbps
769Mbps
133us
4.82Gbps
889Mbps
116us
5.54Gbps

[2]
[3]
[4]

XY
WF
NF
NL

PE Arch.

Frequency

16-iterative
DES

250MHz

MicroBlaze
MicroBlaze
NePA

100MHz
100MHz
100MHz

Throughput
4.80Gbps
5.65Gbps
4.82Gbps
5.54Gbps
12.8Mbps
915Mbps
384Mbps

selection. This paper also shows the uneven PE utilization


in XY routing and biased routing algorithms in Turn-Model
causing none negligible performance loss. Finally, a NoC with
considerations of DES architecture adds great throughput to
the nal design, 5 times faster than the best performance in
previous works.
To the best of our knowledge this is the rst NoC design
constructed from cryptologys point of view for high throughput purpose.
VII. ACKNOWLEDGMENTS
This work was supported by Taiwan National Science
Council grants PPP 101-2911-I-011-502.
R EFERENCES
[1] H.C. Freitas, L.M. Schnorr, M.A.Z. Alves, and P.O.A. Navaux. Impact
of Parallel Workloads on NoC Architecture Design. In Parallel,
Distributed and Network-Based Processing (PDP), 2010 18th Euromicro
International Conference on, pages 551555, Feb. 2010.
[2] T.T.-O. Kwok and Y.-K. Kwok. On the Design, Control, and Use of A
Recongurable Heterogeneous Multi-Core System-on-a-Chip. In IPDPS
2008, pages 111, Apr. 2008.
[3] X. Li and O. Hammami. An Automatic Design Flow for Data Parallel
and Pipelined Signal Processing Applications on Embedded Multiprocessor with NoC: Application to Cryptography. Int. J. Recong. Comput.,
2009, Jan. 2009.
[4] Y.S. Yang, J.H. Bahn, S.E. Lee, and N. Bagherzadeh. Parallel and
Pipeline Processing for Block Cipher Algorithms on a Network-on-Chip.
In ITNG 2009, pages 849854, Apr. 2009.
[5] G. Schelle and D. Grunwald. Exploring FPGA Network on Chip
Implementations across Various Application and Network Loads. In
FPL 2008, pages 4146, Sep. 2008.
[6] J. Duato, A. Robles, F. Silla, and R. Beivide. A Comparison of
Router Architectures for Virtual Cut-Through and Wormhole Switching
in a NOW Environment. In Proceedings of the 13th International
Symposium on Parallel Processing and the 10th Symposium on Parallel
and Distributed Processing, pages 240247, 1999.
[7] N. Banerjee, P. Vellanki, and K.S. Chatha. A Power and Performance
Model for Network-on-Chip Architectures. In Design, Automation and
Test in Europe Conference and Exhibition. Proceedings, volume 2, pages
12501255 Vol.2, Feb. 2004.

You might also like