You are on page 1of 6

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

A Comparison-free Hardware Sorting Engine


Surajeet Ghosh∗ , Shaon Dasgupta† , Sanchita Saha Ray‡
∗ Dept. of Computer Science & Technology, Indian Institute of Engineering Science & Technology, Shibpur, Howrah, India
† Bentley Systems India Private Limited, Kolkata
‡ Dept. of Information Technology, St. Thomas’ College of Engineering & Technology, Khidderpore, Kolkata, India
∗ surajeetghosh@ieee.org, † dasgupta.shaon@gmail.com, ‡ saharay.sanchita@gmail.com

Abstract—This paper proposes a novel compare free hard- approaches using merge sorting technique had been proposed
ware sorting engine that sorts N numerous data elements in in [2] and [3]. In [3], a FIFO based parallel merge sorter called
approximately N clock cycles and detects the largest element
K-Sorter is proposed that sorts N keys in (N + log2 N ) − 1
in the 1 st clock cycle. This sorting engine has been designed
using N symmetric cascaded blocks those are built using few latency, using log2 N comparators in N × log2 N comparisons.
fundamental logic components. In this architecture, sorting and More superior performance compared to optimized algorithms
storing operations are performed in a pipelined manner. The on CPUs or GPUs is found in [2] with an estimated latency
complete design is synthesized for several data sets from pseudo- of (N × log2 N ) cycles for sorting N elements. Both [2]
randomly generated data elements to all unique elements to all
and [3] require a quite large number of comparators. [4]
the same elements, and also from random to completely sorted
data elements. It has been observed that, the engine appears and [5] proposed parallel merge sorter trees, but regardless
impartial to the input ordering. Synthesis results indicate that of their exploited parallelism in FPGA-based merge sorters
the proposed approach consumes reasonably low FPGA resource. for achieving good best case throughput, suffer a significant
The architecture is tested considering delay components of 65-nm throughput drop for poorly skewed data distribution. A stable
standard cell library. For sorting 128 to 64K data elements of
throughput is achieved in [6] by involving a high bandwidth
size 16, 24, and 32 bits at 125 M H z, this architecture takes per-
element sorting delay as approximately 5.3 to 8 ns (1 clock cycle), sort merger to utilize the sort merge passes. However, it is
while it takes 7.72 to 11.8 ns (1 to 2 clock cycles) and 10.16 to difficult to scale up this design for a large volume of data
15.5 ns (2 clock cycles), respectively. The engine achieves sorting streams. The bitonic sorter based approaches are found in [7],
throughput rate as approximately 125 to 200 Million Elements [8], [9] but requires more comparisons than merge sort.
per second (MEps) (16 bit), 85 to 130 MEps (24 bits) and 65 to Some sorters are made for specialized applications. In
98 MEps (32 bits) for sorting 128 to 64K data elements.
Index Terms—Hardware Sorting Engine; Comparison-free [10], systolic sorter is used and in spite of their acceptable
sorter; Largest Element Detector latency i.e., N, the number of registers (N × (N + 1)) and
compare-and-swap (CS) units (N/2 × (N −1)) are very large.
I. I NTRODUCTION A comparison free sorting approach is found in [11], uses
complex matrix-mapping operations by involving a number of
Sorting is prerequisite in most of the data-centric applica- matrices to carry out actual sorting operation. In summary,
tions those involve big data analysis like image processing, it is observed that, most of the existing sorting techniques
video processing, database systems, ATM switching, etc.. In require large and complex compare-and-swap operations and
large database systems, computation intensive operations are are based on well known sorting algorithms. Those require
handled by highly parallel multi-core processing systems. a large pre-processing time in order to split the list of data
However, scalability of multiple CPUs is restricted due to elements into a number of parts which is prerequisite to these
large communication delay. Therefore, development of effi- techniques. Moreover, to satisfy the specifications on both
cient sorting techniques is very important. Numerous sorting algorithm side and hardware performance side, the desired
algorithms have been made using both software and hardware sorting architecture is expected to be stable upon duplicate
techniques. Owing to the many advantages of hardware sorters data entries and also able to handle the overhead upon
over the software based sorting algorithms, hardware sorting high sorting throughput. Additionally, when implemented on
architectures have been an area of interest for many researchers hardware, resources should be utilized optimally. To address
and computer scientists. Numerous architectures, algorithms these challenges, a novel hardware based compare free sorting
and circuit designs have been developed till date with the aim engine has been proposed in this paper.
of solving the problem of high speed sorting. Most of these The key contributions are summarized as follows:
works show effectiveness in terms of time and speed but they • This is the kind of sorting architecture that does not use
have some limitations. In this section we shall discuss about any of the existing well known sorting algorithms or its
some of the notable works related to hardware sorters. improved variants, instead, proposes a novel hardware
A hardware-algorithm for sorting N elements using either structure for sorting.
a p-sorter or a sorting network of fixed input-output size p is • Exhibits comparison free sorting mechanism that does not
presented in [1], runs in O((N × log2 N )/(p × log2 p)) time. require any sort of comparators, any complex circuitry or
To reduce sorting time, some recent hardware based sorting any complex algorithm (e.g., matrix manipulation) and

978-1-7281-3391-1/19/$31.00 ©2019 IEEE 586


DOI 10.1109/ISVLSI.2019.00110
Unsorted Elements Hardware Address of Largest Element TABLE I
C ONSIDERED DELAY COMPONENTS IN 65-nm TECHNOLOGY
Sorting Engine
Sorted Memory Unsorted Memory
T sel T MU X (2:1) t AN D ∗ TE N C tO R ∗ tO R #
O[0] D[0]
O[1] D[1] (ns) (ps) (ps) (ns) (ps) (ps)
M M M M
O[2] D[2] 0.18 - 0.39 87 39 0.38 - 0.8 36 48
∗: 2-input; # : 4-input;
A D D A
Memory Element Memory Element

R Switch
R R R
O[N-2] D[N-2]
O[N-1] D[N-1]
delay incurred by each of these blocks is negligibly small,
primarily composed of an 2-input AND gate delay (t AN D ∗ ),
Counter
Sort an N-input OR gate delay ( l og22 N × t OR # ), termed as selection
Clk Controller
delay (Tsel ) and a multiplexer delay (TMU X ), shown in (1). In
(1), only Tsel (∝ N) is variable, while t AN D ∗ and TMU X are
Fig. 1. A Simple Architecture of the Proposed Hardware Sorting Engine constant. For a large N, Tsel becomes the dominating factor.
Time required to sort each element is denoted by (Ti ), shown
in (2). TE N C delay in (2) is expressed in (4). The sorting time
instead processes data sorting and storing in a pipelined
required for N elements is shown in (3). Here, the sorting
fashion involving only a few basic logic gates.
and storing operations run in a pipelined manner with a lag
• Completely sorts N data elements (regardless of unique
of single clock cycle.
and duplicate) in a linear sorting delay of O(N ) clock
cycles with an ability to find the largest data element in Tbl ock = t AN D ∗ + Tsel + TMU X ≈ Tsel (1)
just a single cycle (1st cycle).
Ti = (n × Tbl ock ) + TE N C (2)
The rest of the paper is organized as follows. Section
II describes the concept of the proposed hardware sorting Tsor t = (N + 1) × Ti (3)
engine. The performance of proposed architecture is evaluated log2 N
in Section III. Finally, Section IV concludes the paper. TE N C = ( × t OR # ) + Tsel + t AN D ∗ + t I NV (4)
2
II. H ARDWARE S ORTING E NGINE At the beginning of every iteration, the element vector table
(EVT) reflects the data elements yet to be sorted. The 1st
A novel hardware based comparison free sorting technique block receives two inputs, one from an unsorted memory
is introduced in this paper that sorts N number of elements (UM) and another from EVT. The outputs of this block (cells
completely in N iterations. However, the proposed architecture present in a certain block) are passed through an OR logic
is able to find the largest element (LE) in the 1st iteration. (hierarchy of parallel 4-input OR gates) as a selection input to
Thereafter, in every iteration, it finds the next LE from the all the multiplexers present in that block. These multiplexers
remaining data elements. A simplified schematic of the pro- also receive two inputs, one from respective cell’s AND gate
posed sorting architecture is shown in Fig. 1. The architecture and another from its previous block (corresponding cell).
receives data elements from an unsorted memory and in every Continuing this way, in the final block, output of only one
clock cycle it detects the largest element and thereafter, it of the switches is always found high (for unique data entries).
is been stored in a sorted memory (SM) unit. The overall This indicates the bit position of the largest data element and
controlling in terms of clock initiation and sending proper with the help of largest element detector (LED) unit (shown in
control signal to the respective units is controlled by the sort Fig. 2), LE (in a certain iteration) is identified. However, for
controller. Following subsections are describing the internal the duplicate entries, output of multiple switches might turn
architecture and its operation. high. This is resolved by imposing a masking logic prior to
A. Organization of hardware sorting engine an ordinary encoder circuit. This masking logic along with the
encoder unit is termed as LED (shown in Fig. 4).
The sorting engine consists of a number cascaded blocks Let us take an example set of five data elements, say
(shown in Fig. 2) selected one after another. Given a set of N, 13 (=11012 ), 4 (=01002 ), 6 (=01102 ), 1 (=00012 ) and 10
n-bit wide data elements, the proposed architecture requires n (=10102 ). Here, EVT is a vector of 5 bits, where each bit
blocks. As a matter of fact, each block is filtering the smaller represents a data element in UM and initialized to 11111.
elements and forwarding the larger elements to the next block Thereafter, in every iteration, EVT is updated by the final
for further filtering and finally deciding the LE among the output (FO) of its previous iteration, shown in Fig 6. In
participating elements. Each of these blocks consists of N this particular example, there exist four blocks (for 4 bit data
number of basic cells (irrespective of duplicate data entries) elements). The 1st block (from left) receives MSB (D3 ) of
those operate in parallel manner. Internal structure of a block these five data elements as one of the inputs and another from
is shown in Fig. 3(a). Each cell consists of a 2-input AND EVT. The output of 1st block is 100012 and fed to the 2nd
gate and a tiny switch (2:1 multiplexer), shown in Fig. 3(b). block (as output of OR gate is ‘1’) along with the next higher
Since, the cells present in a block run concurrently, therefore, order bit (i.e., D2 ). Continuing this way, the output of the

587
EVT

Largest
Block 1 Block 2 Block n Element
Detector Address of
Largest
EVT Update Unit Element

Q D

Q D

CLK
Q D

Fig. 2. Block level structure of the Hardware Sorting Engine

final block is 100002 . For unique data elements, under no such D1,j
circumstances, multiple 1’s would result in FO. Presence of Cell(1, j)
1 in FO indicates the largest element among the participating D2,j
data elements and thereby completes an iteration. In the next Cell(2, j)
cycle, while the engine evaluates the next sorted element From To
that time the last identified sorted element is stored in the Previous DN −1,j Next
sorted memory (SM) in a pipelined fashion. At the end of 1st Block Cell(N-1, j) Block
iteration, FO is 100002 , indicates 1st data element, i.e., 13 is DN,j
the largest element, and stores it at an address indicated by
Cell(N, j)
the counter unit (shown in Fig. 3). At the beginning of the
next iteration, EVT is updated by bit-wise AND-ing it with
the complemented output of FO i.e., FO, i.e., 011112 . This Block j
process continues until, all the bit position in EVT becomes (a)
zero, or alternatively, all the data elements are sorted. After
the completion of 5t h iteration, data elements are completely TM U X

sorted, shown in Fig. 6.


TAN D
B. Operation of hardware sorting engine
The sorting engine starts functioning upon receiving start Data bit 0
operation signal (SO) at every iteration to detect LE and end Next
Previous 1
of operation signal (EO) initiates the storing of a LE an
incremental address in SM. At the end of every iteration, EVT
is updated (by EVT update unit, shown in Fig. 2) to reflect From other cells
new set of participating data elements (eliminating those data in the same Block
elements which were found as the largest data elements in the
previous iterations). The timing diagram of the sorting process TOR
is shown in Fig. 5. (b)

III. P ERFORMANCE A NALYSIS Fig. 3. Internal structure of (a) an intermediate Block j, (b) a Cell.

The proposed sorting engine is simulated and synthesized


in Xilinx ISE 14.7 and Virtex 5 XC5VLX50T development
board running at a frequency of 125 M H z. Test stimuli varies resource constraints of XC5VLX50T platform, we couldn’t
from pseudo-randomly generated data elements to all unique test beyond 1024 elements of 32 bits and 2048 elements of
elements to all the same elements, and also from random 11 bits. However, theoretically we have calculated expected
to completely sorted data elements. However, the structure sorting delay trend of the data elements upto 64K elements
appears impartial to the ordering of data elements. Due to of 32 bits. We have considered delay components of 65-

588
1st Block 2nd Block 3rd Block 4th Block
Input to
Sorting Engine Input Output Input Output Input Output FO LE F O SM
UM Input Output
EVT D3 D2D1 D0 D3 AN D OR D2 AN D OR D1 AN D OR D0 AN D OR
13 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 13 0 13

1st iteration
1 1 1 1 1
4 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
6 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1
1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
10 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1

Bitwise
AND SM
UM Input Output Input Output Input Output Input Output FO LE F O
EVT D3 D2D1 D0 D3 AN D OR D2 AN D OR D1 AN D OR D0 AN D OR
0 0 0 0 0 10 0 13

2nd iteration
13 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1
4 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 10
6 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1
1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
10 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0

Bitwise 4th iteration (F O)


AND
UM Input Output Input Output Input Output Input Output
EVT D3 D2D1 D0 D3 AN D OR D2 AN D OR D1 AN D OR D0 AN D OR FO LE F O SM
0 0 0 1 0 1 1 13

Final iteration
13 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1
4 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 10
6 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 6
1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 1 1 0 4
10 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1

Fig. 6. An illustrative example to show the internal operation performed by the proposed hardware sorting engine.

D1,0 Largest Element


Detector ti
Cell(1, j)
D2,0
CLK
Cell(2, j)
From log2 N
E EVT 111...111 111...011 101...011 101...001
Previous D N
N −1,0
Block C Address of
Cell(N-1, j) SO
DN,0 Largest
Addr
Element (LE) LEA0 LEA1 LEA2
Cell(N, j)
ALE

Fig. 4. Structure of the final block with largest element detector


RD

SM
Address 0 1 2

nm standard cell library as Virtex 5, as per Table I. The WR


architecture takes per-element sorting delay as approximately Data
5.3 to 8 ns (1 clock cycle) for 16 bits, while it takes 7.72 to (LE) LE0 LE1 LE2
11.8 ns (1 to 2 clock cycles) for 24 bits and 10.16 to 15.5 ns
(2 clock cycles) for 32 bits to sort 128 to 64K data elements. EO
The total sorting time (without considering EVT initialization
delay time) of these elements are shown in Fig. 7. The engine Fig. 5. Timing diagram of the sorting process.
achieves sort throughput rate as approximately 125 to 200

589
TABLE II TABLE III
C OMPARATIVE ANALYSIS OF C LOCK C YCLE R EQUIRED TO S ORT, FIND R ESOURCE UTILIZATION OF VARIOUS METHODS FOR SORTING 128
LARGEST ELEMENT, SMALLEST ELEMENT OF DIFFERENT METHODS ELEMENTS OF 32 BITS WIDTH

Parameter [2] [11] [15] Proposed


Element No. of Sorting Largest Smallest Device Virtex-7 Virtex-5 Virtex-7 Virtex-5
Method
Width Elements Operation Element Element
128 2550 2550 2550 Occupied Slices 11142 1665 6832 504
16 512 10200 10200 10200 LUTs 71310 3750 43719 2016
1024 20400 20400 20400
[12] Flip-Flops 90169 3330 133584 2016
128 2550 2550 2550
32 512 10200 10200 10200 Delay (μs) 0.4 1.5 1.6 1.3
1024 20400 20400 20400
128 2048 2048 2048
16 512 8192 8192 8192 TABLE IV
1024 16384 16384 16384
[13] R ESOURCE UTILIZATION OF THE P ROPOSED S ORTING A RCHITECTURE
128 4096 4096 4096
32 512 16384 16384 16384
1024 32768 32768 32768 Parameter Width Elements (N) Resource Used % Used
128 5382 5382 5382 Slices 1225 17.01
16 512 21510 21510 21510 256
LUTs 2288 7.94
1024 43014 43014 43014
[14] Flip-Flops 4125 14.32
128 5382 5382 5382
32 512 21510 21510 21510 Slices 2188 30.39
1024 43014 43014 43014 16
LUTs 512 4561 15.84
128 64 64 64 Bits
16 512 256 256 256 Flip-Flops 8222 28.55
1024 512 512 512 Slices 3189 44.29
[15]
128 64 64 64
32 512 256 256 256 LUTs 1024 9126 31.69
1024 512 512 512 Flip-Flops 15660 54.38
128 448 448 448
Slices 1441 20.01
16 512 2304 2304 2304
1024 5120 5120 5120 LUTs 256 2499 8.68
[2]
128 448 448 448 Flip-Flops 6181 21.46
32 512 2304 2304 2304
1024 5120 5120 5120 Slices 2675 37.15
24
128 128 1 128 LUTs 512 6174 21.44
16 512 512 1 512 Bits
Flip-Flops 12326 42.80
1024 1024 1 1024
Proposed Slices 3991 55.43
128 256 2 256
32 512 1024 2 1024 LUTs 1024 12387 43.01
1024 2048 2 2048
Flip-Flops 19112 66.36
Slices 1857 25.79
LUTs 256 2913 10.11
Flip-Flops 8237 28.60
Million Elements per second (MEps) (16 bit), 85 to 130 MEps
Slices 3617 50.24
(24 bits) and 65 to 98 MEps (32 bits) for sorting 128 to 64K 32
LUTs 512 8089 28.09
data elements. A comparative analysis on number of clock Bits
Flip-Flops 16190 56.22
cycles to perform sorting operation, finding the largest element Slices 5113 71.01
and to spot the smallest element with respect to [2], [12], LUTs 1024 16571 57.54
[13], [14], [15] and our proposed work are shown in Table Flip-Flops 23762 82.51
II. It is clear from the table that the proposed sorting engine
requires N clock cycles to completely sort N elements and it
also requires N clock cycles to decide the smallest element. IV. C ONCLUSION
However, the largest data element is obtained after the first
clock cycle, i.e., in a single clock cycle. Table III provides We have proposed a novel hardware based comparison-
comparison of FPGA resource utilization of our approach free sorting technique with a complexity of O(N) with re-
against various recent sorting methods [2], [12], and [15]. spect to the sorting speed, and basic circuit components. The
Table IV depicts resource utilization of the proposed sorting proposed architecture consumes approximately 1 clock cycle
engine against numerous data sets of variable data width. as per-element sorting delay for 16 bit input data at 125
A comparison in sorting time between a compare free MHz clock frequency. Roughly, it consumes K clock cycles
method [11] and our proposed comparison free scheme is for per-element sorting delay for n bit input data, where
depicted in Fig. 8. For a small number of elements, apparently, K ≈ n/16. Our sorter forwards the data elements across n
[11] and the proposed one consumes approximately same cascaded stages those involve only a few basic logic gates and
amount of sorting time (time difference < 1 μs), however, tiny multiplexers (2:1) and thereby requires reasonably small
for a large set of elements our engine reasonably outperforms time delay per iteration. Due to pipelining, the sorting and
[11] due to their large matrix manipulation time. storing operations run in an overlapped fashion and thereby

590
30 1,200
W = 16 (T) W = 16 (T)
W = 16 (S) W = 24 (T)
25 1,000
W = 24 (T) W = 32 (T)
W = 24 (S)
Sorting Time (μs)

Sorting Time (μs)


20 W = 32 (T) 800
W = 32 (S)
15 600

10 400

5 200

0 0
128 256 512 1K 2K 4K 8K 16K 32K 64K
Number of Elements Number of Elements

Fig. 7. Analysis of sorting time required in theoretical (T) and simulated (S) of various number of elements of different widths (W).

12 R EFERENCES
Proposed [1] S. Olarlu, M. C. Pinotti, and S. Q. Zheng, “An optimal hardware-
10 [11] algorithm for sorting using a fixed-size parallel sorting device,” IEEE
Transactions on Computers, vol. 49, no. 12, pp. 1310–1324, Dec 2000.
[2] S. Mashimo, T. V. Chu, and K. Kise, “High-performance hardware merge
Sorting Time (μs)

8 sorter,” in IEEE 25th Annual Intl. Symposium on Field-Programmable


Custom Computing Machines (FCCM), April 2017, pp. 1–8.
[3] N. Matsumoto, K. Nakano, and Y. Ito, “Optimal parallel hardware k-
6 sorter and top k-sorter, with FPGA implementations,” in 14th Intl. Symp.
on Parallel and Distributed Computing, June 2015, pp. 138–147.
[4] W. Song, D. Koch, M. LujÃan, ˛ and J. Garside, “Parallel hardware merge
4 sorter,” in 2016 IEEE 24th Annual Intl. Symp. on Field-Programmable
Custom Computing Machines (FCCM), May 2016, pp. 95–102.
2 [5] T. Usui, T. V. Chu, and K. Kise, “A cost-effective and scalable merge
sorter tree on fpgas,” in 2016 Fourth International Symposium on
Computing and Networking (CANDAR), Nov 2016, pp. 47–56.
0 [6] J. Casper and K. Olukotun, “Hardware acceleration of database
operations,” in ACM/SIGDA Intl. Symp. on Field-programmable Gate
6 7 8 9 10 11 Arrays, ser. FPGA ’14. New York, NY, USA: ACM, 2014, pp. 151–160.
[7] J. P. Agrawal, “Arbitrary size bitonic (asb) sorters and their applications
Number of Elements (2n ) in broadband atm switching,” in IEEE 15th Annual Intl. Phoenix Conf.
on Computers and Communications, Mar 1996, pp. 454–458.
Fig. 8. Sorting Delay comparison in Comparison free hardware architectures. [8] A. Greb and G. Zachmann, “GPU-abisort: optimal parallel sorting on
stream architectures,” in Proceedings 20th IEEE International Parallel
Distributed Processing Symposium, April 2006, pp. 1-10.
[9] S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens, “Scan primitives
eliminates additional memory cycles required for storing the for GPU computing,” in 22Nd ACM SIGGRAPH/EUROGRAPHICS
Symposium on Graphics Hardware, ser. GH ’07.2007, pp. 97–106.
sorted elements. Simulation results show its minimal resource [10] V. A. Pedroni, et al., “Panning sorter: A minimal-size architecture for
utilization against some of the recent hardware based sorting hardware implementation of 2D data sorting coprocessors,” in IEEE Asia
approaches with reasonably small sorting delay. Future work Pacific Conf. on Circuits and Sys., Dec 2010, pp.923–926.
[11] S. Abdel-Hafeez and A. Gordon-Ross, “An efficient O(n) comparison-
includes exploitation of spatial parallelism in the proposed free sorting algorithm,” IEEE Transactions on Very Large Scale Inte-
pipelined structure to further improve throughput for large gration (VLSI) Systems, vol. 25, no. 6, pp. 1930–1942, June 2017.
volume of data. Finally, this hardware sorting engine would [12] R. Paul, S. Sau, and A. Chakrabarti, “Architecture for real time
continuous sorting on large width data volume for FPGA based
be an useful embedded component as an accelerator for data applications,” CoRR, vol. abs/1206.1567, 2012.
aware applications. [13] S.-W. Cheng, “Arbitrary long digit integer sorter hw/sw co-design,” in
Proceedings of the ASP-DAC Asia and South Pacific Design Automation
Conference, 2003., Jan 2003, pp. 538–543.
ACKNOWLEDGMENTS [14] V. Sananda, “Hardware accelerated crypto merge sort: Memocode 2008
design contest,” in 2008 6th ACM/IEEE International Conference on
This work is supported and funded under grant no. Formal Methods and Models for Co-Design, June 2008, pp. 159–162.
[15] A. Rjabov, “Hardware-based systems for partial sorting of streaming
149(Sanc.)/ST/P/S&T/6G-16/2018 by the Department of Sci- data,” in 2016 15th Biennial Baltic Electronics Conference (BEC), Oct
ence & Technology and Bio-technology, Govt. of West Bengal, 2016, pp. 59–62.
India.

591

You might also like