Professional Documents
Culture Documents
Ghosh 2019
Ghosh 2019
Abstract—This paper proposes a novel compare free hard- approaches using merge sorting technique had been proposed
ware sorting engine that sorts N numerous data elements in in [2] and [3]. In [3], a FIFO based parallel merge sorter called
approximately N clock cycles and detects the largest element
K-Sorter is proposed that sorts N keys in (N + log2 N ) − 1
in the 1 st clock cycle. This sorting engine has been designed
using N symmetric cascaded blocks those are built using few latency, using log2 N comparators in N × log2 N comparisons.
fundamental logic components. In this architecture, sorting and More superior performance compared to optimized algorithms
storing operations are performed in a pipelined manner. The on CPUs or GPUs is found in [2] with an estimated latency
complete design is synthesized for several data sets from pseudo- of (N × log2 N ) cycles for sorting N elements. Both [2]
randomly generated data elements to all unique elements to all
and [3] require a quite large number of comparators. [4]
the same elements, and also from random to completely sorted
data elements. It has been observed that, the engine appears and [5] proposed parallel merge sorter trees, but regardless
impartial to the input ordering. Synthesis results indicate that of their exploited parallelism in FPGA-based merge sorters
the proposed approach consumes reasonably low FPGA resource. for achieving good best case throughput, suffer a significant
The architecture is tested considering delay components of 65-nm throughput drop for poorly skewed data distribution. A stable
standard cell library. For sorting 128 to 64K data elements of
throughput is achieved in [6] by involving a high bandwidth
size 16, 24, and 32 bits at 125 M H z, this architecture takes per-
element sorting delay as approximately 5.3 to 8 ns (1 clock cycle), sort merger to utilize the sort merge passes. However, it is
while it takes 7.72 to 11.8 ns (1 to 2 clock cycles) and 10.16 to difficult to scale up this design for a large volume of data
15.5 ns (2 clock cycles), respectively. The engine achieves sorting streams. The bitonic sorter based approaches are found in [7],
throughput rate as approximately 125 to 200 Million Elements [8], [9] but requires more comparisons than merge sort.
per second (MEps) (16 bit), 85 to 130 MEps (24 bits) and 65 to Some sorters are made for specialized applications. In
98 MEps (32 bits) for sorting 128 to 64K data elements.
Index Terms—Hardware Sorting Engine; Comparison-free [10], systolic sorter is used and in spite of their acceptable
sorter; Largest Element Detector latency i.e., N, the number of registers (N × (N + 1)) and
compare-and-swap (CS) units (N/2 × (N −1)) are very large.
I. I NTRODUCTION A comparison free sorting approach is found in [11], uses
complex matrix-mapping operations by involving a number of
Sorting is prerequisite in most of the data-centric applica- matrices to carry out actual sorting operation. In summary,
tions those involve big data analysis like image processing, it is observed that, most of the existing sorting techniques
video processing, database systems, ATM switching, etc.. In require large and complex compare-and-swap operations and
large database systems, computation intensive operations are are based on well known sorting algorithms. Those require
handled by highly parallel multi-core processing systems. a large pre-processing time in order to split the list of data
However, scalability of multiple CPUs is restricted due to elements into a number of parts which is prerequisite to these
large communication delay. Therefore, development of effi- techniques. Moreover, to satisfy the specifications on both
cient sorting techniques is very important. Numerous sorting algorithm side and hardware performance side, the desired
algorithms have been made using both software and hardware sorting architecture is expected to be stable upon duplicate
techniques. Owing to the many advantages of hardware sorters data entries and also able to handle the overhead upon
over the software based sorting algorithms, hardware sorting high sorting throughput. Additionally, when implemented on
architectures have been an area of interest for many researchers hardware, resources should be utilized optimally. To address
and computer scientists. Numerous architectures, algorithms these challenges, a novel hardware based compare free sorting
and circuit designs have been developed till date with the aim engine has been proposed in this paper.
of solving the problem of high speed sorting. Most of these The key contributions are summarized as follows:
works show effectiveness in terms of time and speed but they • This is the kind of sorting architecture that does not use
have some limitations. In this section we shall discuss about any of the existing well known sorting algorithms or its
some of the notable works related to hardware sorters. improved variants, instead, proposes a novel hardware
A hardware-algorithm for sorting N elements using either structure for sorting.
a p-sorter or a sorting network of fixed input-output size p is • Exhibits comparison free sorting mechanism that does not
presented in [1], runs in O((N × log2 N )/(p × log2 p)) time. require any sort of comparators, any complex circuitry or
To reduce sorting time, some recent hardware based sorting any complex algorithm (e.g., matrix manipulation) and
R Switch
R R R
O[N-2] D[N-2]
O[N-1] D[N-1]
delay incurred by each of these blocks is negligibly small,
primarily composed of an 2-input AND gate delay (t AN D ∗ ),
Counter
Sort an N-input OR gate delay ( l og22 N × t OR # ), termed as selection
Clk Controller
delay (Tsel ) and a multiplexer delay (TMU X ), shown in (1). In
(1), only Tsel (∝ N) is variable, while t AN D ∗ and TMU X are
Fig. 1. A Simple Architecture of the Proposed Hardware Sorting Engine constant. For a large N, Tsel becomes the dominating factor.
Time required to sort each element is denoted by (Ti ), shown
in (2). TE N C delay in (2) is expressed in (4). The sorting time
instead processes data sorting and storing in a pipelined
required for N elements is shown in (3). Here, the sorting
fashion involving only a few basic logic gates.
and storing operations run in a pipelined manner with a lag
• Completely sorts N data elements (regardless of unique
of single clock cycle.
and duplicate) in a linear sorting delay of O(N ) clock
cycles with an ability to find the largest data element in Tbl ock = t AN D ∗ + Tsel + TMU X ≈ Tsel (1)
just a single cycle (1st cycle).
Ti = (n × Tbl ock ) + TE N C (2)
The rest of the paper is organized as follows. Section
II describes the concept of the proposed hardware sorting Tsor t = (N + 1) × Ti (3)
engine. The performance of proposed architecture is evaluated log2 N
in Section III. Finally, Section IV concludes the paper. TE N C = ( × t OR # ) + Tsel + t AN D ∗ + t I NV (4)
2
II. H ARDWARE S ORTING E NGINE At the beginning of every iteration, the element vector table
(EVT) reflects the data elements yet to be sorted. The 1st
A novel hardware based comparison free sorting technique block receives two inputs, one from an unsorted memory
is introduced in this paper that sorts N number of elements (UM) and another from EVT. The outputs of this block (cells
completely in N iterations. However, the proposed architecture present in a certain block) are passed through an OR logic
is able to find the largest element (LE) in the 1st iteration. (hierarchy of parallel 4-input OR gates) as a selection input to
Thereafter, in every iteration, it finds the next LE from the all the multiplexers present in that block. These multiplexers
remaining data elements. A simplified schematic of the pro- also receive two inputs, one from respective cell’s AND gate
posed sorting architecture is shown in Fig. 1. The architecture and another from its previous block (corresponding cell).
receives data elements from an unsorted memory and in every Continuing this way, in the final block, output of only one
clock cycle it detects the largest element and thereafter, it of the switches is always found high (for unique data entries).
is been stored in a sorted memory (SM) unit. The overall This indicates the bit position of the largest data element and
controlling in terms of clock initiation and sending proper with the help of largest element detector (LED) unit (shown in
control signal to the respective units is controlled by the sort Fig. 2), LE (in a certain iteration) is identified. However, for
controller. Following subsections are describing the internal the duplicate entries, output of multiple switches might turn
architecture and its operation. high. This is resolved by imposing a masking logic prior to
A. Organization of hardware sorting engine an ordinary encoder circuit. This masking logic along with the
encoder unit is termed as LED (shown in Fig. 4).
The sorting engine consists of a number cascaded blocks Let us take an example set of five data elements, say
(shown in Fig. 2) selected one after another. Given a set of N, 13 (=11012 ), 4 (=01002 ), 6 (=01102 ), 1 (=00012 ) and 10
n-bit wide data elements, the proposed architecture requires n (=10102 ). Here, EVT is a vector of 5 bits, where each bit
blocks. As a matter of fact, each block is filtering the smaller represents a data element in UM and initialized to 11111.
elements and forwarding the larger elements to the next block Thereafter, in every iteration, EVT is updated by the final
for further filtering and finally deciding the LE among the output (FO) of its previous iteration, shown in Fig 6. In
participating elements. Each of these blocks consists of N this particular example, there exist four blocks (for 4 bit data
number of basic cells (irrespective of duplicate data entries) elements). The 1st block (from left) receives MSB (D3 ) of
those operate in parallel manner. Internal structure of a block these five data elements as one of the inputs and another from
is shown in Fig. 3(a). Each cell consists of a 2-input AND EVT. The output of 1st block is 100012 and fed to the 2nd
gate and a tiny switch (2:1 multiplexer), shown in Fig. 3(b). block (as output of OR gate is ‘1’) along with the next higher
Since, the cells present in a block run concurrently, therefore, order bit (i.e., D2 ). Continuing this way, the output of the
587
EVT
Largest
Block 1 Block 2 Block n Element
Detector Address of
Largest
EVT Update Unit Element
Q D
Q D
CLK
Q D
final block is 100002 . For unique data elements, under no such D1,j
circumstances, multiple 1’s would result in FO. Presence of Cell(1, j)
1 in FO indicates the largest element among the participating D2,j
data elements and thereby completes an iteration. In the next Cell(2, j)
cycle, while the engine evaluates the next sorted element From To
that time the last identified sorted element is stored in the Previous DN −1,j Next
sorted memory (SM) in a pipelined fashion. At the end of 1st Block Cell(N-1, j) Block
iteration, FO is 100002 , indicates 1st data element, i.e., 13 is DN,j
the largest element, and stores it at an address indicated by
Cell(N, j)
the counter unit (shown in Fig. 3). At the beginning of the
next iteration, EVT is updated by bit-wise AND-ing it with
the complemented output of FO i.e., FO, i.e., 011112 . This Block j
process continues until, all the bit position in EVT becomes (a)
zero, or alternatively, all the data elements are sorted. After
the completion of 5t h iteration, data elements are completely TM U X
III. P ERFORMANCE A NALYSIS Fig. 3. Internal structure of (a) an intermediate Block j, (b) a Cell.
588
1st Block 2nd Block 3rd Block 4th Block
Input to
Sorting Engine Input Output Input Output Input Output FO LE F O SM
UM Input Output
EVT D3 D2D1 D0 D3 AN D OR D2 AN D OR D1 AN D OR D0 AN D OR
13 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 13 0 13
1st iteration
1 1 1 1 1
4 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1
6 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1
1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
10 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1
Bitwise
AND SM
UM Input Output Input Output Input Output Input Output FO LE F O
EVT D3 D2D1 D0 D3 AN D OR D2 AN D OR D1 AN D OR D0 AN D OR
0 0 0 0 0 10 0 13
2nd iteration
13 0 1 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1
4 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 10
6 1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1
1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1
10 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 0 1 0
Final iteration
13 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 0 1
4 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 10
6 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 6
1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 1 1 1 0 4
10 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1
Fig. 6. An illustrative example to show the internal operation performed by the proposed hardware sorting engine.
SM
Address 0 1 2
589
TABLE II TABLE III
C OMPARATIVE ANALYSIS OF C LOCK C YCLE R EQUIRED TO S ORT, FIND R ESOURCE UTILIZATION OF VARIOUS METHODS FOR SORTING 128
LARGEST ELEMENT, SMALLEST ELEMENT OF DIFFERENT METHODS ELEMENTS OF 32 BITS WIDTH
590
30 1,200
W = 16 (T) W = 16 (T)
W = 16 (S) W = 24 (T)
25 1,000
W = 24 (T) W = 32 (T)
W = 24 (S)
Sorting Time (μs)
10 400
5 200
0 0
128 256 512 1K 2K 4K 8K 16K 32K 64K
Number of Elements Number of Elements
Fig. 7. Analysis of sorting time required in theoretical (T) and simulated (S) of various number of elements of different widths (W).
12 R EFERENCES
Proposed [1] S. Olarlu, M. C. Pinotti, and S. Q. Zheng, “An optimal hardware-
10 [11] algorithm for sorting using a fixed-size parallel sorting device,” IEEE
Transactions on Computers, vol. 49, no. 12, pp. 1310–1324, Dec 2000.
[2] S. Mashimo, T. V. Chu, and K. Kise, “High-performance hardware merge
Sorting Time (μs)
591