You are on page 1of 5

2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming

Thread-level Parallel Algorithm for Sorting Integer Sequence on Multi-core


Computers

Zhong Cheng1, Ke Qi1,2, Liu Jun1, Huang Yi-Ran1


1. School of Computer and Electronics and Information, Guangxi University, Nanning,
Guangxi,530004, P. R. China
2. School of Information and Statistics,Guangxi University for Finance and Economics, Nanning,
Guangxi, 530004,P. R. China
Email: chzhong@gxu.edu.cn, keqikeruzhen@126.com, liujunzky@163.com,hyr@gxu.edu.cn

Abstract design and implement an efficient parallel sorting


Multisets algorithm on the heterogeneous clusters with
According to the characteristics of multi-core multi- core computers. Ramprasad and Baruah [4]
architectures and binary storage property of integer implemented a parallel radix sort algorithm on Cell
sequence, this paper proposes an efficient thread-level machine, which can utilize effectively the capability of
parallel algorithm for sorting integer sequence on each processing core. Cederman and Tsigas [5]
multi-core computers. The algorithm divides the input implemented quicksort algorithm on the GPU
integer sequence to several data blocks in main processor. Greb and Zachmann [6] studied bitonic
memory and distributes these blocks into the shared L2 sorting on the GPU processor.
cache and private L1 cache respectively, implements Sorting integer sequence is an improtant and special
dynamically load balance among the processing cores, data sorting problem. This paper investigates some key
and utilizes data-level parallel SIMD instructions and factors which will influence the performance of the
thread-binding technique to speed up the sorting parallel sorting integers algorithm on multi-core
procedure. Experiment results show that the algorithm computers. These factors include mapping function
can obtain high speedup and good scalability, and its construction, data partition and exchange between
execution efficiency will not be affected by the data main memory and caches, data distribution of input
distribution of input integer sequence. interger sequence, data-level parallelism, thread-
Keywords: Sorting integers; Multi-core computers; binding, the amount of usable processing cores, the
Multi-level caches; Thread-level parallelism; Data- number of parallel threads, load balance among
level parallelism; Mapping; Prefix sum processing cores and parallel threads. The remainder of
this paper is organized as follows. In selection 2, a
1. Introduction thread-level parallel algorithm for sorting integer
sequence on multi-core computers is proposed and its
It is a new challenge to design efficient and scalable complexity is analyzed. Selection 3 evaluates the
sorting algorithms with thread-level parallelism on the execution time, speedup and scalability of the proposed
multi-core systems. Inoue et al [1] implemented algorithm. Selection 4 concluds the paper and gives
a thread-level parallel sorting algorithm by applying next research direction.
SIMD instructions to merge vectors and eliminate
unaligned memory access on shared-memory PowerPC 2. Sorting integers with thread-level and
970 MP and Cell computers. But the algorithm needs data-level parallelism
to exchange frequently data between main memory and
caches, and its execution efficiency is low and its 2.1. Key issues of designing algorithm
obtained speedup is not high. Qu et al [2] investigated
multi-level data partitioning technique and thread-level Assume that input sequence S has N integers,
parallelism which are how to impact the performance S={s0,s1,s2,…,sN-1}, 0≤si≤m, 0≤i<N, m is a given
of parallel sorting Mulitisets algorithm on multi-core interger; the shared L2 cache on multi-core computer
computers. Zhong et al [3] designed an aperiodic can store most D2 integers, private L1 cache in each
multi-round distribution strategy and applied it to processing core cache can store most D1 intergers, c2

978-0-7695-4575-2/11 $26.00 © 2011 IEEE 37


DOI 10.1109/PAAP.2011.57

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:14:39 UTC from IEEE Xplore. Restrictions apply.
is the amount of the integers in main memory which to each thread are mapped into the hash table in the last
are distributed to shared L2 cache each time, c2=α×D2, round of sorting, and we can also obtain their start
0<α≤1, and c1 is the amount of the integers in L2 storage position in the final sorted array. By using
cache which are distributed to L1 cache each time, these information, each thread can directly put the
c1=β×D1, 0<β≤1, where the optimal values of α and β intergers into the exact location in the final sorted
are determined by the experiment. array.
We divide the input sequence with N integers in the We use the “and” operation and “right shift”
main memory into N / c 2 data blocks which each operation which supported by SSE2 of SIMD
block has c2 integers to reduce the cache miss rate and instructions to construct the mapping function, and put
decrease the access cost between main memory and L2 directly the integers into the corresponding buckets in
cache in the parallel sorting progress. To reduce the hash table. In the l-th round mapping, for the binary
cache miss rate between L2 cache and L1 cache, each value of each integer x to be sorted, we first shift it
block in shared L2 cache is also partitioned into log bs  × l bits towards the right, where 0≤l<lun,
c2/c1 segments which each segment has c1 integers. the shifted result is denoted by x’, and we execute
The performance of mapping-based sorting “and ” operation for x’ and bs. The final computation
algorithm is mainly depend on the computational result is denoted by y. y will is regarded as a bucket
efficiency of hash function, degree of hashing conflict number and x will be directly put to bucket y. To
and amount of mapping buckets. If the required space implement data-level parallelisn, each thread will
of buckets is beyond of the usable storage capacity of execute SIMD instructions to process simultaneously 4
L2 cache, some buckets can not be fully loaded into L2 integers in the 128-bit register. In addition, by aligning
cache one time and page switching will be frequently the integers of input sequence in byte form, SIMD
required, and then cache miss is caused. It means that instruction can read the data in the same cache line.
the amount of mapping rounds determines the times of Hence, the parallel sorting integers algorithm can
sorting iteration; the number of buckets determines eliminate the unaligned memory access and avoid
mapping efficiency, time-space overhead and cache competing the same cache line. If the number of
misse. We know that an integer can be divided into running parallel threads is greater than the amount of
multiple binary segments of any length. According to processing cores, dynamic migration of threads will be
the value range {0,1,2,…,m}of the integers, we solve casued. Hence, we apply thread-binding technique to
adaptively the mapping base so that the required bind evenly the threads to the processing cores such
amount of mapping buckets can be meeted, there is that the amount of threads which are assigned to each
enough space in L2 cache to store the mapped data, processing core is equivalent.
and the amount of mapping rounds is the least.
We assume that lnum denotes the amount of 2.2. Algorithm description and analysis
mapping rounds, and lun is the minimum of value lnum
which statisfys 2( log m )/lnum <c2. The hash table is made The proposed thread-level parallel algorithm for
of bnum buckets A[0~bnum-1], where bnum= sorting integers on multi-core computers is described
2( log m )/lun , and mapping base bs= 2( log m )/lun-1. as follows.
Algorithm 1 Thread-level parallel sorting integers on
In each round of parallel mapping, the data
multi-core computers
distribution characteristics of input integers, which are // Integer sequence S={s0,s1,…,sN-1}, si∈{0,1,…,m},
asigned to each processing core and each thread, is i=0~N-1; p is the amount of processing cores
unknown; and the amount of data which are mapped to // t_num is the number of threads, Tri is the ith thread, lun
each bucket may be uneven. This will result in no load is the amount of mapping rounds;
balanace in the next round of parallel mapping. So, we // bs is mapping base, A is bucket array, bktj is the jth
must re-balance loads among processing cores and bucket;
threads to improve the parallel sorting efficiency. We // Tprefix is temporal array which records the amount of
add a flag bit to each integer of the sequence and the data in each bucket
assign a continuous ID number to each dataum in the // sorted[0..N-1] is the final sorted result array
Begin
previous round of mapping. By these ID numbers, we
(1) i=0; j=0; k=0; l=0;
can redistribute data to each processing core and each (2) for i=1 to N / c 2 do
thread in load balance mode.
To avoid access conflicts among parallel threads, by start[Tri]= Tri *c2 /t_num;
end[Tri]=( Tri+1)*c2 /t_num;
applying the thread-level parallel algorithm for endfor
computing prefix sum [7], we can solve the initial (3) for i =0 to t_num-1 do in parallel
position of each bucket when the data which assigned if (l==0) then

38

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:14:39 UTC from IEEE Xplore. Restrictions apply.
YS=simd_and(s[k],bs,( end[Tri]- start[Tri])); We notice that the amount of buckets in hash table
else bnum=2( log m )/lun. Design of Algorithm 1 must statisfy
temp=simd_sr(s[k], log bs  *l ,
bnum=2( log m )/lun <c2 and log m / log c 2 <lun. Hence,
(end[Tri]- start[Tri]));
YS=simd_and(temp,bs, (end[Tri]- start[Tri])); when the amount of mapping rounds lun=
endif log m / log c 2 , the time complexity of Algorithm 1,
for j=start[Tri] to end[Tri] do
A[bktj]←YS;
T(n)=O(( log m / log c 2 ×( N / c1 +bnum)+
endfor bnum / c1 ) /(p×t_num)), is the least.
l=l+1;
endfor
(4) for i =0 to t_num-1 do in parallel
3. Experiment
for j=Tri * bnum/t_num to
( Tri+1)* bnum/t_num do 3.1. Experimental environment
Lenthj+=A[bktj-1].lenth;
endfor The multi-core machine is quad-core computer Intel
for k=start[Tri] to end[Tri]-1 do Core(TM) 2 Quad Processor with main memory of size
s[k].lable= lenthj++; 2GB, processing cores share L2 Cache with size 12MB,
endfor
and each processing core has private L1 Cache with
endfor
(5) if (lun-1>0) then goto step (3);
size 32KB. The operating sysytem is RedHat
(6) for i=0 to t_num-1 do in parallel Enterprise Linux 5. Programming languages and
for k=start[Tri] to end[Tri]-1 do communication library are C and OpenMP.
s[k].sign ←simd_sr(s[k], lun*logbs,
(end[Tri] - start[Tri])); 3.2. Experimental results and analysis
Tprefix[[Tri]][S[k].sign]+=1;
endfor For the input sequence with 16 million integers,
endfor
Figure 1 shows the required time of executing
(7) for i=0 to t_num-1 do in parallel
Tprefix←Calling algorithm
Algorithm 1 using 4 processing cores, 8 threads and
prefixComputation(bnum) [7]; different values of α and β.
1.85
endfor
Execution time(sec)

1.8 α=0.3
(8) for i=0 to t_num-1 do in parallel 1.75 α=0.4
for k=s[start[Tri]].lable to s[end[Tri]].lable-1 do 1.7
α=0.5
α=0.6
sorted [Tprefix[Tri] [bkt[S[k].sign]]]=s[k]; 1.65 α=0.7
endfor 1.6 α=0.8
α=0.9
endfor 1.55
α=1.0
End 1.5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
The partitioning operation in setp (2) requires Value of β
O( N / c1 / (p×t_num)) time. Step (3) uses SIMD Figure 1 Execution time of Algorithm 1 using 4 processing
instructions and each instruction can deal with cores, 8 threads, 16 million integers and different values
simutaneously 4 integers. Since each mapping of α and β
operation requires O(1) time, step (3) requires From Figure 1 we can see that the execution time of
O( N / c1 /(p×t_num×4)) time. Step (4) requires Algorithm 1 is the least when α=0.5 and β=0.9. That is
to say, when the amount c2 of the integers in main
O(bnum/(p×t_num))+O( N / c1 /(p×t_num)) = memory which are disributed to L2 cache each time is
O((bnum+ N / c1 )/(p×t_num)) time. Step (5) totally equal to 0.5×D2 and the amount c1 of the integers in
executes step (3) and step (4) (lun-1) times, so step (5) L2 cache which are disributed to L1 cache each time is
requires (lun-1)×(O( N / c1 /(p×t_num×4))+ O((bnum+ equal to 0.9×D1, the utilization for shared L2 cache
and private L1 cache is the highest, the amount of data
N / c1 )/(p×t_num))) time. Step (6) also uses SIMD exchange between cache and main memory is
instructions and its time complexity is O( N / c1 / decreased, and the cache page miss rate can be
(p×t_num×4)). From reference [7], we can know that reduced.
For the input integer sequence with different size,
step (4) requires O( bnum / c1 / (p×t_num)) time. Step
with increase of parallel threads, Figure 2 and Figure 3
(8) requires O( N / c1 / (p× t_num)) time. Therefore, give the required time of executing Algorithm 1 using
the time complexity of Algorithm 1 is T(n)= O((lun× 4 and 3 processing cores respectively.
( N / c1 +bnum)+ bnum / c1 )/(p×t_num)).

39

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:14:39 UTC from IEEE Xplore. Restrictions apply.
4.5 1000000 and compared with Algorithm 1-1, the execution time
Execution time(sec) 4 Integers
3.5 2000000 of Algorithm 1 is reduced by 15% to 24%. Algorithm 1
Integers
3
4000000 can develop fully the advantage of fine-grained
2.5
2
Integers
8000000
parallelism. Secondly, the execution time of Algorithm
1.5 Integers 1 is less than that of Algorithm 1-2 with no multi-level
1 16000000
0.5 Integers partitioning data. This is because Algorithm 1 utilizes
0 32000000
Integers efficiently L2 cache and L1 cache to reduce data
4 5 6 7 8 9 10 11 12 13 14 15 16

Number of threads
exchange between cache and main memory. Finally,
the execution time of Algorithm 1 is also less than that
Figure 2 Execution time of Algorithm 1 using 4 processing
of Algorithm 1-3 with no load balanc.
cores with increase of parallel threads
4.5 1000000
For the input intergers with different size, we
execute Algorithm 1 using the optimal number of
Execution time(sec)

4 Integers
2000000
3.5
3
Integers threads to test the data distribution of input sequence is
4000000
2.5 Integers how to impact the performance of parallel sorting. The
2
1.5
8000000
Integers experimentral results are shown in Figure 5.
16000000
1 Integers Almost sorted Uniformly distribution
0.5 32000000 Random distribution Very uneven distribution

Execution time(sec)
0 Integers 4.5
3 4 5 6 7 8 9 10 11 12 4
3.5
Number of threads 3
2.5
Figure 3 Execution time of Algorithm 1 using 3 processing 2
1.5
cores with increase of parallel threads 1
0.5
Figures 2 and Figure 3 show that when the number 0
of running parallel threads is twice as the number of 4 cores 3 cores 2 cores 4 cores 3 cores 2 cores
processing cores, the execution time of Algorithm 1 is 16000000 16000000 16000000 32000000 32000000 32000000
Integers Integers Integers Integers Integers Integers
the least. At this time, the number of parallel threads is
called the optimal number of parallel threads. If Figure 5 Execution time of Algorithm 1 with different
data distribution
Algorithm 1 is executed by more parallel threads, the
Figure 5 shows that for the almost sorted input
overhead of start-stop and switching thread will be
sequence, the input sequence with uniform distribution,
enlarged, the required execution time of Algorithm 1
the input sequence with random distribution and the
will be gradually increased. In addition, Algorithm 1
one with very uneven distribution, the execution time
uses thread-binding technique to assign the threads to
of Algorithm 1 is basically the same. That is to say, the
the processing cores such that the amount of threads
execution performance of Algorithm 1 is almost
which are executed by each processing core is equal. In
irrelevant to the integers distribution of input sequence.
other words, Algorithm 1 can blance the loads among
Figure 6 displays the execution time of Algorithm 1
the processing cores by thread-binding technique, and
using 4 processing cores and optimal number of
its requied execution time is less than that of executing
parallel threads and the execution time of serial radix
Algorithm 1 with no thread-binding.
sort algorithm.
Figure 4 shows the execution time for Algorithm 1,
Algorithm 1 with no using SIMD instructions which is Serial radix sort algorithm Algorithm 1
10
Execution time(sec)

denoted by Algorithm 1-1, Algorithm 1 with no 9


8
multi-level partitioning data which is denoted by 7
6
Algorithm 1-2, and Algorithm 1 with no load balancing 5
4
which is denoted by Algorithm 1-3, respectively. 3
7
2
Algorithm 1 1
Execution time(sec)

6 0
Algorithm 1-1
5 2000000 4000000 8000000 16000000 32000000
Algorithm 1-2
4 Integers Integers Integers Integers Integers
3 Algorithm 1-3
2
Figure 6 Execution time for Algorithm 1 and serial radix
1
0
sort algorithm
From Figure 6 we can see that the execution time of
16000000
Integers

16000000
Integers

16000000
Integers

32000000
Integers

32000000
Integers

32000000
Integers
4 cores

3 cores

2 cores

4 cores

3 cores

2 cores

Algorithm 1 is much less than that of executing serial


Figure 4 Execution time for Algorithm 1, Algorithm 1-1,
radix sort algorithm. Futhermore, with increase of the
Algorithm 1-2 and Algorithm 1-3 size of input integers, the execution time of Algorithm
Firstly, from Figure 4 we can see that, Algorithm 1 1 is slowly increased while the execution time of serial
using data-level parallel SIMD instructions is faster radix sort algorithm is sharply increased.
than Algorithm 1-1 with no using SIMD instructions, For the input interger sequence with different size,

40

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:14:39 UTC from IEEE Xplore. Restrictions apply.
Figure 7 displays the obtained speedup of executing parallel prefix sum algorithm to solve the final storage
Algorithm 1 using different amount of processing cores location in the sorted array for each integer, and it can
and optimal number of parallel threads. avoid the access conflict among parallel threads.
2cores 4thread 3cores 6thread 4cores 8thread Fifthly, it uses data-level parallel SIMD instructions to
3.5 speed up sorting process. Sixthly, it can reduce the
3
amount of thread migration by applying thread-binding
Speedup

2.5
2 technique. Finally, the execution performance of the
1.5
1
algorithm will not be affected by integers distribution
0.5 of the input sequence. The idea of the presented
0
1000000 2000000 4000000 8000000 16000000 32000000
algorithm is helpful to develop thread-level parallel
Integers Integers Integers Integers Integers Integers algorithm for sorting floating point data on multi-core
computers. The next work is to design the efficient and
Figure 7 Seedup for executing Agorithm 1 using the scalable sorting intergers algorithm with process-level
optimal number of pararllel threads and thread-level parallelism on the heterogeneous
Figure 7 shows that Algorithm 1 obtains good cluster with multi-core computers.
speedup and the growth trend of its speedup is
sublinear. ACKNOWLEDGMENTS
Figure 8 gives the curve of equivalent efficiency
function for Algorithm 1 using the optimal number of This work is supported in part by National Natural
pararllel threads when the running processing cores is Science Foundation of China under grant NO.
gradually increased. 60963001, and and Project of Outstanding Innovation
40000000
Teams Construction Plans at Guangxi University.
Number of data

30000000
References
20000000

10000000 [1] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani,


“AA-Sort: A New Parallel Sorting Algorithm for
0 Multi-Core SIMD Processors”, Proc. of PACT ’07, 2007,
1 2 3 4 pp.189-198.
Number of processing cores
[2] Zeng-yan Qu, Cheng Zhong, Xia Li, “Parallel Sorting for
Figure 8 Curve of equivalent efficiency function for Mulitisets on Multi-core Computers”, Proc of the second
Algorithm 1 International Symposium on Parallel Architectures,
We can see from Figure 8 that Algorithm 1 can still Algorithms and Programming, University of Science and
maintain the equivalent efficiency when the amount of Technology of China Press, 2009,pp.135~162.
workload is increased in sublinear proportion to the [3] Cheng Zhong, Zeng-yan Qu, Feng Yang, Meng-xiao Yin,
increase of running processing cores. Therefore, “Parallel Multisets Sorting Using Aperiodic Multi-round
Distribution Strategy on Heterogeneous Multi-core
Algorithm 1, namely the presented thread-level parallel Clusters”, Proc. of 3rd International on Parallel
algorithm for sorting integers on multi-core computers, Architectures, Algorithms and Programming, IEEE
has good scalability. Computer Society Press, 2010,pp. 247-254.
[4] N.Ramprasad, Pallav Kumar Baruah, “Radix Sort on the
4. Conclusion Cell Broadband Engine”, Int.l Conf. High Perf. Comuting
(HiPC)–Posters,2007.Dec.18-21,2007,GOA,India.
For the problem of sorting integer sequence, this [5] Daniel Cederman, Philippas Tsigas, “On Sorting and Load
Balancing on GPU”, ACM SIGARCH Computer
paper proposes an efficient and scalable thread-level Architecture News,Vol.36, No.5,2008, pp.11-18.
parallel algorithm on multi-core computers. The [6] A. Greb, G. Zachmann, “GPU-ABiSort: optimal parallel
algorithm has serveral characteristics: Firstly, it sorting on stream architectures”, Proceedings of 2006
adaptively solves the value of mapping base such that International Parallel and Distributed Processing
shared L2 cache can hold completely the data in the Symposium (IPDPS’06), 2006, pp.25-29.
mapping buckets, and it can avoid cache miss. [7] KE Qi, ZHONG Cheng, LI Zhi, WANG Gangqiang,
Secondly, it partitions the input sequence into data “Thread-level Parallel Algorithm for Maximum Sum
blocks and segements respectively according to the Subsequence on Multi-core Computers”, 2010 Progreess in
size of L2 cache and L1 cache to reduce data exchange Computer Technology and Apllications, University of
Science and Technology of China Press, 2010, pp. 586-590.
and communication overhead between cache and main
memory. Thirdly, it can balance dynamically the loads
among parallel threads. Fourthly, it applies thread-level

41

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:14:39 UTC from IEEE Xplore. Restrictions apply.

You might also like