You are on page 1of 4

2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation

Clump Sort: A Stable Alternative To Heap Sort For Sorting


Medical Data

Visvasuresh Victor Govindaswamy Matthew Caudill Jeff Wilson


Computer Science Computer Science Computer Science
Texas A&M University-Texarkana Texas A&M University-Texarkana Texas A&M University-Texarkana
Texarkana, USA Texarkana, USA Texarkana, USA
lovebat814@yahoo.com

Daniel Brower G. Balasekaran, FACSM


Computer Science Medical and Sports Science
Texas A&M University-Texarkana Nanyang Technology University
Texarkana, USA Singapore

Abstract—Sorting data sets are a thoroughly researched field. competitive for world situations, especially dealing with
Several sorting algorithms have been introduced and these large unsorted data sets. On the other hand, Shell sort, a
include Bubble, Insertion, Selection, Shell, Quick, Merge and modification of the Insertion sort algorithm, employs more
Heap. In this paper, we present a novel sorting algorithm, efficient swapping methods and accomplishes efficiency of
named Clump Sort, to take advantage of ordered segments roughly O(n log2 n). Quick sort algorithm employs a divide
already present in medical data sets. It succeeds in sorting the
and conquer approach and manages to achieve an average of
medical data considerably better than all the sorts except when
using totally non-clumped data. In this test using totally non- O(n log n), but in the worst case achieves O(n2). Merge sort
clumped data, Heap sort does only slightly better than Clump algorithm takes advantage of a special technique for merging
sort. However, Clump sort has the advantage of being a stable sorted lists. When merging two sorted lists, it compares the
sort as the original order of equal elements is preserved smallest element of each list. Heap sort algorithm involves
whereas in Heap sort, it is not since it does not guarantee that building a binary tree with the array elements then reloading
equal elements will appear in their original order after sorting. the elements into the list. Both Merge and Heap also attain
As such, Clump Sort will have considerably better data cache O(n log n) efficiency.
performance with both clumped and non-clumped data,
All of the above sorts [1] have been evaluated using a list
outperforming Heap Sort on a modern desktop PC, because it
accesses the elements in order. Sorting equal elements in the
of random integers, and are designed to sort a data set of
correct order is essential for sorting medical data. which no assumptions are made. However in real world
situations, characteristics of data are rarely random and
Keywords-algorithms; design; experimentation; performance unknown. Often, data contains some order or partial level of
organization. We developed an algorithm designed
specifically to take advantage of ordered segments already
I. INTRODUCTION present in a data set. We call it the Clump sort.
Section 2 explains the Clump sort algorithm while in
Bubble sort, one of the simplest and one of the slowest, section 3; we carry out comparison studies and analysis of
is a famous example used for teaching algorithms to our sort algorithm with other well-known sort algorithms.
students. It works by cycling through the list repetitively and Finally, in section 4, we conclude the paper.
swapping each pair of out-of-order data until the entire list is
in order ascending or descending. On a similar order of II. CLUMP SORT ALGORITHM
inefficiency with Bubble sort algorithm is the Selection sort
algorithm, which works by identifying the lowest remaining The Clump Sort Algorithm (Figs. 1, 2 and 3) is as
array element and relocating it at the end of the already follows:
sorted segment. In contrast, Insertion sort algorithm places 1. Cycle through the list, record the start and end of all
each element directly into its position. These three sorts already ordered segments, even if these segments consist of
have a computational complexity that are roughly О(n2) only one element. Each segment is called a clump. The size
where n is the size of the list. Hence, they are not

978-0-7695-4062-7/10 $26.00 © 2010 IEEE 227


DOI 10.1109/AMS.2010.53

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:20:45 UTC from IEEE Xplore. Restrictions apply.
of a clump could be anywhere between 1 and n/2, where n is
the size of the list.
2. Merge these segments or clumps.
Fig. 1 shows that the Clump Sort identifies segments that
are already in order prior to sorting. The start and end
points for each of these segments are then recorded. We call
each of the sorted segments a clump. In Fig. 2, these sorted
clumps are merged by repetitively taking the smaller data of
the two segments and moving it to the top of the third
segment. This results in a larger third ordered segment or
clump. In Fig. 3, each iteration alternates between “to” and
“from” lists. In the first iteration, the clumps from the first
array are merged into an equivalent segment of the second
array. In the second iteration, the algorithm takes these
merged clumps and further merges them, writing over the
first array. “Non-partnered” clumps are passed directly
between lists.

Figure 2. Merging of 3 clumps.

Figure 1. Identification of ordered segments


prior to sorting.

III. SORTING MODEL


We used the traditional comparison sorting model, which
is the very abstract model of sorting, to test how efficient our
algorithm when compared to other existing sorting
algorithms. The comparison sorting model sorts a list based Figure 3. Four iteration sorting.
only on comparisons of pairs and not using other
information about what is being sorted, for instance
arithmetic on numbers. In our experiments, we had used the
same random set of medical data for all the algorithms and
these sets were sorted in the ascending order.

228

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:20:45 UTC from IEEE Xplore. Restrictions apply.
IV. COMPARISON STUDIES AND ANALYSIS presented in [2][3]. Clump is also a stable sort as the original
order of equal elements is preserved whereas in Heap [4], it
To test how efficient our algorithm was, we ran many is not since it does not guarantee that equal elements will
experiments. They were performed on multicore CPUs— appear in their original order after sorting. As such, Clump
providing 2–4 scalar cores. We used the same random set of Sort will have considerably better data cache performance
medical data for all the algorithms. We also varied the data with both clumped and non-clumped data, outperforming
set from 0 to 100 GB of medical data. Due to limitation of Heap Sort on a modern desktop PC, because it accesses the
space, we had used here two comparison tests with the other elements in order.
three sorts that achieved the best results in our preliminary
tests. For clarity of the plots, we have shown the data set 5
ranging from 0 to 10, 000 data elements. The other three
sorts were the Merge, Quick and Heap Sorts. The results are 4.5
shown as plots with explanation in Figs. 4 and 5. The y-axis clump
represents the time in seconds while the x-axis represent the 4 heap
number of data elements. All tests were run consecutively merge
3.5
without interference as to avoid the effects of any other
quick
processes running at the same time. The data for Fig. 4 was
3
non-clumped or completely disordered while, for Fig. 5, the
data has an average clump size of roughly one fourth the size 2.5
of the data set.
2
1.6

1.5
1.4
clump
1
heap
1.2
merge 0.5
quick
1 0
0 2000 4000 6000 8000 10000

0.8
Figure 5. Sorting with clumped data.

0.6
Another anomaly worthy of remark in our results are the
fluctuations in Quicksorts’ performance using clumped data
0.4
set, compared to its relatively uniform performance using
non-clumped data set. It is likely that this is due to the
0.2 Quicksorts’ sensitivity to the order already present within
the data. This differs from the Merge Sort, which mostly
0 ignores how ordered the initial data is; and the Heap sort,
0 2000 4000 6000 8000 10000 which performs some preconditioning of the data before
loading its binary trees. Clump sort not only shows potential
Figure 4. Sorting with non-clumped data. to compete with the more prominent sorting algorithms, but
also shows definite advantages in cases where the data
In the first test, shown in Fig. 4, the ranking of the contains sorted segments or clumps, regardless of whether
sorting algorithms from best to worst was: Heap, Clump, they overlap in range or not. Table 1 below shows the
Quick and Merge while in the second test, shown in Fig. 5, it various computational complexities and memory usages for
was: Clump, Heap, Merge and Quick. Hence, in all of our the Clump Sort and the more prominent sorts. Clump Sort
experiments, except for when dealing with non-clumped uses less memory than Heap Sort.
data, Clump performed the best. With non-clamped data,
Heap performs better than Clump. However, due to the V. CONCLUSION
inherent nature of Heap, it suffered heavily from the
principle of locality [1] which is the phenomenon of the Clump Sort succeeded at sorting the data considerably
same value or related storage locations being frequently better than all the sorts except for the test using non-clumped
accessed. Moreover, Clump is easily parallelizable than data. In that test, Heap Sort performed better. We plan to
Heap and can be further improved by the methodology

229

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:20:45 UTC from IEEE Xplore. Restrictions apply.
improve upon the Clump Sort algorithm to handle non- TABLE 1. COMPLEXITIES AND MEMORY USAGES
clumped data better.
Sort Computational Complexity Memory
REFERENCES
Bubble O(n) O(1)
[1] Clifford A. Shaffer, Practical Introduction to Data Structures and
Selection O(n) O(1)
Algorithm Analysis (C++ Edition), 2/E, Prenhall Hall, 2000.
[2] V.V. Govindaswamy, G. Balasekaran, J. Marquis and B.A. Shirazi, “A Insertion O(n) O(1)
faster implementation of sequential sorting algorithms using the PARSA
methodology,” Electrical and Computer Engineering, 2003. IEEE CCECE Shell O(n log2 n) O(1)
2003, Canadian Conference on Volume 2, 4-7, May 2003, Page(s):1313 -
1316 vol.2. Quick O(n log n) O(log n)
[3] V. Govindaswamy, “Sorting Algorithms for PARSA,” Technical
Report, University of Texas at Arlington, 2002. Merge O(n log n) O(n)
[4] J. Teuhola and L. Wegner, “The External Heapsort”, IEEE Trans. on
Software Eng., Vol. 15, No. 7 (July1989) 917-925. Heap O(n log n) O(n)

Clump O(n log n) O(log n)

230

Authorized licensed use limited to: Ingrid Nurtanio. Downloaded on October 01,2020 at 02:20:45 UTC from IEEE Xplore. Restrictions apply.

You might also like