You are on page 1of 41

Parallel Sorting

John Mellor-Crummey
Department of Computer Science Rice University johnmc@rice.edu

COMP 422

Lecture 19 5 April 2011

Topics for Today
• • • • •
Introduction Issues in parallel sorting Sorting networks and Batcher’s bitonic sort Bubble sort and odd-even transposition sort Parallel quicksort

2

Why Study Parallel Sorting?
• •
One of the most common operations performed Close relation to task of routing on parallel computers
—e.g. HPC Challenge RandomAccess benchmark

3

Sorting Algorithm Attributes
• •
Internal vs. external
—internal: data fits in memory —external: uses tape or disk

Comparison-based or not
—comparison sort
– basic operation: compare elements and exchange as necessary – Θ(n log n) comparisons to sort n numbers

—non-comparison-based sort
– e.g. radix sort based on the binary representation of data – Θ(n) operations to sort n numbers

Parallel vs. sequential

4

Parallel Sorting Is Intrinsically Interesting Different algorithms for different architecture variants • • • Abstract parallel architecture —PRAM Network topology —hypercube —mesh Communication mechanism —shared address space —message passing Today’s focus: parallel comparison-based sorting 5 .

Parallel Sorting Basics • • Where are the input and output lists stored? —we assume that both input and output lists are distributed What is a parallel sorted sequence? —sequence partitioned among the processors —each processor’s sub-sequence is sorted —all in Pj's sub-sequence < all in Pk's sub-sequence if j < k – the best process numbering can depend on network topology 6 .

Element-wise Parallel Compare-Exchange When partitioning is one element per process 1. aj) Pk [comparison step] 7 . ak) Pj max(ak. and Pk keeps max(aj. Process Pj keeps min(aj. ak Pj ak. aj Pk 2. Processes Pj and Pk send their elements to each other aj Pj ak Pk [communication step] Each process now has both elements aj.ak). ak) min(aj.

Bulk Parallel Compare-Split 1. process Pj retains larger values 8 . Retain only the appropriate half of the merged block Pi retains smaller values. Send block of size n/p to partner 2. Merge received block with own block 4. Each partner now has both blocks 3.

i < j – smaller n/p elements are at processor Pi – larger n/p elements at Pj — time = ts+ twn/p – merge in O(n/p) time. as long as partial lists are sorted 9 .Basic Analysis • • • Assumptions — Pi and Pj are neighbors — communication channels are bi-directional Elementwise compare-exchange: 1 element per processor — time = ts + tw Bulk compare-split: n/p elements per processor — after compare-split on pair of processors Pi and Pj.

y) • min(x.y) x y Ө max(x.y) max(x.y) x y ⊕ ⊕ min(x.y) and y' = max(x.y) – decreasing (denoted Ө) : x' = max(x.y) Ө Sorting network speed is proportional to its depth 10 . two outputs x' and y’ —types – increasing (denoted ⊕): x' = min(x.Sorting Network • • Network of comparators designed for sorting Comparator : two inputs x and y.y) and y' = min(x.

Sorting Networks • • • Network structure: a series of columns Each column consists of a vector of comparators (in parallel) Sorting network organization: 11 .

2.1〉 • Bitonic sorting network —sorts n elements in Θ(log2 n) time —network kernel: rearranges a bitonic sequence into a sorted one 12 .9.0.6.7.0〉: first increases and then decreases (or vice versa) —cyclic rotation of a bitonic sequence is also considered bitonic – 〈8.Example: Bitonic Sorting Network • Bitonic sequence —two parts: increasing and decreasing – 〈1.2.4.9.2.1.4〉: cyclic rotation of 〈0.4.8.

an/2+1).an-1〉 be a bitonic sequence such that —a0 ≤ a1 ≤ ··· ≤ an/2-1 .min(an/2-1. y ∈ s2 .an-1)〉 —s1 and s2 are both bitonic —∀x ∀y x ∈ s1.an/2).…. and —an/2 ≥ an/2+1 ≥ ··· ≥ an-1 • Consider the following subsequences of s s1 = 〈min(a0.an/2).min(a1. x < y • Sequence properties • • Apply recursively on s1 and s2 to produce a sorted sequence Works for any bitonic sequence.a1.an-1)〉 s2 = 〈max(a0.max(an/2-1.…. even if |s1| ≠ |s2| 13 .max(a1.an/2+1).Bitonic Split • Let s = 〈a0.….

I Sequence properties s1 and s2 are both bitonic ∀x ∀y x ∈ s1. x < y min max 14 . y ∈ s2 .Splitting Bitonic Sequences .

y ∈ s2 .II Sequence properties s1 and s2 are both bitonic ∀x ∀y x ∈ s1. x < y min max 15 .Splitting Bitonic Sequences .

Bitonic Merge Sort a bitonic sequence through a series of bitonic splits Example: use bitonic merge to sort 16-element bitonic sequence How: perform a series of log2 16 = 4 bitonic splits 16 .

Sorting via Bitonic Merging Network • • Sorting network can implement bitonic merge algorithm —bitonic merging network Network structure —log2 n columns —each column – n/2 comparators – performs one step of the bitonic merge • • Bitonic merging network with n inputs: ⊕BM[n] —yields increasing output sequence Replacing ⊕ comparators by Ө comparators: ӨBM[n] —yields decreasing output sequence 17 .

1. ⊕ BM[16] • • • Input: bitonic sequence — input wires are numbered 0. n .….1 (shown in binary) Output: sequence in sorted order Each column of comparators is drawn separately 18 .Bitonic Merging Network.

Bitonic Sort How do we sort an unsorted sequence using a bitonic merge? Two steps • • Build a bitonic sequence Sort it using a bitonic merging network 19 .

—build bitonic sequence of length 4 – sort first two elements using ⊕BM[2] – sort next two using ӨBM[2] • Repeatedly merge to generate larger bitonic sequences —⊕BM[k] & ӨBM[k]: bitonic merging networks of size k 20 .Building a Bitonic Sequence • Build a single bitonic sequence from the given sequence —any sequence of length 2 is a bitonic sequence.

Building a Bitonic Sequence Input: sequence of 16 unordered numbers Output: a bitonic sequence of 16 numbers 21 .

n = 16 • • First 3 stages create bitonic sequence input to stage 4 Last stage (⊕BM[16]) yields sorted sequence 22 .Bitonic Sort.

Complexity of Bitonic Sorting Networks • Depth of the network is Θ(log2 n) —log2 n merge stages —jth merge stage is log2 2j = j log 2 n log 2 n 2 —depth = ∑ log j=1 2j = ∑ j = (log i=1 2 n + 1)(log 2 n) /2 = θ (log 2 n) • • Each stage of the network contains n/2 comparators Complexity of serial implementation = Θ(n log2 n) € 23 .

Mapping Bitonic Sort to a Hypercube Consider one item per processor • • • How do we map wires in bitonic network onto a hypercube? In earlier examples —compare-exchange between two wires when labels differ in 1 bit Direct mapping of wires to processors —all communication is nearest neighbor 24 .

Mapping Bitonic Merge to a Hypercube Communication during the last merge stage of bitonic sort • • Each number is mapped to a hypercube node Each connection represents a compare-exchange 25 .

the best sorting algorithm 26 .r.t.r. its serial counterpart Not cost optimal w.t.Mapping Bitonic Sort to Hypercubes Communication in bitonic sort on a hypercube • • • Processes communicate along dims shown in each stage Algorithm is cost optimal w.

y) : x in halves[0]. 5.html 27 . 22.Batcher’s Bitonic Sort in NESL function merge(a) = if (#a == 1) then a else let halves = bottop(a). in flatten({merge(x) : x in [mins.edu/~johnmc/nesl. 6. y) : x in halves[0]. y in halves[1]}. -8. 12]). 3. -7. y in halves[1]}. Try it at: http://www. in merge(b[0]++reverse(b[1])).maxs]}).cs. mins = {min(x. function bitonic_sort(a) = if (#a == 1) then a else let b = {bitonic_sort(x) : x in bottop(a)}. maxs = {max(x.rice. bitonic_sort([2.

Bubble Sort and Variants Sequential bubble sort algorithm Compares and exchanges adjacent elements sequence 28 .

Bubble Sort and Variants • • • Bubble sort complexity: Θ(n2) Difficult to parallelize —algorithm has no concurrency A simple variant uncovers concurrency 29 .

Sequential Odd-Even Transposition Sort 30 .

n = 8 elements are compared 31 .Odd-Even Transposition Sort. n = 8 In each phase.

Odd-Even Transposition Sort • • • After n phases of odd-even exchanges. sequence is sorted Each phase of algorithm requires Θ(n) comparisons Serial complexity is Θ(n2) 32 .

Parallel Odd-Even Transposition Consider one item per processor • • • n iterations —in each iteration. each processor does one compare-exchange Parallel run time of this formulation is Θ(n) Cost optimal with respect to the base serial algorithm —but not the optimal one! 33 .

Parallel Odd-Even Transposition Sort note: if partner id < 1 or > n. then skip compare 34 .

low overhead. optimal average complexity Operation —select an entry in the sequence to be the pivot —divide the sequence into two halves – one with all elements less than the pivot – other greater • Apply process recursively to each of sublist 35 .Quicksort • • Popular sequential sorting algorithm —simplicity.

recursive decomposition —partition the list serially —handle each subproblems on a different processor Time for this algorithm is lower-bounded by Ω(n)! Can we parallelize the partitioning step? —can we use n processors to partition a list of length n around a pivot in O(1) time? Tricky on real machines 36 .Parallelizing Quicksort • • • • First.

Practical Parallel Quicksort Each processor initially responsible for n/p elements • Shared memory formulation —select first pivot & broadcast —each processor partitions own data —globally rearrange data into smaller and larger parts (in place) —recurse with proportional # processors on each part • Message passing formulation —partitioning – – – – each processor first partitions local portion of array determine which processes will be responsible for each partition (based on size of smaller than pivot and larger than pivot groups) divide up the data among the processor subsets responsible for each part —continue recursively 37 .

Data Parallel Quicksort in NESL 1 2 3 4 5 6 7 8 • • • Total work is O(n log n) Recursion depth is O(log n) Depth of each operation is constant • Total depth is O(log n) as well 38 .

another variant of bubble sort —two stage process – – log p rounds of long distance exchanges followed by rounds of odd-even transposition sort until done —key idea: long distance moves of first stage reduce number of rounds necessary in second stage • • • Radix sort : in a series of rounds. sort in Θ(1) time – – assumes that all concurrent writes to a location deposit sum n processes in column j test element j against the rest. sort elements into buckets by digit Bucket and sample sort —assumes evenly distributed items in an interval —buckets represent evenly-sized subintervals Enumeration sort: —determine rank of each element —place it in the correct position —CRCW PRAM algorithm: n2 processors.Other Sorting Algorithms • Shellsort . write 1 into C[j] place A[j] into A[C[j]] 39 .

Other Sorting Algorithms (Cont) • Histogram Sorting —goal: divide keys into p evenly sized pieces —use an iterative approach to do so —initiating processor broadcasts k > p-1 splitter guesses —each processor determines how many keys fall in each bin —sum histogram with global reduction —one processor examines guesses to see which are satisfactory —broadcast finalized splitters and number of keys for each processor —each processor sends local data to appropriate processors using all-to-all communication —each processor merges chunks it receives — Kale and Solomonik improved this (IPDPS 2010) 40 .

” Guy Blelloch.cmu. Kale. 41 . Highly Scalable Parallel Sorting.References • • • • • Adapted from slides “Sorting” by Ananth Grama Based on Chapter 9 of “Introduction to Parallel Computing” by Ananth Grama.html#sort Edgar Solomonik and Laxmikant V. Anshul Gupta. number 3. volume 39. Proceedings of IPDPS 2010. March 1996. http://www.cs. Communications of the ACM. George Karypis. and Vipin Kumar.edu/~scandal/nesl/algorithms. Addison Wesley. 2003 “Programming Parallel Algorithms.