You are on page 1of 37

SWAMEGA PUBLICATIONS

DATA STRUCTURES
AND
ALGORITHMS
For M.Sc., M.C.A, & M.Tech Programmes

Dr. S. RAMAMOORHTY & Dr. S. SATHYALAKSHMI

2014
Data Strcutures & Algorithms
UNIT - III

Sorting algorithm
A sorting algorithm is an algorithm that puts elements of a list in a certain
order. The most-used orders are numerical order and lexicographical order. Efficient sorting
is important for optimizing the use of other algorithms (such as search and merge algorithms)
which require input data to be in sorted lists; it is also often useful for canonicalizing data and
for producing human-readable output. More formally, the output must satisfy two conditions:

1. The output is in non-decreasing order (each element is no smaller than the previous
element according to the desired total order);
2. The output is a permutation (reordering) of the input.

Further, the data is often taken to be in an array, which allows random access,
rather than a list, which only allows sequential access, though often algorithms can be applied
with suitable modification to either type of data.

Since the dawn of computing, the sorting problem has attracted a great deal of
research, perhaps due to the complexity of solving it efficiently despite its simple, familiar
statement. For example, bubble sort was analyzed as early as 1956. A fundamental limit of
comparison sorting algorithms is that they require linearithmic time – O(n log n) – in the
worst case, though better performance is possible on real-world data (such as almost-sorted
data), and algorithms not based on comparison, such as counting sort, can have better
performance. Although many consider sorting a solved problem – asymptotically optimal
algorithms have been known since the mid-20th century – useful new algorithms are still
being invented, with the now widely used Timsort dating to 2002, and the library sort being
first published in 2006.

Sorting algorithms are prevalent in introductory computer science classes, where


the abundance of algorithms for the problem provides a gentle introduction to a variety of
core algorithm concepts, such as big O notation, divide and conquer algorithms, data
structures such as heaps and binary trees, randomized algorithms, best, worst and average
case analysis, time-space tradeoffs, and upper and lower bounds.

Classification
Sorting algorithms are often classified by:

 Computational complexity (worst, average and best behavior) of element comparisons


in terms of the size of the list (n). For typical serial sorting algorithms good behavior
is O(n log n), with parallel sort in O(log2 n), and bad behavior is O(n2). (See Big O
notation.) Ideal behavior for a serial sort is O(n), but this is not possible in the average
case, optimal parallel sorting is O(log n). Comparison-based sorting algorithms,
which evaluate the elements of the list via an abstract key comparison operation, need
at least O(n log n) comparisons for most inputs.
 Computational complexity of swaps (for "in place" algorithms).
 Memory usage (and use of other computer resources). In particular, some sorting
algorithms are "in place". Strictly, an in place sort needs only O(1) memory beyond m
the items being sorted; sometimes O(log(n)) additional memory is considered "in
place".
 Recursion. Some algorithms are either recursive or non-recursive, while others may
be both (e.g., merge sort).
 Stability: stable sorting algorithms maintain the relative order of records with equal
keys (i.e., values).
 Whether or not they are a comparison sort. A comparison sort examines the data only
by comparing two elements with a comparison operator.
 General method: insertion, exchange, selection, merging, etc. Exchange sorts include
bubble sort and quick sort. Selection sorts include shaker sort and heap sort. Also
whether the algorithm is serial or parallel. The remainder of this discussion almost
exclusively concentrates upon serial algorithms and assumes serial operation.
 Adaptability: Whether or not the presortedness of the input affects the running time.
Algorithms that take this into account are known to be adaptive.

Stability

An example of stable sorting on playing cards. When the cards are sorted by
rank with a stable sort, the two 5s must remain in the same order in the sorted output that they
were originally in. When they are sorted with a non-stable sort, the 5s may end up in the
opposite order in the sorted output.
When sorting some kinds of data, only part of the data is examined when
determining the sort order. For example, in the card sorting example to the right, the cards are
being sorted by their rank, and their suit is being ignored. This allows the possibility of
multiple different correctly sorted versions of the original list. Stable sorting algorithms
choose one of these, according to the following rule: if two items compare as equal, like the
two 5 cards, then their relative order will be preserved, so that if one came before the other in
the input, it will also come before the other in the output.

More formally, the data being sorted can be represented as a record or tuple of
values, and the part of the data that is used for sorting is called the key. In the card example,
cards are represented as a record (rank, suit), and the key is the rank. A sorting algorithm is
stable if whenever there are two records R and S with the same key, and R appears before S
in the original list, then R will always appear before S in the sorted list.

When equal elements are indistinguishable, such as with integers, or more


generally, any data where the entire element is the key, stability is not an issue. Stability is
also not an issue if all keys are different.

Unstable sorting algorithms can be specially implemented to be stable. One way


of doing this is to artificially extend the key comparison, so that comparisons between two
objects with otherwise equal keys are decided using the order of the entries in the original
input list as a tie-breaker. Remembering this order, however, may require additional time and
space.

One application for stable sorting algorithms is sorting a list using a primary and
secondary key. For example, suppose we wish to sort a hand of cards such that the suits are in
the order clubs (♣), diamonds (♦), hearts (♥), spades (♠), and within each suit, the cards are
sorted by rank. This can be done by first sorting the cards by rank (using any sort), and then
doing a stable sort by suit:

Within each suit, the stable sort preserves the ordering by rank that was already
done. This idea can be extended to any number of keys, and is leveraged by radix sort. The
same effect can be achieved with an unstable sort by using a lexicographic key comparison,
which e.g. compares first by suits, and then compares by rank if the suits are the same.
Popular sorting algorithms

While there are a large number of sorting algorithms, in practical


implementations a few algorithms predominate. Insertion sort is widely used for small data
sets, while for large data sets an asymptotically efficient sort is used, primarily heap sort,
merge sort, or quick sort. Efficient implementations generally use a hybrid algorithm,
combining an asymptotically efficient algorithm for the overall sort with insertion sort for
small lists at the bottom of a recursion. Let us discuss about some of these algorithms in the
following pages.

Bubble Sort:

Procedure Bubblesort; // Sorting in Ascending Order //


Var i, j : index; x: item; a[1..n] : array of elements;
Begin
For i = 2 to n do
Begin
For j = n down to i do
If a[j-1] > a[j] then
Begin
x = a[j-1]; a[j-1] = a[j]; a[j] = x;
End
End
End. { Bubble Sort }

An alternate algorithm is follows:

Procedure Bubblesort; // Sorting in Ascending Order //


Var i, j : index; x : item; a[1..n] : array of elements;
Begin
For i = 1 to n-1 do
Begin
For j = i+1 to n do
If a[i] > a[j] then
Begin
x = a[i]; a[i] = a[j] ; a[j] = x;
End
End
End

Illustration:
Input for Bubble sort : a[1..8] = {12, 18, 42, 44, 55, 67, 94, 06}

Output : [ First algorithm ] Output : [ Alternate algorithm ]

12 18 42 44 55 67 06 94 06 18 42 44 55 67 94 12

12 18 42 44 55 06 67 94 06 12 42 44 55 67 94 18

12 18 42 44 06 55 67 94 06 12 18 44 55 67 94 42

12 18 42 06 44 55 67 94 06 12 18 42 55 67 94 44

12 18 06 42 44 55 67 94 06 12 18 42 44 67 94 55

12 06 18 42 44 55 67 94 06 12 18 42 44 55 94 67

06 12 18 42 44 55 67 94 06 12 18 42 44 55 67 94

Note:- To sort in Descending Order change the If statement in the algorithms to the new one
such as If a[i] < a[j] then, alternatively in the same algorithms print the final output in the
reverse order. We can very well verify that the least element floats to the beginning as the air
bubble floats upward in the water. If any element in the array appears more than once in
different places in the array, and while sorting if their positions are retained in the output,
then the algorithm is said to be stable.

Selection Sort:

Procedure Selectionsort; // Sorting in Ascending Order //

Var i, j, k : index; x : item, a[1..n] : array of elements;


Begin
for i = 1 to n-1 do
Begin
k = i;
x = a[i];
for j = i+1 to n do
if a[j] < x then
Begin
k=j;
x = a[j] ;
End
a[k] = a[i] ;
a[i] = x ;
End
End

Illustration:

Input for Selection Sort : a[1..8] = { 44, 55, 12, 42, 94, 18, 06, 67 }
Output :

Initial array : 44 55 12 42 94 18 06 67

06 55 12 42 94 18 44 67

06 12 55 42 94 18 44 67

06 12 18 42 94 55 44 67

06 12 18 42 44 55 94 67

06 12 18 42 44 55 67 94

Note: This is method is based on (i) Select an item with the least value, (ii) exchange it with
the first element in the array, and repeat these operations with remaining n-1 elements, then
with n-2 elements, until only one item – the largest – is left.

Insertion Sort:

Procedure Insertionsort; // Sorting in Ascending Order //

Var i, j : index; x : item, a[1..n] : array of elements;


Begin
for i = 2 to n do
Begin
x = a[j];
a[0] = x;
j = i-1;
while x < a[j] do
Begin
a[j+1] = a[j];
j = j-1;
End
a[j+1] = x;
End
End

Note: This method is widely used by card players. The items (cards) are conceptually
divided into a destination sequence a1 , a2 , ...., ai-1 an a source sequence ai , ai+1 ,..... , an. In
each step, starting with i=2 and incrementing i by unity, the i th element of the source
sequence is picked and transferred into the destination sequence by inserting it at the
appropriate place.

Illustration:

Input for Insertion Sort : a[1..8] = {44, 55, 12, 42, 94, 18, 06, 67}

Output : initially 44 55 12 42 94 18 06 67
i = 2 44 55 12 42 94 18 06 67 // no change
i = 3 12 44 55 42 94 18 06 67 // 12 is inserted
i = 4 12 42 44 55 94 18 06 67 // 42 is inserted
i = 5 12 42 44 55 94 18 06 67 // 44 is inserted
i = 6 12 18 42 44 55 94 06 67 // no change
i = 7 06 12 18 42 44 55 94 67 // 06 is inserted
i = 8 06 12 18 42 44 55 67 94 // 67 is inserted

Heap Sort:

Procedure Heapsort; // Sorting in Descending Order //

Var l, r : index; x : item, a[1..n] : array of elements;


Procedure Shift;
label 13 ;
Var i, j : index;
Begin
i = l ; j = 2*i ; x = a[i] ;
while j ≤ r do
Begin
If j < r then
If a[j] < a[j+1] then j = j + 1 ;
If x < a[j] then goto 13 ;
a[i] = a[j] ; i = j ;
j = 2*i;
End
13 : a[i] = x ;
End
Begin
l = (n div 2) + 1; r = n ;
While l > 1 do
Begin
l = l – 1 ; Shift;
End
While r > 1 do
Begin
X = a[1] ; a[1] = a[r] ; a[r] = x ; r = r – 1 ; Shift;
End
End // End of Heap Sort //
Illustration:

Input for Heap Sort : a[1..8] = {06, 42, 12, 55, 94, 18, 44, 67}

Output :

12 42 18 55 94 67 44 06
18 42 44 55 94 67 12 06
42 55 44 67 94 18 12 06
44 55 94 67 42 18 12 06
55 67 94 44 42 18 12 06
67 94 55 44 42 18 12 06
94 67 55 44 42 18 12 06
An alternate algorithm is follows:

Heap Sort :
Phase – I:

(0)Construct a Binary Tree with the input given in the left list layout organizational
representation as shown in the figure (output).

(1) Process the node which is the parent of the right most node on the lowest level. If its
value is less than the value of its largest child, swap those values, otherwise do nothing.

(2) Move left on the same level. Compare the value of the parent node with the values of the
child nodes. If the parent is smaller than the largest child, swap them.

(3) When the left end of this level is reached, move up a level, and beginning with the right
most parent node, repeat step (2). Continue swapping the original parent with the largest of
its children until it is larger than its children. In effect, the original parent is being walked
down the tree in a fashion that ensures that numbers will be increasing order along the path.

(4) Repeat step (3) until all level 1 nodes have been processed (Remember that the root is at
level 0 ).

Phase – II:
(1) Compare the root node with its children, swapping it with the largest child if the largest
child is larger than the root.

(2) If a swap occurred in step (1), then continue swapping the value which was originally in
the root position until it is larger than its children. In effect, this is the original root node
value, is now being walked down a path in the tree to ensure that all paths retain values
arranged in ascending order from leaf node to root node.

(3) Swap the root node with the bottom right most child, sever the new bottom right most
child from the tree and insert it into a STACK. This is the largest value given in the input.

(4) Repeat steps (1) through (3) until only two elements are left. Then among these two
elements insert the largest element into the STACK followed by the last element which is the
least of all.

(5) Now print the elements of the STACK to get the Ascending Order of the input elements.

Illustration:
Input for Heap Sort : a[1..8] = {11, 1, 5, 7, 6, 12, 17, 8, 4, 10, 2}

Output : The given input is represented in a binary tree as shown below:

1
1
5
1

7 6 1 1
2 7

4 1 2
8
0

Swap 6 & 10 1
1
5
1

7 6 1 1
2 7

4 1 2
8
0
Swap 7 & 8
11

5
1

7 10 12 17

4 6 2
8

Swap 5 & 17 1
1
5
1

1 1
8 1
2 7
0

7 4 6 0 2

Swap 1 and 10
11

17
1

8 10 12 5

4 6 2
7

Swap 1 & 6 1
1
1 1
0 7
8 1 5
1 2

4 6 2
7
Swap 11 & 17
11

17
10

8 6 12 5

4 1 2
7

Swap 11 & 12 17

11
10

8 6 12 5

4 1 2
7

Swap 12 & 11
17

12
10

8 6 11 5

4 1 2
7

Swap 2 & 17 17

12
10

8 6 11 5
2

12
10
4 1 2
7
8 6 11 5
Sever 17
4 1 17
7

10

11

12

17
The STACK contains the output.

Quick Sort:

Procedure Quicksort; //Sorting in Ascendign Order //


Procedure Sort ( l, r : index );
Var i, j : index; x, w : item; a[1..n] : array of elements;
Begin
i – l ; j = r ; x = a[(l+r) div 2];
repeat
while a[i] < x do i = i + 1;
while x < a[j] do j = j-1;
if i ≤ j then
Begin
w = a[i] ; a[i] = a[j]; a[j] = w; i = i+1; j= j-1;
End
until i > j;
if l < j then Sort(l, j);
if i < r then Sort(i, r);
End;
Begin Sort ( 1, n); End.
Let the pivot element X = 44. Then arrange the sequence in such a way that
the elements which are less than X at its left and the other at its right. Then take the left half
of the new sequence and continue the process and then repeat for the right half of the
sequence. This process can be repeated till all the elements are ordered.

Illustration:

Input for QuickSort : a[1..8] = {12, 18, 42, 44, 55, 67, 94, 06}

Output : 12 18 42 06 44 55 67 94

12 06 18 42 44 55 67 94

06 12 18 42 44 55 67 94

Merge sort:
In computer science, merge sort (also commonly spelled merge sort) is
an O(n log n) comparison-based sorting algorithm. Most implementations produce a stable
sort, which means that the implementation preserves the input order of equal elements in the
sorted output. Merge sort is a divide and conquer algorithm that was invented by John von
Neumann in 1945.

Merge Sort:

Conceptually, a merge sort works as follows:

(1) Split the sequence (given array) a into two halves, called b and c.
(2) Merge b and c by combining single items into ordered pairs.
(3) Call the merged sequence a, and repeat steps (1) & (2), this time merging ordered pairs
into ordered quadruples.
(4) Repeat the previous steps, merging quadruples into octets, and continue doing this, each
time doubling the lengths of the merged sub-sequences, until the entire sequence is ordered.

Illustration:

Input for Merge Sort : a[1..8] = {44, 55, 12, 42, 94, 18, 06, 67}

Output :

In step 1, the split results in the sequences


44 55 12 42
94 18 06 67
The merging of single components (which are ordered sequences of length 1), into
ordered pairs yields
44 94 18 55 06 12 42 67
Splitting again in the middle and merging ordered pairs yields
44 94 18 55
06 12 42 67 ...... after splitting
06 12 44 94 18 42 55 67 ...... after merging
A third split and merge operation finally produces the desired result
06 12 44 94
18 42 55 67 ....... after splitting
06 12 18 42 44 55 67 94 ..... final result (sorted)

Variants

Variants of merge sort are primarily concerned with reducing the space
complexity and the cost of copying.
A simple alternative for reducing the space overhead to n/2 is to maintain left
and right as a combined structure, copy only the left part of m into temporary space, and to
direct the merge routine to place the merged output into m.

With this version it is better to allocate the temporary space outside the merge
routine, so that only one allocation is needed. The excessive copying mentioned previously is
also mitigated, since the last pair of lines before the return result statement become
superfluous.

In-place sorting is possible, and still stable, but is more complicated, and
slightly slower, requiring non-linearithmic quasilinear time O(n log2 n). One way to sort in-
place is to merge the blocks recursively. Like the standard merge sort, in-place merge sort is
also a stable sort. Stable sorting of linked lists is simpler. In this case the algorithm does not
use more space than that already used by the list representation, but the O(log(k)) used for the
recursion trace.

An alternative to reduce the copying into multiple lists is to associate a new


field of information with each key (the elements in m are called keys). This field will be used
to link the keys and any associated information together in a sorted list (a key and its related
information is called a record). Then the merging of the sorted lists proceeds by changing the
link values; no records need to be moved at all.

A field which contains only a link will generally be smaller than an entire
record so less space will also be used. This is a standard sorting technique, not restricted to
merge sort.

Merge Sort: // Sorting in Ascending Order //


Procedure MergeSort;
Var i, j, k, el, t : index;
h, m, p, q, r : integer;
a[1..n] of elements;
up : Boolean;
// Note that the array a has indices 1.. 2*n //
Begin
up = true; p = 1;
repeat
h=1; m=n;
if up then
Begin i = 1; j = n; k = n+1 ; el = 2*n End
else
Begin k = 1 ; el = n ; i = n+1 ; j = 2*n ; En d
repeat // merge a run from i and j to k //
// q = length of i-run ; r = length of j-run //
if m ≥ p then q = p;
else q = m ; m = m-q ;
if m ≥ p then r = p;
else r = m ; m = m-r ;
while (q ≠ 0) AND (r ≠ 0) do
Begin // merge //
if a[i] < a[j] then
Begin a[k] = a[i] ; k = k+h ; i = i+1 ; q = q-1 ; End
else Begin a[k] = a[j] ; k = k+h ; j = j-1 ; r = r-1 ; End
End
// copy tail of j-run //
while r ≠ 0 do
Begin a[k] = a[j]; k = k+h ; j = j-1 ; r = r-1 ; End
// copy tail of i-run //
while q ≠ 0 do
Begin a[k] = a[i] ; k = k+h ; i = i+1 ; q = q-1 ; End
h = -h ; t = k ; k = el ; el = t ;
until m = 0;
up = NOT (up); p = 2*p;
until p ≥ n ;
if NOT (up) then
for ( i = 1 to n ) do a[i] = a[i+n] ;
End / end of Merge Sort //

Analysis

A recursive merge sort algorithm used to sort an array of 7 integer values.


These are the steps a human would take to emulate merge sort (top-down).

In sorting n objects, merge sort has an average and worst-case performance of


O(n log n). If the running time of merge sort for a list of length n is T(n), then the recurrence
T(n) = 2T(n/2) + n follows from the definition of the algorithm (apply the algorithm to two
lists of half the size of the original list, and add the n steps taken to merge the resulting two
lists). The closed form follows from the master theorem.

In the worst case, the number of comparisons merge sort makes is equal to or
slightly smaller than (n ⌈lg n⌉ - 2⌈lg n⌉ + 1), which is between (n lg n - n + 1) and (n lg n + n +
O(lg n)). For large n and a randomly ordered input list, merge sort's expected (average)
number of comparisons approaches α·n fewer than the worst case where

In the worst case, merge sort does about 39% fewer comparisons than quick
sort does in the average case. In terms of moves, merge sort's worst case complexity is
O(n log n)—the same complexity as quick sort's best case, and merge sort's best case takes
about half as many iterations as the worst case.

Merge sort is more efficient than quick sort for some types of lists if the data to
be sorted can only be efficiently accessed sequentially, and is thus popular in languages such
as Lisp, where sequentially accessed data structures are very common. Unlike some
(efficient) implementations of quick sort, merge sort is a stable sort. Merge sort's most
common implementation does not sort in place; therefore, the memory size of the input must
be allocated for the sorted output to be stored in (see below for versions that need only n/2
extra spaces).

Merge sort also has some demerits. One is its use of 2 n locations; the
additional n locations are commonly used because merging two sorted sets in place is more
complicated and would need more comparisons and move operations. But despite the use of
this space the algorithm still does a lot of work: The contents of m are first copied into left
and right and later into the list result on each invocation of merge_sort (variable names
according to the pseudo code above).

Use with tape drives

Merge sort type algorithms allowed large data sets to be sorted on early
computers that had small random access memories by modern standards. Records were stored
on magnetic tape and processed on banks of magnetic tape drives, such as these IBM 729s.
An external merge sort is practical to run using disk or tape drives when the data to be sorted
is too large to fit into memory. External sorting explains how merge sort is implemented with
disk drives. A typical tape drive sort uses four tape drives. All I/O is sequential (except for
rewinds at the end of each pass). A minimal implementation can get by with just 2 record
buffers and a few program variables.

Naming the four tape drives as A, B, C, D, with the original data on A, and
using only 2 record buffers, the algorithm is similar to Bottom-up implementation, using
pairs of tape drives instead of arrays in memory. The basic algorithm can be described as
follows:

1. Merge pairs of records from A; writing two-record sublists alternately to C and D.


2. Merge two-record sublists from C and D into four-record sublists; writing these
alternately to A and B.
3. Merge four-record sublists from A and B into eight-record sublists; writing these
alternately to C and D
4. Repeat until you have one list containing all the data, sorted --- in log2(n) passes.

Instead of starting with very short runs, usually a hybrid algorithm is used,
where the initial pass will read many records into memory, do an internal sort to create a long
run, and then distribute those long runs onto the output set. The step avoids many early
passes. For example, an internal sort of 1024 records will save 9 passes. The internal sort is
often large because it has such a benefit. In fact, there are techniques that can make the initial
runs longer than the available internal memory. A more sophisticated merge sort that
optimizes tape (and disk) drive usage is the polyphase merge sort.

Optimizing merge sort:

On modern computers, locality of reference can be of paramount importance


in software optimization, because multilevel memory hierarchies are used. Cache-aware
versions of the merge sort algorithm, whose operations have been specifically chosen to
minimize the movement of pages in and out of a machine's memory cache, have been
proposed. For example, the tiled merge sort algorithm stops partitioning subarrays when
subarrays of size S are reached, where S is the number of data items fitting into a CPU's
cache. Each of these subarrays is sorted with an in-place sorting algorithm such as insertion
sort, to discourage memory swaps, and normal merge sort is then completed in the standard
recursive fashion.

This algorithm has demonstrated better performance on machines that benefit


from cache optimization. (LaMarca & Ladner 1997). Kronrod (1969) suggested an
alternative version of merge sort that uses constant additional space. This algorithm was later
refined. (Katajainen, Pasanen & Teuhola 1996). Also, many applications of external sorting
use a form of merge sorting where the input get split up to a higher number of sublists,
ideally to a number for which merging them still makes the currently processed set of pages
fit into main memory.

Radix Sort: // Sorting in Ascending Order //


# define RADIX 10
void RadixSort ( int data[ ] )
{ int i, pos, j ;
Queue queue[RADIX] ;
for ( i = 1 ; i < RADIX ; i++ )
CreaeQueue ( queue [i] );
for ( i = 1 ; i < numofdigits ; i++ )
pos = 0;
for ( j = 0 ; j < SIZE ; j++ )
{ InsertQ( getradix( data[i], i ), data [j] ); }
for ( j = 0 ; j < RADIX; j++ )
{ while ( Isempty ( queue [j] ) )
{ data [pos] = DeleteQ( queue[j] ); pos++ ; }
} }
int numofdigits ( data[ ] )
{ int i, numdigits, max = 0, index = 0 radix = RADIX ;
for ( i = 0 ; i < SIZE ; i++ )
{ if ( max < data [ i ] )
{ max = data [i] ;
index = i ;
} }
Numdigits = ( int ) log ( base 10, max ) ;
}

Illustration for RADIX Sort:


Input : 452 625 26 125 137 269 788 961 302
Queue
0 1 2 3 4 5 6 7 8 9
961 452 615 26 137 788 269
302 125
Input: 961 452 302 615 125 26 137 788 269
Queue
0 1 2 3 4 5 6 7 8 9
302 615 125 137 452 961 788
26 125 269

Input : 302 615 125 26 137 452 961 269 788


0 1 2 3 4 5 6 7 8 9
Queue
26 125 269 302 452 651 788 961
137
Result / Output : 26 125 137 269 302 452 615 788 961 (Sorted)

Multiway Merging:
The effort involved in a sequential sort is proportional to the number of
required passes since, by definition, every pass involves the copying of the entire set of data.
One way to reduce this number is to distribute the runs onto more than two files. Merging r
runs which are equally distributed on N tapes results in a sequence of r/N runs. A second
pass reduces their number to r/N2, a third pass to r/N 3 , and after k passes there are r/N k runs
left. The total number of passes required to sort ‘n’ items by N-way merging is therefore

k= ┌ logN n ┐. Since each pass requires n copy operations, the total number of copy
operation is in the worst case M = n . ┌ log n ┐. ┌ X ┐ indicates ceiling of
N

X – that is, the smallest integer ≥ X.

procedure tapemergesort;
var i, j: tapeno;
el : integer; // no. Of runs distributed //
t : array [tapeno] of tapeno;
Begin // distribute initial runs to t[1], t[2], ..... t[nh] //
j = nh ; el = 0 ;
repeat if j < nh then j = j+1; else j = 1;
“ copy one run from f0 to tape j” ;
until eof(f0) ;
for i = 1 to n do t[i] = i;
repeat // merge from t[1],.....t[nh] to t[nh+1], ... t[n] //
“ rest input tapes” ;
el = 0;
j = nh + 1; // j is index of output tape //
repeat
el = el + 1;
“ merge a run from inputs to t[j] “ ;
if j < n then j = j+1 else j= nh+1 ;
until “ all inputs exhausted” ;
“ switch tapes “
until el = 1 ;
// sorted tape is t[1] //
end
Shake Sort: A careful programmer will, however, notice a peculiar asymmetry in bubble
sort: A single misplaced bubble (smallest element) in the “heavy” end of an otherwise sorted
array will shift into order in a single pass, but a misplaced item in the “light” end will sink
toward its correct position only one step in each pass. For example, the array:
12 18 42 44 55 67 94 06
will be sorted by the improved bubble sort in a single pass, but the array
94 06 12 18 42 44 55 67
will require 7 passes for sorting. This unnatural asymmetry suggests a third improvement:
alternating the direction of consecutive passes. We appropriately call the resulting algorithm
Shakesort. This algorithm is illustrated below:
Procedure shakesort;
Var j, k, el, r : index; x : item; a[1..n] : the given array of elements
begin
el = 2 ; r = n ; k = n ;
repeat
for j = r down to el do
if a[j-1] > a[j] then
begin // exchange a[j-1] and a[j] //
x = a[j-1] ; a[j-1] = a[j] ; a[j] = x ; k = j ;
end
el = k+1 ;
for j = el to r do
if a[j-1] > a[j] then
begin // exchange a[j-1] and a[j] //
x = a[j-1] ; a[j-1] = a[j] ; a[j] = x ; k = j;
end
r = k-1;
until el > r
end // shake sort //

Illustration:
Input for Shake Sort : = { 44, 55, 12, 42, 94, 18, 06, 67 }
Output :
el = 2 3 3 4 4
r= 8 8 7 7 4

44 06 06 06 06
55 44 44 12 12
12 55 12 44 18
42 12 42 18 42
94 42 55 42 44
18 94 18 55 55
06 18 67 67 67
67 67 94 94 94

The last column gives the sorted order of the input. That is, 06, 12, 18, 42, 44, 55, 67, 94

Shell Sort: A refinement of the straight insertion sort was proposed byh D. L. Shell in 1959.
The method is explained and demonstrated with an example of eight items. First, all items
which are four which are four positions apart are grouped and sorted separately. This process
is called a 4-sort,. IN this example of eight items, each group contains exactly two items.
After this first pass, the items are regrouped into groups with items two positions apart and
then sorted anew. This process is called a 2-sort. Finally, in a third pass all items are sorted
in an ordinary sort or 1-sort. One may at first wonder if the necessity of several sorting
passes, each of which involves all items, will not introduce more work than it saves.
However, each sorting step over a chain involves either relatively few items or the items are
already quite well ordered and comparatively few re-arrangements are required.

Procedure Shell sort;


cosnt t = 4;
var i, j, k, s : index; x : item; m : 1.. t;
h : array [1..t] of integer; a: array of given elements;
begin
h[1] = 9 ; h[2] = 5 ; h[3] = 3 ; h[4] = 1 ;
for m = 1 to t do
begin
k = h[m] ; s = - k ; // sentinel position //
for i = k+1 to n do
begin
if s=0 then s = - k ; s = s+1 ; a[s] = x ;
while x < a[j] do
begin
a[j+k] = a[j] ; j = j-k ;
end
a[j+k] = x;
end
end
end

Illustration:
Input for Shell Sort: 44 55 12 42 94 18 06 67
ᵝ ᵞ ᵟ ᵠ ᵝ ᵞ ᵟ ᵠ
Output: 4-sort yields
44 18 06 42 94 55 12 67
ᵝ ᵞ ᵝ ᵞ ᵝ ᵞ ᵝ ᵞ
2-sort yields
06 18 12 42 44 55 94 67
ᵝ ᵝ ᵝ ᵝ ᵝ ᵝ ᵝ ᵝ
1-sort yields
06 12 18 42 44 55 67 94 - (Sorted Order)

The table below shows the times ( in milliseconds ) consumed by the sorting methods so far
discussed, as executed by the PASCAL system on a CDC 6400 computer. The three
columns contain the times used to sort the already ordered array, a random permutation, and
the inversely ordered array. The left figure in each column is for 256 items, the right one for
512 items.

Ordered Random Inversely ordered


Insertion Sort 12 23 366 1444 704 2836
Selection Sort 489 1907 509 1956 695 2675
Bubble Sort 540 2165 1026 4054 1492 5931
Shake Sort 5 9 961 3642 1619 6520
Shell Sort 58 116 127 349 157 492
Heap Sort 116 253 110 241 104 226
Quick Sort 31 69 60 146 37 79
Merge Sort 99 234 102 242 99 232

Noteworthy are the following points:


1) Bubble Sort is definitely the worst sorting method among all compared. Is improved
version “Shake sort” is still worse than Insertion Sort and Selection Sort.
2) Quick Sort beats Heap Sort by a factor of 2 to 3. It sorts the inversely ordered array with
speed practically identical to the one which is already sorted.
The table below shows the influence of enlarging the size of the data items.
In the example chosen the associated data occupy seven times the storage space of the key.
The left figure in each column displays the time needed for sorting records without associated
data, the right figure relates to sorting with associated data; n=256.

Ordered Random Inversely ordered


Insertion Sort 12 46 366 1129 704 2150
Selection Sort 489 547 509 607 695 1430
Bubble Sort 540 610 1026 3212 1492 5599
Shake Sort 5 5 961 3071 1619 5757
Shell Sort 58 186 127 373 157 435
Heap Sort 116 264 110 246 104 227
Quick Sort 31 55 60 137 37 75
Merge Sort 99 196 102 195 99 187

The following points should be noted:


1) Selection Sort is gaining momentum significantly and now emerges as the best of the
sorting methods.
2) Bubble Sort is still worst method by a large margin and only its “improvement” called
Shake Sort is slightly worse in the case of the inversely ordered array.
3) Quick Sort has even strengthened its position as the quickest method and appears as the
best array sorter by far.

To conclude this discussion of sorting methods, we shall try to compare their


effectiveness.. If n denotes the number of items to be sorted, C and M shall again stand for
the number of comparisons required and the number of moves of the items respectively.
Closed analytical formulae can be given for a few of the sorting methods. The table below
shows the same . The column indicators min, ave, max, specify the respective maximum,
minimum and average values over all n! Permutations of n items.
Min Ave Max
Insertion Sort C n-1 (n + n – 2) / 4
2
(n - n) / 2 – 1
2

M 2(n - 1) (n2 - 9n - 10)/ 4 (n2 + 3n - 4) / 2


Selection Sort C (n2 – n) / 2 (n2 – n) / 2 (n2 – n) / 2
M 3(n – 1) n(log n + 0.57) N2 / 4 + 3(n – 1)
Bubble Sort C (n2 – n) / 2 (n2 – n) / 2 (n2 – n) / 2
M 0 (n2 – n) * 0.75 (n2 – n) * 1.5

No reasonably simple accurate formulas are available on the advanced methods.


These formulas merely provide a rough measure of performance as functions of n, and they
allow the classification of sorting algorithms into primitive, straight methods (n 2) and
advanced or “logarithmic” methods (n log n).

PARALLELISM:

Traditionally, computer software has been written for serial computation. To


solve a problem, an algorithm is constructed and implemented as a serial stream of
instructions. These instructions are executed on a central processing unit on one computer.
Only one instruction may execute at a time—after that instruction is finished, the next is
executed. Parallel computing, on the other hand, uses multiple processing elements
simultaneously to solve a problem. This is accomplished by breaking the problem into
independent parts so that each processing element can execute its part of the algorithm
simultaneously with the others. The processing elements can be diverse and include resources
such as a single computer with multiple processors, several networked computers, specialized
hardware, or any combination of the above.

Parallel computing is a form of computation in which many calculations are carried


out simultaneously, operating on the principle that large problems can often be divided into smaller
ones, which are then solved concurrently ("in parallel"). There are several different forms of parallel
computing: bit-level, instruction level, data, and task parallelism. Parallelism has been employed for
many years, mainly in high-performance computing, but interest in it has grown lately due to the
physical constraints preventing frequency scaling. As power consumption (and consequently heat
generation) by computers has become a concern in recent years, parallel computing has become the
dominant paradigm in computer architecture, mainly in the form of multi-core processors.

Parallel computers can be roughly classified according to the level at which the
hardware supports parallelism, with multi-core and multi-processor computers having
multiple processing elements within a single machine, while clusters, MPPs, and grids use
multiple computers to work on the same task. Specialized parallel computer architectures are
sometimes used alongside traditional processors, for accelerating specific tasks.

Parallel computer programs are more difficult to write than sequential ones,
because concurrency introduces several new classes of potential software bugs, of which race
conditions are the most common. Communication and synchronization between the different
subtasks are typically some of the greatest obstacles to getting good parallel program
performance. The maximum possible speed-up of a single program as a result of
parallelization is known as Amdahl's law
Types of parallelism

There are different types of Parallelisms namely: I/O Parallelism, Inter Query
Parallelism, Intra Query Parallelism, Inter Operator Parallelism, Independent Parallelism, Bit
Level Parallelism, Instruction Level Parallelism, Task Level Parallelism and Memory Level
Parallelism.

I/O Parallelism- Helps in reducing the time required to retrieve relations(data tables) from
disk by partitioning the relations on multiple disks in parallel.

Inter Query Parallelism- Queries / transactions (with reference to data base management
system) can be executed in parallel with one another. It increases transaction throughput and
it is the earliest from of parallelism to support, particularly in shared memory parallel
databases.

Intra Query Parallelism- execution of a single query in parallel on multiple processors /


disks. It is important to speed up the long-running queries.

Inter Operator Parallelism- Consider a join of four relations:


R1 |X| R2 |X| R3 |X| R4
Set up a pipeline that computes three joins in parallel.
ie. P1 be assigned of Temp1 = R1 |X| R2
P2 be assigned of Temp2 = Temp1 |X| R3
P3 be assigned of Temp3 = Temp2 |X| R4
(or)
P1 be assigned of R1 |X| R2
P2 be assigned of R3 |X| R4
P3 be assigned of P1 |X| P2

Independent Parallelism- Consider a join of four relations:


R1 |X| R2 |X| R3 |X| R4

Let P1 be assigned of R1 |X| R2


P2 be assigned of R3 |X| R4
P3 be assigned of P1 |X| P2
Here P1 and P2 can work independently and P3 has to wait for input from P1 and P2.

Bit-level parallelism

From the advent of very-large-scale integration (VLSI) computer-chip


fabrication technology in the 1970s until about 1986, speed-up in computer architecture was
driven by doubling computer word size—the amount of information the processor can
manipulate per cycle. Increasing the word size reduces the number of instructions the
processor must execute to perform an operation on variables whose sizes are greater than the
length of the word. For example, where an 8-bit processor must add two 16-bit integers, the
processor must first add the 8 lower-order bits from each integer using the standard addition
instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry
bit from the lower order addition; thus, an 8-bit processor requires two instructions to
complete a single operation, where a 16-bit processor would be able to complete the
operation with a single instruction.

Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then
32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit
processors, which has been a standard in general-purpose computing for two decades. Not
until recently (c. 2003–2004), with the advent of x86-64 architectures, have 64-bit processors
become commonplace.

Instruction-level parallelism

A canonical five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction


Decode, EX = Execute, MEM = Memory access, WB = Register write back)

A computer program, is in essence, a stream of instructions executed by a


processor. These instructions can be re-ordered and combined into groups which are then
executed in parallel without changing the result of the program. This is known as instruction-
level parallelism. Advances in instruction-level parallelism dominated computer architecture
from the mid-1980s until the mid-1990s.

Modern processors have multi-stage instruction pipelines. Each stage in the


pipeline corresponds to a different action the processor performs on that instruction in that
stage; a processor with an N-stage pipeline can have up to N different instructions at different
stages of completion. The canonical example of a pipelined processor is a RISC processor,
with five stages: instruction fetch, decode, execute, memory access, and write back. The
Pentium 4 processor had a 35-stage pipeline.

A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It
can have two instructions in each stage of the pipeline, for a total of up to 10 instructions
(shown in green) being simultaneously executed.
In addition to instruction-level parallelism from pipelining, some processors can
issue more than one instruction at a time. These are known as superscalar processors.
Instructions can be grouped together only if there is no data dependency between them.
Score-boarding and the Tomasulo algorithm (which is similar to scoreboarding but makes use
of register renaming) are two of the most common techniques for implementing out-of-order
execution and instruction-level parallelism.

Task parallelism

Task parallelism is the characteristic of a parallel program that "entirely


different calculations can be performed on either the same or different sets of data". This
contrasts with data parallelism, where the same calculation is performed on the same or
different sets of data. Task parallelism involves the decomposition of a task into sub-tasks
and then allocating each sub-task to a processor for execution. The processors would then
execute these sub-tasks simultaneously and often cooperatively. Task parallelism does not
usually scale with the size of a problem.

Memory-level parallelism

Memory Level Parallelism is a term in computer architecture referring to the


ability to have pending multiple memory operations, in particular cache misses or translation
look aside buffer (TLB) misses, at the same time. In a single processor, MLP may be
considered a form of instruction-level parallelism (ILP). However, ILP is often conflated with
superscalar, the ability to execute more than one instruction at the same time. E.g., a
processor such as the Intel Pentium Pro is five-way superscalar, with the ability to start
executing five different microinstructions in a given cycle, but it can handle four different
cache misses for up to 20 different load microinstructions at any time.

It is possible to have a machine that is not superscalar but which nevertheless


has high MLP. Arguably a machine that has no ILP, which is not superscalar, which executes
one instruction at a time in a non-pipelined manner, but which performs hardware prefetching
(not software instruction level prefetching) exhibits MLP (due to multiple prefetches
outstanding) but not ILP. This is because there are multiple memory operations outstanding,
but not instructions. Instructions are often conflated with operations.

Furthermore, multiprocessor and multithreaded computer systems may be said


to exhibit MLP and ILP due to parallelism—but not intra-thread, single process, ILP and
MLP. Often, however, we restrict the terms MLP and ILP to refer to extracting such
parallelism from what appears to be non-parallel single threaded code.

Some applications that can be solved using parallelism:


(1) Adding n numbers ( possibly n might be even ) where pair of numbers are added
simultaneously (parallel) and the results are paired for next level summation and like as
illustrated below:

5 2 3 8 1 6 4 5
7 11 7 9

18 16

34

The sum of all the eight given numbers is found 34 by implementing parallelism and the
time complexity of this method is found to be log2 (n).

(2) ʃ ( x2 – 3x3 + sin 12x ) dx - can be solved parallel by splitting the given problem into
three sub-problems as shown below.
ʃ x2 dx , -3ʃx3 dx and + ʃ sin 12x dx
These sub-problems can be solved parallel (simultaneously) and then the results can be added
to get the final result (answer).

(3) Web-page displaying with Multimedia data dynamically ( that is, each component of the
web-page can be designed parallel and can be displayed simultaneously.

Important considerations in Parallel programming(computing)


Three important points are to be considered very important with respect to
parallel programming and they are : Data Independence- Tasks can not be parallel if one task
uses data that is dependent on another, say, output of one task is the input for the other, Load
Balancing- Distributing tasks among the processors will minimize the idle time, and
Granularity- Measure of ratio of computation to communication.

Some design issues and scalability in parallelism:


It is better to discuss some of issues in designing parallel systems and also the
scalability of parallel systems.
Issues: (1) Parallel loading of data from external sources is needed in order to handle large
volumes of incoming data. (2) Resistance to failure of some processors or disks. (3) On-line
reorganization of data and schema changes must be supported. (4) Also need support for on-
line repartitioning and schema changes (executed concurrently with other processing).
Scalability: A scalable pair (parallel system , parallel algorithm) is one in which the speed up
is roughly linear in number of processors, (Gustafson’s Law). The efficiency can be
measured as a ratio of speedup with the number of processors.

Hardware

Memory and communication

Main memory in a parallel computer is either shared memory (shared between


all processing elements in a single address space), or distributed memory (in which each
processing element has its own local address space). Distributed memory refers to the fact
that the memory is logically distributed, but often implies that it is physically distributed as
well. Distributed shared memory and memory virtualization combine the two approaches,
where the processing element has its own local memory and access to the memory on non-
local processors. Accesses to local memory are typically faster than accesses to non-local
memory.

A logical view of a Non-Uniform Memory Access (NUMA) architecture. Processors in one directory
can access that directory's memory with less latency than they can access memory in the other
directory's memory.

Computer architectures in which each element of main memory can be


accessed with equal latency and bandwidth are known as Uniform Memory Access (UMA)
systems. Typically, that can be achieved only by a shared memory system, in which the
memory is not physically distributed. A system that does not have this property is known as a
Non-Uniform Memory Access (NUMA) architecture. Distributed memory systems have non-
uniform memory access.

Computer systems make use of caches—small, fast memories located close to


the processor which store temporary copies of memory values (nearby in both the physical
and logical sense). Parallel computer systems have difficulties with caches that may store the
same value in more than one location, with the possibility of incorrect program execution.
These computers require a cache coherency system, which keeps track of cached values and
strategically purges them, thus ensuring correct program execution. Bus snooping is one of
the most common methods for keeping track of which values are being accessed (and thus
should be purged).

Designing large, high-performance cache coherence systems is a very difficult


problem in computer architecture. As a result, shared-memory computer architectures do not
scale as well as distributed memory systems do. Processor–processor and processor–memory
communication can be implemented in hardware in several ways, including via shared (either
multiported or multiplexed) memory, a crossbar switch, a shared bus or an interconnect
network of a myriad of topologies including star, ring, tree, hypercube, fat hypercube (a
hypercube with more than one processor at a node), or n-dimensional mesh.

Parallel computers based on interconnect networks need to have some kind of


routing to enable the passing of messages between nodes that are not directly connected. The
medium used for communication between the processors is likely to be hierarchical in large
multiprocessor machines.

Classes of parallel computers


Parallel computers can be roughly classified according to the level at which the
hardware supports parallelism. This classification is broadly analogous to the distance
between basic computing nodes. These are not mutually exclusive; for example, clusters of
symmetric multiprocessors are relatively common.

Multicore computing

A multicore processor is a processor that includes multiple execution units


("cores") on the same chip. These processors differ from superscalar processors, which can
issue multiple instructions per cycle from one instruction stream (thread); in contrast, a
multicore processor can issue multiple instructions per cycle from multiple instruction
streams. IBM's Cell microprocessor, designed for use in the Sony PlayStation 3, is another
prominent multicore processor.

Each core in a multicore processor can potentially be superscalar as well—that


is, on every cycle, each core can issue multiple instructions from one instruction stream.
Simultaneous multithreading (of which Intel's HyperThreading is the best known) was an
early form of pseudo-multicoreism. A processor capable of simultaneous multithreading has
only one execution unit ("core"), but when that execution unit is idling (such as during a
cache miss), it uses that execution unit to process a second thread.

Symmetric multiprocessing

A symmetric multiprocessor (SMP) is a computer system with multiple identical


processors that share memory and connect via a bus. Bus contention prevents bus
architectures from scaling. As a result, SMPs generally do not comprise more than
32 processors. "Because of the small size of the processors and the significant reduction in
the requirements for bus bandwidth achieved by large caches, such symmetric
multiprocessors are extremely cost-effective, provided that a sufficient amount of memory
bandwidth exists."

Distributed computing

A distributed computer (also known as a distributed memory multiprocessor) is


a distributed memory computer system in which the processing elements are connected by a
network. Distributed computers are highly scalable.

Cluster computing
A Beowulf cluster

A cluster is a group of loosely coupled computers that work together closely,


so that in some respects they can be regarded as a single computer. Clusters are composed of
multiple standalone machines connected by a network. While machines in a cluster do not
have to be symmetric, load balancing is more difficult if they are not. The most common type
of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical
commercial off-the-shelf computers connected with a TCP/IP Ethernet local area network.
Beowulf technology was originally developed by Thomas Sterling and Donald Becker. The
vast majority of the TOP500 supercomputers are clusters.

Massive parallel processing

A massively parallel processor (MPP) is a single computer with many


networked processors. MPPs have many of the same characteristics as clusters, but MPPs
have specialized interconnect networks (whereas clusters use commodity hardware for
networking). MPPs also tend to be larger than clusters, typically having "far more" than
100 processors. In a MPP, "each CPU contains its own memory and copy of the operating
system and application. Each subsystem communicates with the others via a high-speed
interconnect."

A cabinet from Blue Gene/L, ranked as the fourth fastest supercomputer in the world according to the
11/2008 TOP500 rankings. Blue Gene/L is a massively parallel processor. Blue Gene/L, the fifth
fastest supercomputer in the world according to the June 2009 TOP500 ranking, is a MPP.
Grid computing

Grid computing is the most distributed form of parallel computing. It makes use
of computers communicating over the Internet to work on a given problem. Because of the
low bandwidth and extremely high latency available on the Internet, distributed computing
typically deals only with embarrassingly parallel problems. Many distributed computing
applications have been created, of which SETI@home and Folding@home are the best-
known examples.

Most grid computing applications use middleware, software that sits between
the operating system and the application to manage network resources and standardize the
software interface. The most common distributed computing middleware is the Berkeley
Open Infrastructure for Network Computing (BOINC). Often, distributed computing software
makes use of "spare cycles", performing computations at times when a computer is idling.

Specialized parallel computers

Within parallel computing, there are specialized parallel devices that remain
niche areas of interest. While not domain-specific, they tend to be applicable to only a few
classes of parallel problems.

Reconfigurable computing with field-programmable gate arrays

Reconfigurable computing is the use of a field-programmable gate array


(FPGA) as a co-processor to a general-purpose computer. An FPGA is, in essence, a
computer chip that can rewire itself for a given task. FPGAs can be programmed with
hardware description languages such as VHDL or Verilog. However, programming in these
languages can be tedious. Several vendors have created C to HDL languages that attempt to
emulate the syntax and semantics of the C programming language, with which most
programmers are familiar. The best known C to HDL languages are Mitrion-C, Impulse C,
DIME-C, and Handel-C. Specific subsets of SystemC based on C++ can also be used for this
purpose.

AMD's decision to open its Hyper Transport technology to third-party vendors


has become the enabling technology for high-performance reconfigurable computing.
According to Michael R. D'Amour, Chief Operating Officer of DRC Computer Corporation,
"when we first walked into AMD, they called us 'the socket stealers.' Now they call us their
partners."

General-purpose computing on graphics processing units (GPGPU)

Nvidia's Tesla GPGPU card

General-purpose computing on graphics processing units (GPGPU) is a fairly


recent trend in computer engineering research. GPUs are co-processors that have been
heavily optimized for computer graphics processing. Computer graphics processing is a field
dominated by data parallel operations—particularly linear algebra matrix operations.

In the early days, GPGPU programs used the normal graphics APIs for executing
programs. However, several new programming languages and platforms have been built to do
general purpose computation on GPUs with both Nvidia and AMD releasing programming
environments with CUDA and Stream SDK respectively. Other GPU programming
languages include BrookGPU, PeakStream, and RapidMind. Nvidia has also released specific
products for computation in their Tesla series. The technology consortium Khronos Group
has released the OpenCL specification, which is a framework for writing programs that
execute across platforms consisting of CPUs and GPUs. AMD, Apple, Intel, Nvidia and
others are supporting OpenCL.

Application-specific integrated circuits

Several application-specific integrated circuit (ASIC) approaches have been devised for
dealing with parallel applications. Because an ASIC is (by definition) specific to a given
application, it can be fully optimized for that application. As a result, for a given application,
an ASIC tends to outperform a general-purpose computer. However, ASICs are created by X-
ray lithography. This process requires a mask, which can be extremely expensive. A single
mask can cost over a million US dollars. (The smaller the transistors required for the chip,
the more expensive the mask will be.) Meanwhile, performance increases in general-purpose
computing over time (as described by Moore's Law) tend to wipe out these gains in only one
or two chip generations. High initial cost, and the tendency to be overtaken by Moore's-law-
driven general-purpose computing, has rendered ASICs unfeasible for most parallel
computing applications. However, some have been built. One example is the peta-flop
RIKEN MDGRAPE-3 machine which uses custom ASICs for molecular dynamics
simulation.

Vector processors

The Cray-1 is the most famous vector processor.

A vector processor is a CPU or computer system that can execute the same instruction on
large sets of data. "Vector processors have high-level operations that work on linear arrays of
numbers or vectors. An example vector operation is A = B × C, where A, B, and C are each
64-element vectors of 64-bit floating-point numbers." They are closely related to Flynn's
SIMD classification.
Cray computers became famous for their vector-processing computers in the 1970s and
1980s. However, vector processors—both as CPUs and as full computer systems—have
generally disappeared. Modern processor instruction sets do include some vector processing
instructions, such as with AltiVec and Streaming SIMD Extensions (SSE).

Software

Parallel programming languages

Concurrent programming languages, libraries, APIs, and parallel programming


models (such as Algorithmic Skeletons) have been created for programming parallel
computers. These can generally be divided into classes based on the assumptions they make
about the underlying memory architecture—shared memory, distributed memory, or shared
distributed memory. Shared memory programming languages communicate by manipulating
shared memory variables. Distributed memory uses message passing. POSIX Threads and
OpenMP are two of most widely used shared memory APIs, whereas Message Passing
Interface (MPI) is the most widely used message-passing system API. One concept used in
programming parallel programs is the future concept, where one part of a program promises
to deliver a required datum to another part of a program at some future time.

CAPS enterprise and Path scale are also coordinating their effort to make
HMPP (Hybrid Multicore Parallel Programming) directives an Open Standard called
OpenHMPP. The OpenHMPP directive-based programming model offers a syntax to
efficiently offload computations on hardware accelerators and to optimize data movement
to/from the hardware memory. OpenHMPP directives describe remote procedure call (RPC)
on an accelerator device (e.g. GPU) or more generally a set of cores. The directives annotate
C or Fortran codes to describe two sets of functionalities: the offloading of procedures
(denoted code lets) onto a remote device and the optimization of data transfers between the
CPU main memory and the accelerator memory.

Automatic parallelization

Automatic parallelization of a sequential program by a compiler is the holy


grail of parallel computing. Despite decades of work by compiler researchers, automatic
parallelization has had only limited success.

Mainstream parallel programming languages remain either explicitly parallel


or (at best) partially implicit, in which a programmer gives the compiler directives for
parallelization. A few fully implicit parallel programming languages exist—SISAL, Parallel
Haskell, System C (for FPGAs), Mitrion-C, VHDL, and Verilog.

Application check-pointing
As a computer system grows in complexity, the mean time between failures
usually decreases. Application check-pointing is a technique whereby the computer system
takes a "snapshot" of the application — a record of all current resource allocations and
variable states, akin to a core dump; this information can be used to restore the program if the
computer should fail. Application check-pointing means that the program has to restart from
only its last checkpoint rather than the beginning. While check-pointing provides benefits in a
variety of situations, it is especially useful in highly parallel systems with a large number of
processors used in high performance computing.

Algorithmic methods

As parallel computers become larger and faster, it becomes feasible to solve


problems that previously took too long to run. Parallel computing is used in a wide range of
fields, from bioinformatics (protein folding and sequence analysis) to economics
(mathematical finance). Common types of problems found in parallel computing applications
are:

 Dense linear algebra


 Sparse linear algebra
 Spectral methods (such as Cooley–Tukey fast Fourier transform)
 n-body problems (such as Barnes–Hut simulation)
 Structured grid problems (such as Lattice Boltzmann methods)
 Unstructured grid problems (such as found in finite element analysis)
 Monte Carlo simulation
 Combinational logic (such as brute-force cryptographic techniques)
 Graph traversal (such as sorting algorithms)
 Dynamic programming
 Branch and bound methods
 Graphical models (such as detecting hidden Markov models and constructing
Bayesian networks)
 Finite-state machine simulation

Fault-tolerance

Parallel computing can also be applied to the design of fault-tolerant computer


systems, particularly via lockstep systems performing the same operation in parallel. This
provides redundancy in case one component should fail, and also allows automatic error
detection and error correction if the results differ.
Buffer Handling:

In computing, buffer underrun or buffer underflow is a state occurring when


a buffer used to communicate between two devices or processes is fed with data at a lower
speed than the data is being read from it. This requires the program or device reading from
the buffer to pause its processing while the buffer refills. This can cause undesired and
sometimes serious side effects because the data being buffered is generally not suited to stop-
start access of this kind.

General causes and solutions

The term should not be confused with buffer overflow, a condition where a
portion of memory being used as a buffer has a fixed size but is filled with more than that
amount of data. Whereas buffer overflows are usually the result of programming errors, and
thus preventable, buffer underruns are often the result of transitory issues involving the
connection which is being buffered: either a connection between two processes, with others
competing for CPU time, or a physical link, with devices competing for bandwidth.

The simplest guard against such problems is to increase the size of the buffer—if
an incoming data stream needs to be read at 1 bit per second, a buffer of 10 bits would allow
the connection to be blocked for up to 10 seconds before failing, whereas one of 60 bits
would allow a blockage of up to a minute. However, this requires more memory to be
available to the process or device, which can be expensive. It assumes that the buffer starts
full—requiring a potentially significant pause before the reading process begins—and that it
will always remain full unless the connection is currently blocked. If the data does not, on
average, arrive faster than it is needed, any blockages on the connection will be cumulative;
"dropping" one bit every minute on a hypothetical connection with a 60-bit buffer would lead
to a buffer underrun if the connection remained active for an hour. In real-time applications, a
large buffer size also increases the latency between input and output, which is undesirable in
low-latency applications such as video conferencing.

CD and DVD recording issues

Buffer underruns can cause serious problems during CD/DVD burning, because
once the writing is started, it cannot stop and resume flawlessly; thus the pause needed by the
underrun can cause the data on the disc to become invalid. Since the buffer is generally being
filled from a relatively slow source, such as a hard disk or another CD/DVD, a heavy CPU or
memory load from other concurrent tasks can easily exhaust the capacity of a small buffer.
Therefore, a technique called buffer underrun protection was implemented by various
individual CD/DVD writer vendors, under various trademarks, such as Plextor BurnProof,
Nero UltraBuffer, Yamaha SafeBurn, JustLink, and Seamless Link. With this technique, the
laser is indeed able to stop writing for any amount of time and resume when the buffer is full
again. The gap between successive writes is extremely small. Another way to protect against
the problem, when using rewritable media (CD-RW, DVD-RW, DVD-RAM), is to use the
UDF file system, which organizes data in smaller "packets", referenced by a single, updated
address table, which can therefore be written in shorter bursts.

Multimedia playback
If the frame buffer of the graphics controller is not updated, the picture of the
computer screen will appear to hang until the buffer receives new data. Many video player
programs (e.g. MPlayer) feature the ability to drop frames if the system is overloaded,
intentionally allowing a buffer underrun to keep up the tempo.

The buffer in an audio controller is a ring buffer. If an underrun occurs and the
audio controller is not stopped, it will either keep repeating the sound contained in the buffer,
which may hold a quarter of a second, or replace by silence depending on the
implementation. Such effect is commonly referred to as "machinegun" or Max Headroom
(character) stuttering effect. This happens if the operating system hangs during audio
playback. An error handling routine (e.g. blue screen of death) may eventually stop the audio
controller.

Questions:

2 Marks:

1) State the two conditions that must be satisfied by any sorting algorithm.

2) What is parallel computing?

3) State the advantage of parallel computing.

4) What are the characteristics of a good algorithm?

5) What do you mean by the complexities of an algorithm?

5 Marks:

1) Explain the various parameters based on which the sorting algorithms are classfied.

2) Explain the process of Insertion Sort.

3) Explain the process of Quick Sort.

4) Explain the process of Merge Sort.

5) Explain the process of Heap Sort.

8 Marks:

1) Tabulate the Minimum, Average, and Maximum number of Moves and Exchanges of
elements in the following sorting algorithms:
(a) Insertion sort, (b) Quick Sort, (c) Merge Sort, and (d) Heap Sort

2) Illustrate the Insertion Sort.

3) Illustrate the Quick Sort.

4) Illustrate the Merge Sort

5) Explain the Buffer handling for Parallel operation.

6) Explain the various types of parallelism.

16 Marks:

1) Illustrate the Non-recursive version of the Quick Sort.

2) Illustrate the Non-recursive version of the Merge Sort

3) Illustrate the Heap Sort.

4) Illustrate the K-way merge sort.

5) Discuss the time complexities of the various sorting algorithms.

You might also like