Professional Documents
Culture Documents
Board of Studies
___________________________________________________________________________________________
Subject Expert Panel
___________________________________________________________________________________________
Content Review Panel
___________________________________________________________________________________________
Copyright ©
Printed by
Universal Training Solutions Private Limited
Address
05th Floor, I-Space,
Bavdhan, Pune 411021.
All rights reserved. This book or any portion thereof may not, in any form or by any means including electronic
or mechanical or photocopying or recording, be reproduced or distributed or transmitted or stored in a retrieval
system or be broadcasted or transmitted.
___________________________________________________________________________________________
Index
I. Content....................................................................... II
IV. Abbreviations..........................................................IX
Book at a Glance
I
Contents
Chapter I........................................................................................................................................................ 1
Introduction to Parallel Computing............................................................................................................ 1
Aim................................................................................................................................................................. 1
Objectives....................................................................................................................................................... 1
Learning outcome........................................................................................................................................... 1
1.1 Introduction............................................................................................................................................... 2
1.2 Types of Parallel Computing..................................................................................................................... 4
1.3 Need for Parallel Computing.................................................................................................................... 6
1.4 Applications of Parallel Computing.......................................................................................................... 6
Summary . .............................................................................................................................................. 8
References . .............................................................................................................................................. 8
Recommended Reading................................................................................................................................ 8
Self Assessment.............................................................................................................................................. 9
Chapter II.....................................................................................................................................................11
Laws of Parallel Computing.......................................................................................................................11
Aim................................................................................................................................................................11
Objectives......................................................................................................................................................11
Learning outcome..........................................................................................................................................11
2.1 Amdahl’s Law......................................................................................................................................... 12
2.2 Minsky’s Conjecture [Minsky 1970]...................................................................................................... 13
2.3 Moore’s Law........................................................................................................................................... 15
Summary...................................................................................................................................................... 17
References.................................................................................................................................................... 17
Recommended Reading.............................................................................................................................. 17
Self Assessment............................................................................................................................................ 18
Chapter III................................................................................................................................................... 20
Evolution of Computer Architecture......................................................................................................... 20
Aim............................................................................................................................................................... 20
Objectives..................................................................................................................................................... 20
Learning outcome......................................................................................................................................... 20
3.1 Introduction............................................................................................................................................. 21
3.2 Brief History of Computer Architecture................................................................................................. 21
3.2.1 First Generation (1945-1958)................................................................................................. 22
3.2.2 Second Generation (1958-1964)............................................................................................. 23
3.2.3 Third Generation (1964-1974)................................................................................................ 24
3.2.4 Fourth Generation (1974-present).......................................................................................... 26
Summary...................................................................................................................................................... 29
References.................................................................................................................................................... 29
Recommended Reading.............................................................................................................................. 29
Self Assessment............................................................................................................................................ 30
Chapter IV................................................................................................................................................... 32
System Architectures.................................................................................................................................. 32
Aim............................................................................................................................................................... 32
Objectives..................................................................................................................................................... 32
Learning outcome......................................................................................................................................... 32
4.1 Parallel Architectures.............................................................................................................................. 33
4.2 Single Instruction - Single Data (SISD).................................................................................................. 33
4.3 Single Instruction - Multiple Data (SIMD)............................................................................................. 34
4.4 Multiple Instruction - Multiple Data (MIMD)........................................................................................ 35
4.5 Shared Memory....................................................................................................................................... 35
II
4.6 Distributed Memory................................................................................................................................ 37
4.7 ccNUMA................................................................................................................................................. 39
4.8 Cluster..................................................................................................................................................... 39
4.9 Multiple Instruction - Single Data (MISD)............................................................................................. 40
4.10 Some Examples..................................................................................................................................... 40
4.10.1 Intel Pentium D..................................................................................................................... 40
4.10.2 Intel Core 2 Duo................................................................................................................... 41
4.10.3 AMD Athlon 64 X2 & Opteron............................................................................................ 41
4.10.4 IBM pSeries.......................................................................................................................... 41
4.10.5 IBM BlueGene...................................................................................................................... 41
4.10.6 NEC SX-8............................................................................................................................. 41
4.10.7 Cray XT3.............................................................................................................................. 41
4.10.8 SGI Altix 3700...................................................................................................................... 41
Summary...................................................................................................................................................... 42
References.................................................................................................................................................... 42
Recommended Reading.............................................................................................................................. 43
Self Assessment............................................................................................................................................ 44
Chapter V..................................................................................................................................................... 46
Parallel Programming Models and Paradigms........................................................................................ 46
Aim............................................................................................................................................................... 46
Objectives..................................................................................................................................................... 46
Learning outcome......................................................................................................................................... 46
5.1 Introduction............................................................................................................................................. 47
5.2 A Cluster Computer and its Architecture................................................................................................ 47
5.3 Parallel Applications and their Development......................................................................................... 49
5.3.1 Strategies for Developing Parallel Applications..................................................................... 50
5.3.2 Code Granularity and Levels of Parallelism........................................................................... 51
5.4 Parallel Programming Models and Tools................................................................................................ 51
5.4.1 Parallelising Compilers........................................................................................................... 52
5.4.2 Parallel Languages.................................................................................................................. 52
5.4.3 High Performance Fortran...................................................................................................... 52
5.4.4 Message Passing..................................................................................................................... 53
5.4.5 Virtual Shared Memory........................................................................................................... 53
5.4.6 Parallel Object-Oriented Programming.................................................................................. 54
5.4.7 Programming Skeletons.......................................................................................................... 54
5.5 Methodical Design of Parallel Algorithms............................................................................................. 54
5.5.1 Partitioning.............................................................................................................................. 55
5.5.2 Communication....................................................................................................................... 55
5.5.3 Agglomeration........................................................................................................................ 55
5.5.4 Mapping.................................................................................................................................. 55
5.6 Parallel Programming Paradigms........................................................................................................... 55
5.6.1 Choice of Paradigms............................................................................................................... 55
5.6.2 Task-Farming (or Master/Slave)............................................................................................. 57
5.6.3 Single-Program Multiple-Data (SPMD)................................................................................. 58
5.6.4 Data Pipelining....................................................................................................................... 59
5.6.5 Divide and Conquer................................................................................................................ 59
5.6.6 Speculative Parallelism........................................................................................................... 60
5.6.7 Hybrid Models........................................................................................................................ 61
5.7 Programming Skeletons and Templates.................................................................................................. 61
5.7.1 Programmability..................................................................................................................... 62
5.7.2 Reusability.............................................................................................................................. 62
5.7.3 Portability................................................................................................................................ 62
5.7.4 Efficiency................................................................................................................................ 62
III
Summary...................................................................................................................................................... 63
References.................................................................................................................................................... 63
Recommended Reading.............................................................................................................................. 63
Self Assessment............................................................................................................................................ 64
Chapter VI................................................................................................................................................... 66
Interconnection Networks for Parallel Computers.................................................................................. 66
Aim............................................................................................................................................................... 66
Objectives..................................................................................................................................................... 66
Learning outcome......................................................................................................................................... 66
6.1 Introduction............................................................................................................................................. 67
6.2 Network Topologies................................................................................................................................ 67
6.3 Metrics for Interconnection Networks.................................................................................................... 67
6.4 Classification of Interconnection Networks .......................................................................................... 67
6.5 Static Network........................................................................................................................................ 68
6.5.1 Completely-connected Network............................................................................................. 68
6.5.2 Star-Connected Network......................................................................................................... 68
6.5.3 Linear Array............................................................................................................................ 69
6.5.4 Mesh........................................................................................................................................ 69
6.5.5 Tree Network ......................................................................................................................... 70
6.5.6 Hypercube............................................................................................................................... 71
6.6 Dynamic Networks................................................................................................................................. 72
6.6.1 Bus-Based Networks............................................................................................................... 73
6.6.2 Crossbar Networks.................................................................................................................. 73
6.6.3 Multistage Networks............................................................................................................... 74
6.6.4 Omega Network...................................................................................................................... 75
Summary...................................................................................................................................................... 77
References.................................................................................................................................................... 77
Recommended Reading.............................................................................................................................. 77
Self Assessment............................................................................................................................................ 78
Chapter VII................................................................................................................................................. 80
Parallel Sorting............................................................................................................................................ 80
Aim............................................................................................................................................................... 80
Objectives..................................................................................................................................................... 80
Learning outcome......................................................................................................................................... 80
7.1 Introduction............................................................................................................................................. 81
7.2 Merge-based Parallel Sorting . ............................................................................................................... 81
7.3 Splitter-Based Parallel Sorting................................................................................................................ 81
7.4 Splitter-based Basic Histogram Sort....................................................................................................... 82
7.5 Bitonic Sort............................................................................................................................................. 83
7.6 Sample Sort............................................................................................................................................. 84
7.7 Radix Sort . ............................................................................................................................................ 85
7.8 Histogram Sort........................................................................................................................................ 85
7.9 Odd-even Transposition Sort on Linear Array........................................................................................ 86
Summary...................................................................................................................................................... 87
References.................................................................................................................................................... 87
Recommended Reading.............................................................................................................................. 87
Self Assessment............................................................................................................................................ 87
IV
Chapter VIII................................................................................................................................................ 90
Message-Passing Programming ................................................................................................................ 90
Aim............................................................................................................................................................... 90
Objectives..................................................................................................................................................... 90
Learning outcome......................................................................................................................................... 90
8.1 Principles of Message-Passing Programming......................................................................................... 91
8.2 The Building Blocks: Send and Receive Operations.............................................................................. 91
8.3 Non-Buffered Blocking Message Passing Operations............................................................................ 91
8.4 Buffered Blocking Message Passing Operations.................................................................................... 92
8.5 Non-Blocking Message Passing Operations........................................................................................... 93
8.6 Message Passing Interface (MPI)........................................................................................................... 94
8.7 Starting and Terminating the MPI Library.............................................................................................. 94
8.8 Communicators....................................................................................................................................... 94
8.9 Querying Information............................................................................................................................. 94
8.10 Sending and Receiving Messages......................................................................................................... 95
8.11 Avoiding Deadlocks.............................................................................................................................. 95
8.12 Sending and Receiving Messages Simultaneously............................................................................... 96
8.13 Creating and Using Cartesian Topologies............................................................................................. 97
8.14 Overlapping Communication with Computation.................................................................................. 97
8.15 Collective Communication and Computation Operations.................................................................... 98
8.16 Collective Communication Operations................................................................................................. 98
Summary.................................................................................................................................................... 100
References.................................................................................................................................................. 100
Recommended Reading............................................................................................................................ 100
Self Assessment.......................................................................................................................................... 101
V
List of Figures
Fig. 1.1 Serial computing................................................................................................................................ 3
Fig. 1.2 Parallel computing............................................................................................................................. 4
Fig. 2.1 Efficiency and the sequential fraction............................................................................................. 13
Fig. 2.2 Energy diagram showing loss of energy.......................................................................................... 14
Fig. 2.3 Speedup as a function of the number of processors........................................................................ 15
Fig. 2.4 Computing power doubles every 18 months, for the same price.................................................... 16
Fig. 3.1 Structure of IAS computer . ............................................................................................................ 23
Fig. 3.2 IBM 7094 ....................................................................................................................................... 24
Fig. 3.3 Relationship between Wafer, Chip, and Gate.................................................................................. 25
Fig. 3.4 CPU structure of IBM S/360-370 series......................................................................................... 25
Fig. 3.5 Intel 8080 . ...................................................................................................................................... 27
Fig. 3.6 Motorola 68000 .............................................................................................................................. 27
Fig. 3.7 Intel386 CPU................................................................................................................................... 28
Fig. 3.8 Alpha 21264..................................................................................................................................... 28
Fig. 4.1 Summation of two number.............................................................................................................. 33
Fig. 4.2 Summation of two numbers in a pipeline........................................................................................ 34
Fig. 4.3 Structure of a shared memory system.............................................................................................. 36
Fig. 4.4 Shared memory system with a bus connection................................................................................ 36
Fig. 4.5 Shared memory system with crossbar switch.................................................................................. 36
Fig. 4.6 UMA and NUMA............................................................................................................................ 37
Fig. 4.7 Distributed memory......................................................................................................................... 38
Fig. 4.8 Structure of a distributed memory system....................................................................................... 38
Fig. 4.9 Structure of a ccNUMA system....................................................................................................... 39
Fig. 4.10 Structure of a cluster of SMP nodes.............................................................................................. 40
Fig. 5.1 Cluster computer architecture.......................................................................................................... 48
Fig. 5.2 Porting strategies for parallel applications...................................................................................... 50
Fig. 5.3 Detecting parallelism....................................................................................................................... 52
Fig. 5.4 A static master/slave structure......................................................................................................... 58
Fig. 5.5 Basic structure of a SPMD program................................................................................................ 59
Fig. 5.6 Data pipeline structure..................................................................................................................... 59
Fig. 5.7 Divide and conquer as a virtual tree................................................................................................ 60
Fig. 6.1 Classification of interconnection networks (a) a static network (b) a dynamic network................ 67
Fig. 6.2 A completely-connected network of eight nodes............................................................................. 68
Fig. 6.3 Two representations of the star topology......................................................................................... 69
Fig. 6.4 Linear arrays (a) with no wraparound links (b) with wraparound link........................................... 69
Fig. 6.5 Meshes............................................................................................................................................. 70
Fig. 6.6 Ring . ............................................................................................................................................ 70
Fig. 6.7 Binary tree....................................................................................................................................... 71
Fig. 6.8 A fat tree network of 16 processing nodes....................................................................................... 71
Fig. 6.9 Hypercube........................................................................................................................................ 72
Fig. 6.10 Construction of hypercubes from hypercubes of lower dimension............................................... 72
Fig. 6.11 Bus based interconnects................................................................................................................. 73
Fig. 6.12 A completely non-blocking crossbar network connecting ‘p’ processors to ‘b’ memory banks... 74
Fig. 6.13 The schematic of a typical multistage interconnection network................................................... 74
Fig. 6.14 A perfect shuffle interconnection for eight inputs and outputs...................................................... 75
Fig. 6.15 A complete omega network connecting eight inputs and eight outputs........................................ 75
Fig. 6.16 An example of blocking in omega network . ................................................................................ 76
Fig. 7.1 Splitter-based parallel sorting.......................................................................................................... 82
Fig. 7.2 Splitter on key density function....................................................................................................... 82
Fig. 7.3 Basic histogram sort........................................................................................................................ 83
Fig. 7.4 Sample Sort..................................................................................................................................... 84
VI
Fig. 7.5 Odd-even transposition sort on linear array.................................................................................... 86
Fig. 8.1 Message Passing Model.................................................................................................................. 91
Fig. 8.2 Non-Buffered blocking message passing operations....................................................................... 92
Fig. 8.3 Buffered blocking message passing operations............................................................................... 92
Fig. 8.4 Non-blocking message passing operations...................................................................................... 93
Fig. 8.5 An example use of the MPI_MINLOC and MPI_MAXLOC operators......................................... 98
VII
List of Tables
Table 2.1 Speed-up and the number of processors....................................................................................... 15
Table 3.1 Computer generations................................................................................................................... 28
Table 5.1 Code Granularity and Parallelism................................................................................................. 51
Table 8.1 The minimal set of MPI routines.................................................................................................. 94
Table 8.2 MPI datatypes............................................................................................................................... 99
VIII
Abbreviations
ALU - Arithmetic/Logical Unit
CCC - Cube-Connected Cycles
ccNUMA - Cache Coherent Non-Uniform Memory Access
CODINE - Computing in Distributed Networked Environments
CPU - Central Processing Unit
DM - Distributed Memory
DSM - Distributed Shared Memory
ENIAC - Electronic Numerical Integrator and Calculator
FPU - Floating Point Unit
GCA - Grand Challenge Applications
HPF - High Performance Fortran
I/O - Input/Output
LAN - Local Area Network
LSF - Load Sharing Facility
MIMD - Multiple Instruction, Multiple Data Stream
MIPS - Microprocessor without Interlocked Pipeline Stages
MISD - Multiple Instruction, Single Data Stream
MITS - Micro Instrumentation Telemetry Systems
MPI - Message Passing Interface
NIC - Network Interface Cards
NOW - Network of Workstations
NUMA - Non-Uniform Memory Access
PC - Personal Computers
PE - Processing Element
PET - Personal Electronic Transactor
PUL - Parallel Utilities
PVM - Parallel Virtual Machine
RAM - Random-Access Memory
RIPS - Reduced Instruction Set Computer
RISC - Reduced Instruction Set Computing
SFC - Sequential Fraction of Computing
SIMD - Single Instruction, Multiple Data Stream
SISD - Single Instruction - Single Data
SM - Shared Memory
SMP - Single or Multiprocessor System
SOP - Skeleton Oriented Programming
SSI - Single System Image
ULSI - Ultra Large Scale Integration
UMA - Uniform Memory Access
UNIVAC - Universal Automatic Computer
VLSI - Very-Large-Scale Integration
VSM - Virtual Shared Memory
IX
Chapter I
Introduction to Parallel Computing
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
1
Parallel Computing
1.1 Introduction
Parallel computing means to divide a job into several tasks and use more than one processor simultaneously to
perform these tasks. Assume you have developed a new estimation method for the parameters of a complicated
statistical model. To perform many simulations to assure the correctness of the method for reasonable numbers of
data values and for different values of parameters, you must generate simulated data, say, 100000 times for each
length and parameter value. The total simulation work requires a huge number of random number generations and
takes a long time on your PC. If you use 100 PCs in your institute to run these simulations simultaneously, you may
expect that the total execution time will be 1/100. This is the simple idea behind parallel computing.
Computer scientists noticed the importance of parallel computing many years ago. The recent development of
computer hardware has been very rapid. Over roughly 40 years from 1961, the so called ‘Moore’s law’ holds “the
number of transistors per silicon chip has doubled approximately every 18 months.” This means that the capacity of
memory chips and processor speeds have also increased roughly exponentially. In addition, hard disk capacity has
increased dramatically. Consequently, modern personal computers are more powerful than ‘super computers’ were
a decade ago. Even such powerful personal computers are not sufficient for our requirements. In statistical analysis,
while computers are becoming more powerful, data volumes are becoming larger and statistical techniques are
becoming more computer intensive. We are continuously forced to realise more powerful computing environments
for statistical analysis.
Parallel computing is thought to be the most promising technique. However, parallel computing has not been popular
among statisticians until recently. One reason is that parallel computing was available only on very expensive
computers, which were installed at some computer centers in universities or research institutes. Few statisticians
could use these systems easily. Further, software for parallel computing was not well prepared for general use.
Recently, cheap and powerful personal computers changed this situation. Beowulf project realised a powerful
computer system by using many PCs connected by a network. This project was a milestone in parallel computer
development. Freely available software products for parallel computing have become more mature. Thus, parallel
computing has now become easy for statisticians to access.
The simultaneous use of more than one processor or computer to solve a problem is called parallel computing.
Traditionally, software has been written for serial computation:
2
For example:
problem
instructions
CPU
tN t3 t2 t1
do_payroll ()
instructions
Emp2_deduc
Emp2_check
Emp2_check
emp1_deduc
Emp2_Tax
Emp2_rate
emp3_deduc
Emp3_rate
Emp3_rate
Emp3_hrs
emp1_tax
emp1_hrs
Emp2_hrs
.....
CPU
tN t3 t2 t1
Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
3
Parallel Computing
problem instructions
CPU
CPU
CPU
CPU
tN t3 t2 t1
For example:
do_payroll CPU
(emp1)
do_payroll
CPU
(emp2)
do_payroll CPU
(emp3)
do_payroll CPU
(emp N)
tN t3 t2 t1
Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of
affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence.
4
Bit level parallelism
It is a form of parallelism which is based on increasing processors word size. It shortens the number of instructions
that the system must run in order to perform a task on variables which are greater in size. From the advent of Very-
Large-Scale Integration (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in
computer architecture was driven by doubling computer word size—the amount of information the processor can
manipulate per cycle. Increasing the word size reduces the number of instructions the processor must execute to
perform an operation on variables whose sizes are greater than the length of the word.
For example, where an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order
bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-
carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions
to complete a single operation, where a 16-bit processor would be able to complete the operation with a single
instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend
generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose
computing for two decades. Not until recently (c. 2003–2004), with the advent of x86-64 architectures, have 64-bit
processors become commonplace.
Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different
action the processor performs on that instruction in that stage; a processor with an N-stage pipeline can have up
to N different instructions at different stages of completion. The canonical example of a pipelined processor is a
RISC processor, with five stages: instruction fetch, decode, execute, memory access, and write back. The Pentium
4 processor had a 35-stage pipeline.
A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It can have two instructions
in each stage of the pipeline, for a total of up to 10 instructions (shown in green) being simultaneously executed.
In addition to instruction-level parallelism from pipelining, some processors can issue more than one instruction
at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data
dependency between them. Scoreboarding and the Tomasulo algorithm (which is similar to Scoreboarding but
makes use of register renaming) are two of the most common techniques for implementing out-of-order execution
and instruction-level parallelism.
Task parallelism
Task parallelism is a form of parallelisation in which different processors run the program among different codes of
distribution. It is also called as function parallelism. Task parallelism is the characteristic of a parallel program that
“entirely different calculations can be performed on either the same or different sets of data”. This contrasts with
data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism
does not usually scale with the size of a problem.
Data parallelism
Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different
computing nodes to be processed in parallel. Parallelising loops often leads to similar (not necessarily identical)
operation sequences or functions being performed on elements of a large data structure. Many scientific and
engineering applications exhibit data parallelism.
5
Parallel Computing
Provide concurrency
A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things
simultaneously. For example, the Access Grid provides a global collaboration network where people from around
the world can meet and conduct work ‘virtually’.
Use of non-local resources
Computer resources on a wide area network are used, or even the Internet when local compute resources are scarce.
For example:
• SETI@home uses 2.9 million computers in 253 countries. (July 2011)
• Folding@home uses over 450,000 cpus globally (July 2011)
6
Today, commercial applications provide an equal or greater driving force in the development of faster computers.
These applications require the processing of large amounts of data in sophisticated ways. For example:
• Databases, data mining
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Management of national and multi-national corporations
• Financial and economic modeling
• Advanced graphics and virtual reality, particularly in the entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments
7
Parallel Computing
Summary
• Parallel computing means to divide a job into several tasks and use more than one processor simultaneously to
perform these tasks.
• The simultaneous use of more than one processor or computer to solve a problem is called parallel
computing.
• Bit level parallelism is a form of parallelism, which is based on increasing processors word size. It shortens
the number of instructions that the system must run in order to perform a task on variables which are greater
in size.
• A computer program is, in essence, a stream of instructions executed by a processor.
• Task parallelism is a form of parallelisation in which different processors run the program among different
codes of distribution.
• Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different
computing nodes to be processed in parallel.
• Parallel computers can be built from cheap, commodity components.
• A single compute resource can only do one thing at a time.
• Multiple computing resources can be doing many things simultaneously.
• Parallelising loops often leads to similar (not necessarily identical) operation sequences or functions being
performed on elements of a large data structure.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education
India.
• Bhujade, M. R., 1995. Parallel Computing, New Age International.
• Barney, B., Introduction to Parallel Computing [Online] Available at: <https://computing.llnl.gov/tutorials/
parallel_comp/> [Accessed 21 June 2012].
• Parallel Computing [Online] Available at: <http://www.cs.ucf.edu/courses/cot4810/fall04/presentations/
Parallel_Computing.ppt> [Accessed 21 June 2012].
• 2011. Parallel Vs. Serial [Video Online] Available at: <http://www.youtube.com/watch?v=Jeo83akN44o>
[Accessed 21 June 2012].
• 2012. 4 Exploiting Instruction Level Parallelism [Video Online] Available at: <http://www.youtube.com/
watch?v=54E9LGG1hnQ> [Accessed 21 June 2012].
Recommended Reading
• Lafferty, E. L., 1993. Parallel Computing: An Introduction, William Andrew.
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Foster, I., 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software
Engineering, Addison-Wesley.
8
Self Assessment
1. In parallel computing, the computer resources can be ________.
a. a single computer
b. a single computer with multiple processors
c. an arbitrary number of computers
d. a single computer with single processor
3. Which parallelism inherent in program loops, which focuses on distributing the data across different computing
nodes to be processed in parallel?
a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
4. In which parallelism, different processors run the program among different codes of distribution?
a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
6. In which form of parallel computing, we can calculate the amount of operation carried out by an operating
system at same time?
a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
7. _________ shortens the number of instructions that the system must run in order to perform a task on variables
which are greater in size.
a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
9
Parallel Computing
9. __________is the characteristic of a parallel program that “entirely different calculations can be performed on
either the same or different sets of data”.
a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
10. Which parallelism does not usually scale with the size of a problem?
a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
10
Chapter II
Laws of Parallel Computing
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
11
Parallel Computing
The Amdahl’s law states that the speed up (S) of a parallel computer is limited by:
S≤
where,
f = sequential fraction for a given program
n = no. of processors
Proof of the above statement is quite simple. Assuming that total time is T, then the sequential component of this time
will be f.T. The parallelisable fraction of time is therefore (1 – f).T. The time (1 – f).T can be reduced by employing
n processors to operate in parallel to give the time as (1 – f).T/n. The total time taken by the parallel computer thus,
is at least f.T + (1 – f). T/n, while the sequential processor takes time T. The speedup S is limited thus by:
S≤
i.e.,
this result throws some light on the way parallel computers should be built. A computer architect can use the following
two basic approaches while designing a parallel computer:
• Connect a small number of extremely powerful processors (few elephants approach).
• Connect a very large number of inexpensive processors (million ants approach).
Consider two parallel computers Me and Ma. The computer Me is built using the approach of few elephants (very
powerful processors) such that each processor is capable of executing computations at a speed of M Megaflops.
The computer Ma on the other hand is built using ants approach (off-the-shelf inexpensive processors) and each
processor of Ma executes r.M Megaflops, where 0 < r < 1.
Theorem 1
If the machine Me attempts a computation whose sequential fraction f is greater than r, then Ma will execute the
computations more slowly compared to a single processor of the computer Me.
Proof:
Let
W = Total work (computing job)
M = speed (in Mflop) of a PE of machine Me, then
r.M = speed of a PE of Ma (r is the fraction)
f.W = sequential work of the job
T(Ma) = time taken by Ma for the work W
T(Me) = time taken by Me for the work W
Time taken by any computer is:
T=
=+
12
From equation 1 and 2, it is clear that if f > r then T(Ma) > T(Me). The above theorem is quite interesting. It gives
guidelines in building parallel processor using the expensive and the inexpensive technology. The above theorem
implies that a sequential component fraction acceptable for the machine Me may not be acceptable for the machine
Ma. It does no good to have a computing power of a very large number of processors that goes waste. Processors
must maintain some level of efficiency. Let us see how the efficiency ‘e’ and sequential fraction ‘f’ are related.
The efficiency E = S/n and hence,
S = E.n ≤
E ≤
This result states that for a constant efficiency, the fraction of sequential component of an algorithm must be inversely
proportional to the number of processors. Figure below shows the graph of sequential component and efficiency
with ‘n’ as a parameter.
1.0
0.9
0.8
0.7
0.6 n=2
0.5
0.4
E 0.3
0.2
0.1 n = 64
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f
Fig. 2.1 Efficiency and the sequential fraction
(Source: http://www.newagepublishers.com/samplechapter/000573.pdf)
The idea of using very large number of processors may thus be good only for specific applications for which it is
known that the algorithms have a known very small sequential fraction f.
13
Parallel Computing
Here C1 is the vector of Booleans and E1 and E2 are two statements. Every PE of a SIMD machine has one element
of the condition vector C. If this condition is true, then it will execute the corresponding then part (E1 statement).
PEs having false element of C1 shall execute the else part (E2 statement). For a SIMD computer, execution of the
above has to be done sequentially. The first set of PEs (with true elements of C1) executes the then part. Other PEs
are masked off from doing the work. They execute NOP. After the first set completes the execution, the second set
(PEs having false elements of C1) shall execute the else part and the first set of PEs shall be idle. If there is another
IF nested in the E1/E2 statements, one more division amongst the active PEs shall take place. This nesting may
go on repetitively. Figure below shows the energy diagram for the PEs. This diagram assumes the division of true
and false values in the condition vectors as half. The successive time slots shows that number of PEs working are
getting reduced by a factor of two.
N
P ro c e s s in g E le m e n ts
C o m p u tin g P o we r
2
1 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Time
The Minsky’s conjecture was very bad for the proponents of the large scale parallel architectures. Flynn and Hennessy
gave yet another formulation that is presented in the following theorem.
Theorem 2
Speedup of a n processor parallel system is limited as:
S ≤ n/log2n
Proof:
Let T1 = time taken by a single processor system,
fk = probability of k PEs working simultaneously (fk can be thought of as program fraction with a degree of parallelism
k)
14
A plot of speedup as a function of n for the above discussed bounds and the typical expected speedup (linear) is
presented in figure below. Figure below shows the speedup as a function of the number of processors. It is to be
noted from figure that the speedup curve for large values of n have a linear shape. This is so because, the function
n/log n approximates to a straight line when n > 1000.
Speedup n
1000
n/log 2 n
100
10 log 2 n
1
1 10 100 1000
Lee [1980] presented an empirical table for speedup for different values of n. His observations are listed in table
below.
The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near constant
per unit cost. This abstractions of Moore’s Law relates to the compounding of transistor density in two dimensions.
Others relate to speed (the signals have less distance to travel) and computational power (speed×density).
15
Parallel Computing
Fig. 2.4 Computing power doubles every 18 months, for the same price
(Source: http://www.scs.ryerson.ca/mfiala/courses/cps310_win09/murdocca_Ch01CAO.pdf)
16
Summary
• Amdahl’s law is based on a very simple observation. A program requiring total time T for the sequential execution
shall have some part called sequential fraction of computing (SFC) which is inherently sequential (that cannot
be made to run in parallel).
• Minsky’s conjecture states that due to the need for parallel processors to communicate with each other, speedup
increases as the logarithm of the number of processing elements.
• For a parallel computer with n processors, the speedup S shall be proportional to log2n.
• The Minsky’s conjecture was very bad for the proponents of the large scale parallel architectures.
• Moore’s Law is commonly reported as a doubling of transistor density every 18 months.
• The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near
constant per unit cost.
References
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Bhujade, M. R., 1995. Parallel Computing, New Age International.
• Thiébaut, D., Parallel Programming in C for the Transputer [Online] Available at: <http://maven.smith.
edu/~thiebaut/transputer/chapter8/chap8-2.html> [Accessed 21 June 2012].
• Parallel Computer Taxonomy - Conclusion [Online] Available at: <http://www.gigaflop.demon.co.uk/comp/
chapt8.htm> [Accessed 21 June 2012].
• 2012. Amdahl’s Law [Video Online] Available at: <http://www.youtube.com/watch?v=r7Ffc4WOLb8> [Accessed
21 June 2012].
• 2006. Moore’s Law [Video Online] Available at: <http://www.youtube.com/watch?v=XvaQcuLr2cE> [Accessed
21 June 2012].
Recommended Reading
• Padua, D., 2011. Encyclopedia of Parallel Computing, Volume 4, Springer.
• Null, L. & Lobur, J., 2010. The Essentials of Computer Organisation and Architecture, Jones & Bartlett
Publishers.
• Gebali, F., 2011. Algorithms and Parallel Computing, John Wiley & Sons.
17
Parallel Computing
Self Assessment
1. SFC refers to ___________________.
a. Sequential Fraction of Computing
b. Specification Form Computer
c. Software Flow Control
d. Static Frequency Converter
3. For a constant _______, the fraction of sequential component of an algorithm must be inversely proportional
to the number of processors.
a. speed up
b. efficiency
c. speed
d. sequential fraction
4. _________states that due to the need for parallel processors to communicate with each other, speedup increases
as the logarithm of the number of processing elements.
a. Amdahl’s law
b. Moore’s law
c. Minsky’s conjecture
d. Flynn’s law
5. Moore’s law is commonly reported as a doubling of transistor density every ______ months.
a. 8
b. 18
c. 11
d. 7
6. The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near
constant _________.
a. per unit cost
b. per unit time
c. per unit speed
d. per unit efficiency
18
8. Which of the following statements is false?
a. Gordon Moore was the co-founder of Intel.
b. For a parallel computer with n processors, the speedup S shall be proportional to log2n2.
c. The proof why parallel computers behave this way was first given by Flynn [1972].
d. The idea of using very large number of processors may thus be good only for specific applications for which
it is known that the algorithms have a known very small sequential fraction f.
10. For a parallel computer with n processors, the speedup S shall be proportional to _______.
a. log2n2
b. log2n3
c. log2n
d. log2n4
19
Parallel Computing
Chapter III
Evolution of Computer Architecture
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
• understand microelectronics
20
3.1 Introduction
The evolution of computers has been characterised by increasing processor speed, decreasing component size,
increasing memory size, and increasing I/O capacity and speed. One factor responsible for the great increase in
processor speed in the shrinking size of the microprocessor components; this reduces the distance between components
and hence increases speed. However, the real gains in speed in recent years have come from the organisation of the
processor, including heavy use of pipelining and parallel execution techniques and the use of speculative execution
techniques, which results in the tentative execution of future instructions that might be needed. All of these techniques
are designed to keep the processor busy as much of the time as possible.
A critical issue in computer system design is balancing the performance of the various elements, so that gains in
performance in one area are not dragged behind by a lag in other areas. In particular, processor speed has increased
more rapidly than memory access time. A variety of techniques are used to compensate for this mismatch, including
caches, wider data paths from memory to processor, and more intelligent memory chips.
Thus, computer architecture is the science and art of selecting and interconnecting hardware components to create
computers that meet functional performance and cost goals. It refers to those attributes of the system that are visible
to a programmer and have a direct impact on the execution of a program. Computer architect coordinate of many
levels of abstraction and translates business and technology drives into efficient systems for computing tasks.
Computer architecture concerns machine organisation, interfaces, application, technology, measurement and
simulation. It includes:
• Instruction set
• Data formats
• Principle of operation (textual or formal description of every operation)
• Features (organisation of programmable storage, registers used, interrupts mechanism, and so on)
In short, it is the combination of instruction set architecture, machine organisation and the underline hardware.
The different usages of the ‘computer architecture’ term:
• The design of a computer’s CPU architecture, instruction set, addressing modes.
• Description of the requirements (especially speeds and interconnection requirements) or design implementation
for the various parts of a computer, such as memory, motherboard, electronic peripherals, or most commonly
the CPU)
• Architecture is often defined as the set of machine attributes that a programmer should understand in order to
successfully program the specific computer. So, in general, computer architecture refers to attributes of the
system visible to a programmer that have a direct impact on the execution of a program.
21
Parallel Computing
Modern high performance CPU chip designs incorporate aspects of both architectures. On chip cache memory is
divided into an instruction cache and a data cache. Harvard architecture is used as the CPU accesses the cache and
von Neumann architecture is used for off chip memory access.
• ENIAC (Electronic Numerical Integrator and Calculator), 1943-46, by J. Mauchly and J. Presper Eckert,
first general purpose electronic computer. The size of its numerical word was 10 decimal digits, and it could
perform 5000 additions and 357 multiplications per second. It was built to calculate trajectories for ballistic
shells during World War II, programmed by setting switches and plugging and unplugging cables. It used 18000
tubes, weighted 30 tones and consumed 160 kilowatts of electrical power.
• Whirlwind computer, 1949, by Jay Forrester with 5000 vacuum tubes, main innovation - magnetic core memory
1951 UNIVAC (Universal Automatic Computer) - the first commercial computer, built by Eckert and Mauchly,
cost – around $1 million, 46 machines sold. UNIVAC had an add time of 120 microseconds, multiply time of
1800 microseconds and a divide time of 3600 microseconds, used magnetic tape as input.
• IBM’s 701, 1953, the first commercially successful general-purpose computer. The 701 had electrostatic storage
tube memory, used magnetic tape to store information, and had binary, fixed-point, single address hardware.
• IBM 650 - 1st mass-produced computer (450 machines sold in one year).
22
AC MQ
Input/output
Arithmetic-logic unit equipment
DR
Instructions
and data
IBR PC Main
memory
IR AR
Addresses
Control Control
circuits
signals
program control unit
IBM 7090 is the most powerful data processing system at that time. The fully transistorised system has computing
speeds six times faster than those of its vacuum-tube predecessor, the IBM 709. Although the IBM 7090 is a general
purpose data processing system, it is designed with special attention to the needs of the design of missiles, jet engines,
nuclear reactors and supersonic aircraft. It contains more than 50,000 transistors plus extremely fast magnetic core
storage. The new system can simultaneously read and write at the rate of 3,000,000 bits per second, when eight
data channels are in use. In 2.18 millionths of a second, it can locate and make ready for use any of 32,768 data or
instruction numbers (each of 10 digits) in the magnetic core storage. The 7090 can perform any of the following
operations in one second: 229,000 additions or subtractions, 39,500 multiplications, or 32,700 divisions. The basic
cycle time is 2.18 µsecs.
23
Parallel Computing
Control
circuits
Microelectronics
It means ‘small electronics’. The computer consists of logic gates, memory cells and interconnections. It is
manufactured on a semiconductor such as silicon. Many transistors can be produced on a single wafer of silicon.
24
Wafer
Chip
Gate
Packaged
chip
The IBM System/360 Model 91 was introduced in 1966 as the fastest, most powerful computer then in use. It was
specifically designed to handle high-speed data processing for scientific applications such as space exploration,
theoretical astronomy, subatomic physics and global weather forecasting. IBM estimated that each day in use, the
Model 91 would solve more than 1,000 problems involving about 200 billion calculations.
16 32-bit
general purpose
registers
4 64-bit floating –
point registers
internal buses
AR IR PC DR
program
status word Memory To
control unit Main
Memory
25
Parallel Computing
1971 - The 4004 was the world’s first universal microprocessor, invented by Federico Faggin, Ted Hoff, and Stan
Mazor. With just over 2,300 MOS transistors in an area of only 3 by 4 millimeters had as much power as the
ENIAC.
• 4-bit CPU
• 1K data memory and 4K program memory
• Clock rate: 740kHz
• Just a few years later, the word size of the 4004 was doubled to form the 8008.
1974 – 1977 - the first personal computers – introduced on the market as kits (major assembly required).
• SCELBI (SCientific, ELectronic and BIological) and designed by the SCELBI Computer Consulting Company,
based on Intel’s 8008 microprocessor, with 1K of programmable memory, SCELBI sold for $565 and came,
with an additional 15K of memory available for $2760.
• Mark-8 (also Intel 8008 based) designed by Jonathan Titus.
• Altair (based on the new Intel 8080 microprocessor), built by MITS (Micro Instrumentation Telemetry Systems).
The computer kit contained an 8080 CPU, a 256 Byte RAM card, and a new Altair Bus design for the price of
$400.
1976 - Steve Wozniak and Steve Jobs released the Apple I computer and started Apple Computers. The Apple I
was the first single circuit board computer. It came with a video interface, 8k of RAM and a keyboard. The system
incorporated some economical components, including the 6502 processor (designed by Rockwell and produced by
MOS Technologies) and dynamic RAM.
1977 - Apple II computer model was released, also based on the 6502 processor, but it had color graphics (a first for
a personal computer), and used an audio cassette drive for storage. Its original configuration came with 4 kb of RAM,
but a year later this was increased to 48 kb of RAM and the cassette drive was replaced by a floppy disk drive.
1977 - Commodore PET (Personal Electronic Transactor) was designed by Chuck Peddle, ran also on the 6502
chip, but at half the price of the Apple II. It included 4 kb of RAM, monochrome graphics and an audio cassette
drive for data storage.
1981 - IBM released their new computer IBM PC which ran on a 4.77 MHz Intel 8088 microprocessor and equipped
with 16 kilobytes of memory, expandable to 256k. The PC came with one or two 160k floppy disk drives and an
optional color monitor. First one built from off the shelf parts (called open architecture) and marketed by outside
distributors.
26
1974-present
The inventions during this period include:
• Intel 8080
8-bit Data
16-bit Address
6 µm NMOS
6K Transistors
2 MHz
1974
• Motorola 68000
32 bit architecture internally, but 16 bit data bus
16 32-bit registers, 8 data and 8 address registers
2 stage pipeline
no virtual memory support
68020 was fully 32 bit externally
1979
• Intel386 CPU
32-bit Data
improved addressing
security modes (kernal, system services, application services, applications)
1985
27
Parallel Computing
• Alpha 21264
64-bit Address/Data
Superscalar
Out-of-Order Execution
256 TLB entries
128KB Cache
Adaptive Branch Prediction
0.35 µm CMOS Process
15.2M Transistors
600 MHz
28
Summary
• The evolution of computers has been characterised by increasing processor speed, decreasing component size,
increasing memory size, and increasing I/O capacity and speed.
• Computer architecture is the science and art of selecting and interconnecting hardware components to create
computers that meet functional performance and cost goals.
• Computer architecture concerns machine organisation, interfaces, application, technology, measurement and
simulation.
• IBM 7090 is the most powerful data processing system at that time.
• Although the IBM 7090 is a general purpose data processing system, it is designed with special attention to the
needs of the design of missiles, jet engines, nuclear reactors and supersonic aircraft.
• The 7090 can perform any of the following operations in one second: 229,000 additions or subtractions, 39,500
multiplications, or 32,700 divisions.
• Microelectronics means ‘small electronics’. The computer consists of logic gates, memory cells and
interconnections.
• The IBM System/360 Model 91 was introduced in 1966 as the fastest, most powerful computer then in use.
References
• Balaauw, 1997. Computer Architecture: Concepts And Evolution, Pearson Education India.
• Chandra, R. R., Modern Computer Architecture, Galgotia Publications.
• Arnaoudova, E., Brief History of Computer Architecture [pdf] Available at: <http://www.mgnet.org/~douglas/
Classes/cs521/arch/ComputerArch2005.pdf > [Accessed 21 June 2012].
• Learning Computing History [Online] Available at: <http://www.comphist.org/computing_history/new_page_4.
htm> [Accessed 21 June 2012].
• 2008. Lecture -2 History of Computers [Video Online] Available at: <http://www.youtube.com/watch?v=TS2o
dp6rQHU&feature=results_main&playnext=1&list=PLF33FAF1A694F4F69 > [Accessed 21 June 2012].
• 2008. Generation’s of computer (HQ) [Video Online] Available at: <http://www.youtube.com/watch?v=7rkG
FqEfdJk&feature=related> [Accessed 21 June 2012].
Recommended Reading
• Hwang, 2003. Advanced Computer Architecture, Tata McGraw-Hill Education.
• Null, L. & Lobur, J., 2010. The Essentials of Computer Organization and Architecture, Jones & Bartlett
Publishers.
• Anita, G., 2011. Computer Fundamentals, Pearson Education India.
29
Parallel Computing
Self Assessment
1. Which of these do not characterise the evolution of computers?
a. Increasing processor speed
b. Increasing component size
c. Increasing memory size
d. Increasing I/O capacity and speed
2. Who defined computer architecture as “the design of the integrated system which provides a useful tool to the
programmer’?
a. Baer
b. Hayes
c. Hennessy and Patterson
d. Foster
3. Who defined computer architecture as “the interface between the hardware and the lowest level software”?
a. Baer
b. Hayes
c. Hennessy and Patterson
d. Foster
30
8. Which of these is not a characteristic feature of Motorola 68000?
a. 32 bit architecture internally, but 16 bit data bus
b. 16 32-bit registers, 8 data and 8 address registers
c. Virtual memory support
d. 68020 was fully 32 bit externally
9. The IBM System/360 _________was introduced in 1966 as the fastest, most powerful computer then in use.
a. Model 91
b. Model 92
c. Model 90
d. Model 94
10. ________is the first family of computers making a clear distinction between architecture and implementation.
a. IBM’s System 360
b. IBM’s System 260
c. IBM’s System 340
d. IBM’s System 380
31
Parallel Computing
Chapter IV
System Architectures
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
32
4.1 Parallel Architectures
A system for the categorisation of the system architectures of computers was introduced by Flynn (1972). Parallel
or concurrent operation has many different forms within a computer system. Using a model based on the different
streams used in the computation process, the different kinds of parallelism are represented. A stream is a sequence
of objects such as data, or of actions such as instructions. Each stream is independent of all other streams, and each
element of a stream can consist of one or more objects or actions. Parallel computers are those that emphasise the
parallel processing between the operations in some way.
Parallel computers can be characterised based on the data and instruction streams forming various types of computer
organisations. They can also be classified based on the computer structure, e.g. multiple processors having separate
memory or one shared global memory. Parallel processing levels can also be defined based on the size of instructions
in a program called grain size. Thus, parallel computers can be classified based on various criteria.
The four combinations that describe most familiar parallel architectures are:
• SISD (Single Instruction, Single Data Stream): This is the traditional uniprocessor.
• SIMD (Single Instruction, Multiple Data Stream): This includes vector processors as well as massively parallel
processors.
• MISD (Multiple Instruction, Single Data Stream): These are typically systolic arrays.
• MIMD (Multiple Instruction, Multiple Data Stream): This includes traditional multiprocessors as well as the
newer networks of workstations.
Each of these combinations characterises a class of architectures and a corresponding type of parallelism.
In reality, each of the steps shown in figure above is actually composed of several sub-steps, increasing the number of
cycles required for one summation even more. The solution to this inefficient use of processing power is pipelining.
If there is one functional unit available for each of the five steps required, the addition still requires five cycles. The
advantage is that with all functional units being busy at the same time, one result is produced every cycle. For the
summation of n pairs of numbers, only (n−1) +5 cycles are then required.
33
Parallel Computing
save
result
time
As the execution of instructions usually takes more than five steps, pipelines are made longer in real processors.
Long pipelines are also a prerequisite for achieving high CPU clock speeds. These long pipelines generate a new
problem. If there is a branching event (such as due to an if-statements), the pipeline has to be emptied and filled
again, and there is a number of cycles equal to the pipeline length until results are again delivered. To circumvent
this, the number of branches should be kept small (avoiding and/or smart placement of if-statements). Compilers and
CPUs also try to minimise this problem by ‘guessing’ the outcome (branch prediction). The power of a processor
can be increased by combining several pipelines. This is then called a superscalar processor.
Fixed-point and logical calculations (performed in the ALU - Arithmetic/Logical Unit) are usually separated from
floating-point math (done by the FPU – Floating Point Unit). The FPU is commonly subdivided in a unit for addition
and one for multiplication. These units may be present several times, and some processors have additional functional
units for division and the computation of square roots. To actually gain a benefit from having several pipelines, these
have to be used at the same time. Parallelisation is essential to achieve this.
Vector computers work similar to the pipelined scalar computer represented in the figure above. The difference is that
instead of processing single values, vectors of data are processed in one cycle. The number of values in a vector is
limited by the CPU design. A vector processor can simultaneously work with 64 vector elements can also generate
64 results per cycle - the scalar processor would require at least 64 cycles for this. To actually use the theoretically
possible performance of a vector computer, the calculations themselves need to be vectorised. If a vector processor
is fed with single values only, it cannot perform decently. Just like with a scalar computer, the pipelines need to be
kept filled.
Vector computers used to be very common in the field of high performance computing, as they allowed very high
performance even at lower CPU clock speeds. In the last years, they have begun to slowly disappear. Vector processors
are very complex and thus expensive, and perform poorly with non-vectorisable problems. Today’s scalar processors
are much cheaper and achieve higher CPU clock speeds. With the Pentium III, Intel introduced SSE (Streaming
34
SIMD Extensions), which is a set of vector instructions. In certain applications, such as video encoding, the use of
these vector instructions can offer quite impressive performance increases. More vector instructions were added
with SSE2 (Pentium 4) and SSE3 (Pentium 4 Prescott).
Global memory can be accessed by all processors of a parallel computer. Data in the global memory can be read/
write by any of the processors. Examples: Sun HPC, Cray T90.
In MIMD systems with shared memory (SM-MIMD), all processors are connected to a common memory (RAM
- Random Access Memory). Usually all processors are identical and have equal memory access. This is called
symmetric multiprocessing (SMP).
35
Parallel Computing
connection
The connection between processors and memory is of predominant importance. Figure below shows a shared
memory system with a bus connection. The advantage of a bus is its expandability. A huge disadvantage is that all
processors have to share the bandwidth provided by the bus, even when accessing different memory modules. Bus
systems can be found in desktop systems and small servers (frontside bus).
To circumvent the problem of limited memory bandwidth, direct connections from each CPU to each memory module
are desired. This can be achieved by using a crossbar switch. Crossbar switches can be found in high performance
computers and some workstations.
The problem with crossbar switches is their high complexity when many connections need to be made. This problem
can be weakened by using multi-stage crossbar switches, which in turn leads to longer communication times. For
this reason, the number of CPUs and memory modules than can be connected by crossbar switches is limited.
36
The advantages are:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
CPU
CPU
Bus Interconnect
The big advantage of shared memory systems is that all processors can make use of the whole memory. This makes
them easy to program and efficient to use. The limiting factor to their performance is the number of processors and
memory modules that can be connected to each other. Due to this, shared memory-systems usually consist of rather
few processors.
37
Parallel Computing
• Because each processor has its own local memory, it operates independently. Changes it makes to its local memory
have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly
define how and when data is communicated. Synchronisation between tasks is likewise the programmer’s
responsibility.
• The network “fabric” used for data transfer varies widely, though it can can be as simple as Ethernet.
Each processor in a parallel computer has its own memory (local memory); no other processor can access this
memory. Data can only be shared by message passing. Examples: Cray T3E, IBM SP2.
The number of processors and memory modules cannot be increased arbitrarily in the case of a shared memory
system. Another way to build a MIMD system is distributed memory (DM-MIMD). Each processor has its own
local memory. The processors are connected to each other. The demands imposed on the communication network
are lower than in the case of a shared memory system, as the communication between processors may be slower
than the communication between processor and memory.
connection
38
Distributed memory systems can be hugely expanded. Several thousand processors are not uncommon; this is
called massively parallel processing (MPP). To actually use the theoretical performance, much more programming
effort than with shared memory systems is required. The problem has to be subdivided into parts that require little
communication. The processors can only access their own memory. Should they require data from the memory of
another processor, then these data have to be copied. Due to the relatively slow communications network between
the processors, this should be avoided as much as possible.
4.7 ccNUMA
The shared memory systems suffer from a limited system size, while distributed memory systems suffer from the
arduous communication between the memories of the processors. A compromise is the ccNUMA (cache coherent
non-uniform memory access) architecture.
A ccNUMA system basically consists of several SMP systems. These are connected to each other by means of a
fast communications network, often crossbar switches. Access to the whole, distributed or non-unified memory is
possible via a common cache.
A ccNUMA system is as easy to use as a true shared memory system, at the same time it is much easier to expand.
To achieve optimal performance, it has to be made sure that local memory is used, and not the memory of the
other modules, which is only accessible via the slow communications network. The modular structure is another
big advantage of this architecture. Most ccNUMA system consist of modules that can be plugged together to get
systems of various sizes.
RAM RAM
Cache
RAM RAM
4.8 Cluster
For some years now clusters are very popular in the high performance computing community. A cluster consists of
several cheap computers (nodes) linked together. The simplest case is the combination of several desktop computers.
It is known as a Network Of Workstations (NOW). Mostly SMP systems (usually dual-CPU system with Intel or
AMD CPUs) are used because of their good value for money. They form hybrid systems. The nodes, which are
themselves shared memory systems, form a distributed memory system.
39
Parallel Computing
The nodes are connected via a fast network, usually Myrinet or Infiniband. Gigabit Ethernet has approximately the
same bandwidth of about 100 MB/s and is a lot cheaper, but the latency (travel time of a data package) is much
higher. It is about 100 ms for Gigabit Ethernet compared to only 10 - 20 ms for Myrinet. Even this is a lot of time.
At a clock speed of 2 GHz, one cycle takes 0.510 ns. A latency of 10 ms amounts to 20,000 cycles of travel time
before the data package reaches its target. Clusters offer lots of computing power for little money. It is not that easy
to actually use the power. Communication between the nodes is slow, and as with conventional distributed memory
systems, each node can only access its local memory directly. The mostly employed PC architecture also limits
the amount of memory per node. 32 bit systems cannot address more than 4 GB of RAM and x86-64 systems are
limited by the number of memory slots, the size of the available memory modules, and the chip sets. Despite these
disadvantages, clusters are very successful and have given traditional, more expensive distributed memory systems
a hard time. They are ideally suited to problems with a high degree of parallelism, and their modularity makes it
easy to upgrade them. In recent years, the cluster idea has been expanded to connecting computers all over the
world via the internet. This makes it possible to aggregate enormous computing power. Such a widely distributed
system is known as a grid.
network
40
4.10.2 Intel Core 2 Duo
Intel’s successor to the Pentium D is similar in design to the popular Pentium M design, which in turn is based on
the Pentium III, with ancestry reaching back to the Pentium Pro. It abandons high clock frequencies in the favour
of more efficient computation. Like the Pentium D, it uses the frontside bus for memory access by both CPUs. The
Core 2 Duo supports SSE3 and x86-64.
41
Parallel Computing
Summary
• A system for the categorisation of the system architectures of computers was introduced by Flynn (1972).
• A stream is a sequence of objects such as data, or of actions such as instructions.
• Parallel computers are those that emphasise the parallel processing between the operations in some way.
• Parallel computers can be characterised based on the data and instruction streams forming various types of
computer organisations.
• Parallel processing levels can also be defined based on the size of instructions in a program called grain size.
• Long pipelines are also a prerequisite for achieving high CPU clock speeds.
• Fixed-point and logical calculations (performed in the ALU - arithmetic/Logical Unit) are usually separated
from floating-point math (done by the FPU – Floating Point Unit).
• The scalar computer of the previous section performs one instruction on one data set only.
• A computer that performs one instruction on several data sets is called a vector computer.
• Global memory can be accessed by all processors of a parallel computer. Data in the global memory can be
read/write by any of the processors.
• In MIMD systems with shared memory (SM-MIMD), all processors are connected to a common memory
(RAM - Random Access Memory).
• The advantage of a bus is its expandability. A huge disadvantage is that all processors have to share the bandwidth
provided by the bus, even when accessing different memory modules.
• Bus systems can be found in desktop systems and small servers (frontside bus).
• The big advantage of shared memory systems is that all processors can make use of the whole memory.
• A ccNUMA system basically consists of several SMP systems.
• A ccNUMA system is as easy to use as a true shared memory system, at the same time it is much easier to
expand.
• A cluster consists of several cheap computers (nodes) linked together.
• The Intel Pentium D was introduced in 2005. It is Intel’s first dual-core processor. It integrates two cores, based
on the NetBurst design of the Pentium 4, on one chip.
• The pSeries is IBM’s server- and workstation line based on the POWER processor.
• BlueGene is an MPP (massively parallel processing) architecture by IBM.
• The SGI Altix 3700 is a ccNUMA system using Intel’s Itanium 2 processor.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education
India.
• Bhujade, M. R., 1995. Parallel Computing, New Age International.
• Wittwer, T., Introduction to Parallel Programming [Online] Available at: <http://www.scribd.com/doc/23585346/
An-Introduction-to-Parallel-Programming> [Accessed 21 June 2012].
• Introduction to Parallel Computing [Online] Available at: <https://computing.llnl.gov/tutorials/parallel_
comp/#Whatis > [Accessed 21 June 2012].
• 2011. Intro to Computer Architecture [Video Online] Available at: <http://www.youtube.com/watch?v=HEjPop-
aK_w> [Accessed 21 June 2012].
• 2011. x64 Assembly and C++ Tutorial 38: Intro to Single Instruction Multiple Data [Video Online] Available
at: <http://www.youtube.com/watch?v=cbL88Ic6uPw > [Accessed 21 June 2012].
42
Recommended Reading
• Lafferty, E. L., 1993. Parallel Computing: An Introduction, William Andrew.
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Foster, I., 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software
Engineering, Addison-Wesley.
43
Parallel Computing
Self Assessment
1. Which of the following is the traditional uniprocessor?
a. SISD
b. SIMD
c. MISD
d. MIMD
5. A computer that performs one instruction on several data sets is called a _____.
a. vector computer
b. scalar computer
c. vector processor
d. scalar processor
6. Usually all processors are identical and have equal memory access,this is called _________.
a. SMP
b. SISD
c. NUMA
d. DM-DIMD
44
8. A _______consists of several cheap computers (nodes) linked together.
a. cluster
b. shared memory
c. distributed memory
d. hybrid system
10. _____integrates two cores, based on the NetBurst design of the Pentium 4, on one chip.
a. Intel Pentium D
b. Intel Core 2 Duo
c. IBM pSeries
d. IBM BlueGene
45
Parallel Computing
Chapter V
Parallel Programming Models and Paradigms
Aim
The aim of this chapter is to:
• define cluster
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
46
5.1 Introduction
In the 1980s it was believed computer performance was best improved by creating faster and more efficient processors.
This idea was challenged by parallel processing, which in essence means linking together two or more computers
to jointly solve a computational problem. Since the early 1990s there has been an increasing trend to move away
from expensive and specialised proprietary parallel supercomputers (vector-supercomputers and massively parallel
processors) towards networks of computers (PCs/Workstations/SMPs). Among the driving forces that have enabled
this transition has been the rapid improvement in the availability of commodity high performance components for
PCs/workstations and networks. These technologies are making a network/cluster of computers an appealing vehicle
for cost-effective parallel processing and this is consequently leading to low-cost commodity supercomputing.
Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations, to
SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing. The main
attractiveness of such systems is that they are built using affordable, low-cost, commodity hardware (Pentium PCs),
fast LAN such as Myrinet, and standard software components such as UNIX, MPI, and PVM parallel programming
environments. These systems are scalable, i.e., they can be tuned to available budget and computational needs and
allow efficient execution of both demanding sequential and parallel applications.
Clusters use intelligent mechanisms for dynamic and network-wide resource sharing, which respond to resource
requirements and availability. These mechanisms support scalability of cluster performance and allow a flexible use
of workstations, since the cluster or network-wide available resources are expected to be larger than the available
resources at any one node/workstation of the cluster. These intelligent mechanisms also allow clusters to support
multiuser, time-sharing parallel execution environments, where it is necessary to share resources and at the same
time distribute the workload dynamically to utilise the global resources efficiently.
The idea of exploiting this significant computational capability available in networks of workstations (NOWs) has
gained an enthusiastic acceptance within the high-performance computing community, and the current tendency favors
this sort of commodity supercomputing. This is mainly motivated by the fact that most of the scientific community
has the desire to minimise economic risk and rely on consumer based on -the-shelf technology. Cluster computing
has been recognised as the wave of the future to solve large scientific and commercial problems.
A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with memory, I/O facilities,
and an operating system. A cluster generally refers to two or more computers (nodes) connected together. The nodes
can exist in a single cabinet or be physically separated and connected via a LAN. An interconnected (LAN-based)
cluster of computers can appear as a single system to users and applications. Such a system can provide a cost-
effective way to gain features and benefits (fast and reliable services) that have historically been found only on more
expensive proprietary shared memory systems. The typical architecture of a cluster is shown in figure below.
47
Parallel Computing
Parallel Applications
Cluster Middleware
(Single System Image and Availability Infrastructure)
Comm. S/W Comm. S/W Comm. S/W Comm. S/W Comm. S/W
Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW
The network interface hardware acts as a communication processor and is responsible for transmitting and receiving
packets of data between cluster nodes via a network/switch.
Communication software offers a means of fast and reliable data communication among cluster nodes and to the
outside world. Often, clusters with a special network/switch like Myrinet use communication protocols such as
active messages for fast communication among its nodes. They potentially bypass the operating system and thus
remove the critical communication overheads providing direct user-level access to the network interface.
48
The cluster nodes can work collectively as an integrated computing resource, or they can operate as individual
computers. The cluster middleware is responsible for offering an illusion of a unified system image (single system
image) and availability out of a collection on independent but interconnected computers. Programming environments
can offer portable, efficient, and easy-to-use tools for development of applications. They include message passing
libraries, debuggers, and profilers. It should not be forgotten that clusters could be used for the execution of sequential
or parallel applications.
The scale of their resource requirements, such as processing time, memory, and communication needs distinguishes
GCAs. A typical example of a grand challenge problem is the simulation of some phenomena that cannot b e measured
through experiments. GCAs include massive crystallographic and microtomographic structural problems, protein
dynamics and biocatalysis, relativistic quantum chemistry of actinides, virtual materials design and processing,
global climate modeling, and discrete event simulation.
Although the technology of clusters is currently being deployed, the development of parallel applications is really
a complex task.
• First of all, it is largely dependent on the availability of adequate software tools and environments.
• Second, parallel software developers must contend with problems not encountered during sequential programming,
namely: non-determinism, communication, synchronisation, data partitioning and distribution, load-balancing,
fault-tolerance, heterogeneity, shared or distributed memory, deadlocks, and race conditions.
All these issues present some new important challenges. Currently, only some specialised programmers have the
knowledge to use parallel and distributed systems for executing production codes. This programming technology is
still somehow distant from the average sequential programmer, who does not feel very enthusiastic about moving
into a different programming style with increased difficulties, though they are aware of the potential performance
gains.
Parallel computing can only be widely successful if parallel software is able to accomplish some expectations of
the users, such as:
• Provide architecture/processor type transparency
• Provide network/communication transparency
• Be easy-to-use and reliable
• Provide support for fault-tolerance
• Accommodate heterogeneity
• Assure portability
• Provide support for traditional high-level languages
• Be capable of delivering increased performance
• Holy-grail is to provide parallelism transparency
This last expectation is still at least one decade away, but most of the others can be achieved today. The internal
details of the underlying architecture should be hidden from the user and the programming environment should
provide high-level support for parallelism. Otherwise, if the programming interface is difficult to use, it makes the
writing of parallel applications highly unproductive and painful for most programmers. There are basically two
main approaches for parallel programming:
49
Parallel Computing
• The first one is based on implicit parallelism. This approach has been followed by parallel languages and
parallelising compilers. The user does not specify, and thus cannot control, the scheduling of calculations and/
or the placement of data.
• The second one relies on explicit parallelism. In this approach, the programmer is responsible for most of the
parallelisation effort such as task decomposition, mapping tasks to processors, and the communication structure.
This approach is based on the assumption that the user is often the best judge of how parallelism can be exploited
for a particular application.
It is also observed that the use of explicit parallelism will obtain a better efficiency than parallel languages or
compilers that use implicit parallelism.
Identify and
Replace
Subroutines
Existing Parallel
Source Relink Application
Code
Develop
Parallel
Library
The first strategy is based on automatic parallelisation, the second is based on the use of parallel libraries, while the
third strategy (major recoding) resembles from-scratch application development. The goal of automatic parallelisation
is to relieve the programmer from the parallelising tasks. Such a compiler would accept dusty-deck codes and produce
efficient parallel object code without any (or, at least, very little) additional work by the programmer. However, this
is still very hard to achieve and is well beyond the reach of current compiler technology.
Another possible approach for porting parallel code is the use of parallel libraries. This approach has been more
successful than the previous one. The basic idea is to encapsulate some of the parallel code that is common to several
applications into a parallel library that can be implemented in a very efficient way. Such a library can then be reused
by several codes. Parallel libraries can take two forms:
• They encapsulate the control structure of a class of applications.
• They provide a parallel implementation of some mathematical routines that are heavily used in the kernel of
some production codes.
50
The third strategy, which involves writing a parallel application from the very beginning, gives more freedom to the
programmer who can choose the language and the programming model. However, it may make the task very difficult
since little of the code can b e reused. Compiler assistance techniques can b e of great help, although with a limited
applicability. Usually the tasks that can be effectively provided by a compiler are data distribution and placement.
The first two levels (signal and circuit level) of parallelism is performed by a hardware implicitly technique called
hardware parallelism. The remaining two levels (component and system) of parallelism is mostly expressed implicitly/
explicitly by using various software techniques, popularly known as software parallelism. Levels of parallelism
can also be based on the lumps of code (grain size) that can be a potential candidate for parallelism. Table below
lists categories of code granularity for parallelism. All approaches of creating parallelism based on code granularity
have a common goal to boost processor efficiency by hiding latency of a lengthy operation such as a memory/disk
access. To conceal latency, there must be another activity ready to run whenever a lengthy operation occurs. The idea
is to execute concurrently two or more single-threaded applications, such as compiling, text formatting, database
searching, and device simulation, or parallelised applications having multiple tasks simultaneously.
The different levels of parallelism are depicted in figure below. Among the four levels of parallelism, the first two
levels are supported transparently either by the hardware or parallelising compilers. The programmer mostly handles
the last two levels of parallelism. The three important models used in developing applications are shared-memory
model, distributed memory model (message passing model), and distributed-shared memory model.
51
Parallel Computing
Messages Messages
Fine grain
a[0]=... a[1]=... a[2]=... (data level)
b[0]=... b[1]=... b[2]=...
52
5.4.4 Message Passing
Message passing libraries allow efficient parallel programs to b e written for distributed memory systems. These
libraries provide routines to initiate and configure the messaging environment as well as sending and receiving
packets of data. Currently, the two most popular high-level message-passing systems for scientific and engineering
application are the PVM (Parallel Virtual Machine) from Oak Ridge National Laboratory and MPI (Message Passing
Interface) defined by the MPI Forum.
Currently, there are several implementations of MPI, including versions for networks of workstations, clusters of
personal computers, distributed-memory multiprocessors, and shared-memory machines. Almost every hardware
vendor is supporting MPI. This gives the user a comfortable feeling since an MPI program can be executed on
almost all of the existing computing platforms without the need to rewrite the program from scratch. The goal of
portability, architecture, and network transparency has been achieved with these low-level communication libraries
like MPI and PVM. Both communication libraries provide an interface for C and Fortran, and additional support
of graphical tools.
However, these message-passing systems are still stigmatised as low-level because most tasks of the parallelisation
are still left to the application programmer. When writing parallel applications using message passing, the programmer
still has to develop a significant amount of software to manage some of the tasks of the parallelisation, such as: the
communication and synchronisation between processes,
data partitioning and distribution, mapping of processes onto processors, and input/output of data structures. If the
application programmer has no special support for these tasks, it then becomes difficult to widely exploit parallel
computing. The easy-to-use goal is not accomplished with a bare message-passing system, and hence requires
additional support.
Other ways to provide alternate-programming models are based on Virtual Shared Memory (VSM) and parallel
object-oriented programming. Another way is to provide a set of programming skeletons in the form of run-time
libraries that already support some of the tasks of parallelisation and can be implemented on top of portable message-
passing systems like PVM or MPI.
Distributed Shared Memory (DSM) is the extension of the well-accepted shared memory programming model on
systems without physically shared memory.
The shared data space is at and accessed through normal read and write operations. In contrast to message passing, in
a DSM system a process that wants to fetch some data value does not need to know its location; the system will find
and fetch it automatically. In most of the DSM systems, shared data may b e replicated to enhance the parallelism
and the efficiency of the applications.
While scalable parallel machines are mostly based on distributed memory, many users may find it easier to write
parallel programs using a shared-memory programming model. This makes DSM a very promising model, provided
it can be implemented efficiently.
53
Parallel Computing
The object-oriented programming model is by now well established as the state of-the-art software engineering
methodology for sequential programming, and recent developments are also emerging to establish object-orientation
in the area of parallel programming. The current lack of acceptance of this model among the scientific community can
be explained by the fact that computational scientists still prefer to write their programs using traditional languages
like FORTRAN. This is the main difficulty that has been faced by the object-oriented environments, though it is
considered as a promising technique for parallel programming. Some interesting object-oriented environments such
as CC++ and Mentat have been presented in the literature.
After the recognition of parallelisable parts and an identification of the appropriate algorithm, a lot of developing
time is wasted on programming routines closely related to the paradigm and not the application itself. With the aid
of a good set of efficiently programmed interaction routines and skeletons, the development time can be reduced
significantly.
The skeleton hides from the user the specific details of the implementation and allows the user to specify the
computation in terms of an interface tailored to the paradigm. This leads to a style of Skeleton Oriented Programming
(SOP) which has been identified as a very promising solution for parallel computing.
Skeletons are more general programming methods since they can be implemented on top of message-passing,
object-oriented, shared-memory or distributed memory systems, and provide increased support for parallel
programming.
To summarise, there are basically two ways of looking at an explicit parallel programming system. In the first one,
the system provides some primitives to be used by the programmer. The structuring and the implementation of most
of the parallel control and communication is the responsibility of the programmer. The alternative is to provide some
enhanced support for those control structures that are common to a parallel programming paradigm. The main task
of the programmer would be to provide those few routines unique to the application, such as computation and data
generation. Numerous parallel programming environments are available, and many of them do attempt to exploit
the characteristics of parallel paradigms.
54
As suggested by Ian Foster, this methodology organises the design process into four distinct stages:
• Partitioning
• Communication
• Agglomeration
• Mapping
The first two stages seek to develop concurrent and scalable algorithms, and the last two stages focus on locality
and performance-related issues as summarised below:
5.5.1 Partitioning
It refers to decomposing of the computational activities and the data on which it operates into several small tasks. The
decomposition of the data associated with a problem is known as domain/data decomposition, and the decomposition
of computation into disjoint tasks is known as functional decomposition.
5.5.2 Communication
It focuses on the follow of information and coordination among the tasks that are created during the partitioning
stage. The nature of the problem and the decomposition method determine the communication pattern among
these cooperative tasks of a parallel program. The four popular communication patterns commonly used in parallel
programs are: local/global, structured/unstructured, static/dynamic, and synchronous/asynchronous.
5.5.3 Agglomeration
In this stage, the tasks and communication structure defined in the first two stages are evaluated in terms of performance
requirements and implementation costs. If required, tasks are grouped into larger tasks to improve performance or to
reduce development costs. Also, individual communications may be bundled into a super communication. This will
help in reducing communication costs by increasing computation and communication granularity, gaining flexibility
in terms of scalability and mapping decisions, and reducing software-engineering costs.
5.5.4 Mapping
It is concerned with assigning each task to a processor such that it maximises utilisation of system resources (such as
CPU) while minimising the communication costs. Mapping decisions can be taken statically (at compile-time/before
program execution) or dynamically at runtime by load-balancing methods. Several grand challenging applications
have been built using the above methodology.
Experience to date suggests that there are a relatively small number of paradigms underlying most parallel programs.
The choice of paradigm is determined by the available parallel computing resources and by the type of parallelism
inherent in the problem. The computing resources may define the level of granularity that can be efficiently supported
on the system. The type of parallelism reflects the structure of either the application or the data and both types may
exist in different parts of the same application. Parallelism arising from the structure of the application is named as
functional parallelism. In this case, different parts of the program can perform different tasks in a concurrent and
cooperative manner. But parallelism may also be found in the structure of the data. This type of parallelism allows
the execution of parallel processes with identical operation but on different parts of the data.
55
Parallel Computing
consider this paradigm in the parallel computing area. In the world of parallel computing there are several authors
which present a paradigm classification. Not all of them propose exactly the same one, but we can create a superset
of the paradigms detected in parallel applications.
For instance, in a theoretical classification of parallel programs is presented and broken into three classes of
parallelism:
• Processor farms, which are based on replication of independent jobs
• Geometric decomposition, based on the parallelisation of data structures
• Algorithmic parallelism, which results in the use of data row
The several parallel applications and identified the following set of paradigms:
• Pipelining and ring-based applications
• Divide and conquer
• Master/slave
• Cellular automata applications, which are based on data parallelism
The author of also proposed a very appropriate classification. The problems were divided into a few decomposition
techniques, namely:
• Geometric decomposition: the problem domain is broken up into smaller domains and each process executes
the algorithm on each part of it.
• Iterative decomposition: some applications are based on loop execution where each iteration can be done in an
independent way. This approach is implemented through a central queue of runnable tasks, and thus corresponds
to the task-farming paradigm.
• Recursive decomposition: this strategy starts by breaking the original problem into several subproblems and
solving these in a parallel way. It clearly corresponds to a divide and conquer approach.
• Speculative decomposition: some problems can use a speculative decomposition approach: N solution techniques
are tried simultaneously, and (N-1) of them are thrown away as soon as the first one returns a plausible answer.
In some cases this could result optimistically in a shorter overall execution time.
• Functional decomposition: the application is broken down into many distinct phases, where each phase executes
a different algorithm within the same problem. The most used topology is the process pipelining.
A somewhat different classification was presented based on the temporal structure of the problems. The applications
were thus divided into:
• Synchronous problems, which correspond to regular computations on regular data domains
• Loosely synchronous problems, that are typified by iterative calculations on geometrically irregular data
domains
• Asynchronous problems, which are characterised by functional parallelism that is irregular in space and time
• Embarrassingly parallel applications, which correspond to the independent execution of disconnected components
of the same program
Synchronous and loosely synchronous problems present a somehow different synchronisation structure, but both
rely on data decomposition techniques. According to an extensive analysis of 84 real applications, it was estimated
that these two classes of problems dominated scientific and engineering applications being used in 76 percent of the
applications. Asynchronous problems, which are for instance represented by event-driven simulations, represented
10 percent of the studied problems. Finally, embarrassingly parallel applications that correspond to the master/slave
model, accounted for 14 percent of the applications.
56
The most systematic definition of paradigms and application templates was presented in [6]. It describes a generic
tuple of factors which fully characterises a parallel algorithm including: process properties (structure, topology and
execution), interaction properties, and data properties (partitioning and placement). That classification included
most of the paradigms referred so far, albeit described in deeper detail.
Task-farming may either use static load-balancing or dynamic load-balancing. In the first case, the distribution of
tasks is all performed at the beginning of the computation, which allows the master to participate in the computation
after each slave has been allocated a fraction of the work. The allocation of tasks can be done once or in a cyclic
way. Figure below presents a schematic representation of this first approach.
The other way is to use a dynamically load-balanced master/slave paradigm, which can be more suitable when the
number of tasks exceeds the number of available processors, or when the number of tasks is unknown at the start of
the application, or when the execution times are not predictable, or when we are dealing with unbalanced problems.
An important feature of dynamic load-balancing is the ability of the application to adapt itself to changing conditions
of the system, not just the load of the processors, but also a possible reconfiguration of the system resources. Due
to this characteristic, this paradigm can respond quite well to the failure of some processors, which simplifies the
creation of robust applications that are capable of surviving the loss of slaves or even the master.
At an extreme, this paradigm can also enclose some applications that are based on a trivial decomposition approach:
the sequential algorithm is executed simultaneously on different processors but with different data inputs. In such
applications there are no dependencies between different runs so there is no need for communication or coordination
between the processes.
This paradigm can achieve high computational speedups and an interesting degree of scalability. However, for a large
number of processors the centralised control of the master process can become a bottleneck to the applications. It
is, however, possible to enhance the scalability of the paradigm by extending the single master to a set of masters,
each of them controlling a different group of process slaves.
57
Parallel Computing
Master
distribute tasks
Collect
Results
communications
Terminate
SPMD applications can be very efficient if the data is well distributed by the processes and the system is homogeneous.
If the processes present different workloads or capabilities, then the paradigm requires the support of some load-
balancing scheme able to adapt the data distribution layout during run-time execution. This paradigm is highly
sensitive to the loss of some process. Usually, the loss of a single process is enough to cause a deadlock in the
calculation in which none of the processes can advance beyond a global synchronisation point.
58
Distribute Data
Collect Results
59
Parallel Computing
We can identify three generic computational operations for divide and conquer: split, compute, and join. The
application is organised in a sort of virtual tree: some of the processes create subtasks and have to combine the
results of those to produce an aggregate result. The tasks are actually computed by the compute processes at the
leaf nodes of the virtual tree. Figure below presents this execution.
main problem
join
split
sub-problems
The task-farming paradigm can be seen as a slightly modified, degenerated form of divide and conquer; that is,
where problem decomposition is performed before tasks are submitted, the split and join operations is only done
by the master process and all the other processes are only responsible for the computation.
In the divide and conquer model, tasks may be generated during runtime and may be added to a single job queue
on the manager processor or distributed through several job queues across the system.
The programming paradigms can be mainly characterised by two factors: decomposition and distribution of the
parallelism. For instance, in geometric parallelism both the decomposition and distribution are static. The same
happens with the functional decomposition and distribution of data pipelining. In task farming, the work is statically
decomposed but dynamically distributed. Finally, in the divide and conquer paradigm both decomposition and
distribution are dynamic.
In some asynchronous problems, like discrete-event simulation, the system will attempt the look-ahead execution
of related activities in an optimistic assumption that such concurrent executions do not violate the consistency of
the problem execution. Sometimes they do, and in such cases it is necessary to rollback to some previous consistent
state of the application.
60
Another use of this paradigm is to employ different algorithms for the same problem; the first one to give the final
solution is the one that is chosen.
Hiding the underlying structure from the user by presenting a simple interface results in programs that are easier
to understand and maintain, as well as less prone to error. In particular, the programmer can now focus on the
computational task rather than the control and coordination of the parallelism.
Exploiting the observation that parallel applications follow some well-identified structures, much of the parallel
software specific to the paradigm can be potentially reusable. Such software can b e encapsulated in parallel libraries
to promote the reuse of code, reduce the burden on the parallel programmer, and to facilitate the task of recycling
existing sequential programs. This guideline was followed by the PUL project, the TINA system, and the ARNIA
package.
A project developed at the Edinburgh Parallel Computing Centre involved the writing of a package of parallel
utilities (PUL) on top of MPI that gives programming support for the most common programming paradigms as
well as parallel input/output. Apart from the libraries for global and parallel I/O, the collection of the PUL utilities
includes a library for task-farming applications (PUL-TF) another that supports regular domain decomposition
applications (PUL-RD), and another one that can be used to program irregular mesh-based problems (PUL-SM).
This set of PUL utilities hides the hard details of the parallel implementation from the application programmer and
provides a portable programming interface that can b e used on several computing platforms. To ensure programming
flexibility, the application can make simultaneous use of different PUL libraries and have direct access to the MPI
communication routines.
The ARNIA package includes a library for master/slave applications, another for the domain decomposition paradigm,
a special library for distributed computing applications based on the client/server model, and a fourth library that
supports a global shared memory emulation. ARNIA allows the combined use of its building libraries for those
applications that present mixed paradigms or distinct computational phases.
A skeleton generator called TINA supports the reusability and portability of parallel program components and
provides a complete programming environment.
Another graphical programming environment, named TRACS, provides a graphical toolkit to design distributed/
parallel applications based on reusable components, such as farms, grids, and pipes.
Porting and rewriting application programs requires a support environment that encourages code reuse, portability
among different platforms, and scalability across similar systems of different size. This approach, based on skeletal
frameworks, is a viable solution for parallel programming. It can significantly increase programmer productivity
because programmers will be able to develop parts of programs simply by filling in the templates. The development
of software templates has been increasingly receiving the attention of academic research and is seen as one of the
key directions for parallel software.
61
Parallel Computing
The most important advantages of this approach for parallel programming are summarised below.
5.7.1 Programmability
A set of ready-to-use solutions for parallelisation will considerably increase the productivity of the programmers:
the idea is to hide the lower level details of the system, to promote the reuse of code, and relieve the burden of
the application programmer. This approach will increase the programmability of the parallel systems since the
programmer will have more time to spend in optimising the application itself, rather than on low-level details of
the underlying programming system.
5.7.2 Reusability
Reusability is a hot-topic in software engineering. The provision of skeletons or templates to the application
programmer increases the potential for reuse by allowing the same parallel structure to be used in different applications.
This avoids the replication of efforts involved in developing and optimising the code specific to the parallel template.
It was reported that a percentage of code reuse rose from 30 percent up to 90 percent when using skeleton-oriented
programming. In the Chameleon system, 60 percent of the code was reusable, while it was reported that an average
fraction of 80 percent of the code was reused with the ROPE library.
5.7.3 Portability
Providing portability of the parallel applications is a problem of paramount importance. It allows applications
developed on one platform to run on another platform without the need for redevelopment.
5.7.4 Efficiency
There could be some conflicting trade-on between optimal performance and portability/programmability. Both
portability and efficiency of parallel programming systems play an important role in the success of parallel
computing.
62
Summary
• Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations,
to SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing.
• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected
stand-alone computers working together as a single, integrated computing resource.
• A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with memory, I/O
facilities, and an operating system.
• A cluster generally refers to two or more computers (nodes) connected together.
• The network interface hardware acts as a communication processor and is responsible for transmitting and
receiving packets of data between cluster nodes via a network/switch.
• Communication software offers a means of fast and reliable data communication among cluster nodes and to
the outside world.
• The class of applications that a cluster can typically cope with would be considered demanding sequential
applications and grand challenge/supercomputing applications.
• In modern computers, parallelism appears at various levels both in hardware and software: signal, circuit,
component, and system levels.
• The first two levels (signal and circuit level) of parallelism is performed by a hardware implicitly technique
called hardware parallelism.
• Parallelising compilers are still limited to applications that exhibit regular parallelism, such as computations
in loops.
• The High Performance Fortran (HPF) initiative seems to be a promising solution to solve the dusty-deck problem
of Fortran codes.
• Message passing libraries allow efficient parallel programs to b e written for distributed memory systems.
• Virtual shared memory implements a shared-memory programming model in a distributed-memory
environment.
• Distributed Shared Memory (DSM) is the extension of the well-accepted shared memory programming model
on systems without physically shared memory.
• A programming paradigm is a class of algorithms that solve different problems but have the same control structure.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education India.
• Bhujade, M. R., 1995. Parallel Computing, New Age International.
• Buyya, R., Parallel Programming Models and Paradigms [pdf] Available at: <http://www.buyya.com/cluster/
v2chap1.pdf> [Accessed 21 June 2012].
• Dr. Dobb’s, Designing Parallel Algorithms: Part 1 [Online] Available at: <http://www.drdobbs.com/article/pr
int?articleId=223100878&siteSectionName=parallel> [Accessed 21 June 2012].
• 2011. High-Performance Computing - Episode 1 - Introducing MPI [Video Online] Available at: <http://www.
youtube.com/watch?v=kHV6wmG35po> [Accessed 21 June 2012].
• 2009. Lec-7 Pipeline Concept-I [Video Online] Available at: <http://www.youtube.com/watch?v=AXgfeV568
c8&feature=results_video&playnext=1&list=PLD4E8A4E592F7A7D8> [Accessed 21 June 2012].
Recommended Reading
• Joubert, G. R., 2004. Parallel Computing: Software Technology, Algorithms, Architectures and Applications,
Elsevier.
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Quinn, 2003. Parallel Programming In C With Mpi And Open Mp, Tata McGraw-Hill Education.
63
Parallel Computing
Self Assessment
1. Which of the following statements is false?
a. Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations,
to SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing.
b. A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected
stand-alone computers working together as a single, integrated computing resource.
c. A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with memory, I/O
facilities, and an operating system.
d. The network interface software is responsible for transmitting and receiving packets of data between cluster
nodes via a network/switch.
4. Communication software offers a means of fast and reliable data communication among ________and to the
outside world.
a. multiprocessors
b. message passing libraries
c. cluster nodes
d. compilers
64
7. Providing ________of the parallel applications allows applications developed on one platform to run on another
platform without the need for redevelopment.
a. efficiency
b. portability
c. resusability
d. programmability
10. Which of these is not one of the three generic computational operations for divide and conquer?
a. Split
b. Compute
c. Join
d. Replace
65
Parallel Computing
Chapter VI
Interconnection Networks for Parallel Computers
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
66
6.1 Introduction
One of the most important components of a multiprocessor machine is the interconnection network. In order to
solve a problem, the processing elements have to cooperate and to exchange their computed data over the network.
Both the shared memory and distributed memory architectures require an interconnection network to connect the
processor and memory or the modules respectively.
Fig. 6.1 Classification of interconnection networks (a) a static network (b) a dynamic network
(Source: http://www.cs.csi.cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt)
67
Parallel Computing
This arrangement can be useful as it matches the way peripherals are configured about a computer. Usually a
computer will have only one screen, printer and backing store. With star topologies these resources would be placed
at the central node, thereby bestowing even access by all processors. Stars have a mixed degree. The terminal nodes
are only linked to the central node so their degree is one. The central node is linked to all the other processors, so
its degree is N-1. If the degree is a function of N, as in this case, it is said to be variable. The routing algorithm is
simple, and need only reside at the central node. It receives messages from a port and simply redirects them to the
port corresponding to the destination.
Extending the star would involve increasing the fan-out of the central node rather than the depth, as with a tree.
This makes the growth complexity one, which is not only the better than most other topologies but also the best
theoretically possible. The central node has to be modified in order to cope with the extra node, however, all the
other nodes can remain the same.
The longest path starts at a terminal node and is along the branch to the central node and then back out and down to
another terminal node; the diameter is, therefore, three. As the diameter is not a function of N it is said to be ‘fixed’.
A disadvantage of this topology lies with the central processor. This can become a communication bottleneck;
consequently star networks tend to be heterogeneous, with the central processor being designed with a much higher
communications bandwidth than the others. The problem of this topology is obvious. For every communication, a
central node has to be passed (bottle neck).
68
Another, more serious, problem is that should the central processor fail, the whole network fails, along with all access
to peripherals. Again the central processor design needs to be different from the others. The reliability of the central
processor can be improved by using higher grade components and adding error tolerant memory. The bottleneck
at the central node makes it impractical to have many processors in such a computer. However, star networks are
commonly found in local area networks.
Fig. 6.4 Linear arrays (a) with no wraparound links (b) with wraparound link
(Source: http://www.cs.csi.cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt)
2D-array
In 2D-array, the processors are ordered to form a 2-dimensional structure (square) so that each processor is connected
to its four neighbor (north, south, east, west) except perhaps for the processors of the boundary.
6.5.4 Mesh
The simplest connection topology is a n-dimensional mesh. In a 1-D mesh all nodes are arranged in a line, where
the interior nodes have two and the boundary nodes have one neighbor(s).
A 2-D mesh can be created by having each node in a two-dimensional array connected to all its four nearest neighbors.
3D-mesh refers to a generalisation of a 2D-mesh in three dimensions. If the free ends of a mesh circulate back to
the opposite sides, the network is then called torus.
69
Parallel Computing
Fig. 6.5 Meshes (a) 2-D mesh with no wraparound (b) 2-D mesh with wraparound link (2-D torus) (c) a 3-D
mesh with no wraparound
(Source: http://www.cs.csi.cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt)
Binary tree
The basic (binary) tree has two sons for the root node. The interior nodes have three connections (two sons, and
one father), whereas all the leaves just have one father. There is only one path between any pair of two nodes.
Communication bottleneck occurs at higher levels of the tree. The solution can be increasing the number of
communication links and switching nodes closer to the root.
70
Fig. 6.7 Binary tree
(Source: http://www.gigaflop.demon.co.uk/comp/chapt3.htm#s3.2)
Fat trees
Links higher up the tree potentially carry more traffic than those at the lower levels. For this reason, a variant
called a fat-tree fattens the links as we go up the tree. Trees can be laid out in 2D with no wire crossings. This is an
attractive property of trees.
6.5.6 Hypercube
Simple cubes have three dimensions. ‘Hypercubes’ are produced by increasing the number of dimensions in a cube.
The term hypercube refers to any cube with four or more dimensions. A four-dimensional cube is called a ‘tesseract’
and can be thought of as two three-dimensional cubes whose corresponding vertices have been connected.
A hypercube configuration in dimension n has 2n vertices and n.2 n-1 edges. For instance if n =1, there are two vertices
and 1 edge, in two dimensions, we have a square with four edges and four vertices. A vertex of a hypercube in
dimension n can be represented by an n-bit vector. Two nodes are connected if and only if their bit representation
differ in exactly one position.
In a ring every processing element (PE) is wired with just two neighbors, whereas in a 4-D hypercube every PE
is connected to four neighbors. If a node wants to communicate with any node not being its neighbor, all nodes
between the two communicating processors have to be passed.
71
Parallel Computing
Other hypercubic network is cube-connected cycles (CCC), which is a hypercube whose nodes are replaced by
rings/cycles of length n so that the resulting network has constant degree). This network can also be viewed as a
butterfly whose first and last levels collapse into one.
72
6.6.1 Bus-Based Networks
The simplest network that consists of a shared medium (bus) that is common to all the nodes. The distance between
any two nodes in the network is constant. It is ideal for broadcasting information among nodes. It is scalable in
terms of cost, but not scalable in terms of performance.
• The bounded bandwidth of a bus places limitations on the overall performance as the number of nodes
increases.
• Typical bus-based machines are limited to dozens of nodes.
Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures.
Shared Memory
Address
Data
Processor 0 Processor 1
Shared Memory
Address
Data
Cache/ Cache/
Local Memory Local Memory
Processor 0 Processor 1
Fig. 6.11 Bus based interconnects (a) with no local caches (b) with local\memory caches
(Source: http://sydney.edu.au/engineering/it/~comp5426/sem12012/LectureNotes/lecture02-3-12.pdf)
The total number of switching nodes required is Θ(pb). (It is reasonable to assume b>=p). It is scalable in terms
of performance. Not scalable in terms of cost. Examples of machines that employ crossbars include the Sun Ultra
HPC 10000 and the Fujitsu VPP500.
73
Parallel Computing
Processing Elements
0 1 2 3 4 5 b-1
A switching
Element
0
p-1
Fig. 6.12 A completely non-blocking crossbar network connecting ‘p’ processors to ‘b’ memory banks
(Source: http://sydney.edu.au/engineering/it/~comp5426/sem12012/LectureNotes/lecture02-3-12.pdf)
0 0
1 1
p-1 b-1
74
6.6.4 Omega Network
Consists of log p stages, p is the number of inputs (processing nodes) and also the number of outputs (memory
banks). Each stage consists of an interconnection pattern that connects p inputs and p outputs:
2i, 0 ≤ i ≤ p / 2 −1
j=
2i + 1 − p, p / 2 ≤ i ≤ p − 1 Perfect shuffle (left rotation):
Fig. 6.14 A perfect shuffle interconnection for eight inputs and outputs
(Source: http://sydney.edu.au/engineering/it/~comp5426/sem12012/LectureNotes/lecture02-3-12.pdf)
It has (p/2 × log p) switching nodes. A complete omega network with the perfect shuffle interconnects and switches
can now be illustrated:
An omega network has (p/2 × log p) switching nodes, and the cost of such a network grows as (p log p).
Fig. 6.15 A complete omega network connecting eight inputs and eight outputs
(Source: http://www.cs.csi.cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt)
75
Parallel Computing
Fig. 6.16 An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is
blocked at link AB
(Source: http://parallelcomp.uw.hu/ch02lev1sec4.html)
76
Summary
• Network topologies describe how to connect processors and memories to other processors and memories.
• The topology describes how to connect processors and memories to other processors and memories.
• Interconnection networks carry data between processors and to memory.
• Interconnects are made of switches and links (wires, fiber). Interconnects are classified as static or dynamic.
• A star has one central node, which is connected to all other nodes. The star topology is the degenerate case of
a tree.
• Star networks are commonly found in local area networks.
• In a linear array, each node (except the two nodes at the ends) has two neighbors, one each to its left and
right.
• The simplest connection topology is a n-dimensional mesh. In a 1-D mesh all nodes are arranged in a line, where
the interior nodes have two and the boundary nodes have one neighbour(s).
• Trees are hierarchical structures that have some resemblance to natural trees.
• Trees start with a node at the top called the root, this node is connected to other nodes by ‘edges’ or
‘branches’.
• The basic (binary) tree has two sons for the root node. The interior nodes have three connections (two sons, and
one father), whereas all the leaves just have one father.
• Simple cubes have three dimensions.
• ‘Hypercubes’ are produced by increasing the number of dimensions in a cube. The term hypercube refers to any
cube with four or more dimensions.
• Communication links are connected to one another dynamically by the switches to establish paths among
processing nodes and memory banks.
• The simplest network that consists of a shared medium (bus) that is common to all the nodes. The distance
between any two nodes in the network is constant.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education
India.
• Sengupta, J., 2005. Interconnection Networks For Parallel Processing, Deep and Deep Publications.
• Springer, Lecture 3 Interconnection Networks for Parallel Computers [Online] Available at: <http://www.cs.csi.
cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt> [Accessed 21 June 2012].
• Physical Organization of Parallel Platforms [Online] Available at: <http://parallelcomp.uw.hu/ch02lev1sec4.
html> [Accessed 21 June 2012].
• 2012. Network Topology [Video Online] Available at: <http://www.youtube.com/watch?v=POkzLHoZJ0Y>
[Accessed 21 June 2012].
• 2009. Computer Networking Tutorial - 3 - Network Topology [Video Online] Available at: <http://www.youtube.
com/watch?v=kfEDPQAYH4k> [Accessed 21 June 2012].
Recommended Reading
• Treleaven, P. C. & Vanneschi, M., 1987. Future Parallel Computers: An Advanced Course, Pisa, Italy, June
9-20, 1986, Proceedings, Springer.
• Rauber, T. & Rünger, G., 2010. Parallel Programming: For Multicore and Cluster Systems, Springer.
• Duato, 2003. Interconnection Networks: An Engineering Approach, Morgan Kaufmann.
77
Parallel Computing
Self Assessment
1. Interconnection networks carry data between processors and to _________.
a. memory
b. nodes
c. shared medium
d. memory banks
2. A star has one central _______, which is connected to all other nodes.
a. processing element
b. memory bank
c. node
d. processor
6. Which of these are hierarchical structures that have some resemblance to natural trees?
a. Trees
b. Hypercube
c. Torus
d. Tesseract
78
8. What refers to the minimum number of wires that needs to be cut to divide the network into two equal parts?
a. Arc connectivity
b. Bisection width
c. Diameter
d. Scalability
10. In a _______every processing element (PE) is wired with just two neighbors.
a. ring
b. mesh
c. torus
d. hypercube
79
Parallel Computing
Chapter VII
Parallel Sorting
Aim
The aim of this chapter is to:
• define sorting
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
80
7.1 Introduction
Sorting is the process of reordering a sequence taken as input and producing one that is ordered according to
an attribute. Parallel sorting is the process of using multiple processing units to collectively sort an unordered
sequence. The unsorted sequence is composed of disjoint sub-sequences, each of which is associated with a unique
processing unit. Parallel sorting produces a fully sorted sequence composed of ordered sub-sequences, each of which
is associated with a unique processing unit. The produced sequences are typically ordered according to the given
processor ordering and are of roughly equal length.
Sorting is one of the most common operations performed by a computer. Because sorted data are easier to manipulate
than randomly-ordered data, many algorithms require sorted data. Sorting is of additional importance to parallel
computing because of its close relation to the task of routing data among processes, which is an essential part of
many parallel algorithms. Many parallel sorting algorithms have been investigated for a variety of parallel computer
architectures.
Sorting is defined as the task of arranging an unordered collection of elements into monotonically increasing
(or decreasing) order. Specifically, let S = <a1, a2, ..., an > be a sequence of n elements in arbitrary order; sorting
transforms S into a monotonically increasing sequence S’= {a1’, a2’, …..an’} such that ai’≤ aj’ for 1 ≤ i ≤ j ≤ n,
and S′ is a permutation of S.
There are n unsorted keys, distributed evenly over p processors. The distribution of keys in the range is unknown
and possibly skewed. The objective of parallel sorting is to:
• Sort the data globally according to keys
• Ensure no processor has more than (n/p) + threshold keys
The majority of parallel sorting algorithms can be classified as either merge-based or splitter-based.
81
Parallel Computing
"
6'
82
<<<<<<
<<<<<<
However, newer and much larger architectures have changed the problem statement further. Therefore, traditional
approaches, including splitter-based sorting algorithms, require reevaluation and improvement. We will briefly
detail some of these methods below.
A bitonic sequence is a sequence of values a0,….., an-1, with the property that there exists an index ‘i’, where 0≤ i
≤ n -1, such that a0 through ai is monotonically increasing and ai through an-1 is monotonically decreasing, or there
exists a cyclic shift of indices so that the first condition is satisfied.
A bitonic sequence increases monotonically then decreases monotonically. For n/p = 1, Ɵ(lg n) merges are required,
with each merge having a cost of Ɵ (lg n). The composed running time of Bitonic Sort for n=p = 1 is Ɵ (lg2n).
Bitonic Sort can be generalised for n/p > 1, with a complexity of Ɵ (n lg2n). Adaptive Bitonic Sorting, a modification
of Bitonic Sort, avoids unnecessary comparisons, which results in an improved, optimal complexity of Ɵ (n lg n).
Unfortunately, each step of bitonic sort requires movement of data between pairs of processors. Like most merge-
based algorithms, bitonic sort can perform very well when n/p (where n is the total number of keys and p is the
number of processors) is small, since it operates in-place and effectively combines messages.
83
Parallel Computing
On the other hand, its performance quickly degrades as n/p becomes large, which is the much more realistic scenario
for typical scientific applications. The major drawback of bitonic sort on modern architectures is that it moves the
data Ɵ (lg p) times, which turns into costly bottleneck if we scale to higher problem sizes or a larger number of
processors. Since this algorithm is old and very well-studied, we will not go into any deep analysis or testing of
it. Nevertheless, Bitonic Sort has laid groundwork for much of parallel sorting research and continues to influence
modern algorithms. One good comparative study of this algorithm has been documented by Blelloch et al.
The sample is typically regularly spaced in the local sorted data s = p-1. Worst case final load imbalance is 2× (n/p)
keys. In practice, load imbalance is typically very small. Combined sample becomes bottleneck since (s×p) ~p².
With 64-bit keys, if p = 8192, sample is 16 GB.
<<<<<<
*
<<<<<<
84
Sorting by random sampling
Sorting by random sampling is an interesting alternative to regular sampling. The main difference between the two
sampling techniques is that a random sample is flexible in size and collected randomly from each processor’s local
data rather than as a regularly spaced sample. The advantage of sorting by random sampling is that often sufficient
load balance can be achieved for s < p, which allows for potentially better scaling.
Additionally, a random sample can be retrieved before sorting local data, which allows for overlap between sorting
and splitter calculation. However, the technique is not wholly reliable and can result in severe load imbalance,
especially on a larger amount of processors.
Radix sort can be parallelised simply by assigning some subset of buckets to each processor. In addition, it can
deal with uneven distributions efficiently by assigning a varying number of buckets to all processors every step.
This number can be determined by having each processor count how many of its keys will go to each bucket, then
summing up these histograms with a reduction. Once a processor receives the combined histogram, it can adaptively
assign buckets to processors. Radix sort is not a comparison-based sort. However, it is a reasonable assumption to
equate a comparison operation to a bit manipulation, since both are likely to be dominated by the memory access
time. Nevertheless, radix sort is not bound by Ɵ((n lg n)/p), as any comparison-based parallel sorting algorithm
would be.
In fact, this algorithm’s complexity varies linearly with n. The performance of the sort can be expressed as Ɵ(bn/p),
where b is the number of bits in a key. This expression is evident in that doubling the number of bits in the keys
entails doubling the number of iterations of radix sort.
The main drawback to parallel radix sort is that it requires multiple iterations of costly all-to-all data exchanges.
The cache efficiency of this algorithm can also be comparatively weak. In a comparison-based sorting algorithm,
we generally deal with contiguous allocations of keys. During sequential sorting (specifically in the partitioning
phase of Quicksort), we iterate through keys with only two iterators and swap them between two already accessed
locations.
Communication in comparison-based sorting is also cache efficient because we can usually copy sorted blocks into
messages. In radix sort, at every iteration any given key might be moved to any bucket (there are 64 thousand of these
for a 16-bit radix), completely independent of the destination of the previously indexed key. However, Thearling et
al. propose a clever scheme for improving the cache efficiency during the counting stage.
Nevertheless, radix sort is a simple and commonly accepted approach to parallel sorting. Therefore, despite its
limitations, we implemented Radix Sort and analysed its performance.
85
Parallel Computing
We used histogram sort as the basis for our work because it has the essential quality of only moving the actual data
once, combined with an efficient method for dealing with uneven distributions. In fact, histogram sort is unique in
its ability to reliably achieve a defined level of load balance. Therefore, we decided this algorithm is a theoretically
well suited base for scaling sorting on modern architectures.
step 1
4 3 1 2
step 2
3 4 1 2
3 1 4 step 3
2
step 4
1 3 2 4
86
Summary
• Sorting is the process of reordering a sequence taken as input and producing one that is ordered according to
an attribute.
• Parallel sorting is the process of using multiple processing units to collectively sort an unordered sequence.
• Because sorted data are easier to manipulate than randomly-ordered data, many algorithms require sorted
data.
• Sorting is defined as the task of arranging an unordered collection of elements into monotonically increasing
(or decreasing) order.
• Merge-based parallel sorting algorithms rely on merging data between pairs of processors.
• Splitter-based parallel sorting algorithms aim to define a vector of splitters that subdivides the data into p
approximately equalised sections.
• Bitonic sort, a merge-based algorithm, was one of the earliest procedures for parallel sorting. It was introduced
in 1968 by Batcher.
• Bitonic sort is based on repeatedly merging two bitonic sequences to form a larger bitonic sequence.
• A bitonic sequence increases monotonically then decreases monotonically.
• Sample sort is a popular and widely analysed splitter based method for parallel sorting.
• The advantage of sorting by random sampling is that often sufficient load balance can be achieved for s < p,
which allows for potentially better scaling.
• Radix sort is a sorting method that uses the binary representation of keys to migrate them to the appropriate
bucket in a series of steps.
• Radix sort can be parallelised simply by assigning some subset of buckets to each processor.
• The main drawback to parallel radix sort is that it requires multiple iterations of costly all-to-all data
exchanges.
• Like sample sort, histogram sort also determines a set of p-1 splitters to divide the keys into p evenly sized
sections.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education
India.
• Feitelson, G., 2002. Job Scheduling Strategies for Parallel Processing, Springer.
• 2010, Highly Scalable Parallel Sorting [pdf] Available at: <http://charm.cs.illinois.edu/talks/SortingIPDPS10.
pdf> [Accessed 21 June 2012].
• Sorting [pdf] Available at: <http://www.corelab.ntua.gr/courses/parallel.postgrad/Sorting.pdf> [Accessed 21
June 2012].
• 2009. Algorithms Lesson 3: Merge Sort [Video Online] Available at: <http://www.youtube.com/
watch?v=GCae1WNvnZM> [Accessed 21 June 2012].
• 2012. Radix Sort Tutorial [Video Online] Available at: <http://www.youtube.com/watch?v=xhr26ia4k38>
[Accessed 21 June 2012].
Recommended Reading
• Roosta, S. H., 2000. Parallel Processing and Parallel Algorithms: Theory and Computation, Springer.
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Culler, D. E., Singh, J. & Gupta, A., 1999. Parallel Computer Architecture: A Hardware/Software Approach,
Gulf Professional Publishing.
87
Parallel Computing
Self Assessment
1. What refers to the process of reordering a sequence taken as input and producing one that is ordered according
to an attribute?
a. Merging
b. Sorting
c. Splitting
d. Sampling
3. Which of the following is a key that partitions the global set of keys at a desired location?
a. Splitter
b. Processor
c. Bitonic sequence
d. Probe
4. The sorting by regular sampling technique is a reliable and practical variation of sample sort that uses a sample
size of s = ____________.
a. p-2
b. n/p
c. p-1
d. p/n
5. Bitonic sort is based on repeatedly merging ______bitonic sequences to form a larger bitonic sequence.
a. two
b. three
c. four
d. six
7. The major drawback of bitonic sort on modern architectures is that it moves the data ________times.
a. Ɵ (log p)
b. Ɵ (lg p)
c. Ɵ (lg p2)
d. Ɵ (lg p3)
88
8. The performance of the sort can be expressed as Ɵ(bn/p), where b is the number of ___in a key.
a. bits
b. probes
c. sequences
d. splitters
9. Communication in comparison-based sorting is also cache efficient because we can usually _________sorted
blocks into messages.
a. copy
b. merge
c. replace
d. split
10. Parallel sorting is the process of using multiple processing units to collectively sort an _______.
a. ordered sequence
b. unordered sequence
c. attribute
d. equalised section
89
Parallel Computing
Chapter VIII
Message-Passing Programming
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
• understand MPI
• identify the basic functions for sending and receiving messages in MPI
90
8.1 Principles of Message-Passing Programming
The logical view of a machine supporting the message passing paradigm consists of p processes, each with its own
exclusive address space. Each data element must belong to one of the partitions of the space; hence, data must be
explicitly partitioned and placed. All interactions (read-only or read/write) require cooperation of two processes - the
process that has the data and the process that wants to access the data. These two constraints, while onerous, make
underlying costs very explicit to the programmer.
Message-passing programs are often written using the asynchronous or loosely synchronous paradigms. In the
asynchronous paradigm, all concurrent tasks execute asynchronously. In the loosely synchronous model, tasks
or subsets of tasks synchronise to perform interactions. Between these interactions, tasks execute completely
asynchronously. Most message-passing programs are written using the single program multiple data (SPMD)
model.
CPU
Memory
CPU CPU
Memory Memory
CPU CPU
Interconnection
Memory Network Memory
CPU CPU
Memory Memory
CPU
Memory
P0 P1
a = 100; receive(&a, 1, 0)
a = 0;
The semantics of the send operation require that the value received by process P1 must be 100 as opposed to 0.
This motivates the design of the send and receive protocols.
91
Parallel Computing
the receiving process. Idling and deadlocks are major issues with non-buffered blocking sends. In buffered blocking
sends, the sender simply copies the data into the designated buffer and returns after the copy operation has been
completed. The data is copied at a buffer at the receiving end as well. Buffering alleviates idling at the expense of
copying overheads.
(a) Sender comes first (b) Sender and receiver come (c) Receiver comes first;
idling at sender at about the same time; idling at receiver
idling minimized
It is easy to see that in cases where sender and receiver do not reach communication point at similar
times, there can be considerable idling overheads.
send send
Data copied to
buffer at receiver
receive
92
Blocking buffered transfer protocols:
• In the presence of communication hardware with buffers at send and receive ends
• In the absence of communication hardware, sender interrupts receiver and deposits data in buffer at receiver
end
Bounded buffer sizes can have significant impact on performance.
P0 P1
} }
Deadlocks are still possible with buffering since receive operations block.
P0 P1
93
Parallel Computing
It is the most popular message-passing specification to support parallel programming. It is the standardised and
portable to function on a wide variety of parallel computers. It is allowed for the development of portable and
scalable large-scale parallel applications.
MPI_Init also strips off any MPI related command-line arguments. All MPI routines, data-types, and constants are
prefixed by “MPI_”. The return code for successful completion is MPI_SUCCESS.
8.8 Communicators
A communicator defines a communication domain - a set of processes that are allowed to communicate with each
other. Information about communication domains is stored in variables of type MPI_Comm. Communicators are used
as arguments to all message transfer MPI routines. A process can belong to many different (possibly overlapping)
communication domains. MPI defines a default communicator called MPI_COMM_WORLD which includes all
the processes.
The rank of a process is an integer that ranges from zero up to the size of the communicator minus one.
94
int npes, myrank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf(“From process %d out of %d, Hello World!\n”,
myrank, npes);
MPI_Finalize();
}
The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data
items that has been created by packing non-contiguous data. The message-tag can take values ranging from zero up
to the MPI defined constant MPI_TAG_UB.
MPI allows specification of wildcard arguments for both source and tag. If source is set to MPI_ANY_SOURCE,
then any process of the communication domain can be the source of the message. If tag is set to MPI_ANY_TAG,
then messages with any tag are accepted. On the receive side, the message must be of length equal to or less than
the length field specified.
On the receiving end, the status variable can be used to get information about the MPI_Recv operation.
95
Parallel Computing
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
...
If MPI_Send is blocking, there is a deadlock.
Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of
processes) and receives a message from process i - 1(module the number of processes).
int a[10], b[10], npes, myrank;
MPI_Status status;
...
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);
MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
...
Once again, we have a deadlock if MPI_Send is blocking.
The arguments include arguments to the send and receive functions. If we wish to use the same buffer for both send
and receive, we can use:
96
int MPI_Sendrecv_replace(void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag,
int source, int recvtag, MPI_Comm comm,
MPI_Status *status)
This function takes the processes in the old communicator and creates a new communicator with dims dimensions.
Each processor can now be identified in this new Cartesian topology by a vector of dimension dims.
Since sending and receiving messages still require (one dimensional) ranks, MPI provides routines to convert ranks
to Cartesian coordinates and vice-versa.
The most common operation on cartesian topologies is a shift. To determine the rank of source and destination of
such shifts, MPI provides the following function:
These operations return before the operations have been completed. Function MPI_Test tests whether or not the non
blocking send or receive operation identified by its request has finished.
int MPI_Test(MPI_Request *request, int *flag,
MPI_Status *status)
97
Parallel Computing
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, &status, MPI_COMM_WORLD);
}
...
Replacing either the send or the receive operations with non-blocking counterparts fixes this deadlock.
Value 15 17 11 12 17 11
0 1 2 3 4 5
Process
98
MPI Datatype C Datatype
MPI_2INT pair of ints
MPI_SHORT_INT short and int
MPI_LONG_INT long and int
MPI_LONG_DOUBLE_INT long
double and int
MPI_FLOAT_INT float and int
MPI_DOUBLE_INT double and int
Table 8.2 MPI datatypes for data-pairs used with the MPI_MAXLOC and MPI_MINLOC reduction
operations.
If the result of the reduction operation is needed by all processes, MPI provides:
MPI also provides the MPI_Allgather function in which the data are gathered at all the processes.
99
Parallel Computing
Summary
• The logical view of a machine supporting the message passing paradigm consists of p processes, each with its
own exclusive address space.
• Message-passing programs are often written using the asynchronous or loosely synchronous paradigms.
• A simple method for forcing send/receive semantics is for the send operation to return only when it is safe to
do so.
• In the non-buffered blocking send, the operation does not return until the matching receive has been encountered
at the receiving process.
• A simple solution to the idling and deadlocking problem outlined above is to rely on buffers at the sending and
receiving ends.
• The sender simply copies the data into the designated buffer and returns after the copy operation has been
completed.
• MPI defines a standard library for message-passing that can be used to develop portable message-passing
programs using either C or Fortran.
• The MPI standard defines both the syntax as well as the semantics of a core set of library routines.
• Vendor implementations of MPI are available on almost all commercial parallel computers.
• MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialise the MPI environment.
• A communicator defines a communication domain - a set of processes that are allowed to communicate with
each other.
• MPI provides equivalent datatypes for all C datatypes.
• MPI allows specification of wildcard arguments for both source and tag.
• In order to overlap communication with computation, MPI provides a pair of functions for performing non-
blocking send and receive operations (“I” stands for “Immediate”).
• MPI provides an extensive set of functions for performing common collective communication operations.
References
• Lastovetsky, A., 2003. Parallel Computing on Heterogeneous Networks, John Wiley & Sons.
• Padua, D., 2011. Encyclopedia of Parallel Computing, Volume 4, Springer.
• Message-Passing Programming [pdf] Available at: <http://www.cs.umsl.edu/~sanjiv/classes/cs5740/lectures/
mpi.pdf> [Accessed 21 June 2012].
• Sarkar, V., 2008. Programming Using the Message Passing Paradigm (Chapter 6) [pdf] Available at: <http://
www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec16-s08-v2.pdf> [Accessed 21 June 2012].
• 2011. 00 11 Message Passing [Video Online] Available at: <http://www.youtube.com/watch?v=c5NKVAPf2OE>
[Accessed 21 June 2012].
• 2012. Message Passing Algorithms - SixtySec [Video Online] Available at: <http://www.youtube.com/
watch?v=7IdLzEoiPY4> [Accessed 21 June 2012].
Recommended Reading
• Gropp, W., Lusk, E. & Skjellum, A., 2006. Using Mpi: Portable Parallel Programming With the Message-
Passing Interface, Volume 1, MIT Press.
• Rauber, T. & Rünger, G., 2010. Parallel Programming: For Multicore and Cluster Systems, Springer.
• Tokhi, 2003. Parallel Computing for Real-Time Signal Processing and Control, Springer.
100
Self Assessment
1. The logical view of a machine supporting the message passing paradigm consists of p processes, each with its
own exclusive ____________.
a. communication domain
b. address space
c. datatype
d. paradigms
2. What defines a standard library for message-passing that can be used to develop portable message-passing
programs using either C or Fortran?
a. SPMD
b. MPI
c. MIMD
d. CPU
3. A simple solution to the idling and deadlocking problem outlined above is to rely on __________at the sending
and receiving ends.
a. buffers
b. overheads
c. libraries
d. communicators
4. What refers to a set of processes that are allowed to communicate with each other?
a. Communication domain
b. Datatype
c. Message passing programming
d. Debugging
5. Each data element must belong to one of the partitions of the __________; hence, data must be explicitly
partitioned and placed.
a. time
b. order
c. space
d. volume
101
Parallel Computing
8. Idling and _________ are major issues with non-buffered blocking sends.
a. partitions
b. communication
c. deadlocks
d. buffering
102
Application I
Composite: Content Management Solution Uses Parallelisation to Deliver Huge Performance Gains
Content management system vendor Composite needed to parallelise its software to realise the performance gains
enabled by today’s multicore processors. Composite took advantage of the new parallel-programming tools provided
in the Microsoft Visual Studio 2010 development system and the .NET Framework 4 to parallelise its code. The
company’s efforts have yielded impressive performance gains. An eight-core server is delivering a 60 percent
reduction in page-rendering times and an 80 percent reduction in the compilation of dynamic types upon system
initialisation. What’s more, by using the latest Microsoft aids for parallel programming, Composite was able to
implement parallelism in its solution quickly and cost-effectively, with very little developer effort.
Situation
Microsoft Gold Certified Partner Composite develops and sells Composite C1, a content management system (CMS)
designed to help companies build Web sites that combine solid marketing infrastructure, innovative design, and
strong usability. Originally founded as a Web development shop, Composite decided to build its second-generation
CMS in 2005 after realising that the Microsoft Visual Studio 2005 development system and the Microsoft .NET
Framework 2.0 presented an opportunity to build a solution that could meet the needs of both Web developers and
designers.
As new versions of Visual Studio and the .NET Framework become available, Composite examines them closely to
determine how they can be applied to improve its own product. For example, the company took advantage of Visual
Studio 2008 and the .NET Framework 3.5 to add support for Language-Integrated Query (LINQ) in version 1.2 of
Composite C1. Composite began this exercise again in 2009, when it joined an early adopter program for Visual
Studio 2010 and the .NET Framework 4 and began planning for the development of Composite C1 version 1.3.
One area on which Composite decided to focus was performance—specifically, how to optimise its software to
get the most out of modern, multicore processors and multiprocessor servers. “For the past few decades, we’ve all
benefited from rapidly increasing processor clock speeds,” says Marcus Wendt, Cofounder and Product Manager at
Composite. “However, this extended ‘free lunch’ is over, in that clock speeds have leveled off and chip manufacturers
are turning to multiple-processor cores for further gains in processing power. Therein lies the challenge, in that most
applications today—Composite C1 version 1.2 included—are not multicore optimised, which can result in one core
running at 100 percent while the rest remain idle. To get the best performance out of today’s multicore processors,
we needed to introduce parallelisation into our code.”
Solution
Composite took advantage of the new parallel-programming tools provided in the Microsoft Visual Studio 2010
development system and the .NET Framework 4, and the company’s efforts have yielded significant performance
gains in multiple areas. “Parallel programming has traditionally been difficult, tedious, and hard to debug, with very
limited tool support,” says Wendt. “The System.Threading namespace in the .NET Framework has existed for years,
but in the past it required a lot of ‘plumbing code’ to use effectively. Visual Studio 2010 and the .NET Framework
4 eliminate a lot of complexity to make parallel programming much easier.”
103
Parallel Computing
• Parallel Language-Integrated Query (PLINQ), a parallel implementation of LINQ to Objects that combines
the simplicity and readability of LINQ syntax with the power of parallel programming. PLINQ implements
the full set of LINQ standard query operators as extension methods in the System.Linq namespace, along with
additional operators to control the execution of parallel operations. As with code that targets the Task Parallel
Library on top of which PLINQ is built, PLINQ queries scale in the degree of concurrency according to the
capabilities of the host computer.
• New Data Structures for Parallel Programming, which include concurrent collection classes that are scalable
and thread safe; lightweight synchronisation primitives; and types for lazy initialisation and producer/consumer
scenarios. Developers can use these new types with any multithreaded application code, including that which
uses the Task Parallel Library and PLINQ.
Composite also took advantage of the new Parallel Stacks and Parallel Tasks windows for debugging code, which
are provided in Visual Studio 2010 Ultimate, Premium, and Professional. Visual Studio 2010 Premium and Ultimate
also have a new Concurrency Visualiser, which is integrated with the profiler to provide graphical, tabular, and
numerical data about how multithreaded applications interact with themselves and with other programs. “The
Concurrency Visualiser and other parallel-programming tools in Visual Studio 2010 are a great help in that they
enable developers to quickly identify areas of concern and navigate through call stacks and to relevant call sites in
the source code,” says Martin Ingvar Jensen, Senior Developer at Composite.
Parallelisation was achieved by changing a classic for each loop in the application’s compilation manager to a
Parallel. For Each loop—a task that required changing three lines of code. “The time that Composite C1 spends
compiling dynamic types has been significantly reduced on multicore systems, with performance increasing steadily
as more cores are available,” says Wendt. “Not only does this reduce startup times upon initial deployment, but it
also makes developers more efficient and productive when they’re working with our software.
“Given the very limited amount of code work we had to do, it’s fair to say that the support for parallel programming
provided in the .NET Framework 4 worked very well for us,” says Wendt.
The company parallelised the rendering process by using the new data structures for parallel programming. “Rather
than using a C# statement such as for each to declare that concurrency is desired, we call the static For Each method
on the parallel class, passing a collection of data and a lambda expression you want to execute,” explains Wendt.
“The .NET Framework 4 handles all of the complex thread management in accordance with the underlying hardware
platform, firing off more threads as more cores are available. This is work that most developers will happily let the
underlying programming framework handle—in a way that’s likely more efficient and optimised than what they
could implement by hand.”
104
Composite’s use of ConcurrentQueue<T> instead of List<T> to store calculation results is also noteworthy. “We
do this because List<T> is not thread safe, meaning that you need to add locking to your code or brace yourself for
some unexpected results at runtime,” explains Wendt. “Developers still need to think about thread safety, but the
.NET Framework 4 makes the coding much easier, transforming the process from ‘be very careful and do the hard
work’ to just ‘be careful.’ ”
Developers also are taking advantage of the new support for covariance in generics. “Our system uses interfaces
as the generic parameter when querying data from our data layer,” explains Wendt. “Because Visual C# 3.5 did
not support covariance, we had to do a lot of expression tree transformation when using LINQ to SQL. Generic
covariance should give us a performance boost, make our code simpler and thus easier to maintain, and enable us
to add ‘data schema inheritance’ to our data layer to enable some pretty interesting new features.”
Benefits
By taking advantage of the new parallel-programming aids provided in Visual Studio 2010 and the .NET Framework
4, Composite was able to easily capitalise on the performance gains enabled by modern multicore processors.
“With Visual Studio 2010 and the .NET Framework 4, Microsoft is providing tools that immensely simplify
parallel development,” says Wendt. “Devel-opers can simply ‘declare intent’ to do parallelisation and leave it to
the under-lying framework to handle the rest. The process isn’t foolproof in that developers still need to understand
parallel programming, but it’s quite easy to use and, when used correctly, can enable applications to utilise modern
microprocessors much more efficiently.”
The test that Composite constructed to measure the effects of parallelisation on page-rendering times showed a
decrease of more than 60 percent. “Our use of parallelisation reduced the time required to render a test Web page
containing eight functions from 115 to 40.5 milliseconds—even with one page element that takes 40 milliseconds
to render on its own,” says Wendt. “That’s the beauty of parallelisation, in that it enables us to break up a chunk of
work into independent tasks and execute them concurrently to get the job done faster.”
105
Parallel Computing
(Source: Composite, Content Management Solution Uses Parallelisation to Deliver Huge Performance Gains
[Online] Available at: <http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000006833>
[Accessed 22 June 2012].)
Questions
1. What were the benefits offered by taking advantage of the new parallel-programming aids provided in Visual
Studio 2010 and the .NET Framework 4?
Answer: By taking advantage of the new parallel-programming aids provided in Visual Studio 2010 and the
.NET Framework 4, Composite was able to easily capitalise on the performance gains enabled by modern
multicore processors. With Visual Studio 2010 and the .NET Framework 4, Microsoft is providing tools that
immensely simplify parallel development. Developers can simply ‘declare intent’ to do parallelisation and leave
it to the under-lying framework to handle the rest. The process isn’t foolproof in that developers still need to
understand parallel programming, but it’s quite easy to use and, when used correctly, can enable applications
to utilise modern microprocessors much more efficiently.
106
The test that Composite constructed to measure the effects of parallelisation on page-rendering times showed a
decrease of more than 60 percent. The use of parallelisation reduced the time required to render a test Web page
containing eight functions from 115 to 40.5 milliseconds—even with one page element that takes 40 milliseconds
to render on its own. Parallelisation enables us to break up a chunk of work into independent tasks and execute
them concurrently to get the job done faster.
107
Parallel Computing
Application II
Introduction
Future many-core systems, with thousands of cores on a single chip, will be significantly different from present
multi-core chips. Busses scale poorly and will be replaced by on-chip interconnect networks with local subnets linked
by routers, and inter-core message-passing latencies varying by maybe an order of magnitude depending on the
physical distance. Cores (or groups of cores) will have their own clocks, synchronised via protocols running across
interconnect. Cores (or groups of cores) will have significant amounts of local memory. The only shared memory is
likely to be off-chip; even the closest off-chip memory may be distributed, due to three-dimensional integration. This
implies that shared memory will be several orders of magnitude more expensive to access than local memory.
The need to keep yields reasonably high will force manufacturers to ship chips with many dead cores and interconnects.
These will be detected and removed from the software-visible hardware during a burn-in phase or even at boot
time. The result is that the actual topology of the chip will not be determined by the part number, but will need to
be discovered at boot time, and the software must adapt to the topology.
Such a many-core chip will therefore look not unlike a contemporary workstation cluster: a distributed system with
a high-speed interconnect. Compared to local memory, the shared off-chip memory will be expensive to access and
will appear like some network attached storage in a cluster, and will be seen as backing store rather than being directly
accessed during computation. This observation makes it natural to explore programming paradigms developed for
distributed systems: explicit message passing and distributed shared memory (DSM). Message-passing approaches,
such as MPI, are an obvious approach that maps well to the message-passing nature of the hardware (as it does in
distributed systems), and is likely to be the approach of choice for highly-parallel applications. However, shared
memory is a more convenient programming paradigm in many circumstances and provides better support for many
legacy programs. For that reason, DSM is still popular in distributed systems today, and like to remain so in the
future.
In this paper we make the case for a DSM-like approach as a way to manage many-core systems. We argue that,
contrary to classical DSM approaches, such as Ivy, Munin or Treadmarks, shared memory for many-core systems
should not be implemented as middleware (i.e. on top of the OS), but below the OS(es) inside a system virtualisation
layer. This extends the classical OS notion of virtual memory across the distributed system. To distinguish it from
DSM, we refer to this approach as virtual shared memory (VSM). We have recently developed a prototype VSM
system, called vNUMA (“virtual NUMA”), on a conventional workstation cluster. We argue that the vNUMA
approach presents a promising way of managing many-core chips, as it simplifies dealing with the distributed nature
of the hardware. It can do so without introducing overheads to high-performance applications, as explicit message
passing still works with full performance.
vNUMA Overview
vNUMA is a Type-I hypervisor which presents a shared-memory multiprocessor to the (guest) operating system.
Specifically, as vNUMA is implemented on distributed-memory hardware, where the cost of accessing remote
memory is orders of magnitude higher than for local memory, the virtual hardware provides cachecoherent non-
uniform memory access (ccNUMA). Hence, NUMAaware software will perform better on vNUMA than software
that assumes uniform memory-access costs.
108
Implementing shared memory inside the hypervisor has a number of advantages. For one, all the memory in the
system becomes part of the VSM, and therefore the OS can access all memory from all nodes. (Of course, a side
effect of this virtualisation is that the hypervisor can also partition the system, dividing the complete physical memory
into several virtual machines, each running a separate, isolated OS. Besides other advantages of virtualisation, this
supports the deployment of OSes that will not scale to thousands of processors.)
Another advantage is that running in the machine’s most privileged mode gives a VSM system access to optimisations
that are beyond the reach of DSM middleware such as MPI. These include the efficient emulation of individual
instructions, and the use of the performance-monitoring unit (PMU) to track the execution of specific instructions.
Implementing VSM inside the hypervisor also changes some of the trade-offs compared to middleware systems,
and as a result requires different protocols. For example, software running on DSM systems is typically aware of
this, and specifically of the fact that the unit of migration and coherency is a hardware page. This is not the case
for a multiprocessor OS, especially a NUMA-unaware one, which expects data migration and coherency to have a
cacheline granularity. vNUMA therefore includes a number of enhancements to established DSM protocols to support
efficient write-sharing within a page. For example, vNUMA supports three different modes of write sharing of pages:
write-invalidate write-update/multiple writer, and write-update/single writer. vNUMA adapts the writesharing mode
based on the observed sharing patterns. In particular, vNUMA detects and efficiently handles atomic instructions
(such as compare-exchange) used by the OS to implement locks. For some optimisations (e.g. batching write-update
messages), vNUMA makes use of the weak store order provided by modern processors.
VSM on many-cores
Compared to a cluster, a VSM implementation will benefit from a number of advantages a multi-core chip has over
a traditional cluster. For one, network latencies (measured in CPU-core cycles) are orders of magnitude lower for
an on-chip interconnect compared to Ethernet. More importantly, if the VSM approach is adopted, support for it
will be designed into future many-core chips. This has the potential to significantly reduce overheads on a number
of operations that we found expensive in vNUMA (on current COTS hardware). For example, vNUMA traps all
writes to multiple-writer shared pages in order to determine when updates need to be distributed. While this provides
better performance than single-writer protocols in the presence of a limited degree of false sharing, if such writes are
frequent, performance will suffer. Architectural support, for example, in the form of write protection at a cache-line
granularity, could reduce this bottleneck.
Furthermore, if the VSM paradigm is widely adopted, then software will adapt to it, for example by changing OS
data structures to avoid false sharing. According to our (limited) experience with vNUMA scalability, NUMA-aware
software will generally work well, and programs that do not share memory at all will not be affected by VSM in
their performance.
The VSM approach can provide some obvious benefits to processor manufacturers:
• The hypervisor can transparently deal with a small amount of message loss in the interconnect. This allows
chipmakers to more agressively optimise the network, to the degree where it is no longer fully reliable. In
fact, vNUMA, designed for notionally unreliable Ethernet, makes use of the fact that in a cluster environment,
Ethernet is in reality “almost” reliable, losing or damaging messages very rarely. vNUMA deals with this by using
checksums and timeouts, rather than more sophisticated protocols designed for really unreliable networks.
• Cache coherence does not have to be provided by hardware. With growing number of cores, hardware solutions
become more complicated and costly. High-performance applications do not need them, as they deal with
distribution explicitly, reducing the benefit of providing coherence in hardware. Shifting coherence protocols
into the hypervisor has the added benefit that software can easier adapt protocols to access patterns.
• The hypervisor can transparently re-map memory addresses and core IDs. This not only allows it to deal with
unreliable hardware, but also naturally supports turning off cores for power management. The virtual-memory
paradigm can be taken to its logical conclusion by transparently swapping out local memory to off-chip backing
store.
109
Parallel Computing
• Core heterogeneity is easy to support, as individual cores or groups of cores can run their own OS, with the
hypervisor simplifying access to remote memory. Heterogeneous OSes are also easy to support, the main
requirement is that each core’s ISA supports the hypervisor.
It would be possible to implement VSM inside native OSes running on many-cores. However, we expect virtualisation
to be widely used in future systems anyway, for reasons of resource isolation / quality of service, and for dynamic
resource management, in particular saving energy by shutting down unused cores. A single chip will typically run
multiple, heterogeneous operating systems, each with varying allocations of physical resources. Only the hypervisor
has access to the whole system, and as such is the ideal place to implement VSM. For example, it could use Capabilities
for controlling access to pages or coarser-gain memory regions by guest OSes.
More precisely, the system should not impose communication overhead for applications that do not communicate. It
is precisely designed to achieve this: The coherency protocols ensure that pages which are only written by a single
core will be owned exclusively by that core, while read-only pages are shared. Coherency overheads only arise when
pages are shared, or change their mode. As such, the small-cluster scalability results we obtained from our vNUMA
prototype should be representative of parallel applications running on a small subset of cores of a large many-core
system; the main difference being that the many-core system should be more VSM-friendly than a cluster for the
reasons discussed in the introduction.
Furthermore, the coherency protocols ensure that message passing middleware like MPI on top of VSM should be
able to perform as well as without the VSM layer: as it never shares data, but copies it between nodes by explicit
messaging, it does not create coherency traffic. Hence, the VSM stays out of the way of software that does not need
it, but is there to support software that benefits from a shared memory model.
Related work
VSM is based on the ideas of DSM, pioneered by Ivy. Mirage moved DSM into the OS to improve transparency.
Munin utilised weaker memory consistency to support simultaneous writers. Disco carves a NUMA system into
multiple virtual SMP nodes for the benefit of existing operating systems that may not support a NUMA architecture.
This is, in a way, the opposite of VSM, which combines separate nodes into a single virtual NUMA system, allowing
a single operating system instance to span multiple nodes that do not share memory.
Since our initial publication on vNUMA, systems using similar ideas have emerged: Virtual Iron’s VFe hypervisor the
Virtual Multiprocessor from the University of Tokyo. While these systems demonstrate combining virtualisation with
distributed shared memory, they are limited in scope and performance. Virtual Iron attempted to address some of the
performance issues by using high-end hardware, such as InfiniBand rather than Gigabit Ethernet, which effectively
makes the network more similar to what we expect from future many-cores. Virtual Iron has since abandoned the
product for commercial reasons, which largely seems to stem from its dependence on such high-end hardware.
More recently, startup company ScaleMP started to market their vSMP system, which seems similar in nature
(also uses InfiniBand). This supports our claim that there is on-going interest in SMP as a programming model on
distributed-memory hardware. Catamount partitions shared memory between nodes but makes remote partitions
available via virtual-memory mapping. Work on many-core scheduling is orthogonal to the VSM concept. Barrelfish
110
deals with many-core resource heterogeneity by making it explicit. While we agree that this is the best way to achieve
best performance, it only benefits applications that are designed to deal with explicit heterogeneity.
Conclusions
We made a case for virtual shared memory, i.e., a virtual-memory abstraction implemented over physically distributed
memories by a hypervisor, as an attractive model for managing future many core chips. Based on our experience
with a cluster-based prototype, we argue that VSM provides a shared-memory abstraction for software that needs
it, without imposing significant overheads on software that does not share (virtual) memory. We have argued that
the approach integrates well with the use of virtualisation for resource management on many-cores, and simplifies
dealing with faulty cores, faulty interconnects and heterogeneity. It may allow processor manufacturers to move
cache coherence protocols from hardware into software.
(Source: Heiser, Many-Core Chips — A Case for Virtual Shared Memory [Online] Available at: <http://ssrg.nicta.
com.au/publications/papers/Heiser_09a.pdf> [Accessed 27 June 2012])
Questions
1. Give a brief description about vNUMA.
2. Mention the VSM approaches that can provide some obvious benefits to processor manufacturers.
3. Enumerate the applications of VSM.
111
Parallel Computing
Application III
FOAM: Expanding the Horizons of Climate Modeling
Introduction
Climate modeling consistently has been among the most computationally demanding applications of scientific
computing, and an important initial consumer of advances in scientific computation. We report here on the Fast
Ocean-Atmosphere Model (FOAM), a new coupled climate model that continues in this tradition by using a
combination of new model formulation and parallel computing to expand the time horizon that may be addressed
by explicit fluid dynamical representations of the climate system.
Climate is the set of statistical properties that emerges at large temporal and spatial scales from well-known physical
principles acting at much smaller scales. Thus, successfully representing such large-scale climate phenomena
while specifying only small-scale physics, is an extremely computationally demanding endeavor. Until recently,
long-duration simulations with physically realistic models have been infeasible. Through improvements in model
formulation and in computational efficiency, however, our work breaks new ground in climate modeling.
Our progress in model formulation has been based on a new, efficient representation of the world ocean. Success in
this endeavor has further directed our efforts to establishing the minimum spatial resolution that can be considered
a successful representation of the earth’s climate for the purposes of understanding decade to century variability.
Our work on computational efficiency has focused on developing a coupled model that can execute efficiently
using message passing on massively parallel distributed-memory computer systems. Such systems have proved
powerful and cost-effective in those applications to which they can effectively be applied. The result of this work is
a coupled model that can sustain a simulation speed of 6,000 times faster than real time on a moderate-sized parallel
computer, while representing coupled atmosphere-ocean dynamical processes of very long duration which are of
current scientific interest. In this important class of application, we have obtained a significant cost performance
advantage relative to leading contemporary climate models. We believe that the achieved throughput in terms of
wall clock time is the highest of any coupled general circulation model to date.
FOAM has already been used to obtain significant scientific results. Modes of variability with time scales on the
order of a century have been identified in the model. These results provide a basis for observational and theoretical
studies of climate dynamics that may permit climatologists to observe and explain such phenomena.
General circulation models
Models which are intended to explicitly represent the fluid dynamics of an entire planet starting from the equations
of fluid motion are known as general circulation models (GCMs). In the case of the earth, there are two distinct but
interacting global scale fluids, and hence three classes of GCM: atmospheric GCMs, oceanic GCMs, and coupled
ocean-atmosphere GCMs.
GCM calculation is essentially a time integration of a first-order-in-time, second-order-in-space weakly nonlinear set
of partial differential equations. As is typical in such problems, the system is represented by a discretisation in space
and time. The maximum time step of this class of computational model is approximately inversely proportional to the
spatial resolution, while the number of spatial points is inversely proportional to its square. Hence, the computational
cost, even without increases in vertical resolution which may be required, is roughly proportional to the inverse
cube of the horizontal spacing of represented points.
Since planets are large and the scale of fluid phenomena is small, representation in GCMs is quite coarse, typically
on a grid scale of hundreds of kilometers. Nevertheless, on the order of 1010 floating point operations are required
at a typical modest resolution to represent the flow of the atmosphere for a single day. The calculations are further
complicated by the necessity to represent other physical processes that are inputs into the fluid dynamics, such
as radiative physics, cloud convective physics, and land surface interactions in atmospheric models, as well as
atmospheric forcing and equations of state in oceanic models.
112
Despite these daunting constraints, atmospheric and oceanic GCMs that succeed in representing the broad features
of the earth’s climate have been available for about two decades. Continuing refinement and demonstrable progress
have been evident in intervening years. These GCMs have a remarkable range of applications, which can usefully
and neatly be divided into two classes--those within and those outside the limits of dynamic predictability set by
the chaotic nature of nonlinear fluid dynamics.
In the first class of model, predictions or representations of specific flow configurations are sought. Weather models
are the best known such applications, but there are operational ocean models and research models of both the ocean
and atmosphere systems that fall into this class.
In the second class are climate applications, where the duration of the simulation is longer than the predictability of
the instantaneous flow patterns. Here, the interest is in representing the statistical properties of the flow rather than
its instantaneous details. Since prediction of specific fluid dynamical events is replaced by studies of the statistics
of such events, much longer durations must be calculated.
Meaningful climate statistics emerge after some years while the limit of dynamic predictability is at most tens of
days. Hence, the cost at a given spatial resolution is at least two orders of magnitude greater for climate applications
than for applications that simulate individual events within the limits of dynamic predictability. As a result, for a
given spatial resolution and set of represented physical processes, climate modeling is intrinsically a much more
computationally demanding application than weather modeling. Furthermore, additional physics may be required
in climate applications that may be left out of weather models.
In order to understand the implications of past or future simulated climates at particular locations, as well as to
represent the global dynamics more accurately, the push in climate modeling has been toward higher spatial resolution
as more computational resources become available. Still, integrations of climate models for about a century have
represented the maximum attainable until very recently.
A second push has been towards coupled models. There is a rough symmetry between atmosphere and ocean in
that each provides an important boundary condition for the other. Early long-duration models studied the ocean or
atmosphere in isolation, using observed data for the other boundary condition. Efforts to link models of the two
systems into a coupled model have been frustrated by the independent and, until recently, mutually inconsistent
representations of the physics of the air-sea boundary. Substantial progress in this area has been reported recently,
particularly at the National Center for Atmospheric Research (NCAR). Our project benefits directly from this
progress.
Interest in coupled modeling applications has been intense because there are variations in climate at all time scales,
many of which are poorly understood. Such phenomena cannot be represented by either atmospheric or oceanic
GCMs. The coupling between atmosphere and ocean, as well as with land and ice processes in some applications,
thus becomes critical.
Additionally, longer simulations are required than in more established climate modeling approaches, since the time
constants of the relevant processes are much longer, on the order of decades. Finally, there is enormous practical
and theoretical interest in transient climate responses to rapid changes in atmospheric conditions, such as changes in
atmospheric concentrations of radiatively active (‘‘greenhouse’’) gases and aerosol (airborne fine dust) distributions.
A few such runs have been performed [24], but in these cases it is difficult to separate intrinsic climate variability
from variability in response to the changes in atmospheric composition. To address this question rigorously would
require ensembles of similar runs, again multiplying the requisite computational resources substantially.
Thus, calculations of ten thousand years requiring on the order of 1010 floating point operations are of immediate
interest, even at the modest resolutions currently used for climate models.
113
Parallel Computing
One measure of simulation performance is ‘‘model speedup,’’ i.e., simulated time per wall clock time. We have
adopted as our objective a speedup of 10,000 for a model capable of representing deep ocean dynamics using a fully
bi-directionally coupled ocean-atmosphere model. This level of performance as our goal would makes thousand-
year simulations practical on a routine basis.
We adopted several strategies to approach this goal. The simplest of these strategies is the sacrifice of spatial resolution
for expansion of available simulated time. Since computational cost is directly proportional to simulation duration,
but roughly proportional to the inverse cube of the horizontal spacing, simulated duration can be extended greatly
by using the lowest resolution that captures the phenomena of interest.
Investigations revealed that it was important to maintain a reasonable resolution within the ocean, due to the relatively
small scale of important ocean dynamics. However, we determined that a coarse representation of the atmosphere
is sufficient to represent multi-decadal coupled variability. In conventional coupled models, approximately equal
amounts of time are spent in the ocean and atmosphere; hence, reducing atmosphere resolution can make a large
difference to overall performance only if we are able to speed up the ocean simulation performance in some
other way. This observation led us to adopt as a principal algorithmic focus the improvement of the ocean model
efficiency in terms of the number of computations required per unit of simulated time. As we describe below, we
were successful in this endeavor; this success allowed for an excellent scaling of coupled performance from the
use of a low atmospheric resolution, since the ocean part of the model now accounts for only a small fraction of
the resources used.
A second strategy was to use established representations of system physics. As far as possible, we did not endeavor
to participate in the ongoing improvement of representation and parameterisation of the many relevant physical
processes of the climate system. Our objective was not to improve the representation of climate but to expand the
applicability of those representations.
A third strategy was to use massively parallel distributed-memory computing platforms, that is, computing platforms
built of relatively large numbers of processing units, each with a conventional memory and high speed message
links to other processors. This type of computer is well-suited to fluid dynamics applications, where low-latency,
high-bandwidth exchanges are necessary, but in a predictable sequence and thus subject to direct tuning. The
distributed-memory architecture avoids the hardware complexity of shared-memory configurations, improving cost
per performance and providing a straightforward hardware upgrade path.
A fourth component of our approach was to use the standard Message Passing Interface (MPI) to implement inter-
processor communication. The use of MPI not only facilitates the design of the communication-intensive parts of
the model, but also enhances our ability to run on a wide variety of platforms. While the coupled model has to date
been run only on IBM SP platforms, both the ocean and atmosphere models have been benchmarked on a variety
of machines. As processors continue to improve, migration to commercial mass market platforms connected by
commodity networks may further improve cost efficiency.
The fifth element of our strategy was to design an independent piece of code, the ‘‘coupler,’’ to link pre-existing
atmosphere and ocean models. This structure minimised the changes required for those already tested pieces of
software. The coupler also includes a water runoff model, representing river flows and thus allowing for a closed
hydrological cycle. Efficiency and parallel scalability, that is, the ability to make optimal use of large numbers of
114
processors, were paramount in these design efforts. The development of model physics was avoided by this project,
in that the relevant algorithms, with the important exception of the new surface hydrology routines, were imported
without modification. This incremental approach allowed us to focus our attention on the computationally demanding
issues of the fluid dynamics and parallelisation.
Components of FOAM
As noted above, FOAM comprises an atmosphere model, ocean model, and coupler. We describe each of these
components in turn.
Calculations in the third, vertical dimension, particularly those representing radiative transfer, are tightly coupled,
so there is relatively less advantage to a vertical decomposition. As a result, the physics processes in CCM2, which
occur entirely in vertical columns, are represented without any information exchange between processors. Thus no
changes to the relevant code were needed.
The principal modification to PCCM2 to support its use in FOAM was to replace the lower boundary condition
routine with new code responsible for transferring data to the coupler and hence to the ocean model. As we noted
above, FOAM uses a low-resolution atmosphere. The vertical coordinate is a hybrid of a terrain following coordinate
and pressure, with 18 vertical levels. We use a 15th order rhomboidal (R15) horizontal resolution; this corresponds
to 40 latitudes arranged in a Gaussian distribution centered on the equator and 48 longitudes for an average grid size
of 4.5 degrees of latitude and 7.5 degrees of longitude. FOAM uses a 30 minute time step and the recommended
values for the diffusion coefficient given by for an R15 CCM2.
After we began building the coupled model, a new implementation of the Community Climate Model has been
released. The radiative and hydrologic changes to the physics from CCM2 to CCM3, have since been implemented
in FOAM. The bulk of the code of FOAM, including the remainder of the physics, remains that of CCM2.
The radiation parameterisation used in the FOAM atmosphere model is based on PCCM2 but includes the CCM3
additions and improvements. Several other CCM3 innovations have also been incorporated in FOAM. In CCM2,
all types of moist convection were handled by the simple mass flux scheme developed in CCM3, and FOAM, this
scheme is used in conjunction with the deep convection parameterisation . The evaporation of stratiform precipitation
is also now included. The boundary layer model of CCM2 has been modified as described by. Finally, the surface
fluxes over the ocean are now calculated using a diagnosed surface roughness which is a function of wind speed
and stability.
115
Parallel Computing
The ocean model uses the vertical mixing scheme but with a steeper Reynolds number dependency consistent with
the observational analysis. The revised mixing values appear to improve the tropical Pacific SST field by reducing
the model cold bias in the west equatorial Pacific.
A simple, unstaggered Mercator 128 x 128 point grid is used, yielding a discretisation of approximately 1.4 degrees
latitude by 2.8degrees longitude. A spatial filter similar to the sort used in atmospheric models is used to maintain
numerical stability in the Arctic. The topography used is somewhat tuned to preserve basin topology at the represented
resolution but is not smoothed.
The vertical discretisation is with height, with a stretched vertical coordinate maximising resolution in the upper layers.
For the runs reported here, a sixteen layer version was used. The central importance of the surface thermodynamics
in the coupling process and the objective of a minimal computational load led to this choice in preference to an
isopycnal model.
Three separate techniques are used to speed the performance of the ocean model. Unlike in some ocean models, the
free surface is explicitly represented, but its dynamics are artificially slowed, an approach which has been shown to
make little difference to the internal motions. In addition, the still relatively fast, and therefore difficult to represent,
free surface is modeled as a separate two-dimensional system coupled to the internal ocean in a way that correctly
reproduces the free surface while allowing a much longer time step in the internal ocean. Finally, that time step
itself is used only for the calculation of the fastest parts of the internal dynamics, while yet a longer step is used for
diffusive and advective processes.
We believe that the combination of these techniques yields the most computationally efficient ocean model in
existence. That is, for a given resolution, we believe that this model requires fewer floating point operations to
integrate the ocean for a given time than any other model. When compared with other state-of-the-art ocean models,
this improvement corresponds to roughly a tenfold increase in the amount of simulated time represented per unit
of computation.
We have benchmarked the ocean code at 128 x 128 resolution on 64 SP2 nodes running at over 105,000 times
real time. The ocean model also scales well to higher resolutions, and other applications beside long-term climate
modeling are anticipated.
The land surface in FOAM (and in CCM2) is represented by a four-layer diffusion model with heat capacities,
thicknesses and thermal conductivities specified for each layer. Soil types vary in the horizontal direction, with 5
distinct types derived from the vegetation data. Roughness lengths and albedos for two different radiation bands
are also specified. The fluxes of latent and sensible heat and momentum between the land and the atmosphere are
calculated using the bulk transfer formulas with stability dependent coefficients from CCM2 summarised. Between
the ocean and the atmosphere, the new bulk transfer formulas of CCM3 are used. These are also stability dependent
but do not assume a constant roughness length.
116
The hydrology in FOAM is a simple box model. This model was an option in early versions of CCM2 and was
also present in CCM1. Precipitation is added to a 15 cm soil moisture box or to the snow cover, if the ground and
lowest two atmosphere levels are below freezing. The soil moisture is used to calculate a wetness factor Dw used
in the latent heat flux calculation. (Dw equals 1 for land ice, sea ice, snow covered and ocean surfaces) Evaporation
removes water from the box and any excess over 15 cm is designated as runoff and sent to the river model. Snow
cover modifies the properties of the upper soil layer for purposes of the albedo and surface temperature calculations.
Snow melt is calculated and added to the local soil moisture. Snow depths greater than 1 m liquid water equivalent
are also sent to the river model to mimic the near-equilibrium of the Greenland and Antarctic ice sheets.
Because fresh water cycling may be of interest in the phenomena to be studied by the model, and also to avoid long-
term ocean salinity drift, a closed hydrological cycle is implemented by the coupler, with a simple explicit river
model that results in a finite fresh water delay and a set of point sources (river mouths) for continental runoff. This
strategy enables the model to represent phenomena that involve coupling between variations in continental rainfall
and delayed resultant variations in ocean salinity, in turn affecting weather patterns through altered sea surface
temperatures resulting from altered ocean circulation.
The river model is based on the work of Miller et al. A similar implementation is also used in the coupled model of
Russell et al. First a river flow direction is set for each land pointi. Although this can be automated by processing the
topography file, in practice many of the river directions had to be set by hand so that the resulting basin boundaries
resemble the observed.
The flow F in cubic meters per second out of a cell is F = V. (u/d), where V is the total river volume equal to the
local runoff plus the sum of the flow from up to seven of the eight neighboring cells, u is an effective flow velocity
which is taken as a constant 0.35 meters per second, and d is the downstream distance. Precipitation and evaporation
do not act directly on the river water and the temperature of the river water is not taken into account. V for an ocean
point near the coast is then calculated as the sum of the outflow from neighboring land points and converted back to
a flux by dividing by the area of that ocean point. This river freshwater flux is then added to the local precipitation
and evaporation rates to form the total freshwater flux at that point and close the hydrologic cycle.
The temperature of the sea ice is determined by treating it as another soil type. The sea surface may continue to
lose heat by conduction with the lowest ice layer so a clamp on temperature is imposed by the ocean model at -1.92
degrees Celsius. Sea ice roughness and albedos are prescribed. For the hydrologic cycle, the formation of sea ice
is treated as a flux of 2 m of water out of the ocean. The stress between the ice and the atmosphere is arbitrarily
divided by 15 before passing to the ocean model.
117
Parallel Computing
+ =
ii
Running FOAM
We used FOAM to perform a series of long-duration simulations with the goal of determining whether a low resolution
atmosphere could yield a reasonable representation of coupled ocean-atmosphere phenomena. We found that the
coarse R15 atmospheric resolution sufficed to yield a realistic ocean circulation. Although R15 is an extremely
coarse resolution (48 latitudes and 40 longitudes), it still requires approximately 16 times as much processor time
as our ocean with 128 x 128 resolution on IBM SP platforms. This difference in execution time is attributable to
the relatively complicated atmospheric physics code and to the efficiency of our ocean model. Accordingly, we
typically run on 17 or 34 nodes, with 1 or 2 of those processors, respectively, dedicated to the ocean. To optimise
inter-processor communication, the coupler runs on the same nodes as the atmosphere.
118
Fig. 2 Time allocation for a typical FOAM run: Horizontal axis is labelled in seconds. Each bar represents
a single SP processor.
Fig. 2 shows the allocation of computational resources as a function of time for a typical 17 node run performing the
calculations for one simulated day. The bulk of the computation is allocated to the atmosphere implementation. The
ocean time step is six hours, so the ocean is called four times per simulated day. The faster atmospheric dynamics must
be represented on a half-hour time step, called 48 times. Twice per day, the radiative properties of the atmosphere
are recalculated, yielding particularly long atmosphere steps.
All atmosphere nodes must integrate synchronously, as their results are dependent on results of neighboring
processors. This is seen by the simultaneous exit of all atmosphere processors from the coupler routine, which handles
all communication. The fact that all processors do not enter the coupler at the same time indicates imperfect load
balancing in the atmosphere calculations, typically because cloud distributions are not uniform. It is seen that one
ocean processor has no difficulty keeping up with 16 atmosphere processors, but that it cannot keep up with 32.
We are still tuning the code for performance. To date, our best performance has been approximately 6,000 times real
time in a run on 68 nodes of an IBM SP2 using 120 MHz P2SC Power PC symmetric multiprocessors. However,
because of some constraints on the domain decomposition used in low resolution applications of PCCM2, this is a
poor scaling from our production runs. We have seen almost linear scaling on 8, 16, and 32 atmosphere processors,
which is what we normally use.
We typically achieve peak performance faster than 4,000 times real time on 34 nodes. (This lack of scaling to 68
nodes is due to limitations in the spatial decomposition technique as applied to the low atmosphere resolution we
use.)
In practice, with the production of large output files and the sharing of the platform with other users, we are
producing results over months of real time at over 2,000 times real time and have established that we can scale this
up significantly with the availability of additional resources at contemporary levels of technology. We are hopeful
that even better results may be obtained as the platform configuration is tuned for scientific applications. We are
also investigating parallelisation of the input and output to further increase our efficiency.
The performance of FOAM can be compared directly to the NCAR CSM coupled model which accomplishes only
a third of FOAM’s maximum throughput using 16 nodes of a Cray C90. Our principal time advantages are in the
extremely effective ocean code and in the reduced resolution of the atmosphere, which remains adequate to capture
decadal ocean variability.
119
Parallel Computing
A further advantage of FOAM lies in the lower cost and complexity of the distributed memory systems on which
FOAM is executed. While determining the true cost of supercomputers is difficult, we estimate that the cost per
unit of performance of FOAM is already more than ten times better than that of other current models of the same
phenomena.
Near this point in the development of our model, NCAR released a new version of their Community Climate Model,
CCM3. We were fortunate in that the software interfaces to updated physics routines were largely unchanged. We
found that including the new CCM3 moisture physics into our model vastly improved its representation of the
tropical Pacific.
Fig. 3 Sea surface temperature patterns (degrees C) (a) FOAM output (b) observations (see text), and (c)
model minus observations
Annual average sea surface temperature as modeled by FOAM improved with CCM3 moisture physics is shown in
Fig. 3(a). For comparison, observational data is shown in Fig. 3(b) and the difference between the true and modeled
fields is shown in Fig. 3(c). The broad features of the temperature field are captured, though the tight gradients in
western boundary currents such as the Gulf Stream and the Kuroshio are somewhat smeared.
Except in the Antarctic Ocean, the results are comparable to those obtained with higher resolution atmospheres, less
efficient ocean models, and more expensive computational platforms. The errors in the Antarctic are attributable to
the crude representation of sea ice that we currently use. Updating this part of the model is currently a high priority.
The otherwise good agreement indicates that we are capturing the large scale features of ocean circulation.
120
Fig. 4 Two basin variability: Scales are arbitrary, but their product is anomaly sea surface tem-
perature amplitude as a function of space and time, degrees C (a) spatial pattern, and (b) tempo-
ral pattern
We have now successfully run the model for over 500 simulated years, and our first results regarding low frequency
variability of the coupled system are emerging. Fig. 4 shows a pattern (obtained by VARIMAX rotation of empirical
orthogonal function decomposition) that accounts for fully 15 percent of 60 month low-pass filtered variance in sea
surface temperature. The associated time series is also shown, indicating the long time scale of this phenomenon.
This correlation between North Atlantic and North Pacific, until recently unanticipated, corroborates recent model
and observational results by Latif and Barnett.
Conclusions
The FOAM project has accomplished a substantial improvement in the performance that can be achieved by coupled
climate models, without sacrificing physical realism. While reducing atmosphere resolution, it maintains a good
representation of the ocean. The result is that the model is able to identify and study phenomena of interest on
decadal and century time scales, and has succeeded in replicating recent model and observational results. In addition,
the model is able to exploit parallel computer systems that offer improved cost performance relative to the vector
supercomputers that have traditionally been used for such simulations. Given these successes, FOAM may now be
used for its intended purpose, to implement very long simulations for studying variability on the longest time scales.
Currently, we are performing these simulations on multi-computers such as the IBM SP.
121
Parallel Computing
In the longer term, we intend to examine the feasibility of using PC clusters to improve cost performance yet further.
We also hope to exploit high-speed networks to expand the utility of the model, by enabling remote browsing of
the large datasets generated by FOAM, hence making these datasets more accessible to the community, and by
using remote I/O techniques to enable seamless execution on remote computers, with files maintained at a central
location.
Since von Neumann used a weather model as his first test case of scientific computing leading developments of
scientific computing platforms have been put to some of their earliest tests by meteorological applications. This
will continue to be the case for the foreseeable future. The vast nature of the physical system will allow it to make
use of whatever resources become available. The ultimate goal of very high resolution, very long duration, and
very complete models remains far in the future. In the meantime, the FOAM project will endeavor to continue to
provide the first glimpses of climate variability on the longest time scales.
(Source: FOAM: Expanding the Horizons of Climate Modeling [pdf] Available at: <http://www.stat.cmu.
edu/~cschafer/Pubs/tobis97foam.pdf> [Accessed 22 June 2012])
Questions
1. Give a brief description about FOAM Coupler.
2. Define general circulation models (GCM).
3. Explain the FOAM Ocean Model.
122
Bibliography
References
123
Parallel Computing
Recommended Reading
• Anita, G., 2011. Computer Fundamentals, Pearson Education India.
• Culler, D. E., Singh, J. & Gupta, A., 1999. Parallel Computer Architecture: A Hardware/Software Approach,
Gulf Professional Publishing.
• Duato, 2003. Interconnection Networks: An Engineering Approach, Morgan Kaufmann.
• Foster, I., 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software
Engineering, Addison-Wesley.
• Gebali, F., 2011. Algorithms and Parallel Computing, John Wiley & Sons.
• Gropp, W., Lusk, E. & Skjellum, A., 2006. Using Mpi: Portable Parallel Programming With the Message-
Passing Interface, Volume 1, MIT Press.
• Hwang, 2003. Advanced Computer Architecture, Tata McGraw-Hill Education.
• Joubert, G. R., 2004. Parallel Computing: Software Technology, Algorithms, Architectures and Applications,
Elsevier.
124
• Lafferty, E. L., 1993. Parallel Computing: An Introduction, William Andrew.
• Null, L. & Lobur, J., 2010. The Essentials of Computer Organization and Architecture, Jones & Bartlett
Publishers.
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Padua, D., 2011. Encyclopedia of Parallel Computing, Volume 4, Springer.
• Quinn, 2003. Parallel Programming In C With Mpi And Open Mp, Tata McGraw-Hill Education.
• Rauber, T. & Rünger, G., 2010. Parallel Programming: For Multicore and Cluster Systems, Springer.
• Roosta, S. H., 2000. Parallel Processing and Parallel Algorithms: Theory and Computation, Springer.
• Tokhi, 2003. Parallel Computing for Real-Time Signal Processing and Control, Springer.
• Treleaven, P. C. & Vanneschi, M., 1987. Future Parallel Computers: An Advanced Course, Pisa, Italy, June
9-20, 1986, Proceedings, Springer.
125
Parallel Computing
Chapter II
1. a
2. a
3. b
4. c
5. b
6. a
7. d
8. b
9. a
10. c
Chapter III
1. b
2. a
3. c
4. d
5. a
6. b
7. d
8. c
9. a
10. a
Chapter IV
1. a
2. b
3. c
4. d
5. a
6. a
7. a
8. a
9. b
10. a
126
Chapter V
1. d
2. a
3. b
4. c
5. c
6. a
7. b
8. a
9. a
10. d
Chapter VI
1. a
2. c
3. b
4. a
5. c
6. a
7. a
8. b
9. a
10. a
Chapter VII
1. b
2. a
3. a
4. c
5. a
6. a
7. b
8. a
9. a
10. b
Chapter VIII
1. b
2. b
3. a
4. a
5. c
6. a
7. a
8. c
9. b
10. a
127