Parallel Computing PDF

Parallel Computing
Board of Studies
Prof. H. N. Verma Prof. M. K. Ghadoliya

Vice- Chancellor Director,
Jaipur National University, Jaipur School of Distance Education and Learning
Jaipur National University, Jaipur
Dr. Rajendra Takale
Prof. and Head Academics
SBPIM, Pune
___________________________________________________________________________________________
Subject Expert Panel
Dr. Ramchandra G. Pawar Vaibhav Bedarkar

Director, SIBACA, Lonavala Subject Matter Expert
Pune
___________________________________________________________________________________________
Content Review Panel
Gaurav Modi Shubhada Pawar

Subject Matter Expert Subject Matter Expert
___________________________________________________________________________________________
Copyright ©
This book contains the course content for Parallel Computing.
First Edition 2014
Printed by
Universal Training Solutions Private Limited
Address
05th Floor, I-Space,
Bavdhan, Pune 411021.
All rights reserved. This book or any portion thereof may not, in any form or by any means including electronic
or mechanical or photocopying or recording, be reproduced or distributed or transmitted or stored in a retrieval
system or be broadcasted or transmitted.
___________________________________________________________________________________________
Index
I. Content....................................................................... II
II. List of Figures...........................................................VI
III. List of Tables........................................................VIII
IV. Abbreviations..........................................................IX
V. Application. ............................................................ 103
VI. Bibliography.......................................................... 123
VII. Self Assessment Answers .................................. 126
Book at a Glance
I
Contents
Chapter I........................................................................................................................................................ 1
Introduction to Parallel Computing............................................................................................................ 1
Aim................................................................................................................................................................. 1
Objectives....................................................................................................................................................... 1
Learning outcome........................................................................................................................................... 1
1.1 Introduction............................................................................................................................................... 2
1.2 Types of Parallel Computing..................................................................................................................... 4
1.3 Need for Parallel Computing.................................................................................................................... 6
1.4 Applications of Parallel Computing.......................................................................................................... 6
Summary . .............................................................................................................................................. 8
References . .............................................................................................................................................. 8
Recommended Reading................................................................................................................................ 8
Self Assessment.............................................................................................................................................. 9
Chapter II.....................................................................................................................................................11
Laws of Parallel Computing.......................................................................................................................11
Aim................................................................................................................................................................11
Objectives......................................................................................................................................................11
Learning outcome..........................................................................................................................................11
2.1 Amdahl’s Law......................................................................................................................................... 12
2.2 Minsky’s Conjecture [Minsky 1970]...................................................................................................... 13
2.3 Moore’s Law........................................................................................................................................... 15
Summary...................................................................................................................................................... 17
References.................................................................................................................................................... 17
Recommended Reading.............................................................................................................................. 17
Self Assessment............................................................................................................................................ 18
Chapter III................................................................................................................................................... 20
Evolution of Computer Architecture......................................................................................................... 20
Aim............................................................................................................................................................... 20
Objectives..................................................................................................................................................... 20
Learning outcome......................................................................................................................................... 20
3.1 Introduction............................................................................................................................................. 21
3.2 Brief History of Computer Architecture................................................................................................. 21
3.2.1 First Generation (1945-1958)................................................................................................. 22
3.2.2 Second Generation (1958-1964)............................................................................................. 23
3.2.3 Third Generation (1964-1974)................................................................................................ 24
3.2.4 Fourth Generation (1974-present).......................................................................................... 26
Summary...................................................................................................................................................... 29
References.................................................................................................................................................... 29
Self Assessment............................................................................................................................................ 30
Chapter IV................................................................................................................................................... 32
System Architectures.................................................................................................................................. 32
Aim............................................................................................................................................................... 32
Objectives..................................................................................................................................................... 32
4.1 Parallel Architectures.............................................................................................................................. 33
4.2 Single Instruction - Single Data (SISD).................................................................................................. 33
4.3 Single Instruction - Multiple Data (SIMD)............................................................................................. 34
4.4 Multiple Instruction - Multiple Data (MIMD)........................................................................................ 35
4.5 Shared Memory....................................................................................................................................... 35
II
4.6 Distributed Memory................................................................................................................................ 37
4.7 ccNUMA................................................................................................................................................. 39
4.8 Cluster..................................................................................................................................................... 39
4.9 Multiple Instruction - Single Data (MISD)............................................................................................. 40
4.10 Some Examples..................................................................................................................................... 40
4.10.1 Intel Pentium D..................................................................................................................... 40
4.10.2 Intel Core 2 Duo................................................................................................................... 41
4.10.3 AMD Athlon 64 X2 & Opteron............................................................................................ 41
4.10.4 IBM pSeries.......................................................................................................................... 41
4.10.5 IBM BlueGene...................................................................................................................... 41
4.10.6 NEC SX-8............................................................................................................................. 41
4.10.7 Cray XT3.............................................................................................................................. 41
4.10.8 SGI Altix 3700...................................................................................................................... 41
Summary...................................................................................................................................................... 42
References.................................................................................................................................................... 42
Self Assessment............................................................................................................................................ 44
Chapter V..................................................................................................................................................... 46
Parallel Programming Models and Paradigms........................................................................................ 46
Aim............................................................................................................................................................... 46
Objectives..................................................................................................................................................... 46
5.1 Introduction............................................................................................................................................. 47
5.2 A Cluster Computer and its Architecture................................................................................................ 47
5.3 Parallel Applications and their Development......................................................................................... 49
5.3.1 Strategies for Developing Parallel Applications..................................................................... 50
5.3.2 Code Granularity and Levels of Parallelism........................................................................... 51
5.4 Parallel Programming Models and Tools................................................................................................ 51
5.4.1 Parallelising Compilers........................................................................................................... 52
5.4.2 Parallel Languages.................................................................................................................. 52
5.4.3 High Performance Fortran...................................................................................................... 52
5.4.4 Message Passing..................................................................................................................... 53
5.4.5 Virtual Shared Memory........................................................................................................... 53
5.4.6 Parallel Object-Oriented Programming.................................................................................. 54
5.4.7 Programming Skeletons.......................................................................................................... 54
5.5 Methodical Design of Parallel Algorithms............................................................................................. 54
5.5.1 Partitioning.............................................................................................................................. 55
5.5.2 Communication....................................................................................................................... 55
5.5.3 Agglomeration........................................................................................................................ 55
5.5.4 Mapping.................................................................................................................................. 55
5.6 Parallel Programming Paradigms........................................................................................................... 55
5.6.1 Choice of Paradigms............................................................................................................... 55
5.6.2 Task-Farming (or Master/Slave)............................................................................................. 57
5.6.3 Single-Program Multiple-Data (SPMD)................................................................................. 58
5.6.4 Data Pipelining....................................................................................................................... 59
5.6.5 Divide and Conquer................................................................................................................ 59
5.6.6 Speculative Parallelism........................................................................................................... 60
5.6.7 Hybrid Models........................................................................................................................ 61
5.7 Programming Skeletons and Templates.................................................................................................. 61
5.7.1 Programmability..................................................................................................................... 62
5.7.2 Reusability.............................................................................................................................. 62
5.7.3 Portability................................................................................................................................ 62
5.7.4 Efficiency................................................................................................................................ 62
III
Summary...................................................................................................................................................... 63
References.................................................................................................................................................... 63
Self Assessment............................................................................................................................................ 64
Chapter VI................................................................................................................................................... 66
Interconnection Networks for Parallel Computers.................................................................................. 66
Aim............................................................................................................................................................... 66
Objectives..................................................................................................................................................... 66
6.1 Introduction............................................................................................................................................. 67
6.2 Network Topologies................................................................................................................................ 67
6.3 Metrics for Interconnection Networks.................................................................................................... 67
6.4 Classification of Interconnection Networks .......................................................................................... 67
6.5 Static Network........................................................................................................................................ 68
6.5.1 Completely-connected Network............................................................................................. 68
6.5.2 Star-Connected Network......................................................................................................... 68
6.5.3 Linear Array............................................................................................................................ 69
6.5.4 Mesh........................................................................................................................................ 69
6.5.5 Tree Network ......................................................................................................................... 70
6.5.6 Hypercube............................................................................................................................... 71
6.6 Dynamic Networks................................................................................................................................. 72
6.6.1 Bus-Based Networks............................................................................................................... 73
6.6.2 Crossbar Networks.................................................................................................................. 73
6.6.3 Multistage Networks............................................................................................................... 74
6.6.4 Omega Network...................................................................................................................... 75
Summary...................................................................................................................................................... 77
References.................................................................................................................................................... 77
Self Assessment............................................................................................................................................ 78
Chapter VII................................................................................................................................................. 80
Parallel Sorting............................................................................................................................................ 80
Aim............................................................................................................................................................... 80
Objectives..................................................................................................................................................... 80
7.1 Introduction............................................................................................................................................. 81
7.2 Merge-based Parallel Sorting . ............................................................................................................... 81
7.3 Splitter-Based Parallel Sorting................................................................................................................ 81
7.4 Splitter-based Basic Histogram Sort....................................................................................................... 82
7.5 Bitonic Sort............................................................................................................................................. 83
7.6 Sample Sort............................................................................................................................................. 84
7.7 Radix Sort . ............................................................................................................................................ 85
7.8 Histogram Sort........................................................................................................................................ 85
7.9 Odd-even Transposition Sort on Linear Array........................................................................................ 86
Summary...................................................................................................................................................... 87
References.................................................................................................................................................... 87
Self Assessment............................................................................................................................................ 87
IV
Chapter VIII................................................................................................................................................ 90
Message-Passing Programming ................................................................................................................ 90
Aim............................................................................................................................................................... 90
Objectives..................................................................................................................................................... 90
8.1 Principles of Message-Passing Programming......................................................................................... 91
8.2 The Building Blocks: Send and Receive Operations.............................................................................. 91
8.3 Non-Buffered Blocking Message Passing Operations............................................................................ 91
8.4 Buffered Blocking Message Passing Operations.................................................................................... 92
8.5 Non-Blocking Message Passing Operations........................................................................................... 93
8.6 Message Passing Interface (MPI)........................................................................................................... 94
8.7 Starting and Terminating the MPI Library.............................................................................................. 94
8.8 Communicators....................................................................................................................................... 94
8.9 Querying Information............................................................................................................................. 94
8.10 Sending and Receiving Messages......................................................................................................... 95
8.11 Avoiding Deadlocks.............................................................................................................................. 95
8.12 Sending and Receiving Messages Simultaneously............................................................................... 96
8.13 Creating and Using Cartesian Topologies............................................................................................. 97
8.14 Overlapping Communication with Computation.................................................................................. 97
8.15 Collective Communication and Computation Operations.................................................................... 98
8.16 Collective Communication Operations................................................................................................. 98
Summary.................................................................................................................................................... 100
References.................................................................................................................................................. 100
Recommended Reading............................................................................................................................ 100
Self Assessment.......................................................................................................................................... 101
V
List of Figures
Fig. 1.1 Serial computing................................................................................................................................ 3
Fig. 1.2 Parallel computing............................................................................................................................. 4
Fig. 2.1 Efficiency and the sequential fraction............................................................................................. 13
Fig. 2.2 Energy diagram showing loss of energy.......................................................................................... 14
Fig. 2.3 Speedup as a function of the number of processors........................................................................ 15
Fig. 2.4 Computing power doubles every 18 months, for the same price.................................................... 16
Fig. 3.1 Structure of IAS computer . ............................................................................................................ 23
Fig. 3.2 IBM 7094 ....................................................................................................................................... 24
Fig. 3.3 Relationship between Wafer, Chip, and Gate.................................................................................. 25
Fig. 3.4 CPU structure of IBM S/360-370 series......................................................................................... 25
Fig. 3.5 Intel 8080 . ...................................................................................................................................... 27
Fig. 3.6 Motorola 68000 .............................................................................................................................. 27
Fig. 3.7 Intel386 CPU................................................................................................................................... 28
Fig. 3.8 Alpha 21264..................................................................................................................................... 28
Fig. 4.1 Summation of two number.............................................................................................................. 33
Fig. 4.2 Summation of two numbers in a pipeline........................................................................................ 34
Fig. 4.3 Structure of a shared memory system.............................................................................................. 36
Fig. 4.4 Shared memory system with a bus connection................................................................................ 36
Fig. 4.5 Shared memory system with crossbar switch.................................................................................. 36
Fig. 4.6 UMA and NUMA............................................................................................................................ 37
Fig. 4.7 Distributed memory......................................................................................................................... 38
Fig. 4.8 Structure of a distributed memory system....................................................................................... 38
Fig. 4.9 Structure of a ccNUMA system....................................................................................................... 39
Fig. 4.10 Structure of a cluster of SMP nodes.............................................................................................. 40
Fig. 5.1 Cluster computer architecture.......................................................................................................... 48
Fig. 5.2 Porting strategies for parallel applications...................................................................................... 50
Fig. 5.3 Detecting parallelism....................................................................................................................... 52
Fig. 5.4 A static master/slave structure......................................................................................................... 58
Fig. 5.5 Basic structure of a SPMD program................................................................................................ 59
Fig. 5.6 Data pipeline structure..................................................................................................................... 59
Fig. 5.7 Divide and conquer as a virtual tree................................................................................................ 60
Fig. 6.1 Classification of interconnection networks (a) a static network (b) a dynamic network................ 67
Fig. 6.2 A completely-connected network of eight nodes............................................................................. 68
Fig. 6.3 Two representations of the star topology......................................................................................... 69
Fig. 6.4 Linear arrays (a) with no wraparound links (b) with wraparound link........................................... 69
Fig. 6.5 Meshes............................................................................................................................................. 70
Fig. 6.6 Ring . ............................................................................................................................................ 70
Fig. 6.7 Binary tree....................................................................................................................................... 71
Fig. 6.8 A fat tree network of 16 processing nodes....................................................................................... 71
Fig. 6.9 Hypercube........................................................................................................................................ 72
Fig. 6.10 Construction of hypercubes from hypercubes of lower dimension............................................... 72
Fig. 6.11 Bus based interconnects................................................................................................................. 73
Fig. 6.12 A completely non-blocking crossbar network connecting ‘p’ processors to ‘b’ memory banks... 74
Fig. 6.13 The schematic of a typical multistage interconnection network................................................... 74
Fig. 6.14 A perfect shuffle interconnection for eight inputs and outputs...................................................... 75
Fig. 6.15 A complete omega network connecting eight inputs and eight outputs........................................ 75
Fig. 6.16 An example of blocking in omega network . ................................................................................ 76
Fig. 7.1 Splitter-based parallel sorting.......................................................................................................... 82
Fig. 7.2 Splitter on key density function....................................................................................................... 82
Fig. 7.3 Basic histogram sort........................................................................................................................ 83
Fig. 7.4 Sample Sort..................................................................................................................................... 84
VI
Fig. 7.5 Odd-even transposition sort on linear array.................................................................................... 86
Fig. 8.1 Message Passing Model.................................................................................................................. 91
Fig. 8.2 Non-Buffered blocking message passing operations....................................................................... 92
Fig. 8.3 Buffered blocking message passing operations............................................................................... 92
Fig. 8.4 Non-blocking message passing operations...................................................................................... 93
Fig. 8.5 An example use of the MPI_MINLOC and MPI_MAXLOC operators......................................... 98
VII
List of Tables
Table 2.1 Speed-up and the number of processors....................................................................................... 15
Table 3.1 Computer generations................................................................................................................... 28
Table 5.1 Code Granularity and Parallelism................................................................................................. 51
Table 8.1 The minimal set of MPI routines.................................................................................................. 94
Table 8.2 MPI datatypes............................................................................................................................... 99
VIII
Abbreviations
ALU - Arithmetic/Logical Unit
CCC - Cube-Connected Cycles
ccNUMA - Cache Coherent Non-Uniform Memory Access
CODINE - Computing in Distributed Networked Environments
CPU - Central Processing Unit
DM - Distributed Memory
DSM - Distributed Shared Memory
ENIAC - Electronic Numerical Integrator and Calculator
FPU - Floating Point Unit
GCA - Grand Challenge Applications
HPF - High Performance Fortran
I/O - Input/Output
LAN - Local Area Network
LSF - Load Sharing Facility
MIMD - Multiple Instruction, Multiple Data Stream
MIPS - Microprocessor without Interlocked Pipeline Stages
MISD - Multiple Instruction, Single Data Stream
MITS - Micro Instrumentation Telemetry Systems
MPI - Message Passing Interface
NIC - Network Interface Cards
NOW - Network of Workstations
NUMA - Non-Uniform Memory Access
PC - Personal Computers
PE - Processing Element
PET - Personal Electronic Transactor
PUL - Parallel Utilities
PVM - Parallel Virtual Machine
RAM - Random-Access Memory
RIPS - Reduced Instruction Set Computer
RISC - Reduced Instruction Set Computing
SFC - Sequential Fraction of Computing
SIMD - Single Instruction, Multiple Data Stream
SISD - Single Instruction - Single Data
SM - Shared Memory
SMP - Single or Multiprocessor System
SOP - Skeleton Oriented Programming
SSI - Single System Image
ULSI - Ultra Large Scale Integration
UMA - Uniform Memory Access
UNIVAC - Universal Automatic Computer
VLSI - Very-Large-Scale Integration
VSM - Virtual Shared Memory
IX
Chapter I
Introduction to Parallel Computing
Aim
The aim of this chapter is to:
• define parallel computing
• identify the need for parallel computing
• elucidate bit level parallelism
Objectives
The objectives of this chapter are to:
• explain data parallelism
• enumerate the limits to serial computing
• examine the applications of parallel computing
Learning outcome
At the end of this chapter, you will be able to:
• classify types of parallelism
• understand instruction level parallelism
• compare serial and parallel computing
1
Parallel Computing
1.1 Introduction
Parallel computing means to divide a job into several tasks and use more than one processor simultaneously to
perform these tasks. Assume you have developed a new estimation method for the parameters of a complicated
statistical model. To perform many simulations to assure the correctness of the method for reasonable numbers of
data values and for different values of parameters, you must generate simulated data, say, 100000 times for each
length and parameter value. The total simulation work requires a huge number of random number generations and
takes a long time on your PC. If you use 100 PCs in your institute to run these simulations simultaneously, you may
expect that the total execution time will be 1/100. This is the simple idea behind parallel computing.
Computer scientists noticed the importance of parallel computing many years ago. The recent development of
computer hardware has been very rapid. Over roughly 40 years from 1961, the so called ‘Moore’s law’ holds “the
number of transistors per silicon chip has doubled approximately every 18 months.” This means that the capacity of
memory chips and processor speeds have also increased roughly exponentially. In addition, hard disk capacity has
increased dramatically. Consequently, modern personal computers are more powerful than ‘super computers’ were
a decade ago. Even such powerful personal computers are not sufficient for our requirements. In statistical analysis,
while computers are becoming more powerful, data volumes are becoming larger and statistical techniques are
becoming more computer intensive. We are continuously forced to realise more powerful computing environments
for statistical analysis.
Parallel computing is thought to be the most promising technique. However, parallel computing has not been popular
among statisticians until recently. One reason is that parallel computing was available only on very expensive
computers, which were installed at some computer centers in universities or research institutes. Few statisticians
could use these systems easily. Further, software for parallel computing was not well prepared for general use.
Recently, cheap and powerful personal computers changed this situation. Beowulf project realised a powerful
computer system by using many PCs connected by a network. This project was a milestone in parallel computer
development. Freely available software products for parallel computing have become more mature. Thus, parallel
computing has now become easy for statisticians to access.
The simultaneous use of more than one processor or computer to solve a problem is called parallel computing.
Traditionally, software has been written for serial computation:
• To be run on a single computer having a single Central Processing Unit (CPU)

• A problem is broken into a discrete series of instructions
• Instructions are executed one after another
• Only one instruction may execute at any moment in time
2
For example:
problem
instructions
CPU
tN t3 t2 t1
do_payroll ()
instructions
Emp2_deduc
Emp2_check
Emp2_check
emp1_deduc
Emp2_Tax
Emp2_rate
emp3_deduc
Emp3_rate
Emp3_rate
Emp3_hrs
emp1_tax
emp1_hrs
Emp2_hrs
.....
CPU
tN t3 t2 t1
Fig. 1.1 Serial computing

(Source: https://computing.llnl.gov/tutorials/parallel_comp/#Whatis)
Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
3
Parallel Computing
problem instructions
CPU
CPU
CPU
CPU
tN t3 t2 t1
For example:
do_payroll CPU
(emp1)
do_payroll
CPU
(emp2)
do_payroll CPU
(emp3)
do_payroll CPU
(emp N)
tN t3 t2 t1
Fig. 1.2 Parallel computing

The computer resources might be:

• A single computer with multiple processors
• An arbitrary number of computers connected by a network
• A combination of both
The computational problem should be able to:

• Be broken apart into discrete pieces of work that can be solved simultaneously
• Execute multiple program instructions at any moment in time
• Be solved in less time with multiple compute resources than with a single compute resource
Parallel computing is an evolution of serial computing that attempts to emulate what has always been the state of
affairs in the natural world: many complex, interrelated events happening at the same time, yet within a sequence.
1.2 Types of Parallel Computing

There are several types of parallel computing used worldwide. These are:
4
Bit level parallelism
It is a form of parallelism which is based on increasing processors word size. It shortens the number of instructions
that the system must run in order to perform a task on variables which are greater in size. From the advent of Very-
Large-Scale Integration (VLSI) computer-chip fabrication technology in the 1970s until about 1986, speed-up in
computer architecture was driven by doubling computer word size—the amount of information the processor can
manipulate per cycle. Increasing the word size reduces the number of instructions the processor must execute to
perform an operation on variables whose sizes are greater than the length of the word.
For example, where an 8-bit processor must add two 16-bit integers, the processor must first add the 8 lower-order
bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-
carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions
to complete a single operation, where a 16-bit processor would be able to complete the operation with a single
instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend
generally came to an end with the introduction of 32-bit processors, which has been a standard in general-purpose
computing for two decades. Not until recently (c. 2003–2004), with the advent of x86-64 architectures, have 64-bit
processors become commonplace.
Instruction level parallelism

It is a form of parallel computing in which we can calculate the amount of operation carried out by an operating
system at same time. A computer program is, in essence, a stream of instructions executed by a processor. These
instructions can be re-ordered and combined into groups which are then executed in parallel without changing the
result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism
dominated computer architecture from the mid-1980s until the mid-1990s.
Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different
action the processor performs on that instruction in that stage; a processor with an N-stage pipeline can have up
to N different instructions at different stages of completion. The canonical example of a pipelined processor is a
RISC processor, with five stages: instruction fetch, decode, execute, memory access, and write back. The Pentium
4 processor had a 35-stage pipeline.
A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It can have two instructions
in each stage of the pipeline, for a total of up to 10 instructions (shown in green) being simultaneously executed.
In addition to instruction-level parallelism from pipelining, some processors can issue more than one instruction
at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data
dependency between them. Scoreboarding and the Tomasulo algorithm (which is similar to Scoreboarding but
makes use of register renaming) are two of the most common techniques for implementing out-of-order execution
and instruction-level parallelism.
Task parallelism
Task parallelism is a form of parallelisation in which different processors run the program among different codes of
distribution. It is also called as function parallelism. Task parallelism is the characteristic of a parallel program that
“entirely different calculations can be performed on either the same or different sets of data”. This contrasts with
data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism
does not usually scale with the size of a problem.
Data parallelism
Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different
computing nodes to be processed in parallel. Parallelising loops often leads to similar (not necessarily identical)
operation sequences or functions being performed on elements of a large data structure. Many scientific and
engineering applications exhibit data parallelism.
5
Parallel Computing
1.3 Need for Parallel Computing
The main reasons include:

Save time and/or money
In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel
computers can be built from cheap, commodity components.
Solve larger problems

Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer,
especially given limited computer memory. For example:
• ‘Grand Challenge’ problems requiring PetaFLOPS and PetaBytes of computing resources.
• Web search engines/databases processing millions of transactions per second
Provide concurrency
A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things
simultaneously. For example, the Access Grid provides a global collaboration network where people from around
the world can meet and conduct work ‘virtually’.
Use of non-local resources
Computer resources on a wide area network are used, or even the Internet when local compute resources are scarce.
For example:
• SETI@home uses 2.9 million computers in 253 countries. (July 2011)
• Folding@home uses over 450,000 cpus globally (July 2011)
Limits to serial computing

Both physical and practical reasons pose significant constraints to simply building ever faster serial computers:
• Transmission speeds: The speed of a serial computer is directly dependent upon how fast data can move through
hardware. Absolute limits are the speed of light (30 cm/nanoseconds) and the transmission limit of copper wire
(9 cm/nanosecond). Increasing speeds necessitate increasing proximity of processing elements.
• Limits to miniaturisation: Processor technology is allowing an increasing number of transistors to be placed
on a chip. However, even with molecular or atomic-level components, a limit will be reached on how small
components can be.
• Economic limitations: It is increasingly expensive to make a single processor faster. Using a larger number of
moderately fast commodity processors to achieve the same (or better) performance is less expensive.
• Hardware level parallelism: Current computer architectures are increasingly relying upon hardware level
parallelism to improve performance (multiple execution units, pipelined instructions and multi-core).
1.4 Applications of Parallel Computing

Historically, parallel computing has been considered to be ‘the high end of computing’, and has been used to model
difficult problems in many areas of science and engineering:
• Atmosphere, earth, environment
• Physics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonics
• Bioscience, biotechnology, genetics
• Chemistry, molecular sciences
• Geology, seismology
• Mechanical engineering from prosthetics to spacecraft
• Electrical engineering, circuit design, microelectronics
• Computer science, mathematics
6
Today, commercial applications provide an equal or greater driving force in the development of faster computers.
These applications require the processing of large amounts of data in sophisticated ways. For example:
• Databases, data mining
• Oil exploration
• Web search engines, web based business services
• Medical imaging and diagnosis
• Pharmaceutical design
• Management of national and multi-national corporations
• Financial and economic modeling
• Advanced graphics and virtual reality, particularly in the entertainment industry
• Networked video and multi-media technologies
• Collaborative work environments
7
Parallel Computing
Summary
• Parallel computing means to divide a job into several tasks and use more than one processor simultaneously to
perform these tasks.
• The simultaneous use of more than one processor or computer to solve a problem is called parallel
computing.
• Bit level parallelism is a form of parallelism, which is based on increasing processors word size. It shortens
the number of instructions that the system must run in order to perform a task on variables which are greater
in size.
• A computer program is, in essence, a stream of instructions executed by a processor.
• Task parallelism is a form of parallelisation in which different processors run the program among different
codes of distribution.
• Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different
computing nodes to be processed in parallel.
• Parallel computers can be built from cheap, commodity components.
• A single compute resource can only do one thing at a time.
• Multiple computing resources can be doing many things simultaneously.
• Parallelising loops often leads to similar (not necessarily identical) operation sequences or functions being
performed on elements of a large data structure.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education
India.
• Bhujade, M. R., 1995. Parallel Computing, New Age International.
• Barney, B., Introduction to Parallel Computing [Online] Available at: <https://computing.llnl.gov/tutorials/
parallel_comp/> [Accessed 21 June 2012].
• Parallel Computing [Online] Available at: <http://www.cs.ucf.edu/courses/cot4810/fall04/presentations/
Parallel_Computing.ppt> [Accessed 21 June 2012].
• 2011. Parallel Vs. Serial [Video Online] Available at: <http://www.youtube.com/watch?v=Jeo83akN44o>
[Accessed 21 June 2012].
• 2012. 4 Exploiting Instruction Level Parallelism [Video Online] Available at: <http://www.youtube.com/
watch?v=54E9LGG1hnQ> [Accessed 21 June 2012].
Recommended Reading
• Lafferty, E. L., 1993. Parallel Computing: An Introduction, William Andrew.
• Pacheco, P., 2011. An Introduction to Parallel Programming, Elsevier.
• Foster, I., 1995. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software
Engineering, Addison-Wesley.
8
Self Assessment
1. In parallel computing, the computer resources can be ________.
a. a single computer
b. a single computer with multiple processors
c. an arbitrary number of computers
d. a single computer with single processor
2. Which is a form of parallelism which is based on increasing processors word size?

a. Bit level parallelism
b. Instruction level parallelism
c. Task parallelism
d. Data parallelism
3. Which parallelism inherent in program loops, which focuses on distributing the data across different computing
nodes to be processed in parallel?
c. Task parallelism
d. Data parallelism
4. In which parallelism, different processors run the program among different codes of distribution?
c. Task parallelism
d. Data parallelism
5. Which parallelism is also called as function parallelism?

c. Task parallelism
d. Data parallelism
6. In which form of parallel computing, we can calculate the amount of operation carried out by an operating
system at same time?
c. Task parallelism
d. Data parallelism
7. _________ shortens the number of instructions that the system must run in order to perform a task on variables
which are greater in size.
c. Task parallelism
d. Data parallelism
9
Parallel Computing
8. The Pentium 4 processor had a ______pipeline.

a. 15-stage
b. 5-stage
c. 35-stage
d. 85-stage
9. __________is the characteristic of a parallel program that “entirely different calculations can be performed on
either the same or different sets of data”.
c. Task parallelism
d. Data parallelism
10. Which parallelism does not usually scale with the size of a problem?
c. Task parallelism
d. Data parallelism
10
Chapter II
Laws of Parallel Computing
Aim
• define sequential fraction of computing
• demonstrate the graph between efficiency and the sequential fraction
• elucidate Amdahl’s law
Objectives
• explain Minsky’s conjecture
• describe Moore’s Law
• examine doubling of computing power
Learning outcome
• demonstrate the graph between speedup as a function of the number of processors
• understand allocation of resources
• describe parallel processors
11
Parallel Computing
2.1 Amdahl’s Law

Amdahl’s law is based on a very simple observation. A program requiring total time T for the sequential execution
shall have some part called Sequential Fraction of Computing (SFC) which is inherently sequential (that cannot
be made to run in parallel). In terms of total time taken for solving a problem, this fraction called the SFC is an
important parameter of a program.
The Amdahl’s law states that the speed up (S) of a parallel computer is limited by:
S≤
where,
f = sequential fraction for a given program
n = no. of processors
Proof of the above statement is quite simple. Assuming that total time is T, then the sequential component of this time
will be f.T. The parallelisable fraction of time is therefore (1 – f).T. The time (1 – f).T can be reduced by employing
n processors to operate in parallel to give the time as (1 – f).T/n. The total time taken by the parallel computer thus,
is at least f.T + (1 – f). T/n, while the sequential processor takes time T. The speedup S is limited thus by:
S≤
i.e.,
this result throws some light on the way parallel computers should be built. A computer architect can use the following
two basic approaches while designing a parallel computer:
• Connect a small number of extremely powerful processors (few elephants approach).
• Connect a very large number of inexpensive processors (million ants approach).
Consider two parallel computers Me and Ma. The computer Me is built using the approach of few elephants (very
powerful processors) such that each processor is capable of executing computations at a speed of M Megaflops.
The computer Ma on the other hand is built using ants approach (off-the-shelf inexpensive processors) and each
processor of Ma executes r.M Megaflops, where 0 < r < 1.
Theorem 1
If the machine Me attempts a computation whose sequential fraction f is greater than r, then Ma will execute the
computations more slowly compared to a single processor of the computer Me.
Proof:
Let
W = Total work (computing job)
M = speed (in Mflop) of a PE of machine Me, then
r.M = speed of a PE of Ma (r is the fraction)
f.W = sequential work of the job
T(Ma) = time taken by Ma for the work W
T(Me) = time taken by Me for the work W
Time taken by any computer is:
T=
T(Ma) = Time for sequential part + Time for parallel part
=+
T(Ma) = , if n is infinitely very large …………………………....(1)

T(Me) = (Assume only one PE of Me) …………………………….(2)
12
From equation 1 and 2, it is clear that if f > r then T(Ma) > T(Me). The above theorem is quite interesting. It gives
guidelines in building parallel processor using the expensive and the inexpensive technology. The above theorem
implies that a sequential component fraction acceptable for the machine Me may not be acceptable for the machine
Ma. It does no good to have a computing power of a very large number of processors that goes waste. Processors
must maintain some level of efficiency. Let us see how the efficiency ‘e’ and sequential fraction ‘f’ are related.
The efficiency E = S/n and hence,
S = E.n ≤
E ≤
This result states that for a constant efficiency, the fraction of sequential component of an algorithm must be inversely
proportional to the number of processors. Figure below shows the graph of sequential component and efficiency
with ‘n’ as a parameter.
1.0
0.9
0.8
0.7
0.6 n=2
0.5
0.4
E 0.3
0.2
0.1 n = 64
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
f
Fig. 2.1 Efficiency and the sequential fraction
(Source: http://www.newagepublishers.com/samplechapter/000573.pdf)
The idea of using very large number of processors may thus be good only for specific applications for which it is
known that the algorithms have a known very small sequential fraction f.
2.2 Minsky’s Conjecture [Minsky 1970]

Minsky’s conjecture states that due to the need for parallel processors to communicate with each other, speedup
increases as the logarithm of the number of processing elements. For a parallel computer with n processors, the
speedup S shall be proportional to log2n. This conjecture was stated from the way parallel computers give the
overall performance in solving many problems. The proof why parallel computers behave this way was first given
by Flynn [1972]. This proof is limited to only SIMD parallel computers. The essence of Flynn’s proof is given next
using energy diagram for the SIMD computer. Basic argument of Flynn was based on the execution of the nested
IF statement by a SIMD parallel computer. Consider the parallel if statement:
If C1 then E1
else E2.
13
Parallel Computing
Here C1 is the vector of Booleans and E1 and E2 are two statements. Every PE of a SIMD machine has one element
of the condition vector C. If this condition is true, then it will execute the corresponding then part (E1 statement).
PEs having false element of C1 shall execute the else part (E2 statement). For a SIMD computer, execution of the
above has to be done sequentially. The first set of PEs (with true elements of C1) executes the then part. Other PEs
are masked off from doing the work. They execute NOP. After the first set completes the execution, the second set
(PEs having false elements of C1) shall execute the else part and the first set of PEs shall be idle. If there is another
IF nested in the E1/E2 statements, one more division amongst the active PEs shall take place. This nesting may
go on repetitively. Figure below shows the energy diagram for the PEs. This diagram assumes the division of true
and false values in the condition vectors as half. The successive time slots shows that number of PEs working are
getting reduced by a factor of two.
N
P ro c e s s in g E le m e n ts
C o m p u tin g P o we r
2
1 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Time
Fig. 2.2 Energy diagram showing loss of energy

The Minsky’s conjecture was very bad for the proponents of the large scale parallel architectures. Flynn and Hennessy
gave yet another formulation that is presented in the following theorem.
Theorem 2
Speedup of a n processor parallel system is limited as:
S ≤ n/log2n
Proof:
Let T1 = time taken by a single processor system,
fk = probability of k PEs working simultaneously (fk can be thought of as program fraction with a degree of parallelism
k)
Tn = Time taken by the parallel machine with n processors.
Tn = [f1/1 + f2/2 + f3/3 + .. + fn/n ] .T1
The speedup S < T1/Tn = 1 /( f1/1 + f2/2 + f3/3 + ... + fn/n ).

Assume (for simplicity) fk = 1/n for all k = 1, 2,... n (i.e., degree of parallelism is equal for all k),
T1/Tn = 1/(1/n (1/1 +1/2 + 1/3 + ... + 1/n))

Hence S ≤
14
A plot of speedup as a function of n for the above discussed bounds and the typical expected speedup (linear) is
presented in figure below. Figure below shows the speedup as a function of the number of processors. It is to be
noted from figure that the speedup curve for large values of n have a linear shape. This is so because, the function
n/log n approximates to a straight line when n > 1000.
Speedup n
1000
n/log 2 n
100
10 log 2 n
1
1 10 100 1000
Fig. 2.3 Speedup as a function of the number of processors

Lee [1980] presented an empirical table for speedup for different values of n. His observations are listed in table
below.
Value of n (Number of processors) Value of S (Speedup)

1 O(n )
Few 10’s up to 100 O( n log n )
Around 1000 O( n log n )
<S<
O( n/log n )
10000 and above O(n )
Table 2.1 Speed-up and the number of processors
2.3 Moore’s Law

Moore’s law is commonly reported as a doubling of transistor density every 18 months. It was given by the co-founder
of Intel, Gordon Moore. It is a nice blending of his two predictions; in 1965, he predicted an annual doubling of
transistor counts in the most cost effective chip and revised it in 1975 to every 24 months. With a little hand waving,
most reports attribute 18 months to Moore’s Law, but there is quite a bit of variability.
The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near constant
per unit cost. This abstractions of Moore’s Law relates to the compounding of transistor density in two dimensions.
Others relate to speed (the signals have less distance to travel) and computational power (speed×density).
15
Parallel Computing
Fig. 2.4 Computing power doubles every 18 months, for the same price
(Source: http://www.scs.ryerson.ca/mfiala/courses/cps310_win09/murdocca_Ch01CAO.pdf)
16
Summary
• Amdahl’s law is based on a very simple observation. A program requiring total time T for the sequential execution
shall have some part called sequential fraction of computing (SFC) which is inherently sequential (that cannot
be made to run in parallel).
• Minsky’s conjecture states that due to the need for parallel processors to communicate with each other, speedup
increases as the logarithm of the number of processing elements.
• For a parallel computer with n processors, the speedup S shall be proportional to log2n.
• The Minsky’s conjecture was very bad for the proponents of the large scale parallel architectures.
• Moore’s Law is commonly reported as a doubling of transistor density every 18 months.
• The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near
constant per unit cost.
References
• Thiébaut, D., Parallel Programming in C for the Transputer [Online] Available at: <http://maven.smith.
edu/~thiebaut/transputer/chapter8/chap8-2.html> [Accessed 21 June 2012].
• Parallel Computer Taxonomy - Conclusion [Online] Available at: <http://www.gigaflop.demon.co.uk/comp/
chapt8.htm> [Accessed 21 June 2012].
• 2012. Amdahl’s Law [Video Online] Available at: <http://www.youtube.com/watch?v=r7Ffc4WOLb8> [Accessed
21 June 2012].
• 2006. Moore’s Law [Video Online] Available at: <http://www.youtube.com/watch?v=XvaQcuLr2cE> [Accessed
21 June 2012].
Recommended Reading
• Padua, D., 2011. Encyclopedia of Parallel Computing, Volume 4, Springer.
• Null, L. & Lobur, J., 2010. The Essentials of Computer Organisation and Architecture, Jones & Bartlett
Publishers.
• Gebali, F., 2011. Algorithms and Parallel Computing, John Wiley & Sons.
17
Parallel Computing
Self Assessment
1. SFC refers to ___________________.
a. Sequential Fraction of Computing
b. Specification Form Computer
c. Software Flow Control
d. Static Frequency Converter
2. Time taken by any computer = Amount of work/ ___________.

a. Speed
b. Efficiency
c. Sequential fraction
d. Number of processors
3. For a constant _______, the fraction of sequential component of an algorithm must be inversely proportional
to the number of processors.
a. speed up
b. efficiency
c. speed
d. sequential fraction
4. _________states that due to the need for parallel processors to communicate with each other, speedup increases
as the logarithm of the number of processing elements.
a. Amdahl’s law
b. Moore’s law
c. Minsky’s conjecture
d. Flynn’s law
5. Moore’s law is commonly reported as a doubling of transistor density every ______ months.
a. 8
b. 18
c. 11
d. 7
6. The popular perception of Moore’s Law is that computer chips are compounding in their complexity at near
constant _________.
a. per unit cost
b. per unit time
c. per unit speed
d. per unit efficiency
7. Which of the following statements is false?

a. Minsky’s conjecture states that due to the need for parallel processors to communicate with each other,
speedup increases as the logarithm of the number of processing elements.
b. For a parallel computer with n processors, the speedup S shall be proportional to log2n.
c. The Minsky’s conjecture was very bad for the proponents of the large scale parallel architectures.
d. Moore’s Law is commonly reported as a doubling of transistor density every 10 months.
18
a. Gordon Moore was the co-founder of Intel.
b. For a parallel computer with n processors, the speedup S shall be proportional to log2n2.
c. The proof why parallel computers behave this way was first given by Flynn [1972].
d. The idea of using very large number of processors may thus be good only for specific applications for which
it is known that the algorithms have a known very small sequential fraction f.
9. T(Ma) = Time for sequential part + Time for _______ part.

a. parallel
b. serial
c. efficiency
d. sequential fraction
10. For a parallel computer with n processors, the speedup S shall be proportional to _______.
a. log2n2
b. log2n3
c. log2n
d. log2n4
19
Parallel Computing
Chapter III
Evolution of Computer Architecture
Aim
• define computer architecture
• enlist the features of first generation
• elucidate two types of models for a computing machine
Objectives
• enlist the features of third generation
• enumerate the features of second generation
• explain evolution of computer architecture
Learning outcome
• understand microelectronics
• discuss fourth generation technologies
• describe components of computer architecture
20
3.1 Introduction
The evolution of computers has been characterised by increasing processor speed, decreasing component size,
increasing memory size, and increasing I/O capacity and speed. One factor responsible for the great increase in
processor speed in the shrinking size of the microprocessor components; this reduces the distance between components
and hence increases speed. However, the real gains in speed in recent years have come from the organisation of the
processor, including heavy use of pipelining and parallel execution techniques and the use of speculative execution
techniques, which results in the tentative execution of future instructions that might be needed. All of these techniques
are designed to keep the processor busy as much of the time as possible.
A critical issue in computer system design is balancing the performance of the various elements, so that gains in
performance in one area are not dragged behind by a lag in other areas. In particular, processor speed has increased
more rapidly than memory access time. A variety of techniques are used to compensate for this mismatch, including
caches, wider data paths from memory to processor, and more intelligent memory chips.
The various definitions of computer architecture are given below:

• Baer: ‘The design of the integrated system, which provides a useful tool to the programmer’
• Hayes: ‘The study of the structure, behavior and design of computers’
• Hennessy and Patterson: ‘The interface between the hardware and the lowest level software’
• Foster: ‘The art of designing a machine that will be a pleasure to work with’
Thus, computer architecture is the science and art of selecting and interconnecting hardware components to create
computers that meet functional performance and cost goals. It refers to those attributes of the system that are visible
to a programmer and have a direct impact on the execution of a program. Computer architect coordinate of many
levels of abstraction and translates business and technology drives into efficient systems for computing tasks.
Computer architecture concerns machine organisation, interfaces, application, technology, measurement and
simulation. It includes:
• Instruction set
• Data formats
• Principle of operation (textual or formal description of every operation)
• Features (organisation of programmable storage, registers used, interrupts mechanism, and so on)
In short, it is the combination of instruction set architecture, machine organisation and the underline hardware.
The different usages of the ‘computer architecture’ term:
• The design of a computer’s CPU architecture, instruction set, addressing modes.
• Description of the requirements (especially speeds and interconnection requirements) or design implementation
for the various parts of a computer, such as memory, motherboard, electronic peripherals, or most commonly
the CPU)
• Architecture is often defined as the set of machine attributes that a programmer should understand in order to
successfully program the specific computer. So, in general, computer architecture refers to attributes of the
system visible to a programmer that have a direct impact on the execution of a program.
3.2 Brief History of Computer Architecture

The history of computer architecture is divided into various generations. Each of them is discussed below.
21
Parallel Computing
3.2.1 First Generation (1945-1958)

The features of first generation include:
• Vacuum tubes
• Machine code, assembly language
• Computers contained a central processor that was unique to that machine
• Different types of supported instructions, few machines could be considered ‘general purpose’
• Use of drum memory or magnetic core memory, programs and data are loaded using paper tape or punch
cards
• 2 Kb memory, 10 KIPS
Two types of models for a computing machine:

• Harvard architecture - physically separate storage and signal pathways for instructions and data. The term
originated from the Harvard Mark I, relay-based computer, which stored instructions on punched tape and data
in relay latches.
• Von Neumann architecture - a single storage structure to hold both the set of instructions and the data. Such
machines are also known as stored-program computers. Von Neumann bottleneck - the bandwidth, or the data
transfer rate, between the CPU and memory is very small in comparison with the amount of memory.
Modern high performance CPU chip designs incorporate aspects of both architectures. On chip cache memory is
divided into an instruction cache and a data cache. Harvard architecture is used as the CPU accesses the cache and
von Neumann architecture is used for off chip memory access.
• ENIAC (Electronic Numerical Integrator and Calculator), 1943-46, by J. Mauchly and J. Presper Eckert,
first general purpose electronic computer. The size of its numerical word was 10 decimal digits, and it could
perform 5000 additions and 357 multiplications per second. It was built to calculate trajectories for ballistic
shells during World War II, programmed by setting switches and plugging and unplugging cables. It used 18000
tubes, weighted 30 tones and consumed 160 kilowatts of electrical power.
• Whirlwind computer, 1949, by Jay Forrester with 5000 vacuum tubes, main innovation - magnetic core memory
1951 UNIVAC (Universal Automatic Computer) - the first commercial computer, built by Eckert and Mauchly,
cost – around $1 million, 46 machines sold. UNIVAC had an add time of 120 microseconds, multiply time of
1800 microseconds and a divide time of 3600 microseconds, used magnetic tape as input.
• IBM’s 701, 1953, the first commercially successful general-purpose computer. The 701 had electrostatic storage
tube memory, used magnetic tape to store information, and had binary, fixed-point, single address hardware.
• IBM 650 - 1st mass-produced computer (450 machines sold in one year).
22
AC MQ
Input/output
Arithmetic-logic unit equipment
DR
Instructions
and data
IBR PC Main
memory
IR AR
Addresses
Control Control
circuits
signals
program control unit
Fig. 3.1 Structure of IAS computer

(Source: http://www.mgnet.org/~douglas/Classes/cs521/arch/ComputerArch2005.pdf)
3.2.2 Second Generation (1958-1964)

The features of second generation include:
• Transistors – small, low-power, low-cost, more reliable than vacuum tubes
• Magnetic core memory
• Two’s complement, floating point arithmetic
• Reduced the computational time from milliseconds to microseconds
• High level languages
• First operating systems: handled one program at a time
• 1959 - IBM´s 7000 series mainframes were the company´s first transistorised computers
IBM 7090 is the most powerful data processing system at that time. The fully transistorised system has computing
speeds six times faster than those of its vacuum-tube predecessor, the IBM 709. Although the IBM 7090 is a general
purpose data processing system, it is designed with special attention to the needs of the design of missiles, jet engines,
nuclear reactors and supersonic aircraft. It contains more than 50,000 transistors plus extremely fast magnetic core
storage. The new system can simultaneously read and write at the rate of 3,000,000 bits per second, when eight
data channels are in use. In 2.18 millionths of a second, it can locate and make ready for use any of 32,768 data or
instruction numbers (each of 10 digits) in the magnetic core storage. The 7090 can perform any of the following
operations in one second: 229,000 additions or subtractions, 39,500 multiplications, or 32,700 divisions. The basic
cycle time is 2.18 µsecs.
23
Parallel Computing
Magnetic Magnetic Magnetic

drum storage disk storage tape storage
Central processing unit, AC, MQ Printer
AC MQ
Card
Arithmetic-logic circuits Drum disc Reader
Control unit
DR IO processor IO processor
(channel) (channel)
Operator’s
console, Memory control unit
Index registers (multiplexer)
XR(1-7)
IBR Core
Index adders memory
Memory
IR AR PC address
Control
circuits
Fig. 3.2 IBM 7094

3.2.3 Third Generation (1964-1974)

The features of third generation include:
• Introduction of integrated circuits combining thousands of transistors on a single chip
• Semiconductor memory
• Timesharing, graphics, structured programming
• 2 Mb memory, 5 MIPS
• Use of cache memory
• IBM’s System 360 - the first family of computers making a clear distinction between architecture and
implementation
Microelectronics
It means ‘small electronics’. The computer consists of logic gates, memory cells and interconnections. It is
manufactured on a semiconductor such as silicon. Many transistors can be produced on a single wafer of silicon.
24
Wafer
Chip
Gate
Packaged
chip
Fig. 3.3 Relationship between Wafer, Chip, and Gate

(Source: http://www.csc.sdstate.edu/~gamradtk/csc317/csc317l13.pdf)
The IBM System/360 Model 91 was introduced in 1966 as the fastest, most powerful computer then in use. It was
specifically designed to handle high-speed data processing for scientific applications such as space exploration,
theoretical astronomy, subatomic physics and global weather forecasting. IBM estimated that each day in use, the
Model 91 would solve more than 1,000 problems involving about 200 billion calculations.
16 32-bit
general purpose
registers
4 64-bit floating –
point registers
Fixed-point Decimal Floating-point

arithmetic unit arithmetic unit arithmatic unit
internal buses
AR IR PC DR
program
status word Memory To
control unit Main
Memory
Fig. 3.4 CPU structure of IBM S/360-370 series

25
Parallel Computing
3.2.4 Fourth Generation (1974-present)

The features of fourth generation include:
• Introduction of Very Large-Scale Integration (VLSI)/Ultra Large Scale Integration (ULSI) - combines millions
of transistors
• Single-chip processor and the single-board computer emerged
• Smallest in size because of the high component density
• Creation of the Personal Computer (PC)
• Wide spread use of data communications
• Object-Oriented programming: Objects & operations on objects
• Artificial intelligence: Functions & logic predicates
1971 - The 4004 was the world’s first universal microprocessor, invented by Federico Faggin, Ted Hoff, and Stan
Mazor. With just over 2,300 MOS transistors in an area of only 3 by 4 millimeters had as much power as the
ENIAC.
• 4-bit CPU
• 1K data memory and 4K program memory
• Clock rate: 740kHz
• Just a few years later, the word size of the 4004 was doubled to form the 8008.
1974 – 1977 - the first personal computers – introduced on the market as kits (major assembly required).
• SCELBI (SCientific, ELectronic and BIological) and designed by the SCELBI Computer Consulting Company,
based on Intel’s 8008 microprocessor, with 1K of programmable memory, SCELBI sold for $565 and came,
with an additional 15K of memory available for $2760.
• Mark-8 (also Intel 8008 based) designed by Jonathan Titus.
• Altair (based on the new Intel 8080 microprocessor), built by MITS (Micro Instrumentation Telemetry Systems).
The computer kit contained an 8080 CPU, a 256 Byte RAM card, and a new Altair Bus design for the price of
$400.
1976 - Steve Wozniak and Steve Jobs released the Apple I computer and started Apple Computers. The Apple I
was the first single circuit board computer. It came with a video interface, 8k of RAM and a keyboard. The system
incorporated some economical components, including the 6502 processor (designed by Rockwell and produced by
MOS Technologies) and dynamic RAM.
1977 - Apple II computer model was released, also based on the 6502 processor, but it had color graphics (a first for
a personal computer), and used an audio cassette drive for storage. Its original configuration came with 4 kb of RAM,
but a year later this was increased to 48 kb of RAM and the cassette drive was replaced by a floppy disk drive.
1977 - Commodore PET (Personal Electronic Transactor) was designed by Chuck Peddle, ran also on the 6502
chip, but at half the price of the Apple II. It included 4 kb of RAM, monochrome graphics and an audio cassette
drive for data storage.
1981 - IBM released their new computer IBM PC which ran on a 4.77 MHz Intel 8088 microprocessor and equipped
with 16 kilobytes of memory, expandable to 256k. The PC came with one or two 160k floppy disk drives and an
optional color monitor. First one built from off the shelf parts (called open architecture) and marketed by outside
distributors.
26
1974-present
The inventions during this period include:
• Intel 8080
8-bit Data
16-bit Address
6 µm NMOS
6K Transistors
2 MHz
1974
Fig. 3.5 Intel 8080

• Motorola 68000
32 bit architecture internally, but 16 bit data bus
16 32-bit registers, 8 data and 8 address registers
2 stage pipeline
no virtual memory support
68020 was fully 32 bit externally
1979
Fig. 3.6 Motorola 68000

• Intel386 CPU
32-bit Data
improved addressing
security modes (kernal, system services, application services, applications)
1985
27
Parallel Computing
Fig. 3.7 Intel386 CPU

• Alpha 21264
64-bit Address/Data
Superscalar
Out-of-Order Execution
256 TLB entries
128KB Cache
Adaptive Branch Prediction
0.35 µm CMOS Process
15.2M Transistors
600 MHz
Fig. 3.8 Alpha 21264

Generation Period Technology

First 1945-1958 Vacuum Tubes
Second 1958-1964 Transistors
Third 1964-1974 Integrated Circuits (SSI, MSI)
Fourth 1974 - present Integrated Circuits (LSI, VLSI)
Fifth ------ Artificial Intelligence, Neural Networks, Digital/Analog Hybrids, Web
Computing
Table 3.1 Computer generations
28
Summary
• The evolution of computers has been characterised by increasing processor speed, decreasing component size,
increasing memory size, and increasing I/O capacity and speed.
• Computer architecture is the science and art of selecting and interconnecting hardware components to create
computers that meet functional performance and cost goals.
• Computer architecture concerns machine organisation, interfaces, application, technology, measurement and
simulation.
• IBM 7090 is the most powerful data processing system at that time.
• Although the IBM 7090 is a general purpose data processing system, it is designed with special attention to the
needs of the design of missiles, jet engines, nuclear reactors and supersonic aircraft.
• The 7090 can perform any of the following operations in one second: 229,000 additions or subtractions, 39,500
multiplications, or 32,700 divisions.
• Microelectronics means ‘small electronics’. The computer consists of logic gates, memory cells and
interconnections.
• The IBM System/360 Model 91 was introduced in 1966 as the fastest, most powerful computer then in use.
References
• Balaauw, 1997. Computer Architecture: Concepts And Evolution, Pearson Education India.
• Chandra, R. R., Modern Computer Architecture, Galgotia Publications.
• Arnaoudova, E., Brief History of Computer Architecture [pdf] Available at: <http://www.mgnet.org/~douglas/
Classes/cs521/arch/ComputerArch2005.pdf > [Accessed 21 June 2012].
• Learning Computing History [Online] Available at: <http://www.comphist.org/computing_history/new_page_4.
htm> [Accessed 21 June 2012].
• 2008. Lecture -2 History of Computers [Video Online] Available at: <http://www.youtube.com/watch?v=TS2o
dp6rQHU&feature=results_main&playnext=1&list=PLF33FAF1A694F4F69 > [Accessed 21 June 2012].
• 2008. Generation’s of computer (HQ) [Video Online] Available at: <http://www.youtube.com/watch?v=7rkG
FqEfdJk&feature=related> [Accessed 21 June 2012].
Recommended Reading
• Hwang, 2003. Advanced Computer Architecture, Tata McGraw-Hill Education.
• Null, L. & Lobur, J., 2010. The Essentials of Computer Organization and Architecture, Jones & Bartlett
Publishers.
• Anita, G., 2011. Computer Fundamentals, Pearson Education India.
29
Parallel Computing
Self Assessment
1. Which of these do not characterise the evolution of computers?
a. Increasing processor speed
b. Increasing component size
c. Increasing memory size
d. Increasing I/O capacity and speed
2. Who defined computer architecture as “the design of the integrated system which provides a useful tool to the
programmer’?
a. Baer
b. Hayes
c. Hennessy and Patterson
d. Foster
3. Who defined computer architecture as “the interface between the hardware and the lowest level software”?
a. Baer
b. Hayes
c. Hennessy and Patterson
d. Foster
4. Which of these is not a component of computer architecture?

a. Instruction set architecture
b. Machine organisation
c. Underline hardware
d. Software
5. Which of the following is a feature of first generation computers?

a. Vacuum tubes
b. Transistors
c. Integrated circuits
d. Personal computers
6. Which of the following is a feature of second generation computers?

a. Vacuum tubes
b. Transistors
c. Integrated circuits
d. Personal computers
7. Which of these is not a characteristic feature of Intel 8080?

a. 8-bit Data
b. 16-bit Address
c. 6 µm NMOS
d. 2 stage pipeline
30
8. Which of these is not a characteristic feature of Motorola 68000?
a. 32 bit architecture internally, but 16 bit data bus
b. 16 32-bit registers, 8 data and 8 address registers
c. Virtual memory support
d. 68020 was fully 32 bit externally
9. The IBM System/360 _________was introduced in 1966 as the fastest, most powerful computer then in use.
a. Model 91
b. Model 92
c. Model 90
d. Model 94
10. ________is the first family of computers making a clear distinction between architecture and implementation.
a. IBM’s System 360
b. IBM’s System 260
c. IBM’s System 340
d. IBM’s System 380
31
Parallel Computing
Chapter IV
System Architectures
Aim
• define parallel architectures
• enlist the types of parallel architectures
• elucidate Single Instruction - Single Data (SISD)
Objectives
• explain shared memory
• enumerate the characteristics of distributed memory
• enlist the advantages of shared memory
Learning outcome
• identify the structure of a shared memory system
• understand the structure of a cluster of SMP nodes
• compare shared and distributed memory
32
4.1 Parallel Architectures
A system for the categorisation of the system architectures of computers was introduced by Flynn (1972). Parallel
or concurrent operation has many different forms within a computer system. Using a model based on the different
streams used in the computation process, the different kinds of parallelism are represented. A stream is a sequence
of objects such as data, or of actions such as instructions. Each stream is independent of all other streams, and each
element of a stream can consist of one or more objects or actions. Parallel computers are those that emphasise the
parallel processing between the operations in some way.
Parallel computers can be characterised based on the data and instruction streams forming various types of computer
organisations. They can also be classified based on the computer structure, e.g. multiple processors having separate
memory or one shared global memory. Parallel processing levels can also be defined based on the size of instructions
in a program called grain size. Thus, parallel computers can be classified based on various criteria.
The four combinations that describe most familiar parallel architectures are:
• SISD (Single Instruction, Single Data Stream): This is the traditional uniprocessor.
• SIMD (Single Instruction, Multiple Data Stream): This includes vector processors as well as massively parallel
processors.
• MISD (Multiple Instruction, Single Data Stream): These are typically systolic arrays.
• MIMD (Multiple Instruction, Multiple Data Stream): This includes traditional multiprocessors as well as the
newer networks of workstations.
Each of these combinations characterises a class of architectures and a corresponding type of parallelism.
4.2 Single Instruction - Single Data (SISD)

The most simple type of computer performs one instruction (such as reading from memory, addition of two values)
per cycle, with only one set of data or operand (in case of the examples a memory address or a pair of numbers).
Such a system is called a scalar computer. Figure below shows the summation of two numbers in a scalar computer.
As a scalar computer performs only one instruction per cycle, five cycles are needed to complete the task - only one
of them dedicated to the actual addition. To add n pairs of numbers, n · 5 cycles would be required.
Load Load Load add save

Instruction Value 1 Value 2 Value 2 result
Fig. 4.1 Summation of two number

(Source: http://mail.vssd.nl/hlf/9071301788.pdf)
In reality, each of the steps shown in figure above is actually composed of several sub-steps, increasing the number of
cycles required for one summation even more. The solution to this inefficient use of processing power is pipelining.
If there is one functional unit available for each of the five steps required, the addition still requires five cycles. The
advantage is that with all functional units being busy at the same time, one result is produced every cycle. For the
summation of n pairs of numbers, only (n−1) +5 cycles are then required.
33
Parallel Computing
save
result
Step add add

values values
load load load

value 2 value 2 value 2
load load load load

value 1 value 1 value 1 value 1
load load load load load

instruction instruction instruction instruction instruction
time
Fig. 4.2 Summation of two numbers in a pipeline

(Source: http://www.scribd.com/doc/23585346/An-Introduction-to-Parallel-Programming)
As the execution of instructions usually takes more than five steps, pipelines are made longer in real processors.
Long pipelines are also a prerequisite for achieving high CPU clock speeds. These long pipelines generate a new
problem. If there is a branching event (such as due to an if-statements), the pipeline has to be emptied and filled
again, and there is a number of cycles equal to the pipeline length until results are again delivered. To circumvent
this, the number of branches should be kept small (avoiding and/or smart placement of if-statements). Compilers and
CPUs also try to minimise this problem by ‘guessing’ the outcome (branch prediction). The power of a processor
can be increased by combining several pipelines. This is then called a superscalar processor.
Fixed-point and logical calculations (performed in the ALU - Arithmetic/Logical Unit) are usually separated from
floating-point math (done by the FPU – Floating Point Unit). The FPU is commonly subdivided in a unit for addition
and one for multiplication. These units may be present several times, and some processors have additional functional
units for division and the computation of square roots. To actually gain a benefit from having several pipelines, these
have to be used at the same time. Parallelisation is essential to achieve this.
4.3 Single Instruction - Multiple Data (SIMD)

The scalar computer performs one instruction on one data set only. With numerical computations, larger data sets
are to be handled on which the same operation (instruction) has to be performed. A computer that performs one
instruction on several data sets is called a vector computer.
Vector computers work similar to the pipelined scalar computer represented in the figure above. The difference is that
instead of processing single values, vectors of data are processed in one cycle. The number of values in a vector is
limited by the CPU design. A vector processor can simultaneously work with 64 vector elements can also generate
64 results per cycle - the scalar processor would require at least 64 cycles for this. To actually use the theoretically
possible performance of a vector computer, the calculations themselves need to be vectorised. If a vector processor
is fed with single values only, it cannot perform decently. Just like with a scalar computer, the pipelines need to be
kept filled.
Vector computers used to be very common in the field of high performance computing, as they allowed very high
performance even at lower CPU clock speeds. In the last years, they have begun to slowly disappear. Vector processors
are very complex and thus expensive, and perform poorly with non-vectorisable problems. Today’s scalar processors
are much cheaper and achieve higher CPU clock speeds. With the Pentium III, Intel introduced SSE (Streaming
34
SIMD Extensions), which is a set of vector instructions. In certain applications, such as video encoding, the use of
these vector instructions can offer quite impressive performance increases. More vector instructions were added
with SSE2 (Pentium 4) and SSE3 (Pentium 4 Prescott).
4.4 Multiple Instruction - Multiple Data (MIMD)

The one instruction per cycle process applies to all computers containing only one processing core. Multi-core
CPUs, single-CPU systems have more than one processing core, making them MIMD systems. Combining several
processing cores or processors (scalar or vector processors) yields a computer that can process several instructions
and data sets per cycle. All high performance computers belong to this category. MIMD systems can be further
subdivided, mostly based on their memory architecture.
4.5 Shared Memory

General characteristics:
• Shared memory parallel computers vary widely, but generally have in common the ability for all processors to
access all memory as global address space.
• Multiple processors can operate independently but share the same memory resources.
• Changes in a memory location effected by one processor are visible to all other processors.
• Shared memory machines can be divided into two main classes based upon memory access times: UMA and
NUMA.
Uniform Memory Access (UMA)

• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called Cache Coherent UMA (CC-UMA). Cache coherent means if one processor updates a location
in shared memory, all the other processors know about the update. Cache coherency is accomplished at the
hardware level.
Non-Uniform Memory Access (NUMA)

• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA
Global memory can be accessed by all processors of a parallel computer. Data in the global memory can be read/
write by any of the processors. Examples: Sun HPC, Cray T90.
In MIMD systems with shared memory (SM-MIMD), all processors are connected to a common memory (RAM
- Random Access Memory). Usually all processors are identical and have equal memory access. This is called
symmetric multiprocessing (SMP).
35
Parallel Computing
CPU CPU CPU CPU
connection
RAM RAM RAM RAM
Fig. 4.3 Structure of a shared memory system

The connection between processors and memory is of predominant importance. Figure below shows a shared
memory system with a bus connection. The advantage of a bus is its expandability. A huge disadvantage is that all
processors have to share the bandwidth provided by the bus, even when accessing different memory modules. Bus
systems can be found in desktop systems and small servers (frontside bus).
CPU CPU CPU CPU
RAM RAM RAM RAM
Fig. 4.4 Shared memory system with a bus connection

To circumvent the problem of limited memory bandwidth, direct connections from each CPU to each memory module
are desired. This can be achieved by using a crossbar switch. Crossbar switches can be found in high performance
computers and some workstations.
CPU CPU CPU CPU
RAM RAM RAM RAM
Fig. 4.5 Shared memory system with crossbar switch

The problem with crossbar switches is their high complexity when many connections need to be made. This problem
can be weakened by using multi-stage crossbar switches, which in turn leads to longer communication times. For
this reason, the number of CPUs and memory modules than can be connected by crossbar switches is limited.
36
The advantages are:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs
The disadvantages are:

• Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically
increases traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase traffic
associated with cache/memory management.
• Programmer responsibility for synchronisation constructs that ensure “correct” access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with
ever increasing numbers of processors.
CPU
CPU Memory CPU
CPU
Shared Memory (NUMA)
CPU CPU CPU CPU

Memory Memory
CPU CPU CPU CPU
Bus Interconnect
CPU CPU CPU CPU

Memory Memory
CPU CPU CPU CPU
Shared Memory (NUMA)
Fig. 4.6 UMA and NUMA

The big advantage of shared memory systems is that all processors can make use of the whole memory. This makes
them easy to program and efficient to use. The limiting factor to their performance is the number of processors and
memory modules that can be connected to each other. Due to this, shared memory-systems usually consist of rather
few processors.
4.6 Distributed Memory

General characteristics:
• Like shared memory systems, distributed memory systems vary widely but share a common characteristic.
Distributed memory systems require a communication network to connect inter-processor memory.
• Processors have their own local memory. Memory addresses in one processor do not map to another processor,
so there is no concept of global address space across all processors.
37
Parallel Computing
• Because each processor has its own local memory, it operates independently. Changes it makes to its local memory
have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply.
• When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly
define how and when data is communicated. Synchronisation between tasks is likewise the programmer’s
responsibility.
• The network “fabric” used for data transfer varies widely, though it can can be as simple as Ethernet.
The advantages are:

• Memory is scalable with the number of processors. Increase the number of processors and the size of memory
increases proportionately.
• Each processor can rapidly access its own memory without interference and without the overhead incurred with
trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and networking.

• The programmer is responsible for many of the details associated with data communication between
processors.
• It may be difficult to map existing data structures, based on global memory, to this memory organisation.
CPU Memory CPU Memory
CPU Memory CPU Memory
Fig. 4.7 Distributed memory

Each processor in a parallel computer has its own memory (local memory); no other processor can access this
memory. Data can only be shared by message passing. Examples: Cray T3E, IBM SP2.
The number of processors and memory modules cannot be increased arbitrarily in the case of a shared memory
system. Another way to build a MIMD system is distributed memory (DM-MIMD). Each processor has its own
local memory. The processors are connected to each other. The demands imposed on the communication network
are lower than in the case of a shared memory system, as the communication between processors may be slower
than the communication between processor and memory.
RAM RAM RAM RAM
CPU CPU CPU CPU
connection
Fig. 4.8 Structure of a distributed memory system

38
Distributed memory systems can be hugely expanded. Several thousand processors are not uncommon; this is
called massively parallel processing (MPP). To actually use the theoretical performance, much more programming
effort than with shared memory systems is required. The problem has to be subdivided into parts that require little
communication. The processors can only access their own memory. Should they require data from the memory of
another processor, then these data have to be copied. Due to the relatively slow communications network between
the processors, this should be avoided as much as possible.
4.7 ccNUMA
The shared memory systems suffer from a limited system size, while distributed memory systems suffer from the
arduous communication between the memories of the processors. A compromise is the ccNUMA (cache coherent
non-uniform memory access) architecture.
A ccNUMA system basically consists of several SMP systems. These are connected to each other by means of a
fast communications network, often crossbar switches. Access to the whole, distributed or non-unified memory is
possible via a common cache.
A ccNUMA system is as easy to use as a true shared memory system, at the same time it is much easier to expand.
To achieve optimal performance, it has to be made sure that local memory is used, and not the memory of the
other modules, which is only accessible via the slow communications network. The modular structure is another
big advantage of this architecture. Most ccNUMA system consist of modules that can be plugged together to get
systems of various sizes.
CPU CPU CPU CPU
RAM RAM
CPU CPU CPU CPU
Cache
CPU CPU CPU CPU
RAM RAM
CPU CPU CPU CPU
Fig. 4.9 Structure of a ccNUMA system

4.8 Cluster
For some years now clusters are very popular in the high performance computing community. A cluster consists of
several cheap computers (nodes) linked together. The simplest case is the combination of several desktop computers.
It is known as a Network Of Workstations (NOW). Mostly SMP systems (usually dual-CPU system with Intel or
AMD CPUs) are used because of their good value for money. They form hybrid systems. The nodes, which are
themselves shared memory systems, form a distributed memory system.
39
Parallel Computing
The nodes are connected via a fast network, usually Myrinet or Infiniband. Gigabit Ethernet has approximately the
same bandwidth of about 100 MB/s and is a lot cheaper, but the latency (travel time of a data package) is much
higher. It is about 100 ms for Gigabit Ethernet compared to only 10 - 20 ms for Myrinet. Even this is a lot of time.
At a clock speed of 2 GHz, one cycle takes 0.510 ns. A latency of 10 ms amounts to 20,000 cycles of travel time
before the data package reaches its target. Clusters offer lots of computing power for little money. It is not that easy
to actually use the power. Communication between the nodes is slow, and as with conventional distributed memory
systems, each node can only access its local memory directly. The mostly employed PC architecture also limits
the amount of memory per node. 32 bit systems cannot address more than 4 GB of RAM and x86-64 systems are
limited by the number of memory slots, the size of the available memory modules, and the chip sets. Despite these
disadvantages, clusters are very successful and have given traditional, more expensive distributed memory systems
a hard time. They are ideally suited to problems with a high degree of parallelism, and their modularity makes it
easy to upgrade them. In recent years, the cluster idea has been expanded to connecting computers all over the
world via the internet. This makes it possible to aggregate enormous computing power. Such a widely distributed
system is known as a grid.
CPU RAM CPU CPU RAM CPU CPU RAM CPU
CPU RAM CPU CPU RAM CPU
network
CPU RAM CPU CPU RAM CPU
CPU RAM CPU CPU RAM CPU CPU RAM CPU
Fig. 4.10 Structure of a cluster of SMP nodes

4.9 Multiple Instruction - Single Data (MISD)

The attentive reader may have noticed that one system architecture is missing: Multiple instruction - Single Data
(MISD). Such a computer is neither theoretically nor practically possible in a sensible way.
4.10 Some Examples

This section presents a few common multiprocessor/multi-core architectures. A much more extensive and detailed
description is given in the “Overview of recent supercomputers”, which is updated once a year. Twice a year, a list of
the 500 fastest computers in the world is published. The ranking is based on the LINPACK benchmark. Although this
is an old benchmark with little practical reference, the Top 500 list gives a good overview of the fastest computers
and the development of supercomputers.
4.10.1 Intel Pentium D

The Intel Pentium D was introduced in 2005. It is Intel’s first dual-core processor. It integrates two cores, based
on the NetBurst design of the Pentium 4, on one chip. The cores have their own caches and access the common
memory via the frontside bus. This limits memory bandwidth and slows the system down in the case of memory-
intensive computations. The Pentium D’s long pipelines allow for high clock frequencies (at the time of writing
up to 3.73 GHz with the Pentium D 965), but may cause poor performance in the case of branches. The Pentium D
is not dual-CPU-capable. This capability is reserved for the rather expensive Xeon CPU. The Pentium D supports
SSE3 and x86-64 (the 64bit-extension of the x86 instruction set).
40
4.10.2 Intel Core 2 Duo
Intel’s successor to the Pentium D is similar in design to the popular Pentium M design, which in turn is based on
the Pentium III, with ancestry reaching back to the Pentium Pro. It abandons high clock frequencies in the favour
of more efficient computation. Like the Pentium D, it uses the frontside bus for memory access by both CPUs. The
Core 2 Duo supports SSE3 and x86-64.
4.10.3 AMD Athlon 64 X2 & Opteron

AMD’s dual-core CPUs Athlon 64 X2 (single-CPU only) and Opteron (depending on model up to 8 CPUs in one
system possible) are very popular CPUs for Linux clusters. They offer goodt performance at affordable prices and
reasonable power consumption. Each core has its12 own HyperTransport channel for memory access, making these
CPUs well suited for memory intensive applications. They also support SSE3 and x86-64.
4.10.4 IBM pSeries

The pSeries is IBM’s server- and workstation line based on the POWER processor. The newer POWER processors
are multi-core designs and feature large caches. IBM builds shared memory systems with up to 32 CPUs. One large
pSeries installation is the JUMP cluster at Kernforschungszentrum Jülich, Germany.
4.10.5 IBM BlueGene

BlueGene is an MPP (massively parallel processing) architecture by IBM. It uses rather slow 700 MHz PowerPC
processors. These processors form very large, highly integrated distributed memory systems, with fast communication
networks (a 3D-Torus, like the Cray T3E). At the time of writing, position one and three of the Top 500 list were
occupied by BlueGene systems. The fastest system, BlueGene/L (http://www.llnl.gov/asc/computing_resources/
bluegenel/), consists of 131,072 CPUs, and delivers a performance of up to 360 TeraFLOPS.
4.10.6 NEC SX-8

The NEC SX-8 is the one of the few vector supercomputers in production at the moment. It performs vector
operations at a speed of 2 GHz, with eight operations per clock cycle. One SX-8 node consists of eight CPUs,
up to 512 nodes can be connected. The biggest SX-8 installation is, at the time of writing, the 72-node system at
Höchstleistungsrechenzentrum Stuttgart (HLRS), Germany.
4.10.7 Cray XT3

Cray is the most famous name in supercomputing. Many of its designs were known not only for their performance,
but also for their design. The Cray XT3 is a massively-parallel system using AMD’s Opteron CPU. The biggest
installation of an XT3 is “Red Storm” at Sandia National Laboratories (http://www.sandia.gov/ASC/redstorm.html)
with 26,544 dual-core Opteron CPUs, good for a performance of more than 100 TFLOPS and the second position
in the November 2006 Top 500 list.
4.10.8 SGI Altix 3700

The SGI Altix 3700 is a ccNUMA system using Intel’s Itanium 2 processor. The Itanium 2 has large caches and
good flarge c point performance. Being ccNUMA, the Altix 3700 is easy to program. Aster at SARA, Amsterdam,
the Netherlands is an Altix 3700 with 416 CPUs.
41
Parallel Computing
Summary
• A system for the categorisation of the system architectures of computers was introduced by Flynn (1972).
• A stream is a sequence of objects such as data, or of actions such as instructions.
• Parallel computers are those that emphasise the parallel processing between the operations in some way.
• Parallel computers can be characterised based on the data and instruction streams forming various types of
computer organisations.
• Parallel processing levels can also be defined based on the size of instructions in a program called grain size.
• Long pipelines are also a prerequisite for achieving high CPU clock speeds.
• Fixed-point and logical calculations (performed in the ALU - arithmetic/Logical Unit) are usually separated
from floating-point math (done by the FPU – Floating Point Unit).
• The scalar computer of the previous section performs one instruction on one data set only.
• A computer that performs one instruction on several data sets is called a vector computer.
• Global memory can be accessed by all processors of a parallel computer. Data in the global memory can be
read/write by any of the processors.
• In MIMD systems with shared memory (SM-MIMD), all processors are connected to a common memory
(RAM - Random Access Memory).
• The advantage of a bus is its expandability. A huge disadvantage is that all processors have to share the bandwidth
provided by the bus, even when accessing different memory modules.
• Bus systems can be found in desktop systems and small servers (frontside bus).
• The big advantage of shared memory systems is that all processors can make use of the whole memory.
• A ccNUMA system basically consists of several SMP systems.
• A ccNUMA system is as easy to use as a true shared memory system, at the same time it is much easier to
expand.
• A cluster consists of several cheap computers (nodes) linked together.
• The Intel Pentium D was introduced in 2005. It is Intel’s first dual-core processor. It integrates two cores, based
on the NetBurst design of the Pentium 4, on one chip.
• The pSeries is IBM’s server- and workstation line based on the POWER processor.
• BlueGene is an MPP (massively parallel processing) architecture by IBM.
• The SGI Altix 3700 is a ccNUMA system using Intel’s Itanium 2 processor.
References
India.
• Wittwer, T., Introduction to Parallel Programming [Online] Available at: <http://www.scribd.com/doc/23585346/
An-Introduction-to-Parallel-Programming> [Accessed 21 June 2012].
• Introduction to Parallel Computing [Online] Available at: <https://computing.llnl.gov/tutorials/parallel_
comp/#Whatis > [Accessed 21 June 2012].
• 2011. Intro to Computer Architecture [Video Online] Available at: <http://www.youtube.com/watch?v=HEjPop-
aK_w> [Accessed 21 June 2012].
• 2011. x64 Assembly and C++ Tutorial 38: Intro to Single Instruction Multiple Data [Video Online] Available
at: <http://www.youtube.com/watch?v=cbL88Ic6uPw > [Accessed 21 June 2012].
42
Recommended Reading
43
Parallel Computing
Self Assessment
1. Which of the following is the traditional uniprocessor?
a. SISD
b. SIMD
c. MISD
d. MIMD
2. What includes vector processors as well as massively parallel processors?

a. SISD
b. SIMD
c. MISD
d. MIMD
3. Which of the following are typically systolic arrays?

a. SISD
b. SIMD
c. MISD
d. MIMD
4. What includes traditional multiprocessors as well as the newer networks of workstations?

a. SISD
b. SIMD
c. MISD
d. MIMD
5. A computer that performs one instruction on several data sets is called a _____.
a. vector computer
b. scalar computer
c. vector processor
d. scalar processor
6. Usually all processors are identical and have equal memory access,this is called _________.
a. SMP
b. SISD
c. NUMA
d. DM-DIMD
7. A ________system basically consists of several SMP systems.

a. ccNUMA
b. NUMA
c. UMA
d. MIMD
44
8. A _______consists of several cheap computers (nodes) linked together.
a. cluster
b. shared memory
c. distributed memory
d. hybrid system
9. Gigabit Ethernet has approximately the same bandwidth of about _________.

a. 50 MB/s
b. 100 MB/s
c. 60 MB/s
d. 90 MB/s
10. _____integrates two cores, based on the NetBurst design of the Pentium 4, on one chip.
a. Intel Pentium D
b. Intel Core 2 Duo
c. IBM pSeries
d. IBM BlueGene
45
Parallel Computing
Chapter V
Parallel Programming Models and Paradigms
Aim
• define cluster
• explain cluster computer architecture
• elucidate parallel applications and their development
Objectives
• explain shared memory
• enumerate the strategies for developing parallel applications
• elucidate virtual shared memory
Learning outcome
• understand the parallelising compilers
• identify message passing
• describe code granularity and levels of parallelism
46
5.1 Introduction
In the 1980s it was believed computer performance was best improved by creating faster and more efficient processors.
This idea was challenged by parallel processing, which in essence means linking together two or more computers
to jointly solve a computational problem. Since the early 1990s there has been an increasing trend to move away
from expensive and specialised proprietary parallel supercomputers (vector-supercomputers and massively parallel
processors) towards networks of computers (PCs/Workstations/SMPs). Among the driving forces that have enabled
this transition has been the rapid improvement in the availability of commodity high performance components for
PCs/workstations and networks. These technologies are making a network/cluster of computers an appealing vehicle
for cost-effective parallel processing and this is consequently leading to low-cost commodity supercomputing.
Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations, to
SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing. The main
attractiveness of such systems is that they are built using affordable, low-cost, commodity hardware (Pentium PCs),
fast LAN such as Myrinet, and standard software components such as UNIX, MPI, and PVM parallel programming
environments. These systems are scalable, i.e., they can be tuned to available budget and computational needs and
allow efficient execution of both demanding sequential and parallel applications.
Clusters use intelligent mechanisms for dynamic and network-wide resource sharing, which respond to resource
requirements and availability. These mechanisms support scalability of cluster performance and allow a flexible use
of workstations, since the cluster or network-wide available resources are expected to be larger than the available
resources at any one node/workstation of the cluster. These intelligent mechanisms also allow clusters to support
multiuser, time-sharing parallel execution environments, where it is necessary to share resources and at the same
time distribute the workload dynamically to utilise the global resources efficiently.
The idea of exploiting this significant computational capability available in networks of workstations (NOWs) has
gained an enthusiastic acceptance within the high-performance computing community, and the current tendency favors
this sort of commodity supercomputing. This is mainly motivated by the fact that most of the scientific community
has the desire to minimise economic risk and rely on consumer based on -the-shelf technology. Cluster computing
has been recognised as the wave of the future to solve large scientific and commercial problems.
5.2 A Cluster Computer and its Architecture

A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected
stand-alone computers working together as a single, integrated computing resource.
A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with memory, I/O facilities,
and an operating system. A cluster generally refers to two or more computers (nodes) connected together. The nodes
can exist in a single cabinet or be physically separated and connected via a LAN. An interconnected (LAN-based)
cluster of computers can appear as a single system to users and applications. Such a system can provide a cost-
effective way to gain features and benefits (fast and reliable services) that have historically been found only on more
expensive proprietary shared memory systems. The typical architecture of a cluster is shown in figure below.
47
Parallel Computing
Parallel Applications
S equenti al A ppl i cati ons Paral l el Programmi ng E nvi ronments
Cluster Middleware
(Single System Image and Availability Infrastructure)
PC/Workstation PC/Workstation PC/Workstation PC/Workstation PC/Workstation
Comm. S/W Comm. S/W Comm. S/W Comm. S/W Comm. S/W
Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW Net. Interface HW
High Speed Network/Switch
Fig. 5.1 Cluster computer architecture

(Source: http://www.buyya.com/cluster/v2chap1.pdf)
The following are some prominent components of cluster computers:

• Multiple high performance computers (PCs, workstations, or SMPs)
• State-of-the-art operating systems (layered or micro-kernel based)
• High performance networks/switches (such as Gigabit Ethernet and Myrinet)
• Network Interface Cards (NICs)
• Fast communication protocols and services (such as active and fast messages)
• Cluster middleware (Single System Image (SSI) and System Availability Infrastructure)
• Hardware (such as Digital (DEC) Memory Channel, hardware DSM, and SMP techniques)
• Operating system kernel or gluing layer (such as Solaris MC and GLUnix)
• Applications and Subsystems
Applications (such as system management tools and electronic forms)
Run-time dystems (such as software DSM and parallel file-system)
Resource management and scheduling software (such as LSF (Load Sharing Facility) and CODINE
(COmputing in DIstributed Networked Environments))
• Parallel programming environments and tools (such as compilers, PVM (Parallel Virtual Machine), and MPI
(Message Passing Interface))
• Applications
Sequential
Parallel or distributed
The network interface hardware acts as a communication processor and is responsible for transmitting and receiving
packets of data between cluster nodes via a network/switch.
Communication software offers a means of fast and reliable data communication among cluster nodes and to the
outside world. Often, clusters with a special network/switch like Myrinet use communication protocols such as
active messages for fast communication among its nodes. They potentially bypass the operating system and thus
remove the critical communication overheads providing direct user-level access to the network interface.
48
The cluster nodes can work collectively as an integrated computing resource, or they can operate as individual
computers. The cluster middleware is responsible for offering an illusion of a unified system image (single system
image) and availability out of a collection on independent but interconnected computers. Programming environments
can offer portable, efficient, and easy-to-use tools for development of applications. They include message passing
libraries, debuggers, and profilers. It should not be forgotten that clusters could be used for the execution of sequential
or parallel applications.
5.3 Parallel Applications and their Development

The class of applications that a cluster can typically cope with would be considered demanding sequential applications
and grand challenge/supercomputing applications. (GCAs) are fundamental problems in science and engineering
with broad economic and scientific impact. They are generally considered intractable without the use of state-of-
the-art parallel computers.
The scale of their resource requirements, such as processing time, memory, and communication needs distinguishes
GCAs. A typical example of a grand challenge problem is the simulation of some phenomena that cannot b e measured
through experiments. GCAs include massive crystallographic and microtomographic structural problems, protein
dynamics and biocatalysis, relativistic quantum chemistry of actinides, virtual materials design and processing,
global climate modeling, and discrete event simulation.
Although the technology of clusters is currently being deployed, the development of parallel applications is really
a complex task.
• First of all, it is largely dependent on the availability of adequate software tools and environments.
• Second, parallel software developers must contend with problems not encountered during sequential programming,
namely: non-determinism, communication, synchronisation, data partitioning and distribution, load-balancing,
fault-tolerance, heterogeneity, shared or distributed memory, deadlocks, and race conditions.
All these issues present some new important challenges. Currently, only some specialised programmers have the
knowledge to use parallel and distributed systems for executing production codes. This programming technology is
still somehow distant from the average sequential programmer, who does not feel very enthusiastic about moving
into a different programming style with increased difficulties, though they are aware of the potential performance
gains.
Parallel computing can only be widely successful if parallel software is able to accomplish some expectations of
the users, such as:
• Provide architecture/processor type transparency
• Provide network/communication transparency
• Be easy-to-use and reliable
• Provide support for fault-tolerance
• Accommodate heterogeneity
• Assure portability
• Provide support for traditional high-level languages
• Be capable of delivering increased performance
• Holy-grail is to provide parallelism transparency
This last expectation is still at least one decade away, but most of the others can be achieved today. The internal
details of the underlying architecture should be hidden from the user and the programming environment should
provide high-level support for parallelism. Otherwise, if the programming interface is difficult to use, it makes the
writing of parallel applications highly unproductive and painful for most programmers. There are basically two
main approaches for parallel programming:
49
Parallel Computing
• The first one is based on implicit parallelism. This approach has been followed by parallel languages and
parallelising compilers. The user does not specify, and thus cannot control, the scheduling of calculations and/
or the placement of data.
• The second one relies on explicit parallelism. In this approach, the programmer is responsible for most of the
parallelisation effort such as task decomposition, mapping tasks to processors, and the communication structure.
This approach is based on the assumption that the user is often the best judge of how parallelism can be exploited
for a particular application.
It is also observed that the use of explicit parallelism will obtain a better efficiency than parallel languages or
compilers that use implicit parallelism.
5.3.1 Strategies for Developing Parallel Applications

Undoubtedly, the main software issue is to decide between either porting existing sequential applications or developing
new parallel applications from scratch. There are three strategies for creating parallel applications.
Strategy 1: Automatic Parallelization
Existing Minor Automatic Parallel

Source Code Parallelization Application
Code Modification
Strategy 2: Parallel Libraries
Identify and
Replace
Subroutines
Existing Parallel
Source Relink Application
Code
Develop
Parallel
Library
Strategy 3: Major Recoding

Existing Compiler
Major Parallel
Source Assisted
Recoding Application
Code Parallelization
Fig. 5.2 Porting strategies for parallel applications

The first strategy is based on automatic parallelisation, the second is based on the use of parallel libraries, while the
third strategy (major recoding) resembles from-scratch application development. The goal of automatic parallelisation
is to relieve the programmer from the parallelising tasks. Such a compiler would accept dusty-deck codes and produce
efficient parallel object code without any (or, at least, very little) additional work by the programmer. However, this
is still very hard to achieve and is well beyond the reach of current compiler technology.
Another possible approach for porting parallel code is the use of parallel libraries. This approach has been more
successful than the previous one. The basic idea is to encapsulate some of the parallel code that is common to several
applications into a parallel library that can be implemented in a very efficient way. Such a library can then be reused
by several codes. Parallel libraries can take two forms:
• They encapsulate the control structure of a class of applications.
• They provide a parallel implementation of some mathematical routines that are heavily used in the kernel of
some production codes.
50
The third strategy, which involves writing a parallel application from the very beginning, gives more freedom to the
programmer who can choose the language and the programming model. However, it may make the task very difficult
since little of the code can b e reused. Compiler assistance techniques can b e of great help, although with a limited
applicability. Usually the tasks that can be effectively provided by a compiler are data distribution and placement.
5.3.2 Code Granularity and Levels of Parallelism

In modern computers, parallelism appears at various levels both in hardware and software: signal, circuit, component,
and system levels. That is, at the very lowest level, signals travel in parallel along parallel data paths. At a slightly
higher level, multiple functional units operate in parallel for faster performance, popularly known as instruction
level parallelism. For instance, a PC processor such as Pentium Pro has the capability to process three instructions
simultaneously. Many computers overlap CPU and I/O activities; for instance, a disk access for one user while
executing instruction of another user. Some computers use a memory interleaving technique {several banks of
memory can be accessed in parallel for faster accesses to memory. At a still higher level, SMP systems have multiple
CPUs that work in parallel. At an even higher level of parallelism, one can connect several computers together and
make them work as a single machine, popularly known as cluster computing.
The first two levels (signal and circuit level) of parallelism is performed by a hardware implicitly technique called
hardware parallelism. The remaining two levels (component and system) of parallelism is mostly expressed implicitly/
explicitly by using various software techniques, popularly known as software parallelism. Levels of parallelism
can also be based on the lumps of code (grain size) that can be a potential candidate for parallelism. Table below
lists categories of code granularity for parallelism. All approaches of creating parallelism based on code granularity
have a common goal to boost processor efficiency by hiding latency of a lengthy operation such as a memory/disk
access. To conceal latency, there must be another activity ready to run whenever a lengthy operation occurs. The idea
is to execute concurrently two or more single-threaded applications, such as compiling, text formatting, database
searching, and device simulation, or parallelised applications having multiple tasks simultaneously.
Grain size Code Item Parallelised by

Very fine Instruction Processor
Fine Loop/Instruction block Compiler
Medium Standard One Page Function Programmer
Large Program-Separate heavyweight process Programmer
Table 5.1 Code Granularity and Parallelism
Parallelism in an application can be detected at several levels. They are:

• Very-fine grain (multiple instruction issue)
• Fine-grain (data-level)
• Medium-grain (or control-level)
• Large-grain (or task-level)
The different levels of parallelism are depicted in figure below. Among the four levels of parallelism, the first two
levels are supported transparently either by the hardware or parallelising compilers. The programmer mostly handles
the last two levels of parallelism. The three important models used in developing applications are shared-memory
model, distributed memory model (message passing model), and distributed-shared memory model.
5.4 Parallel Programming Models and Tools

This section presents a brief overview on the area of parallel programming and describes the main approaches and
models, including parallelising compilers, parallel languages, message-passing, virtual shared memory, object-
oriented programming, and programming skeletons.
51
Parallel Computing
5.4.1 Parallelising Compilers

There has been some research in parallelising compilers and parallel languages but their functionality is still very
limited. Parallelising compilers are still limited to applications that exhibit regular parallelism, such as computations
in loops. Parallelising/vectorising compilers have proven to be relatively successful for some applications on shared-
memory multiprocessors and vector processors with shared memory, but are largely unproven for distributed-memory
machines. The difficulties are due to the non uniform access time of memory in the latter systems. The currently
existing compiler technology for automatic parallelisation is thus limited in scope and only rarely provides adequate
speedup.
Messages Messages
..... ..... Large grain

Task i-1 Task i Task i+1 (Task level)
func1() func2() func3()

{ { {
... ... ... Medium grain
... ... ... (control level)
} } }
Fine grain
a[0]=... a[1]=... a[2]=... (data level)
b[0]=... b[1]=... b[2]=...
Very fine grain

+ x / (multiple issue)
Fig. 5.3 Detecting parallelism

5.4.2 Parallel Languages

Some parallel languages, like SISAL and PCN have found little favor with application programmers. This is because
users are not willing to learn a completely new language for parallel programming. They really would prefer to use
their traditional high-level languages (like C and Fortran) and try to recycle their already available sequential software.
For these programmers, the extensions to existing languages or run-time libraries are a viable alternative.
5.4.3 High Performance Fortran

The High Performance Fortran (HPF) initiative seems to be a promising solution to solve the dusty-deck problem
of Fortran codes. However, it only supports applications that follow the SPMD paradigm and have a very regular
structure. Other applications that are missing these characteristics and present a more asynchronous structure are
not as successful with the current versions of HPF. Current and future research will address these issues.
52
5.4.4 Message Passing
Message passing libraries allow efficient parallel programs to b e written for distributed memory systems. These
libraries provide routines to initiate and configure the messaging environment as well as sending and receiving
packets of data. Currently, the two most popular high-level message-passing systems for scientific and engineering
application are the PVM (Parallel Virtual Machine) from Oak Ridge National Laboratory and MPI (Message Passing
Interface) defined by the MPI Forum.
Currently, there are several implementations of MPI, including versions for networks of workstations, clusters of
personal computers, distributed-memory multiprocessors, and shared-memory machines. Almost every hardware
vendor is supporting MPI. This gives the user a comfortable feeling since an MPI program can be executed on
almost all of the existing computing platforms without the need to rewrite the program from scratch. The goal of
portability, architecture, and network transparency has been achieved with these low-level communication libraries
like MPI and PVM. Both communication libraries provide an interface for C and Fortran, and additional support
of graphical tools.
However, these message-passing systems are still stigmatised as low-level because most tasks of the parallelisation
are still left to the application programmer. When writing parallel applications using message passing, the programmer
still has to develop a significant amount of software to manage some of the tasks of the parallelisation, such as: the
communication and synchronisation between processes,
data partitioning and distribution, mapping of processes onto processors, and input/output of data structures. If the
application programmer has no special support for these tasks, it then becomes difficult to widely exploit parallel
computing. The easy-to-use goal is not accomplished with a bare message-passing system, and hence requires
additional support.
Other ways to provide alternate-programming models are based on Virtual Shared Memory (VSM) and parallel
object-oriented programming. Another way is to provide a set of programming skeletons in the form of run-time
libraries that already support some of the tasks of parallelisation and can be implemented on top of portable message-
passing systems like PVM or MPI.
5.4.5 Virtual Shared Memory

Virtual shared memory implements a shared-memory programming model in a distributed-memory environment.
Linda is an example of this style of programming. It is based on the notion of generative communication model
and on a virtual shared associative memory, called tuple space, that is accessible to all the processes by using in
and out operations.
Distributed Shared Memory (DSM) is the extension of the well-accepted shared memory programming model on
systems without physically shared memory.
The shared data space is at and accessed through normal read and write operations. In contrast to message passing, in
a DSM system a process that wants to fetch some data value does not need to know its location; the system will find
and fetch it automatically. In most of the DSM systems, shared data may b e replicated to enhance the parallelism
and the efficiency of the applications.
While scalable parallel machines are mostly based on distributed memory, many users may find it easier to write
parallel programs using a shared-memory programming model. This makes DSM a very promising model, provided
it can be implemented efficiently.
53
Parallel Computing
5.4.6 Parallel Object-Oriented Programming

The idea behind parallel object-oriented programming is to provide suitable abstractions and software engineering
methods for structured application design. As in the traditional object model, objects are defined as abstract data types,
which encapsulate their internal state through well-defined interfaces and thus represent passive data containers. If
we treat this model as a collection of shared objects, we can find an interesting resemblance with the shared data
model.
The object-oriented programming model is by now well established as the state of-the-art software engineering
methodology for sequential programming, and recent developments are also emerging to establish object-orientation
in the area of parallel programming. The current lack of acceptance of this model among the scientific community can
be explained by the fact that computational scientists still prefer to write their programs using traditional languages
like FORTRAN. This is the main difficulty that has been faced by the object-oriented environments, though it is
considered as a promising technique for parallel programming. Some interesting object-oriented environments such
as CC++ and Mentat have been presented in the literature.
5.4.7 Programming Skeletons

Another alternative to the use of message-passing is to provide a set of high-level abstractions which provides
support for the mostly used parallel paradigms. A programming paradigm is a class of algorithms that solve different
problems but have the same control structure. Programming paradigms usually encapsulate information about useful
data and communication patterns, and an interesting idea is to provide such abstractions in the form of programming
templates or skeletons. A skeleton corresponds to the instantiation of a specific parallel programming paradigm, and
it encapsulates the control and communication primitives of the application into a single abstraction.
After the recognition of parallelisable parts and an identification of the appropriate algorithm, a lot of developing
time is wasted on programming routines closely related to the paradigm and not the application itself. With the aid
of a good set of efficiently programmed interaction routines and skeletons, the development time can be reduced
significantly.
The skeleton hides from the user the specific details of the implementation and allows the user to specify the
computation in terms of an interface tailored to the paradigm. This leads to a style of Skeleton Oriented Programming
(SOP) which has been identified as a very promising solution for parallel computing.
Skeletons are more general programming methods since they can be implemented on top of message-passing,
object-oriented, shared-memory or distributed memory systems, and provide increased support for parallel
programming.
To summarise, there are basically two ways of looking at an explicit parallel programming system. In the first one,
the system provides some primitives to be used by the programmer. The structuring and the implementation of most
of the parallel control and communication is the responsibility of the programmer. The alternative is to provide some
enhanced support for those control structures that are common to a parallel programming paradigm. The main task
of the programmer would be to provide those few routines unique to the application, such as computation and data
generation. Numerous parallel programming environments are available, and many of them do attempt to exploit
the characteristics of parallel paradigms.
5.5 Methodical Design of Parallel Algorithms

There is no simple recipe for designing parallel algorithms. However, it can benefit from a methodological approach
that maximises the range of options, that provides mechanisms for evaluating alternatives, and that reduces the
cost of backtracking from wrong choices. The design methodology allows the programmer to focus on machine-
independent issues such as concurrency in the early stage of design process, and machine-specific aspects of design
are delayed until late in the design process.
54
As suggested by Ian Foster, this methodology organises the design process into four distinct stages:
• Partitioning
• Communication
• Agglomeration
• Mapping
The first two stages seek to develop concurrent and scalable algorithms, and the last two stages focus on locality
and performance-related issues as summarised below:
5.5.1 Partitioning
It refers to decomposing of the computational activities and the data on which it operates into several small tasks. The
decomposition of the data associated with a problem is known as domain/data decomposition, and the decomposition
of computation into disjoint tasks is known as functional decomposition.
5.5.2 Communication
It focuses on the follow of information and coordination among the tasks that are created during the partitioning
stage. The nature of the problem and the decomposition method determine the communication pattern among
these cooperative tasks of a parallel program. The four popular communication patterns commonly used in parallel
programs are: local/global, structured/unstructured, static/dynamic, and synchronous/asynchronous.
5.5.3 Agglomeration
In this stage, the tasks and communication structure defined in the first two stages are evaluated in terms of performance
requirements and implementation costs. If required, tasks are grouped into larger tasks to improve performance or to
reduce development costs. Also, individual communications may be bundled into a super communication. This will
help in reducing communication costs by increasing computation and communication granularity, gaining flexibility
in terms of scalability and mapping decisions, and reducing software-engineering costs.
5.5.4 Mapping
It is concerned with assigning each task to a processor such that it maximises utilisation of system resources (such as
CPU) while minimising the communication costs. Mapping decisions can be taken statically (at compile-time/before
program execution) or dynamically at runtime by load-balancing methods. Several grand challenging applications
have been built using the above methodology.
5.6 Parallel Programming Paradigms

It has been widely recognised that parallel applications can be classified into some well defined programming
paradigms. A few programming paradigms are used repeatedly to develop many parallel programs. Each paradigm
is a class of algorithms that have the same control structure.
Experience to date suggests that there are a relatively small number of paradigms underlying most parallel programs.
The choice of paradigm is determined by the available parallel computing resources and by the type of parallelism
inherent in the problem. The computing resources may define the level of granularity that can be efficiently supported
on the system. The type of parallelism reflects the structure of either the application or the data and both types may
exist in different parts of the same application. Parallelism arising from the structure of the application is named as
functional parallelism. In this case, different parts of the program can perform different tasks in a concurrent and
cooperative manner. But parallelism may also be found in the structure of the data. This type of parallelism allows
the execution of parallel processes with identical operation but on different parts of the data.
5.6.1 Choice of Paradigms

Most of the typical distributed computing applications are based on the very popular client/server paradigm. In this
paradigm, the processes usually communicate through Remote Procedure Calls (RPCs), but there is no inherent
parallelism in this sort of applications. They are instead used to support distributed services, and thus we do not
55
Parallel Computing
consider this paradigm in the parallel computing area. In the world of parallel computing there are several authors
which present a paradigm classification. Not all of them propose exactly the same one, but we can create a superset
of the paradigms detected in parallel applications.
For instance, in a theoretical classification of parallel programs is presented and broken into three classes of
parallelism:
• Processor farms, which are based on replication of independent jobs
• Geometric decomposition, based on the parallelisation of data structures
• Algorithmic parallelism, which results in the use of data row
The several parallel applications and identified the following set of paradigms:
• Pipelining and ring-based applications
• Divide and conquer
• Master/slave
• Cellular automata applications, which are based on data parallelism
The author of also proposed a very appropriate classification. The problems were divided into a few decomposition
techniques, namely:
• Geometric decomposition: the problem domain is broken up into smaller domains and each process executes
the algorithm on each part of it.
• Iterative decomposition: some applications are based on loop execution where each iteration can be done in an
independent way. This approach is implemented through a central queue of runnable tasks, and thus corresponds
to the task-farming paradigm.
• Recursive decomposition: this strategy starts by breaking the original problem into several subproblems and
solving these in a parallel way. It clearly corresponds to a divide and conquer approach.
• Speculative decomposition: some problems can use a speculative decomposition approach: N solution techniques
are tried simultaneously, and (N-1) of them are thrown away as soon as the first one returns a plausible answer.
In some cases this could result optimistically in a shorter overall execution time.
• Functional decomposition: the application is broken down into many distinct phases, where each phase executes
a different algorithm within the same problem. The most used topology is the process pipelining.
A somewhat different classification was presented based on the temporal structure of the problems. The applications
were thus divided into:
• Synchronous problems, which correspond to regular computations on regular data domains
• Loosely synchronous problems, that are typified by iterative calculations on geometrically irregular data
domains
• Asynchronous problems, which are characterised by functional parallelism that is irregular in space and time
• Embarrassingly parallel applications, which correspond to the independent execution of disconnected components
of the same program
Synchronous and loosely synchronous problems present a somehow different synchronisation structure, but both
rely on data decomposition techniques. According to an extensive analysis of 84 real applications, it was estimated
that these two classes of problems dominated scientific and engineering applications being used in 76 percent of the
applications. Asynchronous problems, which are for instance represented by event-driven simulations, represented
10 percent of the studied problems. Finally, embarrassingly parallel applications that correspond to the master/slave
model, accounted for 14 percent of the applications.
56
The most systematic definition of paradigms and application templates was presented in [6]. It describes a generic
tuple of factors which fully characterises a parallel algorithm including: process properties (structure, topology and
execution), interaction properties, and data properties (partitioning and placement). That classification included
most of the paradigms referred so far, albeit described in deeper detail.
The following paradigms are popularly used in parallel programming:

• Task-farming (or Master/Slave)
• Single Program Multiple Data (SPMD)
• Data pipelining
• Divide and conquer
• Speculative parallelism
5.6.2 Task-Farming (or Master/Slave)

The task-farming paradigm consists of two entities: master and multiple slaves. The master is responsible for
decomposing the problem into small tasks (and distributes these tasks among a farm of slave processes), as well as
for gathering the partial results in order to produce the final result of the computation. The slave processes execute
in a very simple cycle: get a message with the task, process the task, and send the result to the master. Usually, the
communication takes place only between the master and the slaves.
Task-farming may either use static load-balancing or dynamic load-balancing. In the first case, the distribution of
tasks is all performed at the beginning of the computation, which allows the master to participate in the computation
after each slave has been allocated a fraction of the work. The allocation of tasks can be done once or in a cyclic
way. Figure below presents a schematic representation of this first approach.
The other way is to use a dynamically load-balanced master/slave paradigm, which can be more suitable when the
number of tasks exceeds the number of available processors, or when the number of tasks is unknown at the start of
the application, or when the execution times are not predictable, or when we are dealing with unbalanced problems.
An important feature of dynamic load-balancing is the ability of the application to adapt itself to changing conditions
of the system, not just the load of the processors, but also a possible reconfiguration of the system resources. Due
to this characteristic, this paradigm can respond quite well to the failure of some processors, which simplifies the
creation of robust applications that are capable of surviving the loss of slaves or even the master.
At an extreme, this paradigm can also enclose some applications that are based on a trivial decomposition approach:
the sequential algorithm is executed simultaneously on different processors but with different data inputs. In such
applications there are no dependencies between different runs so there is no need for communication or coordination
between the processes.
This paradigm can achieve high computational speedups and an interesting degree of scalability. However, for a large
number of processors the centralised control of the master process can become a bottleneck to the applications. It
is, however, possible to enhance the scalability of the paradigm by extending the single master to a set of masters,
each of them controlling a different group of process slaves.
57
Parallel Computing
Master
distribute tasks
Slave 1 Slave 2 Slave 3 Slave 4
Collect
Results
communications
Terminate
Fig. 5.4 A static master/slave structure

5.6.3 Single-Program Multiple-Data (SPMD)

The SPMD paradigm is the most commonly used paradigm. Each process executes basically the same piece of code
but on a different part of the data. This involves the splitting of application data among the available processors. This
type of parallelism is also referred to as geometric parallelism, domain decomposition, or data parallelism. Figure
below presents a schematic representation of this paradigm. Many physical problems have an underlying regular
geometric structure, with spatially limited interactions. This homogeneity allows the data to be distributed uniformly
across the processors, where each one will be responsible for a defined spatial area. Processors communicate with
neighbouring processors and the communication load will be proportional to the size of the boundary of the element,
while the computation load will be proportional to the volume of the element. It may also b e required to perform
some global synchronisation periodically among all the processes. The communication pattern is usually highly
structured and extremely predictable. The data may initially b e self-generated by each process or may be read from
the disk during the initialisation stage.
SPMD applications can be very efficient if the data is well distributed by the processes and the system is homogeneous.
If the processes present different workloads or capabilities, then the paradigm requires the support of some load-
balancing scheme able to adapt the data distribution layout during run-time execution. This paradigm is highly
sensitive to the loss of some process. Usually, the loss of a single process is enough to cause a deadlock in the
calculation in which none of the processes can advance beyond a global synchronisation point.
58
Distribute Data
Calculate Calculate Calculate Calculate

Exchange Exchange Exchange Exchange
Calculate Calculate Calculate Calculate
Collect Results
Fig. 5.5 Basic structure of a SPMD program

5.6.4 Data Pipelining

This is a more fine-grained parallelism, which is based on a functional decomposition approach: the tasks of the
algorithm, which are capable of concurrent operation, are identified and each processor executes a small part of the
total algorithm. The pipeline is one of the simplest and most popular functional decomposition paradigms. Figure
below presents the structure of this model. Processes are organised in a pipeline; each process corresponds to a
stage of the pipeline and is responsible for a particular task. The communication pattern can be very simple since
the data flows between the adjacent stages of the pipeline. For this reason, this type of parallelism is also sometimes
referred to as data flow parallelism. The communication may be completely asynchronous. The efficiency of this
paradigm is directly dependent on the ability to balance the load across the stages of the pipeline. The robustness
of this paradigm against reconfigurations of the system can be achieved by providing multiple independent paths
across the stages. This paradigm is often used in data reduction or image processing applications.
Process 1 Process 2 Process 3
Phase A Phase B Phase C

Input Output
Fig. 5.6 Data pipeline structure

5.6.5 Divide and Conquer

The divide and conquer approach is well known in sequential algorithm development. A problem is divided up
into two or more sub problems. Each of these sub problems is solved independently and their results are combined
to give the final result. Often, the smaller problems are just smaller instances of the original problem, giving rise
to a recursive solution. Processing may be required to divide the original problem or to combine the results of the
subproblems. In parallel divide and conquer, the subproblems can be solved at the same time, given sufficient
parallelism. The splitting and recombining process also makes use of some parallelism, but these operations require
some process communication. However, because the subproblems are independent, no communication is necessary
between processes working on different subproblems.
59
Parallel Computing
We can identify three generic computational operations for divide and conquer: split, compute, and join. The
application is organised in a sort of virtual tree: some of the processes create subtasks and have to combine the
results of those to produce an aggregate result. The tasks are actually computed by the compute processes at the
leaf nodes of the virtual tree. Figure below presents this execution.
main problem
join
split
sub-problems
Fig. 5.7 Divide and conquer as a virtual tree

The task-farming paradigm can be seen as a slightly modified, degenerated form of divide and conquer; that is,
where problem decomposition is performed before tasks are submitted, the split and join operations is only done
by the master process and all the other processes are only responsible for the computation.
In the divide and conquer model, tasks may be generated during runtime and may be added to a single job queue
on the manager processor or distributed through several job queues across the system.
The programming paradigms can be mainly characterised by two factors: decomposition and distribution of the
parallelism. For instance, in geometric parallelism both the decomposition and distribution are static. The same
happens with the functional decomposition and distribution of data pipelining. In task farming, the work is statically
decomposed but dynamically distributed. Finally, in the divide and conquer paradigm both decomposition and
distribution are dynamic.
5.6.6 Speculative Parallelism

This paradigm is employed when it is quite difficult to obtain parallelism through any one of the previous paradigms.
Some problems have complex data dependencies, which reduces the possibilities of exploiting the parallel execution.
In these cases, an appropriate solution is to execute the problem in small parts but use some speculation or optimistic
execution to facilitate the parallelism.
In some asynchronous problems, like discrete-event simulation, the system will attempt the look-ahead execution
of related activities in an optimistic assumption that such concurrent executions do not violate the consistency of
the problem execution. Sometimes they do, and in such cases it is necessary to rollback to some previous consistent
state of the application.
60
Another use of this paradigm is to employ different algorithms for the same problem; the first one to give the final
solution is the one that is chosen.
5.6.7 Hybrid Models

The boundaries between the paradigms can sometimes be fuzzy and, in some applications, there could be the need
to mix elements of different paradigms. Hybrid methods that include more than one basic paradigm are usually
observed in some large-scale parallel applications. These are situations where it makes sense to mix data and task
parallelism simultaneously or in different parts of the same program.
5.7 Programming Skeletons and Templates

The term skeleton has been identified by two important characteristics:
• It provides only an underlying structure that can be hidden from the user;
• It is incomplete and can b e parameterised, not just by the number of processors, but also by other factors, such
as granularity, topology and data distribution.
Hiding the underlying structure from the user by presenting a simple interface results in programs that are easier
to understand and maintain, as well as less prone to error. In particular, the programmer can now focus on the
computational task rather than the control and coordination of the parallelism.
Exploiting the observation that parallel applications follow some well-identified structures, much of the parallel
software specific to the paradigm can be potentially reusable. Such software can b e encapsulated in parallel libraries
to promote the reuse of code, reduce the burden on the parallel programmer, and to facilitate the task of recycling
existing sequential programs. This guideline was followed by the PUL project, the TINA system, and the ARNIA
package.
A project developed at the Edinburgh Parallel Computing Centre involved the writing of a package of parallel
utilities (PUL) on top of MPI that gives programming support for the most common programming paradigms as
well as parallel input/output. Apart from the libraries for global and parallel I/O, the collection of the PUL utilities
includes a library for task-farming applications (PUL-TF) another that supports regular domain decomposition
applications (PUL-RD), and another one that can be used to program irregular mesh-based problems (PUL-SM).
This set of PUL utilities hides the hard details of the parallel implementation from the application programmer and
provides a portable programming interface that can b e used on several computing platforms. To ensure programming
flexibility, the application can make simultaneous use of different PUL libraries and have direct access to the MPI
communication routines.
The ARNIA package includes a library for master/slave applications, another for the domain decomposition paradigm,
a special library for distributed computing applications based on the client/server model, and a fourth library that
supports a global shared memory emulation. ARNIA allows the combined use of its building libraries for those
applications that present mixed paradigms or distinct computational phases.
A skeleton generator called TINA supports the reusability and portability of parallel program components and
provides a complete programming environment.
Another graphical programming environment, named TRACS, provides a graphical toolkit to design distributed/
parallel applications based on reusable components, such as farms, grids, and pipes.
Porting and rewriting application programs requires a support environment that encourages code reuse, portability
among different platforms, and scalability across similar systems of different size. This approach, based on skeletal
frameworks, is a viable solution for parallel programming. It can significantly increase programmer productivity
because programmers will be able to develop parts of programs simply by filling in the templates. The development
of software templates has been increasingly receiving the attention of academic research and is seen as one of the
key directions for parallel software.
61
Parallel Computing
The most important advantages of this approach for parallel programming are summarised below.
5.7.1 Programmability
A set of ready-to-use solutions for parallelisation will considerably increase the productivity of the programmers:
the idea is to hide the lower level details of the system, to promote the reuse of code, and relieve the burden of
the application programmer. This approach will increase the programmability of the parallel systems since the
programmer will have more time to spend in optimising the application itself, rather than on low-level details of
the underlying programming system.
5.7.2 Reusability
Reusability is a hot-topic in software engineering. The provision of skeletons or templates to the application
programmer increases the potential for reuse by allowing the same parallel structure to be used in different applications.
This avoids the replication of efforts involved in developing and optimising the code specific to the parallel template.
It was reported that a percentage of code reuse rose from 30 percent up to 90 percent when using skeleton-oriented
programming. In the Chameleon system, 60 percent of the code was reusable, while it was reported that an average
fraction of 80 percent of the code was reused with the ROPE library.
5.7.3 Portability
Providing portability of the parallel applications is a problem of paramount importance. It allows applications
developed on one platform to run on another platform without the need for redevelopment.
5.7.4 Efficiency
There could be some conflicting trade-on between optimal performance and portability/programmability. Both
portability and efficiency of parallel programming systems play an important role in the success of parallel
computing.
62
Summary
• Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations,
to SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing.
• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected
• A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with memory, I/O
facilities, and an operating system.
• A cluster generally refers to two or more computers (nodes) connected together.
• The network interface hardware acts as a communication processor and is responsible for transmitting and
receiving packets of data between cluster nodes via a network/switch.
• Communication software offers a means of fast and reliable data communication among cluster nodes and to
the outside world.
• The class of applications that a cluster can typically cope with would be considered demanding sequential
applications and grand challenge/supercomputing applications.
• In modern computers, parallelism appears at various levels both in hardware and software: signal, circuit,
component, and system levels.
• The first two levels (signal and circuit level) of parallelism is performed by a hardware implicitly technique
called hardware parallelism.
• Parallelising compilers are still limited to applications that exhibit regular parallelism, such as computations
in loops.
• The High Performance Fortran (HPF) initiative seems to be a promising solution to solve the dusty-deck problem
of Fortran codes.
• Message passing libraries allow efficient parallel programs to b e written for distributed memory systems.
• Virtual shared memory implements a shared-memory programming model in a distributed-memory
environment.
• Distributed Shared Memory (DSM) is the extension of the well-accepted shared memory programming model
on systems without physically shared memory.
• A programming paradigm is a class of algorithms that solve different problems but have the same control structure.
References
• Grama, 2010. An Introduction to Parallel Computing: Design and Analysis of Algorithms, 2/e, Pearson Education India.
• Buyya, R., Parallel Programming Models and Paradigms [pdf] Available at: <http://www.buyya.com/cluster/
v2chap1.pdf> [Accessed 21 June 2012].
• Dr. Dobb’s, Designing Parallel Algorithms: Part 1 [Online] Available at: <http://www.drdobbs.com/article/pr
int?articleId=223100878&siteSectionName=parallel> [Accessed 21 June 2012].
• 2011. High-Performance Computing - Episode 1 - Introducing MPI [Video Online] Available at: <http://www.
youtube.com/watch?v=kHV6wmG35po> [Accessed 21 June 2012].
• 2009. Lec-7 Pipeline Concept-I [Video Online] Available at: <http://www.youtube.com/watch?v=AXgfeV568
c8&feature=results_video&playnext=1&list=PLD4E8A4E592F7A7D8> [Accessed 21 June 2012].
Recommended Reading
• Joubert, G. R., 2004. Parallel Computing: Software Technology, Algorithms, Architectures and Applications,
Elsevier.
• Quinn, 2003. Parallel Programming In C With Mpi And Open Mp, Tata McGraw-Hill Education.
63
Parallel Computing
Self Assessment
a. Scalable computing clusters, ranging from a cluster of (homogeneous or heterogeneous) PCs or workstations,
to SMPs, are rapidly becoming the standard platforms for high-performance and large-scale computing.
b. A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected
c. A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with memory, I/O
facilities, and an operating system.
d. The network interface software is responsible for transmitting and receiving packets of data between cluster
nodes via a network/switch.
2. A cluster generally refers to ______or more computers (nodes) connected together.

a. two
b. four
c. one
d. three
3. The network interface hardware acts as a communication _________.

a. software
b. processor
c. node
d. cluster
4. Communication software offers a means of fast and reliable data communication among ________and to the
outside world.
a. multiprocessors
b. message passing libraries
c. cluster nodes
d. compilers

a. The class of applications that a cluster can typically cope with would be considered demanding sequential
applications and grand challenge/supercomputing applications.
b. In modern computers, parallelism appears at various levels both in hardware and software: signal, circuit,
component, and system levels.
c. Parallelising compilers are not limited to applications that exhibit regular parallelism, such as computations
in loops.
d. The High Performance Fortran (HPF) initiative seems to be a promising solution to solve the dusty-deck
problem of Fortran codes.

a. Message passing libraries do not allow efficient parallel programs to be written for distributed memory systems.
b. Virtual shared memory implements a shared-memory programming model in a distributed-memory
environment.
c. Distributed Shared Memory (DSM) is the extension of the well-accepted shared memory programming
model on systems without physically shared memory.
d. A programming paradigm is a class of algorithms that solve different problems but have the same control
structure.
64
7. Providing ________of the parallel applications allows applications developed on one platform to run on another
platform without the need for redevelopment.
a. efficiency
b. portability
c. resusability
d. programmability
8. Which paradigm is the most commonly used paradigm?

a. SPMD
b. Data pipelining
c. Divide and conquer
d. Speculative parallelism
9. Which paradigm may either use static load-balancing or dynamic load-balancing?

a. Task-farming
b. Data pipelining
c. Divide and conquer
d. Speculative parallelism
10. Which of these is not one of the three generic computational operations for divide and conquer?
a. Split
b. Compute
c. Join
d. Replace
65
Parallel Computing
Chapter VI
Interconnection Networks for Parallel Computers
Aim
• define network topology
• explain metrics for interconnection networks
• elucidate star-connected network
Objectives
• explain linear arrays
• compare mesh and ring topology
• explain tree network
Learning outcome
• classify interconnection networks
• understand static network
• describe multi-stage network
66
6.1 Introduction
One of the most important components of a multiprocessor machine is the interconnection network. In order to
solve a problem, the processing elements have to cooperate and to exchange their computed data over the network.
Both the shared memory and distributed memory architectures require an interconnection network to connect the
processor and memory or the modules respectively.
6.2 Network Topologies

Network topologies describe how to connect processors and memories to other processors and memories. A variety
of network topologies have been proposed and implemented. An interconnection network topology is a mapping
function from the set of processors and memories onto the same set of processors and memories. In other words, the
topology describes how to connect processors and memories to other processors and memories. These topologies
trade-off the performance for cost and scalability. Commercial machines often implement hybrids of multiple
topologies for reasons of packaging, cost, and available components.
6.3 Metrics for Interconnection Networks

This includes:
• Diameter: The distance between the farthest two nodes in the network. Metric for worst case latency.
• Bisection width: The minimum number of wires you must cut to divide the network into two equal parts.
Metric for worst case bandwidth.
• Arc connectivity: The minimum number of wires you must cut to partition the network into (not necessarily
equal) parts. Metric for fault tolerance. Arc connectivity is always ≤ bisection bandwidth.
• Cost: The number of links or switches (whichever is asymptotically higher) is important contributors to cost.
However, a number of other factors, such as the ability to layout the network, the length of wires, and so on
also factor in to the cost.
6.4 Classification of Interconnection Networks

Interconnection networks carry data between processors and to memory. Interconnects are made of switches and
links (wires, fiber). Interconnects are classified as static or dynamic.
Fig. 6.1 Classification of interconnection networks (a) a static network (b) a dynamic network
(Source: http://www.cs.csi.cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt)
67
Parallel Computing
6.5 Static Network

It consists of point-to-point communication links among processing nodes. It is also referred to as direct networks.
In static networks the nodes are connected to each other directly by wires. Once the connection between nodes has
been established, they cannot be changed anymore.
6.5.1 Completely-connected Network

Each node has a direct communication link to every other node in the network. It is ideal in the sense that a node
can send a message to another node in a single step. Static counterpart of crossbar switching networks and is non-
blocking.
Fig. 6.2 A completely-connected network of eight nodes

6.5.2 Star-Connected Network

One processor acts as the central processor. Every other processor has a communication link connecting it to this
central processor. It is similar to bus-based network. The central processor is the bottleneck. A star has one central
node, which is connected to all other nodes. The star topology is the degenerate case of a tree. It is a tree of depth two
and usually has a high fan out. The central node acts as a switching network, routing messages from one processor
to another, rather like a telephone exchange.
This arrangement can be useful as it matches the way peripherals are configured about a computer. Usually a
computer will have only one screen, printer and backing store. With star topologies these resources would be placed
at the central node, thereby bestowing even access by all processors. Stars have a mixed degree. The terminal nodes
are only linked to the central node so their degree is one. The central node is linked to all the other processors, so
its degree is N-1. If the degree is a function of N, as in this case, it is said to be variable. The routing algorithm is
simple, and need only reside at the central node. It receives messages from a port and simply redirects them to the
port corresponding to the destination.
Extending the star would involve increasing the fan-out of the central node rather than the depth, as with a tree.
This makes the growth complexity one, which is not only the better than most other topologies but also the best
theoretically possible. The central node has to be modified in order to cope with the extra node, however, all the
other nodes can remain the same.
The longest path starts at a terminal node and is along the branch to the central node and then back out and down to
another terminal node; the diameter is, therefore, three. As the diameter is not a function of N it is said to be ‘fixed’.
A disadvantage of this topology lies with the central processor. This can become a communication bottleneck;
consequently star networks tend to be heterogeneous, with the central processor being designed with a much higher
communications bandwidth than the others. The problem of this topology is obvious. For every communication, a
central node has to be passed (bottle neck).
68
Another, more serious, problem is that should the central processor fail, the whole network fails, along with all access
to peripherals. Again the central processor design needs to be different from the others. The reliability of the central
processor can be improved by using higher grade components and adding error tolerant memory. The bottleneck
at the central node makes it impractical to have many processors in such a computer. However, star networks are
commonly found in local area networks.
Fig. 6.3 Two representations of the star topology

(Source: http://www.gigaflop.demon.co.uk/comp/chapt3.htm#s3.2)
6.5.3 Linear Array

In a linear array, each node (except the two nodes at the ends) has two neighbors, one each to its left and right. An
extension of linear array can be a ring or 1-D torus (linear array with wraparound).
Fig. 6.4 Linear arrays (a) with no wraparound links (b) with wraparound link
2D-array
In 2D-array, the processors are ordered to form a 2-dimensional structure (square) so that each processor is connected
to its four neighbor (north, south, east, west) except perhaps for the processors of the boundary.
6.5.4 Mesh
The simplest connection topology is a n-dimensional mesh. In a 1-D mesh all nodes are arranged in a line, where
the interior nodes have two and the boundary nodes have one neighbor(s).
A 2-D mesh can be created by having each node in a two-dimensional array connected to all its four nearest neighbors.
3D-mesh refers to a generalisation of a 2D-mesh in three dimensions. If the free ends of a mesh circulate back to
the opposite sides, the network is then called torus.
69
Parallel Computing
Fig. 6.5 Meshes (a) 2-D mesh with no wraparound (b) 2-D mesh with wraparound link (2-D torus) (c) a 3-D
mesh with no wraparound
Connecting the two ends of a 1-D mesh, it forms a ring.
Fig. 6.6 Ring

(Source: http://hal.archives-ouvertes.fr/docs/00/14/95/27/PDF/RR.pdf)
6.5.5 Tree Network

Trees are hierarchical structures that have some resemblance to natural trees. Trees start with a node at the top called
the root, this node is connected to other nodes by ‘edges’ or ‘branches’. These nodes may spawn further nodes forming
a multilayered structure. Nodes at one level can only connect to nodes in adjacent levels, furthermore, a node may
only stem from one other node (that is it may only have one parent), even though it may give rise to several nodes
(children). The connections are such that the branches are disjoint and there are no loops in the structure.
Binary tree
The basic (binary) tree has two sons for the root node. The interior nodes have three connections (two sons, and
one father), whereas all the leaves just have one father. There is only one path between any pair of two nodes.
Communication bottleneck occurs at higher levels of the tree. The solution can be increasing the number of
communication links and switching nodes closer to the root.
70
Fig. 6.7 Binary tree
(Source: http://www.gigaflop.demon.co.uk/comp/chapt3.htm#s3.2)
Fat trees
Links higher up the tree potentially carry more traffic than those at the lower levels. For this reason, a variant
called a fat-tree fattens the links as we go up the tree. Trees can be laid out in 2D with no wire crossings. This is an
attractive property of trees.
Fig. 6.8 A fat tree network of 16 processing nodes

6.5.6 Hypercube
Simple cubes have three dimensions. ‘Hypercubes’ are produced by increasing the number of dimensions in a cube.
The term hypercube refers to any cube with four or more dimensions. A four-dimensional cube is called a ‘tesseract’
and can be thought of as two three-dimensional cubes whose corresponding vertices have been connected.
A hypercube configuration in dimension n has 2n vertices and n.2 n-1 edges. For instance if n =1, there are two vertices
and 1 edge, in two dimensions, we have a square with four edges and four vertices. A vertex of a hypercube in
dimension n can be represented by an n-bit vector. Two nodes are connected if and only if their bit representation
differ in exactly one position.
In a ring every processing element (PE) is wired with just two neighbors, whereas in a 4-D hypercube every PE
is connected to four neighbors. If a node wants to communicate with any node not being its neighbor, all nodes
between the two communicating processors have to be passed.
71
Parallel Computing
Fig. 6.9 Hypercube

(Source: http://hal.archives-ouvertes.fr/docs/00/14/95/27/PDF/RR.pdf)
Fig. 6.10 Construction of hypercubes from hypercubes of lower dimension

Other hypercubic network is cube-connected cycles (CCC), which is a hypercube whose nodes are replaced by
rings/cycles of length n so that the resulting network has constant degree). This network can also be viewed as a
butterfly whose first and last levels collapse into one.
6.6 Dynamic Networks

These are built using switches (switching element) and links. Communication links are connected to one another
dynamically by the switches to establish paths among processing nodes and memory banks. Contrary to the static
and unchangeable network, dynamic networks can vary in their topology during runtime by switches. Generally,
dynamic networks can be distinguished by their switching strategy, and their number of switching stages (single-stage
or multi-stage networks). Single-stage networks have two main representatives of their class: busses and crossbars.
Bus systems are quite simple in concept. They connect all the components together using one data path. The speed
of a bus can be calculated by multiplying the bandwidth of the bus with its clock frequency. In a crossbar each node
has just two connections, leading to a central switch. The switch has a constant delay corresponding to one logic
gate independent of the number of nodes. Hence, a crossbar is a grid of logic gates and a number of nodes which
are connected via these switches.
72
6.6.1 Bus-Based Networks
The simplest network that consists of a shared medium (bus) that is common to all the nodes. The distance between
any two nodes in the network is constant. It is ideal for broadcasting information among nodes. It is scalable in
terms of cost, but not scalable in terms of performance.
• The bounded bandwidth of a bus places limitations on the overall performance as the number of nodes
increases.
• Typical bus-based machines are limited to dozens of nodes.
Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures.
Shared Memory
Address
Data
Processor 0 Processor 1
Shared Memory
Address
Data
Cache/ Cache/
Local Memory Local Memory
Processor 0 Processor 1
Fig. 6.11 Bus based interconnects (a) with no local caches (b) with local\memory caches
(Source: http://sydney.edu.au/engineering/it/~comp5426/sem12012/LectureNotes/lecture02-3-12.pdf)
6.6.2 Crossbar Networks

The crossbar network is the simplest interconnection network. It has a two dimensional grid of switches. It is a
non-blocking network and provides connectivity between inputs and outputs and it is possible to join any of the
inputs to any output. It employs a grid of switches or switching nodes to connect p processors to b memory banks.
Non-blocking network: The connection of a processing node to a memory bank does not block the connection of
any other processing nodes to other memory banks.
The total number of switching nodes required is Θ(pb). (It is reasonable to assume b>=p). It is scalable in terms
of performance. Not scalable in terms of cost. Examples of machines that employ crossbars include the Sun Ultra
HPC 10000 and the Fujitsu VPP500.
73
Parallel Computing
Processing Elements
0 1 2 3 4 5 b-1
A switching
Element
0
p-1
Fig. 6.12 A completely non-blocking crossbar network connecting ‘p’ processors to ‘b’ memory banks
6.6.3 Multistage Networks

These are the intermediate class of networks between bus-based network and crossbar network. Blocking networks:
access to a memory bank by a processor may disallow access to another memory bank by another processor. These
are more scalable than the bus-based network in terms of performance, more scalable than crossbar network in
terms of cost.
Processors Multistage interconnection network Memory bank
0 0
1 1
Stage 1 Stage 2 Stage n
p-1 b-1
Fig. 6.13 The schematic of a typical multistage interconnection network

74
6.6.4 Omega Network
Consists of log p stages, p is the number of inputs (processing nodes) and also the number of outputs (memory
banks). Each stage consists of an interconnection pattern that connects p inputs and p outputs:
 2i, 0 ≤ i ≤ p / 2 −1
j=
2i + 1 − p, p / 2 ≤ i ≤ p − 1 Perfect shuffle (left rotation):
000 0 0 000 = left_rotate (000)
001 1 1 001 = Left_rotate (100)
010 2 2 010 = Left_rotate (001)
011 3 3 011 = Left_rotate (101)
100 4 4 100 = Left_rotate (010)
101 5 5 101 = left_rotate (110)
110 6 6 110 = left_rotate (011)
111 7 7 111 = left_rotate (111)
Fig. 6.14 A perfect shuffle interconnection for eight inputs and outputs
Each switch has two connection modes:

• Pass-thought connection: the inputs are sent straight through to the outputs
• Cross-over connection: the inputs to the switching node are crossed over and then sent out.
It has (p/2 × log p) switching nodes. A complete omega network with the perfect shuffle interconnects and switches
can now be illustrated:
An omega network has (p/2 × log p) switching nodes, and the cost of such a network grows as (p log p).
Fig. 6.15 A complete omega network connecting eight inputs and eight outputs
75
Parallel Computing
Multistage omega network – routing

Let s be the binary representation of the source and‘d’ be that of the destination processor. The data traverses the
link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-
through mode by the switch else, it switches to crossover. This process is repeated for each of the ‘log p’ switching
stages. This is not a non-blocking switch.
Fig. 6.16 An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is
blocked at link AB
(Source: http://parallelcomp.uw.hu/ch02lev1sec4.html)
76
Summary
• Network topologies describe how to connect processors and memories to other processors and memories.
• The topology describes how to connect processors and memories to other processors and memories.
• Interconnection networks carry data between processors and to memory.
• Interconnects are made of switches and links (wires, fiber). Interconnects are classified as static or dynamic.
• A star has one central node, which is connected to all other nodes. The star topology is the degenerate case of
a tree.
• Star networks are commonly found in local area networks.
• In a linear array, each node (except the two nodes at the ends) has two neighbors, one each to its left and
right.
• The simplest connection topology is a n-dimensional mesh. In a 1-D mesh all nodes are arranged in a line, where
the interior nodes have two and the boundary nodes have one neighbour(s).
• Trees are hierarchical structures that have some resemblance to natural trees.
• Trees start with a node at the top called the root, this node is connected to other nodes by ‘edges’ or
‘branches’.
• The basic (binary) tree has two sons for the root node. The interior nodes have three connections (two sons, and
one father), whereas all the leaves just have one father.
• Simple cubes have three dimensions.
• ‘Hypercubes’ are produced by increasing the number of dimensions in a cube. The term hypercube refers to any
cube with four or more dimensions.
• Communication links are connected to one another dynamically by the switches to establish paths among
processing nodes and memory banks.
• The simplest network that consists of a shared medium (bus) that is common to all the nodes. The distance
between any two nodes in the network is constant.
References
India.
• Sengupta, J., 2005. Interconnection Networks For Parallel Processing, Deep and Deep Publications.
• Springer, Lecture 3 Interconnection Networks for Parallel Computers [Online] Available at: <http://www.cs.csi.
cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt> [Accessed 21 June 2012].
• Physical Organization of Parallel Platforms [Online] Available at: <http://parallelcomp.uw.hu/ch02lev1sec4.
html> [Accessed 21 June 2012].
• 2012. Network Topology [Video Online] Available at: <http://www.youtube.com/watch?v=POkzLHoZJ0Y>
• 2009. Computer Networking Tutorial - 3 - Network Topology [Video Online] Available at: <http://www.youtube.
com/watch?v=kfEDPQAYH4k> [Accessed 21 June 2012].
Recommended Reading
• Treleaven, P. C. & Vanneschi, M., 1987. Future Parallel Computers: An Advanced Course, Pisa, Italy, June
9-20, 1986, Proceedings, Springer.
• Rauber, T. & Rünger, G., 2010. Parallel Programming: For Multicore and Cluster Systems, Springer.
• Duato, 2003. Interconnection Networks: An Engineering Approach, Morgan Kaufmann.
77
Parallel Computing
Self Assessment
1. Interconnection networks carry data between processors and to _________.
a. memory
b. nodes
c. shared medium
d. memory banks
2. A star has one central _______, which is connected to all other nodes.
a. processing element
b. memory bank
c. node
d. processor
3. Simple cubes have ________ dimensions.

a. four
b. three
c. two
d. six
4. Which term refers to any cube with four or more dimensions?

a. Hypercube
b. Tesseract
c. Mesh
d. Torus
5. Connecting the two ends of a 1-D mesh, it forms a ______.

a. hypercube
b. torus
c. ring
d. tesseract
6. Which of these are hierarchical structures that have some resemblance to natural trees?
a. Trees
b. Hypercube
c. Torus
d. Tesseract
7. __________have a mixed degree.

a. Stars
b. Trees
c. Linear arrays
d. Rings
78
8. What refers to the minimum number of wires that needs to be cut to divide the network into two equal parts?
a. Arc connectivity
b. Bisection width
c. Diameter
d. Scalability
9. Arc connectivity is always ≤ __________.

a. Bisection width
b. Cost
c. Diameter
d. Scalability
10. In a _______every processing element (PE) is wired with just two neighbors.
a. ring
b. mesh
c. torus
d. hypercube
79
Parallel Computing
Chapter VII
Parallel Sorting
Aim
• define sorting
• classify parallel sorting
• elucidate merge-based parallel sorting
Objectives
• explain splitter-based parallel sorting
• enlist the advantages of histogram sort
• explain splitter-based basic histogram sort
Learning outcome
• understand sorting by regular sampling
• identify bitonic sort
• describe sample sort
80
7.1 Introduction
Sorting is the process of reordering a sequence taken as input and producing one that is ordered according to
an attribute. Parallel sorting is the process of using multiple processing units to collectively sort an unordered
sequence. The unsorted sequence is composed of disjoint sub-sequences, each of which is associated with a unique
processing unit. Parallel sorting produces a fully sorted sequence composed of ordered sub-sequences, each of which
is associated with a unique processing unit. The produced sequences are typically ordered according to the given
processor ordering and are of roughly equal length.
Sorting is one of the most common operations performed by a computer. Because sorted data are easier to manipulate
than randomly-ordered data, many algorithms require sorted data. Sorting is of additional importance to parallel
computing because of its close relation to the task of routing data among processes, which is an essential part of
many parallel algorithms. Many parallel sorting algorithms have been investigated for a variety of parallel computer
architectures.
Sorting is defined as the task of arranging an unordered collection of elements into monotonically increasing
(or decreasing) order. Specifically, let S = <a1, a2, ..., an > be a sequence of n elements in arbitrary order; sorting
transforms S into a monotonically increasing sequence S’= {a1’, a2’, …..an’} such that ai’≤ aj’ for 1 ≤ i ≤ j ≤ n,
and S′ is a permutation of S.
There are n unsorted keys, distributed evenly over p processors. The distribution of keys in the range is unknown
and possibly skewed. The objective of parallel sorting is to:
• Sort the data globally according to keys
• Ensure no processor has more than (n/p) + threshold keys
The majority of parallel sorting algorithms can be classified as either merge-based or splitter-based.
7.2 Merge-based Parallel Sorting

Merge-based parallel sorting algorithms rely on merging data between pairs of processors. Such algorithms generally
achieve a globally sorted order by constructing a sorting network between processors, which facilitates the necessary
sequence of merges. Merge-based parallel sorting algorithms have been extensively studied, though largely in the
context of sorting networks, which consider n/p ~1. For n/p>>1, these algorithms begin to suffer from heavy use of
communication and difficulty of load balancing. Therefore, splitter-based algorithms have been the primary choice
on most modern machines due to their minimal communication attribute and scalability.
7.3 Splitter-Based Parallel Sorting

Splitter-based parallel sorting algorithms aim to define a vector of splitters that subdivides the data into p approximately
equalised sections. A splitter is a key that partitions the global set of keys at a desired location. Each of the p-1
splitters is simply a key-value within the dataset. Once these splitters are determined, each processor can send
the appropriate portions of its data directly to the destination processors, which results in one round of all-to-all
communication. Notably, such splitter-based algorithms utilise minimal data movement, since the data associated
with each key only move once.
• p-1 global splitters are needed to subdivide the data into p continuous chunks.
• Each processor can send out its local data based on the splitters; data moves only once.
• Each processor merges the data chunks as it receives them.
81
Parallel Computing
"
6'
Fig. 7.1 Splitter-based parallel sorting

(Source: http://charm.cs.illinois.edu/talks/SortingIPDPS10.pdf)
6'
Fig. 7.2 Splitter on key density function

7.4 Splitter-based Basic Histogram Sort

It uses iterative guessing to find splitters.
• O(p) probe rather than O(p²) combined sample
• Probe refinement based on global histogram
Histogram is calculated by applying splitters to data.
82
<<<<<<
<<<<<<
Fig. 7.3 Basic histogram sort

The advantages are:

• Splitter-based: single all-to-all data transpose
• Can achieve arbitrarily small threshold
• Probing technique is scalable compared to sample sort, O(p) vs O(p²)
• Allows good overlap between communication and computation (to be shown)

• Harder to implement
• Running time dependent on data distribution
However, newer and much larger architectures have changed the problem statement further. Therefore, traditional
approaches, including splitter-based sorting algorithms, require reevaluation and improvement. We will briefly
detail some of these methods below.
7.5 Bitonic Sort

Bitonic sort, a merge-based algorithm, was one of the earliest procedures for parallel sorting. It was introduced in
1968 by Batcher. The basic idea behind bitonic sort is to sort the data by merging bitonic sequences. Bitonic sort is
based on repeatedly merging two bitonic sequences to form a larger bitonic sequence.
A bitonic sequence is a sequence of values a0,….., an-1, with the property that there exists an index ‘i’, where 0≤ i
≤ n -1, such that a0 through ai is monotonically increasing and ai through an-1 is monotonically decreasing, or there
exists a cyclic shift of indices so that the first condition is satisfied.
A bitonic sequence increases monotonically then decreases monotonically. For n/p = 1, Ɵ(lg n) merges are required,
with each merge having a cost of Ɵ (lg n). The composed running time of Bitonic Sort for n=p = 1 is Ɵ (lg2n).
Bitonic Sort can be generalised for n/p > 1, with a complexity of Ɵ (n lg2n). Adaptive Bitonic Sorting, a modification
of Bitonic Sort, avoids unnecessary comparisons, which results in an improved, optimal complexity of Ɵ (n lg n).
Unfortunately, each step of bitonic sort requires movement of data between pairs of processors. Like most merge-
based algorithms, bitonic sort can perform very well when n/p (where n is the total number of keys and p is the
number of processors) is small, since it operates in-place and effectively combines messages.
83
Parallel Computing
On the other hand, its performance quickly degrades as n/p becomes large, which is the much more realistic scenario
for typical scientific applications. The major drawback of bitonic sort on modern architectures is that it moves the
data Ɵ (lg p) times, which turns into costly bottleneck if we scale to higher problem sizes or a larger number of
processors. Since this algorithm is old and very well-studied, we will not go into any deep analysis or testing of
it. Nevertheless, Bitonic Sort has laid groundwork for much of parallel sorting research and continues to influence
modern algorithms. One good comparative study of this algorithm has been documented by Blelloch et al.
7.6 Sample Sort

Sample sort is a popular and widely analysed splitter based method for parallel sorting. This algorithm acquires a
sample of data of size s from each processor, and then combines the samples on a single processor. This processor
then produces p1 splitters from the sp-sized combined sample and broadcasts them to all other processors. The
splitters allow each processor to send each key to the correct final destination immediately. Some implementations
of sample sort also perform localised load balancing between neighboring processors after the all-to-all.
The sample is typically regularly spaced in the local sorted data s = p-1. Worst case final load imbalance is 2× (n/p)
keys. In practice, load imbalance is typically very small. Combined sample becomes bottleneck since (s×p) ~p².
With 64-bit keys, if p = 8192, sample is 16 GB.
<<<<<<
*
<<<<<<
Fig. 7.4 Sample Sort

Sorting by regular sampling

The sorting by regular sampling technique is a reliable and practical variation of sample sort that uses a sample size
of s = p-1. The algorithm is simple and executes as follows.
• Each processor sorts its local data.
• Each processor selects a sample vector of size p-1 from its local data. The kth element of the sample vector is
element (n/p) × ((k+1)/p) of the local data.
• The samples are sent to and merged on processor 0, producing a combined sorted sample of size p(p 1).
• Processor 0 defines and broadcasts a vector of p1 splitters with the kth splitter as element p(k +12) of the
combined sorted sample.
• Each processor sends its local data to the appropriate destination processors, as defined by the splitters, in one
round of all-to-all communication.
• Each processor merges the data chunks it receives.
84
Sorting by random sampling
Sorting by random sampling is an interesting alternative to regular sampling. The main difference between the two
sampling techniques is that a random sample is flexible in size and collected randomly from each processor’s local
data rather than as a regularly spaced sample. The advantage of sorting by random sampling is that often sufficient
load balance can be achieved for s < p, which allows for potentially better scaling.
Additionally, a random sample can be retrieved before sorting local data, which allows for overlap between sorting
and splitter calculation. However, the technique is not wholly reliable and can result in severe load imbalance,
especially on a larger amount of processors.
7.7 Radix Sort

Radix sort is a sorting method that uses the binary representation of keys to migrate them to the appropriate bucket
in a series of steps. During every step, the algorithm puts every key in a bucket corresponding to the value of some
subset of the key’s bits. A k-bit radix sort looks at k bits every iteration. Thus, a 16-bit radix on 64-bit keys would
take 4 steps and use 216 buckets every step. The algorithm correctly sorts the keys by starting with the less significant
bits of the keys and moving the keys out of the lower indexed buckets first.
Radix sort can be parallelised simply by assigning some subset of buckets to each processor. In addition, it can
deal with uneven distributions efficiently by assigning a varying number of buckets to all processors every step.
This number can be determined by having each processor count how many of its keys will go to each bucket, then
summing up these histograms with a reduction. Once a processor receives the combined histogram, it can adaptively
assign buckets to processors. Radix sort is not a comparison-based sort. However, it is a reasonable assumption to
equate a comparison operation to a bit manipulation, since both are likely to be dominated by the memory access
time. Nevertheless, radix sort is not bound by Ɵ((n lg n)/p), as any comparison-based parallel sorting algorithm
would be.
In fact, this algorithm’s complexity varies linearly with n. The performance of the sort can be expressed as Ɵ(bn/p),
where b is the number of bits in a key. This expression is evident in that doubling the number of bits in the keys
entails doubling the number of iterations of radix sort.
The main drawback to parallel radix sort is that it requires multiple iterations of costly all-to-all data exchanges.
The cache efficiency of this algorithm can also be comparatively weak. In a comparison-based sorting algorithm,
we generally deal with contiguous allocations of keys. During sequential sorting (specifically in the partitioning
phase of Quicksort), we iterate through keys with only two iterators and swap them between two already accessed
locations.
Communication in comparison-based sorting is also cache efficient because we can usually copy sorted blocks into
messages. In radix sort, at every iteration any given key might be moved to any bucket (there are 64 thousand of these
for a 16-bit radix), completely independent of the destination of the previously indexed key. However, Thearling et
al. propose a clever scheme for improving the cache efficiency during the counting stage.
Nevertheless, radix sort is a simple and commonly accepted approach to parallel sorting. Therefore, despite its
limitations, we implemented Radix Sort and analysed its performance.
7.8 Histogram Sort

Histogram sort is another splitter-based method for parallel sorting. Like sample sort, histogram sort also determines
a set of p-1 splitters to divide the keys into p evenly sized sections. However, it achieves this task more accurately
by taking an iterative approach rather than simply collecting one big sample. The procedure begins by broadcasting
k (where k ≥p -1) splitter guesses, which we call a probe, from the initiating processor to every other processor.
These initial guesses are usually spaced evenly over the data range. Once the probe is received, the following steps
are performed.
85
Parallel Computing
• Each processor sorts its local data.

• Each processor determines how many of its keys fit into every range produced by the guesses, creating a
histogram.
• A reduction sums up these histograms from every processor. Then one processor analyses the data sequentially
to see which splitter guesses were satisfactory (fell within ½ t thresh of the ideal splitter location).
• If there are any unsatisfied splitters, a new probe is generated sequentially then broadcasted to all processors,
returning to step 2. If all splitters have been satisfied, we continue to the next step.
• The desired splitting of data has been achieved, so the p-1 finalised splitters and the numbers of keys expected
to end up on every processor are broadcasted.
• Each processor sends its local data to the appropriate destination processors, as defined by the splitters, in one
round of all-to-all communication.
• Each processor merges the data chunks it receives.
We used histogram sort as the basis for our work because it has the essential quality of only moving the actual data
once, combined with an efficient method for dealing with uneven distributions. In fact, histogram sort is unique in
its ability to reliably achieve a defined level of load balance. Therefore, we decided this algorithm is a theoretically
well suited base for scaling sorting on modern architectures.
7.9 Odd-even Transposition Sort on Linear Array

N numbers can be sorted on an N-cell linear array in O(N) time: the processors alternate operations with their
neighbours.
step 1
4 3 1 2
step 2
3 4 1 2
3 1 4 step 3
2
step 4
1 3 2 4
Fig. 7.5 Odd-even transposition sort on linear array

(Source: http://www.corelab.ntua.gr/courses/parallel.postgrad/Sorting.pdf)
86
Summary
• Sorting is the process of reordering a sequence taken as input and producing one that is ordered according to
an attribute.
• Parallel sorting is the process of using multiple processing units to collectively sort an unordered sequence.
• Because sorted data are easier to manipulate than randomly-ordered data, many algorithms require sorted
data.
• Sorting is defined as the task of arranging an unordered collection of elements into monotonically increasing
(or decreasing) order.
• Merge-based parallel sorting algorithms rely on merging data between pairs of processors.
• Splitter-based parallel sorting algorithms aim to define a vector of splitters that subdivides the data into p
approximately equalised sections.
• Bitonic sort, a merge-based algorithm, was one of the earliest procedures for parallel sorting. It was introduced
in 1968 by Batcher.
• Bitonic sort is based on repeatedly merging two bitonic sequences to form a larger bitonic sequence.
• A bitonic sequence increases monotonically then decreases monotonically.
• Sample sort is a popular and widely analysed splitter based method for parallel sorting.
• The advantage of sorting by random sampling is that often sufficient load balance can be achieved for s < p,
which allows for potentially better scaling.
• Radix sort is a sorting method that uses the binary representation of keys to migrate them to the appropriate
bucket in a series of steps.
• Radix sort can be parallelised simply by assigning some subset of buckets to each processor.
• The main drawback to parallel radix sort is that it requires multiple iterations of costly all-to-all data
exchanges.
• Like sample sort, histogram sort also determines a set of p-1 splitters to divide the keys into p evenly sized
sections.
References
India.
• Feitelson, G., 2002. Job Scheduling Strategies for Parallel Processing, Springer.
• 2010, Highly Scalable Parallel Sorting [pdf] Available at: <http://charm.cs.illinois.edu/talks/SortingIPDPS10.
pdf> [Accessed 21 June 2012].
• Sorting [pdf] Available at: <http://www.corelab.ntua.gr/courses/parallel.postgrad/Sorting.pdf> [Accessed 21
June 2012].
• 2009. Algorithms Lesson 3: Merge Sort [Video Online] Available at: <http://www.youtube.com/
watch?v=GCae1WNvnZM> [Accessed 21 June 2012].
• 2012. Radix Sort Tutorial [Video Online] Available at: <http://www.youtube.com/watch?v=xhr26ia4k38>
Recommended Reading
• Roosta, S. H., 2000. Parallel Processing and Parallel Algorithms: Theory and Computation, Springer.
• Culler, D. E., Singh, J. & Gupta, A., 1999. Parallel Computer Architecture: A Hardware/Software Approach,
Gulf Professional Publishing.
87
Parallel Computing
Self Assessment
1. What refers to the process of reordering a sequence taken as input and producing one that is ordered according
to an attribute?
a. Merging
b. Sorting
c. Splitting
d. Sampling
2. Bitonic sort, a merge-based algorithm, was introduced in _________ by Batcher.

a. 1968
b. 1978
c. 1958
d. 1948
3. Which of the following is a key that partitions the global set of keys at a desired location?
a. Splitter
b. Processor
c. Bitonic sequence
d. Probe
4. The sorting by regular sampling technique is a reliable and practical variation of sample sort that uses a sample
size of s = ____________.
a. p-2
b. n/p
c. p-1
d. p/n
5. Bitonic sort is based on repeatedly merging ______bitonic sequences to form a larger bitonic sequence.
a. two
b. three
c. four
d. six
6. Histogram is calculated by applying ________to data.

a. splitters
b. probes
c. keys
d. sequences
7. The major drawback of bitonic sort on modern architectures is that it moves the data ________times.
a. Ɵ (log p)
b. Ɵ (lg p)
c. Ɵ (lg p2)
d. Ɵ (lg p3)
88
8. The performance of the sort can be expressed as Ɵ(bn/p), where b is the number of ___in a key.
a. bits
b. probes
c. sequences
d. splitters
9. Communication in comparison-based sorting is also cache efficient because we can usually _________sorted
blocks into messages.
a. copy
b. merge
c. replace
d. split
10. Parallel sorting is the process of using multiple processing units to collectively sort an _______.
a. ordered sequence
b. unordered sequence
c. attribute
d. equalised section
89
Parallel Computing
Chapter VIII
Message-Passing Programming
Aim
• define message passing
• explain the principles of message passing
• describe message passing model
Objectives
• explain non-buffered blocking message passing operations
• enlist the steps for voiding deadlocks
• examine the buffered blocking message passing operations
Learning outcome
• understand MPI
• enlist the minimal set of MPI routines
• identify the basic functions for sending and receiving messages in MPI
90
8.1 Principles of Message-Passing Programming
The logical view of a machine supporting the message passing paradigm consists of p processes, each with its own
exclusive address space. Each data element must belong to one of the partitions of the space; hence, data must be
explicitly partitioned and placed. All interactions (read-only or read/write) require cooperation of two processes - the
process that has the data and the process that wants to access the data. These two constraints, while onerous, make
underlying costs very explicit to the programmer.
Message-passing programs are often written using the asynchronous or loosely synchronous paradigms. In the
asynchronous paradigm, all concurrent tasks execute asynchronously. In the loosely synchronous model, tasks
or subsets of tasks synchronise to perform interactions. Between these interactions, tasks execute completely
asynchronously. Most message-passing programs are written using the single program multiple data (SPMD)
model.
CPU
Memory
CPU CPU
Memory Memory
CPU CPU
Interconnection
Memory Network Memory
CPU CPU
Memory Memory
CPU
Memory
Fig. 8.1 Message Passing Model

(Source: http://www.cs.umsl.edu/~sanjiv/classes/cs5740/lectures/mpi.pdf)
8.2 The Building Blocks: Send and Receive Operations

The prototypes of these operations are as follows:
send(void *sendbuf, int nelems, int dest)
receive(void *recvbuf, int nelems, int source)
Consider the following code segments:
P0 P1
a = 100; receive(&a, 1, 0)
send(&a, 1, 1); printf(“%d\n”, a);
a = 0;
The semantics of the send operation require that the value received by process P1 must be 100 as opposed to 0.
This motivates the design of the send and receive protocols.
8.3 Non-Buffered Blocking Message Passing Operations

A simple method for forcing send/receive semantics is for the send operation to return only when it is safe to do so.
In the non-buffered blocking send, the operation does not return until the matching receive has been encountered at
91
Parallel Computing
the receiving process. Idling and deadlocks are major issues with non-buffered blocking sends. In buffered blocking
sends, the sender simply copies the data into the designated buffer and returns after the copy operation has been
completed. The data is copied at a buffer at the receiving end as well. Buffering alleviates idling at the expense of
copying overheads.
sending receiving sending receiving sending receiving

process process process process process process
send request to send
request to send request to send

okey to send receive send okey to send receive
send okey to send
receive
data data data
(a) Sender comes first (b) Sender and receiver come (c) Receiver comes first;
idling at sender at about the same time; idling at receiver
idling minimized
Fig. 8.2 Non-Buffered blocking message passing operations

(Source: http://www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec16-s08-v2.pdf)
It is easy to see that in cases where sender and receiver do not reach communication point at similar
times, there can be considerable idling overheads.
8.4 Buffered Blocking Message Passing Operations

A simple solution to the idling and deadlocking problem outlined above is to rely on buffers at the sending and
receiving ends. The sender simply copies the data into the designated buffer and returns after the copy operation
has been completed. The data must be buffered at the receiving end as well. Buffering trades off idling overhead
for buffer copying overhead.
sending receiving sending receiving

process process process process
send send
Data copied to
buffer at receiver
receive
data data receive
Fig. 8.3 Buffered blocking message passing operations

92
Blocking buffered transfer protocols:
• In the presence of communication hardware with buffers at send and receive ends
• In the absence of communication hardware, sender interrupts receiver and deposits data in buffer at receiver
end
Bounded buffer sizes can have significant impact on performance.
P0 P1
for (i = 0; i < 1000; i++) { for (i = 0; i < 1000; i++){
produce_data(&a); receive(&a, 1, 0);
send(&a, 1, 1); consume_data(&a);
} }
Deadlocks are still possible with buffering since receive operations block.
P0 P1
receive(&a, 1, 1); receive(&a, 1, 0);
send(&b, 1, 1); send(&b, 1, 0);
8.5 Non-Blocking Message Passing Operations

The programmer must ensure semantics of the send and receive. This class of non-blocking protocols returns from the
send or receive operation before it is semantically safe to do so. Non-blocking operations are generally accompanied
by a check-status operation. When used correctly, these primitives are capable of overlapping communication
overheads with useful computations. Message passing libraries typically provide both blocking and non-blocking
primitives.
sending receiving sending receiving
process process process process
send request to send send

request to send
unsafe to
unsafe to
okey to send update receive
update receive okey to send
data being
data being
sent unsafe to read data being
sent data
data received
(a)Without hardware (a) Without hardware

support support
Fig. 8.4 Non-blocking message passing operations

Non-blocking non-buffered send and receive operations:

• in absence of communication hardware
• in presence of communication hardware
93
Parallel Computing
8.6 Message Passing Interface (MPI)

MPI defines a standard library for message-passing that can be used to develop portable message-passing programs
using either C or Fortran. The MPI standard defines both the syntax as well as the semantics of a core set of library
routines. Vendor implementations of MPI are available on almost all commercial parallel computers. It is possible
to write fully-functional message-passing programs by using only the six routines.
It is the most popular message-passing specification to support parallel programming. It is the standardised and
portable to function on a wide variety of parallel computers. It is allowed for the development of portable and
scalable large-scale parallel applications.
MPI_Init Initialises MPI

MPI_Finalize Terminates MPI
MPI_Comm_size Determines the number of processes
MPI_Comm_rank Determines the label of calling process
MPI_Send Sends a message
MPI_Recv Receives a message
Table 8.1 The minimal set of MPI routines
8.7 Starting and Terminating the MPI Library

MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialise the MPI environment.
MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to terminate the MPI
environment. The prototypes of these two functions are:
int MPI_Init(int *argc, char ***argv)
int MPI_Finalize()
MPI_Init also strips off any MPI related command-line arguments. All MPI routines, data-types, and constants are
prefixed by “MPI_”. The return code for successful completion is MPI_SUCCESS.
8.8 Communicators
A communicator defines a communication domain - a set of processes that are allowed to communicate with each
other. Information about communication domains is stored in variables of type MPI_Comm. Communicators are used
as arguments to all message transfer MPI routines. A process can belong to many different (possibly overlapping)
communication domains. MPI defines a default communicator called MPI_COMM_WORLD which includes all
the processes.
8.9 Querying Information

The MPI_Comm_size and MPI_Comm_rank functions are used to determine the number of processes and the label
of the calling process, respectively. The calling sequences of these routines are as follows:
int MPI_Comm_size(MPI_Comm comm, int *size)
int MPI_Comm_rank(MPI_Comm comm, int *rank)
The rank of a process is an integer that ranges from zero up to the size of the communicator minus one.
Our first MPI program

#include <mpi.h>
main(int argc, char *argv[])
94
int npes, myrank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf(“From process %d out of %d, Hello World!\n”,
myrank, npes);
MPI_Finalize();
}
8.10 Sending and Receiving Messages

The basic functions for sending and receiving messages in MPI are the MPI_Send and MPI_Recv, respectively. The
calling sequences of these routines are as follows:
int MPI_Send(void *buf, int count, MPI_Datatype

datatype, int dest, int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag,
MPI_Comm comm, MPI_Status *status)
MPI provides equivalent datatypes for all C datatypes. This is done for portability reasons.
The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data
items that has been created by packing non-contiguous data. The message-tag can take values ranging from zero up
to the MPI defined constant MPI_TAG_UB.
MPI allows specification of wildcard arguments for both source and tag. If source is set to MPI_ANY_SOURCE,
then any process of the communication domain can be the source of the message. If tag is set to MPI_ANY_TAG,
then messages with any tag are accepted. On the receive side, the message must be of length equal to or less than
the length field specified.
On the receiving end, the status variable can be used to get information about the MPI_Recv operation.
The corresponding data structure contains:

typedef struct MPI_Status {
int MPI_SOURCE;
int MPI_TAG;
int MPI_ERROR; };
The MPI_Get_count function returns the precise count of data items received.
int MPI_Get_count(MPI_Status *status, MPI_Datatype
datatype, int *count)
8.11 Avoiding Deadlocks

Consider:
int a[10], b[10], myrank;
MPI_Status status;
...
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
95
Parallel Computing
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
...
If MPI_Send is blocking, there is a deadlock.
Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of
processes) and receives a message from process i - 1(module the number of processes).
int a[10], b[10], npes, myrank;
MPI_Status status;
...
MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);
MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
...
Once again, we have a deadlock if MPI_Send is blocking.
We can break the circular wait to avoid deadlocks as follows:
int a[10], b[10], npes, myrank;

MPI_Status status;
...
if (myrank%2 == 1) {
MPI_COMM_WORLD);
MPI_COMM_WORLD);
}
else {
MPI_COMM_WORLD);
MPI_COMM_WORLD);
}
...
8.12 Sending and Receiving Messages Simultaneously

To exchange messages, MPI provides the following function:
int MPI_Sendrecv(void *sendbuf, int sendcount,
MPI_Datatype senddatatype, int dest, int
sendtag, void *recvbuf, int recvcount,
MPI_Datatype recvdatatype, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
The arguments include arguments to the send and receive functions. If we wish to use the same buffer for both send
and receive, we can use:
96
int MPI_Sendrecv_replace(void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag,
int source, int recvtag, MPI_Comm comm,
MPI_Status *status)
8.13 Creating and Using Cartesian Topologies

We can create Cartesian topologies using the function:
int MPI_Cart_create(MPI_Comm comm_old, int ndims,
int *dims, int *periods, int reorder,
MPI_Comm *comm_cart)
This function takes the processes in the old communicator and creates a new communicator with dims dimensions.
Each processor can now be identified in this new Cartesian topology by a vector of dimension dims.
Since sending and receiving messages still require (one dimensional) ranks, MPI provides routines to convert ranks
to Cartesian coordinates and vice-versa.
int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims,

int *coords)
int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)
The most common operation on cartesian topologies is a shift. To determine the rank of source and destination of
such shifts, MPI provides the following function:
int MPI_Cart_shift(MPI_Comm comm_cart, int dir, int s_step,

int *rank_source, int *rank_dest)
8.14 Overlapping Communication with Computation

In order to overlap communication with computation, MPI provides a pair of functions for performing non-blocking
send and receive operations (“I” stands for “Immediate”):
int MPI_Isend(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm,
MPI_Request *request)
int MPI_Irecv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm,
MPI_Request *request)
These operations return before the operations have been completed. Function MPI_Test tests whether or not the non
blocking send or receive operation identified by its request has finished.
int MPI_Test(MPI_Request *request, int *flag,
MPI_Status *status)
MPI_Wait waits for the operation to complete.

int MPI_Wait(MPI_Request *request, MPI_Status *status)
Using non-blocking operations remove most deadlocks. Consider:
int a[10], b[10], myrank;
MPI_Status status;
...
97
Parallel Computing
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, &status, MPI_COMM_WORLD);
}
...
Replacing either the send or the receive operations with non-blocking counterparts fixes this deadlock.
8.15 Collective Communication and Computation Operations

MPI provides an extensive set of functions for performing common collective communication operations. Each
of these operations is defined over a group corresponding to the communicator. All processors in a communicator
must call these operations.
The barrier synchronisation operation is performed in MPI using:

int MPI_Barrier(MPI_Comm comm)
The one-to-all broadcast operation is:
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,
int source, MPI_Comm comm)
The all-to-one reduction operation is:

int MPI_Reduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int target,
MPI_Comm comm)
8.16 Collective Communication Operations

The operation MPI_MAXLOC combines pairs of values (vi,li) and returns the pair (v, l) such that v is the maximum
among all vi ‘s and l is the corresponding li (if there are more than one, it is the smallest among all these li ‘s).
MPI_MINLOC does the same, except for minimum value of vi.
Value 15 17 11 12 17 11
0 1 2 3 4 5
Process
MinLoc (Value, Process) = (11, 2)
MinLoc (Value, Process) = (17, 1)
Fig. 8.5 An example use of the MPI_MINLOC and MPI_MAXLOC operators

98
MPI Datatype C Datatype
MPI_2INT pair of ints
MPI_SHORT_INT short and int
MPI_LONG_INT long and int
MPI_LONG_DOUBLE_INT long double and int
MPI_FLOAT_INT float and int
MPI_DOUBLE_INT double and int
Table 8.2 MPI datatypes for data-pairs used with the MPI_MAXLOC and MPI_MINLOC reduction
operations.
If the result of the reduction operation is needed by all processes, MPI provides:
int MPI_Allreduce(void *sendbuf, void *recvbuf,

int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm)
To compute prefix-sums, MPI provides:
int MPI_Scan(void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm)
The gather operation is performed in MPI using:
int MPI_Gather(void *sendbuf, int sendcount,

MPI_Datatype senddatatype, void *recvbuf,
int recvcount, MPI_Datatype recvdatatype,
int target, MPI_Comm comm)
MPI also provides the MPI_Allgather function in which the data are gathered at all the processes.
int MPI_Allgather(void *sendbuf, int sendcount,

MPI_Comm comm)
The corresponding scatter operation is:
int MPI_Scatter(void *sendbuf, int sendcount,

int source, MPI_Comm comm)
99
Parallel Computing
Summary
• The logical view of a machine supporting the message passing paradigm consists of p processes, each with its
own exclusive address space.
• Message-passing programs are often written using the asynchronous or loosely synchronous paradigms.
• A simple method for forcing send/receive semantics is for the send operation to return only when it is safe to
do so.
• In the non-buffered blocking send, the operation does not return until the matching receive has been encountered
at the receiving process.
• A simple solution to the idling and deadlocking problem outlined above is to rely on buffers at the sending and
receiving ends.
• The sender simply copies the data into the designated buffer and returns after the copy operation has been
completed.
• MPI defines a standard library for message-passing that can be used to develop portable message-passing
programs using either C or Fortran.
• The MPI standard defines both the syntax as well as the semantics of a core set of library routines.
• Vendor implementations of MPI are available on almost all commercial parallel computers.
• MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialise the MPI environment.
• A communicator defines a communication domain - a set of processes that are allowed to communicate with
each other.
• MPI provides equivalent datatypes for all C datatypes.
• MPI allows specification of wildcard arguments for both source and tag.
• In order to overlap communication with computation, MPI provides a pair of functions for performing non-
blocking send and receive operations (“I” stands for “Immediate”).
• MPI provides an extensive set of functions for performing common collective communication operations.
References
• Lastovetsky, A., 2003. Parallel Computing on Heterogeneous Networks, John Wiley & Sons.
• Message-Passing Programming [pdf] Available at: <http://www.cs.umsl.edu/~sanjiv/classes/cs5740/lectures/
mpi.pdf> [Accessed 21 June 2012].
• Sarkar, V., 2008. Programming Using the Message Passing Paradigm (Chapter 6) [pdf] Available at: <http://
www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec16-s08-v2.pdf> [Accessed 21 June 2012].
• 2011. 00 11 Message Passing [Video Online] Available at: <http://www.youtube.com/watch?v=c5NKVAPf2OE>
• 2012. Message Passing Algorithms - SixtySec [Video Online] Available at: <http://www.youtube.com/
watch?v=7IdLzEoiPY4> [Accessed 21 June 2012].
Recommended Reading
• Gropp, W., Lusk, E. & Skjellum, A., 2006. Using Mpi: Portable Parallel Programming With the Message-
Passing Interface, Volume 1, MIT Press.
• Tokhi, 2003. Parallel Computing for Real-Time Signal Processing and Control, Springer.
100
Self Assessment
1. The logical view of a machine supporting the message passing paradigm consists of p processes, each with its
own exclusive ____________.
a. communication domain
b. address space
c. datatype
d. paradigms
2. What defines a standard library for message-passing that can be used to develop portable message-passing
programs using either C or Fortran?
a. SPMD
b. MPI
c. MIMD
d. CPU
3. A simple solution to the idling and deadlocking problem outlined above is to rely on __________at the sending
and receiving ends.
a. buffers
b. overheads
c. libraries
d. communicators
4. What refers to a set of processes that are allowed to communicate with each other?
a. Communication domain
b. Datatype
c. Message passing programming
d. Debugging
5. Each data element must belong to one of the partitions of the __________; hence, data must be explicitly
partitioned and placed.
a. time
b. order
c. space
d. volume
6. All interactions (read-only or read/write) require cooperation of _______ processes.

a. two
b. three
c. four
d. five
7. Most message-passing programs are written using the _________ model.

a. SPMD
b. SIMD
c. MISD
d. MIMD
101
Parallel Computing
8. Idling and _________ are major issues with non-buffered blocking sends.
a. partitions
b. communication
c. deadlocks
d. buffering
9. Message passing _______typically provide both blocking and non-blocking primitives.

a. domains
b. libraries
c. interface
d. programming

a. MPI defines a standard library for message-passing that can be used to develop portable message-passing
programs using only C not Fortran.
b. The MPI standard defines both the syntax as well as the semantics of a core set of library routines.
c. Vendor implementations of MPI are available on almost all commercial parallel computers.
d. It is possible to write fully-functional message-passing programs by using only the six routines.
102
Application I
Composite: Content Management Solution Uses Parallelisation to Deliver Huge Performance Gains
Content management system vendor Composite needed to parallelise its software to realise the performance gains
enabled by today’s multicore processors. Composite took advantage of the new parallel-programming tools provided
in the Microsoft Visual Studio 2010 development system and the .NET Framework 4 to parallelise its code. The
company’s efforts have yielded impressive performance gains. An eight-core server is delivering a 60 percent
reduction in page-rendering times and an 80 percent reduction in the compilation of dynamic types upon system
initialisation. What’s more, by using the latest Microsoft aids for parallel programming, Composite was able to
implement parallelism in its solution quickly and cost-effectively, with very little developer effort.
Situation
Microsoft Gold Certified Partner Composite develops and sells Composite C1, a content management system (CMS)
designed to help companies build Web sites that combine solid marketing infrastructure, innovative design, and
strong usability. Originally founded as a Web development shop, Composite decided to build its second-generation
CMS in 2005 after realising that the Microsoft Visual Studio 2005 development system and the Microsoft .NET
Framework 2.0 presented an opportunity to build a solution that could meet the needs of both Web developers and
designers.
As new versions of Visual Studio and the .NET Framework become available, Composite examines them closely to
determine how they can be applied to improve its own product. For example, the company took advantage of Visual
Studio 2008 and the .NET Framework 3.5 to add support for Language-Integrated Query (LINQ) in version 1.2 of
Composite C1. Composite began this exercise again in 2009, when it joined an early adopter program for Visual
Studio 2010 and the .NET Framework 4 and began planning for the development of Composite C1 version 1.3.
One area on which Composite decided to focus was performance—specifically, how to optimise its software to
get the most out of modern, multicore processors and multiprocessor servers. “For the past few decades, we’ve all
benefited from rapidly increasing processor clock speeds,” says Marcus Wendt, Cofounder and Product Manager at
Composite. “However, this extended ‘free lunch’ is over, in that clock speeds have leveled off and chip manufacturers
are turning to multiple-processor cores for further gains in processing power. Therein lies the challenge, in that most
applications today—Composite C1 version 1.2 included—are not multicore optimised, which can result in one core
running at 100 percent while the rest remain idle. To get the best performance out of today’s multicore processors,
we needed to introduce parallelisation into our code.”
Solution
Composite took advantage of the new parallel-programming tools provided in the Microsoft Visual Studio 2010
development system and the .NET Framework 4, and the company’s efforts have yielded significant performance
gains in multiple areas. “Parallel programming has traditionally been difficult, tedious, and hard to debug, with very
limited tool support,” says Wendt. “The System.Threading namespace in the .NET Framework has existed for years,
but in the past it required a lot of ‘plumbing code’ to use effectively. Visual Studio 2010 and the .NET Framework
4 eliminate a lot of complexity to make parallel programming much easier.”
New Parallel-Programming Aids

Composite parallelised its code by using Visual Studio 2010 and the new parallelisation libraries in the .NET
Framework 4, including:
• Task Parallel Library (TPL), which includes parallel implementations of for and for each loops (For and For Each
in the Visual Basic language), as well as rich support for coordinating the asynchronous execution of individual
tasks. Implemented as a set of public types and APIs in the System.Threading.Tasks namespace, TPL relies
on an extensible task-scheduling system that is integrated with the .NET ThreadPool and scales the degree of
concurrency dynamically, so that all available processors and processing cores are used most efficiently.
103
Parallel Computing
• Parallel Language-Integrated Query (PLINQ), a parallel implementation of LINQ to Objects that combines
the simplicity and readability of LINQ syntax with the power of parallel programming. PLINQ implements
the full set of LINQ standard query operators as extension methods in the System.Linq namespace, along with
additional operators to control the execution of parallel operations. As with code that targets the Task Parallel
Library on top of which PLINQ is built, PLINQ queries scale in the degree of concurrency according to the
capabilities of the host computer.
• New Data Structures for Parallel Programming, which include concurrent collection classes that are scalable
and thread safe; lightweight synchronisation primitives; and types for lazy initialisation and producer/consumer
scenarios. Developers can use these new types with any multithreaded application code, including that which
uses the Task Parallel Library and PLINQ.
Composite also took advantage of the new Parallel Stacks and Parallel Tasks windows for debugging code, which
are provided in Visual Studio 2010 Ultimate, Premium, and Professional. Visual Studio 2010 Premium and Ultimate
also have a new Concurrency Visualiser, which is integrated with the profiler to provide graphical, tabular, and
numerical data about how multithreaded applications interact with themselves and with other programs. “The
Concurrency Visualiser and other parallel-programming tools in Visual Studio 2010 are a great help in that they
enable developers to quickly identify areas of concern and navigate through call stacks and to relevant call sites in
the source code,” says Martin Ingvar Jensen, Senior Developer at Composite.
Faster compilation of dynamic types

In parallelising its code, Composite identified two areas that were prime candidates for concurrency. One area was
the compilation of dynamic types, which is done upon reinitialising Composite C1 and can take up to 90 seconds
for large Web sites. “Our dynamic type system enables developers to design data types using our Web interface and
treat them as ‘real’ .NET Framework types, such as executing LINQ statements against them,” says Wendt. “This is
achieved by generating and compiling code upon initialisation, which is time-consuming and processor intensive.
Without parallelisation, the CPU Usage History indicator would show one core running at 100 percent utilisation
while the other cores were doing no work at all.”
Parallelisation was achieved by changing a classic for each loop in the application’s compilation manager to a
Parallel. For Each loop—a task that required changing three lines of code. “The time that Composite C1 spends
compiling dynamic types has been significantly reduced on multicore systems, with performance increasing steadily
as more cores are available,” says Wendt. “Not only does this reduce startup times upon initial deployment, but it
also makes developers more efficient and productive when they’re working with our software.
“Given the very limited amount of code work we had to do, it’s fair to say that the support for parallel programming
provided in the .NET Framework 4 worked very well for us,” says Wendt.
Improved page-rendering performance

Composite also saw the potential for big performance gains in page rendering. “In Composite C1, the rendering
process handles the construction of page elements such as navigation aids, news listings, search results, and so on,”
explains Wendt. “In the past, these dynamic page elements were rendered serially, one after the other. Web pages
often contain multiple renderings, so the ability to perform those operations in parallel provides a huge opportunity
to improve performance—and thus deliver a better end-user experience.”
The company parallelised the rendering process by using the new data structures for parallel programming. “Rather
than using a C# statement such as for each to declare that concurrency is desired, we call the static For Each method
on the parallel class, passing a collection of data and a lambda expression you want to execute,” explains Wendt.
“The .NET Framework 4 handles all of the complex thread management in accordance with the underlying hardware
platform, firing off more threads as more cores are available. This is work that most developers will happily let the
underlying programming framework handle—in a way that’s likely more efficient and optimised than what they
could implement by hand.”
104
Composite’s use of ConcurrentQueue<T> instead of List<T> to store calculation results is also noteworthy. “We
do this because List<T> is not thread safe, meaning that you need to add locking to your code or brace yourself for
some unexpected results at runtime,” explains Wendt. “Developers still need to think about thread safety, but the
.NET Framework 4 makes the coding much easier, transforming the process from ‘be very careful and do the hard
work’ to just ‘be careful.’ ”
Other useful new features and capabilities

Beyond parallelisation, Composite developers are finding Visual Studio 2010 and the .NET Framework 4 useful in
other new ways, such as application lifecycle management. “We’re planning to upgrade our current Visual Studio
Team System 2008 Team Foundation Server implementation to Visual Studio Team Foundation Server 2010, so
we can take advantage of the new Scrum templates,” says Wendt. “There are a lot of great new features in the new
Scrum templates provided with Visual Studio Team Foundation Server 2010, and we’re expecting that they will be
of great benefit to the development process.”
Developers also are taking advantage of the new support for covariance in generics. “Our system uses interfaces
as the generic parameter when querying data from our data layer,” explains Wendt. “Because Visual C# 3.5 did
not support covariance, we had to do a lot of expression tree transformation when using LINQ to SQL. Generic
covariance should give us a performance boost, make our code simpler and thus easier to maintain, and enable us
to add ‘data schema inheritance’ to our data layer to enable some pretty interesting new features.”
Benefits
By taking advantage of the new parallel-programming aids provided in Visual Studio 2010 and the .NET Framework
4, Composite was able to easily capitalise on the performance gains enabled by modern multicore processors.
“With Visual Studio 2010 and the .NET Framework 4, Microsoft is providing tools that immensely simplify
parallel development,” says Wendt. “Devel-opers can simply ‘declare intent’ to do parallelisation and leave it to
the under-lying framework to handle the rest. The process isn’t foolproof in that developers still need to understand
parallel programming, but it’s quite easy to use and, when used correctly, can enable applications to utilise modern
microprocessors much more efficiently.”
Upto 80 percent reduction in processing time

Composite’s use of parallel programming is delivering significant performance gains, in turn improving the Composite
C1 user experience for both Web developers and Web site end users. “On a server configured with two quad-core
processors, our use of parallel programming delivered an 80 percent reduction in the time required to com-pile
dynamic types, which means that Web developers don’t need to wait as long when working with Composite C1,”
says Wendt.
The test that Composite constructed to measure the effects of parallelisation on page-rendering times showed a
decrease of more than 60 percent. “Our use of parallelisation reduced the time required to render a test Web page
containing eight functions from 115 to 40.5 milliseconds—even with one page element that takes 40 milliseconds
to render on its own,” says Wendt. “That’s the beauty of parallelisation, in that it enables us to break up a chunk of
work into independent tasks and execute them concurrently to get the job done faster.”
105
Parallel Computing
Minimal developer effort required

Composite was able to implement parallelism in its application quickly and cost-effectively, with very little developer
effort. “The hardest part was figuring out where we could benefit from parallelisation—an area where the Concurrency
Visualiser in Visual Studio 2010 was very helpful,” says Wendt. “Visual Studio 2010 and the .NET Framework
4 extend the ‘free lunch’ enabled by increasing processor speeds over the past few decades with an ‘almost free
lunch’—achieved through a set of tools that make it far easier and more practical to implement parallelisation. The
new parallel-programming aids provided by Microsoft aren’t a ‘magic wand’ that will make existing code run in
parallel by itself, but they do make the work required a whole lot easier.”
Microsoft Visual Studio 2010

Microsoft Visual Studio 2010 is an integrated development system that helps simplify the entire development process
from design to deployment. Unleash your creativity with powerful prototyping, modeling, and design tools that help
you bring your vision to life. Work within a personalised environment that helps accelerate the coding process and
supports the use of your existing skills, and target a growing number of platforms, including Microsoft SharePoint
Server 2010 and cloud services. Also, work more efficiently thanks to integrated testing and debugging tools that
you can use to find and fix bugs quickly and easily to help ensure high-quality solutions.
Microsoft parallel computing platform

Developers today face an unprecedented opportunity to deliver new software experiences that take advantage of
multicore and many-core systems. Microsoft is taking a comprehensive approach to simplifying parallel programming,
working at all levels of the solution stack to make it simple for both native-code and managed-code developers to
safely and productively build robust, scalable, and responsive parallel applications.
(Source: Composite, Content Management Solution Uses Parallelisation to Deliver Huge Performance Gains
[Online] Available at: <http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?casestudyid=4000006833>
[Accessed 22 June 2012].)
Questions
1. What were the benefits offered by taking advantage of the new parallel-programming aids provided in Visual
Studio 2010 and the .NET Framework 4?
Answer: By taking advantage of the new parallel-programming aids provided in Visual Studio 2010 and the
.NET Framework 4, Composite was able to easily capitalise on the performance gains enabled by modern
multicore processors. With Visual Studio 2010 and the .NET Framework 4, Microsoft is providing tools that
immensely simplify parallel development. Developers can simply ‘declare intent’ to do parallelisation and leave
it to the under-lying framework to handle the rest. The process isn’t foolproof in that developers still need to
understand parallel programming, but it’s quite easy to use and, when used correctly, can enable applications
to utilise modern microprocessors much more efficiently.
2. Give a brief description about Microsoft Visual Studio 2010.

Answer: Microsoft Visual Studio 2010 is an integrated development system that helps simplify the entire
development process from design to deployment. Unleash your creativity with powerful prototyping, modeling,
and design tools that help you bring your vision to life. Work within a personalised environment that helps
accelerate the coding process and supports the use of your existing skills, and target a growing number of
platforms, including Microsoft SharePoint Server 2010 and cloud services. Also, work more efficiently thanks
to integrated testing and debugging tools that you can use to find and fix bugs quickly and easily to help ensure
high-quality solutions.
3. How did parallel programming help in reducing processing time?

Answer: Composite’s use of parallel programming is delivering significant performance gains, in turn improving
the Composite C1 user experience for both Web developers and Web site end users. On a server configured
with two quad-core processors, the use of parallel programming delivered an 80 percent reduction in the time
required to compile dynamic types, which means that Web developers don’t need to wait as long when working
with Composite C1.
106
The test that Composite constructed to measure the effects of parallelisation on page-rendering times showed a
decrease of more than 60 percent. The use of parallelisation reduced the time required to render a test Web page
containing eight functions from 115 to 40.5 milliseconds—even with one page element that takes 40 milliseconds
to render on its own. Parallelisation enables us to break up a chunk of work into independent tasks and execute
them concurrently to get the job done faster.
107
Parallel Computing
Application II
Many-Core Chips — A Case for Virtual Shared Memory
Introduction
Future many-core systems, with thousands of cores on a single chip, will be significantly different from present
multi-core chips. Busses scale poorly and will be replaced by on-chip interconnect networks with local subnets linked
by routers, and inter-core message-passing latencies varying by maybe an order of magnitude depending on the
physical distance. Cores (or groups of cores) will have their own clocks, synchronised via protocols running across
interconnect. Cores (or groups of cores) will have significant amounts of local memory. The only shared memory is
likely to be off-chip; even the closest off-chip memory may be distributed, due to three-dimensional integration. This
implies that shared memory will be several orders of magnitude more expensive to access than local memory.
The need to keep yields reasonably high will force manufacturers to ship chips with many dead cores and interconnects.
These will be detected and removed from the software-visible hardware during a burn-in phase or even at boot
time. The result is that the actual topology of the chip will not be determined by the part number, but will need to
be discovered at boot time, and the software must adapt to the topology.
Such a many-core chip will therefore look not unlike a contemporary workstation cluster: a distributed system with
a high-speed interconnect. Compared to local memory, the shared off-chip memory will be expensive to access and
will appear like some network attached storage in a cluster, and will be seen as backing store rather than being directly
accessed during computation. This observation makes it natural to explore programming paradigms developed for
distributed systems: explicit message passing and distributed shared memory (DSM). Message-passing approaches,
such as MPI, are an obvious approach that maps well to the message-passing nature of the hardware (as it does in
distributed systems), and is likely to be the approach of choice for highly-parallel applications. However, shared
memory is a more convenient programming paradigm in many circumstances and provides better support for many
legacy programs. For that reason, DSM is still popular in distributed systems today, and like to remain so in the
future.
In this paper we make the case for a DSM-like approach as a way to manage many-core systems. We argue that,
contrary to classical DSM approaches, such as Ivy, Munin or Treadmarks, shared memory for many-core systems
should not be implemented as middleware (i.e. on top of the OS), but below the OS(es) inside a system virtualisation
layer. This extends the classical OS notion of virtual memory across the distributed system. To distinguish it from
DSM, we refer to this approach as virtual shared memory (VSM). We have recently developed a prototype VSM
system, called vNUMA (“virtual NUMA”), on a conventional workstation cluster. We argue that the vNUMA
approach presents a promising way of managing many-core chips, as it simplifies dealing with the distributed nature
of the hardware. It can do so without introducing overheads to high-performance applications, as explicit message
passing still works with full performance.
vNUMA Overview
vNUMA is a Type-I hypervisor which presents a shared-memory multiprocessor to the (guest) operating system.
Specifically, as vNUMA is implemented on distributed-memory hardware, where the cost of accessing remote
memory is orders of magnitude higher than for local memory, the virtual hardware provides cachecoherent non-
uniform memory access (ccNUMA). Hence, NUMAaware software will perform better on vNUMA than software
that assumes uniform memory-access costs.
108
Implementing shared memory inside the hypervisor has a number of advantages. For one, all the memory in the
system becomes part of the VSM, and therefore the OS can access all memory from all nodes. (Of course, a side
effect of this virtualisation is that the hypervisor can also partition the system, dividing the complete physical memory
into several virtual machines, each running a separate, isolated OS. Besides other advantages of virtualisation, this
supports the deployment of OSes that will not scale to thousands of processors.)
Another advantage is that running in the machine’s most privileged mode gives a VSM system access to optimisations
that are beyond the reach of DSM middleware such as MPI. These include the efficient emulation of individual
instructions, and the use of the performance-monitoring unit (PMU) to track the execution of specific instructions.
Implementing VSM inside the hypervisor also changes some of the trade-offs compared to middleware systems,
and as a result requires different protocols. For example, software running on DSM systems is typically aware of
this, and specifically of the fact that the unit of migration and coherency is a hardware page. This is not the case
for a multiprocessor OS, especially a NUMA-unaware one, which expects data migration and coherency to have a
cacheline granularity. vNUMA therefore includes a number of enhancements to established DSM protocols to support
efficient write-sharing within a page. For example, vNUMA supports three different modes of write sharing of pages:
write-invalidate write-update/multiple writer, and write-update/single writer. vNUMA adapts the writesharing mode
based on the observed sharing patterns. In particular, vNUMA detects and efficiently handles atomic instructions
(such as compare-exchange) used by the OS to implement locks. For some optimisations (e.g. batching write-update
messages), vNUMA makes use of the weak store order provided by modern processors.
VSM on many-cores
Compared to a cluster, a VSM implementation will benefit from a number of advantages a multi-core chip has over
a traditional cluster. For one, network latencies (measured in CPU-core cycles) are orders of magnitude lower for
an on-chip interconnect compared to Ethernet. More importantly, if the VSM approach is adopted, support for it
will be designed into future many-core chips. This has the potential to significantly reduce overheads on a number
of operations that we found expensive in vNUMA (on current COTS hardware). For example, vNUMA traps all
writes to multiple-writer shared pages in order to determine when updates need to be distributed. While this provides
better performance than single-writer protocols in the presence of a limited degree of false sharing, if such writes are
frequent, performance will suffer. Architectural support, for example, in the form of write protection at a cache-line
granularity, could reduce this bottleneck.
Furthermore, if the VSM paradigm is widely adopted, then software will adapt to it, for example by changing OS
data structures to avoid false sharing. According to our (limited) experience with vNUMA scalability, NUMA-aware
software will generally work well, and programs that do not share memory at all will not be affected by VSM in
their performance.
The VSM approach can provide some obvious benefits to processor manufacturers:
• The hypervisor can transparently deal with a small amount of message loss in the interconnect. This allows
chipmakers to more agressively optimise the network, to the degree where it is no longer fully reliable. In
fact, vNUMA, designed for notionally unreliable Ethernet, makes use of the fact that in a cluster environment,
Ethernet is in reality “almost” reliable, losing or damaging messages very rarely. vNUMA deals with this by using
checksums and timeouts, rather than more sophisticated protocols designed for really unreliable networks.
• Cache coherence does not have to be provided by hardware. With growing number of cores, hardware solutions
become more complicated and costly. High-performance applications do not need them, as they deal with
distribution explicitly, reducing the benefit of providing coherence in hardware. Shifting coherence protocols
into the hypervisor has the added benefit that software can easier adapt protocols to access patterns.
• The hypervisor can transparently re-map memory addresses and core IDs. This not only allows it to deal with
unreliable hardware, but also naturally supports turning off cores for power management. The virtual-memory
paradigm can be taken to its logical conclusion by transparently swapping out local memory to off-chip backing
store.
109
Parallel Computing
• Core heterogeneity is easy to support, as individual cores or groups of cores can run their own OS, with the
hypervisor simplifying access to remote memory. Heterogeneous OSes are also easy to support, the main
requirement is that each core’s ISA supports the hypervisor.
It would be possible to implement VSM inside native OSes running on many-cores. However, we expect virtualisation
to be widely used in future systems anyway, for reasons of resource isolation / quality of service, and for dynamic
resource management, in particular saving energy by shutting down unused cores. A single chip will typically run
multiple, heterogeneous operating systems, each with varying allocations of physical resources. Only the hypervisor
has access to the whole system, and as such is the ideal place to implement VSM. For example, it could use Capabilities
for controlling access to pages or coarser-gain memory regions by guest OSes.
VSM and applications

It would obviously be a fallacy to expect a VSM to scale for parallel applications utilising hundreds or
thousands of nodes. On the one hand, the latency of communication cannot be hidden by presenting a
shared-memory model. On the other hand, the more nodes access the same data, the more coherency
traffic is required, and for n nodes, the cost of this is O(n2). Communication intensive applications are
best served by explicit message passing as supported by MPI or similar middleware. The real scalability
test of a VSM system is whether it can support a large number of processors in the absence of conten-
tion.
More precisely, the system should not impose communication overhead for applications that do not communicate. It
is precisely designed to achieve this: The coherency protocols ensure that pages which are only written by a single
core will be owned exclusively by that core, while read-only pages are shared. Coherency overheads only arise when
pages are shared, or change their mode. As such, the small-cluster scalability results we obtained from our vNUMA
prototype should be representative of parallel applications running on a small subset of cores of a large many-core
system; the main difference being that the many-core system should be more VSM-friendly than a cluster for the
reasons discussed in the introduction.
Furthermore, the coherency protocols ensure that message passing middleware like MPI on top of VSM should be
able to perform as well as without the VSM layer: as it never shares data, but copies it between nodes by explicit
messaging, it does not create coherency traffic. Hence, the VSM stays out of the way of software that does not need
it, but is there to support software that benefits from a shared memory model.
Related work
VSM is based on the ideas of DSM, pioneered by Ivy. Mirage moved DSM into the OS to improve transparency.
Munin utilised weaker memory consistency to support simultaneous writers. Disco carves a NUMA system into
multiple virtual SMP nodes for the benefit of existing operating systems that may not support a NUMA architecture.
This is, in a way, the opposite of VSM, which combines separate nodes into a single virtual NUMA system, allowing
a single operating system instance to span multiple nodes that do not share memory.
Since our initial publication on vNUMA, systems using similar ideas have emerged: Virtual Iron’s VFe hypervisor the
Virtual Multiprocessor from the University of Tokyo. While these systems demonstrate combining virtualisation with
distributed shared memory, they are limited in scope and performance. Virtual Iron attempted to address some of the
performance issues by using high-end hardware, such as InfiniBand rather than Gigabit Ethernet, which effectively
makes the network more similar to what we expect from future many-cores. Virtual Iron has since abandoned the
product for commercial reasons, which largely seems to stem from its dependence on such high-end hardware.
More recently, startup company ScaleMP started to market their vSMP system, which seems similar in nature
(also uses InfiniBand). This supports our claim that there is on-going interest in SMP as a programming model on
distributed-memory hardware. Catamount partitions shared memory between nodes but makes remote partitions
available via virtual-memory mapping. Work on many-core scheduling is orthogonal to the VSM concept. Barrelfish
110
deals with many-core resource heterogeneity by making it explicit. While we agree that this is the best way to achieve
best performance, it only benefits applications that are designed to deal with explicit heterogeneity.
Conclusions
We made a case for virtual shared memory, i.e., a virtual-memory abstraction implemented over physically distributed
memories by a hypervisor, as an attractive model for managing future many core chips. Based on our experience
with a cluster-based prototype, we argue that VSM provides a shared-memory abstraction for software that needs
it, without imposing significant overheads on software that does not share (virtual) memory. We have argued that
the approach integrates well with the use of virtualisation for resource management on many-cores, and simplifies
dealing with faulty cores, faulty interconnects and heterogeneity. It may allow processor manufacturers to move
cache coherence protocols from hardware into software.
(Source: Heiser, Many-Core Chips — A Case for Virtual Shared Memory [Online] Available at: <http://ssrg.nicta.
com.au/publications/papers/Heiser_09a.pdf> [Accessed 27 June 2012])
Questions
1. Give a brief description about vNUMA.
2. Mention the VSM approaches that can provide some obvious benefits to processor manufacturers.
3. Enumerate the applications of VSM.
111
Parallel Computing
Application III
FOAM: Expanding the Horizons of Climate Modeling
Introduction
Climate modeling consistently has been among the most computationally demanding applications of scientific
computing, and an important initial consumer of advances in scientific computation. We report here on the Fast
Ocean-Atmosphere Model (FOAM), a new coupled climate model that continues in this tradition by using a
combination of new model formulation and parallel computing to expand the time horizon that may be addressed
by explicit fluid dynamical representations of the climate system.
Climate is the set of statistical properties that emerges at large temporal and spatial scales from well-known physical
principles acting at much smaller scales. Thus, successfully representing such large-scale climate phenomena
while specifying only small-scale physics, is an extremely computationally demanding endeavor. Until recently,
long-duration simulations with physically realistic models have been infeasible. Through improvements in model
formulation and in computational efficiency, however, our work breaks new ground in climate modeling.
Our progress in model formulation has been based on a new, efficient representation of the world ocean. Success in
this endeavor has further directed our efforts to establishing the minimum spatial resolution that can be considered
a successful representation of the earth’s climate for the purposes of understanding decade to century variability.
Our work on computational efficiency has focused on developing a coupled model that can execute efficiently
using message passing on massively parallel distributed-memory computer systems. Such systems have proved
powerful and cost-effective in those applications to which they can effectively be applied. The result of this work is
a coupled model that can sustain a simulation speed of 6,000 times faster than real time on a moderate-sized parallel
computer, while representing coupled atmosphere-ocean dynamical processes of very long duration which are of
current scientific interest. In this important class of application, we have obtained a significant cost performance
advantage relative to leading contemporary climate models. We believe that the achieved throughput in terms of
wall clock time is the highest of any coupled general circulation model to date.
FOAM has already been used to obtain significant scientific results. Modes of variability with time scales on the
order of a century have been identified in the model. These results provide a basis for observational and theoretical
studies of climate dynamics that may permit climatologists to observe and explain such phenomena.
General circulation models
Models which are intended to explicitly represent the fluid dynamics of an entire planet starting from the equations
of fluid motion are known as general circulation models (GCMs). In the case of the earth, there are two distinct but
interacting global scale fluids, and hence three classes of GCM: atmospheric GCMs, oceanic GCMs, and coupled
ocean-atmosphere GCMs.
GCM calculation is essentially a time integration of a first-order-in-time, second-order-in-space weakly nonlinear set
of partial differential equations. As is typical in such problems, the system is represented by a discretisation in space
and time. The maximum time step of this class of computational model is approximately inversely proportional to the
spatial resolution, while the number of spatial points is inversely proportional to its square. Hence, the computational
cost, even without increases in vertical resolution which may be required, is roughly proportional to the inverse
cube of the horizontal spacing of represented points.
Since planets are large and the scale of fluid phenomena is small, representation in GCMs is quite coarse, typically
on a grid scale of hundreds of kilometers. Nevertheless, on the order of 1010 floating point operations are required
at a typical modest resolution to represent the flow of the atmosphere for a single day. The calculations are further
complicated by the necessity to represent other physical processes that are inputs into the fluid dynamics, such
as radiative physics, cloud convective physics, and land surface interactions in atmospheric models, as well as
atmospheric forcing and equations of state in oceanic models.
112
Despite these daunting constraints, atmospheric and oceanic GCMs that succeed in representing the broad features
of the earth’s climate have been available for about two decades. Continuing refinement and demonstrable progress
have been evident in intervening years. These GCMs have a remarkable range of applications, which can usefully
and neatly be divided into two classes--those within and those outside the limits of dynamic predictability set by
the chaotic nature of nonlinear fluid dynamics.
In the first class of model, predictions or representations of specific flow configurations are sought. Weather models
are the best known such applications, but there are operational ocean models and research models of both the ocean
and atmosphere systems that fall into this class.
In the second class are climate applications, where the duration of the simulation is longer than the predictability of
the instantaneous flow patterns. Here, the interest is in representing the statistical properties of the flow rather than
its instantaneous details. Since prediction of specific fluid dynamical events is replaced by studies of the statistics
of such events, much longer durations must be calculated.
Meaningful climate statistics emerge after some years while the limit of dynamic predictability is at most tens of
days. Hence, the cost at a given spatial resolution is at least two orders of magnitude greater for climate applications
than for applications that simulate individual events within the limits of dynamic predictability. As a result, for a
given spatial resolution and set of represented physical processes, climate modeling is intrinsically a much more
computationally demanding application than weather modeling. Furthermore, additional physics may be required
in climate applications that may be left out of weather models.
In order to understand the implications of past or future simulated climates at particular locations, as well as to
represent the global dynamics more accurately, the push in climate modeling has been toward higher spatial resolution
as more computational resources become available. Still, integrations of climate models for about a century have
represented the maximum attainable until very recently.
A second push has been towards coupled models. There is a rough symmetry between atmosphere and ocean in
that each provides an important boundary condition for the other. Early long-duration models studied the ocean or
atmosphere in isolation, using observed data for the other boundary condition. Efforts to link models of the two
systems into a coupled model have been frustrated by the independent and, until recently, mutually inconsistent
representations of the physics of the air-sea boundary. Substantial progress in this area has been reported recently,
particularly at the National Center for Atmospheric Research (NCAR). Our project benefits directly from this
progress.
Interest in coupled modeling applications has been intense because there are variations in climate at all time scales,
many of which are poorly understood. Such phenomena cannot be represented by either atmospheric or oceanic
GCMs. The coupling between atmosphere and ocean, as well as with land and ice processes in some applications,
thus becomes critical.
Additionally, longer simulations are required than in more established climate modeling approaches, since the time
constants of the relevant processes are much longer, on the order of decades. Finally, there is enormous practical
and theoretical interest in transient climate responses to rapid changes in atmospheric conditions, such as changes in
atmospheric concentrations of radiatively active (‘‘greenhouse’’) gases and aerosol (airborne fine dust) distributions.
A few such runs have been performed [24], but in these cases it is difficult to separate intrinsic climate variability
from variability in response to the changes in atmospheric composition. To address this question rigorously would
require ensembles of similar runs, again multiplying the requisite computational resources substantially.
Thus, calculations of ten thousand years requiring on the order of 1010 floating point operations are of immediate
interest, even at the modest resolutions currently used for climate models.
113
Parallel Computing
Design Strategies for FOAM

The principal objective of the FOAM modeling effort is to study the long-duration variability of the climate system.
Variations in deep ocean circulation are believed to be the dominant mechanism for climate changes on long time
scales. In turn, these variations are forced by the spectrum of atmospheric phenomena, each individual event of
which occurs on much shorter time scales, and which in turn are the direct effect of phenomena which, while mostly
well understood, occur on time scales of minutes. It is this need to address processes occurring at multiple time
scales that leads to the high computational cost of climate models and that motivates the FOAM project focus on
improving model performance.
One measure of simulation performance is ‘‘model speedup,’’ i.e., simulated time per wall clock time. We have
adopted as our objective a speedup of 10,000 for a model capable of representing deep ocean dynamics using a fully
bi-directionally coupled ocean-atmosphere model. This level of performance as our goal would makes thousand-
year simulations practical on a routine basis.
We adopted several strategies to approach this goal. The simplest of these strategies is the sacrifice of spatial resolution
for expansion of available simulated time. Since computational cost is directly proportional to simulation duration,
but roughly proportional to the inverse cube of the horizontal spacing, simulated duration can be extended greatly
by using the lowest resolution that captures the phenomena of interest.
Investigations revealed that it was important to maintain a reasonable resolution within the ocean, due to the relatively
small scale of important ocean dynamics. However, we determined that a coarse representation of the atmosphere
is sufficient to represent multi-decadal coupled variability. In conventional coupled models, approximately equal
amounts of time are spent in the ocean and atmosphere; hence, reducing atmosphere resolution can make a large
difference to overall performance only if we are able to speed up the ocean simulation performance in some
other way. This observation led us to adopt as a principal algorithmic focus the improvement of the ocean model
efficiency in terms of the number of computations required per unit of simulated time. As we describe below, we
were successful in this endeavor; this success allowed for an excellent scaling of coupled performance from the
use of a low atmospheric resolution, since the ocean part of the model now accounts for only a small fraction of
the resources used.
A second strategy was to use established representations of system physics. As far as possible, we did not endeavor
to participate in the ongoing improvement of representation and parameterisation of the many relevant physical
processes of the climate system. Our objective was not to improve the representation of climate but to expand the
applicability of those representations.
A third strategy was to use massively parallel distributed-memory computing platforms, that is, computing platforms
built of relatively large numbers of processing units, each with a conventional memory and high speed message
links to other processors. This type of computer is well-suited to fluid dynamics applications, where low-latency,
high-bandwidth exchanges are necessary, but in a predictable sequence and thus subject to direct tuning. The
distributed-memory architecture avoids the hardware complexity of shared-memory configurations, improving cost
per performance and providing a straightforward hardware upgrade path.
A fourth component of our approach was to use the standard Message Passing Interface (MPI) to implement inter-
processor communication. The use of MPI not only facilitates the design of the communication-intensive parts of
the model, but also enhances our ability to run on a wide variety of platforms. While the coupled model has to date
been run only on IBM SP platforms, both the ocean and atmosphere models have been benchmarked on a variety
of machines. As processors continue to improve, migration to commercial mass market platforms connected by
commodity networks may further improve cost efficiency.
The fifth element of our strategy was to design an independent piece of code, the ‘‘coupler,’’ to link pre-existing
atmosphere and ocean models. This structure minimised the changes required for those already tested pieces of
software. The coupler also includes a water runoff model, representing river flows and thus allowing for a closed
hydrological cycle. Efficiency and parallel scalability, that is, the ability to make optimal use of large numbers of
114
processors, were paramount in these design efforts. The development of model physics was avoided by this project,
in that the relevant algorithms, with the important exception of the new surface hydrology routines, were imported
without modification. This incremental approach allowed us to focus our attention on the computationally demanding
issues of the fluid dynamics and parallelisation.
Components of FOAM
As noted above, FOAM comprises an atmosphere model, ocean model, and coupler. We describe each of these
components in turn.
The FOAM Atmosphere Model

The atmospheric component of FOAM is derived from the PCCM2 version of the NCAR Community Climate Model
(CCM2) plus, as we explain below, selected components from CCM3. PCCM2 is functionally equivalent to CCM2,
but has been adapted to support efficient execution on massively parallel distributed-memory computers.
Parallelisation of GCMs is generally accomplished by a one- or two-dimensional decomposition of the spatial

domain into sub-regions and assignment of those sub-regions to distinct processors. The parallelisation process
for CCM is complicated by the fact that some of the computation is performed in a spectral transform space. The
spectral transform approach has useful properties from a numerical methods point of view (avoiding aliasing,
accurate differentiation) at the cost of some complexity in sequential implementation. In a parallel implementation,
however, it also introduces a need for global communication. PCCM2 addresses these issues by incorporating
parallel spectral transform algorithms developed at Argonne and Oak Ridge National Laboratories that support the
use of several hundred or more processors, depending on model resolution. Additional modifications involved the
semi-Lagrangian representation of advection and techniques for load balancing.
Calculations in the third, vertical dimension, particularly those representing radiative transfer, are tightly coupled,
so there is relatively less advantage to a vertical decomposition. As a result, the physics processes in CCM2, which
occur entirely in vertical columns, are represented without any information exchange between processors. Thus no
changes to the relevant code were needed.
The principal modification to PCCM2 to support its use in FOAM was to replace the lower boundary condition
routine with new code responsible for transferring data to the coupler and hence to the ocean model. As we noted
above, FOAM uses a low-resolution atmosphere. The vertical coordinate is a hybrid of a terrain following coordinate
and pressure, with 18 vertical levels. We use a 15th order rhomboidal (R15) horizontal resolution; this corresponds
to 40 latitudes arranged in a Gaussian distribution centered on the equator and 48 longitudes for an average grid size
of 4.5 degrees of latitude and 7.5 degrees of longitude. FOAM uses a 30 minute time step and the recommended
values for the diffusion coefficient given by for an R15 CCM2.
After we began building the coupled model, a new implementation of the Community Climate Model has been
released. The radiative and hydrologic changes to the physics from CCM2 to CCM3, have since been implemented
in FOAM. The bulk of the code of FOAM, including the remainder of the physics, remains that of CCM2.
The radiation parameterisation used in the FOAM atmosphere model is based on PCCM2 but includes the CCM3
additions and improvements. Several other CCM3 innovations have also been incorporated in FOAM. In CCM2,
all types of moist convection were handled by the simple mass flux scheme developed in CCM3, and FOAM, this
scheme is used in conjunction with the deep convection parameterisation . The evaporation of stratiform precipitation
is also now included. The boundary layer model of CCM2 has been modified as described by. Finally, the surface
fluxes over the ocean are now calculated using a diagnosed surface roughness which is a function of wind speed
and stability.
115
Parallel Computing
The FOAM Ocean Model

FOAM uses the parallel ocean model developed at the University of Wisconsin - Madison by Anderson and Tobis.
The dynamics of this model are similar to those of the Modular Ocean Model developed by the Geophysical Fluid
Dynamics Laboratory. As with PCCM2, the focus of our work was not in new representations of the ocean’s physical
properties but in efficient implementation for message-passing parallel platforms.
The ocean model uses the vertical mixing scheme but with a steeper Reynolds number dependency consistent with
the observational analysis. The revised mixing values appear to improve the tropical Pacific SST field by reducing
the model cold bias in the west equatorial Pacific.
A simple, unstaggered Mercator 128 x 128 point grid is used, yielding a discretisation of approximately 1.4 degrees
latitude by 2.8degrees longitude. A spatial filter similar to the sort used in atmospheric models is used to maintain
numerical stability in the Arctic. The topography used is somewhat tuned to preserve basin topology at the represented
resolution but is not smoothed.
The vertical discretisation is with height, with a stretched vertical coordinate maximising resolution in the upper layers.
For the runs reported here, a sixteen layer version was used. The central importance of the surface thermodynamics
in the coupling process and the objective of a minimal computational load led to this choice in preference to an
isopycnal model.
Three separate techniques are used to speed the performance of the ocean model. Unlike in some ocean models, the
free surface is explicitly represented, but its dynamics are artificially slowed, an approach which has been shown to
make little difference to the internal motions. In addition, the still relatively fast, and therefore difficult to represent,
free surface is modeled as a separate two-dimensional system coupled to the internal ocean in a way that correctly
reproduces the free surface while allowing a much longer time step in the internal ocean. Finally, that time step
itself is used only for the calculation of the fastest parts of the internal dynamics, while yet a longer step is used for
diffusive and advective processes.
We believe that the combination of these techniques yields the most computationally efficient ocean model in
existence. That is, for a given resolution, we believe that this model requires fewer floating point operations to
integrate the ocean for a given time than any other model. When compared with other state-of-the-art ocean models,
this improvement corresponds to roughly a tenfold increase in the amount of simulated time represented per unit
of computation.
We have benchmarked the ocean code at 128 x 128 resolution on 64 SP2 nodes running at over 105,000 times
real time. The ocean model also scales well to higher resolutions, and other applications beside long-term climate
modeling are anticipated.
The FOAM Coupler

The separately developed atmosphere and ocean models are integrated into a functioning whole by a set of routines
called the coupler. The coupler is essentially a model of the land surface and atmosphere-ocean interface. The coupler
also handles the calculation of fluxes between the ocean and atmosphere, organises the exchange of information
between them, and calls a new parallel river model for routing the runoff found by the hydrology model to the
oceans.
The land surface in FOAM (and in CCM2) is represented by a four-layer diffusion model with heat capacities,
thicknesses and thermal conductivities specified for each layer. Soil types vary in the horizontal direction, with 5
distinct types derived from the vegetation data. Roughness lengths and albedos for two different radiation bands
are also specified. The fluxes of latent and sensible heat and momentum between the land and the atmosphere are
calculated using the bulk transfer formulas with stability dependent coefficients from CCM2 summarised. Between
the ocean and the atmosphere, the new bulk transfer formulas of CCM3 are used. These are also stability dependent
but do not assume a constant roughness length.
116
The hydrology in FOAM is a simple box model. This model was an option in early versions of CCM2 and was
also present in CCM1. Precipitation is added to a 15 cm soil moisture box or to the snow cover, if the ground and
lowest two atmosphere levels are below freezing. The soil moisture is used to calculate a wetness factor Dw used
in the latent heat flux calculation. (Dw equals 1 for land ice, sea ice, snow covered and ocean surfaces) Evaporation
removes water from the box and any excess over 15 cm is designated as runoff and sent to the river model. Snow
cover modifies the properties of the upper soil layer for purposes of the albedo and surface temperature calculations.
Snow melt is calculated and added to the local soil moisture. Snow depths greater than 1 m liquid water equivalent
are also sent to the river model to mimic the near-equilibrium of the Greenland and Antarctic ice sheets.
Because fresh water cycling may be of interest in the phenomena to be studied by the model, and also to avoid long-
term ocean salinity drift, a closed hydrological cycle is implemented by the coupler, with a simple explicit river
model that results in a finite fresh water delay and a set of point sources (river mouths) for continental runoff. This
strategy enables the model to represent phenomena that involve coupling between variations in continental rainfall
and delayed resultant variations in ocean salinity, in turn affecting weather patterns through altered sea surface
temperatures resulting from altered ocean circulation.
The river model is based on the work of Miller et al. A similar implementation is also used in the coupled model of
Russell et al. First a river flow direction is set for each land pointi. Although this can be automated by processing the
topography file, in practice many of the river directions had to be set by hand so that the resulting basin boundaries
resemble the observed.
The flow F in cubic meters per second out of a cell is F = V. (u/d), where V is the total river volume equal to the
local runoff plus the sum of the flow from up to seven of the eight neighboring cells, u is an effective flow velocity
which is taken as a constant 0.35 meters per second, and d is the downstream distance. Precipitation and evaporation
do not act directly on the river water and the temperature of the river water is not taken into account. V for an ocean
point near the coast is then calculated as the sum of the outflow from neighboring land points and converted back to
a flux by dividing by the area of that ocean point. This river freshwater flux is then added to the local precipitation
and evaporation rates to form the total freshwater flux at that point and close the hydrologic cycle.
The temperature of the sea ice is determined by treating it as another soil type. The sea surface may continue to
lose heat by conduction with the lowest ice layer so a clamp on temperature is imposed by the ocean model at -1.92
degrees Celsius. Sea ice roughness and albedos are prescribed. For the hydrologic cycle, the formation of sea ice
is treated as a flux of 2 m of water out of the ocean. The stress between the ice and the atmosphere is arbitrarily
divided by 15 before passing to the ocean model.
117
Parallel Computing
+ =
ii
Fig. 1 FOAM coupler tiles

The problem of representing atmosphere-ocean interactions is complicated by the fact that these two components
are represented on different spatial grids. Couplers must represent these fluxes consistently in both domains. FOAM
handles this in a simple but highly efficient way, as illustrated in Fig. 1. The model represents the globe as being
divided into two grids, one for the atmosphere and another for the ocean. A third decomposition of the surface is
constructed by laying one grid on top of the other, as shown schematically in Fig. 1(a). The atmosphere/ocean
exchanges, which depend on the properties of both, are calculated for each piece of this overlap grid and are then
averaged for passing back to the ocean (Fig. 1(b), region i) and atmosphere (Fig. 1(b), region ii) exchanges. No effort
is made to interpolate all state variables to a single grid, thus greatly simplifying the task of maintaining consistent
representations on both grids.
Running FOAM
We used FOAM to perform a series of long-duration simulations with the goal of determining whether a low resolution
atmosphere could yield a reasonable representation of coupled ocean-atmosphere phenomena. We found that the
coarse R15 atmospheric resolution sufficed to yield a realistic ocean circulation. Although R15 is an extremely
coarse resolution (48 latitudes and 40 longitudes), it still requires approximately 16 times as much processor time
as our ocean with 128 x 128 resolution on IBM SP platforms. This difference in execution time is attributable to
the relatively complicated atmospheric physics code and to the efficiency of our ocean model. Accordingly, we
typically run on 17 or 34 nodes, with 1 or 2 of those processors, respectively, dedicated to the ocean. To optimise
inter-processor communication, the coupler runs on the same nodes as the atmosphere.
118
Fig. 2 Time allocation for a typical FOAM run: Horizontal axis is labelled in seconds. Each bar represents
a single SP processor.
Fig. 2 shows the allocation of computational resources as a function of time for a typical 17 node run performing the
calculations for one simulated day. The bulk of the computation is allocated to the atmosphere implementation. The
ocean time step is six hours, so the ocean is called four times per simulated day. The faster atmospheric dynamics must
be represented on a half-hour time step, called 48 times. Twice per day, the radiative properties of the atmosphere
are recalculated, yielding particularly long atmosphere steps.
All atmosphere nodes must integrate synchronously, as their results are dependent on results of neighboring
processors. This is seen by the simultaneous exit of all atmosphere processors from the coupler routine, which handles
all communication. The fact that all processors do not enter the coupler at the same time indicates imperfect load
balancing in the atmosphere calculations, typically because cloud distributions are not uniform. It is seen that one
ocean processor has no difficulty keeping up with 16 atmosphere processors, but that it cannot keep up with 32.
We are still tuning the code for performance. To date, our best performance has been approximately 6,000 times real
time in a run on 68 nodes of an IBM SP2 using 120 MHz P2SC Power PC symmetric multiprocessors. However,
because of some constraints on the domain decomposition used in low resolution applications of PCCM2, this is a
poor scaling from our production runs. We have seen almost linear scaling on 8, 16, and 32 atmosphere processors,
which is what we normally use.
We typically achieve peak performance faster than 4,000 times real time on 34 nodes. (This lack of scaling to 68
nodes is due to limitations in the spatial decomposition technique as applied to the low atmosphere resolution we
use.)
In practice, with the production of large output files and the sharing of the platform with other users, we are
producing results over months of real time at over 2,000 times real time and have established that we can scale this
up significantly with the availability of additional resources at contemporary levels of technology. We are hopeful
that even better results may be obtained as the platform configuration is tuned for scientific applications. We are
also investigating parallelisation of the input and output to further increase our efficiency.
The performance of FOAM can be compared directly to the NCAR CSM coupled model which accomplishes only
a third of FOAM’s maximum throughput using 16 nodes of a Cray C90. Our principal time advantages are in the
extremely effective ocean code and in the reduced resolution of the atmosphere, which remains adequate to capture
decadal ocean variability.
119
Parallel Computing
A further advantage of FOAM lies in the lower cost and complexity of the distributed memory systems on which
FOAM is executed. While determining the true cost of supercomputers is difficult, we estimate that the cost per
unit of performance of FOAM is already more than ten times better than that of other current models of the same
phenomena.
Results and Refinements

Initial simulation results with FOAM, performed with CCM2 physics, were somewhat discouraging. In particular,
the tropical Pacific, an important region for climate variability because of the strong phenomenon known as El Nino,
was poorly represented. A more detailed representation of ocean mixing processes helped only slightly.
Near this point in the development of our model, NCAR released a new version of their Community Climate Model,
CCM3. We were fortunate in that the software interfaces to updated physics routines were largely unchanged. We
found that including the new CCM3 moisture physics into our model vastly improved its representation of the
tropical Pacific.
Fig. 3 Sea surface temperature patterns (degrees C) (a) FOAM output (b) observations (see text), and (c)
model minus observations
Annual average sea surface temperature as modeled by FOAM improved with CCM3 moisture physics is shown in
Fig. 3(a). For comparison, observational data is shown in Fig. 3(b) and the difference between the true and modeled
fields is shown in Fig. 3(c). The broad features of the temperature field are captured, though the tight gradients in
western boundary currents such as the Gulf Stream and the Kuroshio are somewhat smeared.
Except in the Antarctic Ocean, the results are comparable to those obtained with higher resolution atmospheres, less
efficient ocean models, and more expensive computational platforms. The errors in the Antarctic are attributable to
the crude representation of sea ice that we currently use. Updating this part of the model is currently a high priority.
The otherwise good agreement indicates that we are capturing the large scale features of ocean circulation.
120
Fig. 4 Two basin variability: Scales are arbitrary, but their product is anomaly sea surface tem-
perature amplitude as a function of space and time, degrees C (a) spatial pattern, and (b) tempo-
ral pattern
We have now successfully run the model for over 500 simulated years, and our first results regarding low frequency
variability of the coupled system are emerging. Fig. 4 shows a pattern (obtained by VARIMAX rotation of empirical
orthogonal function decomposition) that accounts for fully 15 percent of 60 month low-pass filtered variance in sea
surface temperature. The associated time series is also shown, indicating the long time scale of this phenomenon.
This correlation between North Atlantic and North Pacific, until recently unanticipated, corroborates recent model
and observational results by Latif and Barnett.
Conclusions
The FOAM project has accomplished a substantial improvement in the performance that can be achieved by coupled
climate models, without sacrificing physical realism. While reducing atmosphere resolution, it maintains a good
representation of the ocean. The result is that the model is able to identify and study phenomena of interest on
decadal and century time scales, and has succeeded in replicating recent model and observational results. In addition,
the model is able to exploit parallel computer systems that offer improved cost performance relative to the vector
supercomputers that have traditionally been used for such simulations. Given these successes, FOAM may now be
used for its intended purpose, to implement very long simulations for studying variability on the longest time scales.
Currently, we are performing these simulations on multi-computers such as the IBM SP.
121
Parallel Computing
In the longer term, we intend to examine the feasibility of using PC clusters to improve cost performance yet further.
We also hope to exploit high-speed networks to expand the utility of the model, by enabling remote browsing of
the large datasets generated by FOAM, hence making these datasets more accessible to the community, and by
using remote I/O techniques to enable seamless execution on remote computers, with files maintained at a central
location.
Since von Neumann used a weather model as his first test case of scientific computing leading developments of
scientific computing platforms have been put to some of their earliest tests by meteorological applications. This
will continue to be the case for the foreseeable future. The vast nature of the physical system will allow it to make
use of whatever resources become available. The ultimate goal of very high resolution, very long duration, and
very complete models remains far in the future. In the meantime, the FOAM project will endeavor to continue to
provide the first glimpses of climate variability on the longest time scales.
(Source: FOAM: Expanding the Horizons of Climate Modeling [pdf] Available at: <http://www.stat.cmu.
edu/~cschafer/Pubs/tobis97foam.pdf> [Accessed 22 June 2012])
Questions
1. Give a brief description about FOAM Coupler.
2. Define general circulation models (GCM).
3. Explain the FOAM Ocean Model.
122
Bibliography
References
• 2006. Moore’s Law [Video Online] Available at: <http://www.youtube.com/watch?v=XvaQcuLr2cE> [Accessed

21 June 2012].
• 2008. Generation’s of computer (HQ) [Video Online] Available at: <http://www.youtube.com/watch?v=7rkG
FqEfdJk&feature=related> [Accessed 21 June 2012].
• 2008. Lecture -2 History of Computers [Video Online] Available at: <http://www.youtube.com/watch?v=TS2o
dp6rQHU&feature=results_main&playnext=1&list=PLF33FAF1A694F4F69 > [Accessed 21 June 2012].
• 2009. Algorithms Lesson 3: Merge Sort [Video Online] Available at: <http://www.youtube.com/
watch?v=GCae1WNvnZM> [Accessed 21 June 2012].
• 2009. Computer Networking Tutorial - 3 - Network Topology [Video Online] Available at: <http://www.youtube.
com/watch?v=kfEDPQAYH4k> [Accessed 21 June 2012].
• 2009. Lec-7 Pipeline Concept-I [Video Online] Available at: <http://www.youtube.com/watch?v=AXgfeV568
c8&feature=results_video&playnext=1&list=PLD4E8A4E592F7A7D8> [Accessed 21 June 2012].
• 2010, Highly Scalable Parallel Sorting [pdf] Available at: <http://charm.cs.illinois.edu/talks/SortingIPDPS10.
pdf> [Accessed 21 June 2012].
• 2011. 00 11 Message Passing [Video Online] Available at: <http://www.youtube.com/watch?v=c5NKVAPf2OE>
• 2011. High-Performance Computing - Episode 1 - Introducing MPI [Video Online] Available at: <http://www.
youtube.com/watch?v=kHV6wmG35po> [Accessed 21 June 2012].
• 2011. Intro to Computer Architecture [Video Online] Available at: <http://www.youtube.com/watch?v=HEjPop-
aK_w> [Accessed 21 June 2012].
• 2011. Parallel Vs. Serial [Video Online] Available at: <http://www.youtube.com/watch?v=Jeo83akN44o>
• 2011. x64 Assembly and C++ Tutorial 38: Intro to Single Instruction Multiple Data [Video Online] Available
at: <http://www.youtube.com/watch?v=cbL88Ic6uPw > [Accessed 21 June 2012].
• 2012. 4 Exploiting Instruction Level Parallelism [Video Online] Available at: <http://www.youtube.com/
watch?v=54E9LGG1hnQ> [Accessed 21 June 2012].
• 2012. Amdahl’s Law [Video Online] Available at: <http://www.youtube.com/watch?v=r7Ffc4WOLb8> [Accessed
21 June 2012].
• 2012. Message Passing Algorithms - SixtySec [Video Online] Available at: <http://www.youtube.com/
watch?v=7IdLzEoiPY4> [Accessed 21 June 2012].
• 2012. Network Topology [Video Online] Available at: <http://www.youtube.com/watch?v=POkzLHoZJ0Y>
• 2012. Radix Sort Tutorial [Video Online] Available at: <http://www.youtube.com/watch?v=xhr26ia4k38>
• Arnaoudova, E., Brief History of Computer Architecture [pdf] Available at: <http://www.mgnet.org/~douglas/
Classes/cs521/arch/ComputerArch2005.pdf > [Accessed 21 June 2012].
• Balaauw, 1997. Computer Architecture: Concepts And Evolution, Pearson Education India.
• Barney, B., Introduction to Parallel Computing [Online] Available at: <https://computing.llnl.gov/tutorials/
parallel_comp/> [Accessed 21 June 2012].
• Buyya, R., Parallel Programming Models and Paradigms [pdf] Available at: <http://www.buyya.com/cluster/
v2chap1.pdf> [Accessed 21 June 2012].
123
Parallel Computing
• Chandra, R. R., Modern Computer Architecture, Galgotia Publications.

• Dr. Dobb’s, Designing Parallel Algorithms: Part 1 [Online] Available at: <http://www.drdobbs.com/article/pr
int?articleId=223100878&siteSectionName=parallel> [Accessed 21 June 2012].
• Feitelson, G., 2002. Job Scheduling Strategies for Parallel Processing, Springer.
India.
• Introduction to Parallel Computing [Online] Available at: <https://computing.llnl.gov/tutorials/parallel_
comp/#Whatis > [Accessed 21 June 2012].
• Lastovetsky, A., 2003. Parallel Computing on Heterogeneous Networks, John Wiley & Sons.
• Learning Computing History [Online] Available at: <http://www.comphist.org/computing_history/new_page_4.
htm> [Accessed 21 June 2012].
• Message-Passing Programming [pdf] Available at: <http://www.cs.umsl.edu/~sanjiv/classes/cs5740/lectures/
mpi.pdf> [Accessed 21 June 2012].
• Parallel Computer Taxonomy - Conclusion [Online] Available at: <http://www.gigaflop.demon.co.uk/comp/
chapt8.htm> [Accessed 21 June 2012].
• Parallel Computing [Online] Available at: <http://www.cs.ucf.edu/courses/cot4810/fall04/presentations/
Parallel_Computing.ppt> [Accessed 21 June 2012].
• Physical Organization of Parallel Platforms [Online] Available at: <http://parallelcomp.uw.hu/ch02lev1sec4.
html> [Accessed 21 June 2012].
• Sarkar, V., 2008. Programming Using the Message Passing Paradigm (Chapter 6) [pdf] Available at: <http://
www.cs.rice.edu/~vs3/comp422/lecture-notes/comp422-lec16-s08-v2.pdf> [Accessed 21 June 2012].
• Sengupta, J., 2005. Interconnection Networks For Parallel Processing, Deep and Deep Publications.
• Sorting [pdf] Available at: <http://www.corelab.ntua.gr/courses/parallel.postgrad/Sorting.pdf> [Accessed 21
June 2012].
• Springer, Lecture 3 Interconnection Networks for Parallel Computers [Online] Available at: <http://www.cs.csi.
cuny.edu/~yumei/csc770/Lec3-InnerconnectionNetworks.ppt> [Accessed 21 June 2012].
• Thiébaut, D., Parallel Programming in C for the Transputer [Online] Available at: <http://maven.smith.
edu/~thiebaut/transputer/chapter8/chap8-2.html> [Accessed 21 June 2012].
• Wittwer, T., Introduction to Parallel Programming [Online] Available at: <http://www.scribd.com/doc/23585346/
An-Introduction-to-Parallel-Programming> [Accessed 21 June 2012].
Recommended Reading
• Anita, G., 2011. Computer Fundamentals, Pearson Education India.
• Culler, D. E., Singh, J. & Gupta, A., 1999. Parallel Computer Architecture: A Hardware/Software Approach,
Gulf Professional Publishing.
• Duato, 2003. Interconnection Networks: An Engineering Approach, Morgan Kaufmann.
• Gebali, F., 2011. Algorithms and Parallel Computing, John Wiley & Sons.
• Gropp, W., Lusk, E. & Skjellum, A., 2006. Using Mpi: Portable Parallel Programming With the Message-
Passing Interface, Volume 1, MIT Press.
• Hwang, 2003. Advanced Computer Architecture, Tata McGraw-Hill Education.
• Joubert, G. R., 2004. Parallel Computing: Software Technology, Algorithms, Architectures and Applications,
Elsevier.
124
• Null, L. & Lobur, J., 2010. The Essentials of Computer Organization and Architecture, Jones & Bartlett
Publishers.
• Quinn, 2003. Parallel Programming In C With Mpi And Open Mp, Tata McGraw-Hill Education.
• Roosta, S. H., 2000. Parallel Processing and Parallel Algorithms: Theory and Computation, Springer.
• Tokhi, 2003. Parallel Computing for Real-Time Signal Processing and Control, Springer.
• Treleaven, P. C. & Vanneschi, M., 1987. Future Parallel Computers: An Advanced Course, Pisa, Italy, June
9-20, 1986, Proceedings, Springer.
125
Parallel Computing
Self Assessment Answers

Chapter I
1. b
2. a
3. d
4. c
5. c
6. b
7. a
8. c
9. c
10. c
Chapter II
1. a
2. a
3. b
4. c
5. b
6. a
7. d
8. b
9. a
10. c
Chapter III
1. b
2. a
3. c
4. d
5. a
6. b
7. d
8. c
9. a
10. a
Chapter IV
1. a
2. b
3. c
4. d
5. a
6. a
7. a
8. a
9. b
10. a
126
Chapter V
1. d
2. a
3. b
4. c
5. c
6. a
7. b
8. a
9. a
10. d
Chapter VI
1. a
2. c
3. b
4. a
5. c
6. a
7. a
8. b
9. a
10. a
Chapter VII
1. b
2. a
3. a
4. c
5. a
6. a
7. b
8. a
9. a
10. b
Chapter VIII
1. b
2. b
3. a
4. a
5. c
6. a
7. a
8. c
9. b
10. a
127

Parallel Computing PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computing PDF

Uploaded by

Copyright:

Available Formats

Parallel Computing

Prof. H. N. Verma Prof. M. K. Ghadoliya

Dr. Ramchandra G. Pawar Vaibhav Bedarkar

Gaurav Modi Shubhada Pawar

This book contains the course content for Parallel Computing.

First Edition 2014

II. List of Figures...........................................................VI

III. List of Tables........................................................VIII

V. Application. ............................................................ 103

VI. Bibliography.......................................................... 123

VII. Self Assessment Answers .................................. 126

• define parallel computing

• identify the need for parallel computing

• elucidate bit level parallelism

• explain data parallelism

• enumerate the limits to serial computing

• examine the applications of parallel computing

• classify types of parallelism

• understand instruction level parallelism

• compare serial and parallel computing

• To be run on a single computer having a single Central Processing Unit (CPU)

Fig. 1.1 Serial computing

Fig. 1.2 Parallel computing

The computer resources might be:

The computational problem should be able to:

1.2 Types of Parallel Computing

Instruction level parallelism

1.3 Need for Parallel Computing

The main reasons include:

Solve larger problems

Limits to serial computing

1.4 Applications of Parallel Computing

2. Which is a form of parallelism which is based on increasing processors word size?

5. Which parallelism is also called as function parallelism?

8. The Pentium 4 processor had a ______pipeline.

• define sequential fraction of computing

• demonstrate the graph between efficiency and the sequential fraction

• elucidate Amdahl’s law

• explain Minsky’s conjecture

• describe Moore’s Law

• examine doubling of computing power

• demonstrate the graph between speedup as a function of the number of processors

• understand allocation of resources

• describe parallel processors

2.1 Amdahl’s Law

T(Ma) = Time for sequential part + Time for parallel part

T(Ma) = , if n is infinitely very large …………………………....(1)

2.2 Minsky’s Conjecture [Minsky 1970]

Fig. 2.2 Energy diagram showing loss of energy

Tn = Time taken by the parallel machine with n processors.

Tn = [f1/1 + f2/2 + f3/3 + .. + fn/n ] .T1

The speedup S < T1/Tn = 1 /( f1/1 + f2/2 + f3/3 + ... + fn/n ).

T1/Tn = 1/(1/n (1/1 +1/2 + 1/3 + ... + 1/n))

Fig. 2.3 Speedup as a function of the number of processors

Value of n (Number of processors) Value of S (Speedup)

Table 2.1 Speed-up and the number of processors

2.3 Moore’s Law

2. Time taken by any computer = Amount of work/ ___________.

7. Which of the following statements is false?

9. T(Ma) = Time for sequential part + Time for _______ part.

• define computer architecture

• enlist the features of first generation

• elucidate two types of models for a computing machine