You are on page 1of 70

Technische Universität München

Parallel Programming
and High-Performance Computing
Part 1: Introduction

Dr. Ralf-Peter Mundani


CeSIM / IGSSE
Technische Universität München

1 Introduction
General Remarks
• materials: http://www5.in.tum.de/lehre/vorlesungen/parhpp/SS08/

• Ralf-Peter Mundani
– email mundani@tum.de, phone 289–25057, room 3181 (city centre)
– consultation-hour: Tuesday, 4:00—6:00 pm (room 02.05.058)
• Ioan Lucian Muntean
– email muntean@in.tum.de, phone 289–18692, room 02.05.059

• lecture (2 SWS)
– weekly
– Tuesday, start at 12:15 pm, room 02.07.023

• exercises (1 SWS)
– fortnightly
– Wednesday, start at 4:45 pm, room 02.07.023

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−2
Technische Universität München

1 Introduction
General Remarks
• content
– part 1: introduction
– part 2: high-performance networks
– part 3: foundations
– part 4: programming memory-coupled systems
– part 5: programming message-coupled systems
– part 6: dynamic load balancing
– part 7: examples of parallel algorithms

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−3
Technische Universität München

1 Introduction
Overview
• motivation
• classification of parallel computers
• levels of parallelism
• quantitative performance evaluation

I think there is a world market


for maybe five computers.
—Thomas Watson,
chairman IBM, 1943

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−4
Technische Universität München

1 Introduction
Motivation
• numerical simulation: from phenomena to predictions
physical phenomenon
technical process
1. modelling
determination of parameters, expression of relations

2. numerical treatment
model discretisation, algorithm development

3. implementation
software development, parallelisation
discipline
4. visualisation
mathematics illustration of abstract simulation results

computer science 5. validation


comparison of results with reality
application 6. embedding
insertion into working process

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−5
Technische Universität München

1 Introduction
Motivation
• why parallel programming and HPC?
– complex problems (especially the so called “grand challenges”) demand
for more computing power
• climate or geophysics simulation (tsunami, e. g.)
• structure or flow simulation (crash test, e. g.)
• development systems (CAD, e. g.)
• large data analysis (Large Hadron Collider at CERN, e. g.)
• military applications (crypto analysis, e. g.)
• …
– performance increase due to
• faster hardware, more memory (“work harder”)
• more efficient algorithms, optimisation (“work smarter”)
• parallel computing (“get some help”)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−6
Technische Universität München

1 Introduction
Motivation
• objectives (in case all resources would be available N-times)
– throughput: compute N problems simultaneously
• running N instances of a sequential program with different data sets
(“embarrassing parallelism”); SETI@home, e. g.
• drawback: limited resources of single nodes
– response time: compute one problem at a fraction (1/N) of time
• running one instance (i. e. N processes) of a parallel program for
jointly solving a problem; finding prime numbers, e. g.
• drawback: writing a parallel program; communication
– problem size: compute one problem with N-times larger data
• running one instance (i. e. N processes) of a parallel program, using
the sum of all local memories for computing larger problem sizes;
iterative solution of SLE, e. g.
• drawback: writing a parallel program; communication

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−7
Technische Universität München

1 Introduction
Overview
• motivation
• classification of parallel computers
• levels of parallelism
• quantitative performance evaluation

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−8
Technische Universität München

1 Introduction
Classification of Parallel Computers
• definition: “A collection of processing elements that communicate and
cooperate to solve large problems” (ALMASE and GOTTLIEB, 1989)
• possible appearances of such processing elements
– specialised units (steps of a vector pipeline, e. g.)
– parallel features in modern monoprocessors (superscalar architectures,
instruction pipelining, VLIW, multithreading, multicore, …)
– several uniform arithmetical units (processing elements of array
computers, e. g.)
– processors of a multiprocessor computer (i. e. the actual parallel
computers)
– complete stand-alone computers connected via LAN (work station or PC
clusters, so called virtual parallel computers)
– parallel computers or clusters connected via WAN (so called
metacomputers)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−9
Technische Universität München

1 Introduction
Classification of Parallel Computers
• reminder: dual core, quad core, manycore, and multicore
– observation: increasing frequency (and thus core voltage) over past years
– problem: thermal power dissipation increases linearly in frequency and
with the square of the core voltage

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−10
Technische Universität München

1 Introduction
Classification of Parallel Computers
• reminder: dual core, quad core, manycore, and multicore (cont’d)
– 25% reduction in frequency (and thus core voltage) leads to 50%
reduction in dissipation

dissipation

performance
Î

normal CPU reduced CPU

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−11
Technische Universität München

1 Introduction
Classification of Parallel Computers
• reminder: dual core, quad core, manycore, and multicore (cont’d)
– idea: installation of two cores per die with same dissipation as single
core system

dissipation

performance
Î

single core dual core

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−12
Technische Universität München

1 Introduction
Classification of Parallel Computers
• commercial parallel computers
– manufacturers: starting from 1983, big players and small start-ups (see
tabular; out of business: no longer in the parallel business)
– names have been coming and going rapidly
– in addition: several manufacturers of vector computers and non-
standard architectures

company country year status in 2003


Sequent U.S. 1984 acquired by IBM
Intel U.S. 1984 out of business
Meiko U.K. 1985 bankrupt
nCUBE U.S. 1985 out of business
Parsytec Germany 1985 out of business
Alliant U.S. 1985 bankrupt

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−13
Technische Universität München

1 Introduction
Classification of Parallel Computers
• commercial parallel computers (cont’d)

company country year status in 2003


Encore U.S. 1986 out of business
Floating Point Systems U.S. 1986 acquired by SUN
Myrias Canada 1987 out of business
Ametek U.S. 1987 out of business
Silicon Graphics U.S. 1988 active
C-DAC India 1991 active
Kendall Square Research U.S. 1992 bankrupt
IBM U.S. 1993 active
NEC Japan 1993 active
SUN Microsystems U.S. 1993 active
Cray Research U.S. 1993 active

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−14
Technische Universität München

1 Introduction
Classification of Parallel Computers
• arrival of clusters
– in the late eighties, PCs became a commodity market with rapidly
increasing performance, mass production, and decreasing prices
– growing attractiveness for parallel computers
– 1994: Beowulf, the first parallel computer built completely out of
commodity hardware
• NASA Goddard Space Flight Centre
• 16 Intel DX4 processors
• multiple 10 Mbit Ethernet links
• Linux with GNU compilers
• MPI library
– 1996: Beowulf cluster performing more than 1 GFlops
– 1997: a 140-node cluster performing more than 10 GFlops

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−15
Technische Universität München

1 Introduction
Classification of Parallel Computers
• arrival of clusters (cont’d)
– 2005: InfiniBand cluster at TUM
• 36 Opteron nodes (quad boards)
• 4 Itanium nodes (quad boards)
• 4 Xeon nodes (dual boards) for interactive tasks
• InfiniBand 4× Switch, 96 ports
• Linux (SuSE and Redhat)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−16
Technische Universität München

1 Introduction
Classification of Parallel Computers
• supercomputers
– supercomputing or high-performance scientific computing as the most
important application of the big number crunchers
– national initiatives due to huge budget requirements
• Accelerated Strategic Computing Initiative (ASCI) in the U.S.
– in the sequel of the nuclear testing moratorium in 1992/93
– decision: develop, build, and install a series of five
supercomputers of up to $100 million each in the U.S.
– start: ASCI Red (1997, Intel-based, Sandia National Laboratory,
the world’s first TFlops computer)
– then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI
White, …
• meanwhile new high-end computing memorandum (2004)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−17
Technische Universität München

1 Introduction
Classification of Parallel Computers
• supercomputers (cont’d)
– federal “Bundeshöchstleistungsrechner” initiative in Germany
• decision in the mid-nineties
• three federal supercomputing centres in Germany (Munich,
Stuttgart, and Jülich)
• one new installation every second year (i. e. a six year upgrade cycle
for each centre)
• the newest one to be among the top 10 of the world
– overview and state of the art: Top500 list (updated every six month), see
http://www.top500.org

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−18
Technische Universität München

1 Introduction
Classification of Parallel Computers
• MOORE’s law
– observation of Intel co-founder Gordon E. MOORE,
describes important trend in history of computer
hardware (1965)

number of transistors that can be


placed on an integrated circuit is
increasing exponentially, doubling
approximately every two years

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−19
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−20
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−21
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

cluster: #nodes > #processors/node


constellation: #nodes < #processors/node

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−22
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−23
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−24
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−25
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−26
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−27
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−28
Technische Universität München

1 Introduction
Classification of Parallel Computers
• some numbers: Top500 (cont’d)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−29
Technische Universität München

1 Introduction
Classification of Parallel Computers
• The Earth Simulator – world’s #1 from 2002—04
– installed in 2002 in Yokohama, Japan
– ES-building (approx. 50m × 65m × 17m)
– based on NEC SX-6 architecture
– developed by three governmental agencies
– highly parallel vector supercomputer
– consists of 640 nodes (plus 2 control & 128 data switching)
• 8 vector processors (8 GFlops each)
• 16 GB shared memory
Î 5120 processors (40.96 TFlops peak performance) and 10 TB
memory; 35.86 TFlops sustained performance (Linpack)
– nodes connected by 640×640 single stage crossbar (83,200 cables with
a total extension of 2400km; 8 TBps total bandwidth)
– further 700 TB disc space and 1.6 PB mass storage

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−30
Technische Universität München

1 Introduction
Classification of Parallel Computers
• BlueGene/L – world’s #1 since 2004
– installed in 2005 at LLNL, CA, USA
(beta-system in 2004 at IBM)
– cooperation of DoE, LLNL, and IBM
– massive parallel supercomputer
– consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes)
• 2 PowerPC 440d processors (2.8 GFlops each)
• 512 MB memory
Î 131,072 processors (367 TFlops peak performance) and
33.5 TB memory; 280.6 TFlops sustained performance (Linpack)
– nodes configured as 3D torus (32 × 32 × 64); global reduction tree for
fast operations (global max / sum) in a few microseconds
– 1024 Gbps link to global parallel file system
– further 806 TB disc space; operating system SuSE SLES 9

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−31
Technische Universität München

1 Introduction
Classification of Parallel Computers
• HLRB II (world’s #6 for 04/2006)
– installed in 2006 at LRZ, Garching
– installation costs 38 M€
– monthly costs approx. 400,000 €
– upgrade in 2007 (finished)
– one of Germany’s 3 supercomputers
– SGI Altix 4700
– consists of 19 nodes (SGI NUMA link 2D torus)
• 256 blades (ccNUMA link with partition fat tree)
– Intel Itanium2 Montecito Dual Core (12.8 GFlops)
– 4 GB memory per core
Î 9728 processor cores (62.3 TFlops peak performance) and 39 TB
memory; 56.5 TFlops sustained performance (Linpack)
– footprint 24m × 12m; total weight 103 metric tons
Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−32
Technische Universität München

1 Introduction
Classification of Parallel Computers
• standard classification according to FLYNN
– global data and instruction streams as criterion
• instruction stream: sequence of commands to be executed
• data stream: sequence of data subject to instruction streams
– two-dimensional subdivision according to
• amount of instructions per time a computer can execute
• amount of data elements per time a computer can process
– hence, FLYNN distinguishes four classes of architectures
• SISD: single instruction, single data
• SIMD: single instruction, multiple data
• MISD: multiple instruction, single data
• MIMD: multiple instruction, multiple data
– drawback: very different computers may belong to the same class

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−33
Technische Universität München

1 Introduction
Classification of Parallel Computers
• standard classification according to FLYNN (cont’d)
– SISD
• one processing unit that has access to one data memory and to one
program memory
• classical monoprocessor following VON NEUMANN’s principle

data memory processor program memory

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−34
Technische Universität München

1 Introduction
Classification of Parallel Computers
• standard classification according to FLYNN (cont’d)
– SIMD
• several processing units, each with separate access to a (shared or
distributed) data memory; one program memory
• synchronous execution of instructions
• example: array computer, vector computer
• advantages: easy programming model due to control flow with a
strict synchronous-parallel execution of all instructions
• drawbacks: specialised hardware necessary, easily becomes out-
dated due to recent developments at commodity market
data memory processor

program memory

data memory processor

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−35
Technische Universität München

1 Introduction
Classification of Parallel Computers
• standard classification according to FLYNN (cont’d)
– MISD
• several processing units that have access to one data memory;
several program memories
• not very popular class (mainly for special applications such as Digital
Signal Processing)
• operating on a single stream of data, forwarding results from one
processing unit to the next
• example: systolic array (network of primitive processing elements
that “pump” data)
processor program memory
data memory

processor program memory

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−36
Technische Universität München

1 Introduction
Classification of Parallel Computers
• standard classification according to FLYNN (cont’d)
– MIMD
• several processing units, each with separate access to a (shared or
distributed) data memory; several program memories
• classification according to (physical) memory organisation
– shared memory Î shared (global) address space
– distributed memory Î distributed (local) address space
• example: multiprocessor systems, networks of computers

data memory processor program memory

data memory processor program memory

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−37
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling
– cooperation of processors / computers as well as their shared use of
various resources require communication and synchronisation
– the following types of processor coupling can be distinguished
• memory-coupled multiprocessor systems (MemMS)
• message-coupled multiprocessor systems (MesMS)

global memory distributed memory

shared
MemMS, SMP Mem-MesMS (hybrid)
address space

distributed
∅ MesMS
address space

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−38
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– central issues
• scalability: costs for adding new nodes / processors
• programming model: costs for writing parallel programs
• portability: costs for portation (migration), i. e. transfer from one
system to another while preserving executability and flexibility
• load distribution: costs for obtaining a uniform load distribution
among all nodes / processors
– MemMS are advantageous concerning scalability, MesMS are typically
better concerning the rest
– hence, combination of MemMS and MesMS for exploiting all
advantages Î distributed / virtual shared memory (DSM / VSM)
– physical distributed memory with global shared address space

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−39
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– uniform memory access (UMA)
• each processor P has direct access via the network to each memory
module M with same access times to all data
• standard programming model can be used (i. e. no explicit send /
receive of messages necessary)
• communication and synchronisation via shared variables
(inconsistencies (write conflicts, e. g.) have to prevented in general
by the programmer)
M M … M

network

P P … P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−40
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– symmetric multiprocessor (SMP)
• only a small amount of processors, in most cases a central bus, one
address space (UMA), but bad scalability
• cache-coherence implemented in hardware (i. e. a read always
provides a variable’s value from its last write)
• example: double or quad boards, SGI Challenge

C: cache
C C C

P P P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−41
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– non-uniform memory access (NUMA)
• memory modules physically distributed among processors
• shared address space, but access times depend on location of data
(i. e. local addresses faster than remote addresses)
• differences in access times are visible in the program
• example: DSM / VSM, Cray T3E

network

M M

P … P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−42
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– cache-coherent non-uniform memory access (ccNUMA)
• caches for local and remote addresses; cache-coherence
implemented in hardware for entire address space
• problem with scalability due to frequent cache actualisations
• example: SGI Origin 2000

network

M M

C … C
P P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−43
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– cache-only memory access (COMA)
• each processor has only cache-memory
• entirety of all cache-memories = global shared memory
• cache-coherence implemented in hardware
• example: Kendall Square Research KSR-1

network

C C C

P P P

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−44
Technische Universität München

1 Introduction
Classification of Parallel Computers
• processor coupling (cont’d)
– no remote memory access (NORMA)
• each processor has direct access to its local memory only
• access to remote memory only via explicit message exchange (due
to distributed address space) possible
• synchronisation implicitly via the exchange of messages
• performance improvement between memory and I/O due to parallel
data transfer (Direct Memory Access, e. g.) possible
• example: IBM SP2, ASCI Red / Blue / White

network

P P P

M M M

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−45
Technische Universität München

1 Introduction
Overview
• motivation
• classification of parallel computers
• levels of parallelism
• quantitative performance evaluation

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−46
Technische Universität München

1 Introduction
Levels of Parallelism
• the suitability of a parallel architecture for a given parallel program strongly
depends on the granularity of parallelism
• some remarks on granularity
– quantitative meaning: ratio of computational effort and communication /
synchronisation effort (≈ amount of instructions between two necessary
communication / synchronisation steps)
– qualitative meaning: level on which work is done in parallel

parallelism
program level

fine-grain
process level
coarse-grain
parallelism

block level

instruction level

sub-instruction level

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−47
Technische Universität München

1 Introduction
Levels of Parallelism
• program level
– parallel processing of different programs
– independent units without any shared data
– no or only small amount of communication / synchronisation
– organised by the OS
• process level
– a program is subdivided into processes to be executed in parallel
– each process consists of a larger amount of sequential instructions and
has a private address space
– synchronisation necessary (in case all processes in one program)
– communication in most cases necessary (data exchange, e. g.)
– support by OS via routines for process management, process
communication, and process synchronisation
– term of process often referred to as heavy-weight process

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−48
Technische Universität München

1 Introduction
Levels of Parallelism
• block level
– blocks of instructions are executed in parallel
– each block consists of a smaller amount of instructions and shares the
address space with other blocks
– communication via shared variables; synchronisation mechanisms
– term of block often referred to as light-weight-process (thread)
• instruction level
– parallel execution of machine instructions
– optimising compilers can increase this potential by modifying the order
of commands (better exploitation of superscalar architecture and
pipelining mechanisms)
• sub-instruction level
– instructions are further subdivided in units to be executed in parallel or
via overlapping (vector operations, e. g.)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−49
Technische Universität München

1 Introduction
Overview
• motivation
• classification of parallel computers
• levels of parallelism
• quantitative performance evaluation

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−50
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• execution time
– time T of a parallel program between start of the execution on one
processor and end of all computations on the last processor
– during execution all processors are in one of the following states
• compute
– computation time TCOMP
– time spent for computations
• communicate
– communication time TCOMM
– time spent for send and receive operations
• idle
– idle time TIDLE
– time spent for waiting (sending / receiving messages)
– hence T = TCOMP + TCOMM + TIDLE
Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−51
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• parallel profile
– measures the amount of parallelism of a parallel program
– graphical representation
• x-axis shows time, y-axis shows amount of parallel activities
• identification of computation, communication, and idle periods
– example
proc. A
proc. B
proc. C
compute
3
processes
amount of

communicate
2
idle
1
0 time

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−52
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• parallel profile (cont’d)
– degree of parallelism
• P(t) indicates the amount of processes (of one application) that can
be executed in parallel at any point in time (i. e.
y-values of the previous example for any time t)
– average parallelism (often referred to as parallel index)
• A(p) indicates the average amount of processes that can be
executed in parallel, hence

∑i=1i ⋅ ti ,
t p
2
1
A(p) = ⋅ ∫ P(t)dt or A(p) =
t 2 − t1 t1 ∑i=1 ti
p

where p is the amount of processes and ti is the time when exactly i


processes are busy

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−53
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• parallel profile (cont’d)
– previous example: A(p) = (1⋅18 + 2⋅4 + 3⋅13) / 35 = 65/35 = 1.86
P(t)
3
2
1
time
5 10 15 20 25 30 35 40 45
– for A(p) exist several theoretical (typically quite pessimistic) estimates,
often used as arguments against parallel systems
– example: estimate of MINSKY (1971)
• problem: amount of used processors is halved in every step
• parallel summation of 2p numbers on p processors, e. g.
• result?
Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−54
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor
– correlation of multi- and monoprocessor systems’ performance
– important: program that can be executed on both systems
– definitions
• P(1): amount of unit operations of a program on the monoprocessor
system
• P(p): amount of unit operations of a program on the multiprocessor
systems with p processors
• T(1): execution time of a program on the monoprocessor system
(measured in steps or clock cycles)
• T(p): execution time of a program on the multiprocessor system
(measured in steps or clock cycles) with p processors

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−55
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– simplifying preconditions
• T(1) = P(1)
– one operation to be executed in one step on the monoprocessor
system
• T(p) ≤ P(p)
– more than one operation to be executed in one step (for p ≥ 2)
on the multiprocessor system with p processors

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−56
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– speed-up
• S(p) indicates the improvement in processing speed
T(1)
S(p) =
T(p)
• in general, 1 ≤ S(p) ≤ p
– efficiency
• E(p) indicates the relative improvement in processing speed
S(p)
E(p) =
p
• improvement is normalised by the amount of processors p
• in general, 1/p ≤ E(p) ≤ 1

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−57
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– speed-up and efficiency can be seen in two different ways
• algorithm-independent
– best known sequential algorithm for the monoprocessor system
is compared to the respective parallel algorithm for the
multiprocessor system
Î absolute speed-up
Î absolute efficiency
• algorithm-dependent
– parallel algorithm is treated as sequential one to measure the
execution time on the monoprocessor system; “unfair” due to
communication and synchronisation overhead
Î relative speed-up
Î relative efficiency

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−58
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– overhead
• O(p) indicates the necessary overhead of a multiprocessor system
for organisation, communication, and synchronisation
P(p)
O(p) =
P(1)
• in general, 1 ≤ O(p)
– parallel index
• I(p) indicates the amount of operations executed on average per time
unit
P(p)
I(p) =
T(p)
• I(p) ≈ relative speed-up (taking into account the overhead)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−59
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– utilisation
• U(p) indicates the amount of operations each processor executes on
average per time unit
I(p)
U(p) =
p
• conforms to the normalised parallel index
– conclusions
• all defined expressions have a value of 1 for p = 1
• the parallel index is an upper bound for the speed-up
1 ≤ S(p) ≤ I(p) ≤ p
• the workload is an upper bound for the efficiency
1/p ≤ E(p) ≤ U(p) ≤ 1

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−60
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– example (1)
• a monoprocessor systems needs 6000 steps for the execution of
6000 operations to compute some result
• a multiprocessor system with five processors needs 6750 operations
for the computation of the same result, but it needs only 1500 steps
for the execution
• thus P(1) = T(1) = 6000, P(5) = 6750, and T(5) = 1500
• speed-up and efficiency can be computed as

S(5) = 6000/1500 = 4 and E(5) = 4/5 = 0.8

Î there is an acceleration of factor 4 compared to the


monoprocessor system, i. e. on average an improvement of 80% for
each processor of the multiprocessor system

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−61
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• comparison multiprocessor / monoprocessor (cont’d)
– example (2)
• parallel index and utilisation can be computed as

I(5) = 6750/1500 = 4.5 and U(5) = 4.5/5 = 0.9

Î on average 4.5 processors are simultaneously busy, i. e. each


processor is working only for 90% of the execution time
• overhead can be computed as

O(5) = 6750/6000 = 1.125

Î there is an overhead of 12.5% on the multiprocessor system


compared to the monoprocessor system

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−62
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• scalability
– objective: adding further processing elements to the system shall reduce
the execution time without any program modifications
– i. e. a linear performance increase with an efficiency close to 1
– important for the scalability is a sufficient problem size
• one porter may carry one suitcase in a minute
• 60 porters won’t do it in a second
• but 60 porters may carry 60 suitcases in a minute
– in case of a fixed problem size and an increasing amount of processors
saturation will occur for a certain value of p, hence scalability is limited
– when scaling the amount of processors together with the problem size
(so called scaled problem analysis) this effect will not appear for good
scalable hard- and software systems

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−63
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• AMDAHL’s law
– the probably most important and most famous estimate for the speed-
up (even if quite pessimistic)
– underlying model
• each program consists of a sequential part s, 0 ≤ s ≤ 1, that can only
be executed in a sequential way; synchronisation, data I/O, e. g
• furthermore, each program consists of a parallelisable part 1−s that
can be executed in parallel by several processes; finding the
maximum value within a set of numbers, e. g.
– hence, the execution time for the parallel program executed on p
processors can be written as
1− s
T(p) = s ⋅ T(1) + ⋅ T(1)
p

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−64
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• AMDAHL’s law (cont’d)
– the speed-up can thus be computed as
T(1) T(1) 1
S(p) = = =
T(p) 1− s 1− s
s ⋅ T(1) + ⋅ T(1) s+
p p
– when increasing p → ∞ we finally get AMDAHL’s law
1 1
lim S(p) = lim =
p →∞ p →∞ 1− s s
s+
p
Î speed-up is bounded: S(p) ≤ 1/s
– the sequential part can have a dramatic impact on the speed-up
– therefore central effort of all (parallel) algorithms: keep s small
– many parallel programs have a small sequential part (s < 0.1)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−65
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• AMDAHL’s law (cont’d)
– example
• s = 0.1 and, thus, S(p) ≤ 10
• independent from p the speed-up is bounded by this limit
• where’s the error?
S(p)
10

s = 0.1

5 10 15 20 25 p
Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−66
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• GUSTAFSON’s law
– addresses the shortcomings of AMDAHL’s law as it states that any
sufficient large problem can be efficiently parallelised
– instead of a fixed problem size it supposes a fixed time concept
– underlying model
• execution time on the parallel machine is normalised to 1
• this contains a non-parallelisable part σ, 0 ≤ σ ≤ 1
– hence, the execution time for the sequential program on the
monoprocessor can be written as

T(1) = σ + p⋅(1−σ)

– the speed-up can thus be computed as

S(p) = σ + p⋅(1−σ) = p + σ⋅(1−p)

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−67
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• GUSTAFSON’s law (cont’d)
– difference to AMDAHL
• sequential part s(p) is not constant, but gets smaller with increasing p

σ
s(p) = , s(p) ∈ ]0, 1[
σ + p ⋅ (1− σ )

• often more realistic, because more processors are used for a larger
problem size, and here parallelisable parts typically increase (more
computations, less declarations, …)
• speed-up is not bounded for increasing p

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−68
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• GUSTAFSON’s law (cont’d)
– some more thoughts about speed-up
• theory tells: a superlinear speed-up does not exist
– each parallel algorithm can be simulated on a monoprocessor
system by emulating in a loop always the next step of a
processor from the multiprocessor system
• but superlinear speed-up can be observed
– when improving an inferior sequential algorithm
– when a parallel program (that does not fit into the main memory
of the monoprocessor system) completely runs in cache and
main memory of the nodes from the multiprocessor system

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−69
Technische Universität München

1 Introduction
Quantitative Performance Evaluation
• communication—computation-ratio (CCR)
– important quantity measuring the success of a parallelisation
• gives the relation of pure communication time and pure computing
time
• a small CCR is favourable
• typically: CCR decreases with increasing problem size
– example
• N×N matrix distributed among p processors (N/p rows each)
• iterative method: in each step, each matrix element is replaced by
the average of its eight neighbour values
• hence, the two neighbouring rows are always necessary
• computation time: 8N⋅N/p
• communication time: 2N
• CCR: p/4N – what does this mean?

Dr. Ralf-Peter Mundani - Parallel Programming and High-Performance Computing - Summer Term 2008 1−70

You might also like