Lec 1

Introduction
¾ Computer technology has made incredible

progress in the roughly 55 years
¾ Today, less than a thousand dollars will
purchase a personal computer that has more
performance than a computer bought in 1980 for
1 million dollars.
¾ Presently workstation performance (measured in
Spec Marks) improves roughly 50% per year
(2X every 18 months)
1 1
The changing Face of Computing
• Large Mainframes
• Desktop Computing
• Serwers
• Embedded Computers
2 2
Three computer classes
Features Desktop Server Embedded

Price•of system $1000 - $10,000 $10000 - $10 - $100,000
$10,000,000
Price of $100 - $1000 $200 - $2000 (per $0.20 - $200 (per
microprocessor processor) processor)
Microprocessors 150,000,000 4,000,000 300,000,000
sold per year (32-bit & 64-bit
(estimates for 2000) processors only)
Critical system Price-performance, Throughput, Price, power
design issues graphics performance availability, consumption,
scalability application-specific
performance
3 3
Technology Trends
• Integrated circuits logic technology

• Semiconductor DRAM
• Magnetic Disc technology
• Network Technology
4 4
A
A taxonomy
taxonomy of
of parallel
parallel architectures
architectures
¾- control mechanism
- SIMD
- MIMD
¾- address-space organization
- message-passing architecture
- shared-address-space architecture
- UMA
- NUMA
¾- interconnection networks
- static
- dynamic
¾- processor granularity
- coarse-grain computers
- medium-grain computers
- fine-grain computers
5
Flynn’s classifications
– Single-instruction-stream, single-data-stream
(SISD) computers
- Typical uniprocessors
- Parallelism through pipelines,
– Multiple-instruction-stream, single-data-stream
(MISD) computers
- Not used often ?
– Single-instruction-stream, multiple-data-stream
(SIMD) computers
- Vector and array processors
– Multiple-instruction-stream, multiple-data-stream
(MIMD) computers
- Multiprocessors
6 6
Typical
Typical shared-address-space
shared-address-space architecture.
architecture.
P M
M
Inter- P Inter-
P
M
connection connection
M
Network Network
M
P P
An uniform-memory-access computer
A non-uniform-memory-access
computer with local memory only
7 7
A
A typical
typical message-passing
message-passing architecture
architecture. .
Interconnection Network
.............
P P P P P
.............
M M M M M
P - Processor
M - Memory
8 8
Dynamic
Dynamic interconnection
interconnection networks
networks
Crossbar switching networks
Bus-based networks
Multistage interconnection networks
9
A
A completely
completely nonblocking
nonblocking crossbar
crossbar switch
switch
connecting
connecting pp processors
processors to
to bb memory
memory banks
banks
M0 M1 M2 M3 M0
P0
P1
P2
P3
A switching
P4 element
Pp-1
10
A typical bus-based architecture with no cache.
Global memory
Bus
Processor Processor Processor
11 11
Multistage
Multistageinterconnection
interconnectionnetwork
network
Memory banks
Processors
0 0
1 1
Stage 1 Stage 2 Stage n
p-1 b-1
Multistage interconnection network
12 12
Cost
Costand
andperformance
performance
crossbar multistage crossbar
bus
performance
multistage
cost
bus
number of processors number of processors
13 13
Omega network
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111
Pass-through
Cross-over
14 14
An example of blocking in omega network
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111
15 15
Static
Static interconnection
interconnection networks
networks
¾Completely-connected networks
¾Star-connected network
¾Linear array
¾Ring
¾Mesh (2D, 3D, wraparound)
¾Hypercube
16
Examples
Examples of
of static
static interconnection
interconnection networks.
networks.
A completely-connected network A star-connected network
A linear array A ring
A two-dimensional mesh A two-dimensional wraparound mesh

17
Examples
Examples of
of static
static interconnection
interconnection networks.
networks.
A three-dimensional mesh
processor
switching
elements
18 Complete binary tree network and message routing

18
Hypercube
Hypercube
100 110
0 00 10
000 010
101 111
1 01 11
0-D hypercube 001 011
1-D hypercube 2-D hypercube

3-D hypercube
0100 0110 1100 1110

0000 0010 1000 1010
0101 0111 1101 1111
0001 0011 1001 1011
4-D hypercube
19
Evaluating
EvaluatingStatic
StaticInterconnection
InterconnectionNetworks
Networks
¾ Diameter - the maximum distance between any two processors in the

network
¾ Connectivity - measure of the multiplicity of paths between any two
processors
¾Arc connectivity - minimum number of arc that must be removed from
the network to break it into two disconnected networks.
¾ Bisection width - minimum number of communication links that have to
be removed to partition the network into two equal halves.
¾ Channel width -the number of bits that can be communicated
simultaneously over a link connecting two processors.
¾ Bisection bandwidth - minimum volume of communication allowed
between any two halves of the network with an equal number of processors.
¾ Cost - for example: number of communication links.
20 20
Communication
Communication costs
costs in
in static
static
Interconnection
Interconnection networks
networks
Principal parameters
- startup time (ts)
- per-hop time (th)
- per-word transfer time (tw)
Routing techniques
- store-and-forward routing
- cut-through routing
21
Passing
Passing aa message
message from processor PP00 to
from processor to PP33..
Time
P0
P1
P2
P3
A single mesage sent over a store-and-forward network

Time
P0
P1
P2
P3
A single message broken into two parts and sent over a cut-through network
Time
P0
P1
P2
P3
A22
single message broken into four parts and sent over a cut-through network
Communication
Communicationcosts
costsdepends
dependson
onrouting
routingstrategy
strategy
Store and forward routing - the message is sending

between different processors and each intermediate processor
store it in the local memory until received the whole message
t comm = t s + (mtw + th )l
Cut-through routing - the message is divided on parts
which are sending between processors without waiting for the
whole message
tcomm = ts + lth + mtw
23 23
Program Performance Metrics
⇒ The parallel run time (Tpar) is the time from the moment when
computation starts to the moment when the last processor finished his
execution
⇒ The speedup (S) is defined as the ratio of the time needed to solve the
problem on a single processor (Tseq) to the time required to solve the
same problem on parallel system with "p" processors (Tpar)
– relative - Tseq is the execution time of parallel algorithm executing
on one of the processors of parallel computer
– real - Tseq is the execution time for the best-know algorithm using
one of the processors of parallel computer
– absolute - Tseq is the execution time for the best-know algorithm
using the best-know computer
24 24
Program Performance Metrics
⇒ The efficiency (E) of parallel program is defined as a ratio of speedup to

the number of processors
⇒ The cost is usually defined as a product of a parallel run time and the
number of processors
⇒ The scalability of parallel system is a measure of its capacity to increase

speedup in proportion to the number of processors
25 25
Amdahl’s Law
⇒ When executing a parallel program we can distinguish two program

parts: sequential part (Pseq) which needs to be executed sequentially
using one processor parallel part (1-Pseq) which can be executed
independently using number of processors
⇒ Let’s assume that if we execute this program at single processor the
serial execution time will be t1. Then if p indicates the number of used
processors during parallel execution the parallel run time can be
expressed by
Tpar = t1 * Pseq + (1 − Pseq ) t1 / p

⇒ Then speedup is expressed by
t1 1
S= ≤
t1 * Pseq + (1 − Pseq ) * t1 / p Pseq
26 26
Gustafson Speedup
⇒ Let’s assume that the execution of the maximum size problem using
parallel algorithm at p processors takes Pseq + Ppar = 1 time, where
Pseq indicates the sequential part of the program and Ppar parallel part
respectively
⇒ Its sequential time (using one processor) will be Pseq + p*Ppar
⇒ Then we obtain the following expression that specifies speedup
Pseq + p* Ppar
SG = = Pseq + p* Ppar ≥ p* Ppar
Pseq + Ppar
27 27
Other laws
p
⇒ Lee law (1980) S≤
log 2 p
It is a generalisation of Amdhal’s Law
⇒ Minsky’s law (1990) S ≤ log 2 p

Can be used for programs with branch points on a SIMD structure
28 28
Other laws
Stone’s Table
Speedup (S) Type of Algorithms
------------------------------------------------------------------------------------
α* p matrix computations,
discretization
------------------------------------------------------------------------------------
α *p sorts,
linear recursions,
log 2 p evaluation of polynomials
------------------------------------------------------------------------------------
α * log 2 p search for an element in a set
------------------------------------------------------------------------------------
α certain non-linear recursions
------------------------------------------------------------------------------------
where „p.” is a number of used processors and „α” is a possitive number

smaller then 1 which depends on the machine
29 29
The Scalability of Parallel Systems
Let’s consider the problem of adding n mumbers on a Hypercube
In the first step each processor locally adds its n/p numbers, in the next
steps half partial sums are transmited to adjacent processors and added,
the procedure finished when one choosen processor
Assume that it takes one unit of time both to add two numbers and to
communicate a number between two directly connected processors
Then adding the n/p numbers local to each processor takes n/p - 1 time
The p partial sums are added in logp steps (one addition and one
communication)
30 30
Thus the total parallel run time can be approximated by
Tp = n p + 2 log p
Since serial run time can be approximated by n the expresion for speedup
and efficiency are as follows:
n* p n
S= E=
n + 2 p log p n + 2 p log p
These expressions can be used to calculate the speeup and efficiency for any
pair of n and p
31 31
Speedup versus the number of processors
Linear n = 512
n = 320
n = 192
n = 64
S
32 32
Efficiency as a function of n and p
n p=1 p=4 p=8 p = 16 p = 32

64 1.0 0.80 0.57 0.33 0.17
192 1.0 0.92 0.80 0.60 0.38
64 1.0 0.95 0.87 0.71 0.50
64 1.0 0.97 0.91 0.80 0.62
⇒ For the given problem instance, the speedup does not increase linearly as
the number of processors increases
⇒ A larger instance of the same problems yields higher speedup and
efficiency for the same number of processors
33 33
⇒ The parallel system is scalable if it maintain efficiency at a fixed value by

simultaneusly increasing the number of processors and the size of the
problem.
⇒ The scalability of a parallel system is a measure of its capacity to increase

speedup in proportion to the the number of processors.
⇒ The scalability reflects a parallel system ability to utilize increasing

processing resources effectively.
34 34
Measuring and Reporting Performance
¾ 2 key aspects
• execution time
• throughput
• making 1 faster may slow the other
¾ Comparing performance
• performance = 1/execution time
• if X is n times faster than Y: n =Execution timeY /
Execution timeX
• similar for throughput comparisons
• improved performance ==> decreasing execution time
time
35 35
Measuring Performance
Execution time can be defined in different ways

¾ wall-clock time (response time, elapsed time) - it’s
what you see but is dependent on
• computer load
• I/O delays
• OS overhead
¾ • CPU time - time spent computing your program
• includes time spent waiting for I/O
• includes the OS + your program
¾ • Hence system CPU time, and user CPU time
36 36
Unix time command
¾ Unix time command reports

• user CPU time
• system CPU time
• total elapsed time
• % of elapsed time that is user + system CPU time
¾ An answer 90.7u 12.9s 2:39 65% means:

• user CPU time – 90,7 seconds
• system CPU time – 12,9 seconds
• elapsed time – 2 minutes and 39 seconds
• percentage of elapsed time – (90,7 + 12,9)/159 –
65%
37 37
Reporting Performance Results
¾ The guiding principle of reporting performance measurements

should be reproducibility - another experimenter would need to
duplicate the results.
However:
• A system’s software configuration can significantly affect the
performance results for a benchmark.
• Similarly, compiler technology can play a big role in the
performance of compute-oriented benchmarks.
¾ For these reasons it is important to describe exactly the software

system being measured as well as whether any special on
standard modifications have been made.
38 38
Other Problems
• Which is better?
• By how much?
• Are the programs equally important?
Execution times of two programs on three machines (Smith [1988]).
39 39
Total Execution Time: A Consistent Summary Measure
¾ The simplest approach to summarizing relative performance is to

use total execution time of the two programs.
- B is 9.1 times faster than A for programs P1 and P2.
– C is 25 times faster than A for programs P1 and P2.
– C is 2.75 times faster than B for programs P1 and P2.
• Then if the workload consisted of running programs P1 and P2
an equal number of times, the statements above would predict the
relative execution times for the workload on each machine.
• An average of the execution times that tracks total execution time
is the arithmetic mean:
• where Time i is the execution time for the i - th program of a total

of n in the workload.
40 40
Total Execution Time: A Consistent Summary Measure
Then, the question arises

¾ What is the proper mixture of programs for the
workload ?
¾ Are programs P1 and P2 in fact run equally in the

workload, as assumed by the arithmetic mean?
¾ If not, then there are two approaches that have been

tried for summarizing performance.
41 41
Weighted Execution Time
¾ The first approach when given an unequal mix of programs in the

workload is to assign a weighting factor wi to each program to
indicate the relative frequency of the program in that workload.
¾ If, for example, 20% of the tasks in the workload were program
P1 and 80% of the tasks in the workload were program P2, then
the weighting factors would be 0.2 and 0.8. (weighting factors add
up to 1.)
¾ By summing the products of weighting factors and execution
times, a clear picture of performance of the workload is obtained.
42 42
Weighted Execution Time
43 43
Normalized Execution Time and the Pros and Cons of
Geometric Means
• A second approach to unequal mixture of programs in the
workload is to normalize execution times to a reference machine
and then take the average of the normalized execution times.
• Average normalized execution time can be expressed as either an
arithmetic or geometric mean.
• The formula for the geometric mean is
• where Execution time ratioi is the execution time, normalized to

the reference machine, for the i-th program of a total of n in the
workload.
• Geometric means has the nice property:
- ratio of the means = mean of the ratios
44 44
Normalized Execution Time and the Pros and Cons of
Geometric Means
45 45
Amdahl’s Law – hardware approach
¾ Amdahl’s Law states that the performance

improvement to be gained from using some faster
mode of execution is limited by the fraction of the time
the faster mode can be used.
¾ Amdahl’s Law defines the speedup that can be gained

by using a particular feature.
46 46
What is speedup?
¾ Suppose that we can make an enhancement to a machine that will

improve performance when it is used. Speedup is the ratio:
• Alternatively
• Speedup tells us how much faster a task will run using the
machine with the enhancement as opposed to the original one
47 47
What Amdahl’s Law says ?
Amdahl’s Law gives us a quick way to find the speedup from

some enhancement, which depends on two factors:
¾ The fraction of the computation time in the original machine that
can be converted to take advantage of the enhancement
– For example, if 20 seconds of the execution time of a program that takes 60
seconds in total can use an enhancement, the fraction is 20/60. This value,
which we will call Fraction enhanced, is always less than or equal to 1.
¾ The improvement gained by the enhanced execution mode; that is,

how much faster the task would run if the enhanced mode were
used for the entire program
– This value is the time of the original mode over the time of the enhanced
mode: If the enhanced mode takes 2 seconds for some portion of the program
that can completely use the mode, while the original mode took 5 seconds for
the same portion, the improvement is 5/2. We will call this value, which is
always greater than 1, speedup (enhanced).
48 48
How we calculate a speedup ?
The execution time using the original machine with the enhanced
mode will be the time spent using the unenhanced portion of the
machine plus the time spent using the enhancement:
The overall speedup is the ratio of the execution times:
49 49
An Example
¾ Suppose that we are considering an enhancement to the processor

of a server system used for Web serving. The new CPU is 10 times
faster on computation in the Web serving application than the
original processor.
¾ Assuming that the original CPU is busy with computation 40% of
the time and is waiting for I/O 60% of the time, what is the
overall speedup gained by incorporating the enhancement?
50 50
An Example
¾ Implementations of floating-point (FP) square root vary significantly in
performance, especially among processors designed for graphics.
¾ Suppose FP square root (FPSQR) is responsible for 20% of the
execution time of a critical graphics benchmark. One proposal is to
enhance the FPSQR hardware and speedup this operation by a factor of
10. The other alternative is just to try to make all FP instructions in the
graphics processor run faster by a factor of 1.6.
¾ FP instructions are responsible for a total of 50% of the execution time
for the application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort as required for the
fast square root.
51 51
Calculating CPU Performance
• Presently all computers are constructed using a clock running at

a constant rate.
• These discrete time events are called clock periods, cycles,
clock cycles, etc.
• Computer designers refer to the time of a clock period by its
duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a
program can then be expressed two ways:
or
52 52
Clock cycles per instruction
• In addition to the number of clock cycles needed to execute a

program, we can also count the number of instructions executed
the - instruction path length or instruction count (IC). If we know
the number of clock cycles and the instruction count, we can
calculate the average number of clock cycles per instruction (CPI)
• Designers sometimes also use instructions per clock (IPC), which is
the inverse of CPI.
• CPI is computed as:
53 53
Calculating CPU time I
• By transposing instruction count in the formula for CPI, clock
cycles can be defined as IC×CPI. Then the execution time can
calculated using the following formula:
• Expanding above formula into the units of measurement and

inverting the clock rate shows how the pieces fit together.
• Unfortunately, it is difficult to change one parameter in complete

isolation from others because the basic technologies involved in
changing each characteristic are interdependent.
- Clock cycle time — Hardware technology and organization

- CPI — Organization and instruction set architecture
- Instruction count — Instruction set architecture and compiler technology
54 54
Calculating CPU time II
¾ Sometimes it is useful in designing the CPU to calculate the

number of total CPU clock cycles as:
where ICi represents number of times instruction i is executed and CPIi

represents the average number of instructions per clock for instruction i.
¾ This form can be used to express CPU time as:
and overall CPI as:
55 55
An example I
Suppose we have made the following measurements:
Frequency of FP operations (other than FPSQR) = 25%

Average CPI of FP operations = 4.0
Average CPI of other instructions = 1.33
Frequency of FPSQR= 2%
CPI of FPSQR = 20
Assume that the two design alternatives are to decrease the

CPI of FPSQR to 2 or to decrease the average CPI of all FP
operations to 2.5.
Compare these two design alternatives using the CPU
performance equation.
56 56
An example – answer I
• Only the CPI changes, then the original CPI with

neither enhancement is equal:
• The CPI for the enhanced FPSQR can be compute by

subtracting the cycles saved from the original CPI:
57 57
An example – answer II
• We can compute the CPI for the enhancement of all FP

instructions the same way or by summing the FP and non-FP
CPIs. Using the latter gives us
• Then the speedup for the overall FP enhancement is
58 58

Lec 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 1

Uploaded by

Copyright:

Available Formats

Introduction

¾ Computer technology has made incredible

Features Desktop Server Embedded

• Integrated circuits logic technology

Crossbar switching networks

Multistage interconnection networks

Processor Processor Processor

Stage 1 Stage 2 Stage n

crossbar multistage crossbar

number of processors number of processors

¾Mesh (2D, 3D, wraparound)

A completely-connected network A star-connected network

A linear array A ring

A two-dimensional mesh A two-dimensional wraparound mesh

18 Complete binary tree network and message routing

1-D hypercube 2-D hypercube

0100 0110 1100 1110

0101 0111 1101 1111

0001 0011 1001 1011

¾ Diameter - the maximum distance between any two processors in the

A single mesage sent over a store-and-forward network

Store and forward routing - the message is sending

tcomm = ts + lth + mtw

⇒ The efficiency (E) of parallel program is defined as a ratio of speedup to

⇒ The scalability of parallel system is a measure of its capacity to increase

⇒ When executing a parallel program we can distinguish two program

Tpar = t1 * Pseq + (1 − Pseq ) t1 / p

⇒ Minsky’s law (1990) S ≤ log 2 p

where „p.” is a number of used processors and „α” is a possitive number

Let’s consider the problem of adding n mumbers on a Hypercube

Thus the total parallel run time can be approximated by

Efficiency as a function of n and p

n p=1 p=4 p=8 p = 16 p = 32

⇒ The parallel system is scalable if it maintain efficiency at a fixed value by

⇒ The scalability of a parallel system is a measure of its capacity to increase

⇒ The scalability reflects a parallel system ability to utilize increasing

Execution time can be defined in different ways

¾ Unix time command reports

¾ An answer 90.7u 12.9s 2:39 65% means:

¾ The guiding principle of reporting performance measurements

¾ For these reasons it is important to describe exactly the software

Execution times of two programs on three machines (Smith [1988]).

¾ The simplest approach to summarizing relative performance is to

• where Time i is the execution time for the i - th program of a total

Then, the question arises

¾ Are programs P1 and P2 in fact run equally in the

¾ If not, then there are two approaches that have been

¾ The first approach when given an unequal mix of programs in the

• where Execution time ratioi is the execution time, normalized to

¾ Amdahl’s Law states that the performance

¾ Amdahl’s Law defines the speedup that can be gained

¾ Suppose that we can make an enhancement to a machine that will

Amdahl’s Law gives us a quick way to find the speedup from

¾ The improvement gained by the enhanced execution mode; that is,

The overall speedup is the ratio of the execution times:

¾ Suppose that we are considering an enhancement to the processor

• Presently all computers are constructed using a clock running at

• In addition to the number of clock cycles needed to execute a

• Expanding above formula into the units of measurement and

• Unfortunately, it is difficult to change one parameter in complete

- Clock cycle time — Hardware technology and organization

¾ Sometimes it is useful in designing the CPU to calculate the

Frequency of FP operations (other than FPSQR) = 25%