You are on page 1of 58

Introduction

¾ Computer technology has made incredible


progress in the roughly 55 years
¾ Today, less than a thousand dollars will
purchase a personal computer that has more
performance than a computer bought in 1980 for
1 million dollars.
¾ Presently workstation performance (measured in
Spec Marks) improves roughly 50% per year
(2X every 18 months)

1 1
The changing Face of Computing

• Large Mainframes
• Desktop Computing
• Serwers
• Embedded Computers

2 2
Three computer classes

Features Desktop Server Embedded


Price•of system $1000 - $10,000 $10000 - $10 - $100,000
$10,000,000
Price of $100 - $1000 $200 - $2000 (per $0.20 - $200 (per
microprocessor processor) processor)
Microprocessors 150,000,000 4,000,000 300,000,000
sold per year (32-bit & 64-bit
(estimates for 2000) processors only)
Critical system Price-performance, Throughput, Price, power
design issues graphics performance availability, consumption,
scalability application-specific
performance

3 3
Technology Trends

• Integrated circuits logic technology


• Semiconductor DRAM
• Magnetic Disc technology
• Network Technology

4 4
A
A taxonomy
taxonomy of
of parallel
parallel architectures
architectures
¾- control mechanism
- SIMD
- MIMD
¾- address-space organization
- message-passing architecture
- shared-address-space architecture
- UMA
- NUMA
¾- interconnection networks
- static
- dynamic
¾- processor granularity
- coarse-grain computers
- medium-grain computers
- fine-grain computers

5
Flynn’s classifications
– Single-instruction-stream, single-data-stream
(SISD) computers
- Typical uniprocessors
- Parallelism through pipelines,
– Multiple-instruction-stream, single-data-stream
(MISD) computers
- Not used often ?
– Single-instruction-stream, multiple-data-stream
(SIMD) computers
- Vector and array processors
– Multiple-instruction-stream, multiple-data-stream
(MIMD) computers
- Multiprocessors

6 6
Typical
Typical shared-address-space
shared-address-space architecture.
architecture.

P M
M
Inter- P Inter-
P
M
connection connection
M
Network Network
M
P P

An uniform-memory-access computer
A non-uniform-memory-access
computer with local memory only

7 7
A
A typical
typical message-passing
message-passing architecture
architecture. .

Interconnection Network

.............

P P P P P
.............
M M M M M

P - Processor
M - Memory

8 8
Dynamic
Dynamic interconnection
interconnection networks
networks

Crossbar switching networks

Bus-based networks

Multistage interconnection networks

9
A
A completely
completely nonblocking
nonblocking crossbar
crossbar switch
switch
connecting
connecting pp processors
processors to
to bb memory
memory banks
banks
M0 M1 M2 M3 M0

P0

P1

P2

P3
A switching
P4 element

Pp-1

10
A typical bus-based architecture with no cache.

Global memory

Bus

Processor Processor Processor

11 11
Multistage
Multistageinterconnection
interconnectionnetwork
network

Memory banks
Processors
0 0
1 1

Stage 1 Stage 2 Stage n

p-1 b-1
Multistage interconnection network

12 12
Cost
Costand
andperformance
performance

crossbar multistage crossbar

bus

performance
multistage
cost

bus

number of processors number of processors

13 13
Omega network

000 000
001 001

010 010
011 011

100 100
101 101

110 110
111 111

Pass-through
Cross-over

14 14
An example of blocking in omega network

000 000
001 001

010 010
011 011

100 100
101 101

110 110
111 111

15 15
Static
Static interconnection
interconnection networks
networks

¾Completely-connected networks

¾Star-connected network

¾Linear array

¾Ring

¾Mesh (2D, 3D, wraparound)

¾Hypercube
16
Examples
Examples of
of static
static interconnection
interconnection networks.
networks.

A completely-connected network A star-connected network

A linear array A ring

A two-dimensional mesh A two-dimensional wraparound mesh


17
Examples
Examples of
of static
static interconnection
interconnection networks.
networks.

A three-dimensional mesh

processor
switching
elements

18 Complete binary tree network and message routing


18
Hypercube
Hypercube
100 110
0 00 10
000 010

101 111
1 01 11
0-D hypercube 001 011

1-D hypercube 2-D hypercube


3-D hypercube

0100 0110 1100 1110


0000 0010 1000 1010

0101 0111 1101 1111

0001 0011 1001 1011

4-D hypercube
19
Evaluating
EvaluatingStatic
StaticInterconnection
InterconnectionNetworks
Networks

¾ Diameter - the maximum distance between any two processors in the


network
¾ Connectivity - measure of the multiplicity of paths between any two
processors
¾Arc connectivity - minimum number of arc that must be removed from
the network to break it into two disconnected networks.
¾ Bisection width - minimum number of communication links that have to
be removed to partition the network into two equal halves.
¾ Channel width -the number of bits that can be communicated
simultaneously over a link connecting two processors.
¾ Bisection bandwidth - minimum volume of communication allowed
between any two halves of the network with an equal number of processors.
¾ Cost - for example: number of communication links.

20 20
Communication
Communication costs
costs in
in static
static
Interconnection
Interconnection networks
networks

Principal parameters
- startup time (ts)
- per-hop time (th)
- per-word transfer time (tw)

Routing techniques
- store-and-forward routing
- cut-through routing

21
Passing
Passing aa message
message from processor PP00 to
from processor to PP33..
Time
P0
P1
P2
P3

A single mesage sent over a store-and-forward network


Time
P0
P1
P2
P3

A single message broken into two parts and sent over a cut-through network
Time
P0
P1
P2
P3
A22
single message broken into four parts and sent over a cut-through network
Communication
Communicationcosts
costsdepends
dependson
onrouting
routingstrategy
strategy

Store and forward routing - the message is sending


between different processors and each intermediate processor
store it in the local memory until received the whole message

t comm = t s + (mtw + th )l
Cut-through routing - the message is divided on parts
which are sending between processors without waiting for the
whole message

tcomm = ts + lth + mtw

23 23
Program Performance Metrics

⇒ The parallel run time (Tpar) is the time from the moment when
computation starts to the moment when the last processor finished his
execution

⇒ The speedup (S) is defined as the ratio of the time needed to solve the
problem on a single processor (Tseq) to the time required to solve the
same problem on parallel system with "p" processors (Tpar)
– relative - Tseq is the execution time of parallel algorithm executing
on one of the processors of parallel computer
– real - Tseq is the execution time for the best-know algorithm using
one of the processors of parallel computer
– absolute - Tseq is the execution time for the best-know algorithm
using the best-know computer

24 24
Program Performance Metrics

⇒ The efficiency (E) of parallel program is defined as a ratio of speedup to


the number of processors

⇒ The cost is usually defined as a product of a parallel run time and the
number of processors

⇒ The scalability of parallel system is a measure of its capacity to increase


speedup in proportion to the number of processors

25 25
Amdahl’s Law

⇒ When executing a parallel program we can distinguish two program


parts: sequential part (Pseq) which needs to be executed sequentially
using one processor parallel part (1-Pseq) which can be executed
independently using number of processors
⇒ Let’s assume that if we execute this program at single processor the
serial execution time will be t1. Then if p indicates the number of used
processors during parallel execution the parallel run time can be
expressed by

Tpar = t1 * Pseq + (1 − Pseq ) t1 / p


⇒ Then speedup is expressed by
t1 1
S= ≤
t1 * Pseq + (1 − Pseq ) * t1 / p Pseq
26 26
Gustafson Speedup

⇒ Let’s assume that the execution of the maximum size problem using
parallel algorithm at p processors takes Pseq + Ppar = 1 time, where
Pseq indicates the sequential part of the program and Ppar parallel part
respectively
⇒ Its sequential time (using one processor) will be Pseq + p*Ppar
⇒ Then we obtain the following expression that specifies speedup

Pseq + p* Ppar
SG = = Pseq + p* Ppar ≥ p* Ppar
Pseq + Ppar

27 27
Other laws

p
⇒ Lee law (1980) S≤
log 2 p
It is a generalisation of Amdhal’s Law

⇒ Minsky’s law (1990) S ≤ log 2 p


Can be used for programs with branch points on a SIMD structure

28 28
Other laws
Stone’s Table
Speedup (S) Type of Algorithms
------------------------------------------------------------------------------------
α* p matrix computations,
discretization
------------------------------------------------------------------------------------
α *p sorts,
linear recursions,
log 2 p evaluation of polynomials
------------------------------------------------------------------------------------
α * log 2 p search for an element in a set
------------------------------------------------------------------------------------
α certain non-linear recursions
------------------------------------------------------------------------------------

where „p.” is a number of used processors and „α” is a possitive number


smaller then 1 which depends on the machine

29 29
The Scalability of Parallel Systems

Let’s consider the problem of adding n mumbers on a Hypercube

In the first step each processor locally adds its n/p numbers, in the next
steps half partial sums are transmited to adjacent processors and added,
the procedure finished when one choosen processor
Assume that it takes one unit of time both to add two numbers and to
communicate a number between two directly connected processors
Then adding the n/p numbers local to each processor takes n/p - 1 time
The p partial sums are added in logp steps (one addition and one
communication)

30 30
The Scalability of Parallel Systems

Thus the total parallel run time can be approximated by

Tp = n p + 2 log p

Since serial run time can be approximated by n the expresion for speedup
and efficiency are as follows:
n* p n
S= E=
n + 2 p log p n + 2 p log p
These expressions can be used to calculate the speeup and efficiency for any
pair of n and p

31 31
The Scalability of Parallel Systems
Speedup versus the number of processors

Linear n = 512

n = 320
n = 192

n = 64
S

32 32
The Scalability of Parallel Systems

Efficiency as a function of n and p

n p=1 p=4 p=8 p = 16 p = 32


64 1.0 0.80 0.57 0.33 0.17
192 1.0 0.92 0.80 0.60 0.38
64 1.0 0.95 0.87 0.71 0.50
64 1.0 0.97 0.91 0.80 0.62

⇒ For the given problem instance, the speedup does not increase linearly as
the number of processors increases
⇒ A larger instance of the same problems yields higher speedup and
efficiency for the same number of processors

33 33
The Scalability of Parallel Systems

⇒ The parallel system is scalable if it maintain efficiency at a fixed value by


simultaneusly increasing the number of processors and the size of the
problem.

⇒ The scalability of a parallel system is a measure of its capacity to increase


speedup in proportion to the the number of processors.

⇒ The scalability reflects a parallel system ability to utilize increasing


processing resources effectively.

34 34
Measuring and Reporting Performance

¾ 2 key aspects
• execution time
• throughput
• making 1 faster may slow the other

¾ Comparing performance
• performance = 1/execution time
• if X is n times faster than Y: n =Execution timeY /
Execution timeX
• similar for throughput comparisons
• improved performance ==> decreasing execution time
time

35 35
Measuring Performance

Execution time can be defined in different ways


¾ wall-clock time (response time, elapsed time) - it’s
what you see but is dependent on
• computer load
• I/O delays
• OS overhead
¾ • CPU time - time spent computing your program
• includes time spent waiting for I/O
• includes the OS + your program
¾ • Hence system CPU time, and user CPU time

36 36
Unix time command

¾ Unix time command reports


• user CPU time
• system CPU time
• total elapsed time
• % of elapsed time that is user + system CPU time

¾ An answer 90.7u 12.9s 2:39 65% means:


• user CPU time – 90,7 seconds
• system CPU time – 12,9 seconds
• elapsed time – 2 minutes and 39 seconds
• percentage of elapsed time – (90,7 + 12,9)/159 –
65%

37 37
Reporting Performance Results

¾ The guiding principle of reporting performance measurements


should be reproducibility - another experimenter would need to
duplicate the results.

However:
• A system’s software configuration can significantly affect the
performance results for a benchmark.
• Similarly, compiler technology can play a big role in the
performance of compute-oriented benchmarks.

¾ For these reasons it is important to describe exactly the software


system being measured as well as whether any special on
standard modifications have been made.

38 38
Other Problems

• Which is better?
• By how much?
• Are the programs equally important?

Execution times of two programs on three machines (Smith [1988]).

39 39
Total Execution Time: A Consistent Summary Measure

¾ The simplest approach to summarizing relative performance is to


use total execution time of the two programs.
- B is 9.1 times faster than A for programs P1 and P2.
– C is 25 times faster than A for programs P1 and P2.
– C is 2.75 times faster than B for programs P1 and P2.
• Then if the workload consisted of running programs P1 and P2
an equal number of times, the statements above would predict the
relative execution times for the workload on each machine.
• An average of the execution times that tracks total execution time
is the arithmetic mean:

• where Time i is the execution time for the i - th program of a total


of n in the workload.

40 40
Total Execution Time: A Consistent Summary Measure

Then, the question arises


¾ What is the proper mixture of programs for the
workload ?

¾ Are programs P1 and P2 in fact run equally in the


workload, as assumed by the arithmetic mean?

¾ If not, then there are two approaches that have been


tried for summarizing performance.

41 41
Weighted Execution Time

¾ The first approach when given an unequal mix of programs in the


workload is to assign a weighting factor wi to each program to
indicate the relative frequency of the program in that workload.
¾ If, for example, 20% of the tasks in the workload were program
P1 and 80% of the tasks in the workload were program P2, then
the weighting factors would be 0.2 and 0.8. (weighting factors add
up to 1.)
¾ By summing the products of weighting factors and execution
times, a clear picture of performance of the workload is obtained.

42 42
Weighted Execution Time

43 43
Normalized Execution Time and the Pros and Cons of
Geometric Means
• A second approach to unequal mixture of programs in the
workload is to normalize execution times to a reference machine
and then take the average of the normalized execution times.
• Average normalized execution time can be expressed as either an
arithmetic or geometric mean.
• The formula for the geometric mean is

• where Execution time ratioi is the execution time, normalized to


the reference machine, for the i-th program of a total of n in the
workload.
• Geometric means has the nice property:
- ratio of the means = mean of the ratios

44 44
Normalized Execution Time and the Pros and Cons of
Geometric Means

45 45
Amdahl’s Law – hardware approach

¾ Amdahl’s Law states that the performance


improvement to be gained from using some faster
mode of execution is limited by the fraction of the time
the faster mode can be used.

¾ Amdahl’s Law defines the speedup that can be gained


by using a particular feature.

46 46
What is speedup?

¾ Suppose that we can make an enhancement to a machine that will


improve performance when it is used. Speedup is the ratio:

• Alternatively

• Speedup tells us how much faster a task will run using the
machine with the enhancement as opposed to the original one

47 47
What Amdahl’s Law says ?

Amdahl’s Law gives us a quick way to find the speedup from


some enhancement, which depends on two factors:
¾ The fraction of the computation time in the original machine that
can be converted to take advantage of the enhancement
– For example, if 20 seconds of the execution time of a program that takes 60
seconds in total can use an enhancement, the fraction is 20/60. This value,
which we will call Fraction enhanced, is always less than or equal to 1.

¾ The improvement gained by the enhanced execution mode; that is,


how much faster the task would run if the enhanced mode were
used for the entire program
– This value is the time of the original mode over the time of the enhanced
mode: If the enhanced mode takes 2 seconds for some portion of the program
that can completely use the mode, while the original mode took 5 seconds for
the same portion, the improvement is 5/2. We will call this value, which is
always greater than 1, speedup (enhanced).

48 48
How we calculate a speedup ?

The execution time using the original machine with the enhanced
mode will be the time spent using the unenhanced portion of the
machine plus the time spent using the enhancement:

The overall speedup is the ratio of the execution times:

49 49
An Example

¾ Suppose that we are considering an enhancement to the processor


of a server system used for Web serving. The new CPU is 10 times
faster on computation in the Web serving application than the
original processor.
¾ Assuming that the original CPU is busy with computation 40% of
the time and is waiting for I/O 60% of the time, what is the
overall speedup gained by incorporating the enhancement?

50 50
An Example
¾ Implementations of floating-point (FP) square root vary significantly in
performance, especially among processors designed for graphics.
¾ Suppose FP square root (FPSQR) is responsible for 20% of the
execution time of a critical graphics benchmark. One proposal is to
enhance the FPSQR hardware and speedup this operation by a factor of
10. The other alternative is just to try to make all FP instructions in the
graphics processor run faster by a factor of 1.6.
¾ FP instructions are responsible for a total of 50% of the execution time
for the application. The design team believes that they can make all FP
instructions run 1.6 times faster with the same effort as required for the
fast square root.

51 51
Calculating CPU Performance

• Presently all computers are constructed using a clock running at


a constant rate.
• These discrete time events are called clock periods, cycles,
clock cycles, etc.
• Computer designers refer to the time of a clock period by its
duration (e.g., 1 ns) or by its rate (e.g., 1 GHz). CPU time for a
program can then be expressed two ways:

or

52 52
Clock cycles per instruction

• In addition to the number of clock cycles needed to execute a


program, we can also count the number of instructions executed
the - instruction path length or instruction count (IC). If we know
the number of clock cycles and the instruction count, we can
calculate the average number of clock cycles per instruction (CPI)
• Designers sometimes also use instructions per clock (IPC), which is
the inverse of CPI.
• CPI is computed as:

53 53
Calculating CPU time I
• By transposing instruction count in the formula for CPI, clock
cycles can be defined as IC×CPI. Then the execution time can
calculated using the following formula:

• Expanding above formula into the units of measurement and


inverting the clock rate shows how the pieces fit together.

• Unfortunately, it is difficult to change one parameter in complete


isolation from others because the basic technologies involved in
changing each characteristic are interdependent.

- Clock cycle time — Hardware technology and organization


- CPI — Organization and instruction set architecture
- Instruction count — Instruction set architecture and compiler technology

54 54
Calculating CPU time II

¾ Sometimes it is useful in designing the CPU to calculate the


number of total CPU clock cycles as:

where ICi represents number of times instruction i is executed and CPIi


represents the average number of instructions per clock for instruction i.
¾ This form can be used to express CPU time as:

and overall CPI as:

55 55
An example I

Suppose we have made the following measurements:

™Frequency of FP operations (other than FPSQR) = 25%


™Average CPI of FP operations = 4.0
™Average CPI of other instructions = 1.33
™Frequency of FPSQR= 2%
™CPI of FPSQR = 20

Assume that the two design alternatives are to decrease the


CPI of FPSQR to 2 or to decrease the average CPI of all FP
operations to 2.5.
Compare these two design alternatives using the CPU
performance equation.

56 56
An example – answer I

• Only the CPI changes, then the original CPI with


neither enhancement is equal:

• The CPI for the enhanced FPSQR can be compute by


subtracting the cycles saved from the original CPI:

57 57
An example – answer II

• We can compute the CPI for the enhancement of all FP


instructions the same way or by summing the FP and non-FP
CPIs. Using the latter gives us

• Then the speedup for the overall FP enhancement is

58 58

You might also like