ppa svnit

© All Rights Reserved

19 views

ppa svnit

© All Rights Reserved

- Parallel Dbaasfasfm s
- Hardware vs. Software Parallelism
- Pattern Language for Parallel Programming, 2004
- Parallel
- good
- Interactive, remote, computer interface system (US patent 6101534)
- A MPI Parallel Algorithm for the Maximum Flow Problem
- Apt Olderog Boer
- 6ES7 315-2AG10-0AB0
- Cluster Computing
- A Comparison of MPI and Hybrid MPI+OpenMP Programming Paradigms Using Multi-Core Processors
- 8096
- CA Lec4 Performance Benchmark Power
- 233-G354
- From Imperative Programming to Fork_Join to Parallel Streams in Java 8
- sdarticle
- [IJCST-V4I5P50]:A. Ravi Kumar, Dr. M. Ashok, Dr. R. Praveen Sam
- Fir Filter Sse Instructions Paper
- 2015 k Db Joel Goodman Oracle Automatic Parallel Execution Praesentation
- OpenMP Goals

You are on page 1of 61

performance

Speedup

Efficiency

Amdahls law

parallel algorithm helps determine whether coding

it is worthwhile.

can be realized by increasing the number of

processors used.

Speedup

Speedup measures increase in running time due

to parallelism.

the fastest known sequential algorithm

tp is the execution time using a parallel processor.

known sequential algorithm for the problem

tp is the worst case running time of the parallel

algorithm using n PEs.

4

Speedup

Speedup

Parallel execution time

computers with n PEs for traditional problems is n.

Proof:

Assume a computation is partitioned perfectly into n

processes of equal duration.

Assume no overhead is incurred as a result of this

partitioning of the computation (e.g., partitioning

process, information passing, coordination of

processes, etc),

Under these ideal conditions, the parallel computation

will execute n times faster than the sequential

computation.

The parallel running time is ts /n.

Then the parallel speedup of this computation is

S(n) = ts /(ts /n) = n

6

nontraditional problems.

Unfortunately, the best speedup possible for most

applications is much smaller than n

unfeasible.

Usually some parts of programs are sequential and

allow only one PE to be active.

Sometimes a large number of processors are idle for

certain portions of the program.

to receive or to send data.

E.g., recall blocking can occur in message passing

7

Superlinearity

nonstandard problems

of parallel computation, it seems fair to say that ts=.

Since for a fixed tp>0, S(n) = ts/tp is greater than 1 for all

sufficiently large values of ts, it seems reasonable to consider

these solutions to be superlinear.

Examples include nonstandard problems involving

Real-Time requirements where deadlines is part of the

problem requirements.

Problems where all data is not initially available, but has to be

processed after it arrives.

sequential solutions are inefficient.

8

Efficiency:

Speedup:

S = Time(the most efficient sequential

algorithm) / Time(parallel algorithm)

Efficiency:

E=S/N

with N is the number of

processors

Amdahls Law

Gene Amdahl,

Amdahl Corporation and other companies

They found that there were some fairly stringent restrictions on

how much of a speedup one could get for a given parallelized

task.

These observations were wrapped up in Amdahl's Law

have a very deep understanding of Amdahl's Law if they

want to avoid having unrealistic expectations

10

Amdahls Law

soon as possible

etc

Implications

11

Amdahls Law

must be performed sequentially, where 0 f 1.

computer with n processors is

1

1

S ( p)

f (1 f ) / n f

12

Amdahls Law

divided into concurrent tasks is f,

and no overhead incurs when the computation is

divided into concurrent parts,

the time to perform the computation with n processors

is given by

tp fts + [(1 - f )ts] / n, as show in figure

13

Amdahls Law

14

ts

ts

S ( n)

(1 f )t s

tp

ft s

n

1

(1 f )

f

n

15

numerator and denominator by ts , which

establishes Amdahls law.

Multiplying numerator & denominator by n

produces the following alternate version of this

formula:

n

n

S ( n)

nf (1 f ) 1 (n 1) f

16

S(n) 1/f

Note that S(n) never exceeds 1/f

As n increases,

speedup increases

increasing function of the problem size

17

Amdahls law

sequential, and (1-F) is the fraction that can

be parallelized, then the maximum speed-up

that can be achieved by using n processors

is 1/(F+(1-F)/n).

18

Amdahls law

10% is sequential) then the maximum speed-up

which can be achieved on 5 processors is

1/(0.1+(1-0.1)/5) or roughly 3.6 (i.e. the program

can theoretically run 3.6 times faster on five

processors than on one)

19

Amdahls law

maximum speed-up on 10 processors is 1/(0.1+(10.1)/10) or 5.3 (i.e. investing twice as much hardware

speeds the calculation up by about 50%).

maximum speed-up on 20 processors is 1/(0.1+(10.1)/20) or 6.9 (i.e. doubling the hardware again speeds

up the calculation by only 30%).

20

Amdahls law

the maximum speed-up on 1000 processors is

1/(0.1+(1-0.1)/1000) or 9.9 (i.e. throwing an

absurd amount of hardware at the calculation

results in a maximum theoretical (i.e. actual

results will be worse) speed-up of 9.9 vs a single

processor).

21

Speedup

n = 10,000

n = 1,000

n = 100

Processors

22

Example

inside a loop that can be executed in parallel.

What is the maximum speedup we should

expect from a parallel version of the program

executing on 8 CPUs?

1

S

5.9

0.05 (1 0.05) / 8

23

Essence

using lots of parallel processors is not a

practical way of achieving the sort of speed up

that people were looking for.

i.e. it is essentially an argument in support of put

in effort in making single processor systems run

faster.

24

the results back together is extra work required in the

parallel version which isn't required in the serial version

parallel parts of the code, it is unlikely that all of the

processors will be computing all of the time

some of them will likely run out of work to do before

others are finished their part of the parallel work.

25

theoretical speed-up you can ever hope to

achieve. The actual speed-ups are always

less than the speed-up predicted by Amdahl's

Law

26

execute a program is called

available; this is not achievable in real machines, so

some parallel program segments must be executed

sequentially as smaller parallel segments.

Program execution on parallel computer

27

Parallelism profile

Fluctuation of a profile during an observation period

depend on

utilization and run time conditions

processors in a system, some parallel branches must be

executed in chunks sequentially.

Parallelism still exists within each chunk, limited by the

machine size

DOP may be limited by memory and by other

nonprocessor resources

28

Average Parallelism

homogeneous processors

The maximum parallelism in a profile is m

The computing capacity of a single processor is

approximated by the execution rate MIPS or Mflops

Without considering the penalties from memory access,

communication latency, or system over head.

When i processors are busy during an observation

period, we have DOP = i.

29

Average Parallelism

computations) performed is proportional to the

area under the profile curve

30

Average Parallelism

31

is total elapsed time

The average parallelism A is computed as

32

Example 3.1

33

Example 3.1

algorithm increased from 1 to its peak value m=8

and decrease to 0 from its observation period

(t1,t2)

34

Available Parallelism

in scientific and engineering calculations can be

Hundreds or thousands of instructions per clock cycle

But in real machines, the actual parallelism is much smaller (e.g.

10 or 20).

numeric code than in scientific codes

It has relatively little parallelism even when basic block

boundaries are ignored.

Available Parallelism

one entry and one exit.

Basic blocks are frequently used as optimizers in

compilers (since its easier to manage the use of

registers utilized in the block).

Compiler optimization and algorithm redesign may

increase the available parallelism in the application

Limiting optimization to basic blocks limits the instruction

level parallelism that can be obtained (to about 2 to 5 in

typical code).

DOP may thousands in some scientific codes when

multiple processor are used

Asymptotic Speedup - 1

Wi iti

W Wi

i 1

ti (1) Wi /

ti (k ) Wi / k

ti () Wi / i

processor)

(execution time of Wi with k processors)

(for 1 i m)

Asymptotic Speedup - 2

m

Wi

T (1) ti (1)

i 1

i 1

m

Wi

T () ti ()

i 1

i 1 i

W

T (1)

i 1 i

S

m

A

T () Wi / i

i 1

(Asymptotic Speedup

the ideal case)

S in

Asymptotic Speedup - 3

A i ti / ti

i1

i1

m

W

T (1)

i 1 i

S

m

A

T () Wi / i

i 1

m

In ideal case S A

S A if communication latency and other

system overhead are considered.

39

mean, or average, performance of a set of benchmark

programs with potentially many different execution

modes (e.g. scalar, vector, sequential, parallel).

programs to emphasize these different modes and yield

a more meaningful performance measure.

executing m programs in various modes with different

performance level.

Arithmetic Mean

by the number of terms).

Our measures will use execution rates expressed in

MIPS or Mflops.

Ri be the execution rates of the programs i=1,2,3,, m

Arithmetic mean execution rate is defined as

Weight distribution = {fi / i =1,2,3,, m }

proportional to the sum of the inverses of the execution

times;

it is not inversely proportional to the sum of the

execution times.

Thus arithmetic mean fails to represent real times

consumed by the benchmarks programs when executed.

42

Harmonic Mean

execution rate, which is just the inverse of the arithmetic

mean of the execution time (thus guaranteeing the inverse

relation not exhibited by the other means).

Ti = 1/Ri mean execution time per instruction for program i.

Arithmetic mean execution time per instruction

defined by Rh=1/Ta

Rh

1/ R

m

i 1

compute the weighted harmonic mean:

Rh

f

m

i 1

/ Ri

Execution rate Ri is used to reflect the speed of i processor

T1 = 1/R1 = 1 is the sequential execution time on a single

processor with rate R1 = 1.

Ti = 1/Ri = 1/i = is the execution time using i processors with

a combined execution rate of Ri = i.

Now suppose a program has n execution modes with

associated weights f1 fn. The weighted harmonic mean

speedup is defined as:

S T1 / T

*

1

n

f / Ri

i 1 i

T * 1/ Rh*

(weighted arithmetic mean

execution time)

Example 3.2

46

Amdahls Law

Basically this means the system is used sequentially (with

probability ) or all n processors are used (with probability

1- ).

This yields the speedup equation known as Amdahls law:

n

Sn

1 n 1

The implication is that the best speedup possible is 1/ , regardless of

n, the number of processors. (n infinite)

Amdahls Law

48

System Efficiency 1

O (n) = total number of unit operations performed by an nprocessor system in completing a program P.

T (n) = execution time required to execute the program P on an

n-processor system.

instructions executed by the n processors, perhaps

scaled by a constant factor.

O (1) = T (1), for uniprocessor system

T (n) < O (n) if more then one operation is performed by

n processor per unit time when n >=2,

System Efficiency 2

with n processors) can now be expressed as

S (n) = T (1) / T (n)

E (n) = S (n) / n = T (1) / ( n T (n) )

system as compared with the maximum possible

speedup.

Thus 1 / n E (n) 1. The value is 1/n when only one

processor is used (regardless of n), and the value is 1

when all processors are fully utilized.

Redundancy

R (n) = O (n) / O (1)

What values can R (n) obtain?

performed is independent of the number of processors, n. This is

the ideal case.

R (n) = n when all processors performs the same number of

operations as when only a single processor is used; this implies

that n completely redundant computations are performed!

parallelism is carried over to the hardware

implementation without having extra operations

performed.

System Utilization

System utilization is defined as

U (n) = R (n) E (n) = O (n) / ( n T (n) )

were kept busy during execution of the program.

Since 1 R (n) n, and 1 / n E (n) 1, the best

possible value for U (n) is 1, and the worst is 1 / n.

1 / n E (n) U (n) 1

1 R (n) 1 / E (n) n

Quality of Parallelism

as

Q (n) = S (n) E (n) / R (n)

= T 3 (1) / ( n T 2 (n) O (n) )

This measure is directly related to speedup (S)

and efficiency (E), and inversely related to

redundancy (R).

The quality measure is bounded by the speedup

(that is, Q (n) S (n) ).

measures of system performance, since their

interpretation depends on machine clock cycles and

instruction sets. For example, which of these machines

is faster?

a 20 MIPS RISC computer

the instruction sets on the machines. Even the question,

which machine is faster, is suspect, since we really

need to say faster at doing what?

Doing What?

standard programs are frequently used.

instructions, system calls, or library functions.

It uses exclusively integer data items.

Each execution of the entire set of high-level

language statements is a Dhrystone, and a machine

is rated as having a performance of some number of

Dhrystones per second (sometimes reported as

KDhrystones/sec).

program involving floating point and integer data, arrays,

subroutines with parameters, conditional branching, and

library functions.

depends in large measure on the compiler used to

generate the machine language.

56

Other Measures

used to support ATMs, reservation systems, and point of sale

terminals.

The measure may include communication overhead, database

search and update, and logging operations. The benchmark is

also useful for rating relational database performance.

second that can be performed by a system, to relate how well

that system will perform at certain AI applications.

Since one inference requires about 100 instructions (in the

benchmark), a rating of 400 KLIPS is roughly equivalent to 40

MIPS.

Problem 1

A.

B.

benchmark a program on two computer systems.

On system A, the object code executed 80 million Arithmetic Logic

Unit operations (ALU ops), 40 million load instructions, and 25

million branch instructions.

On system B, the object code executed 50 million ALU ops, 50

million loads, and 40 million branch instructions.

In both systems, each ALU op takes 1 clock cycles, each load takes

3 clock cycles, and each branch takes 5 clock cycles.

Compute the relative frequency of occurrence of each type of

instruction executed in both systems

Find the CPI for each system.

58

Problem 1 Solution

of each type of instruction executed in both

systems.

59

Problem 1 Solution

60

(b) MIPS Rate for 4 40-MHz Processors

(c) Speed-up

(d) Efficiency

61

- Parallel Dbaasfasfm sUploaded bymaruthi631
- Hardware vs. Software ParallelismUploaded bySajendra Kumar
- Pattern Language for Parallel Programming, 2004Uploaded bylatinprincessfr6208
- ParallelUploaded byShanntha Joshitta
- goodUploaded byxs2rakesh
- Interactive, remote, computer interface system (US patent 6101534)Uploaded byPriorSmart
- A MPI Parallel Algorithm for the Maximum Flow ProblemUploaded byGRUNDOPUNK
- Apt Olderog BoerUploaded byalinutza_27s
- 6ES7 315-2AG10-0AB0Uploaded byjawadizhar
- Cluster ComputingUploaded byOwolabi Peters
- A Comparison of MPI and Hybrid MPI+OpenMP Programming Paradigms Using Multi-Core ProcessorsUploaded byAnonymous vQrJlEN
- 8096Uploaded bySreekanth Pagadapalli
- CA Lec4 Performance Benchmark PowerUploaded byTANZEELA SHAKEEL
- 233-G354Uploaded byPranav Sohoni
- From Imperative Programming to Fork_Join to Parallel Streams in Java 8Uploaded byEdidelson Lucien
- sdarticleUploaded byHenryLi
- [IJCST-V4I5P50]:A. Ravi Kumar, Dr. M. Ashok, Dr. R. Praveen SamUploaded byEighthSenseGroup
- Fir Filter Sse Instructions PaperUploaded bySafi Ud Din Khan
- 2015 k Db Joel Goodman Oracle Automatic Parallel Execution PraesentationUploaded bynt29
- OpenMP GoalsUploaded bySumit Sahu
- Ln4 ConcurUploaded byoreste2008
- 2. Different computer systems.pdfUploaded byvidishsa
- 12-01-09-10-32-12-1287-sindhujam.pdfUploaded byPrasad Dhanikonda
- paper.psUploaded byShashi Kant
- DW Lecture 05Uploaded byengineer_khaula7035
- 08f_cpe628_chap3Uploaded byIustin Necula
- Multi processors and thread level parallelismUploaded byPrajna Sarswathi
- ACA Unit 3Uploaded bymannanabdulsattar
- System on Chip Architecture Design Lecture12Uploaded byThi Nguyen
- Lec4a SuppUploaded byAdip Chy

- Book Summary---How Will You Measure Your LifeUploaded byParth Shah
- CHOCCHOC.pdfUploaded byParth Shah
- Cms Goveners March 19 2017 UpdatedUploaded byraghu
- TemperatureControllerAI208v76 (1)Uploaded byk_ndres841
- Child LabourUploaded bysiddharthjain
- ch10Uploaded byAnonymous evHjsXn
- embed system designingUploaded byParth Shah
- CensusUploaded bymecitfuturedreams
- Danforth Entrance Act 3Uploaded byParth Shah
- Intro Chp1Uploaded byParth Shah
- This is WhatUploaded byParth Shah

- Introduction to 3D Game Programming With DirectX 10 October 2008Uploaded byVornicescu Ionut
- 10-Essential-InDesign-Skills-by-InDesignSkills.pdfUploaded bypostasului54
- Meaningful Use EP SCC Ambulatory Only GridUploaded byONC for Health Information Technology
- -HDLUploaded byjosesmn
- Equivalence Class Testing.pptUploaded byDOCTOR
- 642-otm-nfsUploaded bydhruvgoel1
- juniperUploaded byPepe Pereza
- Shaft 2012 Users ManualUploaded bylingamkumar
- Hierarchical File StructureUploaded byAbhisek Sarkar
- Running Dea SolverUploaded byFuyen
- CIS_Sybase_ASE_15.0_Benchmark_v1.0.0Uploaded byVinoth Sivasubramanian
- Boa Brochure WebUploaded byYuniko Dwi Anggoro
- Social Network analysis in RUploaded byaotaegui29
- ass1_2014 (1)Uploaded byNikhil Patidar
- TutorialsUploaded byramport
- How to use Gnaural.docxUploaded byRoberto Quiroz
- Voice Activated Appliances for Severely Disabled PersonsUploaded byssn2kor
- Top 20 Automation Testing Interview Questions and AnswersUploaded byBilal Achernan
- 1. AnsibleNetworkAutomation.pdfUploaded byauciomar
- Hannah Cummins PortfolioUploaded byHannah Cummins
- PTC Creo Parametric 3.0 Quick Reference CardUploaded byPeter Laťák
- RUTH 3 Phil-IRI (Silent & Oral) English & FilipinoUploaded byDe Ja Vu
- 00Doxee Platform Quick Guide-V38Uploaded byMauro De Vecchi
- BTreeUploaded byLuiGy Machaca
- SQL Developer User's GuideUploaded bypapa1450
- logix 5000Uploaded byMohanad Nasreldean
- A Critical Evaluation of the Strategic Direction of Samsung Mobile CommunicationsUploaded bySyed Tahmid Zaman Rashik
- Grim Fandango - Manual - PCUploaded bybiduleretrac
- MI Proxy MayaUploaded byPrem Moraes
- UPL ConfigdocUploaded byAbhishek Duggal