Parallel Algorithm Merged

Your Roll No....
CEN-704
VII Semester Examination 2018
B.Tech.(Computer Engineering),
Parallel and Distributed Computing
Paper No: CEN-704
Max. Marks:-60
Time:-03 Hours
top immediately on receipt question paper.

of
Write your Roll No. on
All questions carry equal

Attempt all questions by attempting any
two parts from each question.
marks.
CO1
Q14) a new architecture:
following speed ups are proposed for
Three enhancements with the
one enhancement is usable at a time.
Speedupl-30, speedup2-20, and peedup3 =15. Only 2 1
Assume the enhancements can be used 25%,35% and 10% of the time for enhancements
no enhancement in use?
and 3 respectively. For what fraction of the reduced execution time is
enhancements should be used? 06
If only two enhancements are to be used then which two
Q1(6) benchmark takes

i) An application program is executed on a nine computer cluster. A program
time T on this cluster. Further 25% of T is time in which the application is run simultaneously on
all nine computers. The remaining time, the application has to run on a single computer.
Calculate the effective speedup under the aforementioned condition as compared to executing the
in
program on a single computer. Also calculate the percentage of code that has been parallelized
the preceding program. 03
Gi) Let a be the percentagee of program code that can be executed simultaneously by n
computers in a cluster, each computer using a difierent set of parameters or initial conditions.
Assume that the remaining code must be executed simultaneously by a single processor. Each
processor has an execution rate of x MIPS. Determine an expression for the effective MIPS rate
when using the system for exclusive execution of this program in terms of a, n and x. 03
Q1
().Consider a computer which can execute a program in two operational modes; regular mode
versus enhanced mode with a probability distribution of {A, 1-A} respectively. If A varies
between a and b and 0O<=a<=1, derive an expression for the average speedup factor using
harmonic mean concept. 06
CO2
06
of suitable examples and diagrams.
Q2a) Describe Branch prediction techniques with help
Q2(6)) Consider the following

N= no. of instructions to be executed
M= No. of segments in pipeline
branch instruction
P=probability that a given instruction is a unconditional
branch instruction
Q probability that a given instruction is a conditional
R= probability that a given conditional branch instruction will cause branching
Calculate the followings

Speedup
i. Throughput
ii. efficiency
iv. Average no. of Instructions executed per Instruction cycle. 04
Q2(b)
(i)Derive an expression for optimal number of stages in a pipeline. 02
Q2(c) Three functional pipelines fl, f2 and f3 are characterized by the following reservation
tables. 06
fl: f2: f3:

123 4 2 13 |4 123 4
S1X Ti X| U1 X | X |
$2 X T2 X U2
| S3 XXX] T3 X| U3 X
Using these three pipelines a composite pipeline is formed as below.
Mux F1 F2
:
A F3
BMux
output
Each task going through this composite pipeline in the following order: f1 first, 2 and f3 next, f1
again and then the output is obtained. The dual multiplexer selects a pair of inputs (A,B) or
(X,Y) and feeds them into the input of fl.
(i) Draw the state transition diagram for collision free scheduling
ii)List all simple cycle and greedy cycle
(ii)Find MAL
reservation table.
CO3
Q3 (a) Design and verify SIMD
complexity as O(N ). algorithm for NXN Matrix multiplication with computational
Q3 6) What is PRAM model? Design and verify PRAM 06
Algorithm for EREW NXN matrix
multiplication with computational complexity of O(N). 06
Q3(¢) Parallelize Quick Sort Algorithm. Compute its speced up with respect to its
algorithm. sequential
06
CO4
Q4(a ). What is MPI program? Write an MPI program for computing the values of N! 06
Q46) What do you mean by Open Mp program ? Write an open MP program to compute the
value of PI 06
4(0) Explain the following terms () GPU (Gi) cUDA. Explain how a CUDA program is written
with help of suitable example. Revise 06
COS
Q5. Describe in details any two of the following with the help of suitable examples and
diagrams.
(a)Mobile Agent
(6) Object request Broker
(c) Matrix Logical clock 2x6 12

Parallel Algorithm
RAM (Random Access Machine)
 Unbounded number of local memory cells
 Each memory cell can hold an integer of unbounded size
 Instruction set included –simple operations, data operations,
comparator, branches
 All operations take unit time
 Time complexity = number of instructions executed
 Space complexity = number of memory cells used
PRAM (Parallel Random Access Machine)
 Definition:
◦ Is an abstract machine for designing the algorithms applicable to parallel
computers
◦ M’ is a system <M, X, Y, A> of infinitely many
 RAM’s M1, M2, …, each Mi is called a processor of M’. All the processors are assumed to be
identical. Each has ability to recognize its own index i
 Input cells X(1), X(2),…,
 Output cells Y(1), Y(2),…,
 Shared memory cells A(1), A(2),…,
PRAM (Parallel RAM)
 Unbounded collection of RAM processors P0, P1, …,
 Each processor has unbounded registers
 Unbounded collection of share memory cells
 All processors can access all memory cells in unit time
 All communication via shared memory
PRAM (Parallel RAM)
 Some subset of the processors can remain idle
P0 P1 P2 PN
Shared Memory Cells
 Two or more processors may read

simultaneously from the same cell
 A write conflict occurs when two or
more processors try to write
simultaneously into the same cell
Share Memory Access Conflicts
 PRAM are classified based on their Read/Write
abilities (realistic and useful)
◦ Exclusive Read(ER) : all processors can simultaneously
read from distinct memory locations
◦ Exclusive Write(EW) : all processors can
simultaneously write to distinct memory locations
◦ Concurrent Read(CR) : all processors can
simultaneously read from any memory location
◦ Concurrent Write(CW) : all processors can write to
any memory location
◦ EREW, CREW, CRCW
Concurrent Write (CW)
• What value gets written finally?
• Priority CW: processors have priority based on which value is decided, the
highest priority is allowed to complete WRITE
• Common CW: all processors are allowed to complete WRITE iff all the values
to be written are equal.
• Arbitrary/Random CW: one randomly chosen processor is allowed to
complete WRITE
• Combimimg CW: a functiom may map multiple values into a single value.
Function may be max, min, sum, multiply etc
Strengths of PRAM
 PRAM is attractive and important model for designers of parallel
algorithms Why?
◦ It is natural: the number of operations executed per one cycle on p processors
is at most p
◦ It is strong: any processor can read/write any shared memory cell in unit time
◦ It is simple: it abstracts from any communication or synchronization overhead,
which makes the complexity and correctness of PRAM algorithm easier
◦ It can be used as a benchmark: If a problem has no feasible/efficient solution
on PRAM, it has no feasible/efficient solution for any parallel machine
An initial example
• How do you add N numbers residing in memory location A[0, 1, …, N]
• Serial Algorithm = O(N)
• PRAM Algorithm using N processors P0, P1, P2, …, PN ?

PRAM Algorithm (Parallel Addition)
P0 + P1 + P2 + P3 + Step 1
P0 + P2 + Step 2
P0 + Step 3
• Program in P(i)
• L=n
• Repeat
• L=L/2
• If(i<L) then begin
• read A[2i] from SM
• Read A[2i+1] from SM
• Compute sum=A[2i]+A[2i+1]
• store in A[i]
• Until(L=1)
PRAM Algorithm (Parallel Addition)
• Log (n) steps = time needed
• n / 2 processors needed
• Speed-up = n / log(n)
• Efficiency = 1 / log(n)
• Applicable for other operations
• +, *, <, >, etc.
Another algorithm to find sum
• Program in P(i)
• L=n
• Repeat
• L=L/2
• read A[i] from SM
• Read A[i+L] from SM
• Compute sum of (A[i]and A[i+L])
• store in A[i]
• Until(L=1)
Program to find maximum of N nos
• Program in P(i)
• L=n
• Repeat
• L=L/2
• Compute Max(A[i],A[i+L])
• store in A[i]
• Until(L=1)
Program to find minimum of N nos
• Program in P(i)
• L=n
• Repeat
• L=L/2
• Compute Min(A[i],A[i+L])
• store in A[i]
• Until(L=1)
Program to find Product of N nos
• Program in P(i)
• L=n
• Repeat
• L=L/2
• Compute Product(A[i],A[i+L])
• store in A[i]
• Until(L=1)
Program to find sum of N nos using M
processors
• Program in P(I)
• sum=0
• For(J=I; J<N; J=J+M)

• Sum=sum+A[J]
• A[I]= sum
• L=M
• Repeat
• L=L/2
• Compute sum(A[i],A[i+L])
• store in A[i]
• Until(L=1)
processors-another way of writing
• Program in P(I)
• PAR for (I= 0;I< M,I++ )
• {
• sum=0
• For(J=I; J<N; J=J+M)
• Sum=sum+A[J]
• A[I]= sum
• }
• L=M
• Repeat
• L=L/2
• store in A[i]
• Until(L=1)
processors-another way of writing
• Program in P(I)
• PAR for (I= 0;I< M,I++ )
• {
• sum=0
• K= N/M
• For(J=K*I; J<(I+1)*K; J=J+1)
• Sum=sum+A[J]
• A[I]= sum
• }
• L=M
• Repeat
• L=L/2
• store in A[i]
• Until(L=1)
Matrix multiplication using n Processors on
CRCW PRAM
• For ( i=0;i<n; i++)
• {
• for (j=0;j<n; j++)
• { C[i][j]=0
• PAR for(k=0; k<n;k++)
• {
• Read A[i][k]
• Read B[k][j]
• Compute C[i][j]=A[i][k]*B[k][j]
• store in C[i][j]
• }
• }
• }
Matrix multiplication using n Processors on
CREW PRAM
• For ( i=0;i<n; i++)
• {
• for (j=0;j<n; j++)
• { PAR for(k=0; k<n;k++)
• {C[i][j]=0
• Read A[i][k]
• Read B[k][j]
• Compute C[i][j]=C[i][j]+A[i][k]*B[k][j]
• }
• }
• }
Matrix multiplication using nxn Processors on
CRCW PRAM
• PAR For ( i=0;i<n; i++)
• {
• PAR for (j=0;j<n; j++)
• { C[i][j]=0
• for(k=0; k<n;k++)
• {
• Read A[i][k]
• Read B[k][j]
• }
• }
• }
CRCW PRAM
• For ( i=0;i<n; i++)
• {
• {
• Read A[i][k]
• Read B[k][j]
• }
• }
• }
CREW PRAM
• {
• { for(k=0; k<n;k++)
• {
• Read A[i][k]
• Read B[k][j]
• Compute C[i][j]= C[i][j]+A[i][k]*B[k][j]
• }
• }
• }
Matrix multiplication using nxnxn Processors
on CRCW PRAM
• {
• {
• Read A[i][k]
• Read B[k][j]
• }
• }
• }
Matrix multiplication using nxnxn Processors
on CREW PRAM
• {
• {C[i][j]=0
• Read A[i][k]
• Read B[k][j]
• Compute C[i][j]=C[i][j]+A[i][k]*B[k][j]
• }
• }
• }
EREW PRAM
• {
• { C[i][j]=0
• for(k=0; k<n;k++)
• {
• Ik=(i+j+k)mod n + 1
• Read A[i][Ik]
• Read B[Ik][j]
• Compute C[i][j]=C[i][j]+A[IK][k]*B[Ik][j]
• }
• }
• }
Matrix Multiplication on Mesh with NxN processors
Parallel Algorithm
Complexity
• Processor P(I,j) receives its input after i-1+j-1 steps from the
beginning of computation
• After getting the input P(I,j) takes n steps to Compute C(I,j)
• So C(I,j) is computed in i-1+j-1 +n steps
• So complexity of algorithm is O(i-1+j-1 +n )=O(n-1+n-1+n)=O(n)
Matrix Multiplication on SIMD machines
Memory allocation for matrix multiplication
Successive content of C array in the memory
Initial distribution of rows of A
4-way broadcaste of rows of A
Initial Distribution of rows of B transpose
4-way broadcast of rows of B transpose
CREW Matrix multiplication
EREW Matrix multiplication
CRCW Matrix multiplication

Parallel Algorithm Merged

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Algorithm Merged

Uploaded by

Copyright:

Available Formats

Your Roll No....

Paper No: CEN-704

top immediately on receipt question paper.

All questions carry equal

Q1(6) benchmark takes

Q2(6)) Consider the following

Calculate the followings

fl: f2: f3:

Using these three pipelines a composite pipeline is formed as below.

(c) Matrix Logical clock 2x6 12

Shared Memory Cells

 Two or more processors may read

• Serial Algorithm = O(N)

• PRAM Algorithm using N processors P0, P1, P2, …, PN ?

• For(J=I; J<N; J=J+M)

You might also like