You are on page 1of 40

Chapter 6

Abstract Models

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM model was introduced by Fortune and Wyllie
in 1978 for modeling idealized parallel computers in
which communication cost and synchronization
overhead are negligible.
• During a computational step, an active processor may
read a data value from a memory location, perform a
single operation and finally write back the result into a
memory location.
• This model is referred to as the shared memory, single
instruction, multiple data (SM SIMD) machine.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
Control

Private 1
P
Memory

Private 2
P
Memory
Global

Memory

Private p
P
Memory

PRAM model for Parallel Computations


Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• There are different modes for read and write operations
in a PRAM:
– Exclusive Read (ER): only 1 processor can read from any
memory location at a time.
– Exclusive Write (EW): only 1 processor can write to any memory
location at a time.
– Concurrent Read (CR): multiple processors can read from the
same memory location simultaneously.
– Concurrent Write (CW): multiple processors can write to the
same memory location simultaneously.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• Write conflicts must be resolved using a well-defined
policy such as:
– Common: all concurrent writes store the same value.
– Arbitrary: only one value selected arbitrarily is stored.
– Minimum: the value written by the processor with the smallest
index is stored.
– Reduction: all the values are reduced to only one value using
some reduction function such as sum, minimum, maximum, etc.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM can be divided into the following subclasses:
– EREW PRAM: access to any memory cell is exclusive. It is the
most restrictive PRAM model.
– ERCW PRAM: this allows concurrent writes to the same memory
location by multiple processors, but read accesses remain
exclusive.
– CREW PRAM: concurrent read accesses allowed, but write
accesses are exclusive.
– CRCW PRAM: both concurrent read and write accesses are
allowed.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM
• The following broadcasting mechanism is followed:
– P1 reads x and makes it known to P2
– P1 and P2 make x known to P3 and P4, respectively, in parallel.
– P1, P2, P3 and P4 make x known to P5, P6, P7 and P8,
respectively in parallel
– These 8 processors will make x known to another 8 processors
and so on.
• Since the number of processors having read x doubles in
each iteration, the procedure terminates in O (log p)
time.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM P6

x
x P7
L L L L

x P1 x x x x
x x x x P5 x
P2
x P3 x x
x P6
x x x
P4
x P7 x
x

x P8 x
x
(a) (b) (c) (d)

Simulating Concurrent Read on EREW PRAM


with 8 Processors Using Algorithm Broadcast_EREW

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
The complexity of a sequential algorithm is generally determined by its time and
space complexity.
The time complexity of an algorithm refers to its execution time as a function of the
problem’s size. Similarly, the space complexity refers to the amount of memory
required by the algorithm as a function of the size of the problem.

The time complexity has been known to be the most important measure of the
performance of algorithms. An algorithm whose time complexity is bounded by a
polynomial is called a polynomial–time algorithm.

An algorithm is considered to be efficient if it runs in polynomial time. Inefficient


algorithms are those that require a search of the whole enumerated space and have
an exponential time complexity.

For parallel algorithms, the time complexity remains an important measure of


performance. Additionally, the number of processors plays a major role in
determining the complexity of a parallel algorithm.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The performance of a parallel algorithm is measured
quantitatively as follows:
– Run time, which is defined as the time spent during the
execution of the algorithm
– Number of processors the algorithm uses to solve a problem
– The cost of the parallel algorithm, which is the product of the run
time and the number of processors.

Cost = RT * NP

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
The run time of a parallel algorithm is the length of the time period between the time
the first processor to begin execution starts and the time the last processor to finish
execution terminates.
The cost of a parallel algorithm is basically the total number of steps executed
collectively by all processors.
A parallel algorithm is said to be cost optimal if its cost matches the lower bound on
the number of sequential operations to solve a given problem within a constant
factor. It follows that a parallel algorithm is not cost optimal if there exists a
sequential algorithm whose run time is smaller than the cost of the parallel algorithm.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms

1. Sequential algorithms, either tractable or intractable problems. by categorizing


them into different classes.
2. A problem belongs to class P if a solution of the problem can be obtained by a
Polynomial–Time Algorithm (PTA).
3. A problem belongs to class NP if the correctness of a solution for the problem
can be verified by a polynomial–time algorithm.
4. The NP-complete problems are the problems that are strongly suspected to be
computationally intractable. There is a host of important problems that are
roughly equivalent in complexity and form the class of NP-complete problems.
5. The NP-complete includes many classical problems in combinatorics, graph
theory, and computer science such as the traveling salesman problem, the
Hamilton circuit problem, and integer programming. The best known
algorithms for these problems could take exponential time on some inputs.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
– A problem belongs to class P if a solution of the
problem can be obtained by a polynomial-time
algorithm.
– A problem belongs to class NP if the correctness of a
solution for the problem can be verified by a
polynomial-time algorithm.
– In parallel computation, the class of the well-
parallelizable problems, NC, is the class of problems
that have efficient parallel solutions.
– A problem is in the class P-complete if it is as hard to
parallelize as any problem in P.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
P if a sol. Can be obtained by PTA.

It has an efficient Parallel Sol. Correctness of a solution can be verified by PTA

NP-Hard

NP

NC
P

P-Complete

It is hard to Parallelized as any problem in


NP-Complete Problems that are strongly suspected
P. to be computationally intractable

The Relationships among P, NP, NP-Complete, NP-Hard, NC, and P-Complete

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW model
– Algorithm Sum_EREW
for i = 1 to log n do
forall Pj, where 1<= j <= n/2 do in parallel
if (2j modulo 2i) = 0 then
A[2j] <- A[2j]+ A[2j – 2i-1]
endif
endfor
endfor

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model
– Complexity analysis
• Run time, T(n) = O(log n)
• Number of processors, P(n) = n/2
• Cost, C(n) = O (n log n)
– Since a good sequential algorithm can sum the list of
n elements in O (n), this algorithm is not cost optimal.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model
A[1] A[2] A[5] A[6]A[7] A[8] Active Processors
A[3] A[4]
5 2 10 8 127 3
1
P1, P2, P3, P4

5 7 11 20 7
10 8 10
P2, P4

5 10 18 20 7
8 30
7
P4

5 7 10 18 8 20 7 48

Example of Algorithm Sum-EREW when n = 8


Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array
– Algorithm AllSums_EREW
for i=1 to log n do
forall Pj, where 2i-1+1<=j<=n do in parallel
A [j] <- A[j] +A[j-2i-1]
endfor
endfor

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array
– Complexity analysis
• Run time, T(n) = O (log n).
• Number of processors, P(n) = n - 1.
• Cost, C (n) = O (n log n).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array
Active Processors
A[1] A[2] A[3] A[5]
A[4] A[6] A[7] A[8]

P2, P3, ...,P8

A[1] 2 3 4 5 6 8
Σ1 Σ2 Σ3 Σ4 Σ5 76
Σ7
Σ
P3, P4, ...,P8

A[1] 2 3 4 5 6 8
Σ1 Σ1 Σ1 Σ2 Σ3 74
Σ5
Σ
P5, P6, ...,P8
2
A[1] Σ1 3
Σ1 Σ
4
Σ
5
1 Σ
6
1 7 8
1 Σ1 Σ1
Computing Partial Sums of an Array of 8 Elements

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors
– The algorithm consists of 2 steps:
• Each processor Pi,j,k computes the product of A [i, k]*B [k,
j] and stores it in C[i, j, k].
• The idea of algorithm Sum_EREW is applied along the k
dimension n2 times in parallel to compute C[i, j, n] where
1<=i, j<=n.
– Complexity analysis
• Run time, T (n) = O (log n).
• Number of processors, P (n) = n3.
• Cost, C (n) = O (n3 log n).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication

• Using n3 processors
– Complexity analysis
• This algorithm is not cost optimal because an n x n matrix
multiplication can be done sequentially in less than O (n3).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
j
i P P
1,1,1k = 1 1,2,1

• Using n3 processors C[1,1,1] ← A[1,1] * B[1,1] C[1,2,1] ← A[1,1] * B[1,2]

P P
2,1,1 2,2,1

C[2,1,1] ← A[2,1] * B[1,1] C[2,2,1] ← A[2,1] * B[1,2]

j
i
P P
1,1,2k =2 1,2,2

C[1,1,2] ← A[1,2] * B[2,1] C[1,2,2] ← A[1,2] * B[2,2]

P P
2,1,2 2,2,2

C[2,1,2] ← A[2,2] * B[2,1] C[2,2,2] ← A[2,2] * B[2,2]

After Step 1

j
i P P1,2,2
k=
1,1,2 2

C[1,2,2] ← C[1,2,2] + C[1,2,1]


C[1,1,2] ← C[1,1,2] + C[1,1,1]

P P
2,1,2
2,2,2

C[2,1,2] ← C[2,1,2] +C[2,1,1]


C[2,2,2] ← C[2,2,2] * C[2,2,1]

After Step 2
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
• Reducing the number of processors
– Modify the MatMult_CREW as follows:
• Each processor Pi,j,k, where 1<=k<=n/log n, computes
the sum of log n products. This step will produce (n3/ log
n) partial sums.
• The sum of products produced in step 1 are added to
produce the resulting matrix.
– Complexity analysis:
• Run time, T(n) = O (log n).
• Number of processors, P (n) = n3/log n.
• Cost, C (n) = O (n3).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.6 Sorting
• The algorithm consists of 2 steps:
– Each row of processors i computes C [i], the number
of elements smaller than A [i].
– Each processor Pi,j compares A [i] and A [j], then
updates C [i] appropriately.

– The first processor in each row Pi,1 places A [i] in its


proper position in the sorted list (C [i] +1).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.6 Sorting
• Complexity analysis

– Run time, T (n) = O (1).


– Number of processors, P (n) = n2.
– Cost, C (n) = O (n2).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.6 Sorting
Initially = 6 1 3
A

P P P
1,1 1,2 1,3

compares 6 & 6 compares 6 & 1 compares 6 & 3

P P P
2,1 2,2 2,3

compares 1 & 6 compares 1 & 1 compares 1 & 3

P P P
3,1 3,2 3,3

compares 3 & 6 compares 3 &1 compares 3 & 3

After Step 1 C = 2 0 1

After Step 2 1 3 6
A =

Enumeration Sort of [6,1,3] on a CRCW PRAM


Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model
• Synchronous message passing model
– The behavior of this system can be described as
follows:
• System is initialized and set to an arbitrary initial state.

• For each process i V, repeat the following 2 steps
in synchronized rounds:
– Send messages to the outgoing neighbors by applying some
message generation function to the current state.
– Obtain the new state by applying a state transition function to
the current state and the message received from incoming
neighbors.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model

• Synchronous message passing model


– This system can be modeled as a state machine with
the following components:
• M, a fixed message alphabet.
• A process i can be modeled as:
– Qi : a set of states
– q0, i : the initial state in the state set Qi
– GenMsgi : a message generation function. It is applied to the
current system state to generate messages to the outgoing
neighbors from elements in M.
– Transi: a state transition function that maps the current state
and the incoming messages into a new state.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model
• Synchronous message passing model

Msg1 Msg2 Msg3 Msgk


q q q q q
0,i 1,i 2,i 3,i k,i

After round 1 After round 2 After round k

An Example of a State Diagram for Process i

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model
• Synchronous message passing model
– The complexity analysis for algorithms following this
model is measured quantitatively using:
• Message complexity:
– Defined as the number of messages sent between neighbors
during the execution of the algorithm.

• Time complexity:
– Defined as the time spent during the execution of the algorithm.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.8 Leader Election Problem
• A leader among n processors is the processor
recognized by the other processors as distinguished to
perform a special task.

• The leader election problem arises when the processors


of a distributed system must choose one of them as a
leader.
• A leader is needed to coordinate the reestablishment of
allocation and routing functions.

• The leader election problem is meaningless in the


context of anonymous systems.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple leader election algorithm
– Each process sends its identifier to its outgoing
neighbor.
– When a process receives an identifier from its
incoming neighbor, then:
• The process sends null to its outgoing neighbor, if the
received identifier is less than its own identifier.
• The process sends the received identifier to its outgoing
neighbor, if the received identifier is greater than its own
identifier.
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple leader election algorithm
– Complexity analysis
• Time complexity: O (n)
• Message complexity: O (n2)

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
u: 1

6.9 Leader Election In u: 4


buff : 1
status: unknown

buff : 4

Synchronous Rings status: unknown


u: 2
buff : 2
status: unknown

• Simple leader u: 3
buff : 3
status: unknown

election algorithm u: 1
buff: 4
(a) Initial States
u: 1
buff: null status:
status: unknown unknown
– Leader Election in a u: 4 u: 4
buff : null
Synchronous Ring using buff : null
status: unknown status: unknown
u: 2 u: 2
Algorithm buff : null buff : 4
status: unknown
status: unknown
S_Elect_Leader_Sim u: 3 u: 3
buff : null buff : null
ple status: unknown status: unknown
(b) After first round (c) After second round
u: 1 u: 1
buff: null status: buff: null status:
unknown unknown

u: 4 u: 4
buff : null buff : null
status: unknown status: leader
u: 2 u: 2
buff : null buff : null
status: unknown status: unknown

u: 3 u: 3
buff : 4 buff : null
status: unknown status: unknown
(d) After third round (e) After fourth round

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
– K <- 0
– Each process sends its identifier in messages to its
neighbors in both directions intending that they will
travel 2k hops and then return to their origin.
– If the identifier is proceeding in the outbound
direction, when a process on the path receives the
identifier from its neighbor, then:
• The process sends null to its outneighbor, if the received
identifier is less than its own identifier.
• The process sends the received identifier to its outneighbor,
if the received identifier is greater than its own identifier.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
– If the identifier is proceeding in the inbound direction,
when a process on the path receives the identifier, it
sends the received identifier to its outgoing neighbor
on the path, if the received identifier is greater than its
own identifier.
– If the 2 original messages make it back to their origin
then k<-K+1; go to step 2.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
Process i

k=0

k=1

k=2

Messages initiated at Process i Using The Improved Leader Election Algorithm

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
– Complexity analysis
• Time complexity: O (n)
• Message complexity: O (n log n)

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In Synchronous Rings
u: 1, k: 0 u: 1, k: 0
buff+: (1,out,1) buff+: (2,in,1)
buff-: (1,out,1) buff-: (4,in,1)
u: 4, k: 0 status: unknown

• Improved leader
u: 4, k: 0 status: unknown
buff+: (4,out,1) buff+: null
buff-: buff-: null
status:(4,out,1)
unknown status: unknown

election algorithm u: 2, k: 0
buff+: (2,out,1)
u: 2, k: 0
buff+: (3,in,1)
buff-: null
buff-: (2,out,1) status: unknown
u: 3, k: 0 status: unknown u: 3, k: 0
buff+: (3,out,1) buff+: (4,in,1)
buff-: (3,out,1) buff-: null
status: unknown status: unknown
(b) After first round
(a) Initial States

u: 1, k: 0 u: 1, k: 0
buff+: null buff+: (4,out,1)
buff-: null buff-: null
u: 4, k: 1 u: 4, k: 1 status: unknown
status: unknown
buff+: (4,out,2) buff+: null
buff-: (4,out,2) buff-: null
status: unknown status: unknown

u: 2, k: 0 u: 2, k: 0
buff+: null buff+: null
buff-: null buff-: null
u: 3, k: 0 status: unknown u: 3, k: 0 status: unknown
buff+: null buff+: null
buff-: null buff-: (4,out,1)
status: unknown status: unknown
(d) After third round
(c) After second round

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr

You might also like