You are on page 1of 38

Chapter 6

Abstract Models

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM model was introduced by Fortune and Wyllie
in 1978 for modeling idealized parallel computers in
which communication cost and synchronization
overhead are negligible.
• During a computational step, an active processor may
read a data value from a memory location, perform a
single operation and finally write back the result into a
memory location.
• This model is referred to as the shared memory, single
instruction, multiple data (SM SIMD) machine.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
Control

Private P1
Memory

Private P2
Memory Global

Memory

Private Pp
Memory

PRAM model for Parallel Computations


Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• There are different modes for read and write operations
in a PRAM:
– Exclusive Read (ER): only 1 processor can read from any
memory location at a time.
– Exclusive Write (EW): only 1 processor can write to any memory
location at a time.
– Concurrent Read (CR): multiple processors can read from the
same memory location simultaneously.
– Concurrent Write (CW): multiple processors can write to the
same memory location simultaneously.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• Write conflicts must be resolved using a well-defined
policy such as:
– Common: all concurrent writes store the same value.
– Arbitrary: only one value selected arbitrarily is stored.
– Minimum: the value written by the processor with the smallest
index is stored.
– Reduction: all the values are reduced to only one value using
some reduction function such as sum, minimum, maximum, etc.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM can be divided into the following subclasses:
– EREW PRAM: access to any memory cell is exclusive. It is the
most restrictive PRAM model.
– ERCW PRAM: this allows concurrent writes to the same memory
location by multiple processors, but read accesses remain
exclusive.
– CREW PRAM: concurrent read accesses allowed, but write
accesses are exclusive.
– CRCW PRAM: both concurrent read and write accesses are
allowed.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM
• The following broadcasting mechanism is followed:
– P1 reads x and makes it known to P2.
– P1 and P2 make x known to P3 and P4, respectively, in parallel.
– P1, P2, P3 and P4 make x known to P5, P6, P7 and P8,
respectively in parallel
– These 8 processors will make x known to another 8 processors
and so on.
• Since the number of processors having read x doubles in
each iteration, the procedure terminates in O (log p)
time.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM
x
L L L L

x P1 x x x x
P2 x x x P5 x
x
x P3 x x
x P6
P4 x x
x
x
x P7
x

x P8 x
x
(a) (b) (c) (d)

Simulating Concurrent Read on EREW PRAM


with 8 Processors Using Algorithm Broadcast_EREW

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The performance of a parallel algorithm is measured
quantitatively as follows:
– Run time, which is defined as the time spent during the
execution of the algorithm.
– Number of processors the algorithm uses to solve a problem.
– The cost of the parallel algorithm, which is the product of the run
time and the number of processors.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
– A problem belongs to class P if a solution of the
problem can be obtained by a polynomial-time
algorithm.
– A problem belongs to class NP if the correctness of a
solution for the problem can be verified by a
polynomial-time algorithm.
– In parallel computation, the class of the well-
parallelizable problems, NC, is the class of problems
that have efficient parallel solutions.
– A problem is in the class P-complete if it is as hard to
parallelize as any problem in P.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness

NP NP-Hard

NC
P

P-Complete

NP-Complete

The Relationships among P, NP, NP-Complete, NP-Hard, NC, and P-Complete

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model:
– Algorithm Sum_EREW
for i = 1 to log n do
forall Pj, where 1<=j<=n/2 do in parallel
if (2j modulo 2i) = 0 then
A[2j] <- A[2j]+ A[2j – 2i-1]
endif
endfor
endfor

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model:
– Complexity analysis
• Run time, T(n) = O(log n).
• Number of processors, P(n) = n/2.
• Cost, C(n) = O (n log n).
– Since a good sequential algorithm can sum the list of
n elements in O (n), this algorithm is not cost optimal.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model A[1] A[2] A[3] A[4] A[5] A[6]A[7] A[8]
Active Processors
5 2 10 1 8 12 7 3

P1, P2, P3, P4

5 7 10 11 8 20 7 10

P2, P4

5 7 10 18 8 20 7 30

P4

5 7 10 18 8 20 7 48

Example of Algorithm Sum-EREW when n = 8


Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array:
– Algorithm AllSums_EREW
for i=1 to log n do
forall Pj, where 2i-1+1<=j<=n do in parallel
A [j] <- A[j] +A[j-2i-1]
endfor
endfor

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array:
– Complexity analysis
• Run time, T(n) = O (log n).
• Number of processors, P(n) = n - 1.
• Cost, C (n) = O (n log n).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array:
Active Processors
A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]

P2, P3, ...,P8

A[1] Σ 21 Σ 32 Σ 43 Σ 54 Σ 65 7 8
Σ6 Σ7
P3, P4, ...,P8

A[1] Σ 21 Σ 31 Σ 41 Σ 52 Σ 63 7
Σ4 Σ5
8

P5, P6, ...,P8


2
A[1] Σ 1 Σ 31 Σ 41 Σ 51 Σ 61 7
Σ1 Σ1
8

Computing Partial Sums of an Array of 8 Elements

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors:
– The algorithm consists of 2 steps:
• Each processor Pi,j,k computes the product of A [i, k]*B [k, j]
and stores it in C[i, j, k].
• The idea of algorithm Sum_EREW is applied along the k
dimension n2 times in parallel to compute C[i, j, n] where
1<=i, j<=n.
– Complexity analysis:
• Run time, T (n) = O (log n).
• Number of processors, P (n) = n3.
• Cost, C (n) = O (n3 log n).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors
– Complexity analysis:
• This algorithm is not cost optimal because an n x n matrix
multiplication can be done sequentially in less than O (n3).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
j
i

• Using n3 processors
P1,1,1 k=1 P1,2,1

C[1,1,1] ← A[1,1] * B[1,1] C[1,2,1] ← A[1,1] * B[1,2]

P2,1,1 P2,2,1

C[2,1,1] ← A[2,1] * B[1,1] C[2,2,1] ← A[2,1] * B[1,2]

j
i
P1,1,2 k=2 P1,2,2

C[1,1,2] ← A[1,2] * B[2,1] C[1,2,2] ← A[1,2] * B[2,2]

P2,1,2 P2,2,2

C[2,1,2] ← A[2,2] * B[2,1] C[2,2,2] ← A[2,2] * B[2,2]

After Step 1

j
i
k=2
P1,1,2 P1,2,2

C[1,1,2] ← C[1,1,2] + C[1,1,1] C[1,2,2] ← C[1,2,2] + C[1,2,1]

P2,1,2 P2,2,2

C[2,1,2] ← C[2,1,2] +C[2,1,1] C[2,2,2] ← C[2,2,2] * C[2,2,1]

After Step 2
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.5 Matrix Multiplication
• Reducing the number of processors:
– Modify the MatMult_CREW as follows:
• Each processor Pi,j,k, where 1<=k<=n/log n, computes the
sum of log n products. This step will produce (n3/ log n)
partial sums.
• The sum of products produced in step 1 are added to
produce the resulting matrix.
– Complexity analysis:
• Run time, T(n) = O (log n).
• Number of processors, P (n) = n3/log n.
• Cost, C (n) = O (n3).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.6 Sorting
• The algorithm consists of 2 steps:
– Each row of processors i computes C [i], the number
of elements smaller than A [i]. Each processor Pi,j
compares A [i] and A [j], then updates C [i]
appropriately.
– The first processor in each row Pi,1 places A [i] in its
proper position in the sorted list (C [i] +1).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.6 Sorting
• Complexity Analysis:
– Run time, T (n) = O (1).
– Number of processors, P (n) = n2.
– Cost, C (n) = O (n2).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.6 Sorting
Initially A = 6 1 3

P1,1 P1,2 P1,3

compares 6 & 6 compares 6 & 1 compares 6 & 3

P2,1 P2,2 P2,3

compares 1 & 6 compares 1 & 1 compares 1 & 3

P3,1 P3,2 P3,3

compares 3 & 6 compares 3 &1 compares 3 & 3

After Step 1 C = 2 0 1

After Step 2 A = 1 3 6

Enumeration Sort of [6,1,3] on a CRCW PRAM


Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model
– The behavior of this system can be described as
follows:
• System is initialized and set to an arbitrary initial state.
• For each process i ∈ V, repeat the following 2 steps in
synchronized rounds:
– Send messages to the outgoing neighbors by applying some
message generation function to the current state.
– Obtain the new state by applying a state transition function to
the current state and the message received from incoming
neighbors.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model
– This system can be modeled as a state machine with
the following components:
• M, a fixed message alphabet.
• A process i can be modeled as:
– Qi : a set of states
– q0, i : the initial state in the state set Qi
– GenMsgi : a message generation function. It is applied to the
current system state to generate messages to the outgoing
neighbors from elements in M.
– Transi: a state transition function that maps the current state
and the incoming messages into a new state.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model

Msg1 Msg2 Msg3 Msgk


q0,i q1,i q2,i q3,i qk,i

After round 1 After round 2 After round k

An Example of a State Diagram for Process i

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.7 Message Passing Model
• Synchronous Message Passing Model
– The complexity analysis for algorithms following this
model is measured quantitatively using:
• Message complexity:
– Defined as the number of messages sent between neighbors
during the execution of the algorithm.
• Time complexity:
– Defined as the time spent during the execution of the algorithm.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.8 Leader Election Problem
• A leader among n processors is the processor
recognized by the other processors as distinguished to
perform a special task.
• The leader election problem arises when the processors
of a distributed system must choose one of them as a
leader.
• A leader is needed to coordinate the reestablishment of
allocation and routing functions.
• The leader election problem is meaningless in the
context of anonymous systems.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple Leader Election Algorithm
– Each process sends its identifier to its outgoing
neighbor.
– When a process receives an identifier from its
incoming neighbor, then:
• The process sends null to its outgoing neighbor, if the
received identifier is less than its own identifier.
• The process sends the received identifier to its outgoing
neighbor, if the received identifier is greater than its own
identifier.
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple Leader Election Algorithm
– Complexity analysis:
• Time complexity: O (n).
• Message complexity: O (n2).

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
u: 1
buff : 1
status: unknown

Synchronous Rings
u: 4
buff : 4
status: unknown

u: 2
buff : 2
status: unknown

• Simple Leader u: 3
buff : 3
status: unknown

Election u: 1
buff: 4
(a) Initial States
u: 1
buff: null
status: unknown status: unknown

Algorithm u: 4
buff : null
u: 4
buff : null

– Leader Election in a status: unknown status: unknown

u: 2 u: 2
Synchronous Ring buff : null
status: unknown
buff : 4
status: unknown

using Algorithm u: 3
buff : null
u: 3
buff : null

S_Elect_Leader_Sim status: unknown


(b) After first round
status: unknown
(c) After second round

ple u: 1
buff: null
u: 1
buff: null
status: unknown status: unknown

u: 4 u: 4
buff : null buff : null
status: unknown status: leader

u: 2 u: 2
buff : null buff : null
status: unknown status: unknown

u: 3 u: 3
buff : 4 buff : null
status: unknown status: unknown

(d) After third round (e) After fourth round

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
– K=0
– Each process sends its identifier in messages to its
neighbors in both directions intending that they will
travel 2k hops and then return to their origin.
– If the identifier is proceeding in the outbound
direction, when a process on the path receives the
identifier from its neighbor, then:
• The process sends null to its outneighbor, if the received
identifier is less than its own identifier.
• The process sends the received identifier to its outneighbor,
if the received identifier is greater than its own identifier.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
– If the identifier is proceeding in the inbound direction,
when a process on the path receives the identifier, it
sends the received identifier to its outgoing neighbor
on the path, if the received identifier is greater than its
own identifier.
– If the 2 original messages make it back to their origin
then k<-K+1; go to step 2.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
Process i

k=0

k=1

k=2

Messages initiated at Process i Using The Improved Leader Election Algorithm

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved Leader Election Algorithm
– Complexity analysis
• Time complexity: O (n)
• Message complexity: O (n log n)

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.9 Leader Election In
Synchronous Rings u: 1, k: 0
buff+: (1,out,1)
u: 1, k: 0
buff+: (2,in,1)
buff-: (1,out,1) buff-: (4,in,1)
status: unknown

• Improved Leader
status: unknown u: 4, k: 0
u: 4, k: 0
buff+: (4,out,1) buff+: null
buff-: (4,out,1) buff-: null
status: unknown status: unknown

Election Algorithm u: 2, k: 0
buff+: (2,out,1)
u: 2, k: 0
buff+: (3,in,1)
buff-: null
buff-: (2,out,1)
status: unknown status: unknown
u: 3, k: 0 u: 3, k: 0
buff+: (3,out,1) buff+: (4,in,1)
buff-: (3,out,1) buff-: null
status: unknown status: unknown

(b) After first round


(a) Initial States

u: 1, k: 0 u: 1, k: 0
buff+: null buff+: (4,out,1)
buff-: null buff-: null
status: unknown status: unknown
u: 4, k: 1 u: 4, k: 1
buff+: (4,out,2) buff+: null
buff-: (4,out,2) buff-: null
status: unknown status: unknown

u: 2, k: 0 u: 2, k: 0
buff+: null buff+: null
buff-: null buff-: null
status: unknown status: unknown
u: 3, k: 0 u: 3, k: 0
buff+: null buff+: null
buff-: null buff-: (4,out,1)
status: unknown status: unknown

(d) After third round


(c) After second round

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.10 Summary
• PRAM has played an important role in the introduction of
parallel programming paradigms and design techniques
that have been used in real parallel systems.
• A large number of PRAM algorithms for solving many
fundamental problems have been introduced and
efficiently implemented on real systems.
• An important characteristic of a message system is the
degree of synchrony, which reflects the different types of
timing information that can be used by an algorithm.

Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr

You might also like