07 - Lecture - Abstract Models

Chapter 6
Abstract Models
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostafa Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM model was introduced by Fortune and Wyllie
in 1978 for modeling idealized parallel computers in
which communication cost and synchronization
overhead are negligible.
• During a computational step, an active processor may
read a data value from a memory location, perform a
single operation and finally write back the result into a
memory location.
• This model is referred to as the shared memory, single
instruction, multiple data (SM SIMD) machine.
Variations
Control
Private P1
Memory
Private P2
Memory Global
Memory
Private Pp
Memory
PRAM model for Parallel Computations

Variations
• There are different modes for read and write operations
in a PRAM:
– Exclusive Read (ER): only 1 processor can read from any
memory location at a time.
– Exclusive Write (EW): only 1 processor can write to any memory
location at a time.
– Concurrent Read (CR): multiple processors can read from the
same memory location simultaneously.
– Concurrent Write (CW): multiple processors can write to the
same memory location simultaneously.
Variations
• Write conflicts must be resolved using a well-defined
policy such as:
– Common: all concurrent writes store the same value.
– Arbitrary: only one value selected arbitrarily is stored.
– Minimum: the value written by the processor with the smallest
index is stored.
– Reduction: all the values are reduced to only one value using
some reduction function such as sum, minimum, maximum, etc.
Variations
• The PRAM can be divided into the following subclasses:
– EREW PRAM: access to any memory cell is exclusive. It is the
most restrictive PRAM model.
– ERCW PRAM: this allows concurrent writes to the same memory
location by multiple processors, but read accesses remain
exclusive.
– CREW PRAM: concurrent read accesses allowed, but write
accesses are exclusive.
– CRCW PRAM: both concurrent read and write accesses are
allowed.
6.2 Simulating Multiple Accesses
On An EREW PRAM
• The following broadcasting mechanism is followed:
– P1 reads x and makes it known to P2.
– P1 and P2 make x known to P3 and P4, respectively, in parallel.
– P1, P2, P3 and P4 make x known to P5, P6, P7 and P8,
respectively in parallel
– These 8 processors will make x known to another 8 processors
and so on.
• Since the number of processors having read x doubles in
each iteration, the procedure terminates in O (log p)
time.
6.2 Simulating Multiple Accesses
On An EREW PRAM
x
L L L L
x P1 x x x x
P2 x x x P5 x
x
x P3 x x
x P6
P4 x x
x
x
x P7
x
x P8 x
x
(a) (b) (c) (d)
Simulating Concurrent Read on EREW PRAM

with 8 Processors Using Algorithm Broadcast_EREW
6.3 Analysis of Parallel Algorithms
• The performance of a parallel algorithm is measured
quantitatively as follows:
– Run time, which is defined as the time spent during the
execution of the algorithm.
– Number of processors the algorithm uses to solve a problem.
– The cost of the parallel algorithm, which is the product of the run
time and the number of processors.
• The NC-class and P-completeness
– A problem belongs to class P if a solution of the
problem can be obtained by a polynomial-time
algorithm.
– A problem belongs to class NP if the correctness of a
solution for the problem can be verified by a
polynomial-time algorithm.
– In parallel computation, the class of the well-
parallelizable problems, NC, is the class of problems
that have efficient parallel solutions.
– A problem is in the class P-complete if it is as hard to
parallelize as any problem in P.
• The NC-class and P-completeness
NP NP-Hard
NC
P
P-Complete
NP-Complete
The Relationships among P, NP, NP-Complete, NP-Hard, NC, and P-Complete
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model:
– Algorithm Sum_EREW
for i = 1 to log n do
forall Pj, where 1<=j<=n/2 do in parallel
if (2j modulo 2i) = 0 then
A[2j] <- A[2j]+ A[2j – 2i-1]
endif
endfor
endfor
model:
– Complexity analysis
• Run time, T(n) = O(log n).
• Number of processors, P(n) = n/2.
• Cost, C(n) = O (n log n).
– Since a good sequential algorithm can sum the list of
n elements in O (n), this algorithm is not cost optimal.
model A[1] A[2] A[3] A[4] A[5] A[6]A[7] A[8]
Active Processors
5 2 10 1 8 12 7 3
P1, P2, P3, P4
5 7 10 11 8 20 7 10
P2, P4
5 7 10 18 8 20 7 30
P4
5 7 10 18 8 20 7 48
Example of Algorithm Sum-EREW when n = 8

• All partial sums of an array:
– Algorithm AllSums_EREW
for i=1 to log n do
forall Pj, where 2i-1+1<=j<=n do in parallel
A [j] <- A[j] +A[j-2i-1]
endfor
endfor
• Run time, T(n) = O (log n).
• Number of processors, P(n) = n - 1.
• Cost, C (n) = O (n log n).
Active Processors
A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8]
P2, P3, ...,P8
A[1] Σ 21 Σ 32 Σ 43 Σ 54 Σ 65 7 8
Σ6 Σ7
P3, P4, ...,P8
A[1] Σ 21 Σ 31 Σ 41 Σ 52 Σ 63 7
Σ4 Σ5
8
P5, P6, ...,P8

2
A[1] Σ 1 Σ 31 Σ 41 Σ 51 Σ 61 7
Σ1 Σ1
8
Computing Partial Sums of an Array of 8 Elements
6.5 Matrix Multiplication
• Using n3 processors:
– The algorithm consists of 2 steps:
• Each processor Pi,j,k computes the product of A [i, k]*B [k, j]
and stores it in C[i, j, k].
• The idea of algorithm Sum_EREW is applied along the k
dimension n2 times in parallel to compute C[i, j, n] where
1<=i, j<=n.
– Complexity analysis:
• Run time, T (n) = O (log n).
• Number of processors, P (n) = n3.
• Cost, C (n) = O (n3 log n).
• Using n3 processors
• This algorithm is not cost optimal because an n x n matrix
multiplication can be done sequentially in less than O (n3).
j
i
• Using n3 processors
P1,1,1 k=1 P1,2,1
C[1,1,1] ← A[1,1] * B[1,1] C[1,2,1] ← A[1,1] * B[1,2]
P2,1,1 P2,2,1
C[2,1,1] ← A[2,1] * B[1,1] C[2,2,1] ← A[2,1] * B[1,2]
j
i
P1,1,2 k=2 P1,2,2
C[1,1,2] ← A[1,2] * B[2,1] C[1,2,2] ← A[1,2] * B[2,2]
P2,1,2 P2,2,2
C[2,1,2] ← A[2,2] * B[2,1] C[2,2,2] ← A[2,2] * B[2,2]
After Step 1
j
i
k=2
P1,1,2 P1,2,2
C[1,1,2] ← C[1,1,2] + C[1,1,1] C[1,2,2] ← C[1,2,2] + C[1,2,1]
P2,1,2 P2,2,2
C[2,1,2] ← C[2,1,2] +C[2,1,1] C[2,2,2] ← C[2,2,2] * C[2,2,1]
After Step 2
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
• Reducing the number of processors:
– Modify the MatMult_CREW as follows:
• Each processor Pi,j,k, where 1<=k<=n/log n, computes the
sum of log n products. This step will produce (n3/ log n)
partial sums.
• The sum of products produced in step 1 are added to
produce the resulting matrix.
• Run time, T(n) = O (log n).
• Number of processors, P (n) = n3/log n.
• Cost, C (n) = O (n3).
6.6 Sorting
• The algorithm consists of 2 steps:
– Each row of processors i computes C [i], the number
of elements smaller than A [i]. Each processor Pi,j
compares A [i] and A [j], then updates C [i]
appropriately.
– The first processor in each row Pi,1 places A [i] in its
proper position in the sorted list (C [i] +1).
6.6 Sorting
• Complexity Analysis:
– Run time, T (n) = O (1).
– Number of processors, P (n) = n2.
– Cost, C (n) = O (n2).
6.6 Sorting
Initially A = 6 1 3
P1,1 P1,2 P1,3
compares 6 & 6 compares 6 & 1 compares 6 & 3
P2,1 P2,2 P2,3
compares 1 & 6 compares 1 & 1 compares 1 & 3
P3,1 P3,2 P3,3
compares 3 & 6 compares 3 &1 compares 3 & 3
After Step 1 C = 2 0 1
After Step 2 A = 1 3 6
Enumeration Sort of [6,1,3] on a CRCW PRAM

6.7 Message Passing Model
• Synchronous Message Passing Model
– The behavior of this system can be described as
follows:
• System is initialized and set to an arbitrary initial state.
• For each process i ∈ V, repeat the following 2 steps in
synchronized rounds:
– Send messages to the outgoing neighbors by applying some
message generation function to the current state.
– Obtain the new state by applying a state transition function to
the current state and the message received from incoming
neighbors.
– This system can be modeled as a state machine with
the following components:
• M, a fixed message alphabet.
• A process i can be modeled as:
– Qi : a set of states
– q0, i : the initial state in the state set Qi
– GenMsgi : a message generation function. It is applied to the
current system state to generate messages to the outgoing
neighbors from elements in M.
– Transi: a state transition function that maps the current state
and the incoming messages into a new state.
Msg1 Msg2 Msg3 Msgk

q0,i q1,i q2,i q3,i qk,i
After round 1 After round 2 After round k
An Example of a State Diagram for Process i
– The complexity analysis for algorithms following this
model is measured quantitatively using:
• Message complexity:
– Defined as the number of messages sent between neighbors
during the execution of the algorithm.
• Time complexity:
– Defined as the time spent during the execution of the algorithm.
6.8 Leader Election Problem
• A leader among n processors is the processor
recognized by the other processors as distinguished to
perform a special task.
• The leader election problem arises when the processors
of a distributed system must choose one of them as a
leader.
• A leader is needed to coordinate the reestablishment of
allocation and routing functions.
• The leader election problem is meaningless in the
context of anonymous systems.
6.9 Leader Election In
Synchronous Rings
• Simple Leader Election Algorithm
– Each process sends its identifier to its outgoing
neighbor.
– When a process receives an identifier from its
incoming neighbor, then:
• The process sends null to its outgoing neighbor, if the
received identifier is less than its own identifier.
• The process sends the received identifier to its outgoing
neighbor, if the received identifier is greater than its own
identifier.
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
Synchronous Rings
• Simple Leader Election Algorithm
• Time complexity: O (n).
• Message complexity: O (n2).
u: 1
buff : 1
status: unknown
Synchronous Rings
u: 4
buff : 4
status: unknown
u: 2
buff : 2
status: unknown
• Simple Leader u: 3
buff : 3
status: unknown
Election u: 1
buff: 4
(a) Initial States
u: 1
buff: null
status: unknown status: unknown
Algorithm u: 4
buff : null
u: 4
buff : null
– Leader Election in a status: unknown status: unknown
u: 2 u: 2
Synchronous Ring buff : null
status: unknown
buff : 4
status: unknown
using Algorithm u: 3
buff : null
u: 3
buff : null
S_Elect_Leader_Sim status: unknown

(b) After first round
status: unknown
(c) After second round
ple u: 1
buff: null
u: 1
buff: null
u: 4 u: 4
buff : null buff : null
status: unknown status: leader
u: 2 u: 2
buff : null buff : null
u: 3 u: 3
buff : 4 buff : null
(d) After third round (e) After fourth round
Synchronous Rings
• Improved Leader Election Algorithm
– K=0
– Each process sends its identifier in messages to its
neighbors in both directions intending that they will
travel 2k hops and then return to their origin.
– If the identifier is proceeding in the outbound
direction, when a process on the path receives the
identifier from its neighbor, then:
• The process sends null to its outneighbor, if the received
identifier is less than its own identifier.
• The process sends the received identifier to its outneighbor,
if the received identifier is greater than its own identifier.
Synchronous Rings
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
– If the identifier is proceeding in the inbound direction,
when a process on the path receives the identifier, it
sends the received identifier to its outgoing neighbor
on the path, if the received identifier is greater than its
own identifier.
– If the 2 original messages make it back to their origin
then k<-K+1; go to step 2.
Synchronous Rings
Process i
k=0
k=1
k=2
Messages initiated at Process i Using The Improved Leader Election Algorithm
Synchronous Rings
• Time complexity: O (n)
• Message complexity: O (n log n)
Synchronous Rings u: 1, k: 0
buff+: (1,out,1)
u: 1, k: 0
buff+: (2,in,1)
buff-: (1,out,1) buff-: (4,in,1)
status: unknown
• Improved Leader
status: unknown u: 4, k: 0
u: 4, k: 0
buff+: (4,out,1) buff+: null
buff-: (4,out,1) buff-: null
Election Algorithm u: 2, k: 0
buff+: (2,out,1)
u: 2, k: 0
buff+: (3,in,1)
buff-: null
buff-: (2,out,1)
u: 3, k: 0 u: 3, k: 0
buff+: (3,out,1) buff+: (4,in,1)
(b) After first round

(a) Initial States
u: 1, k: 0 u: 1, k: 0
buff+: null buff+: (4,out,1)
buff-: null buff-: null
u: 4, k: 1 u: 4, k: 1
buff+: (4,out,2) buff+: null
u: 2, k: 0 u: 2, k: 0
buff+: null buff+: null
buff-: null buff-: null
u: 3, k: 0 u: 3, k: 0
buff+: null buff+: null
buff-: null buff-: (4,out,1)
(d) After third round

(c) After second round
6.10 Summary
• PRAM has played an important role in the introduction of
parallel programming paradigms and design techniques
that have been used in real parallel systems.
• A large number of PRAM algorithms for solving many
fundamental problems have been introduced and
efficiently implemented on real systems.
• An important characteristic of a message system is the
degree of synchrony, which reflects the different types of
timing information that can be used by an algorithm.

07 - Lecture - Abstract Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 - Lecture - Abstract Models

Uploaded by

Copyright:

Available Formats

Chapter 6

PRAM model for Parallel Computations

Simulating Concurrent Read on EREW PRAM

The Relationships among P, NP, NP-Complete, NP-Hard, NC, and P-Complete

P1, P2, P3, P4

Example of Algorithm Sum-EREW when n = 8

P2, P3, ...,P8

P5, P6, ...,P8

Computing Partial Sums of an Array of 8 Elements

C[1,1,1] ← A[1,1] * B[1,1] C[1,2,1] ← A[1,1] * B[1,2]

C[2,1,1] ← A[2,1] * B[1,1] C[2,2,1] ← A[2,1] * B[1,2]

C[1,1,2] ← A[1,2] * B[2,1] C[1,2,2] ← A[1,2] * B[2,2]

C[2,1,2] ← A[2,2] * B[2,1] C[2,2,2] ← A[2,2] * B[2,2]

C[1,1,2] ← C[1,1,2] + C[1,1,1] C[1,2,2] ← C[1,2,2] + C[1,2,1]

C[2,1,2] ← C[2,1,2] +C[2,1,1] C[2,2,2] ← C[2,2,2] * C[2,2,1]

P1,1 P1,2 P1,3

compares 6 & 6 compares 6 & 1 compares 6 & 3

P2,1 P2,2 P2,3

compares 1 & 6 compares 1 & 1 compares 1 & 3

P3,1 P3,2 P3,3

compares 3 & 6 compares 3 &1 compares 3 & 3

Enumeration Sort of [6,1,3] on a CRCW PRAM

Msg1 Msg2 Msg3 Msgk

After round 1 After round 2 After round k

An Example of a State Diagram for Process i

– Leader Election in a status: unknown status: unknown

S_Elect_Leader_Sim status: unknown

(d) After third round (e) After fourth round

Messages initiated at Process i Using The Improved Leader Election Algorithm

(b) After first round

(d) After third round

You might also like