Professional Documents
Culture Documents
Abstract Models: Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
Abstract Models: Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
Abstract Models
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM model was introduced by Fortune and Wyllie
in 1978 for modeling idealized parallel computers in
which communication cost and synchronization
overhead are negligible.
• During a computational step, an active processor may
read a data value from a memory location, perform a
single operation and finally write back the result into a
memory location.
• This model is referred to as the shared memory, single
instruction, multiple data (SM SIMD) machine.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
Control
Private 1
P
Memory
Private 2
P
Memory
Global
Memory
Private p
P
Memory
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• Write conflicts must be resolved using a well-defined
policy such as:
– Common: all concurrent writes store the same value.
– Arbitrary: only one value selected arbitrarily is stored.
– Minimum: the value written by the processor with the smallest
index is stored.
– Reduction: all the values are reduced to only one value using
some reduction function such as sum, minimum, maximum, etc.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.1 The PRAM Model and Its
Variations
• The PRAM can be divided into the following subclasses:
– EREW PRAM: access to any memory cell is exclusive. It is the
most restrictive PRAM model.
– ERCW PRAM: this allows concurrent writes to the same memory
location by multiple processors, but read accesses remain
exclusive.
– CREW PRAM: concurrent read accesses allowed, but write
accesses are exclusive.
– CRCW PRAM: both concurrent read and write accesses are
allowed.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM
• The following broadcasting mechanism is followed:
– P1 reads x and makes it known to P2
– P1 and P2 make x known to P3 and P4, respectively, in parallel.
– P1, P2, P3 and P4 make x known to P5, P6, P7 and P8,
respectively in parallel
– These 8 processors will make x known to another 8 processors
and so on.
• Since the number of processors having read x doubles in
each iteration, the procedure terminates in O (log p)
time.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.2 Simulating Multiple Accesses
On An EREW PRAM P6
x
x P7
L L L L
x P1 x x x x
x x x x P5 x
P2
x P3 x x
x P6
x x x
P4
x P7 x
x
x P8 x
x
(a) (b) (c) (d)
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
The complexity of a sequential algorithm is generally determined by its time and
space complexity.
The time complexity of an algorithm refers to its execution time as a function of the
problem’s size. Similarly, the space complexity refers to the amount of memory
required by the algorithm as a function of the size of the problem.
The time complexity has been known to be the most important measure of the
performance of algorithms. An algorithm whose time complexity is bounded by a
polynomial is called a polynomial–time algorithm.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The performance of a parallel algorithm is measured
quantitatively as follows:
– Run time, which is defined as the time spent during the
execution of the algorithm
– Number of processors the algorithm uses to solve a problem
– The cost of the parallel algorithm, which is the product of the run
time and the number of processors.
Cost = RT * NP
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
The run time of a parallel algorithm is the length of the time period between the time
the first processor to begin execution starts and the time the last processor to finish
execution terminates.
The cost of a parallel algorithm is basically the total number of steps executed
collectively by all processors.
A parallel algorithm is said to be cost optimal if its cost matches the lower bound on
the number of sequential operations to solve a given problem within a constant
factor. It follows that a parallel algorithm is not cost optimal if there exists a
sequential algorithm whose run time is smaller than the cost of the parallel algorithm.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
– A problem belongs to class P if a solution of the
problem can be obtained by a polynomial-time
algorithm.
– A problem belongs to class NP if the correctness of a
solution for the problem can be verified by a
polynomial-time algorithm.
– In parallel computation, the class of the well-
parallelizable problems, NC, is the class of problems
that have efficient parallel solutions.
– A problem is in the class P-complete if it is as hard to
parallelize as any problem in P.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.3 Analysis of Parallel Algorithms
• The NC-class and P-completeness
P if a sol. Can be obtained by PTA.
NP-Hard
NP
NC
P
P-Complete
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW model
– Algorithm Sum_EREW
for i = 1 to log n do
forall Pj, where 1<= j <= n/2 do in parallel
if (2j modulo 2i) = 0 then
A[2j] <- A[2j]+ A[2j – 2i-1]
endif
endfor
endfor
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model
– Complexity analysis
• Run time, T(n) = O(log n)
• Number of processors, P(n) = n/2
• Cost, C(n) = O (n log n)
– Since a good sequential algorithm can sum the list of
n elements in O (n), this algorithm is not cost optimal.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• Sum of an array of numbers on the EREW
model
A[1] A[2] A[5] A[6]A[7] A[8] Active Processors
A[3] A[4]
5 2 10 8 127 3
1
P1, P2, P3, P4
5 7 11 20 7
10 8 10
P2, P4
5 10 18 20 7
8 30
7
P4
5 7 10 18 8 20 7 48
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array
– Complexity analysis
• Run time, T(n) = O (log n).
• Number of processors, P(n) = n - 1.
• Cost, C (n) = O (n log n).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.4 Computing Sum And All Sums
• All partial sums of an array
Active Processors
A[1] A[2] A[3] A[5]
A[4] A[6] A[7] A[8]
A[1] 2 3 4 5 6 8
Σ1 Σ2 Σ3 Σ4 Σ5 76
Σ7
Σ
P3, P4, ...,P8
A[1] 2 3 4 5 6 8
Σ1 Σ1 Σ1 Σ2 Σ3 74
Σ5
Σ
P5, P6, ...,P8
2
A[1] Σ1 3
Σ1 Σ
4
Σ
5
1 Σ
6
1 7 8
1 Σ1 Σ1
Computing Partial Sums of an Array of 8 Elements
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors
– The algorithm consists of 2 steps:
• Each processor Pi,j,k computes the product of A [i, k]*B [k,
j] and stores it in C[i, j, k].
• The idea of algorithm Sum_EREW is applied along the k
dimension n2 times in parallel to compute C[i, j, n] where
1<=i, j<=n.
– Complexity analysis
• Run time, T (n) = O (log n).
• Number of processors, P (n) = n3.
• Cost, C (n) = O (n3 log n).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
• Using n3 processors
– Complexity analysis
• This algorithm is not cost optimal because an n x n matrix
multiplication can be done sequentially in less than O (n3).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
j
i P P
1,1,1k = 1 1,2,1
P P
2,1,1 2,2,1
j
i
P P
1,1,2k =2 1,2,2
P P
2,1,2 2,2,2
After Step 1
j
i P P1,2,2
k=
1,1,2 2
P P
2,1,2
2,2,2
After Step 2
Multiplying two 2 x 2 matrices using Algorithm MatMult_CREW
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.5 Matrix Multiplication
• Reducing the number of processors
– Modify the MatMult_CREW as follows:
• Each processor Pi,j,k, where 1<=k<=n/log n, computes
the sum of log n products. This step will produce (n3/ log
n) partial sums.
• The sum of products produced in step 1 are added to
produce the resulting matrix.
– Complexity analysis:
• Run time, T(n) = O (log n).
• Number of processors, P (n) = n3/log n.
• Cost, C (n) = O (n3).
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.6 Sorting
• The algorithm consists of 2 steps:
– Each row of processors i computes C [i], the number
of elements smaller than A [i].
– Each processor Pi,j compares A [i] and A [j], then
updates C [i] appropriately.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.6 Sorting
• Complexity analysis
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.6 Sorting
Initially = 6 1 3
A
P P P
1,1 1,2 1,3
P P P
2,1 2,2 2,3
P P P
3,1 3,2 3,3
After Step 1 C = 2 0 1
After Step 2 1 3 6
A =
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model
• Synchronous message passing model
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.7 Message Passing Model
• Synchronous message passing model
– The complexity analysis for algorithms following this
model is measured quantitatively using:
• Message complexity:
– Defined as the number of messages sent between neighbors
during the execution of the algorithm.
• Time complexity:
– Defined as the time spent during the execution of the algorithm.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.8 Leader Election Problem
• A leader among n processors is the processor
recognized by the other processors as distinguished to
perform a special task.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple leader election algorithm
– Each process sends its identifier to its outgoing
neighbor.
– When a process receives an identifier from its
incoming neighbor, then:
• The process sends null to its outgoing neighbor, if the
received identifier is less than its own identifier.
• The process sends the received identifier to its outgoing
neighbor, if the received identifier is greater than its own
identifier.
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Simple leader election algorithm
– Complexity analysis
• Time complexity: O (n)
• Message complexity: O (n2)
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
u: 1
buff : 4
• Simple leader u: 3
buff : 3
status: unknown
election algorithm u: 1
buff: 4
(a) Initial States
u: 1
buff: null status:
status: unknown unknown
– Leader Election in a u: 4 u: 4
buff : null
Synchronous Ring using buff : null
status: unknown status: unknown
u: 2 u: 2
Algorithm buff : null buff : 4
status: unknown
status: unknown
S_Elect_Leader_Sim u: 3 u: 3
buff : null buff : null
ple status: unknown status: unknown
(b) After first round (c) After second round
u: 1 u: 1
buff: null status: buff: null status:
unknown unknown
u: 4 u: 4
buff : null buff : null
status: unknown status: leader
u: 2 u: 2
buff : null buff : null
status: unknown status: unknown
u: 3 u: 3
buff : 4 buff : null
status: unknown status: unknown
(d) After third round (e) After fourth round
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
– K <- 0
– Each process sends its identifier in messages to its
neighbors in both directions intending that they will
travel 2k hops and then return to their origin.
– If the identifier is proceeding in the outbound
direction, when a process on the path receives the
identifier from its neighbor, then:
• The process sends null to its outneighbor, if the received
identifier is less than its own identifier.
• The process sends the received identifier to its outneighbor,
if the received identifier is greater than its own identifier.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
• The process declares itself as the leader, if the received
identifier is equal to its own identifier.
– If the identifier is proceeding in the inbound direction,
when a process on the path receives the identifier, it
sends the received identifier to its outgoing neighbor
on the path, if the received identifier is greater than its
own identifier.
– If the 2 original messages make it back to their origin
then k<-K+1; go to step 2.
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
Process i
k=0
k=1
k=2
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In
Synchronous Rings
• Improved leader election algorithm
– Complexity analysis
• Time complexity: O (n)
• Message complexity: O (n log n)
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr
6.9 Leader Election In Synchronous Rings
u: 1, k: 0 u: 1, k: 0
buff+: (1,out,1) buff+: (2,in,1)
buff-: (1,out,1) buff-: (4,in,1)
u: 4, k: 0 status: unknown
• Improved leader
u: 4, k: 0 status: unknown
buff+: (4,out,1) buff+: null
buff-: buff-: null
status:(4,out,1)
unknown status: unknown
election algorithm u: 2, k: 0
buff+: (2,out,1)
u: 2, k: 0
buff+: (3,in,1)
buff-: null
buff-: (2,out,1) status: unknown
u: 3, k: 0 status: unknown u: 3, k: 0
buff+: (3,out,1) buff+: (4,in,1)
buff-: (3,out,1) buff-: null
status: unknown status: unknown
(b) After first round
(a) Initial States
u: 1, k: 0 u: 1, k: 0
buff+: null buff+: (4,out,1)
buff-: null buff-: null
u: 4, k: 1 u: 4, k: 1 status: unknown
status: unknown
buff+: (4,out,2) buff+: null
buff-: (4,out,2) buff-: null
status: unknown status: unknown
u: 2, k: 0 u: 2, k: 0
buff+: null buff+: null
buff-: null buff-: null
u: 3, k: 0 status: unknown u: 3, k: 0 status: unknown
buff+: null buff+: null
buff-: null buff-: (4,out,1)
status: unknown status: unknown
(d) After third round
(c) After second round
Advanced Computer Architecture and Parallel Processing Hesham El-Rewini & Mostapha Abd-El-Barr