Professional Documents
Culture Documents
Process B
Shared Variable
Process A in common
memory
Process C
Critical Section
• Critical Section(CS) is a code segment
accessing shared variable, which must be
executed by only one process at a time and
which once started must be completed
without interruption.
Critical Section Requirements
• It should satisfy following requirements-:
Mutual Exclusion
At most one process executing CS at a time.
No deadlock in waiting
No circular wait by 2 or more process.
No preemption
No interrupt until completion.
Eventual Entry
Once entered CS,must be out after completion.
Protected Access
• Granulity of CS affects the performance.
• If CS is too large,it may limit parallism due to
excessive waiting by process.
• When CS is too small,it may add unnecessary
code complexity/Software overhead.
4 operational Modes
• Multiprogramming
• Multiprocessing
• Multitasking
• Multithreading
Multiprogramming
• Multiple independent programs running on
single processor/multiprocessor by time
sharing use of system resource.
• When program enters the I/O mode, the
processor switches to another program.
Multiprocessing
• When multiprogramming is implemented at the
process level on a multiprocessor, it is called
multiprocessing.
• 2 types of multiprocessing-:
If interprocessor communication are handled at the
instruction level, the multiprocessor operates in
MIMD mode.
If interprocessor communication are handled at the
program,subroutine or procedure level, the
multiprocessor operates in MPMD mode.
Multitasking
• A single program can be partitioned into
multiple interrelated tasks concurrently
executed on a multiprocessor.
• Thus multitasking provides the parallel
execution of 2 or more parts of single
program.
Multithreading
• The traditional UNIX/OS has a single threaded
kernal in which 1 process can receive OS
kernal service at a time.
• In multiprocessor we extend single kernal to
be multithreaded.
• The purpose is to allow multiple threads of
light weight processes to share same address
space.
Partitioning and Replication
• Goal of parallel processing is to exploit
parallelism as much as possible with lowest
overhead.
• Program partitioning is a technique for
decomposing a large program and data set into
many small pieces for parallel execution bt
multiple processors.
• Program partitioning involves both
programmers and compilers.
• Program replication refers to duplication of
same program code for parallel execution on
multiple processors over different data sets.
Scheduling and Synchronization
• Scheduling further classified-:
Static Scheduling
• It is conducted at post compile time.
• Its advantage is low overhead but shortcomings is a possible
mismatch with run time profile of each task.
Dynamic Scheduling
• Catches the run time conditions.
• Requires fast context switching,premption and much more
OS support.
• Advantage include better resource utilization at expense of
highest scheduling overhead.
• One can use atomic memory operations such
as Test&Set and Fetch&Add to achieve
synchronization.
Cache Coherence & Protection
• Multicache coherance problem demands an
invalidation or update after each write
operation.
Message Passing Model
• Two processes D and E residing at different
processor nodes may communicate wit each
other by passing messages through a direct
network.
• The messages may be instructions,
data,synchronization or interrupt signals etc.
• Multicomputers are considered loosely
coupled multiprocessors.
IPC using Message Passing
Message(Send/Recieve)
Process D Process E
Synchronous Message Passing
• No shared Memory
• No mutual Exclusion
• Synchronization of sender and reciever
process just like telephone call.
• No buffer used.
• If one process is ready to cummunicate and
other is not,the one that is ready must be
blocked.
Asynchronous Message Passing
• Does not require that message sending and
receiving be synchronised in time and space.
• Arbitrary communication delay may be
experienced because sender may not know if
and when the message has been received until
acknowledgement is received from receiver.
• This scheme is like a postal service using mailbox
with no synchronization between senders and
recievers.
Data Parallel Model
• Used in SIMD computers
• Parallelism handled by hardware
synchronization and flow control.
• Fortran 90 ->data parallel lang.
• Require predistrubuted data sets.
Data Parallelism
• This technique used in array processors(SIMD)
• Issue->match problem size with machine size.
• Recently Connection Machine used woth
16384 PE working in single configuaration.
• No mutual exclusion problem.
Array Language Extensions
• Various data parallel language used
• Represented by high level data types
• CFD for Illiac 4,DAP fortran for Distributed
array processor,C* for Connection machine
• Target to make the number of PE’s of problem
size.
Compiler Support
• Compiler can be used for optimization
purpose.
• Decides the dimension of array.
Object Oriented Model
• Objects dynamically created and manipulated.
• Processing is performed by sending and
receiving messages among objects.
Concurrent OOP
• Need of OOP because of abstraction and
reusability concept.
• Objects are program entites which
encapsulate data and operations in single unit.
• Concurrent manipulation of objects in OOP.
Actor Model
• This is a framework for Concurrent OOP.
• Actors->independent component
• Communicate via asynchronous message
passing.
• 3 primitives->create,send to and become.
Parallelism in COOP
• 3 common patterns for parallism-:
1)Pipeline concurrency
2)Divide and conquer
3)Cooperative Problem Solving
Functional and logic Model
• Functional Programming Language->
Lisp,Sisal and Strand 88.
Logic Programming Language->
Concurrent Prolog and Parlog
Functional Programming Model
• Should not produce any side effects.
• No concept of storage,assignment and
branching.
• Single assignment and data flow language
functional in nature.
Logic Programming Models
• Used for knowledge processing from large
database.
• Supports implicitly search strategy.
• And parallel execution and Or Parallel
Reduction technique used.
• Used in artificial intelligence
Parallel Language and Compilers
• Programming env is collection of s/w tools and
system support.
• Parallel Software Programming env needed.
• Users still forced to focus on hardware details
rather than parallelism using high level
abstraction.
Language Features For Parallelism
• Optimization Features
• Availability Features
• Synchronization/communication Features
• Control Of Parallelism
• Data Parallelism Features
• Process Management Features
Optimization Features
• Control dependence: a
S1 IF (T.EQ.0.0) GOTO S3
S2 A = A / T dependence that arises as
S3 CONTINUE a result of control flow
58
Data Dependence Classification
• Dependence relations are similar to hardware hazards
(which may cause pipeline stalls)
– True dependences (also called flow dependences), denoted by S1 S2, are the
same as RAW (Read After Write) hazards
– Anti dependences, denoted by S1 -1 S2, are the same as WAR (Write After
Read) hazards
– Output dependences, denoted by S1 o S2, are the same as WAW (Write After
Write) hazards
S1 X = … S1 … = X S1 X = …
S2 … = X S2 X = … S2 X = …
S1 S2 S1 -1 S2 S 1 o S 2
59
• When data object is data array indexed by
multidimentional subscript,dependency is
more difficult to determine at compile time.
• Dependency analysis->Process of detecting all
data dependency in program.
Dependence Testing
• Data dependence in array difficult as no 2 array ref may access same memory location.
• We consider subscript ref here.
Example-:
DO i1 = L1, U1
DO i2 = L2, U2
...
DO in = Ln, Un
S1
A(f1(i1,...,in),...,fm(i1,...,in)) = ...
S2
... = A(g1(i1,...,in),...,gm(i1,...,in))
ENDDO
...
ENDDO
Iteration Space
• N-Dimensional cartesion space for n loops ia
called iteration space.
• Example(Next Slide)
Iteration Space Example
DO I = 1, 3 • The iteration space of the
DO J = 1, I statement at S1 is {(1,1),
S1 A(I,J) = A(J,I)
ENDDO (2,1), (2,2), (3,1), (3,2),
ENDDO (3,3)}
• At iteration i = (2,1) the
I (1,1) value of A(1,2) is
(2,1) (2,2)
assigned to A(2,1)
63
Dependence Equation(flow dependence)
DO I = 1, N
S1 A(f(I)) = A(g(I))
ENDDO
To prove flow dependence:
for which values of < is
f() = g()
DO I = 1, N
S1 A(I+1) = A(I) +1 = has solution =-1
ENDDO
Dependence equation(anti dependence)
DO I = 1, N
S1 A(f(I)) = A(g(I))
ENDDO
To prove anti dependence:
for which values of > is
f() = g()
DO I = 1, N
S1 A(I+1) = A(I) +1 = has no solution
ENDDO
Dependence Direction Vector
• Definition
– Suppose that there is a dependence from S1 on
iteration i and S2 on iteration j; then the
dependence direction vector D(i, j) is defined as
“<” if d(i, j)k > 0
D(i, j)k = “=” if d(i, j)k = 0
“>” if d(i, j)k < 0
66
Distance Vector
• Its defined in terms of -
Example-:Direction vector is (<,=)
Distance vector is(-1,0)
DO I = 1, 3
DO J = 1, I
S1 A(I+1,J) = A(I,J)
ENDDO
ENDDO
Subscript Seperability and Partitioning
69
Subscript Seperability
• Seperable if indices if its indices donot occur in
other subscript.
• If 2 diff subscript contain the same index then
we say they are coupled.
Separability and Coupled Subscripts
DO I = …
• When testing
DO J = … multidimensional arrays,
S1
DO K = …
A(I,J,J) = A(I,J,K)
subscript are separable if
ENDDO indices do not occur in
ENDDO other subscripts
ENDDO
• ZIV and SIV subscripts
pairs are separable
SIV pair MIV pair
is separable gives coupled • MIV pairs may give rise to
indices coupled subscript groups
71
Dependence Testing Overview
1. Partition subscripts into separable and coupled groups
2. Classify each subscript as ZIV, SIV, or MIV
3. For each separable subscript, apply the applicable single
subscript test (ZIV, SIV, or MIV) to prove independence or
produce direction vectors for dependence
4. For each coupled group, apply a multiple subscript test
5. If any test yields independence, no further testing needed
6. Otherwise, merge all dependence vectors into one set
72
ZIV Test
DO I = 1, N
• The ZIV test compares
S1 A(5) = A(6) two subscripts
ENDDO
• If the expressions are
K = 10 proved unequal, no
DO I = 1, N
S1 A(5) = A(K)
dependence can exist
ENDDO
K = 10
DO I = 1, N
DO J = 1, N
S1 A(I,5) = A(I,K)
ENDDO
ENDDO
73
Strong SIV Test
• Requires subscript pairs of the
DO I = 1, N
S1 A(I+1) = A(I) form aI+c1 and aI’+c2 for the
ENDDO def and use, respectively
• Dependence equation
DO I = 1, N
aI+c1 = aI’+c2
S1 A(2*I+2) = A(2*I)
ENDDO has a solution if the
dependence distance
DO I = 1, N d = (c1 - c2)/a is integer and
S1 A(I+N) = A(I) |d| < U - L with L and U loop
ENDDO bounds
74
Strong Siv test
Strong SIV test when
• f(...) = aik+c1 and g(…) = aik+c2
• Plug in α,β and solve for dependence:
• β-α = (c1 – c2)/a
• A dependence exists from S1 to S2 if:
• β-α is an integer
• |β-α| ≤ Uk- Lk
Weak-Zero SIV Test
• Requires subscript pairs
DO I = 1, N
S1 A(I) = A(1) of the form a1I+c1 and
ENDDO a2I’+c2 with either a1= 0 or
DO I = 1, N a2 = 0
S1 A(2*I+2) = A(2) • If a2 = 0, the dependence
ENDDO
equation I = (c2 - c1)/a1
DO I = 1, N has a solution if
S1 A(1) = A(I) + A(1) (c2 - c1)/a1 is integer and L
ENDDO
< (c2 - c1)/a1 < U
• If a1 = 0, similar case
76
Weak-Zero SIV test when
• f(...) = aik+c1 and g(…) = c2
• Plug in α,β and solve for dependence:
• α = (c2 – c1)/a
15. …
Control Flow Graph (CFG)
1. sum = 0
2. i = 1
T
3. if i > n goto 15 15. …
F
4. t1 = addr(a) – 4
5. t2 = i*4
6. t3 = t1[t2]
7. t4 = addr(a) – 4
8. t5 = i*4
9. t6 = t4[t5]
10. t7 = t3*t6
11. t8 = sum + t7
12. sum = t8
13. i = i + 1
14. goto 3
Common Subexpression Elimination
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
3. if i > n goto 15 3. if i > n goto 15
4. t1 = addr(a) – 4 4. t1 = addr(a) – 4
5. t2 = i*4 5. t2 = i*4
6. t3 = t1[t2] 6. t3 = t1[t2]
7. t4 = addr(a) – 4 7. t4 = addr(a) – 4
8. t5 = i*4 8. t5 = i*4
9. t6 = t4[t5] 9. t6 = t4[t5]
10. t7 = t3*t6 10. t7 = t3*t6
11. t8 = sum + t7 10a t7 = t3*t3
12. sum = t8 11. t8 = sum + t7
13. i = i + 1 12. sum = t8
14. goto 3 13. i = i + 1
15. … 14. goto 3
Copy Propagation
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
3. if i > n goto 15 3. if i > n goto 15
4. t1 = addr(a) – 4 4. t1 = addr(a) - 4
5. t2 = i * 4 5. t2 = i * 4
6. t3 = t1[t2] 6. t3 = t1[t2]
10a t7 = t3 * t3 10a t7 = t3 * t3
11 t8 = sum + t7 11. t8 = sum + t7
12. sum = t8 11a sum = sum + t7
13. i = i + 1 12. sum = t8
14. goto 3 13. i = i + 1
15. … 14. goto 3
15. …
Invariant Code Motion
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
3. if i > n goto 15 2a t1 = addr(a) - 4
4. t1 = addr(a) – 4 3. if i > n goto 15
5. t2 = i * 4 4. t1 = addr(a) - 4
6. t3 = t1[t2] 5. t2 = i * 4
10a t7 = t3 * t3 6. t3 = t1[t2]
11a sum = sum + t7 10a t7 = t3 * t3
13. i = i + 1 11a sum = sum + t7
14. goto 3 13. i = i + 1
15. … 14. goto 3
15. …
Strength Reduction
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
2a t1 = addr(a) – 4 2a t1 = addr(a) - 4
3. if i > n goto 15 2b t2 = i * 4
5. t2 = i * 4 3. if i > n goto 15
6. t3 = t1[t2] 5. t2 = i * 4
10a t7 = t3 * t3 6. t3 = t1[t2]
11a sum = sum + t7 10a t7 = t3 * t3
13. i = i + 1 11a sum = sum + t7
14. goto 3 11b t2 = t2 + 4
15. … 13. i = i + 1
14. goto 3
15. …
Constant Propagation and Dead Code
Elimination
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
2a t1 = addr(a) – 4 2a t1 = addr(a) - 4
2b t2 = i * 4 2b t2 = i * 4
2c t9 = n * 4 2c t9 = n * 4
3a if t2 > t9 goto 15 2d t2 = 4
6. t3 = t1[t2] 3a if t2 > t9 goto 15
10a t7 = t3 * t3 6. t3 = t1[t2]
11a sum = sum + t7 10a t7 = t3 * t3
11b t2 = t2 + 4 11a sum = sum + t7
14. goto 3a 11b t2 = t2 + 4
15. … 14. goto 3a
15. …
New Control Flow Graph
1. sum = 0
2. t1 = addr(a) - 4
3. t9 = n * 4
4. t2 = 4
T
5. if t2 > t9 goto 11 11. …
F
6. t3 = t1[t2]
7. t7 = t3 * t3
8. sum = sum + t7
9. t2 = t2 + 4
10. goto 5
Analysis and Optimizing Transformations
1. a = y + 2 1’. a = y + 2
2. z = x + w 2’. x = a
3. x = y + 2 3’. z = b + c
4. z = b + c 4’. b = a
5. b = y + 2
Example 3: Local Constant Propagation
1. t1 = 1 Assume a, k, t3, and t4 are used beyond basic block:
2. a = t1 1’. a = 1
3. t2 = 1 + a 2’. k = 2
4. k = t2 3’. t4 = 8.2
5. t3 = cvttoreal(k) 4’. t3 = 8.2
6. t4 = 6.2 + t3
7. t3 = t4
Global Optimizations --- Require Analysis
Outside of Basic Blocks
• Global common subexpression elimination
• Dead code elimination
• Constant propagation
• Loop optimizations
– Loop invariant code motion
– Strength reduction
– Induction variable elimination
Vectorization and Parallization
• Vectorization->Process of converting scalar looping
operations into equivalent vector instructions.
• Parallelization->Converting sequential code to
parallel code.
• Vectorizing Compiler->Compiler that does
vectorization automatically.
• Vector hardware->Speed up Vector Operations.
• Multiprocessors->used to speed up parallel codes.
Vectorization Methods
• Use of temporary storage
• Loop Interchange
• Loop Distribution
• Vector Reduction
• Node Splitting
Vectorization Example-:
Example-:
DO I = 1, N
S1 A(I+1) = B(I) + C
S2 D(I) = A(I) + E
ENDDO
• transformed to:
DO I = 1, N
S1 A(I+1) = B(I) + C
ENDDO
DO I = 1, N
S2 D(I) = A(I) + E
ENDDO
• leads to:
S1 A(2:N+1) = B(1:N) + C
S2 D(1:N) = A(1:N) + E
Vector Reduction
• Theme->Produces the scaler value from 1 or more data
arrays.
Example-:sum,product,maximum,minimum of all elements
in a single array.
Example-:
DO 40 I=1,N
A(I)=b(I)+C(I)
S=S+A(I)
amax==max(amax,A(I))
Continue
After vector Reduction
A(I)=B(I)+C(I)
S=S+SUM(A(1:N))
AMAX=MAX(AMAX,MAXVAL(A(1:N))
50 Continue S1b:T(I)=A(I-1)+X(I)
50 Continue
Thus new loop structure can be vectorized as
follows-:
S1a: X(2:N)=A(3:N+1)
S2: A(2:N)=B(2:N)+C(2:N)
S1b: T(2:N)=A(1:N-1) +X(2:N)
Vectorization Inhibitors
• Conditional statements which depend on
runtime conditions
• Multiple loop entries
• Function or subroutine calls
• Recurrence and their variations
Code Parallization
• Single program divided into many threads for
parallel execution in multiprocessor env
• Each thread executing on single processor
• Parallelization performed on outer loop unless
there is no dependency
Example-:
DoAll 20 I=2,N
Do 10 J=2,N
S1: A(I,J)=(A(I,J-1)+A(I,J+1))/2
EndDo
EndAll
Parallization Inhibitors
• Loop dependence
• Function calls
• Multiple entries
Code Generation and Scheduling
• Directed Acyclic Graph(DAG)
• Register Allocation
• List Scheduling
• Cycle Scheduling
• Trace Scheduling Compilation
DAG
• Constructed from basic block
• Basic block->no backtracking
• So we construct DAG
Example of Constructing the DAG
4 i0