Unit 4 Parallel Models, Languages, and Compilers

Parallel Programming Models,
Languages and Compilers

Module 6
Points to be covered
• Parallel Programming Models-
Shared-Variable Model, Message-Passing Model, Data-Parallel Model,
Object Oriented Model, Functional and Logic Models
• Parallel Languages and Role of Compilers-
Language Features for Parallelism, Parallel Language Constructs,
Optimizing Compilers for Parallelism
• Dependence Analysis of Data Arrays-
Iteration Space and Dependence Analysis, Subscript Separability and
Partitioning, Categorized Dependence Tests
• Code Optimization and Scheduling-
Scalar Optimization with Basic Blocks, Local and Global Optimizations,
Vectorization and Parallelization Methods, Code Generation and
Scheduling, Trace Scheduling Compilation
Parallel Programming Model
• Programming model->simplified and
transparent view of computer
hardware/software system.
• Parallel Programming Model are specifically
designed for multiprocessors, multicomputer
or vector/SIMD computers.
Classification
• We have 5 programming models-:
 Shared-Variable Model
 Message-Passing Model
 Data-Parallel Model
 Object Oriented Model
 Functional and Logic Models
Shared Variable Model
• In all programming system, processors are active
resources and memory & IO devices are passive
resources.
• Program is a collection of processes.
• Parallelism depends on how IPC( Interprocess
Communication) is implemented.
• Process address space is shared.
• To ensure orderly IPC ,a mutual exclusion property
requires that shared object must be shared by only 1
process at a time.
Shared Variable Model(Subpoints)
• Shared Variable communication
• Critical section
• Protected access
• Partitioning and replication
• Scheduling and synchronization
• Cache coherance problem
Shared Variable communication
• Used in multiprocessor programming

• Shared variable IPC demands use of shared
memory and mutual exclusion among multiple
processes accessing the same set of variables.
Process B
Shared Variable
Process A in common
memory
Process C
Critical Section
• Critical Section(CS) is a code segment
accessing shared variable, which must be
executed by only one process at a time and
which once started must be completed
without interruption.
Critical Section Requirements
• It should satisfy following requirements-:
 Mutual Exclusion
At most one process executing CS at a time.
 No deadlock in waiting
No circular wait by 2 or more process.
 No preemption
No interrupt until completion.
 Eventual Entry
Once entered CS,must be out after completion.
Protected Access
• Granulity of CS affects the performance.
• If CS is too large,it may limit parallism due to
excessive waiting by process.
• When CS is too small,it may add unnecessary
code complexity/Software overhead.
4 operational Modes
• Multiprogramming
• Multiprocessing
• Multitasking
• Multithreading
Multiprogramming
• Multiple independent programs running on
single processor/multiprocessor by time
sharing use of system resource.
• When program enters the I/O mode, the
processor switches to another program.
Multiprocessing
• When multiprogramming is implemented at the
process level on a multiprocessor, it is called
multiprocessing.
• 2 types of multiprocessing-:
 If interprocessor communication are handled at the
instruction level, the multiprocessor operates in
MIMD mode.
 If interprocessor communication are handled at the
program,subroutine or procedure level, the
multiprocessor operates in MPMD mode.
Multitasking
• A single program can be partitioned into
multiple interrelated tasks concurrently
executed on a multiprocessor.
• Thus multitasking provides the parallel
execution of 2 or more parts of single
program.
Multithreading
• The traditional UNIX/OS has a single threaded
kernal in which 1 process can receive OS
kernal service at a time.
• In multiprocessor we extend single kernal to
be multithreaded.
• The purpose is to allow multiple threads of
light weight processes to share same address
space.
Partitioning and Replication
• Goal of parallel processing is to exploit
parallelism as much as possible with lowest
overhead.
• Program partitioning is a technique for
decomposing a large program and data set into
many small pieces for parallel execution bt
multiple processors.
• Program partitioning involves both
programmers and compilers.
• Program replication refers to duplication of
same program code for parallel execution on
multiple processors over different data sets.
Scheduling and Synchronization
• Scheduling further classified-:
Static Scheduling
• It is conducted at post compile time.
• Its advantage is low overhead but shortcomings is a possible
mismatch with run time profile of each task.
Dynamic Scheduling
• Catches the run time conditions.
• Requires fast context switching,premption and much more
OS support.
• Advantage include better resource utilization at expense of
highest scheduling overhead.
• One can use atomic memory operations such
as Test&Set and Fetch&Add to achieve
synchronization.
Cache Coherence & Protection
• Multicache coherance problem demands an
invalidation or update after each write
operation.
Message Passing Model
• Two processes D and E residing at different
processor nodes may communicate wit each
other by passing messages through a direct
network.
• The messages may be instructions,
data,synchronization or interrupt signals etc.
• Multicomputers are considered loosely
coupled multiprocessors.
IPC using Message Passing
Message(Send/Recieve)
Process D Process E
Synchronous Message Passing
• No shared Memory
• No mutual Exclusion
• Synchronization of sender and reciever
process just like telephone call.
• No buffer used.
• If one process is ready to cummunicate and
other is not,the one that is ready must be
blocked.
Asynchronous Message Passing
• Does not require that message sending and
receiving be synchronised in time and space.
• Arbitrary communication delay may be
experienced because sender may not know if
and when the message has been received until
acknowledgement is received from receiver.
• This scheme is like a postal service using mailbox
with no synchronization between senders and
recievers.
Data Parallel Model
• Used in SIMD computers
• Parallelism handled by hardware
synchronization and flow control.
• Fortran 90 ->data parallel lang.
• Require predistrubuted data sets.
Data Parallelism
• This technique used in array processors(SIMD)
• Issue->match problem size with machine size.
• Recently Connection Machine used woth
16384 PE working in single configuaration.
• No mutual exclusion problem.
Array Language Extensions
• Various data parallel language used
• Represented by high level data types
• CFD for Illiac 4,DAP fortran for Distributed
array processor,C* for Connection machine
• Target to make the number of PE’s of problem
size.
Compiler Support
• Compiler can be used for optimization
purpose.
• Decides the dimension of array.
Object Oriented Model
• Objects dynamically created and manipulated.
• Processing is performed by sending and
receiving messages among objects.
Concurrent OOP
• Need of OOP because of abstraction and
reusability concept.
• Objects are program entites which
encapsulate data and operations in single unit.
• Concurrent manipulation of objects in OOP.
Actor Model
• This is a framework for Concurrent OOP.
• Actors->independent component
• Communicate via asynchronous message
passing.
• 3 primitives->create,send to and become.
Parallelism in COOP
• 3 common patterns for parallism-:
1)Pipeline concurrency
2)Divide and conquer
3)Cooperative Problem Solving
Functional and logic Model
• Functional Programming Language->
Lisp,Sisal and Strand 88.
Logic Programming Language->
Concurrent Prolog and Parlog
Functional Programming Model
• Should not produce any side effects.
• No concept of storage,assignment and
branching.
• Single assignment and data flow language
functional in nature.
Logic Programming Models
• Used for knowledge processing from large
database.
• Supports implicitly search strategy.
• And parallel execution and Or Parallel
Reduction technique used.
• Used in artificial intelligence
Parallel Language and Compilers
• Programming env is collection of s/w tools and
system support.
• Parallel Software Programming env needed.
• Users still forced to focus on hardware details
rather than parallelism using high level
abstraction.
Language Features For Parallelism
• Optimization Features
• Availability Features
• Synchronization/communication Features
• Control Of Parallelism
• Data Parallelism Features
• Process Management Features
Optimization Features
• Theme->Conversion of sequential Program to

Parallel Program.
• The purpose is to match s/w parallelism with
hardware parallelism.
• Software in Practice-:
1)Automated Parallelizer
Express C automated parallelizer and Allaint
FX Fortran compiler.
2)Semiautomated Parallizer
Needs compiler directives or programmers
interaction.
3)Interactive Restructure Support
Static analyzer,Run time statistics,data flow
graph and code translater for restructuring
fortran code.
Availability Features
• Theme-:Enhance user friendliness,make
language portable for large no of parallel
computers and expand the applicability of
software libraries.
1)Scalibility
Language should be scalable to number of
processors and independent of hardware topology.
2)Compatibility
Compatible with sequential language.
3)Portability
Language should be portable to shared memory
multiprocessor,message passing or both.
Synchronization/Communication Features
• Shared Variable (locks) for IPC

• Remote Procedure Calls
• Data Flow languages
• Mailbox,Semaphores,Monitors
Control Of Parallelism
• Coarse,Medium and fine grain
• Explicit vs implicit parallelism
• Global Parallelism
• Loop Parallelism
• Task Parallelism
• Divide and Conquer Parallelism
Data Parallelism Features
Theme-:how data is accessed and distributed in
either SIMD and MIMD computers.
1)Runtime automatic decomposition
Data automatically distributed with no user
interation.
2)Mapping Specification
User specifies patters and input data mapped to
hardware.
• Virtual Processor Support
Compilers made statically and maps to physical
processor.
• Direct Access to shared data
Shared data is directly accessed by operating
system.
Process Management Features
Theme-:
Support efficient creation of parallel
process,multithreading/multitasking,program
partitioning and replication and dynamic load
balancing at run time.
1)Dynamic Process Creation at Run Time.
2)Creation of lightweight processes.
3)Replication technique.
4)Partitioned Networks.
5)Automatic Load Balancing
Optimizing Compilers for Parallelism
• Role of compiler to remove burden of

optimization and generation.
3 Phases-:
1)Flow analysis
2)Optimization
3)Code Generation
Flow Analysis
• Reveals design flow patters to determine data
and control dependencies.
• Flow analysis carried at various execution
levels.
1)Instruction level->VLSI or superscaler
processors.
2)Loop level->Simd and systolic computer
3)Task level->Multiprocessor/Multicomputer
Program Optimization
• Transformation of user program to explore
hardware capability.
• Explores better performance.
• Goal to maximise speed of code execution.
• To minimize code length.
• Local and global optimizations.
• Machine dependent Transformation
Parallel Code Generation
• Compiler directive can be used to generate
parallel code.
• 2 optimizing compilers-:
1)Parafase and Parafase 2
2)PFC and Parascope
Parafase and Parafase2
• Transforms sequential programs of fortran 77
into parallel programs.
• Parafase consists of 100 program that are
encoded and passed.
• Pass list indentifies dependencies and
converts it to concurrent program.
• Parafase2 for c and pascal in extension to
fortran.
PFC AND Parascope
• Translates fortran 77 to fortran 90 code.
• PFC package extended to PFC + for parallel
code generation on shared memory
multiprocessor.
• PFC performs analysis as following steps
below-:
• PFC performs analysis as following steps
below-:
1)Interprocedure Flow analysis
2)Transformation
3)dependence analysis
4)Vector Code Generation
Dependence Analysis of Data Arrays
• Iteration Space And Dependence Analysis

• Subscript Seperability & Partitioning
• Categorized Dependence Tests
Iteration Space And Dependence Analysis
• Data and control dependences
• Scalar data dependences
– True-, anti-, and output-dependences
Data and Control Dependences
S1 PI = 3.14
• Data dependence: data is
S2 R = 5.0 produced and consumed
S3 AREA = PI * R ** 2 in the correct order
• Control dependence: a
S1 IF (T.EQ.0.0) GOTO S3
S2 A = A / T dependence that arises as
S3 CONTINUE a result of control flow
58
Data Dependence Classification
• Dependence relations are similar to hardware hazards
(which may cause pipeline stalls)
– True dependences (also called flow dependences), denoted by S1  S2, are the
same as RAW (Read After Write) hazards
– Anti dependences, denoted by S1 -1 S2, are the same as WAR (Write After
Read) hazards
– Output dependences, denoted by S1 o S2, are the same as WAW (Write After
Write) hazards
S1 X = … S1 … = X S1 X = …
S2 … = X S2 X = … S2 X = …
S1  S2 S1 -1 S2 S 1 o S 2
59
• When data object is data array indexed by
multidimentional subscript,dependency is
more difficult to determine at compile time.
• Dependency analysis->Process of detecting all
data dependency in program.
Dependence Testing
• Data dependence in array difficult as no 2 array ref may access same memory location.
• We consider subscript ref here.
Example-:
DO i1 = L1, U1
DO i2 = L2, U2
...
DO in = Ln, Un
S1
A(f1(i1,...,in),...,fm(i1,...,in)) = ...
S2
... = A(g1(i1,...,in),...,gm(i1,...,in))
ENDDO
...
ENDDO
Iteration Space
• N-Dimensional cartesion space for n loops ia
called iteration space.
• Example(Next Slide)
Iteration Space Example
DO I = 1, 3 • The iteration space of the
DO J = 1, I statement at S1 is {(1,1),
S1 A(I,J) = A(J,I)
ENDDO (2,1), (2,2), (3,1), (3,2),
ENDDO (3,3)}
• At iteration i = (2,1) the
I (1,1) value of A(1,2) is
(2,1) (2,2)
assigned to A(2,1)
(3,1) (3,2) (3,3)

J
63
Dependence Equation(flow dependence)
DO I = 1, N
S1 A(f(I)) = A(g(I))
ENDDO
To prove flow dependence:
for which values of  <  is
f() = g()
DO I = 1, N
S1 A(I+1) = A(I) +1 =  has solution =-1
ENDDO
Dependence equation(anti dependence)
DO I = 1, N
S1 A(f(I)) = A(g(I))
ENDDO
To prove anti dependence:
for which values of  >  is
f() = g()
DO I = 1, N
S1 A(I+1) = A(I) +1 =  has no solution
ENDDO
Dependence Direction Vector
• Definition
– Suppose that there is a dependence from S1 on
iteration i and S2 on iteration j; then the
dependence direction vector D(i, j) is defined as
“<” if d(i, j)k > 0
D(i, j)k = “=” if d(i, j)k = 0
“>” if d(i, j)k < 0
66
Distance Vector
• Its defined in terms of -
Example-:Direction vector is (<,=)
Distance vector is(-1,0)
DO I = 1, 3
DO J = 1, I
S1 A(I+1,J) = A(I,J)
ENDDO
ENDDO
Subscript Seperability and Partitioning
• Subsscript is nothing but Subscript position in

array reference.
3 positions-:
1)ZIV(Zero Index Variable)
2)SIV(Single Index Variable)
3)MIV(Multiple Index Variable)
ZIV, SIV, and MIV
DO I = …
• ZIV (zero index variable)
DO J = … subscript pairs
DO K = …
S1 A(5,I+1,J) = A(N,I,K) • SIV (single index variable)
ENDDO
ENDDO subscript pairs
ENDDO
• MIV (multiple index
variable) subscript pairs
ZIV pair SIV pair MIV pair
69
Subscript Seperability
• Seperable if indices if its indices donot occur in
other subscript.
• If 2 diff subscript contain the same index then
we say they are coupled.
Separability and Coupled Subscripts
DO I = …
• When testing
DO J = … multidimensional arrays,
S1
DO K = …
A(I,J,J) = A(I,J,K)
subscript are separable if
ENDDO indices do not occur in
ENDDO other subscripts
ENDDO
• ZIV and SIV subscripts
pairs are separable
SIV pair MIV pair
is separable gives coupled • MIV pairs may give rise to
indices coupled subscript groups
71
Dependence Testing Overview
1. Partition subscripts into separable and coupled groups
2. Classify each subscript as ZIV, SIV, or MIV
3. For each separable subscript, apply the applicable single
subscript test (ZIV, SIV, or MIV) to prove independence or
produce direction vectors for dependence
4. For each coupled group, apply a multiple subscript test
5. If any test yields independence, no further testing needed
6. Otherwise, merge all dependence vectors into one set
72
ZIV Test
DO I = 1, N
• The ZIV test compares
S1 A(5) = A(6) two subscripts
ENDDO
• If the expressions are
K = 10 proved unequal, no
DO I = 1, N
S1 A(5) = A(K)
dependence can exist
ENDDO
K = 10
DO I = 1, N
DO J = 1, N
S1 A(I,5) = A(I,K)
ENDDO
ENDDO
73
Strong SIV Test
• Requires subscript pairs of the
DO I = 1, N
S1 A(I+1) = A(I) form aI+c1 and aI’+c2 for the
ENDDO def and use, respectively
• Dependence equation
DO I = 1, N
aI+c1 = aI’+c2
S1 A(2*I+2) = A(2*I)
ENDDO has a solution if the
dependence distance
DO I = 1, N d = (c1 - c2)/a is integer and
S1 A(I+N) = A(I) |d| < U - L with L and U loop
ENDDO bounds
74
Strong Siv test
Strong SIV test when
• f(...) = aik+c1 and g(…) = aik+c2
• Plug in α,β and solve for dependence:
• β-α = (c1 – c2)/a
• A dependence exists from S1 to S2 if:
• β-α is an integer
• |β-α| ≤ Uk- Lk
Weak-Zero SIV Test
• Requires subscript pairs
DO I = 1, N
S1 A(I) = A(1) of the form a1I+c1 and
ENDDO a2I’+c2 with either a1= 0 or
DO I = 1, N a2 = 0
S1 A(2*I+2) = A(2) • If a2 = 0, the dependence
ENDDO
equation I = (c2 - c1)/a1
DO I = 1, N has a solution if
S1 A(1) = A(I) + A(1) (c2 - c1)/a1 is integer and L
ENDDO
< (c2 - c1)/a1 < U
• If a1 = 0, similar case
76
Weak-Zero SIV test when
• f(...) = aik+c1 and g(…) = c2
• Plug in α,β and solve for dependence:
• α = (c2 – c1)/a
• A dependence exists from S1 to S2 if:

• α is an integer
• Lk ≤ α ≤ Uk
MIV Tests
• Similar to Siv except that I and j are distinct
indices.
Code Optimization and scheduling
• Compilersoftware that translates the source
code to object code.
• Optimization demands efforts from
programmer and compiler.
Scaler optimization within basic blocks
• 2 types of scheduling-:
 Static Scheduling (compiler determines
dependency analysis)
 Dynamic Scheduling(Hardware of operating system
determines dependency analysis at run time)
• Code scheduling methods ensure that control
dependency,data dependency and resource
dependency are properly handled during
concurrent execution.
Precedence Constraints
• Hardware scheduler needed to direct the flow of
instruction after the branch.
• Software scheduler ->used to handle only
conditional execution statements.
• Program profiling can be used.
• If flow dependence detected write must proceed
after read operation.
• Care should be taken in output dependence
problem.
Basic Blocks
• Basic blocks is a sequence of statements, with
the properties that
(a) The flow of control can only enter the basic
block through the first instruction in the block.
That is, there are no jumps into the middle of
the block.
(b) Control will leave the block without halting or
branching, except possibly at the last
instruction in the block.
Partitioning instructions into basic blocks
• INPUT: A sequence of three-address

instructions.
• OUTPUT: A list of the basic blocks for that
sequence in which each instruction is assigned
to exactly one basic block
• METHOD: First, we determine those
instructions in the intermediate code that are
leaders, that is, the first instructions in some
basic block.
• The rules for finding leaders are:
1. The first three-address instruction in the
intermediate code is a leader.
2. Any instruction that is the target of a
conditional or unconditional jump is a leader.
3. Any instruction that immediately follows a
conditional or unconditional jump is a leader.
Example
sum = 0
do 10 i = 1, n
10 sum = sum + a[i]*a[i]
Three Address Code
1. sum = 0 initialize sum
2. i = 1 initialize loop counter
3. if i > n goto 15 loop test, check for limit
4. t1 = addr(a) – 4
5. t2 = i * 4 a[i]
6. t3 = t1[t2]
7. t4 = addr(a) – 4
8. t5 = i * 4 a[i]
9. t6 = t4[t5]
10. t7 = t3 * t6 a[i]*a[i]
11. t8 = sum + t7
12. sum = t8 increment sum
13. i = i + 1 increment loop counter
14. goto 3
15. …
Control Flow Graph (CFG)
1. sum = 0
2. i = 1
T
3. if i > n goto 15 15. …
F
4. t1 = addr(a) – 4
5. t2 = i*4
6. t3 = t1[t2]
7. t4 = addr(a) – 4
8. t5 = i*4
9. t6 = t4[t5]
10. t7 = t3*t6
11. t8 = sum + t7
12. sum = t8
13. i = i + 1
14. goto 3
Common Subexpression Elimination
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
3. if i > n goto 15 3. if i > n goto 15
4. t1 = addr(a) – 4 4. t1 = addr(a) – 4
5. t2 = i*4 5. t2 = i*4
6. t3 = t1[t2] 6. t3 = t1[t2]
7. t4 = addr(a) – 4 7. t4 = addr(a) – 4
8. t5 = i*4 8. t5 = i*4
9. t6 = t4[t5] 9. t6 = t4[t5]
10. t7 = t3*t6 10. t7 = t3*t6
11. t8 = sum + t7 10a t7 = t3*t3
12. sum = t8 11. t8 = sum + t7
13. i = i + 1 12. sum = t8
14. goto 3 13. i = i + 1
15. … 14. goto 3
Copy Propagation
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
3. if i > n goto 15 3. if i > n goto 15
4. t1 = addr(a) – 4 4. t1 = addr(a) - 4
5. t2 = i * 4 5. t2 = i * 4
6. t3 = t1[t2] 6. t3 = t1[t2]
10a t7 = t3 * t3 10a t7 = t3 * t3
11 t8 = sum + t7 11. t8 = sum + t7
12. sum = t8 11a sum = sum + t7
13. i = i + 1 12. sum = t8
14. goto 3 13. i = i + 1
15. … 14. goto 3
15. …
Invariant Code Motion
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
3. if i > n goto 15 2a t1 = addr(a) - 4
4. t1 = addr(a) – 4 3. if i > n goto 15
5. t2 = i * 4 4. t1 = addr(a) - 4
6. t3 = t1[t2] 5. t2 = i * 4
10a t7 = t3 * t3 6. t3 = t1[t2]
11a sum = sum + t7 10a t7 = t3 * t3
13. i = i + 1 11a sum = sum + t7
14. goto 3 13. i = i + 1
15. … 14. goto 3
15. …
Strength Reduction
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
2a t1 = addr(a) – 4 2a t1 = addr(a) - 4
3. if i > n goto 15 2b t2 = i * 4
5. t2 = i * 4 3. if i > n goto 15
6. t3 = t1[t2] 5. t2 = i * 4
10a t7 = t3 * t3 6. t3 = t1[t2]
11a sum = sum + t7 10a t7 = t3 * t3
13. i = i + 1 11a sum = sum + t7
14. goto 3 11b t2 = t2 + 4
15. … 13. i = i + 1
14. goto 3
15. …
Constant Propagation and Dead Code
Elimination
1. sum = 0 1. sum = 0
2. i = 1 2. i = 1
2a t1 = addr(a) – 4 2a t1 = addr(a) - 4
2b t2 = i * 4 2b t2 = i * 4
2c t9 = n * 4 2c t9 = n * 4
3a if t2 > t9 goto 15 2d t2 = 4
6. t3 = t1[t2] 3a if t2 > t9 goto 15
10a t7 = t3 * t3 6. t3 = t1[t2]
11a sum = sum + t7 10a t7 = t3 * t3
11b t2 = t2 + 4 11a sum = sum + t7
14. goto 3a 11b t2 = t2 + 4
15. … 14. goto 3a
15. …
New Control Flow Graph
1. sum = 0
2. t1 = addr(a) - 4
3. t9 = n * 4
4. t2 = 4
T
5. if t2 > t9 goto 11 11. …
F
6. t3 = t1[t2]
7. t7 = t3 * t3
8. sum = sum + t7
9. t2 = t2 + 4
10. goto 5
Analysis and Optimizing Transformations
• Local optimizations – performed by local

analysis of a basic block
• Global optimizations – requires analysis of
statements outside a basic block
• Local optimizations are performed first,
followed by global optimizations
Local Optimizations --- Optimizing
Transformations of a Basic Blocks
• Local common subexpression elimination
• Dead code elimination
• Copy propagation
• Renaming of compiler-generated
temporaries to share storage
Example 1: Local Common Subexpression
Elimination
1. t1 = 4 * i
2. t2 = a [ t1 ]
3. t3 = 4 * i
4. t4 = b [ t3 ]
5. t5 = t2 * t4
6. t6 = prod * t5
7. prod = t6
8. t7 = i + 1
9. i = t7
10.if i <= 20 goto 1
Example 2: Local Dead Code Elimination
1. a = y + 2 1’. a = y + 2
2. z = x + w 2’. x = a
3. x = y + 2 3’. z = b + c
4. z = b + c 4’. b = a
5. b = y + 2
Example 3: Local Constant Propagation
1. t1 = 1 Assume a, k, t3, and t4 are used beyond basic block:
2. a = t1 1’. a = 1
3. t2 = 1 + a 2’. k = 2
4. k = t2 3’. t4 = 8.2
5. t3 = cvttoreal(k) 4’. t3 = 8.2
6. t4 = 6.2 + t3
7. t3 = t4
Global Optimizations --- Require Analysis
Outside of Basic Blocks
• Global common subexpression elimination
• Dead code elimination
• Constant propagation
• Loop optimizations
– Loop invariant code motion
– Strength reduction
– Induction variable elimination
Vectorization and Parallization
• Vectorization->Process of converting scalar looping
operations into equivalent vector instructions.
• Parallelization->Converting sequential code to
parallel code.
• Vectorizing Compiler->Compiler that does
vectorization automatically.
• Vector hardware->Speed up Vector Operations.
• Multiprocessors->used to speed up parallel codes.
Vectorization Methods
• Use of temporary storage
• Loop Interchange
• Loop Distribution
• Vector Reduction
• Node Splitting
Vectorization Example-:
Consider scalar loop for addition-:

Do 20 I=8,120,2
20 A(I)=B(I+3)+C(I+1)
After Vectorization-:
A(8:120:2)=B(11:123:2)+C(9:121:2)
Use of temporary Storage
• Theme->In order to have pipelined execution, use
temporary array to produce vector code.
Do 20 I=1,N A(1)=B(1)+C(1)
A(I)=B(I)+C(I) B(1)=2*A(2)
20 B(I)=2*A(I+1) A(2)=B(2)+C(2)
B(2)=2*A(3)
…….
By use of temperory storage method we have-:
TEMP(1:N)=A(1:N+1)
A(1:N)=B(1:N)+C(1:N)
B(1:N)=2*TEMP(1:N)
Loop Interchange
• Theme->Exchange inner loop with outer loop and
increase profitability(improvement in execution
time)
Example-:
Do 20 I=2,N
Do 10 J=2,N
S1: A(I,J)=(A(I,J-1)+A(I,J+1))/2
10 Continue
20 Continue
After Loop Interchange
Do 20 J=2,N
Do 20 I=2,N
S1: A(I,J)=(A(I,J-1)+A(I,J+1))/2
20 Continue
Now inner loop can be vectorized as follows-:
Do 20 J=2,N
A(2:N,J) =(A(2:N,J-1)+A(2:N,J+1))/2
20 Continue
Loop Distribution
• Theme:Distribute the loop and vectorize it.
Example-:
DO I = 1, N
S1 A(I+1) = B(I) + C
S2 D(I) = A(I) + E
ENDDO
• transformed to:
DO I = 1, N
S1 A(I+1) = B(I) + C
ENDDO
DO I = 1, N
S2 D(I) = A(I) + E
ENDDO
• leads to:
S1 A(2:N+1) = B(1:N) + C
S2 D(1:N) = A(1:N) + E
Vector Reduction
• Theme->Produces the scaler value from 1 or more data
arrays.
Example-:sum,product,maximum,minimum of all elements
in a single array.
Example-:
DO 40 I=1,N
A(I)=b(I)+C(I)
S=S+A(I)
amax==max(amax,A(I))
Continue
After vector Reduction
A(I)=B(I)+C(I)
S=S+SUM(A(1:N))
AMAX=MAX(AMAX,MAXVAL(A(1:N))
Where sum,max and maxval are all vector

operations.
Node Splitting
• The data dependence cycle can some times be
broken by node splitting.
Example:-
Do 50 I=2,N Do 50 I=2,N
S1: T(I)=A(I-1)+A(I+1) S1a: X(I)=A(I+1)
S2: A(I)=B(I)+C(I) S2: A(I)=B(I)+C(I)
50 Continue S1b:T(I)=A(I-1)+X(I)
50 Continue
Thus new loop structure can be vectorized as
follows-:
S1a: X(2:N)=A(3:N+1)
S2: A(2:N)=B(2:N)+C(2:N)
S1b: T(2:N)=A(1:N-1) +X(2:N)
Vectorization Inhibitors
• Conditional statements which depend on
runtime conditions
• Multiple loop entries
• Function or subroutine calls
• Recurrence and their variations
Code Parallization
• Single program divided into many threads for
parallel execution in multiprocessor env
• Each thread executing on single processor
• Parallelization performed on outer loop unless
there is no dependency
Example-:
DoAll 20 I=2,N
Do 10 J=2,N
S1: A(I,J)=(A(I,J-1)+A(I,J+1))/2
EndDo
EndAll
Parallization Inhibitors
• Loop dependence
• Function calls
• Multiple entries
Code Generation and Scheduling
• Directed Acyclic Graph(DAG)
• Register Allocation
• List Scheduling
• Cycle Scheduling
• Trace Scheduling Compilation
DAG
• Constructed from basic block
• Basic block->no backtracking
• So we construct DAG
Example of Constructing the DAG
(1) t1:= 4 * i Step (1): create node 4 and i0

Step (2): create node *
t1
*
4 i0
Step (3): attach identifier t1

(2) t2:= a[t1] Step (1): create nodes labeled [], a
Step (2): find previously node(t1)
Step (3): attach label
t2
(3) t3:= 4 * i []
Here we determine that: t1,t3
a0 *
node (4) was created
node (i) was created 4 i0
node (*) was created just attach t3 to *.
Compiler Design
Intermediate Language
Source Program Front End Back End
Scheduling Register Allocation
Fall 2000 CS6241 / ECE8833A - (6-120)

Why Register Allocation?
• Storing and accessing variables from registers is

much faster than accessing data from memory.
The way operations are performed in load/store

(RISC) processors.
• Therefore, in the interests of performance—if not
by necessity—variables ought to be stored in
registers.
• For performance reasons, it is useful to store
variables as long as possible, once they are loaded
into registers.
Fall 2000 CS6241 / ECE8833A - (6-121)
The Goal
• Primarily to assign registers to variables.
• However, the allocator runs out of registers quite

often.
• Decide which variables to “flush” out of registers to

free them up, so that other variables can be bought
in.
This important indirect consequence of allocation is
referred to as spilling.
Fall 2000 CS6241 / ECE8833A - (6-122)
Scheduling
• List
• Cycle
• Trace
Trace Scheduling
• Compaction Code
• Compensation Code
Thank You
All the best for exams….!!!

Unit 4 Parallel Models, Languages, and Compilers

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4 Parallel Models, Languages, and Compilers

Uploaded by

Copyright:

Available Formats

Parallel Programming Models,

Languages and Compilers

• Used in multiprocessor programming

• Theme->Conversion of sequential Program to

• Shared Variable (locks) for IPC

• Role of compiler to remove burden of

• Iteration Space And Dependence Analysis

(3,1) (3,2) (3,3)

• Subsscript is nothing but Subscript position in

• A dependence exists from S1 to S2 if:

• INPUT: A sequence of three-address

• Local optimizations – performed by local

Consider scalar loop for addition-:

Where sum,max and maxval are all vector

(1) t1:= 4 * i Step (1): create node 4 and i0

Step (3): attach identifier t1

Source Program Front End Back End

Scheduling Register Allocation

Fall 2000 CS6241 / ECE8833A - (6-120)

• Storing and accessing variables from registers is

The way operations are performed in load/store

• Primarily to assign registers to variables.

• However, the allocator runs out of registers quite

• Decide which variables to “flush” out of registers to

You might also like