Professional Documents
Culture Documents
instruction formats,
addressing modes of a machine or computer, which can be used by compiler writer.
ISA is the portion of computer visible to the programmer or compiler writer.
• High level languages (HLLs), such as Fortran, Algol, and Cobol, were introduced along
with compilers.
• integrated circuits (ICs) for both logic and memory in small-scale or medium-scale
integration (SSI or MSI)
• Micro programmed control ,Pipelining and cache memory were introduced.
• Multiuser applications
• Hardware, software, and programming elements of a modern computer system are briefly
introduced below in the context of parallel processing.
• computing problems have been labeled numerical computing, tedious integer or floating -
point computation, transaction processing, and large database management information
retrieval
• Processors, memory, and peripheral devices form the hardware core of a computer
system.
• The compiler assigns variables to registers or to memory words and reserves functional
units for operators.
• A loader is used to initiate the program execution through the OS kernel.
• Resource binding demands the use of the compiler, assembler, loader, and OS kernel to
commit physical machine resources to program execution.
• Parallel software can be developed using entirely new languages designed specifically
with parallel support as its goal, or by using extensions to existing sequential languages.
Compiler Support
• Parallelizing Compilers requires full detection of parallelism in source code, and
transformation of sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid compiler parallelizing
efforts
Evolution of Computer Architecture
• Started with the von Neumann architecture built as a sequential machine executing scalar
data .
Flynn's Classification
• Michael Flynn (1972) introduced a classification of various computer architectures based
on notions of instruction and data streams processed simultaneously.
• SISD (single instruction stream over a single data stream)
• SIMD (single instruction stream over multiple data streams) machines
• Most parallel computers are built with MIMD model for general purpose
computations.The SIMD and MISD models are more suitable for special-purpose
computations. For this reason, MIMD is the most popular model, SIMD next, and MISD
the least popular model being applied in commercial machines.
Development Layers
System Attributes
T = Ic* (p + m*k)*τ
Five performance factors : Ic, p, m, k, τ
Where, p is the number of processor cycles needed for the instruction decode and execution, m
is the number of memory references needed, k is the memory access latency, Ic is the instruction
count, and τ is the processor cycle time.
These are influenced by four system attributes:
• Instruction-set architecture,
• Compiler technology,
MIPS Rate
• The processor speed is often measured in terms of million instructions per second
(MIPS).
• The MIPS rate of a given processor
where C is the total number of clock cycles needed to execute a given program.
2. A 400 MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Determine the effective CPI, MIPS rate and execution time for this program.
Progammming Environments
• When using a parallel computer, one desires a parallel environment where parallelism is
automatically exploited.
• Two approaches to parallel programming
Implicit Parallelism and Explicit Parallelism
Implicit Parallelism
• An implicit approach uses a conventional language, such as C, C++ to write the source
program.
• The sequentially coded source program is translated into parallel object code by a
parallelizing compiler.
• This compiler must be able to detect parallelism and assign target machine resources.
• This compiler approach has been applied in programming shared-memory
multiprocessors.
Explicit Parallelism
• The second approach requires more effort by the programmer to develop a source
program using parallel dialects of C, C++.
• Parallelism is explicitly specified in the user programs.
• This reduces the burden on the compiler to detect parallelism.
• Instead, the compiler needs to preserve parallelism.
Multiprocessors and Multicomputers
• Two categories of parallel computers.
• These physical models are distinguished by having a shared common memory or
unshared distributed memories.
Shared-Memory Multiprocessors
Three shared-memory multiprocessor models. These models differ in how the memory and
peripheral resources are shared or distributed.
• Uniform Memory-Access (UMA) model
• Non-Uniform Memory Access (NUMA) model
• Cache-only Memory Access (COMA) model
The UMA Model
• Uniform memory-access (UMA) model
• In a UMA multiprocessor model, the physical memory is uniformly shared by all the
processors. All processors have equal access time to all memory words. Each processor
may use a private cache. Peripherals are also shared in some fashion. Communication
among processors takes place using variables in the common memory.
• The system interconnect takes the form of a common bus, a crossbar switch, or a
multistage network. The UMA model is suitable for general purpose applications by
multiple users.
• It can be used to speed up the execution of a single large program in time-critical
applications. To coordinate parallel events, synchronization and communication among
processors are done through using shared variables in the common memory.
Non-Uniform Memory Access (NUMA) model
• A NUMA multiprocessor is a shared-memory system in which the access time varies
with the location of the memory word.
• Two NUMA machine models:
Distributed-Memory Multicomputers
Vector Supercomputers
• A vector computer is often built on top of a scalar processor. The vector processor is
attached to the scalar processor. Program and data are first loaded into the main memory
through a host computer. All instructions are first decoded by the scalar control unit- If
the decoded instruction is a scalar operation, it will be directly executed by the scalar
processor using the scalar functional pipelines.
• If the instruction is decoded as a vector operation, it will be sent to the vector control
unit.
This control unit will supervise the flow of vector data between the main memory and vector
functional pipelines. The vector data flow is coordinated by the control unit. A number of vector
functional pipelines may be built into a vector processor.
SIMD Supercomputers
Conditions of parallelism
• The ability to execute several program segments in parallel requires each segment to be
independent of the other segments.
• The independence comes in various forms as defined below separately.
• we consider the dependence relations among instructions in a program. In general, each
code segment may contain one or more statements. We use a dependence graph to
describe the relations. The nodes of a dependence graph correspond to the program
statements [instructions], and the directed edges with different labels show the ordered
relations among the statements. The analysis of dependence graphs shows where
opportunity exists for parallelization and vectorization.
Data Dependence
• The ordering relationship between statements is indicated by the data dependence.
• Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
(operands to be used) to S2. Flow dependence is denoted as S1 -> S2.
2. Antidependence: Statement S2 is antidependent on the statement S1 if S2 follows S1 in
the program order and if the output of S2 overlaps the input to S1 also called RAW
hazard and is denoted as
3. Output dependence : two statements are output dependent if they produce (write) the
same output variable .
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs if same file
referenced by both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot be
determined in the following situations:
• The subscript of a variable is itself subscribed( indirect addressing)
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different coefficients of the
loop variable.
• The subscript is nonlinear in the loop index variable.
• The above data dependence relations should not be arbitrarily violated during program
execution. Otherwise, erroneous results may be produced with changed program order. On a
multiprocessor system, the program order may or may not be preserved, depending on the
memory model used.
Control Dependence
• This refers to the situation where the order of the execution of statements cannot be
determined before run time.
• For example all condition statement, where the flow of statement depends on the output.
• Paths taken after a conditional branch may introduce or eliminate data dependence among
instructions.
• Dependence may also exist between operations performed in successive iterations of a
looping procedure.
• Loop example with control-dependent iterations. The successive iterations of the
following loop are control-independent.
• ALU conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
The detection of parallelism in programs requires a check of the various dependence
relations.
Bernstein's Conditions
• Bernstein revealed a set of conditions based on which two processes can execute in
parallel.
• We define the input set Ii, of a process Pi, as the set of all input variables needed to
execute the process and the output set 0i , consists of all output variables generated after
execution of the process Pi. Bernstein‘s conditions—which apply to input and output sets
of processes—must be satisfied for parallel execution of processes.
• Consider two processes P1 and P2 with their input sets I1 and I2 and output sets 01 and
02, respectively.These two processes can execute in parallel and are denoted P1 || P2 if
they are independent and do not create confusing results.
• Conditions are stated as follows:
These three conditions are known as Bernstein conditions.
• In terms of data dependencies, Bernstein’s conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and output-
independent.
• In general, a set of processcs, PI, P2,P3, … Pk can execute in parallel if Bomstein's
conditions are satisfied on a pairwise basis; that is, , PI || P2 ||P3|| … ||Pk if and only if
Pi ||Pj for all i ≠ j.
Ex: Detection of parallelism in a program using Bernstein's conditions
•
• The dependence graph that shows Data dependence (solid arrows) and resource
dependence (dashed arrows). It demonstrates data dependence as well as resource
dependence.
• Sequential execution and Parallel execution of the above program.In sequential execution
five steps are needed. If two adders are available simultaneously, the parallel execution
requires only three steps as shown.
• There are 10 pairs of statements to check against Bernstein's conditions. Only 5 pairs, P1||
P5, P2|| P3, P2|| P5, P5|| P3, and P4|| P5, can execute in parallel if there are no resource
conflicts. Collectively, only P2|| P3 || P5, is possible because P2||P3, P3|| P5, and P5||P2
are all possible.
------------------------------------------------------------------------------------------------------------
Violations of any one or more of the three conditions of Bernstein's conditions prohibits
parallelism between two processes.In general, violation of any one or more of the 3n(n-1)/2
W2 Bernstein's conditions among n processes prohibits parallelism collectively or partially.
• Hardware and software parallelism: This refers to the type of parallelism defined by the
machine architecture and hardware multiplicity.
Hardware parallelism:
• It refers to support to parallelism through hardware multiplicity.It is characterized by the
number of instruction issues per machine cycle. If a processor issues k instructions per
machine cycle, then it is called a k-issue processor.
• For example, the lntel i96OCA is a three-issue processor with one arithmetic, one
memory access, and one branch instruction issues per cycle.
• Software parallelism: This type of parallelism is revealed in the program flow graph.
The program flow graph displays the patterns(number) of simultaneously executable
operations. It indicates the number of instruction executed per machine cycle through
program flow graph.
• Consider the example program graph. There are eight instructions (four loads and four
arithmetic operations) to be executed in three consecutive machine cycles. Four load
operations are performed in the first cycle, followed by two multiply operations in the
second cycle and two add/subtract operations in the third cycle. Here,
• The software parallelism varies from 4 to 2 in three cycles. The average software
parallelism is equal to 8/3 = 2.67 instructions per cycle in this example program.
•
• consider execution of the same program by a two-issue processor which can execute one
memory access (load or write) and one arithmetic (add, subtract, multiply etc.) operation
simultaneously. With this hardware restriction, the program must execute in seven
machine cycles .
• Hardware parallelism displays an average value of 8/7 = 1.14 instructions executed per
cycle.
• This demonstrates a mismatch between the software parallelism and the hardware
parallelism.
• To match the software parallelism, we consider a hardware platform of a dual-processor
system, where single-issue processors are used.Here, L/S stands for load/store operations.
Six processor cycles are needed to execute the I2 instructions by two processors. S1 and
S2 are two inserted store operations, and L5 and L6 are two inserted load operations.
These added instructions are needed for inter processor communication through the
shared memory.
• Hardware parallelism an average value of 12/6 =2 instructions executed per cycle.
• To solve the mismatch problem between software parallelism and hardware parallelism,
one approach is to develop compilation support, and the other is through hardware
redesign.
• The Role of Compilers: Compiler techniques are used to exploit hardware features to
improve performance. Loop transformation, software pipelining, and features of
optimizing compilers are used for supporting parallelism.
• One must design the compiler and the hardware jointly at the same time. Interaction
between the two can lead to a better solution to the mismatch problem between software
and hardware parallelism.
Instruction Level : At instruction or statement level,a typical grain contains less than 20
instructions, called fine grain. Depending on individual programs, finegrain parallelism at this
level may range from two to thousands. The exploitation of fine-grain parallelism can be assisted
by an optimizing compiler which should be able to automatically detect parallelism and translate
the source code to a parallel form which can be recognized by the run-time system.
Loop Level : This corresponds to the iterative loop operations. A typical loop contains less than
500 instructions. Independent loop operations, can be used for vector processing pipelined
execution.The loop level is considered a fine grain of computation.
Procedure Level This level corresponds to medium-grain size at the task, procedural,
subroutine, and coroutine levels. A typical grain at this level contains less than 2000 instructions.
Detection of parallelism at this level is much more difficult than at the finer-grain levels. The
communication requirement is often less.
Subprogram Level : This corresponds to the level of job steps and related subprograms.
The grain size may typically contain thousands of instructions. Parallelism at this level has been
exploited by algorithm designers or programmers, rather than by compilers. We do not have
good compilers for exploiting medium- or coarse-grain parallelism at present.
Job (Program) Level : This corresponds to the parallel execution of independent jobs
(programs) on a parallel computer. The grain size can be as high as tens of thousands of
instructions in a single program.
The basic concept of Program partitioning, Grain Packing and Scheduling introduced below.
• Ex: Figure a shows a schedule without duplicating any of the five nodes. This schedule
contains idle time as well as long interprocessor delays (8 units) between P1 and P2. In
Fig. b, node A is duplicated into A' and assigned to P2 besides retaining the original copy
A in PL Similarly, a duplicated node C is copied into Pi besides the original node C in
P2.
• The new schedule is almost 50% shorter. The reduction in schedule time is caused by
elimination of the (a, 8) and (c, B) delays between the two processors.
• Four major steps are involved in the grain determination and the process of scheduling
optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
• The dataflow graph shows that 24 instructions are to be executed (8 divides, 8 multiplies,
and 8 adds).
• Assume that each add,multiply,divide requires 1, 2, and 3 cycles to complete,
respectively. Sequential execution of the 24 instructions on a control flow uniprocessor
takes 48 cycles to complete. On the other hand, a dataflow multiprocessor completes the
execution in 14 cycles.
Demand-Driven Mechanisms
Reduction machines are based on Demand-Driven Mechanisms.
• Reduction machines trigger an instruction’s execution based on the demand for its
results. Initiates an operation based on the demand for its results by other computations.
• Consider the evaluation of a nested arithmetic expression
[Note:
]
Perfect Shuffle and Exchange:
• Perfect shuffle is a special permutation function for parallel processing applications. The
mapping corresponding to a perfect shuffle is shown in Fig. Its inverse is shown on the
right-hand side.
Barrel Shifter A network of N = 16 nodes, the barrel shifter is obtained from the ring by adding
extra links from each node to those nodes having a distance equal to an integer power of 2.
For N = 16, the barrel shifter has a node degree of 7 with a diameter of 2. The barrel shifter
complexity is still much lower than that of the completely connected network.
Fat tree
• The channel width of a fat tree increases as we ascend from leaves to the root.
• Branches get thicker toward the root.
• The traffic toward the root becomes heavier in the conventional binary tree.
• The fat tree has been proposed to alleviate the problem.
llliac mesh : A variation of the mesh by allowing wraparound connections. The llliac mesh is
topologically equivalent to a chordal ring of degree 4 ( as shown in above Fig.d) for an
configuration.
In general, an llliac mesh should have a diameter of d = n-l, which is only half of the
diameter for a pure mesh.
Torus
• The torus can be viewed as another variant of the mesh with an even shorter diameter.
• This topology combines the ring and mesh and extends to higher dimensions.
Systolic Arrays :
• This is a class of multidimensional pipelined array architectures
• A systolic array where the interior node degree is 6 in this example.
• In general, static systolic arrays are pipelined with multidirectional flow of data streams.
Hypercubes
• This is a binary n-cube architecture which has been implemented in the iPSC, nCUBE,
and CM-2 systems.
• In general, an n-cube consists of nodes spanning along n dimensions, with two
nodes per dimension. A 3-cube with 8 nodes is shown.
• A 4-cube can be formed by interconnecting the corresponding nodes of two 3-cubes,
Cube-Connected Cycle: This architecture is modified from the hypercube.
• 3-cube is modified to form 3-cube connected cycle (CCC).
• The idea is to cut off the corner nodes of the 3-cube and replace each by a ring (cycle) of
3 nodes.
• The CCC is a better architecture for building scalable systems if latency can be tolerated
in some way.
Dynamic connection Networks
• To provide the dynamic connectivity
• The route through which data move from one pe to another is established at the time
communication has to be performed.
• All communication patterns based on program demands.
• Instead of using fixed connections, switches or arbiters must be used along the
connecting paths
• More difficult to expand as compared to static network
• Paths are established as needed between processors
Bus-based Networks
• In a bus-based network, processors share a single communication resource [the bus].
•A bus is a highly non-scalable architecture, because only one processor can communicate on the
bus at a time.
• Used in shared-memory parallel computers to communicate read and write requests to a shared
global memory.
• A bus-based interconnection network, used here to implement a shared-memory parallel
computer. Each processor (P) is connected to the bus, which in turn is connected to the
global memory. A cache associated with each processor stores recently accessed memory
values in an effort to reduce the bus traffic.
Switch Modules
• Each input can be connected to one or more of the outputs. One-to-one and one-to-many
mappings are allowed.
• When only one-to-one mappings (permutations) are allowed, we call the module an n x n
crossbar switch.
• For example, a 2 x 2 crossbar switch can connect two possible patterns: straight or
crossover.
Omega Network
• A I6 x 16 Omega network
• Four possible connections of 2 x 2 switches used in constructing the Omega network.
• Four stages of 2 X 2 switches are needed.
• There are 16 inputs on the left and 16 outputs on the right.
• The ISC pattern is the perfect shuffle over 16 objects.
[Note:8 x 8 Omega network
]
• In general, an n input Omega network requires log2n stages of 2 x 2 switches. Each
stage requires n/2 switch modules.
Crossbar Network
• A crossbar network can be visualized as a single-stage switch network.
• The cross point switches provide dynamic connections between (source, destination)
pairs.
• Each cross point switch can provide a dedicated connection path between a pair.
• The switch can be set on or off dynamically upon program demand.
• The above fig. shows a 16 x I6 crossbar network which connected 16 PDP ll processors
to 16 memory modules. The 16 memory modules could be accessed by the processors in
parallel. Each memory module can satisfy only one processor request at a time.
• Each processor can generate a sequence of addresses to access multiple memory modules
simutaneously. Only one crosspoint switch can be set on in each column. However,
several crosspoint switches can be set on simultaneously in order to support parallel
memory accesses.
• The above crossbar network is for interproccssor communication.
• The PEs are processors with attached memory. The CPs stand for control processors
which are used to supervise the entire system operation, including the crossbar networks.
In this crossbar, at one time only one crosspoint switch can be set on in each row and
each column.Only one-to-one connections are provided.
• The n x n crossbar connects at most n source, destination pairs at a time.
1. With a neat diagram, explain the operation of tagged-token data flow computer.
2. Briefly explain the different types of data dependencies. For the following code segment,
draw the dependence graph:
S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
3. Write a note on speedup performance laws.
4. Explain the different performance metrics.
5. Explain the following
Clock rate and CPI
Performance factors.
MIPS rate
Floating point operations per second
Throughput rate
6. Write a note on grain sizes.
1. Explain two categories of parallel computers with their architecture. ( 10 m)
2. Explain the Flynn’s classification of computer architectures. (6 m)
3. Explain data, control and resource dependence. 10 m
4. Explain Bernstein’s conditions. (4 m)
5.
II. Calculate the average CPl when the program is executed on a uniprocessor .
Determine the effective CPI, MIPS rate and execution time for this program.
11.
15. Explain the following static connection network topologies a) Linear array b) Ring and
chordal ring c)Barrel shifter d) Tree and star e) Fat tree f) Mesh and Torus
16. Compare control flow and data flow architectures.
17.