Parallel Processing Chapter - 3: Instruction Level Parallelism

Parallel Processing
Chapter -3
Instruction Level Parallelism
Dr. Basant Tiwari

basanttiw@gmail.com
Department of Computer Science, Hawassa University
• Basic concepts of pipelining

• Instruction pipelines
• Hazards in a pipeline: Structural, Data, and Control Hazards
• Overview of hazard resolution techniques
• Dynamic instruction scheduling
• Branch prediction techniques
• Instruction-level parallelism using software
• approaches
• Superscalar techniques
Basic concepts of pipelining
• The term “PIPELINING” refers to the temporal overlapping of processing.
• To understand the concept of pipelining, we need to understand first the

concept of assembly lines in an automated production plant where items
are assembled from separate parts (stages) and output of one stage
becomes the input to another stage.
• In computing, a pipeline, also known as a data pipeline, is a set of data
processing elements connected in series, where the output of one element is the
input of the next one. The elements of a pipeline are often executed in parallel.
“pipeline is a technique of decomposing a sequential process into sub-process,

with each subprocess being executed in a special dedicated segment that operates
concurrently with all other segments. The result obtained from one segment is
transferred to the next segment in the pipeline. The final result is obtained after
the data have passed through all segment.”
Basic concepts of pipelining Contd…
• To introduce pipelining in a processor P, the following steps must

be followed:
• Sub-divide the input process into a sequence of subtasks. These
subtasks will make stages of pipeline, which are also known as segments.
• Each stage Si of the pipeline, according to the subtask, will perform some
operation on a distinct set of operands.
• When stage Si has completed its operation, results are passed to the next
stage Si+1 for the next operation.
• The stage Si receives a new set of input from previous stage Si-1 .
• Pipeline can be implemented in two ways:
• Linear pipeline
• Non-Linear Pipeline
Linear Pipeline
• A linear pipeline is a cascade of processing stages which are linearly
connected to perform a fixed function over a stream of data flowing from
one end to the other.
• In modern computers, linear pipelines are applied for instruction execution,
arithmetic computation, and memory-access operations.
• A linear pipeline processor is constructed with k processing stages. External
inputs (operands) are fed into the pipeline at the first stage S i. The processed
results are passed from stage Si to stage Si+1, for all i = 1,2,... ,k-1. The final
result emerges from the pipeline at the last stage Sk
• Depending on the control of data flow along the pipeline, we model linear
pipelines in two categories:
• Asynchronous, and
• Synchronous.
Asynchronous Linear Pipeline
• data flow between adjacent stages in an asynchronous pipeline is controlled by
a handshaking protocol.
• When stage Si, is ready to transmit, it sends a ready signal to stage Si+i. After
stage Si+i receives the incoming data, it returns an acknowledge signal to Si.
• Asynchronous pipelines may have a variable throughput rate. Different
amounts of delay may be experienced in different stages.
Synchronous Linear Pipeline
• Dataflow is done in synchronous pipelining by means of Clock Latches that are
used to interface between stages.
• The latches are made with master-slave flip-flops, which can isolate inputs
from outputs.
• Upon the arrival of a clock pulse, all latches transfer data to the next stage
simultaneously.
• The pipeline stages are combinational logic circuits. It is desired to have
approximately equal delays in all stages. These delays determine the clock
period and thus the speed of the pipeline.
Synchronous Linear Pipeline contd...
• The utilization pattern of successive stages in a synchronous pipeline is
specified by a reservation table. For a linear pipeline, the utilization follows the
diagonal streamline pattern as shown in the figure.
• This table is essentially a space-time diagram depicting the precedence
relationship in using the pipeline stages. For a k-stage linear pipeline, k clock
cycles are needed to flow through the pipeline.
• Successive tasks or operations are initiated one per cycle to enter the pipeline.
Once the pipeline is filled up, one result emerges from the pipeline for each
additional cycle. This throughput is sustained only if the successive tasks are
independent of each other.
Instruction Pipeline
• instruction pipelining is a technique for implementing instruction-level
parallelism within a single processor.
• In a pipelined computer, instructions flow through the Central processing
unit (CPU) in stages. For example, it might have one stage for each step of the
instruction processing cycle.
• An instruction pipeline reads consecutive instructions from memory while
previous instructions are being executed in other segments. This causes the
instruction fetch and execute phases to overlap and perform simultaneous
operations.
• The process of executing the instruction involves the following major steps:
• Fetch the instruction from the main memory
• Decode the instruction
• Fetch the operand
• Execute the decoded instruction
• These four steps become the candidates for stages for the pipeline, which we call
as instruction pipeline .
Instruction Pipeline
Factors that affect Throughput of an Instruction Pipeline
Three sources of architectural problems may affect the throughput of an instruction

pipeline. They are fetching, bottleneck, and issuing problems.
• The fetching problem. In general, supplying instructions rapidly through a
pipeline is costly in terms of chip area. Buffering the data to be sent to the
pipeline is one simple way of improving the overall utilization of a pipeline.
• The bottleneck problem. The bottleneck problem relates to the amount of load
(work) assigned to a stage in the pipeline. If too much work is applied to one
stage, the time taken to complete an operation at that stage can become
unacceptably long. This relatively long time spent by the instruction at one stage
will inevitably create a bottleneck in the pipeline system. In such a system, it is
better to remove the bottleneck that is the source of congestion. One solution to
this problem is to further subdivide the stage. Another solution is to build
multiple copies of this stage into the pipeline.
Factors that affect Throughput of an Instruction Pipeline
• The issuing problem. If an instruction is available, but cannot be executed for

some reason, a hazard exists for that instruction. These hazards create issuing
problems; they prevent issuing an instruction for execution.
Pipeline Hazard
• Any condition that causes the pipeline to stall is called a hazard.
• Pipeline hazards are situations that prevent the next instruction in the
instruction stream, from executing during its designated clock cycles. Any
condition that causes a stall in the pipeline operations can be called a hazard.
• So, hazards are problems with the instruction pipeline inside CPU, when
the next instruction cannot execute in the following clock cycle, and can
potentially lead to incorrect computation results.
• There are primarily three types of hazards:
i. Data Hazards
ii. Control Hazards or instruction Hazards
iii. Structural Hazards.
Data Hazard
• A data hazard is any condition in which either the source or the destination
operands of an instruction are not available at the time expected in the
pipeline.
• As a result of which some operation has to be delayed and the pipeline stalls.
Whenever there are two instructions one of which depends on the data
obtained from the other.
• A=3+A
• B=A*4
• For the above sequence, the second instruction needs the value of ‘A’
computed in the first instruction. Thus the second instruction is said to
depend on the first.
• If the execution is done in a pipelined processor, it is highly likely that the
interleaving of these two instructions can lead to incorrect results due to data
dependency between the instructions. Thus the pipeline needs to be stalled as
and when necessary to avoid errors.
Solution for Data Hazard
• we have two solutions for data hazards:
1. Forwarding (or bypassing): The needed data is forwarded as soon as
possible to the instruction which depends on it.
2. Stalling: the dependent instruction is “pushed back” for one or more clock
cycles. Alternatively, you can think of stalling as the execution of a noop for
one or more cycles.
Instruction or Control Hazard
• This hazard result from branch, i.e. other instruction that change flow of
program (i.e. change PC).
• The instruction fetch unit of the CPU is responsible for providing a stream of
instructions to the execution unit. Normally, The instructions fetched by the
fetch unit are in consecutive memory locations and they are executed.
• However the problem arises when one of the instructions is a branching
instruction to some other memory location. Thus all the instruction
fetched in the pipeline from consecutive memory locations are invalid now
and need to removed (also called flushing of the pipeline). This induces a
stall till new instructions are again fetched from the memory address specified
in the branch instruction.
• Thus the time lost as a result of this is called a branch penalty. Often
dedicated hardware is incorporated in the fetch unit to identify branch
instructions and compute branch addresses as soon as possible and reducing
the resulting delay as a result.
Solution for Instruction or Control Hazard
1. Stall until the branch is revolved
2. Delayed branch: Redefine the runtime behavior of branches to take affect only
after the partially fetched/executed instructions flow through the pipeline.
3. Branch prediction: Predict (statically or dynamically) the outcome of the branch
and fetch there.
Structural Hazards?
• This situation arises mainly when two instructions require a given hardware
resource at the same time and hence for one of the instructions the pipeline
needs to be stalled.
• The most common case is when memory is accessed at the same time by two
instructions. One instruction may need to access the memory as part of the
Execute or Write back phase while other instruction is being fetched.
• In this case if both the instructions and data reside in the same memory. Both
the instructions can’t proceed together and one of them needs to be stalled till
the other is done with the memory access part. Thus in general sufficient
hardware resources are needed for avoiding structural hazards.
Solution for Structural Hazards?
1. Stall the pipeline.
2. Refactor pipeline.
3. Duplicate/split the resource (split I/D caches to alleviate memory pressure).
4. Build instruction buffers to alleviate memory pressure.

Dynamic Instruction Scheduling
• Dynamic scheduling, as its name implies, is a method in which the hardware
determines which instructions to execute, as opposed to a statically scheduled
machine, in which the compiler determines the order of execution.
• Data hazards in a program cause a processor to stall.
• With static scheduling the compiler tries to reorder these instructions during
compile time to reduce pipeline stalls and they all use in-order instruction
issue and stall of an instruction stalls all instructions behind it.
• Uses less hardware
• Can use more powerful algorithms
• With dynamic scheduling the hardware tries to rearrange the instructions during
run-time to reduce pipeline stalls and they all use Out-of-Order Execution.
• Simpler than compiler
• Handles dependencies not known at compile time
• Allows code compiled for a different machine to run efficiently.
Out-of-Order Execution
• In introducing out-of-order execution, we have essentially split the ID pipeline
stage into two stages:
Issue - Decode instructions, check for structural hazards;

Read operands - Wait until no data hazards, then read operands.

• An instruction fetch proceeds with the issue stage and may fetch this
instruction either into a single-entry latch or a queue. A design of this type
may use an instruction queue to hold instructions that have been fetched but
are waiting to be executed.
• instructions are then issued from the latch or queue.
• The EX stage follows the read operands stage,
• Thus, we may need to distinguish when an instruction begins execution and
when it completes execution; between the two times, the instruction is in
execution.
• This allows multiple instructions to be in execution at the same time.
Branch Handling Techniques
• One of the major problem in instruction pipelining is the occurrence of Branch
instruction.
• A branch is an instruction in a computer program that can cause computer to
begin execution of different instruction sequence and thus deviate from its
default behavior of executing instructions in order.
• A branch instruction can be Conditional or Unconditional.
• An Unconditional branch, always alters the sequential program flow by loading
the program counter with the target address.
• In Conditional branch, the control selects the target instruction, if condition
is satisfied or the next sequential instruction is selected if condition is not
satisfied.
• The branch instruction breaks the normal sequence of instruction stream,
causing difficulties in the operation of the instruction pipeline.
• Pipelined computers employs various hardware techniques to minimize the
performance degradation caused by the instruction branching.
1. Prefetch Target Instruction:
• One way of handling a conditional branch is to prefetch the target instruction in addition to
the instruction following the branch.
• If the branch condition is successfully satisfied, the pipeline continues from the branch
target instruction.
2. Branch Target Buffer (BTB)

• BTB is a kind of table implemented inside cache memory, that is filled based on data
collected (store information about most recently executed branch instructions).
• A typical BTB is an associative memory where the addresses of taken branch instructions
are stored together with their target addresses.
• It has three fields:
A. Branch Instruction Address: stores the instruction address which issues the branching
condition.
B. Branch Target Instruction: stores the address of target instruction to which branch
instruction points.
C. Prediction Statistics bit: store the probability (based upon previous execution of this
statement) whether particular instruction will give a branching instruction, pointing to the
address of target instruction.
3. Branch Prediction
• Branch prediction is a technique used in CPU that attempts to guess the outcome of a
conditional operation and prepare for the most likely result. A digital circuit that performs
this operation is known as a branch predictor.
• A Pipeline with branch prediction uses branch predictor, to guess the outcome of a
conditional branch instruction before it is executed.
• The pipeline begins prefetch the instruction stream from the predict patch. A
correct prediction eliminate the waste time caused by branch penalties.
4. Delayed Branch
• Whenever a particular statement is encountered with, high probability of
branching and the statement at branch target address has a high probability of
branching (if any of these condition is satisfied), then the instruction following
the current statement are stopped from appearing the pipeline and the instruction
whose address is found in branch target address, is allowed to enter the pipeline
causing less of branch penalty.
• Instruction-level parallelism (ILP) is a measure of how many of
the instructions in a computer program can be executed simultaneously.
• Pipelining can overlap the execution of instructions when they are independent
of one another. This potential overlap among instructions is called instruction-
level parallelism (ILP) since the instructions can be evaluated in parallel.
• ILP must not be confused with concurrency, since the ILP is about parallel
execution of a sequence of instructions belonging to a specific thread/
a process, Conversely, concurrency regards with the threads of one or different
processes being assigned to a CPU's core in a strict alternance or in true
parallelism if there are enough CPU's cores, ideally one core for each runnable
thread.
• There are two approaches to instruction level parallelism:
Hardware and Software.
Instruction Level Parallelism contd…
• Hardware level works upon dynamic parallelism, whereas the software level
works on static parallelism.
• Dynamic parallelism means the processor decides at run time which instructions
to execute in parallel, whereas static parallelism means the compiler decides
which instructions to execute in parallel.
• Consider the following program:
1. e = a + b
2. f = c + d
3. m = e * f
• Operation 3 depends on the results of operations 1 and 2, so it cannot be
calculated until both of them are completed.
• However, operations 1 and 2 do not depend on any other operation, so they can
be calculated simultaneously.
• If we assume that each operation can be completed in one unit of time then these
three instructions can be completed in a total of two units of time, giving an ILP
of 3/2.
Instruction Level Parallelism contd…
• A goal of compiler and processor designers is to identify and take advantage of as much
ILP as possible.
• Ordinary programs are typically written under a sequential execution model where
instructions execute one after the other and in the order specified by the programmer.
• ILP allows the compiler and the processor to overlap the execution of multiple
instructions or even to change the order in which instructions are executed.
• How much ILP exists in programs is application specific. In certain fields, such as
graphics and scientific computing the amount can be very large.
• However, workloads such as cryptography may exhibit much less parallelism.
• Micro-architectural techniques that are used to exploit ILP include:

• Instruction pipelining
• Superscalar execution
• Out-of-order execution
• Branch prediction etc.
Superscalar Pipeline Processor
• Normally, the pipelines have decoded and issued one instruction at a time to the
execution unit. This result in the steady state CPI of 1.
• Now the super scaler pipeline, decode and issue more than one instruction at a time and
reduce the steady state CPI to less than 1.
• A superscalar processor is a CPU that implements a form

of parallelism called instruction-level parallelism within a single processor. In
contrast to a scalar processor that can execute at most one single instruction per clock
cycle, a superscalar processor can execute more than one instruction during a clock
cycle by simultaneously dispatching multiple instructions to different execution
units on the processor. It therefore allows for more throughput (the number of
instructions that can be executed in a unit of time).
• Superscalar processing is used for increasing the performance of a computer system by

adding more than one execution units, which functions simultaneously.
• Each unit of superscalar processor has its own Fetch, Decode and Store Unit. Execution
unit may be common or duplicate according to the complexity of computation.
A Dual Pipeline Superscalar processor with four functional units in the

execution stage and a lookahead window producing out of order issues.
• The mechanism of superscalar pipe is almost similar to instruction pipeline,
exception is that, control unit functions differently, allotting the block of
execution unit to different pipeline as and when necessary.
• Also in case the hardware has a single execution unit, to be accessed by all
the pipelines, there has to be interconnection network between decode &
execute stage and execute & Store stage.
• There is a lookahead window that used with its own Fetch and Decode
logic. This window is used for instruction lookahead in case of ‘out-of-order
instruction issue’ is required to achieve better pipeline throughput.
Multipipeline Scheduling or Superscalar Scheduling
• Instruction issue and completion policies are critical to superscalar processor

performance. There are two scheduling policies are introduced below.
1. In-Order Issue.
2. Out-of-Order issue.
• When instructions are issued in program order, we call it in-order issue.

When program order is violated, out-of-order issue is being practiced.
• Similarly, if the instructions are completed in program order, it is called
inorder completion. Otherwise, out-of-order completion may result.
• In-order issue is easier to implement but may not yield the optimal
performance. In-order issue may result in either in-order or out-of-order
completion.
• Out-of-order issue usually ends up with out-of-order completion. The
purpose of out-of-order issue and completion is to improve the performance.
Out-of-Order issues is used when program contains Jumps and Branches.
Speculative execution
• Speculative execution is a technique used by modern CPUs to speed up
performance.
• The CPU may execute certain tasks ahead of time, "speculating" that they will
be needed. If the tasks are required, a speed-up is achieved, because the work is
already complete.
• It is a kind of out-of-order execution, also known as dynamic execution.
• Along with multiple branch prediction (used to predict the instructions most
likely to be needed in the near future) and dataflow analysis (used to align
instructions for optimal execution, as opposed to executing them in the order
they came in), speculative execution delivered a dramatic performance
improvement.
End of Chapter - 3

Parallel Processing Chapter - 3: Instruction Level Parallelism

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Processing Chapter - 3: Instruction Level Parallelism

Uploaded by

Copyright:

Available Formats

Parallel Processing

Instruction Level Parallelism

Dr. Basant Tiwari

• Basic concepts of pipelining

• To understand the concept of pipelining, we need to understand first the

“pipeline is a technique of decomposing a sequential process into sub-process,

• To introduce pipelining in a processor P, the following steps must

Three sources of architectural problems may affect the throughput of an instruction

• The issuing problem. If an instruction is available, but cannot be executed for

3. Duplicate/split the resource (split I/D caches to alleviate memory pressure).

4. Build instruction buffers to alleviate memory pressure.

Issue - Decode instructions, check for structural hazards;

2. Branch Target Buffer (BTB)

• Micro-architectural techniques that are used to exploit ILP include:

• A superscalar processor is a CPU that implements a form

• Superscalar processing is used for increasing the performance of a computer system by

A Dual Pipeline Superscalar processor with four functional units in the

• Instruction issue and completion policies are critical to superscalar processor

• When instructions are issued in program order, we call it in-order issue.

You might also like