You are on page 1of 16

Pipelined Instruction Processing

In

Computer Organization and Architecture

Sumit Gupta
Reg. No = 3050060107

99883-80416 (M)
(0181) 4639-871 (Home)
vaidgupta1988@gmail.com

Term Paper for CSE-211


Computer Arithmetic

Fifth Term 2008

ABSTRACT

As we know that the technology is advancing day by day in every field like in science,
medical, defence etc. The topic which I have given for research i.e. “Pipelined Instruction
Processing” is a technique which helps us to increase the processing of instruction by using
pipelined method. In this paper, I present a detailed description of the paper in the field of
computer arithmetic. The procedure starts with the important step to understand paper topic
and continues further by finding relevant aspects which comes under my topic. I am
describing the principle, what are the problems and the solution for that particular problem,
Advantages & Disadvantages. After that I am giving the examples which help us to
understand more about topic and then Implementation. At last, we come to the development
in the pipelined method because technology is changing which leads to development.

Keywords: - Principle, Problems and Solution, Advantages & Disadvantages, Examples,


Implementation, Development, Conclusion, References
Introduction:-

This paper presents one of the instruction processor having pipeline structure which is differ
from other various processor technologies. Main processor families are CISC, RISC
superscalar, VLIW, super pipelined, symbolic processor. Various processor families can be
mapped onto a coordinate space of clock rate versus cycles per instruction as shown in fig.
As implementation technology evolves rapidly, the clock rate of various processors are
gradually moving from low to higher speeds toward the right of design space. Another trend
is that processors manufacturers are trying to lower the CPI rate using hardware and software
approaches.

20

10 Scalar CISC

5.0 Super
pipelined

2.0 Most likely future


Scalar RISC
1.0 processor speed

0.5 Vector
Supercomputer

0.2

0.1

5 10 20 50 100 200 500 1000 MHz

Clock Rate

Conventional processor like the Intel 1486, M6840, IBM 390 etc fall into the family known
as complex-instruction-set computing (CISC) architecture. The typical clock rate of today’s
CISC processor ranges from 33 to 50 MHz On the other hand today’s reduced- instruction-set
computing (RISC) processors, such as the Intel 1860,SPSRC,IBM RS/6000 etc. have faster
clock rates ranging from 20 to 120 MHz determined by the implementation technology
employed. With the use of hardwired control, the CPI of most RISC instruction has been
reduced to one to two cycles. The processor in vector supercomputers is mostly super
pipelined and uses multiple functional units for concurrent scalar and vector operations.
Now we come to the main topic which I have given for the research as a paper i.e. “Pipelined
Instruction Processing”. The idea is to divide the logic into stages, and to work on different
data within each stage. An often used real-world analogy involves doing the laundry: if you
have two loads of laundry to do, you can either wash the first load or then dry the first load,
before moving onto the next, or, you can wash the first load, and when you put it in to dry,
you can put the next load in to wash. If each step takes 20 minutes, then you will finish in 60
minutes instead of 80.

An instruction pipelined is a technique used in the design of computers and other digital
electronic devices to increase their instruction throughput (the number of instructions that can
be executed in a unit of time).The fundamental concept is to split the processing of a
computer instruction into a series of independent steps, with storage at the end of each step.
This allows the computer's control circuitry to issue instructions at the processing rate of the
slowest step, which is much faster than the time needed to perform all steps at once. The term
pipeline refers to the fact that each step is carrying data at once (like water), and each step is
connected to the next (like the links of a pipe.)The origin of pipelining is thought from the
IBM Stretch project. The IBM Stretch Project proposed the terms, "Fetch, Decode, and
Execute" that became common usage.

Non-pipeline architecture is inefficient because some CPU components (modules) are idle
while another module is active during the instruction cycle. Pipelining does not completely
cancel out idle time in a CPU but making those modules work in parallel improves program
execution significantly. Processors with pipelining are organized inside into stages which can
semi-independently work on separate jobs. Each stage is organized and linked into a 'chain'
so each stage's output is fed to another stage until the job is done. This organization of the
processor allows overall processing time to be significantly reduced.

If we take another example, in which a CPU or other circuit, previous data may have an
effect on later data (for instance, if a CPU is processing C = A + B, followed by E = C + D,
the value of C must finish being calculated before it can be used in the second instruction).
This type of problem is called a data dependency conflict. In order to resolve these conflicts,
even more logic must be added to stall or otherwise deal with the incoming data. A
significant part of the effort in modern CPU design goes into resolving these sorts of
dependencies.

An instruction has a number of stages. The various stages can be worked on simultaneously
through various blocks of production. This is a pipeline. This process is also referred as
instruction pipelining. Figure shown the pipeline of two independent stages, fetch instruction
and execution instruction. The first stage fetches an instruction and buffers it. While the
second stage is executing the instruction, the first stage takes advantage of any unused
memory cycles to fetch and buffer the next instruction. This process will speed up instruction
execution

Two stages Instruction Pipeline

Pipelined Instruction principle:-

In order to speed up the operation of a computer system beyond what is possible with
sequential execution, methods must be found to perform more than one task at a time. One
method for gaining significant speedup with modest hardware cost is the technique of
pipelining. In this technique, a task is broken down into multiple steps, and independent
processing units are assigned to each step. Once a task has completed its initial step, another
task may enter that step while the original task moves on to the following step.

Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory
(flip flops). When the clock signal arrives, the flip flops take their new value and the logic
then requires a period of time to decode the new values. Then the next clock pulse arrives and
the flip flops again take their new values, and so on. By breaking the logic into smaller pieces
and inserting flip flops between the pieces of logic, the delay before the logic gives valid
outputs is reduced. In this way the clock period can be reduced. For example, the RISC
pipeline is broken into five stages with a set of flip flops between each stage.

1. Instruction fetch
2. Instruction decode and register fetch
3. Execute
4. Memory access
5. Register write back
Basic five-stage pipeline in a RISC machine (IF = Instruction Fetch, ID = Instruction Decode,
EX = Execute, MEM = Memory access, WB = Register write back). The vertical axis is
successive instructions, the horizontal axis is time. So in the green column, the earliest
instruction is in WB stage, and the latest instruction is undergoing instruction fetch.

Problems in Instruction Pipelining

Several difficulties prevent instruction pipelining from being as simple as the above
description suggests. The principal problems are:

TIMING VARIATIONS:

Not all stages take the same amount of time. This means that the speed gain of a pipeline will
be determined by its slowest stage. This problem is particularly acute in instruction
processing, since different instructions have different operand requirements and sometimes
vastly different processing time. Moreover, synchronization mechanisms are required to
ensure that data is passed from stage to stage only when both stages are ready.

DATA HAZARDS:

When several instructions are in partial execution, a problem arises if they reference the same
data. We must ensure that a later instruction does not attempt to access data sooner than a
preceding instruction, if this will lead to incorrect results. For example, instruction N+1 must
not be permitted to fetch an operand that is yet to be stored into by instruction N.
BRANCHING:

In order to fetch the "next" instruction, we must know which one is required. If the present
instruction is a conditional branch, the next instruction may not be known until the current
one is processed.

INTERRUPTS:

Interrupts insert unplanned "extra" instructions into the instruction stream. The interrupt must
take effect between instructions, that is, when one instruction has completed and the next has
not yet begun. With pipelining, the next instruction has usually begun before the current one
has completed.

All of these problems must be solved in the context of our need for high speed performance.
If we cannot achieve sufficient speed gain, pipelining may not be worth the cost.

Solutions

Possible solutions to the problems described above include the following strategies:

Timing Variations

To maximize the speed gain, stages must first be chosen to be as uniform as possible in
timing requirements. However, a timing mechanism is needed. A synchronous method could
be used, in which a stage is assumed to be complete in a definite number of clock cycles.
However, asynchronous techniques are generally more efficient. A flag bit or signal line is
passed forward to the next stage indicating when valid data is available. A signal must also be
passed back from the next stage when the data has been accepted.

In all cases there must be a buffer register between stages to hold the data; sometimes this
buffer is expanded to a memory which can hold several data items. Each stage must take care
not to accept input data until it is valid, and not to produce output data until there is room in
its output buffer.

Data Hazards

To guard against data hazards it is necessary for each stage to be aware of the operands in use
by stages further down the pipeline. The type of use must also be known, since two
successive reads do not conflict and should not be cause to slow the pipeline. Only when
writing is involved is there a possible conflict.
The pipeline is typically equipped with a small associative check memory which can store the
address and operation type (read or write) for each instruction currently in the pipe. The
concept of "address" must be extended to identify registers as well. Each instruction can
affect only a small number of operands, but indirect effects of addressing must not be
neglected.

As each instruction prepares to enter the pipe, its operand addresses are compared with those
already stored. If there is a conflict, the instruction (and usually those behind it) must wait.
When there is no conflict, the instruction enters the pipe and its operands addresses are stored
in the check memory. When the instruction completes, these addresses are removed. The
memory must be associative to handle the high-speed lookups required.

Branching

The problem in branching is that the pipeline may be slowed down by a branch instruction
because we do not know which branch to follow. In the absence of any special help in this
area, it would be necessary to delay processing of further instructions until the branch
destination is resolved. Since branches are extremely frequent, this delay would be
unacceptable.

One solution which is widely used, especially in RISC architectures, is deferred branching.
In this method, the instruction set is designed so that after a conditional branch instruction,
the next instruction in sequence is always executed, and then the branch is taken. Thus every
branch must be followed by one instruction which logically precedes it and is to be executed
in all cases. This gives the pipeline some breathing room. If necessary this instruction can be
a no-op, but frequent use of no-ops would destroy the speed benefit.

Use of this technique requires a coding method which is confusing for programmers but not
too difficult for compiler code generators. A widely-used strategy in many current
architectures is some type of branch prediction. This may be based on information provided
by the compiler or on statistics collected by the hardware. The goal in any case is to make the
best guess as to whether or not a particular branch will be taken, and to use this guess to
continue the pipeline.

A more costly solution occasionally used is to split the pipeline and begin processing both
branches. This idea is receiving new attention in some of the newest processors.
Interrupts

The fastest but most costly solution to the interrupt problem would be to include as part of the
saved "hardware state" of the CPU the complete contents of the pipeline, so that all
instructions may be restored to their original state in the pipeline. This strategy is too
expensive in other ways and is not practical.

The simplest solution is to wait until all instructions in the pipeline complete, that is, flush the
pipeline from the starting point, before admitting the interrupt sequence. If interrupts are
frequent, this would greatly slow down the pipeline; moreover, critical interrupts would be
delayed.

A compromise solution identifies a "point of no return," the point in the pipe at which
instructions may first perform an irreversible action such as storing operands. Instructions
which have passed this point are allowed to complete, while instructions that have not
reached this point are cancelled.

Advantages and Disadvantages:-

Pipelining does not help in all cases. There are several possible disadvantages. An instruction
pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A
pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline.

Advantages of Pipelining:

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in
most cases.
2. Some combinatorial circuits such as adders or multipliers can be made faster by
adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more
complex combinatorial circuit.

Disadvantages of Pipelining:

1. A non-pipelined processor executes only a single instruction at a time. This prevents


branch delays (in effect, every branch is delayed) and problems with serial
instructions being executed concurrently. Consequently the design is simpler and
cheaper to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower than in a
pipelined equivalent. This is due to the fact that extra flip flops must be added to the
data path of a pipelined processor.
3. A non-pipelined processor will have a stable instruction bandwidth. The performance
of a pipelined processor is much harder to predict and may vary more widely between
different programs.
4. One of the major problems in designing an instruction pipeline is assuring a steady
flow of instructions to initial stages of the pipeline. However, 15-20% of instructions
in an assembly-level stream are (conditional) branches. Of these, 60-70% takes the
branch to a target address. Until the instruction is actually executed, it is impossible to
determine whether the branch will be taken or not.

Example:-

1) Generic pipeline

Generic 4-stage pipeline the colored boxes represent instructions independent of each other

To the right is a generic pipeline with four stages:

1. Fetch
2. Decode
3. Execute
4. Write-back
The top gray box is the list of instructions waiting to be executed; the bottom gray box is the
list of instructions that have been completed; and the middle white box is the pipeline.

Execution is as follows:-

Time Execution

0 Four instructions are awaiting to be executed

1 • the green instruction is fetched from memory


• the green instruction is decoded
2
• the purple instruction is fetched from memory
• the green instruction is executed (actual operation is performed)
3 • the purple instruction is decoded

• the blue instruction is fetched


• the green instruction's results are written back to the register file or memory
• the purple instruction is executed
4
• the blue instruction is decoded

• the red instruction is fetched


• the green instruction is completed
• the purple instruction is written back
5
• the blue instruction is executed

• the red instruction is decoded


• The purple instruction is completed
6 • the blue instruction is written back

• the red instruction is executed


• the blue instruction is completed
7
• the red instruction is written back
8 • the red instruction is completed
9 All instructions are executed

2) Bubble
A bubble in cycle 3 delays execution

When a "hiccup" (difficulty) in execution occurs, a "bubble" is created in the pipeline in


which nothing useful happens. In cycle 2, the fetching of the purple instruction is delayed and
the decoding stage in cycle 3 now contains a bubble. Everything "behind" the purple
instruction is delayed as well but everything "ahead" of the purple instruction continues with
execution.

Clearly, when compared to the execution above, the bubble yields a total execution time of 8
clock ticks instead of 7. Bubbles are like stalls, in which nothing useful will happen for the
fetch, decode, execute and write back.

Implementations:-

Buffered, Synchronous pipelines


Conventional microprocessors are synchronous circuits that use buffered, synchronous
pipelines. In the Synchronous method, one timing signal causes all outputs of units to be
transferred to the succeeding units. The timing signal occurs at fixed intervals, taking into
account the slowest unit. Instruction and arithmetic pipelines use the Synchronous method. In
the synchronous method, there is a staging register between each unit and the clock signal
activates all the staging registers simultaneously. Staging registers are used between the
stages to hold the information.
Buffered, Asynchronous pipelines
Asynchronous pipelines are used in asynchronous circuits, and have their pipeline registers
clocked asynchronously. In the asynchronous method, a pair of “handshaking” signals is used
between each unit and the next unit.
- A ready signal
- An acknowledge signal
The ready signal informs the next unit that it has finished the present operation and is ready
to pass the task and any results onwards. The acknowledge signal is returned when the
receiving unit has accepted the task and results.

The AMULET microprocessor is an example of a microprocessor that uses buffered,


asynchronous pipelines.

Unbuffered pipelines

Unbuffered pipelines, called "wave pipelines", do not have registers in-between pipeline
stages. Instead, the delays in the pipeline are "balanced" so that, for each stage, the difference
between the first stabilized output data and the last is minimized. Thus, data flows in "waves"
through the pipeline, and each wave is kept as short (synchronous) as possible.

The maximum rate that data can be fed into a wave pipeline is determined by the maximum
difference in delay between the first piece of data coming out of the pipe and the last piece of
data, for any given wave. If data is fed in faster than this, it is possible for waves of data to
interfere with each other.
Pipelining Developments:-

In order to make processors even faster, various methods of optimizing pipelines have been
devised.

Super pipelining refers to dividing the pipeline into more steps. The more pipe stages there
are, the faster the pipeline is because each stage is then shorter. Ideally, a pipeline with five
stages should be five times faster than a non-pipelined processor (or rather, a pipeline with
one stage). The instructions are executed at the speed at which each stage is completed, and
each stage takes one fifth of the amount of time that the non-pipelined instruction takes.
Thus, a processor with an 8-step pipeline (the MIPS R4000) will be even faster than its 5-step
counterpart. The MIPS R4000 chops its pipeline into more pieces by dividing some steps into
two. Instruction fetching, for example, is now done in two stages rather than one. The stages
are as shown:

1. Instruction Fetch (First Half)


2. Instruction Fetch (Second Half)
3. Register Fetch
4. Instruction Execute
5. Data Cache Access (First Half)
6. Data Cache Access (Second Half)
7. Tag Check
8. Write Back

Superscalar pipelining involves multiple pipelines in parallel. Internal components of the


processor are replicated so it can launch multiple instructions in some or all of its pipeline
stages. The RISC System/6000 has a forked pipeline with different paths for floating-point
and integer instructions. If there is a mixture of both types in a program, the processor can
keep both forks running simultaneously. Both types of instructions share two initial stages
(Instruction Fetch and Instruction Dispatch) before they fork. Often, however, superscalar
pipelining refers to multiple copies of all pipeline stages (In terms of laundry, this would
mean four washers, four dryers, and four people who fold clothes). Many of today's machines
attempt to find two to six instructions that it can execute in every pipeline stage. If some of
the instructions are dependent, however, only the first instruction or instructions are issued.

Dynamic pipelines have the capability to schedule around stalls. A dynamic pipeline is
divided into three units the instruction fetch and decode unit, five to ten execute or functional
units, and a commit unit. Each execute unit has reservation stations, which act as buffers and
hold the operands and operations.
While the functional units have the freedom to execute out of order, the instruction
fetch/decode and commit units must operate in-order to maintain simple pipeline behavior.
When the instruction is executed and the result is calculated, the commit unit decides when it
is safe to store the result. If a stall occurs, the processor can schedule other instructions to be
executed until the stall is resolved. This, coupled with the efficiency of multiple units
executing instructions simultaneously, makes a dynamic pipeline an attractive alternative.
Conclusion: -

After completing the paper, I conclude that if the more pipe stages are there, the faster the
pipeline is because each stage is then shorter. Ideally, a pipeline with five stages should be
five times faster than a non-pipelined processor. Super pipelined is the one of the example in
which no. of pipes are increased from the previous pipelined structure. Previous statement is
also proved by the example which I am described in paper. But I also find that the pipelined
structure have some disadvantages. There are certain problems which occur in the pipelined
structure but they are removed by using their respective possible solution.
References:-

 http://en.wikipedia.org/wiki/Instruction_pipeline

 www.freepatentsonline.com/5333280.html

 www.cs.princeton.edu/courses/archive/spr02/cs217/lectures/pipeline.pdf

 http://alexandria.tue.nl/extra1/wskrap/publichtml/200612.pdf

 www.csee.wvu.edu/~jdm/classes/cs455/notes/tech/instrpipe.html

 http://www.wipo.int/pctdb/en/wo.jsp?wo=2004084065

 Kai Hwang, “Advanced Computer Architecture”, Parallelism, Scalability,


Programmability, McGraw Hill

You might also like