You are on page 1of 24

2.

2 Pipelining
Pipelining is a technique of decomposing a sequential process into
sub-processes, with each sub-process being executed in a special
dedicated segment that operates concurrently with all other
segments. Any operation that can be decomposed into a sequence
of sub-operations of about the same complexity can be
implemented by a pipeline processor.

2.2.1 Linear Pipeline Processors


A linear pipeline processor is a cascade of processing stages which
are linearly connected to perform a fixed function over a stream of
data flowing from one end to the other. In modern computers,
linear pipelines are applied for instruction execution, arithmetic
computation and memory access operations.

2.2.1.1 Asynchronous and Synchronous Models


A linear pipeline is constructed with k processing stages
(segments). External inputs (operands) are fed into the pipeline at
the first stage S1. The processed results are passed from stage Si to
stage Si+1 for all i = 1, 2, …., k-1. The final result emerges from
the pipeline at the last stage.Depending on the control of data flow
along the pipeline, linear pipeline is modeled into two categories:
asynchronous and synchronous.

Asynchronous Model
As shown in Fig.2-2a, data flow between adjacent stages in
asynchronous pipeline is controlled by a handshaking protocol.
When stage Si is ready to transmit, it sends a ready signal to stage
Si+1. After stage Si+1 receives the incoming data, it returns an
acknowledge signal to Si.

Synchronous Model
Synchronous pipelines are illustrated in Fig.2-2b. The operands
pass through all segments in a fixed sequence. Each segment
consists of a combinational circuit Si that performs a suboperation
over the data stream flowing through the pipe Isolating registers R
(latches) are used to interface between stages and hold the
intermediate results between the stages. Upon the arrival of a clock
pulse, all registers transfer data to the next stage simultaneously.

Input Output

Ready S1 Ready S2 Ready Sk Ready

Ack Ack Ack Ack

(a) An asynchronous pipeline model

Input Output

R S1 R S2 R R Sk R

Clock

(b) A synchronous pipeline model

Fig.2-2

Synchronous pipeline will be simply described by the following


example.

Example 2.1
In certain scientific computations it is necessary to perform the
arithmetic operation (Ai + Bi)*(Ci + Di) with a stream of numbers.
Specify a pipeline configuration to carry out this task. List the
contents of all registers in the pipeline for i = 1 through 6.

Solution
Each sub-operation is to in a segment within a pipeline. Each
segment will have one or more registers and a combinational
circuit as shown. The sub-operations performed in each segment of
the pipeline are as follows:

R1 Ai , R2 Bi , R3 Ci , R4 Di input Ai, Bi, Ci


and Di

R5  R1 + R2 , R6  R3 + R4 Perform Addition of
Ai + Bi and Ci + Di
R7  R5 * R6 Multiply (Ai + Bi)*(Ci + Di)

Ai Bi Ci Di

R1 R2 R3 R4 Segment 1

Adder Adder

R5 R6 Segment 2

Multiplier

Segment 3
R7

Fig.2-3
The seven registers are loaded with new data every clock pulse.
The effect of each clock will be as shown:

Clock Segment 1 Segment 2 Segment 3


pulse
R1 R2 R3 R4 R5 R6 R7
1 A1 B1 C1 D1 - - -
2 A2 B2 C2 D2 A1+B1 C1+D1 -
3 A3 B3 C3 D3 A3+B2 C2+D2 (A1+B1)*(C1+D1)
4 A4 B4 C4 D4 A3+B3 C3+D3 (A2+B2)*(C2+D2)
5 A5 B5 C5 D5 A4+B4 C4+D4 (A3+B3)*(C3+D4)
6 A6 B6 C6 D6 A5+B5 C5+D5 (A4+B4)*(C4+D4)
7 - - - - A6+B6 C6+D6 (A5+B5)*(C5+D5)
8 - - - - - - (A6+B6)*(C6+D6)

2.2.2 Space-time diagram


This is a diagram that shows segment utilization as a function of
time. The space-time diagram of a four segment pipeline is shown
in Fig 2-4 The horizontal axis displays the time in clock cycles and
the vertical axis gives the segment number. This diagram shows
six tasks T1 through T6 executed in four segments. The first task
T1 is completed after the fourth (k) clock cycle. No matter how
many segments are there in the system, once the pipeline is full, it
takes only one clock period to obtain the output.

Task : A task is defined as the total operation performed going


through all the segments in the pipe.

Clock cycles: 1 2 3 4 5 6 7 8 9
Segments: 1 T1 T2 T3 T4 T5 T6 - - -
2 - T1 T2 T3 T4 T5 T6 - -
3 - - T1 T2 T3 T4 T5 T6 -
4 - - - T1 T2 T3 T4 T5 T6

Fig.2-4

Example 2.2
Draw the space-time diagram for a six-segment pipeline showing
the time it takes to process eight tasks.

Solution
Clock cycles: 1 2 3 4 5 6 7 8 9 10 11 12 13
Segments: 1 T1 T2 T3 T4 T5 T6 T7 T8 - - - - -
2 - T1 T2 T3 T4 T5 T6 T7 T8 - - - -
3 - - T1 T2 T3 T4 T5 T6 T7 T8 - - -
4 - - - T1 T2 T3 T4 T5 T6 T7 T8 - -
5 - - - - T1 T2 T3 T4 T5 T6 T7 T8 -
6 - - - - - T1 T2 T3 T4 T5 T6 T7 T8

It takes 13 clock cycles to process 8 tasks.

2.2.3 Pipeline Speedup


Consider the case where a k-segment pipeline with a clock cycle
time tp is used to execute n tasks. The first task T1 will take a
time equal to ktp. All remaining n-1 tasks will take a time equal to
(n-1)tp (see section 2.2). Therefore, to complete n tasks through a
k-segment pipeline requires k + (n - 1) clock cycles.
Next, consider a non-pipeline that performs the same
operation and takes a time equal to tn to complete each task. The
total time required for n tasks is ntn. The speedup of a pipeline
processing over an equivalent non-pipeline processing is defined
by the ratio:

Non − pipeline time ntn


S= =
pipeline time (k + n −1)t p
As the number of tasks increases, n becomes much larger than k
– 1, and (k + 1 – n) approaches the value of n, under this
condition, the speedup becomes
t
S= n
tp

If we assume that the time it takes to process a task is the same in


the pipeline and non-pipeline circuits, we will have tn = ktp.
Including this assumption, the speedup will be reduced to

t
S = n =k
tp

This shows that the theoretical maximum speedup that a pipeline


can provide is k, where k is the number of segments in the
pipeline. Fig 2-5 plots the speedup as a function of n, the number
of tasks performed by the pipeline.
12

10

8 k = 10
Speedup
factor

4 k=6

0
0 1 2 4 8 16 32 64 128 256 512 1024
Number of tasks

Fig.2-5
2.2.4 Pipeline Efficiency
The efficiency Ek of a linear k-segment pipeline is defined as
Speedup S
E = = k
k Number of segments k
If we assume that the time it takes to process a task is the same in
the pipeline and nonpipeline circuits, the speed up will be
knt p
Sk =
(k + n − 1)t p

Then the efficiency will be defined as


S n
E = k=
k k k + (n −1)

Example 2.3
A non-pipeline system takes 50 ns to process a task. The same
task can be processed in a six-segment pipeline with a clock
cycle of 10 ns. Determine the speedup and the efficiency of the
pipeline for 100 tasks. What is the maximum speedup and
efficiency that can be achieved?

Solution
given

- For the non-pipeline system: tn = 50 ns


- For the pipeline system: k = 6, tp = 10 ns
- Number of tasks n = 100

required
nt n 100 × 50
− Speedup ⇒ S = = = 4 . 76
( k + n − 1) t p ( 6 + 100 − 1) × 10
The max imum sp e edup can be achieved when the number of tasks
increases up to a value much l arg er than k − 1
, So we can neglect k − 1
⇒ The ratio will be reduc ed to
nt n t 50
S max = = n = =5
nt p t p 10
Speedup S 4 . 76
− Efficiency ⇒ E = = = = 79 . 33 %
Number o f segments k 6
Maximum Speedup
Maximum efficiency ⇒ E max =
Number o f segments
S 5
= max = = 83 . 33 %
k 6

2.2.5 Arithmetic pipeline


Pipelining techniques can be applied to speed up numerical
arithmetic computations. Pipeline arithmetic units are usually
found in very high speed computers. They to implement floating-
point operations, multiplication of fixed point numbers, and
similar computations encountered in scientific problems.

Fixed-Point Operations
Fixed-point numbers are represented internally in machines in
sign-magnitude, one's complement, or two's complement
notation. Add, subtract, multiply, and divide are the four primitive
arithmetic operations.

Floating-Point Numbers
A floating-point number X is represented by a pair (m, e), where
m is the mantissa (or fraction) and e is the exponent with an
implied base (or radix). The algebraic value is represented as
X = m × re
The sign of X can be embedded in the mantissa.
Floating-Point Operations
The four primitive arithmetic operations are defined below for a
pair of floating-point numbers represented by X = (mx, ex) and Y
= (my, ey), Assuming ex <= ey and base r = 2.
ex −ey ey
X+Y = (m x × 2 + m y)× 2
ex −ey ey
X−Y = (m x × 2 − my)× 2
ex +ey
X×Y = (m x × m y ) × 2
ex −ey
X÷Y = (m x ÷ m y ) × 2
These operations can be divided into two halves:
One half is for exponent operations such as comparing their
relative magnitudes or adding/subtracting them; the other half is
for mantissa operations including four types of fixed-point
operations.
The floating-point addition and subtraction can be
performed in four segments as shown in Fig.2-6. The registers
labeled R are latches used to interface between segments and to
store intermediate results. The sub-operations that are performed
in the four segments are:

1. Compare the exponents:


The exponents are compared by subtracting them to determine
their difference. The larger exponent is chosen as the
exponent of the result.

2. Align the mantissa:


The mantissa associated with the smaller exponent is shifted
to the right a number of times equal to the exponent
difference.

3. Add or subtract the mantissa:


The two mantissas are added or subtracted.

4. Normalize the result: Normalizing the result so that it has


a fraction (mantissa) with nonzero first digit.
Exponents Mantissas
ex ey mx my

R R
Compare
Segment 1 Exponents
by subtraction

Segment 2 Choose exponent Choose exponent

Add or subtract
Segment 3 mantissas

R R

Adjust Normalize
Segment 4 exponent result

R R

Fig.2-6
Pipeline in Fig.2-6 refers to binary numbers, but in the
following numerical example we use decimal numbers for
simplicity (r = 10). Consider the two normalized floating-point
numbers:

X = 0.99 × 10 4
Y = 0.5 ×103

Sub-operations will be carried out through the four segments as


follows:

Segment 1:
The two exponents are subtracted to obtain 4 – 3 = 1. The larger
exponent 4 is chosen to be the exponent of the result.

Segment 2:
The mantissa of Y is shifted to the right once to obtain
X = 0.99 × 10 4
Y = 0.05 ×10 4
This aligns the two mantissas under the same exponent.

Segment 3:
The two mantissas are added to produce the sum
Z = 1.04 × 10 4
Segment 4:
The sum is adjusted by normalizing the result. This done by
shifting the mantissa once to the right and incrementing the
exponent to obtain
Z = 0.104 × 10 5

Example 2.4
The time delays in the four segments in the pipeline of Fig.2-6
are as follows:
t1 = 50ns, t 2 = 30ns, t 3 = 95ns, t 4 = 45ns . The interface registers delay
time t r = 5ns .
a. How long would it take to add 100 pairs of numbers in the
pipeline?
b. How can we reduce the total time to about one-half of the
time calculated in part (a)?

Solution

(a)
The time taken to add n pairs of numbers in an k-segment
pipeline = (k + n - 1 ) tp

Given: k = 4 and n = 100

Calculating the clock cycle tp:

The total time delay in a segment i


TDi = the time delay in the segment circuit + the time delay of
the interface register

 The maximum time delay of all segments


TDmax = the maximum time delay in segment circuits + the time
delay of the interface register
= 95ns + 5ns = 100ns

As we have a different time delay in each segment, the clock


cycle must be larger than or equal to the maximum delay
 The clock cycle tp = 100ns

 The time taken to add 100 pairs of numbers in the 4-segment


pipeline = (k + n - 1 ) tp
= (4 +
100 - 1) 100 ns
= 10.3
ms
(b)
To reduce the time calculated in part (a) to one-half, we have to
reduce the clock cycle tp. We can achieve this if we could reduce
the maximum time delay TDmax to 50ns.

This can be reached by reducing t1 from 50ns to 45ns and t 3 from


95ns to 45ns

2.2.6 Instruction Pipeline


An instruction pipeline reads consecutive instruction from
memory while previous instructions are being executed in other
segments. This causes the instruction fetch and execute phases to
overlap and perform simultaneous operations.

Instruction cycle
Computers with complex instructions (CISC) require other
phases in addition to the fetch and execute to process an
instruction completely. In the most general case, the computer
needs to process each instruction with the following sequence of
steps:

1. Fetch the instruction from memory.


2. Decode instruction.
3. Calculate the effective address.
4. Fetch operands from memory.
5. Execute the instruction.
6. Store the result.

There are certain difficulties that will prevent the instruction


pipeline from operating at its maximum rate. Different segments
may take different times to operate on the incoming information.
Therefore, the design the design of an instruction pipeline will be
most efficient if the instruction cycle is divided into segments of
equal duration.

2.2.6.1 Four Segment Instruction Pipeline


A pipelined processor may process each instruction in four steps,
as follows:

F Fetch: read the instruction from the memory.


D Decode: decode the instruction and fetch the source
operand(s).
E Execute: Perform the operation specified by the
instruction.
W Write: Store the result in the destination location.

Fig.2-7 shows how the instruction cycle in the CPU can be


processed with a four-segment pipeline.

An instruction in the sequence may be a program control type


that causes a branch out of normal sequence. In that case, the
pending operations in the last two segments are completed and
the instruction buffer is emptied. The pipeline then restarts from
the new address stored in the program counter. Similarly, an
interrupt request, when acknowledged, will cause the pipeline to
empty and restart from a new address value, which is the
beginning of an interrupt service routine.
Segment 1 Fetch instruction
from memory

Decode instruction
Segment 2 and fetch the
Source operand

yes
Branch

no
Perform the
Segment 3 operation specified
by the instruction

Segment 4 Store the result

Interrupt yes
handling Interrupt
no
Update PC

Empty pipe

Fig.2-7

Fig.2-8 shows the operation of the four-segment pipeline.

For a variety of reasons, one of the pipeline stages may not be


able to complete its processing task for a given instruction in the
time allotted. Any condition that causes the pipeline to deviate
form its normal operation (to stall) is called a hazard. In general
there are three major hazards:
Time

Clock cycles 1 2 3 4 5 6 7

Instructions
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4
Fig.2-8

Data hazard.
Instruction hazard.
Structural hazard.

2.2.6.2 Data hazards


A data hazard is any condition in which either the source or the
destination operands of an instruction are not available at the
time expected in the pipeline. As a result some operation has to
be delayed, and the pipeline stalls. An example of that situation,
when the source operand(s) of an instruction is the result of the
previous instruction. The former may be in the D segment while
the later is in the E segment, which means that the result of the
later instruction is not yet stored in the destination while the
former instruction requesting it, which will cause a conflict.
Pipelined computers deal with such conflicts in a variety of ways.

Hardware Interlock
The most straight forward method is to insert hardware interlock.
An interlock is a circuit that detects instructions whose source
operands are destinations of instructions farther up in the
pipeline. Detection of this situation causes the instruction whose
source is no available to be delayed by enough clock cycles to
resolve the conflict. Hardware is used to insert the required
delays.

Operand forwarding
Instead of transferring an ALU result into a destination register, a
hardware checks the destination operand, and if it is needed as a
source in the next instruction, it passes the result directly into the
ALU input. Fig.2-9 shows a part of a pipeline processor which
caries out this task.

Forwarding path

E: Execute W: Write
(ALU) (Register file)

Source Result
Fig.2-9

Example 2.5
Consider the four instructions in the following program. Suppose
that the first instruction starts from step 1 in the pipeline used in
Fig.2-8. Specify what operations are performed in the four
segments during step 5and step 6, assuming:

(a) The pipeline system uses hardware interlock technique to


handle data hazards.
(b) The pipeline system uses the operand forwarding
technique in Fig.2-9 to handle data hazards.

LOAD R1  M[312]
ADD R2  R2 + M[313]
INC R3  R4 + 1
STORE M[314]  R3

Solution
The STORE instruction will cause a data conflict as its source
operand (R3) is the result of the previous instruction (INC), this
will result in a data hazard. The timing for the pipeline of the
above program will be as follows:
Time

Clock cycles 1 2 3 4 5 6 7

Instructions
LOAD F1 D1 E1 W1

ADD F2 D2 E2 W2

INC F3 D3 E3 W3

STORE F4 D4 E4 W4

(a)
In a pipeline system using hardware interlock, the interlock
circuit will detect that the STORE instruction’s source operand is
the destination of the INC instruction. We assume that the
interlock circuit detects such instruction after it passes the F
segment. After detecting this situation, the interlock will cause
the STORE instruction to be delayed for 2 clock cycles which is
the minimum number of cycles for the INC instruction to write
the result to the destination (R3). The timing of the pipeline will
become as follows:
Time

Clock cycles 1 2 3 4 5 6 7 8 9

Instructions
LOAD F1 D1 E1 W1

ADD F2 D2 E2 W2

INC F3 D3 E3 W3

STORE F4 - - D4 E4 W4

As shown above, at step 5 and 6, the situation will be as follows:

Instruction Step 5 Step 6


LOAD Completed Completed
ADD Writing the result to Completed
destination (R2)
INC Performing the Writing the result to
operation (R4 + 1) destination (R3)
STORE Delayed Delayed

(b)
In a pipeline system using operand forwarding, as sown in Fig.2-
9, the result of an instruction can be fed back directly into the E
segment in the case of data dependency. In the above program,
hardware will detect data dependency between the INC and the
STORE instructions. This will cause the result of the INC
instruction to be passed directly into the input of the E segment.
The timing of the pipeline will become as follows:
Time

Clock cycles 1 2 3 4 5 6 7

Instructions
LOAD F1 D1 E1 W1

ADD F2 D2 E2 W2

INC F3 D3 E3 W3

STORE F4 D4 E4 W4

In the above figure, it can be seen that, no delays are


needed as the operand will be forwarded directly from the output
of the E segment to its input. In this situation in the D segment of
the STORE instruction, fetching the source operand will be
skipped. At step 5 and 6, the situation will be as follows:
Instruction Step 5 Step 6
LOAD Completed Completed
ADD Writing the result to Completed
destination (R2)
INC Performing the Writing the result to
operation (R4 + 1) destination (R3)
STORE Decode the Performing the
instruction operation (no
operation to be
performed)

2.2.6.3 Instruction hazards


The purpose of the instruction fetch unit is to supply the
execution units with a steady stream of instructions. Whenever
this stream is interrupted, the pipeline stalls, as in the case of a
cache miss. A branch instruction may also cause the pipeline to
stall. Fig.2-10 illustrates the effect of branching on the instruction
stream.

Time

Clock cycles 1 2 3 4 5 6 7 8 9

Instructions
I1 F1 D1 E1 W1

(Branch) I2 F2 D2 E2

<I3> F3 D3 E3 W3

<I4> F4 D4 E4 W4

Fk Dk Ek Wk
Ik
Fk+1 Dk+1 Ek+1 Wk+1
Ik+1

(a)successful branch condition


Fig.2-10
Time

Clock cycles 1 2 3 4 5 6 7 8 9

Instructions
I1 F1 D1 E1 W1

(Branch) I2 F2 D2 E2

<I3> F3 D3 E3 W3

<I4> F4 D4 E4 W4

I3 D3 E3 W3

I4 F4 D4 E4 W4

(b) Unsuccessful branch condition


Fig.2-10

The above figure shows a sequence of instructions being


executed in a four-stage pipeline. Instructions I1 to I4 are stored
at successive memory addresses, and I2 is a branch instruction.
Let the branch target be instruction Ik. In clock cycle 3, the fetch
operation for instruction I3 is in progress at the same time the
branch instruction is being decoded and the target address
computed. In clock cycle 4, if the branch is taken (Fig.2-10a), the
processor must discard I3 and fetch instruction Ik. If the branch is
not taken(Fig.2-10b), I3 can be used. In the meantime, the
hardware unit responsible for the Execute (E) step must be tolled
to do nothing during that clock period. Thus, the pipeline is
stalled for two clock cycles. The time lost as a result of a branch
instruction is often referred to as branch penalty.

Pipelined computers employ various hardware techniques


to minimize the performance degradation caused by instruction
branching.

Pre-fetch target instruction


One way of handling a conditional branch is to pre-fetch the
target instruction in addition to the instruction following the
branch. Both are saved until the branch is executed. If the branch
condition is successful, the pipeline continues from the branch
target instruction.

Branch target buffer (BTB)

Another possibility is the use of a branch target buffer or BTB.


BTB is an associative memory included in the fetch segment of
the pipeline. Each entry of the BTB consists of the address of a
previously executed branch instruction and the target instruction
of that branch. It also stores the next few instructions after the
branch target instruction. When the pipeline decodes a branch
instruction, it searches the BTB for the address of the in
instruction. If it is in the BTB, the instruction is available directly
and fetch continues from the new path.

Loop buffer
Loop buffer is a small very high speed register file included in the
fetch segment of the pipeline. When a program loop is detected
in the program, it is stored in the loop buffer including all
branches. The program loop can be executed directly without
accessing memory until the loop mode is removed by the final
branch.
Branch prediction
A pipeline with branch prediction uses some additional logic to
guess the outcome of a conditional branch instruction before it is
executed. A correct prediction eliminates the wasted time caused
by the branch penalties.

2.2.6.4 Structural hazards


This is the situation when two instructions require the use of a
given hardware resource at the same time. The most common
case is in access to memory. One instruction may need to access
memory as part as part of Execute(E) or Write(W) stage, while
another instruction is being fetched. Many processors use
separate instruction and data memories to avoid this conflict.

2.2.7 RISC Instruction Pipeline


Reduced Instruction Set Computers (RISC) are able to use an
efficient instruction pipeline, taking in advance the following
notations over the RISC characteristics:

• Because of the fixed-length instruction format, the


decoding of the operation can occur at the same time as the
register selection.
• All data manipulated instructions have register-to-register
operations. Since all operands are in registers, there is no
need for calculating the effective address or fetching of
operands from memory.

Single cycle instruction execution


One of the major advantages of RISC over CISC is that RISC can
achieve pipeline segments, requiring just one clock cycle, while
CISC uses many segments in its pipeline, with the longest
segment requiring two or more clock cycles.

Compiler support
Instead of designing hardware to handle difficulties (hazards)
associated with data conflicts and branch penalties, RISC
processors relay on the efficiency of the compiler to detect and
minimize the delays.

2.2.7.1 Three-Segment Instruction Pipeline


The instruction cycle for a RISC computer can be divided into
three sub-operations and implemented in three segments:

I Instruction fetch: fetches the instruction from the


program memory.

A ALU operation: decodes the instruction and


performs an ALU operation. An ALU operation may
be one of three types, data manipulation, data
transfer, or program control.

E Execute instruction: directs the output of the ALU to


the destination.

Fig.2-11 shows the operation of the four-segment pipeline.


Time

Clock cycles 1 2 3 4 5 6 7

Instructions
I1 I A E

I2 I A E

I3 I A E

Fig.2-11
The following sections illustrate how RISC computers use the
compiler to handle data, instruction hazards.
Delayed load
The compiler of RISC computers is designed to detect the data
conflicts (data hazards) and reorder the instructions as necessary
to delay the loading of the conflicting data by inserting no-
operation instructions. For example, the two instructions (written
in Berkeley RISC I format).

ADD R1, R2, R3 R3  R1 + R2


SUB R3, R4, R5 R5  R3 - R4

Give rise to data dependency (data hazard). The result of the Add
instruction is placed into register R3, which in tern is one of the
two source operands of the Subtract instruction. There will be a
conflict in the Subtract instruction. This can be seen from the
pipeline timing shown in Fig.2-12a. The A segment in clock
cycle 3 is using data from R3 which will not be the correct value
since the Addition operation is not yet completed. The compiler
when detects such situation, it searches for a useful to put after
the Add, if it cannot find such an instruction, it inserts no-
operation instruction, illustrated in Fig 2-12b. This is the type of
instruction that fetched from memory but has no operation, thus
wasting a clock cycle.

Time

Clock cycles 1 2 3 4 5 6 7

Instructions
ADD I A E

SUB I A E

(a) Pipeline timing with data conflict


Time

Clock cycles 1 2 3 4 5 6 7

Instructions
ADD I A E

NOP I A E

SUB I A E

(b) Pipeline timing with delayed load


Fig.2-12

Delayed Branch
In Fig.2-10a, the processor fetches instruction I3 before it
determines whether the current instruction I2 is a branch
instruction. When execution of I2 is completed and a branch is to
be made, the processor must discard I3 and fetch the instruction
at the branch target Ik. The location following the branch
instruction is called a branch delay slot. The instructions in the
delay slots are always fetched and at least partially executed
before the branch decision is made and the branch target address
is computed. The compiler for a processor that uses delayed
branches is designed to analyze the instructions before and after
the branch and rearrange the program sequence by inserting
useful (or no-operation) instructions in the delay slots.
RISC computers use delayed branch to handle branch related
instruction hazards. An example of delayed branch is shown in
Fig.2-13. The program sequence for this example consists of the
following instructions (written in Berkeley RISC I format).

AND R0, #10, R0 R0  R0 ∧ #10


SLL R2, #1, R1 R1  Shift-left R2 once
ADD R3, R4, R5 R5  R3 + R4
JMP Z, #2(R6) PC  R6 + 2 (Conditional jump, result = 0)
SUB R7, #3, R6 R6  R7 – 3

In the above program sequence, the branch (JMP) instruction


must be followed by two delay slots to avoid incorrect fetching
for the SUB instruction (in the case of successful branch
condition). When the compiler searches for two useful
instructions to insert in the delay slots instead of inserting no-
operation (as in Fig.2-13a), the AND instruction and the shift-
left (SLL) instruction will be a good choice as there source
operands and result are not related to any other instruction. The
program will be rearranged as follows.

ADD R3, R4, R5 R5  R3 + R4


JMP Z, #2(R6) PC  R6 + 2 (Conditional jump, result = 0)
AND R0, #10, R0 R0  R0 ∧ #10
SLL R2, #1, R1 R1  Shift-left R2 once
SUB R7, #3, R6 R6  R7 – 3
The pipeline timing for this new arrangement is shown in Fig.2-
13b
Time
Clock cycles 1 2 3 4 5 6 7 8 9

Instructions
AND I A E

SLL I A E

ADD I A E

JMP I A E

NOP I A E

NOP I A E
SUB I A E
(a) Using no-operation instruction
Time

Clock cycles 1 2 3 4 5 6 7

Instructions
ADD I A E

JMP I A E

AND I A E

SLL I A E

I A E
SUB
(a) Rearranging the instructions
Fig.2-13