Professional Documents
Culture Documents
Pipelining
CIT 595
Spring 2007
CIT 595
Sequential Laundry
Pipelined Laundry
CIT 595
9-2
9-3
CIT 595
9-4
For each stage, one phase of instruction is carried out, and the stages
are overlapped
9-5
CIT 595
9-6
CIT 595
tn = k x tp
9-7
CIT 595
9-8
CONTROL
9-9
CIT 595
9 - 10
Data
Control
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
Note:
Since Evaluate Address and Execute both use ALU, we can make
this one stage
The Operand Fetch is separated into Register Fetch and Memory
Access (one phase is split into two stages)
Store consists of only register writes (not memory writes)
Memory Write is part of Memory Access
Thus we have a total of 6 stages
CIT 595
S1/S2
9 - 11
CIT 595
9 - 12
Data
Control
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
9 - 13
CIT 595
9 - 14
Recall
CPU Time =
time
program
time
=
cycle
Clock Cycle
time
cycles
instruction
CPI
instructions
program
5 stage
Instruction Count
10 stage clock cycle
time reduced by one half
If
9 - 15
CIT 595
9 - 16
9 - 17
CIT 595
Pipeline Hazards
9 - 18
Structural Hazard
Occurs when hardware cannot support a combination of
instructions that we want to execute in parallel
1. Structural Hazard
2. Data Hazard
3. Control Hazard
9 - 19
CIT 595
9 - 20
S2/S3
S3/S4
S4/S5
Data Hazard
S5/S6
CONTROL
Example 1:
i1: ADD R1, R2, R3
i2: AND R5, R1, R4
Example 2:
i1: ADD R1, R2, R3
i2: ST R1, A
Example 3:
i1: LD R1, A
i2: ADD R2, R1, R2
S1/S2
CIT 595
9 - 21
CIT 595
9 - 22
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
S2: Decode
S3: Register Fetch
S4: Execute/EA
S5: Memory Access (LD/ST)
S6: Write Back (register write)
2
3
4
5
6
7
cycle 0 1
i1 S1 S2 S3 S4 S5 S6
i2
S1
S2
S3
S4
S5
S6
i2 fetching R1,
gets stale value
CIT 595
i1 completed i.e.
value is written to R1
9 - 23
S1/S2
CIT 595
Solution to Example 1
As
If
cycle 0
i1
i2
1
S1
S2
S3
S4
S5
S6
S1
S2
..
..
7
S3
8
S4
S5
S6
S2: Decode
This
CIT 595
9 - 25
CIT 595
9 - 26
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
cycle 0 1
2
3
4
5
6
7
i1 S1 S2 S3 S4 S5 S6
i2
S1
S2
S3
S4
S5
S6
S2: Decode
S3: Register Fetch
S4: Execute/EA
S1/S2
9 - 27
CIT 595
Both i1 and i2
use stage 4 but
1 cycle apart
9 - 28
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
S1/S2
CIT 595
9 - 29
CIT 595
9 - 30
Data
Control
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
S2: Decode
S3: Register Fetch
S4: Execute/Evaluate Addr
S5: Memory Access (LD/ST)
S6: Write Back (register write)
2
3
4
5
6
7
cycle 0 1
i1 S1 S2 S3 S4 S5 S6
i2
S1
S2
S3
S4
S5
S6
i2 fetches old
value of R1
i2 stores old
value of R1
CIT 595
i2 fetches old
value of R1
i1 completed i.e.
value is written to R1
i2 stores old
value of R1
S1/S2
9 - 31
CIT 595
Solution to Example 2
Stall the Pipeline
cycle 0 1
2
3
4
5
6
7
8
9
S1
S2
S3
S4
S5
S6
i1
S1
S2
..
..
..
S3
S4
S5
S6
i2
Data
Control
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
Forwarding
Realize that data value (to be put in R1) is actually available at
end of cycle 4 and is also propagated through to next stage
Dont need to wait till cycle 6, forward a copy of the data to i2s
stage S5
2
3
4
5
6
7
cycle 0 1
i1 S1 S2 S3 S4 S5 S6
i2
S1
S2
S3
S4
S5
S6
CIT 595
S1/S2
9 - 33
CIT 595
2
3
4
5
6
7
cycle 0 1
i1 S1 S2 S3 S4 S5 S6
S1
S2
S3
S4 S5
S6
i2
2
3
4
5
6
7
cycle 0 1
i1 S1 S2 S3 S4 S5 S6
S1
S2
S3
S4
S5
S6
i2
cycle 0
9 - 35
1
S1
i2
CIT 595
9 - 34
Solution to Example 3
i2 fetching R1,
gets stale value
CIT 595
S2
S3
S4
S5
S6
S1
S2
S3
..
S4
7
S5
8
S6
9 - 36
Example 3
Data
Control
S2/S3
S3/S4
S4/S5
S5/S6
01000000000 11000
CONTROL
PC 0100000000011001
BR
IR
NZP
-1
000001111 11111 11
9
SEXT
16
16
16
1111111111111111
B
ADD
A
ALU
16
PC + 1
16
S1/S2
CIT 595
i1
i2
load-use hazard
9 - 37
CIT 595
Control Hazards
9 - 38
Control Hazard
The problem with branch instructions is that:
9 - 39
CIT 595
9 - 40
10
cycle 0 1
2
3
4
5
6
7
i1 S1 S2 S3 S4 S5 S6
BR i2
S6
S1
S2
S3
S4
S5
Instr. after BR i3
S5
S1
S2
S3
S4
S6
cycle 0 1
2
3
4
5
6
7
i1 S1 S2 S3 S4 S5 S6
BR i2
S6
S1
S2
S3
S4
S5
Instr. after BR
i3
S5
S1
S2
S3
S4
S6
Instr before BR
Instr before BR
3 delay slots
Branch
address
resolved
CIT 595
9 - 41
So ultimately we
cannot do anything
till cycle 5, but we
have fetched i3
CIT 595
9 - 42
9 - 43
CIT 595
DEST
9 - 44
11
DEST
CIT 595
9 - 45
CIT 595
9 - 46
cycle 0 1
2
3
i1 S1 S2 S3
BR i2
S1
S2
Instr. after BR i3
S1
target or PC + 1 i4
target or PC + 1 i5
Instr before BR
Cannot do anything
about i3 (it will be
fetched)
Also update the prediction unit with the actual result i.e.
out come the current branch instruction (for future
branches)
CIT 595
9 - 47
S4
S5
S6
S3
S4
S5
S6
S2
S3
S4
S5
S6
S1
S2
S3
S4
S5
S1
S2
S3
S4
Actual
Condition
Learnt
May aborted
May be aborted
9 - 48
12
Types of Prediction
Static (wired/fixed)
Always guess taken or not taken
Effective only with loop control structure
Dynamic
Another hardware in the datapath keeps tracks of
the branch history as the instructions are executing
E.g. 2-bit branch predictor using saturating counters
T
Strongly Taken
11
NT
T
01
Weakly Taken
00
Strongly Not
Taken
NT
T
Weakly Not
Taken
10
NT
T
NT
CIT 595
9 - 49
CIT 595
9 - 50
9 - 51
9 - 52
13
cycle 0 1
2
3
4
5
6
7
i1 S1 S2 S3 S4 S5 S6
JUMP
Instr. after JUMP i2
S1
S2
S3
S4
S5
S6
i3
Jump discovered
S1
S2
S3
S4
S5
S6
9 - 53
CIT 595
9 - 54
Exceptions
Exceptions are used for signaling a certain condition
CIT 595
9 - 55
CIT 595
9 - 56
14
CIT 595
9 - 57
CIT 595
Pipelining Complications
9 - 58
cycle 0 1
2
3
4
5
6
7
S1
S2
S3
S4
S5
S6
i1
LDR
ADD
S1
S2
S3
S4
S5
S6
i2
Memory protection
violation by LDR
Undefined Instruction
Execute
Arithmetic exception
Memory
Access
CIT 595
9 - 59
CIT 595
9 - 60
15
Summary of Pipelining
Improves performance
CPI = 1 (ideal)
Speedup = #number of pipe stages (ideal)
CIT 595
9 - 61
CIT 595
9 - 62
LC3 pipelined
Data
Control
S2/S3
S3/S4
S4/S5
S5/S6
CONTROL
S1/S2
CIT 595
9 - 63
16