You are on page 1of 83

MC9211Computer

Organization
Unit 4 Lesson 1
Processor Design
Basic Processing Unit
Connection Between the
Processor and the Memory
Memory

MAR MDR
Control

PC R0

R1
Processo
IR

ALU
Rn - 1

n general purpose
registers

Figure 1.2. Connections between the processor and the memory.


Add LOCA, R0

• Transfer the contents of register PC to register MAR


• Issue a Read command to memory, and then wait until it has transferred the
requested word into register MDR
• Transfer the instruction from MDR into IR and decode it
• Transfer the address LOCA from IR to MAR
• Issue a Read command and wait until MDR is loaded
• Transfer contents of MDR to the ALU
• Transfer contents of R0 to the ALU
• Perform addition of the two operands in the ALU and transfer result into R0
• Transfer contents of PC to ALU
• Add 4 to operand in ALU and transfer incremented address to PC
Overview
• Instruction Set Processor (ISP)
• Central Processing Unit (CPU)
• A typical computing task consists of a
series of steps specified by a sequence of
machine instructions that constitute a
program.
• An instruction is executed by carrying out
a sequence of more rudimentary
operations.
Some Fundamental Concepts
Fundamental Concepts
• Processor fetches one instruction at a time and
perform the operation specified.
• Instructions are fetched from successive
memory locations until a branch or a jump
instruction is encountered.
• Processor keeps track of the address of the
memory location containing the next instruction
to be fetched using Program Counter (PC).
• Instruction Register (IR)
Executing an Instruction
• Fetch the contents of the memory location
pointed to by the PC. The contents of this
location are loaded into the IR (fetch phase).
IR ← [[PC]]
• Assuming that the memory is byte addressable,
increment the contents of the PC by 4 (fetch
phase).
PC ← [PC] + 4
• Carry out the actions specified by the instruction
in the IR (execution phase).
Processor Organization
Internal processor
bus

Control signals

PC

Instruction
Address
decoder and
lines
MAR control logic

Memory
bus

MDR
Data
lines IR

Datapath
Y

Constant 4 R0

Select MUX

Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
XOR TEMP

Z
Textbook Page 413

Figure 7.1. Single-bus organization of the datapath inside a proc


Executing an Instruction
• Transfer a word of data from one
processor register to another or to the
ALU.
• Perform an arithmetic or a logic operation
and store the result in a processor
register.
• Fetch the contents of a given memory
location and load them into a processor
register.
• Store a word of data from a processor
register into a given memory location.
Register Transfers Internal processor
bus
Riin

Ri

Riout

Yin

Constant 4

Select MUX

A B
ALU

Zin

Textbook Page 416 Zout

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Register Transfers
• All operations and data transfers are controlled by the processor
clock.
Bus

D Q
1
Q
Riout

Ri in
Clock

Figure
Figure7.3.
7.3.Input
Inputand outputgating
and output ating
g for one
for oneregister
re
gisterbit.
bit.
Performing an Arithmetic or
Logic Operation
• The ALU is a combinational circuit that has no
internal storage.
• ALU gets the two operands from MUX and bus.
The result is temporarily stored in register Z.
• What is the sequence of operations to add the
contents of register R1 to those of R2 and store
the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Fetching a Word from Memory
• Address into MAR; issue Read operation; data into
MDR.
Memory-bus Internal processo
data lines MDRoutE MDRout bus

MDR

MDR inE MDRin

Figure
Figure7.4.
7.4. Connection and
Connection and control
control signals
signals forgister
re MDR.
for register MDR.
Fetching a Word from Memory
• The response time of each memory access
varies (cache miss, memory-mapped I/O,…).
• To accommodate this, the processor waits until it
receives an indication that the requested
operation has been completed (Memory-
Function-Completed, MFC).
• Move (R1), R2
¾ MAR ← [R1]
¾ Start a Read operation on the memory bus
¾ Wait for the MFC response from the memory
¾ Load MDR from the memory bus
¾ R2 ← [MDR]
Step 1 2 3

Clock
Timing
MARin

Assume MAR Address


is always available
on the address lines
Read
of the memory bus.
MR

MDRinE
z Move (R1), R2
1. R1out, MARin, Read Data

2. MDRinE, WMFC
MFC
3. MDRout, R2in
MDR out

Figure 7.5. Timing of a memory Read operation.


Execution of a Complete
Instruction
• Add (R3), R1
• Fetch the instruction
• Fetch the first operand (the contents of the
memory location pointed to by R3)
• Perform the addition
• Load the result into R1
Architecture Internal processor
bus
Riin

Ri

Riout

Yin

Constant 4

Select MUX

A B
ALU

Zin

Zout

Figure 7.2. Input and output gating for the registers in Figure 7.1.
Execution of a Complete
Instruction Internal processor
bus

Add (R3), R1 Control signals

PC
Step Action
Instruction
Address
decoder and
lines
1 PCout , MAR in , Read, Select4,Add, Zin MAR control logic

Memory
2 Zout , PCin , Y in , WMF C bus

3 MDR out , IR in Data


MDR

lines IR
4 R3out , MAR in , Read
5 R1out , Y in , WMF C Y

Constant 4 R0
6 MDR out , SelectY, Add, Zin
7 Zout , R1in , End Select MUX

Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR TEMP

Z
Add R2, R1 ?

Figure 7.1. Single-bus organization of the datapath inside a proc


Execution of a Complete
Instruction Internal processor
bus

Add R2, R1 Control signals

PC
Step Action
Instruction
Address
decoder and
lines
1 PCout , MAR in , Read, Select4,Add, Zin MAR control logic

Memory
2 Zout , PCin , Y in , WMF C bus

3 MDR out , IR in Data


MDR

lines IR
4 R3out , MAR in , Read
5 R1out , Y in , WMF C Y

Constant 4 R0
6 MDR out , SelectY, Add, Zin
R2out
7 Zout , R1in , End Select MUX

Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR TEMP

Figure 7.1. Single-bus organization of the datapath inside a proc


Execution of Branch Instructions
• A branch instruction replaces the contents
of PC with the branch target address,
which is usually obtained by adding an
offset X given in the branch instruction.
• The offset X is usually the difference
between the branch target address and
the address immediately following the
branch instruction.
• Conditional branch
Execution of Branch Instructions

Step Action

1 PCout , MAR in , Read, Select4,Add, Z in


2 Zout , PCin , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout, Add, Z in
5 Z out, PCin , End

Figure 7.7. Control sequence for an unconditional branch instruction.


Multiple-Bus Organization
Bus A Bus B Bus C

Incrementer
Textbook Page 424
PC

• Allow the contents of two


different registers to be
Re gister
file

Constant 4
accessed simultaneously and
have their contents placed on
buses A and B.
MUX

ALU R

B • Allow the data on bus C to


be loaded into a third register
Instruction
decoder during the same clock cycle.
IR
• Incrementer unit.
MDR
• ALU simply passes one of
MAR
ts two input operands
unmodified to bus C
Memory b us Address
data lines lines
Æ control signal: R=A or R=B
Figure 7.8. Three-b us or ganization of the datapath.
Multiple-Bus Organization
• Add R4, R5, R6

Step Action

1 PCout, R=B, MAR in , Read, IncPC


2 WMFC
3 MDR outB , R=B, IR in
4 R4outA , R5outB , SelectA, Add, R6in , End

Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,


for the three-bus organization in Figure 7.8.
Exercise
Internal processor
bus

Control signals

• What is the control PC

sequence for Address


lines
Instruction
decoder and

execution of the
MAR control logic

Memory
bus

instruction Data
lines
MDR
IR

Add R1, R2 Constant 4


Y
R0

including the Select MUX

instruction fetch ALU


Add
Sub
A B
R( n - 1)

phase? (Assume control


lines
ALU

Carry-in

single bus
XOR TEMP

architecture)
Figure 7.1. Single-bus organization of the datapath inside a proc
Hardwired Control
Overview
• To execute instructions, the processor
must have some means of generating the
control signals needed in the proper
sequence.
• Two categories: hardwired control and
microprogrammed control
• Hardwired system can operate at high
speed; but with little flexibility.
Control Unit Organization
CLK Control step
Clock counter

External
inputs
Decoder/
IR
encoder
Condition
codes

Control signals

Figure 7.10. Control unit organization.


Detailed Block Description
CLK
Clock Control step Reset
counter

Step decoder

T 1 T2 Tn

INS1
External
INS2 inputs
Instruction
IR Encoder
decoder
Condition
codes
INSm

Run End

Control signals

Figure 7.11. Separation of the decoding and encoding function


Generating Zin
• Zin = T1 + T6 • ADD + T4 • BR + …
Branch Add

T4 T6

T1

Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
Generating End
• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…
Branch<0
Add Branch
N N

T7 T5 T4 T5

End

Figure 7.13. Generation of the End control signal.


A Complete Processor
Instruction Integer Floating-point
unit unit unit

Instruction Data
cache cache

Bus interface
Processor

System b
us

Main Input/
memory Output

Figure 7.14. Block diagram of a complete processor


.
Microprogrammed Control
Microprogrammed Control
• Control signals are generated by a program similar to machine
language programs.
• Control Word (CW); microroutine; microinstruction
: Textbook page430

MDRout

WMFC
MAR in

Select
Read
PCout

R1out

R3out
Micro -

End
PCin

R1in
Add

Z out
IRin
Yin
instruction

Zin
1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1

Figure 7.15 An example of microinstructions for Figure 7.6.


Overview

Textbook page 421


Step Action

1 PCout , MAR in , Read, Select4,Add, Zin


2 Zout , PCin , Y in , WMF C
3 MDR out , IR in
4 R3out , MAR in , Read
5 R1out , Y in , WMF C
6 MDR out , SelectY, Add, Zin
7 Zout , R1in , End

Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.


Basic organization of a
microprogrammed control unit
• Control store
Starting
IR address
generator One function
cannot be carried
out by this simple
organization.

Clock µPC

Control
store CW

Figure 7.16. Basic organization of a microprogrammed control


Conditional branch
• The previous organization cannot handle the situation when the
control unit is required to check the status of the condition codes or
external inputs to choose between alternative courses of action.
• Use conditional branch microinstruction.

AddressMicroinstruction

0 PCout , MAR in , Read, Select4,Add, Z in


1 Zout , PCin , Y in , WMF C
2 MDRout , IR in
3 Branch to starting addressof appropriatemicroroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction0
26 Offset-field-of-IRout , SelectY, Add, Z in
27 Zout , PCin , End

Figure 7.17. Microroutine for the instruction Branch<0.


Microprogrammed Control
External
inputs
Starting and
branch address Condition
IR codes
generator

Clock µPC

Control
store CW

Figure 7.18. Organization of the control unit to allow


conditional branching in the microprogram.
Microinstructions
• A straightforward way to structure
microinstructions is to assign one bit
position to each control signal.
• However, this is very inefficient.
• The length can be reduced: most signals
are not needed simultaneously, and many
signals are mutually exclusive.
• All mutually exclusive signals are placed in
the same group in binary coding.
Partial Format for the
Microinstructions
Microinstruction

F1 F2 F3 F4 F5

F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)

0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action
0001: PCout 001: PCin 001: MARin 0001: Sub 01: Read
0010: MDRout 010: IRin 010: MDRin 10: Write
0011: Zout 011: Zin 011: TEMPin
0100: R0out 100: R0in 100: Yin 1111: XOR
0101: R1out 101: R1in
0110: R2out 110: R2in 16 ALU
functions
0111: R3out 111: R3in
1010: TEMPout
1011: Offsetout

F6 F7 F8
What is the price paid for
this scheme?
F6 (1 bit) F7 (1 bit) F8 (1 bit)
Require a little more hardware
0: SelectY 0: No action 0: Continue
1: Select4 1: WMFC 1: End

Figure 7.19. An example of a partial format for field-encoded microinstructions


Further Improvement
• Enumerate the patterns of required signals
in all possible microinstructions. Each
meaningful combination of active control
signals can then be assigned a distinct
code.
Textbook page 434
• Vertical organization
• Horizontal organization
Microprogram Sequencing
• If all microprograms require only straightforward
sequential execution of microinstructions except
for branches, letting a µPC governs the
sequencing would be efficient.
• However, two disadvantages:
¾ Having a separate microroutine for each machine instruction
results in a large total number of microinstructions and a large
control store.
¾ Longer execution time because it takes more time to carry out
the required branches.
• Example: Add src, Rdst
• Four addressing modes: register, autoincrement,
autodecrement, and indexed (with indirect
forms)
Textbook page 436

- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode

Contents of IR OP code 0 1 0 Rsrc Rdst

11 10 87 4 3 0

Address Microinstruction Textbook page 439


(octal)

000 PCout, MARin, Read, Select


4, Add, Zin
001 Zout, PCin, Yin, WMFC
002 MDRout, IRin
003 µBranch {µ PC← 101 (from Instruction decoder);
µPC5,4 ← [IR10,9]; µPC3 ← [IR 10] ⋅ [IR9] ⋅ [IR8]}
121 Rsrcout, MARin , Read, Select4, Add,inZ
122 Zout, Rsrcin
123 µBranch {µPC← 170;µPC0 ← [IR8]}, WMFC
170 MDRout, MARin, Read, WMFC
171 MDRout, Yin
172 Rdstout, SelectY
, Add, Zin
173 Zout, Rdstin, End

Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.


Note:Microinstruction at location 170 is not executed for this addressing mode.
Microinstructions with Next-
Address Field
• The microprogram we discussed requires
several branch microinstructions, which perform
no useful operation in the datapath.
• A powerful alternative approach is to include an
address field as a part of every microinstruction
to indicate the location of the next
microinstruction to be fetched.
• Pros: separate branch microinstructions are
virtually eliminated; few limitations in assigning
addresses to microinstructions.
• Cons: additional bits for the address field
(around 1/6)
Microinstructions with Next-
Address Field IR

External Condition
Inputs codes

Decoding circuits

µA R

Control store

Next address µI R

Microinstruction decoder

Control signals

Figure 7.22. Microinstruction-sequencing organization.


Microinstruction

F0 F1 F2 F3

F0 (8 bits) F1 (3 bits) F2 (3 bits) F3 (3 bits)

Address of next 000: No transfer 000: No transfer 000: No transfer


microinstruction 001: PCout 001: PCin 001: MARin
010: MDRout 010: IRin 010: MDRin
011: Zout 011: Zin 011: TEMPin
100: Rsrcout 100: Rsrcin 100: Yin
101: Rdstout 101: Rdstin
110: TEMPout

F4 F5 F6 F7

F4 (4 bits) F5 (2 bits) F6 (1 bit) F7 (1 bit)

0000: Add 00: No action 0: SelectY 0: No action


0001: Sub 01: Read 1: Select4 1: WMFC
10: Write
1111: XOR

F8 F9 F10

F8 (1 bit) F9 (1 bit) F10 (1 bit)

0: NextAdrs 0: No action 0: No action


1: InstDec 1: ORmode 1: ORindsrc

Figure 7.23. Format for microinstructions in the example of Section 7


Implementation of the
Microroutine
Octal
address F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10

0 0 0 0 0 0 0 0 0 0 1 0 0 1 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 1 1 00 1 1 0 0 0 0 0 0 00 0 1 0 0 0
0 0 2 0 0 0 0 0 0 1 1 0 1 0 01 0 0 0 0 0 0 0 0 00 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 1 1 0

121 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
122 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 1 0 0 1

1 7 0 0 1 1 1 1 0 0 1 0 1 0 00 0 0 0 1 0 0 0 0 01 0 1 0 0 0
1 7 1 0 1 1 1 1 0 1 0 0 1 0 00 0 1 0 0 0 0 0 0 00 0 0 0 0 0
1 7 2 0 1 1 1 1 0 1 1 1 0 1 01 1 0 0 0 0 0 0 0 00 0 0 0 0 0
1 7 3 0 0 0 0 0 0 0 0 0 1 1 10 1 0 0 0 0 0 0 0 00 0 0 0 0 0

Figure 7.24. Implementation of the microroutine of Figure 7.21 usin


next-microinstruction address field.
(See Figure 7.23 for encoded signa
R15in R15out R0in R0out

Decoder

Decoder

IR Rsrc Rdst

InstDecout
External
inputs ORmode
Decoding
circuits
Condition ORindsrc
codes

µA R

Control store

Next address F1 F2 F8 F9 F10

Rdstout

Rdstin
Microinstruction
decoder
Rsrcout

Rsrcin

Other control signals

Figure 7.25. Some details of the control-signal-generating circuitry.


bit-ORing
Pipelining
Overview
• Pipelining is widely used in modern
processors.
• Pipelining improves system performance
in terms of throughput.
• Pipelined organization requires
sophisticated compilation techniques.
Basic Concepts
Making the Execution of
Programs Faster
• Use faster circuit technology to build the
processor and the main memory.
• Arrange the hardware so that more than
one operation can be performed at the
same time.
• In the latter way, the number of operations
performed per second is increased even
though the elapsed time needed to
perform any one operation is not changed.
Traditional Pipeline Concept

• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes


Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight

Time

30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes 6
A hours for 4 loads
• If they learned pipelining,
B
how long would laundry
take?

D
Traditional Pipeline Concept

6 PM 7 8 9 10 11 Midnight

Time
T
a 30 40 40 40 40 20
s
A
k • Pipelined laundry
takes 3.5 hours for 4
O B loads
r
d C
e
r D
Traditional Pipeline Concept
• Pipelining doesn’t help
6 PM 7 8 9
latency of single task, it
Time helps throughput of entire
T workload
a 30 40 40 40 40 20 • Pipeline rate limited by
s slowest pipeline stage
A
k • Multiple tasks operating
simultaneously using
O B different resources
r • Potential speedup =
d
Number pipe stages
C
e • Unbalanced lengths of pipe
r
stages reduces speedup
D
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
• Stall for Dependences
Use the Idea of Pipelining in a
Computer
Fetch + Execution
Time
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction

I1 F1 E1
(a) Sequential execution

I2 F2 E2
Interstage buffer
B1
I3 F3 E3

Instruction Execution
fetch unit (c) Pipelined execution
unit

Figure 8.1. Basic idea of instruction pipelining.


(b) Hardware organization
Use the Idea of Pipelining in a
Computer Clock cycle

Instruction
1 2 3 4 5 6 7
Time

I1 F1 D1 E1 W1
Fetch + Decode
+ Execution + Write I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

(a) Instruction execution divided into four steps

Interstage u
bffers

D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3

(b) Hardware organization

Textbook page: 457

Figure 8.2. A 4-stage pipeline.


Role of Cache Memory
• Each pipeline stage is expected to complete in
one clock cycle.
• The clock period should be long enough to let
the slowest pipeline stage to complete.
• Faster stages can only wait for the slowest one
to complete.
• Since main memory is very slow compared to
the execution, if each instruction needs to be
fetched from main memory, pipeline is almost
useless.
• Fortunately, we have cache.
Pipeline Performance
• The potential increase in performance
resulting from pipelining is proportional to
the number of pipeline stages.
• However, this increase would be achieved
only if all pipeline stages require the same
time to complete, and there is no
interruption throughout program execution.
• Unfortunately, this is not true.
Pipeline Performance
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction
I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

I5 F5 D5 E5

Figure 8.3. Effect of an e


xecution operation taking more than one clock
ycle.
c
Pipeline Performance
• The previous pipeline is said to have been stalled for two
clock cycles.
• Any condition that causes a pipeline to stall is called a
hazard.
• Data hazard – any condition in which either the source or
the destination operands of an instruction are not
available at the time expected in the pipeline. So some
operation has to be delayed, and the pipeline stalls.
• Instruction (control) hazard – a delay in the availability of
an instruction causes the pipeline to stall.
• Structural hazard – the situation when two instructions
require the use of a given hardware resource at the
same time.
Pipeline Performance
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction Instruction
hazard I1 F1 D1 E1 W1

I2 F2 D2 E2 W2

I3 F3 D3 E3 W3

(a) Instruction execution steps in successive clock cycles

Time
Clock cycle 1 2 3 4 5 6 7 8 9

Stage
F: Fetch F1 F2 F2 F2 F2 F3
Idle periods –
D: Decode D1 idle idle idle D2 D3
stalls (bubbles)
E: Execute E1 idle idle idle E2 E3

W: Write W1 idle idle idle W2 W3

(b) Function performed by each processor stage in successive clock cycles

Figure 8.4. Pipeline stall caused by a cache miss in F2.


Pipeline Performance
Load X(R1), R2
Structural
Time
hazard Clock cycle 1 2 3 4 5 6 7

Instruction
I1 F1 D1 E1 W1

I 2 (Load) F2 D2 E2 M2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4

I5 F5 D5

Figure 8.5. Effect of a Load instruction on pipeline timing.


Pipeline Performance
• Again, pipelining does not result in individual
instructions being executed faster; rather, it is
the throughput that increases.
• Throughput is measured by the rate at which
instruction execution is completed.
• Pipeline stall causes degradation in pipeline
performance.
• We need to identify all hazards that may cause
the pipeline to stall and to find ways to minimize
their impact.
Quiz
• Four instructions, the I2 takes two clock
cycles for execution. Pls draw the figure
for 4-stage pipeline, and figure out the
total cycles needed for the four
instructions to complete.
Data Hazards
Data Hazards
• We must ensure that the results obtained when
instructions are executed in a pipelined processor are
identical to those obtained when the same instructions
are executed sequentially.
• Hazard occurs
A←3+A
B←4×A
• No hazard
A←5×C
B ← 20 + C
• When two operations depend on each other, they must
be executed sequentially in the correct order.
• Another example:
Mul R2, R3, R4
Add R5, R4, R6
Data Hazards
Time
Clock cycle 1 2 3 4 5 6 7 8 9

Instruction

I 1 (Mul) F1 D1 E1 W1

I 2 (Add) F2 D2 D2A E2 W2

I3 F3 D3 E3 W3

I4 F4 D4 E4 W4

Figure 8.6. Pipeline stalled by data dependenc


y between D
2 and W1.
Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
Operand Forwarding
• Instead of from the register file, the second
instruction can get data directly from the
output of ALU after the previous instruction
is completed.
• A special arrangement needs to be made
to “forward” the output of ALU to the input
of ALU.
Source 1
Source 2

SRC1 SRC2

Register
file

ALU

RSLT

Destination

(a) Datapath

SRC1,SRC2 RSLT

E: Execute W: Write
(ALU) (Register file)

Forwarding path

(b) Position of the source and result registers in the processor pipeline

Figure 8.7. Operand forwarding in a pipelined processor


.
Handling Data Hazards in
Software
• Let the compiler detect and handle the
hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
• The compiler can reorder the instructions
to perform some useful work during the
NOP slots.
Side Effects
• The previous example is explicit and easily detected.
• Sometimes an instruction changes the contents of a
register other than the one named as the destination.
• When a location other than one explicitly named in an
instruction as a destination operand is affected, the
instruction is said to have a side effect. (Example?)
• Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
• Instructions designed for execution on pipelined
hardware should have few side effects.
Instruction Hazards
Overview
• Whenever the stream of instructions
supplied by the instruction fetch unit is
interrupted, the pipeline stalls.
• Cache miss
• Branch
Unconditional Branches
Time
Clock cycle 1 2 3 4 5 6

Instruction
I1 F1 E1

I 2 (Branch) F2 E2 Execution unit idle

I3 F3 X

Ik Fk Ek

I k+1 Fk+1 Ek+1

Figure 8.8. An idle cycle caused by a branch instruction.


Time

Branch Timing
Clock cycle 1 2 3 4 5 6 7 8

I1 F1 D1 E1 W1

I 2 (Branch) F2 D2 E2

I3 F3 D3 X

- Branch penalty I4 F4 X

Ik Fk Dk Ek Wk
- Reducing the penalty
I k+1 Fk+1 Dk+1 Ek+1

(a) Branch address computed inecute


Ex stage

Time
Clock cycle 1 2 3 4 5 6 7

I1 F1 D1 E1 W1

I 2 (Branch) F2 D2

I3 F3 X

Ik Fk Dk Ek Wk

I k+1 Fk+1 D k+1 Ek+1

(b) Branch address computed in Decode stage

Figure 8.9. Branch timing.


Instruction Queue and
Prefetching
Instruction fetch unit
Instruction queue
F : Fetch
instruction

D : Dispatch/
Decode E : Execute W : Write
instruction results
unit

Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
• A conditional branch instruction introduces
the added hazard caused by the
dependency of the branch condition on the
result of a preceding instruction.
• The decision to branch cannot be made
until the execution of that instruction has
been completed.
• Branch instructions represent about 20%
of the dynamic instruction count of most
programs.
Super Scalar Architecture
F: Instruction
Fetch Unit Instruction Queue

Floating
Point Unit

Dispatch Unit W: Write


Results

Integer
Unit
Super Scalar Operation
• Equip processor with multiple processing
units – several instructions can be
executed in the same clock cycle –
multiple issue processor
• Throughput can be > 1 instruction / cycle
• Compiler should interleave floating point
and integer instructions
• Out-Of-Order Execution should be taken
care

You might also like