Mc9211co U4l1

MC9211Computer
Organization
Unit 4 Lesson 1
Processor Design
Basic Processing Unit
Connection Between the
Processor and the Memory
Memory
MAR MDR
Control
PC R0
R1
Processo
IR
ALU
Rn - 1
n general purpose
registers
Figure 1.2. Connections between the processor and the memory.

Add LOCA, R0
• Transfer the contents of register PC to register MAR

• Issue a Read command to memory, and then wait until it has transferred the
requested word into register MDR
• Transfer the instruction from MDR into IR and decode it
• Transfer the address LOCA from IR to MAR
• Issue a Read command and wait until MDR is loaded
• Transfer contents of MDR to the ALU
• Transfer contents of R0 to the ALU
• Perform addition of the two operands in the ALU and transfer result into R0
• Transfer contents of PC to ALU
• Add 4 to operand in ALU and transfer incremented address to PC
Overview
• Instruction Set Processor (ISP)
• Central Processing Unit (CPU)
• A typical computing task consists of a
series of steps specified by a sequence of
machine instructions that constitute a
program.
• An instruction is executed by carrying out
a sequence of more rudimentary
operations.
Some Fundamental Concepts
Fundamental Concepts
• Processor fetches one instruction at a time and
perform the operation specified.
• Instructions are fetched from successive
memory locations until a branch or a jump
instruction is encountered.
• Processor keeps track of the address of the
memory location containing the next instruction
to be fetched using Program Counter (PC).
• Instruction Register (IR)
Executing an Instruction
• Fetch the contents of the memory location
pointed to by the PC. The contents of this
location are loaded into the IR (fetch phase).
IR ← [[PC]]
• Assuming that the memory is byte addressable,
increment the contents of the PC by 4 (fetch
phase).
PC ← [PC] + 4
• Carry out the actions specified by the instruction
in the IR (execution phase).
Processor Organization
Internal processor
bus
Control signals
PC
Instruction
Address
decoder and
lines
MAR control logic
Memory
bus
MDR
Data
lines IR
Datapath
Y
Constant 4 R0
Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
XOR TEMP
Z
Textbook Page 413
Figure 7.1. Single-bus organization of the datapath inside a proc

Executing an Instruction
• Transfer a word of data from one
processor register to another or to the
ALU.
• Perform an arithmetic or a logic operation
and store the result in a processor
register.
• Fetch the contents of a given memory
location and load them into a processor
register.
• Store a word of data from a processor
register into a given memory location.
Register Transfers Internal processor
bus
Riin
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Textbook Page 416 Zout
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Register Transfers
• All operations and data transfers are controlled by the processor
clock.
Bus
D Q
1
Q
Riout
Ri in
Clock
Figure
Figure7.3.
7.3.Input
Inputand outputgating
and output ating
g for one
for oneregister
re
gisterbit.
bit.
Performing an Arithmetic or
Logic Operation
• The ALU is a combinational circuit that has no
internal storage.
• ALU gets the two operands from MUX and bus.
The result is temporarily stored in register Z.
• What is the sequence of operations to add the
contents of register R1 to those of R2 and store
the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Fetching a Word from Memory
• Address into MAR; issue Read operation; data into
MDR.
Memory-bus Internal processo
data lines MDRoutE MDRout bus
MDR
MDR inE MDRin
Figure
Figure7.4.
7.4. Connection and
Connection and control
control signals
signals forgister
re MDR.
for register MDR.
Fetching a Word from Memory
• The response time of each memory access
varies (cache miss, memory-mapped I/O,…).
• To accommodate this, the processor waits until it
receives an indication that the requested
operation has been completed (Memory-
Function-Completed, MFC).
• Move (R1), R2
¾ MAR ← [R1]
¾ Start a Read operation on the memory bus
¾ Wait for the MFC response from the memory
¾ Load MDR from the memory bus
¾ R2 ← [MDR]
Step 1 2 3
Clock
Timing
MARin
Assume MAR Address

is always available
on the address lines
Read
of the memory bus.
MR
MDRinE
z Move (R1), R2
1. R1out, MARin, Read Data
2. MDRinE, WMFC
MFC
3. MDRout, R2in
MDR out
Figure 7.5. Timing of a memory Read operation.

Execution of a Complete
Instruction
• Add (R3), R1
• Fetch the instruction
• Fetch the first operand (the contents of the
memory location pointed to by R3)
• Perform the addition
• Load the result into R1
Architecture Internal processor
bus
Riin
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Zout
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Instruction Internal processor
bus
Add (R3), R1 Control signals
PC
Step Action
Instruction
Address
decoder and
lines
1 PCout , MAR in , Read, Select4,Add, Zin MAR control logic
Memory
2 Zout , PCin , Y in , WMF C bus
3 MDR out , IR in Data

MDR
lines IR
4 R3out , MAR in , Read
5 R1out , Y in , WMF C Y
Constant 4 R0
6 MDR out , SelectY, Add, Zin
7 Zout , R1in , End Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR TEMP
Z
Add R2, R1 ?

Instruction Internal processor
bus
Add R2, R1 Control signals
PC
Step Action
Instruction
Address
decoder and
lines
1 PCout , MAR in , Read, Select4,Add, Zin MAR control logic
Memory
2 Zout , PCin , Y in , WMF C bus
3 MDR out , IR in Data

MDR
lines IR
5 R1out , Y in , WMF C Y
Constant 4 R0
R2out
7 Zout , R1in , End Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
XOR TEMP

Execution of Branch Instructions
• A branch instruction replaces the contents
of PC with the branch target address,
which is usually obtained by adding an
offset X given in the branch instruction.
• The offset X is usually the difference
between the branch target address and
the address immediately following the
branch instruction.
• Conditional branch
Execution of Branch Instructions
Step Action
1 PCout , MAR in , Read, Select4,Add, Z in

2 Zout , PCin , Yin , WMF C
3 MDR out , IR in
4 Offset-field-of-IRout, Add, Z in
5 Z out, PCin , End
Figure 7.7. Control sequence for an unconditional branch instruction.

Multiple-Bus Organization
Bus A Bus B Bus C
Incrementer
Textbook Page 424
PC
• Allow the contents of two

different registers to be
Re gister
file
Constant 4
accessed simultaneously and
have their contents placed on
buses A and B.
MUX
ALU R
B • Allow the data on bus C to

be loaded into a third register
Instruction
decoder during the same clock cycle.
IR
• Incrementer unit.
MDR
• ALU simply passes one of
MAR
ts two input operands
unmodified to bus C
Memory b us Address
data lines lines
Æ control signal: R=A or R=B
Figure 7.8. Three-b us or ganization of the datapath.
Multiple-Bus Organization
• Add R4, R5, R6
Step Action
1 PCout, R=B, MAR in , Read, IncPC

2 WMFC
3 MDR outB , R=B, IR in
4 R4outA , R5outB , SelectA, Add, R6in , End
Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,

for the three-bus organization in Figure 7.8.
Exercise
Internal processor
bus
Control signals
• What is the control PC
sequence for Address

lines
Instruction
decoder and
execution of the
MAR control logic
Memory
bus
instruction Data
lines
MDR
IR
Add R1, R2 Constant 4

Y
R0
including the Select MUX
instruction fetch ALU

Add
Sub
A B
R( n - 1)
phase? (Assume control

lines
ALU
Carry-in
single bus
XOR TEMP
architecture)
Hardwired Control
Overview
• To execute instructions, the processor
must have some means of generating the
control signals needed in the proper
sequence.
• Two categories: hardwired control and
microprogrammed control
• Hardwired system can operate at high
speed; but with little flexibility.
Control Unit Organization
CLK Control step
Clock counter
External
inputs
Decoder/
IR
encoder
Condition
codes
Control signals
Figure 7.10. Control unit organization.

Detailed Block Description
CLK
Clock Control step Reset
counter
Step decoder
T 1 T2 Tn
INS1
External
INS2 inputs
Instruction
IR Encoder
decoder
Condition
codes
INSm
Run End
Control signals
Figure 7.11. Separation of the decoding and encoding function

Generating Zin
• Zin = T1 + T6 • ADD + T4 • BR + …
Branch Add
T4 T6
T1
Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
Generating End
• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…
Branch<0
Add Branch
N N
T7 T5 T4 T5
End
Figure 7.13. Generation of the End control signal.

A Complete Processor
Instruction Integer Floating-point
unit unit unit
Instruction Data
cache cache
Bus interface
Processor
System b
us
Main Input/
memory Output
Figure 7.14. Block diagram of a complete processor

.
Microprogrammed Control
• Control signals are generated by a program similar to machine
language programs.
• Control Word (CW); microroutine; microinstruction
: Textbook page430
MDRout
WMFC
MAR in
Select
Read
PCout
R1out
R3out
Micro -
End
PCin
R1in
Add
Z out
IRin
Yin
instruction
Zin
1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
Figure 7.15 An example of microinstructions for Figure 7.6.

Overview
Textbook page 421

Step Action
1 PCout , MAR in , Read, Select4,Add, Zin

2 Zout , PCin , Y in , WMF C
3 MDR out , IR in
5 R1out , Y in , WMF C
7 Zout , R1in , End

Basic organization of a
microprogrammed control unit
• Control store
Starting
IR address
generator One function
cannot be carried
out by this simple
organization.
Clock µPC
Control
store CW
Figure 7.16. Basic organization of a microprogrammed control

Conditional branch
• The previous organization cannot handle the situation when the
control unit is required to check the status of the condition codes or
external inputs to choose between alternative courses of action.
• Use conditional branch microinstruction.
AddressMicroinstruction
0 PCout , MAR in , Read, Select4,Add, Z in

1 Zout , PCin , Y in , WMF C
2 MDRout , IR in
3 Branch to starting addressof appropriatemicroroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25 If N=0, then branch to microinstruction0
26 Offset-field-of-IRout , SelectY, Add, Z in
27 Zout , PCin , End
Figure 7.17. Microroutine for the instruction Branch<0.

External
inputs
Starting and
branch address Condition
IR codes
generator
Clock µPC
Control
store CW
Figure 7.18. Organization of the control unit to allow

conditional branching in the microprogram.
Microinstructions
• A straightforward way to structure
microinstructions is to assign one bit
position to each control signal.
• However, this is very inefficient.
• The length can be reduced: most signals
are not needed simultaneously, and many
signals are mutually exclusive.
• All mutually exclusive signals are placed in
the same group in binary coding.
Partial Format for the
Microinstructions
Microinstruction
F1 F2 F3 F4 F5
F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)
0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action
0001: PCout 001: PCin 001: MARin 0001: Sub 01: Read
0010: MDRout 010: IRin 010: MDRin 10: Write
0011: Zout 011: Zin 011: TEMPin
0100: R0out 100: R0in 100: Yin 1111: XOR
0101: R1out 101: R1in
0110: R2out 110: R2in 16 ALU
functions
0111: R3out 111: R3in
1010: TEMPout
1011: Offsetout
F6 F7 F8
What is the price paid for
this scheme?
F6 (1 bit) F7 (1 bit) F8 (1 bit)
Require a little more hardware
0: SelectY 0: No action 0: Continue
1: Select4 1: WMFC 1: End
Figure 7.19. An example of a partial format for field-encoded microinstructions

Further Improvement
• Enumerate the patterns of required signals
in all possible microinstructions. Each
meaningful combination of active control
signals can then be assigned a distinct
code.
Textbook page 434
• Vertical organization
• Horizontal organization
Microprogram Sequencing
• If all microprograms require only straightforward
sequential execution of microinstructions except
for branches, letting a µPC governs the
sequencing would be efficient.
• However, two disadvantages:
¾ Having a separate microroutine for each machine instruction
results in a large total number of microinstructions and a large
control store.
¾ Longer execution time because it takes more time to carry out
the required branches.
• Example: Add src, Rdst
• Four addressing modes: register, autoincrement,
autodecrement, and indexed (with indirect
forms)
Textbook page 436
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode
Contents of IR OP code 0 1 0 Rsrc Rdst
11 10 87 4 3 0
Address Microinstruction Textbook page 439

(octal)
000 PCout, MARin, Read, Select

4, Add, Zin
001 Zout, PCin, Yin, WMFC
002 MDRout, IRin
003 µBranch {µ PC← 101 (from Instruction decoder);
µPC5,4 ← [IR10,9]; µPC3 ← [IR 10] ⋅ [IR9] ⋅ [IR8]}
121 Rsrcout, MARin , Read, Select4, Add,inZ
122 Zout, Rsrcin
123 µBranch {µPC← 170;µPC0 ← [IR8]}, WMFC
170 MDRout, MARin, Read, WMFC
171 MDRout, Yin
172 Rdstout, SelectY
, Add, Zin
173 Zout, Rdstin, End
Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.

Note:Microinstruction at location 170 is not executed for this addressing mode.
Microinstructions with Next-
Address Field
• The microprogram we discussed requires
several branch microinstructions, which perform
no useful operation in the datapath.
• A powerful alternative approach is to include an
address field as a part of every microinstruction
to indicate the location of the next
microinstruction to be fetched.
• Pros: separate branch microinstructions are
virtually eliminated; few limitations in assigning
addresses to microinstructions.
• Cons: additional bits for the address field
(around 1/6)
Microinstructions with Next-
Address Field IR
External Condition
Inputs codes
Decoding circuits
µA R
Control store
Next address µI R
Microinstruction decoder
Control signals
Figure 7.22. Microinstruction-sequencing organization.

Microinstruction
F0 F1 F2 F3
F0 (8 bits) F1 (3 bits) F2 (3 bits) F3 (3 bits)
Address of next 000: No transfer 000: No transfer 000: No transfer

microinstruction 001: PCout 001: PCin 001: MARin
010: MDRout 010: IRin 010: MDRin
011: Zout 011: Zin 011: TEMPin
100: Rsrcout 100: Rsrcin 100: Yin
101: Rdstout 101: Rdstin
110: TEMPout
F4 F5 F6 F7
F4 (4 bits) F5 (2 bits) F6 (1 bit) F7 (1 bit)
0000: Add 00: No action 0: SelectY 0: No action

0001: Sub 01: Read 1: Select4 1: WMFC
10: Write
1111: XOR
F8 F9 F10
F8 (1 bit) F9 (1 bit) F10 (1 bit)
0: NextAdrs 0: No action 0: No action

1: InstDec 1: ORmode 1: ORindsrc
Figure 7.23. Format for microinstructions in the example of Section 7

Implementation of the
Microroutine
Octal
address F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
0 0 0 0 0 0 0 0 0 0 1 0 0 1 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 1 1 00 1 1 0 0 0 0 0 0 00 0 1 0 0 0
0 0 2 0 0 0 0 0 0 1 1 0 1 0 01 0 0 0 0 0 0 0 0 00 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 1 1 0
121 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
122 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 1 0 0 1
1 7 0 0 1 1 1 1 0 0 1 0 1 0 00 0 0 0 1 0 0 0 0 01 0 1 0 0 0
1 7 1 0 1 1 1 1 0 1 0 0 1 0 00 0 1 0 0 0 0 0 0 00 0 0 0 0 0
1 7 2 0 1 1 1 1 0 1 1 1 0 1 01 1 0 0 0 0 0 0 0 00 0 0 0 0 0
1 7 3 0 0 0 0 0 0 0 0 0 1 1 10 1 0 0 0 0 0 0 0 00 0 0 0 0 0
Figure 7.24. Implementation of the microroutine of Figure 7.21 usin

next-microinstruction address field.
(See Figure 7.23 for encoded signa
R15in R15out R0in R0out
Decoder
Decoder
IR Rsrc Rdst
InstDecout
External
inputs ORmode
Decoding
circuits
Condition ORindsrc
codes
µA R
Control store
Next address F1 F2 F8 F9 F10
Rdstout
Rdstin
Microinstruction
decoder
Rsrcout
Rsrcin
Other control signals
Figure 7.25. Some details of the control-signal-generating circuitry.

bit-ORing
Pipelining
Overview
• Pipelining is widely used in modern
processors.
• Pipelining improves system performance
in terms of throughput.
• Pipelined organization requires
sophisticated compilation techniques.
Basic Concepts
Making the Execution of
Programs Faster
• Use faster circuit technology to build the
processor and the main memory.
• Arrange the hardware so that more than
one operation can be performed at the
same time.
• In the latter way, the number of operations
performed per second is increased even
though the elapsed time needed to
perform any one operation is not changed.
Traditional Pipeline Concept
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
• Washer takes 30 minutes
• Dryer takes 40 minutes
• “Folder” takes 20 minutes

6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes 6
A hours for 4 loads
• If they learned pipelining,
B
how long would laundry
take?
D
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
A
k • Pipelined laundry
takes 3.5 hours for 4
O B loads
r
d C
e
r D
• Pipelining doesn’t help
6 PM 7 8 9
latency of single task, it
Time helps throughput of entire
T workload
a 30 40 40 40 40 20 • Pipeline rate limited by
s slowest pipeline stage
A
k • Multiple tasks operating
simultaneously using
O B different resources
r • Potential speedup =
d
Number pipe stages
C
e • Unbalanced lengths of pipe
r
stages reduces speedup
D
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
• Stall for Dependences
Use the Idea of Pipelining in a
Computer
Fetch + Execution
Time
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction
I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3
Instruction Execution
fetch unit (c) Pipelined execution
unit
Figure 8.1. Basic idea of instruction pipelining.

(b) Hardware organization
Use the Idea of Pipelining in a
Computer Clock cycle
Instruction
1 2 3 4 5 6 7
Time
I1 F1 D1 E1 W1
Fetch + Decode
+ Execution + Write I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
(a) Instruction execution divided into four steps
Interstage u
bffers
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3
(b) Hardware organization
Textbook page: 457
Figure 8.2. A 4-stage pipeline.

Role of Cache Memory
• Each pipeline stage is expected to complete in
one clock cycle.
• The clock period should be long enough to let
the slowest pipeline stage to complete.
• Faster stages can only wait for the slowest one
to complete.
• Since main memory is very slow compared to
the execution, if each instruction needs to be
fetched from main memory, pipeline is almost
useless.
• Fortunately, we have cache.
Pipeline Performance
• The potential increase in performance
resulting from pipelining is proportional to
the number of pipeline stages.
• However, this increase would be achieved
only if all pipeline stages require the same
time to complete, and there is no
interruption throughout program execution.
• Unfortunately, this is not true.
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 E5
Figure 8.3. Effect of an e

xecution operation taking more than one clock
ycle.
c
• The previous pipeline is said to have been stalled for two
clock cycles.
• Any condition that causes a pipeline to stall is called a
hazard.
• Data hazard – any condition in which either the source or
the destination operands of an instruction are not
available at the time expected in the pipeline. So some
operation has to be delayed, and the pipeline stalls.
• Instruction (control) hazard – a delay in the availability of
an instruction causes the pipeline to stall.
• Structural hazard – the situation when two instructions
require the use of a given hardware resource at the
same time.
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction Instruction
hazard I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
(a) Instruction execution steps in successive clock cycles
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Stage
F: Fetch F1 F2 F2 F2 F2 F3
Idle periods –
D: Decode D1 idle idle idle D2 D3
stalls (bubbles)
E: Execute E1 idle idle idle E2 E3
W: Write W1 idle idle idle W2 W3
(b) Function performed by each processor stage in successive clock cycles
Figure 8.4. Pipeline stall caused by a cache miss in F2.

Load X(R1), R2
Structural
Time
hazard Clock cycle 1 2 3 4 5 6 7
Instruction
I1 F1 D1 E1 W1
I 2 (Load) F2 D2 E2 M2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4
I5 F5 D5
Figure 8.5. Effect of a Load instruction on pipeline timing.

• Again, pipelining does not result in individual
instructions being executed faster; rather, it is
the throughput that increases.
• Throughput is measured by the rate at which
instruction execution is completed.
• Pipeline stall causes degradation in pipeline
performance.
• We need to identify all hazards that may cause
the pipeline to stall and to find ways to minimize
their impact.
Quiz
• Four instructions, the I2 takes two clock
cycles for execution. Pls draw the figure
for 4-stage pipeline, and figure out the
total cycles needed for the four
instructions to complete.
Data Hazards
Data Hazards
• We must ensure that the results obtained when
instructions are executed in a pipelined processor are
identical to those obtained when the same instructions
are executed sequentially.
• Hazard occurs
A←3+A
B←4×A
• No hazard
A←5×C
B ← 20 + C
• When two operations depend on each other, they must
be executed sequentially in the correct order.
• Another example:
Mul R2, R3, R4
Add R5, R4, R6
Data Hazards
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Instruction
I 1 (Mul) F1 D1 E1 W1
I 2 (Add) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Figure 8.6. Pipeline stalled by data dependenc

y between D
2 and W1.
Figure 8.6. Pipeline stalled by data dependency between D2 and W1.
Operand Forwarding
• Instead of from the register file, the second
instruction can get data directly from the
output of ALU after the previous instruction
is completed.
• A special arrangement needs to be made
to “forward” the output of ALU to the input
of ALU.
Source 1
Source 2
SRC1 SRC2
Register
file
ALU
RSLT
Destination
(a) Datapath
SRC1,SRC2 RSLT
E: Execute W: Write
(ALU) (Register file)
Forwarding path
(b) Position of the source and result registers in the processor pipeline
Figure 8.7. Operand forwarding in a pipelined processor

.
Handling Data Hazards in
Software
• Let the compiler detect and handle the
hazard:
I1: Mul R2, R3, R4
NOP
NOP
I2: Add R5, R4, R6
• The compiler can reorder the instructions
to perform some useful work during the
NOP slots.
Side Effects
• The previous example is explicit and easily detected.
• Sometimes an instruction changes the contents of a
register other than the one named as the destination.
• When a location other than one explicitly named in an
instruction as a destination operand is affected, the
instruction is said to have a side effect. (Example?)
• Example: conditional code flags:
Add R1, R3
AddWithCarry R2, R4
• Instructions designed for execution on pipelined
hardware should have few side effects.
Instruction Hazards
Overview
• Whenever the stream of instructions
supplied by the instruction fetch unit is
interrupted, the pipeline stalls.
• Cache miss
• Branch
Unconditional Branches
Time
Clock cycle 1 2 3 4 5 6
Instruction
I1 F1 E1
I 2 (Branch) F2 E2 Execution unit idle
I3 F3 X
Ik Fk Ek
I k+1 Fk+1 Ek+1
Figure 8.8. An idle cycle caused by a branch instruction.

Time
Branch Timing
Clock cycle 1 2 3 4 5 6 7 8
I1 F1 D1 E1 W1
I 2 (Branch) F2 D2 E2
I3 F3 D3 X
- Branch penalty I4 F4 X
Ik Fk Dk Ek Wk
- Reducing the penalty
I k+1 Fk+1 Dk+1 Ek+1
(a) Branch address computed inecute

Ex stage
Time
Clock cycle 1 2 3 4 5 6 7
I1 F1 D1 E1 W1
I 2 (Branch) F2 D2
I3 F3 X
Ik Fk Dk Ek Wk
I k+1 Fk+1 D k+1 Ek+1
(b) Branch address computed in Decode stage
Figure 8.9. Branch timing.

Instruction Queue and
Prefetching
Instruction fetch unit
Instruction queue
F : Fetch
instruction
D : Dispatch/
Decode E : Execute W : Write
instruction results
unit
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
• A conditional branch instruction introduces
the added hazard caused by the
dependency of the branch condition on the
result of a preceding instruction.
• The decision to branch cannot be made
until the execution of that instruction has
been completed.
• Branch instructions represent about 20%
of the dynamic instruction count of most
programs.
Super Scalar Architecture
F: Instruction
Fetch Unit Instruction Queue
Floating
Point Unit
Dispatch Unit W: Write

Results
Integer
Unit
Super Scalar Operation
• Equip processor with multiple processing
units – several instructions can be
executed in the same clock cycle –
multiple issue processor
• Throughput can be > 1 instruction / cycle
• Compiler should interleave floating point
and integer instructions
• Out-Of-Order Execution should be taken
care

Mc9211co U4l1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mc9211co U4l1

Uploaded by

Copyright:

Available Formats

MC9211Computer

Figure 1.2. Connections between the processor and the memory.

• Transfer the contents of register PC to register MAR

Figure 7.1. Single-bus organization of the datapath inside a proc

Textbook Page 416 Zout

MDR inE MDRin

Assume MAR Address

Figure 7.5. Timing of a memory Read operation.

Add (R3), R1 Control signals

3 MDR out , IR in Data

Figure 7.1. Single-bus organization of the datapath inside a proc

Add R2, R1 Control signals

3 MDR out , IR in Data

Figure 7.1. Single-bus organization of the datapath inside a proc

1 PCout , MAR in , Read, Select4,Add, Z in

Figure 7.7. Control sequence for an unconditional branch instruction.

• Allow the contents of two

B • Allow the data on bus C to

1 PCout, R=B, MAR in , Read, IncPC

Figure 7.9. Control sequence for the instruction. Add R4,R5,R6,

• What is the control PC

sequence for Address

Add R1, R2 Constant 4

including the Select MUX

instruction fetch ALU

phase? (Assume control

Figure 7.10. Control unit organization.

Figure 7.11. Separation of the decoding and encoding function

Figure 7.13. Generation of the End control signal.

Figure 7.14. Block diagram of a complete processor

Figure 7.15 An example of microinstructions for Figure 7.6.

Textbook page 421

1 PCout , MAR in , Read, Select4,Add, Zin

Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.

Figure 7.16. Basic organization of a microprogrammed control

0 PCout , MAR in , Read, Select4,Add, Z in

Figure 7.17. Microroutine for the instruction Branch<0.

Figure 7.18. Organization of the control unit to allow

F1 (4 bits) F2 (3 bits) F3 (3 bits) F4 (4 bits) F5 (2 bits)

Figure 7.19. An example of a partial format for field-encoded microinstructions

Contents of IR OP code 0 1 0 Rsrc Rdst

Address Microinstruction Textbook page 439

000 PCout, MARin, Read, Select

Figure 7.21. Microinstruction for Add (Rsrc)+,Rdst.

Figure 7.22. Microinstruction-sequencing organization.

F0 (8 bits) F1 (3 bits) F2 (3 bits) F3 (3 bits)

Address of next 000: No transfer 000: No transfer 000: No transfer

F4 (4 bits) F5 (2 bits) F6 (1 bit) F7 (1 bit)

0000: Add 00: No action 0: SelectY 0: No action

F8 (1 bit) F9 (1 bit) F10 (1 bit)

0: NextAdrs 0: No action 0: No action

Figure 7.23. Format for microinstructions in the example of Section 7

Figure 7.24. Implementation of the microroutine of Figure 7.21 usin

Next address F1 F2 F8 F9 F10

Other control signals

Figure 7.25. Some details of the control-signal-generating circuitry.

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

Figure 8.1. Basic idea of instruction pipelining.

(a) Instruction execution divided into four steps

(b) Hardware organization

Textbook page: 457