Professional Documents
Culture Documents
Organization
Unit 4 Lesson 1
Processor Design
Basic Processing Unit
Connection Between the
Processor and the Memory
Memory
MAR MDR
Control
PC R0
R1
Processo
IR
ALU
Rn - 1
n general purpose
registers
Control signals
PC
Instruction
Address
decoder and
lines
MAR control logic
Memory
bus
MDR
Data
lines IR
Datapath
Y
Constant 4 R0
Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
XOR TEMP
Z
Textbook Page 413
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Register Transfers
• All operations and data transfers are controlled by the processor
clock.
Bus
D Q
1
Q
Riout
Ri in
Clock
Figure
Figure7.3.
7.3.Input
Inputand outputgating
and output ating
g for one
for oneregister
re
gisterbit.
bit.
Performing an Arithmetic or
Logic Operation
• The ALU is a combinational circuit that has no
internal storage.
• ALU gets the two operands from MUX and bus.
The result is temporarily stored in register Z.
• What is the sequence of operations to add the
contents of register R1 to those of R2 and store
the result in R3?
1. R1out, Yin
2. R2out, SelectY, Add, Zin
3. Zout, R3in
Fetching a Word from Memory
• Address into MAR; issue Read operation; data into
MDR.
Memory-bus Internal processo
data lines MDRoutE MDRout bus
MDR
Figure
Figure7.4.
7.4. Connection and
Connection and control
control signals
signals forgister
re MDR.
for register MDR.
Fetching a Word from Memory
• The response time of each memory access
varies (cache miss, memory-mapped I/O,…).
• To accommodate this, the processor waits until it
receives an indication that the requested
operation has been completed (Memory-
Function-Completed, MFC).
• Move (R1), R2
¾ MAR ← [R1]
¾ Start a Read operation on the memory bus
¾ Wait for the MFC response from the memory
¾ Load MDR from the memory bus
¾ R2 ← [MDR]
Step 1 2 3
Clock
Timing
MARin
MDRinE
z Move (R1), R2
1. R1out, MARin, Read Data
2. MDRinE, WMFC
MFC
3. MDRout, R2in
MDR out
Ri
Riout
Yin
Constant 4
Select MUX
A B
ALU
Zin
Zout
Figure 7.2. Input and output gating for the registers in Figure 7.1.
Execution of a Complete
Instruction Internal processor
bus
PC
Step Action
Instruction
Address
decoder and
lines
1 PCout , MAR in , Read, Select4,Add, Zin MAR control logic
Memory
2 Zout , PCin , Y in , WMF C bus
lines IR
4 R3out , MAR in , Read
5 R1out , Y in , WMF C Y
Constant 4 R0
6 MDR out , SelectY, Add, Zin
7 Zout , R1in , End Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR TEMP
Z
Add R2, R1 ?
PC
Step Action
Instruction
Address
decoder and
lines
1 PCout , MAR in , Read, Select4,Add, Zin MAR control logic
Memory
2 Zout , PCin , Y in , WMF C bus
lines IR
4 R3out , MAR in , Read
5 R1out , Y in , WMF C Y
Constant 4 R0
6 MDR out , SelectY, Add, Zin
R2out
7 Zout , R1in , End Select MUX
Add
A B
ALU Sub R( n - 1)
control ALU
lines
Carry-in
Figure 7.6. Control sequencefor execution of the instruction Add (R3),R1.
XOR TEMP
Step Action
Incrementer
Textbook Page 424
PC
Constant 4
accessed simultaneously and
have their contents placed on
buses A and B.
MUX
ALU R
Step Action
Control signals
execution of the
MAR control logic
Memory
bus
instruction Data
lines
MDR
IR
Carry-in
single bus
XOR TEMP
architecture)
Figure 7.1. Single-bus organization of the datapath inside a proc
Hardwired Control
Overview
• To execute instructions, the processor
must have some means of generating the
control signals needed in the proper
sequence.
• Two categories: hardwired control and
microprogrammed control
• Hardwired system can operate at high
speed; but with little flexibility.
Control Unit Organization
CLK Control step
Clock counter
External
inputs
Decoder/
IR
encoder
Condition
codes
Control signals
Step decoder
T 1 T2 Tn
INS1
External
INS2 inputs
Instruction
IR Encoder
decoder
Condition
codes
INSm
Run End
Control signals
T4 T6
T1
Figure 7.12. Generation of the Zin control signal for the processor in Figure 7.1.
Generating End
• End = T7 • ADD + T5 • BR + (T5 • N + T4 • N) • BRN +…
Branch<0
Add Branch
N N
T7 T5 T4 T5
End
Instruction Data
cache cache
Bus interface
Processor
System b
us
Main Input/
memory Output
MDRout
WMFC
MAR in
Select
Read
PCout
R1out
R3out
Micro -
End
PCin
R1in
Add
Z out
IRin
Yin
instruction
Zin
1 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
4 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
5 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0
6 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
Clock µPC
Control
store CW
AddressMicroinstruction
Clock µPC
Control
store CW
F1 F2 F3 F4 F5
0000: No transfer 000: No transfer 000: No transfer 0000: Add 00: No action
0001: PCout 001: PCin 001: MARin 0001: Sub 01: Read
0010: MDRout 010: IRin 010: MDRin 10: Write
0011: Zout 011: Zin 011: TEMPin
0100: R0out 100: R0in 100: Yin 1111: XOR
0101: R1out 101: R1in
0110: R2out 110: R2in 16 ALU
functions
0111: R3out 111: R3in
1010: TEMPout
1011: Offsetout
F6 F7 F8
What is the price paid for
this scheme?
F6 (1 bit) F7 (1 bit) F8 (1 bit)
Require a little more hardware
0: SelectY 0: No action 0: Continue
1: Select4 1: WMFC 1: End
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode
11 10 87 4 3 0
External Condition
Inputs codes
Decoding circuits
µA R
Control store
Next address µI R
Microinstruction decoder
Control signals
F0 F1 F2 F3
F4 F5 F6 F7
F8 F9 F10
0 0 0 0 0 0 0 0 0 0 1 0 0 1 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 1 1 00 1 1 0 0 0 0 0 0 00 0 1 0 0 0
0 0 2 0 0 0 0 0 0 1 1 0 1 0 01 0 0 0 0 0 0 0 0 00 0 0 0 0 0
0 0 3 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 00 0 0 1 1 0
121 0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01 1 0 0 0 0
122 0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00 0 1 0 0 1
1 7 0 0 1 1 1 1 0 0 1 0 1 0 00 0 0 0 1 0 0 0 0 01 0 1 0 0 0
1 7 1 0 1 1 1 1 0 1 0 0 1 0 00 0 1 0 0 0 0 0 0 00 0 0 0 0 0
1 7 2 0 1 1 1 1 0 1 1 1 0 1 01 1 0 0 0 0 0 0 0 00 0 0 0 0 0
1 7 3 0 0 0 0 0 0 0 0 0 1 1 10 1 0 0 0 0 0 0 0 00 0 0 0 0 0
Decoder
Decoder
IR Rsrc Rdst
InstDecout
External
inputs ORmode
Decoding
circuits
Condition ORindsrc
codes
µA R
Control store
Rdstout
Rdstin
Microinstruction
decoder
Rsrcout
Rsrcin
• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold A B C D
• Washer takes 30 minutes
Time
30 40 20 30 40 20 30 40 20 30 40 20
• Sequential laundry takes 6
A hours for 4 loads
• If they learned pipelining,
B
how long would laundry
take?
D
Traditional Pipeline Concept
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 40 40 40 20
s
A
k • Pipelined laundry
takes 3.5 hours for 4
O B loads
r
d C
e
r D
Traditional Pipeline Concept
• Pipelining doesn’t help
6 PM 7 8 9
latency of single task, it
Time helps throughput of entire
T workload
a 30 40 40 40 40 20 • Pipeline rate limited by
s slowest pipeline stage
A
k • Multiple tasks operating
simultaneously using
O B different resources
r • Potential speedup =
d
Number pipe stages
C
e • Unbalanced lengths of pipe
r
stages reduces speedup
D
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
• Stall for Dependences
Use the Idea of Pipelining in a
Computer
Fetch + Execution
Time
I1 I2 I3
Time
Clock cycle 1 2 3 4
F E F E F E
1 1 2 2 3 3 Instruction
I1 F1 E1
(a) Sequential execution
I2 F2 E2
Interstage buffer
B1
I3 F3 E3
Instruction Execution
fetch unit (c) Pipelined execution
unit
Instruction
1 2 3 4 5 6 7
Time
I1 F1 D1 E1 W1
Fetch + Decode
+ Execution + Write I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
Interstage u
bffers
D : Decode
F : Fetch instruction E: Execute W : Write
instruction and fetch operation results
operands
B1 B2 B3
Instruction
I1 F1 D1 E1 W1
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
I5 F5 D5 E5
I2 F2 D2 E2 W2
I3 F3 D3 E3 W3
Time
Clock cycle 1 2 3 4 5 6 7 8 9
Stage
F: Fetch F1 F2 F2 F2 F2 F3
Idle periods –
D: Decode D1 idle idle idle D2 D3
stalls (bubbles)
E: Execute E1 idle idle idle E2 E3
Instruction
I1 F1 D1 E1 W1
I 2 (Load) F2 D2 E2 M2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4
I5 F5 D5
Instruction
I 1 (Mul) F1 D1 E1 W1
I 2 (Add) F2 D2 D2A E2 W2
I3 F3 D3 E3 W3
I4 F4 D4 E4 W4
SRC1 SRC2
Register
file
ALU
RSLT
Destination
(a) Datapath
SRC1,SRC2 RSLT
E: Execute W: Write
(ALU) (Register file)
Forwarding path
(b) Position of the source and result registers in the processor pipeline
Instruction
I1 F1 E1
I3 F3 X
Ik Fk Ek
Branch Timing
Clock cycle 1 2 3 4 5 6 7 8
I1 F1 D1 E1 W1
I 2 (Branch) F2 D2 E2
I3 F3 D3 X
- Branch penalty I4 F4 X
Ik Fk Dk Ek Wk
- Reducing the penalty
I k+1 Fk+1 Dk+1 Ek+1
Time
Clock cycle 1 2 3 4 5 6 7
I1 F1 D1 E1 W1
I 2 (Branch) F2 D2
I3 F3 X
Ik Fk Dk Ek Wk
D : Dispatch/
Decode E : Execute W : Write
instruction results
unit
Figure 8.10. Use of an instruction queue in the hardware organization of Figure 8.2b.
Conditional Braches
• A conditional branch instruction introduces
the added hazard caused by the
dependency of the branch condition on the
result of a preceding instruction.
• The decision to branch cannot be made
until the execution of that instruction has
been completed.
• Branch instructions represent about 20%
of the dynamic instruction count of most
programs.
Super Scalar Architecture
F: Instruction
Fetch Unit Instruction Queue
Floating
Point Unit
Integer
Unit
Super Scalar Operation
• Equip processor with multiple processing
units – several instructions can be
executed in the same clock cycle –
multiple issue processor
• Throughput can be > 1 instruction / cycle
• Compiler should interleave floating point
and integer instructions
• Out-Of-Order Execution should be taken
care