Professional Documents
Culture Documents
Processor Design
Christos Kozyrakis
http://ee180.stanford.edu
2
Review: Digital Logic Design
n Key elements
n Basic gates: AND, OR, NOT
n Complex gates: , multiplexors, NAND/NOR/XOR/…, n-input
AND/OR/…, address, ALUs, multi-bit ALUs/multiplexors, …
n Simple state: flip-flops, multi-bit registers
n Memories: synchronous & asynchronous
n Clocking methodology
3
Building a Processor
n Simple or complex
4
Building a Processor
n Generally hardware consists of two parts
n Datapath: the hardware that processes and stores data
n Combinational circuits and state elements
n Control: the hardware that manages the datapath
6
Simple Processor
n Follow the ISA to the letter
n Execute one instruction to completion before moving
the next one
8
Initial Processor Datapath
n Instruction = Memory[PC]
n Fetch the instruction from memory
n Always 32-bits
11
Fetching the Instruction
Increment by 4 for
32b
32-bit
next instruction
register
12
What Did We Fetch?
6 5 5 5 5 6
6 5 5 16
OP rs rt imm
I-format
First Second Immediate
Source Source
Register Register
6 26
OP target
J-format
Jump Target Address
13
Nice Characteristics of
MIPS ISA
n Instructions are fixed length
n Don’t need to decode first instruction to find next one
n Always add 4 bytes to instruction pointer (PC)
14
Register-Register Instructions
n In our subset this is only addu and subu
n We do not want to worry about overflow yet…
n Operation
addu rd, rs, rt # R[rd]<-R[rs]+ R[rt]
subu rd, rs, rt # R[rd]<-R[rs]- R[rt];
Bits 6 5 5 5 5 6
OP=0 rs rt rd sa funct
16
ORI Instruction
n OR immediate instruction
Bits 6 5 5 16
OP rs rt imm
ALUOp
Instruction [25– 21] Read
register 1
Read
Instruction [20– 16] data 1 ALUSrc
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0
M Write data 2 result
u register M
x u
Instruction [15– 11] Write x
1
data 1
RegDst
16 Sign 32
Instruction [15– 0]
or Zero
extend
18
Load Instruction
n Load instruction
19
Datapath: Load Instruction
n Immediate is sign extended
n Extender handles either sign or zero extension
n ALU output fed to memory as address
n MUX selects between ALU result and Memory output
RegWrite
ALUOp
Instruction [25– 21] Read MemWrite
register 1 Read
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
u register M data
u M
Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead
20
Store Instruction
n Store instruction
Bits 6 5 5 16
OP rs rt imm
21
Datapath: Store Instruction
n Memory address calculated just as in lw case
n Read Register 2 is passed to Memory as data
RegWrite
ALUOp
Instruction [25– 21] Read MemWrite
register 1 Read
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
u register M data
u M
Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead
22
Branch Instruction
n Branch instruction: beq rs, rt, immediate
Bits 6 5 5 16
OP rs rt imm
First Second Immediate
Source Source
23
The Next PC
n PC is byte-addressed into instruction memory
n Sequential
PC[31:0] = PC[31:0] + 4
n Branch operation
PC[31:0] = PC[31:0] + 4 + SignExt(imm) × 4
n Simplification
n PC is byte addressed, but instructions are 4 bytes long
n Simplify hardware by using 30 bit PC
n Sequential
PC[31:2] = PC[31:2] + 1
n Branch operation
PC[31:2] = PC[31:2] + 1 + SignExt(imm)
24
Datapath for the PC
30
0
M
u
x
30 ALU
Add 1
result
Add
1
Branch
Zero
Read
PC
address
00
Instruction
[31– 0]
Instruction
memory
16 30
Instruction [15– 0] Sign
extend
25
Jump Instruction
n Jump instruction
Bits 6 26
OP target
26
Datapath for the PC
n MUX selects pseudodirect jump target
32
M M
u u
x x
32 ALU
Add 1 0
result
Add Shift
left 2 Jump
4
Branch
Zero
Read
PC
address
Instruction
[31– 0]
Instruction
memory
27
Putting it All Together:
Our First Processor (Datapath)
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
ALU
Add 1 0
result
Add Shift
left 2 Jump
4
Branch
RegWrite
ALUOp
Instruction [25– 21] Read MemWrite
Read register 1
PC Read
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Eq
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead
28
Control
n State free
n Every instruction takes a single cycle
n Just decode instruction bits
<prev>
RegWrite
<prev>
ALUOp <prev>
Instruction [25– 21] Read MemWrite
PC
Read register 1 <prev>
address Read <prev>
Instruction [20– 16] Read data 1 ALUSrc MemtoReg
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] Write 0 Read
M data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
<prev> 16 32
Sign
extend MemRead
<prev>
<prev>
30
Control: addu
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
<op>
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
0 0
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
1 16 32
Sign
extend MemRead
X 0
31
Control: Next PC
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0 X
1
RegWrite
<op>
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
0 0
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
1 16 32
Sign
extend MemRead
X 0
32
Control: ori
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
Or
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
1 0
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
0 16 32
Sign
extend MemRead
0 0
33
Control: Load
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
Add
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
1 1
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
0 16 32
Sign
extend MemRead
1 1
34
Control: Store
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
0
RegWrite
Add
ALUOp 1
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
1 X
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
X 16 32
Sign
extend MemRead
1 0
35
Control: Branch
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
1 1
0
RegWrite
Sub
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
0 X
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
X 16 32
Sign
extend MemRead
1 0
36
Control: Jump
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 1
Branch
0
0
RegWrite
X
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
X X
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
Instruction [15– 0]
X 16 32
Sign
extend MemRead
X 0
37
Control Signals
38
Turning Control
Tables to Gates
n What is logical equation for
39
Timing for MemWrite & RegWrite
n How quickly should the MemWrite signal go to 1?
n How would you implement this?
40
Multilevel Decoding
n You can have a single decoder block OR
func
ALU ALUctr
op Main 6
ALUop Control 3
6 Control
N (Local)
ALU
41
Multilevel Decoding
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
R-type ori lw sw beq jump
RegDst 1 0 0 x x x
ALUSrc 0 1 1 1 0 x
MemtoReg 0 0 1 x x x
RegWrite 1 1 1 0 0 0
MemWrite 0 0 0 1 0 0
Branch 0 0 0 0 1 0
Jump 0 0 0 0 0 1
ExtOp x 0 1 1 x x
ALUop<N:0> “R-type” Or Add Add Subtract xxx
Instruction [5– 0]
43
Single Cycle Processor
Performance
n Functional unit delay
n Memory: 200ps
n ALU and adders: 200ps
n Register file: 100 ps
n Cons
n Cycle time is the worst case path ® long cycle times
n Worst case = load
n Hardware is underutilized
n ALU and memory used only for a fraction of clock cycle
n Not well amortized!
n Best possible CPI is 1
46
Variable Clock Single Cycle
Processor Performance
n Instruction Mix Instructio Instructio Register ALU Data Register Total
n n
n 45% ALU Class memory
read operation memory write
n 25% loads
n 10% stores R-type 200 100 200 100 600
n 15% branches load 200 100 200 200 100 800
n 5% jumps store 200 100 200 200 700
branch 200 100 200 500
jump 200 200
47
Key Tools for System Architects
1. Pipelining
2. Parallelism
3. Out-of-order execution
4. Prediction
5. Caching
6. Indirection
7. Amortization
8. Redundancy
9. Specialization
10. Focus on the common case
48
Pipelining: The Laundry Analogy
n Ann, Brian, Cathy, Dave doing laundry
49
Single-cycle Laundry
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 20 30 40 20 30 40 20 30 40 20
s
k A
O
r B
d
e C
r
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
Pipelined laundry takes 3.5 hours for 4 loads
51
Lessons from Laundry Analogy
6 PM 7 8 9 n Pipelining doesn’t help latency of
Time single task, it helps throughput of
entire workload
T 30 40 40 40 40 20 n Multiple tasks operating
a
simultaneously
s
k
A n Potential speedup = Number pipe
stages
O Pipeline rate limited by slowest
r B n
pipeline stage
d
e n Unbalanced lengths of pipe stages
C reduces speedup
r
n Time to “fill” pipeline and time to
D “drain” it reduces speedup
52
Another Analogy:
Model T Assembly Line
53
Pipelining the Processor
n 5 stages, one clock cycle per stage
n IF: instruction fetch from memory
n ID: instruction decode & register read
n EX: execute operation or calculate address
n MEM: access memory operand
n WB: write result back to register
lw IF RF/ID EX MEM WB
54
Pipelining the Processor
n Overlap instructions in different stages
n All hardware used all the time
n Clock cycle is fast
n CPI is still 1
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
55
To Be Continued
n Pipelined datapath and control
56