You are on page 1of 55

EE108b Lecture 7

Processor Design
Christos Kozyrakis

http://ee180.stanford.edu

EE180– Winter 2018 – Lecture 07


Announcements
n HW1 due today by 5pm

n Midterm on Tue Feb 13th, 3.30-6.30pm, STLC111


n All lectures included up to Lecture 10 , closed books
n One page of notes, notes, calculator

n Review session on Friday


Fridays 3.30pm – 4.30pm, room 200-030

2
Review: Digital Logic Design
n Key elements
n Basic gates: AND, OR, NOT
n Complex gates: , multiplexors, NAND/NOR/XOR/…, n-input
AND/OR/…, address, ALUs, multi-bit ALUs/multiplexors, …
n Simple state: flip-flops, multi-bit registers
n Memories: synchronous & asynchronous
n Clocking methodology

3
Building a Processor

n Simple or complex

4
Building a Processor
n Generally hardware consists of two parts
n Datapath: the hardware that processes and stores data
n Combinational circuits and state elements
n Control: the hardware that manages the datapath

n Break instruction execution into steps


n Find simple + efficient datapath & control for each step
n Some are obvious, others can be more subtle

n Start simple, optimize performance/energy later


5
Subset of Instructions
n We will focus on a subset of the MIPS instructions
n Memory: lw and sw
n Arithmetic: addu, subu, and, ori, and slt
n Branch: beq and j

n Similar implementations for remaining instructions

6
Simple Processor
n Follow the ISA to the letter
n Execute one instruction to completion before moving
the next one

n Execute each instruction within 1 clock cycle


n CPI = 1

n Needed hardware components


n State: PC, 32-entry register file, memory
n ALU, multiplexors
7
Instruction Steps
1. Fetch instruction from memory
n Address is specified by PC
2. Read one or two registers
3. Do add/sub/.... using an ALU block
4. Fetch a value from memory
5. Store results to register-file/memory
6. Update the PC as needed

8
Initial Processor Datapath

n Major functional units & major connections


n Can you spot the major inconsistency on this diagram?
9
Initial Processor Datapath

n Cannot just join wires together


n These connections will actually require multiplexors
10
Fetching the Instruction
n Not that complex

n Instruction = Memory[PC]
n Fetch the instruction from memory
n Always 32-bits

n Update program counter for next cycle


n What is the address of the next instruction?

11
Fetching the Instruction

Increment by 4 for
32b
32-bit
next instruction
register

12
What Did We Fetch?
6 5 5 5 5 6

R-format OP=0 rs rt rd sa funct


First Second Result Shift Function
Source Source Register Amount Code
Register Register

6 5 5 16
OP rs rt imm
I-format
First Second Immediate
Source Source
Register Register

6 26
OP target
J-format
Jump Target Address

13
Nice Characteristics of
MIPS ISA
n Instructions are fixed length
n Don’t need to decode first instruction to find next one
n Always add 4 bytes to instruction pointer (PC)

n Register specifiers are always in the same place


n Destination moves around some, but
n Source registers are always in the same place
n Or you don’t need that register
n Can read the registers IN PARALLEL with decoding instruction
n Feed bits directly from the instruction memory
n Fixed field decoding

14
Register-Register Instructions
n In our subset this is only addu and subu
n We do not want to worry about overflow yet…

n Operation
addu rd, rs, rt # R[rd]<-R[rs]+ R[rt]
subu rd, rs, rt # R[rd]<-R[rs]- R[rt];

Bits 6 5 5 5 5 6
OP=0 rs rt rd sa funct

First Second Result Shift Function


Source Source Register Amount Code
Register Register 15
Arithmetic Instructions
n Read two register operands
n Perform arithmetic/logical operation
n Write register result

16
ORI Instruction
n OR immediate instruction

ori rt, rs, imm #R[rt]<-R[rs] OR ZeroExt(imm)

n Need to get instr[15:0] into the datapath

Bits 6 5 5 16
OP rs rt imm

First Second Immediate


Source Source
Register Register
17
Datapath: ORI Instruction
n Write register is rt or rd based on instruction
n Read data 2 is ignored for immediates
n Immediates can be sign or zero extended
n ALUsrc and ALU operation set based on instruction
RegWrite

ALUOp
Instruction [25– 21] Read
register 1
Read
Instruction [20– 16] data 1 ALUSrc
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0
M Write data 2 result
u register M
x u
Instruction [15– 11] Write x
1
data 1

RegDst
16 Sign 32
Instruction [15– 0]
or Zero
extend

18
Load Instruction
n Load instruction

lw rt, rs, imm # Addr <- R[rs]+SignExt(imm)


# R[rt] <- Mem[Addr];

n Notice this will use the immediate path as well


6 5 5 16
Bits OP rs rt imm

First Second Immediate


Source Source
Register Register

19
Datapath: Load Instruction
n Immediate is sign extended
n Extender handles either sign or zero extension
n ALU output fed to memory as address
n MUX selects between ALU result and Memory output
RegWrite

ALUOp
Instruction [25– 21] Read MemWrite
register 1 Read
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
u register M data
u M
Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead

20
Store Instruction
n Store instruction

sw rt, rs, imm # Addr <- R[rs]+SignExt(imm)


# Mem[Addr] <- R[rt]

Bits 6 5 5 16
OP rs rt imm

First Second Immediate


Source Source
Register Register

21
Datapath: Store Instruction
n Memory address calculated just as in lw case
n Read Register 2 is passed to Memory as data

RegWrite

ALUOp
Instruction [25– 21] Read MemWrite
register 1 Read
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
u register M data
u M
Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead

22
Branch Instruction
n Branch instruction: beq rs, rt, immediate

Cond <- R[rs] – R[rt]


if (cond eq 0)
PC <- PC + 4 + SignExt(imm)*4
else
PC <- PC + 4;

Bits 6 5 5 16
OP rs rt imm
First Second Immediate
Source Source

23
The Next PC
n PC is byte-addressed into instruction memory
n Sequential
PC[31:0] = PC[31:0] + 4
n Branch operation
PC[31:0] = PC[31:0] + 4 + SignExt(imm) × 4

n Simplification
n PC is byte addressed, but instructions are 4 bytes long
n Simplify hardware by using 30 bit PC
n Sequential
PC[31:2] = PC[31:2] + 1
n Branch operation
PC[31:2] = PC[31:2] + 1 + SignExt(imm)

24
Datapath for the PC
30

0
M
u
x
30 ALU
Add 1
result
Add

1
Branch
Zero

Read
PC
address
00

Instruction
[31– 0]
Instruction
memory

16 30
Instruction [15– 0] Sign
extend

25
Jump Instruction
n Jump instruction

j target # PC[31:2]<-PC[31:29] || target[25:0]

Bits 6 26
OP target

Jump Target Address

26
Datapath for the PC
n MUX selects pseudodirect jump target
32

P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1

M M
u u
x x
32 ALU
Add 1 0
result
Add Shift
left 2 Jump
4
Branch
Zero

Read
PC
address

Instruction
[31– 0]
Instruction
memory

27
Putting it All Together:
Our First Processor (Datapath)
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1

M M
u u
x x
ALU
Add 1 0
result
Add Shift
left 2 Jump
4
Branch

RegWrite
ALUOp
Instruction [25– 21] Read MemWrite
Read register 1
PC Read
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Eq
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead

28
Control
n State free
n Every instruction takes a single cycle
n Just decode instruction bits

n First part of cycle does not have any control


n Which is good, since we don’t have instruction yet

n There are also few control points


n Control on the multiplexers
n Operation type for the ALU
n Write control on the register file & data memory
29
Control: Instruction Fetch
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
ALU
Add 1 0
result
Add
Shift
left 2 Jump
4 <prev>
Branch
<prev>

<prev>
RegWrite
<prev>
ALUOp <prev>
Instruction [25– 21] Read MemWrite
PC
Read register 1 <prev>
address Read <prev>
Instruction [20– 16] Read data 1 ALUSrc MemtoReg
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] Write 0 Read
M data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
<prev> 16 32
Sign
extend MemRead
<prev>

<prev>

30
Control: addu
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0

1
RegWrite
<op>
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
0 0
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
1 16 32
Sign
extend MemRead
X 0

31
Control: Next PC
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0 X

1
RegWrite
<op>
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
0 0
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
1 16 32
Sign
extend MemRead
X 0

32
Control: ori
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0

1
RegWrite
Or
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
1 0
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
0 16 32
Sign
extend MemRead
0 0

33
Control: Load
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0

1
RegWrite
Add
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
1 1
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
0 16 32
Sign
extend MemRead
1 1

34
Control: Store
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0

0
RegWrite
Add
ALUOp 1
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
1 X
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
X 16 32
Sign
extend MemRead
1 0

35
Control: Branch
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
1 1

0
RegWrite
Sub
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
0 X
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
X 16 32
Sign
extend MemRead
1 0

36
Control: Jump
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 1
Branch
0

0
RegWrite
X
ALUOp 0
Instruction [25– 21] Read MemWrite
PC
Read register 1
Read
X X
address
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
RegDst data

Instruction [15– 0]
X 16 32
Sign
extend MemRead
X 0

37
Control Signals

func 10 0000 10 0010 Not Important


op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
add sub ori lw sw beq jump
RegDst 1 1 0 0 x x x
ALUSrc 0 0 1 1 1 0 x
MemtoReg 0 0 0 1 x x x
RegWrite 1 1 1 1 0 0 0
MemWrite 0 0 0 0 1 0 0
Branch 0 0 0 0 0 1 0
Jump 0 0 0 0 0 0 1
ExtOp x x 0 1 1 x x
ALUctr<2:0> Add Sub Or Add Add Sub xxx

38
Turning Control
Tables to Gates
n What is logical equation for

n The RegDst signal?

n the ALUSrc signal?

39
Timing for MemWrite & RegWrite
n How quickly should the MemWrite signal go to 1?
n How would you implement this?

40
Multilevel Decoding
n You can have a single decoder block OR

n Since only the ALU needs the func field


n Pass it to the ALU unit, and have a local decoder there

func
ALU ALUctr
op Main 6
ALUop Control 3
6 Control
N (Local)

ALU
41
Multilevel Decoding
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
R-type ori lw sw beq jump
RegDst 1 0 0 x x x
ALUSrc 0 1 1 1 0 x
MemtoReg 0 0 1 x x x
RegWrite 1 1 1 0 0 0
MemWrite 0 0 0 1 0 0
Branch 0 0 0 0 1 0
Jump 0 0 0 0 0 1
ExtOp x 0 1 1 x x
ALUop<N:0> “R-type” Or Add Add Subtract xxx

n Control signals for the main control block


42
Putting It All Together:
Our First Processor
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 0
M M
u u
x x
Add ALU 1 1
result
Add
Shift
left 2 Jump
RegDst
4 Branch
MemRead
Instruction [31– 26] MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC Read
address
Instruction [20– 16] data 1
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory x u
Instruction [15– 11] Write x
1 Data x
data 1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU
control

Instruction [5– 0]

43
Single Cycle Processor
Performance
n Functional unit delay
n Memory: 200ps
n ALU and adders: 200ps
n Register file: 100 ps

Instruction Instruction Register ALU Data Register Total


Class memory read operation memory write

R-type 200 100 200 100 600


load 200 100 200 200 100 800
store 200 100 200 200 700
branch 200 100 200 500
jump 200 200

n CPU clock cycle = 800 ps = 0.8ns (1.25GHz)


45
Single Cycle MIPS Processor
n Pros
n Single cycle per instruction makes logic simple

n Cons
n Cycle time is the worst case path ® long cycle times
n Worst case = load
n Hardware is underutilized
n ALU and memory used only for a fraction of clock cycle
n Not well amortized!
n Best possible CPI is 1
46
Variable Clock Single Cycle
Processor Performance
n Instruction Mix Instructio Instructio Register ALU Data Register Total
n n
n 45% ALU Class memory
read operation memory write

n 25% loads
n 10% stores R-type 200 100 200 100 600
n 15% branches load 200 100 200 200 100 800
n 5% jumps store 200 100 200 200 700
branch 200 100 200 500
jump 200 200

n CPU clock cycle = 0.6x45% + 0.8x25% + 0.7x10% + 0.5x15% + 0.2x5%


= 0.625 ns (1.6GHz)
n Difficult to implement

47
Key Tools for System Architects
1. Pipelining
2. Parallelism
3. Out-of-order execution
4. Prediction
5. Caching
6. Indirection
7. Amortization
8. Redundancy
9. Specialization
10. Focus on the common case

48
Pipelining: The Laundry Analogy
n Ann, Brian, Cathy, Dave doing laundry

n Washer takes 30 minutes A B C D

n Dryer takes 40 minutes

n “Folding bench” takes 20 minutes

49
Single-cycle Laundry
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 20 30 40 20 30 40 20 30 40 20
s
k A

O
r B
d
e C
r

Single-cycle laundry takes 6 hours for 4 loads


50
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
Pipelined laundry takes 3.5 hours for 4 loads
51
Lessons from Laundry Analogy
6 PM 7 8 9 n Pipelining doesn’t help latency of
Time single task, it helps throughput of
entire workload
T 30 40 40 40 40 20 n Multiple tasks operating
a
simultaneously
s
k
A n Potential speedup = Number pipe
stages
O Pipeline rate limited by slowest
r B n
pipeline stage
d
e n Unbalanced lengths of pipe stages
C reduces speedup
r
n Time to “fill” pipeline and time to
D “drain” it reduces speedup

52
Another Analogy:
Model T Assembly Line

53
Pipelining the Processor
n 5 stages, one clock cycle per stage
n IF: instruction fetch from memory
n ID: instruction decode & register read
n EX: execute operation or calculate address
n MEM: access memory operand
n WB: write result back to register

Cycle 1 Cycle 2 Cycle Cycle 4 Cycle 5


3

lw IF RF/ID EX MEM WB

54
Pipelining the Processor
n Overlap instructions in different stages
n All hardware used all the time
n Clock cycle is fast
n CPI is still 1
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock

1st lw IF RF/ID EX MEM WB

2nd lw IF RF/ID EX MEM WB

3rd lw IF RF/ID EX MEM WB

55
To Be Continued
n Pipelined datapath and control

n Pipeline dependencies, hazards, and stalls

n The limits of pipelining

56

You might also like