Lect 07 Processordesign PDF

EE108b Lecture 7
Processor Design
Christos Kozyrakis
http://ee180.stanford.edu
EE180– Winter 2018 – Lecture 07

Announcements
n HW1 due today by 5pm
n Midterm on Tue Feb 13th, 3.30-6.30pm, STLC111

n All lectures included up to Lecture 10 , closed books
n One page of notes, notes, calculator
n Review session on Friday

Fridays 3.30pm – 4.30pm, room 200-030
2
Review: Digital Logic Design
n Key elements
n Basic gates: AND, OR, NOT
n Complex gates: , multiplexors, NAND/NOR/XOR/…, n-input
AND/OR/…, address, ALUs, multi-bit ALUs/multiplexors, …
n Simple state: flip-flops, multi-bit registers
n Memories: synchronous & asynchronous
n Clocking methodology
3
Building a Processor
n Simple or complex
4
Building a Processor
n Generally hardware consists of two parts
n Datapath: the hardware that processes and stores data
n Combinational circuits and state elements
n Control: the hardware that manages the datapath
n Break instruction execution into steps

n Find simple + efficient datapath & control for each step
n Some are obvious, others can be more subtle
n Start simple, optimize performance/energy later

5
Subset of Instructions
n We will focus on a subset of the MIPS instructions
n Memory: lw and sw
n Arithmetic: addu, subu, and, ori, and slt
n Branch: beq and j
n Similar implementations for remaining instructions
6
Simple Processor
n Follow the ISA to the letter
n Execute one instruction to completion before moving
the next one
n Execute each instruction within 1 clock cycle

n CPI = 1
n Needed hardware components

n State: PC, 32-entry register file, memory
n ALU, multiplexors
7
Instruction Steps
1. Fetch instruction from memory
n Address is specified by PC
2. Read one or two registers
3. Do add/sub/.... using an ALU block
4. Fetch a value from memory
5. Store results to register-file/memory
6. Update the PC as needed
8
Initial Processor Datapath
n Major functional units & major connections

n Can you spot the major inconsistency on this diagram?
9
Initial Processor Datapath
n Cannot just join wires together

n These connections will actually require multiplexors
10
Fetching the Instruction
n Not that complex
n Instruction = Memory[PC]
n Fetch the instruction from memory
n Always 32-bits
n Update program counter for next cycle

n What is the address of the next instruction?
11
Fetching the Instruction
Increment by 4 for
32b
32-bit
next instruction
register
12
What Did We Fetch?
6 5 5 5 5 6
R-format OP=0 rs rt rd sa funct

First Second Result Shift Function
Source Source Register Amount Code
Register Register
6 5 5 16
OP rs rt imm
I-format
First Second Immediate
Source Source
Register Register
6 26
OP target
J-format
Jump Target Address
13
Nice Characteristics of
MIPS ISA
n Instructions are fixed length
n Don’t need to decode first instruction to find next one
n Always add 4 bytes to instruction pointer (PC)
n Register specifiers are always in the same place

n Destination moves around some, but
n Source registers are always in the same place
n Or you don’t need that register
n Can read the registers IN PARALLEL with decoding instruction
n Feed bits directly from the instruction memory
n Fixed field decoding
14
Register-Register Instructions
n In our subset this is only addu and subu
n We do not want to worry about overflow yet…
n Operation
addu rd, rs, rt # R[rd]<-R[rs]+ R[rt]
subu rd, rs, rt # R[rd]<-R[rs]- R[rt];
Bits 6 5 5 5 5 6
OP=0 rs rt rd sa funct
First Second Result Shift Function

Source Source Register Amount Code
Register Register 15
Arithmetic Instructions
n Read two register operands
n Perform arithmetic/logical operation
n Write register result
16
ORI Instruction
n OR immediate instruction
ori rt, rs, imm #R[rt]<-R[rs] OR ZeroExt(imm)
n Need to get instr[15:0] into the datapath
Bits 6 5 5 16
OP rs rt imm

Source Source
Register Register
17
Datapath: ORI Instruction
n Write register is rt or rd based on instruction
n Read data 2 is ignored for immediates
n Immediates can be sign or zero extended
n ALUsrc and ALU operation set based on instruction
RegWrite
ALUOp
Instruction [25– 21] Read
register 1
Read
Instruction [20– 16] data 1 ALUSrc
Read
register 2 Zero
Instruction
0 Registers Read ALU ALU
[31– 0] 0
M Write data 2 result
u register M
x u
Instruction [15– 11] Write x
1
data 1
RegDst
16 Sign 32
Instruction [15– 0]
or Zero
extend
18
Load Instruction
n Load instruction
lw rt, rs, imm # Addr <- R[rs]+SignExt(imm)

# R[rt] <- Mem[Addr];
n Notice this will use the immediate path as well

6 5 5 16
Bits OP rs rt imm

Source Source
Register Register
19
Datapath: Load Instruction
n Immediate is sign extended
n Extender handles either sign or zero extension
n ALU output fed to memory as address
n MUX selects between ALU result and Memory output
RegWrite
ALUOp
Instruction [25– 21] Read MemWrite
register 1 Read
Instruction [20– 16] data 1 ALUSrc MemtoReg
Read
register 2 Zero
Instruction
[31– 0] 0 Read
M Write data 2 result Address 1
u register M data
u M
Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
RegDst data
16 32
Instruction [15– 0] Sign
extend MemRead
20
Store Instruction
n Store instruction
sw rt, rs, imm # Addr <- R[rs]+SignExt(imm)

# Mem[Addr] <- R[rt]
Bits 6 5 5 16
OP rs rt imm

Source Source
Register Register
21
Datapath: Store Instruction
n Memory address calculated just as in lw case
n Read Register 2 is passed to Memory as data
RegWrite
ALUOp
register 1 Read
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u register M data
u M
Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
RegDst data
16 32
extend MemRead
22
Branch Instruction
n Branch instruction: beq rs, rt, immediate
Cond <- R[rs] – R[rt]

if (cond eq 0)
PC <- PC + 4 + SignExt(imm)*4
else
PC <- PC + 4;
Bits 6 5 5 16
OP rs rt imm
Source Source
23
The Next PC
n PC is byte-addressed into instruction memory
n Sequential
PC[31:0] = PC[31:0] + 4
n Branch operation
PC[31:0] = PC[31:0] + 4 + SignExt(imm) × 4
n Simplification
n PC is byte addressed, but instructions are 4 bytes long
n Simplify hardware by using 30 bit PC
n Sequential
PC[31:2] = PC[31:2] + 1
n Branch operation
PC[31:2] = PC[31:2] + 1 + SignExt(imm)
24
Datapath for the PC
30
0
M
u
x
30 ALU
Add 1
result
Add
1
Branch
Zero
Read
PC
address
00
Instruction
[31– 0]
Instruction
memory
16 30
extend
25
Jump Instruction
n Jump instruction
j target # PC[31:2]<-PC[31:29] || target[25:0]
Bits 6 26
OP target
Jump Target Address
26
Datapath for the PC
n MUX selects pseudodirect jump target
32
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
32 ALU
Add 1 0
result
Add Shift
left 2 Jump
4
Branch
Zero
Read
PC
address
Instruction
[31– 0]
Instruction
memory
27
Putting it All Together:
Our First Processor (Datapath)
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
ALU
Add 1 0
result
Add Shift
left 2 Jump
4
Branch
RegWrite
ALUOp
Read register 1
PC Read
address
Read
register 2 Eq
Instruction
[31– 0] 0 Read
Instruction u register M data
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
16 32
extend MemRead
28
Control
n State free
n Every instruction takes a single cycle
n Just decode instruction bits
n First part of cycle does not have any control

n Which is good, since we don’t have instruction yet
n There are also few control points

n Control on the multiplexers
n Operation type for the ALU
n Write control on the register file & data memory
29
Control: Instruction Fetch
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
ALU
Add 1 0
result
Add
Shift
left 2 Jump
4 <prev>
Branch
<prev>
<prev>
RegWrite
<prev>
ALUOp <prev>
PC
Read register 1 <prev>
address Read <prev>
Instruction [20– 16] Read data 1 ALUSrc MemtoReg
register 2 Zero
Instruction 0 Registers Read ALU ALU
[31– 0] Write 0 Read
M data 2 result Address 1
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
<prev> 16 32
Sign
extend MemRead
<prev>
<prev>
30
Control: addu
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
<op>
ALUOp 0
PC
Read register 1
Read
0 0
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
1 16 32
Sign
extend MemRead
X 0
31
Control: Next PC
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0 X
1
RegWrite
<op>
ALUOp 0
PC
Read register 1
Read
0 0
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
1 16 32
Sign
extend MemRead
X 0
32
Control: ori
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
Or
ALUOp 0
PC
Read register 1
Read
1 0
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
0 16 32
Sign
extend MemRead
0 0
33
Control: Load
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
1
RegWrite
Add
ALUOp 0
PC
Read register 1
Read
1 1
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
0 16 32
Sign
extend MemRead
1 1
34
Control: Store
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
0
0
RegWrite
Add
ALUOp 1
PC
Read register 1
Read
1 X
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
X 16 32
Sign
extend MemRead
1 0
35
Control: Branch
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 0
Branch
1 1
0
RegWrite
Sub
ALUOp 0
PC
Read register 1
Read
0 X
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
X 16 32
Sign
extend MemRead
1 0
36
Control: Jump
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1
M M
u u
x x
Add ALU 1 0
result
Add
Shift
left 2 Jump
4 1
Branch
0
0
RegWrite
X
ALUOp 0
PC
Read register 1
Read
X X
address
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
RegDst data
X 16 32
Sign
extend MemRead
X 0
37
Control Signals
func 10 0000 10 0010 Not Important

op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
add sub ori lw sw beq jump
RegDst 1 1 0 0 x x x
ALUSrc 0 0 1 1 1 0 x
MemtoReg 0 0 0 1 x x x
RegWrite 1 1 1 1 0 0 0
MemWrite 0 0 0 0 1 0 0
Branch 0 0 0 0 0 1 0
Jump 0 0 0 0 0 0 1
ExtOp x x 0 1 1 x x
ALUctr<2:0> Add Sub Or Add Add Sub xxx
38
Turning Control
Tables to Gates
n What is logical equation for
n The RegDst signal?
n the ALUSrc signal?
39
Timing for MemWrite & RegWrite
n How quickly should the MemWrite signal go to 1?
n How would you implement this?
40
Multilevel Decoding
n You can have a single decoder block OR
n Since only the ALU needs the func field

n Pass it to the ALU unit, and have a local decoder there
func
ALU ALUctr
op Main 6
ALUop Control 3
6 Control
N (Local)
ALU
41
Multilevel Decoding
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
R-type ori lw sw beq jump
RegDst 1 0 0 x x x
ALUSrc 0 1 1 1 0 x
MemtoReg 0 0 1 x x x
RegWrite 1 1 1 0 0 0
MemWrite 0 0 0 1 0 0
Branch 0 0 0 0 1 0
Jump 0 0 0 0 0 1
ExtOp x 0 1 1 x x
ALUop<N:0> “R-type” Or Add Add Subtract xxx
n Control signals for the main control block

42
Putting It All Together:
Our First Processor
P C [31– 28 ] Instruction [25– 0 ] 0 0 0 0
M M
u u
x x
Add ALU 1 1
result
Add
Shift
left 2 Jump
RegDst
4 Branch
MemRead
Instruction [31– 26] MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Instruction [25– 21] Read

Read register 1
PC Read
address
Instruction [20– 16] data 1
Read
register 2 Zero
Instruction
[31– 0] 0 Read
u M
memory x u
1 Data x
data 1 memory 0
Write
data
16 32
extend ALU
control
43
Single Cycle Processor
Performance
n Functional unit delay
n Memory: 200ps
n ALU and adders: 200ps
n Register file: 100 ps
Instruction Instruction Register ALU Data Register Total

Class memory read operation memory write
R-type 200 100 200 100 600

load 200 100 200 200 100 800
store 200 100 200 200 700
branch 200 100 200 500
jump 200 200
n CPU clock cycle = 800 ps = 0.8ns (1.25GHz)

45
Single Cycle MIPS Processor
n Pros
n Single cycle per instruction makes logic simple
n Cons
n Cycle time is the worst case path ® long cycle times
n Worst case = load
n Hardware is underutilized
n ALU and memory used only for a fraction of clock cycle
n Not well amortized!
n Best possible CPI is 1
46
Variable Clock Single Cycle
Processor Performance
n Instruction Mix Instructio Instructio Register ALU Data Register Total
n n
n 45% ALU Class memory
read operation memory write
n 25% loads
n 10% stores R-type 200 100 200 100 600
n 15% branches load 200 100 200 200 100 800
n 5% jumps store 200 100 200 200 700
branch 200 100 200 500
jump 200 200
n CPU clock cycle = 0.6x45% + 0.8x25% + 0.7x10% + 0.5x15% + 0.2x5%

= 0.625 ns (1.6GHz)
n Difficult to implement
47
Key Tools for System Architects
1. Pipelining
2. Parallelism
3. Out-of-order execution
4. Prediction
5. Caching
6. Indirection
7. Amortization
8. Redundancy
9. Specialization
10. Focus on the common case
48
Pipelining: The Laundry Analogy
n Ann, Brian, Cathy, Dave doing laundry
n Washer takes 30 minutes A B C D
n Dryer takes 40 minutes
n “Folding bench” takes 20 minutes
49
Single-cycle Laundry
6 PM 7 8 9 10 11 Midnight
Time
T
a 30 40 20 30 40 20 30 40 20 30 40 20
s
k A
O
r B
d
e C
r
Single-cycle laundry takes 6 hours for 4 loads

50
Pipelined Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 40 40 40 20
T
a A
s
k
B
O
r
d C
e
r
D
Pipelined laundry takes 3.5 hours for 4 loads
51
Lessons from Laundry Analogy
6 PM 7 8 9 n Pipelining doesn’t help latency of
Time single task, it helps throughput of
entire workload
T 30 40 40 40 40 20 n Multiple tasks operating
a
simultaneously
s
k
A n Potential speedup = Number pipe
stages
O Pipeline rate limited by slowest
r B n
pipeline stage
d
e n Unbalanced lengths of pipe stages
C reduces speedup
r
n Time to “fill” pipeline and time to
D “drain” it reduces speedup
52
Another Analogy:
Model T Assembly Line
53
Pipelining the Processor
n 5 stages, one clock cycle per stage
n IF: instruction fetch from memory
n ID: instruction decode & register read
n EX: execute operation or calculate address
n MEM: access memory operand
n WB: write result back to register
Cycle 1 Cycle 2 Cycle Cycle 4 Cycle 5

3
lw IF RF/ID EX MEM WB
54
Pipelining the Processor
n Overlap instructions in different stages
n All hardware used all the time
n Clock cycle is fast
n CPI is still 1
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Clock
1st lw IF RF/ID EX MEM WB
2nd lw IF RF/ID EX MEM WB
3rd lw IF RF/ID EX MEM WB
55
To Be Continued
n Pipelined datapath and control
n Pipeline dependencies, hazards, and stalls
n The limits of pipelining
56

Lect 07 Processordesign PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 07 Processordesign PDF

Uploaded by

Copyright:

Available Formats

EE108b Lecture 7

EE180– Winter 2018 – Lecture 07

n Midterm on Tue Feb 13th, 3.30-6.30pm, STLC111

n Review session on Friday

n Break instruction execution into steps

n Start simple, optimize performance/energy later

n Similar implementations for remaining instructions

n Execute each instruction within 1 clock cycle

n Needed hardware components

n Major functional units & major connections

n Cannot just join wires together

n Update program counter for next cycle

R-format OP=0 rs rt rd sa funct

n Register specifiers are always in the same place

First Second Result Shift Function

ori rt, rs, imm #R[rt]<-R[rs] OR ZeroExt(imm)

n Need to get instr[15:0] into the datapath

First Second Immediate

lw rt, rs, imm # Addr <- R[rs]+SignExt(imm)

n Notice this will use the immediate path as well

First Second Immediate

sw rt, rs, imm # Addr <- R[rs]+SignExt(imm)

First Second Immediate

Cond <- R[rs] – R[rt]

j target # PC[31:2]<-PC[31:29] || target[25:0]

Jump Target Address

P C [31– 28 ] Instruction [25– 0 ] 0 0 0 1

n First part of cycle does not have any control

n There are also few control points

func 10 0000 10 0010 Not Important

n The RegDst signal?

n the ALUSrc signal?

n Since only the ALU needs the func field

n Control signals for the main control block

Instruction [25– 21] Read

Instruction Instruction Register ALU Data Register Total

R-type 200 100 200 100 600

n CPU clock cycle = 800 ps = 0.8ns (1.25GHz)

n CPU clock cycle = 0.6x45% + 0.8x25% + 0.7x10% + 0.5x15% + 0.2x5%

n Washer takes 30 minutes A B C D

n Dryer takes 40 minutes

n “Folding bench” takes 20 minutes

Single-cycle laundry takes 6 hours for 4 loads

Cycle 1 Cycle 2 Cycle Cycle 4 Cycle 5

1st lw IF RF/ID EX MEM WB

2nd lw IF RF/ID EX MEM WB

3rd lw IF RF/ID EX MEM WB

n Pipeline dependencies, hazards, and stalls

n The limits of pipelining

You might also like