You are on page 1of 30

Microarchitecture (Continued)

Sam Amiri
Introduction
Application
programs
 Microarchitecture Software

 How to implement an architecture in hardware Operating


device drivers
Systems
 Processor instructions
Architecture
 Datapath: functional blocks registers

 Control: control signals Micro-


architecture
datapaths
controllers

adders
Logic
memories

Digital AND gates


Circuits NOT gates

Analog amplifiers
Circuits filters

transistors
Devices
diodes

Physics electrons
2
Microarchitecture

 Multiple implementations for a single architecture:


 Single-cycle: Each instruction executes in a single cycle
 Pipelined: Each instruction is broken up into series of steps & multiple
instructions execute at once

3
MIPS Processor

 Consider subset of MIPS instructions:


 R-type instructions: and, or, add, sub, slt
 Memory instructions: lw, sw
 Branch instructions: beq

4
Review: Single-Cycle Processor
Jump MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite

CLK CLK
CLK
0 25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0 Result
1 A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
PCJump 15:11
1
WriteReg4:0
PCPlus4
+

SignImm
4 15:0
<<2
Sign Extend PCBranch

+
27:0 31:28

25:0
<<2

5
Review: Processor Performance

 Program execution time


Execution Time = (#instructions)(cycles/instruction)(seconds/cycle)
= #instructions x CPI x Tc
 Definitions:
 CPI: Cycles/instruction
 Clock period: seconds/cycle
 IPC: instructions/cycle = IPC
 Challenge is to satisfy constraints of:
 Cost
 Power
 Performance

6
Single-Cycle Performance
MemtoReg
Control
MemWrite
Unit
Branch 0 0
ALUControl 2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite

CLK CLK
CLK 1 0
010 1
25:21
WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
0
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+

SignImm
4 15:0 <<2
Sign Extend PCBranch

+
Result

TC limited by critical path (lw) 7


Single-Cycle Performance

 Single-cycle critical path: MemtoReg


Control
MemWrite
Unit
Tc = tpcq_PC + tmem + 31:26
Branch
ALUControl 2:0
0 0
PCSrc
Op ALUSrc

max(tRFread, tsext + tmux) +


5:0
Funct RegDst
RegWrite

CLK CLK
1 0
tALU + tmem + tmux + tRFsetup 0
CLK

PC' PC
A RD
Instr
25:21
A1
WE3
RD1
SrcA
010
Zero WE
0
1

ALU
1 ALUResult ReadData
1 A RD 1

 Typically, limiting paths are


Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
0
memory, ALU, register file 20:16

15:11
0
1
WriteReg4:0
PCPlus4

Tc = tpcq_PC + 2tmem +

+
SignImm
4 15:0 <<2
Sign Extend PCBranch

+
tRFread + tmux + tALU + tRFsetup Result

 tpcq: clock-to-Q propagation delay


TC limited by critical path (lw)
8
Single-Cycle Performance Example

Element Parameter Delay (ps)


Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20

Tc = ?
9
Single-Cycle Performance Example

Element Parameter Delay (ps)


Register clock-to-Q tpcq_PC 30
Register setup tsetup 20
Multiplexer tmux 25
ALU tALU 200
Memory read tmem 250
Register file read tRFread 150
Register file setup tRFsetup 20

Tc = tpcq_PC + 2tmem + tRFread + tmux + tALU + tRFsetup


= [30 + 2(250) + 150 + 25 + 200 + 20] ps
= 925 ps 10
Single-Cycle Performance Example

 Program with 100 billion instructions:

Execution Time = # instructions x CPI x TC


= (100 × 109)(1)(925 × 10-12 s)
= 92.5 seconds

11
Pipelined Analogy

 Pipelined laundry: overlapping execution


 Parallelism improved performance
 Four loads:
 Speedup
= 8/3.5 = 2.3
 Non-stop:
 Speedup
= 2n/(0.5n+1.5) ≈ 4
= number of stages

12
Pipelined MIPS Processor

 Temporal parallelism
 Divide single-cycle processor into 5 stages:
 Fetch
 Decode
 Execute
 Memory
 Writeback
 Add pipeline registers between stages

13
Single-Cycle vs. Pipelined
Single-Cycle
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900
Instr
Time (ps)
Fetch Decode Execute Memory Write
1
Instruction Read Reg ALU Read / Write Reg
Fetch Decode Execute Memory Write
2
Instruction Read Reg ALU Read / Write Reg

Pipelined
Instr
Fetch Decode Execute Memory Write
1
Instruction Read Reg ALU Read/Write Reg
Fetch Decode Execute Memory Write
2
Instruction Read Reg ALU Read/Write Reg
Fetch Decode Execute Memory Write
3
Instruction Read Reg ALU Read/Write Reg

14
Pipelined Processor Abstraction

1 2 3 4 5 6 7 8 9 10

Time (cycles)
$0
lw DM $s2
lw $s2, 40($0) IM RF 40 + RF

$t1
add DM $s3
add $s3, $t1, $t2 IM RF $t2 + RF

$s1
sub DM $s4
sub $s4, $s1, $s5 IM RF $s5 - RF

$t5
and DM $s5
and $s5, $t5, $t6 IM RF $t6 & RF

$s1
sw DM $s6
sw $s6, 20($s1) IM RF 20 + RF

$t3
or DM $s7
or $s7, $t3, $t4 IM RF $t4 | RF

15
Single-Cycle & Pipelined Datapath
CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0 WriteReg4:0
15:11
1
PCPlus4

+
SignImm
4 15:0 <<2
Sign Extend
PCBranch

+
Result

CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
1 ALUOutM ReadDataW
A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0
15:11
RdE
1
+

SignImmE
4 15:0
<<2
Sign Extend PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback 16


Corrected Pipelined Datapath
CLK
CLK ALUOutW
CLK CLK CLK CLK
CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
SignImmE
+

15:0 <<2
Sign Extend
4 PCBranchM

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

Fetch Decode Execute Memory Writeback

WriteReg must arrive at same time as Result


17
Pipelined Processor Control
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
BranchD BranchE BranchM
31:26 PCSrcM
Op ALUControlD ALUControlE2:0
5:0
Funct ALUSrcD ALUSrcE
RegDstD RegDstE
ALUOutW
CLK CLK CLK
CLK
25:21 WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 0
A RD

ALU
ALUOutM ReadDataW
1 A RD 1
Instruction 20:16
A2 RD2 0 SrcBE Data
Memory
A3 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File
20:16
RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdE
1
+

15:0
<<2
Sign Extend SignImmE
PCBranchM
4

+
PCPlus4F PCPlus4D PCPlus4E

ResultW

 Same control unit as single-cycle processor


 Control delayed to proper pipeline stage 18
Pipeline Hazards

 When an instruction depends on result from instruction that hasn’t


completed
 Types:
 Data hazard: register value not yet written back to register file
 Control hazard: next instruction not decided yet (caused by branches)

19
Data Hazard

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

20
Handling Data Hazards

 Insert nops in code at compile time


 Rearrange code at compile time
 Forward data at run time
 Stall the processor at run time

21
Compile-Time Hazard Elimination

 Insert enough nops for result to be ready


 Or move independent useful instructions forward
1 2 3 4 5 6 7 8 9 10

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

nop DM
nop IM RF RF

nop DM
nop IM RF RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 RF
22
-
Data Forwarding

1 2 3 4 5 6 7 8

Time (cycles)
$s2
add DM $s0
add $s0, $s2, $s3 IM RF $s3 + RF

$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

23
Data Forwarding
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
1 10 ALUOutM ReadDataW
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F PCPlus4D PCPlus4E

PCBranchM

ResultW

RegWriteW
ForwardBE
ForwardAE

RegWriteM
24
Hazard Unit
Data Forwarding

 Forward to Execute stage from either:


 Memory stage or
 Writeback stage
 Forwarding logic for ForwardAE:
if((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM)
then ForwardAE = 10
else if((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW)
then ForwardAE = 01
else ForwardAE = 00
 Forwarding logic for ForwardBE same, but replace rsE with rtE

25
Stalling

1 2 3 4 5 6 7 8

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

Trouble!
$s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 & RF

$s4
or DM $t1
or $t1, $s4, $s0 IM RF $s0 | RF

$s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

26
Stalling

1 2 3 4 5 6 7 8 9

Time (cycles)
$0
lw DM $s0
lw $s0, 40($0) IM RF 40 + RF

$s0 $s0
and DM $t0
and $t0, $s0, $s1 IM RF $s1 RF $s1 & RF

$s4
or or DM $t1
or $t1, $s4, $s0 IM IM RF $s0 | RF

Stall $s0
sub DM $t2
sub $t2, $s0, $s5 IM RF $s5 - RF

27
Stalling Hardware
CLK CLK CLK

RegWriteD RegWriteE RegWriteM RegWriteW


Control MemtoRegE MemtoRegM MemtoRegW
MemtoRegD
Unit
MemWriteD MemWriteE MemWriteM
ALUControlD2:0 ALUControlE2:0
31:26
Op ALUSrcD ALUSrcE
5:0
Funct RegDstD RegDstE
PCSrcM
BranchD BranchE BranchM

CLK CLK CLK


CLK
25:21
WE3 SrcAE ZeroM WE
0 PC' PCF InstrD A1 RD1 00
A RD 01

ALU
ReadDataW
EN

1 10 ALUOutM
A RD
Instruction 20:16
A2 RD2 00 0 SrcBE Data
Memory 01
A3 10 1 Memory
Register WriteDataE WriteDataM
WD3 WD
File 1
25:21
RsD RsE ALUOutW
0
20:16
RtD RtE
0 WriteRegE4:0 WriteRegM4:0 WriteRegW 4:0
15:11
RdD RdE
1
SignImmD SignImmE
+

Sign
15:0
Extend
4
<<2

+
PCPlus4F

CLR
PCPlus4D PCPlus4E
EN

PCBranchM

ResultW

MemtoRegE

RegWriteW
ForwardBE
ForwardAE

RegWriteM
FlushE
StallD
StallF

Hazard Unit 28
Stalling Logic

lwstall = ((rsD==rtE) OR (rtD==rtE)) AND MemtoRegE

StallF = StallD = FlushE = lwstall

29
Thank You!

30

You might also like