You are on page 1of 60

DRAFT 7/13/2023

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:


Multi Cycle Processor

415 Lecture 6.1

Week 6

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

Multicycle ARM Processor


• Single-cycle:
+ simple
- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs
• Multicycle processor addresses these issues by breaking instruction into
shorter steps
o shorter instructions take fewer steps
o can re-use hardware
o cycle time is faster

Dr. Sajid Muhaimin Choudhury 4


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

1
DRAFT 7/13/2023

Multicycle ARM Processor


• Single-cycle:
+ simple
- cycle time limited by longest instruction (LDR)
- separate memories for instruction and data
- 3 adders/ALUs
• Multicycle:
+ higher clock speed Same design steps
+ simpler instructions run faster as single-cycle:
+ reuse expensive hardware on multiple cycles • first datapath
- sequencing overhead paid many times • then control

Dr. Sajid Muhaimin Choudhury 5


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

Multicycle State Elements


Replace Instruction and Data memories with
a single unified memory – more realistic

Dr. Sajid Muhaimin Choudhury 6


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

2
DRAFT 7/13/2023

Multicycle Datapath: Instruction Fetch


STEP 1: Fetch instruction

LDR Rd, [Rn, imm12]

Dr. Sajid Muhaimin Choudhury 7


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

Multicycle Datapath: LDR Register Read


STEP 2: Read source operands from RF

LDR Rd, [Rn, imm12]

Dr. Sajid Muhaimin Choudhury 8


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

3
DRAFT 7/13/2023

Multicycle Datapath: LDR Address


STEP 3: Compute the memory address

LDR Rd, [Rn, imm12]

Dr. Sajid Muhaimin Choudhury 9


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

Multicycle Datapath: LDR Memory Read


STEP 4: Read data from memory

LDR Rd, [Rn, imm12]

Dr. Sajid Muhaimin Choudhury 10


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

10

4
DRAFT 7/13/2023

Multicycle Datapath: LDR Write Register


STEP 5: Write data back to register file

LDR Rd, [Rn, imm12]

Dr. Sajid Muhaimin Choudhury 11


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

11

Multicycle Datapath: Increment PC


STEP 6: Increment PC

Dr. Sajid Muhaimin Choudhury 12


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

12

5
DRAFT 7/13/2023

Multicycle Datapath: Access to PC


PC can be read/written by instruction
• Read: R15 (PC+8) available in Register File

Dr. Sajid Muhaimin Choudhury 13


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

13

Multicycle Datapath: Read to PC (R15)


Example: ADD R1, R15, R2

Dr. Sajid Muhaimin Choudhury 14


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

14

6
DRAFT 7/13/2023

Multicycle Datapath: Read to PC (R15)


Example: ADD R1, R15, R2
• R15 needs to be read as PC+8 from Register File (RF) in 2nd step
• So (also in 2nd step) PC + 8 is produced by ALU and routed to R15
input of RF

Dr. Sajid Muhaimin Choudhury 15


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

15

Multicycle Datapath: Read to PC (R15)


Example: ADD R1, R15, R2
• R15 needs to be read as PC+8 from Register File (RF) in 2nd step
• So (also in 2nd step) PC + 8 is produced by ALU and routed to R15
input of RF
– SrcA = PC (which was already updated in step 1 to PC+4)
– SrcB = 4
– ALUResult = PC + 8
• ALUResult is fed to R15 input port of RF in 2nd step (which is then
routed to RD1 output of RF)

Dr. Sajid Muhaimin Choudhury 16


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

16

7
DRAFT 7/13/2023

Multicycle Datapath: Read to PC (R15)


Example: ADD R1, R15, R2
• R15 needs to be read as PC+8 from Register File (RF) in 2nd step
• So (also in 2nd step) PC + 8 is produced by ALU and routed to R15
input of RF

Dr. Sajid Muhaimin Choudhury 17


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

17

Multicycle Datapath: Access to PC


PC can be read/written by instruction
• Read: R15 (PC+8) available in Register File
• Write: Be able to write result of instruction to PC

Dr. Sajid Muhaimin Choudhury 18


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

18

8
DRAFT 7/13/2023

Multicycle Datapath: Write to PC (R15)


Example: SUB R15, R8, R3

Dr. Sajid Muhaimin Choudhury 19


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

19

Multicycle Datapath: Write to PC (R15)


Example: SUB R15, R8, R3
• Result of instruction needs to be written to the PC register
• ALUResult already routed to the PC register, just assert PCWrite

Dr. Sajid Muhaimin Choudhury 20


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

20

9
DRAFT 7/13/2023

Multicycle Datapath: Write to PC (R15)


Example: SUB R15, R8, R3
• Result of instruction needs to be written to the PC register
• ALUResult already routed to the PC register, just assert PCWrite

Dr. Sajid Muhaimin Choudhury 21


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

21

Multicycle Datapath: STR


Write data in Rn to memory

Dr. Sajid Muhaimin Choudhury 22


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

22

10
DRAFT 7/13/2023

Multicycle Datapath: Data-processing


With immediate addressing (i.e., an
immediate Src2), no additional changes
needed for datapath

Dr. Sajid Muhaimin Choudhury 23


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

23

Multicycle Datapath: Data-processing


With register addressing (register Src2):
Read from Rn and Rm

Dr. Sajid Muhaimin Choudhury 24


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

24

11
DRAFT 7/13/2023

Multicycle Datapath: B
Calculate branch target address:
BTA = (ExtImm) + (PC+8)
ExtImm = Imm24 << 2 and sign-extended

Dr. Sajid Muhaimin Choudhury 25


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

25

Multicycle ARM Processor

Dr. Sajid Muhaimin Choudhury 26


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

26

12
DRAFT 7/13/2023

Multicycle Control

• First, discuss Decoder


• Then, Conditional Logic

Dr. Sajid Muhaimin Choudhury 27


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

27

Multicycle Control: Decoder

Dr. Sajid Muhaimin Choudhury 28


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

28

13
DRAFT 7/13/2023

Multicycle Control: Decoder

Decoder

Dr. Sajid Muhaimin Choudhury 29


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

29

Multicycle Control: Decoder

ALU Decoder and PC Logic same as single-cycle

Dr. Sajid Muhaimin Choudhury 30


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

30

14
DRAFT 7/13/2023

Multicycle Control: Instr Decoder


Instr ImmSrc1:0
Op 1:0
Decoder RegSrc1:0

RegSrc0 = (Op == 102)


RegSrc1 = (Op == 012)
ImmSrc1:0 = Op

Instruction Op Funct5 Funct0 RegSrc0 RegSrc1 ImmSrc1:0


LDR 01 X 1 0 X 01
STR 01 X 0 0 1 01
DP immediate 00 1 X 0 X 00
DP register 00 0 X 0 0 00
B 10 X X 1 X 10

Dr. Sajid Muhaimin Choudhury 31


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

31

Dr. Sajid Muhaimin Choudhury 32


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

32

15
DRAFT 7/13/2023

Multicycle Control: Main FSM

Decoder

Dr. Sajid Muhaimin Choudhury 33


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

33

Main Controller FSM: Fetch

Dr. Sajid Muhaimin Choudhury 34


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

34

16
DRAFT 7/13/2023

Main Controller FSM: Decode

Dr. Sajid Muhaimin Choudhury 35


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

35

Main Controller FSM: Address

Dr. Sajid Muhaimin Choudhury 36


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

36

17
DRAFT 7/13/2023

Main Controller FSM: Read Memory

Dr. Sajid Muhaimin Choudhury 37


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

37

Multicycle ARM Processor

Dr. Sajid Muhaimin Choudhury 38


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

38

18
DRAFT 7/13/2023

Main Controller FSM: LDR

Dr. Sajid Muhaimin Choudhury 39


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

39

Main Controller FSM: STR

Dr. Sajid Muhaimin Choudhury 40


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

40

19
DRAFT 7/13/2023

Main Controller FSM: Data-processing

Dr. Sajid Muhaimin Choudhury 41


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

41

Multicycle Controller FSM

Dr. Sajid Muhaimin Choudhury 42


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

42

20
DRAFT 7/13/2023

Multicycle Control

• First, discuss Decoder


• Then, Conditional Logic

Dr. Sajid Muhaimin Choudhury 43


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

43

Multicycle Control: Cond. Logic

Dr. Sajid Muhaimin Choudhury 44


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

44

21
DRAFT 7/13/2023

Single-Cycle Conditional Logic

Dr. Sajid Muhaimin Choudhury 45


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

45

Multicycle Conditional Logic


NextPC • PCWrite asserted in Fetch
PCS PCWrite state
RegW RegWrite
MemW MemWrite
FlagW1:0
CLK

• ExecuteI/ExecuteR state:
CondEx

CondEx asserts
ALUFlags generated
Cond3:0
CLK • ALUWB state:
[3:2] Flags3:2 Flags updated
Conditi on

CondEx changes
Check
FlagWrite1:0

[1]

ALUFlags3:0 PCWrite, RegWrite, and


CLK
MemWrite don’t see
Flags1:0
[1:0] change till new
[0]
instruction (Fetch state)

Dr. Sajid Muhaimin Choudhury 46


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

46

22
DRAFT 7/13/2023

Multicycle Processor Performance


• Instructions take different number of cycles.

Dr. Sajid Muhaimin Choudhury 47


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

47

Multicycle Controller FSM

Dr. Sajid Muhaimin Choudhury 48


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

48

23
DRAFT 7/13/2023

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:


Multi Cycle Processor

415 Lecture 6.2

Week 6

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

49

Almost identical to single cycle versions

Practice!!

Dr. Sajid Muhaimin Choudhury 50


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

50

24
DRAFT 7/13/2023

Dr. Sajid Muhaimin Choudhury 51


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

51

Store PC in registerfile (LR)


Then branch

Dr. Sajid Muhaimin Choudhury 52


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

52

25
DRAFT 7/13/2023

Dr. Sajid Muhaimin Choudhury 53


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

53

Dr. Sajid Muhaimin Choudhury 54


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

54

26
DRAFT 7/13/2023

Dr. Sajid Muhaimin Choudhury 55


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

55

Dr. Sajid Muhaimin Choudhury 56


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

56

27
DRAFT 7/13/2023

Decoder Conditional
Logic

Dr. Sajid Muhaimin Choudhury 57


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

57

Dr. Sajid Muhaimin Choudhury 58


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

58

28
DRAFT 7/13/2023

Multicycle Processor Performance


• Instructions take different number of cycles:
– 3 cycles:
– 4 cycles:
– 5 cycles:

Dr. Sajid Muhaimin Choudhury 59


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

59

Multicycle Processor Performance


• Instructions take different number of cycles:
– 3 cycles: B
– 4 cycles: DP, STR
– 5 cycles: LDR

Dr. Sajid Muhaimin Choudhury 60


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

60

29
DRAFT 7/13/2023

Multicycle Processor Performance


• Instructions take different number of cycles:
– 3 cycles: B
– 4 cycles: DP, STR
– 5 cycles: LDR
• CPI is weighted average
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 13% branches
– 52% R-type

Dr. Sajid Muhaimin Choudhury 61


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

61

Multicycle Processor Performance


• Instructions take different number of cycles:
– 3 cycles: B
– 4 cycles: DP, STR
– 5 cycles: LDR
• CPI is weighted average
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 13% branches
– 52% R-type

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12

Dr. Sajid Muhaimin Choudhury 62


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

62

30
DRAFT 7/13/2023

Multicycle Processor Performance


Multicycle critical path:
• Assumptions:
• RF is faster than memory
• writing memory is faster than reading memory

Tc2 = tpcq + 2tmux + max(tALU + tmux , tmem) + tsetup

Dr. Sajid Muhaimin Choudhury 63


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

63

Multicycle Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 40
Register setup tsetup 50
Multiplexer tmux 25
ALU tALU 120
Decoder tdec 70
Memory read tmem 200
Register file read tRFread 100
Register file setup tRFsetup 60
Tc2 = ?

Dr. Sajid Muhaimin Choudhury 64


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

64

31
DRAFT 7/13/2023

Multicycle Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 40
Register setup tsetup 50
Multiplexer tmux 25
ALU tALU 120
Decoder tdec 70
Memory read tmem 200
Register file read tRFread 100
Register file setup tRFsetup 60
Tc2 = tpcq + 2tmux + max[tALU + tmux, tmem] + tsetup
= [40 + 2(25) + 200 + 50] ps = 340 ps

Dr. Sajid Muhaimin Choudhury 65


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

65

Multicycle Performance Example


For a program with 100 billion instructions
executing on a multicycle ARM processor
– CPI = 4.12 cycles/instruction
– Clock cycle time: Tc2 = 340 ps

Execution Time = ?

Dr. Sajid Muhaimin Choudhury 66


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

66

32
DRAFT 7/13/2023

Multicycle Performance Example


For a program with 100 billion instructions
executing on a multicycle ARM processor
– CPI = 4.12 cycles/instruction
– Clock cycle time: Tc2 = 340 ps

Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(4.12)(340 × 10-12)
= 140 seconds

Dr. Sajid Muhaimin Choudhury 67


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

67

Multicycle Performance Example


For a program with 100 billion instructions
executing on a multicycle ARM processor
– CPI = 4.12 cycles/instruction
– Clock cycle time: Tc2 = 340 ps

Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(4.12)(340 × 10-12)
= 140 seconds
This is slower than the single-cycle processor (84 sec.)

Dr. Sajid Muhaimin Choudhury 68


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

68

33
DRAFT 7/13/2023

Single-Cycle ARM Processor


PCSrc
Control
MemtoReg
Unit
31:28 MemWrite
Cond
27:26 ALUControl
Op
25:20 ALUSrc
Funct
15:12
Rd ImmSrc
RegWrite

Flags
ALUFlags

RegSrc
0 1 CLK CLK
CLK
19:16
Instr

0 RA1 WE3 SrcA WE


1 PC' PC A1 RD1
A RD 15 1 ALUResult ReadDat a

ALU
0 3:0 A RD
Instruction 0 RA2
A2 RD2 0 SrcB Data
Memory 1
15:12 1 Memory
A3 Register WriteData
WD
4 WD3 File
PCPlus8 1
R15
+

PCPlus4 0
+

4
23:0
Extend ExtImm
Result

Dr. Sajid Muhaimin Choudhury 69


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

69

Single-Cycle Control

Dr. Sajid Muhaimin Choudhury 70


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

70

34
DRAFT 7/13/2023

Single-Cycle ARM Processor


PCSrc
Control
MemtoReg
Unit
31:28 MemWrite
Cond
27:26 ALUControl
Op
25:20 ALUSrc
Funct
15:12
Rd ImmSrc

tdec RegWrite

Flags

tpcq_PC ALUFlags
tmem
tmem tmux CLK tRFread tALU
RegSrc
0 1 CLK
CLK
19:16
Instr

0 RA1 WE3 SrcA WE


1 PC' PC A1 RD1
A RD 15 1 ALUResult ReadDat a

ALU
0 3:0 A RD
Instruction 0 RA2
A2 RD2 0 SrcB Data
Memory 1
15:12
tRFsetup A3 Register
WD3 File
1
WriteData
Memory
WD tmux
4 1
PCPlus8
R15
+

PCPlus4 0
+

4
23:0
Extend ExtImm
Result

Critical Path →LDR reg, [reg, imm]


Dr. Sajid Muhaimin Choudhury 71
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

71

Single-Cycle Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 40
Register setup tsetup 50
Multiplexer tmux 25
ALU tALU 120
Decoder tdec 70
Memory read tmem 200
Register file read tRFread 100
Register file setup tRFsetup 60
Tc1 = tpcq_PC + 2tmem + tdec + tRFread + tALU + 2tmux + tRFsetup
= [50 + 2(200) + 70 + 100 + 120 + 2(25) + 60] ps
= 840 ps

Dr. Sajid Muhaimin Choudhury 72


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

72

35
DRAFT 7/13/2023

Single-Cycle Performance Example


Program with 100 billion instructions:

Execution Time = # instructions x CPI x TC


= (100 × 109)(1)(840 × 10-12 s)
= 84 seconds

Dr. Sajid Muhaimin Choudhury 73


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

73

Review: Multicycle ARM Processor

Dr. Sajid Muhaimin Choudhury 74


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

74

36
DRAFT 7/13/2023

Multicycle ARM Processor: Critical Path

tpcq + 2tmux + tALU + tmux + tsetup

Tc2 = tpcq + 2tmux + tmem + tsetup

Dr. Sajid Muhaimin Choudhury 75


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

75

Tc2 = tpcq + 2tmux + max[tALU + tmux, tmem] + tsetup

Improving memory by decreasing tmem


Below tALU + tmux will be useless

Tmem = tALU + tmux = 120+25 = 145

New Delay = 40 + 50 + 145 + 50 = 285ps

Dr. Sajid Muhaimin Choudhury 76


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

76

37
DRAFT 7/13/2023

Dr. Sajid Muhaimin Choudhury 77


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

77

Dr. Sajid Muhaimin Choudhury 78


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

78

38
DRAFT 7/13/2023

Dr. Sajid Muhaimin Choudhury 79


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

79

Dr. Sajid Muhaimin Choudhury 80


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

80

39
DRAFT 7/13/2023

Multicycle Processor Performance


• Instructions take different number of cycles:
– 3 cycles: B
– 4 cycles: DP, STR
– 5 cycles: LDR
• CPI is weighted average
• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 13% branches
– 52% R-type

Average CPI = (0.13)(3) + (0.52 + 0.10)(4) + (0.25)(5) = 4.12

Dr. Sajid Muhaimin Choudhury 81


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

81

Multicycle Performance Example


For a program with 100 billion instructions
executing on a multicycle ARM processor
– CPI = 4.12 cycles/instruction
– Clock cycle time: Tc2 = 340 ps

Execution Time = (# instructions) × CPI × Tc


= (100 × 109)(4.12)(340 × 10-12)
= 140 seconds
This is slower than the single-cycle processor (84 sec.)

Dr. Sajid Muhaimin Choudhury 82


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

82

40
DRAFT 7/13/2023

EEE 415 - Microprocessors and Embedded Systems

EEE Micro Architecture:


Pipeline Architecture

415 Lecture 6.3

Week 6

Dr. Sajid Muhaimin Choudhury, Assistant Professor


Department of Electrical and Electronics Engineering
Bangladesh University of Engineering and Technology

83

What Is Pipelining?
Engine
Chassis Paint
Time Time
A A
B B
C C

Order of manufacturing Order of


(Car A, B, and then C) manufacturing

Dr. Sajid Muhaimin Choudhury 84


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

84

41
DRAFT 7/13/2023

What Is Pipelining?
• We arrange for the different phases of
execution to be overlapped. We aim to
exploit “temporal” parallelism.
• How are latency and throughput affected?

Volkswagen Beetle Assembly Line


(By Alden Jewell, license: CC BY 2.0)

Dr. Sajid Muhaimin Choudhury 85


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

85

Pipelining and Maintaining Correctness


• We can break the execution of instructions into stages and overlap the
execution of different instructions.
• We need to ensure that the results produced by our new pipelined
processor are no different to the unpipelined one.
• What sort of situations could cause problems once we pipeline our
processor?

Dr. Sajid Muhaimin Choudhury 86


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

86

42
DRAFT 7/13/2023

Pipelining

C T
Clock period = T + C
Combinational
logic Now, if we create two pipeline
stages:
clk
C C C Clock period = T/2 + C
T/2 T/2
If C is small, our clock
frequency has almost doubled.
Hence, our throughput will
clk double, too.
Pipelining
register

Dr. Sajid Muhaimin Choudhury 87


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

87

Pipelining Our Processor


• We can insert pipelining registers into our processor to divide the logic into
different pipeline stages.
• If added carefully, the worst-case delay between any two registers will be
reduced.
• We will also need to be careful to pipeline our control signals so that decoded
control information accompanies each instruction as they progress down the
pipeline.

Dr. Sajid Muhaimin Choudhury 88


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

88

43
DRAFT 7/13/2023

Pipelined ARM Processor


• Temporal parallelism
• Divide single-cycle processor into 5 stages:
– Fetch
– Decode
– Execute
– Memory
– Writeback
• Add pipeline registers between stages

Dr. Sajid Muhaimin Choudhury 89


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

89

Single-Cycle vs. Pipelined


Single-Cycle
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Instr
Time (ps)
Fetch Dec Execute Memory Wr
1 Read
Instruction Reg
ALU Read / Write Reg
Fetch Dec Execute Memory Wr
2 Read
Instruction Reg
ALU Read / Write Reg

Instr
Pipelined
Fetch Dec Execute Memory Wr
1 Read
Instruction Reg ALU Read / Write Reg
Fetch Dec Execute Memory Wr
2 Read
Instruction Reg
ALU Read / Write Reg
Fetch Dec Execute Memory Wr
3 Read
Instruction Reg
ALU Read / Write Reg

(b)

Dr. Sajid Muhaimin Choudhury 90


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

90

44
DRAFT 7/13/2023

Pipelined Processor Abstraction


1 2 3 4 5 6 7 8 9 10

Time (cycles)
R0
LDR DM R2
LDR R2, [R0, #40] IM RF 40 + RF

R9
ADD DM R3
ADD R3, R9, R10 IM RF R10 + RF

R1
SUB DM R4
SUB R4, R1, R5 IM RF R5 - RF

R12
AND DM R5
AND R5, R12, R13 IM RF R13 & RF

R1
STR DM R6
STR R6, [R1, #20] IM RF 20 + RF

R11
ORR DM R7
ORR R7, R11, #42 IM RF 42 | RF

Dr. Sajid Muhaimin Choudhury 91


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

91

Single-Cycle & Pipelined Datapath


Single-Cycle
CLK CLK
CLK
19:16
Instr

0 RA1 WE3 SrcA WE


1 PC' PC A1 RD1
A RD 15 1 ALUResult ReadData
ALU

0 3:0 A RD
Instruction 0 RA2
A2 RD2 0 SrcB Data
Memory 1
15:12 1 Memory
A3 Register WriteData
WD
4 WD3 File
PCPlus8 1
R15
+

PCPlus4 0
+

4
23:0
Extend ExtImm
Result

(a)

Pipelined
CLK CLK CLK CLK
CLK
CLK
CLK
InstrF

InstrD

19:16
0 RA1D WE3 SrcAE WE
1 PC' PCF A1 RD1
A RD 15 1 ALUResultE ReadDataW
0
ALU

3:0
0 RA2D A RD
Instruction A2 RD2 0 SrcBE
Memory 1 Data
15:12 WA3D 1 Memory
A3 Register WriteDataE
WD
WD3 File
4 1
PCPlus8
R15 ALUOutM ALUOutW
+

PCPlus4F 0
+

4
23:0
Extend ExtImmE
ResultW

(b) Fetch Decode Execute Memory Writeback

Dr. Sajid Muhaimin Choudhury 92


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

92

45
DRAFT 7/13/2023

Corrected Pipelined Datapath


CLK CLK CLK CLK
CLK
CLK
CLK

InstrF

InstrD
19:16
0 RA1D WE3 SrcAE WE
1 PC' PCF A1 RD1
A RD 15 1 ALUResultE ReadDat aW

ALU
0 3:0 A RD
Instruction 0 RA2D
A2 RD2 0 SrcBE Data
Memory 1
1 Memory
A3 Register WriteDataE
PCPlus8 WD
WD3 File
4 1
R15 ALUOutM ALUOutW

+
PCPlus4F 0
+

15:12 WA3D WA3E WA3M WA3W


4

23:0 Extend ExtImmE

ResultW

• WA3 must arrive at same time as Result


• Register file written on falling edge of CLK

Dr. Sajid Muhaimin Choudhury 93


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

93

Optimized Pipelined Datapath


CLK CLK CLK CLK
CLK
CLK
CLK
InstrF

InstrD

19:16
0 RA1D WE3 SrcAE WE
1 PC' PCF A1 RD1
A RD 15 1 ALUResultE ReadDat aW
ALU

0 3:0 A RD
Instruction 0 RA2D
A2 RD2 0 SrcBE Data
Memory 1
1 Memory
A3 Register WriteDataE
PCPlus8 WD
WD3 File
4 1
R15 ALUOutM ALUOutW
+

PCPlus4F 0
+

15:12 WA3D WA3E WA3M WA3W


4

23:0 Extend ExtImmE

ResultW

CLK CLK CLK CLK


CLK
CLK
CLK
InstrF

InstrD

19:16
0 RA1D WE3 SrcAE WE
1 PC' PCF A1 RD1
A RD 15 1 ALUResultE ReadDat aW
ALU

0 3:0 A RD
Instruction 0 RA2D
A2 RD2 0 SrcBE Data
Memory 1
1 Memory
A3 Register WriteDataE
WD
WD3 File
1
R15 ALUOutM ALUOutW
PCPlus4F 0
+

15:12 WA3D WA3E WA3M WA3W


4

23:0 Extend ExtImmE


PCPlus8D
ResultW

Remove adder by using PCPlus4F after PC has been updated to PC+4

Dr. Sajid Muhaimin Choudhury 94


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

94

46
DRAFT 7/13/2023

Pipelined Processor Control


CLK CLK CLK

Flags'
PCSrcD PCSrcE PCSrcM PCSrcW
Control
RegWriteD RegWriteE RegWriteM RegWriteW
Unit
MemtoRegD MemtoRegE
MemtoRegM MemtoRegW
27:26 MemWriteD MemWriteE MemWriteM
Op

CondExE
25:20 ALUControlD ALUControlE
Funct
15:12
Rd BranchD BranchE
ALUSrcD ALUSrcE
FlagWriteD FlagWriteE
ImmSrcD
31:28 CondE Cond
Unit

RegSrcD
CLK CLK FlagsE
0 1 CLK
CLK
InstrF

InstrD

19:16
0 WE3 ALUFlags WE
RA1D SrcAE
1 PC' PCF A1 RD1
A RD 15 1 ALUResultE ReadDat aW

ALU
0 3:0 A RD
Instruction 0 RA2D
A2 RD2 0 SrcBE Data
Memory 1
1 Memory
A3 Register WriteDataE
WD
WD3 File

ExtImmE
1
R15 ALUOutM ALUOutW
PCPlus4F 0
+

15:12 WA3E WA3M WA3W


4

23:0 Extend

PCPlus8D
ResultW

• Same control unit as single-cycle processor


• Control delayed to proper pipeline stage

Dr. Sajid Muhaimin Choudhury 95


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

95

Pipeline Hazards
• When an instruction depends on result from instruction that
hasn’t completed
• Types:
• Datahazard: register value not yet written back to register file
• Control hazard: next instruction not decided yet (caused by branch)

Dr. Sajid Muhaimin Choudhury 96


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

96

47
DRAFT 7/13/2023

Data Hazard
1 2 3 4 5 6 7 8

Time (cycles)
R4
ADD DM R1
ADD R1, R4, R5 IM RF R5 + RF

R1
AND DM R8
AND R8, R1, R3 IM RF R3 & RF

R6
ORR DM R9
ORR R9, R6, R1 IM RF R1 | RF

R1
SUB DM R10
SUB R10, R1, R7 IM RF R7 - RF

Dr. Sajid Muhaimin Choudhury 97


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

97

Handling Data Hazards


• Insert NOPs in code at compile time
• Rearrange code at compile time
• Forward data at run time
• Stall the processor at run time

Dr. Sajid Muhaimin Choudhury 98


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

98

48
DRAFT 7/13/2023

Compile-Time Hazard Elimination


• Insert enough NOPs for result to be ready
• Or move independent useful instructions forward
1 2 3 4 5 6 7 8 9 10

Time (cycles)
R4
ADD DM R1
ADD R1, R4, R5 IM RF R5 + RF

NOP DM
NOP IM RF RF

NOP DM
NOP IM RF RF

R1
AND DM R8
AND R8, R1, R3 IM RF R3 & RF

R6
ORR DM R9
ORR R9, R6, R1 IM RF R1 | RF

R1
SUB DM R10
SUB R10, R1, R7 IM RF R7 - RF

Dr. Sajid Muhaimin Choudhury 99


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

99

Data Forwarding
1 2 3 4 5 6 7 8

Time (cycles)
R4
ADD DM R1
ADD R1, R4, R5 IM RF R5 + RF

R1
AND DM R8
AND R8, R1, R3 IM RF R3 & RF

R6
ORR DM R9
ORR R9, R6, R1 IM RF R1 | RF

R1
SUB DM R10
SUB R10, R1, R7 IM RF R7 - RF

Dr. Sajid Muhaimin Choudhury 100


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

100

49
DRAFT 7/13/2023

Data Forwarding
1 2 3 4 5 6 7 8

Time (cycles)
R4
ADD DM R1
ADD R1, R4, R5 IM RF R5 + RF

R1
AND DM R8
AND R8, R1, R3 IM RF R3 & RF

R6
ORR DM R9
ORR R9, R6, R1 IM RF R1 | RF

R1
SUB DM R10
SUB R10, R1, R7 IM RF R7 - RF

• Check if register read in Execute stage matches register


written in Memory or Writeback stage
• If so, forward result

Dr. Sajid Muhaimin Choudhury 101


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

101

Data Forwarding
CLK CLK CLK
Flags'

PCSrcD PCSrcE PCSrcM PCSrcW


Control
RegWriteD RegWriteE RegWriteM RegWriteW
Unit
MemtoRegD MemtoRegE
MemtoRegM MemtoRegW
27:26 MemWriteD MemWriteE MemWriteM
Op
CondExE

25:20 ALUControlD ALUControlE


Funct
15:12
Rd BranchD BranchE
ALUSrcD ALUSrcE
FlagWriteD FlagWriteE
ImmSrcD
31:28 CondE Cond
Unit
RegSrcD

CLK CLK FlagsE


0 1 CLK
CLK
InstrF

InstrD

19:16
0 WE3 ALUFlags WE
RA1D SrcAE
1 PC' PCF A1 RD1 00
A RD 15 1 01 ALUResultE
10 ReadDat aW
ALU

0 3:0 A RD
Instruction 0 RA2D
A2 RD2 00 0 SrcBE Data
Memory 1 01
10 1 Memory
A3 Register WriteDataE
WD
File
ExtImmE

WD3
1
R15 ALUOutM ALUOutW
PCPlus4F 0
+

15:12 WA3E WA3M WA3W


4
23:0 Extend

PCPlus8D
Result W
RegWriteW
ForwardAE
ForwardBE

RegWriteM
Match

Hazard Unit

Dr. Sajid Muhaimin Choudhury 102


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

102

50
DRAFT 7/13/2023

Data Forwarding
• Execute stage register matches Memory stage register?
Match_1E_M = (RA1E == WA3M)
Match_2E_M = (RA2E == WA3M)
• Execute stage register matches Writeback stage register?
Match_1E_W = (RA1E == WA3W)
Match_2E_W = (RA2E == WA3W)
• If it matches, forward result:

if (Match_1E_M • RegWriteM) ForwardAE = 10;


else if (Match_1E_W • RegWriteW) ForwardAE = 01;
else ForwardAE = 00;

Dr. Sajid Muhaimin Choudhury 103


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

103

Data Forwarding
• Execute stage register matches Memory stage register?
Match_1E_M = (RA1E == WA3M)
Match_2E_M = (RA2E == WA3M)
• Execute stage register matches Writeback stage register?
Match_1E_W = (RA1E == WA3W)
Match_2E_W = (RA2E == WA3W)
• If it matches, forward result:

if (Match_1E_M • RegWriteM) ForwardAE = 10;


else if (Match_1E_W • RegWriteW) ForwardAE = 01;
else ForwardAE = 00;
ForwardBE same but with Match2E

Dr. Sajid Muhaimin Choudhury 104


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

104

51
DRAFT 7/13/2023

Stalling
1 2 3 4 5 6 7 8

Time (cycles)
R4
LDR DM R1
LDR R1, [R4, #40] IM RF 40 + RF

Trouble!
R1
AND DM R8
AND R8, R1, R3 IM RF R3 & RF

R6
ORR DM R9
ORR R9, R6, R1 IM RF R1 | RF

R1
SUB DM R10
SUB R10, R1, R7 IM RF R7 - RF

Dr. Sajid Muhaimin Choudhury 105


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

105

Stalling
1 2 3 4 5 6 7 8 9

Time (cycles)
R4
LDR DM R1
LDR R1, [R4, #40] IM RF 40 + RF

R1 R1
AND DM R8
AND R8, R1, R3 IM RF R3 RF R3 & RF

R6
ORR ORR DM R9
ORR R9, R6, R1 IM IM RF R1 | RF

Stall R1
SUB DM R10
SUB R10, R1, R7 IM RF R7 - RF

Dr. Sajid Muhaimin Choudhury 106


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

106

52
DRAFT 7/13/2023

Stalling Hardware
CLK CLK CLK

Flags'
PCSrcD PCSrcE PCSrcM PCSrcW
Control
RegWriteD RegWriteE RegWriteM RegWriteW
Unit
MemtoRegD MemtoRegE
MemtoRegM MemtoRegW
27:26 MemWriteD MemWriteE MemWriteM
Op

CondExE
25:20 ALUControlD ALUControlE
Funct
15:12
Rd BranchD BranchE
ALUSrcD ALUSrcE
FlagWriteD FlagWriteE
ImmSrcD
31:28 CondE Cond
Unit

RegSrcD
CLK CLK FlagsE
0 1 CLK
CLK
InstrF

InstrD

19:16
0 WE3 ALUFlags WE
RA1D SrcAE
1 PC' PCF A1 RD1 00
A RD 15 1 01 ALUResultE
10 ReadDat aW

ALU
0
EN

3:0
0 A RD
Instructi on RA2D
A2 RD2 00 0 SrcBE Data
Memory 1 01
10 1 Memory
A3 Register WriteDataE
WD
File

ExtImmE
WD3
1
R15 ALUOutM ALUOutW
PCPlus4F 0
+

15:12 WA3E WA3M WA3W


4
CLR

CLR
Exten d
EN

23:0

PCPlus8D
ResultW

MemtoRegE
RegWriteW
ForwardAE
ForwardBE

RegWriteM
FlushD

FlushE

Match
St allD
StallF

Hazard Unit

Dr. Sajid Muhaimin Choudhury 107


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

107

Stalling Logic
• Is either of the source register in the Decode stage the same as
the one being written in the Execute stage?
Match_12D_E = (RA1D == WA3E) + (RA2D == WA3E)

• Is a LDR in the Execute stage AND Match_12D_E?


ldrstall = Match_12D_E • MemtoRegE
StallF = StallD = FlushE = ldrstall

Dr. Sajid Muhaimin Choudhury 108


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

108

53
DRAFT 7/13/2023

Control Hazards
• B:
• branch not determined until the Writeback stage of
pipeline
• Instructions after branch fetched before branch occurs
• These 4 instructions must be flushed if branch happens

• Writes to PC (R15) similar

Dr. Sajid Muhaimin Choudhury 109


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

109

Control Hazards
1 2 3 4 5 6 7 8 9 10

Time (cycles)

B DM
20 B 3C IM RF RF

R1
AND DM
24 AND R8, R1, R3 IM RF R3 & RF

R6 Flush
ORR DM
28 ORR R9, R6, R1 IM RF R1 | RF these
instructions
R1
SUB DM
2C SUB R10, R1, R7 IM RF R7 - RF

R1
SUB DM
30 SUB R11, R1, R8 IM RF R8 - RF

34 ...
...
R3
ADD DM R12
64 ADD R12, R3, R4 IM RF R4 RF
+

Branch misprediction penalty


• number of instruction flushed when branch is taken (4)
• May be reduced by determining BTA earlier

Dr. Sajid Muhaimin Choudhury 110


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

110

54
DRAFT 7/13/2023

Early Branch Resolution


• Determine BTA in Execute stage
– Branch misprediction penalty = 2 cycles
• Hardware changes
– Add a branch multiplexer before PC register to select BTA
from ALUResultE
– Add BranchTakenE select signal for this multiplexer (only
asserted if branch condition satisfied)
– PCSrcW now only asserted for writes to PC

Dr. Sajid Muhaimin Choudhury 111


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

111

Pipelined processor with Early BTA


BranchTakenE
CLK CLK CLK
Flags'

PCSrcD PCSrcE PCSrcM PCSrcW


Control
RegWriteD Reg WriteE RegWriteM RegWriteW
Unit
MemtoRegD MemtoRegE
MemtoRegM MemtoRegW
27:26 MemWriteD MemWriteE MemWriteM
Op
CondExE

25:20 ALUControlD ALUContro lE


Funct
15:12
Rd BranchD BranchE
ALUSrcD ALUSrcE
FlagWriteD FlagWriteE
ImmSrcD
31:28 Con dE Cond
Unit
Reg SrcD

CLK CLK FlagsE


0 1 CLK
CLK
InstrF

InstrD

1
19:16
0 WE3 ALUFlags WE
RA1D SrcAE
0 PC' PCF A1 RD1 00
0 A RD 15 1 01 ALUResultE
10 Rea dDat aW
ALU

1
EN

3:0
0 A RD
Instructi on RA2D
A2 RD2 00 0 SrcBE Data
Memory 1 01
10 1 Memory
A3 Register WriteDataE
WD
WD3 File
Ext ImmE

1
R15 ALUOutM ALUOutW
PCPlus4F 0
+

15:12 WA3E WA3M WA3W


4
CLR

CLR
EN

23:0 Extend

PCPlus8D
Result W
MemtoRegE
Reg WriteW
ForwardAE
ForwardBE

RegWriteM
FlushD

FlushE

Match
StallD
StallF

Hazard Unit

Dr. Sajid Muhaimin Choudhury 112


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

112

55
DRAFT 7/13/2023

Control Hazards with Early BTA


1 2 3 4 5 6 7 8 9 10

Time (cycles)

B DM
20 B 3C IM RF RF

R1
AND DM
24 AND R8, R1, R3 IM RF R3 & RF
Flush
R6 these
ORR DM instructions
28 ORR R9, R6, R1 IM RF R1 | RF

2C SUB R10, R1, R7

30 SUB R11, R1, R8

34 ...
...
R3
ADD DM R12
64 ADD R12, R3, R4 IM RF R4 RF

+
Dr. Sajid Muhaimin Choudhury 113
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

113

Control Stalling Logic


• PCWrPendingF = 1 if write to PC in Decode, Execute or Memory
PCWrPendingF = PCSrcD + PCSrcE + PCSrcM
• Stall Fetch if PCWrPendingF
StallF = ldrStallD + PCWrPendingF
• Flush Decode if PCWrPendingF OR PC is written in Writeback OR branch is taken
FlushD = PCWrPendingF + PCSrcW + BranchTakenE
• Flush Execute if branch is taken
FlushE = ldrStallD + BranchTakenE
• Stall Decode if ldrStallD (as before)
StallD = ldrStallD
Dr. Sajid Muhaimin Choudhury 114
EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

114

56
DRAFT 7/13/2023

ARM Pipelined Processor with Hazard Unit


BranchTakenE
CLK CLK CLK

Flags'
PCSrcD PCSrcE PCSrcM PCSrcW
Control
RegWriteD RegWriteE RegWriteM RegWriteW
Unit
MemtoRegD MemtoRegE
MemtoRegM MemtoRegW
27:26 MemWriteD MemWriteE MemWriteM
Op

CondExE
25:20 ALUControlD ALUControlE
Funct
15:12
Rd BranchD BranchE
ALUSrcD ALUSrcE
FlagWriteD FlagWriteE
ImmSrcD
31:28 CondE Cond
Unit

RegSrcD
CLK CLK FlagsE
0 1 CLK
CLK
InstrF

InstrD

1
19:16
0 WE3 ALUFlags WE
RA1D SrcAE
0 PC' PCF A1 RD1 00
0 A RD 15 1 01 ALUResultE
10 ReadDataW

ALU
1
EN

3:0
0 A RD
Instruction RA2D
A2 RD2 00 0 SrcBE Data
Memory 1 01
10 1 Memory
A3 Register WriteDataE
WD
File

ExtImmE
WD3
1
R15 ALUOutM ALUOutW
PCPlus4F 0
+

15:12 WA3E WA3M WA3W


4
CLR

CLR
EN

23:0 Extend

PCPlus8D
ResultW

MemtoRegE
RegWriteW
ForwardAE
ForwardBE

RegWriteM
FlushD

FlushE

Match
StallD
St allF

Hazard Unit

Dr. Sajid Muhaimin Choudhury 115


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

115

Pipelined Performance Example


• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 13% branches
– 52% R-type
• Suppose:
– 40% of loads used by next instruction
– 50% of branches mispredicted
• What is the average CPI?

Dr. Sajid Muhaimin Choudhury 116


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

116

57
DRAFT 7/13/2023

Pipelined Performance Example


• SPECINT2000 benchmark:
– 25% loads
– 10% stores
– 13% branches
– 52% R-type
• Suppose:
– 40% of loads used by next instruction
– 50% of branches mispredicted
• What is the average CPI?
– Load CPI = 1 when not stalling, 2 when stalling
So, CPIlw = 1(0.6) + 2(0.4) = 1.4
– Branch CPI = 1 when not stalling, 3 when stalling
So, CPIbeq = 1(0.5) + 3(0.5) = 2

Average CPI = (0.25)(1.4) + (0.1)(1) + (0.13)(2) + (0.52)(1) = 1.23

Dr. Sajid Muhaimin Choudhury 117


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

117

Pipelined Performance
• Pipelined processor critical path:
Tc3 = max [
tpcq + tmem + tsetup Fetch
2(tRFread + tsetup ) Decode
tpcq + 2tmux + tALU + tsetup Execute
tpcq + tmem + tsetup Memory
2(tpcq + tmux + tRFwrite) ] Writeback

Dr. Sajid Muhaimin Choudhury 118


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

118

58
DRAFT 7/13/2023

Pipelined Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 40
Register setup tsetup 50
Multiplexer tmux 25
ALU tALU 120
Memory read tmem 200
Register file read tRFread 100
Register file setup tRFsetup 60
Register file write tRFwrite 70

Cycle time: Tc3 = ?

Dr. Sajid Muhaimin Choudhury 119


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

119

Pipelined Performance Example


Element Parameter Delay (ps)
Register clock-to-Q tpcq_PC 40
Register setup tsetup 50
Multiplexer tmux 25
ALU tALU 120
Memory read tmem 200
Register file read tRFread 100
Register file setup tRFsetup 60
Register file write tRFwrite 70

Cycle time: Tc3 = 2(tRFread + tsetup )


= 2[100 + 50] ps = 300 ps

Dr. Sajid Muhaimin Choudhury 120


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

120

59
DRAFT 7/13/2023

Pipelined Performance Example


Program with 100 billion instructions
Execution Time = (# instructions) × CPI × Tc
= (100 × 109)(1.23)(300 × 10-12)
= 36.9 seconds

Dr. Sajid Muhaimin Choudhury 121


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

121

Processor Performance Comparison

Execution
Time Speedup
Processor (seconds) (single-cycle as baseline)
Single-cycle 84 1
Multicycle 140 0.6
Pipelined 36.9 2.28

Dr. Sajid Muhaimin Choudhury 122


EEE 415 - Department of EEE, BUET Digital Design and Computer Architecture: ARM® Edition © 2015

122

60

You might also like