You are on page 1of 10

CSE 379

Microprocessor Overview & Pipelining



Instruction Execution
What happens when an instruction is executed?
Fetch
The instruction is copied Irom memory into the processor
Program counter points to address where instruction resides
ARM uses r15 as the program counter (PC)
R15 is a general purpose register
Be careIul!
Some instructions take on a slightly diIIerent meaning when r15 is the
destination
ReIer to manual (ADD or SUB as an example)
Note, every instruction must access memory since every instruction must be
Ietched!
Decode
The instruction consists oI 32 ones and zeros
What do they mean?
What should the processor be doing?
Execute
Carry out the work oI the instruction
What is done depends upon the instruction
Data Processing Instructions
PerIorm the computation (ADD, SUB, XOR, etc.)
Load/Store Instructions
Calculate the eIIective address as speciIied by the addressing mode
Memory Access
Compare/Test Instructions
PerIorm the compare/test
Set the Ilags oI the CPSR accordingly
Branch
Calculate target address
Determine iI branch is taken
Write the results back
CPSR, Register, Memory



Instruction Execution Example
Microprocessor
r0 -
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008004
CPSR CPSR
- -
ALU
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
Instruction:
0xE5940008
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
0xFFFFFFFC 0xF00E1908
Microprocessor
r0 -
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008008
CPSR CPSR
- -
ALU

Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
Instruction:
LDR r0, [r4, =0x008]
Instruct ALU
to compute
effective address
bv adding
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
0xFFFFFFFC 0xF00E1908
Microprocessor
r0 0xC9E81005
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
0x40008020
0x00000008
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008008
CPSR CPSR
- -
ALU
+
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
0xFFFFFFFC 0xF00E1908
Pipelining
A SimpliIied Model oI the Processor
A Iunctional unit is dedicated to each oI the three tasks that take place when an
instruction is executed
The control unit orchestrates the process






Functional Unit Utilization on Nonpipelined Processor
Fetch






Decode






Execute




Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Overview










Timing
Timed by crystal oscillator
Period T
Frequency 1/T







Observations
Only one oI the three Iunctional units are in use at a time
The Pipelined Processor
Instructions are overlapped
Multiple instructions executed simultaneously
Scalar Pipelined Processor
One instruction Ietched per cycle
Microprocessor Microprocessor
Time
Cycle 2 Cycle 1 Cycle 3
Microprocessor
ALU ALU ALU
Control
Unit
Control
Unit
Control
Unit
Fetch
Unit
Fetch
Unit
Fetch
Unit
Decode
Unit
Decode
Unit
Decode
Unit
T T T
T Clock Cycle Time (Clock Period)
T T T T T
Time
20 ns 40 ns 80 ns 60 ns 100 ns 120 ns 0 ns
Instruction #1 Instruction #2
T Clock Cycle Time (Clock Period)
T
Fetch Decode Decode Fetch Execute Execute
The Pipelined Architecture




















Microprocessor
Microprocessor
Microprocessor
Microprocessor
Time
Time
Cycle 2
Cycle 5
Cycle 1
Cycle 4
Cycle 3
Cycle 6
Microprocessor
Microprocessor
ALU ALU Inst
1
Inst
4
Inst.
2
Inst.
3

ALU

ALU ALU

ALU
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Fetch
Unit
Fetch
Unit
Inst. 1
Inst. 4
Fetch
Unit
Fetch
Unit
Inst. 2
Inst. 5
Fetch
Unit
Fetch
Unit
Inst. 3
Inst. 6
Decode
Unit
Decode
Unit
Inst. 3
Decode
Unit
Decode
Unit
Inst. 1
Inst. 4
Decode
Unit
Decode
Unit
Inst. 2
Inst. 5
T
T
T
T
T
T
T Clock Cycle Time (Clock Period)
T Clock Cycle Time (Clock Period)
The Timing Diagram











Throughput
Number oI instructions executed over time
Instruction Latency
Time it takes Ior an individual instruction to execute
What happens when an instruction is dependent upon another instruction?
Hazard Exists










Types oI Hazards
Structural
Multiple instructions require the same hardware simultaneously
Data
Instruction needs results Irom a previous instruction
Control
Instruction aIter a branch is Ietched, but may not be executed
At time oI Ietch, it is unknown iI branch will occur

Fetch
Fetch
Fetch
Decode
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Execute
Execute
Execute
Execute
ARM7 1hree-stage Pipeline
Execute
Fetch
Fetch
Fetch
Stalls due
to BLT
Decode
Decode
(STALL)
Decode
BIC
SUB
Decode
Decode
Decode
Fetch
Fetch
Fetch
OR
EOR
Time Required
to Fill Pipeline
T T T T T T T
Time
20 ns 40 ns 80 ns 60 ns 100 ns 140 ns 120 ns 160 ns 0 ns
T
Execute
Execute
(STALL)
Execute
Execute
(STALL)
Execute
BLT
ADD
Execute
Pipelining in the ARM Architecture
ARM7









ARM9











13 throughput increase over ARM7 when running Dhrystone benchmark

Fetch
Fetch
Fetch
Decode
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Execute
Execute
Execute
Execute
ARM7 1hree-stage Pipeline
Execute
Fetch
Fetch
Fetch
Decode
Memory
Memory
Memory
Memory
Memory
Memory
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Write
Write
Write
Write
Write
Write
Execute
Execute
Execute
Execute
ARM9 Five-stage Pipeline
Execute
ARM10










34 throughput increase over ARM7 when running Dhrystone benchmark

What about the ARM8?




Backward compatibility was maintained as these architectures evolved
Code written Ior ARM7 will run on ARM10

The Effects of Pipelining on Assembly Language
Consider exception handling
PreIetch abort stores address oI aborted instruction 4 in LR
Data abort stores address oI aborted instruction 8 in LR

The diIIerence is due to the Iact that the aborted instruction is in a diIIerent stage oI the
pipeline when the abort occurs
The Vector Table
When the PC is updated with an address stored in memory, the PC relative load must
account Ior the Iact that the PC has been incremented

References
Kris Schindler, Introduction to Microprocessor Based Svstems Using the ARM Processor,
Second Edition, Pearson, 2013
Issue
Issue
Issue
Issue
Issue
Issue
Fetch
Fetch
Fetch
Fetch
Fetch
Fetch
Decode
Memory
Memory
Memory
Memory
Memory
Memory
Decode
Decode
Decode
Decode
Decode
T T
Time Required
to Fill Pipeline
T T T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Write
Write
Write
Write
Write
Write
Execute
Execute
Execute
Execute
ARM1 Six-stage Pipeline
Execute

You might also like