Processor Overview and Pipelining

CSE 379
Microprocessor Overview & Pipelining

Instruction Execution
What happens when an instruction is executed?
Fetch
The instruction is copied Irom memory into the processor
Program counter points to address where instruction resides
ARM uses r15 as the program counter (PC)
R15 is a general purpose register
Be careIul!
Some instructions take on a slightly diIIerent meaning when r15 is the
destination
ReIer to manual (ADD or SUB as an example)
Note, every instruction must access memory since every instruction must be
Ietched!
Decode
The instruction consists oI 32 ones and zeros
What do they mean?
What should the processor be doing?
Execute
Carry out the work oI the instruction
What is done depends upon the instruction
Data Processing Instructions
PerIorm the computation (ADD, SUB, XOR, etc.)
Load/Store Instructions
Calculate the eIIective address as speciIied by the addressing mode
Memory Access
Compare/Test Instructions
PerIorm the compare/test
Set the Ilags oI the CPSR accordingly
Branch
Calculate target address
Determine iI branch is taken
Write the results back
CPSR, Register, Memory

Instruction Execution Example
Microprocessor
r0 -
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008004
CPSR CPSR
- -
ALU
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
Instruction:
0xE5940008
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
0xFFFFFFFC 0xF00E1908
Microprocessor
r0 -
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008008
CPSR CPSR
- -
ALU
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
Instruction:
LDR r0, [r4, =0x008]
Instruct ALU
to compute
effective address
bv adding
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
Microprocessor
r0 0xC9E81005
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
0x40008020
0x00000008
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008008
CPSR CPSR
- -
ALU
+
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
Pipelining
A SimpliIied Model oI the Processor
A Iunctional unit is dedicated to each oI the three tasks that take place when an
instruction is executed
The control unit orchestrates the process

Functional Unit Utilization on Nonpipelined Processor
Fetch

Decode

Execute

Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Overview

Timing
Timed by crystal oscillator
Period T
Frequency 1/T

Observations
Only one oI the three Iunctional units are in use at a time
The Pipelined Processor
Instructions are overlapped
Multiple instructions executed simultaneously
Scalar Pipelined Processor
One instruction Ietched per cycle
Microprocessor Microprocessor
Time
Cycle 2 Cycle 1 Cycle 3
Microprocessor
ALU ALU ALU
Control
Unit
Control
Unit
Control
Unit
Fetch
Unit
Fetch
Unit
Fetch
Unit
Decode
Unit
Decode
Unit
Decode
Unit
T T T
T Clock Cycle Time (Clock Period)
T T T T T
Time
20 ns 40 ns 80 ns 60 ns 100 ns 120 ns 0 ns
Instruction #1 Instruction #2
T
Fetch Decode Decode Fetch Execute Execute
The Pipelined Architecture

Microprocessor
Microprocessor
Microprocessor
Microprocessor
Time
Time
Cycle 2
Cycle 5
Cycle 1
Cycle 4
Cycle 3
Cycle 6
Microprocessor
Microprocessor
ALU ALU Inst
1
Inst
4
Inst.
2
Inst.
3

ALU

ALU ALU

ALU
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Fetch
Unit
Fetch
Unit
Inst. 1
Inst. 4
Fetch
Unit
Fetch
Unit
Inst. 2
Inst. 5
Fetch
Unit
Fetch
Unit
Inst. 3
Inst. 6
Decode
Unit
Decode
Unit
Inst. 3
Decode
Unit
Decode
Unit
Inst. 1
Inst. 4
Decode
Unit
Decode
Unit
Inst. 2
Inst. 5
T
T
T
T
T
T
The Timing Diagram

Throughput
Number oI instructions executed over time
Instruction Latency
Time it takes Ior an individual instruction to execute
What happens when an instruction is dependent upon another instruction?
Hazard Exists

Types oI Hazards
Structural
Multiple instructions require the same hardware simultaneously
Data
Instruction needs results Irom a previous instruction
Control
Instruction aIter a branch is Ietched, but may not be executed
At time oI Ietch, it is unknown iI branch will occur

Fetch
Fetch
Fetch
Decode
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Execute
Execute
Execute
Execute
ARM7 1hree-stage Pipeline
Execute
Fetch
Fetch
Fetch
Stalls due
to BLT
Decode
Decode
(STALL)
Decode
BIC
SUB
Decode
Decode
Decode
Fetch
Fetch
Fetch
OR
EOR
Time Required
to Fill Pipeline
T T T T T T T
Time
20 ns 40 ns 80 ns 60 ns 100 ns 140 ns 120 ns 160 ns 0 ns
T
Execute
Execute
(STALL)
Execute
Execute
(STALL)
Execute
BLT
ADD
Execute
Pipelining in the ARM Architecture
ARM7

ARM9

13 throughput increase over ARM7 when running Dhrystone benchmark

Fetch
Fetch
Fetch
Decode
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T
T
Execute
Execute
Execute
Execute
Execute
ARM7 1hree-stage Pipeline
Execute
Fetch
Fetch
Fetch
Decode
Memory
Memory
Memory
Memory
Memory
Memory
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T T T
T
Execute
Write
Write
Write
Write
Write
Write
Execute
Execute
Execute
Execute
ARM9 Five-stage Pipeline
Execute
ARM10

34 throughput increase over ARM7 when running Dhrystone benchmark

What about the ARM8?

Backward compatibility was maintained as these architectures evolved
Code written Ior ARM7 will run on ARM10

The Effects of Pipelining on Assembly Language
Consider exception handling
PreIetch abort stores address oI aborted instruction 4 in LR
Data abort stores address oI aborted instruction 8 in LR

The diIIerence is due to the Iact that the aborted instruction is in a diIIerent stage oI the
pipeline when the abort occurs
The Vector Table
When the PC is updated with an address stored in memory, the PC relative load must
account Ior the Iact that the PC has been incremented

References
Kris Schindler, Introduction to Microprocessor Based Svstems Using the ARM Processor,
Second Edition, Pearson, 2013
Issue
Issue
Issue
Issue
Issue
Issue
Fetch
Fetch
Fetch
Fetch
Fetch
Fetch
Decode
Memory
Memory
Memory
Memory
Memory
Memory
Decode
Decode
Decode
Decode
Decode
T T
Time Required
to Fill Pipeline
T T T T T T T T
T
Execute
Write
Write
Write
Write
Write
Write
Execute
Execute
Execute
Execute
ARM1 Six-stage Pipeline
Execute

Processor Overview and Pipelining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Processor Overview and Pipelining

Uploaded by

Copyright:

Available Formats

CSE 379

Microprocessor Overview & Pipelining

You might also like