Professional Documents
Culture Documents
Processor Overview and Pipelining
Processor Overview and Pipelining
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
Instruction:
LDR r0, [r4, =0x008]
Instruct ALU
to compute
effective address
bv adding
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
0xFFFFFFFC 0xF00E1908
Microprocessor
r0 0xC9E81005
Address
Address
Address
Contents
Contents
Contents
r1 -
r2 -
-
0x40008020
0x40008020
0x00000008
-
-
-
-
-
-
-
-
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
R13 SP R13 SP
R14 LR R14 LR
R15 PC 0x40008008
CPSR CPSR
- -
ALU
+
Memory-Mapped IO
RAM
Nonvolatile Memory
Control
Unit
0x40000000
0xE000C000
0x00000000
0x00000000
0x00000020
0x000060EC
0x40008000 0x6518A54F
0x40008004 0xE5940008
0x40008008 0x7529B514
0x4000800C
0x40000004
0xE000C004
0x00000004
0x40008024
0x761F349C
0x00000000
0x00046200
0x91080040
0xEA98006A
0xFFFFFFF8
0x40008028
0x4000802C
0x01002081
0x0510E8C9
0x0510E8C9
0xFFFFFFFC 0xF00E1908
Pipelining
A SimpliIied Model oI the Processor
A Iunctional unit is dedicated to each oI the three tasks that take place when an
instruction is executed
The control unit orchestrates the process
Functional Unit Utilization on Nonpipelined Processor
Fetch
Decode
Execute
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Microprocessor
ALU
Control
Unit
Fetch
Unit
Decode
Unit
Overview
Timing
Timed by crystal oscillator
Period T
Frequency 1/T
Observations
Only one oI the three Iunctional units are in use at a time
The Pipelined Processor
Instructions are overlapped
Multiple instructions executed simultaneously
Scalar Pipelined Processor
One instruction Ietched per cycle
Microprocessor Microprocessor
Time
Cycle 2 Cycle 1 Cycle 3
Microprocessor
ALU ALU ALU
Control
Unit
Control
Unit
Control
Unit
Fetch
Unit
Fetch
Unit
Fetch
Unit
Decode
Unit
Decode
Unit
Decode
Unit
T T T
T Clock Cycle Time (Clock Period)
T T T T T
Time
20 ns 40 ns 80 ns 60 ns 100 ns 120 ns 0 ns
Instruction #1 Instruction #2
T Clock Cycle Time (Clock Period)
T
Fetch Decode Decode Fetch Execute Execute
The Pipelined Architecture
Microprocessor
Microprocessor
Microprocessor
Microprocessor
Time
Time
Cycle 2
Cycle 5
Cycle 1
Cycle 4
Cycle 3
Cycle 6
Microprocessor
Microprocessor
ALU ALU Inst
1
Inst
4
Inst.
2
Inst.
3
ALU
ALU ALU
ALU
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Control
Unit
Fetch
Unit
Fetch
Unit
Inst. 1
Inst. 4
Fetch
Unit
Fetch
Unit
Inst. 2
Inst. 5
Fetch
Unit
Fetch
Unit
Inst. 3
Inst. 6
Decode
Unit
Decode
Unit
Inst. 3
Decode
Unit
Decode
Unit
Inst. 1
Inst. 4
Decode
Unit
Decode
Unit
Inst. 2
Inst. 5
T
T
T
T
T
T
T Clock Cycle Time (Clock Period)
T Clock Cycle Time (Clock Period)
The Timing Diagram
Throughput
Number oI instructions executed over time
Instruction Latency
Time it takes Ior an individual instruction to execute
What happens when an instruction is dependent upon another instruction?
Hazard Exists
Types oI Hazards
Structural
Multiple instructions require the same hardware simultaneously
Data
Instruction needs results Irom a previous instruction
Control
Instruction aIter a branch is Ietched, but may not be executed
At time oI Ietch, it is unknown iI branch will occur
Fetch
Fetch
Fetch
Decode
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Execute
Execute
Execute
Execute
ARM7 1hree-stage Pipeline
Execute
Fetch
Fetch
Fetch
Stalls due
to BLT
Decode
Decode
(STALL)
Decode
BIC
SUB
Decode
Decode
Decode
Fetch
Fetch
Fetch
OR
EOR
Time Required
to Fill Pipeline
T T T T T T T
Time
20 ns 40 ns 80 ns 60 ns 100 ns 140 ns 120 ns 160 ns 0 ns
T
Execute
Execute
(STALL)
Execute
Execute
(STALL)
Execute
BLT
ADD
Execute
Pipelining in the ARM Architecture
ARM7
ARM9
13 throughput increase over ARM7 when running Dhrystone benchmark
Fetch
Fetch
Fetch
Decode
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Execute
Execute
Execute
Execute
ARM7 1hree-stage Pipeline
Execute
Fetch
Fetch
Fetch
Decode
Memory
Memory
Memory
Memory
Memory
Memory
Decode
Decode
Decode
Decode
Decode
Fetch
Fetch
Fetch
T
Time Required
to Fill Pipeline
T T T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Write
Write
Write
Write
Write
Write
Execute
Execute
Execute
Execute
ARM9 Five-stage Pipeline
Execute
ARM10
34 throughput increase over ARM7 when running Dhrystone benchmark
What about the ARM8?
Backward compatibility was maintained as these architectures evolved
Code written Ior ARM7 will run on ARM10
The Effects of Pipelining on Assembly Language
Consider exception handling
PreIetch abort stores address oI aborted instruction 4 in LR
Data abort stores address oI aborted instruction 8 in LR
The diIIerence is due to the Iact that the aborted instruction is in a diIIerent stage oI the
pipeline when the abort occurs
The Vector Table
When the PC is updated with an address stored in memory, the PC relative load must
account Ior the Iact that the PC has been incremented
References
Kris Schindler, Introduction to Microprocessor Based Svstems Using the ARM Processor,
Second Edition, Pearson, 2013
Issue
Issue
Issue
Issue
Issue
Issue
Fetch
Fetch
Fetch
Fetch
Fetch
Fetch
Decode
Memory
Memory
Memory
Memory
Memory
Memory
Decode
Decode
Decode
Decode
Decode
T T
Time Required
to Fill Pipeline
T T T T T T T T
Time T Clock Cycle Time (Clock Period)
T
Execute
Write
Write
Write
Write
Write
Write
Execute
Execute
Execute
Execute
ARM1 Six-stage Pipeline
Execute