Professional Documents
Culture Documents
1 Processor Architecture
3 Pipelining
5 Memory Systems
6 Processor Benchmarks
7 References
The address bus contains the address of memory location or I/O device selected for a
data transfer.
Address bus width determines the addressing range.
Width of data bus determines the amount of data transferable in one step
Most microcontrollers have 8-bit data buses
Can transfer 1 byte at any one time
A 32-bit word requires 4 transfers
ARM has a 32 bit data bus
Can transfer 4 bytes at once
Some chips has a external bus with selectable bus width of 8, 16 or 32 bits
Selecting smaller data bus results in lower performance but enables interfacing to lower cost memory
devices
$000000
2 24 addresses
C version Explanation
Assembly version
void main(void)
LOAD 0x2000 Load value of a to Data Register
{ Address Assembly
int a = 1; ------- --------- ADD 0x2002 After adding the previously loaded
int b = 2; 0x1000 LOAD 0x2000 value of a and the newly loaded
int c; 0x1002 ADD 0x2002 value of b, save in ACC.
c = a + b; 0x1004 STORE 0x2004 STORE 0x2004 Save the added result to the
} address of c
The instruction execution steps can be refined to increase the number of pipeline stages
Non-pipelined
Pipelined
Latency
Defined as the time (or #cycles) from entering the pipeline until an instruction completes
Pipelining doesn’t help latency of single task
Throughput
Defined as the number of instructions executed per time period
Potential speedup = Number of pipeline stages
Trivia
The longest pipeline on a commercial machine is 31 stages on the Intel Pentium 4.
Speedup:
To process n tasks, k stages: Tnon−pipelined
For the non-pipelined processor: Sk =
Tpipelined
nkτ T1
=
Tk
For the pipelined processor: nkτ
k cycles for the first task and =
n − 1 cycles for the remaining n − 1 tasks
[k + (n − 1)]τ
nk
[k + (n − 1)]τ =
k+n−1
Sk = k as n → ∞
Si S i+1
t tm d
Latch delay: d
Clock cycle of the pipeline: τ
τ = max(τm ) + d
Pipeline frequency: f
1
f=
τ
Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards
Two instructions attempting to use the same resources at the same time
Data hazards
Instruction attempting to use data before it’s available in the register file
Control hazards
Caused by branch instructions, which invalidates data already in pipeline, requiring flushing and refilling.
Simplest solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
More stall cycles = lower performance
Complex solutions include branch prediction and data forwarding
Speculative execution means predicting the outcome of a branch and executing instruction based
on the prediction. The results of the execution can be committed if the prediction is correct or
unexecuted if guess is wrong. Modern processors make 98% correct guesses.
The details of branch prediction/speculation: out of the scope of this course
Trivia: Read about the Spectre and Meltdown vulnerability caused by spectulative execution at
https://blog.trailofbits.com/2018/01/30/an- accessible- overview- of- meltdown- and- spectre- part- 1/
Seconds
CPU time =
Program
Instructions Cycles Seconds
= × ×
Program Instruction cycle
= IC × CPI × Clk
Variable length
Fixed length
Use if code size is more important
Commonly used operations are shorter Use if performance is more important
→ smaller programs Wastes code space because opcode is always
Complex operations difficult to decode wide
→ control unit must use microprogramming Simple to decode
Slow due to multiple memory accesses during → hardware decoder is possible
instruction fetch Works well with pipelining
Difficult to pipeline
opcode 0-address
Complex Instruction Set Computers (CISC) Reduced Instruction Set Computers (RISC)
Powerful instruction set, variable format, multi Simple instruction set, fixed format
word
Complex optimizing compiler
Dense code, simple compiler
Single cycle execution, easier to pipeline
Each instruction can do more work
Simple hardwired control unit
Multi-cycle execution → high clock rate, lower dev. costs, smaller die
Microcoded control unit size, lower power
CISC RISC
CPI ↑ ↓
Code density ↑ ↓
Instruction length Variable Fixed
Instruction decoder Complex Simple
Memory,
Register,
Operand register,
immediate
immediate
Addressing modes Various Limited
General purpose register file Small Large
Processor
load
Register Memory
file store
ALU
Only load/store
instructions can
access memory: No
direct path from
memory to ALU
Load-store architecture
Non load-store architecture
Must load data into CPU registers, increment
data to/from register, then store from register
Single instruction can access memory and
to memory
modify data
LD R1,0x100
INC 0x100
ADD #1,R1
ST R1,0x100
Harvard von-Neumann
Separate program and data memory spaces Single memory space for program and data
True Harvard:
Can’t access data in program meory
Data memory is more expensive than program memory
Don’t waste data memory for constant data
Modified Harvard:
Has special instruction and hardware pathway to data in program memory
What it is:
Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or in system
Objective: To make slower main memory look like fast memory.
Hold frequently accessed blocks of main memory
CPU looks first for data in cache, then in main memory.
May have additional levels, eg L1, L2 cache
Cache operation is invisible to programmer
Control
Control
Unit
Unit
Instructions
Inst.
I-Cache
Single
Single
Unified Main
Main Memory
Cache D-Cache
Memory
Data
Data
Register
Register
File
File
ARM9
ARM7
“Harvard” cache: logically Harvard, physically von
“von Neumann” cache
Neumann
© Dr Usman Ullah Sheikh & Muním Zabidi 48
Processor Benchmarks
[1] D. A. Patterson and J. L. Hennessy, Computer Organization and Design ARM Edition: The Hardware
Software Interface.
Morgan Kaufmann, 2016.
[2] D. S. Dawoud and R. Peplow, Digital System Design - Use of Microcontroller.
Wharton, TX, USA: River Publishers, 2010.
Downloadable https://www.riverpublishers.com/pdf/ebook/RP_E9788793102293.pdf.