Ams02 Microprocessor Principles

Microprocessor Principles
MKEL1123 Advanced Microprocessor Systems
Dr Usman Ullah Sheikh & Muním Zabidi

School of Electrical Engineering, UTM
Table of Contents
1 Processor Architecture
2 How the CPU Works
3 Pipelining
4 Instruction Set Architecture
5 Memory Systems
6 Processor Benchmarks
7 References
© Dr Usman Ullah Sheikh & Muním Zabidi 2

Processor Architecture

Architecture
In the context of computers, architecture has many meanings

Instruction Set Architecture (ISA)
The parts of a processor design that one needs to understand or write assembly/machine code
e.g. instruction set specifications, register
Example ISAs:
Intel: x86, IA32, x86-64, Itanium
ARM
Microarchitecture (or Computer Organization)
Implementation of the architecture
e.g. cache size, core frequency
Platform Architecture (or System Design)
Memory and I/O buses
Memory controllers
Direct memory access

Components of the Computer
Since 1946 all computers have had 4

components:
Processor controls the system

Input
Memory contains program and
Device
data Central
Processing Memory
Input devices that get data from Unit
the outside world into the Output
computer Device
Output devices that present com-

putation results to the
outside world

What is the CPU?
CPU is the brain of the computer system

The computer works by executing a program containing many instructions
Program
Sequence of instructions that perform a task
Think of executing a program like playing music
Instruction
A binary code representing the simplest operation performed by the processor
Think of an individual instruction as a note coming from a musical instrument
Code forms:
Machine code: byte-level program that a processor executes
Assembly code: text representation of machine code

Memory
k × m array of stored bits (k is usually 2n )

Address unique (n-bit) identifier of location
Contents m-bit value stored in location
Basic operations:
LOAD read a value from a memory location
STORE write a value to a memory location
Basic memory types:
RAM read-and-write
ROM read-only
Flash most popular variant of ROM

Buses
Most processors use the three-bus architecture.

Address bus
Carries the address of memory or I/O device for a
data transfer. Determines the addressing range.
Unidirectional: always acts output of the CPU
Data bus
Carries data to be transferred between processor and
memory or I/O.
Bidirectional: set as input when reading data, and
output when writing data
Control bus
Carries status and control signals required for
various operations
An assortment of signals, anything not address or
data
e.g. R/W*, IO/M*, Interrupt, DMA

Address Bus
The address bus contains the address of memory location or I/O device selected for a
data transfer.
Address bus width determines the addressing range.
CPU Address bus width Total addressable memory

8051 16 216 = 65,536 = 64 KB
8086 20 220 = 1,048,576 = 1 MB
68000 24 224 = 16,777,216 = 16 MB
ARM 32 232 = 4,294,967,296 = 4 GB
64
Xeon 64 2 = 18,446,744,073,709,551,616 = 16 EB

Data Bus
Width of data bus determines the amount of data transferable in one step
Most microcontrollers have 8-bit data buses
Can transfer 1 byte at any one time
A 32-bit word requires 4 transfers
ARM has a 32 bit data bus
Can transfer 4 bytes at once
Some chips has a external bus with selectable bus width of 8, 16 or 32 bits
Selecting smaller data bus results in lower performance but enables interfacing to lower cost memory
devices

Address and Data Buses
$000000
2 24 addresses
24-bit address bus $FFFFFF

15 0
CPU Memory
16-bit data bus
Motorola 68000 bus sizes.

Inside the CPU Memory-Facing Registers
Registers store binary data. The following

registers interface with memory.
Program Counter (PC) points to the address of the

instruction currently being
executed by the CPU
Instruction Register (IR) stores the instruction read

from the address of the
instruction indicated by the
PC
Address Register store the address of the

current data
Data Register stores the value of from the

address indicated by Address
Register

Inside the CPU Internal Components
These components do not have direct access

to memory.
Decoder interpret the instruction

brought to the IR and pass
it to the CU.
Control Unit (CU) generates control signals

based on the instruction
detected by Decoder
Arithmetic/Logic Unit (ALU) performs arithmetic &

logic operations
Accumulator (ACC) a register that stores the

values used before and
after an ALU operation

ALU size
The ALU operates on several bits simultaneously

“The size of the processor”
Usually (but not always) determines data bus size
Typical sizes:
4 bits (remote controllers etc)
8 bits (microcontrollers: 68HC05, 8051, PIC)
16 bits (low-end microprocessors: Intel 8086)
32 bits (most popular size today: ARM, MIPS)
64 bits (servers: IBM POWER, Intel Xeon)

How the CPU Works

How does the CPU Work?
Fetch Decode Execute
Fetch instructions Interpret binary perform requested

from memory instruction action
An instruction has 3 phases of execution

The Control Unit (CU) orchestrates the complete execution of each instruction
At CU’s heart is a Finite State Machine (FSM) that sets up the state of the logic circuits according
to each instruction.
This process is governed by the system clock
the FSM goes through one transition (“machine cycle”) for each tick of the clock.

Inside the CPU An Example Program
C version Explanation
Assembly version
void main(void)
LOAD 0x2000 Load value of a to Data Register
{ Address Assembly
int a = 1; ------- --------- ADD 0x2002 After adding the previously loaded
int b = 2; 0x1000 LOAD 0x2000 value of a and the newly loaded
int c; 0x1002 ADD 0x2002 value of b, save in ACC.
c = a + b; 0x1004 STORE 0x2004 STORE 0x2004 Save the added result to the
} address of c

Executing LOAD instruction
1. The address the CPU wants to execute is 0x10000 in

the PC.
2. Put 0x1000 in Address Register,
3. When 0x1000 enters the Address Register, it
automatically accesses 0x1000 of the memory
4. Instruction there is read from memory.
5. Instruction is stored in IR (LOAD 0x2000)
6. Instruction goes into the decoder. At the same time the
PC is increased
7. Decoder interprets what the content is. CU
understands that the content is to get the value of
address 0x2000
8. CU generates control signals to read the value of
0x2000 from the memory
9. A value of 1 is entered into the data register by the
control signal generated by the CU.
10. Value of data register is available to any circuits
needing it
11. Since this value may be operated through ALU, it is
temporarily stored in ACC.

Executing ADD instruction
1. Like LOAD, the address the current CPU will execute is

0x1002, which has already been increased
2. Put 0x1002 in Address Register,
3. Address 0x1002 is accessed
4. Value at 0x1002 is available
5. This value is stored in IR
6. Value in IR is sent to decoder. At the same time the PC
is increased
7. Decoder interprets value in IR. CU understands that the
content is to get the add the value of address 0x2002
8. CU generates control signals to read the value at
0x2002 based on decoder interpretation The ALU is
given a control signal to add.
9. Data from 0x2002 is loaded and saved in data register
10. ALU add data in Data Register with current value in
ACC
11. The sum replaces the old value of ACC

Executing STORE instruction
1. The current Instruction to be executed is 0x1004,

which is the PC value
2. The value in PC is transferred to the Address Register
3. Location 0x1004 is accessed
4. The value from 0x1004 is made available to the CPU
5. This value is saved in IR
6. The value in IR is made available to the decoder, and
the PC is incremented
7. Decoder interprets the value in IR
8. The CU generates control signals to store the value in
ACC at 0x2004 based on decoder interpretation
9. The output of the ALU is stored in the ACC
10. Finally, the value in ACC is stored in location 0x2004

Pipelining

Idea of the Pipeline 1/3
The sequence of operation of the CPU is the

regardless of instruction
Separate circuits are active during different
phases of execution
Each phase can be executed in parallel

With pipelining each stage (fetch,

decode, execution) of the
instructions can be processed at
once.
Pipelining is used even in the
smalles $2 microcontroller
For our short program, while LOAD
0x2000 is actually being executed,
ADD 0x2002 is being decoded and
STORE 0x2004 is being fetched
from memory

The 3-stage pipeline of the famous

ARM7
1 clock cycle per cell. from the first
cycle to the third cycle
The first opcode is executed, the
second opcode is decoded, and the
third opcode is fetched all at once.
Execute Fetch-Decode-Execute
without pipelining will take 3×3
cycles to execute 3 opcodes
If you use a pipeline, it takes only 5
cycles to execute 3 opcodes

A 5-stage Pipeline
The instruction execution steps can be refined to increase the number of pipeline stages
Non-pipelined
Pipelined

Pipeline Performance
Latency
Defined as the time (or #cycles) from entering the pipeline until an instruction completes
Pipelining doesn’t help latency of single task
Throughput
Defined as the number of instructions executed per time period
Potential speedup = Number of pipeline stages
Trivia
The longest pipeline on a commercial machine is 31 stages on the Intel Pentium 4.

Speedup
Speedup:
To process n tasks, k stages: Tnon−pipelined
For the non-pipelined processor: Sk =
Tpipelined
nkτ T1
=
Tk
For the pipelined processor: nkτ
k cycles for the first task and =
n − 1 cycles for the remaining n − 1 tasks
[k + (n − 1)]τ
nk
[k + (n − 1)]τ =
k+n−1
Sk = k as n → ∞

Clocking
Si S i+1
t tm d
Latch delay: d
Clock cycle of the pipeline: τ
τ = max(τm ) + d
Pipeline frequency: f
1
f=
τ
∴ Pipeline rate limited by slowest pipeline stage.

Also, increasing #stages adds delay d

Limits to Pipelining
Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards
Two instructions attempting to use the same resources at the same time
Data hazards
Instruction attempting to use data before it’s available in the register file
Control hazards
Caused by branch instructions, which invalidates data already in pipeline, requiring flushing and refilling.
Simplest solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
More stall cycles = lower performance
Complex solutions include branch prediction and data forwarding
Speculative execution means predicting the outcome of a branch and executing instruction based
on the prediction. The results of the execution can be committed if the prediction is correct or
unexecuted if guess is wrong. Modern processors make 98% correct guesses.
The details of branch prediction/speculation: out of the scope of this course
Trivia: Read about the Spectre and Meltdown vulnerability caused by spectulative execution at
https://blog.trailofbits.com/2018/01/30/an- accessible- overview- of- meltdown- and- spectre- part- 1/

CPU Performance
Processor Performance is function of

IC: Instruction count
CPI: Cycle per instruction
Clock cycle
Seconds
CPU time =
Program
Instructions Cycles Seconds
= × ×
Program Instruction cycle
= IC × CPI × Clk
Reducing any of the 3 factors will lead to improved performance

Pipelining reduces CPI.
Best case: CPI = 1.

Instruction Set Architecture

Instruction Set Architecture
What is an Instruction Set Architecture (ISA)?

Defines:
Sets of instructions, how they behave
How the processor manipulates the data
Where the processors stores or obtains information
Data types
Registers
Memory model
Addressing modes
How input/output is done
Exceptions

Fixed vs Variable Length
Variable length
Fixed length
Use if code size is more important
Commonly used operations are shorter Use if performance is more important
→ smaller programs Wastes code space because opcode is always
Complex operations difficult to decode wide
→ control unit must use microprogramming Simple to decode
Slow due to multiple memory accesses during → hardware decoder is possible
instruction fetch Works well with pipelining
Difficult to pipeline

Instruction Encoding
opcode 0-address
opcode address 1-address: e.g. 8051
opcode address1 address2 2-address: e.g. ARM Thumb
opcode address1 address2 address3 3-address: e.g. ARM

Machine Language Instructions
Example ADD M,N

Add content of memory locations M and N, and store back in memory location M (M = M + N)
Assume: opcode for ADD is 11, and addresses M=98, N=100
00001011 0000000001100010 0000000001100100
Opcode (8 bits) Address 1 (16 bits) Address 2 (16 bits)

RISC vs CISC
Complex Instruction Set Computers (CISC) Reduced Instruction Set Computers (RISC)
Powerful instruction set, variable format, multi Simple instruction set, fixed format
word
Complex optimizing compiler
Dense code, simple compiler
Single cycle execution, easier to pipeline
Each instruction can do more work
Simple hardwired control unit
Multi-cycle execution → high clock rate, lower dev. costs, smaller die
Microcoded control unit size, lower power

RISC vs CISC
CISC RISC
CPI ↑ ↓
Code density ↑ ↓
Instruction length Variable Fixed
Instruction decoder Complex Simple
Memory,
Register,
Operand register,
immediate
immediate
Addressing modes Various Limited
General purpose register file Small Large

CPU Performance
Recall Performance measure

Seconds
CPU time =
Program
Instructions Cycles Seconds
= × ×
Program Instruction cycle
= IC × CPI × Clk
RISC has higher IC but lower CPI and Clk

Load-Store Architecture
Real computer architectures are neither pure RISC or CISC

Today’s RISC have added complex instructions
And CISC CPUs may preprocess the instruction queue (trace cache) into internal RISC-like instructions
Recent embedded machines (ARM, MIPS) added optional mode to execute subset of 16-bit wide
instructions (Thumb, MIPS16); per procedure decide performance or density
ARM Thumb-2 can be considered variable-length as 16-bit and 32-bit instructions can be mixed freely
Most RISC processors today are RINO (RISC In Name Only) except RISC-V which a clean slate makeover
A better term is “load-store architecture”
Only load/store instructions can access memory
Operands must be located in registers
The operation result is put into register

Load-Store Architecture
Processor
load
Register Memory
file store
ALU
Only load/store
instructions can
access memory: No
direct path from
memory to ALU

Load-Store Architecture Increment data in memory
Load-store architecture
Non load-store architecture
Must load data into CPU registers, increment
data to/from register, then store from register
Single instruction can access memory and
to memory
modify data
LD R1,0x100
INC 0x100
ADD #1,R1
ST R1,0x100

Memory Systems

Harvard vs von-Neumann Memory Architectures

Harvard vs von-Neumann Memory Architectures
Harvard von-Neumann
Separate program and data memory spaces Single memory space for program and data
Allows two simultaneous memory fetches Easier to program

But a performance bottleneck
Hardware can be optimized for access type
and bus width Code be accessed and modifed, just like data
Used in DSP: greater memory bandwidth Beneficial in some cases:

e.g. self-modifying code
Used in PIC µC: non-2n instruction size (12-bit, Is also a security risk
14-bit), data is still 8-bit e.g. malware & software defects

When Harvard is not Really Harvard
True Harvard:
Can’t access data in program meory
Data memory is more expensive than program memory
Don’t waste data memory for constant data
Modified Harvard:
Has special instruction and hardware pathway to data in program memory

Cache Memory
What it is:
Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or in system
Objective: To make slower main memory look like fast memory.
Hold frequently accessed blocks of main memory
CPU looks first for data in cache, then in main memory.
May have additional levels, eg L1, L2 cache
Cache operation is invisible to programmer

Cache Memory Concept
Data must be found quite often in the cache

for the concept to be effective Tiny, very fast register file
Hit rate (h) = percentage of access attempts

that found data in cache
Small, fast cache has slightly
more capacity than register file
teff = htcache + (1 − h)tmain
Effective memory access time = hit rate ×

cache access time + miss rate × main ...
Big, slow main memory
memory access time has room for many words
...
e.g. h = = 0.9, tcache = 10 ns, tmain = 50 ns
teff = 0.9(10) + 0.1(50) = 14 ns

When Harvard is not Really Harvard (Again?)
Control
Control
Unit
Unit
Instructions
Inst.
I-Cache
Single
Single
Unified Main
Main Memory
Cache D-Cache
Memory
Data
Data
Register
Register
File
File
ARM9
ARM7
“Harvard” cache: logically Harvard, physically von
“von Neumann” cache
Neumann
Processor Benchmarks

Benchmarking Common Benchmarks
Benchmarks provide a method of comparing the performance of various subsystems across

different chip/system architectures.
Popular benchmarks for embedded systems are Dhrystone and CoreMark.
Clock speed
Not accurate as CPI varies especially across architectures
MIPS
Million Instructions Per Second
Slightly better than clock speed
Not meaningful when comparing different instruction sets especially RISC vs CISC
e.g. a high-level task may require more instructions in RISC but execute faster
also: “meaningless information on processor speed”
Dhrystone
Synthetic benchmark, developed in 1984
Consists of simple programs to statistically simulate processor usage of a common set of applications
Benchmark does not contain floating-point operations
The name is a pun of Whetstone benchmark for floating-point operations
Output is Dhrystones per second (number of iterations of the main loop per second)

Dhrystone Numbers for Some MCUs

Improving On Dhrystone
Issues with Dhrystone :

Contains unusual code not in real programs
Susceptible to compiler optimizations
Small code size may fit in cache, so results may be different than customer’s actual
Non-standard reporting
CoreMark
Developed in 2009 by Embedded Microprocessor Benchmark Consortium (EEMBC)
Addresses issues with Dhrystone
Implements these algorithms: list processing, matrix operations, state machine and cyclic redundancy
check
Scores are published at https://www.eembc.org/coremark/scores.php
The industry has many other benchmarks, e.g. BDTImark for DSP, SPEC for general purpose
computing

References

References
[1] D. A. Patterson and J. L. Hennessy, Computer Organization and Design ARM Edition: The Hardware
Software Interface.
Morgan Kaufmann, 2016.
[2] D. S. Dawoud and R. Peplow, Digital System Design - Use of Microcontroller.
Wharton, TX, USA: River Publishers, 2010.
Downloadable https://www.riverpublishers.com/pdf/ebook/RP_E9788793102293.pdf.
[3] S. S. Mukherjee, “New challenges in benchmarking future processors,” in Fifth Workshop on

Computer Architecture Evaluation using Commercial Workloads, Citeseer, 2002.

Ams02 Microprocessor Principles

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ams02 Microprocessor Principles

Uploaded by

Copyright:

Available Formats

Microprocessor Principles

MKEL1123 Advanced Microprocessor Systems

Dr Usman Ullah Sheikh & Muním Zabidi

2 How the CPU Works

4 Instruction Set Architecture

© Dr Usman Ullah Sheikh & Muním Zabidi 2

© Dr Usman Ullah Sheikh & Muním Zabidi 3

In the context of computers, architecture has many meanings

© Dr Usman Ullah Sheikh & Muním Zabidi 4

Since 1946 all computers have had 4

Processor controls the system

Output devices that present com-

© Dr Usman Ullah Sheikh & Muním Zabidi 5

CPU is the brain of the computer system

© Dr Usman Ullah Sheikh & Muním Zabidi 6

k × m array of stored bits (k is usually 2n )

© Dr Usman Ullah Sheikh & Muním Zabidi 7

Most processors use the three-bus architecture.

© Dr Usman Ullah Sheikh & Muním Zabidi 8

CPU Address bus width Total addressable memory

© Dr Usman Ullah Sheikh & Muním Zabidi 9

© Dr Usman Ullah Sheikh & Muním Zabidi 10

24-bit address bus $FFFFFF

Motorola 68000 bus sizes.

© Dr Usman Ullah Sheikh & Muním Zabidi 11

Registers store binary data. The following

Program Counter (PC) points to the address of the

Instruction Register (IR) stores the instruction read

Address Register store the address of the

Data Register stores the value of from the

© Dr Usman Ullah Sheikh & Muním Zabidi 12

These components do not have direct access

Decoder interpret the instruction

Control Unit (CU) generates control signals

Arithmetic/Logic Unit (ALU) performs arithmetic &

Accumulator (ACC) a register that stores the

© Dr Usman Ullah Sheikh & Muním Zabidi 13

The ALU operates on several bits simultaneously

© Dr Usman Ullah Sheikh & Muním Zabidi 14

© Dr Usman Ullah Sheikh & Muním Zabidi 15

Fetch Decode Execute

Fetch instructions Interpret binary perform requested

An instruction has 3 phases of execution

© Dr Usman Ullah Sheikh & Muním Zabidi 16

© Dr Usman Ullah Sheikh & Muním Zabidi 17

1. The address the CPU wants to execute is 0x10000 in

© Dr Usman Ullah Sheikh & Muním Zabidi 18

1. Like LOAD, the address the current CPU will execute is

© Dr Usman Ullah Sheikh & Muním Zabidi 19

1. The current Instruction to be executed is 0x1004,

© Dr Usman Ullah Sheikh & Muním Zabidi 20

© Dr Usman Ullah Sheikh & Muním Zabidi 21

The sequence of operation of the CPU is the

© Dr Usman Ullah Sheikh & Muním Zabidi 22

With pipelining each stage (fetch,

© Dr Usman Ullah Sheikh & Muním Zabidi 23

The 3-stage pipeline of the famous

© Dr Usman Ullah Sheikh & Muním Zabidi 24

© Dr Usman Ullah Sheikh & Muním Zabidi 25

© Dr Usman Ullah Sheikh & Muním Zabidi 26

© Dr Usman Ullah Sheikh & Muním Zabidi 27

∴ Pipeline rate limited by slowest pipeline stage.

© Dr Usman Ullah Sheikh & Muním Zabidi 28

© Dr Usman Ullah Sheikh & Muním Zabidi 29