You are on page 1of 54

Microprocessor Principles

MKEL1123 Advanced Microprocessor Systems

Dr Usman Ullah Sheikh & Muním Zabidi


School of Electrical Engineering, UTM
Table of Contents

1 Processor Architecture

2 How the CPU Works

3 Pipelining

4 Instruction Set Architecture

5 Memory Systems

6 Processor Benchmarks

7 References

© Dr Usman Ullah Sheikh & Muním Zabidi 2


Processor Architecture

© Dr Usman Ullah Sheikh & Muním Zabidi 3


Architecture

In the context of computers, architecture has many meanings


Instruction Set Architecture (ISA)
The parts of a processor design that one needs to understand or write assembly/machine code
e.g. instruction set specifications, register
Example ISAs:
Intel: x86, IA32, x86-64, Itanium
ARM
Microarchitecture (or Computer Organization)
Implementation of the architecture
e.g. cache size, core frequency
Platform Architecture (or System Design)
Memory and I/O buses
Memory controllers
Direct memory access

© Dr Usman Ullah Sheikh & Muním Zabidi 4


Components of the Computer

Since 1946 all computers have had 4


components:

Processor controls the system


Input
Memory contains program and
Device
data Central
Processing Memory
Input devices that get data from Unit
the outside world into the Output
computer Device

Output devices that present com-


putation results to the
outside world

© Dr Usman Ullah Sheikh & Muním Zabidi 5


What is the CPU?

CPU is the brain of the computer system


The computer works by executing a program containing many instructions
Program
Sequence of instructions that perform a task
Think of executing a program like playing music
Instruction
A binary code representing the simplest operation performed by the processor
Think of an individual instruction as a note coming from a musical instrument
Code forms:
Machine code: byte-level program that a processor executes
Assembly code: text representation of machine code

© Dr Usman Ullah Sheikh & Muním Zabidi 6


Memory

k × m array of stored bits (k is usually 2n )


Address unique (n-bit) identifier of location
Contents m-bit value stored in location
Basic operations:
LOAD read a value from a memory location
STORE write a value to a memory location
Basic memory types:
RAM read-and-write
ROM read-only
Flash most popular variant of ROM

© Dr Usman Ullah Sheikh & Muním Zabidi 7


Buses

Most processors use the three-bus architecture.


Address bus
Carries the address of memory or I/O device for a
data transfer. Determines the addressing range.
Unidirectional: always acts output of the CPU
Data bus
Carries data to be transferred between processor and
memory or I/O.
Bidirectional: set as input when reading data, and
output when writing data
Control bus
Carries status and control signals required for
various operations
An assortment of signals, anything not address or
data
e.g. R/W*, IO/M*, Interrupt, DMA

© Dr Usman Ullah Sheikh & Muním Zabidi 8


Address Bus

The address bus contains the address of memory location or I/O device selected for a
data transfer.
Address bus width determines the addressing range.

CPU Address bus width Total addressable memory


8051 16 216 = 65,536 = 64 KB
8086 20 220 = 1,048,576 = 1 MB
68000 24 224 = 16,777,216 = 16 MB
ARM 32 232 = 4,294,967,296 = 4 GB
64
Xeon 64 2 = 18,446,744,073,709,551,616 = 16 EB

© Dr Usman Ullah Sheikh & Muním Zabidi 9


Data Bus

Width of data bus determines the amount of data transferable in one step
Most microcontrollers have 8-bit data buses
Can transfer 1 byte at any one time
A 32-bit word requires 4 transfers
ARM has a 32 bit data bus
Can transfer 4 bytes at once
Some chips has a external bus with selectable bus width of 8, 16 or 32 bits
Selecting smaller data bus results in lower performance but enables interfacing to lower cost memory
devices

© Dr Usman Ullah Sheikh & Muním Zabidi 10


Address and Data Buses

$000000

2 24 addresses

24-bit address bus $FFFFFF


15 0
CPU Memory
16-bit data bus

Motorola 68000 bus sizes.

© Dr Usman Ullah Sheikh & Muním Zabidi 11


Inside the CPU Memory-Facing Registers

Registers store binary data. The following


registers interface with memory.

Program Counter (PC) points to the address of the


instruction currently being
executed by the CPU

Instruction Register (IR) stores the instruction read


from the address of the
instruction indicated by the
PC

Address Register store the address of the


current data

Data Register stores the value of from the


address indicated by Address
Register

© Dr Usman Ullah Sheikh & Muním Zabidi 12


Inside the CPU Internal Components

These components do not have direct access


to memory.

Decoder interpret the instruction


brought to the IR and pass
it to the CU.

Control Unit (CU) generates control signals


based on the instruction
detected by Decoder

Arithmetic/Logic Unit (ALU) performs arithmetic &


logic operations

Accumulator (ACC) a register that stores the


values used before and
after an ALU operation

© Dr Usman Ullah Sheikh & Muním Zabidi 13


ALU size

The ALU operates on several bits simultaneously


“The size of the processor”
Usually (but not always) determines data bus size
Typical sizes:
4 bits (remote controllers etc)
8 bits (microcontrollers: 68HC05, 8051, PIC)
16 bits (low-end microprocessors: Intel 8086)
32 bits (most popular size today: ARM, MIPS)
64 bits (servers: IBM POWER, Intel Xeon)

© Dr Usman Ullah Sheikh & Muním Zabidi 14


How the CPU Works

© Dr Usman Ullah Sheikh & Muním Zabidi 15


How does the CPU Work?

Fetch Decode Execute

Fetch instructions Interpret binary perform requested


from memory instruction action

An instruction has 3 phases of execution


The Control Unit (CU) orchestrates the complete execution of each instruction
At CU’s heart is a Finite State Machine (FSM) that sets up the state of the logic circuits according
to each instruction.
This process is governed by the system clock
the FSM goes through one transition (“machine cycle”) for each tick of the clock.

© Dr Usman Ullah Sheikh & Muním Zabidi 16


Inside the CPU An Example Program

C version Explanation
Assembly version
void main(void)
LOAD 0x2000 Load value of a to Data Register
{ Address Assembly
int a = 1; ------- --------- ADD 0x2002 After adding the previously loaded
int b = 2; 0x1000 LOAD 0x2000 value of a and the newly loaded
int c; 0x1002 ADD 0x2002 value of b, save in ACC.
c = a + b; 0x1004 STORE 0x2004 STORE 0x2004 Save the added result to the
} address of c

© Dr Usman Ullah Sheikh & Muním Zabidi 17


Executing LOAD instruction

1. The address the CPU wants to execute is 0x10000 in


the PC.
2. Put 0x1000 in Address Register,
3. When 0x1000 enters the Address Register, it
automatically accesses 0x1000 of the memory
4. Instruction there is read from memory.
5. Instruction is stored in IR (LOAD 0x2000)
6. Instruction goes into the decoder. At the same time the
PC is increased
7. Decoder interprets what the content is. CU
understands that the content is to get the value of
address 0x2000
8. CU generates control signals to read the value of
0x2000 from the memory
9. A value of 1 is entered into the data register by the
control signal generated by the CU.
10. Value of data register is available to any circuits
needing it
11. Since this value may be operated through ALU, it is
temporarily stored in ACC.

© Dr Usman Ullah Sheikh & Muním Zabidi 18


Executing ADD instruction

1. Like LOAD, the address the current CPU will execute is


0x1002, which has already been increased
2. Put 0x1002 in Address Register,
3. Address 0x1002 is accessed
4. Value at 0x1002 is available
5. This value is stored in IR
6. Value in IR is sent to decoder. At the same time the PC
is increased
7. Decoder interprets value in IR. CU understands that the
content is to get the add the value of address 0x2002
8. CU generates control signals to read the value at
0x2002 based on decoder interpretation The ALU is
given a control signal to add.
9. Data from 0x2002 is loaded and saved in data register
10. ALU add data in Data Register with current value in
ACC
11. The sum replaces the old value of ACC

© Dr Usman Ullah Sheikh & Muním Zabidi 19


Executing STORE instruction

1. The current Instruction to be executed is 0x1004,


which is the PC value
2. The value in PC is transferred to the Address Register
3. Location 0x1004 is accessed
4. The value from 0x1004 is made available to the CPU
5. This value is saved in IR
6. The value in IR is made available to the decoder, and
the PC is incremented
7. Decoder interprets the value in IR
8. The CU generates control signals to store the value in
ACC at 0x2004 based on decoder interpretation
9. The output of the ALU is stored in the ACC
10. Finally, the value in ACC is stored in location 0x2004

© Dr Usman Ullah Sheikh & Muním Zabidi 20


Pipelining

© Dr Usman Ullah Sheikh & Muním Zabidi 21


Idea of the Pipeline 1/3

The sequence of operation of the CPU is the


regardless of instruction
Separate circuits are active during different
phases of execution
Each phase can be executed in parallel

© Dr Usman Ullah Sheikh & Muním Zabidi 22


Idea of the Pipeline 2/3

With pipelining each stage (fetch,


decode, execution) of the
instructions can be processed at
once.
Pipelining is used even in the
smalles $2 microcontroller
For our short program, while LOAD
0x2000 is actually being executed,
ADD 0x2002 is being decoded and
STORE 0x2004 is being fetched
from memory

© Dr Usman Ullah Sheikh & Muním Zabidi 23


Idea of the Pipeline 3/3

The 3-stage pipeline of the famous


ARM7
1 clock cycle per cell. from the first
cycle to the third cycle
The first opcode is executed, the
second opcode is decoded, and the
third opcode is fetched all at once.
Execute Fetch-Decode-Execute
without pipelining will take 3×3
cycles to execute 3 opcodes
If you use a pipeline, it takes only 5
cycles to execute 3 opcodes

© Dr Usman Ullah Sheikh & Muním Zabidi 24


A 5-stage Pipeline

The instruction execution steps can be refined to increase the number of pipeline stages

Non-pipelined

Pipelined

© Dr Usman Ullah Sheikh & Muním Zabidi 25


Pipeline Performance

Latency
Defined as the time (or #cycles) from entering the pipeline until an instruction completes
Pipelining doesn’t help latency of single task
Throughput
Defined as the number of instructions executed per time period
Potential speedup = Number of pipeline stages

Trivia
The longest pipeline on a commercial machine is 31 stages on the Intel Pentium 4.

© Dr Usman Ullah Sheikh & Muním Zabidi 26


Speedup

Speedup:
To process n tasks, k stages: Tnon−pipelined
For the non-pipelined processor: Sk =
Tpipelined
nkτ T1
=
Tk
For the pipelined processor: nkτ
k cycles for the first task and =
n − 1 cycles for the remaining n − 1 tasks
[k + (n − 1)]τ
nk
[k + (n − 1)]τ =
k+n−1
Sk = k as n → ∞

© Dr Usman Ullah Sheikh & Muním Zabidi 27


Clocking

Si S i+1

t tm d

Latch delay: d
Clock cycle of the pipeline: τ
τ = max(τm ) + d
Pipeline frequency: f
1
f=
τ

∴ Pipeline rate limited by slowest pipeline stage.


Also, increasing #stages adds delay d

© Dr Usman Ullah Sheikh & Muním Zabidi 28


Limits to Pipelining

Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards
Two instructions attempting to use the same resources at the same time
Data hazards
Instruction attempting to use data before it’s available in the register file
Control hazards
Caused by branch instructions, which invalidates data already in pipeline, requiring flushing and refilling.
Simplest solution is to stall the pipeline until the hazard is resolved, inserting one or more
“bubbles” in the pipeline
More stall cycles = lower performance
Complex solutions include branch prediction and data forwarding
Speculative execution means predicting the outcome of a branch and executing instruction based
on the prediction. The results of the execution can be committed if the prediction is correct or
unexecuted if guess is wrong. Modern processors make 98% correct guesses.
The details of branch prediction/speculation: out of the scope of this course

Trivia: Read about the Spectre and Meltdown vulnerability caused by spectulative execution at
https://blog.trailofbits.com/2018/01/30/an- accessible- overview- of- meltdown- and- spectre- part- 1/

© Dr Usman Ullah Sheikh & Muním Zabidi 29


CPU Performance

Processor Performance is function of


IC: Instruction count
CPI: Cycle per instruction
Clock cycle

Seconds
CPU time =
Program
Instructions Cycles Seconds
= × ×
Program Instruction cycle
= IC × CPI × Clk

Reducing any of the 3 factors will lead to improved performance


Pipelining reduces CPI.
Best case: CPI = 1.

© Dr Usman Ullah Sheikh & Muním Zabidi 30


Instruction Set Architecture

© Dr Usman Ullah Sheikh & Muním Zabidi 31


Instruction Set Architecture

What is an Instruction Set Architecture (ISA)?


Defines:
Sets of instructions, how they behave
How the processor manipulates the data
Where the processors stores or obtains information
Data types
Registers
Memory model
Addressing modes
How input/output is done
Exceptions

© Dr Usman Ullah Sheikh & Muním Zabidi 32


Fixed vs Variable Length

Variable length
Fixed length
Use if code size is more important
Commonly used operations are shorter Use if performance is more important
→ smaller programs Wastes code space because opcode is always
Complex operations difficult to decode wide
→ control unit must use microprogramming Simple to decode
Slow due to multiple memory accesses during → hardware decoder is possible
instruction fetch Works well with pipelining
Difficult to pipeline

© Dr Usman Ullah Sheikh & Muním Zabidi 33


Instruction Encoding

opcode 0-address

opcode address 1-address: e.g. 8051

opcode address1 address2 2-address: e.g. ARM Thumb

opcode address1 address2 address3 3-address: e.g. ARM

© Dr Usman Ullah Sheikh & Muním Zabidi 34


Machine Language Instructions

Example ADD M,N


Add content of memory locations M and N, and store back in memory location M (M = M + N)
Assume: opcode for ADD is 11, and addresses M=98, N=100

00001011 0000000001100010 0000000001100100

Opcode (8 bits) Address 1 (16 bits) Address 2 (16 bits)

© Dr Usman Ullah Sheikh & Muním Zabidi 35


RISC vs CISC

Complex Instruction Set Computers (CISC) Reduced Instruction Set Computers (RISC)

Powerful instruction set, variable format, multi Simple instruction set, fixed format
word
Complex optimizing compiler
Dense code, simple compiler
Single cycle execution, easier to pipeline
Each instruction can do more work
Simple hardwired control unit
Multi-cycle execution → high clock rate, lower dev. costs, smaller die
Microcoded control unit size, lower power

© Dr Usman Ullah Sheikh & Muním Zabidi 36


RISC vs CISC

CISC RISC
CPI ↑ ↓
Code density ↑ ↓
Instruction length Variable Fixed
Instruction decoder Complex Simple
Memory,
Register,
Operand register,
immediate
immediate
Addressing modes Various Limited
General purpose register file Small Large

© Dr Usman Ullah Sheikh & Muním Zabidi 37


CPU Performance

Recall Performance measure


Seconds
CPU time =
Program
Instructions Cycles Seconds
= × ×
Program Instruction cycle
= IC × CPI × Clk

RISC has higher IC but lower CPI and Clk

© Dr Usman Ullah Sheikh & Muním Zabidi 38


Load-Store Architecture

Real computer architectures are neither pure RISC or CISC


Today’s RISC have added complex instructions
And CISC CPUs may preprocess the instruction queue (trace cache) into internal RISC-like instructions
Recent embedded machines (ARM, MIPS) added optional mode to execute subset of 16-bit wide
instructions (Thumb, MIPS16); per procedure decide performance or density
ARM Thumb-2 can be considered variable-length as 16-bit and 32-bit instructions can be mixed freely
Most RISC processors today are RINO (RISC In Name Only) except RISC-V which a clean slate makeover
A better term is “load-store architecture”
Only load/store instructions can access memory
Operands must be located in registers
The operation result is put into register

© Dr Usman Ullah Sheikh & Muním Zabidi 39


Load-Store Architecture

Processor

load

Register Memory
file store

ALU
Only load/store
instructions can
access memory: No
direct path from
memory to ALU

© Dr Usman Ullah Sheikh & Muním Zabidi 40


Load-Store Architecture Increment data in memory

Load-store architecture
Non load-store architecture
Must load data into CPU registers, increment
data to/from register, then store from register
Single instruction can access memory and
to memory
modify data

LD R1,0x100
INC 0x100
ADD #1,R1
ST R1,0x100

© Dr Usman Ullah Sheikh & Muním Zabidi 41


Memory Systems

© Dr Usman Ullah Sheikh & Muním Zabidi 42


Harvard vs von-Neumann Memory Architectures

© Dr Usman Ullah Sheikh & Muním Zabidi 43


Harvard vs von-Neumann Memory Architectures

Harvard von-Neumann

Separate program and data memory spaces Single memory space for program and data

Allows two simultaneous memory fetches Easier to program


But a performance bottleneck
Hardware can be optimized for access type
and bus width Code be accessed and modifed, just like data

Used in DSP: greater memory bandwidth Beneficial in some cases:


e.g. self-modifying code
Used in PIC µC: non-2n instruction size (12-bit, Is also a security risk
14-bit), data is still 8-bit e.g. malware & software defects

© Dr Usman Ullah Sheikh & Muním Zabidi 44


When Harvard is not Really Harvard

True Harvard:
Can’t access data in program meory
Data memory is more expensive than program memory
Don’t waste data memory for constant data
Modified Harvard:
Has special instruction and hardware pathway to data in program memory

© Dr Usman Ullah Sheikh & Muním Zabidi 45


Cache Memory

What it is:
Small amount of fast memory
Sits between normal main memory and CPU
May be located on CPU chip or in system
Objective: To make slower main memory look like fast memory.
Hold frequently accessed blocks of main memory
CPU looks first for data in cache, then in main memory.
May have additional levels, eg L1, L2 cache
Cache operation is invisible to programmer

© Dr Usman Ullah Sheikh & Muním Zabidi 46


Cache Memory Concept

Data must be found quite often in the cache


for the concept to be effective Tiny, very fast register file

Hit rate (h) = percentage of access attempts


that found data in cache
Small, fast cache has slightly
more capacity than register file
teff = htcache + (1 − h)tmain

Effective memory access time = hit rate ×


cache access time + miss rate × main ...
Big, slow main memory
memory access time has room for many words
...
e.g. h = = 0.9, tcache = 10 ns, tmain = 50 ns

teff = 0.9(10) + 0.1(50) = 14 ns

© Dr Usman Ullah Sheikh & Muním Zabidi 47


When Harvard is not Really Harvard (Again?)

Control
Control
Unit
Unit
Instructions
Inst.

I-Cache
Single
Single
Unified Main
Main Memory
Cache D-Cache
Memory

Data
Data

Register
Register
File
File

ARM9
ARM7
“Harvard” cache: logically Harvard, physically von
“von Neumann” cache
Neumann
© Dr Usman Ullah Sheikh & Muním Zabidi 48
Processor Benchmarks

© Dr Usman Ullah Sheikh & Muním Zabidi 49


Benchmarking Common Benchmarks

Benchmarks provide a method of comparing the performance of various subsystems across


different chip/system architectures.
Popular benchmarks for embedded systems are Dhrystone and CoreMark.
Clock speed
Not accurate as CPI varies especially across architectures
MIPS
Million Instructions Per Second
Slightly better than clock speed
Not meaningful when comparing different instruction sets especially RISC vs CISC
e.g. a high-level task may require more instructions in RISC but execute faster
also: “meaningless information on processor speed”
Dhrystone
Synthetic benchmark, developed in 1984
Consists of simple programs to statistically simulate processor usage of a common set of applications
Benchmark does not contain floating-point operations
The name is a pun of Whetstone benchmark for floating-point operations
Output is Dhrystones per second (number of iterations of the main loop per second)

© Dr Usman Ullah Sheikh & Muním Zabidi 50


Dhrystone Numbers for Some MCUs

© Dr Usman Ullah Sheikh & Muním Zabidi 51


Improving On Dhrystone

Issues with Dhrystone :


Contains unusual code not in real programs
Susceptible to compiler optimizations
Small code size may fit in cache, so results may be different than customer’s actual
Non-standard reporting
CoreMark
Developed in 2009 by Embedded Microprocessor Benchmark Consortium (EEMBC)
Addresses issues with Dhrystone
Implements these algorithms: list processing, matrix operations, state machine and cyclic redundancy
check
Scores are published at https://www.eembc.org/coremark/scores.php
The industry has many other benchmarks, e.g. BDTImark for DSP, SPEC for general purpose
computing

© Dr Usman Ullah Sheikh & Muním Zabidi 52


References

© Dr Usman Ullah Sheikh & Muním Zabidi 53


References

[1] D. A. Patterson and J. L. Hennessy, Computer Organization and Design ARM Edition: The Hardware
Software Interface.
Morgan Kaufmann, 2016.
[2] D. S. Dawoud and R. Peplow, Digital System Design - Use of Microcontroller.
Wharton, TX, USA: River Publishers, 2010.
Downloadable https://www.riverpublishers.com/pdf/ebook/RP_E9788793102293.pdf.

[3] S. S. Mukherjee, “New challenges in benchmarking future processors,” in Fifth Workshop on


Computer Architecture Evaluation using Commercial Workloads, Citeseer, 2002.

© Dr Usman Ullah Sheikh & Muním Zabidi 54

You might also like