You are on page 1of 64

Virtual Memory

Computer Architecture
Readings
• Digital Design and Computer Architecture – David
Harris & Sarah Harris

• Chapter 8
Page Replacement Algorithms
• If physical memory is full (i.e., list of free physical pages is empty),
which physical frame to replace on a page fault?

• Is True LRU feasible?


• 4GB memory, 4KB pages, how many possibilities of ordering?

• Modern systems use approximations of LRU


• E.g., the CLOCK algorithm
• And, more sophisticated algorithms to take into account
“frequency” of use
• E.g., the ARC algorithm
• Megiddo and Modha, “ARC: A Self-Tuning, Low Overhead Replacement
Cache,” FAST 2003.
CLOCK Page Replacement Algorithm
• Keep a circular list of physical frames in memory
• Keep a pointer (hand) to the last-examined frame in the list
• When a page is accessed, set the R bit in the PTE
• When a frame needs to be replaced, replace the first frame that
has the reference (R) bit not set, traversing the circular list
starting from the pointer (hand) clockwise
• During traversal, clear the R bits of examined frames
• Set the hand pointer to the next frame in the list
Cache versus Page Replacement
• Physical memory (DRAM) is a cache for disk
• Usually managed by system software via the virtual memory subsystem

• Page replacement is similar to cache replacement


• Page table is the “tag store” for physical memory data store

• What is the difference?


• Required speed of access to cache vs. physical memory
• Number of blocks in a cache vs. physical memory
• “Tolerable” amount of time to find a replacement candidate
(disk versus memory access latency)
• Role of hardware versus software
Memory Protection
Memory Protection
• Multiple programs (processes) run at once
• Each process has its own page table
• Each process can use entire virtual address space without worrying
about where other programs are

• A process can only access physical pages mapped in its page


table – cannot overwrite memory of another process
• Provides protection and isolation between processes
• Enables access control mechanisms per page
Page Table is Per Process
• Each process has its own virtual address space
• Full address space for each program
• Simplifies memory allocation, sharing, linking and
loading.
0
Virtual 0 Physical Address
VP 1 Address PP 2 Space (DRAM)
Address
VP 2 Translation
Space for ...
Process 1: N-1
(e.g., read/only
PP 7 library code)
Virtual 0
Address VP 1
VP 2 PP 10
Space for ...
Process 2: N-1 M-1
Access Protection/Control
via Virtual Memory
Page-Level Access Control (Protection)
• Not every process is allowed to access every page
• E.g., may need supervisor level privilege to access system pages

• Idea: Store access control information on a page basis in


the process’s page table

• Enforce access control at the same time as translation

→ Virtual memory system serves two functions today


Address translation (for illusion of large physical memory)
Access control (protection)
VM as a Tool for Memory Access Protection
◼ Extend Page Table Entries (PTEs) with permission bits
◼ Check bits on each access and during a page fault
❑ If violated, generate exception (Access Protection exception)
Memory
Page Tables
Read? Write? Physical Addr PP 0
VP 0: Yes No PP 6
PP 2
Process i: VP 1: Yes Yes PP 4
VP 2: No No XXXXXXX PP 4
• • •
• • • PP 6
• • •
Read? Write? Physical Addr PP 8
VP 0: Yes Yes PP 6
PP 10
Process j: VP 1: Yes No PP 9
PP 12
VP 2: No No XXXXXXX
• • • •
• • • •
• • • •
Virtual Memory Summary
• Virtual memory gives the illusion of “infinite” capacity

• A subset of virtual pages are located in physical memory

• A page table maps virtual pages to physical pages – this is


called address translation

• A TLB speeds up address translation

• Using different page tables for different programs provides


memory protection
Microarchitecture
Readings
◼Digital Design and Computer Architecture - Sarah Harris,
David Harris
◼Chapter 7.1-7.3
• Instruction Set Architectures (ISA): LC-3
• Assembly programming: LC-3
• Memory Technologies, Memory Hierarchy
• Caches
• Virtual Memory
• Microarchitecture (principles & single-cycle uarch)
• Multi-cycle microarchitecture
• Pipelining
• Issues in Pipelining: Control & Data Dependence Handling,
State Maintenance and Recovery, …
•…
Recall: The Von Neumann Model
MEMORY
Mem Addr Reg
Mem Data Reg

INPUT PROCESSING UNIT OUTPUT


Keyboard, Monitor,
Mouse, ALU TEMP Printer,
Disk… Disk…

CONTROL UNIT

PC or IP Inst Register
Recall: LC-3: A Von Neumann Machine

17
Recall: The Instruction Cycle

• FETCH
• DECODE
• EVALUATE ADDRESS
• FETCH OPERANDS
• EXECUTE
• STORE RESULT
Recall: The Instruction Set Architecture
◼The ISA is the interface between what the software
commands and what the hardware carries out
Problem
◼The ISA specifies Algorithm
❑The memory organization
◼ Address space (LC-3: 216, MIPS: 232) Program
◼ Addressability (LC-3: 16 bits, MIPS: 32 bits) ISA
◼ Word- or Byte-addressable
Microarchitecture
❑The register set Circuits
◼ R0 to R7 in LC-3 Electrons
◼ 32 registers in MIPS

❑The instruction set


◼ Opcodes
◼ Data types
◼ Addressing modes
◼ Semantics of instructions
Microarchitecture
• An implementation of the ISA

• How do we implement the ISA?


• We will discuss this for many lectures

• There can be many implementations of the same ISA


• MIPS R2000, R10000, …
• Intel 80486, Pentium, Pentium Pro, Pentium 4, Kaby Lake, Coffee
Lake, … AMD K5, K7, K9, Bulldozer, BobCat, …
The Von-Neumann Model
• All major instruction set architectures today use this model
• x86, ARM, MIPS, SPARC, Alpha, POWER, RISC-V, …

• Underneath (at the microarchitecture level), the execution model


of almost all implementations (or, microarchitectures) is very
different
• Pipelined instruction execution: Intel 80486 uarch
• Multiple instructions at a time: Intel Pentium uarch
• Out-of-order execution: Intel Pentium Pro uarch
• Separate instruction and data caches

• But, what happens underneath that is not consistent with the


von Neumann model is not exposed to software
• Difference between ISA and microarchitecture
Property of ISA vs. Uarch?
• ADD instruction’s opcode
• Bit-serial adder vs. Ripple-carry adder
• Number of general purpose registers
• Number of cycles to execute the MUL instruction
• Number of ports to the register file
• Whether or not the machine employs pipelined in
struction execution

• Remember
• Microarchitecture: Implementation of the ISA under
specific design constraints and goals
Implementing the ISA:Basics
Microarchitecture
Now That We Have an ISA
• How do we implement it?

• i.e., how do we design a system that obeys the


hardware/software interface?

• Aside: “System” can be solely hardware or a combination of


hardware and software
• “Translation of ISAs”
• A virtual ISA can be converted by “software” into an implementation
ISA

• We will assume “hardware” implementation for most lectures


How Does a Machine Process Instructions?
• What does processing an instruction mean?
• We will assume the von Neumann model (for now)

AS = Architectural (programmer visible) state before an


instruction is processed

Process instruction

AS’ = Architectural (programmer visible) state after an i


nstruction is processed

• Processing an instruction: Transforming AS to AS’


according to the ISA specification of the instruction
The Von Neumann Model/Architecture

Stored program

Sequential instruction processing


The “Process Instruction” Step
• ISA specifies abstractly what AS’ should be, given an instruction
and AS
• It defines an abstract finite state machine where
• State = programmer-visible state
• Next-state logic = instruction execution specification
• From ISA point of view, there are no “intermediate states” between AS
and AS’ during instruction execution
• One state transition per instruction

• Microarchitecture implements how AS is transformed to AS’


• There are many choices in implementation
• We can have programmer-invisible state to optimize the speed of
instruction execution: multiple state transitions per instruction
• Choice 1: AS → AS’ (transform AS to AS’ in a single clock cycle)
• Choice 2: AS → AS+MS1 → AS+MS2 → AS+MS3 → AS’ (take multiple clock cycles
to transform AS to AS’)
A Very Basic Instruction Processing Engine
• Each instruction takes a single clock cycle to execute
• Only combinational logic is used to implement
instruction execution
• No intermediate, programmer-invisible state updates

AS = Architectural (programmer visible) state


at the beginning of a clock cycle

Process instruction in one clock cycle

AS’ = Architectural (programmer visible) state


at the end of a clock cycle
A Very Basic Instruction Processing Engine
• Single-cycle machine

AS’ Sequential AS
Combinational
Logic
Logic
(State)

• What is the clock cycle time determined by?


• What is the critical path of the combinational logic
determined by?
Recall: Programmer Visible (Architectural) State

M[0]
M[1]
M[2]
M[3] Registers
M[4] - given special names in the ISA
(as opposed to addresses)
- general vs. special purpose

M[N-1]
Memory Program Counter
array of storage locations memory address
indexed by an address of the current instruction

Instructions (and programs) specify how to transform


the values of programmer visible state
Single-cycle vs. Multi-cycle Machines
• Single-cycle machines
• Each instruction takes a single clock cycle
• All state updates made at the end of an instruction’s execution
• Big disadvantage: The slowest instruction determines cycle time → long
clock cycle time

• Multi-cycle machines
• Instruction processing broken into multiple cycles/stages
• State updates can be made during an instruction’s execution
• Architectural state updates made at the end of an instruction’s execution
• Advantage over single-cycle: The slowest “stage” determines cycle time

◼ Both single-cycle and multi-cycle machines literally follow the


von Neumann model at the microarchitecture level
Instruction Processing “Cycle”
• Instructions are processed under the direction of a “control unit”
step by step.
• Instruction cycle: Sequence of steps to process an instruction
• Fundamentally, there are six steps:

• Fetch
• Decode
• Evaluate Address
• Fetch Operands
• Execute
• Store Result

• Not all instructions require all six steps


Instruction Processing “Cycle” vs. Machine Clock Cycle

• Single-cycle machine:
• All six phases of the instruction processing cycle take a single
machine clock cycle to complete

• Multi-cycle machine:
• All six phases of the instruction processing cycle can take
multiple machine clock cycles to complete
• In fact, each phase can take multiple clock cycles to complete
Instruction Processing Viewed Another Way

• Instructions transform Data (AS) to Data’ (AS’)


• This transformation is done by functional units
• Units that “operate” on data
• These units need to be told what to do to the data

• An instruction processing engine consists of two components


• Datapath: Consists of hardware elements that deal with and transform
data signals
• functional units that operate on data
• hardware structures (e.g. wires and muxes) that enable the flow of data into
the functional units and registers
• storage units that store data (e.g., registers)
• Control logic: Consists of hardware elements that determine control
signals, i.e., signals that specify what the datapath elements should do
to the data
Single-cycle vs. Multi-cycle: Control & Data

• Single-cycle machine:
• Control signals are generated in the same clock cycle as the one
during which data signals are operated on
• Everything related to an instruction happens in one clock cycle
(serialized processing)

• Multi-cycle machine:
• Control signals needed in the next cycle can be generated in the
current cycle
• Latency of control processing can be overlapped with latency of
datapath operation (more parallelism)
Flash-Forward: Performance Analysis
• Execution time of an instruction
• {CPI} x {clock cycle time}
• Execution time of a program
• Sum over all instructions [{CPI} x {clock cycle time}]
• {# of instructions} x {Average CPI} x {clock cycle time}

• Single-cycle microarchitecture performance


• CPI = 1
• Clock cycle time = long
• Multi-cycle microarchitecture performance
• CPI = different for each instruction Here, we have
• Average CPI → hopefully small two degrees of freedom
• Clock cycle time = short to optimize independently
A Single-Cycle Microarchitecture
A Closer Look
Remember…
• Single-cycle machine

AS’ Sequential AS
Combinational
Logic
Logic
(State)
Let’s Start with the State Elements
• Data and control inputs ALU control
5 Read 3
register 1
Read
Instruction Register 5 data 1
address Read
numbers register 2 Zero
Registers Data ALU ALU
PC 5 Write
Instruction Add Sum register result
Read
Instruction Write data 2
memory Data data

RegWrite

a. Instruction memory b. Program counter c. Adder


a. Registers b. ALU
MemWrite

Instruction
address
Address Read
PC data 16 32
Sign
Instruction Add Sum
extend
Write Data
Instruction
data memory
memory

MemRead
a. Instruction memory b. Program counter c. Adder
MIPS State Elements
CLK CLK CLK
PC' PC WE3 WE
32 32 5
A1 RD1 32
32
A RD 32
5
A2 RD2 32 A RD
Instruction 32 32
Memory Data
5
A3 Memory
Register
WD3 WD
32 File 32

• Program counter:
32-bit register
• Instruction memory:
Takes input 32-bit address A and reads the 32-bit data (i.e., instruction)
from that address to the read data output RD.
• Register file:
The 32-element, 32-bit register file has 2 read ports and 1 write port
• Data memory:
Has a single read/write port. If the write enable, WE, is 1, it writes data
WD into address A on the rising edge of the clock. If the write enable is 0,
it reads address A onto RD.
Instruction Processing
• 5 generic steps
• Instruction fetch (IF)
• Instruction decode and register operand fetch (ID/RF)
• Execute/Evaluate memory address (EX/AG)
• Memory operand fetch (MEM)
• Store/writeback result (WB)
WB
IF Data

Register #
PC Address Instruction Registers ALU Address
Instruction Register #
memory ID/RF Data
Register # EX/AG memory

Data
MEM
What Is To Come: The Full MIPS Datapath
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
bcond
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

JAL, JR, JALR omitted


Another Complete Single-Cycle Processor
MemtoReg
Control
MemWrite
Unit
Branch
ALUControl2:0 PCSrc
31:26
Op ALUSrc
5:0
Funct RegDst
RegWrite

CLK CLK
CLK
25:21 WE3 SrcA Zero WE
0 PC' PC Instr A1 RD1 0
A RD

ALU
1 ALUResult ReadData
A RD 1
Instruction 20:16
A2 RD2 0 SrcB Data
Memory
A3 1 Memory
Register WriteData
WD3 WD
File
20:16
0
15:11
1
WriteReg4:0
PCPlus4
+

SignImm
4 15:0
<<2
Sign Extend PCBranch

+
Result
Single-Cycle Datapath for
Arithmetic and Logical Instructions
R-Type ALU Instructions
• R-type: 3 register operands
MIPS assembly (e.g., register-register signed addition)
add $s0, $s1, $s2 #$s0=rd, $s1=rs, $s2=rt

Machine Encoding

0 rs rt rd 0 add (32) R-Type


6 bits 5 bits 5 bits 5 bits 5 bits 6 bits

• Semantics
if MEM[PC] == add rd rs rt
GPR[rd]  GPR[rs] + GPR[rt]
PC  PC + 4
(R-Type) ALU Datapath

Add

4
ALU operation
25:21 Read 3
Read register 1
PC address Read
20:16 Read data 1
register 2 Zero
Instruction
Instruction Registers ALU ALU
15:11 Write result
Instruction register
Read
memory data 2
Write
data

RegWrite
1

IF ID EX MEM WB
if MEM[PC] == ADD rd rs rt
GPR[rd]  GPR[rs] + GPR[rt]
Combinational
PCfrom
**Based on original figure [P&HPC + 4 2004 Elsevier. ALL RIGHTS RESERVED.]
CO&D, COPYRIGHT
state update logic
Example: ALU Design
◼ ALU operation (F2:0) comes from the control logic

A B
N N

0
F2
N

Cout +
[N-1] S
Extend
Zero

N N N N
1

0
3

2 F1:0
N
Y
I-Type ALU Instructions
• I-type: 2 register operands and 1 immediate
MIPS assembly (e.g., register-immediate signed addition)
addi $s0, $s1, 5 #$s0=rt, $s1=rs

Machine Encoding

addi (0) rs rt immediate I-Type


6 bits 5 bits 5 bits 16 bits

• Semantics
if MEM[PC] == addi rs rt immediate
PC  PC + 4
GPR[rt]  GPR[rs] + sign-extend(immediate)
Datapath for R and I-Type ALU Insts.

Add

4
3 ALU operation
Read
Read 25:21
PC register 1 MemWrite
address Read
data 1
Read
20:16 Zero
Instruction register 2
Instruction Registers ALU ALU
Write Read
15:11 result Address
Instruction register data
Read
memory data 2
Write Data
RegDest data
memory
Write
isItype RegWrite
ALUSrc data
116 32
Sign isItype MemRead
extend

IF ID EX MEM WB
if MEM[PC] == ADDI rt rs immediate
GPR[rt]  GPR[rs] + sign-extend (immediate)
Combinational
PC  PC + 4 state update logic
Recall: ADD with one Literal in LC-3
• ADD assembly and machine code

LC-3 assembly Register file

ADD R1, R4, #-2 DR


Instruction register

SR
Field Values
Sign-ext
OP DR SR imm5 end
1 1 4 1 -2

Machine Code
OP DR SR imm5
From
0001 001 100 1 11110 FSM
15 12 11 9 8 6 5 4 0
Single-Cycle Datapath for
Data Movement Instructions
Load Instructions
• Load 4-byte word

MIPS assembly
lw $s3, 8($s0) #$s0=rs, $s3=rt

Machine Encoding
op rs=base rt imm=offset

lw (35) base rt offset I-Type


31 26 25 21 20 16 15 0

• Semantics
if MEM[PC] == lw rt offset16 (base)
PC  PC + 4
EA = sign-extend(offset) + GPR(base)
GPR[rt]  MEM[ translate(EA) ]
LW Datapath

Add
0
4 add
ALU operation MemWrite
Read 3
Read register 1 MemWrite
PC address Read
data 1
Read
Instruction register 2 Zero Address Read
Instruction Registers ALU ALU data 16 32
Write Read Sign
result Address
Instruction register data extend
Read Data
memory data 2 Write
Write data Data memory
data
memory
RegDest RegWrite Write
data
isItype 116
ALUSrc
MemRead
Sign
32
isItype MemRead
extend
1
a. Data memory unit b. Sign-extension unit

if MEM[PC]==LW rt offset16 (base) IF ID EX MEM WB


EA = sign-extend(offset) + GPR[base]
GPR[rt]  MEM[ translate(EA) ]
Combinational
53 state update logic
PC  PC + 4
Store Instructions
• Store 4-byte word
MIPS assembly
sw $s3, 8($s0) #$s0=rs, $s3=rt

Machine Encoding
op rs=base rt imm=offset

sw (43) base rt offset I-Type


31 26 25 21 20 16 15 0

• Semantics
if Mem[PC] == sw rt offset16 (base)
PC  PC + 4
EA = sign-extend(offset) + GPR(base)
MEM[ translate(EA) ]  GPR[rt]
SW Datapath

Add
1
4 add
ALU operation MemWrite
Read 3
Read register 1 MemWrite
PC address Read
data 1
Read
Instruction register 2 Zero Address Read
Instruction Registers ALU ALU data 16 32
Write Read Sign
result Address
Instruction register data extend
Read Data
memory data 2 Write
Write data Data memory
data
memory
RegDest RegWrite Write
data
isItype 016 ALUSrc MemRead
Sign
32
isItype MemRead
extend
0
a. Data memory unit b. Sign-extension unit

if MEM[PC]==SW rt offset16 (base) IF ID EX MEM WB


EA = sign-extend(offset) + GPR[base]
MEM[ translate(EA) ]  GPR[rt]
Combinational
55 state update logic
PC  PC + 4
Load-Store Datapath

Add

4
add
Read 3 ALU operation isStore
Read register 1 MemWrite
PC address Read
data 1
Read
Instruction register 2 Zero
Instruction Registers ALU ALU
Write Read
result Address
Instruction register data
Read
memory data 2
Write Data
data
memory
RegDest RegWrite Write
data
isItype !isStore
16 32
ALUSrc
Sign isItype MemRead
extend
isLoad
Datapath for Non-Control-Flow Insts.

Add

Read 3 ALU operation isStore


Read register 1 MemWrite
PC address Read
data 1
Read
Instruction register 2 Zero
Instruction Registers ALU ALU
Write Read
result Address
Instruction register data
Read
memory data 2
Write Data
data
memory
RegDest RegWrite Write
data
isItype !isStore
16 32
ALUSrc
Sign isItype MemRead
extend
isLoad

MemtoReg
isLoad
Single-Cycle Datapath for
Control Flow Instructions
Jump Instruction
• Unconditional branch or jump
j target

j (2) immediate J-Type


6 bits 26 bits

• 2 = opcode
• immediate (target) = target address

• Semantics
if MEM[PC]== j immediate26
target = { PC ✝[31:28], immediate26, 2’b00 }
PC  target
✝This is the incremented PC
Unconditional Jump Datapath

isJ Add
PCSrc
4
XALU operation
Read 3 0
Read register 1 MemWrite
PC address Read
data 1
Read
Instruction register 2 Zero
Instruction Registers ALU ALU
Write Read
result Address
Instruction register data
Read
memory data 2
concat Write Data
data
memory
? RegWrite Write
data
ALUSrc
0 16 32
Sign X MemRead
extend
0

if MEM[PC]==J immediate26
60
PC = { PC[31:28], immediate26, 2’b00 } What about JR, JAL, JALR?
Other Jumps in MIPS
• jal: jump and link (function calls)
◼ Semantics
if MEM[PC]== jal immediate26
$ra  PC + 4
target = { PC ✝[31:28], immediate26, 2’b00 }
PC  target

❑ jr: jump register


◼ Semantics
if MEM[PC]== jr rs
PC  GPR(rs)

❑jalr: jump and link register


◼ Semantics
if MEM[PC]== jalr rs
$ra  PC + 4
PC  GPR(rs)

✝This is the incremented PC


Conditional Branch Instructions
◼beq (Branch if Equal)

beq $s0, $s1, offset #$s0=rs,$s1=rt

beq (4) rs rt immediate=offset I-Type


6 bits 5 bits 5 bits 16 bits

◼Semantics (assuming no branch delay slot)


if MEM[PC] == beq rs rt immediate16
target = PC✝+ sign-extend(immediate) x 4
if GPR[rs]==GPR[rt] then PC  target
else PC  PC + 4

❑Variations: beq, bne, blez, bgtz


✝This is the incremented PC
Conditional Branch Datapath (for you to finish)

watch out
PC + 4 from instruction datapath
Add
PCSrc Add Sum Branch target
4
Shift
left 2
Read
PC address sub
ALU operation
Read 3
Instruction register 1
Read
Instruction data 1
Read
Instruction register 2 To branch
memory Registers Zero
ALU bcond
concat Write control logic
register
Read
data 2
Write
data
RegWrite

16 0 32
Sign
extend

How to uphold the delayed branch semantics?


Putting It All Together
PCSrc1=Jump
Instruction [25– 0] Shift Jump address [31– 0]
left 2
26 28 0 1

PC+4 [31– 28] M M


u u
x x
ALU
Add result 1 0
Add
RegDst Shift PCSrc2=Br Taken
Jump left 2
4 Branch
MemRead
Instruction [31– 26]
Control MemtoReg
ALUOp
MemWrite
ALUSrc
RegWrite

Instruction [25– 21] Read


Read register 1
PC address Read
Instruction [20– 16] data 1
Read
register 2 Zero
bcond
Instruction 0 Registers Read ALU ALU
[31– 0] 0 Read
M Write data 2 result Address 1
Instruction u register M data
u M
memory Instruction [15– 11] x u
1 Write x Data
data x
1 memory 0
Write
data
16 32
Instruction [15– 0] Sign
extend ALU ALU operation
control

Instruction [5– 0]

JAL, JR, JALR omitted

You might also like