Operating Systems II

Steven Hand
Michaelmas Term 2005
8 lectures for CST IB
Operating Systems II — P/L/TT/11
Course Aims
This course aims to:
• impart a detailed understanding of the algorithms
and techniques used within operating systems,
• consolidate the knowledge learned in earlier
courses, and
• help students appreciate the trade-offs involved in
designing and implementing an operating system.
Why another operating sytems course?
• OSes are some of the largest software systems
around ⇒ illustrate many s/w engineering issues.
• OSes motivate most of the problems in concurrency
• more people end up ‘writing OSes’ than you think
– modifications to existing systems (e.g. linux)
– embedded software for custom hardware
– research operating systems lots of fun
• various subsystems not covered to date. . .
Operating Systems II — Aims i
Course Outline
• Introduction and Review:
OS functions & structures. Virtual processors
(processes and threads).
• CPU Scheduling.
Scheduling for multiprocessors. RT scheduling
(RM, EDF). SRT/multimedia scheduling.
• Memory Management.
Virtual addresses and translation schemes.
Demand paging. Page replacement. Frame
allocation. VM case studies. Other VM techniques.
• Storage Systems.
Disks & disk scheduling. Caching, buffering. Filing
systems. Case studies (FAT, FFS, NTFS, LFS).
• Protection.
Subjects and objects. Authentication schemes.
Capability systems.
• Conclusion.
Course summary. Exam info.
Operating Systems II — Outline ii
Recommended Reading
• Bacon J [ and Harris T ]
Concurrent Systems or Operating Systems
Addison Wesley 1997, Addison Wesley 2003
• Tannenbaum A S
Modern Operating Systems (2nd Ed)
Prentice Hall 2001
• Silberschatz A, Peterson J and Galvin P
Operating Systems Concepts (5th Ed)
Addison Wesley 1998
• Leffler S J
The Design and Implementation of the 4.3BSD
UNIX Operating System.
Addison Wesley 1989
• Solomon D [ and Russinovich M ]
Inside Windows NT (2nd Ed) or Inside Windows
2000 (3rd Ed)
Microsoft Press 1998, Microsoft Press 2000
• OS links (via course web page)
http://www.cl.cam.ac.uk/Teaching/current/OpSysII/
Operating Systems II — Books iii
Operating System Functions
Operating System
Hardware
A
p
p

2
A
p
p

N
A
p
p

1
An operating system is a collection of software which:
• securely multiplexes resources, i.e.
– protects applications from each other, yet
– shares physical resources between them.
• provides an abstract virtual machine, e.g.
– time-shares CPU to provide virtual processors,
– allocates and protects memory to provide
per-process virtual address spaces,
– presents h/w independent virtual devices.
– divides up storage space by using filing systems.
And ideally it does all this efficiently and robustly.
Operating Systems II — Introduction & Review 1
Hardware Support for Operating Systems
Recall that OS should securely multiplex resources.
⇒ we need to ensure that an application cannot:
• compromise the operating system.
• compromise other applications.
• deny others service (e.g. abuse resources)
To achieve this efficiently and flexibly, we need
hardware support for (at least) dual-mode operation.
Then we can:
• add memory protection hardware
⇒ applications confined to subset of memory;
• make I/O instructions privileged
⇒ applications cannot directly access devices;
• use a timer to force execution interruption
⇒ OS cannot be starved of CPU.
Most modern hardware provides protection using
these techniques (c/f Computer Design course).
Operating Systems II — Introduction & Review 2
Operating System Structures
H/W
S/W
App.
Priv
Unpriv
App. App. App.
Kernel
Scheduler
Device Driver Device Driver
System Calls
File System Protocol Code
H/W
S/W
App.
Priv
Unpriv
Server Device
Driver
Server Server
App. App. App.
Kernel Scheduler
Device
Driver
Traditionally have had two main variants:
1. Kernel-based (lhs above)
• set of OS services accessible via software
interrupt mechanism called system calls.
2. Microkernel-based (rhs above)
• push various OS services into server processes
• access servers via some interprocess
communication (IPC) scheme.
• increased modularity (decreased performance?)
Operating Systems II — Introduction & Review 3
Vertically Structured Operating Systems
H/W
S/W Sched.
App. App.
Device
Driver
System
Process
Priv
Unpriv
O.S. O.S. O.S. O.S.
Driver Stubs Syscalls
Shared
Library
Code
• Consider interface people really see, e.g.
– set of programming libraries / objects.
– a command line interpreter / window system.
• Separate concepts of protection and abstraction ⇒
get extensibility, accountability & performance.
• Examples: Nemesis, Exokernel, Cache Kernel.
We’ll see more on this next year. . .
Operating Systems II — Introduction & Review 4
Virtual Processors
Why virtual processors (VPs) ?
• to provide the illusion that a computer is doing
more than one thing at a time;
• to increase system throughput (i.e. run a thread
when another is blocked on I/O);
• to encapsulate an execution context;
• to provide a simple programming paradigm.
VPs implemented via processes and threads:
• A process (or task) is a unit of resource ownership
— a process is allocated a virtual address space,
and control of some resources.
• A thread (or lightweight process) is a unit of
dispatching — a thread has an execution state and
a set of scheduling parameters.
• OS stores information about processes and threads
in process control blocks (PCBs) and thread
control blocks (TCBs) respectively.
• In general, have 1 process ↔n threads, n ≥ 1
⇒ PCB holds references to one or more TCBs.
Operating Systems II — Introduction & Review 5
Thread Architectures
• User-level threads
– Kernel unaware of threads’ existence.
– Thread management done by application using
an unpriviliged thread library.
– Pros: lightweight creation/termination; fast ctxt
switch (no kernel trap); application-specific
scheduling; OS independence.
– Cons: non-preemption; blocking system calls;
cannot utilise multiple processors.
– e.g. FreeBSD pthreads
• Kernel-level threads
– All thread management done by kernel.
– No thread library (but augmented API).
– Scheduling can be two-level, or direct.
– Pros: can utilise multiple processors; blocking
system calls just block thread; preemption easy.
– Cons: higher overhead for thread mgt and
context switching; less flexible.
– e.g. Windows NT/2K, Linux (?).
Hybrid schemes also exist. . . (see later)
Operating Systems II — Introduction & Review 6
Thread Scheduling Algorithms
Exit
Running
New
Ready
Blocked
dispatch
timeout
or yield
release
admit
event-wait
event
A scheduling algorithm is used to decide which ready
thread(s) should run (and for how long).
Typically use dynamic priority scheduling:
• each thread has an associated [integer] priority.
• to schedule: select highest priority ready thread(s)
• to resolve ties: use round-robin within priority
– different quantum per priority?
– CPU bias: per-thread quantum adaption?
• to avoid starvation: dynamically vary priorities
• e.g. BSD Unix: 128 pris, 100ms fixed quantum,
load- and usage-dependent priority damping.
• e.g. Windows NT/2K: 15 dynamic pris, adaptive
∼20ms quantum; priority boosts, then decays.
Operating Systems II — Introduction & Review 7
Multiprocessors
Two main kinds of [shared-memory] multiprocessor:
1. Uniform Memory Access (UMA), aka SMP.
CPU CPU CPU CPU
Main Memory
Cache Cache Cache Cache
• all (main) memory takes the same time to access
• scales only to 4, 8 processors.
2. Non-Uniform Memory Access (NUMA).
Mem
CPU CPU CPU CPU
Cache
Mem
Cache
Mem
Cache
Mem
Cache
• rarer and more expensive
• can have 16, 64, 256 CPUs . . .
Whole area becoming more important. . .
Operating Systems II — Multiprocessors 8
Multiprocessor Operating Systems
Multiprocessor OSes may be roughly classed as either
symmetric or asymmetric.
• Symmetric Operating Systems:
– identical system image on each processor
⇒ convenient abstraction.
– all resources directly shared
⇒ high synchronisation cost.
– typical scheme on SMP (e.g. Linux, NT).
• Asymmetric Operating Systems:
– partition functionality among processors.
– better scalability (and fault tolerance?)
– partitioning can be static or dynamic.
– common on NUMA (e.g. Hive, Hurricane).
– NB: asymmetric ,⇒ trivial “master-slave”
• Also get hybrid schemes, e.g. Disco:
– (re-)introduce virtual machine monitor
– can fake out SMP (but is this wise?)
– can run multiple OSes simultaneously. . .
Operating Systems II — Multiprocessors 9
Multiprocessor Scheduling (1)
• Objectives:
– Ensure all CPUs are kept busy.
– Allow application-level parallelism.
• Problems:
– Preemption within critical sections:
∗ thread / preempted while holding spinlock.
⇒ other threads can waste many CPU cycles.
∗ similar situation with producer/consumer
threads (i.e. wasted schedule).
– Cache pollution:
∗ if thread from different application runs on a
given CPU, lots of compulsory misses.
∗ generally, scheduling a thread on a new
processor is expensive.
∗ (can get degradation of factor or 10 or more)
– Frequent context switching:
∗ if number of threads greatly exceeds the
number of processors, get poor performance.
Operating Systems II — Multiprocessors 10
Multiprocessor Scheduling (2)
Consider basic ways in which one could adapt
uniprocessor scheduling techniques:
• Central Queue:
simple extension of uniprocessor case.
load-balancing performed automatically.
n-way mutual exclusion on queue.
inefficient use of caches.
no support for application-level parallelism.
• Dedicated Assignment:
contention reduced to thread creation/exit.
better cache locality.
lose strict priority semantics.
can lead to load imbalance.
Are there better ways?
Operating Systems II — Multiprocessors 11
Multiprocessor Scheduling (3)
• Processor Affinity:
– modification of central queue.
– threads have affinity for a certain processor ⇒
can reduce cache problems.
– but: load balance problem again.
– make dynamic? (cache affinity?)
• ‘Take’ Scheduling:
– pseudo-dedicated assignment: idle CPU “takes”
task from most loaded.
– can be implemented cheaply.
– nice trade-off: load high ⇒ no migration.
• Co-scheduling / Gang Scheduling:
– Simultaneously schedule “related” threads.
⇒ can reduce wasted context switches.
– Q: how to choose members of gang?
– Q: what about cache performance?
Operating Systems II — Multiprocessors 12
Example: Mach
• Basic model: dynamic priority with central queue.
• Processors grouped into disjoint processor sets:
– Each processor set has 32 shared ready queues
(one for each priority level).
– Each processor has own local ready queue:
absolute priority over global threads.
• Increase quantum when number of threads is small
– ‘small’ means #threads < (2 #CPUs)
– idea is to have a sensible effective quantum
– e.g. 10 processors, 11 threads
∗ if use default 100ms quantum, each thread
spends an expected 10ms on runqueue
∗ instead stretch quantum to 1s ⇒ effective
quantum is now 100ms.
• Applications provide hints to improve scheduling:
1. discouragement hints: mild, strong and absolute
2. handoff hints (aka “yield to”) — can improve
producer-consumer synchronization
• Simple gang scheduling used for allocation.
Operating Systems II — Multiprocessors 13
MP Thread Architectures
Kernel
Threads
User
Threads
Process 2 Process 1 Process 3
CPU2
Process 4
LWPs
CPU1
Want benefits of both user and kernel threads without
any of the drawbacks ⇒ use hybrid scheme.
E.g. Solaris 2 uses three-level scheduling:
• 1 kernel thread ↔ 1 LWP ↔ n user threads
• user-level thread scheduler ⇒ lightweight & flexible
• LWPs allow potential multiprocessor benefit:
– more LWPs ⇒ more scope for true parallelism
– LWPs can be bound to individual processors ⇒
could in theory have user-level MP scheduler
– kernel scheduler is relatively cache agnostic
(although have processor sets (,= Mach’s)),
Overall: either first-class threads (Psyche) or
scheduler activations probably better for MP.
Operating Systems II — Multiprocessors 14
Real-Time Systems
• Need to both produce correct results and meet
predefined deadlines.
• “Correctness” of output related to time delay it
requires to be produced, e.g.
– nuclear reactor safety system
– JIT manufacturing
– video on demand
• Typically distinguish hard real-time (HRT) and soft
real-time (SRT):
HRT — output value = 100% before the
deadline, 0 (or less) after the deadline.
SRT — output value = 100% before the
deadline, (100 - f(t))% if t seconds late.
• Building such systems is all about predictability.
• It is not about speed.
Operating Systems II — Real-Time Systems 15
Real-Time Scheduling
time
s
i
p
i
i
d
i
a
Task i
• Basic model:
– consider set of tasks T
i
, each of which arrives at
time a
i
and requires s
i
units of CPU time before
a (real-time) deadline of d
i
.
– often extended to cope with periodic tasks:
require s
i
units every p
i
units.
• Best-effort techniques give no predictability
– in general priority specifies what to schedule but
not when or how much.
– i.e. CPU allocation for thread t
i
, priority p
i
depends on all other threads at t
j
s.t. p
j
≥ p
i
.
– with dynamic priority adjustment becomes even
more difficult.
⇒ need something different.
Operating Systems II — Real-Time Systems 16
Static Offline Scheduling
Advantages:
• Low run-time overhead.
• Deterministic behaviour.
• System-wide optimization.
• Resolve dependencies early.
• Can prove system properties.
Disadvantages:
• Inflexibility.
• Low utilisation.
• Potentially large schedule.
• Computationally intensive.
In general, offline scheduling only used when
determinism is the overriding factor, e.g. MARS.
Operating Systems II — Real-Time Systems 17
Static Priority Algorithms
Most common is Rate Monotonic (RM)
• Assign static priorities to tasks off-line (or at
‘connection setup’), high-frequency tasks receiving
high priorities.
• Tasks then processed with no further
rearrangement of priorities required (⇒ reduces
scheduling overhead).
• Optimal, static, priority-driven alg. for preemptive,
periodic jobs: i.e. no other static algorithm can
schedule a task set that RM cannot schedule.
• Admission control: the schedule calculated by RM
is always feasible if the total utilisation of the
processor is less than ln2
• For many task sets RM produces a feasible
schedule for higher utilisation (up to ∼ 88%); if
periods harmonic, can get 100%.
• Predictable operation during transient overload.
Operating Systems II — Real-Time Systems 18
Dynamic Priority Algorithms
Most popular is Earliest Deadline First (EDF):
• Scheduling pretty simple:
– keep queue of tasks ordered by deadline
– dispatch the one at the head of the queue.
• EDF is an optimal, dynamic algorithm:
– it may reschedule periodic tasks in each period
– if a task set can be scheduled by any priority
assignment, it can be scheduled by EDF
• Admission control: EDF produces a feasible
schedule whenever processor utilisation is ≤ 100%.
• Problem: scheduling overhead can be large.
• Problem: if system overloaded, all bets are off.
Notes:
1. Also get least slack-time first (LSTF):
• similar, but not identical
• e.g. A: 2ms every 7ms; B: 4ms every 8ms
2. RM, EDF, LSTF all preemptive (c/f P3Q7, 2000).
Operating Systems II — Real-Time Systems 19
Priority Inversion
• All priority-based schemes can potentially suffer
from priority inversion:
• e.g. consider low, medium and high priority
processes called P
l
, P
m
and P
h
respectively.
1. first P
l
admitted, and locks a semaphore o.
2. then other two processes enter.
3. P
h
runs since highest priority, tries to lock o and
blocks.
4. then P
m
gets to run, thus preventing P
l
from
releasing o, and hence P
h
from running.
• Usual solution is priority inheritence:
– associate with every semaphore o the priority P
of the highest priority process waiting for it.
– then temporarily boost priority of holder of
semaphore up to P.
– can use handoff scheduling to implement.
• NT “solution”: priority boost for CPU starvation
– checks if ∃ ready thread not run ≥ 300 ticks.
– if so, doubles quantum & boosts priority to 15
Operating Systems II — Real-Time Systems 20
Multimedia Scheduling
• Increasing interest in multimedia applications (e.g.
video conferencing, mp3 player, 3D games).
• Challenges OS since require presentation (or
processing) of data in a timely manner.
• OS needs to provide sufficient control so that apps
behave well under contention.
• Main technique: exploit SRT scheduling.
• Effective since:
– the value of multimedia data depends on the
timeliness with which it is presented/processed.
⇒ real-time scheduling allows apps to receive
sufficient and timely resource allocation to
handle their needs even when the system is
under heavy load.
– multimedia data streams are often somewhat
tolerant of information loss.
⇒ informing applications and providing soft
guarantees on resources are sufficient.
• Still ongoing research area. . .
Operating Systems II — Real-Time Systems 21
Example: Lottery Scheduling (MIT)
0 2 4 6 8 10 12 14 16 18 20
10 2 5 1 2
total = 20
random [1..20] = 15
winner
• Basic idea:
– distribute set of tickets between processes.
– to schedule: select a ticket [pseudo-]randomly,
and allocate resource (e.g. CPU) to the winner.
– approximates proportional share
• Why would we do this?
– simple uniform abstraction for resource
management (e.g. use tickets for I/O also)
– “solve” priority inversion with ticket transfer.
• How well does it work?
– σ
2
= np(1 −p) ⇒ accuracy improves with

n
– i.e. “asymptotically fair”
• Stride scheduling much better. . .
Operating Systems II — Real-Time Systems 22
Example: BVT (Stanford)
• Lottery/stride scheduling don’t explicitly support
latency sensitive threads:
– some applications don’t require much CPU but
need it at the right time
– e.g. MPEG player application requires CPU
every frame time.
• Borrowed Virtual Time (BVT) uses a number of
techniques to try to address this.
– execution of threads measured in virtual time
– for each thread t
i
maintain its actual virtual
time A
i
and effective virtual time E
i
– E
i
= A
i
−(warpOn?W
i
: 0), where W
i
is the
virtual time warp for t
i
– always run thread with lowest E
i
– after thread runs for ∆ time units, update
A
i
←A
i
+ (∆m
i
)
– (m
i
is an inversely proportional weight)
– also have warp time limit L
i
and unwarp time
requirement U
i
to prevent pathological cases
• Overall: lots of confusing parameters to set. . .
Operating Systems II — Real-Time Systems 23
Example: Atropos (CUCL)
Run
Wait
Block
Deschedule
Allocate
Unblock
Wakeup
EDF OPT IDLE
Interrupt or
System Call
• Basic idea:
– use EDF with implicit deadlines to effect
proportional share over explicit timescales
– if no EDF tasks runnable, schedule best-effort
• Scheduling parameters are (s, p, x) :
– requests s milliseconds per p milliseconds
– x means “eligible for slack time”
• Uses explicit admission control
• Actual scheduling is easy (∼200 lines C)
Operating Systems II — Real-Time Systems 24
Virtual Memory Management
• Limited physical memory (DRAM), need space for:
– operating system image
– processes (text, data, heap, stack, . . . )
– I/O buffers
• Memory management subsystem deals with:
– Support for address binding (i.e. loading,
dynamic linking).
– Allocation of limited physical resources.
– Protection & sharing of ‘components’.
– Providing convenient abstractions.
• Quite complex to implement:
– processor-, motherboard-specific.
– trade-offs keep shifting.
• Coming up in this section:
– virtual addresses and address translation,
– demand paged virtual memory management,
– 2 case studies (Unix and VMS), and
– a few other VM-related issues.
Operating Systems II — Virtual Memory Management 25
Logical vs Physical Addresses (1)
Old systems directly accessed [physical] memory,
which caused some problems, e.g.
• Contiguous allocation:
– need large lump of memory for process
– with time, get [external] fragmentation
⇒ require expensive compaction
• Address binding (i.e. dealing with absolute
addressing):
– “int x; x = 5;” → “movl $0x5, ????”
– compile time ⇒ must know load address.
– load time ⇒ work every time.
– what about swapping?
• Portability:
– how much memory should we assume a
“standard” machine will have?
– what happens if it has less? or more?
Can avoid lots of problems by separating concept of
logical (or virtual) and physical addresses.
Operating Systems II — Virtual Addressing 26
Logical vs Physical Addresses (2)
CPU
M
e
m
o
r
y
MMU
logical
address
physical
address
fault (to OS)
translation
Run time mapping from logical to physical addresses
performed by special hardware (the MMU).
If we make this mapping a per process thing then:
• Each process has own address space.
• Allocation problem split:
– virtual address allocation easy.
– allocate physical memory ‘behind the scenes’.
• Address binding solved:
– bind to logical addresses at compile-time.
– bind to real addresses at load time/run time.
Two variants: segmentation and paging.
Operating Systems II — Virtual Addressing 27
Segmentation
CPU
address fault
no
yes
physical
address
M
e
m
o
r
y
+
logical
address
Segment Registers
limit
base
limit
base
limit
base
limit
base
• MMU has a set (≥ 1) of segment registers (e.g.
orig x86 had four: cs, ds, es and ss).
• CPU issues tuple (s, o):
1. MMU selects segment s.
2. Checks o ≤ limit.
3. If ok, forwards base+o to memory controller.
• Typically augment translation information with
protection bits (e.g. read, write, execute, etc.)
• Overall, nice logical view (protection & sharing)
• Problem: still have [external] fragmentation.
Operating Systems II — Virtual Addressing 28
Paging
CPU
M
e
m
o
r
y
logical address physical address
p
f
Page Table
p o f o
1
1. Physical memory: f frames each 2
s
bytes.
2. Virtual memory: p pages each 2
s
bytes.
3. Page table maps ¦0, . . . , p −1¦ →¦0, . . . , f −1¦
4. Allocation problem has gone away!
Typically have p >> f ⇒ add valid bit to say if a
given page is represented in physical memory.
• Problem: now have internal fragmentation.
• Problem: protection/sharing now per page.
Operating Systems II — Virtual Addressing 29
Segmentation versus Paging
logical view allocation
Segmentation
Paging
⇒ try combined scheme.
• E.g. paged segments (Multics, OS/2)
– divide each segment s
i
into k = ¸l
i
/2
n
| pages,
where l
i
is the limit (length) of the segment.
– have page table per segment.
high hardware cost / complexity.
not very portable.
• E.g. software segments (most modern OSs)
– consider pages [m, . . . , m+l] to be a segment.
– OS must ensure protection / sharing kept
consistent over region.
loss in granularity.
relatively simple / portable.
Operating Systems II — Virtual Addressing 30
Translation Lookaside Buffers
Typically #pages large ⇒ page table lives in memory.
Add TLB, a fully associative cache for mapping info:
• Check each memory reference in TLB first.
• If miss ⇒ need to load info from page table:
– may be done in h/w or s/w (by OS).
– if full, replace entry (usually h/w)
• Include protection info ⇒ can perform access
check in parallel with translation.
• Context switch requires [expensive] flush:
– can add process tags to improve performance.
– “global” bit useful for wide sharing.
• Use superpages for large regions.
• So TLB contains n entries something like:
Frame Number X W R K V S G PTag Page Number
Tag Data
2
k
bits 2
k
bits
Most parts also present in page table entries (PTEs).
Operating Systems II — Virtual Addressing 31
Page Table Design Issues
• A page table is a data-structure to translate pages
p ∈ P to frames f ∈ ¦F ∪ ⊥¦
• Various concerns need to be addressed:
– time-efficiency: we need to do a page-table
lookup every time we get a TLB miss ⇒ we
wish to minimize the time taken for each one.
– space-efficiency: since we typically have a page
table for every process, we wish to minimize the
memory required for each one.
– superpage support: ideally should support any
size of superpage the TLB understands.
– sharing: given that many processes will have
similar address spaces (fork, shared libraries),
would like easy sharing of address regions.
– sparsity: particularly for 64-bit address spaces,
[P[ ¸[F[ – best if we can have page table size
proportional to [F[ rather than [P[.
– cache friendly: PTEs for close together p1, p2
should themselves be close together in memory.
– simplicity: all other things being equal, the
simpler the solution the better.
Operating Systems II — Virtual Addressing 32
Multi-Level Page Tables
P1 Offset
Virtual Address
L2 Address
L1 Page Table
0
n
N
P2 L1 Address
Base Register
L2 Page Table
0
n
N
Leaf PTE
• Modern systems have 2
32
or 2
64
byte VAS ⇒ have
between 2
22
and 2
42
pages (and hence PTEs).
• Solution: use N-ary tree (N large, 256–4096)
• Keep PTBR per process and context switch.
• Advantages: easy to implement; cache friendly.
• Disadvantages:
– Potentially poor space overhead.
– Inflexibility: superpages, residency.
– Require d ≥ 2 memory references.
Operating Systems II — Virtual Addressing 33
MPT Pseudo Code
Assuming 32-bit virtual addresses, 4K pages, 2-level
MPT (so [P1[ = [P2[ = 10bits, and [offset[ = 12bits)
.entry TLBMiss:
mov r0 <- Ptbr // load page table base
mov r1 <- FaultingVA // get the faulting VA
shr r1 <- r1, #12 // clear offset bits
shr r2 <- r1, #10 // get L1 index
shl r2 <- r2, #2 // scale to sizeof(PDE)
pld r0 <- [r0, r2] // load[phys] PDE
shr r0 <- r0, #12 // clear prot bits
shl r0 <- r0, #12 // get L2 address
shl r1 <- #22, r1 // zap L1 index
shr r1 <- #20, r1 // scale to sizeof(PTE)
pld r0 <- [r0, r1] // load[phys] PTE
mov Tlb <- r0 // install TLB entry
iret // return
Operating Systems II — Virtual Addressing 34
Linear Page Tables
Virtual Address
Install
Mapping
Install
Mapping
PTE
Linear Page Table
Potential Nested TLB Miss
(4Mb in VAS)
PTE
Page Number Offset
0 31 11 12
Index Offset
0 31 11 12 21 22
IGN
1
2
3
System
Page Table
4
5
TLB
• Modification of MPTs:
– typically implemented in software
– pages of LPT translated on demand
– i.e. stages , Ž and  not always needed.
• Advantages:
– can require just 1 memory reference.
– (initial) miss handler simple.
• But doesn’t fix sparsity / superpages.
• Guarded page tables (≈ tries) claim to fix these.
Operating Systems II — Virtual Addressing 35
LPT Pseudo Code
Assuming 32-bit virtual addresses, 4K pages (so
[pagenumber[ = 20bits, and [offset[ = 12bits)
.entry TLBMiss:
mov r0 <- Ptbr // load page table base
mov r1 <- FaultingVA // get the faulting VA
shr r1 <- r1, #12 // clear offset bits
shl r1 <- r1, #2 // scale to sizeof(PTE)
vld r0 <- [r0, r1] // virtual load PTE:
// - this can fault!
mov Tlb <- r0 // install TLB entry
iret // return
.entry DoubleTLBMiss:
add r2 <- r0, r1 // copy dble fault VA
mov r3 <- SysPtbr // get system PTBR
shr r2 <- r2, #12 // clear offset bits
shl r2 <- r2, #22 // zap "L1" index
shr r2 <- r2, #20 // scale to sizeof(PTE)
pld r3 <- [r3, r2] // phys load LPT PTE
mov Tlb <- r3 // install TLB entry
iret // retry vld above
Operating Systems II — Virtual Addressing 36
Inverted Page Tables
Virtual Address
0 31 11 12
PN
Inverted Page Table
Hash
PN
h(PN)=k
Bits
0
f-1
k
Page Number Offset
• Recall f << p → keep entry per frame.
• Then table size bounded by physical memory!
• IPT: frame number is h(pn)
only one memory reference to translate.
no problems with sparsity.
can easily augment with process tag.
no option on which frame to allocate
dealing with collisions.
cache unfriendly.
Operating Systems II — Virtual Addressing 37
Hashed Page Tables
Virtual Address
0 31 11 12
PTE
Hashed Page Table
Hash
PN
h(PN)=k
0
f-1
k
Page Number Offset
PN
• HPT: simply extend IPT into proper hash table.
• i.e. make frame number explicit.
can map to any frame.
can choose table size.
table now bigger.
sharing still hard.
still cache unfriendly, no superpages.
• Can solve these last with clustered page tables.
Operating Systems II — Virtual Addressing 38
Virtual Memory
• Virtual addressing allows us to introduce the idea
of virtual memory:
– already have valid or invalid page translations;
introduce new “non-resident” designation
– such pages live on a non-volatile backing store
– processes access non-resident memory just as if
it were ‘the real thing’.
• Virtual memory (VM) has a number of benefits:
– portability: programs work regardless of how
much actual memory present
– convenience: programmer can use e.g. large
sparse data structures with impunity
– efficiency: no need to waste (real) memory on
code or data which isn’t used.
• VM typically implemented via demand paging:
– programs (executables) reside on disk
– to execute a process we load pages in on
demand; i.e. as and when they are referenced.
• Also get demand segmentation, but rare.
Operating Systems II — Demand Paged Virtual Memory 39
Demand Paging Details
When loading a new process for execution:
• create its address space (e.g. page tables, etc)
• mark PTEs as either “invalid or “non-resident”
• add PCB to scheduler.
Then whenever we receive a page fault:
1. check PTE to determine if “invalid” or not
2. if an invalid reference ⇒ kill process;
3. otherwise ‘page in’ the desired page:
• find a free frame in memory
• initiate disk I/O to read in the desired page
• when I/O is finished modify the PTE for this
page to show that it is now valid
• restart the process at the faulting instruction
Scheme described above is pure demand paging:
• never brings in a page until required ⇒ get lots of
page faults and I/O when process begins.
• hence many real systems explicitly load some core
parts of the process first
Operating Systems II — Demand Paged Virtual Memory 40
Page Replacement
• When paging in from disk, we need a free frame of
physical memory to hold the data we’re reading in.
• In reality, size of physical memory is limited ⇒
– need to discard unused pages if total demand for
pages exceeds physical memory size
– (alternatively could swap out a whole process to
free some frames)
• Modified algorithm: on a page fault we
1. locate the desired replacement page on disk
2. to select a free frame for the incoming page:
(a) if there is a free frame use it
(b) otherwise select a victim page to free,
(c) write the victim page back to disk, and
(d) mark it as invalid in its process page tables
3. read desired page into freed frame
4. restart the faulting process
• Can reduce overhead by adding a ‘dirty’ bit to
PTEs (can potentially omit step 2c above)
• Question: how do we choose our victim page?
Operating Systems II — Demand Paged Virtual Memory 41
Page Replacement Algorithms
• First-In First-Out (FIFO)
– keep a queue of pages, discard from head
– performance difficult to predict: no idea whether
page replaced will be used again or not
– discard is independent of page use frequency
– in general: pretty bad, although very simple.
• Optimal Algorithm (OPT)
– replace the page which will not be used again for
longest period of time
– can only be done with an oracle, or in hindsight
– serves as a good comparison for other algorithms
• Least Recently Used (LRU)
– LRU replaces the page which has not been used
for the longest amount of time
– (i.e. LRU is OPT with -ve time)
– assumes past is a good predictor of the future
– Q: how do we determine the LRU ordering?
Operating Systems II — Page Replacement Algorithms 42
Implementing LRU
• Could try using counters
– give each page table entry a time-of-use field
and give CPU a logical clock (counter)
– whenever a page is referenced, its PTE is
updated to clock value
– replace page with smallest time value
– problem: requires a search to find min value
– problem: adds a write to memory (PTE) on
every memory reference
– problem: clock overflow
• Or a page stack:
– maintain a stack of pages (doubly linked list)
with most-recently used (MRU) page on top
– discard from bottom of stack
– requires changing 6 pointers per [new] reference
– very slow without extensive hardware support
• Neither scheme seems practical on a standard
processor ⇒ need another way.
Operating Systems II — Page Replacement Algorithms 43
Approximating LRU (1)
• Many systems have a reference bit in the PTE
which is set by h/w whenever the page is touched
• This allows not recently used (NRU) replacement:
– periodically (e.g. 20ms) clear all reference bits
– when choosing a victim to replace, prefer pages
with clear reference bits
– if also have a modified bit (or dirty bit) in the
PTE, can extend MRU to use that too:
Ref? Dirty? Comment
no no best type of page to replace
no yes next best (requires writeback)
yes no probably code in use
yes yes bad choice for replacement
• Or can extend by maintaining more history, e.g.
– for each page, the operating system maintains
an 8-bit value, initialized to zero
– periodically (e.g. 20ms) shift reference bit onto
high order bit of the byte, and clear reference bit
– select lowest value page (or one of) to replace
Operating Systems II — Page Replacement Algorithms 44
Approximating LRU (2)
• Popular NRU scheme: second-chance FIFO
– store pages in queue as per FIFO
– before discarding head, check its reference bit
– if reference bit is 0, discard, otherwise:
∗ reset reference bit, and
∗ add page to tail of queue
∗ i.e. give it “a second chance”
• Often implemented with a circular queue and a
current pointer; in this case usually called clock.
• If no h/w provided reference bit can emulate:
– to clear “reference bit”, mark page no access
– if referenced ⇒ trap, update PTE, and resume
– to check if referenced, check permissions
– can use similar scheme to emulate modified bit
Operating Systems II — Page Replacement Algorithms 45
Other Replacement Schemes
• Counting Algorithms: keep a count of the number
of references to each page
– LFU: replace page with smallest count
– MFU: replace highest count because low count
⇒ most recently brought in.
• Page Buffering Algorithms:
– keep a min. number of victims in a free pool
– new page read in before writing out victim.
• (Pseudo) MRU:
– consider access of e.g. large array.
– page to replace is one application has just
finished with, i.e. most recently used.
– e.g. track page faults and look for sequences.
– discard the k
th
in victim sequence.
• Application-specific:
– stop trying to second guess what’s going on.
– provide hook for app. to suggest replacement.
– must be careful with denial of service. . .
Operating Systems II — Page Replacement Algorithms 46
Performance Comparison
FIFO
CLOCK
LRU
OPT
P
a
g
e

F
a
u
l
t
s

p
e
r

1
0
0
0

R
e
f
e
r
e
n
c
e
s
5
10
15
20
25
30
35
40
45
0
5 6 7 8 9 10
Number of Page Frames Available
11 12 13 14 15
Graph plots page-fault rate against number of
physical frames for a pseudo-local reference string.
• want to minimise area under curve
• FIFO can exhibit Belady’s anomaly (although it
doesn’t in this case)
• getting frame allocation right has major impact. . .
Operating Systems II — Page Replacement Algorithms 47
Frame Allocation
• A certain fraction of physical memory is reserved
per-process and for core OS code and data.
• Need an allocation policy to determine how to
distribute the remaining frames.
• Objectives:
– Fairness (or proportional fairness)?
∗ e.g. divide m frames between n processes as
m/n, with remainder in the free pool
∗ e.g. divide frames in proportion to size of
process (i.e. number of pages used)
– Minimize system-wide page-fault rate?
(e.g. allocate all memory to few processes)
– Maximize level of multiprogramming?
(e.g. allocate min memory to many processes)
• Most page replacement schemes are global : all
pages considered for replacement.
⇒ allocation policy implicitly enforced during page-in:
– allocation succeeds iff policy agrees
– ‘free frames’ often in use ⇒ steal them!
Operating Systems II — Frame Allocation 48
The Risk of Thrashing
C
P
U

u
t
i
l
i
s
a
t
i
o
n
Degree of Multiprogramming
thrashing
• As more processes enter the system, the
frames-per-process value can get very small.
• At some point we hit a wall:
– a process needs more frames, so steals them
– but the other processes need those pages, so
they fault to bring them back in
– number of runnable processes plunges
• To avoid thrashing we must give processes as many
frames as they “need”
• If we can’t, we need to reduce the MPL
(a better page-replacement algorithm will not help)
Operating Systems II — Frame Allocation 49
Locality of Reference
















0x10000
0x20000
0x30000
0x40000
0x50000
0x60000
0x70000
0x80000
0x90000
0xa0000
0xb0000
0xc0000
0 10000 20000 30000 40000 50000 60000 70000 80000
M
i
s
s

a
d
d
r
e
s
s
Miss number
Extended Malloc
Initial Malloc
I/O Buffers
User data/bss
User code
User Stack
VM workspace
Kernel data/bss
Kernel code
Parse Optimise Output Kernel Init
move
image
clear
bss
Timer IRQs
connector daemon
Locality of reference: in a short time interval, the
locations referenced by a process tend to be grouped
into a few regions in its address space.
• procedure being executed
• . . . sub-procedures
• . . . data access
• . . . stack variables
Note: have locality in both space and time.
Operating Systems II — Frame Allocation 50
Avoiding Thrashing
We can use the locality of reference principle to help
determine how many frames a process needs:
• define the Working Set (Denning, 1967)
– set of pages that a process needs in store at
“the same time” to make any progress
– varies between processes and during execution
– assume process moves through phases
– in each phase, get (spatial) locality of reference
– from time to time get phase shift
• Then OS can try to prevent thrashing by
maintaining sufficient pages for current phase:
– sample page reference bits every e.g. 10ms
– if a page is “in use”, say it’s in the working set
– sum working set sizes to get total demand D
– if D > m we are in danger of thrashing ⇒
suspend a process
• Alternatively use page fault frequency (PFF):
– monitor per-process page fault rate
– if too high, allocate more frames to process
Operating Systems II — Frame Allocation 51
Other Performance Issues
Various other factors influence VM performance, e.g.
• Program structure: consider for example
for(j=0; j<1024; j++)
for(i=0; i<1024; i++)
array[i][j] = 0;
– a killer on a system with 1024 word pages; either
have all pages resident, or get 2
20
page faults.
– if reverse order of iteration (i,j) then works fine
• Language choice:
– ML, lisp: use lots of pointers, tend to randomise
memory access ⇒ kills spatial locality
– Fortran, C, Java: relatively few pointer refs
• Pre-paging:
– avoid problem with pure demand paging
– can use WS technique to work out what to load
• Real-time systems:
– no paging in hard RT (must lock all pages)
– for SRT, trade-offs may be available
Operating Systems II — Demand Paged Virtual Memory 52
Case Study 1: Unix
• Swapping allowed from very early on.
• Kernel Per-process info. split into two kinds:
– proc and text structures always resident.
– page tables, user structure and kernel stack
could be swapped out.
• Swapping performed by special process: the
swapper (usually process 0).
– periodically awaken and inspect processes on
disk.
– choose one waiting longest time and prepare to
swap in.
– victim chosen by looking at scheduler queues:
try to find process blocked on I/O.
– other metrics: priority, overall time resident,
time since last swap in (for stability).
• From 3BSD / SVR2 onwards, implemented
demand paging.
• Today swapping only used when dire shortage of
physical memory.
Operating Systems II — Virtual Memory Case Studies 53
Unix: Address Space
Kernel
Static Data + BSS
Text (Program Code)
0 Gb
1 Gb
2 Gb
3 Gb
4 Gb
Dynamic Data (Heap)
System
P1
P0
Kernel Stack
User Structure
User Stack
Reserved
4.3 BSD Unix address space borrows from VAX:
• 0Gb–1Gb: segment P0 (text/data, grow upward)
• 1Gb–2Gb: segment P1 (stack, grows downward)
• 2Gb–3Gb: system segment (for kernel).
Address translation done in hardware LPT:
• System page table always resident.
• P0, P1 page tables in system segment.
• Segments have page-aligned length.
Operating Systems II — Virtual Memory Case Studies 54
Unix: Page Table Entries
VLD FOD FS
0 31 30 26 27 25 20 21 24 23
Frame Number 0 AXS 1 x M xxx
Resident
0000..........0000 AXS 0 x x 0
On backing store
AXS 0 x 0
Transit/Sampling
xxxx..........xxxx AXS 0 0 x 1
Demand Zero
Block Number AXS 0 1 x 1 Fill from File
Frame Number xxx M
• PTEs for valid pages determined by h/w.
• If valid bit not set ⇒ use up to OS.
• BSD uses FOD bit, FS bit and the block number.
• First pair are “fill on demand”:
– DZ used for BSS, and growing stack.
– FFF used for executables (text & data).
– Simple pre-paging implemented via klusters.
• Sampling used to simulate reference bit.
• Backing store pages located via swap map(s).
Operating Systems II — Virtual Memory Case Studies 55
Unix: Paging Dynamics
• Physical memory managed by core map:
– array of structures, one per cluster.
– records if free or in use and [potentially] the
associated page number and disk block.
– free list threads through core map.
• Page replacement carried out by the page dæmon.
– every 250ms, OS checks if “enough’ (viz.
lotsfree) physical memory free.
– if not ⇒ wake up page dæmon.
• Basic algorithm: global [two-handed] clock:
– hands point to different entries in core map
– first check if can replace front cluster; if not,
clear its “reference bit” (viz. mark invalid).
– then check if back cluster referenced (viz.
marked valid); if so given second chance.
– else flush to disk (if necessary), and put cluster
onto end of free list.
– move hands forward and repeat. . .
• System V Unix uses an almost identical scheme. . .
Operating Systems II — Virtual Memory Case Studies 56
Case Study 2: VMS
• VMS released in 1978 to run on the VAX-11/780.
• Aimed to support a wide range of hardware, and a
job mix of real-time, timeshared and batch tasks.
• This led to a design with:
– A local page replacement scheme,
– A quota scheme for physical memory, and
– An aggressive page clustering policy.
• First two based around idea of resident set:
– simply the set of pages which a given process
currently has in memory.
– each process also has a resident-set limit.
• Then during execution:
– pages faulted in by pager on demand.
– once hit limit, choose victim from resident set.
⇒ minimises impact on others.
• Also have swapper for extreme cases.
Operating Systems II — Virtual Memory Case Studies 57
VMS: Paging Dynamics
Disk
Resident
Set (P1)
Resident
Set (P2)
Free Page List
Modified
Page List
Evicted
Page (Dirty)
Page In
Page Out
(Clustered)
Page Reclaim
Free Frame
Allocated
Evicted
Page (Clean)
• Basic algorithm: simple [local] FIFO.
• Suckful ⇒ augment with software “victim cache”:
– Victim pages placed on tail of FPL/MPL.
– On fault, search lists before do I/O.
• Lists also allow aggressive page clustering:
– if [MPL[ ≥ hi, write ([MPL[− lo) pages.
– Get ∼100 pages per write on average.
Operating Systems II — Virtual Memory Case Studies 58
VMS: Other Issues
• Modified page replacement:
– introduce callback for privileged processes.
– prefer to retain pages with TLB entries.
• Automatic resident set limit adjustment:
– system counts #page faults per process.
– at quantum end, check if rate > PFRATH.
– if so and if “enough” memory ⇒ increase RSL.
– swapper trimming used to reduce RSLs again.
– NB: real-time processes are exempt.
• Other system services:
– $SETSWM: disable process swapping.
– $LCKPAG: lock pages into memory.
– $LKWSET: lock pages into resident set.
• VMS still alive:
– recent versions updated to support 64-bit
address space
– son-of-VMS aka Win2K/XP also going strong.
Operating Systems II — Virtual Memory Case Studies 59
Other VM Techniques
Once have MMU, can (ab)use for other reasons
• Assume OS provides:
– system calls to change memory protections.
– some way to “catch” memory exceptions.
• This enables a large number of applications.
• e.g. concurrent garbage collection:
– mark unscanned areas of heap as no-access.
– if mutator thread accesses these, trap.
– on trap, collector scans page(s), copying and
forwarding as necessary.
– finally, resume mutator thread.
• e.g. incremental checkpointing:
– at time t atomically mark address space
read-only.
– on each trap, copy page, mark r/w and resume.
no significant interruption.
more space efficient
Operating Systems II — Virtual Memory Case Studies 60
Single Address Space Operating Systems
• Emerging large (64-bit) address spaces mean that
having a SVAS is plausible once more.
• Separate concerns of “what we can see” and “what
we are allowed to access”.
• Advantages: easy sharing (unified addressing).
• Problems:
– address binding issues return.
– cache/TLB setup for MVAS model.
• Distributed shared virtual memory:
– turn a NOW into a SMP.
– how seamless do you think this is?
• Persistent object stores:
– support for pickling & compression?
– garbage collection?
• Sensible use requires restraint. . .
Operating Systems II — Virtual Memory Case Studies 61
Disk I/O
• Performance of disk I/O is crucial to
swapping/paging and file system operation
• Key parameters:
1. wait for controller and disk.
2. seek to appropriate disk cylinder
3. wait for desired block to come under the head
4. transfer data to/from disk
• Performance depends on how the disk is organised
spindle
actuator
read-write
head
arm
rotation
platter
sector
track
cylinder
Operating Systems II — Disk Management 62
Disk Scheduling
• In a typical multiprogramming environment have
multiple users queueing for access to disk
• Also have VM system requests to load/swap/page
processes/pages
• We want to provide best performance to all users
— specifically reducing seek time component
• Several policies for scheduling a set of disk
requests onto the device, e.g.
1. FIFO: perform requests in their arrival order
2. SSTF: if the disk controller knows where the
head is (hope so!) then it can schedule the
request with the shortest seek from the current
position
3. SCAN (“elevator algorithm”): relieves problem
that an unlucky request could receive bad
performance due to queue position
4. C-SCAN: scan in one direction only
5. N-step-SCAN and FSCAN: ensure that the disk
head always moves
Operating Systems II — Disk Management 63
55,58,39,18,90,160,150,38,184 Reference String =
200
50
100
150
0
Time
R
e
f
e
r
e
n
c
e
FIFO
50
100
150
200
0
Time
R
e
f
e
r
e
n
c
e
SSTF
50
100
150
0
Time
R
e
f
e
r
e
n
c
e
SCAN
50
100
150
200
0
Time
R
e
f
e
r
e
n
c
e C-SCAN
200
Operating Systems II — Disk Management 64
Other Disk Scheduling Issues
• Priority: usually beyond disk controller’s control.
– system decides to prioritise, for example by
ensuring that swaps get done before I/O.
– alternatively interactive processes might get
greater priority over batch processes.
– or perhaps short requests given preference over
larger ones (avoid “convoy effect”)
• SRT disk scheduling (e.g. Cello, USD):
– per client/process scheduling parameters.
– two stage: admission, then queue.
– problem: overall performance?
• 2-D Scheduling (e.g. SPTF).
– try to reduce rotational latency.
– typically require h/w support.
• Bad blocks remapping:
– typically transparent ⇒ can potentially undo
scheduling benefits.
– some SCSI disks let OS into bad-block story
Operating Systems II — Disk Management 65
Logical Volumes
Partition A1
Hard Disk A Hard Disk B
Partition A2
Partition A3
Partition B1
Partition B2
Modern OSs tend to abstract away from physical disk;
instead use logical volume concept.
• Partitions first step.
• Augment with “soft partitions”:
– allow v. large number of partitions on one disk.
– can customize, e.g. “real-time” volume.
– aggregation: can make use of v. small partitions.
• Overall gives far more flexibility:
– e.g. dynamic resizing of partitions
– e.g. striping for performance.
• E.g. IRIX xlm, OSF/1 lvm, NT FtDisk.
• Other big opportunity is reliability. . .
Operating Systems II — Disk Management 66
RAID
RAID = Redundant Array of Inexpensive Disks:
• Uses various combinations of striping and
mirroring to increase performance.
• Can implement (some levels) in h/w or s/w
• Many levels exist:
– RAID0: striping over n disks (so actually !R)
– RAID1: simple mirroring, i.e. write n copies of
data to n disks (where n is 2 ;-).
– RAID2: hamming ECC (for disks with no
built-in error detection)
– RAID3: stripe data on multiple disks and keep
parity on a dedicated disk. Done at byte level ⇒
need spindle-synchronised disks.
– RAID4: same as RAID3, but block level.
– RAID5: same as RAID4, but no dedicated parity
disk (round robin instead).
• AutoRAID trades off RAIDs 1 and 5.
• Even funkier stuff emerging. . .
Operating Systems II — Disk Management 67
Disk Caching
• Cache holds copy of some of disk sectors.
• Can reduce access time by applications if the
required data follows the locality principle
• Design issues:
– transfer data by DMA or by shared memory?
– replacement strategy: LRU, LFU, etc.
– reading ahead: e.g. track based.
– write through or write back?
– partitioning? (USD. . . )
• Typically O/S also provides a cache in s/w:
– may be done per volume, or overall.
– also get unified caches — treat VM and FS
caching as part of the same thing.
• Software caching issues:
– should we treat all filesystems the same?
– do applications know better?
Operating Systems II — Disk Management 68
4.3 BSD Unix Buffer Cache
• Name? Well buffers data to/from disk, and caches
recently used information.
• Modern Unix deals with logical blocks, i.e. FS
block within a given partition / logical volume.
• “Typically” prevents 85% of implied disk transfers.
• Implemented as a hash table:
– Hash on (devno, blockno) to see if present.
– Linked list used for collisions.
• Also have LRU list (for replacement).
• Internal interface:
– bread(): get data & lock buffer.
– brelse(): unlock buffer (clean).
– bdwrite(): mark buffer dirty (lazy write).
– bawrite(): asynchronous write.
– bwrite(): synchronous write.
• Uses sync every 30 secs for consistency.
• Limited prefetching (read-ahead).
Operating Systems II — Disk Management 69
NT/2K Cache Manager
• Cache Manager caches “virtual blocks”:
– viz. keeps track of cache “lines” as offsets within
a file rather than a volume.
– disk layout & volume concept abstracted away.
⇒ no translation required for cache hit.
⇒ can get more intelligent prefetching
• Completely unified cache:
– cache “lines” all in virtual address space.
– decouples physical & virtual cache systems: e.g.
∗ virtually cache in 256K blocks,
∗ physically cluster up to 64K.
– NT virtual memory manager responsible for
actually doing the I/O.
– allows lots of FS cache when VM system lightly
loaded, little when system is thrashing.
• NT/2K also provides some user control:
– if specify temporary attrib when creating file ⇒
will never be flushed to disk unless necessary.
– if specify write through attrib when opening a
file ⇒ all writes will synchronously complete.
Operating Systems II — Disk Management 70
File systems
What is a filing system?
Normally consider it to comprise two parts:
1. Directory Service: this provides
• naming mechanism
• access control
• existence control
• concurrency control
2. Storage Service: this provides
• integrity, data needs to survive:
– hardware errors
– OS errors
– user errors
• archiving
• mechanism to implement directory service
What is a file?
• an ordered sequence of bytes (UNIX)
• an ordered sequence of records (ISO FTAM)
Operating Systems II — Filing Systems 71
File Mapping Algorithms
Need to be able to work out which disk blocks belong
to which files ⇒ need a file-mapping algorithm, e.g.
1. chaining in the material
2. chaining in a map
3. table of pointers to blocks
4. extents
Aspects to consider:
• integrity checking after crash
• automatic recovery after crash
• efficiency for different access patterns
– of data structure itself
– of I/O operations to access it
• ability to extend files
• efficiency at high utilization of disk capacity
Operating Systems II — Filing Systems 72
Chaining in the Media
Directory
Entries
File
Blocks
Each disk block has pointer to next block in file.
Can also chain free blocks.
• OK for sequential access – poor for random access
• cost to find disk address of block n in a file:
– best case: n disk reads
– worst case: n disk reads
• Some problems:
– not all of file block is file info
– integrity check tedious. . .
Operating Systems II — Filing Systems 73
Chaining in a map
Directory
Map in
memory
Disc blocks
...........
...........
...........
Maintain the chains of pointers in a map (in
memory), mirroring disk structure.
• disk blocks only contain file information
• integrity check easy: only need to check map
• handling of map is critical for
– performance: must cache bulk of it.
– reliability: must replicate on disk.
• cost to find disk address of block n in a file:
– best case: n memory reads
– worst case: n disk reads
Operating Systems II — Filing Systems 74
Table of pointers
Directory
Disc blocks
Table of
Pointers
• access cost to find block n in a file
– best case: 1 memory read
– worst case: 1 disk read
• i.e. good for random access
• integrity check easy: only need to check tables
• free blocks managed independently (e.g. bitmap)
• table may get large ⇒ must chain tables, or build
a tree of tables (e.g. UNIX inode)
• access cost for chain of tables? for hierarchy?
Operating Systems II — Filing Systems 75
Extent lists
Directory
Disc blocks
List of
Extents
2
3
2
Using contiguous blocks can increase performance. . .
• list of disk addresses and lengths (extents)
• access cost: [perhaps] a disk read and then a
searching problem, O(log(number of extents))
• can use bitmap to manage free space (e.g. QNX)
• system may have some maximum #extents
– new error can’t extend file
– could copy file (i.e. compact into one extent)
– or could chain tables or use a hierarchy as for
table of pointers.
Operating Systems II — Filing Systems 76
File Meta-Data (1)
What information is held about a file?
• times: creation, access and change?
• access control: who and how
• location of file data (see above)
• backup or archive information
• concurrency control
What is the name of a file?
• simple system: only name for file is human
comprehensible text name
• perhaps want multiple text names for file
– soft (symbolic) link: text name → text name
– hard link: text name → file id
– if we have hard links, must have reference
counts on files
Together with the data structure describing the disk
blocks, this information is known as the file
meta-data.
Operating Systems II — Filing Systems 77
File Meta-Data (2)
Directory
Entries
foo
bar
RW
R
File
Node
File
Data
.....
.....
.....
Where is file information kept?
• no hard links: keep it in the directory structure.
• if have hard links, keep file info separate from
directory entries
– file info in a block: OK if blocks small (e.g.
TOPS10)
– or in a table (UNIX i-node / v-node table)
• on OPEN, (after access check) copy info into
memory for fast access
• on CLOSE, write updated file data and meta-data
to disk
How do we handle caching meta-data?
Operating Systems II — Filing Systems 78
Directory Name Space
• simplest — flat name space (e.g. Univac Exec 8)
• two level (e.g. CTSS, TOPS10)
• general hierarchy
– a tree,
– a directed (acyclic) graph (DAG)
• structure of name space often reflects data
structures used to implement it
– e.g. hierarchical name space ↔ hierarchical data
structures
– but, could have hierarchical name space and
huge hash table!
General hierarchy:
• reflects structure of organisation, users’ files etc.
• name is full path name, but can get rather long:
e.g. /usr/groups/X11R5/src/mit/server/os/4.2bsd/utils.c
– offer relative naming
– login directory
– current working directory
Operating Systems II — Filing Systems 79
Directory Implementation
• Directories often don’t get very large (especially if
access control is at the directory level rather than
on individual files)
often quick look up
directories may be small compared to underlying
allocation unit
• But: assuming small dirs means lookup is na¨ıve ⇒
trouble if get big dirs:
– optimise for iteration.
– keep entries sorted (e.g. use a B-Tree).
• Query based access:
– split filespace into system and user.
– system wants fast lookup but doesn’t much care
about friendly names (or may actively dislike)
– user wishes ‘easy’ retrieval ⇒ build content
index and make searching the default lookup.
– Q: how do we keep index up-to-date?
– Q: what about access control?
Operating Systems II — Filing Systems 80
File System Backups
• Backup: keep (recent) copy of whole file system to
allow recovery from:
– CPU software crash
– bad blocks on disk
– disk head crash
– fire, war, famine, pestilence
• What is a recent copy?
– in real time systems recent means mirrored disks
– daily usually sufficient for ‘normal’ machines
• Can use incremental technique, e.g.
– full dump performed daily or weekly
∗ copy whole file system to another disk or tape
∗ ideally can do it while file system live. . .
– incremental dump performed hourly or daily
∗ only copy files and directories which have
changed since the last time.
∗ e.g. use the file ‘last modifed’ time
– to recover: first restore full dump, then
sucessively add in incrementals
Operating Systems II — File System Integrity 81
Ruler Function
1 2 3 4 5 6 7 8 ...
0
1
2
3
4
T
a
p
e

t
o

u
s
e
Operation Day Number
• Want to minimise #tapes needed, time to backup
• Want to maximise the time a file is held on backup
– number days starting at 1
– on day n use tape t such that 2
t
is highest
power of 2 which divides n
– whenever we use tape t, dump all files modified
since we last used that tape, or any tape with a
higher number
• If file is created on day c and deleted on day d a
dump version will be saved substantially after d
• the length of time it is saved depends on d −c and
the exact values of c, d
Operating Systems II — File System Integrity 82
Crash Recovery
• Most failures affect only files being modified
• At start up after a crash run a disk scavenger:
– try to recover data structures from memory
(bring back core memory!)
– get current state of data structures from disk
– identify inconsistencies (may require operator
intervention)
– isolate suspect files and reload from backup
– correct data structures and update disk
• Usually much faster and better (i.e. more recent)
than recovery from backup.
• Can make scavenger’s job simpler:
– replicate vital data structures
– spread replicas around the disk
• Even better: use journal file to assist recovery.
– record all meta-data operations in an
append-only [infinite] file.
– ensure log records written before performing
actual modification.
Operating Systems II — File System Integrity 83
Immutable files
• Avoid concurrency problems: use write-once files.
– multiple version numbers: foo!11, foo!12
– invent new version number on close
• Problems:
– disk space is not infinite
∗ only keep last k versions (archive rest?)
∗ have a explicit keep call
∗ share disk blocks between different versions —
complicated file system structures
– what does name without version mean?
– and the killer. . . directories aren’t immutable!
• But:
– only need concurrency control on version number
– could be used (for files) on unusual media types
∗ write once optical disks
∗ remote servers (e.g. Cedar FS)
– provides an audit trail
∗ required by the spooks
∗ often implemented on top of normal file
system; e.g. UNIX RCS
Operating Systems II — Other FS Issues 84
Multi-level stores
Archiving (c.f. backup): keep frequently used files on
fast media, migrate others to slower (cheaper) media.
Can be done by:
• user — encouraged by accounting penalties
• system — migrate files by periodically looking at
time of last use
• can provide transparent naming but not
performance, e.g. HSM (Windows 2000)
Can integrate multi-level store with ideas from
immutable files, e.g. Plan-9:
• file servers with fast disks
• write once optical juke box
• every night, mark all files immutable
• start migration of files which changed the previous
day to optical disk
• access to archive explicit
e.g. /archive/08Jan2005/users/smh/. . .
Operating Systems II — Other FS Issues 85
Case Study 1: FAT16/32
A
B
8
7
4
EOF
Free
Free
Free
3
n-2
Free
EOF
Disk Info
0
1
2
6
7
8
9
n-1
File Name (8 Bytes)
AD R VSH
Extension (3 Bytes)
Reserved
(10 Bytes)
Time (2 Bytes)
Date (2 Bytes)
First Cluster (2 Bytes)
File Size (4 Bytes)
Attribute Bits
A: Archive
D: Directory
V: Volume Label
S: System
H: Hidden
R: Read-Only
• A file is a linked list of clusters: a cluster is a set
of 2
n
contiguous disk blocks, n ≥ 0.
• Each entry in the FAT contains either:
– the index of another entry within the FAT, or
– a special value EOF meaning “end of file”, or
– a special value Free meaning “free”.
• Directory entries contain index into the FAT
• FAT16 could only handle partitions up to (2
16
c)
bytes ⇒ max 2Gb partition with 32K clusters.
• (and big cluster size is bad)
Operating Systems II — FS Case Studies 86
Extending FAT16 to FAT32
• Obvious extetension: instead of using 2 bytes per
entry, FAT32 uses 4 bytes per entry
⇒ can support e.g. 8Gb partition with 4K clusters
• Further enhancements with FAT32 include:
– can locate the root directory anywhere on the
partition (in FAT16, the root directory had to
immediately follow the FAT(s)).
– can use the backup copy of the FAT instead of
the default (more fault tolerant)
– improved support for demand paged executables
(consider the 4K default cluster size . . . ).
• VFAT on top of FAT32 does long name support:
unicode strings of up to 256 characters.
– want to keep same directory entry structure for
compatibility with e.g. DOS
⇒ use multiple directory entries to contain
successive parts of name.
– abuse V attribute to avoid listing these
Operating Systems II — FS Case Studies 87
Case Study 2: BSD FFS
The original Unix file system: simple, elegant, slow.
S
u
p
e
r
-
B
l
o
c
k
B
o
o
t
-
B
l
o
c
k
Inode
Table
Data
Blocks
S
u
p
e
r
-
B
l
o
c
k
Inode
Table
Data
Blocks
Partition 1
Partition 2
Hard Disk
0 1 2 i i+1 j j+1 j+2 l l+1 m
The fast file-system (FFS) was develped in the hope
of overcoming the following shortcomings:
1. Poor data/metadata layout:
• widely separating data and metadata ⇒
almost guaranteed long seeks
• head crash near start of partition disastrous.
• consecutive file blocks not close together.
2. Data blocks too small:
• 512 byte allocation size good to reduce internal
fragementation (median file size ∼ 2K)
• but poor performance for somewhat larger files.
Operating Systems II — FS Case Studies 88
FFS: Improving Performance
The FFS set out to address these issues:
• Block size problem:
– use larger block size (e.g. 4096 or 8192 bytes)
– but: last block in a file may be split into
fragments of e.g. 512 bytes.
• Random allocation problem:
– ditch free list in favour of bitmap ⇒ since easier
to find contiguous free blocks (e.g.
011100000011101)
– divide disk into cylinder groups containing:
∗ a superblock (replica),
∗ a set of inodes,
∗ a bitmap of free blocks, and
∗ usage summary information.
– (cylinder group · little Unix file system)
• Cylinder groups used to:
– keep inodes near their data blocks
– keep inodes of a directory together
– increase fault tolerance
Operating Systems II — FS Case Studies 89
FFS: Locality and Allocation
• Locality key to achieving high performance
• To achieve locality:
1. don’t let disk fill up ⇒ can find space nearby
2. spread unrelated things far apart.
• e.g. the BSD allocator tries to keep files in a
directory in the same cylinder group, but spread
directories out among the various cylinder groups
• similarly allocates runs of blocks within a cylinder
group, but switches to a different one after 48K
• So does all this make any difference?
– yes! about 10x–20x original FS performance
– get up to 40% of disk bandwidth on large files
– and much beter small file performance.
• Problems?
– block-based scheme limits throughput ⇒ need
decent clustering, or skip-sector allocation
– crash recovery not particularly fast
– rather tied to disk geometry. . .
Operating Systems II — FS Case Studies 90
Case Study 3: NTFS
File Record
Master File
Table (MFT)
0
1
2
3
4
5
6
7
16
17
$Mft
$MftMirr
$LogFile
$Volume
$AttrDef
\
$Bitmap
$BadClus
user file/directory
user file/directory
15
Standard Information
Filename
Data...
• Fundamental structure of NTFS is a volume:
– based on a logical disk partition
– may occupy a portion of a disk, and entire disk,
or span across several disks.
• An array of file records is stored in a special file
called the Master File Table (MFT).
• The MFT is indexed by a file reference (a 64-bit
unique identifier for a file)
• A file itself is a structured object consisting of set
of attribute/value pairs of variable length. . .
Operating Systems II — FS Case Studies 91
NTFS: Recovery
• To aid recovery, all file system data structure
updates are performed inside transactions:
– before a data structure is altered, the
transaction writes a log record that contains
redo and undo information.
– after the data structure has been changed, a
commit record is written to the log to signify
that the transaction succeeded.
– after a crash, the file system can be restored to
a consistent state by processing the log records.
• Does not guarantee that all the user file data can
be recovered after a crash — just that metadata
files will reflect some prior consistent state.
• The log is stored in the third metadata file at the
beginning of the volume ($Logfile) :
• Logging functionality not part of NTFS itself:
– NT has a generic log file service
⇒ could in principle be used by e.g. database
• Overall makes for far quicker recovery after crash
Operating Systems II — FS Case Studies 92
NTFS: Other Features
• Security:
– each file object has a security descriptor
attribute stored in its MFT record.
– this atrribute contains the access token of the
owner of the file plus an access control list
• Fault Tolerance:
– FtDisk driver allows multiple partitions be
combined into a logical volume (RAID 0, 1, 5)
– FtDisk can also handle sector sparing where
the underlying SCSI disk supports it
– NTFS supports software cluster remapping.
• Compression:
– NTFS can divide a file’s data into compression
units (blocks of 16 contiguous clusters)
– NTFS also has support for sparse files
∗ clusters with all zeros not allocated or stored
∗ instead, gaps are left in the sequences of
VCNs kept in the file record
∗ when reading a file, gaps cause NTFS to
zero-fill that portion of the caller’s buffer.
Operating Systems II — FS Case Studies 93
Case Study 4: LFS (Sprite)
LFS is a log-structured file system — a radically
different file system design:
• Premise 1: CPUs getting faster faster than disks.
• Premise 2: memory cheap ⇒ large disk caches
• Premise 3: large cache ⇒ most disk reads “free”.
⇒ performance bottleneck is writing & seeking.
Basic idea: solve write/seek problems by using a log:
• log is [logically] an append-only piece of storage
comprising a set of records.
• all data & meta-data updates written to log.
• periodically flush entire log to disk in a single
contiguous transfer:
– high bandwidth transfer.
– can make blocks of a file contiguous on disk.
• have two logs ⇒ one in use, one being written.
What are the problems here?
Operating Systems II — FS Case Studies 94
LFS: Implementation Issues
1. How do we find data in the log?
• can keep basic UNIX structure (directories,
inodes, indirect blocks, etc)
• then just need to find inodes ⇒ use inode map
• find inode maps by looking at a checkpoint
• checkpoints live in fixed region on disk.
2. What do we do when the disk is full?
• need asynchronous scavenger to run over old
logs and free up some space.
• two basic alternatives:
1. compact live information to free up space.
2. thread log through free space.
• neither great ⇒ use segmented log:
– divide disk into large fixed-size segments.
– compact within a segment, thread between
segments.
– when writing use only clean segments
– occasionally clean segments
– choosing segments to clean is hard. . .
Subject of ongoing debate in the OS community. . .
Operating Systems II — FS Case Studies 95
Protection
Require protection against unauthorised:
• release of information
– reading or leaking data
– violating privacy legislation
– using proprietary software
– covert channels
• modification of information
– changing access rights
– can do sabotage without reading information
• denial of service
– causing a crash
– causing high load (e.g. processes or packets)
– changing access rights
Also wish to protect against the effects of errors:
• isolate for debugging
• isolate for damage control
Protection mechanisms impose controls on access by
subjects (e.g. users) on objects (e.g. files).
Operating Systems II — Protection 96
Protection and Sharing
If we have a single user machine with no network
connection in a locked room then protection is easy.
But we want to:
• share facilities (for economic reasons)
• share and exchange data (application requirement)
Some mechanisms we have already come across:
• user and supervisor levels
– usually one of each
– could have several (e.g. MULTICS rings)
• memory management hardware
– protection keys
– relocation hardware
– bounds checking
– separate address spaces
• files
– access control list
– groups etc
Operating Systems II — Protection 97
Design of Protection System
• Some other protection mechanisms:
– lock the computer room (prevent people from
tampering with the hardware)
– restrict access to system software
– de-skill systems operating staff
– keep designers away from final system!
– use passwords (in general challenge/response)
– use encryption
– legislate
• ref: Saltzer + Schroeder Proc. IEEE Sept 75
– design should be public
– default should be no access
– check for current authority
– give each process minimum possible authority
– mechanisms should be simple, uniform and built
in to lowest layers
– should be psychologically acceptable
– cost of circumvention should be high
– minimize shared access
Operating Systems II — Protection 98
Authentication of User to System (1)
Passwords currently widely used:
• want a long sequence of random characters issued
by system, but user would write it down
• if allow user selection, they will use dictionary
words, car registration, their name, etc.
• best bet probably is to encourage the use of an
algorithm to remember password
• other top tips:
– don’t reflect on terminal, or overprint
– add delay after failed attempt
– use encryption if line suspect
• what about security of password file?
– only accessible to login program (CAP, TITAN)
– hold scrambled, e.g. UNIX
∗ only need to write protect file
∗ need irreversible function (without password)
∗ maybe ‘one-way’ function
∗ however, off line attack possible
⇒ really should use shadow passwords
Operating Systems II — Protection 99
Authentication of User to System (2)
E.g. passwords in UNIX:
• simple for user to remember
arachnid
• sensible user applies an algorithm
!r!chn#d
• password is DES-encrypted 25 times using a 2-byte
per-user ‘salt’ to produce a 11 byte string
• salt followed by these 11 bytes are then stored
IML.DVMcz6Sh2
Really require unforgeable evidence of identity that
system can check:
• enhanced password: challenge-response.
• id card inserted into slot
• fingerprint, voiceprint, face recognition
• smart cards
Operating Systems II — Protection 100
Authentication of System to User
User wants to avoid:
• talking to wrong computer
• right computer, but not the login program
Partial solution in old days for directly wired terminals:
• make login character same as terminal attention, or
• always do a terminal attention before trying login
But, today micros used as terminals ⇒
• local software may have been changed
• so carry your own copy of the terminal program
• but hardware / firmware in public machine may
have been modified
Anyway, still have the problem of comms lines:
• wiretapping is easy
• workstation can often see all packets on network
⇒ must use encryption of some kind, and must trust
encryption device (e.g. a smart card)
Operating Systems II — Protection 101
Mutual suspicion
• We need to encourage lots and lots of suspicion:
– system of user
– users of each other
– user of system
• Called programs should be suspicious of caller (e.g.
OS calls always need to check parameters)
• Caller should be suspicious of called program
• e.g. Trojan horse:
– a ‘useful’ looking program — a game perhaps
– when called by user (in many systems) inherits
all of the user’s privileges
– it can then copy files, modify files, change
password, send mail, etc. . .
– e.g. Multics editor trojan horse, copied files as
well as edited.
• e.g. Virus:
– often starts off as Trojan horse
– self-replicating (e.g. ILOVEYOU)
Operating Systems II — Protection 102
Access matrix
Access matrix is a matrix of subjects against objects.
Subject (or principal) might be:
• users e.g. by uid
• executing process in a protection domain
• sets of users or processes
Objects are things like:
• files
• devices
• domains / processes
• message ports (in microkernels)
Matrix is large and sparse ⇒ don’t want to store it all.
Two common representations:
1. by object: store list of subjects and rights with
each object ⇒ access control list
2. by subject: store list of objects and rights with
each subject ⇒ capabilities
Operating Systems II — Protection 103
Access Control Lists
Often used in storage systems:
• system naming scheme provides for ACL to be
inserted in naming path, e.g. files
• if ACLs stored on disk, check is made in software
⇒ must only use on low duty cycle
• for higher duty cycle must cache results of check
• e.g. Multics: open file = memory segment.
On first reference to segment:
1. interrupt (segment fault)
2. check ACL
3. set up segment descriptor in segment table
• most systems check ACL
– when file opened for read or write
– when code file is to be executed
• access control by program, e.g. Unix
– exam prog, RWX by examiner, X by student
– data file, A by exam program, RW by examiner
• allows arbitrary policies. . .
Operating Systems II — Protection 104
Capabilities
Capabilities associated with active subjects, so:
• store in address space of subject
• must make sure subject can’t forge capabilities
• easily accessible to hardware
• can be used with high duty cycle
e.g. as part of addressing hardware
– Plessey PP250
– CAP I, II, III
– IBM system/38
– Intel iAPX432
• have special machine instructions to modify
(restrict) capabilities
• support passing of capabilities on procedure
(program) call
Can also use software capabilities:
• checked by encryption
• nice for distributed systems
Operating Systems II — Protection 105
Password Capabilities
• Capabilities nice for distributed systems but:
– messy for application, and
– revocation is tricky.
• Could use timeouts (e.g. Amoeba).
• Alternatively: combine passwords and capabilities.
• Store ACL with object, but key it on capability
(not implicit concept of “principal” from OS).
• Advantages:
– revocation possible
– multiple “roles” available.
• Disadvantages:
– still messy (use ‘implicit’ cache?).
Operating Systems II — Protection 106
Covert channels
Information leakage by side-effects: lots of fun!
At the hardware level:
• wire tapping
• monitor signals in machine
• modification to hardware
• electromagnetic radiation of devices
By software:
• leak a bit stream as:
file exists page fault compute a while 1
no file no page fault sleep for a while 0
• system may provide statistics
e.g. TENEX password cracker using system
provided count of page faults
In general, guarding against covert channels is
prohibitively expensive.
(only usually a consideration for military types)
Operating Systems II — Protection 107
Summary & Outlook
• An operating system must:
1. securely multiplex resources.
2. provide / allow abstractions.
• Major aspect of OS design is choosing trade-offs.
– e.g. protection vs. performance vs. portability
– e.g. prettiness vs. power.
• Future systems bring new challenges:
– scalability (multi-processing/computing)
– reliability (computing infrastructure)
– ubiquity (heterogeneity/security)
• Lots of work remains. . .
Exams 2005:
• 2Qs, one each in Papers 3 & 4
• (not mandatory: P3 and P4 both x-of-y)
• Qs generally book-work plus ‘decider’
Operating Systems II — Conclusion 108
1999: Paper 3 Question 8
FIFO, LRU, and Clock are three page replacement
algorithms.
a) Briefly describe the operation of each algorithm.
[6 marks]
b) The Clock strategy assumes some hardware support.
What could you do to allow the use of Clock if this
hardware support were not present? [2 marks]
c) Assuming good temporal locality of reference, which of
the above three algorithms would you choose to use
within an operating system? Why would you not use the
other schemes? [2 marks]
What is a buffer cache? Explain why one is used, and how
it works. [6 marks]
Which buffer cache replacement strategy would you choose
to use within an operating system? Justify your answer.
[2 marks]
Give two reasons why the buffering requirements for network
data are different from those for file systems. [2 marks]
Operating Systems II — Past Examination Questions 109
1999: Paper 4 Question 7
The following are three ways which a file system may use to
determine which disk blocks make up a given file.
a) chaining in a map
b) tables of pointers
c) extent lists
Briefly describe how each scheme works. [3 marks each]
Describe the benefits and drawbacks of using scheme (c).
[6 marks]
You are part of a team designing a distributed filing system
which replicates files for performance and fault-tolerance
reasons. It is required that rights to a given file can be
revoked within T milliseconds (T ≥ 0). Describe how you
would achieve this, commenting on how the value of T would
influence your decision. [5 marks]
Operating Systems II — Past Examination Questions 110
2000: Paper 3 Question 7
Why are the scheduling algorithms used in general purpose
operating systems such as Unix and Windows NT not suitable
for real-time systems? [4 marks]
Rate monotonic (RM) and earliest deadling first (EDF) are
two popular scheduling algorithms for real-time systems.
Describe these algorithms, illustrating your answer by
showing how each of them would schedule the following
task set.
Task Requires Exactly Every
A 2ms 10ms
B 1ms 4ms
C 1ms 5ms
You may assume that context switches are instantaneous.
[8 marks]
Exhibit a task set which is schedulable under EDF but not
under RM. You should demonstrate that this is the case, and
explain why.
Hint: consider the relationship between task periods.
[8 marks]
Operating Systems II — Past Examination Questions 111
2000: Paper 4 Question 7
Why is it important for an operating system to schedule disk
requests? [4 marks]
Briefly describe each of the SSTF, SCAN and C-SCAN disk
scheduling algorithms. Which problem with SSTF does
SCAN seek to overcome? Which problem with SCAN does
C-SCAN seek to overcome? [5 marks]
Consider a Winchester-style hard disk with 100 cylinders, 4
double-sided platters and 25 sectors per track. The following
is the (time-ordered) sequence of requests for disk sectors:
{ 3518, 1846, 8924, 6672, 1590, 4126, 107, 9750,
158, 6621, 446, 11 }.
The disk arm is currently at cylinder 10, moving towards
100. For each of SSTF, SCAN and C-SCAN, give the order
in which the above requests would be serviced. [3 marks]
Which factors do the above disk arm scheduling algorithms
ignore? How could these be taken into account? [4 marks]
Discuss ways in which an operating system can construct
logical volumes which are (a) more reliable and (b) higher
performance than the underlying hardware. [4 marks]
Operating Systems II — Past Examination Questions 112
2001: Paper 3 Question 7
What are the key issues with scheduling for shared-memory
multiprocessors? [3 marks]
Processor affinity, take scheduling and gang scheduling
are three techniques used within multiprocessor operating
systems.
a) Briefly describe the operation of each. [6 marks]
b) Which problem does the processor affinity technique seek
to overcome? [2 marks]
c) What problem does the processor affinity technique suffer
from, and how could this problem in turn be overcome?
[2 marks]
d) In which circumstances is a gang scheduling approach
most appropriate? [2 marks]
What additional issues does the virtual memory management
system have to address when dealing with shared-memory
multiprocessor systems? [5 marks]
Operating Systems II — Past Examination Questions 113
2001: Paper 4 Question 7
In the context of virtual memory management:
a) What is demand paging ? How is it implemented ?
[4 marks]
b) What is meant by temporal locality of reference ?
[2 marks]
c) How does the assumption of temporal locality of
reference influence page replacement decisions? Illustrate
your answer by briefly describing an appropriate page
replacement algorithm or algorithms. [3 marks]
d) What is meant by spatial locality of reference ?
[2 marks]
e) In what ways does the assumption of spatial locality
of reference influence the design of the virtual memory
system ? [3 marks]
A student suggests that the virtual memory system should
really deal with “objects” or “procedures” rather than with
pages. Make arguments both for and against this suggestion.
[4 and 2 marks respectively]
.
Operating Systems II — Past Examination Questions 114
2003: Paper 3 Question 5
Modern server-class machines often use a Redundant Array
of Inexpensive Disks (RAID) to provide non-volatile storage.
a) What is the basic motivation behind this? [2 marks]
b) Describe RAID level 0. What are the benefits and
drawbacks of this scheme? [3 marks]
c) Describe RAID level 1. What are the benefits and
drawbacks of this scheme? [3 marks]
d) Compare and contrast RAID levels 3, 4 & 5. What
problem(s) with the former pair does the last hope to
avoid? [6 marks]
A server machine has k identical high-performance IDE disks
attached to independent IDE controllers. You are asked to
write operating system software to treat these disk as a RAID
level 5 array containing a single file-system. Your software
will include routines to read filesytem data, write filesystem
data, and to schedule these read and write requests. What
difficulties arise here and how may they be addressed?
[6 marks]
Operating Systems II — Past Examination Questions 115
2003: Paper 4 Question 5
Describe the basic operation of a log-structured file system.
What are the potential benefits? What are the problems?
[8 marks]
Several modern file systems make use of journalling.
Describe how a journal is used by the file system, and
the situations in which it is beneficial. [6 marks]
You are assigned the task of designing a tape backup strategy
for an important file server. The goal is to maximize the
time any file is held on backup while minimizing the number
of tapes required. Sketch your strategy, commenting on the
support you require from the file system, and justifying your
design decisions. [6 marks]
Operating Systems II — Past Examination Questions 116
2004: Paper 3 Question 5
(a)What problem do real-time schedling algorithms try to
solve? [2 marks]
(b) Describe one static priority and one dynamic priorty
real-time scheduling algorithm. You should discuss
the issue of admission control, and comment on the
data structures that an implementation would need to
maintain and on how these would be used to make
scheduling decisions. [8 marks]
(c) A designer of a real-time system wishes to have
concurrently executing tasks share a data structure
protected by a mutual exclusion lock.
(i) What scheduling problem could arise here? [2 marks]
(ii) How could this problem be overcome? [2 marks]
(d)The designer also wishes the real-time system to use
demand paged virtual memory for efficiency. What
problems could arise here, and how could they be
overcome? [6 marks]
Operating Systems II — Past Examination Questions 117
2004: Paper 4 Question 5
(a)Most conventional hardware translates virtual addresses
to physical addresses using multi-level page tables
(MPTs):
(i) Describe with the aid of a diagram how translation is
performed when using MPTs. [3 marks]
(ii) What problem(s) with MPTs do linear page tables
attempt to overcome? How is this achieved?
[3 marks]
(iii) What problems(s) with MPTs do inverted page tables
(IPTs) attempt to overcome? How is this achieved?
[3 marks]
(iv) What problems(s) with IPTs do hashed page tables
attempt to overcome? How is this achieved?
[3 marks]
(b) Operating systems often cache part of the contents of
the disk(s) in main memory to speed up access. Compare
and contrast the way in which this is achieved in (i) 4.3
BSD Unix and (ii) Windows 2000. [8 marks]
Operating Systems II — Past Examination Questions 118
2005: Paper 3 Question 5
(a)Modern operating systems typically support both threads
and processes. What is the basic difference between
a thread and a process? Why do operating systems
support both concepts? [2 marks]
(b) You get a summer job with a company which has an in-
house operating system called sOs. sOs uses static priority
scheduling, supports at most 32 concurrently-executing
processes, and only works on uniprocessor machines.
Describe with justification how you would modify sOs in
order to:
(i) support up to 65536 concurrently executing processes.
[2 marks]
(ii) reduce or eliminate the possibility of starvation.
[3 marks]
(iii) efficiently schedule processes on an 8 CPU symmetric
multiprocessor (SMP) machines. [5 marks]
(iv) support threads in addition to processes on SMP
machines. [3 marks]
(c) How would you go about reducing the time taken to boot
a modern operating system? [5 marks]
Operating Systems II — Past Examination Questions 119
2005: Paper 4 Question 5
(a)Scheduling disk requests is important to reduce the
average service time. Describe briefly how the SSTF,
SCAN and C-SCAN scheduling algorithms work.
[3 marks]
(b) Recently there have been proposals that 2-D disk
scheduling algorithms should be used. What is the
basic idea behind 2-D disk scheduling? [2 marks]
(c) You are asked to develop support for 2-D disk scheduling
in a commodity operating system.
(i) What would be the major difficulty that you face?
[2 marks]
(ii) Sketch a design for how you would go about overcoming
this difficulty, and comment on how well you think the
resulting system would work. [8 marks]
(d)Several modern file-systems and databases make use of
a journal to aid in crash recovery.
(i) Briefly describe how journalling helps crash recovery.
[2 marks]
(ii) A researcher suggests adding 128 MB of NVRAM to a
disk drive and using this to store the journal. Discuss
the advantages and disadvantages of this approach.
[3 marks]
Operating Systems II — Past Examination Questions 120

Course Aims
This course aims to: • impart a detailed understanding of the algorithms and techniques used within operating systems, • consolidate the knowledge learned in earlier courses, and • help students appreciate the trade-offs involved in designing and implementing an operating system. Why another operating sytems course? • OSes are some of the largest software systems around ⇒ illustrate many s/w engineering issues. • OSes motivate most of the problems in concurrency • more people end up ‘writing OSes’ than you think – modifications to existing systems (e.g. linux) – embedded software for custom hardware – research operating systems lots of fun

• various subsystems not covered to date. . .
Operating Systems II — Aims i

Course Outline
• Introduction and Review: OS functions & structures. Virtual processors (processes and threads). • CPU Scheduling. Scheduling for multiprocessors. RT scheduling (RM, EDF). SRT/multimedia scheduling. • Memory Management. Virtual addresses and translation schemes. Demand paging. Page replacement. Frame allocation. VM case studies. Other VM techniques. • Storage Systems. Disks & disk scheduling. Caching, buffering. Filing systems. Case studies (FAT, FFS, NTFS, LFS). • Protection. Subjects and objects. Authentication schemes. Capability systems. • Conclusion. Course summary. Exam info.

Operating Systems II — Outline

ii

uk/Teaching/current/OpSysII/ Operating Systems II — Books iii .3BSD UNIX Operating System. Addison Wesley 2003 • Tannenbaum A S Modern Operating Systems (2nd Ed) Prentice Hall 2001 • Silberschatz A.cam.ac. Microsoft Press 2000 • OS links (via course web page) http://www.Recommended Reading • Bacon J [ and Harris T ] Concurrent Systems or Operating Systems Addison Wesley 1997. Addison Wesley 1989 • Solomon D [ and Russinovich M ] Inside Windows NT (2nd Ed) or Inside Windows 2000 (3rd Ed) Microsoft Press 1998.cl. Peterson J and Galvin P Operating Systems Concepts (5th Ed) Addison Wesley 1998 • Leffler S J The Design and Implementation of the 4.

yet – shares physical resources between them.g. e. And ideally it does all this efficiently and robustly.Operating System Functions App N Hardware App 1 App 2 Operating System An operating system is a collection of software which: • securely multiplexes resources. – time-shares CPU to provide virtual processors. Operating Systems II — Introduction & Review 1 . – divides up storage space by using filing systems. – presents h/w independent virtual devices. • provides an abstract virtual machine. – protects applications from each other. i. – allocates and protects memory to provide per-process virtual address spaces.e.

Hardware Support for Operating Systems Recall that OS should securely multiplex resources. • use a timer to force execution interruption ⇒ OS cannot be starved of CPU. • deny others service (e. Then we can: • add memory protection hardware ⇒ applications confined to subset of memory. • make I/O instructions privileged ⇒ applications cannot directly access devices. Most modern hardware provides protection using these techniques (c/f Computer Design course). Operating Systems II — Introduction & Review 2 . we need hardware support for (at least) dual-mode operation. ⇒ we need to ensure that an application cannot: • compromise the operating system. abuse resources) To achieve this efficiently and flexibly. • compromise other applications.g.

Kernel-based (lhs above) • set of OS services accessible via software interrupt mechanism called system calls. App. 2. App. App. App. App. • increased modularity (decreased performance?) Operating Systems II — Introduction & Review 3 .Operating System Structures App. Microkernel-based (rhs above) • push various OS services into server processes • access servers via some interprocess communication (IPC) scheme. App. App. Server Server Unpriv Priv Kernel System Calls Scheduler File System Protocol Code Device Driver Unpriv Priv Server Device Driver Device Driver S/W H/W Device Driver S/W H/W Kernel Scheduler Traditionally have had two main variants: 1.

Device Driver O. O.S. App. Operating Systems II — Introduction & Review 4 . • Separate concepts of protection and abstraction ⇒ get extensibility. e. We’ll see more on this next year. Exokernel.Vertically Structured Operating Systems System Process App.S. O.S. • Examples: Nemesis. O. Shared Library Code Unpriv Priv Sched. – a command line interpreter / window system. . Cache Kernel. accountability & performance. Syscalls Driver Stubs S/W H/W • Consider interface people really see. .g. – set of programming libraries / objects.S.

• In general.e. • to provide a simple programming paradigm. have 1 process ↔ n threads. n ≥ 1 ⇒ PCB holds references to one or more TCBs. • A process (or task) is a unit of resource ownership — a process is allocated a virtual address space. Operating Systems II — Introduction & Review 5 VPs implemented via processes and threads: . • OS stores information about processes and threads in process control blocks (PCBs) and thread control blocks (TCBs) respectively. run a thread when another is blocked on I/O). • to encapsulate an execution context.Virtual Processors Why virtual processors (VPs) ? • to provide the illusion that a computer is doing more than one thing at a time. • to increase system throughput (i. • A thread (or lightweight process) is a unit of dispatching — a thread has an execution state and a set of scheduling parameters. and control of some resources.

. – Cons: higher overhead for thread mgt and context switching. – Pros: lightweight creation/termination. – Thread management done by application using an unpriviliged thread library. Linux (?). . cannot utilise multiple processors. blocking system calls.Thread Architectures • User-level threads – Kernel unaware of threads’ existence. application-specific scheduling. Hybrid schemes also exist. Scheduling can be two-level. FreeBSD pthreads • Kernel-level threads – – – – All thread management done by kernel. fast ctxt switch (no kernel trap). preemption easy. (see later) Operating Systems II — Introduction & Review 6 . Pros: can utilise multiple processors.g. No thread library (but augmented API). – e. blocking system calls just block thread. – Cons: non-preemption. OS independence. less flexible. – e. or direct.g. Windows NT/2K.

Operating Systems II — Introduction & Review 7 • to avoid starvation: dynamically vary priorities . load. BSD Unix: 128 pris. priority boosts.g. then decays. • to resolve ties: use round-robin within priority • e. Typically use dynamic priority scheduling: • to schedule: select highest priority ready thread(s) – different quantum per priority? – CPU bias: per-thread quantum adaption? • each thread has an associated [integer] priority.Thread Scheduling Algorithms admit dispatch Ready event timeout or yield Running event-wait release New Exit Blocked A scheduling algorithm is used to decide which ready thread(s) should run (and for how long). • e.and usage-dependent priority damping. adaptive ∼20ms quantum.g. Windows NT/2K: 15 dynamic pris. 100ms fixed quantum.

CPU Cache CPU Cache CPU Cache CPU Cache Main Memory 2. . . Whole area becoming more important. 8 processors. Operating Systems II — Multiprocessors 8 • rarer and more expensive • can have 16. Non-Uniform Memory Access (NUMA). CPU Cache Mem CPU Cache Mem CPU Cache Mem CPU Cache Mem • all (main) memory takes the same time to access • scales only to 4.Multiprocessors Two main kinds of [shared-memory] multiprocessor: 1. . . 64. Uniform Memory Access (UMA). . 256 CPUs . aka SMP.

Multiprocessor Operating Systems Multiprocessor OSes may be roughly classed as either symmetric or asymmetric. • Symmetric Operating Systems: – identical system image on each processor ⇒ convenient abstraction.g. . – all resources directly shared ⇒ high synchronisation cost. common on NUMA (e.g. partition functionality among processors. Disco: – (re-)introduce virtual machine monitor – can fake out SMP (but is this wise?) – can run multiple OSes simultaneously. e. Hive. NB: asymmetric ⇒ trivial “master-slave” • Asymmetric Operating Systems: – – – – – • Also get hybrid schemes. 9 Operating Systems II — Multiprocessors . . Hurricane).g. – typical scheme on SMP (e. Linux. better scalability (and fault tolerance?) partitioning can be static or dynamic. NT).

e. Operating Systems II — Multiprocessors 10 . lots of compulsory misses. – Allow application-level parallelism. wasted schedule). • Problems: – Preemption within critical sections: ∗ thread A preempted while holding spinlock. – Cache pollution: ∗ if thread from different application runs on a given CPU. ∗ generally. ⇒ other threads can waste many CPU cycles. ∗ (can get degradation of factor or 10 or more) – Frequent context switching: ∗ if number of threads greatly exceeds the number of processors.Multiprocessor Scheduling (1) • Objectives: – Ensure all CPUs are kept busy. ∗ similar situation with producer/consumer threads (i. get poor performance. scheduling a thread on a new processor is expensive.

lose strict priority semantics. can lead to load imbalance. • Dedicated Assignment: contention reduced to thread creation/exit. no support for application-level parallelism.Multiprocessor Scheduling (2) Consider basic ways in which one could adapt uniprocessor scheduling techniques: • Central Queue:          simple extension of uniprocessor case. load-balancing performed automatically. n-way mutual exclusion on queue. better cache locality. inefficient use of caches. Are there better ways? Operating Systems II — Multiprocessors 11 .

can reduce wasted context switches.Multiprocessor Scheduling (3) • Processor Affinity: – modification of central queue. Simultaneously schedule “related” threads. – can be implemented cheaply. – make dynamic? (cache affinity?) – pseudo-dedicated assignment: idle CPU “takes” task from most loaded. Q: how to choose members of gang? Q: what about cache performance? • ‘Take’ Scheduling: • Co-scheduling / Gang Scheduling: – ⇒ – – Operating Systems II — Multiprocessors 12 . – threads have affinity for a certain processor ⇒ can reduce cache problems. – nice trade-off: load high ⇒ no migration. – but: load balance problem again.

each thread spends an expected 10ms on runqueue ∗ instead stretch quantum to 1s ⇒ effective quantum is now 100ms. 10 processors. • Processors grouped into disjoint processor sets: – Each processor set has 32 shared ready queues (one for each priority level). handoff hints (aka “yield to”) — can improve producer-consumer synchronization • Simple gang scheduling used for allocation. discouragement hints: mild. • Increase quantum when number of threads is small – ‘small’ means #threads < (2× #CPUs) – idea is to have a sensible effective quantum – e. strong and absolute 2. • Applications provide hints to improve scheduling: 1.g. 11 threads ∗ if use default 100ms quantum. Operating Systems II — Multiprocessors 13 .Example: Mach • Basic model: dynamic priority with central queue. – Each processor has own local ready queue: absolute priority over global threads.

E. • 1 kernel thread ↔ 1 LWP ↔ n user threads • LWPs allow potential multiprocessor benefit: Overall: either first-class threads (Psyche) or scheduler activations probably better for MP.g.MP Thread Architectures Process 1 Process 2 Process 3 Process 4 User Threads LWPs Kernel Threads CPU1 CPU2 Want benefits of both user and kernel threads without any of the drawbacks ⇒ use hybrid scheme. Operating Systems II — Multiprocessors 14 . Solaris 2 uses three-level scheduling: • user-level thread scheduler ⇒ lightweight & flexible – more LWPs ⇒ more scope for true parallelism – LWPs can be bound to individual processors ⇒ could in theory have user-level MP scheduler – kernel scheduler is relatively cache agnostic (although have processor sets (= Mach’s)).

Real-Time Systems • Need to both produce correct results and meet predefined deadlines.g. SRT — output value = 100% before the deadline. Operating Systems II — Real-Time Systems 15 . • “Correctness” of output related to time delay it requires to be produced. (100 . – nuclear reactor safety system – JIT manufacturing – video on demand • Typically distinguish hard real-time (HRT) and soft real-time (SRT): HRT — output value = 100% before the deadline. • It is not about speed.f (t))% if t seconds late. e. 0 (or less) after the deadline. • Building such systems is all about predictability.

• Best-effort techniques give no predictability – in general priority specifies what to schedule but not when or how much. priority pi depends on all other threads at tj s.e. each of which arrives at time ai and requires si units of CPU time before a (real-time) deadline of di. – often extended to cope with periodic tasks: require si units every pi units. pj ≥ pi. ⇒ need something different. – i.Real-Time Scheduling pi ai • Basic model: Task i si di time – consider set of tasks Ti. CPU allocation for thread ti.t. Operating Systems II — Real-Time Systems 16 . – with dynamic priority adjustment becomes even more difficult.

Static Offline Scheduling
Advantages: • Low run-time overhead. • Deterministic behaviour. • System-wide optimization. • Resolve dependencies early. • Can prove system properties. Disadvantages: • Inflexibility. • Low utilisation. • Potentially large schedule. • Computationally intensive. In general, offline scheduling only used when determinism is the overriding factor, e.g. MARS.

Operating Systems II — Real-Time Systems

17

Static Priority Algorithms
Most common is Rate Monotonic (RM) • Assign static priorities to tasks off-line (or at ‘connection setup’), high-frequency tasks receiving high priorities. • Tasks then processed with no further rearrangement of priorities required (⇒ reduces scheduling overhead). • Optimal, static, priority-driven alg. for preemptive, periodic jobs: i.e. no other static algorithm can schedule a task set that RM cannot schedule. • Admission control: the schedule calculated by RM is always feasible if the total utilisation of the processor is less than ln2 • For many task sets RM produces a feasible schedule for higher utilisation (up to ∼ 88%); if periods harmonic, can get 100%. • Predictable operation during transient overload.

Operating Systems II — Real-Time Systems

18

Dynamic Priority Algorithms
Most popular is Earliest Deadline First (EDF): • Scheduling pretty simple: – keep queue of tasks ordered by deadline – dispatch the one at the head of the queue. – it may reschedule periodic tasks in each period – if a task set can be scheduled by any priority assignment, it can be scheduled by EDF

• EDF is an optimal, dynamic algorithm:

• Admission control: EDF produces a feasible schedule whenever processor utilisation is ≤ 100%. • Problem: scheduling overhead can be large. • Problem: if system overloaded, all bets are off.

Notes:

1. Also get least slack-time first (LSTF): • similar, but not identical • e.g. A: 2ms every 7ms; B: 4ms every 8ms

2. RM, EDF, LSTF all preemptive (c/f P3Q7, 2000).
Operating Systems II — Real-Time Systems 19

– associate with every semaphore S the priority P of the highest priority process waiting for it. – checks if ∃ ready thread not run ≥ 300 ticks. then other two processes enter.g. 2. medium and high priority processes called Pl. and hence Ph from running. – if so. then Pm gets to run. tries to lock S and blocks. Pm and Ph respectively. Ph runs since highest priority. – then temporarily boost priority of holder of semaphore up to P . thus preventing Pl from releasing S.Priority Inversion • All priority-based schemes can potentially suffer from priority inversion: • e. consider low. first Pl admitted. and locks a semaphore S. – can use handoff scheduling to implement. 4. 1. doubles quantum & boosts priority to 15 20 • Usual solution is priority inheritence: • NT “solution”: priority boost for CPU starvation Operating Systems II — Real-Time Systems . 3.

mp3 player. ⇒ real-time scheduling allows apps to receive sufficient and timely resource allocation to handle their needs even when the system is under heavy load. • Effective since: • Main technique: exploit SRT scheduling. 3D games). • Challenges OS since require presentation (or processing) of data in a timely manner. . . Operating Systems II — Real-Time Systems 21 . ⇒ informing applications and providing soft guarantees on resources are sufficient. • Still ongoing research area.g.Multimedia Scheduling • Increasing interest in multimedia applications (e. – the value of multimedia data depends on the timeliness with which it is presented/processed. – multimedia data streams are often somewhat tolerant of information loss. • OS needs to provide sufficient control so that apps behave well under contention. video conferencing.

..20] = 15 10 0 2 4 6 8 10 2 12 14 5 16 1 2 18 20 winner • Basic idea: – distribute set of tickets between processes. – σ = np(1 − p) ⇒ accuracy improves with – i. use tickets for I/O also) – “solve” priority inversion with ticket transfer. CPU) to the winner. .e.g. – to schedule: select a ticket [pseudo-]randomly. and allocate resource (e.Example: Lottery Scheduling (MIT) total = 20 random [1. Operating Systems II — Real-Time Systems 22 .g. “asymptotically fair” 2 • How well does it work? √ n • Stride scheduling much better. – approximates proportional share • Why would we do this? – simple uniform abstraction for resource management (e.

Example: BVT (Stanford) • Lottery/stride scheduling don’t explicitly support latency sensitive threads: – some applications don’t require much CPU but need it at the right time – e. .g. . Operating Systems II — Real-Time Systems 23 . where Wi is the virtual time warp for ti – always run thread with lowest Ei – after thread runs for ∆ time units. MPEG player application requires CPU every frame time. update Ai ← Ai + (∆ × mi) – (mi is an inversely proportional weight) – also have warp time limit Li and unwarp time requirement Ui to prevent pathological cases • Overall: lots of confusing parameters to set. – execution of threads measured in virtual time – for each thread ti maintain its actual virtual time Ai and effective virtual time Ei – Ei = Ai − (warpOn?Wi : 0). • Borrowed Virtual Time (BVT) uses a number of techniques to try to address this.

x) : • Uses explicit admission control Operating Systems II — Real-Time Systems • Actual scheduling is easy (∼200 lines C) 24 .Example: Atropos (CUCL) Deschedule Allocate Interrupt or System Call Run Wait Block Unblock Wakeup EDF OPT IDLE • Basic idea: – use EDF with implicit deadlines to effect proportional share over explicit timescales – if no EDF tasks runnable. p. schedule best-effort – requests s milliseconds per p milliseconds – x means “eligible for slack time” • Scheduling parameters are (s.

– – – – virtual addresses and address translation. – Providing convenient abstractions.Virtual Memory Management • Limited physical memory (DRAM). . and a few other VM-related issues. demand paged virtual memory management. – Allocation of limited physical resources. dynamic linking). 2 case studies (Unix and VMS). – Protection & sharing of ‘components’. need space for: – operating system image – processes (text. stack.e. • Quite complex to implement: – processor-. motherboard-specific. 25 • Coming up in this section: Operating Systems II — Virtual Memory Management . data. . heap. loading. . ) – I/O buffers • Memory management subsystem deals with: – Support for address binding (i. – trade-offs keep shifting.

e.Logical vs Physical Addresses (1) Old systems directly accessed [physical] memory. which caused some problems. ????” compile time ⇒ must know load address. x = 5. dealing with absolute addressing): – – – – “int x. load time ⇒ work every time. e. get [external] fragmentation ⇒ require expensive compaction • Address binding (i.g. what about swapping? • Portability: – how much memory should we assume a “standard” machine will have? – what happens if it has less? or more? Can avoid lots of problems by separating concept of logical (or virtual) and physical addresses. • Contiguous allocation: – need large lump of memory for process – with time. Operating Systems II — Virtual Addressing 26 .” → “movl $0x5.

• Address binding solved: Two variants: segmentation and paging. Operating Systems II — Virtual Addressing 27 Memory . – bind to real addresses at load time/run time. – allocate physical memory ‘behind the scenes’.Logical vs Physical Addresses (2) logical address physical address CPU MMU translation fault (to OS) Run time mapping from logical to physical addresses performed by special hardware (the MMU). – bind to logical addresses at compile-time. If we make this mapping a per process thing then: • Each process has own address space. • Allocation problem split: – virtual address allocation easy.

Segmentation
Segment Registers
limit base limit base limit base limit base

CPU

logical address

no yes address fault

physical address

• MMU has a set (≥ 1) of segment registers (e.g. orig x86 had four: cs, ds, es and ss). • CPU issues tuple (s, o): 1. MMU selects segment s. 2. Checks o ≤ limit. 3. If ok, forwards base+o to memory controller.

• Typically augment translation information with protection bits (e.g. read, write, execute, etc.) • Overall, nice logical view (protection & sharing) • Problem: still have [external] fragmentation.

Operating Systems II — Virtual Addressing

Memory
28

+

Paging
logical address physical address

p

o Page Table

f

o

CPU

p 1 f

1. Physical memory: f frames each 2s bytes. 2. Virtual memory: p pages each 2s bytes. 3. Page table maps {0, . . . , p − 1} → {0, . . . , f − 1} 4. Allocation problem has gone away! Typically have p > f ⇒ add valid bit to say if a > given page is represented in physical memory. • Problem: now have internal fragmentation. • Problem: protection/sharing now per page.
Operating Systems II — Virtual Addressing 29

Memory

Segmentation versus Paging
Segmentation Paging 



logical view 



allocation

⇒ try combined scheme. • E.g. paged segments (Multics, OS/2) – divide each segment si into k = li/2n pages, where li is the limit (length) of the segment. high hardware cost / complexity. not very portable.

– have page table per segment. 



• E.g. software segments (most modern OSs) – OS must ensure protection / sharing kept consistent over region.

– consider pages [m, . . . , m + l] to be a segment. 



loss in granularity. relatively simple / portable.

Operating Systems II — Virtual Addressing

30

Translation Lookaside Buffers Typically #pages large ⇒ page table lives in memory. • So TLB contains n entries something like: Tag V 2 k bits PTag Page Number 2 k bits Frame Number S K R G Data W X Most parts also present in page table entries (PTEs). Add TLB. • Use superpages for large regions. replace entry (usually h/w) • Include protection info ⇒ can perform access check in parallel with translation. – if full. – “global” bit useful for wide sharing. Operating Systems II — Virtual Addressing 31 . • If miss ⇒ need to load info from page table: – may be done in h/w or s/w (by OS). • Context switch requires [expensive] flush: – can add process tags to improve performance. a fully associative cache for mapping info: • Check each memory reference in TLB first.

– sharing: given that many processes will have similar address spaces (fork. – space-efficiency: since we typically have a page table for every process. p2 should themselves be close together in memory. – cache friendly: PTEs for close together p1. the simpler the solution the better. we wish to minimize the memory required for each one. – sparsity: particularly for 64-bit address spaces. 32 Operating Systems II — Virtual Addressing . shared libraries). – superpage support: ideally should support any size of superpage the TLB understands. |P | |F | – best if we can have page table size proportional to |F | rather than |P |.Page Table Design Issues • A page table is a data-structure to translate pages p ∈ P to frames f ∈ {F ∪ ⊥} • Various concerns need to be addressed: – time-efficiency: we need to do a page-table lookup every time we get a TLB miss ⇒ we wish to minimize the time taken for each one. would like easy sharing of address regions. – simplicity: all other things being equal.

– Inflexibility: superpages. • Disadvantages: – Potentially poor space overhead. 256–4096) • Keep PTBR per process and context switch. – Require d ≥ 2 memory references. 33 Operating Systems II — Virtual Addressing . cache friendly. • Solution: use N -ary tree (N large. • Advantages: easy to implement. residency.Multi-Level Page Tables Base Register Virtual Address L1 Address 0 n P1 P2 Offset L1 Page Table L2 Page Table L2 Address 0 n Leaf PTE N N • Modern systems have 232 or 264 byte VAS ⇒ have between 222 and 242 pages (and hence PTEs).

#10 shl r2 <.[r0. r1 #20.r1.r0 iret // install TLB entry // return Operating Systems II — Virtual Addressing 34 . 4K pages. and |offset| = 12bits) .entry TLBMiss: mov r0 <.MPT Pseudo Code Assuming 32-bit virtual addresses. #2 pld r0 <. #12 #22. #12 r0.r1.Ptbr mov r1 <. r2] shr shl shl shr pld r0 r0 r1 r1 r0 <<<<<r0. 2-level MPT (so |P 1| = |P 2| = 10bits. #12 shr r2 <.r2.FaultingVA shr r1 <. r1 [r0. r1] // load page table base // get the faulting VA // clear offset bits // get L1 index // scale to sizeof(PDE) // load[phys] PDE // // // // // clear prot bits get L2 address zap L1 index scale to sizeof(PTE) load[phys] PTE mov Tlb <.

Ž and  not always needed. stages . Operating Systems II — Virtual Addressing • Guarded page tables (≈ tries) claim to fix these. – (initial) miss handler simple.e. 35 . – can require just 1 memory reference.Linear Page Tables Virtual Address 31 Page Number 12 11 Offset 0 Linear Page Table 1 Potential Nested TLB Miss 31 22 21 12 11 0 (4Mb in VAS) IGN Index Offset 2 System Page Table PTE PTE 5 Install Mapping 3 TLB 4 Install Mapping • Modification of MPTs: – typically implemented in software – pages of LPT translated on demand – i. • Advantages: • But doesn’t fix sparsity / superpages.

r1] mov Tlb <.entry TLBMiss: mov r0 <.r2. #12 shl r2 <.entry DoubleTLBMiss: add r2 <.SysPtbr shr r2 <. #12 shl r1 <. r2] mov Tlb <. #20 pld r3 <. #2 vld r0 <.[r3.Ptbr mov r1 <. 4K pages (so |pagenumber| = 20bits. and |offset| = 12bits) .r2.r1.r0.FaultingVA shr r1 <. #22 shr r2 <.r1. r1 mov r3 <.LPT Pseudo Code Assuming 32-bit virtual addresses.[r0.r3 iret Operating Systems II — Virtual Addressing // // // // // // // // load page table base get the faulting VA clear offset bits scale to sizeof(PTE) virtual load PTE: .this can fault! install TLB entry return // // // // // // // // copy dble fault VA get system PTBR clear offset bits zap "L1" index scale to sizeof(PTE) phys load LPT PTE install TLB entry retry vld above 36 .r2.r0 iret .

Inverted Page Tables Virtual Address 31 Page Number 12 11 Offset 0 Inverted Page Table 0 PN Hash h(PN)=k PN Bits k f-1 • Recall f < p → keep entry per frame. no option on which frame to allocate dealing with collisions. can easily augment with process tag. < • Then table size bounded by physical memory! • IPT: frame number is h(pn)       only one memory reference to translate. no problems with sparsity. Operating Systems II — Virtual Addressing 37 . cache unfriendly.

Hashed Page Tables Virtual Address 31 Page Number 12 11 Offset 0 Hashed Page Table 0 PN Hash h(PN)=k PN PTE k f-1 • HPT: simply extend IPT into proper hash table. • i. still cache unfriendly.      can map to any frame. can choose table size. • Can solve these last with clustered page tables. make frame number explicit. Operating Systems II — Virtual Addressing 38 . table now bigger. no superpages. sharing still hard.e.

Operating Systems II — Demand Paged Virtual Memory 39 . • VM typically implemented via demand paging: • Also get demand segmentation. introduce new “non-resident” designation – such pages live on a non-volatile backing store – processes access non-resident memory just as if it were ‘the real thing’. • Virtual memory (VM) has a number of benefits: – portability: programs work regardless of how much actual memory present – convenience: programmer can use e. – programs (executables) reside on disk – to execute a process we load pages in on demand . i.g.Virtual Memory • Virtual addressing allows us to introduce the idea of virtual memory: – already have valid or invalid page translations. large sparse data structures with impunity – efficiency: no need to waste (real) memory on code or data which isn’t used. as and when they are referenced. but rare.e.

etc) • add PCB to scheduler.Demand Paging Details When loading a new process for execution: • create its address space (e. otherwise ‘page in’ the desired page: • find a free frame in memory • initiate disk I/O to read in the desired page • when I/O is finished modify the PTE for this page to show that it is now valid • restart the process at the faulting instruction Scheme described above is pure demand paging: • never brings in a page until required ⇒ get lots of page faults and I/O when process begins. 3.g. if an invalid reference ⇒ kill process. • hence many real systems explicitly load some core parts of the process first Operating Systems II — Demand Paged Virtual Memory 40 . page tables. check PTE to determine if “invalid” or not 2. • mark PTEs as either “invalid or “non-resident” Then whenever we receive a page fault: 1.

and (d) mark it as invalid in its process page tables 3. (c) write the victim page back to disk. • In reality. we need a free frame of physical memory to hold the data we’re reading in.Page Replacement • When paging in from disk. read desired page into freed frame 4. to select a free frame for the incoming page: (a) if there is a free frame use it (b) otherwise select a victim page to free. locate the desired replacement page on disk 2. restart the faulting process • Can reduce overhead by adding a ‘dirty’ bit to PTEs (can potentially omit step 2c above) • Question: how do we choose our victim page? Operating Systems II — Demand Paged Virtual Memory 41 . size of physical memory is limited ⇒ – need to discard unused pages if total demand for pages exceeds physical memory size – (alternatively could swap out a whole process to free some frames) • Modified algorithm: on a page fault we 1.

Page Replacement Algorithms • First-In First-Out (FIFO) – keep a queue of pages. discard from head – performance difficult to predict: no idea whether page replaced will be used again or not – discard is independent of page use frequency – in general: pretty bad. although very simple.e. • Optimal Algorithm (OPT) – replace the page which will not be used again for longest period of time – can only be done with an oracle. or in hindsight – serves as a good comparison for other algorithms – LRU replaces the page which has not been used for the longest amount of time – (i. LRU is OPT with -ve time) – assumes past is a good predictor of the future – Q: how do we determine the LRU ordering? • Least Recently Used (LRU) Operating Systems II — Page Replacement Algorithms 42 .

its PTE is updated to clock value – replace page with smallest time value – problem: requires a search to find min value – problem: adds a write to memory (PTE) on every memory reference – problem: clock overflow – maintain a stack of pages (doubly linked list) with most-recently used (MRU) page on top – discard from bottom of stack – requires changing 6 pointers per [new] reference – very slow without extensive hardware support • Or a page stack : • Neither scheme seems practical on a standard processor ⇒ need another way.Implementing LRU • Could try using counters – give each page table entry a time-of-use field and give CPU a logical clock (counter) – whenever a page is referenced. Operating Systems II — Page Replacement Algorithms 43 .

– for each page. and clear reference bit – select lowest value page (or one of) to replace 44 Operating Systems II — Page Replacement Algorithms . 20ms) shift reference bit onto high order bit of the byte.g.g. e. initialized to zero – periodically (e. 20ms) clear all reference bits – when choosing a victim to replace. can extend MRU to use that too: Ref? no no yes yes Dirty? no yes no yes Comment best type of page to replace next best (requires writeback) probably code in use bad choice for replacement • Or can extend by maintaining more history. prefer pages with clear reference bits – if also have a modified bit (or dirty bit) in the PTE.Approximating LRU (1) • Many systems have a reference bit in the PTE which is set by h/w whenever the page is touched • This allows not recently used (NRU) replacement: – periodically (e. the operating system maintains an 8-bit value.g.

and resume to check if referenced. update PTE. discard. and ∗ add page to tail of queue ∗ i. • If no h/w provided reference bit can emulate: – – – – – store pages in queue as per FIFO – before discarding head. check permissions can use similar scheme to emulate modified bit 45 Operating Systems II — Page Replacement Algorithms .e. give it “a second chance” to clear “reference bit”. otherwise: ∗ reset reference bit. mark page no access if referenced ⇒ trap. in this case usually called clock.Approximating LRU (2) • Popular NRU scheme: second-chance FIFO • Often implemented with a circular queue and a current pointer. check its reference bit – if reference bit is 0.

g.Other Replacement Schemes • Counting Algorithms: keep a count of the number of references to each page – LFU: replace page with smallest count – MFU: replace highest count because low count ⇒ most recently brought in. track page faults and look for sequences.g. i. – page to replace is one application has just finished with. – provide hook for app. . • (Pseudo) MRU: – consider access of e.e. – must be careful with denial of service. . large array. – e. most recently used. 46 Operating Systems II — Page Replacement Algorithms . • Application-specific: – stop trying to second guess what’s going on. – discard the k th in victim sequence. • Page Buffering Algorithms: – keep a min. number of victims in a free pool – new page read in before writing out victim. to suggest replacement.

Performance Comparison 45 Page Faults per 1000 References FIFO 40 35 30 25 20 15 10 5 0 5 6 7 8 9 10 11 12 13 14 15 CLOCK LRU OPT Number of Page Frames Available Graph plots page-fault rate against number of physical frames for a pseudo-local reference string. Operating Systems II — Page Replacement Algorithms 47 . • want to minimise area under curve • FIFO can exhibit Belady’s anomaly (although it doesn’t in this case) • getting frame allocation right has major impact. . .

number of pages used) – Minimize system-wide page-fault rate? (e. • Objectives: – Fairness (or proportional fairness)? ∗ e. divide m frames between n processes as m/n.g.Frame Allocation • A certain fraction of physical memory is reserved per-process and for core OS code and data.g. • Need an allocation policy to determine how to distribute the remaining frames.g. with remainder in the free pool ∗ e. allocate all memory to few processes) – Maximize level of multiprogramming? (e. allocate min memory to many processes) • Most page replacement schemes are global : all pages considered for replacement.e. ⇒ allocation policy implicitly enforced during page-in: – allocation succeeds iff policy agrees – ‘free frames’ often in use ⇒ steal them! Operating Systems II — Frame Allocation 48 .g. divide frames in proportion to size of process (i.

so steals them – but the other processes need those pages. so they fault to bring them back in – number of runnable processes plunges • To avoid thrashing we must give processes as many frames as they “need” • If we can’t.The Risk of Thrashing thrashing CPU utilisation Degree of Multiprogramming • As more processes enter the system. we need to reduce the MPL (a better page-replacement algorithm will not help) Operating Systems II — Frame Allocation 49 . the frames-per-process value can get very small. • At some point we hit a wall: – a process needs more frames.

. Operating Systems II — Frame Allocation 50 Miss address 0x40000 image 0x30000 0x20000 0x10000 0 10000 Timer IRQs connector daemon Kernel data/bss Kernel code 20000 30000 40000 50000 Miss number 60000 70000 80000 . sub-procedures • .Locality of Reference 0xc0000 0xb0000 0xa0000 0x90000 0x80000 0x70000 0x60000 0x50000 move clear bss Kernel Init Parse Optimise Output Extended Malloc Initial Malloc I/O Buffers User data/bss User code User Stack VM workspace Locality of reference: in a short time interval. . . data access • . the locations referenced by a process tend to be grouped into a few regions in its address space. . . . stack variables Note: have locality in both space and time. • procedure being executed • .

10ms if a page is “in use”. allocate more frames to process 51 Operating Systems II — Frame Allocation . say it’s in the working set sum working set sizes to get total demand D if D > m we are in danger of thrashing ⇒ suspend a process • Alternatively use page fault frequency (PFF): – monitor per-process page fault rate – if too high. 1967) – set of pages that a process needs in store at “the same time” to make any progress – varies between processes and during execution – assume process moves through phases – in each phase.g.Avoiding Thrashing We can use the locality of reference principle to help determine how many frames a process needs: • define the Working Set (Denning. get (spatial) locality of reference – from time to time get phase shift • Then OS can try to prevent thrashing by maintaining sufficient pages for current phase: – – – – sample page reference bits every e.

i<1024. either have all pages resident. lisp: use lots of pointers. tend to randomise memory access ⇒ kills spatial locality – Fortran. – if reverse order of iteration (i. Java: relatively few pointer refs • Pre-paging: – avoid problem with pure demand paging – can use WS technique to work out what to load – no paging in hard RT (must lock all pages) – for SRT. j<1024. or get 220 page faults. trade-offs may be available 52 • Real-time systems: Operating Systems II — Demand Paged Virtual Memory . • Program structure: consider for example for(j=0. C. j++) for(i=0.g. i++) array[i][j] = 0. – a killer on a system with 1024 word pages. e.Other Performance Issues Various other factors influence VM performance.j) then works fine • Language choice: – ML.

• From 3BSD / SVR2 onwards. implemented demand paging. – proc and text structures always resident. Operating Systems II — Virtual Memory Case Studies 53 . – page tables. • Today swapping only used when dire shortage of physical memory.Case Study 1: Unix • Kernel Per-process info. – victim chosen by looking at scheduler queues: try to find process blocked on I/O. • Swapping performed by special process: the swapper (usually process 0). – other metrics: priority. user structure and kernel stack could be swapped out. overall time resident. – choose one waiting longest time and prepare to swap in. – periodically awaken and inspect processes on disk. time since last swap in (for stability). split into two kinds: • Swapping allowed from very early on.

Operating Systems II — Virtual Memory Case Studies 54 . • P 0. • Segments have page-aligned length. P 1 page tables in system segment.3 BSD Unix address space borrows from VAX: • 0Gb–1Gb: segment P 0 (text/data.Unix: Address Space 4 Gb 3 Gb Reserved 2 Gb Kernel Kernel Stack User Structure User Stack System P1 1 Gb Dynamic Data (Heap) Static Data + BSS 0 Gb P0 Text (Program Code) 4. grow upward) • 1Gb–2Gb: segment P 1 (stack. grows downward) • 2Gb–3Gb: system segment (for kernel). Address translation done in hardware LPT: • System page table always resident.

– FFF used for executables (text & data). • BSD uses FOD bit. and growing stack. • If valid bit not set ⇒ use up to OS.....xxxx 0 AXS x 1 1 Block Number Frame Number 0 AXS M 0 x xxx 0 AXS x 0 x 0000... • Backing store pages located via swap map(s)..Unix: Page Table Entries VLD FOD FS 27 26 25 24 23 21 20 0 Resident Demand Zero Fill from File Transit/Sampling On backing store 31 30 1 AXS M 0 x xxx Frame Number 0 AXS x 1 0 xxxx. • First pair are “fill on demand”: – DZ used for BSS... FS bit and the block number. – Simple pre-paging implemented via klusters.....0000 • PTEs for valid pages determined by h/w... • Sampling used to simulate reference bit. Operating Systems II — Virtual Memory Case Studies 55 ....

lotsfree) physical memory free. • Basic algorithm: global [two-handed] clock: – hands point to different entries in core map – first check if can replace front cluster. – else flush to disk (if necessary). marked valid). . if not. one per cluster.Unix: Paging Dynamics • Physical memory managed by core map: – array of structures. – then check if back cluster referenced (viz. and put cluster onto end of free list. mark invalid). Operating Systems II — Virtual Memory Case Studies 56 . – if not ⇒ wake up page dæmon. if so given second chance. • Page replacement carried out by the page dæmon. – free list threads through core map. – records if free or in use and [potentially] the associated page number and disk block. . – every 250ms. – move hands forward and repeat. . . clear its “reference bit” (viz. • System V Unix uses an almost identical scheme. OS checks if “enough’ (viz.

• Then during execution: – pages faulted in by pager on demand. – A quota scheme for physical memory. Operating Systems II — Virtual Memory Case Studies 57 . • First two based around idea of resident set: – simply the set of pages which a given process currently has in memory. – each process also has a resident-set limit. ⇒ minimises impact on others. and a job mix of real-time. • This led to a design with: – A local page replacement scheme. timeshared and batch tasks. • Aimed to support a wide range of hardware. choose victim from resident set. • Also have swapper for extreme cases. – once hit limit.Case Study 2: VMS • VMS released in 1978 to run on the VAX-11/780. and – An aggressive page clustering policy.

VMS: Paging Dynamics Evicted Page (Clean) Resident Set (P1) Page Reclaim Modified Page List Free Page List Evicted Page (Dirty) Resident Set (P2) Page In Page Out (Clustered) Free Frame Allocated Disk • Basic algorithm: simple [local] FIFO. – if |MPL| ≥ hi. search lists before do I/O. – On fault. 58 • Lists also allow aggressive page clustering: Operating Systems II — Virtual Memory Case Studies . write (|MPL|− lo) pages. – Get ∼100 pages per write on average. • Suckful ⇒ augment with software “victim cache”: – Victim pages placed on tail of FPL/MPL.

• Automatic resident set limit adjustment: • Other system services: – $SETSWM: disable process swapping. – $LKWSET: lock pages into resident set. swapper trimming used to reduce RSLs again. – prefer to retain pages with TLB entries. – $LCKPAG: lock pages into memory. NB: real-time processes are exempt.VMS: Other Issues • Modified page replacement: – introduce callback for privileged processes. – recent versions updated to support 64-bit address space – son-of-VMS aka Win2K/XP also going strong. at quantum end. • VMS still alive: Operating Systems II — Virtual Memory Case Studies 59 . check if rate > PFRATH. – – – – – system counts #page faults per process. if so and if “enough” memory ⇒ increase RSL.

• e. – on each trap.Other VM Techniques Once have MMU. collector scans page(s).g. copy page. mark r/w and resume.  no significant interruption. – on trap. trap. can (ab)use for other reasons • Assume OS provides: – system calls to change memory protections. – finally. incremental checkpointing: – at time t atomically mark address space read-only. concurrent garbage collection: – mark unscanned areas of heap as no-access. • This enables a large number of applications.g. • e. resume mutator thread. – if mutator thread accesses these.  more space efficient Operating Systems II — Virtual Memory Case Studies 60 . copying and forwarding as necessary. – some way to “catch” memory exceptions.

. – how seamless do you think this is? • Persistent object stores: – support for pickling & compression? – garbage collection? • Sensible use requires restraint.Single Address Space Operating Systems • Emerging large (64-bit) address spaces mean that having a SVAS is plausible once more. • Problems: – address binding issues return. – cache/TLB setup for MVAS model. Operating Systems II — Virtual Memory Case Studies 61 . . • Separate concerns of “what we can see” and “what we are allowed to access”. • Distributed shared virtual memory: – turn a NOW into a SMP. • Advantages: easy sharing (unified addressing).

2.Disk I/O • Performance of disk I/O is crucial to swapping/paging and file system operation • Key parameters: 1. wait for controller and disk. 3. seek to appropriate disk cylinder wait for desired block to come under the head transfer data to/from disk actuator • Performance depends on how the disk is organised track spindle read-write head sector cylinder arm platter rotation Operating Systems II — Disk Management 62 . 4.

C-SCAN: scan in one direction only 5. e. N-step-SCAN and FSCAN: ensure that the disk head always moves Operating Systems II — Disk Management 63 . FIFO: perform requests in their arrival order 2. SCAN (“elevator algorithm”): relieves problem that an unlucky request could receive bad performance due to queue position 4. SSTF: if the disk controller knows where the head is (hope so!) then it can schedule the request with the shortest seek from the current position 3.g.Disk Scheduling • In a typical multiprogramming environment have multiple users queueing for access to disk • Also have VM system requests to load/swap/page processes/pages • We want to provide best performance to all users — specifically reducing seek time component • Several policies for scheduling a set of disk requests onto the device. 1.

90.150.58.18.184 0 Reference 50 100 150 200 0 Reference 50 100 150 200 0 Reference 50 100 150 200 0 50 Reference 100 150 200 Time Time Time FIFO SSTF SCAN Time C-SCAN Operating Systems II — Disk Management 64 .Reference String = 55.160.38.39.

Other Disk Scheduling Issues • Priority: usually beyond disk controller’s control.g. – alternatively interactive processes might get greater priority over batch processes. – system decides to prioritise. • Bad blocks remapping: Operating Systems II — Disk Management . SPTF). – or perhaps short requests given preference over larger ones (avoid “convoy effect”) – per client/process scheduling parameters. then queue. USD): • 2-D Scheduling (e. – problem: overall performance? – try to reduce rotational latency. – typically require h/w support. – two stage: admission. Cello. for example by ensuring that swaps get done before I/O. – some SCSI disks let OS into bad-block story 65 • SRT disk scheduling (e. – typically transparent ⇒ can potentially undo scheduling benefits.g.

e. • Overall gives far more flexibility: • E.g. striping for performance. NT FtDisk. Operating Systems II — Disk Management 66 . IRIX xlm. instead use logical volume concept. large number of partitions on one disk. OSF/1 lvm. small partitions. “real-time” volume. – aggregation: can make use of v. • Augment with “soft partitions”: – allow v.g. • Other big opportunity is reliability. • Partitions first step.g.g. . dynamic resizing of partitions – e.Logical Volumes Hard Disk A Partition A1 Partition A2 Partition A3 Hard Disk B Partition B1 Partition B2 Modern OSs tend to abstract away from physical disk. – e. – can customize. .

RAID RAID = Redundant Array of Inexpensive Disks: • Uses various combinations of striping and mirroring to increase performance.e. Operating Systems II — Disk Management 67 . – RAID5: same as RAID4. i. – RAID4: same as RAID3. . • AutoRAID trades off RAIDs 1 and 5.-). – RAID2: hamming ECC (for disks with no built-in error detection) – RAID3: stripe data on multiple disks and keep parity on a dedicated disk. Done at byte level ⇒ need spindle-synchronised disks. but no dedicated parity disk (round robin instead). • Can implement (some levels) in h/w or s/w • Many levels exist: – RAID0: striping over n disks (so actually !R) – RAID1: simple mirroring. • Even funkier stuff emerging. but block level. write n copies of data to n disks (where n is 2 . .

g. reading ahead: e. – should we treat all filesystems the same? – do applications know better? • Software caching issues: Operating Systems II — Disk Management 68 . etc. ) • Typically O/S also provides a cache in s/w: – may be done per volume. track based. . • Can reduce access time by applications if the required data follows the locality principle • Design issues: – – – – – transfer data by DMA or by shared memory? replacement strategy: LRU. or overall. write through or write back? partitioning? (USD.Disk Caching • Cache holds copy of some of disk sectors. . – also get unified caches — treat VM and FS caching as part of the same thing. LFU.

• Also have LRU list (for replacement). bwrite(): synchronous write. – Linked list used for collisions. • “Typically” prevents 85% of implied disk transfers. • Limited prefetching (read-ahead). FS block within a given partition / logical volume. i. brelse(): unlock buffer (clean). bawrite(): asynchronous write.3 BSD Unix Buffer Cache • Name? Well buffers data to/from disk. Operating Systems II — Disk Management 69 . blockno) to see if present. bdwrite(): mark buffer dirty (lazy write). • Uses sync every 30 secs for consistency. • Modern Unix deals with logical blocks. • Internal interface: – – – – – bread(): get data & lock buffer.4. and caches recently used information. • Implemented as a hash table: – Hash on (devno.e.

∗ physically cluster up to 64K.g. 70 • Completely unified cache: • NT/2K also provides some user control: Operating Systems II — Disk Management . ⇒ no translation required for cache hit.NT/2K Cache Manager • Cache Manager caches “virtual blocks”: – viz. – if specify temporary attrib when creating file ⇒ will never be flushed to disk unless necessary. – if specify write through attrib when opening a file ⇒ all writes will synchronously complete. little when system is thrashing. – allows lots of FS cache when VM system lightly loaded. ⇒ can get more intelligent prefetching – cache “lines” all in virtual address space. keeps track of cache “lines” as offsets within a file rather than a volume. – disk layout & volume concept abstracted away. – decouples physical & virtual cache systems: e. ∗ virtually cache in 256K blocks. – NT virtual memory manager responsible for actually doing the I/O.

File systems What is a filing system? Normally consider it to comprise two parts: 1. Storage Service: this provides • integrity. data needs to survive: – hardware errors – OS errors – user errors • archiving • mechanism to implement directory service What is a file? • an ordered sequence of bytes (UNIX) • an ordered sequence of records (ISO FTAM) Operating Systems II — Filing Systems 71 . Directory Service: this provides • • • • naming mechanism access control existence control concurrency control 2.

e.g. 1. chaining in the material 2. chaining in a map 3. table of pointers to blocks 4.File Mapping Algorithms Need to be able to work out which disk blocks belong to which files ⇒ need a file-mapping algorithm. extents Aspects to consider: • integrity checking after crash • automatic recovery after crash • efficiency for different access patterns – of data structure itself – of I/O operations to access it • ability to extend files • efficiency at high utilization of disk capacity Operating Systems II — Filing Systems 72 .

Chaining in the Media Directory Entries File Blocks Each disk block has pointer to next block in file. 73 Operating Systems II — Filing Systems . • OK for sequential access – poor for random access • cost to find disk address of block n in a file: – best case: n disk reads – worst case: n disk reads • Some problems: – not all of file block is file info – integrity check tedious. . Can also chain free blocks. .

• cost to find disk address of block n in a file: – best case: n memory reads – worst case: n disk reads Operating Systems II — Filing Systems 74 ....Chaining in a map Directory Map in memory Disc blocks ...... mirroring disk structure........ .. • disk blocks only contain file information • integrity check easy: only need to check map • handling of map is critical for – performance: must cache bulk of it. Maintain the chains of pointers in a map (in memory)...... – reliability: must replicate on disk... ........

or build a tree of tables (e.Table of pointers Directory Table of Pointers Disc blocks • access cost to find block n in a file – best case: 1 memory read – worst case: 1 disk read • i. UNIX inode) • access cost for chain of tables? for hierarchy? Operating Systems II — Filing Systems 75 . good for random access • integrity check easy: only need to check tables • free blocks managed independently (e.e. bitmap) • table may get large ⇒ must chain tables.g.g.

. QNX) • system may have some maximum #extents – new error can’t extend file – could copy file (i.e.Extent lists Directory List of Extents Disc blocks 2 3 2 Using contiguous blocks can increase performance. . • list of disk addresses and lengths (extents) • access cost: [perhaps] a disk read and then a searching problem. O(log(number of extents)) • can use bitmap to manage free space (e. compact into one extent) – or could chain tables or use a hierarchy as for table of pointers. 76 Operating Systems II — Filing Systems .g.

this information is known as the file meta-data. Operating Systems II — Filing Systems 77 .File Meta-Data (1) What information is held about a file? • times: creation. access and change? • access control: who and how • location of file data (see above) • backup or archive information • concurrency control What is the name of a file? • simple system: only name for file is human comprehensible text name • perhaps want multiple text names for file – soft (symbolic) link: text name → text name – hard link: text name → file id – if we have hard links. must have reference counts on files Together with the data structure describing the disk blocks.

File Meta-Data (2)
Directory Entries
foo RW ..... ..... bar R .....

File Node

File Data

Where is file information kept? • no hard links: keep it in the directory structure. • if have hard links, keep file info separate from directory entries – file info in a block: OK if blocks small (e.g. TOPS10) – or in a table (UNIX i-node / v-node table) • on OPEN, (after access check) copy info into memory for fast access • on CLOSE, write updated file data and meta-data to disk How do we handle caching meta-data?

Operating Systems II — Filing Systems

78

Directory Name Space
• two level (e.g. CTSS, TOPS10) • general hierarchy • simplest — flat name space (e.g. Univac Exec 8)

– a tree, – a directed (acyclic) graph (DAG)

• structure of name space often reflects data structures used to implement it – e.g. hierarchical name space ↔ hierarchical data structures – but, could have hierarchical name space and huge hash table! General hierarchy: • reflects structure of organisation, users’ files etc. • name is full path name, but can get rather long: e.g. /usr/groups/X11R5/src/mit/server/os/4.2bsd/utils.c – offer relative naming – login directory – current working directory
Operating Systems II — Filing Systems 79

Directory Implementation
• Directories often don’t get very large (especially if access control is at the directory level rather than on individual files) 



often quick look up directories may be small compared to underlying allocation unit

• But: assuming small dirs means lookup is na¨ ⇒ ıve trouble if get big dirs: – optimise for iteration. – keep entries sorted (e.g. use a B-Tree). • Query based access: – split filespace into system and user. – system wants fast lookup but doesn’t much care about friendly names (or may actively dislike) – user wishes ‘easy’ retrieval ⇒ build content index and make searching the default lookup. – Q: how do we keep index up-to-date? – Q: what about access control?

Operating Systems II — Filing Systems

80

– full dump performed daily or weekly ∗ copy whole file system to another disk or tape ∗ ideally can do it while file system live. e. war. then sucessively add in incrementals 81 Operating Systems II — File System Integrity . famine. . – incremental dump performed hourly or daily ∗ only copy files and directories which have changed since the last time.g. pestilence • What is a recent copy? – in real time systems recent means mirrored disks – daily usually sufficient for ‘normal’ machines • Can use incremental technique.g. .File System Backups • Backup: keep (recent) copy of whole file system to allow recovery from: – – – – CPU software crash bad blocks on disk disk head crash fire. use the file ‘last modifed’ time – to recover: first restore full dump. ∗ e.

Operation Day Number • Want to minimise #tapes needed. d Operating Systems II — File System Integrity 82 . or any tape with a higher number • If file is created on day c and deleted on day d a dump version will be saved substantially after d • the length of time it is saved depends on d − c and the exact values of c.. time to backup • Want to maximise the time a file is held on backup – number days starting at 1 – on day n use tape t such that 2t is highest power of 2 which divides n – whenever we use tape t.Ruler Function Tape to use 4 3 2 1 0 12345678 .. dump all files modified since we last used that tape.

– ensure log records written before performing actual modification. • Can make scavenger’s job simpler: – replicate vital data structures – spread replicas around the disk • Even better: use journal file to assist recovery. – record all meta-data operations in an append-only [infinite] file. Operating Systems II — File System Integrity 83 . more recent) than recovery from backup.Crash Recovery • Most failures affect only files being modified • At start up after a crash run a disk scavenger: – try to recover data structures from memory (bring back core memory!) – get current state of data structures from disk – identify inconsistencies (may require operator intervention) – isolate suspect files and reload from backup – correct data structures and update disk • Usually much faster and better (i.e.

– multiple version numbers: foo!11. . directories aren’t immutable! • But: – only need concurrency control on version number – could be used (for files) on unusual media types ∗ write once optical disks ∗ remote servers (e. e. UNIX RCS 84 Operating Systems II — Other FS Issues . . Cedar FS) – provides an audit trail ∗ required by the spooks ∗ often implemented on top of normal file system.g.g. foo!12 – invent new version number on close • Problems: – disk space is not infinite ∗ only keep last k versions (archive rest?) ∗ have a explicit keep call ∗ share disk blocks between different versions — complicated file system structures – what does name without version mean? – and the killer.Immutable files • Avoid concurrency problems: use write-once files.

Can be done by: • user — encouraged by accounting penalties • system — migrate files by periodically looking at time of last use • can provide transparent naming but not performance.g. migrate others to slower (cheaper) media.Multi-level stores Archiving (c. mark all files immutable • start migration of files which changed the previous day to optical disk • access to archive explicit e. . . e. HSM (Windows 2000) Can integrate multi-level store with ideas from immutable files.f. Operating Systems II — Other FS Issues 85 .g. backup): keep frequently used files on fast media. e. /archive/08Jan2005/users/smh/. Plan-9: • file servers with fast disks • write once optical juke box • every night.g.

• FAT16 could only handle partitions up to (216 × c) bytes ⇒ max 2Gb partition with 32K clusters.Case Study 1: FAT16/32 0 1 2 Disk Info 8 n-2 EOF 3 Free 4 7 Free A Attribute Bits File Name (8 Bytes) Extension (3 Bytes) A: Archive D: Directory V: Volume Label S: System H: Hidden R: Read-Only B 6 7 8 9 AD VSHR Reserved (10 Bytes) Time (2 Bytes) n-1 Free EOF Free Date (2 Bytes) First Cluster (2 Bytes) File Size (4 Bytes) • A file is a linked list of clusters: a cluster is a set of 2n contiguous disk blocks. n ≥ 0. • Each entry in the FAT contains either: – the index of another entry within the FAT. or – a special value Free meaning “free”. or – a special value EOF meaning “end of file”. • (and big cluster size is bad ) Operating Systems II — FS Case Studies 86 • Directory entries contain index into the FAT .

Extending FAT16 to FAT32 • Obvious extetension: instead of using 2 bytes per entry. FAT32 uses 4 bytes per entry ⇒ can support e. ). the root directory had to immediately follow the FAT(s)). 8Gb partition with 4K clusters • Further enhancements with FAT32 include: – can locate the root directory anywhere on the partition (in FAT16. . – can use the backup copy of the FAT instead of the default (more fault tolerant) – improved support for demand paged executables (consider the 4K default cluster size . DOS ⇒ use multiple directory entries to contain successive parts of name. – abuse V attribute to avoid listing these Operating Systems II — FS Case Studies 87 . – want to keep same directory entry structure for compatibility with e.g. • VFAT on top of FAT32 does long name support: unicode strings of up to 256 characters.g. .

2. • consecutive file blocks not close together. Poor data/metadata layout: • widely separating data and metadata ⇒ almost guaranteed long seeks • head crash near start of partition disastrous. Operating Systems II — FS Case Studies 88 . slow. Data blocks too small: • 512 byte allocation size good to reduce internal fragementation (median file size ∼ 2K) • but poor performance for somewhat larger files.Case Study 2: BSD FFS The original Unix file system: simple. elegant. Hard Disk Partition 1 Partition 2 Super-Block Inode Table 2 i i+1 Data Blocks Super-Block Boot-Block Inode Table l l+1 Data Blocks m 0 1 j j+1 j+2 The fast file-system (FFS) was develped in the hope of overcoming the following shortcomings: 1.

∗ a bitmap of free blocks. 512 bytes.g.g. 4096 or 8192 bytes) – but: last block in a file may be split into fragments of e. and ∗ usage summary information.g.FFS: Improving Performance The FFS set out to address these issues: • Block size problem: – use larger block size (e. ∗ a set of inodes. 011100000011101) – divide disk into cylinder groups containing: ∗ a superblock (replica). – (cylinder group little Unix file system) – keep inodes near their data blocks – keep inodes of a directory together – increase fault tolerance 89 • Random allocation problem: • Cylinder groups used to: Operating Systems II — FS Case Studies . – ditch free list in favour of bitmap ⇒ since easier to find contiguous free blocks (e.

• Problems? – block-based scheme limits throughput ⇒ need decent clustering. but spread directories out among the various cylinder groups • similarly allocates runs of blocks within a cylinder group. spread unrelated things far apart. the BSD allocator tries to keep files in a directory in the same cylinder group. • e. . don’t let disk fill up ⇒ can find space nearby 2. or skip-sector allocation – crash recovery not particularly fast – rather tied to disk geometry.FFS: Locality and Allocation • To achieve locality: • Locality key to achieving high performance 1.g. but switches to a different one after 48K • So does all this make any difference? – yes! about 10x–20x original FS performance – get up to 40% of disk bandwidth on large files – and much beter small file performance. 90 Operating Systems II — FS Case Studies . .

or span across several disks. • Fundamental structure of NTFS is a volume: – based on a logical disk partition – may occupy a portion of a disk.Case Study 3: NTFS Master File Table (MFT) 0 $Mft 1 $MftMirr $LogFile $Volume $AttrDef \ 6 $Bitmap 7 $BadClus 2 3 4 5 15 16 17 user file/directory user file/directory File Record Standard Information Filename Data. and entire disk.. • The MFT is indexed by a file reference (a 64-bit unique identifier for a file) • A file itself is a structured object consisting of set of attribute/value pairs of variable length.. . • An array of file records is stored in a special file called the Master File Table (MFT). . Operating Systems II — FS Case Studies 91 .

database • Overall makes for far quicker recovery after crash Operating Systems II — FS Case Studies 92 . – after the data structure has been changed.NTFS: Recovery • To aid recovery. the transaction writes a log record that contains redo and undo information. a commit record is written to the log to signify that the transaction succeeded. all file system data structure updates are performed inside transactions: – before a data structure is altered. – after a crash.g. • Does not guarantee that all the user file data can be recovered after a crash — just that metadata files will reflect some prior consistent state. • The log is stored in the third metadata file at the beginning of the volume ($Logfile) : • Logging functionality not part of NTFS itself: – NT has a generic log file service ⇒ could in principle be used by e. the file system can be restored to a consistent state by processing the log records.

gaps are left in the sequences of VCNs kept in the file record ∗ when reading a file. 1. 93 Operating Systems II — FS Case Studies .NTFS: Other Features • Security: – each file object has a security descriptor attribute stored in its MFT record. • Compression: – NTFS can divide a file’s data into compression units (blocks of 16 contiguous clusters) – NTFS also has support for sparse files ∗ clusters with all zeros not allocated or stored ∗ instead. – this atrribute contains the access token of the owner of the file plus an access control list • Fault Tolerance: – FtDisk driver allows multiple partitions be combined into a logical volume (RAID 0. gaps cause NTFS to zero-fill that portion of the caller’s buffer. 5) – FtDisk can also handle sector sparing where the underlying SCSI disk supports it – NTFS supports software cluster remapping.

Basic idea: solve write/seek problems by using a log: • log is [logically] an append-only piece of storage comprising a set of records. ⇒ performance bottleneck is writing & seeking. • Premise 2: memory cheap ⇒ large disk caches • Premise 3: large cache ⇒ most disk reads “free”. – can make blocks of a file contiguous on disk. one being written. What are the problems here? Operating Systems II — FS Case Studies 94 . • all data & meta-data updates written to log. • have two logs ⇒ one in use. • periodically flush entire log to disk in a single contiguous transfer: – high bandwidth transfer.Case Study 4: LFS (Sprite) LFS is a log-structured file system — a radically different file system design: • Premise 1: CPUs getting faster faster than disks.

thread log through free space. 2. What do we do when the disk is full? . etc) • then just need to find inodes ⇒ use inode map • find inode maps by looking at a checkpoint • checkpoints live in fixed region on disk. • need asynchronous scavenger to run over old logs and free up some space.LFS: Implementation Issues 1. . How do we find data in the log? • can keep basic UNIX structure (directories. inodes. . – compact within a segment. Subject of ongoing debate in the OS community. . Operating Systems II — FS Case Studies 95 2. indirect blocks. . • neither great ⇒ use segmented log: – divide disk into large fixed-size segments. • two basic alternatives: 1. – when writing use only clean segments – occasionally clean segments – choosing segments to clean is hard. compact live information to free up space. thread between segments.

g. Operating Systems II — Protection 96 . files). users) on objects (e. processes or packets) – changing access rights Also wish to protect against the effects of errors: • isolate for debugging • isolate for damage control Protection mechanisms impose controls on access by subjects (e.g.Protection Require protection against unauthorised: • release of information – – – – reading or leaking data violating privacy legislation using proprietary software covert channels • modification of information – changing access rights – can do sabotage without reading information • denial of service – causing a crash – causing high load (e.g.

g.Protection and Sharing If we have a single user machine with no network connection in a locked room then protection is easy. MULTICS rings) – – – – protection keys relocation hardware bounds checking separate address spaces • share and exchange data (application requirement) • memory management hardware • files – access control list – groups etc 97 Operating Systems II — Protection . But we want to: • share facilities (for economic reasons) Some mechanisms we have already come across: • user and supervisor levels – usually one of each – could have several (e.

Design of Protection System • Some other protection mechanisms: – lock the computer room (prevent people from tampering with the hardware) – restrict access to system software – de-skill systems operating staff – keep designers away from final system! – use passwords (in general challenge/response) – use encryption – legislate • ref: Saltzer + Schroeder Proc. IEEE Sept 75 – – – – – design should be public default should be no access check for current authority give each process minimum possible authority mechanisms should be simple. uniform and built in to lowest layers – should be psychologically acceptable – cost of circumvention should be high – minimize shared access 98 Operating Systems II — Protection .

e.Authentication of User to System (1) Passwords currently widely used: • want a long sequence of random characters issued by system. but user would write it down • if allow user selection. they will use dictionary words.g. etc. or overprint – add delay after failed attempt – use encryption if line suspect • what about security of password file? – only accessible to login program (CAP. TITAN) – hold scrambled. their name. UNIX ∗ only need to write protect file ∗ need irreversible function (without password) ∗ maybe ‘one-way’ function ∗ however. car registration. • best bet probably is to encourage the use of an algorithm to remember password • other top tips: – don’t reflect on terminal. off line attack possible ⇒ really should use shadow passwords 99 Operating Systems II — Protection .

face recognition • smart cards Operating Systems II — Protection 100 . • id card inserted into slot • fingerprint.g. voiceprint.DVMcz6Sh2 Really require unforgeable evidence of identity that system can check: • enhanced password: challenge-response.Authentication of User to System (2) E. passwords in UNIX: • simple for user to remember arachnid • sensible user applies an algorithm !r!chn#d • password is DES-encrypted 25 times using a 2-byte per-user ‘salt’ to produce a 11 byte string • salt followed by these 11 bytes are then stored IML.

and must trust encryption device (e. or • talking to wrong computer • but hardware / firmware in public machine may have been modified Anyway.g. still have the problem of comms lines: • wiretapping is easy • so carry your own copy of the terminal program ⇒ must use encryption of some kind.Authentication of System to User User wants to avoid: • right computer. today micros used as terminals ⇒ • local software may have been changed • make login character same as terminal attention. but not the login program Partial solution in old days for directly wired terminals: • always do a terminal attention before trying login But. a smart card) Operating Systems II — Protection 101 • workstation can often see all packets on network .

Virus: – often starts off as Trojan horse – self-replicating (e.g. ILOVEYOU) 102 Operating Systems II — Protection . modify files.g. OS calls always need to check parameters) • Caller should be suspicious of called program • e.g. send mail. Trojan horse: – a ‘useful’ looking program — a game perhaps – when called by user (in many systems) inherits all of the user’s privileges – it can then copy files. . – e.Mutual suspicion • We need to encourage lots and lots of suspicion: – system of user – users of each other – user of system • Called programs should be suspicious of caller (e.g. etc. • e. . Multics editor trojan horse.g. copied files as well as edited. change password.

Two common representations: 1. by subject: store list of objects and rights with each subject ⇒ capabilities Operating Systems II — Protection 103 . by uid • executing process in a protection domain • sets of users or processes Objects are things like: • • • • files devices domains / processes message ports (in microkernels) Matrix is large and sparse ⇒ don’t want to store it all.g. by object: store list of subjects and rights with each object ⇒ access control list 2.Access matrix Access matrix is a matrix of subjects against objects. Subject (or principal) might be: • users e.

g. e. Unix • allows arbitrary policies. interrupt (segment fault) 2. e.Access Control Lists Often used in storage systems: • system naming scheme provides for ACL to be inserted in naming path. Operating Systems II — Protection 104 .g. check ACL 3. check is made in software ⇒ must only use on low duty cycle • for higher duty cycle must cache results of check • e. X by student – data file. . set up segment descriptor in segment table • most systems check ACL – when file opened for read or write – when code file is to be executed – exam prog. files • if ACLs stored on disk. RW by examiner • access control by program. RWX by examiner. .g. A by exam program. On first reference to segment: 1. Multics: open file = memory segment.

III IBM system/38 Intel iAPX432 • have special machine instructions to modify (restrict) capabilities • support passing of capabilities on procedure (program) call Can also use software capabilities: • checked by encryption • nice for distributed systems Operating Systems II — Protection 105 .Capabilities Capabilities associated with active subjects.g. as part of addressing hardware – – – – Plessey PP250 CAP I. II. so: • store in address space of subject • must make sure subject can’t forge capabilities • easily accessible to hardware • can be used with high duty cycle e.

Password Capabilities • Capabilities nice for distributed systems but: – messy for application. Amoeba). but key it on capability (not implicit concept of “principal” from OS). • Advantages: – revocation possible – multiple “roles” available. • Store ACL with object. • Could use timeouts (e.g. – still messy (use ‘implicit’ cache?). • Disadvantages: Operating Systems II — Protection 106 . and – revocation is tricky. • Alternatively: combine passwords and capabilities.

Covert channels Information leakage by side-effects: lots of fun! At the hardware level: • wire tapping • monitor signals in machine • modification to hardware • electromagnetic radiation of devices By software: • leak a bit stream as: file exists no file page fault no page fault compute a while sleep for a while 1 0 • system may provide statistics e. TENEX password cracker using system provided count of page faults In general.g. (only usually a consideration for military types) Operating Systems II — Protection 107 . guarding against covert channels is prohibitively expensive.

.g. provide / allow abstractions.g. prettiness vs.Summary & Outlook • An operating system must: 1. power. portability – e. one each in Papers 3 & 4 • (not mandatory: P3 and P4 both x-of-y) • Qs generally book-work plus ‘decider’ Operating Systems II — Conclusion 108 . protection vs. – e. • Future systems bring new challenges: – scalability (multi-processing/computing) – reliability (computing infrastructure) – ubiquity (heterogeneity/security) • Lots of work remains. • Major aspect of OS design is choosing trade-offs. Exams 2005: • 2Qs. performance vs. 2. . securely multiplex resources.

and Clock are three page replacement algorithms. a) Briefly describe the operation of each algorithm. [6 marks] b) The Clock strategy assumes some hardware support. LRU. [2 marks] Operating Systems II — Past Examination Questions 109 . [2 marks] Give two reasons why the buffering requirements for network data are different from those for file systems. and how it works. [6 marks] Which buffer cache replacement strategy would you choose to use within an operating system? Justify your answer. What could you do to allow the use of Clock if this hardware support were not present? [2 marks] c) Assuming good temporal locality of reference. which of the above three algorithms would you choose to use within an operating system? Why would you not use the other schemes? [2 marks] What is a buffer cache? Explain why one is used.1999: Paper 3 Question 8 FIFO.

[3 marks each] Describe the benefits and drawbacks of using scheme (c). Describe how you would achieve this. commenting on how the value of T would influence your decision. [5 marks] Operating Systems II — Past Examination Questions 110 . [6 marks] You are part of a team designing a distributed filing system which replicates files for performance and fault-tolerance reasons. a) chaining in a map b) tables of pointers c) extent lists Briefly describe how each scheme works. It is required that rights to a given file can be revoked within T milliseconds (T ≥ 0).1999: Paper 4 Question 7 The following are three ways which a file system may use to determine which disk blocks make up a given file.

2000: Paper 3 Question 7 Why are the scheduling algorithms used in general purpose operating systems such as Unix and Windows NT not suitable for real-time systems? [4 marks] Rate monotonic (RM) and earliest deadling first (EDF) are two popular scheduling algorithms for real-time systems. Task A B C Requires Exactly 2ms 1ms 1ms Every 10ms 4ms 5ms You may assume that context switches are instantaneous. [8 marks] Operating Systems II — Past Examination Questions 111 . Hint: consider the relationship between task periods. and explain why. [8 marks] Exhibit a task set which is schedulable under EDF but not under RM. illustrating your answer by showing how each of them would schedule the following task set. You should demonstrate that this is the case. Describe these algorithms.

9750. [3 marks] Which factors do the above disk arm scheduling algorithms ignore? How could these be taken into account? [4 marks] Discuss ways in which an operating system can construct logical volumes which are (a) more reliable and (b) higher performance than the underlying hardware. Which problem with SSTF does SCAN seek to overcome? Which problem with SCAN does [5 marks] C-SCAN seek to overcome? Consider a Winchester-style hard disk with 100 cylinders. moving towards 100. The disk arm is currently at cylinder 10. The following is the (time-ordered) sequence of requests for disk sectors: { 3518. 1846. 158. SCAN and C-SCAN disk scheduling algorithms. 6621. give the order in which the above requests would be serviced. 107. 4126. [4 marks] Operating Systems II — Past Examination Questions 112 . 8924. 11 }. SCAN and C-SCAN. 1590. 4 double-sided platters and 25 sectors per track. 6672. 446. For each of SSTF.2000: Paper 4 Question 7 Why is it important for an operating system to schedule disk requests? [4 marks] Briefly describe each of the SSTF.

2001: Paper 3 Question 7 What are the key issues with scheduling for shared-memory multiprocessors? [3 marks] Processor affinity. [6 marks] b) Which problem does the processor affinity technique seek to overcome? [2 marks] c) What problem does the processor affinity technique suffer from. a) Briefly describe the operation of each. take scheduling and gang scheduling are three techniques used within multiprocessor operating systems. and how could this problem in turn be overcome? [2 marks] d) In which circumstances is a gang scheduling approach most appropriate? [2 marks] What additional issues does the virtual memory management system have to address when dealing with shared-memory multiprocessor systems? [5 marks] Operating Systems II — Past Examination Questions 113 .

Operating Systems II — Past Examination Questions 114 . [3 marks] d) What is meant by spatial locality of reference ? [2 marks] e) In what ways does the assumption of spatial locality of reference influence the design of the virtual memory system ? [3 marks] A student suggests that the virtual memory system should really deal with “objects” or “procedures” rather than with pages. Make arguments both for and against this suggestion. [4 and 2 marks respectively] .2001: Paper 4 Question 7 In the context of virtual memory management: a) What is demand paging ? How is it implemented ? [4 marks] b) What is meant by temporal locality of reference ? [2 marks] c) How does the assumption of temporal locality of reference influence page replacement decisions? Illustrate your answer by briefly describing an appropriate page replacement algorithm or algorithms.

a) What is the basic motivation behind this? [2 marks] b) Describe RAID level 0. What difficulties arise here and how may they be addressed? [6 marks] Operating Systems II — Past Examination Questions 115 . What are the benefits and drawbacks of this scheme? [3 marks] c) Describe RAID level 1. You are asked to write operating system software to treat these disk as a RAID level 5 array containing a single file-system. 4 & 5. write filesystem data.2003: Paper 3 Question 5 Modern server-class machines often use a Redundant Array of Inexpensive Disks (RAID) to provide non-volatile storage. What problem(s) with the former pair does the last hope to avoid? [6 marks] A server machine has k identical high-performance IDE disks attached to independent IDE controllers. and to schedule these read and write requests. Your software will include routines to read filesytem data. What are the benefits and drawbacks of this scheme? [3 marks] d) Compare and contrast RAID levels 3.

2003: Paper 4 Question 5 Describe the basic operation of a log-structured file system. commenting on the support you require from the file system. and the situations in which it is beneficial. [6 marks] You are assigned the task of designing a tape backup strategy for an important file server. Sketch your strategy. What are the potential benefits? What are the problems? [8 marks] Several modern file systems make use of journalling. and justifying your design decisions. Describe how a journal is used by the file system. The goal is to maximize the time any file is held on backup while minimizing the number of tapes required. [6 marks] Operating Systems II — Past Examination Questions 116 .

(i) What scheduling problem could arise here? [2 marks] (ii) How could this problem be overcome? [2 marks] (d )The designer also wishes the real-time system to use demand paged virtual memory for efficiency. [8 marks] (c) A designer of a real-time system wishes to have concurrently executing tasks share a data structure protected by a mutual exclusion lock.2004: Paper 3 Question 5 (a)What problem do real-time schedling algorithms try to solve? [2 marks] (b) Describe one static priority and one dynamic priorty real-time scheduling algorithm. and how could they be overcome? [6 marks] Operating Systems II — Past Examination Questions 117 . and comment on the data structures that an implementation would need to maintain and on how these would be used to make scheduling decisions. What problems could arise here. You should discuss the issue of admission control.

3 BSD Unix and (ii) Windows 2000.2004: Paper 4 Question 5 (a)Most conventional hardware translates virtual addresses to physical addresses using multi-level page tables (MPTs): (i) Describe with the aid of a diagram how translation is performed when using MPTs. Compare and contrast the way in which this is achieved in (i) 4. [3 marks] (ii) What problem(s) with MPTs do linear page tables attempt to overcome? How is this achieved? [3 marks] (iii) What problems(s) with MPTs do inverted page tables (IPTs) attempt to overcome? How is this achieved? [3 marks] (iv) What problems(s) with IPTs do hashed page tables attempt to overcome? How is this achieved? [3 marks] (b) Operating systems often cache part of the contents of the disk(s) in main memory to speed up access. [8 marks] Operating Systems II — Past Examination Questions 118 .

Describe with justification how you would modify sOs in order to: (i) support up to 65536 concurrently executing processes. What is the basic difference between a thread and a process? Why do operating systems support both concepts? [2 marks] (b) You get a summer job with a company which has an inhouse operating system called sOs. and only works on uniprocessor machines. [3 marks] (iii) efficiently schedule processes on an 8 CPU symmetric multiprocessor (SMP) machines. sOs uses static priority scheduling. [2 marks] (ii) reduce or eliminate the possibility of starvation. [3 marks] (c) How would you go about reducing the time taken to boot a modern operating system? [5 marks] Operating Systems II — Past Examination Questions 119 .2005: Paper 3 Question 5 (a)Modern operating systems typically support both threads and processes. supports at most 32 concurrently-executing processes. [5 marks] (iv) support threads in addition to processes on SMP machines.

SCAN and C-SCAN scheduling algorithms work. [2 marks] (ii) A researcher suggests adding 128 MB of NVRAM to a disk drive and using this to store the journal. [8 marks] (d )Several modern file-systems and databases make use of a journal to aid in crash recovery. (i) What would be the major difficulty that you face? [2 marks] (ii) Sketch a design for how you would go about overcoming this difficulty.2005: Paper 4 Question 5 (a)Scheduling disk requests is important to reduce the average service time. and comment on how well you think the resulting system would work. (i) Briefly describe how journalling helps crash recovery. Describe briefly how the SSTF. [3 marks] Operating Systems II — Past Examination Questions 120 . [3 marks] (b) Recently there have been proposals that 2-D disk scheduling algorithms should be used. Discuss the advantages and disadvantages of this approach. What is the [2 marks] basic idea behind 2-D disk scheduling? (c) You are asked to develop support for 2-D disk scheduling in a commodity operating system.

Sign up to vote on this title
UsefulNot useful