You are on page 1of 22

1

1
UNI X I NTERNALS
Ms. Radha Senthilkumar, Lect urer
Depart ment of I T
MI T, Chromepet
Anna Universit y, Chennai.
CS502 / I T524 OVERVI EW 2
General Overview of t he syst em
Topics
Hist ory
Syst em St ruct ure
User Perspect ive
Operat ing Syst em Services
Assumpt ion about Hardware
Reference:
The Design of the UNIX Operating System by Maurice J. Bach
CS502 / I T524 OVERVI EW 3
A Lit t le Hist ory First : UNI X
I nit ial design by Ken Thompson, Dennis Rit chie and
ot hers at AT&T's Bell Telephone Laborat ories ( BTL) in
1969: 32 years ago!
AT&T made t he source available t o Universit ies for
research and educat ional use.
1973 UNI X was rewrit t en in C result ing in Version 4.
The C language was also originally designed and developed for
use on t he UNI X syst em by Dennis Rit chie
C was evolved f rom 'B', developed by Thompson.
CS502 / I T524 OVERVI EW 4
UNI X Hist ory
AT&T sold UNI X t o Novel; Novel passed t he UNI X t rademark t o
X/ OPEN and sold source code t o Sant a Cruz Operat ion (SCO).
Plan 9 is AT&T's successor t o UNI X
AT&T was unable t o market UNI X as a product so they made the
source code available t o Universit ies f or use in Research and
Educat ion.
I nf luent ial variant : Berkeley Sof t ware Distribut ions (BSD) dist ribut ed
by t he Computer Syst ems Research Group, Universit y of California at
Berkeley
Berkeley obt ained UNI X f rom AT&T in December of 1974
2
CS502 / I T524 OVERVI EW 5
UNI X Hist ory
UNI X port ed t o many diff erent archit ect ures
Microsof t and SCO collaborat ed t o port UNI X t o t he I nt el 8086
archit ect ure: XENI X
AT&T purchased 20% of Sun: result is j oint eff ort t o develop SVR4
I n 1982 AT&T was broken up and was now able t o market UNI X. They
released Syst em I I I in 1982 and Syst em V t he following year.
Syst em V UNI X int roduced virt ual memory ( diff erent f rom BSD, called
regions), I PC (shared memory, semaphores, message queues), remot e
f ile sharing, shared libraries and STREAMS.
CS502 / I T524 OVERVI EW 6
BSD UNI X
2 BSD: t ext edit or vi
3 BSD: demand-paged virtual memory
4.0BSD: perf ormance improvement s
4.1BSD: j ob cont rol, aut oconf igurat ion
4.2/ 4.3BSD: reliable signals, fast f ilesyst em, improved
net working (TCP/ I P ref erence i mplement at i on),
sophist icat ed I PC primit ives
4.4 BSD: st ackable and ext ensible vnode int erf ace, net work file
system, log-st ruct ured f ilesyst em, ot her filesyst ems, POSI X
support, and other enhancement s.
CS502 / I T524 OVERVI EW 7
Why UNI X
Hist orical significance
Advanced feat ures developed for or port ed t o UNI X
Availabilit y of source code and research papers
Highlight s key design and archit ect ural issues
CS502 / I T524 OVERVI EW 8
CMU and MACH
As UNI X grew, t he kernel became large and complex -
originally small and elegant .
Mid-1980 CMU researches began development of a new OS:
microkernel providing a small set of essent ial services.
Support UNI X API
Support uniprocessor and mult iprocessors
support ed dist ribut ed environments
collect ion of servers provide necessary f unct ionalit y
New source so not encumbered by AT&T licenses
OSF1 and Next St ep based on MACH
3
CS502 / I T524 OVERVI EW 9
Funct ions of an OS
Resource Management
Ti me management - t emporal propert ies
CPU and disk t ransf er scheduling
Space management
main and secondary st orage allocat ion
Synchr oni zat i on and deadl ock
handl i ng
Account i ng and st at us i nf or mat i on
CS502 / I T524 OVERVI EW 10
Funct ions of an OS (cont )
User Environment - OS layer t ransforms bare
hardware machine int o higher level
abst ract ions
Execut i on envi r onment - process
management , f ile manipulat ion, int errupt
handling, I / O operat ions, language.
Er r or det ect i on and handl i ng
Pr ot ect i on and secur i t y
Faul t t ol er ance and f ai l ur e r ecover y
CS502 / I T524 OVERVI EW 11
Design Approaches
Deal wit h complexit ies of modern syst ems
Separ at i on of Pol i ci es and Mechani sms
Pol i ci es - What should be done
Mechani sms - How it should be done
Levin, R., E. Cohen, W. Corwin, F. Pollack and W. Wulf ,
"Pol i cy/ Mechani sm Separ at i on i n HYDRA," Proceedings of
t he 5
t h
Symposium on Principles of Operating Syst ems, 1975, pp.
132-140.
Three common approaches:
Layered Approach
Kernel Approach
Virtual Machine Approach
CS502 / I T524 OVERVI EW 12
Layered Approach
Level Name Objects Example
13 Shell User programming env. Bash statements
12 User process User process Quit,kill,suspend,resume
11 Directories Directories Create,destroy,attach,list
10 Devices External: printer,display Create,open,close
9 File system Files Create,open,close
8 Communications Pipes Crreate,open,close
7 Virtual memory Segments,pages Read,write,fetch
6 Local secondary store Blocks,channel Read,write,fetch
5 Primitive process Process,semaphore Suspend,resume,wait
4 Interrupts Interrupt-handlers Invoke,mask,retry
3 Procedures Procedure,stack,display Mark stack,call,return
2 Instruction set Evaluation stack Load,store,add
1 Electronic circuit Registers,gates,buses Clear,transfer,activate
Simplifies design, implementation and testing
Modular by dividing OS into functional layers.
H
W
r
e
s
o
u
r
c
e
e
n
v
i
r
o
n
m
e
n
t
4
CS502 / I T524 OVERVI EW 13
Virt ual Machine Approach
Virt ual soft ware layer over hardware
I llusion of mult iple inst ances of hardware
Support s mult iple inst ances of OSs
Hardware
Virtual machine software
VM1 VM2 VM3 VM4
CS502 / I T524 OVERVI EW 14
Layered:
Dij kst ra, E. W., "The St ruct ur e of THE Mul t i pr ogr ammi ng
Syst em", Communicat ions of t he ACM, May 1968, pp. 341-346.
Layered (Ring):
Organick, E., The Mul t i cs Syst em, MI T Press, Cambridge, MA. 1972.
Kernel
Brinch Hansen, P., "The Nucl eus of a Mul t i progr ammi ng Syst em",
Communicat ions of t he ACM, Apr. 1970, pp. 238-241.
Wulf , W., E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F.
Pollack, "HYDRA: The Kernel of a Mul t i pr ocessor Oper at i ng
Syst em", Communicat ions of t he ACM, June 1974, pp. 337-345.
Virtual
Seawright, L., and R. MacKinnon, "VM/ 370 - A St udy of Mul t i pl i ci t y
and Usef ul ness", I BM Syst ems Journal, 1979, pp. 4-17.
References
CS502 / I T524 OVERVI EW 15
Syst em St ruct ure
The UNIX system is functionally organized at three levels:
The kernel, which schedules tasks and manages
Storage;
The shell, which connects and interprets users
Commands, calls programs from memory, and
executes them; and
The tools and applications that offer additional
functionality to the operating system
CS502 / I T524 OVERVI EW 16
Syst em St ruct ure
Hardware
kernel
sh
date
vi
Other Application Programs
who
wc
a.out
CC
There are 64
system calls in
System V.
32 are used
more frequently
5
CS502 / I T524 OVERVI EW 17
Syst em St ruct ure: The Kernel
The kernel
The heart of the operating system, the kernel controls the hardware
and turns part of the system on and off at the programmer's
command.
originally found in /usr/sys, and composed of several sub-
components:
conf originally found in /usr/sys/conf, and composed of
configuration and machine-dependent parts, often including boot
code
dev Device drivers (originally /usr/sys/dev) for control of
hardware (and sometimes pseudo-hardware)
sys The "kernel" of the operating system, handling memory
management, system calls, etc.
h (or include) Header files, generally defining key interfaces
within the system, and important system-specific invariables
CS502 / I T524 OVERVI EW 18
Syst em St ruct ure
Commands Most Unix implementations make little
distinction between commands (user-level programs) for
system operation and maintenance (e.g. cron)
some major categories are:
sh The Shell, the primary user-interface on Unix and the
center of the command environment.
Utilities the core of the Unix command set, including ls, grep,
find and many others. This category could be subcategorized:
System utilities such as mkfs, fsck, and many
others; and
User utilities passwd, kill, and others.
Document
Communications
CS502 / I T524 OVERVI EW 19
Development Environment
Most implementations of Unix contained a development
environment sufficient to recreate the system from source
code.
The development environment included:
cc The C language compiler (first appearing in
V3 Unix)
as The machine-language assembler for the
machine
ld The linker, for combining object files
lib Libraries. Originally libc, the system library
make - The build manager (designed to effectively
automate the build process
CS502 / I T524 OVERVI EW 20
User Perspect ive : File Syst em
The UNI X file syst em is charact erized by
A Hierarchical st ructure
Consist ent t reat ment of file dat a
The abilit y t o create and delet e files
Dynamic growth of files
The prot ect ion of file dat a
The treatment of peripheral devices as files
I s organized as t ree wit h single root node called root , every non-leaf node of t he file
syst em st ruct ure is a directory of files and files at t he leaf nodes of t he t ree are
direct ories, regular files or special device files
The name of t he file is given by a pat h name t hat describes how t o locat e t he file in
t he file syst em hierarchy.
6
CS502 / I T524 OVERVI EW 21
Sample file syst em t ree
The path names
/etc/passwd,
/bin/who, and
/usr/src/cmd/who.c
designate files in the
tree, but /bin/passwd
and /usr/src/date.c do
not.
A path name does not
have to start from root
e.g. /dev/tty01
CS502 / I T524 OVERVI EW 22
User Perspect ive : File Syst em
# include < f cnt l.h>
Char buf f er[ 2048] ;
I nt version= 1;
main ( argc, argv)
int argc;
Char * argv[ ] ;
{
int f dold, f dnew;
if ( argc! = 3)
{
print f ( need! Argument f or copy program\ n ) ;
exit ( 1) ;
}
f dold= open( argv[ 1] ,0_RDONLY) ;
I f ( f dold = = -1)
{
print f ( cannot open t he f ile %s\ n , argv[ 1] ) ;
exit ( 1) ;
}
f dnew = creat ( argv[ 2] , 0666) ;
I f ( f dnew = = -1 )
{
print f ( cannot creat e f ile %s\ n, argv[ 2] ) ;
exit ( ) ;
}
Copy ( f dold,f dnew) ; exit (0) ; }
Copy ( old, new)
int old, new;
{
int count ;
while ( ( count = read( old, buff er,
sizeof ( buff er) )) > 0)
writ e ( new, buff er, count ) ; }
Program t o copy a file
CS502 / I T524 OVERVI EW 23
User Perspect ive : Processing
Environment
A program is an execut able file, and a
process is an inst ance of t he program in
execut ion.
Process cont rol
fork,
Exec
Wait
Exit
The shell execut es t he command
synchronously and asynchronous
CS502 / I T524 OVERVI EW 24
User Perspect ive : Processing
Environment
Main(argc, argv)
I nt argc;
Char * argv[ ] ;
{
if (f ork() = = 0)
execl( copy, copy, argv[ 1] ,
argv[ 2] , 0);
Wait ((int * ), 0);
Printf ( copy done\ n );
}
Pr ogr am t hat cr eat e a new pr ocess t o copy f i l e
The ret urned value of f ork( ) :
I f f ork( ) ret urns a negat ive value, t he
creat ion of a child process was
unsuccessful.
f ork( ) ret urns a zero t o t he newly
creat ed child process.
f ork( ) ret urns a posit ive value, t he
process I D of t he child process, t o t he
parent .
7
CS502 / I T524 OVERVI EW 25
User Perspect ive : Processing
Environment
Shell is t he command int erpret er program t hat users t ypically
execute af t er logging int o t he syst em.
The shell usually execut e a command synchronously
Eg who
Shell also execut e asynchronously i.e execut e in t he background
Who &
Shell is a user program and not part of t he ker nel , it is easy t o
modif y it and t ailor it t o a part icular environment .
CS502 / I T524 OVERVI EW 26
User Per spect i ve : Bui l di ng a Bl ock
Pr i mi t i ves
Unix syst em is t o provide operat ing syst em primit ives t hat
enable users to writ e small, modular programs t hat can be used
as building blocks t o build more complex programs.
Redirect I / O
ls > out put
mail aravind< let ter
nroff mm < doc1 > doc1.out 2> errors
Pipe
grep main a.c b.c c.c
grep main a.c b.c c.c | wc -l
CS502 / I T524 OVERVI EW 27
Operat ing syst em services
Cont rolling and execut ion of process by allowing
process creat ion, t erminat ion or suspension and
communicat ion
Scheduling processes fairly for execut ion on t he CPU.
Allocat ing main memory for an execut ing process.
Allocat ing secondary memory f or ef ficient st orage
and ret rieval of user dat a.
Allowing processes cont rolled access t o peripheral
devices such as t erminals, t ape drives, disk drives,
and net work devices.
CS502 / I T524 OVERVI EW 28
Assumpt ions about hardware
Two level
User
Process in this level can access t heir won inst ruct ion and dat a but not
kernel instruct ion and dat a.
Kernel
Can access kernel and user addresses.
I nt errupt s and Except ion
Except ion ref ers t o unexpect ed event s caused by a process such as
addressing illegal memory, execut ing privileged inst ruct ion, dividing
by zero and so on.
I nt errupt s are caused by event s t hat are ext ernal t o a process.
Except ions happen in t he middle of t he execut ion of a an
inst ruct ion
I nt errupt s happen bet ween t he execut ion of t wo inst ruct ions.
8
CS502 / I T524 OVERVI EW 29
Assumpt ions about hardware
Processor Level
The kernel must prevent t he
occurrence of int errupt s
during crit ical act ivit y
Corrupt t he dat a
Memory Management
The kernel permanent ly
resides in main memory as
does t he current ly execut ing
process.
Virt ual address
physical address
Machine
errors
Clock
Disk
Net work
Devices
Terminals
Sof tware
int errupt s
Higher
priorit y
Lower Priorit y
Typical int errupt level
CS502 / I T524 OVERVI EW 30
Tradit ional UNI X kernel
Bloat ed kernel
I nf lexible: support ed single t ype of
file syst em,
process scheduling
execut able file format
file
system
virtual
memory
loader
block dev char dev
kernel
CS502 / I T524 OVERVI EW 31
Modern UNI X
Separat ion of policy and mechanism
modular design and implement at ion (layered)
CS502 / I T524 OVERVI EW 32
References
Original UNI X implement at ion:
D. M. Rit chie, and K. Thompson, The UNI X
Ti me-Shari ng Syst em , Communicat ions of t he
ACM, Vol. 17, No. 7, Jul. 1974, pp. 365-375.
9
CS502 / I T524 OVERVI EW 33
I nt roduct ion t o Kernel
Topics
Kernel Archit ect ure
File Syst em
Process
Reference:
The Design of the UNIX Operating System
by Maurice J. Bach
CS502 / I T524 OVERVI EW 34
kernel Archit ect ure (UNI X)
Library
hardware
File Subsystem
character block
Hardware control
Buffer Cache
system call interface
Device driver
Inter process
communication
Scheduler
Memory
Managemen
t
Process Control
Subsystem
User program
User level
kernel level
User level
kernel level
CS502 / I T524 OVERVI EW 35
kernel Archit ect ure - Cont
The libraries map t hese syst em calls t o t he
primit ive needed t o ent er t he OS.
Assembly language invokes syst em call
direct ly wit hout a syst em call library.
The libraries are linked wit h programs at
compile t ime and are t hus part of t he user
program.
The f ile subsyst em manages f iles, allocat ing
file space, administ ering f ree space,
cont rolling access t o files, and ret rieving dat a
for users.
CS502 / I T524 OVERVI EW 36
kernel Archit ect ure - Cont
The process int eract wit h t he file subsyst em
via a specif ic set of syst em calls, such as
open, close , read, writ e,chown, chmod.
The f ile subsyst em access file dat a using
buffering mechanism t hat regulat es dat a flow
bet ween kernel and secondary st orage
devices.
Block I / O device drivers
Raw dat a I / O device drivers
10
CS502 / I T524 OVERVI EW 37
kernel Archit ect ure - Cont
Process cont rol subsyst em is responsible f or
Process synchronizat ion
I nt er process communicat ion
Memory Management
Process scheduling.
Syst em calls for cont rolling processes:
Fork
Exec
Exit
Wait
brk (cont rol t he size of memory allocat ed t o a process)
signal
CS502 / I T524 OVERVI EW 38
kernel Archit ect ure - Cont
Memory management module cont rol t he allocat ion
of memory
Swapping
Demand paging
The scheduler module allocat e t he CPU t o processes.
H/ w cont rol is responsible for handling int errupt s for
communicat ing wit h t he m/ c.
There are several forms of I PC ranging from
asynchronous signaling of event s t o synchronous
t ransmission of messages bet ween process.
CS502 / I T524 OVERVI EW 39
Mode, Space and Cont ext
user
context
mode
kernel
process
kernel
Application
(user code)
X
not allowed
Interrupts
System tasks
System calls
Exceptions
Privileged
UNIX uses only two privilege levels
system
space
CS502 / I T524 OVERVI EW 40
File Syst em
A file syst em is consist s of a sequence of
logical blocks ( 512/ 1024 byt e et c.)
A file syst em has t he following st ruct ure:
Dat a
Blocks
I node List Super
Block
Boot Block
11
CS502 / I T524 OVERVI EW 41
File Syst em: Boot Block
The beginning of t he file syst em
Cont ains boot st rap code t o load t he
operat ing syst em
I nit ialize t he operat ing syst em
Typically occupies t he first sect or of t he
disk
CS502 / I T524 OVERVI EW 42
File Syst em: Super Block
Describes t he st at e of a file syst em
How large it is
Describes t he size of t he file syst em
How many files it can st ore
Where t o find free space on t he file
syst em
Ot her informat ion
CS502 / I T524 OVERVI EW 43
File Syst em: I node List
I nodes are used t o access disk files.
I nodes maps t he disk files
For each file t here is an inode ent ry in
t he inode list block
I node list also keeps t rack of direct ory
st ruct ure
CS502 / I T524 OVERVI EW 44
File Syst em: Dat a Block
St art s at t he end of t he inode list
Cont ains disk files
An allocat ed dat a block can belong t o
one and only one file in t he file syst em
12
CS502 / I T524 OVERVI EW 45
Process
Process : st at es + cont ext
fork & execut e
Execut able file
Header : describe t he at t ribut es of t he file
t ext : program t ext
dat a : dat a(has init ial values) + bbs
Symbol t able informat ion
CS502 / I T524 OVERVI EW 46
Exec Family
Iorl :nd exe ex:nle :l hoolnv
lne
....
init : pid = 0 : the ancestor of all user process
getty getty getty
login
shell
fork +
exec
a.out
exec
exec
wait exit
fork and exec
CS502 / I T524 OVERVI EW 47
Cont ext
Text :
Dat a
St ack
Syst em-level cont ext
CS502 / I T524 OVERVI EW 48
Cont ext swit ch
cont ext _swit ch (oldPCB, newPCB)
{
save current regist er cont ent s int o
oldPCB including PC, SP, ..;
PC : resume address.
rest ore regist er cont ent s in new PCB
int o regist ers including PC(j ump);
/ / resume here by anot her inst ance of cont ext
swit ch
}
13
CS502 / I T524 OVERVI EW 49
Process st at es(CPU)
Running :
current ly has t he cont rol of t he CPU
Execut ing in user or kernel mode
Ready :
wait ing f or being scheduled( Queue)
Blocked :
wait ing for an event (I / O)
cannot be scheduled unt il
CS502 / I T524 OVERVI EW 50
Process St at es and Transit ion
New(born)
Ready
I / O complet ion
Event occurs
Running
Exit (dead)
Blocked
I / O request ,
Wait f or an event
Wait f or a msg.
Time slice burst
I nt errupt handling
scheduled
CS502 / I T524 OVERVI EW 51
Process I nformat ion
CS502 / I T524 OVERVI EW 52
Process Cont rol Block (PCB)
A kernel dat a st ruct ure having a processs
inf ormat ion f or process management .
process id.,
user id.,
program file info.
scheduling priorit y
st at e,
t he event wait ing for,
open file t able,
resource allocat ed (using),
14
CS502 / I T524 OVERVI EW 53
Process Cont rol Block (PCB)
working direct ory
memory mng. I nfo. ( t ext , dat a, st ack ,shared
area)
machine cont ext
regist er cont ext including PC, SP, PSW, general
purpose regist ers. (cont ent s at t he t ime when a
process st at e t ransfer from running t o ready or
blocked)
This cont ext must be rest ored when scheduled.
CS502 / I T524 OVERVI EW 54
The Buffer Cache
TOPI CS
UNI X syst em Archit ect ure
Buffer Cache
Buffer Pool St ruct ure
Ret rieval of Buf f er
Release Buf f er
Reading and Writ ing Disk Blocks
Reference:
The Design of the UNIX Operating System
by Maurice J. Bach
CS502 / I T524 OVERVI EW 55
The Buffer Cache
When a process want s t o access dat a f rom a file,
t he kernel brings t he dat a int o main memory,
alt ers it and t hen request t o save in t he file
syst em
Example: copy cp one.c t wo.c
To increase t he response t ime and t hroughput ,
t he kernel minimizes t he f requency of disk access
by keeping a pool of int ernal dat a buffer called
buffer cache.
CS502 / I T524 OVERVI EW 56
UNI X Kernel Archit ect ure
libraries
User level
Kernel level
User programs
hardware
Kernel level
Hardware level
trap
System call interface
File subsystem
Buffer cache
Character block
Device drivers
Hardware control
Inter-process
communication
scheduler
Memory
management
Process
control
subsystem
15
CS502 / I T524 OVERVI EW 57
Buffer Cache
Buffer cache cont ains t he dat a in recent ly used
disk blocks
When reading dat a from disk, t he kernel
at t empt s t o read from buffer cache.
I f dat a is already in t he buffer cache, t he kernel
does not need t o read from disk
I f dat a is not in t he buffer cache, t he kernel
reads t he dat a from disk and cache it
CS502 / I T524 OVERVI EW 58
Buffer Headers
A buffer consist s of t wo part s
a memory array
buffer header
disk block : buffer = 1 : 1
Figure 3.1 Buffer Header
device num
block num
status
ptr to next buf on hash queue
ptr to previous buf on hash queue
ptr to next buf on free list
ptr to previous buf on free list
ptr to data area
CS502 / I T524 OVERVI EW 59
Buffer Headers
device num
logical f ile syst em number
block num
block number of t he dat a on disk
st at us
The buff er is current ly locked.
The buff er cont ains valid dat a.
delayed-writ e
The kernel is current ly reading or writ ing t he cont ent s of the disk.
A process is current ly wait ing f or t he buf fer t o become free.
kernel ident ifies t he buffer cont ent by examing device
num and block num.
CS502 / I T524 OVERVI EW 60
Buffer Headers
st ruct buffer_head{
/ * First cache line: * /
st ruct buffer_head * b_next ; / * Hash queue list * /
unsigned long b_blocknr; / * block number* /
unsigned long b_size; / * block size* /
kdev_t b_dev; / * device( B_FREE = free)* /
kdev_t b_rdev; / * Read device* /
unsigned long b_rsect or; / * Real Buffer locat ion on disk* /
st ruct buffer_head * b_t his_page; / * circular list of buffers in one page* /
unsigned long b_st at e; / * buffer st at e bit map( see above) * /
st ruct buffer_head * b_next _free;
unsigned int b_count ; / * users using t his block* /
char * b_dat a; / * point er t o dat a block( 1024 byt es) * /
unsigned int b_list ; / * List t hat t his buffer appears* /
unsigned long b_flusht ime; / * Time when t his( dirt y) buffer should be writ t en* /
st ruct wait _queue * b_wait ;
st ruct buffer_head * * b_pprev; / * doubly linked list of hash-queue* /
st ruct buffer_head * b_prev_free; / * double linked list of buffers* /
st ruct buffer_head * b_reqnext ; / * request queue* /
/ * I / O complet ion* /
void ( * b_end_io) ( st ruct buffer_head * bh, int upt odat e) ;
void * b_dev_id;
} ;
16
CS502 / I T524 OVERVI EW 61
Buffer Headers
/ * buf f er head st at e bi t s* /
# define BH_Upt odat e 0 / * 1 if t he buffer cont ains
valid dat a* /
# define BH_Dirt y 1 / * 1 if t he buffer is dirt y* /
# define BH_Lock 2 / * 1 if t he buffer is locked* /
# define BH_Req 3 / * 0 if t he buffer has been
invalidat ed* /
# define BH_Prot ect ed 6 / * 1 if t he buffer is
prot ect ed* /
CS502 / I T524 OVERVI EW 62
St ruct ures of t he buffer pool
Buffer pool according t o LRU
The kernel maint ains a free list of buffer
doubly linked list
t ake a buff er f rom t he head of the f ree list .
When ret urning a buf fer, at t aches t he buf fer t o t he t ail.
free list
head
buf 1 buf 2 buf n
Forward ptrs
Back ptrs
Figure 3.2 Free list of Buffers
CS502 / I T524 OVERVI EW 63
St ruct ures of t he buffer pool
When t he kernel accesses a disk block
separat e queue (doubly linked circular list )
hashed as a funct ion of t he device and block num
Every disk block exist s on one and only on hash queue
and only once on t he queue
4
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
Figure 3.3 Buffers on the Hash Queues
CS502 / I T524 OVERVI EW 64
Scenarios for ret rieval of a buffer
Det ermine t he logical device num and block num
The algorit hms for reading and writ ing disk blocks use t he
algorit hm get blk
The kernel finds t he block on it s hash queue
The buff er is f ree.
The buff er is current ly busy.
The kernel cannot f ind t he block on the hash queue
The kernel allocat es a buff er f rom t he free list .
I n at t empt ing t o allocate a buff er from the f ree list , f inds a
buff er on t he f ree list t hat has been marked delayed writ e.
The f ree list of buff ers is empt y.
17
CS502 / I T524 OVERVI EW 65
Ret rieval of a Buffer: 1
st
Scenario (a)
The kernel finds t he block on t he hash queue and it s buffer is
free
4
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
freelist header
Search for block 4
CS502 / I T524 OVERVI EW 66
Ret rieval of a Buffer: 1
st
Scenario (b)
4
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
freelist header
Remove block 4 from free list
CS502 / I T524 OVERVI EW 67
Ret rieval of a Buffer: 2
nd
Scenario (a)
The kernel cannot find t he block on t he hash queue, so it
allocat es a buffer from free list
4
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
freelist header
Search for block 18: Not in cache
CS502 / I T524 OVERVI EW 68
Ret rieval of a Buffer: 2
nd
Scenario (b)
Hash queue headers
18
4
5 17
10 50 98
99 35
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
freelist header
Remove 1
st
block from free list: Assign to 18
18
CS502 / I T524 OVERVI EW 69
Ret rieval of a Buffer: 3
rd
Scenario (a)
The kernel cannot f ind t he block on the hash queue, and f inds
delayed writ e buff ers on hash queue
4
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
freelist header
Search for block 18, Delayed write blocks on free list
delay
delay
CS502 / I T524 OVERVI EW 70
Ret rieval of a Buffer: 3
rd
Scenario (b)
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
freelist header
18
writing
writing
(b) Writing Blocks 3, 5, Reassign 4 to 18
Figure 3.8
CS502 / I T524 OVERVI EW 71
Ret rieval of a Buffer: 4t h Scenario
The kernel cannot f ind t he buff er on t he hash queue, and t he free list
is empt y
28
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
freelist header
4 64
5 97 17
10 50 98
99 35 3
Search for block 18, free list empty
CS502 / I T524 OVERVI EW 72
Race for free Buffer
19
CS502 / I T524 OVERVI EW 73
Ret rieval of a Buffer: 5t h Scenario
Kernel finds t he buffer on hash queue, but it is current ly
busy
4
5 17
10 50 98
99 35 3
28 64
97
blkno0 mod 4
blkno1 mod 4
blkno2 mod 4
blkno3 mod 4
Hash queue headers
freelist header
Search for block 99, block busy
busy
CS502 / I T524 OVERVI EW 74
Race for a Locked Buffer
CS502 / I T524 OVERVI EW 75
Algorit hm: Get Block
Get Block (file_syst em_no,block_no)
while (buff er not f ound)
if ( buf f er in hash queue)
if (buffer busy)
sleep ( event buffer becomes free)
cont inue
mark buffer busy
remove buffer from free list
return buffer
else
if (there is no buffer on free list)
sleep ( event any buffer becomes free)
cont inue
remove buffer from free list
if (buffer marked as delayed writ e)
asyschronous whit e buffer to disk
cont inue
remove buffer from old hash queue
put buffer ont o new hash queue
return buffer
CS502 / I T524 OVERVI EW 76
Reading Disk Blocks
Llol dcvic Iilc
Llol dcvic Iilc
Hi_hl-V-l lo.l
l-Vi.- handl-r
Hi_hl-V-l lo.l
l-Vi.- handl-r
LuIIcr uhc
LuIIcr uhc Lovl-V-l lo.l
l-Vi.- handl-r
Lovl-V-l lo.l
l-Vi.- handl-r
kpzr
yht
lo.l_r-ad()
lo.l_vri:-()
r-ad()
r-ada()
vclLll) ll_rv_Llol)
I n linux
I n linux
Figur e 13-3 block device handler ar chit ect ur e f or buf f er I / O oper at ion
in Under st anding t he Linux Ker nel
20
CS502 / I T524 OVERVI EW 77
Algorit hm : Reading a disk block
Algorit hm bread / * block read * /
I nput : f ile syst em block number
Out put : buff er cont aining dat a
Get buff er f or block ( algorithm get blk)
I f (buff er dat a valid)
Ret urn buffer;
I nit iat e disk read;
Sleep (event disk read complet e);
Ret urn (buf fer);
I f it is not in cache, t he kernel calls t he disk driver t o
schedule a read request .
The disk driver not ifies t he disk cont roller lat er t ransmits t he
dat a t o t he buf fer.
Disk int errupt handler awakens t he sleeping process.
CS502 / I T524 OVERVI EW 78
Reading Disk Blocks
Read Ahead
I mproving performance
Read addit ional block bef ore request
Use breada()
Algor it hmbr eada
I nput : (1) f ile syst em block number f or immediat e r ead
(2) f ile syst em block number f or asynchr onous r ead
Out put : buf f er cont aining dat a f or immediat e r ead
{
if (f ir st block not in cache){
get buf f er f or f ir st block(algor it hmget blk);
if (buf f er dat a not valid)
init iat e disk r ead;
}
Algor it hmbr eada
I nput : (1) f ile syst em block number f or immediat e r ead
(2) f ile syst em block number f or asynchr onous r ead
Out put : buf f er cont aining dat a f or immediat e r ead
{
if (f ir st block not in cache){
get buf f er f or f ir st block(algor it hmget blk);
if (buf f er dat a not valid)
init iat e disk r ead;
}
Algor it hm
Algor it hm
CS502 / I T524 OVERVI EW 79
Reading disk Block
if ( second block not in cache) {
get buf f er f or second block( algorit hm get blk) ;
if ( buf f er dat a valid)
release buf f er( algorit hm brelse) ;
else
init iat e disk read;
}
if ( f irst block was originally in cache)
{
read f irst block( algorit hm bread)
ret urn buf f er;
}
sleep( event f irst buf f er cont ains valid dat a) ;
ret urn buf f er;
}
if ( second block not in cache) {
get buf f er f or second block( algorit hm get blk) ;
if ( buf f er dat a valid)
release buf f er( algorit hm brelse) ;
else
init iat e disk read;
}
if ( f irst block was originally in cache)
{
read f irst block( algorit hm bread)
ret urn buf f er;
}
sleep( event f irst buf f er cont ains valid dat a) ;
ret urn buf f er;
}
Algor it hm-cont
Algor it hm-cont
CS502 / I T524 OVERVI EW 80
Synchronous writ e
t he calling process goes t he sleep await ing I / O complet ion and
releases t he buf f er when awakens.
Asynchronous writ e
t he kernel st art s t he disk writ e. The kernel release t he buf f er when
t he I / O complet es
Delayed writ e
The kernel put of f t he physical writ e t o disk unt il buf f er reallocat ed
Look Scenario 3
Relese
Use brelse( )
Writ ing disk Block
21
CS502 / I T524 OVERVI EW 81
Writ ing Disk Block
Algor it hm bwr it e
I nput : buf f er
Out put : none
{
I nit iat e disk wr it e;
if (I / O synchr onous){
sleep(event I / O complet e);
r elese buf f er (algor it hm br else);
}
else if (buf f er mar ked f or delayed wr it e)
mar k buf f er t o put at head of f r ee list ;
}
Algor it hm bwr it e
I nput : buf f er
Out put : none
{
I nit iat e disk wr it e;
if (I / O synchr onous){
sleep(event I / O complet e);
r elese buf f er (algor it hm br else);
}
else if (buf f er mar ked f or delayed wr it e)
mar k buf f er t o put at head of f r ee list ;
}
algor it hm
algor it hm
CS502 / I T524 OVERVI EW 82
Release Disk Block
Algor it hm br else
I nput : locked buf f er
Out put : none
{
wakeup all pr ocess; event ,
wait ing f or any buf f er t o become f r ee;
wakeup all pr ocess; event ,
wait ing f or t his buf f er t o become f r ee;
r aise pr ocessor execut ion level t o allow int er r upt s;
if ( buf f er cont ent s valid and buf f er not old)
enqueue buf f er at end of f r ee list ;
else
enqueue buf f er at beginning of f r ee list
lower pr ocessor execut ion level t o allow int er r upt s;
unlock(buf f er );
}
Algor it hm br else
I nput : locked buf f er
Out put : none
{
wakeup all pr ocess; event ,
wait ing f or any buf f er t o become f r ee;
wakeup all pr ocess; event ,
wait ing f or t his buf f er t o become f r ee;
r aise pr ocessor execut ion level t o allow int er r upt s;
if ( buf f er cont ent s valid and buf f er not old)
enqueue buf f er at end of f r ee list ;
else
enqueue buf f er at beginning of f r ee list
lower pr ocessor execut ion level t o allow int er r upt s;
unlock(buf f er );
}
algor it hm
algor it hm
CS502 / I T524 OVERVI EW 83
Advant ages and
Disadvant ages
Advant ages
Allows uniform disk access
Eliminat es t he need f or special alignment of user buf fers
by copying dat a f rom user buf f ers t o syst em buf f ers,
Reduce t he amount of disk t raf fic
less disk access
I nsure file syst em int egrit y
one disk block is in only one buf f er
Disadvant ages
Can be vulnerable t o crashes
When delayed writ e
requires an ext ra dat a copy
When reading and writ ing t o and f rom user processes
CS502 / I T524 OVERVI EW 84
What happen t o buffer unt il
now
Allocat ed buf f er
Mar k busy
r elease
Using get blk() 5 scenar ios
Pr eser ving int egr it y
Using br else algor it hm
manipulat e Using br ead, br eada, bwr it e
22
CS502 / I T524 OVERVI EW 85
Reference
LI NUX KERNEL I NTERNALS
Beck, Bohme, Dziadzka, Kunit z, Magnus, Verworner
The Design of t he Unix operat ing syst em
Maurice j.bach
Underst anding t he LI NUX KERNEL
Bovet , cesat i
I n linux
Buff er_head : include/ linux/ fs.h
Bread : fs/ buff er.c
Brelse : include/ linux/ fs.h